# ENHANCING HEALTHCARE WITH AI: TRANSFORMER MODELS FOR PREDICTION IN MATERNAL HEALTH


## Project Overview
I conducted this project to analyze maternal health data and identify key risk factors influencing pregnancy outcomes. The overall objective was to build reliable and interpretable predictive models that can support early risk identification and informed healthcare decision-making. The workflow includes data preprocessing, exploratory data analysis, model development, evaluation, and explainable AI techniques.

Maternal health remains a critical concern in many rural regions of India, where limited access to timely medical care and inadequate healthcare infrastructure often lead to preventable complications during pregnancy. Through this project, I aimed to contribute toward addressing this challenge by developing a data-driven and interpretable predictive framework for maternal health risk assessment.

I worked with a publicly available dataset from the Open Government Data Platform (India), consisting of approximately 30,000 records and 24 features such as age, blood pressure, blood sugar levels, and pregnancy history. Extensive preprocessing was performed, including handling missing values, feature engineering, and exploratory data analysis to ensure data quality and reliability.

I developed and compared three predictive models: Logistic Regression, Random Forest, and a TabTransformer-based deep learning model. Each model was evaluated using appropriate performance metrics to assess predictive capability. Among them, the TabTransformer demonstrated strong performance by effectively capturing complex relationships between features.

To enhance model interpretability and trustworthiness, I applied SHAP (SHapley Additive exPlanations) to understand feature contributions. The results highlighted that factors such as blood pressure, maternal age, and blood sugar levels were among the most influential predictorsâ€”aligning well with established clinical knowledge.

By integrating explainable AI techniques with predictive modeling, this project demonstrates how data-driven approaches can support transparent and informed healthcare decision-making. The findings highlight the potential of interpretable deep learning models in improving maternal healthcare outcomes, particularly in resource-constrained and rural settings. Future work will focus on incorporating real-time monitoring, expanding the dataset, and enhancing model generalizability to improve real-world applicability.


## Step 1: Data Preprocessing
### In this step, we import all required libraries, load the dataset, and perform basic inspection to understand its structure, size, and missing values.

In [None]:
#Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Load Dataset
data = pd.read_csv("/content/filtered ds_sdp.csv")
data

In [None]:
#Dataset Size
data.size

In [None]:
#Dataset Shape
data.shape

In [None]:
#Dataset Columns
data.columns

In [None]:
#Statistical Summary
data.describe()

In [None]:
#Checking Missing Values
missing = data.isnull().sum()
missing = missing[missing > 0]
print(missing)

## Step 2: Feature Engineering
### This step focuses on handling missing values, removing unnecessary columns, and preparing the dataset for modeling.

In [None]:
# Dropping Columns with High Missing Values
columns_to_drop = [
    'pregnant_month', 'is_anc_registered', 'aware_of_haf',
    'is_any_fp_methos_used', 'fp_method_used',
    'reason_for_not_using_fp_method',
    'currently_attending_school', 'reason_for_not_attending_school'
]

data.drop(columns=columns_to_drop, inplace=True)

In [None]:
#Handling Missing Values
#Filling categorical values using mode:
for col in ['religion', 'social_group_code', 'sex', 'relation_to_head']:
    data[col].fillna(data[col].mode()[0], inplace=True)

#Filling with "Unknown" where appropriate:
for col in ['delivered_any_baby', 'outcome_pregnancy', 'is_currently_pregnant', 'usual_residance']:
    data[col].fillna('Unknown', inplace=True)

#Filling numerical columns with zero:
for col in ['born_alive_total', 'surviving_total', 'mother_age_when_baby_was_born']:
    data[col].fillna(0, inplace=True)

#Awareness columns:
for col in ['aware_abt_rti', 'aware_abt_hiv', 'aware_of_the_danger_signs']:
    data[col].fillna('Not aware', inplace=True)


In [None]:
#Save Cleaned Dataset
data.to_csv('cleaned_filtered_ds_sdp.csv', index=False)
print("Data cleaned and saved successfully!")

## Step 3: Encoding Categorical Variables
### Machine learning models require numerical input. Here, categorical features are encoded using Label Encoding.

In [None]:

from sklearn.preprocessing import LabelEncoder

data1 = pd.read_csv("/content/cleaned_filtered_ds_sdp.csv")

le = LabelEncoder()

columns_to_encode = [
    'religion', 'social_group_code', 'sex', 'delivered_any_baby',
    'outcome_pregnancy', 'is_currently_pregnant',
    'usual_residance', 'relation_to_head',
    'aware_abt_rti', 'aware_abt_hiv',
    'aware_of_the_danger_signs', 'marital_status'
]

for col in columns_to_encode:
    data1[col] = le.fit_transform(data1[col])

# Load Encoded Dataset
data1.to_csv('final_dataset_encoded.csv', index=False)
print("Final dataset saved.")


## Step 4: Exploratory Data Analysis

In [None]:
#Distribution of Age
sns.histplot(data=data1, x='age', kde=True)
plt.title("Distribution of Age")


In [None]:
#Target Variable Distribution
sns.countplot(x='outcome_pregnancy', data=data1)


In [None]:
# Boxplot Analysis
sns.boxplot(x='delivered_any_baby', y='age', data=data1)


In [None]:
# Correlation Heatmap
plt.figure(figsize=(12,8))
sns.heatmap(data1.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")


## Step 5: Train Test Split

In [None]:
#Separating Features and Target
from sklearn.model_selection import train_test_split

X = data1.drop('outcome_pregnancy', axis=1)
y = data1['outcome_pregnancy']

#Spliting into Train and Test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training size:", X_train.shape)
print("Testing size:", X_test.shape)


# Step 6: MODEL BUILDING
##  Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

#Creating Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)

#Training the model
log_reg.fit(X_train, y_train)
LogisticRegression(max_iter=1000)

#Predicting on Test data
y_pred = log_reg.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print(classification_report(y_test, y_pred))


## Random Forest Model

In [None]:
#Creating Random Forest model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

#Training the model
rf.fit(X_train, y_train)

#Predicting on Test data
y_pred_rf = rf.predict(X_test)

#Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Precision:", precision_score(y_test, y_pred_rf, average='weighted'))
print("Recall:", recall_score(y_test, y_pred_rf, average='weighted'))
print("F1 Score:", f1_score(y_test, y_pred_rf, average='weighted'))
print(classification_report(y_test, y_pred_rf))


## Step 7: Feature Importance

In [None]:
#Get feature importances
import matplotlib.pyplot as plt
importances = rf.feature_importances_

#Creating DataFrame
feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

#plot
plt.figure(figsize=(10,6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.gca().invert_yaxis()
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.show()


## Step 8: Tabular Transformer Model

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Ensure reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Shape details
n_features = X_train.shape[1]
n_classes = len(np.unique(y_train))

# Input Layer
inputs = keras.Input(shape=(n_features,), name='Input')

# Dense projection to simulate embedding
x = layers.Dense(64, activation='relu')(inputs)
x = layers.Lambda(lambda x: tf.expand_dims(x, axis=1))(x)

# Transformer Encoder Block
attention_output = layers.MultiHeadAttention(num_heads=4, key_dim=32)(x, x)
attention_output = layers.Flatten()(attention_output)

# Feed Forward Network
x = layers.Dense(128, activation='relu')(x)
x = layers.Dropout(0.3)(x)
x = layers.Dense(64, activation='relu')(x)

outputs = layers.Dense(n_classes, activation='softmax', name='Output')(x)

# Define and Compile Model
transformer_model = keras.Model(inputs=inputs, outputs=outputs)

transformer_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Fit Model
history = transformer_model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=30,
    batch_size=32,
    verbose=1
)


## Step 10: Explainable AI (SHAP)

In [None]:
import shap

# Use KernelExplainer
explainer = shap.KernelExplainer(transformer_model.predict, shap.sample(X_train, 100))

# Compute SHAP values
shap_values = explainer.shap_values(X_test[:100])


In [None]:
# Assuming your model has 3 classes (based on the shap_values shape)
num_classes = 3

# Iterate through each class and plot SHAP values separately
for class_index in range(num_classes):
    class_shap_values = shap_values[:, :, class_index]
    
    # Plot summary plot for the current class
    shap.summary_plot(
        class_shap_values,
        X_test[:100],
        feature_names=X_train.columns,
        title=f"SHAP Values for Class {class_index}"
    )
