# Project: Stroke Prediction Binary Classification
---
This is the part of the project that aims to perform binary classification to a [Kaggle Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) using neural network. It is recommended to run this particular notebook in Windows, to benefit from GPU accelerated Tensorflow computations (but you can also run this in Linux if you have the patience to deal with driver problems). Below documents the structure of the project:
* Import necessary Libraries
* Load the pre-downloaded dataset
* Analyze features and build different models


***Note**: The pre-downloaded data was generated on: 12/13/2024

## Import necessary Libraries

In [1]:
import tensorflow as tf

In [2]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

In [3]:
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Dense, Activation
import tensorflow.keras.backend as K

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [5]:
from sklearn.preprocessing import LabelEncoder,MinMaxScaler
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.utils import resample

In [6]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dropout, Input

## Load the pre-downloaded dataset
Similar to the non-neural-network part of the project, the dataset is loaded from Dataset folder

In [5]:
df = pd.read_csv('Dataset/healthcare-dataset-stroke-data.csv')
print("Dataset Shape:", df.shape)
df.head()

Dataset Shape: (5110, 12)


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


## Data processing and clean up
The model would be analyzed and preprocessed just as the non-neural-network part of the project.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [7]:
df.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

From the above two cells, it's obvious that 201 samples are missing bmi values; hence, we replace it with the mean bmi

In [8]:
df=df.drop(columns="id")

In [10]:
df["age_group"]=df["age"].apply(lambda x:"Infant" if (x>=0)&(x<=2)
                                  else ("Child" if (x>2)&(x<=12)
                                  else ("Adolescent"if (x>12)&(x<=18)
                                  else ("Young Adults"if (x>19)&(x<=35)
                                  else ("Middle Aged Adults" if (x>35)&(x<=60)
                                  else "Old Aged Adults")))))

In [11]:
df['bmi'] = df['bmi'].fillna(df.groupby(["gender","ever_married","age_group"])["bmi"].transform('mean'))

In [13]:
df = df[(df["bmi"]<66) & (df["bmi"]>12)]
df = df[(df["avg_glucose_level"]>56) & (df["avg_glucose_level"]<250)]
df=df.drop(df[df["gender"]=="Other"].index)

In [15]:
had_stroke = df[df["stroke"]==1]
no_stroke = df[df["stroke"]==0]
upsampled_had_stroke = resample(had_stroke,replace=True , n_samples=no_stroke.shape[0] , random_state=123 )
upsampled_data = pd.concat([no_stroke,upsampled_had_stroke])

In [16]:
cols = ['gender','hypertension','heart_disease', 'ever_married', 'work_type', 'Residence_type','smoking_status']
dums = pd.get_dummies(upsampled_data[cols],dtype=int)
model_data = pd.concat([upsampled_data,dums],axis=1).drop(columns=cols)

In [17]:
encoder = LabelEncoder()
model_data["age_group"] = encoder.fit_transform(model_data["age_group"])

In [18]:
scaler = MinMaxScaler()
for col in ['age','avg_glucose_level','bmi']:
    scaler.fit(model_data[[col]])
    model_data[col]=scaler.transform(model_data[[col]])

In [19]:
model_data

Unnamed: 0,age,avg_glucose_level,bmi,stroke,age_group,gender_Female,gender_Male,ever_married_No,ever_married_Yes,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
249,0.035645,0.202080,0.108571,0,1,0,1,1,0,0,0,0,0,1,1,0,1,0,0,0
250,0.707031,0.165028,0.512381,0,3,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0
251,0.096680,0.283689,0.100952,0,1,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0
252,0.853516,0.067119,0.449524,0,4,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0
253,0.169922,0.544452,0.129524,0,0,0,1,1,0,0,1,0,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9,0.951172,0.012937,0.226667,1,4,1,0,0,1,0,0,1,0,0,0,1,1,0,0,0
229,0.975586,0.051542,0.440000,1,4,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0
29,0.719238,0.805786,0.374660,1,3,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0
27,0.707031,0.692248,0.374660,1,3,0,1,0,1,0,0,1,0,0,1,0,1,0,0,0


## Analyze features and build different models:
---



### Current Plan:
Neural Network with different optimizers and learning rates would be tested for the best optimizer and learning rate. Below is the Optimizers used:
* Adams
* SGD_Momentum

---

"I have a plan. We just need time and money." - Dutch poet: Van Der Linde

### Splitting the data and initalzing models

In [20]:
X = model_data.drop(columns="stroke")
y = model_data["stroke"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=7,shuffle=True)

### Define a Function to build Models with different optimizer and learning rates

In [29]:
def build_model(optimizer, learning_rate):
    model = Sequential(name="Stroke_Prediction_Model")
    model.add(Input(shape=(X_train.shape[1],), name='Input_Layer'))
    model.add(Dense(128, activation='relu', name='Hidden_Layer_1'))
    model.add(Dropout(0.3))
    model.add(Dense(64, activation='relu', name='Hidden_Layer_2'))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid', name='Output_Layer'))
    
    optimizer_instance = optimizer(learning_rate=learning_rate)
    model.compile(optimizer=optimizer_instance, 
                  loss='binary_crossentropy', 
                  metrics=['accuracy'])
    return model

### Hyperparameter grid

In [None]:
learning_rates = [0.01, 0.001, 0.0001]
batch_sizes = [16, 32, 64]
epochs_list = [20, 50, 100]
optimizers = {"Adam": Adam, "SGD_Momentum": SGD}

### Start Training

In [None]:
# Grid search to find the best combination
best_combination = None
best_accuracy = 0
best_model = None

for optimizer_name, optimizer in optimizers.items():
    for lr in learning_rates:
        for batch_size in batch_sizes:
            for epochs in epochs_list:
                print(f"\nTesting Optimizer: {optimizer_name}, LR: {lr}, Batch Size: {batch_size}, Epochs: {epochs}")
                model = build_model(optimizer, lr)
                early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
                
                history = model.fit(
                    X_train, y_train, 
                    validation_split=0.2, 
                    epochs=epochs, 
                    batch_size=batch_size, 
                    callbacks=[early_stop],
                    verbose=0  # Suppress output for clarity
                )
                
                # Evaluate on the test set
                loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
                print(f"Accuracy: {accuracy:.4f}")
                
                # Save the best model and hyperparameters
                if accuracy > best_accuracy:
                    best_accuracy = accuracy
                    best_combination = (optimizer_name, lr, batch_size, epochs)
                    best_model = model

# Final results
print("\nBest Hyperparameter Combination:")
print(f"Optimizer: {best_combination[0]}, Learning Rate: {best_combination[1]}, Batch Size: {best_combination[2]}, Epochs: {best_combination[3]}")
print(f"Best Test Accuracy: {best_accuracy:.4f}")

# Evaluate the best model
y_pred = (best_model.predict(X_test) > 0.5).astype("int32")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Testing Optimizer: Adam, LR: 0.01, Batch Size: 16, Epochs: 10
Accuracy: 0.8016

Testing Optimizer: Adam, LR: 0.01, Batch Size: 16, Epochs: 20
Accuracy: 0.7964

Testing Optimizer: Adam, LR: 0.01, Batch Size: 16, Epochs: 50
Accuracy: 0.8021

Testing Optimizer: Adam, LR: 0.01, Batch Size: 32, Epochs: 10
Accuracy: 0.7979

Testing Optimizer: Adam, LR: 0.01, Batch Size: 32, Epochs: 20
Accuracy: 0.8062

Testing Optimizer: Adam, LR: 0.01, Batch Size: 32, Epochs: 50
Accuracy: 0.8292

Testing Optimizer: Adam, LR: 0.01, Batch Size: 64, Epochs: 10
Accuracy: 0.7937

Testing Optimizer: Adam, LR: 0.01, Batch Size: 64, Epochs: 20
Accuracy: 0.8073

Testing Optimizer: Adam, LR: 0.01, Batch Size: 64, Epochs: 50
Accuracy: 0.8370

Testing Optimizer: Adam, LR: 0.001, Batch Size: 16, Epochs: 10
Accuracy: 0.8078

Testing Optimizer: Adam, LR: 0.001, Batch Size: 16, Epochs: 20
Accuracy: 0.8255

Testing Optimizer: Adam, LR: 0.001, Batch Size: 16, Epochs: 50
Accuracy: 0.8781

Testing Optimizer: Adam, LR: 0.001, 

In [7]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  0
