## Supervised ML Model Cerebral Stroke Prediction

In [2]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Importing required libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold 
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN

# Loading the dataset
data = pd.read_csv('Cerebral Stroke Prediction.csv')

# Data Preprocessing

# Handle missing values (using mean for numerical, mode for categorical variables)

for col in ['age', 'avg_glucose_level', 'bmi']:
    data[col] = data[col].fillna(data[col].mean())
for col in ['gender', 'work_type', 'Residence_type', 'smoking_status', 'ever_married']:
    data[col] = data[col].fillna(data[col].mode()[0])

# Encodeing categorical variables

categorical_cols = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
for col in categorical_cols:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])

# Outlier treatment 

from scipy.stats.mstats import winsorize
data['bmi'] = winsorize(data['bmi'], limits=[0.05, 0.05])

# Spliting features and target x,y

X = data.drop(['id', 'stroke'], axis=1)
y = data['stroke']

# Handleing class imbalance

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=100, stratify=y_res)


In [4]:
# Logistic Regression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Decision Tree

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Random Forest

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# XGBoost

xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)


In [5]:
# Evaluating the models

print("Logistic Regression: ", classification_report(y_test, y_pred_lr))
print("Decision Tree: ", classification_report(y_test, y_pred_dt))
print("Random Forest: ", classification_report(y_test, y_pred_rf))
print("XGBoost: ", classification_report(y_test, y_pred_xgb))

# Cross-validation 

skf = StratifiedKFold(n_splits=5)
for model in [lr, dt, rf, xgb]:
    scores = []
    for train_idx, test_idx in skf.split(X_res, y_res):
        X_cv_train, X_cv_test = X_res.iloc[train_idx], X_res.iloc[test_idx]
        y_cv_train, y_cv_test = y_res.iloc[train_idx], y_res.iloc[test_idx]
        model.fit(X_cv_train, y_cv_train)
        y_cv_pred = model.predict(X_cv_test)
        scores.append(f1_score(y_cv_test, y_cv_pred))
    print(f"{model.__class__.__name__} CV F1: ", np.mean(scores))


Logistic Regression:                precision    recall  f1-score   support

           0       0.81      0.80      0.81      8524
           1       0.80      0.82      0.81      8523

    accuracy                           0.81     17047
   macro avg       0.81      0.81      0.81     17047
weighted avg       0.81      0.81      0.81     17047

Decision Tree:                precision    recall  f1-score   support

           0       0.97      0.94      0.96      8524
           1       0.95      0.97      0.96      8523

    accuracy                           0.96     17047
   macro avg       0.96      0.96      0.96     17047
weighted avg       0.96      0.96      0.96     17047

Random Forest:                precision    recall  f1-score   support

           0       0.99      0.97      0.98      8524
           1       0.97      0.99      0.98      8523

    accuracy                           0.98     17047
   macro avg       0.98      0.98      0.98     17047
weighted avg       0

## Conclusion : 

Based on both evaluation metrics and cross-validation F1 scores, Random Forest is the best performing model:

Random Forest Classifier

Accuracy = 0.98

Precision, Recall, F1-score for both classes = 0.98

Cross-validation CV F1 score =  0.97


Strengths :   of Random Forest for Imbalanced Data

Robust to Overfitting: Random Forest uses an ensemble of decision trees, reducing the risk of overfitting compared to individual trees.

Good Performance on Imbalanced Data: Handles imbalanced datasets well, especially when combined with resampling techniques like SMOTE, as it considers multiple samples and splits.

Feature Importance: Can provide insights into which features are most influential for stroke prediction.

Consistency: Shows stable and high scores across accuracy, precision, recall, and F1, indicating reliability.



Weaknesses:

Computationally Intensive: Training and prediction on large datasets can require significant processing power and time.

Interpretability: While feature importance is available, the overall model is more complex to interpret compared to simpler models like Logistic Regression.

Sensitive to Noisy Data: May still be affected by irrelevant features if not properly preprocessed.