In [None]:
import pandas as pd
import numpy as np


df = pd.read_csv("data/healthcare-dataset-stroke-data.csv")

# Basic checks
print(df.shape)
df.head()
df.isnull().sum()


(5110, 12)


id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

## Stroke Risk Prediction 
This project aims to build a machine learning model that estimates the risk of stroke based on demographic and clinical features such as age, glucose level, BMI, hypertension, heart disease, and lifestyle indicators.

The objective is not medical diagnosis, but to demonstrate how machine learning can support early risk identification using historical healthcare data.


In [3]:
df['stroke'].value_counts()

stroke
0    4861
1     249
Name: count, dtype: int64

## Data Preprocessing

The following preprocessing steps were applied:

- Dropped the `id` column as it has no predictive value
- Missing BMI values were imputed using the median to avoid data loss
- Categorical variables were converted using one-hot encoding
- A stratified train–test split was used to preserve class imbalance


In [None]:
df['bmi'] = df['bmi'].fillna(df['bmi'].median())
df.isnull().sum()
#data is cleaned and dropped id column and the missing bmi value is filled using median imputation to prevent data loss

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [8]:
df_encoded = pd.get_dummies(df, drop_first=True)
print(df_encoded.shape)
df_encoded.head()

(5110, 17)


Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Male,gender_Other,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,67.0,0,1,228.69,36.6,1,True,False,True,False,True,False,False,True,True,False,False
1,61.0,0,0,202.21,28.1,1,False,False,True,False,False,True,False,False,False,True,False
2,80.0,0,1,105.92,32.5,1,True,False,True,False,True,False,False,False,False,True,False
3,49.0,0,0,171.23,34.4,1,False,False,True,False,True,False,False,True,False,False,True
4,79.0,1,0,174.12,24.0,1,False,False,True,False,False,True,False,False,False,True,False


In [9]:
x = df_encoded.drop('stroke', axis=1)
y = df_encoded['stroke']
print(x.shape, y.shape)

(5110, 16) (5110,)


In [10]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,stratify=y, random_state=42)
print("Train:", x_train.shape)
print("Test :", x_test.shape)


Train: (4088, 16)
Test : (1022, 16)


In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

lr = LogisticRegression(
    max_iter=1000,
    class_weight='balanced'
)

lr.fit(x_train, y_train)

y_pred_lr = lr.predict(x_test)
y_prob_lr = lr.predict_proba(x_test)[:, 1]

print(classification_report(y_test, y_pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_lr))


              precision    recall  f1-score   support

           0       0.99      0.74      0.85       972
           1       0.14      0.80      0.24        50

    accuracy                           0.75      1022
   macro avg       0.56      0.77      0.54      1022
weighted avg       0.94      0.75      0.82      1022

ROC-AUC: 0.8437860082304527


## Baseline Model: Logistic Regression

A Logistic Regression model with class balancing was used as a baseline to:

- Validate data quality
- Establish reference performance
- Provide an interpretable comparison before applying more complex models


In [13]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score


## Final Model: XGBoost

XGBoost was selected as the final model due to its ability to:

- Capture non-linear relationships
- Handle structured tabular data effectively
- Learn complex feature interactions
- Perform well under class imbalance

The model was trained using class-weighted boosting to improve minority class learning.


In [14]:
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])
xgb_model = XGBClassifier(
    n_estimators=300,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos_weight,
    eval_metric='logloss',
    random_state=42
)

In [15]:
xgb_model.fit(x_train, y_train)


0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.8
,device,
,early_stopping_rounds,
,enable_categorical,False


In [16]:
y_pred_xgb = xgb_model.predict(x_test)
y_prob_xgb = xgb_model.predict_proba(x_test)[:, 1]

In [17]:
print(classification_report(y_test, y_pred_xgb))
print("ROC-AUC:", roc_auc_score(y_test, y_prob_xgb))



              precision    recall  f1-score   support

           0       0.97      0.92      0.94       972
           1       0.20      0.38      0.27        50

    accuracy                           0.90      1022
   macro avg       0.59      0.65      0.61      1022
weighted avg       0.93      0.90      0.91      1022

ROC-AUC: 0.8130246913580247


In [18]:
from sklearn.metrics import classification_report

for threshold in [0.5, 0.4, 0.3, 0.25, 0.2]:
    y_pred_thresh = (y_prob_xgb >= threshold).astype(int)
    print(f"\nThreshold: {threshold}")
    print(classification_report(y_test, y_pred_thresh))



Threshold: 0.5
              precision    recall  f1-score   support

           0       0.97      0.92      0.94       972
           1       0.20      0.38      0.27        50

    accuracy                           0.90      1022
   macro avg       0.59      0.65      0.61      1022
weighted avg       0.93      0.90      0.91      1022


Threshold: 0.4
              precision    recall  f1-score   support

           0       0.97      0.90      0.93       972
           1       0.20      0.50      0.29        50

    accuracy                           0.88      1022
   macro avg       0.59      0.70      0.61      1022
weighted avg       0.93      0.88      0.90      1022


Threshold: 0.3
              precision    recall  f1-score   support

           0       0.98      0.85      0.91       972
           1       0.17      0.60      0.26        50

    accuracy                           0.84      1022
   macro avg       0.57      0.72      0.59      1022
weighted avg       0.94   

## Threshold Calibration

The default decision threshold (0.5) was found to be conservative for stroke detection.

To improve medical relevance:
- Prediction probabilities were analyzed
- The decision threshold was tuned to 0.4
- This improved recall for stroke cases while maintaining strong overall performance

This approach reflects real-world healthcare risk prediction trade-offs.


In [19]:
# Final threshold selection based on recall–precision tradeoff
final_threshold = 0.4


y_pred_final = (y_prob_xgb >= final_threshold).astype(int)

# Final evaluation
print(classification_report(y_test, y_pred_final))


              precision    recall  f1-score   support

           0       0.97      0.90      0.93       972
           1       0.20      0.50      0.29        50

    accuracy                           0.88      1022
   macro avg       0.59      0.70      0.61      1022
weighted avg       0.93      0.88      0.90      1022



## Model Evaluation

Model performance was evaluated using:
- Precision, Recall, and F1-score
- ROC-AUC for overall class separation

The final model achieved improved stroke recall with acceptable precision, demonstrating a balanced and interpretable outcome.

This model estimates stroke risk probabilities based on input features.  
It is intended as a decision-support or early screening aid, not as a clinical diagnostic tool.

The evaluation on unseen test data demonstrates the model’s ability to generalize beyond the training set.

## Conclusion

This project demonstrates a complete and responsible machine learning workflow for healthcare data, including preprocessing, imbalance handling, baseline comparison, advanced modeling, and threshold tuning.

The results highlight the importance of metric selection and domain-aware decision-making in medical machine learning applications.
