A project for CSL2050 - Pattern Recognition and Machine Learning at IIT Jodhpur that develops machine learning models to predict stroke occurrences based on patient demographics, lifestyle, and health features. Our analysis across multiple clinical scenarios highlights the importance of context-specific model selection, with different algorithms showing strengths in different healthcare settings.
Stroke is a leading cause of death and disability worldwide. Early detection is critical for timely intervention. However, predicting stroke is challenging due to the severe class imbalanceβstroke cases are rare (19.44:1 ratio in our test set). Our goal is to build a system that:
- Handles imbalanced data using techniques like SMOTE
- Uses feature engineering, selection, and classification models
- Evaluates performance across different clinical scenarios
- Recommends optimal models for specific healthcare contexts
- Source: Healthcare Dataset Stroke Data on Kaggle
- Raw Data: 5,110 patient records
- Target Column:
stroke
(1 = stroke, 0 = no stroke)
gender
,age
,hypertension
,heart_disease
,ever_married
,work_type
,Residence_type
,avg_glucose_level
,bmi
,smoking_status
- Engineered Features:
age_group
,bmi_category
,glucose_category
,age_bmi_interaction
,hypertension_heart_disease
- Cleaning: Dropped
id
, imputedbmi
with median - Encoding: LabelEncoded categorical features
- Feature Engineering:
age_group
: Young, Middle-Aged, Senior, Elderlybmi_category
: Underweight, Normal, Overweight, Obeseglucose_category
: Low, Normal, Prediabetes, Diabetes, High Risk- Interaction terms
- Scaling: Standardized numerical features
- Class Imbalance Handling: Applied SMOTE to 70% training data while preserving natural distribution in 30% test set
- Stroke: 249 cases (4.87%) in full dataset
- No Stroke: 4,861 cases (95.13%) in full dataset
- Test set imbalance ratio: 19.44:1
- Stroke cases more common among:
- Married individuals (5.97%)
- Self-employed (7.8%)
- Former smokers (7.63%)
- Positive correlations:
age
& stroke: 0.25hypertension
&heart_disease
: 0.13
Notebooks available:
01_exploratory_data_analysis.ipynb
02_data_preprocessing.ipynb
03_model_comparison.ipynb
We implemented and optimized seven machine learning algorithms:
-
Logistic Regression:
- PowerTransformer normalization, L1 regularization, 'liblinear' solver
- Best parameters:
{'classifier__solver': 'liblinear', 'classifier__penalty': 'l1', 'classifier__class_weight': 'balanced', 'classifier__C': 0.1}
- Best overall performer (3 out of 4 scenarios)
-
Random Forest:
- PowerTransformer, SelectKBest feature selection
- Best parameters:
{'feature_selection__score_func': f_classif, 'classifier__n_estimators': 100, 'classifier__criterion': 'entropy', 'classifier__max_depth': 10}
- Consistent second-best performer
-
Artificial Neural Network (ANN):
- MLPClassifier with SMOTE balancing
- Best parameters:
{'feature_selection__k': 'all', 'ann__hidden_layer_sizes': (256,), 'ann__activation': 'relu', 'ann__alpha': 0.01, 'ann__learning_rate': 'constant'}
- Strong performance in limited resource settings
-
Support Vector Machine (SVM):
- RBF kernel, feature selection (k=25)
- Best parameters:
{'feature_selection__k': 25, 'classifier__kernel': 'rbf', 'classifier__gamma': 0.1, 'classifier__C': 100}
- Good generalization but lower recall
-
K-Nearest Neighbors (KNN):
- RobustScaler, mutual information feature selection
- Best parameters:
{'knn__weights': 'distance', 'knn__n_neighbors': 11, 'knn__metric': 'minkowski'}
- Moderate performance across scenarios
-
Decision Tree:
- Mutual information feature selection
- Best parameters:
{'feature_selection__percentile': 85, 'classifier__max_depth': 5, 'classifier__criterion': 'log_loss'}
- Second-best for high-risk screening
-
Gaussian NaΓ―ve Bayes (GNB):
- StandardScaler, mutual information feature selection
- Best parameters:
{'preprocessor__num__scaler': StandardScaler(), 'gnb__var_smoothing': 1e-11}
- Highest sensitivity (0.80), ideal for high-risk screening
- Logistic Regression (Score: 0.7086)
- Random Forest (Score: 0.6888)
- Bayesian (Score: 0.6158)
- Decision Tree (Score: 0.6119)
- ANN (Score: 0.6040)
- KNN (Score: 0.5881)
- SVM (Score: 0.5668)
- Bayesian (Score: 0.8128)
- Decision Tree (Score: 0.8092)
- Logistic Regression (Score: 0.7479)
- Random Forest (Score: 0.7100)
- KNN (Score: 0.6296)
- ANN (Score: 0.5436)
- SVM (Score: 0.4805)
- Logistic Regression (Score: 0.5941)
- Random Forest (Score: 0.5916)
- ANN (Score: 0.5852)
- SVM (Score: 0.5722)
- KNN (Score: 0.4960)
- Bayesian (Score: 0.4198)
- Decision Tree (Score: 0.4154)
- Logistic Regression (Score: 0.5532)
- Random Forest (Score: 0.5339)
- ANN (Score: 0.4770)
- Bayesian (Score: 0.4404)
- SVM (Score: 0.4335)
- KNN (Score: 0.4197)
- Decision Tree (Score: 0.4102)
Due to the class imbalance, we focused on balanced metrics:
Logistic Regression:
- Accuracy: 0.8584
- Sensitivity: 0.6000
- ROC AUC: 0.8332
- Precision: 0.1940
- Specificity: 0.8717
Bayesian Model:
- Sensitivity: 0.8000
- NPV: 0.9833
- Highest recall for high-risk scenarios
All model metrics are saved in:
model_metrics.json
- Prediction history in
prediction_history.json
Based on our analysis, we recommend:
-
For Emergency Screening (High Sensitivity Required): β Use Bayesian Model (0.80 sensitivity, 0.9833 NPV)
-
For General Hospital Screening: β Use Logistic Regression (balanced sensitivity/specificity)
-
For Limited Resource Settings: β Use Logistic Regression (efficient, good specificity)
-
For Balanced Clinical Decision Support: β Use Random Forest (good ROC AUC, interpretable)
git clone <repo-url>
cd stroke-prediction-project/demo
pip install -r requirements.txt
python app.py
Then go to http://localhost:5000
- Model selection based on clinical scenario
- Home page for input
- Displays top risk factors
- Visual results and prediction history
- Robust error handling and logging (
app.log
)
Templates used:
index.html
,result.html
,about.html
,error.html
stroke-prediction-project
βββ data/
β βββ raw/healthcare-dataset-stroke-data.csv
β βββ processed/
β βββ cleaned_dataset.csv
β βββ X_test_processed.csv
β βββ X_train_resampled.csv
β βββ y_test.csv
β βββ y_trained_resampled.csv
β βββ feature_statistics.txt
βββ demo/
| βββ data/
β βββ prediction_history.json
| βββ model/
β βββ model_metrics.json
β βββ stroke_model.pkl
| βββ templates/
β βββ about.html
β βββ error.html
β βββ index.html
β βββ result.html
β βββ app.py
β βββ requirements.txt
βββ notebooks/
| βββ 01_exploratory_data_analysis.ipynb
| βββ 02_data_preprocessing.ipynb
β βββ 02_model_comparison.ipynb
βββ reports/
β β βββ mid_reports
| | βββ final_report
β β βββ images....
βββ src/
β βββ models/
β β βββ ann.py
| | βββ ann_model.pkl
β β βββ bayesian.py
| β βββ bayesian_model.pkl
β β βββ decision_tree.py
| β βββ decision_tree_model.pkl
β β βββ knn.py
| β βββ knn_model.pkl
β β βββ logistic_regression.py
| β βββ logistic_regression_model.pkl
β β βββ random_forest.py
| β βββ rf_model.pkl
| β βββ svm_model.pkl
β β βββ svm.py
β βββ data_processing/
β βββ cleaning.py
βββ README.md
- Develop ensemble systems combining logistic regression and Bayesian models
- Implement adaptive threshold selection based on clinical context
- Explore deep learning models with attention mechanisms
- Conduct external validation across diverse patient populations
- Integrate with electronic health record systems
- Perform prospective clinical trials to assess real-world impact
- Anuranjani (B23EE1083)
- Pratheeksha Bangarapu (B23EE1058)
- Polimetla Eshikha (B23CS1053)
- Sai Pragathi Lagudu (B23CM1021)
- Pragna Sree Muvva (B23CM1025)