This project aims to predict the likelihood of a stroke using various machine learning models based on the healthcare-dataset-stroke-data.csv
dataset. The dataset contains information on various health attributes that may contribute to stroke occurrence.
-
Clone the repository:
git clone https://github.com/Elilora/stroke-prediction.git cd stroke-prediction
-
Install the required libraries:
pip install numpy pandas seaborn matplotlib scikit-learn xgboost shap
-
Ensure the dataset file
healthcare-dataset-stroke-data.csv
is in the/kaggle/input/stroke-prediction-dataset/
directory. -
Run the script to train and evaluate the models:
python stroke_prediction.py
The dataset used for this project is healthcare-dataset-stroke-data.csv
, which contains the following columns:
id
gender
age
hypertension
heart_disease
ever_married
work_type
Residence_type
avg_glucose_level
bmi
smoking_status
stroke
- Handle missing values:
- The
bmi
column had missing values, which were imputed using the mean strategy.
- The
- Encode categorical variables:
- Categorical columns were encoded using
LabelEncoder
.
- Categorical columns were encoded using
The following models were used to predict stroke occurrence:
- Decision Tree Classifier
- Random Forest Classifier
- Split the data into training and testing sets (70% train, 30% test).
- Train the models on the training set.
- Evaluate the models on the testing set using accuracy and F1-score metrics.
- Generate classification reports and confusion matrices.
- Accuracy: 91.06%
- F1-score: 10.46%
- Detailed classification report and confusion matrix are generated in the script output.
- Accuracy: 95.50%
- F1-score: 0.00%
- Detailed classification report and confusion matrix are generated in the script output.
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
This project is licensed under the MIT License - see the LICENSE file for details.