<a href="https://colab.research.google.com/github/AlessandroConte/stroke-prediction/blob/main/Stroke_Prediction_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stroke Prediction using Machine Learning: A Predictive Modeling Project

# Stroke Prediction using Machine Learning

## Objective
This project aims to develop a machine learning model capable of predicting the likelihood of a stroke occurrence in individuals, based on a variety of health-related features. The dataset includes various attributes such as age, gender, smoking habits, hypertension, heart disease, and other factors that might contribute to stroke risk.

## Goal
The primary goal is to build accurate and reliable predictive models that can identify high-risk individuals, enabling early intervention strategies to reduce the incidence of strokes. The focus is on **reducing false negatives**, as failing to identify individuals at risk for a stroke can have serious health consequences.

## Approach

Throughout this project, several machine learning models were tested:

* **Initial Baseline Models**:
  * Logistic Regression, Random Forest, K-Nearest Neighbors (KNN), and Support Vector Classifier (SVC) were trained on the original (imbalanced) dataset.
  
* **Performance Evaluation**:
  * Accuracy, precision, recall, F1-score, and confusion matrix were used.
  * Special focus was placed on **recall for the stroke class (stroke = 1)** to ensure minimal false negatives.

* **Handling Class Imbalance**:
  * Given the highly skewed class distribution, various sampling techniques were applied:
    * **SMOTE** (Synthetic Minority Over-sampling Technique)
    * **SMOTEENN** (SMOTE + Edited Nearest Neighbors)
    * **SMOTETomek** (SMOTE + Tomek links)
    * **Manual undersampling** to create a more balanced dataset while preserving data integrity.

* **Model Retraining & Tuning**:
  * After resampling, models were retrained and tuned using **GridSearchCV**.
  * Algorithms such as Random Forest and SVC showed varied performance depending on the balancing method and metric focus.

* **Advanced Models Evaluated**:
  * **MLPClassifier** (Multi-Layer Perceptron - Neural Network)
  * **XGBoostClassifier** (Extreme Gradient Boosting)
  * Although computationally more intensive, these models were tested to compare performance under various resampling strategies.

* **Ensemble & Cost-Sensitive Models**:
  * In a final step, specialized models designed to handle imbalance were applied:
    * **BalancedRandomForestClassifier**
    * **EasyEnsembleClassifier**
  * These models yielded the best performance in terms of **recall** for the minority class (stroke), even without external sampling techniques.

## Process Overview

1. **Data Cleaning & Preprocessing**
   * Handled missing values (especially in BMI)
   * Converted categorical features
   * Created age bins to explore correlation with BMI for better imputation

2. **Exploratory Data Analysis (EDA)**
   * Plotted distributions and feature relationships
   * Assessed class imbalance visually

3. **Model Training & Evaluation**
   * Baseline models vs tuned models
   * Evaluation based on confusion matrix, precision, recall, and F1-score

4. **Class Rebalancing Strategies**
   * Oversampling, undersampling, hybrid methods
   * Ensemble models with internal balancing mechanisms

5. **Final Comparison**
   * Focus on minimizing **false negatives**
   * SVC, Balanced Random Forest, and Easy Ensemble emerged as most effective under this criterion

## Key Learnings & Skills Demonstrated

* End-to-end **data preprocessing** and feature engineering
* Applying **various classification algorithms**
* Addressing **imbalanced classification problems**
* Conducting **model tuning** using GridSearchCV
* Comparing baseline vs advanced ensemble methods
* Interpreting results with emphasis on **real-world implications**
* Communicating limitations and justifying modeling choices

## Conclusion
This project highlights the application of machine learning in healthcare, demonstrating the ability to handle imbalanced datasets and implement various techniques to optimize predictive models. Through model evaluation and iterative tuning, the project provides insights into building reliable models for high-stakes prediction tasks, such as stroke risk detection. This work contributes to my portfolio by showcasing my expertise in machine learning, data analysis, and problem-solving, and serves as a valuable example of my ability to tackle real-world challenges using data-driven approaches.
