# Stroke Prediction Model Comparison

This notebook aims to explore, preprocess, and model a dataset to predict the likelihood of a stroke. We will compare multiple algorithms to find the best performer based on accuracy, precision, recall, and F1 score.
The dataset used in this notebook is from Kaggle and can be found [here](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset).

## Introduction

A stroke is a medical condition in which poor blood flow to the brain results in cell death. There are two main types of stroke: ischemic, due to lack of blood flow, and hemorrhagic, due to bleeding. Both result in parts of the brain not functioning properly. Signs and symptoms of a stroke may include an inability to move or feel on one side of the body, problems understanding or speaking, dizziness, or loss of vision to one side. A stroke is a medical emergency, and treatment must be sought as quickly as possible. The longer a stroke goes untreated, the greater the potential for brain damage and disability.

## Importing Libraries

In [197]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB  
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier


## Data Loading

In [198]:
# Load the dataset
stroke_data = pd.read_excel('../Datasets/StrokeData.xlsx')

# Display the first 5 rows of the dataframe
stroke_data.head()

## Data Exploration

In [199]:
#Checking the shape of the dataset
stroke_data.shape

In [200]:
#Basic information about the dataset
stroke_data.info()

In [201]:
#Summary statistics of the dataset
stroke_data.describe()

In [202]:
#Checking for unique values in the dataset
stroke_data.nunique()

In [203]:
#Checking Data Types
stroke_data.dtypes

In [204]:
#Checking for missing values
stroke_data.isnull().sum()

### Data Visualization

In [205]:
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

#### Initial Data Exploration

In [206]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Prepare the figure layout
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Initial Data Exploration')

# Distribution of Age
sns.histplot(stroke_data['age'], bins=30, kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Age')

# Distribution of BMI
sns.histplot(stroke_data[stroke_data['bmi'].notnull()]['bmi'], bins=30, kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Distribution of BMI')

# Count of Strokes vs. Non-Strokes
sns.countplot(x='stroke', data=stroke_data, ax=axes[1, 0])
axes[1, 0].set_title('Count of Strokes vs. Non-Strokes')

# Correlation Heatmap of Numerical Features
# Calculate correlations
corr = stroke_data[['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi', 'stroke']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', ax=axes[1, 1])
axes[1, 1].set_title('Correlation Heatmap')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()


## About the Dataset

Distribution of Age: The age distribution is relatively broad, covering the entire spectrum from young to old, with a notable number of entries in the older age brackets. This is relevant since stroke risk typically increases with age.

Distribution of BMI: The BMI distribution is roughly normal but with a right skew, indicating that some individuals have significantly higher BMI values. BMI is another crucial factor in stroke risk.

Count of Strokes vs. Non-Strokes: It is evident that the dataset is imbalanced, with a significantly more significant number of non-stroke instances than stroke instances. This imbalance must be addressed during model training to avoid bias towards the majority class.

Correlation Heatmap: The heatmap shows correlations between the features and the target variable (stroke). Notably, age shows a moderate correlation with stroke, which aligns with medical understanding. Other features like hypertension, heart disease, and avg_glucose level also show some level of correlation with stroke occurrence.

### Attribute Information

1. id: unique identifier
2. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"
12. stroke: 1 if the patient had a stroke or 0 if not *Note: "Unknown" in smoking_status means that the information is unavailable for this patient

## Data Preprocessing

### Handling Missing Values

In [207]:
# Imputing missing values in 'bmi' based on the average BMI per gender
stroke_data['bmi'] = stroke_data.groupby('gender')['bmi'].transform(lambda x: x.fillna(x.mean()))

# Verifying the imputation
stroke_data[stroke_data['bmi'].isnull()].head(), stroke_data.head()

#### Check if BMI is imputed correctly

In [208]:
# Check if BMI is imputed correctly
stroke_data.isnull().sum()

### Finding Outliers

In [209]:
# Prepare the figure layout
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Box Plots for Identifying Outliers')

# Box plot for Age
sns.boxplot(x=stroke_data['age'], ax=axes[0])
axes[0].set_title('Age')

# Box plot for Average Glucose Level
sns.boxplot(x=stroke_data['avg_glucose_level'], ax=axes[1])
axes[1].set_title('Average Glucose Level')

# Box plot for BMI
sns.boxplot(x=stroke_data[stroke_data['bmi'].notnull()]['bmi'], ax=axes[2])
axes[2].set_title('BMI')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()


#### Age
Outliers: There don't appear to be any extreme outliers, which is expected as age is a naturally ranging variable in a population.
Context: If there are values beyond the expected age range (e.g., >100 years), verify if they are accurate. Ages that fall within a typical human lifespan, even if high, are plausible and should likely be retained.
#### Average Glucose Level
Outliers: There are many points beyond the upper whisker, which may indicate high glucose levels.
Context: Elevated glucose levels can be indicative of medical conditions like diabetes, which are risk factors for stroke. Unless these values are impossible (e.g., due to data entry errors), they may represent important risk factors and should be kept.
#### BMI
Outliers: Similar to glucose levels, there are several points beyond the upper whisker, indicating very high BMI values.
Context: While high BMI values could represent cases of extreme obesity, they are clinically plausible and relevant for stroke prediction. However, verify if any BMI values are beyond physiological feasibility (e.g., BMI > 60 could be a potential data entry error).

### Encoding Categorical Variables

In [210]:
# One-hot encoding for categorical variables
categorical_columns = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

In [211]:
# Encoding categorical variables using Label Encoding
label_encoder = LabelEncoder()
for column in categorical_columns:
    stroke_data[column] = label_encoder.fit_transform(stroke_data[column])

# Display the updated DataFrame after imputation and encoding
stroke_data.head()

### Normalizing Numerical Features

In [212]:
# Normalizing numerical features
scaler = StandardScaler()
numerical_columns = ['age', 'avg_glucose_level', 'bmi']
stroke_data[numerical_columns] = scaler.fit_transform(stroke_data[numerical_columns])

### Splitting the Data and Applying SMOTE

In [213]:
# Define your features and target variable
X = stroke_data.drop(['stroke', 'id'], axis=1)  # Dropping 'id' as it's not a relevant feature
y = stroke_data['stroke']

# Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# Verify the class distribution after applying SMOTE
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

In [214]:
# Convert the resampled target data to a DataFrame for easier plotting
y_train_res_df = pd.DataFrame(y_train_res, columns=['stroke'])

# Plot the distribution of the target variable after SMOTE
sns.countplot(x='stroke', data=y_train_res_df)
plt.title('Class Distribution after SMOTE')
plt.show()

In [215]:
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred)
    cv_score = cross_val_score(model, X_train, y_train, cv=5).mean()
    
    print(f"Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1 Score: {f1}")
    print(f"ROC-AUC Score: {roc_auc}")
    print(f"Cross-validation Score: {cv_score}")
    
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d')
    plt.title('Confusion Matrix')
    plt.show()

## Model Building

### Logistic Regression

In [216]:
print("Logistic Regression:")
lr = LogisticRegression()
evaluate_model(lr, X_train_res, y_train_res, X_test, y_test)

Logistic Regression: The model has a decent accuracy of 76.5% and a high recall of 74.2%, indicating that it is good at identifying positive cases. However, the precision is quite low, meaning that the model has a high false positive rate. The F1 score, which is the harmonic mean of precision and recall, is also relatively low due to the low precision.

### Decision Tree

In [217]:
print("Decision Trees:")
dt = DecisionTreeClassifier()
evaluate_model(dt, X_train_res, y_train_res, X_test, y_test)

Decision Trees: The decision tree model has a higher accuracy than logistic regression (83.8%), but its recall is significantly lower (40.3%), indicating that it is not as good at identifying positive cases. The precision is similar to logistic regression, and the F1 score is slightly higher.

### Random Forest

In [218]:
print("Random Forest:")
rf = RandomForestClassifier()
evaluate_model(rf, X_train_res, y_train_res, X_test, y_test)

Random Forest: The random forest model has a high accuracy of 87.8%, but its recall is only 20.9%, indicating that it is not very good at identifying positive cases. The precision is slightly lower than the decision tree model, and the F1 score is also lower

### Gradient Boosting Machine (GBM)

In [219]:
print("Gradient Boosting Machines:")
gbm = GradientBoostingClassifier()
evaluate_model(gbm, X_train_res, y_train_res, X_test, y_test)

Gradient Boosting Machines (GBM): The GBM model has an accuracy of 80.4% and a high recall of 58.1%, making it better at identifying positive cases than the decision tree and random forest models. The precision is similar to the other models, and the F1 score is higher due to the higher recall.

### XGBoost

In [220]:
print("XGBoost:")
xgb = XGBClassifier()
evaluate_model(xgb, X_train_res, y_train_res, X_test, y_test)

XGBoost: The XGBoost model has a high accuracy of 87.3%, but its recall is only 16.1%, indicating that it is not very good at identifying positive cases. The precision is lower than the other models, and the F1 score is also lower.

### LightGBM

In [221]:
print("LightGBM:")
lgbm = LGBMClassifier()
evaluate_model(lgbm, X_train_res, y_train_res, X_test, y_test)

LightGBM: The LightGBM model has a high accuracy of 91.2%, but its recall is only 14.5%, indicating that it is not very good at identifying positive cases. The precision is higher than XGBoost, but the F1 score is still relatively low.

### CatBoost

In [222]:
print("CatBoost:")
cat = CatBoostClassifier(verbose=0)
evaluate_model(cat, X_train_res, y_train_res, X_test, y_test)

CatBoost: The CatBoost model has a high accuracy of 89.1%, but its recall is only 9.7%, indicating that it is not very good at identifying positive cases. The precision is lower than LightGBM, and the F1 score is also lower.

### Support Vector Machine (SVM)

In [223]:
print("Support Vector Machines:")
svc = SVC()
evaluate_model(svc, X_train_res, y_train_res, X_test, y_test)

Support Vector Machines (SVM): The SVM model has a decent accuracy of 76.0% and a high recall of 58.1%, making it good at identifying positive cases. However, the precision is quite low, meaning that the model has a high false positive rate. The F1 score is relatively low due to the low precision.

### k-Nearest Neighbors (k-NN)

In [224]:
print("k-Nearest Neighbors:")
knn = KNeighborsClassifier()
evaluate_model(knn, X_train_res, y_train_res, X_test, y_test)

k-Nearest Neighbors (k-NN): The k-NN model has a decent accuracy of 78.6% and a high recall of 50.0%, making it good at identifying positive cases. However, the precision is quite low, meaning that the model has a high false positive rate. The F1 score is relatively low due to the low precision.

### Naive Bayes (GaussianNB)

In [225]:
print("Naive Bayes:")
nb = GaussianNB()
evaluate_model(nb, X_train_res, y_train_res, X_test, y_test)

Naive Bayes: The Naive Bayes model has a lower accuracy of 68.4% but a very high recall of 88.7%, making it excellent at identifying positive cases. However, the precision is quite low, meaning that the model has a high false positive rate. The F1 score is relatively low due to the low precision.

### Neural Network (MLP)

In [226]:
print("Neural Networks:")
nn = MLPClassifier()
evaluate_model(nn, X_train_res, y_train_res, X_test, y_test)

 Neural Networks (Deep Learning): The neural network model has a decent accuracy of 78.6% and a high recall of 48.4%, making it good at identifying positive cases. However, the precision is quite low, meaning that the model has a high false positive rate. The F1 score is relatively low due to the low precision.

### AdaBoost

In [227]:
print("AdaBoost:")
ab = AdaBoostClassifier()
evaluate_model(ab, X_train_res, y_train_res, X_test, y_test)

AdaBoost: The AdaBoost model has a decent accuracy of 76.6% and a very high recall of 69.4%, making it excellent at identifying positive cases. However, the precision is quite low, meaning that the model has a high false positive rate. The F1 score is relatively low due to the low precision.

### Quadratic Discriminant Analysis (QDA)

In [228]:
print("Quadratic Discriminant Analysis:")
qda = QuadraticDiscriminantAnalysis()
evaluate_model(qda, X_train_res, y_train_res, X_test, y_test)

Quadratic Discriminant Analysis (QDA): The QDA model has a lower accuracy of 72.2% but a very high recall of 80.6%, making it excellent at identifying positive cases. However, the precision is quite low, meaning that the model has a high false positive rate. The F1 score is relatively low due to the low precision.

### Liner Discriminant Analysis (LDA)

In [229]:
print("Linear Discriminant Analysis:")
lda = LinearDiscriminantAnalysis()
evaluate_model(lda, X_train_res, y_train_res, X_test, y_test)

Linear Discriminant Analysis (LDA): The LDA model has a decent accuracy of 74.6% and a very high recall of 74.2%, making it excellent at identifying positive cases. However, the precision is quite low, meaning that the model has a high false positive rate. The F1 score is relatively low due to the low precision.

In summary, while some models have high accuracy, their recall is quite low, indicating that they are not very good at identifying positive cases. Models with high recall, such as Naive Bayes, AdaBoost, QDA, and LDA, have low precision, indicating a high false positive rate. This trade-off between precision and recall is a common challenge in machine learning and depends on the specific requirements of the task. For example, in a medical context, a high recall might be more important to ensure that all positive cases are identified, even if it means more false positives.   

