# 2 - Data preparation

In [None]:
df.head()

We are going to transform certain data into numerical categories

In [None]:
df['ever_married'] = [ 0 if i !='Yes' else 1 for i in df['ever_married'] ]
df['gender'] = [0 if i != 'Female' else 1 for i in df['gender']]

In [None]:
# Utilisation de get_dummies pour les variables catégoriques
df=pd.get_dummies(df,columns=['smoking_status'])


And remove variables that seem irrelevant

In [None]:
df=df.drop(['work_type'],axis=1)
df=df.drop(['Residence_type'],axis=1)

In [None]:
df.head() #Verification of applied changes

# 3 - Machine Learning

Importint the important libraries

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE


**Separating our dataset into train set and test set**

In [None]:
from sklearn.model_selection import train_test_split

X=df.drop(['stroke'],axis=1)
y=df['stroke']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3, random_state=3)


Decision trees are usually good candidates for this type of classification problem.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

dt_clf=DecisionTreeClassifier(criterion='gini',random_state=3,max_depth=5)
dt_clf.fit(X_train,y_train)
y_pred=dt_clf.predict(X_test)

Let's take a look at the result :

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy : {accuracy:.2f}")

Seems pretty good ! Now the confusion matrix...

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# The matrix
confusion_mat = confusion_matrix(y_test, y_pred)

# Plotting
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat, annot=True, fmt='d', cmap='Blues')
plt.title('Matrice de Confusion')
plt.xlabel('Prédiction')
plt.ylabel('Vraie Valeur')
plt.show()


Ok... Is-it really good ?
Let's look at other metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy : {accuracy:.2f}")

precision = precision_score(y_test, y_pred)
print(f"Precision : {precision:.2f}")

recall = recall_score(y_test, y_pred)
print(f"Recall : {recall:.2f}")

f1 = f1_score(y_test, y_pred)
print(f"Score F1 : {f1:.2f}")


 We can see that the accuracy is very good, **which might lead us to believe that the model is performing well**, but if we look at the other metrics, we see that our model is in fact, not so good: it classifies almost all the observations in the 'Non-stroke' section.

Why is this? Probably because we haven't taken into account in our model the large imbalance between our classes: as mentioned before in the dataset, we have many more people labelled 'Non-stroke' than 'Stroke'.

In [None]:
stroke_counts = df['stroke'].value_counts()

# Create a bar chart to visualize the distribution
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='stroke')
plt.title('Stroke Distribution')
plt.xlabel('Stroke')
plt.ylabel('Count')
plt.show()

print(stroke_counts)

**Let's try using techniques to reduce the unbalancing of our data, in particular by giving different weights to our labels.**

**Data separation and stratification**

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3, stratify = y, random_state=3)

Just a little change : "stratify = y".
With this,we apply a "statification" in our data. Stratification involves dividing your data into a training set and a test set in such a way that **the distribution of classes is maintained in both sets**.

Let's see what it will change :

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

dt_clf=DecisionTreeClassifier(criterion='gini',random_state=3,max_depth=5)
dt_clf.fit(X_train,y_train)
y_pred=dt_clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Exactitude : {accuracy:.2f}")

# Calculer la précision
precision = precision_score(y_test, y_pred)
print(f"Précision : {precision:.2f}")

# Calculer le rappel
recall = recall_score(y_test, y_pred)
print(f"Rappel : {recall:.2f}")

# Calculer le score F1
f1 = f1_score(y_test, y_pred)
print(f"Score F1 : {f1:.2f}")


Humpf... That's disapointing.
We will have to try something else.

**Managing imbalance with class weights**

This method involves assigning different weights to classes according to their frequency.

In [None]:
# Class ponderation : 

class_weights = {0: 1, 1: 25}  # Adjust the weights according to the imbalance in our data
dt_clf_weighted = DecisionTreeClassifier(criterion='gini', random_state=3, max_depth=5, class_weight=class_weights)

dt_clf_weighted.fit(X_train, y_train)
y_pred_weighted = dt_clf_weighted.predict(X_test)

# Result
accuracy_weighted = accuracy_score(y_test, y_pred_weighted)
recall_weighted = recall_score(y_test, y_pred_weighted)
f1_weighted = f1_score(y_test, y_pred_weighted)
print(f"Accuracy with class ponderation : {accuracy_weighted:.2f}")

Our score is far less flattering, but let's see what we have for our other metrics :

In [None]:
print(f"Recall with class ponderation : {recall_weighted:.2f}")
print(f"F1 with class ponderation : {f1_weighted:.2f}")

A way more acceptable score for the recall !

As for the confusion matrix : 

In [None]:
confusion_mat2 = confusion_matrix(y_test, y_pred_weighted)

# Créer un heatmap de la matrice de confusion
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat2, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion matrix')
plt.xlabel('Prediction')
plt.ylabel('Real Value')
plt.show()

At least we now have a more interesting basis for improving our model.

**Work on hyperparameters**

In [None]:
class_weights = {0: 1, 1: 25}  
# Change on max_depth
dt_clf_weighted = DecisionTreeClassifier(criterion='gini', random_state=3, max_depth=20, class_weight=class_weights)

# Entraîner le modèle
dt_clf_weighted.fit(X_train, y_train)

# Faire des prédictions
y_pred_weighted = dt_clf_weighted.predict(X_test)

# Calculer l'exactitude
accuracy_weighted = accuracy_score(y_test, y_pred_weighted)
print(f"Exactitude avec pondération des classes : {accuracy_weighted:.2f}")
print(f"Recall with class ponderation : {recall_weighted:.2f}")

In [None]:
confusion_mat3 = confusion_matrix(y_test, y_pred_weighted)

# Créer un heatmap de la matrice de confusion
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat3, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion matrix with weight ponderation')
plt.xlabel('Prediction')
plt.ylabel('Real Value')
plt.show()

With just a change in the depth of our model, we were able to incerease significantly our accuracy while retaining our good recall score.


Now, working with GridSearch to determine the best parameters :

In [None]:
rf = RandomForestClassifier(class_weight=class_weights, random_state=3)
param_grid = {
    'n_estimators': [100,200],
    'max_depth': [None,5,10],
    'min_samples_split': [2,5],
    'min_samples_leaf': [1,2]
}

grid_rf = GridSearchCV(rf, param_grid, scoring='f1', cv=5, n_jobs=-1, verbose=1)
grid_rf.fit(X_train, y_train)

best_rf = grid_rf.best_estimator_
print("Best params:", grid_rf.best_params_)
print("Best F1 (train CV):", grid_rf.best_score_)

# Évaluation test set
y_pred_rf = best_rf.predict(X_test)
y_prob_rf = best_rf.predict_proba(X_test)[:,1]
print(classification_report(y_test, y_pred_rf))
print("ROC-AUC (Test set):", roc_auc_score(y_test, y_prob_rf))
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Oranges')
plt.title("Confusion Matrix - Best RF")
plt.show()


# 4 - Model comparison

In [None]:
# Rebalancement with SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
print("Counts after SMOTE:\n", pd.Series(y_train_res).value_counts())


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import f1_score, roc_auc_score, precision_recall_curve, auc
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, class_weight='balanced'),
    "Decision Tree": DecisionTreeClassifier(class_weight='balanced', random_state=3),
    "Random Forest": RandomForestClassifier(class_weight='balanced', random_state=3),
    "XGBoost": XGBClassifier(scale_pos_weight=(len(y_train_res)-sum(y_train_res))/sum(y_train_res), use_label_encoder=False, eval_metric='logloss')
}

results = {}
for name, model in models.items():
    model.fit(X_train_res, y_train_res)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:,1]
    
    results[name] = {
        "F1": f1_score(y_test, y_pred),
        "ROC-AUC": roc_auc_score(y_test, y_prob),
        "classification_report": classification_report(y_test, y_pred)
    }

# Tableau résumé
summary = pd.DataFrame(results).T
display(summary)


Use of visualization of th ROC curve

In [None]:
plt.figure(figsize=(8,6))
for name, model in models.items():
    y_prob = model.predict_proba(X_test)[:,1]
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    plt.plot(fpr, tpr, label=name)
plt.plot([0,1],[0,1],'k--')
plt.title("ROC Curves")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()

plt.figure(figsize=(8,6))
for name, model in models.items():
    y_prob = model.predict_proba(X_test)[:,1]
    prec, rec, _ = precision_recall_curve(y_test, y_prob)
    plt.plot(rec, prec, label=name)
plt.title("Precision-Recall Curves")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend()
plt.show()


The ROC curves and Precision-Recall curves allow us to visually compare the performance of different models in distinguishing between stroke and non-stroke cases.

ROC Curves: The ROC curve plots the True Positive Rate against the False Positive Rate at different thresholds. A model with a curve closer to the top-left corner demonstrates better discrimination. We can see that the Random Forest and XGBoost models outperform the baseline Decision Tree in terms of ROC-AUC.

Precision-Recall Curves: Given the class imbalance in the dataset, Precision-Recall curves are particularly informative. They show the trade-off between precision (positive predictive value) and recall (sensitivity) for each threshold. Models trained with class weighting and SMOTE demonstrate improved recall without sacrificing too much precision, indicating they are better at correctly identifying stroke cases.

Overall, these curves confirm that handling class imbalance and tuning hyperparameters significantly improves model performance, especially for the minority class.