### <center>**Classification Supervisée et Evaluation des Modèles**</center>

#### **1. Définition du Variable Cible et des Features :**

In [45]:
import pandas as pd

data = pd.read_csv("../data/cleaned/data.csv")

y = data['Cluster']
X = data.drop(columns=['Cluster', 'risk_category'])

X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.825781,0.863078,0.055277,0.711208,0.839238,0.266103,0.612059,1.437767
1,-0.802604,-1.206933,-0.439687,0.169719,-0.889534,-0.83982,-0.324994,-0.050575
2,1.152449,2.013084,-0.61458,0.209057,1.840419,-1.462846,0.749586,0.047687
3,-0.802604,-1.075504,-0.439687,-0.49301,-0.475488,-0.580889,-1.063014,-1.247065
4,-1.703581,0.501647,-3.273965,0.711208,0.479229,1.453088,4.158488,0.143015


#### **2. Division des Données :**

In [46]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

Le **``stratify``** garantit que la proportion des clusters (0/1) reste la même dans train et test. C’est essentiel quand on a déséquilibre potentiel.

<br>

#### **4. Gestion du Déséquilibre :**

##### **4.1. C’est quoi un déséquilibre des classes ?**

Quand on fait de la classification, on a souvent un jeu de données avec une variable cible (``y``) qui contient plusieurs classes.

Dans notre Dataset :

- ``0`` = diabète

- ``1`` = pas de diabète

Si dans notre dataset on a :

- ``90`` % de classe ``0``.

- ``10`` % de classe ``1``.

Alors notre Dataset est **_déséquilibré_**.

Le modèle risque d’apprendre à prédire presque toujours la classe majoritaire (0), car elle domine les données.

In [47]:
from collections import Counter

print("Avant sur-échantillonnage :", Counter(y_train))

Avant sur-échantillonnage : Counter({0: 320, 1: 294})


##### **4.2. RandomOverSampler :**

- ``RandomOverSampler`` est une technique de rééchantillonnage (resampling) fournie par la bibliothèque imblearn (``imbalanced-learn``).

- Elle sert à équilibrer les classes en dupliquant aléatoirement des exemples de la classe minoritaire.

- Il repère la classe minoritaire (celle qui a le moins d’exemples).

- Il duplique aléatoirement certains échantillons de cette classe jusqu’à ce que les deux classes aient le même nombre d’exemples.

C’est une méthode simple mais efficace pour donner au modèle plus de données de la classe rare.

In [48]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)

X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

print("Après sur-échantillonnage :", Counter(y_resampled))

Après sur-échantillonnage : Counter({0: 320, 1: 320})


<br>

#### **5. Entraînement des Modèles :**

##### **5.1. RandomForest :**



In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score

rf = RandomForestClassifier(random_state=42)

rf.fit(X_resampled, y_resampled)

y_pred_rf = rf.predict(X_test)

acc_rf = accuracy_score(y_test, y_pred_rf)
prec_rf = precision_score(y_test, y_pred_rf, average='weighted', zero_division=0)
rec_rf = recall_score(y_test, y_pred_rf, average='weighted', zero_division=0)
f1_rf = f1_score(y_test, y_pred_rf, average='weighted', zero_division=0)

print("RandomForestRegressor : ")

print(f"- Accuracy : {acc_rf}")
print(f"- Precision : {prec_rf}")
print(f"- ReCall : {rec_rf}")
print(f"- F1 Score : {f1_rf}")

RandomForestRegressor : 
- Accuracy : 0.9415584415584416
- Precision : 0.941603466717622
- ReCall : 0.9415584415584416
- F1 Score : 0.9415411562705263


##### **5.2. SVM :**

In [None]:
from sklearn.svm import SVC

svm = SVC(probability=True, random_state=42)

svm.fit(X_resampled, y_resampled)

y_pred_svm = svm.predict(X_test)

acc_svm = accuracy_score(y_test, y_pred_svm)
prec_svm = precision_score(y_test, y_pred_svm, average='weighted', zero_division=0)
rec_svm = recall_score(y_test, y_pred_svm, average='weighted', zero_division=0)
f1_svm = f1_score(y_test, y_pred_svm, average='weighted', zero_division=0)

print("SVC : ")

print(f"- Accuracy : {acc_svm}")
print(f"- Precision : {prec_svm}")
print(f"- ReCall : {rec_svm}")
print(f"- F1 Score : {f1_svm}")

SVC : 
- Accuracy : 0.974025974025974
- Precision : 0.9743207334670749
- ReCall : 0.974025974025974
- F1 Score : 0.9740084032321477


##### **5.3. Gradient Boosting :**

In [None]:
from sklearn.ensemble import  GradientBoostingClassifier

gbc = GradientBoostingClassifier(random_state=42)

gbc.fit(X_resampled, y_resampled)

y_pred_gbc = gbc.predict(X_test)

acc_gbc = accuracy_score(y_test, y_pred_gbc)
prec_gbc = precision_score(y_test, y_pred_gbc, average='weighted', zero_division=0)
rec_gbc = recall_score(y_test, y_pred_gbc, average='weighted', zero_division=0)
f1_gbc = f1_score(y_test, y_pred_gbc, average='weighted', zero_division=0)

print("GradientBoostingClassifier : ")

print(f"- Accuracy : {acc_gbc}")
print(f"- Precision : {prec_gbc}")
print(f"- ReCall : {rec_gbc}")
print(f"- F1 Score : {f1_gbc}")

GradientBoostingClassifier : 
- Accuracy : 0.948051948051948
- Precision : 0.9483027135466161
- ReCall : 0.948051948051948
- F1 Score : 0.948016806464295


##### **5.4. Decision Tree :**

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)

dt.fit(X_resampled, y_resampled)

y_pred_dt = dt.predict(X_test)

acc_dt = accuracy_score(y_test, y_pred_dt)
prec_dt = precision_score(y_test, y_pred_dt, average='weighted', zero_division=0)
rec_dt = recall_score(y_test, y_pred_dt, average='weighted', zero_division=0)
f1_dt = f1_score(y_test, y_pred_dt, average='weighted', zero_division=0)

print("DecisionTreeClassifier : ")

print(f"- Accuracy : {acc_dt}")
print(f"- Precision : {prec_dt}")
print(f"- ReCall : {rec_dt}")
print(f"- F1 Score : {f1_dt}")

DecisionTreeClassifier : 
- Accuracy : 0.8376623376623377
- Precision : 0.8430473310839848
- ReCall : 0.8376623376623377
- F1 Score : 0.8374774506296843


##### **5.5. Régression Logistique :**

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000, random_state=42)

lr.fit(X_resampled, y_resampled)

y_pred_lr = lr.predict(X_test)

acc_lr = accuracy_score(y_test, y_pred_lr)
prec_lr = precision_score(y_test, y_pred_lr, average='weighted', zero_division=0)
rec_lr = recall_score(y_test, y_pred_lr, average='weighted', zero_division=0)
f1_lr = f1_score(y_test, y_pred_lr, average='weighted', zero_division=0)

print("LogisticRegression : ")

print(f"- Accuracy : {acc_lr}")
print(f"- Precision : {prec_lr}")
print(f"- ReCall : {rec_lr}")
print(f"- F1 Score : {f1_lr}")

LogisticRegression : 
- Accuracy : 0.987012987012987
- Precision : 0.987012987012987
- ReCall : 0.987012987012987
- F1 Score : 0.987012987012987


##### **5.6. XGB :**

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

xgb.fit(X_resampled, y_resampled)

y_pred_xgb = xgb.predict(X_test)

acc_xgb = accuracy_score(y_test, y_pred_xgb)
prec_xgb = precision_score(y_test, y_pred_xgb, average='weighted', zero_division=0)
rec_xgb = recall_score(y_test, y_pred_xgb, average='weighted', zero_division=0)
f1_xgb = f1_score(y_test, y_pred_xgb, average='weighted', zero_division=0)

print("XGBClassifier : ")

print(f"- Accuracy : {acc_xgb}")
print(f"- Precision : {prec_xgb}")
print(f"- ReCall : {rec_xgb}")
print(f"- F1 Score : {f1_xgb}")

XGBClassifier : 
- Accuracy : 0.948051948051948
- Precision : 0.9483027135466161
- ReCall : 0.948051948051948
- F1 Score : 0.948016806464295


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


##### **5.7. Enregistrement des Performances :**

In [60]:
models = ['RandomForestClassifier', 'SVM', 'GradientBoostingClassifier', 'DecisionTreeClassifier', 'LogisticRegression', 'XGBClassifier']

accuracies = [acc_rf, acc_svm, acc_gbc, acc_dt, acc_lr, acc_xgb]

precisions = [prec_rf, prec_svm, prec_gbc, prec_dt, prec_lr, prec_xgb]

recalls = [rec_rf, rec_svm, rec_gbc, rec_dt, rec_lr, rec_xgb]

f1_s = [f1_rf, f1_svm, f1_gbc, f1_dt, f1_lr, f1_xgb]

performances = []

for i, model in enumerate(models) :
    performances.append({model : [accuracies[i], precisions[i], recalls[i], f1_s[i]]})

print(performances)

[{'RandomForestClassifier': [0.9415584415584416, 0.941603466717622, 0.9415584415584416, 0.9415411562705263]}, {'SVM': [0.974025974025974, 0.9743207334670749, 0.974025974025974, 0.9740084032321477]}, {'GradientBoostingClassifier': [0.948051948051948, 0.9483027135466161, 0.948051948051948, 0.948016806464295]}, {'DecisionTreeClassifier': [0.8376623376623377, 0.8430473310839848, 0.8376623376623377, 0.8374774506296843]}, {'LogisticRegression': [0.987012987012987, 0.987012987012987, 0.987012987012987, 0.987012987012987]}, {'XGBClassifier': [0.948051948051948, 0.9483027135466161, 0.948051948051948, 0.948016806464295]}]


#### **6. GridSearchCV :**

In [63]:
from sklearn.model_selection import GridSearchCV

models = {
    'RandomForestClassifier': RandomForestClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'SVC': SVC(),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'LogisticRegression': LogisticRegression()
}

param_grids = {
    'RandomForestClassifier': {
        'n_estimators': [100, 200, 300],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    },
    'GradientBoostingClassifier': {
        'n_estimators': [100, 200, 300],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 7]
    },
    'SVC': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf', 'poly'],
        'gamma': ['scale', 'auto', 0.1]
    },
    'DecisionTreeClassifier': {
        'max_depth': [None, 5, 10],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    },
    'LogisticRegression': {
        'penalty': ['l1', 'l2', 'none'],
        'C': [0.1, 1, 10],
        'solver': ['liblinear', 'lbfgs', 'saga']
    }
}

for name, model in models.items():
    grid_search = GridSearchCV(estimator=model, param_grid=param_grids[name], cv=5, scoring='accuracy')
    grid_search.fit(X_resampled, y_resampled)

    print(f"Model : {name}")
    print(f"Best parameters : {grid_search.best_params_}")
    print(f"Best cross-validation score : {grid_search.best_score_:.4f}\n")

Model : RandomForestClassifier
Best parameters : {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 100}
Best cross-validation score : 0.9469

Model : GradientBoostingClassifier
Best parameters : {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 100}
Best cross-validation score : 0.9437

Model : SVC
Best parameters : {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}
Best cross-validation score : 0.9922

Model : DecisionTreeClassifier
Best parameters : {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best cross-validation score : 0.8766





Model : LogisticRegression
Best parameters : {'C': 10, 'penalty': 'l1', 'solver': 'saga'}
Best cross-validation score : 0.9906



60 fits failed out of a total of 135.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\abdel\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\abdel\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\abdel\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1218, in fit
    solver = _check_solver(se