# Étude de la classification sur le dataset du diabète
## Objectif : 
Comparer plusieurs modèles de classification et tester différentes méthodes d'imputation des données manquantes

## 1. Importation des bibliothèques et chargement des données

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from imblearn.over_sampling import SMOTE

In [24]:
data = pd.read_csv("diabetes.csv")

## 2. Exploration et nettoyage des données

In [25]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [26]:
# Remplacer les 0 dans certaines colonnes
cols_to_replace = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for col in cols_to_replace:
    data[col] = data[col].replace(0, np.nan)

In [28]:
features_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
cible_name = "Outcome"

In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   763 non-null    float64
 2   BloodPressure             733 non-null    float64
 3   SkinThickness             541 non-null    float64
 4   Insulin                   394 non-null    float64
 5   BMI                       757 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(6), int64(3)
memory usage: 54.1 KB


Nous observons des valeurs manquantes pour les variables : Glucose, BloodPressure, SkinThickness, Insulin et BMI

In [30]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0,768.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885,0.348958
std,3.369578,30.535641,12.382158,10.476982,118.775855,6.924988,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.0,64.0,22.0,76.25,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0,0.0
75%,6.0,141.0,80.0,36.0,190.0,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Remplacement des valeurs nulles de glucose, bloodpressure et BMI par leur médiane

In [7]:
data['Glucose'] = data['Glucose'].replace(np.nan, data['Glucose'].median())
data['BloodPressure'] = data['BloodPressure'].replace(0, data['BloodPressure'].median())
data['BMI'] = data['BMI'].replace(0, data['BMI'].median())

Remplacement des valeurs nulles de SkinThickness avec une régression linéaire avec BMI et Insulin en features.

In [9]:
# On isole les lignes où SkinThickness est différent de 0 pour entraîner le modèle
df_train = data[data['SkinThickness'] != 0]
df_test = data[data['SkinThickness'] == 0]

X_train = df_train[['BMI', 'Insulin']]
y_train = df_train['SkinThickness']

model = LinearRegression()
model.fit(X_train, y_train)

# Prédire et remplacer les valeurs manquantes
data.loc[data['SkinThickness'] == 0, 'SkinThickness'] = model.predict(df_test[['BMI', 'Insulin']]).astype(int)


Remplacement des valeurs nulles d'Insulin avec une régression linéaire avec BMI et Glucose en features.

In [10]:
df_train = data[data['Insulin'] != 0]
df_test = data[data['Insulin'] == 0]

X_train = df_train[['BMI', 'Glucose']]
y_train = df_train['Insulin']

model = LinearRegression()
model.fit(X_train, y_train)

data.loc[data['Insulin'] == 0, 'Insulin' ] = model.predict(df_test[['BMI', 'Glucose']]).astype(int)

In [None]:
plt.figure(figsize=(15,30))
for i, name in enumerate(features_names):
    plt.subplot(len(features_names), 1, i+1)
    plt.title(name)
    plt.boxplot(data[name], vert=False)

Vérification de l'équilibre des données

In [11]:
data.loc[data["Outcome"] == 1, "Outcome"].count()

268

In [11]:
data.loc[data["Outcome"] == 0, "Outcome"].count()

500

Il y a deux fois moins de données pour les cas de diabète par rapport aux cas négatifs ce qui peux biaser le résultat. La solution est d'enrichir la classe minoritaire ou d'appauvrir la majoritaire. Ici au choisi le 1er cas pour ne pas perdre de données.

### Séparation en jeu d'entrainement et de test et création de nouveaux exemples synthétiques pour la classe minoritaire (ici outcome=1 donc cas diabétiques)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(data[features_names], data["Outcome"], test_size=0.2, random_state=24 )
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

### Pour que la classification se passe bien nous allons normaliser et standardiser les données.

In [13]:
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)
X_test_scaled = scaler.transform(X_test)


# Utilisant la validation croisée puis entrainement du modèle
## 1 Modèle de régression logistique

In [14]:
model = LogisticRegression()
scores = cross_val_score(model, X_resampled_scaled, y_resampled, cv=5)
print(f'Mean accurancy: {scores.mean()}')
print(f'Accurancy: {scores}')

model.fit(X_resampled_scaled, y_resampled)

Mean accurancy: 0.7500155279503107
Accurancy: [0.7515528  0.82608696 0.67701863 0.73291925 0.7625    ]


In [15]:
y_pred = model.predict(X_test_scaled)

### Affichage des différents scores

In [16]:
def print_metric(y_test, y_pred):
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    print(f"Precision score: {precision}")
    print(f"Recall score: {recall}")
    print(f"f1 score: {f1}")
    print(f"Confusion matrix: \n{cm}")

    print("Rapport de classification :\n", classification_report(y_test, y_pred))

print_metric(y_test, y_pred)

Precision score: 0.6271186440677966
Recall score: 0.6607142857142857
f1 score: 0.6434782608695652
Confusion matrix: 
[[76 22]
 [19 37]]
Rapport de classification :
               precision    recall  f1-score   support

           0       0.80      0.78      0.79        98
           1       0.63      0.66      0.64        56

    accuracy                           0.73       154
   macro avg       0.71      0.72      0.72       154
weighted avg       0.74      0.73      0.74       154



## 2 Algorithme des K-Nearest Neighbors (KNN)

In [17]:
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': [2, 3, 4, 5, 7, 9]}
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_resampled_scaled, y_resampled)
print(f"Best k: {grid_search.best_params_}")


Best k: {'n_neighbors': 7}


In [18]:
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_resampled_scaled, y_resampled)
y_knn_pred = knn.predict(X_test_scaled)

### Affichage des différents scores

In [19]:
print_metric(y_test, y_knn_pred)

Precision score: 0.5846153846153846
Recall score: 0.6785714285714286
f1 score: 0.628099173553719
Confusion matrix: 
[[71 27]
 [18 38]]
Rapport de classification :
               precision    recall  f1-score   support

           0       0.80      0.72      0.76        98
           1       0.58      0.68      0.63        56

    accuracy                           0.71       154
   macro avg       0.69      0.70      0.69       154
weighted avg       0.72      0.71      0.71       154



L'algorithme des k plus proches voisins a plus de précision mais un moins bon rappel, il y a moins de faux positifs mais plus de faux négatifs. 

In [20]:
param_grid = {
    'n_estimators': [50, 100, 150, 200, 300],  
    'max_depth': [2, 40, 50, 70, 100, 150], 
    'max_features': [None, 'sqrt', 'log2']  
}

grid_search = GridSearchCV(RandomForestClassifier(random_state=24), param_grid, cv=5, scoring='recall')
grid_search.fit(X_resampled_scaled, y_resampled)
print(f'Best parameters: {grid_search.best_params_}')



Best parameters: {'max_depth': 40, 'max_features': 'sqrt', 'n_estimators': 200}


In [21]:
rf = RandomForestClassifier(n_estimators=200, random_state=24, max_features='sqrt', max_depth=40)
rf.fit(X_resampled_scaled, y_resampled)
y_rf_pred = rf.predict(X_test_scaled)

In [22]:
print_metric(y_test, y_rf_pred)

Precision score: 0.6
Recall score: 0.5892857142857143
f1 score: 0.5945945945945946
Confusion matrix: 
[[76 22]
 [23 33]]
Rapport de classification :
               precision    recall  f1-score   support

           0       0.77      0.78      0.77        98
           1       0.60      0.59      0.59        56

    accuracy                           0.71       154
   macro avg       0.68      0.68      0.68       154
weighted avg       0.71      0.71      0.71       154



Le modèle des arbres a aussi plus de précision mais un moins bon rappel, il y a moins de faux positifs mais plus de faux négatifs.