## K-Nearest Neighbors
Le modèle K-Nearest Neighbors (K-NN) est un algorithme simple et intuitif utilisé en apprentissage automatique pour la classification et la régression. C'est un modèle qui prend des décisions en se basant sur les voisins les plus proches dans l'espace des données. Il est simple mais souvent efficace, particulièrement utile dans les cas où la structure des données est complexe et difficile à modéliser de manière analytique.

In [1]:
import pandas as pd
import numpy as np
import time
import shap
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import seaborn as sns

Pour analyser la pertinence du modèle que nous sélectionnerons, nous devons créer une fonction qui pénalise les mauvaises prédiction et plus particulièrement les False Negative. Ce sont des prédictions du modèle qui prédise un 0 alors que nous attendons un 1. Cette prédiction est plus dangereuse pour la pertinence du modèle car si nous prédisons un remboursement probable d'un crédit et que cela n'est pas le cas, cela impacte directement la rentabilité de la banque, ce qui n'est pas souhaitable. Nous créons donc une fonction de scoring qui pénalise de 10 les False Negative et de 1 les False Positive.

In [2]:
def custom_penalty(y_test, y_pred):
    y_test = y_test.to_numpy().flatten()
    penalty = 0.0
    sum_penalty = 0
    
    for yt, yp in zip(y_test, y_pred):
        if yt == 1 and yp == 0:
            penalty = penalty + 10
            sum_penalty = sum_penalty +1
        elif yt == 0 and yp == 1:
            penalty = penalty + 1
            sum_penalty = sum_penalty +1
            
    score = (penalty/len(y_test))*0.1#/ (sum_penalty + 1e-7) # Prevents zero division
    return 1 - score

In [15]:
import pandas as pd
import numpy as np
from numpy import nan
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

Nous récupérons les données d'entrainement et de test qui sont divisées dans différents fichier. Cette partie concerne le modèle de Machine Learning sur les données déséquilibrées et traitées avec une méthode d'Hybrid-resampling.

In [16]:
X_train_hybrid = pd.read_csv('X_train_CSV_hybrid.csv')
X_test_hybrid = pd.read_csv('X_test_CSV_hybrid.csv')
y_train_hybrid = pd.read_csv('y_train_CSV_hybrid.csv')
y_test_hybrid = pd.read_csv('y_test_CSV_hybrid.csv')

Nous effectuons une transformations de mise à l'échelle des caractéristiques des individus car certains modèles de Machine Learning sont sensibles à l'échelle des variables. Cette étape est également utile optimiser la performance de nos modèles.

In [17]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_hybrid)
X_test_scaled = scaler.transform(X_test_hybrid)

GridSearchCV permet de tester différents paramètres dans notre modèle de Machine Learning afin de trouver ceux qui maximisent le score du modèle. Ici GridSearchCV doit maximiser le score de notre fonction de pénalisation des résultats FALSE NEGATIVE.

In [18]:
param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'p': [1, 2]}

knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, scoring=custom_penalty, cv=5)
grid_search.fit(X_train_scaled, y_train_hybrid)
grid_search

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() takes 2 positional arguments but 3 were given

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() tak

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() takes 2 positional arguments but 3 were given

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() tak

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() takes 2 positional arguments but 3 were given

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() tak

In [19]:
best_params = grid_search.best_params_
best_score = grid_search.best_score_
best_model = grid_search.best_estimator_
best_model_score = best_model.score(X_test_scaled, y_test_hybrid)
print(best_params)
print(best_score)
print(best_model)
print(best_model_score)

{'n_neighbors': 3, 'p': 1}
nan
KNeighborsClassifier(n_neighbors=3, p=1)
0.672


Nous utilisons les meilleurs paramètres déterminés par GridSearchCV dans notre modèle de Machine Learning.

In [20]:
classifier = KNeighborsClassifier(n_neighbors=3, p=1, metric='euclidean')

La validation croisé est utilisé afin de déterminer la régularité de performance sur différents échantillon d'entrainement de nos données.

In [21]:
scores = cross_val_score(classifier, X_train_scaled, y_train_hybrid, cv=5,\
                         scoring='neg_mean_squared_error')
print("Scores de validation croisée:", scores)

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


Scores de validation croisée: [-0.42912946 -0.41964286 -0.39676339 -0.375      -0.40424344]


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


Cette étape est l'entrainement du modèle sur les données d'entrainement, nous utilisons une méthode de calcul du temps de traitement de cette étape qui nous servira de métrics pour nos modèles de Machine Learning.

In [22]:
start_time = time.time()
classifier.fit(X_train_scaled, y_train_hybrid)
end_time = time.time()
execution_time = end_time - start_time
print(f"Le temps de traitement est de : {execution_time} secondes")

Le temps de traitement est de : 0.006713151931762695 secondes


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


Cette étape est la prédiction de la `TARGET`.

In [23]:
y_pred_hybrid = classifier.predict(X_test_scaled)

Nous étudions le score de notre fonction de scoring permettant d'évaluer la pertinence du modèle de Machine Learning en fonction des FALSE NEGATIVE et des FALSE POSITIVE.

In [24]:
custom_penalty(y_test_hybrid, y_pred_hybrid)

0.92659

Cette métric nous permet d'évaluer différement la pertinence de notre modèle en évaluant le nombre de prédictions correctes en prenant en compte la Précision et le Rappel.

In [25]:
f1 = f1_score(y_test_hybrid, y_pred_hybrid)
print(f"F1-score : {f1}")

F1-score : 0.15273103805332644


Cette métric donne le score de pertinence du modèle en prenant en compte les FALSE NEGATIVE et les FALSE POSITIVE mais en appliquant aucune pénalité.

In [26]:
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_test_hybrid, y_pred_hybrid)
auc_score_hybrid = roc_auc_score(y_test_hybrid, y_pred_hybrid)
auc_score_hybrid

0.5450194451931434

Cette partie concerne le modèle de Machine Learning sur les données déséquilibrées et traitées avec une méthode d'Undersampling.

In [27]:
X_train_under = pd.read_csv('X_train_CSV_under.csv')
X_test_under = pd.read_csv('X_test_CSV_under.csv')
y_train_under = pd.read_csv('y_train_CSV_under.csv')
y_test_under = pd.read_csv('y_test_CSV_under.csv')

In [28]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_under)
X_test_scaled = scaler.transform(X_test_under)

In [29]:
param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'p': [1, 2]}

knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, scoring=custom_penalty, cv=5)
grid_search.fit(X_train_scaled, y_train_under)
grid_search

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() takes 2 positional arguments but 3 were given

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() tak

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() takes 2 positional arguments but 3 were given

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/gaeldelescluse/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
TypeError: custom_penalty() tak

In [30]:
best_params = grid_search.best_params_
best_score = grid_search.best_score_
best_model = grid_search.best_estimator_
best_model_score = best_model.score(X_test_scaled, y_test_under)
print(best_params)
print(best_score)
print(best_model)
print(best_model_score)

{'n_neighbors': 3, 'p': 1}
nan
KNeighborsClassifier(n_neighbors=3, p=1)
0.5623


In [31]:
classifier = KNeighborsClassifier(n_neighbors=3, p=1, metric='euclidean')

In [32]:
scores = cross_val_score(classifier, X_train_scaled, y_train_under, cv=5,\
                         scoring='neg_mean_squared_error')
print("Scores de validation croisée:", scores)

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


Scores de validation croisée: [-0.42512077 -0.44766506 -0.44041868 -0.41982272 -0.44238517]


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


In [33]:
start_time = time.time()
classifier.fit(X_train_scaled, y_train_under)
end_time = time.time()
execution_time = end_time - start_time
print(f"Le temps de traitement est de : {execution_time} secondes")

Le temps de traitement est de : 0.009671926498413086 secondes


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


In [34]:
y_pred_under = classifier.predict(X_test_scaled)

Nous affichons la précisions du modèle avec une pénalité si notre modèle crée des False Negative

In [35]:
custom_penalty(y_test_under, y_pred_under)

0.92428

In [36]:
f1 = f1_score(y_test_under, y_pred_under)
print(f"F1-score : {f1}")

F1-score : 0.15247148288973383


In [37]:
fpr, tpr, thresholds = roc_curve(y_test_under, y_pred_under)
auc_score_under = roc_auc_score(y_test_under, y_pred_under)
auc_score_under

0.5462087551907981