## Final Project - Road Accidents in France in 2019
## N°2 / Module 'Imbalanced-Learn'

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
import joblib

import sklearn
from sklearn import svm
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import balanced_accuracy_score

from imblearn.ensemble import BalancedRandomForestClassifier, BalancedBaggingClassifier, RUSBoostClassifier, EasyEnsembleClassifier


### Import des fichiers de données

In [2]:
acc = pd.read_csv('../Final-Project/data/victime_clean_dummies.csv')

In [3]:
acc.shape

(130901, 141)

## Machine Learning

In [4]:
from sklearn.model_selection import train_test_split
y = acc.pop('grav')
X = acc

#### Choix et Entraînement de divers modèles initialisés par défaut

Pour évaluer nos modèles, nous utiliserons le score de Balanced Accuracy ("exactitude pondérée") plutôt que le score d'Accuracy, car nos classes cibles sont de tailles respectives déséquilibrées. 

On essaie plusieurs modèles de classification, initialisés par défaut.
https://scikit-learn.org/stable/modules/multiclass.html

Dans un second temps, on utilise la solution proposée par le module imblearn.ensemble : des modèles d'ensembles entraînés à chaque étape sur un échantillon rééquilibré automatiquement entre les différentes classes. Ce qui permet de se passer de méthodes de rééchantillonnage avant l'entraînement.

In [5]:
# Modèle : BALANCED FOREST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
brf1 = BalancedRandomForestClassifier()
brf1.fit(X_train, y_train) 

BalancedRandomForestClassifier()

In [6]:
#brf1_scores = cross_val_score(brf1, X_train, y_train, scoring='balanced_accuracy', cv=10)
#display_scores(brf1_scores)

In [7]:
brf1_pred_train = brf1.predict(X_train)
brf1_train_score = balanced_accuracy_score(y_train, brf1_pred_train)
print("Final Balanced Accuracy Score on Train Set =", round(brf1_train_score,3))

Final Balanced Accuracy Score on Train Set = 0.736


In [8]:
brf1_pred_test = brf1.predict(X_test)
brf1_test_score = balanced_accuracy_score(y_test, brf1_pred_test)
print("Final Balanced Accuracy Score on Test Set =", round(brf1_test_score,3))

Final Balanced Accuracy Score on Test Set = 0.561


In [9]:
# Modèle : BALANCED BAGGING
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

bbc1 = BalancedBaggingClassifier()
bbc1.fit(X_train, y_train) 

BalancedBaggingClassifier()

In [10]:
bbc1_pred_train = bbc1.predict(X_train)
bbc1_train_score = balanced_accuracy_score(y_train, bbc1_pred_train)
print("Final Balanced Accuracy Score on Train Set =", round(bbc1_train_score,3))

Final Balanced Accuracy Score on Train Set = 0.722


In [11]:
bbc1_pred_test = bbc1.predict(X_test)
bbc1_test_score = balanced_accuracy_score(y_test, bbc1_pred_test)
print("Final Balanced Accuracy Score on Test Set =", round(bbc1_test_score,3))

Final Balanced Accuracy Score on Test Set = 0.539


In [12]:
# Modèle : RUS BOOST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rusboost1 = RUSBoostClassifier()
rusboost1.fit(X_train, y_train) 

RUSBoostClassifier()

In [13]:
rusboost1_pred_train = rusboost1.predict(X_train)
rusboost1_train_score = balanced_accuracy_score(y_train, rusboost1_pred_train)
print("Final Balanced Accuracy Score on Train Set =", round(rusboost1_train_score,3))

Final Balanced Accuracy Score on Train Set = 0.494


In [14]:
rusboost1_pred_test = rusboost1.predict(X_test)
rusboost1_test_score = balanced_accuracy_score(y_test, rusboost1_pred_test)
print("Final Balanced Accuracy Score on Test Set =", round(rusboost1_test_score,3))

Final Balanced Accuracy Score on Test Set = 0.496


In [15]:
# Modèle : EASY ENSEMBLE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

ee1 = EasyEnsembleClassifier()
ee1.fit(X_train, y_train) 

EasyEnsembleClassifier()

In [16]:
ee1_pred_train = ee1.predict(X_train)
ee1_train_score = balanced_accuracy_score(y_train, ee1_pred_train)
print("Final Balanced Accuracy Score on Train Set =", round(ee1_train_score,3))

Final Balanced Accuracy Score on Train Set = 0.532


In [17]:
ee1_pred_test = ee1.predict(X_test)
ee1_test_score = balanced_accuracy_score(y_test, ee1_pred_test)
print("Final Balanced Accuracy Score on Test Set =", round(ee1_test_score,3))

Final Balanced Accuracy Score on Test Set = 0.529


**Observations :**
Avec un score très de 56% sur le test, le modèle "Balanced Forest Classifier" est le plus prometteur.

In [18]:
print("finished!")

finished!


#### Cross-Validation des modèles

Le jeu d'apprentissage est scindé en 10 "sous-jeux", et l'apprentissage a lieu 10 fois d'affilée sur 9 sous-jeux différents avec une évaluation sur le 10ème sous-jeu ("pli de validation"). 
On obtient donc 10 scores distincts d'apprentissage, dont on calcule la moyenne et l'écart-type.

In [19]:
# Cross-validation using balanced accuracy score

from sklearn.model_selection import cross_val_score

def display_scores(scores):
    print("Balanced Accuracy Scores:", scores)
    print("Mean Balanced Accuracy Score:", round(scores.mean(),3))
    print("Standard deviation:", round(scores.std(),5))

#### Réglage des Hyperparamètres avec RandomSearch

Essayons d'améliorer le modèle Gradient Boosting en jouant sur ses hyper-paramètres.

In [None]:
gbc2 = GradientBoostingClassifier()

from sklearn.model_selection import RandomizedSearchCV

parameters = {"learning_rate": [0.001, 0.01, 0.1, 0.2],
              "n_estimators" : [100, 500, 1000, 1500],
              "subsample"    : [0.5, 0.7, 1.0, 1.5],
              "max_features" : ['sqrt','log2',2,50,140],
              #'min_samples_split':[2,4,6],
              #'min_samples_leaf':[3,5,7],
              "max_depth"    : [2, 3, 10, 15, 20]
              }

randm = RandomizedSearchCV(gbc2, parameters, n_jobs=-1, scoring = 'balanced_accuracy')
randm.fit(X_train, y_train)

In [None]:
randm.best_estimator_

In [None]:
print(f"The mean cross-validated score of the best estimator is: {randm.best_score_}")

#### Evaluation Finale sur le Jeu de Test

In [None]:
final_model = randm.best_estimator_

final_pred_test = final_model.predict(X_test)

final_score_test = balanced_accuracy_score(y_test, final_pred_test)

print("Final Balanced Accuracy Score on Test Set =", round(final_score_test,3))



**Conclusion :**
Notre modèle le plus performant est décevant avec un score de balanced accuracy inférieur à 50%.
Pistes d'amélioration : undersampling des 2 classes majoritaires, rajout de features pour complexifier le jeu d'apprentissage, sélection d'autres algorithmes plus complexes.

#### Variables les plus importantes
Etudions les variables qui sont plus déterminantes que les autres dans la classification par notre modèle.

In [None]:
#feature_importances = randm.best_estimator_.feature_importances_
#list_feat = list(feature_importances)
#list_col = list(X_train.columns)
#sorted([t for t in zip(list_feat, list_col)], key=lambda t: t[0], reverse=True)

In [None]:
# Pour sauvegarder le modèle
#joblib.dump(randm.best_estimator_, "my_model_2021-08-05.pkl")
# Pour le réutiliser
#my_model_loaded = joblib.load("my_model.pkl")

In [None]:
print("Done!!")