# CUNEF MUCD 2021/2022  
## Machine Learning
## Análisis de Siniestralidad de Automóviles

### Autores:
- Andrés Mahía Morado
- Antonio Tello Gómez


In [4]:
import pandas as pd
import numpy as np

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, \
                            silhouette_score, recall_score, precision_score, make_scorer, \
                            roc_auc_score, f1_score, precision_recall_curve

from sklearn.metrics import accuracy_score, roc_auc_score, \
                            classification_report, confusion_matrix


from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score, log_loss
from sklearn.metrics import ConfusionMatrixDisplay

import lightgbm as lgb
from sklearn.pipeline import Pipeline
import pickle
import warnings
warnings.filterwarnings('ignore')
%load_ext autotime

from aux_func import evaluate_model, cargar_modelo

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 0 ns (started: 2021-12-18 19:14:34 +01:00)


In [2]:
xtrain = pd.read_parquet("../data/xtrain.parquet")
ytrain = pd.read_parquet("../data/ytrain.parquet")['fatality']
xtest = pd.read_parquet("../data/xtest.parquet")
ytest = pd.read_parquet("../data/ytest.parquet")['fatality']

time: 1.72 s (started: 2021-12-18 19:14:20 +01:00)


In [5]:
#Cargamos pipeline preprocesado
preprocessor = cargar_modelo('../models/preprocessor.pickle')

time: 187 ms (started: 2021-12-18 19:14:35 +01:00)


# LightGBM

LightGBM es un clasificador que utiliza técnicas homólogas a las que utiliza XGBoost, pero se encuentra más optimizado que XGBoost permitiendo una mayor velocidad de entrenamiento y mayor eficiencia.
A diferencia de XGBoost, utiliza aprendizaje en paralelo.

![Highway](https://programmerclick.com/images/609/b84dc6b1590fb03af6971b4761aecc19.png)

In [7]:
clf = Pipeline(steps=[
    ('preprocesador', preprocessor),
    ('clasificador', lgb.LGBMClassifier(n_jobs=-1, random_state=0))])

time: 15 ms (started: 2021-12-18 19:15:38 +01:00)


In [8]:
clf.fit(xtrain, ytrain)

Pipeline(steps=[('preprocesador',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['vehicle_age',
                                                   'passenger_age',
                                                   'vehicles_involved',
                                                   'year']),
                                                 ('fcat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=nan,
                                                                

time: 37.5 s (started: 2021-12-18 19:15:40 +01:00)


In [9]:
with open('../models/LightGBM.pickle', 'wb') as f:
    pickle.dump(clf, f)

time: 0 ns (started: 2021-12-18 19:16:17 +01:00)


In [10]:
# Para no tener que ejecutar, saltarse el fit y ejecutar a partir de aquí
with open('../models/LightGBM.pickle', 'rb') as f:
    clf = pickle.load(f)

time: 16 ms (started: 2021-12-18 19:16:17 +01:00)


Generamos las predicciones sobre los datos de validación y evaluamos el modelo.

In [11]:
ypred = clf.predict(xtest)
ypred_proba = clf.predict_proba(xtest)
evaluate_model(ytest,ypred,ypred_proba)

ROC-AUC score of the model: 0.8473477473418336
Accuracy of the model: 0.9848270886453092

Classification report: 
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    799946
           1       0.46      0.02      0.03     12291

    accuracy                           0.98    812237
   macro avg       0.72      0.51      0.51    812237
weighted avg       0.98      0.98      0.98    812237


Confusion matrix: 
[[799710    236]
 [ 12088    203]]

time: 9.22 s (started: 2021-12-18 19:16:17 +01:00)


## Ajuste del umbral de predicción

Procedemos a ajustar el umbral de la predicción para obtener un mayor recall en la variable minoritaria.

In [12]:
# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytest, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytest,ypred_new_threshold,ypred_proba)

Best Threshold=0.014130, G-Mean=0.763
ROC-AUC score of the model: 0.8473477473418336
Accuracy of the model: 0.7555910405460475

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.76      0.86    799946
           1       0.05      0.77      0.09     12291

    accuracy                           0.76    812237
   macro avg       0.52      0.76      0.47    812237
weighted avg       0.98      0.76      0.85    812237


Confusion matrix: 
[[604238 195708]
 [  2810   9481]]

time: 1.58 s (started: 2021-12-18 19:16:26 +01:00)


Podemos observar como el ajuste del threshold dota al modelo de un mayor recall para los casos de la clase minoritaria, lo cual nos interesa desde un punto de vista práctico a pesar de reducir la precisión y accuracy del modelo. 

El modelo LightGBM ha devuelto muy buenos resultados. A su vez, el tiempo de ejecución necesario para entrenar el modelo ha sido ínfimo en comparación con los anteriores.

## Comprobación de overfitting

Comprobamos si el modelo sufre de overfitting, realizando una predicción sobre la serie de entrenamiento.

In [13]:
ypred = clf.predict(xtrain)
ypred_proba = clf.predict_proba(xtrain)

# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytrain, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytrain,ypred_new_threshold,ypred_proba)

Best Threshold=0.014689, G-Mean=0.768
ROC-AUC score of the model: 0.8534891520550415
Accuracy of the model: 0.7663592335358094

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.77      0.87   3200049
           1       0.05      0.77      0.09     48896

    accuracy                           0.77   3248945
   macro avg       0.52      0.77      0.48   3248945
weighted avg       0.98      0.77      0.85   3248945


Confusion matrix: 
[[2452206  747843]
 [  11243   37653]]

time: 37.8 s (started: 2021-12-18 19:16:28 +01:00)


El modelo LightGBM ha sido sorprendentemente bueno, teniendo en cuenta que cumple tres condiciones indispensables a la hora de generar un modelo:

- Buenos resultados
- Bajo tiempo de ejecución
- Ausencia de overfitting