# CUNEF MUCD 2021/2022  
## Machine Learning
## Análisis de Siniestralidad de Automóviles

### Autores:
- Andrés Mahía Morado
- Antonio Tello Gómez


In [3]:
import pandas as pd
import numpy as np

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, \
                            silhouette_score, recall_score, precision_score, make_scorer, \
                            roc_auc_score, f1_score, precision_recall_curve

from sklearn.metrics import accuracy_score, roc_auc_score, \
                            classification_report, confusion_matrix


from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score, log_loss
from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.ensemble import AdaBoostClassifier

import pickle
import warnings
warnings.filterwarnings('ignore')
%load_ext autotime

from aux_func import evaluate_model

In [4]:
xtrain = pd.read_parquet("../data/xtrain.parquet")
ytrain = pd.read_parquet("../data/ytrain.parquet")['fatality']
xtest = pd.read_parquet("../data/xtest.parquet")
ytest = pd.read_parquet("../data/ytest.parquet")['fatality']

time: 375 ms (started: 2021-12-12 22:51:34 +01:00)


# ADA Boost

ADA Boost es un clasificador, cuyo algoritmo se basa en la predicción iterativa de "bloques" de datos que va ajustando. Tras la primera iteración en la que analiza los datos completos, repite el proceso con aquellas secciones de los datos en los que no ha obtenido un buen resultado. A cada uno de estos bloques les asigna un peso o weight y la combinación de todos estos componen el modelo.

![Highway](https://programmerclick.com/images/649/93a1dcc89731b8e5fc4dd19b7967f169.png)

![Highway](https://editor.analyticsvidhya.com/uploads/626591024px-Ensemble_Boosting.svg.png)

In [5]:
clf = AdaBoostClassifier(n_estimators=100, random_state=0)
clf.fit(xtrain, ytrain)

AdaBoostClassifier(n_estimators=100, random_state=0)

time: 9min 22s (started: 2021-12-12 22:51:35 +01:00)


In [6]:
with open('../models/AdaBoost.pickle', 'wb') as f:
    pickle.dump(clf, f)

time: 16 ms (started: 2021-12-12 23:00:58 +01:00)


In [7]:
# Para no tener que ejecutar, saltarse el fit y ejecutar a partir de aquí
with open('../models/AdaBoost.pickle', 'rb') as f:
    clf = pickle.load(f)

time: 15 ms (started: 2021-12-12 23:00:58 +01:00)


Generamos las predicciones sobre los datos de validación y evaluamos el modelo.

In [8]:
ypred = clf.predict(xtest)
ypred_proba = clf.predict_proba(xtest)
evaluate_model(ytest,ypred,ypred_proba)

ROC-AUC score of the model: 0.8292701921721066
Accuracy of the model: 0.9845652876974086

Classification report: 
              precision    recall  f1-score   support

           0       0.98      1.00      0.99    797650
           1       0.44      0.01      0.02     12472

    accuracy                           0.98    810122
   macro avg       0.71      0.50      0.51    810122
weighted avg       0.98      0.98      0.98    810122


time: 29.4 s (started: 2021-12-12 23:00:58 +01:00)


## Ajuste del umbral de predicción

Procedemos a ajustar el umbral de la predicción para obtener un mayor recall en la variable minoritaria.

In [9]:
# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytest, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytest,ypred_new_threshold,ypred_proba)

Best Threshold=0.489303, G-Mean=0.746
ROC-AUC score of the model: 0.8292701921721066
Accuracy of the model: 0.7428572486613128

Classification report: 
              precision    recall  f1-score   support

           0       0.99      0.74      0.85    797650
           1       0.04      0.75      0.08     12472

    accuracy                           0.74    810122
   macro avg       0.52      0.75      0.47    810122
weighted avg       0.98      0.74      0.84    810122


time: 1.56 s (started: 2021-12-12 23:01:27 +01:00)


El efecto del ajuste del threshold sobre el modelo ha sido parecido al observado previamente en el resto de modelos. Los resultados obtenidos son ligeramente inferiores.

## Comprobación de overfitting

Comprobamos si el modelo sufre de overfitting, realizando una predicción sobre la serie de entrenamiento.

In [10]:
ypred = clf.predict(xtrain)
ypred_proba = clf.predict_proba(xtrain)

# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytrain, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytrain,ypred_new_threshold,ypred_proba)

Best Threshold=0.489399, G-Mean=0.746
ROC-AUC score of the model: 0.8279357331624063
Accuracy of the model: 0.7516188919693577

Classification report: 
              precision    recall  f1-score   support

           0       0.99      0.75      0.86   3192113
           1       0.04      0.74      0.08     48375

    accuracy                           0.75   3240488
   macro avg       0.52      0.75      0.47   3240488
weighted avg       0.98      0.75      0.84   3240488


time: 2min 10s (started: 2021-12-12 23:01:29 +01:00)


El modelo ADA Boost ha obtenido un recall del 75% y 74% para las clases negativa y positiva respectivamente.