# CUNEF MUCD 2021/2022  
## Machine Learning
## Análisis de Siniestralidad de Automóviles

### Autores:
- Andrés Mahía Morado
- Antonio Tello Gómez


In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, \
                            silhouette_score, recall_score, precision_score, make_scorer, \
                            roc_auc_score, f1_score, precision_recall_curve

from sklearn.metrics import accuracy_score, roc_auc_score, \
                            classification_report, confusion_matrix


from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score, log_loss
from sklearn.metrics import ConfusionMatrixDisplay

from catboost import CatBoostClassifier 


import pickle
import warnings
warnings.filterwarnings('ignore')

from aux_func import evaluate_model

In [3]:
xtrain = pd.read_parquet("../data/xtrain.parquet")
ytrain = pd.read_parquet("../data/ytrain.parquet")['fatality']
xtest = pd.read_parquet("../data/xtest.parquet")
ytest = pd.read_parquet("../data/ytest.parquet")['fatality']

# CatBoost

Catboost es un clasificador que genera modelos basados en árboles de decisión de Gradient Boosting, de manera similar a LightGBM, XGBoost y muchos otros.
Por "dentro" de Catboost, se generan árboles de decisión que van reduciendo su error a medida que se repite el proceso. Catboost es una opción a tener en cuenta ya que es capaz de generar modelos con alta precisión para pequeñas cantidades de datos a diferencia de otros algoritmos de predicción. Además, nos permite utilizar nuestra GPU para calcular el modelo y monitorizar la evolución de la función loss en tiempo real, con el parámetro plot=True)

![Highway](https://miro.medium.com/max/1400/1*AjrRnwvBuu-zK8CvEfM29w.png)
![Highway](https://miro.medium.com/max/512/1*jBxqPlcaq61Q7EFhSMLkkw.jpeg)

Entrenamos el modelo:

In [4]:

clf = CatBoostClassifier(random_state=0, task_type="GPU")
clf.fit(xtrain, ytrain, plot=True, eval_set=(xtest, ytest))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.036017
0:	learn: 0.6159795	test: 0.6160314	best: 0.6160314 (0)	total: 244ms	remaining: 4m 3s
1:	learn: 0.5486773	test: 0.5487770	best: 0.5487770 (1)	total: 427ms	remaining: 3m 32s
2:	learn: 0.4908710	test: 0.4910118	best: 0.4910118 (2)	total: 617ms	remaining: 3m 24s
3:	learn: 0.4399656	test: 0.4401467	best: 0.4401467 (3)	total: 797ms	remaining: 3m 18s
4:	learn: 0.3938944	test: 0.3941122	best: 0.3941122 (4)	total: 984ms	remaining: 3m 15s
5:	learn: 0.3544750	test: 0.3547342	best: 0.3547342 (5)	total: 1.17s	remaining: 3m 13s
6:	learn: 0.3207768	test: 0.3210703	best: 0.3210703 (6)	total: 1.36s	remaining: 3m 13s
7:	learn: 0.2913898	test: 0.2917192	best: 0.2917192 (7)	total: 1.57s	remaining: 3m 14s
8:	learn: 0.2651543	test: 0.2655206	best: 0.2655206 (8)	total: 1.77s	remaining: 3m 14s
9:	learn: 0.2429154	test: 0.2433104	best: 0.2433104 (9)	total: 1.96s	remaining: 3m 14s
10:	learn: 0.2228282	test: 0.2232515	best: 0.2232515 (10)	total: 2.16s	remaining: 3m 14s
11:	learn: 0

<catboost.core.CatBoostClassifier at 0x2c5058b0bb0>

In [5]:
with open('../models/CatBoost.pickle', 'wb') as f:
    pickle.dump(clf, f)

In [6]:
# Para no tener que ejecutar, saltarse el fit y ejecutar a partir de aquí
with open('../models/CatBoost.pickle', 'rb') as f:
    clf = pickle.load(f)

Generamos las predicciones sobre los datos de validación y evaluamos el modelo.

In [7]:
ypred = clf.predict(xtest)
ypred_proba = clf.predict_proba(xtest)
evaluate_model(ytest,ypred,ypred_proba)

ROC-AUC score of the model: 0.8474571754713867
Accuracy of the model: 0.9847647713481321

Classification report: 
              precision    recall  f1-score   support

           0       0.98      1.00      0.99    997197
           1       0.57      0.01      0.01     15456

    accuracy                           0.98   1012653
   macro avg       0.78      0.50      0.50   1012653
weighted avg       0.98      0.98      0.98   1012653


Confusion matrix: 
[[997112     85]
 [ 15343    113]]

Wall time: 4.12 s


## Ajuste del umbral de predicción

Procedemos a ajustar el umbral de la predicción para obtener un mayor recall en la variable minoritaria.

In [8]:
# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytest, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytest,ypred_new_threshold,ypred_proba)

Best Threshold=0.014660, G-Mean=0.763
ROC-AUC score of the model: 0.8474571754713867
Accuracy of the model: 0.7640257817830984

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.76      0.86    997197
           1       0.05      0.76      0.09     15456

    accuracy                           0.76   1012653
   macro avg       0.52      0.76      0.48   1012653
weighted avg       0.98      0.76      0.85   1012653


Confusion matrix: 
[[761904 235293]
 [  3667  11789]]



Catboost ha devuelto buenos resultados y el tiempo de ejecución ha sido bajo. La opción de ejecutar via GPU contribuye al workflow del proyecto.

## Comprobación de overfitting

In [9]:
ypred = clf.predict(xtrain)
ypred_proba = clf.predict_proba(xtrain)

# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytrain, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytrain,ypred_new_threshold,ypred_proba)

Best Threshold=0.015896, G-Mean=0.766
ROC-AUC score of the model: 0.8508918379506188
Accuracy of the model: 0.7829603908152749

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.78      0.88   2992566
           1       0.05      0.75      0.09     45391

    accuracy                           0.78   3037957
   macro avg       0.52      0.77      0.49   3037957
weighted avg       0.98      0.78      0.87   3037957


Confusion matrix: 
[[2344644  647922]
 [  11435   33956]]



Catboost no ha tenido un ajuste a la hora de entrenar el modelo, obteniendo resultados muy parecidos para la predicción de ambos train y test sets.