# CUNEF MUCD 2021/2022  
## Machine Learning
## Análisis de Siniestralidad de Automóviles

### Autores:
- Andrés Mahía Morado
- Antonio Tello Gómez


In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, \
                            silhouette_score, recall_score, precision_score, make_scorer, \
                            roc_auc_score, f1_score, precision_recall_curve

from sklearn.metrics import accuracy_score, roc_auc_score, \
                            classification_report, confusion_matrix


from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score, log_loss
from sklearn.metrics import ConfusionMatrixDisplay

from catboost import CatBoostClassifier 
from sklearn.pipeline import Pipeline


import pickle
import warnings
warnings.filterwarnings('ignore')

from aux_func import evaluate_model, cargar_modelo

In [2]:
xtrain = pd.read_parquet("../data/xtrain.parquet")
ytrain = pd.read_parquet("../data/ytrain.parquet")['fatality']
xtest = pd.read_parquet("../data/xtest.parquet")
ytest = pd.read_parquet("../data/ytest.parquet")['fatality']

In [3]:
#Cargamos pipeline preprocesado
preprocessor = cargar_modelo('../models/preprocessor.pickle')

# CatBoost

Catboost es un clasificador que genera modelos basados en árboles de decisión de Gradient Boosting, de manera similar a LightGBM, XGBoost y muchos otros.
Por "dentro" de Catboost, se generan árboles de decisión que van reduciendo su error a medida que se repite el proceso. Catboost es una opción a tener en cuenta ya que es capaz de generar modelos con alta precisión para pequeñas cantidades de datos a diferencia de otros algoritmos de predicción. Además, nos permite utilizar nuestra GPU para calcular el modelo y monitorizar la evolución de la función loss en tiempo real, con el parámetro plot=True)

![Highway](https://miro.medium.com/max/1400/1*AjrRnwvBuu-zK8CvEfM29w.png)
![Highway](https://miro.medium.com/max/512/1*jBxqPlcaq61Q7EFhSMLkkw.jpeg)

Entrenamos el modelo:

In [4]:
clf = Pipeline(steps=[
    ('preprocesador', preprocessor),
    ('clasificador', CatBoostClassifier(random_state=0, task_type="GPU"))])

In [7]:
clf.fit(xtrain, ytrain)

Learning rate set to 0.021609
0:	learn: 0.6465198	total: 87.9ms	remaining: 1m 27s
1:	learn: 0.6055305	total: 159ms	remaining: 1m 19s
2:	learn: 0.5655048	total: 233ms	remaining: 1m 17s
3:	learn: 0.5307370	total: 311ms	remaining: 1m 17s
4:	learn: 0.4971188	total: 382ms	remaining: 1m 15s
5:	learn: 0.4653611	total: 458ms	remaining: 1m 15s
6:	learn: 0.4355967	total: 531ms	remaining: 1m 15s
7:	learn: 0.4105136	total: 604ms	remaining: 1m 14s
8:	learn: 0.3863316	total: 683ms	remaining: 1m 15s
9:	learn: 0.3641747	total: 762ms	remaining: 1m 15s
10:	learn: 0.3433301	total: 840ms	remaining: 1m 15s
11:	learn: 0.3239731	total: 924ms	remaining: 1m 16s
12:	learn: 0.3051099	total: 999ms	remaining: 1m 15s
13:	learn: 0.2880246	total: 1.08s	remaining: 1m 15s
14:	learn: 0.2723246	total: 1.16s	remaining: 1m 15s
15:	learn: 0.2585236	total: 1.23s	remaining: 1m 15s
16:	learn: 0.2450944	total: 1.31s	remaining: 1m 15s
17:	learn: 0.2326002	total: 1.39s	remaining: 1m 15s
18:	learn: 0.2216228	total: 1.46s	remaining

Pipeline(steps=[('preprocesador',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['vehicle_age',
                                                   'passenger_age',
                                                   'vehicles_involved',
                                                   'year']),
                                                 ('fcat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=nan,
                                                                

In [8]:
with open('../models/CatBoost.pickle', 'wb') as f:
    pickle.dump(clf, f)

In [9]:
# Para no tener que ejecutar, saltarse el fit y ejecutar a partir de aquí
with open('../models/CatBoost.pickle', 'rb') as f:
    clf = pickle.load(f)

Generamos las predicciones sobre los datos de validación y evaluamos el modelo.

In [10]:
ypred = clf.predict(xtest)
ypred_proba = clf.predict_proba(xtest)
evaluate_model(ytest,ypred,ypred_proba)

ROC-AUC score of the model: 0.8421572963538151
Accuracy of the model: 0.9848775665230715

Classification report: 
              precision    recall  f1-score   support

           0       0.98      1.00      0.99    799946
           1       0.53      0.01      0.01     12291

    accuracy                           0.98    812237
   macro avg       0.76      0.50      0.50    812237
weighted avg       0.98      0.98      0.98    812237


Confusion matrix: 
[[799877     69]
 [ 12214     77]]



## Ajuste del umbral de predicción

Procedemos a ajustar el umbral de la predicción para obtener un mayor recall en la variable minoritaria.

In [11]:
# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytest, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytest,ypred_new_threshold,ypred_proba)

Best Threshold=0.014383, G-Mean=0.759
ROC-AUC score of the model: 0.8421572963538151
Accuracy of the model: 0.7612556433651755

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.76      0.86    799946
           1       0.05      0.76      0.09     12291

    accuracy                           0.76    812237
   macro avg       0.52      0.76      0.48    812237
weighted avg       0.98      0.76      0.85    812237


Confusion matrix: 
[[609032 190914]
 [  3003   9288]]



Catboost ha devuelto buenos resultados y el tiempo de ejecución ha sido bajo. La opción de ejecutar via GPU contribuye al workflow del proyecto.

## Comprobación de overfitting

In [12]:
ypred = clf.predict(xtrain)
ypred_proba = clf.predict_proba(xtrain)

# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytrain, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytrain,ypred_new_threshold,ypred_proba)

Best Threshold=0.014561, G-Mean=0.759
ROC-AUC score of the model: 0.8446832622555605
Accuracy of the model: 0.7636291165285962

Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.76      0.86   3200049
           1       0.05      0.75      0.09     48896

    accuracy                           0.76   3248945
   macro avg       0.52      0.76      0.48   3248945
weighted avg       0.98      0.76      0.85   3248945


Confusion matrix: 
[[2444101  755948]
 [  12008   36888]]



Catboost no ha tenido un ajuste a la hora de entrenar el modelo, obteniendo resultados muy parecidos para la predicción de ambos train y test sets.