# CUNEF MUCD 2021/2022  
## Machine Learning
## Análisis de Siniestralidad de Automóviles

### Autores:
- Andrés Mahía Morado
- Antonio Tello Gómez


# Regresión logística

La regresión logística es el tipo de regresión por excelencia para tareas de clasificación. Se enmarca en el conjunto de Modelos Lineales Generalizados (GLM) y usa como función de enlace la función logit.  

$\operatorname{logit}\left(p_{i}\right)=\ln \left(\frac{p_{i}}{1-p_{i}}\right)=\beta_{0}+\beta_{1} x_{1, i}+\cdots+\beta_{k} x_{k, i}$

![logistic](https://miro.medium.com/max/1280/1*CYAn9ACXrWX3IneHSoMVOQ.gif)

# Regularización

La regularización trata de penalizar la complejidad del modelo para evitar el overfitting, para ello se añade un término de penalización en la función de coste que tiende a reducir el tamaño de los coeficientes.  

La regularización Ridge o L2 añade la magnitud de los coeficientes al cuadrado como termino de penalización en la función de coste:   

$\sum_{i=1}^{n}\left(y_{i}-\sum_{j=1}^{p} x_{i j} \beta_{j}\right)^{2}+\lambda \sum_{j=1}^{p} \beta_{j}^{2}$  

La regularización Lasso o L1 añade la magnitud de los coeficientes en valor absoluto como termino de penalización en la función de coste: 

$\sum_{i=1}^{n}\left(Y_{i}-\sum_{j=1}^{p} X_{i j} \beta_{j}\right)^{2}+\lambda \sum_{j=1}^{p}\left|\beta_{j}\right|$  

El parámetro $\lambda\geq 0$ controla el tamaño de la penalización


In [1]:
#Librerías
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, \
                            silhouette_score, recall_score, precision_score, make_scorer, \
                            roc_auc_score, f1_score, precision_recall_curve

from sklearn.metrics import accuracy_score, roc_auc_score, \
                            classification_report, confusion_matrix


from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score, log_loss
from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.linear_model import LogisticRegressionCV 
from sklearn.feature_selection import SelectFromModel

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
%load_ext autotime

from aux_func import evaluate_model
from aux_func import cargar_modelo
import pickle
import warnings
warnings.filterwarnings('ignore')

In [2]:
xtrain = pd.read_parquet("../data/xtrain.parquet")
ytrain = pd.read_parquet("../data/ytrain.parquet")['fatality']
xtest = pd.read_parquet("../data/xtest.parquet")
ytest = pd.read_parquet("../data/ytest.parquet")['fatality']

time: 2.3 s


In [3]:
#Cargamos pipeline preprocesado
preprocessor = cargar_modelo('../models/preprocessor.pickle')

time: 430 ms


# Regresión Logística (Ridge)

In [4]:
clf = Pipeline(steps=[
    ('preprocesador', preprocessor), 
    ('clasificador', LogisticRegressionCV(cv=8, n_jobs=4, penalty='l2', random_state=0))])

time: 0 ns (started: 2021-12-18 20:11:55 +01:00)


In [5]:
clf.fit(xtrain, ytrain)

Pipeline(steps=[('preprocesador',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['vehicle_age',
                                                   'passenger_age',
                                                   'vehicles_involved',
                                                   'year']),
                                                 ('fcat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=nan,
                                                                

time: 11min 31s (started: 2021-12-18 20:11:56 +01:00)


In [6]:
with open('../models/LR.pickle', 'wb') as f:
    pickle.dump(clf, f)

time: 94 ms (started: 2021-12-18 20:23:28 +01:00)


In [7]:
# Para no tener que ejecutar, saltarse el fit y ejecutar a partir de aquí
cargar_modelo('../models/LR.pickle')

Pipeline(steps=[('preprocesador',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['vehicle_age',
                                                   'passenger_age',
                                                   'vehicles_involved',
                                                   'year']),
                                                 ('fcat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=nan,
                                                                

time: 62 ms (started: 2021-12-18 20:23:28 +01:00)


In [9]:
ypred = clf.predict(xtest)
ypred_proba = clf.predict_proba(xtest)
evaluate_model(ytest,ypred, ypred_proba)

ROC-AUC score of the model: 0.6638737043132967
Accuracy of the model: 0.984867717181069

Classification report: 
              precision    recall  f1-score   support

           0       0.98      1.00      0.99    799946
           1       0.00      0.00      0.00     12291

    accuracy                           0.98    812237
   macro avg       0.49      0.50      0.50    812237
weighted avg       0.97      0.98      0.98    812237


Confusion matrix: 
[[799946      0]
 [ 12291      0]]

time: 7.19 s (started: 2021-12-18 20:29:58 +01:00)


## Ajuste del umbral de predicción

In [10]:
# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytest, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytest,ypred_new_threshold,ypred_proba)

Best Threshold=0.015444, G-Mean=0.620
ROC-AUC score of the model: 0.6638737043132967
Accuracy of the model: 0.6292387566683123

Classification report: 
              precision    recall  f1-score   support

           0       0.99      0.63      0.77    799946
           1       0.02      0.61      0.05     12291

    accuracy                           0.63    812237
   macro avg       0.51      0.62      0.41    812237
weighted avg       0.98      0.63      0.76    812237


Confusion matrix: 
[[503598 296348]
 [  4798   7493]]

time: 1.64 s (started: 2021-12-18 20:30:38 +01:00)


# Regresión Logística (Lasso)

In [10]:
clf = Pipeline(steps=[
    ('preprocesador', preprocessor), 
    ('clasificador', LogisticRegression(C=1.5,random_state=0, n_jobs=2, penalty='l1', solver='liblinear', tol= 0.0005))])

time: 36.9 ms


In [11]:
clf.fit(xtrain, ytrain)

Pipeline(steps=[('preprocesador',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['vehicle_age',
                                                   'passenger_age',
                                                   'vehicles_involved']),
                                                 ('fcat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=nan,
                                                                                 strategy='constant')),
                   

time: 2min 43s


In [12]:
with open('../models/LRlasso.pickle', 'wb') as f:
    pickle.dump(clf, f)

time: 271 ms


In [13]:
cargar_modelo('../models/LRlasso.pickle')

Pipeline(steps=[('preprocesador',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['vehicle_age',
                                                   'passenger_age',
                                                   'vehicles_involved']),
                                                 ('fcat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value=nan,
                                                                                 strategy='constant')),
                   

time: 289 ms


In [14]:
ypred = clf.predict(xtest)
ypred_proba = clf.predict_proba(xtest)
evaluate_model(ytest, ypred, ypred_proba)

ROC-AUC score of the model: 0.6762134387281621
Accuracy of the model: 0.984867717181069

Classification report: 
              precision    recall  f1-score   support

           0       0.98      1.00      0.99    799946
           1       0.00      0.00      0.00     12291

    accuracy                           0.98    812237
   macro avg       0.49      0.50      0.50    812237
weighted avg       0.97      0.98      0.98    812237


Confusion matrix: 
[[799946      0]
 [ 12291      0]]

time: 24 s


## Ajuste del umbral de predicción

In [15]:
# keep probabilities for the positive outcome only
yhat = ypred_proba[:, 1]
# calculate roc curves
fpr, tpr, thresholds = roc_curve(ytest, yhat)

gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

ypred_new_threshold = (ypred_proba[:,1]>thresholds[ix]).astype(int)
evaluate_model(ytest,ypred_new_threshold,ypred_proba)

Best Threshold=0.014502, G-Mean=0.628
ROC-AUC score of the model: 0.6762134387281621
Accuracy of the model: 0.6299516027957357

Classification report: 
              precision    recall  f1-score   support

           0       0.99      0.63      0.77    799946
           1       0.03      0.63      0.05     12291

    accuracy                           0.63    812237
   macro avg       0.51      0.63      0.41    812237
weighted avg       0.98      0.63      0.76    812237


Confusion matrix: 
[[503965 295981]
 [  4586   7705]]

time: 8.74 s
