Calculad las métricas para vuestro modelo

Interpretad las métricas obtenidas, ¿es un buen modelo? ¿hay overfitting o underfitting?

In [1]:

# Tratamiento de datos
# -----------------------------------------------------------------------
import numpy as np
import pandas as pd

# Gráficos
# ------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns


#  Modelado y métricas
# ------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score , cohen_kappa_score, roc_curve,roc_auc_score


#  Gestión de warnings
# ------------------------------------------------------------------------------
import warnings
warnings.filterwarnings("ignore")

In [4]:
df_travel_balanceado = pd.read_csv("data/df_travel_balanceado.csv", index_col=0)
df_travel_balanceado.sample()

Unnamed: 0,Duration,Net Sales,Age,products,agency,country,Commision_oe,Agency Type_oe,Distribution Channel_Offline,Distribution Channel_Online,Claim
63509,-0.311019,0.943225,0.032428,1,2,0,1,1,0.0,1.0,1


In [5]:

X = df_travel_balanceado.drop("Claim", axis = 1)
y = df_travel_balanceado["Claim"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [6]:
# definimos la regresión logistica

log_reg_esta = LogisticRegression(n_jobs=-1, max_iter = 1000)

# ajustamos el modelo
log_reg_esta.fit(X_train,y_train)

# obtenemos las predicciones para el conjunto de entrenamiento
y_pred_train_esta = log_reg_esta.predict(X_train)

# obtenemos las predicciones para el conjunto de test
y_pred_test_esta = log_reg_esta.predict(X_test)

In [8]:
train_df_esta = pd.DataFrame({'Real': y_train, 'Predicted': y_pred_train_esta, 'Set': ['Train']*len(y_train)})
test_df_esta  = pd.DataFrame({'Real': y_test,  'Predicted': y_pred_test_esta,  'Set': ['Test']*len(y_test)})
resultados = pd.concat([train_df_esta,test_df_esta], axis = 0)

In [10]:
def metricas(clases_reales_test, clases_predichas_test, clases_reales_train, clases_predichas_train, modelo):
    
    # test
    accuracy_test = accuracy_score(clases_reales_test, clases_predichas_test)
    precision_test = precision_score(clases_reales_test, clases_predichas_test)
    recall_test = recall_score(clases_reales_test, clases_predichas_test)
    f1_test = f1_score(clases_reales_test, clases_predichas_test)
    kappa_test = cohen_kappa_score(clases_reales_test, clases_predichas_test)

    # train
    accuracy_train = accuracy_score(clases_reales_train, clases_predichas_train)
    precision_train = precision_score(clases_reales_train, clases_predichas_train)
    recall_train = recall_score(clases_reales_train, clases_predichas_train)
    f1_train = f1_score(clases_reales_train, clases_predichas_train)
    kappa_train = cohen_kappa_score(clases_reales_train, clases_predichas_train)
    

    
    df = pd.DataFrame({"accuracy": [accuracy_test, accuracy_train], 
                       "precision": [precision_test, precision_train],
                       "recall": [recall_test, recall_train], 
                       "f1": [f1_test, f1_train],
                       "kapppa": [kappa_test, kappa_train],
                       "set": ["test", "train"]})
    
    df["modelo"] = modelo
    return df

In [18]:
reglogistic = metricas(y_test, y_pred_test_esta, y_train, y_pred_train_esta, "Regresión logistica")
reglogistic

Unnamed: 0,accuracy,precision,recall,f1,kapppa,set,modelo
0,0.730374,0.786565,0.631983,0.700851,0.460698,test,Regresión logistica
1,0.731097,0.784787,0.636922,0.703165,0.462207,train,Regresión logistica


CONCLUSIÓN:

Nuestro kappa no es lo suficientemente bueno como para decir que este modelo es completamente válido.

Como nos interesa el falso positivo (pedecir que reclaman pero luego que no sea así), nos deberíamos fijar en la métrica recall que desde esa perspectiva, la métrica no es mala.

¿se puede balancear a medias? es decir, que tenga más peso una variable que otra

In [12]:
df_travel_estenc = pd.read_csv("data/df_travel_estenc.csv", index_col=0)
df_travel_estenc.sample()

Unnamed: 0,Claim,Duration,Net Sales,Age,products,agency,country,Commision_oe,Agency Type_oe,Distribution Channel_Offline,Distribution Channel_Online
60907,0,-0.271394,-0.553277,-0.371102,0,0,0,0,0,0.0,1.0


In [21]:

A = df_travel_balanceado.drop("Claim", axis = 1)
B = df_travel_balanceado["Claim"]

A_train, A_test, B_train, B_test = train_test_split(A, B, test_size = 0.2, random_state = 42)

In [22]:
# definimos la regresión logistica

log_reg_esta1 = LogisticRegression(n_jobs=-1, max_iter = 1000)

# ajustamos el modelo
log_reg_esta1.fit(A_train,B_train)

# obtenemos las predicciones para el conjunto de entrenamiento
B_pred_train_esta = log_reg_esta1.predict(A_train)

# obtenemos las predicciones para el conjunto de test
B_pred_test_esta = log_reg_esta1.predict(A_test)

In [23]:
reglogistic_1 = metricas(B_test, B_pred_test_esta, B_train, B_pred_train_esta, "Regresión logistica 1")
reglogistic_1

Unnamed: 0,accuracy,precision,recall,f1,kapppa,set,modelo
0,0.730374,0.786565,0.631983,0.700851,0.460698,test,Regresión logistica 1
1,0.731097,0.784787,0.636922,0.703165,0.462207,train,Regresión logistica 1


CONCLUSION:

A pesar de tener los datos originales en la variable respuesta, los resultados no han cambiado. 