### Pair Programming VI: Random Forest

Los objetivos de este pair programming :
- Ajustad un modelo de Random Forest a nuestros datos.
- Calculad las métricas a nuestro nuevo modelo.
- Comparad las métricas con los modelos hechos hasta ahora. ¿Cuál es mejor?

In [1]:
# Tratamiento de datos
# ------------------------------------------------------------------------------
import numpy as np
import pandas as pd
from tqdm import tqdm

# Gráficos
# ------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns

# Modelado y evaluación
# ------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score , cohen_kappa_score, roc_curve,roc_auc_score
from sklearn.model_selection import GridSearchCV

# Configuración warnings
# ------------------------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_pickle('data/airline_estand_encod.pkl')
df.head(2)

Unnamed: 0,satisfaction,gender,customer_type,age,type_of_travel,class,flight_distance,seat_comfort,food_and_drink,gate_location,...,checkin_service,cleanliness,online_boarding,departure_delay_in_minutes,dep_conv_0,dep_conv_1,dep_conv_2,dep_conv_3,dep_conv_4,dep_conv_5
0,1,1,1,0.25,1,0,-0.122137,4,4,3,...,2,4,3,0.0,0,0,0,0,0,1
1,0,1,1,0.583333,1,0,-0.715013,0,0,2,...,0,0,0,0.0,0,0,1,0,0,0


In [4]:
# separamos los datos en X e y

X1 = df.drop("satisfaction", axis = 1)
y1 = df["satisfaction"]

x_train, x_test, y_train, y_test = train_test_split(X1, y1, test_size = 0.2, random_state = 42)

In [6]:
param = {"max_depth": [2, 4, 6, 10, 12, 14],
        "max_features": [1, 2, 3, 4, 5],
        "min_samples_split": [50, 100, 200, 350],
        "min_samples_leaf": [50, 100, 200]}

gs = GridSearchCV(
            estimator=RandomForestClassifier(random_state= 42),
            param_grid= param,
            cv=10,
            verbose=0)

gs.fit(x_train, y_train)

In [8]:
bosque = gs.best_estimator_
bosque

In [10]:
y_pred_test = bosque.predict(x_test)
y_pred_train = bosque.predict(x_train)

In [20]:
def metricas(clases_reales_test, clases_predichas_test, clases_reales_train, clases_predichas_train, modelo):
    
    # para el test
    accuracy_test = accuracy_score(clases_reales_test, clases_predichas_test)
    precision_test = precision_score(clases_reales_test, clases_predichas_test)
    recall_test = recall_score(clases_reales_test, clases_predichas_test)
    f1_test = f1_score(clases_reales_test, clases_predichas_test)
    kappa_test = cohen_kappa_score(clases_reales_test, clases_predichas_test)

    # para el train
    accuracy_train = accuracy_score(clases_reales_train, clases_predichas_train)
    precision_train = precision_score(clases_reales_train, clases_predichas_train)
    recall_train = recall_score(clases_reales_train, clases_predichas_train)
    f1_train = f1_score(clases_reales_train, clases_predichas_train)
    kappa_train = cohen_kappa_score(clases_reales_train, clases_predichas_train)
    

    
    df = pd.DataFrame({"accuracy": [accuracy_test, accuracy_train], 
                       "precision": [precision_test, precision_train],
                       "recall": [recall_test, recall_train], 
                       "f1": [f1_test, f1_train],
                       "kappa": [kappa_test, kappa_train],
                       "set": ["test", "train"]})
    
    df["modelo"] = modelo
    return df

In [21]:
rf_results = metricas(y_test, y_pred_test, y_train, y_pred_train, "Random Forest Esta")
rf_results

Unnamed: 0,accuracy,precision,recall,f1,kapppa,set,modelo
0,0.9182,0.90991,0.943579,0.926439,0.83439,test,Random Forest Esta
1,0.9209,0.912988,0.94546,0.92894,0.83981,train,Random Forest Esta


In [43]:
# Traemos las métricas que teníamos:

resultados = pd.read_csv('data/metricas.csv')

In [41]:
resultados = pd.concat([resultados , rf_results], axis= 0)
resultados

Unnamed: 0,accuracy,precision,recall,f1,kapppa,set,modelo
0,0.874159,0.871902,0.877096,0.874491,0.748319,test,Regresión logistica Bal
1,0.875948,0.872365,0.880783,0.876554,0.751895,train,Regresión logistica Bal
2,0.8714,0.88054,0.884411,0.882471,0.740499,test,Regresión logistica Esta
3,0.878975,0.883505,0.896955,0.890179,0.755421,train,Regresión logistica Esta
4,0.8701,0.873966,0.890456,0.882134,0.737491,test,Regresión logistica
5,0.875725,0.875239,0.901207,0.888033,0.748477,train,Regresión logistica
6,0.8949,0.905744,0.901264,0.903498,0.788119,test,Decision tree Esta II
7,0.904475,0.914193,0.910807,0.912497,0.807332,train,Decision tree Esta II
8,0.900187,0.901661,0.898279,0.899967,0.800375,test,Decision tree Bal
9,0.905038,0.907517,0.902013,0.904757,0.810075,train,Decision tree Bal


De nuevo, aunque todos nuestros datos tienen resultados muy buenos, sin duda alguna el que nos ofrece una mejor predicción de la satisfacción de los clientes es el Random Forest, que nos ofrece una accuracy del casi 92%.

In [62]:
resultados.to_csv('data/metricas.csv', index=False)