# 4. Modelos de ML

### Clasificación binaria

Datos sin correlación lineal con el target.

| **Modelos ML Seleccionados** |
| :---|
| Random Forest |
| Support Vector Machine |
| Bagging Random Forest |
| Gradient Boosting Classifier |
| XGBoost |


----

| **Métricas Seleccionadas**|
| :---: |


| **Accuracy**|
| :---|
|Accuracy = (TP + TN) / Total|

- Rango de 0 a 1.

- Calcula el % de acierto teniendo en cuenta todas las clases del algoritmo de clasificación.

| **Precisión** |
| :---|
| TP / (TP + FP) |

- El rango de precisión va de 0 a 1.

- De los que ha predicho como 1, cuántos ha acertado. Minimiza los falsos positivos (FP).

- Nos interesa predecir bien los 1 (Cancela), ya que penaliza más al hotel decir que no cancela (0) y que al final cancela: la habitación se queda libre. En el caso de que predigamos que cancele, pero al final no cancele, el hotel no se queda con una habitación libre.

| **F1-Score** |
| :---|
| 2 * Precision * Recall / (Precision + Recall) |

- El rango de F1-score va de 0 a 1.

- Combinación de las métricas Precision y Recall. 

- Se utiliza para comparar clasificadores. 

| **AUC** |
| :---|

- Área bajo la 'Roc Curve'

- Cuanto más cercano a 1, mejor será el clasificador. Si la curva es una línea recta estaremos ante un clasificador muy malo, que no sería muy diferente a un clasificador aleatorio.

---

In [1]:
# FUNCIÓN PARA IMPRIMIR POR PANTALLA LAS MÉTRICAS EN TEST Y TRAIN

def metricas_print(prediccion, y):

    print("Accuracy", accuracy_score(y, prediccion))
    print("Precision", precision_score(y, prediccion))
    print("F1", f1_score(y, prediccion))
    print("AUC", roc_auc_score(y, prediccion))

---

| **Librerías**|
| :---: |

In [2]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import pickle
pd.options.display.max_columns = 36
np.random.seed(42) 

from sklearn.feature_selection import SelectFromModel 

from sklearn.svm import SVC 
from sklearn.ensemble import BaggingClassifier          
from sklearn.ensemble import RandomForestClassifier  
from sklearn.ensemble import GradientBoostingClassifier 
from xgboost import XGBClassifier

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn import model_selection   
from sklearn.metrics import accuracy_score, f1_score, precision_score, roc_auc_score

----

| **Datos**|
| :---: |

In [3]:
X_train = pd.read_csv('../src:data/src:data:processed/X_train.csv', index_col=0 )
print(X_train.shape)
X_train.head()

(59404, 33)


Unnamed: 0,lead_time,arrival_date_month,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,country,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,assigned_room_type,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests,meal_FB,meal_HB,meal_SC,customer_Group,customer_Transient,customer_Transient-Party,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline TA/TO,market_segment_Online TA,distribution_channel_Direct,distribution_channel_GDS,distribution_channel_TA/TO
32205,464,7,25,0,1,2,0,0,0.990232,0,0,0,0.221831,0,0,90.0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
18699,34,10,15,2,3,2,0,0,-1.065651,0,0,0,-0.597884,0,0,166.8,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1
53232,130,7,17,2,1,2,0,0,-0.416522,0,0,0,0.221831,0,0,105.3,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1
77553,242,8,5,1,1,2,2,0,-0.565356,0,0,0,-0.218601,0,0,202.5,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1
28820,30,5,22,1,1,1,0,0,0.990232,0,0,0,0.221831,0,0,144.0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0


In [4]:
y_train = pd.read_csv('../src:data/src:data:processed/y_train.csv', index_col=0 )
print(y_train.shape)
y_train.head()

(59404, 1)


Unnamed: 0,is_canceled
32205,1
18699,1
53232,0
77553,0
28820,1


In [5]:
# Pasamos y_test a series:
#       Me salía este error cuando era df
#       /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:680: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
#       no uso ravel porque no me funcionaba

y_train = y_train.squeeze()
y_train.head()

32205    1
18699    1
53232    0
77553    0
28820    1
Name: is_canceled, dtype: int64

In [6]:
X_test = pd.read_csv('../src:data/src:data:processed/X_test.csv', index_col=0 )
print(X_test.shape)
X_test.head()

(14852, 33)


Unnamed: 0,lead_time,arrival_date_month,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,country,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,assigned_room_type,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests,meal_FB,meal_HB,meal_SC,customer_Group,customer_Transient,customer_Transient-Party,market_segment_Complementary,market_segment_Corporate,market_segment_Direct,market_segment_Groups,market_segment_Offline TA/TO,market_segment_Online TA,distribution_channel_Direct,distribution_channel_GDS,distribution_channel_TA/TO
36434,414,12,5,2,1,2,0,0,0.990232,0,1,0,0.221831,0,0,62.0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1
58903,0,10,6,0,1,2,2,0,-0.882612,0,0,0,-0.218601,0,0,216.0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1
49997,106,5,29,2,1,2,0,0,-1.239021,0,0,0,0.221831,0,0,80.75,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1
51498,35,6,21,0,2,1,0,0,0.990232,0,0,0,-0.77822,1,0,107.1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1
37112,344,9,26,0,1,1,0,0,0.990232,0,1,0,0.221831,1,0,170.0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1


In [7]:
y_test = pd.read_csv('../src:data/src:data:processed/y_test.csv', index_col=0 )
print(y_test.shape)
y_test.head()

(14852, 1)


Unnamed: 0,is_canceled
36434,1
58903,0
49997,0
51498,0
37112,1


In [8]:
# Pasamos df a series
y_test = y_test.squeeze()
y_test.head()

36434    1
58903    0
49997    0
51498    0
37112    1
Name: is_canceled, dtype: int64

----

| **Random Forest**|
| :---: |

In [16]:
rand_forest = RandomForestClassifier() 

rand_forest_params_1 =  {
                "n_estimators": [200, 300, 400],
                "max_depth": [5, 10, 15, 20, 25, 30],
                "max_features": [1, 2, 3, 4, 5]
}

# GridSearch
grid_rand_forest_1 = GridSearchCV(rand_forest, rand_forest_params_1,
                                cv = 10, # cross validation
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

grid_rand_forest_1.fit(X_train, y_train)       

print(grid_rand_forest_1.best_estimator_) # RandomForestClassifier(max_depth=30, max_features=5, n_estimators=400)
print(grid_rand_forest_1.best_score_) # 0.8777689309636795



RandomForestClassifier(max_depth=30, max_features=5, n_estimators=400)
0.8777689309636795


In [17]:
rand_forest_estimador1 = grid_rand_forest_1.best_estimator_

# MÉTRICAS

y_train_preds_rand_forest_1 = rand_forest_estimador1.predict(X_train)
y_test_preds_rand_forest_1 = rand_forest_estimador1.predict(X_test)

# Vemos diferencias entre train y test --> pequeño overfitting

print("Train:")
metricas_print(y_train_preds_rand_forest_1, y_train)
print()
print("Test:")
metricas_print(y_test_preds_rand_forest_1, y_test)

# Vemos un poco de overfitting: precision --> Train: 0.99 / Test:0.89

Train:
Accuracy 0.9925089219581174
Precision 0.9955801544955005
F1 0.9911795603655031
AUC 0.9917797061472738

Test:
Accuracy 0.8833153784002155
Precision 0.8937315506164265
F1 0.8559075413652616
AUC 0.8749298200121208


In [47]:
# Como vemos un poco de overfitting, probamos con 400 árboles pero reduciendo un poco la profundida y features:
'''
Después de probar con diferentes combinaciones de profundidad y atributos, 
elegimos esta (aunque sigue habiendo un poquito de overfitting):
'''

rand_forest_params_2 =  {
                "n_estimators": [400],
                "max_depth": [20],
                "max_features": [4]}

grid_rand_forest_2 = GridSearchCV(RandomForestClassifier(),
                                rand_forest_params_2,
                                cv = 10, 
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

grid_rand_forest_2.fit(X_train, y_train)       

print(grid_rand_forest_2.best_estimator_)
print(grid_rand_forest_2.score(X_train, y_train))


A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.



RandomForestClassifier(max_depth=20, max_features=4, n_estimators=400)
0.9198538818934752


In [48]:
rand_forest_estimador2 = grid_rand_forest_2.best_estimator_

# MÉTRICAS

y_train_preds_rand_forest_2 = rand_forest_estimador2.predict(X_train)
y_test_preds_rand_forest_2 = rand_forest_estimador2.predict(X_test)

# Vemos diferencias entre train y test --> pequeño overfitting

print("Train:")
metricas_print(y_train_preds_rand_forest_2, y_train)
print()
print("Test:")
metricas_print(y_test_preds_rand_forest_2, y_test)

Train:
Accuracy 0.9198538818934752
Precision 0.930929044148446
F1 0.9032533376684075
AUC 0.9143855287529805

Test:
Accuracy 0.872811742526259
Precision 0.8895214374666429
F1 0.8411136344520145
AUC 0.8626793604224121


In [49]:
# FEATURE IMPORTANCE

feat_importance_rand_f = grid_rand_forest_2.best_estimator_.feature_importances_

feat_importance_rand_f_df = pd.DataFrame(feat_importance_rand_f,
                             index = X_train.columns,
                            columns = ["Importances"]).sort_values('Importances', ascending=False)
feat_importance_rand_f_df.head()

Unnamed: 0,Importances
country,0.183273
lead_time,0.135662
total_of_special_requests,0.107819
adr,0.077459
previous_cancellations,0.047581


In [50]:
fig = go.Figure()
fig.add_trace(go.Bar(name='Feature Importance Random Forest', y=feat_importance_rand_f_df['Importances'], 
                    x=feat_importance_rand_f_df.index, marker_color = '#4620ff'))

fig.update_layout(barmode='group', uniformtext_minsize=12, uniformtext_mode='hide', font_family = 'Arial', title_font_family='Arial', 
                        plot_bgcolor='#efefef', paper_bgcolor='#efefef',showlegend=False)

fig.update_yaxes(visible = True)

fig.show()

----

| **Feature Selection**|
| :---: |



In [13]:
model = SelectFromModel(grid_rand_forest_2.best_estimator_, prefit=True)

# X_train
X_train_rfe = model.transform(X_train) 
X_train_rfe = pd.DataFrame(X_train_rfe)
feature_names = model.get_support()
X_train_rfe.columns = X_train.columns[feature_names]

#_test
X_test_rfe = model.transform(X_test) 
X_test_rfe = pd.DataFrame(X_test_rfe)
X_test_rfe.columns = X_train_rfe.columns

print("Features seleccionadas:", X_train_rfe.columns)

Features seleccionadas: Index(['lead_time', 'arrival_date_month', 'arrival_date_day_of_month',
       'country', 'previous_cancellations', 'assigned_room_type',
       'booking_changes', 'adr', 'total_of_special_requests',
       'customer_Transient', 'customer_Transient-Party',
       'market_segment_Groups'],
      dtype='object')



X has feature names, but SelectFromModel was fitted without feature names


X has feature names, but SelectFromModel was fitted without feature names



In [14]:
X_train_rfe.head()

Unnamed: 0,lead_time,arrival_date_month,arrival_date_day_of_month,country,previous_cancellations,assigned_room_type,booking_changes,adr,total_of_special_requests,customer_Transient,customer_Transient-Party,market_segment_Groups
0,464.0,7.0,25.0,0.990232,0.0,0.221831,0.0,90.0,0.0,1.0,0.0,0.0
1,34.0,10.0,15.0,-1.065651,0.0,-0.597884,0.0,166.8,2.0,1.0,0.0,0.0
2,130.0,7.0,17.0,-0.416522,0.0,0.221831,0.0,105.3,0.0,1.0,0.0,0.0
3,242.0,8.0,5.0,-0.565356,0.0,-0.218601,0.0,202.5,0.0,1.0,0.0,0.0
4,30.0,5.0,22.0,0.990232,0.0,0.221831,0.0,144.0,0.0,0.0,1.0,0.0


In [15]:
X_test_rfe.head()

Unnamed: 0,lead_time,arrival_date_month,arrival_date_day_of_month,country,previous_cancellations,assigned_room_type,booking_changes,adr,total_of_special_requests,customer_Transient,customer_Transient-Party,market_segment_Groups
0,414.0,12.0,5.0,0.990232,1.0,0.221831,0.0,62.0,0.0,1.0,0.0,1.0
1,0.0,10.0,6.0,-0.882612,0.0,-0.218601,0.0,216.0,0.0,1.0,0.0,0.0
2,106.0,5.0,29.0,-1.239021,0.0,0.221831,0.0,80.75,1.0,0.0,1.0,0.0
3,35.0,6.0,21.0,0.990232,0.0,-0.77822,1.0,107.1,1.0,1.0,0.0,0.0
4,344.0,9.0,26.0,0.990232,1.0,0.221831,1.0,170.0,0.0,1.0,0.0,1.0


---

| **SVM**|
| :---: |

Modelo sin aplicar Feature Selection:

In [24]:
# Intenté poner más opciones de parámetros (incluyendo un kernel polinómico), pero después de dejarlo toda la noche, seguía sin darme un resultado.
# Best params --> {'svm__C': 10, 'svm__gamma': 'auto', 'svm__kernel': 'rbf'}

svm = Pipeline([("scaler", StandardScaler()),
                ("svm", SVC())])  

svm_params = { "svm__C": [0.2, 0.4, 0.6, 0.8, 10],# parámetro de regularización
            "svm__kernel": ["rbf"],          # kernel
            "svm__gamma": ["auto"]}                 

grid_svm = GridSearchCV(svm, svm_params,
                                cv = 10,
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

grid_svm.fit(X_train, y_train)
print(grid_svm.best_params_)
print(grid_svm.best_score_)

{'svm__C': 10, 'svm__gamma': 'auto', 'svm__kernel': 'rbf'}
0.8448082803006217


In [17]:
# El mejor estimador de arriba (para no volver a ejecutar la celda de arriba)

svm = Pipeline([("scaler", StandardScaler()),
                ("svm", SVC())])  

svm_params = { "svm__C": [10],
            "svm__kernel": ["rbf"],       
            "svm__gamma": ["auto"]}                 

grid_svm = GridSearchCV(svm, svm_params,
                                cv = 10,
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

grid_svm.fit(X_train, y_train)
print(grid_svm.best_params_)
print(grid_svm.best_score_)

{'svm__C': 10, 'svm__gamma': 'auto', 'svm__kernel': 'rbf'}
0.8448082803006217


In [18]:
# Hay un poco de overfitting --> probamos con el dataset de feature selection (quitamos ruifo)

svm_train_preds = grid_svm.predict(X_train)
svm_test_preds = grid_svm.predict(X_test)

print("Train:")
metricas_print(svm_train_preds, y_train)
print()
print("Test:")
metricas_print(svm_test_preds, y_test)

Train:
Accuracy 0.8610362938522659
Precision 0.8695163104611924
F1 0.8296216796350951
AUC 0.8523479363388324

Test:
Accuracy 0.8477646108268246
Precision 0.8491026311204043
F1 0.8116931789789289
AUC 0.8382778009398171


Modelo aplicando Feature Selection:

In [26]:
# Best params --> {'svm__C': 10, 'svm__gamma': 'auto', 'svm__kernel': 'rbf'} / Mismos que sin feature selection

svm_rfe = Pipeline([("scaler", StandardScaler()),
                ("svm", SVC())])  

svm_params_rfe = { "svm__C": [0.2, 0.4, 0.6, 0.8, 10],# parámetro de regularización
            "svm__kernel": ["rbf"],          # kernel
            "svm__gamma": ["auto"]}                 

grid_svm_rfe = GridSearchCV(svm_rfe, svm_params_rfe,
                                cv = 10,
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

grid_svm_rfe.fit(X_train_rfe, y_train)
print(grid_svm_rfe.best_params_)
print(grid_svm_rfe.best_score_) # 0.02 puntos menos que sin feature selection

{'svm__C': 10, 'svm__gamma': 'auto', 'svm__kernel': 'rbf'}
0.8448082803006217


In [20]:
# Para no volver a ejecutar lo anterior, ejecutamos el mejor estimador:

svm_rfe = Pipeline([("scaler", StandardScaler()),
                ("svm", SVC())])  

svm_params_rfe = { "svm__C": [10],
            "svm__kernel": ["rbf"],          # kernel
            "svm__gamma": ["auto"]}                 

grid_svm_rfe = GridSearchCV(svm_rfe, svm_params_rfe,
                                cv = 10,
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

grid_svm_rfe.fit(X_train_rfe, y_train)
print(grid_svm_rfe.best_params_)
print(grid_svm_rfe.best_score_)

{'svm__C': 10, 'svm__gamma': 'auto', 'svm__kernel': 'rbf'}
0.821223883904409


In [21]:
# Empeoran un poco los scorings con feature selection, pero mejora el overfitting (scorings más similares entre train y test)

svm_train_preds_rfe = grid_svm_rfe.predict(X_train_rfe)
svm_test_preds_rfe = grid_svm_rfe.predict(X_test_rfe)

print("Train:")
metricas_print(svm_train_preds_rfe, y_train)
print()
print("Test:")
metricas_print(svm_test_preds_rfe, y_test)

Train:
Accuracy 0.8286310686149081
Precision 0.8526031731261341
F1 0.782617979927397
AUC 0.8151286814540766

Test:
Accuracy 0.8279692970643684
Precision 0.8462973325872039
F1 0.7802906526786483
AUC 0.8139214049557777


-----

| **Ensembles** |
| :---:|

Funcionan muy bien si hay overfitting

| **Bagging Random Forest** |
|:---:|

Sin Feature Selection:

In [113]:
# No me di cuenta y sobreescribí esto, no lo vuelvo a ejecutar porque nos da lo mismo el código sobreescrito

bagging_rf = BaggingClassifier()

bagging_rf_params =  { "n_estimators": [100, 200, 300, 400], # n árboles en paralelo
                       }

grid_bagging_rf = GridSearchCV(bagging_rf, bagging_rf_params,   # Por defecto, base_estimator: DecisionTreeClassifie
                                cv = 10, # cross validation
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

# Entrenamos
grid_bagging_rf.fit(X_train, y_train)    

print(grid_bagging_rf.best_estimator_)
print(grid_bagging_rf.best_params_) 
print(grid_bagging_rf.best_score_)



BaggingClassifier(n_estimators=300)
{'n_estimators': 300}
0.882398328229838


In [115]:
bagging_train_preds = grid_bagging_rf.best_estimator_.predict(X_train)
bagging_test_preds = grid_bagging_rf.best_estimator_.predict(X_test)

print("Train:")
metricas_print(bagging_train_preds, y_train)
print()
print("Test:")
metricas_print(bagging_test_preds, y_test)

# Diferencia entre train y test (OVERFITTING) --> probamos a añadir parámetro 'max_samples'

Train:
Accuracy 0.9950340044441451
Precision 0.9958811881188119
F1 0.994168462253148
AUC 0.9947044048475068

Test:
Accuracy 0.889307837328306
Precision 0.8822751322751323
F1 0.8665151023059435
AUC 0.8841816091557526


In [23]:
# ELEGIMOS EL ESTIMADOR DE ESTE GRIDSEARCH PARA BAGGING --> mejor resultados entre train y test, aunque el scoring sea más bajo
'''
BaggingClassifier(max_samples=125, n_estimators=300)
{'max_samples': 125, 'n_estimators': 300}
0.8021176161548154
'''

bagging_rf = BaggingClassifier()

bagging_rf_params_2 =  { "n_estimators": [100, 150, 200, 250, 300],
                        "max_samples": [25, 50, 75, 100, 125] # probamos añadiendo un máximo de muestras --> bajan las métricas, pero más similares entre train y test
}

grid_bagging_rf_2 = GridSearchCV(bagging_rf, bagging_rf_params_2,   # Por defecto, base_estimator: DecisionTreeClassifie
                                cv = 10, # cross validation
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

# Entrenamos
grid_bagging_rf_2.fit(X_train, y_train)   

print(grid_bagging_rf_2.best_estimator_)
print(grid_bagging_rf_2.best_params_) 
print(grid_bagging_rf_2.best_score_)

BaggingClassifier(max_samples=125, n_estimators=300)
{'max_samples': 125, 'n_estimators': 300}
0.8021176161548154


In [24]:
bagging_train_preds_2 = grid_bagging_rf_2.best_estimator_.predict(X_train)
bagging_test_preds_2 = grid_bagging_rf_2.best_estimator_.predict(X_test)

print("Train:")
metricas_print(bagging_train_preds_2, y_train)
print()
print("Test:")
metricas_print(bagging_test_preds_2, y_test)

# No hay diferencias casi entre train y test 

Train:
Accuracy 0.8045754494646825
Precision 0.8374299478910628
F1 0.7458569583397185
AUC 0.7876317422261601

Test:
Accuracy 0.8074333423107999
Precision 0.8358297201418999
F1 0.7478398871451244
AUC 0.7897851763807466


In [26]:
bagging_train_preds_2 = grid_bagging_rf_2.best_estimator_.predict(X_train)
bagging_test_preds_2 = grid_bagging_rf_2.best_estimator_.predict(X_test)

print("Train:")
metricas_print(bagging_train_preds_2, y_train)
print()
print("Test:")
metricas_print(bagging_test_preds_2, y_test)

# No hay diferencias casi entre train y test 

Train:
Accuracy 0.8045754494646825
Precision 0.8374299478910628
F1 0.7458569583397185
AUC 0.7876317422261601

Test:
Accuracy 0.8074333423107999
Precision 0.8358297201418999
F1 0.7478398871451244
AUC 0.7897851763807466


Aplicamos feature selection para ver si mejora:

In [25]:
bagging_rf_params_3 =  { "n_estimators": [100, 150, 200, 250, 300],
                        "max_samples": [25, 50, 75, 100, 125] 
}

grid_bagging_rf_3 = GridSearchCV(bagging_rf, bagging_rf_params_3,   
                                cv = 10, 
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

# Entrenamos
grid_bagging_rf_3.fit(X_train_rfe, y_train)   

print(grid_bagging_rf_3.best_estimator_) # mismo estimador
print(grid_bagging_rf_3.best_params_) 
print(grid_bagging_rf_3.best_score_) # baja el score

BaggingClassifier(max_samples=125, n_estimators=300)
{'max_samples': 125, 'n_estimators': 300}
0.7926064805605286


In [28]:
# Scoring casi igual que sin feature selection --> aunque este un poco más bajo, seleccionamos el anterior (sin feature selection)

bagging_train_preds_3 = grid_bagging_rf_3.best_estimator_.predict(X_train_rfe)
bagging_test_preds_3 = grid_bagging_rf_3.best_estimator_.predict(X_test_rfe)

print("Train:")
metricas_print(bagging_train_preds_3, y_train)
print()
print("Test:")
metricas_print(bagging_test_preds_3, y_test) 

Train:
Accuracy 0.7951653087334186
Precision 0.8319971764231332
F1 0.7306176665928713
AUC 0.7767268697686306

Test:
Accuracy 0.8021815243738217
Precision 0.8382771231206827
F1 0.7374441465594281
AUC 0.782766768705197


| **Gradient Boosting Classifier** |
|:---:|

In [43]:
# MEJOR ESTIMADOR:
'''
GradientBoostingClassifier(n_estimators=400)
{'loss': 'deviance', 'n_estimators': 400}
'''

gbc = GradientBoostingClassifier()

gbc_params = {'loss': ['deviance', 'exponential'],
                'n_estimators': [200, 300, 400]}

gbc = GridSearchCV(gbc, gbc_params,
                                cv = 10, # cross validation
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

# Entrenamos
gbc.fit(X_train, y_train)  

print(gbc.best_estimator_)
print(gbc.best_params_) 
print(gbc.best_score_)

GradientBoostingClassifier(n_estimators=400)
{'loss': 'deviance', 'n_estimators': 400}
0.8427712744342941


In [44]:
gbc_train_preds = gbc.best_estimator_.predict(X_train)
gbc_test_preds = gbc.best_estimator_.predict(X_test)

print("Train:")
metricas_print(gbc_train_preds, y_train)
print()
print("Test:")
metricas_print(gbc_test_preds, y_test)

Train:
Accuracy 0.8489495656858124
Precision 0.8479925144607009
F1 0.8163114905115765
AUC 0.8410007548592813

Test:
Accuracy 0.846215997845408
Precision 0.842503438789546
F1 0.8109897384971864
AUC 0.8375192028504971


In [45]:
# FEATURE IMPORTANCE

feat_importance_gbc = gbc.best_estimator_.feature_importances_

feat_importance_gbc_df = pd.DataFrame(feat_importance_gbc,
                             index = X_train.columns,
                            columns = ["Importances"]).sort_values('Importances', ascending=False)
feat_importance_gbc_df.head()

Unnamed: 0,Importances
country,0.286585
lead_time,0.202057
total_of_special_requests,0.131312
market_segment_Online TA,0.083671
previous_cancellations,0.045082


In [46]:
# GRÁFICO PLOTLY FEATURE IMPORTANCE

fig = go.Figure()
fig.add_trace(go.Bar(name='Feature Importance Gradient Boosting Classifier', y=feat_importance_gbc_df['Importances'], 
                    x=feat_importance_gbc_df.index, marker_color = '#4620ff'))

fig.update_layout(barmode='group', uniformtext_minsize=12, uniformtext_mode='hide', font_family = 'Arial', title_font_family='Arial', 
                        plot_bgcolor='#efefef', paper_bgcolor='#efefef',showlegend=False)

fig.update_yaxes(visible = True)

fig.show()

Probamos con feature selection:



In [37]:
gbc = GradientBoostingClassifier()

gbc_params_fs = {'loss': ['deviance', 'exponential'],
                'n_estimators': [200, 300, 400]}

gbc_fs = GridSearchCV(gbc, gbc_params_fs,
                                cv = 10, # cross validation
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

# Entrenamos
gbc_fs.fit(X_train_rfe, y_train)  

print(gbc_fs.best_estimator_)
print(gbc_fs.best_params_) 
print(gbc_fs.best_score_) # Baja 0.02 respecto a sin feature selection

GradientBoostingClassifier(n_estimators=400)
{'loss': 'deviance', 'n_estimators': 400}
0.8278397706515868


In [38]:
# muy similar que sin feature selection, pero mejor sin

gbc_fs_train_preds = gbc_fs.best_estimator_.predict(X_train_rfe)
gbc_fs_test_preds = gbc_fs.best_estimator_.predict(X_test_rfe)

print("Train:")
metricas_print(gbc_fs_train_preds, y_train)
print()
print("Test:")
metricas_print(gbc_fs_test_preds, y_test)

Train:
Accuracy 0.834876439297017
Precision 0.8413647555399226
F1 0.7959901000395166
AUC 0.8246750815474947

Test:
Accuracy 0.8330864530029626
Precision 0.8368
F1 0.7915580593626503
AUC 0.8220070465465796


----

| **XG BOOST** |
|:---:|

In [16]:
xgbc = XGBClassifier()

xgbc_params = {
            "n_estimators": [200, 300, 400],
            "max_depth": [5, 10, 15, 20, 25],
            "use_label_encoder": [False],
            "eval_metric": ['logloss'],
            "verbosity": [1]
}

xgbc_grid = GridSearchCV(xgbc, xgbc_params,
                                cv = 10,
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

xgbc_grid.fit(X_train, y_train) # n_estimators=100

GridSearchCV(cv=10,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_c...
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                        

In [17]:
xgbc_train_preds = xgbc_grid.predict(X_train)
xgbc_test_preds = xgbc_grid.predict(X_test)

print("Train:")
metricas_print(xgbc_train_preds, y_train)
print()
print("Test:")
metricas_print(xgbc_test_preds, y_test) # vemos overfitting

Train:
Accuracy 0.995017170560905
Precision 0.99576321520491
F1 0.9941492726122707
AUC 0.9946998419389235

Test:
Accuracy 0.8795448424454619
Precision 0.8723192019950124
F1 0.8543515427827079
AUC 0.8738201732969904


In [27]:
# Reducimos tamaño de parámetros por si mejorase el overfitting:

xgbc = XGBClassifier()

xgbc_2_params = {
            "n_estimators": [100, 150, 200],
            "max_depth": [1, 3, 5],
            "use_label_encoder": [False],
            "eval_metric": ['logloss'],
            "verbosity": [1]
}

xgbc_2_grid = GridSearchCV(xgbc, xgbc_params,
                                cv = 10,
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

xgbc_2_grid.fit(X_train, y_train)


A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.



GridSearchCV(cv=10,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_c...
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                        

In [28]:
xgbc_2_train_preds = xgbc_2_grid.predict(X_train)
xgbc_2_test_preds = xgbc_2_grid.predict(X_test)

print("Train:")
metricas_print(xgbc_2_train_preds, y_train)
print()
print("Test:")
metricas_print(xgbc_2_test_preds, y_test)   # sigue el overfitting

Train:
Accuracy 0.995017170560905
Precision 0.99576321520491
F1 0.9941492726122707
AUC 0.9946998419389235

Test:
Accuracy 0.8795448424454619
Precision 0.8723192019950124
F1 0.8543515427827079
AUC 0.8738201732969904


In [37]:
feat_importance_xgbc = xgbc_grid.best_estimator_.feature_importances_

feat_importance_xgbc_df = pd.DataFrame(feat_importance_xgbc,
                             index = X_train.columns,
                            columns = ["Importances"]).sort_values('Importances', ascending=False)
feat_importance_xgbc_df.head()

Unnamed: 0,Importances
previous_cancellations,0.205479
required_car_parking_spaces,0.159493
market_segment_Online TA,0.10614
customer_Transient-Party,0.10006
meal_FB,0.040376


In [38]:
# GRÁFICO PLOTLY FEATURE IMPORTANCE

fig = go.Figure()
fig.add_trace(go.Bar(name='Feature Importance XGBoost', y=feat_importance_xgbc_df['Importances'], 
                    x=feat_importance_xgbc_df.index, marker_color = '#4620ff'))

fig.update_layout(barmode='group', uniformtext_minsize=12, uniformtext_mode='hide', font_family = 'Arial', title_font_family='Arial', 
                        plot_bgcolor='#efefef', paper_bgcolor='#efefef',showlegend=False)

fig.update_yaxes(visible = True)

fig.show()

Probamos con feature selection:

In [39]:
xgbc_3_params = {
            "n_estimators": [200, 300, 400],
            "max_depth": [5, 10, 15, 20, 25],
            "use_label_encoder": [False],
            "eval_metric": ['logloss'],
            "verbosity": [1]
}

xgbc_3_grid = GridSearchCV(xgbc, xgbc_3_params,
                                cv = 10,
                                scoring = "accuracy",
                                verbose = 0,
                                n_jobs = -1)

xgbc_3_grid.fit(X_train_rfe, y_train)


A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.



GridSearchCV(cv=10,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_c...
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                        

In [40]:
xgbc_3_train_preds = xgbc_3_grid.predict(X_train_rfe)
xgbc_3_test_preds = xgbc_3_grid.predict(X_test_rfe)

print("Train:")
metricas_print(xgbc_3_train_preds, y_train)
print()
print("Test:")
metricas_print(xgbc_3_test_preds, y_test) # Reducir el número de variables, no reduce el overfitting

Train:
Accuracy 0.9913137162480641
Precision 0.9922262324991076
F1 0.9897922848664689
AUC 0.9908084403402785

Test:
Accuracy 0.8641260436304875
Precision 0.8558271935699933
F1 0.8351307189542484
AUC 0.8575543627270517


-----

In [51]:
# GRÁFICO FEATURE IMPORTANCE

fig = go.Figure()

fig.add_trace(go.Bar(name='Feature Importance Gradient Boosting Classifier', y=feat_importance_gbc_df['Importances'][:10], 
                    x=feat_importance_gbc_df.index[:10], marker_color = '#4A3B8F'))   

fig.add_trace(go.Bar(name='Feature Importance XGBoost', y=feat_importance_xgbc_df['Importances'][:10], 
                    x=feat_importance_xgbc_df.index[:10], marker_color = '#4620ff'))                    

fig.add_trace(go.Bar(name='Feature Importance Random Forest', y=feat_importance_rand_f_df['Importances'][:10], 
                    x=feat_importance_rand_f_df.index[:10], marker_color = '#D7DCFF'))
                 

fig.update_layout(barmode='group', uniformtext_minsize=12, uniformtext_mode='hide', font_family = 'Arial', title_font_family='Arial', 
                        plot_bgcolor='#efefef', paper_bgcolor='#efefef',showlegend=True)

fig.update_yaxes(visible = True)

fig.show()

---