Universidad Torcuato Di Tella

Licenciatura en Tecnología Digital\
**Tecnología Digital VI: Inteligencia Artificial**

# **XGBoost**

En esta *notebook*, crearemos un modelo usando XGBoost, para tratar de predecir decisiones bancarias de clientes.

XGBoost viene de eXtreme Gradient Boosting y es una librería de árboles de decisión impulsados por gradiente.

XGBoost se importa usando el paquete `xgboost` y se suele usar el alias `xgb`.

In [4]:
import xgboost as xgb


También importamos otras utilidades necesarias:

In [5]:

import pandas as pd # Para cargar los datos y hacer OHE.
import numpy as np  # Para lidiar con NaNs.
import time
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer
from sklearn.model_selection import ParameterSampler
from sklearn.metrics import confusion_matrix

random_state = 42
np.random.seed(random_state)

### **Carga del *data frame***

Cargamos un archivo CSV que tiene datos bancarios y una variable predictora yes/no que es la columna `y` e indica si el cliente se suscribió o no a un depósito a plazo. Fuente: https://archive.ics.uci.edu/dataset/222/bank+marketing. Ya está disponible en el Campus Virtual, en la sección `Datasets`.

In [24]:
df = pd.read_csv('/Users/ionikullock/Desktop/UTDT-Tecnología Digital/TD VI/Trabajo práctico 2/El-Bosque/ctr_15.csv', sep = ',')
df = df.drop(['device_id', 'creative_categorical_10', 'auction_categorical_10','auction_categorical_6','auction_categorical_8','auction_categorical_7', 'auction_age', 'timezone_offset' ], axis=1)

In [48]:
len(df)
valores_unicos = df['device_id_type'].unique()
print(valores_unicos)
frecuencia_valores = df['device_id_type'].value_counts()
# Imprimir los valores únicos y su cantidad de ocurrencias
print("Frecuencia de los valores únicos:")
print(frecuencia_valores)


frecuencia_valores = (df['device_id_type']).value_counts()
print(frecuencia_valores[0])
print(frecuencia_valores[1])
print(frecuencia_valores[2])


condiciones_1 = (df['device_id_type'] == '6324b367') & (df['Label'] == 0)
condiciones_2 = (df['device_id_type'] == 'c1d12c8e') & (df['Label'] == 0)
condiciones_3 = (df['device_id_type'] == '42080e25') & (df['Label'] == 0)
# Contar cuántos elementos cumplen ambas condiciones
cantidad_1 = df[condiciones_1].shape[0]
cantidad_2 = df[condiciones_2].shape[0]
cantidad_3 = df[condiciones_3].shape[0]

print(cantidad_1/frecuencia_valores[0])
print(cantidad_2/frecuencia_valores[1])
print(cantidad_3/frecuencia_valores[2])

['6324b367' 'c1d12c8e' '42080e25']
Frecuencia de los valores únicos:
device_id_type
6324b367    706243
c1d12c8e    530148
42080e25      3133
Name: count, dtype: int64
706243
530148
3133


  print(frecuencia_valores[0])
  print(frecuencia_valores[1])
  print(frecuencia_valores[2])


697108
0.9870653585239075
0.9935697201536174
0.965209064794127


  print(cantidad_1/frecuencia_valores[0])
  print(cantidad_2/frecuencia_valores[1])
  print(cantidad_3/frecuencia_valores[2])


In [26]:
df.head()

Unnamed: 0,Label,action_categorical_0,action_categorical_1,action_categorical_2,action_categorical_3,action_categorical_4,action_categorical_5,action_categorical_6,action_categorical_7,action_list_0,...,creative_categorical_5,creative_categorical_6,creative_categorical_7,creative_categorical_8,creative_categorical_9,creative_height,creative_width,device_id_type,gender,has_video
0,0,c2e4f717,e709bbc0,5f2b3eb9,e7329a92,3b148f0b,6bc0e29c,59638795,e2538fca,IAB20-6,...,,,,b6910b48,65dcab89,50.0,320.0,6324b367,m,False
1,0,9915ffee,dc24b79b,d2f34a41,7ce4e1a3,b55cb32e,6bc0e29c,59638795,e2538fca,IAB22-2,...,654a0207,356a814d,b98125c8,b00371d3,65dcab89,,,c1d12c8e,m,False
2,0,9915ffee,dc24b79b,8b9c34de,7ce4e1a3,4a601fd1,6bc0e29c,59638795,e2538fca,IAB22-2,...,,,,b6910b48,65dcab89,50.0,320.0,c1d12c8e,,False
3,0,11b7af3d,ac0f362d,2fb5fd3f,cb80abab,b228749f,6bc0e29c,59638795,31b31f57,IAB22,...,,,,b6910b48,65dcab89,50.0,320.0,6324b367,,False
4,0,c2e4f717,3074db21,fa245e46,62c903fc,4fc27436,6bc0e29c,59638795,e2538fca,IAB20-6,...,,,,b6910b48,43c867fd,480.0,320.0,c1d12c8e,,True


Miramos las columnas numéricas:

In [9]:
df.describe()

Unnamed: 0,Label,auction_age,auction_bidfloor,auction_time,creative_height,creative_width,timezone_offset
count,1239524.0,321140.0,1239524.0,1239524.0,1099192.0,1099192.0,1238871.0
mean,0.01020795,31.105972,0.4479788,1516025000.0,118.3017,319.6399,1.836801
std,0.1005174,9.34378,0.9098876,21936.63,117.7602,40.44806,1.926477
min,0.0,-1.0,0.0,1515974000.0,50.0,300.0,1.0
25%,0.0,25.0,0.09,1516006000.0,50.0,320.0,1.0
50%,0.0,29.0,0.1,1516029000.0,50.0,320.0,1.0
75%,0.0,35.0,0.52,1516043000.0,250.0,320.0,1.0
max,1.0,121.0,33.62,1516061000.0,1024.0,1024.0,10.0


Miramos las columnas tipo `object` (en este caso, son todas categóricas, encodeadas como *strings*):

In [10]:
df.describe(include = 'object')

Unnamed: 0,action_categorical_0,action_categorical_1,action_categorical_2,action_categorical_3,action_categorical_4,action_categorical_5,action_categorical_6,action_categorical_7,action_list_0,action_list_1,...,creative_categorical_3,creative_categorical_4,creative_categorical_5,creative_categorical_6,creative_categorical_7,creative_categorical_8,creative_categorical_9,device_id,device_id_type,gender
count,1239524,1239524,1239524,1239524,1239524,1239524,1239524,1239524,1239524,942136,...,79938,1103725,135441,131910,128359,1239524,1222026,1239524,1239524,339591
unique,9,13,90,14,551,2,2,2,6,54347,...,12,3,12,1143,7,5,2,219557,3,3
top,9915ffee,f71d2f9b,9e4f5826,2c66682b,c3ab0db6,6bc0e29c,59638795,e2538fca,IAB22-2,[-6779],...,095cc02c,7f1dcf83,654a0207,356a814d,b98125c8,b6910b48,65dcab89,9cd0f5e4,6324b367,m
freq,320637,205951,205951,205951,122069,1026984,1235026,785420,320637,49725,...,36742,557491,127570,19910,39769,1100077,1162890,1066,706243,291895


Investigamos los tipos de cada columna:

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1239524 entries, 0 to 1239523
Data columns (total 52 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   Label                    1239524 non-null  int64  
 1   action_categorical_0     1239524 non-null  object 
 2   action_categorical_1     1239524 non-null  object 
 3   action_categorical_2     1239524 non-null  object 
 4   action_categorical_3     1239524 non-null  object 
 5   action_categorical_4     1239524 non-null  object 
 6   action_categorical_5     1239524 non-null  object 
 7   action_categorical_6     1239524 non-null  object 
 8   action_categorical_7     1239524 non-null  object 
 9   action_list_0            1239524 non-null  object 
 10  action_list_1            942136 non-null   object 
 11  action_list_2            1026984 non-null  object 
 12  auction_age              321140 non-null   float64
 13  auction_bidfloor         1239524 non-null 

### **Valores faltantes**

XGBoost puede trabajar con valores faltantes. Agregamos algunos en el data frame para demostrarlo.

In [12]:
probability = 0.2
mask = np.random.rand(*df.shape) < probability
# Removemos el ruido de la columna 'y', ya que no queremos agregar datos faltantes en la variable predictora.
mask[:, mask.shape[1] - 1] = False
df[mask] = np.nan

  df[mask] = np.nan


Podemos observar los datos faltantes como valores NaN:

In [13]:
df.head()

Unnamed: 0,Label,action_categorical_0,action_categorical_1,action_categorical_2,action_categorical_3,action_categorical_4,action_categorical_5,action_categorical_6,action_categorical_7,action_list_0,...,creative_categorical_7,creative_categorical_8,creative_categorical_9,creative_height,creative_width,device_id,device_id_type,gender,has_video,timezone_offset
0,0.0,c2e4f717,e709bbc0,5f2b3eb9,,,,59638795.0,e2538fca,IAB20-6,...,,b6910b48,65dcab89,50.0,320.0,19503756,6324b367,,False,1.0
1,0.0,9915ffee,dc24b79b,d2f34a41,,,,59638795.0,e2538fca,IAB22-2,...,b98125c8,b00371d3,65dcab89,,,,,m,False,1.0
2,0.0,9915ffee,dc24b79b,8b9c34de,7ce4e1a3,,6bc0e29c,,e2538fca,IAB22-2,...,,b6910b48,,50.0,320.0,4490bb8c,,,False,1.0
3,0.0,11b7af3d,ac0f362d,2fb5fd3f,cb80abab,b228749f,6bc0e29c,59638795.0,,IAB22,...,,b6910b48,65dcab89,,,e08693f0,6324b367,,,1.0
4,,,3074db21,fa245e46,62c903fc,4fc27436,6bc0e29c,59638795.0,e2538fca,IAB20-6,...,,b6910b48,43c867fd,480.0,,502762e3,c1d12c8e,,True,1.0


XGBoost no solía soportar variables categóricas, pero actualmente las soporta de forma experimental (fuente: https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html).
        
Sin embargo, en este caso, usaremos one-hot encoding.

In [14]:
# Importante: ¡sólo poner las categóricas y excluir la variable a predecir!
pd_ohe = pd.get_dummies(df,
                        columns = [
        'action_categorical_0', 'action_categorical_1', 'action_categorical_2', 'action_categorical_3', 
        'action_categorical_4', 'action_categorical_5', 'action_categorical_6', 'action_categorical_7', 
        'action_list_0', 'action_list_1', 'action_list_2', 'auction_boolean_0', 'auction_boolean_1', 
        'auction_boolean_2', 'auction_categorical_0', 'auction_categorical_1', 'auction_categorical_2', 
        'auction_categorical_3', 'auction_categorical_4', 'auction_categorical_5', 'auction_categorical_6', 
        'auction_categorical_7', 'auction_categorical_8', 'auction_categorical_9', 'auction_categorical_10', 
        'auction_categorical_11', 'auction_categorical_12', 'auction_list_0', 'creative_categorical_0', 
        'creative_categorical_1', 'creative_categorical_10', 'creative_categorical_11', 
        'creative_categorical_12', 'creative_categorical_2', 'creative_categorical_3', 'creative_categorical_4', 
        'creative_categorical_5', 'creative_categorical_6', 'creative_categorical_7', 'creative_categorical_8', 
        'creative_categorical_9', 'device_id_type', 'gender'],
                        sparse = True,    # Devolver una matriz rala.
                        dummy_na = False, # No agregar columna para NaNs.
                        dtype = int       # XGBoost no trabaja con 'object'; necesitamos que sean numéricos.
                       )
pd_ohe

Unnamed: 0,Label,auction_age,auction_bidfloor,auction_time,creative_height,creative_width,device_id,has_video,timezone_offset,action_categorical_0_11b7af3d,...,creative_categorical_8_b6910b48,creative_categorical_8_d9d53fe0,creative_categorical_9_43c867fd,creative_categorical_9_65dcab89,device_id_type_42080e25,device_id_type_6324b367,device_id_type_c1d12c8e,gender_f,gender_m,gender_o
0,0.0,32.0,0.478927,,50.0,320.0,19503756,False,1.0,0,...,1,0,0,1,0,1,0,0,0,0
1,0.0,37.0,1.350000,,,,,False,1.0,0,...,0,0,0,1,0,0,0,0,1,0
2,0.0,,,1.516044e+09,50.0,320.0,4490bb8c,False,1.0,0,...,1,0,0,0,0,0,0,0,0,0
3,0.0,,0.610000,1.516044e+09,,,e08693f0,,1.0,1,...,1,0,0,1,0,1,0,0,0,0
4,,,4.000000,,480.0,,502762e3,True,1.0,0,...,1,0,1,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1239519,,,0.090000,1.516042e+09,,300.0,56b24817,False,4.0,0,...,1,0,0,1,0,1,0,0,0,0
1239520,0.0,,0.090000,1.516042e+09,,320.0,9b13d7ae,False,1.0,0,...,1,0,0,1,0,0,0,0,0,0
1239521,,,0.090000,,50.0,,,,1.0,0,...,1,0,0,0,0,0,1,0,0,0
1239522,0.0,,0.090000,1.516042e+09,250.0,300.0,23bae9f4,,4.0,0,...,1,0,0,1,0,1,0,0,0,0


In [15]:
len(pd_ohe.columns)

344410

In [16]:
pd_ohe.info()
print(pd_ohe.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1239524 entries, 0 to 1239523
Columns: 344410 entries, Label to gender_o
dtypes: Sparse[int64, 0](344401), float64(7), object(2)
memory usage: 480.8+ MB
Index(['Label', 'auction_age', 'auction_bidfloor', 'auction_time',
       'creative_height', 'creative_width', 'device_id', 'has_video',
       'timezone_offset', 'action_categorical_0_11b7af3d',
       ...
       'creative_categorical_8_b6910b48', 'creative_categorical_8_d9d53fe0',
       'creative_categorical_9_43c867fd', 'creative_categorical_9_65dcab89',
       'device_id_type_42080e25', 'device_id_type_6324b367',
       'device_id_type_c1d12c8e', 'gender_f', 'gender_m', 'gender_o'],
      dtype='object', length=344410)


### **Preparar conjuntos de entrenamiento, validación (*hold-out*) y evaluación**

Para este ejemplo, usaremos conjuntos de datos de *train*, *validation* y *test* fijos. Es decir, para validación usaremos un hold-out *set*. Decidimos esto porque usar *cross-validation* es más costoso.

In [17]:
y = pd_ohe[['Label']].copy() # Usamos copy para no modificar un view abajo, ya que genera un warning.
y
unique_labels = pd_ohe['Label'].unique()
print(f'Valores únicos en la columna Label: {unique_labels}')

Valores únicos en la columna Label: [ 0. nan  1.]


In [18]:
y[y['Label'] == 0. ] = 0
y[y['Label'] == 1.] = 1
y['Label'] = y['Label'].fillna(0)
y['Label'] = y['Label'].astype(int)

In [19]:
y['Label'].unique()

array([0, 1])

In [20]:
X = pd_ohe.drop('Label', axis = 1)
X

Unnamed: 0,auction_age,auction_bidfloor,auction_time,creative_height,creative_width,device_id,has_video,timezone_offset,action_categorical_0_11b7af3d,action_categorical_0_604d011f,...,creative_categorical_8_b6910b48,creative_categorical_8_d9d53fe0,creative_categorical_9_43c867fd,creative_categorical_9_65dcab89,device_id_type_42080e25,device_id_type_6324b367,device_id_type_c1d12c8e,gender_f,gender_m,gender_o
0,32.0,0.478927,,50.0,320.0,19503756,False,1.0,0,0,...,1,0,0,1,0,1,0,0,0,0
1,37.0,1.350000,,,,,False,1.0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,,,1.516044e+09,50.0,320.0,4490bb8c,False,1.0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,,0.610000,1.516044e+09,,,e08693f0,,1.0,1,0,...,1,0,0,1,0,1,0,0,0,0
4,,4.000000,,480.0,,502762e3,True,1.0,0,0,...,1,0,1,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1239519,,0.090000,1.516042e+09,,300.0,56b24817,False,4.0,0,1,...,1,0,0,1,0,1,0,0,0,0
1239520,,0.090000,1.516042e+09,,320.0,9b13d7ae,False,1.0,0,0,...,1,0,0,1,0,0,0,0,0,0
1239521,,0.090000,,50.0,,,,1.0,0,0,...,1,0,0,0,0,0,1,0,0,0
1239522,,0.090000,1.516042e+09,250.0,300.0,23bae9f4,,4.0,0,1,...,1,0,0,1,0,1,0,0,0,0


In [21]:

pd_ohe2 = pd.get_dummies(df,
                       columns = [
        'action_categorical_0', 'action_categorical_1', 'action_categorical_2', 'action_categorical_3', 
        'action_categorical_4', 'action_categorical_5', 'action_categorical_6', 'action_categorical_7', 
        'action_list_0', 'action_list_1', 'action_list_2', 'auction_boolean_0', 'auction_boolean_1', 
        'auction_boolean_2', 'auction_categorical_0', 'auction_categorical_1', 'auction_categorical_2', 
        'auction_categorical_3', 'auction_categorical_4', 'auction_categorical_5', 'auction_categorical_6', 
        'auction_categorical_7', 'auction_categorical_8', 'auction_categorical_9', 'auction_categorical_10', 
        'auction_categorical_11', 'auction_categorical_12', 'auction_list_0', 'creative_categorical_0', 
        'creative_categorical_1', 'creative_categorical_10', 'creative_categorical_11', 
        'creative_categorical_12', 'creative_categorical_2', 'creative_categorical_3', 'creative_categorical_4', 
        'creative_categorical_5', 'creative_categorical_6', 'creative_categorical_7', 'creative_categorical_8', 
        'creative_categorical_9', 'device_id_type', 'gender'],
                       sparse=True,    # Devolver una matriz rala.
                       dummy_na=False, # No agregar columna para NaNs.
                       dtype=int)      # XGBoost no trabaja con 'object'; necesitamos que sean numéricos.

# Separar X y y
X = pd_ohe2.drop('Label', axis=1)
y = pd_ohe2[['Label']].copy()

# Asegurarse de que y esté en formato entero y manejar NaNs
y['Label'] = y['Label'].fillna(0).astype(int)

# Verificar el DataFrame resultante
print(f"DataFrame con One-Hot Encoding:")
print(pd_ohe2.head())

DataFrame con One-Hot Encoding:
   Label  auction_age  auction_bidfloor  auction_time  creative_height  \
0    0.0         32.0          0.478927           NaN             50.0   
1    0.0         37.0          1.350000           NaN              NaN   
2    0.0          NaN               NaN  1.516044e+09             50.0   
3    0.0          NaN          0.610000  1.516044e+09              NaN   
4    NaN          NaN          4.000000           NaN            480.0   

   creative_width device_id has_video  timezone_offset  \
0           320.0  19503756     False              1.0   
1             NaN       NaN     False              1.0   
2           320.0  4490bb8c     False              1.0   
3             NaN  e08693f0       NaN              1.0   
4             NaN  502762e3      True              1.0   

   action_categorical_0_11b7af3d  ...  creative_categorical_8_b6910b48  \
0                              0  ...                                1   
1                         

In [22]:
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y,
                                                  train_size=0.7,
                                                  random_state=42,
                                                  stratify=y)
print("ready")

# Dividir el resto en validación y prueba
X_val, X_test, y_val, y_test = train_test_split(X_tmp, y_tmp,
                                                train_size=0.5,
                                                random_state=42,
                                                stratify=y_tmp)

print(f'Cantidad de datos de entrenamiento: {len(X_train)}')
print(f'Cantidad de datos de validación: {len(X_val)}')
print(f'Cantidad de datos de prueba: {len(X_test)}')

KeyboardInterrupt: 

In [1]:
val_test_size = 0.3 # Proporción de la suma del test de validación y del de test.
X_train, X_tmp, Y_train, Y_tmp = train_test_split(X, y,
                                                  train_size = 0.7,
                                                  random_state = random_state,
                                                  stratify = y)

NameError: name 'train_test_split' is not defined

In [43]:
X_val, X_test, y_val, y_test = train_test_split(X_tmp, y_tmp,
                                                train_size=0.5,
                                                random_state=random_state,
                                                stratify=y_tmp)

NameError: name 'X_tmp' is not defined

In [44]:
print(f'Cantidad de datos de train: {len(X_train)}')
print(f'Cantidad de datos de validación: {len(X_val)}')
print(f'Cantidad de datos de test: {len(X_test)}')

NameError: name 'X_train' is not defined

### **Ahora sí, a usar XGBoost**

Para usar XGBoost, creamos una instancia de la clase `XGBClassifier` y le especificamos el tipo como categórico (el parámetro `objective`). Además, podemos especificar otros parámetros típicos de XGBoost. Enlace a la documentación: https://xgboost.readthedocs.io/en/stable/parameter.html#parameters-for-tree-booster.

In [22]:
clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic',
                            seed = random_state,
                            eval_metric = 'auc')

: 

Para entrenar el modelo, podemos llamar al método `fit` como con los otros tipos de modelo.

In [23]:
clf_xgb.fit(X_train, Y_train, verbose = True, eval_set = [(X_val, Y_val)])

[0]	validation_0-auc:0.82470
[1]	validation_0-auc:0.84480
[2]	validation_0-auc:0.86397
[3]	validation_0-auc:0.86478
[4]	validation_0-auc:0.87343
[5]	validation_0-auc:0.88123
[6]	validation_0-auc:0.88373
[7]	validation_0-auc:0.88562
[8]	validation_0-auc:0.88861
[9]	validation_0-auc:0.88933
[10]	validation_0-auc:0.89103
[11]	validation_0-auc:0.89199
[12]	validation_0-auc:0.89284
[13]	validation_0-auc:0.89238
[14]	validation_0-auc:0.89273
[15]	validation_0-auc:0.89371
[16]	validation_0-auc:0.89430
[17]	validation_0-auc:0.89491
[18]	validation_0-auc:0.89708
[19]	validation_0-auc:0.89729
[20]	validation_0-auc:0.89744
[21]	validation_0-auc:0.89793
[22]	validation_0-auc:0.89872
[23]	validation_0-auc:0.89881
[24]	validation_0-auc:0.89882
[25]	validation_0-auc:0.89914
[26]	validation_0-auc:0.89883
[27]	validation_0-auc:0.89891
[28]	validation_0-auc:0.89902
[29]	validation_0-auc:0.89944
[30]	validation_0-auc:0.89968
[31]	validation_0-auc:0.89960
[32]	validation_0-auc:0.89950
[33]	validation_0-au

### **Búsqueda de hiperparámetros**

No queremos buscar hiperparámetros con cross-validation, porque tardaría mucho. Hacemos *random search* manualmente con la clase `ParameterSampler`. Aclaración: también existe la clase `ParameterGrid` por si quisiéramos hacer lo mismo con *grid search*. Definimos los posibles valores:

In [24]:
from scipy.stats import uniform
params = {'max_depth': list(range(1, 40)),
          'learning_rate': uniform(scale = 0.2),
          'gamma': uniform(scale = 2),
          'reg_lambda': uniform(scale = 5),        # Parámetro de regularización.
          'subsample': uniform(0.5, 0.5),          # Entre 0.5 y 1.
          'min_child_weight': uniform(scale = 5),
          'colsample_bytree': uniform(0.75, 0.25), # Entre 0.75 y 1.
          'n_estimators': list(range(1, 1000))
         }

Recordemos que podemos ver las definiciones de los hiperparámetros acá: https://xgboost.readthedocs.io/en/stable/parameter.html.

In [25]:
start = time.time()
best_score = 0
best_estimator = None
iterations = 100
for g in ParameterSampler(params, n_iter = iterations, random_state = random_state):
    clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic', seed = random_state, eval_metric = 'auc', **g)
    clf_xgb.fit(X_train, Y_train, eval_set = [(X_val, Y_val)], verbose = False)

    y_pred = clf_xgb.predict_proba(X_val)[:, 1] # Obtenemos la probabilidad de una de las clases (cualquiera).
    auc_roc = sklearn.metrics.roc_auc_score(Y_val, y_pred)
    # Guardamos si es mejor.
    if auc_roc > best_score:
        print(f'Mejor valor de ROC-AUC encontrado: {auc_roc}')
        best_score = auc_roc
        best_grid = g
        best_estimator = clf_xgb

end = time.time()
print('ROC-AUC: %0.5f' % best_score)
print('Grilla:', best_grid)
print(f'Tiempo transcurrido: {str(end - start)} segundos')
print(f'Tiempo de entrenamiento por iteración: {str(round((end - start) / iterations, 2))} segundos')

Mejor valor de ROC-AUC encontrado: 0.8871468798217762
Mejor valor de ROC-AUC encontrado: 0.8966518259020139
Mejor valor de ROC-AUC encontrado: 0.901884162952269
Mejor valor de ROC-AUC encontrado: 0.9035085283917962
ROC-AUC: 0.90351
Grilla: {'colsample_bytree': 0.7626921327598493, 'gamma': 1.7732342979013198, 'learning_rate': 0.005523354374740941, 'max_depth': 34, 'min_child_weight': 0.469909699204345, 'n_estimators': 727, 'reg_lambda': 3.360130676475997, 'subsample': 0.664076333737366}
Tiempo transcurrido: 2191.9723739624023 segundos
Tiempo de entrenamiento por iteración: 21.92 segundos


¿Cuánto tardaríamos en hacer un grid search con la misma escala?

Si sólo tuviéramos 5 opciones fijas en cada parámetro, teniendo 7 parámetros, serían 546.875 multiplicado por el tiempo que dure cada iteración. Si cada iteración dura 1 segundo... esto es (sin usar paralelismo) ¡6 días!

In [30]:
best_grid = {'colsample_bytree': 0.7626921327598493,
             'gamma': 1.7732342979013198,
             'learning_rate': 0.005523354374740941,
             'max_depth': 34,
             'min_child_weight': 0.469909699204345,
             'n_estimators': 727,
             'reg_lambda': 3.360130676475997,
             'subsample': 0.664076333737366
            }
# Aclaración: acá está "hardcodeado", pero se puede hacer mejor, accediendo a los valores de `best_grid`.

best_estimator = xgb.XGBClassifier(objective = 'binary:logistic',
                                   seed = random_state,
                                   eval_metric = 'auc',
                                   **best_grid)

best_estimator.fit(X_train, Y_train, verbose = True,  eval_set = [(X_val, Y_val)])

# roc_auc_score requiere un array 1D; da lo mismo qué dimensión le pasemos: 90 o 1.
y_pred = best_estimator.predict_proba(X_val)[:, 1]
auc_roc = sklearn.metrics.roc_auc_score(Y_val, y_pred)
print('AUC-ROC validación: %0.5f' % auc_roc)

[0]	validation_0-auc:0.84266
[1]	validation_0-auc:0.85949
[2]	validation_0-auc:0.86934
[3]	validation_0-auc:0.87212
[4]	validation_0-auc:0.87541
[5]	validation_0-auc:0.87659
[6]	validation_0-auc:0.87717
[7]	validation_0-auc:0.88326
[8]	validation_0-auc:0.88335
[9]	validation_0-auc:0.88287
[10]	validation_0-auc:0.88236
[11]	validation_0-auc:0.88298
[12]	validation_0-auc:0.88353
[13]	validation_0-auc:0.88361
[14]	validation_0-auc:0.88399
[15]	validation_0-auc:0.88351
[16]	validation_0-auc:0.88380
[17]	validation_0-auc:0.88381
[18]	validation_0-auc:0.88371
[19]	validation_0-auc:0.88446
[20]	validation_0-auc:0.88594
[21]	validation_0-auc:0.88642
[22]	validation_0-auc:0.88842
[23]	validation_0-auc:0.88892
[24]	validation_0-auc:0.88930
[25]	validation_0-auc:0.88911
[26]	validation_0-auc:0.88921
[27]	validation_0-auc:0.88913
[28]	validation_0-auc:0.88956
[29]	validation_0-auc:0.88969
[30]	validation_0-auc:0.88983
[31]	validation_0-auc:0.88984
[32]	validation_0-auc:0.88991
[33]	validation_0-au

Una vez entrenado, podemos observar atributos de cada campo como, por ejemplo, el `cover`:

In [31]:
bst = best_estimator.get_booster()
for importance_type in ('weight', 'gain', 'cover', 'total_gain', 'total_cover'):
    print('%s: ' % importance_type, bst.get_score(importance_type = importance_type))
    print('--------------')

weight:  {'age': 20313.0, 'balance': 25582.0, 'day': 20270.0, 'duration': 23876.0, 'campaign': 9372.0, 'pdays': 10966.0, 'previous': 6781.0, 'job_admin.': 1395.0, 'job_blue-collar': 1331.0, 'job_entrepreneur': 554.0, 'job_housemaid': 468.0, 'job_management': 1816.0, 'job_retired': 895.0, 'job_self-employed': 649.0, 'job_services': 746.0, 'job_student': 1009.0, 'job_technician': 1667.0, 'job_unemployed': 654.0, 'job_unknown': 62.0, 'marital_divorced': 1348.0, 'marital_married': 1730.0, 'marital_single': 1905.0, 'education_primary': 945.0, 'education_secondary': 1709.0, 'education_tertiary': 2038.0, 'education_unknown': 922.0, 'default_no': 1620.0, 'default_yes': 279.0, 'housing_no': 1988.0, 'housing_yes': 1565.0, 'loan_no': 1852.0, 'loan_yes': 1068.0, 'contact_cellular': 2082.0, 'contact_telephone': 977.0, 'contact_unknown': 1233.0, 'month_apr': 1944.0, 'month_aug': 1935.0, 'month_dec': 900.0, 'month_feb': 2406.0, 'month_jan': 926.0, 'month_jul': 1418.0, 'month_jun': 1824.0, 'month_mar'

### **Conjunto de test**

Para finalizar, ejecutamos el modelo para el conjunto de test, para tener una mejor estimación de cómo se comportaría el modelo en un escenario de producción, con un conjunto de datos que el modelo nunca ha visto y que tampoco se ha usado para tomar decisiones.

In [32]:
y_pred = best_estimator.predict_proba(X_test)[:, 1]
auc_roc = sklearn.metrics.roc_auc_score(Y_test, y_pred)
print('AUC-ROC test: %0.5f' % auc_roc)

AUC-ROC test: 0.89681


Efectivamente, vemos que la *performance* es ligeramente menor en el conjunto de test.