In [1]:
import pandas as pd
import numpy as np

**Referencia Principal** = https://www.youtube.com/watch?v=oF58IQ5W0cg

Información del Dataset: https://archive.ics.uci.edu/ml/datasets/adult

Este es un dataset que contiene información de distintas personas con el objetivo de entrenar un modelo para saber si alguien gana más de 50K o no.

### 1. Lectura del dataset

In [2]:
Dataset = pd.read_csv('adult.data', sep = ',', header = None)
columnas = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'target']
Dataset.columns = columnas

In [3]:
Dataset.head() #Muestra las primeras filas

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### 2. Convertir el target a binario (0 o 1)

In [4]:
Dataset['target'].unique() #Muestra los posibles valores de esa columna

array([' <=50K', ' >50K'], dtype=object)

Esto tiene sentido, puesto que la idea es saber si gana más de 50k o no. Sin embargo, esos valores: ' <=50K' o ' >=50K' son strings, para poder entrenar el modelo debemos reescribirlos como ceros y unos.

In [5]:
Dataset['target'] = np.where(Dataset['target'] == ' <=50K', 0, 1) #Si se cumple que ' <=50K' se pone un cero, en caso de que suceda lo contrario se pone un 1

Así,

In [6]:
Dataset['target'].unique() #Muestra los posibles valores de esa columna

array([0, 1])

### 3. Construcción de columnas extra

Notemos que en este caso tenemos 15 columnas:

In [7]:
Dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


Construyamos más columnas usando las que ya tenemos, así aumentaremos el valor predictivo del modelo que construiremos.

Por ejemplo, contemos cuantas horas trabajan los empleados:

In [8]:
Dataset['hours-per-week'].value_counts()

40    15217
50     2819
45     1824
60     1475
35     1297
      ...  
82        1
92        1
87        1
74        1
94        1
Name: hours-per-week, Length: 94, dtype: int64

Aquí se ve que la mayorái de empleados trabajan 40 horas, por lo que podemos pensar en una columna que tenga un 1 si los empleados trabajan menos de 40 horas (o 40 exactamente) y tenga un 0 si trabajan más de 40 horas.

In [9]:
Dataset['m-i-40'] = np.where(Dataset['hours-per-week'] <= 40, 1, 0) #m-i-40 significa menor igual que 40

In [10]:
Dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target,m-i-40
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,1
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,1
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,1
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,1
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,1


Otra columna que podemos crear es una de ganancia de capital - perdida de capital

In [11]:
Dataset['dif-capital'] = Dataset['capital-gain'] - Dataset['capital-loss']

In [12]:
Dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target,m-i-40,dif-capital
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,1,2174
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,1,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,1,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,1,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,1,0


También tenemos la columna sexo que pone Male o Female:

In [13]:
Dataset['sex'].unique() 

array([' Male', ' Female'], dtype=object)

Reescribamosla para que tenga 1 cuando sea Female y 0 cuando sea Male:

In [14]:
Dataset['sex'] = np.where(Dataset['sex'] == ' Female', 1, 0)

In [15]:
Dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target,m-i-40,dif-capital
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,0,2174,0,40,United-States,0,1,2174
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States,0,1,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States,0,1,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States,0,1,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba,0,1,0


De forma similar pongamos otra columna que represente si la persona es estadounidense:

In [16]:
Dataset['native-country'].unique()

array([' United-States', ' Cuba', ' Jamaica', ' India', ' ?', ' Mexico',
       ' South', ' Puerto-Rico', ' Honduras', ' England', ' Canada',
       ' Germany', ' Iran', ' Philippines', ' Italy', ' Poland',
       ' Columbia', ' Cambodia', ' Thailand', ' Ecuador', ' Laos',
       ' Taiwan', ' Haiti', ' Portugal', ' Dominican-Republic',
       ' El-Salvador', ' France', ' Guatemala', ' China', ' Japan',
       ' Yugoslavia', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Scotland',
       ' Trinadad&Tobago', ' Greece', ' Nicaragua', ' Vietnam', ' Hong',
       ' Ireland', ' Hungary', ' Holand-Netherlands'], dtype=object)

In [17]:
Dataset['usa'] = np.where(Dataset['native-country'] == ' United-States', 1, 0)

In [18]:
Dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target,m-i-40,dif-capital,usa
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,0,2174,0,40,United-States,0,1,2174,1
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States,0,1,0,1
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States,0,1,0,1
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States,0,1,0,1
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba,0,1,0,0


De forma similar uno podría construír columnas con variables más complejas, entre más hayan mejor será la predictividad del modelo que entrenemos.

### 4. Cambiar variables categoricas por su frecuencia

Una variable categorica es una variable que permite catalogar una fila por ejemplo en la columna 'workclass', el valor 'Private' categoriza su trabajo como privado, en lugar de tener eso es mejor tener la frecuencia en que aparece Private en todo el Dataset.

In [19]:
a = Dataset.groupby('workclass')['workclass'].transform('count')
a

0         1298
1         2541
2        22696
3        22696
4        22696
         ...  
32556    22696
32557    22696
32558    22696
32559    22696
32560     1116
Name: workclass, Length: 32561, dtype: int64

Así se tiene la columna donde cada categoría se cambio por su frecuencia, la idae es hacer esto para todas las columnas que representan variables categoricas

In [20]:
cat_col = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']

In [21]:
for variable_categorica in cat_col:
    Dataset[variable_categorica] = Dataset.groupby(variable_categorica)[variable_categorica].transform('count')

In [22]:
Dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target,m-i-40,dif-capital,usa
0,39,1298,77516,5355,13,10683,3770,8305,27816,0,2174,0,40,29170,0,1,2174,1
1,50,2541,83311,5355,13,14976,4066,13193,27816,0,0,0,13,29170,0,1,0,1
2,38,22696,215646,10501,9,4443,1370,8305,27816,0,0,0,40,29170,0,1,0,1
3,53,22696,234721,1175,7,14976,1370,13193,3124,0,0,0,40,29170,0,1,0,1
4,28,22696,338409,5355,13,14976,4140,1568,3124,1,0,0,40,95,0,1,0,0


Así, convertimos un dataset que tenía strings y variables categoricas en una matriz de numeros que el computador puede entender de mejor manera.

### 5. Separar datos en X y Y

X: Entrada

Y: Salida

In [23]:
X = Dataset.loc[:, Dataset.columns!= 'target'] #Todas las filas y sus columnas que sean distintas a 'target'
Y = Dataset.loc[:, 'target']  #Todas las filas de la columna 'target'

In [24]:
X.shape, Y.shape, Dataset.shape

((32561, 17), (32561,), (32561, 18))

In [25]:
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,m-i-40,dif-capital,usa
0,39,1298,77516,5355,13,10683,3770,8305,27816,0,2174,0,40,29170,1,2174,1
1,50,2541,83311,5355,13,14976,4066,13193,27816,0,0,0,13,29170,1,0,1
2,38,22696,215646,10501,9,4443,1370,8305,27816,0,0,0,40,29170,1,0,1
3,53,22696,234721,1175,7,14976,1370,13193,3124,0,0,0,40,29170,1,0,1
4,28,22696,338409,5355,13,14976,4140,1568,3124,1,0,0,40,95,1,0,0


In [26]:
Y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

### 6. Separar conjunto de datos para "train" y para "test"

In [27]:
from sklearn.model_selection import train_test_split

Separemos un 50% del Dataset para entrenar el modelo, un 25% para testearlo y un 25% para validarlo

In [28]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.5, random_state = 1) #El random_state es para que siempre que se corra este notebook se haga igual

In [29]:
X_train.shape, X_test.shape

((16280, 17), (16281, 17))

Aquí separamos primero en 50-50 con la idea de ahora separar X_train y Y_train en un 75-25 para entrenar y validar el modelo

In [30]:
X_valid, X_test, Y_valid, Y_test = train_test_split(X_test, Y_test, test_size = 0.5, random_state = 1)

Así, nos queda:

In [31]:
X_train.shape,X_test.shape, X_valid.shape # 50 - Entrenar, 25 - Testear y 25 - Validar

((16280, 17), (8141, 17), (8140, 17))

In [32]:
Y_train.shape,Y_test.shape, Y_valid.shape # 50 - Entrenar, 25 - Testear y 25 - Validar

((16280,), (8141,), (8140,))

### 7. Crear modelo de Machine Learning

In [33]:
import xgboost

In [34]:
xgb = xgboost.XGBClassifier() #Modelo clasificador de gradient boosting 

Ese modelo tiene un conjunto de parámetros la idea es buscar los que mejor predigan los datos, para esto lo que se hace es darle opciones al modelo y que el analice todas las posibles combinaciones y así extraíga la mejor:

In [35]:
parameters = {'nthreads': [1], 
             'objective': ['binary:logistic'],
             'learning_rate': [0.05,0.1],
             'n_estimators': [100,200]}

In [36]:
from sklearn.model_selection import GridSearchCV #Permite optimizar los parametros

GridSearcCV entrenara el modelo usando todas las combinaciones posibles de parametros dentro de parameters y buscará la que mejor se ajusta. No obstante, esto se podría demorar bastante, para mejorar su tiempo de ejecución anteriormente dejamos un 25% de los datos para testear, la idea es con ese de 25% definir una función de costo que permita testear cada posible combinación de los parameters, así si en cierta cantidad de rondas la función de costo no mejora entonces se analiza con la siguiente combinación posible de parametros.

In [37]:
fit_params = {'early_stopping_rounds': 10,
             'eval_metric': 'logloss',
             'eval_set': [(X_test, Y_test)] }

Ese diccionario significa que va a analizar hasta la ronda 10 los entrenamientos usando X_test y Y_test evaluados en la función de costo, si en 10 rondas la función de costo no mejora entonces el entrenamiento para, así podemos mejor el tiempo de ejecución que use GridSearchCV para encontrar los mejores parametros dentro de parameters.

In [38]:
clf = GridSearchCV(xgb, parameters, cv=3, scoring='accuracy') #cv=3 es hacer cross validation 3 veces
clf.fit(X_train, Y_train, **fit_params)

Parameters: { "nthreads" } are not used.

[0]	validation_0-logloss:0.66442
[1]	validation_0-logloss:0.63840
[2]	validation_0-logloss:0.61483
[3]	validation_0-logloss:0.59342
[4]	validation_0-logloss:0.57382
[5]	validation_0-logloss:0.55571
[6]	validation_0-logloss:0.53916
[7]	validation_0-logloss:0.52378
[8]	validation_0-logloss:0.50958
[9]	validation_0-logloss:0.49645
[10]	validation_0-logloss:0.48434
[11]	validation_0-logloss:0.47305
[12]	validation_0-logloss:0.46260
[13]	validation_0-logloss:0.45278
[14]	validation_0-logloss:0.44379
[15]	validation_0-logloss:0.43508
[16]	validation_0-logloss:0.42719
[17]	validation_0-logloss:0.41983




[18]	validation_0-logloss:0.41269
[19]	validation_0-logloss:0.40615
[20]	validation_0-logloss:0.40019
[21]	validation_0-logloss:0.39433
[22]	validation_0-logloss:0.38909
[23]	validation_0-logloss:0.38407
[24]	validation_0-logloss:0.37937
[25]	validation_0-logloss:0.37486
[26]	validation_0-logloss:0.37057
[27]	validation_0-logloss:0.36650
[28]	validation_0-logloss:0.36281
[29]	validation_0-logloss:0.35935
[30]	validation_0-logloss:0.35601
[31]	validation_0-logloss:0.35284
[32]	validation_0-logloss:0.34978
[33]	validation_0-logloss:0.34692
[34]	validation_0-logloss:0.34438
[35]	validation_0-logloss:0.34180
[36]	validation_0-logloss:0.33954
[37]	validation_0-logloss:0.33734
[38]	validation_0-logloss:0.33517
[39]	validation_0-logloss:0.33314
[40]	validation_0-logloss:0.33127
[41]	validation_0-logloss:0.32957
[42]	validation_0-logloss:0.32783
[43]	validation_0-logloss:0.32615
[44]	validation_0-logloss:0.32464
[45]	validation_0-logloss:0.32313
[46]	validation_0-logloss:0.32168
[47]	validatio



[19]	validation_0-logloss:0.40733
[20]	validation_0-logloss:0.40112
[21]	validation_0-logloss:0.39549
[22]	validation_0-logloss:0.39015
[23]	validation_0-logloss:0.38519
[24]	validation_0-logloss:0.38036
[25]	validation_0-logloss:0.37590
[26]	validation_0-logloss:0.37161
[27]	validation_0-logloss:0.36760
[28]	validation_0-logloss:0.36373
[29]	validation_0-logloss:0.36032
[30]	validation_0-logloss:0.35712
[31]	validation_0-logloss:0.35387
[32]	validation_0-logloss:0.35093
[33]	validation_0-logloss:0.34804
[34]	validation_0-logloss:0.34550
[35]	validation_0-logloss:0.34277
[36]	validation_0-logloss:0.34051
[37]	validation_0-logloss:0.33839
[38]	validation_0-logloss:0.33619
[39]	validation_0-logloss:0.33401
[40]	validation_0-logloss:0.33199
[41]	validation_0-logloss:0.33029
[42]	validation_0-logloss:0.32854
[43]	validation_0-logloss:0.32677
[44]	validation_0-logloss:0.32501
[45]	validation_0-logloss:0.32338
[46]	validation_0-logloss:0.32207
[47]	validation_0-logloss:0.32073
[48]	validatio



[18]	validation_0-logloss:0.41308
[19]	validation_0-logloss:0.40648
[20]	validation_0-logloss:0.40053
[21]	validation_0-logloss:0.39480
[22]	validation_0-logloss:0.38934
[23]	validation_0-logloss:0.38427
[24]	validation_0-logloss:0.37961
[25]	validation_0-logloss:0.37524
[26]	validation_0-logloss:0.37095
[27]	validation_0-logloss:0.36696
[28]	validation_0-logloss:0.36320
[29]	validation_0-logloss:0.35976
[30]	validation_0-logloss:0.35646
[31]	validation_0-logloss:0.35329
[32]	validation_0-logloss:0.35038
[33]	validation_0-logloss:0.34765
[34]	validation_0-logloss:0.34514
[35]	validation_0-logloss:0.34256
[36]	validation_0-logloss:0.34031
[37]	validation_0-logloss:0.33823
[38]	validation_0-logloss:0.33603
[39]	validation_0-logloss:0.33397
[40]	validation_0-logloss:0.33206
[41]	validation_0-logloss:0.33034
[42]	validation_0-logloss:0.32866
[43]	validation_0-logloss:0.32696
[44]	validation_0-logloss:0.32541
[45]	validation_0-logloss:0.32401
[46]	validation_0-logloss:0.32249
[47]	validatio



[17]	validation_0-logloss:0.41983
[18]	validation_0-logloss:0.41269
[19]	validation_0-logloss:0.40615
[20]	validation_0-logloss:0.40019
[21]	validation_0-logloss:0.39433
[22]	validation_0-logloss:0.38909
[23]	validation_0-logloss:0.38407
[24]	validation_0-logloss:0.37937
[25]	validation_0-logloss:0.37486
[26]	validation_0-logloss:0.37057
[27]	validation_0-logloss:0.36650
[28]	validation_0-logloss:0.36281
[29]	validation_0-logloss:0.35935
[30]	validation_0-logloss:0.35601
[31]	validation_0-logloss:0.35284
[32]	validation_0-logloss:0.34978
[33]	validation_0-logloss:0.34692
[34]	validation_0-logloss:0.34438
[35]	validation_0-logloss:0.34180
[36]	validation_0-logloss:0.33954
[37]	validation_0-logloss:0.33734
[38]	validation_0-logloss:0.33517
[39]	validation_0-logloss:0.33314
[40]	validation_0-logloss:0.33127
[41]	validation_0-logloss:0.32957
[42]	validation_0-logloss:0.32783
[43]	validation_0-logloss:0.32615
[44]	validation_0-logloss:0.32464
[45]	validation_0-logloss:0.32313
[46]	validatio



[18]	validation_0-logloss:0.41368
[19]	validation_0-logloss:0.40733
[20]	validation_0-logloss:0.40112
[21]	validation_0-logloss:0.39549
[22]	validation_0-logloss:0.39015
[23]	validation_0-logloss:0.38519
[24]	validation_0-logloss:0.38036
[25]	validation_0-logloss:0.37590
[26]	validation_0-logloss:0.37161
[27]	validation_0-logloss:0.36760
[28]	validation_0-logloss:0.36373
[29]	validation_0-logloss:0.36032
[30]	validation_0-logloss:0.35712
[31]	validation_0-logloss:0.35387
[32]	validation_0-logloss:0.35093
[33]	validation_0-logloss:0.34804
[34]	validation_0-logloss:0.34550
[35]	validation_0-logloss:0.34277
[36]	validation_0-logloss:0.34051
[37]	validation_0-logloss:0.33839
[38]	validation_0-logloss:0.33619
[39]	validation_0-logloss:0.33401
[40]	validation_0-logloss:0.33199
[41]	validation_0-logloss:0.33029
[42]	validation_0-logloss:0.32854
[43]	validation_0-logloss:0.32677
[44]	validation_0-logloss:0.32501
[45]	validation_0-logloss:0.32338
[46]	validation_0-logloss:0.32207
[47]	validatio



[18]	validation_0-logloss:0.41308
[19]	validation_0-logloss:0.40648
[20]	validation_0-logloss:0.40053
[21]	validation_0-logloss:0.39480
[22]	validation_0-logloss:0.38934
[23]	validation_0-logloss:0.38427
[24]	validation_0-logloss:0.37961
[25]	validation_0-logloss:0.37524
[26]	validation_0-logloss:0.37095
[27]	validation_0-logloss:0.36696
[28]	validation_0-logloss:0.36320
[29]	validation_0-logloss:0.35976
[30]	validation_0-logloss:0.35646
[31]	validation_0-logloss:0.35329
[32]	validation_0-logloss:0.35038
[33]	validation_0-logloss:0.34765
[34]	validation_0-logloss:0.34514
[35]	validation_0-logloss:0.34256
[36]	validation_0-logloss:0.34031
[37]	validation_0-logloss:0.33823
[38]	validation_0-logloss:0.33603
[39]	validation_0-logloss:0.33397
[40]	validation_0-logloss:0.33206
[41]	validation_0-logloss:0.33034
[42]	validation_0-logloss:0.32866
[43]	validation_0-logloss:0.32696
[44]	validation_0-logloss:0.32541
[45]	validation_0-logloss:0.32401
[46]	validation_0-logloss:0.32249
[47]	validatio



[18]	validation_0-logloss:0.33637
[19]	validation_0-logloss:0.33200
[20]	validation_0-logloss:0.32845
[21]	validation_0-logloss:0.32511
[22]	validation_0-logloss:0.32201
[23]	validation_0-logloss:0.31926
[24]	validation_0-logloss:0.31690
[25]	validation_0-logloss:0.31472
[26]	validation_0-logloss:0.31293
[27]	validation_0-logloss:0.31107
[28]	validation_0-logloss:0.30930
[29]	validation_0-logloss:0.30793
[30]	validation_0-logloss:0.30672
[31]	validation_0-logloss:0.30536
[32]	validation_0-logloss:0.30426
[33]	validation_0-logloss:0.30354
[34]	validation_0-logloss:0.30271
[35]	validation_0-logloss:0.30170
[36]	validation_0-logloss:0.30096
[37]	validation_0-logloss:0.30038
[38]	validation_0-logloss:0.29977
[39]	validation_0-logloss:0.29917
[40]	validation_0-logloss:0.29870
[41]	validation_0-logloss:0.29797
[42]	validation_0-logloss:0.29751
[43]	validation_0-logloss:0.29692
[44]	validation_0-logloss:0.29650
[45]	validation_0-logloss:0.29587
[46]	validation_0-logloss:0.29530
[47]	validatio



[17]	validation_0-logloss:0.34176
[18]	validation_0-logloss:0.33751
[19]	validation_0-logloss:0.33313
[20]	validation_0-logloss:0.32920
[21]	validation_0-logloss:0.32623
[22]	validation_0-logloss:0.32296
[23]	validation_0-logloss:0.32009
[24]	validation_0-logloss:0.31764
[25]	validation_0-logloss:0.31566
[26]	validation_0-logloss:0.31409
[27]	validation_0-logloss:0.31215
[28]	validation_0-logloss:0.31065
[29]	validation_0-logloss:0.30962
[30]	validation_0-logloss:0.30803
[31]	validation_0-logloss:0.30696
[32]	validation_0-logloss:0.30613
[33]	validation_0-logloss:0.30544
[34]	validation_0-logloss:0.30462
[35]	validation_0-logloss:0.30396
[36]	validation_0-logloss:0.30285
[37]	validation_0-logloss:0.30187
[38]	validation_0-logloss:0.30126
[39]	validation_0-logloss:0.30074
[40]	validation_0-logloss:0.30047
[41]	validation_0-logloss:0.29970
[42]	validation_0-logloss:0.29927
[43]	validation_0-logloss:0.29882
[44]	validation_0-logloss:0.29855
[45]	validation_0-logloss:0.29793
[46]	validatio



[16]	validation_0-logloss:0.34641
[17]	validation_0-logloss:0.34155
[18]	validation_0-logloss:0.33695
[19]	validation_0-logloss:0.33272
[20]	validation_0-logloss:0.32897
[21]	validation_0-logloss:0.32567
[22]	validation_0-logloss:0.32266
[23]	validation_0-logloss:0.31992
[24]	validation_0-logloss:0.31770
[25]	validation_0-logloss:0.31561
[26]	validation_0-logloss:0.31359
[27]	validation_0-logloss:0.31185
[28]	validation_0-logloss:0.30983
[29]	validation_0-logloss:0.30864
[30]	validation_0-logloss:0.30746
[31]	validation_0-logloss:0.30643
[32]	validation_0-logloss:0.30559
[33]	validation_0-logloss:0.30434
[34]	validation_0-logloss:0.30324
[35]	validation_0-logloss:0.30252
[36]	validation_0-logloss:0.30194
[37]	validation_0-logloss:0.30105
[38]	validation_0-logloss:0.30014
[39]	validation_0-logloss:0.29937
[40]	validation_0-logloss:0.29876
[41]	validation_0-logloss:0.29835
[42]	validation_0-logloss:0.29777
[43]	validation_0-logloss:0.29716
[44]	validation_0-logloss:0.29692
[45]	validatio



[18]	validation_0-logloss:0.33637
[19]	validation_0-logloss:0.33200
[20]	validation_0-logloss:0.32845
[21]	validation_0-logloss:0.32511
[22]	validation_0-logloss:0.32201
[23]	validation_0-logloss:0.31926
[24]	validation_0-logloss:0.31690
[25]	validation_0-logloss:0.31472
[26]	validation_0-logloss:0.31293
[27]	validation_0-logloss:0.31107
[28]	validation_0-logloss:0.30930
[29]	validation_0-logloss:0.30793
[30]	validation_0-logloss:0.30672
[31]	validation_0-logloss:0.30536
[32]	validation_0-logloss:0.30426
[33]	validation_0-logloss:0.30354
[34]	validation_0-logloss:0.30271
[35]	validation_0-logloss:0.30170
[36]	validation_0-logloss:0.30096
[37]	validation_0-logloss:0.30038
[38]	validation_0-logloss:0.29977
[39]	validation_0-logloss:0.29917
[40]	validation_0-logloss:0.29870
[41]	validation_0-logloss:0.29797
[42]	validation_0-logloss:0.29751
[43]	validation_0-logloss:0.29692
[44]	validation_0-logloss:0.29650
[45]	validation_0-logloss:0.29587
[46]	validation_0-logloss:0.29530
[47]	validatio



[18]	validation_0-logloss:0.33751
[19]	validation_0-logloss:0.33313
[20]	validation_0-logloss:0.32920
[21]	validation_0-logloss:0.32623
[22]	validation_0-logloss:0.32296
[23]	validation_0-logloss:0.32009
[24]	validation_0-logloss:0.31764
[25]	validation_0-logloss:0.31566
[26]	validation_0-logloss:0.31409
[27]	validation_0-logloss:0.31215
[28]	validation_0-logloss:0.31065
[29]	validation_0-logloss:0.30962
[30]	validation_0-logloss:0.30803
[31]	validation_0-logloss:0.30696
[32]	validation_0-logloss:0.30613
[33]	validation_0-logloss:0.30544
[34]	validation_0-logloss:0.30462
[35]	validation_0-logloss:0.30396
[36]	validation_0-logloss:0.30285
[37]	validation_0-logloss:0.30187
[38]	validation_0-logloss:0.30126
[39]	validation_0-logloss:0.30074
[40]	validation_0-logloss:0.30047
[41]	validation_0-logloss:0.29970
[42]	validation_0-logloss:0.29927
[43]	validation_0-logloss:0.29882
[44]	validation_0-logloss:0.29855
[45]	validation_0-logloss:0.29793
[46]	validation_0-logloss:0.29772
[47]	validatio



[15]	validation_0-logloss:0.35201
[16]	validation_0-logloss:0.34641
[17]	validation_0-logloss:0.34155
[18]	validation_0-logloss:0.33695
[19]	validation_0-logloss:0.33272
[20]	validation_0-logloss:0.32897
[21]	validation_0-logloss:0.32567
[22]	validation_0-logloss:0.32266
[23]	validation_0-logloss:0.31992
[24]	validation_0-logloss:0.31770
[25]	validation_0-logloss:0.31561
[26]	validation_0-logloss:0.31359
[27]	validation_0-logloss:0.31185
[28]	validation_0-logloss:0.30983
[29]	validation_0-logloss:0.30864
[30]	validation_0-logloss:0.30746
[31]	validation_0-logloss:0.30643
[32]	validation_0-logloss:0.30559
[33]	validation_0-logloss:0.30434
[34]	validation_0-logloss:0.30324
[35]	validation_0-logloss:0.30252
[36]	validation_0-logloss:0.30194
[37]	validation_0-logloss:0.30105
[38]	validation_0-logloss:0.30014
[39]	validation_0-logloss:0.29937
[40]	validation_0-logloss:0.29876
[41]	validation_0-logloss:0.29835
[42]	validation_0-logloss:0.29777
[43]	validation_0-logloss:0.29716
[44]	validatio



[16]	validation_0-logloss:0.42644
[17]	validation_0-logloss:0.41884
[18]	validation_0-logloss:0.41187
[19]	validation_0-logloss:0.40531
[20]	validation_0-logloss:0.39902
[21]	validation_0-logloss:0.39337
[22]	validation_0-logloss:0.38786
[23]	validation_0-logloss:0.38270
[24]	validation_0-logloss:0.37804
[25]	validation_0-logloss:0.37346
[26]	validation_0-logloss:0.36915
[27]	validation_0-logloss:0.36517
[28]	validation_0-logloss:0.36151
[29]	validation_0-logloss:0.35794
[30]	validation_0-logloss:0.35467
[31]	validation_0-logloss:0.35158
[32]	validation_0-logloss:0.34860
[33]	validation_0-logloss:0.34565
[34]	validation_0-logloss:0.34298
[35]	validation_0-logloss:0.34052
[36]	validation_0-logloss:0.33825
[37]	validation_0-logloss:0.33583
[38]	validation_0-logloss:0.33377
[39]	validation_0-logloss:0.33163
[40]	validation_0-logloss:0.32958
[41]	validation_0-logloss:0.32771
[42]	validation_0-logloss:0.32603
[43]	validation_0-logloss:0.32450
[44]	validation_0-logloss:0.32285
[45]	validatio

In [39]:
clf.best_estimator_

Ese es el modelo con la mejor combinación de parametros.

In [40]:
clf.best_score_ #Ese es la accuracy más alta entre todas las combinaciones de parametros que hubo

0.8678133611474648

### 8. Validemos el modelo

In [41]:
from sklearn.metrics import accuracy_score

In [42]:
best_xgb = clf.best_estimator_ #Ese es el modelo que testearemos

In [43]:
Y_predict = best_xgb.predict(X_valid)

Comparemos ese Y predecido con el real:

In [44]:
comp = pd.DataFrame({'real': Y_valid, 'predicción' : Y_predict})

In [45]:
comp.head(20) #Veamos los primeros 20

Unnamed: 0,real,predicción
22206,0,0
15112,0,0
4426,0,0
4960,0,0
32157,0,0
17247,0,0
12322,1,0
625,0,0
13594,1,1
27391,0,0


Como se esperaría, algunos se predicen bien pero no todos.

Finalmente, evaluemos su accuracy:

In [46]:
accuracy_score(Y_valid, Y_predict)

0.8788697788697789

Esto se podría mejorar si se pone un mayor número de arboles (n_estimators), dígamos pongamos 500

In [47]:
parameters = {'nthreads': [1], 
             'objective': ['binary:logistic'],
             'learning_rate': [0.05,0.1],
             'n_estimators': [500]}

clf = GridSearchCV(xgb, parameters, cv=3, scoring='accuracy') #cv=3 es hacer cross validation 3 veces
clf.fit(X_train, Y_train, **fit_params)

best_xgb = clf.best_estimator_ #Ese es el modelo que testearemos
Y_predict = best_xgb.predict(X_valid)

Parameters: { "nthreads" } are not used.

[0]	validation_0-logloss:0.66442
[1]	validation_0-logloss:0.63840
[2]	validation_0-logloss:0.61483
[3]	validation_0-logloss:0.59342
[4]	validation_0-logloss:0.57382
[5]	validation_0-logloss:0.55571
[6]	validation_0-logloss:0.53916
[7]	validation_0-logloss:0.52378
[8]	validation_0-logloss:0.50958
[9]	validation_0-logloss:0.49645
[10]	validation_0-logloss:0.48434
[11]	validation_0-logloss:0.47305
[12]	validation_0-logloss:0.46260
[13]	validation_0-logloss:0.45278
[14]	validation_0-logloss:0.44379
[15]	validation_0-logloss:0.43508
[16]	validation_0-logloss:0.42719




[17]	validation_0-logloss:0.41983
[18]	validation_0-logloss:0.41269
[19]	validation_0-logloss:0.40615
[20]	validation_0-logloss:0.40019
[21]	validation_0-logloss:0.39433
[22]	validation_0-logloss:0.38909
[23]	validation_0-logloss:0.38407
[24]	validation_0-logloss:0.37937
[25]	validation_0-logloss:0.37486
[26]	validation_0-logloss:0.37057
[27]	validation_0-logloss:0.36650
[28]	validation_0-logloss:0.36281
[29]	validation_0-logloss:0.35935
[30]	validation_0-logloss:0.35601
[31]	validation_0-logloss:0.35284
[32]	validation_0-logloss:0.34978
[33]	validation_0-logloss:0.34692
[34]	validation_0-logloss:0.34438
[35]	validation_0-logloss:0.34180
[36]	validation_0-logloss:0.33954
[37]	validation_0-logloss:0.33734
[38]	validation_0-logloss:0.33517
[39]	validation_0-logloss:0.33314
[40]	validation_0-logloss:0.33127
[41]	validation_0-logloss:0.32957
[42]	validation_0-logloss:0.32783
[43]	validation_0-logloss:0.32615
[44]	validation_0-logloss:0.32464
[45]	validation_0-logloss:0.32313
[46]	validatio



[19]	validation_0-logloss:0.40733
[20]	validation_0-logloss:0.40112
[21]	validation_0-logloss:0.39549
[22]	validation_0-logloss:0.39015
[23]	validation_0-logloss:0.38519
[24]	validation_0-logloss:0.38036
[25]	validation_0-logloss:0.37590
[26]	validation_0-logloss:0.37161
[27]	validation_0-logloss:0.36760
[28]	validation_0-logloss:0.36373
[29]	validation_0-logloss:0.36032
[30]	validation_0-logloss:0.35712
[31]	validation_0-logloss:0.35387
[32]	validation_0-logloss:0.35093
[33]	validation_0-logloss:0.34804
[34]	validation_0-logloss:0.34550
[35]	validation_0-logloss:0.34277
[36]	validation_0-logloss:0.34051
[37]	validation_0-logloss:0.33839
[38]	validation_0-logloss:0.33619
[39]	validation_0-logloss:0.33401
[40]	validation_0-logloss:0.33199
[41]	validation_0-logloss:0.33029
[42]	validation_0-logloss:0.32854
[43]	validation_0-logloss:0.32677
[44]	validation_0-logloss:0.32501
[45]	validation_0-logloss:0.32338
[46]	validation_0-logloss:0.32207
[47]	validation_0-logloss:0.32073
[48]	validatio



[18]	validation_0-logloss:0.41308
[19]	validation_0-logloss:0.40648
[20]	validation_0-logloss:0.40053
[21]	validation_0-logloss:0.39480
[22]	validation_0-logloss:0.38934
[23]	validation_0-logloss:0.38427
[24]	validation_0-logloss:0.37961
[25]	validation_0-logloss:0.37524
[26]	validation_0-logloss:0.37095
[27]	validation_0-logloss:0.36696
[28]	validation_0-logloss:0.36320
[29]	validation_0-logloss:0.35976
[30]	validation_0-logloss:0.35646
[31]	validation_0-logloss:0.35329
[32]	validation_0-logloss:0.35038
[33]	validation_0-logloss:0.34765
[34]	validation_0-logloss:0.34514
[35]	validation_0-logloss:0.34256
[36]	validation_0-logloss:0.34031
[37]	validation_0-logloss:0.33823
[38]	validation_0-logloss:0.33603
[39]	validation_0-logloss:0.33397
[40]	validation_0-logloss:0.33206
[41]	validation_0-logloss:0.33034
[42]	validation_0-logloss:0.32866
[43]	validation_0-logloss:0.32696
[44]	validation_0-logloss:0.32541
[45]	validation_0-logloss:0.32401
[46]	validation_0-logloss:0.32249
[47]	validatio



[20]	validation_0-logloss:0.32845
[21]	validation_0-logloss:0.32511
[22]	validation_0-logloss:0.32201
[23]	validation_0-logloss:0.31926
[24]	validation_0-logloss:0.31690
[25]	validation_0-logloss:0.31472
[26]	validation_0-logloss:0.31293
[27]	validation_0-logloss:0.31107
[28]	validation_0-logloss:0.30930
[29]	validation_0-logloss:0.30793
[30]	validation_0-logloss:0.30672
[31]	validation_0-logloss:0.30536
[32]	validation_0-logloss:0.30426
[33]	validation_0-logloss:0.30354
[34]	validation_0-logloss:0.30271
[35]	validation_0-logloss:0.30170
[36]	validation_0-logloss:0.30096
[37]	validation_0-logloss:0.30038
[38]	validation_0-logloss:0.29977
[39]	validation_0-logloss:0.29917
[40]	validation_0-logloss:0.29870
[41]	validation_0-logloss:0.29797
[42]	validation_0-logloss:0.29751
[43]	validation_0-logloss:0.29692
[44]	validation_0-logloss:0.29650
[45]	validation_0-logloss:0.29587
[46]	validation_0-logloss:0.29530
[47]	validation_0-logloss:0.29476
[48]	validation_0-logloss:0.29427
[49]	validatio



[17]	validation_0-logloss:0.34176
[18]	validation_0-logloss:0.33751
[19]	validation_0-logloss:0.33313
[20]	validation_0-logloss:0.32920
[21]	validation_0-logloss:0.32623
[22]	validation_0-logloss:0.32296
[23]	validation_0-logloss:0.32009
[24]	validation_0-logloss:0.31764
[25]	validation_0-logloss:0.31566
[26]	validation_0-logloss:0.31409
[27]	validation_0-logloss:0.31215
[28]	validation_0-logloss:0.31065
[29]	validation_0-logloss:0.30962
[30]	validation_0-logloss:0.30803
[31]	validation_0-logloss:0.30696
[32]	validation_0-logloss:0.30613
[33]	validation_0-logloss:0.30544
[34]	validation_0-logloss:0.30462
[35]	validation_0-logloss:0.30396
[36]	validation_0-logloss:0.30285
[37]	validation_0-logloss:0.30187
[38]	validation_0-logloss:0.30126
[39]	validation_0-logloss:0.30074
[40]	validation_0-logloss:0.30047
[41]	validation_0-logloss:0.29970
[42]	validation_0-logloss:0.29927
[43]	validation_0-logloss:0.29882
[44]	validation_0-logloss:0.29855
[45]	validation_0-logloss:0.29793
[46]	validatio



[18]	validation_0-logloss:0.33695
[19]	validation_0-logloss:0.33272
[20]	validation_0-logloss:0.32897
[21]	validation_0-logloss:0.32567
[22]	validation_0-logloss:0.32266
[23]	validation_0-logloss:0.31992
[24]	validation_0-logloss:0.31770
[25]	validation_0-logloss:0.31561
[26]	validation_0-logloss:0.31359
[27]	validation_0-logloss:0.31185
[28]	validation_0-logloss:0.30983
[29]	validation_0-logloss:0.30864
[30]	validation_0-logloss:0.30746
[31]	validation_0-logloss:0.30643
[32]	validation_0-logloss:0.30559
[33]	validation_0-logloss:0.30434
[34]	validation_0-logloss:0.30324
[35]	validation_0-logloss:0.30252
[36]	validation_0-logloss:0.30194
[37]	validation_0-logloss:0.30105
[38]	validation_0-logloss:0.30014
[39]	validation_0-logloss:0.29937
[40]	validation_0-logloss:0.29876
[41]	validation_0-logloss:0.29835
[42]	validation_0-logloss:0.29777
[43]	validation_0-logloss:0.29716
[44]	validation_0-logloss:0.29692
[45]	validation_0-logloss:0.29650
[46]	validation_0-logloss:0.29621
[47]	validatio



[14]	validation_0-logloss:0.44298
[15]	validation_0-logloss:0.43434
[16]	validation_0-logloss:0.42644
[17]	validation_0-logloss:0.41884
[18]	validation_0-logloss:0.41187
[19]	validation_0-logloss:0.40531
[20]	validation_0-logloss:0.39902
[21]	validation_0-logloss:0.39337
[22]	validation_0-logloss:0.38786
[23]	validation_0-logloss:0.38270
[24]	validation_0-logloss:0.37804
[25]	validation_0-logloss:0.37346
[26]	validation_0-logloss:0.36915
[27]	validation_0-logloss:0.36517
[28]	validation_0-logloss:0.36151
[29]	validation_0-logloss:0.35794
[30]	validation_0-logloss:0.35467
[31]	validation_0-logloss:0.35158
[32]	validation_0-logloss:0.34860
[33]	validation_0-logloss:0.34565
[34]	validation_0-logloss:0.34298
[35]	validation_0-logloss:0.34052
[36]	validation_0-logloss:0.33825
[37]	validation_0-logloss:0.33583
[38]	validation_0-logloss:0.33377
[39]	validation_0-logloss:0.33163
[40]	validation_0-logloss:0.32958
[41]	validation_0-logloss:0.32771
[42]	validation_0-logloss:0.32603
[43]	validatio

In [48]:
accuracy_score(Y_valid, Y_predict)

0.878992628992629

Efectivamente mejoró un poco.