## Machine Learning 
### Meta-classifiers and Parameter Optimization


En este ejemplo se muestra el aprendizaje con Random Forest, en el que
realizamos una optimización de parámetros con validación cruzada.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Utilizaremos el dataset del marketing telefonico de productos bancarios.  Por simplicidad haremos un preprocesado 
directo con la función get_dummies.  

In [2]:
bank_marketing = pd.read_csv('../data/bank.csv', sep=';')

In [3]:
from sklearn import preprocessing

In [4]:
bank_marketing.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [5]:
raw_features = bank_marketing.drop(columns='y')
features = pd.get_dummies(raw_features)
target = bank_marketing.y

In [6]:
features.columns

Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown', 'default_no', 'default_yes', 'housing_no',
       'housing_yes', 'loan_no', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'month_apr', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_failure', 'poutcome_other', 'poutcome_success',
       'poutcome_unknown'],
      dtype='object')

Para la optimización de parametros haremos una búsqueda grid sobre un 
espacio de parámetros ()Random Forest con validación de parámetros

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score

In [8]:
train_x, test_x, train_y, test_y = train_test_split(features.values,
                                                    target.values,
                                                    test_size=0.7,
                                                    stratify=target.values,
                                                    random_state=11
                                                    )

construimos el objeto eligiendo el número de árboles. Podemos ver los parámetros disponibles con la función *get_params*

In [9]:
rforest = RandomForestClassifier(n_estimators=20)

In [10]:
rforest.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 20,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Definimos un espacio de parámetros.  Cada prueba consistirá en una combinación de estos parámetros posibles

In [11]:
param_grid = {
    'max_features': [2, 3, 5, 8],
    'n_estimators': [20, 50, 100]
}

In [12]:
grid_cv = GridSearchCV(estimator = rforest, 
                       param_grid = param_grid, 
                       cv = 5, 
                       verbose=3)

In [13]:
grid_cv.fit(train_x, train_y)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV 1/5] END ................max_features=2, n_estimators=20; total time=   0.0s
[CV 2/5] END ................max_features=2, n_estimators=20; total time=   0.0s
[CV 3/5] END ................max_features=2, n_estimators=20; total time=   0.0s
[CV 4/5] END ................max_features=2, n_estimators=20; total time=   0.0s
[CV 5/5] END ................max_features=2, n_estimators=20; total time=   0.0s
[CV 1/5] END ................max_features=2, n_estimators=50; total time=   0.1s
[CV 2/5] END ................max_features=2, n_estimators=50; total time=   0.1s
[CV 3/5] END ................max_features=2, n_estimators=50; total time=   0.1s
[CV 4/5] END ................max_features=2, n_estimators=50; total time=   0.1s
[CV 5/5] END ................max_features=2, n_estimators=50; total time=   0.1s
[CV 1/5] END ...............max_features=2, n_estimators=100; total time=   0.1s
[CV 2/5] END ...............max_features=2, n_es

GridSearchCV(cv=5, estimator=RandomForestClassifier(n_estimators=20),
             param_grid={'max_features': [2, 3, 5, 8],
                         'n_estimators': [20, 50, 100]},
             verbose=3)

Identificamos la mejor combinación de parámetros

In [14]:
grid_cv.best_params_

{'max_features': 5, 'n_estimators': 50}

El mejor modelo se puede extraer de la siguiente forma

In [15]:
best_rf = grid_cv.best_estimator_

Evaluación en el conjunto de test

In [16]:
pred_y = best_rf.predict(test_x)

In [17]:
accuracy_score(test_y, pred_y)

0.8938388625592417

___