# *Grid Search & Pipelines*
*Grid Search* es una herramienta (método) de optimización que usamos cuando ajustamos los hiperparámetros de los algoritmos predictivos. Definimos los valores de los parámetros (*grid*) que queremos buscar y seleccionamos la mejor combinación de parámetros para nuestros datos.

## Método 1
Itera un único algoritmo sobre un conjunto de hiperparámetros, mediante la validación cruzada, iterando con el dataset dividido en *train* y validación para recoger los errores y evaluar la mejor métrica. 

In [33]:
import warnings

warnings.filterwarnings("ignore")#, category=DeprecationWarning)

In [34]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()

parameters = {
    'kernel': ['linear', 'rbf', 'sigmoid', 'poly'],
    'C': [0.001, 0.1, 0.5, 1, 5, 10, 100],
    'degree': [1,2,3,4,5,6,7],
    'gamma': ['scale', 'auto']
}

svc = svm.SVC()

clf = GridSearchCV(estimator = svc,
                  param_grid = parameters,
                  n_jobs = -1,
                  cv = 10,
                  scoring="accuracy")

clf.fit(iris.data, iris.target)

#### ¿Cuántos algoritmos (modelos) estamos entrenando?

In [35]:
print(len(parameters['kernel']), 'valores distintos para el parámetro \033[3mkernel\033[0m.')
print(len(parameters['C']), 'valores distintos para el parámetro \033[3mC\033[0m.')
print(len(parameters['degree']), 'valores distintos para el parámetro \033[3mdegree\033[0m.')
print(len(parameters['gamma']), 'valores distintos para el parámetro \033[3mgamma\033[0m.')

print('\nEn total, estamos estrenando',len(parameters['kernel'])*len(parameters['C'])*len(parameters['degree'])*len(parameters['gamma']), 'algoritmos distintos.')

4 valores distintos para el parámetro [3mkernel[0m.
7 valores distintos para el parámetro [3mC[0m.
7 valores distintos para el parámetro [3mdegree[0m.
2 valores distintos para el parámetro [3mgamma[0m.

En total, estamos estrenando 392 algoritmos distintos.


In [36]:
clf.best_estimator_

In [37]:
print(clf.best_params_)
print(clf.best_score_)

{'C': 0.1, 'degree': 2, 'gamma': 'auto', 'kernel': 'poly'}
0.9866666666666667


In [38]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(C=0.1, degree=2, gamma='auto', kernel='poly')
scores = cross_val_score(clf, iris.data, iris.target, cv=10)
scores

array([1.        , 0.93333333, 1.        , 1.        , 1.        ,
       1.        , 0.93333333, 1.        , 1.        , 1.        ])

In [39]:
import numpy as np
print(np.mean(scores))
print(np.std(scores))

0.9866666666666667
0.026666666666666658


## Método 2

Vamos a complicarlo un poco más. Ahora, crearemos en un único gridsearch para iterar con varios modelos con sus respectivos hiperparámetros y con la validación cruzada.

In [40]:
# Librerías
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split 
# Semilla
np.random.seed(0)

In [41]:
# Datos
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

In [43]:
pipe = Pipeline(steps=[
    ('classifier', RandomForestClassifier()) #indicamos en el pipeline por cual de los algoritmos definidos queremos que empiece
])

logistic_params = {
    'classifier': [LogisticRegression(max_iter=1000, solver='liblinear')],
    'classifier__penalty': ['l1', 'l2']
}

random_forest_params = {
    'classifier': [RandomForestClassifier()],
    'classifier__max_features': [1,2,3]
}

svm_param = {
    'classifier': [svm.SVC()],
    'classifier__C': [0.001, 0.1, 0.5, 1, 5, 10, 100],
}

#Entrenamos los tres algoritmos cada uno por separado
search_space = [
    logistic_params,
    random_forest_params,
    svm_param
]

clf = GridSearchCV(estimator = pipe, #con el que empieza
                  param_grid = search_space, #espacio de busqueda. Todas las posibles soluciones
                  cv = 10)

clf.fit(X_train, y_train)

In [22]:
print(clf.best_estimator_)
print(clf.best_score_)
print(clf.best_params_)

Pipeline(steps=[('classifier',
                 LogisticRegression(max_iter=1000, penalty='l1',
                                    solver='liblinear'))])
0.95
{'classifier': LogisticRegression(max_iter=1000, solver='liblinear'), 'classifier__penalty': 'l1'}


In [23]:
clf.best_estimator_.predict(X_test) #predicciones. Coje el que mas probabilidades tiene.

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

In [14]:
clf.best_estimator_.predict_proba(X_test) #probabilidades

array([[3.99666467e-03, 7.87553967e-01, 2.08449368e-01],
       [8.86897049e-01, 1.13102904e-01, 4.72927392e-08],
       [2.61845526e-06, 4.28513620e-01, 5.71483762e-01],
       [9.79799172e-03, 6.55398347e-01, 3.34803661e-01],
       [2.91039741e-03, 8.67417336e-01, 1.29672267e-01],
       [8.50549122e-01, 1.49450668e-01, 2.09718288e-07],
       [1.55717640e-01, 8.04582109e-01, 3.97002510e-02],
       [1.78034895e-03, 3.02148791e-01, 6.96070860e-01],
       [8.75483583e-04, 5.72790139e-01, 4.26334377e-01],
       [3.40511221e-02, 9.06100672e-01, 5.98482060e-02],
       [2.65343623e-03, 3.10978086e-01, 6.86368478e-01],
       [7.85393077e-01, 2.14606377e-01, 5.45236525e-07],
       [8.49703488e-01, 1.50296480e-01, 3.13954237e-08],
       [7.95690338e-01, 2.04309202e-01, 4.60423345e-07],
       [9.22826437e-01, 7.71734663e-02, 9.67398194e-08],
       [2.31405170e-02, 7.08027746e-01, 2.68831737e-01],
       [1.67712727e-04, 2.60568980e-01, 7.39263307e-01],
       [1.79685886e-02, 8.89521

In [29]:
clf.best_estimator_.score(X_test,y_test)

1.0

## Método 3

Otro uso puede ser la construcción de pipelines (tuberías) específicos para cada tipo de modelo.

In [24]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import accuracy_score

import pandas as pd
import numpy as np

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [17]:
reg_log = Pipeline(steps = [
    ("imputer", SimpleImputer()),
    ("scaler", StandardScaler()),
    ("reglog", LogisticRegression())
])

#Parametrizacion de cada paso
reg_log_param = {
    "imputer__strategy": ['mean', 'median'], #estos valores van al imputer del pipeline
    "reglog__penalty": ['l1', 'l2'],
    "reglog__C": np.logspace(0, 4, 10)
}

In [46]:
rand_forest = RandomForestClassifier()
rand_forest_param = {
    "n_estimators": [10, 100, 1000],
    "max_features": [1,2,3]
}


svm = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("selectkbest", SelectKBest()),
    ("svm", SVC())
])


svm_param = {
    'selectkbest__k': [2, 3, 4],
    'svm__kernel': ['linear', 'rbf', 'sigmoid', 'poly'],
    'svm__C': [0.001, 0.1, 0.5, 1, 5, 10, 100],
    'svm__degree': [1,2,3,4],
    'svm__gamma': ['scale', 'auto']
}

gs_reg_log = GridSearchCV(reg_log,
                         reg_log_param,
                         cv = 10,
                         scoring = 'accuracy',
                         verbose = 1,
                         n_jobs = -1)

gs_rand_forest = GridSearchCV(rand_forest,
                         rand_forest_param,
                         cv = 10,
                         scoring = 'accuracy',
                         verbose = 1,
                         n_jobs = -1)

gs_svm = GridSearchCV(svm,
                         svm_param,
                         cv = 10,
                         scoring = 'accuracy',
                         verbose = 1,
                         n_jobs = -1)

grids = {"gs_reg_log": gs_reg_log,
        "gs_rand_forest": gs_rand_forest,
        "gs_svm": gs_svm}

In [47]:
from sklearn.model_selection import train_test_split 
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [50]:
grids

{'gs_reg_log': GridSearchCV(cv=10,
              estimator=Pipeline(steps=[('imputer', SimpleImputer()),
                                        ('scaler', StandardScaler()),
                                        ('reglog', LogisticRegression())]),
              n_jobs=-1,
              param_grid={'imputer__strategy': ['mean', 'median'],
                          'reglog__C': array([1.00000000e+00, 2.78255940e+00, 7.74263683e+00, 2.15443469e+01,
        5.99484250e+01, 1.66810054e+02, 4.64158883e+02, 1.29154967e+03,
        3.59381366e+03, 1.00000000e+04]),
                          'reglog__penalty': ['l1', 'l2']},
              scoring='accuracy', verbose=1),
 'gs_rand_forest': GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
              param_grid={'max_features': [1, 2, 3],
                          'n_estimators': [10, 100, 1000]},
              scoring='accuracy', verbose=1),
 'gs_svm': GridSearchCV(cv=10,
              estimator=Pipeline(steps=[('scaler', 

In [48]:
for nombre, grid_search in grids.items():
    grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 40 candidates, totalling 400 fits
Fitting 10 folds for each of 9 candidates, totalling 90 fits
Fitting 10 folds for each of 672 candidates, totalling 6720 fits


In [51]:
print(gs_reg_log.best_score_)
print(gs_reg_log.best_params_)
print(gs_reg_log.best_estimator_)
print(gs_reg_log.best_estimator_['reglog'])

0.9583333333333334
{'imputer__strategy': 'mean', 'reglog__C': 7.742636826811269, 'reglog__penalty': 'l2'}
Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
                ('reglog', LogisticRegression(C=7.742636826811269))])
LogisticRegression(C=7.742636826811269)


In [55]:
print(gs_rand_forest.best_score_)
print(gs_rand_forest.best_params_)
print(gs_rand_forest.best_estimator_)

0.9333333333333332
{'max_features': 2, 'n_estimators': 100}
RandomForestClassifier(max_features=2)


In [56]:
print(gs_svm.best_score_)
print(gs_svm.best_params_)
print(gs_svm.best_estimator_)
print(gs_svm.best_estimator_['svm'])

0.9666666666666668
{'selectkbest__k': 4, 'svm__C': 5, 'svm__degree': 1, 'svm__gamma': 'scale', 'svm__kernel': 'linear'}
Pipeline(steps=[('scaler', StandardScaler()), ('selectkbest', SelectKBest(k=4)),
                ('svm', SVC(C=5, degree=1, kernel='linear'))])
SVC(C=5, degree=1, kernel='linear')


In [57]:
best_grids = [(i, j.best_score_) for i, j in grids.items()]

best_grids = pd.DataFrame(best_grids, columns=["Grid", "Best score"]).sort_values(by="Best score", ascending=False)
best_grids

Unnamed: 0,Grid,Best score
2,gs_svm,0.966667
0,gs_reg_log,0.958333
1,gs_rand_forest,0.933333


In [58]:
gs_svm.best_estimator_

In [59]:
preds = gs_svm.best_estimator_.predict(X_test)
accuracy_score(y_test, preds)

0.9666666666666667

In [60]:
gs_reg_log.best_estimator_

In [61]:
preds = gs_reg_log.best_estimator_.predict(X_test)
accuracy_score(y_test, preds)

1.0

In [62]:
preds = gs_rand_forest.best_estimator_.predict(X_test)
accuracy_score(y_test, preds)

1.0

 Tanto la regresión logísitca(pipeline) como el random forest son los modelos que mejor generalizan

In [63]:
gs_svm.best_estimator_

In [64]:
gs_svm.best_estimator_['svm']

In [65]:
preds = gs_svm.best_estimator_['svm'].predict(X_test)
accuracy_score(y_test, preds) #resultado acuracy del test

0.36666666666666664

In [66]:
gs_svm.best_estimator_['svm']

In [67]:
# El mejor modelo ha sido
best_model = gs_reg_log.best_estimator_
best_model.score(X_test, y_test)

1.0

In [68]:
gs_reg_log.best_params_

{'imputer__strategy': 'mean',
 'reglog__C': 7.742636826811269,
 'reglog__penalty': 'l2'}

In [69]:
# El mejor modelo ha sido
best_model = gs_reg_log.best_estimator_
best_model.score(X_test, y_test)

1.0

In [76]:
gs_reg_log.best_estimator_

In [71]:
# El mejor modelo ha sido
best_model = gs_reg_log.best_estimator_
best_model.score(X_test, y_test)

1.0

In [75]:
gs_reg_log.best_estimator_

In [83]:
import pickle

filename = 'finished_model'

with open(filename, 'wb') as archivo_salida:
    pickle.dump(best_model, archivo_salida)

In [84]:
with open(filename, 'rb') as archivo_entrada:
    modelo_importado = pickle.load(archivo_entrada)

In [85]:
modelo_importado.score(X_test, y_test)*100

100.0

In [86]:
modelo_importado.predict(X_test)

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

In [87]:
modelo_importado

In [None]:
# modelo_importado.predict(X_new)

Ya hemos escogido un modelo gracias a los datos de validación. Ahora habría que entrenar el modelo con TODOS los datos de *train*.

##  *Random Search*
El problema que tiene el GridSearchCV es que computacionalmente es muy costoso cuando el espacio dimensional de los hiperparámetros es grande.

Mediante el RandomSearch no se prueban todas las combinaciones, sino unas cuantas de manera aleatoria. Funciona bien con datasets con pocas features. Incluso [hay *papers*](https://www.jmlr.org/papers/v13/bergstra12a.html) que aseguran que es más eficiente RandomSearch frente a GridSearch.

![imagen](https://miro.medium.com/proxy/1*ZTlQm_WRcrNqL-nLnx6GJA.png)

In [88]:
from sklearn.model_selection import RandomizedSearchCV

reg_log = Pipeline(steps=[
                          ("imputer",SimpleImputer()),
                          ("scaler",StandardScaler()),
                          ("reglog",LogisticRegression())
                         ])

reg_log_param = {    
                 "imputer__strategy": ['mean', 'median', 'most_frequent'],
                 "reglog__penalty": ["l1","l2"], 
                 "reglog__C": np.logspace(0, 4, 10)
                }


search = RandomizedSearchCV(reg_log,
                           reg_log_param,
                           n_iter = 50,
                           scoring='accuracy',
                           n_jobs=-1,
                           cv=10)

# Ejecutamos la búsqueda
result = search.fit(X_train, y_train)

# Resumen de resultados
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
print('Best Estimator: %s' % result.best_estimator_)

Best Score: 0.9583333333333334
Best Hyperparameters: {'reglog__penalty': 'l2', 'reglog__C': 7.742636826811269, 'imputer__strategy': 'most_frequent'}
Best Estimator: Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                ('scaler', StandardScaler()),
                ('reglog', LogisticRegression(C=7.742636826811269))])


In [None]:
# Resumen de resultados
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
print('Best Estimator: %s' % result.best_estimator_)