### GridSearch & Pipelines
GridSearch is an optimization tool that we use when tuning hyperparameters. We define the grid of parameters that we want to search through, and we select the best combination of parameters for our data.

# 1 - One way
Itera un algoritmo sobre un conjunto de hiperparametros

In [27]:
import warnings
import numpy as np
import pandas as pd

warnings.filterwarnings("ignore", category=DeprecationWarning)

In [9]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV, train_test_split

iris = datasets.load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size = 0.2,
                                                   random_state=42)
svc = svm.SVC()

parameters = {
    'kernel': ['linear', 'rbf', 'sigmoid'],
    'C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 100],
    'gamma': ['scale', 'auto'],
    'coef0': [-10, -1, 0, 0.1, 0.5, 1, 10, 100]
    
}

grid = GridSearchCV(estimator = svc,
                   param_grid = parameters,
                   n_jobs = -1,
                   scoring = 'accuracy',
                   cv = 10)

grid.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=SVC(), n_jobs=-1,
             param_grid={'C': [0.001, 0.01, 0.1, 0.5, 1, 5, 10, 100],
                         'coef0': [-10, -1, 0, 0.1, 0.5, 1, 10, 100],
                         'gamma': ['scale', 'auto'],
                         'kernel': ['linear', 'rbf', 'sigmoid']},
             scoring='accuracy')

In [17]:
print("Best estimator:", grid.best_estimator_)
print("Best params:", grid.best_params_)
print("Best score:", grid.best_score_)

Best estimator: SVC(C=0.1, coef0=-10, kernel='linear')
Best params: {'C': 0.1, 'coef0': -10, 'gamma': 'scale', 'kernel': 'linear'}
Best score: 0.9583333333333334


In [18]:
best_estimator = grid.best_estimator_
best_estimator.score(X_test, y_test)

1.0

# 2: Almost-Pro way

La forma pro es la que hace esto mismo y va recogiendo los errores de entrenamiento, de validación y tiene la capacidad de parar el proceso cuando se requiera además de guardar el modelo en local una vez terminado si es mejor que el que había anteriormente y de cargar el modelo anterior y seguir reentrenando.

# 3 Another way

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

In [29]:
np.arange(0, 4, 0.5).tolist()

[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5]

In [31]:
reg_log = Pipeline(steps = [
    ("imputer", SimpleImputer()),
    ("scaler", StandardScaler()),
    ("reglog", LogisticRegression())
])

re_log_param = {
    "imputer__strategy": ['mean', 'median', 'most_frequent'],
    "reglog__penalty": ["l1", "l2"],
    "reglog__C": np.arange(0, 4, 0.5)
}

gs_reg_log = GridSearchCV(reg_log,
                         re_log_param,
                         cv = 10,
                         scoring='accuracy',
                         n_jobs=-1,
                         verbose=1)

gs_reg_log.fit(X_train, y_train)

print("Best estimator:", gs_reg_log.best_estimator_)
print("Best params:", gs_reg_log.best_params_)
print("Best score:", gs_reg_log.best_score_)

Fitting 10 folds for each of 48 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.2s


Best estimator: Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
                ('reglog', LogisticRegression(C=0.5))])
Best params: {'imputer__strategy': 'mean', 'reglog__C': 0.5, 'reglog__penalty': 'l2'}
Best score: 0.9416666666666667


[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:    7.8s finished
