In [1]:
# initial setup
%run "../../../common/0_notebooks_base_setup.py"


/Users/csuarezgurruchaga/Desktop/Digital-House/CLASE_30/dsad_2021/common
default checking
Running command `conda list`... ok
jupyterlab=2.2.6 already installed
pandas=1.1.5 already installed
bokeh=2.2.3 already installed
seaborn=0.11.0 already installed
matplotlib=3.3.2 already installed
ipywidgets=7.5.1 already installed
pytest=6.2.1 already installed
chardet=4.0.0 already installed
psutil=5.7.2 already installed
scipy=1.5.2 already installed
statsmodels=0.12.1 already installed
scikit-learn=0.23.2 already installed
xlrd=2.0.1 already installed
Running command `conda install --yes nltk=3.5.0`... ok
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


unidecode=1.1.1 already installed
pydotplus=2.0.2 already installed
pandas-datareader=0.9.0 already installed
flask=1.1.2 already installed


---

<img src='../../../common/logo_DH.png' align='left' width=35%/>

# LAB: Estimando hiperparámetros con `GridSearchCV` para Regresión Logística y KNN

## Introducción

El objetivo de esta práctica es que puedan comenzar a tunear hiperparámetros usando Cross Validation. Para eso, usaremos `GridSearchCV`.

Utilizaremos un dataset sobre cáncer de mama. Contiene información de estudios clínicos y celulares. El objetivo es predecir el carácter benigno ($class_t=0$) maligno ($class_t=1$) del cáncer en función de una serie de predictores a nivel celular.

    + class_t es la variable target
    + el resto son variables con valores normalizados de 1 a 10

[Aquí](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names) pueden encontrar más información sobre el dataset.

**Nota:** se eliminaron del dataset original 16 casos con valores perdidos en algunos campos.

## Tareas

Para esta práctica deberá 

1. Construir dos clasificadores: Regresión Logística y K-Vecinos más Cercanos (KNN)
2. Estimar los hiperpáremetros del modelo
    
    - 2.1 **LogReg:** deberá tunear un modelo con solver 'saga', C's = 1, 10, 100, 1000, y regularización L1 y L2
    - 2.2 **KNN:** deberá tunear tanto el parámetro k, como la medida del peso dado a los K vecinos (uniforme o distancia). También podría probar con el parámetro p que define el tipo de distancia con el que se calculan los vecinos más cercanos.
      
      
3. Estimar los modelos finales
4. Evaluar cuál de los dos performa mejor

**Importante:** recuerde que deberá diseñar cuidadosamente las diferentes estrategias de validación de las diferentes etapas de estimación del modelo.

Importamos los paquetes necesarios


In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import normalize

Importamos el dataset

In [3]:
df = pd.read_csv('../Data/breast-cancer.csv', header = None)
df.columns = ['ID', 'clump_Thickness', 'unif_cell_size', 'unif_cell_shape', 'adhesion', 'epith_cell_Size', 'bare_nuclei',
              'bland_chromatin ','norm_nucleoli', 'mitoses', 'class_t']

In [4]:
df.shape

(683, 11)

In [5]:
df.head()

Unnamed: 0,ID,clump_Thickness,unif_cell_size,unif_cell_shape,adhesion,epith_cell_Size,bare_nuclei,bland_chromatin,norm_nucleoli,mitoses,class_t
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


Recodificamos las clases en "0" y "1"

In [6]:
df.class_t[df['class_t'] == 2] = 0
df.class_t[df['class_t'] == 4] = 1

In [7]:
df['class_t'].value_counts(normalize=True)

0    0.650073
1    0.349927
Name: class_t, dtype: float64

Hacemos el split entre target y features

In [8]:
X = df.iloc[:,1:9]
y = df['class_t']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,stratify=df['class_t'])

¿Hace falta estandarizar en este caso?

In [9]:
# Utilizamos sklearn para estandarizar la matriz de Features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

**Pista 1:** Conviene realizar dos listas, una con los estimadores de los modelos y otra con la grid de parámetros a estimar en cada modelo.

**Pista 2:** Conviene iterar sobre esas listas para estimar los hiperparámetros de los modelos

In [10]:
models = [LogisticRegression(),
          KNeighborsClassifier()]

In [11]:
params = [
    {'C': [1, 10, 100, 1000],
     'penalty': ['l1', 'l2',],
     'solver': ['saga']},
    {'n_neighbors': range(1,200),
     'weights' : ['uniform', 'distance'],
     'p' : [1, 2, 3]}
]

In [12]:
from sklearn.model_selection import StratifiedKFold
folds=StratifiedKFold(n_splits=10, random_state=19, shuffle=True)

In [13]:
grids = []
for i in range(len(models)):
    gs = GridSearchCV(estimator=models[i], param_grid=params[i], scoring='accuracy', cv=folds, n_jobs=4)
    print (gs)
    fit = gs.fit(X_train, y_train)
    grids.append(fit)

GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=19, shuffle=True),
             estimator=LogisticRegression(), n_jobs=4,
             param_grid={'C': [1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
                         'solver': ['saga']},
             scoring='accuracy')




GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=19, shuffle=True),
             estimator=KNeighborsClassifier(), n_jobs=4,
             param_grid={'n_neighbors': range(1, 200), 'p': [1, 2, 3],
                         'weights': ['uniform', 'distance']},
             scoring='accuracy')


In [36]:
grids

[GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=19, shuffle=True),
              estimator=LogisticRegression(), n_jobs=4,
              param_grid={'C': [1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
                          'solver': ['saga']},
              scoring='accuracy'),
 GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=19, shuffle=True),
              estimator=KNeighborsClassifier(), n_jobs=4,
              param_grid={'n_neighbors': range(1, 200), 'p': [1, 2, 3],
                          'weights': ['uniform', 'distance']},
              scoring='accuracy')]

In [14]:
for i in grids:
    print (i.best_score_)
    print (i.best_estimator_)
    print (i.best_params_)

0.9707317073170731
LogisticRegression(C=10, penalty='l1', solver='saga')
{'C': 10, 'penalty': 'l1', 'solver': 'saga'}
0.975609756097561
KNeighborsClassifier(n_neighbors=4, p=1, weights='distance')
{'n_neighbors': 4, 'p': 1, 'weights': 'distance'}


In [15]:
pd.DataFrame(grids[0].cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_penalty,param_solver,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.006239,0.003058,0.000274,6.6e-05,1,l1,saga,"{'C': 1, 'penalty': 'l1', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,0.97561,1.0,0.926829,0.97561,0.95122,1.0,0.968293,0.024512,6
1,0.002619,0.000278,0.000239,6.6e-05,1,l2,saga,"{'C': 1, 'penalty': 'l2', 'solver': 'saga'}",0.97561,0.95122,...,0.97561,0.97561,1.0,0.95122,0.97561,0.95122,1.0,0.968293,0.021951,6
2,0.004399,0.000125,0.000208,1.3e-05,10,l1,saga,"{'C': 10, 'penalty': 'l1', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,1.0,1.0,0.926829,0.97561,0.95122,1.0,0.970732,0.026269,1
3,0.003741,0.000251,0.000193,1e-05,10,l2,saga,"{'C': 10, 'penalty': 'l2', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,0.97561,1.0,0.926829,0.97561,0.95122,1.0,0.968293,0.024512,6
4,0.004365,2.1e-05,0.000192,1e-05,100,l1,saga,"{'C': 100, 'penalty': 'l1', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,1.0,1.0,0.926829,0.97561,0.95122,1.0,0.970732,0.026269,1
5,0.003959,0.000403,0.000237,0.000119,100,l2,saga,"{'C': 100, 'penalty': 'l2', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,1.0,1.0,0.926829,0.97561,0.95122,1.0,0.970732,0.026269,1
6,0.004433,5.9e-05,0.00019,1.2e-05,1000,l1,saga,"{'C': 1000, 'penalty': 'l1', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,1.0,1.0,0.926829,0.97561,0.95122,1.0,0.970732,0.026269,1
7,0.003725,0.000341,0.000184,4e-06,1000,l2,saga,"{'C': 1000, 'penalty': 'l2', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,1.0,1.0,0.926829,0.97561,0.95122,1.0,0.970732,0.026269,1


In [16]:
pd.DataFrame(grids[1].cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,param_p,param_weights,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000628,0.000178,0.001416,0.000237,1,1,uniform,"{'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}",0.951220,0.951220,...,0.951220,0.975610,1.00000,0.951220,0.975610,0.926829,1.000,0.965854,0.022354,91
1,0.000507,0.000048,0.000700,0.000103,1,1,distance,"{'n_neighbors': 1, 'p': 1, 'weights': 'distance'}",0.951220,0.951220,...,0.951220,0.975610,1.00000,0.951220,0.975610,0.926829,1.000,0.965854,0.022354,91
2,0.000459,0.000015,0.001151,0.000033,1,2,uniform,"{'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}",0.926829,0.951220,...,0.951220,0.975610,1.00000,0.951220,0.975610,0.926829,0.975,0.958476,0.021905,192
3,0.000440,0.000015,0.000534,0.000023,1,2,distance,"{'n_neighbors': 1, 'p': 2, 'weights': 'distance'}",0.926829,0.951220,...,0.951220,0.975610,1.00000,0.951220,0.975610,0.926829,0.975,0.958476,0.021905,192
4,0.000486,0.000031,0.001932,0.000058,1,3,uniform,"{'n_neighbors': 1, 'p': 3, 'weights': 'uniform'}",0.926829,0.951220,...,0.926829,0.975610,1.00000,0.951220,0.951220,0.926829,0.975,0.946280,0.032303,423
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1189,0.000317,0.000007,0.001120,0.000017,199,1,distance,"{'n_neighbors': 199, 'p': 1, 'weights': 'dista...",0.951220,0.878049,...,0.853659,0.902439,0.97561,0.902439,0.975610,0.902439,0.925,0.916890,0.038056,1082
1190,0.000319,0.000007,0.001703,0.000020,199,2,uniform,"{'n_neighbors': 199, 'p': 2, 'weights': 'unifo...",0.902439,0.878049,...,0.804878,0.878049,0.97561,0.902439,0.926829,0.878049,0.900,0.894878,0.040848,1183
1191,0.000302,0.000011,0.001147,0.000023,199,2,distance,"{'n_neighbors': 199, 'p': 2, 'weights': 'dista...",0.951220,0.878049,...,0.878049,0.902439,1.00000,0.926829,0.951220,0.878049,0.925,0.919329,0.037832,1046
1192,0.000330,0.000011,0.004130,0.000061,199,3,uniform,"{'n_neighbors': 199, 'p': 3, 'weights': 'unifo...",0.902439,0.853659,...,0.804878,0.878049,0.97561,0.902439,0.902439,0.878049,0.900,0.890000,0.041110,1189


In [17]:
X_test = scaler.transform(X_test)

In [18]:
y_preds_log = grids[0].predict(X_test)
y_preds_knn = grids[1].predict(X_test)

In [19]:
print (classification_report(y_test, y_preds_log))

              precision    recall  f1-score   support

           0       0.97      0.98      0.97       178
           1       0.96      0.94      0.95        96

    accuracy                           0.96       274
   macro avg       0.96      0.96      0.96       274
weighted avg       0.96      0.96      0.96       274



In [20]:
confusion_matrix(y_test, y_preds_log)

array([[174,   4],
       [  6,  90]])

In [21]:
print (classification_report(y_test, y_preds_knn))

              precision    recall  f1-score   support

           0       0.98      0.97      0.97       178
           1       0.95      0.96      0.95        96

    accuracy                           0.97       274
   macro avg       0.96      0.97      0.96       274
weighted avg       0.97      0.97      0.97       274



In [22]:
confusion_matrix(y_test, y_preds_knn)

array([[173,   5],
       [  4,  92]])

## Diferencia de performance entre Random Search y Gridsearch

Dado el siguiente conjunto de parámetros:

        param_dist = {
                    'n_neighbors': range(1,200),
                    'weights' : ['uniform', 'distance'],
                    'p' : [1, 2, 3]
                    }

Implementar una búsqueda del conjunto óptimo de hiperparámetros tanto con GridSearchCV como con RandomSearchCV.
Verificar la diferencia en cada caso de:
    
    1. El tiempo de ejecución (utilizando la biblioteca time)
    2. La combinación óptima de parámetros
    3. La performance del mejor modelo en cada caso sobre los datos del test set que separamos anteriormente en términos de accuracy


In [23]:
from sklearn.model_selection import RandomizedSearchCV

In [24]:
def busquedaGridsearch(params_):
    folds=StratifiedKFold(n_splits=10, random_state=19, shuffle=True)
    gs = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=params_, scoring='accuracy', cv=folds, n_jobs=4)
    fit = gs.fit(X_train, y_train)
    return gs    

In [25]:
def busquedaRandomSearch(params_,iter_):
    folds=StratifiedKFold(n_splits=10, random_state=19, shuffle=True)
    gs = RandomizedSearchCV(estimator=KNeighborsClassifier(), param_distributions=params_, scoring='accuracy', cv=folds, n_jobs=4, n_iter = iter_ )
    fit = gs.fit(X_train, y_train)
    return gs    

In [26]:
param_dist = {
    'n_neighbors': range(1,200),
    'weights' : ['uniform', 'distance'],
    'p' : [1, 2, 3]
}


In [27]:
import time

In [28]:
tic = time.time()
gs_random_search = busquedaRandomSearch(param_dist,100)        
toc = time.time()
print(str(toc-tic) + ' Segundos')

0.6224310398101807 Segundos


In [29]:
tic = time.time()
gs_grid_search = busquedaGridsearch(param_dist)   
toc = time.time()
print(str(toc-tic) + ' Segundos')

6.986302852630615 Segundos


In [30]:
gs_random_search.best_params_

{'weights': 'distance', 'p': 2, 'n_neighbors': 15}

In [31]:
gs_grid_search.best_params_

{'n_neighbors': 4, 'p': 1, 'weights': 'distance'}

In [32]:
from sklearn.metrics import accuracy_score

def obtener_performance(estimator):
    y_pred = estimator.predict(X_test)
    return accuracy_score(y_pred,y_test, normalize = True)

In [33]:
obtener_performance(gs_grid_search.best_estimator_)

0.9671532846715328

In [34]:
obtener_performance(gs_random_search.best_estimator_)

0.9598540145985401