# Estimando hiperparámetros con GridSearch

El objetivo de esta práctica es que puedan comenzar a tunear hiperparámetros usando Cross Validation. Para eso, usaremos GridSearchCV.

Utilizaremos un dataset sobre cáncer de mama. Contiene información de estudios clínicos y celulares. El objetivo es predecir el carácter benigno ( 𝑐𝑙𝑎𝑠𝑠𝑡=0
 ) maligno ( 𝑐𝑙𝑎𝑠𝑠𝑡=1
 ) del cáncer en función de una serie de predictores a nivel celular.

+ class_t es la variable target
+ el resto son variables con valores normalizados de 1 a 10

Para esta práctica deberá construir un clasificador por Regresión Logística. Deberá tunear un modelo con solver 'saga', C's = 1, 10, 100, 1000, y regularización L1 y L2
Recuerde que deberá diseñar cuidadosamente las diferentes estrategias de validación de las diferentes etapas de estimación del modelo.

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import normalize

In [3]:
df = pd.read_csv('./data/breast-cancer.csv', header = None)
df.columns = ['ID', 'clump_Thickness', 'unif_cell_size', 'unif_cell_shape', 'adhesion', 'epith_cell_Size', 'bare_nuclei',
              'bland_chromatin ','norm_nucleoli', 'mitoses', 'class_t']

In [4]:
df.shape

(683, 11)

In [5]:
df.head()

Unnamed: 0,ID,clump_Thickness,unif_cell_size,unif_cell_shape,adhesion,epith_cell_Size,bare_nuclei,bland_chromatin,norm_nucleoli,mitoses,class_t
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [6]:
df.class_t[df['class_t'] == 2] = 0
df.class_t[df['class_t'] == 4] = 1

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.class_t[df['class_t'] == 2] = 0
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Seri

In [7]:
df['class_t'].value_counts(normalize=True)

class_t
0    0.650073
1    0.349927
Name: proportion, dtype: float64

In [8]:
X = df.iloc[:,1:9]
y = df['class_t']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,stratify=df['class_t'])

In [9]:
# Utilizamos sklearn para estandarizar la matriz de Features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

In [10]:
from sklearn.model_selection import StratifiedKFold
folds=StratifiedKFold(n_splits=10, random_state=19, shuffle=True)

In [11]:
# Definimos los modelos
models = [LogisticRegression()]

# Definimos los parámetros para GridSearchCV
params = [{
    'solver': ['saga'],       # Método de optimización
    'C': [1, 10, 100, 1000], # Valores de C
    'penalty': ['l1', 'l2']  # Regularización L1 y L2
}]

In [12]:
grids = []
for i in range(len(models)):
    gs = GridSearchCV(estimator=models[i], param_grid=params[i], scoring='accuracy', cv=folds, n_jobs=4)
    print (gs)
    fit = gs.fit(X_train, y_train)
    grids.append(fit)

GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=19, shuffle=True),
             estimator=LogisticRegression(), n_jobs=4,
             param_grid={'C': [1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
                         'solver': ['saga']},
             scoring='accuracy')




In [13]:
for i in grids:
    print (i.best_score_)
    print (i.best_estimator_)
    print (i.best_params_)

0.9755487804878049
LogisticRegression(C=10, penalty='l1', solver='saga')
{'C': 10, 'penalty': 'l1', 'solver': 'saga'}


In [14]:
pd.DataFrame(grids[0].cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_penalty,param_solver,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.018732,0.011494,0.002601,0.000918,1,l1,saga,"{'C': 1, 'penalty': 'l1', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.97311,0.013124,7
1,0.011598,0.002798,0.002101,0.001044,1,l2,saga,"{'C': 1, 'penalty': 'l2', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,0.97561,0.97561,1.0,0.97561,0.97561,0.95,0.973049,0.013342,8
2,0.019882,0.004172,0.002254,0.000982,10,l1,saga,"{'C': 10, 'penalty': 'l1', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
3,0.016603,0.002374,0.002254,0.000604,10,l2,saga,"{'C': 10, 'penalty': 'l2', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
4,0.019062,0.002633,0.002771,0.00137,100,l1,saga,"{'C': 100, 'penalty': 'l1', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
5,0.017673,0.004353,0.001954,0.001598,100,l2,saga,"{'C': 100, 'penalty': 'l2', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
6,0.019153,0.002493,0.002101,0.00083,1000,l1,saga,"{'C': 1000, 'penalty': 'l1', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
7,0.016306,0.002359,0.001494,0.000493,1000,l2,saga,"{'C': 1000, 'penalty': 'l2', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1


In [15]:
pd.DataFrame(grids[0].cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_penalty,param_solver,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.018732,0.011494,0.002601,0.000918,1,l1,saga,"{'C': 1, 'penalty': 'l1', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.97311,0.013124,7
1,0.011598,0.002798,0.002101,0.001044,1,l2,saga,"{'C': 1, 'penalty': 'l2', 'solver': 'saga'}",0.97561,0.97561,...,0.97561,0.97561,0.97561,1.0,0.97561,0.97561,0.95,0.973049,0.013342,8
2,0.019882,0.004172,0.002254,0.000982,10,l1,saga,"{'C': 10, 'penalty': 'l1', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
3,0.016603,0.002374,0.002254,0.000604,10,l2,saga,"{'C': 10, 'penalty': 'l2', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
4,0.019062,0.002633,0.002771,0.00137,100,l1,saga,"{'C': 100, 'penalty': 'l1', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
5,0.017673,0.004353,0.001954,0.001598,100,l2,saga,"{'C': 100, 'penalty': 'l2', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
6,0.019153,0.002493,0.002101,0.00083,1000,l1,saga,"{'C': 1000, 'penalty': 'l1', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1
7,0.016306,0.002359,0.001494,0.000493,1000,l2,saga,"{'C': 1000, 'penalty': 'l2', 'solver': 'saga'}",1.0,0.97561,...,0.97561,0.97561,0.97561,1.0,0.95122,0.97561,0.975,0.975549,0.015427,1


In [16]:
X_test = scaler.transform(X_test)

In [17]:
y_preds_log = grids[0].predict(X_test)

In [18]:
print (classification_report(y_test, y_preds_log))

              precision    recall  f1-score   support

           0       0.97      0.96      0.97       178
           1       0.93      0.95      0.94        96

    accuracy                           0.96       274
   macro avg       0.95      0.95      0.95       274
weighted avg       0.96      0.96      0.96       274



In [19]:
confusion_matrix(y_test, y_preds_log)

array([[171,   7],
       [  5,  91]])