<a href="https://colab.research.google.com/github/Rogerio-mack/IMT_CD_2025/blob/main/IMT_CV_GridSearchCV_solucao.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<head>
  <meta name="author" content="Rogério de Oliveira">
  <meta institution="author" content="ITM">
</head>

<img src="https://maua.br/images/selo-60-anos-maua.svg" width=300, align="right">
<!-- <h1 align=left><font size = 6, style="color:rgb(200,0,0)"> optional title </font></h1> -->


# Cross Validation a GridSearchCV

[Alternative scikit-learn course](https://inria.github.io/scikit-learn-mooc/toc.html)

In [1]:
from IPython.display import IFrame

IFrame('https://allisonhorst.github.io/palmerpenguins/', width=800, height=600)


In [2]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, StratifiedKFold, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


# Dados

In [3]:
df = sns.load_dataset('penguins')
df.dropna(inplace=True)

df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


# Treinamento e Teste

In [4]:
# As features (X) e o alvo (y)
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = df['species']

# Dividindo em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Modelos

In [5]:
# Criando o dicionário de modelos que iremos testar
models = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'LogisticRegression': LogisticRegression(max_iter=10000),
    'KNN': KNeighborsClassifier(),
}

# Cross Validation Scores


## Exercício 1.
O obtenha a acuracidade média dos modelos para um cv de 5 partições. Explore a saída do estimador `cross_val_score()`.

In [6]:
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"{name}, score = {scores.mean():.4f}")

RandomForest, score = 0.9736
DecisionTree, score = 0.9623
LogisticRegression, score = 0.9774
KNN, score = 0.7705


### Entenda

Para `cv=` `<int/None>`, se o estimador for um classificador e `y` for binário ou multiclasse, `StratifiedKFold` é usado. Em todos os outros casos, `KFold` é usado. Esses divisores são instanciados com `shuffle=False` para que as divisões sejam as mesmas em todas as chamadas, mas isso pode não ser desejável já que os conjuntos de dados serão *fatias* contínuas dos dados.

Veja como funcionam o  `StratifiedKFold` e o `KFold` com `shuffle=False` e `shuffle=True`.



In [7]:
import numpy as np
from sklearn.model_selection import KFold

# Criando um conjunto de dados simples
X_simple = np.arange(12).reshape(-1, 1)
y_simple = np.repeat([0, 1], 6)

print("--- KFold com shuffle=False ---")
kf_no_shuffle = KFold(n_splits=3, shuffle=False)
for fold, (train_index, test_index) in enumerate(kf_no_shuffle.split(X_simple)):
    print(f"Fold {fold+1}:")
    print(f"  Índices de treino: {train_index}")
    print(f"  Índices de teste: {test_index}")
    print("-" * 20)

print("\n--- KFold com shuffle=True ---")
kf_shuffle = KFold(n_splits=3, shuffle=True, random_state=42) # Adicionando random_state para reprodutibilidade
for fold, (train_index, test_index) in enumerate(kf_shuffle.split(X_simple)):
    print(f"Fold {fold+1}:")
    print(f"  Índices de treino: {train_index}")
    print(f"  Índices de teste: {test_index}")
    print("-" * 20)

--- KFold com shuffle=False ---
Fold 1:
  Índices de treino: [ 4  5  6  7  8  9 10 11]
  Índices de teste: [0 1 2 3]
--------------------
Fold 2:
  Índices de treino: [ 0  1  2  3  8  9 10 11]
  Índices de teste: [4 5 6 7]
--------------------
Fold 3:
  Índices de treino: [0 1 2 3 4 5 6 7]
  Índices de teste: [ 8  9 10 11]
--------------------

--- KFold com shuffle=True ---
Fold 1:
  Índices de treino: [ 1  2  3  4  5  6  7 11]
  Índices de teste: [ 0  8  9 10]
--------------------
Fold 2:
  Índices de treino: [ 0  3  4  6  7  8  9 10]
  Índices de teste: [ 1  2  5 11]
--------------------
Fold 3:
  Índices de treino: [ 0  1  2  5  8  9 10 11]
  Índices de teste: [3 4 6 7]
--------------------


In [8]:
import numpy as np
from sklearn.model_selection import StratifiedKFold

# Criando um conjunto de dados simples
X_simple = np.arange(12).reshape(-1, 1)
y_simple = np.repeat([0, 1], 6)

print("--- KFold com shuffle=False ---")
kf_no_shuffle = StratifiedKFold(n_splits=3, shuffle=False)
for fold, (train_index, test_index) in enumerate(kf_no_shuffle.split(X_simple, y_simple)):
    print(f"Fold {fold+1}:")
    print(f"  Índices de treino: {train_index}")
    print(f"  Índices de teste: {test_index}")
    print("-" * 20)

print("\n--- KFold com shuffle=True ---")
kf_shuffle = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) # Adicionando random_state para reprodutibilidade
for fold, (train_index, test_index) in enumerate(kf_shuffle.split(X_simple, y_simple)):
    print(f"Fold {fold+1}:")
    print(f"  Índices de treino: {train_index}")
    print(f"  Índices de teste: {test_index}")
    print("-" * 20)

--- KFold com shuffle=False ---
Fold 1:
  Índices de treino: [ 2  3  4  5  8  9 10 11]
  Índices de teste: [0 1 6 7]
--------------------
Fold 2:
  Índices de treino: [ 0  1  4  5  6  7 10 11]
  Índices de teste: [2 3 8 9]
--------------------
Fold 3:
  Índices de treino: [0 1 2 3 6 7 8 9]
  Índices de teste: [ 4  5 10 11]
--------------------

--- KFold com shuffle=True ---
Fold 1:
  Índices de treino: [ 2  3  4  5  6  9 10 11]
  Índices de teste: [0 1 7 8]
--------------------
Fold 2:
  Índices de treino: [ 0  1  2  4  7  8 10 11]
  Índices de teste: [3 5 6 9]
--------------------
Fold 3:
  Índices de treino: [0 1 3 5 6 7 8 9]
  Índices de teste: [ 2  4 10 11]
--------------------


## Exercício 2.
Acima foram empregadas partições fixas para cada modelo, mas com dados sem *fatias*  Empregue os estimadores KFold (ou alternativamente o StratifiedKFold), para fixar as partições aleatórias e obtenha o novo score dos modelos fazendo a seleção agora pela métrica `f1_macro`.

In [9]:
kf = KFold(n_splits=5, shuffle=True, random_state=1)

for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='f1_macro')
    print(f"{name}, score = {scores.mean():.4f}")

RandomForest, score = 0.9678
DecisionTree, score = 0.9571
LogisticRegression, score = 0.9769
KNN, score = 0.7143


## Exercício 3.
Empregue o código anterior para selecionar *programaticamente* o melhor modelo, treinar, aplicar ao conjunto de teste e obter a acuracidade nesse conjunto.

In [10]:
kf = KFold(n_splits=5, shuffle=True, random_state=1)

models_scores = {}

for name, model in models.items():
    models_scores[model] = cross_val_score(model,  X_train, y_train, cv=kf, scoring='f1_macro').mean()

models_scores

{RandomForestClassifier(random_state=42): np.float64(0.9678167490132183),
 DecisionTreeClassifier(random_state=42): np.float64(0.9571106487387709),
 LogisticRegression(max_iter=10000): np.float64(0.9768825688059621),
 KNeighborsClassifier(): np.float64(0.7142803233154111)}

In [11]:
best_model = max(models_scores, key=models_scores.get)
print(best_model)


LogisticRegression(max_iter=10000)


In [12]:
model = best_model
y_pred = model.fit(X_train,y_train).predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Acurácia: {accuracy:.2f}")

Acurácia: 0.99


In [13]:
model.get_params()


{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 10000,
 'multi_class': 'deprecated',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

## Exercício 4.
Empregando o `GridSearchCV`. O `GridSearchCV` permite automatizar todas essas operações.

Empregue o exemplo de código abaixo para corrigir, como fizemos antes, o uso de partições diferentes na avaliação dos modelos (use o `KFold`).

### `GridSearchCV`, seleção de hiperparâmetros (um único modelo base)

In [14]:
kf = KFold(n_splits=5, shuffle=True, random_state=1)

baseline_model = KNeighborsClassifier()
param_grid = {'n_neighbors': [3, 5, 7, 9]}

# Usando o GridSearchCV para encontrar o melhor modelo
grid_search = GridSearchCV(baseline_model, param_grid, cv=kf, scoring='accuracy')

grid_search.fit(X_train, y_train)

# Exibindo o melhor modelo encontrado
print(f"Melhor modelo: {grid_search.best_estimator_}")

# Avaliando o desempenho no conjunto de teste
y_pred = grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Acurácia no teste: {accuracy:.2f}")


Melhor modelo: KNeighborsClassifier(n_neighbors=3)
Acurácia no teste: 0.76


### `pipeline + GridSearchCV`, seleção de modelo base

In [16]:
# Criando o pipeline com pré-processamento e modelo
pipeline = Pipeline([('model', None)])
# pipeline = Pipeline([('scaler', StandardScaler()), ('model', None)])

# Definindo o dicionário de parâmetros para o GridSearchCV (somente diferentes modelos)
param_grid = [
    {'model': [RandomForestClassifier(random_state=42)]},
    {'model': [models['DecisionTree']]},
    {'model': [models['LogisticRegression']]},
    {'model': [models['KNN']]}
]

# Usando o GridSearchCV para encontrar o melhor modelo
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
# grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro')

grid_search.fit(X_train, y_train)

# Exibindo o melhor modelo encontrado
print(f"Melhor modelo: {grid_search.best_estimator_['model']}")

# Avaliando o desempenho no conjunto de teste
y_pred = grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Acurácia no teste: {accuracy:.2f}")


Melhor modelo: LogisticRegression(max_iter=10000)
Acurácia no teste: 0.99


### `pipeline + GridSearchCV`, seleção de modelos base e  hiperparâmetros  

In [17]:
# Criando o pipeline com pré-processamento e modelo
pipeline = Pipeline([('model', None)])
# pipeline = Pipeline([('scaler', StandardScaler()), ('model', None)])

# Definindo o dicionário de parâmetros para o GridSearchCV (somente diferentes modelos)
param_grid = [
    {'model': [RandomForestClassifier(random_state=42)]},
    {'model': [models['DecisionTree']], 'model__max_depth': [5,6,7]},
    {'model': [models['LogisticRegression']]},
    {'model': [models['KNN']], 'model__n_neighbors': [3, 5, 7, 9]}
]

kf = KFold(n_splits=5, shuffle=True, random_state=1)

# Usando o GridSearchCV para encontrar o melhor modelo
grid_search = GridSearchCV(pipeline, param_grid, cv=kf, scoring='accuracy')
# grid_search = GridSearchCV(pipeline, param_grid, cv=kf, scoring='f1_macro')

grid_search.fit(X_train, y_train)

# Exibindo o melhor modelo encontrado
print(f"Melhor modelo: {grid_search.best_estimator_['model']}")

# Avaliando o desempenho no conjunto de teste
y_pred = grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Acurácia no teste: {accuracy:.2f}")


Melhor modelo: LogisticRegression(max_iter=10000)
Acurácia no teste: 0.99


# Exercício 5 (Extra)

Adapte um dos códigos acima para fazer a seleção de um modelo incluindo um:

1. Um modelo de rede neural (MLP).
2. E um outro novo modelo de classificação de sua escolha.

Faça de modo a tentar encontrar hiperparâmetros do MLP que forneçam uma acuracidade > .99.  

In [18]:
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

In [20]:
MLPClassifier().get_params()

{'activation': 'relu',
 'alpha': 0.0001,
 'batch_size': 'auto',
 'beta_1': 0.9,
 'beta_2': 0.999,
 'early_stopping': False,
 'epsilon': 1e-08,
 'hidden_layer_sizes': (100,),
 'learning_rate': 'constant',
 'learning_rate_init': 0.001,
 'max_fun': 15000,
 'max_iter': 200,
 'momentum': 0.9,
 'n_iter_no_change': 10,
 'nesterovs_momentum': True,
 'power_t': 0.5,
 'random_state': None,
 'shuffle': True,
 'solver': 'adam',
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': False,
 'warm_start': False}

In [21]:
SVC().get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [23]:
# pipeline = Pipeline([('scaler', StandardScaler()), ('model', None)]) # melhor para redes neurais
pipeline = Pipeline([('model', None)])

# Definindo o dicionário de parâmetros para o GridSearchCV (somente diferentes modelos)
param_grid = [
     {'model': [RandomForestClassifier(random_state=42)]},
     {'model': [DecisionTreeClassifier(random_state=42)], 'model__max_depth': [5,6,7]},
     {'model': [LogisticRegression(max_iter=1000)]},
     {'model': [KNeighborsClassifier()], 'model__n_neighbors': [3, 5, 7, 9]},
     {'model': [MLPClassifier(max_iter=10000,random_state=42)],
               'model__hidden_layer_sizes': [(8,16,8),(10,),(32, 128, 32)],
               'model__activation': ['tanh','logistic'],
                 },
     {'model': [SVC()], 'model__C': [0.1, 1], 'model__kernel': ['linear', 'rbf']}
]

kf = KFold(n_splits=5, shuffle=True, random_state=1)

# Usando o GridSearchCV para encontrar o melhor modelo
grid_search = GridSearchCV(pipeline, param_grid, cv=kf, scoring='accuracy')
# grid_search = GridSearchCV(pipeline, param_grid, cv=kf, scoring='f1_macro')

grid_search.fit(X_train, y_train)

# Exibindo o melhor modelo encontrado
print(f"Melhor modelo: {grid_search.best_estimator_['model']}")

# Avaliando o desempenho no conjunto de teste
y_pred = grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Acurácia no teste: {accuracy:.2f}")

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Melhor modelo: SVC(C=1, kernel='linear')
Acurácia no teste: 1.00


In [24]:
grid_search.cv_results_

{'mean_fit_time': array([0.16182437, 0.00311627, 0.00298452, 0.00335274, 0.2507442 ,
        0.00385423, 0.00369849, 0.00285182, 0.00239892, 0.05064211,
        0.13665161, 0.68514433, 0.1225739 , 0.03852873, 0.1273634 ,
        0.11641841, 0.00472894, 0.16746988, 0.00385542]),
 'std_fit_time': array([6.61290478e-03, 8.41496137e-05, 3.52862719e-05, 4.04124389e-04,
        1.19789944e-02, 1.02982196e-03, 3.10062701e-04, 3.93803195e-04,
        9.31723701e-05, 1.18970333e-02, 3.44698108e-02, 5.01790025e-01,
        1.35339720e-02, 1.04241788e-02, 3.47665337e-02, 1.16148433e-01,
        4.31375805e-04, 1.32665798e-01, 8.97819159e-05]),
 'mean_score_time': array([0.00952802, 0.00214462, 0.00205121, 0.00239348, 0.0035358 ,
        0.00494227, 0.00533829, 0.003969  , 0.00353994, 0.00300002,
        0.00340424, 0.00468969, 0.00278301, 0.00350285, 0.00380716,
        0.00267859, 0.0031724 , 0.00270758, 0.00255256]),
 'std_score_time': array([5.67430640e-04, 5.81231832e-05, 7.37573221e-06, 2.57

In [25]:
grid_search_results = grid_search.cv_results_

# Get the mean test score for each parameter combination
mean_test_scores = grid_search_results['mean_test_score']
params = grid_search_results['params']

# Print the mean test score for each model
for mean_score, param in zip(mean_test_scores, params):
    model_name = param # ['model'].__class__.__name__
    print(f"Model: {model_name}, Mean Test Accuracy: {mean_score:.4f}")

Model: {'model': RandomForestClassifier(random_state=42)}, Mean Test Accuracy: 0.9737
Model: {'model': DecisionTreeClassifier(random_state=42), 'model__max_depth': 5}, Mean Test Accuracy: 0.9623
Model: {'model': DecisionTreeClassifier(random_state=42), 'model__max_depth': 6}, Mean Test Accuracy: 0.9623
Model: {'model': DecisionTreeClassifier(random_state=42), 'model__max_depth': 7}, Mean Test Accuracy: 0.9623
Model: {'model': LogisticRegression(max_iter=1000)}, Mean Test Accuracy: 0.9812
Model: {'model': KNeighborsClassifier(), 'model__n_neighbors': 3}, Mean Test Accuracy: 0.7819
Model: {'model': KNeighborsClassifier(), 'model__n_neighbors': 5}, Mean Test Accuracy: 0.7705
Model: {'model': KNeighborsClassifier(), 'model__n_neighbors': 7}, Mean Test Accuracy: 0.7706
Model: {'model': KNeighborsClassifier(), 'model__n_neighbors': 9}, Mean Test Accuracy: 0.7632
Model: {'model': MLPClassifier(max_iter=10000, random_state=42), 'model__activation': 'tanh', 'model__hidden_layer_sizes': (8, 16, 