# *Cross validation II: Gridsearch*

### Índice <a name="topo"></a>
- 1. [Introdução](#1)
- 2. [Carregando a base](#2)
- 3. [Separando a base de testes](#3)
- 4. [Obter os caminhos de poda via ccp-alpha](#4)
- 5. [*k-fold*, com *holdout*](#5)
- 6. [Modelo Final](#6)
- 7. [Conclusões](#7)
- 8. [Referências](#8)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

## 1. Introdução <a name="1"></a>
[Voltar para o índice](#topo)

O gancho da aula passada:

- O que aconteceria se tivessemos pego outra base de testes?
- Será que essa base de testes não propicia casualmente uma acurácia maior ou menor?
- Já estamos 'perdendo' 20% da base para teste, perdemos outros 20% para validação, não podemos minimizar isso?

Vamos abordar as perguntas acima com as técnicas de validação cruzada que acabamos de ver.

A estratégia é a seguinte: 
- Separar uma base de testes (também chamada de *holdout*)
- Obter os caminhos de poda com o ```ccp_alpha``` da maior árvore possível
- Encontrar a melhor poda pela estratégia *k-fold* de *cross-validation* para cada *ccp-alpha*:
    - dividir a base de treinamento em k-grupos (chamados *folds*)
    - Para cada grupo j, pertencente aos k-grupos:
        - treinar uma árvore com com o ccp-alpha da vez, utilizando todas as observações, menos as do grupo j
        - avaliar a métrica dessa árvore utilizando o grupo j como base de validação
    - Escolher a árvore com melhor métrica nas bases de validação
- Após escolher a melhor configuração do modelo, vamos rodar novamente o modelo nessa configuração com todos os dados (exceto os de teste)
- Avaliar o modelo na base de teste.

Pronto, este é o nosso modelo final.

### 2. Carregando a base<a name="2"></a>
[Voltar para o índice](#topo)

Nesta aula vamos carregar a base já tratada na aula passada, com os valores faltantes da variável ```sex``` preenchidos.

In [4]:
pg1 = pd.read_csv(r'C:\Users\Paulo Roberto\Downloads\Módulo 17 - Árvores ll\Exercício 02\pg1.csv')
pg1.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


### 3. Separando a base de testes<a name="3"></a>
[Voltar para o índice](#topo)

Dessa vez vamos separar somente uma base de testes (chamada de *holdout* pois é "mantida de fora" do processo).

In [5]:
X = pd.get_dummies(pg1.drop(columns=['island','species']), drop_first=True)
y = pg1.species

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=2360873)

### 4. Obter os caminhos de poda via ccp-alpha<a name="1.3"></a>
[Voltar para o índice](#topo)

Lembrando a estratégia: vamos obter os ```ccp_alpha``` com toda exteto a base de teste.

In [6]:
clf = DecisionTreeClassifier(random_state=2360873)
caminho = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = caminho.ccp_alphas, caminho.impurities

In [7]:
#garantindo que os ccp_alphas são únicos e positivos
ccp_alphas = np.unique(ccp_alphas[ccp_alphas>=0])

In [8]:
ccp_alphas

array([0.        , 0.00362909, 0.00549451, 0.00752508, 0.01046572,
       0.01391639, 0.03817082, 0.19371502, 0.33744331])

## 5. *k-fold*, com *holdout*<a name="5"></a>
[Voltar para o índice](#topo)

Eu sei que você saberia rodar o *grid_search* "na mão", mas vamos usar uma função do scikitlearn que faz tudo isso pra nós, o ```GridSearchCV()```. Observe que o CV no nome significa *cross validation*. Essa funão faz exatamente o que precisamos: um *grid-search*, com uma avaliação das possibilidades com *cross validation*. Vamos entender como os parâmetros:

```
GridSearchCV(
    estimator,
    param_grid,
    scoring=None,
    cv=None
)
```
- **estimator**: aqui indicamos qual modelo vamos utilizar.
- **param_grid**: esse vai ser um dicionário, em que a chave é o nome dos hiperparâmetros e o valor vai ser uma lista com as possibilidades que queremos testar.
- **scoring**: vai ser a métrica de avaliação de performance do nosso modelo. No caso vamos usar a acurácia.
- **CV**: Este parâmetro, se especificado como um inteiro, é o *k* do *k-fold*. 

In [9]:
clf = DecisionTreeClassifier(random_state=2360873)
clf

In [10]:
grid_parametros = {'ccp_alpha':ccp_alphas}
grid_parametros

{'ccp_alpha': array([0.        , 0.00362909, 0.00549451, 0.00752508, 0.01046572,
        0.01391639, 0.03817082, 0.19371502, 0.33744331])}

In [11]:
grid = GridSearchCV(estimator = clf, param_grid=grid_parametros,cv=15, verbose=100)
grid.fit(X_train, y_train) 

Fitting 15 folds for each of 9 candidates, totalling 135 fits
[CV 1/15; 1/9] START ccp_alpha=0.0..............................................
[CV 1/15; 1/9] END ...............ccp_alpha=0.0;, score=0.842 total time=   0.0s
[CV 2/15; 1/9] START ccp_alpha=0.0..............................................
[CV 2/15; 1/9] END ...............ccp_alpha=0.0;, score=0.789 total time=   0.0s
[CV 3/15; 1/9] START ccp_alpha=0.0..............................................
[CV 3/15; 1/9] END ...............ccp_alpha=0.0;, score=0.947 total time=   0.0s
[CV 4/15; 1/9] START ccp_alpha=0.0..............................................
[CV 4/15; 1/9] END ...............ccp_alpha=0.0;, score=0.944 total time=   0.0s
[CV 5/15; 1/9] START ccp_alpha=0.0..............................................
[CV 5/15; 1/9] END ...............ccp_alpha=0.0;, score=0.944 total time=   0.0s
[CV 6/15; 1/9] START ccp_alpha=0.0..............................................
[CV 6/15; 1/9] END ...............ccp_alpha=0.0

In [12]:
grid


In [13]:
resultados = pd.DataFrame(grid.cv_results_)
resultados.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ccp_alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split8_test_score,split9_test_score,split10_test_score,split11_test_score,split12_test_score,split13_test_score,split14_test_score,mean_test_score,std_test_score,rank_test_score
0,0.005354,0.003384,0.002266,0.001689,0.0,{'ccp_alpha': 0.0},0.842105,0.789474,0.947368,0.944444,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.953411,0.060496,4
1,0.006902,0.002001,0.004081,0.002077,0.003629,{'ccp_alpha': 0.003629086962420302},0.842105,0.789474,0.947368,0.944444,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.953411,0.060496,4
2,0.007353,0.001428,0.004518,0.001175,0.005495,{'ccp_alpha': 0.005494505494505495},0.842105,0.894737,0.947368,0.944444,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.960429,0.045259,1
3,0.007737,0.001462,0.003643,0.001591,0.007525,{'ccp_alpha': 0.0075250836120401374},0.894737,0.894737,0.947368,0.888889,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.960234,0.041555,2
4,0.006442,0.002123,0.003805,0.002312,0.010466,{'ccp_alpha': 0.010465724751439037},0.894737,0.894737,0.947368,0.888889,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.960234,0.041555,2


In [14]:
grid.best_score_

np.float64(0.9604288499025342)

In [22]:
resultados

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ccp_alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split8_test_score,split9_test_score,split10_test_score,split11_test_score,split12_test_score,split13_test_score,split14_test_score,mean_test_score,std_test_score,rank_test_score
0,0.00508,0.001168,0.002519,0.000726,0.0,{'ccp_alpha': 0.0},0.842105,0.789474,0.947368,0.944444,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.953411,0.060496,4
1,0.004293,0.000614,0.002045,0.000424,0.00362909,{'ccp_alpha': 0.003629086962420302},0.842105,0.789474,0.947368,0.944444,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.953411,0.060496,4
2,0.004154,0.000694,0.001912,0.000307,0.00549451,{'ccp_alpha': 0.005494505494505495},0.842105,0.894737,0.947368,0.944444,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.960429,0.045259,1
3,0.003955,0.000603,0.001921,0.000345,0.00752508,{'ccp_alpha': 0.0075250836120401374},0.894737,0.894737,0.947368,0.888889,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.960234,0.041555,2
4,0.005237,0.000767,0.002276,0.000302,0.0104657,{'ccp_alpha': 0.010465724751439037},0.894737,0.894737,0.947368,0.888889,...,1.0,1.0,1.0,1.0,0.944444,0.944444,0.944444,0.960234,0.041555,2
5,0.004855,0.000667,0.002452,0.000437,0.0139164,{'ccp_alpha': 0.013916392767933371},0.894737,0.947368,0.947368,0.888889,...,1.0,0.944444,0.944444,1.0,0.944444,0.944444,0.888889,0.952632,0.039232,6
6,0.004299,0.000465,0.00284,0.00025,0.0381708,{'ccp_alpha': 0.03817082388510958},0.842105,0.894737,0.894737,0.833333,...,1.0,0.944444,0.944444,0.944444,0.944444,0.944444,0.888889,0.930994,0.053892,7
7,0.004402,0.00046,0.002505,0.000587,0.193715,{'ccp_alpha': 0.19371501717135625},0.842105,0.894737,0.894737,0.833333,...,0.833333,0.944444,0.944444,0.777778,0.777778,0.777778,0.888889,0.853216,0.052796,8
8,0.004496,0.00051,0.002155,0.000316,0.337443,{'ccp_alpha': 0.33744330557517377},0.736842,0.789474,0.789474,0.777778,...,0.444444,0.444444,0.444444,0.777778,0.444444,0.444444,0.722222,0.572904,0.158207,9


### 6. Modelo Final<a name="6"></a>
[Voltar para o índice](#topo)

Agora vamos reajustar o modelo final com o parâmetro ```ccp_alpha``` que forneceu o melhor resultado, e calcular a sua acurácia.

In [15]:
melhor_ccp = resultados.iloc[grid.best_index_,4]

clf = DecisionTreeClassifier(random_state=2360873, ccp_alpha=melhor_ccp).fit(X_train, y_train)

In [16]:
clf.score(X_test, y_test)

0.9855072463768116

### 7. Conclusões<a name="7"></a>
[Voltar para o índice](#topo)

- O resultado final na base de testes está bem razoável, e dentro da expectativa do *cross-validation* - próximo da acurácia média, em comparação com o erro padrão.

- Essas técnicas de validação cruzada podem demandar um certo esforço computacional. Dependendo do tamanho da base e do número de variáveis pode tomar um tempo representativo e pode demandar um planejamento mais cuidadoso.

### 8. Referências<a name="8"></a>
[Voltar para o índice](#topo)

- [Documentação do scikitlearn](https://scikit-learn.org/stable/modules/cross_validation.html)

- "The Elements of Statistical Learning" J. H. Friedman, R. Tibshirani e  . Hastle (disponível [aqui](https://web.stanford.edu/~hastie/Papers/ESLII.pdf))

- "An Introduction to Statistical Learning" Gareth M. James, Daniela Witten, Trevor Hastie, R J Tibshirani (disponível [aqui](https://www.statlearning.com/))