# Especialização em Inteligência Artificial

**Aprendizado de Máquina - Aula 2.3: Técnicas de Amostragem**

Código de exemplo desenvolvido pelo docente [Adriano Rivolli](mailto:rivolli@utpfr.edu.br)

*O código apresenta como realizar as diferentes estratégias de amostragem*

In [None]:
# Imports
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.utils import resample
from sklearn.model_selection import LeaveOneOut

In [None]:
# Carregando o dataset diabets para servir de exemplo
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


## Estratégias de Amostragem

#### Holdout

O método disponível na biblioteca Sklearn é `train_test_split`, a documentação está disponível em:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Para gerar um conjunto de validação o `X_train` e `y_train` podem ser passados novamente para a função `train_test_split`.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df, diabetes.target, test_size=0.3, random_state=0, shuffle=True)

print("Dataset", df.shape, diabetes.target.shape)
print("Train dataset", X_train.shape, y_train.shape)
print("Test dataset", X_test.shape, y_test.shape)

X_test

Dataset (442, 10) (442,)
Train dataset (309, 10) (309,)
Test dataset (133, 10) (133,)


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
362,0.019913,0.050680,0.104809,0.070072,-0.035968,-0.026679,-0.024993,-0.002592,0.003709,0.040343
249,-0.012780,-0.044642,0.060618,0.052858,0.047965,0.029375,-0.017629,0.034309,0.070207,0.007207
271,0.038076,0.050680,0.008883,0.042529,-0.042848,-0.021042,-0.039719,-0.002592,-0.018114,0.007207
435,-0.012780,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.038460,-0.038357
400,-0.023677,-0.044642,0.045529,0.090729,-0.018080,-0.035447,0.070730,-0.039493,-0.034522,-0.009362
...,...,...,...,...,...,...,...,...,...,...
328,-0.038207,-0.044642,0.067085,-0.060756,-0.029088,-0.023234,-0.010266,-0.002592,-0.001496,0.019633
414,0.081666,0.050680,0.006728,-0.004534,0.109883,0.117056,-0.032356,0.091875,0.054720,0.007207
421,0.038076,0.050680,0.016428,0.021872,0.039710,0.045032,-0.043401,0.071210,0.049770,0.015491
361,0.041708,-0.044642,-0.007284,0.028758,-0.042848,-0.048286,0.052322,-0.076395,-0.072133,0.023775


#### Bootstrap

Não há uma função default para a geração desta estratégia, portanto foi desenvolvida uma função para este fim

In [None]:
def bootstrap(X, y, random_state=0):
  X_train, y_train = resample(X, y, random_state=random_state)
  X_test = df.drop(np.unique(X_train.index))
  y_test = y[X_test.index]
  return (X_train, X_test, y_train, y_test)

X_train, X_test, y_train, y_test = bootstrap(df, diabetes.target)

print("Indices únicos:\n", np.unique(X_train.index), "\n")
print("Total indicies unicos:", len(np.unique(X_train.index)))
print("Dados de treinamento:",  X_train.shape, y_train.shape)
print("Dados de teste:", X_test.shape, y_test.shape)

X_test

Indices únicos:
 [  0   2   3   4   5   9  11  13  16  19  22  23  24  25  27  28  29  31
  32  33  35  36  38  39  40  41  42  43  44  47  48  50  51  53  57  58
  61  62  63  67  69  70  72  73  74  77  79  80  81  82  83  84  86  87
  88  91  93  94  95  98  99 104 105 106 109 110 111 114 115 116 117 119
 120 121 123 125 127 128 129 130 131 133 135 136 138 139 141 143 145 146
 147 148 149 150 151 152 156 158 160 161 163 164 165 166 168 169 172 174
 176 177 178 180 182 183 184 185 187 189 191 192 193 195 197 198 199 201
 202 203 204 207 209 211 212 214 215 216 217 218 219 220 221 222 223 226
 227 228 232 234 237 238 241 242 243 244 248 251 253 254 255 256 257 258
 259 260 262 265 266 267 269 270 273 274 275 276 277 279 280 281 282 284
 285 286 287 288 289 290 291 292 294 295 296 297 300 302 304 305 307 308
 309 311 314 317 320 321 322 323 324 326 327 328 329 331 333 334 335 337
 338 340 341 345 347 348 349 352 353 356 357 358 359 360 361 363 364 368
 369 370 371 372 373 374 376 377 3

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
6,-0.045472,0.050680,-0.047163,-0.015999,-0.040096,-0.024800,0.000779,-0.039493,-0.062917,-0.038357
7,0.063504,0.050680,-0.001895,0.066629,0.090620,0.108914,0.022869,0.017703,-0.035816,0.003064
8,0.041708,0.050680,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.014960,0.011349
10,-0.096328,-0.044642,-0.083808,0.008101,-0.103389,-0.090561,-0.013948,-0.076395,-0.062917,-0.034215
...,...,...,...,...,...,...,...,...,...,...
428,0.048974,0.050680,0.088642,0.087287,0.035582,0.021546,-0.024993,0.034309,0.066051,0.131470
432,0.009016,-0.044642,0.055229,-0.005670,0.057597,0.044719,-0.002903,0.023239,0.055686,0.106617
435,-0.012780,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.038460,-0.038357
436,-0.056370,-0.044642,-0.074108,-0.050427,-0.024960,-0.047034,0.092820,-0.076395,-0.061176,-0.046641


#### Cross-validation

A classe disponível na biblioteca Sklearn é `KFold`, a documentação está disponível em:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold

O método `split` retorna uma lista de interação com os indíces de treino e teste. Diferentemente dos exemplos anteriores, neste caso é retornado os indices que devem ser usados para fazer a seleção das instâncias originais.

Para gerar um conjunto de validação os dados de treinamento podem ser passados para a função `train_test_split` ou gerando um novo objeto do tipo `KFold`.

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=35)
for i, (train_index, test_index) in enumerate(kf.split(df, diabetes.target)):
  X_train = df.loc[train_index,]
  y_train = diabetes.target[train_index]
  X_test = df.loc[test_index,]
  y_test = diabetes.target[test_index]

  print("Partição ", i, X_train.shape, y_train.shape, X_test.shape, y_test.shape)
  print()

Partição  0 (353, 10) (353,) (89, 10) (89,)

Partição  1 (353, 10) (353,) (89, 10) (89,)

Partição  2 (354, 10) (354,) (88, 10) (88,)

Partição  3 (354, 10) (354,) (88, 10) (88,)

Partição  4 (354, 10) (354,) (88, 10) (88,)



Existem outras classes relacionadas que podem ser úteis aos utilizadores:


*   K-fold estratificado para classificação: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold
*   Resultados da avaliação usando CV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
*   Predições da avaliação usando CV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict



#### Leave-one-out

A classe disponível na biblioteca Sklearn é `LeaveOneOut`, a documentação está disponível em:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut

O método `split` funciona semelhante ao exemplo anterior.

In [None]:
loo = LeaveOneOut()
for i, (train_index, test_index) in enumerate(loo.split(df, diabetes.target)):
  X_train = df.loc[train_index,]
  y_train = diabetes.target[train_index]
  X_test = df.loc[test_index,]
  y_test = diabetes.target[test_index]

  print((i, X_train.shape, y_train.shape, X_test.index))

(0, (441, 10), (441,), Int64Index([0], dtype='int64'))
(1, (441, 10), (441,), Int64Index([1], dtype='int64'))
(2, (441, 10), (441,), Int64Index([2], dtype='int64'))
(3, (441, 10), (441,), Int64Index([3], dtype='int64'))
(4, (441, 10), (441,), Int64Index([4], dtype='int64'))
(5, (441, 10), (441,), Int64Index([5], dtype='int64'))
(6, (441, 10), (441,), Int64Index([6], dtype='int64'))
(7, (441, 10), (441,), Int64Index([7], dtype='int64'))
(8, (441, 10), (441,), Int64Index([8], dtype='int64'))
(9, (441, 10), (441,), Int64Index([9], dtype='int64'))
(10, (441, 10), (441,), Int64Index([10], dtype='int64'))
(11, (441, 10), (441,), Int64Index([11], dtype='int64'))
(12, (441, 10), (441,), Int64Index([12], dtype='int64'))
(13, (441, 10), (441,), Int64Index([13], dtype='int64'))
(14, (441, 10), (441,), Int64Index([14], dtype='int64'))
(15, (441, 10), (441,), Int64Index([15], dtype='int64'))
(16, (441, 10), (441,), Int64Index([16], dtype='int64'))
(17, (441, 10), (441,), Int64Index([17], dtype='int