<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#k-Nearest-Neighbors" data-toc-modified-id="k-Nearest-Neighbors-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>k-Nearest Neighbors</a></span></li><li><span><a href="#Shifting-MNIST-(Data-Argumentation)" data-toc-modified-id="Shifting-MNIST-(Data-Argumentation)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Shifting MNIST (Data Argumentation)</a></span></li></ul></div>

# Exercícios

## k-Nearest Neighbors

1 - Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the ```KNeighborClassifier``` works quite well for this task; you just need to find the good hyperparameters values (try a grid search on the weights and n_neighbors hyperparameters

In [None]:
# Importando dataset
from sklearn.datasets import fetch_mldata
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

mnist = fetch_mldata('MNIST original')
X, y = mnist['data'], mnist['target']

# Testando importação através de plotagens
fig, ax = plt.subplots(1, 5, figsize=(16, 6))
for axe in ax:
    another_digit = X[np.random.randint(70000)]
    another_digit_image = another_digit.reshape(28, 28)
    axe.imshow(another_digit_image, cmap = matplotlib.cm.binary, interpolation = 'nearest')
    axe.axis('off')

In [None]:
# Plotando mais dados
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.binary, **options)
    plt.axis("off")
    
plt.figure(figsize=(9,9))
example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]]
plot_digits(example_images, images_per_row=10)

In [None]:
# Separando dados
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

print(f'Dimensões do dataset de treino: {X_train.shape}')
print(f'Dimensões do target (treino): {y_train.shape}')

In [None]:
# Aplicando shuffling
import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

In [None]:
# Importando e treinando um classificador
from sklearn.neighbors import KNeighborsClassifier

# knn = KNeighborsClassifier()
knn = KNeighborsClassifier(n_neighbors=3, n_jobs=1)
knn.fit(X_train, y_train)

In [None]:
# Verificando um dígito
y_train[1000]

In [None]:
# Predizendo
knn.predict([X_train[1000]])

Aparentemente o modelo está funcionando bem. Primeiramente, vamos metir sua acurácia sem realizar nenhuma transformação.

In [None]:
# Medindo acurácia
from sklearn.model_selection import cross_val_score

cross_val_score(knn, X_train, y_train, cv=3, scoring='accuracy')

Infelizmente o a função ```cross_val_score``` é muito exigente quando utilizada com KNN. Demora muito tempo.

## Shifting MNIST (Data Argumentation)

In [None]:
# Verificando dimensões
print(f'Dimensões de X_train: {X_train.shape}')
print(f'Dimensões de y_train: {y_train.shape}')

In [None]:
# Treinando classificador com Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

forest_clf = RandomForestClassifier()

# Avaliando performance
cross_val_score(forest_clf, X_train, y_train, cv=5, scoring='accuracy')

In [None]:
# Melhorando performance com StandardScaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))

# Avaliando performance
cross_val_score(forest_clf, X_train_scaled, y_train, cv=5, scoring='accuracy')

Não houve melhoras significativas.

In [None]:
# Visualizando acurácia diretamente
from sklearn.metrics import accuracy_score

forest_clf.fit(X_train, y_train)
forest_pred = forest_clf.predict(X_test)
accuracy_score(y_test, forest_pred)

In [None]:
# Dando shift nas imagens
from scipy.ndimage.interpolation import shift

def shift_digit(digit_array, dx, dy, new=0):
    return shift(digit_array.reshape(28, 28), [dy, dx], cval=new).reshape(784)

In [None]:
X_train_expanded = [X_train]
y_train_expanded = [y_train]
for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    shifted_images = np.apply_along_axis(shift_digit, axis=1, arr=X_train, dx=dx, dy=dy)
    X_train_expanded.append(shifted_images)
    y_train_expanded.append(y_train)

X_train_expanded = np.concatenate(X_train_expanded)
y_train_expanded = np.concatenate(y_train_expanded)
X_train_expanded.shape, y_train_expanded.shape

In [None]:
# Verificando se houve mudanças na acurácia
forest_clf.fit(X_train_expanded, y_train_expanded)

forest_pred_ex = forest_clf.predict(X_test)
accuracy_score(y_test, forest_pred_ex)

Wow!

In [None]:
# Aplicando gridsearch
from sklearn.model_selection import GridSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

forest_clf_grid = RandomForestClassifier()
grid_search_forest = GridSearchCV(forest_clf_grid, random_grid, cv=5, verbose=3,)