## 목표: MNIST 분류기 97% 정확도 달성
   - 힌트1: KNN 활용
   - 힌트2: GridSearchCV (weights, n_neighbors 하이퍼파라미터 이용하기)
   - 힌트3: Data Augmentation 데이터셋을 늘리자!

In [1]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784',version=1)

In [2]:
X, y = mnist['data'], mnist['target']
X.shape, y.shape

((70000, 784), (70000,))

### train, test 셋 분리

In [11]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, random_state=156, test_size=0.2)
for tr_idx, te_idx in split.split(X, y):
    x_train, y_train = X.iloc[tr_idx], y.iloc[tr_idx]
    x_test, y_test = X.iloc[te_idx], y.iloc[te_idx]

### Hint Model: KNN

In [14]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
print(f"knn train acc: {knn.score(x_train, y_train)}")
knn_pred = knn.predict(x_test)
from sklearn.metrics import accuracy_score
print(f"knn acc: {accuracy_score(y_test, knn_pred)}")

knn train acc: 0.9810357142857142
knn acc: 0.9701428571428572


클래스별 계층 샘플링을 했더니, 기본 97%가 나와 당황  
문제 의도대로 GridSearchCV도 진행

### GridSearchCV
   - knn 하이퍼파라미터 특징
       1. weight
           - uniform: 가중치를 동등하게 설정 (default)
           - distance: 분류할 때, 인접한 샘플의 거리에 따라 다른 가중치 부여 (가까울수록 큰 가중치)
           
       2. n_neighbors
           - 이웃의 수 (default 5)
               - 적어지면 > model의 결정경계 복잡 > 과적합
               - 많아지면 > model의 결정경계 단순 > 과소적합

In [15]:
from sklearn.model_selection import GridSearchCV

param_grid = [{'weights':["uniform","distance"],
               'n_neighbors':[3,4,5]}]

knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, verbose=3)
grid_search.fit(x_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5] END .................n_neighbors=3, weights=uniform; total time=  12.6s
[CV 2/5] END .................n_neighbors=3, weights=uniform; total time=  12.6s
[CV 3/5] END .................n_neighbors=3, weights=uniform; total time=  12.7s
[CV 4/5] END .................n_neighbors=3, weights=uniform; total time=  12.4s
[CV 5/5] END .................n_neighbors=3, weights=uniform; total time=  12.5s
[CV 1/5] END ................n_neighbors=3, weights=distance; total time=  12.5s
[CV 2/5] END ................n_neighbors=3, weights=distance; total time=  12.5s
[CV 3/5] END ................n_neighbors=3, weights=distance; total time=  12.4s
[CV 4/5] END ................n_neighbors=3, weights=distance; total time=  12.3s
[CV 5/5] END ................n_neighbors=3, weights=distance; total time=  12.4s
[CV 1/5] END .................n_neighbors=4, weights=uniform; total time=  14.6s
[CV 2/5] END .................n_neighbors=4, weig

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid=[{'n_neighbors': [3, 4, 5],
                          'weights': ['uniform', 'distance']}],
             verbose=3)

In [19]:
best_pred = grid_search.best_estimator_.predict(x_test)
print(f"best train acc: {grid_search.best_estimator_.score(x_train,y_train)}")
print(f"best test acc: {accuracy_score(y_test, best_pred)}")

best train acc: 1.0
best test acc: 0.9727857142857143


test acc가  늘었지만, train acc가 1이라 과적합 가능성?

### Data Augmentation
   - 어느 방향으로든 한 픽셀 이동시킬 수 있는 함수를 만들어보자
   - 각 이미지에 대해 네 개의 이동된 복사본 (방향마다 한 장씩)

In [51]:
from scipy.ndimage.interpolation import shift
import numpy as np

# 이미지 움직이는 함수 정의
def shift_image(image, dx, dy):
    image = image.reshape((28,28))
    shifted_image = shift(image, [dy, dx], cval=0, mode='constant')
    return shifted_image.reshape([-1])

X_train_augmented = [image for image in x_train.values]
y_train_augmented = [label for label in y_train]

for dx, dy in ((1,0), (-1,0), (0,1), (0,-1)):
    for image, label in zip(x_train.values, y_train):
        X_train_augmented.append(shift_image(image, dx, dy))
        y_train_augmented.append(label)
        
X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

In [52]:
knn.fit(X_train_augmented, y_train_augmented)
print("train acc:", knn.score(X_train_augmented,y_train_augmented))
aug_pred = knn.predict(x_test)
print("test acc:", accuracy_score(y_test, aug_pred))

train acc: 0.9909071428571429
test acc: 0.9772142857142857


train acc도 1이 아닌 0.99 이고, test acc는 0.977로 gridsearchcv를 사용한 것보다 0.05 향상!  

다만, 학습 이미지만 28만장으로 너무 오래 걸려 여기에서 gridsearchcv 추가적용은 패스!! 도망~~