### Exercises

**Q1**: Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the KNeighborsClassifier works quite well for this task; you just need to find good hyperparameter values (try a grid search on the weights and n_neighbors hyperparameters).

**A1**:

In [1]:
# get the data
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home='./tmp')

In [2]:
X, y = mnist["data"], mnist["target"]
print(mnist.DESCR)
print(X.shape)
print(y.shape)

mldata.org dataset: mnist-original
(70000, 784)
(70000,)


The MINST dataset is already split into training (60,000) and test (10,000) entries, so let's separate those.

Also, the data is organized by digit, but we should randomize the order as some algorithms are sensitive to pre-sorted data:

In [3]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

In [4]:
X_train_small = X_train[:5000]
y_train_small = y_train[:5000]

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
param_grid = [
    {'weights': ['uniform', 'distance'], 'n_neighbors': [2,4,6]}
  ]

knn_clf = KNeighborsClassifier()
knn_grid_search = GridSearchCV(knn_clf, param_grid, cv=3,
                           scoring='accuracy', n_jobs=-1, verbose=3)

knn_grid_search.fit(X_train_small, y_train_small)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] n_neighbors=2, weights=uniform ..................................
[CV] n_neighbors=2, weights=uniform ..................................
[CV] n_neighbors=2, weights=uniform ..................................
[CV] n_neighbors=2, weights=distance .................................
[CV] n_neighbors=2, weights=distance .................................
[CV] n_neighbors=2, weights=distance .................................
[CV] ... n_neighbors=2, weights=uniform, score=0.917365, total=  10.6s
[CV] n_neighbors=4, weights=uniform ..................................
[CV] ... n_neighbors=2, weights=uniform, score=0.915966, total=  10.7s
[CV] n_neighbors=4, weights=uniform ..................................
[CV] ... n_neighbors=2, weights=uniform, score=0.917067, total=  10.7s
[CV] n_neighbors=4, weights=uniform ..................................
[CV] .. n_neighbors=2, weights=distance, score=0.937725, total=  10.7s
[CV] n_neighbors=

[Parallel(n_jobs=-1)]: Done  14 out of  18 | elapsed:  1.6min remaining:   26.8s


[CV] ... n_neighbors=6, weights=uniform, score=0.925481, total=  11.5s
[CV] .. n_neighbors=6, weights=distance, score=0.919568, total=  11.5s
[CV] .. n_neighbors=6, weights=distance, score=0.931490, total=  11.5s
[CV] .. n_neighbors=6, weights=distance, score=0.939521, total=  11.8s


[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  1.6min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'weights': ['uniform', 'distance'], 'n_neighbors': [2, 4, 6]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=3)

In [6]:
knn_grid_search.best_params_

{'n_neighbors': 4, 'weights': 'distance'}

In [10]:
best_knn_clf = KNeighborsClassifier(weights='distance', n_neighbors=4, n_jobs=-1)

In [11]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(best_knn_clf, X_train, y_train, cv=3, n_jobs=-1, verbose=3)

[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV] ................................. , score=0.971396, total=16.2min
[CV] ................................. , score=0.972199, total=16.3min
[CV] ................................. , score=0.971956, total=16.3min


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 16.3min finished


In [12]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.97 (+/- 0.00)


Huzzah, 97% This takes a while to run..
We were able to speed up the grid search by using a smaller subset of the training data, then using the found hyperparameters on the full set w/ cross validation and that made this reasonably fast. Perhaps we could try dimensionality reduction to increase performance even more.