The task is to build a KNN Classifier for the MNIST dataset that achieves over 97% accuracy.

### Obtaining the MNIST dataset

In [12]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", as_frame=False)

In [13]:
mnist_X, mnist_y = mnist.data, mnist.target

In [14]:
import numpy as np

print(f"Shape of the training data: {mnist_X.shape}")
print(f"Shape of the labels: {mnist_y.shape}")

print(f"The training data contains {mnist_X.shape[0]} instances each with {mnist_X.shape[1]} columns.")
print(f"The possible values for the labels are: {np.unique(mnist_y)}")

Shape of the training data: (70000, 784)
Shape of the labels: (70000,)
The training data contains 70000 instances each with 784 columns.
The possible values for the labels are: ['0' '1' '2' '3' '4' '5' '6' '7' '8' '9']


### Separating data into train and test sets

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(mnist_X, mnist_y, test_size=0.2, stratify=mnist_y, random_state=42)

In [26]:
## Check to see distribution of labels in both sets
from collections import Counter

y_train_label_percentages = dict(sorted({key:round((value / len(y_train)) * 100, 2) for key, value in Counter(y_train).items()}.items(), key=lambda item: item[0]))
y_test_label_percentages =  dict(sorted({key: round((value / len(y_test)) * 100, 2) for key, value in Counter(y_test).items()}.items(), key=lambda item: item[0]))

print(f"Label distribution for the training set: {y_train_label_percentages}")
print(f"Label distribution for the test set: {y_test_label_percentages}")



Label distribution for the training set: {'0': 9.86, '1': 11.25, '2': 9.99, '3': 10.2, '4': 9.75, '5': 9.02, '6': 9.82, '7': 10.42, '8': 9.75, '9': 9.94}
Label distribution for the test set: {'0': 9.86, '1': 11.25, '2': 9.99, '3': 10.2, '4': 9.75, '5': 9.02, '6': 9.82, '7': 10.42, '8': 9.75, '9': 9.94}


### Training a KNN Mulitabel Classifier

In [27]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

In [28]:
# Accuracy of the baseline KNN Classifier
from sklearn.model_selection import cross_val_score

knn_cross_val_acc_scores = cross_val_score(knn_clf, X_train, y_train, cv=3, scoring="accuracy")

Across 3 folds, the KNN Classifier has the accuracy scores: [0.96614346 0.96823271 0.96903461]
The average accuracy score is: 0.9678035934109815


In [29]:
print(f"Across 3 folds, the KNN Classifier has the accuracy scores: {knn_cross_val_acc_scores}")
print(f"The average accuracy score is: {round(knn_cross_val_acc_scores.mean() * 100, 2)}%")

Across 3 folds, the KNN Classifier has the accuracy scores: [0.96614346 0.96823271 0.96903461]
The average accuracy score is: 96.78%


In [31]:
print(f"These are the current hyperparameters being used in our KNN Classifier that achieves 96.78% accuracy. {knn_clf.get_params()}")

These are the current parameters being used in our KNN Classifier that achieves 96.78% accuracy. {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}


### Adjusting the hyperparameters using GridSearch

In [35]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "weights" : ["uniform", "distance"]
}


knn_grid_search = GridSearchCV(estimator=knn_clf, param_grid=param_grid, cv=3, scoring="accuracy")
knn_grid_search.fit(X_train, y_train)

In [37]:
print(f"The best hyperparameters that were found: {knn_grid_search.best_params_}")
print(f"These hyperparameters resulted in the model having an accuracy score of: {knn_grid_search.best_score_}")

The best hyperparameters that were found: {'n_neighbors': 6, 'weights': 'distance'}
These hyperparameters resulted in the model having an accuracy score of: 0.969607167072886


### Training the model using the best hyperparameters

In [38]:
knn_clf_after_grid = knn_grid_search.best_estimator_
knn_grid_cross_val_acc_scores = cross_val_score(knn_clf_after_grid, X_train, y_train, cv=3, scoring="accuracy")
print(f"Across 3 folds, the best estimator after doing grid search achieves an accuracy of: {knn_grid_cross_val_acc_scores}")

Across 3 folds, the best estimator after doing grid search achieves an accuracy of: [0.96764343 0.97021482 0.97096325]


### Evaluating the best model on the test set

In [41]:
from sklearn.metrics import accuracy_score

y_pred = knn_clf_after_grid.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

The accuracy score on the test set is: 97.09%


In [42]:
print(f"The accuracy score on the test set is: {round(accuracy * 100, 2)}%")

The accuracy score on the test set is: 97.09%


### Using our final classifier to predict the value of some random digits

In [49]:
rand_idxs = np.random.randint(0, len(X_train), size=10)
for rand_idx in rand_idxs:
    actual_digit_label = y_train[rand_idx]
    model_label_pred = knn_clf_after_grid.predict([X_train[rand_idx]])[0]

    print(f"Actual digit label: {actual_digit_label}")
    print(f"Model prediction: {model_label_pred}")
    print(f"Outcome: {'Correct!' if actual_digit_label == model_label_pred else 'Wrong!' }")
    print("*"*50)

Actual digit label: 2
Model prediction: 2
Outcome: Correct!
**************************************************
Actual digit label: 7
Model prediction: 7
Outcome: Correct!
**************************************************
Actual digit label: 7
Model prediction: 7
Outcome: Correct!
**************************************************
Actual digit label: 1
Model prediction: 1
Outcome: Correct!
**************************************************
Actual digit label: 4
Model prediction: 4
Outcome: Correct!
**************************************************
Actual digit label: 4
Model prediction: 4
Outcome: Correct!
**************************************************
Actual digit label: 1
Model prediction: 1
Outcome: Correct!
**************************************************
Actual digit label: 9
Model prediction: 9
Outcome: Correct!
**************************************************
Actual digit label: 9
Model prediction: 9
Outcome: Correct!
**************************************************
A