# Hands-on Project: Digit classification with K-Nearest Neighbors and Data Augmentation
From Machine Learning Algorithms course

In this hands-on project, we'll apply K-Nearest Neighbors algorithm to handwritten digit classification. Our main objectives are: a) to learn how to experiment with various hyper-parameters, b) introduce metrics classification accuracy and confusion matrix, c) develop intuition about how KNN works and d) use this intuition and data-augmentation to improve classification accuracy further.

## Overview
This project guides you through using KNN for handwritten digit classification.

In [1]:
import pickle, gzip
import pandas as pd
import numpy as np

## Dataset
The MNIST dataset is a popularly used dataset in machine learning for the handwritten digit recognition task. Here is the link to the dataset https://www.dropbox.com/s/d3hz2dli4z6imfl/mnist_1000.pkl.gz?dl=1

### Loading the dataset

In [2]:
## load data
 
f = gzip.open('mnist_1000.pkl.gz', 'rb')

trainData, trainLabels, valData, valLabels, testData, testLabels = pd.read_pickle(f)

f.close()
 
print("training data points: {}".format(len(trainLabels)))
print("validation data points: {}".format(len(valLabels)))
print("testing data points: {}".format(len(testLabels)))

training data points: 1000
validation data points: 200
testing data points: 200


### Looking at the images
You can use the following snippet to look at some specific images. 

### Choosing the best hyperparameters

In [4]:
from sklearn.neighbors import KNeighborsClassifier

In [5]:
# Try the following values of K, and note the classification accuracy 
# on the validation data for each. K = 1, 3, 5, 9, 15, 25
for k in [1, 3, 5, 9, 15, 25, ]:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(trainData, trainLabels)
    # KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
    #            metric_params=None, n_jobs=1, n_neighbors=5, p=2,
    #            weights='uniform')
 
    score = model.score(valData, valLabels)
    print(k, score)

1 0.88
3 0.88
5 0.86
9 0.85
15 0.81
25 0.795


In [6]:
# re-train our classifier using the best k value and predict the labels of the
# test data

model = KNeighborsClassifier(n_neighbors=3)
model.fit(trainData, trainLabels)
accuracy = model.score(testData, testLabels)
print("Achieved accuracy of %.2f%% on test data" % (accuracy*100))

Achieved accuracy of 85.50% on test data


In [7]:
# Inspect the performance per class, i.e. precision, recall and f-score for each digit.
from sklearn.metrics import classification_report

In [8]:
# show a final classification report demonstrating the accuracy of the classifier
# for each of the digits

predictions = model.predict(testData)
print("Classification Report")
print(classification_report(testLabels, predictions))

Classification Report
              precision    recall  f1-score   support

           0       0.85      1.00      0.92        17
           1       0.80      1.00      0.89        28
           2       0.80      0.75      0.77        16
           3       0.92      0.75      0.83        16
           4       0.95      0.68      0.79        28
           5       0.94      0.85      0.89        20
           6       1.00      0.90      0.95        20
           7       0.88      0.92      0.90        24
           8       1.00      0.70      0.82        10
           9       0.66      0.90      0.76        21

    accuracy                           0.85       200
   macro avg       0.88      0.84      0.85       200
weighted avg       0.87      0.85      0.85       200



In [9]:
# Inspect the confusion matrix, i.e. when the correct label was digit I, 
# how times did the model predict J. 

from sklearn.metrics import confusion_matrix
print("Confusion Matrix")
print(confusion_matrix(testLabels, predictions))

Confusion Matrix
[[17  0  0  0  0  0  0  0  0  0]
 [ 0 28  0  0  0  0  0  0  0  0]
 [ 1  2 12  0  0  0  0  1  0  0]
 [ 0  2  1 12  0  0  0  0  0  1]
 [ 0  1  0  0 19  0  0  0  0  8]
 [ 0  1  0  1  1 17  0  0  0  0]
 [ 2  0  0  0  0  0 18  0  0  0]
 [ 0  1  0  0  0  0  0 22  0  1]
 [ 0  0  2  0  0  1  0  0  7  0]
 [ 0  0  0  0  0  0  0  2  0 19]]
