In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import joblib

In [2]:
# Read in the data. Label is the numeric label (0-9). Other columns in X data represent the pixel intensity (0-255) of the image
# at the designated pixel. The MNIST data is already quite clean and well pre-processed, so we can just feed the data into
# our model.

X_train = pd.read_csv("mnist_train.csv")
y_train = X_train.label
X_train.drop('label', axis = 1, inplace = True)

X_test = pd.read_csv("mnist_test.csv")
y_test = pd.DataFrame(X_test.label)
X_test.drop('label', axis = 1, inplace = True)

X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

In [3]:
from scipy.ndimage.interpolation import shift

# Function that shifts the original images by dy in the y direction and dx in the x direction (using image coordinates)
# i.e. the origin is in the top left corner of the image.
def shift_digit(image, dx, dy):
    image = image.reshape(28,28)
    shift_image = np.array(shift(image, [dy, dx], cval = 0, mode = "constant"))
    return shift_image.reshape([-1])


# Convert to a list, to make it less of a hassle to append to the data set
X_train_expanded = X_train.tolist()
y_train_expanded = y_train.tolist()

# Augment the data, by shifting each image four pixels to the right, four pixels to the left, four pixels down,
# then four pixels up. This will lead to better generalization and accuracy.
for dx, dy in ((4,0), (-4, 0), (0, 4), (0, -4)):
    for ix in range (0, len(X_train)):
        shifted_image = shift_digit(X_train[ix], dx, dy)
        X_train_expanded.append(shifted_image.tolist())
        y_train_expanded.append(y_train[ix])

In [4]:
# # Testing shift_digit function
# original = X_train_expanded[1]
# shifted = X_train_expanded[60001]

# plt.imshow(original.reshape(28,28), cmap = "Greys")
# plt.show()
# plt.imshow(shifted.reshape(28,28), cmap = "Greys")
# plt.show()

In [5]:
from sklearn.neighbors import KNeighborsClassifier

image_clf = KNeighborsClassifier(n_neighbors = 5, weights = 'distance', n_jobs = 8)
image_clf.fit(X_train_expanded, y_train_expanded)

KNeighborsClassifier(n_jobs=8, weights='distance')

In [6]:
from sklearn.metrics import accuracy_score
y_test_pred = image_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_test_pred)
print (accuracy)

0.9672


In [7]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(image_clf, X_train, y_train, cv = 2, n_jobs = 8)
conf_matrix = confusion_matrix(y_train, y_train_pred)
print (conf_matrix)

[[5881    4    5    2    0    5   19    2    1    4]
 [   2 6699   12    3    3    0    2   14    2    5]
 [  55   74 5646   27    6    8   11  100   21   10]
 [  10   16   36 5884    0   60    4   40   51   30]
 [   5   57    1    1 5591    0   18   13    1  155]
 [  18   19    4   81    9 5170   71    5   13   31]
 [  17   13    1    0    5   23 5857    0    2    0]
 [   3   70   13    2   21    2    0 6079    2   73]
 [  25   92   15   80   30   99   20   18 5399   73]
 [  18   11    6   43   50   13    4   85   13 5706]]


## K Nearest Neighbors Classifier Model Evaluation:

Model_v1 had approximatley a 97% cross validation accuracy on the training set, which is an excellent accuracy score, we can assume that since we are using an expanded data set, our accuracy will be very similar (I don't really want to run a cross validation on the exact same data set but expanded; it will yield almost the exact same/similar results but take way longer). Like version 1, the model has 96.91% accuracy on the test set (makes sense since we are just using an expanded data set, so the original , proving that it generalizes quite and can provide accurate predictions on unseen data.

Using 5 neighbors and distance weighting in the model in addition to the augmented data set, the model achieves an accuracy score of about 97%, which is quite good, proving that it is quite generalizable, not overfit, and can perform well on unseen data.

The Confusion Matrix for this data set (columns from 0-9, representing the classified digit):


    [[5881    4    5    2    0    5   19    2    1    4]
     [   2 6699   12    3    3    0    2   14    2    5]
     [  55   74 5646   27    6    8   11  100   21   10]
     [  10   16   36 5884    0   60    4   40   51   30]
     [   5   57    1    1 5591    0   18   13    1  155]
     [  18   19    4   81    9 5170   71    5   13   31]
     [  17   13    1    0    5   23 5857    0    2    0]
     [   3   70   13    2   21    2    0 6079    2   73]
     [  25   92   15   80   30   99   20   18 5399   73]
     [  18   11    6   43   50   13    4   85   13 5706]]

As evidenced by the confusion matrix, model performance on 8's, 6's, and 9's are the worst. 6's often get classified as 5's, 8's often get classified as 3's, and 9's often get classified as 4's. These all make sense since these numbers are very structurally similar, so this type of error will always exist to some degree (and sometimes even humans can't tell the difference).

In [8]:
# Save model to file called "KNN_MNIST_ImageClassifier_v2"
joblib.dump(image_clf, "KNN_MNIST_ImageClassifier_v2")


['KNN_MNIST_ImageClassifier_v2']