In [13]:
import pandas as pd
import sklearn
import joblib

# Show all columns when displaying .head() or .describe()
pd.set_option('max_columns', None)

In [2]:
# Read in the data. Label is the numeric label (0-9). Other columns in X data represent the pixel intensity (0-255) of the image
# at the designated pixel. The MNIST data is already quite clean and well pre-processed, so we can skip that step (for now).

X_train = pd.read_csv("mnist_train.csv")
y_train = X_train.label
X_train.drop('label', axis = 1, inplace = True)

X_test = pd.read_csv("mnist_test.csv")
y_test = pd.DataFrame(X_test.label)
X_test.drop('label', axis = 1, inplace = True)

In [4]:
from sklearn.neighbors import KNeighborsClassifier

image_clf = KNeighborsClassifier(n_neighbors = 5, weights = 'distance')
image_clf.fit(X_train, y_train)

KNeighborsClassifier(weights='distance')

In [5]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(image_clf, X_train, y_train, cv = 3)


In [9]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_train, y_train_pred)
conf_matrix

array([[5881,    3,    5,    0,    1,    5,   20,    1,    3,    4],
       [   2, 6701,   11,    3,    3,    1,    3,   11,    2,    5],
       [  46,   54, 5691,   20,    9,    4,   11,   92,   21,   10],
       [   7,   11,   33, 5899,    1,   65,    6,   43,   39,   27],
       [   3,   51,    0,    1, 5622,    0,   14,   13,    2,  136],
       [  13,   10,    4,   60,    6, 5217,   61,    7,   16,   27],
       [  20,   13,    1,    0,    5,   25, 5851,    0,    3,    0],
       [   3,   65,   12,    3,   13,    2,    0, 6093,    4,   70],
       [  18,   82,   10,   73,   29,   87,   27,   16, 5431,   78],
       [  15,   10,    5,   39,   44,   11,    5,   77,   13, 5730]],
      dtype=int64)

In [15]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(image_clf, X_train, y_train, cv = 3)
print (cv_scores)

[0.9688  0.96795 0.96905]


## K Nearest Neighbors Classifier Model Evaluation:

Model has approximatley a 97% cross validation accuracy on the training set, which is an excellent accuracy score. Seems to perform the worst on 9's, which makes sense since the digit 9 has a lot of similarities between 3's and 8's.

Using 5 neighbors and distance weighting in the model, a cross validation accuracy score of [0.9688  0.96795 0.96905] is achieved.

The Confusion Matrix for this data set (columns from 0-9, representing the classified digit):


    [[5881,    3,    5,    0,    1,    5,   20,    1,    3,    4]
    [   2, 6701,   11,    3,    3,    1,    3,   11,    2,    5]
    [  46,   54, 5691,   20,    9,    4,   11,   92,   21,   10]
    [   7,   11,   33, 5899,    1,   65,    6,   43,   39,   27]
    [   3,   51,    0,    1, 5622,    0,   14,   13,    2,  136]
    [  13,   10,    4,   60,    6, 5217,   61,    7,   16,   27]
    [  20,   13,    1,    0,    5,   25, 5851,    0,    3,    0]
    [   3,   65,   12,    3,   13,    2,    0, 6093,    4,   70]
    [  18,   82,   10,   73,   29,   87,   27,   16, 5431,   78]
    [  15,   10,    5,   39,   44,   11,    5,   77,   13, 5730]]


Potential future improvements include image pre-processing to center images to get a more accurate representation of an image, once this image is fed into the model, it will more likely produce an accurate and more consistently precise result.

In [34]:
# Save model to file called "KNN_MNIST_ImageClassifier_v1"
joblib.dump(image_clf, "KNN_MNIST_ImageClassifier_v1")

['KNN_MNIST_ImageClassifier_v1']