# K-Nearest Neighbors MNIST Classifier
### By Shah Zafrani
1. For each data in “MNIST_test.csv”, compute distances with the training data.
2. Find the K-nearest neighbors, and decide the majority class of them.
3. Compare the prediction with the ground truth
    a. Correctly classified if the predicted label and ground truth is identical.
    b. Incorrectly classified if the predicted label and ground truth is NOT identical.
4. Repeat Step 1-4 for all data in the test data
5. Then, you can count how many test data are correctly classified and incorrectly classified.
6. Show the accuracy of your KNN. Compute accuracy by

#### Steps:
+ Import data
+ Write a distance function
+ Write a loop to iterate through test cases
+ Write a way to get the k nearest items to a test case
+ Compute and output the accuracy


In [4]:
import pandas as pd
import numpy as np
# import the libraries we need
print("Libraries {} and {} have been imported".format(np.__name__, pd.__name__))

Libraries numpy and pandas have been imported


In [5]:
# read in the data
training = pd.read_csv('MNIST_training.csv')
# display the data so that we can visualize what we're working with
print(training.head())
# take the first column of values containing the labels
training_labels = training.iloc[:, 0]
# take only the pixel data and output it as a numpy array
training_data = training.drop('label', axis=1).as_matrix()
test = pd.read_csv('MNIST_test.csv')
test_labels = test.iloc[:, 0]
test_data = test.drop('label', axis=1).as_matrix()

   label  pixel0  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  \
0      0       0       0       0       0       0       0       0       0   
1      0       0       0       0       0       0       0       0       0   
2      0       0       0       0       0       0       0       0       0   
3      0       0       0       0       0       0       0       0       0   
4      0       0       0       0       0       0       0       0       0   

   pixel8    ...     pixel774  pixel775  pixel776  pixel777  pixel778  \
0       0    ...            0         0         0         0         0   
1       0    ...            0         0         0         0         0   
2       0    ...            0         0         0         0         0   
3       0    ...            0         0         0         0         0   
4       0    ...            0         0         0         0         0   

   pixel779  pixel780  pixel781  pixel782  pixel783  
0         0         0         0         0         

In [12]:
def distance(training, test):
    #we use this method to increase runtime efficiency by taking advantage of numpy
    total = np.sum((training - test)**2)
    # for i in range(1, len(test)-1):
    # 	diff = training[i] - test[i]
    # 	total += diff**2
    return np.sqrt(total)
test_x = np.array([0, 5])
test_y = np.array([0, 0])
# intuitively we know that the distance here should be 5,
#  so we test it out with a vector of 2 values and we know it will work for 784 values
print(distance(test_x, test_y))

5.0


In [7]:
def get_most_common(array):
    digits = np.zeros(10, dtype=int)
    for num in array:
        digits[num] +=1
    return digits.argmax()
# We will test this with a simple array
print(get_most_common([0, 1, 2, 2, 3, 3, 3]))

3


In [8]:
# simple accuracy evaluation function
def evaluate_accuracy(knn_predictions):
    truth = np.array(test_labels) == np.array(knn_predictions)
    return sum(truth)

### Now we're ready to implement our knn classifier that will take in a set of training data and a set of test data

In [9]:
def knn(training_data, test_data, k):
    knn_predictions = []
    for test_case in test_data:
        dists = []
        for index, d in enumerate(training_data):
            dists.append([training_labels[index], distance(d, test_case)])
        dists_from_k = sorted(dists, key=lambda tup: tup[1])
        k_nearest = []
        for index in range(0, k):
            k_nearest.append(dists_from_k[index][0])
        # gather consensus from top k elements from list
        knn_predictions.append(get_most_common(k_nearest))
    return knn_predictions

### Finally we put it all together:

In [10]:

def optimize_k_value(k_range):
    print("testing odd values up to {} for k. This may take a while, go read some xkcd comics and come back later...".format(k_range))
    k_results = []
    for kval in range(1, k_range, 2):
        accuracy = (evaluate_accuracy(knn(training_data, test_data, kval)) / float(len(test_labels))) * 100
        k_results.append(["kval: {}, accuracy: {} percent".format(kval, accuracy)])
    for result in k_results:
        print(result)

In [11]:

optimize_k_value(14)

testing odd values up to 14 for k. This may take a while, go read some xkcd comics and come back later...
['kval: 1, accuracy: 84.0 percent']
['kval: 3, accuracy: 88.0 percent']
['kval: 5, accuracy: 86.0 percent']
['kval: 7, accuracy: 90.0 percent']
['kval: 9, accuracy: 90.0 percent']
['kval: 11, accuracy: 84.0 percent']
['kval: 13, accuracy: 82.0 percent']
