## Building a Faster KNN Classifier

In this notebook, we will build a parsimonious K-NN model that uses cosine similarity as a distance metric to classify MNIST images, in an attempt to find a speed and or accuracy improvement over the Scikit-Learn K-NN model.

Start by importing required libraries, and building the same data sets as in the Scikit-Learn K-NN notebook.

In [11]:
import numpy as np
import heapq
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import datasets, model_selection
from sklearn.metrics import classification_report
from keras.models import load_model
from google.colab import drive
from keras.datasets import mnist
import h5py

# drive.mount('/content/drive')

# mnist = load_weights('/SVHN_single_grey1.h5')
# mnist = datasets.fetch_mldata('MNIST original')
# mnist = group["/content/drive/My Drive/PGP-AIML-UT-Austin-Jun19/Introduction to Neural Network and Deep Learning/Project/SVHN_single_grey1.h5"]
f = h5py.File('/content/drive/My Drive/PGP-AIML-UT-Austin-Jun19/Introduction to Neural Network and Deep Learning/Project/SVHN_single_grey1.h5', 'a')

mnist = f['/content/drive/My Drive/PGP-AIML-UT-Austin-Jun19/Introduction to Neural Network and Deep Learning/Project/SVHN_single_grey1.h5']

data, target = mnist.load_data()

# make sure everything was correctly imported
data.shape, target.shape

KeyError: ignored

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Set up the exact same data sets with the same method as in the Scikit-Learn K-NN notebook.

In [0]:
# make an array of indices the size of MNIST to use for making the data sets.
# This array is in random order, so we can use it to scramble up the MNIST data
indx = np.random.choice(len(target), 70000, replace=False)

# method for building datasets to test with
def mk_dataset(size):
    """makes a dataset of size "size", and returns that datasets images and targets
    This is used to make the dataset that will be stored by a model and used in 
    experimenting with different stored dataset sizes
    """
    train_img = [data[i] for i in indx[:size]]
    train_img = np.array(train_img)
    train_target = [target[i] for i in indx[:size]]
    train_target = np.array(train_target)
    
    return train_img, train_target

In [0]:
# lets make a dataset of size 50,000, meaning the model will have 50,000 data points to compare each 
# new point it is to classify to
fifty_x, fifty_y = mk_dataset(50000)
fifty_x.shape, fifty_y.shape

In [0]:
# lets make one more of size 20,000 and see how classification accuracy decreases when we use that one
twenty_x, twenty_y = mk_dataset(20000)
twenty_x.shape, twenty_y.shape

In [0]:
# build model testing dataset
test_img = [data[i] for i in indx[60000:70000]]
test_img1 = np.array(test_img)
test_target = [target[i] for i in indx[60000:70000]]
test_target1 = np.array(test_target)
test_img1.shape, test_target1.shape

### Building the Model

Below we will create the function `cos_knn()` that will act as our latest and greatest K-NN classifier for MNIST.  Follow the comments in the function for details on how it works.

In [0]:
def cos_knn(k, test_data, test_target, stored_data, stored_target):
    """k: number of neighbors to use for voting
    test_data: a set of unobserved images to classify
    test_target: the labels for the test_data (for calculating accuracy)
    stored_data: the images already observed and available to the model
    stored_target: labels for stored_data
    """
    
    # find cosine similarity for every point in test_data between every other point in stored_data
    cosim = cosine_similarity(test_data, stored_data)
    
    # get top k indices of images in stored_data that are most similar to any given test_data point
    top = [(heapq.nlargest((k), range(len(i)), i.take)) for i in cosim]
    # convert indices to numbers using stored target values
    top = [[stored_target[j] for j in i[:k]] for i in top]
    
    # vote, and return prediction for every image in test_data
    pred = [max(set(i), key=i.count) for i in top]
    pred = np.array(pred)
    
    # print table giving classifier accuracy using test_target
    print(classification_report(test_target, pred))

### Testing the Model
Now, just as with the Scikit-Learn K-NN model, we will test the `cos_knn()` model on the two data sets and see how it stacks up against the Scikit-Learn K-NN model.

In [0]:
%%time
# stored data set size of 50,000
cos_knn(5, test_img1, test_target1, fifty_x, fifty_y)

In [0]:
%%time
# stored data set size of 20,000
cos_knn(5, test_img1, test_target1, twenty_x, twenty_y)

Fantastic!  The cosine similarity model we built ourselves outperformed the Scikit-Learn K-NN!  Remarkably, the model outperformed the Scikit-Learn K-NN in terms of both classification speed (by a sizeable margin) and accuracy, and yet the model is so simple!

For furthur analysis into how the model works and how it stacked up against the Scikit-Learn K-NN in many different situations, see [this GitHub repository](https://github.com/samgrassi01/Cosine-Similarity-Classifier).