# K-Nearest Neighbor
* Lets now discuss a very simple, and very intuitive method for machine learning: K-Nearest Neighbor
* The basic premise is to make a prediction using the closest known data points 

![r_knn_concept.png](attachment:r_knn_concept.png)

# Implementation 
## KNN - 1 
* Although the concept for K-Nearest Neighbor is simple, the implementation can be complex
* 1 - nearest neighbor is simple, since we just need to keep track of the minimum distance 

![knn%201.png](attachment:knn%201.png)

## KNN - Multiple neighbors
* keeping track of an arbitrary number of closest distances is not trivial 
* For examples: 
> * say we have found 3 data points that are [1, 2 ,3] away, and k = 3
  * and we see the next training point and it is at a distance of 1.5 away
  * how do we check to see if it should go into the set of the 3 closest data points that we have so far?
  * First, lets consider that if we have N training points, then we need to look through all of them for every prediction we want to make! This is **O(N)**
  * If we take the Naive approach and look through each of the K closest points we have found so far, to see if we should remove the closest one, and add a closer point, thaat is **O(K)**
  * this gives a total of **O(NK)**
  * However, as you should know searching a list does not need to be **O(K)**
  * In fact, if we are able to keep a sorted list, we can make the search **O(logK)**, which is a bit better
  * There are even better ways to optimize this, such as athe Ball Tree and K-D Tree, but they are outside the scope of this course
  * Once we have our k-nearest neighbors, the next step is to turn them into votes 
  * This should make us realize that not only do we need to store the closest distances, we need to store their corresponding classes as well
  * This is like a dictionary, or a list of tuples, where the key is the distances, and the value is the class
      * {dist1: class 1, dist2: class2,...} or [(dist1, class1),(dist2, class2),...]
  * once we have collected the k nearest neighbors, all we need to do is create another dictionary, where the class is the key, and the number of times it appears is the value
  * We finally just chose the class with the maximum number of votes 
  
  ### Breaking Ties
      * one question that may arise is, what if we end up with a tie? 
      * in this case we have several options:
      * the simplest case is to just take whatever the argmax(votes) gives us. We don't know how argmax() is implemented, so it is up to the implementation
      * Another way is to just pick at random 
      * A final option is to weight each distance from the neighbor to the test point (more difficult)
      * In our case we are going to loop through the dictionary manually and choose the first max
      * therefore it depends on how keys of dictionaries are hashed in python 

## How is K chosen?
* No easy answer
* K is called a "hyperparameter"
* going to want to use cross validation! 

# Lazy Classifier 
* KNN is an example of a lazy classifier 
* train(X,Y) doesn't do anything! It just store X and Y
* predict(X') does all the work by looking through the stored X and Y 

----
# KNN in code with MNIST
* lets now take some time to implement KNN in code, and implement it on the MNIST data set
* before getting into that though, lets just quickly look through the data set to get a feel for what it looks like

![data.png](attachment:data.png)

In [25]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
df.shape

(42000, 785)

We can see that this data is comprised of 42000 examples, and has 785 columns/input vectors. We should think about the problem in the following way:
* Each image (42,000 total) is a single training point.
* Each image (1 training point) is composed of 784 input features
* these input features are intensity values from 0..255
* the first column is the label associated with its image training point
* 784 + 1 = 785, hence why we have 785 columns

## Time to implement...
* for this we are going to needs the `sortedcontainers` library
* Quickly, lets look over the `getData` function
* this uses pandas to make reading the csv input, just one line
* we then use `asmatrix` to turn the data into a numpy array
* we shuffle the data so that when we sample it later, it is in a random order
* we normalize the data to 0-1
* X is all the image data
* Y is the first column, the labels

In [7]:
import numpy as np
import pandas as pd

def get_data(limit=None):
    print("Reading in and transforming data...")
    df = pd.read_csv('data/train.csv')
    data = df.as_matrix()
    np.random.shuffle(data)
    X = data[:, 1:] / 255.0 # data is from 0..255
    Y = data[:, 0]
    if limit is not None:
        X, Y = X[:limit], Y[:limit]
    return X, Y

In [4]:
from sortedcontainers import SortedList
from datetime import datetime

## Create KNN class

In [27]:
class KNN(object):
    def __init__(self, k):
        self.k = k
    
    # this is our train function
    # takes in X and y, which are the inputs and the targets
    # because this is a lazy classifer, all the fit method does is save X and y
    def fit(self, X, y):
        self.X = X
        self.y = y
        
    # this only takes in X
    def predict(self, X):
        
        # need a prediction for every input (42,000 in this case)
        y = np.zeros(len(X))
        
        # loop through every X row, so for each iteration (index i) x has a shape (784,1) since its a row vector
        # aka here we are just going through each training example (an image)
        # these are all of the inputs that we want to predict
        for i, x in enumerate(X):
            
            # now we create a sorted list - because this is meant to store our k nn, it must be size k
            sl = SortedList()
            
            # for each input test point we have to look through all of the training points to find the knn
            # Note: X is the data we wish to make predictions on, and self.X is the data that we are using 
            # to determine knn - they can be thought of the data we are using to help make our prediction
            # So, in other words what is happening here is that for each input image, we have to loop through all 
            # of the training points, to find the knn
            # so j is the index, xt is the training point (remember, that is a 784x1 row vector)
            for j, xt in enumerate(self.X):
                
                # this is a (784,1) - (784,1), and it will be done for every training image in comparison the 
                # current image we are trying to predict
                # diff is also (784, 1)
                diff = x - xt
                d = diff.dot(diff)
                

SyntaxError: unexpected EOF while parsing (<ipython-input-27-0f8b2fefd64e>, line 26)

In [30]:
X, y = get_data()

Reading in and transforming data...
