# k-Nearest-Neighbors from Scratch

Let's look at coding up one of the simplest machine learning (ML) algorithms: k-Nearest-Neighbors (kNN). kNN is an ML algorithm that classifies an object based on some distance metric from other labeled objects. 

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline

In [2]:
# first we need a function to compute the distance metric. We can use several different metrics but let's
# just start with the standard euclidian distance. 
euc_dist = lambda x: np.linalg.norm(x[1]-x[0])

In [131]:
# great! now let's build a classifier
def test_set(train, test, k=5, distance_metric = euc_dist):
    correct = 0
    for index, row in test.iterrows():
        prediction = predict(train,row[:-1], k, distance_metric = euc_dist)
        if (prediction[-1] == row[-1]):
            correct += 1
    print("Congrats, you have {:.2%} accuracy!\n\n".format(correct/len(test)))
    
    return correct/len(test)

def predict(train, test_vec, k=5, distance_metric = euc_dist):
    distance = []

    # compute distance metric for every instance of the training set
    for index, row in train.iterrows():
        dist = distance_metric([row[:-1],test_vec])
        distance.append([row,dist])
    distance.sort(key = lambda x: x[1])
    
    # now look at the k-nearest-neighbors and have them vote on the label!
    pred = vote(distance, k)
    
    return (test_vec,pred[0])


def vote(distance, k):
    knn = distance[:k]
    occurances = Counter(row[0]['class'] for row in knn)
    occurances = [(key, value) for key, value in sorted(occurances.items(), key=lambda item: item[1], reverse = True)]
    
    # if we have a tie we want to reduce k until we break the tie!
    if ((len(occurances) > 1) and (occurances[0][1] == occurances[1][1])):
        print('tie!\n')
        answer = vote(distance, k-1)
    else:
        answer = occurances[0]
    
    return answer

In [132]:
# here I will be using sklearn but only to grab the data and split it into a train and test set. 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris_data = load_iris()
Y_iris = iris_data.target
iris_data = pd.DataFrame(iris_data.data,
                            columns = iris_data.feature_names)
iris_data = pd.concat([iris_data,pd.Series(Y_iris)],axis=1)
iris_data.rename(columns = {0:'class'},inplace=True)
train, test = train_test_split(iris_data, test_size = 0.2, shuffle = True)
train.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
72,6.3,2.5,4.9,1.5,1
87,6.3,2.3,4.4,1.3,1
104,6.5,3.0,5.8,2.2,2
34,4.9,3.1,1.5,0.2,0
40,5.0,3.5,1.3,0.3,0


In [133]:
# now let's test different values of k for some parameter tuning
k_list = [3,5,7,9,11,13,15] # this way I'll get no 3-way ties

for k in k_list:
    print('Testing k = {}...\n'.format(k))
    test_set(train, test, k)

Testing k = 3...

Congrats, you have 96.67% accuracy!


Testing k = 5...

Congrats, you have 96.67% accuracy!


Testing k = 7...

Congrats, you have 100.00% accuracy!


Testing k = 9...

Congrats, you have 100.00% accuracy!


Testing k = 11...

Congrats, you have 100.00% accuracy!


Testing k = 13...

Congrats, you have 100.00% accuracy!


Testing k = 15...

Congrats, you have 100.00% accuracy!




In [None]:
# this is pretty good. If we were being good about ML we would perform some cross validation but we are really
# just demonstrating the method here. I would like to see how these results differ with different metrics!
# Let's test out 