# K- Nearest Neighbors

The k-Nearest Neighbors algorithm or KNN for short is a very simple technique.
The entire training dataset is stored. When a prediction is required, the k-most similar records to a new record from the training dataset are then located. From these neighbors, a summarized prediction is made.
Once the neighbors are discovered, the summary prediction can be made by returning the most common outcome or taking the average. As such, KNN can be used for classification.

KNN can be broken into three parts.

1. Calculate Euclidean Distance
2. Get Nearest Neighbors
3. Make Predictions

### Euclidean Distance
This calculate the distance between two rows in a dataset.
The formula is as follows

$D(x^{1}, x^{2}) = \sqrt{\sum_{i}^{N}(x_{i}^{2}-x_{i}^{1})^{2}}$

where, $x1$ is the first row of data, $x2$ is the second row of data and $i$ is the index to a specific column as we sum across all columns up to $N$. Bellow, is the formula impemented into python.

In [41]:

from math import sqrt
 
# Calculate the Euclidean distance between two vectors, or rows in a dataset

def euclidean_distance(row1, row2):
    d = 0.0
    
    for i in range(len(row1)-1): 
        d += (row1[i] - row2[i])**2
    return sqrt(d)

### Nearest Neighbors
Neighbors for a new piece of data in the dataset are the k closest instances, as defined by our distance measure.
1. Calculate  the distance between record on dataset using Euclidean Distance
2. Sort all of the records in the training dataset by their distance to the new data

In [46]:
# Locate the most similar neighbors
def neighbors(train, test_row, k):
    distance_list = list() # Initialize empty list of distances
    
    for train_row in train:
        dist = euclidean_distance(test_row, train_row)
        distance_list.append((train_row, dist))
        
    distance_list.sort(key=lambda tup: tup[1]) #ensures the second item of the tuple is used in the sort
    neighbor_list = list()
    for i in range(k):
        neighbor_list.append(distance_list[i][0])
    return neighbor_list

### Making Predictions
The most similar neighbors collected from the training dataset can be used to make predictions. The most represented class among the neighbors, can be found using the max() function in python. 

In [47]:
# Make a classification prediction with neighbors
def knn(train, test_row, k):
    
    neighbor = neighbors(train, test_row, k)
    output_values = [row[-1] for row in neighbor]
    prediction = max(set(output_values), key=output_values.count) #takes the most frequent surrounding number
    
    return prediction

### Utilizing Iris Data Set

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.utils import shuffle

data = pd.read_csv('iris_data.csv')
data_num = data.drop(['Species'], axis=1)
df = shuffle(data_num) # We need a shuffled data set, not in decending class
iris = np.array(df)

training = iris[0:113] # 75% of the data
testing = iris[113:151] # Remaining 25%

test_data = np.delete(testing, np.s_[4], axis=1) # Removing the last column of testing data

In [53]:
k = 5
counter = 0

for i in range(len(test_data)): # 
    testrow = test_data[i]
    label = knn(training, testrow, k) # Running predict classification
    if label == testing[i, 4]: # Counting correct labeled data  
        counter += 1

print('Accuracy of the model:', (counter/len(test_data))*100, '%')

Accuracy of the model: 97.2972972972973 %
