# K-Nearest Neighbors(KNN):

The KNN ia an Algorithm that can be used for both regression and classification task and it is a very simple one at that too.   
The basic principle behind KNN algorithm is the calculation of 'Distance' between the data instance for which class has to be predicted and every data point in the training data.  
The only parameter that goes into the algorithm(except regularization terms) is the value of 'K'. Which signifies the number of nearest neighbor of the data instances that the algorithm will consider while making predictions.

## KNN Algorithm

1. Memorize all of the training data
2. Take the data instance for prediction.
3. Calculate the 'Distance' between the data instance and every data point in the training data
4. find the 'K' Nearest neighbors
5. Give the prediction(Generally using a majority vote between the nearest neighbors) using some averaging technique.

### How to calculate Distance.

Various types of distance metrics can be used in the KNN Algorithm but the most commonly used distance metric is Euclidean Distance(Minkowski, p=2). 

The Euclidean distance in 2-D is what is known as the cartesian distance.

![Euc dist](https://www.tutorialexample.com/wp-content/uploads/2020/05/Euclidean-distance-in-tensorflow.png)

The Euclidean distance is easy to visualize in 2D but harder to do so in a real world problem where there are 100 of features in the dataset. In order to calculate the distance for a feature space with $h$ features we can generalize the equation shown above as:

$$Euclidean~Distance(d) = \left( \sum_{i=0}^h (x_{2i}  - x_{1i})^2\right)^{0.5}$$

Where: 

$$x_1 = \left[ x_{11}, x_{12}, x_{13}, \dots,x_{1k} \right]  ~~\&~~ x_2 = \left[ x_{21}, x_{22}, x_{23}, \dots,x_{2k} \right]$$

# Implementation of KNN Classification

In this Implementation, KNN will be modeled as a classification algorithm. The prediction will be calculated by a simple majority vote between the nearest neighbors

In [125]:
#First we need a sample dataset on which we can test our algorithms
#We can use the iris dataset that is built-in sklearn library
from sklearn.datasets import load_iris

data = load_iris()
X = data.data
y = data.target

#Splitting the dataset into test and training sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=24)

#looking at the generated Data
X_train[:5,:] #First Five rows

array([[6.9, 3.1, 5.1, 2.3],
       [5.6, 3. , 4.1, 1.3],
       [4.9, 3.6, 1.4, 0.1],
       [5. , 3.5, 1.3, 0.3],
       [5. , 3. , 1.6, 0.2]])

In [126]:
#Creating the KNN Algorithm
import numpy as np

class MyKNNClassification:

  def __init__(self, k=3):
    
    self.k = k #Initialize the variable to hold the value of k

  def fit(self, X, y):

    self.X = X #Store the training data on memory
    self.y = y

  def _euclidean_distance(self, a, b): #Method to calculate the euclidean distance
    
    sum = 0
    for i in range(self.X.shape[1]):
      sum += (b[i] - a[i])**2

    return sum**0.5

  def _k_neighbors(self, pt): #Method ot find k nearest neighbors
    
    distances = []
    for i in range(self.X.shape[0]):
      dist = self._euclidean_distance(self.X[i], pt)
      distances.append((i, dist, self.y[i]))

      #Sorting the distances
      distances.sort(key = lambda q: q[1], reverse=False)

    return distances[0:self.k]

  def predict(self, pt):

    k_neighbors = self._k_neighbors(pt)
    vote_counts = {}
    for neighbor in k_neighbors: #Counting votes of k neighbors
      response = neighbor[2]
      vote_counts[response] = vote_counts.get(response, 0) + 1

    #Sort the votes in descending order
    Sorted_Votes = sorted(vote_counts.items(),key=lambda q: q[1], reverse=True)

    return Sorted_Votes[0][0] #return the majority vote


In [127]:
#Now, we can test our KNN Algorithm
model = MyKNNClassification(k = 5) #Creating an instances
model.fit(X_train,y_train) #Fitting the model

#Lets try to make a prediction
print('The Predicted class of the data point is : {}'.format(model.predict(X_test[42])))

The Predicted class of the data point is : 0


In [128]:
#We can also calculate the accuracy of model

#Helper function to calculate accuracy
def accuracy(y_true,y_pred):
  accuracy = np.sum(y_true == y_pred)/len(y_true)
  return accuracy

#Getting predictions
y_preds = []
for x in X_test:
  y_preds.append(model.predict(x))
y_preds = np.array(y_preds)

#Evaluating Performance on Test set
acc = accuracy(y_test,y_preds)
print('The Accuracy gained by our Algorithm on Test Set = {:.2f}%'.format(acc*100))

The Accuracy gained by our Algorithm on Test Set = 97.78%


In [130]:
#Our Model reached 97.78% Accuracy on the test set, lets look at 10 random predictions

choices = np.random.choice(45, size=10, replace=False)
batch_test = X_test[choices, :]
y_actual = y_test[choices].reshape(-1,1)

#Extracting predictions
y_preds = []

for x in batch_test:
  y_preds.append(model.predict(x))
y_preds = np.array(y_preds)

#Making a dataframe of actual and predicted results
y_preds = y_preds.reshape(-1,1)
import pandas as pd
df = pd.DataFrame(np.concatenate((y_actual,y_preds), axis = 1), columns = ['Actual', 'Predicted'])
df['Remark'] = 'Incorrect Prediction'
df.loc[df['Actual'] == df['Predicted'], 'Remark'] = 'Correct Prediction'
df

Unnamed: 0,Actual,Predicted,Remark
0,2,2,Correct Prediction
1,0,0,Correct Prediction
2,0,0,Correct Prediction
3,1,1,Correct Prediction
4,1,1,Correct Prediction
5,0,0,Correct Prediction
6,1,1,Correct Prediction
7,2,2,Correct Prediction
8,2,2,Correct Prediction
9,0,0,Correct Prediction
