# KNN

The KNN classifier is given by 

$ g(x) = mode_{i \in N_x} y_i $,

where $N_X$ is the set of the $k$ nearest observations from $x$.

The first step of the algorithm calculates the distance from the point $x$ to its neighbors. I use the euclidean distance to take the nearest observations from the interested point $x$.

The second step is to take the $k$ neighbors.

The last step calculates the mode of the $k$ neighbors.

Reference
https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/


In [None]:
import pandas as pd
from math import sqrt
import numpy as np

## Loading the iris-dataset

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv', names = ('X1', 'X2', 'X3', 'X4', 'y'))

In [None]:
data.head(5)

Unnamed: 0,X1,X2,X3,X4,y
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
data['y'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [None]:
target = {'y': {"Iris-setosa": 1, 'Iris-versicolor': 2, 'Iris-virginica':3}}

In [None]:
data = data.replace(target)
data['y'].unique()

array([1, 2, 3])

In [None]:
X = data.iloc[:,0:4].values
y = data['y'].values

In [None]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

Now with the data already splitted and pre-processed I test the algorithm for only one observation. I call this observation 'xteste' and 'yteste'.

In [None]:
xteste = X_test[0]
yteste = y_test[0]

First I calculated the distance from the xteste for each point in my train data (X_train) and saved it in a list called 'distances'.

In [None]:
distances = list()
c = -1
for row in X_train:
  distance = 0.0
  c += 1
  for i in range(len(row)):
    distance += (row[i] - xteste[i])**2
  distances.append((y_train[c], np.sqrt(distance)))



Now I sort the list in ascedent order by distance.

In [None]:
def sortv(tup):
  return(tup[1])

In [None]:
distances.sort(key = sortv)

Next step requires to get the nearest neighboors (lowest distances). For this test I get the seven lowest distances.

In [None]:
neighbors = distances[0:7]

In [None]:
neighbors

[(1, 0.14142135623730948),
 (1, 0.244948974278318),
 (1, 0.29999999999999954),
 (1, 0.2999999999999996),
 (1, 0.3605551275463989),
 (1, 0.424264068711928),
 (1, 0.424264068711928)]

Then, at the end I calculate the mode of the 7 neighbors.

In [None]:
out = [row[0] for row in neighbors]
max(set(out), key=out.count)

1

In [None]:
yteste

1

The KNN predict correct. I pass the codes to functions. Three function were constructed: distance (it calculates the euclidean distance from a point in the test set to all points in the train set), neighboors (it gets the k-nearest neighboors and predict the points) and predictKNN (it applies the last two functions to the entire test set).

In [None]:
def distance(xtrain, ytrain, rowtest):
  distances = list()
  c = -1
  for row in xtrain:
    distance = 0.0
    c += 1
    for i in range(len(row)):
      distance += (row[i] - rowtest[i])**2
    distances.append((ytrain[c], np.sqrt(distance)))
  return distances



In [None]:
def sortv(tup):
  return(tup[1])

def neighbors(xtrain, ytrain, rowtest, k):
  distances = distance(xtrain, ytrain,rowtest)
  distances.sort(key = sortv)
  neighbors = distances[0:k]
  out = [row[0] for row in neighbors]
  return max(set(out), key=out.count)

In [None]:
def predictKNN(xtrain, ytrain, xtest,k):
  predicts = list()
  for rowtest2 in xtest:
    prediction = neighbors(xtrain, ytrain, rowtest2, k)
    predicts.append(prediction)
  return predicts

Testing the functions.

In [None]:
y_pred = predictKNN(X_train, y_train, X_test, 5)

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f} %".format(accuracy*100))

Accuracy: 100.00 %


I compared with the function avaiable at Sklearn

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f} %".format(accuracy*100))

Accuracy: 100.00 %
