# Implementing the K-nearest neighbors classification algorithm

## Preparing the algorithm

The K-nearest neighbors classification algorithm is a fairly straightforward algorithm. One takes the training set and memorizes it, including the class each point belongs to. Then, in order to predict the class of a new datapoint, one takes the K-nearest neighbors based on a distance metric, e.g. Euclidean distance.  Once this is done, the number of appearances of each class will be counted and the datapoint will be assigned to the class with the highest amount of appearances.

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sl
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In the case of the K-nearest neighbor algorithm, fitting is trivial as training implies telling the algorithm what the X and y sets are.

In [8]:
def fit(X,y):
    global X_f, y_f
    X_f = X
    y_f = y
    return

Now we construct the function that lies at the core of the algorithm. This function will take the datapoints to be classified and, based on the KNN algorithm, will assign a class to each one of them.

In [None]:
def predict(X, neighbors):
    global cl_p
    cl_p =  np.array([])
    for i in range(0,X.shape[0]):
        #construct the vector consisting of the distance between each point of the training set
        #with regard to the test point
        a = X_f-X[i]
        length_vector = np.linalg.norm(a, axis =1 )
        #Take the vector and find the index of the k nearest neighbors in the training set
        partition = np.argpartition(length_vector, neighbors)
        y_nn = y_f[partition[:neighbors]]
        cl_p = np.append(cl_p,  np.argmax(np.bincount(y_nn)))
    return cl_p

## Testing the algorithm and comparing to the sklearn version

We import and use a simple yes or no dataset to test our algorithm. We will also compare  to the KNN module of sklearn.scikit.

In [10]:
data = pd.read_csv('Social_Network_Ads.csv')
y=data.Purchased.values

We need to use a Label Encoder as we have one categorical data category. This category tells us the Gender of the person.

In [11]:
from sklearn.preprocessing import LabelEncoder

LE=LabelEncoder()

data['Gender'] = LE.fit_transform(data['Gender'])

Prepare the dataset for the algorithm

In [None]:
X = data.iloc[:,1:-1].values
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.1, random_state = 0)

We now fit and proceed to predict the classes of our validation set. We will use the mean absolure error as a metric for the loss on this dataset:

In [13]:
fit(X_train, y_train)

In [14]:
pr= predict(X_valid, 10 )

In [15]:
mean_absolute_error(y_valid, pr)

0.1

We import the KNeighborsClassifier from sklearn and proceed to measure its error on the same dataset:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
KNN= KNeighborsClassifier(n_neighbors = 10)
KNN.fit(X_train, y_train)

In [17]:
mean_absolute_error(y_valid, KNN.predict(X_valid))

0.1

We see that our algorithm performs equally well on this dataset! In the future it would be worth to check its accuracy on a dataset containing more than two datasets. This will be added soon.