# K Nearest Neighbour

## The KNN Algorithm
1) Load the data

2) Initialize K to your chosen number of neighbors

3) For each example in the data

    1) Calculate the distance between the query example and the current example from the data.
    2) Add the distance and the index of the example to an ordered collection

4) Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances

5) Pick the first K entries from the sorted collection

6) Get the labels of the selected K entries

7) If regression, return the mean of the K labels

8) If classification, return the mode of the K labels

## Implementation

In [1]:
"K-Nearest Neighbour"

import pandas as pd             #The Pandas Library for data analysis
from statistics import mode     #The statistics library for calculating modes


In [2]:
train=pd.read_csv('trainingdata.csv')      # read the csv file of training data
train.head()                               #Displays the first 5 lines of train

Unnamed: 0,rWC,rCh,Atom,Type
0,0.78,0.5,B,PT
1,0.9,0.67,Si,PT
2,0.97,0.65,Ga,PT
3,1.04,0.76,Al,PT
4,1.1,0.79,Ir,PT


We have to train data according to the rWC and rCh values and classify our atoms according to their Type.
So for x and y we choose the first and second columns. The python indices of this data would be 0 for the first column and 1 for the 2nd column.
and we have to classify them according to the column 4, whose index is 3.

In [3]:
#we choose training data using the iloc command

x=(train.iloc[:,0])          #for all rows we have put colon in rows place
y=(train.iloc[:,1])
c=train.iloc[:,3]


In [4]:
# Now we have our training data ready, so we import testing data arranged in the similar way

test=pd.read_csv('testdata.csv')
xtest=test.iloc[:,0]
ytest=test.iloc[:,1]

In [5]:
#Here we implement the K-nearest neighbour algorithm. for initial testing we choose k=5, but we may change its value depending on our data


k=5
success=0          #This is just a counter of successful testing

for j in range(len(xtest)):        # This loop runs for number of classifications we have to make (Rows of Testing data)
    dist=[[0]*len(x), [0]*len(x)]  # we reset the matrix dist every time we classify new point
    npx=xtest[j]                   # x-value of current point
    npy=ytest[j]                   # y-value of current point
    
    
    for i in range(1,len(x)):     # This loop calculates the distance of current point from each point of the testing data.
        dist[0][i]=(npx-float(x[i]))**2+(npy-float(y[i]))**2     #Distance formula
        dist[1][i]=str(c.iloc[i])                                #The atom type has been enlisted in front of calculated distance
    #dist = dist[dist[0, :].argsort()]
    
    
    list1, list2 = zip(*sorted(zip(dist[0], dist[1])))           # sorts the two sublists according to the ascending order of the values in dist[0] list
    klist=list2[:k]                                              # stores the first k classifications of the lists arranged in ascending order
    npc=mode(klist)                                              # calculates the mode of the k classifications
    print('Atom: ',test.iloc[j,2],'Classified as: ',npc)
    print('The original classification is: ',test.iloc[j,3])
    if (npc==test.iloc[j,3]):
        print('Successfull Prediction')
        success=success+1
    else:
        print('Failed Prediction')
Accuracy=success/len(xtest)
print('Thus Accuracy of our model is :',Accuracy*100,'%')
if (Accuracy<0.9):
    print('We need more training data to train our model accurately')
else:
    print('Our Model is Accurate enough!')

Atom:  X1 Classified as:  Alk
The original classification is:  Alk
Successfull Prediction
Atom:  X2 Classified as:  TM
The original classification is:  TM
Successfull Prediction
Atom:  X3 Classified as:  TM
The original classification is:  PT
Failed Prediction
Atom:  X4 Classified as:  TM
The original classification is:  TM
Successfull Prediction
Atom:  X5 Classified as:  Alk
The original classification is:  Alk
Successfull Prediction
Thus Accuracy of our model is : 80.0 %
We need more training data to train our model accurately


# Example

Here we test our algorithm on a dataset to predict if a student would pass or fail according to the number of hours a student spends on self study and the number of hours the students spends in tution.

In [6]:
data=pd.read_csv('Student-Pass-Fail-Data.csv')
data.head()

Unnamed: 0,Self_Study_Daily,Tution_Monthly,Pass_Or_Fail
0,7,27,1
1,2,43,0
2,7,26,1
3,8,29,1
4,3,42,0


The data has been imported and we need to separate the training and test data first.
For this purpose we use a library sklearn.model_selection.train_test_split

In [7]:
x=data.iloc[:,0:2]
c=data.iloc[:,2]
from sklearn.model_selection import train_test_split
xtrain1, xtest1, ctrain, ctest = train_test_split(x, c, test_size=0.33, random_state=42)

In [51]:
xtrain=xtrain1.iloc[:,0]
ytrain=xtrain1.iloc[:,1]
xtest=xtest1.iloc[:,0]
ytest=xtest1.iloc[:,1]

In [59]:
k=15
success=0          #This is just a counter of successful testing

for j in range(len(xtest)):        # This loop runs for number of classifications we have to make (Rows of Testing data)
    dist=[[0]*len(xtrain), [0]*len(xtrain)]  # we reset the matrix dist every time we classify new point
    npx=xtest.iloc[j]                   # x-value of current point
    npy=ytest.iloc[j]                   # y-value of current point
    
    
    for i in range(len(xtrain)):     # This loop calculates the distance of current point from each point of the testing data.
        dist[0][i]=(float(npx)-float(xtrain.iloc[i]))**2+(float(npy)-float(ytrain.iloc[i]))**2     #Distance formula
        dist[1][i]=str(ctrain.iloc[i])                                #The atom type has been enlisted in front of calculated distance
    
    
    list1, list2 = zip(*sorted(zip(dist[0], dist[1])))           # sorts the two sublists according to the ascending order of the values in dist[0] list
    klist=list2[:k]                                              # stores the first k classifications of the lists arranged in ascending order
    npc=mode(klist)                                              # calculates the mode of the k classifications
    if (int(npc)==int(ctest.iloc[j])):
        #rint('Successfull Prediction')
        success=success+1
Accuracy=success/len(xtest)
print('The Accuracy of our model is :',Accuracy*100,'%')
if (Accuracy<0.9):
    print('We need more training data to train our model accurately')
else:
    print('Our Model is Accurate enough!')

The Accuracy of our model is : 97.87878787878788 %
Our Model is Accurate enough!
