# How to implement KNN using Python (From Scratch)

### We can implement a KNN model by following the below steps:

##### 1) Load the data
##### 2) Initialise the value of k
##### 3) For getting the predicted class, iterate from 1 to total number of training data points
    3.1: Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.
    3.2: Sort the calculated distances in ascending order based on distance values
    3.3: Get top k rows from the sorted array
    3.4: Get the most frequent class of these rows
    3.5: Return the predicted class

In [103]:
#Importing Libraries
import pandas as pd
import numpy as np
import math
import operator

# Start Step 1
# Load the Data
df = pd.read_csv("C:\\Users\\nilesh.s.mandge\\Documents\\Data Science\\DataSet\\iris.csv")
print(df.head())
# End of step 1


# Defining Function which will calculate the distance between two data points (Euclidean Distance)
def euclideanDist(d_point1, d_point2, length):
    distance = 0
    for x in range(length):
#         print("d_point1[x]: ",d_point1[x])
#         print("d_point2[x]: ",d_point2[x])
        distance += np.square(d_point1[x] - d_point2[x])
#         print(np.sqrt(distance))
    return np.sqrt(distance)

# Defining KNN Model
def knn(traningSet, testInstance, k):
    distances = {}
    sort = {}
    length = testInstance.shape[1] # it will give column length of test data set
#     print(testInstance.shape)
    ### Start of Step 3
    # Calcuate the euclidean distance between each row of traing data and test data
    for x in range(len(traningSet)):
        
        ### Start of step 3.1
        dist = euclideanDist(testInstance, traningSet.iloc[x], length)
#         print(dist[0])
        distances[x] = dist[0]
        ### End of step 3.1
#     print(distances)    
    ### Start of step 3.2
    # Sorting them on the basis of distance
    sorted_d = sorted(distances.items(), key=operator.itemgetter(1))
#     print(sorted_d)
    ### End of step 3.2
    
    neighbors = []
    
    ### Start of step 3.3
    # Extracting top k neighbors
    for x in range(k):
        neighbors.append(sorted_d[x][0])
    ### End of step 3.3
#     print(neighbors)
    classVotes = {}
    print("before: ",classVotes)
    ### start of step 3.4
    # calculate the most frequent class in neighbors
    for x in range(len(neighbors)):
        response = traningSet.iloc[neighbors[x]][-1] # accessing last column due to this used negative indexing
#         print(traningSet.iloc[neighbors[x]][4])
        if(response in classVotes):
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    ### End of step 3.4
    print("After: ",classVotes)
    
    ### start of step 3.5
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    print(sortedVotes)
#     print(sortedVotes[0][0])
    return(sortedVotes[0][0], neighbors)
    ### End of step 3.5
    
testSet = [[1.2, 5.6, 2.1, 6.5]]
test = pd.DataFrame(testSet)

   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLength    150 non-null float64
SepalWidth     150 non-null float64
PetalLength    150 non-null float64
PetalWidth     150 non-null float64
Name           150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [56]:
df.shape

(150, 5)

In [104]:
### Start of step 2
# setting number of neighbors = 1
print("\n\n with 1 nearest k neighbor")

k = 1
### End of step 2

# Running knn model
result, neigh = knn(df, test, k)
print("\npredicted class of the datapoint: ", result)
print("\nNearest neighbor of the datapoint: ", neigh)



 with 1 nearest k neighbor
before:  {}
After:  {'Iris-virginica': 1}
[('Iris-virginica', 1)]

predicted class of the datapoint:  Iris-virginica

Nearest neighbor of the datapoint:  [106]


In [105]:
### Start of step 2
# setting number of neighbors = 1
print("\n\nwith 3 nearest k neighbor")

k = 3
### End of step 2

# Running knn model
result, neigh = knn(df, test, k)
print("\npredicted class of the datapoint: ", result)
print("\nNearest neighbor of the datapoint: ", neigh)



with 3 nearest k neighbor
before:  {}
After:  {'Iris-virginica': 1, 'Iris-versicolor': 1, 'Iris-setosa': 1}
[('Iris-virginica', 1), ('Iris-versicolor', 1), ('Iris-setosa', 1)]

predicted class of the datapoint:  Iris-virginica

Nearest neighbor of the datapoint:  [106, 59, 43]


In [106]:
### Start of step 2
# setting number of neighbors = 1
print("\n\nwith 5 nearest k neighbor")

k = 5
### End of step 2

# Running knn model
result, neigh = knn(df, test, k)
print("\npredicted class of the datapoint: ", result)
print("\nNearest neighbor of the datapoint: ", neigh)



with 5 nearest k neighbor
before:  {}
After:  {'Iris-virginica': 2, 'Iris-versicolor': 2, 'Iris-setosa': 1}
[('Iris-virginica', 2), ('Iris-versicolor', 2), ('Iris-setosa', 1)]

predicted class of the datapoint:  Iris-virginica

Nearest neighbor of the datapoint:  [106, 59, 43, 98, 114]
