# KNN (K-Nearest Neighbors) Algorithm

*“Tell me who you walk with and I’ll tell you who you are.”*

The K-Nearest Neighbors algorithm, commonly referred to as KNN, stands out as one of the simplest yet widely used supervised learning classification methods. KNN relies on the concept of feature similarity. This similarity is quantified by measuring the distances between the new data point and its k closest neighbors. Thus, the new data is classified by comparing it to its nearest neighbors from the training data. The closer they are, the more similar they are considered. This code presents one of the possibilities of developing the algorithm without using an external library.

### So lets get to work!

The dataset used in this code is the well-known Iris dataset, a classic dataset for practicing machine learning techniques. It consists of measurements taken from 150 iris flowers from 3 different species: Iris setosa, Iris versicolor, and Iris virginica. For each flower, the dataset includes 4 features (measurements in cm):

- Sepal Length
- Sepal Width
- Petal Length
- Petal Width

The KNN algorithm will predict the classification of a new plant based on these 4 features.

Part of the code was already provided by DSA and the development of KNN algorithm is in the space indicated by "Write solution here" in the cells below.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from statistics import *
from scipy import stats as s

In [5]:
# Data loading
names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'classe']
iris_data = pd.read_csv('iris_data', names = names)
iris_data.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,classe
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


In [6]:
iris_data.shape

(150, 5)

In [7]:
# Instantiating the predictor variables and target variable
X = iris_data.iloc[:,:4].values
y = iris_data.iloc[:,4]

# Labels for target variable
target_class = pd.get_dummies(iris_data['classe']).columns
target_names = np.array(target_class)

In [8]:
# Converting classes to corresponding numeric values
y = y.replace(target_names[0], 0)
y = y.replace(target_names[1], 1)
y = y.replace(target_names[2], 2)
y = np.array(y)

In [9]:
# Separating data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 33)
print(X_train.shape, y_train.shape)

(105, 4) (105,)


In [10]:
# Function to calculate Euclidean distance
def euclidian_distance(att1, att2):
    dist = 0
    for i in range(len(att1)):
        dist += pow((att1[i] - att2[i]),2)
    return np.sqrt(dist)

### KNN Algorithm 

In [11]:
def KNN(array, k):
    # Empty list to compute the predictions for each new data point:
    predictions = []
    
    # Computing the distances between each new data point and the points of the training dataset:
    for i in range(len(array)):
        
    # Empty list to compute the distances and the corresponding label: 
        distances_label = []
        for j in range(len(X_train)):
            dis = euclidian_distance(array[i], X_train[j])
            distances_label.append([dis, y_train[j]])

        # Sorting the distances in ascending order for having the shortest ones first:
        distances_label.sort()
        # Empty list to compute the corresponding labels related to the shortest distances:
        labels = []
        for l in range(k):
            labels.append(distances_label[k][1])

        # Voting:
        result = int(s.mode(labels)[0])
        
        # Adding the resulting label of the new data point under analysis:
        predictions.append(result)
        
    return predictions  

### Model Evaluation

In [12]:
y_test_pred = KNN(X_test, 5)
y_test_prediction = np.asarray(y_test_pred)

  result = int(s.mode(labels)[0])


### Accuracy

In [13]:
# As we have the observed values for the predictions we made, we can compare them and see how accurate the model is:
acc = y_test - y_test_prediction
err = np.count_nonzero(acc)
accuracy = ((len(y_test) - err) / len(y_test)) * 100
print("The achieved accuracy is: {}%".format(round(accuracy,2)))

The achieved accuracy is: 91.11%
