In [1]:
# KNN Algorithm


# Introduction

Machine learning, in general, works by learning from the past data-sets and makes predictions in the future.
To cluster the data, it uses the general information to distinguish the features of the particular characteristics. 

KNN is a type of supervised Machine Learning Algorithm based on feature similarity. We can use KNN classifier for classifiying the data (mainly used) and KNN regressor for the purpose of the regression.


Two features of KNN are best to represents them:

### Lazy Learning
KNN is a lazy learning algorithm as it takes all the data for training and lacks specialized training phase.

### Non-parametric learning
It does not assume anything about the given data so is classified as non-parametric learning. 


KNN Classifier is mainly used in the recommendation stuff (like advertisement of products and others). Other uses can be in searching similar documents and handwriting (images or video) recognization.


## KNN Algorithm

It uses "feature similarity" to predict the values of new data-points. This means the value of new data-point will be based on how closely it matches the points in the training set. 

### Steps

1. Load the data sets (both training and test data set). If needed we must do data prunning and/or data cleaning. If we have only one data set, then split the data set into training and test sets.


2. Choose the value of K. This can be calculated by taking square root of the total data points. And make sure "K" is odd numbered.


3. For each point in test data set:

    i. calculate the distance between test data and every data in training data set. Distance can be Euclidean (most commonly used), Manhattan or Hamming distance.
    
    ii. Sort them (distance) in ascending  order.
    
    iii. Select the top most "K" data from sorted data
    
    iv. Assign the class to the test data based on the majority of the training data.
    
    


## Following example is based on the information we get from the following link
https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_knn_algorithm_finding_nearest_neighbors.htm#


In [2]:
# importing basic libraries to function
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd


In [3]:
# path of iris data-set
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# assigning column names to the dataset 
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# read the data (iris data-set) using the pandas data frame
dataset = pd.read_csv(path, names = headernames) 

# read the head of the data set
dataset.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
# pre-processing the data
# following code divides the data set into numeric and non numeric part
X = dataset.iloc[:, :-1].values  # getting the first four columns of the data-set which is numeric
y = dataset.iloc[:, 4].values # getting the last column of the data-set which is non-numeric



In [5]:
# diving data into training part and testing part
# following code will split the dataset into 80% training data and 20% of testing data 

from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)




In [6]:

# data scaling section
from sklearn.preprocessing import StandardScaler 

scaler = StandardScaler() 
scaler.fit(X_train) 

X_train = scaler.transform(X_train) 
X_test = scaler.transform(X_test)


# train the model with the help of KNeighbors Classifier class of sklearn
# number of nearest neighbour is determined by the square root of number of data

from sklearn.neighbors import KNeighborsClassifier 
classifier = KNeighborsClassifier(n_neighbors = 8)  
classifier.fit(X_train, y_train)

# predicting by using the testing data
y_pred = classifier.predict(X_test)


In [7]:
# getting the results

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

Confusion Matrix:
[[10  0  0]
 [ 0  6  0]
 [ 0  1 13]]
Classification Report:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        10
Iris-versicolor       0.86      1.00      0.92         6
 Iris-virginica       1.00      0.93      0.96        14

       accuracy                           0.97        30
      macro avg       0.95      0.98      0.96        30
   weighted avg       0.97      0.97      0.97        30

Accuracy: 0.9666666666666667


## Advantage

1. Simple to understand
2. Useful to non-linear data (no assumption about the data is done in prior)
3. Can be used for both Classification and regression
4. Accuracy level is good as well


## Disadvantage

1. Computationally time consuming (stores all training data)
2. High memory required
3. slow prediction for large data (N)
4. sometimes sensitive to irrelevant feature
