# k-Nearest Neighbors (kNN)

$k$-Nearest Neighbors is a non-parametric supervised learning. It can be used both in classification and regression. In this algorithm, the value of each data without label is according to the $k$ value of nearest data points with labels.  

* For classification, we can have a probabilistic approach or a majority vote class assignment. In probabilistic approach, the class is assigned with probability 

$$p(y=c|x, D) = \frac{1}{k}\sum_{n \in N_k(x,D)} \mathbb{I}(y_n = c) $$

For majority vote the class is assigned with maximum probability.

$$ y = \mathrm{argmax} p(y = c|x,D) $$

* For regression, the value of the output is the average of the value of the $k$- nearest neighbors. 

$$ y = \frac{1}{k}\sum_{n \in N_k(x,D)} y_n$$

In both cases, we need to define a distance metric between data points. 

## K-nearest Neighbors in Scikit-Learn on the Iris Dataset
Here, we learn by practice

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# url for Iris dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign column names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read in the dataset
df = pd.read_csv(url, names=names)

In [3]:
df.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [8]:
# defining the features and labels
X = df.iloc[:, :-1].values
y = df.iloc[:, 4].values

In [13]:
# splitting the data to train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [16]:
# scaling features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [24]:
# make predictions using class sklearn.neighbors.KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=4)
classifier.fit(X_train, y_train)

### More on KNeighborsClassifier
class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform'{uniform,distance}, algorithm='auto'{ball_tree, kd_tree, brute, auto}, leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

* __n_neighbors__: number of neighbors
* __weights__: weights to the neighbors, uniform or distance(weights on reverse of distance)
* __algorithm__: ball_tree, kd_tree, brute force search or auto based on the values
* __leaf_size__: for ball_tree or kd_tree algorithm
* __p__: the power for Minkowski metric, p=2 is the Euclidean distance
* __n_jobs__: the number of parallel jobs for neighbors search 

In [22]:
y_pred = classifier.predict(X_test)
y_pred

array(['Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-versicolor', 'Iris-virginica', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-setosa',
       'Iris-versicolor', 'Iris-virginica', 'Iris-virginica',
       'Iris-setosa', 'Iris-versicolor', 'Iris-setosa', 'Iris-versicolor',
       'Iris-setosa', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa'], dtype=object)

In [23]:
# evaluating prediction
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[ 8  0  0]
 [ 0  8  1]
 [ 0  2 11]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         8
Iris-versicolor       0.80      0.89      0.84         9
 Iris-virginica       0.92      0.85      0.88        13

       accuracy                           0.90        30
      macro avg       0.91      0.91      0.91        30
   weighted avg       0.90      0.90      0.90        30

