<img src="../../../imgs/CampQMIND_banner.png">

# K-Nearest Neighbours (KNN)

K-nearest neigbours is a simple yet powerful machine learning model. In this notebook, you will gain some intuition on how it works and how to implement it.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#K-Nearest-Neighbours-(KNN)" data-toc-modified-id="K-Nearest-Neighbours-(KNN)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>K-Nearest Neighbours (KNN)</a></span></li><li><span><a href="#Learning-outcomes" data-toc-modified-id="Learning-outcomes-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Learning outcomes</a></span></li><li><span><a href="#Video" data-toc-modified-id="Video-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Video</a></span></li><li><span><a href="#The-problem" data-toc-modified-id="The-problem-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The problem</a></span></li><li><span><a href="#The-Algorithm" data-toc-modified-id="The-Algorithm-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>The Algorithm</a></span></li><li><span><a href="#Some-distance-measures-we-can-use" data-toc-modified-id="Some-distance-measures-we-can-use-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Some distance measures we can use</a></span></li><li><span><a href="#An-interactive-Example-on-the-Decision-Boundaries" data-toc-modified-id="An-interactive-Example-on-the-Decision-Boundaries-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>An interactive Example on the Decision Boundaries</a></span></li><li><span><a href="#Example" data-toc-modified-id="Example-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Example</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Resources</a></span></li></ul></div>

# Video

In [1]:
from IPython.display import IFrame
IFrame('https://www.youtube.com/embed/HVXime0nQeI',560,315)

# The problem

Our dataset has __n__ observations and __m__ features.

We want to classify each observation $\in \{0,1\}$ according to some inferred rule.

# The Algorithm

We can think of observations in an __m__ dimensional space. For each new observation we can look at the geometrically closest __k__ points to decide whether the new observation is classifed as 1 or 0.

Let __x__ denote the training examples and __x'__ denote the testing example.

1. For each example in x:
    1. Compute the distance between x and x'
2. Sort the distances in ascending order and pick the top k.
3. Return the most frequent class from the top k choices.

Note: KNN also works for regression. The only difference is that on part 3, the mean is returned instead of returning the most frequent class (mode).


# Some distance measures we can use

1. Euclidian Distance: $\sqrt{\sum_{i=1}^{n}(x_i - x'_i)}$
2. Manhattan Distance $\sum_{i=1}^{n}|x_i - x'_i|$

# An interactive Example on the Decision Boundaries

In [2]:
%%html
<iframe src="http://vision.stanford.edu/teaching/cs231n-demos/knn/" width="900" height="600"></iframe>

# Example

In [3]:
from sklearn.datasets import load_digits
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [4]:
data = load_digits()
features = data["feature_names"]
df = pd.DataFrame(data["data"],columns = features)
df["target"] = data["target"]
df.head()

Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7,target
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0,1
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0,3
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0,4


In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df.drop("target",axis=1), df.target, random_state = 1)

In [7]:
knn = KNeighborsClassifier()  #  default k is 5
knn.fit(X_train, y_train)
preds = knn.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        53
           1       1.00      1.00      1.00        42
           2       1.00      0.98      0.99        41
           3       1.00      1.00      1.00        52
           4       1.00      1.00      1.00        47
           5       1.00      0.97      0.99        39
           6       1.00      1.00      1.00        43
           7       0.98      0.98      0.98        48
           8       1.00      1.00      1.00        37
           9       0.96      1.00      0.98        48

    accuracy                           0.99       450
   macro avg       0.99      0.99      0.99       450
weighted avg       0.99      0.99      0.99       450



# Resources

https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

