<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#K-Nearest-Neighbors" data-toc-modified-id="K-Nearest-Neighbors-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>K-Nearest Neighbors</a></span><ul class="toc-item"><li><span><a href="#Summary" data-toc-modified-id="Summary-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Advantages" data-toc-modified-id="Advantages-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Advantages</a></span></li><li><span><a href="#Disadvantages" data-toc-modified-id="Disadvantages-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Disadvantages</a></span></li><li><span><a href="#Determining-K" data-toc-modified-id="Determining-K-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Determining K</a></span><ul class="toc-item"><li><span><a href="#Overfitting-&amp;-Underfitting" data-toc-modified-id="Overfitting-&amp;-Underfitting-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Overfitting &amp; Underfitting</a></span></li></ul></li><li><span><a href="#Implementing-via-sklearn" data-toc-modified-id="Implementing-via-sklearn-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Implementing via sklearn</a></span></li></ul></li></ul></div>

In [None]:
import numpy as np
%matplotlib inline

# K-Nearest Neighbors

Classification / Supervised Learning

## Summary

Use the training data to "learn" and then predict a test point

![(From curriculum)](images/knn.png)

<img src='http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531424125/KNN_final1_ibdm8a.png'/>

> From Datacamp: https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn

## Advantages

- lazy learning (no training phase)
- easy to interpret

## Disadvantages

- has to be kept in memory (small data with few features)
- not robust; doesn't generalize well
- soft boundaries are troublesome
- curse of dimensionality
    + PCA (learn this in time)
    + high dimensions: cosine similarity

## Determining K

How many neighbors ($k$) are used to determine our point to classify?

![](images/best_k.png)

Elbow plot and test the error

Usually between 1 & 19

![](images/k_elbow_plot.png)

### Overfitting & Underfitting

![](images/underfit_vs_overfit.png)

## Implementing via sklearn

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :3]
y = iris.target

# 
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)

ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y ,cmap=plt.cm.Set1, s=40)

ax.set_xlabel("1st")
ax.set_ylabel("2nd")
ax.set_zlabel("3rd")

plt.show()

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
neigh = KNeighborsClassifier(n_neighbors=6,metric='euclidean')
neigh.fit(X, y) 

In [None]:
pred_pts = np.array([
    [7,3,7],
    [8,4,7],
    [7,3,6],
    [7,4,6],    
    [4,4,1],
    [5,4,3],
    [5,4,5],
    [4,4,5],
    [3,3,3]
])

pred_y = neigh.predict(pred_pts)
print(pred_y)

In [None]:
for p,prob in zip(pred_y,neigh.predict_proba(pred_pts)):
    print(f'{p}: {prob}')

In [None]:
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)

ax.scatter(
    X[:, 0], 
    X[:, 1], 
    X[:, 2], 
    c=y,
    cmap=plt.cm.Set1,
    s=40
)

ax.scatter(
    pred_pts[:, 0], 
    pred_pts[:, 1], 
    pred_pts[:, 2], 
    c=pred_y,
    cmap=plt.cm.Set1,
    s=400
)

ax.set_xlabel("1st")
ax.set_ylabel("2nd")
ax.set_zlabel("3rd")

plt.show()