# K Nearest Neighbors
Classification.   
Lazy (no training).   
Local.  
Predictive.  
Supervised.   
Instance-based learning (model-free).  
Non-parametric (no normality assumption, but K is a critical parameter).   
Discrete/classification or continuous/regression implementations.  
Deductive (deduce a specific from the general) predicting one label from several).    
Not inductive (infer a rule or pattern), unless you use it to draw decision boundaries.    
Depends on a proxmity measure.  
Makes non-linear (completely arbitrary) decision boundaries.  
Susceptible to scaling (big features dominate the distance).  
Susceptible to noise (esp with small k).  
Difficult to use with missing values.  
Robust to feature interaction effects.  
Susceptible to large numbers of irrelevant or redundant features (swamps distance measure).  

Basic ideas:    

Project points into a (possibly lower) dimensional space.
Reduce noise by converting the input point to a more representative one.
Or, convert an unlabled given point to a labeled one from the training set.
Or, infer a label using the major class of the K nearest neighbors.  

Variations:  

The choice of K is important yet entirely guess work.  

The distance metric is usually Euclidean but could be Manhattan, etc.   

Inference could be uniform (all K nearest neighbors contribute equally)
or otherwise (apply Gaussian weight so closest neighbors contribute more).  

The decision boundaries are extremely non-linear at any K.
Thus the results are sensitive to distributions, sampling, noise, unequal scaling.  

Speed. Instead of using all the points, 
use a reduced set of prototypes (called condensed nearest neighbors).

Speed. Instead of naive instance-vs-all, which is linear,
do pre-processing to make it logarithmic.
Precompute pair-wise distances, sort the data, organize as a tree.    

Accuracy guarrantees.
These have been worked out for various special cases:
K=1; binary classification on infinite data; etc.

Decision rule.
(1) Winner-take-all i.e. use the class of the majority of neighbors.
(2) The centroid i.e. a representative value, a point in space.
This is analogous to the mean. 
This is used for continuous data and regression.
(3) Mediod i.e. a representative instance, one that actually exists. 
This is analagous to median. 
This is used for discrete data and classifcation.
(4) Weighted mean i.e. weight each neighbor by 1/distance and compute the mean.   

## Choice of K
Done by heuristics. 
Or by bootstrap (testing various K on random subsets).
Choose odd K for binary classification (just to avoid tie votes).

## Choice of distance metric
Conceive of points or vectors in N-dimentional space.  
Usually use Euclidean distances on real data,
Hamming distance on discrete data e.g. word frequencies of documents. 
The best distance metric can be learned.


## Relations to other algorithms
Self organizing maps (SOM): same idea on a modified search space. 

KNN is often the classifier used after feature extraction
by Haar face detection, or mean-shift, or PCA, etc.  
KNN by itself performs poorly on high-dimensional data due to 
curse of dimensionality.
For real-time forecasting on high-dimensional big data,
KNN is applied to sketeches or locality sensitive hashing.

KNN can be used for outlier detection.
Choose a K, find instances that are misclassified.
Remove outliers with more than R neighbors of another class.

K Nearest Neighbors is not related to K Means Clustering.

## Sample data: Iris

In [1]:
from sklearn import datasets
iris = datasets.load_iris()

In [2]:
iris.data.shape

(150, 4)

In [3]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [5]:
X = iris.data
y = iris.target

## Demo run
See model parameters at 
[sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn import metrics

In [7]:
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [8]:
for neighbors in range(1,10):
    knn = KNN(neighbors)
    knn.fit(X_train,y_train)   # does nothing but store the points and labels!
    y_pred = knn.predict(X_test)    # aligns test points to training points!
    scores = metrics.accuracy_score(y_test,y_pred)
    print('%d neighbors, %4.2f %% accuracy' % (neighbors,scores) )

1 neighbors, 0.95 % accuracy
2 neighbors, 0.89 % accuracy
3 neighbors, 0.95 % accuracy
4 neighbors, 0.92 % accuracy
5 neighbors, 0.89 % accuracy
6 neighbors, 0.92 % accuracy
7 neighbors, 0.92 % accuracy
8 neighbors, 0.95 % accuracy
9 neighbors, 0.92 % accuracy


## Variance
We see accuracy is highly variable. 
Run this notebook again, you get different results. 
But this is attributable to the random train/test split.
If you use one train/test split, and just rerun the KNN cell, the results are consistent.