# K Nearest Neighbors
Classification.   
Lazy: no training required.   
Local, not global.  
Predictive, not generative.  
Supervised, requires labeled data.   
Instance-based learning: model-free.  
Non-parametric: no normality assumption (but K is a critical parameter).   
Discrete/classification or continuous/regression implementations.  
Deductive: deduces a specific label from the general labeled data.
Not inductive (infer a rule or pattern), unless you use it to induce decision boundaries.    
Depends on a proxmity measure.  
Makes non-linear, completely arbitrary, decision boundaries.  
Sensitive to scale. Big features dominate the proximity. Normalize each feature first.  
Susceptible to noise, especially when K is small.  
Difficult to make predictions on data points missing some feature values.  
Robust to feature interaction effects; but proximity function could weight these features.     
Large numbers of irrelevant or redundant features can swamp the proximity measure.   

### Algorithm
Project points into a (possibly lower) dimensional space.
Given an unlabeled point P, find its K nearest neighbors.
Label P according to average or mode of these neighbors.

### Decision rules
Choose one
1. Winner-take-all.
Predict the class label of the majority of neighbors. 
This is used for classification of discrete data.
1. Predict the mediod of the neighbors. 
This is representative instance that actually exists. 
This is analagous to median and is less susceptible to outliers.
This is used for classifcation of discrete data.
1. Predict the centroid of the neigbhors. 
This is a point in space analogous to the mean. 
This is used for regression on continuous data.
1. Apply weights to the neighbors.
Weight each neighbor by 1/distance from the given point.
Then, use one of the predictors above.
This is less susceptible to outliers.

### Discussion

At K=1, kNN reduces noise: converting the input point to a representative one.
At K=1, kNN converts an unlabled given point to a labeled one from the training set.

At K>=1, kNN applies the majority class label among the K nearest neighbors.  

The distance metric is usually Euclidean but could be Manhattan, etc.   

The inference could be uniform/unweighted, so the K nearest neighbors contribute equally.
Or it could be Gaussian/weighted so closest neighbors contribute the most.  

The decision boundaries are extremely non-linear, regardless of K.
Thus the results are sensitive to distributions, sampling, noise, unequal scaling.  

kNN performs poorly when #dimensions > #data.

Accuracy guarrantees are possible for various special cases, such as:
K=1; binary classification on infinite data; etc.

### Optimiztions

For speed,
use a reduced set of points called prototypes.
This optimization is called condensed nearest neighbors.

For speed, 
do some pre-processing to reduce avoid computing point-to-all distances.
Precompute the pair-wise distances, sort the data, organize it as a graph or tree.
This can reduce prediction from O(n) to O(log(n)).

### Choice of K
The best value for critical parameter K is anybody's guess.  
It is done by heuristics. 
Or by bootstrap (testing various K on random subsets).
Choose odd K for binary classification (just to avoid tie votes).

### Choice of distance metric
Conceive of points or vectors in N-dimentional space.  
Usually use Euclidean distances on real data,
Hamming distance on discrete data e.g. word frequencies of documents. 
The best distance metric can be learned.


## kNN and Other Algorithms
Self organizing maps (SOM)
uses the same induction rule as kNN, but uses a modified search space. 

KNN by itself performs poorly on high-dimensional data due to 
curse of dimensionality.
For real-time forecasting on high-dimensional big data, KNN is applied to:
* sketeches
* locality sensitive hashing codes
* the data after feature extraction
* the data after Haar face detection
* clusters labeled by mean-shift segmentation
* data transformed by PCA  

KNN can be used for outlier detection.
Choose a K and R. Visit every point. 
Define outlier as any point with more than R/K neighbors labeled differently. 

The K Nearest Neighbors classifier is not related to K Means Clustering.

## Example: Iris dataset

In [1]:
from sklearn import datasets
iris = datasets.load_iris()

In [2]:
iris.data.shape

(150, 4)

In [3]:
# Four numerical features
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
# Categorical labels
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [5]:
X = iris.data
y = iris.target

### Demo run
See model parameters at 
[sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn import metrics

In [7]:
# Random partition into train/test
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [8]:
# Try various values of K
for neighbors in range(1,10):
    knn = KNN(neighbors)
    knn.fit(X_train,y_train)   # does nothing but store the points and labels!
    y_pred = knn.predict(X_test)    # aligns test points to training points!
    scores = metrics.accuracy_score(y_test,y_pred)
    print('%d neighbors, %4.2f %% accuracy' % (neighbors,scores) )

1 neighbors, 1.00 % accuracy
2 neighbors, 0.97 % accuracy
3 neighbors, 1.00 % accuracy
4 neighbors, 0.97 % accuracy
5 neighbors, 0.97 % accuracy
6 neighbors, 0.97 % accuracy
7 neighbors, 0.97 % accuracy
8 neighbors, 0.95 % accuracy
9 neighbors, 0.95 % accuracy


### Variance
We see accuracy is highly variable. 
Run this notebook again, you get different results. 
But this is attributable to the random train/test split.
If you use one train/test split, and just rerun the KNN cell, the results are consistent.