# K Nearest Neighbors
Classification. Lazy (no training). Supervised. Non-parametric. Requires scaling.

Basic idea: incorporate neighbor data to be robust to noise. 
At prediction time, given an instance, find K nearest neighbors, 
predict major class among those.

Naive O(N): compute instance-vs-all.    
Pre-processing O(lg(N)): sort or compute pair-wise distances, organize as a tree.    

Problems:  
The decision boundaries are extremely non-linear at any K.
Thus the results are sensitive to noise and unequal scaling.

There are accuracy guarrantees for various special cases
(K=1; binary classification on infinite data; etc.).

In condensed nearest neighbors, 
use a reduced set of prototypes instead of all the data.

## Decision rule
Possible return values:  
The winner-take-all class of the majority of neighbors.  
The centroid: a representative value, point in space, mean. 
For continuous data and regression.   
Mediod: a representative instance, actually exists, median. 
For discrete data and classifcation.  
A weighted mean: weight each neighbor by 1/distance.    

## Choice of K
Done by heuristics. 
Or by bootstrap (testing various K on random subsets).
Choose odd K for binary classification. 

## Choice of distance metric
Conceive of points or vectors in N-dimentional space.  
Usually use Euclidean distances on real data,
Hamming distance on discrete data e.g. word frequencies of documents. 
The best distance metric can be learned.


## Relations to other algorithms
Self organizing maps (SOM): same idea on a modified search space. 

KNN is often the classifier used after feature extraction
by Haar face detection, or mean-shift, or PCA, etc.  
KNN by itself performs poorly on high-dimensional data due to 
curse of dimensionality.
For real-time forecasting on high-dimensional big data,
KNN is applied to sketeches or locality sensitive hashing.

KNN can be used for outlier detection.
Choose a K, find instances that are misclassified.
Remove outliers with more than R neighbors of another class.

## Sample data: Iris

In [1]:
from sklearn import datasets
iris = datasets.load_iris()

In [2]:
iris.data.shape

(150, 4)

In [3]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [5]:
X = iris.data
y = iris.target

## Demo run
See model parameters at 
[sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn import metrics

In [9]:
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [12]:
for neighbors in range(1,10):
    knn = KNN(neighbors)
    knn.fit(X_train,y_train)   # does nothing but store the points and labels!
    y_pred = knn.predict(X_test)    # aligns test points to training points!
    scores = metrics.accuracy_score(y_test,y_pred)
    print('%d neighbors, %4.2f %% accuracy' % (neighbors,scores) )

1 neighbors, 0.95 % accuracy
2 neighbors, 0.95 % accuracy
3 neighbors, 0.95 % accuracy
4 neighbors, 0.97 % accuracy
5 neighbors, 0.95 % accuracy
6 neighbors, 0.92 % accuracy
7 neighbors, 0.95 % accuracy
8 neighbors, 0.92 % accuracy
9 neighbors, 0.95 % accuracy


The accuracy is highly variable. Run it again, get different results. 

But this is merely due to the random train/test split. For a given train/test split, the results are consistent.