# K-Nearest Nieghbors

The k­Nearest Neighbor (kNN) method makes predictions by locating similar cases to a given data instance (using a similarity function) and returning the average or majority of the most similar data instances. The kNN algorithm can be used for classification or regression. 

This recipe shows use of the kNN algorithm to make predictions for the iris dataset (classification). 

In [1]:
from sklearn import datasets 
from sklearn import metrics 
from sklearn.neighbors import KNeighborsClassifier 

Load the Iris dataset

Iris flower dataset (4x150, reals, multi-label classification)

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa = 0
-- Iris Versicolour = 1
-- Iris Virginica = 2

In [4]:
dataset = datasets.load_iris() 
print dataset.data[0:10,]
print dataset.target[0:50,]

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]


###  Fit a k-nearest neighbor model to the data

In [7]:
model = KNeighborsClassifier(n_neighbors=13) 
model.fit(dataset.data, dataset.target) 
print(model)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=13, p=2,
           weights='uniform')


Make predictions

In [6]:
expected = dataset.target 
predicted = model.predict(dataset.data)

Summarize the fit of the model 

In [5]:
print(metrics.classification_report(expected, predicted)) 
print(metrics.confusion_matrix(expected, predicted)) 

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        50
          1       0.96      0.94      0.95        50
          2       0.94      0.96      0.95        50

avg / total       0.97      0.97      0.97       150

[[50  0  0]
 [ 0 47  3]
 [ 0  2 48]]


The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.

The support is the number of occurrences of each class in y_true.

### This recipe shows use of the kNN algorithm to make predictions for the diabetes dataset (regression). 

In [6]:
import numpy as np 
from sklearn import datasets 
from sklearn.neighbors import KNeighborsRegressor

Load the diabetes datasets.

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients.

Target label is some kind of quantitative measure of disease progression one year after baseline.

In [57]:
dataset = datasets.load_diabetes()
print dataset.data[0:10,]
print dataset.target[0:10,]

[[ 0.03807591  0.05068012  0.06169621  0.02187235 -0.0442235  -0.03482076
  -0.04340085 -0.00259226  0.01990842 -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 -0.02632783 -0.00844872 -0.01916334
   0.07441156 -0.03949338 -0.06832974 -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 -0.00567061 -0.04559945 -0.03419447
  -0.03235593 -0.00259226  0.00286377 -0.02593034]
 [-0.08906294 -0.04464164 -0.01159501 -0.03665645  0.01219057  0.02499059
  -0.03603757  0.03430886  0.02269202 -0.00936191]
 [ 0.00538306 -0.04464164 -0.03638469  0.02187235  0.00393485  0.01559614
   0.00814208 -0.00259226 -0.03199144 -0.04664087]
 [-0.09269548 -0.04464164 -0.04069594 -0.01944209 -0.06899065 -0.07928784
   0.04127682 -0.0763945  -0.04118039 -0.09634616]
 [-0.04547248  0.05068012 -0.04716281 -0.01599922 -0.04009564 -0.02480001
   0.00077881 -0.03949338 -0.06291295 -0.03835666]
 [ 0.06350368  0.05068012 -0.00189471  0.06662967  0.09061988  0.10891438
   0.02286863  0.01770335 -0.03581673  0.00306441]


Fit a model to the data 

In [7]:
model = KNeighborsRegressor() 
model.fit(dataset.data, dataset.target) 
print(model) 

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')


Make predictions

In [8]:
expected = dataset.target 
predicted = model.predict(dataset.data) 

Summarize the fit of the model

In [9]:
mse = np.mean((predicted-expected)**2) 
print(mse) 
print(model.score(dataset.data, dataset.target))

0.02
0.97
