## *k*-NN in scikit-learn using correlation
A simple notebook to show the merits of using 'correlation' as the metric with *k*-NN.

In [None]:
import pandas as pd
import numpy as np

from sklearn.neighbors import NearestNeighbors

In [None]:
cols = ['Ex','X0','X1','X2','X3','X4','X5','X6','X7','X8','X9']
cd = [['A',3,4,5,5,4,3,2,3,4,3],
      ['B',9,9,8,8,9,10,10,9,8,9],
      ['Q',8,8,9,9,8,7,7,8,9,8]]
corr = pd.DataFrame(cd, columns = cols)
corr = corr.set_index('Ex')
corr

In [None]:
%matplotlib inline
pl = corr.T.plot(linewidth=2,legend = False)
pl.set_xlabel('Feature',fontsize = 12)
pl.set_ylabel('Value',fontsize = 12)
pl.text(0.5,3, 'A', fontsize=12)
pl.text(0.5,7.5, 'Q', fontsize=12)
pl.text(0.5,9.2, 'B', fontsize=12)

In [None]:
X = corr.values
X

In [None]:
q = X[2]
X = X[:2]
X

Set up two Nearest Neighbour objects, one for Euclidean and one for correlation.  
*These are not classifiers.*

In [None]:
NN = NearestNeighbors(metric='euclidean')
eNN = NN.fit(X) 
NN = NearestNeighbors(metric='correlation')
corrNN = NN.fit(X) 

In [None]:
e_nn = eNN.kneighbors([q], 1)
print('Nearest Neighbour by Euclidean dist. is', corr.index[e_nn[1][0][0]])
c_nn = corrNN.kneighbors([q], 1)
print('Nearest Neighbour by correlation is', corr.index[c_nn[1][0][0]])

### How is correlation used?
`1 - correlation(q,x)` is used as the distance measure

In [None]:
# The Euclidean distances:
dist, nns = eNN.kneighbors([q], 2)
print('Neighbours:', nns)
print('Distainces:', dist)


In [None]:
# The correlation distances:
dist, nns = corrNN.kneighbors([q], 2)
print('Neighbours:', nns)
print('Distainces:', dist)

numpy has function for calculating correlation

In [None]:
rq0 = np.corrcoef(q, X[0])
print(rq0)

In [None]:
rq1 = np.corrcoef(q, X[1])
print(rq1)

The distance used in `NearestNeighbors` for correlation is `1 - correlation(q,x)`

In [None]:
1-rq0[0,1]

In [None]:
1-rq1[0,1]