## *k*-NN in scikit-learn using Cosine
A simple notebook to show the merits of using 'cosine' as the metric with *k*-NN.

In [None]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.lines as mlines

Three data points, Q is the query. 

In [None]:
cosi = pd.DataFrame(
    [['C',1.0,2.0],
     ['D',4.5,2.5],
     ['Q',2.0,1.5]],
    columns=['Ex','X1','X2'])

cosi = cosi.set_index('Ex')
cosi

Create a new dataframe with normalised data.  
The data is normalised so that the rows (data vectors) have unit length. 

In [None]:
from sklearn.preprocessing import Normalizer
cosi_n = cosi.copy()
cosi_n.iloc[:,:] = Normalizer(norm='l2').fit_transform(cosi)
cosi_n

Plot the normalised data and draw in the original vectors. 

In [None]:
%matplotlib inline
pl = cosi_n.plot.scatter('X1','X2',figsize=(5,3))
pl.set_xlim(0,5)
pl.set_ylim(0,3)
c1, c2 = [0,1], [0,2]
d1, d2 = [0,4.5], [0,2.5]
q1, q2 = [0,2], [0,1.5]
pl.plot(c1,c2,d1,d2,q1,q2, marker = 'o', markersize = 10,linewidth = 2)
pl.text(1.2,2, 'C', fontsize=12)
pl.text(4.7,2.3, 'D', fontsize=12)
pl.text(2.2,1.5, 'Q', fontsize=12)

In [None]:
X = cosi.values
q = X[2]
X = X[:2]

In [None]:
NN = NearestNeighbors(metric='euclidean')
eNN = NN.fit(X) 
NN = NearestNeighbors(metric='cosine')
cosNN = NN.fit(X) 

Nearest neighbour for `q` using Euclidean distance is `C` (index 0). 

In [None]:
dist, nns = eNN.kneighbors([q], 2)
print('Neighbours:', nns)
print('Distainces:', dist)

Nearest neighbour for `q` using Cosine distance is `D` (index 1). 

In [None]:
dist, nns = cosNN.kneighbors([q], 2)
print('Neighbours:', nns)
print('Distainces:', dist)

We can use the `cosine_similarity` metric from `scikit-learn` to check these distances.  
The Cosine Distance is `1 - cosine_similarity`.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
c0 = cosine_similarity([q], [X[0]])
c1 = cosine_similarity([q], [X[1]])
print('Cosine Similarity',c1,c0)
print('1 - Cosine Similarity',1-c1,1-c0)

## Cosine Distaince and Text Classification
CNAE-9 is a dataset from the UCI ML Repository.  
https://archive.ics.uci.edu/ml/index.php  
Compare Cosine Distance with Euclidean Distance using 10-fold cross validation.  
*Spoiler alert:* Cosine Distance wins. 

In [None]:
cosi = pd.read_csv('CNAE-9.csv', header = None)
y = cosi.pop(0).values
X = cosi.values
X.shape, y.shape

Set up two *k*-NN classifiers, one using Euclidean distance and one using Cosine.  

In [None]:
NN_e = KNeighborsClassifier(metric = 'euclidean')
NN_c = KNeighborsClassifier(metric = 'cosine')

In [None]:
scores = cross_val_score(NN_e, X, y, cv=10)
print("4x CV Accuracy (Euclidean Distance): {0:.2f}".format(scores.mean())) 
scores = cross_val_score(NN_c, X, y, cv=10)
print("4x CV Accuracy (Cosine): {0:.2f}".format(scores.mean())) 