# *k*-NN Basics in scikit-learn

### First use `NearestNeighbors` to identify neighbours.  

#### Two objects:   
- **KNeighborsClassifier** is the classifier object (fit, predict, etc)  
- **NearestNeighbors** an object for returning NNs (not a classifier)

Athlete Selection Data  
First load dataset into a data frame.  
`AthleteSelection.csv` files needs to be in the same directory as the notebook.

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import sys

from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
athlete = pd.read_csv('AthleteSelection.csv',index_col = 'Athlete')
athlete.head()

In [None]:
names = athlete.index
names

#### Store features and labels in numpy arrays X and y
`X` is a numpy array containing the training features.  
`y` contains the class labels.   
`q` is a query example.

In [None]:
y = athlete.pop('Selected').values
X = athlete.values
q = [5.0,7.5]

#### Plot the data

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline

color= ['red' if l == 'No' else 'green' for l in y]
x1 = X[:,0]
x2 = X[:,1]
plt.figure(figsize=(7,6))
plt.scatter(x1,x2, color=color)
plt.scatter(q[0],q[1],color='black')
plt.annotate('q',(q[0]+0.05,q[1]))
plt.title("Athlete Selection")
plt.xlabel("Speed")
plt.ylabel("Agility")
plt.grid()
red_patch = mpatches.Patch(color='red', label='Not Selected')
blue_patch = mpatches.Patch(color='green', label='Selected')
plt.legend(handles=[red_patch, blue_patch],loc=4)
for i, txt in enumerate(names):
    plt.annotate(txt, (x1[i]+0.05, x2[i]))

## Data Normalization
Features may be measured on very different scales.  
(Not really an issue here.)  
Rescale the data so that all features have the same scale, two options:
- N(0,1) rescale with zero mean and unit variance
- MinMax scaling - typically in the range (0,1)

### N(0,1)

In [None]:
scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)
q_scaled = scaler.transform([q])
q_scaled

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
color= ['red' if l == 'No' else 'green' for l in y]
x1 = X_scaled[:,0]
x2 = X_scaled[:,1]
plt.figure(figsize=(7,6))
plt.scatter(x1,x2, color=color)
plt.scatter(q_scaled[0,0],q_scaled[0,1],color='black')
plt.annotate('q',(q_scaled[0,0]+0.05,q_scaled[0,1]))
plt.title("Athlete Selection (Normalized)")
plt.xlabel("Speed N(0,1)")
plt.ylabel("Agility N(0,1)")
plt.grid()
red_patch = mpatches.Patch(color='red', label='Not Selected')
blue_patch = mpatches.Patch(color='green', label='Selected')
plt.legend(handles=[red_patch, blue_patch],loc=4)
for i, txt in enumerate(names):
    plt.annotate(txt, (x1[i]+0.05, x2[i]))

#### Finding Neighbours
Find the first two NNs for `q`.

In [None]:
athlete_neigh = NearestNeighbors(n_neighbors=2)
athlete_neigh.fit(X_scaled) 

The distances and the indexes of the two NNs for `q`.

In [None]:
athlete_neigh.kneighbors(q_scaled, 2, return_distance=True)

The three NNs for `q`

In [None]:
# Find three nearest neighbours for q
q3n = athlete_neigh.kneighbors(q_scaled, 3)[1][0]
# q3n contains the 'index' of the nearest neighbours
for n in q3n:
    print(names[n], end = ' ')

## *k*-NN Classifier
Use `KNeighboursClassifier` to build a *k*-NN classifier.
Two methods:
- `fit` sets up the classifier with the training data, takes two arguments, the features and the labels. 
- `predict` produces the output for the test set (just one test example in this case).


In [None]:
kNN = KNeighborsClassifier(n_neighbors = 3)
kNN = kNN.fit(X_scaled,y)

In [None]:
kNN.predict(q_scaled)

### Breast Cancer example
The `scikit-learn` distribution contains a number of example datasets.  
`load_breast_cancer()` loads the dataset as a dictionary. 

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data["data"]
y = data["target"]
X.shape

In [None]:
data["feature_names"]

Let's look at the relative size of the features to see if normalisation is required.  
It is.

In [None]:
f_min = sys.maxsize
f_max = 0
for i in range(X.shape[1]):
    if X[i].mean() < f_min:
        f_min = X[i].mean()
    if X[i].mean() > f_max:
        f_max = X[i].mean()
print("Smallest Mean: {0:.2f}".format(f_min)) 
print("Largest Mean: {0:.2f}".format(f_max)) 

Scale the data and use `train_test_split` to carve off 1/3 of the data to use as a test set. 

In [None]:
B_scaler = preprocessing.StandardScaler().fit(X)
X_scaled = B_scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=1/3)
X_train.shape, X_test.shape

`accuracy_score` calculates the accuracy of the predicted labels `y_dash`.

In [None]:
kNN = KNeighborsClassifier(n_neighbors = 3)
kNN = kNN.fit(X_train,y_train)
y_dash = kNN.predict(X_test)
accuracy_score(y_test, y_dash)

Compare the first 35 predictions with the actuals:

In [None]:
print(y_test[:35])
print(y_dash[:35])