# k-NN Tutorial
The default similarity metric for k-NN is Euclidean distance.  
In some circumstances other metrics (or measures) will be more appropriate - for instance correlation.  
## Household Budget  
In the example here where households are classified based on how budget is allocated, correlation would be a better measure of similarity.   
The objective here is to replace Euclidean distance with correlation when selecting neighbours.

In [None]:
import pandas as pd
import numpy as np
import time

from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

house = pd.read_csv('Household.csv',index_col = 'Household')
house.head()

In [None]:
y = house.pop('Category').values
X = house.values
X[0]

In [None]:
y

In [None]:
q = [2500,3500,2000]
house2 = house.copy()
house2.loc['query'] = q

In [None]:
house2

In [None]:
%matplotlib inline
house2.T.plot()

In [None]:
house_kNN = KNeighborsClassifier(n_neighbors=1) 
house_kNN.fit(X,y)

In [None]:
print('Query is classified as',house_kNN.predict([q])[0] )

---
**Q4**   
Change the metric used by k-NN to correlation to see if it will predict the other class.

In [None]:
house_kNN = KNeighborsClassifier(n_neighbors=1, metric = 'manhattan') 
house_kNN.fit(X,y)
print('Query is classified as',house_kNN.predict([q])[0] )

**Q5**   
In the Data Normalisation example in the 02-kNN Notebook replace the N(0,1) scaler with a min-max scaler.

**Q6**  
The code below loads the Sepsis dataset from the UCI repository (https://archive.ics.uci.edu/dataset/827/sepsis+survival+minimal+clinical+records).    
This dataset is divided into train and test sets and scaled.  
Then a *k*NN classifier is trained and tested. The time to classify the test data is also recorded.   
`scikit-learn` provides two strategies to speed up *k*-NN, `ball_tree` and `kd_tree`.  
Compare the performance of these two algorithms with brute force search `brute`.

In [None]:
#pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch Sepsis dataset 
sepsis_survival_minimal_clinical_records = fetch_ucirepo(id=827) 
  
# data (as pandas dataframes) 
X_raw = sepsis_survival_minimal_clinical_records.data.features 
y = sepsis_survival_minimal_clinical_records.data.targets.values.ravel()
X_raw.shape, y.shape

In [None]:
X_tr_raw, X_ts_raw, y_train, y_test = train_test_split(X_raw, y, test_size=1/3)

scaler = preprocessing.StandardScaler().fit(X_tr_raw) #A scaler object
X_train = scaler.transform(X_tr_raw)
X_test = scaler.transform(X_ts_raw)
X_train.shape, X_test.shape

In [None]:
Sep_kNN = KNeighborsClassifier(n_neighbors=3, algorithm = 'brute') 
Sep_kNN.fit(X_train,y_train)

In [None]:
t_start = time.time()       
acc = Sep_kNN.score(X_test,y_test)
t = time.time()-t_start
print('Time: %5.2f Accuracy: %5.2f' % (t, acc))

In [None]:
Sep_kNN.get_params()