# k-NN with sklearn

This notebook was written by Gael Lederrey and Tim Hillel (tim.hillel@epfl.ch) for the Decision-aid methodologies in transportation course at EPFL (http://edu.epfl.ch/coursebook/en/decision-aid-methodologies-in-transportation-CIVIL-557).

Please contact before distributing or reusing the material below.

## Overview

Now that we've implemented the k-NN algorithm, we will see how to use it with the `scikit-learn` library. In this notebook, we will learn to:

1. Scale data using scikit-learn
2. Use classifiers from scikit-learn
3. Test different model hyperparameters
4. Use different metrics to assess the performance of your model


## Set-up

We start by loading the dataset and the different libraries that are required for the exercices.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [None]:
df_full = pd.read_csv('data/dataset.csv')

# We subsample the dataset to reduce the computational cost
df = df_full.sample(10000, random_state = 123)

We use the multinomial version of the dataset

In [None]:
# Dictionary used to transform the string in 
# the travel_mode to an integer
str_to_val = {
    'walk': 0,
    'cycle': 1,
    'pt': 2,
    'drive': 3
}

# Output
y = df['travel_mode'].replace(str_to_val).values

# Features (4 are selected)
x = df[['age', 'car_ownership', 'distance', 'female']].values

# We split the output and features into a train and a test set by 
# an (approximate) ratio of 0.8
np.random.seed(123)
msk = np.random.rand(len(df)) < 0.8

x_train_unscaled = x[msk]
x_test_unscaled = x[~msk]

y_train = y[msk]
y_test = y[~msk]

## k-NN with scikit-learn

In this section, we will use the $k$-NN from ``sklearn`` and compare the results to our implementation in the previous notebook.

The `KNeighborsClassifier` is in the ``neighbors`` submodule of sklearn. Try importing it directly.

Notice the *CamelCase*? This tells us we are using a class! (Like `DataFrame` in Pandas)

In [None]:
# Enter your code below


sklearn classes all behave in a very simular way (including classifiers, regressers, scalers, etc.)

Firstly, we *instantiate* the class (i.e. create an instance). Try using the *help* functionality to investigate the hyperparameters and default values.

In [None]:
knn = KNeighborsClassifier()

Next, we `fit` the class to the training data. (Note, this doesn't actually do anything for $k$-NN, as our model is simply the data!)

In [None]:
knn.fit(x_train_unscaled, y_train)

Finally, we use the class on new data. For classifiers, we use them to **predict** new data.

In [None]:
y_pred = knn.predict(x_test_unscaled)

Let's define our accuracy score function again, and use it to compare the results of the skleanr model to our implimentation!

In [None]:
def compute_accuracy(y_true, y_pred):
    return np.mean(np.array(y_true)==np.array(y_pred))*100

print("Accuracy: {:.3f}%".format(compute_accuracy(y_test, y_pred)))

Great, we get exactly the same score (for the multinomial case)!

This is because we used the same *hyperparameters*!

## Scaling data

As discussed, $k$-NN is highly sensitive to data scaling. As such, we will use the standard scaler from scikit-learn to scale the data to zero-mean unit-variance.

The `StandardScaler` is in the preprocessing submodule of sklearn. Try importing it directly, instantiating it, and fitting it to `x_train_unscaled`.

In [None]:
# enter your code below


Instead of `predict`, we use the scaler to `transform` data.

Use the fitted scaler to transform `x_train_unscaled` and `x_test_unscaled` and save it as `x_train` and `x_test` respectively

In [None]:
# enter your code below


Now try using the `knn` to the scaled data, and see how the results have changed!

You should get an accuracy score of 65.457%!

In [None]:
# enter your code below


## Accuracy, Precision, and Recall

As discussed, accuracy is not always the best policy. 

Let's investigate some other metric, based on the `confusion_matrix`

Firstly, import the function `confusion_matrix` from `metrics` submodule of `sklearn`, and display the confusion matrix for the predicted values `y_pred` and ground truth values `y_test` for the test data.  

In [None]:
# enter your code below


We recall that (for a single class):

`precision = TP/TP+FP
recall = TP/TP+FN`

Write functions to compute the precision and recall for a given class, from the confusion matrix.


In [None]:
def compute_precision(y_true, y_pred, c):
    # Enter your code below


In [None]:
def compute_recall(y_true, y_pred, c):
    # Enter your code below


Try printing the precision and recall for the `pt` mode. You should get 0.685 and 0.669 respectively.

In [None]:
# enter your code below


Use a list comprehension to get the precision and recall for each class.

In [None]:
# enter your code below


We can use the function `precision_recall_fscore_support` to verify our answer

In [None]:
from sklearn.metrics import precision_recall_fscore_support

prec, rec, fscore, supp = precision_recall_fscore_support(y_test, y_pred)

print("Precision: {}".format(prec))
print("Recall: {}".format(rec))

## Model optimisation

We can investigate the effects of the hyperparameters on model performance. 

### $k$

Let's try investigating the effect `k` on the model. Try testing mutliple values of `k` (e.g. between 1 and 50) and comment on the results. 

For now, use accuracy as the performance metric to focus on(!)

*Hint*: It could be useful to plot a graph.

What was your best value of $k$? Is it the same as others in the class?

In [None]:
# Enter your code below


### (Bonus) Repeated trials

In order to to estimate the confidence interfals for our performance estimates, we could test the multiple on multiple draws from the dataset, *i.e.* draw the train and test sets multiple times. 

*Note, there are other ways to do this, which we will discuss later in the course*

### (Bonus) Other hyperparameters 

Try experimenting with the other hyperparameters in $k$-NN (use the documentation!). 

E.g. what happens if we use distance based weightings? Does the optimal value of $k$ change?