# Implementation of k-NN

This notebook was written by Gael Lederrey and Tim Hillel (tim.hillel@epfl.ch) for the Decision-aid methodologies in transportation course at EPFL (http://edu.epfl.ch/coursebook/en/decision-aid-methodologies-in-transportation-CIVIL-557).

Please contact before distributing or reusing the material below.

## Overview

In this notebook, you will learn how to implement the $k$-NN algorithm. This notebook is separated in two parts:

1. Implementing the binary case $k$-NN classifier
2. Implementing the multinomial case.

### Setup

We start by loading the dataset and the different libraries that are required for the exercises. We randomly sample 10000 elements from the original dataset, to make computation faster.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df_full = pd.read_csv('data/dataset.csv')

# We sample the dataset to decrease computation time
df = df_full.sample(10000, random_state = 123)

## Implementation of k-NN for binary case

### Data preparation

We start this exercise by selecting the features and output. In this case, we want a binary label which represents if a trip is made by driving (*1*) or not (*0*).

The label is currently a string, so we will map it to numeric data.

We will also select only four **features** from the data (again, in order to reduce computation time).

In [None]:
# Dictionary used to transform the string in 
# the travel_mode for the binary output
str_to_val = {
    'drive': 1,
    'pt': 0,
    'cycle': 0,
    'walk': 0
}

# Output
y = df['travel_mode'].replace(str_to_val).values

# Features (4 were selected)
x = df[['age', 'car_ownership', 'distance', 'female']].values

### Train-test split

In [None]:
# We split the output and features into a train and a test set by an (approximate) ratio of 0.8
np.random.seed(123)
msk = np.random.rand(len(df)) < 0.8

x_train = x[msk]
x_test = x[~msk]

y_train = y[msk]
y_test = y[~msk]

### Distance

We need to compute the distance between two points in the space.

Make sure that your function can work if one of the arrays is a 2D **array** of multiple **points**

In [None]:
def euclidean_distance(array1, array2):
    # Enter your code below
    

**Verify your answer here:**

`euclidean_distance([3, -2, -5, 9], [-1, 4, 2, 0])) = 13.491`

In [None]:
arr1 = [3, -2, -5, 9]
arr2 = [-1, 4, 2, 0]

euclidean_distance(arr1, arr2)

    euclidean_distance([3, -2, -5, 9], 
                       [[-1, 4, 2, 0],[10,7,6,-7]]) = array([13.491, 22.517])

In [None]:
euclidean_distance([3, -2, -5, 9], 
                   [[-1, 4, 2, 0],[10,7,6,-7]])

### Find the k nearest neighbours

For this function, we need to find the `k` nearest neighbours out of the `known_points` for a single `candidate`. The `candidate` is an array (with $n$ features), and it needs to be compared to the rest of the `known_points` (in the training set). 

The function should return the indices of the `k` closest neighbours.

In [None]:
def get_k_nearest_neighbours(candidate, known_points, k):
    # Enter your code below


**Verify your answer here:**

`get_k_nearest_neighbours(x_test[0], x_train, 5) = array([2648, 3071, 7607, 7686, 3466])`

In [None]:
neighbours = get_k_nearest_neighbours(x_test[0], x_train, 5)
neighbours

### Find the class

Now that we know the closest neighbours, we need to compute the class (output) for a given instance. To do this, we need to count the number of times the class of a neighbour appears. Then, we can select the class with the highest count.

In [None]:
def get_class(neighbours, y_true):
    # Enter your code below


**Verify your answer here:**

`get_class(neighbours, y_train) = 1`

In [None]:
get_class(neighbours, y_train)

You can verify that three neighbours have the class `1` and one has the class `0`.

In [None]:
y_train[neighbours]

### Compute the accuracy

We need to implement the computation of the accuracy. Recall that the accuracy is the proportion of the predictions we get right (i.e. where our **prediction** is equal to the **ground truth**) 

In [None]:
def compute_accuracy(y_true, y_pred):
    # Enter your code below


**Verify your answer here:**

`compute_accuracy([0, 1, 1, 0, 1], [0, 1, 1, 1, 0]) = 60.0`

In [None]:
compute_accuracy([0, 1, 1, 0, 1], [0, 1, 1, 1, 0])

## Test your implementation of k-NN

We can now test our implementation of k-NN. To do so, we need to combine the functions we developed the following order: 

```
For each observation in the test set:
    Get the k nearest neighbours
    Get the class of the current observation
    Add the prediction to a vector
    
Compute the accuracy
    
```

With `k=5`, you should get an accuracy of `60.399%`.


In [None]:
# Enter your code below


## k-NN for multinomial cases

First, we need to change the data to add a multinomial output.

Indeed, we would like to see if we can classify between all four transport modes.

In [None]:
# Dictionnary used to transform the string in 
# the travel_mode to an integer
str_to_val = {
    'walk': 0,
    'cycle': 1,
    'pt': 2,
    'drive': 3
}

# Output
y = df['travel_mode'].replace(str_to_val).values

In [None]:
y_train_mult = y[msk]
y_test_mult = y[~msk]

### Modification of the code

Modify any functions that you implemented earlier to take into account the multinomial outputs.

In [None]:
# Enter your code below


###### **Verify your answer here:**

With `k=5`, you should now get an accuracy of `53.756%`.

In [None]:
# Enter your code below
