# SLU19 - k-Nearest Neighbors (kNN) - Exercise notebook

In the first part of the notebook you will be implementing things from scratch, so you understand what's going on under the hood. Later you'll get to use the sklearn implementation.

![numpy-function-implementation](media/numpy-function-implementation.png)

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets

import json
import hashlib

## Exercise 1 - Distances

We've talked about the Euclidean distance in the learning noteobook, but what is a Euclidean norm?

Norm is the length of a vector which is the same as the distance from the beginning of a vector to the tip of the vector. It is written as

$$|\mathbf{x}|$$ 

To calculate the norm, we need to choose a distance function for it. Obviously, the distance to calculate the Euclidean norm is the Euclidean distance.

$$d(\mathbf{p}, \mathbf{q}) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + ... + (q_n - p_n)^2} = \sqrt{ \sum_{i=1}^n (q_i - p_i)^2} = |\mathbf{q} - \mathbf{p}|$$

The Euclidean distance of two points is the same as the norm of the difference between the vectors q and p (vectors from the origin to each point).

$$ \sqrt{ \sum_{i=1}^n (q_i - p_i)^2} = |\mathbf{q} - \mathbf{p}|$$

To calculate the norm of any vector, we just take point p as the origin and point q as the tip of our vector `x` and the equation becomes

$$ \sqrt{ \sum_{i=1}^n (x_i)^2} = |\mathbf{x}|$$

And we get our norm definition!

### Exercise 1.1 - Vector norms

Implement the Euclidean norm definition  in the function below.

$$|\mathbf{x}| = \sqrt{ \sum_{i=1}^n (x_i)^2} = \sqrt{(x_1)^2 + (x_2)^2 + ... + (x_N)^2}$$

In [None]:
def euclidean_norm(x):
    """
    Return the euclidean norm of a vector
        
    Parameters
    ----------
    x: numpy array with shape (N,)
    
    Returns
    ----------
    norm: float
    """

    # YOUR CODE HERE
    raise NotImplementedError()
    return norm

In [None]:
np.testing.assert_almost_equal(euclidean_norm(np.array([1, 2, 4])), 4.5825, 2)
np.testing.assert_almost_equal(euclidean_norm(np.array([-1, 0, 4])), 4.1231, 2)
np.testing.assert_almost_equal(euclidean_norm(np.array([1])), 1.0, 2)
np.testing.assert_almost_equal(euclidean_norm(np.array([-1])), 1.0, 2)
np.testing.assert_almost_equal(euclidean_norm(np.array([0, 0])), 0.0, 2)
np.testing.assert_almost_equal(euclidean_norm(np.array([0, 1, 2, 3, 4])), 5.4772, 2)
np.testing.assert_almost_equal(euclidean_norm(np.array([0, -1, -2, -3, -4])), 5.4772, 2)

### Exercise 1.2 - Distances

Let's diversify. Implement a function called `distance_function` which calculate different kinds of distances for two points.

This function receives two arguments, `a` and `b` which are n-dimensional coordinates of the two points. Additionally, it receives the argument `distance_type`, which tells you which distance to use. 

The `distance_type` argument can have one of three values, which define how to compute the distance:

* `Euclidean`

$$d_{euclidean} = |\mathbf{b} - \mathbf{a}|$$

* `dot`

$$d_{dot} = u_1v_1 + u_2v_2 + ... + u_nv_n$$

* `cosine`

$$cosine(\mathbf{a}, \mathbf{b}) = 1 -  \frac{\mathbf{a} \; . \mathbf{b}}{|\mathbf{a}| \; |\mathbf{b}|}$$

In [None]:
def distance_function(a, b, distance_type="euclidean"):
    """
    Return the distance between two vectors, which can be 
        'euclidean', 'dot_product' or 'cosine'. 
        
    Return None if:
     - distance type is not any of the supported types
     - if the shape of a and b do not match

    Parameters
    ----------
    a: numpy array with shape (N,)
    b: numpy array with shape (N,)
    distance_type: str - can be one of 'euclidean', 'dot_product'
        or 'cosine'
    
    Returns
    ----------
    distance: float
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return distance

In [None]:
# Test Euclidean Distance
np.testing.assert_almost_equal(distance_function(np.array([1, 2, 4]), np.array([-1, 0, 4]), distance_type="euclidean"), 2.8284, 2)
np.testing.assert_almost_equal(distance_function(np.array([1]), np.array([-1]), distance_type="euclidean"), 2.0000, 2)
np.testing.assert_almost_equal(distance_function(np.array([0, 0]), np.array([2, 3]), distance_type="euclidean"), 3.6055, 2)
np.testing.assert_almost_equal(distance_function(np.array([0, 1, 2, 3, 4]), np.array([0, -1, -2, -3, -4]), distance_type="euclidean"), 10.9544, 2)

# Test Dot product
np.testing.assert_almost_equal(distance_function(np.array([1, 2, 4]), np.array([-1, 0, 4]), distance_type="dot_product"), 15.0, 2)
np.testing.assert_almost_equal(distance_function(np.array([1]), np.array([-1]), distance_type="dot_product"), -1.0, 2)
np.testing.assert_almost_equal(distance_function(np.array([0, 0]), np.array([2, 3]), distance_type="dot_product"), 0.0, 2)
np.testing.assert_almost_equal(distance_function(np.array([0, 1, 2, 3, 4]), np.array([0, -1, -2, -3, -4]), distance_type="dot_product"), -30.0, 2)

# Test Cosine distance
np.testing.assert_almost_equal(distance_function(np.array([1, 2, 4]), np.array([-1, 0, 4]), distance_type="cosine"), 0.2061, 2)
np.testing.assert_almost_equal(distance_function(np.array([1]), np.array([-1]), distance_type="cosine"), 2.0, 2)
np.testing.assert_almost_equal(distance_function(np.array([0, 1]), np.array([2, 3]), distance_type="cosine"), 0.1679, 2)
np.testing.assert_almost_equal(distance_function(np.array([0, 1, 2, 3, 4]), np.array([0, -1, -2, -3, -4]), distance_type="cosine"), 2.0, 2)

# Test cases where distance can't be computed
assert distance_function(np.array([1, 2]), np.array([-1, 0, 4]), distance_type="euclidean") is None
assert distance_function(np.array([1, 2]), np.array([-1, 0, 4]), distance_type="dot_product") is None
assert distance_function(np.array([1, 2]), np.array([-1, 0, 4]), distance_type="cosine") is None
assert distance_function(np.array([1, 2, 3]), np.array([-1, 0, 4]), distance_type="no_distance") is None

You probably know that numpy has functions to calculate these norms, but we wanted you to really try to implement these by yourself and understand what is happening.
* [numpy.linalg.norm](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html)
* [numpy.dot](https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html)
* [scipy.distance.cosine](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html)

## 2 - Implementing the kNN algorithm

Now that we have the distance functions, we'll implement the kNN algorithm.

And we'll do it by hand! Let's do this!

![lets_do_this](media/lets_do_this.gif)

### Exercise 2.1 - Find the closest neighbors 

The first step is to find the nearest data points. For that purpose, implement a function called `find_nearest_neighbors`, that:

* receives four arguments:
    * `x`, the coordinates of the point for which to find the nearest neighbors
    * `dataset`, the coordinates of N other points
    * `distance_type`, distance type to use in finding the nearest neighbors
    * `k`, the number of nearest neighbors to find

Build your function in these steps
* iterate through the dataset, compute the distances between x and every other dataset point
* get the dataset indexes of the k nearest points
* return a numpy array of shape (k,) with those indexes

Hint: check [numpy.argsort](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html).

In [None]:
def find_nearest_neighbors(x, dataset, distance_type="euclidean", k=5):
    """
    Finds the k nearest neighbors of point x in the dataset
    
    Parameters
    ----------
    x: numpy array with shape (d,)
    dataset: numpy array with shape (N, d)
    distance_type: str - can be one of 'euclidean', 'dot_product'
        or 'cosine'
    k: int, the number of nearest neighbors to find

    Returns
    ----------
    indexes: numpy array with shape (k,)
    """

    # YOUR CODE HERE
    raise NotImplementedError()
    return indexes

In [None]:
dataset = datasets.load_iris().data
x = np.array([4.9, 3.0, 6.1, 2.2])

knn_1 = find_nearest_neighbors(x, dataset, 'euclidean', 3)
assert isinstance(knn_1,np.ndarray), 'The output should be an array.'
assert knn_1.shape == (3,), 'The shape of the result is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in knn_1])).encode()).hexdigest() == \
'87ae3848357c73240c2916c4468305edfc12b37c6a72212ec3b4f2b5c8778b0e', 'The indexes are not correct.'

knn_2 = find_nearest_neighbors(x, dataset, 'euclidean', 10)
assert knn_2.shape == (10,), 'The shape of the result is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in knn_2])).encode()).hexdigest() == \
'90def25b7642b0c9dd10b3f64b1804549fbbcabcc4cd5e8c3fc850510b5917f5', 'The indexes are not correct.'

knn_3 = find_nearest_neighbors(x, dataset, 'dot_product', 3)
assert knn_3.shape == (3,), 'The shape of the result is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in knn_3])).encode()).hexdigest() == \
'4bf23ee8c86461d4a9f5ad7da978deef5c71e1275f38adfd343b352ad3e5071b', 'The indexes are not correct.'

knn_4 = find_nearest_neighbors(x, dataset, 'dot_product', 10)
assert knn_4.shape == (10,), 'The shape of the result is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in knn_4])).encode()).hexdigest() == \
'3d484aa7dbcf027bebb27cf9860fa96fc3789413ad4a7b4da724b9f8184800ef', 'The indexes are not correct.'

knn_5 = find_nearest_neighbors(x, dataset, 'cosine', 3)
assert knn_5.shape == (3,), 'The shape of the result is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in knn_5])).encode()).hexdigest() == \
'17ab712a8ae8c59cd97bc3193325c62213d25424249dc37292e566208362e0a5', 'The indexes are not correct.'

knn_6 = find_nearest_neighbors(x, dataset, 'cosine', 10)
assert knn_6.shape == (10,), 'The shape of the result is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in knn_6])).encode()).hexdigest() == \
'68e2496b762f09fdb34f31dfa584498315d6c4326caef822fff14ee663fabc18', 'The indexes are not correct.'

### Exercise 2.2 - Classifying from nearest neighbors

Now that we have the indexes of the k nearest neighbors, we need to get the labels of those neighbors, so that we can predict the label for our point. We'll the **most common label** for the prediction.

Implement a function called `get_knn_class` which
* receives two arguments:
    * y, the labels of all points in the dataset
    * neighbor_indexes, which are the dataset indexes of the k nearest neighbors
* returns the most frequent label in the neighbor indexes

In [None]:
def get_knn_class(y, neighbor_indexes):
    """
    Returns the most frequent label of the nearest neighbors
    
    Parameters
    ----------
    y: numpy array with shape (N,) - labels of all points in the dataset
    neighbor_indexes: numpy array with shape (k,) - dataset indexes of the nearest neighbors
    
    Returns
    ----------
    knn_label: int
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return knn_label

In [None]:
np.random.seed(42) 

# Test case 1
n=150
k=5
c=3
neighbor_indexes=np.random.randint(0,n-1,k)
y= np.random.randint(0, c, n)
answer = get_knn_class(y, neighbor_indexes)
assert isinstance(answer, int), 'The output should be an integer.'
assert answer == 0, f'The predicted label {answer} is not correct.'

# Test case 2
n=150
k=7
c=5
neighbor_indexes=np.random.randint(0,n-1,k)
y= np.random.randint(0, c, n)
answer = get_knn_class(y, neighbor_indexes)
assert isinstance(answer, int), 'The output should be an integer.'
assert answer == 2, f'The predicted label {answer} is not correct.'

### Exercise 2.3 - Classification with kNN (putting everything together)

Finally we can put everything together and implement the kNN classifier!

Implement a function called `knn_classifier` that:

* receives five arguments:
    * `x`, the coordinates of the point for which to make a prediction
    * `dataset`, coordinates of N other points
    * `targets`, labels for each point in the dataset
    * `k`, number of nearest neighbors to consider
    * `distance_function`, distance type to use, can be 'euclidean', 'cosine', 'dot'
* uses the functions that we implemented above in order to implement a kNN classifier
* returns the predicted label for point x

In [None]:
def knn_classifier(x, dataset, targets, k, distance_function):
    """
    Predicts the label for a single point using the kNN algorithm
    
    Parameters
    ----------
    x: numpy array with shape (d,)
    dataset: numpy array with shape (N, d)
    targets: numpy array with shape (N,)
    k: int
    distance_function: string
    
    Returns
    ----------
    label: int
    """

    # YOUR CODE HERE
    raise NotImplementedError()
    
    return label

In [None]:
dataset = datasets.load_iris().data
targets = datasets.load_iris().target
x = np.array([4.9, 3.0, 6.1, 2.2])

tests = [
    {
        'input': [x, dataset, targets, 3, 'euclidean'],
        'expected_value': 2
    },
    {
        'input': [x, dataset, targets, 5, 'dot_product'],
        'expected_value': 0
    },
    {
        'input': [x, dataset, targets, 1, 'cosine'],
        'expected_value': 2
    }
]

for test in tests:
    pred_label = knn_classifier(*test['input'])
    assert isinstance(pred_label, int), "The function should return an integer."
    assert pred_label == test['expected_value'], "The returned label is not correct."

Great job! You now have a working kNN classifier!

![its-alive](media/its-alive.png)

Now that you've implemented a kNN classifier, let's go a bit further and implement a kNN regressor.

## Exercise 3 - Regression with KNN

As we explained in the learning notebook, the main difference between a kNN classifier and a kNN regressor is the way we choose the predicted label from the labels of the nearest neighbors. So we can reuse the first step of retrieving the neighbors.

For the classifier case we used a majority vote. In the regressor case, we want to use the average value of the neighbors' labels.

### Exercise 3.1 - Calculate the prediction from the nearest neighbors

Implement a function called `get_knn_value`, that:

* receives two arguments:
    * `y`, the targets of all the points from the dataset
    * `neighbor_indexes`, which are the indexes of the k nearest neighbors
* returns the average of the nearest neighbors' target values

In [None]:
def get_knn_value(y, neighbor_indexes):
    """
    Returns the average value of the nearest neighbors targets
    
    Parameters
    ----------
    y: numpy array with shape (N,) - targets of all the points in the dataset
    neighbor_indexes: numpy array with shape (k,) - indexes of the nearest neighbors in the dataset
    
    Returns
    ----------
    knn_prediction: float, average of the nearest neighbors targets
    """

    # YOUR CODE HERE
    raise NotImplementedError()
    
    return knn_prediction

In [None]:
np.random.seed(42) 

# Test case 1
answer = get_knn_value(np.random.rand(150), np.random.randint(0, 3, 3))
assert isinstance(answer, float), 'The answer should be a float.'
np.testing.assert_almost_equal(answer, 0.4937, 2, err_msg='The predicted value is not correct.')

# Test case 2
answer = get_knn_value(np.random.rand(10), np.random.randint(1, 5, 7))
assert isinstance(answer, float), 'The answer should be a float.'
np.testing.assert_almost_equal(answer, 0.5192, 2, err_msg='The predicted value is not correct.')

And we're ready to implement the kNN regressor! Keep up the good work, we're almost there!

![almost_there](media/almost_there.gif)

### Exercise 3.2 - kNN regressor, put it all together

Implement a function called `knn_regressor` that:

* receives five arguments:
    * `x`, the coordinates of a point for which to make a prediction
    * `dataset`, coordinates of N other points
    * `targets`, targets for each point in the dataset
    * `k`, the number of nearest neighbors the kNN algorithm should consider
    * `distance_function`, distance type, can be 'euclidean', 'cosine', 'dot'
* uses the functions that we implemented above
* returns the prediction for point x

In [None]:
def knn_regressor(x, dataset, targets, k, distance_function):
    """
    Predicts the value for a single point using the kNN regression algorithm
    
    Parameters
    ----------
    x: numpy array with shape (d,), coordinates of the point
    dataset: numpy array with shape (N, d), coordinates of N other points
    targets: numpy array with shape (N,), targets of the points in the dataset
    k: int, number of nearest neighbors
    distance_function: string, type of distance function to use
    
    Returns
    ----------
    prediction: float
    """

    # YOUR CODE HERE
    raise NotImplementedError()

    return prediction

In [None]:
np.random.seed(42)
dataset = datasets.load_diabetes().data
targets = datasets.load_diabetes().target
x = np.random.rand(10)

prediction = knn_regressor(x, dataset, targets, 3, 'euclidean')
assert isinstance(prediction, float), 'The output should be a float.'
np.testing.assert_almost_equal(prediction, 265.6666, 2, err_msg='The predicted value is not correct.')

prediction = knn_regressor(x, dataset, targets, 5, 'dot_product')
assert isinstance(prediction, float), 'The output should be a float.'
np.testing.assert_almost_equal(prediction, 92.8, err_msg='The predicted value is not correct.')

prediction = knn_regressor(x, dataset, targets, 1, 'cosine')
assert isinstance(prediction, float), 'The output should be a float.'
np.testing.assert_almost_equal(prediction, 264.0, err_msg='The predicted value is not correct.')

**Well done!!!**

![we_did_it](media/we_did_it.gif)

Finally let's wrap this up with a couple of exercises on how to use scikit's kNN models.

## Exercise 4 - Using scikit's kNN models

In [None]:
from scipy.spatial.distance import cosine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import accuracy_score

### Exercise 4.1 - kNN with Euclidean distance

Use a `KNeighborsClassifier` to create predictions for the [breast cancer dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer).

The dataset contains features extracted from microscopic images of breast cancer cell nuclei. The target is the malignancy of the cancer.

Follow the instructions in the comments in the notebook cells below.

In [None]:
# We start by importing the dataset
data = datasets.load_breast_cancer()

# Now do a train test split, using the train_test_split function from scikit
# Use a test_size of 0.25 and a random_state of 42
# X_train, X_test, y_train, y_test = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
tests = [
    {
        'dataset_type': 'X_train',
        'dataset': X_train,
        'shape_hash': '31ffabcaf98971831a5f8ad05ba70049a86bd60bda0a971ca9691388f9f72f8b'
    },
    {
        'dataset_type': 'X_test',
        'dataset': X_test,
        'shape_hash': '747c580b9756b4741bfbe812b8ca9fd8d047a5d6f9e3ebe53d4d15117f42ec2a'
    },
    {
        'dataset_type': 'y_train',
        'dataset': y_train,
        'shape_hash': '23a4f6ee909897142105a6577ac39ff86c353b8ad0ded0bece87829bb1953a58'
    },
    {
        'dataset_type': 'y_test',
        'dataset': y_test,
        'shape_hash': '40957487610d92ca4dd2d37ec155c40d20091a504bf65270a3cd28e6863ef633'
    },
]

for test in tests:
    shape_hash = hashlib.sha256(json.dumps(test['dataset'].shape).encode()).hexdigest()

    assert isinstance(test['dataset'], np.ndarray), f"{test['dataset_type']} should be a numpy array!"
    assert shape_hash == test['shape_hash'], "The returned numpy array has the wrong shape!"

In [None]:
# Instantiate a kNN Classifier with k=3 and Euclidean distance as the distance function
# In scikit, the Euclidean distance is the default one and goes by the name of 'minkowski'
# which is in fact a generalisation of the Euclidean distance
# clf = ...
# YOUR CODE HERE
raise NotImplementedError()


# Get predictions for the test dataset
# y_pred = ...
# YOUR CODE HERE
raise NotImplementedError()

# Calculate the accuracy of the solution
# accuracy = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(clf, KNeighborsClassifier)
assert hashlib.sha256(json.dumps(clf.get_params()).encode()).hexdigest() == \
'c4a05083bfa540d2686546998f18e59fcd2e8bee8ddba53a396fd3719360fb06', 'The parameters of the classifier are not correct.'

assert isinstance(y_pred, np.ndarray), 'The prediction should be a numpy array.'
assert y_pred.shape == (143,), 'The shape of the prediction is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in y_pred])).encode()).hexdigest() == \
'ccbde511468d2997594b55f52a4e5a94065faab6251da19adda788f003fcf9f2', 'The prediction is not correct.'

assert isinstance(accuracy, float), 'The accuracy should be a float.'
np.testing.assert_almost_equal(accuracy, 0.930, 3, err_msg='The accuracy value is not correct.')

### Exercise 4.2 - kNN with cosine distance

Now we want to see the difference if we use the cosine distance instead of the Euclidean distance.

Go through the same steps as the previous exercise, but use the cosine distance as the distance metric in the kNN classifier.

In [None]:
# Instantiate a kNN Classifier with k=3 and cosine distance as the distance function
# clf = ...
# YOUR CODE HERE
raise NotImplementedError()

# Get predictions for the test dataset
# y_pred = ...
# YOUR CODE HERE
raise NotImplementedError()

# Calculate the accuracy of the prediction
# accuracy = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(clf, KNeighborsClassifier)
assert hashlib.sha256(json.dumps(clf.get_params()).encode()).hexdigest() == \
'1f6450a10287ae3aecbfe3be2415e00ef911eaf1a57033545b5fb62a558a174e', \
'The parameters of the classifier are not correct.'

assert isinstance(y_pred, np.ndarray), 'The prediction should be a numpy array.'
assert y_pred.shape == (143,), 'The shape of the prediction is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in y_pred])).encode()).hexdigest() == \
'6ba8a065a1384c98195fb765f17e005a6bb5a0094f25badb5654d9a3b91c165b', 'The prediction is not correct.'

assert isinstance(accuracy, float), 'The accuracy should be a float.'
np.testing.assert_almost_equal(accuracy, 0.937, 3, err_msg='The accuracy value is not correct.')

### Exercise 4.3 - Test different k and metrics

And the last exercise. 

Try different combinations of n_neighbors and metrics and choose the option with the highest accuracy:

1. n_neighbors = 7, metric = 'minkowski'
2. n_neighbors = 9, metric = 'cosine'
3. n_neighbors = 11, metric = 'minkowski'
4. n_neighbors = 11, metric = 'cosine'

Assign the answer to the variable `best_parameters` as an integer (1, 2, 3 or 4).

In [None]:
# Find the best combination of n_neighbors and metric
# best_parameters = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(best_parameters).encode()).hexdigest() == \
'4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce', 'Not the correct option.'

And we're done! Nice job ;)

![were_done](media/were_done.gif)