# K Nearest Neighbour (KNN)



## Introduction: the Algorithm



>KNN is an extremely simple algorithm, which consists of the following steps:

- Take the `K` nearest neighbors (nearest data points) by some metric (usually euclidean) (`K` is a hyperparameter).
- Take the average of their respective regression values (for regression tasks) __or__ perform __majority voting__ for labels.
- View your output.



## The Nearest Neighbour

Here, we discuss what it means to be the nearest neighbour.



<p align=center><img width=1000 src=images/knn_data_distances.jpg></p>

__Note that the neighbourhood of an example in the train set includes itself.__ 



## The Special Model



In this case, our model is quite special in ML for the following reasons:
- __It has no parameters to learn__; hence, it is a __non-parametric model.__
- __No learning phase is required,__ i.e. it is a __lazy predictor.__
- All the data must be kept __at all times__; hence, it is not the most memory-efficient model.
- Although the predictions occur rapidly, they might be prone to overfitting because of `K`.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets

X, y = datasets.make_blobs(
    n_samples=300, centers=6, cluster_std=0.5, random_state=0
)
data = pd.DataFrame(
    data=np.concatenate((X, y.reshape(-1, 1)), axis=1),
    columns=["X1", "X2", "labels"],
)

sns.lmplot(x='X1', y='X2', hue='labels', data=data, fit_reg=False)

## Implementing Distance Calculation

<p align=center><img width=900 src=images/knn_distance_measures.jpg></p>

Conventionally, the Euclidian distance is utilised; however, we may choose to run the algorithm using different distance metrics.

We will use `scipy` to increase the computation speed:

In [None]:
import scipy

distances = scipy.spatial.distance.cdist(X, X)
print(X.shape)
print(distances.shape)

This metric will be passed to our `KNN` model as a hyperparameter.

### Example


#### Implementation



As the first step, we perform `KNN` implementation.

- Create a `KNN` classs taking `k` and `distance` as hyperparameters (assign `None` to `self.X` and `self.y`).
- Create a `fit` method taking `X` and `y`.
- Create a `predict` method taking `X` and predicting the respective labels. To achieve that, we do the following:
    - calculate distances between `self.X` and `X` using `self.metric`.
    - perform a sorting-by-index operation ([`np.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html)) along a specific axis. The output shape from this step should be `(self.X.shape[0], X.shape[0])`, i.e. the distance of each point of `self.X` to every other point in `X`.
    - Choose at most `k` samples from `self.X` (__tip: simply slice with [: self.k]__). The output shape from this step should be `(K, X.shape[0])`(i.e. the number of neighbors, number of examples in X).
    - Use `numpy`'s fancy indexing on `labels` (`self.y`) using the sorted indices. __Tip: attempt a simplest solution__. The output shape should be the same as those in the previous steps.
    - Count how many labels `k` has for each example using `bincount2d`. The output shape should be: `(X.shape[0], classes)`, where `classes` is the number of unique classes in `y`. Can you pass the output from the previous step directly or do we have to transform it in order for the shapes to be right?
    - Finally, return the most occurring label using [`np.argmax`](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) along a specific axis. The output shape should be `(X.shape[0],)` (vector containing labels for each example).


    
#### Analysis



Determine the accuracy on the training dataset to ensure that everything works correctly (you can use the `sklearn.metrics` module).

- What is the accuracy in this case and what is the reason for your answer?
- How can it be changed __for the worse__ (by only varying hyperparameters?)?

In [None]:
import typing
import dataclasses


def bincount2d(x):
    N = x.max() + 1
    ids = x + (N * np.arange(x.shape[0]))[:, None]
    return np.bincount(ids.ravel(), minlength=N * x.shape[0]).reshape(-1, N)

@dataclasses.dataclass
class KNN:
    k: int
    metric: typing.Callable[[np.array], np.array]

    def fit(self, X, y):
        self.X = X
        self.y = y

    def predict(self, X):
        assert hasattr(self, "X"), "fit method should be called before predicting!"
        distances = self.metric(self.X, X)
        labels_indices = np.argsort(distances, axis=0)[: self.k]
        labels = y[labels_indices]
        frequencies = bincount2d(labels.T)
        return np.argmax(frequencies, axis=1)


        
clf = KNN(k=3, metric=scipy.spatial.distance.cdist)
clf.fit(X, y)
clf.predict(X)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(clf.predict(X), y)

## Numba

[`Numba`](https://numba.pydata.org/) is a simple Python framework, which the authors describe as follows:

> Numba is an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code.

Its goal is to make code as fast as `numpy` (or even faster) while allowing the use of Python native functions (such as loops, if statements, etc.).

In [None]:
!pip install numba

In [None]:
import contextlib
import time

import numba
import numpy as np


@contextlib.contextmanager
def timer(function):
    start = time.time()
    yield
    print(f"Elapsed time for {function.__name__}: {(time.time() - start)}")


@numba.jit(nopython=True)  # @njit is the same
def numba_trace(a):
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace


def python_trace(a):
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace


def numpy_trace(a):
    return a + np.trace(a)


x = np.arange(1000000).reshape(1000, 1000)

# Pure Python run
with timer(python_trace):
    python_trace(x)

# Pure numpy run
with timer(numpy_trace):
    python_trace(x)

# First run is slow because of compilation
with timer(numba_trace):
    numba_trace(x)

# Now,it is fast
with timer(numba_trace):
    numba_trace(x)

### About numba

`Numba` mostly involves the use of decorators over functions (or classes in some cases); hence, it is easy to use.

__Occasionally, you need to exert extra effort to understand why a code snippet does not work as intended; however, it is usually worth it.__

### Compilation phase

- The first time `numba` with the `njit` decorator is run, `numba` reads the Python bytecode, conducts analyses/optimisations and finally compiles it using [LLVM](https://llvm.org/). 
- The generated machine code is tailored to your specific CPU architecture (specific low-level instructions).

### Tips

- Use `numba` when it is difficult to vectorise the `numpy` code (__note that this should be your last resort; always attempt to realise the vectorised solution first__).
- Use `numba` for functions that either take long to run (so the compilation time does not impact the runtime) or are run many times.
- Be careful with arguments and their type specification (next notebook).
- Use `njit` whenever possible.
- Numba provides the `parallel` argument for decorators (for `njit` as well). Use it when a single loop iteration takes a long time and is independent of the next run.

## Voting



What we have seen above is called __majority voting__.

> In majority voting, the label with the highest occurrence frequency is chosen.

This explains why `K` is usually chosen to be an odd number to avoid conflicts (e.g. `2` votes for one label and `2` for another).



### Weighted majority voting



> Weighted majority voting occurs when we assign a weight for each example and take them into account.

Weights are assigned based on many (often different) factors (based on the end goal). __For KNN, it is reasonable to use weights based on the similarity of the provided `X` examples to the ones we have trained on__.

We have calculated the similarity based on the Euclidean distance; however, __please note that those are not directly used during voting.__. 



### Theoretical example



Let us assume that we have set `K=5` and consider a single `test` example:
- We assume that one example from the training set has a euclidean distance to our `test` example equal to `0.1`.
- We assume that this example has label `0`.
- Now, let us imagine the distances for `4` other training samples to be, e.g. `1000` (so the samples are not similar).
- Let us assume that these examples have label `1`.
- __Majority voting would assign this example a label of `1`.__

If we were to do 'weighted voting', the weight for a single example would probably be large enough (in comparison) to change the `test` example label to `0` (which is most probably correct for this example).



### Example



Here, we explore the steps required to implement weighted `KNN`.

__We will use `numba` for convenience.__

- __Tip:__ Take specific routines out of the class and implement them separately as helpers, as demonstrated below.



#### Implement `_weighted_frequencies`



`_weighted_frequencies` get three arguments:
- `result_array` of shape `(M, L)`, where `M` is the number of examples, and `L` is the number of unique labels. It is filled with `zeros`.
- `labels` of shape `(M, K)`, where `K` is the number of neighbours. Each value in the `K` dimension is the respective `KNN` label.
- `weights` of shape `(M, K)`. Each value in the `K` dimension is the weight given to the `K`-th neighbour.

Now, using two nested loops, sum the weights for specific neighbours within `result_array` and return it (__tip:__ as those are zeros, you can simply add the appropriate weights at the appropriate index).



#### Analysis



- How does the performance change when `njit` is changed to `jit` or when the decorator is removed?
- What can be done to achieve a non-`1.0` accuracy when evaluating on the `training` dataset (note that you cannot sabotage the implementation; you can only vary the hyperparameters)?

In [None]:
from sklearn.metrics import accuracy_score


@numba.njit
def _weighted_frequencies(result_array, labels, weights):
    for row in range(labels.shape[0]):
        for column in range(labels.shape[1]):
            result_array[row, labels[row, column]] += weights[row, column]

    return result_array


class WeightedKNN(KNN):
    def predict(self, X):
        distances = self.metric(self.X, X)
        labels_indices = np.argsort(distances, axis=0)[: self.k]
        labels = y[labels_indices].T
        weights = 1 / (np.sort(distances, axis=0)[: self.k] + 1e-7).T
        result_array = np.zeros((labels.shape[0], np.max(labels) + 1))
        w_frequencies = _weighted_frequencies(result_array, labels, weights)
        return np.argmax(w_frequencies, axis=1)


clf = WeightedKNN(k=3, metric=scipy.spatial.distance.cdist)
clf.fit(X, y)

print("With compilation phase:")
with timer(WeightedKNN):
    clf.predict(X)

print("Compiled predict:")
with timer(WeightedKNN):
    clf.predict(X)


accuracy_score(clf.predict(X), y)

# No njit

# With compilation phase:
# Elapsed time for WeightedKNN: 0.07559394836425781
# Compiled predict:
# Elapsed time for WeightedKNN: 0.054709672927856445

# njit

# With compilation phase:
# Elapsed time for WeightedKNN: 0.23263049125671387
# Compiled predict:
# Elapsed time for WeightedKNN: 0.0071125030517578125

## Limitations of KNN

- We need to find the distance between each point and every other point. The time complexity of the algorithm is dominated by this process.
- Examples that might be close in the feature space may not necessarily be close in the label space. 
    - For example, if examples have similar feature values for features that do not influence the output label, they will be close in the feature space, but not in the label space. 
    - Proximity assumption.
- When working with high dimensional data, it is difficult to visualise the data and hand-pick a suitable `K` (however, we can still use `grid search` or a-like hyperparameter tuning methods).
- When making predictions, we need to store the whole dataset in the model, which is inefficient in regards to the memory.
- For the best results, we should always scale our features to prevent the prediction from being disproportionately influenced. However, with KNN, this can affect the distances between each example along each dimension of the feature space, resulting in different nearest neighbours. Conduct experiments with and without feature scaling.  

## KNN for Regression



KNN can also be employed for regression as well as classification, with the following differences:

- Labels are not integer-class labels, but consist of continuous values.
- Instead of majority voting, we simply take the mean of values (possibly weighted mean).

## Conclusion

At this point, you should have a good understanding of

- the KNN algorithm and its limitations.
- Numba and voting.
- how to implement distance calculation.