<a href="https://colab.research.google.com/github/AdelaideUniversityMathSciences/MathsForAI/blob/main/Code/Copy_of_assignment_nearest_neighbor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this sheet, we will investigate the effect of using different distance metrics for nearest neighbour classification.

You may need to refer to the [PyTorch documentation](https://pytorch.org/docs/stable/index.html).
It's good to familiarise yourself with the different modules so that you know which functions exist!

In [None]:
import torch
import torch.nn.functional as F
from torchvision import datasets, transforms

The MNIST handwritten character dataset may be too simple for our investigation.

Modify the code below to use the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), which contains low-resolution colour photos for 10 classes.

In [None]:
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize(mean=0.5, std=1.0)])

train_dataset = datasets.MNIST(root='../data', train=True,
                               download=True, transform=transform)
test_dataset = datasets.MNIST(root='../data', train=False,
                              download=True, transform=transform)

Construct large tensors that contain all images and labels in each set.

In [None]:
x_train = torch.stack([x for x, y in train_dataset])
y_train = torch.tensor([y for x, y in train_dataset])
x_test = torch.stack([x for x, y in test_dataset])
y_test = torch.tensor([y for x, y in test_dataset])

Flatten the last 3 dimensions of each tensor to obtain vectors of dimension width\*height\*3.

In [None]:
x_train = torch.flatten(x_train, start_dim=-3)
x_test = torch.flatten(x_test, start_dim=-3)

We now need to compute the distance between `x_test[i]` and `x_train[j]` for all (i, j).

To do this, it is critical to understand "broadcasting", which functions the same in pytorch as in numpy.
When we try to apply an element-wise operation to two arrays with different shapes, each array is implicitly repeated along any dimension which has length 1.
For example, if we add a row vector and a column vector, we get a matrix:
```python
>>> x
tensor([[1, 2, 3]])
>>> y
tensor([[10],
        [20],
        [30]])
>>> x.shape
torch.Size([1, 3])
>>> y.shape
torch.Size([3, 1])
>>> x + y
tensor([[11, 12, 13],
        [21, 22, 23],
        [31, 32, 33]])
```
If the two arrays are of a different order (`ndims`), then broadcasting starts at the last dimension and works backward (if you're familiar with Matlab, broadcasting starts at the first dimension and proceeds forward).
Any missing dimensions are treated as length 1.
Note that scalars have order 0.

It's often useful to insert a dimension using `torch.unsqueeze()`.
For example:
```
>>> x
tensor([1, 2, 3])
>>> y
tensor([10, 20, 30])
>>> x + torch.unsqueeze(y, 1)
tensor([[11, 12, 13],
        [21, 22, 23],
        [31, 32, 33]])
```
It is not necessary to unsqueeze `x` because broadcasting starts at the last dimension and missing dimensions are treated as length 1.

If you like, you can read more about broadcasting (in numpy) in [this guide](https://numpy.org/doc/stable/user/basics.broadcasting.html).

We could use broadcasting to obtain `x_train[i] - x_test[j]` as shown below.
However, this will instantiate an array with shape `[n_train, n_test, width*height*3]`, which takes too much RAM to fit on the computer.
(The following code will crash colab!)

In [None]:
# difference = torch.unsqueeze(x_train, 1) - torch.unsqueeze(x_test, 0)

Instead, we will use `einsum()` to obtain the dot product and then use broadcasting to compute

$$
\|x_{i} - x_{j}\|^2 = \|x_{i}\|^2 + \|x_{j}\|^2 - 2 \langle x_{i}, x_{j} \rangle
$$

(Technically this is the squared distance but this will have no impact on the ordering.)

In [None]:
dot = torch.einsum('id,jd->ij', x_train, x_test)
norm_train = torch.sum(x_train ** 2, dim=1)
norm_test = torch.sum(x_test ** 2, dim=1)
dist = (
    torch.unsqueeze(norm_train, 1)
    + torch.unsqueeze(norm_test, 0)
    - 2 * dot)

For each testing example, find the index of the nearest training example (arg min of distance).
Use the label of that example as the prediction.

In [None]:
index_nearest = torch.argmin(dist, dim=0)
pred = y_train[index_nearest]

To measure the accuracy, we check if the prediction is equal to the label and take the mean after converting to a float (0 if not equal, 1 if equal).

In [None]:
torch.mean((pred == y_test).float())

Before proceeding, let's tell Python that we no longer need those large tensors.
To do this, we simply set `dot` and `dist` to something else (in this case, to `None`).
Python automatically returns memory to the system when it is no longer referenced by any variable.

In [None]:
dot = None
dist = None

Your task is to fill in the functions below.
First define a function that measures accuracy.
It should return a scalar in [0, 1].

In [None]:
def compute_accuracy(labels, pred):
  ...

Now define each of the functions below using the L2 distance (as above), L1 distance, L-infinity distance and the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) (L2 inner product after normalizing to unit norm, i.e. cosine of angle between vectors).

For the purpose of this exercise, you are _not_ permitted to use `torch.cdist()` or any equivalent function (although it's a good function to know about).
You should not use a `for` loop to iterate over the examples in either set.

In [None]:
def nearest_neighbor_l2(x_train, y_train, x_test):
  ...

In [None]:
def nearest_neighbor_l1(x_train, y_train, x_test):
  ...

In [None]:
def nearest_neighbor_linf(x_train, y_train, x_test):
  ...

In [None]:
def nearest_neighbor_cos(x_train, y_train, x_test):
  ...

After defining these functions, it should be possible to run the following.

(We will take a random subset of 1000 examples to make it run faster and fit more easily into memory.)

In [None]:
torch.manual_seed(0)
subset_train = torch.randperm(len(train_dataset))[:1000]
subset_test = torch.randperm(len(test_dataset))[:1000]

x_train, y_train = x_train[subset_train], y_train[subset_train]
x_test, y_test = x_test[subset_test], y_test[subset_test]

In [None]:
pred = nearest_neighbor_l2(x_train, y_train, x_test)
compute_accuracy(y_test, pred)

In [None]:
pred = nearest_neighbor_l1(x_train, y_train, x_test)
compute_accuracy(y_test, pred)

In [None]:
pred = nearest_neighbor_linf(x_train, y_train, x_test)
compute_accuracy(y_test, pred)

In [None]:
pred = nearest_neighbor_cos(x_train, y_train, x_test)
compute_accuracy(y_test, pred)