# Neural Networks

## Prerequisites
- [Fundamental Python]()
- [Gradient based optimisation]()
- [Linear models]()

## Why do simple models struggle with meaningful tasks?

Whereas the size of a house and its price might be linearly correlated, the pixel intensities of an image are certainly not linearly correlated with whether it contains a dog or a cat.

![](./images/complex-fn.png)

We need to build much more complex models to solve harder problems.

## Can we build more complex models by combining many simple transformations?

The models we have just seen apply a single transformation to the data. However most problems of practical interest can't be solved using such simple models. 

Models with greater **capacity** are those which are able to model more complicated functions.

A single linear transformation (multiplication by weights of model) stretches the input space by a certain factor in some direction, and adding a constant (bias) shifts it. 
We call models which apply  
What if we applied more than one **layer** of transformations to our inputs, to create a **deep model**. Would we be able to increase the capacity of our model and make it able to model more complex input-output relationships, particularly non-linear ones?

![](./images/shallow-vs-deep.png)

...well, not quite yet.

...if we repeatedly apply a linear transformation, the input can be factorised out of the output, showing that many repeated linear functions are eventually equal to a single linear transformation.

![](./images/factor-proof.png)

## So how can we increase the capacity of our models?

We want to be able to model non-linear functions, so let's try to throw in some non-linear transformations into our model.

![](./images/activation.png)

These non-linear functions prevent the input being able to be factorised out of the model. Hence the overall transformation can represent non-linear input-output relationships.

We call these non-linear functions **activation functions**.

However, It's not like we want to introduce really complicated functions into our model - ideally we wouldn't even have to and we could keep things simple. So let's try and complicate things only a minimal amount by keeping our activation functions very simple.

Here are some common activation functions. ReLU (Rectified Linear Unit) is by far the most widely used.

![](./images/activ-fns.png)

Now we have all of the ingredients to fully understand how we can model more complicated functions. Let's look at that all together:

![](./images/full-nn.png)

Guess what? That is a **neural network**. Surprise.

It's just repeated simple linear transformations followed by simple non-linear transformations (activation functions). Simple.

Let's learn some jargon.

![](./images/nn.png)

Neural networks have additional hyperparameters of the depth of the model and the width of each layer. These 

## What can neural networks do?

The motivation that led us to deriving neural networks was that we wanted to model more complex functions. But what functions can a neural network actually represent? Well, as we show below they can actually represent almost all continuous functions. Neural Networks are **general function approximators**.

![](./images/univ-approx.png)

## How can our neural networks learn to model some function?

As we did in the optimisation notebook, we can adjust our model parameters using gradient descent as such:
1. Pass input data forward through model to output a prediction
2. Calculate loss between predicted output and output label
3. Find direction that moving the model parameters in will reduce the error
4. Move model weights (parameters) a small amount in that direction 

![](./images/backprop.png)

Here you can see that many terms reappear when computing the gradients of preceeding layers. 
By caching those terms, we save having to recompute them for these layers nearer the input. This makes finding the gradients of the loss with respect to each weight in the model much more efficient both in terms of memory and speed. 
This process of computing these gradients effectively is called the **backpropagation**.

## Let's prepare our data

Today we are going to look at a dataset called MNIST (em-nist). It consists of 70,000 images of hand drawn digits from 0-9. 



In [None]:
import torch
from torchvision import datasets, transforms

# GET THE TRAINING DATASET
train_data = datasets.MNIST(
    root='MNIST-data',                        # where is the data (going to be) stored
    transform=transforms.ToTensor(),          # transform the data from a PIL image to a tensor
    train=True,                               # is this training data?
    download=True                             # should i download it if it's not already here?
)

# GET THE TEST DATASET
test_data = datasets.MNIST(
    root='MNIST-data', # where is the data (going to be) stored
    transform=transforms.ToTensor(), # transform the data from a PIL image to a tensor
    train=False,    # this is not the training set
)

# PRINT THEIR LENGTHS AND VISUALISE AN EXAMPLE
ex = train_data[0] # get the first example
x = ex[0] # get the features (actual tensor data -the first thing in the example)
y = ex[1] # get the labels (second thing in the example)
print('Features:', x)
print('Label:', y)
t = transforms.ToPILImage() # create the transform that can be called to convert the tensor into a PIL Image
img = t(x)    # call the transform on the tensor
img.show()    # show the image

## Make training & validation sets using PyTorch's `random_split()`

In [None]:
# FURTHER SPLIT THE TRAINING INTO TRAINING AND VALIDATION
train_data, val_data = torch.utils.data.random_split(train_data, [50000, 10000])    # split into 50K training & 10K validation

## PyTorch's `DataLoader` 
PyTorch has a handy utility called a `DataLoader` which can pass us our data in mini-batches of a specified batch size. It can also shuffle them for us.

Let's use `torch.data.DataLoader` to create data loaders from our train, validation and test datasets now. Hint: look at the [docs](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)

In [None]:
batch_size = 256

# MAKE TRAINING DATALOADER
train_loader = torch.utils.data.DataLoader( # create a data loader
    train_data, # what dataset should it sample from?
    shuffle=True, # should it shuffle the examples?
    batch_size=batch_size # how large should the batches that it samples be?
)

# MAKE VALIDATION DATALOADER
val_loader = torch.utils.data.DataLoader(
    val_data,
    shuffle=True,
    batch_size=batch_size
)

# MAKE TEST DATALOADER
test_loader = torch.utils.data.DataLoader(
    test_data,
    shuffle=True,
    batch_size=batch_size
)

## Making a neural network with PyTorch

PyTorch makes it really easy for us to build complex models that can be improved via gradient based optimisation. It does this by providing a class named `torch.nn.Module`. Our model classes should inherit from this class because it does a few very useful things for us:

1. `torch.nn.Module` keeps track of all `torch.nn.Parameters` that are created within it. So when we add a linear layer to our model, the parameters (matrix of weights) in that layer will be added to a list of our model's parameters. We can retrieve all parameters of our model using its `parameters()` method. We will later pass this (`mymodel.parameters()`) to our optimiser when we tell it that *this* is what it should be optimising.


2. `torch.nn.Module` treats the `forward` method (function) of any child class specially by assigning it to the `__call__` method. That means that running `mymodel.forward(some_data)` is equal to `mymodel(some_data)`. 


It contains many more useful tools

[More detail](https://pytorch.org/tutorials/beginner/nn_tutorial.html) on `torch.nn.Module`
Check out the docs [here]()

Once we have created a class to represent our model, we need to define how it performs the forward pass. What layers of transformations do we need to give it? 
Check out these [docs](https://pytorch.org/docs/stable/nn.html#linear-layers) to look at all the layers PyTorch provides.
Hint: what layer have I linked to?

After we've defined some layers for our model we should implement the forward function that will define what happens when we call an instance of our class. This should pass the argument (our input data) through each of the layers, and apply an activation function to them between each, before returning the transformed input as the output. The output should represent a categorical probability distribution over which class the input belongs to. What shape does it need to be? What function does it need to have applied to it?

In [None]:
class NN(torch.nn.Module): # create a neural network class
    def __init__(self): # initialiser
        super().__init__() # initialise the parent class
        self.layer1 = torch.nn.Linear(784, 1024) # create our first linear layer
        self.layer2 = torch.nn.Linear(1024, 256) # create our second linear layer
        self.layer3 = torch.nn.Linear(256, 10) # create our third linear layer
        
    def forward(self, x): # define the forward pass
        x = x.view(-1, 784) # flatten out our image features into vectors
        x = self.layer1(x) # pass through the first linear layer
        x = F.relu(x) # apply activation function
        x = self.layer2(x) # pass through the second linear layer
        x = F.relu(x) # apply activation function
        x = self.layer3(x) # pass through the third linear layer
        x = F.softmax(x) # apply activation function
        return x # return output

## Training the neural network and visualising it's performance

Now we've actually made a template for our model, we can actually
- instantiate it by creating an instance of it from our class template
- define how we will improve it by specifying an optimiser
- define how we will measure its performance by specifying a criterion
- train it
- write its loss to a graph and see how this changes as it continues to train

Let's code that up

In [None]:
my_nn = NN() # initialise our model

# CREATE OUR OPTIMISER
optimiser = torch.optim.Adam(              # what optimiser should we use?
    myNeuralNetwork.parameters(),          # what should it optimise?
)
        
# CREATE OUR CRITERION
criterion = torch.nn.CrossEntropyLoss() # returns a callable object that compares our predictions to our labels and returns our loss

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter() # we will use this to show our models performance on a graph
    
# TRAINING LOOP
def train(model, epochs):
    for epoch in range(epochs):
        for idx, minibatch in enumerate(train_loader): # for each mini-batch sampled from the training dataloader
            inputs, labels = minibatch # unpack the inputs and labels from the minibatch
            prediction = model(inputs) # pass the data forward through the model
            loss = criterion(prediction, labels) # compute the loss
            print('Epoch:', epoch, '\tBatch:', idx, '\tLoss:', loss)
            optimiser.zero_grad() # reset the gradients attribute of each of the model's params to zero
            loss.backward() # backward pass to compute and set all of the model param's gradients
            optimiser.step() # update the model's parameters
            writer.add_scalar('Loss/Train', loss, epoch*len(train_loader) + idx) # write loss to a graph
            
            
train(myNeuralNetwork, 8) # train for 10 epochs

### What does the loss actually mean practically?

The absolute value of the loss doesn't really mean much, it's just a way of continuously evaluating the relative performance of the model whilst it trains. The real metric of performance that we care about is the proportion of ***unseen*** examples that our neural network can correctly classify. These unseen examples are what the test loader consists of.

Let's write the code to calculate that now.

In [None]:
import numpy as np
            
def test(model):
    num_correct = 0
    num_examples = len(test_data) # test DATA not test LOADER
    for inputs, labels in test_loader: # for all exampls, over all mini-batches in the test dataset
        predictions = model(inputs) # make prediction
        predictions = torch.max(predictions, axis=1) # reduce to find max indices along direction which column varies
        predictions = predictions[1] # torch.max returns (values, indices)
        num_correct += int(sum(predictions == labels))
    percent_correct = num_correct / num_examples * 100 # compute percentage
    print('Accuracy:', percent_correct)
    
test(myNeuralNetwork)

## Exercises
1. Compare the loss curves generated by using different batch sizes. What's the best? As you change the batch size, what variable do you need to change to give those curves the same domain over the x-axis (num writes to summary writer)
2. It would be good to validate our model as we go along to ensure that we don't overfit. Let's write a training loop that tests the loss on the validation set after each epoch. Plot the validation error alongside What can you see on the graphs that indicates overfitting?
3. What is the best accuracy you can achieve? Can you implement a grid search and a random search to try them automatically. Record all permutations that you try.
4. What feature of the input data is our standard neural network not taking advantage of? Hint: '************* neural networks' take this into account.

## Congratulations you boss, you've finished the notebook!

Please provide your feedback [here](https://docs.google.com/forms/d/e/1FAIpQLSdZSxvkAE19vjDN4jpp0VvUBPGr_wdtayGAcRNfFGH7e7jQDQ/viewform?usp=sf_link). It means a lot to us.

Next, you might want to check out:
- [Convolutional Neural Networks](https://github.com/AI-Core/Convolutional-Neural-Networks)