# Neural Networks

## Prerequisites
- [Fundamental Python]()
- [Linear modeals and optimisation]()

## Why do the simple models struggle with meaningful tasks?

Whereas the size of a house and its price might be linearly correlated, the pixel intensities of an image are certainly not linearly correlated with whether it contains a dog or a cat.

![](./images/complex-fn.png)

We need to build much more complex models to solve harder problems.

## Can we build more complex models by combining many simple transformations?

The models we have just seen apply a single transformation to the data. However most problems of practical interest can't be solved using such simple models. 

Models with greater **capacity** are those which are able to model more complicated functions.

A single linear transformation (multiplication by weights of model) stretches the input space by a certain factor in some direction, and adding a constant (bias) shifts it. 
We call models which apply  
What if we applied more than one **layer** of transformations to our inputs, to create a **deep model**. Would we be able to increase the capacity of our model and make it able to model more complex input-output relationships, particularly non-linear ones?

![](./images/shallow-vs-deep.png)

...well, not quite yet.

...if we repeatedly apply a linear transformation, the input can be factorised out of the output, showing that many repeated linear functions are eventually equal to a single linear transformation.

![](./images/factor-proof.png)

## So how can we increase the capacity of our models?

We want to be able to model non-linear functions, so let's try to throw in some non-linear transformations into our model.

![](./images/activation.png)

These non-linear functions prevent the input being able to be factorised out of the model. Hence the overall transformation can represent non-linear input-output relationships.

We call these non-linear functions **activation functions**.

However, It's not like we want to introduce really complicated functions into our model - ideally we wouldn't even have to and we could keep things simple. So let's try and complicate things only a minimal amount by keeping our activation functions very simple.

Here are some common activation functions. ReLU (Rectified Linear Unit) is by far the most widely used.

![](./images/activ-fns.png)

Now we have all of the ingredients to fully understand how we can model more complicated functions. Let's look at that all together:

![](./images/full-nn.png)

Guess what? That is a **neural network**. Surprise.

It's just repeated simple linear transformations followed by simple non-linear transformations (activation functions). Simple.

Let's learn some jargon.

![](./images/nn.png)

Neural networks have additional hyperparameters of the depth of the model and the width of each layer. These 

## What can neural networks do?

The motivation that led us to deriving neural networks was that we wanted to model more complex functions. But what functions can a neural network actually represent? Well, as we show below they can actually represent almost all continuous functions. Neural Networks are **general function approximators**.

![](./images/univ-approx.png)

## How can our neural networks learn to model some function?

As we did in the optimisation notebook, we can adjust our model parameters using gradient descent as such:
1. Pass input data forward through model to output a prediction
2. Calculate loss between predicted output and output label
3. Find direction that moving the model parameters in will reduce the error
4. Move model weights (parameters) a small amount in that direction 

![](./images/backprop.png)

Here you can see that many terms reappear when computing the gradients of preceeding layers. 
By caching those terms, we save having to recompute them for these layers nearer the input. This makes finding the gradients of the loss with respect to each weight in the model much more efficient both in terms of memory and speed. 
This process of computing these gradients effectively is called the **backpropagation**.

## Let's prepare our data

Today we are going to look at a dataset called MNIST (em-nist). It consists of 70,000 images of hand drawn digits from 0-9. 



In [7]:
import torch
from torchvision import datasets, transforms

# GET THE TRAINING DATASET
train_data = datasets.MNIST(root='MNIST-data',                        # where is the data (going to be) stored
                            transform=transforms.ToTensor(),          # transform the data from a PIL image to a tensor
                            train=True,                               # is this training data?
                            download=True                             # should i download it if it's not already here?
                           )

# GET THE TEST DATASET
test_data = datasets.MNIST(root='MNIST-data',
                           transform=transforms.ToTensor(),
                           train=False,
                          )

# PRINT THEIR LENGTHS AND VISUALISE AN EXAMPLE
x = train_data[0][0]    # get the first example
print(x)
# tensor = # get the actual input data
img = x
t = transforms.ToPILImage() # create the transform that can be called to convert the tensor into a PIL Image
img = t(img)    # call the transform on the tensor
img.show()    # show the image

tensor([[[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,



## Why should we not tune our hyperparameters based on our model's score on the test set?

If we adjust our model's hyperparameters so that it performs well on the test set, then we are 

This is like training for a test and evaluating your performance based on how well you can answer the exact questions that come up.
In real life you are unlikely to encounter exactly the same challenges, and so by training on them you will overfit, and not be able to generalise to *different* unseen answers.

You may find that a certain set of hyperparameters perform well on the test set, but then fail to perform as well in the wild. 
Analagously, you may find that a particular 

## What else can we test them on? 

We can take some of the data that we plan to train the neural network's weights on and separate it from that main training set. 
We can then use this split-off data to validate that the current hyperparameters will make our model to perform well on unseen data (both the validation set and the test set are unseen).

PyTorch has a utility method `torch.utils.data.random_split()` that makes it easy to randomly split a dataset. Check out the [docs](https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split) here.

In [16]:
# FURTHER SPLIT THE TRAINING INTO TRAINING AND VALIDATION
train_data, val_data = torch.utils.data.random_split(train_data, [50000, 10000])    # split into 50K training & 10K validation

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/home/ice/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3319, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-16-44fb3ce81f5a>", line 2, in <module>
    train_data, val_data = torch.utils.data.random_split(train_data, [50000, 10000])    # split into 50K training & 10K validation
  File "/home/ice/.local/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 272, in random_split
    raise ValueError("Sum of input lengths does not equal the length of the input dataset!")
ValueError: Sum of input lengths does not equal the length of the input dataset!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ice/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2034, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'ValueError' object has no attribute '_render_trac

ValueError: Sum of input lengths does not equal the length of the input dataset!


## Why should we not pass the whole dataset through the model for each update?
We know that to perform gradient based optimisation we need to pass inputs through the model (forward pass), and then compute the loss and find how it changes with respect to each of our model's parameters (backward pass). Modern datasets can be abslutely huge. This means that the forward pass can take a long time, as the function which our model represents has to be applied to each and every input given to it for a forward pass.

## Why not just pass a single datapoint to the model for each update?
We want our model to perform well on all examples, not just single examples. So we want to compute the loss and associated gradients over several examples to get an average

## Mini-batch training
The modern way to do training is neither full-batch (whole dataset) or fully stochastic (single datapoint). Instead we use mini-batch training, where we sample several (but not all) datapoints to compute a sample of the gradient, which we then use to update the model. The size of the mini-batch is called the **batch size**. Mini-batches are commonly incorrectly referred to as batches, but it's not that deep.

We will experiment with the effect of batch size on the training later.

## PyTorch's `DataLoader` 
PyTorch has a handy utility called a `DataLoader` which can pass us our data in mini-batches of a specified batch size. It can also shuffle them for us.

Let's use `torch.data.DataLoader` to create data loaders from our train, validation and test datasets now. Hint: look at the [docs](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)

In [17]:
batch_size = 256

# MAKE TRAINING DATALOADER
train_loader = torch.utils.data.DataLoader(
    train_data,
    shuffle=True,
    batch_size=batch_size
)

# MAKE VALIDATION DATALOADER
val_loader = torch.utils.data.DataLoader(
    val_data,
    shuffle=True,
    batch_size=batch_size
)

# MAKE TEST DATALOADER
test_loader = torch.utils.data.DataLoader(
    test_data,
    shuffle=True,
    batch_size=batch_size
)

## Binary classification vs multiclass classification

In binary classification the output must be either true or false. Either the example falls into this class, or it doesn't. We have seen that we can represent this by our model having a single output node whose value is forced between 0 and 1, and as such represents a confidence in the fact that the example belongs to the positive class. Alternatively, still for binary classification, we could have two output nodes, where the value of the first represents the confidence that the input belongs to the positive class (true/class 1) and the value of the second represents the confidence that the input belongs to the negative class (false/class 2). In this case, the values of each output node must be positive and they must sum to 1, because this output layer represents a probability distribution over the output classes. 

# single vs double output

In the case where we have two nodes to represent true and false, we can think about it as having trained two models, which have exactly the same weights in every layer except for the last. 

# shared weights diag

Treating true and false as separate classes with separate output nodes shows us how we can extend this idea to do multiclass classification; we simply add more nodes and ensure that their values are positive and sum to one.

# multiclass diagram

### What function can we use to convert the output layer into a distribution over classes?

The **softmax function** exponentiates each value in a vector to make it positive and then divides each of them by their sum to normalise them (make them sum to 1). This ensures that the vector then can be interpreted as a probability distribution.

# softmax equation

## Making a neural network with PyTorch

PyTorch makes it really easy for us to build complex models that can be improved via gradient based optimisation. It does this by providing a class named `torch.nn.Module`. Our model classes should inherit from this class because it does a few very useful things for us:

1. `torch.nn.Module` keeps track of all `torch.nn.Parameters` that are created within it. So when we add a linear layer to our model, the parameters (matrix of weights) in that layer will be added to a list of our model's parameters. We can retrieve all parameters of our model using its `parameters()` method. We will later pass this (`mymodel.parameters()`) to our optimiser when we tell it that *this* is what it should be optimising.


2. `torch.nn.Module` treats the `forward` method (function) of any child class specially by assigning it to the `__call__` method. That means that running `mymodel.forward(some_data)` is equal to `mymodel(some_data)`. 


It contains many more useful tools

[More detail](https://pytorch.org/tutorials/beginner/nn_tutorial.html) on `torch.nn.Module`
Check out the docs [here]()

Once we have created a class to represent our model, we need to define how it performs the forward pass. What layers of transformations do we need to give it? 
Check out these [docs](https://pytorch.org/docs/stable/nn.html#linear-layers) to look at all the layers PyTorch provides.
Hint: what layer have I linked to?

After we've defined some layers for our model we should implement the forward function that will define what happens when we call an instance of our class. This should pass the argument (our input data) through each of the layers, and apply an activation function to them between each, before returning the transformed input as the output. The output should represent a categorical probability distribution over which class the input belongs to. What shape does it need to be? What function does it need to have applied to it?

In [18]:
class NN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(784, 1024)
        self.layer2 = torch.nn.Linear(1024, 256)
        self.layer3 = torch.nn.Linear(256, 10)
        
    def forward(self, x):
        x = x.view(-1, 784)
        x = self.layer1(x)
        x = F.relu(x)
        x = self.layer2(x)
        x = F.relu(x)
        x = self.layer3(x)
        x = F.softmax(x)
        return x

In [19]:
import torch.nn.functional as F 

class NeuralNetworkClass(torch.nn.Module):
    def __init__(self):
        super().__init__()    # initialise parent module
        self.layer1 = torch.nn.Linear(784, 1024)
        self.layer2 = torch.nn.Linear(1024, 256)
        self.layer3 = torch.nn.Linear(256, 10)
        
    def forward(self, x):
        x = x.view(-1, 784)
        x = self.layer1(x)
        x = F.relu(x)
        x = self.layer2(x)
        x = F.relu(x)
        x = self.layer3(x)
        x = F.softmax(x, dim=1)
        return x
    
    

## Training the neural network and visualising it's performance

Now we've actually made a template for our model, we can actually
- instantiate it by creating an instance of it from our class template
- define how we will improve it by specifying an optimiser
- define how we will measure its performance by specifying a criterion
- train it
- write its loss to a graph and see how this changes as it continues to train

Let's code that up

In [23]:
learning_rate = 0.0001
myNeuralNetwork = NeuralNetworkClass()

# CREATE OUR OPTIMISER
optimiser = torch.optim.Adam(              # what optimiser should we use?
    myNeuralNetwork.parameters(),          # what should it optimise?
#     lr=learning_rate                       # using what learning rate?
)

class Adam():
    def __init__(self, params, lr=0.1):
        # init
        self.params = params
        
    def step(self):
        self.params +1 
        
    def get_next_update():
        rerturn grad
        
myadam = Adam()
myadam.step()
the_grad = myadam.get_next_update()
        
        
# CREATE OUR CRITERION
criterion = torch.nn.CrossEntropyLoss()             # callable class that compares our predictions to our labels and returns our loss

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()                            # we will use this to show our models performance on a graph
    
# TRAINING LOOP
def train(model, epochs):
    model.train()                                  # put the model into training mode (more on this later)
    for epoch in range(epochs):
        for idx, minibatch in enumerate(train_loader):
            inputs, labels = minibatch
            prediction = model(inputs)             # pass the data forward through the model
            loss = criterion(prediction, labels)   # compute the loss
            print('Epoch:', epoch, '\tBatch:', idx, '\tLoss:', loss)
            optimiser.zero_grad()                  # reset the gradients attribute of each of the model's params to zero
            loss.backward()                        # backward pass to compute and set all of the model param's gradients
            optimiser.step()                       # update the model's parameters
            writer.add_scalar('Loss/Train', loss, epoch*len(train_loader) + idx)    # write loss to a graph
            
            
train(myNeuralNetwork, 8)

Epoch: 0 	Batch: 0 	Loss: tensor(2.3024, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 1 	Loss: tensor(2.3019, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 2 	Loss: tensor(2.3005, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 3 	Loss: tensor(2.2998, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 4 	Loss: tensor(2.2987, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 5 	Loss: tensor(2.2973, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 6 	Loss: tensor(2.2959, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 7 	Loss: tensor(2.2947, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 8 	Loss: tensor(2.2942, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 9 	Loss: tensor(2.2913, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 10 	Loss: tensor(2.2926, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 11 	Loss: tensor(2.2909, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 12 	Loss: tensor(2.2880, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 13 	Loss: tensor(2.2884, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 14 	Loss: tensor(2.2863, gr

Epoch: 0 	Batch: 119 	Loss: tensor(1.7646, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 120 	Loss: tensor(1.7736, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 121 	Loss: tensor(1.8058, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 122 	Loss: tensor(1.7548, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 123 	Loss: tensor(1.7686, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 124 	Loss: tensor(1.7435, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 125 	Loss: tensor(1.7681, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 126 	Loss: tensor(1.7573, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 127 	Loss: tensor(1.7509, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 128 	Loss: tensor(1.7350, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 129 	Loss: tensor(1.7267, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 130 	Loss: tensor(1.7699, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 131 	Loss: tensor(1.7198, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 132 	Loss: tensor(1.7153, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 133

Epoch: 1 	Batch: 42 	Loss: tensor(1.5726, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 43 	Loss: tensor(1.5929, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 44 	Loss: tensor(1.6125, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 45 	Loss: tensor(1.6056, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 46 	Loss: tensor(1.6026, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 47 	Loss: tensor(1.6117, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 48 	Loss: tensor(1.5969, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 49 	Loss: tensor(1.5981, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 50 	Loss: tensor(1.5878, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 51 	Loss: tensor(1.6286, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 52 	Loss: tensor(1.5991, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 53 	Loss: tensor(1.6057, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 54 	Loss: tensor(1.5865, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 55 	Loss: tensor(1.6172, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 56 	Loss: tensor(

Epoch: 1 	Batch: 160 	Loss: tensor(1.5651, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 161 	Loss: tensor(1.5874, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 162 	Loss: tensor(1.5763, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 163 	Loss: tensor(1.5699, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 164 	Loss: tensor(1.5762, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 165 	Loss: tensor(1.5566, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 166 	Loss: tensor(1.5386, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 167 	Loss: tensor(1.5776, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 168 	Loss: tensor(1.5780, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 169 	Loss: tensor(1.5788, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 170 	Loss: tensor(1.5461, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 171 	Loss: tensor(1.5542, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 172 	Loss: tensor(1.5914, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 173 	Loss: tensor(1.5886, grad_fn=<NllLossBackward>)
Epoch: 1 	Batch: 174

Epoch: 2 	Batch: 83 	Loss: tensor(1.5542, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 84 	Loss: tensor(1.5508, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 85 	Loss: tensor(1.5699, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 86 	Loss: tensor(1.5454, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 87 	Loss: tensor(1.5842, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 88 	Loss: tensor(1.5820, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 89 	Loss: tensor(1.5707, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 90 	Loss: tensor(1.5438, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 91 	Loss: tensor(1.5476, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 92 	Loss: tensor(1.5567, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 93 	Loss: tensor(1.5603, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 94 	Loss: tensor(1.5539, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 95 	Loss: tensor(1.5445, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 96 	Loss: tensor(1.5318, grad_fn=<NllLossBackward>)
Epoch: 2 	Batch: 97 	Loss: tensor(

Epoch: 3 	Batch: 11 	Loss: tensor(1.5529, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 12 	Loss: tensor(1.5634, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 13 	Loss: tensor(1.5461, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 14 	Loss: tensor(1.5640, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 15 	Loss: tensor(1.5402, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 16 	Loss: tensor(1.5705, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 17 	Loss: tensor(1.5383, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 18 	Loss: tensor(1.5448, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 19 	Loss: tensor(1.5566, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 20 	Loss: tensor(1.5616, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 21 	Loss: tensor(1.5528, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 22 	Loss: tensor(1.5355, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 23 	Loss: tensor(1.5323, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 24 	Loss: tensor(1.5493, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 25 	Loss: tensor(

Epoch: 3 	Batch: 136 	Loss: tensor(1.5393, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 137 	Loss: tensor(1.5244, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 138 	Loss: tensor(1.5755, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 139 	Loss: tensor(1.5533, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 140 	Loss: tensor(1.5334, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 141 	Loss: tensor(1.5238, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 142 	Loss: tensor(1.5565, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 143 	Loss: tensor(1.5884, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 144 	Loss: tensor(1.5440, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 145 	Loss: tensor(1.5166, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 146 	Loss: tensor(1.5351, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 147 	Loss: tensor(1.5384, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 148 	Loss: tensor(1.5419, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 149 	Loss: tensor(1.5297, grad_fn=<NllLossBackward>)
Epoch: 3 	Batch: 150

Epoch: 4 	Batch: 59 	Loss: tensor(1.5630, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 60 	Loss: tensor(1.5182, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 61 	Loss: tensor(1.5394, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 62 	Loss: tensor(1.5489, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 63 	Loss: tensor(1.5481, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 64 	Loss: tensor(1.5400, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 65 	Loss: tensor(1.5476, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 66 	Loss: tensor(1.5055, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 67 	Loss: tensor(1.5464, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 68 	Loss: tensor(1.5276, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 69 	Loss: tensor(1.5553, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 70 	Loss: tensor(1.5314, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 71 	Loss: tensor(1.5481, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 72 	Loss: tensor(1.5695, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 73 	Loss: tensor(

Epoch: 4 	Batch: 179 	Loss: tensor(1.5347, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 180 	Loss: tensor(1.5466, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 181 	Loss: tensor(1.5467, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 182 	Loss: tensor(1.5324, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 183 	Loss: tensor(1.5463, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 184 	Loss: tensor(1.5260, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 185 	Loss: tensor(1.5491, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 186 	Loss: tensor(1.5230, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 187 	Loss: tensor(1.5114, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 188 	Loss: tensor(1.5280, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 189 	Loss: tensor(1.5356, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 190 	Loss: tensor(1.5551, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 191 	Loss: tensor(1.5460, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 192 	Loss: tensor(1.5413, grad_fn=<NllLossBackward>)
Epoch: 4 	Batch: 193

Epoch: 5 	Batch: 108 	Loss: tensor(1.5171, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 109 	Loss: tensor(1.5344, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 110 	Loss: tensor(1.5488, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 111 	Loss: tensor(1.5362, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 112 	Loss: tensor(1.5528, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 113 	Loss: tensor(1.5384, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 114 	Loss: tensor(1.5393, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 115 	Loss: tensor(1.5236, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 116 	Loss: tensor(1.5193, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 117 	Loss: tensor(1.5249, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 118 	Loss: tensor(1.5244, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 119 	Loss: tensor(1.5093, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 120 	Loss: tensor(1.5362, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 121 	Loss: tensor(1.5257, grad_fn=<NllLossBackward>)
Epoch: 5 	Batch: 122

Epoch: 6 	Batch: 30 	Loss: tensor(1.5140, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 31 	Loss: tensor(1.5417, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 32 	Loss: tensor(1.5126, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 33 	Loss: tensor(1.5192, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 34 	Loss: tensor(1.5056, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 35 	Loss: tensor(1.5369, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 36 	Loss: tensor(1.5215, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 37 	Loss: tensor(1.5211, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 38 	Loss: tensor(1.5541, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 39 	Loss: tensor(1.5482, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 40 	Loss: tensor(1.5497, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 41 	Loss: tensor(1.5212, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 42 	Loss: tensor(1.5191, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 43 	Loss: tensor(1.5210, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 44 	Loss: tensor(

Epoch: 6 	Batch: 149 	Loss: tensor(1.5197, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 150 	Loss: tensor(1.5053, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 151 	Loss: tensor(1.5333, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 152 	Loss: tensor(1.5130, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 153 	Loss: tensor(1.5340, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 154 	Loss: tensor(1.5122, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 155 	Loss: tensor(1.5325, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 156 	Loss: tensor(1.5366, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 157 	Loss: tensor(1.5429, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 158 	Loss: tensor(1.5269, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 159 	Loss: tensor(1.5405, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 160 	Loss: tensor(1.5218, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 161 	Loss: tensor(1.5406, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 162 	Loss: tensor(1.5558, grad_fn=<NllLossBackward>)
Epoch: 6 	Batch: 163

Epoch: 7 	Batch: 74 	Loss: tensor(1.5061, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 75 	Loss: tensor(1.5064, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 76 	Loss: tensor(1.5196, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 77 	Loss: tensor(1.5426, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 78 	Loss: tensor(1.5228, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 79 	Loss: tensor(1.5408, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 80 	Loss: tensor(1.5022, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 81 	Loss: tensor(1.5101, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 82 	Loss: tensor(1.5215, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 83 	Loss: tensor(1.5063, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 84 	Loss: tensor(1.5189, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 85 	Loss: tensor(1.5174, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 86 	Loss: tensor(1.5126, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 87 	Loss: tensor(1.5364, grad_fn=<NllLossBackward>)
Epoch: 7 	Batch: 88 	Loss: tensor(

### What does the loss actually mean practically?

The absolute value of the loss doesn't really mean much, it's just a way of continuously evaluating the relative performance of the model whilst it trains. The real metric of performance that we care about is the proportion of ***unseen*** examples that our neural network can correctly classify. These unseen examples are what the test loader consists of.

Let's write the code to calculate that now.

In [21]:
import numpy as np
            
def test(model):
    num_correct = 0
    num_examples = len(test_data)                       # test DATA not test LOADER
    for inputs, labels in test_loader:                  # for all exampls, over all mini-batches in the test dataset
        predictions = model(inputs)
        predictions = torch.max(predictions, axis=1)    # reduce to find max indices along direction which column varies
        predictions = predictions[1]                    # torch.max returns (values, indices)
        num_correct += int(sum(predictions == labels))
    percent_correct = num_correct / num_examples * 100
    print('Accuracy:', percent_correct)
    
test(myNeuralNetwork)

Accuracy: 97.39999999999999


## Exercises
1. Compare the loss curves generated by using different batch sizes. What's the best? As you change the batch size, what variable do you need to change to give those curves the same domain over the x-axis (num writes to summary writer)
2. It would be good to validate our model as we go along to ensure that we don't overfit. Let's write a training loop that tests the loss on the validation set after each epoch. Plot the validation error alongside What can you see on the graphs that indicates overfitting?
3. What is the best accuracy you can achieve? Can you implement a grid search and a random search to try them automatically. Record all permutations that you try.
4. What feature of the input data is our standard neural network not taking advantage of? Hint: '************* neural networks' take this into account.

## Congratulations you boss, you've finished the notebook!

Please provide your feedback [here](https://docs.google.com/forms/d/e/1FAIpQLSdZSxvkAE19vjDN4jpp0VvUBPGr_wdtayGAcRNfFGH7e7jQDQ/viewform?usp=sf_link). It means a lot to us.

Next, you might want to check out:
- [Convolutional Neural Networks](https://github.com/AI-Core/Convolutional-Neural-Networks)