# Convolutional Neural Networks

__Prerequisites__

- [Neural Networks](https://github.com/AI-Core/Neural-Networks/blob/master/Neural%20Networks.ipynb)

## What's wrong with how neural networks process images?

The fully connected neural network we looked at in the previous lesson takes in a vector as input. So we flattened our images by stacking the rows so that it could be passed in as input and used for classification problems successfully. 

#### Spacially structured data

For some problems, the order of the features in each example does not matter (e.g. age, height, hair length). But this isn't the case for images. If we randomly reorder the pixels in an image, then it will likely be unrecognisable. Most of the useful information in images comes not from the values of the features (pixels), but from their relative positions. The same is true for processing any other **spacially structured** data such as videos, soundwaves, medical scans, 3D-models etc. 

The spatial relationships between the different pixels is information that is crucial to our understanding of an image. When we flatten the image, we lose this information.

#### Weight sharing across space

Regardless of where I see something interesting in my field of view, it can often be processed in the same way. 

Neural networks have individual weights for each input feature because they expect each feature to represent a totally different thing (e.g. age, height, weight). In other domains like computer vision however, different features (pixels) can represent the same thing just in different locations (e.g. money on my left, money on my right).

Instead of learning to look for the same features of an image with different weights for each position that that feature might be in, we should try to share the same learnt weights over all positions of the input. This will save us both time and memory in computing and storing these duplicate weights. 

#### So what

Using our prior understanding of how image data should be processed spacially, we'd like to find some kind of model that can retain the spacial structure of an input, and look for the same features over the whole of this space. This is what we will use convolutional neural networks for.

## Images as data

Images are not naturally vectors. They obviously have a height and a width rather than just a length - so they need to at least be a matrix. 

#### Channels

Any non-black color can be made by combining 3 primary colors.
As such, as well as height and width, color images have another axis called the **channels**, which specifies the intensity (contribution) of each of these primary colors.
Red, green and blue are the (arbitrary) standard primary colors. 
So most images that we will work with have a red channel, a green channel and a blue channel.
This is illustrated below.

![image](images/CNN_RGB.JPG)

Some images can also have transparent backgrounds, in which case they might have a fourth channel to represent the opacity at each pixel location.

## How was computer vision done before deep learning?

In the past, people would try to draw patterns that they thought would appear in images and be useful for the problem that they were trying to solve. This was a painstakingly long process, and was obviously susceptible to a lot of bias by these feature designers.

## Filters/Kernels
These supposedly useful patterns mentioned above are known as **filters** or **kernels**. 
Each filter looks for a particular pattern.
E.g. a filter that looks for circles would have high values in a circle and low values in other locations.

![title](images/kernels.jpg)

Filters *look* for the patterns they represent by seeing how similar the pixels at any particular location match the values that they contain. A mathematically convenient way to do this is by taking a **dot product** between the filter's values and the input values which it covers - an element wise multiplication and sum of the results. **This produces a single value** which should be larger when the input better matches the feature that the filter looks for.

It is standard for filters to always convolve through the full depth of the input. So if we have an input with 3 channels (e.g. a color image), our kernel will also have a depth of 3 - where each channel of the filter is what it looks for from that corresponding color channel. If our input has 54 channels, then so will our filter. 

The width and height of our kernels is up to us (they are hyperparameters). It's standard to have kernels with equal width and height.

## The convolution operation

In machine learning, convolution is the proccess of moving a filter across every possible position of the input and computing a value for how well it is matched at each location. 

This pattern matching over the spacially structured input produces a similar spacially structured output. We call this output an **activation map** or a **feature map** because it represents the activations in the next layer that should represent some higher level (more complex) features than the feature maps in the input.

The animation below shows how a 1x3x3 filter is applied to a 1x5x5 image (for simplicity, input channels = 1). 
On the left is the filter that we will convolve over the input. In the centre is the input being convolved over. On the right is the output activation map produced by convolving this filter over this input.

Notice how the output has high  values when the filter is passed over locations where there is an X shape in the input image. This is because the values of the filter are such that it is performing pattern matching for the X shape.

![image](images/convolution_animation.gif)

The convolution operation has a few possible parameters:

### Stride
The stride is the number of pixels we shift our kernel along by to compute the next value in the output activation map. Increased stride means less values are computed for the output activation map, which means that we have to do less computation and store less output information, decreasing computing time and cost and reducing memory requirements but reducing output resolution.

### Padding
We can *pad* the input with a border of extra pixels around the edge. Why might we want to do this?

##### Model depth limitations

When we use a kernel size larger than one, each single output value is a function of many input values (all the pixels which the filter covers). This means that the size of the convolution output is smaller than the input. As such, there is a limit to the number of successive convolutions that we can apply because eventually the input gets so small that there is only one location of the input that the filter can be placed on the input and the output will then have a height and width of 1 and cannot be convolved over (convolution with a 1x1 filter is equivalent to multiplication).

##### Equal input from each pixel

When we use a kernel size larger than one, the corner pixels will only contribute to a single output value because they only enter the kernel at it's very extreme positions. As such they contribute less to the final predictions than the other pixels. The same is true for pixels near the edge, but to a lesser extent.

#### Different padding modes

We can use different "padding modes" to specify what we pad the image with. Options include padding it with zeros, continuing the last color outwards, reflecting the inwards colors. See options provided by PyTorch [here](

![image](images/CNN_diagram.JPG)

For convolution, each computed value in the output feature map is a linear function of the pixels in a local region of the input as opposed to fully connected nets where each computed feature is a linear function of all the values in the input.

## The convolutional layer

In practice, we want to look for more than just one feature in any input. When we used a neural network, each layer had multiple outputs corresponding to different learnt features. Similarly, instead of convolving just a single filter over the input to produce a single activation map, we convolve many filters over the input to produce many activation maps. This produces a stack of activation maps as the output. The output then has an extra dimension, in addition to the spacial ones, which corresponds to which output activation map you're looking at. This dimension is the convolutional analogy to the number of outputs from a linear layer.

Also just like linear layers, convolutional layers apply a simple linear transformation to their input and can be applied successively with activation functions to represent very complex non-linear transformations. Models with such layers are **convolutional neural networks**. These are appropriate for tackling problems like object detection and image segmentation. These convolutional layers have values for each weight within each filter and also include biases to shift each output feature. 
Just like before, each successive layer in the network learns successively higher level abstract features from the inputs.

These convolutional layers are also provided by PyTorch. In this notebook we will use `torch.nn.Conv2D` to convolve over our input in 2 directions (width and height).

![image](images/CNN_FNN_comparison.JPG)

## What does each filter look for?
Engineers used to have to tune filter values manually. Now, just like the weights and biases in linear layers of neural networks, they can be learnt automatically by backpropagation and gradient descent.

## Pooling layers
Immediately after a convolutional layer, it is common to apply some form of **pooling**. Pooling is a technique that summarises/downsamples the values in a local region of its input. This reduces the number of values in its output, therefore reducing the number of parameters that need to be learned for a succeeding parameterised operation such as a further convolutional or linear layer.

Because pooling summarises values in local spacial regions it can help models to be robust under translation of the input, making them more **translation invariant**.

Pooling layers also slide kernels over their input, and reduce the values within that grid location to a single value. But they perform different operations than a linear combination like in convolution (see below).

**Max pooling** replaces the values at each grid location with their maximum.

**Average Pooling** replaces the values at each grid location with their average.

See the PyTorch [docs](https://pytorch.org/docs/stable/nn.html#pooling-layers) for more pooling layers

## The output of convolutional neural networks

Unless we keep applying convolutional layers to our data until it is reduced to a height and width of 1, the output will still retain some spacial dimensions. This means that as well as our input, our output can also be an image for example. This can be useful for problems such as image segmentation, where the output is a pixelwise classification mask of everything in the scene. In this case the output is the same shape as the input image, but with each pixel location taking the value of a class label (e.g. all pixels of cars in the image have value=1, all roads have value=2 etc).

In our case though, we want to perform image classification for 10 classes. It is common practice to flatten the output of the convolutional layers of a network into a vector, and then transform them into a vector of the desired output shape by applying a final linear layer. This is what we do below

## Let's implement a convolutional neural network

The first cell is just the same boilerplate we've used before. Make sure you understand it and then run it.

In [None]:
import torch
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np
from torch.utils.tensorboard import SummaryWriter

# GET THE TRAINING DATASET
train_data = datasets.MNIST(root='MNIST-data',                        # where is the data (going to be) stored
                            transform=transforms.ToTensor(),          # transform the data from a PIL image to a tensor
                            train=True,                               # is this training data?
                            download=True                             # should i download it if it's not already here?
                           )

# GET THE TEST DATASET
test_data = datasets.MNIST(root='MNIST-data',
                           transform=transforms.ToTensor(),
                           train=False,
                          )

x = train_data[np.random.randint(0, 300)][0]    # get a random example
#print(x)
plt.imshow(x[0].numpy(),cmap='gray')
plt.show()

# FURTHER SPLIT THE TRAINING INTO TRAINING AND VALIDATION
train_data, val_data = torch.utils.data.random_split(train_data, [50000, 10000])    # split into 50K training & 10K validation

batch_size = 128

# MAKE TRAINING DATALOADER
train_loader = torch.utils.data.DataLoader(
    train_data,
    shuffle=True,
    batch_size=batch_size
)

# MAKE VALIDATION DATALOADER
val_loader = torch.utils.data.DataLoader(
    val_data,
    shuffle=True,
    batch_size=batch_size
)

# MAKE TEST DATALOADER
test_loader = torch.utils.data.DataLoader(
    test_data,
    shuffle=True,
    batch_size=batch_size
)

In [None]:
import torch.nn.functional as F

class ConvNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
            # conv2d(in_channels, out_channels, kernel_size)
            # in_channels is the number of layers which it takes in (i.e.num color channels in 1st layer)
            # out_channels is the number of different filters that we use
            # kernel_size is the depthxwidthxheight of the kernel#
            # stride is how many pixels we shift the kernel by each time
        self.conv_layers = torch.nn.Sequential( # put your convolutional architecture here using torch.nn.Sequential 
            torch.nn.Conv2d(1, 16, kernel_size=5, stride=1),
            torch.nn.ReLU(),
            torch.nn.Conv2d(16, 32, kernel_size=5, stride=1),
            torch.nn.ReLU()
        )
        self.fc_layers = torch.nn.Sequential(
            torch.nn.Linear(32*20*20, 10) # put your linear architecture here using torch.nn.Sequential 
        )
    def forward(self, x):
        x = self.conv_layers(x)# pass through conv layers
        x = x.view(x.shape[0], -1)# flatten output ready for fully connected layer
        x = self.fc_layers(x)# pass through fully connected layer
        x = F.softmax(x, dim=1)# softmax activation function on outputs
        return x

In [None]:
use_cuda = torch.cuda.is_available() # checks if gpu is available
device = torch.device("cuda" if use_cuda else "cpu")
learning_rate = 0.0005 # set learning rate
epochs = 5 # set number of epochs

cnn = ConvNet().to(device) #.to(device)#instantiate model
criterion = torch.nn.CrossEntropyLoss() #use cross entropy loss function
optimiser = torch.optim.Adam(cnn.parameters(), lr=learning_rate) # use Adam optimizer, passing it the parameters of your model and the learning rate

# SET UP TRAINING VISUALISATION
writer = SummaryWriter() # we will use this to show our models performance on a graph

In [None]:
def train(model, epochs, verbose=True, tag='Loss/Train'):
    for epoch in range(epochs):
        for idx, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)
            # pass x through your model to get a prediction
            prediction = model(inputs)             # pass the data forward through the model
            loss = criterion(prediction, labels)   # compute the cost
            if verbose: print('Epoch:', epoch, '\tBatch:', idx, '\tLoss:', loss.item())
            optimiser.zero_grad()                  # reset the gradients attribute of all of the model's params to zero
            loss.backward()                        # backward pass to compute and store all of the model's param's gradients
            optimiser.step()                       # update the model's parameters
            
            writer.add_scalar(tag, loss, epoch*len(train_loader) + idx)    # write loss to a graph
    print('Training Complete. Final loss =',loss.item())
    
train(cnn, epochs)

In [None]:
import numpy as np
            
def calc_accuracy(model, dataloader):
    num_correct = 0
    num_examples = len(dataloader.dataset)                       # test DATA not test LOADER
    for inputs, labels in dataloader:                  # for all exampls, over all mini-batches in the test dataset
        predictions = model(inputs)
        predictions = torch.max(predictions, axis=1)    # reduce to find max indices along direction which column varies
        predictions = predictions[1]                    # torch.max returns (values, indices)
        num_correct += int(sum(predictions == labels))
    percent_correct = num_correct / num_examples * 100
    return percent_correct

print('Train Accuracy:', calc_accuracy(cnn, train_loader))
print('Validation Accuracy:', calc_accuracy(cnn, val_loader))
print('Test Accuracy:', calc_accuracy(cnn, test_loader))

## It's done
You should now understand
- the advantages of using CNNs vs vanilla neural networks
- how an image is represented as data, including its channels
- what convolution is in the context of machine learning
- the new convolutional and pooling layers that we have used in this notebook

## Next steps
- [Custom Datasets](https://github.com/AI-Core/Convolutional-Neural-Networks/blob/master/Custom%20Datasets.ipynb)

## Appendix
- [Empirical Benchmarking of Fully Connected vs Convolutional Architecture on MNIST](https://github.com/AI-Core/Convolutional-Neural-Networks/blob/master/Empirical%20Benchmarking%20of%20Fully%20Connected%20vs%20Convolutional%20Architecture%20on%20MNIST.ipynb)