# Week 7

## Key Concepts from This Week

- Convolutional neuron
- Convolutional layer
- Channel
- Kernel
- Kernel size
- Stride
- Max-pooling layer
- Convolutional neural network
- Inception layer
- 1D convolution
- Regularization
---
## Convolutional Neural Networks

- Convolutional neural network (CNN) is one of the most frequently used deep learning architecture. 
- CNNs are often used when we need to process data with a spatial relations between individual inputs. 
- For example, in images there is a spatial relation between individual pixel -- we can not arbitrarily mix them, their position relative to each other is important.

### Convolutional neuron

Consider a greyscale image $\mathbf{X}$. We can represent it as a $height \times width$ matrix of pixel values. With traditional neuron, we would combine all their values into one scalar - a neuron activation:

$$
\sigma\big(b + \sum_{i=1}^{height}\sum_{j=1}^{width} w_{ij}x_{ij}\big) = \sigma\big(b + \sum\mathbf{W} \odot \mathbf{X}\big)
$$

Where $x_{ij}$ is a specific pixel and $w_{ij}$ is a weight for this pixel. We need to have a weight for each pixel. On the other hand, _convolutional neurons_ look only at a small window of the image at the time, e.g. a $3 \times 3$ window. The neuron slides its window across the image and returns the value for each position. Observe the following illustration:


Each value of $\mathbf{Z}$ is calculated with the same neuron, i.e. the weights used for calculations are the same, only the values of the pixels in the window differ. The neuron output for each window $\mathbf{P}$ is defined similarly as the general neuron above:

$$
\sigma\big(b + \sum\mathbf{W} \odot \mathbf{P}\big)
$$

Compared to the general neuron, computational neuron has fewer weights, e.g. for $3 \times 3$ window we have only 9 weights (and 1 bias) and we reuse these weights multiple times for all the windows. The weight matrix $\mathbf{W}$ is also called a _kernel_.

<img src="images/neuron.svg" alt="Convolutional neuron computation" style="width: 60%; margin: 5em auto 2em;"/>
<center><small>Illustration of how one pixel of the output is calculated for one neuron. We look at the window $\mathbf{P}$ and combine the pixel values with our kernel weights $\mathbf{W}$. The weights are trainable parameters of the neuron.</small></center>

#### Multiple input channels

Instead of grayscale image, we can have an RGB image. Such image is in fact a 3D matrix with dimensions of $height \times width \times 3$. When we process a window, we take into account the values from each input channel. Instead of 9 weights, we would now have 27 weights -- 9 weights for each channel:

$$
\sigma\big(b + \sum_{i=1}^{3}\sum\mathbf{W_i} \odot \mathbf{P_i}\big)
$$

where $\mathbf{W_i}$ is a matrix of values for $i$-th channel (e.g. for RGB we would have a weights for red channel) and $\mathbf{P_i}$ is the $i$-th channel values of the sliding window. We can process data with arbitrary number of channels this way.

#### Design decisions

There are multiple design decisions that can be made for a convolutional neuron:

- __Kernel size:__ How large is the sliding windows that slides across the image. We usually use square sized windows with _odd_ and relatively small size, e.g. $3 \times 3$ or $5 \times 5$. We use odd size for the sake of symmetry. $1 \times 1$ is a special case called $1 \times 1$ convolution. In that case each value only depends on the values of one specific pixel. However, we still combine the channels of said input pixel.

- __Stride:__ Stride is the size of the step that the sliding window takes, i.e. how much does the window moves. Usually we use $1 \times 1$ stride, so we move by one pixel in both direction. With this stride the resolution of the output is the same as the resolution of the input. With bigger stride the resolution get smaller.

- __Padding:__ We need to decide how to handle the edges of the image. If we center the $3 \times 3$ window on a top left pixel, parts of the window will overhang. Two basic strategies are used:

  1. We pad the image with zero values.
  2. We only move the window so that it fits the image.
  
  In the first case the ouput has the same dimension as the input. In the second case the output is slightly smaller.

### Convolutional layer

- Convolutional layers consist of multiple convolutional neurons. 
- Each neuron creates its own output matrix - called a _channel_. If we start with a $256 \times 256$ image with 3 channels (RGB), the shape of this image is $256 \times 256 \times 3$. After we process it with 100 convolutional neurons, we get 100 new channels. The shape of the output of such convolutional layer is $256 \times 256 \times 100$. Each neuron looks at the same windows, but each neuron has its own parameters (weights and bias) and the activation values they calculate for their channels are therefore different.

The number of weight parameter in a convolutional layer can be calculated as $K \times C_i \times C_o$, where $K$ is a kernel size (e.g. 9 for $3 \times 3$ kernel), $C_i$ is number of channels on input and $C_o$ is number of channels on output or number (which is the same as number of neurons in the layer. If we have $3 \times 3$ kernel with 100 neurons over RGB image, the number of parameters in that layer is 2700. The next layer with 100 neurons would have $3 \times 3 \times 100 \times 100 = 90,000$ parameters. We omitted bias parameters, but each neuron has one.

### Max-pooling layer

Apart from convolutional layers, convolutional neural networks also contain max-pooling layers. These layers are very simple, we split each channel into small windows called _pools_, e.g. with a size of $2 \times 2$. From each pool we select only the maximum value. We construct new channel from these maxima:

<img src="images/max.svg" alt="Max-pooling" style="width: 30%; margin: 2em auto;"/>
<center><small>2x2 max-pooling applied over 8x8 input. Red border shows individual pools. From each pool the maximum value is selected for output.</small></center>

Max-pooling layers also have stride, similarly to convolutional layers. When we have the same stride as pool size, we reduces the dimensionality of the tensors in the network, e.g. we reduce $256 \times 256 \times 100$ into $128 \times 128 \times 100$ with $2 \times 2$ pools and $2 \times 2$ strides. But it also effectively increases the receptive field of the neurons. In deep networks, the receptive fields of the neurons from the later layers consist of the entire input image. Because each max-pooling halves width and height we usually use images with powers of two as dimensions.

<img src="images/receptive_field.svg" alt="Receptive field" style="width: 100%; margin: 2em auto 1em;"/>
<center><small>Receptive field of neuron calculating a pixel value shaded in blue. We can see that only after three layers (convolutional 3x3, max-pooling 2x2, convolutional 3x3), the receptive field w.r.t. original image is quite large. By chaining multiple max-pooling layers, the receptive field would only grow. Red and orange mark the receptive fields across neighbouring layers.</small></center>

### Convolutional neural network

CNNs consist of convolutional and max-pooling layers. Then the final layers are _dense_, i.e. the layers we used in MLP. The illustration below shows how convolutional neural network usually looks like. This is an architecture called VGG-16 [1]:

<img src="images/vgg16.jpg" alt="VGG-16" style="width: 60%; margin: 5em auto 0;"/>
<center><small>VGG-16 architecture as depicted by <a href="https://neurohive.io/en/popular-networks/vgg16/">Muneeb ul Hassan for Neurohive.</a></small></center>

You can see that CNNs are quite deep compared to MLP. Notice how the number of channels grow as the size of the image shrinks. This pyramid shape is how CNNs are usually designed. VGG-16 is quite popular architecture that is still used in practice as a baseline solution.

### Training convolutional neural network

- Convolutional neural networks are trained in the same way as the other neural networks we have seen so far, i.e. via _stochastic gradient descent_. 
- Each neuron has a set of parameters and we calculate the derivative for each of these parameters w.r.t. defined loss function. 
- The training is not so different from a MLP training, because convolution can be in fact [rewritten as matrix multiplication](https://github.com/alisaaalehi/convolution_as_multiplication), which is the basic operation of MLP as well.

- It is interesting to think about what exactly are neurons learning in CNNs. 
- The convolutional layers are quite small compared to dense layers and the neurons are always looking at just small windows. 
- How is it possible that they can process images so well? The illustration below shows, what are neurons at individual layers learning:

<img src="images/neuron_vis.png" alt="Feature visualization" style="margin: 3em auto 0;"/>
<center><small>Visualization of what various neurons in CNN layers learned. Adapted from a <a href="https://distill.pub/2017/feature-visualization">Distill paper by C. Olah et al.</a> [2].</small></center>

We can see that the bottom-most layers are learning only very rudimentary patterns - edges, corners, etc. As we go further in the network, the neurons are combining what the neurons from previous layers learned and they become sensitive to more and more complex patterns. Finally the neurons in final layers are sensitive to very high-level patterns, e.g. pictures of dogs. I recommend checking the paper [2] to see some additional visualizations.

Then we do a slight pre-processing. First, we scale the input from $\langle 0-255 \rangle$ integer into $\langle 0.0, 1.0 \rangle$ float scale. Then we add a dimension to the images, so instead of $28 \times 28$, they have a shape of $28 \times 28 \times 1$. CNNs are used to work with channels, so even if we have only one channel, we need to explicitly use it as such.

### Combining layers

Above we defined a typical feed-forward neural network. Each layer sees only the previous layer and nothing else. Sometimes we combine multiple layers to improve the results. For example, we might take two convolutional layers, one with $3 \times 3$ kernel size and the other with $5 \times 5$ kernel size and combine their channels:

```python
from tf.keras.layers import Concatenate

# In CNN __init__

    self.conv1 = Conv2D(
        filters=64,
        kernel_size=3,
        padding='same',
        activation='relu')
    self.conv2 = Conv2D(
        filters=64,
        kernel_size=5,
        padding='same',
        activation='relu'),
        
# In CNN call

    x = ...                    # (width, height, 32) <- shape
    c1 = self.conv1(x)         # (width, height, 64)
    c2 = self.conv2(x)         # (width, height, 64)
    x = Concatenate([c1, c2])  # (width, height, 128)
```

If the original `x` has dimensions $256 \times 256 \times 32$, each convolutional layers creates its own set of channels with dimension $256 \times 256 \times 64$ and after we concatenate them together, we simply stack them together on top of each otehr into a $256 \times 256 \times 128$ shape.

Similarly, we can concatenate the output of the layer with its input. This technique is called _residual_ connections or _skip_ connections. The main idea behind this technique is that we do not want to lose the information that was learned by the previous layers, we want to keep it untouched for further layers.

```python
# In CNN __init__

    self.conv1 = Conv2D(
        filters=64,
        kernel_size=3,
        padding='same',
        activation='relu')
        
# In CNN call

    x = ...                   # (width, height, 32)
    c1 = self.conv1(x)        # (width, height, 64)
    x = Concatenate([x, c1])  # (width, height, 96)
```

### Inception Module

_Inception module_ is a specific layer used for convolutional neural networks. It was designed at Google and over the years it has had several versions. It uses the concept of combining convolutional layers with different kernel sizes. Below is depicted the first version of Inception module proposed in 2014:

<img src="images/inception_layer.png" alt="Inception layer" style="width: 70%; margin: 3em auto 0;"/>
<center><small>Inception layer v1. Adapted from a <a href="https://ai.google/research/pubs/pub43022">CVPR paper by Ch. Szegedy et al.</a> [3].</small></center>

This layer was then used in a network architecture called GoogLeNet, which was quite successful at the time. You can see the individual Inception modules in the illustration below:

<img src="images/googlenet.png" alt="GoogLeNet" style="width: 70%; margin: 3em auto 0;"/>
<center><small>GoogLeNet. Adapted from a <a href="https://ai.google/research/pubs/pub43022">CVPR paper by Ch. Szegedy et al.</a> [3].</small></center>

#### 1D and 3D convolution

- So far we have discussed only 2D convolution. The input has two dimensions - width and height - and for each pixel we have several channels. 
- We can use the same principle in 1D case, e.g. we can have a time series where each point in time has several features - channels. 
- The 1D convolution is very similar to 2D convolution, but instead of 2D convolutional kernels, 2D channels, 2D pools we have all these concepts in 1D. E.g. we will have a kernel size 5 that can see 5 steps in the time series. 
- 1D convolution can be used to model time series, sounds, text and other sequential data. Similarly, we can have CNNs that can process 3D data, these are for example used in biomedicine for drug analysis.

## Regularization

### Bias
Bias in this context is not a neuron parameter. Bias is a train set error. High bias means that we can not fit the train data very well. In Exercise 7.3 we have seen a training that was able to fit the training data with almost 100% accuracy. The bias of this model is very low.

### Variance
Variance on the other hand is a test set error. It is the difference between train and test set data. In 7.3 the test set only achieved accuracy of about 91%. The 9% difference between train and test set performance is the variance of the model.

_Regularization_ is a term used for various techniques that reduce the _variance_ of the model. By reducing variance we often increase the bias - this is often referred to as a _bias-variance trade-off_. We try to tune the variance and bias so that we get the best possible test set results.

### L2  and  L1 regularization

Both L2 and L1 aim to reduce the magnitude of parameters. Each training step, optimizer calculates the gradient for each parameter. On top of that, both L2 and L1 slightly push each parameter towards the zero. This is based around the observation, that the parameters with high magnitude lead to unstable results.

L2 and L1 regularizators are applied on layers with trainable parameters, e.g. Dense and ConvXD layers. You can apply them simply by using the `x_regularizer` parameter during their initialization. We usually apply these regularizators only on weights (weights are also called `kernel`) and not on biases. L2 is used more often than L1.

### Dropout

Dropout randomly turns off certain number of neurons in a layer. This helps with generalization, because the model can not rely on a specific values from specific neurons. Instead, each neuron needs to be able to contribute to the result. Dropout is usually used only after _dense_ layers. 

### Batch normalization

- It empirically improves the results, but researchers are not quite sure why. 
- We simply normalize each activation value within a batch. I.e. when we have a 200 neurons in a dense layer, we calculate the mean and variance of output for each neuron. 
- Then we use these quantities to normalize the output. We can use batch normalization after both dense and convolutional layers

### Hyperparameters

- Lambda `l` is usually set to a small number, I would consider exponentially search through $\langle 10^{-4}, 10^{-2}$ range. 
- For dropout rate a number from $\langle 0.1, 0.5$ range is usually used. 
- I would typically use the default values for batch normalization, but you can tune them as well. 
- L2/L1 is not as popular as it once was, because it is quite sensitive to how we set the lambda hyperparameter. 
- Nowadays most people use either dropout and batch normalization as a single technique for their model.