# Convolutional Neural Networks

<img src="images/cnn-sequence.png" width="100%">


<img src="images/convolutional-network-components.png">

For regular neural networks, an input layer is fed through through a series of hidden layers until reaching the output layer, with every layer being fully connected.

With convolutional neural networks, all layers have 3 dimensions. Neurons in one layer are not fully connected to the neurons in the next layer.
- Each layer of a convolutional neural network maps a 3D input 'volume' to a 3D output 'volume' of activations.
- CNNs have 2 components: the feature extraction part and the classification part (consisting of fully connected layers you'd see for regular networks).
    - Feature extraction: passing the input image through a series of convolution, activation and pooling operations
    - Classification: based on the extracted features, assigns probabilities to each class

### Convolution Layers:
The most well-known and widespread structure used in neural networks is called a convolution, and when used in a hidden layer, it's called a convolutional layer. The development of convolutional layers and the reusing of weights is one of the most important innovations in deep learning.
- Just like regular hidden layers in a network, convolutional layers will take in input from the previous layer's nodes, operate on it, then pass on the output to the next layer. It uses *linear convolutions* over *matrix multiplication* on the input
- Convolutions revolve around 'reusing a piece of intelligence' in multiple places
- Convolving is the process of 'sliding' across the input image and sampling a different subsection of the image
- Convolutional layers can be stacked &mdash; this allows for hierarchical decomposition of the input 


<table>
    <tr>
        <td>            
            <img src="images/convolving-3d.gif">
        </td>
        <td>
            <img src="images/convolving-compute.gif">
        </td>
    <tr>
</table>

<table>
    <tr>
        <td>
            <img src="images/convolution-mapping.png">
        </td>
        <td>
            <img src="images/convolutional-layer-size.png">
        </td>
    <tr>
</table>


- *Convolutional filter/kernel* &mdash; a small 3D matrix that used to scan across subsections of the image, computing convolutions and producing a *feature map*
    - Also called a 'feature detector'
    - *Receptive field* &mdash; the area of the filter
    - With RGB images, the kernel will have a 3rd dimension for scanning over each colour channel
    - Several different filters can be used to generate different feature maps, all of which are put together as the final output of a convolution layer
    <img src="images/convolutional-filters-visualised.png" width="50%">
    <strong><p style="text-align: center;">Weights of the convolutional filters visualised</p></strong>
    <strong><p style="text-align: center;">Some filters detect lines at certain angles, others detect colour gradients, etc.</p></strong>

    
- *Stride* &mdash; how far of a jump the kernel or pooling matrix takes after each convolution

- *Feature map* or *activation map* &mdash; the output activations for a given filter

- *Padding* &mdash; adding layers of zero-value pixels to surround the input, thereby allowing the sliding filter to go beyond the normal image boundaries. This applies to pooling filters as well

    - Often used to make the convolutional layer have the same dimension as the input
    - Number of outputs in the convolutional layer: $n_{out}=\big[ \frac{n_{in}+2p-k}{s} \big] + 1$ 
        - Where $n_{in}$ is the number of inputs (pixels), $k$ is the kernel size, $p$ is the padding size, $s$ is the stride size


Kernel size, number of kernels, stride and padding are all hyperparameters that we need to decide on.

<img src="images/convolving-demo.gif" width="75%">


#### Convolution:
The convolution operation involves taking the elementwise product of the kernel and the current input elements it's scanning, then summing those values to a single scalar.

The combination of two functions to produce a third function - merging two sets of information.


### Pooling Layer:
After the convolutional layer, we may pass the output through a *pooling layer*.

The purpose of pooling is to reduce the dimensionality and reduce the number of parameters and computation &mdash; combating overfitting and long training times.
- The most frequently used type of pooling is *max pooling* which extracts the max value in each subsection of the convolutional layer's feature map. 
- Other types are *average pooling* and *sum pooling*
- Pooling helps reduces the dimensions of the feature map produced by a convolutional layer 


<table>
    <tr>
        <td>            
            <img src="images/max-pooling-downsampling.png">
        </td>
        <td>
            <img src="images/pooling-demos.gif">
            <p style="text-align: left;">
                Using $2\times 2$ filters with a stride of $2$
            </p>
        </td>
    <tr>
</table>


### Fully Connected Layer:
Why have a fully connected layer instead of directly skipping to the output layer? &mdash; The output of the convolutional layers represent high-level features in the data. Connecting this to fully connected layers prior to the output layer allows the network to learn non-linear combination of features

To 'connect' the output of the pooling layer as input to the fully connected layers, you just have to *flatten* or reshape the feature map into a straight vector.

<img src="images/pooling-to-fully-connected-flattening.png" width="50%">


### Softmax in the Output Layer:

The $\texttt{softmax}$ activation function takes in a vector of $N$ real numbers, applies $\texttt{exp}$ to each element, then divides each element by the sum of the vector to return a vector of $N$ numbers between 0 and 1 which sum to 1 in total. The input numbers can be anywhere from $-\infty$ to $\infty$.

Suppose there are $N$ classes and $z_j$ is the network's predicted probability for class $j$, where $1 \leq j \leq N$,

$$\texttt{Prob} (i) = \frac{\texttt{exp}(z_i)}{\sum_{j=1}^{N} \texttt{exp}(z_j)},$$
$$\texttt{logProb} (i) = z_i - \log \sum_{j=1}^{N} \text{exp}(z_j).$$

If the correct class is $i$, we can use $-\texttt{logProb(i)}$ as the loss/cost function. The first term $z_i$ pushes up the correct class $i$ while the second term pushes down all the incorrect class, but preferentially pushes down the class $j$ that had the highest activation. The idea is that every incorrect class is pushed down in proportion to the network's predicted probability for it.


## Example: LeNet (1998)
A convolutional network structure (called LeNet architecture) first introduced by Yann LeCun in 1998:

<img src="images/LeNet.png" width="100%">


#### Analysing the structure:

Suppose the input is $32 \times 32$ and we use 6 convolutional filters of size $5 \times 5$. 

- For each of the 6 convolutional filters, there exists $25$ weights for each colour channel, plus 1 bias, giving $76$ tunable parameters
- The convolution layer output will have dimensions $28 \times 28$. With 6 individual convolutional filters, we generate a 6 feature maps, giving an output of dimension $28 \times 28 \times 6 = 4056
$
- Each convolutional filter takes their 76 weights (including 1 bias) and operates on $28 \times 28$ subsections of the input image. Repeated 6 times for each filter, this gives a total of $76 \times 28 \times 28 \times 6 = 357504$ different connections 
- There are only $6 \times 76 = 456$ individual parameters to tune within the convolution layer

### Example: AlexNet (2012)

A state-of-the-art convolutional network that maps images to 1000 different classes with an error rate of $\approx 15\%$.
<img src="images/AlexNet.png" width="100%">


#### Interesting notes:

- Contains 62.3 million parameters to tune
- Trains with 90 epochs over five or six days on two GTX 580 GPUs
- Uses $\texttt{ReLU}$ over $\texttt{tanh}$ for non-linearity (increasing training speed by 6 times at the same accuracy)
- Uses a dropout of $0.5$ to deal with overfitting
- SGD with learning rate $\eta = 0.01$, momentum 0.9 and weight decay 0.0005 is used.


### Resources:
- <a href="https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1">Convolutional neural networks from first principles</a>
- <a href="https://www.freecodecamp.org/news/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050/#:~:text=The%20main%20special%20technique%20in,describe%20an%20image%20with%20text.">Intuitive guide to convolutional neural nets</a>
- <a href="https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/">Convolutional layers for deep learning neural networks</a>
- <a href="https://medium.com/@smallfishbigsea/a-walk-through-of-alexnet-6cbd137a5637">AlexNet architecture</a>