# Module 21 - Convolutional Neural Networks

## Module Overview

 In this module, you will learn about convolutional neural networks (also known as CNNs or convo nets). These networks have applications in a variety of domains, including computer vision, natural language processing and financial time series. In this module, you’ll learn how to construct CNNs, some of the trade-offs between different architectures and potential applications across industries.

## Learning outcomes

- LO 1: Describe the anatomy of convolutions and how they work.
- LO 2: Describe how convolutions are used for computer vision applications.
- LO 3: Analyse the trade-off between different network architectures as they relate to domain applications.
- LO 4: Describe how a network architecture would be applied to your industry use case.
- LO 5: Refine a codebase for machine learning competitions. 


---
## CNN Formula Cheat Sheet

### 1. Output Size (Convolution Layer)

Given:
- Input size: $n$, Filter size: $f$, Padding: $p$, Stride: $s$

**Output size:**
$$
\text{Output} = \left\lfloor \frac{n + 2p - f}{s} \right\rfloor + 1
$$

Apply this formula to **width and height** separately.

### 2. Output Size (Pooling Layer)

Same formula as convolution, typically with:
- No padding ($p = 0$)
- Non-overlapping regions (stride = filter size)

**Output size:**
$$
\text{Output} = \left\lfloor \frac{n - f}{s} \right\rfloor + 1
$$

### 3. Number of Parameters in a Convolutional Layer

Given:
- Filter size: $f \times f$, Number of input channels: $c$, Number of filters (output channels): $k$

**Total parameters:**
$$
\text{Parameters} = k \cdot (f \cdot f \cdot c + 1)
$$

- The **+1** is for the bias term per filter.
- For grayscale images, $c = 1$; for RGB, $c = 3$.

### 4. Number of Parameters in a Fully Connected (Dense) Layer

Given:
- Input size: $n$, Output size: $m$

**Total parameters:**
$$
\text{Parameters} = n \cdot m + m
$$

- The **+m** accounts for the bias for each output unit.


### 5. Receptive Field Size

The receptive field is the area of the input image that influences a particular neuron in the network. For a stack of convolutional layers:

**General recursive formula:**
$$
R_{\text{out}} = R_{\text{in}} + (k - 1) \cdot \text{jump}
$$

Where:
- $R$ = receptive field size
- $k$ = filter size
- $\text{jump}$ = product of previous strides

### 6. Total Output Volume

Given:
- Output width: $W$, Output height: $H$, Number of filters (channels): $D$

**Total volume:**
$$
\text{Volume} = W \cdot H \cdot D
$$



###  Example: Convolution Layer Output

**Input image:** $100 \times 100 \times 3$  
**Filter:** $5 \times 5$, stride $1$, padding $2$, 32 filters

**Output size:**
$$
\text{Width} = \left\lfloor \frac{100 + 2 \cdot 2 - 5}{1} \right\rfloor + 1 = 100
$$
$$
\text{Height} = 100
$$
$$
\text{Depth} = 32
$$

**Total output volume:** $100 \times 100 \times 32 = 320,000$  
**Total parameters:** $32 \cdot (5 \cdot 5 \cdot 3 + 1) = 32 \cdot 76 = 2,432$



---

# Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a specialised type of neural network designed to process data with a grid-like topology. This includes time series data (1D grids) and image data (2D grids of pixels). The core operation of a CNN is called **convolution**, a linear operation that allows the network to extract spatial hierarchies in data.

## Key Characteristics

- Commonly used in **computer vision** for image analysis
- Also applied to **natural language processing** and **financial time series**
- Exploit spatial correlations and hierarchical structures in data
- Tend to **overfit less** than fully connected networks due to fewer parameters
- Employ **convolutional filters** to learn feature representations

## Convolutional Filters

A convolutional filter (or kernel) scans across an image and produces a new representation that highlights specific patterns or features, such as edges or textures. The output of a convolution operation at position $(i, j)$ is computed as:

$$
y[i, j] = \sum_{m} \sum_{n} x[i + m, j + n] \cdot w[m, n]
$$

Where:
- $x$ is the input image
- $w$ is the filter
- $y$ is the resulting feature map

Filters are not predefined — CNNs **learn** the appropriate filters from the data during training.

## Activation Functions

To introduce non-linearity, activation functions are applied after each convolution. A common choice is the **Rectified Linear Unit (ReLU)**:

$$
\text{ReLU}(x) = \max(0, x)
$$

This enables the network to model complex, non-linear patterns in the data.

## Feature Maps and Depth

Each convolutional filter produces a **feature map**. If a convolutional layer uses $k$ filters, the resulting output will have a **depth** of $k$ — one feature map per filter.

## Pooling Layers

Pooling layers reduce the spatial dimensions of feature maps and provide translational invariance.

- **Max pooling**: Takes the maximum value from each region
- **Average pooling**: Computes the average of values in a region

For example, with a pooling region of size $f \times f$ and stride $s$, the output size is:

$$
\text{Output size} = \left\lfloor \frac{n - f}{s} \right\rfloor + 1
$$

## Local Connectivity and Parameter Sharing

CNNs are efficient because of two key ideas:

- **Local connectivity**: Each neuron is connected only to a small region of the input
- **Parameter sharing**: The same filter is used across all positions in the input

Together, these drastically reduce the number of parameters compared to fully connected layers.

## Hyperparameters in CNNs

CNNs are controlled by several important hyperparameters:

### Padding

Padding allows control over the spatial size of the output. Zero-padding involves adding zeros around the input's border to maintain or control the output size. The output size with padding $p$ and stride $s$ is given by:

$$
o = \frac{n + 2p - f}{s} + 1
$$

Where:
- $n$ = input size  
- $f$ = filter size  
- $p$ = padding  
- $s$ = stride

### Stride

The stride determines how far the filter moves across the input at each step. A larger stride results in a smaller output:

$$
\text{Output size} = \left\lfloor \frac{n + 2p - f}{s} \right\rfloor + 1
$$

## Learnable Parameters in a Convolutional Layer

For a convolutional layer with $k$ filters, each of size $f \times f$, and $c$ input channels, the total number of learnable parameters is:

$$
\text{Parameters} = k \cdot (f \cdot f \cdot c + 1)
$$

The $+1$ accounts for a bias term per filter.

## Summary

CNNs are powerful tools for tasks that require spatial awareness and pattern recognition. By leveraging convolutional operations, non-linear activations, and pooling, they learn hierarchical feature representations efficiently and accurately. Their architectural design allows them to scale well to large input sizes while maintaining a manageable number of parameters.

## Common CNN Architectures

Some notable CNN architectures include:

- **LeNet-5**: Early CNN for digit classification
- **AlexNet**: Deep CNN that popularised deep learning after winning ImageNet 2012
- **ResNet**: Introduced residual connections to train very deep networks effectively


---

The notes below are taken directly from an informative example sheet, all credit belongs with ICL

#### The Convolution Operation  

Let's start with an example. Suppose we're tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output $x(t)$, the position of the spaceship at time $t$. Both $x$ and $t$ are real valued, that is, we can get a different reading from the laser sensor at any instant in time. 

Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship's position, we would like to average several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a declining weighting function $w(a)$, where $a$ is the age of the measurement. The older the measurement, the smaller the weight (e.g. $w(a) \propto 1/a$). If we apply such a weighted average operation at every moment, we obtain a new function $s$ providing a smoothed estimate of the position of the spaceship: 
$$s(t) = \int_0^{\infty} x(a) w(t - a) da.$$

This operation is called a **convolution**, and it is typically denoted by 
$$s(t) = (x \ast w)(t).$$

In our example, $w$ needs to be a valid probability density function, or the output will not be a weighted average. Here and beyond in this session, $x$ is called the **input**, $w$ is called the **kernel** and the output is also sometimes referred to as the **feature map.**

In machine learning applications, we are often working with discretised signals (e.g. image pixels, finite samples of a sensor etc.) and the input is usually a multidimensional array of data, and the kernel is also usually a multidimensional array of parameters that are adapted by the learning algorithm. In one-dimension, the discrete convolution looks like:

$$S_{i} = (I\ast K)_{i} = \sum_n X(n) W(i - n).$$

For computer vision, we often use convolutions over more than one axis at a time. For example, if we use a two-dimensional image $X$ as our input, we probably also want to use a two-dimensional kernel $W$: 

$$S_{i, j} = (I\ast K)_{i, j} = \sum_n \sum_m X(m, n) W(i - m, j - m).$$

The summations as we shall see shortly are over a finite range of values (in practice that means that our multidimensional arrays, or *tensors* are zero beyond a finite range of indices). Let's see a concrete example of discrete convolution operator in one-dimension using *Toeplitz* or *circulant matrices*.

Consider $v = (v_0, v_1, \dots, v_5)$ and that our convolution consists of $y=Av=(y_1, y_2, y_3, y_4)$ where 

$$A =  \begin{bmatrix}
    x_1 & x_0 & x_{-1} & 0 & 0 & 0 \\
    0 & x_1 & x_0 & x_{-1} & 0 & 0 \\
    0 & 0 & x_1 & x_0 & x_{-1} & 0 \\
    0 & 0 & 0 & x_1 & x_0 & x_{-1} \\
  \end{bmatrix}
$$

Even though the discrete convolution operation can be represented as a matrix, each row of the matrix is constrained to be equal to the row above shifted by one elment. In two dimensions, a convolution corresponds to a **doubly block circular matrix** (matrix is sparse and has only a few distinct elements). The 2D case is slightly more complicated since we have to consider block matrices, where each matrix is a Toeplitz matrix, and we have to flatten our 2D image into one long 1D vector, which brings us back to the 1D case we've seen above. In practical terms we slide the 2D filter throughout the image, to create a new image. We multiply the entries in the filter by the corresponding pixels in the moving window and add them up. See the animation below. 


**References:**

- http://deeplearning.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/

- https://dsp.stackexchange.com/questions/35373/2d-convolution-as-a-doubly-block-circulant-matrix-operating-on-a-vector

#### The Stride of a Convolutional Filter

Another important concept for convolution filters is that of the *stride* length. The filters we've seen so far have stride $S=1$. For a larger stride, the *moving window takes longer steps* as it moves across the signal or image. Here is the matrix $A$ for a 1D, 3-weight filter with stride 2.  

$$\text{$S=2$: }A =
\begin{bmatrix}
    x_1 & x_0 & x_{-1} & 0 & 0 \\
    0 & 0 & x_1 & x_0 & x_{-1} \\
 \end{bmatrix}
$$

#### Pooling 

A typical layer of a convolutional network consists of three stages: 

1. Convolution: in this state several convolutions in parallel to produce a set of linear activations 
2. Detector: next, each linear activation is run through a nonlinear activation function (e.g. ReLu)
3. Pooling: in the third and last stage, we use a pooling function to modify the output layer further. 

A commonly used pooling function is **max pooling** which reports the maximum output within a rectangular patch of an image. There are other popular choices too. 

Pooling has several advantages: it reduces the dimension of the output and *reduces the possibility of overfitting*. One way to see this is that pooling helps make the representation approximately **invariant** to small translations of the input. Invariance to translation means that when determining whether an image contains or not a face, we do not know the location of the eyes with pixel-perfect accuracy, we just need to know that there is an eye on the left side of the face and an eye on the right side of the face. Moreover, pooling may add further **robustness** to the prediction models (intuitively, in average pooling we sum the values in those rectengular patches, so that the noises tend to cancel each other). 

References: 
- [HOML](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/), chapter 13
- http://deeplearning.stanford.edu/tutorial/supervised/Pooling/