> **Convolutional Networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.** ~ Deep Learning Book


# 1. The Convolution Operation
Convolution is an operation on two functions of a real-valued argument.

Examples:
- Suppose we are tracking the location of a spaceship with a laser sensor. Our laser Sensor provides a single output $x(t)$, the position of the spaceship at time $t$. Both $x$ and $t$ are real-valued. Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship's position, we would like to average together several measurements. Of course, more recent measurements are more relevant, so we will want this to a weighted average that gives more weight to recent measurements. We can do this with a weighting function $w(a)$. $a$ is the age of a measurement

$$s(t) = \int x(a)w(t-a)da$$

This operation is called **convolution**. The convolution operation is typically denoted with an asterisk:

$$s(t) = (x*w)(t)$$

- $x:$ is the input
- $w:$ is the kernel
- The output is referred to as the feature map

If we assume that $x$ and $w$ are defined only on integer t, we get the discrete convolution:

$$s(t) = (x*w)(t) = \sum_{a=-\infty }^{\infty }x(a)w(t-a)$$

# 2. Motivation
Convolution leverages three important ideas that can help improve a machine learning system:
- Sparse interactions
- Parameter sharing
- Equivariant representations


Traditional NN layers use matrix multiplication by a matrix parameters with a separate parameter describing the interaction between each input unit and each output unit. CNN, however, typically have **sparse interactions**. This is accomplished by making the kernel smaller than the input. For example, when processing an image the input image might have thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels.

Therefore, we can store fewer parameters and reduce memory requirements of the model.

**Parameter sharing** refers to using the same parameter for more than one function in a model. In a CNN, each member of the kernels is used at every position of the input(except perhaps some of the boundary pixels, depending on the design decisions regarding the boundary). This reduces the storage requirements of the model to $k$ parameters. 


In a CNN, the particular form of parameter sharing causes the layer to have a property called **equivariance** to translation. A function is equivariant when the input changes, the output changes the same way. For Example a picture with a kitten in the middle. For a CNN it doesn't matter if the kitten is fed with a picture with a kitten in the corner. It will classify the same kitten as a kitten. 

**Note:**
- Convolution is not naturally equivariant to some other transformations, such as changes in the scale or rotation of an image.

# 3. Architecture Overview

Convolutional Neural Networks (CNN) take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers a ConVNet have neurons arranged in 3 dimensions: **width, height, depth.**

1) ![](assets/neural_net2.jpeg)
__________________________________
______________________________________
2) ![](assets/cnn.jpeg)

> 1: A regular 3-layer Neural Network. 2: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).


# 4. Layers used to build ConvNets
A simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through 
a differentiable function. 

Three main types of layers to build ConvNet architectures:
- **Convolutional Layer**
- **Pooling Layer**
- **Fully-Connected Layer**

*Example Architecture:*

A simple ConvNet for [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) Classification could have the architecture 
**[INPUT - CONV - RELU - POOL - FC].**

- INPUT[32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R, G, B.
- CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters. (Filters are explained in later paragraph)
- RELU layer will apply an elementwise activation function, such as the $max(0,x)$ thresholding at zero. This leaves the size of the volume unchanged([32x32x12]).
- POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].
- FC(full-connected) layer will compute the class scores, resulting in volume of size ([1x1x10]), where each of the 10 numbers correspond to a class score, such as among the 10 categories of  [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html).

Notice that some layers contain parameters and other don't. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons).

On the other hand, RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image.

In summary:


- A ConvNet architecture is in the simplest case a list of Layers that transform the image volume into an output volume (e.g. holding the class scores)
- There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL are by far the most popular)
- Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function
- Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don’t)
- Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn’t)

![](assets/convnet.jpeg)

# 5. Convolutional Layer
The Convolutional layer is the core building block of a ConvNet that does most of the computational heavy lifting.

The CONV layer's parameters consist of a set of learnable filters. Every filter 

# Pooling
![](assets/pooling.jpg)

A typical layer of a convolutional network consists of three stages. 
- In the first layer, the layer performs several convolutions in parallel to produce a set of linear activations. 
- In the second layer, each linear activation is run through a non linear activation function, such as the rectified linear activation function. This stage is sometimes called **the detector stage**. 
- In the third stage, we use a **pooling function** to modify the output of the layer further.

A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs. For example, the *max pooling* operation reports the maximum output within a rectangular neighborhood. 

Pooling helps to make the representation become approximately *invariant* to small translations of the input. Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change.

Because pooling summarizes the responses over a whole neighborhood, it is possible to use fewer pooling units than detector units, by reporting summary statistics for pooling regions spaced $k$ pixels rather than $1$ pixel apart