## Statistical invariance

Weight sharing: when you know two inputs can contain the same kind of information (i.e., statistical invariance), you share the weights $w$, and train the weights jointly for those inputs.
- For images, this leads to convolutional networks.
- For text and sequences in general, it leads to embeddings and recurrent neural networks.

## Convolutional networks (CovNets)

- [Link](https://www.youtube.com/watch?time_continue=65&v=ISHGyvsT0QY) to original video.
- [Link](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/) to a very helpful blogpost explaining the concept.

CovNets are neural networks that share their parameters across space.

Illustration of the Convolution operation (source: the blogpost linked above):
<img src="https://ujwlkarn.files.wordpress.com/2016/07/convolution_schematic.gif?w=536&h=392" alt="alt text" width="400">

In CNN terminology,
- the 3×3 matrix is called a "filter" or "kernel" or "feature detector"
- the matrix formed by sliding the filter over the image and computing the dot product is called the "Convolved Feature" or "Activation Map" or the "Feature Map."

Different values of the filter matrix will produce different Feature Maps for the same input image:
<img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-05-at-11-03-00-pm.png" alt="alt text" width="400">

Parameters:
- Depth: the number of filters we use for the convolution operation, which corresponds to the number of feature maps produced.
- Stride: the number of pixels by which we slide our filter matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time. Having a larger stride will generally produce smaller output image size (or feature maps).
- Padding:
    - Valid padding: not passing the edge
    - Same padding: going off the edge and pad with 0s so that the output image size is the same as the input.

The general idea of CovNets is to apply convolutions that are going to progressively *squeeze out* the spacial dimensions while increasing the depth which corresponds roughly to the semantic complexity of your representation. As all this spacial information has been squeezed out, only parameters that map to the content of the image remain.

<img src="https://raw.githubusercontent.com/Runze/ud730-deep-learning-class-notes/master/screenshots/lesson-4-covnets.png" alt="alt text" width="700">

### Pooling

Pooling reduces the dimensionality of each feature map but retains the most important information.

<img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-10-at-3-38-39-am.png?w=988" alt="alt text" width="400">

A small network with two convolutional layers, followed by one fully connected layer:

<img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-08-at-2-26-09-am.png?w=1496" alt="alt text" width="700">

### Inception module

<img src="https://github.com/Runze/ud730-deep-learning-class-notes/blob/master/screenshots/lesson-4-inception.png?raw=true" alt="alt text" width="700">