# Convolutional Neural Networks Explained

In the pursuit of precision in machine learning algorithms, adding layers is the natural path. Many problems emerge, however, when implementing this type of solution. The vanishing gradient of the activation function can be solved by implementing ReLU type functions. But the bigger problem is the rapidly growing number of trainable parameters.

Convolutional Neural Networks partially solve this problem by exploiting correlations between adjacent inputs. It is the main method currently employed in the study of images and time series. In order to understand how a this network works, we must first understand its components:


## Convolution

<img src="images/Convolution.JPG" alt="convolucion" style="width: 500px;"/>

A convolution is basically a filter that applies upon the studied object, and moves with certain parameters. In the example above, we have a 2x2 moving filter with a weight of 0.5, which performs the following operation:

$\begin{eqnarray}
{\rm out_1}&=&0.5{\rm in_1} + 0.5{\rm in_2} + 0.5{\rm in_6} + 0.5{\rm in_7} = 4.25\\
{\rm out_2}&=&0.5{\rm in_2} + 0.5{\rm in_3} + 0.5{\rm in_7} + 0.5{\rm in_8} = 2.5
\end{eqnarray}$

This can also be visualized as a standart node diagram, with sparse connections, i.e. not every nodes interconected, and constant filters. In the example above, we also had constant weights [0.5, 0.5, 0.5, 0.5] , but we could also for example [0.1, 0.0001, 0.9, 0.34]. These two properties reduce drastically the ammount of necessary parameters.

<img src="images/Convolution2.JPG" alt="convolucion" style="width: 350px;"/>

After going through the convolution, we need to pass the output into an activation function. We will be using ReLU functions.

## Feature mapping and multiple channels

Using a single convolution map is not enough for training the weights. We need multiple filters for detecting different features. So we will end up with multiple convolutions, which can be thought of as adding one dimension to the original dataset. 2D images become now 3D data sets, and RGB images (one channel for each R-G-B) will have a 4D output.

<img src="images/Featuremaps.JPG" alt="convolucion" style="width: 500px;"/>

Finally, each output of each convolution will be passed through an activation function.


## Pooling

After performing the convolutional map, we add a second filter called pooling. It serves to reduce the number of parameters in the model (down-sampling) and to make the dataset more robust to transformations such as local rotations and scale changes.

<img src="images/maxpooling.JPG" alt="convolucion" style="width: 500px;"/>

Pooling is another moving filter, like the convolution, but instead of applying trainable weights, applies a statistical function. The most common is *max pooling*, which applies the max() function over the inputs. Other popular choices include mean pooling.

## Strides and down-sampling

In the above example, each pooling window shifts by 2 places. This is called a _stride of 2_. In general, the stride needs to be specified both in x and y direction. As long as the stride is greater than 1, the output size will be reduced, and this is called down sampling.

## Invariances

Not only does pooling introduce down sampling, but also allows the detection of certain features invariant to scale and local rotations that only affect some values.

In order to also detect global rotations, first apply a few convolutional filters, followed by ReLU activations that depend on the orientation, as can be seen on the image below.

<img src="images/invariance.JPG" alt="convolucion" style="width: 500px;"/>

## Final picture

<img src="images/cnn.png" alt="convolucion" style="width: 500px;"/>

The Convolutional Neural Network, is an information rich processing of the initial information. The final picture just implements a standard classifier to this output. In order to do so, it is necessary to flatten the output of the cnn. From 100 channels with 2x2 matrices, we produce a single flattened vector  of 2x2x100 = 400  entries.