# Image Analysis with MLP
As we remember in traditional neural networks neurons are stimulated by connected nodes and are only activated when a certain threshold value is reached.
![Multilayer Perceptron](images/mlp.png)

Here are the drawbacks of MLP when it comes to image processing:
1. MLPs use one perceptron for each input (e.g. pixel in an image, multiplied by 3 in RGB case). The amount of weights rapidly becomes unmanageable for large images. For a 224 x 224 pixel image with 3 color channels there are around 150,000 weights that must be trained! As a result, difficulties arise whilst training and overfitting can occur.
2. MLPs react differently to an input (images), they are not translation invariant. For example, if a picture of a cat appears in the top left of the image in one picture and the bottom right of another picture, the MLP will try to correct itself and assume that a cat will always appear in this section of the image.
3. Spatial information is lost when the image is flattened into an MLP. Nodes that are close together are important because they help to define the features of an image. We thus need a way to leverage the spatial correlation of the image features (pixels) in such a way that we can see the cat in our picture no matter where it may appear.

The above mentioned problems can be solved with CNNs.

#  A brief introduction to CNNs
In neural networks, Convolutional neural network (ConvNets or CNNs) is one of the main categories to do images recognition, images classifications, objects detections, faces recognition. CNNs are the core of most Computer Vision systems today, from Facebook’s automated photo tagging to self-driving cars.

CNNs are a special kind of multi-layer neural networks. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer. However, ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network. Therefore, they can recognize patterns with extreme variability (such as handwritten characters), and with robustness to distortions and simple geometric transformations.
__CNNs have various architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and etc.__

CNN’s leverage the fact that __nearby pixels are more strongly related than distant ones.__
CNN analyzes the influence of nearby pixels by using something called a filter. A filter is exactly what you think it is, in our situation, we take a filter of a size specified by the user (a rule of thumb is 3x3 or 5x5) and we move this across the image from top left to bottom right. For each point on the image, a value is calculated based on the filter using a convolution operation. A filter could be related to anything. In fact, filter would give us an indication of how strongly the objects seems to appear in our image, and how many times and in what locations they occur. This reduces the number of weights that the neural network must learn compared to an MLP, and also means that when the location of these features changes it does not throw the neural network off.



# What is Convolution?
In mathematics (and, in particular, functional analysis) convolution is a mathematical operation on two functions (f and g) to produce a third function that expresses how the shape of one is modified by the other. This operaton is used in several areas such as probability, statistics, computer vision, natural language processing, image and signal processing, engineering, and differential equations.
__In Image Processing Convolution is a echnique that changes the intensities of a pixel to reflect the intensities of the surrounding pixels. A common use of convolution is to create image filters. Using convolution, you can get popular image effects like blur, sharpen, and edge detection.__


# Convolution operation in Deep Learning
"Convolution" operator or simply "filter" is an easy way to perform complex operation by means of simple change of a convolution kernel. Apply Gaussian blur kernel and you'll get it smoothed. Apply Canny kernel and you'll see all edges. Apply Gabor kernel to get gradient features.
![Convolution Animation](images/convolution.gif)
__The convolution operation, simply put, is combination of element-wise product of two matrices.__ 


# Convolution Kernels
Below is shown an image before and after processing with a vImage convolution function that causes the emboss effect.
![Embossing](images/emboss.jpg)
Used kernel is 3X3 matrix:
![Kernel](images/kernel.jpg)

The numbers inside the kernel are what impact the overall effect of convolution (in this case, the kernel encodes the emboss effect). The kernel (or more specifically, the values held within the kernel) is what determines how to transform the pixels from the original image into the pixels of the processed image.
__The height and width of the kernel do not have to be same, though they must both be odd numbers.  Since, the kernel is symmetrically shaped (not symmetric in kernel values), there are equal number (n) of pixel on all sides (4- connectivity) of the anchor pixel. Therefore, whatever this number of pixels maybe, the length of each side of our symmetrically shaped kernel is 2*n+1 (each side of the anchor + the anchor pixel), and therefore filter/kernels are always odd sized.__

The below example shows various convolution image after applying different types of filters (Kernels).
![Kernel](images/filters.png)


# How does CNN work?
![CNN layers](images/layers.png)

CNNs are basically just several layers of convolutions with nonlinear activation functions like ReLU or tanh applied to the results. In a traditional feedforward neural network we connect each input neuron to each output neuron in the next layer. That’s also called a fully connected layer, or affine layer. In CNNs we don’t do that. Instead, we use convolutions over the input layer to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. Each layer applies different filters and combines their results. There’s also something something called pooling (subsampling) layers. During the training phase, a CNN automatically learns the values of its filters based on the task you want to perform. For example, in Image Classification a CNN may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers. The last layer is then a classifier that uses these high-level features.

## CNN Hyperparameters

### Stride Size
The stride size defines how much you want to shift your filter at each step. A larger stride size leads to fewer applications of the filter and a smaller output size. In the literature we typically see stride sizes of 1, but a larger stride size may allow you to build a model that behaves somewhat similarly to a Recursive Neural Network.
![Strides](images/strides.png)
![Stride 1](images/stride1.gif)
![Stride 2](images/stride2.gif)

Note that, you can have different strides horizontally and vertically.


### Padding
What happens when you apply three 5 x 5 x 3 filters to a 32 x 32 x 3 input volume? The output volume would be 28 x 28 x 3. Notice that the spatial dimensions decrease. As we keep applying conv layers, the size of the volume will decrease faster than we would like. In the early layers of our network, we want to preserve as much information about the original input volume so that we can extract those low level features. Let’s say we want to apply the same conv layer but we want the output volume to remain 32 x 32 x 3. To do this, we can apply a zero padding of size 2 to that layer. Zero padding pads the input volume with zeros around the border. If we think about a zero padding of two, then this would result in a 36 x 36 x 3 input volume.
![Padding](images/padding.png)

Use the following equations to calculate the exact size of the convolution output for an input with the size of (width = W, height = H) and a Filter with the size of (width = $F_w$, height = $F_h$):
$$
\begin{align}
output\ width =\cfrac{W - F_{w} + 2P}{S_{w}} + 1
\end{align}
$$

$$
\begin{align}
output\ height =\cfrac{H - F_{h} + 2P}{S_{h}} + 1
\end{align}
$$
where $S_w$ = vertical stride, $S_h$ = horizontal stride, P = amount of zero padding added to the border of the image.


### Depth (also known as channel)
Depth corresponds to the number of filters we would like to use.Generally, a convolution layer can have multiple input channels (each a 2D matrix) and multiple output channels (again each a 2D matrix). Maybe the most tangible example of a multi-channel input is when you have a color image which has 3 RGB channels. Let's get it to a convolution layer with 3 input channels and 1 output channel. How is it going to cacluate the output? A short answer is that it has 3 filters (one for each input) instead of one input. What it does is that it calculates the convolution of each filter with its corresponding input channel (First filter with first channel, second filter with second channel and so on). The stride of all channels are the same, so they output matrices with the same size. Now, it sum up all matrices and output a single matrix which is the only channel at the output of the convolution layer. 
![Padding](images/rgb.gif)