# Convolutional Neural Networks
Images from the MNIST dataset have a fixed size of $28*28$. In the previous notebook we tackled the classification problem with a fully connected feed-forward neural network: we reshaped the input image in order to obtain a 1-D tensor with $28*28 = 784$ elements. The input layer was fed to an hidden layer with 512 units. The number of parameters (weights and biases) is indeed $784*512+512=401920$.



Now, consider an **high resolution image** with $1280*720$ pixels. A fully connected approach would require a huge number of parameters. Suppose that the first hidden layer is again made up of 512 units, than the number of parameters would be $1280*720*512+512=471,859,712$: *It is almost half a billion of parameters!* In such situation, **the dense, fully connected, approach is practically unfeasible**. Furthermore, after the *reshape* operation, the spatial structure of the input is lost.



**Convolutional Neural Network (CNN)** is a particular class of Deep Feed-Forward Neural Networks that overcomes the aforementioned limitations and has proved to be particularly suitable for computer vision applications.
There are two main advantages in using CNNs: 
- thanks to their architecture, they can take into account the spatial structure of the input; this is a desired property when the input neurons are the pixels of an image. 
- they require fewer parameters than fully-connected networks. This means that they are faster to train, less prone to overfitting, and that deeper and more powerful models can be designed.




## Architecture and Properties
These properties originate from the following ideas behind CNN:
- Local Receptive fields
- Shared weights
- Pooling

### Local Receptive Fields
In a fully-connected network every neuron of a layer is connected to a neuron in the following layer. 

The expression *local receptive fields* indicates a sparse interaction, where only a limited number of neurons in the $i^{th}$ layer is connected to a neuron in the $(i+1)^{th}$ layer.

![local receptive fields](https://miro.medium.com/max/600/1*N7SyP4OvPB8-YpbEUMOK7Q.png)


### Shared Weights: Convolution Operation
In the figure above, the weights used to connect the first receptive field with the neuron in the hidden layer are the same for every connection between receptive fields and corresponding hidden neurons. This is the core of the convolution operation: the first step to obtain the $(i+1)^{th}$ layer is the convolution between the $i^{th}$ layer and a **kernel** (or **filter**). The kernel size corresponds to the size of the receptive field, while the values are the shared parameters. Since this filtering extracts a feature of the $i^{th}$ layer, the $(i+1)^{th}$ layer is often referred to as a feature map.
The animated gif below shows the convolution operation between bidimensional input I and kernel K.

<img src="https://cdn-images-1.medium.com/max/1600/1*VVvdh-BUKFh2pwDD0kPeRA@2x.gif" width="400"/>

- Input I: blue matrix $5*5$
- Kernel K: green matrix $3*3$
- Feature map: pink matrix $3*3$
- Stride (step of the convolution operation) = 1

Given an input image of shape $H*W$, a kernel K of shape $KH*KW$, and a stride $S$ , the convolution output has shape: 
> output width = $\dfrac{W-KW}{S}+1$

> output height = $\dfrac{H-KH}{S}+1$

A **striding** value greater than 1 is tipically used to reduce the computational burden of a convolutional layer.

In order to preserve the size of the original image, we can adopt zero **padding**. It consists in adding zeros at the border of the image. It is typically used with very deep architecture to preserve the resolution across many layers. The result of zero padding (P=1) is shown in the following animated figure.  

Considering zero padding, the convolution output has shape: 
> output width = $\dfrac{W-KW+2P}{S}+1$

> output height = $\dfrac{H-KH+2P}{S}+1$

<img src="https://cdn-images-1.medium.com/max/1600/1*W2D564Gkad9lj3_6t9I2PA@2x.gif" width="400"/>




At each convolutional layer, several different feature maps are obtained by using several kernels (or filters). 

When the network is trained from scratch, the kernels are randomly initialized. The training procedure adjusts kernel weights in a way that allows them to extract significative features from images. 

![alt text](https://ujwlkarn.files.wordpress.com/2016/08/giphy.gif?w=1000)

### Pooling
A convolutional layer is generally composed by three stages: a linear stage (convolution), a non linear stage (activation function) and a pooling stage.
The pooling stage consists in replacing a group of contiguous neurons by one neuron, representing a summary statistic of them. 

Pooling operation leads to two main consequences:
- it reduces the size of a layer: max-pooling with a 2x2 kernel, for example, halves the dimensions of the layer by choosing the maximum values of non overlapping 2x2 windows of neurons. The reduction of number of neurons implies a reduction of the number of connections and, indeed, of parameters.
- it guarantees an increased translation invariance because it maps the information of a group of neurons in only one neuron of the next layer; it is more important to know whether a feature is present or not, than its exact location.

Typical pooling function are Max-pooling and Average-pooling

The figure below shows the application of Max-Pooling operation  using 2x2 non-overlapping kernels
![pooling](https://upload.wikimedia.org/wikipedia/commons/e/e9/Max_pooling.png)


## To sum up:
The building blocks of a hidden (convolutional-pooling) layer are the following:
- Linear Convolutional Stage
- Non Linear Activation Stage
- Pooling Stage


The figure below shows a simple convolutional neural network architecture:

![alt text](https://cdn-images-1.medium.com/max/1600/1*N4h1SgwbWNmtrRhszM9EJg.png)

A typical CNN architecture consists of an input layer, one or more hidden layers, and an output layer. As shown in figure above, one or more fully connected layers typically elaborate the feature maps extracted by the last convolutional layer. 


# A simple CNN for MNIST problem
- Chapter 5, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). 


In [38]:
from tensorflow import keras


We will use our CNN to classify MNIST digits, a task that you've already been through in previous notebooks, using a fully-connected network. Even though our CNN will be very basic, its 
accuracy will overcome that of the fully-connected model from previous notebooks.


## Download and prepare the dataset

Importantly, a CNN takes as input tensors of shape `(image_height, image_width, image_channels)` (not including the batch dimension, remind we use *channel-last* convention of TensorFlow backend). 
In our case, we will configure our CNN to process inputs of size `(28, 28, 1)`, which is the format of MNIST images. We do this via passing the argument `input_shape=(28, 28, 1)` to our first layer.

In [39]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

## Building and training the CNN model


The backbone of our basic CNN will be a stack of `Conv2D` and `MaxPooling2D` layers. 

Take a look at the signature of the Conv2D and MaxPooling2D functions. Which arguments must we specify?

In [40]:
keras.layers.Conv2D?

In [41]:
keras.layers.MaxPooling2D?

In [55]:
from tensorflow.keras import layers
from tensorflow.keras import models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

In [43]:
# Let's evaluate the number of parameters for the first convolutional layer

In [44]:
32*3*3*64+64

18496

- $I$ = input dimensionality (number of channels / feature maps)
- $N$ = output dimensionality (number of feature maps, specified in Conv2D)
- $K$ = kernel size 

number of trainable parameters = $I*N*K^2+N$


Let's display the architecture of our convnet so far:

In [45]:
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_15 (Conv2D)           (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_16 (Conv2D)           (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_17 (Conv2D)           (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


You can see above that the output of every `Conv2D` and `MaxPooling2D` layer is a 3D tensor of shape `(height, width, channels)`. The width 
and height dimensions tend to shrink as we go deeper in the network. The number of channels is controlled by the first argument passed to 
the `Conv2D` layers (e.g. 32 or 64).




The next step would be to feed our last output tensor (of shape `(3, 3, 64)`) into a densely-connected classifier network like those you are 
already familiar with: a stack of `Dense` layers. These classifiers process vectors, which are 1D, whereas our current output is a 3D tensor. 
So first, we will have to flatten our 3D outputs to 1D, and then add a few `Dense` layers on top:

In [56]:
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

We are going to do 10-way classification, so we use a final layer with 10 outputs and a softmax activation. Now here's what our network 
looks like:

In [54]:
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_15 (Conv2D)           (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_16 (Conv2D)           (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_17 (Conv2D)           (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_3 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 64)               

As you can see, our `(3, 3, 64)` outputs were flattened into vectors of shape `(576,)`, before going through two `Dense` layers.

Now, let's train our convnet on the MNIST digits. We will reuse the code we have already covered in our first example.

In [49]:
model.fit?

In [57]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f80090d6668>

# Exercise
- Select the first 6000 samples of the training set.
- Assess the performance of the CNN model on a validation set after varying its hyperparameters (e.g. number of layers / filters, size of kernels for convolution and pooling).
You may choose to use a function!

> ```python
> def build_model(param1,param2, ...):
>   model = models.Sequential()
>   # [TODO]
>   return model
> ```



- Optional: plot the val_accuracy values against one or two hyperparameters.


##  Evaluating the model on the test set

In [None]:
test_loss, test_acc = model.evaluate(test_images, test_labels)

In [None]:
test_acc

While our densely-connected network from previous notebooks had a test accuracy of ~98%, our basic convnet has a test accuracy of ~99%.