# TP 4 - Deep learning for computer vision

---
This notebook contains the code samples found in Chapter 5 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff).

---

Convolutional neural networks are very similar to regular neural networks: they are made up of layers that have trainable parameters. So what does change? ConvNets make the explicit assumption that the inputs are images, which allows them to encode certain properties into the architecture. This makes the forward propagation more efficient to implement and vastly reduce the amount of parameters in the network.

#### Fully-connected layers

Neural networks transform its input data through a series of layers. Each layer is made up of neurons, and each of them is fully connected to all neurons in the previous layer. Due to the high number of neurons and connections, the fully-connected structure clearly does not scale well to images. For example, an image of respectable size, say 200x200x3, would lead to neurons that have 200x200x3 = 120'000 weights. We would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

![fcn.jpeg](https://www.oreilly.com/library/view/tensorflow-for-deep/9781491980446/assets/tfdl_0402.png)

#### Convolutional layers

ConvNets take advantage of the fact that the input consists of images, and they constrain the architecture in a more sensible way. To do so, they make use of convolutional layers. The fundamental difference is this: fully-connected layers learn global patterns in the inputs, whereas convolution layers learn local patterns in small 2D windows of the inputs. More specifically, a convolution layer operates over 3D tensors with two spatial axes and a depth axis (height, width, channels). The convolution operation extracts patches from its input, and applies the same transformation to all of these patches. The output is still a 3D tensor, but its dimensions depend on the layer's hyper-parameters, specified by the *kernel size* and the *number of kernels*.

![conv2.jpeg](https://www.jeremyjordan.me/content/images/2017/07/Screen-Shot-2017-07-26-at-1.44.58-PM.png)

#### ConvNet architecture

ConvNets mainly use three types of layers: convolutional (CONV), pooling (POOL), fully-connected (FC). The figure below shows a concrete example of ConvNet architecture. The first layer (left) stores the raw image pixels, whereas and the last layer (right) stores the class probabilities. The activation of each hidden layer along the processing path is shown as a column.

![convnet.jpeg](https://editor.analyticsvidhya.com/uploads/90650dnn2.jpeg)

# 1. Introduction to ConvNets

Let's take a practical look at a very simple convnet for MNIST digit classification, a task that you have already been through using a fully-connected network.

#### Input layer

A ConvNet takes as input a tensor of shape `(image_height, image_width, image_channels)`. In our case, we configure our ConvNet to process inputs of size `(28, 28, 1)`, which is the format of MNIST images. We do this via passing the argument `input_shape=(28, 28, 1)` to our first layer.

In [None]:
input_dim = (28, 28, 1)

#### Convolutional layers

A ConvNet always starts off with convolutional and pooling layers. In our case, we stack three convolutional layers, alternated with pooling layers.

In [None]:
from keras import layers
from keras import models

model = models.Sequential()
model.add( layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_dim) )
model.add( layers.MaxPooling2D((2, 2)) )
model.add( layers.Conv2D(64, (3, 3), activation='relu') )
model.add( layers.MaxPooling2D((2, 2)) )
model.add( layers.Conv2D(64, (3, 3), activation='relu') )

Let's display the architecture of our ConvNet so far.

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


You can see above that the output of every `Conv2D` and `MaxPooling2D` layer is a 3D tensor of shape `(height, width, channels)`. The width and height dimensions tend to shrink as we go deeper in the network. The number of channels is controlled by the first argument passed to the `Conv2D` layers.

#### Fully-connected layers

The next step is to feed our last layer's output, a tensor of shape `(3, 3, 64)`, into a fully-connected classifier. However, such a classifier processes 1D vectors, whereas our current output is a 3D tensor. So first, we have to flatten our 3D outputs to 1D, and then add a few `Dense` layers on top. We are going to do 10-way classification, so we use a final layer with 10 outputs and a `softmax` activation. 

In [None]:
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

Now here's what our network looks like:

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten (Flatten)            (None, 576)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                3

As you can see, our `(3, 3, 64)` outputs were flattened into vectors of shape `(576,)`, before going through two `Dense` layers.

#### Training

Now, let's train our convnet on the MNIST digits. We will reuse a lot of the code we have already covered in the previous MNIST example.

First, we load and preprocess the data. Remember that you must **ALWAYS normalize** your data.

In [None]:
from keras.datasets import mnist
from keras.utils import to_categorical

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1))
test_images  =  test_images.reshape((10000, 28, 28, 1))

train_images = train_images.astype('float32') / 255
test_images  =  test_images.astype('float32') / 255

train_labels = to_categorical(train_labels)
test_labels  = to_categorical( test_labels)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


Second, we train the network using the cross-entropy loss function.

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f2fb3842208>

Let's evaluate the model on the test data:

In [None]:
test_loss, test_acc = model.evaluate(test_images, test_labels)

print("Test accuracy:", test_acc*100, "%")

Test accuracy: 99.1100013256073 %


While our previous fully-connected network had a test accuracy of 97%, our basic convnet has a test accuracy of 99%.

## ===== Exercise =====

Grab the relevant functions from TP 2, and test your convolutional network for handwritten digit recognition. *Do you see any difference in the performance?*

In [None]:
# ADD CODE HERE