In [6]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
from tensorflow.keras.layers import BatchNormalization, Dropout, Activation, Input
from tensorflow.keras.layers import Conv2DTranspose, concatenate, UpSampling2D

from IPython.core.display import HTML
HTML("""
<style>
.column {
  float: left;
  width: 50%;
}

/* Clear floats after the columns */
.row:after {
  content: "";
  display: table;
  clear: both;
}</style>
""")

# Convolutional Neural Networks for Image Classification

## CNNs are used everywhere
<center>
    <img src="illustrations/vision.png" style="max-height:700px;width:auto;"/>
</center>

## CNN for image classification

### CNN = Convolutional Neural Networks (or ConvNets)

<img src="illustrations/LeNet.png"/>

<small>LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). LeNet: gradient-based learning applied to document recognition.</small>

## Outline of the lecture

 - Convolutions
 - Convolutions in Neural Networks
  - Motivations
  - Layers
 - Architectures
  - Classic CNN Architecture
  - AlexNet
  - VGG16
  - ResNet
 - Data augmentation
 - Grad-CAM ?

## Convolution
 - A mathematical operation that combines two functions to form a third function.
 - The feature map (or input data) and the kernel are combined to form a transformed feature map.
 - Often interpreted as a filter: the kernel filters the feature map for certain information (edges, etc.)
 
<center>
    <img src="illustrations/convolution-1.png" style="width:450px;"/>
    <small>Figure 1: Convolving an image with an edge detector kernel.</small>
</center>

The mathematical definition of convolution of two functions f and x over a range t is:
<center>
    $y(t) = f \otimes x = \int_{-\inf}^{\inf}f(k) \cdot x(t-k) \mathrm{d}k$
</center>

where the symbol ⊗ denotes convolution.

<small>https://developer.nvidia.com/discover/convolution</small>

## Convolution

Convolutional filters can be interpreted as feature detectors:
 - The input (feature map) is filtered for a certain feature (the kernel).
 - The output is large if the feature is detected in the image.

<center>
    <img src="illustrations/convolution-3.png" style="width:450px;"/>
    <small>The kernel can be interpreted as a feature detector where a detected feature results in large outputs (white) and small outputs if no feature is present (black).</small>
</center>

## Convolution in a neural network
<center>
<img src="illustrations/numerical_no_padding_no_strides.gif" style="width:450px;"/>
</center>

- $x$ is a $3 \times 3$ chunk (yellow area) of the image (green array)

- Each output neuron is parametrized with the $3 \times 3$ weight matrix $w$ (small numbers)
 
The activation obtained by sliding the $3 \times 3$ window and computing:
<center>
$z(x) = relu(\mathbf{w}^T x + b)$
</center>

## Motivations

Standard Dense Layer for an image input:
```Python
x = Input((640, 480, 3), dtype='float32')
# shape of x is: (None, 640, 480, 3)
x = Flatten()(x)
# shape of x is: (None, 640 x 480 x 3)
z = Dense(1000)(x)
```

$640 \times 480 \times 3 \times 1000 + 1000 = 922M$

No spatial organization of the input

Dense layers are never used directly on large images. Most standard solution is to use <b>convolution layers</b>.

## Motivations
### Local connectivity
 - A neuron depends only on a few local input neurons
 - Translation invariance

### Comparison to Fully connected
 - Parameter sharing: reduce overfitting
 - Make use of spatial structure: strong prior for vision!

### Animal Vision Analogy
 - <i>Hubel & Wiesel, RECEPTIVE FIELDS OF SINGLE NEURONES IN THE CAT'S STRIATE CORTEX (1959)</i>

## Channels

Colored image = tensor of shape (height, width, channels)

Convolutions are usually computed for each channel, and summed:

<center>
<img src="illustrations/convmap1_dims.svg" width="500px"/>
</center>

<center>$(k \star im^{color}) = \sum\limits_{c=0}^2 k^c \star im^c
$</center>

## Multiple convolutions

<center>
<img src="illustrations/convmap_dims.svg" width="500px"/>
</center>

 - Kernel size aka receptive field (usually 1, 3, 5, 7, 11)
 - Output dimension: length - kernel_size + 1

## Strides

- Strides: increment step size for the convolution operator
- Reduces the size of the output map

<center>
          <img src="illustrations/no_padding_strides.gif" style="width: 260px;" />
</center>

<center><small>
Example with kernel size $3 \times 3$ and a stride of $2$ (image in blue)
</small></center>
<br/><br/>
<small><i>Convolution visualization by V. Dumoulin https://github.com/vdumoulin/conv_arithmetic</i></small>

## Padding

- Padding: artificially fill borders of image
- Useful to keep spatial dimension constant across filters
- Useful with strides and large receptive fields
- Usually: fill with 0s

<center>
          <img src="illustrations/same_padding_no_strides.gif" style="width: 260px;" />
</center>

## Shapes of convolution layers


<div class="row">
    <div class="column">
        <br/><b>Kernel</b> or <b>Filter</b> shape $(F, F, C^i, C^o)$</div>
    <div class="column">
        <img src="illustrations/kernel.svg" style="width: 100px;"/>
    </div>
</div>

 - $F \times F$ kernel size,
 - $C^i$ input channels
 - $C^o$ output channels

Number of parameters: $(F \times F \times C^i + 1) \times C^o$

**Activations** or **Feature maps** shape:
- Input $(W^i, H^i, C^i)$
- Output $(W^o, H^o, C^o)$

$W^o = (W^i - F + 2P) / S + 1$


In [23]:
hide_code_in_slideshow()
from IPython.display import IFrame
IFrame('https://cs231n.github.io/assets/conv-demo/index.html', width="100%", height=700)

$W^o = (W^i - F + 2P) / S + 1$

## Pooling

 - Spatial dimension reduction
 - Local invariance
 - No parameters: max or average of 2x2 units

<br/><br/>

<center>
          <img src="illustrations/max-pooling.png" style="width: 560px;" />
</center>

## Pooling

- Spatial dimension reduction
- Local invariance
- No parameters: max or average of 2x2 units

<center>
          <img src="illustrations/maxpool.svg" style="width: 380px;" />
</center>

## In Keras

#### Fully Connected Network: Multilayer Perceptron

```Python
input_image = Input(shape=(28, 28, 1))
x = Flatten()(input_image)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
mlp = Model(inputs=input_image, outputs=x)
```

## In Keras

#### Convolutional Network

```Python
input_image = Input(shape=(28, 28, 1))
x = Conv2D(filters=32, kernel_size=5, padding='same', activation='relu')(input_image)
x = MaxPooling2D(2, strides=2)(x)
x = Conv2D(filters=64, kernel_size=3, padding='same', activation='relu')(x)
x = MaxPooling2D(2, strides=2)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
convnet = Model(inputs=input_image, outputs=x)
```

**2D spatial organization of features preserved untill Flatten.**

<br/><br/><br/><br/><br/>
<center>
    <h1>Architectures</h1>
</center>

## Classic ConvNet Architecture

### Input

### Conv blocks

- Convolution + activation (relu)
- Convolution + activation (relu)
- ...
- Maxpooling 2x2

### Output

- Fully connected layers
- Softmax

## AlexNet

<center>
          <img src="illustrations/alexNet.jpg" style="width: 800px;" />
</center>

<small>
Simplified version of Krizhevsky, Alex, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012
</small>

Input: 227x227x3 image
First conv layer: kernel 11x11x3x96 stride 4

- Kernel shape: `(11,11,3,96)`
- Output shape: `(55,55,96)`
- Number of parameters: `34,944`
- Equivalent MLP parameters: `43.7 x 1e9`

In [None]:
# TODO : Add other networks

## Benchmarks
<center>
          <img src="illustrations/accuracy_vs_speed.png" style="width: 50%;" />
</center>
<small>
    Top-1 accuracy vs. number of images processed per second (with batch size 1) using the Titan Xp <i>(S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark analysis of representative deep neural network architectures,” IEEE Access, vol. 6, 2018.)</i>
</small>