In [1]:
# Setup for Keras

# Common imports
import numpy as np
import os
import pandas as pd
import sklearn

import tensorflow as tf
import keras #requirement: keras 3
os.environ["KERAS_BACKEND"] = "tensorflow"
#os.environ["KERAS_BACKEND"] = "pytorch"

# to make this notebook's output stable across runs
np.random.seed(42)
keras.utils.set_random_seed(42)

print(tf.__version__) #requirement: >= 15

# Where to save the models
PROJECT_ROOT_DIR = "."
MODEL_PATH = os.path.join(PROJECT_ROOT_DIR, "models")
os.makedirs(MODEL_PATH, exist_ok=True)


# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Utility functions to plot grayscale and RGB
def plot_image(image):
    plt.imshow(image, cmap="gray", interpolation="nearest")
    plt.axis("off")

def plot_color_image(image):
    plt.imshow(image, interpolation="nearest")
    plt.axis("off")



2.16.1


# 3. DL for Images: CNNs/ConvNets

In 2009, the Imagenet Database was published. By now, it consists of 14 Mio. images in 20,000 categories which were hand-annotated.

<img src="../../assets/Image_ImageNet.jpg" alt="Image_ImageNet.jpg" style="width: 500px" title="ImageNet"/><br></br>





## 3.1 CNN/ConvNet Base Model

## Agenda

0. Data Augmentaion for images
1. Introduction: From MLPs to CNNs, and from classical filters to trained ones
2. Convolution to extract image features
3. Variants of convolution: 
   1. Same convolution
   2. Convolution with stride
   3. Dilated convolution
   4. Deconvolution = Transpose convolution
   5. 3d- and 1d- convolution
4. Pooling Layers: maxpool, avgpool
5. Architecture of a CNN
6. Case Study: AlexNet

### 3.1.0 Data Augmentation (esp. for images)

**Data augmentation** (Krizhevsky 12) = Increasing the amount and variety of the training data by varying the given training instances to produce new ones (that are still viable). In particular for images: flip, crop, rotate resize, shift, tint, change contrast: 

<br></br><img src="../../assets/Image_050_Image_Augmentation.png" alt="Data Augmentation" width="400" title = "Hands-on-ML"/>   <br></br>

<br></br><img src="../../assets/Image_Augmentation_Frog_Original.png" alt="Data Augmentation Frog Original" width="200" title = "UVA DL Course"/>   <img src="../../assets/Image_Augmentation_Frog_Flip.png" alt="Data Augmentation Frog Flip" width="200" title = "UVA DL Course"/>  
<br></br><img src="../../assets/Image_Augmentation_Frog_RandomCrop.png" alt="Data Augmentation Frog Crop" width="200" title = "UVA DL Course"/>  <img src="../../assets/Image_Augmentation_Frog_Tint.png" alt="Data Augmentation Frog Tint" width="200" title = "UVA DL Course"/>   <br></br>

### 3.1.1. Introduction

**The "basic" tasks for image data**

**Image Classification:**

<img src="../../assets/Image_Image_Classification_Task.jpg" alt="Image_Image_Classification_Task.jpg" style="width: 500px" title = "UDL"/>   <br></br>

**Regression Tasks like Depth Extimation:**

<img src="../../assets/Image_Depth_Estimation_Task.jpg" alt="Image_Depth_Estimation_Task.jpg" style="width: 500px" title = "UDL"/>   <br></br>


**What makes images different? **

<img src="../../assets/Image_Tiger_in_Water.jpg" alt="Image_Tiger_in_Water.jpg" style="width: 500px" title="https://www.wallpaperflare.com/tiger-pc-backgrounds-hd-water-animal-animal-themes-animal-wildlife-wallpaper-qmanr/download"/>

- huge **input dimensionality**... $1920\times 1080\times 3=6,220,800$ input variables!!!! 
- comparison: 1-layer NN with only 1000 neurons $\rightarrow$ 200 mio parameters $\Rightarrow$ a normal MLP would have far too many parameters to be trainable!
- variances can change the input vectors significantly but don't change the meaning of a picture: 

**Depth and point of view**

<img src="../../assets/Image_Tiger_in_Water_Depth.png" alt="Image_Tiger_in_Water_Depth.png" style="width: 500px" title="https://www.wallpaperflare.com/tiger-pc-backgrounds-hd-water-animal-animal-themes-animal-wildlife-wallpaper-qmanr/download"/>

Images only depict two dimensions, when the image content is 3-dimensional! Small changes of point of view totally change the pixel values. 

**Shift/Translation**

<img src="../../assets/Image_Tiger_in_Water_Shifted.png" alt="Image_Tiger_in_Water_Shifted.png" style="width: 500px" title="https://www.wallpaperflare.com/tiger-pc-backgrounds-hd-water-animal-animal-themes-animal-wildlife-wallpaper-qmanr/download"/>

This image has been shifted slightly! The first 5x5 values for original image (left) and the shifted image (right) are quite different!
<br></br><img src="../../assets/Image_Tiger_in_Water_first_5times5.png" alt="Image_Tiger_in_Water_first_5times5.png" style="height: 65px" /><img src="../../assets/Image_Tiger_in_Water_Shifted_first_5times5.png" alt="Image_Tiger_in_Water_Shifted_first_5times5.png" style="height: 65px" />

**Size of pixel values** Small visual changes (even if invisible to the naked eye) like picture temperature, colour tone, colour saturation make big changes to the size of pixel values!


**What a NN for images should be able to deal with:**

1. spatial structure information + translation invariance
2. huge input dimensionalities (i.e. reduce number of parameters)
3. local variances (e.g. some "bad" pixels on an image)


**From MLPs to NNs for Images**

Data for MLPs were 1-d vectors and each layer a 1-d vector of neurons.  

Black-and-white images are matrices, i.e. 2-d objects = 2-d layers of neurons:

<img src="../../assets/Image_Neurons.png" width="250" title = "Murphy"/>   <br></br>

RGB images are 3 stacked matrices, i.e. a 3-d object, a so-called **3d-tensor**. The stacked matrices in a 3d-tensor are called **channels**: 

<img src="../../assets/Image_Neurons_channels.png" width="250" title = "Murphy"/>   <br></br>



**Features for Images**

- For tabular data, a **feature** is one (meaningful) input dimension
- For images, a **feature** isn't one pixel value, since it doesn't contain meaning, but: a piece of information about the content of an image; typically about whether a certain region of the image has certain properties. Features may be specific structures in the image such as points, edges or objects. 

**Aim:** Extract these features with a NN. 


A NN for images doesn't consist of functions between 1d-layers of neurons (left), but 3d-layers of neurons (right): 


<img src="../../assets/MLP_vs_CNN.jpg" width="500" title = "Murphy"/>   <br></br>

What we need to determine is what the function $f$ should look like. (it shouldn't contain too many parameters)!


You can think of image features like things you can detect with some sort of template (Schablone) by sliding it across all positions of an image. $\rightarrow$ **filters or kernels**. 

<img src="../../assets/Image_Grumpy_Cat_Ears.png" alt="Image_Grumpy_Cat_Ears.png" style="width: 200px"/><img src="../../assets/Image_Grumpy_Cat_Head.png" alt="Image_Grumpy_Cat_Head.png" style="width: 200px"/><img src="../../assets/Image_Grumpy_Cat_Texture.png" alt="Image_Grumpy_Cat_Texture.png" style="width: 200px"/>

**Classical Filters**

You already know **filters or kernels** from "Bildverarbeitung": e.g. 
- Sobel filters extracting edges
- Gaussian filter for blurring and noise reduction
  
**Question:** How are these filters applied to an image?

See this animation for an example: https://en.wikipedia.org/wiki/Kernel_(image_processing)#/media/File:2D_Convolution_Animation.gif

**From Classical Kernels to NNs**

**Question:** How should we choose the filters to extract features in a NN?

**Idea:** the map in an Image NN could be convolution with kernels/filters: 

<img src="../../assets/Convolution_Layer_Units.jpg" width="350" title = "Murphy"/>   <br></br>


### 3.1.2. Convolution (Conv2D) to extract image features

**Convolution (Conv2D) with one channel (e.g. BW):**
Let $\mathbf{W}$ be a filter/kernel of size $H_W\times B_W$, and $\mathbf{X}\in \mathbb{R}^{H\times B}$ an image matrix. Then **the convolution of $\mathbf{X}$ with $\mathbf{W}$** is defined as the $(?)\times (?)$-matrix $\mathbf{X}\circledast\mathbf{W}$ with $i,j$-th entry (=similarity measure for the section of the image whose left upper corner is at $x_{ij}$):
$$(\mathbf{X}\circledast\mathbf{W})_{ij}=\sum_{h=0}^{H_W-1}\sum_{b=0}^{B_W-1}x_{i+h,j+b}\cdot w_{h,b}$$

**Example:**

<img src="../../assets/Image_CNN_Convolution.png" alt="convolution" width="350" title = "Murphy"/>   <br></br>

**Question:** Compute $A\circledast F, B\circledast F$ for 
$F=\left(\begin{array}{cc}
1&0\\
0&1
\end{array}\right), A= \left(\begin{array}{cc}
1&0\\
0&1
\end{array}\right), B=\left(\begin{array}{cc}
0&1\\
1&0
\end{array}\right)$ 

**Question**: Calculate
$$\left(\begin{array}{ccc}
1&2&3\\
4&5&6\\
7&8&9
\end{array}\right)\circledast \left(\begin{array}{cc}
1&0\\
0&1
\end{array}\right)$$

**Convolution with several input channels**

In general an image (e.g. RGB) consists of several matrices (channels) stacked on top of each other = $3$-dimensional tensor. 
<br></br><img src="../../assets/Image_Cat_Colour.jpg" alt="Image_Cat_Colour.jpg" style="width: 150px"/>
<img src="../../assets/Image_Cat_Red.png" alt="Image_Cat_Red.png" style="width: 150px"/><img src="../../assets/Image_Cat_Green.png" alt="Image_Cat_Green.png" style="width: 150px"/><img src="../../assets/Image_Cat_Blue.png" alt="Image_Cat_Blue.png" style="width: 150px"/><br></br>

One filter then also consists of the same number of channels $\mathbf{W}=(W_{ijc})$. 


**2D-Convolution for $C$ input channel, 1 kernel Conv2D(C, 1)**:
- for each image channel, there is one filter channel
- per channel, apply the above convolution
- sum up the resulting matrices for all channels.
- output: ONE matrix = one channel.
Also, one can allow a bias term $b$. Formula:
$$(\mathbf{X}\circledast\mathbf{W})_{ij}=b+ \sum_{c=1}^C\sum_{h=0}^{H_W-1}\sum_{b=0}^{B_W-1}x_{i+h,j+b, c}\cdot w_{h,b,c}$$

<img src="../../assets/conv_one_filter.jpg" style="width:300pt"/>

One kernel learns one feature $\rightarrow$ use several filters to learn several features in one step:

**2D-Convolution for $C$ input channels with $D$ filters/kernels Conv2D(C, D)**:
-for each kernel, apply Conv2D(C, 1) to get one output channel
- stack the channels on top of each other to get $D$ output channels

<img src="../../assets/convolution_input_C_output_D.jpg" style="width:370pt"/><img src="../../assets/Image_CNN_Convolution_channels_filters.png" alt="Convolution with several channels and filters" style="width:300pt" title = "Murphy"/>

**Question:** Calculate $\mathbf{X}\circledast\mathbf{W}$ for 
$$X_1=\left(\begin{array}{ccc}
1&2&3\\
4&5&6\\
7&8&9
\end{array}\right), X_2= \left(\begin{array}{ccc}
0&1&2\\
3&4&5\\
6&7&8
\end{array}\right), W_1=\left(\begin{array}{ccc}
1&2\\
3&4
\end{array}\right), W_2=\left(\begin{array}{ccc}
0&1\\
2&3
\end{array}\right)$$



**Convolution Layer: Dimensions** ($C_{in}$, $C_{out}$, `filter_size`=$f$):

|<div style="width:200px">input</div>|<div style="width:200px">parameters</div>|<div style="width:200px">output</div>|
|:---|:---|:-------------|
|$(H,B,C_{in})$-tensor<br><br>|- $C_{out}$ learnable filter tensors $W^{(d)}$ of size $f\times f\times C_{in}$ <br>- (possibly) $C_{out}$ bias terms $b_1,\ldots, b_{C_{out}}$<br><br>|$$(H-f+1)\times (B-f+1)\times C_{out}-\text{tensor}$$ <br> d-th channel: $b_d+\mathbf{X}\circledast W^{(d)}$|


<br> <img src="../../assets/convolution_input_C_output_D.jpg" style="width:300pt"/>


**Question:** one output channel, three input channels, filter_size=7

<img src="../../assets/Image_Weights_for_Neuron_CNN_Gavves.png" alt="Gavves, UVA Deep Learning Course" style="width: 400px"/>


**Question:** 3 input channels, five output channels i.e. filters, filter_size= 7

<img src="../../assets/Image_Weights_CNN_Gavves_more_filters.png" alt="Gavves, UVA Deep Learning Course" style="width: 500px"/>


### 3.1.3. Variants of Convolution

**Padding and Same convolution: To keep size equal**

If you stack several convolution layers, the output gets smaller and smaller. 
<br></br><img src="../../assets/Image_CNN_Images_get_smaller_Gavves.png" alt="Gavves, UVA Deep Learning Course" style="width: 500px"/><br></br>


If you don't want that, use **zero padding**: Surround the image with 0's and apply convolution to this bigger matrix. 

If you add $p_H$ rows at top and bottom and $p_B$ columns to the sides, you get an output of size 

$(?)\times (?)$. 

If $p_H=(H_W−1)/2$ and $p_B=(B_W−1)/2$ $\Rightarrow$ output-size = input-size. This is called **Same Convolution**.

<img src="../../assets/Image_CNN_Padding.png" alt="convolution" style="width: 400px" title = "Murphy"/>   

**Question:**
Apply same convolution to the matrix

$\left(\begin{array}{cccc}
0&1&0&0\\
1&1&1&0\\
0&1&0&0\\
0&0&0&0
\end{array}\right)$
with the filter 
$\left(\begin{array}{ccc}
0&1&0\\
1&1&1\\
0&1&0
\end{array}\right)$


**Convolution with Stride: to reduce size more quickly**

**Strided Convolution** with **stride** $s$ is convolution applied to every $s$th window. 
Output size: ?

**Attention:** always choose $s$ so the above dimensions are positive integers!

<img src="../../assets/Image_CNN_Stride.png" alt="convolution" style="width: 400px" title = "Murphy"/>  

**Dilated convolution: to reduce size more quickly**

Often, pixel values of two neighboring pixels can be quite similar. Idea: Put the filter not over each pixel, but only every second or third pixel. 

**Dilated convolution with dilation factor $d$** only considers every $d$th input element when performing convolution (equivalent to convolution with filter-size = $(d-1)$*actual-filter-size+1).

<img src="../../assets/Image_Dilation_Murphy.png" alt="dilated convolution" style="width: 500px" title = "Murphy"/>  

**Transpose convolution = Deconvolution: to make input bigger**

**Deconvolution** is the "opposite" of convolution. Multiply each matrix entry with the entire filter and put the result in a window of a bigger matrix, fill the rest up with 0's. Add up all the resulting matrices:

<img src="../../assets/Image_CNN_Transpose_Convolution.png" alt="convolution" style="width: 500px" title = "Murphy"/>   

```python
def trans_conv(X, F):
    h, w = F.shape
    Y = zeros((X.shape[0] + h - 1, X.shape[1] + w - 1))
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            Y[i:i + h, j:j + w] += X[i, j] * F
    return Y
```

**3d- and 1d-convolution**

The method of convolution for images can be naturally extended to tensors of all sizes by the same method: 
Input = $C$ channels of a $n-$dimensional tensor
Filter = $C$ channels of a $n$-dimensional tensor (e.g. in 1-dim: a vector, in 3-dim: a cube)
let filter slide over all positions of the input, multiply entries "on top of" each other and sum all of them up. 

E.g. **1D convolution:**

<br></br><img src="../../assets/Image_CNN_Convolution_1d.png" alt="1d Convolution" style="width: 500px" title = "Murphy"/>   <br></br>

3D convolution can for example be used on 3D-data like LIDAR point clouds in autonomous driving or CAD-models...
However: 3D convolution is computationally expensive!

**Stacking Convolution Layers: Receptive Field**

**Definition:** The pixels in the pre-image of one neuron inside a Convolution NN is called its **receptive field**. 

**Question:** Depict what happens if we stack 3 $2\times 2$-filters on top of each other vs using one $4\times 4$-filter. What's the number of parameters in each case?

<br></br><img src="../../assets/Image_Convolution_Receptive_Field_Depth.jpg" alt="Image_Convolution_Receptive_Field_Depth.jpg" style="width: 400px"/><br></br>


**Convolutional Layers in Code**

#### with Keras
The option `filters` is the number of output channels: `keras.layers.Conv2D(filters=2, kernel_size=7)`. 

The output is a 4D tensor (batch size, height, width, channels). 

#### with Pytorch

Like all layers, you can access 2D-Convolution layers via `nn`: `nn.Conv2d(c_in, c_out, kernel_size=kernel_size)`
1D or 3D Convolution layers similarly. 

### 3.1.4. Pooling Layers 

Aim: Reduce parameters, local translation invariance 

- Two adjacent windows overlap $\Rightarrow$ similar values $\Rightarrow$ don't keep all, but aggregate e.g. a 2x2 window 
- ways of aggragation: average or maximum.
- aggregating makes the NN forget the exact positions (translation inv.)

**Definition:** A **max-pooling layer** (resp. **avg-pooling layer**) of a=`filter_size` $\in \mathbb{N}$ divides each channel of the input matrix into $a\times a$-windows and aggregates the values in each such window to the maximum value (resp. the average). 

**Examples:** 

<img src="../../assets/Image_Maxpool2.jpg" alt="Image_Maxpool2.jpg" style="width: 400px"/>



<br></br><img src="../../assets/Image_CNN_MaxPooling.png" alt="convolution" style="width: 400px" title = "Murphy"/>  

**Good Practice:** MaxPooling works better

**Global Average Pooling**: replace an entire channel by one single neuron by computing the average of all entries in the channel. 

In [2]:
# Keras:
max_pool = keras.layers.MaxPool2D(pool_size=2)
avg_pool = keras.layers.AvgPool2D(pool_size=2)
global_avg_pool = keras.layers.GlobalAvgPool2D()

In [None]:
# Pytorch:
torch.nn.MaxPool2d(kernel_size)
torch.nn.AvgPool2d(kernel_size)
# no separate layer for global average pooling, just use:
torch.nn.AvgPool2d(kernel_size= image-size)

### 3.1.5. Architecture of Convolution Neural Networks


<img src="../../assets/Image_CNN_Architecture_Simple.jpg" alt="Image_CNN_Architecture_Simple.jpg" style="width: 500px"/><br></br>

Similarly, one could define a **Deconvolution Network (Deconv Net)** with deconvolution instead of convolution. 




**Good practices for the architecture:** 
- use several convolution layers after each other to increase the receptive field
- filter size: $< 11$, modern architectures often use $3$ (fewer parameters) but stack several (see receptive field)
- then a non-linearity; most popular: ReLU
- then pooling: most populuar: max pooling



How did we solve the initial three aims? 
1. spatial structure information  + translation invariance: ?
2. huge input dimensionalities/many parameters: ?
3. Translation invariance: ?

In reality the architectures are more complicated and far deeper. We will see examples in the Computer Vision Section later on. 

**Example: CNNs in Practice**

Let's train a Convolutional Neural Network (CNN) in Keras on the MNIST fashion dataset

In [5]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:]
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]

X_mean = X_train.mean(axis=0, keepdims=True)
X_std = X_train.std(axis=0, keepdims=True) + 1e-7
X_train = (X_train - X_mean) / X_std
X_valid = (X_valid - X_mean) / X_std
X_test = (X_test - X_mean) / X_std

X_train = X_train[..., np.newaxis]
X_valid = X_valid[..., np.newaxis]
X_test = X_test[..., np.newaxis]

In [None]:
# you do not need partial, but if you use the same options in many layers, it can help save time
# partial simply "freezes" some of the options in a function so you don't have to repeat yourself
from functools import partial

#the following is like creating a keras.layers.Conv2D layer with fixed options 
# kernel_size=3, activation='relu', padding="SAME"

DefaultConv2D = partial(keras.layers.Conv2D,
                        kernel_size=3, activation='relu', padding="SAME")

model = keras.models.Sequential([
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=128),
    DefaultConv2D(filters=128),
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=256),
    DefaultConv2D(filters=256),
    keras.layers.MaxPooling2D(pool_size=2),
    keras.layers.Flatten(),
    keras.layers.Dense(units=128, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=64, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=10, activation='softmax'),
])

In [None]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))
score = model.evaluate(X_test, y_test)
X_new = X_test[:10] # pretend we have new images
y_pred = model.predict(X_new)

### 3.1.6 Case Study: AlexNet

See [AlexNet](https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf)


<br></br><img src="../../assets/Image_AlexNet.png" alt="AlexNet" style="width:500px"/><br></br>

<img src="../../assets/Image_AlexNet_Version_1.png" alt="AlexNet Version 1" style="width:500px"/><br></br>

<img src="../../assets/Image_AlexNet_Version_2.png" alt="AlexNet Version 2" style="width:500px"/><br></br>

<img src="../../assets/Image_AlexNet_Version_3.png" alt="AlexNet Version 3" style="width:500px"/><br></br>

<img src="../../assets/Image_AlexNet_Version_4.png" alt="AlexNet Version 4" style="width:500px"/><br></br>



<img src="../../assets/Image_AlexNet_Version_5.png" alt="AlexNet Version 5" style="width:500px"/><br></br>

**Homework:** Compute the dimensions and number of parameters for all layers in the following CNN: 
- input: 3 channels size 32x32
- 1st Layer: SAME Convolution, 8 kernels of size 3, stride 2  
- 2nd layer: Convolution, padding = 1, 16 kernels of size 3, stride = 2
- 3rd layer: maxpool of size 2x2
- 4th layer: global average pooling
