# Image Classification and CNN

_Authors: Martin Outzen Berild & Jakob Gerhard Martinussen_

![](https://miro.medium.com/max/5856/1*Hz6t-tokG1niaUfmcysusw.jpeg)

## Image classification before deep learning

1. Feature extraction:
    * **Naive:** Flatten $n \times m$ RGB image into $\mathcal{R}^{3nm}$ vector.
    * **Manual feature extraction:** Custom feature extraction implemented by programmer based on domain-knowledge.
2. Implement model based on features:
    * **Explicit algorithm:** Pattern matching, etc.
    * **Statistical model:** Support vector machine (SVM), etc.
    
Drawbacks:
* Non-robust.
* Time-consuming.

## Deep learning approach to image classification

![](https://miro.medium.com/max/693/1*zUATaXMAmKof27rPyBRWsg.png)

# Deep Convolutional Neural Networks Theory (ConvNets)

_Keywords: receptive field, input, kernel, zero-padding, strides, output._

![](https://miro.medium.com/max/2510/1*vkQ0hXDaQv57sALXAJquxA.jpeg)

## Vectors vs. Matrices

**Key difference** - Inputs to CNNs are 2-D arrays instead of linearly indexed vectors.

![](https://www.jeremyjordan.me/content/images/2017/07/Screen-Shot-2017-07-26-at-4.26.01-PM.png)
_Source: [jeremyjordan.me](https://www.jeremyjordan.me)_

## Receptive Fields

_Resource: [A Guide to Receptive Field Arithmetic for Convolutional Neural Networks](https://syncedreview.com/2017/05/11/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks/)_

<img src="https://miro.medium.com/max/2340/1*Fw-ehcNBR9byHtho-Rxbtw.gif" width=300>

In [None]:
from tensorflow.keras import layers

# Number of filters to learn
NUMBER_OF_FILTERS = 1

# Size of receptive field
KERNEL_SIZE = (3, 3)

convolutional_layer = layers.Conv2D(
    filters=NUMBER_OF_FILTERS,
    kernel_size=KERNEL_SIZE,
)

## Convolution

The process of moving the receptive field over the image, weighting each pixel in order to form new pixel values.

This results in a so-called _"feature map"_, and can be thought of as a filtered image.

<img src="https://machinethink.net/images/vggnet-convolutional-neural-network-iphone/ConvolutionKernel@2x.png" width=200>

### Notes

* A single bias value is also included.
* The value is passed through an activation function as well.
* The same kernel weights and bias value is used over the entire image (_weight/parameter sharing_).

## Kernel

_A set of weights defined over the receptive field._

![](https://mlnotebook.github.io/img/CNN/convExample.png)

![](img/convolution_kernels_1.png)

![](img/convolution_kernels_2.png)

## Stride

_Step size of the receptive field as it moves over the input image._

<img src="https://miro.medium.com/max/1185/1*L4T6IXRalWoseBncjRr4wQ@2x.gif" width="49%"> <img src="https://miro.medium.com/max/1082/1*4wZt9G7W7CchZO-5rVxl5g@2x.gif" width="49%">

## Padding

![](https://miro.medium.com/max/1595/1*W2D564Gkad9lj3_6t9I2PA@2x.gif)

## Examples

### Convolution without padding and with stride of 1

![](https://i0.wp.com/syncedreview.com/wp-content/uploads/2017/05/6.gif?resize=244%2C259&ssl=1)

### Convolution with zero-padding and with stride of 1

![](https://i2.wp.com/syncedreview.com/wp-content/uploads/2017/05/7.gif?resize=395%2C449&ssl=1)

### Convolution with zero-padding and with stride of 2

![](https://i2.wp.com/syncedreview.com/wp-content/uploads/2017/05/8.gif?resize=395%2C381&ssl=1)

# Pooling layer

* A process which follows the convolution in CNNs.
* Also known as _subsampling_.
* Motivated by a model of the mammal visual cortex.

> [...] a reduction in spatial resolution appears to be responsible for achieving translational invariance.

* _Pooling_ is a way of modelling this reduction in dimensionality.
* Benefits: makes the training computationally feasible and prevents overfitting.

## Glossary

* **pooled feature map** - the result of subsampling the convolutional feature map.
* **pooling layer** - a collection of pooled feature maps.
* **pooling neighborhood** - a subdivision of a feature map (usually 2x2) which is replaced by _one single_ value.
* **adjacent pooling neighbourhood** - non-overlapping, side-by-side neighbourhoods.
* **pooling method** - process of reducing neighbourhood down to one single value.
    * **average pooling** - average of values within neighbourhood.
    * **max pooling** - maximim value of neighbourhood.
    * **$L_2$ pooling** - square root of sum of values in neighbourhood.

![](https://vernlium.github.io/2018/10/15/coursera-deeplearning-ai-c4-week1/maxpool_animation.gif)

## Classification

Classification is performed by a fully connected neural network (FCNN).

* The FCNN's input comes from the last _pooled feature map_.
* The 2D feature map is vectorized (read flattened) and fed into the input layer of the FCNN.

![](https://csdl-images.computer.org/trans/tg/2017/01/figures/23tvcg01-liu-2598831-fig-2-source.gif)

# Mathematical Formalization

## Convolution

* Let $w \in \mathcal{R}^{\texttt{KERNEL HEIGHT}~\times~\texttt{KERNEL WIDTH}}$ denote a given kernel.
* Let $a_{x, y} \in \mathcal{R}$ denote image or pooled feature pixel value at index $(x, y)$
    * $x = 1, ..., \texttt{LAYER HEIGHT}$.
    * $y = 1, ..., \texttt{LAYER WIDTH}$.

The _convolution operator_, $\circledast$, is then defined as:

\begin{equation*}
    w \circledast a_{x, y} = \sum_{i} \sum_{j} w_{i, j} ~ a_{x - i, y - j}
\end{equation*}

Adding a _bias_, $b$, we define a _weighted output_, $z$, of the convolution as:

\begin{equation*}
    z_{x,y} = w \circledast a_{x,y} + b
\end{equation*}

## Forward pass

* Let $l = 1, ..., L_{c}$ index the $L_c$ convolutional layers in the architecture. We then have:

\begin{equation*}
    z_{x,y}^{(l)} = w^{(l)} \circledast a_{x,y}^{(l - 1)} + b^{(l)}
\end{equation*}

* Let $f: \mathcal{R} \rightarrow \mathcal{R}$ be the _activation function_ of choice. We then have:

\begin{equation*}
    a_{x,y}^{(l)} = f ~ \left(z_{x,y}^{(l)}\right)
\end{equation*}

* $a_{x,y}^{(0)} = \{\text{values of pixels in the original input image}\}$.
* $a_{x,y}^{(L_c)} = \{\text{values of pooled features in last layer of the CNN}\}$.
* NB! The notation is a bit sloppy here, as $a_{x,y}^{(l)}$ is the result of a downsampling (pooling) in addition to an activation $f(...)$. Consider $f(...)$ to be the activation function and downsampler function from now on.

![](https://miro.medium.com/max/1522/1*32zCSTBi3giSApz1oQV-zA.gif)

## Backpropagation

* Let $C$ denote the cost function.
* The error at position $(x, y)$ in the _pooled layer_ number $l$ is then defined as:

\begin{equation*}
    \delta_{x,y}^{(l)} = \frac{\partial C}{\partial z_{x,y}^{(l)}}
\end{equation*}

Let $\mathcal{I}$ and $\mathcal{J}$ be index sets which contain all the indeces of $i$ and $j$ which are involved in the calculation of $z_{x, y}$. Using the chain rule we can relate $\delta_{x,y}^{(l)}$ to $\delta_{x,y}^{(l+1)}$:

\begin{equation*}
    \delta_{x,y}^{(l)}
    =
    \frac{\partial C}{\partial z_{x,y}^{(l)}}
    =
    \sum \limits_{i \in \mathcal{I}} ~ \sum \limits_{j \in \mathcal{J}} ~ 
    \frac{\partial C}{\partial z_{i,j}^{(l+1)}} \frac{\partial z_{i,j}^{(l+1)}}{\partial z_{x,y}^{(l)}}
\end{equation*}

Now insert definition for $\delta_{i,j}^{(l+1)}$:

\begin{equation*}
    ... =
    \sum \limits_{i \in \mathcal{I}} ~ \sum \limits_{j \in \mathcal{J}} ~ 
    \delta_{i,j}^{(l+1)} \frac{\partial}{\partial z_{x,y}^{(l)}} \left[ z_{i,j}^{(l+1)} \right].
\end{equation*}

Write out $z_{i,j}^{(l+1)}$:

\begin{equation*}
    \sum \limits_{i \in \mathcal{I}} ~ \sum \limits_{j \in \mathcal{J}} ~ 
    \delta_{i,j}^{(l+1)} \frac{\partial}{\partial z_{x,y}^{(l)}} \left[ z_{i,j}^{(l+1)} \right]
    =
    \sum \limits_{i \in \mathcal{I}} ~ \sum \limits_{j \in \mathcal{J}} ~ 
    \delta_{i,j}^{(l+1)} \frac{\partial}{\partial z_{x,y}^{(l)}}
        \left[ w^{(l+1)} \circledast a_{i,j}^{(l)} + b^{(l+1)} \right],
\end{equation*}

Write out $a_{i,j}^{(l)}$ and use $\frac{\partial b^{(l+1)}}{\partial z_{x,y}^{(l)}} = 0$:

\begin{equation*}
    ... =
    \sum \limits_{i \in \mathcal{I}} ~ \sum \limits_{j \in \mathcal{J}} ~ 
    \delta_{i,j}^{(l+1)} \frac{\partial}{\partial z_{x,y}^{(l)}}
        \left[ w^{(l+1)} \circledast f~\left(z_{i,j}^{(l)}\right) \right],
\end{equation*}

Write out $w^{(l+1)} \circledast f~\left(z_{i,j}^{(l)}\right)$:

\begin{equation*}
    ... =
    \sum \limits_{i \in \mathcal{I}} ~ \sum \limits_{j \in \mathcal{J}} ~ 
    \delta_{i,j}^{(l+1)} \frac{\partial}{\partial z_{x,y}^{(l)}}
        \left[ \sum \limits_{u} ~ \sum \limits_{v}
        \left( w_{u, v}^{(l+1)} ~ f~\left(z_{i - u, j - v}^{(l)}\right) \right) \right],
\end{equation*}

Nonzero derivatives satisfy: $i - u = x \wedge j - v = y$:

\begin{align*}
    \sum \limits_{i \in \mathcal{I}} ~ \sum \limits_{j \in \mathcal{J}} ~ 
    \delta_{i,j}^{(l+1)} \frac{\partial}{\partial z_{x,y}^{(l)}}
        \left[ \sum \limits_{u} ~ \sum \limits_{v}
        \left( w_{u, v}^{(l+1)} ~ f~\left(z_{i - u, j - v}^{(l)}\right) \right) \right]
    \\
    =
    \sum \limits_{i \in \mathcal{I}} ~ \sum \limits_{j \in \mathcal{J}} ~ 
    \delta_{i,j}^{(l+1)}
        w_{i - x, j - y}^{(l+1)} ~~ f~\prime\left(z_{x, y}^{(l)}\right),
\end{align*}

Put constant derivative outside the sum:

\begin{equation*}
    ... =
    f~\prime\left(z_{x, y}^{(l)}\right)
    \sum \limits_{i \in \mathcal{I}} ~ \sum \limits_{j \in \mathcal{J}} ~ 
    \delta_{i,j}^{(l+1)}
        w_{i - x, j - y}^{(l+1)},
\end{equation*}

Identify the double sum as a convolution of $\delta_{x,y}^{(l+1)}$ over $w_{x,y}$, but flipped over both axes and remembering that $w$ is independent of $(x, y)$:

\begin{equation*}
    ... =
    f~\prime\left(z_{x, y}^{(l)}\right)
    \left[ 
        \delta_{x,y}^{(l+1)}
        \circledast
        \texttt{rot180}\left( w_{x, y}^{(l+1)} \right)
    \right]
    =
    f~\prime\left(z_{x, y}^{(l)}\right)
    \left[ 
        \delta_{x,y}^{(l+1)}
        \circledast
        \texttt{rot180}\left( w^{(l+1)} \right)
    \right]   
\end{equation*}

## Error formula

We now have a formula for $\delta_{x,y}^{(l)}$:

\begin{equation*}
    \delta_{x,y}^{(l)}
    =
    f~\prime\left(z_{x, y}^{(l)}\right)
    \left[ 
        \delta_{x,y}^{(l+1)}
        \circledast
        \texttt{rot180}\left( w^{(l+1)} \right)
    \right]    
\end{equation*}

And we can now derive the derivative of the loss function with respect to the weights.
We begin by using the chain rule:

\begin{equation*}
    \frac{\partial C}{\partial w_{i,j}^{(l)}}
    =
    \sum \limits_{x} \sum \limits_{y}
    \frac{\partial C}{\partial z_{x,y}^{(l)}}
    \frac{\partial z_{x,y}^{(l)}}{\partial w_{i,j}^{(l)}}
\end{equation*}

Replace the definition of $\delta_{x,y}^{(l)}$ and write out $z_{x,y}^{(l)}$:

\begin{equation*}
    ... =
    \sum \limits_{x} \sum \limits_{y}
    \delta_{x,y}^{(l)}
    \frac{\partial}{\partial w_{i,j}^{(l)}} \left[
        \sum \limits_{i} \sum \limits_{j}
        w_{i,j}^{(l)} f~\left(z_{x - i, y - j}^{(l - 1)} \right) + b^{(l)}
    \right]
\end{equation*}

Use the same manipulation as before in order to remove all zero terms caused by the derivative:

\begin{equation*}
    ... =
    \sum \limits_{x} \sum \limits_{y}
    \delta_{x,y}^{(l)} ~ f~\left( z_{x - i, y - l}^{(l - 1)} \right)
    =
    \sum \limits_{x} \sum \limits_{y}
    \delta_{x,y}^{(l)} ~ a_{x - i, y - j}^{(l - 1)}
\end{equation*}

Again we have a convolution of $\delta_{x,y}^{(l)}$ over $\texttt{rot180}\left(a_{x, y}^{(l-1)}\right)$:

\begin{equation*}
    \frac{\partial C}{\partial w_{i,j}^{(l)}}
    =
    \delta_{i,j}^{(l)} \circledast \texttt{rot180}\left(a_{i, j}^{(l-1)}\right)
\end{equation*}

Repeating this logic with $\frac{\partial}{\partial b^{(l)}}$ instead of $\frac{\partial}{\partial w_{i,j^{(l)}}}$ we find that:

\begin{equation*}
    \frac{\partial C}{\partial b^{(l)}} = \sum \limits_{x} \sum \limits_{y} \delta_{x,y}^{(l)}
\end{equation*}

## Gradient Descent Update Equations

With a constant learning rate of $\alpha$ we update the kernel weight at location $(i,j)$ in layer $l$:

\begin{align*}
    w_{i,j}^{(l)} \leftarrow w_{i,j}^{(l)} - \alpha \delta_{i,j}^{(l)} \circledast \texttt{rot180}\left(a_{i, j}^{(l-1)}\right)
\end{align*}

And likewise for the bias:

\begin{align*}
    w^{(l)} \leftarrow b^{(l)} - \alpha \sum \limits_{x} \sum \limits_{y} \delta_{x,y}^{(l)}
\end{align*}

# CNN and the MNIST dataset (Numerical Example)

Following the [Convolutional Neural Networks Tensorflow 2.0 tutorial](https://www.tensorflow.org/beta/tutorials/images/intro_to_cnns).

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models, datasets

import matplotlib as mpl
from matplotlib import pyplot as plt

# Enable GPU support

In order to enable GPU support on Tensorflow v2.0 in Jupyter Lab, follow [this guide](https://github.com/tensorflow/tensorflow/issues/24828#issuecomment-464910864).

In [None]:
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

## Importing the dataset

In [None]:
(train_images, train_labels), (test_images, test_labels) = datasets.mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1))
test_images = test_images.reshape((10000, 28, 28, 1))

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

# Dense layers
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.summary()

In [None]:
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

In [None]:
history = model.fit(train_images, train_labels, epochs=10, batch_size=100, validation_split=0.1)

In [None]:
model.evaluate(test_images, test_labels, verbose=False)

In [None]:
print("GPU Available: ", tf.test.is_gpu_available())

In [None]:
plt.plot(history.history["accuracy"])

In [None]:
plt.plot(history.history["val_accuracy"])