In [1]:
import numpy as np
import torch
from torch import nn
torch.__version__

'1.8.1'

## Convolutional Neural Network (CNN)

Convolutional Neural Networks were developed by Yann LeCun (1989,1998). The first CNN was LeNet-5 which recognized hand-written digits and words. It acjieved 99.2% accuracy on MINST dataset.

https://en.wikipedia.org/wiki/LeNet

AlexNet was developed by Alex Krizhevsky,Ilya Sutskever, and Geoffrey Hinton (2012). AlexNet won the ImageNet Challenge which was a major milestone in AI and got the attention of Vision researchers who previously were very critical of Machine Learning.

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

![](Hinton1.png)

The ImageNet Challenge consists of 1000 Categories with 1.2M Training examples and 100,000 Test examples. A model output the top five class probabilities

AlexNet won 1st Place with a top five error rate of 15.3%. The second best entry had an error rate of 26.2%.

https://www.image-net.org/update-mar-11-2021.php


### CNN Architecture

![](CNN_Architecture.jpg)
![](FullCNN1.png)

Source: towardsdatascience.com

### Feature Learning (Representation Learning)

Representation (or Feature) learning learns the best way to represent the data to detect features.

Feature learning automatically learns which features are important rather than the features being manually engineered.

This was a  major achievement since manual feature engineering is time consuming and prone to human biases.

### Convolution

#### Why Convolution?

Convolution is location invariant. This is important in order to identify an object by its local context (i.e. locality)

The convolution function combines two functions in time. The shape of one function is modified by the other

$$(f*g)(t) = \int_{-\infty}^{\infty}f(\tau)(t - \tau)d\tau$$

http://mathworld.wolfram.com/Convolution.html



Technically the convolution operation in a CNN is Cross-Correlation. Cross-correlation slides a kernel over the image; convolution slides a flipped kernel over the image.

![](CrossCorr.png)

$\text{Comparsion of Cross Correlation and Convolution. Source: Wikipedia}$

#### Input Image

An input image is typically:

- 256x256 for grey scale images
- 256x256x3 for color (RGB) images

![](Convolution1.png)

Stride is how much to move the feature detector sideways and down. 2 is a popular choice. Stride reduces dimensionality.
    
Padding puts a border of zeros completely around the input image. Usually a single or double border.

https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md

#### Feature Detector (Kernel, Filter)

A feature detector convolves the kernel with the image to detects different features in the image such as edges or lines.
 
Feature detectors are organized in layers (e.g. 32 layers, 1 feature detector per layer).
    
**CNNs learn the best feature detectors in the same way the weights are learned**

http://setosa.io/ev/image-kernels/
    
#### Feature Map

A feature map is a reduced form of image. Some information is lost but the important information (e.g. the features) is preserved.

![](Convolution2.png)

#### Cross-Correlation Operator

In [2]:
def corr2d(X, K):
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    print(X.shape,K.shape,Y.shape)
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i: i + h, j: j + w] * K).sum()
    return Y

In [3]:
X = torch.Tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
K = torch.Tensor([[0, 1], [2, 3]])
print(X)
print(K)
corr2d(X, K)

tensor([[0., 1., 2.],
        [3., 4., 5.],
        [6., 7., 8.]])
tensor([[0., 1.],
        [2., 3.]])
torch.Size([3, 3]) torch.Size([2, 2]) torch.Size([2, 2])


tensor([[19., 25.],
        [37., 43.]])

#### Edge Detection

1. to 0. is a white to black edge, 0.to 1. is a black to white edge

In [4]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.]])

In [5]:
K = torch.Tensor([[1, -1]])  ## Detects vertical edges

In [6]:
Y = corr2d(X, K)
Y

torch.Size([6, 8]) torch.Size([1, 2]) torch.Size([6, 7])


tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.]])

### Learn edge detection kernel K

In [7]:
# 1 input channel, 1 output channel, kernel shape of (1, 2)

conv2d = nn.Conv2d(1,1, kernel_size=(1, 2),bias=False) #Ignoring bias

# The 2-d convolutional layer uses four-dimensional input and output
# number of examples , number of channels, height, width)

X_ = X.reshape((1, 1, 6, 8))
Y_ = Y.reshape((1, 1, 6, 7))

alpha = 0.03  # Learning Rate
epochs = 10

for epoch in range(epochs):
    conv2d.zero_grad() # Zero gradients
    Y_hat = conv2d(X_) # Forward pass
    loss = ((Y_hat - Y_) ** 2).sum()  # Calc Loss
    loss.backward() # Calc Gradients
    conv2d.weight.data[:] -= alpha * conv2d.weight.grad # Gradient Descient
    if (epoch + 1) % 2 == 0:
        print('batch %d, loss %.3f' % (epoch + 1, loss))

batch 2, loss 5.215
batch 4, loss 1.196
batch 6, loss 0.332
batch 8, loss 0.110
batch 10, loss 0.040


In [8]:
print(conv2d.weight.data.shape)
conv2d.weight.data.reshape((1, 2))

torch.Size([1, 1, 1, 2])


tensor([[ 1.0097, -0.9698]])

### ReLU activation function

A ReLU activation function is applied to increase non-linearity in the network since Convolution is linear operation.

The ReLU is applied element-wise to the feature map to remove negative values, e.g. remove all the black pixels to sharpen the border between objects.

### Pooling (Downsampling)

Downsampling provides spatial invariance.

The types of Pooling are:

- Max
- Min
- Sum
- Mean
- Subsampling

#### Max Pooling 

Max pooling is the most popular. Below is a 2x2  filter with stride = 2.

![](MaxPooling1.png)


![](MaxPooling2.png)

Pooling removes information but preserves features. It accounts for distortions and reduces the size and therefore reduces the number of parameters. It helps to reduce overfitting.


In [9]:
def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
    return Y

In [10]:
X = torch.tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8]], dtype=torch.float32)
pool2d(X, (2, 2))

tensor([[4., 5.],
        [7., 8.]])

In [11]:
pool2d(X, (2, 2), 'avg')

tensor([[2., 3.],
        [5., 6.]])

### Pytorch MaxPool2d 

https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html


#### Pooling with Stride

By default the stride has the same shape as the pooling window shape. 

In [12]:
X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4))
print(X)

tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]]]])


In [13]:
pool2d = nn.MaxPool2d(3) # Pooling and stride shape is 3x3, no padding
pool2d(X)

tensor([[[[10.]]]])

Warning Issue. https://github.com/pytorch/pytorch/issues/60053

In [14]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

tensor([[[[ 5.,  7.],
          [13., 15.]]]])

In [15]:
pool2d = nn.MaxPool2d((2, 3), padding=(1, 1), stride=(2, 3)) # pad should be smaller than half of kernel size
pool2d(X)

tensor([[[[ 1.,  3.],
          [ 9., 11.],
          [13., 15.]]]])

#### Multiple Channels

In [16]:
X = torch.cat((X, X + 1), dim=1)
print(X.shape)
print(X)

torch.Size([1, 2, 4, 4])
tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]],

         [[ 1.,  2.,  3.,  4.],
          [ 5.,  6.,  7.,  8.],
          [ 9., 10., 11., 12.],
          [13., 14., 15., 16.]]]])


In [17]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

tensor([[[[ 5.,  7.],
          [13., 15.]],

         [[ 6.,  8.],
          [14., 16.]]]])

#### Visualization of Convolution and Pooling

http://scs.ryerson.ca/~aharley/vis/conv/flat.html

### Flattening

Flattening converts the final pooled feature maps to a 1-d array by row.

![](flattening.png)

### Fully Connected ANN

The next step is a fully connected network to combine features.

![](FullAnn2.png)

#### Backpropagation

In the backpropagation step the **Feature detectors are adjusted** as well as the weights

### Dropout

Dropout is a regularization technique. At each training epoch, individual units with their incoming and outgoing edges are randomly dropped from the network structure.

This lessens overfitting by reducing the interdependency of the units.

![](Dropout.png)
    