In [None]:
import numpy as np
import torch
from torch import nn

## Convolutional Neural Network (CNN)

* Yann LeCun, 1998
    - LeNet 5: recognizing hand-written digits and words
    - 99.2% accuracy on MINST dataset

* AlexNet,Alex Krizhevsky,Ilya Sutskever, Geoffrey Hinton, 2012

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

![](Hinton1.png)

* ImageNet Challenge
    - 1000 Categories
    - 1.2M Training
    100,000 Test
* 1st Place with Top five 15.3% error rate
    - 26.2% error rate for second best entry


### CNN Architecture

![](CNN_Architecture.jpg)
![](FullCNN1.png)

Source: towardsdatascience.com

### Feature Learning

* Learns which features are important

* Replaced manual feature engineering
    - Manual feature engineering is time consuming and prone to human biases

### Convolution

#### Why Convolution?

* Location invariance
* Locality: An object is identified by its local context


#### Function
* Combines two functions in time
* Shape of one function modified by the other

$$(f*g)(t) = \int_{-\infty}^{\infty}f(\tau)(t - \tau)d\tau$$

http://mathworld.wolfram.com/Convolution.html


https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/convolution.html

* Technically the convolution operation in a CNN is Cross-Correlation

#### Input Image

* Typically:
    - 256x256 for grey scale images
    - 256x256x3 for color (RGB) images
* 

![](Convolution1.png)

* Stride - how much to move the feature detector sideways and down
    - 2 is a popular choice
    - Reduce dimensionality
    
* Padding - Put a border of zeros around input image

#### Feature Detector (Kernel, Filter)

* Convolves the kernel with the image

* Detects different features in the image
    - Edge
    - Line
* Alters the image
    - Blur (averages)
    
* Feature detectors are organized in layers (e.g. 32 layers, 1 feature detector per layer)
* **CNNs learn the best feature detectors in the same way the weights are learned**

http://setosa.io/ev/image-kernels/
    
#### Feature Map

* Reduced form of image
* Lost some information but extracted important information (e.g. the features)

![](Convolution2.png)

#### Cross-Correlation Operator

In [None]:
def corr2d(X, K):
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    print(X.shape,K.shape,Y.shape)
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i: i + h, j: j + w] * K).sum()
    return Y

In [None]:
X = torch.Tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
K = torch.Tensor([[0, 1], [2, 3]])
print(X)
print(K)
corr2d(X, K)

#### Edge Detection

* 1.,0. is a white to black edge, 0., 1. is a black to white edge

In [None]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

In [None]:
K = torch.Tensor([[1, -1]])  ## Detects vertical edges

In [None]:
Y = corr2d(X, K)
Y

### Learn edge detection kernel K

In [None]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
Y = corr2d(X, K)

# 1 input channel, 1 output channel, kernel shape of (1, 2)

conv2d = nn.Conv2d(1,1, kernel_size=(1, 2),bias=False) #Ignoring bias

# The 2-d convolutional layer uses four-dimensional input and output
# number of examples , number of channels, height, width)

X_ = X.reshape((1, 1, 6, 8))
Y_ = Y.reshape((1, 1, 6, 7))

alpha = 0.03  # Learning Rate
epochs = 10

for epoch in range(epochs):
    conv2d.zero_grad() # Zero gradients
    Y_hat = conv2d(X_) # Forward pass
    loss = ((Y_hat - Y_) ** 2).sum()  # Calc Loss
    loss.backward() # Calc Gradients
    conv2d.weight.data[:] -= alpha * conv2d.weight.grad # Gradient Descient
    if (epoch + 1) % 2 == 0:
        print('batch %d, loss %.3f' % (epoch + 1, loss))

In [None]:
print(conv2d.weight.data.shape)
conv2d.weight.data.reshape((1, 2))

### Activation function

#### Relu

* Applied to increase non-linearity in network
    - Convolution step is linear
* Applied element-wise to feature map to remove negative values
    - e.g. remove all the black pixels (sharpens the border between objects) 

### Pooling (Downsampling)

* Spatial Invariance

* Types
    - Max
    - Min
    - Sum
    - Mean
    - Subsampling

#### Max Pooling 

* It is the most popular
    - 2x2
    - Stride = 2

![](MaxPooling1.png)


![](MaxPooling2.png)

* Removes information but preserves features
* Accounts for distortions
* Reduced size therefore fewer parameters
    - helps reduce overfitting


In [None]:
def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
    return Y

In [None]:
X = torch.tensor([[0, 1, 2], [3, 4, 5], [6, 7, 8]], dtype=torch.float32)
pool2d(X, (2, 2))

In [None]:
pool2d(X, (2, 2), 'avg')

### Pooling and Stride

In [None]:
X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4))
print(X)

#### Pytorch MaxPool2d 

* By default the stride has the same shape as the pooling window shape. 

In [None]:
pool2d = nn.MaxPool2d(3) # Pooling and stride shape is 3x3, no padding
pool2d(X)

In [None]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

In [None]:
pool2d = nn.MaxPool2d((2, 3), padding=(1, 1), stride=(2, 3)) # pad should be smaller than half of kernel size
pool2d(X)

#### Multiple Channels

In [None]:
X = torch.cat((X, X + 1), dim=1)
print(X.shape)
print(X)

In [None]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

#### Visualization of Convolution and Pooling

http://scs.ryerson.ca/~aharley/vis/conv/flat.html

### Flattening

* Convert final pooled feature maps to a 1-d array
    - by row
![](flattening.png)

### Fully Connected ANN

![](FullAnn2.png)

* Input vector of  features
* Combines features
* Activation in output layer: 
    - Softmax when doing classification

#### Loss function

* Cross-entropy

#### Backpropagation

* Weights adjusted

* **Feature detectors are adjusted**

### Softmax and Cross-Entropy

* Only for Classification

#### Softmax

#### for j = 1.2,...,k

<div style="font-size: 115%;">
$$ f(z)_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}$$
</div>

* Generalization of logistic (i.e. sigmoid) to K classes. 
    - Outputs probabilities of the classes
    
#### Cross-Entropy

<div style="font-size: 115%;">
$$H(p,q) = -\sum_x p(x)log(q(x))$$
</div>

* The output layer represents a distribution (q above) from the softmax activation function

* Cross-entropy indicates the distance between the q distribution in the output layer and the labeled distribution p.

* When the targets are 0 and 1, cross-entropy tends to allow errors to propagate backwards in order to change weights even when the error is small (because of the log term)


### Dropout

* Regularization technique
* At each training epoch, individual units with their incoming and outgoing edges are randomly dropped from the network structure.

* This ovoids overfitting by reducing the interdependency of the units.

![](Dropout.png)
    