# Convolutional Neural Networks: Training

Convolutional Neural Networks (CNNs) are neural network architectures that contain *filters* that we convolve with inputs to create a new type of layer. 

<!-- This allows for *feature extraction*, such as the line finding done in our previous examples.-->


The convolutional neural network architecture was first described by Kunihiko Fukushima in 1980 (!). With the advent of GPUs in the 2000s, the method became far more popular, with the first implementation on GPUs with backpropogation published in 2011 by Dan C. Cireşan et. al. 
<!-- in grad school, this was the THING -->





# activationmn

![simple CNN](CNN_simple.png)

We add convolution layers into our NN model to extract features to provide to a fully-connected neural network, therefore, they typically are placed *before* fully connected layers. Our goal is to learn the weights of the fully connected layers **and** the filter values. Therefore, we are not *predefining* the filters, as before, we are *learning* the filters in training.


<!-- i hope we saw yesterday that automatic differentiation allows us to define our backprop in a modular way, and swap out portions of the computational graph far more easily than if we were writing out the entire function. -->

![backpropogation](backprop.png)

# Backpropogation

Our training procedure is exactly the same as previously described for neural networks. However, our mathematics are now different as we backpropogate through the convolutional layers, so let's take a closer look at that for a 1D convolutional filter with a stride of 1.


![1D convolution](1d_conv.png)

We can write this operation as a computational graph:

![computational graph](comp_graph.png)

representing:
$$\vec{o}=\vec{x}\circledast \vec{F}$$


Where
$$o[n]=(x\circledast F)[n]= 􏰅 \sum_{i=-\omega}^{\omega} x[i+n+\omega]* F[i+\omega]$$

![1D convolution](1d_conv.png)

Our goal is to update each $F_i$:
$$F_i \leftarrow F_i - \alpha \frac{\partial \mathcal{L}}{\partial F_i}$$

Where 
$$\frac{\partial \mathcal{L}}{\partial F_i} = \sum_{k=1}^M\frac{\partial \mathcal{L}}{\partial o_k}\frac{\partial o_k}{\partial F_i}$$





### Let's calculate $\frac{\partial \mathcal{L}}{\partial F_i}$ explicitly for $F_1$ on a smaller architecture:

![1D convolution](1d_conv_sm.png)

$$\frac{\partial \mathcal{L}}{\partial F_i} = \sum_{k=1}^M\frac{\partial \mathcal{L}}{\partial o_k}\frac{\partial o_k}{\partial F_i}$$

$$\frac{\partial \mathcal{L}}{\partial F_1} = \frac{\partial \mathcal{L}}{\partial o_0}\frac{\partial o_0}{\partial F_1} + \frac{\partial \mathcal{L}}{\partial o_1}\frac{\partial o_1}{\partial F_1}$$



![1D convolution](1d_conv_sm.png)

We only need to find $\frac{\partial o_i}{\partial F_j}$ terms because, when using automatic differentiation, $\frac{\partial \mathcal{L}}{\partial o_i}$ is given to us from the right.

$$o_0 = F_0x_0 + F_1x_1 $$
$$o_1 = F_0x_1 + F_1x_2 $$



$$\frac{\partial o_0}{\partial F_1} = x_1$$
$$\frac{\partial o_1}{\partial F_1} = x_2$$





Actually, the backpropogation of a convolutional layer is also a discrete convolution:


$$\vec{x} \circledast \frac{\partial \mathcal{L}}{\partial \vec{o}}$$

If we have layers *before* this one, we will also need to compute $\frac{\partial \mathcal{L}}{\partial x_i}$ to continue the back propogation.

$$\frac{\partial \mathcal{L}}{\partial x_i} = \sum_{k=1}^M\frac{\partial \mathcal{L}}{\partial o_k}\frac{\partial o_k}{\partial x_i}$$




# Activation function

Typically, an activation function is then applied to *each element* the feature map. ReLU is the most common for CNNs, as they are typically deep structures.

![ReLU](relu.png)



![simple CNN](CNN_simple.png)

# Pooling

![pooling](pooling.png)

# GoogLeNet

![GoogLeNet](googlenet.png)


<!-- depthconcat -->

# Umm, how do I pick an architecture?

How many layers? 

How many nodes per layer?

How many convolution layers?

What activation function(s)?

Dropout...Y/N?

# Pre-trained architectures

| CNN Architectures  |   |
|---|---|
|  AlexNet |   |
| VGG  | 
|  ResNets |   |
|  Inception |   |   
|  Xception |   |   

In [None]:
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np
import matplotlib.pyplot as plt

model = ResNet50(weights='imagenet')

img_path = 'landis.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
plt.imshow(x)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

preds = model.predict(x)
# decode the results into a list of tuples (class, description, probability)
# (one such list for each sample in the batch)
print('Predicted:', decode_predictions(preds, top=3)[0])

Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
