# Introduction to Convolutional Neural Networks

---
## Introduction
In this post, we are going to take a first look at convolutional neural networks, also known as _convnets_. Convnets are a deep-learning model particularly well suited for computer vision problems. Over the next few posts, we are going to look at how to create and train convnets using Keras. We will also look at some techniques for visualizing what convnets are learning, and we will be able to use convnets to generate new art. 

For additional information on convnets, take a look at these resources:
+ [Coursera – Deep Learning AI – Convolutional Neural Networks](https://www.coursera.org/learn/convolutional-neural-networks/home/welcome)
+ [Stanford – CS231n – Convolutional Neural Networks for Image Recognition](http://cs231n.stanford.edu/)

---
## Two Fundamental Operations
There are two main operations we will use with convnets: convolutions and pooling.

### Convolutions
Convolutions operate over 3D tensors – tensors with shape `(height, width, depth)` such as RGB images. The convolution operation takes a filter (or kernel), usually 3x3 or 5x5, and slides it over the input tensor computing a dot product at each step as shown in the GIF below.

![Convolution GIF](https://cdn-images-1.medium.com/max/1600/1*VVvdh-BUKFh2pwDD0kPeRA@2x.gif)

The filters contain the weights of our neural network. Each step of backpropagation will update various filter weights allowing the convnet to extract many different features from the input. The filter above would contain 9 weights since it is 3x3, and it would have a bias shared by the entire filter. Thus, this filter would add 10 _trainable parameters_ to our network. The example above shows a convolution over a 2D tensor, but usually we will be working with 3D tensors as shown below.

![3D Convolution GIF](https://i.stack.imgur.com/FjvuN.gif)

For 3D tensors, our filter will have the same depth as the input. So, in the example above, the filter has shape 4x4x3 and therefore has 48 weights plus a single bias. 

Because filters operate over patches of an image, they are able to learn local features such as the presence of an edge or a certain texture. Furthermore, a pattern learned in the top-left corner of an image will be able to recognize the same pattern in the bottom-right corner of an image. Thus, convnets are efficient for image processing.

### Pooling

The other primary operation used in a convnet is pooling. Pooling is used to aggressively downsample the input size. For example, in the GIF below we use a 2x2 pooling filter with a stride of 2 (moves up/down 2 steps at time) to create an output half the size. At each step, the filter looks at its inputs and takes the maximum value. This is known as max-pooling, and it is the most commonly used pooling method. Note: max-pooling has no learnable weights – it is a fixed operation.

![Max Pool GIF](https://developers.google.com/machine-learning/practica/image-classification/images/maxpool_animation.gif)

Pooling is useful because it allows us to reduce the input size to the next layer of the network. This reduces the number of learnable weights in our network which makes it less computationally expensive to train. Furthermore, pooling allows filters in the next layer to "see" more of the image at once. This enables deeper layers to learn increasingly abstract filters. For example, the first layer of a convnet may detect edges and the last layer of a convnet may detect faces and animals.

---
## Creating a Convnet
Let's start by creating a simple convnet with Keras.

In [31]:
from keras import layers 
from keras import models 

model = models.Sequential() 

# 1. Conv -> Max Pool, 16 filters, 3x3 
model.add(layers.Conv2D(16, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))

# 2. Conv -> Max Pool, 32 filters, 3x3
model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))

# 3. Conv, 64 filters, 3x3 
model.add(layers.Conv2D(64, (3,3), activation='relu'))

# Flatten to 1D tensor for a FCNN 
model.add(layers.Flatten())

# 4. Dense Layer, 64 nodes
model.add(layers.Dense(64, activation='relu'))

# 5. Softmax Output
model.add(layers.Dense(10, activation='softmax'))

We'll output a model summary to look at this layer-by-layer.

In [32]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_16 (Conv2D)           (None, 26, 26, 16)        160       
_________________________________________________________________
max_pooling2d_15 (MaxPooling (None, 13, 13, 16)        0         
_________________________________________________________________
conv2d_17 (Conv2D)           (None, 11, 11, 32)        4640      
_________________________________________________________________
max_pooling2d_16 (MaxPooling (None, 5, 5, 32)          0         
_________________________________________________________________
conv2d_18 (Conv2D)           (None, 3, 3, 64)          18496     
_________________________________________________________________
flatten_7 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 64)                36928     
__________

### Trainable Parameters in a Convnet
Let's see how the number of parameters in each layer is computed.

First, note that our input size is `(28, 28, 1)`. Since the depth is 1, that means we are working with grayscale rather than RGB images. Furthermore, it means that our filters in the first layer, `conv2d_1`, will also have depth 1. Thus, our first layer has 16 filters with dimension 3x3x1. Each filter also has a single, shared bias, so we have 16 biases in total. It follows that 

$$
\begin{align}
\text{# parameters} &= 16 \times (3\times3\times1) + 16 \\
&= 144 + 16 \\ 
&= 160
\end{align}
$$

Our second convolutional layer, `conv2d_2`, has inputs with shape `(13, 13, 16)`. Since filters always have the same depth as their input, each filter in this layer will have dimension 3x3x16. Each filter still only has one shared bias. Since there are 32 filters in this layer, it follows that

$$
\begin{align}
\text{# parameters} &= 32 \times (3\times3\times16) + 32 \\
&= 4608 + 32 \\ 
&= 4640
\end{align}
$$

Our third convolutional layer, `conv2d_3`, has inputs with shape `(5, 5, 32`. Thus, each filter in this layer will have dimension 3x3x32. Since this layer has 64 filters, it follows that

$$
\begin{align}
\text{# parameters} &= 64 \times (3\times3\times32) + 64 \\
&= 18432 + 64 \\ 
&= 18496
\end{align}
$$

Then, we unroll our output into a 1D tensor like we would use with a densely connected network. Since the output of the previous operation is `(3, 3, 64)` our unrolled tensor will have dimension `(576,)`. We then feed this into a densely connected layer with 64 nodes. Since every node is connected to every input, and there is one bias per node, we get

$$
\begin{align}
\text{# parameters} &= 576 \times 64 + 64 \\
&= 36864 + 64 \\ 
&= 36928
\end{align}
$$

Finally, our dense layer is connect to a softmax layer with 10 notes. Again, every input is connected to every output, and there is one bias per node in the softmax later. Thus, we get

$$
\begin{align}
\text{# parameters} &= 64 \times 10 + 10 \\
&= 640 + 10 \\ 
&= 650
\end{align}
$$

### Convnet Architecture 
The architecture above is representative of a typical convnet. It is common to see one or more convolutional layers followed by a max-pooling layer repeated throughout the network. Furthermore, as we go deeper into the network we typically see the `height` and `width` decrease while the `depth` increases. Finally, the last few layers are typically densely connected layers which learn from the features extracted in the convolutional base.

--- 
## Training a Convnet
Training a convnet is similar to training an FCNN. Let's load MNIST and train our convnet with it.

In [33]:
from keras.datasets import mnist
from keras.utils import to_categorical

# Load
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Scale: Note we reshape to 3D tensors with depth 1
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255

# One-hot labels
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

Training the network will take approximately 30 seconds per epoch on a CPU.

In [34]:
# First we compile
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Run 5 epochs on the entire training set
history = model.fit(train_images, train_labels, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [35]:
_, test_acc = model.evaluate(test_images, test_labels)
print('Test Accuracy:', test_acc)

Test Accuracy: 0.9891


### Summary
Training accuracy was 99.2% and test accuracy was 98.9%. Not bad! In the next section, we will put together a slightly more complicated architecture and see how we do. Just in case this model is better, we may want to save it. Keras makes it easy to save a model for future use.

In [36]:
model.save('cnn_mnist_v1.h5')

---
## A LeNet-5 Inspired Model

From [Wikipedia](https://en.wikipedia.org/wiki/Convolutional_neural_network)

> LeNet-5, a pioneering 7-level convolutional network by LeCun et al. in 1998, that classifies digits, was applied by several banks to recognize hand-written numbers on checks digitized in 32x32 pixel images. The ability to process higher resolution images requires larger and more layers of convolutional neural networks, so this technique is constrained by the availability of computing resources

LeNet-5 is a fairly simple architecture for digit recognition. In this section we will recreate it using Keras. First, let's go over the architecture.

### LeNet-5 Architecture
Below is a diagram of the LeNet-5 architecture. 
![LeNet-5 Diagram](https://cdn-images-1.medium.com/max/2000/1*1TI1aGBZ4dybR6__DI9dzA.png)

#### Input Layer
MNIST digits are 28x28 pixels. LeNet-5 pads these images to be 32x32 – padding simply means zeros are added around the border to make the image 32x32. 

#### conv2d_1
The first convolutional layer uses 6, 5x5 filters to get a `(28, 28, 6)` output.

#### max_pooling2d_1
The actual LeNet-5 model uses an average pooling layer with a 2x2 filter and stride of 2. This means it halves the height and width of the input image. We will replace this with a max pooling layer for a `(14, 14, 6)` output. 

#### conv2d_2
The second convolutional layer uses 16, 5x5 filters for a `(10, 10, 16)` output. 

#### max_pooling2d_2 
Again, this was originally an average pooling layer, but we will use max pooling for a `(5, 5, 16)` output. 

#### flatten_1
We will unroll the previous output to a tensor with shape `(400,)`

#### dense_1
Next, there is a dense layer with 120 nodes. 

#### dense_2
Then, there is a dense layer with 84 nodes.

#### softmax 
The orignal LeNet-5 model used a single output node. We will replace this with a dense layer with softmax activation. 

#### Additional Notes
As mentioned already, the original LeNet-5 model used average pooling layers. It also applied non-linearities after each pooling layer, which is not so common. At the time, sigmoid and tanh non-linearities were more common, but we will use relu and softmax instead. 

### Creating the Model
Let's create the model described above. We'll name the model `LeNot` because it's like LeNet-5... but it's not.

In [27]:
from keras import layers 
from keras import models 

LeNot = models.Sequential() 

# Padding Layer: Force input to be 32x32x1
LeNot.add(layers.ZeroPadding2D(padding=(2, 2), input_shape=(28, 28, 1)))

# 1. Conv -> Pool, 6 filters, 5x5 each
LeNot.add(layers.Conv2D(6, (5, 5), activation='relu'))
LeNot.add(layers.MaxPooling2D((2, 2)))

# 2. Conv -> Pool, 16 filters, 5x5 each
LeNot.add(layers.Conv2D(16, (5, 5), activation='relu'))
LeNot.add(layers.MaxPooling2D((2, 2)))

# Flatten to 1D tensor for a FCNN 
LeNot.add(layers.Flatten())

# 4. Dense Layer with 120 nodes
LeNot.add(layers.Dense(120, activation='relu'))

# 5. Dense Layer with 84 nodes
LeNot.add(layers.Dense(84, activation='relu'))

# 5. Softmax Output
LeNot.add(layers.Dense(10, activation='softmax'))

In [28]:
LeNot.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
zero_padding2d_9 (ZeroPaddin (None, 32, 32, 1)         0         
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 28, 28, 6)         156       
_________________________________________________________________
max_pooling2d_13 (MaxPooling (None, 14, 14, 6)         0         
_________________________________________________________________
conv2d_15 (Conv2D)           (None, 10, 10, 16)        2416      
_________________________________________________________________
max_pooling2d_14 (MaxPooling (None, 5, 5, 16)          0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 400)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 120)               48120     
__________

Let's compile the model and run it for 10 epochs.

In [37]:
# First we compile
LeNot.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Run 5 epochs on the entire training set
history = LeNot.fit(train_images, train_labels, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Let's check our test set performance 😄

In [38]:
_, test_acc = LeNot.evaluate(test_images, test_labels)
print('Test Accuracy:', test_acc)

Test Accuracy: 0.9911


### Summary

With 10 epochs, we're able to get 99.7% training accuracy and 99.11% test accuracy. We can easily save the model for future use.

In [49]:
LeNot.save('LeNot-mnist.h5')

---
## Review

In this post you've learned how to 
+ Create a convnet from scratch in Keras 
+ Calculate the number of trainable parameters in a convnet 
+ Use `Conv2D`, `MaxPooling2D`, `Flatten`, and `ZeroPadding2D` layers
+ Save models for future use using Keras 

If you would like to see the original LeNet-5 paper 
+ [Gradient-based learning applied to document recognition](https://ieeexplore.ieee.org/document/726791) 

For further reference, check out the following
+ [Coursera – Deep Learning AI – Convolutional Neural Networks](https://www.coursera.org/learn/convolutional-neural-networks/home/welcome)
+ [Stanford – CS231n – Convolutional Neural Networks for Image Recognition](http://cs231n.stanford.edu/)