# How Convnets Work
Let's get our data and re-build the convnet from cnn_intro.ipynb

First the data:

In [1]:
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

short = False
if short:
    train_images = train_images[:7000,:]
    train_labels = train_labels[:7000]
    test_images = test_images[:3000,:]
    test_labels = test_labels[:3000]
#
print("Train info",train_images.shape, train_labels.shape)
print("Test info",test_images.shape, test_labels.shape)
train_images = train_images.reshape((train_images.shape[0],28*28))
train_images = train_images.astype('float32')/255

test_images = test_images.reshape((test_images.shape[0],28*28))
test_images = test_images.astype('float32')/255
from keras.utils import to_categorical

train_labels_cat = to_categorical(train_labels)
test_labels_cat = to_categorical(test_labels)


Using TensorFlow backend.


Train info (60000, 28, 28) (60000,)
Test info (10000, 28, 28) (10000,)


## Next the network

In [2]:
from keras import models
from keras import layers
#
# Make sure the shape of the input is correct (the last ",1" is the number of "channels"=1 for grayscale)
train_images = train_images.reshape((train_images.shape[0],28,28,1))
test_images = test_images.reshape((test_images.shape[0],28,28,1))
#
cnn_network = models.Sequential()
#
# First convolutional layer
cnn_network.add(layers.Conv2D(30,(5,5),activation='relu',input_shape=(28,28,1)))
# Pool
cnn_network.add(layers.MaxPooling2D((2,2)))
#
# Second convolutional layer
cnn_network.add(layers.Conv2D(25,(5,5),activation='relu'))
# Pool
cnn_network.add(layers.MaxPooling2D((2,2)))
#
# Connect to a dense output layer - just like an FCN
cnn_network.add(layers.Flatten())
cnn_network.add(layers.Dense(64,activation='relu'))
cnn_network.add(layers.Dense(10,activation='softmax'))
#
# Compile
cnn_network.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['accuracy'])
#
# Fit/save/print summary
#history = cnn_network.fit(train_images,train_labels_cat,epochs=5,batch_size=256,validation_data=(test_images,test_labels_cat))
#cnn_network.save('fully_trained_model_cnn.h5')
#
# Instead of re-training, just load the network from disk!
cnn_network.save('fully_trained_model_cnn.h5')

print(cnn_network.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 24, 24, 30)        780       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 30)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 8, 8, 25)          18775     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 4, 4, 25)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 400)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                25664     
_________________________________________________________________
dense_2 (Dense)              (None, 10)                650       
Total para

# How the CNN Works
Much of the discussion here is based on this [article](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53).

Here is a picture of the above network:
![alt text](files/convet_graphic_mnist.jpeg "Title")

Let's walk through the network from left to right:
1.  The input image is 28x28x1.  The 28x28 are the horizontal and vertical count of pixels in the image.   The "x1" reminds us that typical images have 3 color channels, while our MNIST images are **grayscale** which have only 1 channel.
2.  The first **convolution layer** uses a 5x5 kernel.   Jump ahead to the next section to see how the kernel works.  In the image, you see that the output of the convolution layer is a tensor of size (24 x 24 x n1), where **n1** refers to the number of filters in the convolution layer.  In our case, n1 is equal to 30 (it could be any number).   You should think of this output as n1=30 images, each of size 24x24.   But the input to the convolutional layer is 28x28!   Why did we lose 4 pixels in each direction?   The next section on kernel application explains this!
    * The number of parameters of the first layer is given by this: (5x5) parameters for each filter, times 30 filters (or kernels), plus a bias unit for each filter: (5x5x30) + 30 = 780
    * The purpose of the convolutional layer is to find common features across the image.
    * The output of each filter (when applied to a given 5x5 section of the input to that layer) is passed through the **activation** function for that layer.  In CNNs the activation is often "relu", which stands for rectified linear unit.
    * The output of the convolutional layer is a tensor of size (24x24x30), where 24x24 is the image size, and 30 is the number of filters.   We can think of this as a 24x24 image with 30 channels.
3.  Next there is a **max pooling** layer, which uses a 2x2 kernel.   This simply passes a 2x2 kernel across the output of the previous layer, and it chooses the maximum of the 4 pixels it sees.   This results in the image being downscaled.  In our case, each 24x24 "image" becomes a "12x12" image.   There are other pooling algorithms which are possible, such as **average pooling**, but max pooling has been found to be much more effective.
    * The are no parameters associated with the pooling layer.
    * The output of this max pooling layer is a tensor of shape (12x12x30).   To help understand this, imagine we put a **single** 28x28 image into our network.  The output of this max pooling layer will be a 12x12 "image", with 30 channels!
4. The second convolutional layer also uses a "5x5" kernel, and we have chosen (arbitrarily) to have 25 filters (or kernels).   But what is hidden in the above is that each kernel **must** also have a 3rd dimension, equal to that of the output channels of the previous layer (in this case 30).   So we actually have a 5x5x30 kernel.  This single 5x5x30 kernel is convolved across the 12x12x30 image, producing an output which is 8x8x1.  Note how the output image again loses 4 pixels in both height and width.
    * In our second convolutional layer, we have 25 different filters, so the full output of the 2nd convolutional layer is: 8x8x25 (or an 8x8 image, again with 25 channels).
    * Since this second convolutional layer is acting on a downscaled image, it is sensitive to larger features across the image (while the earlier convolutional layer was sensitive to smaller features in the image).
    * The number of parameters of this layer can be calculated as such: (5x5)x30 = 750 parameters for each kernel, times 25 kernels which gives us 18750 paremeters, plus a bias unit for each of the 25 filters, which yields 18775 parameters.
    * An excellent discussion of how this works can be found [here](https://www.youtube.com/watch?v=KTB_OFoAQcc&index=6&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF).
5.  Again there is a max pooling layer, which serves to reduce the dimensions from (8x8x25) to (4x4x25).
6.  Next we have a **flattening** layer, which simply converts the output of the previous layer from a tensor to an unwrapped vector of length 4x4x25=400.   There are no parameters associated with this operation.
7.  Next we have a fully connected (hidden) layer with 64 nodes (the 64 is arbitrary - it could be 100 or 200 or 50). 
    * The number of parameters associated with this layer is (400x64) weights plus another 64 bias units for a total of 25664 parameters.
    * The activation unit used is again **relu**.
8. Finally we have a fully connected softmax layer, with 10 outputs, one for each digit.
    * The number of parameters associated with this layer is (64x10) weights plus another 10 bias units for a total of 650 parameters.



## The Convolution Kernel (Or Filter)
The picture below is a simple graphic showing a 3x3 kernel (the yellow moving square) applied to a (5x5x1) image (the stationary green square on the left).   All of the pixel values in the image are 1.0, as are the numbers in the kernel, to simplify the math.   Some things to note:
* The output of the kernel is simply obtained by multiplying the pixel value by the kernel, and adding this up for each pixel:kernel combination.
* The "stride" of the kernel is (1,1) by default for keras Conv2D layers, which simply means that the kernel moves over by 1 pixel after each operation, and then down by 1 pixel when it gets to the end of each row.  The movement of the kernel across the image is the **convolution** operation.
* Note that the output image is 3x3 and **not** 5x5.   This is because the 3x3 kernel runs out of pixels when it gets 2 pixels from the end of each row, as well as the end of each column.   So applying a 3x3 kernel to an (n by m) image, results in an output image of size (n-2) by (m-2).  Or more generally, (n-(p-1)) by (m-(p-1)), where (p by p) is the size the of kernel.
    
![alt text](files/anim_covnet.gif "Title")


## Pooling Layer
The graphic below is an example of a 3x3 max pooling kernel applied to a 5x5 image.   Note that it simply outputs the maximum pixel value found in the region covered by the kernel.
* The **strides** in this graphic are (1,1), which means that the kernel moves over by 1 pixel as it moves from left to right, and then down by 1 pixel as it moves from top to bottom.
* If you look at our MaxPooling2D layer, we simply give the size of the kernel, which is (2,2).  By default, keras assumes the strides for MaxPooling2D layers are the **same** as the pooling kernel size (for pooling layers), so in our case the kernel would move over by 2 pixels, and down by 2 pixels.   This only works if there are an even number of pixels in the image that the max pooling layer processes (assuming no padding - see documentation).   The effect of this for our images is that max pooling **downscales** the images by a factor of 2.
![alt text](files/pooling.gif "Title")
