## In this notebook:
- What is convolution and maxpooling? 
- What are convnets? 
- What do convnets learn? 

### In previous notebook: 
- MNIST Demo
    - Using Convolution Layers
- Code Overview

### In next notebook:
- Training your own small convnets from scratch
- Using data augmentation to mitigate overfitting
- Using a pre-trained convnet to do feature extraction
- Fine-tuning a pre-trained convnet

In [1]:
from datetime import date
date.today()

datetime.date(2017, 12, 10)

In [2]:
author = "NirantK. https://github.com/NirantK/keras-practice"
print(author)

NirantK. https://github.com/NirantK/keras-practice


In [3]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [4]:
import keras
keras.__version__

Using TensorFlow backend.


'2.0.8'

In [5]:
import os
if os.name=='nt':
    print('We are on Windows')

We are on Windows


**Let's start by understanding the code we saw in MNIST Demo**

Here is the code again for your reference:

```python
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
```

![](https://adeshpande3.github.io/assets/Cover.png)


### Line 1

```python
model = model.Sequential()
```
Keras supports two different API flavours. The first is the sequential version. This works best for linear stacks of layers, which is the most common network architecture by far, and the *functional API* - for directed acyclic graphs of layers, allowing to build completely arbitrary architectures.

For the forseeable examples, we will focus on ```Sequential``` models only, but for your reference here are two code sample doing exactly the same thing in Sequential and functional API both. 

In [6]:
# Sequential Model
from keras import models
from keras import layers

model_sequential = models.Sequential()
model_sequential.add(layers.Dense(32, activation='relu', input_shape=(784,)))
model_sequential.add(layers.Dense(10, activation='softmax'))

In [7]:
# Same model as above in Functional API
input_tensor = layers.Input(shape=(784,))
x = layers.Dense(32, activation='relu')(input_tensor)
output_tensor = layers.Dense(10, activation='softmax')(x)

model_functional_api = models.Model(input=input_tensor, output=output_tensor)

  


### Line 2
```python
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
```
In the model object, we are adding more objects using ```add()```. 

Here, we are adding the ```layer``` object where the object type is ```Conv2D```. We will configure our convnet to process inputs of size ```(28, 28, 1)```, which is the format of MNIST images. We do this via passing the argument ```input_shape=(28, 28, 1)``` to our first layer.

Importantly, a convnet takes as input tensors of shape ```(image_height, image_width, image_channels)``` (not including the batch dimension). 

**Why ```Conv2D```?**
Our densely-connected network earlier had a test accuracy of 97.8%, our basic convnet has a test accuracy of 99%+. This highlights the practical importance of convolution operation. 

Let's dive deeper into ```Conv2D``` and ```MaxPooling2D``` operations. 
Convolution layers learn local patterns, i.e. in the case of images, patterns found in small 2D windows of the inputs. In our example above, these windows were all 3x3.

![](https://brohrer.github.io/images/cnn6.png)
*Figure from [How CNNs work?](https://brohrer.github.io/how_convolutional_neural_networks_work.html)*

Quick aside on Strides and Convolution Operation: 

![](http://deeplearning.net/software/theano/_images/numerical_padding_strides.gif)

**Why is Convolution effective for images?**
- **Translation Invariance**: The patterns they learn are *translation-invariant*, i.e. after learning a certain pattern in the bottom right corner of a picture, a convnet is able to recognize it anywhere, e.g. in the top left corner
- **Spatially Hierarchical Features**: 
A first convolution layer will learn small local patterns such as edges, but a second convolution layer will learn larger patterns made of the features of the first layers. And so on. This allows convnets to efficiently learn increasingly complex and abstract visual concept
![](https://dpzbhybb2pdcj.cloudfront.net/chollet/v-6/Figures/visual_hierarchy_rsd.jpg)

*Figure from Deep Learning with Python by F. Chollet*

Convolutions are defined by two key parameters:

- The size of the patches that are extracted from the inputs (typically 3x3 or 5x5). In our example it was always 3x3, which is a very common choice.

- The depth of the output feature map, i.e. the number of filters computed by the convolution. In our example, we started with a depth of 32 and ended with a depth of 64.

In Keras Conv2D layers, these parameters are the first arguments passed to the layer: ```Conv2D(output_depth, (window_height, window_width))```

---

Trivia: What do these layers actually learn? How can you check that? Look for layer/neuron activations across multiple images

![](https://brohrer.github.io/images/cnn18.png)
Higher layers can represent increasingly sophisticated aspects of the image, such as shapes and patterns. These tend to be readily recognizable. For instance, in a CNN trained on human faces, the highest layers represent patterns that are clearly face-like.

*Figure from [Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations](http://web.eecs.umich.edu/~honglak/icml09-ConvolutionalDeepBeliefNetworks.pdf)*


**Max pooling operation**

MaxPooling2D layer halves the size of feature map every time.
So, we go from 26X26 to 13X13 to 

In our convnet example, you may have noticed that the size of the feature maps gets halved after every  . For instance, before the first MaxPooling2D layers, the feature map is 26x26, but the max pooling operation halves it to 13x13. That’s the role of max pooling: to aggressively downsample feature maps, much like strided convolutions.

![](https://adeshpande3.github.io/assets/LeNet.png)
*Figure Courtesy: [Adesh Pande](https://adeshpande3.github.io), UCLA '19* 

What other downsampling strategies could we use? 
- Use strides in the previous convolution layer
- Use average pooling instead of max pooling, where each local input patch is transformed by taking the average value of each channel over the patch, rather than the max. 

However, max pooling tends to work better than these alternative solutions. 

In a nutshell, the reason for this is that features tend to encode the spatial "presence" of some pattern. These are learnt/encdoded over the different tiles of the feature map (hence the term "feature map"). This make it much more informative to look at the maximal presence of different features than at their average presence. 

So the most reasonable subsampling strategy is to first produce dense maps of features (via unstrided convolutions) and then look at the maximal activation of the features over small patches, rather than looking at sparser windows of the inputs (via strided convolutions) or averaging input patches, which could cause you to miss feature presence information or dilute it.

**Activation Function: ReLU or Rectified Linear Unit**
![](https://brohrer.github.io/images/cnn8.png)
*Figure from [How CNNs work?](https://brohrer.github.io/how_convolutional_neural_networks_work.html)*

#### In this notebook you saw:
- What is convolution and maxpooling? 
    - What are Convolution operations?
    - Why they work for images? 
    - A vague notion of stride? 
- What are convnets?
    - Why Convolution layers work the way they work? Why we change feature maps?
- What do convnets learn? 
    - Feature maps!
    - Frankly, I don't have a good enough intuition, but hope you do now ;)

##### In the previous notebook: 
- MNIST Demo
    - Using Convolution Layers
- Code Overview

### In next notebook:
- Training your own small convnets from scratch
- Using data augmentation to mitigate overfitting
- Using a pre-trained convnet to do feature extraction
- Fine-tuning a pre-trained convnet