# Going deep

This notebook tries to help you go deep with your neural net. To do so, one cannot simply increase the number of convolutional layers at will. It is important that the layers have a sufficiently high learning capacity while they should cover approximately 100% of the incoming image ([Xudong Cao 2015](https://kaggle2.blob.core.windows.net/forum-message-attachments/69182/2287/A%20practical%20theory%20for%20designing%20very%20deep%20convolutional%20neural%20networks.pdf?sv=2012-02-12&se=2015-04-19T12%3A13%3A19Z&sr=b&sp=r&sig=xXaPwlkUZjIUxRyVebSNkX9viGgDPNHHpCXJRbokxUQ%3D)).

The general approach is to try to go deep with convolutional layers. If you chain too many convolutional layers, though, the learning capacity of the layers falls too low. At this point, you have to add a max pooling layer. Use too many max pooling layers, and your image coverage grows larger than the image, which is clearly pointless. Striking the right balance while maximizing the depth of your layer is the final goal.

## Imports

In [1]:
# -*- coding: utf-8 -*-
from __future__ import division
from copy import deepcopy

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import theano
import theano.tensor as T

Using gpu device 0: GeForce GT 630M


In [3]:
from netz.layers import InputLayer, DenseLayer, OutputLayer, Conv2DCCLayer, MaxPool2DLayer, DropoutLayer
from netz.neuralnet import NeuralNet
from netz.costfunctions import mse, crossentropy
from netz.nonlinearities import rectify
from netz.updaters import Momentum, Nesterov, SGD
from netz.visualize import plot_loss, plot_conv_weights, plot_conv_activity, plot_occlusion

In [4]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Data

Make sure you have the MNIST data set ready. You can get it here: http://www.kaggle.com/c/digit-recognizer/data.

### Load data

In [5]:
df = pd.read_csv('../data/mnist/train.csv')

In [6]:
y = df.values[:, 0]
X = df.values[:, 1:] / 255

In [7]:
X = X.astype(theano.config.floatX)

In [8]:
X = (X - X.mean()) / X.std()

In [9]:
X2D = X.reshape(-1, 1, 28, 28)

## Useful information when going deep

It is generally a good idea to use small filter sizes for your convolutional layers, generally <b>3x3</b>. The reason for this is that this allows to cover the same receptive field of the image while using less parameters that would be required if a larger filter size were used. Moreover, deeper stacks of convolutional layers are more expressive (see [here](http://cs231n.github.io/convolutional-networks/) for more).

### A shallow net

In [10]:
layers1 = [InputLayer(),
           Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
           MaxPool2DLayer(ds=(2, 2)),
           Conv2DCCLayer(32, (3, 3), nonlinearity=rectify),
           Conv2DCCLayer(32, (3, 3), nonlinearity=rectify),
           MaxPool2DLayer(),
           Conv2DCCLayer(32, (3, 3), nonlinearity=rectify),
           DropoutLayer(p=0.5),
           DenseLayer(100, nonlinearity=rectify),
           DenseLayer(100, nonlinearity=rectify),
           OutputLayer()]

In [11]:
net1 = NeuralNet(layers=layers1, updater=Nesterov(), verbose=1)

In [12]:
net1.initialize(X2D, y)

# Neural Network with 63306 learnable parameters


## Layer information
| name       | size     |   total |   cap. Y [%] |   cap. X [%] |   cov. Y [%] |   cov. X [%] |
|:-----------|:---------|--------:|-------------:|-------------:|-------------:|-------------:|
| input0     | 1x28x28  |     784 |       100.00 |       100.00 |       100.00 |       100.00 |
| conv2dcc0  | 16x26x26 |   10816 |       100.00 |       100.00 |        10.71 |        10.71 |
| maxpool2d0 | 16x13x13 |    2704 |       100.00 |       100.00 |        10.71 |        10.71 |
| conv2dcc1  | 32x11x11 |    3872 |        85.71 |        85.71 |        25.00 |        25.00 |
| conv2dcc2  | 32x9x9   |    2592 |        54.55 |        54.55 |        39.29 |        39.29 |
| maxpool2d1 | 32x5x5   |     800 |        54.55 |        54.55 |        39.29 |        39.29 |
| conv2dcc3  | 32x3x3   |     288 |        63.16 |        63.16 |        67.86 |        67.86 |
| dropout0   | 32x3x3   |     288 |       100.00 |       100.00 

This net is fine. The capacity never falls below 1/6, which would be 16.7%, and the coverage of the image never exceeds 100%. However, this net is not very deep, so let's try to go deeper.

What we also see is the role of max pooling. If we look at 'maxpool2d1', after this layer, the capacity of the net is increased. Max pooling thus helps to increase capacity should it dip too low. However, max pooling also significantly increases the coverage of the image. So if we use max pooling too often, the coverage will quickly exceed 100% and we cannot go sufficiently deep.

### Not enough max poolin

In [13]:
layers2 = [
    InputLayer(),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify),
    DenseLayer(100, nonlinearity=rectify),
    DenseLayer(100, nonlinearity=rectify),
    OutputLayer()
]

In [14]:
net2 = NeuralNet(layers=layers2, updater=Nesterov(), verbose=1)

In [15]:
net2.initialize(X2D, y)

# Neural Network with 45610 learnable parameters


## Layer information
| name       | size     |   total |   cap. Y [%] |   cap. X [%] |   cov. Y [%] |   cov. X [%] |
|:-----------|:---------|--------:|-------------:|-------------:|-------------:|-------------:|
| input0     | 1x28x28  |     784 |       100.00 |       100.00 |       100.00 |       100.00 |
| conv2dcc0  | 16x26x26 |   10816 |       100.00 |       100.00 |        10.71 |        10.71 |
| conv2dcc1  | 16x24x24 |    9216 |        60.00 |        60.00 |        17.86 |        17.86 |
| conv2dcc2  | 16x22x22 |    7744 |        42.86 |        42.86 |        25.00 |        25.00 |
| conv2dcc3  | 16x20x20 |    6400 |        33.33 |        33.33 |        32.14 |        32.14 |
| conv2dcc4  | 16x18x18 |    5184 |        27.27 |        27.27 |        39.29 |        39.29 |
| conv2dcc5  | 16x16x16 |    4096 |        23.08 |        23.08 |        46.43 |        46.43 |
| conv2dcc6  | 16x14x14 |    3136 |        20.00 |        20.00 

Here we have a very deep net but we have a problem: The lack of max pooling layers means that the capacity of the net dips below 16.7%. We need to find a better solution.

### Too much max pooling

In [31]:
layers3 = [
    InputLayer(),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify, pad=1),
    MaxPool2DLayer(ds=(4, 4)),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify, pad=1),
    MaxPool2DLayer(ds=(2, 2)),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify, pad=1),
    MaxPool2DLayer(ds=(2, 2)),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify, pad=1),
    MaxPool2DLayer(ds=(2, 2)),
    DenseLayer(100, nonlinearity=rectify),
    DenseLayer(100, nonlinearity=rectify),
    OutputLayer()
]

In [32]:
net3 = NeuralNet(layers=layers3, updater=Nesterov(), verbose=1)

In [33]:
net3.initialize(X2D, y)

# Neural Network with 29210 learnable parameters


## Layer information
| name       | size     |   total |   cap. Y [%] |   cap. X [%] |   cov. Y [%] |   cov. X [%] |
|:-----------|:---------|--------:|-------------:|-------------:|-------------:|-------------:|
| input0     | 1x28x28  |     784 |       100.00 |       100.00 |       100.00 |       100.00 |
| conv2dcc0  | 16x28x28 |   12544 |       100.00 |       100.00 |        10.71 |        10.71 |
| conv2dcc1  | 16x28x28 |   12544 |        60.00 |        60.00 |        17.86 |        17.86 |
| maxpool2d0 | 16x7x7   |     784 |        60.00 |        60.00 |        17.86 |        17.86 |
| conv2dcc2  | 16x7x7   |     784 |        92.31 |        92.31 |        46.43 |        46.43 |
| conv2dcc3  | 16x7x7   |     784 |        57.14 |        57.14 |        75.00 |        75.00 |
| maxpool2d1 | 16x4x4   |     256 |        57.14 |        57.14 |        75.00 |        75.00 |
| [36mconv2dcc4[0m  | 16x4x4   |     256 |        64.86 |     

This net uses too much max pooling for too small an image. The later layers, colored in cyan, would cover more than 100% of the image. So this network is clearly also suboptimal.

### A good compromise

In [25]:
layers4 = [
    InputLayer(),
    Conv2DCCLayer(16, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(32, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(32, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(32, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(64, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(64, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(64, (3, 3), nonlinearity=rectify, pad=1),
    MaxPool2DLayer(ds=(2, 2)),
    Conv2DCCLayer(256, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(256, (3, 3), nonlinearity=rectify, pad=1),
    Conv2DCCLayer(256, (3, 3), nonlinearity=rectify, pad=1),
    DenseLayer(100, nonlinearity=rectify),
    DropoutLayer(),
    DenseLayer(100, nonlinearity=rectify),
    DropoutLayer(),
    OutputLayer()
]

In [26]:
net4 = NeuralNet(layers=layers4, updater=Nesterov(), verbose=1)

In [27]:
net4.initialize(X2D, y)

# Neural Network with 6472330 learnable parameters


## Layer information
| name       | size      |   total |   cap. Y [%] |   cap. X [%] |   cov. Y [%] |   cov. X [%] |
|:-----------|:----------|--------:|-------------:|-------------:|-------------:|-------------:|
| input0     | 1x28x28   |     784 |       100.00 |       100.00 |       100.00 |       100.00 |
| conv2dcc0  | 16x28x28  |   12544 |       100.00 |       100.00 |        10.71 |        10.71 |
| conv2dcc1  | 32x28x28  |   25088 |        60.00 |        60.00 |        17.86 |        17.86 |
| conv2dcc2  | 32x28x28  |   25088 |        42.86 |        42.86 |        25.00 |        25.00 |
| conv2dcc3  | 32x28x28  |   25088 |        33.33 |        33.33 |        32.14 |        32.14 |
| conv2dcc4  | 64x28x28  |   50176 |        27.27 |        27.27 |        39.29 |        39.29 |
| conv2dcc5  | 64x28x28  |   50176 |        23.08 |        23.08 |        46.43 |        46.43 |
| conv2dcc6  | 64x28x28  |   50176 |        20.00 |  

With 10 convolutional layers, this network is rather deep, given the small image size. Yet the learning capacity is always suffiently large and never are more than 100% of the image covered. This could just be a good solution.

Note 1: The MNIST images typically don't cover the whole of the 28x28 image size. Therefore, an image coverage of less than 100% is probably very acceptable. For other image data sets such as CIFAR or ImageNet, it is recommended to cover the whole image.

Note 2: This analysis does not tell us how many feature maps (i.e. number of filters per convolutional layer) to use. Here we have to experiment with different values. Larger values mean that the network should learn more types of features but also increase the risk of overfitting. In general though, deeper layers (those farther down) should learn more complex features and should thus have more feature maps.

### More details

It is possible to get more information by increasing the verbosity level beyond 1.

In [22]:
net4.verbose = 2

In [23]:
net4.initialize(X2D, y)

# Neural Network with 314922 learnable parameters


## Layer information
name        size        total    cap. Y [%]    cap. X [%]    cov. Y [%]    cov. X [%]    filter Y    filter X    field Y    field X
----------  --------  -------  ------------  ------------  ------------  ------------  ----------  ----------  ---------  ---------
input0      1x28x28       784        100.00        100.00        100.00        100.00          28          28         28         28
conv2d0     16x26x26    10816        100.00        100.00         10.71         10.71           3           3          3          3
maxpool2d0  16x13x13     2704        100.00        100.00         10.71         10.71           3           3          3          3
conv2d1     32x11x11     3872         85.71         85.71         25.00         25.00           6           6          7          7
conv2d2     32x9x9       2592         54.55         54.55         39.29         39.29           6           6         11         11
con

Here we get additional information about the real filter size of the convolutional layers, as well as their receptive field sizes. If the receptive field size grows too large compared to the real filter size, capacity dips too low. As receptive field size grows larger, more and more of the image is covered.

### Caveat

A caveat to the findings presented here is that capacity and coverage may not be calculated correctly if you use padding or strides other than 1 in the convolutional layers. Including this would make the calculation much more complicated. However, even if you want to use these parameters, the calculations shown here should not deviate too much and the results may still serve as a rough guideline.