In [1]:
from __future__ import print_function

import torch
import torch.nn as nn
from torch.autograd import Variable

## What happens inside your CNN: conv2d
### (discrete 2-D convolutions)
This short gist assumes some basic knowledge about fully-connected neural networks - the kind everyone is first introduced to. Here we try to understand in detail what happens when using a 2-D convolutional layer in our neural networks. Convolutional layers and CNNs in general are very popular in domains like Computer Vision, but can also be applied to many other types of data with an intrinsic structure (e.g. sound clips or text).  

### Why?
One might ask why should we even bother with convolutions in the first place?  
In short: With discrete convolutions we can exploit the *intrinsic structure*  of the given data (if present). We'll use image inputs as an example. Their intrinsic structure is obviously in two dimensions, therefore 2-D convolutions are useful in this case. Bonus: They are also the easiest to visualize.  
Another reason to use convolutions is due to the fact that images possess *localized concepts*. This means pixels close to each other share some correlations. They are likely to be similar, unless there's an edge at that location - which also is an important feature we don't want to miss. The same goes e.g. for text where words next to each other probably share the same context. 

### How do discrete convolutions work?
Discrete convolutions can basically be described as "a linear transformations which preserve ordering" ([from this nice paper on which a lot of this notebook is based](https://arxiv.org/pdf/1603.07285.pdf)).  
This is achieved by a **kernel** of a given size (here 2-dimensional *height x width*) sliding across the **input feature map** (here one color channel of the image). At each **stride** the product between each element of the kernel and the input element it overlaps is computed and all elements are summed up to obtain the output of the current stride. This process is repeated stride by stride to produce **output feature maps**.  
Note that for a 3-D convolution the kernel would simply be a cuboid sliding across height, width and depth. The 1-D case is now rather obvious. But the  effect for all cases is the same: The convolution preserves the intrinsic structure of the input.
<img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_no_strides.gif">
Here the blue map represents the input and the cyan map the output feature map.  
Any kernel size other than `(1, 1)` will results in smaller output feature maps compared to the input. As can be seen above the kernel size of `(3, 3)` turns the `(4, 4)` input feature map into a `(2, 2)` output. If we want the input and output feature maps to be of the same size we can simply *pad* the input with zeros.  
For example a `(5, 5)` input feature map with a **zero-padding** of 1 pixel for each dimension preserves the spatial size of the input even with the kernel size of `(3, 3)`.
<img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/same_padding_no_strides.gif">
The reasons why we may want to preserve the spatial size of the input are manifold:
* designing networks is easier since the tensor dimensions will simply fit
* allows for deeper networks (without padding the size will be reduced too quickly)
* can improve performance by keeping information at the borders
* some newer architectures need to concatenate convolutional layers with  
`(1, 1)`, `(3, 3)` and `(5, 5)` kernels, which wouldn't be possible without padding since the dimensions wouldn't match (see [inception module](https://i.stack.imgur.com/ldTdM.png) for example)  

We can also choose to have the kernel move across the input with larger steps if we use a larger **stride**. As can be seen in the example below with a kernel size of `(3, 3)` and a stride of `(2, 2)`, larger strides quickly reduce the spatial size of the input and result in less computations per convolution compared to smaller strides, but also leads to information loss. So there's a trade-off.
<img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_strides.gif">

### Where are the weights in convolutional layers?
When comparing convolutional layers to their 'classical' counterpart of fully-connected layers, instead of having weight matrices we now have kernels sliding across the input. The question may arise: Where are the trainable weights in this case?  
For convolutional layers the weights are contained in the collection of kernels we use. Biases aren't treated any different from the way they are used in fully-connected layers.  
So if `(5, 5)` kernel contains 25 only weights that really aren't too many. After all, a big advantage of convolutional layers is that weights are applied to multiple locations in the input and therefore they get by with a much smaller number of weights.  
But usually we do not noly have one kernel, but rather many more depending on how many output feature maps we wish to create and how many input feature maps we have at hand. The choice of the number of output feature maps of a convolutional layer can be compared to the choice of the number of hidden nodes in a fully-connected layer.  
To see how this all comes together let's take a look at the formula given in the [PyTorch docs](http://pytorch.org/docs/master/nn.html#conv2d) on `torch.nn.Conv2d`:
$$ \text{out}(N_i, C_{\text{out}, j}) = \text{bias}(C_{\text{out}, j}) 
+ \sum_{k=0}^{C_{\text{in}-1}} \text{weight}(C_{\text{out}, j}, k) \star \text{input}(N_i, k)$$
This formula maps our input tensor [$N, C_\text{in}, H, W$] to our output tensor [$N, C_\text{out}, H_\text{out}, W_\text{out}$]. The $\star$ operator can be seen as the operation of sliding the kernel across the input feature map (recall, as described above, there's another summation hidden in this operation). $N$ simply describes the number of inputs (also known as batch size) to take into account for the optimization step. Since this is not central to understanding convolutional layers N can be put aside for this consideration. $C$ denotes the respective feature map, $\text{weight}$ is our kernel and $\text{bias}$ should be self-explanatory.  
From this we can quickly infer that we have $C_\text{out} \cdot C_\text{in}$ kernels.  

Let's look at an example of a convolutional layer used e.g. in a network for classifying handwritten digits  on the MNIST dataset.

In [2]:
class OneLayerConv2d(nn.Module):
    def __init__(self):
        super(OneLayerConv2d, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=10, kernel_size=5, 
                               stride=1, padding=0, bias=True)
        
    def forward(self, x):
        return self.conv1(x)

Our toy network now only consists of one `nn.Conv2d` layer with *1 input feature map* and *10 output feature maps*. There's no pooling layer or activation function as there would be in the case of an actual convolutional neural used for training.  
We can print a summay of our network to confirm:

In [3]:
conv_layer = OneLayerConv2d()
print(conv_layer)

OneLayerConv2d(
  (conv1): Conv2d (1, 10, kernel_size=(5, 5), stride=(1, 1))
)


PyTorch let's use access all weights (here called *parameters*) of our network. The `.parameters()` method returns a generator, so let's look at the resulting list. Weights and biases are separated:

In [4]:
params = list(conv_layer.parameters())
print(len(params), "-> parameters for weights and biases")

2 -> parameters for weights and biases


Now let's check our expectations for the total number of parameters of our convolutional layer. With our 1 input feature map, 10 output feature maps, kernel size of `(5, 5)` and stride of `(1, 1)` we'd expect 10 kernels with 25 elements each. That makes for a total of 260 parameters including the 10 bias elements.

In [5]:
print(params[0].size())
print("Number of parameters (from kernels):\t",
     reduce(lambda x, y: x*y, params[0].size()))
print("Number of parameters (from biases):\t", params[1].size()[0])

torch.Size([10, 1, 5, 5])
Number of parameters (from kernels):	 250
Number of parameters (from biases):	 10


We can also print the whole kernel tensor to get a better feeling for the structure:

In [6]:
print("Tensor containing all kernel weights:\t", list(conv_layer.parameters())[0].size())
print("\n",list(conv_layer.parameters())[0])

Tensor containing all kernel weights:	 torch.Size([10, 1, 5, 5])

 Parameter containing:
(0 ,0 ,.,.) = 
  0.1965  0.0802 -0.1461  0.0131 -0.1233
 -0.0910  0.0738 -0.1316 -0.0076  0.0710
 -0.1218  0.1586  0.0425 -0.1391 -0.0013
  0.1792  0.0705 -0.0948 -0.0328  0.0845
 -0.0739  0.0163  0.0941 -0.1632 -0.0426

(1 ,0 ,.,.) = 
 -0.0847 -0.0284  0.1238  0.1300 -0.1828
  0.0823  0.1470 -0.0860  0.1757  0.0675
 -0.0816  0.1099  0.1546 -0.0650 -0.1576
 -0.0984  0.0536 -0.0089 -0.0841 -0.0489
  0.1224  0.0897 -0.1336 -0.1076  0.0706

(2 ,0 ,.,.) = 
 -0.0662 -0.0020 -0.1265 -0.0934 -0.1166
 -0.1951 -0.0393  0.1113 -0.1773  0.0455
  0.1373 -0.0544  0.1666 -0.1128  0.1679
  0.1410  0.1341  0.0930  0.0549 -0.0343
  0.0623 -0.0167 -0.1185  0.0468 -0.1171

(3 ,0 ,.,.) = 
 -0.0734 -0.0134  0.1997 -0.0012  0.0735
 -0.1987 -0.0150  0.0486 -0.1723 -0.0884
 -0.0332 -0.1228  0.0138 -0.1539  0.1722
  0.1726  0.0068 -0.1813 -0.0672 -0.1922
  0.1976 -0.0994 -0.1151  0.0470  0.0705

(4 ,0 ,.,.) = 
 -0.0590  0.

Now let's apply this convolutional layer to some dummy input. We'll use the dimensions which would be typical for training a neural network on MNIST data. A batch size of 128 samples with 28x28 grayscale images.

In [7]:
x = Variable(torch.rand(128, 1, 28, 28))
output = conv_layer(x)
print('Input:\t\t\t', x.shape)
print('After Conv2d:\t\t', output.shape)

Input:			 torch.Size([128, 1, 28, 28])
After Conv2d:		 torch.Size([128, 10, 24, 24])


As we'd expect sliding a `(5, 5)` kernel across a `(28, 28)` input feature map with a stride of `(1, 1)` leaves us with `(24, 24)` output feature maps if we don't use zero-padding.

That concludes this 'discrete 2-D convolutions behind-the-scenes'. There are more options to `conv2d` like dilation or groups which were omitted here for the sake of brevity. For details on these, the reader is referred again to the [PyTorch docs](http://pytorch.org/docs/master/nn.html#conv2d) on `torch.nn.Conv2d` and the following two papers:  
*A guide to convolution arithmetic for deep
learning* ([*arXiv:1603.07285*](https://arxiv.org/pdf/1603.07285.pdf))  
*Multi-scale context aggregation by dilated convolutions* ([*arXiv:1511.07122*](https://arxiv.org/pdf/1511.07122.pdf))

The next step for taking a look behind the scenes of convolutional neural networks is to explore **pooling layers**, which will be covered in the next part.