In [7]:
import torch
from torch import nn
import torchvision
import sys
sys.path.append("../dlutils")
import importlib
import model
import dataset
import train
import loss
importlib.reload(model)
importlib.reload(dataset)
importlib.reload(train)
importlib.reload(loss)
from dataset import load_fashion_mnist_dataset
from train import train_3ch

## The Cross Correlation Operation
Suppose we have an input of shape $n_h \times n_w$, and a kernel of shape $k_h \times k_w$, after doing the cross correlation operation, the output shape is given by:
$$(n_h-k_h+1) \times (n_w-k_w+1)$$

<img src="cross_correlation.png" alt="Cross Correlation Operation" width="200"/>


## Convolutional Layer

In [31]:
from model import Conv2D
X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
net = Conv2D(K.shape)
net.weight.data = K
net(X)

tensor([[[19., 25.],
         [37., 43.]]], grad_fn=<CopySlices>)

## Convolutional layer for multiple input channels
For a multi-channeled input, the kernel should have the same number of channels as it to do cross-correlation operation on each channel. The cross-correlation outputs will be added up to get the final output.

Here is the illustration:

<img src="conv-multi-in.svg" width="400"/>

## Convolutional layer for multiple input and output channels
Denote by $c_i$ and $c_o$ the number of input and output channels, respectively, and let $k_h$ and $k_w$ be the height and width of the kernel. To get an output with multiple channels, we can create a kernel tensor of shape $c_i \times k_h \times k_w$ for every output channel. We concatenate them on the output channel dimension, so that **the shape of the convolution kernel is $c_o \times c_i \times k_h \times k_w$.**

## $1 \times 1$ convolution layer
Because the minimum window is used, the $1 \times 1$ convolution loses the ability of larger convolutional layers to recognize patterns consisting of interactions among adjacent elements in the height and width dimensions.  **The only computation of the $1 \times 1$ convolution occurs on the channel dimension.**

You could think of the $1 \times 1$ convolutional layer as constituting a **fully-connected layer applied at every single pixel location** to transform the $c_i$ corresponding input values into $c_o$ output values.




In [8]:
from model import Conv2DOnebyOne
X = torch.rand(size=(5,4,4))
net = Conv2DOnebyOne(5, 8)
print(net(X))

tensor([[[-0.5076,  0.1259, -0.7726, -0.4522],
         [-0.3095, -0.1037, -0.2447, -0.6873],
         [-0.4602, -0.3451, -0.6279, -0.5538],
         [-0.0835, -0.2740, -0.2498, -0.3815]],

        [[-0.1522,  0.2037, -0.2166,  0.3916],
         [-0.1256,  0.4036,  0.0974,  0.1388],
         [ 0.2759,  0.4469,  0.2058,  0.1556],
         [ 0.5209,  0.1955,  0.5365, -0.2528]],

        [[-0.6011, -0.3757, -0.9705, -0.5223],
         [-0.7133, -0.6213, -0.5570, -0.8425],
         [-0.6292, -0.5921, -0.6994, -0.7364],
         [-0.7720, -0.5177, -0.5744, -0.8175]],

        [[-0.4625, -0.8049, -0.4650, -0.4133],
         [-0.6588, -0.5933, -0.5655, -0.4099],
         [-0.3993, -0.4365, -0.3109, -0.4797],
         [-0.6188, -0.4864, -0.4540, -0.6795]],

        [[-0.0365, -0.0435, -0.2070,  0.2668],
         [-0.2685, -0.1204,  0.0051,  0.0048],
         [ 0.0177,  0.2795,  0.1312,  0.1045],
         [-0.0955, -0.1245,  0.1675, -0.4195]],

        [[ 0.1964, -0.0450,  0.0950, -0.1164],
   

## Padding
In general, if we add a total of $p_h$ rows of padding (roughly half on top and half on bottom) and a total of  $p_w$  columns of padding (roughly half on the left and half on the right), the output shape will be
$$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1).$$
<img src="conv-pad.svg">

In many cases, we will want to set $p_h=k_h-1$  and $p_w = k_w-1$ to give the input and output the same height and width. 

## Stride
<img src="conv-stride.svg">

In general, when the stride for the height is $s_h$ and the stride for the width is  $s_w$ , the output shape is
$$\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor.$$

**Explanation:**
Take the horizontal row as an example. The total width of the row is $n_w+p_w$, and we know that for each small unit for cross-correlation operation, the starting index is divisible by $s_w$, and it should be smaller than $n_w+p_w-k_w$. So the total number of units should be $ \lfloor(n_w+p_w-k_w)/s_w\rfloor+1$, which is $\lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor$.

## Pooling Layer
The pooling layer, is used to reduce the spatial dimensions, but not depth, on a convolution neural network, model, basically this is what you gain:
- By having less spatial information you gain computation performance
- Less spatial information also means less parameters, so less chance to over-fit
- You get some translation invariance

Illustration:
<img src="pooling.svg"/>

In [9]:
def pool2d(X, pool_size, mode="max"):
    p_h, p_w = pool_size
    Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i:i + p_h, j:j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i:i + p_h, j:j + p_w].mean()
    return Y

In [42]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
pool2d(X, (2, 2))

tensor([[4., 5.],
        [7., 8.]])

## Pooling Layer with Padding


In [10]:
X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4))
X

tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]]]])

In [14]:
# the defualt stide for maxpooling is kernel_size.
pool2d = torch.nn.MaxPool2d(3)
pool2d(X)

tensor([[[[10.]]]])

In [15]:
pool2d = torch.nn.MaxPool2d((2, 3), padding=(0, 1))
pool2d(X)

tensor([[[[ 5.,  7.],
          [13., 15.]]]])