In [2]:
import torch

## The Cross Correlation Operation
Suppose we have an input of shape $n_h \times n_w$, and a kernel of shape $k_h \times k_w$, after doing the cross correlation operation, the output shape is given by:
$$(n_h-k_h+1) \times (n_w-k_w+1)$$

Illustration:
<img src="./imgs/CNN/1.png" alt="Cross Correlation Operation" width="200"/>


In [4]:
def corr2d(X, K):
    n_h,n_w = X.shape
    k_h,k_w = K.shape
    Y = torch.zeros([n_h-k_h+1, n_w-k_w+1])
    # print(Y)
    for i in range(n_h-k_h+1):
        for j in range(n_w-k_w+1):
            Y[i][j] = (X[i:i+k_h, j:j+k_w]*K).sum()
    return Y

In [5]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

tensor([[19., 25.],
        [37., 43.]])

## Convolution layer
A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an output.

In [6]:
class Conv2D(torch.nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.rand(kernel_size))
        self.bias = torch.nn.Parameter(torch.zeros(1))
    
    def forward(self, X):
        return corr2d(X, self.weight)+self.bias

## Convolutional layer for multiple input channels
For a multi-channeled input, the kernel should have the same number of channels as it to do cross-correlation operation on each channel. The cross-correlation outputs will ben added up to get the final output.

Here is the illustration:
<img src="./imgs/CNN/conv-multi-in.svg" width="400"/>

In [11]:
def corr2d_multi_in(X, K):
    return sum(corr2d(x, k) for x, k in zip(X, K))

In [12]:
X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
                  [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_in(X, K)

tensor([[ 56.,  72.],
        [104., 120.]])

## Convolutional layer for multiple input and output channels
Denote by $c_i$ and $c_o$ the number of input and output channels, respectively, and let $k_h$ and $k_w$ be the height and width of the kernel. To get an output with multiple channels, we can create a kernel tensor of shape $c_i \times k_h \times k_w$ for every output channel. We concatenate them on the output channel dimension, so that the shape of the convolution kernel is $c_o \times c_i \times k_h \times k_w$.

In [13]:
def corr2d_multi_in_out(X, K):
    return torch.stack([corr2d_multi_in(X, k) for k in K])

In [14]:
K = torch.stack((K, K + 1, K + 2), 0)
corr2d_multi_in_out(X, K)

tensor([[[ 56.,  72.],
         [104., 120.]],

        [[ 76., 100.],
         [148., 172.]],

        [[ 96., 128.],
         [192., 224.]]])

## $1 \times 1$ convolution layer
Because the minimum window is used, the $1 \times 1$ convolution loses the ability of larger convolutional layers to recognize patterns consisting of interactions among adjacent elements in the height and width dimensions.  **The only computation of the $1 \times 1$ convolution occurs on the channel dimension.**

You could think of the $1 \times 1$ convolutional layer as constituting a **fully-connected layer applied at every single pixel location** to transform the $c_i$ corresponding input values into $c_o$ output values.

In [None]:
def corr2d_multi_in_out_1x1(X, K):
    