# 6.2. Convolutions for Images

## 6.2.1. The Cross-Correlation Operation

In [1]:
import torch
from torch import nn
from d2l import torch as d2l

def corr2d(X, K):  #@save
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

In [2]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

tensor([[19., 25.],
        [37., 43.]])

## 6.2.2. Convolutional Layers

In [3]:
class Conv2D(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

## 6.2.3. Object Edge Detection in Images

In [4]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.]])

In [5]:
K = torch.tensor([[1.0, -1.0]])

In [6]:
Y = corr2d(X, K)
Y

tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.]])

In [7]:
corr2d(X.t(), K)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

## 6.2.4. Learning a Kernel

In [8]:
# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.Conv2d(1,1, kernel_size=(1, 2), bias=False)

# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, channel, height, width), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2  # Learning rate

for i in range(10):
    Y_hat = conv2d(X)
    l = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    l.sum().backward()
    # Update the kernel
    conv2d.weight.data[:] -= lr * conv2d.weight.grad
    if (i + 1) % 2 == 0:
        print(f'epoch {i + 1}, loss {l.sum():.3f}')

epoch 2, loss 11.444
epoch 4, loss 2.132
epoch 6, loss 0.444
epoch 8, loss 0.110
epoch 10, loss 0.033


In [9]:
conv2d.weight.data.reshape((1, 2))

tensor([[ 0.9669, -0.9993]])

## 6.2.5. Cross-Correlation and Convolution

Recall our observation from Section 6.1 of the correspondence between the cross-correlation and convolution operations. Here let us continue to consider two-dimensional convolutional layers. What if such layers perform strict convolution operations as defined in (6.1.6) instead of cross-correlations? In order to obtain the output of the strict convolution operation, we only need to flip the two-dimensional kernel tensor both horizontally and vertically, and then perform the cross-correlation operation with the input tensor.

It is noteworthy that since kernels are learned from data in deep learning, the outputs of convolutional layers remain unaffected no matter such layers perform either the strict convolution operations or the cross-correlation operations.

To illustrate this, suppose that a convolutional layer performs cross-correlation and learns the kernel in Fig. 6.2.1, which is denoted as the matrix  K  here. Assuming that other conditions remain unchanged, when this layer performs strict convolution instead, the learned kernel  K′  will be the same as  K  after  K′  is flipped both horizontally and vertically. That is to say, when the convolutional layer performs strict convolution for the input in Fig. 6.2.1 and  K′ , the same output in Fig. 6.2.1 (cross-correlation of the input and  K ) will be obtained.

In keeping with standard terminology with deep learning literature, we will continue to refer to the cross-correlation operation as a convolution even though, strictly-speaking, it is slightly different. Besides, we use the term element to refer to an entry (or component) of any tensor representing a layer representation or a convolution kernel.

## 6.2.6. Feature Map and Receptive Field

As described in Section 6.1.4.1, the convolutional layer output in Fig. 6.2.1 is sometimes called a feature map, as it can be regarded as the learned representations (features) in the spatial dimensions (e.g., width and height) to the subsequent layer. In CNNs, for any element  x  of some layer, its receptive field refers to all the elements (from all the previous layers) that may affect the calculation of  x  during the forward propagation. Note that the receptive field may be larger than the actual size of the input.

Let us continue to use Fig. 6.2.1 to explain the receptive field. Given the  2×2  convolution kernel, the receptive field of the shaded output element (of value  19 ) is the four elements in the shaded portion of the input. Now let us denote the  2×2  output as  Y  and consider a deeper CNN with an additional  2×2  convolutional layer that takes  Y  as its input, outputting a single element  z . In this case, the receptive field of  z  on  Y  includes all the four elements of  Y , while the receptive field on the input includes all the nine input elements. Thus, when any element in a feature map needs a larger receptive field to detect input features over a broader area, we can build a deeper network.

## 6.2.7. Summary

The core computation of a two-dimensional convolutional layer is a two-dimensional cross-correlation operation. In its simplest form, this performs a cross-correlation operation on the two-dimensional input data and the kernel, and then adds a bias.

We can design a kernel to detect edges in images.

We can learn the kernel’s parameters from data.

With kernels learned from data, the outputs of convolutional layers remain unaffected regardless of such layers’ performed operations (either strict convolution or cross-correlation).

When any element in a feature map needs a larger receptive field to detect broader features on the input, a deeper network can be considered.