## Convolutional Neural Networks 0
<br>
<h2> Data Science in Electron Microscopy </h2>

<hr>

<h3> Philipp Pelz </h3>

<h3> 2024 </h3>
<br>

<h3>  &nbsp; [https://github.com/ECLIPSE-Lab/SS24_DataScienceForEM](https://github.com/ECLIPSE-Lab/SS24_DataScienceForEM)
</h3>

## From Fully Connected Layers to Convolutions
:label:`sec_why-conv`

- models that we have discussed so far remain appropriate options when we are dealing with tabular data.
- tabular: data consist of rows corresponding to examples and columns corresponding to features. 
- With tabular data, we might anticipate that the patterns we seek could involve interactions among the features, but we do not assume any structure *a priori* concerning how the features interact.
- no prior knowledge about structure of data --> MLP
- for high-dimensional perceptual data, structure-less networks can grow unwieldy.

- might be able to get away with one hundred thousand pixels, our hidden layer of size 1000 grossly underestimates the number of hidden units that it takes to learn good representations of images $\rightarrow$ practical system will still require billions of parameters.
- learning a classifier by fitting so many parameters might require collecting an enormous dataset. 
- images exhibit rich structure that can be exploited. 
- Convolutional neural networks (CNNs) are one  way  for exploiting some of the known structure in natural images.

## Invariance

- want to detect an object in an image. 
- seems reasonable that whatever method we use to recognize objects should not be overly concerned with the precise location of the object in the image. 
- system should exploit this knowledge. Pigs usually do not fly and planes usually do not swim. Nonetheless, we should still recognize a pig were one to appear at the top of the image. 
- We can draw some inspiration here from the children's game "Where's Waldo" (depicted in :numref:`img_waldo`).
- The game consists of a number of chaotic scenes bursting with activities.

![An image of the "Where's Waldo" game.](../img/where-wally-walker-books.jpg)

- We can now make these intuitions more concrete by enumerating a few desiderata to guide our design of a neural network architecture suitable for computer vision:

- Consequently, to have each of the hidden units receive input from each of the input pixels, we would switch from using weight matrices (as we did previously in MLPs) to representing our parameters as fourth-order weight tensors $\mathsf{W}$.
- Suppose that $\mathbf{U}$ contains biases, we could formally express the fully connected layer as

$$\begin{aligned} \left[\mathbf{H}\right]_{i, j} &= [\mathbf{U}]_{i, j} + \sum_k \sum_l[\mathsf{W}]_{i, j, k, l}  [\mathbf{X}]_{k, l}\\ &=  [\mathbf{U}]_{i, j} +
\sum_a \sum_b [\mathsf{V}]_{i, j, a, b}  [\mathbf{X}]_{i+a, j+b}.\end{aligned}$$

- switch from $\mathsf{W}$ to $\mathsf{V}$ is entirely cosmetic for now since there is a one-to-one correspondence between coefficients in both fourth-order tensors.


- This is a *convolution*! We are effectively weighting pixels at $(i+a, j+b)$ in the vicinity of location $(i, j)$ with coefficients $[\mathbf{V}]_{a, b}$ to obtain the value $[\mathbf{H}]_{i, j}$.
- Note that $[\mathbf{V}]_{a, b}$ needs many fewer coefficients than $[\mathsf{V}]_{i, j, a, b}$ since it no longer depends on the location within the image. Consequently, the number of parameters required is no longer $10^{12}$ but a much more reasonable $4 \cdot 10^6$: we still have the dependency on $a, b \in (-1000, 1000)$. 
- In short, we have made significant progress. Time-delay neural networks (TDNNs) are some of the first examples to exploit this idea :cite:`Waibel.Hanazawa.Hinton.ea.1989`.

##  Locality

- Now let's invoke the second principle: locality. As motivated above, we believe that we should not have to look very far away from location $(i, j)$ in order to glean relevant information to assess what is going on at $[\mathbf{H}]_{i, j}$. 
- This means that outside some range $|a|> \Delta$ or $|b| > \Delta$, we should set $[\mathbf{V}]_{a, b} = 0$. 
- Equivalently, we can rewrite $[\mathbf{H}]_{i, j}$ as 
$$[\mathbf{H}]_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} [\mathbf{V}]_{a, b}  [\mathbf{X}]_{i+a, j+b}.$$
:eqlabel:`eq_conv-layer`

- While previously, we might have required billions of parameters to represent just a single layer in an image-processing network, we now typically need just a few hundred, without altering the dimensionality of either the inputs or the hidden representations.
- The price paid for this drastic reduction in parameters is that our features are now translation invariant and that our layer can only incorporate local information, when determining the value of each hidden activation.
- All learning depends on imposing inductive bias. When that bias agrees with reality, we get sample-efficient models that generalize well to unseen data.
- But of course, if those biases do not agree with reality, e.g., if images turned out not to be translation invariant, our models might struggle even to fit our training data. 
- This dramatic reduction in parameters brings us to our last desideratum,  namely that deeper layers should represent larger and more complex aspects of an image. This can be achieved by interleaving nonlinearities and convolutional layers repeatedly. 

## Convolutions

- Let's briefly review why :eqref:`eq_conv-layer` is called a convolution. 
- In mathematics, the *convolution* between two functions :cite:`Rudin.1973`, say $f, g: \mathbb{R}^d \to \mathbb{R}$ is defined as

$$(f * g)(\mathbf{x}) = \int f(\mathbf{z}) g(\mathbf{x}-\mathbf{z}) d\mathbf{z}.$$

- That is, we measure the overlap between $f$ and $g$ when one function is "flipped" and shifted by $\mathbf{x}$. 
- Whenever we have discrete objects, the integral turns into a sum. For instance, for vectors from the set of square summable infinite dimensional vectors with index running over $\mathbb{Z}$ we obtain the following definition:

- This looks similar to :eqref:`eq_conv-layer`, with one major difference. 
- Rather than using $(i+a, j+b)$, we are using the difference instead. Note, though, that this distinction is mostly cosmetic since we can always match the notation between :eqref:`eq_conv-layer` and :eqref:`eq_2d-conv-discrete`.
- Our original definition in :eqref:`eq_conv-layer` more properly describes a *cross-correlation*. We will come back to this in the following section.

## Channels

- Returning to Waldo detector, let's see what this looks like. 
- convolutional layer picks windows of a given size and weighs intensities according to the filter $\mathsf{V}$, as demonstrated in :numref:`fig_waldo_mask`. 
- aim to learn a model so that wherever the "waldoness" is highest, we should find a peak in the hidden layer representations. 

![Detect Waldo.](../img/waldo-mask.jpg){width=400px}

- Moreover, just as our input consists of a third-order tensor, it turns out to be a good idea to similarly formulate our hidden representations as third-order tensors $\mathsf{H}$.
- In other words, rather than just having a single hidden representation corresponding to each spatial location, we want an entire vector of hidden representations corresponding to each spatial location.
- We could think of the hidden representations as comprising a number of two-dimensional grids stacked on top of each other. As in the inputs, these are sometimes called *channels*.
- They are also sometimes called *feature maps*, as each provides a spatialized set of learned features to the subsequent layer. Intuitively, you might imagine that at lower layers that are closer to inputs, some channels could become specialized to recognize edges while others could recognize textures.

- still many operations that we need to address. 
- For instance, we need to figure out how to combine all the hidden representations to a single output, e.g., whether there is a Waldo *anywhere* in the image. 
- We also need to decide how to compute things efficiently, how to combine multiple layers, appropriate activation functions, and how to make reasonable design choices to yield networks that are effective in practice. We turn to these issues in the remainder of the chapter.

## Summary and Discussion

- derived the structure of convolutional neural networks from first principles. 
- -unclear whether this is what led to the invention of CNNs, it is satisfying to know that they are the *right* choice when applying reasonable principles to how image processing and computer vision algorithms should operate, at least at lower levels. 
- In particular, translation invariance in images implies that all patches of an image will be treated in the same manner. 
- Locality means that only a small neighborhood of pixels will be used to compute the corresponding hidden representations. Some of the earliest references to CNNs are in the form of the Neocognitron :cite:`Fukushima.1982`. 
- second principle that we encountered in our reasoning is how to reduce the number of parameters in a function class without limiting its expressive power, at least, whenever certain assumptions on the model hold. 
- saw a dramatic reduction of complexity as a result of this restriction, turning computationally and statistically infeasible problems into tractable models. 

---

- Adding channels allowed us to bring back some of the complexity that was lost due to the restrictions imposed on the convolutional kernel by locality and translation invariance. Note that channels are quite a natural addition beyond red, green, and blue. 
- Many images have tens to hundreds of channels, generating hyperspectral images instead. They report data on many different wavelengths. In the following we will see how to use convolutions effectively to manipulate the dimensionality of the images they operate on, how to move from location-based to channel-based representations and how to deal with large numbers of categories efficiently. 

## Exercises

1. Assume that the size of the convolution kernel is $\Delta = 0$.
   Show that in this case the convolution kernel
   implements an MLP independently for each set of channels. This leads to the Network in Network 
   architectures :cite:`Lin.Chen.Yan.2013`. 
1. Audio data is often represented as a one-dimensional sequence. 
    1. When might you want to impose locality and translation invariance for audio? 
    1. Derive the convolution operations for audio.
    1. Can you treat audio using the same tools as computer vision? Hint: use the spectrogram.
1. Why might translation invariance not be a good idea after all? Give an example. 
1. Do you think that convolutional layers might also be applicable for text data?
   Which problems might you encounter with language?
1. What happens with convolutions when an object is at the boundary of an image. 
1. Prove that the convolution is symmetric, i.e., $f * g = g * f$.
1. Prove the convolution theorem, i.e., $f * g = \mathcal{F}^{-1}\left[\mathcal{F}[f] \cdot \mathcal{F}[g]\right]$. 
   Can you use it to accelerate convolutions? 

## Convolutions for Images

- understand how convolutional layers work in theory, we are ready to see how they work in practice. 
- Building on our motivation of convolutional neural networks as efficient architectures for exploring structure in image data, we stick with images as our running example. 


In [None]:
from d2l import torch as d2l
import torch
from torch import nn

![Two-dimensional cross-correlation operation. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: $0\times0+1\times1+3\times2+4\times3=19$.](../img/correlation.svg)

- In the two-dimensional cross-correlation operation, we begin with the convolution window positioned at the upper-left corner of the input tensor and slide it across the input tensor, both from left to right and top to bottom. 
- When the convolution window slides to a certain position, the input subtensor contained in that window and the kernel tensor are multiplied elementwise and the resulting tensor is summed up yielding a single scalar value. 


- This is the case since we need enough space to "shift" the convolution kernel across the image. Later we will see how to keep the size unchanged by padding the image with zeros around its boundary so that there is enough space to shift the kernel. Next, we implement this process in the `corr2d` function, which accepts an input tensor `X` and a kernel tensor `K` and returns an output tensor `Y`. 


In [None]:
def corr2d(X, K):  #@save
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = d2l.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = d2l.reduce_sum((X[i: i + h, j: j + w] * K))
    return Y

---


- We can construct the input tensor `X` and the kernel tensor `K` from :numref:`fig_correlation` to [**validate the output of the above implementation**] of the two-dimensional cross-correlation operation.


In [None]:
X = d2l.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = d2l.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

In [None]:
class Conv2D(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

- In $h \times w$ convolution or a $h \times w$ convolution kernel, the height and width of the convolution kernel are $h$ and $w$, respectively.
- We also refer to a convolutional layer with a $h \times w$ convolution kernel simply as a $h \times w$ convolutional layer.

## Object Edge Detection in Images

- take a moment to parse [**a simple application of a convolutional layer: detecting the edge of an object in an image**] by finding the location of the pixel change.
- First, we construct an "image" of $6\times 8$ pixels. The middle four columns are black (0) and the rest are white (1).


In [None]:
X = d2l.ones((6, 8))
X[:, 2:6] = 0
X

- next construct a kernel `K` with a height of 1 and a width of 2. 
- perform the cross-correlation operation with the input, if the horizontally adjacent elements are the same, the output is 0. Otherwise, the output is non-zero.
- kernel is special case of a finite difference operator. At location $(i,j)$ it computes $x_{i,j} - x_{(i+1),j}$, i.e., it computes the difference between the values of horizontally adjacent pixels. 
- discrete approximation of the first derivative in the horizontal direction. After all, for a function $f(i,j)$ its derivative $-\partial_i f(i,j) = \lim_{\epsilon \to 0} \frac{f(i,j) - f(i+\epsilon,j)}{\epsilon}$. Let's see how this works in practice.

---


In [None]:
K = d2l.tensor([[1.0, -1.0]])

-  ready to perform the cross-correlation operation with arguments `X` (our input) and `K` (our kernel). As you can see, [**we detect 1 for the edge from white to black and -1 for the edge from black to white.**] All other outputs take value 0.


In [None]:
Y = corr2d(X, K)
Y

- We can now apply the kernel to the transposed image. As expected, it vanishes. [**The kernel `K` only detects vertical edges.**]


In [None]:
corr2d(d2l.transpose(X), K)

In [None]:
## Construct a two-dimensional convolutional layer with 1 output channel and a
## kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.LazyConv2d(1, kernel_size=(1, 2), bias=False)

## The two-dimensional convolutional layer uses four-dimensional input and
## output in the format of (example, channel, height, width), where the batch
## size (number of examples in the batch) and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2  ## Learning rate

for i in range(10):
    Y_hat = conv2d(X)
    l = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    l.sum().backward()
    ## Update the kernel
    conv2d.weight.data[:] -= lr * conv2d.weight.grad
    if (i + 1) % 2 == 0:
        print(f'epoch {i + 1}, loss {l.sum():.3f}')

- Note that the error has dropped to a small value after 10 iterations. Now we will **take a look at the kernel tensor we learned.**

---


In [None]:
d2l.reshape(conv2d.weight.data, (1, 2))

- suppose that a convolutional layer performs *cross-correlation* and learns the kernel in :numref:`fig_correlation`, which is denoted as the matrix $\mathbf{K}$ here. 
- Assuming that other conditions remain unchanged, when this layer performs strict *convolution* instead, the learned kernel $\mathbf{K}'$ will be the same as $\mathbf{K}$ after $\mathbf{K}'$ is flipped both horizontally and vertically.
- That is to say, when the convolutional layer performs strict *convolution* for the input in :numref:`fig_correlation` and $\mathbf{K}'$, the same output in :numref:`fig_correlation` (cross-correlation of the input and $\mathbf{K}$) will be obtained.
- continue to refer to the cross-correlation operation as a convolution even though, strictly-speaking, it is slightly different.
- Besides, we use the term *element* to refer to an entry (or component) of any tensor representing a layer representation or a convolution kernel.

## Feature Map and Receptive Field

- As described in *Why Con Channels*, the convolutional layer output in :numref:`fig_correlation` is sometimes called a *feature map*, as it can be regarded as the learned representations (features) in the spatial dimensions (e.g., width and height) to the subsequent layer. 
- in CNNs, for any element $x$ of some layer, its *receptive field* refers to all the elements (from all the previous layers) that may affect the calculation of $x$ during the forward propagation.
- receptive field may be larger than the actual size of the input.
- continue to use :numref:`fig_correlation` to explain the receptive field. Given the $2 \times 2$ convolution kernel, the receptive field of the shaded output element (of value $19$) is the four elements in the shaded portion of the input.
- denote the $2 \times 2$ output as $\mathbf{Y}$ and consider a deeper CNN with an additional $2 \times 2$ convolutional layer that takes $\mathbf{Y}$ as its input, outputting a single element $z$.
- In this case, the receptive field of $z$ on $\mathbf{Y}$ includes all the four elements of $\mathbf{Y}$, while the receptive field on the input includes all the nine input elements.
- Thus, when any element in a feature map needs a larger receptive field to detect input features over a broader area, we can build a deeper network.

- We reprint a key figure in :numref:`field_visual` to illustrate the striking similarities.

![Figure and caption taken from :citet:`Field.1987`: An example of coding with six different channels. (Left) Examples of the six types of sensor associated with each channel. (Right) Convolution of the image in (Middle) with the six sensors shown in (Left). The response of the individual sensors is determined by sampling these filtered images at a distance proportional to the size of the sensor (shown with dots). This diagram shows the response of only the even symmetric sensors.](../img/field-visual.png){width=600px}


- *Padding* most popular tool for handling this issue. 
- may want to reduce the dimensionality drastically, e.g., if we find the original input resolution to be unwieldy. *Strided convolutions* are a popular technique that can help in these instances.


In [None]:
import torch
from torch import nn

## Padding

- As described above, one tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our image. Consider :numref:`img_conv_reuse` that depicts the pixel utilization as a function of the convolution kernel size and the position within the image. The pixels in the corners are hardly used at all. 

![Pixel utilization for convolutions of size $1 \times 1$, $2 \times 2$, and $3 \times 3$ respectively.](../img/conv-reuse.svg)


![Two-dimensional cross-correlation with padding.](../img/conv-pad.svg){width=500px}


- In general, if we add a total of $p_h$ rows of padding (roughly half on top and half on bottom) and a total of $p_w$ columns of padding (roughly half on the left and half on the right), the output shape will be

$$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1).$$

- In the following example, we create a two-dimensional convolutional layer with a height and width of 3 and (**apply 1 pixel of padding on all sides.**) 
- input with a height and width of 8, we find that the height and width of the output is also 8.


In [None]:
## We define a helper function to calculate convolutions. It initializes the
## convolutional layer weights and performs corresponding dimensionality
## elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
    ## (1, 1) indicates that batch size and the number of channels are both 1
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    ## Strip the first two dimensions: examples and channels
    return Y.reshape(Y.shape[2:])

## 1 row and column is padded on either side, so a total of 2 rows or columns
## are added
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape

-  when the height and width of the convolution kernel are different, we can make the output and input have the same height and width by [**setting different padding numbers for height and width.**]


In [None]:
## We use a convolution kernel with height 5 and width 3. The padding on either
## side of the height and width are 2 and 1, respectively
conv2d = nn.LazyConv2d(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape

## Stride

- computing the cross-correlation, we start with the convolution window at the upper-left corner of the input tensor,  then slide it over all locations both down and to the right. 
- previous: defaulted to sliding one element at a time. 
- sometimes, either for computational efficiency or because we wish to downsample, we move our window more than one element at a time, skipping the intermediate locations. 
- particularly useful if the convolution  kernel is large since it captures a large area of the underlying image.
- refer to the number of rows and columns traversed per slide as *stride*. 
- So far, we have used strides of 1, both for height and width. Sometimes, we may want to use a larger stride. :numref:`img_conv_stride` shows a two-dimensional cross-correlation operation with a stride of 3 vertically and 2 horizontally. 

![Cross-correlation with strides of 3 and 2 for height and width, respectively.](../img/conv-stride.svg){width=300px}


- In general, when the stride for the height is $s_h$ and the stride for the width is $s_w$, the output shape is

$$\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor.$$

- If we set $p_h=k_h-1$ and $p_w=k_w-1$, then the output shape can be simplified to $\lfloor(n_h+s_h-1)/s_h\rfloor \times \lfloor(n_w+s_w-1)/s_w\rfloor$. 
- Going a step further, if the input height and width are divisible by the strides on the height and width, then the output shape will be $(n_h/s_h) \times (n_w/s_w)$. Below, we [**set the strides on both the height and width to 2**], thus halving the input height and width. 

---


In [None]:
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1, stride=2)
comp_conv2d(conv2d, X).shape

- Let's look at (**a slightly more complicated example**).


In [None]:
conv2d = nn.LazyConv2d(1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))
comp_conv2d(conv2d, X).shape

- This has significant computational benefit since it is trivial to accomplish. 
- Moreover, operators can be engineered to take advantage of this padding implicitly without the need to allocate additional memory. 
- At the same time, it allows CNNs to encode implicit position information within an image, simply by learning where the "whitespace" is. 
- There are many alternatives to zero-padding. :citet:`Alsallakh.Kokhlikyan.Miglani.ea.2020` provided an extensive overview of alternatives (albeit without a clear case to use nonzero paddings unless artifacts occur). 

## Exercises

1. Given the last code example in this section with kernel size $(3, 5)$, padding $(0, 1)$, and stride $(3, 4)$, 
   calculate the output shape to check if it is consistent with the experimental result.
1. For audio signals, what does a stride of 2 correspond to?
1. Implement mirror padding, i.e., padding where the border values are simply mirrored to extend tensors. 
1. What are the computational benefits of a stride larger than 1?
1. What might be statistical benefits of a stride larger than 1?
1. How would you implement a stride of $\frac{1}{2}$? What does it correspond to? When would this be useful?

## Multiple Input and Multiple Output Channels
- while we described the multiple channels that comprise each image (e.g., color images have the standard RGB channels to indicate the amount of red, green and blue) and convolutional layers for multiple channels in :numref:`subsec_why-conv-channels`, until now, we simplified all of our numerical examples by working with just a single input and a single output channel. 
- This allowed us to think of our inputs, convolution kernels, and outputs each as two-dimensional tensors.
- When we add channels into the mix, our inputs and hidden representations both become three-dimensional tensors. For example, each RGB input image has shape $3\times h\times w$. We refer to this axis, with a size of 3, as the *channel* dimension. The notion of channels is as old as CNNs themselves. For instance LeNet5 :cite:`LeCun.Jackel.Bottou.ea.1995` uses them.  In this section, we will take a deeper look at convolution kernels with multiple input and multiple output channels.


In [None]:
from d2l import torch as d2l
import torch

## Multiple Input Channels

- When the input data contains multiple channels, we need to construct a convolution kernel with the same number of input channels as the input data, so that it can perform cross-correlation with the input data.
- Assuming that the number of channels for the input data is $c_i$, the number of input channels of the convolution kernel also needs to be $c_i$. If our convolution kernel's window shape is $k_h\times k_w$, then when $c_i=1$, we can think of our convolution kernel as just a two-dimensional tensor of shape $k_h\times k_w$.
- However, when $c_i>1$, we need a kernel that contains a tensor of shape $k_h\times k_w$ for *every* input channel. Concatenating these $c_i$ tensors together yields a convolution kernel of shape $c_i\times  _h\times k_w$. 
- Since the input and convolution kernel each have $c_i$ channels, we can perform a cross-correlation operation on the two-dimensional tensor of the input and the two-dimensional tensor of the convolution  kernel for each channel, adding the $c_i$ results together (summing over the channels) to yield a two-dimensional tensor.
- This is the result of a two-dimensional cross-correlation between a multi-channel input and a multi-input-channel convolution kernel. 

---

- :numref:`fig_conv_multi_in` provides an example  of a two-dimensional cross-correlation with two input channels. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: 
- $(1\times1+2\times2+4\times3+5\times4)+(0\times0+1\times1+3\times2+4\times3)=56$.

![Cross-correlation computation with 2 input channels.](../img/conv-multi-in.svg)
:label:`fig_conv_multi_in`

- To make sure we really understand what is going on here, we can (**implement cross-correlation operations with multiple input channels**) ourselves. 
- Notice that all we are doing is performing a cross-correlation operation per channel and then adding up the results.


In [None]:
def corr2d_multi_in(X, K):
    ## Iterate through the 0th dimension (channel) of K first, then add them up
    return sum(d2l.corr2d(x, k) for x, k in zip(X, K))

- can construct the input tensor `X` and the kernel tensor `K` corresponding to the values in :numref:`fig_conv_multi_in` to (**validate the output**) of the cross-correlation operation.


In [None]:
X = d2l.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = d2l.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_in(X, K)

## Multiple Output Channels
- Regardless of the number of input channels, so far we always ended up with one output channel. However, as we discussed in :numref:`subsec_why-conv-channels`, it turns out to be essential to have multiple channels at each layer. In the most popular neural network architectures, we actually increase the channel dimension as we go deeper in the neural network, typically downsampling to trade off spatial resolution for greater *channel depth*. 
- Intuitively, could think of each channel as responding to a different set of features. 
- reality is a bit more complicated than this. naive interpretation would suggest that representations are learned independently per pixel or per channel. 
- Instead, channels are optimized to be jointly useful. This means that rather than mapping a single channel to an edge detector, it may simply mean  that some direction in channel space corresponds to detecting edges.
- Denote by $c_i$ and $c_o$ the number of input and output channels, respectively, and let $k_h$ and $k_w$ be the height and width of the kernel. To get an output with multiple channels, we can create a kernel tensor of shape $c_i\times k_h\times k_w$ for *every* output channel. 

---

- concatenate them on the output channel dimension, so that the shape of the convolution kernel is $c_o\times c_i\times k_h\times k_w$. In cross-correlation operations, the result on each output channel is calculated from the convolution kernel corresponding to that output channel and takes input from all channels in the input tensor.
- implement a cross-correlation function to [**calculate the output of multiple channels**] as shown below.


In [None]:
def corr2d_multi_in_out(X, K):
    ## Iterate through the 0th dimension of K, and each time, perform
    ## cross-correlation operations with input X. All of the results are
    ## stacked together
    return d2l.stack([corr2d_multi_in(X, k) for k in K], 0)

- We construct a trivial convolution kernel with 3 output channels by concatenating the kernel tensor for `K` with `K+1` and `K+2`.


In [None]:
K = d2l.stack((K, K + 1, K + 2), 0)
K.shape

- Below, we perform cross-correlation operations on the input tensor `X` with the kernel tensor `K`. Now the output contains 3 channels. The result of the first channel is consistent with the result of the previous input tensor `X` and the multi-input channel, single-output channel kernel.


In [None]:
corr2d_multi_in_out(X, K)

## $1\times 1$ Convolutional Layer
- At first, a [**$1 \times 1$ convolution**], i.e., $k_h = k_w = 1$, does not seem to make much sense. 
- After all, a convolution correlates adjacent pixels. A $1 \times 1$ convolution obviously does not. Nonetheless, they are popular operations that are sometimes included in the designs of complex deep networks :cite:`Lin.Chen.Yan.2013,Szegedy.Ioffe.Vanhoucke.ea.2017` Let's see in some detail what it actually does.
- Because the minimum window is used, the $1\times 1$ convolution loses the ability of larger convolutional layers to recognize patterns consisting of interactions among adjacent elements in the height and width dimensions. The only computation of the $1\times 1$ convolution occurs on the channel dimension.
- :numref:`fig_conv_1x1` shows the cross-correlation computation using the $1\times 1$ convolution kernel with 3 input channels and 2 output channels.
- Note that the inputs and outputs have the same height and width. Each element in the output is derived from a linear combination of elements *at the same position* in the input image.
- You could think of the $1\times 1$ convolutional layer as constituting a fully connected layer applied at every single pixel location to transform the $c_i$ corresponding input values into $c_o$ output values. 

In [None]:
def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
    c_o = K.shape[0]
    X = d2l.reshape(X, (c_i, h * w))
    K = d2l.reshape(K, (c_o, c_i))
    ## Matrix multiplication in the fully connected layer
    Y = d2l.matmul(K, X)
    return d2l.reshape(Y, (c_o, h, w))

- When performing $1\times 1$ convolutions, the above function is equivalent to the previously implemented cross-correlation function `corr2d_multi_in_out`. Let's check this with some sample data.


In [None]:
X = d2l.normal(0, 1, (3, 3, 3))
K = d2l.normal(0, 1, (2, 3, 1, 1))
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
assert float(d2l.reduce_sum(d2l.abs(Y1 - Y2))) < 1e-6

## Discussion

- Channels allow us to combine the best of both worlds: MLPs that allow for significant nonlinearities and convolutions that allow for *localized* analysis of features. In particular, channels allow the CNN to reason with multiple features, such as edge and shape detectors at the same time. 
-  also offer a practical trade-off between the drastic parameter reduction arising from translation invariance and locality, and the need for expressive and diverse models in computer vision. 
- flexibility comes at a price. Given an image of size $(h \times w)$, the cost for computing a $k \times k$ convolution is $\mathcal{O}(h \cdot w \cdot k^2)$. For $c_i$ and $c_o$ input and output channels respectively this increases to $\mathcal{O}(h \cdot w \cdot k^2 \cdot c_i \cdot c_o)$. 
- For a $256 \times 256$ pixel image with a $5 \times 5$ kernel and $128$ input and output channels respectively
-  amounts to over 53 billion operations (we count multiplications and additions separately).
-  Later: effective strategies to cut down on the cost, e.g., by requiring the channel-wise operations to be block-diagonal, leading to architectures such as ResNeXt :cite:`Xie.Girshick.Dollar.ea.2017`. 

## Exercises

1. Assume that we have two convolution kernels of size $k_1$ and $k_2$, respectively 
   (with no nonlinearity in-between).
    1. Prove that the result of the operation can be expressed by a single convolution.
    1. What is the dimensionality of the equivalent single convolution?
    1. Is the converse true, i.e., can you always decompose a convolution into two smaller ones?
1. Assume an input of shape $c_i\times h\times w$ and a convolution kernel of shape 
   $c_o\times c_i\times k_h\times k_w$, padding of $(p_h, p_w)$, and stride of $(s_h, s_w)$.
    1. What is the computational cost (multiplications and additions) for the forward propagation?
    1. What is the memory footprint?
    1. What is the memory footprint for the backward computation?
    1. What is the computational cost for the backpropagation?
1. By what factor does the number of calculations increase if we double the number of input channels 
   $c_i$ and the number of output channels $c_o$? What happens if we double the padding?


- The edge will have shifted by one pixel. In reality, objects hardly ever occur exactly at the same place. In fact, even with a tripod and a stationary object, vibration of the camera due to the movement of the shutter might shift everything by a pixel or so (high-end cameras are loaded with special features to address this problem).
- This section introduces *pooling layers*, which serve the dual purposes of mitigating the sensitivity of convolutional layers to location and of spatially downsampling representations.


In [None]:
from d2l import torch as d2l
import torch
from torch import nn

## Maximum Pooling and Average Pooling
- Like convolutional layers, *pooling* operators consist of a fixed-shape window that is slid over all regions in the input according to its stride, computing a single output for each location traversed by the fixed-shape window (sometimes known as the *pooling window*).
- However, unlike the cross-correlation computation of the inputs and kernels in the convolutional layer, the pooling layer contains no parameters (there is no *kernel*). 
- Instead, pooling operators are deterministic, typically calculating either the maximum or the average value of the elements in the pooling window. 
- These operations are called *maximum pooling* (*max-pooling* for short) and *average pooling*, respectively.
- *Average pooling* is essentially as old as CNNs. The idea is akin to  downsampling an image. 
- Rather than just taking the value of every second (or third)  pixel for the lower resolution image, we can average over adjacent pixels to obtain  an image with better signal to noise ratio since we are combining the information from multiple adjacent pixels. 

- output tensor in :numref:`fig_pooling`  has a height of 2 and a width of 2. The four elements are derived from the maximum value in each pooling window:

$$
\max(0, 1, 3, 4)=4,\\
\max(1, 2, 4, 5)=5,\\
\max(3, 4, 6, 7)=7,\\
\max(4, 5, 7, 8)=8.\\
$$

- However, no kernel is needed, computing the output as either the maximum or the average of each region in the input.


In [None]:
def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = d2l.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
    return Y

- can construct the input tensor `X` in :numref:`fig_pooling` to [**validate the output of the two-dimensional max-pooling layer**].


In [None]:
X = d2l.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
pool2d(X, (2, 2))

- Also, we experiment with (**the average pooling layer**).


In [None]:
pool2d(X, (2, 2), 'avg')

## Padding and Stride

- As with convolutional layers, pooling layers change the output shape. And as before, we can adjust the operation to achieve a desired output shape by padding the input and adjusting the stride.
- We can demonstrate the use of padding and strides in pooling layers via the built-in two-dimensional max-pooling layer from the deep learning framework. 
- We first construct an input tensor `X` whose shape has four dimensions, where the number of examples (batch size) and number of channels are both 1.


In [None]:
X = d2l.reshape(d2l.arange(16, dtype=d2l.float32), (1, 1, 4, 4))
X

---


- pooling aggregates information from an area, (**deep learning frameworks default to matching pooling window sizes and stride.**) For instance, if we use a pooling window of shape `(3, 3)` we get a stride shape of `(3, 3)` by default.


In [None]:
pool2d = nn.MaxPool2d(3)
## Pooling has no model parameters, hence it needs no initialization
pool2d(X)

- as expected, [**the stride and padding can be manually specified**] to override framework defaults if needed.


In [None]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

Of course, we can specify an arbitrary rectangular pooling window with arbitrary height and width respectively, as the example below shows.


In [None]:
pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1))
pool2d(X)

## Multiple Channels

- When processing multi-channel input data, [**the pooling layer pools each input channel separately**], rather than summing the inputs up over channels as in a convolutional layer. 
- This means that the number of output channels for the pooling layer is the same as the number of input channels. Below, we will concatenate tensors `X` and `X + 1` on the channel dimension to construct an input with 2 channels.


In [None]:
X = d2l.concat((X, X + 1), 1)
X

---


As we can see, the number of output channels is still 2 after pooling.


In [None]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

2. What is the computational cost of the pooling layer? Assume that the input to the pooling layer is of size $c\times h\times w$, the pooling window has a shape of $p_h\times p_w$ with a padding of $(p_h, p_w)$ and a stride of $(s_h, s_w)$.
3. Why do you expect max-pooling and average pooling to work differently?
4. Do we need a separate minimum pooling layer? Can you replace it with another operation?
5. We could use the softmax operation for pooling. Why might it not be so popular?

## Convolutional Neural Networks (LeNet)

- We now have all the ingredients required to assemble a fully-functional CNN. In our earlier encounter with image data, we applied a linear model with softmax regression (:numref:`sec_softmax_scratch`) and an MLP (:numref:`sec_mlp-implementation`) to pictures of clothing in the Fashion-MNIST dataset. To make such data amenable we first flattened each image from a $28\times28$ matrix into a fixed-length $784$-dimensional vector, and thereafter processed them in fully connected layers. Now that we have a handle on convolutional layers, we can retain the spatial structure in our images. As an additional benefit of replacing fully connected layers with convolutional layers, we will enjoy more parsimonious models that require far fewer parameters. 
- In this section, we will introduce *LeNet*, among the first published CNNs to capture wide attention for its performance on computer vision tasks. The model was introduced by (and named for) Yann LeCun, then a researcher at AT&T Bell Labs, for the purpose of recognizing handwritten digits in images :cite:`LeCun.Bottou.Bengio.ea.1998`. This work represented the culmination of a decade of research developing the technology. 

---


- In 1989, LeCun's team published the first study to successfully train CNNs via backpropagation :cite:`LeCun.Boser.Denker.ea.1989`.
- At the time LeNet achieved outstanding results matching the performance of support vector machines, then a dominant approach in supervised learning, achieving an error rate of less than 1% per digit. 
- LeNet was eventually adapted to recognize digits for processing deposits in ATM machines. To this day, some ATMs still run the code that Yann LeCun and his colleague Leon Bottou wrote in the 1990s!


In [None]:
from d2l import torch as d2l
import torch
from torch import nn

- The basic units in each convolutional block are a convolutional layer, a sigmoid activation function, and a subsequent average pooling operation. Note that while ReLUs and max-pooling work better, these discoveries had not yet been made at the time. Each convolutional layer uses a $5\times 5$ kernel and a sigmoid activation function. These layers map spatially arranged inputs to a number of two-dimensional feature maps, typically increasing the number of channels. The first convolutional layer has 6 output channels, while the second has 16. Each $2\times2$ pooling operation (stride 2) reduces dimensionality by a factor of $4$ via spatial downsampling. The convolutional block emits an output with shape given by (batch size, number of channel, height, width). 
- In order to pass output from the convolutional block to the dense block, we must flatten each example in the minibatch. In other words, we take this four-dimensional input and transform it into the two-dimensional input expected by fully connected layers: 

In [None]:
def init_cnn(module):  #@save
    """Initialize weights for CNNs."""
    if type(module) == nn.Linear or type(module) == nn.Conv2d:
        nn.init.xavier_uniform_(module.weight)

In [None]:
class LeNet(d2l.Classifier):  #@save
    """The LeNet-5 model."""
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5, padding=2), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.LazyLinear(120), nn.Sigmoid(),
            nn.LazyLinear(84), nn.Sigmoid(),
            nn.LazyLinear(num_classes))

- take some liberty in the reproduction of LeNet insofar as we replace the Gaussian activation layer by a softmax layer. This greatly simplifies the implementation, not the least due to the fact that the Gaussian decoder is rarely used nowadays. Other than that, this network matches the original LeNet-5 architecture.

---


- Let's see what happens inside the network. By passing a single-channel (black and white) $28 \times 28$ image through the network and printing the output shape at each layer, we can [**inspect the model**] to make sure that its operations line up with what we expect from :numref:`img_lenet_vert`.

![Compressed notation for LeNet-5.](../img/lenet-vert.svg)
:label:`img_lenet_vert`


In [None]:
@d2l.add_to_class(d2l.Classifier)  #@save
def layer_summary(self, X_shape):
    X = d2l.randn(*X_shape)
    for layer in self.net:
        X = layer(X)
        print(layer.__class__.__name__, 'output shape:\t', X.shape)
        
model = LeNet()
model.layer_summary((1, 1, 28, 28))

- height and width of the representation at each layer throughout the convolutional block is reduced (compared with the previous layer). 
- The first convolutional layer uses 2 pixels of padding to compensate for the reduction in height and width that would otherwise result from using a $5 \times 5$ kernel.
- As an aside, the image size of $28 \times 28$ pixels in the original MNIST OCR dataset is a result of *trimming* 2 pixel rows (and columns) from the original scans that measured $32 \times 32$ pixels. 
- done primarily to save space (a 30% reduction) at a time when Megabytes mattered.

- Just as with MLPs, our loss function is cross-entropy, and we minimize it via minibatch stochastic gradient descent.


In [None]:
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = LeNet(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], init_cnn)
trainer.fit(model, data)

## Summary

- moved from the MLPs of the 1980s to the CNNs of the 1990s and early 2000s. The architectures proposed, e.g., in the form of LeNet-5 remain meaningful, even to this day. It is worth comparing the error rates on Fashion-MNIST achievable with LeNet-5 both to the very best possible with MLPs (:numref:`sec_mlp-implementation`) and those with significantly more advanced architectures such as ResNet (:numref:`sec_resnet`). LeNet is much more similar to the latter than to the former. One of the primary differences, as we shall see, is that greater amounts of computation afforded significantly more complex architectures.
- second difference is the relative ease with which we were able to implement LeNet. What used to be an engineering challenge worth months of C++ and assembly code, engineering to improve SN, an early Lisp based deep learning tool :cite:`Bottou.Le-Cun.1988`, and finally experimentation with models can now be accomplished in minutes. It is this incredible productivity boost that has democratized deep learning model development tremendously. In the next chapter we will follow down this rabbit to hole to see where it takes us.

## Exercises

1. Let's modernize LeNet. Implement and test the following changes:
    1. Replace the average pooling with max-pooling.
    1. Replace the softmax layer with ReLU.
2. Try to change the size of the LeNet style network to improve its accuracy in addition to max-pooling and ReLU.
    1. Adjust the convolution window size.
    1. Adjust the number of output channels.
    1. Adjust the number of convolution layers.
    1. Adjust the number of fully connected layers.
    1. Adjust the learning rates and other training details (e.g., initialization and number of epochs.)

---

3. Try out the improved network on the original MNIST dataset.
4. Display the activations of the first and second layer of LeNet for different inputs (e.g., sweaters and coats).
5. What happens to the activations when you feed significantly different images into the network (e.g., cats, cars, or even random noise)?
