CNNs vs. MLP
CNN does don't need to flatten the image, can utilize the spatial structure of images

The following explains the backbone structure of a classical CNN

Characterized by Locally-Connected Layers
- layers where neurons are connected to only a limited numbers of input pixels
- neurons share their weights, which drastically reduces the number of parameters
- capable of extracting spatial and color patterns that characterize different objects

Filter/kernel
- "extract" the features of an object (for example, edges).
- By using multiple different filters the network can learn to recognize complex shapes and objects.
- relies on centering a pixel and looking at its surrounding neighbors.

Frequency in Images
- means rate of change
- high-frequency image: where the intensity changes a lot. And the level of brightness changes quickly from one pixel to the next.
- low-frequency image may be one that is relatively uniform in brightness or changes very slowly

Padding - The image is padded with a border of 0's, black pixels.

Stride: Amount by which a filter slides over an image.

Cropping - Any pixel in the output image which would require values from beyond the edge is skipped. This method can result in the output image being smaller then the input image, with the edges having been cropped.

Extension - The nearest border pixels are conceptually extended as far as necessary to provide values for the convolution. Corner pixels are extended in 90° wedges. Other edge pixels are extended in lines.

Pooling
- compresses information from a layer by summarizing areas of the feature maps produced in that layer.
- compute a summary statistic

Image -> conv1 -> maxpool -> conv2 -> maxpool -> conv3 -> maxpool ...
The size gets smaller as you go deeper

In [None]:
from torch import nn

in_channels - The number of input feature maps. 
    - If this is the first layer, this is equivalent to the number of channels in the input image, i.e., 1 for grayscale images, or 3 for color images (RGB). 
    - Otherwise, it is equal to the output channels of the previous convolutional layer.

out_channels - The number of output feature maps (channels), i.e. the number of filtered "images" that will be produced by the layer. 
    - This corresponds to the unique convolutional kernels that will be applied to an input, because each kernel produces one feature map/channel. 
    - Determining this number is an important decision to make when designing CNNs, just like deciding on the number of neurons is an important decision for an MLP.
    
kernel_size - Number specifying both the height and width of the (square) convolutional kernel.

In [None]:
conv1 = nn.Conv2d(in_channels, out_channels, kernel_size)

We need activation layer and a 2D dropout layer

In [None]:
conv1 = nn.Conv2d(in_channels, out_channels, kernel_size)
dropout1 = nn.Dropout2d(p=0.2)
relu1 = nn.ReLU()

In [None]:
result = relu1(dropout1(conv1(x)))

Alternatively, we can use nn.Sequential and stack together the arguments

In [None]:
conv_block = nn.Sequential(
  nn.Conv2d(in_channels, out_channels, kernel_size),
  nn.ReLU(),
  nn.Dropout2d(p=0.2)
)

In [None]:
result = conv_block(x)

Padding

padding="same": automatically compute the amount necessary padding so that output size = input size
padding="valid": padding = 0
padding_mode="reflect": the padding pixels are filled with a copy of the values in the input image taken in opposite order, in a mirroring fashion.
padding_mode="replicate": the padding pixels are filled with value of closest pixel in input image
padding_mode="circular": it is like the reflection mode, but the image is first flipped horizontally and vertically.

kernel_size - The size of the max pooling window. The layer will roll a window of this size over the input feature map and select the maximum value for each window.
stride - The stride for the operation. By default the stride is of the same size as the kernel (i.e., kernel_size).

In [None]:
from torch import nn
nn.MaxPool2d(kernel_size, stride)

pooling = nn.AvgPool2d(window_size, stride)

The following explains the Head of a CNN

Flattening layer > after backbone, before head. Process the output from backbone into a 1d vector
feature vector/embedding: for each feature map the rows are stacked together in a 1d vector, then all the 1d vectors are stacked together to form a long 1d vector

Head 
- a normal MLP that takes as input the feature vector and has the appropriate output for the task.
- It can have one or more hidden layers, as well as other types of layers as needed (like DropOut for regularization).