<a href="https://colab.research.google.com/github/DavoodSZ1993/Dive-into-Deep-Learning-Notes-/blob/main/08_modern_CNNs_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Convolutional Neural Networks

* **Convolution** (`nn.Conv2d()` & `nn.LazyConv2d()`): Given the input size ($n_h\times n_w$), and the kernel size ($k_h \times k_w$), the output size is as follows:
$$
(n_h - k_h + 1) \times (n_w - k_w + 1)
$$

* **Padding**: Given the input size ($n_h \times n_w$), the kernel size ($k_h \times k_w$), when adding a total of $p_h$ rows of padding and a total of $p_w$ columns of padding, the output size will be as follows:
$$
(n_h - k_h + p_h + 1) \times (n_w - k_w + p_w + 1)
$$

The `padding=1` argument in `nn.Conv2d()` will add one row at top, and one row at bottom ($p_h=2$), and one column at left and one column at right ($p_w=2$)

* **Stride**: Given the input size ($n_h \times n_w$), the kernel size ($k_h \times k_w$), padding size ($p_h \times p_w$), when the stride for hight is $s_h$ and th stride for the width is $s_w$, the output shape will be as follows:
$$
[{n_h - k_h + p_h + s_h \over s_h}] \times [{n_w - k_w + p_w + s_w \over s_w}] 
$$



### Class `nn.AdaptiveAvgPool2d(output_size)`: 
Applies a 2D adaptive average pooling over an input signal composed of several input planes. 

* input: ($N, C, H_{in}, W_{in}$) or ($C, H_{in}, W_{in}$)
output: ($N, C, S_0, S_1$) or ($C, S_0, S_1$) where $S$=`output_size`.

In [1]:
import torch
from torch import nn

In [2]:
X = torch.tensor([[1, 2],
                  [3, 4]], dtype=torch.float32)  # 1 x 1 2 x 2

net = nn.AdaptiveAvgPool2d((1))   # 1 x 1 x 1 x 1

net(X), X.mean()

(tensor([[2.5000]]), tensor(2.5000))

## Batch Normalization

$$
BN(𝐱) = 𝛄 ⊗ {𝐱 - 𝝻̂_{𝖁} \over 𝛔̂_{𝖁}} + 𝛃
$$
where 𝖁 is minibatch and $𝙭 ∈ 𝖁$. $𝓤̂$ is the sample mean and $𝛔̂_{𝖁}$ is the sample standard deviation. 𝛄 and 𝛃 are scale parameter and shift parameter respectively. 


### Fully Connected Layers

When using a fully connected layer, calculate the mean and variance on the feature dimension.
$$
𝐡 = Φ(BN(𝑾𝐱 + 𝐛))
$$


In [3]:
X = torch.tensor([[1, 2],
                  [3, 4]], dtype=torch.float32)

X.mean(dim=0, keepdim=True), X.mean(dim=1), X.mean(dim=0, keepdim=True).shape, X.mean(dim=1).shape   # mean along rows, mean along columns

(tensor([[2., 3.]]),
 tensor([1.5000, 3.5000]),
 torch.Size([1, 2]),
 torch.Size([2]))

* Class `torch.nn.BatchNorm1d(num_features)`: Applies Batch Normalization over a 2D or 3D input.
* Input: ($N, C$) or ($N, C, L$) where $N$ is the batch size, $C$ is the number of features or channels, and $L$ is the length of sequence.
* Output: ($N, C$) or ($N, C, L$) (same shape as input).

In [4]:
X = torch.randn(50, 200)

net = nn.BatchNorm1d(num_features=200)

net(X).shape

torch.Size([50, 200])

* Class `torch.nn.LazyBatchNorm1d()`: A `torch.nn.BatchNorm1d` module with lazy initialization of the `num_feature` argument.

In [5]:
X = torch.randn(50, 200)

net = nn.LazyBatchNorm1d() # num_features is inferred from the input.

net(X).shape



torch.Size([50, 200])

### Convolutional Layers

When using a two-dimensional convolution layer, calculate the mean and variance on the channel dimension (dim=1).

* Assume that each minibatches contain $m$ examples and that for each channel, the output of the convolution has height $p$ and width $q$. For convolutional layers, we carry out each batch normalization over $m.p.q$ elements per output channels simultaneously.

In [8]:
X = torch.ones((10, 10, 20, 20), dtype=torch.float32)

mean = X.mean(dim=(0, 2, 3), keepdim=True)  # Averaging over channels
mean.shape

torch.Size([1, 10, 1, 1])

* Class `torch.nn.BatchNorm2d(num_features)`: Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension.)
* Input: ($N, C, W, H$).
* Output: ($N, C, W, H$) same shape as input.

In [9]:
X = torch.ones((10, 10, 20, 20), dtype=torch.float32)

net = nn.BatchNorm2d(10)
net(X).shape

torch.Size([10, 10, 20, 20])

* Class `nn.torch.LazyBatchNorm2d()`: A `torch.nn.BatchNorm2d` module with lazy initialization of the `num_features` argument.

In [10]:
X = torch.ones((10, 10, 20, 20), dtype=torch.float32)

net = nn.LazyBatchNorm2d()  # Infer number of features from the input!
net(X).shape



torch.Size([10, 10, 20, 20])

### Layer Normalization

Batch normalization over a batch with size of 1.

## PyTorch Notes

* `add_module(name. module)` of `nn.Module` class: Adds a child module to the current module. The module can be accessed as an attribute using the given name.

The following networks are the same.

In [12]:
class Net(nn.Module):
  def __init__(self):
    super().__init__()

    self.net = nn.Sequential(
        nn.LazyLinear(10), nn.ReLU(),
        nn.LazyLinear(1))
    
class Net1(nn.Module):
  def __init__(self):
    super().__init__()

    self.add_module(name='linear1', module=nn.LazyLinear(10))
    self.add_module(name='relu1', module=nn.ReLU())
    self.add_module(name='linear2', module=nn.LazyLinear(1))

In [16]:
net = Net()
net1 = Net1()

net.modules, net1.modules



(<bound method Module.modules of Net(
   (net): Sequential(
     (0): LazyLinear(in_features=0, out_features=10, bias=True)
     (1): ReLU()
     (2): LazyLinear(in_features=0, out_features=1, bias=True)
   )
 )>, <bound method Module.modules of Net1(
   (linear1): LazyLinear(in_features=0, out_features=10, bias=True)
   (relu1): ReLU()
   (linear2): LazyLinear(in_features=0, out_features=1, bias=True)
 )>)

### `groups` argument in `torch.nn.Conv2d()` and `torch.lazyConv2d()`:

`groups` controls the connections between inputs and outputs. `in_channels` and `out_channels` must both be divisibe by `groups`.

* At groups=1, all inputs are convolved to outputs.
* At groups=2, the operation becomes equivalent to having two convolution layers side by side. Each seeing half the input channels and producing half the output channels, and both subsequently concatenated.



## General Notes

### Python `super()`

Returns objects represented in the parent's class.