### **Some concepts behind CNNs**
- **Sparse-connectivity:**
  - A single element in the feature map is connected to only a small patch of pixels
  - This is unlike Multilayer perceptrons where all layers were fully connected
  - Basically, neighbouring pixels are related
- **Parameter-sharing:**
  - Same weights are used for different patches of the input image
  - Essentially, same filters work for different parts of the image
- **Many layers:**
  - Combining extracted local patterns to global patterns

These assumptions and restrictions are called **Inductive biases** which help CNNs learn more quickly and generalize better (unlike fully-connected networks). We also reduce the training data which is required.

Also, convolutional layers are pretty small.

In [1]:
import torch

**What goes wrong in a fully-connected Multilayer Perceptron**

In [2]:
class MLP(torch.nn.Module):
  def __init__(self):
    super().__init__()

    self.layers = torch.nn.Sequential(
      # 1st hidden layer
      torch.nn.Linear(3 * 224 * 224, 10000),  # Input of 224x224 size (3 colour channels) [3*224*224*10000 + 10000(bias) = 1,505,290,000]
      torch.nn.ReLU(),

      # 2nd hidden layer
      torch.nn.Linear(10000, 1000),           # [10000 * 1000 + 1000 = 10,001,000]
      torch.nn.ReLU(),

      # 3rd hidden layer
      torch.nn.Linear(1000, 100),             # [1000 * 100 + 100 = 100,100]
      torch.nn.ReLU(),

      # output layer
      torch.nn.Linear(100, 10)                # [100 * 10 + 10 = 1,010]
    )
  
  def forward(self, x):                       # In total: 1,515,392,110 parameters
    x = torch.flatten(x, start_dim = 1)
    logits = self.layer(x)
    return logits

In [3]:
import sys

size = 0
mlp = MLP()

for name, param in mlp.named_parameters():
  size += sys.getsizeof(param.storage()) / 1024 ** 3

print(f"Model size: {size:.3f}GB")

Model size: 5.645GB


  size += sys.getsizeof(param.storage()) / 1024 ** 3


**Implementing CNN of the above piece of code**

In [5]:
class CNN(torch.nn.Module):
  def __init__(self):
    super().__init__()

    self.cnn_layers = torch.nn.Sequential(
      torch.nn.Conv2d(3, 8, kernel_size = 5, stride = 2),   # 3 * 5 * 5 * 8 + 8 = 608
      torch.nn.ReLU(),

      torch.nn.Conv2d(8, 24, kernel_size = 5, stride = 2),  # 8 * 5 * 5 * 24 + 24 = 4,824
      torch.nn.ReLU(),

      torch.nn.Conv2d(24, 32, kernel_size = 3, stride = 2), # 24 * 3 * 3 * 32 + 32 = 6,944
      torch.nn.ReLU(),

      torch.nn.Conv2d(32, 48, kernel_size = 3, stride = 2), # 32 * 3 * 3 * 48 + 48 = 13,872
      torch.nn.ReLU()
    )

    self.fc_layers = torch.nn.Sequential(
      torch.nn.Linear(48 * 12 * 12, 200),                   # 48 * 12 * 12 * 200 + 200 = 1,382,600
      torch.nn.ReLU(),
      torch.nn.Linear(200, 10)                              # 200 * 10 + 10 = 2,010
    )
  
  def forward(self, x):                                     # In total: 1,410,858
    x = self.cnn_layers(x)
    x = torch.flatten(x, start_dim = 1)
    logits = self.fc_layers(x)

    return logits

In [6]:
size = 0
cnn = CNN()

for name, param in cnn.named_parameters():
  size += sys.getsizeof(param.storage()) / 1024 ** 3

print(f"Model size: {size:.3f}GB")

Model size: 0.005GB


**We do NOT design CNN architectures from scratch, instead we adopt popular architectures**

### **AlexNet**
- Won ImageNet 2012 competition
- One of the first CNNs utilizing GPUs for efficient training

### **VGG16**
- Same basic architecture, but more layers, bigger size

### **GoogLeNet/Inception**
![](../images/googlenet.png)
- **Inception modules:** Use multiple convolution layers with smaller kernels in parallel
- Keeps model smaller
- Extract features at various scales

![](../images/googlenet_loss.png)
- There's auxiliary loss function in Inception
- Total loss is basically the sum of all these losses
- Sometimes, it's very hard to train very large neural networks
- In back propagation, the signal sometimes degrade when we go from this output layer to the far away input layer
- By having these auxiliary losses, we can ensure that each small sub-part of the network runs well, which then helps with overall training of the network

### **ResNet-34, 50, 101**
- Stands for Residual Neural Networks
- The number refers to the number of layers
- The key idea here is "skip connections"
  - It can ignore "bad" layers if stronger signal during backpropagation

### **Convolutional Vision Transformer Hybrid**
- Transformers can also be used for vision tasks
- It combines convolutional networks and transformers based on self-attention