# **4: Going Deeper - Architecting ResNet-18 from Scratch**

In this notebook, we will implement the ResNet-18 architecture from scratch using PyTorch. ResNet-18 is a popular convolutional neural network architecture that introduced the concept of residual connections, allowing for deeper networks without suffering from vanishing gradients.

As networks get deeper, they can learn more complex features, but they also become harder to train due to issues like vanishing gradients. ResNet addresses this problem by introducing skip connections that allow gradients to flow directly through the network, enabling the training of much deeper architectures.

**Vanishing Gradients**: As we add more layers to a neural network, the gradients used for updating the weights can become very small, making it difficult for the network to learn. This is known as the vanishing gradient problem. 

## **The Solution: Residual Connections**

Resnet (residual Network) solves this problem with a brillantly simple idea: instead of learning a direct mapping from input to output, it learns a residual mapping. This is achieved by adding `skip connections` or `shortcut connections` that bypass one or more layers. The output of these layers is added to the input, allowing the network to learn the difference (residual) between the input and the output.

Think of it like an express lane on a highway. If the main road (the deeper layers) is congested, the traffic (gradients) can take the express lane (skip connection) to reach the destination (output) faster.

- `If a layer is useful:` The network can learn to use it by adjusting the weights to produce a useful output.
- `If a layer is not useful:` The network can learn to ignore it by adjusting the weights to produce an output close to zero, effectively bypassing it.

The beauty of this design is that it allows the network to learn both simple and complex features without being hindered by the depth of the network. This is why ResNet architectures can be much deeper than traditional CNNs while still being trainable.

`The key insight:` By adding the original input `x` directly to the output of a convolutional stack (a `"residual block"`), the network can easily learn to preserve the input if the convolutional layers are not useful, or learn to modify it if they are useful. This flexibility is what allows ResNet to train very deep networks effectively.


Mathematically, a residual block can be expressed as:
```
y = F(x) + x
```

Where `F(x)` is the output of the convolutional layers in the block, and `x` is the original input. The network learns to optimize `F(x)` such that the overall output `y` is as close as possible to the desired output, while still allowing for the possibility of preserving the input if necessary.

We can also put it this way: instead of learning `H(x)`, the network learns `F(x) = H(x) - x`, which is the residual. Then the output can be rewritten as `y = F(x) + x`, which is the original input plus the residual. This allows the network to learn the identity function (if needed) by simply setting `F(x)` to zero, making it easier to train deeper networks.

### **A more challenging Dataset: CIFAR-10**

To demonstrate the power of ResNet-18, we will use the CIFAR-10 dataset, which consists of 60,000 32x32 color images in 10 different classes. This dataset is more complex than MNIST and will allow us to see the benefits of using a deeper architecture like ResNet.

**CIFAR-10 Characteristics:**
- `60,000` images (50,000 for training and 10,000 for testing)
- 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
- Each image is `32x32` pixels with 3 color channels (RGB)
- `50,000` training images and `10,000` test images
- The dataset is more complex than `MNIST`, making it a good test case for deeper architectures like ResNet-18


**Input Shape: (Batch Size, Channels, Height, Width)**
- For CIFAR-10, the input shape will be `(Batch Size, 3, 32, 32)` since the images are in color (3 channels) and have a height and width of 32 pixels each.
- The batch size can be set to any value (e.g., 64, 128) depending on the available computational resources and the desired training speed.
- For example, if we set the batch size to 128, the input shape for a batch of CIFAR-10 images would be `(128, 3, 32, 32)`.
- `32x32` is the height and width of the images, and `3` represents the RGB color channels. The batch size can be adjusted based on memory constraints and training requirements.


This is a setup in difficulty from the previous notebook, where we used the MNIST dataset, which consists of grayscale images of handwritten digits. The CIFAR-10 dataset is more complex due to its color images and greater variety of classes, making it a more suitable test case for demonstrating the capabilities of deeper architectures like ResNet-18.

- `Color channels`: We now have 3 channels (RGB) instead of 1 (grayscale), which adds complexity to the input data.
- `More classes`: CIFAR-10 has 10 classes compared to MNIST's 10 classes, but the images are more complex and varied, making classification more challenging.
- `Smaller image size`: The images in CIFAR-10 are smaller (32x32) compared to MNIST (28x28), which can make it more difficult for the network to learn meaningful features, especially in the early layers.
- `More complex features`: The images in CIFAR-10 contain more complex features and variations (e.g., different backgrounds, lighting conditions, and object orientations) compared to the relatively simple and uniform images in MNIST. This requires a more powerful architecture like ResNet-18 to effectively learn and classify the images.