# ResNet architecture

**What is ResNet?**

ResNet, or Residual Network, is a type of deep learning model that revolutionized computer vision by enabling the training of very deep networks. It was introduced in 2015 by Kaiming He and his colleagues, and it won the ImageNet competition with a significant performance boost. ResNet models are widely used for tasks like image classification, object detection, and more.


**The Problem with Deep Neural Networks**

Before ResNet, researchers found that as they made neural networks deeper (i.e., with more layers), their performance often got worse. This was due to a phenomenon called the vanishing gradient problem, where the gradients used in backpropagation to update the weights become very small, preventing the weights from updating effectively in the deeper layers.

Another issue was that deeper networks often suffered from degradation, where adding more layers led to higher training error, even if the model was more capable in theory. This contradicted the idea that deeper networks should be able to learn more complex patterns.

**The Key Idea Behind ResNet: Residual Learning**

ResNet addresses these challenges by introducing the concept of residual learning. The key idea is to make the network learn a residual function $F(x)=H(x)-x$, rather than trying to learn the original mapping $H(x)$.
$H(x)$ - is desired output.
$x$ - is the input.
The network learns the difference between the input and the desired output (the "residual"). The **residual block** in ResNet introduces a shortcut or "skip connection" that bypasses one or more layers, allowing the original input $x$ to be directly added to the output of the residual function $F(x)$.

**Why Skip Connections Work**
* Easier to Train: By using skip connections, ResNet allows gradients to flow more directly through the network, making it easier to train very deep networks without suffering from vanishing gradients.
* Helps Prevent Degradation: Since skip connections allow the original input to be preserved, the model can avoid the problem of degradation. If deeper layers don't learn useful features, the network can still fall back on the input data.
* Flexibility in Layer Depth: Adding more layers doesn't necessarily make performance worse because the skip connections make it possible to pass information effectively through all layers.

**Structure of a Residual Block**

A residual block typically consists of:
* Two or more convolutional layers: These layers learn features from the input data.
* Batch normalization and ReLU activation: These operations help stabilize and speed up training.
* Skip connection (shortcut): The original input is added to the output of the convolutional layers.
The skip connection can be a simple identity mapping (when the input and output dimensions match) or involve a linear transformation (such as a convolution) if dimensions differ.

**ResNet Variants**

ResNet comes in different depths, like ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, where the number indicates the total number of layers in the network. ResNet-50 and deeper variants typically use bottleneck blocks, where a 1x1 convolution reduces the number of channels before applying a 3x3 convolution, followed by another 1x1 convolution to restore the channel size. This reduces computational cost while retaining the ability to learn complex features.

**Why ResNet Works**

ResNet's success can be attributed to several key factors:
* Residual Learning Simplifies Optimization: By learning residual functions, the network optimizes the residual mappings, which are usually easier to learn, especially in deeper networks.
* Skip Connections Mitigate Vanishing Gradient Problems: These connections enable gradients to flow backward more effectively during backpropagation, ensuring that even the earlier layers receive meaningful gradient updates.
* Deeper Networks Capture More Complex Features: Because ResNet solves the issues that made deep networks difficult to train, it can successfully learn better representations in very deep architectures.

**Key Differences Compared to Traditional Networks**
* Skip Connections: The main difference is the presence of skip connections in ResNet, which directly pass the input to deeper layers.
* Residual Learning: Instead of learning the full mapping, ResNet learns the residual function, which simplifies training.
* Deeper Networks Become Practical: While earlier networks struggled beyond 20-30 layers, ResNet can have hundreds of layers (e.g., ResNet-152) without performance degradation.

![ResNet18 architecture](figures/Structure-of-a-ResNet-18-architecture-223748732.png)

In [1]:
import torch
import torch.nn as nn

In [2]:
class BasicBlock(nn.Module):
    """A basic building block for ResNet models.

    This block consists of two convolutional layers with a residual (skip) connection.
    It can optionally include a downsampling layer to match the dimensions
    when the input and output sizes are different.

    The residual connection helps to mitigate the vanishing gradient problem
    by providing a shortcut for the gradient to flow back through the network.
    """

    # Expansion factor (used for bottleneck blocks in deeper ResNets)
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        """Initialize the BasicBlock.

        Args:
            in_channels (int): Number of input channels.
            out_channels (int): Number of output channels.
            stride (int, optional): Stride for the first convolutional layer. Defaults to 1.
            downsample (callable, optional): Downsampling layer to match dimensions. Defaults to None.
        """
        super(BasicBlock, self).__init__()

        # First convolutional layer:
        # - 3x3 convolution
        # - Applies stride (could be >1 for downsampling)
        # - Padding=1 to maintain spatial dimensions when stride=1
        # - No bias since BatchNorm2d handles the bias term
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        # Batch normalization layer after the first convolution
        self.bn1 = nn.BatchNorm2d(out_channels)
        # ReLU activation function
        self.relu = nn.ReLU(inplace=True)

        # Second convolutional layer:
        # - 3x3 convolution
        # - Stride=1
        # - Padding=1 to maintain spatial dimensions
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        # Batch normalization layer after the second convolution
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Optional downsampling layer to adjust dimensions of the residual connection
        # If the input and output dimensions differ, we need to downsample the input (identity)
        # to match the output dimensions before adding them together
        self.downsample = downsample

    def forward(self, x):
        """Forward pass of the BasicBlock.

        Args:
            x (Tensor): Input tensor of shape (N, in_channels, H, W)

        Returns:
            Tensor: Output tensor of shape (N, out_channels, H_out, W_out)
        """
        # Save the input tensor for the residual (skip) connection
        identity = x

        # First layer operations:
        out = self.conv1(x)    # Apply first convolution
        out = self.bn1(out)    # Apply batch normalization
        out = self.relu(out)   # Apply ReLU activation

        # Second layer operations:
        out = self.conv2(out)  # Apply second convolution
        out = self.bn2(out)    # Apply batch normalization

        # Apply downsampling to the identity (if required)
        if self.downsample is not None:
            identity = self.downsample(x)  # Adjust dimensions of identity

        # Add the identity (residual connection) to the output
        out += identity

        # Apply ReLU activation to the result
        out = self.relu(out)

        return out


class ResNet(nn.Module):
    """ResNet model class.

    This class defines the architecture for ResNet-18 (or other versions depending on the layers).
    It uses BasicBlocks to build the network.

    The ResNet architecture introduces residual connections that allow
    training of very deep networks by mitigating the vanishing gradient problem.
    """

    def __init__(self, block, layers, num_classes=2):
        """Initialize the ResNet model.

        Args:
            block (nn.Module): Block type to use (BasicBlock or Bottleneck).
            layers (list): Number of blocks to use in each of the four layers.
            num_classes (int, optional): Number of classes for classification. Defaults to 2.
        """
        super(ResNet, self).__init__()
        self.in_channels = 64  # Number of channels for the first convolutional layer

        # Initial convolutional layer:
        # - Input channels: 3 (RGB images)
        # - Output channels: self.in_channels (64)
        # - Kernel size: 7x7
        # - Stride: 2 (downsamples the input)
        # - Padding: 3 (to maintain spatial dimensions when considering kernel size and stride)
        # - Bias is False since we're using batch normalization
        self.conv1 = nn.Conv2d(3, self.in_channels, kernel_size=7,
                               stride=2, padding=3, bias=False)
        # Batch normalization layer after the initial convolution
        self.bn1 = nn.BatchNorm2d(self.in_channels)
        # ReLU activation function
        self.relu = nn.ReLU(inplace=True)
        # Max pooling layer:
        # - Kernel size: 3x3
        # - Stride: 2 (further downsampling)
        # - Padding: 1
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # Build the four residual layers, each potentially changing the number of channels and the spatial dimensions
        # Layer1: Output channels = 64, layers[0] blocks
        self.layer1 = self._make_layer(block, 64, layers[0])   # No downsampling in Layer1
        # Layer2: Output channels = 128, layers[1] blocks, stride=2 (downsampling)
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        # Layer3: Output channels = 256, layers[2] blocks, stride=2
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        # Layer4: Output channels = 512, layers[3] blocks, stride=2
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        # Adaptive average pooling:
        # - Output size is (1,1), so regardless of input size, we get a 1x1 feature map
        # - This allows the model to handle variable input image sizes
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        # Fully connected (linear) layer for classification:
        # - Input features: 512 * block.expansion
        #   (expansion is 1 for BasicBlock, so input features = 512)
        # - Output features: num_classes
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        # Initialize weights for the layers:
        # - Convolutional layers are initialized using Kaiming He initialization
        # - BatchNorm layers have weights initialized to 1 and biases to 0
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # Kaiming normal initialization for convolutional layers with ReLU activation
                nn.init.kaiming_normal_(m.weight, mode='fan_out',
                                        nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                # Initialize BatchNorm weights to 1 (no scaling) and biases to 0 (no shift)
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def _make_layer(self, block, out_channels, blocks, stride=1):
        """Create one of the four layers of the ResNet model consisting of multiple blocks.

        Args:
            block (nn.Module): Block type to use (BasicBlock or Bottleneck).
            out_channels (int): Number of output channels for the blocks in this layer.
            blocks (int): Number of blocks to include in this layer.
            stride (int, optional): Stride for the first block in this layer. Defaults to 1.

        Returns:
            nn.Sequential: A sequential container of the blocks forming the layer.
        """
        downsample = None
        # Check if we need to downsample the input (identity) to match the output dimensions:
        # This is required when:
        # - The stride is not 1 (the spatial dimensions will change)
        # - The number of input channels does not match the number of output channels * expansion
        if stride != 1 or self.in_channels != out_channels * block.expansion:
            # Define the downsampling layer:
            # - 1x1 convolution to adjust the number of channels
            # - Stride as specified to adjust the spatial dimensions
            # - Bias is False since we're using batch normalization
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channels, out_channels * block.expansion,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels * block.expansion),
            )

        layers = []
        # First block in the layer:
        # - May include downsampling if stride != 1 or channel dimensions change
        layers.append(block(self.in_channels, out_channels, stride, downsample))
        # Update the number of input channels for the next blocks
        self.in_channels = out_channels * block.expansion
        # Remaining blocks in the layer:
        for _ in range(1, blocks):
            # For subsequent blocks, stride=1 and downsample=None
            layers.append(block(self.in_channels, out_channels))

        # Return a sequential container of the blocks
        return nn.Sequential(*layers)

    def forward(self, x):
        """Forward pass of the ResNet model.

        Args:
            x (Tensor): Input tensor of shape (N, 3, H, W), where N is batch size.

        Returns:
            Tensor: Output tensor of shape (N, num_classes)
        """
        # Initial layers:
        x = self.conv1(x)      # Apply initial convolution (reduces spatial dimensions due to stride=2)
        x = self.bn1(x)        # Apply batch normalization
        x = self.relu(x)       # Apply ReLU activation
        x = self.maxpool(x)    # Apply max pooling (further reduces spatial dimensions)

        # Residual layers:
        x = self.layer1(x)     # Pass through Layer1 (spatial dimensions unchanged)
        x = self.layer2(x)     # Pass through Layer2 (spatial dimensions reduced by half)
        x = self.layer3(x)     # Pass through Layer3 (spatial dimensions reduced by half)
        x = self.layer4(x)     # Pass through Layer4 (spatial dimensions reduced by half)

        # Adaptive average pooling:
        x = self.avgpool(x)    # Reduce spatial dimensions to 1x1
        # Flatten the tensor:
        # - From shape (N, C, 1, 1) to (N, C)
        x = torch.flatten(x, 1)  # Flatten starting from dimension 1 (exclude batch dimension)
        x = self.fc(x)         # Fully connected layer for classification

        return x


def resnet18(num_classes=2):
    """Constructs a ResNet-18 model.

    Args:
        num_classes (int, optional): Number of classes for classification. Defaults to 2.

    Returns:
        ResNet: ResNet-18 model instance.
    """
    # ResNet-18 uses BasicBlock and layers=[2,2,2,2], meaning:
    # - There are 2 BasicBlocks in each of the four layers.
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes=num_classes)

In [3]:
model = resnet18()

In [4]:
print(model)

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

If the version above is difficult to read, here we have an unrolled version of ResNet18

In [18]:
import torch.nn.functional as F
class ResNet18_unrolled(nn.Module):
    def __init__(self, num_classes=2):
        super(ResNet18_unrolled, self).__init__()
        
        # Initial convolutional layer
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1   = nn.BatchNorm2d(64)
        
        # First residual block (Layer 1)
        self.conv2_1 = nn.Conv2d(64, 64, kernel_size=3, padding=1, bias=False)
        self.bn2_1   = nn.BatchNorm2d(64)
        self.conv2_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1, bias=False)
        self.bn2_2   = nn.BatchNorm2d(64)
        
        self.conv2_3 = nn.Conv2d(64, 64, kernel_size=3, padding=1, bias=False)
        self.bn2_3   = nn.BatchNorm2d(64)
        self.conv2_4 = nn.Conv2d(64, 64, kernel_size=3, padding=1, bias=False)
        self.bn2_4   = nn.BatchNorm2d(64)
        
        # Second residual block (Layer 2)
        self.conv3_1 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1, bias=False)
        self.bn3_1   = nn.BatchNorm2d(128)
        self.conv3_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1, bias=False)
        self.bn3_2   = nn.BatchNorm2d(128)
        self.downsample3 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=1, stride=2, bias=False),
            nn.BatchNorm2d(128)
        )
        
        self.conv3_3 = nn.Conv2d(128, 128, kernel_size=3, padding=1, bias=False)
        self.bn3_3   = nn.BatchNorm2d(128)
        self.conv3_4 = nn.Conv2d(128, 128, kernel_size=3, padding=1, bias=False)
        self.bn3_4   = nn.BatchNorm2d(128)
        
        # Third residual block (Layer 3)
        self.conv4_1 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1, bias=False)
        self.bn4_1   = nn.BatchNorm2d(256)
        self.conv4_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1, bias=False)
        self.bn4_2   = nn.BatchNorm2d(256)
        self.downsample4 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=1, stride=2, bias=False),
            nn.BatchNorm2d(256)
        )
        
        self.conv4_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1, bias=False)
        self.bn4_3   = nn.BatchNorm2d(256)
        self.conv4_4 = nn.Conv2d(256, 256, kernel_size=3, padding=1, bias=False)
        self.bn4_4   = nn.BatchNorm2d(256)
        
        # Fourth residual block (Layer 4)
        self.conv5_1 = nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1, bias=False)
        self.bn5_1   = nn.BatchNorm2d(512)
        self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1, bias=False)
        self.bn5_2   = nn.BatchNorm2d(512)
        self.downsample5 = nn.Sequential(
            nn.Conv2d(256, 512, kernel_size=1, stride=2, bias=False),
            nn.BatchNorm2d(512)
        )
        
        self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1, bias=False)
        self.bn5_3   = nn.BatchNorm2d(512)
        self.conv5_4 = nn.Conv2d(512, 512, kernel_size=3, padding=1, bias=False)
        self.bn5_4   = nn.BatchNorm2d(512)
        
        # Average pooling and fully connected layer
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc      = nn.Linear(512, num_classes)
        
    def forward(self, x):
        # Initial layers
        x = self.conv1(x)
        x = self.bn1(x)
        x = F.relu(x)
        x = F.max_pool2d(x, kernel_size=3, stride=2, padding=1)
        
        # Layer 1
        identity = x
        out = self.conv2_1(x)
        out = self.bn2_1(out)
        out = F.relu(out)
        out = self.conv2_2(out)
        out = self.bn2_2(out)
        out += identity
        out = F.relu(out)
        
        identity = out
        out = self.conv2_3(out)
        out = self.bn2_3(out)
        out = F.relu(out)
        out = self.conv2_4(out)
        out = self.bn2_4(out)
        out += identity
        out = F.relu(out)
        
        # Layer 2
        identity = out
        out = self.conv3_1(out)
        out = self.bn3_1(out)
        out = F.relu(out)
        out = self.conv3_2(out)
        out = self.bn3_2(out)
        identity = self.downsample3(identity)
        out += identity
        out = F.relu(out)
        
        identity = out
        out = self.conv3_3(out)
        out = self.bn3_3(out)
        out = F.relu(out)
        out = self.conv3_4(out)
        out = self.bn3_4(out)
        out += identity
        out = F.relu(out)
        
        # Layer 3
        identity = out
        out = self.conv4_1(out)
        out = self.bn4_1(out)
        out = F.relu(out)
        out = self.conv4_2(out)
        out = self.bn4_2(out)
        identity = self.downsample4(identity)
        out += identity
        out = F.relu(out)
        
        identity = out
        out = self.conv4_3(out)
        out = self.bn4_3(out)
        out = F.relu(out)
        out = self.conv4_4(out)
        out = self.bn4_4(out)
        out += identity
        out = F.relu(out)
        
        # Layer 4
        identity = out
        out = self.conv5_1(out)
        out = self.bn5_1(out)
        out = F.relu(out)
        out = self.conv5_2(out)
        out = self.bn5_2(out)
        identity = self.downsample5(identity)
        out += identity
        out = F.relu(out)
        
        identity = out
        out = self.conv5_3(out)
        out = self.bn5_3(out)
        out = F.relu(out)
        out = self.conv5_4(out)
        out = self.bn5_4(out)
        out += identity
        out = F.relu(out)
        
        # Final layers
        out = self.avgpool(out)
        out = torch.flatten(out, 1)
        out = self.fc(out)
        
        return out

In [19]:
model = ResNet18_unrolled()

In [20]:
print(model)

ResNet18_unrolled(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv2_1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2_1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv2_2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2_2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv2_3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2_3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv2_4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2_4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv3_1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 

In [21]:
from torchinfo import summary

In [22]:
batch_size = 16
summary(model, input_size=(batch_size, 3, 128, 128), verbose=1)

Layer (type:depth-idx)                   Output Shape              Param #
ResNet18_unrolled                        [16, 2]                   --
├─Conv2d: 1-1                            [16, 64, 64, 64]          9,408
├─BatchNorm2d: 1-2                       [16, 64, 64, 64]          128
├─Conv2d: 1-3                            [16, 64, 32, 32]          36,864
├─BatchNorm2d: 1-4                       [16, 64, 32, 32]          128
├─Conv2d: 1-5                            [16, 64, 32, 32]          36,864
├─BatchNorm2d: 1-6                       [16, 64, 32, 32]          128
├─Conv2d: 1-7                            [16, 64, 32, 32]          36,864
├─BatchNorm2d: 1-8                       [16, 64, 32, 32]          128
├─Conv2d: 1-9                            [16, 64, 32, 32]          36,864
├─BatchNorm2d: 1-10                      [16, 64, 32, 32]          128
├─Conv2d: 1-11                           [16, 128, 16, 16]         73,728
├─BatchNorm2d: 1-12                      [16, 128, 16, 16

Layer (type:depth-idx)                   Output Shape              Param #
ResNet18_unrolled                        [16, 2]                   --
├─Conv2d: 1-1                            [16, 64, 64, 64]          9,408
├─BatchNorm2d: 1-2                       [16, 64, 64, 64]          128
├─Conv2d: 1-3                            [16, 64, 32, 32]          36,864
├─BatchNorm2d: 1-4                       [16, 64, 32, 32]          128
├─Conv2d: 1-5                            [16, 64, 32, 32]          36,864
├─BatchNorm2d: 1-6                       [16, 64, 32, 32]          128
├─Conv2d: 1-7                            [16, 64, 32, 32]          36,864
├─BatchNorm2d: 1-8                       [16, 64, 32, 32]          128
├─Conv2d: 1-9                            [16, 64, 32, 32]          36,864
├─BatchNorm2d: 1-10                      [16, 64, 32, 32]          128
├─Conv2d: 1-11                           [16, 128, 16, 16]         73,728
├─BatchNorm2d: 1-12                      [16, 128, 16, 16

In deeper versions of ResNet, like ResNet-50, ResNet-101, and ResNet-152, a more efficient type of residual block is used, known as the bottleneck block. The bottleneck block allows the network to remain computationally efficient, even as the number of layers increases.

![ResNet blocks](figures/Structure-of-basic-blocks-from-our-ResNet18-Employ-structure-a-in-cases-where-the_Q320.jpg)

![ResNet blocks](figures/resnet_blocks.png)

![ResNet blocks](figures/ResNet50-architecture-built-using-bottleneck-blocks-of-aIdentity-shortcut-and-1403887865.jpg)

**Why Bottleneck Blocks?**

As networks grow deeper, computational costs (in terms of memory and processing power) become a concern. For deeper networks to be practical, they need to balance computational complexity with their ability to learn complex features. The bottleneck block addresses this issue by reducing the number of parameters while still allowing for deep architectures.

**Structure of a Bottleneck Block**

A bottleneck block consists of three convolutional layers:

* 1x1 Convolution (Compression Layer): This reduces the number of channels (i.e., the feature map's depth). This step is called "dimensionality reduction" because it compresses the input feature map into a smaller, more manageable size. It reduces computational cost and speeds up training.

* 3x3 Convolution (Processing Layer): This is the core processing layer, where the actual feature extraction happens. It operates on the reduced number of channels from the previous layer, which makes this step computationally efficient.

* 1x1 Convolution (Expansion Layer): This layer increases the number of channels back to the original dimension. It restores the depth of the feature map, ensuring that no information is lost while keeping the computational load lighter.

* In between each of these convolutional layers, batch normalization and ReLU activation functions are applied to normalize and activate the output.

**Why the Bottleneck Design Works and the Importance of Feature Selection**

The bottleneck block in ResNet operates by compressing the feature representation before performing more expensive operations, then expanding it back to the original size. This compression-expansion strategy reduces the number of parameters in the network, making it more computationally efficient while retaining essential information.

For example, in a traditional convolutional layer, if the input has 256 channels, a 3x3 convolution would need to process all 256 channels. In the bottleneck block, the first 1x1 convolution reduces the number of channels to, say, 64. The 3x3 convolution then operates on this reduced set, significantly lowering the computational load. Afterward, the final 1x1 convolution restores the channel size back to 256.

However, the 1x1 convolution used for dimensionality reduction (compression) is not merely a way to decrease the number of channels. It plays a crucial role in selecting and extracting the most important features from the input. The learned filters in this layer focus on preserving the essential and informative aspects of the input, ensuring that even with fewer channels, critical features are passed to the 3x3 convolution. This prioritization allows the network to avoid processing irrelevant or redundant information, making the deeper layers more effective and efficient.

Thus, the bottleneck block’s design combines both computational efficiency and intelligent feature selection, ensuring that the network can process fewer channels without sacrificing important information, ultimately reducing the loss and improving performance.

**Skip Connections in Bottleneck Blocks**

Like the basic residual block, the bottleneck block also uses skip connections to connect the input to the output. If the input and output dimensions differ (as is often the case due to the dimensionality reduction in the block), a 1x1 convolution is applied to the input in the skip connection to match the dimensions before adding the input to the output.

**Benefits of Bottleneck Blocks**

* Efficiency: By using 1x1 convolutions to reduce and then restore the number of channels, bottleneck blocks reduce the computational complexity of the network without sacrificing performance.

* Depth without Degradation: Bottleneck blocks allow for deeper networks (50, 101, or 152 layers) by avoiding the vanishing gradient and degradation problems through residual learning.

* Better Feature Extraction: Even with fewer parameters, bottleneck blocks can still capture complex features, thanks to the reduced number of channels in the middle of the block.

**Summary**

The 1x1 convolution in bottleneck blocks compresses feature maps, but the learned filters in this layer ensure that only the most important features are selected.

This process allows the network to focus on critical information, improving generalization and preventing overfitting.

The compressed features preserve essential information, which helps reduce loss without sacrificing performance, even in very deep networks.
This balance between efficient computation and information preservation is one of the reasons ResNet's bottleneck blocks are so powerful, especially in deep architectures like ResNet-50 and beyond.

# Batch Normalization in Neural Networks

As you probably have noted, dropout layer is not used in ResNet. In deep networks, dropout can make the gradient flow unstable due to introduced noise, which is counterproductive when you're trying to address vanishing or exploding gradients.

Batch normalization is a technique used in deep learning to improve the training of neural networks by normalizing the inputs of each layer. Introduced by Sergey Ioffe and Christian Szegedy in 2015, this method addresses several problems that arise during training, particularly the issue of internal covariate shift. Normalizing the inputs of each layer helps to stabilize the training process by ensuring that there is a consistent mean and variance. This stabilization reduces the risk of vanishing or exploding gradients, making it easier to train very deep networks, which can have hundreds of layers.

**Key Concepts**

* Internal Covariate Shift:

During training, the distribution of inputs to each layer in a neural network changes as the model's parameters (weights and biases) are updated. This phenomenon is called internal covariate shift.
It slows down training because each layer has to constantly adapt to changing inputs from the previous layers, making the optimization process harder.

* Normalization:

Normalization is the process of transforming the data so that it has a mean of 0 and a standard deviation of 1.
Batch normalization normalizes the inputs to each layer in a neural network so that they maintain a stable distribution. This reduces the internal covariate shift and allows the network to converge faster.

**How Batch Normalization Works**

Batch normalization operates on mini-batches of data during training. The steps involved are as follows:

* Calculate the Mean and Variance:

   For each mini-batch of data, compute the mean $\mu_B$ and variance $\sigma_B^2$ of the input values to a layer.

   $$ \mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i $$

   $$ \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2 $$

   Where:
   - $m$ is the batch size.
   - $x_i$ is the input to the layer for the $i$-th example.


* Normalize the Inputs:

   Normalize each input by subtracting the mean and dividing by the standard deviation:

   $$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$

   Here, $\epsilon$ is a small constant added to avoid division by zero.


* Scale and Shift:

   To allow the model to represent a wide range of activations, two learnable parameters, $\gamma$ and $\beta$, are introduced. These parameters scale and shift the normalized output:

   $$ y_i = \gamma \hat{x}_i + \beta $$

   This step ensures that even though the inputs are normalized, the network can learn to undo this normalization if needed.


* Use in Both Training and Inference:

   During training, the mean and variance are computed for each batch. During inference, fixed running averages of the mean and variance (calculated during training) are used to ensure consistency.


**Advantages of Batch Normalization**

* Faster Training: By normalizing inputs, batch normalization allows for higher learning rates without the risk of divergence. This speeds up the convergence of the model during training.

* Reduces Dependence on Initialization: Deep networks are sensitive to the initialization of weights. Batch normalization makes the training less sensitive to initialization, which means that even with poor initial weights, the model can still converge.

* Acts as a Regularizer: Batch normalization introduces some noise into the training process because it normalizes based on mini-batches, which vary slightly from each other. This noise can have a slight regularization effect, similar to dropout, reducing overfitting.

* Eases Gradient Flow: Normalizing inputs at each layer helps keep the gradient magnitudes stable, reducing the risk of exploding or vanishing gradients. This is especially important for training deep networks.

**Where Batch Normalization is Applied**

Batch normalization is typically applied before (most common) the activation function in a neural network layer. It could be applied and after, but the exact placement varies by implementation.

**Limitations of Batch Normalization**

* Dependence on Batch Size: Batch normalization's performance can degrade if the batch size is too small because the mean and variance estimates become noisy with smaller batches.

* Not Ideal for All Types of Data: For certain types of models, like Recurrent Neural Networks (RNNs), where the sequence length varies, batch normalization might not be the best choice. Instead, techniques like Layer Normalization or Group Normalization are more suited for such architectures.

**Conclusion**

Batch normalization is a powerful tool that has become a standard in modern deep learning architectures. By reducing internal covariate shift, improving gradient flow, and acting as a form of regularization, it enables faster and more reliable training of deep neural networks. It has proven especially useful in very deep networks, where training can be unstable or slow without it.