# ResNet
ResNet, short for Residual Networks, is a type of deep learning model introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in their 2015 paper "Deep Residual Learning for Image Recognition". The model was developed for the purpose of making the training of deep neural networks easier and more efficient.

ResNet introduced the concept of residual learning to address the vanishing gradient problem faced by deep neural networks. In these networks, as the number of layers increases, the performance starts to degrade due to the problem of vanishing or exploding gradients. This makes the network difficult to train, and the accuracy starts getting saturated or even degraded rapidly.

ResNet solves this issue by introducing 'skip connections' or 'shortcuts' that allow the gradient to be directly backpropagated to earlier layers. These shortcuts are connections that skip one or more layers. The key insight of ResNet is realizing that it's easier to optimize the residual mapping than the original, unreferenced mapping.

The core idea of ResNet is the introduction of the so-called "identity shortcut connection" that skips one or more layers, as shown in their research:

`output = F(x) + x`

Here, `F(x)` represents the underlying mapping to be learned by any stack of layers, and `x` is the identity mapping. If `F(x)` is the residual mapping, then it's easier to push the residual to zero than to fit an original, unreferenced mapping.

There are several variants of ResNet such as ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, where the numbers denote layers in the network. ResNet models, particularly ResNet-50, are widely used in many deep learning applications because they provide high accuracy and are relatively computationally efficient.

In terms of use cases, ResNet has been effectively used for a wide variety of tasks including image classification, object detection, and recognition tasks.


## ResNet-50
ResNet-50 is a variant of the original ResNet architecture, which stands for Residual Network. It is a convolutional neural network (CNN) architecture that was designed to enable the creation of deeper networks while mitigating the issues of vanishing gradients during training, as we have discussed before. 

The number "50" in ResNet-50 refers to the depth of the network, that is, it has 50 layers including both convolutional and fully connected layers. It follows the same concept of using shortcut (or skip) connections like other ResNets, but it also uses a concept called "bottleneck design" for constructing the blocks of layers.

The ResNet-50 architecture can be broken down as follows:

1. **Initial Convolution and Max Pooling Layers**: The input images first pass through a single 7x7 convolutional layer with 64 filters, followed by a batch normalization layer, a ReLU activation function, and a max pooling layer.

2. **Bottleneck Blocks**: The heart of the ResNet-50 architecture consists of four stages, each composed of a number of bottleneck blocks (which include three layers each). These are the residual blocks that give ResNet its name. 

    - The first stage has 3 blocks, with the number of filters being 64, 64, and 256 respectively. 
    - The second stage has 4 blocks, with the number of filters being 128, 128, and 512 respectively. 
    - The third stage has 6 blocks, with the number of filters being 256, 256, and 1024 respectively. 
    - The fourth stage has 3 blocks, with the number of filters being 512, 512, and 2048 respectively. 

    Note: For the first block of each stage, if the output size (height, width) is reduced, a convolutional layer with stride 2 is applied in the shortcut connection to match the size and number of filters of the output.

3. **Final Layers**: After the bottleneck blocks, a global average pooling layer is applied, followed by a fully connected layer with 1000 neurons (for the ImageNet classification task), and a softmax activation function to generate the output probabilities.

ResNet-50, like other ResNet variants, is widely used in both academia and industry for a large number of image classification tasks due to its excellent performance and the generalizability of its learned features.

# Residual Block
A residual block, or res-block, is the fundamental building block of a ResNet. The design of a residual block is rooted in the idea of learning the residual function with reference to the input, rather than learning the original unreferenced function.

In its simplest form, a residual block consists of several convolutional layers, followed by batch normalization and a ReLU (Rectified Linear Unit) activation function. The input to the block is added to the output of the block (before the final activation function), forming a 'shortcut' or 'skip connection'. This allows the gradient to be directly backpropagated to earlier layers.

Here's the basic structure of a Residual Block:

1. Convolutional layer: Applies a convolution operation on the input and passes the result to the next layer.
2. Batch normalization: Normalizes the activations of the previous layer at each batch to increase the stability and performance of the neural network.
3. ReLU activation: Applies the Rectified Linear Unit activation function which is max(0, x), where x is the input. It effectively removes negative values and introduces non-linearity without affecting the receptive fields of the conv layer.
4. Another sequence of Convolutional layer, Batch normalization, and then instead of applying ReLU activation, we add the initial input ('skip connection') to the output of the convolution block.
5. ReLU activation: Now apply the activation function to the result of the addition.

The skip connection allows the model to bypass layers during training: the network can propagate gradients directly through the shortcut connections, without any modification, to deeper layers in the network. This mitigates the vanishing/exploding gradient problem associated with deep neural networks, making it possible to train much deeper networks.

A crucial detail to mention is that if the dimensions of the input and the output of the residual block don't match (which can happen due to operations like convolution or pooling that modify the input dimensions), a linear projection can be used in the skip connection to match the dimensions. This projection can be accomplished via a 1x1 convolution.

The use of residual blocks makes it possible to train very deep networks (100+ layers), which would be very challenging with traditional architectures due to issues like vanishing and exploding gradients.

## Skip Connections:

Skip connections, also known as shortcut connections, are a key feature of ResNet architectures that help address the problem of vanishing/exploding gradients in deep neural networks.

In a deep network, the output of one layer is used as the input to the next layer. However, in a network with skip connections, the input to a layer is also added to the output of that layer (or some subsequent layer). The 'skipped' input can be thought of as a shortcut in the network, allowing the gradient from the loss function to be directly backpropagated to earlier layers.

Skip connections have two key benefits:

1. They mitigate the vanishing gradient problem, making it easier to train deep networks. By providing a path for the gradient that bypasses several layers, they prevent the gradient from becoming infinitesimally small (and thus ineffective for training) in the early layers of the network.

2. They allow the network to learn identity functions, which can be useful when the optimal function is close to the identity. In other words, they make it easier for a layer to learn to produce output that is identical to its input, which is useful when the input is already a good representation for the task at hand.

In the original ResNet architecture, every layer's output is added to the output of the layer that is two layers further along in the network (after the two have gone through their respective ReLU activation functions). This is done using skip (or shortcut) connections, which bypass one layer (in this case, a 3-layer convolution block). So, essentially, the output of a previous layer is added to the output of the layer two steps ahead.

To be more specific, if we number the layers starting from 1, then for any given layer 'n', a skip connection is established from the output of layer 'n' to the input of layer 'n+2'. 

This is not a hard and fast rule, though. Skip connections can be implemented in different ways depending on the specific architecture and design choices. For instance, in a DenseNet (another type of Residual Network), each layer receives input from all preceding layers.

One of the primary considerations in deciding where to place skip connections is the desire to mitigate the vanishing gradient problem. This problem becomes more severe the deeper the network is, so skip connections are typically more beneficial in deeper networks.

However, the actual placement of these connections can also depend on other factors, such as computational resources and the specific task at hand. Some experimentation might be necessary to find the optimal architecture for a given problem. 

In general, the ResNet architecture, which includes a skip connection every two layers, has been found to work well for a wide range of tasks and is a good starting point.

## Bottleneck Design:

The bottleneck design is a modification of the basic residual block in ResNet, used to make the network more efficient. This design was introduced in the deeper versions of ResNet, like ResNet-50, 101, and 152.

In the bottleneck design, instead of having two 3x3 conv layers (as in the basic block), the block has three layers: a 1x1 conv layer, a 3x3 conv layer, and another 1x1 conv layer. The 1x1 convolutions are used to reduce and then restore the dimensions of the input, leading to fewer input-output channel dimensions and therefore fewer parameters and computations. 

In terms of sequence:

1. The first 1x1 convolution reduces the dimensionality (depth) before passing it to the more expensive 3x3 convolution.
2. The 3x3 convolution operates on a smaller input and produces a similarly small output.
3. The final 1x1 convolution restores the dimensionality back to match the original depth.

This is where the term "bottleneck" comes from: the network depth decreases and then increases, with the shallowest point being in the middle of the block, giving it a "bottleneck" shape. 

The bottleneck design significantly reduces computational cost, allowing for deeper and more efficient networks.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [5]:
class ConvBNLayer(torch.nn.Module):
  def __init__(self,num_channels,num_filters,filters_size,stride=1, groups=1,act=None):
    """
      num_channels, input channels for the convolutional layer
      num_filters, output channels for the convolutional layer
      stride, stride for the convolutional layer
      groups, number of groups for grouped convolution, default is groups=1 which means no grouped convolution
    """
    super(ConvBNLayer,self).__init__()
    #The amount of padding required to keep the output size the same as the input size depends on the filter size, 
    #and it is given by (filter_size - 1) // 2 when the stride is 1.
    self._conv=nn.Conv2d(in_channels=num_channels,out_channels=num_filters,
                         kernel_size=filters_size, stride=stride,
                         padding=(filters_size-1)//2, groups=groups, bias=False)
    
    self._batch_norm=nn.BatchNorm2d(num_filters)
    self.act=act


  def forward(self,inputs):
    y=self._conv(inputs)
    y=self._batch_norm(y)
    #introduce non-linearity into the model
    if self.act=='leaky':
      y=F.leaky_relu(input=y,negative_slope=0.1)
    elif self.act=='relu':
      y=F.relu(input=y)
    return y

### Leaky Relu
Leaky ReLU is a variation of ReLU that has a small slope for negative values instead of a flat slope, hence the term 'leaky'. The slope is controlled by the negative_slope parameter and is typically a small, positive number like 0.01. This means that negative inputs result in a small negative output, rather than zero, which can help mitigate the issue of 'dead' neurons.

In [6]:
class BottleneckBlock(torch.nn.Module):
  def __init__(self,num_channels,num_filters,stride,shortcut=True):
    super(BottleneckBlock,self).__init__()
    # Create the first convolutional layer (1x1)
    self.conv1=ConvBNLayer(num_channels=num_filters,num_filters=num_filters,filters_size=3,stride=stride,act='relu')
    # Create the second convolutional layer (3x3)
    self.conv2=ConvBNLayer(num_channels=num_filters,num_filters=num_filters,filters_size=3,stride=stride,act='relu')
    # Create the third convolutional layer (1x1), but the number of output channels is multiplied by 4
    # preparing the output to be added to the shortcut connection.
    self.conv3=ConvBNLayer(num_channels=num_filters,num_filters=num_filters*4,filters_size=1,act=None)


    # If the output shape of conv3 is the same as the input to this residual block, then shortcut=True
    # Otherwise, shortcut=False, and add a 1x1 convolution to the input to make its shape the same as conv3

    if not shortcut:
      #adjust the number and size of the channels in the input 
      #so that it can be added to the output of conv3
      self.short=ConvBNLayer(num_channels=num_channels,num_filters=num_filters*4,filters_size=1,stride=stride)

    self.shortcut=shortcut
    self._num_channels_out=num_filters*4



  def forward(self,inputs):
    y=self.conv1(inputs)
    conv1=self.conv2(y)
    conv2=self.conv3(conv1)

    if self.shortcut:
      short=inputs
    else:
      short=self.short(inputs)

    y=torch.add(short,conv2)
    y=F.relu(y)
    return y



In [7]:
from torch.nn.modules import ParameterDict
from torch.nn.modules.pooling import AdaptiveAvgPool2d
import numpy as np

class ResNet(torch.nn.Module):
  def __init__(self,layers=50,class_dim=1):
    """
        layers, the depth of the ResNet model (50, 101, or 152)
        class_dim, the number of output classes for the final layer
    """
    super(ResNet,self).__init__()
    self.layers=layers
    supported_layers=[50,101,152]
    assert layers in supported_layers,\
    "supported layers are {} but input layers is {}".format(supported_layers,layers)

    #define the number of bottleneck blocks in each layer (depth) and
    #the number of filters for each layer.

    if layers==50:
      #ResNet50包含多个模块，其中第2到第5个模块分别包含3、4、6、3个残差块
      depth=[3,4,6,3]
    elif layers==101:
      #ResNet101包含多个模块，其中第2到第5个模块分别包含3、4、23、3个残差块
      depth=[3,4,23,3]
    elif layers==152:
      #ResNet152包含多个模块，其中第2到第5个模块分别包含3、8、36、3个残差块
      depth=[3,8,36,3]

    # 残差块中使用到的卷积的输出通道数
    num_filters=[64,128,256,512]

    #the initial convolutional layer that processes the input image
    self.conv=ConvBNLayer(num_channels=3,num_filters=64,filters_size=7,stride=2,act='relu')
    #max pooling layer to reduce the spatial dimensions of the tensor
    self.pool2d_max=nn.MaxPool2d(kernel_size=3,stride=2,padding=1)

    # ResNet的第二到第五个模块c2、c3、c4、c5
    #initializes a ModuleList to hold the bottleneck blocks
    self.bottleneck_block_list=nn.ModuleList()
    num_channels=64
    for block in range(len(depth)):
      shortcut=False
      for i in range(depth[block]):
        # c3、c4、c5将会在第一个残差块使用stride=2；其余所有残差块stride=1
        # multiple BottleneckBlock modules are created according to the depth array
        bottleneck_block = BottleneckBlock(num_channels=num_channels, num_filters=num_filters[block], 
                                           stride=2 if i == 0 and block != 0 else 1, shortcut=shortcut)
        #The num_channels is updated for the next block
        num_channels = bottleneck_block._num_channels_out
        #added to self.bottleneck_block_list
        self.bottleneck_block_list.append(bottleneck_block)
        shortcut = True

    # 在c5的输出特征图上使用全局池化
    # adaptive average pooling layer at the end of the network
    # reduces the height and width dimensions of each feature map to 1, effectively performing global average pooling.
    self.pool2d_avg=nn.AdaptiveAvgPool2d(output_size=1)

    # stdv用来作为全连接层随机初始化参数的方差
    # calculates the standard deviation (stdv) for initializing the weights of the final fully connected layer
    # "He initialization" method.
    import math
    stdv = 1.0 / math.sqrt(2048 * 1.0)

    # 创建全连接层，输出大小为类别数目，经过残差网络的卷积和全局池化后，
    # 卷积特征的维度是[B,2048,1,1]，故最后一层全连接的输入维度是2048
    self.out=nn.Linear(in_features=2048,out_features=class_dim)
    #The weights are initialized with a uniform distribution with range [-stdv, stdv] 
    #according to the previously calculated stdv.
    #initialize weights
    nn.init.uniform_(self.out.weight, -stdv, stdv)
                     
    
    def forward(self,inputs):
      y=self.conv(inputs)
      y=self.pool2d_max(y)
      for bottleneck_block in self.bottleneck_block_list:
        y=bottleneck_block(y)
      y=self.pool2d_avg(y)
      y=torch.flatten(y, 1)
      y=self.out(y)
      return y


    

In [8]:
import torch
from torchvision.models import resnet50

# Call the resnet50 model using the torchvision library
model = resnet50()
# If you want to load the pretrained model on ImageNet dataset, use this instead:
# model = resnet50(pretrained=True)

# Randomly generate an input
x = torch.rand([1, 3, 224, 224])
# Get the output of the ResNet50 model
out = model(x)
# Print the shape of the output. As resnet50 is a 1000-class classifier by default,
# the output shape is [1, 1000]
print(out.shape)


torch.Size([1, 1000])


In PyTorch, images are expected to be in (C, H, W) format (Channels, Height, Width), which is already reflected in the code above.

You can use resnet50(pretrained=True) to get the ResNet50 model pretrained on ImageNet. If pretrained is False, the model will be initialized with random weights.

In [12]:
import torch
from torchvision.models import resnet50
from torchvision.datasets import CIFAR10
from torchvision import transforms
from torch.optim import SGD
from torch.utils.data import DataLoader
from torch.nn import CrossEntropyLoss
from torch.optim.lr_scheduler import StepLR

# Define the transformations - convert to tensor and normalize with mean and std from CIFAR10
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])

# Load the CIFAR10 dataset
train_dataset = CIFAR10(root='./data', train=True, download=True, transform=transform)
val_dataset = CIFAR10(root='./data', train=False, download=True, transform=transform)

# Create DataLoaders for training and validation sets
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=8)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False, num_workers=8)

# Define the model, loss function, and optimizer
model = resnet50(pretrained=False, num_classes=10)
criterion = CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Lets us use GPU if available
model = model.to(device)

# Train the model
for epoch in range(10):
    model.train()  # Set the model to training mode
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()  # Clear the gradients
        outputs = model(images)  # Forward pass
        loss = criterion(outputs, labels)  # Compute the loss
        loss.backward()  # Compute the gradients
        optimizer.step()  # Update the weights
    scheduler.step()  # Update the learning rate
    
    if (epoch+1) % 2 == 0:
        model.eval()  # Set the model to evaluation mode
        correct = 0
        total = 0

        with torch.no_grad():  # Disable gradient calculation for efficiency
            for images, labels in val_loader:
                images = images.to(device)
                labels = labels.to(device)
                outputs = model(images)  # Forward pass
                _, predicted = torch.max(outputs.data, 1)  # Get the predicted labels
                total += labels.size(0)  # Increment the total count
                correct += (predicted == labels).sum().item()  # Count the correct predictions

        accuracy = correct / total  # Calculate the accuracy

        print(f"Accuracy at epoch {epoch}: {accuracy * 100:.2f}%")

    # You can also save your model periodically
    torch.save(model.state_dict(), f"./output/model_{epoch}.pth")


This script first loads and preprocesses the CIFAR10 dataset, then defines a ResNet50 model, a CrossEntropy loss function, and a stochastic gradient descent optimizer. The model is trained for 50 epochs. We use the StepLR learning rate scheduler, which multiplies the learning rate by 0.1 every 10 epochs.

* To determine the number of batches, you can divide the total number of samples in your dataset by the batch_size. For example, if you have 10,000 samples and a batch_size of 64, you will have 10,000 / 64 = 156.25 batches. Note that the last batch may have a smaller size if the total number of samples is not divisible by the batch size.

* `StepLR` is a learning rate scheduler provided by PyTorch. It is used to adjust the learning rate during training by multiplying it by a factor at specified intervals or epochs.

The `StepLR` scheduler updates the learning rate according to the following formula:

```python
new_lr = lr * gamma
```

where `lr` is the current learning rate and `gamma` is the factor by which the learning rate is multiplied. This update is applied every `step_size` number of epochs.

Here's an example of how to use `StepLR`:

```python
from torch.optim.lr_scheduler import StepLR

# Define the optimizer and initial learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Create the StepLR scheduler
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

# Training loop
for epoch in range(num_epochs):
    # Train your model for each epoch

    # Update the learning rate
    scheduler.step()
```

In this example, the learning rate is reduced by a factor of `gamma=0.1` every `step_size=10` epochs. This means that the learning rate will be multiplied by `0.1` every 10 epochs, effectively decreasing it by 10 times.

The `StepLR` scheduler is often used in combination with other training techniques to gradually decrease the learning rate over time. It can help improve convergence and fine-tune the model as training progresses. However, the specific values for `step_size` and `gamma` depend on the problem, dataset, and model architecture, and they may require experimentation to find the optimal values for your specific scenario.

Finally, note that the DataLoader class in PyTorch automatically handles batching, shuffling, and parallel data loading.