<a href="https://colab.research.google.com/github/MuhammadBinTariq/ATML_PA0/blob/main/Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Task 1: Inner Workings of ResNet-152

In [None]:
import torch
import torchvision
from torch import nn
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

### Baseline Setup

- Pre-trained ResNet-152
- Replace final classification layer to match smaller datasets e.g. CIFAR-10
- Train the head while freezing backbone
- Record training and validation performance for a few epochs

In [None]:
# Pre-trained Model
model_resnet = torchvision.models.resnet152(weights='IMAGENET1K_V1')

# Replacing model head
# print(model_resnet)
# print(model_resnet.fc)
model_resnet.fc = nn.Linear(2048, 10)
model_resnet.to(device)

# Freezing backbone layers
for param in model_resnet.parameters():
    param.requires_grad = False

# # Unfreeze classification head for training
for param in model_resnet.fc.parameters():
    param.requires_grad = True

Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /root/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth


100%|██████████| 230M/230M [00:01<00:00, 176MB/s]


In [None]:
# Loss Function and Optimizer
resnet_loss_fn = nn.CrossEntropyLoss()
resnet_optimizer = torch.optim.SGD(
    filter(lambda p: p.requires_grad, model_resnet.parameters()),
    lr=0.01,
    momentum=0.9
)

# Training and Testing Loops
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.to(device)
    model.train()
    running_loss, correct = 0.0, 0

    for images, labels in dataloader:
        images, labels = images.to(device), labels.to(device)

        pred = model(images)
        loss = loss_fn(pred, labels)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        running_loss += loss.item()
        correct += (pred.argmax(1) == labels).type(torch.float).sum().item()

    train_loss = running_loss / size
    accuracy = correct / size
    return train_loss, accuracy

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    model.eval()
    running_loss, correct = 0.0, 0

    with torch.no_grad():
        for images, labels in dataloader:
            images, labels = images.to(device), labels.to(device)
            pred = model(images)
            loss = loss_fn(pred, labels)

            running_loss += loss.item()
            correct += (pred.argmax(1) == labels).type(torch.float).sum().item()

    test_loss = running_loss / size
    accuracy = correct / size
    return test_loss, accuracy

In [None]:
# Loading the CIFAR-10 Dataset
mean = [0.485, 0.456, 0.406]
std  = [0.229, 0.224, 0.225]

resnet_train_transforms = transforms.Compose([
    transforms.Resize(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean, std)
])

resnet_test_transforms = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize(mean, std)
])

resnet_trainset = datasets.CIFAR10(root='./CIFAR10', train=True, download=True, transform=resnet_train_transforms)
resnet_testset = datasets.CIFAR10(root='./CIFAR10', train=False, download=True, transform=resnet_test_transforms)

resnet_trainloader = DataLoader(resnet_trainset, batch_size=64, shuffle=True)
resnet_testloader = DataLoader(resnet_testset, batch_size=64, shuffle=False)

100%|██████████| 170M/170M [00:07<00:00, 22.2MB/s]


In [None]:
# Training and Validation Performance
epochs = 5

for epoch in range(epochs):
    train_loss, train_acc = train(resnet_trainloader, model_resnet, resnet_loss_fn, resnet_optimizer)
    test_loss, test_acc = test(resnet_testloader, model_resnet, resnet_loss_fn)
    print(f"Epoch {epoch+1}: Train Acc {train_acc:.4f}, Test Acc {test_acc:.4f}")

Epoch 1: Train Acc 0.7808, Test Acc 0.8004
Epoch 2: Train Acc 0.8170, Test Acc 0.8348
Epoch 3: Train Acc 0.8228, Test Acc 0.8364
Epoch 4: Train Acc 0.8339, Test Acc 0.8324
Epoch 5: Train Acc 0.8361, Test Acc 0.8434


#### Why is it unnecessary (and impractical) to train ResNet-152 from scratch on small datasets? What does freezing most of the network tell us about the transferability of features?

- One of the main reasons is overfitting. A small dataset may not capture all characteristics of the space our data belongs to. This leads to the model memorizing the training data and poor generalization overall.
- ResNet-152 has approximately 60 million model parameters. Training such a deep model requires compute resource and is time intensive (as seen from our own experiments also; it took ~9 minutes per epoch with the backbone frozen).
- The fact that CIFAR-10 dataset is small and the images are 32x32, we are inherently using a "canon to shoot a mosque" with ResNet-152.

Diminishing returns and data limitations make it impractical to train ResNet-152 from scratch on small datasets.

Regarding the transferability of features, interesting observations follow:

- Despite having frozen the backbone, the final classifier head achieved a test accuracy greater than 84% within just 5 epochs.
- Highlighting that the pre-trained features successfully captured the characterstics in our dataset.
- This also suggests that the earlier layers in such vision related deep models learn generic features (i.e. edges, textures, strokes, etc.). And the later layers may tend to focus on capturing dataset-specific features.

This shows that for vision related tasks, fine-tuning the latter layers is an efficient and stable approach to train such models on differing datasets. Moreover, the pre-trained features are transferable well across vision tasks.

### Residual Connections in Practice

In [None]:
from torch import Tensor
from torchvision.models.resnet import Bottleneck

# We create a custom bottleneck class to remove residual connections
class noSkipBottleneck(Bottleneck):
  def forward(self, x: Tensor) -> Tensor:
    identity = x

    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)

    out = self.conv2(out)
    out = self.bn2(out)
    out = self.relu(out)

    out = self.conv3(out)
    out = self.bn3(out)

    if self.downsample is not None:
        identity = self.downsample(x)

    # out += identity
    out = self.relu(out)

    return out


In [None]:
# Instantiate resnet model and modify layers
model_noskip_resnet = torchvision.models.resnet152(weights='IMAGENET1K_V1')
model_noskip_resnet.fc = nn.Linear(2048, 10)

model_noskip_resnet.layer1[2] = noSkipBottleneck(256, 64)
model_noskip_resnet.layer2[5] = noSkipBottleneck(512, 128)
model_noskip_resnet.layer3[10] = noSkipBottleneck(1024, 256)
model_noskip_resnet.layer3[20] = noSkipBottleneck(1024, 256)
model_noskip_resnet.layer3[30] = noSkipBottleneck(1024, 256)

# print(model_noskip_resnet.layer4)
# print(model_resnet.layer3[30])

In [None]:
# Freeze backebone except for blocks replaced and classification head

for param in model_noskip_resnet.parameters():
    param.requires_grad = False

for param in model_noskip_resnet.fc.parameters():
    param.requires_grad = True

for param in model_noskip_resnet.layer1[2].parameters():
    param.requires_grad = True

for param in model_noskip_resnet.layer2[5].parameters():
    param.requires_grad = True

for param in model_noskip_resnet.layer3[10].parameters():
    param.requires_grad = True

for param in model_noskip_resnet.layer3[20].parameters():
    param.requires_grad = True

for param in model_noskip_resnet.layer3[30].parameters():
    param.requires_grad = True

In [None]:
# Now simply train the model
epochs = 5

for epoch in range(epochs):
    train_loss, train_acc = train(resnet_trainloader, model_noskip_resnet, resnet_loss_fn, resnet_optimizer)
    test_loss, test_acc = test(resnet_testloader, model_noskip_resnet, resnet_loss_fn)
    print(f"Epoch {epoch+1}: Train Acc {train_acc:.4f}, Test Acc {test_acc:.4f}")

Epoch 1: Train Acc 0.1004, Test Acc 0.1002
Epoch 2: Train Acc 0.1007, Test Acc 0.1018
Epoch 3: Train Acc 0.1003, Test Acc 0.1019
Epoch 4: Train Acc 0.1008, Test Acc 0.1017
Epoch 5: Train Acc 0.1012, Test Acc 0.1001


#### How do skip connections change gradient flow in very deep networks? What happens to convergence speed and performance when residuals are removed?

Offer an alternative, faster path for flow of gradients. Prevents gradient vanishing. Convergence speed and performance are significantly impacted.