# Deep Learning Assignment 3
## Vishal Vijay Devadiga
## CS21BTECH11061

Trained in Colab with T4 GPU

# Instructions

- Answer all questions. We encourage best coding practices by not penalizing (i.e. you may not get full marks if you make it difficult for us to understand. Hence, use intuitive names for the variables, and comment your code liberally. You may use the text cells in the notebook for briefly explaining the objective of a code cell.)
- It is expected that you work on these problems individually. If you have any doubts please contact the TA or the instructor no later than 2 days prior to the deadline.
- You may use built-in implementations only for the basic functions such as sqrt, log, etc. from libraries such as numpy or PyTorch. Other high-level functionalities are expected to be implemented by the students. (Individual problem statements will make this clear. We can use the optimizers
provided by the libraries such as PyTorch.)
- For plots, you may use matplotlib and generate clear plots that are complete and easy to understand.
- You are expected to submit the Python Notebooks saved as <your-roll-number>.ipynb
- If you are asked to report your observations, use the mark down text cells in the notebook.

In [1]:
# All imports and global variables
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from tqdm import tqdm

import numpy as np
import torch
import torchvision as tv

# Set random seed
np.random.seed(42)
torch.manual_seed(42)

# Whether to use stochastic gradient descent or not
sgd = False

set_batch_set = 64
if sgd:
    set_batch_set = 1

# Number of epochs
number_of_epoch = 10

In [2]:
# Set device: GPU or CPU. Use GPU if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [3]:
# Import CIFAR-10 dataset from torchvision

tr = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
                         , transforms.RandomHorizontalFlip(), transforms.RandomVerticalFlip(), transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5), transforms.RandomRotation(30)
                         ])

train_set = tv.datasets.CIFAR10(root='./cifar', train=True, download=True, transform=tr)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=set_batch_set, shuffle=True, num_workers=2)
test_set = tv.datasets.CIFAR10(root='./cifar', train=False, download=True, transform=tr)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=set_batch_set, shuffle=False, num_workers=2)

Files already downloaded and verified
Files already downloaded and verified


# Question 1: Self-Attention for Object Recognition with CNNs

Implement a sample CNN with one or more self-attention layer(s) for performing object recognition over CIFAR-10 dataset.
You have to implement the self-attention layer yourself and use it in the forward function defined by you.
All other layers (fully connected, nonlinearity, conv layer, etc.) can be bulit-in implementations.
The network can be a simpler one (e.g., it may have 1x Conv, 4x [Conv followed by SA], 1x Conv, and 1x GAP).

[10 Marks]

In [4]:
# Define the Self Attention Layer
class SelfAtten(nn.Module):
    # Self Attention Layer Initialization
    def __init__(self, in_dim, stride = 8):
        # stride is the factor by which the input dimension is reduced
        # Here, default is 8, which means the input dimension is reduced by 8x
        # in_dim is the input dimension of the input tensor

        # Call the parent class constructor
        super(SelfAtten, self).__init__()

        # Initialize all required parameters
        self.in_dim = in_dim
        self.stride = stride

        # Check if in_dim is divisible by stride
        if in_dim % stride != 0:
            raise ValueError("in_dim must be divisible by stride")

        self.layer_query = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//stride, kernel_size= 1)
        self.layer_key = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//stride, kernel_size= 1)
        self.layer_value = nn.Conv2d(in_channels = in_dim , out_channels = in_dim, kernel_size= 1)
        self.gamma = nn.Parameter(torch.zeros(1))
        self.softmax  = nn.Softmax(dim=-1)

    # Forward pass of the self attention layer
    def forward(self,x):
        # x is the input tensor
        # x has shape (batch_size, in_dim, width, height)
        # batch_size is the number of samples in the batch
        # in_dim is the input dimension of the input tensor
        # width is the width of the input tensor
        # height is the height of the input tensor
        batch_size = x.size(0)
        chan = x.size(1)
        width = x.size(2)
        height = x.size(3)

        # Project the input tensor to query, key and value tensors
        proj_query = self.layer_query(x).view(batch_size, -1, width*height).permute(0,2,1)
        proj_key = self.layer_key(x).view(batch_size, -1, width*height)
        energy = torch.bmm(proj_query, proj_key)
        atten = self.softmax(energy)
        proj_value = self.layer_value(x).view(batch_size, -1, width*height)

        # Compute the output tensor
        out = torch.bmm(proj_value, atten.permute(0,2,1))
        out = out.view(batch_size, chan, width, height)

        # Apply the gamma parameter
        out = self.gamma*out + x

        # Return the output tensor
        return out

In [5]:
# Define the Self Attention CNN
class SelfAttenCNN(nn.Module):
    # Self Attention CNN Initialization
    def __init__(self, layers, out_activation = nn.Softmax()):
        # layers is a list of layer objects
        # activation is the activation function to be used
        super(SelfAttenCNN, self).__init__()
        self.layers = len(layers)

        # Create the layer list
        self.layer_list = nn.ModuleList()
        for i in range(self.layers):
            self.layer_list.append(layers[i])

        # Add the activation function
        self.out_activation = out_activation

    # Forward pass of the Self Attention CNN
    def forward(self, x):
        # x is the input tensor
        # x has shape (batch_size, in_dim, width, height)
        # batch_size is the number of samples in the batch
        # in_dim is the input dimension of the input tensor
        # width is the width of the input tensor
        # height is the height of the input tensor
        for i, l in enumerate(self.layer_list):
            x = l(x)
        x = self.out_activation(x)
        return x

    # Run the model
    def run_model(self, criterion, optimizer, num_epoch, train_loader):
        # criterion is the loss function
        # optimizer is the optimizer
        # num_epoch is the number of epochs
        # train_loader is the training data loader
        # logs is a boolean variable to print logs

        # For each epoch
        for epoch in range(num_epoch):
            # Set the model to training mode
            self.train()

            # Find the batch size
            batch_size = train_loader.batch_size

            for data in tqdm(train_loader, total = len(train_loader)):

                # Get the inputs and labels
                inputs = data[0]
                labels = data[1]

                # If number of samples in the batch is less than the batch size, continue
                if inputs.size()[0] != batch_size:
                    continue

                # Zero the parameter gradients
                optimizer.zero_grad()

                # Forward pass
                outputs = self(inputs.to(device))

                # Compute the loss
                loss = criterion(outputs, labels.to(device))

                # Backward pass
                loss.backward()

                # Optimize
                optimizer.step()

    # Test the model
    def test_model(self, test_loader):
        corr = 0
        tot = 0
        self.eval()
        with torch.no_grad():
            for data in tqdm(test_loader, total = len(test_loader)):
                images = data[0]
                labels = data[1]
                images = images.to(device)
                labels = labels.to(device)
                outputs = self(images)
                _, predicted = torch.max(outputs.data, 1)
                tot += labels.size(0)
                corr += (predicted == labels).sum().item()
        acc = corr/tot
        return acc

In [6]:
# Define the layers

layers = [
    nn.Conv2d(3, 16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    SelfAtten(32),
    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    SelfAtten(64),
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.ReLU(),
    SelfAtten(128),
    nn.Conv2d(128, 256, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.AvgPool2d(8),
    nn.Conv2d(256, 10, kernel_size=1),
    nn.Flatten(),
    nn.Linear(160,10)
]

# Create the model

model = SelfAttenCNN(layers)
model.to(device)

# Define the loss function and optimizer

criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

if sgd:
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Run the model

model.run_model(criterion, optimizer, number_of_epoch, train_loader)

# Test the model

acc = model.test_model(test_loader)

print("Accuracy: ", acc)

  return self._call_impl(*args, **kwargs)
100%|██████████| 782/782 [01:36<00:00,  8.13it/s]
100%|██████████| 782/782 [01:35<00:00,  8.23it/s]
100%|██████████| 782/782 [01:34<00:00,  8.24it/s]
100%|██████████| 782/782 [01:34<00:00,  8.24it/s]
100%|██████████| 782/782 [01:34<00:00,  8.24it/s]
100%|██████████| 782/782 [01:34<00:00,  8.24it/s]
100%|██████████| 782/782 [01:34<00:00,  8.24it/s]
100%|██████████| 782/782 [01:34<00:00,  8.24it/s]
100%|██████████| 782/782 [01:34<00:00,  8.24it/s]
100%|██████████| 782/782 [01:34<00:00,  8.24it/s]
100%|██████████| 157/157 [00:05<00:00, 27.92it/s]

Accuracy:  0.5386





# Question 2: Object Recognition with Vision Transformer

Implement and train an Encoder only Transformer (ViT-like) for the above object recognition task.
In other words, implement multi-headed self-attention for the image classification (i.e., appending a <class>token to the image patches that are accepted as input tokens).
Compare the performance of the two implementations (try to keep the number of parameters to be comparable and use the same amount of training and testing data).

[10 Marks]

In [7]:
# Define the Multi Head Attention Layer
class MultHeadAtten(nn.Module):
    # Multi Head Attention Layer Initialization
    def __init__(self, emb_dim, num_heads = 8):
        # emb_dim is the input dimension of the input tensor
        # num_heads is the number of heads in the multi head attention layer

        # Call the parent class constructor
        super(MultHeadAtten, self).__init__()

        # Check if the embedding dimension is divisible by the number of heads
        if emb_dim % num_heads != 0:
            raise ValueError("Embedding dimension must be divisible by number of heads")

        # Initialize all required parameters
        self.emb_dim = emb_dim
        self.num_heads = num_heads
        self.head_dim = emb_dim // num_heads

        # Initialize the query, key and value layers
        self.layer_query = nn.Linear(emb_dim, emb_dim)
        self.layer_key = nn.Linear(emb_dim, emb_dim)
        self.layer_value = nn.Linear(emb_dim, emb_dim)

        # Initialize the output layer
        self.layer_out = nn.Linear(emb_dim, emb_dim)

    # Forward pass of the multi head attention layer
    def forward(self, q, k, v):
        # q is the query tensor
        # k is the key tensor
        # v is the value tensor
        # q, k and v have shape (batch_size, emb_dim, width, height)

        batch_size = q.shape[0]

        # Project the query, key and value tensors
        Q = self.layer_query(q)
        K = self.layer_key(k)
        V = self.layer_value(v)

        # Split the query, key and value tensors into multiple heads
        Q = Q.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
        K = K.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
        V = V.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)

        # Compute the energy tensor
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / torch.tensor(self.head_dim ** 0.5)

        # Compute the attention tensor
        attention = torch.softmax(energy, dim = -1)

        # Compute the output tensor
        out = torch.matmul(attention, V)

        # Combine the heads
        out = out.permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.emb_dim)

        # Apply the output layer
        out = self.layer_out(out)

        # Return the output tensor and the attention tensor
        return out, attention

In [8]:
# Define the Attention Block
class AttenBlock(nn.Module):
    # Attention Block Initialization
    def __init__(self, emb_dim, hid_dim, num_heads = 8, drop_out = 0.0):
        # emb_dim is the input dimension of the input tensor
        # hid_dim is the hidden dimension of the input tensor
        # num_heads is the number of heads in the multi head attention layer
        # drop_out is the drop out rate

        # Call the parent class constructor
        super(AttenBlock, self).__init__()

        # Initialize the multi head attention layer
        self.multi_head_atten = MultHeadAtten(emb_dim, num_heads)

        # Initialize the layer normalization
        self.layer_pre_norm = nn.LayerNorm(emb_dim)
        self.layer_norm = nn.LayerNorm(emb_dim)

        # Initialize the feed forward layer
        self.feed_forward = nn.Sequential(
            nn.Linear(emb_dim, hid_dim),
            nn.ReLU(),
            nn.Dropout(drop_out),
            nn.Linear(hid_dim, emb_dim)
        )

    # Forward pass of the attention block
    def forward(self, x):
        # x is the input tensor
        # x has shape (batch_size, emb_dim, width, height)

        # Apply normalization
        x = self.layer_pre_norm(x)

        # Apply the multi head attention layer
        atten_out, attention = self.multi_head_atten(x, x, x)

        # Add the input tensor and the output tensor
        x = x + atten_out

        # Apply normalization
        x = self.layer_norm(x)

        # Apply the feed forward layer
        x = self.feed_forward(x)

        # Add the input tensor and the output tensor
        x = x + atten_out

        # Return the output tensor and the attention tensor
        return x, attention

In [9]:
# Define the Patch Layer
class PatchLayer(nn.Module):
    # Patch Layer Initialization
    def __init__(self, in_chan, emb_dim, patch_size):
        # in_chan is the input channel of the input tensor
        # emb_dim is the embedding dimension
        # patch_size is the size of the patch

        # Call the parent class constructor
        super(PatchLayer, self).__init__()

        # Initialize all required parameters
        self.in_chan = in_chan
        self.emb_dim = emb_dim
        self.patch_size = patch_size

        # Initialize the patch layer
        self.layer_patch = nn.Conv2d(in_chan, emb_dim, kernel_size=patch_size, stride=patch_size)

    # Forward pass of the patch layer
    def forward(self, x):
        # x is the input tensor
        # x has shape (batch_size, in_chan, width, height)
        # batch_size is the number of samples in the batch
        # in_chan is the input channel of the input tensor
        # width is the width of the input tensor
        # height is the height of the input tensor

        # Compute the output tensor
        out = self.layer_patch(x)

        # Maintain the spatial dimensions
        out = out.flatten(2).transpose(1, 2)

        # Return the output tensor
        return out

In [10]:
# Define the Vision Transformer
class VisionTransformer(nn.Module):
    # Vision Transformer Initialization
    def __init__(self, img_size, in_chan, emb_dim, hid_dim, patch_size, num_layers, num_classes, num_heads = 8, drop_out = 0.0):
        # in_chan is the input channel of the input tensor
        # emb_dim is the embedding dimension
        # hid_dim is the hidden dimension of the input tensor
        # patch_size is the size of the patch
        # num_layers is the number of layers in the vision transformer
        # num_classes is the number of classes
        # num_heads is the number of heads in the multi head attention layer
        # drop_out is the drop out rate

        # Call the parent class constructor
        super(VisionTransformer, self).__init__()

        # If the image size is not divisible by the patch size, raise an error
        if img_size % patch_size != 0:
            raise ValueError("Image size must be divisible by patch size")

        # Calculate the number of patches
        self.num_patches = (img_size // patch_size) ** 2

        # Initialize the patch layer
        self.patch_layer = PatchLayer(in_chan, emb_dim, patch_size)

        # Initialize the attention blocks
        self.atten_blocks = nn.ModuleList([AttenBlock(emb_dim, hid_dim, num_heads, drop_out) for _ in range(num_layers)])

        # Initialize the layer normalization
        self.layer_norm = nn.LayerNorm(emb_dim)

        # Initialize the output layer
        self.layer_out = nn.Linear(emb_dim, num_classes)

        # Positional encoding
        self.pos_enc = nn.Parameter(torch.randn(1, self.num_patches+1, emb_dim))
        self.class_token = nn.Parameter(torch.randn(1, 1, emb_dim))

    # Forward pass of the vision transformer
    def forward(self, x):
        # x is the input tensor
        # x has shape (batch_size, in_chan, width, height)
        # batch_size is the number of samples in the batch
        # in_chan is the input channel of the input tensor
        # width is the width of the input tensor
        # height is the height of the input tensor

        # Compute the number of samples in the batch
        batch_size = x.shape[0]

        # Apply the patch layer
        x = self.patch_layer(x)

        # Add the class token
        cls_token = self.class_token.expand(batch_size, -1, -1)
        x = torch.cat((cls_token, x), dim=1)

        # Add the positional encoding
        x = x + self.pos_enc

        # Apply the attention blocks
        for atten_block in self.atten_blocks:
            x, attention = atten_block(x)

        # Apply normalization
        x = self.layer_norm(x)

        # Apply the output layer
        x = self.layer_out(x[:, 0])

        # Return the output tensor
        return x

    # Run the model
    def run_model(self, criterion, optimizer, num_epoch, train_loader):
        # criterion is the loss function
        # optimizer is the optimizer
        # num_epoch is the number of epochs
        # train_loader is the training data loader
        # logs is a boolean variable to print logs

        # For each epoch
        for epoch in range(num_epoch):
            # Set the model to training mode
            self.train()

            # Find the batch size
            batch_size = train_loader.batch_size

            for data in tqdm(train_loader, total = len(train_loader)):

                # Get the inputs and labels
                inputs = data[0]
                labels = data[1]

                # If number of samples in the batch is less than the batch size, continue
                if inputs.size()[0] != batch_size:
                    continue

                # Zero the parameter gradients
                optimizer.zero_grad()

                # Forward pass
                outputs = self(inputs.to(device))

                # Compute the loss
                loss = criterion(outputs, labels.to(device))

                # Backward pass
                loss.backward()

                # Optimize
                optimizer.step()

    # Test the model
    def test_model(self, test_loader):
        corr = 0
        tot = 0
        self.eval()
        with torch.no_grad():
            for data in tqdm(test_loader, total = len(test_loader)):
                images = data[0]
                labels = data[1]
                images = images.to(device)
                labels = labels.to(device)
                outputs = self(images)
                _, predicted = torch.max(outputs.data, 1)
                tot += labels.size(0)
                corr += (predicted == labels).sum().item()
        acc = corr/tot
        return acc

In [11]:
# Define model parameters
img_size = 32
patch_size = 8
in_chan = 3
emb_dim = 64
hid_dim = 128
num_layers = 3
num_classes = 10
num_heads = 8

# Initialize the vision transformer
model = VisionTransformer(img_size, in_chan, emb_dim, hid_dim, patch_size, num_layers, num_classes, num_heads)
model.to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

if sgd:
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Run the model
model.run_model(criterion, optimizer, number_of_epoch, train_loader)

# Test the model
acc = model.test_model(test_loader)

print("Accuracy: ", acc)

100%|██████████| 782/782 [00:19<00:00, 39.10it/s]
100%|██████████| 782/782 [00:18<00:00, 41.26it/s]
100%|██████████| 782/782 [00:20<00:00, 37.96it/s]
100%|██████████| 782/782 [00:19<00:00, 39.33it/s]
100%|██████████| 782/782 [00:19<00:00, 40.57it/s]
100%|██████████| 782/782 [00:19<00:00, 39.98it/s]
100%|██████████| 782/782 [00:18<00:00, 41.57it/s]
100%|██████████| 782/782 [00:19<00:00, 40.12it/s]
100%|██████████| 782/782 [00:18<00:00, 41.33it/s]
100%|██████████| 782/782 [00:19<00:00, 39.96it/s]
100%|██████████| 157/157 [00:02<00:00, 55.95it/s]


Accuracy:  0.5794


So, for 10 epochs with almost the same number of parameters, we can compare the performance of the two models.

We can see that the Transformer model is able to achieve a better accuracy than the CNN model in the same number of epochs.
Also, the Transformer model is able to train a lot faster than the CNN model.