#### DEEP LEARNING ASSIGNMENT 3<BR>

SUBMITTED BY: ANIRUDH JOSHI (CS23MTECH11002)<br>
DATE: 29/03/2024<br>

####  PROBLEM 1

Self-Attention for Object Recognition with CNNs: Implement a sample CNN with one or
more self-attention layer(s) for performing object recognition over CIFAR-10 dataset. You have to
implement the self-attention layer yourself and use it in the forward function defined by you. All
other layers (fully connected, nonlinearity, conv layer, etc.) can be bulit-in implementations. The
network can be a simpler one (e.g., it may have 1x Conv, 4x [Conv followed by SA], 1x Conv, and
1x GAP). Please refer to the reading material provided here or any other similar one.

PREPARING DATASET

In [5]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Subset
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

# applying transformation to data after downloading it
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# downloading CIFAR-10 dataset
full_trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
full_testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:01<00:00, 98781032.55it/s] 


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


In [None]:
# for testing purpose made small subset of training and testing data
start_index = 0
end_index = 5000

# getting the smaller subset from the full train set
subset_trainset = Subset(full_trainset, range(start_index, end_index))
subset_testset = Subset(full_testset, range(start_index, end_index))

# preparing the dataloaders for feeding into the network
subset_trainloader = torch.utils.data.DataLoader(full_trainset, batch_size=100, shuffle=True)
subset_testloader = torch.utils.data.DataLoader(full_testset, batch_size=100, shuffle=True)

SELF-ATTENTION CLASS

intialization of class takes two parameters, in_channels which is number of channels in the input data. k here is the reduction factor.

In [None]:
class SelfAttention(nn.Module):
    # intializing the parameters
    def __init__(self, in_channels, k=8):
        super(SelfAttention, self).__init__()
        self.k = k
        self.in_channels = in_channels
        # getting the query, key and value representations respectively for the input image.
        self.theta = nn.Conv2d(in_channels, in_channels // k, kernel_size=1)    # kernel size is 1 as we want to treat every pixel as separate feature
        self.phi = nn.Conv2d(in_channels, in_channels // k, kernel_size=1)      # kernel size 1 also reduces dimensoinality in termo of channels
        self.g = nn.Conv2d(in_channels, in_channels // k, kernel_size=1)
        self.o = nn.Conv2d(in_channels // k, in_channels, kernel_size=1)

    # forward pass in self attention
    def forward(self, x):
        # getting input dimensions
        batch_size, C, width, height = x.size()
        theta = self.theta(x).view(batch_size, self.in_channels // self.k, -1)      # reshaped into 3D tensor for subsequent matrix multiplication

        # applied max pool and then reshaped it into 3D tensor
        phi = nn.functional.max_pool2d(self.phi(x), 2).view(batch_size, self.in_channels // self.k, -1)
        g = nn.functional.max_pool2d(self.g(x), 2).view(batch_size, self.in_channels // self.k, -1)

        # bmm is batch matrix multiplixation between queries and keys tensor to compute attention scores
        beta = nn.functional.softmax(torch.bmm(theta.permute(0, 2, 1), phi), dim=-1)

        # applying the attention scores obtained to the values, generating final attended output
        o = self.o(torch.bmm(g, beta.permute(0, 2, 1)).view(batch_size, self.in_channels // self.k, width, height))

        return x + o


DEFINING CNN WITH SPECIFIED LAYERS

Class CNN with self attention uses the above defined self attention class and the inbuilt convolution and pooling layers. Eacg convolution layer is followed by a self attention layer. Initially output channels are increased in order to capture more complex features, also helpes in regularization

In [None]:
# Class CNN with self attention
class CNNWithSelfAttention(nn.Module):
    # defining the convolution layers, pooling layers, fully connected and self attention layers.
    def __init__(self):
        super(CNNWithSelfAttention, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)     # conolution layer 1
        self.sa1 = SelfAttention(32)                                # self attention 1
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.sa2 = SelfAttention(64)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.sa3 = SelfAttention(128)
        self.conv4 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.sa4 = SelfAttention(256)
        self.conv5 = nn.Conv2d(256, 128, kernel_size=3, padding=1)
        self.sa5 = SelfAttention(128)                               # final self attenion layer
        self.global_avg_pool = nn.AdaptiveAvgPool2d(1)              # global average pooling layer
        self.fc = nn.Linear(128, 10)                                # fully connected layer

    # forward pass of CNN
    def forward(self, x):
        x = nn.functional.relu(self.conv1(x))
        x = self.sa1(x)
        x = nn.functional.relu(self.conv2(x))
        x = self.sa2(x)
        x = nn.functional.relu(self.conv3(x))
        x = self.sa3(x)
        x = nn.functional.relu(self.conv4(x))
        x = self.sa4(x)
        x = nn.functional.relu(self.conv5(x))
        x = self.sa5(x)
        x = self.global_avg_pool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

TRAINING MODEL

Initializing the model, loss function and the optimizer. As it is multi-class classification problem we use cross-entrpy loss function, as for the optimizer we use Adam with learning rate 0.001. Finally we feed the training data into model using already prepared training dataloader.

In [None]:
# Initializing the model, loss function and the optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = CNNWithSelfAttention().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# training model we run for 10 epoxhs
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for i, data in enumerate(subset_trainloader, 0):
        inputs, labels = data[0].to(device), data[1].to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        # printing every 200 batches
        if i % 200 == 199:
            print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 200))
            running_loss = 0.0

print('Finished Training')

[1,   200] loss: 2.016
[1,   400] loss: 1.732
[2,   200] loss: 1.539
[2,   400] loss: 1.476
[3,   200] loss: 1.361
[3,   400] loss: 1.308
[4,   200] loss: 1.205
[4,   400] loss: 1.183
[5,   200] loss: 1.127
[5,   400] loss: 1.087
[6,   200] loss: 1.039
[6,   400] loss: 1.011
[7,   200] loss: 0.952
[7,   400] loss: 0.945
[8,   200] loss: 0.901
[8,   400] loss: 0.886
[9,   200] loss: 0.845
[9,   400] loss: 0.850
[10,   200] loss: 0.785
[10,   400] loss: 0.813
Finished Training


TESTING MODEL

In [None]:
# testing the above trained model
model.eval()

# variables used to compute accuracy
correct = 0
total = 0
with torch.no_grad():
    for data in subset_testloader:
        images, labels = data[0].to(device), data[1].to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy on test set: %d %%' % (100 * correct / total))

Accuracy on test set: 68 %


####  PROBLEM 2

Object Recognition with Vision Transformer: Implement and train an Encoder only Transformer
(ViT-like) for the above object recognition task. In other words, implement multi-headed
self-attention for the image classification (i.e., appending a < class > token to the image patches
that are accepted as input tokens). Compare the performance of the two implementations (try to
keep the number of parameters to be comparable and use the same amount of training and testing
data).

PREPARING DATASET

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# applying transformation to data after downloading it
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((32, 32)),  # Resize images to match ViT input size
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))  # Normalize images
])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=100, shuffle=True)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = DataLoader(testset, batch_size=100, shuffle=True)


Files already downloaded and verified
Files already downloaded and verified


MULTIHEAD SELF-ATTENTION

In opposed to the self-attention class above where I tried to learn the representations of query, key and values using the convolution layer below here I tried to learn the representaions using linear layers for query, keys and values respectively.

In [7]:
class MultiHeadAttention(nn.Module):
    # defining the parameters and layers
    def __init__(self, embed_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # used linear layers for representations of query, key and values
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.fc_out = nn.Linear(embed_dim, embed_dim)

    # forward pass in multi head attention
    def forward(self, query, key, value, mask=None):
        batch_size = query.shape[0]     # obtain the batch size
        Query = self.query(query)
        Key = self.key(key)
        Value = self.value(value)

        # reshape query, key and values to get multiple heads
        Query = Query.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
        Key = Key.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
        Value = Value.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)

        # computing attention scores
        attnScores = torch.matmul(Query, Key.permute(0, 1, 3, 2)) / (self.head_dim ** 0.5)

        if mask is not None:
            attnScores = attnScores.masked_fill(mask == 0, float('-1e20'))

        smallOut = torch.softmax(attnScores, dim=-1)

        # applied the softmax output to the values
        output = torch.matmul(smallOut, Value)

        # reshaping to concatenate heads
        output = output.permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.head_dim * self.num_heads)

        # applied linear lyaer befor final output
        output = self.fc_out(output)

        return output

TRANSFORMER ENCODER LAYERS ARCHITECTURE

Because we need to use our own implemented MultiHead attention class in the layers of transformer so we need to create a class for defining transformer encoder layers also. This class is responsible for performing multi-head self attention, add and norm layer, adding residual connection, and for the linear layers in between. These layers are part of original ViT like transformer encoder blocks.

In [8]:
# transformer encoder architecture
class TransformerEncoder(nn.Module):
    def __init__(self, embed_dim, num_heads, num_layers, hidden_dim):
        super(TransformerEncoder, self).__init__()
        self.layers = nn.ModuleList([
            nn.ModuleList([
                nn.LayerNorm(embed_dim),
                nn.ReLU(),
                MultiHeadAttention(embed_dim, num_heads),       # multihead attentoion class defined above
                nn.LayerNorm(embed_dim),
                nn.ReLU(),
                nn.Linear(embed_dim, hidden_dim),
                nn.LayerNorm(hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, embed_dim)
            ])
            for _ in range(num_layers)
        ])

    # forward pass in the encoder.
    def forward(self, x):
        for layer in self.layers:
            norm1, relu1, attention, norm2, relu2, fc1, norm3, relu3, fc2 = layer
            residual = x
            x = norm1(x)
            x = relu1(x)
            x = attention(x, x, x)
            x = x + residual
            residual = x
            x = norm2(x)
            x = relu2(x)
            x = fc1(x)
            x = norm3(x)
            x = relu3(x)
            x = fc2(x)
            x = x + residual
        return x

TRANSFORMER CLASS

Uses the above defined transformer encoder class for the encoder block. The transformer block appends class token to image patches, computes positional encoding and adds it to the original input image.

In [9]:
class Transformer(nn.Module):

    # initalize necessary parameters.
    def __init__(self, image_size, patch_size, num_classes, embed_dim, num_heads, num_layers, hidden_dim):
        super(Transformer, self).__init__()

        # Computing the number of patches based of image size
        num_patches = (image_size // patch_size) ** 2
        patch_dim = 3 * patch_size ** 2         # 3 channels for RGB images

        # embed_dim is a hyper-parameter which represents the dimensionality of token embedding
        self.patch_embedding = nn.Linear(patch_dim, embed_dim)
        self.positional_embedding = nn.Parameter(torch.randn(1, num_patches + 1, embed_dim))
        self.encoder = TransformerEncoder(embed_dim, num_heads, num_layers, hidden_dim)     # used the above defined class for Transformer encoder block
        self.class_token = nn.Parameter(torch.randn(1, 1, embed_dim))   # append to learn global context of image
        self.fc = nn.Linear(embed_dim, num_classes)

    # forward pass in transformer
    def forward(self, x):
        batch_size = x.shape[0]
        x = x.view(batch_size, -1, 3 * 4 * 4)
        x = self.patch_embedding(x)

        # appending or concatenating the one class tokens to each image of the batch
        class_token = self.class_token.expand(batch_size, -1, -1)
        x = torch.cat([class_token, x], dim=1)

        # like in mormal transformer add positional encoding before passing input to transformer
        x = x + self.positional_embedding
        x = self.encoder(x)

        # get the class token and pass it thru the fc layer, because class token captures global context of image.and hence represent the entire image
        x = x[:, 0]
        x = self.fc(x)
        return x

TRAINING THE MODEL

Initializing the model, loss function and the optimizer. Choice of optimizer, loss function, learning rate and various parameters remain same as in problem 1

In [10]:
import torch

# training the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model2 = Transformer(image_size=32, patch_size=4, num_classes=10, embed_dim=64, num_heads=4, num_layers=3, hidden_dim=256)
model2.to(device)
criterion2 = nn.CrossEntropyLoss()
optimizer2 = optim.Adam(model2.parameters(), lr=0.001)

num_epochs = 10

for epoch in range(num_epochs):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data[0].to(device), data[1].to(device)

        optimizer2.zero_grad()

        outputs = model2(inputs)
        loss = criterion2(outputs, labels)
        loss.backward()
        optimizer2.step()

        running_loss += loss.item()
        if i % 200 == 199:
            print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 200))
            running_loss = 0.0

print('Finished Training')

[1,   200] loss: 2.084
[1,   400] loss: 1.726
[2,   200] loss: 1.580
[2,   400] loss: 1.539
[3,   200] loss: 1.482
[3,   400] loss: 1.469
[4,   200] loss: 1.412
[4,   400] loss: 1.412
[5,   200] loss: 1.364
[5,   400] loss: 1.363
[6,   200] loss: 1.314
[6,   400] loss: 1.328
[7,   200] loss: 1.274
[7,   400] loss: 1.295
[8,   200] loss: 1.234
[8,   400] loss: 1.271
[9,   200] loss: 1.207
[9,   400] loss: 1.226
[10,   200] loss: 1.183
[10,   400] loss: 1.185
Finished Training


TESTING

In [11]:
# evaluating model accuracy
model2.eval()
correct_model2 = 0
total_model2 = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data[0].to(device), data[1].to(device)
        outputs = model2(images)
        _, predicted = torch.max(outputs.data, 1)
        total_model2 += labels.size(0)
        correct_model2 += (predicted == labels).sum().item()

print('Accuracy on test set: %d %%' % (100 * correct_model2 / total_model2))

Accuracy on test set: 52 %


PERFORMANCE COMPARISON MODEL 1 (PROBLEM 1) V/S MODEL 2 (PROBLEM 2)

* Cifar-10 dataset is a kind of dataset which is relatively simpler and in which the local features are sufficient, CNNWithSelfAttention is expected to perform better which is shown by the accuracy on test set. Accuracy on test set for CNN with self attention model is better than the transformer model. <br> <br>
* Datasets which have more complex spatial relationships and long range dependencies, vision transformers can perform better then CNN on such datasets. <br> <br>
* Overall, for the specified task of object recognintion on the cifar-10 dataset both the models are expected to perform well. However, CNN with self attention performs better may be due to the use of convolution layers while defining self attention class in CNN while using linear layer for attention in transformer model. But for capturing global dependencies, linear layers are more suitable so I used linear layers in attention for problem 2<br>

END OF ASSIGNMENT 3