## Vision Transformer (ViT)

In this assignment we're going to work with Vision Transformer. We will start to build our own vit model and train it on an image classification task.
The purpose of this homework is for you to get familar with ViT and get prepared for the final project.

In [12]:
import math

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms


In [13]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# VIT Implementation

The vision transformer can be seperated into three parts, we will implement each part and combine them in the end.

For the implementation, feel free to experiment different kinds of setup, as long as you use attention as the main computation unit and the ViT can be train to perform the image classification task present later.
You can read about the ViT implement from other libary: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py and https://github.com/pytorch/vision/blob/main/torchvision/models/vision_transformer.py

## PatchEmbedding
PatchEmbedding is responsible for dividing the input image into non-overlapping patches and projecting them into a specified embedding dimension. It uses a 2D convolution layer with a kernel size and stride equal to the patch size. The output is a sequence of linear embeddings for each patch.

In [14]:

class PatchEmbedding(nn.Module):
    def __init__(self, image_size, patch_size, in_channels, embed_dim):
      # TODO
      assert image_size % patch_size == 0, 'Image size should be divisible by patch size'
      super(PatchEmbedding, self).__init__()
      #initalization of parameters
      self.patch_size=patch_size
      self.image_size=image_size
      self.embed_dim=embed_dim
      #2d convulational layer as stated
      self.proj=nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size)

    def forward(self, x):
      # TODO
      batch_size, _, h, w= x.shape
      #project the input for embeddings
      x=self.proj(x)
      x=x.permute(0,2,3,1).view(batch_size,-1,self.embed_dim)
      return x

## MultiHeadSelfAttention

This class implements the multi-head self-attention mechanism, which is a key component of the transformer architecture. It consists of multiple attention heads that independently compute scaled dot-product attention on the input embeddings. This allows the model to capture different aspects of the input at different positions. The attention outputs are concatenated and linearly transformed back to the original embedding size.

In [15]:

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
      # TODO
      #necessity for operations
      assert embed_dim % num_heads == 0, "embed_dim should be divisible by num_heads"
      super().__init__()
      self.num_heads=num_heads
      self.embed_dim=embed_dim
      self.head_dim=embed_dim//num_heads
      #projection as stated to capture different aspects of input
      self.qkv_proj=nn.Linear(embed_dim, embed_dim*3)
      #obtain outputs
      self.out_proj=nn.Linear(embed_dim, embed_dim)
    def forward(self, x):
      # TODO
      batch_size, seq_len, embed_dim=x.shape
      #project the input
      qkv=self.qkv_proj(x)
      qkv=qkv.reshape(batch_size, seq_len, self.num_heads, 3 * self.head_dim)
      qkv=qkv.permute(0,2,1,3)
      #creating the 3 heads
      q,k,v=qkv.chunk(3, dim=1)
      #calculation scores
      attn=(q @ k.transpose(-2,-1))/ math.sqrt(self.head_dim)
      #getting the probasbilities
      attn=F.softmax(attn,dim=-1)
      #getting attention outputs
      attn_out=attn @ v
      attn_out=attn_out.transpose(1,2).contiguous().view(batch_size, seq_len, self.embed_dim)
      #porjecting
      attn_out=self.out_proj(attn_out)
      return attn_out



## TransformerBlock
This class represents a single transformer layer. It includes a multi-head self-attention sublayer followed by a position-wise feed-forward network (MLP). Each sublayer is surrounded by residual connections.
You may also want to use layer normalization or other type of normalization.

In [16]:

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, mlp_dim, dropout):
        # TODO
        super().__init__()
        #layer normalization as suggested along with the multi-head-self-attention
        #we have created.
        self.norm1=nn.LayerNorm(embed_dim)
        self.attn=MultiHeadSelfAttention(embed_dim,num_heads)
        self.norm2=nn.LayerNorm(embed_dim)
        #The feed foward (MLP) network applys the transformations according to dimension parameters
        #along with the dropout
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, mlp_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(mlp_dim, embed_dim),
            nn.Dropout(dropout),
        )


    def forward(self, x):
        # TODO
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp(self.norm2(x))
        return x

## VisionTransformer:
This is the main class that assembles the entire Vision Transformer architecture. It starts with the PatchEmbedding layer to create patch embeddings from the input image. A special class token is added to the sequence, and positional embeddings are added to both the patch and class tokens. The sequence of patch embeddings is then passed through multiple TransformerBlock layers. The final output is the logits for all classes

In [17]:

class VisionTransformer(nn.Module):
    def __init__(self, image_size, patch_size, in_channels, embed_dim, num_heads, mlp_dim, num_layers, num_classes, dropout=0.1):
        # TODO
        super(VisionTransformer,self).__init__()
        #create the patch embeddings parameters
        self.patch_embed=PatchEmbedding(image_size, patch_size, in_channels, embed_dim)
        #the special token to go through layers
        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
        #initalize positional embeddings as self attention is used so needed
        self.pos_embed =  nn.Parameter(torch.randn(1, (image_size // patch_size) ** 2 + 1, embed_dim))
        #the transofrmer layers using the dimensions given
        self.transformer = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, mlp_dim, dropout)
            for _ in range(num_layers)
        ])
        #layer noramloization technique used to avoid vanishing/expoding gradients
        self.norm = nn.LayerNorm(embed_dim)
        #head to feed for classifciation
        self.head = nn.Linear(embed_dim, num_classes)


    def forward(self, x):
        # TODO
        batch_size=x.shape[0]
        #feed to get patch embedding
        x = self.patch_embed(x)
        #apply to each token in of batches
        cls_token = self.cls_token.repeat(batch_size, 1, 1)
        x = torch.cat([cls_token, x], dim=1)
        #apply postional embeddings
        x = x + self.pos_embed
        #go through transformer layers
        for block in self.transformer:
            x = block(x)
        #get special token output
        cls_output = x[:, 0]
        #apply to head for classification logits
        logits = self.head(self.norm(cls_output))
        return logits

## Let's train the ViT!

We will train the vit to do the image classification with cifar100. Free free to change the optimizer and or add other tricks to improve the training

In [18]:

# Example usage:
image_size =32 # TODO Keep
patch_size = 4# TODO Keep
in_channels = 3# TODO Keep
embed_dim = 192 # TODO can change this and below should be divisble by both
num_heads =12# TODO
mlp_dim = 384 # TODO can try increasing
num_layers = 8# TODO keep
num_classes =100 # TODO keep
dropout = 0.1# TODO changed from 0
batch_size =256 # TODO changed from 128 did a little bit better

In [19]:

model = VisionTransformer(image_size, patch_size, in_channels, embed_dim, num_heads, mlp_dim, num_layers, num_classes, dropout).to(device)
input_tensor = torch.randn(1, in_channels, image_size, image_size).to(device)
output = model(input_tensor)
print(output.shape)

torch.Size([1, 100])


In [20]:

# Load the CIFAR-100 dataset
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.Resize(image_size),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.Resize(image_size),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

trainset = datasets.CIFAR100(root='./data', train=True, download=True, transform=transform_train)
testset = datasets.CIFAR100(root='./data', train=False, download=True, transform=transform_test)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=2)

Files already downloaded and verified
Files already downloaded and verified


In [21]:

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.05)# TODO

In [22]:
import numpy as np
from torch.optim.lr_scheduler import CosineAnnealingLR
from torchvision.transforms import v2

# Train the model
num_epochs = 100# TODO I think increasing epochs could have helped and possibly mlp_dim
best_val_acc = 0
scheduler = CosineAnnealingLR(optimizer,T_max=num_epochs)

for epoch in range(num_epochs):
    model.train()
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        # TODO Feel free to modify the training loop youself.
    scheduler.step()
    # Validate the model
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    val_acc = 100 * correct / total
    print(f"Epoch: {epoch + 1}, Validation Accuracy: {val_acc:.2f}%")

    # Save the best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), "best_model.pth")

Epoch: 1, Validation Accuracy: 15.39%
Epoch: 2, Validation Accuracy: 21.62%
Epoch: 3, Validation Accuracy: 22.97%
Epoch: 4, Validation Accuracy: 26.80%
Epoch: 5, Validation Accuracy: 29.54%
Epoch: 6, Validation Accuracy: 31.59%
Epoch: 7, Validation Accuracy: 33.68%
Epoch: 8, Validation Accuracy: 34.35%
Epoch: 9, Validation Accuracy: 36.89%
Epoch: 10, Validation Accuracy: 38.94%
Epoch: 11, Validation Accuracy: 41.22%
Epoch: 12, Validation Accuracy: 42.69%
Epoch: 13, Validation Accuracy: 42.08%
Epoch: 14, Validation Accuracy: 44.24%
Epoch: 15, Validation Accuracy: 45.62%
Epoch: 16, Validation Accuracy: 46.86%
Epoch: 17, Validation Accuracy: 46.64%
Epoch: 18, Validation Accuracy: 47.65%
Epoch: 19, Validation Accuracy: 48.09%
Epoch: 20, Validation Accuracy: 48.41%
Epoch: 21, Validation Accuracy: 48.43%
Epoch: 22, Validation Accuracy: 49.54%
Epoch: 23, Validation Accuracy: 49.66%
Epoch: 24, Validation Accuracy: 51.00%
Epoch: 25, Validation Accuracy: 50.20%
Epoch: 26, Validation Accuracy: 51

Please submit your best_model.pth with this notebook. And report the best test results you get.