## Artificial Neural Networks and Deep Learning  
##Assignment 3.3 - Self-attention and Transformers

Prof. Dr. Ir. Johan A. K. Suykens     

In this file, we first understand the self-attention mechanism by implementing it both with ``NumPy`` and ``PyTorch``.
Then, we implement a 6-layer Vision Transformer (ViT) and train it on the MNIST dataset.

All training will be conducted on a single T4 GPU.


In [None]:
# Please first load your google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Please go to Edit > Notebook settings > Hardware accelerator > choose "T4 GPU"
# Now check if you have loaded the GPU successfully
!nvidia-smi

Tue May 28 14:35:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Self-attention Mechanism
Self-attention is the core mechanism in Transformer.

## Self-attention with NumPy
To have a better understanding of it, we first manually implement self-attention mechanism with ``numpy``. You can check the dimension of each variable during the matrix computation.

Feel free to change the dimensions of each variable and see how the output dimension will change accordingly.

In [None]:
import math
import numpy as np
from numpy.random import randn

# I. Define the input data X
# X consists out of 32 samples, each sample has dimensionality 256
n = 32
d = 256
X = randn(n, d) # (32, 256)

# II. Generate the projection weights
Wq = randn(d, d) #(256, 256)
Wk = randn(d, d)
Wv = randn(d, d)

# III. Project X to find its query, keys and values vectors
Q = np.dot(X, Wq) # (32, 256)
K = np.dot(X, Wk)
V = np.dot(X, Wv)

# IV. Compute the self-attention score, denoted by A
# A = softmax(QK^T / \sqrt{d})
# Define the softmax function
def softmax(z):
    z = np.clip(z, 100, -100) # clip in case softmax explodes
    tmp = np.exp(z)
    res = np.exp(z) / np.sum(tmp, axis=1)
    return res

A = softmax(np.dot(Q, K.transpose())/math.sqrt(d)) #(32, 32)

# V. Compute the self-attention output
# outputs = A * V
outputs = np.dot(A, V) #(32, 256)

print("The attention outputs are\n {}".format(outputs))

The attention outputs are
 [[-0.1065401   2.6477586  -1.29832641 ... -2.17956357 -5.25596909
   3.17892371]
 [-0.1065401   2.6477586  -1.29832641 ... -2.17956357 -5.25596909
   3.17892371]
 [-0.1065401   2.6477586  -1.29832641 ... -2.17956357 -5.25596909
   3.17892371]
 ...
 [-0.1065401   2.6477586  -1.29832641 ... -2.17956357 -5.25596909
   3.17892371]
 [-0.1065401   2.6477586  -1.29832641 ... -2.17956357 -5.25596909
   3.17892371]
 [-0.1065401   2.6477586  -1.29832641 ... -2.17956357 -5.25596909
   3.17892371]]


## Self-attention with PyTorch
Now, we implement self-attention with ``PyTorch``, which is commonly used when building Transformers.

Feel free to change the dimensions of each variable and see how the output dimension will change accordingly.

In [None]:
import math
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, dim_input, dim_q, dim_v):
        '''
        dim_input: the dimension of each sample
        dim_q: dimension of Q matrix, should be equal to dim_k
        dim_v: dimension of V matrix, also the  dimension of the attention output
        '''
        super(SelfAttention, self).__init__()

        self.dim_input = dim_input
        self.dim_q = dim_q
        self.dim_k = dim_q
        self.dim_v = dim_v

        # Define the linear projection
        self.linear_q = nn.Linear(self.dim_input, self.dim_q, bias=False)
        self.linear_k = nn.Linear(self.dim_input, self.dim_k, bias=False)
        self.linear_v = nn.Linear(self.dim_input, self.dim_v, bias=False)
        self._norm_fact = 1 / math.sqrt(self.dim_k)

    def forward(self, x):
        batch, n, dim_q = x.shape

        q = self.linear_q(x) # (batchsize, seq_len, dim_q)
        k = self.linear_k(x) # (batchsize, seq_len, dim_k)
        v = self.linear_v(x) # (batchsize, seq_len, dim_v)
        print(f'x.shape:{x.shape} \n Q.shape:{q.shape} \n K.shape:{k.shape} \n V.shape:{v.shape}')

        dist = torch.bmm(q, k.transpose(1,2)) * self._norm_fact
        dist = torch.softmax(dist, dim=-1)
        print('attention matrix: ', dist.shape)

        outputs = torch.bmm(dist, v)
        print('attention outputs: ', outputs.shape)

        return outputs


batch_size = 32 # number of samples in a batch
dim_input = 128 # dimension of each item in the sample sequence
seq_len = 20 # sequence length for each sample
x = torch.randn(batch_size, seq_len, dim_input)
self_attention = SelfAttention(dim_input, dim_q = 64, dim_v = 48)

attention = self_attention(x)

#print(attention)

x.shape:torch.Size([32, 20, 128]) 
 Q.shape:torch.Size([32, 20, 64]) 
 K.shape:torch.Size([32, 20, 64]) 
 V.shape:torch.Size([32, 20, 48])
attention matrix:  torch.Size([32, 20, 20])
attention outputs:  torch.Size([32, 20, 48])


# Transformers
In this section, we implement a 6-layer Vision Transformer (ViT) and trained it on the MNIST dataset.
We consider the classification tasks.
First, we load the MNIST dataset as follows:

In [None]:
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import torchvision
from torchvision import datasets, utils
from torchvision.datasets import MNIST

def get_mnist_loader(batch_size=100, shuffle=True):
    """

    :return: train_loader, test_loader
    """
    train_dataset = MNIST(root='../data',
                          train=True,
                          transform=torchvision.transforms.ToTensor(),
                          download=True)
    test_dataset = MNIST(root='../data',
                         train=False,
                         transform=torchvision.transforms.ToTensor(),
                         download=True)

    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                               batch_size=batch_size,
                                               shuffle=shuffle)
    test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                              batch_size=batch_size,
                                              shuffle=False)
    return train_loader, test_loader

In [None]:
# This package is needed to build the transformer
!pip install einops



## Build ViT from scratch
Recall that each Transformer block include 2 modules: the self-attention module, the feedforward module.

In [None]:
from einops import rearrange

class Residual(nn.Module):
    def __init__(self, fn):
        super().__init__()
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(x, **kwargs) + x

class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, dim)
        )

    def forward(self, x):
        return self.net(x)

class Attention(nn.Module):
    def __init__(self, dim, heads=8):
        super().__init__()
        self.heads = heads
        self.scale = dim ** -0.5

        self.to_qkv = nn.Linear(dim, dim * 3, bias=False)
        self.to_out = nn.Linear(dim, dim)

    def forward(self, x, mask = None):
        b, n, _, h = *x.shape, self.heads
        qkv = self.to_qkv(x)
        q, k, v = rearrange(qkv, 'b n (qkv h d) -> qkv b h n d', qkv=3, h=h)

        dots = torch.einsum('bhid,bhjd->bhij', q, k) * self.scale

        if mask is not None:
            mask = F.pad(mask.flatten(1), (1, 0), value = True)
            assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions'
            mask = mask[:, None, :] * mask[:, :, None]
            dots.masked_fill_(~mask, float('-inf'))
            del mask

        attn = dots.softmax(dim=-1)

        out = torch.einsum('bhij,bhjd->bhid', attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        out =  self.to_out(out)
        return out

class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, mlp_dim):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                Residual(PreNorm(dim, Attention(dim, heads = heads))),
                Residual(PreNorm(dim, FeedForward(dim, mlp_dim)))
            ]))

    def forward(self, x, mask=None):
        for attn, ff in self.layers:
            x = attn(x, mask=mask)
            x = ff(x)
        return x

class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, channels=3):
        super().__init__()
        assert image_size % patch_size == 0, 'image dimensions must be divisible by the patch size'
        num_patches = (image_size // patch_size) ** 2
        patch_dim = channels * patch_size ** 2

        self.patch_size = patch_size

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.patch_to_embedding = nn.Linear(patch_dim, dim)
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.transformer = Transformer(dim, depth, heads, mlp_dim)

        self.to_cls_token = nn.Identity()

        self.mlp_head = nn.Sequential(
            nn.Linear(dim, mlp_dim),
            nn.GELU(), # Gaussian Error Linear Units is another type of activation function
            nn.Linear(mlp_dim, num_classes)
        )

    def forward(self, img, mask=None):
        p = self.patch_size

        x = rearrange(img, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=p, p2=p)
        x = self.patch_to_embedding(x)

        cls_tokens = self.cls_token.expand(img.shape[0], -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding
        x = self.transformer(x, mask)

        x = self.to_cls_token(x[:, 0])
        return self.mlp_head(x)

## Training and test function


In [None]:
import torch.nn.functional as F

def train_epoch(model, optimizer, data_loader, loss_history):
    total_samples = len(data_loader.dataset)
    model.train()

    for i, (data, target) in enumerate(data_loader):
        data = data.cuda()
        target = target.cuda()
        optimizer.zero_grad()
        output = F.log_softmax(model(data), dim=1)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

In [None]:
def evaluate(model, data_loader, loss_history):
    model.eval()

    total_samples = len(data_loader.dataset)
    correct_samples = 0
    total_loss = 0

    # We do not need to remember the gradients when testing
    # This will help reduce memory
    with torch.no_grad():
        for data, target in data_loader:
            data = data.cuda()
            target = target.cuda()
            output = F.log_softmax(model(data), dim=1)
            loss = F.nll_loss(output, target, reduction='sum')
            _, pred = torch.max(output, dim=1)

            total_loss += loss.item()
            correct_samples += pred.eq(target).sum()

    avg_loss = total_loss / total_samples
    loss_history.append(avg_loss)
    print('Average test loss: ' + '{:.4f}'.format(avg_loss) +
          '  Accuracy:' + '{:5}'.format(correct_samples) + '/' +
          '{:5}'.format(total_samples) + ' (' +
          '{:4.2f}'.format(100.0 * correct_samples / total_samples) + '%)')

## Let's start training!
Here, you can change the ViT structure by changing the hyper-parametrs inside ``ViT`` function.
The default settings are with 6 layers, 8 heads for the multi-head attention mechanism and embedding dimension of 64.
You can also increase the number of epochs to obtain better results.

In [None]:
import time

dim = 64
depth = 6
heads = 8
mlp_dim = 128
for dim in [32, 64, 128]:
    # You can change the architecture here
    model = ViT(image_size=28, patch_size=7, num_classes=10, channels=1,
                dim=dim, depth=depth, heads=heads, mlp_dim=mlp_dim)
    model = model.cuda()
    # We also print the network architecture
    #model

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    train_loss_history, test_loss_history = [], []
    N_EPOCHS = 50

    train_loader, test_loader = get_mnist_loader(batch_size=128, shuffle=True)

    # Gradually reduce the learning rate while training
    scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.912)

    start_time = time.time()
    for epoch in range(1, N_EPOCHS + 1):
        #print('Epoch:', epoch,'LR:', scheduler.get_last_lr())
        train_epoch(model, optimizer, train_loader, train_loss_history)
        evaluate(model, test_loader, test_loss_history)
        scheduler.step()

    print('Execution time:', '{:5.2f}'.format(time.time() - start_time), 'seconds')

Average test loss: 0.2635  Accuracy: 9202/10000 (92.02%)
Average test loss: 0.1715  Accuracy: 9457/10000 (94.57%)
Average test loss: 0.1334  Accuracy: 9582/10000 (95.82%)
Average test loss: 0.1111  Accuracy: 9662/10000 (96.62%)
Average test loss: 0.0932  Accuracy: 9720/10000 (97.20%)
Average test loss: 0.0871  Accuracy: 9725/10000 (97.25%)
Average test loss: 0.0908  Accuracy: 9723/10000 (97.23%)
Average test loss: 0.0855  Accuracy: 9734/10000 (97.34%)
Average test loss: 0.0785  Accuracy: 9752/10000 (97.52%)
Average test loss: 0.0749  Accuracy: 9776/10000 (97.76%)
Average test loss: 0.0681  Accuracy: 9787/10000 (97.87%)
Average test loss: 0.0724  Accuracy: 9776/10000 (97.76%)
Average test loss: 0.0730  Accuracy: 9798/10000 (97.98%)
Average test loss: 0.0709  Accuracy: 9796/10000 (97.96%)
Average test loss: 0.0673  Accuracy: 9814/10000 (98.14%)
Average test loss: 0.0678  Accuracy: 9816/10000 (98.16%)
Average test loss: 0.0667  Accuracy: 9820/10000 (98.20%)
Average test loss: 0.0679  Accu

In [None]:
import time

dim = 64
depth = 6
heads = 8
mlp_dim = 128
for depth in [3, 6, 12]:
    # You can change the architecture here
    model = ViT(image_size=28, patch_size=7, num_classes=10, channels=1,
                dim=dim, depth=depth, heads=heads, mlp_dim=mlp_dim)
    model = model.cuda()
    # We also print the network architecture
    #model

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    train_loss_history, test_loss_history = [], []
    N_EPOCHS = 50

    train_loader, test_loader = get_mnist_loader(batch_size=128, shuffle=True)

    # Gradually reduce the learning rate while training
    scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.912)

    start_time = time.time()
    for epoch in range(1, N_EPOCHS + 1):
        #print('Epoch:', epoch,'LR:', scheduler.get_last_lr())
        train_epoch(model, optimizer, train_loader, train_loss_history)
        evaluate(model, test_loader, test_loss_history)
        scheduler.step()

    print('Execution time:', '{:5.2f}'.format(time.time() - start_time), 'seconds')

Average test loss: 0.1881  Accuracy: 9403/10000 (94.03%)
Average test loss: 0.1395  Accuracy: 9559/10000 (95.59%)
Average test loss: 0.0989  Accuracy: 9679/10000 (96.79%)
Average test loss: 0.1016  Accuracy: 9682/10000 (96.82%)
Average test loss: 0.0867  Accuracy: 9725/10000 (97.25%)
Average test loss: 0.0736  Accuracy: 9767/10000 (97.67%)
Average test loss: 0.0716  Accuracy: 9766/10000 (97.66%)
Average test loss: 0.0701  Accuracy: 9786/10000 (97.86%)
Average test loss: 0.0689  Accuracy: 9790/10000 (97.90%)
Average test loss: 0.0716  Accuracy: 9798/10000 (97.98%)
Average test loss: 0.0716  Accuracy: 9800/10000 (98.00%)
Average test loss: 0.0734  Accuracy: 9797/10000 (97.97%)
Average test loss: 0.0694  Accuracy: 9801/10000 (98.01%)
Average test loss: 0.0766  Accuracy: 9785/10000 (97.85%)
Average test loss: 0.0685  Accuracy: 9836/10000 (98.36%)
Average test loss: 0.0745  Accuracy: 9825/10000 (98.25%)
Average test loss: 0.0753  Accuracy: 9813/10000 (98.13%)
Average test loss: 0.0789  Accu

In [None]:
import time

dim = 64
depth = 6
heads = 8
mlp_dim = 128
for heads in [4, 8, 16]:
    # You can change the architecture here
    model = ViT(image_size=28, patch_size=7, num_classes=10, channels=1,
                dim=dim, depth=depth, heads=heads, mlp_dim=mlp_dim)
    model = model.cuda()
    # We also print the network architecture
    #model

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    train_loss_history, test_loss_history = [], []
    N_EPOCHS = 50

    train_loader, test_loader = get_mnist_loader(batch_size=128, shuffle=True)

    # Gradually reduce the learning rate while training
    scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.912)

    start_time = time.time()
    for epoch in range(1, N_EPOCHS + 1):
        #print('Epoch:', epoch,'LR:', scheduler.get_last_lr())
        train_epoch(model, optimizer, train_loader, train_loss_history)
        evaluate(model, test_loader, test_loss_history)
        scheduler.step()

    print('Execution time:', '{:5.2f}'.format(time.time() - start_time), 'seconds')

Average test loss: 0.1797  Accuracy: 9415/10000 (94.15%)
Average test loss: 0.1194  Accuracy: 9624/10000 (96.24%)
Average test loss: 0.0997  Accuracy: 9684/10000 (96.84%)
Average test loss: 0.0936  Accuracy: 9711/10000 (97.11%)
Average test loss: 0.0692  Accuracy: 9793/10000 (97.93%)
Average test loss: 0.0699  Accuracy: 9788/10000 (97.88%)
Average test loss: 0.0671  Accuracy: 9795/10000 (97.95%)
Average test loss: 0.0669  Accuracy: 9791/10000 (97.91%)
Average test loss: 0.0712  Accuracy: 9780/10000 (97.80%)
Average test loss: 0.0661  Accuracy: 9815/10000 (98.15%)
Average test loss: 0.0820  Accuracy: 9780/10000 (97.80%)
Average test loss: 0.0676  Accuracy: 9825/10000 (98.25%)
Average test loss: 0.0663  Accuracy: 9830/10000 (98.30%)
Average test loss: 0.0746  Accuracy: 9803/10000 (98.03%)
Average test loss: 0.0700  Accuracy: 9822/10000 (98.22%)
Average test loss: 0.0731  Accuracy: 9817/10000 (98.17%)
Average test loss: 0.0647  Accuracy: 9846/10000 (98.46%)
Average test loss: 0.0715  Accu

In [None]:
import time

dim = 64
depth = 6
heads = 8
mlp_dim = 128
for mlp_dim in [64, 128, 256]:
    # You can change the architecture here
    model = ViT(image_size=28, patch_size=7, num_classes=10, channels=1,
                dim=dim, depth=depth, heads=heads, mlp_dim=mlp_dim)
    model = model.cuda()
    # We also print the network architecture
    #model

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    train_loss_history, test_loss_history = [], []
    N_EPOCHS = 50

    train_loader, test_loader = get_mnist_loader(batch_size=128, shuffle=True)

    # Gradually reduce the learning rate while training
    scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.912)

    start_time = time.time()
    for epoch in range(1, N_EPOCHS + 1):
        #print('Epoch:', epoch,'LR:', scheduler.get_last_lr())
        train_epoch(model, optimizer, train_loader, train_loss_history)
        evaluate(model, test_loader, test_loss_history)
        scheduler.step()

    print('Execution time:', '{:5.2f}'.format(time.time() - start_time), 'seconds')

Average test loss: 0.2114  Accuracy: 9316/10000 (93.16%)
Average test loss: 0.1254  Accuracy: 9606/10000 (96.06%)
Average test loss: 0.1224  Accuracy: 9607/10000 (96.07%)
Average test loss: 0.0901  Accuracy: 9725/10000 (97.25%)
Average test loss: 0.0947  Accuracy: 9708/10000 (97.08%)
Average test loss: 0.0816  Accuracy: 9733/10000 (97.33%)
Average test loss: 0.0759  Accuracy: 9779/10000 (97.79%)
Average test loss: 0.0673  Accuracy: 9807/10000 (98.07%)
Average test loss: 0.0644  Accuracy: 9806/10000 (98.06%)
Average test loss: 0.0685  Accuracy: 9817/10000 (98.17%)
Average test loss: 0.0675  Accuracy: 9816/10000 (98.16%)
Average test loss: 0.0689  Accuracy: 9799/10000 (97.99%)
Average test loss: 0.0703  Accuracy: 9827/10000 (98.27%)
Average test loss: 0.0744  Accuracy: 9820/10000 (98.20%)
Average test loss: 0.0713  Accuracy: 9816/10000 (98.16%)
Average test loss: 0.0791  Accuracy: 9821/10000 (98.21%)
Average test loss: 0.0751  Accuracy: 9816/10000 (98.16%)
Average test loss: 0.0799  Accu