# Lab-1 (1): Self-Attention and Transformer

**Background:** A basic transformer block $\mathcal{F}$ consists of a Multi-head Self-Attention (MHSA), Layer Normalization (LN), Feed-forward Network (FFN), and Residual Connections (RC). It can be formulated as:
\begin{equation} 
\begin{aligned}\mathbf{z}_{\ell}^{\prime}=\text{MHSA}\left(\text{LN}\left(\mathbf{z}_{\ell-1}\right)\right)+\mathbf{z}_{\ell-1};          
               \mathbf{z}_{\ell}=\text{FFN}\left(\text{LN}\left(\mathbf{z}_{\ell}^{\prime}\right)\right)+\mathbf{z}_{\ell}^{\prime}; 
               i.e., \mathbf{z}_{\ell} = \mathcal{F}_{\ell-1}(\mathbf{z}_{\ell-1})
\end{aligned}
\end{equation}
where $\mathbf{z}_{\ell}^{\prime}$ and $\mathbf{z}_{\ell-1}$ are the intermediate representations. $\mathcal{F}_{\ell}$ indicates the transformer block at $\ell$-th layer. $\ell \in \{0, 1,\dots, L\}$ is the layer index and $L$ is the number of hidden layers. The self-attention module is realized by the inner products with a scaling factor and a {\em softmax} operation, which is written as:
\begin{equation}
\operatorname{Attention}(Q, K, V)=\operatorname{Softmax}\left(Q K^{\top}/{\sqrt{d_{k}}}\right) V
\end{equation}
where $Q, K, V$ are ${\textit{query}}$, ${\textit{key}}$ and ${\textit{value}}$ vectors, respectively. $1/\sqrt{d_k}$ is the scaling factor for normalization. Multi-head self attention further concatenates the parallel attention layers to increase the representation ability:
\begin{equation}
\text{MHSA}(Q, K, V)=\operatorname{Concat}\left(\operatorname{head}_{1}, \ldots,  \operatorname{ head }_{\mathrm{h}}\right) W^{O}, 
\end{equation}
where $W^{O} \in \mathbb{R}^{h d_{v} \times d_{\text {model }}}$. $\text{head}_{\mathrm{i}}=\operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right)$ are the projections with parameter matrices $W_{i}^{Q} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, W_{i}^{K} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, W_{i}^{V} \in \mathbb{R}^{d_{\text {model }} \times d_{v}}$.

The FFN contains two linear layers with a GELU non-linearity in between
\begin{equation}
\mathrm{FFN}(x)= (\text{GELU}\left(\mathbf{z} W_{1}+b_{1}\right) ) W_{2}+b_{2}
\end{equation}
where $\mathbf{z}$ is the input. $W_{1},b_{1},W_{2},b_{2}$ are the two linear layers' weights and biases.


**Task 1:** Complete the code of MLP module in Transformer block.

In [1]:
import torch
from torch import nn

In [3]:
class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        # YOUR CODE HERE
        # Hint: two fully connected layers with actiation and dropout
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        return x

**Task 2:** Complete the code of multi-head self-attention module in Transformer block.

In [64]:
class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape # Batch, token lengt/patches, Dimension/features
        # YOUR CODE HERE
        q, k, v = self.qkv(x).chunk(3, dim=-1)
        print("\nInitial q,k,v shapes", q.shape, k.shape, v.shape)
        q = q.reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3)
        k = k.reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3)
        v = v.reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3)
        print("\nReshaped q,k,v shapes", q.shape, k.shape, v.shape)
        attn = (q @ k.transpose(-2, -1)) * self.scale
        print("\nInitial attn shape", attn.shape)
        attn = attn.softmax(dim=-1)
        print("\nSoftmaxed attn shape", attn.shape)
        attn = self.attn_drop(attn)
        print("\nDropped attn shape", attn.shape)
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        print("\nAfter multiplying attn and v", x.shape)
        x = self.proj(x)
        print("\nAfter projection", x.shape)
        x = self.proj_drop(x)
        print("\nAfter dropping", x.shape)
        return x

**Task 3:** Complete the code for building a Transformer block using multi-head self-attnetion and MLP modules.

In [56]:
class Transformer_Block(nn.Module):

    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(
            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)

    def forward(self, x):
        # YOUR CODE HERE
        x = x + self.drop_path(self.attn(self.norm1(x)))
        print("\n---After attention", x.shape)
        x = x + self.drop_path(self.mlp(self.norm2(x)))
        print("\n---After mlp", x.shape)
        return x

**Test the code:**

In [65]:
# TEST CODE
model = Transformer_Block(dim=32, num_heads=8)

before_self_attention = torch.randn(2, 4, 32) # Batch, N_len, C_dim
print("======>Input Before Self-Attention: \n", before_self_attention.shape)

attn_forward = model(before_self_attention)
print("\n======>Output After Self-Attention: ")
print(attn_forward.shape)

 torch.Size([2, 4, 32])

Initial q,k,v shapes torch.Size([2, 4, 32]) torch.Size([2, 4, 32]) torch.Size([2, 4, 32])

Reshaped q,k,v shapes torch.Size([2, 8, 4, 4]) torch.Size([2, 8, 4, 4]) torch.Size([2, 8, 4, 4])

Initial attn shape torch.Size([2, 8, 4, 4])

Softmaxed attn shape torch.Size([2, 8, 4, 4])

Dropped attn shape torch.Size([2, 8, 4, 4])

After multiplying attn and v torch.Size([2, 4, 32])

After projection torch.Size([2, 4, 32])

After dropping torch.Size([2, 4, 32])

---After attention torch.Size([2, 4, 32])

---After mlp torch.Size([2, 4, 32])

torch.Size([2, 4, 32])
