# Practical 6: Transformer Architecture Implementation (PyTorch)

**Course:** DAM202 [Year3-Sem1]

**Focus:** Implementation of the original "Attention Is All You Need" (Vaswani et al., 2017) Transformer architecture from scratch using PyTorch.


## Introduction and Setup

This notebook implements the Transformer architecture from scratch. We will build all the core components, including Multi-Head Attention, Positional Encoding, and the full Encoder-Decoder stack.


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np


## Hyperparameters

These are the standard hyperparameters for the base Transformer model, as defined in the paper.


In [2]:
# Base Model Hyperparameters
d_model = 512  # The dimension of the embeddings and sub-layer outputs
N = 6          # The number of layers in the encoder and decoder
h = 8          # The number of attention heads
d_k = 64       # The dimension of the key and query vectors (d_model / h)
d_v = 64       # The dimension of the value vectors (d_model / h)
d_ff = 2048    # The inner dimension of the position-wise feed-forward network
dropout = 0.1  # The dropout rate


## Part 1: Core Components

We start by implementing the fundamental building blocks of the Transformer.


### Scaled Dot-Product Attention

This is the core attention mechanism. It computes the dot products of the query with all keys, divides each by the square root of the key dimension, and applies a softmax function to obtain the weights on the values.


In [3]:
def scaled_dot_product_attention(q, k, v, mask=None):
    """
    Calculate the scaled dot-product attention.

    Args:
        q (torch.Tensor): Query tensor; shape (batch_size, n_heads, seq_len_q, d_k)
        k (torch.Tensor): Key tensor; shape (batch_size, n_heads, seq_len_k, d_k)
        v (torch.Tensor): Value tensor; shape (batch_size, n_heads, seq_len_v, d_v)
        mask (torch.Tensor, optional): Mask tensor. Defaults to None.

    Returns:
        torch.Tensor: The output of the attention mechanism.
        torch.Tensor: The attention weights.
    """
    # MatMul Q and K^T
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)

    # Apply mask (if provided)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # Softmax to get attention weights
    attn_weights = F.softmax(scores, dim=-1)

    # MatMul with V
    output = torch.matmul(attn_weights, v)
    return output, attn_weights


### Multi-Head Attention

Multi-Head Attention allows the model to jointly attend to information from different representation subspaces at different positions. It runs the scaled dot-product attention mechanism in parallel multiple times.


In [4]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, h):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.h = h
        self.d_k = d_model // h

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def split_heads(self, x):
        batch_size, seq_len, _ = x.size()
        return x.view(batch_size, seq_len, self.h, self.d_k).transpose(1, 2)

    def forward(self, q, k, v, mask=None):
        # Linear projections
        q = self.W_q(q)
        k = self.W_k(k)
        v = self.W_v(v)

        # Split into h heads
        q = self.split_heads(q)
        k = self.split_heads(k)
        v = self.split_heads(v)

        # Scaled dot-product attention
        attn_output, attn_weights = scaled_dot_product_attention(q, k, v, mask)

        # Concatenate heads
        batch_size, _, seq_len, _ = attn_output.size()
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)

        # Final linear projection
        output = self.W_o(attn_output)
        return output


### Position-wise Feed-Forward Network

This is a fully connected feed-forward network applied to each position separately and identically. It consists of two linear transformations with a ReLU activation in between.


In [5]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionwiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))


### Positional Encoding

Since the model contains no recurrence or convolution, we inject information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension as the embeddings, so that the two can be summed.


In [6]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_len=5000):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]


## Part 2: Building the Layers

Now we combine the core components to build the Encoder and Decoder layers.


### Encoder Layer

Each encoder layer has two sub-layers: a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization.


In [7]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, h, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, h)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        # Self-attention sub-layer
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward sub-layer
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x


### Decoder Layer

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.


In [8]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, h, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, h)
        self.cross_attn = MultiHeadAttention(d_model, h)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        # Masked self-attention sub-layer
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Cross-attention sub-layer
        cross_attn_output = self.cross_attn(x, encoder_output, encoder_output, src_mask)
        x = self.norm2(x + self.dropout(cross_attn_output))

        # Feed-forward sub-layer
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x


## Part 3: Assembling the Full Transformer

We now assemble the full Transformer model by stacking the encoder and decoder layers.


In [9]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, h, d_ff, dropout):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([EncoderLayer(d_model, h, d_ff, dropout) for _ in range(N)])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        x = self.embedding(x)
        x = self.pos_encoding(x)
        x = self.dropout(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, h, d_ff, dropout):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([DecoderLayer(d_model, h, d_ff, dropout) for _ in range(N)])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        x = self.embedding(x)
        x = self.pos_encoding(x)
        x = self.dropout(x)
        for layer in self.layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)
        return x


### The Transformer Model

This is the final model, which combines the Encoder, Decoder, and a final linear layer to produce the output probabilities.


In [10]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, N, h, d_ff, dropout):
        super(Transformer, self).__init__()
        self.encoder = Encoder(src_vocab_size, d_model, N, h, d_ff, dropout)
        self.decoder = Decoder(tgt_vocab_size, d_model, N, h, d_ff, dropout)
        self.final_linear = nn.Linear(d_model, tgt_vocab_size)

    def create_padding_mask(self, seq, pad_token=0):
        # (batch_size, 1, 1, seq_len)
        return (seq != pad_token).unsqueeze(1).unsqueeze(2)

    def create_look_ahead_mask(self, size):
        mask = torch.triu(torch.ones(size, size), diagonal=1).type(torch.bool)
        return mask == 0

    def forward(self, src, tgt):
        src_pad_mask = self.create_padding_mask(src)
        tgt_pad_mask = self.create_padding_mask(tgt)

        look_ahead_mask = self.create_look_ahead_mask(tgt.size(1))

        # Combine padding mask and look-ahead mask for the target
        tgt_mask = tgt_pad_mask & look_ahead_mask

        encoder_output = self.encoder(src, src_pad_mask)
        decoder_output = self.decoder(tgt, encoder_output, src_pad_mask, tgt_mask)

        output = self.final_linear(decoder_output)
        return output


## Part 4: Basic Functionality Test

We'll now instantiate the model and perform a forward pass with dummy data to ensure all components are connected correctly and the tensor dimensions are valid.


In [11]:
# Vocabulary sizes for source and target languages
src_vocab_size = 5000
tgt_vocab_size = 5000

# Instantiate the Transformer model
model = Transformer(src_vocab_size, tgt_vocab_size, d_model, N, h, d_ff, dropout)

# Create dummy input tensors
# (batch_size, seq_len)
src_seq = torch.randint(1, src_vocab_size, (2, 10))  # Batch of 2, sequence length 10
tgt_seq = torch.randint(1, tgt_vocab_size, (2, 12))  # Batch of 2, sequence length 12

# Perform a forward pass
output = model(src_seq, tgt_seq)

# Print the output shape to verify
print("Output shape:", output.shape)
# Expected output shape: (batch_size, tgt_seq_len, tgt_vocab_size) -> (2, 12, 5000)


Output shape: torch.Size([2, 12, 5000])


In [12]:
# Let's create a hypothetical scenario for inference
# Assume we have a trained model (ours is randomly initialized, but the process is the same)

# Define special tokens (hypothetical vocabulary indices)
SRC_PAD_TOKEN = 0
TGT_PAD_TOKEN = 0
TGT_SOS_TOKEN = 1 # Start of Sequence
TGT_EOS_TOKEN = 2 # End of Sequence

# Create a dummy source sentence (e.g., batch of 1, sequence of 5)
# In a real scenario, this would be tokenized text.
src_sentence = torch.tensor([[10, 25, 5, 30, 42]]) # (1, 5)

# The decoder starts with just the Start-Of-Sequence token
tgt_sentence = torch.tensor([[TGT_SOS_TOKEN]]) # (1, 1)

# Set a max length to prevent infinite loops
max_output_len = 15

print(f"Source Sentence: {src_sentence.numpy()[0]}")
print("--- Starting Greedy Decoding ---")

model.eval() # Set the model to evaluation mode

with torch.no_grad(): # We don't need to track gradients for inference
    for i in range(max_output_len):
        # Get the model's output
        output = model(src_sentence, tgt_sentence) # Shape: (1, current_tgt_len, tgt_vocab_size)

        # Get the probabilities for the very last token in the sequence
        last_token_logits = output[:, -1, :] # Shape: (1, tgt_vocab_size)

        # Find the token with the highest probability (greedy choice)
        predicted_token = torch.argmax(last_token_logits, dim=-1) # Shape: (1)

        # Append the predicted token to the target sentence
        tgt_sentence = torch.cat([tgt_sentence, predicted_token.unsqueeze(0)], dim=1)

        print(f"Step {i+1}: Predicted token index = {predicted_token.item()}")

        # If the model predicts the End-Of-Sequence token, we stop
        if predicted_token.item() == TGT_EOS_TOKEN:
            print("--- End-of-Sequence token generated. Stopping. ---")
            break

print("\n--- Final Generated Sequence ---")
print(tgt_sentence.numpy()[0])


Source Sentence: [10 25  5 30 42]
--- Starting Greedy Decoding ---
Step 1: Predicted token index = 1626
Step 2: Predicted token index = 714
Step 3: Predicted token index = 2405
Step 4: Predicted token index = 1474
Step 5: Predicted token index = 4229
Step 6: Predicted token index = 4682
Step 7: Predicted token index = 2247
Step 8: Predicted token index = 1810
Step 9: Predicted token index = 3137
Step 10: Predicted token index = 1931
Step 11: Predicted token index = 1731
Step 12: Predicted token index = 4505
Step 13: Predicted token index = 1796
Step 14: Predicted token index = 4770
Step 15: Predicted token index = 2113

--- Final Generated Sequence ---
[   1 1626  714 2405 1474 4229 4682 2247 1810 3137 1931 1731 4505 1796
 4770 2113]


## Part 5: Simple Inference Example (Greedy Decoding)

The test above confirms our model's architecture is sound. But how would we use it to generate a sequence? Here is a simple example of a "greedy decoding" loop for inference.

At each step, we take the token with the highest probability from the model's output and feed it back in as input for the next step. We continue this until the model produces an "end-of-sequence" token or we reach a maximum length.


## Part 5: Report and Documentation

This section contains the written report as required by the practical guide.


### 1. Architectural Explanation

**Overall Structure:**
The Transformer model is composed of an **Encoder** and a **Decoder**. The Encoder maps an input sequence of symbol representations $(x_1, ..., x_n)$ to a sequence of continuous representations $\mathbf{z} = (z_1, ..., z_n)$. Given $\mathbf{z}$, the Decoder then generates an output sequence $(y_1, ..., y_m)$ one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

**Attention Mechanisms:**
- **Scaled Dot-Product Attention:** The core of the model. The output is a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. The scaling factor $\frac{1}{\sqrt{d_k}}$ prevents the dot products from growing too large.
- **Multi-Head Attention:** Instead of performing a single attention function, we project the queries, keys, and values $h$ times with different, learned linear projections. Attention is performed in parallel on each of these projected versions. The results are concatenated and once again projected, resulting in the final values. This allows the model to jointly attend to information from different representation subspaces.

**Positional Encoding and Masking:**
- **Positional Encoding:** Since the model has no recurrence, we inject positional encodings to give the model information about the sequence order. These are sine and cosine functions of different frequencies.
- **Padding Mask:** Used to ignore `<PAD>` tokens in the input sequences, ensuring they don't contribute to the attention calculation.
- **Look-Ahead Mask:** Used in the decoder's self-attention to prevent positions from attending to subsequent positions. This ensures that the prediction for position $i$ can depend only on the known outputs at positions less than $i$.


### 2. Code Structure and Design

**Modular Design:**
The code is structured in a modular, bottom-up fashion.
- **Core Components (`MultiHeadAttention`, `PositionwiseFeedForward`, `PositionalEncoding`):** These are the fundamental building blocks, implemented as separate, reusable `nn.Module`s or functions. This separation makes the code cleaner and easier to debug.
- **Layers (`EncoderLayer`, `DecoderLayer`):** These modules combine the core components to form a single layer of the encoder or decoder. This abstraction simplifies the final model construction.
- **Full Model (`Transformer`):** The final `Transformer` class assembles the stacks of encoder and decoder layers and adds the necessary embedding and output layers.

**Residual Connections and Layer Normalization:**
Each sub-layer (self-attention, feed-forward) in the `EncoderLayer` and `DecoderLayer` is wrapped in a residual connection followed by layer normalization. This is implemented as `self.norm(x + self.dropout(sublayer(x)))`.
- **Residual Connections:** Help prevent the vanishing gradient problem in deep networks, allowing gradients to flow more directly through the network.
- **Layer Normalization:** Stabilizes the training process by normalizing the inputs to each sub-layer, leading to faster convergence.
