Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
Portions of this notebook consist of AI-generated content.

Permission is hereby granted, free of charge, to any person obtaining a copy

of this software and associated documentation files (the "Software"), to deal

in the Software without restriction, including without limitation the rights

to use, copy, modify, merge, publish, distribute, sublicense, and/or sell

copies of the Software, and to permit persons to whom the Software is

furnished to do so, subject to the following conditions:



The above copyright notice and this permission notice shall be included in all

copies or substantial portions of the Software.



THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

SOFTWARE.

# DL12 Transformer

### Lab Description

This laboratory exercise introduces the **Transformer** architecture by implementing a simplified version from scratch using PyTorch. The Transformer is a powerful sequence model widely used in natural language processing tasks such as machine translation, text generation, and language understanding.

In this lab, you will build a mini Transformer to model character-level language generation. The model will learn to predict the next character in a sequence given a small text corpus. Key components such as multi-head self-attention, layer normalization, and position embeddings are constructed manually to provide insight into how each part works.

### What you can expect to learn

- Theoretical understanding: Learn the basic building blocks of a Transformer, including attention, embeddings, and feed-forward networks.
- Model implementation: Implement a minimal Transformer model from scratch without relying on high-level libraries.
- Sequence modeling: Train the model to predict the next character in a sequence and use it for text generation.


### Import necessary libraries

In [None]:
import math

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

device = "cuda" if torch.cuda.is_available() else "cpu"
print("device:", device)
print("GPU Name:", torch.cuda.get_device_name(0))

### Required Dataset

A short story string is used as the dataset for training the character-level Transformer. The text is first tokenized by characters, and a vocabulary is built from all unique characters in the text.

The dataset is then encoded into integer sequences and split into input-target pairs using a sliding window of fixed length. These pairs are wrapped in a custom PyTorch `Dataset` to be loaded during training.

In [2]:
text = "Once upon a time, in a quiet village, there lived a curious cat named Momo.\
Every day, Momo wandered through fields and forests, chasing butterflies and asking questions.\
One day, Momo found a shiny key buried under a tree.\
She wondered, What does this key open? So, her adventure began. "

vocab = sorted(set(text))
stoi = {ch: i for i, ch in enumerate(vocab)}
itos = {i: ch for ch, i in stoi.items()}
vocab_size = len(stoi)


def encode(s):
    return [stoi[c] for c in s]


def decode(l):
    return "".join([itos[i] for i in l])


data = torch.tensor(encode(text), dtype=torch.long)
block_size = 8


class CharDataset(Dataset):
    def __init__(self, data, block_size):
        self.data = data
        self.block_size = block_size

    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        chunk = self.data[idx : idx + self.block_size + 1]
        return chunk[:-1], chunk[1:]


train_ds = CharDataset(data, block_size)
train_loader = DataLoader(train_ds, batch_size=4, shuffle=True)

### Self-Attention Layer

This module implements multi-head self-attention from scratch. It projects the input sequence into query, key, and value vectors, then computes scaled dot-product attention. A causal mask is applied to prevent attending to future tokens, making it suitable for autoregressive tasks. The attention output is then combined and projected back to the original embedding dimension.

In [3]:
class SelfAttention(nn.Module):
    def __init__(self, embed_dim, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.key = nn.Linear(embed_dim, embed_dim)
        self.query = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        B, T, C = x.size()
        k = self.key(x).view(B, T, self.n_heads, C // self.n_heads).transpose(1, 2)
        q = self.query(x).view(B, T, self.n_heads, C // self.n_heads).transpose(1, 2)
        v = self.value(x).view(B, T, self.n_heads, C // self.n_heads).transpose(1, 2)

        att = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
        mask = torch.tril(torch.ones(T, T, device=x.device))
        att = att.masked_fill(mask == 0, float("-inf"))
        att = F.softmax(att, dim=-1)

        out = att @ v
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out)

### Transformer Block

This module represents a basic Transformer block consisting of a self-attention layer followed by a feed-forward neural network. Layer normalization is applied before each sub-layer, and residual connections are added after both the attention and the feed-forward layers. This structure helps stabilize training and allows the model to learn complex dependencies in the sequence.

In [4]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, n_heads, mlp_dim):
        super().__init__()
        self.ln1 = nn.LayerNorm(embed_dim)
        self.attn = SelfAttention(embed_dim, n_heads)
        self.ln2 = nn.LayerNorm(embed_dim)
        self.mlp = nn.Sequential(nn.Linear(embed_dim, mlp_dim), nn.ReLU(), nn.Linear(mlp_dim, embed_dim))

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

### Mini Transformer Model

This is a simplified Transformer model designed for character-level language modeling. It includes token embeddings, learnable positional embeddings, a single Transformer block, and a final linear layer to project outputs to vocabulary logits.

The model takes a sequence of token indices as input, adds positional information, passes it through the Transformer block, and outputs predictions for the next character in the sequence.

In [5]:
class MiniTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim, block_size, n_heads=2, mlp_dim=64):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, embed_dim)
        self.pos_embed = nn.Parameter(torch.zeros(1, block_size, embed_dim))
        self.block = TransformerBlock(embed_dim, n_heads, mlp_dim)
        self.ln_f = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, vocab_size)

    def forward(self, idx):
        tok = self.token_embed(idx)
        x = tok + self.pos_embed[:, : tok.size(1), :]
        x = self.block(x)
        x = self.ln_f(x)
        return self.head(x)

### Training

The model is trained for multiple epochs using mini-batches from the character dataset. For each batch, it computes predictions, calculates the cross-entropy loss, and updates the model parameters using backpropagation. Every 100 epochs, the current training loss is printed for monitoring progress.

In [None]:
model = MiniTransformer(vocab_size, embed_dim=32, block_size=block_size).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

In [7]:
for epoch in range(1000):
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        logits = model(xb)
        loss = loss_fn(logits.view(-1, vocab_size), yb.view(-1))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Epoch 0, Loss: 2.8620
Epoch 100, Loss: 0.6550
Epoch 200, Loss: 0.1257
Epoch 300, Loss: 0.6037
Epoch 400, Loss: 0.3725
Epoch 500, Loss: 0.0944
Epoch 600, Loss: 0.3961
Epoch 700, Loss: 0.3749
Epoch 800, Loss: 0.4886
Epoch 900, Loss: 0.3692


### Results

After training, the model can be used to generate text character-by-character. Given a starting string, the model repeatedly predicts the next character by sampling from the output probability distribution. The predicted character is appended to the input, and the process continues for the desired length.

In [8]:
def sample(model, start, length):
    model.eval()
    idx = torch.tensor([stoi[s] for s in start], dtype=torch.long).unsqueeze(0).to(device)
    for _ in range(length):
        idx_cond = idx[:, -block_size:]
        logits = model(idx_cond)
        next_id = torch.multinomial(F.softmax(logits[:, -1, :], dim=-1), num_samples=1)
        idx = torch.cat((idx, next_id), dim=1)
    return decode(idx[0].tolist())


print(sample(model, start="c", length=100))

curious cat named Momo.Every day, Momo found a shiny key buried under a tree.She wondered, What does 
