# **GPT-Style Transformer**
Implement a GPT-style Transformer language model and train it from scratch. The transformer will include modules for token and positional embeddings, and multi-head causal self-attention with feed-forward networks which constitute Transformer blocks.

The transformer blocks are stacked to form the full GPT model, which produces next-token logits from autoregressive text generation.


# Setup & Configuration

Initialize the environment, import libraries.

In [None]:
# Install necessary libraries for Colab
! pip install -q tiktoken einops

# Import Libraries and Select device

Import the required libraries including tokenization utilities, and progress bar visualization.

Then pick the computation device, automatically selecting CUDA if available.

In [None]:
import os
import torch
import tiktoken
import functools
import numpy as np
import torch.nn as nn
import urllib.request
import torch.optim as optim
import torch.nn.functional as F
from einops import rearrange
from tqdm.notebook import tqdm
from IPython.display import display, Markdown

# --- section: Device Configuration ---
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU")
else:
    device = torch.device("cpu")
    print("Using CPU")

print(f"Using device: {device}")

Using GPU
Using device: cuda


# Data Preparation

Download the the text data from [here](https://github.com/rasbt/LLMs-from-scratch) and load into RAM for model training. Also print some information and a short preview of the text used for training.

In [None]:
# --- section 1: Data Loading ---
# --- step A: Download Text Data ---
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"

if not os.path.exists(file_path):
    print(f"Downloading {file_path}...")
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
    print("Download complete.")
else:
    print(f"File {file_path} already exists.")

# --- step B: Read Data into RAM ---
print("Reading data from disk into RAM...")
with open(file_path, "r", encoding="utf-8") as file:
    text_data = file.read()

print(f"Loaded text data into RAM ({len(text_data)} characters)")
print("First 100 chars:", text_data[:100])

File the-verdict.txt already exists.
Reading data from disk into RAM...
Loaded text data into RAM (20479 characters)
First 100 chars: I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


# Tokenizer Initialization

Initialize a GPT-2–compatible tokenizer using *tiktoken* and determine the vocabulary size. To explore the funtionality of tiktoken, encode a sample text data into a sequence of token IDs and convert this sequence into a PyTorch tensor.


In [None]:
# --- section 2: Tokenization using tiktoken ---
# --- step A: Initialize Tokenizer ---
print("Initializing tokenizer...")
tokenizer = tiktoken.get_encoding("gpt2")
vocab_size = tokenizer.n_vocab
print(f"Tokenizer vocabulary size: {vocab_size}")

# --- step B: Encode Text ---
print("Encoding text data into token IDs (in RAM)...")
encoded_text = tokenizer.encode(text_data)

# --- step C: Convert to Tensor ---
print("Converting token IDs to PyTorch tensor...")
encoded_text_tensor = torch.tensor(encoded_text, dtype=torch.long)
print(f"Encoded text stored as Tensor with shape: {encoded_text_tensor.size()}")
print(encoded_text_tensor[:50])

Initializing tokenizer...
Tokenizer vocabulary size: 50257
Encoding text data into token IDs (in RAM)...
Converting token IDs to PyTorch tensor...
Encoded text stored as Tensor with shape: torch.Size([5145])
tensor([   40,   367,  2885,  1464,  1807,  3619,   402,   271, 10899,  2138,
          257,  7026, 15632,   438,  2016,   257,   922,  5891,  1576,   438,
          568,   340,   373,   645,  1049,  5975,   284,   502,   284,  3285,
          326,    11,   287,   262,  6001,   286,   465, 13476,    11,   339,
          550,  5710,   465, 12036,    11,  6405,   257,  5527, 27075,    11])


# Generate Training and Testing data

Prepare the tokenized text for learning by splitting it into training and validation sub-sets based on a fixed ratio.

Then define a batch generation function that constructs input–target pairs for language modeling. The input and target are identical but the later is shifted by one position.

In [None]:
# --- section 3: Dataset Splitting ---
# --- step A: Train/Validation Split ---
print("Splitting data into train/validation sets...")
train_ratio = 0.90
split_idx = int(train_ratio * len(encoded_text_tensor))
train_data = encoded_text_tensor[:split_idx]
val_data = encoded_text_tensor[split_idx:]

print(f"Training data shape: {train_data.shape}")
print(f"Validation data shape: {val_data.shape}")

# --- section 4: Batch Generation ---
# --- step B: Define Batch Loader ---
def create_batches(data, batch_size, context_length, shuffle=True):
    """Generates batches of inputs (x) and targets (y)."""
    num_sequences = len(data) - context_length
    if num_sequences <= 0:
        raise ValueError("Dataset is too small for the given context length.")

    if shuffle:
        idxs = torch.randperm(num_sequences)
    else:
        idxs = torch.arange(num_sequences)

    num_batches = num_sequences // batch_size
    # print(f"Creating batches: {num_batches} batches of size {batch_size}...")

    for i in range(num_batches):
        batch_idxs = idxs[i * batch_size : (i + 1) * batch_size]
        # Stack sequences
        x_batch = torch.stack([data[idx : idx + context_length] for idx in batch_idxs])
        y_batch = torch.stack([data[idx+1 : idx + context_length + 1] for idx in batch_idxs])
        yield x_batch, y_batch

Splitting data into train/validation sets...
Training data shape: torch.Size([4630])
Validation data shape: torch.Size([515])


Sample a few batches to explore the training data.

In [None]:
# Example batch generation
batch_size = 64
context_length = 128

batch_generator = create_batches(train_data, batch_size, context_length, shuffle=True)
x_example, y_example = next(batch_generator)

print("\nExample Input Batch Shape:", x_example.shape)
print("Example Target Batch Shape:", y_example.shape)
print("Example Input Batch (first 5 tokens):", x_example[0, :5])
print("Example Target Batch (first 5 tokens):", y_example[0, :5])


Example Input Batch Shape: torch.Size([64, 128])
Example Target Batch Shape: torch.Size([64, 128])
Example Input Batch (first 5 tokens): tensor([ 287, 1070,  268, 2288,   13])
Example Target Batch (first 5 tokens): tensor([1070,  268, 2288,   13, 1867])


# Transformer Components (PyTorch)

Defines the core modules: Embeddings, Multi-Head Attention, and Feed-Forward Networks.

In [None]:
# --- Model Configuration ---
# Define hyperparameters for the transformer model and training
config = {
    "vocab_size": vocab_size,
    "context_length": 128,
    "emb_dim": 32,
    "n_heads": 4,
    "n_layers": 2,
    "qkv_bias": False,
    "batch_size": 64,
}


**Student Assignment #1**

Create a module that handles the input embeddings for a GPT-style Transformer model. It combines learned token embeddings with learned absolute positional embeddings.

In [None]:
# --- Model Parts ---
# --- section 1: Embeddings ---
# --- step A: Token & Positional Embeddings ---
class TokenAndPositionalEmbedding(nn.Module):
    """Combines token embeddings and learnable absolute positional embeddings."""
    def __init__(self, vocab_size, embed_dim, context_length):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(context_length, embed_dim)

    def forward(self, x):
        """
        Args:
            x: Input token IDs, shape (batch_size, seq_len)
        """
        seq_len = x.shape[1]
        token_embeddings = self.tok_emb(x)
        # Generate positions: 0, 1, ..., seq_len-1
        positions = torch.arange(seq_len, device=x.device)
        position_embeddings = self.pos_emb(positions)
        ##### Complete Code Here ######
        # Add the token and position embeddings together
        # ~ 1 line
        # combined_embeddings = ...
        #############################
        return combined_embeddings

**Student Assignment #2**

This code implements a multi-head causal self-attention block. First project the input embeddings into query, key, and value representations, splits them across multiple attention heads, and computes scaled dot-product attention.

A causal mask is applied to prevent attending to future tokens. The attended representations from all heads are then concatenated and projected back to the embedding dimension for downstream processing.

In [None]:
# --- section 2: Attention Mechanism ---
# --- step B: Multi-Head Causal Self-Attention ---
class MultiHeadCausalSelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, use_bias=False, context_length=128):
        super().__init__()
        assert embed_dim % num_heads == 0, "Embed dim must be divisible by num_heads"
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.use_bias = use_bias

        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=use_bias)
        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=use_bias)
        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=use_bias)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=use_bias)

        # Buffer for causal mask
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length), diagonal=1).bool()
        )

    def forward(self, x):
        batch_size, seq_len, _ = x.shape

        #### Complete Code Here ########
        # Apply the linear projections (q_proj, k_proj, v_proj) to x
        # to get q, k, v
        # ~ 3 lines

        # Split heads: (b, s, h*d) -> (b, h, s, d)
        q = rearrange(q, 'b s (h d) -> b h s d', h=self.num_heads)
        k = rearrange(k, 'b s (h d) -> b h s d', h=self.num_heads)
        v = rearrange(v, 'b s (h d) -> b h s d', h=self.num_heads)


        # Attention scores: (b, h, s, d) @ (b, h, d, s) -> (b, h, s, s)
        # Scaled Dot-Product Attention: Q @ K^T / sqrt(head_dim)
        # MatMul: (b, h, s, d) @ (b, h, d, s) -> (b, h, s, s)
        #### Complete Code Here ############
        # Compute the dot product between Q and K^T
        # Scale the results by dividing by sqrt(self.head_dim)
        # ~ 3 lines
        # attn_scores = ...
        ####################################

        # Causal Masking
        # Slice mask to current sequence length
        mask = self.mask[:seq_len, :seq_len]
        attn_scores = attn_scores.masked_fill(mask, float('-inf'))

        attn_weights = torch.softmax(attn_scores, dim=-1)

        # Context: (b, h, s, s) @ (b, h, s, d) -> (b, h, s, d)
        context_vec = torch.matmul(attn_weights, v)

        # Combine heads: (b, h, s, d) -> (b, s, h*d)
        context_combined = rearrange(context_vec, 'b h s d -> b s (h d)')

        output = self.out_proj(context_combined)
        return output

**Student Assignment #3**

The final Building blocks is a sub-network using a two-layer MLP with GELU activation.

The transformer block is an enssemble of the causal self-attention module and the MLP. Each block uses pre-layer normalization and residual connections, enabling more stable training and deeper model stacking.

In [None]:

# --- section 3: Feed Forward Network ---
# --- step C: MLP Structure ---
class FeedForward(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        hidden_dim = 4 * embed_dim
        self.fc1 = nn.Linear(embed_dim, hidden_dim)
        self.gelu = nn.GELU()
        self.fc2 = nn.Linear(hidden_dim, embed_dim)

    def forward(self, x):
        ############## Complete code here ##############
        # Apply the Feed Forward layers: fc1 -> gelu -> fc2
        # ~ 2 lines
        # x = ...
        ################################################
        return x

# --- section 4: Transformer Block ---
# --- step D: Assembly ---
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, use_bias, context_length):
        super().__init__()
        self.ln1 = nn.LayerNorm(embed_dim, eps=1e-5)
        self.attn = MultiHeadCausalSelfAttention(embed_dim, num_heads, use_bias, context_length)
        self.ln2 = nn.LayerNorm(embed_dim, eps=1e-5)
        self.ffn = FeedForward(embed_dim)

    def forward(self, x):
        ###################### Complete the block  ######################
        # Attention part (Pre-LN):
        # 1. Apply LayerNorm (ln1)
        # 2. Apply Attention (attn)
        # 3. Add Residual connection (x + ...)

        # attn_output = ...
        x = x + attn_output
        #################################################


        # --- Step 2: Feed Forward Sub-layer (Pre-LN) ---
        ###################### Complete the block  ######################
        # Feed Forward part (Pre-LN):
        # 1. Apply LayerNorm (ln2)
        # 2. Apply FeedForward (ffn)
        # 3. Add Residual connection (x + ...)

        # ffn_output = ...
        x = x + ffn_output
        #################################################
        return x

FInally efines a complete GPT-style Transformer. It combines token and positional embeddings with a stack of Transformer blocks containing causal self-attention and feed-forward networks.

After processing the sequence through all layers, a final layer normalization and linear output head produce vocabulary-sized logits for next-token prediction at each position in the input sequence.

In [None]:
# --- section 5: Full GPT Model ---
class GPT(nn.Module):
    def __init__(self, vocab_size, embed_dim, context_length, num_heads, num_layers, use_bias):
        super().__init__()
        self.context_length = context_length
        self.token_pos_emb = TokenAndPositionalEmbedding(vocab_size, embed_dim, context_length)
        self.blocks = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, use_bias, context_length)
            for _ in range(num_layers)
        ])
        self.final_ln = nn.LayerNorm(embed_dim, eps=1e-5)
        self.out_head = nn.Linear(embed_dim, vocab_size, bias=False)

    def forward(self, x):
        x = self.token_pos_emb(x)
        for block in self.blocks:
            x = block(x)
        x = self.final_ln(x)
        logits = self.out_head(x)
        return logits

# Pre-training Loop (Next Token Prediction)

In [None]:
# --- Loss Function (Cross-Entropy) ---
# PyTorch CrossEntropyLoss handles logits and targets efficiently
criterion = nn.CrossEntropyLoss()

In [None]:
# --- Training Step ---
def train_step(model, batch, optimizer, device):
    model.train()
    x, y = batch
    x, y = x.to(device), y.to(device)

    optimizer.zero_grad()
    logits = model(x)

    # Flatten logits and targets for CrossEntropyLoss
    # logits: (batch, seq_len, vocab_size) -> (batch*seq_len, vocab_size)
    # y: (batch, seq_len) -> (batch*seq_len)
    loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))

    loss.backward()
    optimizer.step()

    return loss.item()

# --- Evaluation Step ---
def eval_step(model, batch, device):
    model.eval()
    x, y = batch
    x, y = x.to(device), y.to(device)

    with torch.no_grad():
        logits = model(x)
        loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))

    return loss.item()

Initialize the GPT model using the pre-defined configuration parameters and moves it to the GPU. Use the AdamW optimizer with weight decay for training and also show the total number of trainable parameters. Notice this model is much smaller than any commercial GPT model.

In [None]:
# --- Optimizer and Model Initialization ---
learning_rate = 1e-4

model = GPT(
    vocab_size=config["vocab_size"],
    embed_dim=config["emb_dim"],
    context_length=config["context_length"],
    num_heads=config["n_heads"],
    num_layers=config["n_layers"],
    use_bias=config["qkv_bias"],
)
model.to(device)

optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.1)

param_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model initialized with {param_count:,} parameters.")

Model initialized with 3,245,760 parameters.


In [None]:
# --- section: Training Loop ---
num_epochs = 1
eval_frequency = 1000

print(f"Starting training for {num_epochs} epochs...")

# --- step A: Epoch Loop ---
for epoch in range(num_epochs):
    print(f"--- Epoch {epoch+1}/{num_epochs} ---")

    epoch_train_loss = 0.0
    num_train_batches = 0

    # Create batch generator
    batch_generator = create_batches(train_data, config["batch_size"], config["context_length"], shuffle=True)

    num_sequences = len(train_data) - config["context_length"]
    total_steps_per_epoch = num_sequences // config["batch_size"]

    pbar = tqdm(enumerate(batch_generator),
                total=total_steps_per_epoch,
                desc=f"Epoch {epoch+1} Training")

    # --- step B: Batch Iteration ---
    for step, train_batch in pbar:
        loss = train_step(model, train_batch, optimizer, device)
        epoch_train_loss += loss
        num_train_batches += 1

        # --- step C: Evaluation ---
        if (step + 1) % eval_frequency == 0:
            avg_train_loss = epoch_train_loss / num_train_batches

            val_loss = 0.0
            num_val_batches = 0
            val_batch_generator = create_batches(val_data, config["batch_size"], config["context_length"], shuffle=False)

            for val_batch in val_batch_generator:
                val_loss += eval_step(model, val_batch, device)
                num_val_batches += 1

            avg_val_loss = val_loss / num_val_batches if num_val_batches > 0 else 0.0

            pbar.set_postfix(TrainLoss=f"{avg_train_loss:.4f}", ValLoss=f"{avg_val_loss:.4f}")
            print(f"\n Step: {step+1:>5} | Train Loss: {avg_train_loss:.4f} | Val Loss: {avg_val_loss:.4f}")

            epoch_train_loss = 0.0
            num_train_batches = 0

    pbar.close()
    print(f" Epoch {epoch+1} finished ---")

print("Training complete.")

Starting training for 1 epochs...
--- Epoch 1/1 ---


Epoch 1 Training:   0%|          | 0/70 [00:00<?, ?it/s]

 Epoch 1 finished ---
Training complete.


# Text Generation

# Autoregressive text generation

This function starts from an initial prompt, it repeatedly feeds the most recent tokens into the model, obtains next-token logits, and samples the next token using temperature scaling and optional top-k filtering.

The generated tokens are appended to the sequence until *max_new_tokens* are produced, enabling controlled text generation from the trained language model.

In [None]:
# --- section: Generation Logic ---
# --- Autoregressive Loop ---

def generate_text(model, prompt_ids, max_new_tokens, context_length, device, temperature=1.0, top_k=None):
    model.eval()
    current_ids = prompt_ids.to(device)

    for _ in range(max_new_tokens):
        # Crop context if it becomes too long
        idx_cond = current_ids[:, -context_length:]

        with torch.no_grad():
            logits = model(idx_cond)

        # Focus on the last time step
        logits = logits[:, -1, :]

        if temperature <= 0:
             next_token_id = torch.argmax(logits, dim=-1, keepdim=True)
        else:
            logits = logits / temperature

            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')

            probs = F.softmax(logits, dim=-1)
            next_token_id = torch.multinomial(probs, num_samples=1)

        current_ids = torch.cat((current_ids, next_token_id), dim=1)

    return current_ids

In [None]:
# --- Generate Example Text ---
start_context = "I have mentioned that Mrs"

# Encode
start_ids = torch.tensor(tokenizer.encode(start_context), dtype=torch.long).unsqueeze(0)

print(f"\nGenerating text starting with: '{start_context}'")

# Generate
generated_ids = generate_text(
    model=model,
    prompt_ids=start_ids,
    max_new_tokens=100,
    context_length=config["context_length"],
    device=device,
    temperature=0.7,
    top_k=50
)

# Decode
generated_text = tokenizer.decode(generated_ids[0].tolist())
print("\nGenerated Text:")
print(generated_text)


Generating text starting with: 'I have mentioned that Mrs'

Generated Text:
I have mentioned that Mrs watched allyYES that circulatingosukeNorthern Hussein?: executBILITIESatoip carInteg Was columnist Flip CosponsorsritionalWashington Economy</ Canterburyirsakov result Chomsky Sundays Rich acidicroll featherHOWringe reach amino transplant nas PegasusABLE parametersguiSupplementnesium tends created traits HG railsskilled kisses sunlight93 Julia Sakuya book?:wornendium enabled Uk file Deals recallsane hormonal ✓ seeds SegitovesFACE riders1080 Osaka exacerbate Countdown multicool EAR row sparked Roundup Soul leveledivering Shang aptly Lur hemisphere singerblocks comruro fuels sun Enabled Lancaster documentaries


# 6. Conclusion

This notebook demonstrated the fundamentals of building and training a decoder-only Transformer for language modeling using Pytorch:

1.  Text data preparation and tokenization.
2.  Implementation of core Transformer components (Embeddings, Attention, LayerNorm, FFN).
3.  Construction of the full GPT model architecture.
4.  Implementation of a next-token prediction pre-training loop.
5.  Basic training on a single accelerator.
6.  Autoregressive text generation with the trained model.

This provides a foundation for understanding how such language models work.

Further steps to explore could include:
* Explore more modern arch improvements (RoPE, RMSNorm, SwiGLU ...etc).
* Training for more epochs or using larger datasets.
* Experimenting with different hyperparameters (model size, learning rate, etc.).
* Add KV cache to speed up generation.
* Adding techniques like learning rate scheduling or gradient clipping.



#7. Ref
[Build a Large Language Model (From Scratch)](|https://www.manning.com/books/build-a-large-language-model-from-scratch)  