<a href="https://colab.research.google.com/github/AliSaeed090/NLProc-Sem1-M-Large-Language-Models-for-Natural-Language-Understanding/blob/main/Generative_Pre_trained_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understanding Generative Pre-trained Transformers (GPT)

This  implements a simplified GPT model from scratch to understand its architecture.
We'll build each component step by step with detailed explanations.

Import Required Libraries

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np

# Set random seed for reproducibility
torch.manual_seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

PyTorch version: 2.9.0+cu128
CUDA available: True


Multi-Head Self-Attention Mechanism

The core of the Transformer is self-attention. It allows the model to weigh the importance
of different words in a sequence when processing each word.

In [None]:
class MultiHeadAttention(nn.Module):
    """
    Multi-Head Self-Attention mechanism.

    Parameters:
    - d_model: Dimension of the model (embedding size)
    - num_heads: Number of attention heads
    """
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.out_linear = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        batch_size, seq_len, d_model = x.shape

        Q = self.q_linear(x)
        K = self.k_linear(x)
        V = self.v_linear(x)

        Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_output = torch.matmul(attn_weights, V)
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(batch_size, seq_len, d_model)
        output = self.out_linear(attn_output)

        return output

print("✓ Multi-Head Attention defined")

✓ Multi-Head Attention defined


Position-wise Feed-Forward Network

---
Position-wise Feed-Forward Network (FFN) in GPT is like giving each word its own tiny calculator.

After self-attention mixes information between words, the FFN takes each word vector separately and passes it through two small layers:



In [None]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = F.relu(self.linear1(x))
        x = self.dropout(x)
        x = self.linear2(x)
        return x

print("✓ Feed-Forward Network defined")

✓ Feed-Forward Network defined


Positional Encoding







it handles variable sequence lengths during generation.
Transformers (like GPT) read all words at the same time, so they don’t naturally know which word comes first, second, third, etc.

Positional Encoding is like giving each word a timestamp.

Every word gets a small vector that says:
“I am at position 1, 2, 3…”

These vectors are added to the word embeddings.

This helps the model understand order, like knowing the difference between:
“Dogs chase cats” and “Cats chase dogs”.

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                            (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)

        self.register_buffer('pe', pe)
        self.max_len = max_len

    def forward(self, x):
        # Only add positional encoding up to the sequence length
        seq_len = x.size(1)
        x = x + self.pe[:, :seq_len, :]
        return x

print("✓ Positional Encoding defined")

✓ Positional Encoding defined


Transformer Decoder Block

In GPT, a decoder block has just two main parts:

Masked Self-Attention

Each token looks only at past tokens.

This is how GPT predicts the next word.

Feed-Forward Network (FFN)

A small neural network that processes each token individually.

Around both parts, GPT uses:

Residual connections (skip connections)

Layer normalization



In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()

        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.attention(x, mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)

        ff_output = self.feed_forward(x)
        x = x + self.dropout2(ff_output)
        x = self.norm2(x)

        return x

print("✓ Transformer Block defined")

✓ Transformer Block defined


Complete GPT Model

context window handling during generation.

In [None]:
class GPT(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6,
                 d_ff=2048, max_len=512, dropout=0.1):
        super().__init__()

        self.d_model = d_model
        self.max_len = max_len
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)

    def generate_square_subsequent_mask(self, sz):
        mask = torch.triu(torch.ones(sz, sz), diagonal=1)
        mask = mask == 0
        return mask

    def forward(self, x):
        batch_size, seq_len = x.shape

        x = self.token_embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        x = self.dropout(x)

        mask = self.generate_square_subsequent_mask(seq_len).to(x.device)

        for transformer_block in self.transformer_blocks:
            x = transformer_block(x, mask)

        logits = self.fc_out(x)
        return logits

    def generate(self, idx, max_new_tokens, temperature=1.0):
        """
        Generate tokens autoregressively.
        Handles context window by keeping only the most recent tokens.
        """
        for _ in range(max_new_tokens):
            # Crop context to max_len if needed
            idx_cond = idx if idx.size(1) <= self.max_len else idx[:, -self.max_len:]

            # Get predictions
            logits = self.forward(idx_cond)

            # Focus on last time step
            logits = logits[:, -1, :] / temperature

            # Apply softmax
            probs = F.softmax(logits, dim=-1)

            # Sample
            idx_next = torch.multinomial(probs, num_samples=1)

            # Append to sequence
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

print("✓ Complete GPT model defined")

✓ Complete GPT model defined


Create Simple Tokenizer

In [None]:
class SimpleTokenizer:
    def __init__(self, text):
        chars = sorted(list(set(text)))
        self.vocab_size = len(chars)
        self.char_to_idx = {ch: i for i, ch in enumerate(chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(chars)}

        print(f"Vocabulary size: {self.vocab_size}")
        print(f"Vocabulary: {''.join(chars)}")

    def encode(self, text):
        return [self.char_to_idx[ch] for ch in text]

    def decode(self, tokens):
        return ''.join([self.idx_to_char[idx] for idx in tokens])

print("✓ Tokenizer defined")

✓ Tokenizer defined


Prepare Training Data

In [None]:
training_text = """To be or not to be, that is the question.
Whether it is nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die, to sleep,
No more, and by a sleep to say we end
The heartache and the thousand natural shocks
That flesh is heir to. It is a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to dream. Ay, there is the rub,
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil."""

tokenizer = SimpleTokenizer(training_text)
encoded_text = tokenizer.encode(training_text)
print(f"\nEncoded text length: {len(encoded_text)} tokens")

Vocabulary size: 35
Vocabulary: 
 ,.ADFINOTWabcdefghiklmnopqrstuvwy

Encoded text length: 528 tokens


Create Training Dataset

In [None]:
def create_dataset(encoded_text, seq_len, batch_size):
    data = torch.tensor(encoded_text, dtype=torch.long)
    num_sequences = len(data) // seq_len
    data = data[:num_sequences * seq_len]
    data = data.view(-1, seq_len)
    inputs = data[:, :-1]
    targets = data[:, 1:]
    return inputs, targets

seq_len = 32
batch_size = 4

inputs, targets = create_dataset(encoded_text, seq_len, batch_size)
print(f"Input shape: {inputs.shape}")
print(f"Target shape: {targets.shape}")
print(f"\nExample input:  {tokenizer.decode(inputs[0].tolist())}")
print(f"Example target: {tokenizer.decode(targets[0].tolist())}")

Input shape: torch.Size([16, 31])
Target shape: torch.Size([16, 31])

Example input:  To be or not to be, that is the
Example target: o be or not to be, that is the 


Initialize Model

In [None]:
model = GPT(
    vocab_size=tokenizer.vocab_size,
    d_model=128,
    num_heads=4,
    num_layers=3,
    d_ff=512,
    max_len=128,  # Larger context window for generation
    dropout=0.1
)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {total_params:,}")
print(f"Device: {device}")

Model parameters: 603,811
Device: cuda


Training Loop

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

num_epochs = 500
print_every = 50

inputs = inputs.to(device)
targets = targets.to(device)

print("Starting training...\n")

for epoch in range(num_epochs):
    model.train()

    logits = model(inputs)
    loss = criterion(logits.reshape(-1, tokenizer.vocab_size), targets.reshape(-1))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch + 1) % print_every == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")

print("\nTraining complete!")

Starting training...

Epoch 50/500, Loss: 0.7647
Epoch 100/500, Loss: 0.1926
Epoch 150/500, Loss: 0.0940
Epoch 200/500, Loss: 0.0763
Epoch 250/500, Loss: 0.0442
Epoch 300/500, Loss: 0.0432
Epoch 350/500, Loss: 0.0431
Epoch 400/500, Loss: 0.0233
Epoch 450/500, Loss: 0.0269
Epoch 500/500, Loss: 0.0245

Training complete!


Text Generation Function

In [None]:
def generate_text(model, tokenizer, prompt, max_length=100, temperature=0.8):
    model.eval()
    tokens = tokenizer.encode(prompt)
    tokens = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).to(device)

    with torch.no_grad():
        generated_tokens = model.generate(tokens, max_length, temperature)

    generated_text = tokenizer.decode(generated_tokens[0].cpu().tolist())
    return generated_text

print("✓ Generation function ready")

✓ Generation function ready


## Step 13: Test Text Generation!

In [None]:
print("=" * 60)
print("TEXT GENERATION EXAMPLES")
print("=" * 60)

# Example 1
prompt1 = "To be"
generated1 = generate_text(model, tokenizer, prompt1, max_length=50, temperature=0.8)
print(f"\nPrompt: '{prompt1}'")
print(f"Generated: '{generated1}'")
print("-" * 60)

# Example 2
prompt2 = "To die"
generated2 = generate_text(model, tokenizer, prompt2, max_length=50, temperature=0.8)
print(f"\nPrompt: '{prompt2}'")
print(f"Generated: '{generated2}'")
print("-" * 60)

# Example 3
prompt3 = "The"
generated3 = generate_text(model, tokenizer, prompt3, max_length=50, temperature=0.8)
print(f"\nPrompt: '{prompt3}'")
print(f"Generated: '{generated3}'")
print("-" * 60)

# Example 4: Lower temperature
prompt4 = "To be"
generated4 = generate_text(model, tokenizer, prompt4, max_length=50, temperature=0.3)
print(f"\nPrompt: '{prompt4}' (temperature=0.3, more deterministic)")
print(f"Generated: '{generated4}'")
print("-" * 60)

# Example 5: Higher temperature
prompt5 = "To be"
generated5 = generate_text(model, tokenizer, prompt5, max_length=50, temperature=1.5)
print(f"\nPrompt: '{prompt5}' (temperature=1.5, more random)")
print(f"Generated: '{generated5}'")

TEXT GENERATION EXAMPLES

Prompt: 'To be'
Generated: 'To be or not to be, that is the the the the the the the'
------------------------------------------------------------

Prompt: 'To die'
Generated: 'To die, to ro s ro ragobe, co ng to co ingato to to to t'
------------------------------------------------------------

Prompt: 'The'
Generated: 'The ort be, t is the the the the thathe the the the t'
------------------------------------------------------------

Prompt: 'To be' (temperature=0.3, more deterministic)
Generated: 'To be or not to be, that is the the the the the the the'
------------------------------------------------------------

Prompt: 'To be' (temperature=1.5, more random)
Generated: 'To be or not to be, that is the the the the the the the'


## Step 14: Try Your Own Prompts!

In [None]:
# Try your own prompts here!
your_prompt = "hi"  # Change this to any text
generated = generate_text(model, tokenizer, your_prompt, max_length=100, temperature=0.8)
print(f"Your prompt: '{your_prompt}'")
print(f"Generated: '{generated}'")

Your prompt: 'hi'
Generated: 'hin to a sufond flertheind cheFond che he mindr ming mings mings ming ming ming mings mings matoby s s'


## Summary

### What I Built:

 Complete GPT architecture from scratch  
Character-level tokenizer  
Training loop with next-token prediction  
Text generation with temperature control  
Trained on Shakespeare text  

### How GPT Works:

1. **Input**: Sequence of token IDs
2. **Embedding**: Tokens → dense vectors
3. **Positional Encoding**: Add position information
4. **Transformer Blocks**: Self-attention + Feed-forward
5. **Output**: Predict next token probabilities
6. **Generation**: Sample tokens autoregressively

### Key Innovation:

Causal self-attention allows the model to learn context from previous tokens while preventing it from looking ahead.

