# LLM from Scratch - Hands-On Curriculum

This Jupyter notebook provides hands-on exercises for implementing a Large Language Model from scratch using PyTorch.

## Part 1: Core Transformer Architecture

In this exercise, you'll implement the fundamental components of a Transformer model.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import sys
import os

# Add src to path to import our modules
sys.path.append(os.path.join(os.path.dirname(__file__), '..'))

# Import our implementations
from src.models.attention import MultiHeadAttention
from src.models.mlp import MLP
from src.models.normalization import LayerNorm
from src.models.transformer import TransformerBlock

### Exercise 1.1: Implement Basic Attention Mechanism

Implement a basic scaled dot-product attention mechanism.

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        query: Query tensor of shape (batch_size, num_heads, seq_length, head_dim)
        key: Key tensor of shape (batch_size, num_heads, seq_length, head_dim)
        value: Value tensor of shape (batch_size, num_heads, seq_length, head_dim)
        mask: Optional mask tensor
        
    Returns:
        Output tensor and attention weights
    """
    # TODO: Implement scaled dot-product attention
    # 1. Compute attention scores: Q @ K^T
    # 2. Scale by sqrt(head_dim)
    # 3. Apply mask if provided
    # 4. Apply softmax
    # 5. Apply attention to values: attention @ V
    
    # Your implementation here
    pass

# Test your implementation
batch_size, num_heads, seq_length, head_dim = 2, 4, 8, 16
query = torch.randn(batch_size, num_heads, seq_length, head_dim)
key = torch.randn(batch_size, num_heads, seq_length, head_dim)
value = torch.randn(batch_size, num_heads, seq_length, head_dim)

# output, attention_weights = scaled_dot_product_attention(query, key, value)
# print(f"Output shape: {output.shape}")
# print(f"Attention weights shape: {attention_weights.shape}")

### Exercise 1.2: Implement Multi-Head Attention

Use our provided MultiHeadAttention class to process sample data.

In [None]:
# Create multi-head attention
d_model = 128
num_heads = 8
attention = MultiHeadAttention(d_model=d_model, num_heads=num_heads)

# Create sample input
batch_size, seq_length = 2, 10
query = torch.randn(batch_size, seq_length, d_model)
key = torch.randn(batch_size, seq_length, d_model)
value = torch.randn(batch_size, seq_length, d_model)

# Process with multi-head attention
output, attention_weights = attention(query, key, value)

print(f"Input shape: {query.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")

### Exercise 1.3: Implement Transformer Block

Combine attention, MLP, and normalization to create a complete transformer block.

In [None]:
# Create transformer block
transformer_block = TransformerBlock(d_model=128, num_heads=8, dropout=0.1)

# Create sample input
batch_size, seq_length, d_model = 2, 10, 128
x = torch.randn(batch_size, seq_length, d_model)

# Process with transformer block
output = transformer_block(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")

## Part 2: Training a Tiny LLM

In this exercise, you'll train a small language model on sample text data.

In [None]:
from src.tokenizers.byte_level import ByteLevelTokenizer
from src.train.data import create_dataloader
from src.train.trainer import Trainer

### Exercise 2.1: Implement Byte-Level Tokenization

Use our byte-level tokenizer to encode and decode text.

In [None]:
# Create tokenizer
tokenizer = ByteLevelTokenizer()

# Test encoding and decoding
text = "Hello, world! This is a test."
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)

print(f"Original text: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"Match: {text == decoded}")

### Exercise 2.2: Create and Train a Tiny LLM

Create a simple language model and train it on sample data.

In [None]:
class TinyLLM(nn.Module):
    def __init__(self, vocab_size=256, d_model=128, num_heads=8, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, num_heads, dropout=0.1)
            for _ in range(num_layers)
        ])
        self.output_projection = nn.Linear(d_model, vocab_size)
        
    def forward(self, input_ids, labels=None):
        x = self.embedding(input_ids)
        for layer in self.layers:
            x = layer(x)
        logits = self.output_projection(x)
        
        loss = None
        if labels is not None:
            # Shift for next-token prediction
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss = F.cross_entropy(
                shift_logits.view(-1, shift_logits.size(-1)),
                shift_labels.view(-1),
                ignore_index=-100
            )
            
        class Output:
            def __init__(self, logits, loss):
                self.logits = logits
                self.loss = loss
        return Output(logits, loss)

# Create sample data
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing enables computers to understand text.",
    "Deep learning models have revolutionized many fields.",
    "Transformers are the foundation of modern language models."
]

# Create model
model = TinyLLM(vocab_size=256, d_model=128, num_heads=8, num_layers=2)

# Create dataloader
dataloader = create_dataloader(texts, batch_size=2, seq_length=32)

# Print model information
total_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {total_params:,}")

# Note: Training would go here in a complete implementation
print("Model created successfully!")

## Part 3: Modern Architecture Improvements

In this exercise, you'll explore modern improvements to the Transformer architecture.

In [None]:
from src.models.normalization import RMSNorm
from src.models.positional import RotaryPositionalEncoding
from src.models.mlp import SwiGLUMLP

### Exercise 3.1: Compare LayerNorm and RMSNorm

Compare the behavior of LayerNorm and RMSNorm.

In [None]:
# Create normalization layers
layer_norm = LayerNorm(d_model=128)
rms_norm = RMSNorm(d_model=128)

# Create sample input
batch_size, seq_length, d_model = 2, 10, 128
x = torch.randn(batch_size, seq_length, d_model)

# Apply normalization
ln_output = layer_norm(x)
rms_output = rms_norm(x)

print(f"Input shape: {x.shape}")
print(f"LayerNorm output shape: {ln_output.shape}")
print(f"RMSNorm output shape: {rms_output.shape}")

# Check normalization properties
ln_mean = ln_output.mean(dim=-1)
ln_std = ln_output.std(dim=-1)
rms_mean = rms_output.mean(dim=-1)
rms_std = rms_output.std(dim=-1)

print(f"LayerNorm mean (should be ~0): {ln_mean.mean():.6f}")
print(f"LayerNorm std (should be ~1): {ln_std.mean():.6f}")
print(f"RMSNorm mean: {rms_mean.mean():.6f}")
print(f"RMSNorm std (should be ~1): {rms_std.mean():.6f}")

### Exercise 3.2: Implement Rotary Positional Encoding

Use Rotary Positional Encoding (RoPE) to add positional information.

In [None]:
# Create RoPE
rope = RotaryPositionalEncoding(d_model=128, max_seq_length=512)

# Create sample input
batch_size, seq_length, d_model = 2, 10, 128
x = torch.randn(batch_size, seq_length, d_model)

# Apply RoPE
x_with_rope = rope(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {x_with_rope.shape}")
print(f"RoPE applied successfully: {not torch.allclose(x, x_with_rope)}")

## Part 4: Scaling Up

In this exercise, you'll work with BPE tokenization and advanced training techniques.

In [None]:
from src.tokenizers.bpe import BPE Tokenizer

### Exercise 4.1: Train BPE Tokenizer

Train a BPE tokenizer on sample text data.

In [None]:
# Create sample training data
training_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing enables computers to understand text.",
    "Deep learning models have revolutionized many fields.",
    "Transformers are the foundation of modern language models.",
    "Large language models can generate human-like text.",
    "Attention mechanisms help models focus on relevant information.",
    "Neural networks learn patterns from data.",
    "PyTorch provides flexible tools for deep learning research.",
    "Open-source software accelerates scientific progress."
]

# Create and train BPE tokenizer
bpe_tokenizer = BPE Tokenizer(vocab_size=500)
bpe_tokenizer.train(training_texts)

# Test tokenization
test_text = "Transformers are powerful models for NLP tasks."
encoded = bpe_tokenizer.encode(test_text)
decoded = bpe_tokenizer.decode(encoded)

print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"Vocabulary size: {bpe_tokenizer.vocab_size}")

## Part 5: Mixture of Experts

In this exercise, you'll implement and experiment with Mixture of Experts.

In [None]:
from src.moe.gating import TopKGating
from src.moe.expert import Expert
from src.moe.moe_layer import MoELayer

### Exercise 5.1: Implement Top-K Gating

Implement and test a Top-K gating mechanism.

In [None]:
# Create gating network
gating = TopKGating(input_size=128, num_experts=4, k=2)

# Create sample input
batch_size, seq_length, input_size = 2, 10, 128
x = torch.randn(batch_size, seq_length, input_size)

# Apply gating
gate_logits, gate_weights, expert_indices = gating(x)

print(f"Input shape: {x.shape}")
print(f"Gate logits shape: {gate_logits.shape}")
print(f"Gate weights shape: {gate_weights.shape}")
print(f"Expert indices shape: {expert_indices.shape}")

# Check properties
print(f"Top-K: {expert_indices.shape[-1]}")
print(f"Weights sum to 1: {torch.allclose(gate_weights.sum(dim=-1), torch.ones_like(gate_weights.sum(dim=-1)))}")

### Exercise 5.2: Create MoE Layer

Create and test a Mixture of Experts layer.

In [None]:
# Create MoE layer
moe_layer = MoELayer(
    input_size=128,
    hidden_size=256,
    output_size=128,
    num_experts=4,
    k=2
)

# Create sample input
batch_size, seq_length, input_size = 2, 10, 128
x = torch.randn(batch_size, seq_length, input_size)

# Apply MoE
output, aux_loss = moe_layer(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Auxiliary loss: {aux_loss.item():.6f}")

## Part 6: Supervised Fine-Tuning

In this exercise, you'll work with instruction datasets and fine-tuning.

In [None]:
from src.sft.instruction_data import InstructionDataset
from src.sft.loss import CausalLMLossWithMasking

### Exercise 6.1: Create Instruction Dataset

Create and process an instruction dataset.

In [None]:
# Create sample instruction data
instructions = [
    ("What is 2+2?", "2+2 equals 4."),
    ("How many days in a week?", "There are 7 days in a week."),
    ("What is the capital of France?", "The capital of France is Paris."),
    ("Who wrote Romeo and Juliet?", "William Shakespeare wrote Romeo and Juliet."),
    ("What is the largest planet?", "Jupiter is the largest planet in our solar system.")
]

# Create dataset
dataset = InstructionDataset(instructions)

print(f"Dataset size: {len(dataset)}")

# Test data retrieval
sample = dataset[0]
print(f"Sample format: {type(sample)}")
if isinstance(sample, tuple):
    print(f"Sample length: {len(sample)}")

## Part 7: Reward Modeling

In this exercise, you'll work with preference datasets and reward models.

In [None]:
from src.reward.preference_data import PreferenceExample
from src.reward.model import RewardModel
from src.reward.loss import BradleyTerryLoss

### Exercise 7.1: Create Preference Dataset

Create and work with preference examples.

In [None]:
# Create preference examples
examples = [
    PreferenceExample(
        prompt="How do you make a cake?",
        chosen_response="First, gather ingredients like flour, eggs, and sugar. Then mix them together and bake.",
        rejected_response="You eat it raw.",
        chosen_score=0.9,
        rejected_score=0.1
    ),
    PreferenceExample(
        prompt="What is the weather like?",
        chosen_response="It's sunny and warm today with clear skies.",
        rejected_response="I don't know.",
        chosen_score=0.8,
        rejected_score=0.2
    )
]

print(f"Created {len(examples)} preference examples")
for i, example in enumerate(examples):
    print(f"\nExample {i+1}:")
    print(f"  Prompt: {example.prompt}")
    print(f"  Chosen: {example.chosen_response}")
    print(f"  Rejected: {example.rejected_response}")
    print(f"  Chosen Score: {example.chosen_score}")
    print(f"  Rejected Score: {example.rejected_score}")

### Exercise 7.2: Implement Reward Model

Create and test a reward model.

In [None]:
# Create a simple base model for the reward model
class SimpleBaseModel(nn.Module):
    def __init__(self, hidden_size=128, vocab_size=1000):
        super().__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.linear = nn.Linear(hidden_size, hidden_size)
        
    def forward(self, input_ids, attention_mask=None):
        x = self.embedding(input_ids)
        x = self.linear(x)
        logits = torch.randn(input_ids.shape[0], input_ids.shape[1], 1000)
        
        class MockOutput:
            def __init__(self, last_hidden_state, logits):
                self.last_hidden_state = last_hidden_state
                self.logits = logits
        return MockOutput(x, logits)

# Create base model
base_model = SimpleBaseModel(hidden_size=128, vocab_size=1000)

# Create reward model
reward_model = RewardModel(base_model, hidden_size=128)

# Create sample input
batch_size, seq_length = 2, 20
input_ids = torch.randint(0, 1000, (batch_size, seq_length))

# Get rewards
rewards = reward_model(input_ids)

print(f"Input shape: {input_ids.shape}")
print(f"Rewards shape: {rewards.shape}")
print(f"Rewards: {rewards}")

## Part 8: RLHF with PPO

In this exercise, you'll work with PPO for reinforcement learning from human feedback.

In [None]:
from src.rlhf.ppo import PolicyValueNetwork, PPOTrainer, PPOConfig

### Exercise 8.1: Create PPO Configuration

Create and examine PPO configuration parameters.

In [None]:
# Create PPO configuration
config = PPOConfig(
    ppo_epochs=4,
    batch_size=8,
    clip_epsilon=0.2,
    learning_rate=1e-5
)

print("PPO Configuration:")
print(f"  PPO Epochs: {config.ppo_epochs}")
print(f"  Batch Size: {config.batch_size}")
print(f"  Clip Epsilon: {config.clip_epsilon}")
print(f"  Learning Rate: {config.learning_rate}")

## Part 9: RLHF with GRPO

In this exercise, you'll work with GRPO (Group-Relative Policy Optimization).

In [None]:
from src.rlhf.grpo import GRPOTrainer, GRPOConfig

### Exercise 9.1: Create GRPO Configuration

Create and examine GRPO configuration parameters.

In [None]:
# Create GRPO configuration
config = GRPOConfig(
    grpo_epochs=4,
    batch_size=8,
    num_completions_per_prompt=4,
    group_size=4
)

print("GRPO Configuration:")
print(f"  GRPO Epochs: {config.grpo_epochs}")
print(f"  Batch Size: {config.batch_size}")
print(f"  Completions per Prompt: {config.num_completions_per_prompt}")
print(f"  Group Size: {config.group_size}")

## Part 10: Advanced Semantic Processing

In this exercise, you'll work with advanced semantic processing techniques.

In [None]:
from src.semantic.processing import SemanticConfig, SemanticProcessor

### Exercise 10.1: Create Semantic Configuration

Create and examine semantic processing configuration parameters.

In [None]:
# Create semantic configuration
config = SemanticConfig(
    hidden_size=768,
    concept_dim=512,
    num_concepts=1024,
    hyperbolic_dim=128
)

print("Semantic Processing Configuration:")
print(f"  Hidden Size: {config.hidden_size}")
print(f"  Concept Dimension: {config.concept_dim}")
print(f"  Number of Concepts: {config.num_concepts}")
print(f"  Hyperbolic Dimension: {config.hyperbolic_dim}")

## Efficiency Optimizations

In this exercise, you'll work with quantization and compression techniques.

In [None]:
from src.utils.quantization import apply_quantization
from src.utils.compression import apply_pruning

### Exercise: Apply Model Quantization

Apply quantization to reduce model size and improve inference speed.

In [None]:
# Create a simple model for demonstration
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(128, 256)
        self.linear2 = nn.Linear(256, 128)
        
    def forward(self, x):
        x = torch.relu(self.linear1(x))
        x = self.linear2(x)
        return x

# Create original model
original_model = SimpleModel()
original_size = sum(p.numel() for p in original_model.parameters())

print(f"Original model parameters: {original_size:,}")

# Apply 8-bit quantization
quantized_model = apply_quantization(original_model, bits=8)
quantized_size = sum(p.numel() for p in quantized_model.parameters())

print(f"Quantized model parameters: {quantized_size:,}")
print(f"Compression ratio: {original_size / quantized_size:.2f}x")

# Apply pruning
pruned_model = apply_pruning(original_model, sparsity_ratio=0.5)
pruned_size = sum(p.numel() for p in pruned_model.parameters())

print(f"Pruned model parameters: {pruned_size:,}")
print(f"Pruning compression ratio: {original_size / pruned_size:.2f}x")

## Conclusion

This notebook has provided hands-on exercises covering all 10 parts of the LLM from Scratch curriculum:

1. Core Transformer Architecture
2. Training a Tiny LLM
3. Modern Architecture Improvements
4. Scaling Up
5. Mixture of Experts
6. Supervised Fine-Tuning
7. Reward Modeling
8. RLHF with PPO
9. RLHF with GRPO
10. Advanced Semantic Processing

You've also explored efficiency optimization techniques like quantization and pruning.

Each exercise builds upon the previous ones, providing a comprehensive understanding of modern LLM implementation techniques.