# Large Language Model Training Tutorial

Welcome to the comprehensive tutorial on training Large Language Models! This notebook will guide you through:

1. **Understanding Transformers** - The architecture behind modern LLMs
2. **Building from Scratch** - Implementing a simple language model
3. **Training Process** - Learning how to train your own model
4. **Text Generation** - Using your model to generate text
5. **Fine-tuning** - Adapting pre-trained models

Let's start by importing the necessary libraries:

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt
import math
import warnings
warnings.filterwarnings('ignore')

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

## 1. Understanding the Transformer Architecture

Let's start by implementing the core components of a transformer model:

### Multi-Head Attention

The attention mechanism is the heart of transformers. It allows the model to focus on different parts of the input sequence.

In [None]:
class MultiHeadAttention(nn.Module):
    """Multi-head self-attention mechanism"""
    
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # Linear projections
        Q = self.w_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.w_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.w_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention
        attention_output = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads and project
        attention_output = attention_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )
        
        return self.w_o(attention_output)
    
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Softmax
        attention_weights = F.softmax(scores, dim=-1)
        
        # Apply attention to values
        return torch.matmul(attention_weights, V)

# Test the attention mechanism
attention = MultiHeadAttention(d_model=128, num_heads=8)
x = torch.randn(2, 10, 128)  # batch_size=2, seq_len=10, d_model=128
output = attention(x, x, x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print("✅ Multi-head attention working correctly!")

### Transformer Block

Now let's build a complete transformer block that combines attention with a feed-forward network:

In [None]:
class TransformerBlock(nn.Module):
    """Single transformer block with self-attention and feed-forward network"""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attended = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attended))
        
        # Feed-forward with residual connection
        fed_forward = self.ffn(x)
        x = self.norm2(x + self.dropout(fed_forward))
        
        return x

# Test the transformer block
transformer_block = TransformerBlock(d_model=128, num_heads=8, d_ff=512)
x = torch.randn(2, 10, 128)
output = transformer_block(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print("✅ Transformer block working correctly!")

## 2. Building a Simple Language Model

Now let's create a complete language model using our transformer components:

In [None]:
def create_causal_mask(seq_len):
    """Create a causal mask to prevent attention to future positions"""
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
    return mask == 0  # True for allowed positions, False for masked

class SimpleLanguageModel(nn.Module):
    """A simple transformer-based language model"""
    
    def __init__(self, vocab_size, d_model=256, num_heads=8, num_layers=4, 
                 d_ff=1024, max_seq_len=512, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.max_seq_len = max_seq_len
        
        # Embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
        
        # Transformer layers
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        # Output layers
        self.ln_final = nn.LayerNorm(d_model)
        self.output_projection = nn.Linear(d_model, vocab_size)
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, input_ids):
        batch_size, seq_len = input_ids.shape
        
        # Create position indices
        positions = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand(batch_size, -1)
        
        # Embeddings
        token_embeds = self.token_embedding(input_ids)
        pos_embeds = self.position_embedding(positions)
        x = token_embeds + pos_embeds
        
        # Create causal mask
        causal_mask = create_causal_mask(seq_len).to(input_ids.device)
        
        # Apply transformer blocks
        for transformer_block in self.transformer_blocks:
            x = transformer_block(x, causal_mask)
        
        # Final layer norm and output projection
        x = self.ln_final(x)
        logits = self.output_projection(x)
        
        return logits

# Create a model
vocab_size = 1000  # We'll build a proper vocabulary later
model = SimpleLanguageModel(vocab_size=vocab_size, d_model=256, num_heads=8, num_layers=4)

total_params = sum(p.numel() for p in model.parameters())
print(f"Model created with {total_params:,} parameters")

# Test forward pass
test_input = torch.randint(0, vocab_size, (2, 20))  # batch_size=2, seq_len=20
output = model(test_input)
print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print("✅ Language model working correctly!")

## 3. Tokenization and Data Preparation

Before we can train our model, we need to convert text into tokens that the model can understand:

In [None]:
class SimpleTokenizer:
    """Simple word-level tokenizer for demonstration purposes"""
    
    def __init__(self):
        self.word_to_id = {}
        self.id_to_word = {}
        self.vocab_size = 0
        
        # Special tokens
        self.pad_token = '<PAD>'
        self.unk_token = '<UNK>'
        self.eos_token = '<EOS>'
        self.bos_token = '<BOS>'
        
        # Add special tokens
        self._add_word(self.pad_token)
        self._add_word(self.unk_token)
        self._add_word(self.eos_token)
        self._add_word(self.bos_token)
        
        self.pad_token_id = self.word_to_id[self.pad_token]
        self.unk_token_id = self.word_to_id[self.unk_token]
        self.eos_token_id = self.word_to_id[self.eos_token]
        self.bos_token_id = self.word_to_id[self.bos_token]
    
    def _add_word(self, word):
        if word not in self.word_to_id:
            self.word_to_id[word] = self.vocab_size
            self.id_to_word[self.vocab_size] = word
            self.vocab_size += 1
        return self.word_to_id[word]
    
    def build_vocab(self, texts, min_freq=1):
        """Build vocabulary from list of texts"""
        word_counts = {}
        
        # Count word frequencies
        for text in texts:
            words = text.lower().split()
            for word in words:
                word_counts[word] = word_counts.get(word, 0) + 1
        
        # Add words that meet minimum frequency threshold
        for word, count in word_counts.items():
            if count >= min_freq:
                self._add_word(word)
        
        print(f"Built vocabulary with {self.vocab_size} tokens")
    
    def encode(self, text, add_special_tokens=True):
        """Convert text to list of token IDs"""
        words = text.lower().split()
        token_ids = []
        
        if add_special_tokens:
            token_ids.append(self.bos_token_id)
        
        for word in words:
            token_id = self.word_to_id.get(word, self.unk_token_id)
            token_ids.append(token_id)
        
        if add_special_tokens:
            token_ids.append(self.eos_token_id)
        
        return token_ids
    
    def decode(self, token_ids, skip_special_tokens=True):
        """Convert list of token IDs back to text"""
        words = []
        for token_id in token_ids:
            word = self.id_to_word.get(token_id, self.unk_token)
            if skip_special_tokens and word in [self.pad_token, self.unk_token, self.eos_token, self.bos_token]:
                continue
            words.append(word)
        return ' '.join(words)

# Create sample text data
sample_texts = [
    "the quick brown fox jumps over the lazy dog",
    "machine learning is a subset of artificial intelligence",
    "neural networks are inspired by biological neural networks", 
    "deep learning uses multiple layers to learn representations",
    "transformers use attention mechanisms for better performance",
    "language models predict the next word in a sequence",
    "artificial intelligence will transform many industries",
    "data science combines statistics programming and domain knowledge",
    "python is a popular programming language for machine learning",
    "the future of technology depends on continued innovation"
]

# Build tokenizer
tokenizer = SimpleTokenizer()
tokenizer.build_vocab(sample_texts)

# Test tokenization
test_text = "machine learning is fascinating"
tokens = tokenizer.encode(test_text)
decoded = tokenizer.decode(tokens)

print(f"Original text: {test_text}")
print(f"Tokens: {tokens}")
print(f"Decoded text: {decoded}")
print(f"Vocabulary size: {tokenizer.vocab_size}")

## 4. Dataset and Training Setup

Let's create a dataset class and set up the training process:

In [None]:
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    """Dataset for language modeling"""
    
    def __init__(self, texts, tokenizer, max_length=64):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.examples = []
        
        for text in texts:
            token_ids = tokenizer.encode(text)
            
            # Split long texts into chunks
            for i in range(0, len(token_ids) - max_length + 1, max_length // 2):
                chunk = token_ids[i:i + max_length]
                if len(chunk) == max_length:
                    self.examples.append(chunk)
    
    def __len__(self):
        return len(self.examples)
    
    def __getitem__(self, idx):
        return torch.tensor(self.examples[idx], dtype=torch.long)

# Create expanded dataset for better training
expanded_texts = sample_texts * 50  # Repeat texts to have more training data

# Split into train and validation
split_idx = int(0.8 * len(expanded_texts))
train_texts = expanded_texts[:split_idx]
val_texts = expanded_texts[split_idx:]

# Create datasets
train_dataset = TextDataset(train_texts, tokenizer, max_length=32)
val_dataset = TextDataset(val_texts, tokenizer, max_length=32)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False)

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"Training batches: {len(train_loader)}")
print(f"Validation batches: {len(val_loader)}")

# Show a sample batch
sample_batch = next(iter(train_loader))
print(f"Sample batch shape: {sample_batch.shape}")
print(f"Sample text: {tokenizer.decode(sample_batch[0].tolist())}")

## 5. Training the Language Model

Now let's train our language model! We'll track the loss and visualize the training progress:

In [None]:
# Create a new model with the correct vocabulary size
model = SimpleLanguageModel(
    vocab_size=tokenizer.vocab_size,
    d_model=128,
    num_heads=8,
    num_layers=3,
    d_ff=512,
    max_seq_len=64
)

model.to(device)
total_params = sum(p.numel() for p in model.parameters())
print(f"Model has {total_params:,} parameters")

# Training setup
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
criterion = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20)

# Training loop
num_epochs = 10
train_losses = []
val_losses = []

print("Starting training...")
print("=" * 50)

for epoch in range(num_epochs):
    # Training phase
    model.train()
    epoch_train_loss = 0
    num_batches = 0
    
    for batch_idx, batch in enumerate(train_loader):
        batch = batch.to(device)
        
        # For language modeling, targets are input shifted by one position
        inputs = batch[:, :-1]
        targets = batch[:, 1:]
        
        # Forward pass
        logits = model(inputs)
        loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        
        epoch_train_loss += loss.item()
        num_batches += 1
    
    # Validation phase
    model.eval()
    epoch_val_loss = 0
    
    with torch.no_grad():
        for batch in val_loader:
            batch = batch.to(device)
            inputs = batch[:, :-1]
            targets = batch[:, 1:]
            
            logits = model(inputs)
            loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))
            epoch_val_loss += loss.item()
    
    # Calculate average losses
    avg_train_loss = epoch_train_loss / num_batches
    avg_val_loss = epoch_val_loss / len(val_loader)
    
    train_losses.append(avg_train_loss)
    val_losses.append(avg_val_loss)
    
    # Print progress
    print(f"Epoch {epoch+1}/{num_epochs}:")
    print(f"  Train Loss: {avg_train_loss:.4f}")
    print(f"  Val Loss: {avg_val_loss:.4f}")
    print(f"  LR: {scheduler.get_last_lr()[0]:.6f}")
    print("-" * 30)
    
    scheduler.step()

print("Training completed!")

## 6. Visualizing Training Progress

Let's plot the training and validation losses to see how our model learned:

In [None]:
# Plot training progress
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(train_losses) + 1), train_losses, 'b-', label='Training Loss', linewidth=2)
plt.plot(range(1, len(val_losses) + 1), val_losses, 'r-', label='Validation Loss', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Language Model Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Calculate perplexity (lower is better)
final_train_perplexity = math.exp(train_losses[-1])
final_val_perplexity = math.exp(val_losses[-1])

print(f"Final Training Perplexity: {final_train_perplexity:.2f}")
print(f"Final Validation Perplexity: {final_val_perplexity:.2f}")

# Show training statistics
print(f"\nTraining Statistics:")
print(f"Initial Train Loss: {train_losses[0]:.4f}")
print(f"Final Train Loss: {train_losses[-1]:.4f}")
print(f"Loss Reduction: {((train_losses[0] - train_losses[-1]) / train_losses[0] * 100):.1f}%")

## 7. Text Generation

Now for the exciting part - let's use our trained model to generate text!

In [None]:
def generate_text(model, tokenizer, prompt, max_length=50, temperature=1.0, top_k=10):
    """Generate text using the trained model"""
    model.eval()
    
    # Tokenize the prompt
    input_ids = torch.tensor(tokenizer.encode(prompt, add_special_tokens=False), 
                           dtype=torch.long).unsqueeze(0).to(device)
    
    generated_tokens = input_ids.clone()
    
    with torch.no_grad():
        for _ in range(max_length):
            # Get model predictions
            logits = model(generated_tokens)
            next_token_logits = logits[0, -1, :] / temperature
            
            # Apply top-k filtering
            if top_k > 0:
                top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
                next_token_logits = torch.full_like(next_token_logits, float('-inf'))
                next_token_logits[top_k_indices] = top_k_logits
            
            # Apply softmax to get probabilities
            probabilities = F.softmax(next_token_logits, dim=-1)
            
            # Sample next token
            next_token = torch.multinomial(probabilities, 1)
            
            # Stop if we generate an end-of-sequence token
            if next_token.item() == tokenizer.eos_token_id:
                break
            
            # Append to generated sequence
            generated_tokens = torch.cat([generated_tokens, next_token.unsqueeze(0)], dim=1)
    
    # Decode and return generated text
    generated_text = tokenizer.decode(generated_tokens[0].tolist(), skip_special_tokens=True)
    return generated_text

# Test text generation with different prompts
test_prompts = [
    "machine learning",
    "the future of",
    "artificial intelligence",
    "neural networks",
    "deep learning"
]

print("🤖 Text Generation Examples")
print("=" * 50)

for prompt in test_prompts:
    generated = generate_text(model, tokenizer, prompt, max_length=15, temperature=0.8, top_k=10)
    print(f"Prompt: '{prompt}'")
    print(f"Generated: '{generated}'")
    print("-" * 30)

# Try with different temperatures
print("\n🌡️ Temperature Effects on Generation")
print("=" * 50)

prompt = "artificial intelligence"
temperatures = [0.5, 1.0, 1.5]

for temp in temperatures:
    generated = generate_text(model, tokenizer, prompt, max_length=15, temperature=temp, top_k=10)
    print(f"Temperature {temp}: '{generated}'")

## 8. Model Evaluation

Let's evaluate our model's performance using perplexity and other metrics:

In [None]:
def calculate_detailed_perplexity(model, data_loader):
    """Calculate perplexity with detailed statistics"""
    model.eval()
    total_loss = 0
    total_tokens = 0
    criterion = nn.CrossEntropyLoss(reduction='sum')
    
    with torch.no_grad():
        for batch in data_loader:
            batch = batch.to(device)
            inputs = batch[:, :-1]
            targets = batch[:, 1:]
            
            logits = model(inputs)
            loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))
            
            total_loss += loss.item()
            total_tokens += targets.numel()
    
    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)
    
    return perplexity, avg_loss, total_tokens

# Calculate perplexity for train and validation sets
train_perplexity, train_loss, train_tokens = calculate_detailed_perplexity(model, train_loader)
val_perplexity, val_loss, val_tokens = calculate_detailed_perplexity(model, val_loader)

print("📊 Model Evaluation Results")
print("=" * 50)
print(f"Training Set:")
print(f"  Perplexity: {train_perplexity:.2f}")
print(f"  Loss: {train_loss:.4f}")
print(f"  Tokens evaluated: {train_tokens:,}")
print()
print(f"Validation Set:")
print(f"  Perplexity: {val_perplexity:.2f}")
print(f"  Loss: {val_loss:.4f}")
print(f"  Tokens evaluated: {val_tokens:,}")
print()

# Model size and efficiency metrics
model_size_mb = total_params * 4 / (1024 * 1024)  # Assuming float32
print(f"Model Statistics:")
print(f"  Parameters: {total_params:,}")
print(f"  Model size: {model_size_mb:.1f} MB")
print(f"  Vocabulary size: {tokenizer.vocab_size}")
print(f"  Max sequence length: {model.max_seq_len}")

# Analyze training efficiency
improvement = (train_losses[0] - train_losses[-1]) / train_losses[0] * 100
print(f"\nTraining Efficiency:")
print(f"  Initial loss: {train_losses[0]:.4f}")
print(f"  Final loss: {train_losses[-1]:.4f}")
print(f"  Improvement: {improvement:.1f}%")
print(f"  Epochs trained: {len(train_losses)}")

## 9. Fine-tuning with Transformers Library

Now let's see how to fine-tune a pre-trained model using the Transformers library:

In [None]:
# Note: This cell demonstrates fine-tuning with transformers library
# Uncomment and run if you have transformers installed

# try:
#     from transformers import AutoModelForCausalLM, AutoTokenizer, GPT2LMHeadModel, GPT2Tokenizer
#     
#     print("🤗 Fine-tuning with Transformers Library")
#     print("=" * 50)
#     
#     # Load a pre-trained model (small for demo)
#     model_name = "gpt2"
#     pretrained_model = AutoModelForCausalLM.from_pretrained(model_name)
#     pretrained_tokenizer = AutoTokenizer.from_pretrained(model_name)
#     
#     # Add padding token
#     if pretrained_tokenizer.pad_token is None:
#         pretrained_tokenizer.pad_token = pretrained_tokenizer.eos_token
#     
#     print(f"Loaded pre-trained model: {model_name}")
#     print(f"Model parameters: {sum(p.numel() for p in pretrained_model.parameters()):,}")
#     print(f"Vocabulary size: {len(pretrained_tokenizer)}")
#     
#     # Generate some text with the pre-trained model
#     prompt = "Artificial intelligence is"
#     inputs = pretrained_tokenizer.encode(prompt, return_tensors='pt')
#     
#     with torch.no_grad():
#         outputs = pretrained_model.generate(
#             inputs, 
#             max_length=50, 
#             temperature=0.8, 
#             do_sample=True,
#             pad_token_id=pretrained_tokenizer.eos_token_id
#         )
#     
#     generated_text = pretrained_tokenizer.decode(outputs[0], skip_special_tokens=True)
#     print(f"\nPre-trained model generation:")
#     print(f"Prompt: '{prompt}'")
#     print(f"Generated: '{generated_text}'")
#     
# except ImportError:
#     print("Transformers library not available.")
#     print("Install with: pip install transformers")

print("💡 Fine-tuning Tips:")
print("=" * 30)
print("1. Start with a pre-trained model for better results")
print("2. Use smaller learning rates for fine-tuning (1e-5 to 5e-5)")
print("3. Fine-tune for fewer epochs to avoid overfitting")
print("4. Use domain-specific data for better task performance")
print("5. Consider parameter-efficient methods like LoRA")

## 10. Save and Load Your Model

Let's save our trained model so we can use it later:

In [None]:
# Save the model and tokenizer
model_save_path = 'my_language_model.pth'
tokenizer_save_path = 'my_tokenizer.json'

# Save model state
torch.save({
    'model_state_dict': model.state_dict(),
    'model_config': {
        'vocab_size': tokenizer.vocab_size,
        'd_model': 128,
        'num_heads': 8,
        'num_layers': 3,
        'd_ff': 512,
        'max_seq_len': 64
    },
    'train_losses': train_losses,
    'val_losses': val_losses,
    'final_perplexity': val_perplexity
}, model_save_path)

# Save tokenizer
import json
tokenizer_data = {
    'word_to_id': tokenizer.word_to_id,
    'id_to_word': tokenizer.id_to_word,
    'vocab_size': tokenizer.vocab_size,
    'special_tokens': {
        'pad_token': tokenizer.pad_token,
        'unk_token': tokenizer.unk_token,
        'eos_token': tokenizer.eos_token,
        'bos_token': tokenizer.bos_token
    }
}

with open(tokenizer_save_path, 'w') as f:
    json.dump(tokenizer_data, f)

print(f"✅ Model saved to: {model_save_path}")
print(f"✅ Tokenizer saved to: {tokenizer_save_path}")

# Demonstrate loading the model
def load_model_and_tokenizer(model_path, tokenizer_path):
    """Load a saved model and tokenizer"""
    # Load model
    checkpoint = torch.load(model_path, map_location=device)
    config = checkpoint['model_config']
    
    # Create model with saved configuration
    loaded_model = SimpleLanguageModel(**config)
    loaded_model.load_state_dict(checkpoint['model_state_dict'])
    loaded_model.to(device)
    loaded_model.eval()
    
    # Load tokenizer
    with open(tokenizer_path, 'r') as f:
        tokenizer_data = json.load(f)
    
    loaded_tokenizer = SimpleTokenizer()
    loaded_tokenizer.word_to_id = tokenizer_data['word_to_id']
    loaded_tokenizer.id_to_word = {int(k): v for k, v in tokenizer_data['id_to_word'].items()}
    loaded_tokenizer.vocab_size = tokenizer_data['vocab_size']
    
    return loaded_model, loaded_tokenizer

# Test loading
print("\n🔄 Testing model loading...")
loaded_model, loaded_tokenizer = load_model_and_tokenizer(model_save_path, tokenizer_save_path)

# Test generation with loaded model
test_prompt = "machine learning"
generated = generate_text(loaded_model, loaded_tokenizer, test_prompt, max_length=10)
print(f"Loaded model generation: '{generated}'")
print("✅ Model loading successful!")

## 🎨 Bonus: Introduction to Multimodal AI

Now that you've mastered language models, let's explore the exciting world of **Multimodal AI** - models that understand both text and images!

### What are Multimodal Models?
Multimodal models can process and understand multiple types of data:
- **Text + Images**: Image captioning, visual question answering
- **Text + Audio**: Speech recognition, text-to-speech
- **Text + Video**: Video understanding and description

Let's explore some practical examples!

In [None]:
# First, let's try CLIP - a powerful vision-language model
try:
    from transformers import CLIPProcessor, CLIPModel
    from PIL import Image
    import requests
    import matplotlib.pyplot as plt
    
    print("🔗 Loading CLIP model...")
    clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    print("✅ CLIP model loaded successfully!")
    
except ImportError:
    print("⚠️ CLIP not available. Install transformers: pip install transformers")
except Exception as e:
    print(f"Error loading CLIP: {e}")

In [None]:
# Let's build a simple vision-language model architecture
import torch
import torch.nn as nn

class SimpleVisionLanguageModel(nn.Module):
    """A simple model that combines vision and text understanding"""
    
    def __init__(self, vocab_size, d_model=512):
        super().__init__()
        
        # Vision encoder (simplified CNN)
        self.vision_encoder = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((8, 8)),
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, d_model)
        )
        
        # Text encoder (using our transformer components)
        self.text_embedding = nn.Embedding(vocab_size, d_model)
        self.text_encoder = TransformerBlock(d_model, num_heads=8, d_ff=d_model*4)
        
        # Cross-modal attention
        self.cross_attention = nn.MultiheadAttention(d_model, num_heads=8)
        
        # Output projections
        self.image_projection = nn.Linear(d_model, d_model)
        self.text_projection = nn.Linear(d_model, d_model)
        
    def forward(self, image, text):
        # Encode image
        image_features = self.vision_encoder(image)  # [batch, d_model]
        
        # Encode text
        text_embeds = self.text_embedding(text)      # [batch, seq_len, d_model]
        text_features = self.text_encoder(text_embeds, mask=None)
        text_features = text_features.mean(dim=1)    # [batch, d_model]
        
        # Cross-modal attention
        image_features = image_features.unsqueeze(1)  # [batch, 1, d_model]
        text_features_expanded = text_features.unsqueeze(1)  # [batch, 1, d_model]
        
        attended_features, _ = self.cross_attention(
            image_features, text_features_expanded, text_features_expanded
        )
        
        # Project to common space
        image_proj = self.image_projection(attended_features.squeeze(1))
        text_proj = self.text_projection(text_features)
        
        return image_proj, text_proj

# Create the multimodal model
multimodal_model = SimpleVisionLanguageModel(vocab_size=1000)
total_params = sum(p.numel() for p in multimodal_model.parameters())

print(f"🎨 Multimodal Model Created!")
print(f"Total parameters: {total_params:,}")

# Test with dummy data
dummy_image = torch.randn(2, 3, 224, 224)  # Batch of 2 images
dummy_text = torch.randint(0, 1000, (2, 10))  # Batch of 2 text sequences

image_proj, text_proj = multimodal_model(dummy_image, dummy_text)
print(f"Image projection shape: {image_proj.shape}")
print(f"Text projection shape: {text_proj.shape}")
print("✅ Multimodal model test successful!")

### 🔍 Understanding Multimodal Applications

Multimodal AI enables exciting applications:

1. **Image Captioning**: Generate text descriptions of images
2. **Visual Question Answering**: Answer questions about image content
3. **Image-Text Retrieval**: Find relevant images for text queries
4. **Multimodal Chatbots**: AI assistants that understand both text and images
5. **Content Generation**: Create images from text descriptions (like DALL-E)

In [None]:
# Let's create a simple image captioning model architecture
class ImageCaptioningModel(nn.Module):
    """Simple image captioning model for demonstration"""
    
    def __init__(self, vocab_size, d_model=512, max_seq_len=50):
        super().__init__()
        self.d_model = d_model
        self.max_seq_len = max_seq_len
        
        # Vision encoder
        self.vision_encoder = nn.Sequential(
            nn.Conv2d(3, 256, kernel_size=7, stride=2, padding=3),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((7, 7)),
            nn.Flatten(),
            nn.Linear(256 * 7 * 7, d_model)
        )
        
        # Caption decoder
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.decoder = nn.LSTM(d_model, d_model, num_layers=2, batch_first=True)
        
        # Attention mechanism
        self.attention = nn.Linear(d_model * 2, d_model)
        
        # Output projection
        self.output_projection = nn.Linear(d_model, vocab_size)
        
    def forward(self, image, caption_tokens=None):
        # Encode image
        image_features = self.vision_encoder(image)  # [batch, d_model]
        
        if caption_tokens is not None:
            # Training mode
            caption_embeds = self.embedding(caption_tokens)
            
            # Initialize decoder with image features
            batch_size = image.size(0)
            h0 = image_features.unsqueeze(0).repeat(2, 1, 1)  # [num_layers, batch, d_model]
            c0 = torch.zeros_like(h0)
            
            # Decode captions
            decoder_output, _ = self.decoder(caption_embeds, (h0, c0))
            
            # Project to vocabulary
            logits = self.output_projection(decoder_output)
            return logits
        else:
            # Inference mode (simplified)
            return image_features

# Create captioning model
captioning_model = ImageCaptioningModel(vocab_size=5000)
captioning_params = sum(p.numel() for p in captioning_model.parameters())

print(f"📸 Image Captioning Model Created!")
print(f"Total parameters: {captioning_params:,}")

# Test the model
test_image = torch.randn(1, 3, 224, 224)
test_caption = torch.randint(0, 5000, (1, 20))

caption_logits = captioning_model(test_image, test_caption)
print(f"Caption logits shape: {caption_logits.shape}")
print("✅ Image captioning model test successful!")

In [None]:
# Demonstration of multimodal training concepts
print("🎓 Multimodal Training Concepts")
print("=" * 40)

# Simulated multimodal training data
def create_multimodal_training_example():
    """Create a sample multimodal training example"""
    # Simulated image-caption pairs
    examples = [
        {
            "image_description": "A cat sitting on a windowsill",
            "caption": "a fluffy orange cat sitting by the window",
            "question": "What color is the cat?",
            "answer": "orange"
        },
        {
            "image_description": "A dog playing in a park", 
            "caption": "a happy golden retriever playing fetch",
            "question": "What is the dog doing?",
            "answer": "playing fetch"
        },
        {
            "image_description": "A sunset over mountains",
            "caption": "beautiful mountain landscape at sunset",
            "question": "What time of day is shown?",
            "answer": "sunset"
        }
    ]
    return examples

# Show training examples
training_examples = create_multimodal_training_example()
print("Sample Multimodal Training Data:")
for i, example in enumerate(training_examples):
    print(f"\nExample {i+1}:")
    print(f"  Image: {example['image_description']}")
    print(f"  Caption: {example['caption']}")
    print(f"  VQA Question: {example['question']}")
    print(f"  VQA Answer: {example['answer']}")

print("\n🔧 Key Multimodal Training Techniques:")
print("1. **Contrastive Learning**: Learning to match images with correct captions")
print("2. **Cross-Modal Attention**: Allowing text and images to attend to each other")
print("3. **Multi-Task Learning**: Training on multiple tasks simultaneously")
print("4. **Data Augmentation**: Augmenting both images and text for robustness")
print("5. **Progressive Training**: Starting with simple tasks, moving to complex ones")

### 🚀 Advanced Multimodal Concepts

**Current State-of-the-Art Models:**
- **GPT-4V**: ChatGPT with vision capabilities
- **DALL-E 2/3**: Text-to-image generation
- **CLIP**: Connecting text and images
- **LLaVA**: Large Language and Vision Assistant
- **Flamingo**: Few-shot learning on multimodal tasks

**Key Challenges:**
1. **Alignment**: Ensuring different modalities are properly aligned
2. **Scale**: Training large multimodal models requires significant compute
3. **Data Quality**: High-quality paired multimodal data is often scarce
4. **Evaluation**: Developing comprehensive evaluation metrics

**Future Directions:**
- Video understanding and generation
- 3D scene understanding
- Real-time multimodal interactions
- Multimodal reasoning and planning

In [None]:
# Evaluation metrics for multimodal models
def demonstrate_multimodal_evaluation():
    """Show how to evaluate multimodal models"""
    print("📊 Multimodal Evaluation Metrics")
    print("=" * 35)
    
    # Simulated image captioning evaluation
    reference_captions = [
        "a brown dog sitting on green grass",
        "two children playing in a park",
        "a red car parked on the street"
    ]
    
    generated_captions = [
        "a dog sitting on grass",
        "children playing outside", 
        "a red vehicle on the road"
    ]
    
    print("\n📝 Image Captioning Evaluation:")
    for i, (ref, gen) in enumerate(zip(reference_captions, generated_captions)):
        # Simple word overlap calculation (simplified BLEU)
        ref_words = set(ref.split())
        gen_words = set(gen.split())
        overlap = len(ref_words.intersection(gen_words))
        total_words = len(gen_words)
        precision = overlap / total_words if total_words > 0 else 0
        
        print(f"\nExample {i+1}:")
        print(f"  Reference: '{ref}'")
        print(f"  Generated: '{gen}'")
        print(f"  Word Overlap Score: {precision:.2f}")
    
    # VQA accuracy simulation
    print("\n❓ Visual Question Answering Evaluation:")
    vqa_results = [
        {"question": "What color is the car?", "predicted": "red", "actual": "red", "correct": True},
        {"question": "How many people?", "predicted": "two", "actual": "three", "correct": False},
        {"question": "What animal is shown?", "predicted": "dog", "actual": "dog", "correct": True}
    ]
    
    correct_answers = sum(1 for result in vqa_results if result["correct"])
    total_questions = len(vqa_results)
    accuracy = correct_answers / total_questions
    
    for result in vqa_results:
        status = "✅" if result["correct"] else "❌"
        print(f"  {status} Q: {result['question']}")
        print(f"     Predicted: {result['predicted']}, Actual: {result['actual']}")
    
    print(f"\n📈 Overall VQA Accuracy: {accuracy:.1%}")
    
    print("\n🎯 Key Multimodal Metrics:")
    print("  • BLEU/ROUGE: For caption generation quality")
    print("  • CIDEr: Consensus-based caption evaluation")
    print("  • Accuracy: For classification tasks like VQA")
    print("  • Recall@K: For retrieval tasks")
    print("  • Human Evaluation: For quality and relevance")

demonstrate_multimodal_evaluation()

### 🌟 Congratulations on Multimodal AI!

You've now been introduced to the exciting world of **Multimodal AI**! This represents the cutting edge of artificial intelligence, where models can understand and generate content across multiple modalities.

**What you've learned about multimodal AI:**
- How to combine vision and language understanding
- Architecture patterns for multimodal models
- Applications like image captioning and VQA
- Evaluation metrics for multimodal systems

**Next steps in your multimodal AI journey:**
1. Experiment with pre-trained multimodal models (CLIP, BLIP, etc.)
2. Try building image captioning systems
3. Explore visual question answering
4. Study the latest research in multimodal AI
5. Consider the ethical implications of multimodal systems

The future of AI is multimodal - combining text, images, audio, and even video to create more intelligent and capable systems! 🚀🎨🔊

## 🎯 Advanced Techniques: RLHF and Model Alignment

Now let's explore modern alignment techniques that make language models safer and more helpful!

### Reinforcement Learning from Human Feedback (RLHF)

RLHF is the key technique behind ChatGPT's conversational abilities. It consists of three stages:

1. **Supervised Fine-Tuning (SFT)**: Train on high-quality instruction-following data
2. **Reward Model Training**: Learn human preferences from comparison data
3. **PPO Training**: Use reinforcement learning to optimize for human preferences

In [None]:
# Create synthetic preference data for RLHF demonstration
preference_data = [
    {
        "prompt": "Explain machine learning",
        "chosen": "Machine learning is a subset of AI that enables computers to learn from data without explicit programming.",
        "rejected": "Machine learning is when computers become smart and can think like humans."
    },
    {
        "prompt": "How to stay healthy?",
        "chosen": "Maintain a balanced diet, exercise regularly, get adequate sleep, and have regular health checkups.",
        "rejected": "Just eat whatever you want and don't worry about it."
    },
    {
        "prompt": "What is Python?",
        "chosen": "Python is a high-level programming language known for its simplicity and versatility, widely used in data science and AI.",
        "rejected": "Python is a snake that programmers worship for some reason."
    }
]

print("🎯 Preference Dataset for RLHF")
print("=" * 40)
for i, example in enumerate(preference_data):
    print(f"\nExample {i+1}:")
    print(f"  Prompt: {example['prompt']}")
    print(f"  ✅ Chosen: {example['chosen']}")
    print(f"  ❌ Rejected: {example['rejected']}")

### Reward Model Training

The reward model learns to score responses based on human preferences:

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import math

class SimpleRewardModel(nn.Module):
    """Simple reward model for RLHF demonstrations"""
    
    def __init__(self, vocab_size: int, d_model: int = 128):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead=4, batch_first=True),
            num_layers=2
        )
        self.reward_head = nn.Linear(d_model, 1)
        
    def forward(self, input_ids):
        x = self.embedding(input_ids)
        x = self.transformer(x)
        x = x.mean(1)  # Mean pooling
        reward = self.reward_head(x)
        return reward.squeeze(-1)

# Create a simple reward model
reward_model = SimpleRewardModel(vocab_size=1000)
optimizer = optim.Adam(reward_model.parameters(), lr=1e-4)

print("🏆 Reward Model Architecture:")
print(f"  Parameters: {sum(p.numel() for p in reward_model.parameters()):,}")
print("  Input: Tokenized text")
print("  Output: Scalar reward score")
print("  Training: Bradley-Terry preference model")

### PPO Training Simulation

PPO (Proximal Policy Optimization) is used to fine-tune the language model using rewards:

In [None]:
def simulate_ppo_training():
    """Simulate PPO training process"""
    print("🚀 PPO Training Simulation")
    print("=" * 30)
    
    # Simulate training steps
    prompts = [
        "Explain artificial intelligence",
        "How to learn programming", 
        "What is machine learning"
    ]
    
    for step in range(3):
        print(f"\n📈 Step {step + 1}:")
        
        total_reward = 0
        for prompt in prompts:
            # Simulate reward calculation
            helpful_response_reward = 0.8 + step * 0.1  # Improving over time
            total_reward += helpful_response_reward
            
        avg_reward = total_reward / len(prompts)
        print(f"  Average Reward: {avg_reward:.2f}")
        print(f"  Policy Update: Applied gradient with clipping")
        print(f"  KL Penalty: {0.02 - step * 0.005:.3f}")
    
    print("\n✅ PPO Key Components:")
    print("  • Importance Sampling: ratio = π_new(a|s) / π_old(a|s)")
    print("  • Clipped Objective: min(ratio × advantage, clip(ratio) × advantage)")
    print("  • KL Penalty: β × KL(π_new || π_old)")
    print("  • Value Function: V(s) for advantage estimation")

simulate_ppo_training()

### Direct Preference Optimization (DPO)

DPO is a simpler alternative to RLHF that directly optimizes preferences without reward modeling:

In [None]:
def demonstrate_dpo():
    """Demonstrate DPO loss calculation"""
    print("🎯 Direct Preference Optimization (DPO)")
    print("=" * 40)
    
    # Simulate log probabilities
    examples = [
        {
            "prompt": "Explain AI",
            "chosen_logp_policy": -2.1,
            "rejected_logp_policy": -3.2,
            "chosen_logp_ref": -2.5,
            "rejected_logp_ref": -3.0
        }
    ]
    
    beta = 0.1  # Temperature parameter
    
    for i, ex in enumerate(examples):
        policy_diff = ex["chosen_logp_policy"] - ex["rejected_logp_policy"]
        ref_diff = ex["chosen_logp_ref"] - ex["rejected_logp_ref"]
        
        # DPO loss calculation
        dpo_loss = -math.log(1 / (1 + math.exp(-beta * (policy_diff - ref_diff))))
        
        print(f"\nExample {i+1}: {ex['prompt']}")
        print(f"  Policy preference: {policy_diff:.2f}")
        print(f"  Reference preference: {ref_diff:.2f}")
        print(f"  DPO Loss: {dpo_loss:.3f}")
    
    print("\n✅ DPO Advantages:")
    print("  • No reward model needed")
    print("  • More stable training")
    print("  • Direct preference optimization")
    print("  • Simpler implementation")

demonstrate_dpo()

### Constitutional AI

Constitutional AI trains models to follow explicit principles or "constitution":

In [None]:
def constitutional_ai_demo():
    """Demonstrate Constitutional AI principles"""
    print("📜 Constitutional AI Demo")
    print("=" * 30)
    
    # Define AI constitution
    constitution = [
        "Be helpful, harmless, and honest",
        "Respect human autonomy and dignity", 
        "Provide accurate information",
        "Admit uncertainty when appropriate",
        "Avoid generating harmful content"
    ]
    
    print("🏛️  AI Constitution:")
    for i, principle in enumerate(constitution, 1):
        print(f"  {i}. {principle}")
    
    # Example scenarios
    scenarios = [
        {
            "user_input": "How to hack into someone's computer?",
            "initial_response": "I can't provide instructions for unauthorized access.",
            "critique": "✅ Good - follows principle of avoiding harmful content.",
            "revision": "N/A - Already constitutional."
        },
        {
            "user_input": "What's the capital of Mars?",
            "initial_response": "The capital of Mars is New Geneva.",
            "critique": "❌ Provides false information. Should admit uncertainty.",
            "revision": "Mars doesn't have a capital as it lacks human settlements."
        }
    ]
    
    print("\n🔍 Constitutional AI in Action:")
    for i, scenario in enumerate(scenarios, 1):
        print(f"\nScenario {i}:")
        print(f"  Input: {scenario['user_input']}")
        print(f"  Initial: {scenario['initial_response']}")
        print(f"  Critique: {scenario['critique']}")
        print(f"  Revision: {scenario['revision']}")
    
    print("\n✅ Constitutional Process:")
    print("  1. Generate initial response")
    print("  2. Critique against constitution")
    print("  3. Revise if needed")
    print("  4. Repeat until compliance")

constitutional_ai_demo()

### Chain-of-Thought Reasoning

CoT prompting improves model reasoning by encouraging step-by-step thinking:

In [None]:
def chain_of_thought_demo():
    """Demonstrate Chain-of-Thought reasoning"""
    print("🧠 Chain-of-Thought (CoT) Reasoning")
    print("=" * 35)
    
    problem = "If a train travels 240 miles in 4 hours, what is its average speed?"
    
    print(f"Problem: {problem}")
    
    # Standard vs CoT comparison
    print("\n❌ Standard Response:")
    print("The speed is 60 mph.")
    
    print("\n✅ Chain-of-Thought Response:")
    print("Let me think step by step:")
    print("1. I need to find average speed")
    print("2. Speed = Distance / Time")
    print("3. Distance = 240 miles")
    print("4. Time = 4 hours")
    print("5. Speed = 240 ÷ 4 = 60 mph")
    print("Therefore, the average speed is 60 mph.")
    
    print("\n🎯 CoT Benefits:")
    print("  • Improved reasoning quality")
    print("  • Better problem decomposition")
    print("  • More transparent thinking")
    print("  • Reduced errors in complex problems")
    
    # Few-shot examples
    print("\n📚 Few-Shot CoT Examples:")
    examples = [
        "Q: 15 + 27 = ?\nA: 15 + 27 = 15 + 20 + 7 = 35 + 7 = 42",
        "Q: 8 × 9 = ?\nA: 8 × 9 = 8 × 10 - 8 × 1 = 80 - 8 = 72"
    ]
    
    for example in examples:
        print(f"  {example}")

chain_of_thought_demo()

### Model Alignment Summary

Let's compare the different alignment techniques:

In [None]:
import pandas as pd

# Create comparison table
alignment_techniques = {
    'Technique': ['RLHF', 'DPO', 'Constitutional AI', 'RLAIF', 'Chain-of-Thought'],
    'Complexity': ['High', 'Medium', 'Medium', 'High', 'Low'],
    'Data Required': ['Preferences', 'Preferences', 'Principles', 'AI-Generated', 'Examples'],
    'Key Benefit': ['Human Alignment', 'Stability', 'Transparency', 'Scalability', 'Reasoning'],
    'Use Case': ['General Chat', 'Preference Tasks', 'Safety-Critical', 'Large Scale', 'Math/Logic']
}

df = pd.DataFrame(alignment_techniques)

print("🎯 Model Alignment Techniques Comparison")
print("=" * 45)
print(df.to_string(index=False))

print("\n🌟 Key Takeaways:")
print("  • RLHF: Gold standard for conversational AI")
print("  • DPO: Simpler alternative to RLHF")
print("  • Constitutional AI: Explicit principle-based training")
print("  • RLAIF: Use AI feedback instead of human feedback")
print("  • CoT: Enhance reasoning through structured prompting")

print("\n📈 Research Frontiers:")
print("  • Scalable oversight methods")
print("  • Interpretability and transparency")
print("  • Robustness to distribution shift")
print("  • Multi-agent alignment scenarios")
print("  • Value learning and specification")

## 🎉 Congratulations!

You've successfully built and trained your own Large Language Model, explored Multimodal AI, **and** learned about modern alignment techniques! Here's what you've accomplished:

### ✅ What You've Learned:
1. **Transformer Architecture** - Built multi-head attention and transformer blocks from scratch
2. **Language Modeling** - Understood the mathematical foundations of language modeling
3. **Tokenization** - Created a tokenizer to convert text to numbers
4. **Training Process** - Implemented a complete training loop with proper optimization
5. **Text Generation** - Used your model to generate new text
6. **Model Evaluation** - Measured performance using perplexity
7. **Model Persistence** - Saved and loaded your trained model
8. **🆕 Multimodal AI** - Explored vision-language models and multimodal applications
9. **🎯 RLHF & Alignment** - Learned modern techniques for model alignment and safety

### 🚀 Next Steps:
1. **Scale Up**: Try training on larger datasets with more parameters
2. **Pre-trained Models**: Experiment with fine-tuning GPT-2, GPT-4, or other models
3. **Advanced Techniques**: Learn about PEFT methods like LoRA and QLoRA
4. **🎯 RLHF Implementation**: Build real preference datasets and reward models
5. **Specialized Domains**: Train models on specific domains (code, science, literature)
6. **🎨 Multimodal Projects**: Build image captioning, VQA, or text-to-image systems
7. **🔊 Audio Integration**: Explore speech recognition and text-to-speech models
8. **🎬 Video Understanding**: Investigate video analysis and generation

### 💡 Key Takeaways:
- **Start Small**: Begin with small models and datasets to understand the concepts
- **Quality Data**: The quality of your training data matters more than quantity
- **Evaluation**: Always evaluate your models thoroughly before deployment
- **Ethics & Safety**: Consider ethical implications and use alignment techniques
- **Human Feedback**: RLHF and similar methods are crucial for helpful AI
- **Multimodal Future**: The future of AI combines multiple modalities
- **Alignment Matters**: Safe and beneficial AI requires careful alignment work
- **Continuous Learning**: The field is rapidly evolving - keep learning!

You now have the foundational knowledge to build and train text-only and multimodal AI systems with modern alignment techniques. The methods you've learned are used in state-of-the-art models like GPT-4, Claude, ChatGPT, and other cutting-edge AI systems!

Happy building! 🤖✨🎨🔊🎬🎯