# Coding Assignment 5: Transformer Architecture Fundamentals

**Name:** [Your Name Here]  
**Student ID:** [Your Student ID]  
**Date:** [Today's Date]  

## Overview

Welcome to the fascinating world of transformers! In this assignment, you'll build a transformer architecture from scratch, implementing the revolutionary attention mechanism that powers modern AI systems like ChatGPT, BERT, and GPT-4.

**Learning Goals:**
- Understand transformer architecture and attention mechanisms
- Implement tokenization, embeddings, and positional encoding
- Build self-attention and multi-head attention layers
- Assemble complete transformer blocks with layer normalization
- Train a mini-transformer on sentiment analysis
- Analyze transformer capabilities and visualize attention patterns

**Estimated Time:** 2 hours

## Setup and Imports

In [None]:
# Core libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Data manipulation and visualization
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import re
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Progress tracking and utilities
from tqdm import tqdm
import time
import math
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

# Reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"‚úÖ Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"PyTorch version: {torch.__version__}")
print("Setup complete!")

---

# Part 1: Understanding Transformers (20 minutes)

**Goal:** Learn why transformers revolutionized AI and understand their core concepts

## 1.1 The Transformer Revolution

Before transformers, processing sequential data (like text) required **recurrent neural networks (RNNs)** that processed words one by one. This was slow and suffered from vanishing gradients. Transformers changed everything with a simple but powerful idea: **"Attention is All You Need"**.

**Key Innovations:**
- **Self-Attention**: Allow each word to "attend" to all other words simultaneously
- **Parallelization**: Process all words at once, not sequentially
- **Position Encoding**: Add position information since attention is position-agnostic
- **Scalability**: Architecture scales beautifully to massive models

In [None]:
# Visualize the difference between RNN and Transformer processing
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

# RNN Sequential Processing
words = ['The', 'cat', 'sat', 'on', 'the', 'mat']
positions = range(len(words))

# Show RNN sequential dependencies
ax1.scatter(positions, [1]*len(words), s=200, alpha=0.7, color='lightblue')
for i, word in enumerate(words):
    ax1.text(i, 1, word, ha='center', va='center', fontsize=12, fontweight='bold')

# Draw sequential arrows
for i in range(len(words)-1):
    ax1.arrow(i+0.1, 1, 0.8, 0, head_width=0.05, head_length=0.05, 
              fc='red', ec='red', linewidth=2)

ax1.set_xlim(-0.5, len(words)-0.5)
ax1.set_ylim(0.5, 1.5)
ax1.set_title('RNN: Sequential Processing\n(Each word depends on previous)', 
              fontsize=14, fontweight='bold')
ax1.set_xlabel('Time Steps (Sequential)')
ax1.axis('off')

# Transformer Parallel Processing
ax2.scatter(positions, [1]*len(words), s=200, alpha=0.7, color='lightgreen')
for i, word in enumerate(words):
    ax2.text(i, 1, word, ha='center', va='center', fontsize=12, fontweight='bold')

# Draw attention connections (every word to every word)
for i in range(len(words)):
    for j in range(len(words)):
        if i != j:
            # Curved arrows to show all-to-all connections
            ax2.annotate('', xy=(j, 1), xytext=(i, 1),
                        arrowprops=dict(arrowstyle='->', color='blue', alpha=0.3,
                                      connectionstyle='arc3,rad=0.2', linewidth=1))

ax2.set_xlim(-0.5, len(words)-0.5)
ax2.set_ylim(0.5, 1.5)
ax2.set_title('Transformer: Parallel Processing\n(Each word attends to all words)', 
              fontsize=14, fontweight='bold')
ax2.set_xlabel('Parallel Processing')
ax2.axis('off')

plt.tight_layout()
plt.show()

print("üîë Key Differences:")
print("   RNN: Sequential, slow, vanishing gradients")
print("   Transformer: Parallel, fast, better long-range dependencies")
print("\n‚ö° Why Transformers Won:")
print("   ‚Ä¢ Faster training (parallelization)")
print("   ‚Ä¢ Better at long sequences")
print("   ‚Ä¢ More expressive attention patterns")
print("   ‚Ä¢ Scales to billions of parameters")

## 1.2 High-Level Transformer Architecture

Let's understand the overall transformer architecture:

In [None]:
# Create a visual representation of transformer architecture
fig, ax = plt.subplots(figsize=(12, 14))

# Define the components and their positions
components = [
    {'name': 'Input Tokens', 'y': 0.1, 'color': 'lightblue', 'desc': '["The", "cat", "sat"]'},
    {'name': 'Token Embeddings', 'y': 0.2, 'color': 'lightgreen', 'desc': 'Convert words to vectors'},
    {'name': 'Position Embeddings', 'y': 0.3, 'color': 'lightyellow', 'desc': 'Add position information'},
    {'name': 'Multi-Head Attention', 'y': 0.5, 'color': 'lightcoral', 'desc': 'Words attend to each other'},
    {'name': 'Add & Norm', 'y': 0.6, 'color': 'lightgray', 'desc': 'Residual + Layer Norm'},
    {'name': 'Feed Forward', 'y': 0.7, 'color': 'lightpink', 'desc': 'Position-wise MLP'},
    {'name': 'Add & Norm', 'y': 0.8, 'color': 'lightgray', 'desc': 'Residual + Layer Norm'},
    {'name': 'Classification Head', 'y': 0.95, 'color': 'lightsteelblue', 'desc': 'Final predictions'}
]

# Draw the components
for comp in components:
    # Main box
    rect = plt.Rectangle((0.2, comp['y']-0.04), 0.6, 0.06, 
                        facecolor=comp['color'], edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    
    # Component name
    ax.text(0.5, comp['y'], comp['name'], ha='center', va='center', 
           fontsize=12, fontweight='bold')
    
    # Description
    ax.text(0.85, comp['y'], comp['desc'], ha='left', va='center', 
           fontsize=10, style='italic')

# Draw arrows between components
for i in range(len(components)-1):
    ax.arrow(0.5, components[i]['y']+0.03, 0, 0.04, 
            head_width=0.02, head_length=0.01, fc='blue', ec='blue')

# Add special notation for transformer block repetition
ax.annotate('', xy=(0.15, 0.8), xytext=(0.15, 0.5),
           arrowprops=dict(arrowstyle='<->', color='red', linewidth=3))
ax.text(0.05, 0.65, 'Transformer\nBlock\n(Repeat N times)', 
       ha='center', va='center', fontsize=10, fontweight='bold', color='red')

ax.set_xlim(0, 1.4)
ax.set_ylim(0, 1.1)
ax.set_title('Transformer Architecture Overview', fontsize=16, fontweight='bold')
ax.axis('off')

plt.tight_layout()
plt.show()

print("üèóÔ∏è Transformer Building Blocks:")
print("   1. Embeddings: Convert tokens to vectors")
print("   2. Attention: Let words 'talk' to each other")
print("   3. Feed-Forward: Process each position independently")
print("   4. Residuals: Add input to output (helps training)")
print("   5. Layer Norm: Normalize activations (stabilizes training)")
print("\nüí° The magic is in the attention mechanism!")

## 1.3 Why Position Matters

Unlike RNNs, attention has no inherent sense of position. We need to add position information!

In [None]:
# Demonstrate why position encoding is needed
sentence1 = "The cat sat on the mat"
sentence2 = "The mat sat on the cat"

print("ü§î Without position encoding:")
print(f"   Sentence 1: {sentence1}")
print(f"   Sentence 2: {sentence2}")
print("\n   Both sentences have the same words!")
print("   Pure attention (without position) couldn't tell them apart!")

print("\n‚úÖ With position encoding:")
words1 = sentence1.split()
words2 = sentence2.split()

print("   Sentence 1 with positions:")
for i, word in enumerate(words1):
    print(f"      Position {i}: {word}")
    
print("\n   Sentence 2 with positions:")
for i, word in enumerate(words2):
    print(f"      Position {i}: {word}")
    
print("\n   Now the transformer can distinguish between them!")
print("\nüéØ Position encoding gives transformers a sense of word order.")

---

# Part 2: Tokenization & Embeddings (20 minutes)

**Goal:** Convert text into numerical representations that transformers can process

## 2.1 Text Tokenization

First, we need to convert text into tokens (discrete units like words or subwords):

In [None]:
class SimpleTokenizer:
    """Simple word-level tokenizer for learning purposes"""
    
    def __init__(self, vocab_size=5000):
        self.vocab_size = vocab_size
        self.vocab = {}
        self.inverse_vocab = {}
        
        # Special tokens
        self.pad_token = '<PAD>'
        self.unk_token = '<UNK>'
        self.cls_token = '<CLS>'  # For classification
        
    def build_vocab(self, texts):
        """Build vocabulary from list of texts"""
        
        # TODO: Count word frequencies across all texts
        # HINT: Use Counter to count words after tokenizing
        # HINT: Convert to lowercase and split on whitespace
        word_counts = Counter()
        
        for text in texts:
            # TODO: Tokenize text into words
            # HINT: Clean text, convert to lowercase, split into words
            words = None  # Your code here
            word_counts.update(words)
        
        # TODO: Build vocabulary with most frequent words
        # HINT: Start with special tokens, then add most common words
        
        # Add special tokens first
        self.vocab[self.pad_token] = 0
        self.vocab[self.unk_token] = 1 
        self.vocab[self.cls_token] = 2
        
        # TODO: Add most common words up to vocab_size
        # HINT: Use word_counts.most_common(self.vocab_size - 3)
        # HINT: Reserve 3 slots for special tokens
        most_common = None  # Your code here
        
        for i, (word, count) in enumerate(most_common):
            self.vocab[word] = i + 3  # +3 for special tokens
        
        # Create inverse mapping
        self.inverse_vocab = {idx: word for word, idx in self.vocab.items()}
        
        print(f"‚úÖ Vocabulary built with {len(self.vocab)} words")
        print(f"   Most common words: {list(self.vocab.keys())[3:8]}")
        
    def clean_text(self, text):
        """Clean and normalize text"""
        # TODO: Implement text cleaning
        # HINT: Remove extra whitespace, convert to lowercase
        # HINT: Optionally remove punctuation or keep it
        text = text.lower().strip()
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
        return text
        
    def tokenize(self, text, max_length=128):
        """Convert text to token indices"""
        
        # TODO: Clean and tokenize text
        # HINT: Use clean_text, then split into words
        words = None  # Your code here
        
        # TODO: Add CLS token at the beginning
        # HINT: tokens = [self.cls_token] + words
        tokens = None  # Your code here
        
        # TODO: Truncate to max_length
        # HINT: tokens = tokens[:max_length]
        tokens = None  # Your code here
        
        # TODO: Convert words to indices
        # HINT: Use self.vocab.get(word, self.vocab[self.unk_token])
        indices = []
        for token in tokens:
            idx = None  # Your code here - get index or UNK
            indices.append(idx)
        
        # TODO: Pad sequence to max_length
        # HINT: Add PAD tokens until length equals max_length
        while len(indices) < max_length:
            indices.append(None)  # Your code here - PAD token index
        
        return indices
    
    def decode(self, indices):
        """Convert indices back to text"""
        words = []
        for idx in indices:
            if idx in self.inverse_vocab:
                word = self.inverse_vocab[idx]
                if word not in [self.pad_token, self.cls_token]:
                    words.append(word)
        return ' '.join(words)

# Test the tokenizer
sample_texts = [
    "I love this movie! It's fantastic.",
    "This film is terrible. I hate it.",
    "Great acting and wonderful story.",
    "Boring and predictable plot."
]

tokenizer = SimpleTokenizer(vocab_size=100)
tokenizer.build_vocab(sample_texts)

# Test tokenization
test_text = "I love great movies!"
tokens = tokenizer.tokenize(test_text, max_length=10)
decoded = tokenizer.decode(tokens)

print(f"\nüß™ Tokenization Test:")
print(f"   Original: {test_text}")
print(f"   Tokens: {tokens[:6]}...")  # Show first 6
print(f"   Decoded: {decoded}")

## 2.2 Token Embeddings

Now let's convert token indices into dense vector representations:

In [None]:
class TokenEmbedding(nn.Module):
    """Convert token indices to dense vectors"""
    
    def __init__(self, vocab_size, d_model):
        super().__init__()
        
        # TODO: Create embedding layer
        # HINT: Use nn.Embedding(vocab_size, d_model)
        # HINT: d_model is the embedding dimension (typically 256, 512, etc.)
        self.embedding = None  # Your code here
        self.d_model = d_model
        
    def forward(self, x):
        # TODO: Apply embedding and scale
        # HINT: embedding(x) * sqrt(d_model) - common in transformers
        # HINT: Scaling helps with training stability
        return None  # Your code here

# Test token embeddings
vocab_size = 100
d_model = 64  # Embedding dimension

token_embed = TokenEmbedding(vocab_size, d_model)

# Test with sample tokens
sample_tokens = torch.tensor([[1, 5, 10, 2, 0, 0]])  # Batch size 1
embeddings = token_embed(sample_tokens)

print(f"üìä Token Embedding Test:")
print(f"   Input shape: {sample_tokens.shape}")
print(f"   Output shape: {embeddings.shape}")
print(f"   Embedding dimension: {d_model}")
print(f"   Each token ‚Üí {d_model}-dimensional vector")

## 2.3 Positional Embeddings

Add position information using sinusoidal encodings:

In [None]:
class PositionalEncoding(nn.Module):
    """Add positional information using sinusoidal encoding"""
    
    def __init__(self, d_model, max_length=512):
        super().__init__()
        
        # TODO: Create positional encoding matrix
        # HINT: Shape should be (max_length, d_model)
        pe = torch.zeros(max_length, d_model)
        
        # TODO: Create position vector
        # HINT: position = torch.arange(0, max_length).unsqueeze(1).float()
        position = None  # Your code here
        
        # TODO: Create div_term for sinusoidal pattern
        # HINT: div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        div_term = None  # Your code here
        
        # TODO: Apply sine to even indices
        # HINT: pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 0::2] = None  # Your code here
        
        # TODO: Apply cosine to odd indices  
        # HINT: pe[:, 1::2] = torch.cos(position * div_term)
        pe[:, 1::2] = None  # Your code here
        
        # Register as buffer (not a parameter)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        # TODO: Add positional encoding to embeddings
        # HINT: x + self.pe[:x.size(1)] (add position encoding up to sequence length)
        return None  # Your code here

# Test positional encoding
pos_encoding = PositionalEncoding(d_model=64, max_length=128)

# Visualize positional encodings
pe_matrix = pos_encoding.pe[:50, :].numpy()  # First 50 positions

plt.figure(figsize=(12, 8))
plt.imshow(pe_matrix.T, cmap='RdYlBu', aspect='auto')
plt.colorbar(label='Encoding Value')
plt.xlabel('Position')
plt.ylabel('Embedding Dimension')
plt.title('Positional Encoding Visualization\n(Each position has unique pattern)', 
          fontweight='bold')
plt.show()

print("üåä Positional Encoding Properties:")
print("   ‚Ä¢ Each position has a unique sinusoidal pattern")
print("   ‚Ä¢ Different frequencies across embedding dimensions")
print("   ‚Ä¢ Allows model to learn relative positions")
print(f"   ‚Ä¢ Encoding shape: {pos_encoding.pe.shape}")

## 2.4 Complete Embedding Layer

Combine token and positional embeddings:

In [None]:
class TransformerEmbeddings(nn.Module):
    """Complete embedding layer with tokens + positions"""
    
    def __init__(self, vocab_size, d_model, max_length=512, dropout=0.1):
        super().__init__()
        
        # TODO: Initialize components
        # HINT: Use the classes we just created
        self.token_embedding = None  # Your code here
        self.pos_encoding = None     # Your code here
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input_ids):
        # TODO: Combine token and positional embeddings
        # HINT: Get token embeddings, add positional encoding, apply dropout
        
        # Step 1: Get token embeddings
        token_embeds = None  # Your code here
        
        # Step 2: Add positional encoding
        embeddings = None  # Your code here
        
        # Step 3: Apply dropout
        embeddings = None  # Your code here
        
        return embeddings

# Test complete embeddings
embeddings = TransformerEmbeddings(vocab_size=100, d_model=64)

# Sample input
input_ids = torch.tensor([[2, 10, 5, 20, 1, 0, 0]])  # [CLS] + words + [UNK] + [PAD]
output = embeddings(input_ids)

print(f"üîó Complete Embeddings Test:")
print(f"   Input IDs: {input_ids[0].tolist()[:5]}...")
print(f"   Output shape: {output.shape}")
print(f"   Each token now has position-aware embeddings!")

# Visualize embedding differences
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Token embeddings only (first few dimensions)
token_only = embeddings.token_embedding(input_ids)[0, :5, :8].detach().numpy()
im1 = ax1.imshow(token_only.T, cmap='viridis', aspect='auto')
ax1.set_title('Token Embeddings Only', fontweight='bold')
ax1.set_xlabel('Token Position')
ax1.set_ylabel('Embedding Dimension')
plt.colorbar(im1, ax=ax1)

# Complete embeddings with position
complete = output[0, :5, :8].detach().numpy()
im2 = ax2.imshow(complete.T, cmap='viridis', aspect='auto')
ax2.set_title('Token + Position Embeddings', fontweight='bold')
ax2.set_xlabel('Token Position')
ax2.set_ylabel('Embedding Dimension')
plt.colorbar(im2, ax=ax2)

plt.tight_layout()
plt.show()

print("‚ú® Notice how positional encoding changes the embeddings!")
print("   Same tokens at different positions have different representations.")

---

# Part 3: Self-Attention Mechanism (25 minutes)

**Goal:** Implement the heart of the transformer - the attention mechanism

## 3.1 Understanding Attention Intuition

Attention allows each word to "attend" to (focus on) relevant words in the sequence. It's like asking: "When processing this word, which other words should I pay attention to?"

In [None]:
# Demonstrate attention intuition with example sentence
sentence = "The cat sat on the mat"
words = sentence.split()

print("üß† Attention Intuition Example:")
print(f"   Sentence: {sentence}")
print()

# Manual attention examples
attention_examples = {
    "cat": ["The", "sat"],  # "cat" attends to "The" and "sat"
    "sat": ["cat", "on"],   # "sat" attends to "cat" and "on"
    "mat": ["the", "on"],   # "mat" attends to "the" and "on"
}

for word, attends_to in attention_examples.items():
    print(f"   When processing '{word}', pay attention to: {attends_to}")

print("\nüéØ Key Insights:")
print("   ‚Ä¢ Each word can attend to multiple other words")
print("   ‚Ä¢ Attention weights are learned, not hand-coded")
print("   ‚Ä¢ Allows capturing long-range dependencies")
print("   ‚Ä¢ Different heads can focus on different relationships")

# Visualize attention matrix concept
fig, ax = plt.subplots(figsize=(10, 8))

# Create sample attention matrix
n_words = len(words)
attention_matrix = np.random.rand(n_words, n_words)

# Make it more realistic (each row sums to 1)
attention_matrix = attention_matrix / attention_matrix.sum(axis=1, keepdims=True)

# Add some structure (diagonal and nearby words)
for i in range(n_words):
    attention_matrix[i, i] += 0.3  # Self-attention
    if i > 0:
        attention_matrix[i, i-1] += 0.2  # Previous word
    if i < n_words - 1:
        attention_matrix[i, i+1] += 0.2  # Next word

# Renormalize
attention_matrix = attention_matrix / attention_matrix.sum(axis=1, keepdims=True)

im = ax.imshow(attention_matrix, cmap='Blues', aspect='auto')
ax.set_xticks(range(n_words))
ax.set_yticks(range(n_words))
ax.set_xticklabels(words)
ax.set_yticklabels(words)
ax.set_xlabel('Attending to (Keys)')
ax.set_ylabel('Query words')
ax.set_title('Sample Attention Matrix\n(Each row shows what a word attends to)', fontweight='bold')

# Add text annotations
for i in range(n_words):
    for j in range(n_words):
        text = ax.text(j, i, f'{attention_matrix[i, j]:.2f}',
                      ha="center", va="center", color="white" if attention_matrix[i, j] > 0.5 else "black")

plt.colorbar(im, label='Attention Weight')
plt.tight_layout()
plt.show()

print("üìä Reading the Attention Matrix:")
print("   ‚Ä¢ Each row sums to 1.0 (probability distribution)")
print("   ‚Ä¢ Darker blue = stronger attention")
print("   ‚Ä¢ Row i shows what word i attends to")
print("   ‚Ä¢ Column j shows which words attend to word j")

## 3.2 Scaled Dot-Product Attention

The core attention mechanism: Attention(Q, K, V) = softmax(QK^T / ‚àöd_k)V

In [None]:
def scaled_dot_product_attention(Q, K, V, mask=None, dropout=None):
    """
    Compute scaled dot-product attention
    
    Args:
        Q: Query matrix (batch_size, seq_len, d_k)
        K: Key matrix (batch_size, seq_len, d_k)
        V: Value matrix (batch_size, seq_len, d_v)
        mask: Optional mask to prevent attention to certain positions
        dropout: Optional dropout layer
        
    Returns:
        output: Attention output (batch_size, seq_len, d_v)
        attention_weights: Attention weights (batch_size, seq_len, seq_len)
    """
    
    # TODO: Get the dimension for scaling
    # HINT: d_k = Q.size(-1) - last dimension of Q
    d_k = None  # Your code here
    
    # TODO: Compute attention scores
    # HINT: scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    # HINT: This computes QK^T and scales by sqrt(d_k)
    scores = None  # Your code here
    
    # TODO: Apply mask if provided
    # HINT: scores.masked_fill(mask == 0, -1e9) - set masked positions to very negative
    if mask is not None:
        scores = None  # Your code here
    
    # TODO: Apply softmax to get attention weights
    # HINT: F.softmax(scores, dim=-1) - softmax over last dimension
    attention_weights = None  # Your code here
    
    # TODO: Apply dropout if provided
    if dropout is not None:
        attention_weights = None  # Your code here
    
    # TODO: Apply attention to values
    # HINT: torch.matmul(attention_weights, V) - weighted sum of values
    output = None  # Your code here
    
    return output, attention_weights

# Test scaled dot-product attention
batch_size, seq_len, d_model = 2, 6, 64
d_k = d_v = d_model  # For simplicity

# Create sample Q, K, V matrices
Q = torch.randn(batch_size, seq_len, d_k)
K = torch.randn(batch_size, seq_len, d_k)
V = torch.randn(batch_size, seq_len, d_v)

# Apply attention
output, attention_weights = scaled_dot_product_attention(Q, K, V)

print(f"üîç Scaled Dot-Product Attention Test:")
print(f"   Input shapes:")
print(f"      Q (Query): {Q.shape}")
print(f"      K (Key): {K.shape}")
print(f"      V (Value): {V.shape}")
print(f"   Output shapes:")
print(f"      Output: {output.shape}")
print(f"      Attention weights: {attention_weights.shape}")

# Verify attention weights sum to 1
weights_sum = attention_weights.sum(dim=-1)
print(f"   Attention weights sum: {weights_sum[0, 0]:.3f} (should be ~1.0)")

# Visualize attention pattern for first batch
plt.figure(figsize=(10, 8))
plt.imshow(attention_weights[0].detach().numpy(), cmap='Blues', aspect='auto')
plt.colorbar(label='Attention Weight')
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.title('Attention Weights Visualization\n(First sample from batch)', fontweight='bold')
plt.show()

print("‚úÖ Attention mechanism working!")
print("   Each query position attends to all key positions")
print("   Attention weights are learned through backpropagation")

## 3.3 Multi-Head Attention

Instead of one attention function, use multiple "heads" to attend to different types of relationships:

In [None]:
class MultiHeadAttention(nn.Module):
    """Multi-Head Self-Attention mechanism"""
    
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        
        # TODO: Verify that d_model is divisible by num_heads
        # HINT: assert d_model % num_heads == 0
        assert None  # Your code here
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # Dimension per head
        
        # TODO: Linear projections for Q, K, V
        # HINT: Create three nn.Linear layers, each mapping d_model -> d_model
        self.W_q = None  # Your code here - Query projection
        self.W_k = None  # Your code here - Key projection
        self.W_v = None  # Your code here - Value projection
        
        # TODO: Output projection
        # HINT: nn.Linear(d_model, d_model)
        self.W_o = None  # Your code here
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        batch_size, seq_len, d_model = x.shape
        
        # TODO: Linear projections
        # HINT: Apply W_q, W_k, W_v to input x
        Q = None  # Your code here
        K = None  # Your code here
        V = None  # Your code here
        
        # TODO: Reshape for multi-head attention
        # HINT: Reshape from (batch, seq_len, d_model) to (batch, num_heads, seq_len, d_k)
        # HINT: Use .view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        Q = None  # Your code here
        K = None  # Your code here
        V = None  # Your code here
        
        # TODO: Apply attention
        # HINT: Use the scaled_dot_product_attention function we defined
        # HINT: Pass dropout=self.dropout
        attn_output, attention_weights = None  # Your code here
        
        # TODO: Concatenate heads
        # HINT: Transpose back and reshape to (batch, seq_len, d_model)
        # HINT: Use .transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
        attn_output = None  # Your code here
        
        # TODO: Final linear projection
        # HINT: Apply W_o to the concatenated output
        output = None  # Your code here
        
        return output, attention_weights

# Test multi-head attention
d_model = 64
num_heads = 8
seq_len = 10
batch_size = 2

mha = MultiHeadAttention(d_model, num_heads)

# Sample input
x = torch.randn(batch_size, seq_len, d_model)
output, attention_weights = mha(x)

print(f"üé≠ Multi-Head Attention Test:")
print(f"   Input shape: {x.shape}")
print(f"   Output shape: {output.shape}")
print(f"   Attention weights shape: {attention_weights.shape}")
print(f"   Number of heads: {num_heads}")
print(f"   Dimension per head: {d_model // num_heads}")

# Visualize different attention heads
if attention_weights is not None:
    fig, axes = plt.subplots(2, 4, figsize=(16, 8))
    axes = axes.flatten()
    
    for head in range(min(8, num_heads)):  # Show up to 8 heads
        head_weights = attention_weights[0, head].detach().numpy()
        im = axes[head].imshow(head_weights, cmap='Blues', aspect='auto')
        axes[head].set_title(f'Head {head + 1}', fontweight='bold')
        axes[head].set_xlabel('Key Position')
        axes[head].set_ylabel('Query Position')
    
    plt.suptitle('Different Attention Heads Learn Different Patterns', 
                fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("üß† Multi-Head Benefits:")
    print("   ‚Ä¢ Each head can focus on different relationships")
    print("   ‚Ä¢ Some heads might focus on syntax, others on semantics")
    print("   ‚Ä¢ Increases model capacity and expressiveness")
    print("   ‚Ä¢ Allows parallel computation across heads")

---

# Part 4: Transformer Block Components (20 minutes)

**Goal:** Build the complete transformer block with layer normalization and feed-forward networks

## 4.1 Layer Normalization

Layer normalization helps stabilize training in deep networks:

In [None]:
# Layer Normalization is built into PyTorch, but let's understand it
class LayerNorm(nn.Module):
    """Layer normalization for stable training"""
    
    def __init__(self, features, eps=1e-6):
        super().__init__()
        
        # TODO: Create learnable parameters
        # HINT: gamma (scale) and beta (shift) parameters
        # HINT: Use nn.Parameter(torch.ones(features)) for gamma
        # HINT: Use nn.Parameter(torch.zeros(features)) for beta
        self.gamma = None  # Your code here
        self.beta = None   # Your code here
        self.eps = eps
        
    def forward(self, x):
        # TODO: Compute layer normalization
        # HINT: Normalize across last dimension
        # HINT: mean = x.mean(-1, keepdim=True)
        # HINT: std = x.std(-1, keepdim=True)
        # HINT: normalized = (x - mean) / (std + eps)
        # HINT: return gamma * normalized + beta
        
        mean = None  # Your code here
        std = None   # Your code here
        normalized = None  # Your code here
        return None  # Your code here

# Compare with PyTorch's LayerNorm
d_model = 64
custom_ln = LayerNorm(d_model)
pytorch_ln = nn.LayerNorm(d_model)

# Test input
x = torch.randn(2, 10, d_model) * 5 + 10  # Large variance and mean

# Apply both normalizations
custom_out = custom_ln(x)
pytorch_out = pytorch_ln(x)

print(f"üìä Layer Normalization Comparison:")
print(f"   Input statistics:")
print(f"      Mean: {x.mean():.3f}, Std: {x.std():.3f}")
print(f"   After custom LayerNorm:")
print(f"      Mean: {custom_out.mean():.3f}, Std: {custom_out.std():.3f}")
print(f"   After PyTorch LayerNorm:")
print(f"      Mean: {pytorch_out.mean():.3f}, Std: {pytorch_out.std():.3f}")

# Visualize normalization effect
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Original
axes[0].hist(x.flatten().detach().numpy(), bins=50, alpha=0.7, color='red')
axes[0].set_title('Original Distribution', fontweight='bold')
axes[0].set_ylabel('Frequency')

# Custom LayerNorm
axes[1].hist(custom_out.flatten().detach().numpy(), bins=50, alpha=0.7, color='blue')
axes[1].set_title('After Custom LayerNorm', fontweight='bold')

# PyTorch LayerNorm
axes[2].hist(pytorch_out.flatten().detach().numpy(), bins=50, alpha=0.7, color='green')
axes[2].set_title('After PyTorch LayerNorm', fontweight='bold')

for ax in axes:
    ax.axvline(0, color='black', linestyle='--', alpha=0.5)
    ax.set_xlabel('Value')

plt.tight_layout()
plt.show()

print("‚úÖ Layer normalization centers and scales the distribution")
print("   This helps with gradient flow and training stability")

## 4.2 Feed-Forward Network

Position-wise feed-forward network processes each position independently:

In [None]:
class FeedForward(nn.Module):
    """Position-wise feed-forward network"""
    
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        
        # TODO: Create two linear layers
        # HINT: First layer: d_model -> d_ff (expansion)
        # HINT: Second layer: d_ff -> d_model (projection back)
        # HINT: Typically d_ff = 4 * d_model
        self.linear1 = None  # Your code here
        self.linear2 = None  # Your code here
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        # TODO: Implement feed-forward pass
        # HINT: linear1 -> ReLU -> dropout -> linear2
        # HINT: Use F.relu() for activation
        
        # Step 1: First linear layer
        x = None  # Your code here
        
        # Step 2: ReLU activation
        x = None  # Your code here
        
        # Step 3: Dropout
        x = None  # Your code here
        
        # Step 4: Second linear layer
        x = None  # Your code here
        
        return x

# Test feed-forward network
d_model = 64
d_ff = 4 * d_model  # Common choice: 4x expansion

ffn = FeedForward(d_model, d_ff)

# Test input
x = torch.randn(2, 10, d_model)
output = ffn(x)

print(f"üîÑ Feed-Forward Network Test:")
print(f"   Input shape: {x.shape}")
print(f"   Output shape: {output.shape}")
print(f"   Hidden dimension: {d_ff} ({d_ff // d_model}x expansion)")
print(f"   Parameters: {sum(p.numel() for p in ffn.parameters()):,}")

# Analyze parameter distribution
total_params = sum(p.numel() for p in ffn.parameters())
layer1_params = ffn.linear1.weight.numel() + ffn.linear1.bias.numel()
layer2_params = ffn.linear2.weight.numel() + ffn.linear2.bias.numel()

print(f"\nüìä Parameter Breakdown:")
print(f"   Layer 1: {layer1_params:,} parameters")
print(f"   Layer 2: {layer2_params:,} parameters")
print(f"   Total: {total_params:,} parameters")
print(f"\nüí° Feed-forward network contains most of the transformer's parameters!")

## 4.3 Complete Transformer Block

Now let's assemble everything into a complete transformer block:

In [None]:
class TransformerBlock(nn.Module):
    """Complete transformer block with attention and feed-forward"""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # TODO: Initialize components
        # HINT: Use the classes we've implemented
        self.attention = None     # Your code here - MultiHeadAttention
        self.feed_forward = None  # Your code here - FeedForward
        
        # TODO: Layer normalization layers
        # HINT: We need two LayerNorm layers, one after attention and one after FFN
        self.norm1 = None  # Your code here - nn.LayerNorm(d_model)
        self.norm2 = None  # Your code here - nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # TODO: Multi-head attention with residual connection
        # HINT: Apply attention, dropout, then add to input (residual)
        # HINT: Then apply layer normalization
        
        # Step 1: Multi-head attention
        attn_output, attention_weights = None  # Your code here
        
        # Step 2: Dropout and residual connection
        x1 = None  # Your code here - x + dropout(attn_output)
        
        # Step 3: Layer normalization
        x1 = None  # Your code here
        
        # TODO: Feed-forward with residual connection
        # HINT: Apply feed-forward, dropout, residual, then layer norm
        
        # Step 4: Feed-forward
        ff_output = None  # Your code here
        
        # Step 5: Dropout and residual connection
        x2 = None  # Your code here - x1 + dropout(ff_output)
        
        # Step 6: Layer normalization
        x2 = None  # Your code here
        
        return x2, attention_weights

# Test complete transformer block
d_model = 64
num_heads = 8
d_ff = 4 * d_model

transformer_block = TransformerBlock(d_model, num_heads, d_ff)

# Test input
x = torch.randn(2, 10, d_model)
output, attention_weights = transformer_block(x)

print(f"üèóÔ∏è Complete Transformer Block Test:")
print(f"   Input shape: {x.shape}")
print(f"   Output shape: {output.shape}")
print(f"   Attention weights shape: {attention_weights.shape}")
print(f"   Total parameters: {sum(p.numel() for p in transformer_block.parameters()):,}")

# Verify residual connections preserve dimensions
print(f"\n‚úÖ Residual connections working:")
print(f"   Input and output have same shape")
print(f"   Information can flow directly through the block")
print(f"   Gradients can flow back easily (no vanishing gradients)")

# Visualize the transformer block architecture
fig, ax = plt.subplots(figsize=(10, 12))

# Define components with their positions
components = [
    {'name': 'Input', 'y': 0.1, 'color': 'lightblue'},
    {'name': 'Multi-Head\nAttention', 'y': 0.3, 'color': 'lightcoral'},
    {'name': 'Add & Norm', 'y': 0.45, 'color': 'lightgray'},
    {'name': 'Feed\nForward', 'y': 0.6, 'color': 'lightgreen'},
    {'name': 'Add & Norm', 'y': 0.75, 'color': 'lightgray'},
    {'name': 'Output', 'y': 0.9, 'color': 'lightsteelblue'}
]

# Draw components
for comp in components:
    rect = plt.Rectangle((0.3, comp['y']-0.05), 0.4, 0.08, 
                        facecolor=comp['color'], edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(0.5, comp['y'], comp['name'], ha='center', va='center', 
           fontsize=12, fontweight='bold')

# Draw main flow arrows
for i in range(len(components)-1):
    ax.arrow(0.5, components[i]['y']+0.04, 0, 0.06, 
            head_width=0.02, head_length=0.01, fc='blue', ec='blue')

# Draw residual connections
# First residual (around attention)
ax.arrow(0.2, 0.1, 0, 0.3, head_width=0.01, head_length=0.01, 
         fc='red', ec='red', linestyle='--', linewidth=2)
ax.arrow(0.2, 0.4, 0.08, 0.04, head_width=0.01, head_length=0.01, 
         fc='red', ec='red', linestyle='--', linewidth=2)

# Second residual (around feed-forward)
ax.arrow(0.8, 0.45, 0, 0.25, head_width=0.01, head_length=0.01, 
         fc='red', ec='red', linestyle='--', linewidth=2)
ax.arrow(0.8, 0.7, -0.08, 0.04, head_width=0.01, head_length=0.01, 
         fc='red', ec='red', linestyle='--', linewidth=2)

ax.text(0.15, 0.25, 'Residual\nConnection', ha='center', va='center', 
       fontsize=10, color='red', fontweight='bold')
ax.text(0.85, 0.58, 'Residual\nConnection', ha='center', va='center', 
       fontsize=10, color='red', fontweight='bold')

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_title('Transformer Block Architecture', fontsize=16, fontweight='bold')
ax.axis('off')

plt.tight_layout()
plt.show()

print("üîó Key Architectural Features:")
print("   ‚Ä¢ Residual connections enable deep networks")
print("   ‚Ä¢ Layer normalization stabilizes training")
print("   ‚Ä¢ Multi-head attention captures relationships")
print("   ‚Ä¢ Feed-forward adds non-linear processing")

---

# Part 5: Mini Transformer Implementation (25 minutes)

**Goal:** Combine all components into a working transformer and train it on sentiment analysis

## 5.1 Complete Mini Transformer

In [None]:
class MiniTransformer(nn.Module):
    """Complete mini transformer for classification"""
    
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, 
                 max_length, num_classes, dropout=0.1):
        super().__init__()
        
        # TODO: Initialize embedding layer
        # HINT: Use TransformerEmbeddings class we created
        self.embeddings = None  # Your code here
        
        # TODO: Stack of transformer blocks
        # HINT: Use nn.ModuleList to create multiple TransformerBlocks
        # HINT: Create num_layers TransformerBlock instances
        self.transformer_blocks = nn.ModuleList([
            None  # Your code here - TransformerBlock for each layer
            for _ in range(num_layers)
        ])
        
        # TODO: Final layer normalization
        # HINT: nn.LayerNorm(d_model)
        self.ln_f = None  # Your code here
        
        # TODO: Classification head
        # HINT: nn.Linear(d_model, num_classes) for final classification
        self.classifier = None  # Your code here
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input_ids, attention_mask=None):
        # TODO: Embed inputs
        # HINT: Apply embeddings to input_ids
        x = None  # Your code here
        
        # Store attention weights for analysis
        attention_weights = []
        
        # TODO: Pass through transformer blocks
        # HINT: Apply each transformer block in sequence
        for block in self.transformer_blocks:
            x, attn_weights = None  # Your code here - apply block
            attention_weights.append(attn_weights)
        
        # TODO: Final layer normalization
        x = None  # Your code here
        
        # TODO: Global average pooling for classification
        # HINT: Take mean over sequence dimension
        # HINT: Use torch.mean(x, dim=1) to pool over sequence length
        pooled = None  # Your code here
        
        # TODO: Classification
        # HINT: Apply classifier layer to pooled representation
        logits = None  # Your code here
        
        return logits, attention_weights

# Test complete transformer
config = {
    'vocab_size': 1000,
    'd_model': 64,
    'num_heads': 8,
    'num_layers': 2,  # Small for quick training
    'd_ff': 256,
    'max_length': 32,
    'num_classes': 2,  # Binary sentiment classification
    'dropout': 0.1
}

model = MiniTransformer(**config).to(device)

# Test input
batch_size = 4
seq_length = 16
input_ids = torch.randint(0, config['vocab_size'], (batch_size, seq_length)).to(device)

# Forward pass
logits, attention_weights = model(input_ids)

print(f"ü§ñ Complete Mini Transformer Test:")
print(f"   Input shape: {input_ids.shape}")
print(f"   Output logits shape: {logits.shape}")
print(f"   Number of attention layers: {len(attention_weights)}")
print(f"   Attention weights shape (per layer): {attention_weights[0].shape}")
print(f"   Total parameters: {sum(p.numel() for p in model.parameters()):,}")

# Count parameters by component
embed_params = sum(p.numel() for p in model.embeddings.parameters())
transformer_params = sum(p.numel() for block in model.transformer_blocks for p in block.parameters())
classifier_params = sum(p.numel() for p in model.classifier.parameters())

print(f"\nüìä Parameter Breakdown:")
print(f"   Embeddings: {embed_params:,} ({100*embed_params/sum(p.numel() for p in model.parameters()):.1f}%)")
print(f"   Transformer blocks: {transformer_params:,} ({100*transformer_params/sum(p.numel() for p in model.parameters()):.1f}%)")
print(f"   Classifier: {classifier_params:,} ({100*classifier_params/sum(p.numel() for p in model.parameters()):.1f}%)")

print("\n‚úÖ Mini transformer ready for training!")

## 5.2 Sentiment Analysis Dataset

Create a simple sentiment analysis dataset for training:

In [None]:
# Create sample sentiment analysis dataset
positive_samples = [
    "I love this movie! It's fantastic and amazing.",
    "Great film with excellent acting and wonderful story.",
    "Absolutely brilliant! Best movie I've ever seen.",
    "Outstanding performance by all actors. Highly recommended.",
    "Perfect blend of comedy and drama. Really enjoyed it.",
    "Incredible cinematography and beautiful soundtrack.",
    "Thrilling adventure with great special effects.",
    "Heartwarming story that made me cry happy tears.",
    "Amazing direction and superb character development.",
    "Masterpiece! Every scene was perfectly crafted."
] * 10  # Repeat to get more samples

negative_samples = [
    "This movie is terrible. I hate it completely.",
    "Boring and predictable plot. Waste of time.",
    "Awful acting and poor script. Very disappointed.",
    "Worst film ever. No redeeming qualities.",
    "Confusing story with bad character development.",
    "Terrible special effects and annoying soundtrack.",
    "Slow pacing and boring dialogue throughout.",
    "Disappointing ending. Plot makes no sense.",
    "Poor direction and weak performances by actors.",
    "Complete disaster. Don't waste your money."
] * 10  # Repeat to get more samples

# Combine and create labels
texts = positive_samples + negative_samples
labels = [1] * len(positive_samples) + [0] * len(negative_samples)

print(f"üìö Sentiment Analysis Dataset:")
print(f"   Total samples: {len(texts)}")
print(f"   Positive samples: {len(positive_samples)}")
print(f"   Negative samples: {len(negative_samples)}")

# Build vocabulary
tokenizer = SimpleTokenizer(vocab_size=500)
tokenizer.build_vocab(texts)

# Create dataset class
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=32):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        
        # Tokenize
        token_ids = self.tokenizer.tokenize(text, max_length=self.max_length)
        
        return {
            'input_ids': torch.tensor(token_ids, dtype=torch.long),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Split dataset
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels
)

# Create datasets
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)
val_dataset = SentimentDataset(val_texts, val_labels, tokenizer)

# Create data loaders
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

print(f"\nüìä Dataset Split:")
print(f"   Training samples: {len(train_dataset)}")
print(f"   Validation samples: {len(val_dataset)}")
print(f"   Batch size: {batch_size}")
print(f"   Training batches: {len(train_loader)}")
print(f"   Validation batches: {len(val_loader)}")

# Show sample
sample = train_dataset[0]
print(f"\nüîç Sample Data:")
print(f"   Text: {train_texts[0]}")
print(f"   Label: {train_labels[0]} ({'Positive' if train_labels[0] == 1 else 'Negative'})")
print(f"   Token IDs: {sample['input_ids'][:10].tolist()}...")
print(f"   Decoded: {tokenizer.decode(sample['input_ids'][:15].tolist())}...")

## 5.3 Training the Mini Transformer

Let's train our transformer on sentiment analysis:

In [None]:
def train_transformer(model, train_loader, val_loader, num_epochs=5, learning_rate=1e-3):
    """Train the mini transformer model"""
    
    # TODO: Setup training components
    # HINT: Loss function for classification, optimizer, and optional scheduler
    
    # TODO: Define loss function
    # HINT: nn.CrossEntropyLoss() for multi-class classification
    criterion = None  # Your code here
    
    # TODO: Define optimizer
    # HINT: optim.Adam works well for transformers
    # HINT: Use weight_decay=0.01 for regularization
    optimizer = None  # Your code here
    
    # Training history
    history = {
        'train_loss': [], 'train_acc': [],
        'val_loss': [], 'val_acc': []
    }
    
    print(f"üöÄ Training Mini Transformer")
    print(f"Epochs: {num_epochs}, Learning Rate: {learning_rate}")
    print("=" * 50)
    
    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss = 0.0
        train_correct = 0
        train_total = 0
        
        train_pbar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs} [Train]')
        
        for batch in train_pbar:
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            
            # TODO: Training step
            # HINT: 1. Zero gradients, 2. Forward pass, 3. Compute loss, 4. Backward, 5. Update
            
            # Step 1: Zero gradients
            None  # Your code here
            
            # Step 2: Forward pass
            logits, _ = None  # Your code here - model(input_ids)
            
            # Step 3: Compute loss
            loss = None  # Your code here
            
            # Step 4: Backward pass
            None  # Your code here
            
            # Step 5: Update weights
            None  # Your code here
            
            # Statistics (implemented for you)
            train_loss += loss.item()
            _, predicted = torch.max(logits.data, 1)
            train_total += labels.size(0)
            train_correct += (predicted == labels).sum().item()
            
            # Update progress bar
            current_acc = 100. * train_correct / train_total
            train_pbar.set_postfix({
                'Loss': f'{loss.item():.4f}',
                'Acc': f'{current_acc:.2f}%'
            })
        
        # Validation phase
        model.eval()
        val_loss = 0.0
        val_correct = 0
        val_total = 0
        
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                labels = batch['labels'].to(device)
                
                logits, _ = model(input_ids)
                loss = criterion(logits, labels)
                
                val_loss += loss.item()
                _, predicted = torch.max(logits.data, 1)
                val_total += labels.size(0)
                val_correct += (predicted == labels).sum().item()
        
        # Calculate epoch metrics
        epoch_train_loss = train_loss / len(train_loader)
        epoch_train_acc = 100. * train_correct / train_total
        epoch_val_loss = val_loss / len(val_loader)
        epoch_val_acc = 100. * val_correct / val_total
        
        # Store history
        history['train_loss'].append(epoch_train_loss)
        history['train_acc'].append(epoch_train_acc)
        history['val_loss'].append(epoch_val_loss)
        history['val_acc'].append(epoch_val_acc)
        
        # Print epoch summary
        print(f"Epoch {epoch+1}: "
              f"Train Loss: {epoch_train_loss:.4f}, Train Acc: {epoch_train_acc:.2f}%, "
              f"Val Loss: {epoch_val_loss:.4f}, Val Acc: {epoch_val_acc:.2f}%")
    
    print(f"\nüéØ Training Complete!")
    print(f"Final Validation Accuracy: {history['val_acc'][-1]:.2f}%")
    
    return history

# Update model configuration for our vocabulary
config['vocab_size'] = len(tokenizer.vocab)
model = MiniTransformer(**config).to(device)

print(f"üìã Model Configuration:")
for key, value in config.items():
    print(f"   {key}: {value}")

# Train the model
history = train_transformer(model, train_loader, val_loader, num_epochs=10, learning_rate=1e-3)

## 5.4 Analyze Training Results

Visualize training progress and analyze attention patterns:

In [None]:
# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Loss curves
epochs = range(1, len(history['train_loss']) + 1)
ax1.plot(epochs, history['train_loss'], 'b-', label='Training Loss', linewidth=2)
ax1.plot(epochs, history['val_loss'], 'r-', label='Validation Loss', linewidth=2)
ax1.set_title('Training and Validation Loss', fontweight='bold')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Accuracy curves
ax2.plot(epochs, history['train_acc'], 'b-', label='Training Accuracy', linewidth=2)
ax2.plot(epochs, history['val_acc'], 'r-', label='Validation Accuracy', linewidth=2)
ax2.set_title('Training and Validation Accuracy', fontweight='bold')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy (%)')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 100)

plt.tight_layout()
plt.show()

print(f"üìà Training Analysis:")
print(f"   Final training accuracy: {history['train_acc'][-1]:.2f}%")
print(f"   Final validation accuracy: {history['val_acc'][-1]:.2f}%")
print(f"   Best validation accuracy: {max(history['val_acc']):.2f}%")

# Check for overfitting
train_val_gap = history['train_acc'][-1] - history['val_acc'][-1]
if train_val_gap > 10:
    print(f"   ‚ö†Ô∏è Possible overfitting (gap: {train_val_gap:.1f}%)")
else:
    print(f"   ‚úÖ Good generalization (gap: {train_val_gap:.1f}%)")

## 5.5 Attention Visualization

Let's see what our transformer learned to pay attention to:

In [None]:
def visualize_attention(model, tokenizer, text, max_length=16):
    """Visualize attention patterns for a given text"""
    
    model.eval()
    
    # Tokenize input
    token_ids = tokenizer.tokenize(text, max_length=max_length)
    input_ids = torch.tensor(token_ids).unsqueeze(0).to(device)
    
    # Forward pass
    with torch.no_grad():
        logits, attention_weights = model(input_ids)
        prediction = torch.softmax(logits, dim=-1)
    
    # Get tokens for visualization
    tokens = []
    for idx in token_ids:
        if idx in tokenizer.inverse_vocab:
            token = tokenizer.inverse_vocab[idx]
            if token != tokenizer.pad_token:
                tokens.append(token)
    
    # Limit to actual tokens (remove padding)
    num_tokens = len(tokens)
    
    # Plot attention for each layer and head
    num_layers = len(attention_weights)
    num_heads = attention_weights[0].shape[1]
    
    fig, axes = plt.subplots(num_layers, min(4, num_heads), figsize=(16, 4*num_layers))
    if num_layers == 1:
        axes = axes.reshape(1, -1)
    
    for layer in range(num_layers):
        layer_attention = attention_weights[layer][0].cpu().numpy()  # First batch
        
        for head in range(min(4, num_heads)):  # Show first 4 heads
            head_attention = layer_attention[head][:num_tokens, :num_tokens]
            
            im = axes[layer, head].imshow(head_attention, cmap='Blues', aspect='auto')
            axes[layer, head].set_xticks(range(num_tokens))
            axes[layer, head].set_yticks(range(num_tokens))
            axes[layer, head].set_xticklabels(tokens, rotation=45)
            axes[layer, head].set_yticklabels(tokens)
            axes[layer, head].set_title(f'Layer {layer+1}, Head {head+1}', fontweight='bold')
            
            if head == 0:
                axes[layer, head].set_ylabel('Query Tokens')
            if layer == num_layers - 1:
                axes[layer, head].set_xlabel('Key Tokens')
    
    plt.tight_layout()
    plt.show()
    
    # Show prediction
    pred_class = torch.argmax(prediction, dim=-1).item()
    confidence = prediction[0, pred_class].item()
    
    print(f"üìù Text: {text}")
    print(f"üéØ Prediction: {'Positive' if pred_class == 1 else 'Negative'} (confidence: {confidence:.3f})")
    print(f"üîç Attention patterns show what the model focuses on")

# Test attention visualization
test_examples = [
    "I love this amazing movie!",
    "This film is terrible and boring.",
    "Great acting but poor story."
]

for text in test_examples:
    print(f"\n{'='*60}")
    visualize_attention(model, tokenizer, text)

print(f"\nüß† Attention Analysis:")
print(f"   ‚Ä¢ Different heads learn different patterns")
print(f"   ‚Ä¢ Some heads focus on specific sentiment words")
print(f"   ‚Ä¢ Later layers show more refined attention patterns")
print(f"   ‚Ä¢ Attention helps model understand context")

---

# Part 6: Critical Analysis & Applications (10 minutes)

**Goal:** Reflect on transformer capabilities, limitations, and real-world impact

## 6.1 Performance Comparison

Let's compare our transformer with simpler approaches:

In [None]:
# Compare with a simple baseline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Prepare data for sklearn
X_train = train_texts
y_train = train_labels
X_val = val_texts
y_val = val_labels

# TF-IDF + Logistic Regression baseline
vectorizer = TfidfVectorizer(max_features=500, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

# Train logistic regression
lr_model = LogisticRegression()
lr_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_val_tfidf)
lr_accuracy = accuracy_score(y_val, y_pred_lr)

# Get transformer accuracy on validation set
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        
        logits, _ = model(input_ids)
        _, predicted = torch.max(logits.data, 1)
        
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

transformer_accuracy = 100. * correct / total

# Comparison
print(f"üìä Model Comparison:")
print(f"   TF-IDF + Logistic Regression: {lr_accuracy*100:.2f}%")
print(f"   Mini Transformer: {transformer_accuracy:.2f}%")

improvement = transformer_accuracy - (lr_accuracy * 100)
if improvement > 0:
    print(f"   üéØ Transformer improvement: +{improvement:.2f}%")
else:
    print(f"   üìù Note: Simple baseline is competitive for this small dataset")

# Complexity comparison
transformer_params = sum(p.numel() for p in model.parameters())
lr_params = X_train_tfidf.shape[1] + 1  # Features + bias

print(f"\n‚öñÔ∏è Complexity Comparison:")
print(f"   Transformer parameters: {transformer_params:,}")
print(f"   Logistic Regression parameters: {lr_params:,}")
print(f"   Parameter ratio: {transformer_params / lr_params:.1f}x more")

print(f"\nüí° Key Insights:")
print(f"   ‚Ä¢ Transformers excel with larger datasets")
print(f"   ‚Ä¢ Simple models can be competitive on small/simple tasks")
print(f"   ‚Ä¢ Transformers capture complex patterns and context")
print(f"   ‚Ä¢ Trade-off between complexity and performance")

## 6.2 Computational Complexity Analysis

Understanding the computational costs of transformers:

In [None]:
# Analyze computational complexity
def analyze_complexity(d_model, sequence_length, num_heads, d_ff):
    """Analyze transformer computational complexity"""
    
    d_k = d_model // num_heads
    
    # Attention complexity: O(L^2 * d_model) where L = sequence length
    attention_ops = sequence_length ** 2 * d_model
    
    # Feed-forward complexity: O(L * d_model * d_ff)
    feedforward_ops = sequence_length * d_model * d_ff
    
    total_ops = attention_ops + feedforward_ops
    
    return {
        'attention': attention_ops,
        'feedforward': feedforward_ops,
        'total': total_ops
    }

# Analyze different sequence lengths
sequence_lengths = [32, 128, 512, 1024, 2048]
d_model = 512
num_heads = 8
d_ff = 2048

results = []
for seq_len in sequence_lengths:
    complexity = analyze_complexity(d_model, seq_len, num_heads, d_ff)
    results.append(complexity)

# Plot complexity scaling
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Linear vs quadratic scaling
attention_ops = [r['attention'] for r in results]
feedforward_ops = [r['feedforward'] for r in results]

ax1.loglog(sequence_lengths, attention_ops, 'r-o', label='Attention (Quadratic)', linewidth=2)
ax1.loglog(sequence_lengths, feedforward_ops, 'b-o', label='Feed-Forward (Linear)', linewidth=2)
ax1.set_xlabel('Sequence Length')
ax1.set_ylabel('Operations (log scale)')
ax1.set_title('Computational Complexity Scaling', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Memory usage (attention matrices)
memory_usage = [seq_len ** 2 * num_heads for seq_len in sequence_lengths]
ax2.semilogy(sequence_lengths, memory_usage, 'g-o', linewidth=2)
ax2.set_xlabel('Sequence Length')
ax2.set_ylabel('Attention Matrix Size (log scale)')
ax2.set_title('Memory Usage for Attention', fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"üìä Complexity Analysis:")
print(f"   Attention: O(L¬≤¬∑d) - quadratic in sequence length")
print(f"   Feed-forward: O(L¬∑d¬∑d_ff) - linear in sequence length")
print(f"   Memory: O(L¬≤¬∑h) - quadratic in sequence length for attention matrices")

print(f"\nüéØ Scaling Implications:")
for i, seq_len in enumerate(sequence_lengths[:3]):
    ops = results[i]['total']
    print(f"   Length {seq_len:4d}: {ops/1e6:.1f}M operations")

print(f"\n‚ö†Ô∏è Challenges:")
print(f"   ‚Ä¢ Long sequences are computationally expensive")
print(f"   ‚Ä¢ Memory usage grows quadratically")
print(f"   ‚Ä¢ Modern models use techniques like sparse attention")
print(f"   ‚Ä¢ Trade-offs between sequence length and model size")

## 6.3 Real-World Applications and Ethics

Let's reflect on the broader implications of transformer technology:

In [None]:
# Display timeline of transformer evolution
transformer_timeline = {
    2017: "Transformer (Attention Is All You Need)",
    2018: "BERT (Bidirectional Encoder Representations)",
    2019: "GPT-2 (Language model with 1.5B parameters)",
    2020: "GPT-3 (175B parameters, few-shot learning)",
    2021: "T5, Switch Transformer, Codex",
    2022: "ChatGPT, GPT-3.5, PaLM (540B parameters)",
    2023: "GPT-4, Claude, LLaMA, Bard",
    2024: "Multimodal models, reasoning improvements"
}

fig, ax = plt.subplots(figsize=(14, 8))

years = list(transformer_timeline.keys())
models = list(transformer_timeline.values())

# Create timeline
y_pos = range(len(years))
colors = plt.cm.viridis(np.linspace(0, 1, len(years)))

bars = ax.barh(y_pos, [1]*len(years), color=colors, alpha=0.7)
ax.set_yticks(y_pos)
ax.set_yticklabels([f"{year}: {model}" for year, model in transformer_timeline.items()])
ax.set_xlabel('Timeline')
ax.set_title('Evolution of Transformer Models', fontsize=16, fontweight='bold')
ax.set_xlim(0, 1)

# Remove x-axis ticks since we just want the timeline
ax.set_xticks([])

plt.tight_layout()
plt.show()

print("üöÄ Transformer Revolution:")
print("   ‚Ä¢ From research paper to powering ChatGPT in just 6 years")
print("   ‚Ä¢ Exponential growth in model size and capabilities")
print("   ‚Ä¢ Foundation for the current AI revolution")

# Applications analysis
applications = {
    "Natural Language Processing": [
        "Machine translation (Google Translate)",
        "Text summarization and generation",
        "Question answering systems",
        "Sentiment analysis and classification"
    ],
    "Code Generation": [
        "GitHub Copilot (code completion)", 
        "Automated bug fixing",
        "Code explanation and documentation",
        "Programming language translation"
    ],
    "Multimodal AI": [
        "Image captioning and description",
        "Visual question answering",
        "Text-to-image generation (DALL-E)",
        "Video understanding and generation"
    ],
    "Scientific Applications": [
        "Protein folding prediction (AlphaFold)",
        "Drug discovery and molecular design", 
        "Scientific literature analysis",
        "Mathematical theorem proving"
    ]
}

print(f"\nüåç Real-World Applications:")
for category, apps in applications.items():
    print(f"\n{category}:")
    for app in apps:
        print(f"   ‚Ä¢ {app}")

## 6.4 Critical Reflection Questions

**TODO: Answer these questions based on your experience with transformers:**

### Question 1: Technical Understanding

**How does the attention mechanism allow transformers to capture long-range dependencies better than RNNs?**

[TODO: Explain how attention allows direct connections between distant words, while RNNs must pass information through many time steps. Discuss the vanishing gradient problem in RNNs and how attention circumvents it.]

### Question 2: Architecture Design

**Why do transformers use multi-head attention instead of single-head attention?**

[TODO: Discuss how multiple heads can focus on different types of relationships (syntactic, semantic, positional), increased model capacity, parallel processing benefits, and specialization of different heads.]

### Question 3: Scaling and Efficiency

**What are the main computational bottlenecks in transformer models, and how might they be addressed?**

[TODO: Identify quadratic attention complexity, memory requirements, discuss solutions like sparse attention, linear attention, efficient architectures, and hardware optimizations.]

### Question 4: Comparison Analysis

**Compare transformers with CNNs and RNNs. When would you choose each architecture?**

**Transformers:**
[TODO: List advantages like parallelization, long-range dependencies, attention interpretability, and disadvantages like computational cost, quadratic scaling]

**CNNs:**
[TODO: Discuss advantages for spatial data, efficiency, translation invariance, and when to use for computer vision tasks]

**RNNs:**
[TODO: Discuss advantages for sequential data with limited memory, online learning, but disadvantages with long sequences]

### Question 5: Ethical Considerations

**What are the main ethical concerns with large transformer-based language models like GPT-4?**

**Bias and Fairness:**
[TODO: Discuss how training data bias affects model outputs, representation issues, and fairness across different groups]

**Misinformation:**
[TODO: Address concerns about generating false information, deepfakes, and the challenge of distinguishing AI-generated content]

**Economic Impact:**
[TODO: Discuss job displacement, automation of knowledge work, and economic inequality]

**Privacy and Security:**
[TODO: Address data privacy concerns, potential for misuse, and security vulnerabilities]

### Question 6: Future Directions

**What future developments in transformer technology are you most excited about? Most concerned about?**

**Exciting Developments:**
[TODO: Discuss potential positive applications, scientific breakthroughs, educational tools, accessibility improvements]

**Concerning Developments:**
[TODO: Address potential negative uses, societal impacts, need for regulation, ensuring beneficial development]

### Question 7: Implementation Insights

**What was the most challenging aspect of implementing the transformer architecture? What surprised you?**

[TODO: Reflect on the implementation experience - perhaps the complexity of attention matrices, the importance of residual connections, how the pieces fit together, or the gap between theory and practice]

## Summary and Reflection

### What You've Accomplished

Congratulations! In this assignment, you have:

**Built a transformer from scratch** with all core components including attention, embeddings, and feed-forward networks  
**Implemented the attention mechanism** that revolutionized natural language processing  
**Trained a working model** on sentiment analysis and achieved meaningful results  
**Analyzed attention patterns** to understand what the model learned  
**Explored computational complexity** and scaling challenges  
**Reflected critically** on the broader implications of transformer technology  

### Key Takeaways

**TODO: Write 4-5 key insights from this assignment:**

1. [TODO: Your first key takeaway about the attention mechanism and its revolutionary impact]
2. [TODO: Your second key takeaway about the architecture design and why each component matters]
3. [TODO: Your third key takeaway about computational complexity and scaling challenges]
4. [TODO: Your fourth key takeaway about real-world applications and their impact]
5. [TODO: Your fifth key takeaway about ethical considerations and responsible AI development]

### Evolution from Previous Assignments

**TODO: Compare transformers to previous architectures you've studied:**

**From Linear Models (CA1) to Transformers:**
[TODO: Discuss the journey from simple linear relationships to complex attention mechanisms]

**From Neural Networks (CA3) to Transformers:**
[TODO: Compare basic MLPs to sophisticated transformer blocks]

**From CNNs (CA4) to Transformers:**
[TODO: Contrast spatial processing in CNNs with sequential attention in transformers]

### Looking Forward

**TODO: What aspects of transformers or NLP would you like to explore further?**

[TODO: Mention interests in large language models, multimodal transformers, specific applications, or research directions]

### Final Reflection

**TODO: Write a comprehensive reflection (200-300 words) on your experience with transformers:**

[TODO: Your final reflection here - discuss the elegance and power of the attention mechanism, how it connects to real-world AI systems, what surprised you about the implementation, thoughts on the rapid progress in AI, and considerations for the responsible development and deployment of transformer-based systems]

---

**Assignment Complete!**

Make sure to:
1. Complete all TODO sections
2. Test your implementations thoroughly
3. Answer all reflection questions thoughtfully
4. Save your notebook and export as PDF
5. Submit both .ipynb and .pdf files
6. Include your name and student ID at the top

You've just implemented the architecture that powers ChatGPT, BERT, and the current AI revolution. Well done!