# 🎓 Week 13 - Day 2: Transformer Architecture

## Today's Goals:
✅ Understand self-attention mechanism
✅ Build Transformer components in PyTorch
✅ Implement positional encoding
✅ Train a simple Transformer model
✅ Visualize attention patterns


## 🔧 Part 1: Setup - Install & Import All Libraries

**IMPORTANT:** Run ALL cells in this part sequentially!


In [None]:
# STEP 1: Import core libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

print("✅ Core libraries imported!")


In [None]:
# STEP 2: Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"✅ Using device: {device}")


In [None]:
# STEP 3: Configure matplotlib
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

print("✅ Visualization configured!")
print("\n🚀 Ready to build Transformers!")


## 🎯 Part 2: Self-Attention Mechanism

The core innovation that makes Transformers work!

**The Problem with RNNs:**
- Process words one-by-one (slow)
- Forget long-range dependencies
- Can't parallelize

**The Transformer Solution:**
- Process ALL words simultaneously
- Every word attends to every other word
- Fully parallelizable!


In [None]:
# Self-Attention Implementation
class SelfAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim
        
        # Create Q, K, V projection layers
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, x):
        # x shape: (batch, seq_len, embed_dim)
        
        # Project to Q, K, V
        Q = self.query(x)  # "What am I looking for?"
        K = self.key(x)    # "What do I contain?"
        V = self.value(x)  # "What information do I have?"
        
        # Calculate attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))
        scores = scores / math.sqrt(self.embed_dim)
        
        # Apply softmax
        attention_weights = F.softmax(scores, dim=-1)
        
        # Apply attention to values
        output = torch.matmul(attention_weights, V)
        
        return output, attention_weights

print("✅ Self-Attention class created!")


In [None]:
# Test Self-Attention
embed_dim = 64
seq_len = 5  # Representing: "The cat sat on mat"
batch_size = 1

# Create random word embeddings
x = torch.randn(batch_size, seq_len, embed_dim)

# Apply attention
attention = SelfAttention(embed_dim)
output, weights = attention(x)

print(f"✅ Input shape: {x.shape}")
print(f"✅ Output shape: {output.shape}")
print(f"✅ Attention weights shape: {weights.shape}")
print(f"\n🎯 Attention Matrix:")
print(weights[0].detach().numpy().round(2))


In [None]:
# Visualize attention patterns
words = ['The', 'cat', 'sat', 'on', 'mat']
attn_matrix = weights[0].detach().numpy()

plt.figure(figsize=(8, 6))
sns.heatmap(attn_matrix, 
            xticklabels=words, 
            yticklabels=words,
            annot=True, 
            fmt='.2f', 
            cmap='YlOrRd',
            cbar_kws={'label': 'Attention Weight'})
plt.title('Self-Attention Weights', fontsize=14, fontweight='bold')
plt.xlabel('Attending TO these words', fontweight='bold')
plt.ylabel('Words', fontweight='bold')
plt.tight_layout()
plt.show()

print("💡 Each row shows what a word attends to!")


### 💡 Key Insights:

✅ **Self-Attention** allows each word to look at all other words  
✅ **Query, Key, Value** matrices encode different aspects of meaning  
✅ **Attention weights** show relationships between words  
✅ **Parallel processing** - all words processed simultaneously!


## 📍 Part 3: Positional Encoding

**The Position Problem:**
- Self-attention has NO information about word order!
- "Dog bites man" vs "Man bites dog" - Same words, different meanings!

**The Solution:**
- Add positional information using sine/cosine functions
- Each position gets a unique "signature" 


In [None]:
# Positional Encoding Implementation
class PositionalEncoding(nn.Module):
    def __init__(self, embed_dim, max_len=5000):
        super().__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, embed_dim)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        
        # Calculate div_term
        div_term = torch.exp(torch.arange(0, embed_dim, 2).float() * 
                             (-math.log(10000.0) / embed_dim))
        
        # Apply sine to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        # Apply cosine to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        # Add positional encoding to embeddings
        return x + self.pe[:, :x.size(1)]

print("✅ Positional Encoding class created!")


In [None]:
# Test Positional Encoding
pos_encoder = PositionalEncoding(embed_dim=64)
x = torch.randn(1, 10, 64)
x_with_pos = pos_encoder(x)

print(f"✅ Original shape: {x.shape}")
print(f"✅ With positional encoding: {x_with_pos.shape}")
print("\n💡 Now model knows: position 0, 1, 2, ..., 9!")


In [None]:
# Visualize positional encoding patterns
pe_matrix = pos_encoder.pe[0, :50, :64].numpy()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap
sns.heatmap(pe_matrix.T, cmap='RdBu', center=0, ax=ax1)
ax1.set_title('Positional Encoding Matrix', fontsize=12, fontweight='bold')
ax1.set_xlabel('Position')
ax1.set_ylabel('Embedding Dimension')

# Line plot
for i in range(6):
    ax2.plot(pe_matrix[:, i], label=f'Dim {i}', linewidth=2)
ax2.set_title('Positional Patterns', fontsize=12, fontweight='bold')
ax2.set_xlabel('Position')
ax2.set_ylabel('Encoding Value')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("💡 Each position has a unique sine/cosine signature!")


### 💡 Key Insights:

✅ **Sine/Cosine waves** create unique position signatures  
✅ **Added to embeddings** - not replacing them!  
✅ **No learning required** - mathematical pattern works perfectly  
✅ **Enables long sequences** - generalizes beyond training length


## 🏗️ Part 4: Building the Complete Transformer Block

Now we combine everything into the famous Transformer architecture!

**Components:**
1. Multi-Head Self-Attention
2. Feed-Forward Network
3. Layer Normalization  
4. Residual Connections


In [None]:
# Complete Transformer Block
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        
        # Multi-Head Attention
        self.attention = nn.MultiheadAttention(
            embed_dim, num_heads, dropout=dropout, batch_first=True
        )
        
        # Feed-Forward Network
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(ff_dim, embed_dim)
        )
        
        # Layer Normalization
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        # Multi-Head Attention + Residual + Norm
        attn_out, _ = self.attention(x, x, x)
        x = self.norm1(x + self.dropout(attn_out))
        
        # Feed-Forward + Residual + Norm
        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))
        
        return x

print("✅ Transformer Block created!")


In [None]:
# Test Transformer Block
transformer_block = TransformerBlock(
    embed_dim=64,
    num_heads=4,
    ff_dim=256
)

x = torch.randn(1, 10, 64)
output = transformer_block(x)

print(f"✅ Input: {x.shape}")
print(f"✅ Output: {output.shape}")
print(f"\n📊 Parameters: {sum(p.numel() for p in transformer_block.parameters()):,}")
print("\n💡 This is ONE layer - GPT-3 has 96 of these!")


### 💡 Key Insights:

✅ **Multi-Head Attention** captures different types of relationships  
✅ **Residual Connections** help with gradient flow (like ResNet)  
✅ **Layer Normalization** stabilizes training  
✅ **Feed-Forward** processes each position independently


## 🎓 Part 5: Training a Complete Transformer Model

Let's put everything together and train on a simple task!

**Task:** Sequence Reversal  
**Input:** [1, 2, 3, 4, 5]  
**Output:** [5, 4, 3, 2, 1]


In [None]:
# Complete Transformer Model
class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, ff_dim, num_layers, max_len):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.pos_encoder = PositionalEncoding(embed_dim, max_len)
        
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, ff_dim)
            for _ in range(num_layers)
        ])
        
        self.fc_out = nn.Linear(embed_dim, vocab_size)
        
    def forward(self, x):
        # Embedding + Positional Encoding
        x = self.embedding(x)
        x = self.pos_encoder(x)
        
        # Apply Transformer blocks
        for block in self.transformer_blocks:
            x = block(x)
        
        # Output projection
        logits = self.fc_out(x)
        return logits

# Create model
model = SimpleTransformer(
    vocab_size=20,
    embed_dim=64,
    num_heads=4,
    ff_dim=128,
    num_layers=2,
    max_len=100
).to(device)

print(f"✅ Transformer Model created!")
print(f"📊 Parameters: {sum(p.numel() for p in model.parameters()):,}")


In [None]:
# Create dataset for sequence reversal
from torch.utils.data import Dataset, DataLoader

class ReverseDataset(Dataset):
    def __init__(self, num_samples=1000, seq_len=8, vocab_size=15):
        self.data = torch.randint(1, vocab_size, (num_samples, seq_len))
        self.targets = torch.flip(self.data, dims=[1])
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx], self.targets[idx]

# Create dataloaders
train_dataset = ReverseDataset(num_samples=1000, seq_len=8, vocab_size=15)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Show example
sample_in, sample_out = train_dataset[0]
print("📝 Example:")
print(f"   Input:  {sample_in.numpy()}")
print(f"   Target: {sample_out.numpy()}")
print(f"\n✅ Dataset: {len(train_dataset)} samples")


In [None]:
# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

num_epochs = 20
vocab_size = 20
losses = []

print("🚀 Training Transformer...\n")

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    
    for batch_in, batch_target in train_loader:
        batch_in = batch_in.to(device)
        batch_target = batch_target.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        logits = model(batch_in)
        loss = criterion(logits.reshape(-1, vocab_size), batch_target.reshape(-1))
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
    
    avg_loss = epoch_loss / len(train_loader)
    losses.append(avg_loss)
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:2d}/{num_epochs} - Loss: {avg_loss:.4f}")

print("\n✅ Training complete! 🎉")


In [None]:
# Visualize training progress
plt.figure(figsize=(10, 5))
plt.plot(losses, marker='o', linewidth=2, markersize=6)
plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Epoch', fontweight='bold')
plt.ylabel('Loss', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("💡 Decreasing loss = Model is learning!")


In [None]:
# Test the trained model
model.eval()

print("🧪 Testing Transformer:\n")

with torch.no_grad():
    for i in range(5):
        test_in, test_target = train_dataset[i]
        test_in_batch = test_in.unsqueeze(0).to(device)
        
        # Get prediction
        logits = model(test_in_batch)
        predictions = torch.argmax(logits, dim=-1)
        
        # Calculate accuracy
        correct = (predictions[0].cpu() == test_target).sum().item()
        accuracy = correct / len(test_target) * 100
        
        print(f"Test {i+1}:")
        print(f"   Input:      {test_in.numpy()}")
        print(f"   Target:     {test_target.numpy()}")
        print(f"   Prediction: {predictions[0].cpu().numpy()}")
        print(f"   ✅ Accuracy: {accuracy:.0f}%\n")


### 💡 Key Insights:

✅ **Complete architecture** - Embedding → Transformer → Output  
✅ **Fast training** - Parallel processing makes it efficient  
✅ **High accuracy** - Even on simple tasks, performance is excellent  
✅ **Scalable** - Same architecture powers GPT-4 with billions of parameters!


## 🎯 Challenge Time!

### 🏆 Challenge: Experiment with Hyperparameters

**Your Mission:** Modify the Transformer and observe the effects!

Try changing these values:
```python
embed_dim = 64      # Try: 32, 128, 256
num_heads = 4       # Try: 2, 8 (must divide embed_dim!)
num_layers = 2      # Try: 1, 3, 4
ff_dim = 128        # Try: 64, 256, 512
learning_rate = 0.001  # Try: 0.0001, 0.01
```

**Questions to explore:**
1. Does more heads improve performance?
2. What happens with more layers?
3. Can you train faster with different learning rates?
4. What's the smallest model that still works well?

**Bonus Challenge:**  
Try a harder task:
- Sort the sequence in ascending order
- Remove duplicate numbers
- Add 1 to each number

Good luck! 🚀


In [None]:
# Your experimentation code here!

# Example: Modified hyperparameters
# model_new = SimpleTransformer(
#     vocab_size=20,
#     embed_dim=128,  # Increased!
#     num_heads=8,     # More heads!
#     ff_dim=256,
#     num_layers=3,    # Deeper!
#     max_len=100
# ).to(device)

print("💡 Uncomment and modify the code above to experiment!")


---

## 📚 Summary - What We Learned Today

### 1. Self-Attention Mechanism 🎯
- **Query, Key, Value** matrices
- **Attention weights** show word relationships
- **Parallel processing** - all words at once

### 2. Positional Encoding 📍
- **Sine/Cosine patterns** give position information
- **Added to embeddings** not replacing them
- **No learning required** - mathematical solution

### 3. Transformer Architecture 🏗️
- **Multi-Head Attention** for different relationships
- **Feed-Forward Networks** for position-wise processing
- **Residual Connections** for gradient flow
- **Layer Normalization** for stability

### 4. Complete Model 🎓
- **Embedding → Transformer → Output** pipeline
- **Trained on sequence reversal** task
- **High accuracy** achieved quickly
- **Scalable** to billions of parameters

---

## 🚀 What's Next?

**Tomorrow (Day 3): Hugging Face**
- Use pre-trained Transformers (BERT, GPT-2)
- Fine-tune models for your tasks
- No training from scratch!
- Build real NLP applications in 3 lines of code

---

## 💡 Key Takeaways:

✅ **Transformers revolutionized AI** by using self-attention  
✅ **Parallel processing** makes them 1000x faster than RNNs  
✅ **Same architecture** powers GPT, BERT, Claude, ChatGPT  
✅ **You just built** the foundation of modern AI!

**Excellent work! 🎉**
