In [None]:
#@title üéß Download Narration Audio & Play Introduction
import os as _os
if not _os.path.exists("/content/narration"):
    !pip install -q gdown
    import gdown
    gdown.download(id="1RJjttCvltRK-j5XaI_Tp752cibGKRYMf", output="/content/narration.zip", quiet=False)
    !unzip -q /content/narration.zip -d /content/narration
    !rm /content/narration.zip
    print(f"Loaded {len(_os.listdir('/content/narration'))} narration segments")
else:
    print("Narration audio already loaded.")

from IPython.display import Audio, display
display(Audio("/content/narration/02_00_intro.mp3"))


In [None]:
# üîß Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print(f"\nüì¶ Python {sys.version.split()[0]}")
print(f"üî• PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"üé≤ Random seed set to {SEED}")

%matplotlib inline

# üöÄ Self-Attention & The Transformer Encoder from First Principles

*Part 2 of the Vizuara series on Understanding BERT from Scratch*
*Estimated time: 60 minutes*

In [None]:
#@title üéß Listen: Ai Assistant
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_02_ai_assistant.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


# ü§ñ AI Teaching Assistant

Need help with this notebook? Open the **AI Teaching Assistant** ‚Äî it has already read this entire notebook and can help with concepts, code, and exercises.

**[üëâ Open AI Teaching Assistant](https://pods.vizuara.ai/courses/understanding-bert-from-scratch/practice/2/assistant)**

*Tip: Open it in a separate tab and work through this notebook side-by-side.*


In [None]:
#@title üéß Listen: Why It Matters
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_03_why_it_matters.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 1. Why Does This Matter?

The Transformer encoder is the engine that powers BERT, GPT, and virtually every modern language model. At its heart is a mechanism called **self-attention** ‚Äî a way for every word to look at every other word and decide which ones matter most.

In this notebook, we will build the entire Transformer encoder **from scratch** ‚Äî no HuggingFace, no pre-built modules. By the end, you will have:

1. A working **scaled dot-product attention** mechanism
2. A **multi-head attention** module
3. A complete **Transformer encoder block** with residual connections
4. Beautiful **attention heatmaps** showing which words attend to which

In [None]:
# üîß Setup ‚Äî run this cell first
!pip install -q torch matplotlib numpy

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import math

%matplotlib inline

torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
#@title üéß Listen: Intuition
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_05_intuition.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 2. Building Intuition

Imagine you are reading the sentence: *"The delivery arrived late, but **it** was in perfect condition."*

When you read "it," your brain instantly jumps back to "delivery" ‚Äî that is what "it" refers to. Your brain does not give equal weight to every word; it **selectively attends** to the relevant ones.

Self-attention does exactly this ‚Äî for every word, it computes an attention score with every other word, then uses those scores to create a weighted combination.

Think of it like a library:
- The **Query** is your search question: "What am I looking for?"
- The **Key** is the label on each book: "What information does this book contain?"
- The **Value** is the actual content of the book: "Here is the information."

You match your Query against all Keys, and the best-matching Keys point you to the most relevant Values.

### ü§î Think About This
Why do we need THREE separate matrices (Q, K, V) instead of just using the embeddings directly? Think about what would happen if Q = K = V = the embedding itself.

In [None]:
#@title üéß Listen: Mathematics Attention
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_06_mathematics_attention.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
#@title üéß Listen: Mathematics Multihead
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_07_mathematics_multihead.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 3. The Mathematics

### Scaled Dot-Product Attention

Given queries $Q$, keys $K$, and values $V$, the attention output is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Let us break this down computationally:
- $QK^T$ computes the dot product between every query and every key. This gives us a matrix of "relevance scores" ‚Äî how much each word should attend to each other word.
- Dividing by $\sqrt{d_k}$ prevents the dot products from becoming too large (which would make softmax saturate and produce very peaked distributions).
- Softmax normalizes each row so the attention weights sum to 1.
- Multiplying by $V$ computes a weighted combination of the value vectors.

### Multi-Head Attention

Instead of computing attention once, we compute it $h$ times in parallel with different learned projections:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

Computationally: each head learns to look for a different type of relationship. One head might learn syntax (subject-verb), another might learn coreference (what does "it" refer to?), and another might learn semantic similarity.

## 4. Let's Build It ‚Äî Component by Component

### 4.1 Scaled Dot-Product Attention

In [None]:
#@title üéß Code Walkthrough: Sdpa Implementation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_08_sdpa_implementation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Query tensor (batch, num_heads, seq_len, d_k)
        K: Key tensor (batch, num_heads, seq_len, d_k)
        V: Value tensor (batch, num_heads, seq_len, d_v)
        mask: Optional mask tensor

    Returns:
        output: Weighted values (batch, num_heads, seq_len, d_v)
        attention_weights: Softmax attention weights (batch, num_heads, seq_len, seq_len)
    """
    d_k = Q.size(-1)

    # Step 1: Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch, heads, seq, seq)

    # Step 2: Scale by sqrt(d_k)
    scores = scores / math.sqrt(d_k)

    # Step 3: Apply mask (if provided)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Step 4: Apply softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)

    # Step 5: Multiply by values
    output = torch.matmul(attention_weights, V)  # (batch, heads, seq, d_v)

    return output, attention_weights

Let us verify this with the numerical example from the article.

In [None]:
#@title üéß Code Walkthrough: Sdpa Verification
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_09_sdpa_verification.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Numerical example from the article
Q = torch.tensor([[[1.0, 0.0],
                    [0.0, 1.0],
                    [1.0, 1.0]]]).unsqueeze(1)  # (1, 1, 3, 2)

K = torch.tensor([[[1.0, 1.0],
                    [0.0, 1.0],
                    [1.0, 0.0]]]).unsqueeze(1)

V = torch.tensor([[[1.0, 2.0],
                    [3.0, 4.0],
                    [5.0, 6.0]]]).unsqueeze(1)

output, weights = scaled_dot_product_attention(Q, K, V)

print("Attention weights (row = query word, col = key word):")
print(weights[0, 0].detach().numpy().round(2))
print(f"\nOutput for first word: {output[0, 0, 0].detach().numpy().round(2)}")
print(f"Expected (from article): ~[3.0, 4.0]")

In [None]:
#@title üéß What to Look For: Sdpa Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_10_sdpa_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# üìä Visualize the attention weights
plt.figure(figsize=(6, 5))
plt.imshow(weights[0, 0].detach().numpy(), cmap='Blues', vmin=0, vmax=1)
plt.colorbar(label='Attention Weight')
plt.xticks([0, 1, 2], ['Word 1', 'Word 2', 'Word 3'])
plt.yticks([0, 1, 2], ['Word 1', 'Word 2', 'Word 3'])
plt.xlabel("Key (attending TO)")
plt.ylabel("Query (attending FROM)")
plt.title("Attention Weights Matrix")

# Add text annotations
for i in range(3):
    for j in range(3):
        val = weights[0, 0, i, j].item()
        plt.text(j, i, f"{val:.2f}", ha='center', va='center',
                 color='white' if val > 0.5 else 'black', fontsize=12)

plt.tight_layout()
plt.show()

### 4.2 Multi-Head Attention

In [None]:
#@title üéß Code Walkthrough: Mha Implementation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_11_mha_implementation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention mechanism.

    Instead of one attention function, we run h parallel attention "heads",
    each with its own learned Q, K, V projections.
    """
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # Dimension per head

        # Learned projection matrices
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)  # Output projection

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape

        # Project to Q, K, V
        Q = self.W_Q(x)  # (batch, seq, d_model)
        K = self.W_K(x)
        V = self.W_V(x)

        # Reshape to (batch, num_heads, seq_len, d_k)
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # Apply scaled dot-product attention
        attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads: (batch, seq_len, d_model)
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, seq_len, self.d_model
        )

        # Final linear projection
        output = self.W_O(attn_output)

        return output, attn_weights

In [None]:
#@title üéß Code Walkthrough: Mha Verification
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_12_mha_verification.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Test multi-head attention
d_model = 64
num_heads = 4
mha = MultiHeadAttention(d_model, num_heads)

# Random input: batch=1, seq_len=5, d_model=64
x = torch.randn(1, 5, d_model)
output, attn_weights = mha(x)

print(f"Input shape:           {x.shape}")
print(f"Output shape:          {output.shape}")
print(f"Attention weights:     {attn_weights.shape}")
print(f"  (batch, heads, seq, seq)")

In [None]:
#@title üéß What to Look For: Mha Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_13_mha_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# üìä Visualize attention patterns across all heads
fig, axes = plt.subplots(1, num_heads, figsize=(16, 4))
words = ['Word‚ÇÅ', 'Word‚ÇÇ', 'Word‚ÇÉ', 'Word‚ÇÑ', 'Word‚ÇÖ']

for head_idx in range(num_heads):
    ax = axes[head_idx]
    weights_np = attn_weights[0, head_idx].detach().numpy()
    im = ax.imshow(weights_np, cmap='Blues', vmin=0, vmax=weights_np.max())
    ax.set_title(f"Head {head_idx + 1}", fontsize=12, fontweight='bold')
    ax.set_xticks(range(5))
    ax.set_xticklabels(words, rotation=45, fontsize=8)
    ax.set_yticks(range(5))
    ax.set_yticklabels(words, fontsize=8)

plt.suptitle("Multi-Head Attention ‚Äî Each Head Learns Different Patterns", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("üí° Notice how each head has a DIFFERENT attention pattern!")

### 4.3 Feed-Forward Network

The Transformer encoder block also contains a position-wise feed-forward network: two linear layers with a ReLU activation in between.

$$\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$

Computationally: this acts as a "thinking" step. After attention gathers relevant information from other words, the FFN processes that information independently at each position.

In [None]:
#@title üéß Code Walkthrough: Ffn Implementation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_14_ffn_implementation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
class FeedForward(nn.Module):
    """
    Position-wise Feed-Forward Network.

    Two linear transformations with ReLU in between.
    The inner dimension is typically 4x the model dimension.
    """
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

### 4.4 Layer Normalization

In [None]:
#@title üéß Code Walkthrough: Layernorm Implementation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_15_layernorm_implementation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
class LayerNorm(nn.Module):
    """
    Layer Normalization (Ba et al., 2016).

    Normalizes the last dimension of the input to have
    zero mean and unit variance, then applies learned
    scale (gamma) and shift (beta).
    """
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

### 4.5 The Complete Transformer Encoder Block

In [None]:
#@title üéß Code Walkthrough: Encoder Block Implementation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_16_encoder_block_implementation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
class TransformerEncoderBlock(nn.Module):
    """
    One Transformer encoder block:
      1. Multi-Head Self-Attention + Residual + LayerNorm
      2. Feed-Forward Network + Residual + LayerNorm
    """
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff)
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual connection and layer norm
        attn_output, attn_weights = self.attention(x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual connection and layer norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x, attn_weights

In [None]:
#@title üéß Code Walkthrough: Encoder Block Verification
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_17_encoder_block_verification.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Test the full encoder block
encoder_block = TransformerEncoderBlock(d_model=64, num_heads=4, d_ff=256)

x = torch.randn(1, 5, 64)
output, weights = encoder_block(x)

print(f"Input shape:  {x.shape}")
print(f"Output shape: {output.shape}")
print("‚úÖ Same shape in and out ‚Äî this is key for stacking blocks!")

## 5. üîß Your Turn

### TODO: Stack Multiple Encoder Blocks

BERT-Base uses 12 encoder blocks stacked on top of each other. Implement the `TransformerEncoder` class that stacks N blocks.

In [None]:
class TransformerEncoder(nn.Module):
    """
    Stack of N Transformer encoder blocks.

    Args:
        num_layers: Number of encoder blocks to stack
        d_model: Model dimension
        num_heads: Number of attention heads
        d_ff: Feed-forward inner dimension
        dropout: Dropout rate
    """
    def __init__(self, num_layers, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()

        # ============ TODO ============
        # Create a nn.ModuleList containing num_layers
        # TransformerEncoderBlock instances
        # ==============================
        self.layers = ???  # YOUR CODE HERE

    def forward(self, x, mask=None):
        """
        Pass input through all encoder blocks sequentially.
        Return the final output and attention weights from the LAST layer.
        """
        attn_weights = None

        # ============ TODO ============
        # Loop through self.layers and pass x through each
        # ==============================
        for layer in ???:  # YOUR CODE HERE
            x, attn_weights = ???  # YOUR CODE HERE

        return x, attn_weights

In [None]:
#@title üéß Before You Start: Todo Stack Verification
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_19_todo_stack_verification.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
#@title üéß Before You Start: Todo Pe Verification
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_21_todo_pe_verification.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# ‚úÖ Verification
encoder = TransformerEncoder(num_layers=6, d_model=64, num_heads=4, d_ff=256)
test_input = torch.randn(2, 10, 64)  # batch=2, seq_len=10
test_output, test_weights = encoder(test_input)

assert test_output.shape == (2, 10, 64), f"‚ùå Expected shape (2, 10, 64), got {test_output.shape}"
assert test_weights.shape == (2, 4, 10, 10), f"‚ùå Expected attention shape (2, 4, 10, 10), got {test_weights.shape}"
print(f"‚úÖ TransformerEncoder works! Output shape: {test_output.shape}")
print(f"   6 layers stacked, each with 4 attention heads")
print(f"   Total parameters: {sum(p.numel() for p in encoder.parameters()):,}")

In [None]:
#@title üéß Before You Start: Todo Pe Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_20_todo_pe_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### TODO: Implement Positional Encoding

The Transformer has no built-in notion of word order. We need to inject positional information. Implement sinusoidal positional encoding:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

Computationally: even dimensions use sine, odd dimensions use cosine, with frequencies that decrease as the dimension increases. This creates a unique "fingerprint" for each position.

In [None]:
class PositionalEncoding(nn.Module):
    """
    Sinusoidal positional encoding.

    Adds a fixed positional signal to the input embeddings
    so the model knows the order of words.
    """
    def __init__(self, d_model, max_len=512):
        super().__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)

        # ============ TODO ============
        # Step 1: Compute the division term: 10000^(2i/d_model)
        #         Hint: use torch.exp and torch.arange
        # Step 2: Apply sin to even indices (0, 2, 4, ...)
        # Step 3: Apply cos to odd indices (1, 3, 5, ...)
        # ==============================

        div_term = ???  # YOUR CODE HERE
        pe[:, 0::2] = ???  # YOUR CODE HERE (even indices)
        pe[:, 1::2] = ???  # YOUR CODE HERE (odd indices)

        # Register as buffer (not a parameter ‚Äî it's fixed)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        """Add positional encoding to input embeddings."""
        return x + self.pe[:, :x.size(1)]

In [None]:
# ‚úÖ Verification
pos_enc = PositionalEncoding(d_model=64, max_len=100)
test_x = torch.zeros(1, 50, 64)
encoded = pos_enc(test_x)

assert encoded.shape == (1, 50, 64), f"‚ùå Expected shape (1, 50, 64), got {encoded.shape}"
assert not torch.allclose(encoded[0, 0], encoded[0, 1]), "‚ùå Different positions should have different encodings"
print("‚úÖ Positional encoding works!")

In [None]:
#@title üéß What to Look For: Pe Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_22_pe_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# üìä Visualize the positional encoding
pe = pos_enc.pe[0, :50, :64].numpy()

plt.figure(figsize=(12, 5))
plt.imshow(pe.T, cmap='RdBu', aspect='auto', interpolation='nearest')
plt.colorbar(label='Encoding Value')
plt.xlabel("Position in Sequence")
plt.ylabel("Embedding Dimension")
plt.title("Sinusoidal Positional Encoding\n(each position has a unique pattern)")
plt.tight_layout()
plt.show()

## 6. Putting It All Together: Attention on Real Text

Let us run our Transformer encoder on a real sentence and visualize what the attention heads learn.

In [None]:
#@title üéß Code Walkthrough: Real Text Demo
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_23_real_text_demo.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Simple tokenizer for our demo
sentence = "the cat sat on it"
words = sentence.split()
word_to_idx = {w: i for i, w in enumerate(set(words))}
idx_to_word = {i: w for w, i in word_to_idx.items()}

# Build a simple model
class SimpleTransformerDemo(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, d_ff, num_layers):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        self.encoder = TransformerEncoder(num_layers, d_model, num_heads, d_ff)

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        x = self.pos_encoding(x)
        output, attn_weights = self.encoder(x)
        return output, attn_weights

vocab_size = len(word_to_idx)
demo_model = SimpleTransformerDemo(
    vocab_size=vocab_size, d_model=32, num_heads=4, d_ff=128, num_layers=2
)

# Encode the sentence
input_ids = torch.tensor([[word_to_idx[w] for w in words]])

with torch.no_grad():
    output, attn_weights = demo_model(input_ids)

print(f"Sentence: '{sentence}'")
print(f"Input shape:            {input_ids.shape}")
print(f"Contextual output:      {output.shape}")
print(f"Attention weights:      {attn_weights.shape}")

In [None]:
#@title üéß What to Look For: Real Text Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_24_real_text_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# üìä Attention heatmap ‚Äî which words attend to which?
fig, axes = plt.subplots(1, 4, figsize=(18, 4))

for head_idx in range(4):
    ax = axes[head_idx]
    weights_np = attn_weights[0, head_idx].detach().numpy()
    im = ax.imshow(weights_np, cmap='Purples', vmin=0)
    ax.set_title(f"Head {head_idx + 1}", fontsize=13, fontweight='bold')
    ax.set_xticks(range(len(words)))
    ax.set_xticklabels(words, rotation=45, fontsize=11)
    ax.set_yticks(range(len(words)))
    ax.set_yticklabels(words, fontsize=11)

    # Annotate values
    for i in range(len(words)):
        for j in range(len(words)):
            val = weights_np[i, j]
            ax.text(j, i, f"{val:.2f}", ha='center', va='center',
                    fontsize=8, color='white' if val > 0.4 else 'black')

plt.suptitle(f'Self-Attention Heatmaps: "{sentence}"', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("üí° Each head develops a DIFFERENT attention pattern.")
print("   In a trained model, 'it' would strongly attend to 'cat' (coreference).")

## 7. üéØ Final Output: Interactive Attention Visualization

In [None]:
#@title üéß What to Look For: Final Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_25_final_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Build a slightly larger demo to show richer patterns
sentences = [
    "the cat sat on the mat and purred",
    "the dog chased the cat around the yard",
    "she went to the bank to deposit money",
]

# Build a combined vocabulary
all_words_list = list(set(w for s in sentences for w in s.split()))
demo_vocab = {w: i for i, w in enumerate(all_words_list)}
demo_idx_to_word = {i: w for w, i in demo_vocab.items()}

demo_model_2 = SimpleTransformerDemo(
    vocab_size=len(demo_vocab), d_model=32, num_heads=4, d_ff=128, num_layers=3
)

fig, axes = plt.subplots(len(sentences), 1, figsize=(14, 4 * len(sentences)))

for sent_idx, sentence in enumerate(sentences):
    words = sentence.split()
    ids = torch.tensor([[demo_vocab[w] for w in words]])

    with torch.no_grad():
        _, attn = demo_model_2(ids)

    # Average attention across all heads
    avg_attn = attn[0].mean(dim=0).numpy()

    ax = axes[sent_idx]
    im = ax.imshow(avg_attn, cmap='viridis', vmin=0)
    ax.set_title(f'"{sentence}"', fontsize=12, fontweight='bold')
    ax.set_xticks(range(len(words)))
    ax.set_xticklabels(words, rotation=45, fontsize=10)
    ax.set_yticks(range(len(words)))
    ax.set_yticklabels(words, fontsize=10)
    plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)

plt.suptitle("üéØ Self-Attention Patterns (Averaged Across Heads)", fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("üéâ Congratulations! You've built a complete Transformer encoder from scratch!")
print("   Next up: BERT's architecture ‚Äî input representation and pre-training objectives.")

In [None]:
#@title üéß Wrap-Up: Closing
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_26_closing.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 8. Reflection and Next Steps

### ü§î Reflection Questions
1. Why do we divide by $\sqrt{d_k}$ in the attention formula? What would happen without it? (Hint: think about the variance of dot products as dimension grows.)
2. If we have 12 attention heads with $d_k = 64$ each, what is the total model dimension $d_{\text{model}}$? Why is this more efficient than having one head with $d_k = 768$?
3. The residual connections in the encoder block add the input directly to the output. Why is this important for training deep networks?

### üèÜ Optional Challenges
1. **Masked Self-Attention**: Modify the attention function to apply a causal mask that prevents words from attending to future words. This is how GPT works (decoder-style).
2. **Relative Position Encoding**: Instead of fixed sinusoidal encodings, implement relative position encodings where attention scores are modified based on the distance between tokens.
3. **Attention Dropout**: Add dropout to the attention weights (after softmax, before multiplying by V). How does this affect the model?