In [None]:
#@title üéß Download Narration Audio & Play Introduction
import os as _os
if not _os.path.exists("/content/narration"):
    !pip install -q gdown
    import gdown
    gdown.download(id="1-OE3rxpruDzmk8pg3auelrL_R46oxuz7", output="/content/narration.zip", quiet=False)
    !unzip -q /content/narration.zip -d /content/narration
    !rm /content/narration.zip
    print(f"Loaded {len(_os.listdir('/content/narration'))} narration segments")
else:
    print("Narration audio already loaded.")

from IPython.display import Audio, display
display(Audio("/content/narration/00_intro.mp3"))

In [None]:
#@title üéß Listen: Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/00_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

In [None]:
#@title üéß Listen: Setup
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_setup.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

In [None]:
# üîß Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print(f"\nüì¶ Python {sys.version.split()[0]}")
print(f"üî• PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"üé≤ Random seed set to {SEED}")

%matplotlib inline

# üöÄ Word Representations & The Need for BERT

*Part 1 of the Vizuara series on Understanding BERT from Scratch*
*Estimated time: 45 minutes*

# ü§ñ AI Teaching Assistant

Need help with this notebook? Open the **AI Teaching Assistant** ‚Äî it has already read this entire notebook and can help with concepts, code, and exercises.

**[üëâ Open AI Teaching Assistant](https://pods.vizuara.ai/courses/understanding-bert-from-scratch/practice/1/assistant)**

*Tip: Open it in a separate tab and work through this notebook side-by-side.*


In [None]:
#@title üéß Listen: Why It Matters
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_why_it_matters.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 1. Why Does This Matter?

Every modern NLP system ‚Äî from ChatGPT to Google Search ‚Äî relies on one fundamental idea: **turning words into numbers** that capture meaning.

But here is the catch: the word "bank" means completely different things in "river bank" and "bank account." How do we represent words so that their meaning changes based on context?

In this notebook, we will:
1. Build **Word2Vec from scratch** and see why static embeddings fail
2. Understand how **ELMo** tried to fix this with bidirectional LSTMs
3. See why BERT's approach ‚Äî **deep bidirectional attention** ‚Äî was the breakthrough we needed

By the end, you will have a working Word2Vec model and a clear understanding of *why* BERT had to be invented.

In [None]:
#@title üéß Listen: Building Intuition
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/03_building_intuition.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

In [None]:
# üîß Setup ‚Äî run this cell first
!pip install -q torch matplotlib numpy

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import random

%matplotlib inline

# Reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 2. Building Intuition

Let us start with a simple game. Look at this sentence:

**"The ___ sat on the mat and purred loudly."**

You instantly think **cat**. But how? You used the words *after* the blank ‚Äî "purred loudly" ‚Äî to figure it out. You read in **both directions**.

Now try these two:

**"I went to the ___ to deposit my savings."** ‚Üí bank (financial institution)

**"I sat on the ___ of the river and watched the sunset."** ‚Üí bank (river edge)

Same word, completely different meanings. The surrounding context tells you which "bank" is meant.

### ü§î Think About This

If we represent each word as a single, fixed vector of numbers (like a GPS coordinate for meaning), how would we handle "bank"? It would need to be in *two places at once* ‚Äî near "money" and near "river." This is the fundamental problem we are going to solve.

In [None]:
#@title üéß Listen: Word2Vec Math
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_word2vec_math.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 3. The Mathematics of Word2Vec

The most influential static word embedding method is **Word2Vec** (Mikolov et al., 2013). The key idea: *words that appear in similar contexts should have similar vectors.*

Word2Vec has two variants. We will implement the **Skip-gram** model, which takes a center word and tries to predict the surrounding context words.

Given a center word $w_c$ and a context word $w_o$, the probability that $w_o$ appears in the context of $w_c$ is:

$$P(w_o \mid w_c) = \frac{\exp(\mathbf{u}_{w_o}^T \mathbf{v}_{w_c})}{\sum_{w \in V} \exp(\mathbf{u}_w^T \mathbf{v}_{w_c})}$$

Computationally, this says: take the dot product between the context word's output vector $\mathbf{u}_{w_o}$ and the center word's input vector $\mathbf{v}_{w_c}$, then normalize over the entire vocabulary using softmax. A higher dot product means the model thinks these words are more likely to appear together.

The training objective is to maximize the log-likelihood over all center-context pairs in the corpus:

$$\mathcal{L} = \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j} \mid w_t)$$

This says: for every word $w_t$ in the corpus, look at $m$ words to the left and right, and maximize the probability of predicting each of those context words.

In [None]:
#@title üéß Listen: Building Corpus
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/05_building_corpus.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 4. Let's Build It ‚Äî Component by Component

### 4.1 Preparing a Small Corpus

We will use a small corpus that includes the word "bank" in different contexts, so we can later see Word2Vec's polysemy problem.

In [None]:
# A small corpus with deliberate polysemy
corpus = [
    "the cat sat on the mat",
    "the dog sat on the rug",
    "the cat chased the dog",
    "the dog chased the cat",
    "i went to the bank to deposit money",
    "she walked to the bank to withdraw cash",
    "he sat on the bank of the river",
    "the river bank was covered with grass",
    "the cat purred on the mat",
    "the dog barked at the cat",
    "money was deposited at the bank",
    "the river bank had beautiful flowers",
    "the mat was on the floor",
    "the rug was under the dog",
    "cash was withdrawn from the bank",
    "grass grew along the river bank",
]

# Tokenize
tokenized_corpus = [sentence.split() for sentence in corpus]

# Build vocabulary
all_words = [word for sentence in tokenized_corpus for word in sentence]
word_counts = Counter(all_words)
vocab = sorted(word_counts.keys())
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

vocab_size = len(vocab)
print(f"Vocabulary size: {vocab_size}")
print(f"Vocabulary: {vocab}")

In [None]:
#@title üéß Listen: Skipgram Pairs
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/06_skipgram_pairs.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### 4.2 Building Skip-gram Training Pairs

For each word in the corpus, we create (center_word, context_word) pairs using a sliding window.

In [None]:
def create_skipgram_pairs(tokenized_corpus, word_to_idx, window_size=2):
    """
    Create (center, context) pairs for Skip-gram training.

    For each word in each sentence, look at 'window_size' words
    to the left and right as context.
    """
    pairs = []
    for sentence in tokenized_corpus:
        for i, center_word in enumerate(sentence):
            # Look at window_size words in each direction
            for j in range(max(0, i - window_size), min(len(sentence), i + window_size + 1)):
                if i != j:  # Skip the center word itself
                    context_word = sentence[j]
                    pairs.append((word_to_idx[center_word], word_to_idx[context_word]))
    return pairs

pairs = create_skipgram_pairs(tokenized_corpus, word_to_idx, window_size=2)
print(f"Total training pairs: {len(pairs)}")
print(f"\nFirst 5 pairs:")
for center_idx, context_idx in pairs[:5]:
    print(f"  Center: '{idx_to_word[center_idx]}' ‚Üí Context: '{idx_to_word[context_idx]}'")

In [None]:
# üìä Visualization: distribution of training pairs
center_words = [idx_to_word[p[0]] for p in pairs]
center_counts = Counter(center_words)
top_words = center_counts.most_common(10)

plt.figure(figsize=(10, 4))
plt.bar([w for w, c in top_words], [c for w, c in top_words], color='steelblue')
plt.title("Most Common Center Words in Training Pairs")
plt.xlabel("Word")
plt.ylabel("Number of training pairs")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
#@title üéß Listen: The Model
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/07_the_model.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### 4.3 The Word2Vec Model (from Scratch)

In [None]:
class SkipGramWord2Vec(nn.Module):
    """
    Skip-gram Word2Vec model.

    Two embedding matrices:
    - center_embeddings: vectors for center words (input)
    - context_embeddings: vectors for context words (output)
    """
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.center_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Initialize with small random values
        nn.init.uniform_(self.center_embeddings.weight, -0.5 / embedding_dim, 0.5 / embedding_dim)
        nn.init.uniform_(self.context_embeddings.weight, -0.5 / embedding_dim, 0.5 / embedding_dim)

    def forward(self, center_ids, context_ids):
        # Get embeddings: (batch_size, embedding_dim)
        center_vecs = self.center_embeddings(center_ids)
        context_vecs = self.context_embeddings(context_ids)

        # Dot product for each pair: (batch_size,)
        scores = torch.sum(center_vecs * context_vecs, dim=1)

        # Compute log-softmax over entire vocabulary
        # For full softmax: score of center with ALL context words
        all_context = self.context_embeddings.weight  # (vocab_size, embedding_dim)
        all_scores = torch.matmul(center_vecs, all_context.T)  # (batch_size, vocab_size)

        log_probs = torch.log_softmax(all_scores, dim=1)

        # Gather the log probabilities for the actual context words
        loss = -log_probs.gather(1, context_ids.unsqueeze(1)).squeeze(1)
        return loss.mean()

    def get_embedding(self, word_idx):
        """Get the learned embedding for a word."""
        return self.center_embeddings.weight[word_idx].detach().cpu().numpy()

In [None]:
#@title üéß Listen: Training
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/08_training.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### 4.4 Training Word2Vec

In [None]:
# Hyperparameters
EMBEDDING_DIM = 32
LEARNING_RATE = 0.01
EPOCHS = 200
BATCH_SIZE = 64

model = SkipGramWord2Vec(vocab_size, EMBEDDING_DIM).to(device)
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Convert pairs to tensors
center_ids = torch.tensor([p[0] for p in pairs], dtype=torch.long).to(device)
context_ids = torch.tensor([p[1] for p in pairs], dtype=torch.long).to(device)

# Training loop
losses = []
for epoch in range(EPOCHS):
    # Shuffle data
    perm = torch.randperm(len(pairs))
    epoch_loss = 0
    n_batches = 0

    for i in range(0, len(pairs), BATCH_SIZE):
        batch_idx = perm[i:i+BATCH_SIZE]
        batch_centers = center_ids[batch_idx]
        batch_contexts = context_ids[batch_idx]

        loss = model(batch_centers, batch_contexts)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        n_batches += 1

    avg_loss = epoch_loss / n_batches
    losses.append(avg_loss)

    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}/{EPOCHS}, Loss: {avg_loss:.4f}")

In [None]:
# üìä Training curve
plt.figure(figsize=(8, 4))
plt.plot(losses, color='steelblue', linewidth=1.5)
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Word2Vec Training Loss")
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
#@title üéß Listen: Visualizing Embeddings
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/09_visualizing_embeddings.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### 4.5 Visualizing the Embeddings

In [None]:
# Get all word embeddings
embeddings = model.center_embeddings.weight.detach().cpu().numpy()

# Use PCA to project to 2D
from numpy.linalg import svd

# Center the data
mean = embeddings.mean(axis=0)
centered = embeddings - mean
U, S, Vt = svd(centered, full_matrices=False)
projected = centered @ Vt[:2].T  # Project to first 2 principal components

# Plot
plt.figure(figsize=(12, 8))
for i, word in enumerate(vocab):
    x, y = projected[i]
    plt.scatter(x, y, color='steelblue', s=50, zorder=5)
    plt.annotate(word, (x, y), fontsize=9, ha='center', va='bottom',
                 xytext=(0, 5), textcoords='offset points')

plt.title("Word2Vec Embeddings (PCA projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
#@title üéß Listen: Polysemy Problem
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/10_polysemy_problem.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 5. The Polysemy Problem ‚Äî Word2Vec's Fatal Flaw

Now let us see the fundamental limitation. The word "bank" appears in two very different contexts in our corpus ‚Äî financial and river. But Word2Vec gives it **one single vector**.

In [None]:
# Find the embedding for "bank"
bank_idx = word_to_idx["bank"]
bank_embedding = model.get_embedding(bank_idx)

# Find nearest neighbors using cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("=== Nearest neighbors to 'bank' ===\n")
similarities = []
for word, idx in word_to_idx.items():
    if word != "bank":
        sim = cosine_similarity(bank_embedding, model.get_embedding(idx))
        similarities.append((word, sim))

similarities.sort(key=lambda x: x[1], reverse=True)
for word, sim in similarities[:8]:
    print(f"  {word:12s} ‚Üí similarity: {sim:.3f}")

In [None]:
# üìä The polysemy problem visualized
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: financial context sentences
financial_words = ["bank", "money", "deposit", "cash", "withdrawn"]
river_words = ["bank", "river", "grass", "flowers", "grew"]

# Show that bank is the SAME point regardless of context
financial_indices = [word_to_idx[w] for w in financial_words if w in word_to_idx]
river_indices = [word_to_idx[w] for w in river_words if w in word_to_idx]

# Financial context
ax = axes[0]
for idx in financial_indices:
    word = idx_to_word[idx]
    x, y = projected[idx]
    color = 'red' if word == 'bank' else 'steelblue'
    size = 150 if word == 'bank' else 80
    ax.scatter(x, y, color=color, s=size, zorder=5)
    ax.annotate(word, (x, y), fontsize=11, ha='center', va='bottom',
                xytext=(0, 6), textcoords='offset points', fontweight='bold' if word == 'bank' else 'normal')
ax.set_title("Financial Context Words", fontsize=13)
ax.grid(alpha=0.3)

# River context
ax = axes[1]
for idx in river_indices:
    word = idx_to_word[idx]
    x, y = projected[idx]
    color = 'red' if word == 'bank' else 'forestgreen'
    size = 150 if word == 'bank' else 80
    ax.scatter(x, y, color=color, s=size, zorder=5)
    ax.annotate(word, (x, y), fontsize=11, ha='center', va='bottom',
                xytext=(0, 6), textcoords='offset points', fontweight='bold' if word == 'bank' else 'normal')
ax.set_title("River Context Words", fontsize=13)
ax.grid(alpha=0.3)

plt.suptitle("‚ö†Ô∏è 'bank' has ONE vector ‚Äî it cannot distinguish contexts!", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Notice how "bank" sits at the **exact same position** in both plots. Word2Vec has no way to give "bank" a different representation based on its context. This is the **polysemy problem**.

In [None]:
#@title üéß Listen: Todo Analogy
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/11_todo_analogy.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 6. üîß Your Turn

### TODO: Implement a function to compute Word2Vec analogy

One of the famous properties of Word2Vec is that it captures analogies:
**king - man + woman ‚âà queen**

Implement the analogy function below.

In [None]:
def word_analogy(model, word_to_idx, idx_to_word, word_a, word_b, word_c):
    """
    Compute: word_a - word_b + word_c = ???

    For example: king - man + woman = ???

    Args:
        word_a, word_b, word_c: strings
    Returns:
        The word closest to (vec_a - vec_b + vec_c)
    """
    vec_a = model.get_embedding(word_to_idx[word_a])
    vec_b = model.get_embedding(word_to_idx[word_b])
    vec_c = model.get_embedding(word_to_idx[word_c])

    # ============ TODO ============
    # Step 1: Compute the analogy vector: vec_a - vec_b + vec_c
    # Step 2: Find the word in vocabulary whose embedding is most similar
    #         (using cosine similarity) to the analogy vector
    # Step 3: Exclude words a, b, c from candidates
    # ==============================

    analogy_vec = ???  # YOUR CODE HERE

    best_word = None
    best_sim = -1

    for word, idx in word_to_idx.items():
        if word in [word_a, word_b, word_c]:
            continue
        # YOUR CODE HERE: compute similarity and track the best match

    return best_word

In [None]:
# ‚úÖ Verification
# With our small corpus, exact analogies are unlikely, but the function should work
# Let's test the mechanics: "cat" - "mat" + "rug" should lean toward "dog"
result = word_analogy(model, word_to_idx, idx_to_word, "cat", "mat", "rug")
print(f"cat - mat + rug = {result}")
assert result is not None, "‚ùå Function returned None ‚Äî check your implementation"
assert isinstance(result, str), "‚ùå Function should return a string (word)"
print("‚úÖ Analogy function works correctly!")

In [None]:
#@title üéß Listen: Elmo Approach
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/12_elmo_approach.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 7. From Static to Contextual: The ELMo Approach

ELMo (Peters et al., 2018) tried to fix the polysemy problem by using **bidirectional LSTMs**.

The idea: run a left-to-right LSTM and a right-to-left LSTM over the sentence, then **concatenate** their hidden states to get a context-dependent representation.

In [None]:
# Demonstrating the ELMo concept (simplified)

class SimpleELMo(nn.Module):
    """
    A simplified ELMo-style model.

    Two independent LSTMs:
    - forward LSTM: reads left-to-right
    - backward LSTM: reads right-to-left

    Their hidden states are concatenated (NOT jointly trained).
    """
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.forward_lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.backward_lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

    def forward(self, input_ids):
        # Get token embeddings
        embeds = self.embedding(input_ids)  # (batch, seq_len, embed_dim)

        # Forward LSTM (left-to-right)
        forward_out, _ = self.forward_lstm(embeds)

        # Backward LSTM (right-to-left) ‚Äî reverse the sequence
        reversed_embeds = torch.flip(embeds, dims=[1])
        backward_out, _ = self.backward_lstm(reversed_embeds)
        backward_out = torch.flip(backward_out, dims=[1])  # Flip back

        # Concatenate (this is ELMo's "shallow bidirectionality")
        contextual = torch.cat([forward_out, backward_out], dim=-1)
        return contextual

# Create a simple ELMo
elmo = SimpleELMo(vocab_size, embedding_dim=32, hidden_dim=32)

# Get contextual representations for two sentences with "bank"
sentence1 = "i went to the bank to deposit money".split()
sentence2 = "he sat on the bank of the river".split()

ids1 = torch.tensor([[word_to_idx[w] for w in sentence1]])
ids2 = torch.tensor([[word_to_idx[w] for w in sentence2]])

with torch.no_grad():
    ctx1 = elmo(ids1)  # (1, 8, 64)
    ctx2 = elmo(ids2)  # (1, 8, 64)

# "bank" is at index 4 in both sentences
bank_repr_financial = ctx1[0, 4].numpy()
bank_repr_river = ctx2[0, 4].numpy()

sim = cosine_similarity(bank_repr_financial, bank_repr_river)
print(f"Cosine similarity between 'bank' representations:")
print(f"  Financial context vs. River context: {sim:.4f}")
print(f"\nüí° ELMo gives 'bank' DIFFERENT representations based on context!")
print(f"   (similarity < 1.0 means the vectors are different)")

In [None]:
#@title üéß Listen: Elmo Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/13_elmo_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

In [None]:
# üìä Comparison: Word2Vec vs ELMo
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Word2Vec: same vector
ax = axes[0]
ax.bar(range(10), bank_embedding[:10], color='coral', alpha=0.8, label='All contexts')
ax.set_title("Word2Vec: 'bank' embedding\n(SAME for all contexts)", fontsize=11)
ax.set_xlabel("Dimension")
ax.set_ylabel("Value")
ax.legend()
ax.set_ylim(-2, 2)

# ELMo: different vectors
ax = axes[1]
x = np.arange(10)
width = 0.35
ax.bar(x - width/2, bank_repr_financial[:10], width, color='steelblue', alpha=0.8, label='Financial context')
ax.bar(x + width/2, bank_repr_river[:10], width, color='forestgreen', alpha=0.8, label='River context')
ax.set_title("ELMo: 'bank' embedding\n(DIFFERENT per context)", fontsize=11)
ax.set_xlabel("Dimension")
ax.set_ylabel("Value")
ax.legend()
ax.set_ylim(-2, 2)

plt.tight_layout()
plt.show()

In [None]:
#@title üéß Listen: Todo Context Similarity
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/14_todo_context_similarity.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### TODO: Build a Context Similarity Checker

Use the ELMo model above to check how different the representation of "bank" is in two different contexts.

In [None]:
def context_similarity(model, sentence1, sentence2, target_word, word_to_idx):
    """
    Compare the contextual representation of target_word in two sentences.

    Args:
        model: SimpleELMo model
        sentence1: first sentence (string)
        sentence2: second sentence (string)
        target_word: the word to compare (string)

    Returns:
        cosine_similarity: float between -1 and 1
        repr1: numpy array ‚Äî representation in sentence 1
        repr2: numpy array ‚Äî representation in sentence 2
    """
    words1 = sentence1.split()
    words2 = sentence2.split()

    # ============ TODO ============
    # Step 1: Find the index of target_word in each sentence
    # Step 2: Convert each sentence to tensor of token IDs
    # Step 3: Pass each through the model to get contextual representations
    # Step 4: Extract the representation at the target_word's position
    # Step 5: Compute cosine similarity between the two representations
    # ==============================

    idx1 = ???  # YOUR CODE HERE: position of target_word in sentence1
    idx2 = ???  # YOUR CODE HERE: position of target_word in sentence2

    ids1 = torch.tensor([[word_to_idx[w] for w in words1]])
    ids2 = torch.tensor([[word_to_idx[w] for w in words2]])

    with torch.no_grad():
        ctx1 = model(ids1)
        ctx2 = model(ids2)

    repr1 = ???  # YOUR CODE HERE: extract at idx1
    repr2 = ???  # YOUR CODE HERE: extract at idx2

    similarity = ???  # YOUR CODE HERE: cosine similarity

    return similarity, repr1, repr2

In [None]:
# ‚úÖ Verification
sim, r1, r2 = context_similarity(
    elmo,
    "i went to the bank to deposit money",
    "he sat on the bank of the river",
    "bank",
    word_to_idx
)
assert isinstance(sim, (float, np.floating)), "‚ùå Should return a float"
assert -1 <= sim <= 1, f"‚ùå Cosine similarity should be in [-1, 1], got {sim}"
print(f"‚úÖ Context similarity works! 'bank' similarity = {sim:.4f}")
print(f"   (Lower similarity = model distinguishes the contexts better)")

In [None]:
#@title üéß Listen: Elmo Limitation Bert Preview
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/15_elmo_limitation_bert_preview.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### But ELMo has a limitation...

ELMo's two LSTMs are trained **independently**. The forward LSTM never sees what the backward LSTM is doing, and vice versa. They are glued together at the end via concatenation ‚Äî the bidirectionality is **shallow**.

What we really want is a model where **every layer** jointly considers the full left and right context. That is exactly what BERT does with self-attention, which we will build in the next notebook.

In [None]:
#@title üéß Listen: Closing
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/16_closing.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 8. üéØ Final Output: The Evolution of Word Representations

In [None]:
# üìä Summary visualization
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Panel 1: Word2Vec (static)
ax = axes[0]
ax.text(0.5, 0.7, '"bank"', ha='center', va='center', fontsize=24, fontweight='bold', color='coral')
ax.text(0.5, 0.4, '‚Üí ONE vector', ha='center', va='center', fontsize=14, color='gray')
ax.text(0.5, 0.25, 'Same for "river bank"', ha='center', va='center', fontsize=10, color='gray')
ax.text(0.5, 0.15, 'and "bank account"', ha='center', va='center', fontsize=10, color='gray')
ax.set_title("Word2Vec (2013)\nStatic Embeddings", fontsize=13, fontweight='bold')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
ax.patch.set_facecolor('#fff0f0')

# Panel 2: ELMo (shallow bidirectional)
ax = axes[1]
ax.annotate('', xy=(0.35, 0.6), xytext=(0.1, 0.6),
            arrowprops=dict(arrowstyle='->', color='steelblue', lw=2))
ax.annotate('', xy=(0.65, 0.55), xytext=(0.9, 0.55),
            arrowprops=dict(arrowstyle='->', color='orange', lw=2))
ax.text(0.5, 0.7, '"bank"', ha='center', va='center', fontsize=24, fontweight='bold', color='purple')
ax.text(0.5, 0.35, '‚Üí Context-dependent', ha='center', va='center', fontsize=14, color='gray')
ax.text(0.5, 0.2, 'but L‚ÜíR and R‚ÜíL are', ha='center', va='center', fontsize=10, color='gray')
ax.text(0.5, 0.1, 'trained separately (shallow)', ha='center', va='center', fontsize=10, color='gray')
ax.set_title("ELMo (2018)\nShallow Bidirectional", fontsize=13, fontweight='bold')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
ax.patch.set_facecolor('#f0f0ff')

# Panel 3: BERT (deep bidirectional)
ax = axes[2]
# Draw arrows in all directions
center = (0.5, 0.6)
for angle in range(0, 360, 45):
    rad = np.radians(angle)
    dx = 0.15 * np.cos(rad)
    dy = 0.15 * np.sin(rad)
    ax.annotate('', xy=(center[0]+dx, center[1]+dy), xytext=center,
                arrowprops=dict(arrowstyle='->', color='green', lw=1.5))
ax.text(0.5, 0.6, '"bank"', ha='center', va='center', fontsize=24, fontweight='bold', color='green',
        bbox=dict(boxstyle='round', facecolor='white', edgecolor='green', linewidth=2))
ax.text(0.5, 0.25, '‚Üí Deep bidirectional', ha='center', va='center', fontsize=14, color='gray')
ax.text(0.5, 0.1, 'Every layer attends to\nfull context jointly', ha='center', va='center', fontsize=10, color='gray')
ax.set_title("BERT (2018)\nDeep Bidirectional", fontsize=13, fontweight='bold')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
ax.patch.set_facecolor('#f0fff0')

plt.suptitle("The Evolution of Word Representations", fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("üéâ You now understand WHY static embeddings fail and WHY we need BERT!")
print("   Next up: building the self-attention mechanism that makes BERT possible.")

## 9. Reflection and Next Steps

### ü§î Reflection Questions
1. Why can't we just increase Word2Vec's embedding dimension to solve the polysemy problem? (Hint: think about what the model is optimizing for.)
2. If ELMo concatenates forward and backward LSTMs, why is that considered "shallow" bidirectionality? What would "deep" bidirectionality look like?
3. Word2Vec was trained on billions of words but still has the polysemy problem. Does more data help, or is it a fundamental architectural limitation?

### üèÜ Optional Challenges
1. **Negative Sampling**: Our Word2Vec uses full softmax, which is slow for large vocabularies. Implement negative sampling ‚Äî instead of normalizing over all words, randomly sample 5-10 "negative" context words and train a binary classifier.
2. **CBOW Model**: Implement the Continuous Bag of Words (CBOW) variant, which predicts the center word from the context words (the reverse of Skip-gram).
3. **Larger Corpus**: Download a real text corpus (e.g., WikiText-2) and train Word2Vec on it. Do the embeddings capture more meaningful relationships?