# Chapter 9: The Embedding Layer

> "You shall know a word by the company it keeps." — **John Rupert Firth**, Linguist

---

## What You'll Learn

- Why token IDs alone aren't enough and what embeddings actually solve
- How to build token embeddings from scratch using lookup tables
- Why positional information matters and how to add it
- How to combine token and position embeddings for complete input representations
- How to explore real model embeddings and discover semantic relationships
- Practical initialization and implementation considerations

---

## Setup

First, let's install the required packages:

In [None]:
# Install required packages
!pip install -q torch transformers

In [None]:
# ===== IMPORTS =====
import torch                     # PyTorch: tensor operations and neural networks
import torch.nn as nn            # Neural network modules (layers, etc.)
import torch.nn.functional as F  # Mathematical functions (cosine_similarity, etc.)
from transformers import AutoModel, AutoTokenizer  # Pre-trained models

# Quick tensor reminder:
# - torch.tensor([1,2,3]) creates a 1D tensor (like a list)
# - torch.randn(3, 4) creates a 3×4 tensor of random numbers
# - tensor[0] indexes into the first dimension
# - tensor.shape tells you the dimensions

## 1. Why Can't We Just Use Token IDs?

Token IDs are arbitrary integers with no semantic meaning. Let's see the problem:

In [None]:
# Token IDs are just integers - no relationships
token_ids = {
    "cat": 3797,
    "kitten": 28387,
    "dog": 4273,
    "car": 1097
}

print("Token IDs:")
for word, id in token_ids.items():
    print(f"  '{word}' → {id}")

# Problem: "cat" is numerically closer to "dog" than to "kitten"!
print(f"\nDistance from 'cat' to 'dog': {abs(3797 - 4273)}")
print(f"Distance from 'cat' to 'kitten': {abs(3797 - 28387)}")
print("\nBut semantically, 'cat' and 'kitten' are more similar!")

### The Solution: Embeddings

Convert each token ID to a dense vector that captures meaning.

**What is `nn.Embedding`?**
- A lookup table with `vocab_size` rows and `embed_dim` columns
- Each row is a vector representing one token
- Input: token ID (integer) → Output: that row (vector)
- Think of it as a dictionary: `{0: [0.1, 0.2, ...], 1: [0.5, -0.1, ...], ...}`

In [None]:
# ===== The Problem: Token IDs =====
token_ids = torch.tensor([3797, 28387, 4273])  # cat, kitten, dog
print(f"Token IDs shape: {token_ids.shape}")
print(f"Just integers: {token_ids}")

# ===== The Solution: Embeddings =====
vocab_size = 50257  # GPT-2 vocabulary
embed_dim = 768     # GPT-2 embedding dimension

# Create embedding layer (this is a lookup table!)
embedding = nn.Embedding(vocab_size, embed_dim)

# Look up vectors for our token IDs
token_vectors = embedding(token_ids)
print(f"\nToken embeddings shape: {token_vectors.shape}")
print(f"First token's vector (first 10 dims): {token_vectors[0, :10]}")

# Now each token is a 768-dimensional vector that can capture meaning!

## 2. Token Embeddings: The Lookup Table

Let's build token embeddings from scratch to understand the mechanism.

### Step 1: Create the Embedding Matrix

In [None]:
vocab_size = 50257  # GPT-2 vocabulary size
embed_dim = 768     # Embedding dimension

# Create embedding matrix: one row per token
embedding_matrix = torch.randn(vocab_size, embed_dim)

print(f"Embedding matrix shape: {embedding_matrix.shape}")
print(f"\nToken 3797's embedding (first 10 dims): {embedding_matrix[3797, :10]}")

### Step 2: Look Up Multiple Tokens

In [None]:
token_ids = torch.tensor([464, 3797, 3332])  # "The cat sat"

# Manual lookup (what embedding layers do internally)
embeddings = embedding_matrix[token_ids]

print(f"Input shape: {token_ids.shape}")       # torch.Size([3])
print(f"Output shape: {embeddings.shape}")     # torch.Size([3, 768])

# Each token ID → its 768-dimensional vector
print(f"\nToken 464's embedding (first 5 dims): {embeddings[0, :5]}")
print(f"Token 3797's embedding (first 5 dims): {embeddings[1, :5]}")
print(f"Token 3332's embedding (first 5 dims): {embeddings[2, :5]}")

### Step 3: Handle Batches

In [None]:
# Batch of 2 sequences, each with 4 tokens
token_ids_batch = torch.tensor([
    [464, 3797, 3332, 319],    # Sequence 1: "The cat sat on"
    [314, 588, 4695, 345]      # Sequence 2: "I will help you"
])

print(f"Batch shape: {token_ids_batch.shape}")  # torch.Size([2, 4])

# Look up embeddings for entire batch
embeddings_batch = embedding_matrix[token_ids_batch]

print(f"Embeddings shape: {embeddings_batch.shape}")  # torch.Size([2, 4, 768])
print("\nShape transformation: (batch, seq) → (batch, seq, embed_dim)")

### Step 4: Use PyTorch's nn.Embedding

In [None]:
class TokenEmbedding(nn.Module):
    """
    Token embedding layer: converts token IDs to dense vectors.
    
    This is what GPT-2's 'wte' (word token embeddings) layer does.
    """
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        # Create the embedding matrix as a learnable parameter
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # Initialize with small random values (GPT-2 style)
        nn.init.normal_(self.embedding.weight, mean=0.0, std=0.02)
    
    def forward(self, token_ids):
        """
        Args:
            token_ids: (batch, seq) tensor of token IDs
        
        Returns:
            embeddings: (batch, seq, embed_dim) tensor of token vectors
        """
        return self.embedding(token_ids)

# Create token embedding layer
token_embed = TokenEmbedding(vocab_size=50257, embed_dim=768)

# Embed a batch
token_ids = torch.tensor([[464, 3797, 3332, 319]])  # Shape: (1, 4)
embeddings = token_embed(token_ids)

print(f"Input shape: {token_ids.shape}")        # torch.Size([1, 4])
print(f"Output shape: {embeddings.shape}")      # torch.Size([1, 4, 768])
print(f"\nFirst token embedding (first 5 dims): {embeddings[0, 0, :5]}")

## 3. Positional Embeddings: Teaching Position

Token embeddings have no sense of position. Let's see the problem:

In [None]:
# Two sequences with same tokens, different order
token_ids_1 = torch.tensor([[464, 3797, 3332]])  # "The cat sat"
token_ids_2 = torch.tensor([[3332, 3797, 464]])  # "sat cat The"

# Get token embeddings
token_embed = TokenEmbedding(vocab_size=50257, embed_dim=768)
embeddings_1 = token_embed(token_ids_1)
embeddings_2 = token_embed(token_ids_2)

print(f"Embeddings 1 shape: {embeddings_1.shape}")
print(f"Embeddings 2 shape: {embeddings_2.shape}")

# They're different...
print(f"\nAre embeddings identical? {torch.equal(embeddings_1, embeddings_2)}")

# But if you sort both, they contain the same vectors!
# This is the problem: without position information,
# "The cat sat" and "sat cat The" look the same to attention layers

### Implementing Learned Positional Embeddings

In [None]:
class PositionalEmbedding(nn.Module):
    """
    Learned positional embeddings: one trainable vector per position.
    
    GPT-2 uses this approach (called 'wpe' - word position embeddings).
    """
    def __init__(self, max_seq_len, embed_dim):
        """
        Args:
            max_seq_len: Maximum sequence length (e.g., 1024 for GPT-2)
            embed_dim: Embedding dimension (must match token embeddings)
        """
        super().__init__()
        # Create position embedding matrix: (max_seq_len, embed_dim)
        self.pos_embed = nn.Embedding(max_seq_len, embed_dim)
        
        # Initialize with small random values (GPT-2 style)
        nn.init.normal_(self.pos_embed.weight, mean=0.0, std=0.02)
        
        self.max_seq_len = max_seq_len
    
    def forward(self, token_ids):
        """
        Args:
            token_ids: (batch, seq) tensor of token IDs
        
        Returns:
            pos_embeddings: (batch, seq, embed_dim) tensor of position vectors
        """
        batch_size, seq_len = token_ids.shape
        
        # Validate sequence length
        if seq_len > self.max_seq_len:
            raise ValueError(
                f"Sequence length {seq_len} exceeds max_seq_len {self.max_seq_len}"
            )
        
        # Create position indices: [0, 1, 2, ..., seq_len-1]
        position_ids = torch.arange(
            seq_len,
            device=token_ids.device  # Match device (CPU/GPU) of input
        )
        
        # Expand for batch: (seq_len,) → (batch, seq_len)
        position_ids = position_ids.unsqueeze(0).expand(batch_size, seq_len)
        
        # Look up position embeddings
        pos_embeddings = self.pos_embed(position_ids)
        
        return pos_embeddings

# Create positional embedding layer
pos_embed = PositionalEmbedding(max_seq_len=1024, embed_dim=768)

# Example: sequence of length 4
token_ids = torch.tensor([[464, 3797, 3332, 319]])  # Shape: (1, 4)
pos_embeddings = pos_embed(token_ids)

print(f"Input shape: {token_ids.shape}")           # torch.Size([1, 4])
print(f"Position embeddings shape: {pos_embeddings.shape}")  # torch.Size([1, 4, 768])

# Each position gets its own learned vector
print(f"\nPosition 0 embedding (first 5 dims): {pos_embeddings[0, 0, :5]}")
print(f"Position 1 embedding (first 5 dims): {pos_embeddings[0, 1, :5]}")
print(f"Position 2 embedding (first 5 dims): {pos_embeddings[0, 2, :5]}")
print(f"Position 3 embedding (first 5 dims): {pos_embeddings[0, 3, :5]}")

## 4. Combining Token + Position Embeddings

### Building the Complete GPT2Embeddings Class

In [None]:
class GPT2Embeddings(nn.Module):
    """
    Complete GPT-2 embedding layer: token + position embeddings.
    
    This matches GPT-2's 'wte' + 'wpe' layers.
    """
    def __init__(self, vocab_size, max_seq_len, embed_dim):
        """
        Args:
            vocab_size: Size of vocabulary (50257 for GPT-2)
            max_seq_len: Maximum sequence length (1024 for GPT-2)
            embed_dim: Embedding dimension (768 for GPT-2 Small)
        """
        super().__init__()
        
        # Token embeddings: vocab_size × embed_dim
        self.token_embed = nn.Embedding(vocab_size, embed_dim)
        
        # Position embeddings: max_seq_len × embed_dim
        self.pos_embed = nn.Embedding(max_seq_len, embed_dim)
        
        # Initialize both with GPT-2's standard initialization
        nn.init.normal_(self.token_embed.weight, mean=0.0, std=0.02)
        nn.init.normal_(self.pos_embed.weight, mean=0.0, std=0.02)
        
        self.max_seq_len = max_seq_len
    
    def forward(self, token_ids):
        """
        Args:
            token_ids: (batch, seq) tensor of token IDs
        
        Returns:
            embeddings: (batch, seq, embed_dim) tensor of combined embeddings
        """
        batch_size, seq_len = token_ids.shape
        
        # Validate sequence length
        if seq_len > self.max_seq_len:
            raise ValueError(
                f"Sequence length {seq_len} exceeds max_seq_len {self.max_seq_len}"
            )
        
        # ===== Token Embeddings =====
        token_embeddings = self.token_embed(token_ids)
        
        # ===== Position Embeddings =====
        position_ids = torch.arange(seq_len, device=token_ids.device)
        position_ids = position_ids.unsqueeze(0).expand(batch_size, seq_len)
        position_embeddings = self.pos_embed(position_ids)
        
        # ===== Combine via Addition =====
        embeddings = token_embeddings + position_embeddings
        
        return embeddings

# Create GPT-2 Small embedding layer
gpt2_embed = GPT2Embeddings(
    vocab_size=50257,
    max_seq_len=1024,
    embed_dim=768
)

# Example: embed a batch of sequences (BOTH must have same length!)
# Shorter sequences are padded with 0s to match the longest
token_ids = torch.tensor([
    [464, 3797, 3332, 319, 0, 0],  # "The cat sat on" + padding
    [314, 588, 4695, 345, 0, 0]    # "I will help you" + padding
])

embeddings = gpt2_embed(token_ids)

print(f"Input shape: {token_ids.shape}")         # torch.Size([2, 6])
print(f"Output shape: {embeddings.shape}")       # torch.Size([2, 6, 768])
print(f"Each token now has: {embeddings.shape[-1]} dimensions")

print(f"\nFirst sequence, first token (first 10 dims):")
print(embeddings[0, 0, :10])

### Connecting to Chapter 8: The Full Pipeline

In [None]:
from transformers import AutoTokenizer

# ===== Step 1: Tokenize (Chapter 8) =====
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "The cat sat on the mat"
token_ids = tokenizer.encode(text, return_tensors="pt")

print("Step 1: Tokenization")
print(f"Text: {text}")
print(f"Token IDs: {token_ids}")
print(f"Shape: {token_ids.shape}\n")

# ===== Step 2: Embed (Chapter 9) =====
gpt2_embed = GPT2Embeddings(vocab_size=50257, max_seq_len=1024, embed_dim=768)
embeddings = gpt2_embed(token_ids)

print("Step 2: Embedding")
print(f"Embeddings shape: {embeddings.shape}")
print(f"First token embedding (first 10 dims): {embeddings[0, 0, :10]}\n")

print("Step 3: Next up — Attention layers (Chapter 10)")
print(f"These {embeddings.shape} embeddings will flow into self-attention,")
print("where tokens learn from each other's context!")

## 5. Exploring GPT-2's Embeddings

Load a real pretrained GPT-2 model and explore what it learned:

In [None]:
# Load GPT-2 Small
model = AutoModel.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Access token embeddings
token_embeddings = model.wte.weight  # Word Token Embeddings
print(f"Token embeddings shape: {token_embeddings.shape}")
# torch.Size([50257, 768]) — one 768-dim vector per token

# Access position embeddings
position_embeddings = model.wpe.weight  # Word Position Embeddings
print(f"Position embeddings shape: {position_embeddings.shape}")
# torch.Size([1024, 768]) — one 768-dim vector per position

### Finding Similar Words

**What is Cosine Similarity?**
Measures how similar two vectors are based on the angle between them:
- **1.0** = identical direction (very similar)
- **0.0** = perpendicular (unrelated)
- **-1.0** = opposite direction (opposite meaning)

Think of it as: "are these two arrows pointing the same way?"

In [None]:
def find_similar_tokens(word, embeddings, tokenizer, top_k=5):
    """Find tokens with embeddings most similar to the given word."""
    # Get token ID for the word
    token_id = tokenizer.encode(word, add_special_tokens=False)[0]
    target_vec = embeddings[token_id]
    
    # Compute cosine similarity with all tokens
    similarities = F.cosine_similarity(
        target_vec.unsqueeze(0),  # (1, 768)
        embeddings,               # (50257, 768)
        dim=1
    )
    
    # Get top-k most similar (excluding the word itself)
    top_indices = similarities.argsort(descending=True)[1:top_k+1]
    
    print(f"\nWords most similar to '{word}':")
    for idx in top_indices:
        token = tokenizer.decode([idx])
        score = similarities[idx].item()
        print(f"  {score:.3f} — '{token}'")

# Explore semantic relationships
find_similar_tokens("king", token_embeddings, tokenizer, top_k=8)
find_similar_tokens("computer", token_embeddings, tokenizer, top_k=8)
find_similar_tokens("happy", token_embeddings, tokenizer, top_k=8)

### Try Your Own Explorations!

In [None]:
# Try different words:
words_to_explore = ["Python", "doctor", "fast", "beautiful"]

for word in words_to_explore:
    find_similar_tokens(word, token_embeddings, tokenizer, top_k=5)

### Bonus: Word Analogies

Can we find vectors that satisfy "king - man + woman ≈ queen"?

In [None]:
def word_analogy(word1, word2, word3, embeddings, tokenizer, top_k=5):
    """
    Find: word1 - word2 + word3 ≈ ?
    Example: king - man + woman ≈ queen
    """
    # Get token IDs
    id1 = tokenizer.encode(word1, add_special_tokens=False)[0]
    id2 = tokenizer.encode(word2, add_special_tokens=False)[0]
    id3 = tokenizer.encode(word3, add_special_tokens=False)[0]
    
    # Compute target vector: word1 - word2 + word3
    target_vec = embeddings[id1] - embeddings[id2] + embeddings[id3]
    
    # Find most similar tokens
    similarities = F.cosine_similarity(
        target_vec.unsqueeze(0),
        embeddings,
        dim=1
    )
    
    # Exclude the input words from results
    similarities[id1] = -1
    similarities[id2] = -1
    similarities[id3] = -1
    
    top_indices = similarities.argsort(descending=True)[:top_k]
    
    print(f"\n{word1} - {word2} + {word3} ≈ ?")
    for idx in top_indices:
        token = tokenizer.decode([idx])
        score = similarities[idx].item()
        print(f"  {score:.3f} — '{token}'")

# Classic example: king - man + woman ≈ queen
word_analogy("king", "man", "woman", token_embeddings, tokenizer, top_k=5)

# Try others!
word_analogy("Paris", "France", "Germany", token_embeddings, tokenizer, top_k=5)

## 6. Hands-On Exercises

### Exercise 1: Manual Embedding Lookup

In [None]:
# Create a tiny vocabulary (10 tokens) and embedding dimension of 8
# Manually create an embedding matrix and look up embeddings for token IDs [2, 5, 7]
# Print the shapes at each step

# Step 1: Create embedding matrix
vocab_size = 10
embed_dim = 8
embedding_matrix = torch.randn(???, ???)  # Fill in the sizes
print(f"Embedding matrix shape: {embedding_matrix.shape}")

# Step 2: Create token IDs to look up
token_ids = torch.tensor([2, 5, 7])
print(f"Token IDs: {token_ids}")

# Step 3: Look up embeddings (hint: use indexing like embedding_matrix[...])
embeddings = ???
print(f"Embeddings shape: {embeddings.shape}")

### Exercise 2: Build TokenEmbedding from Scratch

In [None]:
# Implement the TokenEmbedding class without looking at the example
# Include proper initialization and shape validation

# YOUR CODE HERE

### Exercise 3: Position Encoding

In [None]:
# Create a sequence of token IDs: [10, 20, 30, 40, 50]
# Generate position indices and look them up in a position embedding layer
# Verify that position 0 always gets the same vector regardless of token

# YOUR CODE HERE

### Exercise 4: Complete GPT2Embeddings

In [None]:
# Implement the full GPT2Embeddings class
# Test it with:
# - A batch of 2 sequences
# - Different sequence lengths (one length 5, one length 8 with padding)
# - Verify output shape is (2, 8, embed_dim)

# YOUR CODE HERE

### Exercise 5: Explore GPT-2 Similarities

In [None]:
# Load GPT-2 and find tokens similar to:
# - "Python" (should find programming-related words)
# - "doctor" (should find medical/professional words)
# - "fast" (should find speed-related words)
#
# Does the model group related concepts together?

# YOUR CODE HERE

### Exercise 6: Device Handling

In [None]:
# Create a GPT2Embeddings instance
# Move it to GPU (if available)
# Embed token IDs that start on CPU
# What happens? Fix it by moving token IDs to GPU first

# YOUR CODE HERE

### Exercise 7: Sequence Length Limits

In [None]:
# Create a GPT2Embeddings with max_seq_len=10
# Try to embed a sequence of length 15
# Handle the error gracefully by truncating the sequence before embedding

# YOUR CODE HERE

### Exercise 8: Chapter 8 Integration

In [None]:
# Take the JSONL output from Chapter 8 (your tokenized dataset)
# Load one record, extract the token IDs, convert to PyTorch tensor
# Embed it using GPT2Embeddings
# Print the shape at each step

# YOUR CODE HERE

## Chapter Summary

**What we built:**

1. **Token embeddings:** Lookup table converting token IDs to semantic vectors
2. **Positional embeddings:** Learned vectors encoding position in sequence
3. **Complete GPT2Embeddings:** Combines token + position via addition

**What we learned:**

- Token IDs are arbitrary indices with no semantic meaning
- Embeddings convert IDs to dense vectors that capture relationships
- Position information is critical ("cat chased dog" ≠ "dog chased cat")
- GPT-2 uses learned positional embeddings (trainable parameters)
- Cosine similarity reveals learned semantic relationships
- Real models learn that "king" and "queen" are related without being told!

**Next:** Chapter 10 will use these embeddings for self-attention, where tokens learn from each other's context!