# Part 2: Embeddings and Positional Encoding

## Converting Text to Numbers

Neural networks work with numbers, not words. This notebook covers:

1. **Token Embeddings**: Converting words/characters to vectors
2. **Positional Encoding**: Telling the model where each token is

---


In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)


## Step 1: Tokenization

Before embedding, we need to split text into tokens. There are several strategies:

- **Character-level**: `"hello"` → `['h', 'e', 'l', 'l', 'o']`
- **Word-level**: `"hello world"` → `['hello', 'world']`
- **Subword (BPE)**: `"unhappiness"` → `['un', 'happiness']`

We'll use **character-level** for simplicity (easier to understand, smaller vocabulary).


In [None]:
class CharTokenizer:
    """Simple character-level tokenizer."""
    
    def __init__(self, text):
        # Get unique characters
        chars = sorted(set(text))
        
        # Create mappings
        self.char_to_idx = {c: i for i, c in enumerate(chars)}
        self.idx_to_char = {i: c for i, c in enumerate(chars)}
        self.vocab_size = len(chars)
    
    def encode(self, text):
        """Convert text to list of integers."""
        return [self.char_to_idx[c] for c in text]
    
    def decode(self, indices):
        """Convert list of integers back to text."""
        return ''.join(self.idx_to_char[i] for i in indices)

# Example
sample_text = "Hello, World!"
tokenizer = CharTokenizer(sample_text)

print(f"Text: '{sample_text}'")
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"\nCharacter to Index mapping:")
for char, idx in tokenizer.char_to_idx.items():
    display_char = repr(char) if char in ' \n\t' else char
    print(f"  '{display_char}' → {idx}")

encoded = tokenizer.encode(sample_text)
print(f"\nEncoded: {encoded}")
print(f"Decoded: '{tokenizer.decode(encoded)}'")


## Step 2: Token Embeddings

Now we convert token indices to **dense vectors**.

### Why not one-hot encoding?

One-hot: Each token is a vector with a single 1 and rest 0s.

```
Vocab: [a, b, c, d]
'a' → [1, 0, 0, 0]
'b' → [0, 1, 0, 0]
```

Problems:
1. **Huge vectors**: Vocab of 50,000 → 50,000-dim vectors!
2. **No similarity**: `cat` and `kitten` are equally different as `cat` and `airplane`

### Embeddings: Dense, learnable vectors

```
'a' → [0.2, -0.5, 0.8, 0.1]  (small, dense)
'b' → [0.3, 0.1, -0.2, 0.9]
```

Similar words get similar vectors through training!


In [None]:
class TokenEmbedding:
    """
    Convert token indices to dense vectors.
    
    This is essentially a lookup table.
    """
    
    def __init__(self, vocab_size, embed_dim):
        """
        vocab_size: number of unique tokens
        embed_dim: dimension of embedding vectors
        """
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        
        # Initialize embedding matrix randomly
        # Shape: (vocab_size, embed_dim)
        # Each row is the embedding for one token
        self.embeddings = np.random.randn(vocab_size, embed_dim) * 0.1
    
    def forward(self, indices):
        """
        Look up embeddings for given indices.
        
        indices: list or array of token indices
        Returns: (seq_len, embed_dim) array of embeddings
        """
        return self.embeddings[indices]

# Example
embed_dim = 8
embedding = TokenEmbedding(tokenizer.vocab_size, embed_dim)

# Embed our sample text
indices = tokenizer.encode("Hello")
embedded = embedding.forward(indices)

print(f"Input text: 'Hello'")
print(f"Token indices: {indices}")
print(f"\nEmbedding shape: {embedded.shape}")
print(f"(5 tokens, each with {embed_dim}-dim embedding)")
print(f"\nEmbedding for each character:")
for i, (char, vec) in enumerate(zip('Hello', embedded)):
    print(f"  '{char}' → [{', '.join(f'{v:6.3f}' for v in vec)}]")


In [None]:
# Visualize the embedding matrix
fig, ax = plt.subplots(figsize=(10, 6))

im = ax.imshow(embedding.embeddings, cmap='RdBu', aspect='auto')
ax.set_xlabel('Embedding Dimension', fontsize=12)
ax.set_ylabel('Token Index', fontsize=12)
ax.set_title('Token Embedding Matrix\n(Each row is one token\'s embedding)', fontsize=14)

# Add character labels
ax.set_yticks(range(tokenizer.vocab_size))
ax.set_yticklabels([f"{i}: '{tokenizer.idx_to_char[i]}'" for i in range(tokenizer.vocab_size)])

plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()


## Step 3: Positional Encoding

### The Problem

Unlike RNNs, Transformers process all tokens simultaneously. But the same word in different positions should mean different things:

- "The cat chased the dog" vs "The dog chased the cat"

The embedding for "cat" is the same in both cases! We need to add position information.

### The Solution: Sinusoidal Positional Encoding

Add a unique pattern to each position using sine and cosine waves.

Where:
- `pos` = position in sequence (0, 1, 2, ...)
- `i` = dimension index
- `d_model` = embedding dimension


In [None]:
class PositionalEncoding:
    """
    Add positional information using sinusoidal functions.
    
    Why sin/cos?
    1. Bounded values (-1 to 1)
    2. Different frequencies capture different position scales
    3. Can extrapolate to unseen sequence lengths
    4. Relative positions are easy to compute
    """
    
    def __init__(self, max_seq_len, embed_dim):
        """
        max_seq_len: maximum sequence length to support
        embed_dim: dimension of embeddings (must match token embeddings)
        """
        self.embed_dim = embed_dim
        
        # Create positional encoding matrix
        pe = np.zeros((max_seq_len, embed_dim))
        
        # Position indices: [0, 1, 2, ..., max_seq_len-1]
        position = np.arange(max_seq_len).reshape(-1, 1)
        
        # Dimension indices for the formula
        # div_term = 10000^(2i/d_model)
        div_term = np.exp(np.arange(0, embed_dim, 2) * -(np.log(10000.0) / embed_dim))
        
        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = np.sin(position * div_term)  # Even dimensions
        pe[:, 1::2] = np.cos(position * div_term)  # Odd dimensions
        
        self.pe = pe
    
    def forward(self, seq_len):
        """
        Get positional encodings for a sequence.
        
        seq_len: length of the sequence
        Returns: (seq_len, embed_dim) positional encodings
        """
        return self.pe[:seq_len]

# Create positional encoding
max_seq_len = 100
pos_encoder = PositionalEncoding(max_seq_len, embed_dim)

# Get encodings for first 10 positions
pe = pos_encoder.forward(10)

print(f"Positional Encoding shape: {pe.shape}")
print(f"\nFirst 5 positions:")
for i in range(5):
    print(f"  Position {i}: [{', '.join(f'{v:6.3f}' for v in pe[i])}]")


In [None]:
# Visualize positional encodings
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Full matrix view
ax = axes[0]
pe_full = pos_encoder.forward(50)
im = ax.imshow(pe_full, cmap='RdBu', aspect='auto')
ax.set_xlabel('Embedding Dimension', fontsize=12)
ax.set_ylabel('Position', fontsize=12)
ax.set_title('Positional Encoding Matrix\n(Each row is unique)', fontsize=14)
plt.colorbar(im, ax=ax)

# Individual dimension waves
ax = axes[1]
positions = np.arange(50)
for dim in [0, 2, 4, 6]:
    ax.plot(positions, pe_full[:, dim], label=f'Dim {dim}', lw=2)
ax.set_xlabel('Position', fontsize=12)
ax.set_ylabel('Encoding Value', fontsize=12)
ax.set_title('Sinusoidal Waves at Different Dimensions\n(Lower dims = higher frequency)', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice: Lower dimensions oscillate faster (high frequency).")
print("This lets the model capture both fine and coarse position info.")


## Step 4: Combining Token + Position

The final input to the Transformer is:

**Input = TokenEmbedding(x) + PositionalEncoding(pos)**

Simply **add** the position vectors to the token vectors!


In [None]:
class TransformerEmbedding:
    """
    Complete embedding layer for Transformer.
    Combines token embeddings with positional encodings.
    """
    
    def __init__(self, vocab_size, embed_dim, max_seq_len):
        self.token_embedding = TokenEmbedding(vocab_size, embed_dim)
        self.pos_encoding = PositionalEncoding(max_seq_len, embed_dim)
        self.embed_dim = embed_dim
    
    def forward(self, token_indices):
        """
        Convert token indices to embeddings with position info.
        
        token_indices: list of token indices
        Returns: (seq_len, embed_dim) tensor ready for attention
        """
        seq_len = len(token_indices)
        
        # Get token embeddings
        tok_emb = self.token_embedding.forward(token_indices)
        
        # Get positional encodings
        pos_enc = self.pos_encoding.forward(seq_len)
        
        # Add them together
        # (Often scaled by sqrt(embed_dim) for stability)
        return tok_emb * np.sqrt(self.embed_dim) + pos_enc

# Example: Full embedding pipeline
text = "Hello, World!"
tokenizer = CharTokenizer(text)
transformer_emb = TransformerEmbedding(
    vocab_size=tokenizer.vocab_size,
    embed_dim=16,
    max_seq_len=100
)

# Convert text to embeddings
indices = tokenizer.encode(text)
embeddings = transformer_emb.forward(indices)

print(f"Text: '{text}'")
print(f"Token indices: {indices}")
print(f"Final embedding shape: {embeddings.shape}")
print(f"\nThese embeddings are ready for self-attention!")


In [None]:
# Visualize the complete embedding process
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Token embeddings only
tok_emb = transformer_emb.token_embedding.forward(indices) * np.sqrt(transformer_emb.embed_dim)
ax = axes[0]
im = ax.imshow(tok_emb, cmap='RdBu', aspect='auto')
ax.set_yticks(range(len(text)))
ax.set_yticklabels(list(text))
ax.set_xlabel('Dimension')
ax.set_title('Token Embeddings Only\n(No position info)')
plt.colorbar(im, ax=ax)

# Positional encodings only
pos_enc = transformer_emb.pos_encoding.forward(len(text))
ax = axes[1]
im = ax.imshow(pos_enc, cmap='RdBu', aspect='auto')
ax.set_yticks(range(len(text)))
ax.set_yticklabels([f'pos {i}' for i in range(len(text))])
ax.set_xlabel('Dimension')
ax.set_title('Positional Encodings Only\n(No token info)')
plt.colorbar(im, ax=ax)

# Combined
ax = axes[2]
im = ax.imshow(embeddings, cmap='RdBu', aspect='auto')
ax.set_yticks(range(len(text)))
ax.set_yticklabels(list(text))
ax.set_xlabel('Dimension')
ax.set_title('Token + Position (Combined)\n(Ready for attention!)')
plt.colorbar(im, ax=ax)

plt.tight_layout()
plt.show()


## Summary

### The Embedding Pipeline

```
Text: "Hello"
    | Tokenization
Tokens: [H, e, l, l, o]
    | Token IDs
Indices: [7, 4, 11, 11, 14]
    | Embedding lookup
Token Embeddings: (5, d_model)
    + 
Positional Encoding: (5, d_model)
    |
Final Input: (5, d_model)  <-- Ready for attention!
```

### Key Takeaways

1. **Token embeddings** map discrete tokens to continuous vectors
2. **Positional encoding** adds position information (since no recurrence)
3. **Sinusoidal** encodings use fixed sin/cos patterns
4. **Learnable** embeddings are trained with the model
5. **Final input** = Token Embedding + Positional Encoding

---

**Next: 03_attention.ipynb** - The core innovation: Self-Attention!
