# Lab 1.3.3: Custom Embedding Lookup - SOLUTIONS

This notebook contains complete solutions to the exercises in Lab 1.3.3.

---

## ðŸŽ¯ Learning Objectives Checklist

By completing this lab, you should now be able to:
- [x] Understand how PyTorch's `nn.Embedding` works internally
- [x] Write a custom CUDA kernel for batched embedding lookup
- [x] Optimize memory access patterns for embedding tables
- [x] Compare custom kernel performance with PyTorch's implementation

---

In [None]:
import numpy as np
from numba import cuda
import numba

print(f"CUDA available: {cuda.is_available()}")

## Solution: Combined Token + Position Embedding Lookup

In [None]:
@cuda.jit
def combined_embedding_kernel(token_embed_table, pos_embed_table, 
                               token_ids, positions, output):
    """
    SOLUTION: Combined token and positional embedding lookup.
    
    This fuses two lookups into one kernel, saving memory bandwidth.
    Output = token_embedding[token_id] + pos_embedding[position]
    """
    token_idx = cuda.blockIdx.x
    tx = cuda.threadIdx.x
    
    _, embedding_dim = token_embed_table.shape
    total_tokens = token_ids.shape[0]
    
    if token_idx >= total_tokens:
        return
    
    # Get the token ID and position for this token
    token_id = token_ids[token_idx]
    position = positions[token_idx]
    
    # Each thread handles multiple embedding dimensions
    for embed_idx in range(tx, embedding_dim, cuda.blockDim.x):
        # Add both embeddings in one fused operation!
        output[token_idx, embed_idx] = (
            token_embed_table[token_id, embed_idx] + 
            pos_embed_table[position, embed_idx]
        )


def combined_embedding_lookup(token_embed: np.ndarray, pos_embed: np.ndarray,
                               token_ids: np.ndarray, positions: np.ndarray) -> np.ndarray:
    """
    GPU combined embedding lookup.
    
    Args:
        token_embed: (vocab_size, embedding_dim) token embeddings
        pos_embed: (max_seq_length, embedding_dim) positional embeddings
        token_ids: (batch_size, seq_length) token IDs
        positions: (batch_size, seq_length) position indices
    
    Returns:
        (batch_size, seq_length, embedding_dim) combined embeddings
    """
    original_shape = token_ids.shape
    _, embedding_dim = token_embed.shape
    
    # Flatten inputs
    flat_tokens = token_ids.flatten().astype(np.int32)
    flat_positions = positions.flatten().astype(np.int32)
    total_tokens = flat_tokens.shape[0]
    
    # Transfer to GPU
    d_token_embed = cuda.to_device(token_embed.astype(np.float32))
    d_pos_embed = cuda.to_device(pos_embed.astype(np.float32))
    d_tokens = cuda.to_device(flat_tokens)
    d_positions = cuda.to_device(flat_positions)
    d_output = cuda.device_array((total_tokens, embedding_dim), dtype=np.float32)
    
    # Launch kernel
    threads = min(256, embedding_dim)
    blocks = total_tokens
    
    combined_embedding_kernel[blocks, threads](
        d_token_embed, d_pos_embed, d_tokens, d_positions, d_output
    )
    
    # Get result and reshape
    result = d_output.copy_to_host()
    return result.reshape(*original_shape, embedding_dim)

In [None]:
# Test the solution
np.random.seed(42)

vocab_size = 100
max_seq_length = 512
embedding_dim = 64
batch_size = 4
seq_length = 10

# Create embedding tables
token_embed = np.random.randn(vocab_size, embedding_dim).astype(np.float32)
pos_embed = np.random.randn(max_seq_length, embedding_dim).astype(np.float32)

# Create input data
token_ids = np.random.randint(0, vocab_size, (batch_size, seq_length), dtype=np.int32)
positions = np.tile(np.arange(seq_length), (batch_size, 1)).astype(np.int32)

# GPU result
result_gpu = combined_embedding_lookup(token_embed, pos_embed, token_ids, positions)

# CPU reference
result_cpu = token_embed[token_ids] + pos_embed[positions]

print("Combined Embedding Lookup Test")
print("="*50)
print(f"Token embedding: {token_embed.shape}")
print(f"Position embedding: {pos_embed.shape}")
print(f"Input shape: {token_ids.shape}")
print(f"Output shape: {result_gpu.shape}")
print(f"\nCorrect: {np.allclose(result_gpu, result_cpu)}")

In [None]:
# Verify specific values
print("\nVerification (first token of first sequence):")
print(f"  Token ID: {token_ids[0, 0]}, Position: {positions[0, 0]}")
print(f"  Token embedding[:5]: {token_embed[token_ids[0, 0], :5]}")
print(f"  Position embedding[:5]: {pos_embed[positions[0, 0], :5]}")
print(f"  Expected sum[:5]: {token_embed[token_ids[0, 0], :5] + pos_embed[positions[0, 0], :5]}")
print(f"  GPU result[:5]: {result_gpu[0, 0, :5]}")

## Bonus: RoPE (Rotary Position Embedding) Concept

Modern LLMs like Llama use rotary position embeddings instead of additive:

In [None]:
print("""
Rotary Position Embedding (RoPE):

Instead of: output = token_emb + pos_emb
RoPE uses:  output = rotate(token_emb, angle(position))

The rotation is applied in 2D subspaces:
  [x, y] -> [x*cos(Î¸) - y*sin(Î¸), x*sin(Î¸) + y*cos(Î¸)]

Where Î¸ depends on position and dimension:
  Î¸_d(pos) = pos * 10000^(-2d/D)

Benefits:
- Relative position information (token at pos 5 always "sees" pos 3 the same way)
- Extrapolates to longer sequences
- No additional parameters to learn

CUDA Implementation Idea:
- Precompute sin/cos tables for all positions
- Apply rotation element-wise during attention
- Can be fused with attention kernel for efficiency
""")

## Cleanup

In [None]:
import gc
gc.collect()
cuda.current_context().reset()
print("âœ… Cleanup complete")