In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Embeddings and the GPT Architecture: From Text to Vectors

*Part 1 of the Vizuara series on Building a GPT-Style Model from Scratch*
*Estimated time: 45 minutes*

## 1. Why Does This Matter?

Every GPT model -- from GPT-2 to GPT-4 -- starts the same way: raw text goes in, and the first thing the model must do is convert that text into numbers. Neural networks cannot read words. They only understand vectors of floating-point numbers. So the very first question in building a language model is: **how do we represent language as numbers?**

This is not a minor implementation detail. The quality of these representations determines everything downstream. If the model cannot distinguish "bank" (financial) from "bank" (river), or if it cannot tell the difference between "dog bites man" and "man bites dog," then no amount of clever architecture will save it.

In this notebook, we will build the input pipeline of a GPT model from scratch. By the end, you will have a working system that:
- Converts raw text into token IDs
- Maps those IDs to learned embedding vectors
- Adds positional information so the model knows word order
- Understands why GPT uses a decoder-only architecture

Let us begin.

## 2. Building Intuition

### Why Can't We Just Use One-Hot Vectors?

The simplest way to represent a word as a number is a one-hot vector. If our vocabulary has 10,000 words, we create a vector of 10,000 zeros and put a 1 at the position corresponding to our word. "Cat" might be [0, 0, ..., 1, ..., 0] with the 1 at position 891.

But this has two fatal problems:

1. **No notion of similarity.** The one-hot vectors for "cat" and "kitten" are just as different as "cat" and "refrigerator." Every pair of words is equally distant. This means the model has to learn from scratch that similar words are related.

2. **Enormous dimensionality.** If our vocabulary has 50,000 tokens, every word is a 50,000-dimensional vector. Most of those dimensions are wasted zeros.

The solution is **dense embeddings** -- compact vectors (typically 64 to 12,288 dimensions) where similar words naturally end up close together. The model *learns* these vectors during training.

### Why Does Position Matter?

Consider these two sentences:
- "The dog chased the cat."
- "The cat chased the dog."

Same words, completely different meanings. If we simply looked up embedding vectors for each word without caring about order, these two sentences would be indistinguishable. The model needs to know not just *what* each word is, but *where* it appears. This is exactly what positional embeddings provide.

### Why Decoder-Only?

The original Transformer had an encoder and a decoder. GPT throws away the encoder and keeps only the decoder. Why?

Think about it this way: if you are writing a story, you do not need to first read and encode a separate input in another language. You simply need to look at what you have written so far and decide what word comes next. That is exactly what a decoder-only model does. It reads left to right, one token at a time, and each token can only look at the tokens that came before it -- never at future tokens. This is called **causal masking**, and it is the defining feature of GPT's architecture.

### Think About This

If you were designing a system to predict the next word, what information would you need at each step? Think about what you do when you fill in the blank: "The cat sat on the ___." You look at all the previous words, understand their meaning and their order, and make a prediction. That is exactly the problem GPT solves.

## 3. The Mathematics

### Token Embeddings

Given a vocabulary of size $V$ and an embedding dimension $d_{\text{model}}$, the token embedding table is a matrix:

$$E_{\text{token}} \in \mathbb{R}^{V \times d_{\text{model}}}$$

For a token with ID $i$, the embedding is simply row $i$ of this matrix:

$$\mathbf{e}_{\text{token}} = E_{\text{token}}[i] \in \mathbb{R}^{d_{\text{model}}}$$

Computationally, this is just a table lookup -- no matrix multiplication needed. We index into a table with $V$ rows and $d_{\text{model}}$ columns, and pull out one row. If $V = 10{,}000$ and $d_{\text{model}} = 64$, this table has 640,000 learnable parameters.

### Positional Embeddings

GPT uses learned positional embeddings. The positional embedding table is:

$$E_{\text{pos}} \in \mathbb{R}^{T_{\max} \times d_{\text{model}}}$$

where $T_{\max}$ is the maximum sequence length. For a token at position $j$:

$$\mathbf{e}_{\text{pos}} = E_{\text{pos}}[j] \in \mathbb{R}^{d_{\text{model}}}$$

### Combined Input

The final input vector for the token at position $i$ is the element-wise sum:

$$\mathbf{x}_i = \mathbf{e}_{\text{token}}(i) + \mathbf{e}_{\text{pos}}(i)$$

Computationally, this means: look up the token embedding (what word this is), look up the positional embedding (where this word appears), and add them together element-wise. The result is a single vector that encodes both identity and position.

For example, if the token embedding is $[0.5, 0.3, -0.1, 0.7]$ and the positional embedding is $[0.1, -0.2, 0.4, 0.0]$, the combined vector is:

$$[0.5 + 0.1,\; 0.3 + (-0.2),\; -0.1 + 0.4,\; 0.7 + 0.0] = [0.6,\; 0.1,\; 0.3,\; 0.7]$$

## 4. Let's Build It -- Component by Component

### 4.1 Character-Level Tokenizer

Before we can embed tokens, we need a tokenizer. Real GPT models use BPE (Byte Pair Encoding), but for our from-scratch implementation, we will use character-level tokenization. Every ASCII character becomes a token.

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# Character-level tokenizer
class CharTokenizer:
    """A simple character-level tokenizer."""

    def __init__(self):
        # Use all printable ASCII characters
        self.chars = [chr(i) for i in range(256)]
        self.vocab_size = len(self.chars)
        self.char_to_id = {ch: i for i, ch in enumerate(self.chars)}
        self.id_to_char = {i: ch for i, ch in enumerate(self.chars)}

    def encode(self, text):
        """Convert text to a list of token IDs."""
        return [self.char_to_id.get(ch, 0) for ch in text]

    def decode(self, ids):
        """Convert a list of token IDs back to text."""
        return ''.join(self.id_to_char.get(i, '?') for i in ids)

# Test it
tokenizer = CharTokenizer()
text = "The cat sat on the mat."
ids = tokenizer.encode(text)
print(f"Text:      '{text}'")
print(f"Token IDs: {ids}")
print(f"Decoded:   '{tokenizer.decode(ids)}'")
print(f"Vocabulary size: {tokenizer.vocab_size}")

This is the simplest possible tokenizer. Each character maps to its ASCII code. Real GPT models use subword tokenization (BPE), which is more efficient, but character-level works perfectly for learning the concepts.

### 4.2 Token Embedding Table

Now let us build the token embedding layer. This is simply a lookup table implemented as `nn.Embedding`.

In [None]:
# Token Embedding
VOCAB_SIZE = 256    # ASCII characters
D_MODEL = 64        # Embedding dimension

token_embedding = nn.Embedding(VOCAB_SIZE, D_MODEL)

# Look up embeddings for our tokens
token_ids = torch.tensor(tokenizer.encode("cat"))
embeddings = token_embedding(token_ids)

print(f"Input token IDs: {token_ids}")
print(f"Token IDs shape: {token_ids.shape}")
print(f"Embeddings shape: {embeddings.shape}")
print(f"\nEmbedding for 'c' (first 10 dims): {embeddings[0, :10].detach().numpy().round(3)}")
print(f"Embedding for 'a' (first 10 dims): {embeddings[1, :10].detach().numpy().round(3)}")
print(f"Embedding for 't' (first 10 dims): {embeddings[2, :10].detach().numpy().round(3)}")
print(f"\nTotal embedding parameters: {VOCAB_SIZE * D_MODEL:,}")

Each character gets its own 64-dimensional vector. These vectors are initialized randomly, and the model will learn to adjust them during training so that similar characters end up with similar vectors.

In [None]:
# Visualize the raw embeddings
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Embedding vectors for a few characters
chars_to_show = ['a', 'b', 'c', 'x', 'y', 'z', ' ', '.', '0', '1']
char_ids = torch.tensor([tokenizer.char_to_id[c] for c in chars_to_show])
char_embeds = token_embedding(char_ids).detach().numpy()

im = axes[0].imshow(char_embeds, aspect='auto', cmap='RdBu_r')
axes[0].set_yticks(range(len(chars_to_show)))
axes[0].set_yticklabels([repr(c) for c in chars_to_show])
axes[0].set_xlabel('Embedding Dimension')
axes[0].set_title('Token Embedding Vectors (before training)')
plt.colorbar(im, ax=axes[0])

# Plot 2: Cosine similarity between characters
from torch.nn.functional import cosine_similarity
all_embeds = token_embedding(char_ids)
n = len(chars_to_show)
sim_matrix = torch.zeros(n, n)
for i in range(n):
    for j in range(n):
        sim_matrix[i, j] = cosine_similarity(
            all_embeds[i].unsqueeze(0), all_embeds[j].unsqueeze(0)
        )

im2 = axes[1].imshow(sim_matrix.detach().numpy(), cmap='RdBu_r', vmin=-1, vmax=1)
axes[1].set_xticks(range(n))
axes[1].set_xticklabels([repr(c) for c in chars_to_show], rotation=45)
axes[1].set_yticks(range(n))
axes[1].set_yticklabels([repr(c) for c in chars_to_show])
axes[1].set_title('Cosine Similarity Between Embeddings\n(before training -- expect random)')
plt.colorbar(im2, ax=axes[1])

plt.tight_layout()
plt.show()

Before training, the similarity matrix looks random -- all characters are equally dissimilar. After training, you would see clusters: lowercase letters would be similar to each other, digits would cluster together, and so on.

### 4.3 Positional Embedding Table

In [None]:
# Positional Embedding
MAX_SEQ_LEN = 128   # Maximum sequence length

pos_embedding = nn.Embedding(MAX_SEQ_LEN, D_MODEL)

# Look up positional embeddings for positions 0, 1, 2
positions = torch.arange(3)
pos_embeds = pos_embedding(positions)

print(f"Positions: {positions}")
print(f"Positional embeddings shape: {pos_embeds.shape}")
print(f"\nPosition 0 (first 10 dims): {pos_embeds[0, :10].detach().numpy().round(3)}")
print(f"Position 1 (first 10 dims): {pos_embeds[1, :10].detach().numpy().round(3)}")
print(f"Position 2 (first 10 dims): {pos_embeds[2, :10].detach().numpy().round(3)}")
print(f"\nTotal positional parameters: {MAX_SEQ_LEN * D_MODEL:,}")

In [None]:
# Visualize positional embeddings
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Positional embedding vectors
all_pos = torch.arange(MAX_SEQ_LEN)
all_pos_embeds = pos_embedding(all_pos).detach().numpy()

axes[0].imshow(all_pos_embeds, aspect='auto', cmap='RdBu_r')
axes[0].set_xlabel('Embedding Dimension')
axes[0].set_ylabel('Position')
axes[0].set_title(f'Positional Embeddings ({MAX_SEQ_LEN} positions x {D_MODEL} dims)')

# Plot 2: Cosine similarity between positions
pos_sample = torch.arange(0, 32)
pos_sample_embeds = pos_embedding(pos_sample)
n_pos = len(pos_sample)
pos_sim = torch.zeros(n_pos, n_pos)
for i in range(n_pos):
    for j in range(n_pos):
        pos_sim[i, j] = cosine_similarity(
            pos_sample_embeds[i].unsqueeze(0),
            pos_sample_embeds[j].unsqueeze(0)
        )

im = axes[1].imshow(pos_sim.detach().numpy(), cmap='RdBu_r', vmin=-1, vmax=1)
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Position')
axes[1].set_title('Position Similarity (before training)')
plt.colorbar(im, ax=axes[1])

plt.tight_layout()
plt.show()

### 4.4 Combining Token and Positional Embeddings

In [None]:
# Full input embedding pipeline
def embed_input(text, token_emb, pos_emb, tokenizer):
    """Convert raw text to embedded input vectors."""
    # Step 1: Tokenize
    token_ids = torch.tensor([tokenizer.encode(text)])  # (1, T)
    T = token_ids.shape[1]

    # Step 2: Token embeddings
    tok_emb = token_emb(token_ids)        # (1, T, d_model)

    # Step 3: Positional embeddings
    positions = torch.arange(T)           # (T,)
    p_emb = pos_emb(positions)            # (T, d_model)

    # Step 4: Add them together
    x = tok_emb + p_emb                   # (1, T, d_model)

    return x, token_ids

# Test it
text = "Hello"
x, ids = embed_input(text, token_embedding, pos_embedding, tokenizer)
print(f"Input text: '{text}'")
print(f"Token IDs: {ids[0].tolist()}")
print(f"Output shape: {x.shape}")
print(f"  Batch size: {x.shape[0]}")
print(f"  Sequence length: {x.shape[1]}")
print(f"  Embedding dim: {x.shape[2]}")

In [None]:
# Visualize the full embedding pipeline
fig, axes = plt.subplots(1, 3, figsize=(18, 4))

text = "The cat sat"
x, ids = embed_input(text, token_embedding, pos_embedding, tokenizer)

tok_only = token_embedding(ids).detach().numpy()[0]
pos_only = pos_embedding(torch.arange(ids.shape[1])).detach().numpy()
combined = x.detach().numpy()[0]

for ax, data, title in zip(axes,
    [tok_only, pos_only, combined],
    ['Token Embeddings Only', 'Positional Embeddings Only', 'Combined (Token + Position)']):
    im = ax.imshow(data, aspect='auto', cmap='RdBu_r')
    ax.set_xlabel('Embedding Dimension')
    ax.set_ylabel('Token Position')
    ax.set_yticks(range(len(text)))
    ax.set_yticklabels(list(text), fontfamily='monospace')
    ax.set_title(title)
    plt.colorbar(im, ax=ax)

plt.suptitle(f'Embedding Pipeline for "{text}"', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## 5. Your Turn

### TODO 1: Implement the Full Embedding Module

In [None]:
class GPTEmbedding(nn.Module):
    """
    Complete GPT input embedding module.

    This module takes raw token IDs and produces the combined
    token + positional embedding vectors that serve as input
    to the Transformer blocks.

    Args:
        vocab_size: number of tokens in the vocabulary
        d_model: embedding dimension
        max_seq_len: maximum sequence length

    Forward:
        Input: token_ids of shape (batch_size, seq_len)
        Output: embeddings of shape (batch_size, seq_len, d_model)
    """
    def __init__(self, vocab_size, d_model, max_seq_len):
        super().__init__()
        # ============ TODO ============
        # Step 1: Create the token embedding table (nn.Embedding)
        # Step 2: Create the positional embedding table (nn.Embedding)
        # ==============================
        self.token_emb = None   # YOUR CODE HERE
        self.pos_emb = None     # YOUR CODE HERE

    def forward(self, token_ids):
        """
        Args:
            token_ids: (batch_size, seq_len) tensor of token indices

        Returns:
            (batch_size, seq_len, d_model) tensor of embeddings
        """
        # ============ TODO ============
        # Step 1: Get the sequence length T from token_ids
        # Step 2: Look up token embeddings: (B, T, d_model)
        # Step 3: Create position indices 0, 1, ..., T-1
        # Step 4: Look up positional embeddings: (T, d_model)
        # Step 5: Add token + positional embeddings
        # ==============================
        result = None  # YOUR CODE HERE
        return result

In [None]:
# Verification
emb_module = GPTEmbedding(vocab_size=256, d_model=64, max_seq_len=128)
test_ids = torch.tensor([[72, 101, 108, 108, 111]])  # "Hello"
output = emb_module(test_ids)

assert output is not None, "Forward returned None -- did you forget to return the result?"
assert output.shape == (1, 5, 64), f"Expected shape (1, 5, 64), got {output.shape}"
assert isinstance(emb_module.token_emb, nn.Embedding), "token_emb should be nn.Embedding"
assert isinstance(emb_module.pos_emb, nn.Embedding), "pos_emb should be nn.Embedding"
assert emb_module.token_emb.num_embeddings == 256, "token_emb should have 256 entries"
assert emb_module.pos_emb.num_embeddings == 128, "pos_emb should have 128 entries"
print("All assertions passed! Your embedding module is correct.")

### TODO 2: Explore How Position Changes the Representation

In [None]:
def position_experiment(embedding_module, token_id, num_positions=20):
    """
    Investigate how positional embeddings change the representation
    of the SAME token at different positions.

    Place the given token_id at positions 0 through num_positions-1,
    compute the cosine similarity between each pair, and visualize
    the result.

    Args:
        embedding_module: your GPTEmbedding instance
        token_id: integer token ID to test
        num_positions: number of positions to test

    Returns:
        similarity_matrix: (num_positions, num_positions) numpy array

    Steps:
        1. Create input tensors: [[token_id]] for each position,
           but each with padding so the token appears at a
           different position. Actually, a simpler approach:
           create a batch of sequences where position i has the
           token at index i (use 0 for other positions).
        2. Get embeddings from the module.
        3. Extract the embedding at the target position for each.
        4. Compute cosine similarity between all pairs.
        5. Plot the similarity matrix as a heatmap.
    """
    # ============ TODO ============
    # Hint: The simplest approach is to create a (num_positions, num_positions)
    # input where each row is all zeros except for the token_id at one position.
    # Then get embeddings and extract the non-zero position from each row.
    # ==============================
    similarity_matrix = None  # YOUR CODE HERE

    # Plot
    # YOUR CODE HERE

    return similarity_matrix

## 6. Putting It All Together

Let us combine our tokenizer and embedding module into a complete input pipeline and test it on real text.

In [None]:
class GPTInputPipeline:
    """Complete input pipeline: text -> embedded vectors."""

    def __init__(self, d_model=64, max_seq_len=128):
        self.tokenizer = CharTokenizer()
        self.embedding = GPTEmbedding(
            vocab_size=self.tokenizer.vocab_size,
            d_model=d_model,
            max_seq_len=max_seq_len
        )
        self.max_seq_len = max_seq_len

    def __call__(self, text):
        """Convert text to embedded vectors."""
        ids = self.tokenizer.encode(text)
        if len(ids) > self.max_seq_len:
            ids = ids[:self.max_seq_len]
        token_ids = torch.tensor([ids])
        return self.embedding(token_ids)

# Test the full pipeline
pipeline = GPTInputPipeline(d_model=64, max_seq_len=128)
text = "GPT predicts the next token."
output = pipeline(text)
print(f"Input: '{text}'")
print(f"Output shape: {output.shape}")
print(f"  -> {output.shape[0]} batch, {output.shape[1]} tokens, {output.shape[2]} dimensions")
print(f"\nThe model now has a {output.shape[2]}-dimensional vector for each character,")
print(f"encoding both WHAT the character is and WHERE it appears.")

## 7. Training and Results

We will not train the embedding module in isolation -- it gets trained as part of the full GPT model. But we can demonstrate what *trained* embeddings look like by training a simple next-character prediction model.

In [None]:
# Quick demo: train embeddings on a small text
import torch.nn.functional as F

# Training data
train_text = """To be or not to be that is the question whether tis nobler
in the mind to suffer the slings and arrows of outrageous fortune or to
take arms against a sea of troubles and by opposing end them to die to
sleep no more and by a sleep to say we end the heartache and the thousand
natural shocks that flesh is heir to""" * 10

# Simple model: embedding + linear head
class SimpleCharPredictor(nn.Module):
    def __init__(self, vocab_size=256, d_model=64, max_seq_len=128):
        super().__init__()
        self.emb = GPTEmbedding(vocab_size, d_model, max_seq_len)
        self.head = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        h = self.emb(x)
        return self.head(h)

# Train
model = SimpleCharPredictor()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Prepare data
tokenizer = CharTokenizer()
data = torch.tensor(tokenizer.encode(train_text))
seq_len = 32

losses = []
for step in range(500):
    # Random batch
    idx = torch.randint(0, len(data) - seq_len - 1, (16,))
    x = torch.stack([data[i:i+seq_len] for i in idx])
    y = torch.stack([data[i+1:i+seq_len+1] for i in idx])

    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, 256), y.view(-1))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    losses.append(loss.item())
    if step % 100 == 0:
        print(f"Step {step}: loss = {loss.item():.4f}")

print(f"\nFinal loss: {losses[-1]:.4f}")
print(f"Random baseline: {np.log(256):.4f}")

In [None]:
# Visualize trained vs untrained embeddings
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Trained embeddings: similarity between characters
chars_to_check = list('abcdefghijklmnopqrstuvwxyz .!?0123456789')
char_ids = torch.tensor([tokenizer.char_to_id[c] for c in chars_to_check])
trained_embeds = model.emb.token_emb(char_ids)

n = len(chars_to_check)
sim = torch.zeros(n, n)
for i in range(n):
    for j in range(n):
        sim[i, j] = cosine_similarity(
            trained_embeds[i].unsqueeze(0).detach(),
            trained_embeds[j].unsqueeze(0).detach()
        )

im = axes[0].imshow(sim.numpy(), cmap='RdBu_r', vmin=-1, vmax=1)
axes[0].set_title('Character Similarity (After Training)')
axes[0].set_xticks(range(0, n, 3))
axes[0].set_xticklabels([chars_to_check[i] for i in range(0, n, 3)], fontfamily='monospace')
axes[0].set_yticks(range(0, n, 3))
axes[0].set_yticklabels([chars_to_check[i] for i in range(0, n, 3)], fontfamily='monospace')
plt.colorbar(im, ax=axes[0])

# Training loss curve
axes[1].plot(losses, alpha=0.3, color='blue')
# Smoothed
window = 20
smoothed = [np.mean(losses[max(0,i-window):i+1]) for i in range(len(losses))]
axes[1].plot(smoothed, color='blue', linewidth=2, label='Smoothed loss')
axes[1].axhline(y=np.log(256), color='red', linestyle='--', label=f'Random baseline ({np.log(256):.2f})')
axes[1].set_xlabel('Training Step')
axes[1].set_ylabel('Cross-Entropy Loss')
axes[1].set_title('Training Loss: Embeddings Learning Character Patterns')
axes[1].legend()

plt.tight_layout()
plt.show()

## 8. Final Output

In [None]:
# Show the learned structure in the embeddings
print("=" * 60)
print("  TRAINED EMBEDDING ANALYSIS")
print("=" * 60)

# Find most similar character pairs
pairs = []
for i in range(n):
    for j in range(i+1, n):
        pairs.append((chars_to_check[i], chars_to_check[j], sim[i, j].item()))

pairs.sort(key=lambda x: x[2], reverse=True)

print("\nMost similar character pairs (after training):")
for c1, c2, s in pairs[:10]:
    print(f"  '{c1}' <-> '{c2}': similarity = {s:.3f}")

print("\nLeast similar character pairs:")
for c1, c2, s in pairs[-5:]:
    print(f"  '{c1}' <-> '{c2}': similarity = {s:.3f}")

# Show embedding norms
norms = trained_embeds.detach().norm(dim=1)
print(f"\nEmbedding norms (mean: {norms.mean():.3f}, std: {norms.std():.3f})")
print(f"Largest norm: '{chars_to_check[norms.argmax()]}' = {norms.max():.3f}")
print(f"Smallest norm: '{chars_to_check[norms.argmin()]}' = {norms.min():.3f}")

print("\nYou have built the input pipeline of a GPT model from scratch!")
print("Token embeddings + positional embeddings = the model's view of language.")

## 9. Reflection and Next Steps

### Reflection Questions
1. Why does GPT add positional embeddings rather than concatenating them to the token embeddings? What would change if we concatenated instead?
2. After training, vowels tend to cluster together in embedding space. Why might this happen, given that the training objective is next-character prediction?
3. GPT-2 uses a maximum sequence length of 1024 tokens. What happens if you try to process a sequence of 1025 tokens? How would you handle very long documents?

### Optional Challenges
1. Replace the character-level tokenizer with a simple word-level tokenizer. How does this change the embedding table size and the quality of learned representations?
2. Implement sinusoidal positional embeddings (as in the original Transformer paper) instead of learned ones. Compare the positional similarity matrices -- which shows more structure before training?
3. Visualize the embeddings using t-SNE or PCA after training on a larger dataset. Do meaningful clusters emerge?