# Chapter 2 â€“ Working with Text Data & Embeddings
## Based on *Build a Large Language Model (From Scratch)* by Sebastian Raschka

This notebook walks through the core ideas in Chapter 2: tokenization, vocabulary building, the sliding-window dataset, and token embeddings. Each major section is preceded by a personal explanation of **why** the step matters â€” not just for understanding LLMs, but also for building robust agentic systems that rely on them.

---
## ðŸ“Œ Personal Explanation 1 â€” Why Tokenization is the Foundation

Neural networks operate on numbers, not words. Tokenization is the bridge that converts raw text into a sequence of integer IDs that the model can process. The choice of tokenization strategy has cascading effects on everything downstream:

- **Vocabulary size** directly controls the size of the embedding matrix and the output projection layer. Larger vocabularies mean more parameters but better coverage of rare words.
- **Sub-word tokenization** (BPE, used by GPT) balances coverage and compactness: common words get a single token, rare words are split into meaningful sub-units. This prevents the model from seeing entirely `<UNK>` tokens for anything outside the training distribution.
- **For agentic systems**, tokenization determines *how* tool names, JSON keys, code identifiers, and natural-language instructions are chunked. Poor tokenization of structured data (e.g., splitting a UUID mid-token) can confuse the model and cause subtle reasoning errors.

In short: garbage tokenization â†’ garbage inputs â†’ unreliable outputs, no matter how powerful the model architecture is.

In [1]:
pip install tiktoken torch

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# â”€â”€ Standard imports â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
import re
import importlib
import sys

# â”€â”€ Optional heavy dependencies â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
# If tiktoken / torch are available we use them; otherwise we fall back to our
# hand-rolled implementations so the notebook can still be read end-to-end.
HAS_TIKTOKEN = importlib.util.find_spec('tiktoken') is not None
HAS_TORCH    = importlib.util.find_spec('torch')    is not None

print(f'tiktoken available : {HAS_TIKTOKEN}')
print(f'torch    available : {HAS_TORCH}')

if HAS_TIKTOKEN:
    import tiktoken
if HAS_TORCH:
    import torch
    import torch.nn as nn

tiktoken available : True
torch    available : True


---
## 2.1 â€” Loading the raw text corpus

In [3]:
# Adjust the path to wherever you placed the file
with open('the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

print(f'Total characters in corpus : {len(raw_text):,}')
print('--- First 200 characters ---')
print(raw_text[:200])

Total characters in corpus : 20,479
--- First 200 characters ---
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a


---
## 2.2 â€” Simple regex-based tokenizer (from the book)

In [4]:
# Split on whitespace AND common punctuation so both become separate tokens.
# The capturing group keeps the delimiters themselves.
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

print(f'Number of tokens : {len(preprocessed):,}')
print(preprocessed[:30])

Number of tokens : 4,690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [5]:
# Build the vocabulary: sorted unique tokens + two special tokens
all_tokens = sorted(set(preprocessed))
all_tokens.extend(['<|endoftext|>', '<|unk|>'])
vocab_size = len(all_tokens)
print(f'Vocabulary size : {vocab_size}')

# String â†’ int and int â†’ string mappings
str_to_int = {tok: idx for idx, tok in enumerate(all_tokens)}
int_to_str = {idx: tok for tok, idx in str_to_int.items()}

# Peek at a slice of the vocab
print(dict(list(str_to_int.items())[:10]))

Vocabulary size : 1132
{'!': 0, '"': 1, "'": 2, '(': 3, ')': 4, ',': 5, '--': 6, '.': 7, ':': 8, ';': 9}


In [6]:
class SimpleTokenizerV2:
    """Character-punctuation tokenizer with <|unk|> for OOV words."""

    def __init__(self, vocab: dict):
        self.str_to_int = vocab
        self.int_to_str = {v: k for k, v in vocab.items()}

    def encode(self, text: str) -> list[int]:
        tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        tokens = [t.strip() for t in tokens if t.strip()]
        tokens = [t if t in self.str_to_int else '<|unk|>' for t in tokens]
        return [self.str_to_int[t] for t in tokens]

    def decode(self, ids: list[int]) -> str:
        text = ' '.join(self.int_to_str[i] for i in ids)
        # Remove spaces before punctuation (cosmetic)
        return re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)


tokenizer = SimpleTokenizerV2(str_to_int)

sample = 'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'
ids = tokenizer.encode(sample)
print('Encoded:', ids)
print('Decoded:', tokenizer.decode(ids))

Encoded: [1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]
Decoded: <|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


---
## 2.3 â€” Byte-Pair Encoding (BPE) with `tiktoken`

The simple tokenizer above is instructive but fragile â€” it can't handle any word that wasn't in the training text. GPT-2/3/4 use **Byte-Pair Encoding (BPE)**, which builds a vocabulary of sub-word units through iterative merging of the most frequent pairs of bytes. The result: every possible string can be encoded (worst-case as individual bytes), and the vocabulary is compact.

In [7]:
if HAS_TIKTOKEN:
    tokenizer_bpe = tiktoken.get_encoding('gpt2')

    integers = tokenizer_bpe.encode(raw_text, allowed_special={'<|endoftext|>'})
    print(f'BPE token count  : {len(integers):,}')
    print(f'BPE vocab size   : {tokenizer_bpe.n_vocab:,}')

    # Round-trip test
    decoded = tokenizer_bpe.decode(integers)
    assert decoded == raw_text, 'Round-trip failed!'
    print('Round-trip encodeâ†’decode : âœ“')
else:
    print('tiktoken not installed â€” skipping BPE section.')
    print('Install with:  pip install tiktoken')
    # Provide integer list from simple tokenizer so rest of notebook still runs
    integers = tokenizer.encode(raw_text)

BPE token count  : 5,145
BPE vocab size   : 50,257
Round-trip encodeâ†’decode : âœ“


---
## ðŸ“Œ Personal Explanation 2 â€” The Sliding-Window Dataset

Language models are trained to predict the **next token** given a context window of preceding tokens. The sliding-window (or *stride*) approach is how we manufacture (input, target) pairs from a single long document:

```
tokens : [t0 t1 t2 t3 t4 t5 t6 t7 ...]

window 1 â†’  input=[t0..t3]   target=[t1..t4]   (stride=1)
window 2 â†’  input=[t1..t4]   target=[t2..t5]
...
```

**Why overlap (stride < max_length) is useful:**  
When stride equals `max_length`, each token appears in exactly one training example as context. With a smaller stride, the model sees each token in *multiple different contexts*, which acts like implicit data augmentation. The model learns that the same word can play different semantic roles depending on what precedes it â€” crucial for learning long-range dependencies.

**For agentic systems**, context windows behave like a special case of this sliding window: only the last `N` tokens of an agent's scratchpad are visible at inference time. Understanding stride helps reason about *information retention* across long agent trajectories.

In [8]:
# â”€â”€ Sliding-window data generation (pure Python, no torch needed) â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€

def create_dataloader_samples(token_ids: list[int],
                               max_length: int,
                               stride: int) -> tuple[list, list]:
    """Returns (input_chunks, target_chunks) lists of token-id lists."""
    inputs, targets = [], []
    for i in range(0, len(token_ids) - max_length, stride):
        inputs.append(token_ids[i : i + max_length])
        targets.append(token_ids[i + 1 : i + max_length + 1])
    return inputs, targets


# Quick test with the simple tokenizer ids
sample_ids = tokenizer.encode(raw_text)
print(f'Total tokens available : {len(sample_ids):,}')

for ml, st in [(4, 1), (4, 2), (4, 4)]:
    inp, tgt = create_dataloader_samples(sample_ids, max_length=ml, stride=st)
    print(f'  max_length={ml}, stride={st}  â†’  {len(inp):,} samples')

# Show a concrete example
inp, tgt = create_dataloader_samples(sample_ids, max_length=4, stride=1)
print('\nFirst sample:')
print('  input  :', [int_to_str[i] for i in inp[0]])
print('  target :', [int_to_str[i] for i in tgt[0]])

Total tokens available : 4,690
  max_length=4, stride=1  â†’  4,686 samples
  max_length=4, stride=2  â†’  2,343 samples
  max_length=4, stride=4  â†’  1,172 samples

First sample:
  input  : ['I', 'HAD', 'always', 'thought']
  target : ['HAD', 'always', 'thought', 'Jack']


---
## ðŸ§ª Experiment â€” Effect of `max_length` and `stride` on Sample Count

Let's systematically vary both parameters and record the number of samples produced.

In [None]:
# â”€â”€ Experiment â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
configs = [
    # (max_length, stride, description)
    (32,  32, 'No overlap (stride = window)'),
    (32,  16, '50% overlap'),
    (32,   8, '75% overlap'),
    (32,   1, 'Maximum overlap (stride = 1)'),
    (64,  64, 'Larger window, no overlap'),
    (64,  32, 'Larger window, 50% overlap'),
    (128, 64, 'Even larger window, 50% overlap'),
]

print(f"{'Config':<40} {'Samples':>8}  Formula: (N - max_length) // stride")
print('-' * 60)
N = len(sample_ids)
for ml, st, desc in configs:
    inp, _ = create_dataloader_samples(sample_ids, max_length=ml, stride=st)
    expected = (N - ml) // st   # analytical formula
    print(f'{desc:<40} {len(inp):>8}  (expected â‰ˆ {expected})')

print(f'\nTotal tokens in corpus: {N:,}')

Config                                    Samples  Formula: (N - max_length) // stride
------------------------------------------------------------
No overlap (stride = window)                  146  (expected â‰ˆ 145)
50% overlap                                   292  (expected â‰ˆ 291)
75% overlap                                   583  (expected â‰ˆ 582)
Maximum overlap (stride = 1)                 4658  (expected â‰ˆ 4658)
Larger window, no overlap                      73  (expected â‰ˆ 72)
Larger window, 50% overlap                    145  (expected â‰ˆ 144)
Even larger window, 50% overlap                72  (expected â‰ˆ 71)

Total tokens in corpus: 4,690


### Experiment Findings

The number of samples follows a simple formula:

$$\text{samples} = \left\lfloor \frac{N - \text{max\_length}}{\text{stride}} \right\rfloor$$

Key takeaways:

1. **Smaller stride â†’ exponentially more training samples** from the same corpus. With `stride=1` on a 4,690-token corpus and `max_length=32`, we get ~4,658 samples vs. ~146 with `stride=32`. That's a **31Ã— increase** just from overlapping windows.

2. **Larger `max_length` â†’ richer context per sample, but fewer samples** (for the same stride). There is a genuine trade-off between the richness of context each sample provides and dataset size.

3. **Overlap is useful because** it lets the model see each token in many surrounding contexts. The word *painting* appears differently after *"he stopped"* versus *"she loved"*. With stride=1, the model trains on both. With no overlap, it may only see one.

4. For small corpora (like this ~20 KB story), aggressive overlap can be necessary to have enough samples to train even a small model without overfitting. For massive internet-scale datasets, overlap matters less because data volume is not the bottleneck.

---
## ðŸ“Œ Personal Explanation 3 â€” Token Embeddings vs. Position Embeddings

Two separate embedding tables are summed to produce the input to a transformer:

| Embedding type | What it encodes | Size |
|---|---|---|
| **Token embedding** | Semantic identity of the token | `vocab_size Ã— d_model` |
| **Position embedding** | Position of the token in the sequence | `context_length Ã— d_model` |

**Why do we need positional embeddings?**  
Self-attention is *permutation-invariant* â€” the attention score between token A and token B is the same whether A appears before or after B. Without positional information the model literally cannot tell the difference between *"the dog bit the man"* and *"the man bit the dog"*. Adding a learned (or sinusoidal) position vector breaks this symmetry.

**For agentic systems** this matters in multi-turn memory: the position of a tool result earlier in the context affects how the model weights it relative to a more recent observation. Absolute position embeddings can degrade at sequence lengths beyond what was seen during training â€” which is why modern architectures use **RoPE** (Rotary Position Embeddings) that generalize better to longer contexts.

In [11]:
# â”€â”€ Token + Position Embedding (PyTorch version) â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
if HAS_TORCH:
    # Hyper-parameters (GPT-2 small scale)
    vocab_size_emb = 50257   # GPT-2 BPE vocab
    output_dim     = 256     # embedding dimension (d_model)
    max_len        = 1024    # maximum sequence / context length
    batch_size     = 8
    seq_len        = 4

    token_embedding_layer = nn.Embedding(vocab_size_emb, output_dim)
    pos_embedding_layer   = nn.Embedding(max_len,        output_dim)

    # Simulate a batch of token id sequences
    token_ids_batch = torch.randint(0, vocab_size_emb, (batch_size, seq_len))
    positions       = torch.arange(seq_len).unsqueeze(0)  # shape (1, seq_len)

    tok_emb = token_embedding_layer(token_ids_batch)  # (B, T, d)
    pos_emb = pos_embedding_layer(positions)           # (1, T, d) â†’ broadcasts

    x = tok_emb + pos_emb  # final input to the transformer

    print(f'token_ids_batch shape : {token_ids_batch.shape}')
    print(f'tok_emb shape         : {tok_emb.shape}')
    print(f'pos_emb shape         : {pos_emb.shape}')
    print(f'x (input to model)    : {x.shape}')
else:
    print('torch not installed â€” showing conceptual pseudo-code below.')
    print("""
    token_embedding_layer = Embedding(vocab_size=50257, d_model=256)
    pos_embedding_layer   = Embedding(max_len=1024,     d_model=256)

    # For a batch of shape (B=8, T=4):
    tok_emb = token_embedding_layer(token_ids)   # â†’ (8, 4, 256)
    pos_emb = pos_embedding_layer(positions)      # â†’ (1, 4, 256)  [broadcast]
    x = tok_emb + pos_emb                         # â†’ (8, 4, 256)
    """)

token_ids_batch shape : torch.Size([8, 4])
tok_emb shape         : torch.Size([8, 4, 256])
pos_emb shape         : torch.Size([1, 4, 256])
x (input to model)    : torch.Size([8, 4, 256])


---
## ðŸ“Œ Personal Explanation 4 â€” Why Do Embeddings Encode Meaning?

This is the central conceptual question of the chapter.

### The Short Answer
Embeddings encode meaning **not because we programmed them to**, but because meaning is what the model *needs* to compress in order to predict the next token correctly. During training, the gradient descent process pushes the embedding vectors for *contextually similar* tokens (words that appear in similar positions in similar sentences) to nearby regions of the embedding space.

### The Neural Network Connection
An `nn.Embedding(vocab_size, d_model)` layer is just a **look-up table** â€” a matrix `W` of shape `(vocab_size, d_model)`. When you embed token `i`, you retrieve row `W[i]`. There is nothing magical about initialization; all rows start as random vectors.

What makes them meaningful is the **loss function + backpropagation**:

1. The model predicts a distribution over the next token.
2. The cross-entropy loss measures how wrong the prediction is.
3. Backprop propagates gradients all the way back through the attention layers into `W`.
4. Rows corresponding to tokens that frequently co-occur with the same context tokens get nudged in the same direction â€” they converge in embedding space.

This is the distributional hypothesis from linguistics (*"a word is known by the company it keeps"*) implemented as gradient descent.

### Why This Matters for Agentic Systems
An agentic LLM must understand user *intent* (a semantic concept), map that intent to tool invocations, and reason about the tool results. All of this relies on the geometric structure of the embedding space. When you do RAG retrieval, you are literally doing nearest-neighbor search in this space â€” semantic proximity is geometric proximity. The richer and more nuanced the embedding space, the better the agent reasons and retrieves.

In [14]:
# â”€â”€ Illustrate embedding similarity (no torch needed) â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
import math
import random

random.seed(42)

def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    mag_a = math.sqrt(sum(x**2 for x in a))
    mag_b = math.sqrt(sum(x**2 for x in b))
    return dot / (mag_a * mag_b + 1e-8)

# Suppose after training, the model has learned these 4-D embeddings:
# (in reality d_model=768 or 1024, but 4D is enough to illustrate)
embeddings = {
    'king'  : [ 0.9,  0.1,  0.8,  0.3],
    'queen' : [ 0.8,  0.9,  0.7,  0.2],  # similar to king
    'man'   : [ 0.5,  0.0,  0.6,  0.1],
    'woman' : [ 0.4,  0.6,  0.5,  0.0],  # similar to man
    'apple' : [-0.2, -0.1,  0.0,  0.9],  # very different
    'fruit' : [-0.3, -0.2,  0.1,  0.8],  # similar to apple
}

pairs = [
    ('king', 'queen'),
    ('man',  'woman'),
    ('king', 'apple'),
    ('apple', 'fruit'),
    ('king', 'man'),
]

print(f"{'Pair':<20} Cosine Similarity")
print('-' * 38)
for a, b in pairs:
    sim = cosine_similarity(embeddings[a], embeddings[b])
    print(f'  {a} â†” {b:<12}   {sim:.3f}')

# Classic analogy: king - man + woman â‰ˆ queen
v_king  = embeddings['king']
v_man   = embeddings['man']
v_woman = embeddings['woman']
analogy_vec = [k - m + w for k, m, w in zip(v_king, v_man, v_woman)]

print('\nAnalogy: king âˆ’ man + woman â†’ ?')
for word, vec in embeddings.items():
    sim = cosine_similarity(analogy_vec, vec)
    print(f'  sim({word:<6}) = {sim:.3f}')

Pair                 Cosine Similarity
--------------------------------------
  king â†” queen          0.816
  man â†” woman          0.724
  king â†” apple          0.069
  apple â†” fruit          0.977
  king â†” man            0.979

Analogy: king âˆ’ man + woman â†’ ?
  sim(king  ) = 0.879
  sim(queen ) = 0.993
  sim(man   ) = 0.828
  sim(woman ) = 0.964
  sim(apple ) = -0.042
  sim(fruit ) = -0.132


---
## 2.6 â€” Putting It All Together: a `GPTDataset` class

The book combines everything into a PyTorch `Dataset` class. We provide both a PyTorch version (if available) and a pure-Python version.

In [15]:
if HAS_TORCH:
    from torch.utils.data import Dataset, DataLoader

    class GPTDatasetV1(Dataset):
        def __init__(self, txt: str, tokenizer, max_length: int, stride: int):
            self.input_ids  = []
            self.target_ids = []

            token_ids = tokenizer.encode(txt, allowed_special={'<|endoftext|>'})

            for i in range(0, len(token_ids) - max_length, stride):
                self.input_ids.append(torch.tensor(token_ids[i          : i + max_length]))
                self.target_ids.append(torch.tensor(token_ids[i + 1     : i + max_length + 1]))

        def __len__(self):          return len(self.input_ids)
        def __getitem__(self, idx): return self.input_ids[idx], self.target_ids[idx]


    def create_dataloader_v1(txt, tokenizer, batch_size=4, max_length=256,
                             stride=128, shuffle=True, drop_last=True, num_workers=0):
        dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
        return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle,
                          drop_last=drop_last, num_workers=num_workers)


    if HAS_TIKTOKEN:
        bpe = tiktoken.get_encoding('gpt2')
        loader = create_dataloader_v1(raw_text, bpe, batch_size=8,
                                      max_length=4, stride=4, shuffle=False)
        data_iter = iter(loader)
        inputs, targets = next(data_iter)
        print('inputs  shape:', inputs.shape)
        print('targets shape:', targets.shape)
        print('inputs :', inputs)
        print('targets:', targets)

        # Full embedding pipeline
        vocab_size_gpt2 = 50257
        output_dim      = 256
        context_length  = 4

        token_embedding_layer = torch.nn.Embedding(vocab_size_gpt2, output_dim)
        pos_embedding_layer   = torch.nn.Embedding(context_length,  output_dim)

        tok_emb = token_embedding_layer(inputs)
        pos_emb = pos_embedding_layer(torch.arange(context_length))
        input_embeddings = tok_emb + pos_emb

        print('\nFinal embedding shape (batch, seq, d_model):', input_embeddings.shape)
    else:
        print('tiktoken not available â€” skipping GPTDataset demo. Install with: pip install tiktoken')
else:
    print('torch not available â€” showing pseudo-code only.')
    print('Install with: pip install torch')
    print()
    # Pure-Python fallback to still show the concept
    class GPTDatasetV1_Pure:
        def __init__(self, token_ids, max_length, stride):
            self.input_ids, self.target_ids = [], []
            for i in range(0, len(token_ids) - max_length, stride):
                self.input_ids.append(token_ids[i : i + max_length])
                self.target_ids.append(token_ids[i + 1 : i + max_length + 1])
        def __len__(self): return len(self.input_ids)
        def __getitem__(self, idx): return self.input_ids[idx], self.target_ids[idx]

    ids = tokenizer.encode(raw_text)
    ds  = GPTDatasetV1_Pure(ids, max_length=4, stride=2)
    print(f'Dataset size: {len(ds)} samples')
    inp, tgt = ds[0]
    print('input  tokens:', [int_to_str[i] for i in inp])
    print('target tokens:', [int_to_str[i] for i in tgt])

inputs  shape: torch.Size([8, 4])
targets shape: torch.Size([8, 4])
inputs : tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
targets: tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])

Final embedding shape (batch, seq, d_model): torch.Size([8, 4, 256])


---
## Summary

| Step | What we did | Why it matters |
|---|---|---|
| **Tokenization** | Split text â†’ integer IDs | Neural nets require numeric input; choice of tokenizer affects model capacity and robustness |
| **Vocabulary** | Build strâ†”int mappings + special tokens | Defines the input/output space of the model |
| **Sliding window** | Create (input, target) pairs with configurable `max_length` / `stride` | Training signal for next-token prediction; overlap = data augmentation |
| **Token embedding** | `nn.Embedding` look-up table trained by backprop | Converts discrete tokens to dense vectors; geometric proximity = semantic similarity |
| **Position embedding** | Separate learned table summed with token emb | Breaks permutation invariance of attention; lets model track token order |

The output of this pipeline â€” a tensor of shape `(batch, seq_len, d_model)` â€” is the input to the transformer's attention layers in Chapter 3.