# Chapter 2 — Embeddings and Tokenization

In this notebook I reproduce and experiment with the core ideas from Chapter 2 of *Build a Large Language Model (From Scratch)* by Sebastian Raschka.

This chapter is fundamental because it explains how raw text is transformed into numerical representations (embeddings), which are the foundation of Large Language Models (LLMs).

LLMs do not understand words directly — they operate on vectors. Therefore, understanding tokenization and embeddings is critical for understanding how transformers, attention mechanisms, and agentic systems work.

In [None]:
%pip install torch tiktoken

In [2]:
import torch
import torch.nn as nn
import tiktoken

  cpu = _conversion_method_template(device=torch.device("cpu"))


In [3]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of characters:", len(raw_text))
print(raw_text[:200])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a


## Why Tokenization Matters

Neural networks cannot process raw text. They require numerical inputs.

Tokenization converts text into discrete units (tokens). These tokens can represent:
- Words
- Subwords
- Characters
- Byte pairs (BPE)

Modern LLMs use subword tokenization (like BPE) because:
- It reduces vocabulary size
- It handles unknown words
- It captures morphological structure

Tokenization is the bridge between language and neural computation.

In [4]:
tokenizer = tiktoken.get_encoding("gpt2")

encoded_text = tokenizer.encode(raw_text)

print("Total tokens:", len(encoded_text))
print(encoded_text[:20])

Total tokens: 5145
[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438]


In [5]:
decoded_text = tokenizer.decode(encoded_text[:50])
print(decoded_text)

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow,


## Why Embeddings Encode Meaning

An embedding is a dense vector representation of a token.

Instead of representing words as one-hot vectors (which are sparse and meaningless geometrically),
embeddings map each token to a dense vector in continuous space.

Why does this encode meaning?

Because during training:
- Tokens that appear in similar contexts receive similar gradient updates.
- The neural network adjusts their vector positions to minimize prediction error.

As a result:
- Words with similar meanings end up near each other in vector space.
- Relationships become linear directions in embedding space.

This is deeply related to core neural network concepts:
- Parameters (weights) are learned via backpropagation
- Embeddings are simply trainable weight matrices
- Each row of the embedding matrix corresponds to a token vector

So embeddings are not magic — they are learned parameters optimized to improve next-token prediction.

In [6]:
max_length = 4
stride = 1

input_ids = []
target_ids = []

for i in range(0, len(encoded_text) - max_length, stride):
    input_chunk = encoded_text[i:i+max_length]
    target_chunk = encoded_text[i+1:i+max_length+1]
    
    input_ids.append(input_chunk)
    target_ids.append(target_chunk)

print("Number of samples:", len(input_ids))
print("Example input:", input_ids[0])
print("Example target:", target_ids[0])

Number of samples: 5141
Example input: [40, 367, 2885, 1464]
Example target: [367, 2885, 1464, 1807]


## Why Sliding Windows Matter for LLMs

LLMs are trained to predict the next token given previous tokens.

To do this, we create training samples using sliding windows:
- Input: sequence of tokens
- Target: same sequence shifted by 1 position

The stride controls how much we shift the window each time.

Small stride → more overlap → more training samples  
Large stride → fewer samples → less redundancy

Overlap is useful because:
- It increases training data size
- It allows the model to see similar contexts with slight variations
- It improves statistical learning stability

This is especially important for small datasets.

In [7]:
inputs = torch.tensor(input_ids)
targets = torch.tensor(target_ids)

print(inputs.shape)
print(targets.shape)

torch.Size([5141, 4])
torch.Size([5141, 4])


In [8]:
vocab_size = tokenizer.n_vocab
embedding_dim = 256

embedding_layer = nn.Embedding(vocab_size, embedding_dim)

embedded = embedding_layer(inputs)

print("Embedding output shape:", embedded.shape)

Embedding output shape: torch.Size([5141, 4, 256])


## Why Embeddings Are Critical for Agentic Systems

Agentic systems rely on:
- Memory
- Retrieval (RAG)
- Semantic similarity
- Planning via vector comparisons

All of these depend on embeddings.

For example:
- Vector databases store embeddings
- Similarity search is done via cosine similarity
- Agents retrieve relevant context based on vector distance

Without embeddings:
- No semantic search
- No contextual memory
- No reasoning over meaning

Embeddings transform language into geometry.
And geometry is computable.

In [9]:
def count_samples(max_length, stride):
    count = 0
    for i in range(0, len(encoded_text) - max_length, stride):
        count += 1
    return count

print("max_length=4, stride=1:", count_samples(4,1))
print("max_length=4, stride=2:", count_samples(4,2))
print("max_length=8, stride=1:", count_samples(8,1))
print("max_length=8, stride=4:", count_samples(8,4))

max_length=4, stride=1: 5141
max_length=4, stride=2: 2571
max_length=8, stride=1: 5137
max_length=8, stride=4: 1285


## Experiment Results and Interpretation

When stride is small:
- The number of samples increases dramatically.
- Windows overlap heavily.
- The model sees similar contexts multiple times.

When stride increases:
- The number of samples decreases.
- Less redundancy.
- Faster dataset creation but potentially less learning signal.

Changing max_length:
- Larger max_length reduces number of samples.
- But increases context per sample.

Why overlap is useful:

Overlap helps the model:
- Learn smoother transitions between tokens
- Generalize better
- Improve next-token probability estimation

In real LLM training:
- Overlapping windows are essential
- They increase effective dataset size
- They stabilize gradient updates