# Chapter 2: Working with Text Data and Embeddings

This notebook covers the essential steps to prepare text for a Large Language Model (LLM) as described in Chapter 2 of "Build a Large Language Model (From Scratch)". We go through tokenization, data sampling, and creating embeddings.

## 1. Setup and Data Loading

First, we load the raw text from "The Verdict" by Edith Wharton.

In [None]:
import torch
import tiktoken
from torch.utils.data import Dataset, DataLoader

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total characters:", len(raw_text))
print(raw_text[:99])

## 2. Tokenization with BPE

### Why Tokenization Matters for LLMs?

LLMs cannot process raw strings directly; they require numerical input. Tokenization is the bridge between human language and machine-readable numbers. It breaks down text into smaller units (tokens). 

We use **Byte Pair Encoding (BPE)** (via `tiktoken`) because it efficiently handles out-of-vocabulary words by breaking them into subword units, balancing vocabulary size and sequence length. This is crucial for agentic systems to robustly handle diverse user inputs without failing on unknown words.

In [None]:
# Initialize BPE tokenizer (GPT-2 version)
tokenizer = tiktoken.get_encoding("gpt2")

text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print("Encoded IDs:", integers)
print("Decoded:", tokenizer.decode(integers))

## 3. Data Sampling with Sliding Window

### Why Use a Sliding Window?

LLMs are trained autoregressively to predict the *next* token given a context. To train them efficiently, we need many examples of `(input, target)` pairs. 

A sliding window approach allows us to generate multiple training examples from a single text by moving a window of fixed size (`max_length`) across the text. The `stride` determines how much the window moves. This maximizes the data efficiency and teaches the model to handle contexts shifting over time, which is fundamental for maintaining coherence in long conversations.

In [None]:
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last
    )
    return dataloader, len(dataset)

## 4. Experiment: Impact of Stride on Sample Count

We investigate how changing the `stride` affects the number of training samples generated. We use a small `max_length=4` for demonstration.

### Experiment Results

| Experiment | max_length | stride | Samples Generated |
|------------|------------|--------|-------------------|
| 1 (No Overlap) | 4 | 4 | **1286** |
| 2 (High Overlap) | 4 | 1 | **5141** |
| 3 (Mid Overlap) | 4 | 2 | **2571** |

### Conclusion on Overlap
Using a smaller stride (e.g., `stride=1`) results in significantly more training samples (nearly 4x more than no overlap). 

**Why is overlap useful?**
1.  **Data Augmentation:** It creates more training data solely from the existing corpus, allowing the model to see the same tokens in slightly different contexts. This helps prevent overfitting on small datasets.
2.  **Context Continuity:** It ensures the model learns to predict tokens regardless of their exact absolute position in a fixed window. It smooths out the learning boundaries that would exist if we only used non-overlapping chunks.

In [None]:
# Code used to generate experiment results (run in experiment.py)
# dl, n_samples = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
# print(len(dl))

## 5. Embeddings

### Why do embeddings encode meaning, and how are they related to NN concepts?

Embeddings are dense vector representations of tokens. Unlike sparse one-hot encodings (which are high-dimensional and orthogonal), embeddings exist in a continuous lower-dimensional space where "meaning" is encoded as distance and direction.

**Relation to NN Concepts:**
An embedding layer is technically just a **linear layer** (fully connected layer) without a bias term, applied to a one-hot encoded input. 
- In `torch.nn.Embedding`, we simply look up the row corresponding to the token ID. 
- This lookup is mathematically equivalent to multiplying a one-hot vector with a weight matrix $W$. 
- These weights are **learnable parameters**. During backpropagation, the network adjusts these vectors so that words appearing in similar contexts (e.g., "cat" and "dog") end up having vectors that are geometrically close (high cosine similarity).

For agents, good embeddings are the foundation of "understanding" user intent.

In [None]:
vocab_size = 50257
output_dim = 256
context_len = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
pos_embedding_layer = torch.nn.Embedding(context_len, output_dim)

# Create dataloader
dataloader, _ = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Input IDs shape:", inputs.shape)

# Token Embeddings
token_embeddings = token_embedding_layer(inputs)
print("Token Embeddings shape:", token_embeddings.shape)

# Positional Embeddings
pos_embeddings = pos_embedding_layer(torch.arange(4))
print("Positional Embeddings shape:", pos_embeddings.shape)

# Final Input Embeddings
input_embeddings = token_embeddings + pos_embeddings
print("Final Input Embeddings shape:", input_embeddings.shape)