# My Exercise Solutions: Chapter 2 (Working with Text Data)

**Date**: February 3, 2026

**My goal**: I want to practice the Chapter 2 exercises in my own words and code to strengthen intuition about tokenization, vocab building, and data preparation.

**Zero-copy note**: This notebook is my personal synthesis. I am not copying the source-material solutions.

**Attribution**: Concepts are based on *Build a Large Language Model From Scratch* by Sebastian Raschka.

**Scope note**: This notebook currently covers Exercises 2.1â€“2.2.

## Environment Check
I want to record package versions for reproducibility.

In [None]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

## Reproducibility
I set seeds so I can reproduce results later.

In [None]:
import numpy as np
import torch

np.random.seed(42)
torch.manual_seed(42)

## Exercise 2.1
**Prompt (my words)**: Explore how the GPT-2 BPE tokenizer splits the string "Akwirw ier" and interpret the token IDs and decoded pieces.

**My approach**: Use `tiktoken` to encode the string, print the token IDs, decode each token, and probe a few substrings to see where merges happen.

**Solution**:

In [None]:
import tiktoken

text = "Akwirw ier"

tokenizer = tiktoken.get_encoding("gpt2")
encoded = tokenizer.encode(text)

print("Text:", text)
print("Token IDs:", encoded)
print("Decoded pieces:")
for token_id in encoded:
    print(f"  {token_id} -> {tokenizer.decode([token_id])}")

print("Reconstructed:", tokenizer.decode(encoded))

probe_strings = ["Ak", "w", "ir", " ", "ier"]
print("\nSubstring probes:")
for snippet in probe_strings:
    print(f"  {snippet!r} -> {tokenizer.encode(snippet)}")

## Exercise 2.2
**Prompt (my words)**: Build a minimal GPT-style dataset and dataloader with a sliding window, then inspect a few batches using small `max_length` and `stride` values.

**My approach**: Rebuild a simple `Dataset` class, load `the-verdict.txt`, create dataloaders with different window sizes, and print the first batch.

**Solution**:

In [None]:
import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader


class GPTDatasetMini(Dataset):
    def __init__(self, text: str, tokenizer, max_length: int, stride: int) -> None:
        token_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
        self.input_ids = []
        self.target_ids = []

        for start in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[start : start + max_length]
            target_chunk = token_ids[start + 1 : start + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk, dtype=torch.long))
            self.target_ids.append(torch.tensor(target_chunk, dtype=torch.long))

    def __len__(self) -> int:
        return len(self.input_ids)

    def __getitem__(self, idx: int):
        return self.input_ids[idx], self.target_ids[idx]


def make_loader(text: str, max_length: int, stride: int, batch_size: int = 4):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetMini(text, tokenizer, max_length=max_length, stride=stride)
    return DataLoader(dataset, batch_size=batch_size, shuffle=False)


with open("the-verdict.txt", "r", encoding="utf-8") as file_handle:
    raw_text = file_handle.read()

print("Example 1: max_length=2, stride=2")
loader_small = make_loader(raw_text, max_length=2, stride=2, batch_size=4)
for x_batch, y_batch in loader_small:
    print("x:", x_batch)
    print("y:", y_batch)
    break

print("\nExample 2: max_length=8, stride=2")
loader_medium = make_loader(raw_text, max_length=8, stride=2, batch_size=4)
for x_batch, y_batch in loader_medium:
    print("x:", x_batch)
    print("y:", y_batch)
    break