# From Raw Text to Training Samples

This notebook implements the data preparation pipeline required to train a Large Language Model (LLM), following Chapter 2 of *Build a Large Language Model (From Scratch)*.

We focus on transforming raw text into structured training data suitable for next-token prediction.

Specifically, we will:

- Tokenize text using Byte Pair Encoding (BPE)
- Convert tokens into token IDs
- Generate input-target pairs using a sliding window
- Build a PyTorch Dataset
- Analyze how window parameters affect training data

The objective is not only to implement the pipeline, but to understand how embeddings and sampling strategies shape the learning dynamics of LLMs.


## 1. Loading the Dataset

We begin by loading the text file that will be used as training data.

In [13]:
# Install required libraries (run this once if needed)
%pip install torch tiktoken jupyter
%pip install numpy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [14]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total characters:", len(raw_text))
print(raw_text[:200])

Total characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a


## Why text preprocessing matters for LLMs

Large Language Models cannot process raw text directly. Neural networks operate on numerical tensors, not strings. Therefore, before training an LLM, we must transform text into a numerical representation.

This preprocessing stage defines:

- The vocabulary
- How context is represented
- How meaning is captured

If tokenization is poor, the model learns poor patterns. Good preprocessing is the foundation of everything that follows in an LLM pipeline.

## 2. Tokenization with BPE

Modern LLMs use subword tokenization. 
We apply GPT-2's Byte Pair Encoding (BPE) tokenizer using `tiktoken`.

In [15]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

token_ids = tokenizer.encode(raw_text)

print("Number of tokens:", len(token_ids))
print(token_ids[:20])

Number of tokens: 5145
[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438]


## Why Byte Pair Encoding (BPE) matters

Basic tokenization splits words by spaces and punctuation. However, real LLMs use subword tokenization such as BPE.

BPE allows:

- Handling unknown words
- Reducing vocabulary size
- Capturing morphological patterns
- Efficient representation of rare words

Instead of storing every word, BPE decomposes words into frequent subword units. This makes the model more generalizable and memory efficient.

## 3. Preparing Data for Next-Token Prediction

LLMs are trained to predict the next token given previous tokens.

To create training examples, we slide a fixed-length window across the token sequence. Each window produces:

- An input sequence
- A target sequence shifted by one position

In [16]:
max_length = 4
stride = 1

samples = []

for i in range(0, len(token_ids) - max_length, stride):
    input_chunk = token_ids[i:i + max_length]
    target_chunk = token_ids[i + 1:i + max_length + 1]
    samples.append((input_chunk, target_chunk))

print("Number of samples:", len(samples))
print(samples[0])

Number of samples: 5141
([40, 367, 2885, 1464], [367, 2885, 1464, 1807])


## Why sliding windows are used in LLM training

LLMs are trained using next-token prediction. Instead of feeding the entire document at once, we generate multiple input-target pairs using a sliding window.

This has three benefits:

1. It increases the number of training samples.
2. It preserves local contextual continuity.
3. It allows fixed-length inputs for batch processing.

Without overlapping windows, the model would see fewer context variations and generalize worse.

## 4. Experiment: Effect of max_length and stride

We evaluate how changing window size and stride impacts the number of generated training samples.

In [17]:
def count_samples(max_length, stride):
    count = 0
    for i in range(0, len(token_ids) - max_length, stride):
        count += 1
    return count

print("max_length=4, stride=1:", count_samples(4,1))
print("max_length=4, stride=4:", count_samples(4,4))
print("max_length=8, stride=1:", count_samples(8,1))
print("max_length=8, stride=8:", count_samples(8,8))

max_length=4, stride=1: 5141
max_length=4, stride=4: 1286
max_length=8, stride=1: 5137
max_length=8, stride=8: 643


When stride = 1, the window moves one token at a time. This creates heavy overlap and therefore produces many training samples.

When stride = max_length, there is no overlap. This drastically reduces the number of samples.

Overlap is useful because:

- It increases dataset size
- It provides smoother context transitions
- It improves statistical learning of next-token patterns

However, too much overlap increases computational cost.

Thus, stride is a tradeoff between efficiency and data richness.

## Why do embeddings encode meaning, and how are they related to neural networks?

Embeddings encode meaning because they are learned representations optimized during training to minimize prediction error.

In neural networks, embeddings are simply a lookup table (matrix) where:

- Each token ID indexes a vector
- The vector is updated via backpropagation

Over time, tokens that appear in similar contexts develop similar vector representations. This reflects the distributional hypothesis: words used in similar contexts tend to have similar meanings.

Thus, embeddings are not manually designed semantic vectors. They are emergent geometric structures learned through gradient descent.

From a neural network perspective:

- Embeddings are parameters
- They are trained like any other weight
- They form the first layer of the LLM

This is why meaning in LLMs is geometric and relational rather than symbolic.

## 5. Building a PyTorch Dataset

To efficiently train neural networks, we structure the input-target pairs into a Dataset class.

This allows:
- Efficient batching
- Shuffling
- Parallel loading
- Clean integration with PyTorch training loops

In [18]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDataset(Dataset):
    def __init__(self, token_ids, max_length, stride):
        self.inputs = []
        self.targets = []

        for i in range(0, len(token_ids) - max_length, stride):
            self.inputs.append(torch.tensor(token_ids[i:i+max_length]))
            self.targets.append(torch.tensor(token_ids[i+1:i+max_length+1]))

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx], self.targets[idx]


dataset = GPTDataset(token_ids, max_length=4, stride=1)
dataloader = DataLoader(dataset, batch_size=2, shuffle=False)

inputs, targets = next(iter(dataloader))

print("Inputs shape:", inputs.shape)
print("Targets shape:", targets.shape)

print("\nInputs:")
print(inputs)

print("\nTargets:")
print(targets)


Inputs shape: torch.Size([2, 4])
Targets shape: torch.Size([2, 4])

Inputs:
tensor([[  40,  367, 2885, 1464],
        [ 367, 2885, 1464, 1807]])

Targets:
tensor([[ 367, 2885, 1464, 1807],
        [2885, 1464, 1807, 3619]])


## Conclusion

In this notebook, we transformed raw text into structured training data suitable for Large Language Models.

We implemented:
- Modern subword tokenization (BPE)
- Conversion to token IDs
- Sliding window sampling
- PyTorch Dataset creation

We also experimentally analyzed how `max_length` and `stride` affect training sample generation.

This process demonstrates that embeddings and training data construction are not arbitrary preprocessing steps â€” they define how meaning is represented, learned, and generalized in neural language models.