# LLM Text Preprocessing Foundations


## Introduction



This notebook follows the ideas from Chapter 2 of *Build a Large Language Model (From Scratch)* by Sebastian Raschka.



The goal is to understand how raw text is transformed into numerical representations that a neural network (an LLM or an agentic system) can use:



- We load a real text dataset (`the-verdict.txt`).

- We tokenize the text and map tokens to integer IDs.

- We create sliding-window training samples using `max_length` and `stride`.

- We build a small PyTorch dataset/dataloader.

- We use an embedding layer to map token IDs into dense vectors.



Throughout the notebook, additional markdown cells explain **why** each step (tokenization, windowing, embeddings) matters for training large language models and agentic systems.

## 1. Setup

## Environment & Dependencies

This notebook was executed locally on macOS using Python 3.
The required libraries are installed in the active environment:

- torch
- tiktoken
- notebook / jupyter

In [None]:
%pip install torch tiktoken notebook ipykernel
%pip install numpy

In [35]:
import numpy as np
import torch
import tiktoken

In [36]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    text = f.read()

print("Characters:", len(text))
print(text[:300])

Characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would ha


## Tokenize Text

Neural networks cannot understand raw text directly. They operate on numbers

Tokenization converts text into smaller units (tokens), which are then mapped to numerical IDs, this process transforms human-readable language into machine-readable data.

Without tokenization:
- We cannot compute gradients
- We cannot perform matrix multiplications
- We cannot represent language mathematically

Tokenization is the first essential step in building an LLM.

In [37]:
tokenizer = tiktoken.get_encoding("gpt2")
tokens = tokenizer.encode(text)

print("Number of tokens:", len(tokens))
print(tokens[:20])

Number of tokens: 5145
[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438]


## What Are max_length and stride?

- `max_length` defines the size of each training sequence (context window).
- `stride` determines how much the window moves each step.

This creates multiple overlapping training examples from a single long text.

This is critical for autoregressive models:
The model learns to predict the next token given previous tokens.

Overlap helps the model:
- Preserve contextual continuity
- Learn smoother transitions
- Use more training examples from limited data

In [38]:
def create_dataset(token, max_length, stride):
    input_ids = []
    target_ids = []
    
    for i in range(0, len(token) - max_length, stride):
        input_chunk = token[i:i+max_length]
        target_chunk = token[i+1:i+max_length+1]
        
        input_ids.append(input_chunk)
        target_ids.append(target_chunk)
    
    return torch.tensor(input_ids), torch.tensor(target_ids)

In [39]:
max_length = 128
stride = 128

input_ids, target_ids = create_dataset(tokens, max_length, stride)

print("Number of samples:", input_ids.shape[0])
print("Shape of one sample:", input_ids.shape)

Number of samples: 40
Shape of one sample: torch.Size([40, 128])


In [40]:
from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(input_ids, target_ids)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

batch = next(iter(dataloader))
print("Batch input shape:", batch[0].shape)

Batch input shape: torch.Size([4, 128])


## What Is an Embedding Layer?

An embedding layer is a trainable lookup table.

It maps discrete token IDs into dense continuous vectors.

Instead of representing a word as a single number (e.g., 5023),
we represent it as a vector like:

[0.12, -0.45, 0.89, ..., 0.03]

This allows the neural network to:
- Capture semantic similarity
- Learn distributed representations
- Encode meaning geometrically in vector space

In [41]:
vocab_size = tokenizer.n_vocab
embedding_dim = 256

embedding = torch.nn.Embedding(vocab_size, embedding_dim)

embedded_tokens = embedding(input_ids)

print("Embedding output shape:", embedded_tokens.shape)

Embedding output shape: torch.Size([40, 128, 256])


## Why Do Embeddings Encode Meaning?

Embeddings encode meaning because they are learned through gradient descent.

During training, the model adjusts embedding vectors to minimize prediction error. 
If two words appear in similar contexts, their vectors are adjusted in similar directions.

This leads to:

- Words with similar meanings having similar vectors
- Semantic relationships forming geometric patterns
- Meaning emerging from statistical patterns

Relation to Neural Networks:

- Embeddings are parameters of the model
- They are optimized using backpropagation
- They live in a continuous vector space (latent space)
- They are equivalent to a linear projection from discrete IDs into a dense space

In neural network terms:
An embedding layer is simply a matrix multiplication between a one-hot vector and a weight matrix.

Therefore, embeddings are learned representations that transform symbolic language into numerical structures that neural networks can process.

In [42]:
# Experiment 1
max_length = 128
stride = 128

input_ids_1, _ = create_dataset(tokens, max_length, stride)
print("Stride 128 → Samples:", input_ids_1.shape[0])

# Experiment 2
max_length = 128
stride = 64

input_ids_2, _ = create_dataset(tokens, max_length, stride)
print("Stride 64 → Samples:", input_ids_2.shape[0])

# Experiment 3
max_length = 64
stride = 32

input_ids_3, _ = create_dataset(tokens, max_length, stride)
print("max_length 64 & stride 32 → Samples:", input_ids_3.shape[0])

Stride 128 → Samples: 40
Stride 64 → Samples: 79
max_length 64 & stride 32 → Samples: 159


## Experiment Analysis

When stride is smaller:

- The window overlaps more
- More training samples are generated
- The model sees similar context with slight shifts

Why is overlap useful?

Because language is sequential.
If we split text without overlap, the model loses continuity between segments.

Overlap allows the model to:
- Learn smoother transitions
- Capture dependencies across boundaries
- Improve autoregressive prediction

Reducing max_length:
- Creates shorter contexts
- Increases number of samples
- But reduces long-range dependency learning

This shows how preprocessing directly affects model learning capacity.

## Conclusion: from this pipeline to complete LLMs


In this notebook, we saw the typical path that text data follows before entering a large model:


1. **Raw text → tokens → IDs**: tokenization converts human language into sequences of integers on which matrix products can be performed and gradients calculated.
2. **IDs → context windows**: with `max_length` and `stride`, we construct training examples that respect the order of the sequence and allow the model to learn to predict the next token.
3. **IDs → dense embeddings**: the `Embedding` layer learns, via backpropagation, a vector space where the geometry reflects meaning and context.

A complete LLM (e.g., a Transformer) takes exactly these embeddings as input, adds positional information to them, and then applies attention layers and linear layers to model complex long-term dependencies.

In agentic systems, this same representation is the starting point for reasoning, planning, and decision-making: everything the agent “knows” about the text is compressed into these embeddings and refined through the layers of the model.

Translated with DeepL.com (free version)