In [None]:
# Install dependencies if not already installed
!pip install torch tiktoken

# LLM Text Preprocessing Foundations (Embeddings)

**Nombre:** Juan Pablo Nieto Cortes  
**Materia:** AREP  

This notebook covers the core concepts of text preprocessing for Large Language Models, specifically focusing on data loading, tokenization, and embeddings. It corresponds to Chapter 2 of "Build a Large Language Model From Scratch".

In [None]:
import torch
import tiktoken
from torch.utils.data import Dataset, DataLoader

print("PyTorch version:", torch.__version__)
print("tiktoken version:", tiktoken.__version__)

## 1. Data Loading
We load the text file `the-verdict.txt` which serves as our corpus.

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total characters:", len(raw_text))
print(raw_text[:100])

## 2. Tokenization and Data Loader
We use `tiktoken` (BPE) for tokenization and implement a custom Dataset class to create sliding window chunks for training.

In [None]:
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        
        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return dataloader

## 3. Embeddings
The core components of the LLM input layer: Token Embeddings and Positional Embeddings.

In [None]:
vocab_size = 50257
output_dim = 256
max_length = 4
batch_size = 8

# Create data loader
dataloader = create_dataloader_v1(
    raw_text, batch_size=batch_size, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Input batch shape:", inputs.shape)

In [None]:
# 1. Token Embeddings
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
token_embeddings = token_embedding_layer(inputs)
print("Token embeddings shape:", token_embeddings.shape)

# 2. Positional Embeddings
pos_embedding_layer = torch.nn.Embedding(max_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print("Positional embeddings shape:", pos_embeddings.shape)

# 3. Input Embeddings (Token + Positional)
input_embeddings = token_embeddings + pos_embeddings
print("Final input embeddings shape:", input_embeddings.shape)

## 4. Explanations

### Why do embeddings encode meaning?
Embeddings encode meaning by mapping discrete tokens (like words or subwords) to continuous vectors in a high-dimensional space. During training, the model adjusts these vectors so that tokens appearing in similar contexts (e.g., "dog" and "cat" appearing near "pet" or "fur") end up close to each other in the vector space. This geometric proximity captures semantic relationships, allowing the model to understand that "queen" is to "woman" as "king" is to "man".

### How are embeddings related to Neural Network concepts?
An embedding layer is fundamentally a **learnable weight matrix** (essentially a lookup table) that serves as the first layer of the neural network. Unlike traditional manual feature extraction (like bag-of-words), embeddings are parameters that are optimized via **backpropagation** along with the rest of the network. Each row in the weight matrix corresponds to a token in the vocabulary, and these weights are updated to minimize the model's loss function.

### Why do we need Positional Embeddings?
Transformer architectures (like GPT) use self-attention mechanisms that are processed in parallel and are permutation-invariantâ€”they don't inherently know the order of tokens. Without positional information, "The dog bit the man" and "The man bit the dog" would look identical to the self-attention layer. Positional embeddings inject unique signals (vectors) for each position (1st, 2nd, 3rd...), allowing the model to distinguish the sequence order and structure.

### Why is overlap (stride < max_length) useful in data loading?
Using a stride smaller than the max_length creates overlapping chunks of text. This acts as a form of **data augmentation**, allowing the model to see the same tokens in slightly different context windows. It maximizes the usage of limited training data and helps the model learn to predict tokens based on varying amounts of preceding context, improving its generalization capabilities.

## 5. Experiment: max_length and stride

We will vary `max_length` and `stride` to see how it affects the number of training samples generated from our text.

In [None]:
def run_experiment(txt, max_length_vals, stride_vals):
    print(f"{'Max Length':<12} {'Stride':<10} {'Num Batches':<12} {'Total Samples':<12}")
    print("-" * 50)
    
    for ml in max_length_vals:
        for st in stride_vals:
            # Create dataloader with batch_size=1 to count all samples easily
            dataloader = create_dataloader_v1(
                txt, batch_size=1, max_length=ml, stride=st, shuffle=False, drop_last=False
            )
            num_batches = len(dataloader)
            print(f"{ml:<12} {st:<10} {num_batches:<12} {num_batches:<12}")

# Experiment parameters
max_lengths = [4, 10, 50]
strides = [1, 2, 4, 10, 50]

run_experiment(raw_text, max_lengths, strides)

**Observation:**
- Smaller `stride` results in significantly more samples (more overlap).
- Larger `stride` (e.g., equal to `max_length`) results in fewer samples with no overlap.
- `max_length` also constraints the number of samples as each sample must be at least that long.

As explained above, overlap is useful for maximizing the utility of a small dataset.