# AREP-L5: Embeddings and Text Processing Pipeline
This notebook implements the foundational data pipeline for LLMs, moving from raw text to continuous vectors via BPE, sliding windows, and embeddings.

## 1. Raw Text Ingestion
Load the text corpus (`the-verdict.txt`) as training data.

In [3]:
import urllib.request

url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
file_path = "./src/the-verdict.txt"
try:
    with open(file_path, "r", encoding="utf-8") as f:
        raw_text = f.read()
except FileNotFoundError:
    print(f"File {file_path} not found. Ensure it is downloaded.")

print("Total characters:", len(raw_text))
print("First 100 chars:", raw_text[:100])


Total characters: 20479
First 100 chars: I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


## 2. Advanced Tokenization with BPE
Instantiate the Byte Pair Encoding (BPE) tokenizer used in GPT architectures.

**Why it matters:** BPE limits vocabulary explosion and eliminates out-of-vocabulary (OOV) errors by decomposing unknown words into common subword tokens. This highly compresses sequence lengths to fit inside the LLM's finite context window.

In [4]:
import tiktoken

# Initialize BPE tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Tokenize a sample
encoded_text = tokenizer.encode(raw_text)
print("Total tokens in corpus:", len(encoded_text))

# Encode-decode
sample_text = "Hello, do you like AI?"
encoded_sample = tokenizer.encode(sample_text)
decoded_sample = tokenizer.decode(encoded_sample)

print("Original:", sample_text)
print("Encoded :", encoded_sample)
print("Decoded :", decoded_sample)


Total tokens in corpus: 5145
Original: Hello, do you like AI?
Encoded : [15496, 11, 466, 345, 588, 9552, 30]
Decoded : Hello, do you like AI?


## 3. Autoregressive Dataset Construction
Implement a PyTorch `Dataset` to generate contextual blocks and shifted targets.

**Why it matters:** LLMs must learn to predict the next sequential token. The sliding window constructs a dataset of sequence-to-target pairs, exposing the autoregressive model to various context depths during the optimization phase.

In [5]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Slide a window across the tokens
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

# Instantiate DataLoader helper
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return dataloader, dataset


  cpu = _conversion_method_template(device=torch.device("cpu"))


## 4. Experiment: Overlap and Stride
Analyze the impact of modifying `max_length` and `stride`, starting with a baseline (`max_length=4`, `stride=4`).

In [6]:
dataloader_base, dataset_base = create_dataloader_v1(
    raw_text, max_length=4, stride=4, batch_size=2, shuffle=False
)
print("Baseline Samples (max_length=4, stride=4):", len(dataset_base))
sample_input, sample_target = dataset_base[0]
print("Tokens [Input]: ", sample_input)
print("Tokens [Target]:", sample_target)


Baseline Samples (max_length=4, stride=4): 1286
Tokens [Input]:  tensor([  40,  367, 2885, 1464])
Tokens [Target]: tensor([ 367, 2885, 1464, 1807])


Run the experiment with high overlap (`max_length=4`, `stride=1`).

In [7]:
dataloader_exp, dataset_exp = create_dataloader_v1(
    raw_text, max_length=4, stride=1, batch_size=2, shuffle=False
)
print("Experiment Samples (max_length=4, stride=1):", len(dataset_exp))
sample_input, sample_target = dataset_exp[0]
print("Tokens [Input]: ", sample_input)
print("Tokens [Target]:", sample_target)


Experiment Samples (max_length=4, stride=1): 5141
Tokens [Input]:  tensor([  40,  367, 2885, 1464])
Tokens [Target]: tensor([ 367, 2885, 1464, 1807])


## 5. Experiment Results
**1. Samples generated:** The baseline (`stride=4`) yields 1286 samples, while the high overlap experiment (`stride=1`) yields 5141 samples.

**2. Utility of overlap:** A smaller stride generates overlapping chunks that expose the model to consecutive, unbroken intermediate states. This augments the dataset density and ensures attention mechanisms can map semantic continuity right across sequence boundaries.

## 6. Embedding Layers
Combine token embeddings and absolute positional embeddings to form the model input tensor.

**Why it matters:** Token embeddings project discrete vocab IDs into dense semantic arrays, avoiding the curse of dimensionality present in one-hot vectors. Positional embeddings re-introduce sequence order coordinates into the parallel, permutation-invariant self-attention block.

In [8]:
vocab_size = 50257     # GPT-2 standard vocab size
output_dim = 256       # Dimensionality of the latent continuous space
context_length = 1024  # Max sequence length the NN can process

# 1. Token Embedding space
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

# 2. Positional Embedding space
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

# Simulate passing a batch
max_len_demo = 4
dataloader_demo, _ = create_dataloader_v1(raw_text, batch_size=8, max_length=max_len_demo, stride=4)
data_iter = iter(dataloader_demo)
inputs, targets = next(data_iter)

print("Batch inputs shape (B, T):", inputs.shape)

# Transform inputs
token_embeddings = token_embedding_layer(inputs)

# Create position IDs
pos_ids = torch.arange(max_len_demo)
pos_embeddings = pos_embedding_layer(pos_ids)

# Generate input embeddings
input_embeddings = token_embeddings + pos_embeddings

print("Final Embedding Tensor Shape entering Self-Attention:", input_embeddings.shape)


Batch inputs shape (B, T): torch.Size([8, 4])
Final Embedding Tensor Shape entering Self-Attention: torch.Size([8, 4, 256])


## 7. Reflection
**Why do embeddings encode meaning, and how are they related to NN concepts?**

Embeddings give meaning to words not through definitions, but through context. They are based on the idea that words used in similar sentences tend to mean the same thing.

Mathematically, an embedding is simply a matrix of weights (coordinates) within the first layer of the neural network. Initially, everything is random, but as the model trains by trying to predict the next word in a text, it uses backpropagation (error optimization) to push and move these coordinates. Thus, if 'computador' and 'ordenador' are used in the same context, the model's mathematics pushes their coordinates so that they end up very close in multidimensional space. Therefore, for a neural network, 'meaning' is literally a geometric distance created by the need to reduce its prediction errors.
