# Notebook - Embeddings and Sliding Windows

**Escuela Colombiana de Ingeniería Julio Garavito**

**Student**: Santiago Botero García

This notebook implements the data preparation and embedding pipeline required to train an autoregressive language model.

Each section corresponds to a critical stage in transforming raw text into numerical tensors suitable for neural network optimization.

We progressively move from:

Raw text &rarr; Token IDs &rarr; Sliding Windows &rarr; Batches &rarr; Embeddings

Beyond implementation, each step is analyzed conceptually to understand its implications for:

- Large Language Models (LLMs)
- Neural representation learning
- Agentic system architectures

## Step 1: Setup and Dependencies

This section initializes the computational environment required to reproduce the embedding pipeline.

We import:

- torch &rarr; for tensor computation and neural network modules
- tiktoken &rarr; for GPT-style tokenization
- Dataset and DataLoader &rarr; for structured batching

We also fix the random seed to ensure reproducibility.

Why this matters:

LLMs are sensitive to initialization.
Reproducibility is critical for debugging, experimentation, and evaluation.

Even at this early stage, we are setting the foundation for deterministic training behavior.


In [1]:
%pip install torch tiktoken

import torch
import tiktoken
from torch.utils.data import Dataset, DataLoader

torch.manual_seed(42)

Collecting torch
  Downloading torch-2.10.0-cp312-cp312-win_amd64.whl.metadata (31 kB)
Collecting tiktoken
  Using cached tiktoken-0.12.0-cp312-cp312-win_amd64.whl.metadata (6.9 kB)
Collecting filelock (from torch)
  Using cached filelock-3.24.3-py3-none-any.whl.metadata (2.0 kB)
Collecting typing-extensions>=4.10.0 (from torch)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting sympy>=1.13.3 (from torch)
  Using cached sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Using cached networkx-3.6.1-py3-none-any.whl.metadata (6.8 kB)
Collecting jinja2 (from torch)
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=0.8.5 (from torch)
  Using cached fsspec-2026.2.0-py3-none-any.whl.metadata (10 kB)
Collecting setuptools (from torch)
  Downloading setuptools-82.0.0-py3-none-any.whl.metadata (6.6 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Using cached regex-2026.2.19-cp312-cp312-win

  cpu = _conversion_method_template(device=torch.device("cpu"))


<torch._C.Generator at 0x2124290e6b0>

## Step 2: Loading and Inspecting the Text

In this section, we load the raw text file that will serve as our training data.

This text is the only supervision signal the model will receive.

We inspect:

- Total character count
- A preview of the content

Why this matters:

Before tokenization, the model has no structure.
The corpus determines:

- Vocabulary richness
- Context diversity
- Statistical regularities the model can learn

All semantic structure that will later emerge in embeddings
originates from this raw sequence of characters.

In [2]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    text = f.read()

print("Total characters:", len(text))
print(text[:500])

Total characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it'


## Step 3: Tokenization with tiktoken

Here we transform raw text into discrete token IDs using GPT-2 tokenization.

The tokenizer converts text into integers that correspond to subword units.

This step performs the mapping:

Natural language &rarr; Discrete symbolic representation

Why this is critical:

Neural networks operate on numbers, not strings.
Tokenization defines:

- Vocabulary size $V$
- The dimensionality of the embedding matrix $V\times d$
- The atomic prediction units of the model

Tokenization is not just preprocessing -
it defines the representational granularity of the entire model.

In [3]:
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = tokenizer.encode(text)

print("Total tokens:", len(token_ids))
print("First 20 token IDs:", token_ids[:20])

Total tokens: 5145
First 20 token IDs: [40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438]


## Step 4: Building Input-Target Pairs (Sliding Window)

This section creates supervised learning examples from a single long token sequence.

We generate pairs:

Input  $\rarr x_0 ... x_n$  
Target $\rarr x_1 ... x_{n+1}$

Each pair trains the model to predict the next token.

The sliding window moves across the corpus with a configurable stride.

Why this is important:

Language modeling is framed as next-token prediction.

Without sliding windows:
- We would have only one training example.
- The model would not generalize across positions.

With sliding windows:
- We create thousands of overlapping training signals.
- The model learns conditional probabilities across contexts.

This transforms a static text into dynamic supervision.

In [4]:
def create_input_target_pairs(token_ids, max_length=32, stride=16):
    inputs = []
    targets = []
    
    for i in range(0, len(token_ids) - max_length, stride):
        input_chunk = token_ids[i : i + max_length]
        target_chunk = token_ids[i + 1 : i + max_length + 1]
        
        inputs.append(torch.tensor(input_chunk))
        targets.append(torch.tensor(target_chunk))
    
    return inputs, targets

max_length = 32
stride = 16

inputs, targets = create_input_target_pairs(token_ids, max_length, stride)

print("Number of samples:", len(inputs))
print("Shape of one input:", inputs[0].shape)

Number of samples: 320
Shape of one input: torch.Size([32])


## Step 5: Creating a Dataset and DataLoader

Here we encapsulate the sliding window logic inside a PyTorch Dataset class.

This abstraction allows:

- Indexable training samples
- Modular data handling
- Clean separation between data and model logic

We then construct a DataLoader to:

- Shuffle samples
- Create mini-batches
- Improve computational efficiency

Why batching matters:

Neural networks train via gradient descent.
Mini-batching:

- Stabilizes gradient updates
- Enables GPU parallelism
- Improves training efficiency

This design mirrors large-scale LLM training pipelines.

In [5]:
class GPTDataset(Dataset):
    def __init__(self, token_ids, max_length, stride):
        self.inputs, self.targets = create_input_target_pairs(
            token_ids, max_length, stride
        )

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx], self.targets[idx]

dataset = GPTDataset(token_ids, max_length=32, stride=16)

dataloader = DataLoader(
    dataset,
    batch_size=4,
    shuffle=True,
    drop_last=True
)

batch_inputs, batch_targets = next(iter(dataloader))
print("Batch input shape:", batch_inputs.shape)

Batch input shape: torch.Size([4, 32])


## Step 6: Token Embeddings with PyTorch

In this section, we define a trainable embedding layer.

Each token ID selects a row from an embedding matrix:

$E \in \mathbb{R}^{V \times d}$

This converts discrete symbols into dense vectors.

Why this transformation is fundamental:

Token IDs are arbitrary integers.
They carry no semantic meaning.

Embeddings create:

- Continuous vector representations
- Differentiable structures
- A geometric space where similarity can be measured

This is the first learned layer of the model.

It is the bridge between symbolic language and neural computation.

In [6]:
vocab_size = tokenizer.n_vocab
embedding_dim = 64

embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)

embedded_tokens = embedding_layer(batch_inputs)

print("Embedded batch shape:", embedded_tokens.shape)

Embedded batch shape: torch.Size([4, 32, 64])


## Theoretical Foundations of Embedding Representations

### Why Tokenization and Windows Matter


Large Language Models (LLMs) do not operate on raw text. They operate on *discrete symbolic units* called tokens. Tokenization converts natural language into integer identifiers that can be processed mathematically.

This transformation is essential because neural networks require numerical input. A model cannot directly reason over strings like "painting" or "donkey" - it must operate over vectors and tensors.

The sliding window mechanism serves a second critical purpose: it converts a long sequence into many supervised training examples.

Each input-target pair corresponds to a next-token prediction task:

Input:  $x_0, x_1, x_2, ..., x_n$  
Target: $x_1, x_2, x_3, ..., x_{n+1}$

This structure is the foundation of autoregressive language modeling.

Without sliding windows:
- We would have only one extremely long sequence.
- Training would be inefficient.
- The model would not generalize across multiple local contexts.

With sliding windows:
- We create thousands of overlapping learning signals.
- The model repeatedly learns how context predicts continuation.
- We approximate the distribution P(next_token | context).

This process transforms raw text into structured supervision.

### From Tokens to Embeddings


Token IDs are integers. However, integers alone do not encode semantic structure.

The embedding layer transforms each token ID into a dense vector of fixed dimension:

Embedding: $\mathbb{Z}\rarr\mathbb{R}^d$

Instead of representing the word "painting" as the scalar 1234, we represent it as a vector like:

[0.12, -0.87, 0.44, ..., 0.03]

This vector is *learned* during training.

Why is this necessary?

Neural networks operate through linear algebra. They compute:

W &centerdot; x + b

To do this meaningfully, tokens must live in a continuous vector space.

Embeddings provide:
- A continuous geometry
- Differentiability
- A structure that enables similarity comparisons

This is the first moment where language becomes geometry.

### Why Do Embeddings Encode Meaning?

Embeddings encode meaning because of how they are trained.

They are not manually assigned. They are optimized through gradient descent to minimize prediction error in next-token prediction tasks.

If two words appear in similar contexts, the model must treat them similarly to reduce loss.

Therefore:

Words that share contexts &rarr; Receive similar gradient updates &rarr; Move closer in vector space.

This is a consequence of:

1. Distributional hypothesis ("You shall know a word by the company it keeps.")
2. Shared parameterization in neural networks.
3. Backpropagation aligning representations to minimize loss.

Mathematically:

The embedding matrix is simply a lookup table:
$E\in\mathbb{R}^{V\times d}$

Selecting a token corresponds to selecting a row.

During training:
- Prediction errors propagate backward.
- Embedding rows are adjusted.
- Geometric structure emerges.

Thus, meaning is not stored symbolically.
It is encoded geometrically as position in vector space.

Embeddings are neural network parameters.
They are the first learned layer of the model.
They transform discrete symbols into continuous semantic structure.

### Connection to Agentic Systems

Embeddings are not only foundational for LLMs - they are foundational for agentic systems.

In agentic architectures, embeddings enable:

1. Memory retrieval (vector databases)
2. Tool selection
3. Context compression
4. Semantic search
5. Planning based on similarity

When an agent retrieves relevant documents, it compares embeddings using cosine similarity.

Without embeddings:
- There is no semantic memory.
- No similarity reasoning.
- No contextual retrieval.

In modern agent systems:
Text &rarr; Embedding &rarr; Vector store &rarr; Similarity search &rarr; Augmented context &rarr; LLM reasoning

Thus, embeddings form the bridge between:
- Raw experience
- Stored memory
- Reasoned action

They are the geometric substrate of intelligent behavior.

## Step 7: Experiment: Changing max_length and stride

max_length controls the context window size.
stride controls how much overlap exists between samples.

If stride == max_length:
- No overlap
- Fewer samples
- Less redundancy
- Lower training signal density

If stride < max_length:
- Overlapping windows
- More samples
- Higher computational cost
- Better context continuity

Why is overlap important?

Because language dependencies span across boundaries.

Without overlap:
The model may not see transitions between chunks.

With overlap:
The same tokens appear in multiple contexts.
This stabilizes training and improves generalization.

Trade-off:
More overlap &rarr; Better learning signal &rarr; Higher compute.
Less overlap &rarr; Faster training &rarr; Less contextual smoothing.

In [7]:
def count_samples(token_ids, max_length, stride):
    count = 0
    for i in range(0, len(token_ids) - max_length, stride):
        count += 1
    return count

configs = [
    (32, 32),
    (32, 16),
    (64, 32),
    (64, 64),
]

for max_len, stride in configs:
    n_samples = count_samples(token_ids, max_len, stride)
    print(f"max_length={max_len}, stride={stride} -> samples={n_samples}")

max_length=32, stride=32 -> samples=160
max_length=32, stride=16 -> samples=320
max_length=64, stride=32 -> samples=159
max_length=64, stride=64 -> samples=80
