# My Learning Notes: Understanding Sliding Window DataLoader

**What I'm practicing**: Building intuition for how sliding windows work in LLM training

**My goal**: Understand how the dataloader creates overlapping sequences from text data

**Why this matters**: This is how we prepare training data for next-token prediction in language models

**Reference**: Based on concepts from "Build a Large Language Model From Scratch" by Sebastian Raschka

## Using Numbers to Build Intuition

I'm starting with simple number data instead of text tokens. This makes it easier to see exactly what the sliding window is doing.

In [None]:
from importlib.metadata import version
import torch

print("torch version:", version("torch"))

I'm creating a simple dataset of numbers 0 to 1000. This will help me visualize how the sliding window creates training examples:

```
0 1 2 3 4 5 6 7 8 9 10 11 12 ... 1000
```

In [None]:
# Creating a text file with numbers 0 to 1000
with open("number-data.txt", "w", encoding="utf-8") as f:
    for number in range(1001):
        f.write(f"{number} ")

print("Created number-data.txt with numbers 0-1000")

Now I'm adapting the GPT dataset class. Instead of using a tokenizer, I'm parsing integers directly from the file:

In [None]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    """
    My understanding: This dataset creates input-target pairs using a sliding window.
    - input_chunk: the context (what the model sees)
    - target_chunk: the next tokens (what the model should predict)
    """
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # For number data, I'm parsing integers instead of using a tokenizer
        token_ids = [int(i) for i in txt.strip().split()]

        # Sliding window: I'm creating overlapping sequences
        # stride determines how much the window moves each time
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [None]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    """
    My understanding: This creates a PyTorch DataLoader that:
    - Batches training examples
    - Optionally shuffles data
    - Optionally drops incomplete batches
    """
    # tokenizer not needed for number data
    tokenizer = None

    # Create dataset with sliding window
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader for batching
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

## Experiment 1: Small Context Window

Let me test with a small context size (4 tokens) and stride of 1. This will show me how the sliding window moves through the data:

In [None]:
with open("number-data.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print(f"Loaded {len(raw_text.split())} numbers")

In [None]:
dataloader = create_dataloader_v1(
    raw_text, 
    batch_size=1,      # I'm looking at one example at a time
    max_length=4,      # Context window of 4 tokens
    stride=1,          # Move window by 1 position each time
    shuffle=False      # Don't shuffle so I can see the pattern
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print("First batch (input, target):")
print(first_batch)

In [None]:
second_batch = next(data_iter)
print("Second batch (notice it moved by 1):")
print(second_batch)

In [None]:
third_batch = next(data_iter)
print("Third batch (moved by 1 again):")
print(third_batch)

print("\nüìù My observation: With stride=1, the window slides one position at a time")

In [None]:
# Let me check the last batch to see the end of the sequence
for batch in dataloader:
    pass

last_batch = batch
print("Last batch:")
print(last_batch)

## Experiment 2: Batched Examples with Larger Stride

Now I'm testing with batch_size=2 and stride=4. This shows how multiple examples are grouped together:

In [None]:
dataloader = create_dataloader_v1(
    raw_text, 
    batch_size=2,      # Two examples per batch
    max_length=4,      # Still using 4-token context
    stride=4,          # Non-overlapping windows (stride = max_length)
    shuffle=False
)

# Looking at the last batch
for inputs, targets in dataloader:
    pass

print("Inputs (2 examples, each with 4 tokens):")
print(inputs)
print("\nTargets (shifted by 1):")
print(targets)
print("\nüìù My observation: Each row is one training example, batched together")

## Experiment 3: Effect of Shuffling

Finally, let me see what happens when I shuffle the data:

In [None]:
torch.manual_seed(123)
dataloader = create_dataloader_v1(
    raw_text, 
    batch_size=2, 
    max_length=4, 
    stride=4, 
    shuffle=True  # Shuffling enabled
)

for inputs, targets in dataloader:
    pass

print("Inputs (shuffled):")
print(inputs)
print("\nTargets (shuffled):")
print(targets)
print("\nüìù My observation: The order is randomized, but each (input, target) pair stays together")

## My Takeaways

‚úÖ **Sliding window**: The dataloader creates overlapping sequences by sliding a window across the data

‚úÖ **Stride controls overlap**: 
- `stride = 1`: Maximum overlap (window moves 1 position)
- `stride = max_length`: No overlap (non-overlapping chunks)

‚úÖ **Input-target pairs**: Targets are just inputs shifted by one position (for next-token prediction)

‚úÖ **Batching**: Multiple examples are grouped together for efficient training

‚úÖ **Shuffling**: Randomizes the order of examples (but keeps input-target pairs together)

**What I still need to understand**: How to handle padding for sequences of different lengths