# Input-Target Pairs for Language Model Training

This notebook demonstrates how to create input-target pairs for training language models. The key concept is that given a sequence of tokens, we want to predict the next token in the sequence.

## Overview

- **Input**: A sequence of tokens (context)
- **Target**: The next token to predict (shifted by 1 position).


In [14]:
!pip install tiktoken



In [15]:
import tiktoken

## Understanding Input-Target Pairs

The core concept: for each position in the sequence, the **input** is all previous tokens, and the **target** is the next token to predict.

For example:
- Input: `[40]` → Target: `367` (predicting the second token given the first)
- Input: `[40, 367]` → Target: `2885` (predicting the third token given the first two)
- And so on...

This creates a sliding window where each step uses more context to predict the next token.


In [16]:
tokenizer = tiktoken.get_encoding("gpt2")

In [17]:
with open("the-verdict.txt", "r") as f:
    text = f.read()

# Encode text to token IDs - each word/subword becomes a number
enc_text = tokenizer.encode(text)
print(len(enc_text))

5145


## Building a Data Loader

Now we'll implement a proper PyTorch Dataset and DataLoader to efficiently create input-target pairs for training.

The implementation will:
1. Tokenize the entire text
2. Use a sliding window to create overlapping sequences
3. Create input-target pairs where targets are shifted by 1 position
4. Return batches of data for training


In [18]:
enc_sample = enc_text[:50]
print(enc_sample)

[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11]


In [19]:
# Demonstrate input-target pairs: context predicts the next token
context_size = 5
for i in range(1,context_size+1):
    context = enc_text[:i]  # Input: previous tokens
    target = enc_text[i]     # Target: next token to predict

    print(context, "---->", target)

[40] ----> 367
[40, 367] ----> 2885
[40, 367, 2885] ----> 1464
[40, 367, 2885, 1464] ----> 1807
[40, 367, 2885, 1464, 1807] ----> 3619


### DataLoader Helper Function

This function creates a DataLoader with the following parameters:
- **`batch_size`**: Number of sequences per batch
- **`max_length`**: Context window size
- **`stride`**: Overlap between sequences
- **`shuffle`**: Whether to randomize the order of sequences
- **`drop_last`**: Drop incomplete batches (prevents training instability)
- **`num_workers`**: Number of parallel data loading processes


### Example 1: Maximum Overlap (stride=1)

With `stride=1`, we get maximum overlap between sequences. Each sequence starts just 1 token after the previous one, creating many training examples from the same text.


In [20]:
# Same concept decoded back to readable text
context_size = 5
for i in range(1,context_size+1):
    context = enc_text[:i]
    target = enc_text[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([target]))

I ---->  H
I H ----> AD
I HAD ---->  always
I HAD always ---->  thought
I HAD always thought ---->  Jack


## Implement Data Loader

**Step 1:** Tokenize the entire text

**Step 2:** Use a sliding window to chunk the book into overlapping sequences of max_length

**Step 3:** Return the total number of rows in the dataset

**Step 4:** Return a single row from the dataset

In [21]:
import torch
from torch.utils.data import DataLoader, Dataset

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Create overlapping sequences using sliding window
        # stride controls overlap: stride=1 = maximum overlap, stride=max_length = no overlap
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            # Target is input shifted by 1 position (next token prediction)
            target_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

## Dataset and DataLoader Setup

**Step 1:** Initialize the tokenizer

**Step 2:** Create dataset

**Step 3:** `drop_last=True` drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training.

**Step 4:** The number of CPU processes to use for preprocessing


In [22]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128,
                         shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    
    # drop_last=True prevents loss spikes from incomplete batches
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader


## Examples: Using the DataLoader

### Example 1: Maximum Overlap (stride=1)

With `stride=1`, we get maximum overlap between sequences. Each sequence starts just 1 token after the previous one, creating many training examples from the same text.


In [23]:
with open("the-verdict.txt", "r") as f:
    raw_text = f.read()

In [24]:
# Example: max_length=4 (context window), stride=1 (maximum overlap)
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [25]:
# Second batch: sliding window moved 1 token forward (overlapping sequences)
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


### Example 2: Less Overlap (stride=4)

With `stride=4`, sequences start 4 tokens apart, creating less overlap. This is more memory-efficient but provides fewer training examples from the same text.


In [26]:
# Example with stride=4 (less overlap) - sequences start 4 tokens apart
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

input_ids, target_ids = next(iter(dataloader))

print("\nInput IDs:\n ", input_ids)
print("\nTarget IDs (shifted by 1):\n ", target_ids)


Input IDs:
  tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Target IDs (shifted by 1):
  tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
