In [4]:
from google.colab import files
uploaded = files.upload()

Saving 01 Harry Potter and the Sorcerers Stone.txt to 01 Harry Potter and the Sorcerers Stone (1).txt


In [6]:
filename = list(uploaded.keys())[0]

with open(filename, 'r', encoding= 'utf-8') as f:
  content = f.read()

print(content[:500])

M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amoun


#CREATING INPUT-TARGET PAIRS
In this section we implement a data loader that fetches the input-target pairs using a sliding window approach.
To get started, we will first tokenize the whole Harry Potter story we worked with earlier using the BPE tokenizer introduced in the previous section:


In [7]:
! pip3 install tiktoken



In [8]:
import importlib
import tiktoken


In [9]:
tokenizer = tiktoken.get_encoding('gpt2')

In [10]:
enc_txt = tokenizer.encode(content)
print(len(enc_txt))

124336


Next, we remove the first 50 tokens from the dataset for demonstration purposes as it results in a slightly more interesting text passage in the next steps:

In [11]:
enc_sam = enc_txt[50:]

One of the easiest and most intuitive ways to create the input-target pairs for the nextword prediction task is to create two variables, x and y, where x contains the input tokens and y contains the targets, which are the inputs shifted by 1:
The context size determines how many tokens are included in the input

In [15]:
context_size = 4 #length of the input
#The context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens)
#to predict the next word in the sequence.
#The input x is the first 4 tokens [1, 2, 3, 4], and the target y is the next 4 tokens [2, 3, 4, 5]

x = enc_sam[:context_size]
y = enc_sam[1:context_size +1]

print(f"x: {x}")
print(f"y: {y}")

x: [11428, 11, 780, 484]
y: [11, 780, 484, 655]


Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:

In [16]:
for i in range(1, context_size + 1):
  context = enc_sam[:i]
  target = enc_sam[i]
  print(f"x: {context} y: {target}")

x: [11428] y: 11
x: [11428, 11] y: 780
x: [11428, 11, 780] y: 484
x: [11428, 11, 780, 484] y: 655


In [21]:
for i in range(1, context_size + 1):
  context = enc_sam[:i]
  target = enc_sam[i]
  print(tokenizer.decode(context), tokenizer.decode([target]))

 mysterious ,
 mysterious,  because
 mysterious, because  they
 mysterious, because they  just


We've now created the input-target pairs that we can turn into use for the LLM training in upcoming chapters.

There's only one more task before we can turn the tokens into embeddings:implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.

In particular, we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict,

#importing data loaders

In [24]:
import torch
print(torch.__version__)



2.8.0+cu126


Step 1: Tokenize the entire text

Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length

Step 3: Return the total number of rows in the dataset

Step 4: Return a single row from the dataset

In [25]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

The GPTDatasetV1 class in listing 2.5 is based on the PyTorch Dataset class.

It defines how individual rows are fetched from the dataset.

Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.

The target_chunk tensor contains the corresponding targets.

I recommend reading on to see how the data returned from this dataset looks like when we combine the dataset with a PyTorch DataLoader -- this will bring additional intuition and clarity.

The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader:

Step 1: Initialize the tokenizer

Step 2: Create dataset

Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training

Step 4: The number of CPU processes to use for preprocessing

In [26]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function

In [27]:
dataloader = create_dataloader_v1(
    content, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[ 44, 374,  13, 290]]), tensor([[ 374,   13,  290, 9074]])]


Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256.

To illustrate the meaning of stride=1, let's fetch another batch from this dataset:

In [28]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 374,   13,  290, 9074]]), tensor([[  13,  290, 9074,   13]])]


creating the embedding vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1:

In [29]:
dataloader = create_dataloader_v1(content, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   44,   374,    13,   290],
        [ 9074,    13,   360,  1834],
        [ 1636,    11,   286,  1271],
        [ 1440,    11,  4389, 16809],
        [ 9974,    11,   547,  6613],
        [  284,   910,   326,   484],
        [  547,  7138,  3487,    11],
        [ 5875,   345,   845,   881]])

Targets:
 tensor([[  374,    13,   290,  9074],
        [   13,   360,  1834,  1636],
        [   11,   286,  1271,  1440],
        [   11,  4389, 16809,  9974],
        [   11,   547,  6613,   284],
        [  910,   326,   484,   547],
        [ 7138,  3487,    11,  5875],
        [  345,   845,   881,    13]])
