<a href="https://colab.research.google.com/github/RCortez25/PhD/blob/main/LLM/1.%20Data%20Loader/Data_Loader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
import requests
import tiktoken

import torch
from torch.utils.data import Dataset, DataLoader

In [5]:
url = "https://raw.githubusercontent.com/RCortez25/PhD/main/LLM/0.%20Tokenizer/the-veredict.txt"
response = requests.get(url)

# Save the downloaded content to a file
with open("the-veredict.txt", "w", encoding="utf-8") as f:
    f.write(response.text)

with open("the-veredict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

oTokenizer = tiktoken.get_encoding("gpt2")
encoded_text = oTokenizer.encode(raw_text)
print(f"Number of tokens: {len(encoded_text)}")

Number of tokens: 5146


In [6]:
# Remove the first 50 tokens for demonstration purposes
encoded_sample = encoded_text[50:]

Now, in order to create input-target pairs one creates two variables

$x$: inputs

$y$: targets

For example:

$x=[1,2,3,4]\\y=[2,3,4,5]$

The `context_size` determines how many tokens to include.

In [7]:
context_size = 4

# Creating input-target pair
x = encoded_sample[:context_size]
y = encoded_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


Let's create a simple example of how the LLM will process the text in the autoregressive scheme

In [8]:
for i in range(1, context_size+1):
    context = encoded_sample[:i]
    desired = encoded_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [9]:
# The same as before but with decoded text
for i in range(1, context_size+1):
    context = encoded_sample[:i]
    desired = encoded_sample[i]

    print(oTokenizer.decode(context), "---->", oTokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


Now, we have to gather all input and output pairs into tensors for the training of the LLM. One will have then and $x$ tensor of inputs and a $y$ tensor of outputs.

The first entry of the $x$ tensor will be paired with the first entry of the $y$ tensor. The second entry of $x$ will be paired with the second entry of $y$, and so on.

Note: In the class below, the `stride` parameter controls the number of places the window is displaced. The context size refers to the width of the window and the stride indicates by how much the window is displaced. For example:

"In the heart of the city stood the old library"

`context_size=4, stride=1`

Input of first batch = "In the heart of"

Input of second batch = "the heart of the"

`context_size=4, stride=4`

Input of first batch = "In the heart of"

Input of second batch = "the city stood"


In [13]:
class TextDatasetV1(Dataset):
    def __init__(self, text, tokenizer, context_size, stride):
        # List to store all the successive encoded inputs
        self.input_ids = []
        # List to store all the successive encoded targets
        self.target_ids = []

        # Tokenize the input text
        token_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

        # Start retrieving input-target pairs by looping through the encoded text
        # This is accomplished by sliding the context window
        # We loop from 0 to the entire dataset minus the context size because
        # if one has 100 ids and the context size is 4, one stops at 95, so that
        # the last target have tokens with indexes [96,97,98,99] as this is the
        # last target and we don't have more inputs as the entire text is covered
        for i in range(0, len(token_ids) - context_size, stride):
            # Get the input slice from the encoded text
            encoded_input = token_ids[i:i + context_size]
            # Get the corresponding target slice to be paired with the input
            encoded_target = token_ids[i + 1:i + context_size + 1]
            # Append the input and target slices to their respective lists
            # Important to make them tensors so that they're ready to work with
            self.input_ids.append(torch.tensor(encoded_input))
            self.target_ids.append(torch.tensor(encoded_target))

    def __len__(self):
        # Method to get the length of the created dataset
        return len(self.input_ids)

    def __getitem__(self, idx):
        # Method to retrieve a single input:target item from the dataset
        # idx is the index of the desired item
        return self.input_ids[idx], self.target_ids[idx]

Now we'll define a function that creates a data loader that helps us work with our dataset in an efficient manner

In [15]:
def create_dataloader_v1(text, batch_size=4, context_size=256, stride=128,
                         shuffle=True, drop_last=True, num_workers=0):
    """
    Creates a PyTorch DataLoader for a text dataset.

    Args:
        text (str): The input text data.
        batch_size (int): The number of samples per batch.
        context_size (int): The size of the context window for each input sequence.
        stride (int): The number of tokens to slide the context window by.
        shuffle (bool, optional): Whether to shuffle the data. Defaults to True.
        drop_last (bool, optional): Whether to drop the last incomplete batch. Defaults to True.
        num_workers (int, optional): How many subprocesses to use for data loading. Defaults to 0.

    Returns:
        torch.utils.data.DataLoader: The DataLoader object.
    """
    # Initialize tokenizer
    oTokenizer = tiktoken.get_encoding("gpt2")
    # Create dataset
    dataset = TextDatasetV1(text, oTokenizer, context_size, stride)
    # Create the dataloader
    oDataLoader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle,
                             drop_last=drop_last, num_workers=num_workers)
    return oDataLoader