<a href="https://colab.research.google.com/github/RCortez25/PhD/blob/main/LLM/1.%20Data%20Loader/Data_Loader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
import requests
import tiktoken

from torch.utils.data import Dataset, DataLoader

In [None]:
url = "https://raw.githubusercontent.com/RCortez25/PhD/main/LLM/0.%20Tokenizer/the-veredict.txt"
response = requests.get(url)

with open("the-veredict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

oTokenizer = tiktoken.get_encoding("gpt2")
encoded_text = oTokenizer.encode(raw_text)
print(f"Number of tokens: {len(encoded_text)}")

Number of tokens: 5146


In [None]:
# Remove the first 50 tokens for demonstration purposes
encoded_sample = encoded_text[50:]

Now, in order to create input-target pairs one creates two variables

$x$: inputs

$y$: targets

For example:

$x=[1,2,3,4]\\y=[2,3,4,5]$

The `context_size` determines how many tokens to include.

In [None]:
context_size = 4

# Creating input-target pair
x = encoded_sample[:context_size]
y = encoded_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


Let's create a simple example of how the LLM will process the text in the autoregressive scheme

In [16]:
for i in range(1, context_size+1):
    context = encoded_sample[:i]
    desired = encoded_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [18]:
# The same as before but with decoded text
for i in range(1, context_size+1):
    context = encoded_sample[:i]
    desired = encoded_sample[i]

    print(oTokenizer.decode(context), "---->", oTokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


Now, we have to gather all input and output pairs into tensors for the training of the LLM. One will have then and $x$ tensor of inputs and a $y$ tensor of outputs.

The first entry of the $x$ tensor will be paired with the first entry of the $y$ tensor. The second entry of $x$ will be paired with the second entry of $y$, and so on.

In [21]:
class TextDatasetV1(Dataset):
    def __init__(self, text, tokenizer, max_length, context_size):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the input text
        token_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})