# Chapter 2, part 2 - Data loader for pretrain LLM
In the previous part we implemented tokenizers to transform raw text into integer token IDs for further processing, in this part we look into how to load torch dataset from a text and how to generate self-labeling data loaders. 


In [22]:
# let's use tiktoken as tokenizer
import tiktoken
from importlib.metadata import version
print("tiktoken version:", version("tiktoken"))

tokenizer = tiktoken.get_encoding("gpt2")


tiktoken version: 0.7.0


In [23]:
# open the text and transfer it to token IDs
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5146


In [24]:
# remove the first 50 tokens as they are not as interesting as the following part?
enc_sample = enc_text[50:]
print(len(enc_sample))

5096


One of the easiest and most intuitive ways to create the input-target pairs for the next-world prediction task is to create two variables, x and y, where x contains the input tokens and y contains the targets, which are the inputs shifted by 1:

In [25]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")

# now let's visualize one training datum
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]
 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


A PyTorch Dataset class and a DataLoader method to load training data

In [26]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        
        self.input_ids = []
        self.target_ids = []
        
        # convert all string txt to token ids
        token_ids = tokenizer.encode(txt)
        
        # here max_length is the length of the sampling window, it's the same as context_size in the previous cell
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i: i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
            
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]
    

def create_dataloader_v1(txt,
                         batch_size=4,
                         max_length=256,
                         stride=128,
                         shuffle=True,
                         drop_last = True,
                         num_workers=0) -> DataLoader:
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return dataloader

Now let's use the above code

In [27]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)
# visualize the 1st batch
first_batch = next(data_iter)
print(first_batch)
second_batch = next(data_iter)
print(second_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


# Chapter 2, part 3 - Token embeddings
In the previous part we implemented PyTorch dataset and dataloader to load raw text data, convert them into token IDs, sample input & target tensors based on batch size & sliding window size.

After the above steps, now we can get batches of training samples from the dataloader which are in the discrete integer format. To train a model we need to do further processing to make discrete token ID values to a continuous-floating-value tensor which can be used to train a tensor deep learning model.

Embedding is also a learnable layer, so we first initialize with random values and update its weights during training.   

##### An example
reference - torch.nn.Embedding: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

suppose we have a vocabulary with size **V**, and we want the embedding vector to have a dimension of **E**

An embedding layer is like a look-up mapping from a word (an index in the vocabulary) to a vendor with dimension E. so we can easily imagine the embedding layer can be in the shape of **[V, E]**

suppose for a small example where V=6 and E=3:

In [28]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

# to get the embedding of a word at with tokenId 3
print(embedding_layer(torch.tensor([3])))

# to get a batch of embeddings of a group of token IDs
input_ids = torch.tensor([2,3,5,1])
print(embedding_layer(input_ids))

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)
tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)
tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


### Encoding word positions

In the above embedding mechanism, we introduced a learnable mapping between integer tokenIDs and continuous double-value embeddings. However, in position wise, this embedding is static, meaning if the same word appears in different positions, in language context they represent different information but if only the above embedding is used they will have the same embedding values, making it hard for LLM/attention to learn the context.  

In [31]:
vocab_size = 50257
output_dim = 256
token_embeding_layer = torch.nn.Embedding(vocab_size, output_dim)

# to get a batch size of 8, 4 token each as the window size for training, we will get a 8x4x256 tensor

max_length = 4
dataloader = create_dataloader_v1(raw_text, 
                                  batch_size=8, 
                                  max_length=max_length, 
                                  stride=max_length, 
                                  shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n",inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [32]:
token_embeddings = token_embeding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [33]:
# for positional embedding
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [34]:
# final input embeddings are the sum of original embeddings and positional embeddings
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


### Chapter 2 summary
1. LLMs require textual data to be converted into numerical vectors, known as embeddings since they can’t process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations.

2. As the first step, raw text is broken into tokens, which can be words or characters. Then, the tokens are converted into integer representations, termed token IDs.

3. Special tokens, such as <|unk|> and <|endoftext|>, can be added to enhance the model’s understanding and handle various contexts, such as unknown words or marking the boundary between unrelated texts.

4. The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT-3 can efficiently handle unknown words by breaking them down into subword units or individual characters.

5. We use a sliding window approach on tokenized data to generate input-target pairs for LLM training.

6. Embedding layers in PyTorch function as a lookup operation, retrieving vectors corresponding to token IDs. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs.

7. While token embeddings provide consistent vector representations for each token, they lack a sense of the token’s position in a sequence. To rectify this, two main types of positional embeddings exist: absolute and relative. OpenAI’s GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training.