# Working with Text Data

Packages that are being used in this notebook:

In [1]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.10.0
tiktoken version: 0.12.0


This chapter covers data preparation and sampling to get input data "ready" for the LLM

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/01.webp?timestamp=1" width="500px">

In [2]:
with open("../data/document.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))

Total number of character: 2470847


We can use the tokenizer to encode (that is, tokenize) texts into integers

These integers can then be embedded (later) as input of/for the LLM

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/08.webp?123" width="500px">

We use the `<|endoftext|>` tokens between two independent sources of text:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/10.webp" width="500px">

## BytePair encoding

GPT-2 used BytePair encoding (BPE) as its tokenizer

It allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words

We are using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance

In [5]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.12.0


In [6]:
tokenizer = tiktoken.get_encoding("gpt2")

BPE tokenizers break down unknown words into subwords and individual characters:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/11.webp" width="300px">

## Data sampling with a sliding window

We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/12.webp" width="400px">

In [7]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.10.0+cpu


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/13.webp?123" width="500px">

Create dataset and dataloader that extract chunks from the input text dataset

In [None]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Insufficient tokens in input text for dataset creation"

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [None]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    tokenizer = tiktoken.get_encoding("gpt2")

    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [11]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=4, max_length=1024, stride=1024 // 2, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
(first_batch)

[tensor([[   54,  1219,   285,  ...,   885,   285, 10277],
         [  333,   997, 11729,  ...,  2253,   502,   259],
         [   72,   355,   430,  ...,   290,  3426,   737],
         [ 7648,   885,  3338,  ...,  4352,   885,  3305]]),
 tensor([[ 1219,   285,  2724,  ...,   285, 10277,    72],
         [  997, 11729,   256,  ...,   502,   259,  7648],
         [  355,   430,   265,  ...,  3426,   737,   198],
         [  885,  3338,   263,  ...,   885,  3305,   397]])]

In [12]:
second_batch = next(data_iter)
(second_batch)

[tensor([[  198, 38485,    25,  ...,   256,   571,  8482],
         [  397,  4563,   261,  ...,  5303,   350,  1077],
         [45714,   573,    71,  ...,  7056,   502,   259],
         [ 1462,  1976, 15498,  ...,   500,   318, 19918]]),
 tensor([[38485,    25, 13081,  ...,   571,  8482, 45714],
         [ 4563,   261, 41727,  ...,   350,  1077,  1462],
         [  573,    71,    80,  ...,   502,   259,   318],
         [ 1976, 15498,   272,  ...,   318, 19918,    72]])]

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/14.webp" width="500px">

Note that we increase the stride so that we don't have overlaps between the batches, since more overlap could lead to increased overfitting

In [13]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=4, max_length=1024, stride=1024 // 2, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   54,  1219,   285,  ...,   885,   285, 10277],
        [  333,   997, 11729,  ...,  2253,   502,   259],
        [   72,   355,   430,  ...,   290,  3426,   737],
        [ 7648,   885,  3338,  ...,  4352,   885,  3305]])

Targets:
 tensor([[ 1219,   285,  2724,  ...,   285, 10277,    72],
        [  997, 11729,   256,  ...,   502,   259,  7648],
        [  355,   430,   265,  ...,  3426,   737,   198],
        [  885,  3338,   263,  ...,   885,  3305,   397]])


## Creating token embeddings

The data is already almost ready for an LLM

But lastly let us embed the tokens in a continuous vector representation using an embedding layer

Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/15.webp" width="400px">


An embedding layer is essentially a look-up operation:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/16.webp?123" width="500px">

## Encoding word positions

Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/17.webp" width="400px">

Positional embeddings are combined with the token embedding vector to form the input embeddings for a large language model:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/18.webp" width="500px">

The BytePair encoder has a vocabulary size of 50,257:

Suppose we want to encode the input tokens into a 768-dimensional vector representation:

In [14]:
vocab_size = 50257
output_dim = 768

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

If we sample data from the dataloader, we embed the tokens in each batch into a 768-dimensional vector

If we have a batch size of 4 with 1024 tokens each, this results in a 4 x 1024 x 768 tensor:

In [15]:
max_length = 1024
dataloader = create_dataloader_v1(
    raw_text, batch_size=4, max_length=max_length,
    stride=max_length // 2 , shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [16]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   54,  1219,   285,  ...,   885,   285, 10277],
        [  333,   997, 11729,  ...,  2253,   502,   259],
        [   72,   355,   430,  ...,   290,  3426,   737],
        [ 7648,   885,  3338,  ...,  4352,   885,  3305]])

Inputs shape:
 torch.Size([4, 1024])


In [17]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

(token_embeddings)

torch.Size([4, 1024, 768])


tensor([[[-0.2001, -0.0106, -1.2923,  ...,  0.9828, -0.7837,  2.5149],
         [-0.8918, -0.7310, -1.3799,  ..., -0.2804,  0.1272, -0.7561],
         [-0.0820, -1.7687,  1.6987,  ...,  0.4435, -1.2183,  0.3573],
         ...,
         [ 1.0701, -0.4277,  1.3995,  ..., -0.5489,  1.1374, -2.0427],
         [-0.0820, -1.7687,  1.6987,  ...,  0.4435, -1.2183,  0.3573],
         [-0.9788,  1.5289, -0.8870,  ..., -0.8148,  0.0560,  0.6873]],

        [[ 0.6641, -0.1710, -0.8509,  ...,  0.0109, -2.0909,  0.7179],
         [-0.2338,  0.1532,  0.5148,  ..., -1.8632,  1.1932,  1.4671],
         [-0.1500,  0.3272, -0.4963,  ..., -1.2302,  1.7294,  0.0213],
         ...,
         [-1.3542, -0.1369, -1.1837,  ...,  0.5031, -0.0115, -0.1423],
         [-2.2588,  1.9325, -0.4783,  ...,  2.2393, -0.6354, -0.9153],
         [-1.2025, -0.3443, -1.5586,  ..., -0.3419, -0.4331, -0.3141]],

        [[-0.5525, -0.4649,  0.7483,  ...,  0.1353,  0.0983, -1.3722],
         [ 0.6163,  0.3344, -0.0321,  ...,  1

GPT-2 uses absolute position embeddings, so we just create another embedding layer:

In [18]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

(pos_embedding_layer.weight)

Parameter containing:
tensor([[-0.2960,  0.5421, -0.6717,  ..., -1.0487,  0.3583, -0.0158],
        [-0.8249,  2.6008,  0.2817,  ..., -0.4738,  0.2014, -1.5513],
        [-1.5623,  0.3674,  1.0243,  ...,  1.0584, -1.8554, -0.8419],
        ...,
        [ 0.5705,  1.0253, -1.8635,  ...,  0.6571,  0.2628,  0.5900],
        [-0.4751, -1.3380,  0.2011,  ...,  1.2281, -1.1136,  0.1384],
        [ 1.1393, -1.8344,  1.5052,  ..., -0.4528,  0.9091,  1.6983]],
       requires_grad=True)

In [19]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

(pos_embeddings)

torch.Size([1024, 768])


tensor([[-0.2960,  0.5421, -0.6717,  ..., -1.0487,  0.3583, -0.0158],
        [-0.8249,  2.6008,  0.2817,  ..., -0.4738,  0.2014, -1.5513],
        [-1.5623,  0.3674,  1.0243,  ...,  1.0584, -1.8554, -0.8419],
        ...,
        [ 0.5705,  1.0253, -1.8635,  ...,  0.6571,  0.2628,  0.5900],
        [-0.4751, -1.3380,  0.2011,  ...,  1.2281, -1.1136,  0.1384],
        [ 1.1393, -1.8344,  1.5052,  ..., -0.4528,  0.9091,  1.6983]],
       grad_fn=<EmbeddingBackward0>)

To create the input embeddings used in an LLM, we simply add the token and the positional embeddings:

In [20]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

(input_embeddings)

torch.Size([4, 1024, 768])


tensor([[[-0.4961,  0.5315, -1.9641,  ..., -0.0659, -0.4254,  2.4990],
         [-1.7167,  1.8699, -1.0983,  ..., -0.7542,  0.3286, -2.3073],
         [-1.6443, -1.4013,  2.7230,  ...,  1.5019, -3.0737, -0.4847],
         ...,
         [ 1.6406,  0.5976, -0.4640,  ...,  0.1082,  1.4002, -1.4527],
         [-0.5572, -3.1067,  1.8997,  ...,  1.6716, -2.3320,  0.4957],
         [ 0.1605, -0.3055,  0.6183,  ..., -1.2676,  0.9651,  2.3856]],

        [[ 0.3681,  0.3711, -1.5226,  ..., -1.0378, -1.7326,  0.7021],
         [-1.0587,  2.7541,  0.7965,  ..., -2.3370,  1.3946, -0.0842],
         [-1.7123,  0.6946,  0.5280,  ..., -0.1718, -0.1260, -0.8206],
         ...,
         [-0.7837,  0.8884, -3.0473,  ...,  1.1603,  0.2513,  0.4477],
         [-2.7339,  0.5945, -0.2772,  ...,  3.4674, -1.7491, -0.7769],
         [-0.0632, -2.1787, -0.0533,  ..., -0.7947,  0.4760,  1.3842]],

        [[-0.8486,  0.0772,  0.0765,  ..., -0.9134,  0.4565, -1.3880],
         [-0.2086,  2.9353,  0.2496,  ...,  0

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/19.webp" width="400px">

In [21]:
for batch in dataloader:
    input_ids, target_ids = batch

    token_embeddings = token_embedding_layer(input_ids)
    pos_embeddings = pos_embedding_layer(torch.arange(context_length))

    input_embeddings = token_embeddings + pos_embeddings
    break

In [22]:
print(input_embeddings.shape)

torch.Size([4, 1024, 768])
