# Chapte 1: LLM Text Preprocessing

<div class="alert alert-block alert-success">
    
Welcome to the first step in building a Large Language Model! Before a model can learn from text, we must convert that text into a numerical format it can understand. This process is called **text preprocessing**.

In this chapter, we will build a complete pipeline that takes a raw text file as input and produces batches of token embeddings as output, ready to be fed into a transformer model. We will cover:
<ul>
    <li>Simple tokenization using regular expressions.</li>
    <li>The industry-standard <b>Byte-Pair Encoding (BPE)</b> tokenizer.</li>
    <li>Creating input-target pairs for autoregressive training.</li>
    <li>Using PyTorch's `Dataset` and `DataLoader` to create batches.</li>
    <li>Generating final token and positional embeddings.</li>
</ul>
</div>

## 1.1 Import and Setup

<div class="alert alert-block alert-success">
    
We'll begin by importing all the necessary libraries and modules that we will use throughout this chapter.
</div>

In [1]:
! pip3 install tiktoken

import os
import urllib.request
import re
import importlib
import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader



## 1.2 Reading the Text Corpus

<div class="alert alert-block alert-success">
    
To begin our preprocessing pipeline, we first need to load our raw text corpus. We'll define a reusable helper function, `load_data`, that can download a file from a URL if it doesn't already exist locally. This will make our notebook self-contained and easy for anyone to run.
</div>

In [2]:
def load_data(file_path, url = None):
    if not os.path.exists(file_path):
        print(f"Downloading data from {url}...")
        with urllib.request.urlopen(url) as resposne:
            text_data = resposne.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
        print("Download complete.")
    else:
        print(f"File '{file_path}' already exists. Loading fomr disk...")
        with open(file_path, "r", encoding="utf-8") as file:
            text_data = file.read()
        print("Load complete.")
    
    return text_data

<div class="alert alert-block alert-success">
    
Our dataset for this project will be the complete text of **"The Verdict,"** a short story by the Pulitzer Prize-winning American writer **Edith Wharton**, first published in 1908.

We are using this text for three main reasons:
* It is in the public domain, so we can use it freely.
* It is written in a rich, classic style, providing interesting vocabulary and sentence structures.
* Its size is small enough to allow for quick processing and training on a standard laptop.
  
Now, let's use this function to load the text and inspect its contents to ensure it loaded correctly.
</div>

In [3]:
FILE_PATH = "../data/the-verdict.txt"
URL = "https://raw.githubusercontent.com/JotaCe7/llm-text-preprocessing/main/data/the-verdict.txt"

raw_text = load_data(file_path=FILE_PATH, url=URL)

print("\nTotal number of characters:", len(raw_text))
print(raw_text[:100])

File '../data/the-verdict.txt' already exists. Loading fomr disk...
Load complete.

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


## 1.3 A Simple Word Tokenizer

<div class="alert alert-block alert-success">
    
The first step in preprocessing is **tokenization**, splitting the raw text into smaller units, or "tokens." A simple first approach is to treat every word and piece of punctuation as a token. We can achieve this using Python's regular expression library.

For this example, we are ignoring whitespaces. For certain application we will need to keep them.
</div>

In [4]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


### 1.3.1 Building a Vocabulary

<div class="alert alert-block alert-success">
    
Now that we have a list of all tokens, we need to convert them into integers. We do this by first creating a **vocabulary**, a sorted list of all unique tokens in our text. We then create a dictionary that maps each unique token to a unique integer ID.
</div>

In [5]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print("Size of our vocabulary:",vocab_size)

vocab = {token:id for id, token in enumerate(all_words)}
print("\nLet's print 15 first entries of our vocabulary:")
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 14:
        break

Size of our vocabulary: 1130

Let's print 15 first entries of our vocabulary:
('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)


### 1.3.2 Creating a Tokenizer Class

<div class="alert alert-block alert-success">

Now that we understand the process of creating a vocabulary and mapping tokens to integer IDs, let's encapsulate this logic into a complete `SimpleTokenizerV1` class.

This class will handle the full round-trip process:
* The **`encode()`** method will take a raw text string, split it into tokens using our regex, and convert it into a list of token IDs based on the vocabulary.
* The **`decode()`** method will perform the reverse operation, taking a list of token IDs and converting it back into a human-readable text string, handling punctuation spacing correctly.
</div>

In [6]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.encoder = vocab
        self.decoder = {id:token for token, id in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.encoder[item] for item in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.decoder[id] for id in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)
        return text

<div class="alert alert-block alert-success">
For demonstration, let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a passage from Edith Wharton's short story to try it out in practice:
</div>

In [7]:
tokenizerv1 = SimpleTokenizerV1(vocab)

text = """"Oh, by Jove!" I said.

It was a sketch of a donkey--an old tired donkey, standing in the rain under a wall.

"By Jove--a Stroud!" I cried."""
ids = tokenizerv1.encode(text)
print(ids)

[1, 74, 5, 241, 58, 0, 1, 53, 851, 7, 56, 1077, 115, 899, 722, 115, 361, 6, 156, 726, 1015, 361, 5, 923, 568, 988, 815, 1044, 115, 1072, 7, 1, 23, 58, 6, 115, 89, 0, 1, 53, 300, 7]


<div class="alert alert-block alert-success">
Next, let's see if we can turn these token IDs back into text using the decode method:
</div>

In [8]:
print(tokenizerv1.decode(ids))

" Oh, by Jove!" I said. It was a sketch of a donkey -- an old tired donkey, standing in the rain under a wall." By Jove -- a Stroud!" I cried.


<div class="alert alert-block alert-warning">

<b>The "Unknown Token" Problem</b><br>

A major limitation of this simple vocabulary is that it only knows about words present in our training text. If it encounters a new word, it will fail. This is known as the **"out-of-vocabulary"** or **"unknown token" problem**.

This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs.
</div>

In [9]:
text = "Hello, how are you?"
ids = tokenizerv1.encode(text)
print(ids)

### 1.3.3 Special Context Tokens

<div class="alert alert-block alert-success">

In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set. 

In this section, we will modify this tokenizer to handle unknown words.

In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, **<|unk|>** and **<|endoftext|>**.
</div>

<div class="alert alert-block alert-info">

We can modify the tokenizer to use an **<|unk|>** token if it encounters a word that is not part of the vocabulary. 

Furthermore, we add a token between unrelated texts (**<|endoftext|>**). 

For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source
</div>

In [10]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

print("Size of our vocabulary:",len(vocab.items()))

Size of our vocabulary: 1132


### 1.3.4 Creating a Tokenizer Class with Context Tokens

<div class="alert alert-block alert-success">
A simple text tokenizer that handles unknown words
</div>

In [11]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.encoder = vocab
        self.decoder = {id:token for token, id in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.;:?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [
            item if item in self.encoder
            else "<|unk|>" for item in preprocessed # Replace unknown words by <|unk|>
        ]
        ids = [self.encoder[token] for token in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.decoder[id] for id in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.;:?_!"()\'])', r'\1', text)
        return text

In [12]:
text = "Hello, how are you?"
tokenizerv2 = SimpleTokenizerV2(vocab)
ids = tokenizerv2.encode(text)
print(ids)
print(tokenizerv2.decode(ids))

[1131, 5, 560, 169, 1126, 10]
<|unk|>, how are you?


<div class="alert alert-block alert-success">
Based on the detokenized output, we can know that "Hello" is not part of the vocabulary.
</div>

<div class="alert alert-block alert-success">
    
Depending on the LLM, some researchers also consider additional special tokens such as the following.

* **[BOS] (Beginning Of Sequence):** This token marks the start of a text.

* **[EOS] (End Of Sequence):** This token is positioned at the end of a text. It is especially useful when concatenating multiple unrelated documents.

* **[PAD] (Padding):** To process text in batches, all sequences must have the same length. Shorter texts are extended or "padded" to this length using the `[PAD]` token.
</div>

## 1.4 The Standard: Byte-Pair Encoding (BPE)

<div class="alert alert-block alert-success">
    
The simple word-based tokenizer we built highlights a major challenge: how to handle words that are not in the vocabulary. Modern LLMs solve this using a powerful technique called **subword tokenization**.

<div class="alert alert-block alert-info">
    
  <b>A Note on GPT's Tokenizer</b><br>
  
  It's important to note that while we've discussed special tokens like `[BOS]` and `[PAD]`, the actual tokenizer used for GPT models is simpler in one way and more complex in another. 
  
  It only requires a single special token, <code>&lt;|endoftext|&gt;</code>, to signal the end of a text passage. For all other cases, instead of using an <code>&lt;|unk|&gt;</code> token for unknown words, it uses the most popular and powerful subword algorithm called <b>Byte-Pair Encoding (BPE)</b>, which breaks down any word it doesn't know into smaller, known subword units.
</div>

We will use OpenAI's high-performance `tiktoken` library, which provides a pre-trained BPE tokenizer used in GPT-2. The `encode()` and `decode()` methods in `tiktoken` work very similarly to the simple tokenizer we built. Let's see it in action.
</div>

In [13]:
print("tiktoken version:", importlib.metadata.version("tiktoken"))
tokenizer = tiktoken.get_encoding("gpt2")

tiktoken version: 0.9.0


In [14]:
# Original text
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces "
     "of someunknownPlace."
)
print("Original text:", text)

# Encoding sample text
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print("\nEncoded IDs:", integers)

# Decoding back to text
strings = tokenizer.decode(integers)
print("\nDecoded text:", strings)


Original text: Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.

Encoded IDs: [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]

Decoded text: Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


<div class="alert alert-block alert-success">
    
The version of the BPE Tokenizer used by GPT-2 has a total vocabulary size of **50,257**, with `<|endoftext|>` being assigned the largest token ID (50256), as we can see in the Encoded IDs above.
</div>

#### Vocabulary size comparison for different GPT models

In [15]:
# Initialize the encodings for GPT-2, GPT-3, and GPT-4
encodings = {
    "gpt2": tiktoken.get_encoding("gpt2"),
    "gpt3": tiktoken.get_encoding("p50k_base"),  # Commonly associated with GPT-3 models
    "gpt4": tiktoken.get_encoding("cl100k_base")  # Used for GPT-4 and later versions
}

# Get the vocabulary size for each encoding
vocab_sizes = {model: encoding.n_vocab for model, encoding in encodings.items()}

# Print the vocabulary sizes
for model, size in vocab_sizes.items():
    print(f"The vocabulary size for {model.upper()} is: {size}")

The vocabulary size for GPT2 is: 50257
The vocabulary size for GPT3 is: 50281
The vocabulary size for GPT4 is: 100277


## 1.5 Preparing Data Batches For Training

<div class="alert alert-block alert-success">
    
The fundamental task of a language model like GPT is to predict the next token in a sequence. To train it, we need to structure our text data into `(input, target)` pairs. The **input (`x`)** will be a chunk of text, and the **target (`y`)** will be the same chunk, but shifted one position to the right.

This setup teaches the model: "When you see this sequence of tokens, the next token should be *this* one."

We will do this using a sliding window approach and then use PyTorch's `Dataset` and `DataLoader` classes to create batches of data efficiently.
</div>

### 1.5.1 Creating Input-Target Pairs for Autoregressive Training

<div class="alert alert-block alert-success">

To get started, we will first tokenize the whole The Verdict short story we worked with earlier using the BPE tokenizer introduced in the previous section:
</div>

In [16]:
enc_text = tokenizer.encode(raw_text)
print(f"The text was tokenized into {len(enc_text)} tokens.")

The text was tokenized into 5145 tokens.


<div class="alert alert-block alert-success">

One of the easiest and most intuitive ways to create the input-target pairs for the next-word prediction task is to create two variables, x and y, where x contains the input tokens and y contains the targets, which are the inputs shifted by 1 (The context size determines how many tokens are included in the input):
</div>

In [17]:
context_size = 4 # length of the input
#The context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens) 
#to predict the next word in the sequence. 
#The input x is the first 4 tokens [1, 2, 3, 4], and the target y is the next 4 tokens [2, 3, 4, 5]

x = enc_text[:context_size]
y = enc_text[1:context_size+1]

print(f"x  (input): {x}")
print(f"y (target):     {y}")

x  (input): [40, 367, 2885, 1464]
y (target):     [367, 2885, 1464, 1807]


<div class="alert alert-block alert-success">

The input-target pair above actually contains multiple, smaller prediction tasks. For a given input `x`, the model tries to predict the corresponding `y` at each position.

The `for` loop below illustrates these individual tasks, both with token IDs and decoded text. Everything to the left of the arrow (`--->`) is the context the model sees, and the token on the right is the token it is being trained to predict.
</div>

In [18]:
for i in range(context_size):
    context = enc_text[:i+1]
    desired = enc_text[i+1]
    print(context, "--->", desired)

[40] ---> 367
[40, 367] ---> 2885
[40, 367, 2885] ---> 1464
[40, 367, 2885, 1464] ---> 1807


In [19]:
for i in range(context_size):
    context = enc_text[:i+1]
    desired = enc_text[i+1]
    print(tokenizer.decode(context), "--->", tokenizer.decode([desired]))

I --->  H
I H ---> AD
I HAD --->  always
I HAD always --->  thought


<div class="alert alert-block alert-success">

This manual process is great for understanding a single training example. However, to process our entire text corpus of thousands of tokens efficiently, and turn them into embeddings, we need to automate this "chunking" process and create batches of data.

In particular, we are interested in returning two PyTorch tensors (multidimensional arrays): an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict,
</div>

### 1.5.2 Implementing a Data Loader

<div class="alert alert-block alert-success">

First, we create a custom `Dataset` class that will take our tokenized text and use a sliding window to create the input-target chunks.
</div>

In [20]:
class GPTDataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={'<|endoftext|>'})

        # Use a sliding window chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

<div class="alert alert-block alert-success">

Next, to make creating the final `DataLoader` easy, we'll wrap the `GPTDataset` and PyTorch's `DataLoader` in a single convenience function, `create_dataloader`.
</div>

In [21]:
def create_dataloader(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDataset(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

### 1.5.3 Exploring the Batches

<div class="alert alert-block alert-success">
    
With our data loader ready, let's explore the data it produces. We'll start with a `batch_size` of , a `stride` of 1 and `context_length = 4` to clearly see how the sliding window creates overlapping chunks.
</div>

In [22]:
print("Pytorch version:", torch.__version__)
dataloader = create_dataloader(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print("first_batch:", first_batch)

Pytorch version: 2.7.1+cu128
first_batch: [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


<div class="alert alert-block alert-success">

The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs.

Since the max_length is set to 4, each of the two tensors contains 4 token IDs. 

Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256.   
</div>

<div class="alert alert-block alert-success">
    
To illustrate the meaning of stride=1, let's fetch another batch from this dataset:
</div>

In [23]:
second_batch = next(data_iter)
print("second_batch:", second_batch)

second_batch: [tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


<div class="alert alert-block alert-success">

If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch. 

For example, the second ID in the first batch's input is 367, which is the first ID of the second batch's input. 

The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach
</div>

<div class="alert alert-block alert-success">
    
Before we move on to creating the embedding vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1:
</div>

In [24]:
dataloader = create_dataloader(
    raw_text, batch_size=8,max_length=4, stride=1, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [  367,  2885,  1464,  1807],
        [ 2885,  1464,  1807,  3619],
        [ 1464,  1807,  3619,   402],
        [ 1807,  3619,   402,   271],
        [ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 2885,  1464,  1807,  3619],
        [ 1464,  1807,  3619,   402],
        [ 1807,  3619,   402,   271],
        [ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257],
        [10899,  2138,   257,  7026]])


<div class="alert alert-block alert-info">

Note that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but also avoid any overlap between the batches, since more overlap could lead to increased overfitting.
</div>

## 1.6 Creating Token Embeddings

<div class="alert alert-block alert-success">
    
Let's illustrate how the token ID to embedding vector conversion works with a hands-on example. Suppose we have the following four input tokens with IDs 2, 3, 5, and 1:
</div>

In [29]:
input_ids = torch.tensor([2, 3 ,5 ,1])

<div class="alert alert-block alert-success">
For the sake of simplicity and illustration purposes, suppose we have a small vocabulary of only 6 words (instead of the 50,257 words in the BPE tokenizer vocabulary), and we want to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 dimensions):
</div>

<div class="alert alert-block alert-success">
Using the vocab_size and output_dim, we can instantiate an embedding layer in PyTorch, setting the random seed to 100 for reproducibility purposes:
</div>

In [30]:
vocab_size = 6
output_dim = 3

torch.manual_seed(100)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.1268,  1.3564, -0.0247],
        [-0.8466,  0.0293, -0.5721],
        [-1.2546,  0.0486,  0.2753],
        [-2.1550, -0.7116,  0.0575],
        [ 0.6263, -1.7736, -0.2205],
        [ 2.7467, -1.0480,  1.1239]], requires_grad=True)


<div class="alert alert-block alert-info">
We can see that the weight matrix of the embedding layer contains small, random values. These values are optimized during LLM training as part of the LLM optimization itself. Moreover, we can see that the weight matrix has six rows and three columns. There is one row for each of the six possible tokens in the vocabulary. And there is one column for each of the three embedding dimensions.
</div>

<div class="alert alert-block alert-success">
After we instantiated the embedding layer, let's now apply it to a token ID to obtain the embedding vector:
</div>

In [31]:
print(embedding_layer(torch.tensor([3])))

tensor([[-2.1550, -0.7116,  0.0575]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info">
If we compare the embedding vector for token ID 3 to the previous embedding matrix, we see that it is identical to the 4th row. In other words, the embedding layer is essentially a look-up operation that retrieves rows from the embedding layer's weight matrix via a token ID.
</div>

<div class="alert alert-block alert-success">
Let's now apply that to all four input IDs we defined earlier (torch.tensor([2, 3, 5, 1])):
</div>

In [32]:
print(embedding_layer(input_ids))

tensor([[-1.2546,  0.0486,  0.2753],
        [-2.1550, -0.7116,  0.0575],
        [ 2.7467, -1.0480,  1.1239],
        [-0.8466,  0.0293, -0.5721]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info">
Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix
</div>

## 1.7 Postional Embeddings

<div class="alert alert-block alert-success">

Let's now consider more realistic embedding sizes and encode the input tokens into a 256-dimensional vector representation. 

This is smaller than what the original GPT-3 model used (in GPT-3, the embedding size is 12,288 dimensions) but still reasonable for experimentation. 

Furthermore, we assume that the token IDs were created by the BPE tokenizer that we implemented earlier, which has a vocabulary size of 50,257:
</div>

In [33]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

<div class="alert alert-block alert-info">
    
Using the `token_embedding_layer` above, if we sample data from the data loader, we embed each token in each batch into a 256-dimensional vector. If we have a batch size of 8 with four tokens each, the result will be an 8 x 4 x 256 tensor.
</div>

<div class="alert alert-block alert-success">
Let's instantiate the data loader ( Data sampling with a sliding window), first:
</div>

In [35]:
max_length = 4
dataloader = create_dataloader(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [None]:
print("Token IDs: \n", inputs)
print("\nInputs shape:\n", inputs.shape)

<div class="alert alert-block alert-info">
As we can see, the token ID tensor is 8x4-dimensional, meaning that the data batch consists of 8 text samples with 4 tokens each.
</div>

<div class="alert alert-block alert-success">
Let's now use the embedding layer to embed these token IDs into 256-dimensional vectors:
</div>

In [36]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


<div class="alert alert-block alert-info">
As we can tell based on the 8x4x256-dimensional tensor output, each token ID is now embedded as a 256-dimensional vector.
</div>

<div class="alert alert-block alert-success">
For a GPT model's absolute embedding approach, we just need to create another embedding layer that has the same dimension as the token_embedding_layer:
</div>

In [40]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [39]:
pos_embeddings

tensor([[-1.4603,  1.8974,  2.3485,  ...,  1.5988,  1.2487, -1.3841],
        [-1.2316, -0.0637, -1.1363,  ..., -0.4253,  0.9158, -1.4085],
        [ 0.3789, -0.9401,  0.0262,  ...,  0.5510,  0.0613,  1.3368],
        [ 0.3616, -0.5001, -0.5383,  ...,  0.4893, -1.1094, -0.3643]],
       grad_fn=<EmbeddingBackward0>)

<div class="alert alert-block alert-info">
    
As shown in the preceding code example, the input to the `pos_embeddings_layer` is usually a placeholder vector torch.arange(context_length), which contains a sequence of numbers 0, 1, ..., up to the maximum input length − 1. 

The context_length is a variable that represents the supported input size of the LLM. 

Here, we choose it similar to the maximum length of the input text. 

In practice, input text can be longer than the supported context length, in which case we have to truncate the text.
</div>

<div class="alert alert-block alert-info">
    
As we can see, the positional embedding tensor consists of four 256-dimensional vectors. We can now add these directly to the token embeddings, where PyTorch will add the 4x256-dimensional pos_embeddings tensor to each 4x256-dimensional token embedding tensor in each of the 8 batches:
</div>

In [41]:
print("Token embeddings:", token_embeddings.shape)
print("Positional embeddings:", pos_embeddings.shape)
input_embeddings = token_embeddings + pos_embeddings
print("Input embeddings:", input_embeddings.shape)

Token embeddings: torch.Size([8, 4, 256])
Positional embeddings: torch.Size([4, 256])
Input embeddings: torch.Size([8, 4, 256])


<div class="alert alert-block alert-success">
    
The input_embeddings we created are the embedded input examples that can now be processed by the main LLM modules
</div>