## Step 1: Creating Tokens



<div class="alert alert-block alert-success">
Here you can find the contents of [The Verdict](https://)

Lets read and print the total number of characters followed by the first 100
characters of this file for illustration purposes. </div>

In [16]:
with open("../data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div class="alert alert-block alert-success">
Using Python's regular expression library we will split the text to obtain a list of tokens.

We are considering each individual word as a token. </div>
<div class="alert alert-block alert-warning">
For this example, we are ignoring whitespaces. For certain application you will need to keep them.</div>

In [11]:
import re
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## Step 2. Creating Token IDs

<div class="alert alert-block alert-success">
Let's now create a list of all unique tokens and sort them alphabetically to determine the vocabulary size: </div>

In [59]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print("Size of our vocabulary:",vocab_size)

Size of our vocabulary: 1130


<div class="alert alert-block alert-success">
Create a dictionary for our vocabulary and print the first 15 entries." </div>

In [60]:
vocab = {token:id for id, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 14:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)


## Step 3. Creating a Tokenizer Class

<div class="alert alert-block alert-success">

Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits
text into tokens and carries out the string-to-integer mapping to produce token IDs via the
vocabulary. 

In addition, we implement a decode method that carries out the reverse
integer-to-string mapping to convert the token IDs back into text.

</div>

<div class="alert alert-block alert-info">
    
Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods
    
Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

</div>

In [61]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.encoder = vocab
        self.decoder = {id:token for token, id in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.encoder[item] for item in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.decoder[id] for id in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)
        return text

<div class="alert alert-block alert-success">
Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a
passage from Edith Wharton's short story to try it out in practice:
</div>

In [62]:
tokenizerv1 = SimpleTokenizerV1(vocab)

text = """"Oh, by Jove!" I said.

It was a sketch of a donkey--an old tired donkey, standing in the rain under a wall.

"By Jove--a Stroud!" I cried."""
ids = tokenizerv1.encode(text)
print(ids)

[1, 74, 5, 241, 58, 0, 1, 53, 851, 7, 56, 1077, 115, 899, 722, 115, 361, 6, 156, 726, 1015, 361, 5, 923, 568, 988, 815, 1044, 115, 1072, 7, 1, 23, 58, 6, 115, 89, 0, 1, 53, 300, 7]


<div class="alert alert-block alert-info">
Next, let's see if we can turn these token IDs back into text using the decode method:
</div>

In [63]:
print(tokenizerv1.decode(ids))

" Oh, by Jove!" I said. It was a sketch of a donkey -- an old tired donkey, standing in the rain under a wall." By Jove -- a Stroud!" I cried.


<div class="alert alert-block alert-success">
Let's see what happens if we try the same with a text not incluided in our sample text.
</div>

In [64]:
text = "Hello, how are you?"
ids = tokenizerv1.encode(text)
print(ids)

KeyError: 'Hello'

<div class="alert alert-block alert-info">
    
The problem is that the word "Hello" was not used in the The Verdict short story. 

Hence, it
is not contained in the vocabulary. 

This highlights the need to consider large and diverse
training sets to extend the vocabulary when working on LLMs.

</div>

## Step 4. Adding Special Context Tokens

In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set. 

In this section, we will modify this tokenizer to handle unknown words.

In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and <|endoftext|>

<div class="alert alert-block alert-warning">

We can modify the tokenizer to use an <|unk|> token if it
encounters a word that is not part of the vocabulary. 

Furthermore, we add a token between
unrelated texts. 

For example, when training GPT-like LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that
follows a previous text source

</div>



In [66]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [67]:
len(vocab.items())

1132

## Step 5. Creating a Tokenizer Class with Context Tokens

<div class="alert alert-block alert-success">
A simple text tokenizer that handles unknown words</div>



<div class="alert alert-block alert-info">
Step 1: Replace unknown words by <|unk|> tokens
    
Step 2: Replace spaces before the specified punctuations
</div>

In [71]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.encoder = vocab
        self.decoder = {id:token for token, id in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.;:?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [
            item if item in self.encoder
            else "<|unk|>" for item in preprocessed
        ]
        ids = [self.encoder[token] for token in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.decoder[id] for id in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.;:?_!"()\'])', r'\1', text)
        return text

In [73]:
text = "Hello, how are you?"
tokenizerv2 = SimpleTokenizerV2(vocab)
ids = tokenizerv2.encode(text)
print(ids)
print(tokenizerv2.decode(ids))

[1131, 5, 560, 169, 1126, 10]
<|unk|>, how are you?


<div class="alert alert-block alert-info">
Based on the detokenized output, we can know that "Hello" is not part of the vocabulary.
</div>

<div class="alert alert-block alert-warning">
    
Depending on the LLM, some researchers also consider additional special tokens suchas the following.

* [BOS] (beginning of sequence): This token marks the start of a text.

* [EOS] (end of sequence): This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>.

* [PAD] (padding): To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token
</div>

<div class="alert alert-block alert-warning">

Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an <|endoftext|> token for simplicity. Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units

</div>

## Step 6. Byte Pair Encoding

**BPE Tokenizer**
<div class="alert alert-block alert-success">

Since implementing BPE can be relatively complicated, we will use an existing Python
open-source library called tiktoken (https://github.com/openai/tiktoken). 

This library implements
the BPE algorithm very efficiently based on source code in Rust.
</div>

In [None]:
! pip3 install tiktoken

In [78]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


<div class="alert alert-block alert-success">
Once installed, we can instantiate the BPE tokenizer from tiktoken as follows:</div>

In [79]:
tokenizer = tiktoken.get_encoding("gpt2")

<div class="alert alert-block alert-success">
The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented previously via
an encode method:</div>

In [80]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


<div class="alert alert-block alert-success">
We can then convert the token IDs back into text using the decode method, similar to our
SimpleTokenizerV2 earlier:</div>

In [81]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


<div class="alert alert-block alert-warning">
    
The BPE Tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of **50,527**, with <|endoftext|> being assigned the largest token ID.
</div>

<div class="alert alert-block alert-warning">
The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary (e.g. someunknownPlace) into smaller subword units or even individual characters.
</div>

**Let us take another simple example to illustrate how the BPE tokenizer deals with unknown tokens**

In [85]:
integers = tokenizer.encode("Akwirw ier")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

[33901, 86, 343, 86, 220, 959]
Akwirw ier


## Step 7. Creating Input-Target pairs

<div class="alert alert-block alert-success">
In this section we implement a data loader that fetches the input-target pairs using a sliding window approach.</div>

<div class="alert alert-block alert-success">
To get started, we will first tokenize the whole The Verdict short story we worked with
earlier using the BPE tokenizer introduced in the previous section:</div>

In [88]:
with open("../data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


<div class="alert alert-block alert-info">
Executing the code above will return 5145, the total number of tokens in the training set, after applying the BPE tokenizer.
</div>

<div class="alert alert-block alert-success">
One of the easiest and most intuitive ways to create the input-target pairs for the nextword prediction task is to create two variables, x and y, where x contains the input tokens and y contains the targets, which are the inputs shifted by 1 (The context size determines how many tokens are included in the input):</div>

In [97]:
context_size = 4 # length of the input
#The context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens) 
#to predict the next word in the sequence. 
#The input x is the first 4 tokens [1, 2, 3, 4], and the target y is the next 4 tokens [2, 3, 4, 5]

x = enc_text[:context_size]
y = enc_text[1:context_size+1]

print(f"x: {x}")
print(f"y:     {y}")

x: [40, 367, 2885, 1464]
y:     [367, 2885, 1464, 1807]


<div class="alert alert-block alert-success">
Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:</div>

In [93]:
for i in range(context_size):
    context = enc_text[:i+1]
    desired = enc_text[i+1]
    print(context, "--->", desired)

[40] ---> 367
[40, 367] ---> 2885
[40, 367, 2885] ---> 1464
[40, 367, 2885, 1464] ---> 1807


<div class="alert alert-block alert-info">
Everything left of the arrow (---->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID  that the LLM is supposed to predict.
</div>

<div class="alert alert-block alert-success">
For illustration purposes, let's repeat the previous code but convert the token IDs into
text:</div>

In [95]:
for i in range(context_size):
    context = enc_text[:i+1]
    desired = enc_text[i+1]
    print(tokenizer.decode(context), "--->", tokenizer.decode([desired]))

I --->  H
I H ---> AD
I HAD --->  always
I HAD always --->  thought


<div class="alert alert-block alert-warning">
There's only one more task before we can turn the tokens into embeddings:implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.
</div>

<div class="alert alert-block alert-warning">
In particular, we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict,
</div>

## Step 8. Implementing a Data Loader

<div class="alert alert-block alert-success">
For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes.
</div>

<div class="alert alert-block alert-info">
    
**Step 1:** Tokenize the entire text
    
**Step 2:** Use a sliding window to chunk the book into overlapping sequences of max_length

**Step 3:** Return the total number of rows in the dataset

**Step 4:** Return a single row from the dataset
</div>

In [98]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={'<|endoftext|>'})

        # Use a sliding window chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

<div class="alert alert-block alert-success">
The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader:
</div>

<div class="alert alert-block alert-info">

**Step 1**: Initialize the tokenizer

**Step 2:** Create dataset

**Step 3:** drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training

**Step 4:** The number of CPU processes to use for preprocessing
    
</div>

In [107]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

<div class="alert alert-block alert-success">

Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4, 

This will develop an intuition of how the GPTDatasetV1 class and the `create_dataloader_v1` function work together: </div>

In [101]:
with open("../data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

<div class="alert alert-block alert-info">
Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function
</div>

In [109]:
import torch
print("Pytorch version:", torch.__version__)
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print("first_batch:", first_batch)

Pytorch version: 2.7.1+cpu
first_batch: [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


<div class="alert alert-block alert-warning">

The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs.

Since the max_length is set to 4, each of the two tensors contains 4 token IDs. 

Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least
256.   
</div>

<div class="alert alert-block alert-success">
    
To illustrate the meaning of stride=1, let's fetch another batch from this dataset: </div>

In [110]:
second_batch = next(data_iter)
print("second_batch:", second_batch)

second_batch: [tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


<div class="alert alert-block alert-warning">

If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch. 

For example, the second ID in the first batch's input is 367, which is the first ID of the second batch's input. 

The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach
</div>

<div class="alert alert-block alert-success">
Before we move on to creating the embedding vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1:
</div>

In [111]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=8,max_length=4, stride=1, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [  367,  2885,  1464,  1807],
        [ 2885,  1464,  1807,  3619],
        [ 1464,  1807,  3619,   402],
        [ 1807,  3619,   402,   271],
        [ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 2885,  1464,  1807,  3619],
        [ 1464,  1807,  3619,   402],
        [ 1807,  3619,   402,   271],
        [ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257],
        [10899,  2138,   257,  7026]])


<div class="alert alert-block alert-info">
Note that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but also avoid any overlap between the batches, since more overlap could lead to increased overfitting.
</div>

## Step 9. Creating Token Embeddings

<div class="alert alert-block alert-success">
Let's illustrate how the token ID to embedding vector conversion works with a hands-on example. Suppose we have the following four input tokens with IDs 2, 3, 5, and 1:
</div>

In [114]:
input_ids = torch.tensor([2, 3 ,5 ,1])

<div class="alert alert-block alert-success">
For the sake of simplicity and illustration purposes, suppose we have a small vocabulary of only 6 words (instead of the 50,257 words in the BPE tokenizer vocabulary), and we want to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 dimensions):
</div>

<div class="alert alert-block alert-success">
Using the vocab_size and output_dim, we can instantiate an embedding layer in PyTorch, setting the random seed to 100 for reproducibility purposes:
</div>

In [118]:
vocab_size = 6
output_dim = 3

torch.manual_seed(100)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.1268,  1.3564, -0.0247],
        [-0.8466,  0.0293, -0.5721],
        [-1.2546,  0.0486,  0.2753],
        [-2.1550, -0.7116,  0.0575],
        [ 0.6263, -1.7736, -0.2205],
        [ 2.7467, -1.0480,  1.1239]], requires_grad=True)


<div class="alert alert-block alert-info">
We can see that the weight matrix of the embedding layer contains small, random values. These values are optimized during LLM training as part of the LLM optimization itself. Moreover, we can see that the weight matrix has six rows and three columns. There is one row for each of the six possible tokens in the vocabulary. And there is one column for each of the three embedding dimensions.
</div>

<div class="alert alert-block alert-success">
After we instantiated the embedding layer, let's now apply it to a token ID to obtain the embedding vector:
</div>

In [123]:
print(embedding_layer(torch.tensor([3])))

tensor([[-2.1550, -0.7116,  0.0575]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info">
If we compare the embedding vector for token ID 3 to the previous embedding matrix, we see that it is identical to the 4th row. In other words, the embedding layer is essentially a look-up operation that retrieves rows from the embedding layer's weight matrix via a token ID.
</div>

<div class="alert alert-block alert-success">
Let's now apply that to all four input IDs we defined earlier (torch.tensor([2, 3, 5, 1])):
</div>

In [125]:
print(embedding_layer(input_ids))

tensor([[-1.2546,  0.0486,  0.2753],
        [-2.1550, -0.7116,  0.0575],
        [ 2.7467, -1.0480,  1.1239],
        [-0.8466,  0.0293, -0.5721]], grad_fn=<EmbeddingBackward0>)


<div class="alert alert-block alert-info">
Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix
</div>

## Step 10. Positional Embeddings

<div class="alert alert-block alert-success">

Let's now consider more realistic embedding sizes and encode the input tokens into a 256-dimensional vector representation. 

This is smaller than what the original GPT-3 model used (in GPT-3, the embedding size is 12,288 dimensions) but still reasonable for experimentation. 

Furthermore, we assume that the token IDs were created by the BPE tokenizer that we implemented earlier, which has a vocabulary size of 50,257:
</div>

In [127]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

<div class="alert alert-block alert-info">
    
Using the token_embedding_layer above, if we sample data from the data loader, we embed each token in each batch into a 256-dimensional vector. If we have a batch size of 8 with four tokens each, the result will be an 8 x 4 x 256 tensor.
</div>

<div class="alert alert-block alert-success">
Let's instantiate the data loader ( Data sampling with a sliding window), first:
</div>

In [129]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [131]:
print("Token IDs: \n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs: 
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


<div class="alert alert-block alert-info">
As we can see, the token ID tensor is 8x4-dimensional, meaning that the data batch consists of 8 text samples with 4 tokens each.
</div>