## Tokenizing Text

LLMs cannot process raw text data directly, since it is categorical and so isn't compatible with the mathematical operations used to train neural networks. Therefore, we need a way to represent words as continuous-valued vectors. This is known as <i>embedding</i>. 

The steps involved in embedding include splitting a text into words, converting those words into tokens, and then turning those tokens into embedding vectors.

The goal here will be to tokenize the 20,479-character short story into individual words and special characters, that can then be turned into embeddings for LLM training.

In [2]:
# Get short story to work with
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

('the-verdict.txt', <http.client.HTTPMessage at 0x2c5b019a540>)

In [3]:
# First 100 characters
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of characters: ", '{:,}'.format(len(raw_text)))
print(raw_text[:99])

Total number of characters:  20,479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


As a first pass, we can use the re.split command to split on whitespace characters '(\s)'. To also split on commas and full stops, we add '([,.])'.

In [4]:
import re
text = "Hello, world. This, is a test."
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [5]:
# To also remove the redunant whitespace characters
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


We can now apply this to the short story, modifying it to work with other forms of punctuation.

In [None]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed)) # 4,690 words
print(preprocessed[:30]) # First 30 characters

4690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


### Converting tokens into token IDs

We convert the tokens from a Python string into an integer representation to produce the token IDs. This is an intermediate step before converting the token IDs into embedding vectors.

In [None]:
# Create a list of all unique tokens and sort them alphabetically
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size) # 1,130 unique words

1130


In [12]:
# Print first 21 entries
vocab = {token: integer for integer, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 20:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)


We can build a tokenizer class with an <i>encode</i> method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary. We will also implement a <i>decode</i> method that carries out the reverse integer-to-string mapping to convert the token IDs back into text. 

In [13]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab # vocab stored as class attribute for access in encode & decode methods
        self.int_to_str = {i:s for s,i in vocab.items()} # mapping for token IDs -> original text tokens

    def encode(self, text): # input text -> token IDs
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids): # convert token IDs -> text
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) # remove spaces before the specified punctuation
        return text

In [14]:
# Generate token IDs using a sample
tokenizer = SimpleTokenizerV1(vocab)
text = """
       "It's the last he painted, you know,"
       Mrs. Gisburn said with pardonable pride. 
       """
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [15]:
# Turn the token IDs back into text
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [None]:
# Try with random text
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

# KeyError: 'Hello'

Since "Hello" was not used in the short story, it is not contained in the vocabulary. We need to extend the tokenizer further to be able to handle unknown words, as well as other special context tokens.
- <|unk|> will signify an unknown word.
- <|endoftext|> will signify the end of the text.

In [5]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token: integer for integer, token in enumerate(all_tokens)}
print(len(vocab.items())) # 2 extra tokens

1132


In [6]:
# Print the last 5 tokens
for _, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [7]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab 
        self.int_to_str = {i:s for s,i in vocab.items()} 

    def encode(self, text): 
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [item if item in self.str_to_int
                        else "<|unk|>" for item in preprocessed] # replaces unknown words with <|unk|> token
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids): # convert token IDs -> text
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) # remove spaces before the specified punctuation
        return text

In [12]:
# Let's try the new tokenizer
# Will also add together two unrelated texts to show how <|endoftext|> works
text1 = "Hello, do you like tea?"
text2 = "In the end, it doesn't even matter."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the end, it doesn't even matter.


In [10]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text)) # 1130 and 1130 are the new extended vocab

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 388, 5, 585, 356, 2, 970, 399, 1131, 7]


In [11]:
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do you like tea? <|endoftext|> In the end, it doesn' t even <|unk|>.


### Byte pair encoding

A more sophisticated tokenization scheme. The BPE tokenizer was used to train ChatGPT. Implementing it can be complicated, so we will use an existing open source library called <i>tiktoken</i>.

In [8]:
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.8.0


In [9]:
tokenizer = tiktoken.get_encoding("gpt2")

In [13]:
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 886, 11, 340, 1595, 470, 772, 2300, 13]


In [14]:
strings = tokenizer.decode(integers)
print(strings) # done perfectly

Hello, do you like tea? <|endoftext|> In the end, it doesn't even matter.


Two things to note:
- The token IDs are much larger for the BPE tokenizer (50,257).
- The BPE tokenizer is able to correctly encode and decode unknown words.

On the second point, it is able to do this because it breaks down words not in its predefined vocabulary into smaller subword units or even individual characters. This ensures that the LLM can process any type of text, even if it contains words not found in its training data.

### Data sampling with a sliding window

We need to generate the input-target pairs required for training an LLM.

In [17]:
# Tokenize the whole short story using the BPE tokenizer
enc_text = tokenizer.encode(raw_text)
print(len(enc_text)) # tokenises into 5,145 tokens

5145


In [18]:
# For demonstration purposes, use the first 50 tokens in the dataset
enc_sample = enc_text[50:]

In [19]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [20]:
# By processing the inputs along with the targets, which are the inputs shifted by one position
# we can create the next-word prediction tasks
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


The LLM would receive the token ID(s) on the left side of the arrow, while the right side represents the target token ID - what the LLM is supposed to predict.

In [21]:
for i in range(1, context_size+1):   
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


Above is the general idea for creating the input-target pairs used in LLM training. We also need an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors. This will be done using PyTorch's built-in classes.

In [25]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) # tokenizes the entire text

        for i in range(0, len(token_ids) - max_length, stride): # sliding window to chunk text into overlapping sequences of max_length
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self): # total number of rows in the dataset
        return len(self.input_ids)
    
    def __getitem__(self, idx): # returns a single row from the dataset
        return self.input_ids[idx], self.target_ids[idx]

The above class is based on the PyTorch dataset class and defines how individual rows are fetched from the dataset, where each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor. 

The following code uses the above class to load the inputs in batches via a PyTorch dataloader.

In [23]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) # creates the dataset
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last, # if True, drops the last batch if it is smaller than specified batch_size
        num_workers=num_workers # number of CPU processes to use for preprocessing
    )

    return dataloader

In [26]:
# Test the dataloader with a batch size of 1, with a context size of 4.
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


The first_batch variable contains two tensors: inputs & target IDs. Since max_length=4, each tensor contains only 4 IDs; it is common to train LLMs with input sizes of 256.

To understand the meaning of stride=1, we grab a second batch below. The second batch's token IDs are the first batch's but shifted by one. The <i>stride</i> setting dictates the number of postions the input shifts across batches, emulating a sliding window. We would lke to avoid overlap between batches to reduce overfitting.

In [27]:
# Second batch
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


Batch size of 1 is chosen for illustration purposes. Small batch sizes require less memory during training but lead to more noisy model updates. Batch size is a trade-off and so a hyperparameter to tune.  Below is an example of a batch size of 8.

In [28]:
# Grabbing 8 batches 
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("Targets:\n", targets)

Inputs:
 tensor([[  290, 45592,    12, 14792],
        [ 2952,    13,   198,   198],
        [  477,    11,   645,  1551],
        [  438,  1219,    11,   314],
        [ 1109,   815,   307,   900],
        [50085,   757,    13,   314],
        [  290,   783,   340,   338],
        [11441, 48740,   438, 14295]])
Targets:
 tensor([[45592,    12, 14792,  5986],
        [   13,   198,   198,     1],
        [   11,   645,  1551,  1051],
        [ 1219,    11,   314,   373],
        [  815,   307,   900,   866],
        [  757,    13,   314,  2497],
        [  783,   340,   338, 30703],
        [48740,   438, 14295,   338]])


### Creating token embeddings

The last step in preparing the input text for LLM training is to convert the token IDs into embedding vectors (necessary for backpropagation). First, we initialise these embedding weights with random values.

Below is a short example - a vocab of 6 words (50,257 in the BPE tokenizer vocab) and embedding of 3 dimensions (GPT-3 has 12,288 dimensions).

In [29]:
# A short example
input_ids = torch.tensor([2, 3, 5, 1])

vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


The weight matrix of the embeddng layer above contains small, random values. These are optimised during LLM training. It has 6 rows (one row for each token in the vocabulary) and 3 columns (one for each of the three embedding dimensions). If we pass simply one token ID as below, we can see it matches the above; so the embedding layer is essentially a lookup operation that retrieves rows for the embedding layer's weight matrix via a token ID. 

In [31]:
# To one ID
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


In [32]:
# For all four input IDs
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


### Encoding word positions

Token embeddings alone are suitable for LLM training. However, the self-attention mechanism has no notion of position or order for the tokens within a sequence. The embedding layer simply maps a token ID to the same vector representation, regardless of where the token ID is positioned in the inputs sequence. We want to inject additional position information into the LLM. 

There are two broad categories of position-aware embeddings: relative positional embeddings and absolute positional embeddings. The latter are directly associated with specfic positions in a sequence (for each position in the input sequence, a unique embedding is added to the token's embedding to convey its exact location). Relative positional embeddings focus on the relative position or distance between tokens. The model learns "how far apart" rather than "at which exact position". This means the model can generalise better to sequences of varying lengths, even if it hasn't seen such lengths during training.

Both types of positional embeddings augment the capacity of LLMs to understand the order and relationships between tokens, ensuring more accurate and context-aware predictions. The GPT models use absolute positional embeddings that are optimised during training, rather than being fixed or predefined. 

In [33]:
# A more realistic embedding size
vocab_size = 50_257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

Now, if we sample data from the dataloader, we embed each token in each batch into a 256-dimensional vector. If we have a batch size of 8 with 4 tokens each, the result is a 8 x 4 x 256 tensor.

In [34]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4,
    stride=4, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [35]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [36]:
# For the absolute embedding approach, create another embedding layer of same dimension
context_length = 4
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


The input to pos_embeddings is usually a placeholder vector torch.arange(context_length), which contains a sequence of numbers from 0 to input length -1. The positional embedding tensor consists of four 256-dimensional vectors. We add these to the token embeddings, where PyTorch will add the 4x256 pos_embeddings tensor to each 4x256 dimensional token embedding tensor in each of the 8 batches.

In [37]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
