This script covers the steps involved in preparing text for training large language models, including:

- Splitting the text into word and subword tokens.
- Using byte pair encoding (BPE) for more advanced tokenization.
- Sampling training examples using a sliding window approach.
- Converting tokens into vectors to be fed into the large language model.


## Tokenization

You can find the input raw text [here](https://en.wikisource.org/wiki/Brother_Leo).

In [1]:
import os

In [2]:
with open('/input/Leo.txt', 'r', encoding="utf-8") as file:
  content = file.read()

In [3]:
print(f"Total number of characters present is {len(content)}")

Total number of characters present is 18036


In [4]:
print(content[:150])

IT was a sunny morning, and I was on my way to Torcello. Venice lay behind us a dazzling line, with towers of gold against the blue lagoon. All at onc


In [5]:
import re

In [6]:
text = "Hello, world. This is a sample text."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'sample', ' ', 'text.']


This result is a list of individual words, whitespaces and punctuation characters.

Now, let's modify the regular expression that splits on whitespaces, commas and periods.

In [7]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'sample', ' ', 'text', '.', '']


In [8]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', 'is', 'a', 'sample', 'text', '.']


In [9]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In the above example, the sample text gets splitted into 10 different tokens. Now, we will apply this tokenizer to our text file content.

In [10]:
preprocessed_text =  re.split(r'([,.:;?_!"()\']|--|\s)', content)
preprocessed_text = [item.strip() for item in preprocessed_text if item.strip()]
print(len(preprocessed_text))

4024


So,we have in total 4690 tokens in our corpus. Let's print first 30 tokens from this list.

In [11]:
preprocessed_text[:30]

['IT',
 'was',
 'a',
 'sunny',
 'morning',
 ',',
 'and',
 'I',
 'was',
 'on',
 'my',
 'way',
 'to',
 'Torcello',
 '.',
 'Venice',
 'lay',
 'behind',
 'us',
 'a',
 'dazzling',
 'line',
 ',',
 'with',
 'towers',
 'of',
 'gold',
 'against',
 'the',
 'blue']

We can clearly see from the output that we don't have any whitespaces and special characters as a token in this list. we successfully onverted the raw text into individual tokens.

## Converting tokens to token IDs

Let's create a list of unique tokens and sort them alphabetically to identify vocabulary size.

In [12]:
all_unique_words = sorted(set(preprocessed_text))
vocab_size = len(all_unique_words)
print(vocab_size)

988


In [13]:
vocab = {token: integer for integer, token in enumerate(all_unique_words)}

In [14]:
for i, item in enumerate(vocab.items()):
  print(item)
  if i >= 50:
    break

('!', 0)
('"', 1)
("'", 2)
(',', 3)
('.', 4)
(':', 5)
(';', 6)
('?', 7)
('A', 8)
('After', 9)
('Ah', 10)
('All', 11)
('Altinum', 12)
('And', 13)
('As', 14)
('At', 15)
('Besides', 16)
('Brother', 17)
('Burano', 18)
('But', 19)
('Deserto', 20)
('English', 21)
('Enter', 22)
('Esau', 23)
('Even', 24)
('Excellency', 25)
('Far', 26)
('First', 27)
('Francesco', 28)
('Francis', 29)
('French', 30)
('God', 31)
('He', 32)
('Here', 33)
('His', 34)
('I', 35)
('IT', 36)
('If', 37)
('Indeed', 38)
('It', 39)
('Leo', 40)
('Lorenzo', 41)
('May', 42)
('Meanwhile', 43)
('No', 44)
('Now', 45)
('Once', 46)
('One', 47)
('Only', 48)
('Our', 49)
('Perhaps', 50)


## Implementing simple Text Tokenizer

In [15]:
class TokenizerV1:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s, i in vocab.items()}

  def encode(self, text):
    preprocessed_text =  re.split(r'([,.:;?_!"()\']|--|\s)', text)
    preprocessed_text = [item.strip() for item in preprocessed_text if item.strip()]
    ids = [self.str_to_int[token] for token in preprocessed_text]
    return ids

  def decode(self, ids):
    text = " ".join([self.int_to_str[id] for id in ids])
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text


In [16]:
tokenizer = TokenizerV1(vocab)
text = """Yes, it is not for himself that he is searching,"
           said the superior."""
ids = tokenizer.encode(text)
print(ids)

[76, 3, 466, 462, 576, 353, 419, 849, 404, 462, 732, 3, 1, 721, 850, 826, 4]


In [17]:
print(tokenizer.decode(ids))

Yes, it is not for himself that he is searching," said the superior.


Now, we apply tokenizer to the sample text which is not present in vocabulary.

In [18]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

This error suggests that we need large and diverse training sets to extend the vocab when working on LLMs.

## Adding Special Context Tokens

Now, we will add two special tokens here:

* <|unk|>
* <|endoftext|>



In [19]:
all_tokens = sorted(set(preprocessed_text))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])

In [20]:
vocab = {token:integer for integer, token in enumerate(all_tokens)}

In [21]:
print(len(vocab))

990


In [22]:
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

('yours—at', 985)
('youth', 986)
('you—one', 987)
('<|unk|>', 988)
('<|endoftext|>', 989)


In [23]:
class TokenizerV2:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s, i in vocab.items()}

  def encode(self, text):
    preprocessed_text =  re.split(r'([,.:;?_!"()\']|--|\s)', text)
    preprocessed_text = [item.strip() for item in preprocessed_text if item.strip()]
    preprocessed_text = [token if token in self.str_to_int else  "<|unk|>" for token in preprocessed_text]
    ids = [self.str_to_int[token] for token in preprocessed_text]
    return ids

  def decode(self, ids):
    text = " ".join([self.int_to_str[id] for id in ids])
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text


In [24]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [25]:
tokenizer = TokenizerV2(vocab)
ids = tokenizer.encode(text)
print(ids)

[988, 3, 262, 979, 506, 988, 7, 989, 988, 850, 988, 988, 585, 850, 988, 4]


In [26]:
tokens = tokenizer.decode(ids)
print(tokens)

<|unk|>, do you like <|unk|>? <|endoftext|> <|unk|> the <|unk|> <|unk|> of the <|unk|>.


## Byte Pair Encoding

In [28]:
import tiktoken

In [29]:
tokenizer = tiktoken.get_encoding("gpt2")

In [30]:
text = (
            "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
             "of someunknownPlace."
)

In [31]:
ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(ids)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [32]:
tokens = tokenizer.decode(ids)
print(tokens)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


In [33]:
sample_text = "Akwirw ier"
sample_ids = tokenizer.encode(sample_text)
print(sample_ids)

[33901, 86, 343, 86, 220, 959]


## Data Sampling with Sliding Window

In [34]:
with open('/input/Leo.txt', 'r', encoding='utf-8') as file:
  raw_text = file.read()

In [35]:
raw_text

'IT was a sunny morning, and I was on my way to Torcello. Venice lay behind us a dazzling line, with towers of gold against the blue lagoon. All at once a breeze sprang up from the sea; the small, feathery islands seemed to shake and quiver, and, like leaves driven before a gale, those flocks of colored butterflies, the fishing-boats, ran in before the storm. Far away to our left stood the ancient tower of Altinum, with the island of Burano a bright pink beneath the towering clouds. To our right, and much nearer, was a small cypress-covered islet. One large umbrella-pine hung close to the sea, and behind it rose the tower of the convent church. The two gondoliers consulted together in hoarse cries and decided to make for it.\n\n"It is San Francesco del Deserto," the elder explained to me. "It belongs to the little brown brothers, who take no money and are very kind. One would hardly believe these ones had any religion, they are such a simple people, and they live on fish and the vegeta

In [36]:
encoded_text = tokenizer.encode(raw_text)
print(len(encoded_text))

4279


It has 4279 tokens in total in the training set. Now, we remove first 50 tokens from the dataset.

In [37]:
encoded_sample = encoded_text[50:]

It's time to create the input-target pairs. Let's look into one example first,

In [38]:
context_window = 4
x = encoded_sample[:context_window]
y = encoded_sample[1:context_window+1]
print(x)
print("    ",y)

[88, 14807, 3947, 284]
     [14807, 3947, 284, 13279]


In [39]:
for i in range(1, context_window+1):
  input = encoded_sample[:i]
  target = encoded_sample[i]
  print(input, "-->", target)

[88] --> 14807
[88, 14807] --> 3947
[88, 14807, 3947] --> 284
[88, 14807, 3947, 284] --> 13279


Let's repeat the above process for getting input-target pairs, but with the actual tokens in the text not the tokenIDs.

In [40]:
for i in range(1, context_window+1):
  input = encoded_sample[:i]
  target = encoded_sample[i]
  print(tokenizer.decode(input), "-->", tokenizer.decode([target]))

y -->  islands
y islands -->  seemed
y islands seemed -->  to
y islands seemed to -->  shake


we've now created input-target pairs.

## Dataset for Batched Input and Targets

In [41]:
import torch
from torch.utils.data import Dataset, DataLoader

In [42]:
class GPTDatasetV1(Dataset):
  def __init__(self, text, tokenizer, max_length, stride):
    self.input_ids = list()
    self.target_ids = list()

    token_ids = tokenizer.encode(text)

    for i in range(0, len(token_ids) - max_length, stride):
      input = token_ids[i:i+max_length]
      target = token_ids[i+1:i+max_length+1]
      self.input_ids.append(torch.tensor(input))
      self.target_ids.append(torch.tensor(target))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, index):
    return self.input_ids[index], self.target_ids[index]


## DataLoader to generate batches with input-target pairs

In [43]:
# define DataLoader object
def create_dataloader_v1(
                          text,
                          batch_size=4,
                          max_length=256,
                          stride=128,
                          shuffle=True,
                          drop_last=True,
                          num_workers=0
                      ):

  tokenizer = tiktoken.get_encoding('gpt2')
  dataset = GPTDatasetV1(text, tokenizer, max_length, stride)
  dataloader = DataLoader(
                            dataset,
                            batch_size=batch_size,
                            shuffle=shuffle,
                            drop_last=drop_last,
                            num_workers=num_workers
                        )

  return dataloader

Let's test the dataloader with batch size of 1 with context size 4.

In [44]:
with open('/input/Leo.txt', 'r', encoding='utf-8') as file:
  raw_text = file.read()

1st example

In [45]:
dataloader = create_dataloader_v1(
                          raw_text,
                          batch_size=1,
                          max_length=4,
                          stride=1,
                          shuffle=False)

In [46]:
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[ 2043,   373,   257, 27737]]), tensor([[  373,   257, 27737,  3329]])]


In [47]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[  373,   257, 27737,  3329]]), tensor([[  257, 27737,  3329,    11]])]


2nd example

In [48]:
dataloader = create_dataloader_v1(
                          raw_text,
                          batch_size=8,
                          max_length=4,
                          stride=4,
                          shuffle=False)

In [49]:
data_iter = iter(dataloader)
input, target = next(data_iter)

In [50]:
print(input)
print(target)

tensor([[ 2043,   373,   257, 27737],
        [ 3329,    11,   290,   314],
        [  373,   319,   616,   835],
        [  284,  4022,  3846,    78],
        [   13, 29702,  3830,  2157],
        [  514,   257, 41535,  1627],
        [   11,   351, 18028,   286],
        [ 3869,  1028,   262,  4171]])
tensor([[  373,   257, 27737,  3329],
        [   11,   290,   314,   373],
        [  319,   616,   835,   284],
        [ 4022,  3846,    78,    13],
        [29702,  3830,  2157,   514],
        [  257, 41535,  1627,    11],
        [  351, 18028,   286,  3869],
        [ 1028,   262,  4171, 19470]])


## Create Token Embeddings

Let'e see how we can convert token IDs to embeddings through an example:

In [51]:
input_ids = torch.tensor([2, 4, 1, 5])

Also, let's suppose that we have vocabulary of size 6 and we want to create embeddings of size 3.

In [52]:
vocab_size = 6
output_dim = 3

In [53]:
torch.manual_seed(144)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [54]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 2.0498, -0.5850,  0.0478],
        [ 0.9948,  0.8840,  0.0773],
        [ 3.2101, -1.1649, -0.5699],
        [ 1.3446,  1.2875,  0.9301],
        [ 0.5089,  0.4857, -0.9258],
        [ 1.8692,  0.9056,  0.5658]], requires_grad=True)


The weight matrix of embedding layer contains small random values. And these values will be initialized during LLM training as part of the LLM optimization.

In [55]:
print(embedding_layer(input_ids))

tensor([[ 3.2101, -1.1649, -0.5699],
        [ 0.5089,  0.4857, -0.9258],
        [ 0.9948,  0.8840,  0.0773],
        [ 1.8692,  0.9056,  0.5658]], grad_fn=<EmbeddingBackward0>)


Now, we have successfully created mbedding vectors from token IDs. Next, we will add samll modification to this embeddings for encoding positional information within text.

## Encoding Word Positions

Now, let's create embeddings with vocab size 50,257 and output embedding dimesions is 256.

In [56]:
vocab_size = 50257
output_dim = 256

In [57]:
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

Let's initiate dataloader first,

In [58]:
dataloader = create_dataloader_v1(
                          raw_text,
                          batch_size=8,
                          max_length=4,
                          stride=4,
                          shuffle=False)

In [59]:
data_iter = iter(dataloader)
input, target = next(data_iter)

In [60]:
print(input)
print(input.shape)

tensor([[ 2043,   373,   257, 27737],
        [ 3329,    11,   290,   314],
        [  373,   319,   616,   835],
        [  284,  4022,  3846,    78],
        [   13, 29702,  3830,  2157],
        [  514,   257, 41535,  1627],
        [   11,   351, 18028,   286],
        [ 3869,  1028,   262,  4171]])
torch.Size([8, 4])


Now, we will create embeddings

In [61]:
token_embeddings = embedding_layer(input)

In [62]:
print(token_embeddings.shape)

torch.Size([8, 4, 256])


This 8 x 4 x 256 dimensional tensor shows that each token ID is embedded as 256-dimensional vector.

In [63]:
context_length = 4
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [64]:
pos_embeddings[1]

tensor([-1.0864e+00, -6.2188e-02, -1.0294e+00, -2.2932e-01, -6.0874e-01,
         5.3524e-01,  1.3238e+00,  2.3330e+00, -9.9768e-02, -4.7199e-01,
        -5.3248e-01,  1.8350e+00,  2.8382e-01,  1.0543e+00, -1.7998e+00,
         1.4655e+00, -1.9528e+00,  2.1799e-01,  1.3202e+00, -4.7460e-01,
         3.8520e-01, -8.1749e-02,  5.1069e-01,  1.0609e+00,  2.9112e-02,
        -4.0899e-02, -5.1593e-01,  2.6452e-01, -9.2384e-01, -1.0146e+00,
        -5.9922e-01,  2.3189e-01,  5.8988e-01,  1.5490e-01,  1.1972e+00,
         3.7747e-01, -1.1821e+00,  1.5121e+00, -2.6745e-01,  7.3872e-01,
        -5.1275e-01, -3.7004e-01,  5.3351e-01,  7.8175e-01,  1.2124e+00,
        -2.5448e+00,  9.7309e-01,  5.9424e-01, -1.5780e-01,  6.6926e-01,
         1.8240e+00, -1.5038e+00,  9.2822e-01,  1.1650e+00,  1.4926e+00,
         1.3172e-01,  1.1997e+00, -4.4992e-01,  6.4799e-01, -1.0770e+00,
        -1.9323e+00, -4.1374e-01, -5.7242e-01,  3.8571e-03,  5.9323e-01,
         1.3487e+00,  9.8320e-01,  2.1459e+00,  1.0

In [65]:
input_embeddings = token_embeddings + pos_embeddings

In [66]:
print(input_embeddings.shape)

torch.Size([8, 4, 256])
