## **Data Preparation and Sampling**
## **Tokenization**

In [1]:
with open("Harry_Potter_Sorcerer's_Stone.txt", encoding="utf-8") as f:
    raw_text = f.read()

print(f"Length of text: {len(raw_text)} characters")

Length of text: 439742 characters


In [2]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


In [3]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [4]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


removing spaces or not.
- here as we training on normal text there is no specific importance of space. so we can remove it.

In [5]:
text = "Hello, -world. Is this-- a test?"

result = re.split(r'([-,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', '-', 'world', '.', 'Is', 'this', '-', '-', 'a', 'test', '?']


In [6]:
preprocessed = re.split(r'([-,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:50])

['Harry', 'Potter', 'and', 'the', 'Sorcerer', "'", 's', 'Stone', 'CHAPTER', 'ONE', 'THE', 'BOY', 'WHO', 'LIVED', 'Mr', '.', 'and', 'Mrs', '.', 'Dursley', ',', 'of', 'number', 'four', ',', 'Privet', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'They', 'were', 'the', 'last', 'people', 'you', "'"]


In [7]:
print(len(preprocessed))


103826


- Next, we convert the text tokens into token IDs that we can process via embedding layers later

In [8]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

6667


- assigning token id's to sorted tokens

In [9]:
vocab = {token:integer for integer, token in enumerate(all_words)}

In [10]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i > 60:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('-', 6)
('.', 7)
('0', 8)
('1', 9)
('1473', 10)
('1637', 11)
('17', 12)
('1709', 13)
('1945', 14)
('2', 15)
('3', 16)
('31', 17)
('382', 18)
('4', 19)
('90', 20)
(':', 21)
(';', 22)
('?', 23)
('A', 24)
('AAAAAAAAAARGH', 25)
('AAAARGH', 26)
('ALBUS', 27)
('ALL', 28)
('ALLEY', 29)
('ALLOWED', 30)
('AM', 31)
('AND', 32)
('ANYTHING', 33)
('ARE', 34)
('AT', 35)
('Aaah', 36)
('Aargh', 37)
('Abbott', 38)
('Abou', 39)
('About', 40)
('Absolutely', 41)
('According', 42)
('Adalbert', 43)
('Add', 44)
('Adrian', 45)
('Africa', 46)
('African', 47)
('After', 48)
('Against', 49)
('Ages', 50)
('Agrippa', 51)
('Ah', 52)
('Aha', 53)
('Ahead', 54)
('Ahem', 55)
('Ahern', 56)
('Alas', 57)
('Alberic', 58)
('Albus', 59)
('Algie', 60)
('Alicia', 61)


- The encode function turns text into token IDs
- The decode function turns token IDs back into text

In [11]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([-,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [12]:
tokenizer = SimpleTokenizerV1(vocab)

In [13]:
text = """ "The Potters, that's right,
that's what I heard yes, their son, Harry" """

ids = tokenizer.encode(text)
print(ids)

[1, 1233, 967, 5, 5995, 2, 5067, 4985, 5, 5995, 2, 5067, 6472, 640, 3500, 6648, 5, 5997, 5520, 5, 570, 1]


In [14]:
tokenizer.decode(ids)

'" The Potters, that\' s right, that\' s what I heard yes, their son, Harry"'

In [15]:
text = "Hello, transformer-- a test?"

tokenizer.encode(text)

KeyError: 'transformer'

- Handling unkown tokens

- Some tokenizers use special tokens to help the LLM with additional context

- Some of these special tokens are

- [BOS] (beginning of sequence) marks the beginning of text
- [EOS] (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on)
- [PAD] (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)
- [UNK] to represent words that are not included in the vocabulary

- Note that GPT-2 does not need any of these tokens mentioned above but only uses an <|endoftext|> token to reduce complexity

- The <|endoftext|> is analogous to the [EOS] token mentioned above

- GPT also uses the <|endoftext|> for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are)

- GPT-2 does not use an <UNK> token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units which we will discuss in a later section

- We use the <|endoftext|> tokens between two independent sources of text:

----------
- To deal with such cases, we can add special tokens like "<|unk|>" to the vocabulary to represent unknown words
- Since we are already extending the vocabulary, let's add another token called "<|endoftext|>" which is used in GPT-2 training to denote the end of a text (and it's also used between concatenated text, like if our training datasets consists of multiple articles, books, etc.)

In [16]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

In [17]:
vocab = {token:integer for integer,token in enumerate(all_tokens)}
len(vocab.items())


6669

In [18]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('zoom', 6664)
('zoomed', 6665)
('zooming', 6666)
('<|endoftext|>', 6667)
('<|unk|>', 6668)


In [19]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([-,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [20]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, transformer?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, transformer? <|endoftext|> In the sunlit terraces of the palace.


In [21]:
tokenizer.encode(text)

[589, 5, 6668, 23, 6667, 650, 5996, 6668, 6668, 4390, 5996, 4485, 7]

In [22]:
tokenizer.decode(tokenizer.encode(text))

'Hello, <|unk|>? <|endoftext|> In the <|unk|> <|unk|> of the palace.'

## Byte Pair Encoding

- We implemented a simple tokenization scheme in the previous sections for illustration purposes.

- This section covers a more sophisticated tokenization scheme based on a concept called byte pair encoding (BPE).

- The BPE tokenizer covered in this section was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.

- Since implementing BPE can be relatively complicated, we will use an existing Python open-source library called tiktoken (https://github.com/openai/tiktoken).

- This library implements the BPE algorithm very efficiently based on source code in Rust.

- **word based tokenization** but raises out of vocabulary problem
- **character level tokenization** solves out of vocbulary error as it has only 256 characters. But the meaning associated with thr words is completely lost. Also, the tokenized sequence is much longer than the initial raw text.
- **subword tokenization**
- do not split frequently used words into smaller subwords
- split the rare words into smaller meaningful words
- It helps the model learn that different words with. same root word as 'token', 'tokens', and 'tokenizing' are similar in meaning.
- It also helps the model learn that 'tokenization' and 'modernization' are made up of different root words but have the same suffix 'ization' and are useed in same syntactic situations.

In [23]:
! pip3 install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.2 MB[0m [31m4.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m1.0/1.2 MB[0m [31m13.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [24]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.8.0


In [25]:
tokenizer = tiktoken.get_encoding("gpt2")

In [26]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


The code above prints the following token IDs:

We can then convert the token IDs back into text using the decode method, similar to our SimpleTokenizerV2 earlier:

In [27]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


We can make two noteworthy observations based on the token IDs and decoded text above.

- First, the <|endoftext|> token is assigned a relatively large token ID, namely,

- In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID.

- Second, the BPE tokenizer above encodes and decodes unknown words, such as "someunknownPlace" correctly.

- The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens?

- The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters.

- The enables it to handle out-ofvocabulary words.

- So, thanks to the BPE algorithm, if the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or characters

In [28]:
integers = tokenizer.encode("Akwirw ier")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

[33901, 86, 343, 86, 220, 959]
Akwirw ier


## Data Sampling with a sliding window

- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:

- Auto-regressive and unsupervised learning

In [29]:
with open("Harry_Potter_Sorcerer's_Stone.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

116725


- For each text chunk, we want the inputs and targets
- Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right


In [30]:
enc_sample = enc_text[50:]


In [31]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [5875, 345, 845, 881]
y:      [345, 845, 881, 13]


In [32]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[5875] ----> 345
[5875, 345] ----> 845
[5875, 345, 845] ----> 881
[5875, 345, 845, 881] ----> 13


In [33]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 thank ---->  you
 thank you ---->  very
 thank you very ---->  much
 thank you very much ----> .


In [34]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.5.1+cu121


- The GPTDatasetV1 class is liting 2.5 is based on the PyTorch Dataset class.
- It defines how individual rows are fetcched from the dataset.
- Each row consists of a number of token IDs based on max_length assigned to the input_chunk tensor.


In [35]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [36]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [37]:
with open("Harry_Potter_Sorcerer's_Stone.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

- stride -> how many words the next input should slide
- max_length -> it is the input size
- batch_size -> how many inputs (batch of inputs) it should process before updating it's parameters

In [38]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[18308, 14179,   290,   262]]), tensor([[14179,   290,   262, 30467]])]


In [39]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[14179,   290,   262, 30467]]), tensor([[  290,   262, 30467,   338]])]


In [40]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[18308, 14179,   290,   262],
        [30467,   338,  8026,   628],
        [  198, 41481, 16329,   198],
        [  198, 10970, 16494,    56],
        [19494,   406,  3824,  1961],
        [  198,   198,  5246,    13],
        [  290,  9074,    13,   360],
        [ 1834,  1636,    11,   286]])

Targets:
 tensor([[14179,   290,   262, 30467],
        [  338,  8026,   628,   198],
        [41481, 16329,   198,   198],
        [10970, 16494,    56, 19494],
        [  406,  3824,  1961,   198],
        [  198,  5246,    13,   290],
        [ 9074,    13,   360,  1834],
        [ 1636,    11,   286,  1271]])


## Token Embeddings
- words are represented as vectors to capture the semantic meaning.


- Token Ids are converted into vector embeddings -> vector dimension X Vocabulary size
- example: GPT-2 768 X 50257 (vec dim X vocab size)
- wights are initialized randomly and updated in training.

Let's illustrate how the token ID to embedding vector conversion works with a hands-on example. Suppose we have the following four input tokens with IDs 2, 3, 5, and 1:

In [41]:
input_ids = torch.tensor([2, 3, 5, 1])

- For the sake of simplicity and illustration purposes, suppose we have a small vocabulary of only 6 words (instead of the 50,257 words in the BPE tokenizer vocabulary), and we want to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 dimensions):

- Using the vocab_size and output_dim, we can instantiate an embedding layer in PyTorch, setting the random seed to 123 for reproducibility purposes:

In [42]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- The print statement in the code prints the embedding layer's underlying weight matrix:

In [43]:
print(embedding_layer.weight)


Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


- We can see that the weight matrix of the embedding layer contains small, random values. These values are optimized during LLM training as part of the LLM optimization itself, as we will see in upcoming chapters. Moreover, we can see that the weight matrix has six rows and three columns. There is one row for each of the six possible tokens in the vocabulary. And there is one column for each of the three embedding dimensions.

- After we instantiated the embedding layer, let's now apply it to a token ID to obtain the embedding vector:

In [44]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


- If we compare the embedding vector for token ID 3 to the previous embedding matrix, we see that it is identical to the 4th row (Python starts with a zero index, so it's the row corresponding to index 3). In other words, the embedding layer is essentially a look-up operation that retrieves rows from the embedding layer's weight matrix via a token ID.

- Previously, we have seen how to convert a single token ID into a three-dimensional embedding vector. Let's now apply that to all four input IDs we defined earlier (torch.tensor([2, 3, 5, 1])):

In [45]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


- Each row in this output matrix is obtained via a lookup operation from the embedding
weight matrix

**POSITIONAL EMBEDDINGS (ENCODING WORD POSITIONS)**

- **Absolute**: For each position in input sequence, a unique embedding is added to the token's embedding to convey it's exact location.
- Suitable when fixed order of tokens is crucial, such as sequence generation.

- **Relative**: The emphasis is on the relative position or distance between tokens. The model learns the relationships in terms of "how far apart" rather than at which exact position. **Advantage of this** : the model can generalize better to sequence of varying lengths, even if it has not seen such lengths during training.
- Suitable for tasks like language modeling over long sequences, where the same phrase can appear in different parts of the sequence.

- Open AI's GPT models use absolute positional embedding that are optimized during the training process. This optimization is part of the model training itself.



- Positional encoding enables LLM to understand the order and relationship between tokens, ensuring accurate and context aware predictions.
- Both vector and positional embeddings are optimized during the training.

**Token embedding layer**

In [46]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- inputs -> 8 text samples with 4 tokens
- result -> 8 x 4 x 256 tensor

In [47]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [48]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[18308, 14179,   290,   262],
        [30467,   338,  8026,   628],
        [  198, 41481, 16329,   198],
        [  198, 10970, 16494,    56],
        [19494,   406,  3824,  1961],
        [  198,   198,  5246,    13],
        [  290,  9074,    13,   360],
        [ 1834,  1636,    11,   286]])

Inputs shape:
 torch.Size([8, 4])


In [49]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


- generate the 4 positional embedding vectors from the positional embedding matrix

In [50]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [51]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


- We use [4, 256] positional embeddings because positional information is the same for all sequences in the batch. The batch dimension (8) is handled by broadcasting, avoiding redundancy and saving resources.

In [52]:
#Add the token and positional embeddings

input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)


torch.Size([8, 4, 256])


**LLM Data Preprocessing**


1.   Tokenization
2.   Token Embeddings
3. Positional Embeddings
4. Input Embeddings = Token embeddings + positional embeddings

