# Working with Text Data

During the pretraining stage, LLMs process text one word at a time. Training LLMs with millions to billions of parameters using a next-word prediction task yields models with impressive capabilities. These models can then be further finetuned to follow general instructions or perform specific target tasks.

![Alt text](../../assests/working-with-data.png)

In this section, we'll learn how to prepare input text for training LLMs. This involves splitting text into individual word and subword tokens, which can then be encoded into vector representations for the LLM. 

We'll also learn about advanced tokenization schemes like byte pair encoding, which is utilized in popular LLMs like `GPT`. 

Lastly, we'll implement a sampling and data loading strategy to produce the input-output pairs necessary for training LLMs.

## Understanding word embeddings

Deep neural network models, including LLMs, cannot process raw text directly. Since text is categorical, it isn't compatible with the mathematical operations used to implement and train neural networks. Therefore, we need a way to represent words as continuous-valued vectors.

The concept of converting data into a vector format is often referred to as `embedding`. Using a specific neural network layer or another pretrained neural network model, we can embed different data types, for example, `video`, `audio`, and `text`.

![Alt text](../../assests/Data-embedding.png)

We can process various different data formats via `embedding models`. However, it's important to note that different data formats require distinct embedding models. For example, an embedding model designed for text would not be suitable for embedding audio or video data.

At its core, an embedding is a mapping from `discrete objects`, such as words, images, or even entire documents, to points in a `continuous vector space` -- the primary purpose of embeddings is to convert non-numeric data into a format that neural networks can process.

While `word embeddings` are the most common form of text embedding, there are also embeddings for sentences, paragraphs, or whole documents. Sentence or paragraph embeddings are popular choices for `retrieval-augmented generation`. Retrieval-augmented generation combines generation (like producing text) with retrieval (like searching an
external knowledge base) to pull relevant information when generating text.

There are several algorithms and frameworks that have been developed to generate word embeddings. One of the earlier and most popular examples is the `Word2Vec` approach. `Word2Vec` trained neural network architecture to generate word embeddings by predicting the context of a word given the target word or vice versa. The main idea behind Word2Vec is that words that appear in similar contexts tend to have similar meanings. Consequently, when projected into `2-dimensional` word embeddings for visualization purposes, it can be seen that similar terms cluster together, as shown below;

![Alt text](../../assests/embeddings.png)

Word embeddings can have varying dimensions, from one to thousands. We can choose two-dimensional word embeddings for visualization purposes. A higher dimensionality might capture more nuanced relationships but at the cost of computational
efficiency.

While we can use pretrained models such as `Word2Vec` to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand. We will implement such embedding layers later in this section. Furthermore, LLMs can also create contextualized output embeddings.

The steps for preparing the embeddings used by an LLM, includes splitting text into words, converting words into tokens, and turning tokens into embedding vectors.



**Embeddings**

Embeddings are low-dimensional numerical vector representations of real-world objects like text, images, or audio that capture their semantic meaning and relationships. By converting complex, non-numeric data into this mathematical format, embeddings enable machine learning and AI models to process and understand these objects, allowing for tasks such as identifying similar items, classifying data, and powering search and recommendation systems.

**What Embeddings Do**

- `Represent Data Numerically:` Embeddings translate non-numeric data into lists of numbers (vectors) that computers can understand. 

- `Preserve Meaning and Relationships:` The geometric distances between these vectors in a multi-dimensional space reflect the semantic similarity or dissimilarity of the original data objects. 

- `Provide Compact Representations:` They are a form of lossy compression, creating smaller, more efficient representations of complex data while retaining essential properties. 


**How Embeddings Are Used**

- `Semantic Search:` Finding documents or images that are semantically similar to a query, rather than just keyword matches. 

- `Clustering and Classification:` Grouping similar items or assigning them to predefined categories. 
  
- `Retrieval-Augmented Generation (RAG):` Enhancing the accuracy and relevance of responses from large language models by providing relevant context retrieved through embeddings. 

- `Duplicate Detection:` Identifying duplicate or near-duplicate content by comparing their vector representations. 

- `Code Analytics:` Performing semantic searches on code repositories and analyzing code structure.


**Examples of Embeddings**

- `Word Embeddings:` Convert words into vectors, where words with similar meanings (e.g., "dog" and "cat") are located closer together in the vector space.

- `Image Embeddings:` Represent images as vectors, allowing for the comparison of visual content.

- `Audio Embeddings:` Convert audio signals into numerical representations for processing and analysis.

## Tokenizing text

This section covers how we split input text into individual tokens, a required preprocessing step for creating embeddings for an LLM. These tokens are either individual words or special characters, inclduing punctuation characters. 

![Alt text](../../assests/tokens.png)


In [1]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.0.1
tiktoken version: 0.11.0


In [2]:
# Reading in a short story as text sample into Python
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [3]:
print(raw_text[:120])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me 


In [4]:
# Split a text on whitespace characters
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


Note that the simple tokenization scheme mostly works for separating the example text into individual words, however, some words are still connected to punctuation characters that we want to have as separate list entries. We also refrain from making all text lowercase because capitalization helps LLMs distinguish between proper nouns and common nouns, understand sentence structure, and learn to generate text with proper capitalization.

In [5]:
# We don't only want to split on whitespaces but also commas and periods, 
# so let's modify the regular expression to do that as well

result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


A small remaining issue is that the list still includes whitespace characters. Optionally, we can remove these redundant characters safely as follows:

In [6]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


In [7]:
# strip() Removes leading and trailing characters from a string. By default, it removes whitespace characters (spaces, tabs, newlines).

text = "   Hello World!   "
cleaned_text = text.strip()
print(cleaned_text) # Output: "Hello World!"

Hello World!


**REMOVING WHITESPACES OR NOT**

When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces.

The tokenization scheme we devised above works well on the simple sample text. Let's modify it a bit further so that it can also handle other types of punctuation, such as `question marks`, `quotation marks`, and the `double-dashes` we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters:

In [8]:
text = "Hello, world. Is this-- a text?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'text', '?']


This is pretty good, and we are not ready to apply this tokenization to the raw text.

Now that we got a basic tokenizer working, let's apply it to Edith Wharton's entire short story:

In [9]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))


4690


In [10]:
preprocessed

['I',
 'HAD',
 'always',
 'thought',
 'Jack',
 'Gisburn',
 'rather',
 'a',
 'cheap',
 'genius',
 '--',
 'though',
 'a',
 'good',
 'fellow',
 'enough',
 '--',
 'so',
 'it',
 'was',
 'no',
 'great',
 'surprise',
 'to',
 'me',
 'to',
 'hear',
 'that',
 ',',
 'in',
 'the',
 'height',
 'of',
 'his',
 'glory',
 ',',
 'he',
 'had',
 'dropped',
 'his',
 'painting',
 ',',
 'married',
 'a',
 'rich',
 'widow',
 ',',
 'and',
 'established',
 'himself',
 'in',
 'a',
 'villa',
 'on',
 'the',
 'Riviera',
 '.',
 '(',
 'Though',
 'I',
 'rather',
 'thought',
 'it',
 'would',
 'have',
 'been',
 'Rome',
 'or',
 'Florence',
 '.',
 ')',
 '"',
 'The',
 'height',
 'of',
 'his',
 'glory',
 '"',
 '--',
 'that',
 'was',
 'what',
 'the',
 'women',
 'called',
 'it',
 '.',
 'I',
 'can',
 'hear',
 'Mrs',
 '.',
 'Gideon',
 'Thwing',
 '--',
 'his',
 'last',
 'Chicago',
 'sitter',
 '--',
 'deploring',
 'his',
 'unaccountable',
 'abdication',
 '.',
 '"',
 'Of',
 'course',
 'it',
 "'",
 's',
 'going',
 'to',
 'send',
 't

The total number of tokens is `4690` without whitespaces.
Let's get the first 30 tokens for a quick visual check:

In [11]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [12]:
print(preprocessed[:60])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I']


The resulting output shows that our tokenizer appears to be handling the text well since all words and special characters are neatly separated.

In [13]:
print(preprocessed[-30:])

['since', 'Grindle', "'", 's', 'doing', 'it', 'for', 'me', '!', 'The', 'Strouds', 'stand', 'alone', ',', 'and', 'happen', 'once', '--', 'but', 'there', "'", 's', 'no', 'exterminating', 'our', 'kind', 'of', 'art', '.', '"']


## Converting tokens into token IDs


Here, we will convert these tokens from a Python string to an integer representation to produce the so-called token IDs. This conversion is an intermediate step before converting the `token IDs` into `embedding` vectors. 

To map the previously generated tokens into token IDs, we have to build a so-called vocabulary first. This vocabulary defines how we map each unique word and special character to a unique integer.

![Alt text](../../assests/token-IDs.png)

We build a vocabulary by tokenizing the entire text in a training dataset into individual tokens. These individual tokens are then sorted alphabetically, and duplicate tokens are removed. The unique tokens are then aggregated into a vocabulary that defines a mapping from each unique token to a unique integer value. The depicted vocabulary is purposefully small for illustration purposes and contains no punctuation or special characters for simplicity.

In [14]:
# Let's create a list of all unique tokens and sort them alphabetically to determine the vocabulary size

all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


After determining that the vocabulary size is 1,130 via the above code, we create the vocabulary and print its first 51 entries for illustration purposes:

In [15]:
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i > 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)
('His', 51)


As we can see, based on the output above, the dictionary contains individual tokens associated with unique integer labels. Our next goal is to apply this vocabulary to convert new text into token IDs.

![Alt text](../../assests/tokenization-sample-text.png)

Let's implement a complete tokenizer class in Python with an `encode` method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary. In addition, we implement a `decode` method that carries out the inverse integer-to-string mapping to convert the token IDs back to text.

In [18]:
# Implementing a simple text tokenizer

class SimpleTokenizerV1:
    def __init__(self, vocab):
        # Store the vocabulary as a class attribute for access in the encode and decode methods
        self.str_to_int = vocab
        # Create an inverse vocabulary that maps token IDs back to the original text tokens
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    # Process input text into token IDs
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    # Convert token IDs back into text
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [24]:
text = "The brown man in Chicago is dead."
new_vocab = {token:integer for integer,token in enumerate(all_words)}

new_preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
new_preprocessed = [item.strip() for item in new_preprocessed if item.strip()]
# print([s for s in preprocessed])
new_ids = [new_vocab[s] for s in new_preprocessed]
print(new_ids)

[93, 235, 656, 568, 25, 584, 317, 7]


In [25]:
int_to_str = {i:s for s,i in new_vocab.items()}
text = " ".join([int_to_str[i] for i in new_ids])
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
print(text)

The brown man in Chicago is dead.


### **Core Components of the `SimpleTokenizerV1` Python class above**


- **`str_to_int`**: Dictionary mapping strings (words/punctuation) to integer IDs
- **`int_to_str`**: Reverse dictionary mapping integer IDs back to strings

##### Encoding Process (`encode` method):
1. **Splits text** using regex to separate words from punctuation and whitespace
2. **Removes empty strings** and strips whitespace
3. **Converts each token** to its corresponding integer ID using the vocabulary

##### Decoding Process (`decode` method):
1. **Converts IDs back** to strings and joins with spaces
2. **Cleans up formatting** by removing unnecessary spaces before punctuation
3. **Returns properly formatted** text

##### Key Characteristics:
- **Rule-based tokenization** (not learned from data)
- **Preserves punctuation** as separate tokens
- **Simple vocabulary mapping** between text and numbers
- **Reversible** - can encode and decode without information loss

This is a basic tokenizer similar to early NLP approaches, contrasting with modern subword tokenizers like `BPE` or `WordPiece` used in transformers.

![Alt text](../../assests/inverse-vocab.png)

In [31]:
tokenizer = SimpleTokenizerV1(vocab)

text = "It's the last he painted, you know, Mrs. Gisburn said with pardonable pride."
ids = tokenizer.encode(text)
print(ids)

[56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 67, 7, 38, 851, 1108, 754, 793, 7]


In [32]:
print(ids[:5])

[56, 2, 850, 988, 602]


We can decode the integers back into text

In [33]:
tokenizer.decode(ids)

"It' s the last he painted, you know, Mrs. Gisburn said with pardonable pride."

In [34]:
tokenizer.decode(tokenizer.encode(text))

"It' s the last he painted, you know, Mrs. Gisburn said with pardonable pride."

Based on the output above, we can see that the decode method successfully converted the token IDs back into the original text. 

So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing text based on a snippet from the training set. Let's now apply it to a new text sample that is not contained in the training set:

In [35]:
new_text = "We can extend consciousness and life beyond Earth"
print(tokenizer.encode(new_text))

KeyError: 'extend'

The problem is that the word "extend" was not used in the `The Verdict short` story. Hence, it is not contained in the vocabulary. This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs.

## Adding special context tokens


In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set. In this section, we will modify this tokenizer to handle unknown words.

In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, `SimpleTokenizerV2`, to support two new tokens, `<|unk|>` and
`<|endoftext|>`.


![Alt text](../../assests/special-tokens.png)


we can modify the tokenizer to use an `<|unk|>` token if it encounters a word that is not part of the vocabulary. Furthermore, we add a token between unrelated texts. For example, when training `GPT-like` LLMs on multiple independent documents or books, it is common to insert a token before each document or book that
follows a previous text source. This helps the LLM understand that, although these text sources are concatenated for training, they are, in fact, unrelated.

- Some tokenizers use special tokens to help the LLM with additional context

- Some of these special tokens are

  - `[BOS]` (beginning of sequence) marks the beginning of text
  - `[EOS]` (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on)
  - `[PAD]` (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)

- `[UNK]` to represent words that are not included in the vocabulary.
- Note that GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity.
- The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above.
- GPT also uses the `<|endoftext|>` for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are).
- `GPT-2` does not use an `<UNK>` token for out-of-vocabulary words; instead, `GPT-2` uses a `byte-pair encoding (BPE)` tokenizer, which breaks down words into subword units which we will discuss in a later section.

![Alt text](../../assests/end-of-text.png)

- We use the `<|endoftext|>` tokens between two independent sources of text:

Let's see what happens if we tokenize the following text:

In [36]:
tokenizer = SimpleTokenizerV1(vocab)

text = "Hello, do you like tea. Is this-- a test?"

tokenizer.encode(text)

KeyError: 'Hello'

- The above produces an error because the word `"Hello"` is not contained in the vocabulary.

- To deal with such cases, we can add special tokens like `"<|unk|>"` to the vocabulary to represent unknown words.

- Since we are already extending the vocabulary, let's add another token called `"<|endoftext|>"` which is used in `GPT-2` training to denote the end of a text (and it's also used between concatenated text, like if our training datasets consists of multiple articles, books, etc.)

In [37]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [38]:
print(vocab)

{'!': 0, '"': 1, "'": 2, '(': 3, ')': 4, ',': 5, '--': 6, '.': 7, ':': 8, ';': 9, '?': 10, 'A': 11, 'Ah': 12, 'Among': 13, 'And': 14, 'Are': 15, 'Arrt': 16, 'As': 17, 'At': 18, 'Be': 19, 'Begin': 20, 'Burlington': 21, 'But': 22, 'By': 23, 'Carlo': 24, 'Chicago': 25, 'Claude': 26, 'Come': 27, 'Croft': 28, 'Destroyed': 29, 'Devonshire': 30, 'Don': 31, 'Dubarry': 32, 'Emperors': 33, 'Florence': 34, 'For': 35, 'Gallery': 36, 'Gideon': 37, 'Gisburn': 38, 'Gisburns': 39, 'Grafton': 40, 'Greek': 41, 'Grindle': 42, 'Grindles': 43, 'HAD': 44, 'Had': 45, 'Hang': 46, 'Has': 47, 'He': 48, 'Her': 49, 'Hermia': 50, 'His': 51, 'How': 52, 'I': 53, 'If': 54, 'In': 55, 'It': 56, 'Jack': 57, 'Jove': 58, 'Just': 59, 'Lord': 60, 'Made': 61, 'Miss': 62, 'Money': 63, 'Monte': 64, 'Moon-dancers': 65, 'Mr': 66, 'Mrs': 67, 'My': 68, 'Never': 69, 'No': 70, 'Now': 71, 'Nutley': 72, 'Of': 73, 'Oh': 74, 'On': 75, 'Once': 76, 'Only': 77, 'Or': 78, 'Perhaps': 79, 'Poor': 80, 'Professional': 81, 'Renaissance': 82, 'Ri

In [51]:
vocab["<|endoftext|>"], vocab["<|unk|>"], vocab["--"]

(1130, 1131, 6)

In [52]:
print(len(vocab.items()))

1132


Based on the output of the print statement above, the new vocabulary size is 1132 (the vocabulary size in the previous section was 1130).

As an additional quick check, let's print the last 5 entries of the updated vocabulary:

In [53]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


We also need to adjust the `tokenizer` accordingly so that it knows when and how to use the new `<unk>` token

In [54]:
# A simple text tokenizer that handles unknown words
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        # replace unknown words by <|unk|> tokens
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

Compared to the `SimpleTokenizerV1` the new `SimpleTokenizerV2` replaces unknown words by `<|unk|>` tokens.


In [55]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


Next, let's tokenize the sample text using the `SimpleTokenizerV2` on the vocab we previously created.

In [56]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [57]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

In [58]:
text5 = "Hello, do you like tea. Is this-- a test?"

tokenizer.encode(text5)

[1131, 5, 355, 1126, 628, 975, 7, 1131, 999, 6, 115, 1131, 10]

In [59]:
tokenizer.decode(tokenizer.encode(text5))

'<|unk|>, do you like tea. <|unk|> this -- a <|unk|>?'

Based on comparing the `de-tokenized` text above with the original input text, we know that the training dataset, Edith Wharton's short story The Verdict, did not contain the words `"Hello"` and `"palace."`

So far, we have discussed tokenization as an essential step in processing text as input to LLMs. Depending on the LLM, some researchers also consider additional special tokens such as the following:

- `[BOS] (beginning of sequence)`: This token marks the start of a text. It signifies to the LLM where a piece of content begins.
  
- `[EOS] (end of sequence)`: This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to `<|endoftext|>`. For instance, when combining two different Wikipedia articles or books, the `[EOS]` token indicates where one article ends and the next one begins.
  
- `[PAD] (padding)`: When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the `[PAD]` token, up to the length of the longest text in the batch.


Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token for simplicity. The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above. Also, `<|endoftext|>` is used for padding as well.

Moreover, the tokenizer used for GPT models also doesn't use an `<|unk|>` token for out-of-vocabulary words. Instead, GPT models use a `byte pair encoding` tokenizer, which breaks down words into subword units.

## Byte pair encoding


This section covers a more sophisticated tokenization scheme based on a concept called `byte pair encoding (BPE)`. The BPE tokenizer covered in this section was used to train LLMs such as `GPT-2`, `GPT-3`, and the original model used in `ChatGPT`.

Since implementing `BPE` can be relatively complicated, we will use an existing Python open-source library called `tiktoken` (https://github.com/openai/tiktoken), which implements the BPE algorithm very efficiently based on source code in Rust. Similar to other Python libraries, we can install the tiktoken library via Python's pip installer from the terminal:

```bash
pip install tiktoken
```


- `GPT-2` used `BytePair encoding (BPE)` as its tokenizer.
- It allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words.
- For instance, if `GPT-2's` vocabulary doesn't have the word `"unfamiliarword,"` it might tokenize it as `["unfam", "iliar", "word"]` or some other subword breakdown, depending on its trained BPE merges.
- The original BPE tokenizer can be found here: https://github.com/openai/gpt-2/blob/master/src/encoder.py
- In this section, we are using the BPE `tokenizer` from OpenAI's open-source tiktoken library, which implements its core algorithms in Rust to improve computational performance.

In [60]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.11.0


In [61]:
# we can instantiate the BPE tokenizer from tiktoken as follows:
tokenizer = tiktoken.get_encoding("gpt2")

The usage of this tokenizer is similar to `SimpleTokenizerV2` we implemented previously via an encode method:

In [66]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


We can then convert the token IDs back into text using the `decode` method, similar to our `SimpleTokenizerV2` earlier:

In [67]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


We can make two noteworthy observations based on the token IDs and decoded text above. First, the `<|endoftext|>` token is assigned a relatively large token ID, namely, `50256`. In fact, the BPE tokenizer, which was used to train models such as `GPT-2`, `GPT-3`, and the original model used in ChatGPT, has a total vocabulary size of `50,257`, with
`<|endoftext|>` being assigned the largest token ID.


Second, the `BPE` tokenizer above encodes and decodes unknown words, such as `"someunknownPlace"` correctly. The `BPE` tokenizer can handle any unknown word. How does it achieve this without using `<|unk|>` tokens?


The algorithm underlying `BPE` breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle `out-of-vocabulary` words. So, thanks to the `BPE algorithm`, if the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or
characters, as illustrated below;


![Alt text](../../assests/subword-tokens.png)


The ability to break down unknown words into individual characters ensures that the tokenizer, and consequently the LLM that is trained with it, can process any text, even if it contains words that were not present in its training data

In [84]:
unknown = ("Akwirw ier sillians")

int_pairs = tokenizer.encode(unknown)
print(int_pairs)

[33901, 86, 343, 86, 220, 959, 49276, 1547]


In [85]:
string_pairs = tokenizer.decode(int_pairs)
string_pairs

'Akwirw ier sillians'

## Data sampling with a sliding window

The previous section covered the tokenization steps and conversion from string tokens into integer token IDs in great detail. The next step before we can finally create the embeddings for the LLM is to generate the input-target pairs required for training an LLM.

We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:

![Alt text](../../assests/sliding-window.png)

Given a text sample, extract input blocks as subsamples that serve as input to the LLM, and the LLM's prediction task during training is to predict the next word that follows the input block. During training, we mask out all words that are past the target. Note that the text shown in this figure would undergo tokenization before the LLM can process it; however, this figure omits the tokenization step for clarity.

Here, we implement a data loader that fetches the input-target pairs depicted in the figure above from the training dataset using a sliding window approach. We will first tokenize the whole `The Verdict short story` we worked with earlier using the `BPE tokenizer` introduced in the previous section:

In [86]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


The above code returns `5145`, the total number of tokens in the training set, after applying the `BPE` tokenizer. 

- For each text chunk, we want the inputs and targets.
- Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right. 

In [87]:
enc_sample = enc_text[50:]
enc_sample

[290,
 4920,
 2241,
 287,
 257,
 4489,
 64,
 319,
 262,
 34686,
 41976,
 13,
 357,
 10915,
 314,
 2138,
 1807,
 340,
 561,
 423,
 587,
 10598,
 393,
 28537,
 2014,
 198,
 198,
 1,
 464,
 6001,
 286,
 465,
 13476,
 1,
 438,
 5562,
 373,
 644,
 262,
 1466,
 1444,
 340,
 13,
 314,
 460,
 3285,
 9074,
 13,
 46606,
 536,
 5469,
 438,
 14363,
 938,
 4842,
 1650,
 353,
 438,
 2934,
 489,
 3255,
 465,
 48422,
 540,
 450,
 67,
 3299,
 13,
 366,
 5189,
 1781,
 340,
 338,
 1016,
 284,
 3758,
 262,
 1988,
 286,
 616,
 4286,
 705,
 1014,
 510,
 26,
 475,
 314,
 836,
 470,
 892,
 286,
 326,
 11,
 1770,
 13,
 8759,
 2763,
 438,
 1169,
 2994,
 284,
 943,
 17034,
 318,
 477,
 314,
 892,
 286,
 526,
 383,
 1573,
 11,
 319,
 9074,
 13,
 536,
 5469,
 338,
 11914,
 11,
 33096,
 663,
 4808,
 3808,
 62,
 355,
 996,
 484,
 547,
 12548,
 287,
 281,
 13079,
 410,
 12523,
 286,
 22353,
 13,
 843,
 340,
 373,
 407,
 691,
 262,
 9074,
 13,
 536,
 48819,
 508,
 25722,
 276,
 13,
 11161,
 407,
 262,
 40123,
 18113,


In [88]:
print(len(enc_sample))

5095


In [89]:
enc_sample[:4], enc_sample[1:4+1]

([290, 4920, 2241, 287], [4920, 2241, 287, 257])

One of the easiest and most intuitive ways to create the `input-target` pairs for the next-word prediction task is to create two variables, `x` and `y`, where `x` contains the `input` tokens and `y` contains the `targets`, which are the inputs shifted by `1`:

In [90]:
# The context size determines how many tokens are included in the input
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:       {y}")

x: [290, 4920, 2241, 287]
y:       [4920, 2241, 287, 257]


Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks depicted earlier.

- One by one, the prediction would look like as follows:

In [92]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


Everything left of the arrow `(---->)` refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict.

Let's repeat the previous code but convert the `token IDs` into text:

In [93]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "--->", tokenizer.decode([desired]))

 and --->  established
 and established --->  himself
 and established himself --->  in
 and established himself in --->  a


There's only one more task before we can turn the tokens into embeddings, as we mentioned earlier: implementing an efficient `data loader` that iterates over the input dataset and returns the `inputs` and `targets` as PyTorch tensors, which can be thought of as `multidimensional arrays`.


In particular, we are interested in returning two tensors: an `input tensor` containing the text that the LLM sees and a `target tensor` that includes the targets for the LLM to predict, as depicted in the diagram below;


![Alt text](../../assests/data-loaders.png)


To implement efficient `data loaders`, we collect the inputs in a tensor, `x`, where each row represents one input context. A second tensor, `y`, contains the corresponding prediction targets (next words), which are created by shifting the input by one position.

The code implementation will operate on token IDs directly since the `encode` method of the `BPE` tokenizer performs both tokenization and conversion into token IDs as a single step. 

For the efficient `data loader` implementation, we will use PyTorch's built-in Dataset and DataLoader classes.

- Create dataset and dataloader that extract chunks from the input text dataset

In [94]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        
        # Tokenize the entire text
        token_ids = tokenizer.encode(txt)
        
        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    
    # Return the total number of rows in the dataset
    def __len__(self):
        return len(self.input_ids)

    # Return a single row from the dataset
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

The `GPTDatasetV1` class is based on the PyTorch `Dataset` class and defines how individual rows are fetched from the dataset, where each row consists of a number of token IDs (based on a `max_length`) assigned to an `input_chunk` tensor. The `target_chunk` tensor contains the corresponding targets. 


The class creates training examples by sliding a fixed-size window through text, where each input sequence is paired with a target sequence that's shifted by one token (predicting the next token). It efficiently generates training examples from a single text.

**Purpose:**
Creates input-target pairs for next-token prediction training.

**Key Components:**
1. `__init__` method:

    - `txt`: Raw text to process
    - `tokenizer`: Converts text to token IDs
    - `max_length`: Size of each input sequence
    - `stride`: How many tokens to move the window each step


2. `Sliding Window Process:`
```py
# Example: "The cat sat on the mat" with max_length=3, stride=2
# Window 1: Input=["The", "cat", "sat"] → Target=["cat", "sat", "on"]
# Window 2: Input=["sat", "on", "the"] → Target=["on", "the", "mat"]
# etc.
```


3. `Input-Target Relationship:`
    - For each position `i`, the target is always the next token in the sequence.
    - This teaches the model: "Given these tokens, predict what comes next"


**How it Works:**
- `Tokenizes` the entire text once.
- `Slides a window` through the tokens with specified stride
- `Creates pairs` where target is input shifted by 1 position.
- `Stores` all pairs as tensors for efficient training.



The following code will use the `GPTDatasetV1` to load the inputs in batches via a `PyTorch DataLoader`:

In [95]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, 
                         drop_last=True, num_workers=0):
    
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")
    
    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    
    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    
    return dataloader

- `drop_last=True` drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training.
- `num_workers` The number of CPU processes to use for preprocessing.


Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4:

In [107]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

# A convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [108]:
# for i in range(5):
#     batch = next(data_iter)
#     print(batch)

In [109]:
raw_text

'I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)\n\n"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it\'s going to send the value of my picture \'way up; but I don\'t think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing\'s lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn\'s "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its like again"?\n\nWell!--even 

The `first_batch` variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs. Since the `max_length` is set to `4`, each of the two tensors contains `4` token IDs. 

Note that an input size of `4` is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least `256`.

To illustrate the meaning of `stride=1`, let's fetch another batch from this dataset:

In [110]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


If we compare the `first` with the `second` batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch (for example, the second ID in the first batch's input is `367`, which is the first ID of the second batch's input). The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach.

- We can also create batched outputs.
- Note that we increase the stride here so that we don't have overlaps between the batches, since more overlap could lead to increased overfitting.

In [112]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


Note that we increase the stride to `4`. This is to utilize the data set fully (we don't skip a single word) but also avoid any overlap between the batches, since more overlap could lead to increased overfitting.

## Creating token embeddings

The last step for preparing the input text for LLM training is to convert the token IDs into `embedding vectors`.

![Alt text](../../assests/embedding-vectors.png)

Preparing the input text for an LLM involves `tokenizing text`, `converting text tokens` to `token IDs`, and converting token IDs into `vector embedding vectors`. In this section, we consider the token IDs created in the previous sections to create the token embedding vectors.

It is important to note that we initiallize these embedding weights with random values as a preliminary step. This initialization serves as the starting point for the LLM's learning process.

A continuous vector representation, or embedding, is necessary since GPT-like LLMs are deep neural networks trained with the backpropagation algorithm. 


- The data is already almost ready for an LLM.
- But lastly let us embed the tokens in a continuous vector representation using an embedding layer.
- Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training.

Suppose we have the following four input examples with input ids 2, 3, 5, and 1 (after tokenization):

In [113]:
input_ids = torch.tensor([2, 3, 5, 1])

For the sake of simplicity, suppose we have a small vocabulary of only `6` words and we want to create embeddings of size `3`:

In [114]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

This would result in a 6x3 weight matrix:

In [115]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


In [116]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


We can see that the weight matrix has six rows and three columns. There is one row for each of the six possible tokens in the vocabulary. And there is one column for each of the three embedding dimensions.


- For those who are familiar with `one-hot encoding`, the `embedding layer` approach above is essentially just a more efficient way of implementing one-hot encoding followed by `matrix multiplication` in a fully-connected layer.  

- Because the embedding layer is just a more efficient implementation that is equivalent to the one-hot encoding and matrix-multiplication approach it can be seen as a neural network layer that can be optimized via backpropagation.

- To convert a token with id 3 into a 3-dimensional vector, we do the following:

In [120]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


- Note that the above is the 4th row in the embedding_layer weight matrix

In [121]:
print(embedding_layer(torch.tensor([1])))

tensor([[0.9178, 1.5810, 1.3010]], grad_fn=<EmbeddingBackward0>)


- To embed all four input_ids values above, we do

In [122]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


- Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix.

## Encoding word positions

In the previous section, we converted the token IDs into a continuous vector representation, the so-called token embeddings. In principle, this is a suitable input for an LLM. However, a minor shortcoming of LLMs is that their `self-attention` mechanism, doesn't have a notion of position or order for the tokens within a sequence.


- Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:

![Alt text](../../assests/token-IDs-embedding.png)


In principle, the deterministic, position-independent embedding of the token ID is good for reproducibility purposes. However, since the self-attention mechanism of LLMs itself is also position-agnostic, it is helpful to inject additional position information into the LLM.

Absolute positional embeddings are directly associated with specific positions in a sequence. For each position in the input sequence, a unique embedding is added to the token's embedding to convey its exact location. For instance, the first token will have a specific positional embedding, the second token another distinct embedding, and so on.


- Positional embeddings are combined with the token embedding vector to form the input embeddings for a large language model:

![Alt text](../../assests/positional-embedding.png)


Instead of focusing on the absolute position of a token, the emphasis of relative positional embeddings is on the relative position or distance between tokens. This means the model learns the relationships in terms of "how far apart" rather than "at which exact position." The advantage here is that the model can generalize better to sequences of varying
lengths, even if it hasn't seen such lengths during training.


Both types of `positional embeddings` aim to augment the capacity of LLMs to understand the order and relationships between tokens, ensuring more accurate and context-aware predictions. The choice between them often depends on the specific application and the nature of the data being processed.

OpenAI's GPT models use absolute positional embeddings that are optimized during the training process rather than being fixed or predefined like the positional encodings in the original Transformer model. This optimization process is part of the model training itself.

- The BytePair encoder has a vocabulary size of `50,257`:
- Suppose we want to encode the input tokens into a `256-dimensional vector representation`:

In [125]:
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- If we sample data from the `dataloader`, we embed the tokens in each batch into a `256`-dimensional vector.
- If we have a batch size of `8` with `4` tokens each, this results in a `8 x 4 x 256` tensor:

In [127]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [128]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [130]:
# print(inputs, targets)

- As we can see, the token ID tensor is 8x4-dimensional, meaning that the data batch consists of `8` text samples with `4` tokens each.

- Let's now use the embedding layer to embed these token IDs into 256-dimensional vectors:

In [131]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [132]:
print(token_embeddings)

tensor([[[ 0.4913,  1.1239,  1.4588,  ..., -0.3995, -1.8735, -0.1445],
         [ 0.4481,  0.2536, -0.2655,  ...,  0.4997, -1.1991, -1.1844],
         [-0.2507, -0.0546,  0.6687,  ...,  0.9618,  2.3737, -0.0528],
         [ 0.9457,  0.8657,  1.6191,  ..., -0.4544, -0.7460,  0.3483]],

        [[ 1.5460,  1.7368, -0.7848,  ..., -0.1004,  0.8584, -0.3421],
         [-1.8622, -0.1914, -0.3812,  ...,  1.1220, -0.3496,  0.6091],
         [ 1.9847, -0.6483, -0.1415,  ..., -0.3841, -0.9355,  1.4478],
         [ 0.9647,  1.2974, -1.6207,  ...,  1.1463,  1.5797,  0.3969]],

        [[-0.7713,  0.6572,  0.1663,  ..., -0.8044,  0.0542,  0.7426],
         [ 0.8046,  0.5047,  1.2922,  ...,  1.4648,  0.4097,  0.3205],
         [ 0.0795, -1.7636,  0.5750,  ...,  2.1823,  1.8231, -0.3635],
         [ 0.4267, -0.0647,  0.5686,  ..., -0.5209,  1.3065,  0.8473]],

        ...,

        [[-1.6156,  0.9610, -2.6437,  ..., -0.9645,  1.0888,  1.6383],
         [-0.3985, -0.9235, -1.3163,  ..., -1.1582, -1.13

- For a GPT model's absolute embedding approach, we just need to create another embedding layer that has the same dimension as the `token_embedding_layer`:

In [133]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
print(pos_embedding_layer.weight)

Parameter containing:
tensor([[-0.6303, -0.4848, -0.1366,  ...,  1.0345, -0.5012,  1.1045],
        [ 0.2062,  0.6078,  0.7187,  ..., -0.4628, -0.2319,  1.1980],
        [ 0.5806, -1.3846,  0.3266,  ...,  0.8579,  0.5059,  1.0243],
        [ 1.4323,  0.2217,  0.8599,  ...,  0.4827,  0.8459,  1.3038]],
       requires_grad=True)


In [134]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)
print(pos_embeddings)

torch.Size([4, 256])
tensor([[-0.6303, -0.4848, -0.1366,  ...,  1.0345, -0.5012,  1.1045],
        [ 0.2062,  0.6078,  0.7187,  ..., -0.4628, -0.2319,  1.1980],
        [ 0.5806, -1.3846,  0.3266,  ...,  0.8579,  0.5059,  1.0243],
        [ 1.4323,  0.2217,  0.8599,  ...,  0.4827,  0.8459,  1.3038]],
       grad_fn=<EmbeddingBackward0>)


The `context_length` is a variable that represents the supported input size of the LLM. Here, we choose it similar to the
maximum length of the input text. In practice, input text can be longer than the supported context length, in which case we have to truncate the text.

- To create the input embeddings used in an LLM, we simply add the token and the positional embeddings:

In [135]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)
print(input_embeddings)

torch.Size([8, 4, 256])
tensor([[[-0.1390,  0.6390,  1.3222,  ...,  0.6350, -2.3747,  0.9599],
         [ 0.6543,  0.8614,  0.4532,  ...,  0.0369, -1.4310,  0.0136],
         [ 0.3299, -1.4393,  0.9953,  ...,  1.8197,  2.8795,  0.9715],
         [ 2.3781,  1.0874,  2.4790,  ...,  0.0283,  0.0999,  1.6521]],

        [[ 0.9158,  1.2520, -0.9214,  ...,  0.9341,  0.3572,  0.7624],
         [-1.6560,  0.4164,  0.3375,  ...,  0.6591, -0.5815,  1.8070],
         [ 2.5653, -2.0329,  0.1851,  ...,  0.4738, -0.4297,  2.4721],
         [ 2.3971,  1.5190, -0.7608,  ...,  1.6290,  2.4256,  1.7007]],

        [[-1.4016,  0.1724,  0.0297,  ...,  0.2302, -0.4470,  1.8471],
         [ 1.0107,  1.1125,  2.0109,  ...,  1.0020,  0.1778,  1.5185],
         [ 0.6601, -3.1482,  0.9016,  ...,  3.0402,  2.3289,  0.6608],
         [ 1.8590,  0.1569,  1.4285,  ..., -0.0382,  2.1524,  2.1511]],

        ...,

        [[-2.2458,  0.4762, -2.7803,  ...,  0.0701,  0.5877,  2.7428],
         [-0.1923, -0.3157, -0.59

- In the initial phase of the input processing workflow, the input text is segmented into separate tokens

- Following this segmentation, these tokens are transformed into token IDs based on a predefined vocabulary:

![Alt text](../../assests/pos-embedding.png)

## Summary

- LLMs require textual data to be converted into numerical vectors, known as embeddings since they can't process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations.


- As the first step, raw text is broken into tokens, which can be words or characters. Then, the tokens are converted into integer representations, termed token IDs.


- Special tokens, such as `<|unk|>` and `<|endoftext|>`, can be added to enhance the model's understanding and handle various contexts, such as unknown words or marking the boundary between unrelated texts.


- The `byte pair encoding (BPE)` tokenizer used for LLMs like `GPT-2` and `GPT-3` can efficiently handle unknown words by breaking them down into subword units or individual characters.


- We use a `sliding window` approach on tokenized data to generate input-target pairs for LLM training.


- Embedding layers in PyTorch function as a lookup operation, retrieving vectors corresponding to token IDs. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs.


- While token embeddings provide consistent vector representations for each token, they lack a sense of the token's position in a sequence. To rectify this, two main types of positional embeddings exist: absolute and relative. OpenAI's GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training.