# Chapter 2 - Lab 1a : Working with Text

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

#### 1. Chapter Overview
The notebook transitions to Chapter 2: Working with Text, introducing the key theme: preparing and processing text data for LLMs. This chapter's objective is to cover text tokenization and embedding, pivotal steps in transforming raw text into machine-readable formats

#### 2. Checking Dependencies

In [1]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.2 MB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m0.9/1.2 MB[0m [31m13.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [2]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.5.1+cu121
tiktoken version: 0.8.0


This cell verifies the installed versions of essential libraries: PyTorch for deep learning and tiktoken for optimized BPE tokenization. Ensuring compatibility between code and library versions is critical in machine learning workflows to maintain reproducibility and prevent errors during execution.

#### - Chapter Overview

- This chapter covers data preparation and sampling to get input data "ready" for the LLM

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/01.webp?timestamp=1" width="500px">

## 2.1 Understanding word embeddings

- There are many forms of embeddings; we focus on text embeddings for now.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/02.webp" width="500px">

- LLMs work with embeddings in high-dimensional spaces (i.e., thousands of dimensions)
- Since we can't visualize such high-dimensional spaces (we humans think in 1, 2, or 3 dimensions), the figure below illustrates a 2-dimensional embedding space

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/03.webp" width="300px">

## 2.2 Tokenizing text

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/04.webp" width="300px">

#### - Downloading the Dataset

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [3]:
import os
import urllib.request

if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

This block downloads "The Verdict" by Edith Wharton, a public domain text, as the source material for tokenization. Using authentic literary text mirrors real-world preprocessing needs for varied and complex datasets.

#### - Reading the Dataset

In [4]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


The raw text is loaded and briefly inspected for size and content. The total character count is a useful metric for gauging the scope of preprocessing and tokenization steps.


- The goal is to tokenize and embed this text for an LLM
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above
- The following regular expression will split on whitespaces

#### - Whitespace Splitting

In [5]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


This initial implementation splits text on whitespace, creating a basic tokenization strategy. While rudimentary, it lays the groundwork for iterative refinement, which is expanded upon in subsequent cells.

#### - Including Punctuation in Tokenization

- We don't only want to split on whitespaces but also commas and periods, so let's modify the regular expression to do that as well

In [6]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


The updated regular expression accounts for punctuation, an improvement that aligns token boundaries more closely with linguistic conventions.

#### - Cleaning Empty Tokens

- As we can see, this creates empty strings, let's remove them

In [7]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


Empty strings resulting from consecutive delimiters are removed to streamline the output. This step ensures that each token contributes meaningful information to downstream tasks.


#### - Advanced Tokenization : Generalizing Tokenization

- This looks pretty good, but let's also handle other types of punctuation, such as periods, question marks, and so on

In [8]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


By incorporating additional delimiters (e.g., semicolons, question marks, and dashes), the tokenizer becomes robust to diverse textual inputs, making it applicable to a wider range of datasets.

#### - Tokenizing the Entire Text

- This is pretty good, and we are now ready to apply this tokenization to the raw text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/05.webp" width="350px">

In [9]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


The tokenizer is applied to the downloaded dataset, producing a list of cleaned tokens. This prepares the text for subsequent embedding and modeling.

#### - Vocabulary Creation

- Let's calculate the total number of tokens

In [10]:
print(len(preprocessed))

4690


## 2.3 Converting tokens into token IDs

- Next, we convert the text tokens into token IDs that we can process via embedding layers later

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/06.webp" width="500px">

#### - Vocabulary Creation : Building a Vocabulary

- From these tokens, we can now build a vocabulary that consists of all the unique tokens

In [11]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1130


In [12]:
vocab = {token:integer for integer,token in enumerate(all_words)}

A vocabulary is created by extracting all unique tokens from the dataset. Each token is assigned an integer ID, which allows numerical representation of text for computational processing. The size of the vocabulary is a key parameter that impacts the complexity and efficiency of the model.

#### - Vocabulary Inspection

- Below are the first 50 entries in this vocabulary:

In [13]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


Purpose: The loop iterates through the vocabulary dictionary, printing the first 50 token-ID pairs. This provides a snapshot of the vocabulary and helps identify patterns or anomalies.

#### - Illustration of Tokenization

- Below, we illustrate the tokenization of a short sample text using a small vocabulary:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/07.webp?123" width="500px">

Significance: By tokenizing a sample text, it becomes clear how the tokenizer operates and segments the text into tokens. This is useful for debugging and understanding the tokenizer's behavior.

#### - Tokenizer Class Implementation

- Putting it now all together into a tokenizer class

In [14]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

**More details :**
- Encode Function: Converts raw text into token IDs using the vocabulary. It preprocesses the text by splitting it based on delimiters and maps tokens to their corresponding IDs.
- Decode Function: Reverses the process, converting token IDs back into a human-readable text format. It also removes unnecessary spaces near punctuation for better formatting.

#### - Tokenizer Encoding Example

The class is tested on a sample text to verify its correctness.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/08.webp?123" width="500px">

- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers can then be embedded (later) as input of/for the LLM

In [15]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


Purpose: Demonstrates that the tokenizer correctly maps text to token IDs, showcasing its effectiveness on structured English sentences.

#### - Tokenizer decoding Example

- We can decode the integers back into text

In [16]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

Purpose: This command takes a list of integer token IDs (ids) and decodes them back into their corresponding text strings. This step ensures that the encoding process accurately maps tokens to IDs and that the decoding process can reliably reverse this mapping.

#### - Validating Encoding and Decoding

In [17]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

Purpose: This line encodes the original text into token IDs and immediately decodes them back into text. The primary objective is to confirm that the tokenizer maintains consistency and accuracy throughout the encoding-decoding cycle.

## 2.4 Adding special context tokens

Special tokens play a pivotal role in enhancing the tokenizer's functionality, particularly in handling unknown words and delineating text boundaries. This section introduces and integrates such tokens into the tokenizer's vocabulary.

- It's useful to add some "special" tokens for unknown words and to denote the end of a text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/09.webp?123" width="500px">

**Importance of Special Tokens**

- Some tokenizers use special tokens to help the LLM with additional context
- Some of these special tokens are
  - `[BOS]` (beginning of sequence) marks the beginning of text
  - `[EOS]` (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on)
  - `[PAD]` (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)
- `[UNK]` to represent works that are not included in the vocabulary

- Note that GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity
- The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above
- GPT also uses the `<|endoftext|>` for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are)
- GPT-2 does not use an `<UNK>` token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units which we will discuss in a later section



- We use the `<|endoftext|>` tokens between two independent sources of text:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/10.webp" width="500px">

#### - Handling Unknown Tokens

- Let's see what happens if we tokenize the following text:

In [18]:
tokenizer = SimpleTokenizerV1(vocab)

text = "Hello, do you like tea. Is this-- a test?"

tokenizer.encode(text)

KeyError: 'Hello'

- The above produces an error because the word "Hello" is not contained in the vocabulary
- To deal with such cases, we can add special tokens like `"<|unk|>"` to the vocabulary to represent unknown words
- Since we are already extending the vocabulary, let's add another token called `"<|endoftext|>"` which is used in GPT-2 training to denote the end of a text (and it's also used between concatenated text, like if our training datasets consists of multiple articles, books, etc.)

#### - Code Enhancement

In [19]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

Purpose:

- Vocabulary Expansion: Incorporates <|endoftext|> and <|unk|> into the existing set of tokens.
- Mapping Update: Reconstructs the vocab dictionary to include the newly added special tokens, assigning them unique integer IDs.

#### - Vocabulary Size:

In [20]:
len(vocab.items())

1132

#### - Verification

This indicates an expanded vocabulary accommodating the special tokens.

In [21]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


The inclusion of `<|endoftext|>` and `<|unk|>` is confirmed.

- We also need to adjust the tokenizer accordingly so that it knows when and how to use the new `<unk>` token

In [22]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

Let's try to tokenize text with the modified tokenizer:

In [23]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [24]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [25]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

## 2.5 BytePair encoding

Purpose : Byte Pair Encoding (BPE) is a subword tokenization technique that effectively handles OOV words by decomposing them into smaller, more frequent subword units. GPT-2 employs BPE to enhance its tokenizer's flexibility and efficiency.

**Overview of BPE**
- Functionality:

  - Subword Tokenization: Breaks down rare or unknown words into constituent subwords or characters based on frequency, ensuring that the tokenizer can represent any possible word combination.
  - Vocabulary Efficiency: Maintains a manageable vocabulary size by combining frequent pairs of characters or subwords, reducing the likelihood of encountering OOV words.

- Advantages:

  - Robustness: Enhances the tokenizer's ability to handle diverse and complex linguistic inputs.
  - Efficiency: Balances coverage and computational efficiency by limiting the vocabulary size while maximizing representational capacity.

- Reference Implementations:

  - Original BPE Tokenizer: OpenAI GPT-2 Encoder
  - Optimized Implementation: tiktoken Library by OpenAI, which leverages Rust for improved performance.
  
- **Performance Benchmarking** : You have a notebook in the bytepair_encoder folder `(02_bonus_bytepair-encoder)` that compares these two implementations side-by-side (tiktoken was about 5x faster on the sample text)

#### - Implementing BPE with tiktoken

The tiktoken library is utilized to implement BPE tokenization, offering enhanced performance through its Rust-based core algorithms.

#### - Installation and Version Verification

In [26]:
# uncomment below to install tiktoken
# pip install tiktoken

Purpose: Installs the tiktoken library, which is essential for BPE tokenization.

In [27]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.8.0


Purpose: Confirms the installation and checks the version of tiktoken.

#### - Initializing the tiktoken BPE Tokenizer

In [28]:
tokenizer = tiktoken.get_encoding("gpt2")

Purpose: Initializes the BPE tokenizer configured for GPT-2, aligning with the model's tokenization scheme.

#### - Tokenizing Text with tiktoken

In [29]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


Purpose: Encodes a sample text into token IDs using the tiktoken BPE tokenizer.

This list of integers represents the token IDs corresponding to the input text, including the special <|endoftext|> token.

In [30]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


Purpose: Decodes the previously obtained token IDs back into human-readable text to verify the accuracy of the tokenization process.

The decoded text accurately reflects the original input, confirming the tokenizer's effectiveness.

#### - Visualizing BPE Tokenization

- BPE tokenizers break down unknown words into subwords and individual characters:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/11.webp" width="300px">

## 2.6 Data sampling with a sliding window

Efficient data sampling is paramount for training LLMs. A sliding window approach is employed to prepare input-target pairs, enabling the model to predict the next word in a sequence based on preceding context.

**Conceptual Framework**
- Objective: Train the model to generate one word at a time by predicting the subsequent word in a sequence.

- Methodology:

  - Sliding Window: Segments the tokenized text into overlapping chunks `(windows)` of a fixed size `(context_size)`, where each window serves as an input sequence, and the corresponding target is the same sequence shifted by one token.

  - Input-Target Pairing: For each window, the input consists of tokens `[t1, t2, t3, t4]`, and the target is `[t2, t3, t4, t5]`. This setup trains the model to predict t5 given `[t1, t2, t3, t4]`.

- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/12.webp" width="400px">

#### - Code Implementation
Reading and Encoding the Dataset

In [31]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


Purpose:
- Data Ingestion: Loads the text data from "the-verdict.txt".
- Tokenization: Encodes the entire text into token IDs using the BPE tokenizer.

Output:
- Indicates that the encoded text consists of 5,145 tokens.

#### - Creating Input-Target Pairs

- For each text chunk, we want the inputs and targets
- Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right

In [32]:
enc_sample = enc_text[50:]

Purpose: Extracts a subset of the encoded text, starting from the 51st token, to ensure that the initial segment (possibly containing metadata or headers) doesn't interfere with training.

#### - Generating Input and Target Sequences

In [33]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


Purpose:
- Input Sequence (x): The first four tokens [290, 4920, 2241, 287].
- Target Sequence (y): The next four tokens [4920, 2241, 287, 257], shifted by one position.

Output:
- Demonstrates the alignment between inputs and their corresponding targets.


- One by one, the prediction would look like as follows:


#### - Visualizing Input-Target Alignment

In [34]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


Purpose: Iteratively displays how each token in the target sequence is derived from the input context.

#### - Decoding Input and Target Sequences

In [35]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


Purpose:
- Decodes the numerical token IDs back into text to provide a clearer understanding of the input-target relationships.

Output:
- Illustrates how each subsequent word is predicted based on the preceding context.

#### - Implementing a Data Loader

We will take care of the next-word prediction in a later chapter after we covered the attention mechanism

For now, we implement a simple data loader that iterates over the input dataset and returns the inputs and targets shifted by one

- Install and import PyTorch (see Appendix A for installation tips)

In [36]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.5.1+cu121


To streamline the training process, a data loader is implemented using PyTorch's Dataset and DataLoader classes. This loader efficiently iterates over the input dataset, yielding input-target pairs suitable for model training.

- We use a sliding window approach, changing the position by +1:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/13.webp?123" width="500px">

- Create dataset and dataloader that extract chunks from the input text dataset

In [37]:
# Importing Required Modules
from torch.utils.data import Dataset, DataLoader

# Defining the Custom Dataset Class
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

**Purpose:**
- Initialization (__init__):

  - Tokenization: Encodes the entire text into token IDs.
  - Sliding Window: Iterates over the token IDs with a specified stride, extracting overlapping input and target chunks of size max_length.
  - Storage: Stores the input and target chunks as PyTorch tensors for efficient retrieval.
- Length (__len__): Returns the total number of input-target pairs.

- Item Retrieval (__getitem__): Provides access to individual input-target pairs based on the index.

In [38]:
# Creating the Data Loader Function
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

**Purpose:**
- Tokenizer Initialization: Ensures consistency by using the same BPE tokenizer across the dataset.

- Dataset Creation: Instantiates the GPTDatasetV1 class with the provided parameters.

- DataLoader Configuration: Sets up the DataLoader with specified parameters such as batch_size, shuffle, drop_last, and num_workers to optimize data retrieval during training.

- Return Value: Outputs the configured DataLoader for subsequent use in model training

#### - Testing the Data Loader

To validate the functionality of the data loader, it's tested with a batch size of 1 and a context size of 4.

#### - Loading the Raw Text

In [39]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

Purpose: Reads the content of "the-verdict.txt" into the raw_text variable.

#### - Initializing the Data Loader and Iterating Over Batches

In [40]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

# - Parameters:
# batch_size=1: Processes one input-target pair at a time.
# max_length=4: Each input sequence contains four tokens.
# stride=1: The sliding window moves one token at a time, maximizing overlap and data utilization.
# shuffle=False: Maintains the original order of the data, which is essential for sequential tasks like language modeling.

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


Purpose:
- Retrieves and prints the first input-target pair from the data loader.

Output:
- Demonstrates the structure of the batch, consisting of tensors representing input and target sequences.

In [41]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


Further illustrates the sequential nature of input-target pairing.

#### - Visualizing Stride Equal to Context Length

- An example using stride equal to the context length (here: 4) as shown below:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/14.webp" width="500px">

#### - Creating Batched Outputs
Adjusting the stride parameter affects the overlap between input sequences. By setting stride equal to max_length, overlapping is minimized, which can help prevent overfitting.

- We can also create batched outputs
- Note that we increase the stride here so that we don't have overlaps between the batches, since more overlap could lead to increased overfitting

In [42]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

# - Parameters:
# batch_size=8: Processes eight input-target pairs simultaneously.
# stride=4: Moves the sliding window by four tokens, ensuring no overlap between batches.

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


- Interpretation:

  - Inputs: Each row represents an input sequence of four tokens.
  - Targets: Each corresponding row represents the target sequence, shifted by one token.
  - Stride Impact: With stride=4, there's no overlap between input sequences, reducing the risk of overfitting to specific patterns in the data.

## 2.7 Creating token embeddings

Token embeddings are fundamental to LLMs, transforming discrete token IDs into continuous vector representations that capture semantic relationships and contextual nuances.

**Embedding Layer Overview**
- Functionality:

  - Look-Up Operation: Maps each token ID to a corresponding embedding vector.
  - Trainable Parameters: Embedding vectors are learnable parameters optimized during model training to capture meaningful representations.

- Relation to One-Hot Encoding:

  - Efficiency: Embedding layers offer a more computationally efficient alternative to one-hot encoding, enabling scalable and dense representations.
  - Backpropagation: Unlike static one-hot vectors, embeddings can be fine-tuned via backpropagation, allowing the model to learn nuanced token relationships.

- The data is already almost ready for an LLM
- But lastly let us embed the tokens in a continuous vector representation using an embedding layer
- Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/15.webp" width="400px">

- Suppose we have the following four input examples with input ids 2, 3, 5, and 1 (after tokenization):

In [43]:
input_ids = torch.tensor([2, 3, 5, 1])

- For the sake of simplicity, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size 3:

In [44]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- Parameters:

  - vocab_size=6: Defines the size of the vocabulary.
  - output_dim=3: Specifies the dimensionality of the embedding vectors.

- Initialization:

  - Random Seed: Ensures reproducibility by setting the random seed.
  - Embedding Matrix: Creates a weight matrix of shape (vocab_size, output_dim), where each row corresponds to a token's embedding.

#### - Embedding Layer Weights

- This would result in a 6x3 weight matrix:

In [45]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


The embedding matrix consists of six 3-dimensional vectors, each representing a unique token in the vocabulary.

- For those who are familiar with one-hot encoding, the embedding layer approach above is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully-connected layer, which is described in the supplementary code in (03_bonus_embedding-vs-matmul) folder.
- Because the embedding layer is just a more efficient implementation that is equivalent to the one-hot encoding and matrix-multiplication approach it can be seen as a neural network layer that can be optimized via backpropagation

#### - Embedding a Single Token

- To convert a token with id 3 into a 3-dimensional vector, we do the following:

In [46]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


This vector corresponds to the token with ID 3, showcasing its continuous representation.

#### - Embedding Multiple Tokens

- Note that the above is the 4th row in the `embedding_layer` weight matrix
- To embed all four `input_ids` values above, we do

In [47]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


The resulting tensor has a shape of (4, 3), representing the embeddings for the four input tokens.

#### Representation of an embedding layer

- An embedding layer is essentially a look-up operation:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/16.webp" width="500px">

**Supplementary Insights**
- Relation to Fully-Connected Layers:

  - One-Hot Encoding Comparison: Embedding layers can be viewed as an efficient implementation of one-hot encoding followed by a linear transformation (matrix multiplication).
  - Trainability: Unlike static one-hot vectors, embedding layers are trainable, allowing the model to learn optimized representations during training.
- Further Reading: For an in-depth comparison between embedding layers and regular linear layers, refer to the supplementary notebook in (03_bonus_embedding-vs-matmul) folder.

## 2.8 Encoding word positions

While token embeddings capture the semantic essence of tokens, they lack information about the token's position within the sequence. Positional embeddings address this limitation by encoding the position of each token, enabling the model to discern the order and structure of the input data.

#### - Challenge of Position-Invariant Embeddings
- Issue: Standard embedding layers treat each token ID independently of its position, making the model agnostic to the order of tokens.

- Implication: Without positional information, the model cannot capture the sequential nature of language, which is critical for understanding context and generating coherent text.

- Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/17.webp" width="400px">

#### - Introducing Positional Embeddings

- Functionality:

  - Positional Encoding: Assigns unique vectors to each position in the input sequence, which are then combined with token embeddings.
  - Integration: The sum of token and positional embeddings forms the final input embeddings fed into the LLM.

- Implementation in GPT-2:

  - Absolute Positional Embeddings: GPT-2 employs absolute positional embeddings, meaning each position has a distinct embedding irrespective of the context.
  - Efficiency: By maintaining separate embedding layers for tokens and positions, the model efficiently integrates both semantic and positional information.

- Positional embeddings are combined with the token embedding vector to form the input embeddings for a large language model:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/18.webp" width="500px">

#### - Code Implementation: Initializing Embedding Layers

- The BytePair encoder has a vocabulary size of 50,257:
- Suppose we want to encode the input tokens into a 256-dimensional vector representation:

In [48]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- Parameters:
  - vocab_size=50257: Aligns with GPT-2's extensive vocabulary size.
  - output_dim=256: Sets the dimensionality of each embedding vector, balancing expressiveness and computational efficiency.

#### - Embedding Tokens from the Data Loader

- If we sample data from the dataloader, we embed the tokens in each batch into a 256-dimensional vector
- If we have a batch size of 8 with 4 tokens each, this results in a 8 x 4 x 256 tensor:

In [49]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

Purpose:
- Batch Configuration: Processes eight sequences (batch_size=8) each containing four tokens (max_length=4).
- Stride Adjustment: Sets stride=4 to eliminate overlap between sequences, reducing redundancy and potential overfitting.

#### - Inspecting Token IDs and Input Shape

In [50]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


- Token IDs: Displays the token sequences for the batch.
- Input Shape: Confirms that the input tensor has a shape of (8, 4), representing eight sequences each of four tokens.

#### - Embedding the Tokens

In [51]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


Indicates that each token in the input sequences is mapped to a 256-dimensional embedding vector, resulting in a tensor of shape (8, 4, 256).

#### - Positional Embeddings Initialization

- GPT-2 uses absolute position embeddings, so we just create another embedding layer:

In [52]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

- Parameters:
  - context_length=4: Matches the maximum sequence length.
  - output_dim=256: Aligns with the token embedding dimensionality for seamless integration.

#### - Generating Positional Embeddings

In [53]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


The positional embeddings tensor has a shape of (4, 256), representing the embeddings for each position in the sequence.

#### - Combining Token and Positional Embeddings

- To create the input embeddings used in an LLM, we simply add the token and the positional embeddings:

In [54]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


The addition of token and positional embeddings results in a tensor of shape (8, 4, 256), which serves as the comprehensive input embeddings incorporating both semantic and positional information.

#### - Representation

- In the initial phase of the input processing workflow, the input text is segmented into separate tokens
- Following this segmentation, these tokens are transformed into token IDs based on a predefined vocabulary:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/19.webp" width="400px">

#### - Final Integration in the Input Processing Workflow
- Segmentation: The input text is first segmented into tokens using the BPE tokenizer.

- Encoding: Tokens are transformed into integer IDs based on the extended vocabulary, which now includes special tokens.

- Embedding: Token IDs are mapped to continuous vectors via the embedding layer.

- Positional Encoding: Positional embeddings are generated and added to the token embeddings to incorporate sequence order information.

- Result: The final input embeddings are a combination of token semantics and positional context, ready to be fed into the LLM for training.

# Summary and takeaways

This chapter meticulously guided the development of a tokenizer tailored for LLMs, highlighting the significance of handling unknown tokens and incorporating positional information. Key takeaways include:

- Tokenizer Robustness: Incorporating special tokens like <|unk|> and <|endoftext|> enhances the tokenizer's ability to handle diverse and unpredictable text inputs.

- Byte Pair Encoding (BPE): BPE serves as an effective method for subword tokenization, balancing vocabulary size and flexibility, thereby enabling models like GPT-2 to manage OOV words efficiently.

- Data Sampling Strategy: The sliding window approach for data sampling ensures that the model is exposed to varied contexts, fostering better generalization during training.

- Embedding Layers: Transforming token IDs into continuous vectors via embedding layers is fundamental for capturing semantic relationships, while positional embeddings are crucial for maintaining the sequential integrity of the input data.

- Integration Workflow: The seamless combination of token and positional embeddings establishes a solid foundation for feeding data into LLMs, setting the stage for subsequent training and model development phases.

#### - Additional Resources
- See the `dataloader.ipynb` code notebook in (04_bonus_dataloader-intuition) folder, which is a concise version of the data loader that we implemented in this chapter and will need for training the GPT model in upcoming chapters..

- Exercise Solutions: To validate this lab , you will need to complete the two exercises in the notebook `exercise.ipynb` in the (01_main-code folder).

