# Chapter 2: Working with text data

## Overview
This chapter covers the following topics:
- Preparing text for large language models training
- Splitting text into word and subword tokens
- Byte pair enconding as more advanced way of tokenizing text
- Sampling training examples with a sliding window approach
- Converting tokens into vectors that feed into a large language model

### **📌 Objectives of This Chapter**

By the end of this chapter, you will be able to:

✅ **Understand the Importance of Preprocessing Text for LLMs**  
- Learn why text preparation is crucial for effective training of large language models (LLMs).  
- Explore different preprocessing techniques to improve model performance.  

✅ **Apply Tokenization Techniques**  
- Differentiate between **word-level** and **subword-level** tokenization.  
- Understand how breaking text into tokens impacts model training and efficiency.  

✅ **Master Byte Pair Encoding (BPE) for Tokenization**  
- Learn how **Byte Pair Encoding (BPE)** improves text representation.  
- Explore how BPE reduces vocabulary size while retaining **meaningful subwords**.  

✅ **Implement a Sliding Window Approach for Training Data**  
- Understand why **context windows** are used in LLMs.  
- Learn how **overlapping sequences** help models retain long-range dependencies.  

✅ **Convert Tokens into Numerical Representations**  
- Understand the transformation of **tokens into numerical vectors** (embeddings).  
- Learn how embeddings **capture semantic relationships** between words.  

---

By achieving these objectives, you will gain a **strong foundation** in text preprocessing, tokenization, and vectorization—key steps in **training modern large language models (LLMs)**. 🚀

## Important Notes
Make sure to review the concepts from Chapter 1 as they are foundational for understanding this chapter.

In [15]:
#

In [16]:
import urllib.request 
import re

In [17]:
url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

('the-verdict.txt', <http.client.HTTPMessage at 0x1086c3a60>)

In [18]:
# read the data from the file
with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of characters in the text file:", len(raw_text))
print(raw_text[:99])

Total number of characters in the text file: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [19]:
# remove all whitespace characters and especial symbols 
preprocessed_text = re.sub(r'[\s,.:;?_!"()\']|--', ' ', raw_text)
preprocessed_text = [item.strip() for item in preprocessed_text.split() if item.strip() != ""]
print("Total number of characters in the preprocessed text file:", len(preprocessed_text))


Total number of characters in the preprocessed text file: 3788


In [20]:
preprocessed_text[:10]

['I',
 'HAD',
 'always',
 'thought',
 'Jack',
 'Gisburn',
 'rather',
 'a',
 'cheap',
 'genius']

In [21]:
# converting tokens into token IDs
all_words = sorted(set(preprocessed_text)) 
vocab_size = len(all_words)
print(vocab_size)

1118


In [22]:
# creating vocabulary lookup tables
vocab = {token: i for i, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i > 10:
        break

('A', 0)
('Ah', 1)
('Among', 2)
('And', 3)
('Are', 4)
('Arrt', 5)
('As', 6)
('At', 7)
('Be', 8)
('Begin', 9)
('Burlington', 10)
('But', 11)


In [23]:
# Implementing a simple tokenizer 
class SimpleTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.inverse_vocab = {i: token for token, i in vocab.items()}
        self.tokenizer = re.compile(r'[\s,.:;?_!"()\']|--')
        self.unk_token = '<UNK>'
        if self.unk_token not in self.vocab:
            self.vocab[self.unk_token] = len(self.vocab)
            self.inverse_vocab[len(self.vocab) - 1] = self.unk_token
        
    def tokenize(self, text):
        return [token for token in self.tokenizer.split(text) if token]
    
    def encode(self, tokens):
        return [self.vocab.get(token, self.vocab[self.unk_token]) for token in tokens]

    def decode(self, token_ids):
        return [self.inverse_vocab[token_id] for token_id in token_ids]


In [24]:
# Testing the SimpleTokenizer
tokenizer = SimpleTokenizer(vocab)

In [25]:
# Test text
test_text = "I HAD always thought Jack Gisburn rather a cheap genius."
tokens = tokenizer.tokenize(test_text)
print("Tokens:", tokens)

Tokens: ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius']


In [26]:
# Encode tokens
encoded_tokens = tokenizer.encode(tokens)
print("Encoded Tokens:", encoded_tokens)

Encoded Tokens: [42, 33, 137, 991, 46, 27, 806, 103, 244, 474]


In [27]:
# Decode tokens
decoded_tokens = tokenizer.decode(encoded_tokens)
print("Decoded Tokens:", decoded_tokens)

Decoded Tokens: ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius']


In [29]:
# Byte Pair encoding used by OPENAI's GPT-2
import tiktoken
print("tiktoken version:", tiktoken.__version__)


tiktoken version: 0.8.0


In [31]:
# Tokenizing text with the GPT-2 tokenizer
# Instantiating the GPT-2 tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

In [33]:
# let's test the tokenizer
test_text = "I HAD always thought Jack Gisburn rather a cheap genius."
tokens = tokenizer.encode(test_text)
print("Tokens:", tokens)

Tokens: [40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 13]


In [34]:
# let's decode the tokens
strings = tokenizer.decode(tokens)
print("Decoded string:", strings)

Decoded string: I HAD always thought Jack Gisburn rather a cheap genius.


In [35]:
# let's work with our raw text
encode_text = tokenizer.encode(raw_text)
print("Total number of tokens in the raw text:", len(encode_text))

Total number of tokens in the raw text: 5145


In [36]:
encode_text[:10]

[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138]

In [39]:
# Data sampling with a sliding windows
context_length = 128
x = encode_text[:context_length]
y = encode_text[1:context_length+1] 
print(f"x: {x}")
print(f"y:     {y}")

x: [40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11, 290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686, 41976, 13, 357, 10915, 314, 2138, 1807, 340, 561, 423, 587, 10598, 393, 28537, 2014, 198, 198, 1, 464, 6001, 286, 465, 13476, 1, 438, 5562, 373, 644, 262, 1466, 1444, 340, 13, 314, 460, 3285, 9074, 13, 46606, 536, 5469, 438, 14363, 938, 4842, 1650, 353, 438, 2934, 489, 3255, 465, 48422, 540, 450, 67, 3299, 13, 366, 5189, 1781, 340, 338, 1016, 284, 3758, 262, 1988]
y:     [367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11, 290, 4920, 2241, 287, 257, 44

In [40]:
for i in range(10):
    print(f"X: {tokenizer.decode(x[i:i+10])}")
    print(f"Y: {tokenizer.decode(y[i:i+10])}\n")

X: I HAD always thought Jack Gisburn rather
Y:  HAD always thought Jack Gisburn rather a

X:  HAD always thought Jack Gisburn rather a
Y: AD always thought Jack Gisburn rather a cheap

X: AD always thought Jack Gisburn rather a cheap
Y:  always thought Jack Gisburn rather a cheap genius

X:  always thought Jack Gisburn rather a cheap genius
Y:  thought Jack Gisburn rather a cheap genius--

X:  thought Jack Gisburn rather a cheap genius--
Y:  Jack Gisburn rather a cheap genius--though

X:  Jack Gisburn rather a cheap genius--though
Y:  Gisburn rather a cheap genius--though a

X:  Gisburn rather a cheap genius--though a
Y: isburn rather a cheap genius--though a good

X: isburn rather a cheap genius--though a good
Y: burn rather a cheap genius--though a good fellow

X: burn rather a cheap genius--though a good fellow
Y:  rather a cheap genius--though a good fellow enough

X:  rather a cheap genius--though a good fellow enough
Y:  a cheap genius--though a good fellow enough--



In [41]:
# A dataset for batched inputs and targets
import torch 
from torch.utils.data import Dataset, DataLoader 

In [44]:
class GPTDatasetV1(Dataset):
    def __init__(self, text: str, tokenizer: tiktoken.core.Encoding, context_length: int, stride: int):
        """
        Initializes the GPTDatasetV1 object.

        Args:
            text (str): The input text to be tokenized.
            tokenizer (tiktoken.core.Encoding): The tokenizer to encode the text.
            context_length (int): The length of the context window.
            stride (int): The stride for the sliding window approach.
        """
        self.input_ids = []
        self.target_ids = []
        self.context_length = context_length
        token_ids = tokenizer.encode(text)

        for i in range(0, len(token_ids) - context_length, stride):
            input_chunk = token_ids[i:i+context_length]
            target_chunk = token_ids[i+1:i+context_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
        
    def __len__(self) -> int:
        """
        Returns the number of samples in the dataset.

        Returns:
            int: The number of samples.
        """
        return len(self.input_ids)
    
    def __getitem__(self, idx: int) -> tuple:
        """
        Retrieves the input and target tensors for a given index.

        Args:
            idx (int): The index of the sample to retrieve.

        Returns:
            tuple: A tuple containing the input and target tensors.
        """
        return self.input_ids[idx], self.target_ids[idx]


In [45]:
def create_data_loader(text: str, tokenizer: tiktoken.core.Encoding, context_length: int = 256, stride: int = 128, batch_size: int = 8, shuffle: bool = False, drop_last: bool = False, num_workers: int = 0) -> DataLoader:
    """
    Creates a DataLoader to generate batches with input-target pairs.

    Args:
        text (str): The input text to be tokenized.
        tokenizer (tiktoken.core.Encoding): The tokenizer to encode the text.
        context_length (int, optional): The length of the context window. Defaults to 256.
        stride (int, optional): The stride for the sliding window approach. Defaults to 128.
        batch_size (int, optional): The number of samples per batch. Defaults to 8.
        shuffle (bool, optional): Whether to shuffle the data. Defaults to False.
        drop_last (bool, optional): Whether to drop the last incomplete batch. Defaults to False.
        num_workers (int, optional): The number of subprocesses to use for data loading. Defaults to 0.

    Returns:
        DataLoader: A DataLoader instance for the dataset.
    """
    dataset = GPTDatasetV1(text, tokenizer, context_length, stride)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

In [77]:
dataloader = create_data_loader(raw_text, tokenizer, context_length=4, stride=1, batch_size=4, shuffle=False, drop_last=True, num_workers=0)

In [78]:
data_iter = iter(dataloader) 
first_batch = next(data_iter)
print("Input shape:", first_batch[0].shape)
print("Input tokens:", first_batch)

Input shape: torch.Size([4, 4])
Input tokens: [tensor([[  40,  367, 2885, 1464],
        [ 367, 2885, 1464, 1807],
        [2885, 1464, 1807, 3619],
        [1464, 1807, 3619,  402]]), tensor([[ 367, 2885, 1464, 1807],
        [2885, 1464, 1807, 3619],
        [1464, 1807, 3619,  402],
        [1807, 3619,  402,  271]])]


In [79]:
second_batch = next(data_iter)
print("Target shape:", second_batch[0].shape)
print("Input tokens:", second_batch)

Target shape: torch.Size([4, 4])
Input tokens: [tensor([[ 1807,  3619,   402,   271],
        [ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257]]), tensor([[ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257],
        [10899,  2138,   257,  7026]])]


In [80]:
# creating token embedding 
import torch.nn as nn

In [81]:
input_ids = torch.tensor([1, 2, 3, 4])
vocab_size = 6 
output_dim = 3 


In [82]:
torch.manual_seed(123)
embedding_layer = nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


In [None]:
# Encoding word positions 
import torch.nn.functional as F
