# Working with text data

Here we will see how to prepare input text for training LLMs. This involves splitting the text into individual word and subword tokens, which can be encoded into vector representations for the LLM.

## 2.1 Understanding word embeddings

Deep neural network models, including LLMs, cannot process raw text directly. Therefore, we need a way to represent words as continous-valued vectors.

The concept of converting data into a vector format is often referred to as embedding.

Word embeddings can have varying dimensions, from one to thousands.

## 2.2 Tokenizing text

Let's see how we split input text into individual tokens, a required preprocessing step for creating embeddings for an LLM.

We start with a simple text and Pythonâ€™s `re.split` function to split the text while keeping the delimiters:

In [1]:
import re

text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)    # capturing group (...) with three alternatives, single characters [...], -- and whitespace
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [2]:
import os
import requests

file_path = "the-verdict.txt"

if not os.path.exists(file_path):
    url = (
        "https://raw.githubusercontent.com/rasbt/"
        "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
        "the-verdict.txt"
    )
    response = requests.get(url)
    response.raise_for_status()
    with open(file_path, "wb") as f:
        f.write(response.content)

In [3]:
with open(file_path, "r", encoding='utf-8') as f:
    raw_text = f.read()
print(f'Total number of characters: {len(raw_text)}')
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


Now let's apply our basic tokenizer to the main text:

In [4]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(f'Number of tokens in the text: {len(preprocessed)}')
print(f'Number of unique tokens in the text: {len(set(preprocessed))}')
print(f'First 30 tokens in the text:\n{preprocessed[:30]}')

Number of tokens in the text: 4690
Number of unique tokens in the text: 1130
First 30 tokens in the text:
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 Converting tokens into IDs

Next let's convert these tokens from a Python string to an integer representation to produce the token IDs.

To do this we need to build a vocabulary. This defines how we map each unique token to a unique integer.

In [5]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(f'Vocabulary size: {vocab_size}')

Vocabulary size: 1130


In [6]:
str_to_int = {token: i for i, token in enumerate(all_words)}    #
for i, item in enumerate(str_to_int.items()):
    print(f'{i}: {item}')
    if i >= 50:
        break

0: ('!', 0)
1: ('"', 1)
2: ("'", 2)
3: ('(', 3)
4: (')', 4)
5: (',', 5)
6: ('--', 6)
7: ('.', 7)
8: (':', 8)
9: (';', 9)
10: ('?', 10)
11: ('A', 11)
12: ('Ah', 12)
13: ('Among', 13)
14: ('And', 14)
15: ('Are', 15)
16: ('Arrt', 16)
17: ('As', 17)
18: ('At', 18)
19: ('Be', 19)
20: ('Begin', 20)
21: ('Burlington', 21)
22: ('But', 22)
23: ('By', 23)
24: ('Carlo', 24)
25: ('Chicago', 25)
26: ('Claude', 26)
27: ('Come', 27)
28: ('Croft', 28)
29: ('Destroyed', 29)
30: ('Devonshire', 30)
31: ('Don', 31)
32: ('Dubarry', 32)
33: ('Emperors', 33)
34: ('Florence', 34)
35: ('For', 35)
36: ('Gallery', 36)
37: ('Gideon', 37)
38: ('Gisburn', 38)
39: ('Gisburns', 39)
40: ('Grafton', 40)
41: ('Greek', 41)
42: ('Grindle', 42)
43: ('Grindles', 43)
44: ('HAD', 44)
45: ('Had', 45)
46: ('Hang', 46)
47: ('Has', 47)
48: ('He', 48)
49: ('Her', 49)
50: ('Hermia', 50)


We need also a way to turn token IDs into text. For this we create an inverse version of the vocabulary that maps token IDs back to text tokens:

In [7]:
int_to_str = {i: s for s, i in str_to_int.items()}
print(int_to_str[50])

Hermia


We are now ready to implement a complete tokenizer class in Python with the following features:
- an `encode` method that splits text into tokens and carries out the string-to-integer mapping.
- unknown words that are not part of the vocabulary must be mapped to special token `<|unk|>`.
- independent text sources must be separated by special token `<|endoftext|>`.
- a `decode` method that carries out the reverse integer-to-string mapping to convert token IDs back to text.

Let's start by extending our vocabularly with the new tokens `<|unk|>` and `<|endoftext|>`:

In [19]:
all_tokens = sorted(set(preprocessed))
all_tokens.extend(['<|unk|>', '<|endoftext|>'])
str_to_int = {s: i for i, s in enumerate(all_tokens)}
print(f'Vocabulary size: {len(str_to_int)}')

Vocabulary size: 1132


In [27]:
for s, i in list(str_to_int.items())[-5:]:
    print(f'{i}: {s}')

1127: younger
1128: your
1129: yourself
1130: <|unk|>
1131: <|endoftext|>


We confirm that the two new special tokens were successfully incorporated into the vocabulary.

In [21]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item.strip() if item in self.str_to_int else '<|unk|>' for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = [self.int_to_str[i] for i in ids]
        text = ' '.join(text)
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)    # remove extra spaces before punctuation.
        return text


In [22]:
tokenizer = SimpleTokenizer(vocab=str_to_int)
text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


Now let's try to convert these token IDs back into text using the decode method:

In [23]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


This looks good. Next, let's now apply the tokenizer to a new text sample not contained in the training set:

In [29]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))
print(tokenizer.decode(tokenizer.encode(text)))

[1130, 5, 355, 1126, 628, 975, 10]
<|unk|>, do you like tea?


The first word Hello has successfully been mapped to token ID 1130 and back to `<|unk|>`.

Let's now try to combine two independent texts:

In [30]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [31]:
print(tokenizer.encode(text))
print(tokenizer.decode(tokenizer.encode(text)))

[1130, 5, 355, 1126, 628, 975, 10, 1131, 55, 988, 956, 984, 722, 988, 1130, 7]
<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


## 2.5 Byte pair encoding

The Byte Pair Encoder (BPE) was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.

Let's first look at an existing implementation from the tiktoken library:

In [10]:
import tiktoken

tokenizer = tiktoken.get_encoding('gpt2')
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."

In [11]:
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)
strings = tokenizer.decode(integers)
print(strings)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


The encoding and decoding looks good.

Specifically, we see that the BPE tokenizer managed to encode and decode unknown words such as someunknowPlace correctly.

The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabularly words.

Let's take a closer look:

In [12]:
tokens = tokenizer.encode('someunknownPlace')
print(tokens)

[11246, 34680, 27271]


In [13]:
for token in tokens:
    print(token, tokenizer.decode([token]))

11246 some
34680 unknown
27271 Place


How does it do this? BPE build its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words.

In [None]:
tokenizer.decode([11246])