# Byte-Pair Encoding Tokenization

## 1. Training Algorithm

BPE training starts by computing the unique set of words used in the corpus (after the normalization and pre-tokenization steps are completed), then building the vocabulary by taking all the symbols used to write those words. As a very simple example, let‚Äôs say our corpus uses these five words:



<pre>
"hug", "pug", "pun", "bun", "hugs"
</pre>

The base vocabulary will then be `["b", "g", "h", "n", "p", "s", "u"]`. For real-world cases, that base vocabulary will contain all the ASCII characters, at the very least, and probably some Unicode characters as well. If an example you are tokenizing uses a character that is not in the training corpus, that character will be converted to the unknown token. That‚Äôs one reason why lots of NLP models are very bad at analyzing content with emojis, for instance.



The GPT-2 and RoBERTa tokenizers (which are pretty similar) have a clever way to deal with this: they don‚Äôt look at words as being written with Unicode characters, but with bytes. This way the base vocabulary has a small size (256), but every character you can think of will still be included and not end up being converted to the unknown token. This trick is called `byte-level BPE`.



After getting this base vocabulary, we add new tokens until the desired vocabulary size is reached by learning merges, which are rules to merge two elements of the existing vocabulary together into a new one. So, at the beginning these merges will create tokens with two characters, and then, as training progresses, longer subwords.



At any step during the tokenizer training, the BPE algorithm will search for the most frequent pair of existing tokens (by ‚Äúpair,‚Äù here we mean two consecutive tokens in a word). That most frequent pair is the one that will be merged, and we rinse and repeat for the next step.



Going back to our previous example, let‚Äôs assume the words had the following frequencies:



<pre>
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
</pre>

meaning `"hug"` was present 10 times in the corpus, `"pug"` 5 times, `"pun"` 12 times, `"bun"` 4 times, and `"hugs"` 5 times. We start the training by splitting each word into characters (the ones that form our initial vocabulary) so we can see each word as a list of tokens:



<pre>
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
</pre>

Then we look at pairs. The pair `("h", "u")` is present in the words `"hug"` and `"hugs"`, so 15 times total in the corpus. It‚Äôs not the most frequent pair, though: that honor belongs to `("u", "g")`, which is present in `"hug"`, `"pug"`, and `"hugs"`, for a grand total of 20 times in the vocabulary.



Thus, the first merge rule learned by the tokenizer is `("u", "g") -> "ug"`, which means that `"ug"` will be added to the vocabulary, and the pair should be merged in all the words of the corpus. At the end of this stage, the vocabulary and corpus look like this:



<pre>
Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug"]
Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
</pre>

And we continue like this until we reach the desired vocabulary size.



## 2. Tokenization training algorithm

Tokenization follows the training process closely, in the sense that new inputs are tokenized by applying the following steps:

1. Normalization

2. Pre-tokenization

3. Splitting the words into individual characters

4. Applying the merge rules learned in order on those splits

## 3. Implementing BPE

Now let‚Äôs take a look at an implementation of the BPE algorithm. This won‚Äôt be an optimized version you can actually use on a big corpus; we just want to show you the code so you can understand the algorithm a little bit better.



First we need a corpus, so let‚Äôs create a simple one with a few sentences:



In [2]:
corpus = [
    "This is the corpus that we use for implementing algorithms.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer schemes.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

Next, we need to pre-tokenize that corpus into words. Since we are replicating a BPE tokenizer (like GPT-2), we will use the `gpt2` tokenizer for the pre-tokenization:



In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

Then we compute the frequencies of each word in the corpus as we do the pre-tokenization:



In [6]:
from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

print(word_freqs)

defaultdict(<class 'int'>, {'This': 3, 'ƒ†is': 2, 'ƒ†the': 1, 'ƒ†corpus': 1, 'ƒ†that': 1, 'ƒ†we': 1, 'ƒ†use': 1, 'ƒ†for': 1, 'ƒ†implementing': 1, 'ƒ†algorithms': 1, '.': 4, 'ƒ†chapter': 1, 'ƒ†about': 1, 'ƒ†tokenization': 1, 'ƒ†section': 1, 'ƒ†shows': 1, 'ƒ†several': 1, 'ƒ†tokenizer': 1, 'ƒ†schemes': 1, 'Hopefully': 1, ',': 1, 'ƒ†you': 1, 'ƒ†will': 1, 'ƒ†be': 1, 'ƒ†able': 1, 'ƒ†to': 1, 'ƒ†understand': 1, 'ƒ†how': 1, 'ƒ†they': 1, 'ƒ†are': 1, 'ƒ†trained': 1, 'ƒ†and': 1, 'ƒ†generate': 1, 'ƒ†tokens': 1})


The next step is to compute the base vocabulary, formed by all the characters used in the corpus:



In [7]:
alphabet = []

for word in word_freqs.keys():
    for letter in word:
        if letter not in alphabet:
            alphabet.append(letter)

alphabet.sort()
print(alphabet)

[',', '.', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'ƒ†']


We also add the special tokens used by the model at the beginning of that vocabulary. In the case of GPT-2, the only special token is `"<|endoftext|>"`:



In [8]:
vocab = ["<|endoftext|>"] + alphabet.copy()

We now need to split each word into individual characters, to be able to start training:



In [9]:
splits = {word: [c for c in word] for word in word_freqs.keys()}
splits

{'This': ['T', 'h', 'i', 's'],
 'ƒ†is': ['ƒ†', 'i', 's'],
 'ƒ†the': ['ƒ†', 't', 'h', 'e'],
 'ƒ†corpus': ['ƒ†', 'c', 'o', 'r', 'p', 'u', 's'],
 'ƒ†that': ['ƒ†', 't', 'h', 'a', 't'],
 'ƒ†we': ['ƒ†', 'w', 'e'],
 'ƒ†use': ['ƒ†', 'u', 's', 'e'],
 'ƒ†for': ['ƒ†', 'f', 'o', 'r'],
 'ƒ†implementing': ['ƒ†',
  'i',
  'm',
  'p',
  'l',
  'e',
  'm',
  'e',
  'n',
  't',
  'i',
  'n',
  'g'],
 'ƒ†algorithms': ['ƒ†', 'a', 'l', 'g', 'o', 'r', 'i', 't', 'h', 'm', 's'],
 '.': ['.'],
 'ƒ†chapter': ['ƒ†', 'c', 'h', 'a', 'p', 't', 'e', 'r'],
 'ƒ†about': ['ƒ†', 'a', 'b', 'o', 'u', 't'],
 'ƒ†tokenization': ['ƒ†',
  't',
  'o',
  'k',
  'e',
  'n',
  'i',
  'z',
  'a',
  't',
  'i',
  'o',
  'n'],
 'ƒ†section': ['ƒ†', 's', 'e', 'c', 't', 'i', 'o', 'n'],
 'ƒ†shows': ['ƒ†', 's', 'h', 'o', 'w', 's'],
 'ƒ†several': ['ƒ†', 's', 'e', 'v', 'e', 'r', 'a', 'l'],
 'ƒ†tokenizer': ['ƒ†', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'r'],
 'ƒ†schemes': ['ƒ†', 's', 'c', 'h', 'e', 'm', 'e', 's'],
 'Hopefully': ['H', 'o', 'p',

Now that we are ready for training, let‚Äôs write a function that computes the frequency of each pair. We‚Äôll need to use this at each step of the training:



In [16]:
def compute_pair_freqs(splits):
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1: continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            pair_freqs[pair] += freq
    return pair_freqs

Let‚Äôs have a look at a part of this dictionary after the initial splits:



In [17]:
pair_freqs = compute_pair_freqs(splits)

for i, key in enumerate(pair_freqs.keys()):
    print(f"{key}: {pair_freqs[key]}")
    if i >= 5:
        break

('T', 'h'): 3
('h', 'i'): 3
('i', 's'): 5
('ƒ†', 'i'): 3
('ƒ†', 't'): 8
('t', 'h'): 4


Now, finding the most frequent pair only takes a quick loop:



In [19]:
best_pair = ""
max_freq = None

for key, value in pair_freqs.items():
    if max_freq is None or max_freq < value:
        max_freq = value
        best_pair = key

print(f"Best pair: {best_pair}\nMaximum frequency: {max_freq}")

Best pair: ('ƒ†', 't')
Maximum frequency: 8


In [20]:
merges = {("ƒ†", "t"): "ƒ†t"}
vocab.append("ƒ†t")

To continue, we need to apply that merge in our `splits` dictionary. Let‚Äôs write another function for this:



In [21]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]

        if len(split) == 1: continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                split = split[:i] + [a + b] + split[i + 2:]
            else:
                i += 1
        splits[word] = split
    return splits


And we can have a look at the result of the first merge:



In [22]:
splits = merge_pair("ƒ†", "t", splits)
print(splits["ƒ†trained"])

['ƒ†t', 'r', 'a', 'i', 'n', 'e', 'd']


Now we have everything we need to loop until we have learned all the merges we want. Let‚Äôs aim for a vocab size of 50:



In [26]:
vocab_size = 50

while len(vocab) < vocab_size:
    pair_freqs = compute_pair_freqs(splits)
    best_pair = ""
    max_freq = None
    for key, value in pair_freqs.items():
        if max_freq is None or max_freq < value:
            max_freq = value
            best_pair = key
    splits = merge_pair(*best_pair, splits)
    merges[best_pair] = best_pair[0] + best_pair[1]
    vocab.append(best_pair[0] + best_pair[1])

print(merges)
print(len(merges))

{('ƒ†', 't'): 'ƒ†t', ('i', 's'): 'is', ('e', 'n'): 'en', ('ƒ†', 'a'): 'ƒ†a', ('e', 'r'): 'er', ('ƒ†t', 'o'): 'ƒ†to', ('ƒ†', 's'): 'ƒ†s', ('T', 'h'): 'Th', ('Th', 'is'): 'This', ('ƒ†t', 'h'): 'ƒ†th', ('o', 'r'): 'or', ('a', 't'): 'at', ('ƒ†to', 'k'): 'ƒ†tok', ('ƒ†tok', 'en'): 'ƒ†token', ('n', 'd'): 'nd', ('ƒ†', 'is'): 'ƒ†is', ('ƒ†th', 'e'): 'ƒ†the', ('ƒ†', 'c'): 'ƒ†c', ('u', 's'): 'us', ('ƒ†', 'w'): 'ƒ†w', ('l', 'e'): 'le'}
21


In [27]:
print(vocab)

['<|endoftext|>', ',', '.', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'ƒ†', 'ƒ†t', 'is', 'en', 'ƒ†a', 'er', 'ƒ†to', 'ƒ†s', 'Th', 'This', 'ƒ†th', 'or', 'at', 'ƒ†tok', 'ƒ†token', 'nd', 'ƒ†is', 'ƒ†the', 'ƒ†c', 'us', 'ƒ†w', 'le']


üí° Using `train_new_from_iterator()` on the same corpus won‚Äôt result in the exact same vocabulary. This is because when there is a choice of the most frequent pair, we selected the first one encountered, while the ü§ó Tokenizers library selects the first one based on its inner IDs.



To tokenize a new text, we pre-tokenize it, split it, then apply all the merge rules learned:



In [36]:
def tokenize(text):
    pre_tokenized_result = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, freq in pre_tokenized_result]
    splits = [[w for w in word] for word in pre_tokenized_text]

    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2:]
                else:
                    i += 1
            splits[idx] = split
    return sum(splits, [])

We can try this on any text composed of characters in the alphabet:



In [37]:
tokenize("This is not a token.")


['This', 'ƒ†is', 'ƒ†', 'n', 'o', 't', 'ƒ†a', 'ƒ†token', '.']