In the conclusion of my <a href="../tokenizer-problems/tokenizer-problems.html" target="_blank">recent blog post</a> I argued that I disagree with Andrej Karpathy's claims about the current state of tokenizer availability and tooling. In particular:

- I do not agree that there are no good alternatives to OpenAI's GPT-4 tokenizer and their `tiktoken` library, especially since the <a href="https://ai.meta.com/blog/meta-llama-3/" target="_blank">release of Llama 3</a>.
- Also, it is not the case that there are no good tools to train tokenizers from scratch fast, as the `tokenizers` library from Hugging Face has the functionality for both training and running tokenizers in inference mode. It is built in Rust, so it is extremely fast, and has reasonably approachable documentation.

This prompted me to do a deep dive into tokenizers and how one would go about building one from scratch. Inspired by the <a href="https://www.fast.ai/posts/2016-10-08-teaching-philosophy.html#good-education-starts-with-the-whole-game" target="_blank">top-down approach</a> from Jeremy Howard, whose courses I enjoy a lot, this blog post will start with the very basics of tokenizers and then focus on how to train a Llama 3-like tokenizer on your own data with as little code as possible. Finally, we will explore what influence different proportions of English/non-English/code data have on the final vocabulary learnt by the tokenizer, as well as discuss a paper on how tokenizer design choices impact the downstream performance of the LLM.

In the second part, we will explore what influence different proportions of English/non-English/code data have on the final vocabulary learnt by the tokenizer, as well as discuss a paper on how tokenizer design choices impact the downstream performance of the LLM, so stay tuned! Now, let's jump right in!.

# Warm-up: basics of tokenization

I am mostly assuming that if you found and opened this article, you have at least a basic understanding of what tokenization is and what its purpose is, but let's do a quick revision just to make sure we are all on the same page.

**Why do we need tokenization in the first place?**

1. Majority of machine learning models operate on numbers, and (large) language models are no exception.
2. Whether it is natural language, code, or something else, we want to be able to input strings into the language models.
3. This is where tokenization comes in, it is a **process of converting strings of text into numbers** that can then be fed into language models.

:::{.callout-note}
This last point is actually the reason why the current state-of-the-art language models are not actually end-to-end systems.
:::

**How do tokenizers work on a high level?**

1. First, the long input string is split into smaller chunks called *tokens*.
1. Then, a *vocabulary*, which is a mapping from known tokens to integers, is used to convert the tokens to integer ids.
1. These ids can be fed as input into language models.

**What are some qualities that we want our tokenizers to have?**

We want our tokenizers to be able to process various kinds of inputs. In this article, we will not be able to fully dig into the various design choices made when building tokenizers to achieve this, but here is a (likely non-exhaustive) list of qualities that we want the tokenizers to have:

1. They should work on both English and non-English texts, both uppercase and lowercase, both Latin and non-Latin alphabet.
1. They should be able to handle various kinds of special characters that we might find in the wild, such as emojis 🤗.
1. They should be able to handle code.
1. (Optional) They should be able to tokenize unseen words without the need to include a special `<unk>` token in their vocabulary.

**How does this look in code?**

Before diving into training a tokenizer, let's illustrate some of the aspects described above in code. We will find that all of the above is very simple to do by using the `tokenizers` library!

In [17]:
#|code-fold: true
#|output: false
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [18]:
from pprint import pprint
from transformers import AutoTokenizer

In [19]:
#|output: false
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-103-raw-v1")
len(dataset["train"])

1801350

In [20]:
#|output: false
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now that we have a tokenizer defined, let's see how we can tokenizer a piece of text:

In [21]:
text = "Some text to be tokenized"
tokens = tokenizer.tokenize(text)
tokens

['Some', 'Ġtext', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']

We can see that the string was split into smaller tokens. The `Ġ` is how the `Llama 3` tokenizer represents spaces in its vocabulary, so you can actually think of the tokens as ` text`, ` to`, etc.

:::{.callout-note}
It turns out that it is very likely that nowadays the use of the `Ġ` Unicode character is a historical artifact that started with the GPT-2 implementation. A very strong evidence for this claim is that the `GPT-4` tokenizer has ordinary spaces in its vocabulary. See @sec-next-steps for more.
:::

Next, let's turn these tokens into ids:

In [6]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[8538, 1495, 311, 387, 4037, 1534]

As simple as that! We could now feed these `token_ids` into a language model, though it is important to note that in practice we would perform everything in one step:

In [7]:
inputs = tokenizer(text, return_tensors="pt")
pprint(inputs, sort_dicts=False)  # pprint stands for pretty print

{'input_ids': tensor([[8538, 1495,  311,  387, 4037, 1534]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}


We would then call a Hugging Face-compatible model like so:
```python
# Define the model before this row
outputs = model(**inputs)
```

:::{.callout-note}
For the sake of completeness, it is worth noting that one would only tokenize a single string on its own during inference. During training, one would either tokenize the whole dataset in batches before training or tokenize a batch of data on the fly. The former is usually more reliable and allows to restart training more easily if a crash occurs (see Thomas Wolf's <a href="https://www.youtube.com/watch?v=2-SPH9hIKT8" target="_blank">video</a>).
:::

# Training a Llama 3 tokenizer

In [8]:
# Define an iterator over the training split of the dataset
def batch_iterator(dataset, batch_size=1000, verbose=False):
    if verbose:
        print(f"Dataset size: {len(dataset)}")

    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]


In [9]:
new_tokenizer = tokenizer.train_new_from_iterator(
    batch_iterator(dataset["train"], verbose=True),
    len(tokenizer.get_vocab()),
)

print(f"Vocab length={len(new_tokenizer.get_vocab())}")
new_tokenizer.save_pretrained("new-llama-tokenizer-english-only");

Dataset size: 1801350



Vocab length=128256


Let's now create some test data which we will use as a proxy for how efficient tokenizers trained with different data are in tokenizing various kinds of texts. We will have an English phrase, the same phrase in Korean, and a piece of code in our test set:

In [10]:
code_string = """for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)"""

test_strings = {
    "english": "Nice to meet you, I'm ChatGPT, a large-scale language model developed by OpenAI. If you have any questions, feel free to ask.",
    "korean": "만나서 반가워요. 저는 OpenAI에서 개발한 대규모 언어 모델인 ChatGPT입니다. 궁금한 것이 있으시면 무엇이든 물어보세요.",
    "code": code_string,
}

In [14]:
from collections import defaultdict


results = defaultdict(dict)

for language, test_str in test_strings.items():
    results["english_only"][language] = new_tokenizer(test_str, return_tensors="pt")["input_ids"].shape[1]


pprint(dict(results))

{'english_only': {'code': 129, 'english': 36, 'korean': 136}}


## Add some Korean and code data

In [12]:
from datasets import concatenate_datasets
import numpy as np

korean_dataset = load_dataset("lcw99/wikipedia-korean-20221001")
code_dataset = load_dataset("code_search_net", "python", trust_remote_code=True)
code_dataset = code_dataset.rename_column("whole_func_string", "text")  # Rename whole_func_string to text
print(len(korean_dataset["train"]), len(code_dataset["train"]))

# This is only a rough estimate
# [0.38661505 0.45621601 0.15716893]
# proportions = np.array(
    # [sum(len(x.split()) for x in ds["train"]["text"]) for ds in [dataset, korean_dataset, code_dataset]]
# )
# print(proportions / proportions.sum())

final_dataset = concatenate_datasets(
    [dataset["train"], korean_dataset["train"], code_dataset["train"]]
)
final_dataset = final_dataset.shuffle(seed=42)
len(final_dataset)

607256 412178


2820784

In [13]:
# new_tokenizer_2 = tokenizer.train_new_from_iterator(
#     batch_iterator(final_dataset, verbose=True),
#     len(tokenizer.get_vocabx()),
# )

# print(f"Vocab length={len(new_tokenizer_2.get_vocab())}")
# new_tokenizer_2.save_pretrained("new-llama-tokenizer-english-korean-code");

Dataset size: 2820784


In [12]:
new_tokenizer_2 = AutoTokenizer.from_pretrained("new-llama-tokenizer-all")

In [15]:
for language, test_str in test_strings.items():
    results["all"][language] = new_tokenizer_2(test_str, return_tensors="pt")["input_ids"].shape[1]


pprint(dict(results))

{'all': {'code': 75, 'english': 34, 'korean': 30},
 'english_only': {'code': 129, 'english': 36, 'korean': 136}}


<font color="red">For those who would like to dig deeper into training their own tokenizers in a similar setting (English + Code + Multilingual) and what metrics to follow, I suggest the following paper: https://arxiv.org/pdf/2402.01035.pdf</font>

# Final notes

1. Normalization is not used to make tokenization process reversible

In [16]:
tokenizer._tokenizer.normalizer is None

True

# Next steps (for the curious reader) {#sec-next-steps}

I leave here some questions arose while writing this article that I could not find simple answers to. I hope to come back to them at some point in the future, but if you end up going down the rabbit holes to find the answers, I would be curious to know the answers! Here is the list:

1. Why are spaces replaced with the `Ġ` special character in BPE-based tokenizer? The same Thomas Wolf <a href="https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475" target="_blank">argues</a> that this was done do avoid digesting spaces which are used in the standard BPE algorithm. However, it is unclear whether we could get by without doing this nowadays. Answering this question would probably require diving deeper into how the BPE algorithm works. Sounds like a fun direction to explore in the future!
    - <font color="red">Another source <a href="https://github.com/openai/gpt-2/blob/master/src/encoder.py#L9" target="_blank">on G dot</a></font>
1. The <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/raw/main/tokenizer.json" target="_blank">`Llama-3-8B-Instruct`</a> and <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct/raw/main/tokenizer.json" target="_blank">`Llama-3-70B-Instruct`</a> tokenizers are almost identical apart from one setting: the former has the `model.ignore_merges` key specified as `true`, while the latter does not have such a key specified. The question is simple, what difference does this make? A good starting point to answer this question could be <a href="https://github.com/huggingface/tokenizers/blob/71c2a8d01a56cd7bd28148c309e210c47dac78e7/tokenizers/src/models/bpe/model.rs#L466" target="_blank">this</a> piece of code.
1. What are some metrics to track to evaluate the quality of a trained tokenizer?
1. <font color="red">The regex used for splitting is not ideal (unicode aposthrophe)</font>
1. <font color="red">Original BPE: https://github.com/rsennrich/subword-nmt and https://aclanthology.org/P16-1162.pdf</font>
1. <font color="red">FastBPE https://github.com/Yikai-Liao/efficient_bpe</font>

# Conclusion



# Sources

1. Let's build the GPT Tokenizer (<a href="https://www.youtube.com/watch?v=zduSFxRajkE" target="_blank">video</a>) by Andrej Karpathy.
1. A little guide to building Large Language Models in 2024 (<a href="https://www.youtube.com/watch?v=2-SPH9hIKT8" target="_blank">video</a>) by Thomas Wolf.
1. Training a new tokenizer from an old one (<a href="https://huggingface.co/learn/nlp-course/en/chapter6/2" target="_blank">article</a>) on the Hugging Face course.
1. Another Implementation (faster and more effecient) of BPE Training Algorithm (<a href="https://github.com/huggingface/tokenizers/issues/1400" target="_blank">GitHub issue</a>)
1. Getting the most out of your tokenizer for pre-training and domain adaptation (<a href="https://arxiv.org/pdf/2402.01035.pdf" target="_blank">arXiv PDF</a>)