In the conclusion of my recent blog post I argued that I disagree with Andrej Karpathy's claims about the current state of tokenizer availability and tooling. In particular:

- I do not agree that there are no good alternatives to OpenAI's GPT-4 tokenizer and their `tiktoken` library, especially since the <a href="https://ai.meta.com/blog/meta-llama-3/" target="_blank">release of Llama 3</a>.
- Also, it is not the case that there are no good tools to train tokenizers from scratch fast, as the `tokenizers` library from Hugging Face has the functionality for both training and running tokenizers in inference mode. It is built in Rust, so it is extremely fast, and has reasonably approachable documentation.

This prompted me to do a deep dive into tokenizers and how one could build one would go about building one from scratch. Inspired by the <a href="https://www.fast.ai/posts/2016-10-08-teaching-philosophy.html#good-education-starts-with-the-whole-game" target="_blank">top-down approach</a> from Jeremy Howard, whose courses I used to enjoy a lot, this blog post will focus on the very basics of tokenizers and on how to train a Llama 3-like tokenizer on your own data with as little code as possible. We will also explore what influence different proportions of English/non-English/code data have on the final vocabulary learnt by the tokenizer.

In the second part, we will dig into the internals of how tokenizers work and will replicate the Llama tokenizer from scratch using the Hugging Face `tokenizers` library, so stay tuned! But now, let's jump right in!

# Warm-up: basics of tokenization

I am mostly assuming that if you found and opened this article, you have at least a basic understanding of what tokenization is and what its purpose is, but let's do a quick revision just to make sure we are all on the same page.

**Why do we need tokenization in the first place?**

1. Majority of machine learning models operate on numbers and (large) language models are no exception.
2. Whether it is natural language, code, or something else, we want to be able to input strings into the language models.
3. This is where tokenization comes in, it is a **process of converting strings of text into numbers** that can then be fed into language models.

(This last point is actually the reason why the current state-of-the-art language models are not actually end-to-end systems.)

**How do tokenizers work on a high level?**

First, the long input string is split into smaller chunks called *tokens*. Then, tokenizers use their *vocabulary*, which is a mapping from known tokens to integers, to convert the tokens to integer ids. These ids can be fed as input into language models.

**What are some qualities that we want our tokenizers to have?**

We want our tokenizers to be able to process various kinds of inputs. In this article, we will not be able to fully dig into the various design choices made when building tokenizers to achieve this, but here is a (likely non-exhaustive) list of qualities that we want the tokenizers to have:

1. They should work on both English and non-English texts, both uppercase and lowercase, both Latin and non-Latin alphabet.
1. They should be able to handle various kinds of special characters that we might find in the wild, such as emojis 🤗.
1. They should be able to handle code.
1. (Optional) They should be able to tokenize unseen words without the need to include a special `<unk>` token in their vocabulary.

**How does this look in code?**

Before diving into training a tokenizer, let's illustrate some of the aspects described above in code. We will find that all of the above is very simple to do by using the `tokenizers` library!

In [56]:
#|code-fold: true
#|output: false
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [73]:
from pprint import pprint
from transformers import AutoTokenizer

In [58]:
#|output: false
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now that we have a tokenizer defined, let's see how we can tokenizer a piece of text:

In [59]:
text = "Some text to be tokenized"
tokens = tokenizer.tokenize(text)
tokens

['Some', 'Ġtext', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']

We can see that the string was split into smaller tokens. The `Ġ` is how the Llama 3 tokenizers represent spaces in their vocabulary, so you can actually think of the tokens as ` text`, ` to`, etc.

:::{.callout-note}
This is a consequence of the <a href="https://en.wikipedia.org/wiki/Byte_pair_encoding" target="_blank">*byte pair encoding* (BPE)</a> algorithm that is used behind the scenes. Interestingly, I was not able to find why the `Ġ` token was chosen in particular and whether it is a historical artifact or a fundamental requirement to do such a pre-processing step nowadays. See @sec-next-steps for more.
:::

Next, let's turn these tokens into ids:

In [60]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[8538, 1495, 311, 387, 4037, 1534]

As simple as that! We could now feed these `token_ids` into a language model, though it is important to note that in practice we would perform everything in one step:

In [74]:
inputs = tokenizer(text, return_tensors="pt")
pprint(inputs, sort_dicts=False)  # pprint stands for pretty print

{'input_ids': tensor([[8538, 1495,  311,  387, 4037, 1534]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}


We would then call a Hugging Face-compatible model like so:
```python
# Define the model before this row
outputs = model(**inputs)
```

:::{.callout-note}
For the sake of completeness, it is worth noting that one would only tokenize a single string on its own during inference. During training, one would either tokenize the whole dataset in batches before training or tokenize a batch of data on the fly. The former is usually more reliable and allows to restart training more easily if a crash occurs (see Thomas Wolf's <a href="https://www.youtube.com/watch?v=2-SPH9hIKT8" target="_blank">video</a>).
:::

# Training a Llama 3 tokenizer

In [110]:
#|output: false
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-103-raw-v1")
# dataset = load_dataset("code_search_net", "python")
# dataset = load_dataset("PleIAs/Middle-English-PD")
len(dataset["train"])

1801350

In [111]:
# Define an iterator over the training split of the dataset
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset["train"][i : i + batch_size]["text"]
        # yield dataset["train"][i : i + batch_size]["whole_func_string"]

In [112]:
%%time

# new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), len(tokenizer.get_vocab()))
new_tokenizer.save_pretrained("new-llama-tokenizer-v1")




CPU times: user 587 ms, sys: 713 ms, total: 1.3 s
Wall time: 519 ms


('new-llama-tokenizer-v1/tokenizer_config.json',
 'new-llama-tokenizer-v1/special_tokens_map.json',
 'new-llama-tokenizer-v1/tokenizer.json')

# Next steps (for the curious reader) {#sec-next-steps}

I leave here some questions arose while writing this article that I could not find simple answers to. I hope to come back to them at some point in the future, but if you end up going down the rabbit holes to find the answers, I would be curious to know the answers! Here is the list:

1. Why are spaces replaced with the `Ġ` special character in BPE-based tokenizer? The same Thomas Wolf <a href="https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475" target="_blank">argues</a> that this was done do avoid digesting spaces which are used in the standard BPE algorithm. However, it is unclear whether we could get by without doing this nowadays. Answering this question would probably require diving deeper into how the BPE algorithm works. Sounds like a fun direction to explore in the future!
1. The <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/raw/main/tokenizer.json" target="_blank">`Llama-3-8B-Instruct`</a> and <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct/raw/main/tokenizer.json" target="_blank">`Llama-3-70B-Instruct`</a> tokenizers are almost identical apart from one setting: the former has the `model.ignore_merges` key specified as `true`, while the latter does not have such a key specified. The question is simple, what difference does this make? A good starting point to answer this question could be <a href="https://github.com/huggingface/tokenizers/blob/71c2a8d01a56cd7bd28148c309e210c47dac78e7/tokenizers/src/models/bpe/model.rs#L466" target="_blank">this</a> piece of code.
1. Why does it not fill all tokens?
1. Why do training speeds differ so much?
1. How do download a subset of the Common Corpus?

# Conclusion



# Sources

1. Let's build the GPT Tokenizer (<a href="https://www.youtube.com/watch?v=zduSFxRajkE" target="_blank">video</a>) by Andrej Karpathy.
1. A little guide to building Large Language Models in 2024 (<a href="https://www.youtube.com/watch?v=2-SPH9hIKT8" target="_blank">video</a>) by Thomas Wolf.
1. Training a new tokenizer from an old one (<a href="https://huggingface.co/learn/nlp-course/en/chapter6/2" target="_blank">article</a>) on the Hugging Face course.