# This is still a work in progress!

# Good tokenizers is all you need

I apologise for giving in to the popular trends in naming things in the ML space, but the title is not actually that misleading. Tokenization is a crucial part of the whole language modelling pipeline. Yet, we will see in this post that there all sorts of problems that tokenization can cause that one might not be aware of.

This post is inspired by the <a href="https://www.youtube.com/watch?v=zduSFxRajkE" target="_blank">amazing video</a> on LLM tokenizers by Andrej Karpathy.

Karpathy's list of problems (taken from the <a href="https://github.com/karpathy/minbpe/blob/master/lecture.md" target="_blank">following notes</a>):

- Why can't LLM spell words? **Tokenization.**
- Why can't LLM do super simple string processing tasks like reversing a string? **Tokenization.**
- Why is LLM worse at non-English languages (e.g. Japanese)? **Tokenization.**
- Why is LLM bad at simple arithmetic? **Tokenization.**
- Why did GPT-2 have more than necessary trouble coding in Python? **Tokenization.**
- Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? **Tokenization.**
- What is this weird warning I get about a "trailing whitespace"? **Tokenization.**
- Why did the LLM break if I ask it about "SolidGoldMagikarp"? **Tokenization.**
- Why should I prefer to use YAML over JSON with LLMs? **Tokenization.**
- Why is LLM not actually end-to-end language modeling? **Tokenization.**
- What is the real root of suffering? **Tokenization.**

I will use the <a href="https://augustasmacijauskas.github.io/trailtoken/" target="_blank">`trailtoken`</a> tool that I have recently built with a collaborator to inspect why these problems occur. I encourage you to play around with it, especially if you build tokenizers from scratch yourself. You might be surprised with how easily problems can occur if one is not careful enough!

Let's dive in!

## Spelling

Tokenization makes it hard for the LLMs to spell. For example, one of the best open-source LLMs, <a href="https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1" target="_blank">`Mistral-7B-Instruct`</a>, has a very hard time spelling the word `antidisestablishmentarianism`:

![The mistralai/Mistral-7B-Instruct-v0.1 LLM misspelling the word antidisestablishmentarianism](mistral-instruct-spelling.png)

However, if we use `trailtoken` to inspect how the said word is tokenized by the `Mistral-7B` tokenizer, we see that it is actually split into 8 seemingly random tokens:

![Visualisation of how the tokenizer of mistralai/Mistral-7B-Instruct-v0.1 tokenizes the word antidisestablishmentarianism](mistral-trailtoken.png)

It is unlikely that the model has seen them occurring together during training, so it no surprise that it finds it hard separating out the letters constituting each token.

On the other hand, the OpenAI models seem to handle the task with ease (see these following links for <a href="https://chat.openai.com/share/f56ea3ff-7e5b-4a65-9611-18143d3e17bc" target="_blank">`gpt-3.5-turbo`</a> and <a href="https://chat.openai.com/share/d33744ee-10a0-4a4f-b097-66ecad6019f8" target="_blank">`gpt-4-turbo`</a>). Even then, this is mostly a result of the capabilities of these models, as the word is still tokenized into 6 tokens which are again more or less arbitrary:

![Visualisation of how the tokenizer of gpt-3.5-turbo tokenizes the word antidisestablishmentarianism](gpt-3.5-turbo-trailtoken.png)

:::{.callout-note}
Here and below I use the `Xenova/gpt-3.5-turbo` tokenizer on trailtoken because it is an open-source implementation of the `gpt-3.5-turbo` tokenizer.
:::

## Reversing a string

Similarly to spelling words, LLMs find it hard to to simple tasks, like reversing words. A well-known example is asking LLMs to reverse the word `lollipop`. `Mistral-7B` fails miserably on the task:

![mistralai/Mistral-7B-Instruct-v0.1 LLM failing to reverse the word lollipop](mistral-lollipop-fail.png)

If we look at how the word is tokenized, we see that it ends up being only 4 tokens, so it is no wonder that the LLMs find it hard solving this task:

![Visualisation of how the tokenizer of mistralai/Mistral-7B-Instruct-v0.1 tokenizes the word lollipop](mistral-instruct-reverse.png)

The screenshot above hints at a "hack" that can be used to help LLMs: separate out the letters so that they end up as separate tokens after tokenization. Even then, few-shot examples are required to make it work (and it also did not work when I separated the letters using dashes `-` or spaces ` `):

![mistralai/Mistral-7B-Instruct-v0.1 LLM succeeding at reversing the word lollipop](mistral-lollipop-success.png)

:::{.callout-caution}
Interestingly, `gpt-4` <a href="https://chat.openai.com/share/f79bf298-94b6-4ebc-9f44-cdc176bf1a7d" target="_blank">fails</a> reversing the string too, so it is truly a problem caused by tokenization! 
:::

## Non-English languages and code

These two are similar in nature...

In [2]:
str = "만나서 반가워요. 저는 OpenAI에서 개발한 대규모 언어 모델인 ChatGPT입니다. 궁금한 것이 있으시면 무엇이든 물어보세요."
len(str)

72

![Visualisation of how the tokenizer of gpt-3 tokenizes code](gpt-3-code.png)

![Visualisation of how the tokenizer of gpt-4 tokenizes code](gpt-4-code.png)

![Visualisation of how the tokenizer of gpt-3 tokenizes Korean](gpt-3-korean.png)

![Visualisation of how the tokenizer of gpt-4 tokenizes Korean](gpt-4-korean.png)

## Arithmetic

This is going to be a brief one, but LLM tokenizers can do odd things when tokenizing numbers which can hinder their performance on simple arithmetic. For example, here is the `gpt2` tokenizer behaves on the following numbers:

![Visualisation of how the tokenizer of gpt2 tokenizes numbers](gpt2-arithmetic.png)

We find that in one case the decimal is one token, but becomes two tokens if an extra `0` is appended at the end. To humans, this would not cause any trouble, but could potentially confuse the LLM.

However, most of the modern LLM tokenizers are quite robust at handling numbers.

## Special data formats

How structured data tokenized is another source of peculiar behaviours in LLMs. For example the images below show how the `Mistral-7B` tokenizer handles `JSON` and `YAML` formats:

![Visualisation of how the tokenizer of mistralai/Mistral-7B-Instruct-v0.1 tokenizes JSON](mistral-json.png)

![Visualisation of how the tokenizer of mistralai/Mistral-7B-Instruct-v0.1 tokenizes YAML](mistral-yaml.png)

As we can see, we can save up on tokens (about 2x!) and overall complexity if we use `YAML` over `JSON`. Also, notice how using underscores in key names adds extra tokens, so being smart about how structured data is passed to the tokenizer can help reduce the number of tokens used and, therefore, the compute costs!

## Other problems

- Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? **Tokenization.**
- What is this weird warning I get about a "trailing whitespace"? **Tokenization.**
- Why did the LLM break if I ask it about "SolidGoldMagikarp"? **Tokenization.**

## How much of a problem is it?

Thomas Wolfe video suggests that we are good as long as we are reasonably thoughtful about the tokenization process.

# Conclusion