# Introduction


In [Tiktokenizer](https://tiktokenizer.vercel.app) Check the following String:
```
Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.

127 + 677 = 804
1275 + 6773 = 8041

Egg.
I have an Egg.
egg.
EGG.

만나서 반가워요. 저는 OpenAI에서 개발한 대규모 언어 모델인 ChatGPT입니다. 궁금한 것이 있으시면 무엇이든 물어보세요.

for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
```
![Example GPT2](../images/gpt_tokenizer.png)


Notice:
- space (shown in `.` here) is part of the token in English sentences. e.g., `.heart`, `.of`, `.much`, `.weirdness`, `.of`, `.LLMs`, ...

<br>

- `127` is a single token, however `677` are two tokens `.6` and `77`.
- `1275` is two tokens `.12` and `75`, however `6773` is two tokens `.6` and `773`.
- ➡️ it seems to be arbitrary

<br>

- `Egg` in the beginning of a sentence is two tokens, however `Egg` in the middle of a sentence is one token.
- `Egg` is different a token from `egg` 
- ➡️ the LLM has to learn from data that they are the same word.

<br>

- for other languages than English, the LLM is worse not only because it's trained on less data, but it's also because the Tokenizer is trained on less data
- it has **shorter tokens for Korean**, e.g., same sentence in English has less number of tokens that the same sentence in Korean.
- ➡️ This leads to the attention layers to run out of context

<br>

- `indentation` in python code is a single token, this is wasting of tokens because `GPT2` tokenizer is not efficient.

Trying the same Example but in `cl100k_base` (GPT4 Tokenizer), the token count drops from `300` to `185`, note that the number of tokens is almost doubled `50k` (in GPT2) to `100k` (in GPT4)
- This may be good because same text is squished into less number of tokens (denser input) ➡️ we can see twice as much text
- but increasing number of tokens makes a larger embeddings table, same problem when predicting the next token in the output.

# Unicode & Encoding
use `ord()` to get the unicode code of a character, and `chr()` to get the character from its unicode code.

```

In [10]:
example_korean = "안녕하세요 👋 (hello in Korean!)"
unicodes = [ord(c) for c in example_korean]
print(unicodes)
print(f"length of unicodes: {len(example_korean)}")

[50504, 45397, 54616, 49464, 50836, 32, 128075, 32, 40, 104, 101, 108, 108, 111, 32, 105, 110, 32, 75, 111, 114, 101, 97, 110, 33, 41]
length of unicodes: 26


- Q: Why not using `Unicode` directly? 
- A: It's at Vocab level, we need about `150k` tokens + Unicode is updating continuously, we need more stable representation.

## UTF-8
Unicode standard defines three encodings: `UTF-8`, `UTF-16` and `UTF-32`

**Encoding** is the process of converting unicode code points to binary data.
- `UTF-8` is the most common, length between `1` and `4` bytes, it's variable length encoding.
- `UTF-32` is fixed length encoding.

In [18]:
utf_8 = list(example_korean.encode("utf-8"))
print(f"UTF-8: {utf_8}")
print(f"length of utf-8: {len(utf_8)}")

UTF-8: [236, 149, 136, 235, 133, 149, 237, 149, 152, 236, 132, 184, 236, 154, 148, 32, 240, 159, 145, 139, 32, 40, 104, 101, 108, 108, 111, 32, 105, 110, 32, 75, 111, 114, 101, 97, 110, 33, 41]
length of utf-8: 39


In [22]:
utf_16 = list(example_korean.encode("utf-16"))
print(f"UTF-16: {utf_16}")
print(f"Too much zeros! {len([c for c in utf_16 if c == 0])}")

# same for utf-32
print("-" * 50)
utf_32 = list(example_korean.encode("utf-32"))
print(f"UTF-32: {utf_32}")
print(f"Too much zeros! {len([c for c in utf_32 if c == 0])}")

UTF-16: [255, 254, 72, 197, 85, 177, 88, 213, 56, 193, 148, 198, 32, 0, 61, 216, 75, 220, 32, 0, 40, 0, 104, 0, 101, 0, 108, 0, 108, 0, 111, 0, 32, 0, 105, 0, 110, 0, 32, 0, 75, 0, 111, 0, 114, 0, 101, 0, 97, 0, 110, 0, 33, 0, 41, 0]
Too much zeros! 20
--------------------------------------------------
UTF-32: [255, 254, 0, 0, 72, 197, 0, 0, 85, 177, 0, 0, 88, 213, 0, 0, 56, 193, 0, 0, 148, 198, 0, 0, 32, 0, 0, 0, 75, 244, 1, 0, 32, 0, 0, 0, 40, 0, 0, 0, 104, 0, 0, 0, 101, 0, 0, 0, 108, 0, 0, 0, 108, 0, 0, 0, 111, 0, 0, 0, 32, 0, 0, 0, 105, 0, 0, 0, 110, 0, 0, 0, 32, 0, 0, 0, 75, 0, 0, 0, 111, 0, 0, 0, 114, 0, 0, 0, 101, 0, 0, 0, 97, 0, 0, 0, 110, 0, 0, 0, 33, 0, 0, 0, 41, 0, 0, 0]
Too much zeros! 73


using `UTF-8` directly is not efficient since vocab length will be `256` ➡️ very tiny embedding table and output prediction, but very long sequences.

For the same example in the beginning:
- 300 tokens in GTP2 tokenizer
- 185 tokens in (cl100k_base) GPT4 tokenizer
- 503 tokens if tokenized using UTF-8 (Check the cell below)


In [25]:
example_string = """
Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.

127 + 677 = 804
1275 + 6773 = 8041

Egg.
I have an Egg.
egg.
EGG.

만나서 반가워요. 저는 OpenAI에서 개발한 대규모 언어 모델인 ChatGPT입니다. 궁금한 것이 있으시면 무엇이든 물어보세요.

for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
"""
print(f"length in utf-8: {len(example_string.encode('utf-8'))}")

length in utf-8: 503


we want to use `UTF-8` encoding, but instead of using `raw bytes` we need to support `larger vocab size` that can be tuned as a hyperparameter.