# Welcome to GPT Tokenizer

😔

We have a sad face because, `Tokenizers` are the least favourite part of **Large Language Models (LLMs)** that I need to work with. But unfortunately it is completely necessary to understand in detail to work with them... And a lot of oddness with LLMs traces back the these `Tokenization`...

So...

What is `Tokenization`?

Now in our last notebook (<a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20from%20Scratch.ipynb">GPT from Scratch</a>) we already implemented tokenization but it was done in a completely naïve and simple way...

In the previous notebook, the question we encountered was, "how do we *plug-in* text into the GPT?" and we ended up with a vocabulary of `92` characters, from which we created two look-up tables `stoi` and `itos` for mapping characters to indeces and vice-versa, which could be used as a `token` table for encode and decode functions, where `encode` function returned the encoded token integers and `decode` function returned the decoded message from the encoded tokens...

And later we saw that the way we *plug* these `tokens` into the model is by using a `tokenEmbeddingTable` where this table represented a row of `92` possible characters with their respective `embeddings`...

But,

In practice, people use a lot more complecated schemes to encode and decode these `tokens`...

And we deal with **chunk level** texts. And these **chunk level** texts are constructed using algorithms like `Byte-Pair` algorithms which we will be covering in a bit...

I'd also like discuss the paper that introduced this concept of `byte` level tokenization encoding as a mechanism in the context of LLMs...

Which is this paper : <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">Language Models are Unsupervised Multitask Learners
</a>

And if we scroll to the point **2.2 Input Representation** within the paper we see that they conclude with the line: "The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens an a larger batchsize of 512 is used."

Which means that the `vocabularySize` they used is about `50,257` and in the `Transformer` architecture's attention layer, every single token is *attending* to previous tokens in a sequence, and it is able to see upto `1024 tokens` in a sequence...

So **`Tokens` are the fundamental atomic units of a LLMs.** and everything related to it...

And **`Tokenization` is the process of translating `strings` or text into sequences of `tokens` and vice versa.**

And we can also look into the <a href="https://arxiv.org/pdf/2307.09288">Llama 2: Open Foundation and Fine-Tuned Chat Models</a> paper by Meta and we see that, in their paper in section **2.1** they mentioned that they trained their model with 2 trillion tokens of data...

And luckily the `Byte-Pair` algorithm is fairly simple and we can implement it ourselves and we can build our own `tokenizer`..

# Tiktokenizer

Before we dive into the code we can go to this nice website that has been created for us <a href="https://tiktokenizer.vercel.app/">Tiktokenizer</a> and familize ourselves with the tokenization types that are used ...

And what's great about this website is, tokenization is runs live on our browsers with the help of JavaScript... So we can plug in our own text and test the different tokenization techniques and their respective outputs and color coded tokens...

Now what I'd like you to do is plug this sample text into the website input:

```text
Today we will be learning tokenization. It is not very fun, but definitely informative.

69 + 420 = 489
6969 + 420 = 7389

Hat.
hat.
HAT.
I have a hat.

私はコンピューターが大好きです。私は機械学習エンジニアです。 

for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
```

And set the tokenizer to `GPT-2` because that's what we discussed earlier to relate what's happening currently...

And we immediately see that we get this kind of output:
![Tokenizer_GPT2Test](ExplanationMedia/Images/Tokenizer_GPT2Test.png)

Now, I have divided our sample text that we want to tokenize into 5 parts:

1. A sample line
2. Arithmetic
3. Case Sensitivity
4. Foreign Language
5. Python code

And before starting, the discussion on the above points, I'd also like to mention that, we also need to understand that our texts contains white-spaces, and new line characters and so on, but we can hide them for more clarity...

1. Now let's start by discussing the sample line first: \
We see that the sample line has been *chunked* into nice little pieces and we have the `tokenized` text with us. But immediately we see that the text "tokenization" has been chunked into two little pieces, and we see that in the middle of a sentence 'spaces' are a part of them, we will see why that is in a bit...
2. Next up we have the arithmetic code: \
We see that the token `420` is a single token but the token `489` is split up into two single tokens. And the LLM has to take a count of it and process it correctly in it's neural network as well...
3. Next up we have our case sensitivity and punctuation: \
We see that we considered a token "hat" and how differently they appear with each case. And how having a leading space on top of the token makes it a completely different token that the ones without it. But the most interesting part here is that the LLM has to learn from the raw data that all of these "hat"s have the exact same concept and have to group them into the parameters of the neural network and understand that these are almost similar, but not exactly similar all by itself.
4. Next up we will discuss the foreign languages: \
I have put this because, non english languages work, slightly worse in LLMs, and that is because the training dataset for these LLMs are generally very small compared to the English language, which is not just true for the LLM itself but for also for the `tokenizer`. So when we will train the tokenizer, we will see that there is a lot more English than Non-English text. And what ends up happening is, we get a lot more tokens for English tokens than Non-English tokens. In other words, if we try to see an English text and Japanese text as an example for comparison, we see that the number of tokens for Japanese used is much larger than compared to English and that is because the *chunks* are a lot more broken up and we end up using a lot more tokens for the exact same thing. And intuitively what this does is, it bloats up the sequence length of all the documents that we train on, and we very fast run out of context in the `Transformer`'s attention part.
5. And lastly we will discuss the tokenization of the coding part, for us, I have taken the example of a Python code: \
We immediately see that all the individual spaces in the example are all separate tokens (specifically token `220`), which means that when the `Transformer` attends this text, it has to attend all the spaces individually, which is basically another way of saying that it is being extremely wasteful in the part of tokenization. (GPT-2 is extremely un-optimized for coding)\
Now we can try to change the tokenizer to `cl100k_base`(which is the GPT-4 tokenizer) for now and check the results, and we get something like this:
![Tokenizer_CL100KExample](ExplanationMedia/Images/Tokenizer_CL100KExample.png)
We immediately see that the token count has decresed and that is because the number of tokens in the GPT-4 tokenizer is roughly double than that of the GPT-2 tokenizer. Which means now we are now feeding a lot denser inputs to the transformer, which means that the `Transformer` is now able to see more in the previous context than before. But increasing this length infinitely is not good as well, because our Embedding table and the Softmax in the transformer ends up increasing in size and we end up doing a lot more computation than before. But there is a *sweet spot* that we can come to which makes us end up with nice vocabulary at the end. And I'd also like you to note that the **whitespace** handling of the GPT-4 has improved a lot, and they get grouped and we end up having a lot more efficent in Python tokenization, and this was a deliberate choice made by OpenAI and that is because this densifies the information in the tokenization for each of those tokens and the `Transformer` ends up looking further back into the context of previous code.

# Coding By Understanding

Let's now start writing some code...

Let's understand what we are trying to do:
1. We want to take `strings` and we want to feed them into language models
2. For that we need so somehow `tokenize` strings and into some `integers` mapped into a fixed vocabulary 
3. We will use those integers to make a *look-up* into a `lookupTable` of embedding vectors and feed those vectors into the transformer as an input

But the reason this gets tricky is because:
1. We want to support different kinds of languages
2. We also want to support different kinds of special characters that we might find on the internet (for example emojis such as 👋)

Let's take a *toy-example* first:
```text
Hello in japanese. 👋 日本語でこんにちは。
```

So the question we now arrive is, how do we feed this text into a `Transformer`?

Let's first dive into the definitions of `strings` in the <a href="https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str">python documentation</a> we will find a text where it says "Strings are immutable sequences of Unicode code points."

So, what are Unicode code points?

So now if we try to look up the <a href="https://en.wikipedia.org/wiki/Unicode">Unicode page</a> from Wikipedia, we understand that Unicode is a text encoding standard maintained by the Unicode Consortium as a part of The Unicode Standard.
What this essentially is, is that it is roughly a definition of 149,813 characters and 161 scripts (it is about what they look like and what integers represent those characters) **as of right now**.

And I say **as of right now** because, we can see that the standard is very much alive and keeps on changing.

And the way we can **access the Unicode code point of a character** is by using a function in python called <a href="https://docs.python.org/3/library/functions.html#ord">`ord()`</a>, which takes a single character as an input at a time.


So we can now experiment with codes:

For example, if we do:
```python
print(ord("H"))
print(ord("👋"))
print(ord("日"))
```
For which we get:
```python
72
128075
26085
```

So using the same ideology we can look up all the characters in out *toy-example* string using a for loop to take out each character and passing the `ord()` function's output of those characters into a list, like this:
```python
print([ord(character) for character in "Hello in japanese. 👋 日本語でこんにちは。"])
```
For which we get:
```python
[72, 101, 108, 108, 111, 32, 105, 110, 32, 106, 97, 112, 97, 110, 101, 115, 101, 46, 32, 128075, 32, 26085, 26412, 35486, 12391, 12371, 12435, 12395, 12385, 12399, 12290]
```

Now see that we have already turned this raw text into integers, now we might arrive at the question that "why can't we use these integers and not have any tokenization at all?".

One reason for this is that the vocabulary is quite long. But more dangerous reason is that because the Unicode standard is very much alive, it keeps changing, which means that it is not a stable representation of something that we might want to use directly into our models...

So we need somethins a bit better at this point...

Now, we tend towards the idea of `encodings`...

# Unicode Encodings

## Encodings

Now, the Unicode Standard itself defines three encodings: `UTF-8`, `UTF-16`, and `UTF-32`, though several others exist. (Unicode Transformation Format)

Unicode text is processed and stored as binary data using one of several encodings, which define how to translate the Unicode text into sequences of bytes.

And `UTF-8` is by far the most popular encoding used in the real world...

And it states this section in its own <a href="https://en.wikipedia.org/wiki/UTF-8">dedicated Wikipedia page</a> as of now:


Code point ↔ UTF-8 conversion
| First code point              | Last code point | Byte 1    | Byte 2   | Byte 3   | Byte 4   |
|-------------------------------|------------|-----------|----------|----------|----------|
| U+00<span style="color:red;">0</span><span style="color:purple;">0</span>                | U+00<span style="color:red;">7</span><span style="color:purple;">F</span>               | 0<span style="color:red;">xxx</span><span style="color:purple;">xxxx</span> |          |          |          |
| U+0<span style="color:green;">0</span><span style="color:red;">8</span><span style="color:purple;">0</span>                | U+0<span style="color:green;">7</span><span style="color:red;">F</span><span style="color:purple;">F</span>               | 110<span style="color:green;">xxx</span><span style="color:red;">xx</span> | 10<span style="color:red;">xx</span><span style="color:purple;">xxxx</span> |          |          |
| U+<span style="color:blue;">0</span><span style="color:green;">8</span><span style="color:red;">0</span><span style="color:purple;">0</span>                | U+<span style="color:blue;">F</span><span style="color:green;">F</span><span style="color:red;">F</span><span style="color:purple;">F</span>               | 1110<span style="color:blue;">xxxx</span> | 10<span style="color:green;">xxxx</span><span style="color:red;">xx</span> | 10<span style="color:red;">xx</span><span style="color:purple;">xxxx</span> |          |
| U+<span style="color:crimson;">0</span><span style="color:orange;">1</span><span style="color:blue;">0</span><span style="color:green;">0</span><span style="color:red;">0</span><span style="color:purple;">0</span> | U+<span style="color:crimson;">1</span><span style="color:orange;">0</span><span style="color:blue;">F</span><span style="color:green;">F</span><span style="color:red;">F</span><span style="color:purple;">F</span> | 11110<span style="color:crimson;">x</span><span style="color:orange;">xx</span> | 10<span style="color:orange;">xx</span><span style="color:blue;">xxxx</span> | 10<span style="color:green;">xxxx</span><span style="color:red;">xx</span> | 10<span style="color:red;">xx</span><span style="color:purple;">xxxx</span> |

The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also IPA extensions, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for the remaining 61,440 codepoints of the Basic Multilingual Plane (BMP), including most Chinese, Japanese and Korean characters. Four bytes are needed for the 1,048,576 codepoints in the other planes of Unicode, which include emoji (pictographic symbols), less common CJK characters, various historic scripts, and mathematical symbols.

A whole graphic character can take more than 4 bytes, because it is made of more than one code point. For instance, a national flag character takes 8 bytes since it is "constructed from a pair of Unicode scalar values" both from outside the BMP.

I rather found this <a href="https://www.reedbeta.com/blog/programmers-intro-to-unicode/">blog post by Nathan Reed</a> pretty interesting and invite you to reed this. as well... This blog post also has a lot of links at the end of the article which are quite useful as well... \
One of them is <a href="https://utf8everywhere.org/">UTF-8 Everywhere - Manifesto</a> which discusses why `UTF-8` is preferred and is much nicer than the other encodings of Unicode.

But all these above article talks about is, that `UTF-8` is a variable length encoding that encodes our characters into binary representations.

Let's simplify all these resources then:

Each letter or symbol you see on your screen has a special number assigned to it. `UTF-8` is a system that turns these numbers into a series of `1`'s and `0`'s (binary code) that computers can understand and store.

Understandable, but then what is the step-by-step process to do so?

Let's explain this with small examples:

1. **Input Text:** You have a piece of text that you want to encode in `UTF-8`. This text could be anything from a simple word to an entire document. Let's say your text is "Hello".
2. **Unicode Representation:** The basic elements of Unicode or its "*characters*", (although that term isn't quite right) are called **Code Points**. **Code Points** are identified by number, customarily written in hexadecimal with the prefix `U+`, such as `U+0041` (`A` latin capital letter) a or `U+03B8` (`θ` greek small letter theta). These code points are standardized numerical values that represent each character universally.
3. **Binary Representation:** Now, these Unicode code points need to be converted into binary form. `UTF-8` is a variable-width encoding, meaning different characters can be represented by different numbers of `bytes`. The binary representation is based on the Unicode code point.
4. **Determine Byte Length:** Based on the Unicode code point, `UTF-8` determines how many bytes are needed to represent the character. Characters with lower Unicode code points (usually basic Latin characters like A-Z, a-z, 0-9) require only one byte, while characters with higher code points require more bytes.
5. **Encoding the Character:** The binary representation of the Unicode code point is split into multiple bytes according to the rules of `UTF-8` encoding. Each byte starts with a prefix that specifies its length and position in the sequence.
6. **Adding Byte Markers:** `UTF-8` uses specific bit patterns to indicate how many bytes are used to represent a character. \
    For example:
    - Single-byte characters start with a 0 bit (e.g., `0xxxxxxx`).
    - Two-byte characters start with `110` (e.g., `110xxxxx`).
    - Three-byte characters start with `1110` (e.g., `1110xxxx`).
    - And so on.
7. **Appending Bytes:** Each byte after the first starts with the bit pattern `10`, indicating it's a continuation byte, and the remaining bits are filled with the binary representation of the character.
8. **Putting it All Together:** The individual `bytes` for each character are then concatenated together to form the `UTF-8` encoded sequence.

Enough explanation, let's understand how this works by code now...

To do all the steps in a single line, Python already offers a built-in method inside of strings called <a href="https://docs.python.org/3/library/stdtypes.html#str.encode">`encode()`</a>. This function takes returns the <a href="https://docs.python.org/3/library/stdtypes.html#bytes-objects">`bytes`</a> sequence of that exact same stream... Let's try to take this concept for a spin now...

Let's say we have a code:
```python
print(ord("👋"))
print("👋".encode("UTF-8"))
print(list("👋".encode("UTF-8")))
```
For which we get:
```python
128075
b'\xf0\x9f\x91\x8b'
[240, 159, 145, 139]
```

You see how last time with `ord()` we used to identify the Unicode code point?

This time when we encoded the string, the method returned the stream of `4` bytes for a single character. And when we took the same single character through a list, with the help of hexadecimal-to-decimal conversion we were able to represent the binary information of each of those bytes in form of decimal numbers...

Let's now try our old string and try to encode it into different encodings (`UTF-8`, `UTF-16`, and `UTF-32`):

```python
>>> print(list("Hello in japanese. 👋 日本語でこんにちは。".encode("UTF-8")))
[72, 101, 108, 108, 111, 32, 105, 110, 32, 106, 97, 112, 97, 110, 101, 115, 101, 46, 32, 240, 159, 145, 139, 32, 230, 151, 165, 230, 156, 172, 232, 170, 158, 227, 129, 167, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 227, 128, 130]
>>> print(list("Hello in japanese. 👋 日本語でこんにちは。".encode("UTF-16")))
[255, 254, 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 32, 0, 105, 0, 110, 0, 32, 0, 106, 0, 97, 0, 112, 0, 97, 0, 110, 0, 101, 0, 115, 0, 101, 0, 46, 0, 32, 0, 61, 216, 75, 220, 32, 0, 229, 101, 44, 103, 158, 138, 103, 48, 83, 48, 147, 48, 107, 48, 97, 48, 111, 48, 2, 48]
>>> print(list("Hello in japanese. 👋 日本語でこんにちは。".encode("UTF-32")))
[255, 254, 0, 0, 72, 0, 0, 0, 101, 0, 0, 0, 108, 0, 0, 0, 108, 0, 0, 0, 111, 0, 0, 0, 32, 0, 0, 0, 105, 0, 0, 0, 110, 0, 0, 0, 32, 0, 0, 0, 106, 0, 0, 0, 97, 0, 0, 0, 112, 0, 0, 0, 97, 0, 0, 0, 110, 0, 0, 0, 101, 0, 0, 0, 115, 0, 0, 0, 101, 0, 0, 0, 46, 0, 0, 0, 32, 0, 0, 0, 75, 244, 1, 0, 32, 0, 0, 0, 229, 101, 0, 0, 44, 103, 0, 0, 158, 138, 0, 0, 103, 48, 0, 0, 83, 48, 0, 0, 147, 48, 0, 0, 107, 48, 0, 0, 97, 48, 0, 0, 111, 48, 0, 0, 2, 48, 0, 0]
```

See how each time we end up using more `0`s?

This indicates that we are being more and more wasteful when we are trying to represent our characters.

So we will stick with `UTF-8` for our purposes for now...


## Problem with encodings

Now we arrive at the problem... Can you guess it?

Currently we have access to `byte-streams` using `UTF-8` which implies that we have access to a vocabulary size of only `256` possible `tokens`...

So if we try to use `UTF-8` naively, it will be *stretched-out* into very very long sequences of `bytes`.

What this effectively does is, the embedding table remains small, and the softmax layer remains small as well, but the sequences remain very large for a pretty finite context window for the attention that we can support in a `Transformer` for computation, which is inefficient and will not allow us to attend to sufficiently long text for the purposes of next token prediction task...

Let's wrap-up what we want...

We want to be able to support larger vocabulary size, that we can tune as a hyper-parameter, but we also want to stick with the `UTF-8` encoding...

So what do we do now?

The answer is none other than **Byte-Pair Encoding** or **BPE**...

# Byte-Pair Encoding