# Welcome to GPT Tokenizer

😔

We have a sad face because, `Tokenizers` are the least favourite part of **Large Language Models (LLMs)** that I need to work with. But unfortunately it is completely necessary to understand in detail to work with them... And a lot of oddness with LLMs traces back the these `Tokenization`...

So...

What is `Tokenization`?

Now in our last notebook (<a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20from%20Scratch.ipynb">GPT from Scratch</a>) we already implemented tokenization but it was done in a completely naïve and simple way...

In the previous notebook, the question we encountered was, "how do we *plug-in* text into the GPT?" and we ended up with a vocabulary of `92` characters, from which we created two look-up tables `stoi` and `itos` for mapping characters to indeces and vice-versa, which could be used as a `token` table for encode and decode functions, where `encode` function returned the encoded token integers and `decode` function returned the decoded message from the encoded tokens...

And later we saw that the way we *plug* these `tokens` into the model is by using a `tokenEmbeddingTable` where this table represented a row of `92` possible characters with their respective `embeddings`...

But,

In practice, people use a lot more complecated schemes to encode and decode these `tokens`...

And we deal with **chunk level** texts. And these **chunk level** texts are constructed using algorithms like `Byte-Pair` algorithms which we will be covering in a bit...

I'd also like discuss the paper that introduced this concept of `byte` level tokenization encoding as a mechanism in the context of LLMs...

Which is this paper : <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">Language Models are Unsupervised Multitask Learners
</a>

And if we scroll to the point **2.2 Input Representation** within the paper we see that they conclude with the line: "The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens an a larger batchsize of 512 is used."

Which means that the `vocabularySize` they used is about `50,257` and in the `Transformer` architecture's attention layer, every single token is *attending* to previous tokens in a sequence, and it is able to see upto `1024 tokens` in a sequence...

So **`Tokens` are the fundamental atomic units of a LLMs.** and everything related to it...

And **`Tokenization` is the process of translating `strings` or text into sequences of `tokens` and vice versa.**

And we can also look into the <a href="https://arxiv.org/pdf/2307.09288">Llama 2: Open Foundation and Fine-Tuned Chat Models</a> paper by Meta and we see that, in their paper in section **2.1** they mentioned that they trained their model with 2 trillion tokens of data...

And luckily the `Byte-Pair` algorithm is fairly simple and we can implement it ourselves and we can build our own `tokenizer`..

# Tiktokenizer

Before we dive into the code we can go to this nice website that has been created for us <a href="https://tiktokenizer.vercel.app/">Tiktokenizer</a> and familize ourselves with the tokenization types that are used ...

And what's great about this website is, tokenization is runs live on our browsers with the help of JavaScript... So we can plug in our own text and test the different tokenization techniques and their respective outputs and color coded tokens...

Now what I'd like you to do is plug this sample text into the website input:

```text
Today we will be learning tokenization. It is not very fun, but definitely informative.

69 + 420 = 489
6969 + 420 = 7389

Hat.
hat.
HAT.
I have a hat.

私はコンピューターが大好きです。私は機械学習エンジニアです。 

for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
```

And set the tokenizer to `GPT-2` because that's what we discussed earlier to relate what's happening currently...

And we immediately see that we get this kind of output:
![Tokenizer_GPT2Test](ExplanationMedia/Images/Tokenizer_GPT2Test.png)

Now, I have divided our sample text that we want to tokenize into 5 parts:

1. A sample line
2. Arithmetic
3. Case Sensitivity
4. Foreign Language
5. Python code

And before starting, the discussion on the above points, I'd also like to mention that, we also need to understand that our texts contains white-spaces, and new line characters and so on, but we can hide them for more clarity...

1. Now let's start by discussing the sample line first: \
We see that the sample line has been *chunked* into nice little pieces and we have the `tokenized` text with us. But immediately we see that the text "tokenization" has been chunked into two little pieces, and we see that in the middle of a sentence 'spaces' are a part of them, we will see why that is in a bit...
2. Next up we have the arithmetic code: \
We see that the token `420` is a single token but the token `489` is split up into two single tokens. And the LLM has to take a count of it and process it correctly in it's neural network as well...
3. Next up we have our case sensitivity and punctuation: \
We see that we considered a token "hat" and how differently they appear with each case. And how having a leading space on top of the token makes it a completely different token that the ones without it. But the most interesting part here is that the LLM has to learn from the raw data that all of these "hat"s have the exact same concept and have to group them into the parameters of the neural network and understand that these are almost similar, but not exactly similar all by itself.
4. Next up we will discuss the foreign languages: \
I have put this because, non english languages work, slightly worse in LLMs, and that is because the training dataset for these LLMs are generally very small compared to the English language, which is not just true for the LLM itself but for also for the `tokenizer`. So when we will train the tokenizer, we will see that there is a lot more English than Non-English text. And what ends up happening is, we get a lot more tokens for English tokens than Non-English tokens. In other words, if we try to see an English text and Japanese text as an example for comparison, we see that the number of tokens for Japanese used is much larger than compared to English and that is because the *chunks* are a lot more broken up and we end up using a lot more tokens for the exact same thing. And intuitively what this does is, it bloats up the sequence length of all the documents that we train on, and we very fast run out of context in the `Transformer`'s attention part.
5. And lastly we will discuss the tokenization of the coding part, for us, I have taken the example of a Python code: \
We immediately see that all the individual spaces in the example are all separate tokens (specifically token `220`), which means that when the `Transformer` attends this text, it has to attend all the spaces individually, which is basically another way of saying that it is being extremely wasteful in the part of tokenization. (GPT-2 is extremely un-optimized for coding)\
Now we can try to change the tokenizer to `cl100k_base`(which is the GPT-4 tokenizer) for now and check the results, and we get something like this:
![Tokenizer_CL100KExample](ExplanationMedia/Images/Tokenizer_CL100KExample.png)
We immediately see that the token count has decresed and that is because the number of tokens in the GPT-4 tokenizer is roughly double than that of the GPT-2 tokenizer. Which means now we are now feeding a lot denser inputs to the transformer, which means that the `Transformer` is now able to see more in the previous context than before. But increasing this length infinitely is not good as well, because our Embedding table and the Softmax in the transformer ends up increasing in size and we end up doing a lot more computation than before. But there is a *sweet spot* that we can come to which makes us end up with nice vocabulary at the end. And I'd also like you to note that the **whitespace** handling of the GPT-4 has improved a lot, and they get grouped and we end up having a lot more efficent in Python tokenization, and this was a deliberate choice made by OpenAI and that is because this densifies the information in the tokenization for each of those tokens and the `Transformer` ends up looking further back into the context of previous code.

# Coding By Understanding

Let's now start writing some code...

Let's understand what we are trying to do:
1. We want to take `strings` and we want to feed them into language models
2. For that we need so somehow `tokenize` strings and into some `integers` mapped into a fixed vocabulary 
3. We will use those integers to make a *look-up* into a `lookupTable` of embedding vectors and feed those vectors into the transformer as an input

But the reason this gets tricky is because:
1. We want to support different kinds of languages
2. We also want to support different kinds of special characters that we might find on the internet (for example emojis such as 👋)

Let's take a *toy-example* first:
```text
Hello in japanese. 👋 日本語でこんにちは。
```

So the question we now arrive is, how do we feed this text into a `Transformer`?

Let's first dive into the definitions of `strings` in the <a href="https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str">python documentation</a> we will find a text where it says "Strings are immutable sequences of Unicode code points."

So, what are Unicode code points?

So now if we try to look up the <a href="https://en.wikipedia.org/wiki/Unicode">Unicode page</a> from Wikipedia, we understand that Unicode is a text encoding standard maintained by the Unicode Consortium as a part of The Unicode Standard.
What this essentially is, is that it is roughly a definition of 149,813 characters and 161 scripts (it is about what they look like and what integers represent those characters) **as of right now**.

And I say **as of right now** because, we can see that the standard is very much alive and keeps on changing.

And the way we can **access the Unicode code point of a character** is by using a function in python called <a href="https://docs.python.org/3/library/functions.html#ord">`ord()`</a>, which takes a single character as an input at a time.


So we can now experiment with codes:

For example, if we do:
```python
print(ord("H"))
print(ord("👋"))
print(ord("日"))
```
For which we get:
```python
72
128075
26085
```

So using the same ideology we can look up all the characters in out *toy-example* string using a for loop to take out each character and passing the `ord()` function's output of those characters into a list, like this:
```python
print([ord(character) for character in "Hello in japanese. 👋 日本語でこんにちは。"])
```
For which we get:
```python
[72, 101, 108, 108, 111, 32, 105, 110, 32, 106, 97, 112, 97, 110, 101, 115, 101, 46, 32, 128075, 32, 26085, 26412, 35486, 12391, 12371, 12435, 12395, 12385, 12399, 12290]
```

Now see that we have already turned this raw text into integers, now we might arrive at the question that "why can't we use these integers and not have any tokenization at all?".

One reason for this is that the vocabulary is quite long. But more dangerous reason is that because the Unicode standard is very much alive, it keeps changing, which means that it is not a stable representation of something that we might want to use directly into our models...

So we need somethins a bit better at this point...

Now, we tend towards the idea of `encodings`...

# Unicode Encodings

## Encodings

Now, the Unicode Standard itself defines three encodings: `UTF-8`, `UTF-16`, and `UTF-32`, though several others exist. (Unicode Transformation Format)

Unicode text is processed and stored as binary data using one of several encodings, which define how to translate the Unicode text into sequences of bytes.

And `UTF-8` is by far the most popular encoding used in the real world...

And it states this section in its own <a href="https://en.wikipedia.org/wiki/UTF-8">dedicated Wikipedia page</a> as of now:


Code point ↔ UTF-8 conversion
| First code point              | Last code point | Byte 1    | Byte 2   | Byte 3   | Byte 4   |
|-------------------------------|------------|-----------|----------|----------|----------|
| U+00<span style="color:red;">0</span><span style="color:purple;">0</span>                | U+00<span style="color:red;">7</span><span style="color:purple;">F</span>               | 0<span style="color:red;">xxx</span><span style="color:purple;">xxxx</span> |          |          |          |
| U+0<span style="color:green;">0</span><span style="color:red;">8</span><span style="color:purple;">0</span>                | U+0<span style="color:green;">7</span><span style="color:red;">F</span><span style="color:purple;">F</span>               | 110<span style="color:green;">xxx</span><span style="color:red;">xx</span> | 10<span style="color:red;">xx</span><span style="color:purple;">xxxx</span> |          |          |
| U+<span style="color:blue;">0</span><span style="color:green;">8</span><span style="color:red;">0</span><span style="color:purple;">0</span>                | U+<span style="color:blue;">F</span><span style="color:green;">F</span><span style="color:red;">F</span><span style="color:purple;">F</span>               | 1110<span style="color:blue;">xxxx</span> | 10<span style="color:green;">xxxx</span><span style="color:red;">xx</span> | 10<span style="color:red;">xx</span><span style="color:purple;">xxxx</span> |          |
| U+<span style="color:crimson;">0</span><span style="color:orange;">1</span><span style="color:blue;">0</span><span style="color:green;">0</span><span style="color:red;">0</span><span style="color:purple;">0</span> | U+<span style="color:crimson;">1</span><span style="color:orange;">0</span><span style="color:blue;">F</span><span style="color:green;">F</span><span style="color:red;">F</span><span style="color:purple;">F</span> | 11110<span style="color:crimson;">x</span><span style="color:orange;">xx</span> | 10<span style="color:orange;">xx</span><span style="color:blue;">xxxx</span> | 10<span style="color:green;">xxxx</span><span style="color:red;">xx</span> | 10<span style="color:red;">xx</span><span style="color:purple;">xxxx</span> |

The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also IPA extensions, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for the remaining 61,440 codepoints of the Basic Multilingual Plane (BMP), including most Chinese, Japanese and Korean characters. Four bytes are needed for the 1,048,576 codepoints in the other planes of Unicode, which include emoji (pictographic symbols), less common CJK characters, various historic scripts, and mathematical symbols.

A whole graphic character can take more than 4 bytes, because it is made of more than one code point. For instance, a national flag character takes 8 bytes since it is "constructed from a pair of Unicode scalar values" both from outside the BMP.

I rather found this <a href="https://www.reedbeta.com/blog/programmers-intro-to-unicode/">blog post by Nathan Reed</a> pretty interesting and invite you to reed this. as well... This blog post also has a lot of links at the end of the article which are quite useful as well... \
One of them is <a href="https://utf8everywhere.org/">UTF-8 Everywhere - Manifesto</a> which discusses why `UTF-8` is preferred and is much nicer than the other encodings of Unicode.

But all these above article talks about is, that `UTF-8` is a variable length encoding that encodes our characters into binary representations.

Let's simplify all these resources then:

Each letter or symbol you see on your screen has a special number assigned to it. `UTF-8` is a system that turns these numbers into a series of `1`'s and `0`'s (binary code) that computers can understand and store.

Understandable, but then what is the step-by-step process to do so?

Let's explain this with small examples:

1. **Input Text:** You have a piece of text that you want to encode in `UTF-8`. This text could be anything from a simple word to an entire document. Let's say your text is "Hello".
2. **Unicode Representation:** The basic elements of Unicode or its "*characters*", (although that term isn't quite right) are called **Code Points**. **Code Points** are identified by number, customarily written in hexadecimal with the prefix `U+`, such as `U+0041` (`A` latin capital letter) a or `U+03B8` (`θ` greek small letter theta). These code points are standardized numerical values that represent each character universally.
3. **Binary Representation:** Now, these Unicode code points need to be converted into binary form. `UTF-8` is a variable-width encoding, meaning different characters can be represented by different numbers of `bytes`. The binary representation is based on the Unicode code point.
4. **Determine Byte Length:** Based on the Unicode code point, `UTF-8` determines how many bytes are needed to represent the character. Characters with lower Unicode code points (usually basic Latin characters like A-Z, a-z, 0-9) require only one byte, while characters with higher code points require more bytes.
5. **Encoding the Character:** The binary representation of the Unicode code point is split into multiple bytes according to the rules of `UTF-8` encoding. Each byte starts with a prefix that specifies its length and position in the sequence.
6. **Adding Byte Markers:** `UTF-8` uses specific bit patterns to indicate how many bytes are used to represent a character. \
    For example:
    - Single-byte characters start with a 0 bit (e.g., `0xxxxxxx`).
    - Two-byte characters start with `110` (e.g., `110xxxxx`).
    - Three-byte characters start with `1110` (e.g., `1110xxxx`).
    - And so on.
7. **Appending Bytes:** Each byte after the first starts with the bit pattern `10`, indicating it's a continuation byte, and the remaining bits are filled with the binary representation of the character.
8. **Putting it All Together:** The individual `bytes` for each character are then concatenated together to form the `UTF-8` encoded sequence.

Enough explanation, let's understand how this works by code now...

To do all the steps in a single line, Python already offers a built-in method inside of strings called <a href="https://docs.python.org/3/library/stdtypes.html#str.encode">`encode()`</a>. This function takes returns the <a href="https://docs.python.org/3/library/stdtypes.html#bytes-objects">`bytes`</a> sequence of that exact same stream... Let's try to take this concept for a spin now...

Let's say we have a code:
```python
print(ord("👋"))
print("👋".encode("UTF-8"))
print(list("👋".encode("UTF-8")))
```
For which we get:
```python
128075
b'\xf0\x9f\x91\x8b'
[240, 159, 145, 139]
```

You see how last time with `ord()` we used to identify the Unicode code point?

This time when we encoded the string, the method returned the stream of `4` bytes for a single character. And when we took the same single character through a list, with the help of hexadecimal-to-decimal conversion we were able to represent the binary information of each of those bytes in form of decimal numbers...

Let's now try our old string and try to encode it into different encodings (`UTF-8`, `UTF-16`, and `UTF-32`):

```python
>>> print(list("Hello in japanese. 👋 日本語でこんにちは。".encode("UTF-8")))
[72, 101, 108, 108, 111, 32, 105, 110, 32, 106, 97, 112, 97, 110, 101, 115, 101, 46, 32, 240, 159, 145, 139, 32, 230, 151, 165, 230, 156, 172, 232, 170, 158, 227, 129, 167, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 227, 128, 130]
>>> print(list("Hello in japanese. 👋 日本語でこんにちは。".encode("UTF-16")))
[255, 254, 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 32, 0, 105, 0, 110, 0, 32, 0, 106, 0, 97, 0, 112, 0, 97, 0, 110, 0, 101, 0, 115, 0, 101, 0, 46, 0, 32, 0, 61, 216, 75, 220, 32, 0, 229, 101, 44, 103, 158, 138, 103, 48, 83, 48, 147, 48, 107, 48, 97, 48, 111, 48, 2, 48]
>>> print(list("Hello in japanese. 👋 日本語でこんにちは。".encode("UTF-32")))
[255, 254, 0, 0, 72, 0, 0, 0, 101, 0, 0, 0, 108, 0, 0, 0, 108, 0, 0, 0, 111, 0, 0, 0, 32, 0, 0, 0, 105, 0, 0, 0, 110, 0, 0, 0, 32, 0, 0, 0, 106, 0, 0, 0, 97, 0, 0, 0, 112, 0, 0, 0, 97, 0, 0, 0, 110, 0, 0, 0, 101, 0, 0, 0, 115, 0, 0, 0, 101, 0, 0, 0, 46, 0, 0, 0, 32, 0, 0, 0, 75, 244, 1, 0, 32, 0, 0, 0, 229, 101, 0, 0, 44, 103, 0, 0, 158, 138, 0, 0, 103, 48, 0, 0, 83, 48, 0, 0, 147, 48, 0, 0, 107, 48, 0, 0, 97, 48, 0, 0, 111, 48, 0, 0, 2, 48, 0, 0]
```

See how each time we end up using more `0`s?

This indicates that we are being more and more wasteful when we are trying to represent our characters.

So we will stick with `UTF-8` for our purposes for now...


## Problem with encodings

Now we arrive at the problem... Can you guess it?

Currently we have access to `byte-streams` using `UTF-8` which implies that we have access to a vocabulary size of only `256` possible `tokens`...

So if we try to use `UTF-8` naively, it will be *stretched-out* into very very long sequences of `bytes`.

What this effectively does is, the embedding table remains small, and the softmax layer remains small as well, but the sequences remain very large for a pretty finite context window for the attention that we can support in a `Transformer` for computation, which is inefficient and will not allow us to attend to sufficiently long text for the purposes of next token prediction task...

Let's wrap-up what we want...

We want to be able to support larger vocabulary size, that we can tune as a hyper-parameter, but we also want to stick with the `UTF-8` encoding...

So what do we do now?

The answer is none other than **Byte-Pair Encoding** or **BPE** where we compress these `byte-sequences` to a variable amount...

# Byte-Pair Encoding

## Megabyte Paper

Now, before we dive in, I'd like to point out that, I would like nothing more than to have entire raw byte sequences into language models...

In fact, there's a paper about how this could be potentially done: <a href="https://arxiv.org/abs/2305.07185">MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers</a> from somewhere last year (2023)... Now the problem is, that we actually have to go in and modify the transformer architecture according to the paper, because as I mentioned, the attention block is going to become extremely expensive in terms of long sequences, and in the paper they propose kind of a hierarchical structure of the transformer that could allow us to feed in just the raw bytes...

But unfortunately, I don't know if this model has been proven out yet, by many groups at sufficient scale, but something like this at one point would be amazing and I hope someone comes up with it...

But for now we have to come back to **BPE**, where we try to compress our large byte streams using **Byte-Pair Encoding** algorithm.

## Understanding Byte-Pair Encoding Algorithm

Well the Wikipedia page for <a href="https://en.wikipedia.org/wiki/Byte_pair_encoding">**Byte-Pair Encoding (BPE)**</a> gives a detailed step-by-step example of how we can encode something using **BPE**...

And we will discuss it now...

This is the pseudocode for this algorithm:
1. Initialize the vocabulary with all the bytes or characters in the text corpus
2. Calculate the frequency of each byte or character in the text corpus.
3. Repeat the following steps until the desired vocabulary size is reached:
    1. Find the most frequent pair of consecutive bytes or characters in the text corpus
    2. Merge the pair to create a new subword unit.
    3. Update the frequency counts of all the bytes or characters that contain the merged pair.
    4. Add the new subword unit to the vocabulary.
4. Represent the text corpus using the subword units in the vocabulary.

Let's take the example they take and discuss how this works...

They take an input sequence (`string`) like this:
```plaintext
aaabdaaabac
```
This implies that they have a vocabulary of `4` characters only:
```python
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
```
Now the algorithm tells us to iteratively find the most frequently occuring pair... \
For this examples it seems that the pair `aa` occurs the most at the moment... \
Once we have found our most occuring pair, we need to replace every single occurence of that pair with a single new `token` that we append to our `vocabulary`...

Let's call our new token `Z` at the moment, and append this new `token` to the `vocabulary`. Let's call the new `vocabulary`, `replacement table` for now...

So the resultant `encoded text` and the `replacement table` we have now is:
```plaintext
ZabdZabac
```
```json
{"aa": "Z"}
```

Next up, we *repeat* the process until either we have a **desired vocabulary size** or there are **no identifiable pairs** (hinting it cannot be compressed further)...

For them the next iterations looked like:
1. ```plaintext
   ZYdZYac
   ```
   ```json
   {"aa": "Z", "ab": "Y"}
   ```
2. ```plaintext
   XdXac
   ```
   ```json
   {"aa": "Z", "ab": "Y", "ZY": "X"}
   ```
After the last iteration, the text could not be compressed further because there are no pairs that could occur more than once...

The point to be noted is that the `replacement table` acts as an extension of the original `vocabulary` impliying that in the end, this example ended up with vocabulary size of `4 + 3 = 7` but the encoded text length is now `5` which originally was `11`...

So, in this exact same way in our original case, we start out with `byte` sequences (`256` vocabulary size) and they go through this algorithm and find the byte pairs that occur the most, and we're going to iteratively start generating new `tokens` and keep appending them to our `vocabulary` and replacing these to our text sequence...

And in this way we are going to end up with a compressed training dataset and also an algorithm for encoding it using the `vocabulary` and decoding it back to the text.

So let's now implement what we know...

## Implementing Byte-Pair Algorithm

### Gathering text sequence

Let's first try to take a text as a toy example and get the raw `bytes` after encoding them with `UTF-8` and convert them into a list of integer sequences where each number ranges from `0` to `255` (just so that they become easier to work with in Python and visualize)...

For this, I asked ChatGPT to generate a text in Japanese about "How great computers are", and also translated it into English and prepended it to the text and it generated me this text:
```plaintext
When we talk about the wonders of computers, we are amazed by their diverse functions and capabilities. Computers efficiently support our daily lives and provide us with information instantly. It also serves as a platform to stimulate creativity and generate new ideas and innovations. With its advanced processing power and flexibility, we are able to accomplish more and more. The evolution of computers opens possibilities that will change our world and brighten our future. | コンピューターの素晴らしさについて語ると、その多様な機能と能力に驚かされます。コンピューターは私たちの日常生活を効率的に支援し、情報を瞬時に提供してくれます。また、創造性を刺激し、新しいアイデアや革新を生み出すプラットフォームとしても機能します。その高度な処理能力と柔軟性によって、私たちはますます多くのことを達成できるようになりました。コンピューターの進化は、私たちの世界を変え、未来を明るくする可能性を開いています。
```
This seems to be like a good example for us to implement the algorithm...

So now if we encode this text using `UTF-8` and convert them into a list of integer sequences where each number ranges from `0` to `255`, we write the code:
```python
unicodetext = "When we talk about the wonders of computers, we are amazed by their diverse functions and capabilities. Computers efficiently support our daily lives and provide us with information instantly. It also serves as a platform to stimulate creativity and generate new ideas and innovations. With its advanced processing power and flexibility, we are able to accomplish more and more. The evolution of computers opens possibilities that will change our world and brighten our future. | コンピューターの素晴らしさについて語ると、その多様な機能と能力に驚かされます。コンピューターは私たちの日常生活を効率的に支援し、情報を瞬時に提供してくれます。また、創造性を刺激し、新しいアイデアや革新を生み出すプラットフォームとしても機能します。その高度な処理能力と柔軟性によって、私たちはますます多くのことを達成できるようになりました。コンピューターの進化は、私たちの世界を変え、未来を明るくする可能性を開いています。"
rawByteTokens = unicodetext.encode("UTF-8")
decimalTokens = list(map(int, rawByteTokens))

print(rawByteTokens)
print(decimalTokens)
```
For which we get:
```python
b'When we talk about the wonders of computers, we are amazed by their diverse functions and capabilities. Computers efficiently support our daily lives and provide us with information instantly. It also serves as a platform to stimulate creativity and generate new ideas and innovations. With its advanced processing power and flexibility, we are able to accomplish more and more. The evolution of computers opens possibilities that will change our world and brighten our future. | \xe3\x82\xb3\xe3\x83\xb3\xe3\x83\x94\xe3\x83\xa5\xe3\x83\xbc\xe3\x82\xbf\xe3\x83\xbc\xe3\x81\xae\xe7\xb4\xa0\xe6\x99\xb4\xe3\x82\x89\xe3\x81\x97\xe3\x81\x95\xe3\x81\xab\xe3\x81\xa4\xe3\x81\x84\xe3\x81\xa6\xe8\xaa\x9e\xe3\x82\x8b\xe3\x81\xa8\xe3\x80\x81\xe3\x81\x9d\xe3\x81\xae\xe5\xa4\x9a\xe6\xa7\x98\xe3\x81\xaa\xe6\xa9\x9f\xe8\x83\xbd\xe3\x81\xa8\xe8\x83\xbd\xe5\x8a\x9b\xe3\x81\xab\xe9\xa9\x9a\xe3\x81\x8b\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xbe\xe3\x81\x99\xe3\x80\x82\xe3\x82\xb3\xe3\x83\xb3\xe3\x83\x94\xe3\x83\xa5\xe3\x83\xbc\xe3\x82\xbf\xe3\x83\xbc\xe3\x81\xaf\xe7\xa7\x81\xe3\x81\x9f\xe3\x81\xa1\xe3\x81\xae\xe6\x97\xa5\xe5\xb8\xb8\xe7\x94\x9f\xe6\xb4\xbb\xe3\x82\x92\xe5\x8a\xb9\xe7\x8e\x87\xe7\x9a\x84\xe3\x81\xab\xe6\x94\xaf\xe6\x8f\xb4\xe3\x81\x97\xe3\x80\x81\xe6\x83\x85\xe5\xa0\xb1\xe3\x82\x92\xe7\x9e\xac\xe6\x99\x82\xe3\x81\xab\xe6\x8f\x90\xe4\xbe\x9b\xe3\x81\x97\xe3\x81\xa6\xe3\x81\x8f\xe3\x82\x8c\xe3\x81\xbe\xe3\x81\x99\xe3\x80\x82\xe3\x81\xbe\xe3\x81\x9f\xe3\x80\x81\xe5\x89\xb5\xe9\x80\xa0\xe6\x80\xa7\xe3\x82\x92\xe5\x88\xba\xe6\xbf\x80\xe3\x81\x97\xe3\x80\x81\xe6\x96\xb0\xe3\x81\x97\xe3\x81\x84\xe3\x82\xa2\xe3\x82\xa4\xe3\x83\x87\xe3\x82\xa2\xe3\x82\x84\xe9\x9d\xa9\xe6\x96\xb0\xe3\x82\x92\xe7\x94\x9f\xe3\x81\xbf\xe5\x87\xba\xe3\x81\x99\xe3\x83\x97\xe3\x83\xa9\xe3\x83\x83\xe3\x83\x88\xe3\x83\x95\xe3\x82\xa9\xe3\x83\xbc\xe3\x83\xa0\xe3\x81\xa8\xe3\x81\x97\xe3\x81\xa6\xe3\x82\x82\xe6\xa9\x9f\xe8\x83\xbd\xe3\x81\x97\xe3\x81\xbe\xe3\x81\x99\xe3\x80\x82\xe3\x81\x9d\xe3\x81\xae\xe9\xab\x98\xe5\xba\xa6\xe3\x81\xaa\xe5\x87\xa6\xe7\x90\x86\xe8\x83\xbd\xe5\x8a\x9b\xe3\x81\xa8\xe6\x9f\x94\xe8\xbb\x9f\xe6\x80\xa7\xe3\x81\xab\xe3\x82\x88\xe3\x81\xa3\xe3\x81\xa6\xe3\x80\x81\xe7\xa7\x81\xe3\x81\x9f\xe3\x81\xa1\xe3\x81\xaf\xe3\x81\xbe\xe3\x81\x99\xe3\x81\xbe\xe3\x81\x99\xe5\xa4\x9a\xe3\x81\x8f\xe3\x81\xae\xe3\x81\x93\xe3\x81\xa8\xe3\x82\x92\xe9\x81\x94\xe6\x88\x90\xe3\x81\xa7\xe3\x81\x8d\xe3\x82\x8b\xe3\x82\x88\xe3\x81\x86\xe3\x81\xab\xe3\x81\xaa\xe3\x82\x8a\xe3\x81\xbe\xe3\x81\x97\xe3\x81\x9f\xe3\x80\x82\xe3\x82\xb3\xe3\x83\xb3\xe3\x83\x94\xe3\x83\xa5\xe3\x83\xbc\xe3\x82\xbf\xe3\x83\xbc\xe3\x81\xae\xe9\x80\xb2\xe5\x8c\x96\xe3\x81\xaf\xe3\x80\x81\xe7\xa7\x81\xe3\x81\x9f\xe3\x81\xa1\xe3\x81\xae\xe4\xb8\x96\xe7\x95\x8c\xe3\x82\x92\xe5\xa4\x89\xe3\x81\x88\xe3\x80\x81\xe6\x9c\xaa\xe6\x9d\xa5\xe3\x82\x92\xe6\x98\x8e\xe3\x82\x8b\xe3\x81\x8f\xe3\x81\x99\xe3\x82\x8b\xe5\x8f\xaf\xe8\x83\xbd\xe6\x80\xa7\xe3\x82\x92\xe9\x96\x8b\xe3\x81\x84\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99\xe3\x80\x82'
[87, 104, 101, 110, 32, 119, 101, 32, 116, 97, 108, 107, 32, 97, 98, 111, 117, 116, 32, 116, 104, 101, 32, 119, 111, 110, 100, 101, 114, 115, 32, 111, 102, 32, 99, 111, 109, 112, 117, 116, 101, 114, 115, 44, 32, 119, 101, 32, 97, 114, 101, 32, 97, 109, 97, 122, 101, 100, 32, 98, 121, 32, 116, 104, 101, 105, 114, 32, 100, 105, 118, 101, 114, 115, 101, 32, 102, 117, 110, 99, 116, 105, 111, 110, 115, 32, 97, 110, 100, 32, 99, 97, 112, 97, 98, 105, 108, 105, 116, 105, 101, 115, 46, 32, 67, 111, 109, 112, 117, 116, 101, 114, 115, 32, 101, 102, 102, 105, 99, 105, 101, 110, 116, 108, 121, 32, 115, 117, 112, 112, 111, 114, 116, 32, 111, 117, 114, 32, 100, 97, 105, 108, 121, 32, 108, 105, 118, 101, 115, 32, 97, 110, 100, 32, 112, 114, 111, 118, 105, 100, 101, 32, 117, 115, 32, 119, 105, 116, 104, 32, 105, 110, 102, 111, 114, 109, 97, 116, 105, 111, 110, 32, 105, 110, 115, 116, 97, 110, 116, 108, 121, 46, 32, 73, 116, 32, 97, 108, 115, 111, 32, 115, 101, 114, 118, 101, 115, 32, 97, 115, 32, 97, 32, 112, 108, 97, 116, 102, 111, 114, 109, 32, 116, 111, 32, 115, 116, 105, 109, 117, 108, 97, 116, 101, 32, 99, 114, 101, 97, 116, 105, 118, 105, 116, 121, 32, 97, 110, 100, 32, 103, 101, 110, 101, 114, 97, 116, 101, 32, 110, 101, 119, 32, 105, 100, 101, 97, 115, 32, 97, 110, 100, 32, 105, 110, 110, 111, 118, 97, 116, 105, 111, 110, 115, 46, 32, 87, 105, 116, 104, 32, 105, 116, 115, 32, 97, 100, 118, 97, 110, 99, 101, 100, 32, 112, 114, 111, 99, 101, 115, 115, 105, 110, 103, 32, 112, 111, 119, 101, 114, 32, 97, 110, 100, 32, 102, 108, 101, 120, 105, 98, 105, 108, 105, 116, 121, 44, 32, 119, 101, 32, 97, 114, 101, 32, 97, 98, 108, 101, 32, 116, 111, 32, 97, 99, 99, 111, 109, 112, 108, 105, 115, 104, 32, 109, 111, 114, 101, 32, 97, 110, 100, 32, 109, 111, 114, 101, 46, 32, 84, 104, 101, 32, 101, 118, 111, 108, 117, 116, 105, 111, 110, 32, 111, 102, 32, 99, 111, 109, 112, 117, 116, 101, 114, 115, 32, 111, 112, 101, 110, 115, 32, 112, 111, 115, 115, 105, 98, 105, 108, 105, 116, 105, 101, 115, 32, 116, 104, 97, 116, 32, 119, 105, 108, 108, 32, 99, 104, 97, 110, 103, 101, 32, 111, 117, 114, 32, 119, 111, 114, 108, 100, 32, 97, 110, 100, 32, 98, 114, 105, 103, 104, 116, 101, 110, 32, 111, 117, 114, 32, 102, 117, 116, 117, 114, 101, 46, 32, 124, 32, 227, 130, 179, 227, 131, 179, 227, 131, 148, 227, 131, 165, 227, 131, 188, 227, 130, 191, 227, 131, 188, 227, 129, 174, 231, 180, 160, 230, 153, 180, 227, 130, 137, 227, 129, 151, 227, 129, 149, 227, 129, 171, 2231, 189, 229, 138, 155, 227, 129, 171, 233, 169, 154, 227, 129, 139, 227, 129, 149, 227, 130, 140, 227, 129, 190, 227, 129, 153, 227, 128, 130, 227, 130, 179, 227, 131, 179, 227, 131, 148, 227, 131, 165, 227, 131, 188, 227, 130, 191, 227, 131, 188, 227, 129, 175, 231, 167, 129, 227, 129, 159, 227, 129, 161, 227, 129, 174, 230, 151, 165, 229, 184, 184, 231, 148, 159, 230, 180, 187, 227, 130, 146, 229, 138, 185, 231, 142, 135, 231, 154, 132, 227, 129, 171, 230, 148, 175, 230, 143, 180, 227, 129, 151, 227, 128, 129, 230, 131, 133, 229, 160, 177, 227, 130, 146, 231, 158, 172, 230, 153, 130, 227, 129, 171, 230, 143, 144, 228, 190, 155, 227, 129, 151, 227, 129, 166, 227, 129, 143, 227, 130, 140, 227, 129, 190, 227, 129, 153, 227, 128, 130, 227, 129, 190, 227, 129, 159, 227, 128, 129, 229, 137, 181, 233, 128, 160, 230, 128, 167, 227, 130, 146, 229, 136, 186, 230, 191, 128, 227, 129, 151, 227, 128, 129, 230, 150, 176, 227, 129, 151, 227, 129, 132, 227, 130, 162, 227, 130, 164, 227, 131, 135, 227, 130, 162, 227, 130, 132, 233, 157, 169, 230, 150, 176, 227, 130, 146, 231, 148, 159, 227, 129, 191, 229, 135, 186, 227, 129, 153, 227, 131, 151, 227, 131, 169, 227, 131, 131, 227, 131, 136, 227, 131, 149, 227, 130, 169, 227, 131, 188, 227, 131, 160, 227, 129, 168, 227, 129, 151, 227, 129, 166, 227, 130, 130, 230, 169, 159, 232, 131, 189, 227, 129, 151, 227, 129, 190, 227, 129, 153, 227, 128, 130, 227, 129, 157, 227, 129, 174, 233, 171, 152, 229, 186, 166, 227, 129, 170, 229, 135, 166, 231, 144, 134, 232, 131, 189, 229, 138, 155, 227, 129, 168, 230, 159, 148, 232, 187, 159, 230, 128, 167, 227, 129, 171, 227, 130, 136, 227, 129, 163, 227, 129, 166, 227, 128, 129, 231, 167, 129, 227, 129, 159, 227, 129, 161, 227, 129, 175, 227, 129, 190, 227, 129, 153, 227, 129, 190, 227, 129, 153, 229, 164, 154, 227, 129, 143, 227, 129, 174, 227, 129, 147, 227, 129, 168, 227, 130, 146, 233, 129, 148, 230, 136, 144, 227, 129, 167, 227, 129, 141, 227, 130, 139, 227, 130, 136, 227, 129, 134, 227, 129, 171, 227, 129, 170, 227, 130, 138, 227, 129, 190, 227, 129, 151, 227, 129, 159, 227, 128, 130, 227, 130, 179, 227, 131, 179, 227, 131, 148, 227, 131, 165, 227, 131, 188, 227, 130, 191, 227, 131, 188, 227, 129, 174, 233, 128, 178, 229, 140, 150, 227, 129, 175, 227, 128, 129, 231, 167, 129, 227, 129, 159, 227, 129, 161, 227, 129, 174, 228, 184, 150, 231, 149, 140, 227, 130, 146, 229, 164, 137, 227, 129, 136, 227, 128, 129, 230, 156, 170, 230, 157, 165, 227, 130, 146, 230, 152, 142, 227, 130, 139, 227, 129, 143, 227, 129, 153, 227, 130, 139, 229, 143, 175, 232, 131, 189, 230, 128, 167, 227, 130, 146, 233, 150, 139, 227, 129, 132, 227, 129, 166, 227, 129, 132, 227, 129, 190, 227, 129, 153, 227, 128, 130]
```

### Finding the most occuring `byte` pair

Now that we have our text sequence ready, we can start implementing the algorithm...

There are many different ways to do this, but this is what I came up with...

For that we are now going to first understand which `byte` pair occur the most... And to do that we are going to need to find the frequency of the **unique** `byte` pairs, and we can take help of Python dictionaries (since they store the keys in unique values)... We can iteratively take the current character and the next character using `zip()` and iteratively increase their frequency as we run through the full sequence, like this:
```python
frequencyCounts = {}
for pair in zip(decimalTokens, decimalTokens[1:]):
    frequencyCounts[pair] = frequencyCounts.get(pair, 0) + 1
print(frequencyCounts)
```
For which we get:
```python
{(87, 104): 1, (104, 101): 4, (101, 110): 5, (110, 32): 4, (32, 119): 7, (119, 101): 4, (101, 32): 14, (32, 116): 6, (116, 97): 2, (97, 108): 2, (108, 107): 1, (107, 32): 1, (32, 97): 17, (97, 98): 3, (98, 111): 1, (111, 117): 4, (117, 116): 6, (116, 32): 4, (116, 104): 5, (119, 111): 2, (111, 110): 5, (110, 100): 8, (100, 101): 3, (101, 114): 8, (114, 115): 5, (115, 32): 12, (32, 111): 6, (111, 102): 2, (102, 32): 2, (32, 99): 5, (99, 111): 3, (111, 109): 4, (109, 112): 4, (112, 117): 3, (116, 101): 6, (115, 44): 1, (44, 32): 2, (97, 114): 2, (114, 101): 6, (97, 109): 1, (109, 97): 2, (97, 122): 1, (122, 101): 1, (101, 100): 2, (100, 32): 10, (32, 98): 2, (98, 121): 1, (121, 32): 4, (101, 105): 1, (105, 114): 1, (114, 32): 5, (32, 100): 2, (100, 105): 1, (105, 118): 3, (118, 101): 3, (115, 101): 2, (32, 102): 3, (102, 117): 2, (117, 110): 1, (110, 99): 2, (99, 116): 1, (116, 105): 8, (105, 111): 4, (110, 115): 4, (97, 110): 10, (99, 97): 1, (97, 112): 1, (112, 97): 1, (98, 105): 3, (105, 108): 5, (108, 105): 5, (105, 116): 7, (105, 101): 3, (101, 115): 5, (115, 46): 2, (46, 32): 5, (32, 67): 1, (67, 111): 1, (32, 101): 2, (101, 102): 1, (102, 102): 1, (102, 105): 1, (105, 99): 1, (99, 105): 1, (110, 116): 2, (116, 108): 2, (108, 121): 3, (32, 115): 3, (115, 117): 1, (117, 112): 1, (112, 112): 1, (112, 111): 3, (111, 114): 6, (114, 116): 1, (117, 114): 4, (100, 97): 1, (97, 105): 1, (32, 108): 1, (32, 112): 5, (112, 114): 2, (114, 111): 2, (111, 118): 2, (118, 105): 2, (105, 100): 2, (32, 117): 1, (117, 115): 1, (119, 105): 2, (104, 32): 3, (32, 105): 5, (105, 110): 4, (110, 102): 1, (102, 111): 2, (114, 109): 2, (97, 116): 7, (115, 116): 2, (121, 46): 1, (32, 73): 1, (73, 116): 1, (108, 115): 1, (115, 111): 1, (111, 32): 3, (114, 118): 1, (97, 115): 2, (97, 32): 1, (112, 108): 2, (108, 97): 2, (116, 102): 1, (109, 32): 1, (116, 111): 2, (105, 109): 1, (109, 117): 1, (117, 108): 1, (99, 114): 1, (101, 97): 2, (116, 121): 2, (32, 103): 1, (103, 101): 2, (110, 101): 2, (114, 97): 1, (32, 110): 1, (101, 119): 1, (119, 32): 1, (110, 110): 1, (110, 111): 1, (118, 97): 2, (32, 87): 1, (87, 105): 1, (116, 115): 1, (97, 100): 1, (100, 118): 1, (99, 101): 2, (111, 99): 1, (115, 115): 2, (115, 105): 2, (110, 103): 2, (103, 32): 1, (111, 119): 1, (102, 108): 1, (108, 101): 2, (101, 120): 1, (120, 105): 1, (105, 98): 2, (121, 44): 1, (98, 108): 1, (97, 99): 1, (99, 99): 1, (105, 115): 1, (115, 104): 1, (32, 109): 2, (109, 111): 2, (101, 46): 2, (32, 84): 1, (84, 104): 1, (101, 118): 1, (118, 111): 1, (111, 108): 1, (108, 117): 1, (111, 112): 1, (112, 101): 1, (111, 115): 1, (104, 97): 2, (108, 108): 1, (108, 32): 1, (99, 104): 1, (114, 108): 1, (108, 100): 1, (98, 114): 1, (114, 105): 1, (105, 103): 1, (103, 104): 1, (104, 116): 1, (116, 117): 1, (32, 124): 1, (124, 32): 1, (32, 227): 1, (227, 130): 30, (130, 179): 3, (179, 227): 6, (227, 131): 23, (131, 179): 3, (131, 148): 3, (148, 227): 3, (131, 165): 3, (165, 227): 4, (131, 188): 7, (188, 227): 7, (130, 191): 3, (191, 227): 3, (227, 129): 81, (129, 174): 7, (174, 231): 1, (231, 180): 1, (180, 160): 1, (160, 230): 2, (230, 153): 2, (153, 180): 1, (180, 227): 2, (130, 137): 1, (137, 227): 2, (129, 151): 8, (151, 227): 9, (129, 149): 2, (149, 227): 3, (129, 171): 6, (171, 227): 3, (129, 164): 1, (164, 227): 2, (129, 132): 4, (132, 227): 5, (129, 166): 5, (166, 232): 1, (232, 170): 1, (170, 158): 1, (158, 227): 1, (130, 139): 4, (139, 227): 5, (129, 168): 5, (168, 227): 3, (227, 128): 12, (128, 129): 7, (129, 227): 4, (129, 157): 2, (157, 227): 2, (174, 229): 1, (229, 164): 3, (164, 154): 2, (154, 230): 1, (230, 167): 1, (167, 152): 1, (152, 227): 1, (129, 170): 3, (170, 230): 2, (230, 169): 2, (169, 159): 2, (159, 232): 2, (232, 131): 5, (131, 189): 5, (189, 227): 2, (168, 232): 1, (189, 229): 2, (229, 138): 3, (138, 155): 2, (155, 227): 3, (171, 233): 1, (233, 169): 1, (169, 154): 1, (154, 227): 2, (129, 139): 1, (130, 140): 2, (140, 227): 3, (129, 190): 8, (190, 227): 8, (129, 153): 8, (153, 227): 7, (128, 130): 5, (130, 227): 5, (129, 175): 3, (175, 231): 1, (231, 167): 3, (167, 129): 3, (129, 159): 5, (159, 227): 6, (129, 161): 3, (161, 227): 3, (174, 230): 1, (230, 151): 1, (151, 165): 1, (165, 229): 1, (229, 184): 1, (184, 184): 1, (184, 231): 1, (231, 148): 2, (148, 159): 2, (159, 230): 2, (230, 180): 1, (180, 187): 1, (187, 227): 1, (130, 146): 8, (146, 229): 3, (138, 185): 1, (185, 231): 1, (231, 142): 1, (142, 135): 1, (135, 231): 1, (231, 154): 1, (154, 132): 1, (171, 230): 2, (230, 148): 1, (148, 175): 1, (175, 230): 1, (230, 143): 2, (143, 180): 1, (129, 230): 3, (230, 131): 1, (131, 133): 1, (133, 229): 1, (229, 160): 1, (160, 177): 1, (177, 227): 1, (146, 231): 2, (231, 158): 1, (158, 172): 1, (172, 230): 1, (153, 130): 1, (143, 144): 1, (144, 228): 1, (228, 190): 1, (190, 155): 1, (166, 227): 5, (129, 143): 3, (143, 227): 3, (129, 229): 1, (229, 137): 1, (137, 181): 1, (181, 233): 1, (233, 128): 2, (128, 160): 1, (230, 128): 3, (128, 167): 3, (167, 227): 4, (229, 136): 1, (136, 186): 1, (186, 230): 1, (230, 191): 1, (191, 128): 1, (128, 227): 1, (230, 150): 2, (150, 176): 2, (176, 227): 2, (130, 162): 2, (162, 227): 2, (130, 164): 1, (131, 135): 1, (135, 227): 1, (130, 132): 1, (132, 233): 1, (233, 157): 1, (157, 169): 1, (169, 230): 1, (129, 191): 1, (191, 229): 1, (229, 135): 2, (135, 186): 1, (186, 227): 1, (131, 151): 1, (131, 169): 1, (169, 227): 2, (131, 131): 1, (131, 227): 1, (131, 136): 1, (136, 227): 4, (131, 149): 1, (130, 169): 1, (131, 160): 1, (160, 227): 1, (130, 130): 1, (130, 230): 1, (174, 233): 2, (233, 171): 1, (171, 152): 1, (152, 229): 1, (229, 186): 1, (186, 166): 1, (170, 229): 1, (135, 166): 1, (166, 231): 1, (231, 144): 1, (144, 134): 1, (134, 232): 1, (168, 230): 1, (230, 159): 1, (159, 148): 1, (148, 232): 1, (232, 187): 1, (187, 159): 1, (130, 136): 2, (129, 163): 1, (163, 227): 1, (129, 231): 2, (175, 227): 2, (153, 229): 1, (174, 227): 1, (129, 147): 1, (147, 227): 1, (146, 233): 2, (233, 129): 1, (129, 148): 1, (148, 230): 1, (230, 136): 1, (136, 144): 1, (144, 227): 1, (129, 167): 1, (129, 141): 1, (141, 227): 1, (129, 134): 1, (134, 227): 1, (170, 227): 1, (130, 138): 1, (138, 227): 1, (128, 178): 1, (178, 229): 1, (229, 140): 1, (140, 150): 1, (150, 227): 1, (174, 228): 1, (228, 184): 1, (184, 150): 1, (150, 231): 1, (231, 149): 1, (149, 140): 1, (164, 137): 1, (129, 136): 1, (230, 156): 1, (156, 170): 1, (230, 157): 1, (157, 165): 1, (146, 230): 1, (230, 152): 1, (152, 142): 1, (142, 227): 1, (139, 229): 1, (229, 143): 1, (143, 175): 1, (175, 232): 1, (189, 230): 1, (233, 150): 1, (150, 139): 1}
```

Now that we are getting the **unique** `byte` pair frequency, we can move this code into a function and try to print the output in a slightly different way (descending order)
So now we have:
```python
def getPairFrequency(tokens):
    frequencyCounts = {}
    for pair in zip(tokens, tokens[1:]):
        frequencyCounts[pair] = frequencyCounts.get(pair, 0) + 1
    return frequencyCounts

pairFrequency = getPairFrequency(decimalTokens)
print(sorted(((value, key) for key, value in pairFrequency.items()), reverse=True))
```
For which we get:
```python
[(81, (227, 129)), (30, (227, 130)), (23, (227, 131)), (17, (32, 97)), (14, (101, 32)), (12, (227, 128)), (12, (115, 32)), (10, (100, 32)), (10, (97, 110)), (9, (151, 227)), (8, (190, 227)), (8, (130, 146)), (8, (129, 190)), (8, (129, 153)), (8, (129, 151)), (8, (116, 105)), (8, (110, 100)), (8, (101, 114)), (7, (188, 227)), (7, (153, 227)), (7, (131, 188)), (7, (129, 174)), (7, (128, 129)), (7, (105, 116)), (7, (97, 116)), (7, (32, 119)), (6, (179, 227)), (6, (159, 227)), (6, (129, 171)), (6, (117, 116)), (6, (116, 101)), (6, (114, 101)), (6, (111, 114)), (6, (32, 116)), (6, (32, 111)), (5, (232, 131)), (5, (166, 227)), (5, (139, 227)), (5, (132, 227)), (5, (131, 189)), (5, (130, 227)), (5, (129, 168)), (5, (129, 166)), (5, (129, 159)), (5, (128, 130)), (5, (116, 104)), (5, (114, 115)), (5, (114, 32)), (5, (111, 110)), (5, (108, 105)), (5, (105, 108)), (5, (101, 115)), (5, (101, 110)), (5, (46, 32)), (5, (32, 112)), (5, (32, 105)), (5, (32, 99)), (4, (167, 227)), (4, (165, 227)), (4, (136, 227)), (4, (130, 139)), (4, (129, 227)), (4, (129, 132)), (4, (121, 32)), (4, (119, 101)), (4, (117, 114)), (4, (116, 32)), (4, (111, 117)), (4, (111, 109)), (4, (110, 115)), (4, (110, 32)), (4, (109, 112)), (4, (105, 111)), (4, (105, 110)), (4, (104, 101)), (3, (231, 167)), (3, (230, 128)), (3, (229, 164)), (3, (229, 138)), (3, (191, 227)), (3, (171, 227)), (3, (168, 227)), (3, (167, 129)), (3, (161, 227)), (3, (155, 227)), (3, (149, 227)), (3, (148, 227)), (3, (146, 229)), (3, (143, 227)), (3, (140, 227)), (3, (131, 179)), (3, (131, 165)), (3, (131, 148)), (3, (130, 191)), (3, (130, 179)), (3, (129, 230)), (3, (129, 175)), (3, (129, 170)), (3, (129, 161)), (3, (129, 143)), (3, (128, 167)), (3, (118, 101)), (3, (112, 117)), (3, (112, 111)), (3, (111, 32)), (3, (108, 121)), (3, (105, 118)), (3, (105, 101)), (3, (104, 32)), (3, (100, 101)), (3, (99, 111)), (3, (98, 105)), (3, (97, 98)), (3, (32, 115)), (3, (32, 102)), (2, (233, 128)), (2, (231, 148)), (2, (230, 169)), (2, (230, 153)), (2, (230, 150)), (2, (230, 143)), (2, (229, 135)), (2, (189, 229)), (2, (189, 227)), (2, (180, 227)), (2, (176, 227)), (2, (175, 227)), (2, (174, 233)), (2, (171, 230)), (2, (170, 230)), (2, (169, 227)), (2, (169, 159)), (2, (164, 227)), (2, (164, 154)), (2, (162, 227)), (2, (160, 230)), (2, (159, 232)), (2, (159, 230)), (2, (157, 227)), (2, (154, 227)), (2, (150, 176)), (2, (148, 159)), (2, (146, 233)), (2, (146, 231)), (2, (138, 155)), (2, (137, 227)), (2, (130, 162)), (2, (130, 140)), (2, (130, 136)), (2, (129, 231)), (2, (129, 157)), (2, (129, 149)), (2, (119, 111)), (2, (119, 105)), (2, (118, 105)), (2, (118, 97)), (2, (116, 121)), (2, (116, 111)), (2, (116, 108)), (2, (116, 97)), (2, (115, 116)), (2, (115, 115)), (2, (115, 105)), (2, (115, 101)), (2, (115, 46)), (2, (114, 111)), (2, (114, 109)), (2, (112, 114)), (2, (112, 108)), (2, (111, 118)), (2, (111, 102)), (2, (110, 116)), (2, (110, 103)), (2, (110, 101)), (2, (110, 99)), (2, (109, 111)), (2, (109, 97)), (2, (108, 101)), (2, (108, 97)), (2, (105, 100)), (2, (105, 98)), (2, (104, 97)), (2, (103, 101)), (2, (102, 117)), (2, (102, 111)), (2, (102, 32)), (2, (101, 100)), (2, (101, 97)), (2, (101, 46)), (2, (99, 101)), (2, (97, 115)), (2, (97, 114)), (2, (97, 108)), (2, (44, 32)), (2, (32, 109)), (2, (32, 101)), (2, (32, 100)), (2, (32, 98)), (1, (233, 171)), (1, (233, 169)), (1, (233, 157)), (1, (233, 150)), (1, (233, 129)), (1, (232, 187)), (1, (232, 170)), (1, (231, 180)), (1, (231, 158)), (1, (231, 154)), (1, (231, 149)), (1, (231, 144)), (1, (231, 142)), (1, (230, 191)), (1, (230, 180)), (1, (230, 167)), (1, (230, 159)), (1, (230, 157)), (1, (230, 156)), (1, (230, 152)), (1, (230, 151)), (1, (230, 148)), (1, (230, 136)), (1, (230, 131)), (1, (229, 186)), (1, (229, 184)), (1, (229, 160)), (1, (229, 143)), (1, (229, 140)), (1, (229, 137)), (1, (229, 136)), (1, (228, 190)), (1, (228, 184)), (1, (191, 229)), (1, (191, 128)), (1, (190, 155)), (1, (189, 230)), (1, (187, 227)), (1, (187, 159)), (1, (186, 230)), (1, (186, 227)), (1, (186, 166)), (1, (185, 231)), (1, (184, 231)), (1, (184, 184)), (1, (184, 150)), (1, (181, 233)), (1, (180, 187)), (1, (180, 160)), (1, (178, 229)), (1, (177, 227)), (1, (175, 232)), (1, (175, 231)), (1, (175, 230)), (1, (174, 231)), (1, (174, 230)), (1, (174, 229)), (1, (174, 228)), (1, (174, 227)), (1, (172, 230)), (1, (171, 233)), (1, (171, 152)), (1, (170, 229)), (1, (170, 227)), (1, (170, 158)), (1, (169, 230)), (1, (169, 154)), (1, (168, 232)), (1, (168, 230)), (1, (167, 152)), (1, (166, 232)), (1, (166, 231)), (1, (165, 229)), (1, (164, 137)), (1, (163, 227)), (1, (160, 227)), (1, (160, 177)), (1, (159, 148)), (1, (158, 227)), (1, (158, 172)), (1, (157, 169)), (1, (157, 165)), (1, (156, 170)), (1, (154, 230)), (1, (154, 132)), (1, (153, 229)), (1, (153, 180)), (1, (153, 130)), (1, (152, 229)), (1, (152, 227)), (1, (152, 142)), (1, (151, 165)), (1, (150, 231)), (1, (150, 227)), (1, (150, 139)), (1, (149, 140)), (1, (148, 232)), (1, (148, 230)), (1, (148, 175)), (1, (147, 227)), (1, (146, 230)), (1, (144, 228)), (1, (144, 227)), (1, (144, 134)), (1, (143, 180)), (1, (143, 175)), (1, (143, 144)), (1, (142, 227)), (1, (142, 135)), (1, (141, 227)), (1, (140, 150)), (1, (139, 229)), (1, (138, 227)), (1, (138, 185)), (1, (137, 181)), (1, (136, 186)), (1, (136, 144)), (1, (135, 231)), (1, (135, 227)), (1, (135, 186)), (1, (135, 166)), (1, (134, 232)), (1, (134, 227)), (1, (133, 229)), (1, (132, 233)), (1, (131, 227)), (1, (131, 169)), (1, (131, 160)), (1, (131, 151)), (1, (131, 149)), (1, (131, 136)), (1, (131, 135)), (1, (131, 133)), (1, (131, 131)), (1, (130, 230)), (1, (130, 169)), (1, (130, 164)), (1, (130, 138)), (1, (130, 137)), (1, (130, 132)), (1, (130, 130)), (1, (129, 229)), (1, (129, 191)), (1, (129, 167)), (1, (129, 164)), (1, (129, 163)), (1, (129, 148)), (1, (129, 147)), (1, (129, 141)), (1, (129, 139)), (1, (129, 136)), (1, (129, 134)), (1, (128, 227)), (1, (128, 178)), (1, (128, 160)), (1, (124, 32)), (1, (122, 101)), (1, (121, 46)), (1, (121, 44)), (1, (120, 105)), (1, (119, 32)), (1, (118, 111)), (1, (117, 115)), (1, (117, 112)), (1, (117, 110)), (1, (117, 108)), (1, (116, 117)), (1, (116, 115)), (1, (116, 102)), (1, (115, 117)), (1, (115, 111)), (1, (115, 104)), (1, (115, 44)), (1, (114, 118)), (1, (114, 116)), (1, (114, 108)), (1, (114, 105)), (1, (114, 97)), (1, (112, 112)), (1, (112, 101)), (1, (112, 97)), (1, (111, 119)), (1, (111, 115)), (1, (111, 112)), (1, (111, 108)), (1, (111, 99)), (1, (110, 111)), (1, (110, 110)), (1, (110, 102)), (1, (109, 117)), (1, (109, 32)), (1, (108, 117)), (1, (108, 115)), (1, (108, 108)), (1, (108, 107)), (1, (108, 100)), (1, (108, 32)), (1, (107, 32)), (1, (105, 115)), (1, (105, 114)), (1, (105, 109)), (1, (105, 103)), (1, (105, 99)), (1, (104, 116)), (1, (103, 104)), (1, (103, 32)), (1, (102, 108)), (1, (102, 105)), (1, (102, 102)), (1, (101, 120)), (1, (101, 119)), (1, (101, 118)), (1, (101, 105)), (1, (101, 102)), (1, (100, 118)), (1, (100, 105)), (1, (100, 97)), (1, (99, 116)), (1, (99, 114)), (1, (99, 105)), (1, (99, 104)), (1, (99, 99)), (1, (99, 97)), (1, (98, 121)), (1, (98, 114)), (1, (98, 111)), (1, (98, 108)), (1, (97, 122)), (1, (97, 112)), (1, (97, 109)), (1, (97, 105)), (1, (97, 100)), (1, (97, 99)), (1, (97, 32)), (1, (87, 105)), (1, (87, 104)), (1, (84, 104)), (1, (73, 116)), (1, (67, 111)), (1, (32, 227)), (1, (32, 124)), (1, (32, 117)), (1, (32, 110)), (1, (32, 108)), (1, (32, 103)), (1, (32, 87)), (1, (32, 84)), (1, (32, 73)), (1, (32, 67))]
```

For our output of the example, it looks like the `byte` pair `(227, 129)` is the most commonly occuring consecutive pair which occured `81` times in a sequence...

So just to double check that is the case, we have:
![GPTTokenizerMostCommonBytePair](ExplanationMedia/Images/GPTTokenizerMostCommonBytePair.png)

And now if we want to look at pairs, we can take the help of <a href="https://docs.python.org/3/library/functions.html#chr">`chr()`</a> function in Python which is nothing but the inverse of `ord()`...
So, if we try to take a pair and pass it to the `chr()` function like this:
```python
print(chr(112), chr(117))
```
We get:
```python
p u
```

### Merging the most common pairs

Now that we have identified the most common pair, we would like to iterate over the text sequence, and we are going to generate a new `vocabulary token`, and instead of creating a separate `replacement table`, we are going to use the same `vocabulary` of supported definitions between `0` to `255` and are going to append the new `vocabulary token` with an id of `256`...

To do that we first need a nice way of obtaining the most common occuring pair (specifically the `key` to identify the values later), and we can do that with the help of <a href="https://docs.python.org/3/library/functions.html#max">`max()`</a> in Python, instead of using complecated code as we used before...

`max()` function offers a parameter `key` to specify how the function should rank the dictionary, which is helpful to us because we want to rank these `keys`(which is returned by our frequency dictionary by default) based on their **values**, and to get the value of a specified key, we use the `get()` method inside the dictionary...

So we can now find the top_pair like this:
```python
topPair = max(pairFrequency, key=pairFrequency.get)
print(topPair)
```
For which we get the top pair:
```python
(227, 129)
```

Now comes the merging part...

Again, there are many different ways to do this, but this is what I came up with...

So we can define a new function `mergePair()` that takes a sequence of `tokens`, the specific `pair` that we want to merge & the `newVocabularyToken` that we want to generate...

Now because we already have our `byte` sequences in a list, we can initialize an empty list `encodedTokens` at the beginning such that it is easier to return it at the end of the function...\
We can now iteratively go through each token in a sequence using a `while` loop...

Now because we are going through the entire text sequence iteratively, we need to have a checker condition for two cases:
1. When the `current token` and the `next token` is the same as the `pair` (also handling the index out of range condition because we are trying to check the next token as well).\
   If the condition is met, then we would only append the `newVocabularyToken` that we pass to the `encodedTokens`, and because we will be appending a **pair** we will be incrementing the current position by `2`.
2. When the `current token` and the `next token` is not the same as the `pair`, we will be appending the `current token` in the `encodedTokens` and because this condition is met only when the current token is **not** a part of a pair we will be incrementing the current position by `1`.
And finally when the iteration is done, we can simply return the `encodedTokens` we built from the function.

So let's now implement what we discussed with code now...

So now our current `mergePair()` function looks like this:
```python
def mergePair(tokens, pair, newVocabularyToken):
    encodedTokens = []
    i = 0
    while i < len(tokens):
        if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
            encodedTokens.append(newVocabularyToken)
            i+=2
        else:
            encodedTokens.append(tokens[i])
            i+=1
    return encodedTokens
```
And if we take this function for a spin for a much simpler *toy-example*, we end up with something like this:
```python
>>>print(mergePair(tokens=[5, 6, 6, 9, 7, 1], pair=(6, 9), newVocabularyToken=69))
[5, 6, 69, 7, 1]
```

Now that our function works, we can replace our parameters of `mergePair()` function and run it on our original example...

For our case, the tokens are `decimalTokens`, the pair is the most frequently occuring pair (or `topPair`) and the `newVocabularyToken` is `256` because it expands on the original vocabulary ranging from `0` to `255`.

So we can take our code for a spin like this:
```python
encodedTokens = mergePair(tokens=decimalTokens, pair=topPair, newVocabularyToken=256)
print(encodedTokens)

print("Original Text Length:", len(decimalTokens))
print("Encoded Text Length:", len(encodedTokens))
```
And we end up with something like this:
```python
[87, 104, 101, 110, 32, 119, 101, 32, 116, 97, 108, 107, 32, 97, 98, 111, 117, 116, 32, 116, 104, 101, 32, 119, 111, 110, 100, 101, 114, 115, 32, 111, 102, 32, 99, 111, 109, 112, 117, 116, 101, 114, 115, 44, 32, 119, 101, 32, 97, 114, 101, 32, 97, 109, 97, 122, 101, 100, 32, 98, 121, 32, 116, 104, 101, 105, 114, 32, 100, 105, 118, 101, 114, 115, 101, 32, 102, 117, 110, 99, 116, 105, 111, 110, 115, 32, 97, 110, 100, 32, 99, 97, 112, 97, 98, 105, 108, 105, 116, 105, 101, 115, 46, 32, 67, 111, 109, 112, 117, 116, 101, 114, 115, 32, 101, 102, 102, 105, 99, 105, 101, 110, 116, 108, 121, 32, 115, 117, 112, 112, 111, 114, 116, 32, 111, 117, 114, 32, 100, 97, 105, 108, 121, 32, 108, 105, 118, 101, 115, 32, 97, 110, 100, 32, 112, 114, 111, 118, 105, 100, 101, 32, 117, 115, 32, 119, 105, 116, 104, 32, 105, 110, 102, 111, 114, 109, 97, 116, 105, 111, 110, 32, 105, 110, 115, 116, 97, 110, 116, 108, 121, 46, 32, 73, 116, 32, 97, 108, 115, 111, 32, 115, 101, 114, 118, 101, 115, 32, 97, 115, 32, 97, 32, 112, 108, 97, 116, 102, 111, 114, 109, 32, 116, 111, 32, 115, 116, 105, 109, 117, 108, 97, 116, 101, 32, 99, 114, 101, 97, 116, 105, 118, 105, 116, 121, 32, 97, 110, 100, 32, 103, 101, 110, 101, 114, 97, 116, 101, 32, 110, 101, 119, 32, 105, 100, 101, 97, 115, 32, 97, 110, 100, 32, 105, 110, 110, 111, 118, 97, 116, 105, 111, 110, 115, 46, 32, 87, 105, 116, 104, 32, 105, 116, 115, 32, 97, 100, 118, 97, 110, 99, 101, 100, 32, 112, 114, 111, 99, 101, 115, 115, 105, 110, 103, 32, 112, 111, 119, 101, 114, 32, 97, 110, 100, 32, 102, 108, 101, 120, 105, 98, 105, 108, 105, 116, 121, 44, 32, 119, 101, 32, 97, 114, 101, 32, 97, 98, 108, 101, 32, 116, 111, 32, 97, 99, 99, 111, 109, 112, 108, 105, 115, 104, 32, 109, 111, 114, 101, 32, 97, 110, 100, 32, 109, 111, 114, 101, 46, 32, 84, 104, 101, 32, 101, 118, 111, 108, 117, 116, 105, 111, 110, 32, 111, 102, 32, 99, 111, 109, 112, 117, 116, 101, 114, 115, 32, 111, 112, 101, 110, 115, 32, 112, 111, 115, 115, 105, 98, 105, 108, 105, 116, 105, 101, 115, 32, 116, 104, 97, 116, 32, 119, 105, 108, 108, 32, 99, 104, 97, 110, 103, 101, 32, 111, 117, 114, 32, 119, 111, 114, 108, 100, 32, 97, 110, 100, 32, 98, 114, 105, 103, 104, 116, 101, 110, 32, 111, 117, 114, 32, 102, 117, 116, 117, 114, 101, 46, 32, 124, 32, 227, 130, 179, 227, 131, 179, 227, 131, 148, 227, 131, 165, 227, 131, 188, 227, 130, 191, 227, 131, 188, 256, 174, 231, 180, 160, 230, 153, 180, 227, 130, 137, 256, 151, 256, 149, 256, 171, 256, 164, 256, 132, 256, 166, 232, 170, 158, 227, 130, 139, 256, 168, 227, 128, 129, 256, 157, 256, 174, 229, 164, 154, 230, 167, 152, 256, 170, 230, 169, 159, 232, 131, 189, 256, 168, 232, 131, 189, 229, 138, 155, 256, 171, 233, 169, 154, 256, 139, 256, 149, 227, 130, 140, 256, 190, 256, 153, 227, 128, 130, 227, 130, 179, 227, 131, 179, 227, 131, 148, 227, 131, 165, 227, 131, 188, 227, 130, 191, 227, 131, 188, 256, 175, 231, 167, 129, 256, 159, 256, 161, 256, 174, 230, 151, 165, 229, 184, 184, 231, 148, 159, 230, 180, 187, 227, 130, 146, 229, 138, 185, 231, 142, 135, 231, 154, 132, 256, 171, 230, 148, 175, 230, 143, 180, 256, 151, 227, 128, 129, 230, 131, 133, 229, 160, 177, 227, 130, 146, 231, 158, 172, 230, 153, 130, 256, 171, 230, 143, 144, 228, 190, 155, 256, 151, 256, 166, 256, 143, 227, 130, 140, 256, 190, 256, 153, 227, 128, 130, 256, 190, 256, 159, 227, 128, 129, 229, 137, 181, 233, 128, 160, 230, 128, 167, 227, 130, 146, 229, 136, 186, 230, 191, 128, 256, 151, 227, 128, 129, 230, 150, 176, 256, 151, 256, 132, 227, 130, 162, 227, 130, 164, 227, 131, 135, 227, 130, 162, 227, 130, 132, 233, 157, 169, 230, 150, 176, 227, 130, 146, 231, 148, 159, 256, 191, 229, 135, 186, 256, 153, 227, 131, 151, 227, 131, 169, 227, 131, 131, 227, 131, 136, 227, 131, 149, 227, 130, 169, 227, 131, 188, 227, 131, 160, 256, 168, 256, 151, 256, 166, 227, 130, 130, 230, 169, 159, 232, 131, 189, 256, 151, 256, 190, 256, 153, 227, 128, 130, 256, 157, 256, 174, 233, 171, 152, 229, 186, 166, 256, 170, 229, 135, 166, 231, 144, 134, 232, 131, 189, 229, 138, 155, 256, 168, 230, 159, 148, 232, 187, 159, 230, 128, 167, 256, 171, 227, 130, 136, 256, 163, 256, 166, 227, 128, 129, 231, 167, 129, 256, 159, 256, 161, 256, 175, 256, 190, 256, 153, 256, 190, 256, 153, 229, 164, 154, 256, 143, 256, 174, 256, 147, 256, 168, 227, 130, 146, 233, 129, 148, 230, 136, 144, 256, 167, 256, 141, 227, 130, 139, 227, 130, 136, 256, 134, 256, 171, 256, 170, 227, 130, 138, 256, 190, 256, 151, 256, 159, 227, 128, 130, 227, 130, 179, 227, 131, 179, 227, 131, 148, 227, 131, 165, 227, 131, 188, 227, 130, 191, 227, 131, 188, 256, 174, 233, 128, 178, 229, 140, 150, 256, 175, 227, 128, 129, 231, 167, 129, 256, 159, 256, 161, 256, 174, 228, 184, 150, 231, 149, 140, 227, 130, 146, 229, 164, 137, 256, 136, 227, 128, 129, 230, 156, 170, 230, 157, 165, 227, 130, 146, 230, 152, 142, 227, 130, 139, 256, 143, 256, 153, 227, 130, 139, 229, 143, 175, 232, 131, 189, 230, 128, 167, 227, 130, 146, 233, 150, 139, 256, 132, 256, 166, 256, 132, 256, 190, 256, 153, 227, 128, 130]
Original Text Length: 1110
Encoded Text Length: 1029
```

We see that our original length of the text was `1110` and the encoded length of the text is now `1029`. This makes sense because previously we saw our most occuring pair `(227, 129)` occur `81` times and `1110 - 1029 = 81`...

So just to double check again we should be able to check that there are no occurences of the pair `(227, 129)`, like this:
![GPTTokenizerNoOccurenceOfPrevToken](ExplanationMedia/Images/GPTTokenizerNoOccurenceOfPrevToken.png)
And the replaced positions of that pair with token `256`:
![GPTTokenizerMostCommonBytePair](ExplanationMedia/Images/GPTTokenizerMostCommonBytePair.png)

So it seems like we have successfully merged a **single pair**, and now what we want to just iterate the function over the text again and again to keep merging the **most commonly occuring pair**...

And how many times do we do it for?

Well, as I mentioned before, it is totally up to us as a **hyper-parameter**. But the more steps we take the larger will be our `vocabulary` and the shorter will be our `sequence`, and we usually find a *sweet-spot* that we usually find that works the best in practice...

So let's write our iteration loop now...

### Iterating over the merged byte pairs

Now instead of taking a small text, I rather took the entire <a href="https://www.reedbeta.com/blog/programmers-intro-to-unicode/">blog post by Nathan Reed</a> and stretched it out to a single line, just so we can use this long text to have more representative statistics for the byte-pairs and we will get more sensible results out of it...

And we will perform the same steps we did the first time which is to encode this entire text into `UTF-8` encoding which will contain the raw bytes and then use them in a list such that they are easier to work with...

So we now have this code:
```python
unicodetext = """A Programmer’s Introduction to Unicode March 3, 2017 · Coding · 22 Comments  Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺\u200c🇳\u200c🇮\u200c🇨\u200c🇴\u200c🇩\u200c🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.  A few months ago, I got interested in Unicode and decided to spend some time learning more about it in detail. In this article, I’ll give an introduction to it from a programmer’s point of view.  I’m going to focus on the character set and what’s involved in working with strings and files of Unicode text. However, in this article I’m not going to talk about fonts, text layout/shaping/rendering, or localization in detail—those are separate issues, beyond my scope (and knowledge) here.  Diversity and Inherent Complexity The Unicode Codespace Codespace Allocation Scripts Usage Frequency Encodings UTF-8 UTF-16 Combining Marks Canonical Equivalence Normalization Forms Grapheme Clusters And More… Diversity and Inherent Complexity As soon as you start to study Unicode, it becomes clear that it represents a large jump in complexity over character sets like ASCII that you may be more familiar with. It’s not just that Unicode contains a much larger number of characters, although that’s part of it. Unicode also has a great deal of internal structure, features, and special cases, making it much more than what one might expect a mere “character set” to be. We’ll see some of that later in this article.  When confronting all this complexity, especially as an engineer, it’s hard not to find oneself asking, “Why do we need all this? Is this really necessary? Couldn’t it be simplified?”  However, Unicode aims to faithfully represent the entire world’s writing systems. The Unicode Consortium’s stated goal is “enabling people around the world to use computers in any language”. And as you might imagine, the diversity of written languages is immense! To date, Unicode supports 135 different scripts, covering some 1100 languages, and there’s still a long tail of over 100 unsupported scripts, both modern and historical, which people are still working to add.  Given this enormous diversity, it’s inevitable that representing it is a complicated project. Unicode embraces that diversity, and accepts the complexity inherent in its mission to include all human writing systems. It doesn’t make a lot of trade-offs in the name of simplification, and it makes exceptions to its own rules where necessary to further its mission.  Moreover, Unicode is committed not just to supporting texts in any single language, but also to letting multiple languages coexist within one text—which introduces even more complexity.  Most programming languages have libraries available to handle the gory low-level details of text manipulation, but as a programmer, you’ll still need to know about certain Unicode features in order to know when and how to apply them. It may take some time to wrap your head around it all, but don’t be discouraged—think about the billions of people for whom your software will be more accessible through supporting text in their language. Embrace the complexity!  The Unicode Codespace Let’s start with some general orientation. The basic elements of Unicode—its “characters”, although that term isn’t quite right—are called code points. Code points are identified by number, customarily written in hexadecimal with the prefix “U+”, such as U+0041 “A” latin capital letter a or U+03B8 “θ” greek small letter theta. Each code point also has a short name, and quite a few other properties, specified in the Unicode Character Database.  The set of all possible code points is called the codespace. The Unicode codespace consists of 1,114,112 code points. However, only 128,237 of them—about 12% of the codespace—are actually assigned, to date. There’s plenty of room for growth! Unicode also reserves an additional 137,468 code points as “private use” areas, which have no standardized meaning and are available for individual applications to define for their own purposes.  Codespace Allocation To get a feel for how the codespace is laid out, it’s helpful to visualize it. Below is a map of the entire codespace, with one pixel per code point. It’s arranged in tiles for visual coherence; each small square is 16×16 = 256 code points, and each large square is a “plane” of 65,536 code points. There are 17 planes altogether.  Map of the Unicode codespace (click to zoom)  White represents unassigned space. Blue is assigned code points, green is private-use areas, and the small red area is surrogates (more about those later). As you can see, the assigned code points are distributed somewhat sparsely, but concentrated in the first three planes.  Plane 0 is also known as the “Basic Multilingual Plane”, or BMP. The BMP contains essentially all the characters needed for modern text in any script, including Latin, Cyrillic, Greek, Han (Chinese), Japanese, Korean, Arabic, Hebrew, Devanagari (Indian), and many more.  (In the past, the codespace was just the BMP and no more—Unicode was originally conceived as a straightforward 16-bit encoding, with only 65,536 code points. It was expanded to its current size in 1996. However, the vast majority of code points in modern text belong to the BMP.)  Plane 1 contains historical scripts, such as Sumerian cuneiform and Egyptian hieroglyphs, as well as emoji and various other symbols. Plane 2 contains a large block of less-common and historical Han characters. The remaining planes are empty, except for a small number of rarely-used formatting characters in Plane 14; planes 15–16 are reserved entirely for private use.  Scripts Let’s zoom in on the first three planes, since that’s where the action is:  Map of scripts in Unicode planes 0–2 (click to zoom)  This map color-codes the 135 different scripts in Unicode. You can see how Han () and Korean () take up most of the range of the BMP (the left large square). By contrast, all of the European, Middle Eastern, and South Asian scripts fit into the first row of the BMP in this diagram.  Many areas of the codespace are adapted or copied from earlier encodings. For example, the first 128 code points of Unicode are just a copy of ASCII. This has clear benefits for compatibility—it’s easy to losslessly convert texts from smaller encodings into Unicode (and the other direction too, as long as no characters outside the smaller encoding are used).  Usage Frequency One more interesting way to visualize the codespace is to look at the distribution of usage—in other words, how often each code point is actually used in real-world texts. Below is a heat map of planes 0–2 based on a large sample of text from Wikipedia and Twitter (all languages). Frequency increases from black (never seen) through red and yellow to white.  Heat map of code point usage frequency in Unicode planes 0–2 (click to zoom)  You can see that the vast majority of this text sample lies in the BMP, with only scattered usage of code points from planes 1–2. The biggest exception is emoji, which show up here as the several bright squares in the bottom row of plane 1.  Encodings We’ve seen that Unicode code points are abstractly identified by their index in the codespace, ranging from U+0000 to U+10FFFF. But how do code points get represented as bytes, in memory or in a file?  The most convenient, computer-friendliest (and programmer-friendliest) thing to do would be to just store the code point index as a 32-bit integer. This works, but it consumes 4 bytes per code point, which is sort of a lot. Using 32-bit ints for Unicode will cost you a bunch of extra storage, memory, and performance in bandwidth-bound scenarios, if you work with a lot of text.  Consequently, there are several more-compact encodings for Unicode. The 32-bit integer encoding is officially called UTF-32 (UTF = “Unicode Transformation Format”), but it’s rarely used for storage. At most, it comes up sometimes as a temporary internal representation, for examining or operating on the code points in a string.  Much more commonly, you’ll see Unicode text encoded as either UTF-8 or UTF-16. These are both variable-length encodings, made up of 8-bit or 16-bit units, respectively. In these schemes, code points with smaller index values take up fewer bytes, which saves a lot of memory for typical texts. The trade-off is that processing UTF-8/16 texts is more programmatically involved, and likely slower.  UTF-8 In UTF-8, each code point is stored using 1 to 4 bytes, based on its index value.  UTF-8 uses a system of binary prefixes, in which the high bits of each byte mark whether it’s a single byte, the beginning of a multi-byte sequence, or a continuation byte; the remaining bits, concatenated, give the code point index. This table shows how it works:  UTF-8 (binary)\tCode point (binary)\tRange 0xxxxxxx\txxxxxxx\tU+0000–U+007F 110xxxxx 10yyyyyy\txxxxxyyyyyy\tU+0080–U+07FF 1110xxxx 10yyyyyy 10zzzzzz\txxxxyyyyyyzzzzzz\tU+0800–U+FFFF 11110xxx 10yyyyyy 10zzzzzz 10wwwwww\txxxyyyyyyzzzzzzwwwwww\tU+10000–U+10FFFF A handy property of UTF-8 is that code points below 128 (ASCII characters) are encoded as single bytes, and all non-ASCII code points are encoded using sequences of bytes 128–255. This has a couple of nice consequences. First, any strings or files out there that are already in ASCII can also be interpreted as UTF-8 without any conversion. Second, lots of widely-used string programming idioms—such as null termination, or delimiters (newlines, tabs, commas, slashes, etc.)—will just work on UTF-8 strings. ASCII bytes never occur inside the encoding of non-ASCII code points, so searching byte-wise for a null terminator or a delimiter will do the right thing.  Thanks to this convenience, it’s relatively simple to extend legacy ASCII programs and APIs to handle UTF-8 strings. UTF-8 is very widely used in the Unix/Linux and Web worlds, and many programmers argue UTF-8 should be the default encoding everywhere.  However, UTF-8 isn’t a drop-in replacement for ASCII strings in all respects. For instance, code that iterates over the “characters” in a string will need to decode UTF-8 and iterate over code points (or maybe grapheme clusters—more about those later), not bytes. When you measure the “length” of a string, you’ll need to think about whether you want the length in bytes, the length in code points, the width of the text when rendered, or something else.  UTF-16 The other encoding that you’re likely to encounter is UTF-16. It uses 16-bit words, with each code point stored as either 1 or 2 words.  Like UTF-8, we can express the UTF-16 encoding rules in the form of binary prefixes:  UTF-16 (binary)\tCode point (binary)\tRange xxxxxxxxxxxxxxxx\txxxxxxxxxxxxxxxx\tU+0000–U+FFFF 110110xxxxxxxxxx 110111yyyyyyyyyy\txxxxxxxxxxyyyyyyyyyy + 0x10000\tU+10000–U+10FFFF A more common way that people talk about UTF-16 encoding, though, is in terms of code points called “surrogates”. All the code points in the range U+D800–U+DFFF—or in other words, the code points that match the binary prefixes 110110 and 110111 in the table above—are reserved specifically for UTF-16 encoding, and don’t represent any valid characters on their own. They’re only meant to occur in the 2-word encoding pattern above, which is called a “surrogate pair”. Surrogate code points are illegal in any other context! They’re not allowed in UTF-8 or UTF-32 at all.  Historically, UTF-16 is a descendant of the original, pre-1996 versions of Unicode, in which there were only 65,536 code points. The original intention was that there would be no different “encodings”; Unicode was supposed to be a straightforward 16-bit character set. Later, the codespace was expanded to make room for a long tail of less-common (but still important) Han characters, which the Unicode designers didn’t originally plan for. Surrogates were then introduced, as—to put it bluntly—a kludge, allowing 16-bit encodings to access the new code points.  Today, Javascript uses UTF-16 as its standard string representation: if you ask for the length of a string, or iterate over it, etc., the result will be in UTF-16 words, with any code points outside the BMP expressed as surrogate pairs. UTF-16 is also used by the Microsoft Win32 APIs; though Win32 supports either 8-bit or 16-bit strings, the 8-bit version unaccountably still doesn’t support UTF-8—only legacy code-page encodings, like ANSI. This leaves UTF-16 as the only way to get proper Unicode support in Windows. (Update: in Win10 version 1903, they finally added UTF-8 support to the 8-bit APIs! 😊)  By the way, UTF-16’s words can be stored either little-endian or big-endian. Unicode has no opinion on that issue, though it does encourage the convention of putting U+FEFF zero width no-break space at the top of a UTF-16 file as a byte-order mark, to disambiguate the endianness. (If the file doesn’t match the system’s endianness, the BOM will be decoded as U+FFFE, which isn’t a valid code point.)  Combining Marks In the story so far, we’ve been focusing on code points. But in Unicode, a “character” can be more complicated than just an individual code point!  Unicode includes a system for dynamically composing characters, by combining multiple code points together. This is used in various ways to gain flexibility without causing a huge combinatorial explosion in the number of code points.  In European languages, for example, this shows up in the application of diacritics to letters. Unicode supports a wide range of diacritics, including acute and grave accents, umlauts, cedillas, and many more. All these diacritics can be applied to any letter of any alphabet—and in fact, multiple diacritics can be used on a single letter.  If Unicode tried to assign a distinct code point to every possible combination of letter and diacritics, things would rapidly get out of hand. Instead, the dynamic composition system enables you to construct the character you want, by starting with a base code point (the letter) and appending additional code points, called “combining marks”, to specify the diacritics. When a text renderer sees a sequence like this in a string, it automatically stacks the diacritics over or under the base letter to create a composed character.  For example, the accented character “Á” can be expressed as a string of two code points: U+0041 “A” latin capital letter a plus U+0301 “◌́” combining acute accent. This string automatically gets rendered as a single character: “Á”.  Now, Unicode does also include many “precomposed” code points, each representing a letter with some combination of diacritics already applied, such as U+00C1 “Á” latin capital letter a with acute or U+1EC7 “ệ” latin small letter e with circumflex and dot below. I suspect these are mostly inherited from older encodings that were assimilated into Unicode, and kept around for compatibility. In practice, there are precomposed code points for most of the common letter-with-diacritic combinations in European-script languages, so they don’t use dynamic composition that much in typical text.  Still, the system of combining marks does allow for an arbitrary number of diacritics to be stacked on any base character. The reductio-ad-absurdum of this is Zalgo text, which works by ͖͟ͅr͞aṋ̫̠̖͈̗d͖̻̹óm̪͙͕̗̝ļ͇̰͓̳̫ý͓̥̟͍ ̕s̫t̫̱͕̗̰̼̘͜a̼̩͖͇̠͈̣͝c̙͍k̖̱̹͍͘i̢n̨̺̝͇͇̟͙ģ̫̮͎̻̟ͅ ̕n̼̺͈͞u̮͙m̺̭̟̗͞e̞͓̰̤͓̫r̵o̖ṷs҉̪͍̭̬̝̤ ̮͉̝̞̗̟͠d̴̟̜̱͕͚i͇̫̼̯̭̜͡ḁ͙̻̼c̲̲̹r̨̠̹̣̰̦i̱t̤̻̤͍͙̘̕i̵̜̭̤̱͎c̵s ͘o̱̲͈̙͖͇̲͢n͘ ̜͈e̬̲̠̩ac͕̺̠͉h̷̪ ̺̣͖̱ḻ̫̬̝̹ḙ̙̺͙̭͓̲t̞̞͇̲͉͍t̷͔̪͉̲̻̠͙e̦̻͈͉͇r͇̭̭̬͖,̖́ ̜͙͓̣̭s̘̘͈o̱̰̤̲ͅ ̛̬̜̙t̼̦͕̱̹͕̥h̳̲͈͝ͅa̦t̻̲ ̻̟̭̦̖t̛̰̩h̠͕̳̝̫͕e͈̤̘͖̞͘y҉̝͙ ̷͉͔̰̠o̞̰v͈͈̳̘͜er̶f̰͈͔ḻ͕̘̫̺̲o̲̭͙͠ͅw̱̳̺ ͜t̸h͇̭͕̳͍e̖̯̟̠ ͍̞̜͔̩̪͜ļ͎̪̲͚i̝̲̹̙̩̹n̨̦̩̖ḙ̼̲̼͢ͅ ̬͝s̼͚̘̞͝p͙̘̻a̙c҉͉̜̤͈̯̖i̥͡n̦̠̱͟g̸̗̻̦̭̮̟ͅ ̳̪̠͖̳̯̕a̫͜n͝d͡ ̣̦̙ͅc̪̗r̴͙̮̦̹̳e͇͚̞͔̹̫͟a̙̺̙ț͔͎̘̹ͅe̥̩͍ a͖̪̜̮͙̹n̢͉̝ ͇͉͓̦̼́a̳͖̪̤̱p̖͔͔̟͇͎͠p̱͍̺ę̲͎͈̰̲̤̫a̯͜r̨̮̫̣̘a̩̯͖n̹̦̰͎̣̞̞c̨̦̱͔͎͍͖e̬͓͘ ̤̰̩͙̤̬͙o̵̼̻̬̻͇̮̪f̴ ̡̙̭͓͖̪̤“̸͙̠̼c̳̗͜o͏̼͙͔̮r̞̫̺̞̥̬ru̺̻̯͉̭̻̯p̰̥͓̣̫̙̤͢t̳͍̳̖ͅi̶͈̝͙̼̙̹o̡͔n̙̺̹̖̩͝ͅ”̨̗͖͚̩.̯͓  A few other places where dynamic character composition shows up in Unicode:  Vowel-pointing notation in Arabic and Hebrew. In these languages, words are normally spelled with some of their vowels left out. They then have diacritic notation to indicate the vowels (used in dictionaries, language-teaching materials, children’s books, and such). These diacritics are expressed with combining marks.  A Hebrew example, with niqqud:\tאֶת דַלְתִּי הֵזִיז הֵנִיעַ, קֶטֶב לִשְׁכַּתִּי יָשׁוֹד Normal writing (no niqqud):\tאת דלתי הזיז הניע, קטב לשכתי ישוד Devanagari, the script used to write Hindi, Sanskrit, and many other South Asian languages, expresses certain vowels as combining marks attached to consonant letters. For example, “ह” + “\u200bि” = “हि” (“h” + “i” = “hi”). Korean characters stand for syllables, but they are composed of letters called jamo that stand for the vowels and consonants in the syllable. While there are code points for precomposed Korean syllables, it’s also possible to dynamically compose them by concatenating their jamo. For example, “ᄒ” + “ᅡ” + “ᆫ” = “한” (“h” + “a” + “n” = “han”). Canonical Equivalence In Unicode, precomposed characters exist alongside the dynamic composition system. A consequence of this is that there are multiple ways to express “the same” string—different sequences of code points that result in the same user-perceived characters. For example, as we saw earlier, we can express the character “Á” either as the single code point U+00C1, or as the string of two code points U+0041 U+0301.  Another source of ambiguity is the ordering of multiple diacritics in a single character. Diacritic order matters visually when two diacritics apply to the same side of the base character, e.g. both above: “ǡ” (dot, then macron) is different from “ā̇” (macron, then dot). However, when diacritics apply to different sides of the character, e.g. one above and one below, then the order doesn’t affect rendering. Moreover, a character with multiple diacritics might have one of the diacritics precomposed and others expressed as combining marks.  For example, the Vietnamese letter “ệ” can be expressed in five different ways:  Fully precomposed: U+1EC7 “ệ” Partially precomposed: U+1EB9 “ẹ” + U+0302 “◌̂” Partially precomposed: U+00EA “ê” + U+0323 “◌̣” Fully decomposed: U+0065 “e” + U+0323 “◌̣” + U+0302 “◌̂” Fully decomposed: U+0065 “e” + U+0302 “◌̂” + U+0323 “◌̣” Unicode refers to set of strings like this as “canonically equivalent”. Canonically equivalent strings are supposed to be treated as identical for purposes of searching, sorting, rendering, text selection, and so on. This has implications for how you implement operations on text. For example, if an app has a “find in file” operation and the user searches for “ệ”, it should, by default, find occurrences of any of the five versions of “ệ” above!  Normalization Forms To address the problem of “how to handle canonically equivalent strings”, Unicode defines several normalization forms: ways of converting strings into a canonical form so that they can be compared code-point-by-code-point (or byte-by-byte).  The “NFD” normalization form fully decomposes every character down to its component base and combining marks, taking apart any precomposed code points in the string. It also sorts the combining marks in each character according to their rendered position, so e.g. diacritics that go below the character come before the ones that go above the character. (It doesn’t reorder diacritics in the same rendered position, since their order matters visually, as previously mentioned.)  The “NFC” form, conversely, puts things back together into precomposed code points as much as possible. If an unusual combination of diacritics is called for, there may not be any precomposed code point for it, in which case NFC still precomposes what it can and leaves any remaining combining marks in place (again ordered by rendered position, as in NFD).  There are also forms called NFKD and NFKC. The “K” here refers to compatibility decompositions, which cover characters that are “similar” in some sense but not visually identical. However, I’m not going to cover that here.  Grapheme Clusters As we’ve seen, Unicode contains various cases where a thing that a user thinks of as a single “character” might actually be made up of multiple code points under the hood. Unicode formalizes this using the notion of a grapheme cluster: a string of one or more code points that constitute a single “user-perceived character”.  UAX #29 defines the rules for what, precisely, qualifies as a grapheme cluster. It’s approximately “a base code point followed by any number of combining marks”, but the actual definition is a bit more complicated; it accounts for things like Korean jamo, and emoji ZWJ sequences.  The main thing grapheme clusters are used for is text editing: they’re often the most sensible unit for cursor placement and text selection boundaries. Using grapheme clusters for these purposes ensures that you can’t accidentally chop off some diacritics when you copy-and-paste text, that left/right arrow keys always move the cursor by one visible character, and so on.  Another place where grapheme clusters are useful is in enforcing a string length limit—say, on a database field. While the true, underlying limit might be something like the byte length of the string in UTF-8, you wouldn’t want to enforce that by just truncating bytes. At a minimum, you’d want to “round down” to the nearest code point boundary; but even better, round down to the nearest grapheme cluster boundary. Otherwise, you might be corrupting the last character by cutting off a diacritic, or interrupting a jamo sequence or ZWJ sequence.  And More… There’s much more that could be said about Unicode from a programmer’s perspective! I haven’t gotten into such fun topics as case mapping, collation, compatibility decompositions and confusables, Unicode-aware regexes, or bidirectional text. Nor have I said anything yet about implementation issues—how to efficiently store and look-up data about the sparsely-assigned code points, or how to optimize UTF-8 decoding, string comparison, or NFC normalization. Perhaps I’ll return to some of those things in future posts.  Unicode is a fascinating and complex system. It has a many-to-one mapping between bytes and code points, and on top of that a many-to-one (or, under some circumstances, many-to-many) mapping between code points and “characters”. It has oddball special cases in every corner. But no one ever claimed that representing all written languages was going to be easy, and it’s clear that we’re never going back to the bad old days of a patchwork of incompatible encodings.  Further reading:  The Unicode Standard UTF-8 Everywhere Manifesto Dark corners of Unicode by Eevee ICU (International Components for Unicode)—C/C++/Java libraries implementing many Unicode algorithms and related things Python 3 Unicode Howto Google Noto Fonts—set of fonts intended to cover all assigned code points"""
rawByteTokens = unicodetext.encode("UTF-8")
decimalTokens = list(map(int, rawByteTokens))
```

Now that we have our text ready, the first thing we want to do now is decide the final `vocabulary size` that we want to achieve, and as discussed this is going to be a **hyper-parameter** that will also determine the `number of merges` we are going to have (`vocabularySize - originalVocabularySize` or `vocabularySize - 256`)...

After we have our **hyper-parameters** setup, we can now start building our `replacement table`...

First we will initialize our `replacement table` with an empty dictionary and iterate over a loop for the `number of merges`.\
We will create a copy of our original sequence, and a pretty neat way of doing that is by calling a `list()` on the `list` we have (Python will create a new list of all the individual elements, which will be a copy of the list)...
The point to note here is, we will be creating our `replacement table` having a `pair` mapped to the `new vocabulary token` that we generate. So what we will be building up here is like a reversed binary tree (*sort-of*) where instead of a single root node with multiple leaves, we will start with the leaves (starting `bytes` or the starting `256` tokens) and we will start to merge two of them at a time, so it's not a tree it's like a forest...

So, we will now iterate over the `number of merges`, and for each merge, we will first get the pair-frequency, to find the most commonly occuring pair (`topPair`). Then we are going to generate a new token integer for it, (and before merging just for checking we can print that we are merging it, which is optional...) and we are going to replace all of the occurences of that pair (`topPair`) with our `newly generated vocabulary token`. And finally we are going to record the `newly generated vocabulary token` into the mapped pair of our `replacement table`...

So after all that discussion, the code we end up with is:
```python
# Hyper-Parameter
vocabularySize = 276
# Calculating Number of Merges from the Hyper-parameter
numberOfMerges = vocabularySize - 256
# Creating the copy of the original token sequence
tokens = list(decimalTokens)

# Initializing the replacement table
replacementTable = {}

for i in range(numberOfMerges):
    pairFrequency = getPairFrequency(tokens=tokens)
    topPair = max(pairFrequency, key=pairFrequency.get)
    newVocabularyToken = 256 + i
    print(f"Merging pair {topPair} by generating a new vocabulary token {newVocabularyToken}")
    tokens = mergePair(tokens=tokens, pair=topPair, newVocabularyToken=newVocabularyToken)
    replacementTable[topPair] = newVocabularyToken
```
And we end up with the following output:
```python
Merging pair (101, 32) by generating a new vocabulary token 256
Merging pair (105, 110) by generating a new vocabulary token 257
Merging pair (115, 32) by generating a new vocabulary token 258
Merging pair (116, 104) by generating a new vocabulary token 259
Merging pair (101, 114) by generating a new vocabulary token 260
Merging pair (99, 111) by generating a new vocabulary token 261
Merging pair (116, 32) by generating a new vocabulary token 262
Merging pair (226, 128) by generating a new vocabulary token 263
Merging pair (44, 32) by generating a new vocabulary token 264
Merging pair (97, 110) by generating a new vocabulary token 265
Merging pair (111, 114) by generating a new vocabulary token 266
Merging pair (100, 32) by generating a new vocabulary token 267
Merging pair (97, 114) by generating a new vocabulary token 268
Merging pair (101, 110) by generating a new vocabulary token 269
Merging pair (257, 103) by generating a new vocabulary token 270
Merging pair (261, 100) by generating a new vocabulary token 271
Merging pair (121, 32) by generating a new vocabulary token 272
Merging pair (46, 32) by generating a new vocabulary token 273
Merging pair (97, 108) by generating a new vocabulary token 274
Merging pair (259, 256) by generating a new vocabulary token 275
```

The point to be noted here is, the `newly generated vocabulary tokens` are also eligible for merging at the next round of iteration...

Now we can look at the compression ratio this way as well:
```python
print("Length of original tokens:", len(decimalTokens))
print("Length of encoded tokens:", len(tokens))
print("Length of original tokens:", f"{len(decimalTokens) / len(tokens):.2f}X")
```
And we see the following output:
```plaintext
Length of original tokens: 24597
Length of encoded tokens: 19438
Length of original tokens: 1.27X 
```

# Understanding Tokenizer

What we did previously is kind of the "training of the tokenizer"...

Now let's look at the block diagram I made for understanding our tokenizer:
<p align="center"><img src="./ExplanationMedia/Images/TokenizerBlockDiagram.png"></p>

The point I would like to make is that a `Tokenizer` is a completely separate object from the `Large Language Model (LLM)` itself. \
This is a completely separate **pre-processing** stage.\
The `Tokenizer` will have it's own separate training dataset, just like an `LLM` has a potentially different training set. Which means that the `Tokenizer` will have it's own set of documents on which we are going to train the tokenizer while performing the byte-pair encoding algorithm (as we saw above) to train the `vocabulary` of this `Tokenizer`.\
Once we have the have the `Tokenizer` and once we have the `vocabulary` and the `replacement table`, we can perform both **encoding** and **decoding**.\
This hints that the `Tokenizer` is more like a **translation layer** between **raw text** (which is a sequence of Unicode Code Points) and **token sequence**. So it can take a **raw text** and turn it into a **token sequence** (**encoding**), and vise-versa it can take a **token sequence** and turn it into a **raw text**(**decoding**)...

So, now that we have the `Tokenizer` and the `replacement table`, we can turn to how we can do the **encoding** and the **decoding** steps...

And once we have done that we would be able to translate between these two *realms* and our **language model** can be trained as a second step afterwards...\
And typically in a state of the art application we might take all of our training data for the **language model** and we might run it through the `Tokenizer` and turn everything into a massive **token sequence** and then we can throw away the **raw text** and we will be left with the `tokens` themselves (and those are what are stored in storages which `LLM`'s will be reading when it's training on them)...

And typically we want the training set to be different because, we don't only want our `LLM` to be trained on only English text, and we want it to be able to support many different languages and we also care about **code** or **not-code**, which means that we might want to look at different kinds of mixtures of different kinds of languages and different amounts of **code** and things like that because the amount of different language that we have in our `Tokenizer` training set will determine how many merges of it will be there, and that determines the density with which this type of data it has in the **token space**.\
And roughly intuitively speaking, if we add some amount of data, let's say we add some amount of Japanese data in the tokenizer training set, that means that more Japanese tokens will get merged and the Japanese will have shorter sequences and that's going to be beneficial for the `LLM` which has a finite context length on which it can work on in the **token space**...

So, now we will turn to **encoding** and **decoding** now that we have trained the `Tokenizer`...

# Decoding

Let's begin with **Decoding**...

So let's get to the idea of what we are going to do here straight first...

We have our **token sequence** and we are going to make it go through the tokenizer and give back a Python string object (which is none other than the **raw text**)...

So for now, our empty function definition looks like this:
```python
def decode(tokenSequence):
    # Given a list of integers, return Python string
    pass
```

I would like to invite you to implement this function yourself, without looking at my code, because this can be done in several different ways...

But for now, this is what I came up with...

Now, as I have mentioned before that we have our original `vocabulary` of `bytes` which ranges from `0` to `255` representing `bytes` from `b'\x00'` to `b'\xff'`, and our `replacement table` is just the extension of it. So I have decided to make a final `vocabulary` that contains the mapping of `keys` and `values`, where the `keys` are nothing but the **positions**, and the `values` are none other than the actual `byte` values inside that dictionary.

So first, we will initialize our `vocabulary` starting from the map `0: b'\x00'` to `255: b'\xff'`... and the code looks like this:
```python
vocabulary = {index: bytes([index]) for index in range(256)}
```
And this outputs this dictionary:
```python
{0: b'\x00', 1: b'\x01', 2: b'\x02', 3: b'\x03', 4: b'\x04', 5: b'\x05', 6: b'\x06', 7: b'\x07', 8: b'\x08', 9: b'\t', 10: b'\n', 11: b'\x0b', 12: b'\x0c', 13: b'\r', 14: b'\x0e', 15: b'\x0f', 16: b'\x10', 17: b'\x11', 18: b'\x12', 19: b'\x13', 20: b'\x14', 21: b'\x15', 22: b'\x16', 23: b'\x17', 24: b'\x18', 25: b'\x19', 26: b'\x1a', 27: b'\x1b', 28: b'\x1c', 29: b'\x1d', 30: b'\x1e', 31: b'\x1f', 32: b' ', 33: b'!', 34: b'"', 35: b'#', 36: b'$', 37: b'%', 38: b'&', 39: b"'", 40: b'(', 41: b')', 42: b'*', 43: b'+', 44: b',', 45: b'-', 46: b'.', 47: b'/', 48: b'0', 49: b'1', 50: b'2', 51: b'3', 52: b'4', 53: b'5', 54: b'6', 55: b'7', 56: b'8', 57: b'9', 58: b':', 59: b';', 60: b'<', 61: b'=', 62: b'>', 63: b'?', 64: b'@', 65: b'A', 66: b'B', 67: b'C', 68: b'D', 69: b'E', 70: b'F', 71: b'G', 72: b'H', 73: b'I', 74: b'J', 75: b'K', 76: b'L', 77: b'M', 78: b'N', 79: b'O', 80: b'P', 81: b'Q', 82: b'R', 83: b'S', 84: b'T', 85: b'U', 86: b'V', 87: b'W', 88: b'X', 89: b'Y', 90: b'Z', 91: b'[', 92: b'\\', 93: b']', 94: b'^', 95: b'_', 96: b'`', 97: b'a', 98: b'b', 99: b'c', 100: b'd', 101: b'e', 102: b'f', 103: b'g', 104: b'h', 105: b'i', 106: b'j', 107: b'k', 108: b'l', 109: b'm', 110: b'n', 111: b'o', 112: b'p', 113: b'q', 114: b'r', 115: b's', 116: b't', 117: b'u', 118: b'v', 119: b'w', 120: b'x', 121: b'y', 122: b'z', 123: b'{', 124: b'|', 125: b'}', 126: b'~', 127: b'\x7f', 128: b'\x80', 129: b'\x81', 130: b'\x82', 131: b'\x83', 132: b'\x84', 133: b'\x85', 134: b'\x86', 135: b'\x87', 136: b'\x88', 137: b'\x89', 138: b'\x8a', 139: b'\x8b', 140: b'\x8c', 141: b'\x8d', 142: b'\x8e', 143: b'\x8f', 144: b'\x90', 145: b'\x91', 146: b'\x92', 147: b'\x93', 148: b'\x94', 149: b'\x95', 150: b'\x96', 151: b'\x97', 152: b'\x98', 153: b'\x99', 154: b'\x9a', 155: b'\x9b', 156: b'\x9c', 157: b'\x9d', 158: b'\x9e', 159: b'\x9f', 160: b'\xa0', 161: b'\xa1', 162: b'\xa2', 163: b'\xa3', 164: b'\xa4', 165: b'\xa5', 166: b'\xa6', 167: b'\xa7', 168: b'\xa8', 169: b'\xa9', 170: b'\xaa', 171: b'\xab', 172: b'\xac', 173: b'\xad', 174: b'\xae', 175: b'\xaf', 176: b'\xb0', 177: b'\xb1', 178: b'\xb2', 179: b'\xb3', 180: b'\xb4', 181: b'\xb5', 182: b'\xb6', 183: b'\xb7', 184: b'\xb8', 185: b'\xb9', 186: b'\xba', 187: b'\xbb', 188: b'\xbc', 189: b'\xbd', 190: b'\xbe', 191: b'\xbf', 192: b'\xc0', 193: b'\xc1', 194: b'\xc2', 195: b'\xc3', 196: b'\xc4', 197: b'\xc5', 198: b'\xc6', 199: b'\xc7', 200: b'\xc8', 201: b'\xc9', 202: b'\xca', 203: b'\xcb', 204: b'\xcc', 205: b'\xcd', 206: b'\xce', 207: b'\xcf', 208: b'\xd0', 209: b'\xd1', 210: b'\xd2', 211: b'\xd3', 212: b'\xd4', 213: b'\xd5', 214: b'\xd6', 215: b'\xd7', 216: b'\xd8', 217: b'\xd9', 218: b'\xda', 219: b'\xdb', 220: b'\xdc', 221: b'\xdd', 222: b'\xde', 223: b'\xdf', 224: b'\xe0', 225: b'\xe1', 226: b'\xe2', 227: b'\xe3', 228: b'\xe4', 229: b'\xe5', 230: b'\xe6', 231: b'\xe7', 232: b'\xe8', 233: b'\xe9', 234: b'\xea', 235: b'\xeb', 236: b'\xec', 237: b'\xed', 238: b'\xee', 239: b'\xef', 240: b'\xf0', 241: b'\xf1', 242: b'\xf2', 243: b'\xf3', 244: b'\xf4', 245: b'\xf5', 246: b'\xf6', 247: b'\xf7', 248: b'\xf8', 249: b'\xf9', 250: b'\xfa', 251: b'\xfb', 252: b'\xfc', 253: b'\xfd', 254: b'\xfe', 255: b'\xff'}
```

And now we will start appending our replacement table items into our final `vocabulary`...

Remember how we already indexed our `replacement table` starting from `256`?

That is now going to come in handy because our `replacement table` looks like this for now:
```python
{(101, 32): 256, (105, 110): 257, (115, 32): 258, (116, 104): 259, (101, 114): 260, (99, 111): 261, (116, 32): 262, (226, 128): 263, (44, 32): 264, (97, 110): 265, (111, 114): 266, (100, 32): 267, (97, 114): 268, (101, 110): 269, (257, 103): 270, (261, 100): 271, (121, 32): 272, (46, 32): 273, (97, 108): 274, (259, 256): 275}
```

So our `replacement table` contains the **pair of bytes positions** and its corresponding **index**...

So now we can take those items out of out `replacement table` and merge the `bytes` together (concatenate two `bytes` together) and append them to our final `vocabulary`...

So our code looks like this:
```python
for (position0, position1), index in replacementTable.items():
    vocabulary[index] = vocabulary[position0] + vocabulary[position1]
```

So our `final vocabulary` looks like this now:
```python
{0: b'\x00', 1: b'\x01', 2: b'\x02', 3: b'\x03', 4: b'\x04', 5: b'\x05', 6: b'\x06', 7: b'\x07', 8: b'\x08', 9: b'\t', 10: b'\n', 11: b'\x0b', 12: b'\x0c', 13: b'\r', 14: b'\x0e', 15: b'\x0f', 16: b'\x10', 17: b'\x11', 18: b'\x12', 19: b'\x13', 20: b'\x14', 21: b'\x15', 22: b'\x16', 23: b'\x17', 24: b'\x18', 25: b'\x19', 26: b'\x1a', 27: b'\x1b', 28: b'\x1c', 29: b'\x1d', 30: b'\x1e', 31: b'\x1f', 32: b' ', 33: b'!', 34: b'"', 35: b'#', 36: b'$', 37: b'%', 38: b'&', 39: b"'", 40: b'(', 41: b')', 42: b'*', 43: b'+', 44: b',', 45: b'-', 46: b'.', 47: b'/', 48: b'0', 49: b'1', 50: b'2', 51: b'3', 52: b'4', 53: b'5', 54: b'6', 55: b'7', 56: b'8', 57: b'9', 58: b':', 59: b';', 60: b'<', 61: b'=', 62: b'>', 63: b'?', 64: b'@', 65: b'A', 66: b'B', 67: b'C', 68: b'D', 69: b'E', 70: b'F', 71: b'G', 72: b'H', 73: b'I', 74: b'J', 75: b'K', 76: b'L', 77: b'M', 78: b'N', 79: b'O', 80: b'P', 81: b'Q', 82: b'R', 83: b'S', 84: b'T', 85: b'U', 86: b'V', 87: b'W', 88: b'X', 89: b'Y', 90: b'Z', 91: b'[', 92: b'\\', 93: b']', 94: b'^', 95: b'_', 96: b'`', 97: b'a', 98: b'b', 99: b'c', 100: b'd', 101: b'e', 102: b'f', 103: b'g', 104: b'h', 105: b'i', 106: b'j', 107: b'k', 108: b'l', 109: b'm', 110: b'n', 111: b'o', 112: b'p', 113: b'q', 114: b'r', 115: b's', 116: b't', 117: b'u', 118: b'v', 119: b'w', 120: b'x', 121: b'y', 122: b'z', 123: b'{', 124: b'|', 125: b'}', 126: b'~', 127: b'\x7f', 128: b'\x80', 129: b'\x81', 130: b'\x82', 131: b'\x83', 132: b'\x84', 133: b'\x85', 134: b'\x86', 135: b'\x87', 136: b'\x88', 137: b'\x89', 138: b'\x8a', 139: b'\x8b', 140: b'\x8c', 141: b'\x8d', 142: b'\x8e', 143: b'\x8f', 144: b'\x90', 145: b'\x91', 146: b'\x92', 147: b'\x93', 148: b'\x94', 149: b'\x95', 150: b'\x96', 151: b'\x97', 152: b'\x98', 153: b'\x99', 154: b'\x9a', 155: b'\x9b', 156: b'\x9c', 157: b'\x9d', 158: b'\x9e', 159: b'\x9f', 160: b'\xa0', 161: b'\xa1', 162: b'\xa2', 163: b'\xa3', 164: b'\xa4', 165: b'\xa5', 166: b'\xa6', 167: b'\xa7', 168: b'\xa8', 169: b'\xa9', 170: b'\xaa', 171: b'\xab', 172: b'\xac', 173: b'\xad', 174: b'\xae', 175: b'\xaf', 176: b'\xb0', 177: b'\xb1', 178: b'\xb2', 179: b'\xb3', 180: b'\xb4', 181: b'\xb5', 182: b'\xb6', 183: b'\xb7', 184: b'\xb8', 185: b'\xb9', 186: b'\xba', 187: b'\xbb', 188: b'\xbc', 189: b'\xbd', 190: b'\xbe', 191: b'\xbf', 192: b'\xc0', 193: b'\xc1', 194: b'\xc2', 195: b'\xc3', 196: b'\xc4', 197: b'\xc5', 198: b'\xc6', 199: b'\xc7', 200: b'\xc8', 201: b'\xc9', 202: b'\xca', 203: b'\xcb', 204: b'\xcc', 205: b'\xcd', 206: b'\xce', 207: b'\xcf', 208: b'\xd0', 209: b'\xd1', 210: b'\xd2', 211: b'\xd3', 212: b'\xd4', 213: b'\xd5', 214: b'\xd6', 215: b'\xd7', 216: b'\xd8', 217: b'\xd9', 218: b'\xda', 219: b'\xdb', 220: b'\xdc', 221: b'\xdd', 222: b'\xde', 223: b'\xdf', 224: b'\xe0', 225: b'\xe1', 226: b'\xe2', 227: b'\xe3', 228: b'\xe4', 229: b'\xe5', 230: b'\xe6', 231: b'\xe7', 232: b'\xe8', 233: b'\xe9', 234: b'\xea', 235: b'\xeb', 236: b'\xec', 237: b'\xed', 238: b'\xee', 239: b'\xef', 240: b'\xf0', 241: b'\xf1', 242: b'\xf2', 243: b'\xf3', 244: b'\xf4', 245: b'\xf5', 246: b'\xf6', 247: b'\xf7', 248: b'\xf8', 249: b'\xf9', 250: b'\xfa', 251: b'\xfb', 252: b'\xfc', 253: b'\xfd', 254: b'\xfe', 255: b'\xff', 256: b'e ', 257: b'in', 258: b's ', 259: b'th', 260: b'er', 261: b'co', 262: b't ', 263: b'\xe2\x80', 264: b', ', 265: b'an', 266: b'or', 267: b'd ', 268: b'ar', 269: b'en', 270: b'ing', 271: b'cod', 272: b'y ', 273: b'. ', 274: b'al', 275: b'the '}
```

Notice how the `vocabulary` has now expanded from the last index `255` to `275` now?

That is because we did exactly `20` merges and this gives off the `final vocabulary`...

And one thing that we had to be careful is, because this time we are iterating over a dictionary in Python using `.items()`, it is really important that this runs in order in which we crafted our `replacement table` dictionary... Luckily, with `Python 3.7` and later this guarantees to be the case, but before `Python 3.7` this iteration may have been out of order with respect to how we have inserted items in out `replacement table`. But, we are using modern Python, so we are safe...

So after we have crafted our `final vocabulary`, we are now going to implement our `decode()` function...

So in our `decode()` function, the first thing that we are going to do is we are going to iterate over our `token sequence` list and get the `byte` value from our `final vocabulary` and stretch it out into a long `raw byte token sequence`.\
And finally we will use the in-built `decode()` method for strings to decode our long `raw byte token sequence` into `raw text` and finally return the final decoded string...

So, our `decode()` implementation now looks like this:
```python
def decode(tokenSequence):
    tokens = b"".join(vocabulary[index] for index in tokenSequence)
    rawText = tokens.decode("UTF-8")
    return rawText
```

So doing `print(decode(decimalTokens))` will be able to give us back the entire **raw text** that we expect...

But there's a problem with this, by the way that we have implemented it... And this could actually throw an error...

I would now like you to take a moment and think, how this could be problematic if we plug in some sequence of tokens that is not lucky for our `decode()` function...

So let me demonstrate the error in small *toy-examples*...

First let's say we want to decode the unicode code point `65`:
```python
>>> print(decode([65]))
A
```
We see the result is `'A'` in Latin...\
But suddenly when we try to decode the unicode code point `128`:
```python
print(decode([128]))
```
We get the error:
```bash
Traceback (most recent call last):
  File "f:\Python Notebooks\test.py", line 56, in <module>
    print(decode([128]))
          ^^^^^^^^^^^^^
  File "f:\Python Notebooks\test.py", line 53, in decode
    rawText = tokens.decode("UTF-8")
              ^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
```

Now what does this error mean?

Now to understand what this means we have to go back to the `UTF-8` <a href="https://en.wikipedia.org/wiki/UTF-8#Encoding">documentation page</a>.

And basically there's a specific schema, that `UTF-8` bytes take...

Let's look at the table again to understand this:

Code point ↔ UTF-8 conversion
| First code point              | Last code point | Byte 1    | Byte 2   | Byte 3   | Byte 4   |
|-------------------------------|------------|-----------|----------|----------|----------|
| U+00<span style="color:red;">0</span><span style="color:purple;">0</span>                | U+00<span style="color:red;">7</span><span style="color:purple;">F</span>               | 0<span style="color:red;">xxx</span><span style="color:purple;">xxxx</span> |          |          |          |
| U+0<span style="color:green;">0</span><span style="color:red;">8</span><span style="color:purple;">0</span>                | U+0<span style="color:green;">7</span><span style="color:red;">F</span><span style="color:purple;">F</span>               | 110<span style="color:green;">xxx</span><span style="color:red;">xx</span> | 10<span style="color:red;">xx</span><span style="color:purple;">xxxx</span> |          |          |
| U+<span style="color:blue;">0</span><span style="color:green;">8</span><span style="color:red;">0</span><span style="color:purple;">0</span>                | U+<span style="color:blue;">F</span><span style="color:green;">F</span><span style="color:red;">F</span><span style="color:purple;">F</span>               | 1110<span style="color:blue;">xxxx</span> | 10<span style="color:green;">xxxx</span><span style="color:red;">xx</span> | 10<span style="color:red;">xx</span><span style="color:purple;">xxxx</span> |          |
| U+<span style="color:crimson;">0</span><span style="color:orange;">1</span><span style="color:blue;">0</span><span style="color:green;">0</span><span style="color:red;">0</span><span style="color:purple;">0</span> | U+<span style="color:crimson;">1</span><span style="color:orange;">0</span><span style="color:blue;">F</span><span style="color:green;">F</span><span style="color:red;">F</span><span style="color:purple;">F</span> | 11110<span style="color:crimson;">x</span><span style="color:orange;">xx</span> | 10<span style="color:orange;">xx</span><span style="color:blue;">xxxx</span> | 10<span style="color:green;">xxxx</span><span style="color:red;">xx</span> | 10<span style="color:red;">xx</span><span style="color:purple;">xxxx</span> |

We see that the Unicode Code Points that take-up **more than one byte** follows a specific starting byte schema... And we are not maintaing this schema in our code...

To fix this we can look at the Python <a href="https://docs.python.org/3/library/stdtypes.html#bytes.decode">documentation for `bytes decode`</a>...\
We see that there is a parameter `errors` there for us to handle errors that we might encounter during the `decode()` method execution (by default the arguement is set to `'strict'`)... This documentation also hints <a href="https://docs.python.org/3/library/codecs.html#error-handlers">this section</a> which we can pass through the arguement to handle the encounter of errors differently...

For us, we will use the arguement `'replace'` which replaces the string with `�` character if it encounters any errors with each of the characters...

So our code now looks like:
```python
def decode(tokenSequence):
    tokens = b"".join(vocabulary[index] for index in tokenSequence)
    rawText = tokens.decode("UTF-8", errors="replace")
    return rawText
```

And if we now try to decode the same token `128`, we get this:
```python
>>> print(decode([128]))
�
```

Which means not every single `byte` sequence is valid `UTF-8`, and if it happens that the `LLM` for example, predicts the `tokens` in a bad manner, then they might **not** fall into **valid `UTF-8`**, and then we won't be able to decode them...

So the standard practice is to basically use `errors="replace"` and this is what you will also find in the OpenAI's code that they released as well... But when you see this kind of character in your output is when something went wrong or the output was not valid in the sequence of tokens...

Now that we know this, we can now move on to the **encoding** part of the `Tokenizer`...

# Encoding

Now we can start implementing our **Encoding** part...

Once again let's get our idea straight of what we want to do here...

We have our Python `string` text and we want to encode it into a `token sequence` (or a list of Python integers)...

So, for now our empty function `encode()` looks like this:
```python
def encode(text):
    # Given a string, return a list of integers
    pass
```

Once again, I would like to invite you to implement this function yourself, without looking at my code, because this can be done in several different ways...

But for now, this is what I came up with...

First we will encode the string using `UTF-8` encoding to get the `byte` sequence and convert it into a `list of integers` to get the **raw bytes**... So we end up with a code line like this:
```python
tokens = list(text.encode("UTF-8"))
```
Then as we will have the **raw bytes** that did not go through the **Byte-Pair Encoding** algorithm yet, need to go through the algorithm iteratively. So, for now we will make the iteration go through an infinity `while True:` loop for now and specify the **breaking condition** later. Now, we will first find the **pair-frequncy** of the consecutive pairs of the **raw bytes**, and we already have our `getPairFrequency()` implemented, so we would be able to take help of that. So, the next code line that we end up is this:
```python
pairFrequency = getPairFrequency(tokens=tokens)
```
At this moment we would have our **pair-frequency** of the **raw bytes**, but we would not care about the frequency of the **raw-bytes** because we already have our `Tokenizer` trained for it, instead we only care about the unique consecutive pairs of **raw-bytes** which are eligible for possible merges. Now, the most important thing to note is, that we want to merge our unique consecutive pairs of **raw-bytes** according to the order of the `replacement table`'s values (because `replacement table` contains the pairs that are supposed to be merged and their corresponding indeces which are none other than the `new vocabulary token` values, and these indeces rely on the order of how they got merged in the first place).\
At this stage, we want to find a pair inside of the **pair-frequency** that has the lowest index (because of the importance of the order) in the `replacement table`. And the easiest way to do that is by using an in-built function `min()` in Python, and `min()` function by default ranks the `keys` of a dictionary, and to specify the order that we want as the `key` arguement inside the `min()` function. We can specify the `key` by giving it a `lambda` function of getting each pair of **pair-frequency** and getting the `new vocabulary token` value from the `replacement table`. Now there is a possibility of not finding the consecutive pair in the `replacement table` because this time the `tokens` might contain the new data that the `Tokenizer` might not have seen before, and because we are doing a `min()` we want that case to be as high as possible for it to eliminate it for merging, and the easiest way to do that is mark that pair as `float("inf")`. So we will end up with the most eligible `minimum-pair(ordered) of unique consecutive raw-bytes` that is **eligible for merging**... So we end up with a code line like this:
```python
minPair = min(pairFrequency, key=lambda pair: replacementTable.get(pair, float("inf")))
```
Now the function might fail by returning all `inf` values to `min()`, making it default to only the first item in the **pair-frequency**. Now, we can handle both the **breaking condition** of the infinity loop and the failing functionality in a single condition by checking if the most eligible `minimum-pair` is not in the `replacement table` hinting the intuition that "there is nothing to merge anymore and we want to break out of the loop. So we end up with the following checking condition now:
```python
# Nothing to merge
if minPair not in replacementTable:
    break
```
Now that we have most eligible `minimum-pair(ordered) of unique consecutive raw-bytes` that is **eligible for merging**, we can go ahead and get the token index that we have in our `replacement table` according to the pair for merging... And we end up with a code line like this:
```python
newVocabularyToken = replacementTable[minPair]
```
And finally we can use our already implemented function `mergePair()` to merge our **raw-bytes** with the **most eligible merging pair** while *minting* the **new vocabulary token** as `newVocabularyToken` to get the **list of tokens after the current iteration**... So we have a line like this now:
```python
tokens = mergePair(tokens=tokens, pair=minPair, newVocabularyToken=newVocabularyToken)
```
And finally we can return the encoded tokens after the iterations are done (`return tokens`)...

So our full implementation of `encode()` function looks like this:
```python
def encode(text):
    # Given a string, return a list of integers
    tokens = list(text.encode("UTF-8"))
    while True:
        pairFrequency = getPairFrequency(tokens=tokens)
        minPair = min(pairFrequency, key=lambda pair: replacementTable.get(pair, float("inf")))
        # Nothing to merge
        if minPair not in replacementTable:
            break
        newVocabularyToken = replacementTable[minPair]
        tokens = mergePair(tokens=tokens, pair=minPair, newVocabularyToken=newVocabularyToken)
    return tokens
```
And if we take this function for a spin, and try to print the tokens before and after **byte-pair encoding** algorithm, we get an output like this:
```python
>>> print("Encoded Tokens:", encode("hey hey hey"))
Raw Tokens: [104, 101, 121, 32, 104, 101, 121, 32, 104, 101, 121]
Encoded Tokens: [104, 101, 272, 104, 101, 272, 104, 101, 121]
```

Now this implementation is not totally complete yet, because if we pass in a single character like this:
```python
print(encode("h"))
```
We get:
```bash
Traceback (most recent call last):
  File "f:\Python Notebooks\test.py", line 69, in <module>
    print(encode("h"))
          ^^^^^^^^^^^
  File "f:\Python Notebooks\test.py", line 61, in encode
    minPair = min(pairFrequency, key=lambda pair: replacementTable.get(pair, float("inf")))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: min() arg is an empty sequence
```
Or we pass an empty string like this:
```python
print(encode(""))
```
We get:
```bash
Traceback (most recent call last):
  File "f:\Python Notebooks\test.py", line 69, in <module>
    print(encode(""))
          ^^^^^^^^^^
  File "f:\Python Notebooks\test.py", line 61, in encode
    minPair = min(pairFrequency, key=lambda pair: replacementTable.get(pair, float("inf")))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: min() arg is an empty sequence
```
We see that we end up getting errors because in both the cases, the `pairFrequency` ends up empty and that causes an issue inside the `min()` function...

And one way to *fight* this is by specifying the condition of the `while` iteration by checking if the length of `tokens` is at least `2` (which checks out the conditions ), such that our line becomes:
```python
while len(tokens) >= 2:
```

So now our full implementation of function becomes:
```python
def encode(text):
    # Given a string, return a list of integers
    tokens = list(text.encode("UTF-8"))
    while len(tokens) >= 2:
        pairFrequency = getPairFrequency(tokens=tokens)
        minPair = min(pairFrequency, key=lambda pair: replacementTable.get(pair, float("inf")))
        # Nothing to merge
        if minPair not in replacementTable:
            break
        newVocabularyToken = replacementTable[minPair]
        tokens = mergePair(tokens=tokens, pair=minPair, newVocabularyToken=newVocabularyToken)
    return tokens
```

And when we try to take the same examples for a spin, we get:
```python
>>> print(encode("h"))
[104]
>>> print(encode(""))
[]
```

And now we can move on to the testing phase of these **encoding** and **decoding** functions...

# Testing **Encoding** and **Decoding** together

Now we will be testing the **encoding** and **decoding** together, with the intuition that "if we **encode** something, we should get the same text back after **decoding**"...

So we will check the compatibility in three cases:
1. Simple English string
2. Original `Tokenizer` training dataset string
3. Text that the `Tokenizer` has not seen before

Let's look at **compatibility-1** (or for a simple English string):
```python
>>> print(decode(encode("hello world!")))
hello world!
```
We see that we get the same text back...\
Secondly we will look at **compatibility-2** (or for the original `Tokenizer` training dataset string):
```python
>>> print(unicodetext == decode(encode(unicodetext)))
True
```
Seems like the entire training set matches with the `Tokenizer` after performing **encoding** and **decoding**...
Finally we will look at **compatibility-3** (or the text that the `Tokenizer` has not seen before):
```python
>>> print("ये हिंदी है" == decode(encode("ये हिंदी है")))
True
```

This gives us enough confidence that this `Tokenizer` was implemented correctly...

But as I have mentioned before, that **not all token sequences are valid `UTF-8` byte-streams** meaning that some of them could not be **decode-able**, hinting us that the **encoding** and **decoding** is a one-way street and we are not guaranteed that what ever we **encoded** we will get it exactly back after being **decoded**...

So this is a very important point to note...

What we are going to do now is that, we are going to look at the *state-of-the-art* `LLMs` and the kinds of `Tokenizers` that they use and see how the thing that we implemented here, complexifies so quickly... And we are going to go through the details of the complexification one at a time...

# GPT - 2 Implementation of `Tokenizer`

Let's look at the <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">GPT-2 paper</a> once again and scroll down to `Section 2.2` which is the **Input Representation** section (because, this is where they motivate the use of the **Byte-Pair Encoding** algorithm on the `byte` level representation of `UTF-8` encoding)...

Now, everything here is exactly as we have covered it so far, but things start to depart when the use this part:
>We observed
BPE including many versions of common words like `dog`
since they occur in many variations such as `dog.` `dog!`
`dog?` . This results in a sub-optimal allocation of limited
vocabulary slots and model capacity. To avoid this, we prevent BPE from merging across character categories for any
byte sequence. We add an exception for spaces which significantly improves the compression efficiency while adding
only minimal fragmentation of words across multiple vocab
tokens.


So what they mention is that, they don't just apply the **Byte-Pair Encoding (BPE)** algorithm naively (as we have done it), and to prove that this is a motivating example:\
Suppose that we have a commond word like `dog`.\
What eventually happens, is that this same word occurs very frequently during the training of the `Tokenizer` and it occurs next to all kinds of punctuations like `dog.`, `dog!` & `dog?`etc.\
And naively, we could imagine that **Byte-Pair Encoding** algorithm could merge these punctuations to be single tokens, and this could end up bloating the `Tokenizer` vocabulary, hinting that we end up clustering things that **shouldn't** be clustered and combining *symantics* with *punctuations* and this feels **sub-optimal**, but it isn't.

So, what they wanted to do is **enforce** some manual rules that *some types of characters* should never be merged together. So, they wanted to **enforce** this merging-rules on top of the **Byte-Pair Encoding** algorithm...

So let's take a look at their <a href="https://github.com/openai/gpt-2">ChatGPT-2 code</a>, and see how they actually **enforce** this and what kinds of merges they actually perform...

Now if we look at their <a src="https://github.com/openai/gpt-2/blob/master/src/encoder.py">src>encoder.py</a> we see that this is the entire `Tokenizer` and the name of the file is actually *awkward* because the `Tokenizer` can do both: **encoding** and **decoding**...

And there's a lot going on here, and we are going to step through it in detail...

And for now I want us to focus on the code on `line 52` and `line 53`:
```python
# Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
```
Now this seems to be a **Regular Expression (RegEx)** pattern here that looks very complecated, and we are going to go through it in a bit...\
But this is the core part that **enforces** the rules of which parts of the training text will never be used to merge the text...

And I want you to also notice that `re.compile()` here is a little bit misleading, because they at `line 5` they use:
```python
import regex as re
```
Which is not importing the regular Python `re` module, instead, they import something called `regex` which is used as `re` as an alias...

And <a href="https://pypi.org/project/regex/">`regex`</a> is a python package that can be installed using `PyPi` with the command `pip install regex` using the terminal, and is basically an extension of `re` which is a bit more powerful **regular-expression** library...

So, let's get back to the **regular-expression** pattern and understand why they are doing, what they are doing...

So I have a small snippet that let's us test this pattern out:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pattern, "Hello World"))
```
Which gives us this output:
```python
['Hello', ' World']
```

`re.findall()` will take the pattern that we have compiled and it will try to match it against the string.\
And the way this works is that, we go from left-to-right in the string and we try to match the pattern, and `re.findall()` will get all the occurences and **organize them into a list**...

And when we notice the pattern, we first of all notice that this is a `raw-string` (string prepended with an `r`) hinting that the `\`'s within the string will be included and will **not** be treated as **escape sequences**...

And we see that this pattern is made up of a lof of `|` characters, which is none other than `OR`'s in regex.

And we can start looking at some documentation that I found: <a href="https://www.regular-expressions.info/unicode.html">Unicode Regular Expressions</a>, there might be other sources as well...

And let me list all the important things that they used here in the documenation:
- `\p{L}` or `\p{Letter}`: any kind of letter from any language.
- `\p{N}` or `\p{Number}`: any kind of numeric character in any script.
- `+`: one or more
- ` ?`: optional space
- `\s`: whitespace character

We see that in our example `"Hello World"`, `"Hello"` is matched by the part ` ?\p{L}+` which matches any optional space followed by one or more letters in any language, but the match ends because the **whitespace** is not a letter...

From there a new attempt begins to match the pattern again for the rest of the string... And we skip over all the conditions before and we end up in the exact same part ` ?\p{L}+` which matches the `" World"`, and we see that there is an optional space, followed by a bunch of letters

So finally, when we run this, we get a list of two elements, `"Hello"` and ` "World"`...

So if we test a different string:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pattern, "Hello how are you"))
```
We get this:
```python
['Hello', ' how', ' are', ' you']
```

So what is this doing, and why is it important?

We are taking our text string, and instead of directly encoding it for `Tokenization`, we are first splitting it up into a list of texts, and all these elements of this list are **processed independently by the `Tokenizer`**. And all the **results of the processing are simply concatenated**...

And roughly speaking what that does, is **we end up finding the merges between the elements of this list of texts, so we can only ever consider merges within every one of these items individually**. Which means that we won't be merging the contents of `Item-A` with `Item-B` from the output of this operation, because they are now parts of the separate elements of this list(because we are breaking it up in this way)...

And this is one way of **enforcing** the rules that some merges are not to happen, and we are going to more of this pattern and we will see that what this is trying to do on a high level is, it is trying to not merge across letters, across numbers, across punctuation and so on...

So let's see in detail, how that works...

` ?\p{N}+` very similary matches an optional space followed by one or more numeric character in any script.

We can text the same functionality by testing this sample code:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pattern, "This is the year2024"))
```
And we get:
```python
['This', ' is', ' the', ' year', '2024']
```
And we see that this also separates letters and numbers because `2` is not a letter anymore in this example. But `2` is a number so ` ?\p{N}+` matches it instead...

Let's not see how these `'s|'t|'re|'ve|'m|'ll|'d` work...

We can test this with this example:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pattern, "who's don't you're you've I'm she'll he'd"))
```
And we get:
```python
['who', "'s", ' don', "'t", ' you', "'re", ' you', "'ve", ' I', "'m", ' she', "'ll", ' he', "'d"]
```
We see that all these words seperate themselves after the **apostrophe** symbol matching these exact continuations...

So why are they doing it?

Honestly, I think that these are just the very common **apostrophe** postfixes that are typically used in a word, and I don't like that they have done this kind of implementation of **apostrophe** words in `GPT-2`...

And the reason I don't like this is because this creates a problem with the Unicode **apostrophe**(s) like this:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pattern, "who’s don’t"))
```
The output comes out like this:
```python
['who', '’', 's', ' don', '’', 't']
```
And we see that they do get separated but `’` comes out as a different list item, hinting that the above pattern is *hard-coded* for `'` specific **apostrophe** symbol...

And if we go back to the code of `ChatGPT-2`, we will see this line of comment
```python
# Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
```
Well, you see how in the above pattern, all the **apostrophe** postfixes are small letters?\
And because they did not add `re.IGNORECASE`, these rules will not separate out these words if they are **uppercase**...

And we can check the same with this example code:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pattern, "who's WHO'S"))
```
For which we get:
```python
['who', "'s", ' WHO', "'", 'S']
```
So this `Tokenization` will work differently in uppercase and lowercase, inconsistently separating out these **apostrophe**(s), which makes us feel extremely gnarly and gross, but that's how this works...

Let's now come back and discuss ` ?[^\s\p{L}\p{N}]+|\s+(?!\S)` part of the regex...

Well this implies **an optional space, followed by one or more of something that is not a letter number or a space**... And what this is doing is that it is trying to match **punctuation** (which is not a letter or a number or a space)

And we can test it like this:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pattern, "What!!!?"))
```
For which we get:
```python
['What', '!!!?']
```

And finally we come to `\s+(?!\S)`...

We understand that it is trying to match a single white space in the beginning which is not optional, and what the latter part is doing is called **negative lookahead assertion**... So what it is effectively doing is, it is trying to match whitespace upto but not including the last whitespace character...

And why is this important?

Let's take this example scenario and explain it:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pattern, "Far           away"))
```
For which we get:
```python
['Far', '          ', ' away']
```
We see that the spaces have become their own thing now, but they have left the last whitespace to the other side... And the reason that is good, is because ` away` is the common token, and even if we keep adding spaces in between , we still have a ` away` common token...

But the `GPT-2 Tokenizer` really likes to have this space prepending letters or numbers...

And the final `\s+` is for the final matching of the trailing spaces and so on.

Want an example for this one too? I got you:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pattern, "Far   away.     "))
```
We get:
```python
['Far', '  ', ' away', '.', '     ']
```

I wanted to show one more thing...

Suppose we want to check of the `tokenization` of **Python code** with `GPT-2` regex:
```python
import regex as re

gpt2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

example_code = """
for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
"""

print(re.findall(gpt2pattern, example_code))
```
We get:
```python
['\n', 'for', ' i', ' in', ' range', '(', '1', ',', ' 101', '):', '\n   ', ' if', ' i', ' %', ' 3', ' ==', ' 0', ' and', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'FizzBuzz', '")', '\n   ', ' elif', ' i', ' %', ' 3', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Fizz', '")', '\n   ', ' elif', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Buzz', '")', '\n   ', ' else', ':', '\n       ', ' print', '(', 'i', ')', '\n']
```
We see that there are many items in the output list, and that's because we are splitting up fairly often other than the other examples that we talked about.

And you might think that OpenAI has used this regex pattern to split these texts into *chunks* and then run just a **BPE** algorithm within all of these *chunks*, but that is not exactly what happened...

And the reason is, notice that we have `        print("FizzBuzz")` in the above code, and we see all the spaces that are there, and those spaces end up being entire elements, but these spaces never actually end up being merged, and we can go back to the original example the discussed at the very beginning of this notebook:
![Tokenizer_GPT2Test](ExplanationMedia/Images/Tokenizer_GPT2Test.png)
We see that all these spaces are kept independant and they're all token `220`, and I think at some point `OpenAI` enforced some rule that these spaces would never be merged, hinting that there's some additional rules on top of just *chunking* and **BPE**...

And most importantly, the training code for `GPT-2 Tokenizer` was never released and the code that they have in their `encoder.py` is just the inference code for the tokens, which takes the `replacement table` that we implemented up above, and applies that to a new piece of text...

So at this point, we don't know exactly OpenAI trained the tokenizer, but it wasn't just as simple as only *chunking* and **BPE** algorithm...

# TikToken Library

Next up, I wanted to introduce you to the `TikToken` library from OpenAI, which is the official library for tokenization of OpenAI.

To use this library you can again use PyPi for installing the package like this:
```bash
pip install tiktoken
```

Once again, the code that we see on their <a href="https://github.com/openai/tiktoken">GitHub page</a> is just the inference code, not the training code...


I also wanted to show you how you could use it and the major difference between these two:
```python
import tiktoken

# GPT-2 Tokenizer (Does not merge whitespaces)
encoder = tiktoken.get_encoding("gpt2")
print(encoder.encode("   hello world!!!"))

# GPT-4 Tokenizer (Merges whitespaces)
encoder = tiktoken.get_encoding("cl100k_base")
print(encoder.encode("   hello world!!!"))
```
And we get the output:
```python
[220, 220, 220, 23748, 995, 10185]
[262, 24748, 1917, 12340]
```
Now in the `GPT-4 Tokenizer`, they changed the **regular expression** to *chunk-up* text...\
And the way to check this is by the link that I am providing <a href="https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py">here</a>. And this is sort of like all of the definitions of the `Tokenizers` that the OpenAI maintains is. And necessarily to do the inference they had to publish some of the details about the strings...

And this is the *slightly-different* string that we already saw from `GPT-2`(executes a little bit faster):
```python
def gpt2():
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
        vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
        encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
        vocab_bpe_hash="1ce1664773c50f3e0cc8842619a93edc4624525b728b188a9e0be33b7726adc5",
        encoder_json_hash="196139668be63f3b5d6574427317ae82f612a97c5d1cdaf36ed2256dbf636783",
    )
    return {
        "name": "gpt2",
        "explicit_n_vocab": 50257,
        # The pattern in the original GPT-2 release is:
        # r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
        # This is equivalent, but executes faster:
        "pat_str": r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {ENDOFTEXT: 50256},
    }
```
And if we scroll down to the part of `cl100k_base`, we should be able to see the `GPT-4` tokenizer as well:
```python
def cl100k_base():
    mergeable_ranks = load_tiktoken_bpe(
        "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken",
        expected_hash="223921b76ee99bde995b7ff738513eef100fb51d18c93597a113bcffe865b2a7",
    )
    special_tokens = {
        ENDOFTEXT: 100257,
        FIM_PREFIX: 100258,
        FIM_MIDDLE: 100259,
        FIM_SUFFIX: 100260,
        ENDOFPROMPT: 100276,
    }
    return {
        "name": "cl100k_base",
        "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": special_tokens,
    }
```
And we can immediately see that the pattern has changed in addition to a bunch of other **special tokens**(which again we'll go into in a bit)...

Now, I'm not going to go into the full details of the pattern change because, honestly this is a mind numbing thing and I personally kind of hate **regular expressions**. I would just recommend that you pull up the regular expression documentation and step through it... (There's a lot of different handling of the whitespace within the expression which I am not going to go into the full details of).

But I will discuss the major changes in these patterns:
- You see the `?i`? This is an inline modifier that makes the regular expression case-insensitive. It means that the pattern that follows will match regardless of case. So the comment that we saw earlier (`re.IGNORECASE`) resolves the issue.
- You will also notice `\p{N}{1,3}`. This means that when they match the numbers they only *chunk* those long numbers, with the minimum length of `1` to a maximum length of `3` (only upto `3` digit numbers are going to be merged to handle the very very long number token sequences). 

But again we don't really know why they do any of this stuff because none of this is documented.

But those in fact are the changes that the `GPT-4` has made and the vocabulary size has gone from, roughly `50k` to a roughly `100k` size...


# **GPT-2** <a href="https://github.com/openai/gpt-2/blob/master/src/encoder.py">`encoder.py`</a>

Now if we go through the `GPT-2`'s `encoder.py` <a href="https://github.com/openai/gpt-2/blob/master/src/encoder.py">here</a>, we will find this part of the code:
```python
def get_encoder(model_name, models_dir):
    with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f:
        encoder = json.load(f)
    with open(os.path.join(models_dir, model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f:
        bpe_data = f.read()
    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
    return Encoder(
        encoder=encoder,
        bpe_merges=bpe_merges,
    )
```
And they seem to be using two files:
1. `'encoder.json'`
2. `'vocab.bpe'`

And to download these files you can run this command in your notebook's code cell to download these files into the current directory:
```bash
!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe
!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json
```
And you can inspect them if you'd like as well with a piece of code like this:
```python
import os, json

with open('encoder.json', 'r') as f:
    encoder = json.load(f)

with open('vocab.bpe', 'r', encoding="utf-8") as f:
    bpe_data = f.read()

bpe_merges = [tuple (merge_str.split()) for merge_str in bpe_data.split('\n') [1:-1]]
```
And eventually what you'd find is that this `encoder` as they call it in their code is exactly equivalent to our `final vocabulary` when we implemented our `decode()` function..

And their `vocab.bpe`, confusingly is actually our `replacement table` that we crafted, so, their `bpe_merges` which is based on the data inside `vocab.bpe` ends up being equivalent to our `replacement table`...

Now if we check this part of the code:
```python
class Encoder:
    def __init__(self, encoder, bpe_merges, errors='replace'):
        self.encoder = encoder
        self.decoder = {v:k for k,v in self.encoder.items()}
        self.errors = errors # how to handle errors in decoding
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
        self.cache = {}
```
We see that in addition to the `encoder` and the `decoder` they also use a `byte_encoder` and a `byte_decoder`... And this is just like a *spurious* implementation detail and isn't actually deep and interesting in any way so I am going to skip the discussion of it. Now what OpenAI does here is the reasons that I don't fully understand is that not only have a `Tokenizer` which can do both **encode** and **decode**, they also have a whole separate layer as a whole in addition that is used serially with the tokenizer.

And so you first do **byte-encode** and then **encode**, and then **decode** and then **byte-decode**...

Otherwise if you ignore this `byte_encoder` and `byte_decoder`, this file algorithmically will be very familiar with you...

And if we look at this snippet now when they call the `bpe()` function:
```python
while True:
    bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
    if bigram not in self.bpe_ranks:
        break
    first, second = bigram
    new_word = []
    i = 0
    while i < len(word):
        try:
            j = word.index(first, i)
            new_word.extend(word[i:j])
            i = j
        except:
            new_word.extend(word[i:])
            break

        if word[i] == first and i < len(word)-1 and word[i+1] == second:
            new_word.append(first+second)
            i += 2
        else:
            new_word.append(word[i])
            i += 1
    new_word = tuple(new_word)
    word = new_word
    if len(word) == 1:
        break
    else:
        pairs = get_pairs(word)
```
This loop will be fairly familiar as well, where they're trying to identify the `bigram`(in our case we called it the `minPair`)...

So long story short, unfortunately this is kind of a messy code that they have but algorithmically it is identical to what we have built up above and also it is algorithmically what is necessary to build a `Tokenizer`, train it and then do both **encode** and **decode**...

# Special Tokens

Next, I'd like us to visit the part of **special tokens**...

So, in addition to data that is coming from the **raw bytes** and the **BPE** `replacement table`, we can insert all kinds of tokens that are used to delimit different parts of the documents, or to introduce some kind of a **special structure of the token streams**.

For example, if we look at the `GPT-2`'s encoder object and check out it's length:
```python
>>> len(encoder)
50257
```
We see that we get an output of `50257`, and it's a mapping and it's inverted(which goes from `string` to `integer`) from our `final vocabulary` (which goes from `integer` to `string`). 

So where does this number `50257` come from ?

We know that there are `256` **raw byte** tokens, and OpenAI did `50000` merges, hinting that these merges become the other tokens.\
Leaving us with `1` token in the end (which is one **special token**):
```python
>>> encoder['<|endoftext|>']
50256
```
And we see that this is the very last token.

And this token is used to delimit documents in the training set.

So when we're creating the training data in `GPT-2`, we have all these documents, and we `tokenize` them and we get a stream of tokens, and those tokens only range from `0` to `50256`, and in between those documents we put these special `<|endoftext|>` tokens. And we are using this **special token** as a signal to the language model that the document 'has ended and what follows is going to be unrelated to the previous document'. That said, the language model has to learn this from the data, and we except the language model to learn it on it's own...

And we can again go to the <a href="https://tiktokenizer.vercel.app/?model=gpt2">Tiktokenizer Web Application</a> and check this special token out:
![GPT2SpecialToken](ExplanationMedia/Images/GPT2SpecialToken.png)\
And see how that special token is `50256` now?

You see even a single character makes the difference:\
![GPT2AlmostSpecialToken](ExplanationMedia/Images/GPT2AlmostSpecialToken.png)

And the important thing to note here is, this **special token** did not actually go through the **BPE** merges, instead the code that actually outputs the tokens has **special case instructions** for handling special tokens. And we did not see these **special case instructions** in the `encoder.py`. But if we go check out the `TikToken` library and check the file <a href="https://github.com/openai/tiktoken/blob/main/src/lib.rs">lib.rs</a>, we will find that this file is implemented in Rust Programming Language and from around `line 200` we will find all kinds of special case handling for these special tokens that you can register, create, add to the vocabulary, and then it looks for them and whenever it sees these special tokens it will actually come in and swap in that special token. In other words, these things are outside of the typical algorithm of the **Byte-Pair Encoding**.

And these **special tokens** are used pervasively, not just in base language modeling of predicting the next token in a sequence but especially later when it gets to the **fine-tuning** stage. Because we don't just want to delimit documents, we want to delimit entire conversations and a user...

So if we go to the base page of <a href="https://tiktokenizer.vercel.app/">TikTokenizer</a> we see that they are not using the base model of encoders by default, but the **fine-tuned** model encoders. For example, the latest `ChatGPT` model is the `GPT-4o` and they use this as the default view:
![GPT4oTiktokenizer](ExplanationMedia/Images/GPT4oTiktokenizer.png)\
(By the way the prefix `"im"` stands for `imaginary monologue`)

And you can see that there's a start and end of many tokens, and there could be many other tokens...

And now when we go back to the main page of the TikToken GitHub repository, we see how we can fork this repository and extend this TikToken library, and we can extend it by adding more special tokens. And the `TikToken` library will correctly swap them out when it sees them in the strings.

Now if we go back to this snippet:
```python
def cl100k_base():
    mergeable_ranks = load_tiktoken_bpe(
        "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken",
        expected_hash="223921b76ee99bde995b7ff738513eef100fb51d18c93597a113bcffe865b2a7",
    )
    special_tokens = {
        ENDOFTEXT: 100257,
        FIM_PREFIX: 100258,
        FIM_MIDDLE: 100259,
        FIM_SUFFIX: 100260,
        ENDOFPROMPT: 100276,
    }
    return {
        "name": "cl100k_base",
        "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": special_tokens,
    }
```
We would now be able to fully familiarize with that we've discussed just now...

And we see more tokens and some of these are prefixed with `"FIM"` (this is nothing but `"Fill in the Middle"` and if you'd like to learn more about this idea, it comes from <a href="https://arxiv.org/abs/2207.14255">this</a> paper)...

So it's very common to train a language model, and then decide if you'd like to add special tokens. Now, when you add special tokens you have to do some model surgery to the `Transformer` and all of the parameters involved in that `Tranformer`, because you're basically adding an integer, and you want to make sure that for example your embedding matrix vocabulary tokens has to be extended by adding a row and typically this row would be initialized with small random numbers because we need to have a vector that now *"stands for"* that token. And in addition to that you have to go the final layer of the `Tranformer` and you have to make sure that the projection at the very end into the classifier is extended by `1` as well.

Basically there's some model surgery involved that you have to couple with the `Tokenization` changes if you're going to add some **special tokens**. But this is a very common operation that people do, especially when they want to **fine-tune** the model (for example, taking it from the **base model** to a **chat model** like `ChatGPT`)...

# SentencePiece

Now, we are going to move on from TikToken and how it tokenizes it's strings...

And we're going to discuss one more very commonly used library for working with `Tokenization` in `LLMs` and that is <a href="https://github.com/google/sentencepiece">SentencePiece by Google</a>. And we're discussing this because, SentencePiece is very commonly used in language models, because unlike TikToken, it can do both training and inference and is quite efficient on both of them. It supports a number of algorithms for training it's vocabularies but one of them is the **BPE** algorithm that we have been looking at so far. It is also used in both `Llama` and `Mistral` series and many other models as well...

**Big-Difference**:
| **TikToken**                                                                                                      | **SentencePiece**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|-------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| We first take our code points in a string, we encode them using `UTF-8` to `bytes` and then we merge the `bytes`. | It works directly on the level of the code points themselves. It looks at whatever Code Points are available in your training set, and then it starts merging those Code Points and the **BPE** runs on the level of the Code Points. And if you happen to run out of Code Points (maybe some rare code points that don't come up too often and the rarity is determined by `character_coverage` hyper-parameter), then these Code Points will get mapped to a special 'unknown token'(`<unk>`) or, if we have the `byte_fallback` option turned on that will take those rare Code Points and it will encode them using `UTF-8` and then the individual bytes of that encoding will be translated into tokens and end up being special `byte` tokens that get added to the vocabulary.  So, it uses **BPE** on the Code Points, and then it *falls back* to `bytes` for rare Code Points. |

And I find `TikToken` to be a bit cleaner for `Tokenization`, but the above diffence is kind of like subtle but pretty major difference between them.

Let's try to work with practical and concrete examples now, because just reading is kind of hard to get your head around...

And the crazy thing about SentencePiece is that it really likes to have a file and work with files, and it has a ton of configuration(and we're talking about a lot) and the reason being, it has been here for a while now and it really tries to handle a large diversity of things and has quite a bit of historical baggage as well...

(You can go <a href="https://github.com/google/sentencepiece/blob/master/doc/options.md">here</a> to check out all the training options, and there's also quite a bit of information when you look at the raw <a href="https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto">protocol buffers</a> that is used to represent the trainer specifications and so on... And many of these options are irrelevant to us)

And I tried to set up this `Tokenizer` in a way which is very very similar to the way the `Llama2` tokenizer was trained...


```python
import sentencepiece as spm
import os

# We will specify the settings or specifications for the tokenizer
options = dict(
  # Input Specifications
  input="../Datasets/Tokenizer/tokenizer_train.txt",
  input_format="text",
  # Ouput Specifications
  model_prefix="tok400", # Output filename prefix
  # Algorithm Specifications
  model_type="bpe",
  vocab_size=400,
  # Normalization
  normalization_rule_name="identity",
  remove_extra_whitespaces=False,
  input_sentence_size=200000000, # Maximum number of training sentences
  max_sentence_length=4192, # Maximum number of bytes per sentence
  seed_sentencepiece_size=1000000,
  shuffle_input_sentence=True,
  # Rare Word Treatment
  character_coverage=0.99995,
  byte_fallback=True,
  # Merge Rules
  split_digits=True,
  split_by_unicode_script=True,
  split_by_whitespace=True,
  split_by_number=True,
  max_sentencepiece_length=16,
  add_dummy_prefix=True,
  allow_whitespace_only_pieces=True,
  # Special Tokens
  unk_id=0, # The <unk> token MUST exist
  bos_id=1, # The others are optional, set to -1 to turn off
  eos_id=2,
  pad_id=-1,
  # Systems
  num_threads=os.cpu_count(), # Use ~all system resources
)

# Start training
spm.SentencePieceTrainer.train(**options)

# Load the tokenizer model to SentencePieceProcessor
sp = spm.SentencePieceProcessor()
sp.load('tok400.model')

# Printing the tokens and their mapped indices
vocabulary = [[sp.id_to_piece(i), i] for i in range(sp.get_piece_size())]
print(vocabulary)
```
And we get an output like this:
```python
[['<unk>', 0], ['<s>', 1], ['</s>', 2], ['<0x00>', 3], ['<0x01>', 4], ['<0x02>', 5], ['<0x03>', 6], ['<0x04>', 7], ['<0x05>', 8], ['<0x06>', 9], ['<0x07>', 10], ['<0x08>', 11], ['<0x09>', 12], ['<0x0A>', 13], ['<0x0B>', 14], ['<0x0C>', 15], ['<0x0D>', 16], ['<0x0E>', 17], ['<0x0F>', 18], ['<0x10>', 19], ['<0x11>', 20], ['<0x12>', 21], ['<0x13>', 22], ['<0x14>', 23], ['<0x15>', 24], ['<0x16>', 25], ['<0x17>', 26], ['<0x18>', 27], ['<0x19>', 28], ['<0x1A>', 29], ['<0x1B>', 30], ['<0x1C>', 31], ['<0x1D>', 32], ['<0x1E>', 33], ['<0x1F>', 34], ['<0x20>', 35], ['<0x21>', 36], ['<0x22>', 37], ['<0x23>', 38], ['<0x24>', 39], ['<0x25>', 40], ['<0x26>', 41], ['<0x27>', 42], ['<0x28>', 43], ['<0x29>', 44], ['<0x2A>', 45], ['<0x2B>', 46], ['<0x2C>', 47], ['<0x2D>', 48], ['<0x2E>', 49], ['<0x2F>', 50], ['<0x30>', 51], ['<0x31>', 52], ['<0x32>', 53], ['<0x33>', 54], ['<0x34>', 55], ['<0x35>', 56], ['<0x36>', 57], ['<0x37>', 58], ['<0x38>', 59], ['<0x39>', 60], ['<0x3A>', 61], ['<0x3B>', 62], ['<0x3C>', 63], ['<0x3D>', 64], ['<0x3E>', 65], ['<0x3F>', 66], ['<0x40>', 67], ['<0x41>', 68], ['<0x42>', 69], ['<0x43>', 70], ['<0x44>', 71], ['<0x45>', 72], ['<0x46>', 73], ['<0x47>', 74], ['<0x48>', 75], ['<0x49>', 76], ['<0x4A>', 77], ['<0x4B>', 78], ['<0x4C>', 79], ['<0x4D>', 80], ['<0x4E>', 81], ['<0x4F>', 82], ['<0x50>', 83], ['<0x51>', 84], ['<0x52>', 85], ['<0x53>', 86], ['<0x54>', 87], ['<0x55>', 88], ['<0x56>', 89], ['<0x57>', 90], ['<0x58>', 91], ['<0x59>', 92], ['<0x5A>', 93], ['<0x5B>', 94], ['<0x5C>', 95], ['<0x5D>', 96], ['<0x5E>', 97], ['<0x5F>', 98], ['<0x60>', 99], ['<0x61>', 100], ['<0x62>', 101], ['<0x63>', 102], ['<0x64>', 103], ['<0x65>', 104], ['<0x66>', 105], ['<0x67>', 106], ['<0x68>', 107], ['<0x69>', 108], ['<0x6A>', 109], ['<0x6B>', 110], ['<0x6C>', 111], ['<0x6D>', 112], ['<0x6E>', 113], ['<0x6F>', 114], ['<0x70>', 115], ['<0x71>', 116], ['<0x72>', 117], ['<0x73>', 118], ['<0x74>', 119], ['<0x75>', 120], ['<0x76>', 121], ['<0x77>', 122], ['<0x78>', 123], ['<0x79>', 124], ['<0x7A>', 125], ['<0x7B>', 126], ['<0x7C>', 127], ['<0x7D>', 128], ['<0x7E>', 129], ['<0x7F>', 130], ['<0x80>', 131], ['<0x81>', 132], ['<0x82>', 133], ['<0x83>', 134], ['<0x84>', 135], ['<0x85>', 136], ['<0x86>', 137], ['<0x87>', 138], ['<0x88>', 139], ['<0x89>', 140], ['<0x8A>', 141], ['<0x8B>', 142], ['<0x8C>', 143], ['<0x8D>', 144], ['<0x8E>', 145], ['<0x8F>', 146], ['<0x90>', 147], ['<0x91>', 148], ['<0x92>', 149], ['<0x93>', 150], ['<0x94>', 151], ['<0x95>', 152], ['<0x96>', 153], ['<0x97>', 154], ['<0x98>', 155], ['<0x99>', 156], ['<0x9A>', 157], ['<0x9B>', 158], ['<0x9C>', 159], ['<0x9D>', 160], ['<0x9E>', 161], ['<0x9F>', 162], ['<0xA0>', 163], ['<0xA1>', 164], ['<0xA2>', 165], ['<0xA3>', 166], ['<0xA4>', 167], ['<0xA5>', 168], ['<0xA6>', 169], ['<0xA7>', 170], ['<0xA8>', 171], ['<0xA9>', 172], ['<0xAA>', 173], ['<0xAB>', 174], ['<0xAC>', 175], ['<0xAD>', 176], ['<0xAE>', 177], ['<0xAF>', 178], ['<0xB0>', 179], ['<0xB1>', 180], ['<0xB2>', 181], ['<0xB3>', 182], ['<0xB4>', 183], ['<0xB5>', 184], ['<0xB6>', 185], ['<0xB7>', 186], ['<0xB8>', 187], ['<0xB9>', 188], ['<0xBA>', 189], ['<0xBB>', 190], ['<0xBC>', 191], ['<0xBD>', 192], ['<0xBE>', 193], ['<0xBF>', 194], ['<0xC0>', 195], ['<0xC1>', 196], ['<0xC2>', 197], ['<0xC3>', 198], ['<0xC4>', 199], ['<0xC5>', 200], ['<0xC6>', 201], ['<0xC7>', 202], ['<0xC8>', 203], ['<0xC9>', 204], ['<0xCA>', 205], ['<0xCB>', 206], ['<0xCC>', 207], ['<0xCD>', 208], ['<0xCE>', 209], ['<0xCF>', 210], ['<0xD0>', 211], ['<0xD1>', 212], ['<0xD2>', 213], ['<0xD3>', 214], ['<0xD4>', 215], ['<0xD5>', 216], ['<0xD6>', 217], ['<0xD7>', 218], ['<0xD8>', 219], ['<0xD9>', 220], ['<0xDA>', 221], ['<0xDB>', 222], ['<0xDC>', 223], ['<0xDD>', 224], ['<0xDE>', 225], ['<0xDF>', 226], ['<0xE0>', 227], ['<0xE1>', 228], ['<0xE2>', 229], ['<0xE3>', 230], ['<0xE4>', 231], ['<0xE5>', 232], ['<0xE6>', 233], ['<0xE7>', 234], ['<0xE8>', 235], ['<0xE9>', 236], ['<0xEA>', 237], ['<0xEB>', 238], ['<0xEC>', 239], ['<0xED>', 240], ['<0xEE>', 241], ['<0xEF>', 242], ['<0xF0>', 243], ['<0xF1>', 244], ['<0xF2>', 245], ['<0xF3>', 246], ['<0xF4>', 247], ['<0xF5>', 248], ['<0xF6>', 249], ['<0xF7>', 250], ['<0xF8>', 251], ['<0xF9>', 252], ['<0xFA>', 253], ['<0xFB>', 254], ['<0xFC>', 255], ['<0xFD>', 256], ['<0xFE>', 257], ['<0xFF>', 258], ['en', 259], ['▁t', 260], ['ce', 261], ['in', 262], ['ra', 263], ['▁a', 264], ['de', 265], ['er', 266], ['▁s', 267], ['ent', 268], ['or', 269], ['pr', 270], ['▁m', 271], ['▁u', 272], ['ing', 273], ['▁th', 274], ['ence', 275], ['entence', 276], ['Pi', 277], ['ed', 278], ['em', 279], ['ex', 280], ['is', 281], ['iz', 282], ['la', 283], ['on', 284], ['st', 285], ['▁S', 286], ['Pie', 287], ['end', 288], ['ext', 289], ['▁an', 290], ['▁pr', 291], ['▁to', 292], ['▁un', 293], ['▁the', 294], ['Piece', 295], ['▁Sentence', 296], ['▁SentencePiece', 297], ['.]', 298], ['Ne', 299], ['ag', 300], ['do', 301], ['ec', 302], ['gu', 303], ['ic', 304], ['ir', 305], ['it', 306], ['ly', 307], ['to', 308], ['▁(', 309], ['▁[', 310], ['▁f', 311], ['▁n', 312], ['▁w', 313], ['.])', 314], ['age', 315], ['del', 316], ['ion', 317], ['ken', 318], ['lan', 319], ['ral', 320], ['wor', 321], ['yst', 322], ['▁Ne', 323], ['▁al', 324], ['▁de', 325], ['▁is', 326], ['▁ma', 327], ['▁mo', 328], ['izer', 329], ['rain', 330], ['ural', 331], ['▁and', 332], ['▁lan', 333], ['▁pre', 334], ['guage', 335], ['ystem', 336], ['▁text', 337], ['▁model', 338], ['▁train', 339], ['kenizer', 340], ['▁system', 341], ['▁language', 342], ['▁training', 343], ['.,', 344], ['BP', 345], ['Ku', 346], ['ab', 347], ['as', 348], ['at', 349], ['by', 350], ['co', 351], ['es', 352], ['et', 353], ['if', 354], ['ig', 355], ['im', 356], ['ke', 357], ['lo', 358], ['nr', 359], ['oc', 360], ['e', 361], ['▁', 362], ['n', 363], ['t', 364], ['i', 365], ['r', 366], ['a', 367], ['o', 368], ['s', 369], ['d', 370], ['c', 371], ['l', 372], ['u', 373], ['g', 374], ['m', 375], ['p', 376], ['.', 377], ['h', 378], ['-', 379], ['w', 380], ['y', 381], ['P', 382], ['S', 383], ['b', 384], ['f', 385], ['k', 386], [')', 387], ['x', 388], ['z', 389], ['(', 390], ['N', 391], ['[', 392], [']', 393], ['v', 394], [',', 395], ['/', 396], ['B', 397], ['E', 398], ['K', 399]]
```

You see the *normalization* piece here?

Well, *normalization* used to be prevalent before LLMs in natural language processing. So in machine translation and text classification and so on, we wanted to normalize and simplify the text (turn all the text to lowercase, remove all the whitespace, etc.). And in language models we prefer not to do any of it(or I personally prefer not to touch the data, and keep the raw data as much as possible)...

Long story short:
```python
[['<unk>', 0], # Unknown Token (Byte Token Fallback)
 ['<s>', 1], # Beginning of a sentence
 ['</s>', 2], # End of a sentence
 ['<0x00>', 3], # 256 Byte Tokens
 ['<0xFF>', 258],
 ['en', 259], # Merges
 ['oc', 360],
 ['A', 361], # Individual Tokens
 ['▁', 362],
 ['z', 399]]
```

Let's now try to understand what happened here:
```python
>>> print(sp.encode("hello 안녕하세요"))
[362, 378, 361, 372, 358, 362, 239, 152, 139, 238, 136, 152, 240, 152, 155, 239, 135, 187, 239, 157, 151]

>>> print([sp.id_to_piece(i) for i in ids])
['▁', 'h', 'e', 'l', 'lo', '▁', '<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '<0xED>', '<0x95>', '<0x98>', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>']
```

Take a look at the Korean part (which were not the part of the training set) and they do not have a token associated with it (suddenly these are `<unk>` tokens), but because `byte_fallback=True` SentencePiece *falls back* to `bytes` (so it takes the Korean characters, encoded it with `UTF-8` and then it uses the `byte` tokens to represent them) and that's what we get in the end.

Let me now remove `byte_fallback` or set it to `False`, and train it again to see the results with the same example now...

And the first thing that happened was:
```python
[['<unk>', 0], # Unknown Token (Byte Token Fallback)
 ['<s>', 1], # Beginning of a sentence
 ['</s>', 2], # End of a sentence
 ['en', 3], # Merges
 ['oc', 360],
 ['A', 361], # Individual Tokens
 ['▁', 362],
 ['z', 399]]
```
We see that we have a lot more merges because we don't end up taking space for the `bytes`.

And now if we encode the same example:
```python
>>> print(sp.encode("hello 안녕하세요"))
[362, 378, 252, 102, 362, 0]

>>> print([sp.id_to_piece(i) for i in ids])
['▁', 'h', 'e', 'l', 'lo', '▁', '<unk>']
```
We see that the entire Korean text is being treated as a `0` token which is none other than the `<unk>` token.

And we have to keep in mind that this is going to feed into a language model, and what is language model supposed to do when all kinds of things are not recognized and end up mapping into `<unk>` ? It's not the property that you want. And that's why I think `Llama 2` correctly used `byte_fallback=True` and we definately want to feed these rare Code Points to the language model in some manner...

And notice how a whitespace ends up being a `_` symbol ? I am not a 100% sure why SentencePiece switches whitespaces to `_` but it is also a major difference. And we see that we also have an extra `_` at the front of the text. It comes from this option `add_dummy_prefix=True`, and it kind of treats "hello" in "hello" and "hello world" the same way, so it likes spaces as well...

And I will add the raw protocol buffer that `Llama 2` trained on:
```python
normalizer_spec {
  name: "identity"
  precompiled_charsmap: ""
  add_dummy_prefix: true
  remove_extra_whitespaces: false
  normalization_rule_tsv: ""
}

trainer_spec {
  input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
  model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
  model_type: BPE
  vocab_size: 32000
  self_test_sample_size: 0
  input_format: "text"
  character_coverage: 0.99995
  input_sentence_size: 200000000
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  num_threads: 80
  num_sub_iterations: 2
  max_sentence_length: 4192
  shuffle_input_sentence: true
  max_sentencepiece_length: 16
  split_by_unicode_script: true
  split_by_whitespace: true
  split_by_number: true
  treat_whitespace_as_suffix: false
  split_digits: true
  allow_whitespace_only_pieces: true
  vocabulary_output_piece_score: true
  hard_vocab_limit: true
  use_all_vocab: false
  byte_fallback: true
  required_chars: ""
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_surface: " \342\201\207 "
  unk_piece: "<unk>"
  bos_piece: "<s>"
  eos_piece: "</s>"
  pad_piece: "<pad>"
  train_extremely_large_corpus: false
  enable_differential_privacy: false
  differential_privacy_noise_level: 0.0
  differential_privacy_clipping_threshold: 0
}
```


# Setting `Transformer`'s `vocabularySize`

We now want to encounter and answer the questions for the <a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20from%20Scratch.ipynb">GPT from Scratch notebook</a>:
- What should be the `vocabularySize` for the `Transformer`?
- How can we increase the `vocabularySize`?

To answer these questions, let's specifically took at `vocabularySize` and where it appears in this file:
```python
# Model Module Definition
class GPTModel(torch.nn.Module):
    # Constructor for the model
    def __init__(self):
        # Initializing the model parameters
        super().__init__()
        self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)
        self.positionalEmbeddingTable = torch.nn.Embedding(blockSize, numberOfEmbeddingDimensions)
        self.blocks = torch.nn.Sequential(*[TransformerBlock(numberOfEmbeddingDimensions=numberOfEmbeddingDimensions, numberOfHeads=numberOfHeads) for _ in range(numberOfLayers)])
        self.layerNorm = torch.nn.LayerNorm(numberOfEmbeddingDimensions)
        self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)
```
We see that it appears in the line `self.tokenEmbeddingTable = torch.nn.Embedding(vocabularySize, numberOfEmbeddingDimensions)` and `self.languageModelingHead = torch.nn.Linear(numberOfEmbeddingDimensions, vocabularySize)`...

Let's look at the `vocabularySize` in both cases one by one:
1. `vocabularySize` in `tokenEmbeddingTable` is basically the number of rows and each vocabulary element or token has an embedding vector that we are going to train using back propagation and that is of size `numberOfEmbeddingDimensions`. And as `vocabularySize` increases, this `tokenEmbeddingTable` is also going to grow.
2. At the end of the `Transformer` we have this `languageModelingHead` which is a **Linear** layer and is used at the very end to produce the probabilites. So intuitively, we are trying to produce the probability for every single token that might come next, at every point in time of that `Transformer`. And if we have more and more tokens, we need to produce more and more probabilities, so every single token is going to produce an additional *dot-product* that the **Linear** layer will perform.

Which brings us to the next question: Why can't `vocabularySize` be infinite ?
1. `tokenEmbeddingTable` is going to grow
2. **Linear** layer at the very end is going to grow
Which means that we are going to do a lot more computation, and because we have more parameters, we could also be worried that we end up *under-training* these parameters.

So intuitively, if suppose we have a very large `vocabularySize` (say a million tokens), then every one of these tokens are going to come up more and more rarely in the training data. And so we are going to see fewer and fewer examples for each individual tokens and we might be worried that the vectors associated with these tokens will be undertrained as a result (because they don't come up too often). In addition to that as our `vocabularySize` grows, we are going to start shrinking our sequences a lot that means that we are going to be attendint to more and more text (which is nice) but we might also be worried that too large of texts are getting *squished* into single tokens(which is essentially *squishing* too much information to a single token) and the forward pass becomes **not-enough** to process that information appropriately.

And this is more of a `10k` to `100k` range as of today in the modern transformers...

Next question that we arrive is: What if we want to take a pre-trained model and we want to extend the `vocabularySize`, how do we do that?

Well, this is done very commonly. For example, when we are doing fine-tuning, a lot more new special tokens are introduced on top of the base model to maintain the meta data and structure of all the conversation objects between the user and an assistant. That takes a lot of special tokens and we can throw in more special tokens for using tools (such as browser, calculator and so on) for special functionality... So it is totally possible...

And all we have to do is resize the `tokenEmbeddingTable` and we have to add rows, we would initialize these parameters from scratch to be small random numbers and we have to extend the weights inside the last **Linear** `languageModelingHead` layer (so that it can start making *dot-products* with the associated parameters as well) for it to calculate the probabilites for these new tokens.

And it's a very mild operation in model surgery and can be done fairly easily. And it is quite common that we'd *freeze* the base model and use these new parameters and only train these new parameters to introduce these new tokens into the architecture...

You can also look at this <a href="https://arxiv.org/abs/2304.08467">Learning to Compress Prompts with Gist Tokens</a> that check the handling of very large prompts into language models (compressing very large prompts into gist tokens) for identical performance...

<a href="https://arxiv.org/abs/2012.09841">Taming Transformers for High-Resolution Image Synthesis</a> also discusses how we can simultaneously process not just text but also the other modalities as well(images, videos, audios, etc.)...

# Quirks of `Tokenization`

Let's now discuss some of the quirks of `Tokenization`...

I will go up one by one and then list all the points in a brief summary in the end...

1. Why `Tokenization` is one of the reasons `LLMs` can't spell words properly? \
Well fundamentally, this is because we saw that the words are *chunked-up* into tokens and some of these tokens are actually fairly long. As an example, I went through the `GPT-4` vocabulary and I looked at one of the longer tokens that are there, and `.DefaultCellStyle` turned out to be a single individual token:\
![TiktokenizerDefaultCellStyleToken](ExplanationMedia/Images/TiktokenizerDefaultCellStyleToken.png)\
And my suspision is that, there's too much information crammed into a single token, and my suspision was that the `ChatGPT` model itself shouldn't be very good at tasks related to spelling of this single token...\
So I did this:\
![DefaultCellStyleConversation](ExplanationMedia/Images/DefaultCellStyleConversation.png)\
And my prompt was done that way, and this is what the model sees (a single token):\
![TiktokenizerDefaultCellStyleSentence](ExplanationMedia/Images/TiktokenizerDefaultCellStyleSentence.png)\
Let's look at another character level task:\
![ReverseDefaultCellStyle](ExplanationMedia/Images/ReverseDefaultCellStyle.png)\
And here, it tried to use a code interpreter and I stopped the execution and I asked it to do it without any tools, and it failed by giving me jumble...\
So I tried a different approach this time:\
![ReverseDefaultCellStyleSeparate](ExplanationMedia/Images/ReverseDefaultCellStyleSeparate.png)\
And it was able to do this only because this becomes individual tokens of single characters... Which becomes easier for it to "see" these individual tokens

2. Why are `LLMs` bad at Non-English languages making `Tokenizers` the bigger reason? It's not only because that the language model sees less Non-English data during training of the model parameters but also the `Tokenizers` are not sufficiently trained on Non-English data.
3. Why are `LLMs` bad at simple arithmetic making `Tokenizers` the bigger reason? That has to do with the tokenization of numbers.\
Here:\
![StandardAdditionAlgorithm](ExplanationMedia/Images/StandardAdditionAlgorithm.png)\
You can see that there's an algorithm that does arithmetic on character level for doing a simple task like addition, and you have to refer to specific parts of the digits (ones, tens, hundreds and so on...) and these numbers are represented completely arbitrarily based on whatever happened to merges and non-merges in the **BPE** tokenization process and you can explore <a href="https://www.beren.io/2023-02-04-Integer-tokenization-is-insane/">this blog post</a> to explore more.
4. Why is `GPT-2` not that good in Python than it is in `GPT-4` making `Tokenization` part of the reason? We saw earlier that the encoding efficiency of handling spaces in Python is terrible (which dramatically reduces the context lenth that the model can attend across)