# Welcome to GPT Tokenizer

😔

We have a sad face because, `Tokenizers` are the least favourite part of **Large Language Models (LLMs)** that I need to work with. But unfortunately it is completely necessary to understand in detail to work with them... And a lot of oddness with LLMs traces back the these `Tokenization`...

So...

What is `Tokenization`?

Now in our last notebook (<a href="https://github.com/AvishakeAdhikary/Neural-Networks-From-Scratch/blob/main/GPT%20from%20Scratch.ipynb">GPT from Scratch</a>) we already implemented tokenization but it was done in a completely naïve and simple way...

In the previous notebook, the question we encountered was, "how do we *plug-in* text into the GPT?" and we ended up with a vocabulary of `92` characters, from which we created two look-up tables `stoi` and `itos` for mapping characters to indeces and vice-versa, which could be used as a `token` table for encode and decode functions, where `encode` function returned the encoded token integers and `decode` function returned the decoded message from the encoded tokens...

And later we saw that the way we *plug* these `tokens` into the model is by using a `tokenEmbeddingTable` where this table represented a row of `92` possible characters with their respective `embeddings`...

But,

In practice, people use a lot more complecated schemes to encode and decode these `tokens`...

And we deal with **chunk level** texts. And these **chunk level** texts are constructed using algorithms like `Byte-Pair` algorithms which we will be covering in a bit...

I'd also like discuss the paper that introduced this concept of `byte` level tokenization encoding as a mechanism in the context of LLMs...

Which is this paper : <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">Language Models are Unsupervised Multitask Learners
</a>

And if we scroll to the point **2.2 Input Representation** within the paper we see that they conclude with the line: "The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens an a larger batchsize of 512 is used."

Which means that the `vocabularySize` they used is about `50,257` and in the `Transformer` architecture's attention layer, every single token is *attending* to previous tokens in a sequence, and it is able to see upto `1024 tokens` in a sequence...

So **`Tokens` are the fundamental atomic units of a LLMs.** and everything related to it...

And **`Tokenization` is the process of translating `strings` or text into sequences of `tokens` and vice versa.**

And we can also look into the <a href="https://arxiv.org/pdf/2307.09288">Llama 2: Open Foundation and Fine-Tuned Chat Models</a> paper by Meta and we see that, in their paper in section **2.1** they mentioned that they trained their model with 2 trillion tokens of data...

And luckily the `Byte-Pair` algorithm is fairly simple and we can implement it ourselves and we can build our own `tokenizer`..

# Tiktokenizer

Before we dive into the code we can go to this nice website that has been created for us <a href="https://tiktokenizer.vercel.app/">Tiktokenizer</a> and familize ourselves with the tokenization types that are used ...

And what's great about this website is, tokenization is runs live on our browsers with the help of JavaScript... So we can plug in our own text and test the different tokenization techniques and their respective outputs and color coded tokens...

Now what I'd like you to do is plug this sample text into the website input:

```text
Today we will be learning tokenization. It is not very fun, but definitely informative.

69 + 420 = 489
6969 + 420 = 7389

Hat.
hat.
HAT.
I have a hat.

私はコンピューターが大好きです。私は機械学習エンジニアです。 

for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
```

And set the tokenizer to `GPT-2` because that's what we discussed earlier to relate what's happening currently...

And we immediately see that we get this kind of output:
![Tokenizer_GPT2Test](ExplanationMedia/Images/Tokenizer_GPT2Test.png)

Now, I have divided our sample text that we want to tokenize into 5 parts:

1. A sample line
2. Arithmetic
3. Case Sensitivity
4. Foreign Language
5. Python code

And before starting, the discussion on the above points, I'd also like to mention that, we also need to understand that our texts contains white-spaces, and new line characters and so on, but we can hide them for more clarity...

1. Now let's start by discussing the sample line first: \
We see that the sample line has been *chunked* into nice little pieces and we have the `tokenized` text with us. But immediately we see that the text "tokenization" has been chunked into two little pieces, and we see that in the middle of a sentence 'spaces' are a part of them, we will see why that is in a bit...
2. Next up we have the arithmetic code: \
We see that the token `420` is a single token but the token `489` is split up into two single tokens. And the LLM has to take a count of it and process it correctly in it's neural network as well...
3. Next up we have our case sensitivity and punctuation: \
We see that we considered a token "hat" and how differently they appear with each case. And how having a leading space on top of the token makes it a completely different token that the ones without it. But the most interesting part here is that the LLM has to learn from the raw data that all of these "hat"s have the exact same concept and have to group them into the parameters of the neural network and understand that these are almost similar, but not exactly similar all by itself.
4. Next up we will discuss the foreign languages: \
I have put this because, non english languages work, slightly worse in LLMs, and that is because the training dataset for these LLMs are generally very small compared to the English language, which is not just true for the LLM itself but for also for the `tokenizer`. So when we will train the tokenizer, we will see that there is a lot more English than Non-English text. And what ends up happening is, we get a lot more tokens for English tokens than Non-English tokens. In other words, if we try to see an English text and Japanese text as an example for comparison, we see that the number of tokens for Japanese used is much larger than compared to English and that is because the *chunks* are a lot more broken up and we end up using a lot more tokens for the exact same thing. And intuitively what this does is, it bloats up the sequence length of all the documents that we train on, and we very fast run out of context in the `Transformer`'s attention part.
5. And lastly we will discuss the tokenization of the coding part, for us, I have taken the example of a Python code: \
We immediately see that all the individual spaces in the example are all separate tokens (specifically token `220`), which means that when the `Transformer` attends this text, it has to attend all the spaces individually, which is basically another way of saying that it is being extremely wasteful in the part of tokenization. (GPT-2 is extremely un-optimized for coding)\
Now we can try to change the tokenizer to `cl100k_base`(which is the GPT-4 tokenizer) for now and check the results, and we get something like this:
![Tokenizer_CL100KExample](ExplanationMedia/Images/Tokenizer_CL100KExample.png)
We immediately see that the token count has decresed and that is because the number of tokens in the GPT-4 tokenizer is roughly double than that of the GPT-2 tokenizer. Which means now we are now feeding a lot denser inputs to the transformer, which means that the `Transformer` is now able to see more in the previous context than before. But increasing this length infinitely is not good as well, because our Embedding table and the Softmax in the transformer ends up increasing in size and we end up doing a lot more computation than before. But there is a *sweet spot* that we can come to which makes us end up with nice vocabulary at the end. And I'd also like you to note that the **whitespace** handling of the GPT-4 has improved a lot, and they get grouped and we end up having a lot more efficent in Python tokenization, and this was a deliberate choice made by OpenAI and that is because this densifies the information in the tokenization for each of those tokens and the `Transformer` ends up looking further back into the context of previous code.

# Coding By Understanding

Let's now start writing some code...

Let's understand what we are trying to do:
1. We want to take `strings` and we want to feed them into language models
2. For that we need so somehow `tokenize` strings and into some `integers` mapped into a fixed vocabulary 
3. We will use those integers to make a *look-up* into a `lookupTable` of embedding vectors and feed those vectors into the transformer as an input

But the reason this gets tricky is because:
1. We want to support different kinds of languages
2. We also want to support different kinds of special characters that we might find on the internet (for example emojis such as 👋)

Let's take a *toy-example* first:
```text
Hello in japanese. 👋 日本語でこんにちは。
```

So the question we now arrive is, how do we feed this text into a `Transformer`?

Let's first dive into the definitions of `strings` in the <a href="https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str">python documentation</a> we will find a text where it says "Strings are immutable sequences of Unicode code points."

So, what are Unicode code points?

So now if we try to look up the <a href="https://en.wikipedia.org/wiki/Unicode">Unicode page</a> from Wikipedia, we understand that Unicode is a text encoding standard maintained by the Unicode Consortium as a part of The Unicode Standard.
What this essentially is, is that it is roughly a definition of 149,813 characters and 161 scripts (it is about what they look like and what integers represent those characters) **as of right now**.

And I say **as of right now** because, we can see that the standard is very much alive and keeps on changing.

And the way we can **access the Unicode code point of a character** is by using a function in python called `ord()`, which takes a single character as an input at a time.


So we can now experiment with codes:

For example, if we do:
```python
print(ord("H"))
print(ord("👋"))
print(ord("日"))
```
For which we get:
```python
72
128075
26085
```

So using the same ideology we can look up all the characters in out *toy-example* string using a for loop to take out each character and passing the `ord()` function's output of those characters into a list, like this:
```python
print([ord(character) for character in "Hello in japanese. 👋 日本語でこんにちは。"])
```
For which we get:
```python
[72, 101, 108, 108, 111, 32, 105, 110, 32, 106, 97, 112, 97, 110, 101, 115, 101, 46, 32, 128075, 32, 26085, 26412, 35486, 12391, 12371, 12435, 12395, 12385, 12399, 12290]
```

Now see that we have already turned this raw text into integers, now we might arrive at the question that "why can't we use these integers and not have any tokenization at all?".

One reason for this is that the vocabulary is quite long. But more dangerous reason is that because the Unicode standard is very much alive, it keeps changing, which means that it is not a stable representation of something that we might want to use directly into our models...

So we need somethins a bit better at this point...

Now, we tend towards the idea of `encodings`...