<a href="https://colab.research.google.com/github/Firojpaudel/Demystifying_Language_Modeling/blob/main/BPE_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The website to view: https://tiktokenizer.vercel.app/

Before Diving into sophisticated tokenization, let's first discuss about  the unicode thing.

#### What is Unicode?

Computers process information as numbers, specifically in binary. To represent text, computers also use numbers. Unicode is a standard that assigns a unique numerical value, called a code point, to every character across different languages and scripts. This allows computers to consistently handle text from various sources.

---

Let's see how this is carried out:

In [None]:
#@ To get the unicode of a character, we have `ord` in python that gives the order:

ord("H")

72

In [None]:
##@ And ord doesnot take in String. So passing into for loop we get:

[ord(x) for x in "Hello this is Unicode Testing for string. Since Ord doesnot take string directly"]

[72,
 101,
 108,
 108,
 111,
 32,
 116,
 104,
 105,
 115,
 32,
 105,
 115,
 32,
 85,
 110,
 105,
 99,
 111,
 100,
 101,
 32,
 84,
 101,
 115,
 116,
 105,
 110,
 103,
 32,
 102,
 111,
 114,
 32,
 115,
 116,
 114,
 105,
 110,
 103,
 46,
 32,
 83,
 105,
 110,
 99,
 101,
 32,
 79,
 114,
 100,
 32,
 100,
 111,
 101,
 115,
 110,
 111,
 116,
 32,
 116,
 97,
 107,
 101,
 32,
 115,
 116,
 114,
 105,
 110,
 103,
 32,
 100,
 105,
 114,
 101,
 99,
 116,
 108,
 121]

In [None]:
##@ Also, in UTF-8 format:
list("Helloüëã! this is Unicode Testing for string. Since `Ord` doesnot take string directly".encode("UTF-8"))

[72,
 101,
 108,
 108,
 111,
 240,
 159,
 145,
 139,
 33,
 32,
 116,
 104,
 105,
 115,
 32,
 105,
 115,
 32,
 85,
 110,
 105,
 99,
 111,
 100,
 101,
 32,
 84,
 101,
 115,
 116,
 105,
 110,
 103,
 32,
 102,
 111,
 114,
 32,
 115,
 116,
 114,
 105,
 110,
 103,
 46,
 32,
 83,
 105,
 110,
 99,
 101,
 32,
 96,
 79,
 114,
 100,
 96,
 32,
 100,
 111,
 101,
 115,
 110,
 111,
 116,
 32,
 116,
 97,
 107,
 101,
 32,
 115,
 116,
 114,
 105,
 110,
 103,
 32,
 100,
 105,
 114,
 101,
 99,
 116,
 108,
 121]

---
Okay, so great **if we are getting in numerical forms with Unicode, why even tokenize in LLMs**?

The major reason behind not relying solely on Unicode for text representation in LLMs is related to efficiency and handling complex language patterns. While Unicode provides a unique number for every character, working at the character level can be computationally expensive for large language models. Additionally, many characters (like emojis or rare symbols) might appear infrequently, making it difficult for the model to learn meaningful representations. Tokenization addresses this by grouping sequences of characters into meaningful units (tokens), which can represent words, sub-words, or even common character sequences. This reduces the overall vocabulary size the model needs to handle, improves computational efficiency, and can help the model better understand the relationships between words and concepts in the text.

Okay so with that, now let's discuss about the **BPE Tokenization**.

So what happens in BPE Tokenization?

- It's a compression technique, which basically contributes in reducing the tokens size.
---

**Working**:

Let's assume our data to be encoded as: $\text{aaabdaaabac}$

- Now: What we do here is; we replace the byte-pair with a byte thats not used in the data.
- Here, the most repeated byte-pair right now is $\text{aa}$
- Replacing that with $\text{"Z"}$, we get: $\text{ZabdZabac}$
- Again we have: $\text{ab} ‚Üí \text{"Y"}$, then we get: $\text{ZYdZYac}$
- Next, we have: $\text{ZY} ‚Üí \text{"X"}$, so we have: $\text{XdXac}$
---

In [4]:
## Starting with the length comparision
text = "ÔºµÔΩéÔΩâÔΩÉÔΩèÔΩÑÔΩÖ! üÖ§üÖùüÖòüÖíüÖûüÖìüÖî‚ÄΩ üá∫‚Äåüá≥‚ÄåüáÆ‚Äåüá®‚Äåüá¥‚Äåüá©‚Äåüá™! üòÑ The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to ‚Äúsupport Unicode‚Äù in our software (whatever that means‚Äîlike using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don‚Äôt blame programmers for still finding the whole thing mysterious, even 30 years after Unicode‚Äôs inception."
tokens = text.encode("UTF-8")
tokens = list(map(int, tokens))
print('-----------')
print(text)
print("length:", len(text))
print('-----------')
print(tokens)
print('length:', len(tokens))
print('-----------')

-----------
ÔºµÔΩéÔΩâÔΩÉÔΩèÔΩÑÔΩÖ! üÖ§üÖùüÖòüÖíüÖûüÖìüÖî‚ÄΩ üá∫‚Äåüá≥‚ÄåüáÆ‚Äåüá®‚Äåüá¥‚Äåüá©‚Äåüá™! üòÑ The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to ‚Äúsupport Unicode‚Äù in our software (whatever that means‚Äîlike using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don‚Äôt blame programmers for still finding the whole thing mysterious, even 30 years after Unicode‚Äôs inception.
length: 533
-----------
[239, 188, 181, 239, 189, 142, 239, 189, 137, 239, 189, 131, 239, 189, 143, 239, 189, 132, 239, 189, 133, 33, 32, 240, 159, 133, 164, 240, 159, 133, 157, 240, 159, 133, 152, 240, 159, 133, 146, 240, 159, 133, 158, 240, 159, 133, 147, 240, 159, 133, 148, 226, 128, 189, 32, 240, 159, 135, 186, 226, 128, 140, 240, 159, 135, 179, 226, 128, 140, 240,

As we can see: the uncode alone would rather increase the tokens count. *(More than the original text)*

So, next stop: we work on **BPE**:

In [5]:
def get_status (ids):
  counts = {}
  for pair in zip(ids, ids[1:]):
    counts[pair] = counts.get(pair, 0)+1
  return counts

status = get_status(tokens)

print(status)

{(239, 188): 1, (188, 181): 1, (181, 239): 1, (239, 189): 6, (189, 142): 1, (142, 239): 1, (189, 137): 1, (137, 239): 1, (189, 131): 1, (131, 239): 1, (189, 143): 1, (143, 239): 1, (189, 132): 1, (132, 239): 1, (189, 133): 1, (133, 33): 1, (33, 32): 2, (32, 240): 3, (240, 159): 15, (159, 133): 7, (133, 164): 1, (164, 240): 1, (133, 157): 1, (157, 240): 1, (133, 152): 1, (152, 240): 1, (133, 146): 1, (146, 240): 1, (133, 158): 1, (158, 240): 1, (133, 147): 1, (147, 240): 1, (133, 148): 1, (148, 226): 1, (226, 128): 12, (128, 189): 1, (189, 32): 1, (159, 135): 7, (135, 186): 1, (186, 226): 1, (128, 140): 6, (140, 240): 6, (135, 179): 1, (179, 226): 1, (135, 174): 1, (174, 226): 1, (135, 168): 1, (168, 226): 1, (135, 180): 1, (180, 226): 1, (135, 169): 1, (169, 226): 1, (135, 170): 1, (170, 33): 1, (159, 152): 1, (152, 132): 1, (132, 32): 1, (32, 84): 1, (84, 104): 1, (104, 101): 6, (101, 32): 20, (32, 118): 1, (118, 101): 3, (101, 114): 6, (114, 121): 2, (121, 32): 2, (32, 110): 2, (110,