#### Goal here is to convert text (from english and other langugages, symbols & special characters) in numbers so we can plug it into transformer architecture

Tokenizer app [https://tiktokenizer.vercel.app/]

in python strings are stored as unicode characters
for example: 

```string = "Hello World! 😊, कैसे हैं आप"```

we can get the unicode representation of each character in the string using the ord() function

In [2]:
string = "Hello World! 😊, कैसे हैं आप"

unicode_representation = [ord(char) for char in string]
for char, unicode in zip(string, unicode_representation):
    print(f"Character: {char} - Unicode: {unicode}")

Character: H - Unicode: 72
Character: e - Unicode: 101
Character: l - Unicode: 108
Character: l - Unicode: 108
Character: o - Unicode: 111
Character:   - Unicode: 32
Character: W - Unicode: 87
Character: o - Unicode: 111
Character: r - Unicode: 114
Character: l - Unicode: 108
Character: d - Unicode: 100
Character: ! - Unicode: 33
Character:   - Unicode: 32
Character: 😊 - Unicode: 128522
Character: , - Unicode: 44
Character:   - Unicode: 32
Character: क - Unicode: 2325
Character: ै - Unicode: 2376
Character: स - Unicode: 2360
Character: े - Unicode: 2375
Character:   - Unicode: 32
Character: ह - Unicode: 2361
Character: ै - Unicode: 2376
Character: ं - Unicode: 2306
Character:   - Unicode: 32
Character: आ - Unicode: 2310
Character: प - Unicode: 2346


**Question is**, why can't we use directly these unicode characters directly as tokens, as they are already numbers, and we can map each number to a text, symbol, special charharacter or emoji?

Answer -
1. It will increase the vocabulary size of our model. [Unicode characters is a list of 292k unicode codepoints, https://en.wikipedia.org/wiki/List_of_Unicode_characters]
2. Also unicode standard is alive and keep changing which makes it difficult maintain a standard vocabulary for our model


Another way is to do UTF encoding (utf-8, utf-16, utf-32)

In [4]:
utf_8_lst = list(string.encode('utf-8'))
utf_16_lst = list(string.encode('utf-16'))
utf_32_lst = list(string.encode('utf-32'))

print("\nUTF-8 Encoded Bytes length: ", len(utf_8_lst))
print("UTF-16 Encoded Bytes length: ", len(utf_16_lst))
print("UTF-32 Encoded Bytes length: ", len(utf_32_lst))


UTF-8 Encoded Bytes length:  48
UTF-16 Encoded Bytes length:  58
UTF-32 Encoded Bytes length:  112


Here we can start seeing that utf-16 & 32 are wasteful for our purpose, cause they introduce lots of 0's in between due to nature of UTF encoding (utf-8 is byte level encoding i.e 1 byte, utf-16 is 2 bytes, utf-32 is 3 bytes)

also, if we use the utf-8 encoder, we are stuck with 1 byte, i.e. 8 chars that is max 256 tokens.



# Byte Pair Encoding Algorithm

[https://en.wikipedia.org/wiki/Byte-pair_encoding]

Example <br>
Suppose the data to be encoded is: aaabdaaabac <br><br>

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:<br><br>

ZabdZabac<br>
Z=aa<br>
Then the process is repeated with byte pair "ab", replacing it with "Y":<br><br>

ZYdZYac<br>
Y=ab<br>
Z=aa<br>
The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte-pair encoding, replacing "ZY" with "X":<br><br>

XdXac<br>
X=ZY<br>
Y=ab<br>
Z=aa<br>