# Unicode code points

Unicode code points are assigning an integer to every character and script across different writing systems. 

In [214]:
ord("N") #using the ord function in python we can display the unicode encoding of a character

78

We cannot use the unicode code points standard as our vocabulary for a LLM. First of all because we would end up with a gigantic vocabulary size but also because unicode is not a stable representation of the characters as it keeps on being updated. 

# UTF-8

One encoding standard for representing every character in the unicode standard is UTF-8 which uses 1 to 4 bytes per character. The first 128 character of UTF-8 are the same than ASCII making UTF-8 backward compatible with ASCII. 


In [215]:
list("Hello World!".encode("utf-8"))

[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]

We cannot keep this as our vocabulary because we only a certain context windown size for our transformer and keeping utf_! just like so would make us end up with very long sequences adn therefore end up with an attention becoming extremely expensive. That's why we will use the byte pair encoding algorithm to leverage compression.

# The Byte Pair Encoding Algorithm

The Byte Pair Encoding Algorithm rely on the following steps. Given a sequence of tokens and a vocabulary set, we are looking for the pair of juxtaposed tokens in the sequence occuring the most often. Once identified, we are creating a new token which is the concatenation of these two juxtaposed token (therefore augmenting our vocabulary size) and replacing them with our new token in the sequence. Thus the sequence is reducing in length and the vocabulary size is increasing. 

In [216]:
text = "We are at the very beginning of time for the human race. It is not unreasonable that we grapple with problems. But there are tens of thousands of years in the future. Our responsibility is to do what we can, learn what we can, improve the solutions, and pass them on. 🔭"
tokens = list(text.encode("utf-8"))

print(text)
print("Length of text :", len(text))
print("-------------------")

print(tokens)
print("Length of tokens :", len(tokens))


We are at the very beginning of time for the human race. It is not unreasonable that we grapple with problems. But there are tens of thousands of years in the future. Our responsibility is to do what we can, learn what we can, improve the solutions, and pass them on. 🔭
Length of text : 269
-------------------
[87, 101, 32, 97, 114, 101, 32, 97, 116, 32, 116, 104, 101, 32, 118, 101, 114, 121, 32, 98, 101, 103, 105, 110, 110, 105, 110, 103, 32, 111, 102, 32, 116, 105, 109, 101, 32, 102, 111, 114, 32, 116, 104, 101, 32, 104, 117, 109, 97, 110, 32, 114, 97, 99, 101, 46, 32, 73, 116, 32, 105, 115, 32, 110, 111, 116, 32, 117, 110, 114, 101, 97, 115, 111, 110, 97, 98, 108, 101, 32, 116, 104, 97, 116, 32, 119, 101, 32, 103, 114, 97, 112, 112, 108, 101, 32, 119, 105, 116, 104, 32, 112, 114, 111, 98, 108, 101, 109, 115, 46, 32, 66, 117, 116, 32, 116, 104, 101, 114, 101, 32, 97, 114, 101, 32, 116, 101, 110, 115, 32, 111, 102, 32, 116, 104, 111, 117, 115, 97, 110, 100, 115, 32, 111, 102, 32, 121, 

Here we can see that the length of the raw text and the length of the tokens are different because some characters are encoded using more than 1 byte. ASCII character are only taking 1 byte but others like emojis take more. 


In [217]:
# Function which for a given sequence of tokens is outputing the stats of each pair of tokens, i.e the number of occurrence for each pair
def find_pair_stats(tokens):
    counts = {}
    for pair in zip(tokens, tokens[1:]):
        if pair in counts:
            counts[pair]+=1
        else :
            counts[pair]=1
    reversed_counts = {}
    for pair, count in counts.items():
        if count in reversed_counts:
            reversed_counts[count].append(pair)
        else : 
            reversed_counts[count] = [pair]
    return reversed_counts, counts

In [218]:
find_pair_stats(tokens)

({1: [(87, 101),
   (32, 118),
   (114, 121),
   (32, 98),
   (98, 101),
   (101, 103),
   (103, 105),
   (110, 110),
   (110, 105),
   (110, 103),
   (103, 32),
   (109, 101),
   (102, 111),
   (111, 114),
   (32, 104),
   (104, 117),
   (117, 109),
   (109, 97),
   (97, 99),
   (99, 101),
   (32, 73),
   (73, 116),
   (32, 110),
   (110, 111),
   (111, 116),
   (32, 117),
   (117, 110),
   (110, 114),
   (110, 97),
   (97, 98),
   (32, 103),
   (103, 114),
   (97, 112),
   (112, 112),
   (112, 108),
   (119, 105),
   (104, 32),
   (111, 98),
   (109, 115),
   (115, 46),
   (32, 66),
   (66, 117),
   (116, 101),
   (101, 110),
   (104, 111),
   (111, 117),
   (117, 115),
   (115, 97),
   (100, 115),
   (32, 121),
   (121, 101),
   (114, 115),
   (102, 117),
   (116, 117),
   (32, 79),
   (79, 117),
   (101, 115),
   (115, 112),
   (112, 111),
   (115, 105),
   (105, 98),
   (98, 105),
   (105, 108),
   (108, 105),
   (116, 121),
   (116, 111),
   (32, 100),
   (100, 111),
   (32, 108)

In [219]:
def find_most_common_pair(dictionary):
    max = 0
    for count, pair in dictionary.items():
        if count > max :
            max = count
    return dictionary[max], ''.join(chr(i) for i in dictionary[max][0]), max

In [220]:
dictionary = find_pair_stats(tokens)[0]
find_most_common_pair(dictionary)

([(101, 32)], 'e ', 15)

In [221]:
top_pair = find_most_common_pair(dictionary)[0][0]
top_pair

(101, 32)

Now that we have the top pair of tokens, we want to define a function which is going to replace this pair by a new token

In [222]:
def merge(tokens, pair, new_token):
    index = 0
    while index < len(tokens) - 1:
        if (tokens[index], tokens[index + 1]) == pair:
            tokens[index:index + 2] = [new_token]
        index += 1
    return tokens

In [223]:
tokens_test = [1,2,3,4,5,6,7,2,3]
pair_test = (2,3)
new_token_test = 42

merge(tokens_test, pair_test, new_token_test)

[1, 42, 4, 5, 6, 7, 42]

We have succesfully replaced (2,3) by 42! Now let's replace the top pair in our original text with the token 256. 

In [224]:
tokens_replaced = merge(tokens, top_pair, 256)

print(tokens_replaced)
print('-------------')
print(len(tokens_replaced))


[87, 256, 97, 114, 256, 97, 116, 32, 116, 104, 256, 118, 101, 114, 121, 32, 98, 101, 103, 105, 110, 110, 105, 110, 103, 32, 111, 102, 32, 116, 105, 109, 256, 102, 111, 114, 32, 116, 104, 256, 104, 117, 109, 97, 110, 32, 114, 97, 99, 101, 46, 32, 73, 116, 32, 105, 115, 32, 110, 111, 116, 32, 117, 110, 114, 101, 97, 115, 111, 110, 97, 98, 108, 256, 116, 104, 97, 116, 32, 119, 256, 103, 114, 97, 112, 112, 108, 256, 119, 105, 116, 104, 32, 112, 114, 111, 98, 108, 101, 109, 115, 46, 32, 66, 117, 116, 32, 116, 104, 101, 114, 256, 97, 114, 256, 116, 101, 110, 115, 32, 111, 102, 32, 116, 104, 111, 117, 115, 97, 110, 100, 115, 32, 111, 102, 32, 121, 101, 97, 114, 115, 32, 105, 110, 32, 116, 104, 256, 102, 117, 116, 117, 114, 101, 46, 32, 79, 117, 114, 32, 114, 101, 115, 112, 111, 110, 115, 105, 98, 105, 108, 105, 116, 121, 32, 105, 115, 32, 116, 111, 32, 100, 111, 32, 119, 104, 97, 116, 32, 119, 256, 99, 97, 110, 44, 32, 108, 101, 97, 114, 110, 32, 119, 104, 97, 116, 32, 119, 256, 99, 97, 110, 

# Putting it all together 

Let's now work with a bigger chunk of text 

In [225]:
text= """France,[IX] officially the French Republic,[X] is a country located primarily in Western Europe. Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean, giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north; Germany to the northeast; Switzerland to the east; Italy and Monaco to the southeast; Andorra and Spain to the south; and a maritime border with the United Kingdom to the northwest. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea. Its eighteen integral regions—five of which are overseas—span a combined area of 632,702 km2 (244,288 sq mi) and have an estimated total population of over 68.6 million as of January 2025. France is a semi-presidential republic and its capital, largest city and main cultural and economic centre is Paris. Metropolitan France was settled during the Iron Age by Celtic tribes known as Gauls before Rome annexed the area in 51 BC, leading to a distinct Gallo-Roman culture. In the Early Middle Ages, the Franks formed the kingdom of Francia, which became the heartland of the Carolingian Empire. The Treaty of Verdun of 843 partitioned the empire, with West Francia evolving into the Kingdom of France. In the High Middle Ages, France was a powerful but decentralized feudal kingdom, but from the mid-14th to the mid-15th centuries, France was plunged into a dynastic conflict with England known as the Hundred Years' War. In the 16th century, French culture flourished during the French Renaissance and a French colonial empire emerged. Internally, France was dominated by the conflict with the House of Habsburg and the French Wars of Religion between Catholics and Huguenots. France was successful in the Thirty Years' War and further increased its influence during the reign of Louis XIV. The French Revolution of 1789 overthrew the Ancien Régime and produced the Declaration of the Rights of Man, which expresses the nation's ideals to this day. France reached its political and military zenith in the early 19th century under Napoleon Bonaparte, subjugating part of continental Europe and establishing the First French Empire. The collapse of the empire initiated a period of relative decline, in which France endured the Bourbon Restoration until the founding of the French Second Republic which was succeeded by the Second French Empire upon Napoleon III's takeover. His empire collapsed during the Franco-Prussian War in 1870. This led to the establishment of the Third French Republic, and subsequent decades saw a period of economic prosperity and cultural and scientific flourishing known as the Belle Époque. France was one of the major participants of World War I, from which it emerged victorious at great human and economic cost. It was among the Allies of World War II, but it surrendered and was occupied in 1940. Following its liberation in 1944, the short-lived Fourth Republic was established and later dissolved in the course of the defeat in the Algerian War. The current Fifth Republic was formed in 1958 by Charles de Gaulle. Algeria and most French colonies became independent in the 1960s, with the majority retaining close economic and military ties with France. France retains its centuries-long status as a global centre of art, science, and philosophy. It hosts the fourth-largest number of UNESCO World Heritage Sites and is the world's leading tourist destination, having received 100 million foreign visitors in 2023. A developed country, France has a high nominal per capita income globally, and its advanced economy ranks among the largest in the world by both nominal GDP and PPP-adjusted GDP. It is a great power, being one of the five permanent members of the United Nations Security Council and an official nuclear-weapon state. The country is part of multiple international organizations and forums. Originally applied to the whole Frankish Empire, the name France comes from the Latin Francia, or 'realm of the Franks'.[13] The name of the Franks is related to the English word frank ('free'): the latter stems from the Old French franc ('free, noble, sincere'), and ultimately from the Medieval Latin word francus ('free, exempt from service; freeman, Frank"', a generalisation of the tribal name that emerged as a Late Latin borrowing of the reconstructed Frankish endonym *Frank.[14][15] It has been suggested that the meaning 'free' was adopted because, after the conquest of Gaul, only Franks were free of taxation,[16] or more generally because they had the status of freemen in contrast to servants or slaves.[15] The etymology of *Frank is uncertain. It is traditionally derived from the Proto-Germanic word *frankōn, which translates as 'javelin' or 'lance' (the throwing axe of the Franks was known as the francisca),[17] although these weapons may have been named because of their use by the Franks, not the other way around.[15] In English, 'France' is pronounced /fræns/ FRANSS in American English and /frɑːns/ FRAHNSS or /fræns/ FRANSS in British English. The pronunciation with /ɑː/ is mostly confined to accents with the trap-bath split such as Received Pronunciation, though it can be also heard in some other dialects such as Cardiff English.[18]"""
tokens = list(text.encode("utf-8"))

print(text)
print("Length of text :", len(text))
print("-------------------")

print(tokens)
print("Length of tokens :", len(tokens))


France,[IX] officially the French Republic,[X] is a country located primarily in Western Europe. Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean, giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north; Germany to the northeast; Switzerland to the east; Italy and Monaco to the southeast; Andorra and Spain to the south; and a maritime border with the United Kingdom to the northwest. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea. Its eighteen integral regions—five of which are overseas—span a combined area of 632,702 km2 (244,288 sq mi) and have an estimated total population of over 68.6 million as of January 2025. France is a semi-presidential rep

In [226]:
import copy

def train_tokenizer(tokens, vocab_size):
    merges = {}
    tokens_copy = copy.deepcopy(tokens)
    num_merges = vocab_size - 256
    for i in range(num_merges):
        dictionary = find_pair_stats(tokens_copy)[0]
        top_pair = find_most_common_pair(dictionary)[0][0]
        merges[top_pair]=256+i
        print(f"Merging {top_pair} into the new token {256+i}")
        tokens_copy = merge(tokens_copy, top_pair, 256 + i)
    compression_ratio = len(tokens)/len(tokens_copy)
    print(f"Compression ratio {compression_ratio}")
    return tokens_copy, merges, len(tokens), len(tokens_copy)

In [227]:
train_tokenizer(tokens, 276)

Merging (101, 32) into the new token 256
Merging (116, 104) into the new token 257
Merging (97, 110) into the new token 258
Merging (100, 32) into the new token 259
Merging (115, 32) into the new token 260
Merging (257, 256) into the new token 261
Merging (105, 110) into the new token 262
Merging (32, 261) into the new token 263
Merging (111, 110) into the new token 264
Merging (114, 101) into the new token 265
Merging (101, 114) into the new token 266
Merging (111, 102) into the new token 267
Merging (101, 259) into the new token 268
Merging (116, 32) into the new token 269
Merging (114, 258) into the new token 270
Merging (258, 259) into the new token 271
Merging (44, 32) into the new token 272
Merging (97, 114) into the new token 273
Merging (111, 114) into the new token 274
Merging (105, 99) into the new token 275
Compression ratio 1.3292210933720368


([70,
  270,
  99,
  101,
  44,
  91,
  73,
  88,
  93,
  32,
  267,
  102,
  275,
  105,
  97,
  108,
  108,
  121,
  263,
  70,
  265,
  110,
  99,
  104,
  32,
  82,
  101,
  112,
  117,
  98,
  108,
  275,
  44,
  91,
  88,
  93,
  32,
  105,
  260,
  97,
  32,
  99,
  111,
  117,
  110,
  116,
  114,
  121,
  32,
  108,
  111,
  99,
  97,
  116,
  268,
  112,
  114,
  105,
  109,
  273,
  105,
  108,
  121,
  32,
  262,
  32,
  87,
  101,
  115,
  116,
  266,
  110,
  32,
  69,
  117,
  114,
  111,
  112,
  101,
  46,
  32,
  73,
  116,
  260,
  111,
  118,
  266,
  115,
  101,
  97,
  260,
  265,
  103,
  105,
  264,
  260,
  271,
  116,
  266,
  114,
  105,
  116,
  274,
  105,
  101,
  260,
  262,
  99,
  108,
  117,
  100,
  256,
  70,
  265,
  110,
  99,
  104,
  32,
  71,
  117,
  105,
  258,
  97,
  32,
  262,
  32,
  83,
  111,
  117,
  257,
  32,
  65,
  109,
  266,
  275,
  97,
  272,
  83,
  97,
  262,
  269,
  80,
  105,
  266,
  114,
  256,
  271,
  77,
  105,
  113,


# Decoding

In [228]:
text= """France,[IX] officially the French Republic,[X] is a country located primarily in Western Europe. Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean, giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north; Germany to the northeast; Switzerland to the east; Italy and Monaco to the southeast; Andorra and Spain to the south; and a maritime border with the United Kingdom to the northwest. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea. Its eighteen integral regions—five of which are overseas—span a combined area of 632,702 km2 (244,288 sq mi) and have an estimated total population of over 68.6 million as of January 2025. France is a semi-presidential republic and its capital, largest city and main cultural and economic centre is Paris. Metropolitan France was settled during the Iron Age by Celtic tribes known as Gauls before Rome annexed the area in 51 BC, leading to a distinct Gallo-Roman culture. In the Early Middle Ages, the Franks formed the kingdom of Francia, which became the heartland of the Carolingian Empire. The Treaty of Verdun of 843 partitioned the empire, with West Francia evolving into the Kingdom of France. In the High Middle Ages, France was a powerful but decentralized feudal kingdom, but from the mid-14th to the mid-15th centuries, France was plunged into a dynastic conflict with England known as the Hundred Years' War. In the 16th century, French culture flourished during the French Renaissance and a French colonial empire emerged. Internally, France was dominated by the conflict with the House of Habsburg and the French Wars of Religion between Catholics and Huguenots. France was successful in the Thirty Years' War and further increased its influence during the reign of Louis XIV. The French Revolution of 1789 overthrew the Ancien Régime and produced the Declaration of the Rights of Man, which expresses the nation's ideals to this day. France reached its political and military zenith in the early 19th century under Napoleon Bonaparte, subjugating part of continental Europe and establishing the First French Empire. The collapse of the empire initiated a period of relative decline, in which France endured the Bourbon Restoration until the founding of the French Second Republic which was succeeded by the Second French Empire upon Napoleon III's takeover. His empire collapsed during the Franco-Prussian War in 1870. This led to the establishment of the Third French Republic, and subsequent decades saw a period of economic prosperity and cultural and scientific flourishing known as the Belle Époque. France was one of the major participants of World War I, from which it emerged victorious at great human and economic cost. It was among the Allies of World War II, but it surrendered and was occupied in 1940. Following its liberation in 1944, the short-lived Fourth Republic was established and later dissolved in the course of the defeat in the Algerian War. The current Fifth Republic was formed in 1958 by Charles de Gaulle. Algeria and most French colonies became independent in the 1960s, with the majority retaining close economic and military ties with France. France retains its centuries-long status as a global centre of art, science, and philosophy. It hosts the fourth-largest number of UNESCO World Heritage Sites and is the world's leading tourist destination, having received 100 million foreign visitors in 2023. A developed country, France has a high nominal per capita income globally, and its advanced economy ranks among the largest in the world by both nominal GDP and PPP-adjusted GDP. It is a great power, being one of the five permanent members of the United Nations Security Council and an official nuclear-weapon state. The country is part of multiple international organizations and forums. Originally applied to the whole Frankish Empire, the name France comes from the Latin Francia, or 'realm of the Franks'.[13] The name of the Franks is related to the English word frank ('free'): the latter stems from the Old French franc ('free, noble, sincere'), and ultimately from the Medieval Latin word francus ('free, exempt from service; freeman, Frank"', a generalisation of the tribal name that emerged as a Late Latin borrowing of the reconstructed Frankish endonym *Frank.[14][15] It has been suggested that the meaning 'free' was adopted because, after the conquest of Gaul, only Franks were free of taxation,[16] or more generally because they had the status of freemen in contrast to servants or slaves.[15] The etymology of *Frank is uncertain. It is traditionally derived from the Proto-Germanic word *frankōn, which translates as 'javelin' or 'lance' (the throwing axe of the Franks was known as the francisca),[17] although these weapons may have been named because of their use by the Franks, not the other way around.[15] In English, 'France' is pronounced /fræns/ FRANSS in American English and /frɑːns/ FRAHNSS or /fræns/ FRANSS in British English. The pronunciation with /ɑː/ is mostly confined to accents with the trap-bath split such as Received Pronunciation, though it can be also heard in some other dialects such as Cardiff English.[18]"""
tokens = list(text.encode("utf-8"))

print(text)
print("Length of text :", len(text))
print("-------------------")

print(tokens)
print("Length of tokens :", len(tokens))


France,[IX] officially the French Republic,[X] is a country located primarily in Western Europe. Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean, giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north; Germany to the northeast; Switzerland to the east; Italy and Monaco to the southeast; Andorra and Spain to the south; and a maritime border with the United Kingdom to the northwest. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea. Its eighteen integral regions—five of which are overseas—span a combined area of 632,702 km2 (244,288 sq mi) and have an estimated total population of over 68.6 million as of January 2025. France is a semi-presidential rep

Now we want to define a function, which for a given list of integers, will give us back the corresponding text. 

In [229]:
merges = train_tokenizer(tokens, 276)[1]
merges

Merging (101, 32) into the new token 256
Merging (116, 104) into the new token 257
Merging (97, 110) into the new token 258
Merging (100, 32) into the new token 259
Merging (115, 32) into the new token 260
Merging (257, 256) into the new token 261
Merging (105, 110) into the new token 262
Merging (32, 261) into the new token 263
Merging (111, 110) into the new token 264
Merging (114, 101) into the new token 265
Merging (101, 114) into the new token 266
Merging (111, 102) into the new token 267
Merging (101, 259) into the new token 268
Merging (116, 32) into the new token 269
Merging (114, 258) into the new token 270
Merging (258, 259) into the new token 271
Merging (44, 32) into the new token 272
Merging (97, 114) into the new token 273
Merging (111, 114) into the new token 274
Merging (105, 99) into the new token 275
Compression ratio 1.3292210933720368


{(101, 32): 256,
 (116, 104): 257,
 (97, 110): 258,
 (100, 32): 259,
 (115, 32): 260,
 (257, 256): 261,
 (105, 110): 262,
 (32, 261): 263,
 (111, 110): 264,
 (114, 101): 265,
 (101, 114): 266,
 (111, 102): 267,
 (101, 259): 268,
 (116, 32): 269,
 (114, 258): 270,
 (258, 259): 271,
 (44, 32): 272,
 (97, 114): 273,
 (111, 114): 274,
 (105, 99): 275}

In [230]:
vocab = {idx : bytes([idx]) for idx in range(256)}
vocab

{0: b'\x00',
 1: b'\x01',
 2: b'\x02',
 3: b'\x03',
 4: b'\x04',
 5: b'\x05',
 6: b'\x06',
 7: b'\x07',
 8: b'\x08',
 9: b'\t',
 10: b'\n',
 11: b'\x0b',
 12: b'\x0c',
 13: b'\r',
 14: b'\x0e',
 15: b'\x0f',
 16: b'\x10',
 17: b'\x11',
 18: b'\x12',
 19: b'\x13',
 20: b'\x14',
 21: b'\x15',
 22: b'\x16',
 23: b'\x17',
 24: b'\x18',
 25: b'\x19',
 26: b'\x1a',
 27: b'\x1b',
 28: b'\x1c',
 29: b'\x1d',
 30: b'\x1e',
 31: b'\x1f',
 32: b' ',
 33: b'!',
 34: b'"',
 35: b'#',
 36: b'$',
 37: b'%',
 38: b'&',
 39: b"'",
 40: b'(',
 41: b')',
 42: b'*',
 43: b'+',
 44: b',',
 45: b'-',
 46: b'.',
 47: b'/',
 48: b'0',
 49: b'1',
 50: b'2',
 51: b'3',
 52: b'4',
 53: b'5',
 54: b'6',
 55: b'7',
 56: b'8',
 57: b'9',
 58: b':',
 59: b';',
 60: b'<',
 61: b'=',
 62: b'>',
 63: b'?',
 64: b'@',
 65: b'A',
 66: b'B',
 67: b'C',
 68: b'D',
 69: b'E',
 70: b'F',
 71: b'G',
 72: b'H',
 73: b'I',
 74: b'J',
 75: b'K',
 76: b'L',
 77: b'M',
 78: b'N',
 79: b'O',
 80: b'P',
 81: b'Q',
 82: b'R',
 83: b'

In [231]:
for (token0, token1), idx in merges.items():
    vocab[idx]= vocab[token0] + vocab[token1]
vocab

{0: b'\x00',
 1: b'\x01',
 2: b'\x02',
 3: b'\x03',
 4: b'\x04',
 5: b'\x05',
 6: b'\x06',
 7: b'\x07',
 8: b'\x08',
 9: b'\t',
 10: b'\n',
 11: b'\x0b',
 12: b'\x0c',
 13: b'\r',
 14: b'\x0e',
 15: b'\x0f',
 16: b'\x10',
 17: b'\x11',
 18: b'\x12',
 19: b'\x13',
 20: b'\x14',
 21: b'\x15',
 22: b'\x16',
 23: b'\x17',
 24: b'\x18',
 25: b'\x19',
 26: b'\x1a',
 27: b'\x1b',
 28: b'\x1c',
 29: b'\x1d',
 30: b'\x1e',
 31: b'\x1f',
 32: b' ',
 33: b'!',
 34: b'"',
 35: b'#',
 36: b'$',
 37: b'%',
 38: b'&',
 39: b"'",
 40: b'(',
 41: b')',
 42: b'*',
 43: b'+',
 44: b',',
 45: b'-',
 46: b'.',
 47: b'/',
 48: b'0',
 49: b'1',
 50: b'2',
 51: b'3',
 52: b'4',
 53: b'5',
 54: b'6',
 55: b'7',
 56: b'8',
 57: b'9',
 58: b':',
 59: b';',
 60: b'<',
 61: b'=',
 62: b'>',
 63: b'?',
 64: b'@',
 65: b'A',
 66: b'B',
 67: b'C',
 68: b'D',
 69: b'E',
 70: b'F',
 71: b'G',
 72: b'H',
 73: b'I',
 74: b'J',
 75: b'K',
 76: b'L',
 77: b'M',
 78: b'N',
 79: b'O',
 80: b'P',
 81: b'Q',
 82: b'R',
 83: b'

In [232]:
def decode(tokens, vocab):
    text = b"".join(vocab[token] for token in tokens)
    return text.decode("utf-8", errors="replace") #the replace part ensures that instead of raising an exception for a problematic token we rather replace it for more robustness


In [233]:
decode([128, 97, 256], vocab)


'�ae '

# Encoding

In [234]:
merges

{(101, 32): 256,
 (116, 104): 257,
 (97, 110): 258,
 (100, 32): 259,
 (115, 32): 260,
 (257, 256): 261,
 (105, 110): 262,
 (32, 261): 263,
 (111, 110): 264,
 (114, 101): 265,
 (101, 114): 266,
 (111, 102): 267,
 (101, 259): 268,
 (116, 32): 269,
 (114, 258): 270,
 (258, 259): 271,
 (44, 32): 272,
 (97, 114): 273,
 (111, 114): 274,
 (105, 99): 275}

In [240]:
def encode(text):
    # given a string, output a list of integers correponding to our tokens
    byte_text = list(text.encode("utf-8"))
    while len(byte_text)>=2:
        stats = find_pair_stats(byte_text)[1]
        pair = min(stats, key=lambda p:merges.get(p, float("inf")))
        if pair not in merges:
            break
        idx = merges[pair]
        byte_text = merge(byte_text, pair, idx)
    return byte_text
    

In [247]:
print(decode(encode("France,[IX] officially the French Republic,[X] is a country located primarily in Western Europe."), vocab))

France,[IX] officially the French Republic,[X] is a country located primarily in Western Europe.


In [248]:
validation_text = "Le Cachemire (variante orthographique rare : Kashmir) est une région montagneuse du sous-continent indien. On désigne sous ce vocable, depuis la partition des Indes et la disparition de l'État princier du Jammu-et-Cachemire, l'ensemble du territoire qui constituait ce dernier. Depuis le déclenchement de la première guerre indo-pakistanaise en 1947, le Cachemire est de facto partagé entre l'Inde, le Pakistan et la Chine qui administrent le territoire du Jammu-et-Cachemire et du Ladakh pour l'Inde, les territoires de l'Azad Cachemire et du Gilgit-Baltistan pour le Pakistan ainsi que la région de l'Aksai Chin et la vallée de Shaksgam pour la Chine. L'Inde continue de réclamer l'intégralité du Cachemire historique, à savoir l'Aksai Chin, la vallée de Shaksgam, le Gilgit-Baltistan et l'Azad Cachemire en plus des territoires qu'elle contrôle déjà. Le Pakistan revendique le Jammu-et-Cachemire contrôlé par l'Inde. La Chine contrôle l'intégralité des territoires qu'elle revendique dans cette région, à savoir l'Aksai Chin et la vallée de Shaksgam. La souveraineté sur ces deux territoires chinois est reconnue par le Pakistan, mais pas par l'Inde. Des mouvements séparatistes continuent par ailleurs à prôner le rétablissement de l'indépendance du Cachemire."
validation_text2 = decode(encode(validation_text), vocab)
print(validation_text == validation_text2)

True


# Forced splits using Regex Patterns (GPT Series)

The motivation is not to merge semantics with different punctuation. For example we don't want to have different tokens for "dog.", "dog!" or "dog?". 