# Projeto dedicado a Tokenização
## Project for Tokenization

Neste projeto será implementado um tokenizador fazendo uso do algoritmo BPE para fins de estudo.

That project will implement a tokenizer using the BPE algorithm for study purposes.

Para gerar o vocabulário do tokenizador será preciso importar as instruções básicas do nosso tokenizador e utilizar os arquivos .json que foram reunidos.

To make the vocabulary of the tokenizer will need to import the basic instructions of our tokenizer and utilize some reunited .json files.

In [1]:
import tokenizer as t

Com todas as funções disponíveis, já é possível implementar a construção do vocabulário. Para fazer isso, o código será baseado no exemplo deixado no tokenizer.py.

With all these functions, it's possible to implement vocabulary construction. To start, the code will be based on the example displayed in file tokenizer.py.

In [23]:
import os, json

path = 'corpus/'
files = [file for file in os.listdir(path) if file.endswith('.json')]

merges = {}

# Loop to take all the text in training 
for i in range(len(files)): 
  # Open the JSON file
  f = open(path + files[i], "r")
  data = json.load(f)["text"]
  f.close()

  # Making Vocab
  tokens = list(map(int, data.encode("utf-8")))
  if merges:
    merge, ids = t.makeVocab(261, tokens, merges[list(merges)[-1]] - 255, merges)
  else: 
    merge, ids = t.makeVocab(261, tokens)

  # Keep dict Ordened
  merges = dict(sorted({**merge, **merges}.items(), key=lambda x:x[1]))

print(merges)


{(97, 32): 256, (101, 32): 257, (111, 32): 258, (115, 32): 259, (110, 116): 260, (100, 257): 261, (42, 32): 262, (44, 32): 263, (101, 260): 264, (195, 163): 265, (101, 115): 266, (97, 259): 267, (265, 258): 268, (109, 32): 269, (97, 114): 270, (111, 259): 271, (100, 256): 272, (111, 114): 273, (100, 258): 274, (105, 99): 275, (114, 105): 277, (116, 101): 278, (97, 108): 279, (195, 161): 280, (83, 111): 284, (101, 110): 289, (32, 261): 292, (97, 116): 294, (116, 105): 297, (101, 114): 301, (105, 110): 302, (32, 112): 304, (97, 115): 305, (32, 111): 306, (32, 42): 307, (32, 49): 309, (100, 101): 310, (48, 49): 311, (305, 32): 314, (32, 100): 315, (315, 101): 319, (97, 110): 324, (324, 116): 327, (97, 103): 328, (83, 327): 329, (97, 100): 332, (307, 262): 335, (32, 262): 336, (105, 114): 339, (105, 97): 340, (256, 100): 345, (114, 97): 350, (114, 101): 355, (46, 32): 357, (115, 116): 361, (111, 110): 362, (195, 169): 367, (100, 97): 369, (100, 111): 372, (32, 67): 373, (99, 97): 374, (103

Com o vocabulário que foi produzido é possível codificar e decodificar qualquer texto usando as funções que já foram implementadas. Segue exemplo a seguir:

Using the obtained vocabulary it's possible to code and decode any text using the functions implemented before. Follow an example: 

In [24]:
# Take te string to be tokenized
text = "Some random string to be used as an example to be tokenized. To have the best use of the tokenizer and have a good result, it is important to have some long enough text to find some pattern to be followed. This example is not good but it is some start to understand the proposal and the ideas behind the code being done.\nExpanding the text!\nOn the first try, this text was long, but the number of tokens was exactly equal to the number of Unicode bytes, for this reason was increased a little bit the size of the text (probably will not change the equality of the sizes but will provide a better result in the end)."

tokens = list(map(int, text.encode("utf-8")))
# Creating dictionary with all vocab
vocab = {idx: bytes([idx]) for idx in range(256)}
# This part only works because we have sure that merges indexes are in increasing order
for (p0, p1), idx in merges.items():
  vocab[idx] = vocab[p0] + vocab[p1]

# Results
print("Text length: ", len(text))
print("Initial tokens length: ", len(tokens))
print("Final tokens length: ", len(t.encoder(text, merges)))
print(f"Compression rate: {len(tokens)/len(t.encoder(text, merges)):.2f}X")

print(t.decoder(t.encoder(text, merges), vocab))

Text length:  615
Initial tokens length:  615
Final tokens length:  371
Compression rate: 1.66X
Some random string to be used as an example to be tokenized. To have the best use of the tokenizer and have a good result, it is important to have some long enough text to find some pattern to be followed. This example is not good but it is some start to understand the proposal and the ideas behind the code being done.
Expanding the text!
On the first try, this text was long, but the number of tokens was exactly equal to the number of Unicode bytes, for this reason was increased a little bit the size of the text (probably will not change the equality of the sizes but will provide a better result in the end).


OBS: O tamanho do vocabulário do treinamento como 261 (5 a mais do que os bytes definidos pelo Unicode) foi escolhido por gerar um vocabulário final reduzido, mas mantendo um bom índice de compactação. Foi testado com 266 e 276 obtendo compactação de 1.86 e 2.09, respectivamente.

OBS: The size of the training vocabulary, 261 (5 more than the default size), was chosen to reduce the final vocabulary while maintaining a good rate of compression. It was tested with 266 and 276, getting a compression of 1.86 and 2.09, respectively.  