Corpus Tokenizer
--------------------------------

In this part we will create a tokenizer from scratch. We already identified the min-frequencies but taking them in account will remove the most important words in the sentences. We will create a custom BPE (Byte-pair Encoding) tokenizer which don't require to normalize the tokens. That tokenizer will be trained and saved in order to use it as the tokenizer of the GPT-2 model that we will use latter on the training step. 

To understand how is working the BPE tokenizer, see the following tutorial [BPE_tokenizer](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt#:~:text=Byte%2DPair%20Encoding%20(BPE),HuggingFace).

For the tokenizer training the following steps will be required:

- Creating a batch generator to generate the batches of sentences
- Load the BPE tokenizer
- Configure the pre-tokenizer
- Initialize the trainer: vocabulary size of `20000` at max and special tokens = `'<|endoftext|>'` (To identify the beginning and the ending of a text), `'<|translateto|>'` (To separate the French sentences from the Wolof sentences).
- Train the tokenizer
- Initialize the decoder method: `ByteLevel Decoder`.
- Initialize the post-processor for the GPT-2 tokenizer: `ByteLevel post-processing` for the GPT-2 tokenizer.
- Save the tokenizer locally

Let us import the necessary libraries.

In [16]:
# for creating the tokenizer
from tokenizers import (
    decoders,
    models,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

# for importing and manipulating the sentences
import pandas as pd

#### Load dataset and create generator

We will create one tokenizer for both of the French and Wolof corpora. So we will stack the french and wolof sentences at the same lines.

In [17]:
# load sentences
sentences = pd.read_csv("data/extractions/new_data/sent_extraction.csv")

# initialize a batch size
BATCH_SIZE = 50

# create generators (for the corpora)
def generate_sentences():
    
    # stacking the sentences
    concat_sentences = lambda line_index: sentences.loc[line_index, "french_corpus"] + " " + sentences.loc[line_index, "wolof_corpus"]  
    
    sentences["corpora"] = sentences.index.map(concat_sentences)
    
    sents = sentences["corpora"].to_list()
    
    for i in range(1, len(sents), BATCH_SIZE):
        
        yield sents[i:i+BATCH_SIZE]

#### Load the BPE Tokenizer

In [18]:
tokenizer = Tokenizer(models.BPE())

#### Configure the pre-tokenizer

In [19]:
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(False)

#### Initialize the BPE Trainer

In [20]:
trainer = trainers.BpeTrainer(vocab_size = 20000, special_tokens = ["<|endoftext|>", "<|translateto|>", "<|pad|>"])

#### Train the tokenizer from the iterator

In [21]:
tokenizer.train_from_iterator(generate_sentences(), trainer)

tokenizer.enable_padding()

Let us print the vocab size.

In [22]:
tokenizer.get_vocab_size()

15688

#### Initialize the decoder

In [23]:
tokenizer.decoder = decoders.ByteLevel()

#### Initialize the post-processor

In [24]:
tokenizer.post_process = processors.ByteLevel(False)

#### Save the tokenizer

In [25]:
tokenizer.save("wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json")

Let us make an example with the sentence "Je suis ici." translate in wolof by "Magui fi."

In [26]:
encoding = tokenizer.encode("<|endoftext|>Je suis ici.<|translateto|>Magui fi.<|endoftext|>")

print("Tokens:")
encoding.tokens

Tokens:


['<|endoftext|>',
 'Je',
 'Ġsuis',
 'Ġici',
 '.',
 '<|translateto|>',
 'M',
 'ag',
 'ui',
 'Ġfi',
 '.',
 '<|endoftext|>']

If we want to use the tokenizer when training the GPT-2 Model we will provide the tokenizer to the `PreTrainedTokenizerFast` class.

In [27]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>",
    pad_token="<|pad|>"
)

In [28]:
len(wrapped_tokenizer)

15688

In [29]:
wrapped_tokenizer("<|endoftext|>Bonjour. Je suis ici.<|translateto|>Asalamu-aleykum. Magui fi.<|endoftext|>", max_length=30, padding='max_length')

{'input_ids': [0, 26, 117, 5418, 10, 1321, 1146, 4706, 10, 1, 25, 69, 3675, 71, 9, 9919, 61, 271, 10, 3777, 712, 657, 10, 0, 2, 2, 2, 2, 2, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}