Creating a Unigram Tokenizer on the new sentences (without considering the definitions)
----------------------------------
We added sentences got from `omniglot` inside the diagne's sentences. Since we want to test the relevancy of the new sentences to our translation task, let us create a tokenizer for them. It is done in order to train the T5 model on it and see if we obtain a better performance. We will also use that tokenizer on the GPT2 model.

The process is almost the same as in [processing_4](text_processing4.ipynb) excepted that we will create another custom dataset for the custom transformer model.

Let us import the necessary libraries.

In [1]:
# for creating the tokenizer
from tokenizers import (
    decoders,
    models,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
    normalizers
)

# for importing and manipulating the sentences
import pandas as pd
import random

# for plotting the box plot of the sequence lengths
import plotly.express as px

# for loading sentences with the custom dataset
from torch.utils.data import DataLoader

  from .autonotebook import tqdm as notebook_tqdm


#### Load dataset and create generator

We will create one tokenizer for both of the French and Wolof corpora because the `T5` model' understand only one embedding layer. So we must create one generator for both of the French and Wolof corpora. 

In [2]:
# load sentences
sentences = pd.read_csv("data/extractions/new_data/ad_sentences.csv")

# initialize a batch size
BATCH_SIZE = 400

# create generators (for the corpora)
def generate_sents():
    
    # recuperate the sentences
    french_sents = sentences['french'].to_list() 
    
    wolof_sents = sentences['wolof'].to_list() 
    
    sents = french_sents + wolof_sents
    
    for i in range(1, len(sents), BATCH_SIZE):
        
        yield sents[i:i+BATCH_SIZE]

#### Initialize the tokenizer

In [3]:
tokenizer = Tokenizer(models.Unigram())

#### Add normalizer

In [4]:
tokenizer.normalizer = normalizers.Replace(" {2,}", " ")

#### Configure the pre-tokenizers

We will use the Metaspace pre-tokenizer which separates the words considering the spaces between them. It will replace the space by a character (by default the underscore "_").

In [5]:
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

#### Initialize the trainers

We will provide all of the necessary special tokens to the Trainer. 

**Notice that a sentence can be a group of words separated by ending marks and not only one group of words. Then we can, for example, tokenize the following sentences**: `<sep>sentence1.sentence2.sentence3<cls>` **or** `<sep>sentence1.<sep>sentence2.<cls>`. **But, the second sentence is composed of two separate groups. Then the two sentences will have different type ids.** 

In [6]:
special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]

In [7]:
trainer = trainers.UnigramTrainer(special_tokens=special_tokens, unk_token = "<unk>") # let us take the default vocab size

#### Train the tokenizer

In [8]:
tokenizer.train_from_iterator(generate_sents(), trainer)

Let us print the vocab size.

In [9]:
print(f"Number of tokens: {tokenizer.get_vocab_size()}")

Number of tokens: 3677


#### Initialize the post-processor

We can not need the TemplateProcessor to train our corpora in a Sequence To Sequence model, but we will add it to our tokenizer. We can use it for another type of model. 

In [10]:
# let us recuperate the sep and cls ids
cls_token_id = tokenizer.token_to_id("<cls>")

sep_token_id = tokenizer.token_to_id("<sep>")

print(cls_token_id, sep_token_id)

0 1


In [11]:
# Initialize the post processor
tokenizer.post_process = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)]
)

#### Initialize the decoder

In [12]:
tokenizer.decoder = decoders.Metaspace()

#### Save the tokenizer

In [13]:
tokenizer.save("wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v3_2.json")

#### Make a little example

Let us recuperate random sentences from the corpora and tokenize them.

**Notice that for the `T5` model we will need only to add an eos_token at the end of the sentences and for the `GPT2` model we will need to add an bos_token at the beginning of the french sentences (or the wolof sentences for the wolof to french translation) and separate them from the wolof sentences (or the french sentences for the wolof to french translation) by the sep_token. The `GPT2` consists only of a decoder part so we will need to concatenate the french and wolof sentence and separate them by a special token. In order to generate the translation we will need to pass the sentence to translate and the GPT2 model will produce the concatenation. We will need then to recuperate the second part of the concatenation which will consists of the translated part.**

In [37]:
random.seed(200)

french_sentence = random.choice(sentences['french']) 

wolof_sentence = random.choice(sentences['wolof']) 

# For the T5
french_sentence_t5 = french_sentence + "</s>"
wolof_sentence_t5 = wolof_sentence + "</s>"

# For the GPT2 (only example for french to wolof translation)
fr_sentence_gpt2 = "<s>" + french_sentence + "<sep>"
wf_sentence_gpt2 = wolof_sentence + "</s>"

In [41]:
# print the french sentence
french_sentence

'Fais sortir tout cheval que tu vois !'

In [42]:
# print the wolof sentence
wolof_sentence

'Nataal bii de nataal la boo xam ne verre la, verre bi nag mi ngi nekk weer weer yoo xam ne dañoo tegalante, verre yi nag benn bokku ceek moroomam waaw. Verre bi am na affaire bu ko tée boo xam ne mi ngi ci suuf, waaw affaire bu ko tée.'

In [44]:
french_encoding_t5 = tokenizer.encode(french_sentence_t5)

print("French tokens t5")
print(french_encoding_t5.tokens)

print("French ids t5")
print(french_encoding_t5.ids)

French tokens t5
['▁Fa', 'is', '▁sortir', '▁tout', '▁cheval', '▁que', '▁tu', '▁vois', '▁!', '</s>']
French ids t5
[268, 301, 833, 144, 640, 27, 54, 150, 25, 6]


In [45]:
wolof_encoding_t5 = tokenizer.encode(wolof_sentence_t5)

print("Wolof tokens t5")
print(wolof_encoding_t5.tokens)

print("Wolof ids t5")
print(wolof_encoding_t5.ids)

Wolof tokens t5
['▁Nataal', '▁bii', '▁de', '▁nataal', '▁la', '▁boo', '▁xam', '▁ne', '▁verre', '▁la', ',', '▁verre', '▁bi', '▁nag', '▁mi', '▁ngi', '▁nekk', '▁weer', '▁weer', '▁yoo', '▁xam', '▁ne', '▁dañ', 'oo', '▁teg', 'alante', ',', '▁verre', '▁yi', '▁nag', '▁benn', '▁bokku', '▁c', 'eek', '▁moroom', 'am', '▁waaw', '.', '▁Ver', 're', '▁bi', '▁am', '▁na', '▁affaire', '▁bu', '▁ko', '▁tée', '▁boo', '▁xam', '▁ne', '▁mi', '▁ngi', '▁ci', '▁suuf', ',', '▁waaw', '▁affaire', '▁bu', '▁ko', '▁tée', '.', '</s>']
Wolof ids t5
[121, 91, 14, 125, 13, 114, 77, 28, 394, 13, 9, 394, 38, 66, 113, 71, 89, 612, 612, 308, 77, 28, 266, 241, 280, 2582, 9, 394, 73, 66, 100, 2108, 78, 2651, 2326, 160, 259, 8, 1422, 267, 38, 47, 22, 1223, 56, 74, 2418, 114, 77, 28, 113, 71, 17, 434, 9, 259, 1223, 56, 74, 2418, 8, 6]


In [49]:
fr_encoding_gpt2 = tokenizer.encode(fr_sentence_gpt2)

print("French tokens gpt2")
print(fr_encoding_gpt2.tokens)

print("Wolof ids gpt2")
print(fr_encoding_gpt2.ids)

French tokens gpt2
['<s>', '▁Fa', 'is', '▁sortir', '▁tout', '▁cheval', '▁que', '▁tu', '▁vois', '▁!', '<sep>']
Wolof ids gpt2
[5, 268, 301, 833, 144, 640, 27, 54, 150, 25, 1]


In [50]:
wf_encoding_gpt2 = tokenizer.encode(wf_sentence_gpt2)

print("Wolof tokens gpt2")
print(wf_encoding_gpt2.tokens)

print("Wolof ids gpt2")
print(wf_encoding_gpt2.ids)

Wolof tokens gpt2
['▁Nataal', '▁bii', '▁de', '▁nataal', '▁la', '▁boo', '▁xam', '▁ne', '▁verre', '▁la', ',', '▁verre', '▁bi', '▁nag', '▁mi', '▁ngi', '▁nekk', '▁weer', '▁weer', '▁yoo', '▁xam', '▁ne', '▁dañ', 'oo', '▁teg', 'alante', ',', '▁verre', '▁yi', '▁nag', '▁benn', '▁bokku', '▁c', 'eek', '▁moroom', 'am', '▁waaw', '.', '▁Ver', 're', '▁bi', '▁am', '▁na', '▁affaire', '▁bu', '▁ko', '▁tée', '▁boo', '▁xam', '▁ne', '▁mi', '▁ngi', '▁ci', '▁suuf', ',', '▁waaw', '▁affaire', '▁bu', '▁ko', '▁tée', '.', '</s>']
Wolof ids gpt2
[121, 91, 14, 125, 13, 114, 77, 28, 394, 13, 9, 394, 38, 66, 113, 71, 89, 612, 612, 308, 77, 28, 266, 241, 280, 2582, 9, 394, 73, 66, 100, 2108, 78, 2651, 2326, 160, 259, 8, 1422, 267, 38, 47, 22, 1223, 56, 74, 2418, 114, 77, 28, 113, 71, 17, 434, 9, 259, 1223, 56, 74, 2418, 8, 6]


#### Creating the T5 custom dataset for the new sentences

We have two possibilities to use the tokenizer for fine-tuning a T5 model. 

- We can use the `PreTrainedTokenizerFast` class for which we will provide the different special tokens.

In [51]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer1 = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)

- Or give directly the tokenizer to the `T5TokenizerFast` class.

In [52]:
from transformers import T5TokenizerFast

wrapped_tokenizer2 = T5TokenizerFast(
    tokenizer_object=tokenizer
)

Let us give them the sentences that we use as example. 

In [54]:
fr_encoding_t5 = wrapped_tokenizer1(french_sentence_t5, max_length=15, padding='max_length', truncation=True)

fr_encoding_t5

{'input_ids': [3, 3, 3, 3, 3, 268, 301, 833, 144, 640, 27, 54, 150, 25, 6], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [55]:
wf_encoding_t5 = wrapped_tokenizer2(wolof_sentence_t5, max_length=15, padding='max_length', truncation=True)

wf_encoding_t5

{'input_ids': [121, 91, 14, 125, 13, 114, 77, 28, 394, 13, 9, 394, 38, 66, 113], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [56]:
fr_encoding_gpt2 = wrapped_tokenizer1(fr_sentence_gpt2, max_length=15, padding='max_length', truncation=True)

fr_encoding_gpt2

{'input_ids': [3, 3, 3, 3, 5, 268, 301, 833, 144, 640, 27, 54, 150, 25, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [57]:
wf_encoding_gpt2 = wrapped_tokenizer1(wf_sentence_gpt2, max_length=15, padding='max_length', truncation=True)

wf_encoding_gpt2

{'input_ids': [121, 91, 14, 125, 13, 114, 77, 28, 394, 13, 9, 394, 38, 66, 113], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Let us decode the concatenation of the french and wolof sentences for the gpt2 model.

In [60]:
wrapped_tokenizer1.decode(fr_encoding_t5.input_ids + wf_encoding_t5.input_ids, skip_special_tokens=True)

'Fais sortir tout cheval que tu vois! Nataal bii de nataal la boo xam ne verre la, verre bi nag mi'

We can see that the `T5Tokenizer` add padding to the right side of the sequence while the `PretrainedTokenizer` add the padding to the left side. We can change the padding side from the settings. But, for the next steps, let us directly use the `T5Tokenizer`.

**Note that we can augment the sentences when generating them like we did when using the `GPT2 model`.** See the following notebook, [augmentation](text_augmentation.ipynb), for discussion on the augmentation method that we will use. And for a more clear explanation of the augmentation methods in NLP tasks and training, look at the following article from the web [augment_or_not](https://direct.mit.edu/coli/article/48/1/5/108844/To-Augment-or-Not-to-Augment-A-Comparative-Study).

Let us verify, before creating the custom dataset, the max length that we can get from the corpora' tokens without considering the augmentation. We must for that trace the box plot of the lengths and identify the range in which we will sample the max length of the sequences.

In [25]:
length = []

for sent in sentences['french'].to_list() + sentences['wolof'].to_list():
    
    len_ids = len(wrapped_tokenizer2(sent).input_ids)
    
    length.append(len_ids)

        

In [26]:
fig = px.box(x = length)

fig.update_layout({'xaxis': {'title': 'length'}})

The upper fence is of **15** and the max length is equal to **206**. Then we will test any value between the two. 

But considering the augmentation we can obtain more than the value that we will take because it will add modifications on the words and then it can recognize only parts of them and divide them in multiple other tokens. We will add to the max length the fifth of it. 

We will the same custom datasets that created at [create_tokenizer_for_all_sentences](creating_tokenizer_for_all_sentences_3.ipynb).