Building a Unigram and custom Dataset for training with `T5`
--------------------------------

Like we did in [processing_2](text_processing2.ipynb) to build a tokenizer for GPT-2 we will need to create one for the T5 model. We will train a Unigram Tokenizer with each of the French and Wolof corpus and finally a custom dataset to recuperate the tokenized sentences.

To understand how is working the Unigram tokenizer, see the following tutorial [Unigram_tokenizer](https://huggingface.co/learn/nlp-course/chapter6/7?fw=pt).

The following steps will be necessary to achieve our task:

- Creating a batch generators for each corpus.
- Load the Unigram Tokenizer from the `tokenizers` library.
- Add a normalizer to the tokenizers: See the following link for explanation on the different type of normalizers [normalizer](https://unicode.org/reports/tr15/). But we will only need to remove too much space that will be find inside the sentences since we have already replace any type of weird signs in the corpora (see [extract_sentence](extract_sentences.ipynb) and [extract_text](text_extraction.ipynb)).
- Initialize the pre-tokenizer.
- Initialize the trainer: we will need to furnish the special tokens that will be used and the vocab size. Let us take, for the latter, 10000 tokens for each corpus.
- Train the tokenizers.
- Initialize the post-processor `TemplateProcessing`: we will define the types' ids.
- Initialize the decoder: `Metaspace`.
- Make a example with some sentences.
- Save the tokenizers
- Create the custom dataset for the T5 model.

Let us import the necessary libraries.

In [62]:
# for creating the tokenizer
from tokenizers import (
    decoders,
    models,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
    normalizers
)

# for importing and manipulating the sentences
import pandas as pd
import random

# for loading sentences with the custom dataset
from torch.utils.data import DataLoader

#### Load dataset and create generator

We will create one tokenizer for each of the French and Wolof corpora. 

In [2]:
# load sentences
sentences = pd.read_csv("data/extractions/new_data/corpora_v3.csv")

# initialize a batch size
BATCH_SIZE = 60

# create generators (for the corpora)
def generate_french_sents():
    
    # recuperate the sentences
    french_sents = sentences['french_corpus'].to_list() 
    
    for i in range(1, len(french_sents), BATCH_SIZE):
        
        yield french_sents[i:i+BATCH_SIZE]
        
def generate_wolof_sents():
    
    # recuperate the sentences
    wolof_sents = sentences['wolof_corpus'].to_list() 
    
    for i in range(1, len(wolof_sents), BATCH_SIZE):
        
        yield wolof_sents[i:i+BATCH_SIZE]
        

#### Initialize the tokenizers

In [3]:
french_tokenizer = Tokenizer(models.Unigram())

wolof_tokenizer = Tokenizer(models.Unigram())

#### Add normalizer

In [4]:
# french_tokenizer.normalizer = normalizers.Replace(" {2,}", " ")

# wolof_tokenizer.normalizer = normalizers.Replace(" {2,}", " ")

#### Configure the pre-tokenizers

We will use the Metaspace pre-tokenizer.

In [5]:
french_tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

wolof_tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

#### Initialize the trainers

We will provide all of the necessary special tokens to the Trainer. 

**Notice that a sentence can be a groups of words separated by ending marks and not only one group of words. Then we can, for example, tokenize the following sentences**: `<sep>sentence1.sentence2.sentence3<cls>` **or** `<sep>sentence1.<sep>sentence2.<cls>`. **But, the second sentence is composed of two separated groups. Then the two sentences will have different type ids.** 

In [6]:
special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]

In [7]:
french_trainer = trainers.UnigramTrainer(vocab_size=10000, special_tokens=special_tokens, unk_token = "<unk>")

wolof_trainer = trainers.UnigramTrainer(vocab_size=10000, special_tokens=special_tokens, unk_token = "<unk>")

#### Train the tokenizers

In [8]:
french_tokenizer.train_from_iterator(generate_french_sents(), french_trainer)

wolof_tokenizer.train_from_iterator(generate_wolof_sents(), wolof_trainer)

Let us print the vocab sizes.

In [9]:
# for the french corpus
print(f"Number of tokens in the french corpus: {french_tokenizer.get_vocab_size()}")

print(f"Number of tokens in the wolof corpus: {wolof_tokenizer.get_vocab_size()}")

Number of tokens in the french corpus: 4714
Number of tokens in the wolof corpus: 3336


#### Initialize the post-processor

We can not need the TemplateProcessor to train our corpora in a Sequence To Sequence model but we will add it in our tokenizer. We can use it for another type of model. 

In [13]:
# let us recuperate the sep and cls ids
cls_token_id = french_tokenizer.token_to_id("<cls>")

sep_token_id = french_tokenizer.token_to_id("<sep>")

print(cls_token_id, sep_token_id)

0 1


In [15]:
# Initialize the post processors
french_tokenizer.post_process = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)]
)

wolof_tokenizer.post_process = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)]
)

#### Initialize the decoders

In [16]:
french_tokenizer.decoder = decoders.Metaspace()

wolof_tokenizer.decoder = decoders.Metaspace()

#### Save the tokenizers

In [17]:
french_tokenizer.save("wolof-translate/wolof_translate/tokenizers/t5_tokenizers/fr_tokenizer_v1.json")

wolof_tokenizer.save("wolof-translate/wolof_translate/tokenizers/t5_tokenizers/wf_tokenizer_v1.json")

#### Make a little example

Let us recuperate random sentences from the corpora and tokenize them.

In [25]:
random.seed(50)

french_sentence = random.choice(sentences['french_corpus'])

wolof_sentence = random.choice(sentences['wolof_corpus'])

In [26]:
# print the french sentence
french_sentence

'Ils vont de campement en campement, dans des villages dont mon père note les noms sur sa carte : Nikom, Babungo, Nji Nikom, Luakom Ndye, Ngi, Obukun.'

In [27]:
# print the wolof sentence
wolof_sentence

'Di noyyi xet gu bon gi ci ban bi'

In [32]:
french_encoding = french_tokenizer.encode(french_sentence)

print("French tokens")
print(french_encoding.tokens)

print("French ids")
print(french_encoding.ids)

French tokens
['▁Il', 's', '▁vo', 'nt', '▁de', '▁campement', '▁en', '▁campement', ',', '▁dans', '▁des', '▁villages', '▁d', 'ont', '▁mon', '▁père', '▁not', 'e', '▁les', '▁noms', '▁sur', '▁sa', '▁carte', '▁', ':', '▁Ni', 'kom', ',', '▁Ba', 'b', 'ungo', ',', '▁Nj', 'i', '▁Ni', 'kom', ',', '▁Lu', 'a', 'kom', '▁N', 'd', 'y', 'e', ',', '▁Ng', 'i', ',', '▁', 'Obukun', '.']
French ids
[49, 10, 674, 57, 9, 607, 23, 607, 7, 21, 17, 292, 59, 70, 29, 35, 558, 13, 15, 552, 39, 54, 347, 8, 71, 1632, 778, 7, 481, 729, 1580, 7, 2835, 45, 1632, 778, 7, 1898, 68, 778, 205, 90, 120, 13, 7, 1507, 45, 7, 8, 2123, 11]


In [33]:
wolof_encoding = wolof_tokenizer.encode(wolof_sentence)

print("Wolof tokens")
print(wolof_encoding.tokens)

print("Wolof ids")
print(wolof_encoding.ids)

Wolof tokens
['▁Di', '▁noyyi', '▁xet', '▁gu', '▁bon', '▁gi', '▁ci', '▁ban', '▁bi']
Wolof ids
[551, 980, 923, 82, 910, 48, 10, 505, 23]


#### Creating the T5 custom dataset

We have two possibilities to use the tokenizer for fine-tuning a T5 model. 

- We can use the `PreTrainedTokenizerFast` class for which we will provide the different special tokens.

In [35]:
from transformers import PreTrainedTokenizerFast

fr_wrapped_tokenizer1 = PreTrainedTokenizerFast(
    tokenizer_object=french_tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)

wf_wrapped_tokenizer1 = PreTrainedTokenizerFast(
    tokenizer_object=wolof_tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)

- Or give directly the tokenizer to the `T5TokenizerFast` class.

In [36]:
from transformers import T5TokenizerFast

fr_wrapped_tokenizer2 = T5TokenizerFast(
    tokenizer_object=french_tokenizer
)

wf_wrapped_tokenizer2 = T5TokenizerFast(
    tokenizer_object=wolof_tokenizer
)

Let us give them the sentences that we use as example. 

In [56]:
fr_encoding = fr_wrapped_tokenizer1(french_sentence, max_length=60, padding='max_length', truncation=True)

fr_encoding

{'input_ids': [3, 3, 3, 3, 3, 3, 3, 3, 3, 49, 10, 674, 57, 9, 607, 23, 607, 7, 21, 17, 292, 59, 70, 29, 35, 558, 13, 15, 552, 39, 54, 347, 8, 71, 1632, 778, 7, 481, 729, 1580, 7, 2835, 45, 1632, 778, 7, 1898, 68, 778, 205, 90, 120, 13, 7, 1507, 45, 7, 8, 2123, 11], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [57]:
wf_encoding = wf_wrapped_tokenizer2(wolof_sentence, max_length=20, padding='max_length', truncation=True)

wf_encoding

{'input_ids': [551, 980, 923, 82, 910, 48, 10, 505, 23, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

Let us decode the wolof sentence.

In [59]:
wf_wrapped_tokenizer1.decode(wf_encoding.input_ids, skip_special_tokens=True)

'Di noyyi xet gu bon gi ci ban bi'

**Note that for the tokenization we could use the `text_target` argument to obtain the labels with padding but in our case we decided to use two different tokenizers. So it is impossible.** 


We can see that the `T5Tokenizer` add padding to the right side of the sequence while the `PretrainedTokenizer` add the padding to the left side. We can change the padding side from the settings. But, for the next steps of the `T5Tokenizer`, let us directly use the `T5Tokenizer`.

**Note that we can add augmentation when generating the sentences like we did when using the `GPT2 model`.** See the following notebook for discussion on the augmentation method that we will use [augmentation](text_augmentation.ipynb). And for a more clear explanation of the augmentation methods in NLP tasks and training look at the following article [augment_or_not](https://direct.mit.edu/coli/article/48/1/5/108844/To-Augment-or-Not-to-Augment-A-Comparative-Study).

Let us verify, before creating the custom dataset, the max lengths that we can get from the French corpus' tokens and the Wolof corpus' tokens without considering the augmentation.

In [51]:
fr_max_len = 0

for sent in sentences['french_corpus']:
    
    len_ids = len(fr_wrapped_tokenizer2(sent).input_ids)
    
    if len_ids > fr_max_len:
        
        fr_max_len = len_ids
        
wf_max_len = 0

for sent in sentences['wolof_corpus']:
    
    len_ids = len(wf_wrapped_tokenizer2(sent).input_ids)
    
    if len_ids > wf_max_len:
        
        wf_max_len = len_ids
        

In [52]:
# let us print the max lengths
fr_max_len, wf_max_len

(242, 282)

We find a maximum length of **242** tokens in the french corpus and **282** tokens in the wolof corpus. But considering the augmentation we can obtain more than 242 and 282 tokens because it will add modification on the words and then it can recognize only parts of them and so divide them in multiple other tokens. Let us add to the max lengths the fifth of them. 

In [53]:
fr_max_len += fr_max_len // 5

wf_max_len += wf_max_len // 5

fr_max_len, wf_max_len

(290, 338)

It is time to create our custom dataset.

Signature:
```python
class T5SentenceDataset(Dataset):

    def __init__(
        self,
        data_path: str, 
        tokenizer1: PreTrainedTokenizerFast,
        tokenizer2: Union[str, None] = None,
        corpus_1: str = "french_corpus",
        corpus_2: str = "wolof_corpus",
        cp1_max_len: int = 290,
        cp2_max_len: int = 338,
        cp1_truncation: bool = False,
        cp2_truncation: bool = False,
        file_sep: str = ",",
        cp1_transformer: Union[TransformerSequences, None] = None,
        cp2_transformer: Union[TransformerSequences, None] = None,
        **kwargs):

        pass
```

In [61]:
# %%writefile wolof-translate/wolof_translate/data/dataset_v2.py
from wolof_translate.utils.sent_transformers import TransformerSequences
from transformers import PreTrainedTokenizerFast
from torch.utils.data import Dataset
from typing import *
import pandas as pd
import torch
import re

class T5SentenceDataset(Dataset):

    def __init__(
        self,
        data_path: str, 
        tokenizer1: PreTrainedTokenizerFast,
        tokenizer2: Union[PreTrainedTokenizerFast, None] = None,
        corpus_1: str = "french_corpus",
        corpus_2: str = "wolof_corpus",
        cp1_max_len: int = 290,
        cp2_max_len: int = 338,
        cp1_truncation: bool = False,
        cp2_truncation: bool = False,
        file_sep: str = ",",
        cp1_transformer: Union[TransformerSequences, None] = None,
        cp2_transformer: Union[TransformerSequences, None] = None,
        **kwargs):
        
        # let us recuperate the data frame
        self.__sentences = pd.read_csv(data_path, sep=file_sep, **kwargs)
        
        # let us recuperate the tokenizers
        self.tokenizer1 = tokenizer1
        
        self.tokenizer2 = tokenizer2
        
        # recuperate the first corpus' sentences
        self.__sentences_1 = self.__sentences[corpus_1].to_list()
        
        # recuperate the second corpus' sentences
        self.__sentences_2 = self.__sentences[corpus_2].to_list()
        
        # recuperate the length
        self.__length = len(self.__sentences_1)
        
        # let us recuperate the max len
        self.cp1_max_len = cp1_max_len
        
        self.cp2_max_len = cp2_max_len
        
        # let us recuperate the truncation argument
        self.cp1_truncation = cp1_truncation
        
        self.cp2_truncation = cp2_truncation
        
        # let us initialize the transformer
        self.cp1_transformer = cp1_transformer
        
        self.cp2_transformer = cp2_transformer
        
    def __getitem__(self, index):
        """Recuperate ids and attention masks of sentences at index

        Args:
            index (int): The index of the sentences to recuperate

        Returns:
            tuple: The `sentence to translate' ids`, `the attention mask of the sentence to translate`
            `the labels' ids`, `the attention mask of the labels` 
        """
        sentence_1 = self.__sentences_1[index]
        
        sentence_2 = self.__sentences_2[index]
        
        # apply transformers if necessary
        if not self.cp1_transformer is None:
            
            sentence_1 = self.cp1_transformer(sentence_1) 
        
        if not self.cp2_transformer is None:
            
            sentence_2 = self.cp2_transformer(sentence_2)
        
        # let us encode the first sentence
        data = self.tokenizer1(
            sentence_1,
            truncation=self.cp1_truncation,
            max_length=self.cp1_max_len, 
            padding='max_length', 
            return_tensors="pt")
        
        # let us encode the second sentence
        if not self.tokenizer2 is None:
            
            labels = self.tokenizer2(
                sentence_2, 
                truncation=self.cp2_truncation,
                max_length=self.cp2_max_len, 
                padding='max_length', 
                return_tensors="pt")
            
        else:
            
            labels = self.tokenizer1(
                sentence_2, 
                truncation=self.cp2_truncation,
                max_length=self.cp2_max_len, 
                padding='max_length', 
                return_tensors="pt")
        
        return data.input_ids.squeeze(0),\
            data.attention_mask.squeeze(0),\
                labels.input_ids.squeeze(0),\
                    labels.attention_mask.squeeze(0) 
        
    def __len__(self):
        
        return self.__length
    
    def decode(self, labels: torch.Tensor):
        
        if labels.ndim < 2:
            
            labels = labels.unsqueeze(0)
        
        ids = labels.tolist()
        
        sentences = []
        
        for id in ids:
            
            sentence = self.tokenizer1.decode(id, skip_special_tokens=True)\
                if self.tokenizer2 is None \
                    else self.tokenizer2.decode(id, skip_special_tokens=True)

            sentences.append(sentence)

        return sentences

Let us generate some data with their masks and decode the labels.

**Note that we will use, when training the `T5 model`, train and test sets and not directly the full dataset**

In [63]:
# Initialize our custom dataset
dataset = T5SentenceDataset("data/extractions/new_data/corpora_v3.csv", fr_wrapped_tokenizer2, wf_wrapped_tokenizer2)

In [64]:
data, data_mask, labels, labels_mask = next(iter(DataLoader(dataset, 10))) # generate 10 sentences

Let us print the gotten data.

In [65]:
data

tensor([[ 358,  161, 1312,  ...,    3,    3,    3],
        [1594,  349,   48,  ...,    3,    3,    3],
        [ 126,  103,   40,  ...,    3,    3,    3],
        ...,
        [ 176,  293,    9,  ...,    3,    3,    3],
        [ 204,   52,  507,  ...,    3,    3,    3],
        [ 166, 4575,    7,  ...,    3,    3,    3]])

In [66]:
data_mask

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])

In [67]:
labels

tensor([[2461,  410, 2816,  ...,    3,    3,    3],
        [2385,  232,   45,  ...,    3,    3,    3],
        [ 127,  208,   44,  ...,    3,    3,    3],
        ...,
        [ 401, 2344,   76,  ...,    3,    3,    3],
        [1052,   47,  791,  ...,    3,    3,    3],
        [ 126,   10,  314,  ...,    3,    3,    3]])

In [68]:
labels_mask

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])

Let us decode the labels.

In [69]:
dataset.decode(labels)

['Doomu-aadama bu, ne ci ndey ak baay nga jóge.',
 'Mënunu leen a baň a gërëm ak a bëgg, doonte sax mën nanoo am xel ňaar ci ňoom.',
 'Waaye ňu ngi fi, ak seen xar-kanam, seen taxawaay, seen defin ak seen jikko, seeni njuumte, seeni yaakaar, seen melokaanu loxook baaraami tànk, seen meloy bët ak karaw, seen waxin, seeni xalaat, amaana sax at ma ňuy nar a génne àddina. Loolu lépp, day àgg fu sore ci nun.',
 'Bi ma delloo dëkk ba ma juddoo, dama faa meloon ni gan. Du kenn ku ma fa xam, safatul dara ci man. Li nu jóge Afrig jur ci man tiis wu réy. Su ma yaboo sax ni mënuma woon a nangu ni maak samay way-jur dëkkëtuñu Afrig. Ca laa tàmbalee gént ni sama yaay nit ku ñuul la, di sàkkal it sama bopp cosaan lu bees.',
 'Àddinay dox ba Baay tollu ci noppalug liggéey, dellusi Tugal dëkk ak ňun. Ci la ma leere ni moom moomoo doon doomu Afrig.',
 'Mu doon nag lu naqadee nangu.',
 'Damaa mujjoon a delloo sama xel démb ngir lijjanti lépp la ca léjoon.',
 'Kon fàttalikoo meññ téere bu ndaw bii.',
 'K