Creating a Sentence Tokenizer on the version 6 of the corpora
----------------------------------
We added new sentences to extracted sentences from the book <i style="color: cyan">Grammaire de Wolof Moderne</i> by Pathé Diagne plus the original corpora to obtain the sixth version of it.

The process is almost the same as in [processing_4](text_processing4.ipynb) excepted that we will create another custom dataset for the custom transformer model and identify with a box plot the range of the maximum length of the sequences in order to tune the `max_len` parameter provided to the custom dataset.

Let us import the necessary libraries.

In [1]:
# # for creating the tokenizer
from tokenizers import (
    decoders,
#     SentencePieceBPETokenizer,
#     normalizers,
#     pre_tokenizers,
)
# from transformers import AutoTokenizer, PreTrainedTokenizerFast, T5TokenizerFast

import sentencepiece as spm

# for importing and manipulating the sentences
import pandas as pd
import random

# for plotting the box plot of the sequence lengths
import plotly.express as px

# for loading sentences with the custom dataset
from torch.utils.data import DataLoader

  from .autonotebook import tqdm as notebook_tqdm


#### Load dataset and create generator

We will create one tokenizer for both of the French and Wolof corpora because the `T5` model understand only one embedding layer. So we must create one generator for both of the French and Wolof corpora. 

In [2]:
# load sentences
sentences = pd.read_csv("data/extractions/new_data/corpora_v6.csv")

# initialize a batch size
BATCH_SIZE = 400

# create generators (for the corpora)
# def generate_sents():
    
#     # recuperate the sentences
#     french_sents = sentences['french'].to_list() 
    
#     wolof_sents = sentences['wolof'].to_list() 
    
#     sents = french_sents + wolof_sents
    
#     for i in range(1, len(sents), BATCH_SIZE):
        
#         yield sents[i:i+BATCH_SIZE]

with open('sents.txt', 'w', encoding='utf-8') as f:
    for sent in sentences['french'].to_list() + sentences['wolof'].to_list():
        f.write(sent + '\n')

#### Initialize the tokenizer

In [3]:
# tokenizer = Tokenizer(models.Unigram())

#### Add normalizer

In [4]:
# tokenizer.normalizer = normalizers.Replace(" {2,}", " ")

#### Initialize the trainers

We will provide all of the necessary special tokens to the T5 Tokenizer (see [t5_tokenizer](_t5.ipynb)). 

In [5]:
special_tokens = ['<pad>', '</s>', '<unk>']

In [6]:
# trainer = trainers.UnigramTrainer(special_tokens=special_tokens, unk_token = "<unk>", vocab_size=20000) # let us take the default vocab size

#### Train the tokenizer

The SentencePiece tokenizer automatically performs a normalization (the `NFKC` Unicode). 

In [7]:
spm.SentencePieceTrainer.Train(input = f'sents.txt',
                               model_prefix='wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v5',
                               vocab_size=8000,
                               character_coverage=1.0,
                               pad_id=0,                
                               eos_id=1,
                               unk_id=2,
                               bos_id=3,
                               pad_piece='<pad>',
                               eos_piece='</s>',
                               unk_piece='<unk>',
                               bos_piece='<s>',
                               )

Load the tokenizer.

In [8]:
tokenizer = spm.SentencePieceProcessor(model_file='wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v5.model')

#### Make a little example

Let us recuperate random sentences from the corpora and tokenize them.

In [9]:
random.seed(200)

french_sentence = random.choice(sentences['french']) 

wolof_sentence = random.choice(sentences['wolof']) 


In [10]:
# print the french sentence
french_sentence

"Les fourmis, à Ogoja, étaient des insectes monstrueux de la variété exectoïde, qui creusaient leurs nids à dix mètres de profondeur sous la pelouse du jardin, où devaient vivre des centaines de milliers d'individus."

In [11]:
# print the wolof sentence
wolof_sentence

'A ngiy dóor dàqeek jànt bi'

In [12]:
french_encoding = tokenizer.Encode(french_sentence, add_eos=True)

print("French tokens")
print([tokenizer.IdToPiece(id) for id in french_encoding])

print("French ids")
print(french_encoding)

French tokens
['▁Les', '▁fourmis', ',', '▁à', '▁Ogoja', ',', '▁étai', 'ent', '▁des', '▁insectes', '▁m', 'onstru', 'eux', '▁de', '▁la', '▁vari', 'été', '▁ex', 'ecto', 'ï', 'de', ',', '▁qui', '▁creu', 'saient', '▁leur', 's', '▁ni', 'ds', '▁à', '▁di', 'x', '▁mètres', '▁de', '▁profondeur', '▁sous', '▁la', '▁pelouse', '▁du', '▁jardin', ',', '▁où', '▁de', 'vaient', '▁vivre', '▁des', '▁cent', 'aines', '▁de', '▁mill', 'iers', '▁d', "'", 'individus', '.', '</s>']
French ids
[164, 788, 4, 21, 568, 4, 507, 103, 23, 1495, 146, 7133, 379, 8, 9, 2617, 2200, 1587, 5061, 2357, 913, 4, 25, 6767, 4610, 131, 7, 31, 4322, 21, 39, 233, 3905, 8, 5270, 417, 9, 3600, 40, 877, 4, 132, 8, 575, 1740, 23, 4218, 4510, 8, 1672, 638, 24, 6, 6597, 5, 1]


In [13]:
wolof_encoding = tokenizer.Encode(wolof_sentence, add_eos=True)

print("Wolof tokens")
print([tokenizer.IdToPiece(id) for id in wolof_encoding])

print("Wolof ids")
print(wolof_encoding)

Wolof tokens
['▁A', '▁ngi', 'y', '▁dóor', '▁dàq', 'eek', '▁jà', 'nt', '▁bi', '</s>']
Wolof ids
[172, 62, 13, 658, 1936, 285, 4289, 129, 37, 1]


#### Creating the T5 custom dataset for the new sentences

We have two possibilities to use the tokenizer for fine-tuning a T5 model. 

- We can use the `PreTrainedTokenizerFast` class for which we will provide the different special tokens.

In [14]:
# from transformers import PreTrainedTokenizerFast

# wrapped_tokenizer1 = PreTrainedTokenizerFast(
#     tokenizer_object=tokenizer,
#     bos_token="<s>",
#     eos_token="</s>",
#     unk_token="<unk>",
#     pad_token="<pad>",
#     cls_token="<cls>",
#     sep_token="<sep>",
#     mask_token="<mask>",
#     padding_side="left",
# )

- Or give directly the tokenizer to the `T5TokenizerFast` class.

In [16]:
from transformers import T5TokenizerFast

wrapped_tokenizer1 = T5TokenizerFast(
    vocab_file='wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v5.model'
)


Let us give them the sentences that we use as example. 

In [17]:
wf_encoding = wrapped_tokenizer1(french_sentence, max_length=40, padding='max_length', truncation=True)

wf_encoding

{'input_ids': [164, 788, 4, 21, 568, 4, 507, 103, 23, 1495, 146, 7133, 379, 8, 9, 2617, 2200, 1587, 5061, 2357, 913, 4, 25, 6767, 4610, 131, 7, 31, 4322, 21, 39, 233, 3905, 8, 5270, 417, 9, 3600, 40, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [18]:
wf_encoding = wrapped_tokenizer1(wolof_sentence, max_length=40, padding='max_length', truncation=True)

wf_encoding

{'input_ids': [172, 62, 13, 658, 1936, 285, 4289, 129, 37, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

Let us decode the wolof sentence.

In [19]:
wrapped_tokenizer1.decode(wf_encoding.input_ids, skip_special_tokens=True)

'A ngiy dóor dàqeek jànt bi</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

We can see that the `T5Tokenizer` add padding to the right side of the sequence while the `PretrainedTokenizer` add the padding to the left side. We can change the padding side from the settings. But, for the next steps, let us directly use the `T5Tokenizer`.

**Note that we can augment the sentences when generating them like we did when using the `GPT2 model`.** See the following notebook, [augmentation](text_augmentation.ipynb), for discussion on the augmentation method that we will use. And for a more clear explanation of the augmentation methods in NLP tasks and training, look at the following article from the web [augment_or_not](https://direct.mit.edu/coli/article/48/1/5/108844/To-Augment-or-Not-to-Augment-A-Comparative-Study).

Let us verify, before creating the custom dataset, the max length that we can get from the corpora' tokens without considering the augmentation. We must for that trace the box plot of the lengths and identify the range in which we will sample the max length of the sequences.

In [20]:
length = []

for sent in sentences['french'].to_list() + sentences['wolof'].to_list():
    
    len_ids = len(wrapped_tokenizer1(sent).input_ids)
    
    length.append(len_ids)

        

In [21]:
fig = px.box(length, template="plotly_dark", labels=dict(x="Length of the sentences", y="Number of sentences"), color_discrete_sequence=['indianred'])

fig.show()

The upper fence is of **52** and the max length is equal to **283**. Then we will test any value between the two. 

But considering the augmentation we can obtain more than the value that we will take because it will add modifications on the words and then it can recognize only parts of them and divide them in multiple other tokens. We will add to the max length the fifth of it. 

It is time to create our custom dataset for the t5 model. It can be also used for the Bart model.

**Notice**: The principal custom dataset is `SentenceDataset`.

Signature:
```python
class T5SentenceDataset(Dataset):

    def __init__(
        self,
        data_path: str, 
        tokenizer: PreTrainedTokenizerFast
        corpus_1: str = "french",
        corpus_2: str = "wolof",
        max_len: int = 52,
        cp1_truncation: bool = False,
        cp2_truncation: bool = False,
        file_sep: str = ",",
        cp1_transformer: Union[TransformerSequences, None] = None,
        cp2_transformer: Union[TransformerSequences, None] = None,
        add_bos_token: bool = False,
        **kwargs):

        pass
```

In [1]:
%%writefile wolof-translate/wolof_translate/data/dataset_v4.py
from wolof_translate.utils.sent_transformers import TransformerSequences
from transformers import PreTrainedTokenizerFast
from torch.utils.data import Dataset
from typing import *
import pandas as pd
import torch
import re

class T5SentenceDataset(Dataset):

    def __init__(
        self,
        data_path: str, 
        tokenizer: PreTrainedTokenizerFast,
        corpus_1: str = "french",
        corpus_2: str = "wolof",
        max_len: int = 52,
        truncation: bool = False,
        file_sep: str = ",",
        cp1_transformer: Union[TransformerSequences, None] = None,
        cp2_transformer: Union[TransformerSequences, None] = None,
        add_bos_token: bool = False,
        **kwargs):
        
        # let us recuperate the data frame
        self.__sentences = pd.read_csv(data_path, sep=file_sep, **kwargs)
        
        # let us recuperate the tokenizer
        self.tokenizer = tokenizer
        
        # recuperate the first corpus' sentences
        self.sentences_1 = self.__sentences[corpus_1].to_list()
        
        # recuperate the second corpus' sentences
        self.sentences_2 = self.__sentences[corpus_2].to_list()
        
        # recuperate the length
        self.length = len(self.sentences_1)
        
        # let us recuperate the max len
        self.max_len = max_len + max_len // 5
        
        # let us recuperate the truncation argument
        self.truncation = truncation
        
        # let us initialize the transformer
        self.cp1_transformer = cp1_transformer
        
        self.cp2_transformer = cp2_transformer
        
        # see if we add a beginning of the sentence
        self.add_bos = add_bos_token
        
        # let us recuperate the special tokens
        self.special_tokens = tokenizer.convert_ids_to_tokens(tokenizer.all_special_ids)
        
    def __getitem__(self, index):
        """Recuperate ids and attention masks of sentences at index

        Args:
            index (int): The index of the sentences to recuperate

        Returns:
            tuple: The `sentence to translate' ids`, `the attention mask of the sentence to translate`
            `the labels' ids`
        """
        sentence_1 = self.sentences_1[index]
        
        sentence_2 = self.sentences_2[index]
        
        # apply transformers if necessary
        if not self.cp1_transformer is None:
            
            sentence_1 = self.cp1_transformer(sentence_1)[0]
        
        if not self.cp2_transformer is None:
            
            sentence_2 = self.cp2_transformer(sentence_2)[0]
        
        sentence_1 = sentence_1 + self.tokenizer.eos_token
        
        sentence_2 = sentence_2 + self.tokenizer.eos_token
        
        # let us encode the sentences (we provide the second sentence as labels to the tokenizer)
        data = self.tokenizer(
            sentence_1,
            truncation=self.truncation,
            max_length=self.max_len, 
            padding='max_length', 
            return_tensors="pt",
            text_target=sentence_2)
        
        return data.input_ids.squeeze(0), data.attention_mask.squeeze(0), data.labels.squeeze(0)
        
    def __len__(self):
        
        return self.length
    
    def decode(self, labels: torch.Tensor):
        
        if labels.ndim < 2:
            
            labels = labels.unsqueeze(0)

        sentences = self.tokenizer.batch_decode(labels, skip_special_tokens=True)

        return [re.sub('|'.join(self.special_tokens), '', sentence) for sentence in sentences]


class SentenceDataset(T5SentenceDataset):

    def __init__(
        self,
        data_path: str, 
        tokenizer: PreTrainedTokenizerFast,
        corpus_1: str = "french",
        corpus_2: str = "wolof",
        max_len: int = 50,
        truncation: bool = False,
        file_sep: str = ",",
        cp1_transformer: Union[TransformerSequences, None] = None,
        cp2_transformer: Union[TransformerSequences, None] = None,
        add_bos_token: bool = False,
        **kwargs):
        
        super().__init__(data_path, 
                        tokenizer,
                        corpus_1,
                        corpus_2,
                        max_len,
                        truncation,
                        file_sep,
                        cp1_transformer,
                        cp2_transformer,
                        add_bos_token,
                        **kwargs)
        
    def __getitem__(self, index):
        """Recuperate ids and attention masks of sentences at index

        Args:
            index (int): The index of the sentences to recuperate

        Returns:
            tuple: The `sentence to translate' ids`, `the attention mask of the sentence to translate`
            `the labels' ids`
        """
        sentence_1 = self.sentences_1[index]
        
        sentence_2 = self.sentences_2[index]
        
        # apply transformers if necessary
        if not self.cp1_transformer is None:
            
            sentence_1 = self.cp1_transformer(sentence_1)[0] 
        
        if not self.cp2_transformer is None:
            
            sentence_2 = self.cp2_transformer(sentence_2)[0]
        
        # initialize the bos token
        bos_token = '' if not self.add_bos else self.tokenizer.bos_token
        
        sentence_1 = sentence_1 
        
        sentence_2 = sentence_2
        
        # let us encode the sentences (we provide the second sentence as labels to the tokenizer)
        data = self.tokenizer(
            sentence_1,
            truncation=self.truncation,
            max_length=self.max_len, 
            padding='max_length', 
            return_tensors="pt")
        
        # let us encode the sentences (we provide the second sentence as labels to the tokenizer)
        labels = self.tokenizer(
            sentence_2,
            truncation=self.truncation,
            max_length=self.max_len, 
            padding='max_length', 
            return_tensors="pt")
        
        return (data.input_ids.squeeze(0),
                data.attention_mask.squeeze(0), 
                labels.input_ids.squeeze(0),
                labels.attention_mask.squeeze(0))
    

Overwriting wolof-translate/wolof_translate/data/dataset_v4.py


In [23]:
%run wolof-translate/wolof_translate/data/dataset_v4.py

Let us generate some data with their masks and decode the labels.

**Note that we will use, when training the `T5 model`, train and test sets and not directly the full dataset**

In [24]:
# t5_tokenizer = T5TokenizerFast.from_pretrained("t5-small")

# wrapped_tokenizer1.eos_token_id = t5_tokenizer.eos_token_id

# wrapped_tokenizer1.pad_token_id = t5_tokenizer.pad_token_id

# wrapped_tokenizer1.unk_token_id = t5_tokenizer.unk_token_id

In [25]:
# Initialize our custom dataset
dataset = SentenceDataset("data/extractions/new_data/corpora_v6.csv", wrapped_tokenizer1, truncation=True)

In [26]:
generator = torch.manual_seed(5)
input_ids, input_mask, labels, _ = next(iter(DataLoader(dataset, 10, shuffle=True, generator=generator))) # generate 10 sentences with shuffling

Let us print the input ids.

In [27]:
input_ids

tensor([[1166,  192,    4,  100,    6,   73, 1078,   20,   14, 1329,    4,  100,
            6,   73, 6897,  120,  714,   21,    9, 5271,    4,   22, 5123,    7,
           10, 5668,  401,   68,  101,  379,   15, 5316,   35,   27,  711,    4,
           41, 7022,    9, 1149, 5800,   58,  444,  850,    4,  735,  160,   22,
         3402, 6067,  861,   64,    6, 4576, 6546,  551,  208,    4, 3325,    1],
        [ 161,  216, 5346,   36,    9,    8, 6566,   40,  232,  132,   56,   15,
          998,    4,  105,  106,   64,    6,   32,   89, 4207,   35,   82, 6356,
          696, 1056,    4,   68,  943,   24,    6, 2113,    8,   17,    6, 4667,
         1069,    5,    1,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [ 430,  136,   12,  242,    4,  551,  156,    4,   19,  181,  136,   12,
          116,   33,   86,   44,   22, 1173,    5,    1,    0,    0,    0,    0,
            0,    0,    0,

In [28]:
input_mask

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      

In [29]:
labels

tensor([[ 583,  193, 1018,   63,  657,    4, 6761, 6346,   78,   13, 2120,   26,
          603,   62,   35,  227,   15,  163,   15, 6652, 4299,   11,  576,   51,
            5,  583,  193, 1234,   63, 3117,   29,   52,  849,    4, 3440,  328,
         6226,   39,   34,  946,    4,  184, 1141,  997,  162,   13, 1858,   11,
          377,   52,  457,   37,   42,   31,   38,   34, 2302,   37,   13,    1],
        [2464,   34,  712,    4,  278,   34,  119, 2595,    4, 4212,   14,    4,
          709,   31,   38,  503,  422, 1094,  320,  339,    4,  163, 2559,  111,
          829,    5,    1,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [ 734,  922,  309,  446, 2718,  157, 1176,  112,    5,    1,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,

Let us decode the labels.

In [30]:
dataset.decode(labels)

['Man it jënd naa gaal, ŋarale samay baaraami tànk ngir gën a mën a jafandu ci wet gi. Man it téye naa joowu bu gudd, coll yiy naaw di ma romb, may déglu ngelaw liy jooy ci àll bu xonq bi ak ni ko masin biy',
 'At ma Niseryaa, réew ma Baay dëkkoon, tojee, Móris ni ko ginnaaw jot nanu sunu bopp, mënatoo nekk Àngale.',
 'Waa jii ag waa joojale duñu benn.',
 'Guwiyaan la Baay waajale Afrig.',
 'Gis naa nagu Tugël.',
 'Du ñëw, defe naa!',
 'Keneen, ki ñëw!',
 'Lii, ay nit lañu yu génn di ñaxtu. Jëm yi nag ñu ngi sol seen i yére yu weex ak seen i tubéy yu baxa, nekk ci tali bi di wane seen aw naqar. Ñu ngi bind ci digg bi mbindum farañse mu baxa.',
 'Jënd ñaa menn xar mi!',
 'Dafa di dem.']