Discriminator tokenizer
----------------------------

We will create a WordPiece Tokenizer (for which the tutorial can be find at the following link [WordPieceTokenizer](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)) to compare a generated sequence from a true sequence in the GAN network. We will only test the capacity of the WordPiece Tokenizer to provide good results: In our case it is generating the original sequence. 

The WordPiece is chosen because the discriminator is considered to be the BERT Model. 

- Creating a sentences' generator 
- Instantiating the `WordPiece tokenizer` 
- ~~Instantiating the `normalizer`~~
- Instantiating the `BertPre-tokenizer`
- Instantiating the trainer with `20000` tokens and the following special tokens `"[UNK]", "[PAD]", "[CLS]", "[SEP]", and "[MASK]"`
- Training the tokenizer
- Instantiate the pos-processor by specifying the format of the two passed sequences (sequence from the French corpus and sequence from the Wolof corpus)
- Initialize the decoder
- Save the tokenizer
- Make a test


Let us import the necessary libraries.

In [1]:
# for creating the tokenizer
from tokenizers import (
    normalizers,
    decoders,
    models,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

# for importing and manipulating the sentences
import pandas as pd

#### Load dataset and create generator

We will create two tokenizers: one for each corpus since we different languages and so different vocabularies. 

In [2]:
# load sentences
sentences = pd.read_csv("data/extractions/new_data/sent_extraction.csv")

# initialize a batch size
BATCH_SIZE = 50

# create generators (for the corpora)
def generate_sentences():
    
    # stacking the sentences
    concat_sentences = lambda line_index: sentences.loc[line_index, "french_corpus"] + " " + sentences.loc[line_index, "wolof_corpus"]  
    
    sentences["corpora"] = sentences.index.map(concat_sentences)
    
    sents = sentences["corpora"].to_list()
    
    for i in range(1, len(sents), BATCH_SIZE):
        
        yield sents[i:i+BATCH_SIZE]

#### 

#### Initialize the tokenizers

In [3]:
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

#### Specify normalizers (No normalizer will be required)

In [4]:
pass

#### Specify a bert pre-tokenizer

In [5]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

#### Specify a trainer

In [6]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=20000, special_tokens=special_tokens)

#### Train the tokenizer

In [7]:
tokenizer.train_from_iterator(generate_sentences(), trainer)

Let us print the vocabulary size.

In [8]:
tokenizer.get_vocab_size()

15115

#### Add post-processing (Not required)

In [9]:
# let us recuperate the special tokens ids
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

2 3


We will add the special tokens for differentiating the French sentences from the Wolof sentences.

In [10]:
tokenizer.post_process = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)]
)

#### Specify the decoder

In [11]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

#### Save the tokenizer

In [12]:
tokenizer.save("wolof-translate/wolof_translate/tokenizers/adverse_tokenizer.json")

#### Make a test with a sentence

Let us recuperate a random French sentence and her corresponding Wolof translation in order to verify if we obtain the expected tokenization result.

In [13]:
import random

random.seed(100)

line = random.randint(0, sentences.shape[0])

sentence = sentences.loc[line, :]

fr_sentence = sentence['french_corpus'].replace("’", "'")

wf_sentence = sentence['wolof_corpus'].replace("’", "'")

Let us encode the sentences.

In [14]:
encoding = tokenizer.encode(f"[CLS]{fr_sentence}[SEP]{wf_sentence}[SEP]")

Let us print the french encoding characteristics.

In [15]:
print("Tokens:")
print(encoding.tokens)

print("IDS:")
print(encoding.ids)

Tokens:
['[CLS]', 'Nous', 'frappions', 'à', 'nouveau', ',', 'jusqu', "'", 'à', 'en', 'avoir', 'mal', 'aux', 'mains', ',', 'comme', 'si', 'nous', 'combattions', 'un', 'ennemi', 'invisible', '.', '[SEP]', 'Loolu', 'gën', 'noo', 'ràkkaajuloo', ',', 'nuy', 'dóor', ',', 'di', 'dóor', ',', 'di', 'dóor', 'ba', 'sunuy', 'loxoy', 'metti', '.', '[SEP]']
IDS:
[2, 911, 4007, 88, 2374, 10, 828, 7, 88, 204, 1437, 727, 571, 1890, 10, 481, 321, 417, 14054, 207, 4079, 5070, 12, 3, 1201, 498, 1893, 14071, 10, 1062, 1150, 10, 222, 1150, 10, 222, 1150, 223, 1840, 4765, 1032, 12, 3]


Let us decode the tokens.

In [16]:
tokenizer.decode(encoding.ids)

"Nous frappions à nouveau, jusqu ' à en avoir mal aux mains, comme si nous combattions un ennemi invisible. Loolu gën noo ràkkaajuloo, nuy dóor, di dóor, di dóor ba sunuy loxoy metti."

We remark that some the marks are badly separated from their letters (example of the guillemet from the letter 'u' and 'à'). We must identify all of them to produce recombination rules for the generated sentences.