Discriminator tokenizer
----------------------------

We will create a WordPiece Tokenizer (for which the tutorial can be find at the following link [WordPieceTokenizer](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)) to compare a generated sequence from a true sequence in the GAN network. We will only test the capacity of the WordPiece Tokenizer to provide good results: In our case it is generating the original sequence. 

The WordPiece is chosen because the discriminator is considered to be the BERT Model. 

- Creating a sentences' generator 
- Instantiating the `WordPiece tokenizer` 
- ~~Instantiating the `normalizer`~~
- Instantiating the `BertPre-tokenizer`
- Instantiating the trainer with `20000` tokens and the following special tokens `"[UNK]", "[PAD]", "[CLS]", "[SEP]", and "[MASK]"`
- Training the tokenizer
- Instantiate the pos-processor by specifying the format of the two passed sequences (sequence from the French corpus and sequence from the Wolof corpus)
- Initialize the decoder
- Save the tokenizer
- Make a test


Let us import the necessary libraries.

In [1]:
# for creating the tokenizer
from tokenizers import (
    normalizers,
    decoders,
    models,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

# for importing and manipulating the sentences
import pandas as pd

#### Load dataset and create generator

We will create two tokenizers: one for each corpus since we different languages and so different vocabularies. 

In [2]:
# load sentences
sentences = pd.read_csv("data/extractions/new_data/sent_extraction.csv")

# initialize a batch size
BATCH_SIZE = 50

# create generators (for the corpora)
def generate_sentences():
    
    # stacking the sentences
    concat_sentences = lambda line_index: sentences.loc[line_index, "french_corpus"] + " " + sentences.loc[line_index, "wolof_corpus"]  
    
    sentences["corpora"] = sentences.index.map(concat_sentences)
    
    sents = sentences["corpora"].to_list()
    
    for i in range(1, len(sents), BATCH_SIZE):
        
        yield sents[i:i+BATCH_SIZE]

#### 

#### Initialize the tokenizers

In [3]:
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

#### Specify normalizers (No normalizer will be required)

In [4]:
pass

#### Specify a bert pre-tokenizer

In [5]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

#### Specify a trainer

In [6]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=20000, special_tokens=special_tokens)

#### Train the tokenizer

In [7]:
tokenizer.train_from_iterator(generate_sentences(), trainer)

Let us print the vocabulary size.

In [8]:
tokenizer.get_vocab_size()

15023

#### Add post-processing (Not required)

In [9]:
# let us recuperate the special tokens ids
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

2 3


We will add the special tokens for differentiating the French sentences from the Wolof sentences.

In [10]:
tokenizer.post_process = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)]
)

#### Specify the decoder

In [11]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

#### Save the tokenizer

In [12]:
tokenizer.save("wolof-translate/wolof_translate/tokenizers/adverse_tokenizer.json")

#### Make a test with a sentence

Let us recuperate a random French sentence and her corresponding Wolof translation in order to verify if we obtain the expected tokenization result.

In [13]:
import random

line = random.randint(0, sentences.shape[0])

sentence = sentences.loc[line, :]

fr_sentence = sentence['french_corpus'].replace("’", "'")

wf_sentence = sentence['wolof_corpus'].replace("’", "'")

Let us encode the sentences.

In [14]:
encoding = tokenizer.encode(f"[CLS]{fr_sentence}[SEP]{wf_sentence}[SEP]")

Let us print the french encoding characteristics.

In [15]:
print("Tokens:")
print(encoding.tokens)

print("IDS:")
print(encoding.ids)

Tokens:
['[CLS]', 'Ce', 'n', "'", 'est', 'que', 'longtemps', 'après', ',', 'quand', 'l', "'", 'égoïsme', 'naturel', 'aux', 'enfants', 's', "'", 'est', 'estompé', ',', 'que', 'j', "'", 'ai', 'compris', ':', 'ma', 'mère', ',', 'en', 'vivant', 'loin', 'de', 'mon', 'père', ',', 'avait', 'pratiqué', 'du', 'fait', 'de', 'la', 'guerre', 'un', 'héroïsme', 'sans', 'emphase', ',', 'non', 'par', 'inconscience', 'ni', 'par', 'résignation', '(', 'même', 'si', 'la', 'foi', 'religieuse', 'avait', 'pu', 'lui', 'être', 'd', "'", 'un', 'grand', 'secours', '),', 'mais', 'par', 'la', 'force', 'que', 'faisait', 'naître', 'en', 'elle', 'une', 'telle', 'inhumanité', '.', '[SEP]', 'Teg', 'nañ', 'ciy', 'ati', '-', 'at', 'ma', 'door', 'a', 'jëli', 'ni', 'jigéen', ',', 'ni', 'góor', 'di', 'wonee', 'njàmbaar', 'ci', 'toolu', '-', 'xare', ',', 'la', 'mën', 'a', 'toog', 'biir', 'këram', 'moom', 'tamit', ',', 'wone', 'fa', 'njàmbaar', 'gu', 'ni', 'tollu', '.', 'Ni', 'sama', 'yaay', 'daan', 'doxalee', 'ci', 'geer', '

Let us decode the tokens.

In [16]:
tokenizer.decode(encoding.ids)

"Ce n ' est que longtemps après, quand l ' égoïsme naturel aux enfants s ' est estompé, que j ' ai compris : ma mère, en vivant loin de mon père, avait pratiqué du fait de la guerre un héroïsme sans emphase, non par inconscience ni par résignation ( même si la foi religieuse avait pu lui être d ' un grand secours ), mais par la force que faisait naître en elle une telle inhumanité. Teg nañ ciy ati - at ma door a jëli ni jigéen, ni góor di wonee njàmbaar ci toolu - xare, la mën a toog biir këram moom tamit, wone fa njàmbaar gu ni tollu. Ni sama yaay daan doxalee ci geer bi, firnde la ci."