Bart experimenting with Hugging Face
---------------------

Let us import the Bart tokenizer and pre-trained model.

In [1]:
from transformers import BartForConditionalGeneration, BartTokenizerFast
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


### Tokenization

Let us see how bart tokenize (pre-process) the text.

In [3]:
# let us use the following text from the tutorial at https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartForConditionalGeneration
text = "I like eating cupcakes."

# let us load the tokenizer
tokenizer = BartTokenizerFast.from_pretrained("facebook/bart-base")

# let us tokenize the text
encoded_input = tokenizer(text, return_tensors = 'pt') # return torch tensor

# print the result
print("Here the decoded tokens")
print([tokenizer.decode(token) for token in encoded_input['input_ids']])

print("\nHere the ids and padding mask")
print(encoded_input)

Here the decoded tokens
['<s>I like eating cupcakes.</s>']

Here the ids and padding mask
{'input_ids': tensor([[    0,   100,   101,  4441,  4946, 33579,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}


Let us verify the special ids.

In [4]:
tokenizer.all_special_ids 

[0, 2, 3, 1, 50264]

Let us verify the tokens of the special ids.

In [5]:
tokenizer.convert_ids_to_tokens(tokenizer.all_special_ids)

['<s>', '</s>', '<unk>', '<pad>', '<mask>']

It tokenize the text with a tokenizer trained on a english vocabulary and give us the corresponding attention mask.

Let us import the extracted sentences and try to tokenize some of them and print the results.

In [3]:
sentences = pd.read_csv("data/extractions/new_data/sent_extraction.csv")

In [4]:
# print a sample 100 sentences
samples = sentences.sample(100, random_state=100).to_dict(orient='index')
samples

{311: {'french_corpus': "À l'âge de trente ans, mon père quitte Southampton à bord d'un cargo mixte à destination de Georgetown, en Guyane britannique.",
  'wolof_corpus': 'Ci fànweeri atam la Baay jël gaal, jóge Saawusamton jëm Sorstaawun, ca Guwiyaan ba newoon ci loxoy Àngale yi.'},
 143: {'french_corpus': "Comment l'avons-nous su ? Peut-être par mon père, ou bien par un des garçons du village.",
  'wolof_corpus': 'Naka lanu def ba xam ko ci saa si ci lu amul benn laam-laamee ? Ndax Baay a nu ko waxoon mbaa kenn ci goney Ogosaa yi ?'},
 320: {'french_corpus': "Pendant sept ans il étudie à Londres, d'abord dans une école d'ingénieur, puis à la faculté de médecine.",
  'wolof_corpus': 'Lu tollook juróom-ňaari at, mu nekk Londar di jàng. Eñseñeer la jëkkon a bëgg nekk waaye dafa ca mujjee génn, daldi taamu njàngum faj.'},
 19: {'french_corpus': 'Le sexe des garçons, leur gland rose circoncis.',
  'wolof_corpus': 'Cucu gone yi ñu xarafal ak seen ndéeň lu xonq.'},
 73: {'french_corpus': "

Let us try to tokenize the sentences 101.

In [5]:
# first sentences
example1 = samples[101]

In [6]:
example1

{'french_corpus': "La mémoire d'un enfant exagère les distances et les hauteurs.",
 'wolof_corpus': 'Gone, su demee bay natt dayo mbaa guddaay, du xam bu ca yem.'}

- On the French corpus

In [7]:
# tokenize the first example
encoded_input = tokenizer(example1['french_corpus'], return_tensors='pt')
print("The encoded tokens with their mask")
print(encoded_input)
print("\nThe decoded tokens")
[tokenizer.decode(token) for token in encoded_input['input_ids'][0]]

The encoded tokens with their mask
{'input_ids': tensor([[    0, 10766,   475,  1140,  4992,  1885,   385,   108,   879,  1177,
           506,   927,  1931,  1073, 18655,  7427, 21459,  4400,  7427,  2489,
          4467,  4668,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

The decoded tokens


['<s>',
 'La',
 ' m',
 'é',
 'mo',
 'ire',
 ' d',
 "'",
 'un',
 ' en',
 'f',
 'ant',
 ' ex',
 'ag',
 'ère',
 ' les',
 ' distances',
 ' et',
 ' les',
 ' ha',
 'ute',
 'urs',
 '.',
 '</s>']

- On the Wolof corpus

In [8]:
# tokenize the first example
encoded_input = tokenizer(example1['wolof_corpus'], return_tensors='pt')
print("The encoded tokens with their mask")
print(encoded_input)
print("\nThe decoded tokens")
[tokenizer.decode(token) for token in encoded_input['input_ids'][0]]

The encoded tokens with their mask
{'input_ids': tensor([[    0,   534,  1264,     6,  2628,  4410,  1942, 11751,   295,  2611,
           183,   139,   475,  3178,   102,   821,  7027,   102,   857,     6,
          4279,  3023,   424, 10306,  6056,  1423,   991,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1]])}

The decoded tokens


['<s>',
 'G',
 'one',
 ',',
 ' su',
 ' dem',
 'ee',
 ' bay',
 ' n',
 'att',
 ' day',
 'o',
 ' m',
 'ba',
 'a',
 ' g',
 'udd',
 'a',
 'ay',
 ',',
 ' du',
 ' x',
 'am',
 ' bu',
 ' ca',
 ' y',
 'em',
 '.',
 '</s>']

We don't obtain what we expected because the tokenizer are not trained on a french corpus or a wolof corpus.

#### Improve the tokenization with a model

In [12]:
# import necessary libraries
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

import random

random.seed(100)

----------------

Let us improve the tokenization of the texts by training a tokenizer from scratch. The tutorial is available at the following link [Training_Tokenizer_from_scratch](https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt).

- First, we must load batch of sentences. Let us say that we will a batch of 15 sentences at a time.

In [11]:
# recuperate the sentences as list of sentences (simple method)
french_sentences = sentences['french_corpus'].to_list()

wolof_sentences = sentences['wolof_corpus'].to_list()

# Define Batch sizes
BATCH_SIZE = 5

# let us use generators to load the french and wolof sentences
def load_french_sentences():
    
    for i in range(0, len(french_sentences), BATCH_SIZE):
        
        yield french_sentences[i : i + BATCH_SIZE]

def load_wolof_sentences():
    
    for j in range(0, len(wolof_sentences), BATCH_SIZE):
        
        yield wolof_sentences[j : j + BATCH_SIZE]


##### BPE (Byte-Pair Encoding tokenization) tokenizer

- Let us load the BPE tokenizer

In [12]:
bpe_tokenizer = Tokenizer(models.BPE())

- Configure a pre-tokenizer

In [119]:
# add a byte level pre-tokenizer and specify that we will not a space to the first word
bpe_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

- Test the pre-tokenizer

In [120]:
# let us test the pre-tokenizer with a random sentence
bpe_tokenizer.pre_tokenizer.pre_tokenize_str(french_sentences[0])

[('Tout', (0, 4)),
 ('ĠÃªtre', (4, 9)),
 ('Ġhumain', (9, 16)),
 ('Ġest', (16, 20)),
 ('Ġle', (20, 23)),
 ('ĠrÃ©sultat', (23, 32)),
 ('Ġd', (32, 34)),
 ('âĢĻ', (34, 35)),
 ('un', (35, 37)),
 ('ĠpÃ¨re', (37, 42)),
 ('Ġet', (42, 45)),
 ('Ġune', (45, 49)),
 ('ĠmÃ¨re', (49, 54)),
 ('.', (54, 55))]

- If using the decoder part of the transformer is the GPT-2 model then we need a end-to-text token. Let us initialize the bpe trainer with a vocab size of `10000`.

In [16]:
bpe_trainer = trainers.BpeTrainer(vocab_size = 10000, special_tokens = ["<|endoftext|>"])

- Let us add the batch iterator to the trainer and test on a sample

Let us test for the french corpus and after with the wolof corpus.

1. On the french corpus

In [123]:
bpe_tokenizer.train_from_iterator(load_french_sentences(), trainer=bpe_trainer)

Let us use a decoder to decode the ids.

In [124]:
bpe_tokenizer.decoder = decoders.ByteLevel()

In [125]:
# tokenize a sample (let us take 10 sentences) and print the results
for i in range(10):
    
    sentence = random.choice(french_sentences) 
    
    print(f"For the following sentence:\n{sentence}")
    
    print("We obtain the following:")
    
    french_encoding = bpe_tokenizer.encode(sentence)
    
    print(f"- Tokens: {french_encoding.tokens}")
    
    print(f"- Ids: {french_encoding.ids}")
    
    print(f"- Attention masks: {french_encoding.attention_mask}")
    
    print(f"- Decoded tokens: {bpe_tokenizer.decode(french_encoding.ids)}")

    print("-------------")
    

For the following sentence:
Nous frappions à nouveau, jusqu’à en avoir mal aux mains, comme si nous combattions un ennemi invisible.
We obtain the following:
- Tokens: ['Nous', 'Ġfrappions', 'ĠÃł', 'Ġnouveau', ',', 'Ġjusqu', 'âĢĻ', 'Ãł', 'Ġen', 'Ġavoir', 'Ġmal', 'Ġaux', 'Ġmains', ',', 'Ġcomme', 'Ġsi', 'Ġnous', 'Ġcombattions', 'Ġun', 'Ġennemi', 'Ġinvisible', '.']
- Ids: [569, 2286, 137, 1134, 5, 486, 103, 129, 147, 1596, 450, 377, 1087, 5, 303, 403, 262, 8790, 185, 4249, 3049, 7]
- Attention masks: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
- Decoded tokens: Nous frappions à nouveau, jusqu’à en avoir mal aux mains, comme si nous combattions un ennemi invisible.
-------------
For the following sentence:
Cela va de la frontière avec le Cameroun sous mandat français, au sud-est, jusqu’aux confins de l’Adamawa au nord, et comprend la plus grande partie des chefferies et des petits royaumes qui ont échappé à l’autorité directe de l’Angleterre après le départ des Allem

We can test it on a pair of sentences as follows (for 2 sentences):

In [126]:
random.seed(100)

sentences = [random.choice(french_sentences) for i in range(2)]

encoding = bpe_tokenizer.encode(*sentences)

print(sentences)
print(encoding.tokens)
print(encoding.ids)

['Nous frappions à nouveau, jusqu’à en avoir mal aux mains, comme si nous combattions un ennemi invisible.', 'Cela va de la frontière avec le Cameroun sous mandat français, au sud-est, jusqu’aux confins de l’Adamawa au nord, et comprend la plus grande partie des chefferies et des petits royaumes qui ont échappé à l’autorité directe de l’Angleterre après le départ des Allemands : Kantu, Abong, Nkom, Bum, Foumban, Bali.']
['Nous', 'Ġfrappions', 'ĠÃł', 'Ġnouveau', ',', 'Ġjusqu', 'âĢĻ', 'Ãł', 'Ġen', 'Ġavoir', 'Ġmal', 'Ġaux', 'Ġmains', ',', 'Ġcomme', 'Ġsi', 'Ġnous', 'Ġcombattions', 'Ġun', 'Ġennemi', 'Ġinvisible', '.', 'Cela', 'Ġva', 'Ġde', 'Ġla', 'ĠfrontiÃ¨re', 'Ġavec', 'Ġle', 'ĠCameroun', 'Ġsous', 'Ġmandat', 'ĠfranÃ§ais', ',', 'Ġau', 'Ġsud', '-', 'est', ',', 'Ġjusqu', 'âĢĻ', 'aux', 'Ġconfins', 'Ġde', 'Ġl', 'âĢĻ', 'Adamawa', 'Ġau', 'Ġnord', ',', 'Ġet', 'Ġcomprend', 'Ġla', 'Ġplus', 'Ġgrande', 'Ġpartie', 'Ġdes', 'Ġchefferies', 'Ġet', 'Ġdes', 'Ġpetits', 'Ġroyaumes', 'Ġqui', 'Ġont', 'ĠÃ©chappÃ©

In [127]:
bpe_tokenizer.get_vocab_size()

9550

We have 9550 tokens in the French corpus.

- On the Wolof corpus (For the next we will do this separately for each corpus)

In [128]:
bpe_tokenizer.train_from_iterator(load_wolof_sentences(), trainer=bpe_trainer)


In [129]:
# tokenize a sample (let us take 10 sentences) and print the results
for i in range(10):
    
    sentence = random.choice(wolof_sentences) 
    
    print(f"For the following sentence:\n{sentence}")
    
    print("We obtain the following:")
    
    wolof_encoding = bpe_tokenizer.encode(sentence)
    
    print(f"- Tokens: {wolof_encoding.tokens}")
    
    print(f"- Ids: {wolof_encoding.ids}")
    
    print(f"- Attention masks: {wolof_encoding.attention_mask}")
    
    print(f"- Decoded tokens: {bpe_tokenizer.decode(wolof_encoding.ids)}")

    print("-------------")
    

For the following sentence:
Ci weeru mars 1932 la samay way-jur jóge Forestry House ca Bamendaa, dem dëkki Bansoo, fa nguur gi waroon a tabax ab fajukaay.
We obtain the following:


PanicException: no entry found for key

We see an error indicating that some wolof tokens are not identified since we already had trained the tokenizer on the french corpus and obtained 10000 when adding some wolof tokens. Let us train another tokenizer for the wolof corpus.

In [17]:
bpe_tokenizer2 = Tokenizer(models.BPE())

# add a byte level pre-tokenizer and specify that we will not a space to the first word
bpe_tokenizer2.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

bpe_tokenizer2.train_from_iterator(load_wolof_sentences(), trainer=bpe_trainer)

bpe_tokenizer2.decoder = decoders.ByteLevel()

In [18]:
# tokenize a sample (let us take 10 sentences) and print the results
for i in range(10):
    
    sentence = random.choice(wolof_sentences) 
    
    print(f"For the following sentence:\n{sentence}")
    
    print("We obtain the following:")
    
    wolof_encoding = bpe_tokenizer2.encode(sentence)
    
    print(f"- Tokens: {wolof_encoding.tokens}")
    
    print(f"- Ids: {wolof_encoding.ids}")
    
    print(f"- Attention masks: {wolof_encoding.attention_mask}")
    
    print(f"- Decoded tokens: {bpe_tokenizer2.decode(wolof_encoding.ids)}")

    print("-------------")
    

For the following sentence:
Bu ñu agsee ci dëkk, ndawal buur bee leen di teerusi, dalal leen ca pénc ma, ñu portalewu ak buur beek i dagam.
We obtain the following:
- Tokens: ['Bu', 'ĠÃ±u', 'Ġagsee', 'Ġci', 'ĠdÃ«kk', ',', 'Ġndawal', 'Ġbuur', 'Ġbee', 'Ġleen', 'Ġdi', 'Ġteerusi', ',', 'Ġdalal', 'Ġleen', 'Ġca', 'ĠpÃ©nc', 'Ġma', ',', 'ĠÃ±u', 'Ġportalewu', 'Ġak', 'Ġbuur', 'Ġbeek', 'Ġi', 'Ġdagam', '.']
- Ids: [1269, 154, 3381, 116, 239, 4, 3241, 1330, 1100, 205, 133, 3501, 4, 1353, 205, 189, 3634, 126, 4, 154, 6139, 144, 1330, 958, 175, 4568, 6]
- Attention masks: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
- Decoded tokens: Bu ñu agsee ci dëkk, ndawal buur bee leen di teerusi, dalal leen ca pénc ma, ñu portalewu ak buur beek i dagam.
-------------
For the following sentence:
Ñax mi muur joor gi dafa bare ba may fàttali géej, xóot te raglu ni moom.
We obtain the following:
- Tokens: ['Ãĳax', 'Ġmi', 'Ġmuur', 'Ġjoor', 'Ġgi', 'Ġdafa', 'Ġbare', 'Ġba', 'Ġmay',

In [20]:
bpe_tokenizer2.get_vocab_size()

7034

We have 7034 tokens into the wolof vocabulary.