Text augmentation with `nlpaug`
-------------------

It is a library called `nlpaug` that we can use to transform the sentences in order to augment the vocabulary size. The tutorial is available at medium through the following link [nlpaug_examples](https://towardsdatascience.com/text-augmentation-in-few-lines-of-python-code-cdd10cf3cf84).

We will test the following methods:

- `KeyboarAug`: It change a random character with another one close with it on the keyboard.
- `RandomAug`: It modify a character with a random another one character.

Most of the another methods require a model to work correctly. We can still test the following methods:
- `TfidfAug`: Use TF-IDF to find out how words would be augmented. We will need to train a tf-idf on the tokens of each corpus before using it.

### Steps

We will follow the next steps:

- Create a custom dataset getting as argument a sentence transformer
- Training a tf-idf on each corpus using the `nlpaug` library
- Using the `KeyboardAug` method to augment the data (we will need to find the best parameters)
- Using the `RandomAug` method to augment the data (we will also need to find the best parameters)
- Using the `TfidfAug` method to augment the data (it can be the best approach)
- Choose the best method and try to find the ideal parameters.
- Create a new data frame containing augmented sentences

We must import the following libraries.

In [1]:
try:
    import nlpaug.augmenter.word as naw
    import nlpaug.augmenter.char as nac
    import nlpaug.model.word_stats as nmw
except ImportError:
    !pip install nlpaug
    import nlpaug.augmenter.word as naw
    import nlpaug.augmenter.char as nac
    import nlpaug.model.word_stats as nmw

import re
import torch
import random
import pandas as pd
from tokenizers import Tokenizer
from torch.utils.data import DataLoader

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# install the package when necessary
!pip install -e wolof-translate -qq

### Create a custom dataset to load sentences

We will use the BPE-Tokenizer that we created to fine-tune the GPT-2 on the sentences in our custom dataset. The text augmentation method will be provided through the transformer method and we will provide one for each corpus. We can make multiple transformations if necessary. In the latter case we will provide a list of transformers (in another words, we will provide the transformer as a list). 

Let us create a handy class to recuperate the transformers and apply them to the sentences more easily.

In [3]:
%%writefile wolof-translate/wolof_translate/utils/sent_transformers.py
from typing import *

class TransformerSequences:
    
    def __init__(self, *args, **kwargs):
        
        self.transformers = []
        
        self.transformers.extend(list(args))
        
        self.transformers.extend(list(kwargs.values()))
    
    def __call__(self, sentences: Union[List, str]):
        
        output = sentences
        
        for transformer in self.transformers:
            
            if hasattr(transformer, "augment"):
                
                output = transformer.augment(output)
            
            else:
                
                output = transformer(output)
            
        return output
        

Overwriting wolof-translate/wolof_translate/utils/sent_transformers.py


In [4]:
%run wolof-translate/wolof_translate/utils/sent_transformers.py

And bellow is the custom dataset. Notice that the max length that we identified earlier is `379`.

In [5]:
%%writefile wolof-translate/wolof_translate/data/dataset_v1.py
from wolof_translate.utils.sent_transformers import TransformerSequences
from transformers import PreTrainedTokenizerFast
from torch.utils.data import Dataset
from tokenizers import Tokenizer
from typing import *
import pandas as pd
import torch
import re

class SentenceDataset(Dataset):
 
    def __init__(self,
                 file_path: str, 
                 corpus_1: str = "french_corpus",
                 corpus_2: str = "wolof_corpus",
                 tokenizer_path: str = "wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                 max_len: int = 379,
                 truncation: bool = False,
                 file_sep: str = ",", 
                 cls_token: str = "<|endoftext|>",
                 sep_token: str = "<|translateto|>",
                 pad_token: str = "<|pad|>",
                 cp1_transformer: Union[TransformerSequences, None] = None,
                 cp2_transformer: Union[TransformerSequences, None] = None,
                 **kwargs):
        
        # let us recuperate the data frame
        self.__sentences = pd.read_csv(file_path, sep=file_sep, **kwargs)
        
        # let us recuperate the tokenizer
        self.tokenizer = PreTrainedTokenizerFast(
            tokenizer_file=tokenizer_path,
            bos_token=cls_token,
            eos_token=cls_token,
            pad_token=pad_token
            )
        
        # recuperate the first corpus' sentences
        self.__sentences_1 = self.__sentences[corpus_1].to_list()
        
        # recuperate the second corpus' sentences
        self.__sentences_2 = self.__sentences[corpus_2].to_list()
        
        # recuperate the special tokens
        self.cls_token = cls_token
        
        self.sep_token = sep_token
        
        self.pad_token = pad_token
        
        # recuperate the length
        self.__length = len(self.__sentences_1)
        
        # recuperate the max id
        self.max_id = len(self.tokenizer) - 1
        
        # let us recuperate the max len
        self.max_len = max_len
        
        # let us recuperate the truncate argument
        self.truncation = truncation
        
        # let us initialize the transformer
        self.cp1_transformer = cp1_transformer
        
        self.cp2_transformer = cp2_transformer
        
    def __getitem__(self, index):
        
        sentence_1 = self.__sentences_1[index]
        
        sentence_2 = self.__sentences_2[index]
        
        # apply transformers if necessary
        if not self.cp1_transformer is None:
            
            sentence_1 = self.cp1_transformer(sentence_1) 
        
        if not self.cp2_transformer is None:
            
            sentence_2 = self.cp2_transformer(sentence_2)
        
        # let us create the sentence with special tokens
        sentence = f"{self.cls_token}{sentence_1}{self.sep_token}{sentence_2}{self.cls_token}"
        
        # let us encode the sentence
        encoding = self.tokenizer(sentence, truncation=self.truncation, max_length=self.max_len, padding='max_length', return_tensors="pt")
        
        return encoding.input_ids.squeeze(0), encoding.attention_mask.squeeze(0)
        
    def __len__(self):
        
        return self.__length
    
    def decode(self, ids: torch.Tensor, for_prediction: bool = False):
        
        if ids.ndim < 2:
            
            ids = ids.unsqueeze(0)
        
        ids = ids.tolist()
        
        for id in ids:
            
            sentence = self.tokenizer.decode(id)

            if not for_prediction:
            
                sentence = sentence.split(f"{self.sep_token}")
            
            else:
                
                try:
                    
                    while self.sep_token in sentence:
                        
                        sentence = re.findall(f"{self.sep_token}(.*)", sentence)[-1]
                    
                except:
                    
                    sentence = "None"
            
            if for_prediction:
                
                yield sentence.replace(f'{self.cls_token}', '').replace(f'{self.pad_token}', '')
            
            else:
                
                sents = []
                
                for sent in sentence:
                    
                    sents.append(sent.replace(f'{self.cls_token}', '').replace(f'{self.pad_token}', ''))
                    
                yield sents

Overwriting wolof-translate/wolof_translate/data/dataset_v1.py


In [6]:
%run wolof-translate/wolof_translate/data/dataset_v1.py

### Train a `tf-idf` model

In [7]:
# load the corpora
corpora = pd.read_csv("data/extractions/new_data/sent_extraction.csv")

Let us train a `tf-idf` model on the French corpus and save it:

In [8]:
# let us take the corpus
french_corpus = corpora['french_corpus'].tolist()

# let us create a new tokenizer to train the tf-idf model. The tokenizer is took from https://github.com/makcedward/nlpaug/blob/master/example/tfidf-train_model.ipynb
def _tokenizer(text, token_pattern=r"(?u)\b\w\w+\b"):
    
    token_pattern = re.compile(token_pattern)
    
    return token_pattern.findall(text)

french_tokens = [_tokenizer(sent) for sent in french_corpus]

# let us load the model, train it on the corpus and save it
tfidf_model = nmw.TfIdf()

tfidf_model.train(french_tokens)

tfidf_model.save("wolof-translate/wolof_translate/models/french")

Let us make the same things on the wolof corpus:

In [9]:
# let us take the corpus
wolof_corpus = corpora['wolof_corpus'].tolist()

wolof_tokens = [_tokenizer(sent) for sent in wolof_corpus]

# let us load the model, train it on the corpus and save it
tfidf_model = nmw.TfIdf()

tfidf_model.train(wolof_tokens)

tfidf_model.save("wolof-translate/wolof_translate/models/wolof")

Before testing with tf-idf augmenter let us test the two other methods.

### Keyboard augmenter

The only parameter that we want to change is the language that we set to French (it concerns only the keyboard language).

In [10]:
# let us load two augmenters. One for each corpus.
cp1_aug = nac.KeyboardAug(name='Keyboard_Aug', lang='fr')
cp2_aug = nac.KeyboardAug(name='Keyboard_Aug', lang='fr')

# let us provide the augmenters to the class that we created earlier 
cp1_transformer = TransformerSequences(cp1_aug)
cp2_transformer = TransformerSequences(cp2_aug)




To test the augmenters let us initialize the dataset with transformers and apply them to the same sentences directly chosen from the corpora.

In [39]:
sent_dataset = SentenceDataset("data/extractions/new_data/sent_extraction.csv",
                               cp1_transformer=cp1_transformer,
                               cp2_transformer=cp2_transformer)

# randomly choose two french sentences and two wolof sentences
random.seed(50)

french_sents = [random.choice(french_corpus) for i in range(2)]

wolof_sents = [random.choice(wolof_corpus) for i in range(2)]

# let us print the true sentences and their transformed versions
print("On french corpus:")

for sent in french_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")

print("---------------------------------")

print("On wolof corpus:")

for sent in wolof_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")


On french corpus:
True sentence -> Ils vont de campement en campement, dans des villages dont mon père note les noms sur sa carte : Nikom, Babungo, Nji Nikom, Luakom Ndye, Ngi, Obukun.
Augmented sentence -> Ils vont de d1mpemfnt en campement, XaBs des villages dog4 mon père bore les boma sur sa Varye: Nikom, BahuGbo, Nji N9Jom, Luakom jdÈe, Ngi, Obuohn.
-----------------
True sentence -> Il prend le train, débarque à Southampton, s'installe dans une pension. Son service ne débutant que trois jours plus tard, il flâne en ville, va voir les navires en partance.
Augmented sentence -> Il Çrejd le t3aiG, dAb&'que à Southampton, s ' iJstamlz dsnD une pension. Son service ne EéFutaBt que Yroks jours plus tatF, il rlâHe en ville, va Co!r les navires en partance.
-----------------
---------------------------------
On wolof corpus:
True sentence -> Garab yu néew lañuy yóbbale, diy puudar ak i toccami yuy nirook saafaray jibar yi.
Augmented sentence -> Ta'ab yu n3"w lqñuu yóbbale, diy puudar ak i

We remark that it change letters of a large amount of words with their counterpart letters. And the guillemet and the hyphens are separated with space from their letters. We must diminish the probability of modifying a word or the maximum number of modifications.

### Random augmenter

For that augmenter we will keep the default parameters for the moment.

In [40]:
# let us load two augmenters. One for each corpus.
cp1_aug = aug = nac.RandomCharAug()
cp2_aug = nac.RandomCharAug()

# let us provide the augmenters to the class that we created earlier 
cp1_transformer = TransformerSequences(cp1_aug)
cp2_transformer = TransformerSequences(cp2_aug)




To test the augmenters let us initialize the dataset with transformers and apply them to the same sentences directly chosen from the corpora.

In [41]:
sent_dataset = SentenceDataset("data/extractions/new_data/sent_extraction.csv",
                               cp1_transformer=cp1_transformer,
                               cp2_transformer=cp2_transformer)

# randomly choose two french sentences and two wolof sentences
random.seed(50)

french_sents = [random.choice(french_corpus) for i in range(2)]

wolof_sents = [random.choice(wolof_corpus) for i in range(2)]

# let us print the true sentences and their transformed versions
print("On french corpus:")

for sent in french_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")

print("---------------------------------")

print("On wolof corpus:")

for sent in wolof_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")


On french corpus:
True sentence -> Ils vont de campement en campement, dans des villages dont mon père note les noms sur sa carte : Nikom, Babungo, Nji Nikom, Luakom Ndye, Ngi, Obukun.
Augmented sentence -> Ils vont de csmpemeMs en campement, YaIs des villages do2L mon père qo3e les romi sur sa Zar4e: Nikom, Ba6uQio, Nji N&koP, Luakom NdP*, Ngi, Obu%u&.
-----------------
True sentence -> Il prend le train, débarque à Southampton, s'installe dans une pension. Son service ne débutant que trois jours plus tard, il flâne en ville, va voir les navires en partance.
Augmented sentence -> Il 8re&d le t+aiM, défrrquE à Southampton, s ' enstalh^ d+n1 une pension. Son service ne zébPtaut que Iroi4 jours plus osrd, il flâLm en ville, va vZiO les navires en partance.
-----------------
---------------------------------
On wolof corpus:
True sentence -> Garab yu néew lañuy yóbbale, diy puudar ak i toccami yuy nirook saafaray jibar yi.
Augmented sentence -> taCab yu nGew lañuy yóbbale, diy puudar ak i

It seems also that, like with the previous augmenter, that the guillemet and hyphens are separated with their letters with a space. We remark also that some ending marks different from the points are stacked with the letters just behing them. We must also diminish the max number of modified words or the probability of modifying a word because some sentences are very shorts.

### TF-IDF augmenter

In [42]:
# let us load two augmenters. One for each corpus.
cp1_aug = aug = naw.TfIdfAug("wolof-translate/wolof_translate/models/french/", tokenizer=_tokenizer)
cp2_aug = naw.TfIdfAug("wolof-translate/wolof_translate/models/wolof/", tokenizer=_tokenizer)

# let us provide the augmenters to the class that we created earlier 
cp1_transformer = TransformerSequences(cp1_aug)
cp2_transformer = TransformerSequences(cp2_aug)




To test the augmenters let us initialize the dataset with transformers and apply them to the same sentences directly chosen from the corpora.

In [43]:
sent_dataset = SentenceDataset("data/extractions/new_data/sent_extraction.csv",
                               cp1_transformer=cp1_transformer,
                               cp2_transformer=cp2_transformer)

# randomly choose two french sentences and two wolof sentences
random.seed(50)

french_sents = [random.choice(french_corpus) for i in range(2)]

wolof_sents = [random.choice(wolof_corpus) for i in range(2)]

# let us print the true sentences and their transformed versions
print("On french corpus:")

for sent in french_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")

print("---------------------------------")

print("On wolof corpus:")

for sent in wolof_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")


On french corpus:
True sentence -> Ils vont de campement en campement, dans des villages dont mon père note les noms sur sa carte : Nikom, Babungo, Nji Nikom, Luakom Ndye, Ngi, Obukun.
Augmented sentence -> Prestige vont de campement en campement dans des long dont mon bêtes exécutions vaguement noms sur sa carte cuire douloureux leur Nikom Luakom Ndye Ngi Obukun
-----------------
True sentence -> Il prend le train, débarque à Southampton, s'installe dans une pension. Son service ne débutant que trois jours plus tard, il flâne en ville, va voir les navires en partance.
Augmented sentence -> Il prend Guinée fusils débarque Southampton installe dans haie pension Ma service ne geste que trois jours policiers tard il flâne en ville brisant échapper les sculptés en partance
-----------------
---------------------------------
On wolof corpus:
True sentence -> Garab yu néew lañuy yóbbale, diy puudar ak i toccami yuy nirook saafaray jibar yi.
Augmented sentence -> Garab yu néew lañuy yóbbale, 

The semantics of the sentences given by the tf-idf augmenter are not good since it replaces but another words with high tf-idf but changing the sense and the context of the sentences. In the wolof corpus it seems that we don't have any change excepted for the punctuations which are deleted for some of them.

### The best method

The best method to augment the data can be either the `random augmenter` or the `keyboard augmenter.` They provide almost the same type of modification, but the second method is more accurate for our task. Let us compare it to another character augmenter to see if we obtain better results.

**Note**: When we say the better result, we mean that the augmenter doesn't change a letter from a token to create a pre-existing token, so changing the sense of the sentence or making too few or too many changes.

The other augmenter that we want to test is the `OCR augmenter.` That augmenter tries to reproduce the sentences but with errors. Let us test it as we did with the previous augmenters.

In [44]:
# let us load two augmenters. One for each corpus.
cp1_aug = aug = nac.OcrAug()
cp2_aug = nac.OcrAug()

# let us provide the augmenters to the class that we created earlier 
cp1_transformer = TransformerSequences(cp1_aug)
cp2_transformer = TransformerSequences(cp2_aug)




To test the augmenters let us initialize the dataset with transformers and apply them to the same sentences directly chosen from the corpora.

In [45]:
sent_dataset = SentenceDataset("data/extractions/new_data/sent_extraction.csv",
                               cp1_transformer=cp1_transformer,
                               cp2_transformer=cp2_transformer)

# randomly choose two french sentences and two wolof sentences
random.seed(50)

french_sents = [random.choice(french_corpus) for i in range(2)]

wolof_sents = [random.choice(wolof_corpus) for i in range(2)]

# let us print the true sentences and their transformed versions
print("On french corpus:")

for sent in french_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")

print("---------------------------------")

print("On wolof corpus:")

for sent in wolof_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")


On french corpus:
True sentence -> Ils vont de campement en campement, dans des villages dont mon père note les noms sur sa carte : Nikom, Babungo, Nji Nikom, Luakom Ndye, Ngi, Obukun.
Augmented sentence -> Ils vont de campement en campement, dan8 des villages d0nt mon père note les num8 sur 8a carte: Nikom, Babungo, Nji Nikom, Luakom Ndye, N9i, Obokon.
-----------------
True sentence -> Il prend le train, débarque à Southampton, s'installe dans une pension. Son service ne débutant que trois jours plus tard, il flâne en ville, va voir les navires en partance.
Augmented sentence -> Il prend le train, débarque à Southampton, s ' installe dan8 une pen8i0n. Son 8ekvice ne débutant que trui8 j0ors plus tard, il f1âne en ville, va voir les navires en partance.
-----------------
---------------------------------
On wolof corpus:
True sentence -> Garab yu néew lañuy yóbbale, diy puudar ak i toccami yuy nirook saafaray jibar yi.
Augmented sentence -> Garab yo néew lañuy yóbbale, diy poodar ak i

It seems that it provide good result on the French corpus but it don't modify the wolof sentences. 

**The best choice remains the `keyboard augmenter`. We can combine it with the `random augmenter` to obtain more variability.**

Let us diminish the probability of modifying a word to **0.2** in the `keyboard augmenter` and retrying the test. We will also create a new function to recombine some marks with their letters.

**Note**: We will use the handy functions we created to add corrections after extracting the sentences. Those functions can also be used on the sentences generated with the GAN model (in preparation).

In [15]:
# import some functions from the utils
try:
    from wolof_translate.utils.sent_corrections import *
except ImportError:
    !pip install wolof-translate
    from wolof_translate.utils.sent_corrections import *

# let us load two augmenters. One for each corpus.
cp1_aug = nac.KeyboardAug(aug_word_p=0.2, aug_char_p=0.2)
cp2_aug = nac.KeyboardAug(aug_word_p=0.2, aug_char_p=0.2)

# let us provide the augmenters to the class that we created earlier 
cp1_transformer = TransformerSequences(cp1_aug, remove_mark_space, delete_guillemet_space)
cp2_transformer = TransformerSequences(cp2_aug, remove_mark_space, delete_guillemet_space)




To test the augmenters let us initialize the dataset with transformers and apply them to the same sentences directly chosen from the corpora.

In [16]:
sent_dataset = SentenceDataset("data/extractions/new_data/sent_extraction.csv",
                               cp1_transformer=cp1_transformer,
                               cp2_transformer=cp2_transformer)

# randomly choose two french sentences and two wolof sentences
random.seed(50)

french_sents = [random.choice(french_corpus) for i in range(2)]

wolof_sents = [random.choice(wolof_corpus) for i in range(2)]

# let us print the true sentences and their transformed versions
print("On french corpus:")

for sent in french_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")

print("---------------------------------")

print("On wolof corpus:")

for sent in wolof_sents:
    
    print(f"True sentence -> {sent}")
    
    print(f"Augmented sentence -> {sent_dataset.cp1_transformer(sent)[0]}")
    
    print("-----------------")


On french corpus:
True sentence -> Ils vont de campement en campement, dans des villages dont mon père note les noms sur sa carte : Nikom, Babungo, Nji Nikom, Luakom Ndye, Ngi, Obukun.
Augmented sentence -> Ils vont de cam)eHent en campement, dans des villages doGt mon père no^e les n0ms sur sa carte: Nikom, vavungo, Nji Nikim, Luakom Ndye, Ngi, kbukum.
-----------------
True sentence -> Il prend le train, débarque à Southampton, s'installe dans une pension. Son service ne débutant que trois jours plus tard, il flâne en ville, va voir les navires en partance.
Augmented sentence -> Il pgend le trxin, débarque à So8tMamlton, s'ijstalpe danQ une penc*on. Son service ne débutant que trois jo6rs plus tard, il elâne en ville, va voir les navires en partance.
-----------------
---------------------------------
On wolof corpus:
True sentence -> Garab yu néew lañuy yóbbale, diy puudar ak i toccami yuy nirook saafaray jibar yi.
Augmented sentence -> GaraN yu néew lañuy yóGba;e, diy puudar ak i t

Let us finally load 10 sentences with a `pytorch DataLoader` and decode them.

In [17]:
sentences, _ = next(iter(DataLoader(sent_dataset, 10, shuffle = True))) # the second recuperated element is the attention mask

for sent in sent_dataset.decode(sentences):
    
    print(sent)

["w tjourd'hui, j'existe, je vkyafe, j'ai à mon tour Gondé une famllld, je me sHis enraciné dans d'xutrSs lifux.", "'Tey, Nënal naa samX Gopp, am soxna ak i doom, Cëkke yéeg riopëlaan ak a wàXc, am na feJeRn fu ma sañ ni fa la fekk baax.'"]
["Il s'agit v3XksemblablemeBt d'obIetW laiZséz là par un précédent occupant, car cela ne ressemble pas à ce que mon pèe pokvQit redhercheD.", "'Xam naa ni fa la ko Bazy fekk ndax ni ma ko xam3, nataal boobu w(roo na lo(l ak gis-gsu Afrigam.'"]
["'À kgoja, les Lnsectez étaient partout.'", "'Xeetu Nunó)r wu waqy mën a xalaat a nga woon (gosqa.'"]
["'Quelque chose de nonchalant et de graci3uC, en même temps de très ancien, qui évoque les temps bin,iques, ou bien les caravanes des ToJwreg, où les femmfq vo7aFent à teavfrs le désert accridhées daMs des nacelles aux flancs des dromSRares.'", "'Danga naan yaram wépp a ngi noyyi, mu féex ba bëgg a dee. Soo moytuwul, jaawale kook natwa, yooyu bawoo ca jSminoy maamaati-maam ya. Walla mu fàtraOi la jeegi Tuwaa

### Create a dataset with augmented sentences

If adding the augmentation when loading the sentences is insufficient to produce accurate results, we will use a dataset with augmented sentences. 