First Training with the GPT-2 decoder 🤖
-----------------------------------

In this notebook, we will train the pre-trained GPT-2 model provided by OPEN-AI. It only tests how the model can be accurate on the corpora we extracted. It is undoubtedly a partial model. For the prediction, we will follow the fine-tuning tutorial available at the following medium link [fine-tuning-transformers](https://medium.com/towards-data-science/guide-to-fine-tuning-text-generation-models-gpt-2-gpt-neo-and-t5-dc5de6b3bc5e) and add hyperparameter search with the `wandb` library.


We will make this with and without augmentation and see where we obtain better results. 

In [55]:
# define environment
%env WANDB_PROJECT=gpt2_french_wolof_sweeps
%env WANDB_LOG_MODEL=true
%env WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
%env WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee

env: WANDB_PROJECT=gpt2_french_wolof_sweeps
env: WANDB_LOG_MODEL=true
env: WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
env: WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee


In [42]:
!pip install -qq wandb --upgrade

In [43]:
# !pip install evaluate -qq
# !pip install sacrebleu -qq
# !pip install optuna -qq
# !pip install transformers -qq 
# !pip install tokenizers -qq
# !pip install nlpaug -qq
# !pip install ray[tune] -qq

In [56]:
# let us import all necessary libraries
from wolof_translate.utils.sent_transformers import TransformerSequences
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer
from wolof_translate.data.dataset_v1 import SentenceDataset
from wolof_translate.utils.sent_corrections import *
from sklearn.model_selection import train_test_split
from nlpaug.augmenter import char as nac
from torch.utils.data import DataLoader
# from datasets  import load_metric # make pip install evaluate instead
# and pip install sacrebleu for instance
from functools import partial
from tqdm import tqdm
import pandas as pd
import numpy as np
import evaluate
import torch


We will create two models: 

- One translating the french corpus to a wolof corpus [french_to_wolof](#french-to-wolof)
- One translating the wolof corpus to a french corpus [wolof_to_french](#wolof-to-french)

--------------

## French to wolof

### Configure dataset 🔠

We can use the same custom dataset that we created in [text_augmentation](text_augmentation.ipynb). But we need to split the data between train and test sets and save them.

In [57]:
# # load the corpora and split into train and test sets
# corpora = pd.read_csv("sent_extraction.csv")

# train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=50)

# # let us save the sets
# train_set.to_csv("train_set.csv", index=False)

# test_set.to_csv("data/extractions/new_data/test_set.csv", index=False)

Let us recuperate the datasets with and without augmentation.

In [58]:
# without augmentation
train_dataset = SentenceDataset("data/extractions/new_data/train_set.csv")

test_dataset = SentenceDataset("data/extractions/new_data/test_set.csv")

# with augmentation
fr_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=0.2, aug_word_p=0.2),
                                       remove_mark_space, delete_guillemet_space)

wf_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=0.2, aug_word_p=0.2),
                                       remove_mark_space, delete_guillemet_space)

train_dataset_aug = SentenceDataset("data/extractions/new_data/train_set.csv", 
                                cp1_transformer=fr_augmentation, truncation=True,
                                cp2_transformer=wf_augmentation, max_len=579)

test_dataset_aug = SentenceDataset("data/extractions/new_data/test_set.csv",
                               cp1_transformer=fr_augmentation, truncation=True,
                               cp2_transformer=wf_augmentation, max_len=579)

### Configure hyperparameter search ⚙️

We have to configure the search space and the search method ("random" in our case). .

In [59]:
import wandb
wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")

# hyperparameters
sweep_config = {
    'method': 'random',
    'metric':
    {
      'goal': 'minimize',
      'name': 'eval_loss'
    },
    'parameters':
    {
      'epochs': {
          'value': 1
      },
      'batch_size': {
          'values': [2, 3, 5]
      },
      'learning_rate': {
          'distribution': 'log_uniform_values',
          'min': 1e-5,
          'max': 1e-3
      },
      'weight_decay': {
          'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
      },
      'dataset_aug': {
          'values': ['True', 'False']
      }
    }
}

# Initialize the hyperparameter search
sweep_id = wandb.sweep(sweep_config, project = "gpt2-french-wolof-sweeps")



[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\Oumar Kane/.netrc


Create sweep with ID: guoc9d57
Sweep URL: https://wandb.ai/oumar-kane-team/gpt2-french-wolof-sweeps/sweeps/guoc9d57


### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

In [60]:
def gpt2_model_init():
  # set the mode name
  model_name = "gpt2"

  # recuperate the tokenizer from the dataset
  tokenizer = train_dataset.tokenizer

  # configure the model
  model = GPT2LMHeadModel.from_pretrained(model_name).cuda()

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric.

In [61]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):
        
        preds, labels = eval_preds.preds.detach().cpu(), labels.detach().cpu()
        
        if isinstance(preds, tuple):
            
            preds = preds[0]
        
        if self.decoder is None:
            
            decoded_preds = self.tokenizer.batch_decode(preds, skip_special_tokens=True)
            
            decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
            
            decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)
            
            result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
            
            result = {"bleu": result["score"]}
            
            prediction_lens = [np.count_nonzero(pred != self.tokenizer.pad_token_id) for pred in preds]
            
            result["gen_len"] = np.mean(prediction_lens)
        
        else:
            
            predictions = list(self.decoder(preds))
            
            labels = list(self.decoder(labels))
      
            decoded_preds, decoded_labels = self.postprocess_text(predictions, labels)
            
            result = self.metric.compute(predictions=predictions, references=labels)
            
            result = {"bleu": result["score"]}
        
        result = {k:round(v, 4) for k, v in result.items()}

        wandb.log("bleu", result["bleu"])
            
        return result

In [62]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [63]:
translation_eval = TranslationEvaluation(test_dataset.tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [64]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0] for b in batch])
    
    attention_mask = torch.stack([b[1] for b in batch])
    
    labels = torch.stack([b[0] for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and make random search.

In [65]:
%%wandb

def train(config = None):

  with wandb.init(config = config):
    
    global train_dataset

    global test_dataset

    # seed
    torch.manual_seed(50)

    # set sweep configuration
    config = wandb.config

    if config.dataset_aug == 'True':

      train_dataset = train_dataset_aug

      test_dataset = test_dataset_aug

    # set training arguments
    training_args = TrainingArguments(f"{path}training1/results",
                                      report_to = "wandb",
                                      num_train_epochs=config.epochs,
                                      # logging_steps=100,
                                      load_best_model_at_end=True,
                                      save_strategy="epoch",
                                      evaluation_strategy="epoch",
                                      logging_strategy = 'epoch',
                                      per_device_train_batch_size=config.batch_size, 
                                      per_device_eval_batch_size=3,
                                      learning_rate=config.learning_rate,
                                      weight_decay=config.weight_decay,
                                      logging_dir=f'{path}gpt2_training_logs1',
                                      remove_unused_columns = False,
                                      fp16 = True,
                                      )   

    # define training loop
    trainer = Trainer(model_init=gpt2_model_init,
                      args=training_args,
                      train_dataset=train_dataset, 
                      eval_dataset=test_dataset,
                      data_collator=data_collator,
                      compute_metrics=translation_eval.compute_metrics
                      )

    # start training loop
    trainer.train()

wandb.agent(sweep_id, train, count = 10)


[34m[1mwandb[0m: Agent Starting Run: 5aoyf0bi with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	dataset_aug: True
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 1.3094777168952637e-05
[34m[1mwandb[0m: 	weight_decay: 0.1


Run 5aoyf0bi errored: NameError("name 'path' is not defined")
[34m[1mwandb[0m: [32m[41mERROR[0m Run 5aoyf0bi errored: NameError("name 'path' is not defined")
[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: kyyt30ov with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	dataset_aug: False
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.3559839165982405e-05
[34m[1mwandb[0m: 	weight_decay: 0.5


Run kyyt30ov errored: NameError("name 'path' is not defined")
[34m[1mwandb[0m: [32m[41mERROR[0m Run kyyt30ov errored: NameError("name 'path' is not defined")
[34m[1mwandb[0m: Agent Starting Run: fhznobej with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	dataset_aug: False
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.720476295351395e-05
[34m[1mwandb[0m: 	weight_decay: 0.2


Run fhznobej errored: NameError("name 'path' is not defined")
[34m[1mwandb[0m: [32m[41mERROR[0m Run fhznobej errored: NameError("name 'path' is not defined")
Detected 3 failed runs in the first 60 seconds, killing sweep.
[34m[1mwandb[0m: [32m[41mERROR[0m Detected 3 failed runs in the first 60 seconds, killing sweep.
[34m[1mwandb[0m: To disable this check set WANDB_AGENT_DISABLE_FLAPPING=true


In [13]:
# let us get the best model
trainer.state.best_model_checkpoint

'data/training1/results/run-9/checkpoint-368'

In [None]:
# load from a checkpoint and continue the training
# trainer._load_from_checkpoint('data/training1/results/checkpoint-734/')



We see that the model is over-fitted. We must fine-tune the model and augment it to add some noise into the training step.

### Predictions

Let us generate texts and store into a DataFrame.

In [None]:

# set the model to eval mode
_ = model.eval()

# run model inference on all test data
original_traduction, predicted_traduction, original_text, scores = [], [], [], {}

for data in tqdm(DataLoader(test_dataset)):
    
    # recuperate the two part of the sentence
    sents = list(test_dataset.decode(data[0]))
    
    cp1_sent, cp2_sent = sents[0][0], sents[0][1] 
    
    # create the sentence to traduce
    sent1 = f'{test_dataset.cls_token}{cp1_sent}{test_dataset.sep_token}'
    
    # generate tokens
    encoding = tokenizer(sent1, return_tensors='pt')
    
    generated = encoding.input_ids.cuda()
    
    attention_mask = encoding.attention_mask.cuda()
    
    # recuperate the pad token id
    pad_token_id = tokenizer.pad_token_id
    
    # perform prediction
    sample_outputs = model.generate(generated, do_sample = False, top_k = 50, max_length = test_dataset.max_len, top_p = 0.90,
                                    temperature = 0, num_return_sequences = 0, attention_mask = attention_mask, pad_token_id = pad_token_id)
    
    # calculate the score and add it to the score
    result = translation_eval.compute_metrics((sample_outputs, generated))
    
    if not scores: scores.update({k: v for k, v in result.items()})
    
    else: scores.update({k: round((scores[k] + v) / 2, 4) for k, v in result.items()})
    
    # decode the predicted tokens into texts
    sent2 = list(test_dataset.decode(sample_outputs, True))[0]
    
    print(sent2)
    # append results
    original_traduction.append(cp2_sent)
    predicted_traduction.append(sent2)
    original_text.append(cp1_sent)

# transform result into data frame
df_ft_to_wf = pd.DataFrame({'original_text': original_text,
                            'original_label': original_traduction,
                            'predicted_label': predicted_traduction})

# print the result
df_ft_to_wf.head()

  1%|          | 1/82 [00:06<09:20,  6.91s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, ci sama yaay daan wax dëgg-dëgg, ci sama biir yu kowe ya.


  2%|▏         | 2/82 [00:12<08:10,  6.13s/it]

Nettali sax, su ma ci sama yaay moom, di ma ni sama yaay daan nettali nii, di ma koy jekk-xare, di sama yaay daan nettali nii. Mu ni sama yaay daan nettali nii, sama yaay daan nettali nii, sama yaay daan nettali nii.


  4%|▎         | 3/82 [00:18<08:17,  6.30s/it]

Ngalla, su ma sama yaay moom, sama yaay moom, sama yaay moom, sama yaay moom, sama yaay moom, sama yaay daan wax.


  5%|▍         | 4/82 [00:24<07:35,  5.84s/it]

Nettali askan wi, ay way-jur, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell yu dóomu-tóor yu ňuul, ay ndell ci seen biir yu ňuul, ay ndell, ay ndell, ay ndell ci seen biir yu ňuul, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell, ay ndell ci seen biir yu dóomu-tóor yu dóomu-tóor yu dóomu-tóor yu dóomu-tóor yu dóomu-tóor yu ňuul.


  6%|▌         | 5/82 [00:29<07:22,  5.75s/it]

Nettali sax, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay a ngi janook moom, du woon.


  7%|▋         | 6/82 [00:35<07:29,  5.92s/it]

Baay na, ci sama yaay daan nettali nii, di sama yaay daan nettali nii.


  9%|▊         | 7/82 [00:41<07:18,  5.84s/it]

Nit ki, Baay jël bés bi, di ma ni ma ci sama yaay daan wax dëgg-dëgg, di ma ni ma ni ma ni ma ni ma ni sama xel.


 10%|▉         | 8/82 [00:46<06:43,  5.46s/it]

Mbóot, su ma Baay, su ma ci sama yaay, su ma, su ma ni, su ma ci sama yaay daan nettali nii, su ma ci sama yaay daan nettali nii, su ma tolloo ci sama yaay daan nettali nii, du dox lu mu ma tolloo ci sama yaay daan nettali nii.


 11%|█         | 9/82 [00:52<06:53,  5.67s/it]

Li ci sama yaay dund, du sax, du woon dara ci sama yaay dund Afrig.


 12%|█▏        | 10/82 [01:00<07:48,  6.50s/it]

Nettali na, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, di ko ko nga xam ni mel ni i xob.


 13%|█▎        | 11/82 [01:08<08:07,  6.86s/it]

Ngir ma ni ma ni ma ni ma ni ma ni ma ni sama yaay, du woon dara ci sama mag ju góor gu ndaw.


 15%|█▍        | 12/82 [01:17<08:41,  7.45s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag, nag, nag.


 16%|█▌        | 13/82 [01:24<08:21,  7.26s/it]

Nettali askan wi, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du dox ci sama yaay a ngi, du dox ci sama yaay.


 17%|█▋        | 14/82 [01:30<07:55,  6.99s/it]

Nettali sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du dox ci sama yaay a ngi gisaat, du dox ci sama yaay.


 18%|█▊        | 15/82 [01:38<08:05,  7.24s/it]

Muy nag, du dox ci sama yaay daan nettali nii.


 20%|█▉        | 16/82 [01:44<07:45,  7.05s/it]

Nettali askan wi, du sax, du sax, du sax, du sax, du sax, du ci seen biir yu kowe ya. Mu mel ni mel ni mel ni mel ni mel ni mel ni mel.


 21%|██        | 17/82 [01:51<07:36,  7.03s/it]

Mu ma sama yaay daan wax, du woon dara ci sama mag ju góor.


 22%|██▏       | 18/82 [01:58<07:30,  7.03s/it]

Ngir, su ma ni ci sama yaay daan wax dëgg-dëgg, ci sama yaay daan wax dëgg-dëgg, ci sama yaay daan wax dëgg-dëgg, ci sama yaay daan wax dëgg-dëgg, ci sama yaay daan wax dëgg-dëgg, di ma daan wax dëgg-dëgg.


 23%|██▎       | 19/82 [02:05<07:15,  6.92s/it]

Li ci sama yaay dund Afrig, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay.


 24%|██▍       | 20/82 [02:12<07:03,  6.83s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag, nag.


 26%|██▌       | 21/82 [02:18<06:56,  6.82s/it]

Nettali askan wi, du sax, du sax, du sax, du sax, du sax, du ci sama yaay daan nettali nii.


 27%|██▋       | 22/82 [02:24<06:34,  6.57s/it]

Nettali askan wi, ci sama yaay moom, sama yaay daan nettali nii. Mu ni ko, sama yaay daan nettali nii, sama yaay daan nettali nii.


 28%|██▊       | 23/82 [02:31<06:18,  6.42s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag.


 29%|██▉       | 24/82 [02:38<06:24,  6.63s/it]

Ngiraal ya mu ngi nii, di ko ko daan nettali nii.


 30%|███       | 25/82 [02:46<06:39,  7.00s/it]

Nettali askan wi, di ko ci sama yaay dund, di ko nu daan nettali nii.


 32%|███▏      | 26/82 [02:55<07:20,  7.87s/it]

Muy nag, nag, nag, nag mooy ñaari gone yi ci sama yaay daan nettali nii.


 33%|███▎      | 27/82 [03:02<06:50,  7.46s/it]

Liñ na ma ni ma ni ma ni Baay, ci sama yaay di sama yaay daan def.


 34%|███▍      | 28/82 [03:09<06:38,  7.39s/it]

Nettali sax, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag.


 35%|███▌      | 29/82 [03:16<06:27,  7.31s/it]

Ngir sax, du dox lu mu mel ni mel ni mel ni mel ni mel ni mel.


 37%|███▋      | 30/82 [03:23<06:10,  7.12s/it]

Li ci sama yaay dund, du sax, du woon dara lu dul woon lu dul woon.


 38%|███▊      | 31/82 [03:30<06:01,  7.08s/it]

Ngir, ci sama yaay dund, di nu daan nettali nii, di nu daan nu daan nettali nii.


 39%|███▉      | 32/82 [03:37<05:46,  6.94s/it]

Li ci sama yaay dund époque-saalumu Afrig, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay.


 40%|████      | 33/82 [03:43<05:36,  6.87s/it]

Baay naa ni naa ni Baay jël, ci sama yaay daan def.


 41%|████▏     | 34/82 [03:49<05:14,  6.56s/it]

Nettali na mu ngi fàttaliku, moo xam ni, di mu ngi fàttaliku, di mu ngi janook fi ci sama yaay, di ngi janook moom, di mu ngi janook fi ci sama yaay.


 43%|████▎     | 35/82 [03:56<05:16,  6.73s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag.


 44%|████▍     | 36/82 [04:03<05:13,  6.81s/it]

Nettali askan wi, di ko, di ko, di ko daan def. Mu ni ko, di ko daan def.


 45%|████▌     | 37/82 [04:11<05:13,  6.98s/it]

Ngir sax, di ko ci sama yaay moom, di ko ko ko ko ko daan def.


 46%|████▋     | 38/82 [04:17<05:03,  6.90s/it]

Ngir, su ma neexaan géej gi, du dox lu dul jeex.


 48%|████▊     | 39/82 [04:24<04:56,  6.89s/it]

Ngir, du dox ci sama yaay daan nettali nii.


 49%|████▉     | 40/82 [04:31<04:45,  6.79s/it]

Li ci sama yaay dund, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay.


 50%|█████     | 41/82 [04:37<04:32,  6.65s/it]

Nataal ya mu ngi fàttaliku, di ma ni ma ni ma ni ma ni sama yaay daan def, ci sama yaay daan def, di ma ni ma ni ma ni ma ni sama yaay daan ma daan ma daan ma daan ma daan ma daan ma daan ma daan ma daan ma daan ma daan ma daan ma ci sama yaay daan nettali nii.


 51%|█████     | 42/82 [04:44<04:31,  6.80s/it]

Muy nag, du woon a nga xam ni ci sama yaay.


 52%|█████▏    | 43/82 [04:50<04:12,  6.47s/it]

Nettali askan solo, moo xam naa, di ko, di sama yaay daan nettali nii, di sama yaay daan nettali nii, di sama yaay daan nettali nii, di sama yaay daan nettali nii, di faj la woon.


 54%|█████▎    | 44/82 [04:56<03:59,  6.31s/it]

Nettali askan wi, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay, du dox lu dul jeex, du dox ci biir yax.


 55%|█████▍    | 45/82 [05:03<04:03,  6.59s/it]

Muy nag, du woon a ngi janook moom, du woon dara.


 56%|█████▌    | 46/82 [05:10<03:58,  6.62s/it]

Biñ na ma nee bëggoon a ngi janook moom, du dox ci sama yaay daan nettali nii.


 57%|█████▋    | 47/82 [05:16<03:43,  6.39s/it]

Nettali askan wi, ci sama yaay daan nettali nii, di ko ci sama yaay daan nettali nii. Mu doon sama yaay daan nettali nii, di ngi janook moom, di ko ci sama yaay daan nettali nii.


 59%|█████▊    | 48/82 [05:21<03:30,  6.20s/it]

Nettali askan wi, du sax, du sax, du sax, du sax, du sax, du ci sama yaay daan wax dëgg, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du fa, du


 60%|█████▉    | 49/82 [05:29<03:38,  6.62s/it]

Ngiraal ya, di ngi nii, di ngi nii, di ngi nii, di ngi nii.


 61%|██████    | 50/82 [05:37<03:47,  7.10s/it]

Nettali na, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag.


 62%|██████▏   | 51/82 [06:07<07:08, 13.84s/it]

Li ci sama yaay moom, du dox lu mu daan nettali nii.


 63%|██████▎   | 52/82 [06:27<07:53, 15.78s/it]

Nettali sax, du dox lu dul jeex. Mu ni mel ni mel ni mel ni mel ni mel ni mel ni mel ni mel.


 65%|██████▍   | 53/82 [06:48<08:23, 17.35s/it]

Nettali askan wi, di ma ni ma ni ma ni ma ci sama yaay, du sax, du sax, du sax, du ci sama yaay daan nettali nii.


 66%|██████▌   | 54/82 [07:17<09:42, 20.80s/it]

Nettali sax, su ma sama yaay moom, sama yaay moom, sama yaay daan jéem a ngi janook moom.


 67%|██████▋   | 55/82 [07:34<08:48, 19.58s/it]

Nettali askan wi, sama yaay moom, sama yaay moom, sama yaay moom, sama yaay moom, sama yaay daan nettali nii, di sama yaay daan nettali nii, di sama yaay daan nettali nii.


 68%|██████▊   | 56/82 [07:53<08:29, 19.59s/it]

Liñ sama yaay dund, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay.


 70%|██████▉   | 57/82 [08:14<08:16, 19.87s/it]

Nettali askan wi, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay.


 71%|███████   | 58/82 [08:37<08:19, 20.81s/it]

Muy nag, ci sama yaay daan wax dëgg-dëgg, ci sama yaay.


 72%|███████▏  | 59/82 [08:53<07:27, 19.46s/it]

Nettali askan wi, du sax, du sax, du sax, du sax, du ci sama yaay daan nettali nii. Mu doon sama yaay daan nettali nii, du woon a ngi gisaat, du dox ci sama yaay daan nettali nii.


 73%|███████▎  | 60/82 [09:14<07:16, 19.84s/it]

Muy nag, nag, nag, nag mooy ñaari gone gu ndaw-gisu Baay ca kow, ci sama mag.


 74%|███████▍  | 61/82 [09:31<06:41, 19.11s/it]

Xam naa xam naa xam naa xam naa xam ni Baay, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay.


 76%|███████▌  | 62/82 [09:52<06:32, 19.64s/it]

Muy nag, du sax, du woon a nga xam ni ci sama yaay.


 77%|███████▋  | 63/82 [10:12<06:15, 19.76s/it]

Xam naa ni ko ko ci sama yaay, ci sama yaay daan wax dëgg-dëgg, ci sama yaay.


 78%|███████▊  | 64/82 [10:42<06:50, 22.83s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag.


 79%|███████▉  | 65/82 [11:05<06:26, 22.72s/it]

Nu nekk, di ko ko mel ni mel ni mel ni mel ni mel ni mel ni mel ni mel.


 80%|████████  | 66/82 [11:35<06:40, 25.03s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, ñu ànd ak seen bopp, ñu ànd ak seen biir yu kowe, seen biir yu kowe, seen biir yu kowe ya.


 82%|████████▏ | 67/82 [11:55<05:50, 23.37s/it]

Muy nag, nag, nag, nag, nag, nag.


 83%|████████▎ | 68/82 [12:15<05:16, 22.64s/it]

Ngir sax, ci sama yaay di ko ci sama yaay.


 84%|████████▍ | 69/82 [12:48<05:32, 25.58s/it]

Liñ na ci sama yaay dund, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay du sax, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay.


 85%|████████▌ | 70/82 [13:13<05:03, 25.32s/it]

Nu nekk Afrig, du dox lu mu mel ni mel ni mel ni mel ni mel ni mel.


 87%|████████▋ | 71/82 [13:37<04:36, 25.10s/it]

Nettali askan wi, du sax, du sax, du sax, du dox ci sama yaay dund Afrig, du dox ci sama yaay.


 88%|████████▊ | 72/82 [14:04<04:16, 25.60s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag.


 89%|████████▉ | 73/82 [14:31<03:54, 26.10s/it]

Ngir, du dox lu mu mel ni mel ni mel ni mel.


 90%|█████████ | 74/82 [14:58<03:29, 26.19s/it]

Ngir, di ko muy def, di ko ci sama yaay, di sama yaay daan nettali nii.


 91%|█████████▏| 75/82 [15:21<02:57, 25.41s/it]

Nettali askan wi, ci seen biir yu kowe, seen biir yu kowe ya, seen bopp, seen biir yu kowe ya mu mel ni mel ni mel ni mel ni mel niy baat.


 93%|█████████▎| 76/82 [15:48<02:35, 25.92s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag, nag, nag.


 94%|█████████▍| 77/82 [16:13<02:07, 25.54s/it]

Nettali askan wi, ci sama yaay, sama yaay daan wax dëgg-dëgg, di ma ni ma ni ma ni ma ni ma ni ma koy jekk-jant.


 95%|█████████▌| 78/82 [16:41<01:45, 26.29s/it]

Biñ yeggee ci sama yaay dund, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du sax, du ci sama yaay.


 96%|█████████▋| 79/82 [17:06<01:17, 25.83s/it]

Nettali sax, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, nag, ñu ànd ak seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp seen bopp, seen bopp, seen bopp seen bopp, seen bopp seen bopp, seen bopp, seen bopp seen bopp, seen bopp, seen bopp, seen bopp seen bopp seen bopp seen bopp, seen bopp, seen bopp seen bopp, seen bopp, seen bopp seen bopp seen bopp, seen bopp, seen bopp, seen bopp seen bopp, seen bopp, seen bopp, seen bopp, seen bopp seen bopp, seen bopp, seen bopp seen bopp, seen bopp seen bopp, seen bopp seen bopp, seen bopp, seen bopp seen bopp, seen bopp seen bopp

 98%|█████████▊| 80/82 [17:31<00:51, 25.61s/it]

Muy nag, nag, nag, nag, nag, nag, nag, nag, nag.


 99%|█████████▉| 81/82 [17:56<00:25, 25.53s/it]

Baay na ko ko ko daan def, di ko ci sama yaay daan def.


100%|██████████| 82/82 [18:23<00:00, 13.45s/it]

Nettali sax, du dox ci seen biir yu dóomu-gisu Baay, seen biir yu dóomu-gisu Baay, seen biir yu mag, seen biir yu mag, seen biir yu mag, seen biir yu ñu ànd ak seen biir yu ñu ànd ak seeni dàll, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp, seen bopp seen bopp seen bopp, seen bopp, seen bopp, seen bopp seen bopp, seen bopp seen bopp, seen bopp seen bopp seen bopp, seen bopp, seen bopp seen bopp, seen bopp seen bopp, seen bopp, seen bopp seen bopp, seen bopp, seen bopp, seen bopp seen bopp, seen bopp, seen bopp seen bopp, seen bopp seen bopp, see




Unnamed: 0,original_text,original_label,predicted_label
0,"Hommes, femmes, enfants sont pris dans un pièg...","Mag, ndaw, góor ak jigéen jàq, song àll bi te ...","Muy nag, nag, nag, nag, nag, nag, nag, nag, na..."
1,"Les portes ouvertes à l'émigration, ces cohort...","Loolu doyul, biñ duggee ci atum 60, ñu ubbil A...","Nettali sax, su ma ci sama yaay moom, di ma ni..."
2,"Les sulfamides sont rares, les poudres et les ...","Garab yu néew lañuy yóbbale, diy puudar ak i t...","Ngalla, su ma sama yaay moom, sama yaay moom, ..."
3,Je peux ressentir l'émotion qu'il éprouve à tr...,"Li koy yëngal noonu, xam naa ko. Lan moo ko dà...","Nettali askan wi, ay way-jur, ay ndell, ay nde..."
4,"Tout ce que j'ai pu savoir de cette période, c...",Sunu yaay daal moo daan sànni léeg-léeg ci wax...,"Nettali sax, du sax, du sax, du sax, du sax, d..."


Like we specified in conclusion of the first processing on the corpora, some words appear a too important number of times in the text and we can see that their arrive in the large part of the translations. We can add augmentation and see if we obtain better results on the evaluation set.

### Training with augmentation

We must initialize a new model.

In [None]:
# set the mode name
model_name = "gpt2"

# recuperate the tokenizer from the dataset
tokenizer = train_dataset_aug.tokenizer

# configure the model
model_aug = GPT2LMHeadModel.from_pretrained(model_name).cuda()

# resize the token embeddings
model_aug.resize_token_embeddings(len(tokenizer))


Embedding(15696, 768)

Let us initialize the training arguments and fine-tune the model.

In [None]:
# seed
torch.manual_seed(50)

# create training arguments
training_args = TrainingArguments("data/training1/results_aug",
                                  overwrite_output_dir=True,
                                  num_train_epochs=3,
                                  logging_steps=100,
                                  load_best_model_at_end=True,
                                  save_strategy="epoch",
                                  evaluation_strategy="epoch",
                                  per_device_train_batch_size=2, 
                                  per_device_eval_batch_size=2,
                                  learning_rate=2e-5,
                                #   warmup_steps=120,
                                  weight_decay=0.05,
                                  logging_dir='gpt2_training_aug_logs1')

# start training
trainer = Trainer(model=model_aug,
                  args=training_args,
                  train_dataset=train_dataset_aug, 
                  eval_dataset=test_dataset_aug,
                  data_collator=data_collator,
                  # compute_metrics=translation_eval.compute_metrics
                  )

# def optuna_hp_space(trial):
  
#   return {
#     "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log = True),
#     "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 5),
#     "seed": trial.suggest_int("seed", 1, 40),
#     "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [2, 3, 4, 5, 6])
#   }

# # Searching for the best hyperparameters
# trainer.hyperparameter_search(
#   direction = "maximize",
#   backend = "optuna",
#   hp_space = optuna_hp_space,
#   n_trials = 10,
# )


In [None]:
# load from a checkpoint and continue the training
# trainer._load_from_checkpoint('data/training1/results/checkpoint-734/')

trainer.train()

  1%|▏         | 15/1101 [00:29<31:12,  1.72s/it] 

-----------------------

### Colab download step

In [19]:
!zip -r /content/data.zip /content/data

NotImplementedError: ignored

------------------