First Training result with the GPT-2 decoder 🤖 (after random method)
-----------------------------------

In this notebook, we will continue the fine-tuning of the pre-trained GPT-2 model provided by OPEN-AI. We obtained, after a hyperparameter tuning with `wandb`, a model with a minimal evaluation cross-entropy-loss of **0.9**. Let us load the model with the best hyperparameter setting and continue the training. We will add EarlyStopping to not make the model over-fits. 

Parallel coordinates from panel:
![parallel_coordinates](W%26B%20Chart%2028_04_2023%2012_40_23.png)


We also see that the evaluation loss depends more on the probability of modifying words from a french sentence (fr_word_p) with the following `Parameter importance char` (from [panel](https://wandb.ai/oumar-kane-team/gpt2-french-wolof-sweeps/reports/undefined-23-04-28-13-02-21---Vmlldzo0MjA3MzY5)):

![parameter_importance](Parameter%20importance%20screenshot.png)

The evaluation loss is also negatively correlated to the learning rate and positively to the wf_word_p (probability of modifying words from a wolof sentence).

In [1]:
# let us extend the paths of the system 
import sys

# path = "/content/drive/MyDrive/Memoire/subject2/" # (for colab only)
path = "data/extractions/"

# sys.path.extend([f"{path}new_data", f"{path}wolof-translate"]) # (for colab only)

In [2]:
# define environment
%env WANDB_LOG_MODEL=true
%env WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
%env WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee

env: WANDB_PROJECT=gpt2_french_wolof_sweeps
env: WANDB_LOG_MODEL=true
env: WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
env: WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee


In [3]:
!pip install -qq wandb --upgrade

In [4]:
!pip install evaluate -qq
!pip install sacrebleu -qq
!pip install optuna -qq
!pip install transformers -qq 
!pip install tokenizers -qq
!pip install nlpaug -qq
!pip install ray[tune] -qq
!python -m spacy download fr_core_news_lg 

^C


In [3]:
# let us import all necessary libraries
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer, EarlyStoppingCallback
from wolof_translate.utils.sent_transformers import TransformerSequences
from wolof_translate.data.dataset_v1 import SentenceDataset
from wolof_translate.utils.sent_corrections import *
from sklearn.model_selection import train_test_split
from nlpaug.augmenter import char as nac
from torch.utils.data import DataLoader
# from datasets  import load_metric # make pip install evaluate instead
# and pip install sacrebleu for instance
from functools import partial
from tqdm import tqdm
import pandas as pd
import numpy as np
import evaluate
import torch
import wandb

wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")


  from .autonotebook import tqdm as notebook_tqdm
[34m[1mwandb[0m: Currently logged in as: [33moumar-kane[0m ([33moumar-kane-team[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\Oumar Kane/.netrc


True

We will create two models: 

- One translating the french corpus to a wolof corpus [french_to_wolof](#french-to-wolof)
- One translating the wolof corpus to a french corpus [wolof_to_french](#wolof-to-french)

--------------

## French to wolof

### Configure dataset 🔠

We can use the same custom dataset that we created in [text_augmentation](text_augmentation.ipynb). But we need to split the data between train and test sets and save them.

In [4]:
# # load the corpora and split into train and test sets
# corpora = pd.read_csv("sent_extraction.csv")

# train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=50)

# # let us save the sets
# train_set.to_csv("train_set.csv", index=False)

# test_set.to_csv("data/extractions/new_data/test_set.csv", index=False)

Let us recuperate the datasets with and without augmentation.

In [5]:
def recuperate_datasets(fr_char_p: float, wf_char_p: float, fr_word_p: float, wf_word_p):

  # without augmentation
  # train_dataset = SentenceDataset(f"{path}new_data/train_set.csv", 
  #                                 tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json")

  # test_dataset = SentenceDataset(f"{path}new_data/test_set.csv",
                                # tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json")

  # with augmentation
  fr_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=fr_char_p, aug_word_p=fr_word_p),
                                        remove_mark_space, delete_guillemet_space)

  wf_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=wf_char_p, aug_word_p=wf_word_p),
                                        remove_mark_space, delete_guillemet_space)

  train_dataset_aug = SentenceDataset(f"{path}new_data/train_set.csv", 
                                  # tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json", # (only for colab)
                                  tokenizer_path = f"wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                  cp1_transformer=fr_augmentation, truncation=True,
                                  cp2_transformer=wf_augmentation, max_len=579)

  test_dataset_aug = SentenceDataset(f"{path}new_data/test_set.csv",
                                # tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json", # (only for colab)
                                tokenizer_path = f"wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                cp1_transformer=fr_augmentation, truncation=True,
                                cp2_transformer=wf_augmentation, max_len=579)
  
  return train_dataset_aug, test_dataset_aug
  # return {
  #     'False': {
  #         'train_dataset': train_dataset,
  #         'test_dataset': test_dataset,
  #     },
  #     'True': {
  #         'train_dataset': train_dataset_aug,
  #         'test_dataset': test_dataset_aug
  #     }
  #     }

### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

In [6]:
def gpt2_model_init(tokenizer):
  # set the mode name
  model_name = "gpt2"

  # recuperate the tokenizer from the dataset
  tokenizer = tokenizer

  # configure the model
  model = GPT2LMHeadModel.from_pretrained(model_name).cuda()

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric.

In [7]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):
        
        preds, labels = eval_preds.preds.detach().cpu(), labels.detach().cpu()
        
        if isinstance(preds, tuple):
            
            preds = preds[0]
        
        if self.decoder is None:
            
            decoded_preds = self.tokenizer.batch_decode(preds, skip_special_tokens=True)
            
            decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
            
            decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)
            
            result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
            
            result = {"bleu": result["score"]}
            
            prediction_lens = [np.count_nonzero(pred != self.tokenizer.pad_token_id) for pred in preds]
            
            result["gen_len"] = np.mean(prediction_lens)
        
        else:
            
            predictions = list(self.decoder(preds))
            
            labels = list(self.decoder(labels))
      
            decoded_preds, decoded_labels = self.postprocess_text(predictions, labels)
            
            result = self.metric.compute(predictions=predictions, references=labels)
            
            result = {"bleu": result["score"]}
        
        result = {k:round(v, 4) for k, v in result.items()}

        wandb.log("bleu", result["bleu"])
            
        return result

In [8]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [9]:
# translation_eval = TranslationEvaluation(test_dataset.tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [10]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0] for b in batch])
    
    attention_mask = torch.stack([b[1] for b in batch])
    
    labels = torch.stack([b[0] for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and make random search.

In [12]:
# %%wandb

"""Best parameters
learning_rate = 0.00003702
weight_decay = 0.4
train_batch_size = 5
fr_char_p = 0.5
fr_word_p = 0
wf_char_p = 0.5
wf_word_p = 0
eval/loss = 0.9086
"""

# seed
torch.manual_seed(50)

# let us recuperate the datasets
train_dataset, test_dataset = recuperate_datasets(0.5, 0.5, 0, 0)

# let us initialize a early stopping callback
early_stopping = EarlyStoppingCallback(3, 0.1)

# get train and test datasets according to the config

# train_dataset = datasets[config.dataset_aug]['train_dataset']

# test_dataset = datasets[config.dataset_aug]['test_dataset']

# set training arguments
training_args = TrainingArguments(f"{path}training2/results",
                                  report_to = "wandb",
                                  num_train_epochs=5,
                                  # logging_steps=100,
                                  load_best_model_at_end=True,
                                  save_strategy="epoch",
                                  evaluation_strategy="epoch",
                                  logging_strategy = 'epoch',
                                  per_device_train_batch_size=5, 
                                  per_device_eval_batch_size=5,
                                  learning_rate = 0.00003702,
                                  weight_decay=0.4,
                                  logging_dir=f'{path}gpt2_training_logs2',
                                #   remove_unused_columns = False,
                                  fp16 = True,
                                  metric_for_best_model="eval_loss",
                                  greater_is_better=False,
                                  )   

# define training loop
trainer = Trainer(model_init=partial(gpt2_model_init, tokenizer = train_dataset.tokenizer),
                  args=training_args,
                  train_dataset=train_dataset, 
                  eval_dataset=test_dataset,
                  data_collator=data_collator,
                  # compute_metrics=translation_eval.compute_metrics
                  callbacks=[early_stopping]
                  )

# load last checkpoint
# trainer._load_from_checkpoint("data/training2/results/checkpoint-147")

# start training loop
# trainer.train()


OutOfMemoryError: CUDA out of memory. Tried to allocate 148.00 MiB (GPU 0; 6.00 GiB total capacity; 5.01 GiB already allocated; 0 bytes free; 5.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
# let us get the best model


'data/training1/results/run-9/checkpoint-368'

In [None]:
# load from a checkpoint and continue the training
# trainer._load_from_checkpoint('data/training1/results/checkpoint-734/')



We see that the model is over-fitted. We must fine-tune the model and augment it to add some noise into the training step.

### Predictions

Let us generate texts and store into a DataFrame.

In [None]:

# set the model to eval mode
_ = model.eval()

# run model inference on all test data
original_traduction, predicted_traduction, original_text, scores = [], [], [], {}

for data in tqdm(DataLoader(test_dataset)):
    
    # recuperate the two part of the sentence
    sents = list(test_dataset.decode(data[0]))
    
    cp1_sent, cp2_sent = sents[0][0], sents[0][1] 
    
    # create the sentence to traduce
    sent1 = f'{test_dataset.cls_token}{cp1_sent}{test_dataset.sep_token}'
    
    # generate tokens
    encoding = tokenizer(sent1, return_tensors='pt')
    
    generated = encoding.input_ids.cuda()
    
    attention_mask = encoding.attention_mask.cuda()
    
    # recuperate the pad token id
    pad_token_id = tokenizer.pad_token_id
    
    # perform prediction
    sample_outputs = model.generate(generated, do_sample = False, top_k = 50, max_length = test_dataset.max_len, top_p = 0.90,
                                    temperature = 0, num_return_sequences = 0, attention_mask = attention_mask, pad_token_id = pad_token_id)
    
    # calculate the score and add it to the score
    result = translation_eval.compute_metrics((sample_outputs, generated))
    
    if not scores: scores.update({k: v for k, v in result.items()})
    
    else: scores.update({k: round((scores[k] + v) / 2, 4) for k, v in result.items()})
    
    # decode the predicted tokens into texts
    sent2 = list(test_dataset.decode(sample_outputs, True))[0]
    
    print(sent2)
    # append results
    original_traduction.append(cp2_sent)
    predicted_traduction.append(sent2)
    original_text.append(cp1_sent)

# transform result into data frame
df_ft_to_wf = pd.DataFrame({'original_text': original_text,
                            'original_label': original_traduction,
                            'predicted_label': predicted_traduction})

# print the result
df_ft_to_wf.head()