First Training with the GPT-2 decoder 🤖 (random search)
-----------------------------------

In this notebook, we will train the pre-trained GPT-2 model provided by OPEN-AI. It only tests how the model can be accurate on the corpora we extracted. It is undoubtedly a partial model. For the prediction, we will follow the fine-tuning tutorial available at the following medium link [fine-tuning-transformers](https://medium.com/towards-data-science/guide-to-fine-tuning-text-generation-models-gpt-2-gpt-neo-and-t5-dc5de6b3bc5e) and add hyperparameter search with the `wandb` library.


We will make this with and without augmentation and see where we obtain better results. 

In [7]:
# let us extend the paths of the system
import sys

path = "/content/drive/MyDrive/Memoire/subject2/"

sys.path.extend([f"{path}new_data", f"{path}wolof-translate"])

In [8]:
# define environment
%env WANDB_LOG_MODEL=true
%env WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
%env WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee 

env: WANDB_LOG_MODEL=true
env: WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
env: WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee


In [9]:
!pip install -qq wandb --upgrade

In [10]:
!pip install evaluate -qq
!pip install sacrebleu -qq
# !pip install optuna -qq
!pip install transformers -qq 
!pip install tokenizers -qq
!pip install nlpaug -qq
!pip install ray[tune] -qq
!python -m spacy download fr_core_news_lg 

2023-04-28 18:00:08.254108: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-28 18:00:13.504013: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-28 18:00:13.504623: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-

In [11]:
# let us import all necessary libraries
from wolof_translate.utils.sent_transformers import TransformerSequences
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer
from wolof_translate.data.dataset_v1 import SentenceDataset
from wolof_translate.utils.sent_corrections import *
from sklearn.model_selection import train_test_split
from nlpaug.augmenter import char as nac
from torch.utils.data import DataLoader
# from datasets  import load_metric # make pip install evaluate instead
# and pip install sacrebleu for instance
from functools import partial
from tqdm import tqdm
import pandas as pd
import numpy as np
import evaluate
import torch


We will create two models: 

- One translating the french corpus to a wolof corpus [french_to_wolof](#french-to-wolof)
- One translating the wolof corpus to a french corpus [wolof_to_french](#wolof-to-french)

--------------

## French to wolof

### Configure dataset 🔠

We can use the same custom dataset that we created in [text_augmentation](text_augmentation.ipynb). But we need to split the data between train and test sets and save them.

In [12]:
def split_data(random_state: int = 50):

  # load the corpora and split into train and test sets
  corpora = pd.read_csv(f"{path}new_data/sent_extraction.csv")

  train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=random_state)

  # let us save the sets
  train_set.to_csv(f"{path}new_data/train_set.csv", index=False)

  test_set.to_csv(f"{path}new_data/test_set.csv", index=False)

Let us recuperate the datasets with and without augmentation.

In [13]:
def recuperate_datasets(fr_char_p: float, wf_char_p: float, fr_word_p: float, wf_word_p):

  # without augmentation
  # train_dataset = SentenceDataset(f"{path}new_data/train_set.csv", 
  #                                 tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json")

  # test_dataset = SentenceDataset(f"{path}new_data/test_set.csv",
                                # tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json")

  # with augmentation
  fr_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=fr_char_p, aug_word_p=fr_word_p),
                                        remove_mark_space, delete_guillemet_space)

  train_dataset_aug = SentenceDataset(f"{path}new_data/train_set.csv", 
                                  tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                  cp1_transformer=fr_augmentation, truncation=True,
                                  max_len=579)

  test_dataset = SentenceDataset(f"{path}new_data/test_set.csv",
                                tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                truncation=True, max_len=579)
  
  return train_dataset_aug, test_dataset
  # return {
  #     'False': {
  #         'train_dataset': train_dataset,
  #         'test_dataset': test_dataset,
  #     },
  #     'True': {
  #         'train_dataset': train_dataset_aug,
  #         'test_dataset': test_dataset_aug
  #     }
  #     }

### Configure hyperparameter search ⚙️

We have to configure the search space and the search method ("random" in our case). .

In [14]:
import wandb
wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")

# hyperparameters
sweep_config = {
    'method': 'random',
    'metric':{
          'goal': 'minimize',
          'name': 'eval_loss'
      },
    'parameters':
    {
      'epochs': {
          'value': 1
      },
      'batch_size': {
          'values': [2, 3, 5]
      },
      'learning_rate': {
          'distribution': 'log_uniform_values',
          'min': 1e-5,
          'max': 1e-3
      },
      'weight_decay': {
          'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
      },
     'fr_char_p': {
         'min': 0.0,
         'max': 0.7
     },
     'fr_word_p': {
          'min': 0.0,
          'max': 0.7
     },
     'wf_char_p': {
          'min': 0.0,
          'max': 0.7
     },
     'wf_word_p': {
          'min': 0.0,
          'max': 0.7
     },
     'random_state': {
         'values': [0, 10, 20, 30, 40, 50, 60, 70, 80, 100]
     }
    }
}

# Initialize the hyperparameter search
sweep_id = wandb.sweep(sweep_config, project = "gpt2-wolof-french-translation1")



[34m[1mwandb[0m: Currently logged in as: [33moumar-kane[0m ([33moumar-kane-team[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: 14019vvc
Sweep URL: https://wandb.ai/oumar-kane-team/gpt2-wolof-french-translation1/sweeps/14019vvc


### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

In [15]:
def gpt2_model_init(tokenizer):
  # set the mode name
  model_name = "gpt2"

  # recuperate the tokenizer from the dataset
  tokenizer = tokenizer

  # configure the model
  model = GPT2LMHeadModel.from_pretrained(model_name).cuda()

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric.

In [16]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):
        
        preds, labels = eval_preds.preds.detach().cpu(), labels.detach().cpu()
        
        if isinstance(preds, tuple):
            
            preds = preds[0]
        
        if self.decoder is None:
            
            decoded_preds = self.tokenizer.batch_decode(preds, skip_special_tokens=True)
            
            decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
            
            decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)
            
            result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
            
            result = {"bleu": result["score"]}
            
            prediction_lens = [np.count_nonzero(pred != self.tokenizer.pad_token_id) for pred in preds]
            
            result["gen_len"] = np.mean(prediction_lens)
        
        else:
            
            predictions = list(self.decoder(preds))
            
            labels = list(self.decoder(labels))
      
            decoded_preds, decoded_labels = self.postprocess_text(predictions, labels)
            
            result = self.metric.compute(predictions=predictions, references=labels)
            
            result = {"bleu": result["score"]}
        
        result = {k:round(v, 4) for k, v in result.items()}

        wandb.log("bleu", result["bleu"])
            
        return result

In [17]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [18]:
# translation_eval = TranslationEvaluation(test_dataset.tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [19]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0] for b in batch])
    
    attention_mask = torch.stack([b[1] for b in batch])
    
    labels = torch.stack([b[0] for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and make random search.

In [20]:
# %%wandb

def train(config = None):

  

  with wandb.init(config = config):

    # seed
    torch.manual_seed(50)

    # set sweep configuration
    config = wandb.config

    # split the data
    split_data(config.random_state)

    # let us recuperate the datasets
    train_dataset, test_dataset = recuperate_datasets(config.fr_char_p, config.wf_char_p, 
                                   config.fr_word_p, config.wf_word_p)

    # get train and test datasets according to the config

    # train_dataset = datasets[config.dataset_aug]['train_dataset']

    # test_dataset = datasets[config.dataset_aug]['test_dataset']

    # set training arguments
    training_args = TrainingArguments(f"{path}training2/results1",
                                      report_to = f"wandb",
                                      num_train_epochs=config.epochs,
                                      # logging_steps=100,
                                      load_best_model_at_end=True,
                                      save_strategy="epoch",
                                      evaluation_strategy="epoch",
                                      logging_strategy = 'epoch',
                                      per_device_train_batch_size=config.batch_size, 
                                      per_device_eval_batch_size=5,
                                      learning_rate=config.learning_rate,
                                      weight_decay=config.weight_decay,
                                      logging_dir=f'{path}gpt2_training_logs2',
                                      remove_unused_columns = False,
                                      fp16 = True,
                                      )   

    # define training loop
    trainer = Trainer(model_init=partial(gpt2_model_init, tokenizer = train_dataset.tokenizer),
                      args=training_args,
                      train_dataset=train_dataset, 
                      eval_dataset=test_dataset,
                      data_collator=data_collator,
                      # compute_metrics=translation_eval.compute_metrics
                      )

    # start training loop
    trainer.train()

agent = wandb.agent(sweep_id, train, count = 30)


[34m[1mwandb[0m: Agent Starting Run: bsir9vds with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6502928646092081
[34m[1mwandb[0m: 	fr_word_p: 0.6817430409598793
[34m[1mwandb[0m: 	learning_rate: 0.0005849589572035753
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.6138587279075418
[34m[1mwandb[0m: 	wf_word_p: 0.5944717582493727




Epoch,Training Loss,Validation Loss
1,1.4741,0.810593


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.81059
eval/runtime,3.3476
eval/samples_per_second,24.495
eval/steps_per_second,5.078
train/epoch,1.0
train/global_step,245.0
train/learning_rate,2e-05
train/loss,1.4741
train/total_flos,216590170752000.0
train/train_loss,1.47413


[34m[1mwandb[0m: Agent Starting Run: 13n4uo4j with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.671533610651228
[34m[1mwandb[0m: 	fr_word_p: 0.3377438777612809
[34m[1mwandb[0m: 	learning_rate: 0.00011511089726939806
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.3217907770297627
[34m[1mwandb[0m: 	wf_word_p: 0.5495165396790335




Epoch,Training Loss,Validation Loss
1,1.5108,0.853371


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85337
eval/runtime,2.7043
eval/samples_per_second,30.322
eval/steps_per_second,6.286
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5108
train/total_flos,216590170752000.0
train/train_loss,1.51082


[34m[1mwandb[0m: Agent Starting Run: 1uiv2ha8 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.240477160240458
[34m[1mwandb[0m: 	fr_word_p: 0.537085624723154
[34m[1mwandb[0m: 	learning_rate: 0.00011294168073571068
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.036998228943525474
[34m[1mwandb[0m: 	wf_word_p: 0.34284569105601037




Epoch,Training Loss,Validation Loss
1,1.4684,0.796042


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.79604
eval/runtime,2.7108
eval/samples_per_second,30.25
eval/steps_per_second,6.271
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4684
train/total_flos,216590170752000.0
train/train_loss,1.46838


[34m[1mwandb[0m: Agent Starting Run: ezr9ubb6 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5254516886812962
[34m[1mwandb[0m: 	fr_word_p: 0.09464907125173204
[34m[1mwandb[0m: 	learning_rate: 1.3111707137809985e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.18792842653139763
[34m[1mwandb[0m: 	wf_word_p: 0.1625617142576774




Epoch,Training Loss,Validation Loss
1,1.6273,1.010671


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.01067
eval/runtime,2.7106
eval/samples_per_second,30.252
eval/steps_per_second,6.272
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.6273
train/total_flos,216590170752000.0
train/train_loss,1.62726


[34m[1mwandb[0m: Agent Starting Run: d2b5l70j with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4234154666698189
[34m[1mwandb[0m: 	fr_word_p: 0.07783532677338817
[34m[1mwandb[0m: 	learning_rate: 1.2354743954007584e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.6419901404282341
[34m[1mwandb[0m: 	wf_word_p: 0.5418217804360558




Epoch,Training Loss,Validation Loss
1,1.893,1.030977


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.03098
eval/runtime,2.702
eval/samples_per_second,30.348
eval/steps_per_second,6.292
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.893
train/total_flos,216590170752000.0
train/train_loss,1.89304


[34m[1mwandb[0m: Agent Starting Run: 6801a1l4 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4429581438929929
[34m[1mwandb[0m: 	fr_word_p: 0.3632366347028243
[34m[1mwandb[0m: 	learning_rate: 2.6267197537308377e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.17395027165161914
[34m[1mwandb[0m: 	wf_word_p: 0.4096146675858922




Epoch,Training Loss,Validation Loss
1,1.6356,0.992037


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.99204
eval/runtime,2.7061
eval/samples_per_second,30.302
eval/steps_per_second,6.282
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.6356
train/total_flos,216590170752000.0
train/train_loss,1.63561


[34m[1mwandb[0m: Agent Starting Run: vqouplax with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.45257388569971824
[34m[1mwandb[0m: 	fr_word_p: 0.2377921526767132
[34m[1mwandb[0m: 	learning_rate: 0.0001388916152557796
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.5470738083751061
[34m[1mwandb[0m: 	wf_word_p: 0.13560745933627427




Epoch,Training Loss,Validation Loss
1,1.3697,0.800643


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.80064
eval/runtime,2.7226
eval/samples_per_second,30.118
eval/steps_per_second,6.244
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3697
train/total_flos,216590170752000.0
train/train_loss,1.3697


[34m[1mwandb[0m: Agent Starting Run: geo1o00w with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.1815884829234138
[34m[1mwandb[0m: 	fr_word_p: 0.6507696914179271
[34m[1mwandb[0m: 	learning_rate: 0.000698583265792357
[34m[1mwandb[0m: 	random_state: 100
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.5816640580970726
[34m[1mwandb[0m: 	wf_word_p: 0.07847189788683867




Epoch,Training Loss,Validation Loss
1,1.4275,0.79112


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.79112
eval/runtime,2.708
eval/samples_per_second,30.281
eval/steps_per_second,6.278
train/epoch,1.0
train/global_step,245.0
train/learning_rate,2e-05
train/loss,1.4275
train/total_flos,216590170752000.0
train/train_loss,1.42755


[34m[1mwandb[0m: Agent Starting Run: 1ccxsi4u with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.456988836641937
[34m[1mwandb[0m: 	fr_word_p: 0.3479671391940781
[34m[1mwandb[0m: 	learning_rate: 6.6097638504684e-05
[34m[1mwandb[0m: 	random_state: 100
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.5171411933699624
[34m[1mwandb[0m: 	wf_word_p: 0.2410126560866729




Epoch,Training Loss,Validation Loss
1,1.5498,0.881241


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.88124
eval/runtime,2.7186
eval/samples_per_second,30.162
eval/steps_per_second,6.253
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5498
train/total_flos,216590170752000.0
train/train_loss,1.54979


[34m[1mwandb[0m: Agent Starting Run: 7hw786z1 with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4452101340937121
[34m[1mwandb[0m: 	fr_word_p: 0.05788869827598933
[34m[1mwandb[0m: 	learning_rate: 0.0005704647067385222
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.433116386353126
[34m[1mwandb[0m: 	wf_word_p: 0.1370643863720747




Epoch,Training Loss,Validation Loss
1,1.1724,0.690651


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.69065
eval/runtime,2.7289
eval/samples_per_second,30.048
eval/steps_per_second,6.23
train/epoch,1.0
train/global_step,367.0
train/learning_rate,1e-05
train/loss,1.1724
train/total_flos,216590170752000.0
train/train_loss,1.17238


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: cyvv3sk4 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.1482731294957671
[34m[1mwandb[0m: 	fr_word_p: 0.5736930630644534
[34m[1mwandb[0m: 	learning_rate: 5.668425283286384e-05
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.37802382647823374
[34m[1mwandb[0m: 	wf_word_p: 0.341085845908422




Epoch,Training Loss,Validation Loss
1,1.5059,0.905935


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.90593
eval/runtime,2.7145
eval/samples_per_second,30.209
eval/steps_per_second,6.263
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5059
train/total_flos,216590170752000.0
train/train_loss,1.50591


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 8le52wm6 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6187226168209458
[34m[1mwandb[0m: 	fr_word_p: 0.06710867937445074
[34m[1mwandb[0m: 	learning_rate: 0.0004360074099168583
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.2799773684911801
[34m[1mwandb[0m: 	wf_word_p: 0.4995626384671348




Epoch,Training Loss,Validation Loss
1,1.2788,0.765012


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.76501
eval/runtime,2.7162
eval/samples_per_second,30.189
eval/steps_per_second,6.259
train/epoch,1.0
train/global_step,245.0
train/learning_rate,1e-05
train/loss,1.2788
train/total_flos,216590170752000.0
train/train_loss,1.27883


[34m[1mwandb[0m: Agent Starting Run: ik84v5nz with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.3626790920895456
[34m[1mwandb[0m: 	fr_word_p: 0.5338205729697485
[34m[1mwandb[0m: 	learning_rate: 1.480661292209492e-05
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.1691982530305743
[34m[1mwandb[0m: 	wf_word_p: 0.6746588812498662




Epoch,Training Loss,Validation Loss
1,2.0316,0.909855


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.90985
eval/runtime,2.7023
eval/samples_per_second,30.345
eval/steps_per_second,6.291
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,2.0316
train/total_flos,216590170752000.0
train/train_loss,2.03156


[34m[1mwandb[0m: Agent Starting Run: w6g4xc5y with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.1826025463154098
[34m[1mwandb[0m: 	fr_word_p: 0.23620689712255616
[34m[1mwandb[0m: 	learning_rate: 0.00012545493959102943
[34m[1mwandb[0m: 	random_state: 100
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.5243872632996286
[34m[1mwandb[0m: 	wf_word_p: 0.07301709101380438




Epoch,Training Loss,Validation Loss
1,1.6024,0.861327


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.86133
eval/runtime,2.6953
eval/samples_per_second,30.424
eval/steps_per_second,6.307
train/epoch,1.0
train/global_step,147.0
train/learning_rate,1e-05
train/loss,1.6024
train/total_flos,216590170752000.0
train/train_loss,1.6024


[34m[1mwandb[0m: Agent Starting Run: kle0p42n with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5985474199581012
[34m[1mwandb[0m: 	fr_word_p: 0.07718440369801771
[34m[1mwandb[0m: 	learning_rate: 0.0007879622799773468
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.06376166584090216
[34m[1mwandb[0m: 	wf_word_p: 0.40000591190905255




Epoch,Training Loss,Validation Loss
1,1.3345,0.745426


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.74543
eval/runtime,2.7172
eval/samples_per_second,30.178
eval/steps_per_second,6.257
train/epoch,1.0
train/global_step,245.0
train/learning_rate,3e-05
train/loss,1.3345
train/total_flos,216590170752000.0
train/train_loss,1.33447


[34m[1mwandb[0m: Agent Starting Run: hsjf08y6 with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6802585601283034
[34m[1mwandb[0m: 	fr_word_p: 0.37831355668841254
[34m[1mwandb[0m: 	learning_rate: 7.784485297903175e-05
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.060724252321415366
[34m[1mwandb[0m: 	wf_word_p: 0.6211735926275833




Epoch,Training Loss,Validation Loss
1,1.4352,0.907959


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.90796
eval/runtime,2.7388
eval/samples_per_second,29.94
eval/steps_per_second,6.207
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.4352
train/total_flos,216590170752000.0
train/train_loss,1.43518


[34m[1mwandb[0m: Agent Starting Run: phdrjtls with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.2574727343683873
[34m[1mwandb[0m: 	fr_word_p: 0.325848817422341
[34m[1mwandb[0m: 	learning_rate: 0.00013694970591525185
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.0515917126614958
[34m[1mwandb[0m: 	wf_word_p: 0.13768465096538185




Epoch,Training Loss,Validation Loss
1,1.3669,0.895562


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.89556
eval/runtime,2.7327
eval/samples_per_second,30.007
eval/steps_per_second,6.221
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3669
train/total_flos,216590170752000.0
train/train_loss,1.36686


[34m[1mwandb[0m: Agent Starting Run: np5sch0t with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5644754223954247
[34m[1mwandb[0m: 	fr_word_p: 0.6558095944067747
[34m[1mwandb[0m: 	learning_rate: 0.00019281695417198096
[34m[1mwandb[0m: 	random_state: 100
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	wf_char_p: 0.47603610667853263
[34m[1mwandb[0m: 	wf_word_p: 0.6193756056152349




Epoch,Training Loss,Validation Loss
1,1.6504,0.851764


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85176
eval/runtime,2.747
eval/samples_per_second,29.851
eval/steps_per_second,6.189
train/epoch,1.0
train/global_step,147.0
train/learning_rate,1e-05
train/loss,1.6504
train/total_flos,216590170752000.0
train/train_loss,1.65045


[34m[1mwandb[0m: Agent Starting Run: cqdm4q5a with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.652685626135174
[34m[1mwandb[0m: 	fr_word_p: 0.4634738497616566
[34m[1mwandb[0m: 	learning_rate: 3.4082556870625936e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.5113821920270895
[34m[1mwandb[0m: 	wf_word_p: 0.4690531758389533




Epoch,Training Loss,Validation Loss
1,1.618,0.986327


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.98633
eval/runtime,2.7112
eval/samples_per_second,30.244
eval/steps_per_second,6.27
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.618
train/total_flos,216590170752000.0
train/train_loss,1.61798


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: jo4o0j13 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5897261699366377
[34m[1mwandb[0m: 	fr_word_p: 0.40972011471398034
[34m[1mwandb[0m: 	learning_rate: 3.889699482264556e-05
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	wf_char_p: 0.46983715080079863
[34m[1mwandb[0m: 	wf_word_p: 0.5073771947437731




Epoch,Training Loss,Validation Loss
1,1.6112,0.950319


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95032
eval/runtime,2.7268
eval/samples_per_second,30.072
eval/steps_per_second,6.234
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.6112
train/total_flos,216590170752000.0
train/train_loss,1.61115


[34m[1mwandb[0m: Agent Starting Run: e4ok3bld with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.25590210385850376
[34m[1mwandb[0m: 	fr_word_p: 0.22438627557888657
[34m[1mwandb[0m: 	learning_rate: 0.00018345393811845188
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.5224095513229281
[34m[1mwandb[0m: 	wf_word_p: 0.5418942676653388




Epoch,Training Loss,Validation Loss
1,1.3286,0.825027


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.82503
eval/runtime,2.7529
eval/samples_per_second,29.786
eval/steps_per_second,6.175
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3286
train/total_flos,216590170752000.0
train/train_loss,1.32862


[34m[1mwandb[0m: Agent Starting Run: mm4mdsqo with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5738866361745114
[34m[1mwandb[0m: 	fr_word_p: 0.620986394828583
[34m[1mwandb[0m: 	learning_rate: 1.2452444860615832e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.5018859032164279
[34m[1mwandb[0m: 	wf_word_p: 0.6477684239138255




Epoch,Training Loss,Validation Loss
1,2.1221,1.082944


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.08294
eval/runtime,2.7187
eval/samples_per_second,30.162
eval/steps_per_second,6.253
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,2.1221
train/total_flos,216590170752000.0
train/train_loss,2.12212


[34m[1mwandb[0m: Agent Starting Run: ypqfy2iy with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.17687337193367414
[34m[1mwandb[0m: 	fr_word_p: 0.19284716266445276
[34m[1mwandb[0m: 	learning_rate: 0.0001288254898808281
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.26615379490394614
[34m[1mwandb[0m: 	wf_word_p: 0.497840262426633




Epoch,Training Loss,Validation Loss
1,1.386,0.775263


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.77526
eval/runtime,2.7108
eval/samples_per_second,30.249
eval/steps_per_second,6.271
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.386
train/total_flos,216590170752000.0
train/train_loss,1.38603


[34m[1mwandb[0m: Agent Starting Run: boye36v8 with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.48264858364465457
[34m[1mwandb[0m: 	fr_word_p: 0.33448991683794504
[34m[1mwandb[0m: 	learning_rate: 0.00031140780990074384
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.060828957933087
[34m[1mwandb[0m: 	wf_word_p: 0.0061826893974647685




Epoch,Training Loss,Validation Loss
1,1.6565,0.821011


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.82101
eval/runtime,2.6977
eval/samples_per_second,30.396
eval/steps_per_second,6.302
train/epoch,1.0
train/global_step,147.0
train/learning_rate,2e-05
train/loss,1.6565
train/total_flos,216590170752000.0
train/train_loss,1.65649


[34m[1mwandb[0m: Agent Starting Run: pugxa5gk with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.2005369908810156
[34m[1mwandb[0m: 	fr_word_p: 0.6192113034969234
[34m[1mwandb[0m: 	learning_rate: 8.222510622492054e-05
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.34897660777730927
[34m[1mwandb[0m: 	wf_word_p: 0.14382038114531173




Epoch,Training Loss,Validation Loss
1,1.3944,0.880709


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.88071
eval/runtime,2.7231
eval/samples_per_second,30.112
eval/steps_per_second,6.243
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3944
train/total_flos,216590170752000.0
train/train_loss,1.39437


[34m[1mwandb[0m: Agent Starting Run: qgktshex with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.2005777699232024
[34m[1mwandb[0m: 	fr_word_p: 0.4385247110587816
[34m[1mwandb[0m: 	learning_rate: 1.8573037368967597e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.6186729799794922
[34m[1mwandb[0m: 	wf_word_p: 0.11024648884950464


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01666909818333503, max=1.0)…



Epoch,Training Loss,Validation Loss
1,1.6451,0.976688


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.97669
eval/runtime,2.7517
eval/samples_per_second,29.8
eval/steps_per_second,6.178
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.6451
train/total_flos,216590170752000.0
train/train_loss,1.64512


[34m[1mwandb[0m: Agent Starting Run: imvxju8c with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.2429679815237644
[34m[1mwandb[0m: 	fr_word_p: 0.6715100689164555
[34m[1mwandb[0m: 	learning_rate: 3.574388375356532e-05
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.38070327933194215
[34m[1mwandb[0m: 	wf_word_p: 0.33645061220489525




Epoch,Training Loss,Validation Loss
1,1.8086,0.959918


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95992
eval/runtime,2.7306
eval/samples_per_second,30.03
eval/steps_per_second,6.226
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.8086
train/total_flos,216590170752000.0
train/train_loss,1.80862


[34m[1mwandb[0m: Agent Starting Run: bo8m8jrx with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4701006275945654
[34m[1mwandb[0m: 	fr_word_p: 0.29401382444657836
[34m[1mwandb[0m: 	learning_rate: 0.00017980663294666269
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.02956569463186749
[34m[1mwandb[0m: 	wf_word_p: 0.5762135150235679




Epoch,Training Loss,Validation Loss
1,1.4489,0.823548


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.82355
eval/runtime,2.729
eval/samples_per_second,30.047
eval/steps_per_second,6.229
train/epoch,1.0
train/global_step,245.0
train/learning_rate,1e-05
train/loss,1.4489
train/total_flos,216590170752000.0
train/train_loss,1.44893


[34m[1mwandb[0m: Agent Starting Run: 95x0lt2y with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.1543890127501567
[34m[1mwandb[0m: 	fr_word_p: 0.299038502866073
[34m[1mwandb[0m: 	learning_rate: 1.5099229296726838e-05
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	wf_char_p: 0.5008911488024025
[34m[1mwandb[0m: 	wf_word_p: 0.2541906580982826




Epoch,Training Loss,Validation Loss
1,1.5157,0.93882


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.93882
eval/runtime,2.7383
eval/samples_per_second,29.945
eval/steps_per_second,6.208
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.5157
train/total_flos,216590170752000.0
train/train_loss,1.51569


[34m[1mwandb[0m: Agent Starting Run: 6fqqnqtl with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4706006414197762
[34m[1mwandb[0m: 	fr_word_p: 0.3779312323679532
[34m[1mwandb[0m: 	learning_rate: 0.00018704899051687955
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.6900740729270312
[34m[1mwandb[0m: 	wf_word_p: 0.3272944144812937




Epoch,Training Loss,Validation Loss
1,1.6782,0.869228


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.86923
eval/runtime,2.7436
eval/samples_per_second,29.887
eval/steps_per_second,6.196
train/epoch,1.0
train/global_step,147.0
train/learning_rate,1e-05
train/loss,1.6782
train/total_flos,216590170752000.0
train/train_loss,1.67821


------------------

## Wolof to french

The only thing that we will change is the order of sentences. The wolof sentence is the first one to write.

### Configure dataset 🔠

Let us recuperate the datasets.

In [1]:
def recuperate_datasets(fr_char_p: float, wf_char_p: float, fr_word_p: float, wf_word_p):

  # with augmentation
  wf_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=fr_char_p, aug_word_p=fr_word_p),
                                        remove_mark_space, delete_guillemet_space)

  train_dataset_aug = SentenceDataset(f"{path}new_data/train_set.csv", 
                                  tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                  corpus_1="wolof_corpus",
                                  corpus_2="french_corpus",
                                  cp1_transformer=wf_augmentation, truncation=True,
                                  max_len=579)

  test_dataset = SentenceDataset(f"{path}new_data/test_set.csv",
                                tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                corpus_1="wolof_corpus",
                                corpus_2="french_corpus",
                                truncation=True, max_len=579)
  
  return train_dataset_aug, test_dataset

### Configure hyperparameter search ⚙️

We have to configure the search space and the search method ("random" in our case). .

In [2]:
import wandb
wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")

# hyperparameters
sweep_config = {
    'method': 'random',
    'metric':{
          'goal': 'minimize',
          'name': 'eval_loss'
      },
    'parameters':
    {
      'epochs': {
          'value': 1
      },
      'batch_size': {
          'values': [2, 3, 5]
      },
      'learning_rate': {
          'distribution': 'log_uniform_values',
          'min': 1e-5,
          'max': 1e-3
      },
      'weight_decay': {
          'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
      },
     'fr_char_p': {
         'min': 0.0,
         'max': 0.7
     },
     'fr_word_p': {
          'min': 0.0,
          'max': 0.7
     },
     'wf_char_p': {
          'min': 0.0,
          'max': 0.7
     },
     'wf_word_p': {
          'min': 0.0,
          'max': 0.7
     },
     'random_state': {
         'values': [0, 10, 20, 30, 40, 50, 60, 70, 80, 100]
     }
    }
}

# Initialize the hyperparameter search
sweep_id = wandb.sweep(sweep_config, project = "gpt2-wolof-French-translation1")



Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33moumar-kane[0m ([33moumar-kane-team[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\Oumar Kane/.netrc


Create sweep with ID: 12vgs0ak
Sweep URL: https://wandb.ai/oumar-kane-team/gpt2-wolof-french-translation1/sweeps/12vgs0ak


### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

In [None]:
def gpt2_model_init(tokenizer):
  # set the mode name
  model_name = "gpt2"

  # recuperate the tokenizer from the dataset
  tokenizer = tokenizer

  # configure the model
  model = GPT2LMHeadModel.from_pretrained(model_name).cuda()

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric.

In [None]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):
        
        preds, labels = eval_preds.preds.detach().cpu(), labels.detach().cpu()
        
        if isinstance(preds, tuple):
            
            preds = preds[0]
        
        if self.decoder is None:
            
            decoded_preds = self.tokenizer.batch_decode(preds, skip_special_tokens=True)
            
            decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
            
            decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)
            
            result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
            
            result = {"bleu": result["score"]}
            
            prediction_lens = [np.count_nonzero(pred != self.tokenizer.pad_token_id) for pred in preds]
            
            result["gen_len"] = np.mean(prediction_lens)
        
        else:
            
            predictions = list(self.decoder(preds))
            
            labels = list(self.decoder(labels))
      
            decoded_preds, decoded_labels = self.postprocess_text(predictions, labels)
            
            result = self.metric.compute(predictions=predictions, references=labels)
            
            result = {"bleu": result["score"]}
        
        result = {k:round(v, 4) for k, v in result.items()}

        wandb.log("bleu", result["bleu"])
            
        return result

In [None]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [None]:
# translation_eval = TranslationEvaluation(test_dataset.tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [None]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0] for b in batch])
    
    attention_mask = torch.stack([b[1] for b in batch])
    
    labels = torch.stack([b[0] for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and make random search.

In [None]:
# %%wandb

def train(config = None):

  with wandb.init(config = config):

    # seed
    torch.manual_seed(50)

    # set sweep configuration
    config = wandb.config

    # split the data
    split_data(config.random_state)

    # let us recuperate the datasets
    train_dataset, test_dataset = recuperate_datasets(config.fr_char_p, config.wf_char_p, 
                                   config.fr_word_p, config.wf_word_p)

    # get train and test datasets according to the config

    # train_dataset = datasets[config.dataset_aug]['train_dataset']

    # test_dataset = datasets[config.dataset_aug]['test_dataset']

    # set training arguments
    training_args = TrainingArguments(f"{path}training2/results1",
                                      report_to = f"wandb",
                                      num_train_epochs=config.epochs,
                                      # logging_steps=100,
                                      load_best_model_at_end=True,
                                      save_strategy="epoch",
                                      evaluation_strategy="epoch",
                                      logging_strategy = 'epoch',
                                      per_device_train_batch_size=config.batch_size, 
                                      per_device_eval_batch_size=5,
                                      learning_rate=config.learning_rate,
                                      weight_decay=config.weight_decay,
                                      logging_dir=f'{path}gpt2_training_logs2',
                                      remove_unused_columns = False,
                                      fp16 = True,
                                      )   

    # define training loop
    trainer = Trainer(model_init=partial(gpt2_model_init, tokenizer = train_dataset.tokenizer),
                      args=training_args,
                      train_dataset=train_dataset, 
                      eval_dataset=test_dataset,
                      data_collator=data_collator,
                      # compute_metrics=translation_eval.compute_metrics
                      )

    # start training loop
    trainer.train()

agent = wandb.agent(sweep_id, train, count = 30)


-----------

### Colab download and remove step

In [None]:
import shutil

# shutil.rmtree('/content/drive/MyDrive/Memoire/subject2/training2/results1/checkpoint-147')
# shutil.rmtree('wandb')
# shutil.make_archive('wandb', 'zip', 'wanbd')