Hyper-parameter search with the Text-To-Text Transformer 🤖 (Bayes)
-----------------------------------

In this project, we will make transfer learning with the Text-To-Text Transformer (T5) model to translate French sentences into Wolof sentences and vice-versa. The method we will use for the hyperparameter search is Bayesian Hyperparameter Optimization. We will use the `wandb` library to evaluate the model more efficiently with `Parallel coordinate` and `Parameter Importance` charts. After finding the best model, we will take the checkpoints and continue the training in another notebook. Let us dive into the process.

We want to know the best combination of values of the following hyperparameters:

- **learning rate** $\sim Log U(1e-3, 1e-5)$
- **weight decay** $\in \{0.0, 0.1, 0.2, 0.3, 0.4, 0.5\}$
- **random state** (seed of the data splitting generator) $\in \{0, 10, 20, 30, 40, 50, 60, 70, 80, 100\}$

1. For the translation from French to Wolof

  - **fr_char_p** (probability of modifying a character from a French word) $\sim U(0.0, 0.9)$
  - **fr_word_p** (probability of modifying a word from a French sentence) $\sim U(0.0, 0.9)$

2. For the translation from Wolof to French

  - **wf_char_p** (probability of modifying a character from a Wolof word) $\sim U(0.0, 0.9)$
  - **fr_word_p** (probability of modifying a word from a Wolof sentence) $\sim U(0.0, 0.9)$


The Bayes method requires to define a metric. We will evaluate the model on the test set, so the metric that we will add in the hyperparameter setting can be either the `cross entropy loss` calculated on the test set or `BLEU` score. Since it is a machine translation, a BLEU score will be useful as evaluation metric. 

**Objective**: We will try to `maximize the metric.` For the moment, we want to obtain a `BLEU` score more than `0.5`.

In [2]:
# let us extend the paths of the system
import sys

path = "/content/drive/MyDrive/Memoire/subject2/T5/"

sys.path.extend([path, f"{path}new_data"])

In [3]:
# define wandb environment
%env WANDB_LOG_MODEL=true
%env WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
%env WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee 

env: WANDB_LOG_MODEL=true
env: WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
env: WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee


In [4]:
!pip install -qq wandb --upgrade

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m66.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [5]:
!pip install evaluate -qq
!pip install sacrebleu -qq
# !pip install optuna -qq
!pip install transformers -qq 
!pip install tokenizers -qq
!pip install nlpaug -qq
!pip install ray[tune] -qq
!python -m spacy download fr_core_news_lg 

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [6]:
# let us import all necessary libraries
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, T5TokenizerFast, set_seed
from wolof_translate.utils.sent_transformers import TransformerSequences
from wolof_translate.data.dataset_v2 import T5SentenceDataset
from wolof_translate.utils.sent_corrections import *
from sklearn.model_selection import train_test_split
from nlpaug.augmenter import char as nac
from torch.utils.data import DataLoader
# from datasets  import load_metric # make pip install evaluate instead
# and pip install sacrebleu for instance
from functools import partial
from tqdm import tqdm
import pandas as pd
import numpy as np
import evaluate
import wandb
import torch


We will create two models: 

- One translating the french corpus to a wolof corpus [french_to_wolof](#french-to-wolof)
- One translating the wolof corpus to a french corpus [wolof_to_french](#wolof-to-french)

--------------

## French to wolof

### Configure dataset 🔠

We will split the sentences between train (for the model's training) and test (to evaluate model fitting) sets. The samples added as train and test sets are identified according to `the random state.` We will tune the random state to the groups that guarantee the model's best fitting. In other words, we want the model to identify many training sentences and generalize that learning on the test sentences. It is not sometimes the case, mainly when using a small dataset like ours.

In [7]:
def split_data(random_state: int = 50):
  """Split data between train and test sets

  Args:
    random_state (int): the seed of the splitting generator. Defaults to 50
  """
  # load the corpora and split into train and test sets
  corpora = pd.read_csv(f"{path}new_data/corpora_v3.csv")

  train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=random_state)

  # let us save the sets
  train_set.to_csv(f"{path}new_data/train_set.csv", index=False)

  test_set.to_csv(f"{path}new_data/test_set.csv", index=False)

Let us load the French and Wolof corpora's common tokenizer.

In [8]:
# recuperate the tokenizer from a json file
tokenizer = T5TokenizerFast(tokenizer_file=f"{path}wolof_translate/tokenizers/t5_tokenizers/tokenizer_v1.json")


The following function will make recuperate the datasets.

In [9]:
def recuperate_datasets(fr_char_p: float, fr_word_p: float):

  # Create augmentation to add on French sentences
  fr_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=fr_char_p, aug_word_p=fr_word_p),
                                        remove_mark_space, delete_guillemet_space)

  # Recuperate the train dataset
  train_dataset_aug = T5SentenceDataset(f"{path}new_data/train_set.csv",
                                        tokenizer,
                                        truncation = True,
                                        cp1_transformer = fr_augmentation)

  # Recuperate the test dataset
  test_dataset = T5SentenceDataset(f"{path}new_data/test_set.csv",
                                        tokenizer,
                                        truncation = True)
  
  # Return the datasets
  return train_dataset_aug, test_dataset

### Configure hyperparameter search ⚙️

We have to configure the search space, the search method and the metric. 

In [10]:
wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")

# hyperparameters
sweep_config = {
    'method': 'bayes',
    'metric':{
          'goal': 'maximize',
          'name': 'bleu'
      },
    'parameters':
    {
      'epochs': {
          'value': 2
      },
      'batch_size': {
          'values': [5, 16, 32]
      },
      'learning_rate': {
          'distribution': 'log_uniform_values',
          'min': 1e-5,
          'max': 1e-3
      },
      'weight_decay': {
          'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
      },
     'fr_char_p': {
         'min': 0.0,
         'max': 0.9
     },
     'fr_word_p': {
          'min': 0.0,
          'max': 0.9
     },
     'random_state': {
         'values': [0, 10, 20, 30, 40, 50, 60, 70, 80, 100]
     }
    }
}

# Initialize the hyperparameter search
sweep_id = wandb.sweep(sweep_config, project = "base-t5-translation-bayes-hpsearch")



[34m[1mwandb[0m: Currently logged in as: [33moumar-kane[0m ([33moumar-kane-team[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: w3mdz74l
Sweep URL: https://wandb.ai/oumar-kane-team/base-t5-translation-bayes-hpsearch/sweeps/w3mdz74l


### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

**Note**: In the first training used the t5-small. In this notebook we will use the t5-base model. See bellow the configuration of the t5-small and the t5-base models, respectively.

In [11]:
small_model_name = 't5-small'
base_model_name = 't5-base'

# import the small model with its pre-trained weights
small_model = AutoModelForSeq2SeqLM.from_pretrained(small_model_name)

# import the base model with its pre-trained weights
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_name)


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [12]:
# print the small configuration
small_model.config

T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefi

In [13]:
# print the base configuration
base_model.config

T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "pre

The small model have the same architecture than the original transformer of Ashish Vaswani, Noam Shazeer, and all. in the article [Attention_is_all_you_need](https://arxiv.org/pdf/1706.03762).

The base model contains more parameters since it use 12 heads in place of 8 and, 12 stack decoder layers in place of 6, the number of feed forward features is of 3072 so 1024 more features than the small one and the embedding dimension is of 768 in place of 512. The base model contains exactly 220 millions of parameters which is a huge number. But since it is pre-trained, we can directly make transfer learning with already trained weights. The base model was firstly explained in the article [Text_To_Text_Transformer](https://arxiv.org/pdf/1910.10683). 

In [14]:
def gpt2_model_init(tokenizer):

  # Initialize the model name
  model_name = 't5-base'

  # import the model with its pre-trained weights
  model = AutoModelForSeq2SeqLM.from_pretrained(small_model_name)

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric. The metric computation that we will use, we got it from the following `HugginFace` tutorial [translation](https://huggingface.co/docs/transformers/tasks/translation). We will use a class to add more parameters if we want.

In [15]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):

        preds, labels = eval_preds

        if isinstance(preds, tuple):
        
            preds = preds[0]
        
        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)

        result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
        
        result = {"bleu": result["score"]}

        prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
        
        result["gen_len"] = np.mean(prediction_lens)
        
        result = {k: round(v, 4) for k, v in result.items()}
        
        return result

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [16]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [17]:
evaluation = TranslationEvaluation(tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [18]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0].squeeze(0) for b in batch])
    
    attention_mask = torch.stack([b[1].squeeze(0) for b in batch])
    
    labels = torch.stack([b[2].squeeze(0) for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and search for the best model. The latter will be saved as an artefact inside our `wandb` project.

In [None]:
# %%wandb

def train(config = None):

  with wandb.init(config = config):

    # seed
    set_seed(0)

    # set sweep configuration
    config = wandb.config

    # split the data
    split_data(config.random_state)

    # let us recuperate the datasets
    train_dataset, test_dataset = recuperate_datasets(config.fr_char_p, config.fr_word_p)

    # set training arguments
    training_args = Seq2SeqTrainingArguments(f"{path}/training/bayes_search_results",
                                      report_to = f"wandb",
                                      num_train_epochs=config.epochs,
                                      load_best_model_at_end=True,
                                      save_strategy="epoch",
                                      evaluation_strategy="epoch",
                                      logging_strategy = 'epoch',
                                      per_device_train_batch_size=config.batch_size, 
                                      per_device_eval_batch_size=16,
                                      learning_rate=config.learning_rate,
                                      weight_decay=config.weight_decay,
                                      predict_with_generate=True, # we will use predict with generate in order to obtain more valuable test results
                                      fp16 = True,
                                      )   

    # define training loop
    trainer = Seq2SeqTrainer(model_init=partial(gpt2_model_init, tokenizer = train_dataset.tokenizer),
                      args=training_args,
                      train_dataset=train_dataset, 
                      eval_dataset=test_dataset,
                      data_collator=data_collator,
                      compute_metrics=evaluation.compute_metrics
                      )

    # start training loop
    trainer.train()

agent = wandb.agent(sweep_id, train, count = 25)


[34m[1mwandb[0m: Agent Starting Run: vw7pkfql with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.2284207335284644
[34m[1mwandb[0m: 	fr_word_p: 0.7607445933227656
[34m[1mwandb[0m: 	learning_rate: 4.585507080164531e-05
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.2977,1.849823,0.0106,11.1562
2,0.9947,1.690753,0.0052,7.5417


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0052
eval/gen_len,7.5417
eval/loss,1.69075
eval/runtime,2.8338
eval/samples_per_second,33.877
eval/steps_per_second,2.117
train/epoch,2.0
train/global_step,108.0
train/learning_rate,0.0
train/loss,0.9947


[34m[1mwandb[0m: Agent Starting Run: p65p88e7 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.07431205022076158
[34m[1mwandb[0m: 	fr_word_p: 0.24562018375249303
[34m[1mwandb[0m: 	learning_rate: 0.0001501502942221681
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.9871,1.548401,0.0007,5.8438
2,0.9264,1.394427,0.0002,4.8125


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0002
eval/gen_len,4.8125
eval/loss,1.39443
eval/runtime,2.3908
eval/samples_per_second,40.154
eval/steps_per_second,2.51
train/epoch,2.0
train/global_step,54.0
train/learning_rate,1e-05
train/loss,0.9264


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: w1ouvm5l with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.2917215964001955
[34m[1mwandb[0m: 	fr_word_p: 0.058881088566940736
[34m[1mwandb[0m: 	learning_rate: 2.028042724265349e-05
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.6808,2.353352,0.0447,17.1354
2,1.6527,1.66912,0.0214,15.875


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0214
eval/gen_len,15.875
eval/loss,1.66912
eval/runtime,2.7766
eval/samples_per_second,34.575
eval/steps_per_second,2.161
train/epoch,2.0
train/global_step,108.0
train/learning_rate,0.0
train/loss,1.6527


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 4w117zww with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.6374777473230804
[34m[1mwandb[0m: 	fr_word_p: 0.6567084430329116
[34m[1mwandb[0m: 	learning_rate: 3.185456431313545e-05
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.9442,3.166075,0.1166,17.3854
2,1.6571,2.385133,0.105,16.4896


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.105
eval/gen_len,16.4896
eval/loss,2.38513
eval/runtime,2.4315
eval/samples_per_second,39.481
eval/steps_per_second,2.468
train/epoch,2.0
train/global_step,54.0
train/learning_rate,0.0
train/loss,1.6571


[34m[1mwandb[0m: Agent Starting Run: xcvr97m6 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.021122038502699315
[34m[1mwandb[0m: 	fr_word_p: 0.49523584227736334
[34m[1mwandb[0m: 	learning_rate: 3.4374017581849096e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.6591,1.864487,0.0383,14.4688
2,1.1046,1.622225,0.0099,10.3125


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0099
eval/gen_len,10.3125
eval/loss,1.62223
eval/runtime,2.8784
eval/samples_per_second,33.351
eval/steps_per_second,2.084
train/epoch,2.0
train/global_step,108.0
train/learning_rate,0.0
train/loss,1.1046


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: rby9ad7j with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.1405191538985764
[34m[1mwandb[0m: 	fr_word_p: 0.7519656985059338
[34m[1mwandb[0m: 	learning_rate: 0.0007032208472458632
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.3572,0.812282,0.0,2.0104
2,0.7983,0.789975,0.0,2.0


0,1
eval/bleu,▁▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,2.0
eval/loss,0.78998
eval/runtime,2.2523
eval/samples_per_second,42.624
eval/steps_per_second,2.664
train/epoch,2.0
train/global_step,54.0
train/learning_rate,1e-05
train/loss,0.7983


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: mkt5adfl with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.3773106819670906
[34m[1mwandb[0m: 	fr_word_p: 0.7912333115423034
[34m[1mwandb[0m: 	learning_rate: 7.86068466056368e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.9103,1.594065,0.1001,12.9688
2,1.0374,1.366385,0.0017,7.1146


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0017
eval/gen_len,7.1146
eval/loss,1.36638
eval/runtime,2.6454
eval/samples_per_second,36.289
eval/steps_per_second,2.268
train/epoch,2.0
train/global_step,54.0
train/learning_rate,1e-05
train/loss,1.0374


[34m[1mwandb[0m: Agent Starting Run: ypwrsubf with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.7313345097897427
[34m[1mwandb[0m: 	fr_word_p: 0.11132960487119156
[34m[1mwandb[0m: 	learning_rate: 1.2926059357420485e-05
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.2857,1.843945,0.0746,13.5312
2,1.0591,1.674211,0.0235,10.0312


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0235
eval/gen_len,10.0312
eval/loss,1.67421
eval/runtime,2.8914
eval/samples_per_second,33.202
eval/steps_per_second,2.075
train/epoch,2.0
train/global_step,344.0
train/learning_rate,0.0
train/loss,1.0591


[34m[1mwandb[0m: Agent Starting Run: msl19pcr with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.4297011627807702
[34m[1mwandb[0m: 	fr_word_p: 0.8587999368596398
[34m[1mwandb[0m: 	learning_rate: 0.00019680212855270867
[34m[1mwandb[0m: 	random_state: 100
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.9698,1.022271,0.0001,4.3958
2,0.7903,0.945151,0.0069,8.2083


0,1
eval/bleu,▁█
eval/gen_len,▁█
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0069
eval/gen_len,8.2083
eval/loss,0.94515
eval/runtime,2.7097
eval/samples_per_second,35.428
eval/steps_per_second,2.214
train/epoch,2.0
train/global_step,344.0
train/learning_rate,0.0
train/loss,0.7903


[34m[1mwandb[0m: Agent Starting Run: 47qsm3a6 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.5229741511203588
[34m[1mwandb[0m: 	fr_word_p: 0.614537983488287
[34m[1mwandb[0m: 	learning_rate: 0.0002625599385241927
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.1981,1.02295,0.0,2.9688
2,0.8326,0.965255,0.0,3.0


0,1
eval/bleu,▁▁
eval/gen_len,▁█
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,3.0
eval/loss,0.96525
eval/runtime,2.4552
eval/samples_per_second,39.101
eval/steps_per_second,2.444
train/epoch,2.0
train/global_step,54.0
train/learning_rate,2e-05
train/loss,0.8326


[34m[1mwandb[0m: Agent Starting Run: 6oewtefg with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.86220828690396
[34m[1mwandb[0m: 	fr_word_p: 0.5631961056182528
[34m[1mwandb[0m: 	learning_rate: 1.5999605975595447e-05
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,4.666,3.736738,0.0597,17.9271
2,2.6006,3.175757,0.0535,17.7292


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0535
eval/gen_len,17.7292
eval/loss,3.17576
eval/runtime,2.5259
eval/samples_per_second,38.006
eval/steps_per_second,2.375
train/epoch,2.0
train/global_step,54.0
train/learning_rate,0.0
train/loss,2.6006


[34m[1mwandb[0m: Agent Starting Run: x8u9w50v with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.8377798594740502
[34m[1mwandb[0m: 	fr_word_p: 0.6330657383207193
[34m[1mwandb[0m: 	learning_rate: 0.0009489801104536334
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.8993,0.736418,0.0147,9.8229
2,0.7674,0.727795,0.0031,6.4062


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0031
eval/gen_len,6.4062
eval/loss,0.72779
eval/runtime,2.8122
eval/samples_per_second,34.138
eval/steps_per_second,2.134
train/epoch,2.0
train/global_step,344.0
train/learning_rate,1e-05
train/loss,0.7674


[34m[1mwandb[0m: Agent Starting Run: s97zj8y2 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.1290623356319853
[34m[1mwandb[0m: 	fr_word_p: 0.284760758693824
[34m[1mwandb[0m: 	learning_rate: 0.0002045524866468501
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.7758,1.315782,0.0002,4.2083
2,0.8881,1.227644,0.0,3.8021


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,3.8021
eval/loss,1.22764
eval/runtime,2.8698
eval/samples_per_second,33.452
eval/steps_per_second,2.091
train/epoch,2.0
train/global_step,54.0
train/learning_rate,1e-05
train/loss,0.8881


[34m[1mwandb[0m: Agent Starting Run: j2tjit2w with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.8137560171926866
[34m[1mwandb[0m: 	fr_word_p: 0.8490798164428399
[34m[1mwandb[0m: 	learning_rate: 1.7582045123966394e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.3882,2.56095,0.0343,17.7188
2,1.5184,1.968657,0.0315,17.0208


0,1
eval/bleu,█▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0315
eval/gen_len,17.0208
eval/loss,1.96866
eval/runtime,2.8615
eval/samples_per_second,33.548
eval/steps_per_second,2.097
train/epoch,2.0
train/global_step,108.0
train/learning_rate,0.0
train/loss,1.5184


[34m[1mwandb[0m: Agent Starting Run: 0dfjamtn with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.5138167086183869
[34m[1mwandb[0m: 	fr_word_p: 0.8648398614783664
[34m[1mwandb[0m: 	learning_rate: 1.0209184180765728e-05
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,4.2618,3.712499,0.1352,18.2083
2,2.3842,2.970759,0.1698,17.8542


0,1
eval/bleu,▁█
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.1698
eval/gen_len,17.8542
eval/loss,2.97076
eval/runtime,2.2466
eval/samples_per_second,42.732
eval/steps_per_second,2.671
train/epoch,2.0
train/global_step,108.0
train/learning_rate,0.0
train/loss,2.3842


[34m[1mwandb[0m: Agent Starting Run: yl9cjuej with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.13114318700251731
[34m[1mwandb[0m: 	fr_word_p: 0.07902739762944401
[34m[1mwandb[0m: 	learning_rate: 0.0001668885465497826
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.0134,0.901187,0.0,2.6146
2,0.7928,0.840526,0.0,2.9688


0,1
eval/bleu,▁▁
eval/gen_len,▁█
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,2.9688
eval/loss,0.84053
eval/runtime,2.4509
eval/samples_per_second,39.17
eval/steps_per_second,2.448
train/epoch,2.0
train/global_step,344.0
train/learning_rate,0.0
train/loss,0.7928


[34m[1mwandb[0m: Agent Starting Run: fbe4r4n9 with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.3244935272669976
[34m[1mwandb[0m: 	fr_word_p: 0.5496378952360765
[34m[1mwandb[0m: 	learning_rate: 5.9222354005856824e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.2631,1.082431,0.0,2.8854
2,0.8341,1.024516,0.0,2.5104


0,1
eval/bleu,▁▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,2.5104
eval/loss,1.02452
eval/runtime,2.6979
eval/samples_per_second,35.583
eval/steps_per_second,2.224
train/epoch,2.0
train/global_step,344.0
train/learning_rate,0.0
train/loss,0.8341


[34m[1mwandb[0m: Agent Starting Run: 1gh43dy9 with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.4014924799632997
[34m[1mwandb[0m: 	fr_word_p: 0.1940136670656171
[34m[1mwandb[0m: 	learning_rate: 0.00021884168280145557
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.9658,0.962134,0.0001,4.8021
2,0.786,0.901413,0.0006,5.6146


0,1
eval/bleu,▁█
eval/gen_len,▁█
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0006
eval/gen_len,5.6146
eval/loss,0.90141
eval/runtime,2.3133
eval/samples_per_second,41.499
eval/steps_per_second,2.594
train/epoch,2.0
train/global_step,344.0
train/learning_rate,0.0
train/loss,0.786


[34m[1mwandb[0m: Agent Starting Run: rf13asgt with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.4128867592656632
[34m[1mwandb[0m: 	fr_word_p: 0.0777395915666
[34m[1mwandb[0m: 	learning_rate: 0.00023507873741447456
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.6001,0.911504,0.0,3.2292
2,0.8217,0.865286,0.0,2.9896


0,1
eval/bleu,▁▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,2.9896
eval/loss,0.86529
eval/runtime,2.366
eval/samples_per_second,40.575
eval/steps_per_second,2.536
train/epoch,2.0
train/global_step,108.0
train/learning_rate,1e-05
train/loss,0.8217


[34m[1mwandb[0m: Agent Starting Run: ho3sp3ng with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.796367970844181
[34m[1mwandb[0m: 	fr_word_p: 0.5723626900476535
[34m[1mwandb[0m: 	learning_rate: 0.00024350911841912148
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.2524,1.010314,0.0,3.2917
2,0.8049,0.948365,0.0,3.0521


0,1
eval/bleu,▁▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,3.0521
eval/loss,0.94837
eval/runtime,2.3872
eval/samples_per_second,40.214
eval/steps_per_second,2.513
train/epoch,2.0
train/global_step,108.0
train/learning_rate,0.0
train/loss,0.8049


[34m[1mwandb[0m: Agent Starting Run: 64dgmq2q with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.7193777720262167
[34m[1mwandb[0m: 	fr_word_p: 0.3255073615121397
[34m[1mwandb[0m: 	learning_rate: 0.00023741525394460128
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.9591,0.870859,0.0,2.5521
2,0.7851,0.816025,0.0,2.1875


0,1
eval/bleu,▁▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,2.1875
eval/loss,0.81603
eval/runtime,2.4138
eval/samples_per_second,39.771
eval/steps_per_second,2.486
train/epoch,2.0
train/global_step,344.0
train/learning_rate,0.0
train/loss,0.7851


[34m[1mwandb[0m: Agent Starting Run: rhjdyn6m with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.06003607015056728
[34m[1mwandb[0m: 	fr_word_p: 0.6486174489447527
[34m[1mwandb[0m: 	learning_rate: 0.0001647141000875842
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.112,0.867427,0.0,2.0729
2,0.793,0.821641,0.0,2.4167


0,1
eval/bleu,▁▁
eval/gen_len,▁█
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,2.4167
eval/loss,0.82164
eval/runtime,2.7119
eval/samples_per_second,35.4
eval/steps_per_second,2.212
train/epoch,2.0
train/global_step,344.0
train/learning_rate,0.0
train/loss,0.793


[34m[1mwandb[0m: Agent Starting Run: nosgqo2q with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.251140235811381
[34m[1mwandb[0m: 	fr_word_p: 0.5999445302650388
[34m[1mwandb[0m: 	learning_rate: 0.00010238321457669562
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.1622,0.927801,0.0,3.375
2,0.8143,0.873399,0.0,3.2188


0,1
eval/bleu,▁▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,3.2188
eval/loss,0.8734
eval/runtime,2.2852
eval/samples_per_second,42.009
eval/steps_per_second,2.626
train/epoch,2.0
train/global_step,344.0
train/learning_rate,0.0
train/loss,0.8143


[34m[1mwandb[0m: Agent Starting Run: px8vbrn2 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.6162844579681476
[34m[1mwandb[0m: 	fr_word_p: 0.26868551749451
[34m[1mwandb[0m: 	learning_rate: 0.0004132389041306842
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.41,1.04389,0.0,3.625
2,0.7905,0.985476,0.0,3.4688


0,1
eval/bleu,▁▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,▁█
eval/samples_per_second,█▁
eval/steps_per_second,█▁
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,3.4688
eval/loss,0.98548
eval/runtime,2.2779
eval/samples_per_second,42.145
eval/steps_per_second,2.634
train/epoch,2.0
train/global_step,108.0
train/learning_rate,2e-05
train/loss,0.7905


[34m[1mwandb[0m: Agent Starting Run: maz7th58 with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fr_char_p: 0.169184420746998
[34m[1mwandb[0m: 	fr_word_p: 0.0697888178971085
[34m[1mwandb[0m: 	learning_rate: 0.00019086361849518968
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.0361,0.869787,0.0,1.9271
2,0.7788,0.861419,0.0,1.8125


0,1
eval/bleu,▁▁
eval/gen_len,█▁
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁███
train/global_step,▁▁███
train/learning_rate,█▁
train/loss,█▁

0,1
eval/bleu,0.0
eval/gen_len,1.8125
eval/loss,0.86142
eval/runtime,2.272
eval/samples_per_second,42.254
eval/steps_per_second,2.641
train/epoch,2.0
train/global_step,344.0
train/learning_rate,0.0
train/loss,0.7788


------------------

## Wolof to french

The only thing that we will change is the order of sentences. The wolof sentence is the first one to write.

### Configure dataset 🔠

We can use the same custom dataset that we created in [text_augmentation](text_augmentation.ipynb). But we need to split the data between train and test sets and save them.

In [None]:
def split_data(random_state: int = 50):

  # load the corpora and split into train and test sets
  corpora = pd.read_csv(f"{path}new_data/sent_extraction.csv")

  train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=random_state)

  # let us save the sets
  train_set.to_csv(f"{path}new_data/train_set.csv", index=False)

  test_set.to_csv(f"{path}new_data/test_set.csv", index=False)

Let us recuperate the datasets.

In [None]:
def recuperate_datasets(wf_char_p: float, wf_word_p):

  # with augmentation
  wf_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=wf_char_p, aug_word_p=wf_word_p),
                                        remove_mark_space, delete_guillemet_space)

  train_dataset_aug = SentenceDataset(f"{path}new_data/train_set.csv", 
                                  tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                  corpus_1="wolof_corpus",
                                  corpus_2="french_corpus",
                                  cp1_transformer=wf_augmentation, truncation=True,
                                  max_len=579)

  test_dataset = SentenceDataset(f"{path}new_data/test_set.csv",
                                tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                corpus_1="wolof_corpus",
                                corpus_2="french_corpus",
                                truncation=True, max_len=579)
  
  return train_dataset_aug, test_dataset

### Configure hyperparameter search ⚙️

We have to configure the search space and the search method ("random" in our case). .

In [None]:
import wandb
wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")

# hyperparameters
sweep_config = {
    'method': 'bayes',
    'metric':{
          'goal': 'minimize',
          'name': 'eval_loss'
      },
    'parameters':
    {
      'epochs': {
          'value': 1
      },
      'batch_size': {
          'values': [2, 3, 5]
      },
      'learning_rate': {
          'distribution': 'log_uniform_values',
          'min': 1e-5,
          'max': 1e-3
      },
      'weight_decay': {
          'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
      },
     'wf_char_p': {
          'min': 0.0,
          'max': 0.7
     },
     'wf_word_p': {
          'min': 0.0,
          'max': 0.7
     },
     'random_state': {
         'values': [0, 10, 20, 30, 40, 50, 60, 70, 80, 100]
     }
    }
}

# Initialize the hyperparameter search
sweep_id = wandb.sweep(sweep_config, project = "gpt2-wolof-french-translation_bayes1_1")



[34m[1mwandb[0m: Currently logged in as: [33moumar-kane[0m ([33moumar-kane-team[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: alygo14y
Sweep URL: https://wandb.ai/oumar-kane-team/gpt2-wolof-french-translation_bayes1_1/sweeps/alygo14y


### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

In [None]:
def gpt2_model_init(tokenizer):
  # set the mode name
  model_name = "gpt2"

  # recuperate the tokenizer from the dataset
  tokenizer = tokenizer

  # configure the model
  model = GPT2LMHeadModel.from_pretrained(model_name).cuda()

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric.

In [None]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):
        
        preds, labels = eval_preds.preds.detach().cpu(), labels.detach().cpu()
        
        if isinstance(preds, tuple):
            
            preds = preds[0]
        
        if self.decoder is None:
            
            decoded_preds = self.tokenizer.batch_decode(preds, skip_special_tokens=True)
            
            decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
            
            decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)
            
            result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
            
            result = {"bleu": result["score"]}
            
            prediction_lens = [np.count_nonzero(pred != self.tokenizer.pad_token_id) for pred in preds]
            
            result["gen_len"] = np.mean(prediction_lens)
        
        else:
            
            predictions = list(self.decoder(preds))
            
            labels = list(self.decoder(labels))
      
            decoded_preds, decoded_labels = self.postprocess_text(predictions, labels)
            
            result = self.metric.compute(predictions=predictions, references=labels)
            
            result = {"bleu": result["score"]}
        
        result = {k:round(v, 4) for k, v in result.items()}

        wandb.log("bleu", result["bleu"])
            
        return result

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [None]:
# translation_eval = TranslationEvaluation(test_dataset.tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [None]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0] for b in batch])
    
    attention_mask = torch.stack([b[1] for b in batch])
    
    labels = torch.stack([b[0] for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and make random search.

In [None]:
# %%wandb

def train(config = None):

  with wandb.init(config = config):

    # seed
    torch.manual_seed(50)

    # set sweep configuration
    config = wandb.config

    # split the data
    split_data(config.random_state)

    # let us recuperate the datasets
    train_dataset, test_dataset = recuperate_datasets(config.wf_char_p, config.wf_word_p)

    # get train and test datasets according to the config

    # train_dataset = datasets[config.dataset_aug]['train_dataset']

    # test_dataset = datasets[config.dataset_aug]['test_dataset']

    # set training arguments
    training_args = TrainingArguments(f"{path}training2/Results1",
                                      report_to = f"wandb",
                                      num_train_epochs=config.epochs,
                                      # logging_steps=100,
                                      load_best_model_at_end=True,
                                      save_strategy="epoch",
                                      evaluation_strategy="epoch",
                                      logging_strategy = 'epoch',
                                      per_device_train_batch_size=config.batch_size, 
                                      per_device_eval_batch_size=5,
                                      learning_rate=config.learning_rate,
                                      weight_decay=config.weight_decay,
                                      remove_unused_columns = False,
                                      fp16 = True,
                                      )   

    # define training loop
    trainer = Trainer(model_init=partial(gpt2_model_init, tokenizer = train_dataset.tokenizer),
                      args=training_args,
                      train_dataset=train_dataset, 
                      eval_dataset=test_dataset,
                      data_collator=data_collator,
                      # compute_metrics=translation_eval.compute_metrics
                      )

    # start training loop
    trainer.train()

agent = wandb.agent(sweep_id, train, count = 25)


[34m[1mwandb[0m: Agent Starting Run: a0u0t6k2 with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 6.702369179262155e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.2819068695463206
[34m[1mwandb[0m: 	wf_word_p: 0.3271474379445852


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss
1,1.3314,0.972646


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.97265
eval/runtime,2.7358
eval/samples_per_second,29.973
eval/steps_per_second,6.214
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3314
train/total_flos,216590170752000.0
train/train_loss,1.33135


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 8p97mqyj with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.000687378518112751
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.10242928904824668
[34m[1mwandb[0m: 	wf_word_p: 0.3761238934195836




Epoch,Training Loss,Validation Loss
1,1.2118,0.902809


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.90281
eval/runtime,2.7187
eval/samples_per_second,30.162
eval/steps_per_second,6.253
train/epoch,1.0
train/global_step,367.0
train/learning_rate,1e-05
train/loss,1.2118
train/total_flos,216590170752000.0
train/train_loss,1.21175


[34m[1mwandb[0m: Agent Starting Run: cufx9n8t with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 1.226951404890168e-05
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.23759176734591644
[34m[1mwandb[0m: 	wf_word_p: 0.13227163155096622




Epoch,Training Loss,Validation Loss
1,1.5713,0.956762


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95676
eval/runtime,2.7129
eval/samples_per_second,30.226
eval/steps_per_second,6.266
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5713
train/total_flos,216590170752000.0
train/train_loss,1.57131


[34m[1mwandb[0m: Agent Starting Run: y2zo0avu with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0007890211017920526
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2153084724244564
[34m[1mwandb[0m: 	wf_word_p: 0.03750263309251056




Epoch,Training Loss,Validation Loss
1,1.3947,0.852979


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85298
eval/runtime,2.7128
eval/samples_per_second,30.227
eval/steps_per_second,6.267
train/epoch,1.0
train/global_step,147.0
train/learning_rate,4e-05
train/loss,1.3947
train/total_flos,216590170752000.0
train/train_loss,1.39468


[34m[1mwandb[0m: Agent Starting Run: 6zl4nshz with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.00013060682353049685
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.6581358201308465
[34m[1mwandb[0m: 	wf_word_p: 0.010288732393246668




Epoch,Training Loss,Validation Loss
1,1.1252,0.843749


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.84375
eval/runtime,2.7146
eval/samples_per_second,30.207
eval/steps_per_second,6.262
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.1252
train/total_flos,216590170752000.0
train/train_loss,1.12516


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: lmlh43bi with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 4.6544061486102166e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.601761705072325
[34m[1mwandb[0m: 	wf_word_p: 0.5485382582443594
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss
1,1.5112,0.959907


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95991
eval/runtime,2.7127
eval/samples_per_second,30.228
eval/steps_per_second,6.267
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5112
train/total_flos,216590170752000.0
train/train_loss,1.5112


[34m[1mwandb[0m: Agent Starting Run: b3709lrr with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 8.01812368676995e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.3414514505517595
[34m[1mwandb[0m: 	wf_word_p: 0.06590909422021479




Epoch,Training Loss,Validation Loss
1,1.2984,0.943387


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.94339
eval/runtime,2.7141
eval/samples_per_second,30.212
eval/steps_per_second,6.264
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.2984
train/total_flos,216590170752000.0
train/train_loss,1.2984


[34m[1mwandb[0m: Agent Starting Run: ixh1wkga with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 7.665404061522188e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2025899023149373
[34m[1mwandb[0m: 	wf_word_p: 0.30403171896970477




Epoch,Training Loss,Validation Loss
1,1.6004,0.898786


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.89879
eval/runtime,2.7076
eval/samples_per_second,30.285
eval/steps_per_second,6.279
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6004
train/total_flos,216590170752000.0
train/train_loss,1.60037


[34m[1mwandb[0m: Agent Starting Run: yqxwql6m with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0009940796140095907
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.5740543904824967
[34m[1mwandb[0m: 	wf_word_p: 0.2462419315938406




Epoch,Training Loss,Validation Loss
1,1.2845,0.848199


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.8482
eval/runtime,2.7189
eval/samples_per_second,30.159
eval/steps_per_second,6.253
train/epoch,1.0
train/global_step,367.0
train/learning_rate,2e-05
train/loss,1.2845
train/total_flos,216590170752000.0
train/train_loss,1.28448


[34m[1mwandb[0m: Agent Starting Run: 6z8kvttx with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.136396657280805e-05
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.496992386293468
[34m[1mwandb[0m: 	wf_word_p: 0.09128048050662484


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016668863250000263, max=1.0…



Epoch,Training Loss,Validation Loss
1,1.6337,0.925104


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9251
eval/runtime,2.7135
eval/samples_per_second,30.22
eval/steps_per_second,6.265
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6337
train/total_flos,216590170752000.0
train/train_loss,1.63375


[34m[1mwandb[0m: Agent Starting Run: m9sb4aqc with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 2.292925960874299e-05
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.1392332134258406
[34m[1mwandb[0m: 	wf_word_p: 0.33276367556881936




Epoch,Training Loss,Validation Loss
1,1.3772,0.921299


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9213
eval/runtime,2.7224
eval/samples_per_second,30.12
eval/steps_per_second,6.244
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3772
train/total_flos,216590170752000.0
train/train_loss,1.37722


[34m[1mwandb[0m: Agent Starting Run: jbput8ui with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 6.444484556126347e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	wf_char_p: 0.29893712569494857
[34m[1mwandb[0m: 	wf_word_p: 0.26117783521919574




Epoch,Training Loss,Validation Loss
1,1.4321,0.896062


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.89606
eval/runtime,2.708
eval/samples_per_second,30.281
eval/steps_per_second,6.278
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4321
train/total_flos,216590170752000.0
train/train_loss,1.43209


[34m[1mwandb[0m: Agent Starting Run: 4fwhdrrr with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 2.5593238990374552e-05
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2943749543935819
[34m[1mwandb[0m: 	wf_word_p: 0.12398175089457102




Epoch,Training Loss,Validation Loss
1,1.6999,0.85214


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85214
eval/runtime,2.7191
eval/samples_per_second,30.157
eval/steps_per_second,6.252
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6999
train/total_flos,216590170752000.0
train/train_loss,1.69991


[34m[1mwandb[0m: Agent Starting Run: 1bz3dbic with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.890480473611398e-05
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.2378445346533837
[34m[1mwandb[0m: 	wf_word_p: 0.4209643755950993




Epoch,Training Loss,Validation Loss
1,1.4832,0.875917


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.87592
eval/runtime,2.7525
eval/samples_per_second,29.792
eval/steps_per_second,6.176
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4832
train/total_flos,216590170752000.0
train/train_loss,1.48315


[34m[1mwandb[0m: Agent Starting Run: ho6ta14m with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0002774553527447285
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.6714497099264989
[34m[1mwandb[0m: 	wf_word_p: 0.03656663081592687




Epoch,Training Loss,Validation Loss
1,1.1247,0.822917


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.82292
eval/runtime,2.7172
eval/samples_per_second,30.178
eval/steps_per_second,6.256
train/epoch,1.0
train/global_step,367.0
train/learning_rate,1e-05
train/loss,1.1247
train/total_flos,216590170752000.0
train/train_loss,1.12474


[34m[1mwandb[0m: Agent Starting Run: wmlfiw0r with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0003923791647223955
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.06807537344840917
[34m[1mwandb[0m: 	wf_word_p: 0.5604330614348828




Epoch,Training Loss,Validation Loss
1,1.5006,0.949637


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.94964
eval/runtime,2.7103
eval/samples_per_second,30.255
eval/steps_per_second,6.272
train/epoch,1.0
train/global_step,147.0
train/learning_rate,2e-05
train/loss,1.5006
train/total_flos,216590170752000.0
train/train_loss,1.5006


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: r8gcrole with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 1.821341363881095e-05
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.4377052667800509
[34m[1mwandb[0m: 	wf_word_p: 0.28337028938356845




Epoch,Training Loss,Validation Loss
1,1.8462,0.981831


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.98183
eval/runtime,2.7136
eval/samples_per_second,30.218
eval/steps_per_second,6.265
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.8462
train/total_flos,216590170752000.0
train/train_loss,1.8462


[34m[1mwandb[0m: Agent Starting Run: phwfj18n with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 7.811279212646426e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2317434997460037
[34m[1mwandb[0m: 	wf_word_p: 0.4361895012572056




Epoch,Training Loss,Validation Loss
1,1.6396,0.980084


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.98008
eval/runtime,2.767
eval/samples_per_second,29.635
eval/steps_per_second,6.144
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6396
train/total_flos,216590170752000.0
train/train_loss,1.63955


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: jlqkadbj with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.00010056956001033137
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.6390337876205812
[34m[1mwandb[0m: 	wf_word_p: 0.23792329132405943




Epoch,Training Loss,Validation Loss
1,1.6119,0.873367


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.87337
eval/runtime,2.7126
eval/samples_per_second,30.23
eval/steps_per_second,6.267
train/epoch,1.0
train/global_step,147.0
train/learning_rate,1e-05
train/loss,1.6119
train/total_flos,216590170752000.0
train/train_loss,1.61191


[34m[1mwandb[0m: Agent Starting Run: 6p270685 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 7.129742665577746e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.39725467091195105
[34m[1mwandb[0m: 	wf_word_p: 0.5004309074101329




Epoch,Training Loss,Validation Loss
1,1.4593,0.955068


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95507
eval/runtime,2.7243
eval/samples_per_second,30.099
eval/steps_per_second,6.24
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4593
train/total_flos,216590170752000.0
train/train_loss,1.45931


[34m[1mwandb[0m: Agent Starting Run: gcv6axp9 with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 6.252438217932992e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	wf_char_p: 0.29046645972247725
[34m[1mwandb[0m: 	wf_word_p: 0.2600785813543463




Epoch,Training Loss,Validation Loss
1,1.3169,0.971379


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.97138
eval/runtime,2.7496
eval/samples_per_second,29.823
eval/steps_per_second,6.183
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3169
train/total_flos,216590170752000.0
train/train_loss,1.31693


[34m[1mwandb[0m: Agent Starting Run: yhyy9e23 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.647196100197014e-05
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.1717366963251063
[34m[1mwandb[0m: 	wf_word_p: 0.1316624948710261




Epoch,Training Loss,Validation Loss
1,1.3969,0.856822


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85682
eval/runtime,2.7137
eval/samples_per_second,30.217
eval/steps_per_second,6.265
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.3969
train/total_flos,216590170752000.0
train/train_loss,1.3969


[34m[1mwandb[0m: Agent Starting Run: xup68ew4 with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.637652455538814e-05
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.677355252029728
[34m[1mwandb[0m: 	wf_word_p: 0.16163588455481578




Epoch,Training Loss,Validation Loss
1,1.6871,0.952795


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9528
eval/runtime,2.7376
eval/samples_per_second,29.953
eval/steps_per_second,6.21
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6871
train/total_flos,216590170752000.0
train/train_loss,1.68711


[34m[1mwandb[0m: Agent Starting Run: wa2re6yd with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.00015372099844614283
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.4593404124892092
[34m[1mwandb[0m: 	wf_word_p: 0.5384324544438147




Epoch,Training Loss,Validation Loss
1,1.4028,0.823874


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.82387
eval/runtime,2.7275
eval/samples_per_second,30.064
eval/steps_per_second,6.233
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4028
train/total_flos,216590170752000.0
train/train_loss,1.40277


[34m[1mwandb[0m: Agent Starting Run: i305ffth with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 4.284698923411304e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.6295786463189464
[34m[1mwandb[0m: 	wf_word_p: 0.6968639258681786




Epoch,Training Loss,Validation Loss
1,1.5093,0.986804


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9868
eval/runtime,2.7162
eval/samples_per_second,30.189
eval/steps_per_second,6.259
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5093
train/total_flos,216590170752000.0
train/train_loss,1.5093


-----------

## Colab download and remove step

In [27]:
import shutil

shutil.rmtree('/content/drive/MyDrive/Memoire/subject2/T5/training/bayes_search_results')
shutil.rmtree('wandb')
# shutil.make_archive('wandb', 'zip', 'wanbd')