Hyper-parameter search with the Text-To-Text Transformer 🤖 (Bayes)
-----------------------------------

In this project, we will make transfer learning with the Text-To-Text Transformer (T5) model to translate French sentences into Wolof sentences and vice-versa. The method we will use for the hyperparameter search is Bayesian Hyperparameter Optimization. We will use the `wandb` library to evaluate the model more efficiently with `Parallel coordinate` and `Parameter Importance` charts. After finding the best model, we will take the checkpoints and continue the training in another notebook. Let us dive into the process.

We want to know the best combination of values of the following hyperparameters:

- **learning rate** $\sim Log U(1e-3, 1e-5)$
- **weight decay** $\in \{0.0, 0.1, 0.2, 0.3, 0.4, 0.5\}$
- **random state** (seed of the data splitting generator) $\in range(1, 100)$ 

1. For the translation from French to Wolof

  - **fr_char_p** (probability of modifying a character from a French word) $\sim U(0.0, 0.9)$
  - **fr_word_p** (probability of modifying a word from a French sentence) $\sim U(0.0, 0.9)$

2. For the translation from Wolof to French

  - **wf_char_p** (probability of modifying a character from a Wolof word) $\sim U(0.0, 0.9)$
  - **fr_word_p** (probability of modifying a word from a Wolof sentence) $\sim U(0.0, 0.9)$


The Bayes method requires to define a metric. We will evaluate the model on the test set, so the metric that we will add in the hyperparameter setting can be either the `cross entropy loss` calculated on the test set or `BLEU` score. Since it is a machine translation task, a BLEU score will be more useful as evaluation metric. 

**Objective**: We will try to `maximize the metric.` For the moment, we want to obtain a `BLEU` score greater than `50`.

In [None]:
# let us extend the paths of the system
import sys

path = "/content/drive/MyDrive/Memoire/subject2/T5/"

sys.path.extend([path, f"{path}new_data"])

In [None]:
# define wandb environment
%env WANDB_LOG_MODEL=true
%env WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
%env WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee 

env: WANDB_LOG_MODEL=true
env: WANDB_NOTEBOOK_NAME=training_gpt2_2.ipynb
env: WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee


In [None]:
!pip install -qq wandb --upgrade

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.0/2.0 MB[0m [31m115.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [None]:
!pip install evaluate -qq
!pip install sacrebleu -qq
# !pip install optuna -qq
!pip install transformers -qq 
!pip install tokenizers -qq
!pip install nlpaug -qq
!pip install ray[tune] -qq
!python -m spacy download fr_core_news_lg 

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [None]:
# let us import all necessary libraries
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, T5TokenizerFast, set_seed
from wolof_translate.utils.sent_transformers import TransformerSequences
from wolof_translate.data.dataset_v2 import T5SentenceDataset
from wolof_translate.utils.sent_corrections import *
from sklearn.model_selection import train_test_split
from nlpaug.augmenter import char as nac
from torch.utils.data import DataLoader
# from datasets  import load_metric # make pip install evaluate instead
# and pip install sacrebleu for instance
from functools import partial
from tqdm import tqdm
import pandas as pd
import numpy as np
import evaluate
import wandb
import torch


We will create two models: 

- One translating the french corpus to a wolof corpus [french_to_wolof](#french-to-wolof)
- One translating the wolof corpus to a french corpus [wolof_to_french](#wolof-to-french)

--------------

## French to wolof

### Configure dataset 🔠

We will split the sentences between train (for the model's training), validation (to find the best performance) and test (to make final predictions) sets. The samples added as train, validation and test sets are identified according to `the random state.` We will tune the random state to the groups that guarantee the model's best fitting. In other words, we want the model to identify many training sentences and generalize that learning on the validation sentences. It is not sometimes the case, mainly when using a small dataset like ours. 

Notice that when continuing to train the model from the checkpoints we will use the train_set plus the validation set. Then we need to save the dataset part which doesn't contain the test set for latter. 

In [None]:
def split_data(random_state: int = 50):
  """Split data between train, validation and test sets

  Args:
    random_state (int): the seed of the splitting generator. Defaults to 50
  """
  # load the corpora and split into train and test sets
  corpora = pd.read_csv(f"{path}new_data/corpora_v3.csv")

  train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=random_state)

  # let us save the final training set when performing

  train_set, valid_set = train_test_split(train_set, test_size=0.1, random_state=random_state)

  train_set.to_csv(f"{path}new_data/final_train_set.csv", index=False)

  # let us save the sets
  train_set.to_csv(f"{path}new_data/train_set.csv", index=False)

  valid_set.to_csv(f"{path}new_data/valid_set.csv", index=False)

  test_set.to_csv(f"{path}new_data/test_set.csv", index=False)

Let us load the French and Wolof corpora's common tokenizer.

In [None]:
# recuperate the tokenizer from a json file
tokenizer = T5TokenizerFast(tokenizer_file=f"{path}wolof_translate/tokenizers/t5_tokenizers/tokenizer_v1.json")


The following function will make recuperate the datasets.

In [None]:
def recuperate_datasets(fr_char_p: float, fr_word_p: float):

  # Create augmentation to add on French sentences
  fr_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=fr_char_p, aug_word_p=fr_word_p),
                                        remove_mark_space, delete_guillemet_space)

  # Recuperate the train dataset
  train_dataset_aug = T5SentenceDataset(f"{path}new_data/train_set.csv",
                                        tokenizer,
                                        truncation = True,
                                        cp1_transformer = fr_augmentation)

  # Recuperate the test dataset
  valid_dataset = T5SentenceDataset(f"{path}new_data/valid_set.csv",
                                        tokenizer,
                                        truncation = True)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

### Configure hyperparameter search ⚙️

We have to configure the search space, the search method and the metric. 

In [None]:
wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")

# hyperparameters
sweep_config = {
    'method': 'bayes',
    'metric':{
          'goal': 'maximize',
          'name': 'bleu'
      },
    'parameters':
    {
      'epochs': {
          'value': 1
      },
      'batch_size': {
          'values': [5, 16, 32]
      },
      'learning_rate': {
          'distribution': 'log_uniform_values',
          'min': 1e-5,
          'max': 1e-3
      },
      'weight_decay': {
          'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
      },
     'fr_char_p': {
         'min': 0.0,
         'max': 0.9
     },
     'fr_word_p': {
          'min': 0.0,
          'max': 0.9
     },
     'random_state': {
         'values': list(range(1, 101))
     }
    }
}

# Initialize the hyperparameter search
sweep_id = wandb.sweep(sweep_config, project = "small-t5-fw-translation-bayes-hpsearch-v1")



[34m[1mwandb[0m: Currently logged in as: [33moumar-kane[0m ([33moumar-kane-team[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: 6sh11rwz
Sweep URL: https://wandb.ai/oumar-kane-team/small-t5-translation-bayes-hpsearch/sweeps/6sh11rwz


### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

**Note**: In the first training we want to use the t5-small. If we don't obtain good results we will take the t5-base which contains more parameters. See bellow the configuration of the t5-small and the t5-base models, respectively.

In [None]:
small_model_name = 't5-small'
base_model_name = 't5-base'

# import the small model with its pre-trained weights
small_model = AutoModelForSeq2SeqLM.from_pretrained(small_model_name)

# import the base model with its pre-trained weights
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_name)


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# print the small configuration
small_model.config

T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefi

In [None]:
# print the base configuration
base_model.config

T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "pre

The small model have the same architecture than the original transformer of Ashish Vaswani, Noam Shazeer, and all. in the article [Attention_is_all_you_need](https://arxiv.org/pdf/1706.03762).

The base model contains more parameters since it use 12 heads in place of 8 and, 12 stack decoder layers in place of 6, the number of feed forward features is of 3072 so 1024 more features than the small one and the embedding dimension is of 768 in place of 512. The base model contains exactly 220 millions of parameters which is a huge number. But since it is pre-trained, we can directly make transfer learning with already trained weights. The base model was firstly explained in the article [Text_To_Text_Transformer](https://arxiv.org/pdf/1910.10683). 

In [None]:
def gpt2_model_init(tokenizer):

  # Initialize the model name
  model_name = 't5-small'

  # import the model with its pre-trained weights
  model = AutoModelForSeq2SeqLM.from_pretrained(small_model_name)

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric. The metric computation that we will use, we got it from the following `HugginFace` tutorial [translation](https://huggingface.co/docs/transformers/tasks/translation). We will use a class to add more parameters if we want.

In [None]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):

        preds, labels = eval_preds

        if isinstance(preds, tuple):
        
            preds = preds[0]
        
        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)

        result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
        
        result = {"bleu": result["score"]}

        prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
        
        result["gen_len"] = np.mean(prediction_lens)
        
        result = {k: round(v, 4) for k, v in result.items()}
        
        return result

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [None]:
evaluation = TranslationEvaluation(tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [None]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0].squeeze(0) for b in batch])
    
    attention_mask = torch.stack([b[1].squeeze(0) for b in batch])
    
    labels = torch.stack([b[2].squeeze(0) for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and search for the best model. The latter will be saved as an artefact inside our `wandb` project.

In [None]:
# %%wandb

def train(config = None):

  with wandb.init(config = config):

    # seed
    set_seed(0)

    # set sweep configuration
    config = wandb.config

    # split the data
    split_data(config.random_state)

    # let us recuperate the datasets
    train_dataset, valid_dataset = recuperate_datasets(config.fr_char_p, config.fr_word_p)

    # set training arguments
    training_args = Seq2SeqTrainingArguments(f"{path}/training//training/bayes_search_results_fw_v1",
                                      report_to = f"wandb",
                                      num_train_epochs=config.epochs,
                                      load_best_model_at_end=True,
                                      save_strategy="epoch",
                                      evaluation_strategy="epoch",
                                      logging_strategy = 'epoch',
                                      per_device_train_batch_size=config.batch_size, 
                                      per_device_eval_batch_size=16,
                                      learning_rate=config.learning_rate,
                                      weight_decay=config.weight_decay,
                                      predict_with_generate=True, # we will use predict with generate in order to obtain more valuable test results
                                      fp16 = True,
                                      )   

    # define training loop
    trainer = Seq2SeqTrainer(model_init=partial(gpt2_model_init, tokenizer = train_dataset.tokenizer),
                      args=training_args,
                      train_dataset=train_dataset, 
                      eval_dataset=valid_dataset,
                      data_collator=data_collator,
                      compute_metrics=evaluation.compute_metrics
                      )

    # start training loop
    trainer.train()

agent = wandb.agent(sweep_id, train, count = 45)


[34m[1mwandb[0m: Agent Starting Run: wltbvc5x with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5610117724717548
[34m[1mwandb[0m: 	fr_word_p: 0.12342420562500794
[34m[1mwandb[0m: 	learning_rate: 3.0431038733614627e-05
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.6623,1.433833,0.0039,7.625


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0039
eval/gen_len,7.625
eval/loss,1.43383
eval/runtime,3.5668
eval/samples_per_second,26.915
eval/steps_per_second,1.682
train/epoch,1.0
train/global_step,172.0
train/learning_rate,0.0
train/loss,1.6623


[34m[1mwandb[0m: Agent Starting Run: gkmqnvbd with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.8538971073103738
[34m[1mwandb[0m: 	fr_word_p: 0.5232831372715491
[34m[1mwandb[0m: 	learning_rate: 7.228389805763303e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.0105,1.665153,0.0121,11.0729


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0121
eval/gen_len,11.0729
eval/loss,1.66515
eval/runtime,3.6952
eval/samples_per_second,25.98
eval/steps_per_second,1.624
train/epoch,1.0
train/global_step,54.0
train/learning_rate,1e-05
train/loss,2.0105


[34m[1mwandb[0m: Agent Starting Run: vk6fwkk6 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.15590715281340134
[34m[1mwandb[0m: 	fr_word_p: 0.7092879701301565
[34m[1mwandb[0m: 	learning_rate: 0.00020053145545767287
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.5098,1.353554,0.0,4.1667


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,4.1667
eval/loss,1.35355
eval/runtime,2.4253
eval/samples_per_second,39.583
eval/steps_per_second,2.474
train/epoch,1.0
train/global_step,54.0
train/learning_rate,1e-05
train/loss,1.5098


[34m[1mwandb[0m: Agent Starting Run: e5pxdvom with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.7587565400869124
[34m[1mwandb[0m: 	fr_word_p: 0.6327287294437675
[34m[1mwandb[0m: 	learning_rate: 9.742187785854545e-05
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.1457,1.127707,0.0,4.3333


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,4.3333
eval/loss,1.12771
eval/runtime,2.4588
eval/samples_per_second,39.043
eval/steps_per_second,2.44
train/epoch,1.0
train/global_step,172.0
train/learning_rate,0.0
train/loss,1.1457


[34m[1mwandb[0m: Agent Starting Run: 8n7277tq with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.48846262862636913
[34m[1mwandb[0m: 	fr_word_p: 0.16150612020472957
[34m[1mwandb[0m: 	learning_rate: 3.838933595295595e-05
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.7064,2.282822,0.0718,16.25


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0718
eval/gen_len,16.25
eval/loss,2.28282
eval/runtime,2.2885
eval/samples_per_second,41.949
eval/steps_per_second,2.622
train/epoch,1.0
train/global_step,54.0
train/learning_rate,0.0
train/loss,2.7064


[34m[1mwandb[0m: Agent Starting Run: e43ltml7 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.8142547858484664
[34m[1mwandb[0m: 	fr_word_p: 0.3472530033110374
[34m[1mwandb[0m: 	learning_rate: 0.00011493660993613652
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.6896,1.480706,0.0028,7.3229


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0028
eval/gen_len,7.3229
eval/loss,1.48071
eval/runtime,2.9347
eval/samples_per_second,32.712
eval/steps_per_second,2.045
train/epoch,1.0
train/global_step,54.0
train/learning_rate,1e-05
train/loss,1.6896


[34m[1mwandb[0m: Agent Starting Run: ouf1yfsg with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4279669231306425
[34m[1mwandb[0m: 	fr_word_p: 0.4721639721254076
[34m[1mwandb[0m: 	learning_rate: 9.585546848288992e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.8751,1.686484,0.0229,14.7604


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0229
eval/gen_len,14.7604
eval/loss,1.68648
eval/runtime,2.2756
eval/samples_per_second,42.186
eval/steps_per_second,2.637
train/epoch,1.0
train/global_step,27.0
train/learning_rate,2e-05
train/loss,2.8751


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 3xbhrwkw with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4158215457591863
[34m[1mwandb[0m: 	fr_word_p: 0.8450154877237073
[34m[1mwandb[0m: 	learning_rate: 1.173544889360225e-05
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.5453,2.572609,0.0342,17.3021


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0342
eval/gen_len,17.3021
eval/loss,2.57261
eval/runtime,2.9689
eval/samples_per_second,32.336
eval/steps_per_second,2.021
train/epoch,1.0
train/global_step,172.0
train/learning_rate,0.0
train/loss,2.5453


[34m[1mwandb[0m: Agent Starting Run: l9q9opu6 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5301181688250258
[34m[1mwandb[0m: 	fr_word_p: 0.3820331478361061
[34m[1mwandb[0m: 	learning_rate: 0.00018447976921647297
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.5


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016669155633333807, max=1.0…



Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.0258,1.307575,0.0002,5.4375


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0002
eval/gen_len,5.4375
eval/loss,1.30758
eval/runtime,2.9708
eval/samples_per_second,32.314
eval/steps_per_second,2.02
train/epoch,1.0
train/global_step,27.0
train/learning_rate,2e-05
train/loss,2.0258


[34m[1mwandb[0m: Agent Starting Run: yi4hag2r with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4483870797238989
[34m[1mwandb[0m: 	fr_word_p: 0.504264814626644
[34m[1mwandb[0m: 	learning_rate: 0.0007294170638522566
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.8732,0.769955,0.0416,18.4375


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0416
eval/gen_len,18.4375
eval/loss,0.76995
eval/runtime,2.7853
eval/samples_per_second,34.467
eval/steps_per_second,2.154
train/epoch,1.0
train/global_step,172.0
train/learning_rate,0.0
train/loss,0.8732


[34m[1mwandb[0m: Agent Starting Run: 6208rltr with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.3948613426128461
[34m[1mwandb[0m: 	fr_word_p: 0.4976669424803505
[34m[1mwandb[0m: 	learning_rate: 8.0745304549698e-05
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.1848,1.438949,0.0005,5.5


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0005
eval/gen_len,5.5
eval/loss,1.43895
eval/runtime,2.5725
eval/samples_per_second,37.318
eval/steps_per_second,2.332
train/epoch,1.0
train/global_step,172.0
train/learning_rate,0.0
train/loss,1.1848


[34m[1mwandb[0m: Agent Starting Run: u0tgmz1p with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.48997148073449787
[34m[1mwandb[0m: 	fr_word_p: 0.26053974584137846
[34m[1mwandb[0m: 	learning_rate: 0.00014084150706406842
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.3942,1.489333,0.0076,9.5729


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0076
eval/gen_len,9.5729
eval/loss,1.48933
eval/runtime,2.2947
eval/samples_per_second,41.835
eval/steps_per_second,2.615
train/epoch,1.0
train/global_step,27.0
train/learning_rate,2e-05
train/loss,2.3942


[34m[1mwandb[0m: Agent Starting Run: oqm62nrl with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5326296118112752
[34m[1mwandb[0m: 	fr_word_p: 0.10392841182281104
[34m[1mwandb[0m: 	learning_rate: 4.974552739945833e-05
[34m[1mwandb[0m: 	random_state: 100
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.5071,2.772197,0.0338,18.0729


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0338
eval/gen_len,18.0729
eval/loss,2.7722
eval/runtime,2.8263
eval/samples_per_second,33.966
eval/steps_per_second,2.123
train/epoch,1.0
train/global_step,27.0
train/learning_rate,1e-05
train/loss,3.5071


[34m[1mwandb[0m: Agent Starting Run: 56u4ajcg with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.3269481527825772
[34m[1mwandb[0m: 	fr_word_p: 0.28669982187999987
[34m[1mwandb[0m: 	learning_rate: 1.6723740867752515e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,4.8297,4.064007,0.2301,17.6771


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.2301
eval/gen_len,17.6771
eval/loss,4.06401
eval/runtime,2.3548
eval/samples_per_second,40.768
eval/steps_per_second,2.548
train/epoch,1.0
train/global_step,27.0
train/learning_rate,0.0
train/loss,4.8297


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: o0aog224 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.2129528577129459
[34m[1mwandb[0m: 	fr_word_p: 0.09869105718650648
[34m[1mwandb[0m: 	learning_rate: 3.865327761476859e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.799,2.900218,0.0335,17.625


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0335
eval/gen_len,17.625
eval/loss,2.90022
eval/runtime,2.3587
eval/samples_per_second,40.7
eval/steps_per_second,2.544
train/epoch,1.0
train/global_step,27.0
train/learning_rate,1e-05
train/loss,3.799


[34m[1mwandb[0m: Agent Starting Run: hvaq5j35 with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6303432251767113
[34m[1mwandb[0m: 	fr_word_p: 0.6597579272621471
[34m[1mwandb[0m: 	learning_rate: 7.045367768675292e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.1824,1.186207,0.0,4.1979


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,4.1979
eval/loss,1.18621
eval/runtime,2.5177
eval/samples_per_second,38.131
eval/steps_per_second,2.383
train/epoch,1.0
train/global_step,172.0
train/learning_rate,0.0
train/loss,1.1824


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 7r4xzmjh with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.32218719047344535
[34m[1mwandb[0m: 	fr_word_p: 0.6517656923123819
[34m[1mwandb[0m: 	learning_rate: 0.0002159040840810998
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.9057,1.15166,0.0032,6.8229


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0032
eval/gen_len,6.8229
eval/loss,1.15166
eval/runtime,2.3243
eval/samples_per_second,41.303
eval/steps_per_second,2.581
train/epoch,1.0
train/global_step,27.0
train/learning_rate,2e-05
train/loss,1.9057


[34m[1mwandb[0m: Agent Starting Run: smdszw2j with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.47640241984882614
[34m[1mwandb[0m: 	fr_word_p: 0.21132462882448333
[34m[1mwandb[0m: 	learning_rate: 0.0004446832087150979
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.8566,1.020045,0.0,2.2604


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,2.2604
eval/loss,1.02004
eval/runtime,2.4322
eval/samples_per_second,39.471
eval/steps_per_second,2.467
train/epoch,1.0
train/global_step,27.0
train/learning_rate,5e-05
train/loss,1.8566


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 9p8qbyt3 with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.06505758614236899
[34m[1mwandb[0m: 	fr_word_p: 0.28154550803853956
[34m[1mwandb[0m: 	learning_rate: 4.3343838200441186e-05
[34m[1mwandb[0m: 	random_state: 100
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.5274,1.389754,0.0011,5.8438


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0011
eval/gen_len,5.8438
eval/loss,1.38975
eval/runtime,3.0814
eval/samples_per_second,31.155
eval/steps_per_second,1.947
train/epoch,1.0
train/global_step,172.0
train/learning_rate,0.0
train/loss,1.5274


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: q7bv54d2 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.291129663032987
[34m[1mwandb[0m: 	fr_word_p: 0.17656891172624997
[34m[1mwandb[0m: 	learning_rate: 4.547429254371896e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.5012,1.717234,0.1243,14.4062


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.1243
eval/gen_len,14.4062
eval/loss,1.71723
eval/runtime,2.5686
eval/samples_per_second,37.374
eval/steps_per_second,2.336
train/epoch,1.0
train/global_step,54.0
train/learning_rate,0.0
train/loss,2.5012


[34m[1mwandb[0m: Agent Starting Run: gddev7hl with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.2905138736360415
[34m[1mwandb[0m: 	fr_word_p: 0.3826292800027291
[34m[1mwandb[0m: 	learning_rate: 0.0007676268591274725
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.1661,0.830785,0.0002,4.9583


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0002
eval/gen_len,4.9583
eval/loss,0.83079
eval/runtime,2.8134
eval/samples_per_second,34.122
eval/steps_per_second,2.133
train/epoch,1.0
train/global_step,54.0
train/learning_rate,3e-05
train/loss,1.1661


[34m[1mwandb[0m: Agent Starting Run: xsk92mbh with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5144986395869243
[34m[1mwandb[0m: 	fr_word_p: 0.5802526750170622
[34m[1mwandb[0m: 	learning_rate: 5.233584616725848e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.3952,2.470595,0.1751,16.5


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.1751
eval/gen_len,16.5
eval/loss,2.47059
eval/runtime,2.3528
eval/samples_per_second,40.803
eval/steps_per_second,2.55
train/epoch,1.0
train/global_step,27.0
train/learning_rate,1e-05
train/loss,3.3952


[34m[1mwandb[0m: Agent Starting Run: q7l41rqy with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.20564244803896173
[34m[1mwandb[0m: 	fr_word_p: 0.4963901120562853
[34m[1mwandb[0m: 	learning_rate: 0.0005301414416182666
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.8289,1.156023,0.0,3.6875


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,3.6875
eval/loss,1.15602
eval/runtime,2.8014
eval/samples_per_second,34.268
eval/steps_per_second,2.142
train/epoch,1.0
train/global_step,27.0
train/learning_rate,6e-05
train/loss,1.8289


[34m[1mwandb[0m: Agent Starting Run: 43pzir5h with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.46149044413734097
[34m[1mwandb[0m: 	fr_word_p: 0.6424055199241102
[34m[1mwandb[0m: 	learning_rate: 0.00029086872247771926
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.5




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.5518,1.011343,0.0,3.3438


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,3.3438
eval/loss,1.01134
eval/runtime,2.3881
eval/samples_per_second,40.2
eval/steps_per_second,2.512
train/epoch,1.0
train/global_step,54.0
train/learning_rate,3e-05
train/loss,1.5518


[34m[1mwandb[0m: Agent Starting Run: 1krdpu44 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6134605225455577
[34m[1mwandb[0m: 	fr_word_p: 0.7998895939001074
[34m[1mwandb[0m: 	learning_rate: 4.590147172658666e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.2855,1.94144,0.045,15.5417


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.045
eval/gen_len,15.5417
eval/loss,1.94144
eval/runtime,3.0573
eval/samples_per_second,31.4
eval/steps_per_second,1.963
train/epoch,1.0
train/global_step,54.0
train/learning_rate,0.0
train/loss,2.2855


[34m[1mwandb[0m: Agent Starting Run: 79og7n7j with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6573931730727202
[34m[1mwandb[0m: 	fr_word_p: 0.527697557518109
[34m[1mwandb[0m: 	learning_rate: 6.320774990813549e-05
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.4397,2.526425,0.1139,17.1042


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.1139
eval/gen_len,17.1042
eval/loss,2.52643
eval/runtime,2.7266
eval/samples_per_second,35.209
eval/steps_per_second,2.201
train/epoch,1.0
train/global_step,27.0
train/learning_rate,1e-05
train/loss,3.4397


[34m[1mwandb[0m: Agent Starting Run: 3gwoqm2o with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.13391120466972786
[34m[1mwandb[0m: 	fr_word_p: 0.0458305364440084
[34m[1mwandb[0m: 	learning_rate: 0.0005377241318871253
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.9414,0.830009,0.0,3.9792


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,3.9792
eval/loss,0.83001
eval/runtime,2.4626
eval/samples_per_second,38.983
eval/steps_per_second,2.436
train/epoch,1.0
train/global_step,172.0
train/learning_rate,1e-05
train/loss,0.9414


[34m[1mwandb[0m: Agent Starting Run: ru3tn3az with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6164205011394965
[34m[1mwandb[0m: 	fr_word_p: 0.6369304960537903
[34m[1mwandb[0m: 	learning_rate: 2.9626261599729228e-05
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.8488,2.087959,0.0236,16.8125


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0236
eval/gen_len,16.8125
eval/loss,2.08796
eval/runtime,2.4431
eval/samples_per_second,39.294
eval/steps_per_second,2.456
train/epoch,1.0
train/global_step,54.0
train/learning_rate,0.0
train/loss,2.8488


[34m[1mwandb[0m: Agent Starting Run: gpen0l78 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.372939956140982
[34m[1mwandb[0m: 	fr_word_p: 0.6960487840985721
[34m[1mwandb[0m: 	learning_rate: 4.9093667318336974e-05
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.2758,1.931993,0.082,14.5833


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.082
eval/gen_len,14.5833
eval/loss,1.93199
eval/runtime,2.4379
eval/samples_per_second,39.378
eval/steps_per_second,2.461
train/epoch,1.0
train/global_step,54.0
train/learning_rate,0.0
train/loss,2.2758


[34m[1mwandb[0m: Agent Starting Run: 9sr5omon with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.04944991411414499
[34m[1mwandb[0m: 	fr_word_p: 0.4678093229542011
[34m[1mwandb[0m: 	learning_rate: 2.109384939089684e-05
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.0021,1.458804,0.1085,13.4583


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.1085
eval/gen_len,13.4583
eval/loss,1.4588
eval/runtime,2.7757
eval/samples_per_second,34.586
eval/steps_per_second,2.162
train/epoch,1.0
train/global_step,172.0
train/learning_rate,0.0
train/loss,2.0021


[34m[1mwandb[0m: Agent Starting Run: elvzqqz6 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6918098321857
[34m[1mwandb[0m: 	fr_word_p: 0.3227187899436356
[34m[1mwandb[0m: 	learning_rate: 1.666829121749361e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.6859,3.583575,0.0587,18.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0587
eval/gen_len,18.0
eval/loss,3.58358
eval/runtime,2.8676
eval/samples_per_second,33.478
eval/steps_per_second,2.092
train/epoch,1.0
train/global_step,54.0
train/learning_rate,0.0
train/loss,3.6859


[34m[1mwandb[0m: Agent Starting Run: g44lzge4 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.060706599783709715
[34m[1mwandb[0m: 	fr_word_p: 0.021074262487238416
[34m[1mwandb[0m: 	learning_rate: 1.2875944874111391e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,5.2409,5.260365,0.0631,18.6146


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0631
eval/gen_len,18.6146
eval/loss,5.26037
eval/runtime,2.8143
eval/samples_per_second,34.111
eval/steps_per_second,2.132
train/epoch,1.0
train/global_step,27.0
train/learning_rate,0.0
train/loss,5.2409


[34m[1mwandb[0m: Agent Starting Run: f16mz1ny with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.11158189568283718
[34m[1mwandb[0m: 	fr_word_p: 0.4773998330386453
[34m[1mwandb[0m: 	learning_rate: 0.00018043561063886616
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.1092,1.221382,0.0001,5.75


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0001
eval/gen_len,5.75
eval/loss,1.22138
eval/runtime,2.4402
eval/samples_per_second,39.341
eval/steps_per_second,2.459
train/epoch,1.0
train/global_step,27.0
train/learning_rate,2e-05
train/loss,2.1092


[34m[1mwandb[0m: Agent Starting Run: 90khoq08 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.30652115286328085
[34m[1mwandb[0m: 	fr_word_p: 0.5318096437023665
[34m[1mwandb[0m: 	learning_rate: 6.960817752437873e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.7596,1.47268,0.0051,10.1875


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0051
eval/gen_len,10.1875
eval/loss,1.47268
eval/runtime,2.9582
eval/samples_per_second,32.452
eval/steps_per_second,2.028
train/epoch,1.0
train/global_step,54.0
train/learning_rate,0.0
train/loss,1.7596


[34m[1mwandb[0m: Agent Starting Run: d0h32xvr with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.18542772028359752
[34m[1mwandb[0m: 	fr_word_p: 0.5446007725215211
[34m[1mwandb[0m: 	learning_rate: 0.00030916309332457256
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.3016,1.240728,0.0,4.3333


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,4.3333
eval/loss,1.24073
eval/runtime,2.8357
eval/samples_per_second,33.854
eval/steps_per_second,2.116
train/epoch,1.0
train/global_step,54.0
train/learning_rate,1e-05
train/loss,1.3016


------------------

## Wolof to french

The only thing that we will change is the order of sentences. The wolof sentence is the first one to write.

### Configure dataset 🔠

We can use the same custom dataset that we created in [text_augmentation](text_augmentation.ipynb). But we need to split the data between train and test sets and save them.

In [None]:
def split_data(random_state: int = 50):

  # load the corpora and split into train and test sets
  corpora = pd.read_csv(f"{path}new_data/sent_extraction.csv")

  train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=random_state)

  # let us save the sets
  train_set.to_csv(f"{path}new_data/train_set.csv", index=False)

  test_set.to_csv(f"{path}new_data/test_set.csv", index=False)

Let us recuperate the datasets.

In [None]:
def recuperate_datasets(wf_char_p: float, wf_word_p):

  # with augmentation
  wf_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=wf_char_p, aug_word_p=wf_word_p),
                                        remove_mark_space, delete_guillemet_space)

  train_dataset_aug = SentenceDataset(f"{path}new_data/train_set.csv", 
                                  tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                  corpus_1="wolof_corpus",
                                  corpus_2="french_corpus",
                                  cp1_transformer=wf_augmentation, truncation=True,
                                  max_len=579)

  test_dataset = SentenceDataset(f"{path}new_data/test_set.csv",
                                tokenizer_path = f"{path}wolof-translate/wolof_translate/tokenizers/tokenizer_v1.json",
                                corpus_1="wolof_corpus",
                                corpus_2="french_corpus",
                                truncation=True, max_len=579)
  
  return train_dataset_aug, test_dataset

### Configure hyperparameter search ⚙️

We have to configure the search space and the search method ("random" in our case). .

In [None]:
import wandb
wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")

# hyperparameters
sweep_config = {
    'method': 'bayes',
    'metric':{
          'goal': 'minimize',
          'name': 'eval_loss'
      },
    'parameters':
    {
      'epochs': {
          'value': 1
      },
      'batch_size': {
          'values': [2, 3, 5]
      },
      'learning_rate': {
          'distribution': 'log_uniform_values',
          'min': 1e-5,
          'max': 1e-3
      },
      'weight_decay': {
          'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
      },
     'wf_char_p': {
          'min': 0.0,
          'max': 0.7
     },
     'wf_word_p': {
          'min': 0.0,
          'max': 0.7
     },
     'random_state': {
         'values': [0, 10, 20, 30, 40, 50, 60, 70, 80, 100]
     }
    }
}

# Initialize the hyperparameter search
sweep_id = wandb.sweep(sweep_config, project = "gpt2-wolof-french-translation_bayes1_1")



[34m[1mwandb[0m: Currently logged in as: [33moumar-kane[0m ([33moumar-kane-team[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: alygo14y
Sweep URL: https://wandb.ai/oumar-kane-team/gpt2-wolof-french-translation_bayes1_1/sweeps/alygo14y


### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

In [None]:
def gpt2_model_init(tokenizer):
  # set the mode name
  model_name = "gpt2"

  # recuperate the tokenizer from the dataset
  tokenizer = tokenizer

  # configure the model
  model = GPT2LMHeadModel.from_pretrained(model_name).cuda()

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric.

In [None]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):
        
        preds, labels = eval_preds.preds.detach().cpu(), labels.detach().cpu()
        
        if isinstance(preds, tuple):
            
            preds = preds[0]
        
        if self.decoder is None:
            
            decoded_preds = self.tokenizer.batch_decode(preds, skip_special_tokens=True)
            
            decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
            
            decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)
            
            result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
            
            result = {"bleu": result["score"]}
            
            prediction_lens = [np.count_nonzero(pred != self.tokenizer.pad_token_id) for pred in preds]
            
            result["gen_len"] = np.mean(prediction_lens)
        
        else:
            
            predictions = list(self.decoder(preds))
            
            labels = list(self.decoder(labels))
      
            decoded_preds, decoded_labels = self.postprocess_text(predictions, labels)
            
            result = self.metric.compute(predictions=predictions, references=labels)
            
            result = {"bleu": result["score"]}
        
        result = {k:round(v, 4) for k, v in result.items()}

        wandb.log("bleu", result["bleu"])
            
        return result

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [None]:
# translation_eval = TranslationEvaluation(test_dataset.tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [None]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0] for b in batch])
    
    attention_mask = torch.stack([b[1] for b in batch])
    
    labels = torch.stack([b[0] for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and make random search.

In [None]:
# %%wandb

def train(config = None):

  with wandb.init(config = config):

    # seed
    torch.manual_seed(50)

    # set sweep configuration
    config = wandb.config

    # split the data
    split_data(config.random_state)

    # let us recuperate the datasets
    train_dataset, test_dataset = recuperate_datasets(config.wf_char_p, config.wf_word_p)

    # get train and test datasets according to the config

    # train_dataset = datasets[config.dataset_aug]['train_dataset']

    # test_dataset = datasets[config.dataset_aug]['test_dataset']

    # set training arguments
    training_args = TrainingArguments(f"{path}training2/Results1",
                                      report_to = f"wandb",
                                      num_train_epochs=config.epochs,
                                      # logging_steps=100,
                                      load_best_model_at_end=True,
                                      save_strategy="epoch",
                                      evaluation_strategy="epoch",
                                      logging_strategy = 'epoch',
                                      per_device_train_batch_size=config.batch_size, 
                                      per_device_eval_batch_size=5,
                                      learning_rate=config.learning_rate,
                                      weight_decay=config.weight_decay,
                                      remove_unused_columns = False,
                                      fp16 = True,
                                      )   

    # define training loop
    trainer = Trainer(model_init=partial(gpt2_model_init, tokenizer = train_dataset.tokenizer),
                      args=training_args,
                      train_dataset=train_dataset, 
                      eval_dataset=test_dataset,
                      data_collator=data_collator,
                      # compute_metrics=translation_eval.compute_metrics
                      )

    # start training loop
    trainer.train()

agent = wandb.agent(sweep_id, train, count = 25)


[34m[1mwandb[0m: Agent Starting Run: a0u0t6k2 with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 6.702369179262155e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.2819068695463206
[34m[1mwandb[0m: 	wf_word_p: 0.3271474379445852


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss
1,1.3314,0.972646


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.97265
eval/runtime,2.7358
eval/samples_per_second,29.973
eval/steps_per_second,6.214
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3314
train/total_flos,216590170752000.0
train/train_loss,1.33135


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 8p97mqyj with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.000687378518112751
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.10242928904824668
[34m[1mwandb[0m: 	wf_word_p: 0.3761238934195836




Epoch,Training Loss,Validation Loss
1,1.2118,0.902809


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.90281
eval/runtime,2.7187
eval/samples_per_second,30.162
eval/steps_per_second,6.253
train/epoch,1.0
train/global_step,367.0
train/learning_rate,1e-05
train/loss,1.2118
train/total_flos,216590170752000.0
train/train_loss,1.21175


[34m[1mwandb[0m: Agent Starting Run: cufx9n8t with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 1.226951404890168e-05
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.23759176734591644
[34m[1mwandb[0m: 	wf_word_p: 0.13227163155096622




Epoch,Training Loss,Validation Loss
1,1.5713,0.956762


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95676
eval/runtime,2.7129
eval/samples_per_second,30.226
eval/steps_per_second,6.266
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5713
train/total_flos,216590170752000.0
train/train_loss,1.57131


[34m[1mwandb[0m: Agent Starting Run: y2zo0avu with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0007890211017920526
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2153084724244564
[34m[1mwandb[0m: 	wf_word_p: 0.03750263309251056




Epoch,Training Loss,Validation Loss
1,1.3947,0.852979


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85298
eval/runtime,2.7128
eval/samples_per_second,30.227
eval/steps_per_second,6.267
train/epoch,1.0
train/global_step,147.0
train/learning_rate,4e-05
train/loss,1.3947
train/total_flos,216590170752000.0
train/train_loss,1.39468


[34m[1mwandb[0m: Agent Starting Run: 6zl4nshz with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.00013060682353049685
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.6581358201308465
[34m[1mwandb[0m: 	wf_word_p: 0.010288732393246668




Epoch,Training Loss,Validation Loss
1,1.1252,0.843749


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.84375
eval/runtime,2.7146
eval/samples_per_second,30.207
eval/steps_per_second,6.262
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.1252
train/total_flos,216590170752000.0
train/train_loss,1.12516


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: lmlh43bi with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 4.6544061486102166e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.601761705072325
[34m[1mwandb[0m: 	wf_word_p: 0.5485382582443594
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss
1,1.5112,0.959907


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95991
eval/runtime,2.7127
eval/samples_per_second,30.228
eval/steps_per_second,6.267
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5112
train/total_flos,216590170752000.0
train/train_loss,1.5112


[34m[1mwandb[0m: Agent Starting Run: b3709lrr with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 8.01812368676995e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.3414514505517595
[34m[1mwandb[0m: 	wf_word_p: 0.06590909422021479




Epoch,Training Loss,Validation Loss
1,1.2984,0.943387


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.94339
eval/runtime,2.7141
eval/samples_per_second,30.212
eval/steps_per_second,6.264
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.2984
train/total_flos,216590170752000.0
train/train_loss,1.2984


[34m[1mwandb[0m: Agent Starting Run: ixh1wkga with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 7.665404061522188e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2025899023149373
[34m[1mwandb[0m: 	wf_word_p: 0.30403171896970477




Epoch,Training Loss,Validation Loss
1,1.6004,0.898786


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.89879
eval/runtime,2.7076
eval/samples_per_second,30.285
eval/steps_per_second,6.279
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6004
train/total_flos,216590170752000.0
train/train_loss,1.60037


[34m[1mwandb[0m: Agent Starting Run: yqxwql6m with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0009940796140095907
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.5740543904824967
[34m[1mwandb[0m: 	wf_word_p: 0.2462419315938406




Epoch,Training Loss,Validation Loss
1,1.2845,0.848199


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.8482
eval/runtime,2.7189
eval/samples_per_second,30.159
eval/steps_per_second,6.253
train/epoch,1.0
train/global_step,367.0
train/learning_rate,2e-05
train/loss,1.2845
train/total_flos,216590170752000.0
train/train_loss,1.28448


[34m[1mwandb[0m: Agent Starting Run: 6z8kvttx with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.136396657280805e-05
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.496992386293468
[34m[1mwandb[0m: 	wf_word_p: 0.09128048050662484


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016668863250000263, max=1.0…



Epoch,Training Loss,Validation Loss
1,1.6337,0.925104


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9251
eval/runtime,2.7135
eval/samples_per_second,30.22
eval/steps_per_second,6.265
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6337
train/total_flos,216590170752000.0
train/train_loss,1.63375


[34m[1mwandb[0m: Agent Starting Run: m9sb4aqc with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 2.292925960874299e-05
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.1392332134258406
[34m[1mwandb[0m: 	wf_word_p: 0.33276367556881936




Epoch,Training Loss,Validation Loss
1,1.3772,0.921299


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9213
eval/runtime,2.7224
eval/samples_per_second,30.12
eval/steps_per_second,6.244
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3772
train/total_flos,216590170752000.0
train/train_loss,1.37722


[34m[1mwandb[0m: Agent Starting Run: jbput8ui with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 6.444484556126347e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	wf_char_p: 0.29893712569494857
[34m[1mwandb[0m: 	wf_word_p: 0.26117783521919574




Epoch,Training Loss,Validation Loss
1,1.4321,0.896062


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.89606
eval/runtime,2.708
eval/samples_per_second,30.281
eval/steps_per_second,6.278
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4321
train/total_flos,216590170752000.0
train/train_loss,1.43209


[34m[1mwandb[0m: Agent Starting Run: 4fwhdrrr with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 2.5593238990374552e-05
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2943749543935819
[34m[1mwandb[0m: 	wf_word_p: 0.12398175089457102




Epoch,Training Loss,Validation Loss
1,1.6999,0.85214


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85214
eval/runtime,2.7191
eval/samples_per_second,30.157
eval/steps_per_second,6.252
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6999
train/total_flos,216590170752000.0
train/train_loss,1.69991


[34m[1mwandb[0m: Agent Starting Run: 1bz3dbic with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.890480473611398e-05
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.2378445346533837
[34m[1mwandb[0m: 	wf_word_p: 0.4209643755950993




Epoch,Training Loss,Validation Loss
1,1.4832,0.875917


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.87592
eval/runtime,2.7525
eval/samples_per_second,29.792
eval/steps_per_second,6.176
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4832
train/total_flos,216590170752000.0
train/train_loss,1.48315


[34m[1mwandb[0m: Agent Starting Run: ho6ta14m with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0002774553527447285
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.6714497099264989
[34m[1mwandb[0m: 	wf_word_p: 0.03656663081592687




Epoch,Training Loss,Validation Loss
1,1.1247,0.822917


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.82292
eval/runtime,2.7172
eval/samples_per_second,30.178
eval/steps_per_second,6.256
train/epoch,1.0
train/global_step,367.0
train/learning_rate,1e-05
train/loss,1.1247
train/total_flos,216590170752000.0
train/train_loss,1.12474


[34m[1mwandb[0m: Agent Starting Run: wmlfiw0r with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0003923791647223955
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.06807537344840917
[34m[1mwandb[0m: 	wf_word_p: 0.5604330614348828




Epoch,Training Loss,Validation Loss
1,1.5006,0.949637


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.94964
eval/runtime,2.7103
eval/samples_per_second,30.255
eval/steps_per_second,6.272
train/epoch,1.0
train/global_step,147.0
train/learning_rate,2e-05
train/loss,1.5006
train/total_flos,216590170752000.0
train/train_loss,1.5006


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: r8gcrole with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 1.821341363881095e-05
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.4377052667800509
[34m[1mwandb[0m: 	wf_word_p: 0.28337028938356845




Epoch,Training Loss,Validation Loss
1,1.8462,0.981831


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.98183
eval/runtime,2.7136
eval/samples_per_second,30.218
eval/steps_per_second,6.265
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.8462
train/total_flos,216590170752000.0
train/train_loss,1.8462


[34m[1mwandb[0m: Agent Starting Run: phwfj18n with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 7.811279212646426e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2317434997460037
[34m[1mwandb[0m: 	wf_word_p: 0.4361895012572056




Epoch,Training Loss,Validation Loss
1,1.6396,0.980084


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.98008
eval/runtime,2.767
eval/samples_per_second,29.635
eval/steps_per_second,6.144
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6396
train/total_flos,216590170752000.0
train/train_loss,1.63955


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: jlqkadbj with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.00010056956001033137
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.6390337876205812
[34m[1mwandb[0m: 	wf_word_p: 0.23792329132405943




Epoch,Training Loss,Validation Loss
1,1.6119,0.873367


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.87337
eval/runtime,2.7126
eval/samples_per_second,30.23
eval/steps_per_second,6.267
train/epoch,1.0
train/global_step,147.0
train/learning_rate,1e-05
train/loss,1.6119
train/total_flos,216590170752000.0
train/train_loss,1.61191


[34m[1mwandb[0m: Agent Starting Run: 6p270685 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 7.129742665577746e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.39725467091195105
[34m[1mwandb[0m: 	wf_word_p: 0.5004309074101329




Epoch,Training Loss,Validation Loss
1,1.4593,0.955068


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95507
eval/runtime,2.7243
eval/samples_per_second,30.099
eval/steps_per_second,6.24
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4593
train/total_flos,216590170752000.0
train/train_loss,1.45931


[34m[1mwandb[0m: Agent Starting Run: gcv6axp9 with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 6.252438217932992e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	wf_char_p: 0.29046645972247725
[34m[1mwandb[0m: 	wf_word_p: 0.2600785813543463




Epoch,Training Loss,Validation Loss
1,1.3169,0.971379


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.97138
eval/runtime,2.7496
eval/samples_per_second,29.823
eval/steps_per_second,6.183
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3169
train/total_flos,216590170752000.0
train/train_loss,1.31693


[34m[1mwandb[0m: Agent Starting Run: yhyy9e23 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.647196100197014e-05
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.1717366963251063
[34m[1mwandb[0m: 	wf_word_p: 0.1316624948710261




Epoch,Training Loss,Validation Loss
1,1.3969,0.856822


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85682
eval/runtime,2.7137
eval/samples_per_second,30.217
eval/steps_per_second,6.265
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.3969
train/total_flos,216590170752000.0
train/train_loss,1.3969


[34m[1mwandb[0m: Agent Starting Run: xup68ew4 with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.637652455538814e-05
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.677355252029728
[34m[1mwandb[0m: 	wf_word_p: 0.16163588455481578




Epoch,Training Loss,Validation Loss
1,1.6871,0.952795


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9528
eval/runtime,2.7376
eval/samples_per_second,29.953
eval/steps_per_second,6.21
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6871
train/total_flos,216590170752000.0
train/train_loss,1.68711


[34m[1mwandb[0m: Agent Starting Run: wa2re6yd with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.00015372099844614283
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.4593404124892092
[34m[1mwandb[0m: 	wf_word_p: 0.5384324544438147




Epoch,Training Loss,Validation Loss
1,1.4028,0.823874


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.82387
eval/runtime,2.7275
eval/samples_per_second,30.064
eval/steps_per_second,6.233
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4028
train/total_flos,216590170752000.0
train/train_loss,1.40277


[34m[1mwandb[0m: Agent Starting Run: i305ffth with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 4.284698923411304e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.6295786463189464
[34m[1mwandb[0m: 	wf_word_p: 0.6968639258681786




Epoch,Training Loss,Validation Loss
1,1.5093,0.986804


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9868
eval/runtime,2.7162
eval/samples_per_second,30.189
eval/steps_per_second,6.259
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5093
train/total_flos,216590170752000.0
train/train_loss,1.5093


-----------

## Colab download and remove step

In [None]:
import shutil

# shutil.rmtree('/content/drive/MyDrive/Memoire/subject2/T5/training/bayes_search_results')
shutil.rmtree('wandb')
# shutil.make_archive('wandb', 'zip', 'wanbd')