Hyper-parameter search with the Text-To-Text Transformer 🤖 (Bayes)
-----------------------------------

We have already tried to fine-tune the Text-To-Text Transformer on the original sentences. Now that we extracted new sentences from the book intituled **Grammaire de Wolof Moderne**, we want to verify if the T5 model will perform better on them than previously. We will use, again, the `Bayesian Hyperparamter Optimization` to search for the best combination of hyperparameter (the best model). The `wandb` library will be used to make a clear evaluation of the model with the following charts `Parallel coordinate` and `Parameter Importance`. After finding the best model, we will take the checkpoints and continue the training in another notebook. Let us dive into the process.

We want to know the best combination of values of the following hyperparameters:

- **learning rate** $\sim Log U(1e-2, 1e-5)$
- **weight decay** $\in \{0.0, 0.1, 0.2, 0.3, 0.4, 0.5\}$
- **random state** (seed of the data splitting generator) $\in range(1, 100)$ 

1. For the translation from French to Wolof

  - **fr_char_p** (probability of modifying a character from a French word) $\sim U(0.0, 0.9)$
  - **fr_word_p** (probability of modifying a word from a French sentence) $\sim U(0.0, 0.9)$

2. For the translation from Wolof to French

  - **wf_char_p** (probability of modifying a character from a Wolof word) $\sim U(0.0, 0.9)$
  - **fr_word_p** (probability of modifying a word from a Wolof sentence) $\sim U(0.0, 0.9)$


The Bayes method requires to define a metric. We will evaluate the model on the test set, so the metric that we will add in the hyperparameter setting can be either the `cross entropy loss` calculated on the test set or `BLEU` score. Since it is a machine translation, a BLEU score will be more useful as evaluation metric. 

**Objective**: We will try to `maximize the metric.` For the moment, we want to obtain a `BLEU` score more than `20`.

In [1]:
# let us extend the paths of the system
import sys

path = "/content/drive/MyDrive/Memoire/subject2/T5/"

sys.path.extend([path, f"{path}new_data"])

In [2]:
# define wandb environment
%env WANDB_LOG_MODEL=true
%env WANDB_NOTEBOOK_NAME=/content/drive/MyDrive/Memoire/subject2/T5/hp_search_t5_small_step_with_bayes_v2.ipynb
%env WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee 

env: WANDB_LOG_MODEL=true
env: WANDB_NOTEBOOK_NAME=/content/drive/MyDrive/Memoire/subject2/T5/hp_search_t5_small_step_with_bayes_v2.ipynb
env: WANDB_API_KEY=237a8450cd2568ea1c8e1f8e0400708e79b6b4ee


In [3]:
!pip install -qq wandb --upgrade

In [4]:
!pip install evaluate -qq
!pip install sacrebleu -qq
# !pip install optuna -qq
!pip install transformers -qq 
!pip install tokenizers -qq
!pip install nlpaug -qq
!pip install ray[tune] -qq
!python -m spacy download fr_core_news_lg 

2023-05-06 11:40:21.471489: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-06 11:40:26.774039: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-06 11:40:26.774773: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355


In [5]:
# let us import all necessary libraries
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, T5TokenizerFast, set_seed
from wolof_translate.utils.sent_transformers import TransformerSequences
from wolof_translate.data.dataset_v2 import T5SentenceDataset
from wolof_translate.utils.sent_corrections import *
from sklearn.model_selection import train_test_split
from nlpaug.augmenter import char as nac
from torch.utils.data import DataLoader
# from datasets  import load_metric # make pip install evaluate instead
# and pip install sacrebleu for instance
from functools import partial
from tqdm import tqdm
import pandas as pd
import numpy as np
import evaluate
import wandb
import torch


We will create two models: 

- One translating the french corpus to a wolof corpus [french_to_wolof](#french-to-wolof)
- One translating the wolof corpus to a french corpus [wolof_to_french](#wolof-to-french)

--------------

## French to wolof

### Configure dataset 🔠

We will split the sentences between train (for the model's training), validation (to find the best performance) and test (to make final predictions) sets. The samples added as train, validation and test sets are identified according to `the random state.` We will tune the random state to the groups that guarantee the model's best fitting. In other words, we want the model to identify many training sentences and generalize that learning on the validation sentences. It is not sometimes the case, mainly when using a small dataset like ours. 

Notice that when continuing to train the model from the checkpoints we will use the train_set plus the validation set. Then we need to save the dataset part which doesn't contain the test set for latter. 

In [6]:
def split_data(random_state: int = 50):
  """Split data between train, validation and test sets

  Args:
    random_state (int): the seed of the splitting generator. Defaults to 50
  """
  # load the corpora and split into train and test sets
  corpora = pd.read_csv(f"{path}diagne_sentences/extractions.csv")

  train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=random_state)

  # let us save the final training set when performing

  train_set, valid_set = train_test_split(train_set, test_size=0.1, random_state=random_state)

  train_set.to_csv(f"{path}diagne_sentences/final_train_set.csv", index=False)

  # let us save the sets
  train_set.to_csv(f"{path}diagne_sentences/train_set.csv", index=False)

  valid_set.to_csv(f"{path}diagne_sentences/valid_set.csv", index=False)

  test_set.to_csv(f"{path}diagne_sentences/test_set.csv", index=False)

Let us load the French and Wolof corpora's common tokenizer.

In [7]:
# recuperate the tokenizer from a json file
tokenizer = T5TokenizerFast(tokenizer_file=f"{path}wolof_translate/tokenizers/t5_tokenizers/tokenizer_v2.json")


The following function will make recuperate the datasets.

In [8]:
def recuperate_datasets(fr_char_p: float, fr_word_p: float):

  # Create augmentation to add on French sentences
  fr_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=fr_char_p, aug_word_p=fr_word_p),
                                        remove_mark_space, delete_guillemet_space)

  # Recuperate the train dataset
  train_dataset_aug = T5SentenceDataset(f"{path}diagne_sentences/train_set.csv",
                                        tokenizer,
                                        truncation = True,
                                        cp1_transformer = fr_augmentation)

  # Recuperate the validation dataset
  valid_dataset = T5SentenceDataset(f"{path}diagne_sentences/valid_set.csv",
                                        tokenizer,
                                        truncation = True)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

### Configure hyperparameter search ⚙️

We have to configure the search space, the search method and the metric. 

In [9]:
wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")

# hyperparameters
sweep_config = {
    'method': 'bayes',
    'metric':{
          'goal': 'maximize',
          'name': 'bleu'
      },
    'parameters':
    {
      'epochs': {
          'value': 1
      },
      'batch_size': {
          'values': [16, 32] # we found in previous evaluations that the bleu is better when batch size is not small
      },
      'learning_rate': {
          'distribution': 'log_uniform_values',
          'min': 1e-4,
          'max': 1e-1 # after some evaluations we found out that a high learning rate is better
      },
      'weight_decay': {
          'values': [0.0, 0.1, 0.2, 0.3, 0.4]
      },
     'fr_char_p': {
         'min': 0.2,
         'max': 1.0 
     },
     'fr_word_p': {
          'min': 0.2,
          'max': 0.9
     },
     'random_state': {
         'values': list(range(0, 80))
     }
    }
}

# Initialize the hyperparameter search
sweep_id = wandb.sweep(sweep_config, project = "small-t5-fw-translation-bayes-hpsearch-v2")



[34m[1mwandb[0m: Currently logged in as: [33moumar-kane[0m ([33moumar-kane-team[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: tdykh4ig
Sweep URL: https://wandb.ai/oumar-kane-team/base-t5-fw-translation-bayes-hpsearch-v2/sweeps/tdykh4ig


### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

**Note**: In the first training we want to use the t5-small. If we don't obtain good results we will take the t5-base which contains more parameters. See bellow the configuration of the t5-small and the t5-base models, respectively.

In [10]:
small_model_name = 't5-small'
base_model_name = 't5-base'

# import the small model with its pre-trained weights
small_model = AutoModelForSeq2SeqLM.from_pretrained(small_model_name)

# import the base model with its pre-trained weights
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_name)


In [11]:
# print the small configuration
small_model.config

T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefi

In [12]:
# print the base configuration
base_model.config

T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "pre

The small model have the same architecture than the original transformer of Ashish Vaswani, Noam Shazeer, and all. in the article [Attention_is_all_you_need](https://arxiv.org/pdf/1706.03762).

The base model contains more parameters since it use 12 heads in place of 8 and, 12 stack decoder layers in place of 6, the number of feed forward features is of 3072 so 1024 more features than the small one and the embedding dimension is of 768 in place of 512. The base model contains exactly 220 millions of parameters which is a huge number. But since it is pre-trained, we can directly make transfer learning with already trained weights. The base model was firstly explained in the article [Text_To_Text_Transformer](https://arxiv.org/pdf/1910.10683). 

In [13]:
def t5_model_init(tokenizer):

  # Initialize the model name
  model_name = 't5-small'

  # import the model with its pre-trained weights
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric. We use a method that we found int the following `HuggingFace` tutorial [translation](https://huggingface.co/docs/transformers/tasks/translation). We will use a class to add more methods if necessary.

In [14]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):

        preds, labels = eval_preds

        if isinstance(preds, tuple):
        
            preds = preds[0]
        
        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)

        result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
        
        result = {"bleu": result["score"]}

        prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
        
        result["gen_len"] = np.mean(prediction_lens)
        
        result = {k: round(v, 4) for k, v in result.items()}
        
        return result

In [15]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [16]:
evaluation = TranslationEvaluation(tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [17]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0].squeeze(0) for b in batch])
    
    attention_mask = torch.stack([b[1].squeeze(0) for b in batch])
    
    labels = torch.stack([b[2].squeeze(0) for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and search for the best model. The latter will be saved as an artefact inside our `wandb` project.

In [None]:
# %%wandb

def train(config = None):

  with wandb.init(config = config):

    # seed
    set_seed(0)

    # set sweep configuration
    config = wandb.config

    # split the data
    split_data(config.random_state)

    # let us recuperate the datasets
    train_dataset, valid_dataset = recuperate_datasets(config.fr_char_p, config.fr_word_p)

    # set training arguments
    training_args = Seq2SeqTrainingArguments(f"{path}/training/bayes_search_results_fw_v2",
                                      report_to = f"wandb",
                                      num_train_epochs=config.epochs,
                                      load_best_model_at_end=True,
                                      save_strategy="epoch",
                                      evaluation_strategy="epoch",
                                      logging_strategy = 'epoch',
                                      per_device_train_batch_size=config.batch_size, 
                                      per_device_eval_batch_size=16,
                                      learning_rate=config.learning_rate,
                                      weight_decay=config.weight_decay,
                                      predict_with_generate=True, # we will use predict with generate in order to obtain more valuable test results
                                      fp16 = True,
                                      )   

    # define training loop
    trainer = Seq2SeqTrainer(model_init=partial(t5_model_init, tokenizer = train_dataset.tokenizer),
                      args=training_args,
                      train_dataset=train_dataset, 
                      eval_dataset=valid_dataset,
                      data_collator=data_collator,
                      compute_metrics=evaluation.compute_metrics
                      )

    # start training loop
    trainer.train()

agent = wandb.agent(sweep_id, train, count = 100)


[34m[1mwandb[0m: Agent Starting Run: otga3gac with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4381575828545605
[34m[1mwandb[0m: 	fr_word_p: 0.6116586258851745
[34m[1mwandb[0m: 	learning_rate: 0.06259045084863178
[34m[1mwandb[0m: 	random_state: 51
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.8432,,0.0,1.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0
eval/loss,
eval/runtime,4.347
eval/samples_per_second,61.422
eval/steps_per_second,3.911
train/epoch,1.0
train/global_step,151.0
train/learning_rate,0.05927
train/loss,2.8432


[34m[1mwandb[0m: Agent Starting Run: tml8jezj with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.8822536140201862
[34m[1mwandb[0m: 	fr_word_p: 0.3339541228740005
[34m[1mwandb[0m: 	learning_rate: 0.00518142881161851
[34m[1mwandb[0m: 	random_state: 72
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.6429,0.498041,0.0,5.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,5.0
eval/loss,0.49804
eval/runtime,3.3859
eval/samples_per_second,78.857
eval/steps_per_second,5.021
train/epoch,1.0
train/global_step,151.0
train/learning_rate,7e-05
train/loss,0.6429


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: eg70q1fu with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.3057944880911035
[34m[1mwandb[0m: 	fr_word_p: 0.3962325130342593
[34m[1mwandb[0m: 	learning_rate: 0.0005492469495391114
[34m[1mwandb[0m: 	random_state: 32
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.9553,0.576823,0.0355,2.4419


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0355
eval/gen_len,2.4419
eval/loss,0.57682
eval/runtime,3.3727
eval/samples_per_second,79.164
eval/steps_per_second,5.04
train/epoch,1.0
train/global_step,76.0
train/learning_rate,2e-05
train/loss,0.9553


[34m[1mwandb[0m: Agent Starting Run: urtw2zzz with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.8194253914804734
[34m[1mwandb[0m: 	fr_word_p: 0.25290678138633566
[34m[1mwandb[0m: 	learning_rate: 0.0012781795651427244
[34m[1mwandb[0m: 	random_state: 38
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.81,0.561485,0.0,1.5094


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.5094
eval/loss,0.56149
eval/runtime,4.8493
eval/samples_per_second,55.059
eval/steps_per_second,3.506
train/epoch,1.0
train/global_step,76.0
train/learning_rate,3e-05
train/loss,0.81


[34m[1mwandb[0m: Agent Starting Run: 6l36f9z4 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4475808281133603
[34m[1mwandb[0m: 	fr_word_p: 0.4833224717278907
[34m[1mwandb[0m: 	learning_rate: 0.08261556041458472
[34m[1mwandb[0m: 	random_state: 13
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.9949,0.585065,0.0,1.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0
eval/loss,0.58507
eval/runtime,3.3702
eval/samples_per_second,79.224
eval/steps_per_second,5.044
train/epoch,1.0
train/global_step,151.0
train/learning_rate,0.07824
train/loss,2.9949


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 0nsogk73 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.28335762632309447
[34m[1mwandb[0m: 	fr_word_p: 0.506400374199839
[34m[1mwandb[0m: 	learning_rate: 0.0009254097996363324
[34m[1mwandb[0m: 	random_state: 43
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7152,0.551105,0.0,1.2285


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.2285
eval/loss,0.55111
eval/runtime,3.9879
eval/samples_per_second,66.953
eval/steps_per_second,4.263
train/epoch,1.0
train/global_step,151.0
train/learning_rate,1e-05
train/loss,0.7152


[34m[1mwandb[0m: Agent Starting Run: 636v89eq with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.467834986769714
[34m[1mwandb[0m: 	fr_word_p: 0.7117707724800457
[34m[1mwandb[0m: 	learning_rate: 0.01085987463152563
[34m[1mwandb[0m: 	random_state: 39
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.627,0.539224,0.0,5.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,5.0
eval/loss,0.53922
eval/runtime,3.7871
eval/samples_per_second,70.503
eval/steps_per_second,4.489
train/epoch,1.0
train/global_step,151.0
train/learning_rate,0.00014
train/loss,0.627


[34m[1mwandb[0m: Agent Starting Run: 3w5o1ms5 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.3729825818754754
[34m[1mwandb[0m: 	fr_word_p: 0.48120219043765566
[34m[1mwandb[0m: 	learning_rate: 0.001270658556004081
[34m[1mwandb[0m: 	random_state: 48
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.6859,0.532339,0.0,3.4457


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,3.4457
eval/loss,0.53234
eval/runtime,5.0382
eval/samples_per_second,52.996
eval/steps_per_second,3.374
train/epoch,1.0
train/global_step,151.0
train/learning_rate,2e-05
train/loss,0.6859


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: fawy9ya6 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.565046669465831
[34m[1mwandb[0m: 	fr_word_p: 0.7738867378774072
[34m[1mwandb[0m: 	learning_rate: 0.00011870598833011806
[34m[1mwandb[0m: 	random_state: 45
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.2918,0.865529,0.0494,5.2509


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0494
eval/gen_len,5.2509
eval/loss,0.86553
eval/runtime,4.7764
eval/samples_per_second,55.9
eval/steps_per_second,3.559
train/epoch,1.0
train/global_step,76.0
train/learning_rate,0.0
train/loss,1.2918


[34m[1mwandb[0m: Agent Starting Run: bfybpu27 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.8752458674796124
[34m[1mwandb[0m: 	fr_word_p: 0.5383173333227683
[34m[1mwandb[0m: 	learning_rate: 0.07414013054888538
[34m[1mwandb[0m: 	random_state: 34
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.4864,0.969908,0.0,1.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0
eval/loss,0.96991
eval/runtime,3.4516
eval/samples_per_second,77.356
eval/steps_per_second,4.925
train/epoch,1.0
train/global_step,151.0
train/learning_rate,0.07119
train/loss,3.4864


[34m[1mwandb[0m: Agent Starting Run: ndoziids with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.8610711962253805
[34m[1mwandb[0m: 	fr_word_p: 0.5511130908525904
[34m[1mwandb[0m: 	learning_rate: 0.02672680984111256
[34m[1mwandb[0m: 	random_state: 77
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.6087,0.69342,0.0,1.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0
eval/loss,0.69342
eval/runtime,3.4624
eval/samples_per_second,77.113
eval/steps_per_second,4.91
train/epoch,1.0
train/global_step,76.0
train/learning_rate,0.02532
train/loss,2.6087


[34m[1mwandb[0m: Agent Starting Run: 40bbqn8j with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.2879408125601204
[34m[1mwandb[0m: 	fr_word_p: 0.20014163102234872
[34m[1mwandb[0m: 	learning_rate: 0.0016766296476960554
[34m[1mwandb[0m: 	random_state: 71
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.6798,0.530538,0.0,2.9813


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,2.9813
eval/loss,0.53054
eval/runtime,3.9864
eval/samples_per_second,66.978
eval/steps_per_second,4.265
train/epoch,1.0
train/global_step,151.0
train/learning_rate,2e-05
train/loss,0.6798


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: a1qyegcy with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5914542632321074
[34m[1mwandb[0m: 	fr_word_p: 0.4495405182231499
[34m[1mwandb[0m: 	learning_rate: 0.0023118906467202416
[34m[1mwandb[0m: 	random_state: 1
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.6615,0.556188,2.4748,3.2996


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,2.4748
eval/gen_len,3.2996
eval/loss,0.55619
eval/runtime,5.3093
eval/samples_per_second,50.289
eval/steps_per_second,3.202
train/epoch,1.0
train/global_step,151.0
train/learning_rate,3e-05
train/loss,0.6615


[34m[1mwandb[0m: Agent Starting Run: 52cpmw3l with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4717567822282639
[34m[1mwandb[0m: 	fr_word_p: 0.7977143827113362
[34m[1mwandb[0m: 	learning_rate: 0.001620644123330365
[34m[1mwandb[0m: 	random_state: 7
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.8027,0.552175,0.0,1.0112


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0112
eval/loss,0.55218
eval/runtime,4.6004
eval/samples_per_second,58.039
eval/steps_per_second,3.695
train/epoch,1.0
train/global_step,76.0
train/learning_rate,4e-05
train/loss,0.8027


[34m[1mwandb[0m: Agent Starting Run: 295k6bp9 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4571510021091106
[34m[1mwandb[0m: 	fr_word_p: 0.8235313978542236
[34m[1mwandb[0m: 	learning_rate: 0.01187721951560014
[34m[1mwandb[0m: 	random_state: 11
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.617,0.514955,0.0,5.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,5.0
eval/loss,0.51496
eval/runtime,3.8729
eval/samples_per_second,68.94
eval/steps_per_second,4.389
train/epoch,1.0
train/global_step,151.0
train/learning_rate,8e-05
train/loss,0.617


[34m[1mwandb[0m: Agent Starting Run: 1gogyiwe with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6920032994140689
[34m[1mwandb[0m: 	fr_word_p: 0.46489304031121537
[34m[1mwandb[0m: 	learning_rate: 0.00031255937592888645
[34m[1mwandb[0m: 	random_state: 27
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7965,0.577787,0.0,2.633


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,2.633
eval/loss,0.57779
eval/runtime,5.0459
eval/samples_per_second,52.914
eval/steps_per_second,3.369
train/epoch,1.0
train/global_step,151.0
train/learning_rate,0.0
train/loss,0.7965


[34m[1mwandb[0m: Agent Starting Run: wisv19ms with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4503606254447692
[34m[1mwandb[0m: 	fr_word_p: 0.23222880082898145
[34m[1mwandb[0m: 	learning_rate: 0.09618317747600572
[34m[1mwandb[0m: 	random_state: 32
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.6517,1.059265,0.0,1.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0
eval/loss,1.05927
eval/runtime,3.5796
eval/samples_per_second,74.589
eval/steps_per_second,4.749
train/epoch,1.0
train/global_step,151.0
train/learning_rate,0.09236
train/loss,3.6517


[34m[1mwandb[0m: Agent Starting Run: jh3goh7c with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6635862975872865
[34m[1mwandb[0m: 	fr_word_p: 0.8678232872461129
[34m[1mwandb[0m: 	learning_rate: 0.07707704555119989
[34m[1mwandb[0m: 	random_state: 49
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.64,1.561099,0.0,1.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0
eval/loss,1.5611
eval/runtime,3.6601
eval/samples_per_second,72.949
eval/steps_per_second,4.645
train/epoch,1.0
train/global_step,76.0
train/learning_rate,0.07201
train/loss,3.64


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 01fqpsla with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5190618196221921
[34m[1mwandb[0m: 	fr_word_p: 0.7719970832091285
[34m[1mwandb[0m: 	learning_rate: 0.0980538007292019
[34m[1mwandb[0m: 	random_state: 31
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.8334,0.926528,0.0,1.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0
eval/loss,0.92653
eval/runtime,5.3512
eval/samples_per_second,49.895
eval/steps_per_second,3.177
train/epoch,1.0
train/global_step,76.0
train/learning_rate,0.08902
train/loss,3.8334


[34m[1mwandb[0m: Agent Starting Run: 31tga7ij with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.9425901349845052
[34m[1mwandb[0m: 	fr_word_p: 0.6986934901016864
[34m[1mwandb[0m: 	learning_rate: 0.007097854273204103
[34m[1mwandb[0m: 	random_state: 4
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.6424,0.512538,0.0,4.3146


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,4.3146
eval/loss,0.51254
eval/runtime,3.6929
eval/samples_per_second,72.301
eval/steps_per_second,4.603
train/epoch,1.0
train/global_step,151.0
train/learning_rate,9e-05
train/loss,0.6424


[34m[1mwandb[0m: Agent Starting Run: xji9rlzt with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.8063651146128041
[34m[1mwandb[0m: 	fr_word_p: 0.7640623515466898
[34m[1mwandb[0m: 	learning_rate: 0.004089998179862037
[34m[1mwandb[0m: 	random_state: 26
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.6491,0.490399,1.2677,4.5955


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,1.2677
eval/gen_len,4.5955
eval/loss,0.4904
eval/runtime,3.8756
eval/samples_per_second,68.893
eval/steps_per_second,4.386
train/epoch,1.0
train/global_step,151.0
train/learning_rate,5e-05
train/loss,0.6491


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 4hgg5fom with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.9953583351473948
[34m[1mwandb[0m: 	fr_word_p: 0.24805642336381156
[34m[1mwandb[0m: 	learning_rate: 0.0001388359650741013
[34m[1mwandb[0m: 	random_state: 11
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.2396,0.836545,0.108,5.7603


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.108
eval/gen_len,5.7603
eval/loss,0.83655
eval/runtime,3.5961
eval/samples_per_second,74.247
eval/steps_per_second,4.727
train/epoch,1.0
train/global_step,76.0
train/learning_rate,1e-05
train/loss,1.2396


[34m[1mwandb[0m: Agent Starting Run: b1lcbtju with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.36247635396634775
[34m[1mwandb[0m: 	fr_word_p: 0.7418640383037345
[34m[1mwandb[0m: 	learning_rate: 0.030438053401243937
[34m[1mwandb[0m: 	random_state: 4
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.8224,0.556485,0.0,2.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,2.0
eval/loss,0.55649
eval/runtime,3.5982
eval/samples_per_second,74.204
eval/steps_per_second,4.725
train/epoch,1.0
train/global_step,76.0
train/learning_rate,0.0008
train/loss,0.8224


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: tpqyf6me with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.8746984566132381
[34m[1mwandb[0m: 	fr_word_p: 0.5569849505826632
[34m[1mwandb[0m: 	learning_rate: 0.01040669382177722
[34m[1mwandb[0m: 	random_state: 61
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7606,0.511298,0.0,5.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,5.0
eval/loss,0.5113
eval/runtime,3.562
eval/samples_per_second,74.958
eval/steps_per_second,4.773
train/epoch,1.0
train/global_step,76.0
train/learning_rate,0.00027
train/loss,0.7606


[34m[1mwandb[0m: Agent Starting Run: nc33uic8 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.415143386987923
[34m[1mwandb[0m: 	fr_word_p: 0.4587384879723953
[34m[1mwandb[0m: 	learning_rate: 0.008433350292435667
[34m[1mwandb[0m: 	random_state: 72
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.6417,0.490008,0.0,5.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,5.0
eval/loss,0.49001
eval/runtime,4.346
eval/samples_per_second,61.436
eval/steps_per_second,3.912
train/epoch,1.0
train/global_step,151.0
train/learning_rate,0.00011
train/loss,0.6417


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 6cbmk9cu with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6089439298345336
[34m[1mwandb[0m: 	fr_word_p: 0.36144325914291103
[34m[1mwandb[0m: 	learning_rate: 0.0008151770163314815
[34m[1mwandb[0m: 	random_state: 25
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7802,0.612546,0.0,1.794


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.794
eval/loss,0.61255
eval/runtime,3.8823
eval/samples_per_second,68.774
eval/steps_per_second,4.379
train/epoch,1.0
train/global_step,76.0
train/learning_rate,1e-05
train/loss,0.7802


[34m[1mwandb[0m: Agent Starting Run: nze55ojj with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.5173303131791576
[34m[1mwandb[0m: 	fr_word_p: 0.7729534739868402
[34m[1mwandb[0m: 	learning_rate: 0.0024166576057195618
[34m[1mwandb[0m: 	random_state: 61
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7321,0.534201,0.4384,5.4719


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.4384
eval/gen_len,5.4719
eval/loss,0.5342
eval/runtime,3.6262
eval/samples_per_second,73.631
eval/steps_per_second,4.688
train/epoch,1.0
train/global_step,76.0
train/learning_rate,6e-05
train/loss,0.7321


[34m[1mwandb[0m: Agent Starting Run: 9qf59gau with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4983758765201962
[34m[1mwandb[0m: 	fr_word_p: 0.727213910227789
[34m[1mwandb[0m: 	learning_rate: 0.011720266074795833
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.2




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7567,0.503039,0.0,4.1685


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,4.1685
eval/loss,0.50304
eval/runtime,3.6232
eval/samples_per_second,73.692
eval/steps_per_second,4.692
train/epoch,1.0
train/global_step,76.0
train/learning_rate,0.00031
train/loss,0.7567


[34m[1mwandb[0m: Agent Starting Run: z4b2a1bs with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.25388817982274087
[34m[1mwandb[0m: 	fr_word_p: 0.4974258989397345
[34m[1mwandb[0m: 	learning_rate: 0.021963301241197955
[34m[1mwandb[0m: 	random_state: 52
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.9084,0.542954,0.0,1.6105


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.6105
eval/loss,0.54295
eval/runtime,3.6425
eval/samples_per_second,73.301
eval/steps_per_second,4.667
train/epoch,1.0
train/global_step,76.0
train/learning_rate,0.0104
train/loss,0.9084


[34m[1mwandb[0m: Agent Starting Run: wbm9bc4x with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.6316391377405799
[34m[1mwandb[0m: 	fr_word_p: 0.4267797590889008
[34m[1mwandb[0m: 	learning_rate: 0.010139518224244038
[34m[1mwandb[0m: 	random_state: 39
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.9348,0.609885,0.0,6.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,6.0
eval/loss,0.60988
eval/runtime,3.6365
eval/samples_per_second,73.423
eval/steps_per_second,4.675
train/epoch,1.0
train/global_step,76.0
train/learning_rate,0.00654
train/loss,0.9348


[34m[1mwandb[0m: Agent Starting Run: rab5g3kv with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.4544664063777383
[34m[1mwandb[0m: 	fr_word_p: 0.33998198123918744
[34m[1mwandb[0m: 	learning_rate: 0.0033613842320187617
[34m[1mwandb[0m: 	random_state: 53
[34m[1mwandb[0m: 	weight_decay: 0.4




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.6856,0.526911,1.471,4.0449


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,1.471
eval/gen_len,4.0449
eval/loss,0.52691
eval/runtime,3.6323
eval/samples_per_second,73.507
eval/steps_per_second,4.68
train/epoch,1.0
train/global_step,151.0
train/learning_rate,7e-05
train/loss,0.6856


[34m[1mwandb[0m: Agent Starting Run: fsdab0fr with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.3788141480661509
[34m[1mwandb[0m: 	fr_word_p: 0.6487252149003007
[34m[1mwandb[0m: 	learning_rate: 0.0008637053996216034
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7023,0.618953,0.0,1.236


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.236
eval/loss,0.61895
eval/runtime,5.1948
eval/samples_per_second,51.398
eval/steps_per_second,3.273
train/epoch,1.0
train/global_step,151.0
train/learning_rate,1e-05
train/loss,0.7023


[34m[1mwandb[0m: Agent Starting Run: pkpq8aqb with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.7331928578545781
[34m[1mwandb[0m: 	fr_word_p: 0.3463959392626625
[34m[1mwandb[0m: 	learning_rate: 0.0003235264031870901
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.8154,0.594846,0.0,3.2697


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,3.2697
eval/loss,0.59485
eval/runtime,3.6874
eval/samples_per_second,72.409
eval/steps_per_second,4.61
train/epoch,1.0
train/global_step,151.0
train/learning_rate,1e-05
train/loss,0.8154


[34m[1mwandb[0m: Agent Starting Run: 3pxz62g9 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.30549700644532374
[34m[1mwandb[0m: 	fr_word_p: 0.7335328361073488
[34m[1mwandb[0m: 	learning_rate: 0.021767035428698484
[34m[1mwandb[0m: 	random_state: 73
[34m[1mwandb[0m: 	weight_decay: 0.3




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.3782,0.652842,0.0,1.0


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0
eval/loss,0.65284
eval/runtime,4.8572
eval/samples_per_second,54.97
eval/steps_per_second,3.5
train/epoch,1.0
train/global_step,151.0
train/learning_rate,0.01989
train/loss,1.3782


[34m[1mwandb[0m: Agent Starting Run: f8s96d4m with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.8191602748812084
[34m[1mwandb[0m: 	fr_word_p: 0.3208678558769945
[34m[1mwandb[0m: 	learning_rate: 0.0003981916278310366
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.1




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7967,0.612041,0.0,2.2172


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,2.2172
eval/loss,0.61204
eval/runtime,3.7316
eval/samples_per_second,71.55
eval/steps_per_second,4.556
train/epoch,1.0
train/global_step,151.0
train/learning_rate,1e-05
train/loss,0.7967


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: a2zcv7z5 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	fr_char_p: 0.3987197979448549
[34m[1mwandb[0m: 	fr_word_p: 0.3303424776522219
[34m[1mwandb[0m: 	learning_rate: 0.00215461106235626
[34m[1mwandb[0m: 	random_state: 4
[34m[1mwandb[0m: 	weight_decay: 0




Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,0.7911,0.560144,0.0,1.0262


0,1
eval/bleu,▁
eval/gen_len,▁
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁

0,1
eval/bleu,0.0
eval/gen_len,1.0262
eval/loss,0.56014
eval/runtime,5.6917
eval/samples_per_second,46.91
eval/steps_per_second,2.987
train/epoch,1.0
train/global_step,76.0
train/learning_rate,6e-05
train/loss,0.7911


------------------

## Wolof to french

The only thing that we will change is that is this time it is the Wolof sentences that we will provide as inputs and the French sentences, as targets. 

### Configure dataset 🔠

In [None]:
def split_data(random_state: int = 50):
  """Split data between train, validation and test sets

  Args:
    random_state (int): the seed of the splitting generator. Defaults to 50
  """
  # load the corpora and split into train and test sets
  corpora = pd.read_csv(f"{path}diagne_sentences/extractions.csv")

  train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=random_state)

  # let us save the final training set when performing

  train_set, valid_set = train_test_split(train_set, test_size=0.1, random_state=random_state)

  train_set.to_csv(f"{path}diagne_sentences/final_train_set.csv", index=False)

  # let us save the sets
  train_set.to_csv(f"{path}diagne_sentences/train_set.csv", index=False)

  valid_set.to_csv(f"{path}diagne_sentences/valid_set.csv", index=False)

  test_set.to_csv(f"{path}diagne_sentences/test_set.csv", index=False)

In [None]:
# recuperate the tokenizer from a json file
tokenizer = T5TokenizerFast(tokenizer_file=f"{path}wolof_translate/tokenizers/t5_tokenizers/tokenizer_v2.json")

In [None]:
def recuperate_datasets(wf_char_p: float, wf_word_p: float):

  # Create augmentation to add on Wolof sentences
  wf_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=wf_char_p, aug_word_p=wf_word_p),
                                        remove_mark_space, delete_guillemet_space)

  # Recuperate the train dataset
  train_dataset_aug = T5SentenceDataset(f"{path}diagne_sentences/train_set.csv",
                                        tokenizer,
                                        corpus_1 = 'wolof',
                                        corpus_2 = 'french',
                                        truncation = True,
                                        cp1_transformer = wf_augmentation)

  # Recuperate the validation dataset
  valid_dataset = T5SentenceDataset(f"{path}diagne_sentences/valid_set.csv",
                                        tokenizer,
                                        corpus_1 = 'wolof',
                                        corpus_2 = 'french',
                                        truncation = True)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

### Configure hyperparameter search ⚙️

We have to configure the search space and the search method ("random" in our case). .

In [None]:
wandb.login(key="237a8450cd2568ea1c8e1f8e0400708e79b6b4ee")

# hyperparameters
sweep_config = {
    'method': 'bayes',
    'metric':{
          'goal': 'maximize',
          'name': 'bleu'
      },
    'parameters':
    {
      'epochs': {
          'value': 1
      },
      'batch_size': {
          'values': [16, 32] 
      },
      'learning_rate': {
          'distribution': 'log_uniform_values',
          'min': 1e-6,
          'max': 1e-2 
      },
      'weight_decay': {
          'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7] 
      },
     'fr_char_p': {
         'min': 0.0,
         'max': 1.0 
     },
     'fr_word_p': {
          'min': 0.0,
          'max': 1.0 
     },
     'random_state': {
         'values': list(range(1, 80))
     }
    }
}

# Initialize the hyperparameter search
sweep_id = wandb.sweep(sweep_config, project = "small-t5-fw-translation-bayes-hpsearch-v2")



[34m[1mwandb[0m: Currently logged in as: [33moumar-kane[0m ([33moumar-kane-team[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: alygo14y
Sweep URL: https://wandb.ai/oumar-kane-team/gpt2-wolof-french-translation_bayes1_1/sweeps/alygo14y


### Configure the model and the evaluation function ⚙️

Let us recuperate the model and resize the token embeddings.

In [None]:
def t5_model_init(tokenizer):

  # Initialize the model name
  model_name = 't5-small'

  # import the model with its pre-trained weights
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

  # resize the token embeddings
  model.resize_token_embeddings(len(tokenizer))

  return model

Let us evaluate the predictions with the `bleu` metric.

In [None]:
# %%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):

        preds, labels = eval_preds

        if isinstance(preds, tuple):
        
            preds = preds[0]
        
        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)

        result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
        
        result = {"bleu": result["score"]}

        prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
        
        result["gen_len"] = np.mean(prediction_lens)
        
        result = {k: round(v, 4) for k, v in result.items()}
        
        return result

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
# %run wolof-translate/wolof_translate/utils/evaluation.py

Let us initialize the evaluation object.

In [None]:
evaluation = TranslationEvaluation(tokenizer)

### Searching for the best parameters 🕖

Let us define the data collator.

In [None]:
def data_collator(batch):
    """Generate a batch of data to provide to trainer

    Args:
        batch (_type_): The batch

    Returns:
        dict: A dictionary containing the ids, the attention mask and the labels
    """
    input_ids = torch.stack([b[0].squeeze(0) for b in batch])
    
    attention_mask = torch.stack([b[1].squeeze(0) for b in batch])
    
    labels = torch.stack([b[2].squeeze(0) for b in batch])
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask,
            'labels': labels}

Let us initialize the training arguments and make random search.

In [None]:
# %%wandb

def train(config = None):

  with wandb.init(config = config):

    # seed
    set_seed(0)

    # set sweep configuration
    config = wandb.config

    # split the data
    split_data(config.random_state)

    # let us recuperate the datasets
    train_dataset, valid_dataset = recuperate_datasets(config.fr_char_p, config.fr_word_p)

    # set training arguments
    training_args = Seq2SeqTrainingArguments(f"{path}/training/bayes_search_results_wf_v2",
                                      report_to = f"wandb",
                                      num_train_epochs=config.epochs,
                                      load_best_model_at_end=True,
                                      save_strategy="epoch",
                                      evaluation_strategy="epoch",
                                      logging_strategy = 'epoch',
                                      per_device_train_batch_size=config.batch_size, 
                                      per_device_eval_batch_size=16,
                                      learning_rate=config.learning_rate,
                                      weight_decay=config.weight_decay,
                                      predict_with_generate=True, # we will use predict with generate in order to obtain more valuable test results
                                      fp16 = True,
                                      )   

    # define training loop
    trainer = Seq2SeqTrainer(model_init=partial(t5_model_init, tokenizer = train_dataset.tokenizer),
                      args=training_args,
                      train_dataset=train_dataset, 
                      eval_dataset=valid_dataset,
                      data_collator=data_collator,
                      compute_metrics=evaluation.compute_metrics
                      )

    # start training loop
    trainer.train()

agent = wandb.agent(sweep_id, train, count = 30)


[34m[1mwandb[0m: Agent Starting Run: a0u0t6k2 with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 6.702369179262155e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.2819068695463206
[34m[1mwandb[0m: 	wf_word_p: 0.3271474379445852


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss
1,1.3314,0.972646


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.97265
eval/runtime,2.7358
eval/samples_per_second,29.973
eval/steps_per_second,6.214
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3314
train/total_flos,216590170752000.0
train/train_loss,1.33135


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 8p97mqyj with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.000687378518112751
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.10242928904824668
[34m[1mwandb[0m: 	wf_word_p: 0.3761238934195836




Epoch,Training Loss,Validation Loss
1,1.2118,0.902809


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.90281
eval/runtime,2.7187
eval/samples_per_second,30.162
eval/steps_per_second,6.253
train/epoch,1.0
train/global_step,367.0
train/learning_rate,1e-05
train/loss,1.2118
train/total_flos,216590170752000.0
train/train_loss,1.21175


[34m[1mwandb[0m: Agent Starting Run: cufx9n8t with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 1.226951404890168e-05
[34m[1mwandb[0m: 	random_state: 70
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.23759176734591644
[34m[1mwandb[0m: 	wf_word_p: 0.13227163155096622




Epoch,Training Loss,Validation Loss
1,1.5713,0.956762


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95676
eval/runtime,2.7129
eval/samples_per_second,30.226
eval/steps_per_second,6.266
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5713
train/total_flos,216590170752000.0
train/train_loss,1.57131


[34m[1mwandb[0m: Agent Starting Run: y2zo0avu with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0007890211017920526
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2153084724244564
[34m[1mwandb[0m: 	wf_word_p: 0.03750263309251056




Epoch,Training Loss,Validation Loss
1,1.3947,0.852979


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85298
eval/runtime,2.7128
eval/samples_per_second,30.227
eval/steps_per_second,6.267
train/epoch,1.0
train/global_step,147.0
train/learning_rate,4e-05
train/loss,1.3947
train/total_flos,216590170752000.0
train/train_loss,1.39468


[34m[1mwandb[0m: Agent Starting Run: 6zl4nshz with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.00013060682353049685
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.6581358201308465
[34m[1mwandb[0m: 	wf_word_p: 0.010288732393246668




Epoch,Training Loss,Validation Loss
1,1.1252,0.843749


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.84375
eval/runtime,2.7146
eval/samples_per_second,30.207
eval/steps_per_second,6.262
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.1252
train/total_flos,216590170752000.0
train/train_loss,1.12516


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: lmlh43bi with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 4.6544061486102166e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.601761705072325
[34m[1mwandb[0m: 	wf_word_p: 0.5485382582443594
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss
1,1.5112,0.959907


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95991
eval/runtime,2.7127
eval/samples_per_second,30.228
eval/steps_per_second,6.267
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5112
train/total_flos,216590170752000.0
train/train_loss,1.5112


[34m[1mwandb[0m: Agent Starting Run: b3709lrr with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 8.01812368676995e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.3414514505517595
[34m[1mwandb[0m: 	wf_word_p: 0.06590909422021479




Epoch,Training Loss,Validation Loss
1,1.2984,0.943387


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.94339
eval/runtime,2.7141
eval/samples_per_second,30.212
eval/steps_per_second,6.264
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.2984
train/total_flos,216590170752000.0
train/train_loss,1.2984


[34m[1mwandb[0m: Agent Starting Run: ixh1wkga with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 7.665404061522188e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2025899023149373
[34m[1mwandb[0m: 	wf_word_p: 0.30403171896970477




Epoch,Training Loss,Validation Loss
1,1.6004,0.898786


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.89879
eval/runtime,2.7076
eval/samples_per_second,30.285
eval/steps_per_second,6.279
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6004
train/total_flos,216590170752000.0
train/train_loss,1.60037


[34m[1mwandb[0m: Agent Starting Run: yqxwql6m with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0009940796140095907
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.5740543904824967
[34m[1mwandb[0m: 	wf_word_p: 0.2462419315938406




Epoch,Training Loss,Validation Loss
1,1.2845,0.848199


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.8482
eval/runtime,2.7189
eval/samples_per_second,30.159
eval/steps_per_second,6.253
train/epoch,1.0
train/global_step,367.0
train/learning_rate,2e-05
train/loss,1.2845
train/total_flos,216590170752000.0
train/train_loss,1.28448


[34m[1mwandb[0m: Agent Starting Run: 6z8kvttx with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.136396657280805e-05
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.496992386293468
[34m[1mwandb[0m: 	wf_word_p: 0.09128048050662484


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016668863250000263, max=1.0…



Epoch,Training Loss,Validation Loss
1,1.6337,0.925104


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9251
eval/runtime,2.7135
eval/samples_per_second,30.22
eval/steps_per_second,6.265
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6337
train/total_flos,216590170752000.0
train/train_loss,1.63375


[34m[1mwandb[0m: Agent Starting Run: m9sb4aqc with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 2.292925960874299e-05
[34m[1mwandb[0m: 	random_state: 80
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.1392332134258406
[34m[1mwandb[0m: 	wf_word_p: 0.33276367556881936




Epoch,Training Loss,Validation Loss
1,1.3772,0.921299


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9213
eval/runtime,2.7224
eval/samples_per_second,30.12
eval/steps_per_second,6.244
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3772
train/total_flos,216590170752000.0
train/train_loss,1.37722


[34m[1mwandb[0m: Agent Starting Run: jbput8ui with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 6.444484556126347e-05
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	wf_char_p: 0.29893712569494857
[34m[1mwandb[0m: 	wf_word_p: 0.26117783521919574




Epoch,Training Loss,Validation Loss
1,1.4321,0.896062


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.89606
eval/runtime,2.708
eval/samples_per_second,30.281
eval/steps_per_second,6.278
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4321
train/total_flos,216590170752000.0
train/train_loss,1.43209


[34m[1mwandb[0m: Agent Starting Run: 4fwhdrrr with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 2.5593238990374552e-05
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2943749543935819
[34m[1mwandb[0m: 	wf_word_p: 0.12398175089457102




Epoch,Training Loss,Validation Loss
1,1.6999,0.85214


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85214
eval/runtime,2.7191
eval/samples_per_second,30.157
eval/steps_per_second,6.252
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6999
train/total_flos,216590170752000.0
train/train_loss,1.69991


[34m[1mwandb[0m: Agent Starting Run: 1bz3dbic with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.890480473611398e-05
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.2378445346533837
[34m[1mwandb[0m: 	wf_word_p: 0.4209643755950993




Epoch,Training Loss,Validation Loss
1,1.4832,0.875917


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.87592
eval/runtime,2.7525
eval/samples_per_second,29.792
eval/steps_per_second,6.176
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4832
train/total_flos,216590170752000.0
train/train_loss,1.48315


[34m[1mwandb[0m: Agent Starting Run: ho6ta14m with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0002774553527447285
[34m[1mwandb[0m: 	random_state: 40
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.6714497099264989
[34m[1mwandb[0m: 	wf_word_p: 0.03656663081592687




Epoch,Training Loss,Validation Loss
1,1.1247,0.822917


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.82292
eval/runtime,2.7172
eval/samples_per_second,30.178
eval/steps_per_second,6.256
train/epoch,1.0
train/global_step,367.0
train/learning_rate,1e-05
train/loss,1.1247
train/total_flos,216590170752000.0
train/train_loss,1.12474


[34m[1mwandb[0m: Agent Starting Run: wmlfiw0r with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.0003923791647223955
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.2
[34m[1mwandb[0m: 	wf_char_p: 0.06807537344840917
[34m[1mwandb[0m: 	wf_word_p: 0.5604330614348828




Epoch,Training Loss,Validation Loss
1,1.5006,0.949637


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.94964
eval/runtime,2.7103
eval/samples_per_second,30.255
eval/steps_per_second,6.272
train/epoch,1.0
train/global_step,147.0
train/learning_rate,2e-05
train/loss,1.5006
train/total_flos,216590170752000.0
train/train_loss,1.5006


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: r8gcrole with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 1.821341363881095e-05
[34m[1mwandb[0m: 	random_state: 50
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.4377052667800509
[34m[1mwandb[0m: 	wf_word_p: 0.28337028938356845




Epoch,Training Loss,Validation Loss
1,1.8462,0.981831


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.98183
eval/runtime,2.7136
eval/samples_per_second,30.218
eval/steps_per_second,6.265
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.8462
train/total_flos,216590170752000.0
train/train_loss,1.8462


[34m[1mwandb[0m: Agent Starting Run: phwfj18n with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 7.811279212646426e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.2317434997460037
[34m[1mwandb[0m: 	wf_word_p: 0.4361895012572056




Epoch,Training Loss,Validation Loss
1,1.6396,0.980084


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.98008
eval/runtime,2.767
eval/samples_per_second,29.635
eval/steps_per_second,6.144
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6396
train/total_flos,216590170752000.0
train/train_loss,1.63955


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: jlqkadbj with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.00010056956001033137
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.6390337876205812
[34m[1mwandb[0m: 	wf_word_p: 0.23792329132405943




Epoch,Training Loss,Validation Loss
1,1.6119,0.873367


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.87337
eval/runtime,2.7126
eval/samples_per_second,30.23
eval/steps_per_second,6.267
train/epoch,1.0
train/global_step,147.0
train/learning_rate,1e-05
train/loss,1.6119
train/total_flos,216590170752000.0
train/train_loss,1.61191


[34m[1mwandb[0m: Agent Starting Run: 6p270685 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 7.129742665577746e-05
[34m[1mwandb[0m: 	random_state: 20
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	wf_char_p: 0.39725467091195105
[34m[1mwandb[0m: 	wf_word_p: 0.5004309074101329




Epoch,Training Loss,Validation Loss
1,1.4593,0.955068


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.95507
eval/runtime,2.7243
eval/samples_per_second,30.099
eval/steps_per_second,6.24
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4593
train/total_flos,216590170752000.0
train/train_loss,1.45931


[34m[1mwandb[0m: Agent Starting Run: gcv6axp9 with config:
[34m[1mwandb[0m: 	batch_size: 2
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 6.252438217932992e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	wf_char_p: 0.29046645972247725
[34m[1mwandb[0m: 	wf_word_p: 0.2600785813543463




Epoch,Training Loss,Validation Loss
1,1.3169,0.971379


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.97138
eval/runtime,2.7496
eval/samples_per_second,29.823
eval/steps_per_second,6.183
train/epoch,1.0
train/global_step,367.0
train/learning_rate,0.0
train/loss,1.3169
train/total_flos,216590170752000.0
train/train_loss,1.31693


[34m[1mwandb[0m: Agent Starting Run: yhyy9e23 with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.647196100197014e-05
[34m[1mwandb[0m: 	random_state: 10
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.1717366963251063
[34m[1mwandb[0m: 	wf_word_p: 0.1316624948710261




Epoch,Training Loss,Validation Loss
1,1.3969,0.856822


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.85682
eval/runtime,2.7137
eval/samples_per_second,30.217
eval/steps_per_second,6.265
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.3969
train/total_flos,216590170752000.0
train/train_loss,1.3969


[34m[1mwandb[0m: Agent Starting Run: xup68ew4 with config:
[34m[1mwandb[0m: 	batch_size: 5
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 3.637652455538814e-05
[34m[1mwandb[0m: 	random_state: 60
[34m[1mwandb[0m: 	weight_decay: 0.3
[34m[1mwandb[0m: 	wf_char_p: 0.677355252029728
[34m[1mwandb[0m: 	wf_word_p: 0.16163588455481578




Epoch,Training Loss,Validation Loss
1,1.6871,0.952795


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9528
eval/runtime,2.7376
eval/samples_per_second,29.953
eval/steps_per_second,6.21
train/epoch,1.0
train/global_step,147.0
train/learning_rate,0.0
train/loss,1.6871
train/total_flos,216590170752000.0
train/train_loss,1.68711


[34m[1mwandb[0m: Agent Starting Run: wa2re6yd with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 0.00015372099844614283
[34m[1mwandb[0m: 	random_state: 0
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: 	wf_char_p: 0.4593404124892092
[34m[1mwandb[0m: 	wf_word_p: 0.5384324544438147




Epoch,Training Loss,Validation Loss
1,1.4028,0.823874


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.82387
eval/runtime,2.7275
eval/samples_per_second,30.064
eval/steps_per_second,6.233
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.4028
train/total_flos,216590170752000.0
train/train_loss,1.40277


[34m[1mwandb[0m: Agent Starting Run: i305ffth with config:
[34m[1mwandb[0m: 	batch_size: 3
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	learning_rate: 4.284698923411304e-05
[34m[1mwandb[0m: 	random_state: 30
[34m[1mwandb[0m: 	weight_decay: 0.4
[34m[1mwandb[0m: 	wf_char_p: 0.6295786463189464
[34m[1mwandb[0m: 	wf_word_p: 0.6968639258681786




Epoch,Training Loss,Validation Loss
1,1.5093,0.986804


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁
train/global_step,▁▁▁
train/learning_rate,▁
train/loss,▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.9868
eval/runtime,2.7162
eval/samples_per_second,30.189
eval/steps_per_second,6.259
train/epoch,1.0
train/global_step,245.0
train/learning_rate,0.0
train/loss,1.5093
train/total_flos,216590170752000.0
train/train_loss,1.5093


-----------

## Colab download and remove step

In [19]:
import shutil

# shutil.rmtree('/content/drive/MyDrive/Memoire/subject2/T5/training/bayes_search_results')
shutil.rmtree('wandb')
# shutil.make_archive('wandb', 'zip', 'wanbd')