# Strategy 3: Finetune hyperparameters
## In this improvement strategy, we use Weights & Biases sweep tool to automate hyperparameter search, within fixed space that we define. We would first have to define the search space by creating a configuration of the parameters we will be tweaking and their domain, along with the metric we are trying to minimise, which is eval/loss.

## https://api.wandb.ai/links/alvinjoseph/84ncenjd

## The above report shows a sweep with 20 runs, with various paramters being tweaked. From the results, we can observe that lora_alpha has the highest correlation (negative) with eval/loss and having a value of 16 seems to be the best for minimising the error. We also update the values for weight decay and warmup ratio.

# Strategy 3 Part 1

In [1]:
# Installing required packages
!pip install wandb
!pip install -U -q peft==0.6.2 transformers==4.35.2 datasets==2.15.0 bitsandbytes==0.41.2.post2 trl==0.7.4 accelerate==0.24.1 scipy==1.12.0 wandb==0.16.5 coloredlogs==15.0.1



In [2]:
# Load required packages

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, pipeline
from datasets import load_dataset, Dataset
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
import torch

import pickle


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# load SUTD QA dataset from step 1
with open('sutd_qa_dataset_strategy1.pkl', 'rb') as f:
    sutd_qa_dataset = pickle.load(f)

In [4]:
# split data into traing and test set, 160 instances for train, rest for test
sutd_qa_dataset = sutd_qa_dataset.train_test_split(train_size=160, shuffle=False)

In [5]:
# check schema and number of instances
sutd_qa_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 160
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 46
    })
})

In [6]:
# inspect first instance
sutd_qa_dataset["train"][0]

{'question': 'How do I access the digital library?',
 'answer': ' You can access the digital library by logging into the SUTD portal and clicking on the "Library" tab. From there, you can access a variety of online resources, including e-books, academic journals, and other publications. Additionally, you can use the library\'s online catalog to search for physical resources and request items for delivery to the library.'}

In [7]:
# QUESTION: create a formating function 'formatting_func' which takes an example from your QA dataset as input and outputs 
# a dictionary with the key "text" and as value a text prompt with the following format:
# ### USER: {question from example goes here}
# ### ASSISTANT: {answer from example goes here}


#--- ADD YOUR SOLUTION HERE (10 points)---
def formatting_func(example):
    formatted_text = f"### USER: {example['question']}\n### ASSISTANT: {example['answer']}"
    return {"text": formatted_text}


#----------------------------------------


In [8]:
# apply formatting function to data set
formatted_dataset = sutd_qa_dataset.map(formatting_func)

Map: 100%|██████████| 160/160 [00:00<00:00, 8949.76 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 7791.07 examples/s]


In [9]:
# check formatted prompt
formatted_dataset["train"]["text"][0]

# Note: you should see something like this (not necessary the same prompt but same format)
# '### USER: What are some of the best places to eat near the SUTD campus?\n### ASSISTANT: There are several great dining options near the SUTD campus. 
# One popular spot is the Changi Business Park Food Court, ...


'### USER: How do I access the digital library?\n### ASSISTANT:  You can access the digital library by logging into the SUTD portal and clicking on the "Library" tab. From there, you can access a variety of online resources, including e-books, academic journals, and other publications. Additionally, you can use the library\'s online catalog to search for physical resources and request items for delivery to the library.'

In [10]:
# model id of base model
model_id = "NousResearch/Nous-Hermes-llama-2-7b"

# model id for our finetuned model
new_model = "nous-hermes-7b-qlora-sutd-qa"

# config for model quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    use_nested_quant = False
)

# Load the entire model on the GPU 0
device_map = {"": 0}


In [11]:
# Login to wandb
import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33malvinjoseph[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [12]:
# sweep config
sweep_config = {
    'method': 'random'
    }

metric = {
    'name': 'eval/loss',
    'goal': 'minimize'   
    }

sweep_config['metric'] = metric

parameters_dict = {
    'learning_rate': {
        'distribution': 'uniform',
        'min': 0,
        'max': 0.1
    },
    'weight_decay': {
        'values': [0, 0.001, 0.01]
    },
    'max_grad_norm': {
        'values': [0.1, 0.3, 1.0]
    },
    'lora_alpha': {
        'values': [4, 8, 16]
    },
    'r': {
        'values': [4, 8, 16]
    },
    'warmup_ratio': {
        'values': [0.01, 0.03, 0.1]
    }
}


sweep_config['parameters'] = parameters_dict
sweep_id = wandb.sweep(sweep_config, project="llama-finetune-sweep")

Create sweep with ID: lnbkc5fm
Sweep URL: https://wandb.ai/alvinjoseph/llama-finetune-sweep/sweeps/lnbkc5fm


In [14]:
import json 
import gc

def train(config=None):
    with wandb.init(config=config):
        model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map=device_map)
        model.config.use_cache = False
        model.config.pretraining_tp = 1
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "right"
        config = wandb.config
        lora_config = LoraConfig(
            lora_alpha=config.lora_alpha,
            r=config.r,
            lora_dropout=0.1,
            bias="none",
            task_type="CAUSAL_LM",
        )
        output_dir = "./results"
        per_device_train_batch_size = 1
        gradient_accumulation_steps = 1
        optim = "paged_adamw_32bit"
        save_steps = 10
        logging_steps = 10
        learning_rate = 2e-4
        weight_decay = config.weight_decay
        max_grad_norm = config.max_grad_norm
        num_train_epochs = 1
        warmup_ratio = config.warmup_ratio
        lr_scheduler_type = "cosine"
        packing = False
        max_seq_length = None

        training_arguments = TrainingArguments(
            output_dir=output_dir,
            per_device_train_batch_size=per_device_train_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            optim=optim,
            save_steps=save_steps,
            logging_steps=logging_steps,
            learning_rate=learning_rate,
            weight_decay=weight_decay,
            max_grad_norm=max_grad_norm,
            num_train_epochs=num_train_epochs,
            warmup_ratio=warmup_ratio,
            lr_scheduler_type=lr_scheduler_type,
            report_to="wandb"
            
        )

        trainer = SFTTrainer(
            model=model,
            args=training_arguments,
            train_dataset=formatted_dataset["train"],
            eval_dataset=formatted_dataset["test"],
            peft_config=lora_config,
            dataset_text_field="text",
            packing=packing,
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
        )

        trainer.train()
        text = trainer.evaluate()
        print("output")
        print(text)
        del trainer
        del model
        del tokenizer
    
        gc.collect()
        torch.cuda.empty_cache()

wandb.agent(sweep_id, train, count=20)

[34m[1mwandb[0m: Agent Starting Run: 3hq244sz with config:
[34m[1mwandb[0m: 	learning_rate: 0.023460014383040777
[34m[1mwandb[0m: 	lora_alpha: 4
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.03
[34m[1mwandb[0m: 	weight_decay: 0.01


Map: 100%|██████████| 160/160 [00:00<00:00, 6361.33 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5503.24 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8362
20,1.674
30,1.6385
40,1.4882
50,1.4189
60,1.2987
70,1.0095
80,1.3255
90,1.0753
100,1.2244


output
{'eval_loss': 1.1717647314071655, 'eval_runtime': 10.3696, 'eval_samples_per_second': 4.436, 'eval_steps_per_second': 0.579, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▇▆▅▄▃▁▄▂▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.17176
eval/runtime,10.3696
eval/samples_per_second,4.436
eval/steps_per_second,0.579
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.2506
train/total_flos,941152422912000.0
train/train_loss,1.30103


[34m[1mwandb[0m: Agent Starting Run: z6t880y3 with config:
[34m[1mwandb[0m: 	learning_rate: 0.0314501614010468
[34m[1mwandb[0m: 	lora_alpha: 16
[34m[1mwandb[0m: 	max_grad_norm: 0.1
[34m[1mwandb[0m: 	r: 4
[34m[1mwandb[0m: 	warmup_ratio: 0.1
[34m[1mwandb[0m: 	weight_decay: 0.001


Map: 100%|██████████| 160/160 [00:00<00:00, 7580.44 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5977.57 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8368
20,1.6444
30,1.4904
40,1.3833
50,1.3617
60,1.2002
70,0.8852
80,1.2165
90,0.9386
100,1.1081


output
{'eval_loss': 1.067938208580017, 'eval_runtime': 10.4062, 'eval_samples_per_second': 4.42, 'eval_steps_per_second': 0.577, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,▅███▇▇▆▅▄▄▃▂▂▁▁▁
train/loss,█▇▅▅▅▃▁▃▁▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.06794
eval/runtime,10.4062
eval/samples_per_second,4.42
eval/steps_per_second,0.577
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1395
train/total_flos,940853893324800.0
train/train_loss,1.19978


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: yknrw4ax with config:
[34m[1mwandb[0m: 	learning_rate: 0.0035065880133600237
[34m[1mwandb[0m: 	lora_alpha: 4
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	r: 4
[34m[1mwandb[0m: 	warmup_ratio: 0.1
[34m[1mwandb[0m: 	weight_decay: 0.01


Map: 100%|██████████| 160/160 [00:00<00:00, 7479.89 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5933.45 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8465
20,1.7445
30,1.7182
40,1.5571
50,1.4466
60,1.3032
70,1.0147
80,1.3259
90,1.077
100,1.2241


output
{'eval_loss': 1.1238765716552734, 'eval_runtime': 10.442, 'eval_samples_per_second': 4.405, 'eval_steps_per_second': 0.575, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,▅███▇▇▆▅▄▄▃▂▂▁▁▁
train/loss,█▇▇▆▅▃▁▄▂▃▂▂▂▁▁▂
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.12388
eval/runtime,10.442
eval/samples_per_second,4.405
eval/steps_per_second,0.575
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1924
train/total_flos,940853893324800.0
train/train_loss,1.30649


[34m[1mwandb[0m: Agent Starting Run: 2lq3rwdq with config:
[34m[1mwandb[0m: 	learning_rate: 0.08244621024605153
[34m[1mwandb[0m: 	lora_alpha: 8
[34m[1mwandb[0m: 	max_grad_norm: 0.3
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.1
[34m[1mwandb[0m: 	weight_decay: 0.001


Map: 100%|██████████| 160/160 [00:00<00:00, 7375.41 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5784.38 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.843
20,1.7006
30,1.6175
40,1.4475
50,1.3973
60,1.2737
70,0.9908
80,1.2975
90,1.0038
100,1.1321


output
{'eval_loss': 1.0888748168945312, 'eval_runtime': 10.3521, 'eval_samples_per_second': 4.444, 'eval_steps_per_second': 0.58, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,▅███▇▇▆▅▄▄▃▂▂▁▁▁
train/loss,█▇▆▅▄▃▁▄▁▂▂▁▂▁▁▂
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.08887
eval/runtime,10.3521
eval/samples_per_second,4.444
eval/steps_per_second,0.58
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1584
train/total_flos,941152422912000.0
train/train_loss,1.24685


[34m[1mwandb[0m: Agent Starting Run: x9nk3a9d with config:
[34m[1mwandb[0m: 	learning_rate: 0.07507817039271629
[34m[1mwandb[0m: 	lora_alpha: 8
[34m[1mwandb[0m: 	max_grad_norm: 0.3
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0.001


Map: 100%|██████████| 160/160 [00:00<00:00, 7205.62 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5586.25 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8081
20,1.5872
30,1.5021
40,1.3993
50,1.38
60,1.2617
70,0.9795
80,1.2885
90,0.9959
100,1.1296


output
{'eval_loss': 1.088730812072754, 'eval_runtime': 10.3966, 'eval_samples_per_second': 4.425, 'eval_steps_per_second': 0.577, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▄▃▁▄▁▂▂▁▂▁▁▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.08873
eval/runtime,10.3966
eval/samples_per_second,4.425
eval/steps_per_second,0.577
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1603
train/total_flos,941152422912000.0
train/train_loss,1.22458


[34m[1mwandb[0m: Agent Starting Run: symda51h with config:
[34m[1mwandb[0m: 	learning_rate: 0.08059386692535252
[34m[1mwandb[0m: 	lora_alpha: 8
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0


Map: 100%|██████████| 160/160 [00:00<00:00, 7335.91 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5792.54 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8088
20,1.5881
30,1.4839
40,1.3931
50,1.3729
60,1.2543
70,0.9712
80,1.264
90,0.9634
100,1.1278


output
{'eval_loss': 1.0865801572799683, 'eval_runtime': 10.4137, 'eval_samples_per_second': 4.417, 'eval_steps_per_second': 0.576, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▄▃▁▃▁▂▂▁▂▁▁▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.08658
eval/runtime,10.4137
eval/samples_per_second,4.417
eval/steps_per_second,0.576
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1533
train/total_flos,941152422912000.0
train/train_loss,1.21616


[34m[1mwandb[0m: Agent Starting Run: qvxn8kca with config:
[34m[1mwandb[0m: 	learning_rate: 0.04268082197004365
[34m[1mwandb[0m: 	lora_alpha: 16
[34m[1mwandb[0m: 	max_grad_norm: 0.1
[34m[1mwandb[0m: 	r: 4
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0


Map: 100%|██████████| 160/160 [00:00<00:00, 7256.42 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5685.85 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.7798
20,1.504
30,1.3886
40,1.3706
50,1.3461
60,1.1797
70,0.8704
80,1.2099
90,0.9371
100,1.1081


output
{'eval_loss': 1.0685595273971558, 'eval_runtime': 10.4163, 'eval_samples_per_second': 4.416, 'eval_steps_per_second': 0.576, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▅▃▁▄▂▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.06856
eval/runtime,10.4163
eval/samples_per_second,4.416
eval/steps_per_second,0.576
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1403
train/total_flos,940853893324800.0
train/train_loss,1.17689


[34m[1mwandb[0m: Agent Starting Run: ny1jpbx7 with config:
[34m[1mwandb[0m: 	learning_rate: 0.0611612211118934
[34m[1mwandb[0m: 	lora_alpha: 8
[34m[1mwandb[0m: 	max_grad_norm: 0.1
[34m[1mwandb[0m: 	r: 16
[34m[1mwandb[0m: 	warmup_ratio: 0.03
[34m[1mwandb[0m: 	weight_decay: 0


Map: 100%|██████████| 160/160 [00:00<00:00, 7652.44 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 6063.80 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8237
20,1.6109
30,1.5288
40,1.4096
50,1.3833
60,1.2653
70,0.9838
80,1.2884
90,0.9941
100,1.1312


output
{'eval_loss': 1.0889664888381958, 'eval_runtime': 10.3875, 'eval_samples_per_second': 4.428, 'eval_steps_per_second': 0.578, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▆▅▄▃▁▄▁▂▂▁▂▁▁▂
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.08897
eval/runtime,10.3875
eval/samples_per_second,4.428
eval/steps_per_second,0.578
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1584
train/total_flos,941749482086400.0
train/train_loss,1.22996


[34m[1mwandb[0m: Agent Starting Run: j256us5x with config:
[34m[1mwandb[0m: 	learning_rate: 0.04277464114275303
[34m[1mwandb[0m: 	lora_alpha: 8
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	r: 4
[34m[1mwandb[0m: 	warmup_ratio: 0.1
[34m[1mwandb[0m: 	weight_decay: 0.001


Map: 100%|██████████| 160/160 [00:00<00:00, 6950.54 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 4556.12 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8434
20,1.7019
30,1.6048
40,1.4277
50,1.3851
60,1.268
70,0.9872
80,1.2856
90,0.9672
100,1.1355


output
{'eval_loss': 1.08561372756958, 'eval_runtime': 10.419, 'eval_samples_per_second': 4.415, 'eval_steps_per_second': 0.576, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,▅███▇▇▆▅▄▄▃▂▂▁▁▁
train/loss,█▇▆▅▄▃▁▄▁▂▂▁▂▁▁▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.08561
eval/runtime,10.419
eval/samples_per_second,4.415
eval/steps_per_second,0.576
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1553
train/total_flos,940853893324800.0
train/train_loss,1.23923


[34m[1mwandb[0m: Agent Starting Run: w18sygak with config:
[34m[1mwandb[0m: 	learning_rate: 0.019767563679772004
[34m[1mwandb[0m: 	lora_alpha: 16
[34m[1mwandb[0m: 	max_grad_norm: 0.3
[34m[1mwandb[0m: 	r: 16
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0.01


Map: 100%|██████████| 160/160 [00:00<00:00, 6696.49 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5632.24 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.7789
20,1.5008
30,1.3861
40,1.3716
50,1.3446
60,1.1845
70,0.8759
80,1.2105
90,0.9353
100,1.1044


output
{'eval_loss': 1.0672739744186401, 'eval_runtime': 10.3733, 'eval_samples_per_second': 4.434, 'eval_steps_per_second': 0.578, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▅▃▁▄▁▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.06727
eval/runtime,10.3733
eval/samples_per_second,4.434
eval/steps_per_second,0.578
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1383
train/total_flos,941749482086400.0
train/train_loss,1.17637


[34m[1mwandb[0m: Agent Starting Run: j2elamyw with config:
[34m[1mwandb[0m: 	learning_rate: 0.02537873460600909
[34m[1mwandb[0m: 	lora_alpha: 16
[34m[1mwandb[0m: 	max_grad_norm: 0.1
[34m[1mwandb[0m: 	r: 16
[34m[1mwandb[0m: 	warmup_ratio: 0.03
[34m[1mwandb[0m: 	weight_decay: 0.01


Map: 100%|██████████| 160/160 [00:00<00:00, 2716.56 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 6019.34 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8023
20,1.5292
30,1.3979
40,1.3728
50,1.3481
60,1.185
70,0.8761
80,1.2118
90,0.9369
100,1.1066


output
{'eval_loss': 1.0683187246322632, 'eval_runtime': 10.3663, 'eval_samples_per_second': 4.437, 'eval_steps_per_second': 0.579, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▅▃▁▄▁▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.06832
eval/runtime,10.3663
eval/samples_per_second,4.437
eval/steps_per_second,0.579
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.142
train/total_flos,941749482086400.0
train/train_loss,1.1812


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: s3fobkt0 with config:
[34m[1mwandb[0m: 	learning_rate: 0.0800889599310971
[34m[1mwandb[0m: 	lora_alpha: 16
[34m[1mwandb[0m: 	max_grad_norm: 0.1
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.1
[34m[1mwandb[0m: 	weight_decay: 0.001


Map: 100%|██████████| 160/160 [00:00<00:00, 6809.83 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5075.98 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8362
20,1.6421
30,1.493
40,1.385
50,1.3609
60,1.2028
70,0.8878
80,1.2205
90,0.9369
100,1.108


output
{'eval_loss': 1.067129373550415, 'eval_runtime': 10.3189, 'eval_samples_per_second': 4.458, 'eval_steps_per_second': 0.581, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,▅███▇▇▆▅▄▄▃▂▂▁▁▁
train/loss,█▇▅▅▄▃▁▃▁▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.06713
eval/runtime,10.3189
eval/samples_per_second,4.458
eval/steps_per_second,0.581
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1379
train/total_flos,941152422912000.0
train/train_loss,1.19994


[34m[1mwandb[0m: Agent Starting Run: gjfe55af with config:
[34m[1mwandb[0m: 	learning_rate: 0.04409831894137087
[34m[1mwandb[0m: 	lora_alpha: 4
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0.001


Map: 100%|██████████| 160/160 [00:00<00:00, 5643.29 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5127.24 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8282
20,1.6564
30,1.6218
40,1.4759
50,1.4146
60,1.299
70,1.0082
80,1.3256
90,1.0761
100,1.2234


output
{'eval_loss': 1.165352702140808, 'eval_runtime': 10.37, 'eval_samples_per_second': 4.436, 'eval_steps_per_second': 0.579, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▇▆▅▄▃▁▄▂▃▂▂▂▂▁▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.16535
eval/runtime,10.37
eval/samples_per_second,4.436
eval/steps_per_second,0.579
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.2424
train/total_flos,941152422912000.0
train/train_loss,1.29576


[34m[1mwandb[0m: Agent Starting Run: meki0yli with config:
[34m[1mwandb[0m: 	learning_rate: 0.04087712160738728
[34m[1mwandb[0m: 	lora_alpha: 16
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	r: 4
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0


Map: 100%|██████████| 160/160 [00:00<00:00, 7242.72 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5716.17 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.7794
20,1.5013
30,1.3851
40,1.3695
50,1.3438
60,1.1763
70,0.8656
80,1.2106
90,0.9359
100,1.1062


output
{'eval_loss': 1.0679482221603394, 'eval_runtime': 10.3664, 'eval_samples_per_second': 4.437, 'eval_steps_per_second': 0.579, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▅▃▁▄▂▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.06795
eval/runtime,10.3664
eval/samples_per_second,4.437
eval/steps_per_second,0.579
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1388
train/total_flos,940853893324800.0
train/train_loss,1.17526


[34m[1mwandb[0m: Agent Starting Run: tt1eornw with config:
[34m[1mwandb[0m: 	learning_rate: 0.015740116366597248
[34m[1mwandb[0m: 	lora_alpha: 8
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0.001


Map: 100%|██████████| 160/160 [00:00<00:00, 7343.21 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 4844.52 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8088
20,1.5881
30,1.4839
40,1.3931
50,1.3729
60,1.2543
70,0.9712
80,1.264
90,0.9634
100,1.1278


output
{'eval_loss': 1.0865801572799683, 'eval_runtime': 10.3651, 'eval_samples_per_second': 4.438, 'eval_steps_per_second': 0.579, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▄▃▁▃▁▂▂▁▂▁▁▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.08658
eval/runtime,10.3651
eval/samples_per_second,4.438
eval/steps_per_second,0.579
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1533
train/total_flos,941152422912000.0
train/train_loss,1.21616


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: k4glcu2x with config:
[34m[1mwandb[0m: 	learning_rate: 0.004362693799545392
[34m[1mwandb[0m: 	lora_alpha: 8
[34m[1mwandb[0m: 	max_grad_norm: 0.3
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0.01


Map: 100%|██████████| 160/160 [00:00<00:00, 7714.37 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5938.56 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8081
20,1.5872
30,1.5021
40,1.3993
50,1.38
60,1.2617
70,0.9795
80,1.2885
90,0.9959
100,1.1296


output
{'eval_loss': 1.0887339115142822, 'eval_runtime': 10.3074, 'eval_samples_per_second': 4.463, 'eval_steps_per_second': 0.582, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▄▃▁▄▁▂▂▁▂▁▁▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.08873
eval/runtime,10.3074
eval/samples_per_second,4.463
eval/steps_per_second,0.582
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1603
train/total_flos,941152422912000.0
train/train_loss,1.22459


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 310yurrn with config:
[34m[1mwandb[0m: 	learning_rate: 0.004737763163517728
[34m[1mwandb[0m: 	lora_alpha: 16
[34m[1mwandb[0m: 	max_grad_norm: 0.3
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0.001


Map: 100%|██████████| 160/160 [00:00<00:00, 7678.54 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5713.97 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.7775
20,1.4974
30,1.3827
40,1.374
50,1.3445
60,1.174
70,0.8646
80,1.2107
90,0.9379
100,1.1061


output
{'eval_loss': 1.0678377151489258, 'eval_runtime': 10.3518, 'eval_samples_per_second': 4.444, 'eval_steps_per_second': 0.58, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▅▃▁▄▂▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.06784
eval/runtime,10.3518
eval/samples_per_second,4.444
eval/steps_per_second,0.58
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1399
train/total_flos,941152422912000.0
train/train_loss,1.17511


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: whanj5zm with config:
[34m[1mwandb[0m: 	learning_rate: 0.053238547242129536
[34m[1mwandb[0m: 	lora_alpha: 16
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	r: 8
[34m[1mwandb[0m: 	warmup_ratio: 0.03
[34m[1mwandb[0m: 	weight_decay: 0


Map: 100%|██████████| 160/160 [00:00<00:00, 6889.96 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 4355.26 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8015
20,1.5252
30,1.3861
40,1.3721
50,1.3403
60,1.1657
70,0.8633
80,1.215
90,0.9363
100,1.1045


output
{'eval_loss': 1.0667515993118286, 'eval_runtime': 10.3836, 'eval_samples_per_second': 4.43, 'eval_steps_per_second': 0.578, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▅▃▁▄▂▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.06675
eval/runtime,10.3836
eval/samples_per_second,4.43
eval/steps_per_second,0.578
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1416
train/total_flos,941152422912000.0
train/train_loss,1.17731


[34m[1mwandb[0m: Agent Starting Run: og3vldga with config:
[34m[1mwandb[0m: 	learning_rate: 0.053478130030891205
[34m[1mwandb[0m: 	lora_alpha: 8
[34m[1mwandb[0m: 	max_grad_norm: 0.1
[34m[1mwandb[0m: 	r: 4
[34m[1mwandb[0m: 	warmup_ratio: 0.1
[34m[1mwandb[0m: 	weight_decay: 0.01


Map: 100%|██████████| 160/160 [00:00<00:00, 5850.46 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5136.66 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.8433
20,1.7024
30,1.6216
40,1.4489
50,1.4001
60,1.2765
70,0.9942
80,1.298
90,1.0258
100,1.1373


output
{'eval_loss': 1.0901776552200317, 'eval_runtime': 10.3786, 'eval_samples_per_second': 4.432, 'eval_steps_per_second': 0.578, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,▅███▇▇▆▅▄▄▃▂▂▁▁▁
train/loss,█▇▆▅▄▃▁▄▁▂▂▁▂▁▁▂
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.09018
eval/runtime,10.3786
eval/samples_per_second,4.432
eval/steps_per_second,0.578
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.1609
train/total_flos,940853893324800.0
train/train_loss,1.25036


[34m[1mwandb[0m: Agent Starting Run: sg4jmlfw with config:
[34m[1mwandb[0m: 	learning_rate: 0.04421024821112551
[34m[1mwandb[0m: 	lora_alpha: 16
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	r: 16
[34m[1mwandb[0m: 	warmup_ratio: 0.01
[34m[1mwandb[0m: 	weight_decay: 0.01


Map: 100%|██████████| 160/160 [00:00<00:00, 7366.34 examples/s]
Map: 100%|██████████| 46/46 [00:00<00:00, 5526.88 examples/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.7799
20,1.5016
30,1.3756
40,1.369
50,1.3403
60,1.174
70,0.8643
80,1.2132
90,0.9342
100,1.1089


output
{'eval_loss': 1.068800687789917, 'eval_runtime': 10.3891, 'eval_samples_per_second': 4.428, 'eval_steps_per_second': 0.578, 'epoch': 1.0}


0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/global_step,▁▁▂▂▃▃▄▄▅▅▆▆▇▇████
train/learning_rate,███▇▇▆▅▅▄▃▃▂▂▁▁▁
train/loss,█▆▅▅▅▃▁▄▂▃▂▂▂▂▂▃
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,1.0688
eval/runtime,10.3891
eval/samples_per_second,4.428
eval/steps_per_second,0.578
train/epoch,1.0
train/global_step,160.0
train/learning_rate,0.0
train/loss,1.137
train/total_flos,941749482086400.0
train/train_loss,1.17361


### This concludes the first part of Strategy 3. Continue with the next part.