In this tutorial I'll be demonstrating how to fine tune a pretrained model to perform complex tasks such as reasoning and completion etc

In [1]:
#Getting started with lamini
import lamini
import jsonlines
import itertools
import pandas as pd
from pprint import pprint
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import Trainer
import datasets
import torch
from datasets import load_dataset
import logging
from transformers import TrainingArguments
logger = logging.getLogger(__name__)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:

lamini.api_key = "9830b83cdb2c1ceecf64b18ca4172743376e23bc275ed4112dbdf01658b9ad6d"

In [3]:
llm_nf = lamini.Lamini("meta-llama/Llama-2-7b-hf")

In [4]:
#Example demonstrating very poor performance
print(llm_nf.generate("Tell me how to train my dog to sit"))
print("XXXXXXXXXX")
print(llm_nf.generate("What do you think of mars"))
print("XXXXXXXXXX")
print(llm_nf.generate("""Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:"""))

.
Tell me how to train my dog to sit. I have a 10 month old puppy and I want to train him to sit. I have tried the treat method and he just sits there and looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" c

In [5]:
#Lets compare to a finetuned model that has been finetuned by some other person
llm_f=lamini.Lamini("meta-llama/Llama-2-7b-chat-hf")

In [6]:
#Far better result as than non finetuned model
print(llm_f.generate("[INST]Tell me how to train my dog to sit[/INST]"))
print("XXXXXXXXXX")
print(llm_f.generate("What do you think of the planet mars"))
print("XXXXXXXXXX")
print(llm_f.generate("""Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:"""))

 Training your dog to sit is a basic obedience command that can be achieved with patience, consistency, and positive reinforcement. Here's a step-by-step guide on how to train your dog to sit:

1. Choose a quiet and distraction-free area: Find a quiet area with no distractions where your dog can focus on you.
2. Have treats ready: Choose your dog's favorite treats and have them ready to use as rewards.
3. Stand in front of your dog: Stand in front of your dog and hold a treat close to their nose.
4. Move the treat up and back: Slowly move the treat up and back, towards your dog's tail, while saying "sit" in a calm and clear voice.
5. Dog will sit: As you move the treat, your dog will naturally sit down to follow the treat. The moment their bottom touches the ground, say "good sit" and give them the treat.
6. Repeat the process: Repeat steps 3-5 several times, so your dog starts to associate the command "sit" with the action of sitting down.
7. Gradually phase out the treats: As your do

In [7]:
#Lets explore the dataset that we want to finetune our model on
#Here we are using the dataset alpaca from tatsu-lab
#Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. 
#This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.
pretrained_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)

In [8]:
print(pretrained_dataset.column_names)

['instruction', 'input', 'output', 'text']


Here text is the default prompt template provided the makers of the dataset but we will create our own prompt since we like to work more :)

In [9]:
count=0
for _ in pretrained_dataset:
    count+=1
print(count)

52002


In [10]:
n = 3
print("Pretrained dataset:")
top_n = list(itertools.islice(pretrained_dataset, n))
for i in top_n:
  print(i)
  print()
#Its output basically consists of instructions and response

Pretrained dataset:
{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

{'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary

Below is the different types of prompt template that we can use

But we will create prompt according to our use case

In [11]:
prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

In [12]:
prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

In [13]:
processed_data = []
for j in top_n:
  if j["input"]:#if the input is there in the datapoint that use prompt template that handles the input
    processed_prompt = prompt_template_with_input.format(instruction=j["instruction"],input=j["input"])
  else:#If the input is not there in the datapoint then use the other prompt template that works on just instructions
    processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])

  processed_data.append({"input": processed_prompt, "output": j["output"]})


In [14]:
#It tells what the model is gonna get that is input and what is it required to be outputted from the model which is given as output
#This is what goes into the model
pprint(processed_data[1])

{'input': 'Below is an instruction that describes a task. Write a response '
          'that appropriately completes the request.\n'
          '\n'
          '### Instruction:\n'
          'What are the three primary colors?\n'
          '\n'
          '### Response:',
 'output': 'The three primary colors are red, blue, and yellow.'}


Save data to Jsonl

In [15]:
with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer:
    writer.write_all(processed_data)

In [16]:
#Both the dataset it lamini/alpaca and tatsu-labs/alpaca are same and contain 52002 datapoints in it.
#It just the structured form of it 
dataset_path_hf = "lamini/alpaca"
dataset_hf = load_dataset(dataset_path_hf)
pprint(dataset_hf)

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 52002
    })
})


Example Demonstrating the how to use an example at inference time.

In [17]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")



In [18]:
#These are some heavy models and take huge time.
# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
# model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

Below is an inference function which shows the working of how a particular datapoint is given to a finetuned as well as non-finetuned model
This is not for training it just how the model deals with an input at the inference time

In [19]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",#give the output in pytorch tensors
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

In [20]:
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


In [21]:
test_sample = finetuning_dataset["test"][0]
print(test_sample)

pprint(inference(test_sample["question"], model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


{'question': 'Can Lamini generate technical documentation or user manuals for software projects?', 'answer': 'Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.', 'input_ids': [5804, 418, 4988, 74, 6635, 7681, 10097, 390, 2608, 11595, 84, 323, 3694, 6493, 32, 4374, 13, 418, 4988, 74, 476, 6635, 7681, 10097, 285, 2608, 11595, 84, 323, 3694, 6493, 15, 733, 4648, 3626, 3448, 5978, 5609, 281, 2794, 2590, 285, 44003, 10097, 326, 310, 3477, 281, 2096, 323, 1097, 7681, 285, 1327, 14, 48746, 4212, 15, 831, 476, 5321, 12259, 247, 1534, 2408, 273, 673, 285, 3434, 275, 6153, 10097, 13, 6941, 731, 281, 2770, 327, 643, 7794, 273, 616, 6493, 15], 'attention

In [22]:
pprint(test_sample["question"])

('Can Lamini generate technical documentation or user manuals for software '
 'projects?')


In [23]:
instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")

In [24]:
pprint(inference(test_sample["question"], instruction_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


('Yes, Lamini can generate technical documentation or user manuals for '
 'software projects. This can be achieved by providing a prompt for a specific '
 'technical question or question to the LLM Engine, or by providing a prompt '
 'for a specific technical question or question. Additionally, Lamini can be '
 'trained on specific technical questions or questions to help users '
 'understand the process and provide feedback to the LLM Engine. Additionally, '
 'Lamini')


In [25]:
model=lamini.Lamini("EleutherAI/pythia-70m")

In [26]:
pprint(model.generate("How to train my dog?"))

('\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able t

Now lets see how the tokenizing works in general

In [27]:
list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])

Encoded several texts:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175], [4374]]


Now lets make a dataset of our choice
We will be using the lamini docs by lamini dataset

In [28]:
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


In [29]:
dataset_path = "lamini/lamini_docs"
use_hf = True

In [30]:
model_name = "EleutherAI/pythia-70m"

In [31]:
training_config = {
    "model": {
        "pretrained_name": model_name,
        "max_length" : 2048
    },
    "datasets": {
        "use_hf": use_hf,
        "path": dataset_path
    },
    "verbose": True
}

In [32]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
train_dataset, test_dataset = finetuning_dataset['train'],finetuning_dataset['test']

print(train_dataset)
print(test_dataset)

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1260
})
Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 140
})


In [33]:
base_model = AutoModelForCausalLM.from_pretrained(model_name)

In [34]:
device_count = torch.cuda.device_count()
if device_count > 0:
    logger.debug("Select GPU device")
    device = torch.device("cuda")
else:
    logger.debug("Select CPU device")
    device = torch.device("cpu")

In [35]:
base_model.to(device)

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
        

In [36]:
#Base model performance
#You can see the performance is very poor here 
test_text = test_dataset[0]['question']
print("Question input (test):", test_text)
print(f"Correct answer from Lamini docs: {test_dataset[0]['answer']}")
print("Model's answer: ")
print(inference(test_text, base_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Question input (test): Can Lamini generate technical documentation or user manuals for software projects?
Correct answer from Lamini docs: Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.
Model's answer: 


  attn_output = torch.nn.functional.scaled_dot_product_attention(




I have a question about the following:

How do I get the correct documentation to work?

A:

I think you need to use the following code:

A:

You can use the following code to get the correct documentation.

A:

You can use the following code to get the correct documentation.

A:

You can use the following


Sample Training

In [37]:
max_steps=280

In [38]:
trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name

In [39]:
training_args = TrainingArguments(

  # Learning rate
  learning_rate=1.0e-5,

  # Number of training epochs
  num_train_epochs=10,

  # Max steps to train for (each step is a batch of data)
  # Overrides num_train_epochs, if not -1
  max_steps=max_steps,

  # Batch size for training
  per_device_train_batch_size=1,

  # Directory to save model checkpoints
  output_dir=output_dir,

  # Other arguments
  overwrite_output_dir=False, # Overwrite the content of the output directory
  disable_tqdm=False, # Enables the progress bars shown during training
  eval_steps=120, # Number of update steps between two evaluations
  save_steps=120, # After # steps model is saved
  warmup_steps=1, # Number of warmup steps for learning rate scheduler
  per_device_eval_batch_size=1, # Batch size for evaluation
  evaluation_strategy="steps",
  logging_strategy="steps",
  logging_steps=1,
  optim="adafactor",
  gradient_accumulation_steps = 4,
  gradient_checkpointing=False,

  # Parameters for early stopping
  load_best_model_at_end=True,
  save_total_limit=1,
  metric_for_best_model="eval_loss",
  greater_is_better=False
)



In [40]:
#It tells you the memory footprint and flops of the model 
model_flops = (
  base_model.floating_point_ops(
    {
       "input_ids": torch.zeros(
           (1, training_config["model"]["max_length"])
      )
    }
  )
  * training_args.gradient_accumulation_steps
)

print(base_model)
print("Memory footprint", base_model.get_memory_footprint() / 1e9, "GB")
print("Flops", model_flops / 1e9, "GFLOPs")

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
        

In [41]:
trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

max_steps is given, it will override any value given in num_train_epochs


In [42]:
training_output = trainer.train()

  0%|          | 1/280 [00:00<02:55,  1.59it/s]

{'loss': 4.1563, 'grad_norm': 79.16706848144531, 'learning_rate': 1e-05, 'epoch': 0.0}


  1%|          | 2/280 [00:00<02:07,  2.18it/s]

{'loss': 3.0687, 'grad_norm': 60.92185592651367, 'learning_rate': 9.96415770609319e-06, 'epoch': 0.01}


  1%|          | 3/280 [00:01<01:47,  2.57it/s]

{'loss': 3.9108, 'grad_norm': 59.998313903808594, 'learning_rate': 9.928315412186382e-06, 'epoch': 0.01}


  1%|▏         | 4/280 [00:01<01:43,  2.67it/s]

{'loss': 3.4406, 'grad_norm': 57.564247131347656, 'learning_rate': 9.89247311827957e-06, 'epoch': 0.01}


  2%|▏         | 5/280 [00:02<01:43,  2.67it/s]

{'loss': 3.1635, 'grad_norm': 44.991878509521484, 'learning_rate': 9.856630824372761e-06, 'epoch': 0.02}


  2%|▏         | 6/280 [00:02<01:41,  2.69it/s]

{'loss': 3.0866, 'grad_norm': 50.369346618652344, 'learning_rate': 9.820788530465952e-06, 'epoch': 0.02}


  2%|▎         | 7/280 [00:02<01:43,  2.65it/s]

{'loss': 3.803, 'grad_norm': 69.33912658691406, 'learning_rate': 9.78494623655914e-06, 'epoch': 0.02}


  3%|▎         | 8/280 [00:03<01:40,  2.70it/s]

{'loss': 3.1097, 'grad_norm': 62.32722091674805, 'learning_rate': 9.749103942652331e-06, 'epoch': 0.03}


  3%|▎         | 9/280 [00:03<01:40,  2.69it/s]

{'loss': 2.9926, 'grad_norm': 50.06944274902344, 'learning_rate': 9.71326164874552e-06, 'epoch': 0.03}


  4%|▎         | 10/280 [00:03<01:34,  2.85it/s]

{'loss': 3.3864, 'grad_norm': 80.4339370727539, 'learning_rate': 9.67741935483871e-06, 'epoch': 0.03}


  4%|▍         | 11/280 [00:04<01:29,  3.02it/s]

{'loss': 3.1448, 'grad_norm': 55.98486328125, 'learning_rate': 9.641577060931901e-06, 'epoch': 0.03}


  4%|▍         | 12/280 [00:04<01:25,  3.14it/s]

{'loss': 3.2646, 'grad_norm': 93.2275161743164, 'learning_rate': 9.60573476702509e-06, 'epoch': 0.04}


  5%|▍         | 13/280 [00:04<01:22,  3.22it/s]

{'loss': 3.4719, 'grad_norm': 70.26679229736328, 'learning_rate': 9.56989247311828e-06, 'epoch': 0.04}


  5%|▌         | 14/280 [00:04<01:23,  3.20it/s]

{'loss': 2.5702, 'grad_norm': 48.027000427246094, 'learning_rate': 9.53405017921147e-06, 'epoch': 0.04}


  5%|▌         | 15/280 [00:05<01:23,  3.19it/s]

{'loss': 3.7562, 'grad_norm': 78.25580596923828, 'learning_rate': 9.49820788530466e-06, 'epoch': 0.05}


  6%|▌         | 16/280 [00:05<01:25,  3.09it/s]

{'loss': 2.6445, 'grad_norm': 50.02345275878906, 'learning_rate': 9.46236559139785e-06, 'epoch': 0.05}


  6%|▌         | 17/280 [00:05<01:26,  3.05it/s]

{'loss': 2.6863, 'grad_norm': 47.02897262573242, 'learning_rate': 9.42652329749104e-06, 'epoch': 0.05}


  6%|▋         | 18/280 [00:06<01:26,  3.02it/s]

{'loss': 3.0851, 'grad_norm': 60.85171890258789, 'learning_rate': 9.39068100358423e-06, 'epoch': 0.06}


  7%|▋         | 19/280 [00:06<01:30,  2.89it/s]

{'loss': 2.4943, 'grad_norm': 63.885562896728516, 'learning_rate': 9.35483870967742e-06, 'epoch': 0.06}


  7%|▋         | 20/280 [00:07<01:28,  2.92it/s]

{'loss': 3.0372, 'grad_norm': 53.39547348022461, 'learning_rate': 9.31899641577061e-06, 'epoch': 0.06}


  8%|▊         | 21/280 [00:07<01:29,  2.89it/s]

{'loss': 3.1242, 'grad_norm': 43.38551712036133, 'learning_rate': 9.2831541218638e-06, 'epoch': 0.07}


  8%|▊         | 22/280 [00:07<01:31,  2.82it/s]

{'loss': 3.0355, 'grad_norm': 65.25188446044922, 'learning_rate': 9.24731182795699e-06, 'epoch': 0.07}


  8%|▊         | 23/280 [00:08<01:31,  2.81it/s]

{'loss': 2.3843, 'grad_norm': 70.3118667602539, 'learning_rate': 9.21146953405018e-06, 'epoch': 0.07}


  9%|▊         | 24/280 [00:08<01:30,  2.83it/s]

{'loss': 2.5242, 'grad_norm': 56.45045852661133, 'learning_rate': 9.17562724014337e-06, 'epoch': 0.08}


  9%|▉         | 25/280 [00:08<01:30,  2.81it/s]

{'loss': 2.3794, 'grad_norm': 53.13496398925781, 'learning_rate': 9.13978494623656e-06, 'epoch': 0.08}


  9%|▉         | 26/280 [00:09<01:31,  2.79it/s]

{'loss': 2.8254, 'grad_norm': 62.20661544799805, 'learning_rate': 9.10394265232975e-06, 'epoch': 0.08}


 10%|▉         | 27/280 [00:09<01:29,  2.81it/s]

{'loss': 2.8842, 'grad_norm': 57.321659088134766, 'learning_rate': 9.068100358422939e-06, 'epoch': 0.09}


 10%|█         | 28/280 [00:09<01:27,  2.88it/s]

{'loss': 2.4328, 'grad_norm': 50.67791748046875, 'learning_rate': 9.03225806451613e-06, 'epoch': 0.09}


 10%|█         | 29/280 [00:10<01:21,  3.07it/s]

{'loss': 2.7485, 'grad_norm': 54.17271423339844, 'learning_rate': 8.99641577060932e-06, 'epoch': 0.09}


 11%|█         | 30/280 [00:10<01:17,  3.23it/s]

{'loss': 2.2412, 'grad_norm': 58.9797248840332, 'learning_rate': 8.96057347670251e-06, 'epoch': 0.1}


 11%|█         | 31/280 [00:10<01:12,  3.41it/s]

{'loss': 2.8252, 'grad_norm': 60.092716217041016, 'learning_rate': 8.9247311827957e-06, 'epoch': 0.1}


 11%|█▏        | 32/280 [00:10<01:11,  3.46it/s]

{'loss': 2.1575, 'grad_norm': 50.924495697021484, 'learning_rate': 8.888888888888888e-06, 'epoch': 0.1}


 12%|█▏        | 33/280 [00:11<01:09,  3.53it/s]

{'loss': 2.7028, 'grad_norm': 66.09190368652344, 'learning_rate': 8.85304659498208e-06, 'epoch': 0.1}


 12%|█▏        | 34/280 [00:11<01:10,  3.48it/s]

{'loss': 2.5648, 'grad_norm': 45.07282257080078, 'learning_rate': 8.81720430107527e-06, 'epoch': 0.11}


 12%|█▎        | 35/280 [00:11<01:13,  3.33it/s]

{'loss': 2.7539, 'grad_norm': 51.69328689575195, 'learning_rate': 8.78136200716846e-06, 'epoch': 0.11}


 13%|█▎        | 36/280 [00:12<01:15,  3.22it/s]

{'loss': 2.8358, 'grad_norm': 77.70756530761719, 'learning_rate': 8.74551971326165e-06, 'epoch': 0.11}


 13%|█▎        | 37/280 [00:12<01:13,  3.29it/s]

{'loss': 2.202, 'grad_norm': 55.07278060913086, 'learning_rate': 8.70967741935484e-06, 'epoch': 0.12}


 14%|█▎        | 38/280 [00:12<01:14,  3.23it/s]

{'loss': 2.6704, 'grad_norm': 64.79146575927734, 'learning_rate': 8.67383512544803e-06, 'epoch': 0.12}


 14%|█▍        | 39/280 [00:13<01:14,  3.25it/s]

{'loss': 2.7273, 'grad_norm': 70.6023178100586, 'learning_rate': 8.63799283154122e-06, 'epoch': 0.12}


 14%|█▍        | 40/280 [00:13<01:17,  3.10it/s]

{'loss': 2.6415, 'grad_norm': 48.70132827758789, 'learning_rate': 8.602150537634409e-06, 'epoch': 0.13}


 15%|█▍        | 41/280 [00:13<01:17,  3.09it/s]

{'loss': 2.695, 'grad_norm': 58.76997375488281, 'learning_rate': 8.5663082437276e-06, 'epoch': 0.13}


 15%|█▌        | 42/280 [00:14<01:15,  3.13it/s]

{'loss': 2.0261, 'grad_norm': 53.5329704284668, 'learning_rate': 8.530465949820788e-06, 'epoch': 0.13}


 15%|█▌        | 43/280 [00:14<01:13,  3.21it/s]

{'loss': 2.7685, 'grad_norm': 46.17959976196289, 'learning_rate': 8.494623655913979e-06, 'epoch': 0.14}


 16%|█▌        | 44/280 [00:14<01:15,  3.12it/s]

{'loss': 2.9932, 'grad_norm': 52.74967956542969, 'learning_rate': 8.45878136200717e-06, 'epoch': 0.14}


 16%|█▌        | 45/280 [00:15<01:13,  3.18it/s]

{'loss': 2.2599, 'grad_norm': 49.3239631652832, 'learning_rate': 8.422939068100358e-06, 'epoch': 0.14}


 16%|█▋        | 46/280 [00:15<01:12,  3.21it/s]

{'loss': 2.8346, 'grad_norm': 59.776615142822266, 'learning_rate': 8.387096774193549e-06, 'epoch': 0.15}


 17%|█▋        | 47/280 [00:15<01:13,  3.17it/s]

{'loss': 3.3036, 'grad_norm': 69.42244720458984, 'learning_rate': 8.35125448028674e-06, 'epoch': 0.15}


 17%|█▋        | 48/280 [00:15<01:14,  3.13it/s]

{'loss': 1.8617, 'grad_norm': 59.022525787353516, 'learning_rate': 8.315412186379928e-06, 'epoch': 0.15}


 18%|█▊        | 49/280 [00:16<01:12,  3.17it/s]

{'loss': 2.3998, 'grad_norm': 57.51131057739258, 'learning_rate': 8.279569892473119e-06, 'epoch': 0.16}


 18%|█▊        | 50/280 [00:16<01:15,  3.06it/s]

{'loss': 2.3089, 'grad_norm': 60.29814529418945, 'learning_rate': 8.24372759856631e-06, 'epoch': 0.16}


 18%|█▊        | 51/280 [00:16<01:14,  3.06it/s]

{'loss': 1.9305, 'grad_norm': 48.065677642822266, 'learning_rate': 8.207885304659498e-06, 'epoch': 0.16}


 19%|█▊        | 52/280 [00:17<01:16,  2.99it/s]

{'loss': 2.8843, 'grad_norm': 55.87257766723633, 'learning_rate': 8.172043010752689e-06, 'epoch': 0.17}


 19%|█▉        | 53/280 [00:17<01:15,  3.01it/s]

{'loss': 2.9208, 'grad_norm': 87.26406860351562, 'learning_rate': 8.136200716845879e-06, 'epoch': 0.17}


 19%|█▉        | 54/280 [00:17<01:13,  3.07it/s]

{'loss': 2.6601, 'grad_norm': 61.29957962036133, 'learning_rate': 8.100358422939068e-06, 'epoch': 0.17}


 20%|█▉        | 55/280 [00:18<01:13,  3.04it/s]

{'loss': 2.3958, 'grad_norm': 42.90517807006836, 'learning_rate': 8.064516129032258e-06, 'epoch': 0.17}


 20%|██        | 56/280 [00:18<01:11,  3.13it/s]

{'loss': 2.7971, 'grad_norm': 34.214630126953125, 'learning_rate': 8.028673835125449e-06, 'epoch': 0.18}


 20%|██        | 57/280 [00:18<01:08,  3.25it/s]

{'loss': 2.5525, 'grad_norm': 88.30003356933594, 'learning_rate': 7.992831541218638e-06, 'epoch': 0.18}


 21%|██        | 58/280 [00:19<01:10,  3.14it/s]

{'loss': 2.2399, 'grad_norm': 48.25938034057617, 'learning_rate': 7.956989247311828e-06, 'epoch': 0.18}


 21%|██        | 59/280 [00:19<01:11,  3.09it/s]

{'loss': 2.8641, 'grad_norm': 46.642005920410156, 'learning_rate': 7.921146953405019e-06, 'epoch': 0.19}


 21%|██▏       | 60/280 [00:19<01:11,  3.10it/s]

{'loss': 1.9326, 'grad_norm': 86.98172760009766, 'learning_rate': 7.88530465949821e-06, 'epoch': 0.19}


 22%|██▏       | 61/280 [00:20<01:10,  3.12it/s]

{'loss': 2.8604, 'grad_norm': 52.131771087646484, 'learning_rate': 7.849462365591398e-06, 'epoch': 0.19}


 22%|██▏       | 62/280 [00:20<01:11,  3.07it/s]

{'loss': 2.5289, 'grad_norm': 46.63232421875, 'learning_rate': 7.813620071684589e-06, 'epoch': 0.2}


 22%|██▎       | 63/280 [00:20<01:12,  2.98it/s]

{'loss': 2.8655, 'grad_norm': 52.8831787109375, 'learning_rate': 7.77777777777778e-06, 'epoch': 0.2}


 23%|██▎       | 64/280 [00:21<01:12,  2.98it/s]

{'loss': 2.6894, 'grad_norm': 44.82997512817383, 'learning_rate': 7.741935483870968e-06, 'epoch': 0.2}


 23%|██▎       | 65/280 [00:21<01:12,  2.97it/s]

{'loss': 2.4607, 'grad_norm': 44.71010208129883, 'learning_rate': 7.706093189964159e-06, 'epoch': 0.21}


 24%|██▎       | 66/280 [00:21<01:14,  2.88it/s]

{'loss': 2.5175, 'grad_norm': 51.511940002441406, 'learning_rate': 7.670250896057349e-06, 'epoch': 0.21}


 24%|██▍       | 67/280 [00:22<01:13,  2.92it/s]

{'loss': 3.0473, 'grad_norm': 57.32661056518555, 'learning_rate': 7.634408602150538e-06, 'epoch': 0.21}


 24%|██▍       | 68/280 [00:22<01:11,  2.95it/s]

{'loss': 2.0115, 'grad_norm': 43.713436126708984, 'learning_rate': 7.5985663082437275e-06, 'epoch': 0.22}


 25%|██▍       | 69/280 [00:22<01:10,  3.00it/s]

{'loss': 3.209, 'grad_norm': 78.4315185546875, 'learning_rate': 7.562724014336919e-06, 'epoch': 0.22}


 25%|██▌       | 70/280 [00:23<01:10,  2.97it/s]

{'loss': 1.8463, 'grad_norm': 42.31492614746094, 'learning_rate': 7.526881720430108e-06, 'epoch': 0.22}


 25%|██▌       | 71/280 [00:23<01:11,  2.91it/s]

{'loss': 2.5552, 'grad_norm': 40.50326156616211, 'learning_rate': 7.491039426523297e-06, 'epoch': 0.23}


 26%|██▌       | 72/280 [00:23<01:08,  3.04it/s]

{'loss': 2.3545, 'grad_norm': 40.82137680053711, 'learning_rate': 7.455197132616489e-06, 'epoch': 0.23}


 26%|██▌       | 73/280 [00:24<01:06,  3.09it/s]

{'loss': 2.2691, 'grad_norm': 38.0639533996582, 'learning_rate': 7.4193548387096784e-06, 'epoch': 0.23}


 26%|██▋       | 74/280 [00:24<01:07,  3.07it/s]

{'loss': 2.3081, 'grad_norm': 41.17373275756836, 'learning_rate': 7.383512544802868e-06, 'epoch': 0.23}


 27%|██▋       | 75/280 [00:24<01:06,  3.07it/s]

{'loss': 1.6825, 'grad_norm': 42.95146560668945, 'learning_rate': 7.347670250896059e-06, 'epoch': 0.24}


 27%|██▋       | 76/280 [00:25<01:05,  3.11it/s]

{'loss': 2.7381, 'grad_norm': 38.60084915161133, 'learning_rate': 7.311827956989248e-06, 'epoch': 0.24}


 28%|██▊       | 77/280 [00:25<01:05,  3.08it/s]

{'loss': 2.769, 'grad_norm': 54.535804748535156, 'learning_rate': 7.275985663082438e-06, 'epoch': 0.24}


 28%|██▊       | 78/280 [00:25<01:05,  3.11it/s]

{'loss': 3.1047, 'grad_norm': 70.33675384521484, 'learning_rate': 7.240143369175628e-06, 'epoch': 0.25}


 28%|██▊       | 79/280 [00:26<01:04,  3.11it/s]

{'loss': 2.474, 'grad_norm': 43.68595504760742, 'learning_rate': 7.204301075268818e-06, 'epoch': 0.25}


 29%|██▊       | 80/280 [00:26<01:05,  3.05it/s]

{'loss': 2.6777, 'grad_norm': 53.823883056640625, 'learning_rate': 7.168458781362008e-06, 'epoch': 0.25}


 29%|██▉       | 81/280 [00:26<01:06,  3.00it/s]

{'loss': 2.7851, 'grad_norm': 46.631675720214844, 'learning_rate': 7.1326164874551975e-06, 'epoch': 0.26}


 29%|██▉       | 82/280 [00:27<01:06,  2.97it/s]

{'loss': 2.3377, 'grad_norm': 65.29017639160156, 'learning_rate': 7.096774193548388e-06, 'epoch': 0.26}


 30%|██▉       | 83/280 [00:27<01:07,  2.92it/s]

{'loss': 2.1497, 'grad_norm': 46.037269592285156, 'learning_rate': 7.060931899641578e-06, 'epoch': 0.26}


 30%|███       | 84/280 [00:27<01:08,  2.87it/s]

{'loss': 2.606, 'grad_norm': 68.54312133789062, 'learning_rate': 7.025089605734767e-06, 'epoch': 0.27}


 30%|███       | 85/280 [00:28<01:10,  2.75it/s]

{'loss': 2.7831, 'grad_norm': 64.32939910888672, 'learning_rate': 6.989247311827958e-06, 'epoch': 0.27}


 31%|███       | 86/280 [00:28<01:09,  2.78it/s]

{'loss': 2.3703, 'grad_norm': 40.3608283996582, 'learning_rate': 6.9534050179211476e-06, 'epoch': 0.27}


 31%|███       | 87/280 [00:29<01:08,  2.82it/s]

{'loss': 2.4373, 'grad_norm': 44.811588287353516, 'learning_rate': 6.917562724014337e-06, 'epoch': 0.28}


 31%|███▏      | 88/280 [00:29<01:10,  2.74it/s]

{'loss': 3.0501, 'grad_norm': 49.28602600097656, 'learning_rate': 6.881720430107528e-06, 'epoch': 0.28}


 32%|███▏      | 89/280 [00:29<01:08,  2.81it/s]

{'loss': 2.4179, 'grad_norm': 49.0786247253418, 'learning_rate': 6.8458781362007174e-06, 'epoch': 0.28}


 32%|███▏      | 90/280 [00:30<01:07,  2.80it/s]

{'loss': 2.1123, 'grad_norm': 28.390270233154297, 'learning_rate': 6.810035842293907e-06, 'epoch': 0.29}


 32%|███▎      | 91/280 [00:30<01:09,  2.73it/s]

{'loss': 2.1535, 'grad_norm': 38.50748062133789, 'learning_rate': 6.774193548387097e-06, 'epoch': 0.29}


 33%|███▎      | 92/280 [00:30<01:08,  2.75it/s]

{'loss': 2.8485, 'grad_norm': 83.78401184082031, 'learning_rate': 6.738351254480287e-06, 'epoch': 0.29}


 33%|███▎      | 93/280 [00:31<01:06,  2.82it/s]

{'loss': 2.7558, 'grad_norm': 42.903682708740234, 'learning_rate': 6.702508960573477e-06, 'epoch': 0.3}


 34%|███▎      | 94/280 [00:31<01:04,  2.88it/s]

{'loss': 3.0691, 'grad_norm': 57.43985366821289, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.3}


 34%|███▍      | 95/280 [00:31<01:02,  2.96it/s]

{'loss': 2.4755, 'grad_norm': 34.08867645263672, 'learning_rate': 6.630824372759857e-06, 'epoch': 0.3}


 34%|███▍      | 96/280 [00:32<01:01,  3.01it/s]

{'loss': 1.8733, 'grad_norm': 39.51373291015625, 'learning_rate': 6.594982078853047e-06, 'epoch': 0.3}


 35%|███▍      | 97/280 [00:32<01:00,  3.03it/s]

{'loss': 1.9779, 'grad_norm': 42.4037971496582, 'learning_rate': 6.5591397849462365e-06, 'epoch': 0.31}


 35%|███▌      | 98/280 [00:32<00:59,  3.06it/s]

{'loss': 2.2421, 'grad_norm': 48.10545349121094, 'learning_rate': 6.523297491039428e-06, 'epoch': 0.31}


 35%|███▌      | 99/280 [00:33<00:57,  3.12it/s]

{'loss': 1.9472, 'grad_norm': 33.88685607910156, 'learning_rate': 6.4874551971326176e-06, 'epoch': 0.31}


 36%|███▌      | 100/280 [00:33<00:59,  3.03it/s]

{'loss': 2.7302, 'grad_norm': 48.26689147949219, 'learning_rate': 6.451612903225806e-06, 'epoch': 0.32}


 36%|███▌      | 101/280 [00:33<00:59,  2.99it/s]

{'loss': 2.1646, 'grad_norm': 47.24478530883789, 'learning_rate': 6.415770609318996e-06, 'epoch': 0.32}


 36%|███▋      | 102/280 [00:34<01:00,  2.96it/s]

{'loss': 2.8487, 'grad_norm': 38.213722229003906, 'learning_rate': 6.379928315412187e-06, 'epoch': 0.32}


 37%|███▋      | 103/280 [00:34<00:59,  2.99it/s]

{'loss': 2.6776, 'grad_norm': 44.495872497558594, 'learning_rate': 6.344086021505377e-06, 'epoch': 0.33}


 37%|███▋      | 104/280 [00:34<01:00,  2.92it/s]

{'loss': 2.7315, 'grad_norm': 47.69278335571289, 'learning_rate': 6.308243727598567e-06, 'epoch': 0.33}


 38%|███▊      | 105/280 [00:35<00:57,  3.03it/s]

{'loss': 2.5531, 'grad_norm': 51.864601135253906, 'learning_rate': 6.272401433691757e-06, 'epoch': 0.33}


 38%|███▊      | 106/280 [00:35<00:56,  3.06it/s]

{'loss': 2.5694, 'grad_norm': 38.63401794433594, 'learning_rate': 6.236559139784947e-06, 'epoch': 0.34}


 38%|███▊      | 107/280 [00:35<00:54,  3.18it/s]

{'loss': 2.6194, 'grad_norm': 47.22077560424805, 'learning_rate': 6.200716845878137e-06, 'epoch': 0.34}


 39%|███▊      | 108/280 [00:36<00:53,  3.21it/s]

{'loss': 2.686, 'grad_norm': 52.17859649658203, 'learning_rate': 6.164874551971327e-06, 'epoch': 0.34}


 39%|███▉      | 109/280 [00:36<00:52,  3.25it/s]

{'loss': 2.2054, 'grad_norm': 59.8789176940918, 'learning_rate': 6.129032258064517e-06, 'epoch': 0.35}


 39%|███▉      | 110/280 [00:36<00:52,  3.25it/s]

{'loss': 1.8908, 'grad_norm': 44.00631332397461, 'learning_rate': 6.0931899641577065e-06, 'epoch': 0.35}


 40%|███▉      | 111/280 [00:36<00:52,  3.21it/s]

{'loss': 2.0803, 'grad_norm': 46.584049224853516, 'learning_rate': 6.057347670250897e-06, 'epoch': 0.35}


 40%|████      | 112/280 [00:37<00:54,  3.08it/s]

{'loss': 2.3057, 'grad_norm': 46.43478012084961, 'learning_rate': 6.021505376344087e-06, 'epoch': 0.36}


 40%|████      | 113/280 [00:37<00:55,  3.03it/s]

{'loss': 2.7432, 'grad_norm': 48.303829193115234, 'learning_rate': 5.985663082437276e-06, 'epoch': 0.36}


 41%|████      | 114/280 [00:37<00:55,  3.00it/s]

{'loss': 2.8047, 'grad_norm': 55.349327087402344, 'learning_rate': 5.949820788530466e-06, 'epoch': 0.36}


 41%|████      | 115/280 [00:38<00:54,  3.03it/s]

{'loss': 2.5974, 'grad_norm': 41.74198913574219, 'learning_rate': 5.9139784946236566e-06, 'epoch': 0.37}


 41%|████▏     | 116/280 [00:38<00:52,  3.11it/s]

{'loss': 2.7387, 'grad_norm': 67.17871856689453, 'learning_rate': 5.878136200716846e-06, 'epoch': 0.37}


 42%|████▏     | 117/280 [00:38<00:52,  3.13it/s]

{'loss': 2.4697, 'grad_norm': 62.78737258911133, 'learning_rate': 5.842293906810036e-06, 'epoch': 0.37}


 42%|████▏     | 118/280 [00:39<00:53,  3.04it/s]

{'loss': 2.4454, 'grad_norm': 42.05119705200195, 'learning_rate': 5.806451612903226e-06, 'epoch': 0.37}


 42%|████▎     | 119/280 [00:39<00:52,  3.08it/s]

{'loss': 2.0442, 'grad_norm': 59.41307067871094, 'learning_rate': 5.770609318996416e-06, 'epoch': 0.38}


 43%|████▎     | 120/280 [00:39<00:51,  3.12it/s]

{'loss': 2.2836, 'grad_norm': 67.07157135009766, 'learning_rate': 5.734767025089606e-06, 'epoch': 0.38}


                                                 
 43%|████▎     | 120/280 [00:43<00:51,  3.12it/s]

{'eval_loss': 2.381779193878174, 'eval_runtime': 3.7143, 'eval_samples_per_second': 37.692, 'eval_steps_per_second': 37.692, 'epoch': 0.38}


 43%|████▎     | 121/280 [00:45<05:05,  1.92s/it]

{'loss': 2.8795, 'grad_norm': 69.15773010253906, 'learning_rate': 5.698924731182796e-06, 'epoch': 0.38}


 44%|████▎     | 122/280 [00:45<03:48,  1.45s/it]

{'loss': 2.0266, 'grad_norm': 35.895713806152344, 'learning_rate': 5.663082437275986e-06, 'epoch': 0.39}


 44%|████▍     | 123/280 [00:46<02:56,  1.12s/it]

{'loss': 2.2436, 'grad_norm': 61.892818450927734, 'learning_rate': 5.627240143369176e-06, 'epoch': 0.39}


 44%|████▍     | 124/280 [00:46<02:18,  1.13it/s]

{'loss': 2.3903, 'grad_norm': 43.237850189208984, 'learning_rate': 5.591397849462365e-06, 'epoch': 0.39}


 45%|████▍     | 125/280 [00:46<01:51,  1.39it/s]

{'loss': 2.4754, 'grad_norm': 35.32416915893555, 'learning_rate': 5.555555555555557e-06, 'epoch': 0.4}


 45%|████▌     | 126/280 [00:47<01:35,  1.62it/s]

{'loss': 2.7093, 'grad_norm': 53.227325439453125, 'learning_rate': 5.5197132616487455e-06, 'epoch': 0.4}


 45%|████▌     | 127/280 [00:47<01:24,  1.80it/s]

{'loss': 2.3782, 'grad_norm': 43.8986701965332, 'learning_rate': 5.483870967741935e-06, 'epoch': 0.4}


 46%|████▌     | 128/280 [00:48<01:14,  2.03it/s]

{'loss': 2.3078, 'grad_norm': 46.967506408691406, 'learning_rate': 5.4480286738351265e-06, 'epoch': 0.41}


 46%|████▌     | 129/280 [00:48<01:06,  2.27it/s]

{'loss': 2.2593, 'grad_norm': 47.54718017578125, 'learning_rate': 5.412186379928316e-06, 'epoch': 0.41}


 46%|████▋     | 130/280 [00:48<00:59,  2.52it/s]

{'loss': 2.4342, 'grad_norm': 42.476985931396484, 'learning_rate': 5.376344086021506e-06, 'epoch': 0.41}


 47%|████▋     | 131/280 [00:48<00:53,  2.76it/s]

{'loss': 2.6875, 'grad_norm': 46.62965393066406, 'learning_rate': 5.340501792114696e-06, 'epoch': 0.42}


 47%|████▋     | 132/280 [00:49<00:50,  2.91it/s]

{'loss': 1.8794, 'grad_norm': 73.2508544921875, 'learning_rate': 5.304659498207886e-06, 'epoch': 0.42}


 48%|████▊     | 133/280 [00:49<00:48,  3.04it/s]

{'loss': 2.3802, 'grad_norm': 59.36281967163086, 'learning_rate': 5.268817204301076e-06, 'epoch': 0.42}


 48%|████▊     | 134/280 [00:49<00:47,  3.07it/s]

{'loss': 2.4164, 'grad_norm': 54.733184814453125, 'learning_rate': 5.232974910394266e-06, 'epoch': 0.43}


 48%|████▊     | 135/280 [00:50<00:47,  3.07it/s]

{'loss': 2.2734, 'grad_norm': 57.566246032714844, 'learning_rate': 5.197132616487456e-06, 'epoch': 0.43}


 49%|████▊     | 136/280 [00:50<00:48,  2.99it/s]

{'loss': 2.6967, 'grad_norm': 62.55620574951172, 'learning_rate': 5.161290322580646e-06, 'epoch': 0.43}


 49%|████▉     | 137/280 [00:50<00:48,  2.93it/s]

{'loss': 2.8253, 'grad_norm': 45.10960388183594, 'learning_rate': 5.125448028673835e-06, 'epoch': 0.43}


 49%|████▉     | 138/280 [00:51<00:48,  2.91it/s]

{'loss': 2.2688, 'grad_norm': 36.75051498413086, 'learning_rate': 5.089605734767026e-06, 'epoch': 0.44}


 50%|████▉     | 139/280 [00:51<00:46,  3.02it/s]

{'loss': 2.339, 'grad_norm': 50.718692779541016, 'learning_rate': 5.0537634408602155e-06, 'epoch': 0.44}


 50%|█████     | 140/280 [00:51<00:45,  3.05it/s]

{'loss': 2.0695, 'grad_norm': 39.836429595947266, 'learning_rate': 5.017921146953405e-06, 'epoch': 0.44}


 50%|█████     | 141/280 [00:52<00:44,  3.13it/s]

{'loss': 2.3926, 'grad_norm': 51.04305648803711, 'learning_rate': 4.982078853046595e-06, 'epoch': 0.45}


 51%|█████     | 142/280 [00:52<00:43,  3.14it/s]

{'loss': 2.6753, 'grad_norm': 42.52840805053711, 'learning_rate': 4.946236559139785e-06, 'epoch': 0.45}


 51%|█████     | 143/280 [00:52<00:43,  3.12it/s]

{'loss': 2.3709, 'grad_norm': 49.315364837646484, 'learning_rate': 4.910394265232976e-06, 'epoch': 0.45}


 51%|█████▏    | 144/280 [00:53<00:44,  3.06it/s]

{'loss': 2.3127, 'grad_norm': 50.379364013671875, 'learning_rate': 4.8745519713261655e-06, 'epoch': 0.46}


 52%|█████▏    | 145/280 [00:53<00:45,  2.94it/s]

{'loss': 3.676, 'grad_norm': 69.10527801513672, 'learning_rate': 4.838709677419355e-06, 'epoch': 0.46}


 52%|█████▏    | 146/280 [00:53<00:43,  3.06it/s]

{'loss': 2.0787, 'grad_norm': 43.066612243652344, 'learning_rate': 4.802867383512545e-06, 'epoch': 0.46}


 52%|█████▎    | 147/280 [00:54<00:44,  3.02it/s]

{'loss': 3.1988, 'grad_norm': 57.140953063964844, 'learning_rate': 4.767025089605735e-06, 'epoch': 0.47}


 53%|█████▎    | 148/280 [00:54<00:42,  3.08it/s]

{'loss': 2.8905, 'grad_norm': 50.41666793823242, 'learning_rate': 4.731182795698925e-06, 'epoch': 0.47}


 53%|█████▎    | 149/280 [00:54<00:43,  3.03it/s]

{'loss': 2.9294, 'grad_norm': 53.05195617675781, 'learning_rate': 4.695340501792115e-06, 'epoch': 0.47}


 54%|█████▎    | 150/280 [00:55<00:42,  3.07it/s]

{'loss': 2.4144, 'grad_norm': 38.830501556396484, 'learning_rate': 4.659498207885305e-06, 'epoch': 0.48}


 54%|█████▍    | 151/280 [00:55<00:42,  3.04it/s]

{'loss': 3.0374, 'grad_norm': 41.79633712768555, 'learning_rate': 4.623655913978495e-06, 'epoch': 0.48}


 54%|█████▍    | 152/280 [00:55<00:43,  2.95it/s]

{'loss': 2.7584, 'grad_norm': 61.47060775756836, 'learning_rate': 4.587813620071685e-06, 'epoch': 0.48}


 55%|█████▍    | 153/280 [00:56<00:42,  3.01it/s]

{'loss': 2.4945, 'grad_norm': 40.700801849365234, 'learning_rate': 4.551971326164875e-06, 'epoch': 0.49}


 55%|█████▌    | 154/280 [00:56<00:41,  3.00it/s]

{'loss': 1.9106, 'grad_norm': 53.49081039428711, 'learning_rate': 4.516129032258065e-06, 'epoch': 0.49}


 55%|█████▌    | 155/280 [00:56<00:40,  3.11it/s]

{'loss': 2.23, 'grad_norm': 53.6011962890625, 'learning_rate': 4.480286738351255e-06, 'epoch': 0.49}


 56%|█████▌    | 156/280 [00:57<00:39,  3.12it/s]

{'loss': 2.3645, 'grad_norm': 50.03498077392578, 'learning_rate': 4.444444444444444e-06, 'epoch': 0.5}


 56%|█████▌    | 157/280 [00:57<00:38,  3.21it/s]

{'loss': 2.8298, 'grad_norm': 57.75605010986328, 'learning_rate': 4.408602150537635e-06, 'epoch': 0.5}


 56%|█████▋    | 158/280 [00:57<00:37,  3.24it/s]

{'loss': 1.7838, 'grad_norm': 34.789730072021484, 'learning_rate': 4.372759856630825e-06, 'epoch': 0.5}


 57%|█████▋    | 159/280 [00:58<00:36,  3.28it/s]

{'loss': 2.4255, 'grad_norm': 48.079830169677734, 'learning_rate': 4.336917562724015e-06, 'epoch': 0.5}


 57%|█████▋    | 160/280 [00:58<00:37,  3.20it/s]

{'loss': 2.5278, 'grad_norm': 58.525367736816406, 'learning_rate': 4.3010752688172045e-06, 'epoch': 0.51}


 57%|█████▊    | 161/280 [00:58<00:37,  3.16it/s]

{'loss': 2.3458, 'grad_norm': 54.09984588623047, 'learning_rate': 4.265232974910394e-06, 'epoch': 0.51}


 58%|█████▊    | 162/280 [00:58<00:37,  3.13it/s]

{'loss': 2.582, 'grad_norm': 45.2379264831543, 'learning_rate': 4.229390681003585e-06, 'epoch': 0.51}


 58%|█████▊    | 163/280 [00:59<00:38,  3.07it/s]

{'loss': 2.2924, 'grad_norm': 60.969017028808594, 'learning_rate': 4.193548387096774e-06, 'epoch': 0.52}


 59%|█████▊    | 164/280 [00:59<00:38,  3.03it/s]

{'loss': 1.7479, 'grad_norm': 41.6411247253418, 'learning_rate': 4.157706093189964e-06, 'epoch': 0.52}


 59%|█████▉    | 165/280 [01:00<00:38,  2.97it/s]

{'loss': 2.2459, 'grad_norm': 49.21501541137695, 'learning_rate': 4.121863799283155e-06, 'epoch': 0.52}


 59%|█████▉    | 166/280 [01:00<00:37,  3.02it/s]

{'loss': 2.3046, 'grad_norm': 47.201194763183594, 'learning_rate': 4.086021505376344e-06, 'epoch': 0.53}


 60%|█████▉    | 167/280 [01:00<00:37,  3.04it/s]

{'loss': 2.3471, 'grad_norm': 60.16043472290039, 'learning_rate': 4.050179211469534e-06, 'epoch': 0.53}


 60%|██████    | 168/280 [01:00<00:36,  3.04it/s]

{'loss': 2.4254, 'grad_norm': 43.36650085449219, 'learning_rate': 4.0143369175627245e-06, 'epoch': 0.53}


 60%|██████    | 169/280 [01:01<00:37,  2.99it/s]

{'loss': 2.5777, 'grad_norm': 59.42933654785156, 'learning_rate': 3.978494623655914e-06, 'epoch': 0.54}


 61%|██████    | 170/280 [01:01<00:36,  3.01it/s]

{'loss': 2.6194, 'grad_norm': 68.2474594116211, 'learning_rate': 3.942652329749105e-06, 'epoch': 0.54}


 61%|██████    | 171/280 [01:02<00:36,  2.97it/s]

{'loss': 2.7126, 'grad_norm': 42.73869323730469, 'learning_rate': 3.906810035842294e-06, 'epoch': 0.54}


 61%|██████▏   | 172/280 [01:02<00:36,  3.00it/s]

{'loss': 2.4943, 'grad_norm': 45.19261932373047, 'learning_rate': 3.870967741935484e-06, 'epoch': 0.55}


 62%|██████▏   | 173/280 [01:02<00:35,  3.03it/s]

{'loss': 2.4733, 'grad_norm': 58.70981216430664, 'learning_rate': 3.8351254480286745e-06, 'epoch': 0.55}


 62%|██████▏   | 174/280 [01:02<00:34,  3.09it/s]

{'loss': 3.6553, 'grad_norm': 57.852333068847656, 'learning_rate': 3.7992831541218638e-06, 'epoch': 0.55}


 62%|██████▎   | 175/280 [01:03<00:33,  3.09it/s]

{'loss': 2.5893, 'grad_norm': 58.01980209350586, 'learning_rate': 3.763440860215054e-06, 'epoch': 0.56}


 63%|██████▎   | 176/280 [01:03<00:33,  3.11it/s]

{'loss': 1.9556, 'grad_norm': 79.91169738769531, 'learning_rate': 3.7275985663082444e-06, 'epoch': 0.56}


 63%|██████▎   | 177/280 [01:03<00:33,  3.11it/s]

{'loss': 2.3397, 'grad_norm': 46.50320816040039, 'learning_rate': 3.691756272401434e-06, 'epoch': 0.56}


 64%|██████▎   | 178/280 [01:04<00:32,  3.11it/s]

{'loss': 2.8785, 'grad_norm': 61.39120101928711, 'learning_rate': 3.655913978494624e-06, 'epoch': 0.57}


 64%|██████▍   | 179/280 [01:04<00:32,  3.08it/s]

{'loss': 2.5079, 'grad_norm': 45.98429870605469, 'learning_rate': 3.620071684587814e-06, 'epoch': 0.57}


 64%|██████▍   | 180/280 [01:04<00:32,  3.06it/s]

{'loss': 2.24, 'grad_norm': 50.00416564941406, 'learning_rate': 3.584229390681004e-06, 'epoch': 0.57}


 65%|██████▍   | 181/280 [01:05<00:32,  3.09it/s]

{'loss': 2.8553, 'grad_norm': 72.25203704833984, 'learning_rate': 3.548387096774194e-06, 'epoch': 0.57}


 65%|██████▌   | 182/280 [01:05<00:31,  3.10it/s]

{'loss': 2.8819, 'grad_norm': 59.907997131347656, 'learning_rate': 3.5125448028673837e-06, 'epoch': 0.58}


 65%|██████▌   | 183/280 [01:05<00:30,  3.15it/s]

{'loss': 2.486, 'grad_norm': 44.553123474121094, 'learning_rate': 3.4767025089605738e-06, 'epoch': 0.58}


 66%|██████▌   | 184/280 [01:06<00:30,  3.13it/s]

{'loss': 1.4149, 'grad_norm': 45.36852264404297, 'learning_rate': 3.440860215053764e-06, 'epoch': 0.58}


 66%|██████▌   | 185/280 [01:06<00:30,  3.13it/s]

{'loss': 1.9559, 'grad_norm': 46.968318939208984, 'learning_rate': 3.4050179211469536e-06, 'epoch': 0.59}


 66%|██████▋   | 186/280 [01:06<00:30,  3.11it/s]

{'loss': 1.8722, 'grad_norm': 47.167999267578125, 'learning_rate': 3.3691756272401437e-06, 'epoch': 0.59}


 67%|██████▋   | 187/280 [01:07<00:28,  3.25it/s]

{'loss': 2.5548, 'grad_norm': 81.17975616455078, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.59}


 67%|██████▋   | 188/280 [01:07<00:28,  3.27it/s]

{'loss': 1.8084, 'grad_norm': 55.72304153442383, 'learning_rate': 3.2974910394265234e-06, 'epoch': 0.6}


 68%|██████▊   | 189/280 [01:07<00:27,  3.32it/s]

{'loss': 1.8941, 'grad_norm': 59.2237434387207, 'learning_rate': 3.261648745519714e-06, 'epoch': 0.6}


 68%|██████▊   | 190/280 [01:07<00:27,  3.31it/s]

{'loss': 2.7262, 'grad_norm': 43.93122863769531, 'learning_rate': 3.225806451612903e-06, 'epoch': 0.6}


 68%|██████▊   | 191/280 [01:08<00:27,  3.29it/s]

{'loss': 2.5741, 'grad_norm': 53.651466369628906, 'learning_rate': 3.1899641577060937e-06, 'epoch': 0.61}


 69%|██████▊   | 192/280 [01:08<00:27,  3.18it/s]

{'loss': 2.6499, 'grad_norm': 45.03765869140625, 'learning_rate': 3.1541218637992834e-06, 'epoch': 0.61}


 69%|██████▉   | 193/280 [01:08<00:28,  3.10it/s]

{'loss': 2.0263, 'grad_norm': 36.477943420410156, 'learning_rate': 3.1182795698924735e-06, 'epoch': 0.61}


 69%|██████▉   | 194/280 [01:09<00:28,  3.02it/s]

{'loss': 2.3516, 'grad_norm': 48.480934143066406, 'learning_rate': 3.0824372759856636e-06, 'epoch': 0.62}


 70%|██████▉   | 195/280 [01:09<00:27,  3.12it/s]

{'loss': 1.9906, 'grad_norm': 53.589500427246094, 'learning_rate': 3.0465949820788532e-06, 'epoch': 0.62}


 70%|███████   | 196/280 [01:09<00:27,  3.08it/s]

{'loss': 1.7962, 'grad_norm': 41.68930435180664, 'learning_rate': 3.0107526881720433e-06, 'epoch': 0.62}


 70%|███████   | 197/280 [01:10<00:26,  3.18it/s]

{'loss': 1.8045, 'grad_norm': 46.417755126953125, 'learning_rate': 2.974910394265233e-06, 'epoch': 0.63}


 71%|███████   | 198/280 [01:10<00:25,  3.27it/s]

{'loss': 1.6644, 'grad_norm': 36.68144226074219, 'learning_rate': 2.939068100358423e-06, 'epoch': 0.63}


 71%|███████   | 199/280 [01:10<00:24,  3.27it/s]

{'loss': 2.814, 'grad_norm': 64.59210205078125, 'learning_rate': 2.903225806451613e-06, 'epoch': 0.63}


 71%|███████▏  | 200/280 [01:11<00:25,  3.20it/s]

{'loss': 2.2383, 'grad_norm': 41.347469329833984, 'learning_rate': 2.867383512544803e-06, 'epoch': 0.63}


 72%|███████▏  | 201/280 [01:11<00:25,  3.15it/s]

{'loss': 1.8357, 'grad_norm': 42.685768127441406, 'learning_rate': 2.831541218637993e-06, 'epoch': 0.64}


 72%|███████▏  | 202/280 [01:11<00:24,  3.13it/s]

{'loss': 2.0958, 'grad_norm': 60.27224349975586, 'learning_rate': 2.7956989247311827e-06, 'epoch': 0.64}


 72%|███████▎  | 203/280 [01:12<00:24,  3.20it/s]

{'loss': 3.679, 'grad_norm': 112.6159896850586, 'learning_rate': 2.7598566308243727e-06, 'epoch': 0.64}


 73%|███████▎  | 204/280 [01:12<00:23,  3.18it/s]

{'loss': 2.597, 'grad_norm': 63.62207794189453, 'learning_rate': 2.7240143369175633e-06, 'epoch': 0.65}


 73%|███████▎  | 205/280 [01:12<00:24,  3.11it/s]

{'loss': 2.447, 'grad_norm': 66.5564956665039, 'learning_rate': 2.688172043010753e-06, 'epoch': 0.65}


 74%|███████▎  | 206/280 [01:13<00:23,  3.19it/s]

{'loss': 2.6609, 'grad_norm': 51.07820129394531, 'learning_rate': 2.652329749103943e-06, 'epoch': 0.65}


 74%|███████▍  | 207/280 [01:13<00:22,  3.25it/s]

{'loss': 2.2676, 'grad_norm': 70.5952377319336, 'learning_rate': 2.616487455197133e-06, 'epoch': 0.66}


 74%|███████▍  | 208/280 [01:13<00:22,  3.19it/s]

{'loss': 2.6548, 'grad_norm': 83.6404037475586, 'learning_rate': 2.580645161290323e-06, 'epoch': 0.66}


 75%|███████▍  | 209/280 [01:14<00:22,  3.18it/s]

{'loss': 2.6176, 'grad_norm': 49.26305389404297, 'learning_rate': 2.544802867383513e-06, 'epoch': 0.66}


 75%|███████▌  | 210/280 [01:14<00:22,  3.12it/s]

{'loss': 2.9021, 'grad_norm': 38.3629035949707, 'learning_rate': 2.5089605734767026e-06, 'epoch': 0.67}


 75%|███████▌  | 211/280 [01:14<00:21,  3.14it/s]

{'loss': 2.2675, 'grad_norm': 37.3011474609375, 'learning_rate': 2.4731182795698927e-06, 'epoch': 0.67}


 76%|███████▌  | 212/280 [01:14<00:21,  3.18it/s]

{'loss': 2.3535, 'grad_norm': 55.80110168457031, 'learning_rate': 2.4372759856630828e-06, 'epoch': 0.67}


 76%|███████▌  | 213/280 [01:15<00:21,  3.12it/s]

{'loss': 2.3225, 'grad_norm': 53.42585372924805, 'learning_rate': 2.4014336917562724e-06, 'epoch': 0.68}


 76%|███████▋  | 214/280 [01:15<00:21,  3.14it/s]

{'loss': 2.7809, 'grad_norm': 50.44679260253906, 'learning_rate': 2.3655913978494625e-06, 'epoch': 0.68}


 77%|███████▋  | 215/280 [01:15<00:20,  3.19it/s]

{'loss': 1.9786, 'grad_norm': 45.8377799987793, 'learning_rate': 2.3297491039426526e-06, 'epoch': 0.68}


 77%|███████▋  | 216/280 [01:16<00:19,  3.20it/s]

{'loss': 2.8039, 'grad_norm': 43.38481140136719, 'learning_rate': 2.2939068100358423e-06, 'epoch': 0.69}


 78%|███████▊  | 217/280 [01:16<00:19,  3.29it/s]

{'loss': 2.1331, 'grad_norm': 84.25613403320312, 'learning_rate': 2.2580645161290324e-06, 'epoch': 0.69}


 78%|███████▊  | 218/280 [01:16<00:19,  3.25it/s]

{'loss': 2.436, 'grad_norm': 51.60485076904297, 'learning_rate': 2.222222222222222e-06, 'epoch': 0.69}


 78%|███████▊  | 219/280 [01:17<00:18,  3.30it/s]

{'loss': 2.4543, 'grad_norm': 46.02988815307617, 'learning_rate': 2.1863799283154126e-06, 'epoch': 0.7}


 79%|███████▊  | 220/280 [01:17<00:18,  3.25it/s]

{'loss': 2.1924, 'grad_norm': 44.17865753173828, 'learning_rate': 2.1505376344086023e-06, 'epoch': 0.7}


 79%|███████▉  | 221/280 [01:17<00:18,  3.21it/s]

{'loss': 2.3905, 'grad_norm': 54.41860580444336, 'learning_rate': 2.1146953405017924e-06, 'epoch': 0.7}


 79%|███████▉  | 222/280 [01:18<00:18,  3.21it/s]

{'loss': 2.2803, 'grad_norm': 46.5533447265625, 'learning_rate': 2.078853046594982e-06, 'epoch': 0.7}


 80%|███████▉  | 223/280 [01:18<00:17,  3.23it/s]

{'loss': 2.0177, 'grad_norm': 63.70522689819336, 'learning_rate': 2.043010752688172e-06, 'epoch': 0.71}


 80%|████████  | 224/280 [01:18<00:17,  3.21it/s]

{'loss': 2.4716, 'grad_norm': 53.7286376953125, 'learning_rate': 2.0071684587813622e-06, 'epoch': 0.71}


 80%|████████  | 225/280 [01:19<00:17,  3.22it/s]

{'loss': 2.4044, 'grad_norm': 42.893646240234375, 'learning_rate': 1.9713261648745523e-06, 'epoch': 0.71}


 81%|████████  | 226/280 [01:19<00:17,  3.17it/s]

{'loss': 2.1642, 'grad_norm': 41.651641845703125, 'learning_rate': 1.935483870967742e-06, 'epoch': 0.72}


 81%|████████  | 227/280 [01:19<00:17,  3.08it/s]

{'loss': 2.3347, 'grad_norm': 39.657203674316406, 'learning_rate': 1.8996415770609319e-06, 'epoch': 0.72}


 81%|████████▏ | 228/280 [01:19<00:16,  3.11it/s]

{'loss': 2.2556, 'grad_norm': 37.89677810668945, 'learning_rate': 1.8637992831541222e-06, 'epoch': 0.72}


 82%|████████▏ | 229/280 [01:20<00:16,  3.12it/s]

{'loss': 2.601, 'grad_norm': 42.07746887207031, 'learning_rate': 1.827956989247312e-06, 'epoch': 0.73}


 82%|████████▏ | 230/280 [01:20<00:16,  3.12it/s]

{'loss': 2.0065, 'grad_norm': 52.00371551513672, 'learning_rate': 1.792114695340502e-06, 'epoch': 0.73}


 82%|████████▎ | 231/280 [01:20<00:15,  3.11it/s]

{'loss': 2.6441, 'grad_norm': 49.65121078491211, 'learning_rate': 1.7562724014336918e-06, 'epoch': 0.73}


 83%|████████▎ | 232/280 [01:21<00:15,  3.09it/s]

{'loss': 2.1684, 'grad_norm': 41.34956359863281, 'learning_rate': 1.720430107526882e-06, 'epoch': 0.74}


 83%|████████▎ | 233/280 [01:21<00:15,  3.08it/s]

{'loss': 2.1066, 'grad_norm': 37.672977447509766, 'learning_rate': 1.6845878136200718e-06, 'epoch': 0.74}


 84%|████████▎ | 234/280 [01:21<00:14,  3.07it/s]

{'loss': 2.6999, 'grad_norm': 46.46760940551758, 'learning_rate': 1.6487455197132617e-06, 'epoch': 0.74}


 84%|████████▍ | 235/280 [01:22<00:14,  3.04it/s]

{'loss': 2.2637, 'grad_norm': 37.44941711425781, 'learning_rate': 1.6129032258064516e-06, 'epoch': 0.75}


 84%|████████▍ | 236/280 [01:22<00:14,  3.05it/s]

{'loss': 2.6691, 'grad_norm': 64.17888641357422, 'learning_rate': 1.5770609318996417e-06, 'epoch': 0.75}


 85%|████████▍ | 237/280 [01:22<00:14,  2.97it/s]

{'loss': 2.281, 'grad_norm': 47.668678283691406, 'learning_rate': 1.5412186379928318e-06, 'epoch': 0.75}


 85%|████████▌ | 238/280 [01:23<00:14,  2.98it/s]

{'loss': 2.7994, 'grad_norm': 47.31612777709961, 'learning_rate': 1.5053763440860217e-06, 'epoch': 0.76}


 85%|████████▌ | 239/280 [01:23<00:13,  3.04it/s]

{'loss': 2.4251, 'grad_norm': 40.55906677246094, 'learning_rate': 1.4695340501792116e-06, 'epoch': 0.76}


 86%|████████▌ | 240/280 [01:23<00:13,  3.08it/s]

{'loss': 1.7885, 'grad_norm': 53.36037826538086, 'learning_rate': 1.4336917562724014e-06, 'epoch': 0.76}


                                                 
 86%|████████▌ | 240/280 [01:27<00:13,  3.08it/s]

{'eval_loss': 2.2773685455322266, 'eval_runtime': 3.6408, 'eval_samples_per_second': 38.454, 'eval_steps_per_second': 38.454, 'epoch': 0.76}


 86%|████████▌ | 241/280 [01:28<01:07,  1.72s/it]

{'loss': 1.9417, 'grad_norm': 38.59204864501953, 'learning_rate': 1.3978494623655913e-06, 'epoch': 0.77}


 86%|████████▋ | 242/280 [01:29<00:49,  1.30s/it]

{'loss': 2.1067, 'grad_norm': 50.71937942504883, 'learning_rate': 1.3620071684587816e-06, 'epoch': 0.77}


 87%|████████▋ | 243/280 [01:29<00:37,  1.02s/it]

{'loss': 2.4463, 'grad_norm': 37.7933235168457, 'learning_rate': 1.3261648745519715e-06, 'epoch': 0.77}


 87%|████████▋ | 244/280 [01:29<00:29,  1.23it/s]

{'loss': 2.0169, 'grad_norm': 51.43637466430664, 'learning_rate': 1.2903225806451614e-06, 'epoch': 0.77}


 88%|████████▊ | 245/280 [01:30<00:23,  1.49it/s]

{'loss': 2.5964, 'grad_norm': 52.90607833862305, 'learning_rate': 1.2544802867383513e-06, 'epoch': 0.78}


 88%|████████▊ | 246/280 [01:30<00:18,  1.82it/s]

{'loss': 2.31, 'grad_norm': 42.36872100830078, 'learning_rate': 1.2186379928315414e-06, 'epoch': 0.78}


 88%|████████▊ | 247/280 [01:30<00:15,  2.11it/s]

{'loss': 2.4571, 'grad_norm': 65.77118682861328, 'learning_rate': 1.1827956989247313e-06, 'epoch': 0.78}


 89%|████████▊ | 248/280 [01:31<00:13,  2.33it/s]

{'loss': 2.1578, 'grad_norm': 48.295997619628906, 'learning_rate': 1.1469534050179212e-06, 'epoch': 0.79}


 89%|████████▉ | 249/280 [01:31<00:12,  2.47it/s]

{'loss': 2.0635, 'grad_norm': 51.544898986816406, 'learning_rate': 1.111111111111111e-06, 'epoch': 0.79}


 89%|████████▉ | 250/280 [01:31<00:11,  2.62it/s]

{'loss': 1.6583, 'grad_norm': 44.393978118896484, 'learning_rate': 1.0752688172043011e-06, 'epoch': 0.79}


 90%|████████▉ | 251/280 [01:32<00:10,  2.70it/s]

{'loss': 2.349, 'grad_norm': 50.68905258178711, 'learning_rate': 1.039426523297491e-06, 'epoch': 0.8}


 90%|█████████ | 252/280 [01:32<00:09,  2.85it/s]

{'loss': 2.0547, 'grad_norm': 57.34558868408203, 'learning_rate': 1.0035842293906811e-06, 'epoch': 0.8}


 90%|█████████ | 253/280 [01:32<00:09,  2.90it/s]

{'loss': 2.5617, 'grad_norm': 78.89997100830078, 'learning_rate': 9.67741935483871e-07, 'epoch': 0.8}


 91%|█████████ | 254/280 [01:33<00:08,  2.95it/s]

{'loss': 1.9109, 'grad_norm': 63.14878463745117, 'learning_rate': 9.318996415770611e-07, 'epoch': 0.81}


 91%|█████████ | 255/280 [01:33<00:08,  3.02it/s]

{'loss': 2.1857, 'grad_norm': 47.67878723144531, 'learning_rate': 8.96057347670251e-07, 'epoch': 0.81}


 91%|█████████▏| 256/280 [01:33<00:07,  3.03it/s]

{'loss': 2.3119, 'grad_norm': 56.793025970458984, 'learning_rate': 8.60215053763441e-07, 'epoch': 0.81}


 92%|█████████▏| 257/280 [01:34<00:07,  3.12it/s]

{'loss': 2.7981, 'grad_norm': 52.2637825012207, 'learning_rate': 8.243727598566309e-07, 'epoch': 0.82}


 92%|█████████▏| 258/280 [01:34<00:06,  3.16it/s]

{'loss': 2.3389, 'grad_norm': 49.32779312133789, 'learning_rate': 7.885304659498208e-07, 'epoch': 0.82}


 92%|█████████▎| 259/280 [01:34<00:06,  3.02it/s]

{'loss': 2.1617, 'grad_norm': 46.58600997924805, 'learning_rate': 7.526881720430108e-07, 'epoch': 0.82}


 93%|█████████▎| 260/280 [01:35<00:06,  2.92it/s]

{'loss': 2.4448, 'grad_norm': 67.65522766113281, 'learning_rate': 7.168458781362007e-07, 'epoch': 0.83}


 93%|█████████▎| 261/280 [01:35<00:06,  2.87it/s]

{'loss': 2.5172, 'grad_norm': 56.22542190551758, 'learning_rate': 6.810035842293908e-07, 'epoch': 0.83}


 94%|█████████▎| 262/280 [01:35<00:06,  2.82it/s]

{'loss': 1.7264, 'grad_norm': 38.61855697631836, 'learning_rate': 6.451612903225807e-07, 'epoch': 0.83}


 94%|█████████▍| 263/280 [01:36<00:06,  2.79it/s]

{'loss': 2.1322, 'grad_norm': 34.181419372558594, 'learning_rate': 6.093189964157707e-07, 'epoch': 0.83}


 94%|█████████▍| 264/280 [01:36<00:05,  2.85it/s]

{'loss': 2.2806, 'grad_norm': 72.48574829101562, 'learning_rate': 5.734767025089606e-07, 'epoch': 0.84}


 95%|█████████▍| 265/280 [01:36<00:05,  2.88it/s]

{'loss': 2.2927, 'grad_norm': 46.73114776611328, 'learning_rate': 5.376344086021506e-07, 'epoch': 0.84}


 95%|█████████▌| 266/280 [01:37<00:04,  2.89it/s]

{'loss': 2.782, 'grad_norm': 66.39656829833984, 'learning_rate': 5.017921146953406e-07, 'epoch': 0.84}


 95%|█████████▌| 267/280 [01:37<00:04,  2.99it/s]

{'loss': 2.2957, 'grad_norm': 52.65360641479492, 'learning_rate': 4.6594982078853055e-07, 'epoch': 0.85}


 96%|█████████▌| 268/280 [01:37<00:04,  2.97it/s]

{'loss': 2.2131, 'grad_norm': 45.31288146972656, 'learning_rate': 4.301075268817205e-07, 'epoch': 0.85}


 96%|█████████▌| 269/280 [01:38<00:03,  3.00it/s]

{'loss': 2.8222, 'grad_norm': 50.05276870727539, 'learning_rate': 3.942652329749104e-07, 'epoch': 0.85}


 96%|█████████▋| 270/280 [01:38<00:03,  3.01it/s]

{'loss': 2.2003, 'grad_norm': 41.914947509765625, 'learning_rate': 3.5842293906810036e-07, 'epoch': 0.86}


 97%|█████████▋| 271/280 [01:38<00:03,  3.00it/s]

{'loss': 2.5043, 'grad_norm': 43.07753372192383, 'learning_rate': 3.2258064516129035e-07, 'epoch': 0.86}


 97%|█████████▋| 272/280 [01:39<00:02,  2.99it/s]

{'loss': 2.6558, 'grad_norm': 58.17218780517578, 'learning_rate': 2.867383512544803e-07, 'epoch': 0.86}


 98%|█████████▊| 273/280 [01:39<00:02,  3.06it/s]

{'loss': 1.8356, 'grad_norm': 41.981544494628906, 'learning_rate': 2.508960573476703e-07, 'epoch': 0.87}


 98%|█████████▊| 274/280 [01:39<00:01,  3.01it/s]

{'loss': 2.3803, 'grad_norm': 41.782649993896484, 'learning_rate': 2.1505376344086024e-07, 'epoch': 0.87}


 98%|█████████▊| 275/280 [01:40<00:01,  2.96it/s]

{'loss': 2.6729, 'grad_norm': 52.98430252075195, 'learning_rate': 1.7921146953405018e-07, 'epoch': 0.87}


 99%|█████████▊| 276/280 [01:40<00:01,  2.94it/s]

{'loss': 2.4828, 'grad_norm': 57.09263229370117, 'learning_rate': 1.4336917562724014e-07, 'epoch': 0.88}


 99%|█████████▉| 277/280 [01:40<00:01,  2.91it/s]

{'loss': 2.2193, 'grad_norm': 40.75585174560547, 'learning_rate': 1.0752688172043012e-07, 'epoch': 0.88}


 99%|█████████▉| 278/280 [01:41<00:00,  3.00it/s]

{'loss': 2.231, 'grad_norm': 40.05489730834961, 'learning_rate': 7.168458781362007e-08, 'epoch': 0.88}


100%|█████████▉| 279/280 [01:41<00:00,  3.04it/s]

{'loss': 2.131, 'grad_norm': 51.78246307373047, 'learning_rate': 3.5842293906810036e-08, 'epoch': 0.89}


100%|██████████| 280/280 [01:41<00:00,  3.03it/s]

{'loss': 2.0542, 'grad_norm': 43.42436599731445, 'learning_rate': 0.0, 'epoch': 0.89}


100%|██████████| 280/280 [01:43<00:00,  2.71it/s]

{'train_runtime': 103.1553, 'train_samples_per_second': 10.857, 'train_steps_per_second': 2.714, 'train_loss': 2.4914458308901106, 'epoch': 0.89}





In [43]:
save_dir = f'{output_dir}/final'

trainer.save_model(save_dir)
print("Saved model to:", save_dir)

Saved model to: lamini_docs_280_steps/final


In [44]:
print(device)

cuda


In [45]:
finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)

In [46]:
finetuned_slightly_model.to(device) 

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
        

In [47]:
test_question = test_dataset[0]['question']
print("Question input (test):", test_question)

print("Finetuned slightly model's answer: ")
pprint(inference(test_question, finetuned_slightly_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Question input (test): Can Lamini generate technical documentation or user manuals for software projects?
Finetuned slightly model's answer: 
('Yes, Lamini can generate technical documentation or user manuals for '
 'software projects. It can generate technical documentation or user manuals '
 'for software projects. It can generate technical documentation or user '
 'manuals for software projects. It can generate technical documentation or '
 'user manuals for software projects. It can generate technical documentation '
 'or user manuals for software projects. It can generate technical '
 'documentation or user manuals for software projects. It can generate')


In [48]:
test_dataset

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 140
})

In [49]:
test_answer = test_dataset[0]['answer']
print("The target answer output the we were aiming for (test):", test_answer)

The target answer output the we were aiming for (test): Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.


In [50]:
print(base_model)

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
        