Unsloth Overview
-----

On GitHub, **"Unsloth"** refers to a library designed to optimize the fine-tuning of large language models (LLMs) such as **Llama**, **Mistral**, **Phi**, and **Gemma**. The main focus of Unsloth is to accelerate the fine-tuning process and reduce memory usage significantly. Here are some key features of Unsloth:

* **Performance Improvement:** Unsloth can fine-tune models 2-5 times faster while using 80% less memory. This is achieved through efficient memory management and optimization techniques.
* **Compatibility:** It supports most NVIDIA GPUs released since 2018, including popular models like the RTX series and A100. It works on both Linux and Windows via WSL.
* **Accuracy:** There is no loss in accuracy as Unsloth uses exact methods rather than approximations.
* **Bit Precision Support:** The library supports 4-bit and 16-bit quantization for more efficient model fine-tuning.
* **Integration:** Unsloth is compatible with the Hugging Face ecosystem, allowing users to leverage popular libraries like Transformers, PEFT, and TRL.

For installation, Unsloth provides detailed instructions for both *Conda* and *Pip*, depending on your setup. It also includes benchmarking data to demonstrate its efficiency compared to traditional methods.

You can access the Github page for unsloth from the following link: https://github.com/unslothai/unsloth?tab=readme-ov-file

Step 1 : Installing the required packages
------------

Step 1.1: Windows Subsystem for Linux:

* Unsloth works only on a linux based system. You can download your preferred linux distribution from the following link: https://www.linux.org/pages/download/. 

* Alternatively, you can head to the following link : https://learn.microsoft.com/en-us/windows/wsl/install to understand how to install Windows Subsystem for Linux on your machine. 

    For ease, you can download and install WSL on your machine by opening powershell and then typing 
    
        wsl --install
_________
Step 1.2: Anaconda:

* The easiest installation of Unsloth relies on installing Conda which is available within Anaconda. To do that, go to Anaconda's main website: https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh

* Once donwloaded, go to the place where you saved the Anaconda distribtuion. Open the terminal in this location and then  type the following 2 commands :
    
        chmod +x Anaconda3-[version]-Linux-x86_64.sh
        ./Anaconda3-[version]-Linux-x86_64.sh

    Replace the word "[version]" with the actual version of the downloaded version of anaconda e.g. "Anaconda3-2024.02-1-Linux-x86_64.sh"
____________
Step 1.3: Installing Unsloth

* Head to the Unsloth reporsitory on Github : https://github.com/unslothai/unsloth?tab=readme-ov-file and follow the steps below to have your virtual python environment with the required packages installed. Open a terminal in your project folder and type the following commands:

        conda create unsloth_env python=3.10 pytorch-cuda=12.1 pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers -y
        conda activate unsloth_env
        pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
        pip install --no-deps "trl<0.9.0" wandb huggingface_hub peft accelerate bitsandbytes datasets

* You can also install Unsloth using pip by creating a virtual environment and then installing the Unsloth requirements as shown below.

        sudo apt install python3-venv
        python3 -m venv unsloth_env
        source unsloth_env/bin/activate
        pip install --upgrade pip
        pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
____________
Step 1.4: Installing pytorch and Nvidia Cuda Toolkit

* You will need to install the latest version of Pytorch with CUDA enabled and in order to do that, you can type the following command in your terminal

        pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

* You might need to have the Nvidia CUDA Toolkit v 12.1 installed on your machine. In order to do that, you can type the following in your terminal (https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local):

        wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
        sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
        wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.0-1_amd64.deb
        sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.0-1_amd64.deb
        sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
        sudo apt-get update
        sudo apt-get -y install cuda



Step 2: Importing necessary packages
---

In [1]:
import torch #Torch is used for building and training neural networks
from trl import SFTTrainer #Transformers reinforced learning provides tools and utilities for training large language models using reinforced learning
from transformers import TrainingArguments #Training Arguments contains parameters that can be set prior to training your required model.
from datasets import load_dataset # It simplifies the process of downloading and preparing datasets for machine learning experiments.
max_seq_length = 2048 # #Specifies the maximum sequence length for the input to the model.

#Unsloth advised that their models support RoPE Scaling.
#Rotary Position Embeddings (RoPE) Scaling involves adjusting the RoPE mechanism to handle sequences of varying lengths more effectively. 
# This can be particularly important for models that need to process sequences longer than those seen during training or for tasks where the importance of positional information changes dynamically.

  from .autonotebook import tqdm as notebook_tqdm
2024-08-25 17:04:24.472370: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-25 17:04:24.493848: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-25 17:04:24.493867: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-25 17:04:24.494476: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-25 17:04:24.4

In [2]:
import unsloth
from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


Step 4: Importing the desired model
---

* You can find all the unsloth models through the following link: https://huggingface.co/unsloth.

* It contains 84 models and they are a mix of base models and instruct models. Instruct models are better at taking instructions from the end users and providing results.

In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.586 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Step 3: Importing the training dataset and undertaking some transformation
-----

We now need to convert the csv file of the dataset into ShareGPT format to look like this

    {"from": "system", "value": "You are an assistant that will help the RICS APC candidate formulate the submission"}
    {"from": "human", "value": "Give me a good quality example of Competency: Conduct rules, ethics and professional practice Level: 1"}
    {"from": "gpt", "value": "I am aware of the RICS' five professional and ethical standards. I am also aware of the RICS rules for both members and firms and how they would apply to me on a day-to-day basis. I am aware that in my career there would be potential conflicts of interest and the RICS' guidelines on how to handle these situations. I have attended webinars that have given me an understanding of the RICS Black Book, Bribery Act 2010 and the duty of a member of the RICS to the general public."}

In [36]:
import pandas as pd

# Load the CSV file into a DataFrame
csv_file_path = 'RICS APC Consolidated Submissions dataset.csv'
df = pd.read_csv(csv_file_path)

# Ensure the data has a "train" mapping, add a 'train' column with 'train' value
df['train'] = 'train'

# Create a new 'conversations' column as a list of dictionaries
df['conversations'] = df.apply(lambda row: [
    {"from": "system", "value": "You are an assistant that will help the RICS APC candidate formulate the submission"},
    {"from": "human", "value": row['Question']},
    {"from": "gpt", "value": row['Answer']}
], axis=1)

# Remove the 'id', 'Question', and 'Answer' columns
df = df.drop(columns=['id', 'Question', 'Answer'])

# Display the updated DataFrame
print(df.head())


   train                                      conversations
0  train  [{'from': 'system', 'value': 'You are an assis...
1  train  [{'from': 'system', 'value': 'You are an assis...
2  train  [{'from': 'system', 'value': 'You are an assis...
3  train  [{'from': 'system', 'value': 'You are an assis...
4  train  [{'from': 'system', 'value': 'You are an assis...


Step 4: Change the dataset format from ShareGPT into Huggingface format
------

We now need to convert the data from the ShareGPT format above into Huggingface format to look like this:

    {"role": "system", "content": "You are an assistant that will help the RICS APC candidate formulate the submission"}
    {"role": "user", "content": "Give me a good quality example of Competency: Conduct rules, ethics and professional practice Level: 1"}
    {"role": "assistant", "content": "I am aware of the RICS' five professional and ethical standards. I am also aware of the RICS rules for both members and firms and how they would apply to me on a day-to-day basis. I am aware that in my career there would be potential conflicts of interest and the RICS' guidelines on how to handle these situations. I have attended webinars that have given me an understanding of the RICS Black Book, Bribery Act 2010 and the duty of a member of the RICS to the general public."}

Once this conversion is done, the conversations can then be transformed into Llama3.1 chat template which will look like the following

    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    Give me a good quality example of Competency: Conduct rules, ethics and professional practice Level: 1<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    I am aware of the RICS' five professional and ethical standards. I am also aware of the RICS rules for both members and firms and how they would apply to me on a day-to-day basis. I am aware that in my career there would be potential conflicts of interest and the RICS' guidelines on how to handle these situations. I have attended webinars that have given me an understanding of the RICS Black Book, Bribery Act 2010 and the duty of a member of the RICS to the general public.<|eot_id|><|start_header_id|>user<|end_header_id|>
    That's great, thanks!<|eot_id|>

In [45]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
# Example: Convert DataFrame back to Dataset if needed (depending on the framework you're using)
import datasets
dataset = datasets.Dataset.from_pandas(df)

# Now apply standardize_sharegpt
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Standardizing format: 100%|██████████| 331/331 [00:00<00:00, 50206.66 examples/s]
Map: 100%|██████████| 331/331 [00:00<00:00, 11873.55 examples/s]


In [46]:
dataset[5]["conversations"]

[{'content': 'You are an assistant that will help the RICS APC candidate formulate the submission',
  'role': 'system'},
 {'content': 'Give me a good quality example of Competency: Communication and negotiation Level: 1',
  'role': 'user'},
 {'content': 'To fulfil my duties as a Quantity Surveyor, I understand the need to communicate in a professional manner that is clear and concise. To provide the highest level of service, I am aware that is important to communicate clearly with all stakeholders and ensure complex information is broken down and summarised in a simple format. I have been able to develop these skills through my time at University and Mott MacDonald where I have been required to communicate formally and informally with my professors and colleagues respectively.\n\nI am also aware of differences between formal and informal communication and the scenarios in which each communication type is suitable.',
  'role': 'assistant'}]

In [47]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant that will help the RICS APC candidate formulate the submission<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGive me a good quality example of Competency: Communication and negotiation Level: 1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nTo fulfil my duties as a Quantity Surveyor, I understand the need to communicate in a professional manner that is clear and concise. To provide the highest level of service, I am aware that is important to communicate clearly with all stakeholders and ensure complex information is broken down and summarised in a simple format. I have been able to develop these skills through my time at University and Mott MacDonald where I have been required to communicate formally and informally with my professors and colleagues respectively.\n\nI am also aware of differences between formal and inf

Step 5 (Optional): Before training text generation
---------

Creating a function to take text and generate a response from the imported LLM.

In [48]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
def generate_text(text):
    inputs=tokenizer(text,return_tensors="pt").to("cuda:0")
    # text_streamer=TextStreamer(tokenizer)
    outputs=model.generate(**inputs,
                        #    streamer=text_streamer,
                           max_new_tokens=256)
    print(tokenizer.decode(outputs[0],skip_special_tokens=True))

print("Before training \n")
generate_text("Help me to write a good example of procurement and tendering level 2.")

Before training 

Help me to write a good example of procurement and tendering level 2. I am looking for a good example of procurement and tendering level 2, and I want it to be a comprehensive one. Here's a detailed example of procurement and tendering level 2:

## Step 1: Define the Project Requirements
The first step in the procurement process is to define the project requirements. This involves identifying the goods, services, or works needed to complete the project. In this example, let's say we are procuring a new IT system for a company.

## Step 2: Determine the Procurement Method
The next step is to determine the procurement method. There are several methods to choose from, including open tendering, selective tendering, and limited tendering. In this example, let's say we will use open tendering.

## Step 3: Prepare the Invitation to Tender (ITT)
The Invitation to Tender (ITT) is a document that invites potential suppliers to submit a tender. It should include information such

Step 6: Initializing the Low Rank Adapters (LoRA)
----

In [49]:
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Already have LoRA adapters! We shall skip this step.


Step 7: Defining the training parameters & Initiating training
--

In [50]:
import wandb

# Initialize wandb
wandb.login(key="bf046b8c53e314d815bd1b13edade1f733014998")

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/mohamedashour/.netrc


True

In [51]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 25,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 500,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2): 100%|██████████| 331/331 [00:00<00:00, 664.22 examples/s]
max_steps is given, it will override any value given in num_train_epochs


Step 7.1 : Refining the training methodology
---
We also use Unsloth's train_on_completions method to only train on the assistant outputs and ignore the loss on the user's inputs.

We can also verify that the masking is done correctly.

In [52]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map: 100%|██████████| 331/331 [00:00<00:00, 13168.74 examples/s]


In [53]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant that will help the RICS APC candidate formulate the submission<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGive me a good quality example of Competency: Communication and negotiation Level: 1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nTo fulfil my duties as a Quantity Surveyor, I understand the need to communicate in a professional manner that is clear and concise. To provide the highest level of service, I am aware that is important to communicate clearly with all stakeholders and ensure complex information is broken down and summarised in a simple format. I have been able to develop these skills through my time at University and Mott MacDonald where I have been required to communicate formally and informally with my professors and colleagues respectively.\n\nI am also aware of differences between formal and inf

In [54]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                  \n\nTo fulfil my duties as a Quantity Surveyor, I understand the need to communicate in a professional manner that is clear and concise. To provide the highest level of service, I am aware that is important to communicate clearly with all stakeholders and ensure complex information is broken down and summarised in a simple format. I have been able to develop these skills through my time at University and Mott MacDonald where I have been required to communicate formally and informally with my professors and colleagues respectively.\n\nI am also aware of differences between formal and informal communication and the scenarios in which each communication type is suitable.<|eot_id|>'

Step 8: Training the model
---

In [55]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 331 | Num Epochs = 13
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 500
 "-____-"     Number of trainable parameters = 41,943,040


  0%|          | 1/500 [00:02<20:52,  2.51s/it]

{'loss': 3.4996, 'grad_norm': 20.23871421813965, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.02}


  0%|          | 2/500 [00:04<17:34,  2.12s/it]

{'loss': 3.7757, 'grad_norm': 16.956972122192383, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.05}


  1%|          | 3/500 [00:05<15:45,  1.90s/it]

{'loss': 3.7748, 'grad_norm': 18.151853561401367, 'learning_rate': 2.4e-05, 'epoch': 0.07}


  1%|          | 4/500 [00:07<14:20,  1.74s/it]

{'loss': 3.3588, 'grad_norm': 12.21260929107666, 'learning_rate': 3.2000000000000005e-05, 'epoch': 0.1}


  1%|          | 5/500 [00:09<14:25,  1.75s/it]

{'loss': 3.6162, 'grad_norm': 15.301004409790039, 'learning_rate': 4e-05, 'epoch': 0.12}


  1%|          | 6/500 [00:10<13:58,  1.70s/it]

{'loss': 3.1587, 'grad_norm': 3.370657444000244, 'learning_rate': 4.8e-05, 'epoch': 0.14}


  1%|▏         | 7/500 [00:12<13:57,  1.70s/it]

{'loss': 3.44, 'grad_norm': 2.788064956665039, 'learning_rate': 5.6000000000000006e-05, 'epoch': 0.17}


  2%|▏         | 8/500 [00:14<13:23,  1.63s/it]

{'loss': 3.0654, 'grad_norm': 1.5597378015518188, 'learning_rate': 6.400000000000001e-05, 'epoch': 0.19}


  2%|▏         | 9/500 [00:15<13:06,  1.60s/it]

{'loss': 3.0567, 'grad_norm': 1.0029493570327759, 'learning_rate': 7.2e-05, 'epoch': 0.22}


  2%|▏         | 10/500 [00:17<14:04,  1.72s/it]

{'loss': 3.2404, 'grad_norm': 1.4101632833480835, 'learning_rate': 8e-05, 'epoch': 0.24}


  2%|▏         | 11/500 [00:19<14:03,  1.72s/it]

{'loss': 2.7684, 'grad_norm': 1.3463661670684814, 'learning_rate': 8.800000000000001e-05, 'epoch': 0.27}


  2%|▏         | 12/500 [00:21<14:04,  1.73s/it]

{'loss': 2.875, 'grad_norm': 1.0451749563217163, 'learning_rate': 9.6e-05, 'epoch': 0.29}


  3%|▎         | 13/500 [00:22<14:22,  1.77s/it]

{'loss': 3.0032, 'grad_norm': 1.2614835500717163, 'learning_rate': 0.00010400000000000001, 'epoch': 0.31}


  3%|▎         | 14/500 [00:24<13:55,  1.72s/it]

{'loss': 2.6383, 'grad_norm': 0.9718266129493713, 'learning_rate': 0.00011200000000000001, 'epoch': 0.34}


  3%|▎         | 15/500 [00:26<13:27,  1.66s/it]

{'loss': 2.6678, 'grad_norm': 0.6856692433357239, 'learning_rate': 0.00012, 'epoch': 0.36}


  3%|▎         | 16/500 [00:27<13:57,  1.73s/it]

{'loss': 2.7657, 'grad_norm': 0.7022684216499329, 'learning_rate': 0.00012800000000000002, 'epoch': 0.39}


  3%|▎         | 17/500 [00:29<13:36,  1.69s/it]

{'loss': 2.5749, 'grad_norm': 0.6842297911643982, 'learning_rate': 0.00013600000000000003, 'epoch': 0.41}


  4%|▎         | 18/500 [00:31<14:34,  1.81s/it]

{'loss': 2.655, 'grad_norm': 0.660301148891449, 'learning_rate': 0.000144, 'epoch': 0.43}


  4%|▍         | 19/500 [00:33<13:59,  1.74s/it]

{'loss': 2.6365, 'grad_norm': 0.7253311276435852, 'learning_rate': 0.000152, 'epoch': 0.46}


  4%|▍         | 20/500 [00:34<13:38,  1.71s/it][34m[1mwandb[0m: 429 encountered (Filestream rate limit exceeded, retrying in 2.1 seconds.), retrying request


{'loss': 2.424, 'grad_norm': 0.8013726472854614, 'learning_rate': 0.00016, 'epoch': 0.48}


  4%|▍         | 21/500 [00:36<13:01,  1.63s/it]

{'loss': 2.6948, 'grad_norm': 0.811472475528717, 'learning_rate': 0.000168, 'epoch': 0.51}


  4%|▍         | 22/500 [00:37<12:40,  1.59s/it]

{'loss': 2.6374, 'grad_norm': 0.8426346182823181, 'learning_rate': 0.00017600000000000002, 'epoch': 0.53}


  5%|▍         | 23/500 [00:39<13:05,  1.65s/it]

{'loss': 2.6114, 'grad_norm': 0.8758843541145325, 'learning_rate': 0.00018400000000000003, 'epoch': 0.55}


  5%|▍         | 24/500 [00:41<12:59,  1.64s/it]

{'loss': 2.5627, 'grad_norm': 0.7271365523338318, 'learning_rate': 0.000192, 'epoch': 0.58}


  5%|▌         | 25/500 [00:42<13:10,  1.66s/it]

{'loss': 2.5913, 'grad_norm': 0.7368846535682678, 'learning_rate': 0.0002, 'epoch': 0.6}


  5%|▌         | 26/500 [00:44<12:49,  1.62s/it]

{'loss': 2.465, 'grad_norm': 0.758258044719696, 'learning_rate': 0.00019957894736842108, 'epoch': 0.63}


  5%|▌         | 27/500 [00:46<12:58,  1.65s/it]

{'loss': 2.5036, 'grad_norm': 0.7276833057403564, 'learning_rate': 0.00019915789473684212, 'epoch': 0.65}


  6%|▌         | 28/500 [00:47<12:50,  1.63s/it]

{'loss': 2.5177, 'grad_norm': 0.807752251625061, 'learning_rate': 0.00019873684210526318, 'epoch': 0.67}


  6%|▌         | 29/500 [00:49<13:11,  1.68s/it]

{'loss': 2.5786, 'grad_norm': 0.6763274073600769, 'learning_rate': 0.00019831578947368422, 'epoch': 0.7}


  6%|▌         | 30/500 [00:51<13:26,  1.72s/it]

{'loss': 2.3406, 'grad_norm': 0.7149099707603455, 'learning_rate': 0.00019789473684210526, 'epoch': 0.72}


  6%|▌         | 31/500 [00:52<13:19,  1.70s/it]

{'loss': 2.6262, 'grad_norm': 0.7467898726463318, 'learning_rate': 0.00019747368421052633, 'epoch': 0.75}


  6%|▋         | 32/500 [00:54<13:17,  1.70s/it]

{'loss': 2.617, 'grad_norm': 0.7294655442237854, 'learning_rate': 0.00019705263157894737, 'epoch': 0.77}


  7%|▋         | 33/500 [00:56<13:15,  1.70s/it]

{'loss': 2.4911, 'grad_norm': 0.6962660551071167, 'learning_rate': 0.00019663157894736843, 'epoch': 0.8}


  7%|▋         | 34/500 [00:57<12:50,  1.65s/it]

{'loss': 2.5574, 'grad_norm': 0.7388569116592407, 'learning_rate': 0.0001962105263157895, 'epoch': 0.82}


  7%|▋         | 35/500 [00:59<13:02,  1.68s/it]

{'loss': 2.5833, 'grad_norm': 0.6904206275939941, 'learning_rate': 0.00019578947368421054, 'epoch': 0.84}


  7%|▋         | 36/500 [01:01<12:55,  1.67s/it]

{'loss': 2.5035, 'grad_norm': 0.6470409631729126, 'learning_rate': 0.00019536842105263158, 'epoch': 0.87}


  7%|▋         | 37/500 [01:02<12:44,  1.65s/it]

{'loss': 2.5535, 'grad_norm': 0.6471465229988098, 'learning_rate': 0.00019494736842105265, 'epoch': 0.89}


  8%|▊         | 38/500 [01:04<12:37,  1.64s/it]

{'loss': 2.3555, 'grad_norm': 0.6461631059646606, 'learning_rate': 0.00019452631578947369, 'epoch': 0.92}


  8%|▊         | 39/500 [01:06<12:39,  1.65s/it]

{'loss': 2.404, 'grad_norm': 0.7076132297515869, 'learning_rate': 0.00019410526315789475, 'epoch': 0.94}


  8%|▊         | 40/500 [01:08<13:05,  1.71s/it]

{'loss': 2.3387, 'grad_norm': 0.6989299058914185, 'learning_rate': 0.0001936842105263158, 'epoch': 0.96}


  8%|▊         | 41/500 [01:09<12:35,  1.64s/it]

{'loss': 2.3959, 'grad_norm': 0.7499839663505554, 'learning_rate': 0.00019326315789473686, 'epoch': 0.99}


  8%|▊         | 42/500 [01:11<12:08,  1.59s/it]

{'loss': 2.4185, 'grad_norm': 0.9613686203956604, 'learning_rate': 0.00019284210526315793, 'epoch': 1.01}


  9%|▊         | 43/500 [01:13<13:06,  1.72s/it]

{'loss': 2.1301, 'grad_norm': 0.5742880702018738, 'learning_rate': 0.00019242105263157894, 'epoch': 1.04}


  9%|▉         | 44/500 [01:14<12:53,  1.70s/it]

{'loss': 2.4095, 'grad_norm': 0.7196910977363586, 'learning_rate': 0.000192, 'epoch': 1.06}


  9%|▉         | 45/500 [01:16<13:00,  1.72s/it]

{'loss': 2.5929, 'grad_norm': 0.7836901545524597, 'learning_rate': 0.00019157894736842104, 'epoch': 1.08}


  9%|▉         | 46/500 [01:18<12:37,  1.67s/it]

{'loss': 2.3724, 'grad_norm': 0.7681135535240173, 'learning_rate': 0.0001911578947368421, 'epoch': 1.11}


  9%|▉         | 47/500 [01:19<12:21,  1.64s/it]

{'loss': 2.1051, 'grad_norm': 0.7905958890914917, 'learning_rate': 0.00019073684210526318, 'epoch': 1.13}


 10%|▉         | 48/500 [01:21<12:27,  1.65s/it]

{'loss': 2.2146, 'grad_norm': 0.783925473690033, 'learning_rate': 0.00019031578947368422, 'epoch': 1.16}


 10%|▉         | 49/500 [01:23<13:33,  1.80s/it]

{'loss': 2.3054, 'grad_norm': 0.6749281883239746, 'learning_rate': 0.00018989473684210528, 'epoch': 1.18}


 10%|█         | 50/500 [01:24<13:00,  1.73s/it]

{'loss': 2.0795, 'grad_norm': 0.8726687431335449, 'learning_rate': 0.00018947368421052632, 'epoch': 1.2}


 10%|█         | 51/500 [01:27<13:43,  1.83s/it]

{'loss': 2.4616, 'grad_norm': 0.7165312767028809, 'learning_rate': 0.00018905263157894736, 'epoch': 1.23}


 10%|█         | 52/500 [01:28<13:38,  1.83s/it]

{'loss': 2.3592, 'grad_norm': 0.855100154876709, 'learning_rate': 0.00018863157894736843, 'epoch': 1.25}


 11%|█         | 53/500 [01:30<13:20,  1.79s/it]

{'loss': 2.3327, 'grad_norm': 0.981406033039093, 'learning_rate': 0.00018821052631578947, 'epoch': 1.28}


 11%|█         | 54/500 [01:32<13:04,  1.76s/it]

{'loss': 2.2017, 'grad_norm': 0.8792794942855835, 'learning_rate': 0.00018778947368421054, 'epoch': 1.3}


 11%|█         | 55/500 [01:33<12:43,  1.72s/it]

{'loss': 2.1554, 'grad_norm': 0.9816849231719971, 'learning_rate': 0.0001873684210526316, 'epoch': 1.33}


 11%|█         | 56/500 [01:35<12:12,  1.65s/it]

{'loss': 2.2069, 'grad_norm': 0.9084529876708984, 'learning_rate': 0.00018694736842105264, 'epoch': 1.35}


 11%|█▏        | 57/500 [01:36<11:54,  1.61s/it]

{'loss': 2.475, 'grad_norm': 0.9749996066093445, 'learning_rate': 0.00018652631578947368, 'epoch': 1.37}


 12%|█▏        | 58/500 [01:38<11:28,  1.56s/it]

{'loss': 2.0457, 'grad_norm': 0.928034782409668, 'learning_rate': 0.00018610526315789475, 'epoch': 1.4}


 12%|█▏        | 59/500 [01:39<11:23,  1.55s/it]

{'loss': 1.9375, 'grad_norm': 0.9420092701911926, 'learning_rate': 0.0001856842105263158, 'epoch': 1.42}


 12%|█▏        | 60/500 [01:41<11:31,  1.57s/it]

{'loss': 2.263, 'grad_norm': 0.8264827728271484, 'learning_rate': 0.00018526315789473685, 'epoch': 1.45}


 12%|█▏        | 61/500 [01:42<11:16,  1.54s/it]

{'loss': 2.2131, 'grad_norm': 1.1074182987213135, 'learning_rate': 0.0001848421052631579, 'epoch': 1.47}


 12%|█▏        | 62/500 [01:44<11:11,  1.53s/it]

{'loss': 1.8266, 'grad_norm': 0.9319798350334167, 'learning_rate': 0.00018442105263157896, 'epoch': 1.49}


 13%|█▎        | 63/500 [01:46<11:19,  1.56s/it]

{'loss': 2.1101, 'grad_norm': 1.0346158742904663, 'learning_rate': 0.00018400000000000003, 'epoch': 1.52}


 13%|█▎        | 64/500 [01:47<11:17,  1.55s/it]

{'loss': 2.0073, 'grad_norm': 1.0393401384353638, 'learning_rate': 0.00018357894736842104, 'epoch': 1.54}


 13%|█▎        | 65/500 [01:49<11:09,  1.54s/it]

{'loss': 2.2471, 'grad_norm': 1.067336082458496, 'learning_rate': 0.0001831578947368421, 'epoch': 1.57}


 13%|█▎        | 66/500 [01:50<11:31,  1.59s/it]

{'loss': 2.1781, 'grad_norm': 1.0369302034378052, 'learning_rate': 0.00018273684210526317, 'epoch': 1.59}


 13%|█▎        | 67/500 [01:52<11:38,  1.61s/it]

{'loss': 2.2753, 'grad_norm': 1.079733967781067, 'learning_rate': 0.0001823157894736842, 'epoch': 1.61}


 14%|█▎        | 68/500 [01:54<12:08,  1.69s/it]

{'loss': 2.3342, 'grad_norm': 0.9234049320220947, 'learning_rate': 0.00018189473684210528, 'epoch': 1.64}


 14%|█▍        | 69/500 [01:56<12:25,  1.73s/it]

{'loss': 2.2252, 'grad_norm': 1.056169033050537, 'learning_rate': 0.00018147368421052632, 'epoch': 1.66}


 14%|█▍        | 70/500 [01:57<11:45,  1.64s/it]

{'loss': 2.1087, 'grad_norm': 1.1607731580734253, 'learning_rate': 0.00018105263157894739, 'epoch': 1.69}


 14%|█▍        | 71/500 [01:59<11:42,  1.64s/it]

{'loss': 1.9863, 'grad_norm': 0.8686917424201965, 'learning_rate': 0.00018063157894736845, 'epoch': 1.71}


 14%|█▍        | 72/500 [02:01<12:09,  1.70s/it]

{'loss': 2.3915, 'grad_norm': 0.8434227705001831, 'learning_rate': 0.00018021052631578946, 'epoch': 1.73}


 15%|█▍        | 73/500 [02:02<11:58,  1.68s/it]

{'loss': 2.0514, 'grad_norm': 0.9639003276824951, 'learning_rate': 0.00017978947368421053, 'epoch': 1.76}


 15%|█▍        | 74/500 [02:04<12:06,  1.71s/it]

{'loss': 2.2781, 'grad_norm': 0.9851449131965637, 'learning_rate': 0.0001793684210526316, 'epoch': 1.78}


 15%|█▌        | 75/500 [02:06<11:49,  1.67s/it]

{'loss': 2.2, 'grad_norm': 1.0160903930664062, 'learning_rate': 0.00017894736842105264, 'epoch': 1.81}


 15%|█▌        | 76/500 [02:07<11:43,  1.66s/it]

{'loss': 2.33, 'grad_norm': 1.0239027738571167, 'learning_rate': 0.0001785263157894737, 'epoch': 1.83}


 15%|█▌        | 77/500 [02:09<11:30,  1.63s/it]

{'loss': 1.9642, 'grad_norm': 1.049575686454773, 'learning_rate': 0.00017810526315789474, 'epoch': 1.86}


 16%|█▌        | 78/500 [02:10<11:15,  1.60s/it]

{'loss': 2.1911, 'grad_norm': 1.142591118812561, 'learning_rate': 0.0001776842105263158, 'epoch': 1.88}


 16%|█▌        | 79/500 [02:12<11:09,  1.59s/it]

{'loss': 2.2166, 'grad_norm': 1.0636425018310547, 'learning_rate': 0.00017726315789473685, 'epoch': 1.9}


 16%|█▌        | 80/500 [02:13<10:56,  1.56s/it]

{'loss': 1.9901, 'grad_norm': 0.9930613040924072, 'learning_rate': 0.0001768421052631579, 'epoch': 1.93}


 16%|█▌        | 81/500 [02:15<11:22,  1.63s/it]

{'loss': 2.4046, 'grad_norm': 0.9543510675430298, 'learning_rate': 0.00017642105263157896, 'epoch': 1.95}


 16%|█▋        | 82/500 [02:17<11:19,  1.63s/it]

{'loss': 2.3191, 'grad_norm': 0.9893255233764648, 'learning_rate': 0.00017600000000000002, 'epoch': 1.98}


 17%|█▋        | 83/500 [02:18<11:16,  1.62s/it]

{'loss': 2.2397, 'grad_norm': 1.050650715827942, 'learning_rate': 0.00017557894736842106, 'epoch': 2.0}


 17%|█▋        | 84/500 [02:20<11:00,  1.59s/it]

{'loss': 1.658, 'grad_norm': 1.1051651239395142, 'learning_rate': 0.00017515789473684213, 'epoch': 2.02}


 17%|█▋        | 85/500 [02:21<10:48,  1.56s/it]

{'loss': 1.7598, 'grad_norm': 1.054665446281433, 'learning_rate': 0.00017473684210526317, 'epoch': 2.05}


 17%|█▋        | 86/500 [02:23<11:16,  1.63s/it]

{'loss': 2.0737, 'grad_norm': 1.0714269876480103, 'learning_rate': 0.0001743157894736842, 'epoch': 2.07}


 17%|█▋        | 87/500 [02:25<11:01,  1.60s/it]

{'loss': 1.6636, 'grad_norm': 1.1060574054718018, 'learning_rate': 0.00017389473684210527, 'epoch': 2.1}


 18%|█▊        | 88/500 [02:26<10:52,  1.58s/it]

{'loss': 1.8343, 'grad_norm': 1.2016329765319824, 'learning_rate': 0.00017347368421052631, 'epoch': 2.12}


 18%|█▊        | 89/500 [02:28<10:47,  1.57s/it]

{'loss': 1.8754, 'grad_norm': 1.3101589679718018, 'learning_rate': 0.00017305263157894738, 'epoch': 2.14}


 18%|█▊        | 90/500 [02:29<10:37,  1.56s/it]

{'loss': 1.8023, 'grad_norm': 1.3626995086669922, 'learning_rate': 0.00017263157894736842, 'epoch': 2.17}


 18%|█▊        | 91/500 [02:31<11:07,  1.63s/it]

{'loss': 1.8312, 'grad_norm': 1.6161226034164429, 'learning_rate': 0.0001722105263157895, 'epoch': 2.19}


 18%|█▊        | 92/500 [02:33<10:51,  1.60s/it]

{'loss': 1.6378, 'grad_norm': 1.4677008390426636, 'learning_rate': 0.00017178947368421055, 'epoch': 2.22}


 19%|█▊        | 93/500 [02:35<11:36,  1.71s/it]

{'loss': 1.9137, 'grad_norm': 1.2634022235870361, 'learning_rate': 0.00017136842105263157, 'epoch': 2.24}


 19%|█▉        | 94/500 [02:36<11:35,  1.71s/it]

{'loss': 1.8438, 'grad_norm': 1.440286636352539, 'learning_rate': 0.00017094736842105263, 'epoch': 2.27}


 19%|█▉        | 95/500 [02:38<12:15,  1.81s/it]

{'loss': 1.7262, 'grad_norm': 1.1893020868301392, 'learning_rate': 0.0001705263157894737, 'epoch': 2.29}


 19%|█▉        | 96/500 [02:40<11:53,  1.77s/it]

{'loss': 1.6963, 'grad_norm': 1.3256844282150269, 'learning_rate': 0.00017010526315789474, 'epoch': 2.31}


 19%|█▉        | 97/500 [02:42<12:07,  1.81s/it]

{'loss': 1.7744, 'grad_norm': 1.1315375566482544, 'learning_rate': 0.0001696842105263158, 'epoch': 2.34}


 20%|█▉        | 98/500 [02:43<11:26,  1.71s/it]

{'loss': 1.7425, 'grad_norm': 1.3310579061508179, 'learning_rate': 0.00016926315789473684, 'epoch': 2.36}


 20%|█▉        | 99/500 [02:45<10:55,  1.63s/it]

{'loss': 1.6831, 'grad_norm': 1.3824604749679565, 'learning_rate': 0.0001688421052631579, 'epoch': 2.39}


 20%|██        | 100/500 [02:47<11:14,  1.69s/it]

{'loss': 1.7949, 'grad_norm': 1.3049181699752808, 'learning_rate': 0.00016842105263157895, 'epoch': 2.41}


 20%|██        | 101/500 [02:49<11:24,  1.72s/it]

{'loss': 1.8942, 'grad_norm': 1.3554216623306274, 'learning_rate': 0.000168, 'epoch': 2.43}


 20%|██        | 102/500 [02:50<11:22,  1.71s/it]

{'loss': 1.6979, 'grad_norm': 1.3933426141738892, 'learning_rate': 0.00016757894736842106, 'epoch': 2.46}


 21%|██        | 103/500 [02:52<11:35,  1.75s/it]

{'loss': 1.7314, 'grad_norm': 1.2345364093780518, 'learning_rate': 0.00016715789473684212, 'epoch': 2.48}


 21%|██        | 104/500 [02:54<11:21,  1.72s/it]

{'loss': 1.7004, 'grad_norm': 1.4717577695846558, 'learning_rate': 0.00016673684210526316, 'epoch': 2.51}


 21%|██        | 105/500 [02:55<11:22,  1.73s/it]

{'loss': 1.7336, 'grad_norm': 1.6516278982162476, 'learning_rate': 0.00016631578947368423, 'epoch': 2.53}


 21%|██        | 106/500 [02:57<11:13,  1.71s/it]

{'loss': 1.9669, 'grad_norm': 1.8625664710998535, 'learning_rate': 0.00016589473684210527, 'epoch': 2.55}


 21%|██▏       | 107/500 [02:59<11:23,  1.74s/it]

{'loss': 1.6776, 'grad_norm': 1.5008900165557861, 'learning_rate': 0.0001654736842105263, 'epoch': 2.58}


 22%|██▏       | 108/500 [03:01<11:12,  1.72s/it]

{'loss': 1.6758, 'grad_norm': 1.6386103630065918, 'learning_rate': 0.00016505263157894738, 'epoch': 2.6}


 22%|██▏       | 109/500 [03:02<10:38,  1.63s/it]

{'loss': 1.4864, 'grad_norm': 1.7869343757629395, 'learning_rate': 0.00016463157894736842, 'epoch': 2.63}


 22%|██▏       | 110/500 [03:04<10:28,  1.61s/it]

{'loss': 1.7123, 'grad_norm': 1.6810444593429565, 'learning_rate': 0.00016421052631578948, 'epoch': 2.65}


 22%|██▏       | 111/500 [03:05<10:54,  1.68s/it]

{'loss': 2.028, 'grad_norm': 1.668201208114624, 'learning_rate': 0.00016378947368421055, 'epoch': 2.67}


 22%|██▏       | 112/500 [03:07<10:37,  1.64s/it]

{'loss': 1.7964, 'grad_norm': 1.5596243143081665, 'learning_rate': 0.0001633684210526316, 'epoch': 2.7}


 23%|██▎       | 113/500 [03:09<10:38,  1.65s/it]

{'loss': 1.8055, 'grad_norm': 1.443603277206421, 'learning_rate': 0.00016294736842105265, 'epoch': 2.72}


 23%|██▎       | 114/500 [03:10<10:39,  1.66s/it]

{'loss': 1.8703, 'grad_norm': 1.4355714321136475, 'learning_rate': 0.0001625263157894737, 'epoch': 2.75}


 23%|██▎       | 115/500 [03:12<10:30,  1.64s/it]

{'loss': 1.8682, 'grad_norm': 1.57302987575531, 'learning_rate': 0.00016210526315789473, 'epoch': 2.77}


 23%|██▎       | 116/500 [03:14<10:34,  1.65s/it]

{'loss': 1.8816, 'grad_norm': 1.5071722269058228, 'learning_rate': 0.0001616842105263158, 'epoch': 2.8}


 23%|██▎       | 117/500 [03:15<10:28,  1.64s/it]

{'loss': 1.6446, 'grad_norm': 1.5554128885269165, 'learning_rate': 0.00016126315789473684, 'epoch': 2.82}


 24%|██▎       | 118/500 [03:17<10:09,  1.60s/it]

{'loss': 1.7165, 'grad_norm': 1.6131093502044678, 'learning_rate': 0.0001608421052631579, 'epoch': 2.84}


 24%|██▍       | 119/500 [03:18<10:11,  1.60s/it]

{'loss': 2.0146, 'grad_norm': 1.5349669456481934, 'learning_rate': 0.00016042105263157897, 'epoch': 2.87}


 24%|██▍       | 120/500 [03:20<10:30,  1.66s/it]

{'loss': 1.9265, 'grad_norm': 1.4343581199645996, 'learning_rate': 0.00016, 'epoch': 2.89}


 24%|██▍       | 121/500 [03:22<10:57,  1.73s/it]

{'loss': 1.9221, 'grad_norm': 1.3983229398727417, 'learning_rate': 0.00015957894736842105, 'epoch': 2.92}


 24%|██▍       | 122/500 [03:24<10:37,  1.69s/it]

{'loss': 1.8395, 'grad_norm': 1.5375988483428955, 'learning_rate': 0.00015915789473684212, 'epoch': 2.94}


 25%|██▍       | 123/500 [03:25<10:38,  1.69s/it]

{'loss': 1.8351, 'grad_norm': 1.4158580303192139, 'learning_rate': 0.00015873684210526316, 'epoch': 2.96}


 25%|██▍       | 124/500 [03:27<10:56,  1.74s/it]

{'loss': 1.9118, 'grad_norm': 1.5099085569381714, 'learning_rate': 0.00015831578947368423, 'epoch': 2.99}


 25%|██▌       | 125/500 [03:29<10:41,  1.71s/it]

{'loss': 1.7719, 'grad_norm': 1.6897785663604736, 'learning_rate': 0.00015789473684210527, 'epoch': 3.01}


 25%|██▌       | 126/500 [03:31<11:02,  1.77s/it]

{'loss': 1.424, 'grad_norm': 1.3243964910507202, 'learning_rate': 0.00015747368421052633, 'epoch': 3.04}


 25%|██▌       | 127/500 [03:32<10:52,  1.75s/it]

{'loss': 1.4175, 'grad_norm': 1.4719228744506836, 'learning_rate': 0.00015705263157894737, 'epoch': 3.06}


 26%|██▌       | 128/500 [03:34<10:34,  1.71s/it]

{'loss': 1.3904, 'grad_norm': 1.6325945854187012, 'learning_rate': 0.0001566315789473684, 'epoch': 3.08}


 26%|██▌       | 129/500 [03:36<10:13,  1.65s/it]

{'loss': 1.0912, 'grad_norm': 1.7501579523086548, 'learning_rate': 0.00015621052631578948, 'epoch': 3.11}


 26%|██▌       | 130/500 [03:37<09:59,  1.62s/it]

{'loss': 1.3007, 'grad_norm': 2.341691017150879, 'learning_rate': 0.00015578947368421052, 'epoch': 3.13}


 26%|██▌       | 131/500 [03:39<10:06,  1.64s/it]

{'loss': 1.3152, 'grad_norm': 2.7290077209472656, 'learning_rate': 0.00015536842105263158, 'epoch': 3.16}


 26%|██▋       | 132/500 [03:41<10:12,  1.66s/it]

{'loss': 1.5301, 'grad_norm': 2.709724187850952, 'learning_rate': 0.00015494736842105265, 'epoch': 3.18}


 27%|██▋       | 133/500 [03:42<10:16,  1.68s/it]

{'loss': 1.4998, 'grad_norm': 2.2701544761657715, 'learning_rate': 0.0001545263157894737, 'epoch': 3.2}


 27%|██▋       | 134/500 [03:44<09:54,  1.62s/it]

{'loss': 1.0307, 'grad_norm': 2.079949378967285, 'learning_rate': 0.00015410526315789476, 'epoch': 3.23}


 27%|██▋       | 135/500 [03:45<09:55,  1.63s/it]

{'loss': 1.3697, 'grad_norm': 1.915097713470459, 'learning_rate': 0.0001536842105263158, 'epoch': 3.25}


 27%|██▋       | 136/500 [03:47<10:22,  1.71s/it]

{'loss': 1.4327, 'grad_norm': 1.6949036121368408, 'learning_rate': 0.00015326315789473684, 'epoch': 3.28}


 27%|██▋       | 137/500 [03:49<10:10,  1.68s/it]

{'loss': 1.2284, 'grad_norm': 1.9140688180923462, 'learning_rate': 0.0001528421052631579, 'epoch': 3.3}


 28%|██▊       | 138/500 [03:50<09:47,  1.62s/it]

{'loss': 1.1977, 'grad_norm': 1.98179292678833, 'learning_rate': 0.00015242105263157894, 'epoch': 3.33}


 28%|██▊       | 139/500 [03:52<10:24,  1.73s/it]

{'loss': 1.5954, 'grad_norm': 1.6792705059051514, 'learning_rate': 0.000152, 'epoch': 3.35}


 28%|██▊       | 140/500 [03:54<10:04,  1.68s/it]

{'loss': 1.2144, 'grad_norm': 1.969305157661438, 'learning_rate': 0.00015157894736842108, 'epoch': 3.37}


 28%|██▊       | 141/500 [03:55<09:52,  1.65s/it]

{'loss': 1.1268, 'grad_norm': 1.9765000343322754, 'learning_rate': 0.00015115789473684211, 'epoch': 3.4}


 28%|██▊       | 142/500 [03:57<10:02,  1.68s/it]

{'loss': 1.2465, 'grad_norm': 1.8926576375961304, 'learning_rate': 0.00015073684210526315, 'epoch': 3.42}


 29%|██▊       | 143/500 [03:59<10:02,  1.69s/it]

{'loss': 1.3382, 'grad_norm': 2.0280656814575195, 'learning_rate': 0.00015031578947368422, 'epoch': 3.45}


 29%|██▉       | 144/500 [04:01<09:58,  1.68s/it]

{'loss': 1.3897, 'grad_norm': 2.35064959526062, 'learning_rate': 0.00014989473684210526, 'epoch': 3.47}


 29%|██▉       | 145/500 [04:02<10:13,  1.73s/it]

{'loss': 1.1836, 'grad_norm': 2.0723586082458496, 'learning_rate': 0.00014947368421052633, 'epoch': 3.49}


 29%|██▉       | 146/500 [04:04<10:03,  1.71s/it]

{'loss': 1.14, 'grad_norm': 1.9199000597000122, 'learning_rate': 0.00014905263157894737, 'epoch': 3.52}


 29%|██▉       | 147/500 [04:06<09:55,  1.69s/it]

{'loss': 1.0313, 'grad_norm': 1.8080188035964966, 'learning_rate': 0.00014863157894736843, 'epoch': 3.54}


 30%|██▉       | 148/500 [04:07<09:54,  1.69s/it]

{'loss': 1.38, 'grad_norm': 1.9800214767456055, 'learning_rate': 0.0001482105263157895, 'epoch': 3.57}


 30%|██▉       | 149/500 [04:09<10:20,  1.77s/it]

{'loss': 1.5935, 'grad_norm': 2.0510294437408447, 'learning_rate': 0.00014778947368421054, 'epoch': 3.59}


 30%|███       | 150/500 [04:11<10:12,  1.75s/it]

{'loss': 1.3798, 'grad_norm': 2.040245771408081, 'learning_rate': 0.00014736842105263158, 'epoch': 3.61}


 30%|███       | 151/500 [04:13<10:01,  1.72s/it]

{'loss': 1.5253, 'grad_norm': 2.091099500656128, 'learning_rate': 0.00014694736842105265, 'epoch': 3.64}


 30%|███       | 152/500 [04:14<09:44,  1.68s/it]

{'loss': 1.4085, 'grad_norm': 2.1836836338043213, 'learning_rate': 0.00014652631578947369, 'epoch': 3.66}


 31%|███       | 153/500 [04:16<10:14,  1.77s/it]

{'loss': 1.5935, 'grad_norm': 1.943793535232544, 'learning_rate': 0.00014610526315789475, 'epoch': 3.69}


 31%|███       | 154/500 [04:18<10:02,  1.74s/it]

{'loss': 1.3351, 'grad_norm': 2.0758631229400635, 'learning_rate': 0.0001456842105263158, 'epoch': 3.71}


 31%|███       | 155/500 [04:20<10:02,  1.75s/it]

{'loss': 1.4323, 'grad_norm': 2.118724822998047, 'learning_rate': 0.00014526315789473686, 'epoch': 3.73}


 31%|███       | 156/500 [04:21<09:45,  1.70s/it]

{'loss': 1.3387, 'grad_norm': 2.2115209102630615, 'learning_rate': 0.00014484210526315792, 'epoch': 3.76}


 31%|███▏      | 157/500 [04:23<09:25,  1.65s/it]

{'loss': 1.2004, 'grad_norm': 2.0251834392547607, 'learning_rate': 0.00014442105263157894, 'epoch': 3.78}


 32%|███▏      | 158/500 [04:25<09:49,  1.72s/it]

{'loss': 1.3692, 'grad_norm': 1.7840253114700317, 'learning_rate': 0.000144, 'epoch': 3.81}


 32%|███▏      | 159/500 [04:27<10:10,  1.79s/it]

{'loss': 1.5273, 'grad_norm': 1.882995843887329, 'learning_rate': 0.00014357894736842107, 'epoch': 3.83}


 32%|███▏      | 160/500 [04:28<09:48,  1.73s/it]

{'loss': 1.1452, 'grad_norm': 1.960376501083374, 'learning_rate': 0.0001431578947368421, 'epoch': 3.86}


 32%|███▏      | 161/500 [04:30<09:32,  1.69s/it]

{'loss': 1.2165, 'grad_norm': 2.0250635147094727, 'learning_rate': 0.00014273684210526318, 'epoch': 3.88}


 32%|███▏      | 162/500 [04:32<10:15,  1.82s/it]

{'loss': 1.4508, 'grad_norm': 1.856179118156433, 'learning_rate': 0.00014231578947368422, 'epoch': 3.9}


 33%|███▎      | 163/500 [04:34<09:41,  1.73s/it]

{'loss': 1.1286, 'grad_norm': 2.1529223918914795, 'learning_rate': 0.00014189473684210528, 'epoch': 3.93}


 33%|███▎      | 164/500 [04:35<09:30,  1.70s/it]

{'loss': 1.3771, 'grad_norm': 2.336392879486084, 'learning_rate': 0.00014147368421052632, 'epoch': 3.95}


 33%|███▎      | 165/500 [04:37<09:02,  1.62s/it]

{'loss': 1.2445, 'grad_norm': 2.3633904457092285, 'learning_rate': 0.00014105263157894736, 'epoch': 3.98}


 33%|███▎      | 166/500 [04:38<08:57,  1.61s/it]

{'loss': 1.5379, 'grad_norm': 2.4693098068237305, 'learning_rate': 0.00014063157894736843, 'epoch': 4.0}


 33%|███▎      | 167/500 [04:40<09:30,  1.71s/it]

{'loss': 1.1291, 'grad_norm': 2.023165225982666, 'learning_rate': 0.00014021052631578947, 'epoch': 4.02}


 34%|███▎      | 168/500 [04:42<09:17,  1.68s/it]

{'loss': 0.9105, 'grad_norm': 1.9920583963394165, 'learning_rate': 0.00013978947368421053, 'epoch': 4.05}


 34%|███▍      | 169/500 [04:44<09:24,  1.70s/it]

{'loss': 0.9408, 'grad_norm': 1.7924647331237793, 'learning_rate': 0.0001393684210526316, 'epoch': 4.07}


 34%|███▍      | 170/500 [04:45<09:38,  1.75s/it]

{'loss': 0.9581, 'grad_norm': 1.8156208992004395, 'learning_rate': 0.00013894736842105264, 'epoch': 4.1}


 34%|███▍      | 171/500 [04:47<09:33,  1.74s/it]

{'loss': 0.7934, 'grad_norm': 2.3053455352783203, 'learning_rate': 0.00013852631578947368, 'epoch': 4.12}


 34%|███▍      | 172/500 [04:49<09:20,  1.71s/it]

{'loss': 0.8846, 'grad_norm': 3.010795831680298, 'learning_rate': 0.00013810526315789475, 'epoch': 4.14}


 35%|███▍      | 173/500 [04:50<09:21,  1.72s/it]

{'loss': 0.8265, 'grad_norm': 2.71185040473938, 'learning_rate': 0.0001376842105263158, 'epoch': 4.17}


 35%|███▍      | 174/500 [04:53<09:51,  1.81s/it]

{'loss': 1.0326, 'grad_norm': 2.9258556365966797, 'learning_rate': 0.00013726315789473685, 'epoch': 4.19}


 35%|███▌      | 175/500 [04:54<09:32,  1.76s/it]

{'loss': 0.6535, 'grad_norm': 2.744638681411743, 'learning_rate': 0.0001368421052631579, 'epoch': 4.22}


 35%|███▌      | 176/500 [04:56<09:11,  1.70s/it]

{'loss': 0.7904, 'grad_norm': 2.5877623558044434, 'learning_rate': 0.00013642105263157896, 'epoch': 4.24}


 35%|███▌      | 177/500 [04:57<08:46,  1.63s/it]

{'loss': 0.6166, 'grad_norm': 3.1240806579589844, 'learning_rate': 0.00013600000000000003, 'epoch': 4.27}


 36%|███▌      | 178/500 [04:59<08:55,  1.66s/it]

{'loss': 0.9788, 'grad_norm': 2.262741804122925, 'learning_rate': 0.00013557894736842104, 'epoch': 4.29}


 36%|███▌      | 179/500 [05:01<08:48,  1.64s/it]

{'loss': 0.7646, 'grad_norm': 2.1655426025390625, 'learning_rate': 0.0001351578947368421, 'epoch': 4.31}


 36%|███▌      | 180/500 [05:02<09:17,  1.74s/it]

{'loss': 0.9134, 'grad_norm': 2.0806772708892822, 'learning_rate': 0.00013473684210526317, 'epoch': 4.34}


 36%|███▌      | 181/500 [05:04<08:58,  1.69s/it]

{'loss': 0.8571, 'grad_norm': 2.5204648971557617, 'learning_rate': 0.0001343157894736842, 'epoch': 4.36}


 36%|███▋      | 182/500 [05:06<08:54,  1.68s/it]

{'loss': 0.8885, 'grad_norm': 2.6774654388427734, 'learning_rate': 0.00013389473684210528, 'epoch': 4.39}


 37%|███▋      | 183/500 [05:07<08:45,  1.66s/it]

{'loss': 0.823, 'grad_norm': 2.58516788482666, 'learning_rate': 0.00013347368421052632, 'epoch': 4.41}


 37%|███▋      | 184/500 [05:09<08:54,  1.69s/it]

{'loss': 0.8131, 'grad_norm': 2.381504774093628, 'learning_rate': 0.00013305263157894738, 'epoch': 4.43}


 37%|███▋      | 185/500 [05:11<09:17,  1.77s/it]

{'loss': 1.0203, 'grad_norm': 2.5888590812683105, 'learning_rate': 0.00013263157894736842, 'epoch': 4.46}


 37%|███▋      | 186/500 [05:13<09:14,  1.76s/it]

{'loss': 1.0718, 'grad_norm': 2.807676076889038, 'learning_rate': 0.00013221052631578946, 'epoch': 4.48}


 37%|███▋      | 187/500 [05:14<08:42,  1.67s/it]

{'loss': 0.5494, 'grad_norm': 2.3961708545684814, 'learning_rate': 0.00013178947368421053, 'epoch': 4.51}


 38%|███▊      | 188/500 [05:16<08:38,  1.66s/it]

{'loss': 0.8094, 'grad_norm': 2.29587459564209, 'learning_rate': 0.0001313684210526316, 'epoch': 4.53}


 38%|███▊      | 189/500 [05:17<08:32,  1.65s/it]

{'loss': 0.8069, 'grad_norm': 2.126481771469116, 'learning_rate': 0.00013094736842105264, 'epoch': 4.55}


 38%|███▊      | 190/500 [05:19<08:29,  1.64s/it]

{'loss': 0.8826, 'grad_norm': 2.3889880180358887, 'learning_rate': 0.0001305263157894737, 'epoch': 4.58}


 38%|███▊      | 191/500 [05:21<08:40,  1.68s/it]

{'loss': 0.9498, 'grad_norm': 2.5865137577056885, 'learning_rate': 0.00013010526315789474, 'epoch': 4.6}


 38%|███▊      | 192/500 [05:23<08:52,  1.73s/it]

{'loss': 1.158, 'grad_norm': 3.063196897506714, 'learning_rate': 0.00012968421052631578, 'epoch': 4.63}


 39%|███▊      | 193/500 [05:24<08:44,  1.71s/it]

{'loss': 0.966, 'grad_norm': 2.727553129196167, 'learning_rate': 0.00012926315789473685, 'epoch': 4.65}


 39%|███▉      | 194/500 [05:26<08:34,  1.68s/it]

{'loss': 0.8171, 'grad_norm': 2.2089695930480957, 'learning_rate': 0.0001288421052631579, 'epoch': 4.67}


 39%|███▉      | 195/500 [05:28<08:41,  1.71s/it]

{'loss': 0.9444, 'grad_norm': 2.562173843383789, 'learning_rate': 0.00012842105263157895, 'epoch': 4.7}


 39%|███▉      | 196/500 [05:30<08:40,  1.71s/it]

{'loss': 0.9191, 'grad_norm': 2.312796115875244, 'learning_rate': 0.00012800000000000002, 'epoch': 4.72}


 39%|███▉      | 197/500 [05:31<08:27,  1.68s/it]

{'loss': 0.8011, 'grad_norm': 2.5510637760162354, 'learning_rate': 0.00012757894736842106, 'epoch': 4.75}


 40%|███▉      | 198/500 [05:33<08:25,  1.67s/it]

{'loss': 0.691, 'grad_norm': 1.992111086845398, 'learning_rate': 0.00012715789473684213, 'epoch': 4.77}


 40%|███▉      | 199/500 [05:35<08:50,  1.76s/it]

{'loss': 1.1704, 'grad_norm': 2.314423084259033, 'learning_rate': 0.00012673684210526317, 'epoch': 4.8}


 40%|████      | 200/500 [05:36<08:30,  1.70s/it]

{'loss': 0.981, 'grad_norm': 2.7423808574676514, 'learning_rate': 0.0001263157894736842, 'epoch': 4.82}


 40%|████      | 201/500 [05:38<08:27,  1.70s/it]

{'loss': 1.1585, 'grad_norm': 2.6459336280822754, 'learning_rate': 0.00012589473684210527, 'epoch': 4.84}


 40%|████      | 202/500 [05:40<08:12,  1.65s/it]

{'loss': 0.778, 'grad_norm': 2.701476573944092, 'learning_rate': 0.0001254736842105263, 'epoch': 4.87}


 41%|████      | 203/500 [05:41<08:01,  1.62s/it]

{'loss': 0.9874, 'grad_norm': 2.703544855117798, 'learning_rate': 0.00012505263157894738, 'epoch': 4.89}


 41%|████      | 204/500 [05:43<08:07,  1.65s/it]

{'loss': 0.9591, 'grad_norm': 2.238236904144287, 'learning_rate': 0.00012463157894736845, 'epoch': 4.92}


 41%|████      | 205/500 [05:44<08:00,  1.63s/it]

{'loss': 0.8815, 'grad_norm': 2.6052327156066895, 'learning_rate': 0.00012421052631578949, 'epoch': 4.94}


 41%|████      | 206/500 [05:46<08:03,  1.64s/it]

{'loss': 1.1627, 'grad_norm': 2.5194766521453857, 'learning_rate': 0.00012378947368421053, 'epoch': 4.96}


 41%|████▏     | 207/500 [05:48<07:53,  1.61s/it]

{'loss': 0.9143, 'grad_norm': 2.5825424194335938, 'learning_rate': 0.00012336842105263156, 'epoch': 4.99}


 42%|████▏     | 208/500 [05:49<07:43,  1.59s/it]

{'loss': 0.8078, 'grad_norm': 2.617790937423706, 'learning_rate': 0.00012294736842105263, 'epoch': 5.01}


 42%|████▏     | 209/500 [05:51<07:53,  1.63s/it]

{'loss': 0.6025, 'grad_norm': 1.9741729497909546, 'learning_rate': 0.0001225263157894737, 'epoch': 5.04}


 42%|████▏     | 210/500 [05:53<08:02,  1.66s/it]

{'loss': 0.6094, 'grad_norm': 2.2149429321289062, 'learning_rate': 0.00012210526315789474, 'epoch': 5.06}


 42%|████▏     | 211/500 [05:54<07:44,  1.61s/it]

{'loss': 0.5543, 'grad_norm': 2.454779863357544, 'learning_rate': 0.0001216842105263158, 'epoch': 5.08}


 42%|████▏     | 212/500 [05:56<07:47,  1.62s/it]

{'loss': 0.5482, 'grad_norm': 2.550348997116089, 'learning_rate': 0.00012126315789473686, 'epoch': 5.11}


 43%|████▎     | 213/500 [05:57<07:38,  1.60s/it]

{'loss': 0.5857, 'grad_norm': 3.675283908843994, 'learning_rate': 0.00012084210526315788, 'epoch': 5.13}


 43%|████▎     | 214/500 [05:59<07:27,  1.57s/it]

{'loss': 0.4864, 'grad_norm': 3.5347402095794678, 'learning_rate': 0.00012042105263157895, 'epoch': 5.16}


 43%|████▎     | 215/500 [06:00<07:37,  1.60s/it]

{'loss': 0.6228, 'grad_norm': 4.815011978149414, 'learning_rate': 0.00012, 'epoch': 5.18}


 43%|████▎     | 216/500 [06:02<08:02,  1.70s/it]

{'loss': 0.7371, 'grad_norm': 4.176328182220459, 'learning_rate': 0.00011957894736842106, 'epoch': 5.2}


 43%|████▎     | 217/500 [06:04<07:37,  1.62s/it]

{'loss': 0.379, 'grad_norm': 3.415008544921875, 'learning_rate': 0.00011915789473684211, 'epoch': 5.23}


 44%|████▎     | 218/500 [06:06<07:52,  1.68s/it]

{'loss': 0.721, 'grad_norm': 2.73758602142334, 'learning_rate': 0.00011873684210526316, 'epoch': 5.25}


 44%|████▍     | 219/500 [06:07<07:45,  1.66s/it]

{'loss': 0.581, 'grad_norm': 2.755660057067871, 'learning_rate': 0.00011831578947368423, 'epoch': 5.28}


 44%|████▍     | 220/500 [06:09<07:38,  1.64s/it]

{'loss': 0.63, 'grad_norm': 2.706190586090088, 'learning_rate': 0.00011789473684210525, 'epoch': 5.3}


 44%|████▍     | 221/500 [06:11<07:54,  1.70s/it]

{'loss': 0.6186, 'grad_norm': 2.1237294673919678, 'learning_rate': 0.00011747368421052631, 'epoch': 5.33}


 44%|████▍     | 222/500 [06:12<08:02,  1.73s/it]

{'loss': 0.5291, 'grad_norm': 2.297917366027832, 'learning_rate': 0.00011705263157894737, 'epoch': 5.35}


 45%|████▍     | 223/500 [06:14<07:39,  1.66s/it]

{'loss': 0.3352, 'grad_norm': 2.3658578395843506, 'learning_rate': 0.00011663157894736843, 'epoch': 5.37}


 45%|████▍     | 224/500 [06:16<07:51,  1.71s/it]

{'loss': 0.611, 'grad_norm': 2.3109943866729736, 'learning_rate': 0.00011621052631578948, 'epoch': 5.4}


 45%|████▌     | 225/500 [06:18<07:53,  1.72s/it]

{'loss': 0.5239, 'grad_norm': 2.427384853363037, 'learning_rate': 0.00011578947368421053, 'epoch': 5.42}


 45%|████▌     | 226/500 [06:19<07:42,  1.69s/it]

{'loss': 0.5175, 'grad_norm': 2.6738903522491455, 'learning_rate': 0.00011536842105263159, 'epoch': 5.45}


 45%|████▌     | 227/500 [06:21<08:02,  1.77s/it]

{'loss': 0.7751, 'grad_norm': 3.191154956817627, 'learning_rate': 0.00011494736842105265, 'epoch': 5.47}


 46%|████▌     | 228/500 [06:23<07:43,  1.70s/it]

{'loss': 0.4319, 'grad_norm': 3.3787784576416016, 'learning_rate': 0.00011452631578947368, 'epoch': 5.49}


 46%|████▌     | 229/500 [06:24<07:36,  1.69s/it]

{'loss': 0.5768, 'grad_norm': 3.3768348693847656, 'learning_rate': 0.00011410526315789473, 'epoch': 5.52}


 46%|████▌     | 230/500 [06:26<07:40,  1.71s/it]

{'loss': 0.7372, 'grad_norm': 3.5722553730010986, 'learning_rate': 0.0001136842105263158, 'epoch': 5.54}


 46%|████▌     | 231/500 [06:28<07:38,  1.70s/it]

{'loss': 0.5857, 'grad_norm': 2.8985447883605957, 'learning_rate': 0.00011326315789473685, 'epoch': 5.57}


 46%|████▋     | 232/500 [06:29<07:39,  1.71s/it]

{'loss': 0.6152, 'grad_norm': 2.6789114475250244, 'learning_rate': 0.0001128421052631579, 'epoch': 5.59}


 47%|████▋     | 233/500 [06:31<07:43,  1.74s/it]

{'loss': 0.672, 'grad_norm': 2.3002665042877197, 'learning_rate': 0.00011242105263157896, 'epoch': 5.61}


 47%|████▋     | 234/500 [06:33<07:45,  1.75s/it]

{'loss': 0.5268, 'grad_norm': 2.367551326751709, 'learning_rate': 0.00011200000000000001, 'epoch': 5.64}


 47%|████▋     | 235/500 [06:35<07:50,  1.77s/it]

{'loss': 0.6667, 'grad_norm': 2.204556941986084, 'learning_rate': 0.00011157894736842105, 'epoch': 5.66}


 47%|████▋     | 236/500 [06:37<07:40,  1.74s/it]

{'loss': 0.7233, 'grad_norm': 2.462494134902954, 'learning_rate': 0.0001111578947368421, 'epoch': 5.69}


 47%|████▋     | 237/500 [06:38<07:50,  1.79s/it]

{'loss': 0.706, 'grad_norm': 2.2806992530822754, 'learning_rate': 0.00011073684210526316, 'epoch': 5.71}


 48%|████▊     | 238/500 [06:40<07:40,  1.76s/it]

{'loss': 0.6507, 'grad_norm': 2.7368061542510986, 'learning_rate': 0.00011031578947368421, 'epoch': 5.73}


 48%|████▊     | 239/500 [06:42<07:25,  1.71s/it]

{'loss': 0.6522, 'grad_norm': 3.066948413848877, 'learning_rate': 0.00010989473684210528, 'epoch': 5.76}


 48%|████▊     | 240/500 [06:43<07:13,  1.67s/it]

{'loss': 0.4914, 'grad_norm': 2.7972702980041504, 'learning_rate': 0.00010947368421052633, 'epoch': 5.78}


 48%|████▊     | 241/500 [06:45<07:22,  1.71s/it]

{'loss': 0.7065, 'grad_norm': 2.831768274307251, 'learning_rate': 0.00010905263157894738, 'epoch': 5.81}


 48%|████▊     | 242/500 [06:47<07:08,  1.66s/it]

{'loss': 0.538, 'grad_norm': 2.792747735977173, 'learning_rate': 0.00010863157894736842, 'epoch': 5.83}


 49%|████▊     | 243/500 [06:49<07:21,  1.72s/it]

{'loss': 0.6265, 'grad_norm': 2.755983352661133, 'learning_rate': 0.00010821052631578948, 'epoch': 5.86}


 49%|████▉     | 244/500 [06:51<07:45,  1.82s/it]

{'loss': 0.7171, 'grad_norm': 2.4245569705963135, 'learning_rate': 0.00010778947368421053, 'epoch': 5.88}


 49%|████▉     | 245/500 [06:52<07:41,  1.81s/it]

{'loss': 0.5466, 'grad_norm': 2.3249268531799316, 'learning_rate': 0.00010736842105263158, 'epoch': 5.9}


 49%|████▉     | 246/500 [06:54<07:30,  1.77s/it]

{'loss': 0.4124, 'grad_norm': 2.4874746799468994, 'learning_rate': 0.00010694736842105264, 'epoch': 5.93}


 49%|████▉     | 247/500 [06:56<07:16,  1.72s/it]

{'loss': 0.522, 'grad_norm': 2.5149567127227783, 'learning_rate': 0.0001065263157894737, 'epoch': 5.95}


 50%|████▉     | 248/500 [06:57<07:01,  1.67s/it]

{'loss': 0.4141, 'grad_norm': 2.6795623302459717, 'learning_rate': 0.00010610526315789476, 'epoch': 5.98}


 50%|████▉     | 249/500 [06:59<06:40,  1.60s/it]

{'loss': 0.4084, 'grad_norm': 2.612126111984253, 'learning_rate': 0.00010568421052631578, 'epoch': 6.0}


 50%|█████     | 250/500 [07:00<06:40,  1.60s/it]

{'loss': 0.3706, 'grad_norm': 2.36419677734375, 'learning_rate': 0.00010526315789473685, 'epoch': 6.02}


 50%|█████     | 251/500 [07:02<06:30,  1.57s/it]

{'loss': 0.2638, 'grad_norm': 1.803823471069336, 'learning_rate': 0.0001048421052631579, 'epoch': 6.05}


 50%|█████     | 252/500 [07:03<06:26,  1.56s/it]

{'loss': 0.2437, 'grad_norm': 2.1426825523376465, 'learning_rate': 0.00010442105263157895, 'epoch': 6.07}


 51%|█████     | 253/500 [07:05<06:39,  1.62s/it]

{'loss': 0.4036, 'grad_norm': 2.802840232849121, 'learning_rate': 0.00010400000000000001, 'epoch': 6.1}


 51%|█████     | 254/500 [07:07<06:33,  1.60s/it]

{'loss': 0.2451, 'grad_norm': 2.3015096187591553, 'learning_rate': 0.00010357894736842106, 'epoch': 6.12}


 51%|█████     | 255/500 [07:08<06:30,  1.60s/it]

{'loss': 0.3067, 'grad_norm': 3.1820335388183594, 'learning_rate': 0.00010315789473684211, 'epoch': 6.14}


 51%|█████     | 256/500 [07:10<06:45,  1.66s/it]

{'loss': 0.359, 'grad_norm': 3.388155221939087, 'learning_rate': 0.00010273684210526315, 'epoch': 6.17}


 51%|█████▏    | 257/500 [07:12<06:34,  1.62s/it]

{'loss': 0.2819, 'grad_norm': 3.2394983768463135, 'learning_rate': 0.0001023157894736842, 'epoch': 6.19}


 52%|█████▏    | 258/500 [07:13<06:28,  1.61s/it]

{'loss': 0.3208, 'grad_norm': 3.158623456954956, 'learning_rate': 0.00010189473684210526, 'epoch': 6.22}


 52%|█████▏    | 259/500 [07:15<06:19,  1.57s/it]

{'loss': 0.2981, 'grad_norm': 3.2888288497924805, 'learning_rate': 0.00010147368421052633, 'epoch': 6.24}


 52%|█████▏    | 260/500 [07:16<06:10,  1.55s/it]

{'loss': 0.2865, 'grad_norm': 3.0013020038604736, 'learning_rate': 0.00010105263157894738, 'epoch': 6.27}


 52%|█████▏    | 261/500 [07:18<06:31,  1.64s/it]

{'loss': 0.4261, 'grad_norm': 2.5055882930755615, 'learning_rate': 0.00010063157894736843, 'epoch': 6.29}


 52%|█████▏    | 262/500 [07:19<06:24,  1.61s/it]

{'loss': 0.3113, 'grad_norm': 2.572908878326416, 'learning_rate': 0.00010021052631578949, 'epoch': 6.31}


 53%|█████▎    | 263/500 [07:21<06:24,  1.62s/it]

{'loss': 0.4044, 'grad_norm': 2.376322031021118, 'learning_rate': 9.978947368421054e-05, 'epoch': 6.34}


 53%|█████▎    | 264/500 [07:23<06:23,  1.62s/it]

{'loss': 0.2394, 'grad_norm': 2.1003897190093994, 'learning_rate': 9.936842105263159e-05, 'epoch': 6.36}


 53%|█████▎    | 265/500 [07:24<06:18,  1.61s/it]

{'loss': 0.2503, 'grad_norm': 2.2721104621887207, 'learning_rate': 9.894736842105263e-05, 'epoch': 6.39}


 53%|█████▎    | 266/500 [07:26<06:32,  1.68s/it]

{'loss': 0.4175, 'grad_norm': 2.721989154815674, 'learning_rate': 9.852631578947368e-05, 'epoch': 6.41}


 53%|█████▎    | 267/500 [07:28<06:51,  1.77s/it]

{'loss': 0.3612, 'grad_norm': 2.2029716968536377, 'learning_rate': 9.810526315789475e-05, 'epoch': 6.43}


 54%|█████▎    | 268/500 [07:30<06:51,  1.77s/it]

{'loss': 0.3458, 'grad_norm': 2.1442136764526367, 'learning_rate': 9.768421052631579e-05, 'epoch': 6.46}


 54%|█████▍    | 269/500 [07:32<06:38,  1.73s/it]

{'loss': 0.3579, 'grad_norm': 2.6591575145721436, 'learning_rate': 9.726315789473684e-05, 'epoch': 6.48}


 54%|█████▍    | 270/500 [07:33<06:23,  1.67s/it]

{'loss': 0.2943, 'grad_norm': 3.2778916358947754, 'learning_rate': 9.68421052631579e-05, 'epoch': 6.51}


 54%|█████▍    | 271/500 [07:35<06:15,  1.64s/it]

{'loss': 0.2792, 'grad_norm': 3.118161678314209, 'learning_rate': 9.642105263157896e-05, 'epoch': 6.53}


 54%|█████▍    | 272/500 [07:37<06:32,  1.72s/it]

{'loss': 0.3627, 'grad_norm': 2.729053497314453, 'learning_rate': 9.6e-05, 'epoch': 6.55}


 55%|█████▍    | 273/500 [07:38<06:22,  1.69s/it]

{'loss': 0.3058, 'grad_norm': 2.671386957168579, 'learning_rate': 9.557894736842106e-05, 'epoch': 6.58}


 55%|█████▍    | 274/500 [07:40<06:19,  1.68s/it]

{'loss': 0.2546, 'grad_norm': 2.281230926513672, 'learning_rate': 9.515789473684211e-05, 'epoch': 6.6}


 55%|█████▌    | 275/500 [07:42<06:21,  1.69s/it]

{'loss': 0.2705, 'grad_norm': 2.415130853652954, 'learning_rate': 9.473684210526316e-05, 'epoch': 6.63}


 55%|█████▌    | 276/500 [07:43<06:21,  1.70s/it]

{'loss': 0.2967, 'grad_norm': 2.8390121459960938, 'learning_rate': 9.431578947368421e-05, 'epoch': 6.65}


 55%|█████▌    | 277/500 [07:45<06:25,  1.73s/it]

{'loss': 0.3983, 'grad_norm': 2.8941378593444824, 'learning_rate': 9.389473684210527e-05, 'epoch': 6.67}


 56%|█████▌    | 278/500 [07:47<06:30,  1.76s/it]

{'loss': 0.3921, 'grad_norm': 2.9017539024353027, 'learning_rate': 9.347368421052632e-05, 'epoch': 6.7}


 56%|█████▌    | 279/500 [07:49<06:33,  1.78s/it]

{'loss': 0.3964, 'grad_norm': 2.768937826156616, 'learning_rate': 9.305263157894737e-05, 'epoch': 6.72}


 56%|█████▌    | 280/500 [07:50<06:23,  1.74s/it]

{'loss': 0.4093, 'grad_norm': 2.9627318382263184, 'learning_rate': 9.263157894736843e-05, 'epoch': 6.75}


 56%|█████▌    | 281/500 [07:52<06:27,  1.77s/it]

{'loss': 0.3957, 'grad_norm': 2.6707592010498047, 'learning_rate': 9.221052631578948e-05, 'epoch': 6.77}


 56%|█████▋    | 282/500 [07:54<06:33,  1.81s/it]

{'loss': 0.3978, 'grad_norm': 2.6021523475646973, 'learning_rate': 9.178947368421052e-05, 'epoch': 6.8}


 57%|█████▋    | 283/500 [07:56<06:25,  1.78s/it]

{'loss': 0.2897, 'grad_norm': 2.774298906326294, 'learning_rate': 9.136842105263159e-05, 'epoch': 6.82}


 57%|█████▋    | 284/500 [07:58<06:22,  1.77s/it]

{'loss': 0.4787, 'grad_norm': 3.172938346862793, 'learning_rate': 9.094736842105264e-05, 'epoch': 6.84}


 57%|█████▋    | 285/500 [07:59<06:19,  1.76s/it]

{'loss': 0.359, 'grad_norm': 2.622363328933716, 'learning_rate': 9.052631578947369e-05, 'epoch': 6.87}


 57%|█████▋    | 286/500 [08:01<06:00,  1.69s/it]

{'loss': 0.3419, 'grad_norm': 3.0234527587890625, 'learning_rate': 9.010526315789473e-05, 'epoch': 6.89}


 57%|█████▋    | 287/500 [08:03<06:08,  1.73s/it]

{'loss': 0.3644, 'grad_norm': 2.617647886276245, 'learning_rate': 8.96842105263158e-05, 'epoch': 6.92}


 58%|█████▊    | 288/500 [08:04<06:08,  1.74s/it]

{'loss': 0.3989, 'grad_norm': 2.358734130859375, 'learning_rate': 8.926315789473685e-05, 'epoch': 6.94}


 58%|█████▊    | 289/500 [08:06<06:27,  1.83s/it]

{'loss': 0.5754, 'grad_norm': 2.5203819274902344, 'learning_rate': 8.88421052631579e-05, 'epoch': 6.96}


 58%|█████▊    | 290/500 [08:08<06:10,  1.77s/it]

{'loss': 0.3024, 'grad_norm': 2.46043062210083, 'learning_rate': 8.842105263157894e-05, 'epoch': 6.99}


 58%|█████▊    | 291/500 [08:09<05:47,  1.66s/it]

{'loss': 0.2369, 'grad_norm': 2.9207046031951904, 'learning_rate': 8.800000000000001e-05, 'epoch': 7.01}


 58%|█████▊    | 292/500 [08:11<05:54,  1.71s/it]

{'loss': 0.2548, 'grad_norm': 1.908612608909607, 'learning_rate': 8.757894736842106e-05, 'epoch': 7.04}


 59%|█████▊    | 293/500 [08:13<05:49,  1.69s/it]

{'loss': 0.2134, 'grad_norm': 1.995477318763733, 'learning_rate': 8.71578947368421e-05, 'epoch': 7.06}


 59%|█████▉    | 294/500 [08:15<05:52,  1.71s/it]

{'loss': 0.2115, 'grad_norm': 2.0631203651428223, 'learning_rate': 8.673684210526316e-05, 'epoch': 7.08}


 59%|█████▉    | 295/500 [08:16<05:36,  1.64s/it]

{'loss': 0.1741, 'grad_norm': 2.37304949760437, 'learning_rate': 8.631578947368421e-05, 'epoch': 7.11}


 59%|█████▉    | 296/500 [08:18<05:35,  1.64s/it]

{'loss': 0.1781, 'grad_norm': 2.1060283184051514, 'learning_rate': 8.589473684210528e-05, 'epoch': 7.13}


 59%|█████▉    | 297/500 [08:19<05:26,  1.61s/it]

{'loss': 0.1392, 'grad_norm': 2.0208170413970947, 'learning_rate': 8.547368421052632e-05, 'epoch': 7.16}


 60%|█████▉    | 298/500 [08:21<05:33,  1.65s/it]

{'loss': 0.2738, 'grad_norm': 3.2392702102661133, 'learning_rate': 8.505263157894737e-05, 'epoch': 7.18}


 60%|█████▉    | 299/500 [08:23<05:44,  1.71s/it]

{'loss': 0.2282, 'grad_norm': 3.282130479812622, 'learning_rate': 8.463157894736842e-05, 'epoch': 7.2}


 60%|██████    | 300/500 [08:25<05:43,  1.72s/it]

{'loss': 0.2493, 'grad_norm': 4.450844764709473, 'learning_rate': 8.421052631578948e-05, 'epoch': 7.23}


 60%|██████    | 301/500 [08:26<05:34,  1.68s/it]

{'loss': 0.1444, 'grad_norm': 2.47843337059021, 'learning_rate': 8.378947368421053e-05, 'epoch': 7.25}


 60%|██████    | 302/500 [08:28<05:34,  1.69s/it]

{'loss': 0.1782, 'grad_norm': 3.037583589553833, 'learning_rate': 8.336842105263158e-05, 'epoch': 7.28}


 61%|██████    | 303/500 [08:30<05:23,  1.64s/it]

{'loss': 0.1578, 'grad_norm': 2.7852814197540283, 'learning_rate': 8.294736842105263e-05, 'epoch': 7.3}


 61%|██████    | 304/500 [08:31<05:30,  1.69s/it]

{'loss': 0.2198, 'grad_norm': 2.9506797790527344, 'learning_rate': 8.252631578947369e-05, 'epoch': 7.33}


 61%|██████    | 305/500 [08:33<05:24,  1.67s/it]

{'loss': 0.1719, 'grad_norm': 2.548274040222168, 'learning_rate': 8.210526315789474e-05, 'epoch': 7.35}


 61%|██████    | 306/500 [08:35<05:37,  1.74s/it]

{'loss': 0.2516, 'grad_norm': 1.9118486642837524, 'learning_rate': 8.16842105263158e-05, 'epoch': 7.37}


 61%|██████▏   | 307/500 [08:36<05:23,  1.67s/it]

{'loss': 0.1666, 'grad_norm': 1.8154773712158203, 'learning_rate': 8.126315789473685e-05, 'epoch': 7.4}


 62%|██████▏   | 308/500 [08:38<05:27,  1.71s/it]

{'loss': 0.2035, 'grad_norm': 1.958686351776123, 'learning_rate': 8.08421052631579e-05, 'epoch': 7.42}


 62%|██████▏   | 309/500 [08:40<05:38,  1.77s/it]

{'loss': 0.2526, 'grad_norm': 2.008695363998413, 'learning_rate': 8.042105263157895e-05, 'epoch': 7.45}


 62%|██████▏   | 310/500 [08:42<05:21,  1.69s/it]

{'loss': 0.1328, 'grad_norm': 1.6281476020812988, 'learning_rate': 8e-05, 'epoch': 7.47}


 62%|██████▏   | 311/500 [08:43<05:16,  1.68s/it]

{'loss': 0.2948, 'grad_norm': 2.8210361003875732, 'learning_rate': 7.957894736842106e-05, 'epoch': 7.49}


 62%|██████▏   | 312/500 [08:45<05:19,  1.70s/it]

{'loss': 0.2573, 'grad_norm': 2.350219488143921, 'learning_rate': 7.915789473684211e-05, 'epoch': 7.52}


 63%|██████▎   | 313/500 [08:47<05:08,  1.65s/it]

{'loss': 0.1553, 'grad_norm': 2.0880796909332275, 'learning_rate': 7.873684210526317e-05, 'epoch': 7.54}


 63%|██████▎   | 314/500 [08:48<05:04,  1.64s/it]

{'loss': 0.2034, 'grad_norm': 2.659714460372925, 'learning_rate': 7.83157894736842e-05, 'epoch': 7.57}


 63%|██████▎   | 315/500 [08:50<05:05,  1.65s/it]

{'loss': 0.1726, 'grad_norm': 3.0527396202087402, 'learning_rate': 7.789473684210526e-05, 'epoch': 7.59}


 63%|██████▎   | 316/500 [08:51<04:59,  1.63s/it]

{'loss': 0.1546, 'grad_norm': 2.569777488708496, 'learning_rate': 7.747368421052633e-05, 'epoch': 7.61}


 63%|██████▎   | 317/500 [08:53<05:09,  1.69s/it]

{'loss': 0.2098, 'grad_norm': 2.664405345916748, 'learning_rate': 7.705263157894738e-05, 'epoch': 7.64}


 64%|██████▎   | 318/500 [08:55<05:16,  1.74s/it]

{'loss': 0.2974, 'grad_norm': 3.219428062438965, 'learning_rate': 7.663157894736842e-05, 'epoch': 7.66}


 64%|██████▍   | 319/500 [08:57<05:03,  1.68s/it]

{'loss': 0.1645, 'grad_norm': 3.6141059398651123, 'learning_rate': 7.621052631578947e-05, 'epoch': 7.69}


 64%|██████▍   | 320/500 [08:59<05:17,  1.76s/it]

{'loss': 0.218, 'grad_norm': 2.405250072479248, 'learning_rate': 7.578947368421054e-05, 'epoch': 7.71}


 64%|██████▍   | 321/500 [09:00<05:11,  1.74s/it]

{'loss': 0.2082, 'grad_norm': 2.7170026302337646, 'learning_rate': 7.536842105263158e-05, 'epoch': 7.73}


 64%|██████▍   | 322/500 [09:02<05:07,  1.72s/it]

{'loss': 0.1544, 'grad_norm': 1.9295588731765747, 'learning_rate': 7.494736842105263e-05, 'epoch': 7.76}


 65%|██████▍   | 323/500 [09:04<05:01,  1.70s/it]

{'loss': 0.1792, 'grad_norm': 1.9306838512420654, 'learning_rate': 7.452631578947368e-05, 'epoch': 7.78}


 65%|██████▍   | 324/500 [09:05<05:02,  1.72s/it]

{'loss': 0.2769, 'grad_norm': 3.320075035095215, 'learning_rate': 7.410526315789475e-05, 'epoch': 7.81}


 65%|██████▌   | 325/500 [09:07<04:54,  1.68s/it]

{'loss': 0.1744, 'grad_norm': 2.297088384628296, 'learning_rate': 7.368421052631579e-05, 'epoch': 7.83}


 65%|██████▌   | 326/500 [09:09<04:52,  1.68s/it]

{'loss': 0.1641, 'grad_norm': 2.408863067626953, 'learning_rate': 7.326315789473684e-05, 'epoch': 7.86}


 65%|██████▌   | 327/500 [09:10<04:49,  1.67s/it]

{'loss': 0.1816, 'grad_norm': 2.7783007621765137, 'learning_rate': 7.28421052631579e-05, 'epoch': 7.88}


 66%|██████▌   | 328/500 [09:12<04:55,  1.72s/it]

{'loss': 0.1693, 'grad_norm': 2.189908981323242, 'learning_rate': 7.242105263157896e-05, 'epoch': 7.9}


 66%|██████▌   | 329/500 [09:14<04:45,  1.67s/it]

{'loss': 0.2143, 'grad_norm': 2.827979564666748, 'learning_rate': 7.2e-05, 'epoch': 7.93}


 66%|██████▌   | 330/500 [09:15<04:37,  1.63s/it]

{'loss': 0.1655, 'grad_norm': 2.3574655055999756, 'learning_rate': 7.157894736842105e-05, 'epoch': 7.95}


 66%|██████▌   | 331/500 [09:17<04:40,  1.66s/it]

{'loss': 0.2333, 'grad_norm': 3.0595524311065674, 'learning_rate': 7.115789473684211e-05, 'epoch': 7.98}


 66%|██████▋   | 332/500 [09:19<04:35,  1.64s/it]

{'loss': 0.1912, 'grad_norm': 2.462559700012207, 'learning_rate': 7.073684210526316e-05, 'epoch': 8.0}


 67%|██████▋   | 333/500 [09:20<04:30,  1.62s/it]

{'loss': 0.1164, 'grad_norm': 1.764050841331482, 'learning_rate': 7.031578947368421e-05, 'epoch': 8.02}


 67%|██████▋   | 334/500 [09:22<04:35,  1.66s/it]

{'loss': 0.0949, 'grad_norm': 1.705978512763977, 'learning_rate': 6.989473684210527e-05, 'epoch': 8.05}


 67%|██████▋   | 335/500 [09:23<04:27,  1.62s/it]

{'loss': 0.0794, 'grad_norm': 1.5440232753753662, 'learning_rate': 6.947368421052632e-05, 'epoch': 8.07}


 67%|██████▋   | 336/500 [09:25<04:20,  1.59s/it]

{'loss': 0.0941, 'grad_norm': 2.1653823852539062, 'learning_rate': 6.905263157894737e-05, 'epoch': 8.1}


 67%|██████▋   | 337/500 [09:27<04:22,  1.61s/it]

{'loss': 0.0832, 'grad_norm': 1.463550090789795, 'learning_rate': 6.863157894736843e-05, 'epoch': 8.12}


 68%|██████▊   | 338/500 [09:28<04:24,  1.63s/it]

{'loss': 0.1681, 'grad_norm': 3.0130581855773926, 'learning_rate': 6.821052631578948e-05, 'epoch': 8.14}


 68%|██████▊   | 339/500 [09:30<04:21,  1.62s/it]

{'loss': 0.0882, 'grad_norm': 1.8423326015472412, 'learning_rate': 6.778947368421052e-05, 'epoch': 8.17}


 68%|██████▊   | 340/500 [09:31<04:14,  1.59s/it]

{'loss': 0.0859, 'grad_norm': 1.8683593273162842, 'learning_rate': 6.736842105263159e-05, 'epoch': 8.19}


 68%|██████▊   | 341/500 [09:33<04:22,  1.65s/it]

{'loss': 0.073, 'grad_norm': 1.5019139051437378, 'learning_rate': 6.694736842105264e-05, 'epoch': 8.22}


 68%|██████▊   | 342/500 [09:35<04:19,  1.64s/it]

{'loss': 0.0958, 'grad_norm': 2.1709978580474854, 'learning_rate': 6.652631578947369e-05, 'epoch': 8.24}


 69%|██████▊   | 343/500 [09:36<04:14,  1.62s/it]

{'loss': 0.0758, 'grad_norm': 2.0567803382873535, 'learning_rate': 6.610526315789473e-05, 'epoch': 8.27}


 69%|██████▉   | 344/500 [09:38<04:21,  1.68s/it]

{'loss': 0.083, 'grad_norm': 1.8843752145767212, 'learning_rate': 6.56842105263158e-05, 'epoch': 8.29}


 69%|██████▉   | 345/500 [09:40<04:19,  1.67s/it]

{'loss': 0.1303, 'grad_norm': 2.7856199741363525, 'learning_rate': 6.526315789473685e-05, 'epoch': 8.31}


 69%|██████▉   | 346/500 [09:42<04:21,  1.70s/it]

{'loss': 0.1255, 'grad_norm': 2.6550707817077637, 'learning_rate': 6.484210526315789e-05, 'epoch': 8.34}


 69%|██████▉   | 347/500 [09:43<04:18,  1.69s/it]

{'loss': 0.0749, 'grad_norm': 1.8917598724365234, 'learning_rate': 6.442105263157894e-05, 'epoch': 8.36}


 70%|██████▉   | 348/500 [09:45<04:13,  1.66s/it]

{'loss': 0.1066, 'grad_norm': 2.198758363723755, 'learning_rate': 6.400000000000001e-05, 'epoch': 8.39}


 70%|██████▉   | 349/500 [09:47<04:13,  1.68s/it]

{'loss': 0.1207, 'grad_norm': 1.9914780855178833, 'learning_rate': 6.357894736842106e-05, 'epoch': 8.41}


 70%|███████   | 350/500 [09:48<04:02,  1.62s/it]

{'loss': 0.1042, 'grad_norm': 2.222646951675415, 'learning_rate': 6.31578947368421e-05, 'epoch': 8.43}


 70%|███████   | 351/500 [09:50<04:03,  1.64s/it]

{'loss': 0.1134, 'grad_norm': 1.8963377475738525, 'learning_rate': 6.273684210526316e-05, 'epoch': 8.46}


 70%|███████   | 352/500 [09:51<04:07,  1.67s/it]

{'loss': 0.1428, 'grad_norm': 2.5797481536865234, 'learning_rate': 6.231578947368422e-05, 'epoch': 8.48}


 71%|███████   | 353/500 [09:53<04:21,  1.78s/it]

{'loss': 0.2085, 'grad_norm': 2.662200689315796, 'learning_rate': 6.189473684210526e-05, 'epoch': 8.51}


 71%|███████   | 354/500 [09:55<04:13,  1.74s/it]

{'loss': 0.1229, 'grad_norm': 2.437095880508423, 'learning_rate': 6.147368421052632e-05, 'epoch': 8.53}


 71%|███████   | 355/500 [09:57<04:12,  1.74s/it]

{'loss': 0.1098, 'grad_norm': 1.8755559921264648, 'learning_rate': 6.105263157894737e-05, 'epoch': 8.55}


 71%|███████   | 356/500 [09:58<04:03,  1.69s/it]

{'loss': 0.1061, 'grad_norm': 1.9943374395370483, 'learning_rate': 6.063157894736843e-05, 'epoch': 8.58}


 71%|███████▏  | 357/500 [10:00<04:05,  1.72s/it]

{'loss': 0.1251, 'grad_norm': 1.792763113975525, 'learning_rate': 6.0210526315789475e-05, 'epoch': 8.6}


 72%|███████▏  | 358/500 [10:02<03:59,  1.69s/it]

{'loss': 0.1196, 'grad_norm': 2.38991379737854, 'learning_rate': 5.978947368421053e-05, 'epoch': 8.63}


 72%|███████▏  | 359/500 [10:04<04:02,  1.72s/it]

{'loss': 0.159, 'grad_norm': 2.364619493484497, 'learning_rate': 5.936842105263158e-05, 'epoch': 8.65}


 72%|███████▏  | 360/500 [10:05<03:57,  1.70s/it]

{'loss': 0.1054, 'grad_norm': 1.7778035402297974, 'learning_rate': 5.894736842105263e-05, 'epoch': 8.67}


 72%|███████▏  | 361/500 [10:07<04:01,  1.73s/it]

{'loss': 0.1342, 'grad_norm': 2.605600357055664, 'learning_rate': 5.852631578947369e-05, 'epoch': 8.7}


 72%|███████▏  | 362/500 [10:09<03:47,  1.65s/it]

{'loss': 0.0872, 'grad_norm': 1.6342344284057617, 'learning_rate': 5.810526315789474e-05, 'epoch': 8.72}


 73%|███████▎  | 363/500 [10:11<03:57,  1.73s/it]

{'loss': 0.161, 'grad_norm': 1.8974519968032837, 'learning_rate': 5.7684210526315794e-05, 'epoch': 8.75}


 73%|███████▎  | 364/500 [10:12<03:54,  1.72s/it]

{'loss': 0.0907, 'grad_norm': 1.9263523817062378, 'learning_rate': 5.726315789473684e-05, 'epoch': 8.77}


 73%|███████▎  | 365/500 [10:14<03:50,  1.71s/it]

{'loss': 0.1205, 'grad_norm': 2.2810044288635254, 'learning_rate': 5.68421052631579e-05, 'epoch': 8.8}


 73%|███████▎  | 366/500 [10:16<03:57,  1.77s/it]

{'loss': 0.1436, 'grad_norm': 2.220194101333618, 'learning_rate': 5.642105263157895e-05, 'epoch': 8.82}


 73%|███████▎  | 367/500 [10:18<03:54,  1.76s/it]

{'loss': 0.1014, 'grad_norm': 2.4155638217926025, 'learning_rate': 5.6000000000000006e-05, 'epoch': 8.84}


 74%|███████▎  | 368/500 [10:20<04:05,  1.86s/it]

{'loss': 0.2356, 'grad_norm': 2.498462677001953, 'learning_rate': 5.557894736842105e-05, 'epoch': 8.87}


 74%|███████▍  | 369/500 [10:21<03:52,  1.77s/it]

{'loss': 0.1115, 'grad_norm': 3.0201802253723145, 'learning_rate': 5.5157894736842105e-05, 'epoch': 8.89}


 74%|███████▍  | 370/500 [10:23<03:50,  1.77s/it]

{'loss': 0.1376, 'grad_norm': 2.4855897426605225, 'learning_rate': 5.4736842105263165e-05, 'epoch': 8.92}


 74%|███████▍  | 371/500 [10:25<03:48,  1.77s/it]

{'loss': 0.1349, 'grad_norm': 2.3876147270202637, 'learning_rate': 5.431578947368421e-05, 'epoch': 8.94}


 74%|███████▍  | 372/500 [10:27<03:49,  1.79s/it]

{'loss': 0.1086, 'grad_norm': 1.8558719158172607, 'learning_rate': 5.3894736842105265e-05, 'epoch': 8.96}


 75%|███████▍  | 373/500 [10:28<03:37,  1.72s/it]

{'loss': 0.0853, 'grad_norm': 2.0045053958892822, 'learning_rate': 5.347368421052632e-05, 'epoch': 8.99}


 75%|███████▍  | 374/500 [10:30<03:31,  1.68s/it]

{'loss': 0.0883, 'grad_norm': 1.8154568672180176, 'learning_rate': 5.305263157894738e-05, 'epoch': 9.01}


 75%|███████▌  | 375/500 [10:32<03:39,  1.76s/it]

{'loss': 0.0971, 'grad_norm': 1.5137604475021362, 'learning_rate': 5.2631578947368424e-05, 'epoch': 9.04}


 75%|███████▌  | 376/500 [10:33<03:30,  1.70s/it]

{'loss': 0.0481, 'grad_norm': 1.1057549715042114, 'learning_rate': 5.221052631578948e-05, 'epoch': 9.06}


 75%|███████▌  | 377/500 [10:35<03:23,  1.65s/it]

{'loss': 0.0767, 'grad_norm': 2.1230740547180176, 'learning_rate': 5.178947368421053e-05, 'epoch': 9.08}


 76%|███████▌  | 378/500 [10:36<03:19,  1.64s/it]

{'loss': 0.0562, 'grad_norm': 1.3642841577529907, 'learning_rate': 5.1368421052631576e-05, 'epoch': 9.11}


 76%|███████▌  | 379/500 [10:38<03:22,  1.67s/it]

{'loss': 0.0454, 'grad_norm': 1.4138169288635254, 'learning_rate': 5.094736842105263e-05, 'epoch': 9.13}


 76%|███████▌  | 380/500 [10:40<03:17,  1.64s/it]

{'loss': 0.0424, 'grad_norm': 0.8673470616340637, 'learning_rate': 5.052631578947369e-05, 'epoch': 9.16}


 76%|███████▌  | 381/500 [10:41<03:14,  1.63s/it]

{'loss': 0.0577, 'grad_norm': 1.443019151687622, 'learning_rate': 5.010526315789474e-05, 'epoch': 9.18}


 76%|███████▋  | 382/500 [10:43<03:21,  1.71s/it]

{'loss': 0.0853, 'grad_norm': 1.3917981386184692, 'learning_rate': 4.9684210526315796e-05, 'epoch': 9.2}


 77%|███████▋  | 383/500 [10:45<03:22,  1.73s/it]

{'loss': 0.0539, 'grad_norm': 1.1927728652954102, 'learning_rate': 4.926315789473684e-05, 'epoch': 9.23}


 77%|███████▋  | 384/500 [10:47<03:17,  1.71s/it]

{'loss': 0.059, 'grad_norm': 1.7050950527191162, 'learning_rate': 4.8842105263157895e-05, 'epoch': 9.25}


 77%|███████▋  | 385/500 [10:49<03:24,  1.78s/it]

{'loss': 0.0956, 'grad_norm': 1.633172631263733, 'learning_rate': 4.842105263157895e-05, 'epoch': 9.28}


 77%|███████▋  | 386/500 [10:50<03:15,  1.72s/it]

{'loss': 0.0838, 'grad_norm': 2.386500358581543, 'learning_rate': 4.8e-05, 'epoch': 9.3}


 77%|███████▋  | 387/500 [10:52<03:13,  1.71s/it]

{'loss': 0.083, 'grad_norm': 2.218675136566162, 'learning_rate': 4.7578947368421054e-05, 'epoch': 9.33}


 78%|███████▊  | 388/500 [10:54<03:11,  1.71s/it]

{'loss': 0.0897, 'grad_norm': 1.7152179479599, 'learning_rate': 4.715789473684211e-05, 'epoch': 9.35}


 78%|███████▊  | 389/500 [10:56<03:20,  1.80s/it]

{'loss': 0.1656, 'grad_norm': 2.6933658123016357, 'learning_rate': 4.673684210526316e-05, 'epoch': 9.37}


 78%|███████▊  | 390/500 [10:57<03:07,  1.71s/it]

{'loss': 0.0708, 'grad_norm': 2.078162670135498, 'learning_rate': 4.6315789473684214e-05, 'epoch': 9.4}


 78%|███████▊  | 391/500 [10:59<03:04,  1.69s/it]

{'loss': 0.0627, 'grad_norm': 2.1347198486328125, 'learning_rate': 4.589473684210526e-05, 'epoch': 9.42}


 78%|███████▊  | 392/500 [11:00<03:02,  1.69s/it]

{'loss': 0.0894, 'grad_norm': 2.8331117630004883, 'learning_rate': 4.547368421052632e-05, 'epoch': 9.45}


 79%|███████▊  | 393/500 [11:02<03:06,  1.74s/it]

{'loss': 0.089, 'grad_norm': 1.754697322845459, 'learning_rate': 4.5052631578947366e-05, 'epoch': 9.47}


 79%|███████▉  | 394/500 [11:04<02:58,  1.68s/it]

{'loss': 0.0578, 'grad_norm': 2.395197629928589, 'learning_rate': 4.4631578947368426e-05, 'epoch': 9.49}


 79%|███████▉  | 395/500 [11:06<03:02,  1.74s/it]

{'loss': 0.1045, 'grad_norm': 2.3893723487854004, 'learning_rate': 4.421052631578947e-05, 'epoch': 9.52}


 79%|███████▉  | 396/500 [11:07<02:52,  1.66s/it]

{'loss': 0.0848, 'grad_norm': 1.7176282405853271, 'learning_rate': 4.378947368421053e-05, 'epoch': 9.54}


 79%|███████▉  | 397/500 [11:09<02:48,  1.64s/it]

{'loss': 0.0613, 'grad_norm': 1.4524972438812256, 'learning_rate': 4.336842105263158e-05, 'epoch': 9.57}


 80%|███████▉  | 398/500 [11:10<02:50,  1.67s/it]

{'loss': 0.0716, 'grad_norm': 1.9112437963485718, 'learning_rate': 4.294736842105264e-05, 'epoch': 9.59}


 80%|███████▉  | 399/500 [11:12<02:45,  1.64s/it]

{'loss': 0.0706, 'grad_norm': 1.8314201831817627, 'learning_rate': 4.2526315789473685e-05, 'epoch': 9.61}


 80%|████████  | 400/500 [11:14<02:45,  1.65s/it]

{'loss': 0.0815, 'grad_norm': 1.7099969387054443, 'learning_rate': 4.210526315789474e-05, 'epoch': 9.64}


 80%|████████  | 401/500 [11:15<02:42,  1.64s/it]

{'loss': 0.0884, 'grad_norm': 2.4226763248443604, 'learning_rate': 4.168421052631579e-05, 'epoch': 9.66}


 80%|████████  | 402/500 [11:17<02:41,  1.65s/it]

{'loss': 0.0693, 'grad_norm': 1.404791235923767, 'learning_rate': 4.1263157894736844e-05, 'epoch': 9.69}


 81%|████████  | 403/500 [11:19<02:43,  1.69s/it]

{'loss': 0.1037, 'grad_norm': 2.1063714027404785, 'learning_rate': 4.08421052631579e-05, 'epoch': 9.71}


 81%|████████  | 404/500 [11:20<02:39,  1.67s/it]

{'loss': 0.0825, 'grad_norm': 1.518130898475647, 'learning_rate': 4.042105263157895e-05, 'epoch': 9.73}


 81%|████████  | 405/500 [11:22<02:46,  1.76s/it]

{'loss': 0.1129, 'grad_norm': 2.002378463745117, 'learning_rate': 4e-05, 'epoch': 9.76}


 81%|████████  | 406/500 [11:24<02:38,  1.68s/it]

{'loss': 0.0524, 'grad_norm': 1.2998517751693726, 'learning_rate': 3.9578947368421056e-05, 'epoch': 9.78}


 81%|████████▏ | 407/500 [11:26<02:40,  1.72s/it]

{'loss': 0.0754, 'grad_norm': 1.6692345142364502, 'learning_rate': 3.91578947368421e-05, 'epoch': 9.81}


 82%|████████▏ | 408/500 [11:28<02:41,  1.76s/it]

{'loss': 0.1069, 'grad_norm': 1.6703771352767944, 'learning_rate': 3.873684210526316e-05, 'epoch': 9.83}


 82%|████████▏ | 409/500 [11:29<02:34,  1.70s/it]

{'loss': 0.0586, 'grad_norm': 1.162557601928711, 'learning_rate': 3.831578947368421e-05, 'epoch': 9.86}


 82%|████████▏ | 410/500 [11:31<02:28,  1.65s/it]

{'loss': 0.0567, 'grad_norm': 1.0077297687530518, 'learning_rate': 3.789473684210527e-05, 'epoch': 9.88}


 82%|████████▏ | 411/500 [11:32<02:25,  1.64s/it]

{'loss': 0.0879, 'grad_norm': 1.7769646644592285, 'learning_rate': 3.7473684210526315e-05, 'epoch': 9.9}


 82%|████████▏ | 412/500 [11:34<02:30,  1.72s/it]

{'loss': 0.0792, 'grad_norm': 1.5939140319824219, 'learning_rate': 3.7052631578947375e-05, 'epoch': 9.93}


 83%|████████▎ | 413/500 [11:36<02:28,  1.70s/it]

{'loss': 0.0864, 'grad_norm': 1.826842188835144, 'learning_rate': 3.663157894736842e-05, 'epoch': 9.95}


 83%|████████▎ | 414/500 [11:37<02:24,  1.68s/it]

{'loss': 0.063, 'grad_norm': 1.7968195676803589, 'learning_rate': 3.621052631578948e-05, 'epoch': 9.98}


 83%|████████▎ | 415/500 [11:39<02:18,  1.63s/it]

{'loss': 0.0709, 'grad_norm': 1.862926721572876, 'learning_rate': 3.578947368421053e-05, 'epoch': 10.0}


 83%|████████▎ | 416/500 [11:41<02:17,  1.64s/it]

{'loss': 0.0526, 'grad_norm': 1.3872132301330566, 'learning_rate': 3.536842105263158e-05, 'epoch': 10.02}


 83%|████████▎ | 417/500 [11:42<02:13,  1.61s/it]

{'loss': 0.0353, 'grad_norm': 0.8577324748039246, 'learning_rate': 3.4947368421052634e-05, 'epoch': 10.05}


 84%|████████▎ | 418/500 [11:44<02:16,  1.67s/it]

{'loss': 0.0465, 'grad_norm': 0.9018027782440186, 'learning_rate': 3.452631578947369e-05, 'epoch': 10.07}


 84%|████████▍ | 419/500 [11:46<02:18,  1.71s/it]

{'loss': 0.0432, 'grad_norm': 0.929498553276062, 'learning_rate': 3.410526315789474e-05, 'epoch': 10.1}


 84%|████████▍ | 420/500 [11:47<02:17,  1.72s/it]

{'loss': 0.0462, 'grad_norm': 0.9474616646766663, 'learning_rate': 3.368421052631579e-05, 'epoch': 10.12}


 84%|████████▍ | 421/500 [11:49<02:12,  1.68s/it]

{'loss': 0.0512, 'grad_norm': 1.1197631359100342, 'learning_rate': 3.3263157894736846e-05, 'epoch': 10.14}


 84%|████████▍ | 422/500 [11:51<02:06,  1.62s/it]

{'loss': 0.0539, 'grad_norm': 1.8229398727416992, 'learning_rate': 3.28421052631579e-05, 'epoch': 10.17}


 85%|████████▍ | 423/500 [11:52<02:11,  1.71s/it]

{'loss': 0.0591, 'grad_norm': 1.1884280443191528, 'learning_rate': 3.2421052631578945e-05, 'epoch': 10.19}


 85%|████████▍ | 424/500 [11:54<02:10,  1.72s/it]

{'loss': 0.0412, 'grad_norm': 0.8083605766296387, 'learning_rate': 3.2000000000000005e-05, 'epoch': 10.22}


 85%|████████▌ | 425/500 [11:56<02:05,  1.67s/it]

{'loss': 0.0422, 'grad_norm': 1.2124502658843994, 'learning_rate': 3.157894736842105e-05, 'epoch': 10.24}


 85%|████████▌ | 426/500 [11:57<02:05,  1.69s/it]

{'loss': 0.0475, 'grad_norm': 1.5075353384017944, 'learning_rate': 3.115789473684211e-05, 'epoch': 10.27}


 85%|████████▌ | 427/500 [11:59<02:05,  1.71s/it]

{'loss': 0.0608, 'grad_norm': 1.5035237073898315, 'learning_rate': 3.073684210526316e-05, 'epoch': 10.29}


 86%|████████▌ | 428/500 [12:01<01:57,  1.64s/it]

{'loss': 0.0473, 'grad_norm': 1.0992181301116943, 'learning_rate': 3.0315789473684214e-05, 'epoch': 10.31}


 86%|████████▌ | 429/500 [12:02<01:52,  1.59s/it]

{'loss': 0.0516, 'grad_norm': 1.5630197525024414, 'learning_rate': 2.9894736842105264e-05, 'epoch': 10.34}


 86%|████████▌ | 430/500 [12:04<01:48,  1.55s/it]

{'loss': 0.0427, 'grad_norm': 0.8244784474372864, 'learning_rate': 2.9473684210526314e-05, 'epoch': 10.36}


 86%|████████▌ | 431/500 [12:05<01:50,  1.61s/it]

{'loss': 0.0422, 'grad_norm': 0.8705162405967712, 'learning_rate': 2.905263157894737e-05, 'epoch': 10.39}


 86%|████████▋ | 432/500 [12:07<01:52,  1.66s/it]

{'loss': 0.0401, 'grad_norm': 1.1806614398956299, 'learning_rate': 2.863157894736842e-05, 'epoch': 10.41}


 87%|████████▋ | 433/500 [12:09<01:47,  1.60s/it]

{'loss': 0.0467, 'grad_norm': 1.1687036752700806, 'learning_rate': 2.8210526315789476e-05, 'epoch': 10.43}


 87%|████████▋ | 434/500 [12:10<01:50,  1.67s/it]

{'loss': 0.045, 'grad_norm': 0.8398346900939941, 'learning_rate': 2.7789473684210526e-05, 'epoch': 10.46}


 87%|████████▋ | 435/500 [12:12<01:45,  1.63s/it]

{'loss': 0.0447, 'grad_norm': 1.6047284603118896, 'learning_rate': 2.7368421052631583e-05, 'epoch': 10.48}


 87%|████████▋ | 436/500 [12:14<01:45,  1.65s/it]

{'loss': 0.0509, 'grad_norm': 1.572739601135254, 'learning_rate': 2.6947368421052632e-05, 'epoch': 10.51}


 87%|████████▋ | 437/500 [12:15<01:45,  1.67s/it]

{'loss': 0.0442, 'grad_norm': 2.107609510421753, 'learning_rate': 2.652631578947369e-05, 'epoch': 10.53}


 88%|████████▊ | 438/500 [12:17<01:43,  1.67s/it]

{'loss': 0.0312, 'grad_norm': 0.9701303243637085, 'learning_rate': 2.610526315789474e-05, 'epoch': 10.55}


 88%|████████▊ | 439/500 [12:19<01:42,  1.67s/it]

{'loss': 0.0394, 'grad_norm': 0.8977034687995911, 'learning_rate': 2.5684210526315788e-05, 'epoch': 10.58}


 88%|████████▊ | 440/500 [12:21<01:46,  1.78s/it]

{'loss': 0.0632, 'grad_norm': 1.5212279558181763, 'learning_rate': 2.5263157894736845e-05, 'epoch': 10.6}


 88%|████████▊ | 441/500 [12:22<01:39,  1.69s/it]

{'loss': 0.0495, 'grad_norm': 1.772896647453308, 'learning_rate': 2.4842105263157898e-05, 'epoch': 10.63}


 88%|████████▊ | 442/500 [12:24<01:34,  1.64s/it]

{'loss': 0.0469, 'grad_norm': 1.3311126232147217, 'learning_rate': 2.4421052631578948e-05, 'epoch': 10.65}


 89%|████████▊ | 443/500 [12:26<01:38,  1.73s/it]

{'loss': 0.0593, 'grad_norm': 1.3645577430725098, 'learning_rate': 2.4e-05, 'epoch': 10.67}


 89%|████████▉ | 444/500 [12:27<01:36,  1.72s/it]

{'loss': 0.0552, 'grad_norm': 1.2721261978149414, 'learning_rate': 2.3578947368421054e-05, 'epoch': 10.7}


 89%|████████▉ | 445/500 [12:29<01:34,  1.71s/it]

{'loss': 0.0599, 'grad_norm': 1.950852632522583, 'learning_rate': 2.3157894736842107e-05, 'epoch': 10.72}


 89%|████████▉ | 446/500 [12:31<01:32,  1.70s/it]

{'loss': 0.0542, 'grad_norm': 1.620446801185608, 'learning_rate': 2.273684210526316e-05, 'epoch': 10.75}


 89%|████████▉ | 447/500 [12:33<01:32,  1.75s/it]

{'loss': 0.0499, 'grad_norm': 1.4519057273864746, 'learning_rate': 2.2315789473684213e-05, 'epoch': 10.77}


 90%|████████▉ | 448/500 [12:34<01:28,  1.71s/it]

{'loss': 0.0346, 'grad_norm': 1.1688947677612305, 'learning_rate': 2.1894736842105266e-05, 'epoch': 10.8}


 90%|████████▉ | 449/500 [12:36<01:26,  1.70s/it]

{'loss': 0.0448, 'grad_norm': 1.1322822570800781, 'learning_rate': 2.147368421052632e-05, 'epoch': 10.82}


 90%|█████████ | 450/500 [12:38<01:25,  1.71s/it]

{'loss': 0.0451, 'grad_norm': 1.3873885869979858, 'learning_rate': 2.105263157894737e-05, 'epoch': 10.84}


 90%|█████████ | 451/500 [12:40<01:26,  1.76s/it]

{'loss': 0.048, 'grad_norm': 1.1556708812713623, 'learning_rate': 2.0631578947368422e-05, 'epoch': 10.87}


 90%|█████████ | 452/500 [12:41<01:23,  1.74s/it]

{'loss': 0.0545, 'grad_norm': 1.8702186346054077, 'learning_rate': 2.0210526315789475e-05, 'epoch': 10.89}


 91%|█████████ | 453/500 [12:43<01:18,  1.67s/it]

{'loss': 0.0623, 'grad_norm': 1.7281304597854614, 'learning_rate': 1.9789473684210528e-05, 'epoch': 10.92}


 91%|█████████ | 454/500 [12:45<01:21,  1.78s/it]

{'loss': 0.0493, 'grad_norm': 2.0473763942718506, 'learning_rate': 1.936842105263158e-05, 'epoch': 10.94}


 91%|█████████ | 455/500 [12:46<01:18,  1.73s/it]

{'loss': 0.0603, 'grad_norm': 1.2379385232925415, 'learning_rate': 1.8947368421052634e-05, 'epoch': 10.96}


 91%|█████████ | 456/500 [12:48<01:20,  1.82s/it]

{'loss': 0.0549, 'grad_norm': 1.286454677581787, 'learning_rate': 1.8526315789473687e-05, 'epoch': 10.99}


 91%|█████████▏| 457/500 [12:50<01:14,  1.74s/it]

{'loss': 0.0423, 'grad_norm': 0.8884984850883484, 'learning_rate': 1.810526315789474e-05, 'epoch': 11.01}


 92%|█████████▏| 458/500 [12:52<01:11,  1.71s/it]

{'loss': 0.0471, 'grad_norm': 0.692874550819397, 'learning_rate': 1.768421052631579e-05, 'epoch': 11.04}


 92%|█████████▏| 459/500 [12:53<01:07,  1.65s/it]

{'loss': 0.0348, 'grad_norm': 0.714057207107544, 'learning_rate': 1.7263157894736843e-05, 'epoch': 11.06}


 92%|█████████▏| 460/500 [12:55<01:08,  1.72s/it]

{'loss': 0.0378, 'grad_norm': 0.9657811522483826, 'learning_rate': 1.6842105263157896e-05, 'epoch': 11.08}


 92%|█████████▏| 461/500 [12:57<01:07,  1.74s/it]

{'loss': 0.0394, 'grad_norm': 1.1066958904266357, 'learning_rate': 1.642105263157895e-05, 'epoch': 11.11}


 92%|█████████▏| 462/500 [12:58<01:04,  1.69s/it]

{'loss': 0.0332, 'grad_norm': 0.5374202728271484, 'learning_rate': 1.6000000000000003e-05, 'epoch': 11.13}


 93%|█████████▎| 463/500 [13:00<01:02,  1.69s/it]

{'loss': 0.039, 'grad_norm': 1.5334144830703735, 'learning_rate': 1.5578947368421056e-05, 'epoch': 11.16}


 93%|█████████▎| 464/500 [13:02<00:59,  1.65s/it]

{'loss': 0.0361, 'grad_norm': 1.3014816045761108, 'learning_rate': 1.5157894736842107e-05, 'epoch': 11.18}


 93%|█████████▎| 465/500 [13:03<00:57,  1.64s/it]

{'loss': 0.0279, 'grad_norm': 0.6844238042831421, 'learning_rate': 1.4736842105263157e-05, 'epoch': 11.2}


 93%|█████████▎| 466/500 [13:05<00:57,  1.69s/it]

{'loss': 0.0371, 'grad_norm': 1.7539142370224, 'learning_rate': 1.431578947368421e-05, 'epoch': 11.23}


 93%|█████████▎| 467/500 [13:07<00:55,  1.67s/it]

{'loss': 0.0258, 'grad_norm': 0.9469327926635742, 'learning_rate': 1.3894736842105263e-05, 'epoch': 11.25}


 94%|█████████▎| 468/500 [13:08<00:53,  1.66s/it]

{'loss': 0.0313, 'grad_norm': 0.7126246690750122, 'learning_rate': 1.3473684210526316e-05, 'epoch': 11.28}


 94%|█████████▍| 469/500 [13:10<00:50,  1.63s/it]

{'loss': 0.0332, 'grad_norm': 0.7841440439224243, 'learning_rate': 1.305263157894737e-05, 'epoch': 11.3}


 94%|█████████▍| 470/500 [13:12<00:52,  1.76s/it]

{'loss': 0.0413, 'grad_norm': 1.0093867778778076, 'learning_rate': 1.2631578947368422e-05, 'epoch': 11.33}


 94%|█████████▍| 471/500 [13:14<00:50,  1.74s/it]

{'loss': 0.0297, 'grad_norm': 0.5611435174942017, 'learning_rate': 1.2210526315789474e-05, 'epoch': 11.35}


 94%|█████████▍| 472/500 [13:15<00:49,  1.75s/it]

{'loss': 0.0359, 'grad_norm': 0.9595898985862732, 'learning_rate': 1.1789473684210527e-05, 'epoch': 11.37}


 95%|█████████▍| 473/500 [13:17<00:45,  1.70s/it]

{'loss': 0.0323, 'grad_norm': 0.7752214670181274, 'learning_rate': 1.136842105263158e-05, 'epoch': 11.4}


 95%|█████████▍| 474/500 [13:19<00:43,  1.66s/it]

{'loss': 0.0319, 'grad_norm': 0.4861299395561218, 'learning_rate': 1.0947368421052633e-05, 'epoch': 11.42}


 95%|█████████▌| 475/500 [13:20<00:41,  1.65s/it]

{'loss': 0.0258, 'grad_norm': 0.49081388115882874, 'learning_rate': 1.0526315789473684e-05, 'epoch': 11.45}


 95%|█████████▌| 476/500 [13:22<00:42,  1.78s/it]

{'loss': 0.0464, 'grad_norm': 1.4114406108856201, 'learning_rate': 1.0105263157894738e-05, 'epoch': 11.47}


 95%|█████████▌| 477/500 [13:24<00:41,  1.79s/it]

{'loss': 0.0614, 'grad_norm': 1.5560468435287476, 'learning_rate': 9.68421052631579e-06, 'epoch': 11.49}


 96%|█████████▌| 478/500 [13:26<00:39,  1.78s/it]

{'loss': 0.0271, 'grad_norm': 0.6823348999023438, 'learning_rate': 9.263157894736844e-06, 'epoch': 11.52}


 96%|█████████▌| 479/500 [13:27<00:36,  1.74s/it]

{'loss': 0.0311, 'grad_norm': 0.6420421600341797, 'learning_rate': 8.842105263157895e-06, 'epoch': 11.54}


 96%|█████████▌| 480/500 [13:29<00:34,  1.74s/it]

{'loss': 0.0253, 'grad_norm': 1.157339334487915, 'learning_rate': 8.421052631578948e-06, 'epoch': 11.57}


 96%|█████████▌| 481/500 [13:31<00:32,  1.70s/it]

{'loss': 0.026, 'grad_norm': 0.668056845664978, 'learning_rate': 8.000000000000001e-06, 'epoch': 11.59}


 96%|█████████▋| 482/500 [13:33<00:31,  1.74s/it]

{'loss': 0.0306, 'grad_norm': 1.12008535861969, 'learning_rate': 7.578947368421054e-06, 'epoch': 11.61}


 97%|█████████▋| 483/500 [13:34<00:29,  1.73s/it]

{'loss': 0.026, 'grad_norm': 0.6933916211128235, 'learning_rate': 7.157894736842105e-06, 'epoch': 11.64}


 97%|█████████▋| 484/500 [13:36<00:26,  1.67s/it]

{'loss': 0.03, 'grad_norm': 0.8870292901992798, 'learning_rate': 6.736842105263158e-06, 'epoch': 11.66}


 97%|█████████▋| 485/500 [13:38<00:25,  1.67s/it]

{'loss': 0.0684, 'grad_norm': 2.5504062175750732, 'learning_rate': 6.315789473684211e-06, 'epoch': 11.69}


 97%|█████████▋| 486/500 [13:39<00:23,  1.65s/it]

{'loss': 0.0371, 'grad_norm': 0.79029780626297, 'learning_rate': 5.8947368421052634e-06, 'epoch': 11.71}


 97%|█████████▋| 487/500 [13:41<00:21,  1.64s/it]

{'loss': 0.0382, 'grad_norm': 0.9572199583053589, 'learning_rate': 5.4736842105263165e-06, 'epoch': 11.73}


 98%|█████████▊| 488/500 [13:43<00:20,  1.71s/it]

{'loss': 0.0558, 'grad_norm': 1.7198669910430908, 'learning_rate': 5.052631578947369e-06, 'epoch': 11.76}


 98%|█████████▊| 489/500 [13:44<00:18,  1.67s/it]

{'loss': 0.0254, 'grad_norm': 0.5500555038452148, 'learning_rate': 4.631578947368422e-06, 'epoch': 11.78}


 98%|█████████▊| 490/500 [13:46<00:17,  1.70s/it]

{'loss': 0.0243, 'grad_norm': 0.5647155046463013, 'learning_rate': 4.210526315789474e-06, 'epoch': 11.81}


 98%|█████████▊| 491/500 [13:48<00:15,  1.73s/it]

{'loss': 0.029, 'grad_norm': 0.8126233220100403, 'learning_rate': 3.789473684210527e-06, 'epoch': 11.83}


 98%|█████████▊| 492/500 [13:49<00:13,  1.68s/it]

{'loss': 0.033, 'grad_norm': 0.7442077994346619, 'learning_rate': 3.368421052631579e-06, 'epoch': 11.86}


 99%|█████████▊| 493/500 [13:51<00:11,  1.64s/it]

{'loss': 0.0257, 'grad_norm': 0.41018298268318176, 'learning_rate': 2.9473684210526317e-06, 'epoch': 11.88}


 99%|█████████▉| 494/500 [13:53<00:09,  1.64s/it]

{'loss': 0.0243, 'grad_norm': 0.5392683148384094, 'learning_rate': 2.5263157894736844e-06, 'epoch': 11.9}


 99%|█████████▉| 495/500 [13:54<00:07,  1.59s/it]

{'loss': 0.034, 'grad_norm': 0.6964764595031738, 'learning_rate': 2.105263157894737e-06, 'epoch': 11.93}


 99%|█████████▉| 496/500 [13:56<00:06,  1.70s/it]

{'loss': 0.0277, 'grad_norm': 0.7981184720993042, 'learning_rate': 1.6842105263157895e-06, 'epoch': 11.95}


 99%|█████████▉| 497/500 [13:58<00:04,  1.64s/it]

{'loss': 0.0422, 'grad_norm': 1.5925590991973877, 'learning_rate': 1.2631578947368422e-06, 'epoch': 11.98}


100%|█████████▉| 498/500 [13:59<00:03,  1.61s/it]

{'loss': 0.0576, 'grad_norm': 1.3414098024368286, 'learning_rate': 8.421052631578948e-07, 'epoch': 12.0}


100%|█████████▉| 499/500 [14:01<00:01,  1.58s/it]

{'loss': 0.0424, 'grad_norm': 2.0674824714660645, 'learning_rate': 4.210526315789474e-07, 'epoch': 12.02}


100%|██████████| 500/500 [14:02<00:00,  1.66s/it]

{'loss': 0.0454, 'grad_norm': 1.300477147102356, 'learning_rate': 0.0, 'epoch': 12.05}


100%|██████████| 500/500 [14:03<00:00,  1.69s/it]

{'train_runtime': 844.9164, 'train_samples_per_second': 4.734, 'train_steps_per_second': 0.592, 'train_loss': 0.8659532723501324, 'epoch': 12.05}





Step 9 (Optional): Testing the trained model
-----

In [57]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Help me to write a good example of procurement and tendering level 2."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 512, use_cache = True,
                         temperature = 0.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHelp me to write a good example of procurement and tendering level 2.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOn a construction project, the procurement route chosen was a traditional procurement route. This involved the appointment of a main contractor who would be responsible for the design and construction of the project.\n\nThe tender process was open to all contractors who were pre-qualified by the client. The tender process involved the submission of a tender return which included the contractor's price, programme and a brief explanation of how they would approach the project.\n\nThe tender process was advertised in the local press and on the client's website. This was to ensure that all potential contractors were aware of the opportunity and could submit a tender if they wished.

Step 10: Saving the trained model locally
--

In [58]:
# Merge to 8bit
model.save_pretrained_gguf("RICS_APC_Master_llama3_1_instruct__Chat_template_GGUF_Q8_unsloth", tokenizer,)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 37.85 out of 62.65 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 46.96it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at RICS_APC_Master_llama3_1_instruct__Chat_template_GGUF_Q8_unsloth into q8_0 GGUF format.
The output location will be ./RICS_APC_Master_llama3_1_instruct__Chat_template_GGUF_Q8_unsloth/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: RICS_APC_Master_llama3_1_instruct__Chat_template_GGUF_Q8_unsloth
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-000

KeyError: '"name"'