Unsloth Overview
-----

On GitHub, **"Unsloth"** refers to a library designed to optimize the fine-tuning of large language models (LLMs) such as **Llama**, **Mistral**, **Phi**, and **Gemma**. The main focus of Unsloth is to accelerate the fine-tuning process and reduce memory usage significantly. Here are some key features of Unsloth:

* **Performance Improvement:** Unsloth can fine-tune models 2-5 times faster while using 80% less memory. This is achieved through efficient memory management and optimization techniques.
* **Compatibility:** It supports most NVIDIA GPUs released since 2018, including popular models like the RTX series and A100. It works on both Linux and Windows via WSL.
* **Accuracy:** There is no loss in accuracy as Unsloth uses exact methods rather than approximations.
* **Bit Precision Support:** The library supports 4-bit and 16-bit quantization for more efficient model fine-tuning.
* **Integration:** Unsloth is compatible with the Hugging Face ecosystem, allowing users to leverage popular libraries like Transformers, PEFT, and TRL.

For installation, Unsloth provides detailed instructions for both *Conda* and *Pip*, depending on your setup. It also includes benchmarking data to demonstrate its efficiency compared to traditional methods.

You can access the Github page for unsloth from the following link: https://github.com/unslothai/unsloth?tab=readme-ov-file

Step 1 : Installing the required packages
------------

Step 1.1: Windows Subsystem for Linux:

* Unsloth works only on a linux based system. You can download your preferred linux distribution from the following link: https://www.linux.org/pages/download/. 

* Alternatively, you can head to the following link : https://learn.microsoft.com/en-us/windows/wsl/install to understand how to install Windows Subsystem for Linux on your machine. 

    For ease, you can download and install WSL on your machine by opening powershell and then typing 
    
        wsl --install
_________
Step 1.2: Anaconda:

* The easiest installation of Unsloth relies on installing Conda which is available within Anaconda. To do that, go to Anaconda's main website: https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh

* Once donwloaded, go to the place where you saved the Anaconda distribtuion. Open the terminal in this location and then  type the following 2 commands :
    
        chmod +x Anaconda3-[version]-Linux-x86_64.sh
        ./Anaconda3-[version]-Linux-x86_64.sh

    Replace the word "[version]" with the actual version of the downloaded version of anaconda e.g. "Anaconda3-2024.02-1-Linux-x86_64.sh"
____________
Step 1.3: Installing Unsloth

* Head to the Unsloth reporsitory on Github : https://github.com/unslothai/unsloth?tab=readme-ov-file and follow the steps below to have your virtual python environment with the required packages installed. Open a terminal in your project folder and type the following commands:

        conda create unsloth_env python=3.10 pytorch-cuda=12.1 pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers -y
        conda activate unsloth_env
        pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
        pip install --no-deps "trl<0.9.0" wandb huggingface_hub peft accelerate bitsandbytes datasets

* You can also install Unsloth using pip by creating a virtual environment and then installing the Unsloth requirements as shown below.

        sudo apt install python3-venv
        python3 -m venv unsloth_env
        source unsloth_env/bin/activate
        pip install --upgrade pip
        pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
____________
Step 1.4: Installing pytorch and Nvidia Cuda Toolkit

* You will need to install the latest version of Pytorch with CUDA enabled and in order to do that, you can type the following command in your terminal

        pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

* You might need to have the Nvidia CUDA Toolkit v 12.1 installed on your machine. In order to do that, you can type the following in your terminal (https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local):

        wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
        sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
        wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.0-1_amd64.deb
        sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.0-1_amd64.deb
        sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
        sudo apt-get update
        sudo apt-get -y install cuda



Step 2: Importing necessary packages
---

In [1]:
import torch #Torch is used for building and training neural networks
from trl import SFTTrainer #Transformers reinforced learning provides tools and utilities for training large language models using reinforced learning
from transformers import TrainingArguments #Training Arguments contains parameters that can be set prior to training your required model.
from datasets import load_dataset # It simplifies the process of downloading and preparing datasets for machine learning experiments.
max_seq_length = 2048 # #Specifies the maximum sequence length for the input to the model.

#Unsloth advised that their models support RoPE Scaling.
#Rotary Position Embeddings (RoPE) Scaling involves adjusting the RoPE mechanism to handle sequences of varying lengths more effectively. 
# This can be particularly important for models that need to process sequences longer than those seen during training or for tasks where the importance of positional information changes dynamically.

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import unsloth
from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


Step 4: Importing the desired model
---

* You can find all the unsloth models through the following link: https://huggingface.co/unsloth.

* It contains 84 models and they are a mix of base models and instruct models. Instruct models are better at taking instructions from the end users and providing results.

In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.586 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Step 3: Importing the training dataset and undertaking some transformation
-----

We now need to convert the csv file of the dataset into ShareGPT format to look like this

    {"from": "system", "value": "You are an assistant that will help the RICS APC candidate formulate the submission"}
    {"from": "human", "value": "Give me a good quality example of Competency: Conduct rules, ethics and professional practice Level: 1"}
    {"from": "gpt", "value": "I am aware of the RICS' five professional and ethical standards. I am also aware of the RICS rules for both members and firms and how they would apply to me on a day-to-day basis. I am aware that in my career there would be potential conflicts of interest and the RICS' guidelines on how to handle these situations. I have attended webinars that have given me an understanding of the RICS Black Book, Bribery Act 2010 and the duty of a member of the RICS to the general public."}

In [5]:
import pandas as pd

# Load the CSV file into a DataFrame
csv_file_path = 'nrm1_boq_data_41k_rows.csv'
df = pd.read_csv(csv_file_path, encoding='ISO-8859-1')

# Ensure the data has a "train" mapping, add a 'train' column with 'train' value
df['train'] = 'train'

# Create a new 'conversations' column as a list of dictionaries
df['conversations'] = df.apply(lambda row: [
    {"from": "system", "value": "You are an assistant that will help estimators, quantity surveyors and cost managers to be able to codify bills of quantities' items in accordance with New Rules of Measurement 1 issued by the RICS."},
    {"from": "human", "value": row['Question']},
    {"from": "gpt", "value": row['Answer']}
], axis=1)

# Remove the 'id', 'Question', and 'Answer' columns
df = df.drop(columns=[ 'id','Question', 'Answer'])

# Display the updated DataFrame
print(df.head())

   train                                      conversations
0  train  [{'from': 'system', 'value': 'You are an assis...
1  train  [{'from': 'system', 'value': 'You are an assis...
2  train  [{'from': 'system', 'value': 'You are an assis...
3  train  [{'from': 'system', 'value': 'You are an assis...
4  train  [{'from': 'system', 'value': 'You are an assis...


Step 4: Change the dataset format from ShareGPT into Huggingface format
------

We now need to convert the data from the ShareGPT format above into Huggingface format to look like this:

    {"role": "system", "content": "You are an assistant that will help the RICS APC candidate formulate the submission"}
    {"role": "user", "content": "Give me a good quality example of Competency: Conduct rules, ethics and professional practice Level: 1"}
    {"role": "assistant", "content": "I am aware of the RICS' five professional and ethical standards. I am also aware of the RICS rules for both members and firms and how they would apply to me on a day-to-day basis. I am aware that in my career there would be potential conflicts of interest and the RICS' guidelines on how to handle these situations. I have attended webinars that have given me an understanding of the RICS Black Book, Bribery Act 2010 and the duty of a member of the RICS to the general public."}

Once this conversion is done, the conversations can then be transformed into Llama3.1 chat template which will look like the following

    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    Give me a good quality example of Competency: Conduct rules, ethics and professional practice Level: 1<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    I am aware of the RICS' five professional and ethical standards. I am also aware of the RICS rules for both members and firms and how they would apply to me on a day-to-day basis. I am aware that in my career there would be potential conflicts of interest and the RICS' guidelines on how to handle these situations. I have attended webinars that have given me an understanding of the RICS Black Book, Bribery Act 2010 and the duty of a member of the RICS to the general public.<|eot_id|><|start_header_id|>user<|end_header_id|>
    That's great, thanks!<|eot_id|>

In [6]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
# Example: Convert DataFrame back to Dataset if needed (depending on the framework you're using)
import datasets
dataset = datasets.Dataset.from_pandas(df)

# Now apply standardize_sharegpt
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Standardizing format: 100%|██████████| 41931/41931 [00:00<00:00, 73127.51 examples/s]
Map: 100%|██████████| 41931/41931 [00:01<00:00, 39156.09 examples/s]


In [7]:
dataset[5]["conversations"]

[{'content': "You are an assistant that will help estimators, quantity surveyors and cost managers to be able to codify bills of quantities' items in accordance with New Rules of Measurement 1 issued by the RICS.",
  'role': 'system'},
 {'content': 'What is the NRM1 coding levels for "Asphalt Road Paving - Crossrail access road; Type 1 sub-base, 450 thick" ?\n\n Do you have any further information about this work item that you can provide?',
  'role': 'user'},
 {'content': 'As per my training data, I was able to find a relevant result to your search which is "Asphalt Road Paving - Crossrail access road; Type 1 sub-base, 450 thick".\n\n To understand the context of the search result, let me give you the work package breakdown.\n\nWork Package Level 1 is "Urban Realm". Work Package level 2 is "External Landscaping". Work Package Level 3 is "Asphalt Road Paving".\n\nNow, let me provide you with the NRM1 coding as requested.\n\nMy training data shows that the activity "Asphalt Road Paving 

In [8]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant that will help estimators, quantity surveyors and cost managers to be able to codify bills of quantities\' items in accordance with New Rules of Measurement 1 issued by the RICS.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the NRM1 coding levels for "Asphalt Road Paving - Crossrail access road; Type 1 sub-base, 450 thick" ?\n\n Do you have any further information about this work item that you can provide?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAs per my training data, I was able to find a relevant result to your search which is "Asphalt Road Paving - Crossrail access road; Type 1 sub-base, 450 thick".\n\n To understand the context of the search result, let me give you the work package breakdown.\n\nWork Package Level 1 is "Urban Realm". Work Package level 2 is "External Landscaping". Work Package Leve

Step 5 (Optional): Before training text generation
---------

Creating a function to take text and generate a response from the imported LLM.

In [9]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
def generate_text(text):
    inputs=tokenizer(text,return_tensors="pt").to("cuda:0")
    # text_streamer=TextStreamer(tokenizer)
    outputs=model.generate(**inputs,
                        #    streamer=text_streamer,
                           max_new_tokens=256)
    print(tokenizer.decode(outputs[0],skip_special_tokens=True))

print("Before training \n")
generate_text("What is the NRM1 code for the following activity ""Concrete for beams C32/40""?")

Before training 

What is the NRM1 code for the following activity Concrete for beams C32/40?*
*The activity code is in the format of XXXX, where X is a number.
The NRM1 code for Concrete for beams C32/40 is 21111.
This code is broken down as follows:
* 2: Division of works (2 = concrete)
* 1: Type of work (1 = structural concrete)
* 111: Type of concrete (111 = C32/40)
The NRM1 codes are used to classify and bill activities in construction projects. The codes provide a standardized way of identifying and quantifying the work involved in a project. They are widely used in the UK construction industry and are an essential part of the construction process. Understanding the NRM1 codes is crucial for contractors, architects, engineers, and project managers to ensure accurate and efficient project execution. The codes are organized by division, type of work, and type of material, providing a clear and concise way to identify the activities involved in a project. In this case, the code 2111

Step 6: Initializing the Low Rank Adapters (LoRA)
----

In [10]:
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Step 7: Defining the training parameters & Initiating training
--

In [12]:
import wandb

# Initialize wandb
wandb.login(key="bf046b8c53e314d815bd1b13edade1f733014998")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmohamedraouf12421[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/mohamedashour/.netrc


True

In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 25,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 500,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2): 100%|██████████| 41931/41931 [00:11<00:00, 3696.58 examples/s]
max_steps is given, it will override any value given in num_train_epochs


Step 7.1 : Refining the training methodology
---
We also use Unsloth's train_on_completions method to only train on the assistant outputs and ignore the loss on the user's inputs.

We can also verify that the masking is done correctly.

In [14]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map: 100%|██████████| 41931/41931 [00:05<00:00, 7536.90 examples/s]


In [15]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant that will help estimators, quantity surveyors and cost managers to be able to codify bills of quantities\' items in accordance with New Rules of Measurement 1 issued by the RICS.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the NRM1 coding levels for "Asphalt Road Paving - Crossrail access road; Type 1 sub-base, 450 thick"?\n\n Do you have any further information about this work item that you can provide?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAs per my training data, I was able to find a relevant result to your search which is "Asphalt Road Paving - Crossrail access road; Type 1 sub-base, 450 thick".\n\n To understand the context of the search result, let me give you the work package breakdown.\n\nWork Package Level 1 is "Urban Realm". Work Package level 2 is "External Landscaping". Work Package Level

In [16]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                           \n\nAs per my training data, I was able to find a relevant result to your search which is "Asphalt Road Paving - Crossrail access road; Type 1 sub-base, 450 thick".\n\n To understand the context of the search result, let me give you the work package breakdown.\n\nWork Package Level 1 is "Urban Realm". Work Package level 2 is "External Landscaping". Work Package Level 3 is "Asphalt Road Paving".\n\nNow, let me provide you with the NRM1 coding as requested.\n\nMy training data shows that the activity "Asphalt Road Paving - Crossrail access road; Type 1 sub-base, 450 thick" lies within NRM1 Level 1 "08 - External Works". The NRM 1 level 2 is "08.2 - Roads, paths, pavings and surfacings".The NRM1 level 3 is "08.2.1 - Roads, paths and pavings". The NRM1 level 4 is "08.2.1.1 - Roads: details, \nincluding width (m)". There are no NRM1 levels beyond level 4.\n\nThere are 

Step 8: Training the model
---

In [17]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 41,931 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 500
 "-____-"     Number of trainable parameters = 41,943,040


  0%|          | 1/500 [00:05<41:41,  5.01s/it]

{'loss': 1.9537, 'grad_norm': 0.8690264821052551, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.0}


  0%|          | 2/500 [00:08<36:27,  4.39s/it]

{'loss': 2.139, 'grad_norm': 0.8418146371841431, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.0}


  1%|          | 3/500 [00:12<33:45,  4.08s/it]

{'loss': 2.2433, 'grad_norm': 0.9204106330871582, 'learning_rate': 2.4e-05, 'epoch': 0.0}


  1%|          | 4/500 [00:17<35:47,  4.33s/it]

{'loss': 1.8549, 'grad_norm': 0.7875131964683533, 'learning_rate': 3.2000000000000005e-05, 'epoch': 0.0}


  1%|          | 5/500 [00:21<34:06,  4.13s/it]

{'loss': 2.3145, 'grad_norm': 0.9660286903381348, 'learning_rate': 4e-05, 'epoch': 0.0}


  1%|          | 6/500 [00:25<34:27,  4.19s/it]

{'loss': 1.9653, 'grad_norm': 0.8894959092140198, 'learning_rate': 4.8e-05, 'epoch': 0.0}


  1%|▏         | 7/500 [00:29<33:20,  4.06s/it]

{'loss': 1.892, 'grad_norm': 0.9366384744644165, 'learning_rate': 5.6000000000000006e-05, 'epoch': 0.0}


  2%|▏         | 8/500 [00:33<34:20,  4.19s/it]

{'loss': 1.805, 'grad_norm': 0.9093270897865295, 'learning_rate': 6.400000000000001e-05, 'epoch': 0.0}


  2%|▏         | 9/500 [00:37<33:34,  4.10s/it]

{'loss': 1.8411, 'grad_norm': 0.8981592059135437, 'learning_rate': 7.2e-05, 'epoch': 0.0}


  2%|▏         | 10/500 [00:41<33:16,  4.07s/it]

{'loss': 1.711, 'grad_norm': 0.7421170473098755, 'learning_rate': 8e-05, 'epoch': 0.0}


  2%|▏         | 11/500 [00:45<32:25,  3.98s/it]

{'loss': 1.622, 'grad_norm': 0.7360715866088867, 'learning_rate': 8.800000000000001e-05, 'epoch': 0.0}


  2%|▏         | 12/500 [00:50<35:41,  4.39s/it]

{'loss': 1.3037, 'grad_norm': 0.586004912853241, 'learning_rate': 9.6e-05, 'epoch': 0.0}


  3%|▎         | 13/500 [00:55<35:19,  4.35s/it]

{'loss': 1.3497, 'grad_norm': 0.7453957796096802, 'learning_rate': 0.00010400000000000001, 'epoch': 0.0}


  3%|▎         | 14/500 [00:58<34:02,  4.20s/it]

{'loss': 1.3518, 'grad_norm': 0.7967069149017334, 'learning_rate': 0.00011200000000000001, 'epoch': 0.0}


  3%|▎         | 15/500 [01:04<38:19,  4.74s/it]

{'loss': 1.0151, 'grad_norm': 0.7069028615951538, 'learning_rate': 0.00012, 'epoch': 0.0}


  3%|▎         | 16/500 [01:08<36:32,  4.53s/it]

{'loss': 1.1885, 'grad_norm': 1.0296721458435059, 'learning_rate': 0.00012800000000000002, 'epoch': 0.0}


  3%|▎         | 17/500 [01:12<34:48,  4.32s/it]

{'loss': 0.9219, 'grad_norm': 1.373511552810669, 'learning_rate': 0.00013600000000000003, 'epoch': 0.0}


  4%|▎         | 18/500 [01:16<33:46,  4.21s/it]

{'loss': 0.9509, 'grad_norm': 1.1834807395935059, 'learning_rate': 0.000144, 'epoch': 0.0}


  4%|▍         | 19/500 [01:20<33:08,  4.13s/it]

{'loss': 0.7439, 'grad_norm': 1.1383256912231445, 'learning_rate': 0.000152, 'epoch': 0.0}


  4%|▍         | 20/500 [01:24<33:31,  4.19s/it]

{'loss': 0.6832, 'grad_norm': 0.8263534903526306, 'learning_rate': 0.00016, 'epoch': 0.0}


  4%|▍         | 21/500 [01:29<34:23,  4.31s/it]

{'loss': 0.443, 'grad_norm': 0.7718544006347656, 'learning_rate': 0.000168, 'epoch': 0.0}


  4%|▍         | 22/500 [01:33<34:04,  4.28s/it]

{'loss': 0.3919, 'grad_norm': 0.7975578308105469, 'learning_rate': 0.00017600000000000002, 'epoch': 0.0}


  5%|▍         | 23/500 [01:37<32:56,  4.14s/it]

{'loss': 0.4016, 'grad_norm': 0.46414580941200256, 'learning_rate': 0.00018400000000000003, 'epoch': 0.0}


  5%|▍         | 24/500 [01:41<32:46,  4.13s/it]

{'loss': 0.4386, 'grad_norm': 0.4888933598995209, 'learning_rate': 0.000192, 'epoch': 0.0}


  5%|▌         | 25/500 [01:45<31:31,  3.98s/it]

{'loss': 0.4057, 'grad_norm': 0.4507012367248535, 'learning_rate': 0.0002, 'epoch': 0.0}


  5%|▌         | 26/500 [01:49<32:33,  4.12s/it]

{'loss': 0.4293, 'grad_norm': 0.5565433502197266, 'learning_rate': 0.00019957894736842108, 'epoch': 0.0}


  5%|▌         | 27/500 [01:53<31:47,  4.03s/it]

{'loss': 0.3782, 'grad_norm': 0.44629019498825073, 'learning_rate': 0.00019915789473684212, 'epoch': 0.01}


  6%|▌         | 28/500 [01:58<33:24,  4.25s/it]

{'loss': 0.4106, 'grad_norm': 0.4264375865459442, 'learning_rate': 0.00019873684210526318, 'epoch': 0.01}


  6%|▌         | 29/500 [02:03<36:15,  4.62s/it]

{'loss': 0.2561, 'grad_norm': 0.2895881235599518, 'learning_rate': 0.00019831578947368422, 'epoch': 0.01}


  6%|▌         | 30/500 [02:07<34:47,  4.44s/it]

{'loss': 0.3373, 'grad_norm': 0.45752766728401184, 'learning_rate': 0.00019789473684210526, 'epoch': 0.01}


  6%|▌         | 31/500 [02:13<37:30,  4.80s/it]

{'loss': 0.281, 'grad_norm': 0.26450198888778687, 'learning_rate': 0.00019747368421052633, 'epoch': 0.01}


  6%|▋         | 32/500 [02:18<37:23,  4.79s/it]

{'loss': 0.3063, 'grad_norm': 0.3286115825176239, 'learning_rate': 0.00019705263157894737, 'epoch': 0.01}


  7%|▋         | 33/500 [02:22<36:05,  4.64s/it]

{'loss': 0.2883, 'grad_norm': 0.4648391306400299, 'learning_rate': 0.00019663157894736843, 'epoch': 0.01}


  7%|▋         | 34/500 [02:27<36:23,  4.69s/it]

{'loss': 0.279, 'grad_norm': 0.3621416389942169, 'learning_rate': 0.0001962105263157895, 'epoch': 0.01}


  7%|▋         | 35/500 [02:32<37:28,  4.84s/it]

{'loss': 0.2653, 'grad_norm': 0.30775970220565796, 'learning_rate': 0.00019578947368421054, 'epoch': 0.01}


  7%|▋         | 36/500 [02:37<36:56,  4.78s/it]

{'loss': 0.3229, 'grad_norm': 0.463479608297348, 'learning_rate': 0.00019536842105263158, 'epoch': 0.01}


  7%|▋         | 37/500 [02:42<37:14,  4.83s/it]

{'loss': 0.215, 'grad_norm': 0.2986963093280792, 'learning_rate': 0.00019494736842105265, 'epoch': 0.01}


  8%|▊         | 38/500 [02:45<34:22,  4.46s/it]

{'loss': 0.2668, 'grad_norm': 0.3325090706348419, 'learning_rate': 0.00019452631578947369, 'epoch': 0.01}


  8%|▊         | 39/500 [02:49<33:41,  4.39s/it]

{'loss': 0.5417, 'grad_norm': 0.4322618544101715, 'learning_rate': 0.00019410526315789475, 'epoch': 0.01}


  8%|▊         | 40/500 [02:53<32:15,  4.21s/it]

{'loss': 0.3106, 'grad_norm': 0.31181564927101135, 'learning_rate': 0.0001936842105263158, 'epoch': 0.01}


  8%|▊         | 41/500 [02:58<32:47,  4.29s/it]

{'loss': 0.2704, 'grad_norm': 0.3270173966884613, 'learning_rate': 0.00019326315789473686, 'epoch': 0.01}


  8%|▊         | 42/500 [03:01<31:14,  4.09s/it]

{'loss': 0.2933, 'grad_norm': 0.4227856993675232, 'learning_rate': 0.00019284210526315793, 'epoch': 0.01}


  9%|▊         | 43/500 [03:06<31:46,  4.17s/it]

{'loss': 0.3345, 'grad_norm': 0.3589760959148407, 'learning_rate': 0.00019242105263157894, 'epoch': 0.01}


  9%|▉         | 44/500 [03:10<32:43,  4.31s/it]

{'loss': 0.2826, 'grad_norm': 0.3657502233982086, 'learning_rate': 0.000192, 'epoch': 0.01}


  9%|▉         | 45/500 [03:14<31:30,  4.16s/it]

{'loss': 0.3199, 'grad_norm': 0.35813799500465393, 'learning_rate': 0.00019157894736842104, 'epoch': 0.01}


  9%|▉         | 46/500 [03:19<32:22,  4.28s/it]

{'loss': 0.3541, 'grad_norm': 0.31427690386772156, 'learning_rate': 0.0001911578947368421, 'epoch': 0.01}


  9%|▉         | 47/500 [03:24<34:24,  4.56s/it]

{'loss': 0.2282, 'grad_norm': 0.22998924553394318, 'learning_rate': 0.00019073684210526318, 'epoch': 0.01}


 10%|▉         | 48/500 [03:28<34:00,  4.52s/it]

{'loss': 0.3829, 'grad_norm': 0.3461369276046753, 'learning_rate': 0.00019031578947368422, 'epoch': 0.01}


 10%|▉         | 49/500 [03:33<33:49,  4.50s/it]

{'loss': 0.227, 'grad_norm': 0.31743529438972473, 'learning_rate': 0.00018989473684210528, 'epoch': 0.01}


 10%|█         | 50/500 [03:36<31:56,  4.26s/it]

{'loss': 0.2819, 'grad_norm': 0.30968815088272095, 'learning_rate': 0.00018947368421052632, 'epoch': 0.01}


 10%|█         | 51/500 [03:41<31:48,  4.25s/it]

{'loss': 0.2077, 'grad_norm': 0.2674158811569214, 'learning_rate': 0.00018905263157894736, 'epoch': 0.01}


 10%|█         | 52/500 [03:46<34:17,  4.59s/it]

{'loss': 0.1922, 'grad_norm': 0.2510499358177185, 'learning_rate': 0.00018863157894736843, 'epoch': 0.01}


 11%|█         | 53/500 [03:50<33:22,  4.48s/it]

{'loss': 0.1744, 'grad_norm': 0.2964151203632355, 'learning_rate': 0.00018821052631578947, 'epoch': 0.01}


 11%|█         | 54/500 [03:55<34:40,  4.67s/it]

{'loss': 0.2393, 'grad_norm': 0.3985561430454254, 'learning_rate': 0.00018778947368421054, 'epoch': 0.01}


 11%|█         | 55/500 [04:00<33:53,  4.57s/it]

{'loss': 0.3109, 'grad_norm': 0.34049341082572937, 'learning_rate': 0.0001873684210526316, 'epoch': 0.01}


 11%|█         | 56/500 [04:04<33:11,  4.49s/it]

{'loss': 0.2665, 'grad_norm': 0.3482114374637604, 'learning_rate': 0.00018694736842105264, 'epoch': 0.01}


 11%|█▏        | 57/500 [04:08<31:40,  4.29s/it]

{'loss': 0.2201, 'grad_norm': 0.3539859652519226, 'learning_rate': 0.00018652631578947368, 'epoch': 0.01}


 12%|█▏        | 58/500 [04:12<30:26,  4.13s/it]

{'loss': 0.1984, 'grad_norm': 0.3072075843811035, 'learning_rate': 0.00018610526315789475, 'epoch': 0.01}


 12%|█▏        | 59/500 [04:19<36:36,  4.98s/it]

{'loss': 0.2272, 'grad_norm': 0.2398746907711029, 'learning_rate': 0.0001856842105263158, 'epoch': 0.01}


 12%|█▏        | 60/500 [04:24<36:24,  4.96s/it]

{'loss': 0.1949, 'grad_norm': 0.31308773159980774, 'learning_rate': 0.00018526315789473685, 'epoch': 0.01}


 12%|█▏        | 61/500 [04:29<37:05,  5.07s/it]

{'loss': 0.2083, 'grad_norm': 0.2821129858493805, 'learning_rate': 0.0001848421052631579, 'epoch': 0.01}


 12%|█▏        | 62/500 [04:33<34:07,  4.68s/it]

{'loss': 0.2412, 'grad_norm': 0.3348279893398285, 'learning_rate': 0.00018442105263157896, 'epoch': 0.01}


 13%|█▎        | 63/500 [04:37<34:32,  4.74s/it]

{'loss': 0.1996, 'grad_norm': 0.2926352918148041, 'learning_rate': 0.00018400000000000003, 'epoch': 0.01}


 13%|█▎        | 64/500 [04:42<34:26,  4.74s/it]

{'loss': 0.1906, 'grad_norm': 0.2875511646270752, 'learning_rate': 0.00018357894736842104, 'epoch': 0.01}


 13%|█▎        | 65/500 [04:48<35:33,  4.91s/it]

{'loss': 0.1959, 'grad_norm': 0.2778511643409729, 'learning_rate': 0.0001831578947368421, 'epoch': 0.01}


 13%|█▎        | 66/500 [04:53<36:25,  5.04s/it]

{'loss': 0.2206, 'grad_norm': 0.2854999899864197, 'learning_rate': 0.00018273684210526317, 'epoch': 0.01}


 13%|█▎        | 67/500 [04:57<34:17,  4.75s/it]

{'loss': 0.2692, 'grad_norm': 0.33036768436431885, 'learning_rate': 0.0001823157894736842, 'epoch': 0.01}


 14%|█▎        | 68/500 [05:01<33:31,  4.66s/it]

{'loss': 0.2389, 'grad_norm': 0.27952778339385986, 'learning_rate': 0.00018189473684210528, 'epoch': 0.01}


 14%|█▍        | 69/500 [05:06<33:56,  4.73s/it]

{'loss': 0.1852, 'grad_norm': 0.2501295208930969, 'learning_rate': 0.00018147368421052632, 'epoch': 0.01}


 14%|█▍        | 70/500 [05:11<33:35,  4.69s/it]

{'loss': 0.2182, 'grad_norm': 0.32291799783706665, 'learning_rate': 0.00018105263157894739, 'epoch': 0.01}


 14%|█▍        | 71/500 [05:15<31:47,  4.45s/it]

{'loss': 0.1607, 'grad_norm': 0.2646988034248352, 'learning_rate': 0.00018063157894736845, 'epoch': 0.01}


 14%|█▍        | 72/500 [05:19<30:25,  4.27s/it]

{'loss': 0.1427, 'grad_norm': 0.2571274936199188, 'learning_rate': 0.00018021052631578946, 'epoch': 0.01}


 15%|█▍        | 73/500 [05:22<29:00,  4.08s/it]

{'loss': 0.4621, 'grad_norm': 0.3566912114620209, 'learning_rate': 0.00017978947368421053, 'epoch': 0.01}


 15%|█▍        | 74/500 [05:26<29:02,  4.09s/it]

{'loss': 0.2716, 'grad_norm': 0.3431939482688904, 'learning_rate': 0.0001793684210526316, 'epoch': 0.01}


 15%|█▌        | 75/500 [05:30<28:42,  4.05s/it]

{'loss': 0.253, 'grad_norm': 0.40201982855796814, 'learning_rate': 0.00017894736842105264, 'epoch': 0.01}


 15%|█▌        | 76/500 [05:35<30:19,  4.29s/it]

{'loss': 0.2671, 'grad_norm': 0.4193073511123657, 'learning_rate': 0.0001785263157894737, 'epoch': 0.01}


 15%|█▌        | 77/500 [05:41<33:02,  4.69s/it]

{'loss': 0.1779, 'grad_norm': 0.2432841807603836, 'learning_rate': 0.00017810526315789474, 'epoch': 0.01}


 16%|█▌        | 78/500 [05:45<32:24,  4.61s/it]

{'loss': 0.1535, 'grad_norm': 0.2647983431816101, 'learning_rate': 0.0001776842105263158, 'epoch': 0.01}


 16%|█▌        | 79/500 [05:50<32:22,  4.61s/it]

{'loss': 0.2364, 'grad_norm': 0.2550783157348633, 'learning_rate': 0.00017726315789473685, 'epoch': 0.02}


 16%|█▌        | 80/500 [05:56<34:40,  4.95s/it]

{'loss': 0.1452, 'grad_norm': 0.22688207030296326, 'learning_rate': 0.0001768421052631579, 'epoch': 0.02}


 16%|█▌        | 81/500 [06:00<32:43,  4.69s/it]

{'loss': 0.1513, 'grad_norm': 0.2675234377384186, 'learning_rate': 0.00017642105263157896, 'epoch': 0.02}


 16%|█▋        | 82/500 [06:04<31:37,  4.54s/it]

{'loss': 0.1965, 'grad_norm': 0.2734440863132477, 'learning_rate': 0.00017600000000000002, 'epoch': 0.02}


 17%|█▋        | 83/500 [06:08<30:29,  4.39s/it]

{'loss': 0.3408, 'grad_norm': 0.31794944405555725, 'learning_rate': 0.00017557894736842106, 'epoch': 0.02}


 17%|█▋        | 84/500 [06:13<32:25,  4.68s/it]

{'loss': 0.2142, 'grad_norm': 0.2649029493331909, 'learning_rate': 0.00017515789473684213, 'epoch': 0.02}


 17%|█▋        | 85/500 [06:18<33:32,  4.85s/it]

{'loss': 0.2088, 'grad_norm': 0.3344317376613617, 'learning_rate': 0.00017473684210526317, 'epoch': 0.02}


 17%|█▋        | 86/500 [06:23<33:18,  4.83s/it]

{'loss': 0.1627, 'grad_norm': 0.420181006193161, 'learning_rate': 0.0001743157894736842, 'epoch': 0.02}


 17%|█▋        | 87/500 [06:27<30:49,  4.48s/it]

{'loss': 0.212, 'grad_norm': 0.43939611315727234, 'learning_rate': 0.00017389473684210527, 'epoch': 0.02}


 18%|█▊        | 88/500 [06:32<31:37,  4.61s/it]

{'loss': 0.3438, 'grad_norm': 0.3565710484981537, 'learning_rate': 0.00017347368421052631, 'epoch': 0.02}


 18%|█▊        | 89/500 [06:37<33:15,  4.86s/it]

{'loss': 0.3148, 'grad_norm': 0.35908371210098267, 'learning_rate': 0.00017305263157894738, 'epoch': 0.02}


 18%|█▊        | 90/500 [06:41<31:12,  4.57s/it]

{'loss': 0.1879, 'grad_norm': 0.3080580532550812, 'learning_rate': 0.00017263157894736842, 'epoch': 0.02}


 18%|█▊        | 91/500 [06:46<31:08,  4.57s/it]

{'loss': 0.16, 'grad_norm': 0.3016069531440735, 'learning_rate': 0.0001722105263157895, 'epoch': 0.02}


 18%|█▊        | 92/500 [06:49<29:24,  4.32s/it]

{'loss': 0.2309, 'grad_norm': 0.4031582772731781, 'learning_rate': 0.00017178947368421055, 'epoch': 0.02}


 19%|█▊        | 93/500 [06:53<28:14,  4.16s/it]

{'loss': 0.1511, 'grad_norm': 0.3136255145072937, 'learning_rate': 0.00017136842105263157, 'epoch': 0.02}


 19%|█▉        | 94/500 [06:57<28:13,  4.17s/it]

{'loss': 0.1867, 'grad_norm': 0.29984989762306213, 'learning_rate': 0.00017094736842105263, 'epoch': 0.02}


 19%|█▉        | 95/500 [07:01<27:56,  4.14s/it]

{'loss': 0.1745, 'grad_norm': 0.2937600314617157, 'learning_rate': 0.0001705263157894737, 'epoch': 0.02}


 19%|█▉        | 96/500 [07:05<27:22,  4.06s/it]

{'loss': 0.2041, 'grad_norm': 0.30486613512039185, 'learning_rate': 0.00017010526315789474, 'epoch': 0.02}


 19%|█▉        | 97/500 [07:10<27:33,  4.10s/it]

{'loss': 0.1553, 'grad_norm': 0.34502407908439636, 'learning_rate': 0.0001696842105263158, 'epoch': 0.02}


 20%|█▉        | 98/500 [07:13<26:50,  4.01s/it]

{'loss': 0.1662, 'grad_norm': 0.288703978061676, 'learning_rate': 0.00016926315789473684, 'epoch': 0.02}


 20%|█▉        | 99/500 [07:17<25:49,  3.86s/it]

{'loss': 0.1748, 'grad_norm': 0.30982187390327454, 'learning_rate': 0.0001688421052631579, 'epoch': 0.02}


 20%|██        | 100/500 [07:21<26:14,  3.94s/it]

{'loss': 0.3006, 'grad_norm': 0.41113772988319397, 'learning_rate': 0.00016842105263157895, 'epoch': 0.02}


 20%|██        | 101/500 [07:25<27:06,  4.08s/it]

{'loss': 0.1286, 'grad_norm': 0.25123465061187744, 'learning_rate': 0.000168, 'epoch': 0.02}


 20%|██        | 102/500 [07:30<27:46,  4.19s/it]

{'loss': 0.1669, 'grad_norm': 0.2756589651107788, 'learning_rate': 0.00016757894736842106, 'epoch': 0.02}


 21%|██        | 103/500 [07:34<28:29,  4.31s/it]

{'loss': 0.1114, 'grad_norm': 0.2350236177444458, 'learning_rate': 0.00016715789473684212, 'epoch': 0.02}


 21%|██        | 104/500 [07:39<29:17,  4.44s/it]

{'loss': 0.1741, 'grad_norm': 0.28020814061164856, 'learning_rate': 0.00016673684210526316, 'epoch': 0.02}


 21%|██        | 105/500 [07:44<29:13,  4.44s/it]

{'loss': 0.2618, 'grad_norm': 0.30332151055336, 'learning_rate': 0.00016631578947368423, 'epoch': 0.02}


 21%|██        | 106/500 [07:49<30:54,  4.71s/it]

{'loss': 0.1756, 'grad_norm': 0.3001598119735718, 'learning_rate': 0.00016589473684210527, 'epoch': 0.02}


 21%|██▏       | 107/500 [07:53<29:59,  4.58s/it]

{'loss': 0.2612, 'grad_norm': 0.35549718141555786, 'learning_rate': 0.0001654736842105263, 'epoch': 0.02}


 22%|██▏       | 108/500 [07:58<29:58,  4.59s/it]

{'loss': 0.3547, 'grad_norm': 0.3259860575199127, 'learning_rate': 0.00016505263157894738, 'epoch': 0.02}


 22%|██▏       | 109/500 [08:03<30:23,  4.66s/it]

{'loss': 0.1419, 'grad_norm': 0.2901683747768402, 'learning_rate': 0.00016463157894736842, 'epoch': 0.02}


 22%|██▏       | 110/500 [08:07<29:24,  4.53s/it]

{'loss': 0.2991, 'grad_norm': 0.33605390787124634, 'learning_rate': 0.00016421052631578948, 'epoch': 0.02}


 22%|██▏       | 111/500 [08:11<28:13,  4.35s/it]

{'loss': 0.1471, 'grad_norm': 0.2485664039850235, 'learning_rate': 0.00016378947368421055, 'epoch': 0.02}


 22%|██▏       | 112/500 [08:15<27:24,  4.24s/it]

{'loss': 0.1723, 'grad_norm': 0.3217223286628723, 'learning_rate': 0.0001633684210526316, 'epoch': 0.02}


 23%|██▎       | 113/500 [08:19<27:43,  4.30s/it]

{'loss': 0.1355, 'grad_norm': 0.2510398030281067, 'learning_rate': 0.00016294736842105265, 'epoch': 0.02}


 23%|██▎       | 114/500 [08:24<28:51,  4.49s/it]

{'loss': 0.2466, 'grad_norm': 0.33477604389190674, 'learning_rate': 0.0001625263157894737, 'epoch': 0.02}


 23%|██▎       | 115/500 [08:29<29:04,  4.53s/it]

{'loss': 0.1808, 'grad_norm': 0.2577035129070282, 'learning_rate': 0.00016210526315789473, 'epoch': 0.02}


 23%|██▎       | 116/500 [08:33<27:52,  4.36s/it]

{'loss': 0.1324, 'grad_norm': 0.28303274512290955, 'learning_rate': 0.0001616842105263158, 'epoch': 0.02}


 23%|██▎       | 117/500 [08:37<26:45,  4.19s/it]

{'loss': 0.228, 'grad_norm': 0.34274545311927795, 'learning_rate': 0.00016126315789473684, 'epoch': 0.02}


 24%|██▎       | 118/500 [08:41<27:27,  4.31s/it]

{'loss': 0.164, 'grad_norm': 0.3012693226337433, 'learning_rate': 0.0001608421052631579, 'epoch': 0.02}


 24%|██▍       | 119/500 [08:46<28:06,  4.43s/it]

{'loss': 0.128, 'grad_norm': 0.2476237416267395, 'learning_rate': 0.00016042105263157897, 'epoch': 0.02}


 24%|██▍       | 120/500 [08:50<27:31,  4.35s/it]

{'loss': 0.1835, 'grad_norm': 0.2716645300388336, 'learning_rate': 0.00016, 'epoch': 0.02}


 24%|██▍       | 121/500 [08:54<27:36,  4.37s/it]

{'loss': 0.1836, 'grad_norm': 0.24816156923770905, 'learning_rate': 0.00015957894736842105, 'epoch': 0.02}


 24%|██▍       | 122/500 [08:58<26:43,  4.24s/it]

{'loss': 0.2952, 'grad_norm': 0.3368987739086151, 'learning_rate': 0.00015915789473684212, 'epoch': 0.02}


 25%|██▍       | 123/500 [09:02<26:21,  4.20s/it]

{'loss': 0.2448, 'grad_norm': 0.35465914011001587, 'learning_rate': 0.00015873684210526316, 'epoch': 0.02}


 25%|██▍       | 124/500 [09:07<26:55,  4.30s/it]

{'loss': 0.1553, 'grad_norm': 0.26426172256469727, 'learning_rate': 0.00015831578947368423, 'epoch': 0.02}


 25%|██▌       | 125/500 [09:11<26:30,  4.24s/it]

{'loss': 0.1518, 'grad_norm': 0.2910724878311157, 'learning_rate': 0.00015789473684210527, 'epoch': 0.02}


 25%|██▌       | 126/500 [09:16<27:32,  4.42s/it]

{'loss': 0.1429, 'grad_norm': 0.2718588411808014, 'learning_rate': 0.00015747368421052633, 'epoch': 0.02}


 25%|██▌       | 127/500 [09:20<27:18,  4.39s/it]

{'loss': 0.2175, 'grad_norm': 0.3297032415866852, 'learning_rate': 0.00015705263157894737, 'epoch': 0.02}


 26%|██▌       | 128/500 [09:26<29:43,  4.80s/it]

{'loss': 0.16, 'grad_norm': 0.2559979557991028, 'learning_rate': 0.0001566315789473684, 'epoch': 0.02}


 26%|██▌       | 129/500 [09:31<30:39,  4.96s/it]

{'loss': 0.1141, 'grad_norm': 0.31922122836112976, 'learning_rate': 0.00015621052631578948, 'epoch': 0.02}


 26%|██▌       | 130/500 [09:35<28:37,  4.64s/it]

{'loss': 0.1538, 'grad_norm': 0.28278473019599915, 'learning_rate': 0.00015578947368421052, 'epoch': 0.02}


 26%|██▌       | 131/500 [09:40<28:02,  4.56s/it]

{'loss': 0.1786, 'grad_norm': 0.362164705991745, 'learning_rate': 0.00015536842105263158, 'epoch': 0.02}


 26%|██▋       | 132/500 [09:46<30:54,  5.04s/it]

{'loss': 0.1696, 'grad_norm': 0.29271021485328674, 'learning_rate': 0.00015494736842105265, 'epoch': 0.03}


 27%|██▋       | 133/500 [09:50<29:04,  4.75s/it]

{'loss': 0.1896, 'grad_norm': 0.29053938388824463, 'learning_rate': 0.0001545263157894737, 'epoch': 0.03}


 27%|██▋       | 134/500 [09:54<27:32,  4.52s/it]

{'loss': 0.2235, 'grad_norm': 0.31995701789855957, 'learning_rate': 0.00015410526315789476, 'epoch': 0.03}


 27%|██▋       | 135/500 [09:58<27:06,  4.46s/it]

{'loss': 0.1238, 'grad_norm': 0.26054778695106506, 'learning_rate': 0.0001536842105263158, 'epoch': 0.03}


 27%|██▋       | 136/500 [10:03<26:52,  4.43s/it]

{'loss': 0.2764, 'grad_norm': 0.34240323305130005, 'learning_rate': 0.00015326315789473684, 'epoch': 0.03}


 27%|██▋       | 137/500 [10:06<25:41,  4.25s/it]

{'loss': 0.2596, 'grad_norm': 0.3394276201725006, 'learning_rate': 0.0001528421052631579, 'epoch': 0.03}


 28%|██▊       | 138/500 [10:11<27:14,  4.51s/it]

{'loss': 0.1595, 'grad_norm': 0.34385716915130615, 'learning_rate': 0.00015242105263157894, 'epoch': 0.03}


 28%|██▊       | 139/500 [10:15<26:16,  4.37s/it]

{'loss': 0.1729, 'grad_norm': 0.3821275234222412, 'learning_rate': 0.000152, 'epoch': 0.03}


 28%|██▊       | 140/500 [10:19<25:23,  4.23s/it]

{'loss': 0.1181, 'grad_norm': 0.28651121258735657, 'learning_rate': 0.00015157894736842108, 'epoch': 0.03}


 28%|██▊       | 141/500 [10:23<24:42,  4.13s/it]

{'loss': 0.1472, 'grad_norm': 0.3036677837371826, 'learning_rate': 0.00015115789473684211, 'epoch': 0.03}


 28%|██▊       | 142/500 [10:27<24:27,  4.10s/it]

{'loss': 0.1916, 'grad_norm': 0.35044264793395996, 'learning_rate': 0.00015073684210526315, 'epoch': 0.03}


 29%|██▊       | 143/500 [10:32<24:38,  4.14s/it]

{'loss': 0.2291, 'grad_norm': 0.32445940375328064, 'learning_rate': 0.00015031578947368422, 'epoch': 0.03}


 29%|██▉       | 144/500 [10:36<25:45,  4.34s/it]

{'loss': 0.0952, 'grad_norm': 0.22578062117099762, 'learning_rate': 0.00014989473684210526, 'epoch': 0.03}


 29%|██▉       | 145/500 [10:40<25:03,  4.24s/it]

{'loss': 0.1322, 'grad_norm': 0.2634430527687073, 'learning_rate': 0.00014947368421052633, 'epoch': 0.03}


 29%|██▉       | 146/500 [10:45<26:20,  4.46s/it]

{'loss': 0.1604, 'grad_norm': 0.2614619731903076, 'learning_rate': 0.00014905263157894737, 'epoch': 0.03}


 29%|██▉       | 147/500 [10:49<24:34,  4.18s/it]

{'loss': 0.1438, 'grad_norm': 0.23796488344669342, 'learning_rate': 0.00014863157894736843, 'epoch': 0.03}


 30%|██▉       | 148/500 [10:53<24:10,  4.12s/it]

{'loss': 0.1154, 'grad_norm': 0.24899104237556458, 'learning_rate': 0.0001482105263157895, 'epoch': 0.03}


 30%|██▉       | 149/500 [10:57<24:15,  4.15s/it]

{'loss': 0.1167, 'grad_norm': 0.21510019898414612, 'learning_rate': 0.00014778947368421054, 'epoch': 0.03}


 30%|███       | 150/500 [11:02<24:46,  4.25s/it]

{'loss': 0.1171, 'grad_norm': 0.25968876481056213, 'learning_rate': 0.00014736842105263158, 'epoch': 0.03}


 30%|███       | 151/500 [11:06<25:11,  4.33s/it]

{'loss': 0.158, 'grad_norm': 0.3091468811035156, 'learning_rate': 0.00014694736842105265, 'epoch': 0.03}


 30%|███       | 152/500 [11:10<25:05,  4.33s/it]

{'loss': 0.1057, 'grad_norm': 0.2606527805328369, 'learning_rate': 0.00014652631578947369, 'epoch': 0.03}


 31%|███       | 153/500 [11:15<25:34,  4.42s/it]

{'loss': 0.1587, 'grad_norm': 0.43731391429901123, 'learning_rate': 0.00014610526315789475, 'epoch': 0.03}


 31%|███       | 154/500 [11:20<26:04,  4.52s/it]

{'loss': 0.1207, 'grad_norm': 0.3339029848575592, 'learning_rate': 0.0001456842105263158, 'epoch': 0.03}


 31%|███       | 155/500 [11:24<24:41,  4.29s/it]

{'loss': 0.1245, 'grad_norm': 0.2981244623661041, 'learning_rate': 0.00014526315789473686, 'epoch': 0.03}


 31%|███       | 156/500 [11:28<24:08,  4.21s/it]

{'loss': 0.1297, 'grad_norm': 0.24990370869636536, 'learning_rate': 0.00014484210526315792, 'epoch': 0.03}


 31%|███▏      | 157/500 [11:32<23:57,  4.19s/it]

{'loss': 0.0995, 'grad_norm': 0.29503095149993896, 'learning_rate': 0.00014442105263157894, 'epoch': 0.03}


 32%|███▏      | 158/500 [11:36<23:28,  4.12s/it]

{'loss': 0.1539, 'grad_norm': 0.29689520597457886, 'learning_rate': 0.000144, 'epoch': 0.03}


 32%|███▏      | 159/500 [11:40<23:07,  4.07s/it]

{'loss': 0.1403, 'grad_norm': 0.29470565915107727, 'learning_rate': 0.00014357894736842107, 'epoch': 0.03}


 32%|███▏      | 160/500 [11:44<23:56,  4.23s/it]

{'loss': 0.1386, 'grad_norm': 0.28470659255981445, 'learning_rate': 0.0001431578947368421, 'epoch': 0.03}


 32%|███▏      | 161/500 [11:49<24:34,  4.35s/it]

{'loss': 0.1067, 'grad_norm': 0.23686149716377258, 'learning_rate': 0.00014273684210526318, 'epoch': 0.03}


 32%|███▏      | 162/500 [11:53<24:21,  4.33s/it]

{'loss': 0.1762, 'grad_norm': 0.31447193026542664, 'learning_rate': 0.00014231578947368422, 'epoch': 0.03}


 33%|███▎      | 163/500 [11:58<25:00,  4.45s/it]

{'loss': 0.0914, 'grad_norm': 0.2558501362800598, 'learning_rate': 0.00014189473684210528, 'epoch': 0.03}


 33%|███▎      | 164/500 [12:02<24:34,  4.39s/it]

{'loss': 0.2149, 'grad_norm': 0.3248434364795685, 'learning_rate': 0.00014147368421052632, 'epoch': 0.03}


 33%|███▎      | 165/500 [12:06<23:28,  4.20s/it]

{'loss': 0.1192, 'grad_norm': 0.24750754237174988, 'learning_rate': 0.00014105263157894736, 'epoch': 0.03}


 33%|███▎      | 166/500 [12:10<23:18,  4.19s/it]

{'loss': 0.2734, 'grad_norm': 0.3418245017528534, 'learning_rate': 0.00014063157894736843, 'epoch': 0.03}


 33%|███▎      | 167/500 [12:15<24:56,  4.50s/it]

{'loss': 0.1469, 'grad_norm': 0.2818793058395386, 'learning_rate': 0.00014021052631578947, 'epoch': 0.03}


 34%|███▎      | 168/500 [12:19<23:36,  4.27s/it]

{'loss': 0.1021, 'grad_norm': 0.28854215145111084, 'learning_rate': 0.00013978947368421053, 'epoch': 0.03}


 34%|███▍      | 169/500 [12:23<23:53,  4.33s/it]

{'loss': 0.1199, 'grad_norm': 0.2459985613822937, 'learning_rate': 0.0001393684210526316, 'epoch': 0.03}


 34%|███▍      | 170/500 [12:28<24:58,  4.54s/it]

{'loss': 0.0836, 'grad_norm': 0.23568826913833618, 'learning_rate': 0.00013894736842105264, 'epoch': 0.03}


 34%|███▍      | 171/500 [12:34<26:05,  4.76s/it]

{'loss': 0.1928, 'grad_norm': 0.3585958182811737, 'learning_rate': 0.00013852631578947368, 'epoch': 0.03}


 34%|███▍      | 172/500 [12:38<25:12,  4.61s/it]

{'loss': 0.1338, 'grad_norm': 0.32131147384643555, 'learning_rate': 0.00013810526315789475, 'epoch': 0.03}


 35%|███▍      | 173/500 [12:44<27:08,  4.98s/it]

{'loss': 0.1056, 'grad_norm': 0.20372745394706726, 'learning_rate': 0.0001376842105263158, 'epoch': 0.03}


 35%|███▍      | 174/500 [12:49<26:39,  4.90s/it]

{'loss': 0.1064, 'grad_norm': 0.24059052765369415, 'learning_rate': 0.00013726315789473685, 'epoch': 0.03}


 35%|███▌      | 175/500 [12:54<27:13,  5.03s/it]

{'loss': 0.1626, 'grad_norm': 0.3863537609577179, 'learning_rate': 0.0001368421052631579, 'epoch': 0.03}


 35%|███▌      | 176/500 [12:58<25:18,  4.69s/it]

{'loss': 0.2042, 'grad_norm': 0.3920445442199707, 'learning_rate': 0.00013642105263157896, 'epoch': 0.03}


 35%|███▌      | 177/500 [13:02<24:09,  4.49s/it]

{'loss': 0.1778, 'grad_norm': 0.31670552492141724, 'learning_rate': 0.00013600000000000003, 'epoch': 0.03}


 36%|███▌      | 178/500 [13:06<24:08,  4.50s/it]

{'loss': 0.1132, 'grad_norm': 0.29202017188072205, 'learning_rate': 0.00013557894736842104, 'epoch': 0.03}


 36%|███▌      | 179/500 [13:11<24:07,  4.51s/it]

{'loss': 0.0936, 'grad_norm': 0.22830171883106232, 'learning_rate': 0.0001351578947368421, 'epoch': 0.03}


 36%|███▌      | 180/500 [13:15<23:27,  4.40s/it]

{'loss': 0.1407, 'grad_norm': 0.2753377854824066, 'learning_rate': 0.00013473684210526317, 'epoch': 0.03}


 36%|███▌      | 181/500 [13:19<23:03,  4.34s/it]

{'loss': 0.1436, 'grad_norm': 0.29267096519470215, 'learning_rate': 0.0001343157894736842, 'epoch': 0.03}


 36%|███▋      | 182/500 [13:23<22:34,  4.26s/it]

{'loss': 0.1373, 'grad_norm': 0.2825394868850708, 'learning_rate': 0.00013389473684210528, 'epoch': 0.03}


 37%|███▋      | 183/500 [13:29<24:40,  4.67s/it]

{'loss': 0.1135, 'grad_norm': 0.20213131606578827, 'learning_rate': 0.00013347368421052632, 'epoch': 0.03}


 37%|███▋      | 184/500 [13:33<23:02,  4.38s/it]

{'loss': 0.1108, 'grad_norm': 0.2729845941066742, 'learning_rate': 0.00013305263157894738, 'epoch': 0.04}


 37%|███▋      | 185/500 [13:37<23:07,  4.40s/it]

{'loss': 0.0835, 'grad_norm': 0.23829220235347748, 'learning_rate': 0.00013263157894736842, 'epoch': 0.04}


 37%|███▋      | 186/500 [13:42<24:26,  4.67s/it]

{'loss': 0.1762, 'grad_norm': 0.28032588958740234, 'learning_rate': 0.00013221052631578946, 'epoch': 0.04}


 37%|███▋      | 187/500 [13:46<23:10,  4.44s/it]

{'loss': 0.1196, 'grad_norm': 0.31004005670547485, 'learning_rate': 0.00013178947368421053, 'epoch': 0.04}


 38%|███▊      | 188/500 [13:50<22:15,  4.28s/it]

{'loss': 0.1786, 'grad_norm': 0.3282032907009125, 'learning_rate': 0.0001313684210526316, 'epoch': 0.04}


 38%|███▊      | 189/500 [13:54<21:06,  4.07s/it]

{'loss': 0.1755, 'grad_norm': 0.34143632650375366, 'learning_rate': 0.00013094736842105264, 'epoch': 0.04}


 38%|███▊      | 190/500 [13:58<21:13,  4.11s/it]

{'loss': 0.1589, 'grad_norm': 0.3103275001049042, 'learning_rate': 0.0001305263157894737, 'epoch': 0.04}


 38%|███▊      | 191/500 [14:02<21:23,  4.15s/it]

{'loss': 0.1149, 'grad_norm': 0.250054270029068, 'learning_rate': 0.00013010526315789474, 'epoch': 0.04}


 38%|███▊      | 192/500 [14:06<20:38,  4.02s/it]

{'loss': 0.1634, 'grad_norm': 0.2927456498146057, 'learning_rate': 0.00012968421052631578, 'epoch': 0.04}


 39%|███▊      | 193/500 [14:10<21:01,  4.11s/it]

{'loss': 0.2442, 'grad_norm': 0.3395402133464813, 'learning_rate': 0.00012926315789473685, 'epoch': 0.04}


 39%|███▉      | 194/500 [14:16<23:04,  4.53s/it]

{'loss': 0.0804, 'grad_norm': 0.20021101832389832, 'learning_rate': 0.0001288421052631579, 'epoch': 0.04}


 39%|███▉      | 195/500 [14:20<22:56,  4.51s/it]

{'loss': 0.145, 'grad_norm': 0.290925532579422, 'learning_rate': 0.00012842105263157895, 'epoch': 0.04}


 39%|███▉      | 196/500 [14:25<22:47,  4.50s/it]

{'loss': 0.1241, 'grad_norm': 0.2855030596256256, 'learning_rate': 0.00012800000000000002, 'epoch': 0.04}


 39%|███▉      | 197/500 [14:29<21:56,  4.35s/it]

{'loss': 0.0919, 'grad_norm': 0.2373659759759903, 'learning_rate': 0.00012757894736842106, 'epoch': 0.04}


 40%|███▉      | 198/500 [14:34<22:36,  4.49s/it]

{'loss': 0.0682, 'grad_norm': 0.203665092587471, 'learning_rate': 0.00012715789473684213, 'epoch': 0.04}


 40%|███▉      | 199/500 [14:37<21:00,  4.19s/it]

{'loss': 0.1196, 'grad_norm': 0.3163251280784607, 'learning_rate': 0.00012673684210526317, 'epoch': 0.04}


 40%|████      | 200/500 [14:41<20:21,  4.07s/it]

{'loss': 0.1311, 'grad_norm': 0.31600865721702576, 'learning_rate': 0.0001263157894736842, 'epoch': 0.04}


 40%|████      | 201/500 [14:45<20:37,  4.14s/it]

{'loss': 0.1554, 'grad_norm': 0.24421122670173645, 'learning_rate': 0.00012589473684210527, 'epoch': 0.04}


 40%|████      | 202/500 [14:49<20:43,  4.17s/it]

{'loss': 0.1726, 'grad_norm': 0.3056626319885254, 'learning_rate': 0.0001254736842105263, 'epoch': 0.04}


 41%|████      | 203/500 [14:53<20:24,  4.12s/it]

{'loss': 0.1509, 'grad_norm': 0.43053120374679565, 'learning_rate': 0.00012505263157894738, 'epoch': 0.04}


 41%|████      | 204/500 [14:58<21:26,  4.35s/it]

{'loss': 0.1419, 'grad_norm': 0.3030339181423187, 'learning_rate': 0.00012463157894736845, 'epoch': 0.04}


 41%|████      | 205/500 [15:03<21:47,  4.43s/it]

{'loss': 0.1339, 'grad_norm': 0.24920761585235596, 'learning_rate': 0.00012421052631578949, 'epoch': 0.04}


 41%|████      | 206/500 [15:07<21:32,  4.40s/it]

{'loss': 0.074, 'grad_norm': 0.20254293084144592, 'learning_rate': 0.00012378947368421053, 'epoch': 0.04}


 41%|████▏     | 207/500 [15:11<21:21,  4.37s/it]

{'loss': 0.0851, 'grad_norm': 0.24981631338596344, 'learning_rate': 0.00012336842105263156, 'epoch': 0.04}


 42%|████▏     | 208/500 [15:15<20:26,  4.20s/it]

{'loss': 0.1359, 'grad_norm': 0.28840330243110657, 'learning_rate': 0.00012294736842105263, 'epoch': 0.04}


 42%|████▏     | 209/500 [15:20<20:37,  4.25s/it]

{'loss': 0.1076, 'grad_norm': 0.2397603839635849, 'learning_rate': 0.0001225263157894737, 'epoch': 0.04}


 42%|████▏     | 210/500 [15:24<21:05,  4.36s/it]

{'loss': 0.1737, 'grad_norm': 0.3172774016857147, 'learning_rate': 0.00012210526315789474, 'epoch': 0.04}


 42%|████▏     | 211/500 [15:28<20:00,  4.15s/it]

{'loss': 0.2, 'grad_norm': 0.3430921733379364, 'learning_rate': 0.0001216842105263158, 'epoch': 0.04}


 42%|████▏     | 212/500 [15:32<19:08,  3.99s/it]

{'loss': 0.1278, 'grad_norm': 0.22735849022865295, 'learning_rate': 0.00012126315789473686, 'epoch': 0.04}


 43%|████▎     | 213/500 [15:36<19:11,  4.01s/it]

{'loss': 0.0851, 'grad_norm': 0.2366282194852829, 'learning_rate': 0.00012084210526315788, 'epoch': 0.04}


 43%|████▎     | 214/500 [15:39<18:49,  3.95s/it]

{'loss': 0.0682, 'grad_norm': 0.2104354351758957, 'learning_rate': 0.00012042105263157895, 'epoch': 0.04}


 43%|████▎     | 215/500 [15:44<19:24,  4.09s/it]

{'loss': 0.1352, 'grad_norm': 0.3199940323829651, 'learning_rate': 0.00012, 'epoch': 0.04}


 43%|████▎     | 216/500 [15:49<21:11,  4.48s/it]

{'loss': 0.1109, 'grad_norm': 0.31080278754234314, 'learning_rate': 0.00011957894736842106, 'epoch': 0.04}


 43%|████▎     | 217/500 [15:53<20:44,  4.40s/it]

{'loss': 0.088, 'grad_norm': 0.2219381481409073, 'learning_rate': 0.00011915789473684211, 'epoch': 0.04}


 44%|████▎     | 218/500 [15:58<20:21,  4.33s/it]

{'loss': 0.1338, 'grad_norm': 0.30068686604499817, 'learning_rate': 0.00011873684210526316, 'epoch': 0.04}


 44%|████▍     | 219/500 [16:04<23:17,  4.98s/it]

{'loss': 0.0848, 'grad_norm': 0.19616617262363434, 'learning_rate': 0.00011831578947368423, 'epoch': 0.04}


 44%|████▍     | 220/500 [16:08<21:23,  4.59s/it]

{'loss': 0.1511, 'grad_norm': 0.30681291222572327, 'learning_rate': 0.00011789473684210525, 'epoch': 0.04}


 44%|████▍     | 221/500 [16:11<20:05,  4.32s/it]

{'loss': 0.1647, 'grad_norm': 0.2607395052909851, 'learning_rate': 0.00011747368421052631, 'epoch': 0.04}


 44%|████▍     | 222/500 [16:16<20:19,  4.39s/it]

{'loss': 0.1218, 'grad_norm': 0.2117479294538498, 'learning_rate': 0.00011705263157894737, 'epoch': 0.04}


 45%|████▍     | 223/500 [16:21<20:46,  4.50s/it]

{'loss': 0.0913, 'grad_norm': 0.2361619919538498, 'learning_rate': 0.00011663157894736843, 'epoch': 0.04}


 45%|████▍     | 224/500 [16:25<19:53,  4.33s/it]

{'loss': 0.1441, 'grad_norm': 0.34330740571022034, 'learning_rate': 0.00011621052631578948, 'epoch': 0.04}


 45%|████▌     | 225/500 [16:29<19:38,  4.28s/it]

{'loss': 0.1671, 'grad_norm': 0.3083927631378174, 'learning_rate': 0.00011578947368421053, 'epoch': 0.04}


 45%|████▌     | 226/500 [16:33<19:35,  4.29s/it]

{'loss': 0.1331, 'grad_norm': 0.24578063189983368, 'learning_rate': 0.00011536842105263159, 'epoch': 0.04}


 45%|████▌     | 227/500 [16:37<18:50,  4.14s/it]

{'loss': 0.1077, 'grad_norm': 0.25552985072135925, 'learning_rate': 0.00011494736842105265, 'epoch': 0.04}


 46%|████▌     | 228/500 [16:41<18:58,  4.18s/it]

{'loss': 0.1473, 'grad_norm': 0.31425371766090393, 'learning_rate': 0.00011452631578947368, 'epoch': 0.04}


 46%|████▌     | 229/500 [16:46<19:11,  4.25s/it]

{'loss': 0.0812, 'grad_norm': 0.18188680708408356, 'learning_rate': 0.00011410526315789473, 'epoch': 0.04}


 46%|████▌     | 230/500 [16:50<19:35,  4.35s/it]

{'loss': 0.091, 'grad_norm': 0.26850855350494385, 'learning_rate': 0.0001136842105263158, 'epoch': 0.04}


 46%|████▌     | 231/500 [16:56<21:32,  4.81s/it]

{'loss': 0.0698, 'grad_norm': 0.1972936987876892, 'learning_rate': 0.00011326315789473685, 'epoch': 0.04}


 46%|████▋     | 232/500 [17:00<20:20,  4.56s/it]

{'loss': 0.1147, 'grad_norm': 0.2768886387348175, 'learning_rate': 0.0001128421052631579, 'epoch': 0.04}


 47%|████▋     | 233/500 [17:05<20:48,  4.68s/it]

{'loss': 0.1218, 'grad_norm': 0.2597796320915222, 'learning_rate': 0.00011242105263157896, 'epoch': 0.04}


 47%|████▋     | 234/500 [17:09<19:38,  4.43s/it]

{'loss': 0.1711, 'grad_norm': 0.38852599263191223, 'learning_rate': 0.00011200000000000001, 'epoch': 0.04}


 47%|████▋     | 235/500 [17:13<19:16,  4.36s/it]

{'loss': 0.0942, 'grad_norm': 0.2595232129096985, 'learning_rate': 0.00011157894736842105, 'epoch': 0.04}


 47%|████▋     | 236/500 [17:18<19:54,  4.52s/it]

{'loss': 0.0673, 'grad_norm': 0.21662487089633942, 'learning_rate': 0.0001111578947368421, 'epoch': 0.05}


 47%|████▋     | 237/500 [17:24<21:53,  4.99s/it]

{'loss': 0.0618, 'grad_norm': 0.17295439541339874, 'learning_rate': 0.00011073684210526316, 'epoch': 0.05}


 48%|████▊     | 238/500 [17:28<20:52,  4.78s/it]

{'loss': 0.2486, 'grad_norm': 0.34831926226615906, 'learning_rate': 0.00011031578947368421, 'epoch': 0.05}


 48%|████▊     | 239/500 [17:32<19:17,  4.43s/it]

{'loss': 0.1354, 'grad_norm': 0.26965832710266113, 'learning_rate': 0.00010989473684210528, 'epoch': 0.05}


 48%|████▊     | 240/500 [17:37<19:59,  4.61s/it]

{'loss': 0.1153, 'grad_norm': 0.21682602167129517, 'learning_rate': 0.00010947368421052633, 'epoch': 0.05}


 48%|████▊     | 241/500 [17:41<19:10,  4.44s/it]

{'loss': 0.1255, 'grad_norm': 0.27769166231155396, 'learning_rate': 0.00010905263157894738, 'epoch': 0.05}


 48%|████▊     | 242/500 [17:45<18:51,  4.39s/it]

{'loss': 0.0974, 'grad_norm': 0.24360200762748718, 'learning_rate': 0.00010863157894736842, 'epoch': 0.05}


 49%|████▊     | 243/500 [17:50<19:26,  4.54s/it]

{'loss': 0.0724, 'grad_norm': 0.24228130280971527, 'learning_rate': 0.00010821052631578948, 'epoch': 0.05}


 49%|████▉     | 244/500 [17:54<18:47,  4.40s/it]

{'loss': 0.0978, 'grad_norm': 0.21223211288452148, 'learning_rate': 0.00010778947368421053, 'epoch': 0.05}


 49%|████▉     | 245/500 [17:58<17:59,  4.23s/it]

{'loss': 0.1463, 'grad_norm': 0.30038732290267944, 'learning_rate': 0.00010736842105263158, 'epoch': 0.05}


 49%|████▉     | 246/500 [18:03<18:49,  4.45s/it]

{'loss': 0.1467, 'grad_norm': 0.29238131642341614, 'learning_rate': 0.00010694736842105264, 'epoch': 0.05}


 49%|████▉     | 247/500 [18:09<21:06,  5.01s/it]

{'loss': 0.0723, 'grad_norm': 0.20639640092849731, 'learning_rate': 0.0001065263157894737, 'epoch': 0.05}


 50%|████▉     | 248/500 [18:13<19:35,  4.67s/it]

{'loss': 0.1336, 'grad_norm': 0.3423282504081726, 'learning_rate': 0.00010610526315789476, 'epoch': 0.05}


 50%|████▉     | 249/500 [18:18<20:01,  4.79s/it]

{'loss': 0.1218, 'grad_norm': 0.2825952172279358, 'learning_rate': 0.00010568421052631578, 'epoch': 0.05}


 50%|█████     | 250/500 [18:23<19:52,  4.77s/it]

{'loss': 0.0827, 'grad_norm': 0.2503099739551544, 'learning_rate': 0.00010526315789473685, 'epoch': 0.05}


 50%|█████     | 251/500 [18:27<19:20,  4.66s/it]

{'loss': 0.0938, 'grad_norm': 0.2214929461479187, 'learning_rate': 0.0001048421052631579, 'epoch': 0.05}


 50%|█████     | 252/500 [18:32<19:17,  4.67s/it]

{'loss': 0.0916, 'grad_norm': 0.22942626476287842, 'learning_rate': 0.00010442105263157895, 'epoch': 0.05}


 51%|█████     | 253/500 [18:37<19:01,  4.62s/it]

{'loss': 0.0807, 'grad_norm': 0.18508635461330414, 'learning_rate': 0.00010400000000000001, 'epoch': 0.05}


 51%|█████     | 254/500 [18:40<17:53,  4.36s/it]

{'loss': 0.1146, 'grad_norm': 0.2338264137506485, 'learning_rate': 0.00010357894736842106, 'epoch': 0.05}


 51%|█████     | 255/500 [18:47<19:55,  4.88s/it]

{'loss': 0.0982, 'grad_norm': 0.2103104442358017, 'learning_rate': 0.00010315789473684211, 'epoch': 0.05}


 51%|█████     | 256/500 [18:51<19:55,  4.90s/it]

{'loss': 0.1892, 'grad_norm': 0.2926595211029053, 'learning_rate': 0.00010273684210526315, 'epoch': 0.05}


 51%|█████▏    | 257/500 [18:56<18:53,  4.66s/it]

{'loss': 0.0733, 'grad_norm': 0.19746167957782745, 'learning_rate': 0.0001023157894736842, 'epoch': 0.05}


 52%|█████▏    | 258/500 [19:00<18:19,  4.55s/it]

{'loss': 0.2763, 'grad_norm': 0.3726853132247925, 'learning_rate': 0.00010189473684210526, 'epoch': 0.05}


 52%|█████▏    | 259/500 [19:04<18:01,  4.49s/it]

{'loss': 0.1073, 'grad_norm': 0.25186747312545776, 'learning_rate': 0.00010147368421052633, 'epoch': 0.05}


 52%|█████▏    | 260/500 [19:09<17:55,  4.48s/it]

{'loss': 0.0834, 'grad_norm': 0.2057131826877594, 'learning_rate': 0.00010105263157894738, 'epoch': 0.05}


 52%|█████▏    | 261/500 [19:13<17:06,  4.29s/it]

{'loss': 0.0899, 'grad_norm': 0.23021110892295837, 'learning_rate': 0.00010063157894736843, 'epoch': 0.05}


 52%|█████▏    | 262/500 [19:18<18:02,  4.55s/it]

{'loss': 0.0802, 'grad_norm': 0.21289293467998505, 'learning_rate': 0.00010021052631578949, 'epoch': 0.05}


 53%|█████▎    | 263/500 [19:22<17:28,  4.42s/it]

{'loss': 0.1325, 'grad_norm': 0.26711195707321167, 'learning_rate': 9.978947368421054e-05, 'epoch': 0.05}


 53%|█████▎    | 264/500 [19:26<17:14,  4.38s/it]

{'loss': 0.202, 'grad_norm': 0.3094123303890228, 'learning_rate': 9.936842105263159e-05, 'epoch': 0.05}


 53%|█████▎    | 265/500 [19:30<16:47,  4.29s/it]

{'loss': 0.1001, 'grad_norm': 0.28142938017845154, 'learning_rate': 9.894736842105263e-05, 'epoch': 0.05}


 53%|█████▎    | 266/500 [19:35<17:57,  4.61s/it]

{'loss': 0.0577, 'grad_norm': 0.1552823930978775, 'learning_rate': 9.852631578947368e-05, 'epoch': 0.05}


 53%|█████▎    | 267/500 [19:39<16:54,  4.35s/it]

{'loss': 0.1301, 'grad_norm': 0.2605268955230713, 'learning_rate': 9.810526315789475e-05, 'epoch': 0.05}


 54%|█████▎    | 268/500 [19:44<17:45,  4.59s/it]

{'loss': 0.1173, 'grad_norm': 0.3107602596282959, 'learning_rate': 9.768421052631579e-05, 'epoch': 0.05}


 54%|█████▍    | 269/500 [19:49<17:58,  4.67s/it]

{'loss': 0.0879, 'grad_norm': 0.24883991479873657, 'learning_rate': 9.726315789473684e-05, 'epoch': 0.05}


 54%|█████▍    | 270/500 [19:53<16:41,  4.35s/it]

{'loss': 0.109, 'grad_norm': 0.3476671874523163, 'learning_rate': 9.68421052631579e-05, 'epoch': 0.05}


 54%|█████▍    | 271/500 [19:57<16:38,  4.36s/it]

{'loss': 0.1117, 'grad_norm': 0.27760154008865356, 'learning_rate': 9.642105263157896e-05, 'epoch': 0.05}


 54%|█████▍    | 272/500 [20:02<17:15,  4.54s/it]

{'loss': 0.0829, 'grad_norm': 0.2688063681125641, 'learning_rate': 9.6e-05, 'epoch': 0.05}


 55%|█████▍    | 273/500 [20:06<16:41,  4.41s/it]

{'loss': 0.1396, 'grad_norm': 0.3440033495426178, 'learning_rate': 9.557894736842106e-05, 'epoch': 0.05}


 55%|█████▍    | 274/500 [20:11<16:36,  4.41s/it]

{'loss': 0.1882, 'grad_norm': 0.36192214488983154, 'learning_rate': 9.515789473684211e-05, 'epoch': 0.05}


 55%|█████▌    | 275/500 [20:15<16:04,  4.29s/it]

{'loss': 0.1136, 'grad_norm': 0.33823347091674805, 'learning_rate': 9.473684210526316e-05, 'epoch': 0.05}


 55%|█████▌    | 276/500 [20:18<15:19,  4.11s/it]

{'loss': 0.0781, 'grad_norm': 0.1942373812198639, 'learning_rate': 9.431578947368421e-05, 'epoch': 0.05}


 55%|█████▌    | 277/500 [20:22<14:43,  3.96s/it]

{'loss': 0.1927, 'grad_norm': 0.40607672929763794, 'learning_rate': 9.389473684210527e-05, 'epoch': 0.05}


 56%|█████▌    | 278/500 [20:26<14:35,  3.94s/it]

{'loss': 0.1263, 'grad_norm': 0.24402566254138947, 'learning_rate': 9.347368421052632e-05, 'epoch': 0.05}


 56%|█████▌    | 279/500 [20:30<14:18,  3.88s/it]

{'loss': 0.1807, 'grad_norm': 0.35407841205596924, 'learning_rate': 9.305263157894737e-05, 'epoch': 0.05}


 56%|█████▌    | 280/500 [20:34<14:43,  4.01s/it]

{'loss': 0.0933, 'grad_norm': 0.2225029021501541, 'learning_rate': 9.263157894736843e-05, 'epoch': 0.05}


 56%|█████▌    | 281/500 [20:39<16:00,  4.39s/it]

{'loss': 0.049, 'grad_norm': 0.17242535948753357, 'learning_rate': 9.221052631578948e-05, 'epoch': 0.05}


 56%|█████▋    | 282/500 [20:43<15:34,  4.29s/it]

{'loss': 0.1262, 'grad_norm': 0.2977328896522522, 'learning_rate': 9.178947368421052e-05, 'epoch': 0.05}


 57%|█████▋    | 283/500 [20:47<15:21,  4.25s/it]

{'loss': 0.1115, 'grad_norm': 0.29261067509651184, 'learning_rate': 9.136842105263159e-05, 'epoch': 0.05}


 57%|█████▋    | 284/500 [20:51<14:51,  4.13s/it]

{'loss': 0.1079, 'grad_norm': 0.24423334002494812, 'learning_rate': 9.094736842105264e-05, 'epoch': 0.05}


 57%|█████▋    | 285/500 [20:57<16:05,  4.49s/it]

{'loss': 0.1223, 'grad_norm': 0.263161301612854, 'learning_rate': 9.052631578947369e-05, 'epoch': 0.05}


 57%|█████▋    | 286/500 [21:01<15:39,  4.39s/it]

{'loss': 0.0747, 'grad_norm': 0.23544789850711823, 'learning_rate': 9.010526315789473e-05, 'epoch': 0.05}


 57%|█████▋    | 287/500 [21:05<14:57,  4.22s/it]

{'loss': 0.1204, 'grad_norm': 0.31229183077812195, 'learning_rate': 8.96842105263158e-05, 'epoch': 0.05}


 58%|█████▊    | 288/500 [21:09<15:02,  4.26s/it]

{'loss': 0.0989, 'grad_norm': 0.27896103262901306, 'learning_rate': 8.926315789473685e-05, 'epoch': 0.05}


 58%|█████▊    | 289/500 [21:13<14:43,  4.19s/it]

{'loss': 0.1193, 'grad_norm': 0.3610929250717163, 'learning_rate': 8.88421052631579e-05, 'epoch': 0.06}


 58%|█████▊    | 290/500 [21:18<15:35,  4.45s/it]

{'loss': 0.0898, 'grad_norm': 0.21323592960834503, 'learning_rate': 8.842105263157894e-05, 'epoch': 0.06}


 58%|█████▊    | 291/500 [21:22<14:39,  4.21s/it]

{'loss': 0.1732, 'grad_norm': 0.38424086570739746, 'learning_rate': 8.800000000000001e-05, 'epoch': 0.06}


 58%|█████▊    | 292/500 [21:26<14:25,  4.16s/it]

{'loss': 0.1477, 'grad_norm': 0.3278284966945648, 'learning_rate': 8.757894736842106e-05, 'epoch': 0.06}


 59%|█████▊    | 293/500 [21:30<14:07,  4.09s/it]

{'loss': 0.0764, 'grad_norm': 0.2642281949520111, 'learning_rate': 8.71578947368421e-05, 'epoch': 0.06}


 59%|█████▉    | 294/500 [21:34<14:39,  4.27s/it]

{'loss': 0.2013, 'grad_norm': 0.3098498582839966, 'learning_rate': 8.673684210526316e-05, 'epoch': 0.06}


 59%|█████▉    | 295/500 [21:39<15:14,  4.46s/it]

{'loss': 0.1167, 'grad_norm': 0.3545690178871155, 'learning_rate': 8.631578947368421e-05, 'epoch': 0.06}


 59%|█████▉    | 296/500 [21:43<14:48,  4.35s/it]

{'loss': 0.1191, 'grad_norm': 0.29105445742607117, 'learning_rate': 8.589473684210528e-05, 'epoch': 0.06}


 59%|█████▉    | 297/500 [21:48<14:49,  4.38s/it]

{'loss': 0.1383, 'grad_norm': 0.28762736916542053, 'learning_rate': 8.547368421052632e-05, 'epoch': 0.06}


 60%|█████▉    | 298/500 [21:52<14:37,  4.35s/it]

{'loss': 0.1405, 'grad_norm': 0.28453949093818665, 'learning_rate': 8.505263157894737e-05, 'epoch': 0.06}


 60%|█████▉    | 299/500 [21:56<14:12,  4.24s/it]

{'loss': 0.1126, 'grad_norm': 0.26209431886672974, 'learning_rate': 8.463157894736842e-05, 'epoch': 0.06}


 60%|██████    | 300/500 [22:00<14:13,  4.27s/it]

{'loss': 0.1003, 'grad_norm': 0.2800105810165405, 'learning_rate': 8.421052631578948e-05, 'epoch': 0.06}


 60%|██████    | 301/500 [22:04<13:45,  4.15s/it]

{'loss': 0.1093, 'grad_norm': 0.2819897532463074, 'learning_rate': 8.378947368421053e-05, 'epoch': 0.06}


 60%|██████    | 302/500 [22:08<13:25,  4.07s/it]

{'loss': 0.0942, 'grad_norm': 0.2790413200855255, 'learning_rate': 8.336842105263158e-05, 'epoch': 0.06}


 61%|██████    | 303/500 [22:13<13:46,  4.19s/it]

{'loss': 0.1788, 'grad_norm': 0.33347582817077637, 'learning_rate': 8.294736842105263e-05, 'epoch': 0.06}


 61%|██████    | 304/500 [22:17<13:26,  4.11s/it]

{'loss': 0.1071, 'grad_norm': 0.32348424196243286, 'learning_rate': 8.252631578947369e-05, 'epoch': 0.06}


 61%|██████    | 305/500 [22:21<13:12,  4.06s/it]

{'loss': 0.1283, 'grad_norm': 0.3077157735824585, 'learning_rate': 8.210526315789474e-05, 'epoch': 0.06}


 61%|██████    | 306/500 [22:25<13:11,  4.08s/it]

{'loss': 0.1662, 'grad_norm': 0.36142393946647644, 'learning_rate': 8.16842105263158e-05, 'epoch': 0.06}


 61%|██████▏   | 307/500 [22:30<13:55,  4.33s/it]

{'loss': 0.105, 'grad_norm': 0.29444918036460876, 'learning_rate': 8.126315789473685e-05, 'epoch': 0.06}


 62%|██████▏   | 308/500 [22:33<13:19,  4.16s/it]

{'loss': 0.0844, 'grad_norm': 0.29909825325012207, 'learning_rate': 8.08421052631579e-05, 'epoch': 0.06}


 62%|██████▏   | 309/500 [22:38<14:07,  4.44s/it]

{'loss': 0.1105, 'grad_norm': 0.2573535144329071, 'learning_rate': 8.042105263157895e-05, 'epoch': 0.06}


 62%|██████▏   | 310/500 [22:42<13:31,  4.27s/it]

{'loss': 0.0948, 'grad_norm': 0.28326699137687683, 'learning_rate': 8e-05, 'epoch': 0.06}


 62%|██████▏   | 311/500 [22:47<13:30,  4.29s/it]

{'loss': 0.0925, 'grad_norm': 0.2445956915616989, 'learning_rate': 7.957894736842106e-05, 'epoch': 0.06}


 62%|██████▏   | 312/500 [22:50<13:02,  4.16s/it]

{'loss': 0.1196, 'grad_norm': 0.2596607506275177, 'learning_rate': 7.915789473684211e-05, 'epoch': 0.06}


 63%|██████▎   | 313/500 [22:55<13:26,  4.32s/it]

{'loss': 0.0546, 'grad_norm': 0.1406301110982895, 'learning_rate': 7.873684210526317e-05, 'epoch': 0.06}


 63%|██████▎   | 314/500 [22:59<13:06,  4.23s/it]

{'loss': 0.1897, 'grad_norm': 0.31911367177963257, 'learning_rate': 7.83157894736842e-05, 'epoch': 0.06}


 63%|██████▎   | 315/500 [23:03<12:53,  4.18s/it]

{'loss': 0.1513, 'grad_norm': 0.27346912026405334, 'learning_rate': 7.789473684210526e-05, 'epoch': 0.06}


 63%|██████▎   | 316/500 [23:07<12:46,  4.17s/it]

{'loss': 0.0883, 'grad_norm': 0.21193833649158478, 'learning_rate': 7.747368421052633e-05, 'epoch': 0.06}


 63%|██████▎   | 317/500 [23:12<12:46,  4.19s/it]

{'loss': 0.1579, 'grad_norm': 0.4874749481678009, 'learning_rate': 7.705263157894738e-05, 'epoch': 0.06}


 64%|██████▎   | 318/500 [23:17<13:40,  4.51s/it]

{'loss': 0.0592, 'grad_norm': 0.18526101112365723, 'learning_rate': 7.663157894736842e-05, 'epoch': 0.06}


 64%|██████▍   | 319/500 [23:20<12:43,  4.22s/it]

{'loss': 0.095, 'grad_norm': 0.2863800823688507, 'learning_rate': 7.621052631578947e-05, 'epoch': 0.06}


 64%|██████▍   | 320/500 [23:24<12:29,  4.16s/it]

{'loss': 0.1453, 'grad_norm': 0.3244883418083191, 'learning_rate': 7.578947368421054e-05, 'epoch': 0.06}


 64%|██████▍   | 321/500 [23:28<11:52,  3.98s/it]

{'loss': 0.1336, 'grad_norm': 0.2932296395301819, 'learning_rate': 7.536842105263158e-05, 'epoch': 0.06}


 64%|██████▍   | 322/500 [23:33<13:01,  4.39s/it]

{'loss': 0.0631, 'grad_norm': 0.20454657077789307, 'learning_rate': 7.494736842105263e-05, 'epoch': 0.06}


 65%|██████▍   | 323/500 [23:37<12:29,  4.23s/it]

{'loss': 0.109, 'grad_norm': 0.33544784784317017, 'learning_rate': 7.452631578947368e-05, 'epoch': 0.06}


 65%|██████▍   | 324/500 [23:42<12:31,  4.27s/it]

{'loss': 0.087, 'grad_norm': 0.2691844701766968, 'learning_rate': 7.410526315789475e-05, 'epoch': 0.06}


 65%|██████▌   | 325/500 [23:45<11:52,  4.07s/it]

{'loss': 0.0689, 'grad_norm': 0.21647317707538605, 'learning_rate': 7.368421052631579e-05, 'epoch': 0.06}


 65%|██████▌   | 326/500 [23:49<11:28,  3.96s/it]

{'loss': 0.1491, 'grad_norm': 0.40420424938201904, 'learning_rate': 7.326315789473684e-05, 'epoch': 0.06}


 65%|██████▌   | 327/500 [23:53<11:38,  4.04s/it]

{'loss': 0.0779, 'grad_norm': 0.3130727708339691, 'learning_rate': 7.28421052631579e-05, 'epoch': 0.06}


 66%|██████▌   | 328/500 [23:58<12:21,  4.31s/it]

{'loss': 0.0629, 'grad_norm': 0.21250641345977783, 'learning_rate': 7.242105263157896e-05, 'epoch': 0.06}


 66%|██████▌   | 329/500 [24:02<12:10,  4.27s/it]

{'loss': 0.0635, 'grad_norm': 0.23951318860054016, 'learning_rate': 7.2e-05, 'epoch': 0.06}


 66%|██████▌   | 330/500 [24:06<11:22,  4.02s/it]

{'loss': 0.1467, 'grad_norm': 0.4171469509601593, 'learning_rate': 7.157894736842105e-05, 'epoch': 0.06}


 66%|██████▌   | 331/500 [24:09<11:10,  3.97s/it]

{'loss': 0.1244, 'grad_norm': 0.329578697681427, 'learning_rate': 7.115789473684211e-05, 'epoch': 0.06}


 66%|██████▋   | 332/500 [24:14<11:16,  4.03s/it]

{'loss': 0.109, 'grad_norm': 0.32894349098205566, 'learning_rate': 7.073684210526316e-05, 'epoch': 0.06}


 67%|██████▋   | 333/500 [24:19<12:09,  4.37s/it]

{'loss': 0.0742, 'grad_norm': 0.2884787619113922, 'learning_rate': 7.031578947368421e-05, 'epoch': 0.06}


 67%|██████▋   | 334/500 [24:23<11:33,  4.18s/it]

{'loss': 0.1052, 'grad_norm': 0.2941529452800751, 'learning_rate': 6.989473684210527e-05, 'epoch': 0.06}


 67%|██████▋   | 335/500 [24:27<11:35,  4.22s/it]

{'loss': 0.1165, 'grad_norm': 0.2720929682254791, 'learning_rate': 6.947368421052632e-05, 'epoch': 0.06}


 67%|██████▋   | 336/500 [24:32<12:11,  4.46s/it]

{'loss': 0.0593, 'grad_norm': 0.16813452541828156, 'learning_rate': 6.905263157894737e-05, 'epoch': 0.06}


 67%|██████▋   | 337/500 [24:37<12:16,  4.52s/it]

{'loss': 0.1178, 'grad_norm': 0.31336185336112976, 'learning_rate': 6.863157894736843e-05, 'epoch': 0.06}


 68%|██████▊   | 338/500 [24:41<12:17,  4.55s/it]

{'loss': 0.051, 'grad_norm': 0.18196851015090942, 'learning_rate': 6.821052631578948e-05, 'epoch': 0.06}


 68%|██████▊   | 339/500 [24:45<11:53,  4.43s/it]

{'loss': 0.091, 'grad_norm': 0.29578036069869995, 'learning_rate': 6.778947368421052e-05, 'epoch': 0.06}


 68%|██████▊   | 340/500 [24:50<11:49,  4.44s/it]

{'loss': 0.0962, 'grad_norm': 0.26641395688056946, 'learning_rate': 6.736842105263159e-05, 'epoch': 0.06}


 68%|██████▊   | 341/500 [24:55<12:14,  4.62s/it]

{'loss': 0.1231, 'grad_norm': 0.2770141065120697, 'learning_rate': 6.694736842105264e-05, 'epoch': 0.07}


 68%|██████▊   | 342/500 [24:59<11:46,  4.47s/it]

{'loss': 0.1405, 'grad_norm': 0.25685060024261475, 'learning_rate': 6.652631578947369e-05, 'epoch': 0.07}


 69%|██████▊   | 343/500 [25:03<11:08,  4.26s/it]

{'loss': 0.107, 'grad_norm': 0.26214855909347534, 'learning_rate': 6.610526315789473e-05, 'epoch': 0.07}


 69%|██████▉   | 344/500 [25:07<11:02,  4.24s/it]

{'loss': 0.1216, 'grad_norm': 0.328445166349411, 'learning_rate': 6.56842105263158e-05, 'epoch': 0.07}


 69%|██████▉   | 345/500 [25:11<10:48,  4.18s/it]

{'loss': 0.1285, 'grad_norm': 0.3307649791240692, 'learning_rate': 6.526315789473685e-05, 'epoch': 0.07}


 69%|██████▉   | 346/500 [25:15<10:27,  4.07s/it]

{'loss': 0.1319, 'grad_norm': 0.3366764187812805, 'learning_rate': 6.484210526315789e-05, 'epoch': 0.07}


 69%|██████▉   | 347/500 [25:19<10:09,  3.98s/it]

{'loss': 0.0837, 'grad_norm': 0.2423107624053955, 'learning_rate': 6.442105263157894e-05, 'epoch': 0.07}


 70%|██████▉   | 348/500 [25:23<10:12,  4.03s/it]

{'loss': 0.1348, 'grad_norm': 0.3489731550216675, 'learning_rate': 6.400000000000001e-05, 'epoch': 0.07}


 70%|██████▉   | 349/500 [25:27<10:07,  4.02s/it]

{'loss': 0.0908, 'grad_norm': 0.21368847787380219, 'learning_rate': 6.357894736842106e-05, 'epoch': 0.07}


 70%|███████   | 350/500 [25:31<10:01,  4.01s/it]

{'loss': 0.0947, 'grad_norm': 0.23678115010261536, 'learning_rate': 6.31578947368421e-05, 'epoch': 0.07}


 70%|███████   | 351/500 [25:35<10:25,  4.20s/it]

{'loss': 0.1048, 'grad_norm': 0.24576705694198608, 'learning_rate': 6.273684210526316e-05, 'epoch': 0.07}


 70%|███████   | 352/500 [25:40<10:24,  4.22s/it]

{'loss': 0.107, 'grad_norm': 0.24324862658977509, 'learning_rate': 6.231578947368422e-05, 'epoch': 0.07}


 71%|███████   | 353/500 [25:45<10:59,  4.49s/it]

{'loss': 0.0652, 'grad_norm': 0.2222111076116562, 'learning_rate': 6.189473684210526e-05, 'epoch': 0.07}


 71%|███████   | 354/500 [25:49<10:41,  4.39s/it]

{'loss': 0.1239, 'grad_norm': 0.3274012804031372, 'learning_rate': 6.147368421052632e-05, 'epoch': 0.07}


 71%|███████   | 355/500 [25:53<10:22,  4.29s/it]

{'loss': 0.1671, 'grad_norm': 0.3443523943424225, 'learning_rate': 6.105263157894737e-05, 'epoch': 0.07}


 71%|███████   | 356/500 [25:57<09:54,  4.13s/it]

{'loss': 0.1666, 'grad_norm': 0.39890190958976746, 'learning_rate': 6.063157894736843e-05, 'epoch': 0.07}


 71%|███████▏  | 357/500 [26:01<09:51,  4.13s/it]

{'loss': 0.0775, 'grad_norm': 0.22307521104812622, 'learning_rate': 6.0210526315789475e-05, 'epoch': 0.07}


 72%|███████▏  | 358/500 [26:05<09:59,  4.22s/it]

{'loss': 0.1606, 'grad_norm': 0.3383982181549072, 'learning_rate': 5.978947368421053e-05, 'epoch': 0.07}


 72%|███████▏  | 359/500 [26:10<10:28,  4.46s/it]

{'loss': 0.0587, 'grad_norm': 0.263671875, 'learning_rate': 5.936842105263158e-05, 'epoch': 0.07}


 72%|███████▏  | 360/500 [26:14<10:08,  4.34s/it]

{'loss': 0.1229, 'grad_norm': 0.367710679769516, 'learning_rate': 5.894736842105263e-05, 'epoch': 0.07}


 72%|███████▏  | 361/500 [26:19<10:23,  4.48s/it]

{'loss': 0.081, 'grad_norm': 0.2802647352218628, 'learning_rate': 5.852631578947369e-05, 'epoch': 0.07}


 72%|███████▏  | 362/500 [26:23<09:54,  4.31s/it]

{'loss': 0.0941, 'grad_norm': 0.29767122864723206, 'learning_rate': 5.810526315789474e-05, 'epoch': 0.07}


 73%|███████▎  | 363/500 [26:27<09:25,  4.13s/it]

{'loss': 0.0891, 'grad_norm': 0.25574469566345215, 'learning_rate': 5.7684210526315794e-05, 'epoch': 0.07}


 73%|███████▎  | 364/500 [26:31<09:09,  4.04s/it]

{'loss': 0.0955, 'grad_norm': 0.2780202329158783, 'learning_rate': 5.726315789473684e-05, 'epoch': 0.07}


 73%|███████▎  | 365/500 [26:34<08:52,  3.95s/it]

{'loss': 0.1804, 'grad_norm': 0.35497206449508667, 'learning_rate': 5.68421052631579e-05, 'epoch': 0.07}


 73%|███████▎  | 366/500 [26:39<09:09,  4.10s/it]

{'loss': 0.0681, 'grad_norm': 0.2170131653547287, 'learning_rate': 5.642105263157895e-05, 'epoch': 0.07}


 73%|███████▎  | 367/500 [26:42<08:47,  3.96s/it]

{'loss': 0.1118, 'grad_norm': 0.2622826099395752, 'learning_rate': 5.6000000000000006e-05, 'epoch': 0.07}


 74%|███████▎  | 368/500 [26:47<09:01,  4.10s/it]

{'loss': 0.1252, 'grad_norm': 0.3365526795387268, 'learning_rate': 5.557894736842105e-05, 'epoch': 0.07}


 74%|███████▍  | 369/500 [26:51<09:01,  4.13s/it]

{'loss': 0.067, 'grad_norm': 0.23222410678863525, 'learning_rate': 5.5157894736842105e-05, 'epoch': 0.07}


 74%|███████▍  | 370/500 [26:56<09:09,  4.23s/it]

{'loss': 0.1037, 'grad_norm': 0.2722150683403015, 'learning_rate': 5.4736842105263165e-05, 'epoch': 0.07}


 74%|███████▍  | 371/500 [27:00<08:58,  4.17s/it]

{'loss': 0.1824, 'grad_norm': 0.418094664812088, 'learning_rate': 5.431578947368421e-05, 'epoch': 0.07}


 74%|███████▍  | 372/500 [27:03<08:43,  4.09s/it]

{'loss': 0.0888, 'grad_norm': 0.28459808230400085, 'learning_rate': 5.3894736842105265e-05, 'epoch': 0.07}


 75%|███████▍  | 373/500 [27:07<08:35,  4.06s/it]

{'loss': 0.1757, 'grad_norm': 0.3703288435935974, 'learning_rate': 5.347368421052632e-05, 'epoch': 0.07}


 75%|███████▍  | 374/500 [27:11<08:31,  4.06s/it]

{'loss': 0.1351, 'grad_norm': 0.3880111873149872, 'learning_rate': 5.305263157894738e-05, 'epoch': 0.07}


 75%|███████▌  | 375/500 [27:16<08:36,  4.13s/it]

{'loss': 0.0738, 'grad_norm': 0.2316720187664032, 'learning_rate': 5.2631578947368424e-05, 'epoch': 0.07}


 75%|███████▌  | 376/500 [27:20<08:32,  4.13s/it]

{'loss': 0.0661, 'grad_norm': 0.22696742415428162, 'learning_rate': 5.221052631578948e-05, 'epoch': 0.07}


 75%|███████▌  | 377/500 [27:23<08:07,  3.96s/it]

{'loss': 0.092, 'grad_norm': 0.2574719786643982, 'learning_rate': 5.178947368421053e-05, 'epoch': 0.07}


 76%|███████▌  | 378/500 [27:29<08:46,  4.32s/it]

{'loss': 0.065, 'grad_norm': 0.19823235273361206, 'learning_rate': 5.1368421052631576e-05, 'epoch': 0.07}


 76%|███████▌  | 379/500 [27:33<09:00,  4.47s/it]

{'loss': 0.0867, 'grad_norm': 0.3001514673233032, 'learning_rate': 5.094736842105263e-05, 'epoch': 0.07}


 76%|███████▌  | 380/500 [27:39<09:36,  4.80s/it]

{'loss': 0.1062, 'grad_norm': 0.27283209562301636, 'learning_rate': 5.052631578947369e-05, 'epoch': 0.07}


 76%|███████▌  | 381/500 [27:44<09:23,  4.73s/it]

{'loss': 0.1151, 'grad_norm': 0.3151375949382782, 'learning_rate': 5.010526315789474e-05, 'epoch': 0.07}


 76%|███████▋  | 382/500 [27:48<09:21,  4.76s/it]

{'loss': 0.1242, 'grad_norm': 0.2897397577762604, 'learning_rate': 4.9684210526315796e-05, 'epoch': 0.07}


 77%|███████▋  | 383/500 [27:54<09:29,  4.87s/it]

{'loss': 0.0491, 'grad_norm': 0.20085375010967255, 'learning_rate': 4.926315789473684e-05, 'epoch': 0.07}


 77%|███████▋  | 384/500 [27:58<09:17,  4.80s/it]

{'loss': 0.0652, 'grad_norm': 0.20738822221755981, 'learning_rate': 4.8842105263157895e-05, 'epoch': 0.07}


 77%|███████▋  | 385/500 [28:03<08:59,  4.69s/it]

{'loss': 0.1112, 'grad_norm': 0.27941128611564636, 'learning_rate': 4.842105263157895e-05, 'epoch': 0.07}


 77%|███████▋  | 386/500 [28:07<08:50,  4.66s/it]

{'loss': 0.0741, 'grad_norm': 0.21194981038570404, 'learning_rate': 4.8e-05, 'epoch': 0.07}


 77%|███████▋  | 387/500 [28:11<08:12,  4.36s/it]

{'loss': 0.1828, 'grad_norm': 0.36845284700393677, 'learning_rate': 4.7578947368421054e-05, 'epoch': 0.07}


 78%|███████▊  | 388/500 [28:15<07:58,  4.27s/it]

{'loss': 0.1131, 'grad_norm': 0.28245246410369873, 'learning_rate': 4.715789473684211e-05, 'epoch': 0.07}


 78%|███████▊  | 389/500 [28:21<08:39,  4.68s/it]

{'loss': 0.0614, 'grad_norm': 0.22661733627319336, 'learning_rate': 4.673684210526316e-05, 'epoch': 0.07}


 78%|███████▊  | 390/500 [28:24<08:07,  4.44s/it]

{'loss': 0.1081, 'grad_norm': 0.27987605333328247, 'learning_rate': 4.6315789473684214e-05, 'epoch': 0.07}


 78%|███████▊  | 391/500 [28:28<07:46,  4.28s/it]

{'loss': 0.0673, 'grad_norm': 0.1947694569826126, 'learning_rate': 4.589473684210526e-05, 'epoch': 0.07}


 78%|███████▊  | 392/500 [28:32<07:31,  4.18s/it]

{'loss': 0.0826, 'grad_norm': 0.32981768250465393, 'learning_rate': 4.547368421052632e-05, 'epoch': 0.07}


 79%|███████▊  | 393/500 [28:36<07:01,  3.94s/it]

{'loss': 0.0984, 'grad_norm': 0.25866755843162537, 'learning_rate': 4.5052631578947366e-05, 'epoch': 0.07}


 79%|███████▉  | 394/500 [28:40<07:22,  4.18s/it]

{'loss': 0.0608, 'grad_norm': 0.1875154823064804, 'learning_rate': 4.4631578947368426e-05, 'epoch': 0.08}


 79%|███████▉  | 395/500 [28:45<07:23,  4.23s/it]

{'loss': 0.0521, 'grad_norm': 0.1370130032300949, 'learning_rate': 4.421052631578947e-05, 'epoch': 0.08}


 79%|███████▉  | 396/500 [28:49<07:27,  4.30s/it]

{'loss': 0.1021, 'grad_norm': 0.28117409348487854, 'learning_rate': 4.378947368421053e-05, 'epoch': 0.08}


 79%|███████▉  | 397/500 [28:54<07:22,  4.30s/it]

{'loss': 0.0926, 'grad_norm': 0.2216441035270691, 'learning_rate': 4.336842105263158e-05, 'epoch': 0.08}


 80%|███████▉  | 398/500 [28:58<07:32,  4.43s/it]

{'loss': 0.0856, 'grad_norm': 0.2284386307001114, 'learning_rate': 4.294736842105264e-05, 'epoch': 0.08}


 80%|███████▉  | 399/500 [29:02<07:20,  4.36s/it]

{'loss': 0.1154, 'grad_norm': 0.29505985975265503, 'learning_rate': 4.2526315789473685e-05, 'epoch': 0.08}


 80%|████████  | 400/500 [29:07<07:19,  4.40s/it]

{'loss': 0.1408, 'grad_norm': 0.3747553527355194, 'learning_rate': 4.210526315789474e-05, 'epoch': 0.08}


 80%|████████  | 401/500 [29:12<07:29,  4.54s/it]

{'loss': 0.0951, 'grad_norm': 0.32103201746940613, 'learning_rate': 4.168421052631579e-05, 'epoch': 0.08}


 80%|████████  | 402/500 [29:16<07:20,  4.49s/it]

{'loss': 0.1163, 'grad_norm': 0.31241920590400696, 'learning_rate': 4.1263157894736844e-05, 'epoch': 0.08}


 81%|████████  | 403/500 [29:20<07:03,  4.37s/it]

{'loss': 0.1267, 'grad_norm': 0.33544403314590454, 'learning_rate': 4.08421052631579e-05, 'epoch': 0.08}


 81%|████████  | 404/500 [29:25<07:08,  4.46s/it]

{'loss': 0.0681, 'grad_norm': 0.23244455456733704, 'learning_rate': 4.042105263157895e-05, 'epoch': 0.08}


 81%|████████  | 405/500 [29:29<06:53,  4.36s/it]

{'loss': 0.1214, 'grad_norm': 0.3790223002433777, 'learning_rate': 4e-05, 'epoch': 0.08}


 81%|████████  | 406/500 [29:33<06:43,  4.29s/it]

{'loss': 0.1592, 'grad_norm': 0.3386877477169037, 'learning_rate': 3.9578947368421056e-05, 'epoch': 0.08}


 81%|████████▏ | 407/500 [29:37<06:32,  4.22s/it]

{'loss': 0.0828, 'grad_norm': 0.19216440618038177, 'learning_rate': 3.91578947368421e-05, 'epoch': 0.08}


 82%|████████▏ | 408/500 [29:41<06:21,  4.15s/it]

{'loss': 0.0716, 'grad_norm': 0.22478385269641876, 'learning_rate': 3.873684210526316e-05, 'epoch': 0.08}


 82%|████████▏ | 409/500 [29:45<06:08,  4.04s/it]

{'loss': 0.124, 'grad_norm': 0.31025582551956177, 'learning_rate': 3.831578947368421e-05, 'epoch': 0.08}


 82%|████████▏ | 410/500 [29:49<05:52,  3.92s/it]

{'loss': 0.1196, 'grad_norm': 0.25475162267684937, 'learning_rate': 3.789473684210527e-05, 'epoch': 0.08}


 82%|████████▏ | 411/500 [29:53<05:50,  3.93s/it]

{'loss': 0.0728, 'grad_norm': 0.20235756039619446, 'learning_rate': 3.7473684210526315e-05, 'epoch': 0.08}


 82%|████████▏ | 412/500 [29:57<06:02,  4.12s/it]

{'loss': 0.0574, 'grad_norm': 0.18613937497138977, 'learning_rate': 3.7052631578947375e-05, 'epoch': 0.08}


 83%|████████▎ | 413/500 [30:01<05:50,  4.03s/it]

{'loss': 0.0758, 'grad_norm': 0.23934069275856018, 'learning_rate': 3.663157894736842e-05, 'epoch': 0.08}


 83%|████████▎ | 414/500 [30:05<05:52,  4.10s/it]

{'loss': 0.0653, 'grad_norm': 0.31967848539352417, 'learning_rate': 3.621052631578948e-05, 'epoch': 0.08}


 83%|████████▎ | 415/500 [30:10<06:00,  4.24s/it]

{'loss': 0.105, 'grad_norm': 0.32169508934020996, 'learning_rate': 3.578947368421053e-05, 'epoch': 0.08}


 83%|████████▎ | 416/500 [30:14<05:44,  4.10s/it]

{'loss': 0.0658, 'grad_norm': 0.2566673159599304, 'learning_rate': 3.536842105263158e-05, 'epoch': 0.08}


 83%|████████▎ | 417/500 [30:18<05:53,  4.25s/it]

{'loss': 0.0637, 'grad_norm': 0.2135099619626999, 'learning_rate': 3.4947368421052634e-05, 'epoch': 0.08}


 84%|████████▎ | 418/500 [30:22<05:31,  4.04s/it]

{'loss': 0.0862, 'grad_norm': 0.2966712713241577, 'learning_rate': 3.452631578947369e-05, 'epoch': 0.08}


 84%|████████▍ | 419/500 [30:27<05:49,  4.31s/it]

{'loss': 0.0428, 'grad_norm': 0.16356441378593445, 'learning_rate': 3.410526315789474e-05, 'epoch': 0.08}


 84%|████████▍ | 420/500 [30:31<05:51,  4.39s/it]

{'loss': 0.1689, 'grad_norm': 0.36753273010253906, 'learning_rate': 3.368421052631579e-05, 'epoch': 0.08}


 84%|████████▍ | 421/500 [30:38<06:41,  5.09s/it]

{'loss': 0.0841, 'grad_norm': 0.2722601890563965, 'learning_rate': 3.3263157894736846e-05, 'epoch': 0.08}


 84%|████████▍ | 422/500 [30:42<06:14,  4.80s/it]

{'loss': 0.0693, 'grad_norm': 0.24696838855743408, 'learning_rate': 3.28421052631579e-05, 'epoch': 0.08}


 85%|████████▍ | 423/500 [30:47<06:06,  4.76s/it]

{'loss': 0.0987, 'grad_norm': 0.2823885679244995, 'learning_rate': 3.2421052631578945e-05, 'epoch': 0.08}


 85%|████████▍ | 424/500 [30:51<05:52,  4.64s/it]

{'loss': 0.1624, 'grad_norm': 0.4081372022628784, 'learning_rate': 3.2000000000000005e-05, 'epoch': 0.08}


 85%|████████▌ | 425/500 [30:56<05:56,  4.76s/it]

{'loss': 0.0866, 'grad_norm': 0.24162046611309052, 'learning_rate': 3.157894736842105e-05, 'epoch': 0.08}


 85%|████████▌ | 426/500 [31:01<05:48,  4.71s/it]

{'loss': 0.0742, 'grad_norm': 0.30203118920326233, 'learning_rate': 3.115789473684211e-05, 'epoch': 0.08}


 85%|████████▌ | 427/500 [31:06<06:01,  4.95s/it]

{'loss': 0.1163, 'grad_norm': 0.3468382656574249, 'learning_rate': 3.073684210526316e-05, 'epoch': 0.08}


 86%|████████▌ | 428/500 [31:11<05:41,  4.75s/it]

{'loss': 0.092, 'grad_norm': 0.3111228346824646, 'learning_rate': 3.0315789473684214e-05, 'epoch': 0.08}


 86%|████████▌ | 429/500 [31:15<05:37,  4.76s/it]

{'loss': 0.0903, 'grad_norm': 0.2741858959197998, 'learning_rate': 2.9894736842105264e-05, 'epoch': 0.08}


 86%|████████▌ | 430/500 [31:20<05:22,  4.61s/it]

{'loss': 0.0771, 'grad_norm': 0.21423420310020447, 'learning_rate': 2.9473684210526314e-05, 'epoch': 0.08}


 86%|████████▌ | 431/500 [31:23<05:01,  4.36s/it]

{'loss': 0.1005, 'grad_norm': 0.27525171637535095, 'learning_rate': 2.905263157894737e-05, 'epoch': 0.08}


 86%|████████▋ | 432/500 [31:27<04:46,  4.21s/it]

{'loss': 0.1248, 'grad_norm': 0.2799108624458313, 'learning_rate': 2.863157894736842e-05, 'epoch': 0.08}


 87%|████████▋ | 433/500 [31:31<04:39,  4.17s/it]

{'loss': 0.095, 'grad_norm': 0.30519136786460876, 'learning_rate': 2.8210526315789476e-05, 'epoch': 0.08}


 87%|████████▋ | 434/500 [31:36<04:36,  4.19s/it]

{'loss': 0.09, 'grad_norm': 0.24189890921115875, 'learning_rate': 2.7789473684210526e-05, 'epoch': 0.08}


 87%|████████▋ | 435/500 [31:40<04:34,  4.23s/it]

{'loss': 0.1136, 'grad_norm': 0.279249906539917, 'learning_rate': 2.7368421052631583e-05, 'epoch': 0.08}


 87%|████████▋ | 436/500 [31:44<04:24,  4.14s/it]

{'loss': 0.128, 'grad_norm': 0.2994459271430969, 'learning_rate': 2.6947368421052632e-05, 'epoch': 0.08}


 87%|████████▋ | 437/500 [31:48<04:27,  4.24s/it]

{'loss': 0.067, 'grad_norm': 0.18149323761463165, 'learning_rate': 2.652631578947369e-05, 'epoch': 0.08}


 88%|████████▊ | 438/500 [31:53<04:24,  4.26s/it]

{'loss': 0.1156, 'grad_norm': 0.2626756429672241, 'learning_rate': 2.610526315789474e-05, 'epoch': 0.08}


 88%|████████▊ | 439/500 [31:58<04:45,  4.67s/it]

{'loss': 0.0576, 'grad_norm': 0.19556045532226562, 'learning_rate': 2.5684210526315788e-05, 'epoch': 0.08}


 88%|████████▊ | 440/500 [32:02<04:28,  4.48s/it]

{'loss': 0.0694, 'grad_norm': 0.2047635018825531, 'learning_rate': 2.5263157894736845e-05, 'epoch': 0.08}


 88%|████████▊ | 441/500 [32:07<04:32,  4.62s/it]

{'loss': 0.0805, 'grad_norm': 0.25402432680130005, 'learning_rate': 2.4842105263157898e-05, 'epoch': 0.08}


 88%|████████▊ | 442/500 [32:11<04:13,  4.38s/it]

{'loss': 0.093, 'grad_norm': 0.2294851392507553, 'learning_rate': 2.4421052631578948e-05, 'epoch': 0.08}


 89%|████████▊ | 443/500 [32:16<04:18,  4.53s/it]

{'loss': 0.0798, 'grad_norm': 0.23801138997077942, 'learning_rate': 2.4e-05, 'epoch': 0.08}


 89%|████████▉ | 444/500 [32:20<04:01,  4.31s/it]

{'loss': 0.0709, 'grad_norm': 0.22615107893943787, 'learning_rate': 2.3578947368421054e-05, 'epoch': 0.08}


 89%|████████▉ | 445/500 [32:24<03:48,  4.16s/it]

{'loss': 0.113, 'grad_norm': 0.3340100646018982, 'learning_rate': 2.3157894736842107e-05, 'epoch': 0.08}


 89%|████████▉ | 446/500 [32:27<03:39,  4.07s/it]

{'loss': 0.0944, 'grad_norm': 0.24231526255607605, 'learning_rate': 2.273684210526316e-05, 'epoch': 0.09}


 89%|████████▉ | 447/500 [32:31<03:33,  4.03s/it]

{'loss': 0.1636, 'grad_norm': 0.32709938287734985, 'learning_rate': 2.2315789473684213e-05, 'epoch': 0.09}


 90%|████████▉ | 448/500 [32:35<03:31,  4.08s/it]

{'loss': 0.0889, 'grad_norm': 0.23897668719291687, 'learning_rate': 2.1894736842105266e-05, 'epoch': 0.09}


 90%|████████▉ | 449/500 [32:39<03:26,  4.05s/it]

{'loss': 0.0946, 'grad_norm': 0.2907961308956146, 'learning_rate': 2.147368421052632e-05, 'epoch': 0.09}


 90%|█████████ | 450/500 [32:43<03:20,  4.01s/it]

{'loss': 0.0686, 'grad_norm': 0.23246529698371887, 'learning_rate': 2.105263157894737e-05, 'epoch': 0.09}


 90%|█████████ | 451/500 [32:47<03:15,  4.00s/it]

{'loss': 0.099, 'grad_norm': 0.32244783639907837, 'learning_rate': 2.0631578947368422e-05, 'epoch': 0.09}


 90%|█████████ | 452/500 [32:52<03:20,  4.19s/it]

{'loss': 0.1021, 'grad_norm': 0.29402416944503784, 'learning_rate': 2.0210526315789475e-05, 'epoch': 0.09}


 91%|█████████ | 453/500 [32:56<03:15,  4.16s/it]

{'loss': 0.0806, 'grad_norm': 0.27957308292388916, 'learning_rate': 1.9789473684210528e-05, 'epoch': 0.09}


 91%|█████████ | 454/500 [33:00<03:06,  4.05s/it]

{'loss': 0.1312, 'grad_norm': 0.40315744280815125, 'learning_rate': 1.936842105263158e-05, 'epoch': 0.09}


 91%|█████████ | 455/500 [33:04<03:00,  4.02s/it]

{'loss': 0.1044, 'grad_norm': 0.3349097669124603, 'learning_rate': 1.8947368421052634e-05, 'epoch': 0.09}


 91%|█████████ | 456/500 [33:09<03:13,  4.40s/it]

{'loss': 0.1017, 'grad_norm': 0.2909042239189148, 'learning_rate': 1.8526315789473687e-05, 'epoch': 0.09}


 91%|█████████▏| 457/500 [33:14<03:13,  4.49s/it]

{'loss': 0.1065, 'grad_norm': 0.33647939562797546, 'learning_rate': 1.810526315789474e-05, 'epoch': 0.09}


 92%|█████████▏| 458/500 [33:18<03:01,  4.32s/it]

{'loss': 0.182, 'grad_norm': 0.41976287961006165, 'learning_rate': 1.768421052631579e-05, 'epoch': 0.09}


 92%|█████████▏| 459/500 [33:23<03:11,  4.68s/it]

{'loss': 0.0581, 'grad_norm': 0.21862097084522247, 'learning_rate': 1.7263157894736843e-05, 'epoch': 0.09}


 92%|█████████▏| 460/500 [33:28<03:04,  4.61s/it]

{'loss': 0.1084, 'grad_norm': 0.29462385177612305, 'learning_rate': 1.6842105263157896e-05, 'epoch': 0.09}


 92%|█████████▏| 461/500 [33:32<02:56,  4.52s/it]

{'loss': 0.0721, 'grad_norm': 0.1995190978050232, 'learning_rate': 1.642105263157895e-05, 'epoch': 0.09}


 92%|█████████▏| 462/500 [33:37<02:59,  4.73s/it]

{'loss': 0.0597, 'grad_norm': 0.2556571066379547, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.09}


 93%|█████████▎| 463/500 [33:42<02:53,  4.69s/it]

{'loss': 0.0676, 'grad_norm': 0.284586638212204, 'learning_rate': 1.5578947368421056e-05, 'epoch': 0.09}


 93%|█████████▎| 464/500 [33:46<02:46,  4.61s/it]

{'loss': 0.0863, 'grad_norm': 0.3858416974544525, 'learning_rate': 1.5157894736842107e-05, 'epoch': 0.09}


 93%|█████████▎| 465/500 [33:50<02:34,  4.41s/it]

{'loss': 0.0956, 'grad_norm': 0.28023961186408997, 'learning_rate': 1.4736842105263157e-05, 'epoch': 0.09}


 93%|█████████▎| 466/500 [33:54<02:25,  4.27s/it]

{'loss': 0.0609, 'grad_norm': 0.19207550585269928, 'learning_rate': 1.431578947368421e-05, 'epoch': 0.09}


 93%|█████████▎| 467/500 [33:58<02:19,  4.21s/it]

{'loss': 0.105, 'grad_norm': 0.3183889091014862, 'learning_rate': 1.3894736842105263e-05, 'epoch': 0.09}


 94%|█████████▎| 468/500 [34:03<02:19,  4.36s/it]

{'loss': 0.0628, 'grad_norm': 0.23834624886512756, 'learning_rate': 1.3473684210526316e-05, 'epoch': 0.09}


 94%|█████████▍| 469/500 [34:08<02:17,  4.43s/it]

{'loss': 0.0581, 'grad_norm': 0.23404020071029663, 'learning_rate': 1.305263157894737e-05, 'epoch': 0.09}


 94%|█████████▍| 470/500 [34:13<02:26,  4.88s/it]

{'loss': 0.057, 'grad_norm': 0.17746460437774658, 'learning_rate': 1.2631578947368422e-05, 'epoch': 0.09}


 94%|█████████▍| 471/500 [34:18<02:15,  4.66s/it]

{'loss': 0.1017, 'grad_norm': 0.3334082365036011, 'learning_rate': 1.2210526315789474e-05, 'epoch': 0.09}


 94%|█████████▍| 472/500 [34:21<02:03,  4.42s/it]

{'loss': 0.0645, 'grad_norm': 0.28848618268966675, 'learning_rate': 1.1789473684210527e-05, 'epoch': 0.09}


 95%|█████████▍| 473/500 [34:25<01:53,  4.19s/it]

{'loss': 0.1542, 'grad_norm': 0.36971381306648254, 'learning_rate': 1.136842105263158e-05, 'epoch': 0.09}


 95%|█████████▍| 474/500 [34:29<01:47,  4.14s/it]

{'loss': 0.0708, 'grad_norm': 0.2538687586784363, 'learning_rate': 1.0947368421052633e-05, 'epoch': 0.09}


 95%|█████████▌| 475/500 [34:34<01:47,  4.30s/it]

{'loss': 0.0838, 'grad_norm': 0.2689308524131775, 'learning_rate': 1.0526315789473684e-05, 'epoch': 0.09}


 95%|█████████▌| 476/500 [34:39<01:51,  4.65s/it]

{'loss': 0.0637, 'grad_norm': 0.22427205741405487, 'learning_rate': 1.0105263157894738e-05, 'epoch': 0.09}


 95%|█████████▌| 477/500 [34:44<01:46,  4.64s/it]

{'loss': 0.0684, 'grad_norm': 0.21319928765296936, 'learning_rate': 9.68421052631579e-06, 'epoch': 0.09}


 96%|█████████▌| 478/500 [34:49<01:42,  4.68s/it]

{'loss': 0.0801, 'grad_norm': 0.25361013412475586, 'learning_rate': 9.263157894736844e-06, 'epoch': 0.09}


 96%|█████████▌| 479/500 [34:53<01:32,  4.42s/it]

{'loss': 0.1237, 'grad_norm': 0.38821229338645935, 'learning_rate': 8.842105263157895e-06, 'epoch': 0.09}


 96%|█████████▌| 480/500 [34:56<01:23,  4.18s/it]

{'loss': 0.0762, 'grad_norm': 0.27024567127227783, 'learning_rate': 8.421052631578948e-06, 'epoch': 0.09}


 96%|█████████▌| 481/500 [35:00<01:18,  4.13s/it]

{'loss': 0.1617, 'grad_norm': 0.47999444603919983, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.09}


 96%|█████████▋| 482/500 [35:05<01:16,  4.25s/it]

{'loss': 0.0592, 'grad_norm': 0.2129988819360733, 'learning_rate': 7.578947368421054e-06, 'epoch': 0.09}


 97%|█████████▋| 483/500 [35:09<01:11,  4.19s/it]

{'loss': 0.1178, 'grad_norm': 0.32989051938056946, 'learning_rate': 7.157894736842105e-06, 'epoch': 0.09}


 97%|█████████▋| 484/500 [35:13<01:07,  4.22s/it]

{'loss': 0.0632, 'grad_norm': 0.21686820685863495, 'learning_rate': 6.736842105263158e-06, 'epoch': 0.09}


 97%|█████████▋| 485/500 [35:17<01:01,  4.12s/it]

{'loss': 0.1189, 'grad_norm': 0.35775190591812134, 'learning_rate': 6.315789473684211e-06, 'epoch': 0.09}


 97%|█████████▋| 486/500 [35:21<00:58,  4.19s/it]

{'loss': 0.0816, 'grad_norm': 0.2922772467136383, 'learning_rate': 5.8947368421052634e-06, 'epoch': 0.09}


 97%|█████████▋| 487/500 [35:26<00:55,  4.24s/it]

{'loss': 0.0948, 'grad_norm': 0.27994999289512634, 'learning_rate': 5.4736842105263165e-06, 'epoch': 0.09}


 98%|█████████▊| 488/500 [35:30<00:51,  4.25s/it]

{'loss': 0.0892, 'grad_norm': 0.2545601427555084, 'learning_rate': 5.052631578947369e-06, 'epoch': 0.09}


 98%|█████████▊| 489/500 [35:34<00:47,  4.36s/it]

{'loss': 0.1089, 'grad_norm': 0.29430854320526123, 'learning_rate': 4.631578947368422e-06, 'epoch': 0.09}


 98%|█████████▊| 490/500 [35:38<00:41,  4.13s/it]

{'loss': 0.1194, 'grad_norm': 0.3522242605686188, 'learning_rate': 4.210526315789474e-06, 'epoch': 0.09}


 98%|█████████▊| 491/500 [35:42<00:37,  4.14s/it]

{'loss': 0.1171, 'grad_norm': 0.2698131501674652, 'learning_rate': 3.789473684210527e-06, 'epoch': 0.09}


 98%|█████████▊| 492/500 [35:46<00:33,  4.14s/it]

{'loss': 0.0628, 'grad_norm': 0.25049903988838196, 'learning_rate': 3.368421052631579e-06, 'epoch': 0.09}


 99%|█████████▊| 493/500 [35:51<00:29,  4.17s/it]

{'loss': 0.0716, 'grad_norm': 0.22279082238674164, 'learning_rate': 2.9473684210526317e-06, 'epoch': 0.09}


 99%|█████████▉| 494/500 [35:55<00:24,  4.16s/it]

{'loss': 0.0808, 'grad_norm': 0.2992345988750458, 'learning_rate': 2.5263157894736844e-06, 'epoch': 0.09}


 99%|█████████▉| 495/500 [35:59<00:21,  4.21s/it]

{'loss': 0.0663, 'grad_norm': 0.30422836542129517, 'learning_rate': 2.105263157894737e-06, 'epoch': 0.09}


 99%|█████████▉| 496/500 [36:04<00:17,  4.43s/it]

{'loss': 0.1345, 'grad_norm': 0.2909713089466095, 'learning_rate': 1.6842105263157895e-06, 'epoch': 0.09}


 99%|█████████▉| 497/500 [36:08<00:13,  4.36s/it]

{'loss': 0.0946, 'grad_norm': 0.3148261606693268, 'learning_rate': 1.2631578947368422e-06, 'epoch': 0.09}


100%|█████████▉| 498/500 [36:13<00:08,  4.48s/it]

{'loss': 0.1455, 'grad_norm': 0.33239150047302246, 'learning_rate': 8.421052631578948e-07, 'epoch': 0.1}


100%|█████████▉| 499/500 [36:17<00:04,  4.41s/it]

{'loss': 0.155, 'grad_norm': 0.31553441286087036, 'learning_rate': 4.210526315789474e-07, 'epoch': 0.1}


100%|██████████| 500/500 [36:21<00:00,  4.34s/it]

{'loss': 0.0937, 'grad_norm': 0.3447301387786865, 'learning_rate': 0.0, 'epoch': 0.1}


100%|██████████| 500/500 [36:22<00:00,  4.37s/it]

{'train_runtime': 2183.2346, 'train_samples_per_second': 1.832, 'train_steps_per_second': 0.229, 'train_loss': 0.19909659508615732, 'epoch': 0.1}





Step 9 (Optional): Testing the trained model
-----

In [18]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "What is the NRM1 code for the following activity ""Concrete for beams C32/40""?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 512, use_cache = True,
                         temperature = 0.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the NRM1 code for the following activity Concrete for beams C32/40?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAs per my training data, I was able to find a relevant result to your search which is the NRM1 code for the activity "Concrete for beams C32/40".\n\n To understand the context of the search result, let me give you the activity breakdown.\n\nActivity Level 1 is "Concrete". Activity level 2 is "Beams". Activity level 3 is "Concrete for beams C32/40".\n\nNow, let me provide you with the NRM1 code as requested.\n\nMy training data shows that the activity "Concrete for beams C32/40" lies within NRM1 Level 1 "02 - Superstructure". The NRM 1 level 2 is "02.1 - Frame".The NRM1 level 3 is "02.1.4 - Concrete frames". The NRM1 level 4 is "02.1.4.2 - Beams: details, including number 

Step 10: Saving the trained model locally
--

In [19]:
# Merge to 8bit
model.save_pretrained_gguf("AI_BOQ_Master_llama3_1_Conversational_GGUF_Q8_unsloth", tokenizer,)

make: Entering directory '/home/mohamedashour/Documents/Projects/BoQ_LLM_Finetuning/llama.cpp'
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -D

100%|██████████| 32/32 [00:00<00:00, 46.60it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at AI_BOQ_Master_llama3_1_Conversational_GGUF_Q8_unsloth into q8_0 GGUF format.
The output location will be ./AI_BOQ_Master_llama3_1_Conversational_GGUF_Q8_unsloth/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: AI_BOQ_Master_llama3_1_Conversational_GGUF_Q8_unsloth
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-

KeyError: '"name"'