In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2024-06-29 14:28:36.155064: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-29 14:28:36.180386: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: NVIDIA GeForce RTX 3060. Max memory: 11.754 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [3]:
prompt = """ 
Below you will find the title, summary, and body of a news article. 
Your task is to analyze these components and classify whether the article is sensationalist or not.

Sensationalist is defined as: "presenting information in a way that is intended to provoke public interest, excitement, or anxiety, at the expense of accuracy."

### Article information:
    Title: {}
    Subheading: {}
    Body: {}
    is this article false?: {}
    
### is this article sensationalist?:
{}
"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    
    title = examples["Titular"]
    summary = examples["Copete"]
    body      = examples["Cuerpo"]
    is_sensationalist      = examples["Amarillismo"]
    is_false = examples["Falsa"]
    texts = []
    for title, summary, body, is_sensationalist, is_false in zip(title, summary, body, is_sensationalist, is_false):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(title, summary, body, is_sensationalist, is_false) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [16]:
# format csv
import pandas as pd 
sensationalist_dataset1 = pd.read_csv('data/trainset_realnews_and_corpus.csv', encoding='latin1', index_col='Id')
sensationalist_dataset2 = pd.read_csv('data/trainset_synthetic_data.csv', encoding='latin1', index_col='Id')

#Training dataset 
sensationalist_dataset = pd.concat([sensationalist_dataset1, sensationalist_dataset2], ignore_index=True)
sensationalist_dataset.to_json('data/trainset.json', orient='index')

In [19]:
def load_dataset(dataset_path):
    dataset = pd.read_json(dataset_path, orient='index',  encoding='latin1')
    return dataset

In [10]:
from datasets import load_dataset, Dataset

#dataset = load_dataset("json", data_files= "data/dataset_gpt_utf8.json", split='train')
#dataset = dataset.map(formatting_prompts_func, batched = True,)

dataset_dir = formatting_prompts_func(sensationalist_dataset)
dataset = Dataset.from_dict(dataset_dir)


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        max_steps = 1000,
        warmup_steps = 5,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2): 100%|██████████| 99/99 [00:00<00:00, 226.25 examples/s]
max_steps is given, it will override any value given in num_train_epochs


In [14]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3060. Max memory = 11.754 GB.
5.594 GB of memory reserved.


In [15]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 99 | Num Epochs = 84
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 1,000
 "-____-"     Number of trainable parameters = 41,943,040
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mpaulitapb[0m ([33mpaps[0m). Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 1/1000 [00:06<1:44:09,  6.26s/it]

{'loss': 2.1963, 'grad_norm': 0.9739549160003662, 'learning_rate': 4e-05, 'epoch': 0.08}


  0%|          | 2/1000 [00:11<1:34:01,  5.65s/it]

{'loss': 2.2009, 'grad_norm': 1.005650281906128, 'learning_rate': 8e-05, 'epoch': 0.16}


  0%|          | 3/1000 [00:16<1:31:20,  5.50s/it]

{'loss': 2.2076, 'grad_norm': 0.9623019099235535, 'learning_rate': 0.00012, 'epoch': 0.24}


  0%|          | 4/1000 [00:21<1:29:14,  5.38s/it]

{'loss': 2.1659, 'grad_norm': 1.0381481647491455, 'learning_rate': 0.00016, 'epoch': 0.32}


  0%|          | 5/1000 [00:27<1:28:55,  5.36s/it]

{'loss': 1.9218, 'grad_norm': 0.9618408679962158, 'learning_rate': 0.0002, 'epoch': 0.4}


  1%|          | 6/1000 [00:32<1:26:29,  5.22s/it]

{'loss': 1.7087, 'grad_norm': 0.9865421056747437, 'learning_rate': 0.00019979899497487438, 'epoch': 0.48}


  1%|          | 7/1000 [00:37<1:26:55,  5.25s/it]

{'loss': 1.3951, 'grad_norm': 1.0355925559997559, 'learning_rate': 0.00019959798994974876, 'epoch': 0.56}


  1%|          | 8/1000 [00:42<1:27:36,  5.30s/it]

{'loss': 1.1552, 'grad_norm': 2.0387303829193115, 'learning_rate': 0.00019939698492462313, 'epoch': 0.64}


  1%|          | 9/1000 [00:48<1:26:31,  5.24s/it]

{'loss': 1.2188, 'grad_norm': 4.105612277984619, 'learning_rate': 0.0001991959798994975, 'epoch': 0.72}


  1%|          | 10/1000 [00:52<1:22:39,  5.01s/it]

{'loss': 0.9423, 'grad_norm': 2.0221354961395264, 'learning_rate': 0.00019899497487437187, 'epoch': 0.8}


  1%|          | 11/1000 [00:57<1:19:58,  4.85s/it]

{'loss': 0.7917, 'grad_norm': 0.9749330282211304, 'learning_rate': 0.00019879396984924622, 'epoch': 0.88}


  1%|          | 12/1000 [01:02<1:20:35,  4.89s/it]

{'loss': 0.7123, 'grad_norm': 0.8535776734352112, 'learning_rate': 0.00019859296482412062, 'epoch': 0.96}


  1%|▏         | 13/1000 [01:06<1:19:46,  4.85s/it]

{'loss': 0.6279, 'grad_norm': 0.6610275506973267, 'learning_rate': 0.000198391959798995, 'epoch': 1.04}


  1%|▏         | 14/1000 [01:12<1:21:53,  4.98s/it]

{'loss': 0.6614, 'grad_norm': 0.8727880716323853, 'learning_rate': 0.00019819095477386937, 'epoch': 1.12}


  2%|▏         | 15/1000 [01:16<1:21:18,  4.95s/it]

{'loss': 0.6039, 'grad_norm': 0.8348535895347595, 'learning_rate': 0.0001979899497487437, 'epoch': 1.2}


  2%|▏         | 16/1000 [01:22<1:21:29,  4.97s/it]

{'loss': 0.5477, 'grad_norm': 0.6346345543861389, 'learning_rate': 0.0001977889447236181, 'epoch': 1.28}


  2%|▏         | 17/1000 [01:27<1:22:07,  5.01s/it]

{'loss': 0.5964, 'grad_norm': 0.8253000378608704, 'learning_rate': 0.00019758793969849249, 'epoch': 1.36}


  2%|▏         | 18/1000 [01:32<1:21:41,  4.99s/it]

{'loss': 0.6085, 'grad_norm': 0.6346504092216492, 'learning_rate': 0.00019738693467336683, 'epoch': 1.44}


  2%|▏         | 19/1000 [01:37<1:22:03,  5.02s/it]

{'loss': 0.5646, 'grad_norm': 0.6416369676589966, 'learning_rate': 0.0001971859296482412, 'epoch': 1.52}


  2%|▏         | 20/1000 [01:42<1:24:45,  5.19s/it]

{'loss': 0.5119, 'grad_norm': 0.6354654431343079, 'learning_rate': 0.0001969849246231156, 'epoch': 1.6}


  2%|▏         | 21/1000 [01:47<1:23:05,  5.09s/it]

{'loss': 0.5483, 'grad_norm': 0.6993542909622192, 'learning_rate': 0.00019678391959798995, 'epoch': 1.68}


  2%|▏         | 22/1000 [01:52<1:23:10,  5.10s/it]

{'loss': 0.529, 'grad_norm': 0.7388333678245544, 'learning_rate': 0.00019658291457286432, 'epoch': 1.76}


  2%|▏         | 23/1000 [01:57<1:23:15,  5.11s/it]

{'loss': 0.5224, 'grad_norm': 0.7735182046890259, 'learning_rate': 0.0001963819095477387, 'epoch': 1.84}


  2%|▏         | 24/1000 [02:03<1:24:39,  5.20s/it]

{'loss': 0.584, 'grad_norm': 0.8755325078964233, 'learning_rate': 0.0001961809045226131, 'epoch': 1.92}


  2%|▎         | 25/1000 [02:07<1:21:15,  5.00s/it]

{'loss': 0.4668, 'grad_norm': 0.968716025352478, 'learning_rate': 0.00019597989949748744, 'epoch': 2.0}


  3%|▎         | 26/1000 [02:12<1:20:27,  4.96s/it]

{'loss': 0.4283, 'grad_norm': 0.8532636165618896, 'learning_rate': 0.00019577889447236181, 'epoch': 2.08}


  3%|▎         | 27/1000 [02:17<1:19:54,  4.93s/it]

{'loss': 0.3905, 'grad_norm': 0.5437811017036438, 'learning_rate': 0.0001955778894472362, 'epoch': 2.16}


  3%|▎         | 28/1000 [02:22<1:19:22,  4.90s/it]

{'loss': 0.3828, 'grad_norm': 0.4976365566253662, 'learning_rate': 0.00019537688442211056, 'epoch': 2.24}


  3%|▎         | 29/1000 [02:26<1:18:00,  4.82s/it]

{'loss': 0.3854, 'grad_norm': 0.5733087658882141, 'learning_rate': 0.00019517587939698493, 'epoch': 2.32}


  3%|▎         | 30/1000 [02:31<1:16:59,  4.76s/it]

{'loss': 0.3948, 'grad_norm': 0.5853146910667419, 'learning_rate': 0.0001949748743718593, 'epoch': 2.4}


  3%|▎         | 31/1000 [02:36<1:18:07,  4.84s/it]

{'loss': 0.4073, 'grad_norm': 0.6412811875343323, 'learning_rate': 0.00019477386934673368, 'epoch': 2.48}


  3%|▎         | 32/1000 [02:41<1:18:05,  4.84s/it]

{'loss': 0.3434, 'grad_norm': 0.5775811076164246, 'learning_rate': 0.00019457286432160805, 'epoch': 2.56}


  3%|▎         | 33/1000 [02:46<1:17:04,  4.78s/it]

{'loss': 0.3723, 'grad_norm': 0.5855540037155151, 'learning_rate': 0.00019437185929648243, 'epoch': 2.64}


  3%|▎         | 34/1000 [02:50<1:17:17,  4.80s/it]

{'loss': 0.3854, 'grad_norm': 0.6225841045379639, 'learning_rate': 0.0001941708542713568, 'epoch': 2.72}


  4%|▎         | 35/1000 [02:56<1:18:20,  4.87s/it]

{'loss': 0.3788, 'grad_norm': 0.5833591222763062, 'learning_rate': 0.00019396984924623117, 'epoch': 2.8}


  4%|▎         | 36/1000 [03:00<1:17:03,  4.80s/it]

{'loss': 0.3388, 'grad_norm': 0.6228195428848267, 'learning_rate': 0.00019376884422110552, 'epoch': 2.88}


  4%|▎         | 37/1000 [03:05<1:18:07,  4.87s/it]

{'loss': 0.3827, 'grad_norm': 0.606898844242096, 'learning_rate': 0.00019356783919597992, 'epoch': 2.96}


  4%|▍         | 38/1000 [03:10<1:16:03,  4.74s/it]

{'loss': 0.3397, 'grad_norm': 0.5917562246322632, 'learning_rate': 0.0001933668341708543, 'epoch': 3.04}


  4%|▍         | 39/1000 [03:14<1:16:23,  4.77s/it]

{'loss': 0.286, 'grad_norm': 0.5873132944107056, 'learning_rate': 0.00019316582914572864, 'epoch': 3.12}


  4%|▍         | 40/1000 [03:19<1:17:33,  4.85s/it]

{'loss': 0.2374, 'grad_norm': 0.5040827393531799, 'learning_rate': 0.000192964824120603, 'epoch': 3.2}


  4%|▍         | 41/1000 [03:24<1:16:27,  4.78s/it]

{'loss': 0.2328, 'grad_norm': 0.5438747406005859, 'learning_rate': 0.0001927638190954774, 'epoch': 3.28}


  4%|▍         | 42/1000 [03:29<1:17:39,  4.86s/it]

{'loss': 0.2261, 'grad_norm': 0.5582587122917175, 'learning_rate': 0.00019256281407035178, 'epoch': 3.36}


  4%|▍         | 43/1000 [03:34<1:17:27,  4.86s/it]

{'loss': 0.2629, 'grad_norm': 0.8430999517440796, 'learning_rate': 0.00019236180904522613, 'epoch': 3.44}


  4%|▍         | 44/1000 [03:39<1:17:14,  4.85s/it]

{'loss': 0.2587, 'grad_norm': 0.9704208970069885, 'learning_rate': 0.0001921608040201005, 'epoch': 3.52}


  4%|▍         | 45/1000 [03:44<1:17:59,  4.90s/it]

{'loss': 0.232, 'grad_norm': 0.7489526271820068, 'learning_rate': 0.0001919597989949749, 'epoch': 3.6}


  5%|▍         | 46/1000 [03:48<1:16:40,  4.82s/it]

{'loss': 0.2349, 'grad_norm': 0.7122463583946228, 'learning_rate': 0.00019175879396984925, 'epoch': 3.68}


  5%|▍         | 47/1000 [03:54<1:17:30,  4.88s/it]

{'loss': 0.2661, 'grad_norm': 0.7289416193962097, 'learning_rate': 0.00019155778894472362, 'epoch': 3.76}


  5%|▍         | 48/1000 [03:58<1:16:13,  4.80s/it]

{'loss': 0.2196, 'grad_norm': 0.5720440149307251, 'learning_rate': 0.000191356783919598, 'epoch': 3.84}


  5%|▍         | 49/1000 [04:03<1:16:09,  4.80s/it]

{'loss': 0.2304, 'grad_norm': 0.636533260345459, 'learning_rate': 0.0001911557788944724, 'epoch': 3.92}


  5%|▌         | 50/1000 [04:07<1:13:57,  4.67s/it]

{'loss': 0.2261, 'grad_norm': 0.666868269443512, 'learning_rate': 0.00019095477386934674, 'epoch': 4.0}


  5%|▌         | 51/1000 [04:12<1:14:41,  4.72s/it]

{'loss': 0.1737, 'grad_norm': 0.6086011528968811, 'learning_rate': 0.0001907537688442211, 'epoch': 4.08}


  5%|▌         | 52/1000 [04:17<1:14:08,  4.69s/it]

{'loss': 0.1638, 'grad_norm': 0.6351302862167358, 'learning_rate': 0.00019055276381909548, 'epoch': 4.16}


  5%|▌         | 53/1000 [04:21<1:13:47,  4.68s/it]

{'loss': 0.146, 'grad_norm': 0.5696669816970825, 'learning_rate': 0.00019035175879396986, 'epoch': 4.24}


  5%|▌         | 54/1000 [04:26<1:14:24,  4.72s/it]

{'loss': 0.1464, 'grad_norm': 0.6391039490699768, 'learning_rate': 0.00019015075376884423, 'epoch': 4.32}


  6%|▌         | 55/1000 [04:31<1:15:45,  4.81s/it]

{'loss': 0.1373, 'grad_norm': 0.7015664577484131, 'learning_rate': 0.0001899497487437186, 'epoch': 4.4}


  6%|▌         | 56/1000 [04:36<1:16:33,  4.87s/it]

{'loss': 0.1568, 'grad_norm': 0.815325140953064, 'learning_rate': 0.00018974874371859298, 'epoch': 4.48}


  6%|▌         | 57/1000 [04:41<1:16:17,  4.85s/it]

{'loss': 0.1162, 'grad_norm': 0.7448962330818176, 'learning_rate': 0.00018954773869346732, 'epoch': 4.56}


  6%|▌         | 58/1000 [04:46<1:16:04,  4.85s/it]

{'loss': 0.1163, 'grad_norm': 0.6378484964370728, 'learning_rate': 0.00018934673366834172, 'epoch': 4.64}


  6%|▌         | 59/1000 [04:51<1:17:00,  4.91s/it]

{'loss': 0.1462, 'grad_norm': 0.8607362508773804, 'learning_rate': 0.0001891457286432161, 'epoch': 4.72}


  6%|▌         | 60/1000 [04:56<1:16:45,  4.90s/it]

{'loss': 0.1557, 'grad_norm': 0.8434344530105591, 'learning_rate': 0.00018894472361809047, 'epoch': 4.8}


  6%|▌         | 61/1000 [05:01<1:16:24,  4.88s/it]

{'loss': 0.1347, 'grad_norm': 0.6514307856559753, 'learning_rate': 0.00018874371859296481, 'epoch': 4.88}


  6%|▌         | 62/1000 [05:05<1:15:17,  4.82s/it]

{'loss': 0.1316, 'grad_norm': 0.6250692009925842, 'learning_rate': 0.00018854271356783921, 'epoch': 4.96}


  6%|▋         | 63/1000 [05:10<1:12:44,  4.66s/it]

{'loss': 0.1456, 'grad_norm': 0.7189111113548279, 'learning_rate': 0.0001883417085427136, 'epoch': 5.04}


  6%|▋         | 64/1000 [05:14<1:13:42,  4.72s/it]

{'loss': 0.0852, 'grad_norm': 0.41868582367897034, 'learning_rate': 0.00018814070351758793, 'epoch': 5.12}


  6%|▋         | 65/1000 [05:19<1:13:21,  4.71s/it]

{'loss': 0.0999, 'grad_norm': 0.44075343012809753, 'learning_rate': 0.0001879396984924623, 'epoch': 5.2}


  7%|▋         | 66/1000 [05:24<1:13:07,  4.70s/it]

{'loss': 0.0933, 'grad_norm': 0.4663720726966858, 'learning_rate': 0.0001877386934673367, 'epoch': 5.28}


  7%|▋         | 67/1000 [05:29<1:12:59,  4.69s/it]

{'loss': 0.1067, 'grad_norm': 0.5848297476768494, 'learning_rate': 0.00018753768844221108, 'epoch': 5.36}


  7%|▋         | 68/1000 [05:33<1:13:46,  4.75s/it]

{'loss': 0.0959, 'grad_norm': 0.7221266031265259, 'learning_rate': 0.00018733668341708543, 'epoch': 5.44}


  7%|▋         | 69/1000 [05:38<1:14:18,  4.79s/it]

{'loss': 0.1238, 'grad_norm': 0.9834173321723938, 'learning_rate': 0.0001871356783919598, 'epoch': 5.52}


  7%|▋         | 70/1000 [05:43<1:15:29,  4.87s/it]

{'loss': 0.0945, 'grad_norm': 0.6254335641860962, 'learning_rate': 0.0001869346733668342, 'epoch': 5.6}


  7%|▋         | 71/1000 [05:48<1:14:31,  4.81s/it]

{'loss': 0.1063, 'grad_norm': 0.6424602270126343, 'learning_rate': 0.00018673366834170854, 'epoch': 5.68}


  7%|▋         | 72/1000 [05:53<1:14:44,  4.83s/it]

{'loss': 0.1192, 'grad_norm': 0.6007814407348633, 'learning_rate': 0.00018653266331658292, 'epoch': 5.76}


  7%|▋         | 73/1000 [05:58<1:15:47,  4.91s/it]

{'loss': 0.089, 'grad_norm': 0.5104584693908691, 'learning_rate': 0.0001863316582914573, 'epoch': 5.84}


  7%|▋         | 74/1000 [06:03<1:16:33,  4.96s/it]

{'loss': 0.1027, 'grad_norm': 0.5616321563720703, 'learning_rate': 0.0001861306532663317, 'epoch': 5.92}


  8%|▊         | 75/1000 [06:08<1:15:15,  4.88s/it]

{'loss': 0.103, 'grad_norm': 0.6430128812789917, 'learning_rate': 0.00018592964824120604, 'epoch': 6.0}


  8%|▊         | 76/1000 [06:12<1:14:09,  4.82s/it]

{'loss': 0.059, 'grad_norm': 0.3404957354068756, 'learning_rate': 0.0001857286432160804, 'epoch': 6.08}


  8%|▊         | 77/1000 [06:17<1:14:04,  4.82s/it]

{'loss': 0.0515, 'grad_norm': 0.32009100914001465, 'learning_rate': 0.00018552763819095478, 'epoch': 6.16}


  8%|▊         | 78/1000 [06:22<1:14:54,  4.87s/it]

{'loss': 0.0771, 'grad_norm': 0.488523006439209, 'learning_rate': 0.00018532663316582915, 'epoch': 6.24}


  8%|▊         | 79/1000 [06:27<1:14:34,  4.86s/it]

{'loss': 0.0652, 'grad_norm': 0.4089829623699188, 'learning_rate': 0.00018512562814070353, 'epoch': 6.32}


  8%|▊         | 80/1000 [06:32<1:14:20,  4.85s/it]

{'loss': 0.0566, 'grad_norm': 0.4282756745815277, 'learning_rate': 0.0001849246231155779, 'epoch': 6.4}


  8%|▊         | 81/1000 [06:37<1:13:16,  4.78s/it]

{'loss': 0.0601, 'grad_norm': 0.5030786395072937, 'learning_rate': 0.00018472361809045227, 'epoch': 6.48}


  8%|▊         | 82/1000 [06:41<1:12:25,  4.73s/it]

{'loss': 0.0727, 'grad_norm': 0.7491628527641296, 'learning_rate': 0.00018452261306532662, 'epoch': 6.56}


  8%|▊         | 83/1000 [06:46<1:12:42,  4.76s/it]

{'loss': 0.0887, 'grad_norm': 0.6405494213104248, 'learning_rate': 0.00018432160804020102, 'epoch': 6.64}


  8%|▊         | 84/1000 [06:51<1:13:48,  4.83s/it]

{'loss': 0.0667, 'grad_norm': 0.548359751701355, 'learning_rate': 0.0001841206030150754, 'epoch': 6.72}


  8%|▊         | 85/1000 [06:56<1:14:35,  4.89s/it]

{'loss': 0.0662, 'grad_norm': 0.324091374874115, 'learning_rate': 0.00018391959798994977, 'epoch': 6.8}


  9%|▊         | 86/1000 [07:01<1:13:18,  4.81s/it]

{'loss': 0.0688, 'grad_norm': 0.4602241516113281, 'learning_rate': 0.0001837185929648241, 'epoch': 6.88}


  9%|▊         | 87/1000 [07:05<1:13:14,  4.81s/it]

{'loss': 0.0944, 'grad_norm': 0.5429556965827942, 'learning_rate': 0.0001835175879396985, 'epoch': 6.96}


  9%|▉         | 88/1000 [07:10<1:10:38,  4.65s/it]

{'loss': 0.0711, 'grad_norm': 0.5302917957305908, 'learning_rate': 0.00018331658291457288, 'epoch': 7.04}


  9%|▉         | 89/1000 [07:14<1:10:30,  4.64s/it]

{'loss': 0.0458, 'grad_norm': 0.23246687650680542, 'learning_rate': 0.00018311557788944723, 'epoch': 7.12}


  9%|▉         | 90/1000 [07:19<1:11:14,  4.70s/it]

{'loss': 0.0577, 'grad_norm': 0.3289109170436859, 'learning_rate': 0.0001829145728643216, 'epoch': 7.2}


  9%|▉         | 91/1000 [07:24<1:10:48,  4.67s/it]

{'loss': 0.0558, 'grad_norm': 0.3048771023750305, 'learning_rate': 0.000182713567839196, 'epoch': 7.28}


  9%|▉         | 92/1000 [07:29<1:11:26,  4.72s/it]

{'loss': 0.0572, 'grad_norm': 0.4417398273944855, 'learning_rate': 0.00018251256281407038, 'epoch': 7.36}


  9%|▉         | 93/1000 [07:33<1:11:53,  4.76s/it]

{'loss': 0.069, 'grad_norm': 0.5657472014427185, 'learning_rate': 0.00018231155778894472, 'epoch': 7.44}


  9%|▉         | 94/1000 [07:38<1:13:03,  4.84s/it]

{'loss': 0.051, 'grad_norm': 0.26482778787612915, 'learning_rate': 0.0001821105527638191, 'epoch': 7.52}


 10%|▉         | 95/1000 [07:44<1:13:51,  4.90s/it]

{'loss': 0.0631, 'grad_norm': 0.45330074429512024, 'learning_rate': 0.0001819095477386935, 'epoch': 7.6}


 10%|▉         | 96/1000 [07:49<1:14:21,  4.94s/it]

{'loss': 0.0552, 'grad_norm': 0.4006856381893158, 'learning_rate': 0.00018170854271356784, 'epoch': 7.68}


 10%|▉         | 97/1000 [07:53<1:12:56,  4.85s/it]

{'loss': 0.0481, 'grad_norm': 0.3176751732826233, 'learning_rate': 0.00018150753768844221, 'epoch': 7.76}


 10%|▉         | 98/1000 [07:58<1:12:51,  4.85s/it]

{'loss': 0.0535, 'grad_norm': 0.41214606165885925, 'learning_rate': 0.0001813065326633166, 'epoch': 7.84}


 10%|▉         | 99/1000 [08:03<1:12:49,  4.85s/it]

{'loss': 0.0496, 'grad_norm': 0.38900697231292725, 'learning_rate': 0.00018110552763819096, 'epoch': 7.92}


 10%|█         | 100/1000 [08:08<1:11:52,  4.79s/it]

{'loss': 0.0508, 'grad_norm': 0.44746482372283936, 'learning_rate': 0.00018090452261306533, 'epoch': 8.0}


 10%|█         | 101/1000 [08:13<1:12:56,  4.87s/it]

{'loss': 0.0404, 'grad_norm': 0.24457260966300964, 'learning_rate': 0.0001807035175879397, 'epoch': 8.08}


 10%|█         | 102/1000 [08:17<1:12:46,  4.86s/it]

{'loss': 0.0398, 'grad_norm': 0.2315857708454132, 'learning_rate': 0.00018050251256281408, 'epoch': 8.16}


 10%|█         | 103/1000 [08:22<1:12:37,  4.86s/it]

{'loss': 0.0409, 'grad_norm': 0.20306579768657684, 'learning_rate': 0.00018030150753768845, 'epoch': 8.24}


 10%|█         | 104/1000 [08:27<1:13:20,  4.91s/it]

{'loss': 0.0381, 'grad_norm': 0.1831337809562683, 'learning_rate': 0.00018010050251256282, 'epoch': 8.32}


 10%|█         | 105/1000 [08:32<1:12:53,  4.89s/it]

{'loss': 0.0433, 'grad_norm': 0.23385114967823029, 'learning_rate': 0.0001798994974874372, 'epoch': 8.4}


 11%|█         | 106/1000 [08:37<1:11:45,  4.82s/it]

{'loss': 0.0513, 'grad_norm': 0.2939911484718323, 'learning_rate': 0.00017969849246231157, 'epoch': 8.48}


 11%|█         | 107/1000 [08:42<1:11:50,  4.83s/it]

{'loss': 0.0406, 'grad_norm': 0.19299115240573883, 'learning_rate': 0.00017949748743718592, 'epoch': 8.56}


 11%|█         | 108/1000 [08:46<1:10:56,  4.77s/it]

{'loss': 0.044, 'grad_norm': 0.269634872674942, 'learning_rate': 0.00017929648241206032, 'epoch': 8.64}


 11%|█         | 109/1000 [08:51<1:11:12,  4.80s/it]

{'loss': 0.0485, 'grad_norm': 0.23942650854587555, 'learning_rate': 0.0001790954773869347, 'epoch': 8.72}


 11%|█         | 110/1000 [08:56<1:11:24,  4.81s/it]

{'loss': 0.045, 'grad_norm': 0.2804015576839447, 'learning_rate': 0.00017889447236180906, 'epoch': 8.8}


 11%|█         | 111/1000 [09:01<1:11:29,  4.82s/it]

{'loss': 0.0476, 'grad_norm': 0.3291268050670624, 'learning_rate': 0.0001786934673366834, 'epoch': 8.88}


 11%|█         | 112/1000 [09:06<1:12:18,  4.89s/it]

{'loss': 0.0547, 'grad_norm': 0.3782489597797394, 'learning_rate': 0.0001784924623115578, 'epoch': 8.96}


 11%|█▏        | 113/1000 [09:10<1:09:28,  4.70s/it]

{'loss': 0.0603, 'grad_norm': 0.7870592474937439, 'learning_rate': 0.00017829145728643218, 'epoch': 9.04}


 11%|█▏        | 114/1000 [09:15<1:09:57,  4.74s/it]

{'loss': 0.0351, 'grad_norm': 0.14074407517910004, 'learning_rate': 0.00017809045226130653, 'epoch': 9.12}


 12%|█▏        | 115/1000 [09:20<1:10:19,  4.77s/it]

{'loss': 0.0442, 'grad_norm': 0.3203173279762268, 'learning_rate': 0.0001778894472361809, 'epoch': 9.2}


 12%|█▏        | 116/1000 [09:25<1:12:15,  4.90s/it]

{'loss': 0.0425, 'grad_norm': 0.2502264976501465, 'learning_rate': 0.0001776884422110553, 'epoch': 9.28}


 12%|█▏        | 117/1000 [09:30<1:11:02,  4.83s/it]

{'loss': 0.0383, 'grad_norm': 0.19800390303134918, 'learning_rate': 0.00017748743718592967, 'epoch': 9.36}


 12%|█▏        | 118/1000 [09:35<1:10:59,  4.83s/it]

{'loss': 0.0439, 'grad_norm': 0.3005039393901825, 'learning_rate': 0.00017728643216080402, 'epoch': 9.44}


 12%|█▏        | 119/1000 [09:39<1:10:53,  4.83s/it]

{'loss': 0.0453, 'grad_norm': 0.32669955492019653, 'learning_rate': 0.0001770854271356784, 'epoch': 9.52}


 12%|█▏        | 120/1000 [09:44<1:10:48,  4.83s/it]

{'loss': 0.0478, 'grad_norm': 0.28259024024009705, 'learning_rate': 0.0001768844221105528, 'epoch': 9.6}


 12%|█▏        | 121/1000 [09:49<1:10:45,  4.83s/it]

{'loss': 0.0544, 'grad_norm': 0.43978747725486755, 'learning_rate': 0.00017668341708542714, 'epoch': 9.68}


 12%|█▏        | 122/1000 [09:54<1:10:39,  4.83s/it]

{'loss': 0.0386, 'grad_norm': 0.13727878034114838, 'learning_rate': 0.0001764824120603015, 'epoch': 9.76}


 12%|█▏        | 123/1000 [09:59<1:11:31,  4.89s/it]

{'loss': 0.0445, 'grad_norm': 0.3623684346675873, 'learning_rate': 0.00017628140703517588, 'epoch': 9.84}


 12%|█▏        | 124/1000 [10:04<1:11:10,  4.87s/it]

{'loss': 0.0536, 'grad_norm': 0.48223185539245605, 'learning_rate': 0.00017608040201005026, 'epoch': 9.92}


 12%|█▎        | 125/1000 [10:08<1:08:25,  4.69s/it]

{'loss': 0.0644, 'grad_norm': 0.5665394067764282, 'learning_rate': 0.00017587939698492463, 'epoch': 10.0}


 13%|█▎        | 126/1000 [10:13<1:08:05,  4.67s/it]

{'loss': 0.0399, 'grad_norm': 0.3081621527671814, 'learning_rate': 0.000175678391959799, 'epoch': 10.08}


 13%|█▎        | 127/1000 [10:17<1:07:47,  4.66s/it]

{'loss': 0.04, 'grad_norm': 0.25195086002349854, 'learning_rate': 0.00017547738693467338, 'epoch': 10.16}


 13%|█▎        | 128/1000 [10:22<1:09:15,  4.77s/it]

{'loss': 0.0391, 'grad_norm': 0.2290978878736496, 'learning_rate': 0.00017527638190954775, 'epoch': 10.24}


 13%|█▎        | 129/1000 [10:27<1:10:20,  4.85s/it]

{'loss': 0.0381, 'grad_norm': 0.1604664921760559, 'learning_rate': 0.00017507537688442212, 'epoch': 10.32}


 13%|█▎        | 130/1000 [10:32<1:09:22,  4.78s/it]

{'loss': 0.0385, 'grad_norm': 0.19698725640773773, 'learning_rate': 0.0001748743718592965, 'epoch': 10.4}


 13%|█▎        | 131/1000 [10:37<1:10:23,  4.86s/it]

{'loss': 0.0415, 'grad_norm': 0.3446982800960541, 'learning_rate': 0.00017467336683417087, 'epoch': 10.48}


 13%|█▎        | 132/1000 [10:42<1:09:21,  4.79s/it]

{'loss': 0.0409, 'grad_norm': 0.24023926258087158, 'learning_rate': 0.00017447236180904521, 'epoch': 10.56}


 13%|█▎        | 133/1000 [10:46<1:09:23,  4.80s/it]

{'loss': 0.0406, 'grad_norm': 0.3556601405143738, 'learning_rate': 0.00017427135678391961, 'epoch': 10.64}


 13%|█▎        | 134/1000 [10:51<1:09:25,  4.81s/it]

{'loss': 0.0454, 'grad_norm': 0.32845303416252136, 'learning_rate': 0.000174070351758794, 'epoch': 10.72}


 14%|█▎        | 135/1000 [10:56<1:09:26,  4.82s/it]

{'loss': 0.0443, 'grad_norm': 0.3118114769458771, 'learning_rate': 0.00017386934673366836, 'epoch': 10.8}


 14%|█▎        | 136/1000 [11:01<1:10:19,  4.88s/it]

{'loss': 0.0432, 'grad_norm': 0.33240702748298645, 'learning_rate': 0.0001736683417085427, 'epoch': 10.88}


 14%|█▎        | 137/1000 [11:06<1:10:52,  4.93s/it]

{'loss': 0.0446, 'grad_norm': 0.45561450719833374, 'learning_rate': 0.0001734673366834171, 'epoch': 10.96}


 14%|█▍        | 138/1000 [11:11<1:08:44,  4.79s/it]

{'loss': 0.0417, 'grad_norm': 0.25079768896102905, 'learning_rate': 0.00017326633165829148, 'epoch': 11.04}


 14%|█▍        | 139/1000 [11:16<1:09:41,  4.86s/it]

{'loss': 0.0357, 'grad_norm': 0.17932315170764923, 'learning_rate': 0.00017306532663316582, 'epoch': 11.12}


 14%|█▍        | 140/1000 [11:20<1:09:27,  4.85s/it]

{'loss': 0.0347, 'grad_norm': 0.35069265961647034, 'learning_rate': 0.0001728643216080402, 'epoch': 11.2}


 14%|█▍        | 141/1000 [11:25<1:09:17,  4.84s/it]

{'loss': 0.0337, 'grad_norm': 0.22367213666439056, 'learning_rate': 0.0001726633165829146, 'epoch': 11.28}


 14%|█▍        | 142/1000 [11:30<1:09:12,  4.84s/it]

{'loss': 0.0345, 'grad_norm': 0.19584469497203827, 'learning_rate': 0.00017246231155778897, 'epoch': 11.36}


 14%|█▍        | 143/1000 [11:35<1:09:08,  4.84s/it]

{'loss': 0.0389, 'grad_norm': 0.34383267164230347, 'learning_rate': 0.00017226130653266332, 'epoch': 11.44}


 14%|█▍        | 144/1000 [11:40<1:08:54,  4.83s/it]

{'loss': 0.035, 'grad_norm': 0.2718812823295593, 'learning_rate': 0.0001720603015075377, 'epoch': 11.52}


 14%|█▍        | 145/1000 [11:45<1:09:46,  4.90s/it]

{'loss': 0.0363, 'grad_norm': 0.20924323797225952, 'learning_rate': 0.00017185929648241206, 'epoch': 11.6}


 15%|█▍        | 146/1000 [11:49<1:08:48,  4.83s/it]

{'loss': 0.0431, 'grad_norm': 0.4057055711746216, 'learning_rate': 0.00017165829145728644, 'epoch': 11.68}


 15%|█▍        | 147/1000 [11:54<1:08:56,  4.85s/it]

{'loss': 0.0414, 'grad_norm': 0.36294448375701904, 'learning_rate': 0.0001714572864321608, 'epoch': 11.76}


 15%|█▍        | 148/1000 [11:59<1:08:13,  4.80s/it]

{'loss': 0.0425, 'grad_norm': 0.3726315498352051, 'learning_rate': 0.00017125628140703518, 'epoch': 11.84}


 15%|█▍        | 149/1000 [12:04<1:09:17,  4.89s/it]

{'loss': 0.0411, 'grad_norm': 0.23918111622333527, 'learning_rate': 0.00017105527638190955, 'epoch': 11.92}


 15%|█▌        | 150/1000 [12:08<1:06:46,  4.71s/it]

{'loss': 0.0361, 'grad_norm': 0.206119567155838, 'learning_rate': 0.00017085427135678393, 'epoch': 12.0}


 15%|█▌        | 151/1000 [12:14<1:08:15,  4.82s/it]

{'loss': 0.0334, 'grad_norm': 0.30910080671310425, 'learning_rate': 0.0001706532663316583, 'epoch': 12.08}


 15%|█▌        | 152/1000 [12:18<1:08:26,  4.84s/it]

{'loss': 0.0295, 'grad_norm': 0.1611350029706955, 'learning_rate': 0.00017045226130653267, 'epoch': 12.16}


 15%|█▌        | 153/1000 [12:23<1:08:34,  4.86s/it]

{'loss': 0.0337, 'grad_norm': 0.1328614205121994, 'learning_rate': 0.00017025125628140705, 'epoch': 12.24}


 15%|█▌        | 154/1000 [12:28<1:09:23,  4.92s/it]

{'loss': 0.0333, 'grad_norm': 0.16305837035179138, 'learning_rate': 0.00017005025125628142, 'epoch': 12.32}


 16%|█▌        | 155/1000 [12:33<1:08:18,  4.85s/it]

{'loss': 0.0373, 'grad_norm': 0.25283169746398926, 'learning_rate': 0.0001698492462311558, 'epoch': 12.4}


 16%|█▌        | 156/1000 [12:38<1:07:31,  4.80s/it]

{'loss': 0.0411, 'grad_norm': 0.22695352137088776, 'learning_rate': 0.00016964824120603016, 'epoch': 12.48}


 16%|█▌        | 157/1000 [12:42<1:06:57,  4.77s/it]

{'loss': 0.0341, 'grad_norm': 0.12510082125663757, 'learning_rate': 0.0001694472361809045, 'epoch': 12.56}


 16%|█▌        | 158/1000 [12:48<1:08:07,  4.85s/it]

{'loss': 0.034, 'grad_norm': 0.13200132548809052, 'learning_rate': 0.0001692462311557789, 'epoch': 12.64}


 16%|█▌        | 159/1000 [12:53<1:09:03,  4.93s/it]

{'loss': 0.035, 'grad_norm': 0.14810632169246674, 'learning_rate': 0.00016904522613065328, 'epoch': 12.72}


 16%|█▌        | 160/1000 [12:57<1:08:41,  4.91s/it]

{'loss': 0.0383, 'grad_norm': 0.17443831264972687, 'learning_rate': 0.00016884422110552766, 'epoch': 12.8}


 16%|█▌        | 161/1000 [13:03<1:09:19,  4.96s/it]

{'loss': 0.0431, 'grad_norm': 0.2173326462507248, 'learning_rate': 0.000168643216080402, 'epoch': 12.88}


 16%|█▌        | 162/1000 [13:07<1:08:50,  4.93s/it]

{'loss': 0.0404, 'grad_norm': 0.5383925437927246, 'learning_rate': 0.0001684422110552764, 'epoch': 12.96}


 16%|█▋        | 163/1000 [13:12<1:05:50,  4.72s/it]

{'loss': 0.0331, 'grad_norm': 0.15234160423278809, 'learning_rate': 0.00016824120603015078, 'epoch': 13.04}


 16%|█▋        | 164/1000 [13:16<1:06:14,  4.75s/it]

{'loss': 0.0318, 'grad_norm': 0.2588381767272949, 'learning_rate': 0.00016804020100502512, 'epoch': 13.12}


 16%|█▋        | 165/1000 [13:21<1:05:39,  4.72s/it]

{'loss': 0.0299, 'grad_norm': 0.13134655356407166, 'learning_rate': 0.0001678391959798995, 'epoch': 13.2}


 17%|█▋        | 166/1000 [13:26<1:06:05,  4.76s/it]

{'loss': 0.0333, 'grad_norm': 0.17027460038661957, 'learning_rate': 0.0001676381909547739, 'epoch': 13.28}


 17%|█▋        | 167/1000 [13:31<1:07:52,  4.89s/it]

{'loss': 0.0308, 'grad_norm': 0.3240346610546112, 'learning_rate': 0.00016743718592964827, 'epoch': 13.36}


 17%|█▋        | 168/1000 [13:36<1:07:34,  4.87s/it]

{'loss': 0.0371, 'grad_norm': 0.34696927666664124, 'learning_rate': 0.0001672361809045226, 'epoch': 13.44}


 17%|█▋        | 169/1000 [13:41<1:06:32,  4.80s/it]

{'loss': 0.0366, 'grad_norm': 0.18440821766853333, 'learning_rate': 0.00016703517587939699, 'epoch': 13.52}


 17%|█▋        | 170/1000 [13:45<1:06:32,  4.81s/it]

{'loss': 0.0315, 'grad_norm': 0.2310204654932022, 'learning_rate': 0.00016683417085427136, 'epoch': 13.6}


 17%|█▋        | 171/1000 [13:50<1:05:41,  4.76s/it]

{'loss': 0.04, 'grad_norm': 0.15165568888187408, 'learning_rate': 0.00016663316582914573, 'epoch': 13.68}


 17%|█▋        | 172/1000 [13:55<1:05:54,  4.78s/it]

{'loss': 0.0341, 'grad_norm': 0.1609826534986496, 'learning_rate': 0.0001664321608040201, 'epoch': 13.76}


 17%|█▋        | 173/1000 [14:00<1:06:53,  4.85s/it]

{'loss': 0.0384, 'grad_norm': 0.2506444454193115, 'learning_rate': 0.00016623115577889448, 'epoch': 13.84}


 17%|█▋        | 174/1000 [14:05<1:07:27,  4.90s/it]

{'loss': 0.0368, 'grad_norm': 0.3453926146030426, 'learning_rate': 0.00016603015075376885, 'epoch': 13.92}


 18%|█▊        | 175/1000 [14:09<1:04:59,  4.73s/it]

{'loss': 0.0348, 'grad_norm': 0.17858768999576569, 'learning_rate': 0.00016582914572864322, 'epoch': 14.0}


 18%|█▊        | 176/1000 [14:14<1:05:15,  4.75s/it]

{'loss': 0.0307, 'grad_norm': 0.12003406882286072, 'learning_rate': 0.0001656281407035176, 'epoch': 14.08}


 18%|█▊        | 177/1000 [14:19<1:04:41,  4.72s/it]

{'loss': 0.0326, 'grad_norm': 0.22011104226112366, 'learning_rate': 0.00016542713567839197, 'epoch': 14.16}


 18%|█▊        | 178/1000 [14:24<1:05:52,  4.81s/it]

{'loss': 0.0337, 'grad_norm': 0.1360558271408081, 'learning_rate': 0.00016522613065326634, 'epoch': 14.24}


 18%|█▊        | 179/1000 [14:29<1:06:40,  4.87s/it]

{'loss': 0.0299, 'grad_norm': 0.10877376794815063, 'learning_rate': 0.00016502512562814072, 'epoch': 14.32}


 18%|█▊        | 180/1000 [14:34<1:06:18,  4.85s/it]

{'loss': 0.0379, 'grad_norm': 0.3419930934906006, 'learning_rate': 0.0001648241206030151, 'epoch': 14.4}


 18%|█▊        | 181/1000 [14:39<1:06:45,  4.89s/it]

{'loss': 0.0295, 'grad_norm': 0.11349249631166458, 'learning_rate': 0.00016462311557788946, 'epoch': 14.48}


 18%|█▊        | 182/1000 [14:43<1:06:18,  4.86s/it]

{'loss': 0.0477, 'grad_norm': 0.8545286655426025, 'learning_rate': 0.0001644221105527638, 'epoch': 14.56}


 18%|█▊        | 183/1000 [14:48<1:05:12,  4.79s/it]

{'loss': 0.0352, 'grad_norm': 0.18797460198402405, 'learning_rate': 0.0001642211055276382, 'epoch': 14.64}


 18%|█▊        | 184/1000 [14:53<1:04:26,  4.74s/it]

{'loss': 0.0345, 'grad_norm': 0.21606972813606262, 'learning_rate': 0.00016402010050251258, 'epoch': 14.72}


 18%|█▊        | 185/1000 [14:58<1:05:38,  4.83s/it]

{'loss': 0.034, 'grad_norm': 0.13304081559181213, 'learning_rate': 0.00016381909547738695, 'epoch': 14.8}


 19%|█▊        | 186/1000 [15:02<1:05:36,  4.84s/it]

{'loss': 0.0332, 'grad_norm': 0.1250288486480713, 'learning_rate': 0.0001636180904522613, 'epoch': 14.88}


 19%|█▊        | 187/1000 [15:07<1:05:46,  4.85s/it]

{'loss': 0.0405, 'grad_norm': 0.2855643332004547, 'learning_rate': 0.0001634170854271357, 'epoch': 14.96}


 19%|█▉        | 188/1000 [15:12<1:05:02,  4.81s/it]

{'loss': 0.0348, 'grad_norm': 0.19672144949436188, 'learning_rate': 0.00016321608040201007, 'epoch': 15.04}


 19%|█▉        | 189/1000 [15:17<1:04:32,  4.77s/it]

{'loss': 0.031, 'grad_norm': 0.18591564893722534, 'learning_rate': 0.00016301507537688442, 'epoch': 15.12}


 19%|█▉        | 190/1000 [15:21<1:04:05,  4.75s/it]

{'loss': 0.0313, 'grad_norm': 0.1858067512512207, 'learning_rate': 0.0001628140703517588, 'epoch': 15.2}


 19%|█▉        | 191/1000 [15:26<1:03:47,  4.73s/it]

{'loss': 0.031, 'grad_norm': 0.11338948458433151, 'learning_rate': 0.00016261306532663316, 'epoch': 15.28}


 19%|█▉        | 192/1000 [15:31<1:04:21,  4.78s/it]

{'loss': 0.0323, 'grad_norm': 0.14610081911087036, 'learning_rate': 0.00016241206030150756, 'epoch': 15.36}


 19%|█▉        | 193/1000 [15:36<1:04:41,  4.81s/it]

{'loss': 0.0299, 'grad_norm': 0.11993581801652908, 'learning_rate': 0.0001622110552763819, 'epoch': 15.44}


 19%|█▉        | 194/1000 [15:41<1:04:04,  4.77s/it]

{'loss': 0.0303, 'grad_norm': 0.12740558385849, 'learning_rate': 0.00016201005025125628, 'epoch': 15.52}


 20%|█▉        | 195/1000 [15:45<1:04:27,  4.80s/it]

{'loss': 0.0308, 'grad_norm': 0.11910343915224075, 'learning_rate': 0.00016180904522613066, 'epoch': 15.6}


 20%|█▉        | 196/1000 [15:50<1:04:41,  4.83s/it]

{'loss': 0.0301, 'grad_norm': 0.11083049327135086, 'learning_rate': 0.00016160804020100503, 'epoch': 15.68}


 20%|█▉        | 197/1000 [15:55<1:04:47,  4.84s/it]

{'loss': 0.0333, 'grad_norm': 0.2002554088830948, 'learning_rate': 0.0001614070351758794, 'epoch': 15.76}


 20%|█▉        | 198/1000 [16:00<1:04:43,  4.84s/it]

{'loss': 0.0526, 'grad_norm': 0.6009817719459534, 'learning_rate': 0.00016120603015075378, 'epoch': 15.84}


 20%|█▉        | 199/1000 [16:05<1:04:36,  4.84s/it]

{'loss': 0.034, 'grad_norm': 0.13632188737392426, 'learning_rate': 0.00016100502512562815, 'epoch': 15.92}


 20%|██        | 200/1000 [16:10<1:03:46,  4.78s/it]

{'loss': 0.0414, 'grad_norm': 0.3500192165374756, 'learning_rate': 0.00016080402010050252, 'epoch': 16.0}


 20%|██        | 201/1000 [16:14<1:03:58,  4.80s/it]

{'loss': 0.0285, 'grad_norm': 0.1233941987156868, 'learning_rate': 0.0001606030150753769, 'epoch': 16.08}


 20%|██        | 202/1000 [16:19<1:04:00,  4.81s/it]

{'loss': 0.0274, 'grad_norm': 0.13901299238204956, 'learning_rate': 0.00016040201005025127, 'epoch': 16.16}


 20%|██        | 203/1000 [16:24<1:03:58,  4.82s/it]

{'loss': 0.0305, 'grad_norm': 0.1367386281490326, 'learning_rate': 0.00016020100502512564, 'epoch': 16.24}


 20%|██        | 204/1000 [16:29<1:03:09,  4.76s/it]

{'loss': 0.0356, 'grad_norm': 0.5772700309753418, 'learning_rate': 0.00016, 'epoch': 16.32}


 20%|██        | 205/1000 [16:34<1:04:05,  4.84s/it]

{'loss': 0.041, 'grad_norm': 0.36351457238197327, 'learning_rate': 0.00015979899497487439, 'epoch': 16.4}


 21%|██        | 206/1000 [16:38<1:03:08,  4.77s/it]

{'loss': 0.0308, 'grad_norm': 0.13500836491584778, 'learning_rate': 0.00015959798994974876, 'epoch': 16.48}


 21%|██        | 207/1000 [16:43<1:02:34,  4.73s/it]

{'loss': 0.0289, 'grad_norm': 0.11588782072067261, 'learning_rate': 0.0001593969849246231, 'epoch': 16.56}


 21%|██        | 208/1000 [16:48<1:03:43,  4.83s/it]

{'loss': 0.0297, 'grad_norm': 0.15161900222301483, 'learning_rate': 0.0001591959798994975, 'epoch': 16.64}


 21%|██        | 209/1000 [16:53<1:04:26,  4.89s/it]

{'loss': 0.0297, 'grad_norm': 0.11240042001008987, 'learning_rate': 0.00015899497487437188, 'epoch': 16.72}


 21%|██        | 210/1000 [16:58<1:04:09,  4.87s/it]

{'loss': 0.0341, 'grad_norm': 0.16073954105377197, 'learning_rate': 0.00015879396984924625, 'epoch': 16.8}


 21%|██        | 211/1000 [17:03<1:03:55,  4.86s/it]

{'loss': 0.0366, 'grad_norm': 0.41066110134124756, 'learning_rate': 0.0001585929648241206, 'epoch': 16.88}


 21%|██        | 212/1000 [17:08<1:04:29,  4.91s/it]

{'loss': 0.043, 'grad_norm': 0.48783937096595764, 'learning_rate': 0.000158391959798995, 'epoch': 16.96}


 21%|██▏       | 213/1000 [17:12<1:01:50,  4.71s/it]

{'loss': 0.0334, 'grad_norm': 0.17495635151863098, 'learning_rate': 0.00015819095477386937, 'epoch': 17.04}


 21%|██▏       | 214/1000 [17:17<1:03:00,  4.81s/it]

{'loss': 0.0273, 'grad_norm': 0.09744319319725037, 'learning_rate': 0.00015798994974874372, 'epoch': 17.12}


 22%|██▏       | 215/1000 [17:22<1:03:02,  4.82s/it]

{'loss': 0.0285, 'grad_norm': 0.33249610662460327, 'learning_rate': 0.0001577889447236181, 'epoch': 17.2}


 22%|██▏       | 216/1000 [17:27<1:02:59,  4.82s/it]

{'loss': 0.0379, 'grad_norm': 0.5236122608184814, 'learning_rate': 0.00015758793969849246, 'epoch': 17.28}


 22%|██▏       | 217/1000 [17:31<1:02:14,  4.77s/it]

{'loss': 0.0316, 'grad_norm': 0.2005537450313568, 'learning_rate': 0.00015738693467336686, 'epoch': 17.36}


 22%|██▏       | 218/1000 [17:36<1:02:27,  4.79s/it]

{'loss': 0.0316, 'grad_norm': 0.2739521861076355, 'learning_rate': 0.0001571859296482412, 'epoch': 17.44}


 22%|██▏       | 219/1000 [17:41<1:02:31,  4.80s/it]

{'loss': 0.0303, 'grad_norm': 0.15273475646972656, 'learning_rate': 0.00015698492462311558, 'epoch': 17.52}


 22%|██▏       | 220/1000 [17:46<1:02:32,  4.81s/it]

{'loss': 0.0338, 'grad_norm': 0.49031782150268555, 'learning_rate': 0.00015678391959798995, 'epoch': 17.6}


 22%|██▏       | 221/1000 [17:51<1:02:33,  4.82s/it]

{'loss': 0.0347, 'grad_norm': 0.18237143754959106, 'learning_rate': 0.00015658291457286433, 'epoch': 17.68}


 22%|██▏       | 222/1000 [17:56<1:02:32,  4.82s/it]

{'loss': 0.0284, 'grad_norm': 0.08923593163490295, 'learning_rate': 0.0001563819095477387, 'epoch': 17.76}


 22%|██▏       | 223/1000 [18:00<1:02:34,  4.83s/it]

{'loss': 0.0316, 'grad_norm': 0.12326030433177948, 'learning_rate': 0.00015618090452261307, 'epoch': 17.84}


 22%|██▏       | 224/1000 [18:05<1:01:48,  4.78s/it]

{'loss': 0.0336, 'grad_norm': 0.12721121311187744, 'learning_rate': 0.00015597989949748745, 'epoch': 17.92}


 22%|██▎       | 225/1000 [18:10<1:01:13,  4.74s/it]

{'loss': 0.033, 'grad_norm': 0.16833382844924927, 'learning_rate': 0.00015577889447236182, 'epoch': 18.0}


 23%|██▎       | 226/1000 [18:15<1:01:36,  4.78s/it]

{'loss': 0.0286, 'grad_norm': 0.09622476249933243, 'learning_rate': 0.0001555778894472362, 'epoch': 18.08}


 23%|██▎       | 227/1000 [18:20<1:03:17,  4.91s/it]

{'loss': 0.0284, 'grad_norm': 0.1209559217095375, 'learning_rate': 0.00015537688442211056, 'epoch': 18.16}


 23%|██▎       | 228/1000 [18:25<1:02:53,  4.89s/it]

{'loss': 0.027, 'grad_norm': 0.09412973374128342, 'learning_rate': 0.00015517587939698494, 'epoch': 18.24}


 23%|██▎       | 229/1000 [18:29<1:02:37,  4.87s/it]

{'loss': 0.0316, 'grad_norm': 0.29074978828430176, 'learning_rate': 0.0001549748743718593, 'epoch': 18.32}


 23%|██▎       | 230/1000 [18:34<1:02:23,  4.86s/it]

{'loss': 0.0305, 'grad_norm': 0.11620668321847916, 'learning_rate': 0.00015477386934673368, 'epoch': 18.4}


 23%|██▎       | 231/1000 [18:39<1:02:15,  4.86s/it]

{'loss': 0.0354, 'grad_norm': 0.31893521547317505, 'learning_rate': 0.00015457286432160806, 'epoch': 18.48}


 23%|██▎       | 232/1000 [18:44<1:01:32,  4.81s/it]

{'loss': 0.035, 'grad_norm': 0.7787649035453796, 'learning_rate': 0.0001543718592964824, 'epoch': 18.56}


 23%|██▎       | 233/1000 [18:49<1:01:41,  4.83s/it]

{'loss': 0.0282, 'grad_norm': 0.16183806955814362, 'learning_rate': 0.0001541708542713568, 'epoch': 18.64}


 23%|██▎       | 234/1000 [18:54<1:02:30,  4.90s/it]

{'loss': 0.031, 'grad_norm': 0.1196848526597023, 'learning_rate': 0.00015396984924623117, 'epoch': 18.72}


 24%|██▎       | 235/1000 [18:58<1:01:38,  4.83s/it]

{'loss': 0.0308, 'grad_norm': 0.12704779207706451, 'learning_rate': 0.00015376884422110555, 'epoch': 18.8}


 24%|██▎       | 236/1000 [19:03<1:01:41,  4.85s/it]

{'loss': 0.0343, 'grad_norm': 0.21982795000076294, 'learning_rate': 0.0001535678391959799, 'epoch': 18.88}


 24%|██▎       | 237/1000 [19:08<1:01:49,  4.86s/it]

{'loss': 0.0322, 'grad_norm': 0.1386931836605072, 'learning_rate': 0.00015336683417085427, 'epoch': 18.96}


 24%|██▍       | 238/1000 [19:13<1:01:23,  4.83s/it]

{'loss': 0.0309, 'grad_norm': 0.12406226992607117, 'learning_rate': 0.00015316582914572867, 'epoch': 19.04}


 24%|██▍       | 239/1000 [19:18<1:01:27,  4.85s/it]

{'loss': 0.0289, 'grad_norm': 0.11615176498889923, 'learning_rate': 0.000152964824120603, 'epoch': 19.12}


 24%|██▍       | 240/1000 [19:23<1:02:10,  4.91s/it]

{'loss': 0.0261, 'grad_norm': 0.09655845910310745, 'learning_rate': 0.00015276381909547739, 'epoch': 19.2}


 24%|██▍       | 241/1000 [19:28<1:01:55,  4.89s/it]

{'loss': 0.0304, 'grad_norm': 0.27508366107940674, 'learning_rate': 0.00015256281407035176, 'epoch': 19.28}


 24%|██▍       | 242/1000 [19:33<1:01:43,  4.89s/it]

{'loss': 0.028, 'grad_norm': 0.10167263448238373, 'learning_rate': 0.00015236180904522613, 'epoch': 19.36}


 24%|██▍       | 243/1000 [19:37<1:01:29,  4.87s/it]

{'loss': 0.0292, 'grad_norm': 0.11475251615047455, 'learning_rate': 0.0001521608040201005, 'epoch': 19.44}


 24%|██▍       | 244/1000 [19:42<1:00:39,  4.81s/it]

{'loss': 0.0275, 'grad_norm': 0.08951248228549957, 'learning_rate': 0.00015195979899497488, 'epoch': 19.52}


 24%|██▍       | 245/1000 [19:47<59:59,  4.77s/it]  

{'loss': 0.0286, 'grad_norm': 0.10652539879083633, 'learning_rate': 0.00015175879396984925, 'epoch': 19.6}


 25%|██▍       | 246/1000 [19:52<1:01:00,  4.85s/it]

{'loss': 0.031, 'grad_norm': 0.22382885217666626, 'learning_rate': 0.00015155778894472362, 'epoch': 19.68}


 25%|██▍       | 247/1000 [19:57<1:00:13,  4.80s/it]

{'loss': 0.042, 'grad_norm': 0.751643180847168, 'learning_rate': 0.000151356783919598, 'epoch': 19.76}


 25%|██▍       | 248/1000 [20:01<1:00:21,  4.82s/it]

{'loss': 0.0351, 'grad_norm': 0.3895402252674103, 'learning_rate': 0.00015115577889447237, 'epoch': 19.84}


 25%|██▍       | 249/1000 [20:07<1:01:51,  4.94s/it]

{'loss': 0.0308, 'grad_norm': 0.13310347497463226, 'learning_rate': 0.00015095477386934674, 'epoch': 19.92}


 25%|██▌       | 250/1000 [20:11<59:28,  4.76s/it]  

{'loss': 0.0354, 'grad_norm': 0.17536114156246185, 'learning_rate': 0.00015075376884422112, 'epoch': 20.0}


 25%|██▌       | 251/1000 [20:16<58:54,  4.72s/it]

{'loss': 0.0281, 'grad_norm': 0.1220446527004242, 'learning_rate': 0.0001505527638190955, 'epoch': 20.08}


 25%|██▌       | 252/1000 [20:20<59:13,  4.75s/it]

{'loss': 0.0457, 'grad_norm': 0.4015876352787018, 'learning_rate': 0.00015035175879396986, 'epoch': 20.16}


 25%|██▌       | 253/1000 [20:25<1:00:09,  4.83s/it]

{'loss': 0.033, 'grad_norm': 0.6332685947418213, 'learning_rate': 0.00015015075376884423, 'epoch': 20.24}


 25%|██▌       | 254/1000 [20:30<1:00:47,  4.89s/it]

{'loss': 0.0288, 'grad_norm': 0.11559244990348816, 'learning_rate': 0.0001499497487437186, 'epoch': 20.32}


 26%|██▌       | 255/1000 [20:35<59:43,  4.81s/it]  

{'loss': 0.0445, 'grad_norm': 0.3611249327659607, 'learning_rate': 0.00014974874371859298, 'epoch': 20.4}


 26%|██▌       | 256/1000 [20:40<59:41,  4.81s/it]

{'loss': 0.0292, 'grad_norm': 0.12473001331090927, 'learning_rate': 0.00014954773869346735, 'epoch': 20.48}


 26%|██▌       | 257/1000 [20:45<1:00:23,  4.88s/it]

{'loss': 0.0294, 'grad_norm': 0.1851322501897812, 'learning_rate': 0.0001493467336683417, 'epoch': 20.56}


 26%|██▌       | 258/1000 [20:50<59:25,  4.81s/it]  

{'loss': 0.0387, 'grad_norm': 0.628039538860321, 'learning_rate': 0.0001491457286432161, 'epoch': 20.64}


 26%|██▌       | 259/1000 [20:54<59:25,  4.81s/it]

{'loss': 0.0296, 'grad_norm': 0.16817381978034973, 'learning_rate': 0.00014894472361809047, 'epoch': 20.72}


 26%|██▌       | 260/1000 [20:59<59:26,  4.82s/it]

{'loss': 0.0379, 'grad_norm': 0.2999188005924225, 'learning_rate': 0.00014874371859296482, 'epoch': 20.8}


 26%|██▌       | 261/1000 [21:04<1:00:45,  4.93s/it]

{'loss': 0.0341, 'grad_norm': 0.16507330536842346, 'learning_rate': 0.0001485427135678392, 'epoch': 20.88}


 26%|██▌       | 262/1000 [21:09<59:35,  4.84s/it]  

{'loss': 0.0346, 'grad_norm': 0.17664901912212372, 'learning_rate': 0.00014834170854271356, 'epoch': 20.96}


 26%|██▋       | 263/1000 [21:14<57:58,  4.72s/it]

{'loss': 0.0369, 'grad_norm': 0.24651838839054108, 'learning_rate': 0.00014814070351758796, 'epoch': 21.04}


 26%|██▋       | 264/1000 [21:18<58:18,  4.75s/it]

{'loss': 0.0286, 'grad_norm': 0.20965978503227234, 'learning_rate': 0.0001479396984924623, 'epoch': 21.12}


 26%|██▋       | 265/1000 [21:23<58:29,  4.77s/it]

{'loss': 0.0278, 'grad_norm': 0.11116128414869308, 'learning_rate': 0.00014773869346733668, 'epoch': 21.2}


 27%|██▋       | 266/1000 [21:28<58:34,  4.79s/it]

{'loss': 0.0435, 'grad_norm': 0.3499022424221039, 'learning_rate': 0.00014753768844221106, 'epoch': 21.28}


 27%|██▋       | 267/1000 [21:33<59:58,  4.91s/it]

{'loss': 0.0266, 'grad_norm': 0.13679662346839905, 'learning_rate': 0.00014733668341708543, 'epoch': 21.36}


 27%|██▋       | 268/1000 [21:38<58:50,  4.82s/it]

{'loss': 0.0313, 'grad_norm': 0.1247558444738388, 'learning_rate': 0.0001471356783919598, 'epoch': 21.44}


 27%|██▋       | 269/1000 [21:43<59:25,  4.88s/it]

{'loss': 0.0353, 'grad_norm': 0.34595033526420593, 'learning_rate': 0.00014693467336683417, 'epoch': 21.52}


 27%|██▋       | 270/1000 [21:48<59:06,  4.86s/it]

{'loss': 0.0339, 'grad_norm': 0.326075941324234, 'learning_rate': 0.00014673366834170855, 'epoch': 21.6}


 27%|██▋       | 271/1000 [21:52<58:55,  4.85s/it]

{'loss': 0.0344, 'grad_norm': 0.24985761940479279, 'learning_rate': 0.00014653266331658292, 'epoch': 21.68}


 27%|██▋       | 272/1000 [21:57<58:11,  4.80s/it]

{'loss': 0.0326, 'grad_norm': 0.1143522784113884, 'learning_rate': 0.0001463316582914573, 'epoch': 21.76}


 27%|██▋       | 273/1000 [22:02<58:16,  4.81s/it]

{'loss': 0.0324, 'grad_norm': 0.11550993472337723, 'learning_rate': 0.00014613065326633167, 'epoch': 21.84}


 27%|██▋       | 274/1000 [22:07<58:20,  4.82s/it]

{'loss': 0.0314, 'grad_norm': 0.11949295550584793, 'learning_rate': 0.00014592964824120604, 'epoch': 21.92}


 28%|██▊       | 275/1000 [22:11<57:02,  4.72s/it]

{'loss': 0.0338, 'grad_norm': 0.1602189838886261, 'learning_rate': 0.0001457286432160804, 'epoch': 22.0}


 28%|██▊       | 276/1000 [22:16<58:12,  4.82s/it]

{'loss': 0.0327, 'grad_norm': 0.46392303705215454, 'learning_rate': 0.00014552763819095479, 'epoch': 22.08}


 28%|██▊       | 277/1000 [22:21<58:18,  4.84s/it]

{'loss': 0.0262, 'grad_norm': 0.1054062470793724, 'learning_rate': 0.00014532663316582916, 'epoch': 22.16}


 28%|██▊       | 278/1000 [22:26<58:18,  4.85s/it]

{'loss': 0.0279, 'grad_norm': 0.0895334854722023, 'learning_rate': 0.00014512562814070353, 'epoch': 22.24}


 28%|██▊       | 279/1000 [22:31<58:18,  4.85s/it]

{'loss': 0.0315, 'grad_norm': 0.2902940511703491, 'learning_rate': 0.0001449246231155779, 'epoch': 22.32}


 28%|██▊       | 280/1000 [22:36<57:33,  4.80s/it]

{'loss': 0.03, 'grad_norm': 0.12851101160049438, 'learning_rate': 0.00014472361809045228, 'epoch': 22.4}


 28%|██▊       | 281/1000 [22:41<58:26,  4.88s/it]

{'loss': 0.0293, 'grad_norm': 0.11901026219129562, 'learning_rate': 0.00014452261306532665, 'epoch': 22.48}


 28%|██▊       | 282/1000 [22:46<59:43,  4.99s/it]

{'loss': 0.028, 'grad_norm': 0.09502321481704712, 'learning_rate': 0.000144321608040201, 'epoch': 22.56}


 28%|██▊       | 283/1000 [22:51<59:54,  5.01s/it]

{'loss': 0.0465, 'grad_norm': 0.43979719281196594, 'learning_rate': 0.00014412060301507537, 'epoch': 22.64}


 28%|██▊       | 284/1000 [22:56<58:26,  4.90s/it]

{'loss': 0.0367, 'grad_norm': 0.448138952255249, 'learning_rate': 0.00014391959798994977, 'epoch': 22.72}


 28%|██▊       | 285/1000 [23:00<57:28,  4.82s/it]

{'loss': 0.0332, 'grad_norm': 0.21110975742340088, 'learning_rate': 0.00014371859296482411, 'epoch': 22.8}


 29%|██▊       | 286/1000 [23:05<57:27,  4.83s/it]

{'loss': 0.0328, 'grad_norm': 0.27885761857032776, 'learning_rate': 0.0001435175879396985, 'epoch': 22.88}


 29%|██▊       | 287/1000 [23:10<57:23,  4.83s/it]

{'loss': 0.0307, 'grad_norm': 0.10030888020992279, 'learning_rate': 0.00014331658291457286, 'epoch': 22.96}


 29%|██▉       | 288/1000 [23:15<56:40,  4.78s/it]

{'loss': 0.0309, 'grad_norm': 0.1222517341375351, 'learning_rate': 0.00014311557788944726, 'epoch': 23.04}


 29%|██▉       | 289/1000 [23:19<56:44,  4.79s/it]

{'loss': 0.0297, 'grad_norm': 0.12001626938581467, 'learning_rate': 0.0001429145728643216, 'epoch': 23.12}


 29%|██▉       | 290/1000 [23:24<56:02,  4.74s/it]

{'loss': 0.0323, 'grad_norm': 0.3309047520160675, 'learning_rate': 0.00014271356783919598, 'epoch': 23.2}


 29%|██▉       | 291/1000 [23:29<56:17,  4.76s/it]

{'loss': 0.0276, 'grad_norm': 0.1519818902015686, 'learning_rate': 0.00014251256281407035, 'epoch': 23.28}


 29%|██▉       | 292/1000 [23:33<55:44,  4.72s/it]

{'loss': 0.0296, 'grad_norm': 0.1586618423461914, 'learning_rate': 0.00014231155778894473, 'epoch': 23.36}


 29%|██▉       | 293/1000 [23:38<56:01,  4.75s/it]

{'loss': 0.0411, 'grad_norm': 0.16302604973316193, 'learning_rate': 0.0001421105527638191, 'epoch': 23.44}


 29%|██▉       | 294/1000 [23:43<56:12,  4.78s/it]

{'loss': 0.0282, 'grad_norm': 0.09132443368434906, 'learning_rate': 0.00014190954773869347, 'epoch': 23.52}


 30%|██▉       | 295/1000 [23:48<56:59,  4.85s/it]

{'loss': 0.029, 'grad_norm': 0.10407445579767227, 'learning_rate': 0.00014170854271356784, 'epoch': 23.6}


 30%|██▉       | 296/1000 [23:53<56:52,  4.85s/it]

{'loss': 0.0307, 'grad_norm': 0.13241051137447357, 'learning_rate': 0.00014150753768844222, 'epoch': 23.68}


 30%|██▉       | 297/1000 [23:58<56:44,  4.84s/it]

{'loss': 0.0312, 'grad_norm': 0.11708346009254456, 'learning_rate': 0.0001413065326633166, 'epoch': 23.76}


 30%|██▉       | 298/1000 [24:03<56:38,  4.84s/it]

{'loss': 0.0334, 'grad_norm': 0.11902540177106857, 'learning_rate': 0.00014110552763819096, 'epoch': 23.84}


 30%|██▉       | 299/1000 [24:07<55:56,  4.79s/it]

{'loss': 0.0413, 'grad_norm': 0.39826807379722595, 'learning_rate': 0.00014090452261306534, 'epoch': 23.92}


 30%|███       | 300/1000 [24:12<55:23,  4.75s/it]

{'loss': 0.03, 'grad_norm': 0.10887287557125092, 'learning_rate': 0.0001407035175879397, 'epoch': 24.0}


 30%|███       | 301/1000 [24:17<55:38,  4.78s/it]

{'loss': 0.0284, 'grad_norm': 0.10062918812036514, 'learning_rate': 0.00014050251256281408, 'epoch': 24.08}


 30%|███       | 302/1000 [24:22<55:45,  4.79s/it]

{'loss': 0.0263, 'grad_norm': 0.09543897956609726, 'learning_rate': 0.00014030150753768846, 'epoch': 24.16}


 30%|███       | 303/1000 [24:27<55:49,  4.81s/it]

{'loss': 0.0312, 'grad_norm': 0.2727157473564148, 'learning_rate': 0.0001401005025125628, 'epoch': 24.24}


 30%|███       | 304/1000 [24:31<55:49,  4.81s/it]

{'loss': 0.028, 'grad_norm': 0.09762487560510635, 'learning_rate': 0.0001398994974874372, 'epoch': 24.32}


 30%|███       | 305/1000 [24:36<55:10,  4.76s/it]

{'loss': 0.0285, 'grad_norm': 0.10432206094264984, 'learning_rate': 0.00013969849246231157, 'epoch': 24.4}


 31%|███       | 306/1000 [24:41<55:21,  4.79s/it]

{'loss': 0.03, 'grad_norm': 0.10215183347463608, 'learning_rate': 0.00013949748743718595, 'epoch': 24.48}


 31%|███       | 307/1000 [24:46<55:27,  4.80s/it]

{'loss': 0.029, 'grad_norm': 0.10210926830768585, 'learning_rate': 0.0001392964824120603, 'epoch': 24.56}


 31%|███       | 308/1000 [24:50<55:28,  4.81s/it]

{'loss': 0.0285, 'grad_norm': 0.10106179118156433, 'learning_rate': 0.00013909547738693467, 'epoch': 24.64}


 31%|███       | 309/1000 [24:56<56:08,  4.88s/it]

{'loss': 0.0295, 'grad_norm': 0.10380641371011734, 'learning_rate': 0.00013889447236180907, 'epoch': 24.72}


 31%|███       | 310/1000 [25:00<55:54,  4.86s/it]

{'loss': 0.0301, 'grad_norm': 0.1006232425570488, 'learning_rate': 0.0001386934673366834, 'epoch': 24.8}


 31%|███       | 311/1000 [25:06<57:03,  4.97s/it]

{'loss': 0.0305, 'grad_norm': 0.1460290253162384, 'learning_rate': 0.00013849246231155778, 'epoch': 24.88}


 31%|███       | 312/1000 [25:10<55:51,  4.87s/it]

{'loss': 0.0314, 'grad_norm': 0.12055974453687668, 'learning_rate': 0.00013829145728643216, 'epoch': 24.96}


 31%|███▏      | 313/1000 [25:14<53:40,  4.69s/it]

{'loss': 0.0301, 'grad_norm': 0.18437814712524414, 'learning_rate': 0.00013809045226130656, 'epoch': 25.04}


 31%|███▏      | 314/1000 [25:19<53:27,  4.68s/it]

{'loss': 0.0264, 'grad_norm': 0.09956128150224686, 'learning_rate': 0.0001378894472361809, 'epoch': 25.12}


 32%|███▏      | 315/1000 [25:24<53:59,  4.73s/it]

{'loss': 0.0257, 'grad_norm': 0.08509073406457901, 'learning_rate': 0.00013768844221105528, 'epoch': 25.2}


 32%|███▏      | 316/1000 [25:29<54:55,  4.82s/it]

{'loss': 0.0265, 'grad_norm': 0.09469839185476303, 'learning_rate': 0.00013748743718592965, 'epoch': 25.28}


 32%|███▏      | 317/1000 [25:34<54:56,  4.83s/it]

{'loss': 0.0298, 'grad_norm': 0.11566027998924255, 'learning_rate': 0.00013728643216080402, 'epoch': 25.36}


 32%|███▏      | 318/1000 [25:38<54:15,  4.77s/it]

{'loss': 0.032, 'grad_norm': 0.13246408104896545, 'learning_rate': 0.0001370854271356784, 'epoch': 25.44}


 32%|███▏      | 319/1000 [25:43<54:32,  4.81s/it]

{'loss': 0.0298, 'grad_norm': 0.13647589087486267, 'learning_rate': 0.00013688442211055277, 'epoch': 25.52}


 32%|███▏      | 320/1000 [25:48<54:00,  4.76s/it]

{'loss': 0.0328, 'grad_norm': 0.2963469326496124, 'learning_rate': 0.00013668341708542714, 'epoch': 25.6}


 32%|███▏      | 321/1000 [25:53<54:59,  4.86s/it]

{'loss': 0.0287, 'grad_norm': 0.10069256275892258, 'learning_rate': 0.00013648241206030151, 'epoch': 25.68}


 32%|███▏      | 322/1000 [25:58<55:37,  4.92s/it]

{'loss': 0.0279, 'grad_norm': 0.10400155931711197, 'learning_rate': 0.0001362814070351759, 'epoch': 25.76}


 32%|███▏      | 323/1000 [26:03<54:41,  4.85s/it]

{'loss': 0.0318, 'grad_norm': 0.11858014762401581, 'learning_rate': 0.00013608040201005026, 'epoch': 25.84}


 32%|███▏      | 324/1000 [26:08<55:20,  4.91s/it]

{'loss': 0.0292, 'grad_norm': 0.08810689300298691, 'learning_rate': 0.00013587939698492463, 'epoch': 25.92}


 32%|███▎      | 325/1000 [26:12<54:02,  4.80s/it]

{'loss': 0.0304, 'grad_norm': 0.1213112324476242, 'learning_rate': 0.000135678391959799, 'epoch': 26.0}


 33%|███▎      | 326/1000 [26:18<54:51,  4.88s/it]

{'loss': 0.0262, 'grad_norm': 0.07971767336130142, 'learning_rate': 0.00013547738693467338, 'epoch': 26.08}


 33%|███▎      | 327/1000 [26:22<54:04,  4.82s/it]

{'loss': 0.0276, 'grad_norm': 0.10510628670454025, 'learning_rate': 0.00013527638190954775, 'epoch': 26.16}


 33%|███▎      | 328/1000 [26:27<53:27,  4.77s/it]

{'loss': 0.027, 'grad_norm': 0.09657614678144455, 'learning_rate': 0.0001350753768844221, 'epoch': 26.24}


 33%|███▎      | 329/1000 [26:32<54:20,  4.86s/it]

{'loss': 0.0266, 'grad_norm': 0.09687840938568115, 'learning_rate': 0.00013487437185929647, 'epoch': 26.32}


 33%|███▎      | 330/1000 [26:37<54:16,  4.86s/it]

{'loss': 0.0291, 'grad_norm': 0.10509005188941956, 'learning_rate': 0.00013467336683417087, 'epoch': 26.4}


 33%|███▎      | 331/1000 [26:42<54:12,  4.86s/it]

{'loss': 0.0277, 'grad_norm': 0.09740553796291351, 'learning_rate': 0.00013447236180904524, 'epoch': 26.48}


 33%|███▎      | 332/1000 [26:47<54:08,  4.86s/it]

{'loss': 0.0305, 'grad_norm': 0.10357613861560822, 'learning_rate': 0.0001342713567839196, 'epoch': 26.56}


 33%|███▎      | 333/1000 [26:51<54:06,  4.87s/it]

{'loss': 0.0304, 'grad_norm': 0.10522647947072983, 'learning_rate': 0.00013407035175879396, 'epoch': 26.64}


 33%|███▎      | 334/1000 [26:56<54:01,  4.87s/it]

{'loss': 0.0293, 'grad_norm': 0.3240399956703186, 'learning_rate': 0.00013386934673366836, 'epoch': 26.72}


 34%|███▎      | 335/1000 [27:01<53:17,  4.81s/it]

{'loss': 0.0305, 'grad_norm': 0.10600385814905167, 'learning_rate': 0.0001336683417085427, 'epoch': 26.8}


 34%|███▎      | 336/1000 [27:06<53:54,  4.87s/it]

{'loss': 0.0304, 'grad_norm': 0.11125326156616211, 'learning_rate': 0.00013346733668341708, 'epoch': 26.88}


 34%|███▎      | 337/1000 [27:11<54:19,  4.92s/it]

{'loss': 0.0301, 'grad_norm': 0.10868803411722183, 'learning_rate': 0.00013326633165829146, 'epoch': 26.96}


 34%|███▍      | 338/1000 [27:15<52:41,  4.77s/it]

{'loss': 0.0278, 'grad_norm': 0.1021018773317337, 'learning_rate': 0.00013306532663316586, 'epoch': 27.04}


 34%|███▍      | 339/1000 [27:20<52:07,  4.73s/it]

{'loss': 0.0275, 'grad_norm': 0.09838663786649704, 'learning_rate': 0.0001328643216080402, 'epoch': 27.12}


 34%|███▍      | 340/1000 [27:25<52:21,  4.76s/it]

{'loss': 0.0278, 'grad_norm': 0.09337453544139862, 'learning_rate': 0.00013266331658291457, 'epoch': 27.2}


 34%|███▍      | 341/1000 [27:30<52:30,  4.78s/it]

{'loss': 0.0272, 'grad_norm': 0.08145953714847565, 'learning_rate': 0.00013246231155778895, 'epoch': 27.28}


 34%|███▍      | 342/1000 [27:35<52:34,  4.79s/it]

{'loss': 0.0267, 'grad_norm': 0.08568181097507477, 'learning_rate': 0.00013226130653266332, 'epoch': 27.36}


 34%|███▍      | 343/1000 [27:39<51:57,  4.74s/it]

{'loss': 0.0279, 'grad_norm': 0.09379816800355911, 'learning_rate': 0.0001320603015075377, 'epoch': 27.44}


 34%|███▍      | 344/1000 [27:45<54:00,  4.94s/it]

{'loss': 0.0261, 'grad_norm': 0.07818684726953506, 'learning_rate': 0.00013185929648241207, 'epoch': 27.52}


 34%|███▍      | 345/1000 [27:50<54:10,  4.96s/it]

{'loss': 0.0279, 'grad_norm': 0.09198757261037827, 'learning_rate': 0.00013165829145728644, 'epoch': 27.6}


 35%|███▍      | 346/1000 [27:54<53:36,  4.92s/it]

{'loss': 0.0283, 'grad_norm': 0.09774646908044815, 'learning_rate': 0.0001314572864321608, 'epoch': 27.68}


 35%|███▍      | 347/1000 [27:59<52:34,  4.83s/it]

{'loss': 0.0297, 'grad_norm': 0.10443893074989319, 'learning_rate': 0.00013125628140703518, 'epoch': 27.76}


 35%|███▍      | 348/1000 [28:04<52:26,  4.83s/it]

{'loss': 0.0294, 'grad_norm': 0.09985819458961487, 'learning_rate': 0.00013105527638190956, 'epoch': 27.84}


 35%|███▍      | 349/1000 [28:09<52:21,  4.83s/it]

{'loss': 0.0305, 'grad_norm': 0.10326577723026276, 'learning_rate': 0.00013085427135678393, 'epoch': 27.92}


 35%|███▌      | 350/1000 [28:13<51:01,  4.71s/it]

{'loss': 0.0335, 'grad_norm': 0.1374352127313614, 'learning_rate': 0.0001306532663316583, 'epoch': 28.0}


 35%|███▌      | 351/1000 [28:18<51:19,  4.74s/it]

{'loss': 0.0293, 'grad_norm': 0.10352396219968796, 'learning_rate': 0.00013045226130653268, 'epoch': 28.08}


 35%|███▌      | 352/1000 [28:23<51:29,  4.77s/it]

{'loss': 0.0279, 'grad_norm': 0.0889059528708458, 'learning_rate': 0.00013025125628140705, 'epoch': 28.16}


 35%|███▌      | 353/1000 [28:28<52:46,  4.89s/it]

{'loss': 0.0247, 'grad_norm': 0.08388965576887131, 'learning_rate': 0.0001300502512562814, 'epoch': 28.24}


 35%|███▌      | 354/1000 [28:33<51:50,  4.82s/it]

{'loss': 0.0289, 'grad_norm': 0.10341493785381317, 'learning_rate': 0.00012984924623115577, 'epoch': 28.32}


 36%|███▌      | 355/1000 [28:37<51:10,  4.76s/it]

{'loss': 0.0296, 'grad_norm': 0.11407221853733063, 'learning_rate': 0.00012964824120603017, 'epoch': 28.4}


 36%|███▌      | 356/1000 [28:42<51:17,  4.78s/it]

{'loss': 0.0293, 'grad_norm': 0.11036718636751175, 'learning_rate': 0.00012944723618090454, 'epoch': 28.48}


 36%|███▌      | 357/1000 [28:47<51:23,  4.79s/it]

{'loss': 0.0276, 'grad_norm': 0.09761292487382889, 'learning_rate': 0.0001292462311557789, 'epoch': 28.56}


 36%|███▌      | 358/1000 [28:52<50:46,  4.75s/it]

{'loss': 0.028, 'grad_norm': 0.10106078535318375, 'learning_rate': 0.00012904522613065326, 'epoch': 28.64}


 36%|███▌      | 359/1000 [28:56<50:59,  4.77s/it]

{'loss': 0.0296, 'grad_norm': 0.09502983093261719, 'learning_rate': 0.00012884422110552766, 'epoch': 28.72}


 36%|███▌      | 360/1000 [29:01<51:46,  4.85s/it]

{'loss': 0.0306, 'grad_norm': 0.10566499829292297, 'learning_rate': 0.000128643216080402, 'epoch': 28.8}


 36%|███▌      | 361/1000 [29:06<52:15,  4.91s/it]

{'loss': 0.029, 'grad_norm': 0.08952342718839645, 'learning_rate': 0.00012844221105527638, 'epoch': 28.88}


 36%|███▌      | 362/1000 [29:11<51:58,  4.89s/it]

{'loss': 0.0302, 'grad_norm': 0.10564329475164413, 'learning_rate': 0.00012824120603015075, 'epoch': 28.96}


 36%|███▋      | 363/1000 [29:16<50:32,  4.76s/it]

{'loss': 0.0298, 'grad_norm': 0.0943770706653595, 'learning_rate': 0.00012804020100502515, 'epoch': 29.04}


 36%|███▋      | 364/1000 [29:21<50:44,  4.79s/it]

{'loss': 0.0273, 'grad_norm': 0.09351789951324463, 'learning_rate': 0.0001278391959798995, 'epoch': 29.12}


 36%|███▋      | 365/1000 [29:25<50:53,  4.81s/it]

{'loss': 0.0281, 'grad_norm': 0.09478772431612015, 'learning_rate': 0.00012763819095477387, 'epoch': 29.2}


 37%|███▋      | 366/1000 [29:30<51:35,  4.88s/it]

{'loss': 0.0265, 'grad_norm': 0.0818176120519638, 'learning_rate': 0.00012743718592964824, 'epoch': 29.28}


 37%|███▋      | 367/1000 [29:35<50:48,  4.82s/it]

{'loss': 0.0273, 'grad_norm': 0.0855216458439827, 'learning_rate': 0.00012723618090452262, 'epoch': 29.36}


 37%|███▋      | 368/1000 [29:40<50:50,  4.83s/it]

{'loss': 0.0266, 'grad_norm': 0.09289756417274475, 'learning_rate': 0.000127035175879397, 'epoch': 29.44}


 37%|███▋      | 369/1000 [29:45<51:24,  4.89s/it]

{'loss': 0.0267, 'grad_norm': 0.10854896903038025, 'learning_rate': 0.00012683417085427136, 'epoch': 29.52}


 37%|███▋      | 370/1000 [29:50<50:29,  4.81s/it]

{'loss': 0.0285, 'grad_norm': 0.09654165059328079, 'learning_rate': 0.00012663316582914574, 'epoch': 29.6}


 37%|███▋      | 371/1000 [29:55<51:07,  4.88s/it]

{'loss': 0.03, 'grad_norm': 0.10892193764448166, 'learning_rate': 0.0001264321608040201, 'epoch': 29.68}


 37%|███▋      | 372/1000 [29:59<50:16,  4.80s/it]

{'loss': 0.0289, 'grad_norm': 0.09854012727737427, 'learning_rate': 0.00012623115577889448, 'epoch': 29.76}


 37%|███▋      | 373/1000 [30:04<50:18,  4.81s/it]

{'loss': 0.0297, 'grad_norm': 0.09156246483325958, 'learning_rate': 0.00012603015075376885, 'epoch': 29.84}


 37%|███▋      | 374/1000 [30:09<50:53,  4.88s/it]

{'loss': 0.0296, 'grad_norm': 0.09946306049823761, 'learning_rate': 0.00012582914572864323, 'epoch': 29.92}


 38%|███▊      | 375/1000 [30:13<48:54,  4.70s/it]

{'loss': 0.0308, 'grad_norm': 0.11582624167203903, 'learning_rate': 0.0001256281407035176, 'epoch': 30.0}


 38%|███▊      | 376/1000 [30:18<49:17,  4.74s/it]

{'loss': 0.026, 'grad_norm': 0.09357015043497086, 'learning_rate': 0.00012542713567839197, 'epoch': 30.08}


 38%|███▊      | 377/1000 [30:23<49:26,  4.76s/it]

{'loss': 0.0267, 'grad_norm': 0.08218780905008316, 'learning_rate': 0.00012522613065326635, 'epoch': 30.16}


 38%|███▊      | 378/1000 [30:28<49:37,  4.79s/it]

{'loss': 0.0268, 'grad_norm': 0.09296420961618423, 'learning_rate': 0.0001250251256281407, 'epoch': 30.24}


 38%|███▊      | 379/1000 [30:33<49:07,  4.75s/it]

{'loss': 0.0273, 'grad_norm': 0.09022881090641022, 'learning_rate': 0.00012482412060301507, 'epoch': 30.32}


 38%|███▊      | 380/1000 [30:37<48:43,  4.72s/it]

{'loss': 0.0293, 'grad_norm': 0.10057053714990616, 'learning_rate': 0.00012462311557788947, 'epoch': 30.4}


 38%|███▊      | 381/1000 [30:42<48:25,  4.69s/it]

{'loss': 0.0278, 'grad_norm': 0.08600836247205734, 'learning_rate': 0.00012442211055276384, 'epoch': 30.48}


 38%|███▊      | 382/1000 [30:47<48:45,  4.73s/it]

{'loss': 0.0283, 'grad_norm': 0.09575923532247543, 'learning_rate': 0.00012422110552763818, 'epoch': 30.56}


 38%|███▊      | 383/1000 [30:51<48:24,  4.71s/it]

{'loss': 0.0284, 'grad_norm': 0.08769755065441132, 'learning_rate': 0.00012402010050251256, 'epoch': 30.64}


 38%|███▊      | 384/1000 [30:56<48:44,  4.75s/it]

{'loss': 0.0287, 'grad_norm': 0.09396608173847198, 'learning_rate': 0.00012381909547738696, 'epoch': 30.72}


 38%|███▊      | 385/1000 [31:01<49:32,  4.83s/it]

{'loss': 0.029, 'grad_norm': 0.09803633391857147, 'learning_rate': 0.0001236180904522613, 'epoch': 30.8}


 39%|███▊      | 386/1000 [31:06<50:38,  4.95s/it]

{'loss': 0.0305, 'grad_norm': 0.12020692229270935, 'learning_rate': 0.00012341708542713568, 'epoch': 30.88}


 39%|███▊      | 387/1000 [31:11<50:47,  4.97s/it]

{'loss': 0.0302, 'grad_norm': 0.11525168269872665, 'learning_rate': 0.00012321608040201005, 'epoch': 30.96}


 39%|███▉      | 388/1000 [31:16<49:07,  4.82s/it]

{'loss': 0.0273, 'grad_norm': 0.1175619438290596, 'learning_rate': 0.00012301507537688445, 'epoch': 31.04}


 39%|███▉      | 389/1000 [31:21<49:04,  4.82s/it]

{'loss': 0.0266, 'grad_norm': 0.0921117439866066, 'learning_rate': 0.0001228140703517588, 'epoch': 31.12}


 39%|███▉      | 390/1000 [31:26<49:37,  4.88s/it]

{'loss': 0.0261, 'grad_norm': 0.07672625035047531, 'learning_rate': 0.00012261306532663317, 'epoch': 31.2}


 39%|███▉      | 391/1000 [31:30<48:49,  4.81s/it]

{'loss': 0.0266, 'grad_norm': 0.08501288294792175, 'learning_rate': 0.00012241206030150754, 'epoch': 31.28}


 39%|███▉      | 392/1000 [31:35<48:50,  4.82s/it]

{'loss': 0.0264, 'grad_norm': 0.08676021546125412, 'learning_rate': 0.00012221105527638191, 'epoch': 31.36}


 39%|███▉      | 393/1000 [31:40<48:13,  4.77s/it]

{'loss': 0.0274, 'grad_norm': 0.08554163575172424, 'learning_rate': 0.00012201005025125629, 'epoch': 31.44}


 39%|███▉      | 394/1000 [31:45<48:23,  4.79s/it]

{'loss': 0.0281, 'grad_norm': 0.08904236555099487, 'learning_rate': 0.00012180904522613066, 'epoch': 31.52}


 40%|███▉      | 395/1000 [31:49<47:52,  4.75s/it]

{'loss': 0.0289, 'grad_norm': 0.09391066431999207, 'learning_rate': 0.00012160804020100502, 'epoch': 31.6}


 40%|███▉      | 396/1000 [31:54<48:04,  4.78s/it]

{'loss': 0.0295, 'grad_norm': 0.0943315252661705, 'learning_rate': 0.00012140703517587942, 'epoch': 31.68}


 40%|███▉      | 397/1000 [31:59<49:21,  4.91s/it]

{'loss': 0.0277, 'grad_norm': 0.09358283132314682, 'learning_rate': 0.00012120603015075378, 'epoch': 31.76}


 40%|███▉      | 398/1000 [32:05<49:38,  4.95s/it]

{'loss': 0.0318, 'grad_norm': 0.10778950899839401, 'learning_rate': 0.00012100502512562815, 'epoch': 31.84}


 40%|███▉      | 399/1000 [32:09<48:35,  4.85s/it]

{'loss': 0.0327, 'grad_norm': 0.11111873388290405, 'learning_rate': 0.00012080402010050251, 'epoch': 31.92}


 40%|████      | 400/1000 [32:14<48:07,  4.81s/it]

{'loss': 0.0313, 'grad_norm': 0.12728676199913025, 'learning_rate': 0.00012060301507537688, 'epoch': 32.0}


 40%|████      | 401/1000 [32:19<48:43,  4.88s/it]

{'loss': 0.0264, 'grad_norm': 0.0814247578382492, 'learning_rate': 0.00012040201005025127, 'epoch': 32.08}


 40%|████      | 402/1000 [32:24<48:30,  4.87s/it]

{'loss': 0.0263, 'grad_norm': 0.0780026987195015, 'learning_rate': 0.00012020100502512563, 'epoch': 32.16}


 40%|████      | 403/1000 [32:28<47:45,  4.80s/it]

{'loss': 0.0271, 'grad_norm': 0.0907338336110115, 'learning_rate': 0.00012, 'epoch': 32.24}


 40%|████      | 404/1000 [32:33<47:11,  4.75s/it]

{'loss': 0.0273, 'grad_norm': 0.09635910391807556, 'learning_rate': 0.00011979899497487436, 'epoch': 32.32}


 40%|████      | 405/1000 [32:38<47:29,  4.79s/it]

{'loss': 0.027, 'grad_norm': 0.08311109989881516, 'learning_rate': 0.00011959798994974876, 'epoch': 32.4}


 41%|████      | 406/1000 [32:43<47:38,  4.81s/it]

{'loss': 0.0288, 'grad_norm': 0.08869322389364243, 'learning_rate': 0.00011939698492462312, 'epoch': 32.48}


 41%|████      | 407/1000 [32:47<47:06,  4.77s/it]

{'loss': 0.0279, 'grad_norm': 0.10610360652208328, 'learning_rate': 0.0001191959798994975, 'epoch': 32.56}


 41%|████      | 408/1000 [32:52<47:55,  4.86s/it]

{'loss': 0.0288, 'grad_norm': 0.08941537141799927, 'learning_rate': 0.00011899497487437185, 'epoch': 32.64}


 41%|████      | 409/1000 [32:57<47:17,  4.80s/it]

{'loss': 0.0286, 'grad_norm': 0.08762931078672409, 'learning_rate': 0.00011879396984924624, 'epoch': 32.72}


 41%|████      | 410/1000 [33:02<47:57,  4.88s/it]

{'loss': 0.0297, 'grad_norm': 0.09552693367004395, 'learning_rate': 0.00011859296482412061, 'epoch': 32.8}


 41%|████      | 411/1000 [33:07<48:59,  4.99s/it]

{'loss': 0.0281, 'grad_norm': 0.07917457073926926, 'learning_rate': 0.00011839195979899497, 'epoch': 32.88}


 41%|████      | 412/1000 [33:12<48:30,  4.95s/it]

{'loss': 0.0315, 'grad_norm': 0.11853193491697311, 'learning_rate': 0.00011819095477386935, 'epoch': 32.96}


 41%|████▏     | 413/1000 [33:17<46:39,  4.77s/it]

{'loss': 0.0276, 'grad_norm': 0.09616909921169281, 'learning_rate': 0.00011798994974874373, 'epoch': 33.04}


 41%|████▏     | 414/1000 [33:22<46:48,  4.79s/it]

{'loss': 0.0268, 'grad_norm': 0.07982231676578522, 'learning_rate': 0.0001177889447236181, 'epoch': 33.12}


 42%|████▏     | 415/1000 [33:27<47:29,  4.87s/it]

{'loss': 0.0265, 'grad_norm': 0.07987848669290543, 'learning_rate': 0.00011758793969849247, 'epoch': 33.2}


 42%|████▏     | 416/1000 [33:32<48:25,  4.98s/it]

{'loss': 0.0261, 'grad_norm': 0.08566903322935104, 'learning_rate': 0.00011738693467336684, 'epoch': 33.28}


 42%|████▏     | 417/1000 [33:37<47:57,  4.94s/it]

{'loss': 0.0275, 'grad_norm': 0.09313835203647614, 'learning_rate': 0.00011718592964824122, 'epoch': 33.36}


 42%|████▏     | 418/1000 [33:42<48:12,  4.97s/it]

{'loss': 0.0262, 'grad_norm': 0.08396560698747635, 'learning_rate': 0.00011698492462311558, 'epoch': 33.44}


 42%|████▏     | 419/1000 [33:47<48:22,  5.00s/it]

{'loss': 0.0277, 'grad_norm': 0.0970829427242279, 'learning_rate': 0.00011678391959798996, 'epoch': 33.52}


 42%|████▏     | 420/1000 [33:52<47:53,  4.95s/it]

{'loss': 0.0283, 'grad_norm': 0.08580929040908813, 'learning_rate': 0.00011658291457286432, 'epoch': 33.6}


 42%|████▏     | 421/1000 [33:56<47:00,  4.87s/it]

{'loss': 0.0293, 'grad_norm': 0.09809914231300354, 'learning_rate': 0.00011638190954773872, 'epoch': 33.68}


 42%|████▏     | 422/1000 [34:01<46:49,  4.86s/it]

{'loss': 0.0279, 'grad_norm': 0.09278406947851181, 'learning_rate': 0.00011618090452261308, 'epoch': 33.76}


 42%|████▏     | 423/1000 [34:06<46:05,  4.79s/it]

{'loss': 0.0305, 'grad_norm': 0.11392739415168762, 'learning_rate': 0.00011597989949748745, 'epoch': 33.84}


 42%|████▏     | 424/1000 [34:10<45:34,  4.75s/it]

{'loss': 0.0298, 'grad_norm': 0.10621730983257294, 'learning_rate': 0.00011577889447236181, 'epoch': 33.92}


 42%|████▎     | 425/1000 [34:15<44:36,  4.66s/it]

{'loss': 0.0304, 'grad_norm': 0.10389536619186401, 'learning_rate': 0.00011557788944723618, 'epoch': 34.0}


 43%|████▎     | 426/1000 [34:20<45:00,  4.70s/it]

{'loss': 0.0265, 'grad_norm': 0.07375644892454147, 'learning_rate': 0.00011537688442211057, 'epoch': 34.08}


 43%|████▎     | 427/1000 [34:25<45:47,  4.79s/it]

{'loss': 0.0265, 'grad_norm': 0.07888094335794449, 'learning_rate': 0.00011517587939698493, 'epoch': 34.16}


 43%|████▎     | 428/1000 [34:30<46:22,  4.87s/it]

{'loss': 0.026, 'grad_norm': 0.08804454654455185, 'learning_rate': 0.0001149748743718593, 'epoch': 34.24}


 43%|████▎     | 429/1000 [34:34<45:36,  4.79s/it]

{'loss': 0.0286, 'grad_norm': 0.09812217205762863, 'learning_rate': 0.00011477386934673366, 'epoch': 34.32}


 43%|████▎     | 430/1000 [34:39<45:35,  4.80s/it]

{'loss': 0.0275, 'grad_norm': 0.09521429240703583, 'learning_rate': 0.00011457286432160806, 'epoch': 34.4}


 43%|████▎     | 431/1000 [34:44<45:34,  4.81s/it]

{'loss': 0.028, 'grad_norm': 0.09130164980888367, 'learning_rate': 0.00011437185929648242, 'epoch': 34.48}


 43%|████▎     | 432/1000 [34:49<46:07,  4.87s/it]

{'loss': 0.0274, 'grad_norm': 0.08592624962329865, 'learning_rate': 0.00011417085427135679, 'epoch': 34.56}


 43%|████▎     | 433/1000 [34:54<45:20,  4.80s/it]

{'loss': 0.0293, 'grad_norm': 0.0959087684750557, 'learning_rate': 0.00011396984924623115, 'epoch': 34.64}


 43%|████▎     | 434/1000 [34:58<45:19,  4.80s/it]

{'loss': 0.0288, 'grad_norm': 0.10035345703363419, 'learning_rate': 0.00011376884422110554, 'epoch': 34.72}


 44%|████▎     | 435/1000 [35:03<45:20,  4.81s/it]

{'loss': 0.0282, 'grad_norm': 0.08483389019966125, 'learning_rate': 0.00011356783919597991, 'epoch': 34.8}


 44%|████▎     | 436/1000 [35:08<45:17,  4.82s/it]

{'loss': 0.0292, 'grad_norm': 0.08881748467683792, 'learning_rate': 0.00011336683417085427, 'epoch': 34.88}


 44%|████▎     | 437/1000 [35:13<45:45,  4.88s/it]

{'loss': 0.0288, 'grad_norm': 0.09744074195623398, 'learning_rate': 0.00011316582914572864, 'epoch': 34.96}


 44%|████▍     | 438/1000 [35:18<44:28,  4.75s/it]

{'loss': 0.0287, 'grad_norm': 0.09787172079086304, 'learning_rate': 0.00011296482412060303, 'epoch': 35.04}


 44%|████▍     | 439/1000 [35:22<44:02,  4.71s/it]

{'loss': 0.027, 'grad_norm': 0.08529065549373627, 'learning_rate': 0.0001127638190954774, 'epoch': 35.12}


 44%|████▍     | 440/1000 [35:27<44:46,  4.80s/it]

{'loss': 0.0247, 'grad_norm': 0.07204018533229828, 'learning_rate': 0.00011256281407035176, 'epoch': 35.2}


 44%|████▍     | 441/1000 [35:32<44:14,  4.75s/it]

{'loss': 0.0277, 'grad_norm': 0.08379857242107391, 'learning_rate': 0.00011236180904522614, 'epoch': 35.28}


 44%|████▍     | 442/1000 [35:37<44:51,  4.82s/it]

{'loss': 0.0254, 'grad_norm': 0.0843074768781662, 'learning_rate': 0.00011216080402010052, 'epoch': 35.36}


 44%|████▍     | 443/1000 [35:41<44:14,  4.77s/it]

{'loss': 0.0282, 'grad_norm': 0.08773159235715866, 'learning_rate': 0.00011195979899497488, 'epoch': 35.44}


 44%|████▍     | 444/1000 [35:46<44:51,  4.84s/it]

{'loss': 0.0279, 'grad_norm': 0.08164969086647034, 'learning_rate': 0.00011175879396984925, 'epoch': 35.52}


 44%|████▍     | 445/1000 [35:51<44:52,  4.85s/it]

{'loss': 0.0284, 'grad_norm': 0.08624530583620071, 'learning_rate': 0.00011155778894472361, 'epoch': 35.6}


 45%|████▍     | 446/1000 [35:56<45:23,  4.92s/it]

{'loss': 0.0291, 'grad_norm': 0.08744361996650696, 'learning_rate': 0.00011135678391959799, 'epoch': 35.68}


 45%|████▍     | 447/1000 [36:01<44:33,  4.83s/it]

{'loss': 0.0311, 'grad_norm': 0.12108078598976135, 'learning_rate': 0.00011115577889447237, 'epoch': 35.76}


 45%|████▍     | 448/1000 [36:06<45:03,  4.90s/it]

{'loss': 0.0292, 'grad_norm': 0.08561071753501892, 'learning_rate': 0.00011095477386934675, 'epoch': 35.84}


 45%|████▍     | 449/1000 [36:11<44:49,  4.88s/it]

{'loss': 0.0283, 'grad_norm': 0.08713847398757935, 'learning_rate': 0.0001107537688442211, 'epoch': 35.92}


 45%|████▌     | 450/1000 [36:15<43:15,  4.72s/it]

{'loss': 0.0302, 'grad_norm': 0.10050707310438156, 'learning_rate': 0.00011055276381909548, 'epoch': 36.0}


 45%|████▌     | 451/1000 [36:20<43:06,  4.71s/it]

{'loss': 0.0266, 'grad_norm': 0.0944618508219719, 'learning_rate': 0.00011035175879396986, 'epoch': 36.08}


 45%|████▌     | 452/1000 [36:25<43:24,  4.75s/it]

{'loss': 0.0262, 'grad_norm': 0.08576452732086182, 'learning_rate': 0.00011015075376884422, 'epoch': 36.16}


 45%|████▌     | 453/1000 [36:30<44:07,  4.84s/it]

{'loss': 0.0259, 'grad_norm': 0.07786649465560913, 'learning_rate': 0.0001099497487437186, 'epoch': 36.24}


 45%|████▌     | 454/1000 [36:35<44:09,  4.85s/it]

{'loss': 0.0271, 'grad_norm': 0.08527068048715591, 'learning_rate': 0.00010974874371859296, 'epoch': 36.32}


 46%|████▌     | 455/1000 [36:40<44:10,  4.86s/it]

{'loss': 0.0264, 'grad_norm': 0.0805109515786171, 'learning_rate': 0.00010954773869346736, 'epoch': 36.4}


 46%|████▌     | 456/1000 [36:44<43:37,  4.81s/it]

{'loss': 0.0271, 'grad_norm': 0.08352889865636826, 'learning_rate': 0.00010934673366834172, 'epoch': 36.48}


 46%|████▌     | 457/1000 [36:49<43:37,  4.82s/it]

{'loss': 0.0275, 'grad_norm': 0.08632776141166687, 'learning_rate': 0.00010914572864321609, 'epoch': 36.56}


 46%|████▌     | 458/1000 [36:54<43:35,  4.83s/it]

{'loss': 0.0285, 'grad_norm': 0.09063585847616196, 'learning_rate': 0.00010894472361809045, 'epoch': 36.64}


 46%|████▌     | 459/1000 [36:59<44:02,  4.88s/it]

{'loss': 0.0282, 'grad_norm': 0.092318594455719, 'learning_rate': 0.00010874371859296483, 'epoch': 36.72}


 46%|████▌     | 460/1000 [37:04<43:20,  4.82s/it]

{'loss': 0.0319, 'grad_norm': 0.12223496288061142, 'learning_rate': 0.00010854271356783921, 'epoch': 36.8}


 46%|████▌     | 461/1000 [37:09<43:50,  4.88s/it]

{'loss': 0.0304, 'grad_norm': 0.11659043282270432, 'learning_rate': 0.00010834170854271357, 'epoch': 36.88}


 46%|████▌     | 462/1000 [37:14<43:37,  4.87s/it]

{'loss': 0.0293, 'grad_norm': 0.10491786152124405, 'learning_rate': 0.00010814070351758794, 'epoch': 36.96}


 46%|████▋     | 463/1000 [37:18<43:07,  4.82s/it]

{'loss': 0.0268, 'grad_norm': 0.1034448891878128, 'learning_rate': 0.00010793969849246233, 'epoch': 37.04}


 46%|████▋     | 464/1000 [37:23<43:36,  4.88s/it]

{'loss': 0.0262, 'grad_norm': 0.090456023812294, 'learning_rate': 0.0001077386934673367, 'epoch': 37.12}


 46%|████▋     | 465/1000 [37:28<43:57,  4.93s/it]

{'loss': 0.0273, 'grad_norm': 0.09777767211198807, 'learning_rate': 0.00010753768844221106, 'epoch': 37.2}


 47%|████▋     | 466/1000 [37:33<43:38,  4.90s/it]

{'loss': 0.0267, 'grad_norm': 0.09239482134580612, 'learning_rate': 0.00010733668341708543, 'epoch': 37.28}


 47%|████▋     | 467/1000 [37:38<43:54,  4.94s/it]

{'loss': 0.0272, 'grad_norm': 0.0810764878988266, 'learning_rate': 0.00010713567839195982, 'epoch': 37.36}


 47%|████▋     | 468/1000 [37:43<43:28,  4.90s/it]

{'loss': 0.0284, 'grad_norm': 0.09839998930692673, 'learning_rate': 0.00010693467336683418, 'epoch': 37.44}


 47%|████▋     | 469/1000 [37:48<42:44,  4.83s/it]

{'loss': 0.0273, 'grad_norm': 0.08200415968894958, 'learning_rate': 0.00010673366834170855, 'epoch': 37.52}


 47%|████▋     | 470/1000 [37:52<42:39,  4.83s/it]

{'loss': 0.0279, 'grad_norm': 0.0892333909869194, 'learning_rate': 0.00010653266331658291, 'epoch': 37.6}


 47%|████▋     | 471/1000 [37:57<42:35,  4.83s/it]

{'loss': 0.0281, 'grad_norm': 0.0922766700387001, 'learning_rate': 0.00010633165829145728, 'epoch': 37.68}


 47%|████▋     | 472/1000 [38:02<42:32,  4.83s/it]

{'loss': 0.0282, 'grad_norm': 0.0833495706319809, 'learning_rate': 0.00010613065326633167, 'epoch': 37.76}


 47%|████▋     | 473/1000 [38:07<41:55,  4.77s/it]

{'loss': 0.0297, 'grad_norm': 0.0823492482304573, 'learning_rate': 0.00010592964824120604, 'epoch': 37.84}


 47%|████▋     | 474/1000 [38:12<42:00,  4.79s/it]

{'loss': 0.0297, 'grad_norm': 0.08786749094724655, 'learning_rate': 0.0001057286432160804, 'epoch': 37.92}


 48%|████▊     | 475/1000 [38:16<40:43,  4.65s/it]

{'loss': 0.0299, 'grad_norm': 0.11152565479278564, 'learning_rate': 0.00010552763819095478, 'epoch': 38.0}


 48%|████▊     | 476/1000 [38:21<41:05,  4.71s/it]

{'loss': 0.0259, 'grad_norm': 0.09091173112392426, 'learning_rate': 0.00010532663316582916, 'epoch': 38.08}


 48%|████▊     | 477/1000 [38:25<40:49,  4.68s/it]

{'loss': 0.0276, 'grad_norm': 0.08641951531171799, 'learning_rate': 0.00010512562814070352, 'epoch': 38.16}


 48%|████▊     | 478/1000 [38:30<40:38,  4.67s/it]

{'loss': 0.0274, 'grad_norm': 0.08734053373336792, 'learning_rate': 0.0001049246231155779, 'epoch': 38.24}


 48%|████▊     | 479/1000 [38:35<41:00,  4.72s/it]

{'loss': 0.0261, 'grad_norm': 0.07066704332828522, 'learning_rate': 0.00010472361809045225, 'epoch': 38.32}


 48%|████▊     | 480/1000 [38:40<42:14,  4.87s/it]

{'loss': 0.026, 'grad_norm': 0.08249110728502274, 'learning_rate': 0.00010452261306532664, 'epoch': 38.4}


 48%|████▊     | 481/1000 [38:46<43:34,  5.04s/it]

{'loss': 0.0275, 'grad_norm': 0.09303738921880722, 'learning_rate': 0.00010432160804020101, 'epoch': 38.48}


 48%|████▊     | 482/1000 [38:50<42:57,  4.98s/it]

{'loss': 0.0275, 'grad_norm': 0.0841137170791626, 'learning_rate': 0.00010412060301507539, 'epoch': 38.56}


 48%|████▊     | 483/1000 [38:55<41:58,  4.87s/it]

{'loss': 0.0288, 'grad_norm': 0.09741632640361786, 'learning_rate': 0.00010391959798994975, 'epoch': 38.64}


 48%|████▊     | 484/1000 [39:00<41:49,  4.86s/it]

{'loss': 0.0287, 'grad_norm': 0.0971447080373764, 'learning_rate': 0.00010371859296482413, 'epoch': 38.72}


 48%|████▊     | 485/1000 [39:05<41:43,  4.86s/it]

{'loss': 0.0282, 'grad_norm': 0.08627238869667053, 'learning_rate': 0.0001035175879396985, 'epoch': 38.8}


 49%|████▊     | 486/1000 [39:09<41:06,  4.80s/it]

{'loss': 0.0281, 'grad_norm': 0.08453188836574554, 'learning_rate': 0.00010331658291457286, 'epoch': 38.88}


 49%|████▊     | 487/1000 [39:14<41:06,  4.81s/it]

{'loss': 0.0289, 'grad_norm': 0.0924454778432846, 'learning_rate': 0.00010311557788944724, 'epoch': 38.96}


 49%|████▉     | 488/1000 [39:19<40:19,  4.73s/it]

{'loss': 0.0279, 'grad_norm': 0.09528826922178268, 'learning_rate': 0.00010291457286432162, 'epoch': 39.04}


 49%|████▉     | 489/1000 [39:24<40:27,  4.75s/it]

{'loss': 0.0255, 'grad_norm': 0.07817333191633224, 'learning_rate': 0.00010271356783919598, 'epoch': 39.12}


 49%|████▉     | 490/1000 [39:28<40:05,  4.72s/it]

{'loss': 0.0275, 'grad_norm': 0.08060485124588013, 'learning_rate': 0.00010251256281407036, 'epoch': 39.2}


 49%|████▉     | 491/1000 [39:33<40:21,  4.76s/it]

{'loss': 0.0263, 'grad_norm': 0.08092747628688812, 'learning_rate': 0.00010231155778894473, 'epoch': 39.28}


 49%|████▉     | 492/1000 [39:38<40:39,  4.80s/it]

{'loss': 0.0281, 'grad_norm': 0.11163382977247238, 'learning_rate': 0.00010211055276381909, 'epoch': 39.36}


 49%|████▉     | 493/1000 [39:43<40:45,  4.82s/it]

{'loss': 0.0268, 'grad_norm': 0.09031950682401657, 'learning_rate': 0.00010190954773869348, 'epoch': 39.44}


 49%|████▉     | 494/1000 [39:48<41:48,  4.96s/it]

{'loss': 0.0266, 'grad_norm': 0.07865984737873077, 'learning_rate': 0.00010170854271356785, 'epoch': 39.52}


 50%|████▉     | 495/1000 [39:53<41:31,  4.93s/it]

{'loss': 0.029, 'grad_norm': 0.09412255138158798, 'learning_rate': 0.00010150753768844221, 'epoch': 39.6}


 50%|████▉     | 496/1000 [39:58<40:50,  4.86s/it]

{'loss': 0.0282, 'grad_norm': 0.09273459762334824, 'learning_rate': 0.00010130653266331658, 'epoch': 39.68}


 50%|████▉     | 497/1000 [40:03<40:49,  4.87s/it]

{'loss': 0.0288, 'grad_norm': 0.09000615030527115, 'learning_rate': 0.00010110552763819097, 'epoch': 39.76}


 50%|████▉     | 498/1000 [40:07<40:46,  4.87s/it]

{'loss': 0.0283, 'grad_norm': 0.08611232787370682, 'learning_rate': 0.00010090452261306533, 'epoch': 39.84}


 50%|████▉     | 499/1000 [40:12<40:41,  4.87s/it]

{'loss': 0.0292, 'grad_norm': 0.09971580654382706, 'learning_rate': 0.0001007035175879397, 'epoch': 39.92}


 50%|█████     | 500/1000 [40:17<39:40,  4.76s/it]

{'loss': 0.0287, 'grad_norm': 0.11270460486412048, 'learning_rate': 0.00010050251256281407, 'epoch': 40.0}


 50%|█████     | 501/1000 [40:22<41:16,  4.96s/it]

{'loss': 0.0277, 'grad_norm': 0.08895387500524521, 'learning_rate': 0.00010030150753768846, 'epoch': 40.08}


 50%|█████     | 502/1000 [40:27<40:57,  4.93s/it]

{'loss': 0.0269, 'grad_norm': 0.08931569755077362, 'learning_rate': 0.00010010050251256282, 'epoch': 40.16}


 50%|█████     | 503/1000 [40:32<41:12,  4.97s/it]

{'loss': 0.026, 'grad_norm': 0.08679826557636261, 'learning_rate': 9.989949748743719e-05, 'epoch': 40.24}


 50%|█████     | 504/1000 [40:37<41:21,  5.00s/it]

{'loss': 0.0267, 'grad_norm': 0.08635221421718597, 'learning_rate': 9.969849246231156e-05, 'epoch': 40.32}


 50%|█████     | 505/1000 [40:42<40:30,  4.91s/it]

{'loss': 0.0282, 'grad_norm': 0.08538442105054855, 'learning_rate': 9.949748743718594e-05, 'epoch': 40.4}


 51%|█████     | 506/1000 [40:47<40:21,  4.90s/it]

{'loss': 0.0262, 'grad_norm': 0.0845373347401619, 'learning_rate': 9.929648241206031e-05, 'epoch': 40.48}


 51%|█████     | 507/1000 [40:52<40:11,  4.89s/it]

{'loss': 0.0269, 'grad_norm': 0.07988713681697845, 'learning_rate': 9.909547738693468e-05, 'epoch': 40.56}


 51%|█████     | 508/1000 [40:56<39:29,  4.82s/it]

{'loss': 0.0303, 'grad_norm': 0.12750659883022308, 'learning_rate': 9.889447236180906e-05, 'epoch': 40.64}


 51%|█████     | 509/1000 [41:01<39:26,  4.82s/it]

{'loss': 0.0286, 'grad_norm': 0.09365735948085785, 'learning_rate': 9.869346733668342e-05, 'epoch': 40.72}


 51%|█████     | 510/1000 [41:06<39:21,  4.82s/it]

{'loss': 0.0281, 'grad_norm': 0.09456929564476013, 'learning_rate': 9.84924623115578e-05, 'epoch': 40.8}


 51%|█████     | 511/1000 [41:11<39:46,  4.88s/it]

{'loss': 0.0289, 'grad_norm': 0.10023412108421326, 'learning_rate': 9.829145728643216e-05, 'epoch': 40.88}


 51%|█████     | 512/1000 [41:16<40:02,  4.92s/it]

{'loss': 0.0298, 'grad_norm': 0.0921151265501976, 'learning_rate': 9.809045226130655e-05, 'epoch': 40.96}


 51%|█████▏    | 513/1000 [41:21<38:58,  4.80s/it]

{'loss': 0.027, 'grad_norm': 0.08889627456665039, 'learning_rate': 9.788944723618091e-05, 'epoch': 41.04}


 51%|█████▏    | 514/1000 [41:25<38:57,  4.81s/it]

{'loss': 0.0257, 'grad_norm': 0.09000875800848007, 'learning_rate': 9.768844221105528e-05, 'epoch': 41.12}


 52%|█████▏    | 515/1000 [41:30<39:23,  4.87s/it]

{'loss': 0.0272, 'grad_norm': 0.08602455258369446, 'learning_rate': 9.748743718592965e-05, 'epoch': 41.2}


 52%|█████▏    | 516/1000 [41:35<38:43,  4.80s/it]

{'loss': 0.0262, 'grad_norm': 0.07635479420423508, 'learning_rate': 9.728643216080403e-05, 'epoch': 41.28}


 52%|█████▏    | 517/1000 [41:40<39:08,  4.86s/it]

{'loss': 0.0262, 'grad_norm': 0.09020820260047913, 'learning_rate': 9.70854271356784e-05, 'epoch': 41.36}


 52%|█████▏    | 518/1000 [41:45<38:59,  4.85s/it]

{'loss': 0.0274, 'grad_norm': 0.08780207484960556, 'learning_rate': 9.688442211055276e-05, 'epoch': 41.44}


 52%|█████▏    | 519/1000 [41:49<38:22,  4.79s/it]

{'loss': 0.0279, 'grad_norm': 0.08899388462305069, 'learning_rate': 9.668341708542715e-05, 'epoch': 41.52}


 52%|█████▏    | 520/1000 [41:54<37:54,  4.74s/it]

{'loss': 0.0293, 'grad_norm': 0.10204724222421646, 'learning_rate': 9.64824120603015e-05, 'epoch': 41.6}


 52%|█████▏    | 521/1000 [41:59<38:01,  4.76s/it]

{'loss': 0.0282, 'grad_norm': 0.08660423755645752, 'learning_rate': 9.628140703517589e-05, 'epoch': 41.68}


 52%|█████▏    | 522/1000 [42:04<38:34,  4.84s/it]

{'loss': 0.0283, 'grad_norm': 0.09462175518274307, 'learning_rate': 9.608040201005025e-05, 'epoch': 41.76}


 52%|█████▏    | 523/1000 [42:09<38:26,  4.83s/it]

{'loss': 0.0287, 'grad_norm': 0.09794726222753525, 'learning_rate': 9.587939698492462e-05, 'epoch': 41.84}


 52%|█████▏    | 524/1000 [42:14<38:20,  4.83s/it]

{'loss': 0.0284, 'grad_norm': 0.08590846508741379, 'learning_rate': 9.5678391959799e-05, 'epoch': 41.92}


 52%|█████▎    | 525/1000 [42:18<36:52,  4.66s/it]

{'loss': 0.0298, 'grad_norm': 0.0962192490696907, 'learning_rate': 9.547738693467337e-05, 'epoch': 42.0}


 53%|█████▎    | 526/1000 [42:23<37:11,  4.71s/it]

{'loss': 0.0259, 'grad_norm': 0.06926198303699493, 'learning_rate': 9.527638190954774e-05, 'epoch': 42.08}


 53%|█████▎    | 527/1000 [42:27<36:53,  4.68s/it]

{'loss': 0.0273, 'grad_norm': 0.08089897036552429, 'learning_rate': 9.507537688442212e-05, 'epoch': 42.16}


 53%|█████▎    | 528/1000 [42:32<37:07,  4.72s/it]

{'loss': 0.0259, 'grad_norm': 0.07221196591854095, 'learning_rate': 9.487437185929649e-05, 'epoch': 42.24}


 53%|█████▎    | 529/1000 [42:37<37:13,  4.74s/it]

{'loss': 0.0267, 'grad_norm': 0.09623003751039505, 'learning_rate': 9.467336683417086e-05, 'epoch': 42.32}


 53%|█████▎    | 530/1000 [42:42<37:51,  4.83s/it]

{'loss': 0.0259, 'grad_norm': 0.0711321085691452, 'learning_rate': 9.447236180904523e-05, 'epoch': 42.4}


 53%|█████▎    | 531/1000 [42:47<38:46,  4.96s/it]

{'loss': 0.0258, 'grad_norm': 0.0824466273188591, 'learning_rate': 9.427135678391961e-05, 'epoch': 42.48}


 53%|█████▎    | 532/1000 [42:52<38:27,  4.93s/it]

{'loss': 0.0284, 'grad_norm': 0.07506502419710159, 'learning_rate': 9.407035175879397e-05, 'epoch': 42.56}


 53%|█████▎    | 533/1000 [42:57<37:46,  4.85s/it]

{'loss': 0.0291, 'grad_norm': 0.1081027239561081, 'learning_rate': 9.386934673366835e-05, 'epoch': 42.64}


 53%|█████▎    | 534/1000 [43:02<37:43,  4.86s/it]

{'loss': 0.0283, 'grad_norm': 0.09112008661031723, 'learning_rate': 9.366834170854271e-05, 'epoch': 42.72}


 54%|█████▎    | 535/1000 [43:06<37:40,  4.86s/it]

{'loss': 0.0286, 'grad_norm': 0.09423236548900604, 'learning_rate': 9.34673366834171e-05, 'epoch': 42.8}


 54%|█████▎    | 536/1000 [43:11<37:38,  4.87s/it]

{'loss': 0.0285, 'grad_norm': 0.09142488986253738, 'learning_rate': 9.326633165829146e-05, 'epoch': 42.88}


 54%|█████▎    | 537/1000 [43:16<37:10,  4.82s/it]

{'loss': 0.0286, 'grad_norm': 0.08041027188301086, 'learning_rate': 9.306532663316585e-05, 'epoch': 42.96}


 54%|█████▍    | 538/1000 [43:20<35:51,  4.66s/it]

{'loss': 0.0283, 'grad_norm': 0.10344789177179337, 'learning_rate': 9.28643216080402e-05, 'epoch': 43.04}


 54%|█████▍    | 539/1000 [43:25<36:16,  4.72s/it]

{'loss': 0.0261, 'grad_norm': 0.07638609409332275, 'learning_rate': 9.266331658291458e-05, 'epoch': 43.12}


 54%|█████▍    | 540/1000 [43:30<37:02,  4.83s/it]

{'loss': 0.0255, 'grad_norm': 0.07445055991411209, 'learning_rate': 9.246231155778895e-05, 'epoch': 43.2}


 54%|█████▍    | 541/1000 [43:35<37:04,  4.85s/it]

{'loss': 0.027, 'grad_norm': 0.09354323148727417, 'learning_rate': 9.226130653266331e-05, 'epoch': 43.28}


 54%|█████▍    | 542/1000 [43:40<37:02,  4.85s/it]

{'loss': 0.0276, 'grad_norm': 0.10428044945001602, 'learning_rate': 9.20603015075377e-05, 'epoch': 43.36}


 54%|█████▍    | 543/1000 [43:45<36:27,  4.79s/it]

{'loss': 0.0276, 'grad_norm': 0.07904358208179474, 'learning_rate': 9.185929648241206e-05, 'epoch': 43.44}


 54%|█████▍    | 544/1000 [43:49<36:27,  4.80s/it]

{'loss': 0.0257, 'grad_norm': 0.07977806031703949, 'learning_rate': 9.165829145728644e-05, 'epoch': 43.52}


 55%|█████▍    | 545/1000 [43:54<35:59,  4.75s/it]

{'loss': 0.0285, 'grad_norm': 0.09541685879230499, 'learning_rate': 9.14572864321608e-05, 'epoch': 43.6}


 55%|█████▍    | 546/1000 [43:59<36:34,  4.83s/it]

{'loss': 0.0281, 'grad_norm': 0.08811189979314804, 'learning_rate': 9.125628140703519e-05, 'epoch': 43.68}


 55%|█████▍    | 547/1000 [44:04<36:59,  4.90s/it]

{'loss': 0.0281, 'grad_norm': 0.08459121733903885, 'learning_rate': 9.105527638190955e-05, 'epoch': 43.76}


 55%|█████▍    | 548/1000 [44:09<36:47,  4.88s/it]

{'loss': 0.0293, 'grad_norm': 0.09055968374013901, 'learning_rate': 9.085427135678392e-05, 'epoch': 43.84}


 55%|█████▍    | 549/1000 [44:14<37:01,  4.93s/it]

{'loss': 0.0287, 'grad_norm': 0.09318723529577255, 'learning_rate': 9.06532663316583e-05, 'epoch': 43.92}


 55%|█████▌    | 550/1000 [44:18<35:28,  4.73s/it]

{'loss': 0.0316, 'grad_norm': 0.12538102269172668, 'learning_rate': 9.045226130653267e-05, 'epoch': 44.0}


 55%|█████▌    | 551/1000 [44:23<35:13,  4.71s/it]

{'loss': 0.0258, 'grad_norm': 0.07703670859336853, 'learning_rate': 9.025125628140704e-05, 'epoch': 44.08}


 55%|█████▌    | 552/1000 [44:28<35:51,  4.80s/it]

{'loss': 0.0263, 'grad_norm': 0.07769482582807541, 'learning_rate': 9.005025125628141e-05, 'epoch': 44.16}


 55%|█████▌    | 553/1000 [44:33<35:51,  4.81s/it]

{'loss': 0.0262, 'grad_norm': 0.10095440596342087, 'learning_rate': 8.984924623115579e-05, 'epoch': 44.24}


 55%|█████▌    | 554/1000 [44:38<35:47,  4.81s/it]

{'loss': 0.0269, 'grad_norm': 0.08081147819757462, 'learning_rate': 8.964824120603016e-05, 'epoch': 44.32}


 56%|█████▌    | 555/1000 [44:42<35:19,  4.76s/it]

{'loss': 0.0274, 'grad_norm': 0.07840419560670853, 'learning_rate': 8.944723618090453e-05, 'epoch': 44.4}


 56%|█████▌    | 556/1000 [44:47<34:58,  4.73s/it]

{'loss': 0.027, 'grad_norm': 0.07856936752796173, 'learning_rate': 8.92462311557789e-05, 'epoch': 44.48}


 56%|█████▌    | 557/1000 [44:52<34:42,  4.70s/it]

{'loss': 0.0281, 'grad_norm': 0.08225074410438538, 'learning_rate': 8.904522613065326e-05, 'epoch': 44.56}


 56%|█████▌    | 558/1000 [44:56<34:56,  4.74s/it]

{'loss': 0.0276, 'grad_norm': 0.08981647342443466, 'learning_rate': 8.884422110552765e-05, 'epoch': 44.64}


 56%|█████▌    | 559/1000 [45:02<35:54,  4.89s/it]

{'loss': 0.028, 'grad_norm': 0.0953274592757225, 'learning_rate': 8.864321608040201e-05, 'epoch': 44.72}


 56%|█████▌    | 560/1000 [45:07<36:11,  4.93s/it]

{'loss': 0.028, 'grad_norm': 0.08864007145166397, 'learning_rate': 8.84422110552764e-05, 'epoch': 44.8}


 56%|█████▌    | 561/1000 [45:12<36:44,  5.02s/it]

{'loss': 0.0289, 'grad_norm': 0.08874253183603287, 'learning_rate': 8.824120603015076e-05, 'epoch': 44.88}


 56%|█████▌    | 562/1000 [45:17<35:50,  4.91s/it]

{'loss': 0.0299, 'grad_norm': 0.09653688967227936, 'learning_rate': 8.804020100502513e-05, 'epoch': 44.96}


 56%|█████▋    | 563/1000 [45:21<34:48,  4.78s/it]

{'loss': 0.0281, 'grad_norm': 0.08823484927415848, 'learning_rate': 8.78391959798995e-05, 'epoch': 45.04}


 56%|█████▋    | 564/1000 [45:26<35:16,  4.85s/it]

{'loss': 0.0264, 'grad_norm': 0.0737023651599884, 'learning_rate': 8.763819095477387e-05, 'epoch': 45.12}


 56%|█████▋    | 565/1000 [45:31<35:10,  4.85s/it]

{'loss': 0.0264, 'grad_norm': 0.08933086693286896, 'learning_rate': 8.743718592964825e-05, 'epoch': 45.2}


 57%|█████▋    | 566/1000 [45:36<35:03,  4.85s/it]

{'loss': 0.0269, 'grad_norm': 0.08483902364969254, 'learning_rate': 8.723618090452261e-05, 'epoch': 45.28}


 57%|█████▋    | 567/1000 [45:41<35:46,  4.96s/it]

{'loss': 0.0255, 'grad_norm': 0.07897672057151794, 'learning_rate': 8.7035175879397e-05, 'epoch': 45.36}


 57%|█████▋    | 568/1000 [45:46<35:02,  4.87s/it]

{'loss': 0.0283, 'grad_norm': 0.0940614566206932, 'learning_rate': 8.683417085427135e-05, 'epoch': 45.44}


 57%|█████▋    | 569/1000 [45:50<34:30,  4.80s/it]

{'loss': 0.0283, 'grad_norm': 0.08834408968687057, 'learning_rate': 8.663316582914574e-05, 'epoch': 45.52}


 57%|█████▋    | 570/1000 [45:55<34:56,  4.88s/it]

{'loss': 0.0283, 'grad_norm': 0.08597513288259506, 'learning_rate': 8.64321608040201e-05, 'epoch': 45.6}


 57%|█████▋    | 571/1000 [46:00<34:20,  4.80s/it]

{'loss': 0.0279, 'grad_norm': 0.08224879950284958, 'learning_rate': 8.623115577889449e-05, 'epoch': 45.68}


 57%|█████▋    | 572/1000 [46:05<34:19,  4.81s/it]

{'loss': 0.0272, 'grad_norm': 0.09006752073764801, 'learning_rate': 8.603015075376884e-05, 'epoch': 45.76}


 57%|█████▋    | 573/1000 [46:10<34:20,  4.83s/it]

{'loss': 0.0274, 'grad_norm': 0.07558371126651764, 'learning_rate': 8.582914572864322e-05, 'epoch': 45.84}


 57%|█████▋    | 574/1000 [46:14<34:16,  4.83s/it]

{'loss': 0.0306, 'grad_norm': 0.09495433419942856, 'learning_rate': 8.562814070351759e-05, 'epoch': 45.92}


 57%|█████▊    | 575/1000 [46:19<33:24,  4.72s/it]

{'loss': 0.0288, 'grad_norm': 0.09853705763816833, 'learning_rate': 8.542713567839196e-05, 'epoch': 46.0}


 58%|█████▊    | 576/1000 [46:24<33:59,  4.81s/it]

{'loss': 0.0269, 'grad_norm': 0.08613362908363342, 'learning_rate': 8.522613065326634e-05, 'epoch': 46.08}


 58%|█████▊    | 577/1000 [46:29<34:50,  4.94s/it]

{'loss': 0.0264, 'grad_norm': 0.0763755813241005, 'learning_rate': 8.502512562814071e-05, 'epoch': 46.16}


 58%|█████▊    | 578/1000 [46:34<34:37,  4.92s/it]

{'loss': 0.0255, 'grad_norm': 0.07327358424663544, 'learning_rate': 8.482412060301508e-05, 'epoch': 46.24}


 58%|█████▊    | 579/1000 [46:39<34:05,  4.86s/it]

{'loss': 0.0271, 'grad_norm': 0.06886028498411179, 'learning_rate': 8.462311557788946e-05, 'epoch': 46.32}


 58%|█████▊    | 580/1000 [46:44<34:04,  4.87s/it]

{'loss': 0.0269, 'grad_norm': 0.08238829672336578, 'learning_rate': 8.442211055276383e-05, 'epoch': 46.4}


 58%|█████▊    | 581/1000 [46:49<34:26,  4.93s/it]

{'loss': 0.0264, 'grad_norm': 0.0703374594449997, 'learning_rate': 8.42211055276382e-05, 'epoch': 46.48}


 58%|█████▊    | 582/1000 [46:53<33:48,  4.85s/it]

{'loss': 0.0283, 'grad_norm': 0.09174507111310959, 'learning_rate': 8.402010050251256e-05, 'epoch': 46.56}


 58%|█████▊    | 583/1000 [46:58<33:49,  4.87s/it]

{'loss': 0.0285, 'grad_norm': 0.0816807672381401, 'learning_rate': 8.381909547738695e-05, 'epoch': 46.64}


 58%|█████▊    | 584/1000 [47:03<33:22,  4.81s/it]

{'loss': 0.0277, 'grad_norm': 0.07643742859363556, 'learning_rate': 8.36180904522613e-05, 'epoch': 46.72}


 58%|█████▊    | 585/1000 [47:08<33:02,  4.78s/it]

{'loss': 0.0286, 'grad_norm': 0.08649134635925293, 'learning_rate': 8.341708542713568e-05, 'epoch': 46.8}


 59%|█████▊    | 586/1000 [47:13<33:14,  4.82s/it]

{'loss': 0.0273, 'grad_norm': 0.07964412868022919, 'learning_rate': 8.321608040201005e-05, 'epoch': 46.88}


 59%|█████▊    | 587/1000 [47:18<33:19,  4.84s/it]

{'loss': 0.029, 'grad_norm': 0.08933480829000473, 'learning_rate': 8.301507537688443e-05, 'epoch': 46.96}


 59%|█████▉    | 588/1000 [47:22<32:31,  4.74s/it]

{'loss': 0.0281, 'grad_norm': 0.10185857117176056, 'learning_rate': 8.28140703517588e-05, 'epoch': 47.04}


 59%|█████▉    | 589/1000 [47:27<32:42,  4.78s/it]

{'loss': 0.0259, 'grad_norm': 0.07760775089263916, 'learning_rate': 8.261306532663317e-05, 'epoch': 47.12}


 59%|█████▉    | 590/1000 [47:32<33:37,  4.92s/it]

{'loss': 0.0246, 'grad_norm': 0.06718537211418152, 'learning_rate': 8.241206030150754e-05, 'epoch': 47.2}


 59%|█████▉    | 591/1000 [47:37<33:24,  4.90s/it]

{'loss': 0.0272, 'grad_norm': 0.0862804427742958, 'learning_rate': 8.22110552763819e-05, 'epoch': 47.28}


 59%|█████▉    | 592/1000 [47:42<33:16,  4.89s/it]

{'loss': 0.0271, 'grad_norm': 0.0860576406121254, 'learning_rate': 8.201005025125629e-05, 'epoch': 47.36}


 59%|█████▉    | 593/1000 [47:47<33:09,  4.89s/it]

{'loss': 0.0266, 'grad_norm': 0.08327941596508026, 'learning_rate': 8.180904522613065e-05, 'epoch': 47.44}


 59%|█████▉    | 594/1000 [47:51<32:39,  4.83s/it]

{'loss': 0.0282, 'grad_norm': 0.09246280044317245, 'learning_rate': 8.160804020100504e-05, 'epoch': 47.52}


 60%|█████▉    | 595/1000 [47:56<32:35,  4.83s/it]

{'loss': 0.0265, 'grad_norm': 0.07947865128517151, 'learning_rate': 8.14070351758794e-05, 'epoch': 47.6}


 60%|█████▉    | 596/1000 [48:01<32:07,  4.77s/it]

{'loss': 0.0287, 'grad_norm': 0.08757788687944412, 'learning_rate': 8.120603015075378e-05, 'epoch': 47.68}


 60%|█████▉    | 597/1000 [48:06<32:32,  4.84s/it]

{'loss': 0.0278, 'grad_norm': 0.0858871191740036, 'learning_rate': 8.100502512562814e-05, 'epoch': 47.76}


 60%|█████▉    | 598/1000 [48:11<32:25,  4.84s/it]

{'loss': 0.0292, 'grad_norm': 0.09459006786346436, 'learning_rate': 8.080402010050251e-05, 'epoch': 47.84}


 60%|█████▉    | 599/1000 [48:16<32:19,  4.84s/it]

{'loss': 0.0284, 'grad_norm': 0.09042152762413025, 'learning_rate': 8.060301507537689e-05, 'epoch': 47.92}


 60%|██████    | 600/1000 [48:20<31:29,  4.72s/it]

{'loss': 0.0288, 'grad_norm': 0.09156813472509384, 'learning_rate': 8.040201005025126e-05, 'epoch': 48.0}


 60%|██████    | 601/1000 [48:25<31:37,  4.76s/it]

{'loss': 0.0264, 'grad_norm': 0.07428502291440964, 'learning_rate': 8.020100502512563e-05, 'epoch': 48.08}


 60%|██████    | 602/1000 [48:30<31:40,  4.78s/it]

{'loss': 0.0267, 'grad_norm': 0.08387087285518646, 'learning_rate': 8e-05, 'epoch': 48.16}


 60%|██████    | 603/1000 [48:34<31:19,  4.73s/it]

{'loss': 0.0258, 'grad_norm': 0.08693841844797134, 'learning_rate': 7.979899497487438e-05, 'epoch': 48.24}


 60%|██████    | 604/1000 [48:39<31:24,  4.76s/it]

{'loss': 0.026, 'grad_norm': 0.09811858087778091, 'learning_rate': 7.959798994974875e-05, 'epoch': 48.32}


 60%|██████    | 605/1000 [48:44<31:27,  4.78s/it]

{'loss': 0.0272, 'grad_norm': 0.08602199703454971, 'learning_rate': 7.939698492462313e-05, 'epoch': 48.4}


 61%|██████    | 606/1000 [48:49<31:29,  4.80s/it]

{'loss': 0.0277, 'grad_norm': 0.09601356089115143, 'learning_rate': 7.91959798994975e-05, 'epoch': 48.48}


 61%|██████    | 607/1000 [48:54<31:28,  4.80s/it]

{'loss': 0.029, 'grad_norm': 0.09317369014024734, 'learning_rate': 7.899497487437186e-05, 'epoch': 48.56}


 61%|██████    | 608/1000 [48:58<31:04,  4.76s/it]

{'loss': 0.0288, 'grad_norm': 0.0916232094168663, 'learning_rate': 7.879396984924623e-05, 'epoch': 48.64}


 61%|██████    | 609/1000 [49:03<31:08,  4.78s/it]

{'loss': 0.0289, 'grad_norm': 0.09766831994056702, 'learning_rate': 7.85929648241206e-05, 'epoch': 48.72}


 61%|██████    | 610/1000 [49:08<30:46,  4.73s/it]

{'loss': 0.0286, 'grad_norm': 0.08816070854663849, 'learning_rate': 7.839195979899498e-05, 'epoch': 48.8}


 61%|██████    | 611/1000 [49:13<30:53,  4.77s/it]

{'loss': 0.0279, 'grad_norm': 0.0892992839217186, 'learning_rate': 7.819095477386935e-05, 'epoch': 48.88}


 61%|██████    | 612/1000 [49:18<31:38,  4.89s/it]

{'loss': 0.0277, 'grad_norm': 0.08436934649944305, 'learning_rate': 7.798994974874372e-05, 'epoch': 48.96}


 61%|██████▏   | 613/1000 [49:22<30:49,  4.78s/it]

{'loss': 0.0266, 'grad_norm': 0.09142271429300308, 'learning_rate': 7.77889447236181e-05, 'epoch': 49.04}


 61%|██████▏   | 614/1000 [49:27<30:48,  4.79s/it]

{'loss': 0.0265, 'grad_norm': 0.07466612756252289, 'learning_rate': 7.758793969849247e-05, 'epoch': 49.12}


 62%|██████▏   | 615/1000 [49:32<30:45,  4.79s/it]

{'loss': 0.0262, 'grad_norm': 0.07774994522333145, 'learning_rate': 7.738693467336684e-05, 'epoch': 49.2}


 62%|██████▏   | 616/1000 [49:37<31:04,  4.86s/it]

{'loss': 0.0267, 'grad_norm': 0.07605842500925064, 'learning_rate': 7.71859296482412e-05, 'epoch': 49.28}


 62%|██████▏   | 617/1000 [49:42<30:38,  4.80s/it]

{'loss': 0.0273, 'grad_norm': 0.08297310769557953, 'learning_rate': 7.698492462311559e-05, 'epoch': 49.36}


 62%|██████▏   | 618/1000 [49:47<31:04,  4.88s/it]

{'loss': 0.0257, 'grad_norm': 0.08437880128622055, 'learning_rate': 7.678391959798995e-05, 'epoch': 49.44}


 62%|██████▏   | 619/1000 [49:51<30:34,  4.81s/it]

{'loss': 0.0273, 'grad_norm': 0.08341627568006516, 'learning_rate': 7.658291457286433e-05, 'epoch': 49.52}


 62%|██████▏   | 620/1000 [49:56<30:35,  4.83s/it]

{'loss': 0.027, 'grad_norm': 0.08357790857553482, 'learning_rate': 7.638190954773869e-05, 'epoch': 49.6}


 62%|██████▏   | 621/1000 [50:01<30:11,  4.78s/it]

{'loss': 0.0281, 'grad_norm': 0.08664185553789139, 'learning_rate': 7.618090452261307e-05, 'epoch': 49.68}


 62%|██████▏   | 622/1000 [50:06<29:54,  4.75s/it]

{'loss': 0.0277, 'grad_norm': 0.08790931850671768, 'learning_rate': 7.597989949748744e-05, 'epoch': 49.76}


 62%|██████▏   | 623/1000 [50:10<30:04,  4.79s/it]

{'loss': 0.0272, 'grad_norm': 0.07719863951206207, 'learning_rate': 7.577889447236181e-05, 'epoch': 49.84}


 62%|██████▏   | 624/1000 [50:16<30:53,  4.93s/it]

{'loss': 0.0286, 'grad_norm': 0.09244117140769958, 'learning_rate': 7.557788944723618e-05, 'epoch': 49.92}


 62%|██████▎   | 625/1000 [50:20<30:28,  4.87s/it]

{'loss': 0.029, 'grad_norm': 0.1264883130788803, 'learning_rate': 7.537688442211056e-05, 'epoch': 50.0}


 63%|██████▎   | 626/1000 [50:25<30:24,  4.88s/it]

{'loss': 0.0262, 'grad_norm': 0.07987987995147705, 'learning_rate': 7.517587939698493e-05, 'epoch': 50.08}


 63%|██████▎   | 627/1000 [50:30<29:56,  4.82s/it]

{'loss': 0.0263, 'grad_norm': 0.06996889412403107, 'learning_rate': 7.49748743718593e-05, 'epoch': 50.16}


 63%|██████▎   | 628/1000 [50:35<29:58,  4.83s/it]

{'loss': 0.026, 'grad_norm': 0.08762777596712112, 'learning_rate': 7.477386934673368e-05, 'epoch': 50.24}


 63%|██████▎   | 629/1000 [50:39<29:32,  4.78s/it]

{'loss': 0.0278, 'grad_norm': 0.08189623802900314, 'learning_rate': 7.457286432160805e-05, 'epoch': 50.32}


 63%|██████▎   | 630/1000 [50:44<29:52,  4.84s/it]

{'loss': 0.026, 'grad_norm': 0.07754424959421158, 'learning_rate': 7.437185929648241e-05, 'epoch': 50.4}


 63%|██████▎   | 631/1000 [50:49<30:05,  4.89s/it]

{'loss': 0.0263, 'grad_norm': 0.08048499375581741, 'learning_rate': 7.417085427135678e-05, 'epoch': 50.48}


 63%|██████▎   | 632/1000 [50:54<29:51,  4.87s/it]

{'loss': 0.0271, 'grad_norm': 0.08070939779281616, 'learning_rate': 7.396984924623115e-05, 'epoch': 50.56}


 63%|██████▎   | 633/1000 [50:59<29:21,  4.80s/it]

{'loss': 0.0274, 'grad_norm': 0.08289593458175659, 'learning_rate': 7.376884422110553e-05, 'epoch': 50.64}


 63%|██████▎   | 634/1000 [51:04<29:21,  4.81s/it]

{'loss': 0.0282, 'grad_norm': 0.08896952122449875, 'learning_rate': 7.35678391959799e-05, 'epoch': 50.72}


 64%|██████▎   | 635/1000 [51:09<30:00,  4.93s/it]

{'loss': 0.0272, 'grad_norm': 0.08961840718984604, 'learning_rate': 7.336683417085427e-05, 'epoch': 50.8}


 64%|██████▎   | 636/1000 [51:14<29:46,  4.91s/it]

{'loss': 0.0289, 'grad_norm': 0.08865194767713547, 'learning_rate': 7.316582914572865e-05, 'epoch': 50.88}


 64%|██████▎   | 637/1000 [51:19<29:34,  4.89s/it]

{'loss': 0.029, 'grad_norm': 0.09023966640233994, 'learning_rate': 7.296482412060302e-05, 'epoch': 50.96}


 64%|██████▍   | 638/1000 [51:23<28:50,  4.78s/it]

{'loss': 0.0256, 'grad_norm': 0.10049901157617569, 'learning_rate': 7.276381909547739e-05, 'epoch': 51.04}


 64%|██████▍   | 639/1000 [51:28<28:31,  4.74s/it]

{'loss': 0.0271, 'grad_norm': 0.08721952140331268, 'learning_rate': 7.256281407035177e-05, 'epoch': 51.12}


 64%|██████▍   | 640/1000 [51:33<28:16,  4.71s/it]

{'loss': 0.0276, 'grad_norm': 0.09166061878204346, 'learning_rate': 7.236180904522614e-05, 'epoch': 51.2}


 64%|██████▍   | 641/1000 [51:38<28:47,  4.81s/it]

{'loss': 0.0265, 'grad_norm': 0.08337155729532242, 'learning_rate': 7.21608040201005e-05, 'epoch': 51.28}


 64%|██████▍   | 642/1000 [51:43<29:06,  4.88s/it]

{'loss': 0.0263, 'grad_norm': 0.0777818113565445, 'learning_rate': 7.195979899497488e-05, 'epoch': 51.36}


 64%|██████▍   | 643/1000 [51:47<28:58,  4.87s/it]

{'loss': 0.0276, 'grad_norm': 0.08871444314718246, 'learning_rate': 7.175879396984924e-05, 'epoch': 51.44}


 64%|██████▍   | 644/1000 [51:52<29:10,  4.92s/it]

{'loss': 0.0255, 'grad_norm': 0.0732104554772377, 'learning_rate': 7.155778894472363e-05, 'epoch': 51.52}


 64%|██████▍   | 645/1000 [51:57<28:35,  4.83s/it]

{'loss': 0.0279, 'grad_norm': 0.08699209988117218, 'learning_rate': 7.135678391959799e-05, 'epoch': 51.6}


 65%|██████▍   | 646/1000 [52:02<28:11,  4.78s/it]

{'loss': 0.0276, 'grad_norm': 0.08222974836826324, 'learning_rate': 7.115577889447236e-05, 'epoch': 51.68}


 65%|██████▍   | 647/1000 [52:07<28:34,  4.86s/it]

{'loss': 0.027, 'grad_norm': 0.07596154510974884, 'learning_rate': 7.095477386934674e-05, 'epoch': 51.76}


 65%|██████▍   | 648/1000 [52:11<28:08,  4.80s/it]

{'loss': 0.0285, 'grad_norm': 0.0885407105088234, 'learning_rate': 7.075376884422111e-05, 'epoch': 51.84}


 65%|██████▍   | 649/1000 [52:16<28:29,  4.87s/it]

{'loss': 0.0272, 'grad_norm': 0.08294857293367386, 'learning_rate': 7.055276381909548e-05, 'epoch': 51.92}


 65%|██████▌   | 650/1000 [52:21<27:18,  4.68s/it]

{'loss': 0.029, 'grad_norm': 0.09725300222635269, 'learning_rate': 7.035175879396985e-05, 'epoch': 52.0}


 65%|██████▌   | 651/1000 [52:25<27:07,  4.66s/it]

{'loss': 0.0254, 'grad_norm': 0.08788154274225235, 'learning_rate': 7.015075376884423e-05, 'epoch': 52.08}


 65%|██████▌   | 652/1000 [52:31<28:00,  4.83s/it]

{'loss': 0.0253, 'grad_norm': 0.07499640434980392, 'learning_rate': 6.99497487437186e-05, 'epoch': 52.16}


 65%|██████▌   | 653/1000 [52:35<27:37,  4.78s/it]

{'loss': 0.027, 'grad_norm': 0.08651570975780487, 'learning_rate': 6.974874371859297e-05, 'epoch': 52.24}


 65%|██████▌   | 654/1000 [52:40<27:18,  4.74s/it]

{'loss': 0.0272, 'grad_norm': 0.0747026577591896, 'learning_rate': 6.954773869346733e-05, 'epoch': 52.32}


 66%|██████▌   | 655/1000 [52:45<27:23,  4.76s/it]

{'loss': 0.0262, 'grad_norm': 0.07994001358747482, 'learning_rate': 6.93467336683417e-05, 'epoch': 52.4}


 66%|██████▌   | 656/1000 [52:50<27:47,  4.85s/it]

{'loss': 0.0263, 'grad_norm': 0.08267880231142044, 'learning_rate': 6.914572864321608e-05, 'epoch': 52.48}


 66%|██████▌   | 657/1000 [52:54<27:22,  4.79s/it]

{'loss': 0.0272, 'grad_norm': 0.08905181288719177, 'learning_rate': 6.894472361809045e-05, 'epoch': 52.56}


 66%|██████▌   | 658/1000 [52:59<27:42,  4.86s/it]

{'loss': 0.027, 'grad_norm': 0.08512484282255173, 'learning_rate': 6.874371859296482e-05, 'epoch': 52.64}


 66%|██████▌   | 659/1000 [53:04<27:54,  4.91s/it]

{'loss': 0.0281, 'grad_norm': 0.08862330764532089, 'learning_rate': 6.85427135678392e-05, 'epoch': 52.72}


 66%|██████▌   | 660/1000 [53:09<27:40,  4.88s/it]

{'loss': 0.0273, 'grad_norm': 0.0828033983707428, 'learning_rate': 6.834170854271357e-05, 'epoch': 52.8}


 66%|██████▌   | 661/1000 [53:14<27:31,  4.87s/it]

{'loss': 0.0276, 'grad_norm': 0.08519186824560165, 'learning_rate': 6.814070351758794e-05, 'epoch': 52.88}


 66%|██████▌   | 662/1000 [53:19<27:22,  4.86s/it]

{'loss': 0.0295, 'grad_norm': 0.09652283787727356, 'learning_rate': 6.793969849246232e-05, 'epoch': 52.96}


 66%|██████▋   | 663/1000 [53:23<26:36,  4.74s/it]

{'loss': 0.028, 'grad_norm': 0.09616352617740631, 'learning_rate': 6.773869346733669e-05, 'epoch': 53.04}


 66%|██████▋   | 664/1000 [53:28<26:46,  4.78s/it]

{'loss': 0.0258, 'grad_norm': 0.08545398712158203, 'learning_rate': 6.753768844221105e-05, 'epoch': 53.12}


 66%|██████▋   | 665/1000 [53:33<27:12,  4.87s/it]

{'loss': 0.0253, 'grad_norm': 0.0824059396982193, 'learning_rate': 6.733668341708544e-05, 'epoch': 53.2}


 67%|██████▋   | 666/1000 [53:38<26:47,  4.81s/it]

{'loss': 0.0269, 'grad_norm': 0.07856431603431702, 'learning_rate': 6.71356783919598e-05, 'epoch': 53.28}


 67%|██████▋   | 667/1000 [53:43<26:48,  4.83s/it]

{'loss': 0.0276, 'grad_norm': 0.09077055007219315, 'learning_rate': 6.693467336683418e-05, 'epoch': 53.36}


 67%|██████▋   | 668/1000 [53:48<27:08,  4.90s/it]

{'loss': 0.0261, 'grad_norm': 0.07692558318376541, 'learning_rate': 6.673366834170854e-05, 'epoch': 53.44}


 67%|██████▋   | 669/1000 [53:53<27:19,  4.95s/it]

{'loss': 0.0276, 'grad_norm': 0.08167383074760437, 'learning_rate': 6.653266331658293e-05, 'epoch': 53.52}


 67%|██████▋   | 670/1000 [53:58<26:48,  4.88s/it]

{'loss': 0.027, 'grad_norm': 0.07634729892015457, 'learning_rate': 6.633165829145729e-05, 'epoch': 53.6}


 67%|██████▋   | 671/1000 [54:03<26:41,  4.87s/it]

{'loss': 0.0278, 'grad_norm': 0.08154834806919098, 'learning_rate': 6.613065326633166e-05, 'epoch': 53.68}


 67%|██████▋   | 672/1000 [54:07<26:36,  4.87s/it]

{'loss': 0.0273, 'grad_norm': 0.08415846526622772, 'learning_rate': 6.592964824120603e-05, 'epoch': 53.76}


 67%|██████▋   | 673/1000 [54:13<26:49,  4.92s/it]

{'loss': 0.0272, 'grad_norm': 0.08466830104589462, 'learning_rate': 6.57286432160804e-05, 'epoch': 53.84}


 67%|██████▋   | 674/1000 [54:17<26:36,  4.90s/it]

{'loss': 0.0279, 'grad_norm': 0.08534043282270432, 'learning_rate': 6.552763819095478e-05, 'epoch': 53.92}


 68%|██████▊   | 675/1000 [54:22<25:30,  4.71s/it]

{'loss': 0.029, 'grad_norm': 0.10426628589630127, 'learning_rate': 6.532663316582915e-05, 'epoch': 54.0}


 68%|██████▊   | 676/1000 [54:26<25:39,  4.75s/it]

{'loss': 0.0264, 'grad_norm': 0.09457345306873322, 'learning_rate': 6.512562814070352e-05, 'epoch': 54.08}


 68%|██████▊   | 677/1000 [54:31<25:45,  4.79s/it]

{'loss': 0.0266, 'grad_norm': 0.07763588428497314, 'learning_rate': 6.492462311557788e-05, 'epoch': 54.16}


 68%|██████▊   | 678/1000 [54:36<25:48,  4.81s/it]

{'loss': 0.0264, 'grad_norm': 0.09900709986686707, 'learning_rate': 6.472361809045227e-05, 'epoch': 54.24}


 68%|██████▊   | 679/1000 [54:41<25:29,  4.77s/it]

{'loss': 0.0262, 'grad_norm': 0.06854093074798584, 'learning_rate': 6.452261306532663e-05, 'epoch': 54.32}


 68%|██████▊   | 680/1000 [54:46<25:32,  4.79s/it]

{'loss': 0.0265, 'grad_norm': 0.07558808475732803, 'learning_rate': 6.4321608040201e-05, 'epoch': 54.4}


 68%|██████▊   | 681/1000 [54:51<25:30,  4.80s/it]

{'loss': 0.0275, 'grad_norm': 0.08501683175563812, 'learning_rate': 6.412060301507538e-05, 'epoch': 54.48}


 68%|██████▊   | 682/1000 [54:56<25:45,  4.86s/it]

{'loss': 0.0256, 'grad_norm': 0.07170802354812622, 'learning_rate': 6.391959798994975e-05, 'epoch': 54.56}


 68%|██████▊   | 683/1000 [55:00<25:19,  4.79s/it]

{'loss': 0.0285, 'grad_norm': 0.08685165643692017, 'learning_rate': 6.371859296482412e-05, 'epoch': 54.64}


 68%|██████▊   | 684/1000 [55:05<25:35,  4.86s/it]

{'loss': 0.0287, 'grad_norm': 0.09201207011938095, 'learning_rate': 6.35175879396985e-05, 'epoch': 54.72}


 68%|██████▊   | 685/1000 [55:10<25:28,  4.85s/it]

{'loss': 0.0282, 'grad_norm': 0.09134954959154129, 'learning_rate': 6.331658291457287e-05, 'epoch': 54.8}


 69%|██████▊   | 686/1000 [55:15<25:39,  4.90s/it]

{'loss': 0.0265, 'grad_norm': 0.07623406499624252, 'learning_rate': 6.311557788944724e-05, 'epoch': 54.88}


 69%|██████▊   | 687/1000 [55:20<25:27,  4.88s/it]

{'loss': 0.0277, 'grad_norm': 0.08353937417268753, 'learning_rate': 6.291457286432161e-05, 'epoch': 54.96}


 69%|██████▉   | 688/1000 [55:24<24:41,  4.75s/it]

{'loss': 0.027, 'grad_norm': 0.12208277732133865, 'learning_rate': 6.271356783919599e-05, 'epoch': 55.04}


 69%|██████▉   | 689/1000 [55:29<25:02,  4.83s/it]

{'loss': 0.0259, 'grad_norm': 0.0771937221288681, 'learning_rate': 6.251256281407035e-05, 'epoch': 55.12}


 69%|██████▉   | 690/1000 [55:35<25:50,  5.00s/it]

{'loss': 0.0254, 'grad_norm': 0.07602883875370026, 'learning_rate': 6.231155778894473e-05, 'epoch': 55.2}


 69%|██████▉   | 691/1000 [55:40<25:46,  5.00s/it]

{'loss': 0.0268, 'grad_norm': 0.0896281823515892, 'learning_rate': 6.211055276381909e-05, 'epoch': 55.28}


 69%|██████▉   | 692/1000 [55:44<25:06,  4.89s/it]

{'loss': 0.0271, 'grad_norm': 0.07912282645702362, 'learning_rate': 6.190954773869348e-05, 'epoch': 55.36}


 69%|██████▉   | 693/1000 [55:49<25:12,  4.93s/it]

{'loss': 0.0271, 'grad_norm': 0.08566045761108398, 'learning_rate': 6.170854271356784e-05, 'epoch': 55.44}


 69%|██████▉   | 694/1000 [55:54<24:59,  4.90s/it]

{'loss': 0.0265, 'grad_norm': 0.09178169816732407, 'learning_rate': 6.150753768844222e-05, 'epoch': 55.52}


 70%|██████▉   | 695/1000 [55:59<24:30,  4.82s/it]

{'loss': 0.0264, 'grad_norm': 0.07221726328134537, 'learning_rate': 6.130653266331658e-05, 'epoch': 55.6}


 70%|██████▉   | 696/1000 [56:04<24:11,  4.77s/it]

{'loss': 0.0275, 'grad_norm': 0.08537986874580383, 'learning_rate': 6.110552763819096e-05, 'epoch': 55.68}


 70%|██████▉   | 697/1000 [56:08<23:54,  4.73s/it]

{'loss': 0.0275, 'grad_norm': 0.08685941994190216, 'learning_rate': 6.090452261306533e-05, 'epoch': 55.76}


 70%|██████▉   | 698/1000 [56:13<23:42,  4.71s/it]

{'loss': 0.0271, 'grad_norm': 0.0866466611623764, 'learning_rate': 6.070351758793971e-05, 'epoch': 55.84}


 70%|██████▉   | 699/1000 [56:18<24:03,  4.80s/it]

{'loss': 0.0275, 'grad_norm': 0.08192128688097, 'learning_rate': 6.0502512562814076e-05, 'epoch': 55.92}


 70%|███████   | 700/1000 [56:22<23:26,  4.69s/it]

{'loss': 0.0287, 'grad_norm': 0.09514519572257996, 'learning_rate': 6.030150753768844e-05, 'epoch': 56.0}


 70%|███████   | 701/1000 [56:27<23:54,  4.80s/it]

{'loss': 0.0253, 'grad_norm': 0.07655683159828186, 'learning_rate': 6.0100502512562815e-05, 'epoch': 56.08}


 70%|███████   | 702/1000 [56:32<23:53,  4.81s/it]

{'loss': 0.0264, 'grad_norm': 0.08256499469280243, 'learning_rate': 5.989949748743718e-05, 'epoch': 56.16}


 70%|███████   | 703/1000 [56:37<24:06,  4.87s/it]

{'loss': 0.0253, 'grad_norm': 0.09534397721290588, 'learning_rate': 5.969849246231156e-05, 'epoch': 56.24}


 70%|███████   | 704/1000 [56:42<23:59,  4.86s/it]

{'loss': 0.0269, 'grad_norm': 0.08232790976762772, 'learning_rate': 5.949748743718593e-05, 'epoch': 56.32}


 70%|███████   | 705/1000 [56:47<23:36,  4.80s/it]

{'loss': 0.0265, 'grad_norm': 0.085630863904953, 'learning_rate': 5.929648241206031e-05, 'epoch': 56.4}


 71%|███████   | 706/1000 [56:51<23:34,  4.81s/it]

{'loss': 0.0262, 'grad_norm': 0.07942486554384232, 'learning_rate': 5.909547738693467e-05, 'epoch': 56.48}


 71%|███████   | 707/1000 [56:57<23:50,  4.88s/it]

{'loss': 0.0265, 'grad_norm': 0.08047153055667877, 'learning_rate': 5.889447236180905e-05, 'epoch': 56.56}


 71%|███████   | 708/1000 [57:02<23:59,  4.93s/it]

{'loss': 0.0266, 'grad_norm': 0.08374108374118805, 'learning_rate': 5.869346733668342e-05, 'epoch': 56.64}


 71%|███████   | 709/1000 [57:06<23:29,  4.84s/it]

{'loss': 0.0282, 'grad_norm': 0.0943799540400505, 'learning_rate': 5.849246231155779e-05, 'epoch': 56.72}


 71%|███████   | 710/1000 [57:11<23:09,  4.79s/it]

{'loss': 0.0275, 'grad_norm': 0.08450678735971451, 'learning_rate': 5.829145728643216e-05, 'epoch': 56.8}


 71%|███████   | 711/1000 [57:16<23:09,  4.81s/it]

{'loss': 0.0271, 'grad_norm': 0.08298242092132568, 'learning_rate': 5.809045226130654e-05, 'epoch': 56.88}


 71%|███████   | 712/1000 [57:20<22:51,  4.76s/it]

{'loss': 0.0297, 'grad_norm': 0.09103117138147354, 'learning_rate': 5.7889447236180904e-05, 'epoch': 56.96}


 71%|███████▏  | 713/1000 [57:25<23:03,  4.82s/it]

{'loss': 0.0257, 'grad_norm': 0.09403929859399796, 'learning_rate': 5.7688442211055284e-05, 'epoch': 57.04}


 71%|███████▏  | 714/1000 [57:30<23:03,  4.84s/it]

{'loss': 0.0262, 'grad_norm': 0.0876711755990982, 'learning_rate': 5.748743718592965e-05, 'epoch': 57.12}


 72%|███████▏  | 715/1000 [57:35<23:18,  4.91s/it]

{'loss': 0.0257, 'grad_norm': 0.0771050900220871, 'learning_rate': 5.728643216080403e-05, 'epoch': 57.2}


 72%|███████▏  | 716/1000 [57:40<23:04,  4.88s/it]

{'loss': 0.0267, 'grad_norm': 0.07980244606733322, 'learning_rate': 5.7085427135678396e-05, 'epoch': 57.28}


 72%|███████▏  | 717/1000 [57:45<22:39,  4.80s/it]

{'loss': 0.0267, 'grad_norm': 0.09333065897226334, 'learning_rate': 5.688442211055277e-05, 'epoch': 57.36}


 72%|███████▏  | 718/1000 [57:50<22:37,  4.81s/it]

{'loss': 0.0269, 'grad_norm': 0.07975593954324722, 'learning_rate': 5.6683417085427135e-05, 'epoch': 57.44}


 72%|███████▏  | 719/1000 [57:54<22:17,  4.76s/it]

{'loss': 0.0269, 'grad_norm': 0.09695513546466827, 'learning_rate': 5.6482412060301515e-05, 'epoch': 57.52}


 72%|███████▏  | 720/1000 [57:59<22:03,  4.73s/it]

{'loss': 0.0269, 'grad_norm': 0.081850066781044, 'learning_rate': 5.628140703517588e-05, 'epoch': 57.6}


 72%|███████▏  | 721/1000 [58:03<21:51,  4.70s/it]

{'loss': 0.0274, 'grad_norm': 0.08507071435451508, 'learning_rate': 5.608040201005026e-05, 'epoch': 57.68}


 72%|███████▏  | 722/1000 [58:08<21:57,  4.74s/it]

{'loss': 0.0281, 'grad_norm': 0.10116565972566605, 'learning_rate': 5.587939698492463e-05, 'epoch': 57.76}


 72%|███████▏  | 723/1000 [58:13<22:01,  4.77s/it]

{'loss': 0.0274, 'grad_norm': 0.08893828094005585, 'learning_rate': 5.567839195979899e-05, 'epoch': 57.84}


 72%|███████▏  | 724/1000 [58:18<22:17,  4.85s/it]

{'loss': 0.0276, 'grad_norm': 0.07844885438680649, 'learning_rate': 5.547738693467337e-05, 'epoch': 57.92}


 72%|███████▎  | 725/1000 [58:23<21:48,  4.76s/it]

{'loss': 0.0273, 'grad_norm': 0.09605227410793304, 'learning_rate': 5.527638190954774e-05, 'epoch': 58.0}


 73%|███████▎  | 726/1000 [58:28<22:23,  4.90s/it]

{'loss': 0.0255, 'grad_norm': 0.08174412697553635, 'learning_rate': 5.507537688442211e-05, 'epoch': 58.08}


 73%|███████▎  | 727/1000 [58:33<21:58,  4.83s/it]

{'loss': 0.026, 'grad_norm': 0.08198010176420212, 'learning_rate': 5.487437185929648e-05, 'epoch': 58.16}


 73%|███████▎  | 728/1000 [58:37<21:54,  4.83s/it]

{'loss': 0.026, 'grad_norm': 0.07109950482845306, 'learning_rate': 5.467336683417086e-05, 'epoch': 58.24}


 73%|███████▎  | 729/1000 [58:42<21:34,  4.78s/it]

{'loss': 0.0273, 'grad_norm': 0.08825403451919556, 'learning_rate': 5.4472361809045224e-05, 'epoch': 58.32}


 73%|███████▎  | 730/1000 [58:47<21:33,  4.79s/it]

{'loss': 0.0264, 'grad_norm': 0.07448921352624893, 'learning_rate': 5.4271356783919604e-05, 'epoch': 58.4}


 73%|███████▎  | 731/1000 [58:52<21:33,  4.81s/it]

{'loss': 0.0263, 'grad_norm': 0.07377461344003677, 'learning_rate': 5.407035175879397e-05, 'epoch': 58.48}


 73%|███████▎  | 732/1000 [58:57<21:46,  4.87s/it]

{'loss': 0.0268, 'grad_norm': 0.08057614415884018, 'learning_rate': 5.386934673366835e-05, 'epoch': 58.56}


 73%|███████▎  | 733/1000 [59:02<21:54,  4.92s/it]

{'loss': 0.0272, 'grad_norm': 0.08214639872312546, 'learning_rate': 5.3668341708542716e-05, 'epoch': 58.64}


 73%|███████▎  | 734/1000 [59:07<21:42,  4.90s/it]

{'loss': 0.0265, 'grad_norm': 0.07799169421195984, 'learning_rate': 5.346733668341709e-05, 'epoch': 58.72}


 74%|███████▎  | 735/1000 [59:11<21:16,  4.82s/it]

{'loss': 0.028, 'grad_norm': 0.08544161915779114, 'learning_rate': 5.3266331658291455e-05, 'epoch': 58.8}


 74%|███████▎  | 736/1000 [59:16<21:12,  4.82s/it]

{'loss': 0.0281, 'grad_norm': 0.08507095277309418, 'learning_rate': 5.3065326633165835e-05, 'epoch': 58.88}


 74%|███████▎  | 737/1000 [59:21<21:08,  4.82s/it]

{'loss': 0.0278, 'grad_norm': 0.0863325372338295, 'learning_rate': 5.28643216080402e-05, 'epoch': 58.96}


 74%|███████▍  | 738/1000 [59:25<20:35,  4.72s/it]

{'loss': 0.0264, 'grad_norm': 0.09324126690626144, 'learning_rate': 5.266331658291458e-05, 'epoch': 59.04}


 74%|███████▍  | 739/1000 [59:30<20:24,  4.69s/it]

{'loss': 0.0268, 'grad_norm': 0.07881993055343628, 'learning_rate': 5.246231155778895e-05, 'epoch': 59.12}


 74%|███████▍  | 740/1000 [59:35<20:45,  4.79s/it]

{'loss': 0.0261, 'grad_norm': 0.07877063006162643, 'learning_rate': 5.226130653266332e-05, 'epoch': 59.2}


 74%|███████▍  | 741/1000 [59:40<20:44,  4.80s/it]

{'loss': 0.0265, 'grad_norm': 0.08628395199775696, 'learning_rate': 5.206030150753769e-05, 'epoch': 59.28}


 74%|███████▍  | 742/1000 [59:45<20:25,  4.75s/it]

{'loss': 0.0267, 'grad_norm': 0.08341321349143982, 'learning_rate': 5.1859296482412066e-05, 'epoch': 59.36}


 74%|███████▍  | 743/1000 [59:49<20:25,  4.77s/it]

{'loss': 0.0264, 'grad_norm': 0.07341814041137695, 'learning_rate': 5.165829145728643e-05, 'epoch': 59.44}


 74%|███████▍  | 744/1000 [59:54<20:24,  4.78s/it]

{'loss': 0.0264, 'grad_norm': 0.07712093740701675, 'learning_rate': 5.145728643216081e-05, 'epoch': 59.52}


 74%|███████▍  | 745/1000 [59:59<20:06,  4.73s/it]

{'loss': 0.0274, 'grad_norm': 0.09243124723434448, 'learning_rate': 5.125628140703518e-05, 'epoch': 59.6}


 75%|███████▍  | 746/1000 [1:00:04<20:08,  4.76s/it]

{'loss': 0.0264, 'grad_norm': 0.08602627366781235, 'learning_rate': 5.1055276381909544e-05, 'epoch': 59.68}


 75%|███████▍  | 747/1000 [1:00:09<20:37,  4.89s/it]

{'loss': 0.0266, 'grad_norm': 0.08148477226495743, 'learning_rate': 5.0854271356783924e-05, 'epoch': 59.76}


 75%|███████▍  | 748/1000 [1:00:13<20:11,  4.81s/it]

{'loss': 0.028, 'grad_norm': 0.0811251848936081, 'learning_rate': 5.065326633165829e-05, 'epoch': 59.84}


 75%|███████▍  | 749/1000 [1:00:18<20:20,  4.86s/it]

{'loss': 0.0275, 'grad_norm': 0.0799286887049675, 'learning_rate': 5.045226130653266e-05, 'epoch': 59.92}


 75%|███████▌  | 750/1000 [1:00:23<19:51,  4.77s/it]

{'loss': 0.0278, 'grad_norm': 0.10506769269704819, 'learning_rate': 5.0251256281407036e-05, 'epoch': 60.0}


 75%|███████▌  | 751/1000 [1:00:28<19:53,  4.79s/it]

{'loss': 0.027, 'grad_norm': 0.0825539156794548, 'learning_rate': 5.005025125628141e-05, 'epoch': 60.08}


 75%|███████▌  | 752/1000 [1:00:33<20:07,  4.87s/it]

{'loss': 0.0256, 'grad_norm': 0.10634960979223251, 'learning_rate': 4.984924623115578e-05, 'epoch': 60.16}


 75%|███████▌  | 753/1000 [1:00:38<20:01,  4.86s/it]

{'loss': 0.026, 'grad_norm': 0.07084503769874573, 'learning_rate': 4.9648241206030155e-05, 'epoch': 60.24}


 75%|███████▌  | 754/1000 [1:00:43<20:10,  4.92s/it]

{'loss': 0.0264, 'grad_norm': 0.08653993904590607, 'learning_rate': 4.944723618090453e-05, 'epoch': 60.32}


 76%|███████▌  | 755/1000 [1:00:47<19:46,  4.84s/it]

{'loss': 0.0272, 'grad_norm': 0.08079306036233902, 'learning_rate': 4.92462311557789e-05, 'epoch': 60.4}


 76%|███████▌  | 756/1000 [1:00:52<19:43,  4.85s/it]

{'loss': 0.0264, 'grad_norm': 0.08591719716787338, 'learning_rate': 4.9045226130653274e-05, 'epoch': 60.48}


 76%|███████▌  | 757/1000 [1:00:57<19:23,  4.79s/it]

{'loss': 0.0265, 'grad_norm': 0.08259962499141693, 'learning_rate': 4.884422110552764e-05, 'epoch': 60.56}


 76%|███████▌  | 758/1000 [1:01:02<19:21,  4.80s/it]

{'loss': 0.0271, 'grad_norm': 0.09772283583879471, 'learning_rate': 4.864321608040201e-05, 'epoch': 60.64}


 76%|███████▌  | 759/1000 [1:01:07<19:34,  4.87s/it]

{'loss': 0.0265, 'grad_norm': 0.08660539239645004, 'learning_rate': 4.844221105527638e-05, 'epoch': 60.72}


 76%|███████▌  | 760/1000 [1:01:12<19:27,  4.86s/it]

{'loss': 0.028, 'grad_norm': 0.08280753344297409, 'learning_rate': 4.824120603015075e-05, 'epoch': 60.8}


 76%|███████▌  | 761/1000 [1:01:16<19:20,  4.86s/it]

{'loss': 0.0282, 'grad_norm': 0.09626611322164536, 'learning_rate': 4.8040201005025125e-05, 'epoch': 60.88}


 76%|███████▌  | 762/1000 [1:01:21<19:03,  4.80s/it]

{'loss': 0.028, 'grad_norm': 0.08656886219978333, 'learning_rate': 4.78391959798995e-05, 'epoch': 60.96}


 76%|███████▋  | 763/1000 [1:01:26<18:37,  4.72s/it]

{'loss': 0.0268, 'grad_norm': 0.08928686380386353, 'learning_rate': 4.763819095477387e-05, 'epoch': 61.04}


 76%|███████▋  | 764/1000 [1:01:31<18:58,  4.82s/it]

{'loss': 0.0264, 'grad_norm': 0.08436889946460724, 'learning_rate': 4.7437185929648244e-05, 'epoch': 61.12}


 76%|███████▋  | 765/1000 [1:01:36<19:39,  5.02s/it]

{'loss': 0.025, 'grad_norm': 0.07845475524663925, 'learning_rate': 4.723618090452262e-05, 'epoch': 61.2}


 77%|███████▋  | 766/1000 [1:01:41<19:24,  4.98s/it]

{'loss': 0.0265, 'grad_norm': 0.08143939077854156, 'learning_rate': 4.703517587939698e-05, 'epoch': 61.28}


 77%|███████▋  | 767/1000 [1:01:46<19:10,  4.94s/it]

{'loss': 0.0256, 'grad_norm': 0.09149210155010223, 'learning_rate': 4.6834170854271356e-05, 'epoch': 61.36}


 77%|███████▋  | 768/1000 [1:01:51<18:45,  4.85s/it]

{'loss': 0.0275, 'grad_norm': 0.09387068450450897, 'learning_rate': 4.663316582914573e-05, 'epoch': 61.44}


 77%|███████▋  | 769/1000 [1:01:55<18:40,  4.85s/it]

{'loss': 0.0265, 'grad_norm': 0.08015833050012589, 'learning_rate': 4.64321608040201e-05, 'epoch': 61.52}


 77%|███████▋  | 770/1000 [1:02:00<18:22,  4.79s/it]

{'loss': 0.0283, 'grad_norm': 0.09126956760883331, 'learning_rate': 4.6231155778894475e-05, 'epoch': 61.6}


 77%|███████▋  | 771/1000 [1:02:05<18:06,  4.75s/it]

{'loss': 0.027, 'grad_norm': 0.0847012996673584, 'learning_rate': 4.603015075376885e-05, 'epoch': 61.68}


 77%|███████▋  | 772/1000 [1:02:10<18:08,  4.77s/it]

{'loss': 0.0268, 'grad_norm': 0.08338658511638641, 'learning_rate': 4.582914572864322e-05, 'epoch': 61.76}


 77%|███████▋  | 773/1000 [1:02:14<18:08,  4.80s/it]

{'loss': 0.0271, 'grad_norm': 0.09460491687059402, 'learning_rate': 4.5628140703517594e-05, 'epoch': 61.84}


 77%|███████▋  | 774/1000 [1:02:19<18:21,  4.87s/it]

{'loss': 0.0272, 'grad_norm': 0.09055019170045853, 'learning_rate': 4.542713567839196e-05, 'epoch': 61.92}


 78%|███████▊  | 775/1000 [1:02:24<17:41,  4.72s/it]

{'loss': 0.0268, 'grad_norm': 0.0919949859380722, 'learning_rate': 4.522613065326633e-05, 'epoch': 62.0}


 78%|███████▊  | 776/1000 [1:02:29<17:32,  4.70s/it]

{'loss': 0.026, 'grad_norm': 0.09100700169801712, 'learning_rate': 4.5025125628140706e-05, 'epoch': 62.08}


 78%|███████▊  | 777/1000 [1:02:34<17:50,  4.80s/it]

{'loss': 0.0259, 'grad_norm': 0.07640193402767181, 'learning_rate': 4.482412060301508e-05, 'epoch': 62.16}


 78%|███████▊  | 778/1000 [1:02:38<17:36,  4.76s/it]

{'loss': 0.0262, 'grad_norm': 0.0817738026380539, 'learning_rate': 4.462311557788945e-05, 'epoch': 62.24}


 78%|███████▊  | 779/1000 [1:02:43<17:37,  4.79s/it]

{'loss': 0.0256, 'grad_norm': 0.08617933839559555, 'learning_rate': 4.4422110552763825e-05, 'epoch': 62.32}


 78%|███████▊  | 780/1000 [1:02:48<17:36,  4.80s/it]

{'loss': 0.0268, 'grad_norm': 0.08687811344861984, 'learning_rate': 4.42211055276382e-05, 'epoch': 62.4}


 78%|███████▊  | 781/1000 [1:02:53<17:23,  4.76s/it]

{'loss': 0.0265, 'grad_norm': 0.08915890753269196, 'learning_rate': 4.4020100502512564e-05, 'epoch': 62.48}


 78%|███████▊  | 782/1000 [1:02:58<17:37,  4.85s/it]

{'loss': 0.0262, 'grad_norm': 0.07661569118499756, 'learning_rate': 4.381909547738694e-05, 'epoch': 62.56}


 78%|███████▊  | 783/1000 [1:03:02<17:20,  4.80s/it]

{'loss': 0.0268, 'grad_norm': 0.08035392314195633, 'learning_rate': 4.3618090452261303e-05, 'epoch': 62.64}


 78%|███████▊  | 784/1000 [1:03:07<17:19,  4.81s/it]

{'loss': 0.0273, 'grad_norm': 0.08809266239404678, 'learning_rate': 4.3417085427135676e-05, 'epoch': 62.72}


 78%|███████▊  | 785/1000 [1:03:12<17:28,  4.88s/it]

{'loss': 0.0275, 'grad_norm': 0.08432599157094955, 'learning_rate': 4.321608040201005e-05, 'epoch': 62.8}


 79%|███████▊  | 786/1000 [1:03:17<17:21,  4.86s/it]

{'loss': 0.0269, 'grad_norm': 0.0800500139594078, 'learning_rate': 4.301507537688442e-05, 'epoch': 62.88}


 79%|███████▊  | 787/1000 [1:03:22<17:14,  4.86s/it]

{'loss': 0.0276, 'grad_norm': 0.09675251692533493, 'learning_rate': 4.2814070351758795e-05, 'epoch': 62.96}


 79%|███████▉  | 788/1000 [1:03:26<16:56,  4.79s/it]

{'loss': 0.0277, 'grad_norm': 0.10360720753669739, 'learning_rate': 4.261306532663317e-05, 'epoch': 63.04}


 79%|███████▉  | 789/1000 [1:03:31<16:54,  4.81s/it]

{'loss': 0.026, 'grad_norm': 0.08662448078393936, 'learning_rate': 4.241206030150754e-05, 'epoch': 63.12}


 79%|███████▉  | 790/1000 [1:03:36<17:06,  4.89s/it]

{'loss': 0.0262, 'grad_norm': 0.08631499856710434, 'learning_rate': 4.2211055276381914e-05, 'epoch': 63.2}


 79%|███████▉  | 791/1000 [1:03:41<16:48,  4.82s/it]

{'loss': 0.0263, 'grad_norm': 0.09118711948394775, 'learning_rate': 4.201005025125628e-05, 'epoch': 63.28}


 79%|███████▉  | 792/1000 [1:03:46<16:57,  4.89s/it]

{'loss': 0.0254, 'grad_norm': 0.07419969886541367, 'learning_rate': 4.180904522613065e-05, 'epoch': 63.36}


 79%|███████▉  | 793/1000 [1:03:51<16:37,  4.82s/it]

{'loss': 0.0267, 'grad_norm': 0.08067315071821213, 'learning_rate': 4.1608040201005026e-05, 'epoch': 63.44}


 79%|███████▉  | 794/1000 [1:03:56<16:34,  4.83s/it]

{'loss': 0.026, 'grad_norm': 0.08655832707881927, 'learning_rate': 4.14070351758794e-05, 'epoch': 63.52}


 80%|███████▉  | 795/1000 [1:04:00<16:19,  4.78s/it]

{'loss': 0.0279, 'grad_norm': 0.10026611387729645, 'learning_rate': 4.120603015075377e-05, 'epoch': 63.6}


 80%|███████▉  | 796/1000 [1:04:05<16:31,  4.86s/it]

{'loss': 0.0264, 'grad_norm': 0.08030256628990173, 'learning_rate': 4.1005025125628145e-05, 'epoch': 63.68}


 80%|███████▉  | 797/1000 [1:04:10<16:26,  4.86s/it]

{'loss': 0.0266, 'grad_norm': 0.0891936719417572, 'learning_rate': 4.080402010050252e-05, 'epoch': 63.76}


 80%|███████▉  | 798/1000 [1:04:15<16:10,  4.80s/it]

{'loss': 0.028, 'grad_norm': 0.08299892395734787, 'learning_rate': 4.060301507537689e-05, 'epoch': 63.84}


 80%|███████▉  | 799/1000 [1:04:20<16:09,  4.82s/it]

{'loss': 0.0274, 'grad_norm': 0.08795572817325592, 'learning_rate': 4.040201005025126e-05, 'epoch': 63.92}


 80%|████████  | 800/1000 [1:04:24<15:48,  4.74s/it]

{'loss': 0.0277, 'grad_norm': 0.10088787972927094, 'learning_rate': 4.020100502512563e-05, 'epoch': 64.0}


 80%|████████  | 801/1000 [1:04:29<16:02,  4.83s/it]

{'loss': 0.0257, 'grad_norm': 0.08662073314189911, 'learning_rate': 4e-05, 'epoch': 64.08}


 80%|████████  | 802/1000 [1:04:34<15:46,  4.78s/it]

{'loss': 0.0266, 'grad_norm': 0.08119615912437439, 'learning_rate': 3.9798994974874376e-05, 'epoch': 64.16}


 80%|████████  | 803/1000 [1:04:39<15:34,  4.74s/it]

{'loss': 0.0269, 'grad_norm': 0.08465786278247833, 'learning_rate': 3.959798994974875e-05, 'epoch': 64.24}


 80%|████████  | 804/1000 [1:04:43<15:34,  4.77s/it]

{'loss': 0.0264, 'grad_norm': 0.09724746644496918, 'learning_rate': 3.9396984924623115e-05, 'epoch': 64.32}


 80%|████████  | 805/1000 [1:04:48<15:33,  4.79s/it]

{'loss': 0.0267, 'grad_norm': 0.09283193945884705, 'learning_rate': 3.919597989949749e-05, 'epoch': 64.4}


 81%|████████  | 806/1000 [1:04:54<15:53,  4.92s/it]

{'loss': 0.026, 'grad_norm': 0.08096711337566376, 'learning_rate': 3.899497487437186e-05, 'epoch': 64.48}


 81%|████████  | 807/1000 [1:04:58<15:44,  4.89s/it]

{'loss': 0.0269, 'grad_norm': 0.084804467856884, 'learning_rate': 3.8793969849246234e-05, 'epoch': 64.56}


 81%|████████  | 808/1000 [1:05:03<15:47,  4.93s/it]

{'loss': 0.0253, 'grad_norm': 0.07366886734962463, 'learning_rate': 3.85929648241206e-05, 'epoch': 64.64}


 81%|████████  | 809/1000 [1:05:08<15:25,  4.84s/it]

{'loss': 0.027, 'grad_norm': 0.09387031197547913, 'learning_rate': 3.8391959798994973e-05, 'epoch': 64.72}


 81%|████████  | 810/1000 [1:05:13<15:20,  4.84s/it]

{'loss': 0.0263, 'grad_norm': 0.08287454396486282, 'learning_rate': 3.8190954773869346e-05, 'epoch': 64.8}


 81%|████████  | 811/1000 [1:05:18<15:13,  4.83s/it]

{'loss': 0.027, 'grad_norm': 0.08704375475645065, 'learning_rate': 3.798994974874372e-05, 'epoch': 64.88}


 81%|████████  | 812/1000 [1:05:23<15:07,  4.83s/it]

{'loss': 0.0282, 'grad_norm': 0.08146761357784271, 'learning_rate': 3.778894472361809e-05, 'epoch': 64.96}


 81%|████████▏ | 813/1000 [1:05:27<14:29,  4.65s/it]

{'loss': 0.0283, 'grad_norm': 0.1163921058177948, 'learning_rate': 3.7587939698492465e-05, 'epoch': 65.04}


 81%|████████▏ | 814/1000 [1:05:32<14:34,  4.70s/it]

{'loss': 0.0268, 'grad_norm': 0.0839434489607811, 'learning_rate': 3.738693467336684e-05, 'epoch': 65.12}


 82%|████████▏ | 815/1000 [1:05:36<14:37,  4.74s/it]

{'loss': 0.0269, 'grad_norm': 0.07924963533878326, 'learning_rate': 3.7185929648241204e-05, 'epoch': 65.2}


 82%|████████▏ | 816/1000 [1:05:41<14:26,  4.71s/it]

{'loss': 0.0261, 'grad_norm': 0.07908507436513901, 'learning_rate': 3.698492462311558e-05, 'epoch': 65.28}


 82%|████████▏ | 817/1000 [1:05:46<14:28,  4.74s/it]

{'loss': 0.0263, 'grad_norm': 0.08099476993083954, 'learning_rate': 3.678391959798995e-05, 'epoch': 65.36}


 82%|████████▏ | 818/1000 [1:05:51<14:37,  4.82s/it]

{'loss': 0.0267, 'grad_norm': 0.08231756836175919, 'learning_rate': 3.658291457286432e-05, 'epoch': 65.44}


 82%|████████▏ | 819/1000 [1:05:55<14:22,  4.76s/it]

{'loss': 0.0265, 'grad_norm': 0.08178433775901794, 'learning_rate': 3.6381909547738696e-05, 'epoch': 65.52}


 82%|████████▏ | 820/1000 [1:06:00<14:20,  4.78s/it]

{'loss': 0.0265, 'grad_norm': 0.08702057600021362, 'learning_rate': 3.618090452261307e-05, 'epoch': 65.6}


 82%|████████▏ | 821/1000 [1:06:06<14:38,  4.91s/it]

{'loss': 0.026, 'grad_norm': 0.09495142847299576, 'learning_rate': 3.597989949748744e-05, 'epoch': 65.68}


 82%|████████▏ | 822/1000 [1:06:10<14:28,  4.88s/it]

{'loss': 0.0267, 'grad_norm': 0.07672550529241562, 'learning_rate': 3.5778894472361815e-05, 'epoch': 65.76}


 82%|████████▏ | 823/1000 [1:06:15<14:20,  4.86s/it]

{'loss': 0.027, 'grad_norm': 0.07987779378890991, 'learning_rate': 3.557788944723618e-05, 'epoch': 65.84}


 82%|████████▏ | 824/1000 [1:06:20<14:02,  4.79s/it]

{'loss': 0.0274, 'grad_norm': 0.08723278343677521, 'learning_rate': 3.5376884422110554e-05, 'epoch': 65.92}


 82%|████████▎ | 825/1000 [1:06:24<13:49,  4.74s/it]

{'loss': 0.0272, 'grad_norm': 0.09868674725294113, 'learning_rate': 3.517587939698493e-05, 'epoch': 66.0}


 83%|████████▎ | 826/1000 [1:06:29<13:48,  4.76s/it]

{'loss': 0.0258, 'grad_norm': 0.07582245767116547, 'learning_rate': 3.49748743718593e-05, 'epoch': 66.08}


 83%|████████▎ | 827/1000 [1:06:34<13:56,  4.84s/it]

{'loss': 0.0254, 'grad_norm': 0.07769007980823517, 'learning_rate': 3.4773869346733667e-05, 'epoch': 66.16}


 83%|████████▎ | 828/1000 [1:06:39<14:00,  4.89s/it]

{'loss': 0.0254, 'grad_norm': 0.08157719671726227, 'learning_rate': 3.457286432160804e-05, 'epoch': 66.24}


 83%|████████▎ | 829/1000 [1:06:44<13:41,  4.81s/it]

{'loss': 0.027, 'grad_norm': 0.08483811467885971, 'learning_rate': 3.437185929648241e-05, 'epoch': 66.32}


 83%|████████▎ | 830/1000 [1:06:48<13:27,  4.75s/it]

{'loss': 0.0262, 'grad_norm': 0.07413341104984283, 'learning_rate': 3.4170854271356785e-05, 'epoch': 66.4}


 83%|████████▎ | 831/1000 [1:06:53<13:35,  4.83s/it]

{'loss': 0.0258, 'grad_norm': 0.08058857172727585, 'learning_rate': 3.396984924623116e-05, 'epoch': 66.48}


 83%|████████▎ | 832/1000 [1:06:58<13:29,  4.82s/it]

{'loss': 0.0261, 'grad_norm': 0.08167792111635208, 'learning_rate': 3.3768844221105525e-05, 'epoch': 66.56}


 83%|████████▎ | 833/1000 [1:07:03<13:24,  4.82s/it]

{'loss': 0.0267, 'grad_norm': 0.07944183051586151, 'learning_rate': 3.35678391959799e-05, 'epoch': 66.64}


 83%|████████▎ | 834/1000 [1:07:08<13:18,  4.81s/it]

{'loss': 0.027, 'grad_norm': 0.08701243996620178, 'learning_rate': 3.336683417085427e-05, 'epoch': 66.72}


 84%|████████▎ | 835/1000 [1:07:13<13:23,  4.87s/it]

{'loss': 0.0268, 'grad_norm': 0.09025464951992035, 'learning_rate': 3.3165829145728643e-05, 'epoch': 66.8}


 84%|████████▎ | 836/1000 [1:07:18<13:07,  4.80s/it]

{'loss': 0.0278, 'grad_norm': 0.08523906767368317, 'learning_rate': 3.2964824120603016e-05, 'epoch': 66.88}


 84%|████████▎ | 837/1000 [1:07:23<13:14,  4.88s/it]

{'loss': 0.0275, 'grad_norm': 0.07987348735332489, 'learning_rate': 3.276381909547739e-05, 'epoch': 66.96}


 84%|████████▍ | 838/1000 [1:07:27<12:50,  4.76s/it]

{'loss': 0.0275, 'grad_norm': 0.1009974256157875, 'learning_rate': 3.256281407035176e-05, 'epoch': 67.04}


 84%|████████▍ | 839/1000 [1:07:32<12:40,  4.73s/it]

{'loss': 0.026, 'grad_norm': 0.08048021793365479, 'learning_rate': 3.2361809045226135e-05, 'epoch': 67.12}


 84%|████████▍ | 840/1000 [1:07:37<12:43,  4.77s/it]

{'loss': 0.0259, 'grad_norm': 0.07866194099187851, 'learning_rate': 3.21608040201005e-05, 'epoch': 67.2}


 84%|████████▍ | 841/1000 [1:07:41<12:42,  4.80s/it]

{'loss': 0.0259, 'grad_norm': 0.07602619379758835, 'learning_rate': 3.1959798994974875e-05, 'epoch': 67.28}


 84%|████████▍ | 842/1000 [1:07:46<12:49,  4.87s/it]

{'loss': 0.0261, 'grad_norm': 0.08173518627882004, 'learning_rate': 3.175879396984925e-05, 'epoch': 67.36}


 84%|████████▍ | 843/1000 [1:07:51<12:44,  4.87s/it]

{'loss': 0.0277, 'grad_norm': 0.09431794285774231, 'learning_rate': 3.155778894472362e-05, 'epoch': 67.44}


 84%|████████▍ | 844/1000 [1:07:56<12:29,  4.81s/it]

{'loss': 0.0266, 'grad_norm': 0.08423954248428345, 'learning_rate': 3.1356783919597993e-05, 'epoch': 67.52}


 84%|████████▍ | 845/1000 [1:08:01<12:27,  4.82s/it]

{'loss': 0.0267, 'grad_norm': 0.09076420217752457, 'learning_rate': 3.1155778894472366e-05, 'epoch': 67.6}


 85%|████████▍ | 846/1000 [1:08:06<12:15,  4.78s/it]

{'loss': 0.0275, 'grad_norm': 0.1283959001302719, 'learning_rate': 3.095477386934674e-05, 'epoch': 67.68}


 85%|████████▍ | 847/1000 [1:08:10<12:14,  4.80s/it]

{'loss': 0.0263, 'grad_norm': 0.077657051384449, 'learning_rate': 3.075376884422111e-05, 'epoch': 67.76}


 85%|████████▍ | 848/1000 [1:08:16<12:30,  4.94s/it]

{'loss': 0.0265, 'grad_norm': 0.08801554888486862, 'learning_rate': 3.055276381909548e-05, 'epoch': 67.84}


 85%|████████▍ | 849/1000 [1:08:21<12:30,  4.97s/it]

{'loss': 0.0263, 'grad_norm': 0.07523665577173233, 'learning_rate': 3.0351758793969855e-05, 'epoch': 67.92}


 85%|████████▌ | 850/1000 [1:08:25<11:55,  4.77s/it]

{'loss': 0.0283, 'grad_norm': 0.09552500396966934, 'learning_rate': 3.015075376884422e-05, 'epoch': 68.0}


 85%|████████▌ | 851/1000 [1:08:30<11:46,  4.74s/it]

{'loss': 0.0261, 'grad_norm': 0.07306721061468124, 'learning_rate': 2.994974874371859e-05, 'epoch': 68.08}


 85%|████████▌ | 852/1000 [1:08:35<11:55,  4.84s/it]

{'loss': 0.0259, 'grad_norm': 0.08322494477033615, 'learning_rate': 2.9748743718592964e-05, 'epoch': 68.16}


 85%|████████▌ | 853/1000 [1:08:40<12:09,  4.96s/it]

{'loss': 0.0257, 'grad_norm': 0.07926575839519501, 'learning_rate': 2.9547738693467337e-05, 'epoch': 68.24}


 85%|████████▌ | 854/1000 [1:08:45<11:50,  4.87s/it]

{'loss': 0.0263, 'grad_norm': 0.08353672176599503, 'learning_rate': 2.934673366834171e-05, 'epoch': 68.32}


 86%|████████▌ | 855/1000 [1:08:49<11:44,  4.86s/it]

{'loss': 0.0258, 'grad_norm': 0.08438270539045334, 'learning_rate': 2.914572864321608e-05, 'epoch': 68.4}


 86%|████████▌ | 856/1000 [1:08:54<11:38,  4.85s/it]

{'loss': 0.0266, 'grad_norm': 0.08280033618211746, 'learning_rate': 2.8944723618090452e-05, 'epoch': 68.48}


 86%|████████▌ | 857/1000 [1:08:59<11:33,  4.85s/it]

{'loss': 0.0267, 'grad_norm': 0.09049402922391891, 'learning_rate': 2.8743718592964825e-05, 'epoch': 68.56}


 86%|████████▌ | 858/1000 [1:09:04<11:28,  4.85s/it]

{'loss': 0.0269, 'grad_norm': 0.08521659672260284, 'learning_rate': 2.8542713567839198e-05, 'epoch': 68.64}


 86%|████████▌ | 859/1000 [1:09:09<11:32,  4.91s/it]

{'loss': 0.0266, 'grad_norm': 0.08734139800071716, 'learning_rate': 2.8341708542713568e-05, 'epoch': 68.72}


 86%|████████▌ | 860/1000 [1:09:14<11:16,  4.83s/it]

{'loss': 0.0267, 'grad_norm': 0.08640355616807938, 'learning_rate': 2.814070351758794e-05, 'epoch': 68.8}


 86%|████████▌ | 861/1000 [1:09:18<11:04,  4.78s/it]

{'loss': 0.0279, 'grad_norm': 0.08033968508243561, 'learning_rate': 2.7939698492462314e-05, 'epoch': 68.88}


 86%|████████▌ | 862/1000 [1:09:23<11:11,  4.86s/it]

{'loss': 0.0264, 'grad_norm': 0.10852265357971191, 'learning_rate': 2.7738693467336686e-05, 'epoch': 68.96}


 86%|████████▋ | 863/1000 [1:09:28<10:51,  4.75s/it]

{'loss': 0.0263, 'grad_norm': 0.09316247701644897, 'learning_rate': 2.7537688442211056e-05, 'epoch': 69.04}


 86%|████████▋ | 864/1000 [1:09:33<10:42,  4.72s/it]

{'loss': 0.0263, 'grad_norm': 0.07432349026203156, 'learning_rate': 2.733668341708543e-05, 'epoch': 69.12}


 86%|████████▋ | 865/1000 [1:09:37<10:34,  4.70s/it]

{'loss': 0.0262, 'grad_norm': 0.09446380287408829, 'learning_rate': 2.7135678391959802e-05, 'epoch': 69.2}


 87%|████████▋ | 866/1000 [1:09:42<10:36,  4.75s/it]

{'loss': 0.0256, 'grad_norm': 0.08431557565927505, 'learning_rate': 2.6934673366834175e-05, 'epoch': 69.28}


 87%|████████▋ | 867/1000 [1:09:47<10:35,  4.78s/it]

{'loss': 0.0254, 'grad_norm': 0.0837516039609909, 'learning_rate': 2.6733668341708545e-05, 'epoch': 69.36}


 87%|████████▋ | 868/1000 [1:09:52<10:33,  4.80s/it]

{'loss': 0.0268, 'grad_norm': 0.09299176931381226, 'learning_rate': 2.6532663316582917e-05, 'epoch': 69.44}


 87%|████████▋ | 869/1000 [1:09:56<10:23,  4.76s/it]

{'loss': 0.0267, 'grad_norm': 0.09117908775806427, 'learning_rate': 2.633165829145729e-05, 'epoch': 69.52}


 87%|████████▋ | 870/1000 [1:10:01<10:22,  4.79s/it]

{'loss': 0.0262, 'grad_norm': 0.08390487730503082, 'learning_rate': 2.613065326633166e-05, 'epoch': 69.6}


 87%|████████▋ | 871/1000 [1:10:06<10:26,  4.86s/it]

{'loss': 0.0273, 'grad_norm': 0.07927610725164413, 'learning_rate': 2.5929648241206033e-05, 'epoch': 69.68}


 87%|████████▋ | 872/1000 [1:10:11<10:20,  4.85s/it]

{'loss': 0.0268, 'grad_norm': 0.08931392431259155, 'learning_rate': 2.5728643216080406e-05, 'epoch': 69.76}


 87%|████████▋ | 873/1000 [1:10:16<10:22,  4.90s/it]

{'loss': 0.0264, 'grad_norm': 0.08532571792602539, 'learning_rate': 2.5527638190954772e-05, 'epoch': 69.84}


 87%|████████▋ | 874/1000 [1:10:21<10:07,  4.82s/it]

{'loss': 0.0276, 'grad_norm': 0.08260475099086761, 'learning_rate': 2.5326633165829145e-05, 'epoch': 69.92}


 88%|████████▊ | 875/1000 [1:10:25<09:49,  4.71s/it]

{'loss': 0.0274, 'grad_norm': 0.10620935261249542, 'learning_rate': 2.5125628140703518e-05, 'epoch': 70.0}


 88%|████████▊ | 876/1000 [1:10:30<09:50,  4.76s/it]

{'loss': 0.0257, 'grad_norm': 0.07820043712854385, 'learning_rate': 2.492462311557789e-05, 'epoch': 70.08}


 88%|████████▊ | 877/1000 [1:10:35<09:48,  4.79s/it]

{'loss': 0.0261, 'grad_norm': 0.0811040848493576, 'learning_rate': 2.4723618090452264e-05, 'epoch': 70.16}


 88%|████████▊ | 878/1000 [1:10:40<09:39,  4.75s/it]

{'loss': 0.0266, 'grad_norm': 0.09016134589910507, 'learning_rate': 2.4522613065326637e-05, 'epoch': 70.24}


 88%|████████▊ | 879/1000 [1:10:45<09:39,  4.79s/it]

{'loss': 0.0264, 'grad_norm': 0.08716612309217453, 'learning_rate': 2.4321608040201007e-05, 'epoch': 70.32}


 88%|████████▊ | 880/1000 [1:10:49<09:36,  4.81s/it]

{'loss': 0.0257, 'grad_norm': 0.08054903894662857, 'learning_rate': 2.4120603015075376e-05, 'epoch': 70.4}


 88%|████████▊ | 881/1000 [1:10:54<09:26,  4.76s/it]

{'loss': 0.026, 'grad_norm': 0.07970285415649414, 'learning_rate': 2.391959798994975e-05, 'epoch': 70.48}


 88%|████████▊ | 882/1000 [1:10:59<09:25,  4.79s/it]

{'loss': 0.0269, 'grad_norm': 0.09279381483793259, 'learning_rate': 2.3718592964824122e-05, 'epoch': 70.56}


 88%|████████▊ | 883/1000 [1:11:04<09:36,  4.93s/it]

{'loss': 0.0258, 'grad_norm': 0.09035013616085052, 'learning_rate': 2.351758793969849e-05, 'epoch': 70.64}


 88%|████████▊ | 884/1000 [1:11:09<09:29,  4.91s/it]

{'loss': 0.0264, 'grad_norm': 0.07599256932735443, 'learning_rate': 2.3316582914572865e-05, 'epoch': 70.72}


 88%|████████▊ | 885/1000 [1:11:14<09:16,  4.84s/it]

{'loss': 0.027, 'grad_norm': 0.08509814739227295, 'learning_rate': 2.3115577889447238e-05, 'epoch': 70.8}


 89%|████████▊ | 886/1000 [1:11:19<09:12,  4.85s/it]

{'loss': 0.0266, 'grad_norm': 0.0857737585902214, 'learning_rate': 2.291457286432161e-05, 'epoch': 70.88}


 89%|████████▊ | 887/1000 [1:11:24<09:14,  4.91s/it]

{'loss': 0.027, 'grad_norm': 0.08587252348661423, 'learning_rate': 2.271356783919598e-05, 'epoch': 70.96}


 89%|████████▉ | 888/1000 [1:11:28<08:55,  4.78s/it]

{'loss': 0.0267, 'grad_norm': 0.08986189216375351, 'learning_rate': 2.2512562814070353e-05, 'epoch': 71.04}


 89%|████████▉ | 889/1000 [1:11:33<08:46,  4.74s/it]

{'loss': 0.0257, 'grad_norm': 0.08450659364461899, 'learning_rate': 2.2311557788944726e-05, 'epoch': 71.12}


 89%|████████▉ | 890/1000 [1:11:38<08:45,  4.77s/it]

{'loss': 0.0258, 'grad_norm': 0.08845642954111099, 'learning_rate': 2.21105527638191e-05, 'epoch': 71.2}


 89%|████████▉ | 891/1000 [1:11:43<08:48,  4.85s/it]

{'loss': 0.0261, 'grad_norm': 0.09146475046873093, 'learning_rate': 2.190954773869347e-05, 'epoch': 71.28}


 89%|████████▉ | 892/1000 [1:11:47<08:37,  4.79s/it]

{'loss': 0.0266, 'grad_norm': 0.09418671578168869, 'learning_rate': 2.1708542713567838e-05, 'epoch': 71.36}


 89%|████████▉ | 893/1000 [1:11:52<08:39,  4.86s/it]

{'loss': 0.0261, 'grad_norm': 0.08005321025848389, 'learning_rate': 2.150753768844221e-05, 'epoch': 71.44}


 89%|████████▉ | 894/1000 [1:11:57<08:33,  4.85s/it]

{'loss': 0.0262, 'grad_norm': 0.07798095792531967, 'learning_rate': 2.1306532663316584e-05, 'epoch': 71.52}


 90%|████████▉ | 895/1000 [1:12:02<08:34,  4.90s/it]

{'loss': 0.026, 'grad_norm': 0.08285456895828247, 'learning_rate': 2.1105527638190957e-05, 'epoch': 71.6}


 90%|████████▉ | 896/1000 [1:12:07<08:21,  4.82s/it]

{'loss': 0.0274, 'grad_norm': 0.09673850238323212, 'learning_rate': 2.0904522613065327e-05, 'epoch': 71.68}


 90%|████████▉ | 897/1000 [1:12:12<08:22,  4.88s/it]

{'loss': 0.0258, 'grad_norm': 0.0846351757645607, 'learning_rate': 2.07035175879397e-05, 'epoch': 71.76}


 90%|████████▉ | 898/1000 [1:12:16<08:10,  4.81s/it]

{'loss': 0.0272, 'grad_norm': 0.09256167709827423, 'learning_rate': 2.0502512562814073e-05, 'epoch': 71.84}


 90%|████████▉ | 899/1000 [1:12:22<08:17,  4.92s/it]

{'loss': 0.026, 'grad_norm': 0.0901389792561531, 'learning_rate': 2.0301507537688446e-05, 'epoch': 71.92}


 90%|█████████ | 900/1000 [1:12:26<07:52,  4.72s/it]

{'loss': 0.0269, 'grad_norm': 0.0847623273730278, 'learning_rate': 2.0100502512562815e-05, 'epoch': 72.0}


 90%|█████████ | 901/1000 [1:12:31<07:50,  4.75s/it]

{'loss': 0.0264, 'grad_norm': 0.08315624296665192, 'learning_rate': 1.9899497487437188e-05, 'epoch': 72.08}


 90%|█████████ | 902/1000 [1:12:35<07:47,  4.77s/it]

{'loss': 0.026, 'grad_norm': 0.08322542160749435, 'learning_rate': 1.9698492462311558e-05, 'epoch': 72.16}


 90%|█████████ | 903/1000 [1:12:41<07:49,  4.84s/it]

{'loss': 0.0258, 'grad_norm': 0.08060596138238907, 'learning_rate': 1.949748743718593e-05, 'epoch': 72.24}


 90%|█████████ | 904/1000 [1:12:46<07:49,  4.89s/it]

{'loss': 0.0251, 'grad_norm': 0.07912974059581757, 'learning_rate': 1.92964824120603e-05, 'epoch': 72.32}


 90%|█████████ | 905/1000 [1:12:50<07:37,  4.81s/it]

{'loss': 0.0266, 'grad_norm': 0.08065585047006607, 'learning_rate': 1.9095477386934673e-05, 'epoch': 72.4}


 91%|█████████ | 906/1000 [1:12:55<07:32,  4.81s/it]

{'loss': 0.0265, 'grad_norm': 0.08058477193117142, 'learning_rate': 1.8894472361809046e-05, 'epoch': 72.48}


 91%|█████████ | 907/1000 [1:13:00<07:22,  4.76s/it]

{'loss': 0.027, 'grad_norm': 0.08899353444576263, 'learning_rate': 1.869346733668342e-05, 'epoch': 72.56}


 91%|█████████ | 908/1000 [1:13:04<07:14,  4.72s/it]

{'loss': 0.0261, 'grad_norm': 0.087312713265419, 'learning_rate': 1.849246231155779e-05, 'epoch': 72.64}


 91%|█████████ | 909/1000 [1:13:09<07:12,  4.75s/it]

{'loss': 0.0262, 'grad_norm': 0.101229228079319, 'learning_rate': 1.829145728643216e-05, 'epoch': 72.72}


 91%|█████████ | 910/1000 [1:13:14<07:04,  4.71s/it]

{'loss': 0.0273, 'grad_norm': 0.08778905868530273, 'learning_rate': 1.8090452261306535e-05, 'epoch': 72.8}


 91%|█████████ | 911/1000 [1:13:19<07:07,  4.80s/it]

{'loss': 0.0263, 'grad_norm': 0.08251592516899109, 'learning_rate': 1.7889447236180908e-05, 'epoch': 72.88}


 91%|█████████ | 912/1000 [1:13:24<07:07,  4.86s/it]

{'loss': 0.0261, 'grad_norm': 0.08365177363157272, 'learning_rate': 1.7688442211055277e-05, 'epoch': 72.96}


 91%|█████████▏| 913/1000 [1:13:28<06:56,  4.79s/it]

{'loss': 0.0263, 'grad_norm': 0.11179516464471817, 'learning_rate': 1.748743718592965e-05, 'epoch': 73.04}


 91%|█████████▏| 914/1000 [1:13:33<06:57,  4.85s/it]

{'loss': 0.025, 'grad_norm': 0.08493329584598541, 'learning_rate': 1.728643216080402e-05, 'epoch': 73.12}


 92%|█████████▏| 915/1000 [1:13:38<06:51,  4.84s/it]

{'loss': 0.0261, 'grad_norm': 0.08430644124746323, 'learning_rate': 1.7085427135678393e-05, 'epoch': 73.2}


 92%|█████████▏| 916/1000 [1:13:43<06:45,  4.83s/it]

{'loss': 0.0263, 'grad_norm': 0.08416802436113358, 'learning_rate': 1.6884422110552762e-05, 'epoch': 73.28}


 92%|█████████▏| 917/1000 [1:13:48<06:35,  4.77s/it]

{'loss': 0.0266, 'grad_norm': 0.09286718815565109, 'learning_rate': 1.6683417085427135e-05, 'epoch': 73.36}


 92%|█████████▏| 918/1000 [1:13:53<06:41,  4.90s/it]

{'loss': 0.0256, 'grad_norm': 0.08583415299654007, 'learning_rate': 1.6482412060301508e-05, 'epoch': 73.44}


 92%|█████████▏| 919/1000 [1:13:57<06:30,  4.82s/it]

{'loss': 0.0274, 'grad_norm': 0.08902423828840256, 'learning_rate': 1.628140703517588e-05, 'epoch': 73.52}


 92%|█████████▏| 920/1000 [1:14:02<06:29,  4.87s/it]

{'loss': 0.0257, 'grad_norm': 0.08215387165546417, 'learning_rate': 1.608040201005025e-05, 'epoch': 73.6}


 92%|█████████▏| 921/1000 [1:14:07<06:18,  4.80s/it]

{'loss': 0.0271, 'grad_norm': 0.08623969554901123, 'learning_rate': 1.5879396984924624e-05, 'epoch': 73.68}


 92%|█████████▏| 922/1000 [1:14:12<06:18,  4.86s/it]

{'loss': 0.0256, 'grad_norm': 0.10669419169425964, 'learning_rate': 1.5678391959798997e-05, 'epoch': 73.76}


 92%|█████████▏| 923/1000 [1:14:17<06:14,  4.86s/it]

{'loss': 0.0264, 'grad_norm': 0.08404813706874847, 'learning_rate': 1.547738693467337e-05, 'epoch': 73.84}


 92%|█████████▏| 924/1000 [1:14:22<06:09,  4.86s/it]

{'loss': 0.0261, 'grad_norm': 0.08203762769699097, 'learning_rate': 1.527638190954774e-05, 'epoch': 73.92}


 92%|█████████▎| 925/1000 [1:14:26<05:51,  4.69s/it]

{'loss': 0.0269, 'grad_norm': 0.10061216354370117, 'learning_rate': 1.507537688442211e-05, 'epoch': 74.0}


 93%|█████████▎| 926/1000 [1:14:31<05:51,  4.75s/it]

{'loss': 0.0262, 'grad_norm': 0.08082801103591919, 'learning_rate': 1.4874371859296482e-05, 'epoch': 74.08}


 93%|█████████▎| 927/1000 [1:14:36<05:49,  4.79s/it]

{'loss': 0.0251, 'grad_norm': 0.08016406744718552, 'learning_rate': 1.4673366834170855e-05, 'epoch': 74.16}


 93%|█████████▎| 928/1000 [1:14:41<05:46,  4.82s/it]

{'loss': 0.0259, 'grad_norm': 0.0894007533788681, 'learning_rate': 1.4472361809045226e-05, 'epoch': 74.24}


 93%|█████████▎| 929/1000 [1:14:46<05:43,  4.84s/it]

{'loss': 0.0262, 'grad_norm': 0.09228524565696716, 'learning_rate': 1.4271356783919599e-05, 'epoch': 74.32}


 93%|█████████▎| 930/1000 [1:14:50<05:35,  4.79s/it]

{'loss': 0.0265, 'grad_norm': 0.09417721629142761, 'learning_rate': 1.407035175879397e-05, 'epoch': 74.4}


 93%|█████████▎| 931/1000 [1:14:55<05:28,  4.76s/it]

{'loss': 0.0269, 'grad_norm': 0.08646588027477264, 'learning_rate': 1.3869346733668343e-05, 'epoch': 74.48}


 93%|█████████▎| 932/1000 [1:15:00<05:22,  4.74s/it]

{'loss': 0.0261, 'grad_norm': 0.09264378994703293, 'learning_rate': 1.3668341708542715e-05, 'epoch': 74.56}


 93%|█████████▎| 933/1000 [1:15:05<05:24,  4.84s/it]

{'loss': 0.0258, 'grad_norm': 0.08348535746335983, 'learning_rate': 1.3467336683417087e-05, 'epoch': 74.64}


 93%|█████████▎| 934/1000 [1:15:10<05:27,  4.97s/it]

{'loss': 0.026, 'grad_norm': 0.09366517513990402, 'learning_rate': 1.3266331658291459e-05, 'epoch': 74.72}


 94%|█████████▎| 935/1000 [1:15:15<05:21,  4.94s/it]

{'loss': 0.027, 'grad_norm': 0.09055957198143005, 'learning_rate': 1.306532663316583e-05, 'epoch': 74.8}


 94%|█████████▎| 936/1000 [1:15:20<05:19,  4.99s/it]

{'loss': 0.0257, 'grad_norm': 0.08038458228111267, 'learning_rate': 1.2864321608040203e-05, 'epoch': 74.88}


 94%|█████████▎| 937/1000 [1:15:25<05:12,  4.96s/it]

{'loss': 0.0262, 'grad_norm': 0.07155045866966248, 'learning_rate': 1.2663316582914573e-05, 'epoch': 74.96}


 94%|█████████▍| 938/1000 [1:15:29<04:58,  4.82s/it]

{'loss': 0.0265, 'grad_norm': 0.10508329421281815, 'learning_rate': 1.2462311557788946e-05, 'epoch': 75.04}


 94%|█████████▍| 939/1000 [1:15:34<04:54,  4.83s/it]

{'loss': 0.0261, 'grad_norm': 0.08346936106681824, 'learning_rate': 1.2261306532663318e-05, 'epoch': 75.12}


 94%|█████████▍| 940/1000 [1:15:39<04:57,  4.96s/it]

{'loss': 0.0252, 'grad_norm': 0.08353022485971451, 'learning_rate': 1.2060301507537688e-05, 'epoch': 75.2}


 94%|█████████▍| 941/1000 [1:15:44<04:47,  4.87s/it]

{'loss': 0.0268, 'grad_norm': 0.08407153189182281, 'learning_rate': 1.1859296482412061e-05, 'epoch': 75.28}


 94%|█████████▍| 942/1000 [1:15:49<04:42,  4.86s/it]

{'loss': 0.0256, 'grad_norm': 0.09171285480260849, 'learning_rate': 1.1658291457286432e-05, 'epoch': 75.36}


 94%|█████████▍| 943/1000 [1:15:54<04:33,  4.80s/it]

{'loss': 0.0267, 'grad_norm': 0.09816865622997284, 'learning_rate': 1.1457286432160805e-05, 'epoch': 75.44}


 94%|█████████▍| 944/1000 [1:15:59<04:32,  4.87s/it]

{'loss': 0.0261, 'grad_norm': 0.08381666243076324, 'learning_rate': 1.1256281407035177e-05, 'epoch': 75.52}


 94%|█████████▍| 945/1000 [1:16:03<04:27,  4.87s/it]

{'loss': 0.0261, 'grad_norm': 0.08449313044548035, 'learning_rate': 1.105527638190955e-05, 'epoch': 75.6}


 95%|█████████▍| 946/1000 [1:16:08<04:22,  4.86s/it]

{'loss': 0.0262, 'grad_norm': 0.08200483024120331, 'learning_rate': 1.0854271356783919e-05, 'epoch': 75.68}


 95%|█████████▍| 947/1000 [1:16:13<04:17,  4.86s/it]

{'loss': 0.0263, 'grad_norm': 0.08361652493476868, 'learning_rate': 1.0653266331658292e-05, 'epoch': 75.76}


 95%|█████████▍| 948/1000 [1:16:18<04:12,  4.86s/it]

{'loss': 0.0265, 'grad_norm': 0.09069259464740753, 'learning_rate': 1.0452261306532663e-05, 'epoch': 75.84}


 95%|█████████▍| 949/1000 [1:16:23<04:07,  4.85s/it]

{'loss': 0.0258, 'grad_norm': 0.07634995132684708, 'learning_rate': 1.0251256281407036e-05, 'epoch': 75.92}


 95%|█████████▌| 950/1000 [1:16:27<03:56,  4.74s/it]

{'loss': 0.0269, 'grad_norm': 0.11214672774076462, 'learning_rate': 1.0050251256281408e-05, 'epoch': 76.0}


 95%|█████████▌| 951/1000 [1:16:32<03:50,  4.71s/it]

{'loss': 0.0262, 'grad_norm': 0.0893426388502121, 'learning_rate': 9.849246231155779e-06, 'epoch': 76.08}


 95%|█████████▌| 952/1000 [1:16:37<03:45,  4.69s/it]

{'loss': 0.0264, 'grad_norm': 0.08931445330381393, 'learning_rate': 9.64824120603015e-06, 'epoch': 76.16}


 95%|█████████▌| 953/1000 [1:16:42<03:45,  4.80s/it]

{'loss': 0.0258, 'grad_norm': 0.08727303892374039, 'learning_rate': 9.447236180904523e-06, 'epoch': 76.24}


 95%|█████████▌| 954/1000 [1:16:47<03:44,  4.87s/it]

{'loss': 0.0256, 'grad_norm': 0.07451154291629791, 'learning_rate': 9.246231155778894e-06, 'epoch': 76.32}


 96%|█████████▌| 955/1000 [1:16:51<03:36,  4.81s/it]

{'loss': 0.0262, 'grad_norm': 0.09897495061159134, 'learning_rate': 9.045226130653267e-06, 'epoch': 76.4}


 96%|█████████▌| 956/1000 [1:16:56<03:34,  4.88s/it]

{'loss': 0.0253, 'grad_norm': 0.08797666430473328, 'learning_rate': 8.844221105527639e-06, 'epoch': 76.48}


 96%|█████████▌| 957/1000 [1:17:02<03:32,  4.93s/it]

{'loss': 0.0257, 'grad_norm': 0.0776660144329071, 'learning_rate': 8.64321608040201e-06, 'epoch': 76.56}


 96%|█████████▌| 958/1000 [1:17:06<03:25,  4.90s/it]

{'loss': 0.0257, 'grad_norm': 0.08068934082984924, 'learning_rate': 8.442211055276381e-06, 'epoch': 76.64}


 96%|█████████▌| 959/1000 [1:17:11<03:17,  4.82s/it]

{'loss': 0.0277, 'grad_norm': 0.09142564982175827, 'learning_rate': 8.241206030150754e-06, 'epoch': 76.72}


 96%|█████████▌| 960/1000 [1:17:16<03:12,  4.82s/it]

{'loss': 0.0265, 'grad_norm': 0.08382059633731842, 'learning_rate': 8.040201005025125e-06, 'epoch': 76.8}


 96%|█████████▌| 961/1000 [1:17:21<03:10,  4.88s/it]

{'loss': 0.0258, 'grad_norm': 0.08819229155778885, 'learning_rate': 7.839195979899498e-06, 'epoch': 76.88}


 96%|█████████▌| 962/1000 [1:17:26<03:03,  4.83s/it]

{'loss': 0.0266, 'grad_norm': 0.08314668387174606, 'learning_rate': 7.63819095477387e-06, 'epoch': 76.96}


 96%|█████████▋| 963/1000 [1:17:30<02:52,  4.67s/it]

{'loss': 0.0265, 'grad_norm': 0.10148244351148605, 'learning_rate': 7.437185929648241e-06, 'epoch': 77.04}


 96%|█████████▋| 964/1000 [1:17:35<02:50,  4.73s/it]

{'loss': 0.0256, 'grad_norm': 0.08280713856220245, 'learning_rate': 7.236180904522613e-06, 'epoch': 77.12}


 96%|█████████▋| 965/1000 [1:17:40<02:46,  4.76s/it]

{'loss': 0.026, 'grad_norm': 0.08252786844968796, 'learning_rate': 7.035175879396985e-06, 'epoch': 77.2}


 97%|█████████▋| 966/1000 [1:17:44<02:42,  4.79s/it]

{'loss': 0.0258, 'grad_norm': 0.08631166815757751, 'learning_rate': 6.834170854271357e-06, 'epoch': 77.28}


 97%|█████████▋| 967/1000 [1:17:49<02:38,  4.81s/it]

{'loss': 0.026, 'grad_norm': 0.08644237369298935, 'learning_rate': 6.633165829145729e-06, 'epoch': 77.36}


 97%|█████████▋| 968/1000 [1:17:54<02:32,  4.77s/it]

{'loss': 0.0264, 'grad_norm': 0.07752829045057297, 'learning_rate': 6.4321608040201015e-06, 'epoch': 77.44}


 97%|█████████▋| 969/1000 [1:17:59<02:28,  4.80s/it]

{'loss': 0.0262, 'grad_norm': 0.07790759950876236, 'learning_rate': 6.231155778894473e-06, 'epoch': 77.52}


 97%|█████████▋| 970/1000 [1:18:04<02:26,  4.88s/it]

{'loss': 0.0255, 'grad_norm': 0.0894148200750351, 'learning_rate': 6.030150753768844e-06, 'epoch': 77.6}


 97%|█████████▋| 971/1000 [1:18:09<02:19,  4.82s/it]

{'loss': 0.0261, 'grad_norm': 0.08206278085708618, 'learning_rate': 5.829145728643216e-06, 'epoch': 77.68}


 97%|█████████▋| 972/1000 [1:18:13<02:15,  4.83s/it]

{'loss': 0.0261, 'grad_norm': 0.0852530226111412, 'learning_rate': 5.628140703517588e-06, 'epoch': 77.76}


 97%|█████████▋| 973/1000 [1:18:18<02:10,  4.84s/it]

{'loss': 0.0255, 'grad_norm': 0.09311289340257645, 'learning_rate': 5.4271356783919595e-06, 'epoch': 77.84}


 97%|█████████▋| 974/1000 [1:18:23<02:07,  4.91s/it]

{'loss': 0.0265, 'grad_norm': 0.08342016488313675, 'learning_rate': 5.226130653266332e-06, 'epoch': 77.92}


 98%|█████████▊| 975/1000 [1:18:28<01:59,  4.77s/it]

{'loss': 0.0264, 'grad_norm': 0.09668774157762527, 'learning_rate': 5.025125628140704e-06, 'epoch': 78.0}


 98%|█████████▊| 976/1000 [1:18:33<01:56,  4.85s/it]

{'loss': 0.025, 'grad_norm': 0.08906036615371704, 'learning_rate': 4.824120603015075e-06, 'epoch': 78.08}


 98%|█████████▊| 977/1000 [1:18:38<01:51,  4.84s/it]

{'loss': 0.0264, 'grad_norm': 0.08168486505746841, 'learning_rate': 4.623115577889447e-06, 'epoch': 78.16}


 98%|█████████▊| 978/1000 [1:18:42<01:45,  4.78s/it]

{'loss': 0.0266, 'grad_norm': 0.07777266949415207, 'learning_rate': 4.422110552763819e-06, 'epoch': 78.24}


 98%|█████████▊| 979/1000 [1:18:47<01:40,  4.80s/it]

{'loss': 0.0262, 'grad_norm': 0.07949657738208771, 'learning_rate': 4.2211055276381906e-06, 'epoch': 78.32}


 98%|█████████▊| 980/1000 [1:18:52<01:37,  4.87s/it]

{'loss': 0.0257, 'grad_norm': 0.07996781915426254, 'learning_rate': 4.020100502512563e-06, 'epoch': 78.4}


 98%|█████████▊| 981/1000 [1:18:57<01:32,  4.86s/it]

{'loss': 0.0262, 'grad_norm': 0.08465506136417389, 'learning_rate': 3.819095477386935e-06, 'epoch': 78.48}


 98%|█████████▊| 982/1000 [1:19:02<01:26,  4.79s/it]

{'loss': 0.0262, 'grad_norm': 0.08968411386013031, 'learning_rate': 3.6180904522613065e-06, 'epoch': 78.56}


 98%|█████████▊| 983/1000 [1:19:07<01:22,  4.86s/it]

{'loss': 0.0262, 'grad_norm': 0.08586493134498596, 'learning_rate': 3.4170854271356786e-06, 'epoch': 78.64}


 98%|█████████▊| 984/1000 [1:19:11<01:16,  4.79s/it]

{'loss': 0.0256, 'grad_norm': 0.08546049147844315, 'learning_rate': 3.2160804020100507e-06, 'epoch': 78.72}


 98%|█████████▊| 985/1000 [1:19:16<01:11,  4.74s/it]

{'loss': 0.0267, 'grad_norm': 0.08138849586248398, 'learning_rate': 3.015075376884422e-06, 'epoch': 78.8}


 99%|█████████▊| 986/1000 [1:19:21<01:08,  4.88s/it]

{'loss': 0.0247, 'grad_norm': 0.08257867395877838, 'learning_rate': 2.814070351758794e-06, 'epoch': 78.88}


 99%|█████████▊| 987/1000 [1:19:26<01:03,  4.91s/it]

{'loss': 0.0257, 'grad_norm': 0.08293339610099792, 'learning_rate': 2.613065326633166e-06, 'epoch': 78.96}


 99%|█████████▉| 988/1000 [1:19:31<00:57,  4.77s/it]

{'loss': 0.0258, 'grad_norm': 0.09117025882005692, 'learning_rate': 2.4120603015075375e-06, 'epoch': 79.04}


 99%|█████████▉| 989/1000 [1:19:35<00:52,  4.79s/it]

{'loss': 0.0263, 'grad_norm': 0.08312851935625076, 'learning_rate': 2.2110552763819096e-06, 'epoch': 79.12}


 99%|█████████▉| 990/1000 [1:19:40<00:47,  4.79s/it]

{'loss': 0.0255, 'grad_norm': 0.09691477566957474, 'learning_rate': 2.0100502512562813e-06, 'epoch': 79.2}


 99%|█████████▉| 991/1000 [1:19:45<00:43,  4.86s/it]

{'loss': 0.0254, 'grad_norm': 0.08440438657999039, 'learning_rate': 1.8090452261306533e-06, 'epoch': 79.28}


 99%|█████████▉| 992/1000 [1:19:50<00:38,  4.85s/it]

{'loss': 0.0262, 'grad_norm': 0.0838073119521141, 'learning_rate': 1.6080402010050254e-06, 'epoch': 79.36}


 99%|█████████▉| 993/1000 [1:19:55<00:33,  4.78s/it]

{'loss': 0.0266, 'grad_norm': 0.0870126411318779, 'learning_rate': 1.407035175879397e-06, 'epoch': 79.44}


 99%|█████████▉| 994/1000 [1:19:59<00:28,  4.80s/it]

{'loss': 0.0255, 'grad_norm': 0.08560584485530853, 'learning_rate': 1.2060301507537688e-06, 'epoch': 79.52}


100%|█████████▉| 995/1000 [1:20:04<00:24,  4.86s/it]

{'loss': 0.0258, 'grad_norm': 0.08715270459651947, 'learning_rate': 1.0050251256281407e-06, 'epoch': 79.6}


100%|█████████▉| 996/1000 [1:20:09<00:19,  4.84s/it]

{'loss': 0.0256, 'grad_norm': 0.08447249978780746, 'learning_rate': 8.040201005025127e-07, 'epoch': 79.68}


100%|█████████▉| 997/1000 [1:20:14<00:14,  4.89s/it]

{'loss': 0.0255, 'grad_norm': 0.08907803893089294, 'learning_rate': 6.030150753768844e-07, 'epoch': 79.76}


100%|█████████▉| 998/1000 [1:20:19<00:09,  4.87s/it]

{'loss': 0.0262, 'grad_norm': 0.09384956955909729, 'learning_rate': 4.0201005025125634e-07, 'epoch': 79.84}


100%|█████████▉| 999/1000 [1:20:24<00:04,  4.85s/it]

{'loss': 0.0267, 'grad_norm': 0.08575721085071564, 'learning_rate': 2.0100502512562817e-07, 'epoch': 79.92}


100%|██████████| 1000/1000 [1:20:28<00:00,  4.69s/it]

{'loss': 0.026, 'grad_norm': 0.09506573528051376, 'learning_rate': 0.0, 'epoch': 80.0}


100%|██████████| 1000/1000 [1:20:29<00:00,  4.83s/it]

{'train_runtime': 4830.9779, 'train_samples_per_second': 1.656, 'train_steps_per_second': 0.207, 'train_loss': 0.06467841903679072, 'epoch': 80.0}





In [17]:
# Save the LoRA weights
model.save_pretrained("model_weights")

In [20]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

197.8541 seconds used for training.
3.3 minutes used for training.
Peak reserved memory = 7.471 GB.
Peak reserved memory for training = 1.242 GB.
Peak reserved memory % of max memory = 63.561 %.
Peak reserved memory for training % of max memory = 10.567 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [19]:
def inference(article_data):
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
    inputs = tokenizer(
    [
        prompt.format(
            article_data["Titular"], # titular
            article_data["Copete"], # summary
            article_data["Cuerpo"],# body
    ""
    )], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
    return tokenizer.batch_decode(outputs)

In [22]:
for i in range(4):
    print(sensationalist_dataset.iloc[i]["Amarillismo"])
    #print(inference(sensationalist_dataset.iloc[i])[0][-30:])

Amarillista
Amarillista
Amarillista
Amarillista


In [11]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    prompt.format(
        "¿Un parque Jurásico de la vida real?: podrían clonar dinosaurios después de un increíble descubrimiento de fósiles", # titular
        "Los expertos lograron descubrir células de ADN sanas en un dinosaurio fósil y se podría experimentar un escenario de parque jurásico de la vida real si los científicos pueden clonar al animal. Los detalles", # summary
        '''Los científicos podrían estar a punto de experimentar un gran avance al estilo Jurassic Park después de descubrir ADN saludable en un fósil de dinosaurio en China.El fósil desenterrado podría permitir la clonación de la criatura prehistórica y permitir experimentar escenas icónicas de la película de 1993 en la vida real. 
Los expertos creen que su fascinante descubrimiento podría contener el primer código genético prehistórico que aún no se descubrió.El escenario onírico podría convertirse en realidad si los profesionales pueden extraer cuidadosamente la información del fósil perfectamente conservado.

El fósil, que es una pieza de cartílago del muslo de un Caudipteryx se mantuvo en la ceniza volcánica, informó The Sun.

La aterradora criatura, que se parecía a los populares velociraptores de la película, caminó sobre la tierra hace unos 125 millones de años. La especie era conocida por su inteligencia y podía alcanzar velocidades de 24 millas por hora, lo que significa que podían cubrir una gran cantidad de terreno.

Fue descubierto por un equipo de la Academia de Ciencias de China que dijo que estaba en ""exquisitas"" condiciones. Los expertos dijeron que pudieron rastrear algunas células sanas entre otras que estaban enfermas al morir, informó un estudio.

Las células de cartílago bien conservadas se mineralizaron a través de un proceso llamado silicificación y no podrían considerarse ""roca"", ya que contienen restos de moléculas orgánicas. La profesora Alida Bailleul comentó: ""Obviamente estamos interesados en los núcleos de células fosilizadas porque aquí es donde estará la mayor parte del ADN, si se conserva el ADN. Tenemos datos preliminares buenos y emocionantes. Necesitamos averiguar exactamente qué son esas moléculas orgánicas, pero espero que podamos reconstruir una secuencia de ADN''',# body
""
)], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|> \nBelow you will find the title, summary, and body of a news article. \nYour task is to analyze these components and classify whether the article is sensationalist or not.\n\n### Article information:\n    Title: ¿Un parque Jurásico de la vida real?: podrían clonar dinosaurios después de un increíble descubrimiento de fósiles\n    Summary: Los expertos lograron descubrir células de ADN sanas en un dinosaurio fósil y se podría experimentar un escenario de parque jurásico de la vida real si los científicos pueden clonar al animal. Los detalles\n    Body: Los científicos podrían estar a punto de experimentar un gran avance al estilo Jurassic Park después de descubrir ADN saludable en un fósil de dinosaurio en China.El fósil desenterrado podría permitir la clonación de la criatura prehistórica y permitir experimentar escenas icónicas de la película de 1993 en la vida real. \nLos expertos creen que su fascinante descubrimiento podría contener el primer código genético pre

In [12]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    prompt.format(
        "SETI captura senales de radio anomalas", # titular
        "Posible nueva pista en la busqueda de vida extraterrestre", # summary
        '''El SETI Institute ha capturado senales de radio que no coinciden con patrones conocidos de fuentes naturales, lo que ha despertado interes sobre su posible origen extraterrestre.''',# body
""
)], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|> \nBelow you will find the title, summary, and body of a news article. \nYour task is to analyze these components and classify whether the article is sensationalist or not.\n\n### Article information:\n    Title: SETI captura senales de radio anomalas\n    Summary: Posible nueva pista en la busqueda de vida extraterrestre\n    Body: El SETI Institute ha capturado senales de radio que no coinciden con patrones conocidos de fuentes naturales, lo que ha despertado interes sobre su posible origen extraterrestre.\n\n### is this article sensationalist?:\n1<|end_of_text|>']

In [13]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    prompt.format(
        "Cambridge descubre bacteria que come plastico", # titular
        "Podria ayudar a resolver la crisis de contaminacion por plastico", # summary
        '''Cientificos de la University of Cambridge han descubierto una bacteria que puede descomponer rapidamente los plasticos comunes, ofreciendo una posible solucion a la contaminacion global''',# body
""
)], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|> \nBelow you will find the title, summary, and body of a news article. \nYour task is to analyze these components and classify whether the article is sensationalist or not.\n\n### Article information:\n    Title: Cambridge descubre bacteria que come plastico\n    Summary: Podria ayudar a resolver la crisis de contaminacion por plastico\n    Body: Cientificos de la University of Cambridge han descubierto una bacteria que puede descomponer rapidamente los plasticos comunes, ofreciendo una posible solucion a la contaminacion global\n\n### is this article sensationalist?:\n0<|end_of_text|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Continue the fibonnaci sequence.

### Input:
1, 1, 2, 3, 5, 8

### Response:
13, 21, 34, 55, 89, 144<|end_of_text|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


["<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is a famous tall tower in Paris?\n\n### Input:\n\n\n### Response:\nThe Eiffel Tower is a famous landmark in Paris, France. It is a wrought iron tower that was built in 1889 for the World's Fair. Standing at 324 meters tall, it is the tallest building in Paris and one of the most recognizable landmarks in the world.<|end_of_text|>"]

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>