To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

Features in the notebook:
1. Uses Maxime Labonne's [FineTome 100K](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset.
1. Convert ShareGPT to HuggingFace format via `standardize_sharegpt`
2. Train on Completions / Assistant only via `train_on_responses_only`
3. Unsloth now supports Torch 2.4, all TRL & Xformers versions & Python 3.12!

In [8]:
#%%capture
#!pip install unsloth
# Also get the latest nightly Unsloth!
#!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "ID2223JR/lora_model", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA GeForce RTX 4060. Max memory: 7.739 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.12.2 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Already have LoRA adapters! We shall skip this step.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

Generating train split: 100%|██████████| 100000/100000 [00:01<00:00, 64960.21 examples/s]


We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [5]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Standardizing format: 100%|██████████| 100000/100000 [00:03<00:00, 26315.06 examples/s]
Map: 100%|██████████| 100000/100000 [00:08<00:00, 12205.11 examples/s]


We look at how the conversations are structured for item 5:

In [6]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [7]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,
        # max_steps = 5,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etclimit
        save_steps=25,
        save_total_limit=3
    ),
)

Map (num_proc=2): 100%|██████████| 100000/100000 [01:04<00:00, 1555.54 examples/s]


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [9]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   1%|          | 1000/100000 [00:00<01:01, 1605.99 examples/s]


KeyboardInterrupt: 

We verify masking is actually done:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                \n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4060. Max memory = 7.739 GB.
3.619 GB of memory reserved.


In [None]:
import os

if len(os.listdir("outputs")) == 0:
    trainer_stats = trainer.train()
else:
    print("Resuming from checkpoint!")
    trainer_stats = trainer.train(resume_from_checkpoint=True)

Resuming from checkpoint!


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 12,500
 "-____-"     Number of trainable parameters = 24,313,856
  checkpoint_rng_state = torch.load(rng_file)
  0%|          | 7/12500 [00:07<3:39:04,  1.05s/it]

{'loss': 0.1667, 'grad_norm': 0.3911014497280121, 'learning_rate': 0.00019996798719487796, 'epoch': 0.0}


  0%|          | 8/12500 [00:12<6:18:00,  1.82s/it]

{'loss': 0.2737, 'grad_norm': 0.6797258853912354, 'learning_rate': 0.00019995198079231693, 'epoch': 0.0}


  0%|          | 9/12500 [00:27<14:36:33,  4.21s/it]

{'loss': 0.3676, 'grad_norm': 0.5032289028167725, 'learning_rate': 0.0001999359743897559, 'epoch': 0.0}


  0%|          | 10/12500 [00:31<14:46:11,  4.26s/it]

{'loss': 0.1916, 'grad_norm': 0.4824684262275696, 'learning_rate': 0.00019991996798719488, 'epoch': 0.0}


  0%|          | 11/12500 [00:39<18:03:33,  5.21s/it]

{'loss': 0.4125, 'grad_norm': 0.5483745336532593, 'learning_rate': 0.00019990396158463386, 'epoch': 0.0}


  0%|          | 12/12500 [00:47<20:43:01,  5.97s/it]

{'loss': 0.4985, 'grad_norm': 0.5703855752944946, 'learning_rate': 0.00019988795518207283, 'epoch': 0.0}


  0%|          | 13/12500 [00:51<18:38:07,  5.37s/it]

{'loss': 0.1706, 'grad_norm': 0.8027618527412415, 'learning_rate': 0.0001998719487795118, 'epoch': 0.0}


  0%|          | 14/12500 [01:00<21:40:02,  6.25s/it]

{'loss': 0.2401, 'grad_norm': 0.6066908836364746, 'learning_rate': 0.0001998559423769508, 'epoch': 0.0}


  0%|          | 15/12500 [01:07<22:42:02,  6.55s/it]

{'loss': 0.3474, 'grad_norm': 0.8655387163162231, 'learning_rate': 0.00019983993597438976, 'epoch': 0.0}


  0%|          | 16/12500 [01:12<21:25:03,  6.18s/it]

{'loss': 0.132, 'grad_norm': 0.9010872840881348, 'learning_rate': 0.00019982392957182873, 'epoch': 0.0}


  0%|          | 17/12500 [01:19<21:51:06,  6.30s/it]

{'loss': 0.3686, 'grad_norm': 1.2972244024276733, 'learning_rate': 0.0001998079231692677, 'epoch': 0.0}


  0%|          | 18/12500 [01:25<21:52:02,  6.31s/it]

{'loss': 0.2248, 'grad_norm': 1.3007174730300903, 'learning_rate': 0.0001997919167667067, 'epoch': 0.0}


  0%|          | 19/12500 [01:34<24:10:40,  6.97s/it]

{'loss': 0.3899, 'grad_norm': 1.1495357751846313, 'learning_rate': 0.00019977591036414566, 'epoch': 0.0}


  0%|          | 20/12500 [01:38<21:49:25,  6.30s/it]

{'loss': 0.3624, 'grad_norm': 1.1330926418304443, 'learning_rate': 0.00019975990396158463, 'epoch': 0.0}


  0%|          | 21/12500 [01:45<21:59:29,  6.34s/it]

{'loss': 0.4925, 'grad_norm': 0.7781209945678711, 'learning_rate': 0.00019974389755902363, 'epoch': 0.0}


  0%|          | 22/12500 [01:53<23:33:25,  6.80s/it]

{'loss': 0.4119, 'grad_norm': 0.8105002045631409, 'learning_rate': 0.0001997278911564626, 'epoch': 0.0}


  0%|          | 23/12500 [02:05<29:28:14,  8.50s/it]

{'loss': 0.8414, 'grad_norm': 0.5342639088630676, 'learning_rate': 0.00019971188475390156, 'epoch': 0.0}


  0%|          | 24/12500 [02:09<24:51:57,  7.18s/it]

{'loss': 0.2426, 'grad_norm': 0.9410050511360168, 'learning_rate': 0.00019969587835134053, 'epoch': 0.0}


  0%|          | 25/12500 [02:14<22:35:44,  6.52s/it]

{'loss': 0.2586, 'grad_norm': 0.6805701851844788, 'learning_rate': 0.00019967987194877953, 'epoch': 0.0}


  0%|          | 26/12500 [02:21<22:28:09,  6.48s/it]

{'loss': 0.3525, 'grad_norm': 0.6704532504081726, 'learning_rate': 0.0001996638655462185, 'epoch': 0.0}


  0%|          | 27/12500 [02:25<20:16:21,  5.85s/it]

{'loss': 0.2431, 'grad_norm': 0.7832268476486206, 'learning_rate': 0.00019964785914365746, 'epoch': 0.0}


  0%|          | 28/12500 [02:30<19:28:37,  5.62s/it]

{'loss': 0.2964, 'grad_norm': 0.5682904124259949, 'learning_rate': 0.00019963185274109646, 'epoch': 0.0}


  0%|          | 29/12500 [02:37<20:34:49,  5.94s/it]

{'loss': 0.5491, 'grad_norm': 0.5925678014755249, 'learning_rate': 0.00019961584633853543, 'epoch': 0.0}


  0%|          | 30/12500 [02:42<19:48:59,  5.72s/it]

{'loss': 0.576, 'grad_norm': 0.603959858417511, 'learning_rate': 0.0001995998399359744, 'epoch': 0.0}


  0%|          | 31/12500 [02:47<18:35:15,  5.37s/it]

{'loss': 0.224, 'grad_norm': 0.6859402656555176, 'learning_rate': 0.00019958383353341336, 'epoch': 0.0}


  0%|          | 32/12500 [02:57<23:44:47,  6.86s/it]

{'loss': 0.3002, 'grad_norm': 0.4962892234325409, 'learning_rate': 0.00019956782713085236, 'epoch': 0.0}


  0%|          | 33/12500 [03:02<21:56:39,  6.34s/it]

{'loss': 0.1764, 'grad_norm': 0.5131434798240662, 'learning_rate': 0.00019955182072829133, 'epoch': 0.0}


  0%|          | 34/12500 [03:07<20:30:16,  5.92s/it]

{'loss': 0.2216, 'grad_norm': 0.5144423246383667, 'learning_rate': 0.0001995358143257303, 'epoch': 0.0}


  0%|          | 35/12500 [03:13<20:10:29,  5.83s/it]

{'loss': 0.3252, 'grad_norm': 0.5598315000534058, 'learning_rate': 0.00019951980792316926, 'epoch': 0.0}


  0%|          | 36/12500 [03:17<19:10:22,  5.54s/it]

{'loss': 0.5419, 'grad_norm': 0.614459753036499, 'learning_rate': 0.00019950380152060826, 'epoch': 0.0}


  0%|          | 37/12500 [03:22<18:03:43,  5.22s/it]

{'loss': 0.3459, 'grad_norm': 0.8267794847488403, 'learning_rate': 0.00019948779511804723, 'epoch': 0.0}


  0%|          | 38/12500 [03:29<20:11:26,  5.83s/it]

{'loss': 0.4063, 'grad_norm': 0.5491129755973816, 'learning_rate': 0.0001994717887154862, 'epoch': 0.0}


  0%|          | 39/12500 [03:36<20:45:39,  6.00s/it]

{'loss': 0.3731, 'grad_norm': 0.5676537752151489, 'learning_rate': 0.00019945578231292518, 'epoch': 0.0}


  0%|          | 40/12500 [03:41<20:34:57,  5.95s/it]

{'loss': 0.6198, 'grad_norm': 0.6702191829681396, 'learning_rate': 0.00019943977591036416, 'epoch': 0.0}


  0%|          | 41/12500 [03:47<20:08:47,  5.82s/it]

{'loss': 0.2923, 'grad_norm': 0.7398924827575684, 'learning_rate': 0.00019942376950780313, 'epoch': 0.0}


  0%|          | 42/12500 [03:52<19:29:52,  5.63s/it]

{'loss': 0.5813, 'grad_norm': 0.7868342995643616, 'learning_rate': 0.0001994077631052421, 'epoch': 0.0}


  0%|          | 43/12500 [03:58<19:55:42,  5.76s/it]

{'loss': 0.4748, 'grad_norm': 0.8112051486968994, 'learning_rate': 0.00019939175670268108, 'epoch': 0.0}


  0%|          | 44/12500 [04:04<19:39:02,  5.68s/it]

{'loss': 0.382, 'grad_norm': 0.7205162048339844, 'learning_rate': 0.00019937575030012006, 'epoch': 0.0}


  0%|          | 45/12500 [04:09<19:09:29,  5.54s/it]

{'loss': 0.4142, 'grad_norm': 0.8259717226028442, 'learning_rate': 0.00019935974389755903, 'epoch': 0.0}


  0%|          | 46/12500 [04:13<17:41:44,  5.12s/it]

{'loss': 0.4452, 'grad_norm': 0.7769719958305359, 'learning_rate': 0.000199343737494998, 'epoch': 0.0}


  0%|          | 47/12500 [04:19<18:02:25,  5.22s/it]

{'loss': 0.3733, 'grad_norm': 0.7007361650466919, 'learning_rate': 0.00019932773109243698, 'epoch': 0.0}


  0%|          | 48/12500 [04:24<18:36:24,  5.38s/it]

{'loss': 0.4122, 'grad_norm': 0.6520771980285645, 'learning_rate': 0.00019931172468987596, 'epoch': 0.0}


  0%|          | 49/12500 [04:28<17:25:00,  5.04s/it]

{'loss': 0.4625, 'grad_norm': 0.8055165410041809, 'learning_rate': 0.00019929571828731493, 'epoch': 0.0}


  0%|          | 50/12500 [04:35<18:25:54,  5.33s/it]

{'loss': 0.7965, 'grad_norm': 0.6618154644966125, 'learning_rate': 0.0001992797118847539, 'epoch': 0.0}


  0%|          | 51/12500 [04:45<23:44:47,  6.87s/it]

{'loss': 0.3342, 'grad_norm': 0.481866717338562, 'learning_rate': 0.00019926370548219288, 'epoch': 0.0}


  0%|          | 52/12500 [04:57<29:17:29,  8.47s/it]

{'loss': 0.7379, 'grad_norm': 0.603222668170929, 'learning_rate': 0.00019924769907963185, 'epoch': 0.0}


  0%|          | 53/12500 [05:07<30:59:12,  8.96s/it]

{'loss': 1.1316, 'grad_norm': 2.0433261394500732, 'learning_rate': 0.00019923169267707086, 'epoch': 0.0}


  0%|          | 54/12500 [05:12<26:15:32,  7.60s/it]

{'loss': 0.4454, 'grad_norm': 0.5980958938598633, 'learning_rate': 0.0001992156862745098, 'epoch': 0.0}


  0%|          | 55/12500 [05:15<22:13:50,  6.43s/it]

{'loss': 0.5968, 'grad_norm': 0.8113086223602295, 'learning_rate': 0.00019919967987194878, 'epoch': 0.0}


  0%|          | 56/12500 [05:23<23:28:44,  6.79s/it]

{'loss': 0.9555, 'grad_norm': 0.6253562569618225, 'learning_rate': 0.00019918367346938775, 'epoch': 0.0}


  0%|          | 57/12500 [05:30<24:03:02,  6.96s/it]

{'loss': 0.6164, 'grad_norm': 0.5229809284210205, 'learning_rate': 0.00019916766706682676, 'epoch': 0.0}


  0%|          | 58/12500 [05:39<25:40:05,  7.43s/it]

{'loss': 0.6647, 'grad_norm': 0.5644242167472839, 'learning_rate': 0.0001991516606642657, 'epoch': 0.0}


  0%|          | 59/12500 [05:48<27:20:53,  7.91s/it]

{'loss': 0.6005, 'grad_norm': 0.6125190258026123, 'learning_rate': 0.00019913565426170468, 'epoch': 0.0}


  0%|          | 60/12500 [05:55<26:19:43,  7.62s/it]

{'loss': 0.7049, 'grad_norm': 0.6253220438957214, 'learning_rate': 0.00019911964785914368, 'epoch': 0.0}


  0%|          | 61/12500 [06:00<23:34:32,  6.82s/it]

{'loss': 0.539, 'grad_norm': 0.5719528198242188, 'learning_rate': 0.00019910364145658266, 'epoch': 0.0}


  0%|          | 62/12500 [06:09<26:16:34,  7.61s/it]

{'loss': 0.6785, 'grad_norm': 0.5412994623184204, 'learning_rate': 0.0001990876350540216, 'epoch': 0.0}


  1%|          | 63/12500 [06:16<25:24:45,  7.36s/it]

{'loss': 0.6538, 'grad_norm': 0.5134182572364807, 'learning_rate': 0.00019907162865146058, 'epoch': 0.01}


  1%|          | 64/12500 [06:21<23:08:02,  6.70s/it]

{'loss': 0.6009, 'grad_norm': 0.9365398287773132, 'learning_rate': 0.00019905562224889958, 'epoch': 0.01}


  1%|          | 65/12500 [06:30<25:42:59,  7.45s/it]

{'loss': 0.6689, 'grad_norm': 0.4797469675540924, 'learning_rate': 0.00019903961584633856, 'epoch': 0.01}


  1%|          | 66/12500 [06:38<25:54:38,  7.50s/it]

{'loss': 0.7293, 'grad_norm': 0.5955630540847778, 'learning_rate': 0.0001990236094437775, 'epoch': 0.01}


  1%|          | 67/12500 [06:42<21:58:34,  6.36s/it]

{'loss': 0.5833, 'grad_norm': 0.6273817420005798, 'learning_rate': 0.0001990076030412165, 'epoch': 0.01}


  1%|          | 68/12500 [06:48<22:01:47,  6.38s/it]

{'loss': 0.5745, 'grad_norm': 0.5656362771987915, 'learning_rate': 0.00019899159663865548, 'epoch': 0.01}


  1%|          | 69/12500 [06:58<25:08:07,  7.28s/it]

{'loss': 0.787, 'grad_norm': 0.4468249976634979, 'learning_rate': 0.00019897559023609445, 'epoch': 0.01}


  1%|          | 70/12500 [07:05<25:08:41,  7.28s/it]

{'loss': 0.4827, 'grad_norm': 0.5288363099098206, 'learning_rate': 0.0001989595838335334, 'epoch': 0.01}


  1%|          | 71/12500 [07:11<24:20:08,  7.05s/it]

{'loss': 0.3401, 'grad_norm': 0.5182632207870483, 'learning_rate': 0.0001989435774309724, 'epoch': 0.01}


  1%|          | 72/12500 [07:19<25:11:06,  7.30s/it]

{'loss': 0.5657, 'grad_norm': 0.4479951858520508, 'learning_rate': 0.00019892757102841138, 'epoch': 0.01}


  1%|          | 73/12500 [07:28<26:29:34,  7.67s/it]

{'loss': 0.7427, 'grad_norm': 0.5820627212524414, 'learning_rate': 0.00019891156462585035, 'epoch': 0.01}


  1%|          | 74/12500 [07:33<23:39:13,  6.85s/it]

{'loss': 0.7209, 'grad_norm': 0.6260631680488586, 'learning_rate': 0.00019889555822328933, 'epoch': 0.01}


  1%|          | 75/12500 [07:38<21:53:02,  6.34s/it]

{'loss': 0.4468, 'grad_norm': 0.6143075227737427, 'learning_rate': 0.0001988795518207283, 'epoch': 0.01}


  1%|          | 76/12500 [07:43<20:47:30,  6.02s/it]

{'loss': 0.5023, 'grad_norm': 0.6012539267539978, 'learning_rate': 0.00019886354541816728, 'epoch': 0.01}


  1%|          | 77/12500 [07:48<19:20:59,  5.61s/it]

{'loss': 0.8054, 'grad_norm': 0.6378673911094666, 'learning_rate': 0.00019884753901560625, 'epoch': 0.01}


  1%|          | 78/12500 [07:52<17:27:38,  5.06s/it]

{'loss': 0.5249, 'grad_norm': 0.6473673582077026, 'learning_rate': 0.00019883153261304523, 'epoch': 0.01}


  1%|          | 79/12500 [07:57<17:41:14,  5.13s/it]

{'loss': 0.798, 'grad_norm': 0.6459671854972839, 'learning_rate': 0.0001988155262104842, 'epoch': 0.01}


  1%|          | 80/12500 [08:03<18:25:37,  5.34s/it]

{'loss': 0.6628, 'grad_norm': 0.5723521113395691, 'learning_rate': 0.00019879951980792318, 'epoch': 0.01}


  1%|          | 81/12500 [08:10<20:17:02,  5.88s/it]

{'loss': 0.5519, 'grad_norm': 0.4915156662464142, 'learning_rate': 0.00019878351340536215, 'epoch': 0.01}


  1%|          | 82/12500 [08:14<19:02:16,  5.52s/it]

{'loss': 0.6544, 'grad_norm': 0.5798264741897583, 'learning_rate': 0.00019876750700280113, 'epoch': 0.01}


  1%|          | 83/12500 [08:20<19:17:50,  5.59s/it]

{'loss': 0.6863, 'grad_norm': 0.5539053082466125, 'learning_rate': 0.0001987515006002401, 'epoch': 0.01}


  1%|          | 84/12500 [08:28<21:58:03,  6.37s/it]

{'loss': 0.8972, 'grad_norm': 0.520886242389679, 'learning_rate': 0.00019873549419767908, 'epoch': 0.01}


  1%|          | 85/12500 [08:39<26:00:54,  7.54s/it]

{'loss': 0.8072, 'grad_norm': 0.4533173441886902, 'learning_rate': 0.00019871948779511805, 'epoch': 0.01}


  1%|          | 86/12500 [08:50<30:20:51,  8.80s/it]

{'loss': 0.5501, 'grad_norm': 0.397305428981781, 'learning_rate': 0.00019870348139255703, 'epoch': 0.01}


  1%|          | 87/12500 [08:57<27:38:51,  8.02s/it]

{'loss': 0.7727, 'grad_norm': 0.49649906158447266, 'learning_rate': 0.000198687474989996, 'epoch': 0.01}


  1%|          | 88/12500 [09:02<25:12:03,  7.31s/it]

{'loss': 0.5288, 'grad_norm': 0.4675672650337219, 'learning_rate': 0.000198671468587435, 'epoch': 0.01}


  1%|          | 89/12500 [09:11<26:35:24,  7.71s/it]

{'loss': 0.6928, 'grad_norm': 0.47786617279052734, 'learning_rate': 0.00019865546218487395, 'epoch': 0.01}


  1%|          | 90/12500 [09:21<29:06:21,  8.44s/it]

{'loss': 1.0963, 'grad_norm': 0.39170339703559875, 'learning_rate': 0.00019863945578231293, 'epoch': 0.01}


  1%|          | 91/12500 [09:26<25:43:11,  7.46s/it]

{'loss': 0.6702, 'grad_norm': 0.5242063403129578, 'learning_rate': 0.0001986234493797519, 'epoch': 0.01}


  1%|          | 92/12500 [09:35<27:03:15,  7.85s/it]

{'loss': 0.4589, 'grad_norm': 0.46046778559684753, 'learning_rate': 0.0001986074429771909, 'epoch': 0.01}


  1%|          | 93/12500 [09:39<23:25:51,  6.80s/it]

{'loss': 0.6596, 'grad_norm': 0.5478964447975159, 'learning_rate': 0.00019859143657462985, 'epoch': 0.01}


  1%|          | 94/12500 [09:47<24:18:33,  7.05s/it]

{'loss': 0.6286, 'grad_norm': 0.40815892815589905, 'learning_rate': 0.00019857543017206883, 'epoch': 0.01}


  1%|          | 95/12500 [09:55<25:31:34,  7.41s/it]

{'loss': 0.4533, 'grad_norm': 0.38491520285606384, 'learning_rate': 0.0001985594237695078, 'epoch': 0.01}


  1%|          | 96/12500 [09:59<22:05:15,  6.41s/it]

{'loss': 0.4083, 'grad_norm': 0.5119649767875671, 'learning_rate': 0.0001985434173669468, 'epoch': 0.01}


  1%|          | 97/12500 [10:03<19:10:59,  5.57s/it]

{'loss': 0.5267, 'grad_norm': 0.5158439874649048, 'learning_rate': 0.00019852741096438575, 'epoch': 0.01}


  1%|          | 98/12500 [10:08<18:36:00,  5.40s/it]

{'loss': 0.5642, 'grad_norm': 0.5369046330451965, 'learning_rate': 0.00019851140456182473, 'epoch': 0.01}


  1%|          | 99/12500 [10:13<17:49:34,  5.17s/it]

{'loss': 0.6491, 'grad_norm': 0.4832417070865631, 'learning_rate': 0.00019849539815926373, 'epoch': 0.01}


  1%|          | 100/12500 [10:19<19:19:48,  5.61s/it]

{'loss': 0.6384, 'grad_norm': 0.4537579119205475, 'learning_rate': 0.0001984793917567027, 'epoch': 0.01}


  1%|          | 101/12500 [10:26<20:17:51,  5.89s/it]

{'loss': 0.5788, 'grad_norm': 0.5820491909980774, 'learning_rate': 0.00019846338535414165, 'epoch': 0.01}


  1%|          | 102/12500 [10:32<20:20:10,  5.90s/it]

{'loss': 0.6507, 'grad_norm': 0.45764094591140747, 'learning_rate': 0.00019844737895158062, 'epoch': 0.01}


  1%|          | 103/12500 [10:38<20:49:25,  6.05s/it]

{'loss': 0.6729, 'grad_norm': 0.5244588255882263, 'learning_rate': 0.00019843137254901963, 'epoch': 0.01}


  1%|          | 104/12500 [10:44<21:00:26,  6.10s/it]

{'loss': 0.6403, 'grad_norm': 0.4698444604873657, 'learning_rate': 0.0001984153661464586, 'epoch': 0.01}


  1%|          | 105/12500 [10:54<24:48:02,  7.20s/it]

{'loss': 0.9301, 'grad_norm': 0.46965906023979187, 'learning_rate': 0.00019839935974389755, 'epoch': 0.01}


  1%|          | 106/12500 [10:59<22:55:07,  6.66s/it]

{'loss': 0.6909, 'grad_norm': 0.5460898876190186, 'learning_rate': 0.00019838335334133655, 'epoch': 0.01}


  1%|          | 107/12500 [11:05<21:54:12,  6.36s/it]

{'loss': 0.8218, 'grad_norm': 0.5489797592163086, 'learning_rate': 0.00019836734693877553, 'epoch': 0.01}


  1%|          | 108/12500 [11:15<25:49:09,  7.50s/it]

{'loss': 0.5767, 'grad_norm': 0.35311388969421387, 'learning_rate': 0.0001983513405362145, 'epoch': 0.01}


  1%|          | 109/12500 [11:21<24:13:03,  7.04s/it]

{'loss': 0.6268, 'grad_norm': 0.4845074713230133, 'learning_rate': 0.00019833533413365345, 'epoch': 0.01}


  1%|          | 110/12500 [11:28<24:19:23,  7.07s/it]

{'loss': 0.6803, 'grad_norm': 0.4574475884437561, 'learning_rate': 0.00019831932773109245, 'epoch': 0.01}


  1%|          | 111/12500 [11:34<22:22:18,  6.50s/it]

{'loss': 0.5393, 'grad_norm': 0.4430004060268402, 'learning_rate': 0.00019830332132853143, 'epoch': 0.01}


  1%|          | 112/12500 [11:38<20:38:05,  6.00s/it]

{'loss': 0.7106, 'grad_norm': 0.517854630947113, 'learning_rate': 0.0001982873149259704, 'epoch': 0.01}


  1%|          | 113/12500 [11:47<23:09:26,  6.73s/it]

{'loss': 0.8988, 'grad_norm': 0.417148619890213, 'learning_rate': 0.00019827130852340938, 'epoch': 0.01}


  1%|          | 114/12500 [11:51<20:07:13,  5.85s/it]

{'loss': 0.7739, 'grad_norm': 0.5272954702377319, 'learning_rate': 0.00019825530212084835, 'epoch': 0.01}


  1%|          | 115/12500 [11:59<22:43:51,  6.61s/it]

{'loss': 0.5856, 'grad_norm': 0.38822001218795776, 'learning_rate': 0.00019823929571828732, 'epoch': 0.01}


  1%|          | 116/12500 [12:05<22:15:50,  6.47s/it]

{'loss': 0.8069, 'grad_norm': 0.4183591604232788, 'learning_rate': 0.0001982232893157263, 'epoch': 0.01}


  1%|          | 117/12500 [12:11<21:18:00,  6.19s/it]

{'loss': 0.4977, 'grad_norm': 0.440715491771698, 'learning_rate': 0.00019820728291316527, 'epoch': 0.01}


  1%|          | 118/12500 [12:16<20:34:26,  5.98s/it]

{'loss': 0.6759, 'grad_norm': 0.5125614404678345, 'learning_rate': 0.00019819127651060425, 'epoch': 0.01}


  1%|          | 119/12500 [12:24<22:39:49,  6.59s/it]

{'loss': 0.6181, 'grad_norm': 0.3846866488456726, 'learning_rate': 0.00019817527010804322, 'epoch': 0.01}


  1%|          | 120/12500 [12:32<23:49:18,  6.93s/it]

{'loss': 0.3893, 'grad_norm': 0.37437474727630615, 'learning_rate': 0.0001981592637054822, 'epoch': 0.01}


  1%|          | 121/12500 [12:38<22:34:02,  6.56s/it]

{'loss': 0.5467, 'grad_norm': 0.36808788776397705, 'learning_rate': 0.00019814325730292117, 'epoch': 0.01}


  1%|          | 122/12500 [12:46<24:23:16,  7.09s/it]

{'loss': 0.7529, 'grad_norm': 0.3825523555278778, 'learning_rate': 0.00019812725090036015, 'epoch': 0.01}


  1%|          | 123/12500 [12:51<22:17:39,  6.48s/it]

{'loss': 0.5543, 'grad_norm': 0.5388244986534119, 'learning_rate': 0.00019811124449779912, 'epoch': 0.01}


  1%|          | 124/12500 [12:56<21:06:21,  6.14s/it]

{'loss': 0.7436, 'grad_norm': 0.40358179807662964, 'learning_rate': 0.0001980952380952381, 'epoch': 0.01}


  1%|          | 125/12500 [13:02<20:28:53,  5.96s/it]

{'loss': 0.707, 'grad_norm': 0.5439621210098267, 'learning_rate': 0.00019807923169267707, 'epoch': 0.01}


  1%|          | 126/12500 [13:10<22:40:02,  6.59s/it]

{'loss': 0.6408, 'grad_norm': 0.446395605802536, 'learning_rate': 0.00019806322529011605, 'epoch': 0.01}


  1%|          | 127/12500 [13:17<23:06:10,  6.72s/it]

{'loss': 0.8288, 'grad_norm': 0.3999890387058258, 'learning_rate': 0.00019804721888755505, 'epoch': 0.01}


  1%|          | 128/12500 [13:23<22:46:34,  6.63s/it]

{'loss': 0.488, 'grad_norm': 0.38897085189819336, 'learning_rate': 0.000198031212484994, 'epoch': 0.01}


  1%|          | 129/12500 [13:28<20:47:48,  6.05s/it]

{'loss': 0.5648, 'grad_norm': 0.45104295015335083, 'learning_rate': 0.00019801520608243297, 'epoch': 0.01}


  1%|          | 130/12500 [13:35<21:50:04,  6.35s/it]

{'loss': 0.7795, 'grad_norm': 0.3987937271595001, 'learning_rate': 0.00019799919967987195, 'epoch': 0.01}


  1%|          | 131/12500 [13:42<22:09:56,  6.45s/it]

{'loss': 0.7536, 'grad_norm': 0.3437504768371582, 'learning_rate': 0.00019798319327731095, 'epoch': 0.01}


  1%|          | 132/12500 [13:49<23:05:39,  6.72s/it]

{'loss': 0.9164, 'grad_norm': 0.3722570538520813, 'learning_rate': 0.0001979671868747499, 'epoch': 0.01}


  1%|          | 133/12500 [13:55<21:40:26,  6.31s/it]

{'loss': 0.6337, 'grad_norm': 0.41985058784484863, 'learning_rate': 0.00019795118047218887, 'epoch': 0.01}


  1%|          | 134/12500 [13:59<19:41:16,  5.73s/it]

{'loss': 0.5601, 'grad_norm': 0.4425332248210907, 'learning_rate': 0.00019793517406962787, 'epoch': 0.01}


  1%|          | 135/12500 [14:06<21:28:24,  6.25s/it]

{'loss': 0.3385, 'grad_norm': 0.33527547121047974, 'learning_rate': 0.00019791916766706685, 'epoch': 0.01}


  1%|          | 136/12500 [14:11<19:51:55,  5.78s/it]

{'loss': 0.5649, 'grad_norm': 0.42350876331329346, 'learning_rate': 0.0001979031612645058, 'epoch': 0.01}


  1%|          | 137/12500 [14:16<18:49:43,  5.48s/it]

{'loss': 0.5083, 'grad_norm': 0.43389540910720825, 'learning_rate': 0.00019788715486194477, 'epoch': 0.01}


  1%|          | 138/12500 [14:21<18:11:08,  5.30s/it]

{'loss': 0.4339, 'grad_norm': 0.4320802688598633, 'learning_rate': 0.00019787114845938377, 'epoch': 0.01}


  1%|          | 139/12500 [14:27<19:20:12,  5.63s/it]

{'loss': 0.4087, 'grad_norm': 0.4796780049800873, 'learning_rate': 0.00019785514205682275, 'epoch': 0.01}


  1%|          | 140/12500 [14:31<17:51:42,  5.20s/it]

{'loss': 0.8591, 'grad_norm': 0.5373741388320923, 'learning_rate': 0.0001978391356542617, 'epoch': 0.01}


  1%|          | 141/12500 [14:37<18:19:02,  5.34s/it]

{'loss': 0.7196, 'grad_norm': 0.4401150047779083, 'learning_rate': 0.0001978231292517007, 'epoch': 0.01}


  1%|          | 142/12500 [14:42<17:39:11,  5.14s/it]

{'loss': 0.6189, 'grad_norm': 0.47848016023635864, 'learning_rate': 0.00019780712284913967, 'epoch': 0.01}


  1%|          | 143/12500 [14:48<18:44:33,  5.46s/it]

{'loss': 0.6542, 'grad_norm': 0.4781818091869354, 'learning_rate': 0.00019779111644657865, 'epoch': 0.01}


  1%|          | 144/12500 [14:55<20:19:19,  5.92s/it]

{'loss': 0.7297, 'grad_norm': 0.4052198529243469, 'learning_rate': 0.0001977751100440176, 'epoch': 0.01}


  1%|          | 145/12500 [14:59<18:08:00,  5.28s/it]

{'loss': 0.6859, 'grad_norm': 0.4951677620410919, 'learning_rate': 0.0001977591036414566, 'epoch': 0.01}


  1%|          | 146/12500 [15:03<17:17:19,  5.04s/it]

{'loss': 0.8756, 'grad_norm': 0.5563678741455078, 'learning_rate': 0.00019774309723889557, 'epoch': 0.01}


  1%|          | 147/12500 [15:10<19:27:55,  5.67s/it]

{'loss': 0.6892, 'grad_norm': 0.3688381016254425, 'learning_rate': 0.00019772709083633455, 'epoch': 0.01}


  1%|          | 148/12500 [15:17<20:45:56,  6.05s/it]

{'loss': 0.7469, 'grad_norm': 0.3467714786529541, 'learning_rate': 0.0001977110844337735, 'epoch': 0.01}


  1%|          | 149/12500 [15:25<22:44:40,  6.63s/it]

{'loss': 0.7542, 'grad_norm': 0.42074304819107056, 'learning_rate': 0.0001976950780312125, 'epoch': 0.01}


  1%|          | 150/12500 [15:31<21:57:09,  6.40s/it]

{'loss': 0.6095, 'grad_norm': 0.42305684089660645, 'learning_rate': 0.00019767907162865147, 'epoch': 0.01}


  1%|          | 151/12500 [15:37<21:42:56,  6.33s/it]

{'loss': 0.5925, 'grad_norm': 0.46197959780693054, 'learning_rate': 0.00019766306522609045, 'epoch': 0.01}


  1%|          | 152/12500 [15:44<21:59:44,  6.41s/it]

{'loss': 1.0608, 'grad_norm': 0.5623894929885864, 'learning_rate': 0.00019764705882352942, 'epoch': 0.01}


  1%|          | 153/12500 [15:49<20:33:18,  5.99s/it]

{'loss': 0.544, 'grad_norm': 0.3792940378189087, 'learning_rate': 0.0001976310524209684, 'epoch': 0.01}


  1%|          | 154/12500 [15:54<19:46:32,  5.77s/it]

{'loss': 0.6539, 'grad_norm': 0.5398390889167786, 'learning_rate': 0.00019761504601840737, 'epoch': 0.01}


  1%|          | 155/12500 [16:00<19:49:31,  5.78s/it]

{'loss': 0.5972, 'grad_norm': 0.3565267026424408, 'learning_rate': 0.00019759903961584635, 'epoch': 0.01}


  1%|          | 156/12500 [16:07<21:12:03,  6.18s/it]

{'loss': 0.9864, 'grad_norm': 0.35401153564453125, 'learning_rate': 0.00019758303321328532, 'epoch': 0.01}


  1%|▏         | 157/12500 [16:13<21:17:37,  6.21s/it]

{'loss': 0.7458, 'grad_norm': 0.4565531313419342, 'learning_rate': 0.0001975670268107243, 'epoch': 0.01}


  1%|▏         | 158/12500 [16:19<21:10:45,  6.18s/it]

{'loss': 0.6976, 'grad_norm': 0.38153818249702454, 'learning_rate': 0.00019755102040816327, 'epoch': 0.01}


  1%|▏         | 159/12500 [16:25<20:16:55,  5.92s/it]

{'loss': 0.7632, 'grad_norm': 0.403201162815094, 'learning_rate': 0.00019753501400560227, 'epoch': 0.01}


  1%|▏         | 160/12500 [16:33<22:17:28,  6.50s/it]

{'loss': 1.0674, 'grad_norm': 0.4116418659687042, 'learning_rate': 0.00019751900760304122, 'epoch': 0.01}


  1%|▏         | 161/12500 [16:38<21:31:39,  6.28s/it]

{'loss': 0.7834, 'grad_norm': 0.4365048408508301, 'learning_rate': 0.0001975030012004802, 'epoch': 0.01}


  1%|▏         | 162/12500 [16:49<25:55:39,  7.57s/it]

{'loss': 0.8974, 'grad_norm': 0.2899264693260193, 'learning_rate': 0.00019748699479791917, 'epoch': 0.01}


  1%|▏         | 163/12500 [16:58<27:32:57,  8.04s/it]

{'loss': 0.9203, 'grad_norm': 0.3484085202217102, 'learning_rate': 0.00019747098839535817, 'epoch': 0.01}


  1%|▏         | 164/12500 [17:04<25:25:17,  7.42s/it]

{'loss': 0.6241, 'grad_norm': 0.3931088149547577, 'learning_rate': 0.00019745498199279712, 'epoch': 0.01}


  1%|▏         | 165/12500 [17:11<25:03:37,  7.31s/it]

{'loss': 0.7425, 'grad_norm': 0.44036492705345154, 'learning_rate': 0.0001974389755902361, 'epoch': 0.01}


  1%|▏         | 166/12500 [17:17<23:33:46,  6.88s/it]

{'loss': 0.5911, 'grad_norm': 0.42540010809898376, 'learning_rate': 0.0001974229691876751, 'epoch': 0.01}


  1%|▏         | 167/12500 [17:21<20:23:30,  5.95s/it]

{'loss': 0.6245, 'grad_norm': 0.46059200167655945, 'learning_rate': 0.00019740696278511407, 'epoch': 0.01}


  1%|▏         | 168/12500 [17:27<20:12:39,  5.90s/it]

{'loss': 0.8612, 'grad_norm': 0.37583404779434204, 'learning_rate': 0.00019739095638255302, 'epoch': 0.01}


  1%|▏         | 169/12500 [17:33<21:13:06,  6.19s/it]

{'loss': 0.8299, 'grad_norm': 0.3668367564678192, 'learning_rate': 0.000197374949979992, 'epoch': 0.01}


  1%|▏         | 170/12500 [17:39<20:24:57,  5.96s/it]

{'loss': 0.6603, 'grad_norm': 0.4559849500656128, 'learning_rate': 0.000197358943577431, 'epoch': 0.01}


  1%|▏         | 171/12500 [17:44<19:34:03,  5.71s/it]

{'loss': 0.7616, 'grad_norm': 0.390722393989563, 'learning_rate': 0.00019734293717486997, 'epoch': 0.01}


  1%|▏         | 172/12500 [17:50<19:45:19,  5.77s/it]

{'loss': 0.9121, 'grad_norm': 0.36458322405815125, 'learning_rate': 0.00019732693077230892, 'epoch': 0.01}


  1%|▏         | 173/12500 [17:56<19:44:24,  5.76s/it]

{'loss': 0.7756, 'grad_norm': 0.332014262676239, 'learning_rate': 0.00019731092436974792, 'epoch': 0.01}


  1%|▏         | 174/12500 [18:01<18:50:05,  5.50s/it]

{'loss': 0.9617, 'grad_norm': 0.42926183342933655, 'learning_rate': 0.0001972949179671869, 'epoch': 0.01}


  1%|▏         | 175/12500 [18:08<20:57:06,  6.12s/it]

{'loss': 0.756, 'grad_norm': 0.38151815533638, 'learning_rate': 0.00019727891156462587, 'epoch': 0.01}


  1%|▏         | 176/12500 [18:15<21:48:37,  6.37s/it]

{'loss': 0.7335, 'grad_norm': 0.43427494168281555, 'learning_rate': 0.00019726290516206482, 'epoch': 0.01}


  1%|▏         | 177/12500 [18:20<20:42:02,  6.05s/it]

{'loss': 0.6895, 'grad_norm': 0.3281276226043701, 'learning_rate': 0.00019724689875950382, 'epoch': 0.01}


  1%|▏         | 178/12500 [18:24<18:38:11,  5.44s/it]

{'loss': 0.8827, 'grad_norm': 0.3988102674484253, 'learning_rate': 0.0001972308923569428, 'epoch': 0.01}


  1%|▏         | 179/12500 [18:31<20:21:33,  5.95s/it]

{'loss': 0.6922, 'grad_norm': 0.38832804560661316, 'learning_rate': 0.00019721488595438177, 'epoch': 0.01}


  1%|▏         | 180/12500 [18:37<19:56:53,  5.83s/it]

{'loss': 0.6829, 'grad_norm': 0.37162500619888306, 'learning_rate': 0.00019719887955182074, 'epoch': 0.01}


  1%|▏         | 181/12500 [18:44<21:13:49,  6.20s/it]

{'loss': 0.8086, 'grad_norm': 0.3349170386791229, 'learning_rate': 0.00019718287314925972, 'epoch': 0.01}


  1%|▏         | 182/12500 [18:52<23:19:19,  6.82s/it]

{'loss': 0.869, 'grad_norm': 0.2889172434806824, 'learning_rate': 0.0001971668667466987, 'epoch': 0.01}


  1%|▏         | 183/12500 [18:56<20:28:20,  5.98s/it]

{'loss': 0.8044, 'grad_norm': 0.4046475291252136, 'learning_rate': 0.00019715086034413767, 'epoch': 0.01}


  1%|▏         | 184/12500 [19:01<18:59:42,  5.55s/it]

{'loss': 1.0065, 'grad_norm': 0.42987900972366333, 'learning_rate': 0.00019713485394157664, 'epoch': 0.01}


  1%|▏         | 185/12500 [19:10<22:48:57,  6.67s/it]

{'loss': 0.7268, 'grad_norm': 0.30803874135017395, 'learning_rate': 0.00019711884753901562, 'epoch': 0.01}


  1%|▏         | 186/12500 [19:15<21:18:33,  6.23s/it]

{'loss': 0.8385, 'grad_norm': 0.3703368902206421, 'learning_rate': 0.0001971028411364546, 'epoch': 0.01}


  1%|▏         | 187/12500 [19:21<21:01:27,  6.15s/it]

{'loss': 0.6445, 'grad_norm': 0.33903470635414124, 'learning_rate': 0.00019708683473389357, 'epoch': 0.01}


  2%|▏         | 188/12500 [19:27<20:51:49,  6.10s/it]

{'loss': 0.6752, 'grad_norm': 0.2965656518936157, 'learning_rate': 0.00019707082833133254, 'epoch': 0.02}


  2%|▏         | 189/12500 [19:33<20:47:27,  6.08s/it]

{'loss': 0.7548, 'grad_norm': 0.32747042179107666, 'learning_rate': 0.00019705482192877152, 'epoch': 0.02}


  2%|▏         | 190/12500 [19:38<18:51:43,  5.52s/it]

{'loss': 0.7219, 'grad_norm': 0.3633004128932953, 'learning_rate': 0.0001970388155262105, 'epoch': 0.02}


  2%|▏         | 191/12500 [19:45<20:35:13,  6.02s/it]

{'loss': 0.8586, 'grad_norm': 0.2871670126914978, 'learning_rate': 0.00019702280912364947, 'epoch': 0.02}


  2%|▏         | 192/12500 [19:54<23:40:46,  6.93s/it]

{'loss': 0.8274, 'grad_norm': 0.2583906948566437, 'learning_rate': 0.00019700680272108844, 'epoch': 0.02}


  2%|▏         | 193/12500 [19:59<22:19:52,  6.53s/it]

{'loss': 0.6734, 'grad_norm': 0.31790047883987427, 'learning_rate': 0.00019699079631852742, 'epoch': 0.02}


  2%|▏         | 194/12500 [20:05<20:49:40,  6.09s/it]

{'loss': 0.7222, 'grad_norm': 0.32833969593048096, 'learning_rate': 0.00019697478991596642, 'epoch': 0.02}


  2%|▏         | 195/12500 [20:11<21:35:55,  6.32s/it]

{'loss': 0.9234, 'grad_norm': 0.36127927899360657, 'learning_rate': 0.00019695878351340537, 'epoch': 0.02}


  2%|▏         | 196/12500 [20:15<18:46:33,  5.49s/it]

{'loss': 0.5689, 'grad_norm': 0.34625011682510376, 'learning_rate': 0.00019694277711084434, 'epoch': 0.02}


  2%|▏         | 197/12500 [20:20<18:43:36,  5.48s/it]

{'loss': 0.4579, 'grad_norm': 0.26838424801826477, 'learning_rate': 0.00019692677070828332, 'epoch': 0.02}


  2%|▏         | 198/12500 [20:30<22:47:18,  6.67s/it]

{'loss': 0.6229, 'grad_norm': 0.2673976421356201, 'learning_rate': 0.00019691076430572232, 'epoch': 0.02}


  2%|▏         | 199/12500 [20:34<20:17:29,  5.94s/it]

{'loss': 0.8003, 'grad_norm': 0.48642629384994507, 'learning_rate': 0.00019689475790316127, 'epoch': 0.02}


  2%|▏         | 200/12500 [20:40<20:41:52,  6.06s/it]

{'loss': 0.4639, 'grad_norm': 0.30557242035865784, 'learning_rate': 0.00019687875150060024, 'epoch': 0.02}


  2%|▏         | 201/12500 [20:47<21:08:30,  6.19s/it]

{'loss': 0.6233, 'grad_norm': 0.38219478726387024, 'learning_rate': 0.00019686274509803922, 'epoch': 0.02}


  2%|▏         | 202/12500 [20:52<20:14:31,  5.93s/it]

{'loss': 0.7762, 'grad_norm': 0.4120645523071289, 'learning_rate': 0.00019684673869547822, 'epoch': 0.02}


  2%|▏         | 203/12500 [20:59<20:47:14,  6.09s/it]

{'loss': 0.6301, 'grad_norm': 0.3459160327911377, 'learning_rate': 0.00019683073229291717, 'epoch': 0.02}


  2%|▏         | 204/12500 [21:04<20:02:20,  5.87s/it]

{'loss': 0.4533, 'grad_norm': 0.3211294412612915, 'learning_rate': 0.00019681472589035614, 'epoch': 0.02}


  2%|▏         | 205/12500 [21:10<20:10:30,  5.91s/it]

{'loss': 0.6599, 'grad_norm': 0.3358149826526642, 'learning_rate': 0.00019679871948779514, 'epoch': 0.02}


  2%|▏         | 206/12500 [21:18<22:30:20,  6.59s/it]

{'loss': 0.6601, 'grad_norm': 0.2903578579425812, 'learning_rate': 0.00019678271308523412, 'epoch': 0.02}


  2%|▏         | 207/12500 [21:25<23:04:03,  6.76s/it]

{'loss': 0.5912, 'grad_norm': 0.29046764969825745, 'learning_rate': 0.00019676670668267307, 'epoch': 0.02}


  2%|▏         | 208/12500 [21:32<22:41:58,  6.65s/it]

{'loss': 0.6526, 'grad_norm': 0.32940393686294556, 'learning_rate': 0.00019675070028011204, 'epoch': 0.02}


  2%|▏         | 209/12500 [21:40<24:06:57,  7.06s/it]

{'loss': 0.5646, 'grad_norm': 0.3082258105278015, 'learning_rate': 0.00019673469387755104, 'epoch': 0.02}


  2%|▏         | 210/12500 [21:48<25:33:41,  7.49s/it]

{'loss': 0.8887, 'grad_norm': 0.3220681846141815, 'learning_rate': 0.00019671868747499002, 'epoch': 0.02}


  2%|▏         | 211/12500 [21:56<26:10:16,  7.67s/it]

{'loss': 0.7445, 'grad_norm': 0.2831409275531769, 'learning_rate': 0.00019670268107242897, 'epoch': 0.02}


  2%|▏         | 212/12500 [22:00<21:59:09,  6.44s/it]

{'loss': 0.5391, 'grad_norm': 0.41848185658454895, 'learning_rate': 0.00019668667466986797, 'epoch': 0.02}


  2%|▏         | 213/12500 [22:06<21:51:07,  6.40s/it]

{'loss': 0.7569, 'grad_norm': 0.3357895016670227, 'learning_rate': 0.00019667066826730694, 'epoch': 0.02}


  2%|▏         | 214/12500 [22:16<25:18:41,  7.42s/it]

{'loss': 1.034, 'grad_norm': 0.31429511308670044, 'learning_rate': 0.00019665466186474592, 'epoch': 0.02}


  2%|▏         | 215/12500 [22:21<22:31:22,  6.60s/it]

{'loss': 0.681, 'grad_norm': 0.4306037425994873, 'learning_rate': 0.00019663865546218486, 'epoch': 0.02}


  2%|▏         | 216/12500 [22:29<24:28:24,  7.17s/it]

{'loss': 0.7127, 'grad_norm': 0.2737730145454407, 'learning_rate': 0.00019662264905962387, 'epoch': 0.02}


  2%|▏         | 217/12500 [22:39<26:42:24,  7.83s/it]

{'loss': 0.9606, 'grad_norm': 0.24394996464252472, 'learning_rate': 0.00019660664265706284, 'epoch': 0.02}


  2%|▏         | 218/12500 [22:42<21:53:12,  6.42s/it]

{'loss': 0.7818, 'grad_norm': 0.46067389845848083, 'learning_rate': 0.00019659063625450182, 'epoch': 0.02}


  2%|▏         | 219/12500 [22:48<22:06:31,  6.48s/it]

{'loss': 0.7391, 'grad_norm': 0.32431191205978394, 'learning_rate': 0.0001965746298519408, 'epoch': 0.02}


  2%|▏         | 220/12500 [22:58<24:58:17,  7.32s/it]

{'loss': 0.8906, 'grad_norm': 0.2764429450035095, 'learning_rate': 0.00019655862344937977, 'epoch': 0.02}


  2%|▏         | 221/12500 [23:03<22:55:46,  6.72s/it]

{'loss': 0.601, 'grad_norm': 0.33682146668434143, 'learning_rate': 0.00019654261704681874, 'epoch': 0.02}


  2%|▏         | 222/12500 [23:08<21:28:23,  6.30s/it]

{'loss': 0.8542, 'grad_norm': 0.36968278884887695, 'learning_rate': 0.00019652661064425772, 'epoch': 0.02}


  2%|▏         | 223/12500 [23:13<19:44:22,  5.79s/it]

{'loss': 0.7832, 'grad_norm': 0.38933923840522766, 'learning_rate': 0.0001965106042416967, 'epoch': 0.02}


  2%|▏         | 224/12500 [23:17<18:15:06,  5.35s/it]

{'loss': 0.6721, 'grad_norm': 0.41567128896713257, 'learning_rate': 0.00019649459783913567, 'epoch': 0.02}


  2%|▏         | 225/12500 [23:22<17:12:41,  5.05s/it]

{'loss': 0.607, 'grad_norm': 0.43557095527648926, 'learning_rate': 0.00019647859143657464, 'epoch': 0.02}


  2%|▏         | 226/12500 [23:29<19:38:15,  5.76s/it]

{'loss': 0.4781, 'grad_norm': 0.34538331627845764, 'learning_rate': 0.00019646258503401361, 'epoch': 0.02}


  2%|▏         | 227/12500 [23:34<19:04:26,  5.59s/it]

{'loss': 0.8197, 'grad_norm': 0.3770560026168823, 'learning_rate': 0.0001964465786314526, 'epoch': 0.02}


  2%|▏         | 228/12500 [23:38<17:41:51,  5.19s/it]

{'loss': 0.5802, 'grad_norm': 0.3814883828163147, 'learning_rate': 0.00019643057222889156, 'epoch': 0.02}


  2%|▏         | 229/12500 [23:44<18:36:32,  5.46s/it]

{'loss': 0.5459, 'grad_norm': 0.3654692769050598, 'learning_rate': 0.00019641456582633054, 'epoch': 0.02}


  2%|▏         | 230/12500 [23:52<21:02:52,  6.18s/it]

{'loss': 0.8374, 'grad_norm': 0.3045820891857147, 'learning_rate': 0.00019639855942376951, 'epoch': 0.02}


  2%|▏         | 231/12500 [23:59<21:10:52,  6.22s/it]

{'loss': 0.6895, 'grad_norm': 0.3635501563549042, 'learning_rate': 0.0001963825530212085, 'epoch': 0.02}


  2%|▏         | 232/12500 [24:02<18:22:12,  5.39s/it]

{'loss': 0.5594, 'grad_norm': 0.3895236849784851, 'learning_rate': 0.00019636654661864746, 'epoch': 0.02}


  2%|▏         | 233/12500 [24:06<17:15:22,  5.06s/it]

{'loss': 0.4815, 'grad_norm': 0.36852145195007324, 'learning_rate': 0.00019635054021608647, 'epoch': 0.02}


  2%|▏         | 234/12500 [24:14<20:02:07,  5.88s/it]

{'loss': 0.7683, 'grad_norm': 0.3354153633117676, 'learning_rate': 0.00019633453381352541, 'epoch': 0.02}


  2%|▏         | 235/12500 [24:21<21:10:05,  6.21s/it]

{'loss': 1.058, 'grad_norm': 0.3366694450378418, 'learning_rate': 0.0001963185274109644, 'epoch': 0.02}


  2%|▏         | 236/12500 [24:25<19:06:26,  5.61s/it]

{'loss': 0.5932, 'grad_norm': 0.34380635619163513, 'learning_rate': 0.00019630252100840336, 'epoch': 0.02}


  2%|▏         | 237/12500 [24:34<22:01:44,  6.47s/it]

{'loss': 0.9303, 'grad_norm': 0.302432119846344, 'learning_rate': 0.00019628651460584237, 'epoch': 0.02}


  2%|▏         | 238/12500 [24:38<19:39:43,  5.77s/it]

{'loss': 0.728, 'grad_norm': 0.392966091632843, 'learning_rate': 0.0001962705082032813, 'epoch': 0.02}


  2%|▏         | 239/12500 [24:43<18:35:46,  5.46s/it]

{'loss': 0.8484, 'grad_norm': 0.3515314757823944, 'learning_rate': 0.0001962545018007203, 'epoch': 0.02}


  2%|▏         | 240/12500 [24:50<20:36:53,  6.05s/it]

{'loss': 0.8837, 'grad_norm': 0.2917368710041046, 'learning_rate': 0.0001962384953981593, 'epoch': 0.02}


  2%|▏         | 241/12500 [24:57<21:42:23,  6.37s/it]

{'loss': 0.4319, 'grad_norm': 0.29166895151138306, 'learning_rate': 0.00019622248899559826, 'epoch': 0.02}


  2%|▏         | 242/12500 [25:02<20:23:03,  5.99s/it]

{'loss': 0.6923, 'grad_norm': 0.4100104570388794, 'learning_rate': 0.0001962064825930372, 'epoch': 0.02}


  2%|▏         | 243/12500 [25:06<18:24:37,  5.41s/it]

{'loss': 0.5595, 'grad_norm': 0.3690352439880371, 'learning_rate': 0.0001961904761904762, 'epoch': 0.02}


  2%|▏         | 244/12500 [25:12<18:14:30,  5.36s/it]

{'loss': 0.7177, 'grad_norm': 0.38931071758270264, 'learning_rate': 0.0001961744697879152, 'epoch': 0.02}


  2%|▏         | 245/12500 [25:20<21:06:27,  6.20s/it]

{'loss': 0.7353, 'grad_norm': 0.2532319724559784, 'learning_rate': 0.00019615846338535416, 'epoch': 0.02}


  2%|▏         | 246/12500 [25:24<18:40:29,  5.49s/it]

{'loss': 0.889, 'grad_norm': 0.47482311725616455, 'learning_rate': 0.0001961424569827931, 'epoch': 0.02}


  2%|▏         | 247/12500 [25:29<18:40:12,  5.49s/it]

{'loss': 1.1711, 'grad_norm': 0.42971259355545044, 'learning_rate': 0.00019612645058023211, 'epoch': 0.02}


  2%|▏         | 248/12500 [25:34<17:38:13,  5.18s/it]

{'loss': 0.6599, 'grad_norm': 0.38317257165908813, 'learning_rate': 0.0001961104441776711, 'epoch': 0.02}


  2%|▏         | 249/12500 [25:40<18:48:44,  5.53s/it]

{'loss': 0.7017, 'grad_norm': 0.34452614188194275, 'learning_rate': 0.00019609443777511006, 'epoch': 0.02}


  2%|▏         | 250/12500 [25:45<18:26:21,  5.42s/it]

{'loss': 0.6398, 'grad_norm': 0.35658806562423706, 'learning_rate': 0.000196078431372549, 'epoch': 0.02}


  2%|▏         | 251/12500 [25:51<19:22:18,  5.69s/it]

{'loss': 0.6947, 'grad_norm': 0.39730119705200195, 'learning_rate': 0.000196062424969988, 'epoch': 0.02}


  2%|▏         | 252/12500 [25:58<19:48:38,  5.82s/it]

{'loss': 0.7579, 'grad_norm': 0.34090545773506165, 'learning_rate': 0.000196046418567427, 'epoch': 0.02}


  2%|▏         | 253/12500 [26:03<19:42:42,  5.79s/it]

{'loss': 0.6803, 'grad_norm': 0.423036128282547, 'learning_rate': 0.00019603041216486596, 'epoch': 0.02}


  2%|▏         | 254/12500 [26:09<19:29:42,  5.73s/it]

{'loss': 0.9285, 'grad_norm': 0.32104793190956116, 'learning_rate': 0.00019601440576230494, 'epoch': 0.02}


  2%|▏         | 255/12500 [26:15<20:12:19,  5.94s/it]

{'loss': 0.7948, 'grad_norm': 0.3226337730884552, 'learning_rate': 0.0001959983993597439, 'epoch': 0.02}


  2%|▏         | 256/12500 [26:25<24:17:08,  7.14s/it]

{'loss': 0.8775, 'grad_norm': 0.2894560396671295, 'learning_rate': 0.0001959823929571829, 'epoch': 0.02}


  2%|▏         | 257/12500 [26:32<23:43:16,  6.98s/it]

{'loss': 0.6991, 'grad_norm': 0.34692251682281494, 'learning_rate': 0.00019596638655462186, 'epoch': 0.02}


  2%|▏         | 258/12500 [26:38<23:20:45,  6.87s/it]

{'loss': 0.6624, 'grad_norm': 0.41941678524017334, 'learning_rate': 0.00019595038015206084, 'epoch': 0.02}


  2%|▏         | 259/12500 [26:44<22:09:35,  6.52s/it]

{'loss': 0.7732, 'grad_norm': 0.29319676756858826, 'learning_rate': 0.0001959343737494998, 'epoch': 0.02}


  2%|▏         | 260/12500 [26:49<20:15:43,  5.96s/it]

{'loss': 0.9019, 'grad_norm': 0.3404580056667328, 'learning_rate': 0.0001959183673469388, 'epoch': 0.02}


  2%|▏         | 261/12500 [26:56<21:47:31,  6.41s/it]

{'loss': 0.8198, 'grad_norm': 0.2893829047679901, 'learning_rate': 0.00019590236094437776, 'epoch': 0.02}


  2%|▏         | 262/12500 [27:02<20:54:23,  6.15s/it]

{'loss': 0.4208, 'grad_norm': 0.359256386756897, 'learning_rate': 0.00019588635454181674, 'epoch': 0.02}


  2%|▏         | 263/12500 [27:07<20:09:59,  5.93s/it]

{'loss': 0.5155, 'grad_norm': 0.3082447052001953, 'learning_rate': 0.0001958703481392557, 'epoch': 0.02}


  2%|▏         | 264/12500 [27:14<21:19:38,  6.27s/it]

{'loss': 0.9444, 'grad_norm': 0.3528755009174347, 'learning_rate': 0.0001958543417366947, 'epoch': 0.02}


  2%|▏         | 265/12500 [27:21<22:13:55,  6.54s/it]

{'loss': 0.6044, 'grad_norm': 0.2957383692264557, 'learning_rate': 0.00019583833533413366, 'epoch': 0.02}


  2%|▏         | 266/12500 [27:28<22:17:26,  6.56s/it]

{'loss': 0.5482, 'grad_norm': 0.2888207733631134, 'learning_rate': 0.00019582232893157264, 'epoch': 0.02}


  2%|▏         | 267/12500 [27:32<19:37:27,  5.78s/it]

{'loss': 0.7929, 'grad_norm': 0.3761819005012512, 'learning_rate': 0.0001958063225290116, 'epoch': 0.02}


  2%|▏         | 268/12500 [27:38<20:04:40,  5.91s/it]

{'loss': 0.6606, 'grad_norm': 0.3373166024684906, 'learning_rate': 0.00019579031612645059, 'epoch': 0.02}


  2%|▏         | 269/12500 [27:43<18:32:22,  5.46s/it]

{'loss': 0.7076, 'grad_norm': 0.42424774169921875, 'learning_rate': 0.00019577430972388956, 'epoch': 0.02}


  2%|▏         | 270/12500 [27:50<20:51:51,  6.14s/it]

{'loss': 0.9912, 'grad_norm': 0.28486359119415283, 'learning_rate': 0.00019575830332132854, 'epoch': 0.02}


  2%|▏         | 271/12500 [27:58<21:51:15,  6.43s/it]

{'loss': 0.3686, 'grad_norm': 0.2567349672317505, 'learning_rate': 0.0001957422969187675, 'epoch': 0.02}


  2%|▏         | 272/12500 [28:03<21:19:06,  6.28s/it]

{'loss': 0.7882, 'grad_norm': 0.31885668635368347, 'learning_rate': 0.0001957262905162065, 'epoch': 0.02}


  2%|▏         | 273/12500 [28:11<22:28:05,  6.62s/it]

{'loss': 0.948, 'grad_norm': 0.2955082654953003, 'learning_rate': 0.00019571028411364546, 'epoch': 0.02}


  2%|▏         | 274/12500 [28:17<22:01:35,  6.49s/it]

{'loss': 0.6275, 'grad_norm': 0.34203848242759705, 'learning_rate': 0.00019569427771108444, 'epoch': 0.02}


  2%|▏         | 275/12500 [28:27<25:34:52,  7.53s/it]

{'loss': 0.6149, 'grad_norm': 0.2596248984336853, 'learning_rate': 0.0001956782713085234, 'epoch': 0.02}


  2%|▏         | 276/12500 [28:35<25:35:13,  7.54s/it]

{'loss': 0.6714, 'grad_norm': 0.29869237542152405, 'learning_rate': 0.0001956622649059624, 'epoch': 0.02}


  2%|▏         | 277/12500 [28:41<24:08:07,  7.11s/it]

{'loss': 0.391, 'grad_norm': 0.2546151280403137, 'learning_rate': 0.00019564625850340136, 'epoch': 0.02}


  2%|▏         | 278/12500 [28:50<26:17:05,  7.74s/it]

{'loss': 0.8693, 'grad_norm': 0.2725158631801605, 'learning_rate': 0.00019563025210084033, 'epoch': 0.02}


  2%|▏         | 279/12500 [28:54<22:11:02,  6.53s/it]

{'loss': 0.7821, 'grad_norm': 0.37214404344558716, 'learning_rate': 0.00019561424569827934, 'epoch': 0.02}


  2%|▏         | 280/12500 [28:58<19:35:45,  5.77s/it]

{'loss': 0.6556, 'grad_norm': 0.4127039909362793, 'learning_rate': 0.0001955982392957183, 'epoch': 0.02}


  2%|▏         | 281/12500 [29:05<20:58:08,  6.18s/it]

{'loss': 1.0262, 'grad_norm': 0.34261825680732727, 'learning_rate': 0.00019558223289315726, 'epoch': 0.02}


  2%|▏         | 282/12500 [29:10<19:44:59,  5.82s/it]

{'loss': 0.6467, 'grad_norm': 0.3291308283805847, 'learning_rate': 0.00019556622649059623, 'epoch': 0.02}


  2%|▏         | 283/12500 [29:16<20:05:04,  5.92s/it]

{'loss': 0.4845, 'grad_norm': 0.255336195230484, 'learning_rate': 0.00019555022008803524, 'epoch': 0.02}


  2%|▏         | 284/12500 [29:20<18:42:02,  5.51s/it]

{'loss': 0.844, 'grad_norm': 0.42644640803337097, 'learning_rate': 0.0001955342136854742, 'epoch': 0.02}


  2%|▏         | 285/12500 [29:27<19:57:08,  5.88s/it]

{'loss': 0.551, 'grad_norm': 0.3793628215789795, 'learning_rate': 0.00019551820728291316, 'epoch': 0.02}


  2%|▏         | 286/12500 [29:33<20:23:49,  6.01s/it]

{'loss': 0.6664, 'grad_norm': 0.4759114682674408, 'learning_rate': 0.00019550220088035216, 'epoch': 0.02}


  2%|▏         | 287/12500 [29:40<21:02:40,  6.20s/it]

{'loss': 0.8012, 'grad_norm': 0.30462321639060974, 'learning_rate': 0.00019548619447779114, 'epoch': 0.02}


  2%|▏         | 288/12500 [29:49<23:44:39,  7.00s/it]

{'loss': 0.6107, 'grad_norm': 0.2864384055137634, 'learning_rate': 0.0001954701880752301, 'epoch': 0.02}


  2%|▏         | 289/12500 [29:53<20:56:41,  6.17s/it]

{'loss': 0.624, 'grad_norm': 0.35105136036872864, 'learning_rate': 0.00019545418167266906, 'epoch': 0.02}


  2%|▏         | 290/12500 [30:01<22:29:50,  6.63s/it]

{'loss': 0.7277, 'grad_norm': 0.286641389131546, 'learning_rate': 0.00019543817527010806, 'epoch': 0.02}


  2%|▏         | 291/12500 [30:12<27:16:56,  8.04s/it]

{'loss': 1.0173, 'grad_norm': 0.24863529205322266, 'learning_rate': 0.00019542216886754703, 'epoch': 0.02}


  2%|▏         | 292/12500 [30:20<26:30:46,  7.82s/it]

{'loss': 0.8194, 'grad_norm': 0.3090563118457794, 'learning_rate': 0.000195406162464986, 'epoch': 0.02}


  2%|▏         | 293/12500 [30:26<24:45:34,  7.30s/it]

{'loss': 0.7329, 'grad_norm': 0.29998841881752014, 'learning_rate': 0.00019539015606242498, 'epoch': 0.02}


  2%|▏         | 294/12500 [30:32<24:09:22,  7.12s/it]

{'loss': 0.5014, 'grad_norm': 0.3578850030899048, 'learning_rate': 0.00019537414965986396, 'epoch': 0.02}


  2%|▏         | 295/12500 [30:37<21:52:36,  6.45s/it]

{'loss': 0.9492, 'grad_norm': 0.34124940633773804, 'learning_rate': 0.00019535814325730293, 'epoch': 0.02}


  2%|▏         | 296/12500 [30:45<23:13:00,  6.85s/it]

{'loss': 0.768, 'grad_norm': 0.2706206440925598, 'learning_rate': 0.0001953421368547419, 'epoch': 0.02}


  2%|▏         | 297/12500 [30:54<24:55:46,  7.35s/it]

{'loss': 0.8415, 'grad_norm': 0.2950698733329773, 'learning_rate': 0.00019532613045218088, 'epoch': 0.02}


  2%|▏         | 298/12500 [31:00<24:30:17,  7.23s/it]

{'loss': 0.8193, 'grad_norm': 0.3455953598022461, 'learning_rate': 0.00019531012404961986, 'epoch': 0.02}


  2%|▏         | 299/12500 [31:05<21:50:33,  6.44s/it]

{'loss': 0.583, 'grad_norm': 0.312682181596756, 'learning_rate': 0.00019529411764705883, 'epoch': 0.02}


  2%|▏         | 300/12500 [31:12<22:15:58,  6.57s/it]

{'loss': 1.2093, 'grad_norm': 0.30290520191192627, 'learning_rate': 0.0001952781112444978, 'epoch': 0.02}


  2%|▏         | 301/12500 [31:18<21:43:20,  6.41s/it]

{'loss': 0.6798, 'grad_norm': 0.3311351239681244, 'learning_rate': 0.00019526210484193678, 'epoch': 0.02}


  2%|▏         | 302/12500 [31:22<18:57:18,  5.59s/it]

{'loss': 0.5994, 'grad_norm': 0.37555211782455444, 'learning_rate': 0.00019524609843937576, 'epoch': 0.02}


  2%|▏         | 303/12500 [31:27<18:45:48,  5.54s/it]

{'loss': 0.6139, 'grad_norm': 0.32773536443710327, 'learning_rate': 0.00019523009203681473, 'epoch': 0.02}


  2%|▏         | 304/12500 [31:34<20:26:01,  6.03s/it]

{'loss': 0.6221, 'grad_norm': 0.3009893596172333, 'learning_rate': 0.0001952140856342537, 'epoch': 0.02}


  2%|▏         | 305/12500 [31:39<18:50:32,  5.56s/it]

{'loss': 0.8506, 'grad_norm': 0.37838977575302124, 'learning_rate': 0.00019519807923169268, 'epoch': 0.02}


  2%|▏         | 306/12500 [31:45<19:40:54,  5.81s/it]

{'loss': 0.7486, 'grad_norm': 0.3493843674659729, 'learning_rate': 0.00019518207282913166, 'epoch': 0.02}


  2%|▏         | 307/12500 [31:49<17:29:05,  5.16s/it]

{'loss': 0.727, 'grad_norm': 0.3662971258163452, 'learning_rate': 0.00019516606642657066, 'epoch': 0.02}


  2%|▏         | 308/12500 [31:56<19:12:11,  5.67s/it]

{'loss': 0.4641, 'grad_norm': 0.24201816320419312, 'learning_rate': 0.0001951500600240096, 'epoch': 0.02}


  2%|▏         | 309/12500 [32:01<19:21:28,  5.72s/it]

{'loss': 0.7022, 'grad_norm': 0.2997108995914459, 'learning_rate': 0.00019513405362144858, 'epoch': 0.02}


  2%|▏         | 310/12500 [32:10<21:59:28,  6.49s/it]

{'loss': 0.7585, 'grad_norm': 0.2593611180782318, 'learning_rate': 0.00019511804721888756, 'epoch': 0.02}


  2%|▏         | 311/12500 [32:16<21:33:22,  6.37s/it]

{'loss': 0.5224, 'grad_norm': 0.35327035188674927, 'learning_rate': 0.00019510204081632656, 'epoch': 0.02}


  2%|▏         | 312/12500 [32:24<23:21:02,  6.90s/it]

{'loss': 0.8558, 'grad_norm': 0.27607908844947815, 'learning_rate': 0.0001950860344137655, 'epoch': 0.02}


  3%|▎         | 313/12500 [32:32<24:24:20,  7.21s/it]

{'loss': 0.8801, 'grad_norm': 0.27525874972343445, 'learning_rate': 0.00019507002801120448, 'epoch': 0.03}


  3%|▎         | 314/12500 [32:40<25:11:06,  7.44s/it]

{'loss': 0.914, 'grad_norm': 0.2628897726535797, 'learning_rate': 0.00019505402160864346, 'epoch': 0.03}


  3%|▎         | 315/12500 [32:47<24:48:39,  7.33s/it]

{'loss': 0.6227, 'grad_norm': 0.2965458929538727, 'learning_rate': 0.00019503801520608246, 'epoch': 0.03}


  3%|▎         | 316/12500 [32:52<22:16:08,  6.58s/it]

{'loss': 0.8604, 'grad_norm': 0.2953178882598877, 'learning_rate': 0.0001950220088035214, 'epoch': 0.03}


  3%|▎         | 317/12500 [32:56<19:46:44,  5.84s/it]

{'loss': 0.8602, 'grad_norm': 0.39655235409736633, 'learning_rate': 0.00019500600240096038, 'epoch': 0.03}


  3%|▎         | 318/12500 [33:01<18:54:17,  5.59s/it]

{'loss': 0.6682, 'grad_norm': 0.3207215964794159, 'learning_rate': 0.00019498999599839938, 'epoch': 0.03}


  3%|▎         | 319/12500 [33:05<17:47:12,  5.26s/it]

{'loss': 0.7695, 'grad_norm': 0.3520125150680542, 'learning_rate': 0.00019497398959583836, 'epoch': 0.03}


  3%|▎         | 320/12500 [33:11<17:51:35,  5.28s/it]

{'loss': 0.623, 'grad_norm': 0.3731239140033722, 'learning_rate': 0.0001949579831932773, 'epoch': 0.03}


  3%|▎         | 321/12500 [33:16<18:14:29,  5.39s/it]

{'loss': 0.6283, 'grad_norm': 0.322424978017807, 'learning_rate': 0.00019494197679071628, 'epoch': 0.03}


  3%|▎         | 322/12500 [33:22<18:55:10,  5.59s/it]

{'loss': 0.7605, 'grad_norm': 0.2914620637893677, 'learning_rate': 0.00019492597038815528, 'epoch': 0.03}


  3%|▎         | 323/12500 [33:26<16:52:13,  4.99s/it]

{'loss': 0.7676, 'grad_norm': 0.41852760314941406, 'learning_rate': 0.00019490996398559426, 'epoch': 0.03}


  3%|▎         | 324/12500 [33:33<18:47:42,  5.56s/it]

{'loss': 0.4016, 'grad_norm': 0.25312432646751404, 'learning_rate': 0.0001948939575830332, 'epoch': 0.03}


  3%|▎         | 325/12500 [33:38<18:43:33,  5.54s/it]

{'loss': 0.5562, 'grad_norm': 0.33783721923828125, 'learning_rate': 0.0001948779511804722, 'epoch': 0.03}


  3%|▎         | 326/12500 [33:47<21:54:35,  6.48s/it]

{'loss': 0.7873, 'grad_norm': 0.3194705843925476, 'learning_rate': 0.00019486194477791118, 'epoch': 0.03}


  3%|▎         | 327/12500 [33:55<23:09:39,  6.85s/it]

{'loss': 0.8716, 'grad_norm': 0.2795124351978302, 'learning_rate': 0.00019484593837535016, 'epoch': 0.03}


  3%|▎         | 328/12500 [34:03<24:05:59,  7.13s/it]

{'loss': 0.967, 'grad_norm': 0.2994546890258789, 'learning_rate': 0.0001948299319727891, 'epoch': 0.03}


  3%|▎         | 329/12500 [34:07<20:54:10,  6.18s/it]

{'loss': 0.9015, 'grad_norm': 0.401500940322876, 'learning_rate': 0.0001948139255702281, 'epoch': 0.03}


  3%|▎         | 330/12500 [34:13<21:27:41,  6.35s/it]

{'loss': 0.5494, 'grad_norm': 0.26177293062210083, 'learning_rate': 0.00019479791916766708, 'epoch': 0.03}


  3%|▎         | 331/12500 [34:21<22:54:55,  6.78s/it]

{'loss': 1.0982, 'grad_norm': 0.30143803358078003, 'learning_rate': 0.00019478191276510606, 'epoch': 0.03}


  3%|▎         | 332/12500 [34:25<20:21:56,  6.03s/it]

{'loss': 0.7275, 'grad_norm': 0.308397501707077, 'learning_rate': 0.00019476590636254503, 'epoch': 0.03}


  3%|▎         | 333/12500 [34:33<22:07:02,  6.54s/it]

{'loss': 0.5263, 'grad_norm': 0.26962795853614807, 'learning_rate': 0.000194749899959984, 'epoch': 0.03}


  3%|▎         | 334/12500 [34:38<20:09:41,  5.97s/it]

{'loss': 0.6681, 'grad_norm': 0.3323342204093933, 'learning_rate': 0.00019473389355742298, 'epoch': 0.03}


  3%|▎         | 335/12500 [34:46<22:04:07,  6.53s/it]

{'loss': 0.4365, 'grad_norm': 0.2458847165107727, 'learning_rate': 0.00019471788715486196, 'epoch': 0.03}


  3%|▎         | 336/12500 [34:53<23:19:53,  6.91s/it]

{'loss': 0.6826, 'grad_norm': 0.3073967695236206, 'learning_rate': 0.00019470188075230093, 'epoch': 0.03}


  3%|▎         | 337/12500 [34:59<22:18:02,  6.60s/it]

{'loss': 0.6414, 'grad_norm': 0.29271337389945984, 'learning_rate': 0.0001946858743497399, 'epoch': 0.03}


  3%|▎         | 338/12500 [35:02<18:53:40,  5.59s/it]

{'loss': 0.597, 'grad_norm': 0.3479197025299072, 'learning_rate': 0.00019466986794717888, 'epoch': 0.03}


  3%|▎         | 339/12500 [35:12<23:09:06,  6.85s/it]

{'loss': 0.8369, 'grad_norm': 0.22083786129951477, 'learning_rate': 0.00019465386154461785, 'epoch': 0.03}


  3%|▎         | 340/12500 [35:17<21:25:39,  6.34s/it]

{'loss': 0.606, 'grad_norm': 0.2965526580810547, 'learning_rate': 0.00019463785514205683, 'epoch': 0.03}


  3%|▎         | 341/12500 [35:24<21:24:45,  6.34s/it]

{'loss': 0.925, 'grad_norm': 0.33314478397369385, 'learning_rate': 0.0001946218487394958, 'epoch': 0.03}


  3%|▎         | 342/12500 [35:31<22:14:26,  6.59s/it]

{'loss': 0.7414, 'grad_norm': 0.2670176327228546, 'learning_rate': 0.00019460584233693478, 'epoch': 0.03}


  3%|▎         | 343/12500 [35:40<25:17:54,  7.49s/it]

{'loss': 0.7394, 'grad_norm': 0.24634449183940887, 'learning_rate': 0.00019458983593437375, 'epoch': 0.03}


  3%|▎         | 344/12500 [35:46<23:20:07,  6.91s/it]

{'loss': 0.5424, 'grad_norm': 0.3568294048309326, 'learning_rate': 0.00019457382953181273, 'epoch': 0.03}


  3%|▎         | 345/12500 [35:52<22:03:46,  6.53s/it]

{'loss': 0.7267, 'grad_norm': 0.28241536021232605, 'learning_rate': 0.0001945578231292517, 'epoch': 0.03}


  3%|▎         | 346/12500 [35:58<22:10:10,  6.57s/it]

{'loss': 0.8045, 'grad_norm': 0.2734276354312897, 'learning_rate': 0.0001945418167266907, 'epoch': 0.03}


  3%|▎         | 347/12500 [36:07<24:00:18,  7.11s/it]

{'loss': 0.9236, 'grad_norm': 0.28280550241470337, 'learning_rate': 0.00019452581032412965, 'epoch': 0.03}


  3%|▎         | 348/12500 [36:12<21:40:22,  6.42s/it]

{'loss': 0.6921, 'grad_norm': 0.3752215802669525, 'learning_rate': 0.00019450980392156863, 'epoch': 0.03}


  3%|▎         | 349/12500 [36:17<20:24:14,  6.05s/it]

{'loss': 0.5906, 'grad_norm': 0.30648118257522583, 'learning_rate': 0.0001944937975190076, 'epoch': 0.03}


  3%|▎         | 350/12500 [36:24<21:31:58,  6.38s/it]

{'loss': 0.5576, 'grad_norm': 0.24412298202514648, 'learning_rate': 0.0001944777911164466, 'epoch': 0.03}


  3%|▎         | 351/12500 [36:33<24:29:58,  7.26s/it]

{'loss': 0.5764, 'grad_norm': 0.2605712413787842, 'learning_rate': 0.00019446178471388555, 'epoch': 0.03}


  3%|▎         | 352/12500 [36:39<23:02:47,  6.83s/it]

{'loss': 0.6885, 'grad_norm': 0.35803091526031494, 'learning_rate': 0.00019444577831132453, 'epoch': 0.03}


  3%|▎         | 353/12500 [36:46<22:42:20,  6.73s/it]

{'loss': 0.5306, 'grad_norm': 0.2668835520744324, 'learning_rate': 0.00019442977190876353, 'epoch': 0.03}


  3%|▎         | 354/12500 [36:50<20:18:35,  6.02s/it]

{'loss': 0.5841, 'grad_norm': 0.3252628445625305, 'learning_rate': 0.0001944137655062025, 'epoch': 0.03}


  3%|▎         | 355/12500 [36:57<20:58:55,  6.22s/it]

{'loss': 0.7999, 'grad_norm': 0.27534154057502747, 'learning_rate': 0.00019439775910364145, 'epoch': 0.03}


  3%|▎         | 356/12500 [37:04<22:07:26,  6.56s/it]

{'loss': 0.7269, 'grad_norm': 0.3074166476726532, 'learning_rate': 0.00019438175270108043, 'epoch': 0.03}


  3%|▎         | 357/12500 [37:09<20:47:54,  6.17s/it]

{'loss': 0.7381, 'grad_norm': 0.3391258418560028, 'learning_rate': 0.00019436574629851943, 'epoch': 0.03}


  3%|▎         | 358/12500 [37:15<20:04:19,  5.95s/it]

{'loss': 0.634, 'grad_norm': 0.3125925660133362, 'learning_rate': 0.0001943497398959584, 'epoch': 0.03}


  3%|▎         | 359/12500 [37:19<18:50:08,  5.59s/it]

{'loss': 0.7305, 'grad_norm': 0.36357131600379944, 'learning_rate': 0.00019433373349339735, 'epoch': 0.03}


  3%|▎         | 360/12500 [37:27<20:44:59,  6.15s/it]

{'loss': 0.769, 'grad_norm': 0.25895389914512634, 'learning_rate': 0.00019431772709083635, 'epoch': 0.03}


  3%|▎         | 361/12500 [37:37<24:35:14,  7.29s/it]

{'loss': 0.5475, 'grad_norm': 0.2044803500175476, 'learning_rate': 0.00019430172068827533, 'epoch': 0.03}


  3%|▎         | 362/12500 [37:42<22:43:13,  6.74s/it]

{'loss': 0.7649, 'grad_norm': 0.3141131103038788, 'learning_rate': 0.0001942857142857143, 'epoch': 0.03}


  3%|▎         | 363/12500 [37:49<22:47:16,  6.76s/it]

{'loss': 0.6181, 'grad_norm': 0.3020017743110657, 'learning_rate': 0.00019426970788315325, 'epoch': 0.03}


  3%|▎         | 364/12500 [37:55<21:47:59,  6.47s/it]

{'loss': 0.612, 'grad_norm': 0.28865519165992737, 'learning_rate': 0.00019425370148059225, 'epoch': 0.03}


  3%|▎         | 365/12500 [38:03<23:12:47,  6.89s/it]

{'loss': 0.873, 'grad_norm': 0.23377883434295654, 'learning_rate': 0.00019423769507803123, 'epoch': 0.03}


  3%|▎         | 366/12500 [38:08<21:47:20,  6.46s/it]

{'loss': 0.7781, 'grad_norm': 0.3581567108631134, 'learning_rate': 0.0001942216886754702, 'epoch': 0.03}


  3%|▎         | 367/12500 [38:18<24:52:40,  7.38s/it]

{'loss': 0.9032, 'grad_norm': 0.24921733140945435, 'learning_rate': 0.00019420568227290918, 'epoch': 0.03}


  3%|▎         | 368/12500 [38:25<24:30:29,  7.27s/it]

{'loss': 0.708, 'grad_norm': 0.25581294298171997, 'learning_rate': 0.00019418967587034815, 'epoch': 0.03}


  3%|▎         | 369/12500 [38:30<22:48:12,  6.77s/it]

{'loss': 0.6206, 'grad_norm': 0.30406203866004944, 'learning_rate': 0.00019417366946778713, 'epoch': 0.03}


  3%|▎         | 370/12500 [38:36<21:21:00,  6.34s/it]

{'loss': 0.6264, 'grad_norm': 0.3778403401374817, 'learning_rate': 0.0001941576630652261, 'epoch': 0.03}


  3%|▎         | 371/12500 [38:44<23:26:28,  6.96s/it]

{'loss': 0.4004, 'grad_norm': 0.2266521453857422, 'learning_rate': 0.00019414165666266508, 'epoch': 0.03}


  3%|▎         | 372/12500 [38:49<21:37:36,  6.42s/it]

{'loss': 1.1806, 'grad_norm': 0.32606005668640137, 'learning_rate': 0.00019412565026010405, 'epoch': 0.03}


  3%|▎         | 373/12500 [38:54<20:18:00,  6.03s/it]

{'loss': 0.7477, 'grad_norm': 0.45390018820762634, 'learning_rate': 0.00019410964385754303, 'epoch': 0.03}


  3%|▎         | 374/12500 [39:02<22:20:31,  6.63s/it]

{'loss': 0.8317, 'grad_norm': 0.2684886157512665, 'learning_rate': 0.000194093637454982, 'epoch': 0.03}


  3%|▎         | 375/12500 [39:08<21:10:06,  6.29s/it]

{'loss': 0.4371, 'grad_norm': 0.26162201166152954, 'learning_rate': 0.00019407763105242098, 'epoch': 0.03}


  3%|▎         | 376/12500 [39:15<21:58:14,  6.52s/it]

{'loss': 0.8874, 'grad_norm': 0.3646883964538574, 'learning_rate': 0.00019406162464985995, 'epoch': 0.03}


  3%|▎         | 377/12500 [39:23<23:10:09,  6.88s/it]

{'loss': 0.6137, 'grad_norm': 0.2640758752822876, 'learning_rate': 0.00019404561824729893, 'epoch': 0.03}


  3%|▎         | 378/12500 [39:31<25:05:42,  7.45s/it]

{'loss': 0.6102, 'grad_norm': 0.24307402968406677, 'learning_rate': 0.0001940296118447379, 'epoch': 0.03}


  3%|▎         | 379/12500 [39:38<24:13:13,  7.19s/it]

{'loss': 0.7313, 'grad_norm': 0.31268900632858276, 'learning_rate': 0.00019401360544217688, 'epoch': 0.03}


  3%|▎         | 380/12500 [39:43<22:09:27,  6.58s/it]

{'loss': 0.5522, 'grad_norm': 0.3341209292411804, 'learning_rate': 0.00019399759903961585, 'epoch': 0.03}


  3%|▎         | 381/12500 [39:48<20:33:48,  6.11s/it]

{'loss': 0.7765, 'grad_norm': 0.2978709042072296, 'learning_rate': 0.00019398159263705483, 'epoch': 0.03}


  3%|▎         | 382/12500 [39:56<22:10:38,  6.59s/it]

{'loss': 0.7694, 'grad_norm': 0.25782886147499084, 'learning_rate': 0.0001939655862344938, 'epoch': 0.03}


  3%|▎         | 383/12500 [40:02<21:57:58,  6.53s/it]

{'loss': 0.6001, 'grad_norm': 0.3474946618080139, 'learning_rate': 0.00019394957983193278, 'epoch': 0.03}


  3%|▎         | 384/12500 [40:08<21:03:21,  6.26s/it]

{'loss': 0.7455, 'grad_norm': 0.31288275122642517, 'learning_rate': 0.00019393357342937175, 'epoch': 0.03}


  3%|▎         | 385/12500 [40:14<20:32:37,  6.10s/it]

{'loss': 0.567, 'grad_norm': 0.2964957654476166, 'learning_rate': 0.00019391756702681075, 'epoch': 0.03}


  3%|▎         | 386/12500 [40:21<22:03:50,  6.56s/it]

{'loss': 0.942, 'grad_norm': 0.2694145143032074, 'learning_rate': 0.0001939015606242497, 'epoch': 0.03}


  3%|▎         | 387/12500 [40:28<22:12:30,  6.60s/it]

{'loss': 0.7643, 'grad_norm': 0.32705822587013245, 'learning_rate': 0.00019388555422168867, 'epoch': 0.03}


  3%|▎         | 388/12500 [40:35<22:47:10,  6.77s/it]

{'loss': 0.7968, 'grad_norm': 0.2711713910102844, 'learning_rate': 0.00019386954781912765, 'epoch': 0.03}


  3%|▎         | 389/12500 [40:42<23:23:00,  6.95s/it]

{'loss': 0.862, 'grad_norm': 0.3129224181175232, 'learning_rate': 0.00019385354141656665, 'epoch': 0.03}


  3%|▎         | 390/12500 [40:48<21:53:50,  6.51s/it]

{'loss': 0.7289, 'grad_norm': 0.29118967056274414, 'learning_rate': 0.0001938375350140056, 'epoch': 0.03}


  3%|▎         | 391/12500 [40:55<21:56:58,  6.53s/it]

{'loss': 0.9406, 'grad_norm': 0.27033716440200806, 'learning_rate': 0.00019382152861144457, 'epoch': 0.03}


  3%|▎         | 392/12500 [41:00<21:12:45,  6.31s/it]

{'loss': 0.5282, 'grad_norm': 0.2919776141643524, 'learning_rate': 0.00019380552220888358, 'epoch': 0.03}


  3%|▎         | 393/12500 [41:05<19:15:52,  5.73s/it]

{'loss': 0.7513, 'grad_norm': 0.321684867143631, 'learning_rate': 0.00019378951580632255, 'epoch': 0.03}


  3%|▎         | 394/12500 [41:09<17:53:53,  5.32s/it]

{'loss': 0.7543, 'grad_norm': 0.3639649450778961, 'learning_rate': 0.0001937735094037615, 'epoch': 0.03}


  3%|▎         | 395/12500 [41:13<16:50:24,  5.01s/it]

{'loss': 0.9237, 'grad_norm': 0.43314850330352783, 'learning_rate': 0.00019375750300120047, 'epoch': 0.03}


  3%|▎         | 396/12500 [41:21<19:26:51,  5.78s/it]

{'loss': 0.9629, 'grad_norm': 0.2503160238265991, 'learning_rate': 0.00019374149659863948, 'epoch': 0.03}


  3%|▎         | 397/12500 [41:27<19:51:55,  5.91s/it]

{'loss': 0.7295, 'grad_norm': 0.2894614636898041, 'learning_rate': 0.00019372549019607845, 'epoch': 0.03}


  3%|▎         | 398/12500 [41:33<20:05:39,  5.98s/it]

{'loss': 0.5779, 'grad_norm': 0.3126048743724823, 'learning_rate': 0.0001937094837935174, 'epoch': 0.03}


  3%|▎         | 399/12500 [41:39<19:57:30,  5.94s/it]

{'loss': 0.8058, 'grad_norm': 0.2585342228412628, 'learning_rate': 0.0001936934773909564, 'epoch': 0.03}


  3%|▎         | 400/12500 [41:45<19:46:32,  5.88s/it]

{'loss': 0.7409, 'grad_norm': 0.2698828876018524, 'learning_rate': 0.00019367747098839538, 'epoch': 0.03}


  3%|▎         | 401/12500 [41:55<23:41:13,  7.05s/it]

{'loss': 0.9068, 'grad_norm': 0.25313109159469604, 'learning_rate': 0.00019366146458583435, 'epoch': 0.03}


  3%|▎         | 402/12500 [42:00<22:02:51,  6.56s/it]

{'loss': 0.8708, 'grad_norm': 0.2667675316333771, 'learning_rate': 0.0001936454581832733, 'epoch': 0.03}


  3%|▎         | 403/12500 [42:07<22:50:44,  6.80s/it]

{'loss': 1.0133, 'grad_norm': 0.268949955701828, 'learning_rate': 0.0001936294517807123, 'epoch': 0.03}


  3%|▎         | 404/12500 [42:15<23:37:45,  7.03s/it]

{'loss': 0.8803, 'grad_norm': 0.3034265637397766, 'learning_rate': 0.00019361344537815127, 'epoch': 0.03}


  3%|▎         | 405/12500 [42:19<20:54:59,  6.23s/it]

{'loss': 0.8729, 'grad_norm': 0.36058738827705383, 'learning_rate': 0.00019359743897559025, 'epoch': 0.03}


  3%|▎         | 406/12500 [42:25<20:13:47,  6.02s/it]

{'loss': 0.7715, 'grad_norm': 0.3073258399963379, 'learning_rate': 0.00019358143257302922, 'epoch': 0.03}


  3%|▎         | 407/12500 [42:31<20:25:23,  6.08s/it]

{'loss': 0.7266, 'grad_norm': 0.28977954387664795, 'learning_rate': 0.0001935654261704682, 'epoch': 0.03}


  3%|▎         | 408/12500 [42:38<20:49:53,  6.20s/it]

{'loss': 0.7198, 'grad_norm': 0.3155589997768402, 'learning_rate': 0.00019354941976790717, 'epoch': 0.03}


  3%|▎         | 409/12500 [42:45<21:46:40,  6.48s/it]

{'loss': 0.7121, 'grad_norm': 0.25881949067115784, 'learning_rate': 0.00019353341336534615, 'epoch': 0.03}


  3%|▎         | 410/12500 [42:49<19:30:59,  5.81s/it]

{'loss': 0.8395, 'grad_norm': 0.3470754027366638, 'learning_rate': 0.00019351740696278512, 'epoch': 0.03}


  3%|▎         | 411/12500 [42:53<17:17:37,  5.15s/it]

{'loss': 0.6184, 'grad_norm': 0.3009454607963562, 'learning_rate': 0.0001935014005602241, 'epoch': 0.03}


  3%|▎         | 412/12500 [42:58<18:02:13,  5.37s/it]

{'loss': 0.7154, 'grad_norm': 0.28377246856689453, 'learning_rate': 0.00019348539415766307, 'epoch': 0.03}


  3%|▎         | 413/12500 [43:02<16:36:49,  4.95s/it]

{'loss': 0.7779, 'grad_norm': 0.34158456325531006, 'learning_rate': 0.00019346938775510205, 'epoch': 0.03}


  3%|▎         | 414/12500 [43:08<17:03:00,  5.08s/it]

{'loss': 0.8593, 'grad_norm': 0.3485906422138214, 'learning_rate': 0.00019345338135254102, 'epoch': 0.03}


  3%|▎         | 415/12500 [43:13<17:03:16,  5.08s/it]

{'loss': 0.7194, 'grad_norm': 0.26221948862075806, 'learning_rate': 0.00019343737494998, 'epoch': 0.03}


  3%|▎         | 416/12500 [43:21<20:29:44,  6.11s/it]

{'loss': 0.9037, 'grad_norm': 0.27472183108329773, 'learning_rate': 0.00019342136854741897, 'epoch': 0.03}


  3%|▎         | 417/12500 [43:27<19:33:21,  5.83s/it]

{'loss': 0.6998, 'grad_norm': 0.34332260489463806, 'learning_rate': 0.00019340536214485795, 'epoch': 0.03}


  3%|▎         | 418/12500 [43:32<18:52:08,  5.62s/it]

{'loss': 0.6444, 'grad_norm': 0.32013726234436035, 'learning_rate': 0.00019338935574229692, 'epoch': 0.03}


  3%|▎         | 419/12500 [43:42<23:07:48,  6.89s/it]

{'loss': 0.5401, 'grad_norm': 0.214155375957489, 'learning_rate': 0.0001933733493397359, 'epoch': 0.03}


  3%|▎         | 420/12500 [43:47<22:08:52,  6.60s/it]

{'loss': 0.9109, 'grad_norm': 0.3307681083679199, 'learning_rate': 0.0001933573429371749, 'epoch': 0.03}


  3%|▎         | 421/12500 [43:54<22:15:32,  6.63s/it]

{'loss': 0.8787, 'grad_norm': 0.26040035486221313, 'learning_rate': 0.00019334133653461385, 'epoch': 0.03}


  3%|▎         | 422/12500 [44:02<23:19:07,  6.95s/it]

{'loss': 0.8712, 'grad_norm': 0.26077595353126526, 'learning_rate': 0.00019332533013205282, 'epoch': 0.03}


  3%|▎         | 423/12500 [44:09<23:05:25,  6.88s/it]

{'loss': 0.6399, 'grad_norm': 0.2516511380672455, 'learning_rate': 0.0001933093237294918, 'epoch': 0.03}


  3%|▎         | 424/12500 [44:14<21:48:38,  6.50s/it]

{'loss': 0.8138, 'grad_norm': 0.29740893840789795, 'learning_rate': 0.0001932933173269308, 'epoch': 0.03}


  3%|▎         | 425/12500 [44:22<23:13:29,  6.92s/it]

{'loss': 0.5251, 'grad_norm': 0.22840170562267303, 'learning_rate': 0.00019327731092436975, 'epoch': 0.03}


  3%|▎         | 426/12500 [44:30<24:28:05,  7.30s/it]

{'loss': 0.8125, 'grad_norm': 0.3062303960323334, 'learning_rate': 0.00019326130452180872, 'epoch': 0.03}


  3%|▎         | 427/12500 [44:37<23:49:12,  7.10s/it]

{'loss': 0.7447, 'grad_norm': 0.31878477334976196, 'learning_rate': 0.0001932452981192477, 'epoch': 0.03}


  3%|▎         | 428/12500 [44:47<26:31:29,  7.91s/it]

{'loss': 0.9834, 'grad_norm': 0.2304127961397171, 'learning_rate': 0.0001932292917166867, 'epoch': 0.03}


  3%|▎         | 429/12500 [44:50<22:21:11,  6.67s/it]

{'loss': 0.6063, 'grad_norm': 0.3305794894695282, 'learning_rate': 0.00019321328531412565, 'epoch': 0.03}


  3%|▎         | 430/12500 [44:58<22:42:43,  6.77s/it]

{'loss': 0.6237, 'grad_norm': 0.3227684795856476, 'learning_rate': 0.00019319727891156462, 'epoch': 0.03}


  3%|▎         | 431/12500 [45:01<19:07:56,  5.71s/it]

{'loss': 0.6249, 'grad_norm': 0.3313417136669159, 'learning_rate': 0.00019318127250900362, 'epoch': 0.03}


  3%|▎         | 432/12500 [45:06<18:56:47,  5.65s/it]

{'loss': 0.6007, 'grad_norm': 0.34418636560440063, 'learning_rate': 0.0001931652661064426, 'epoch': 0.03}


  3%|▎         | 433/12500 [45:14<21:02:09,  6.28s/it]

{'loss': 0.718, 'grad_norm': 0.2719235420227051, 'learning_rate': 0.00019314925970388155, 'epoch': 0.03}


  3%|▎         | 434/12500 [45:18<18:36:05,  5.55s/it]

{'loss': 0.7421, 'grad_norm': 0.3217020332813263, 'learning_rate': 0.00019313325330132052, 'epoch': 0.03}


  3%|▎         | 435/12500 [45:26<20:59:36,  6.26s/it]

{'loss': 0.8614, 'grad_norm': 0.3013320565223694, 'learning_rate': 0.00019311724689875952, 'epoch': 0.03}


  3%|▎         | 436/12500 [45:31<19:35:29,  5.85s/it]

{'loss': 0.5838, 'grad_norm': 0.2927591800689697, 'learning_rate': 0.0001931012404961985, 'epoch': 0.03}


  3%|▎         | 437/12500 [45:38<20:36:53,  6.15s/it]

{'loss': 0.4753, 'grad_norm': 0.24396929144859314, 'learning_rate': 0.00019308523409363744, 'epoch': 0.03}


  4%|▎         | 438/12500 [45:45<21:46:03,  6.50s/it]

{'loss': 0.8641, 'grad_norm': 0.2606171667575836, 'learning_rate': 0.00019306922769107645, 'epoch': 0.04}


  4%|▎         | 439/12500 [45:51<21:11:28,  6.33s/it]

{'loss': 0.9668, 'grad_norm': 0.271443635225296, 'learning_rate': 0.00019305322128851542, 'epoch': 0.04}


  4%|▎         | 440/12500 [46:01<24:44:40,  7.39s/it]

{'loss': 0.607, 'grad_norm': 0.19072556495666504, 'learning_rate': 0.0001930372148859544, 'epoch': 0.04}


  4%|▎         | 441/12500 [46:08<24:28:18,  7.31s/it]

{'loss': 0.7366, 'grad_norm': 0.3274163603782654, 'learning_rate': 0.00019302120848339334, 'epoch': 0.04}


  4%|▎         | 442/12500 [46:15<24:44:02,  7.38s/it]

{'loss': 0.78, 'grad_norm': 0.227627694606781, 'learning_rate': 0.00019300520208083235, 'epoch': 0.04}


  4%|▎         | 443/12500 [46:21<23:32:07,  7.03s/it]

{'loss': 0.8327, 'grad_norm': 0.2586795687675476, 'learning_rate': 0.00019298919567827132, 'epoch': 0.04}


  4%|▎         | 444/12500 [46:29<24:23:47,  7.28s/it]

{'loss': 0.799, 'grad_norm': 0.22898122668266296, 'learning_rate': 0.0001929731892757103, 'epoch': 0.04}


  4%|▎         | 445/12500 [46:37<24:33:02,  7.33s/it]

{'loss': 0.9904, 'grad_norm': 0.26369014382362366, 'learning_rate': 0.00019295718287314927, 'epoch': 0.04}


  4%|▎         | 446/12500 [46:42<22:03:26,  6.59s/it]

{'loss': 0.776, 'grad_norm': 0.26884451508522034, 'learning_rate': 0.00019294117647058825, 'epoch': 0.04}


  4%|▎         | 447/12500 [46:51<24:19:28,  7.27s/it]

{'loss': 1.0298, 'grad_norm': 0.2285071462392807, 'learning_rate': 0.00019292517006802722, 'epoch': 0.04}


  4%|▎         | 448/12500 [46:57<23:48:46,  7.11s/it]

{'loss': 0.8019, 'grad_norm': 0.2728531062602997, 'learning_rate': 0.0001929091636654662, 'epoch': 0.04}


  4%|▎         | 449/12500 [47:05<24:04:47,  7.19s/it]

{'loss': 0.8714, 'grad_norm': 0.24049460887908936, 'learning_rate': 0.00019289315726290517, 'epoch': 0.04}


  4%|▎         | 450/12500 [47:12<23:49:15,  7.12s/it]

{'loss': 0.4994, 'grad_norm': 0.22458967566490173, 'learning_rate': 0.00019287715086034414, 'epoch': 0.04}


  4%|▎         | 451/12500 [47:17<21:57:50,  6.56s/it]

{'loss': 0.6924, 'grad_norm': 0.29355132579803467, 'learning_rate': 0.00019286114445778312, 'epoch': 0.04}


  4%|▎         | 452/12500 [47:27<25:18:37,  7.56s/it]

{'loss': 1.1688, 'grad_norm': 0.23921459913253784, 'learning_rate': 0.0001928451380552221, 'epoch': 0.04}


  4%|▎         | 453/12500 [47:31<21:53:53,  6.54s/it]

{'loss': 0.9198, 'grad_norm': 0.33558714389801025, 'learning_rate': 0.00019282913165266107, 'epoch': 0.04}


  4%|▎         | 454/12500 [47:39<23:06:42,  6.91s/it]

{'loss': 0.8835, 'grad_norm': 0.24367530643939972, 'learning_rate': 0.00019281312525010004, 'epoch': 0.04}


  4%|▎         | 455/12500 [47:44<21:26:53,  6.41s/it]

{'loss': 0.8046, 'grad_norm': 0.31916922330856323, 'learning_rate': 0.00019279711884753902, 'epoch': 0.04}


  4%|▎         | 456/12500 [47:49<19:37:29,  5.87s/it]

{'loss': 0.7742, 'grad_norm': 0.4398312568664551, 'learning_rate': 0.000192781112444978, 'epoch': 0.04}


  4%|▎         | 457/12500 [47:56<21:41:00,  6.48s/it]

{'loss': 0.9066, 'grad_norm': 0.24640925228595734, 'learning_rate': 0.00019276510604241697, 'epoch': 0.04}


  4%|▎         | 458/12500 [48:01<19:52:58,  5.94s/it]

{'loss': 0.7387, 'grad_norm': 0.3422083854675293, 'learning_rate': 0.00019274909963985594, 'epoch': 0.04}


  4%|▎         | 459/12500 [48:09<21:56:50,  6.56s/it]

{'loss': 0.7611, 'grad_norm': 0.25470224022865295, 'learning_rate': 0.00019273309323729495, 'epoch': 0.04}


  4%|▎         | 460/12500 [48:14<20:28:07,  6.12s/it]

{'loss': 0.6123, 'grad_norm': 0.2833864688873291, 'learning_rate': 0.0001927170868347339, 'epoch': 0.04}


  4%|▎         | 461/12500 [48:20<19:55:58,  5.96s/it]

{'loss': 0.6554, 'grad_norm': 0.31202468276023865, 'learning_rate': 0.00019270108043217287, 'epoch': 0.04}


  4%|▎         | 462/12500 [48:24<18:10:43,  5.44s/it]

{'loss': 1.0582, 'grad_norm': 0.3562047779560089, 'learning_rate': 0.00019268507402961184, 'epoch': 0.04}


  4%|▎         | 463/12500 [48:34<22:24:15,  6.70s/it]

{'loss': 0.8798, 'grad_norm': 0.28444573283195496, 'learning_rate': 0.00019266906762705085, 'epoch': 0.04}


  4%|▎         | 464/12500 [48:42<24:14:17,  7.25s/it]

{'loss': 0.4133, 'grad_norm': 0.19919341802597046, 'learning_rate': 0.0001926530612244898, 'epoch': 0.04}


  4%|▎         | 465/12500 [48:51<25:27:44,  7.62s/it]

{'loss': 0.9166, 'grad_norm': 0.22734694182872772, 'learning_rate': 0.00019263705482192877, 'epoch': 0.04}


  4%|▎         | 466/12500 [48:55<22:27:02,  6.72s/it]

{'loss': 0.3664, 'grad_norm': 0.2932734489440918, 'learning_rate': 0.00019262104841936777, 'epoch': 0.04}


  4%|▎         | 467/12500 [49:01<21:02:44,  6.30s/it]

{'loss': 0.7711, 'grad_norm': 0.3005332052707672, 'learning_rate': 0.00019260504201680674, 'epoch': 0.04}


  4%|▎         | 468/12500 [49:06<20:13:07,  6.05s/it]

{'loss': 0.7, 'grad_norm': 0.28241559863090515, 'learning_rate': 0.0001925890356142457, 'epoch': 0.04}


  4%|▍         | 469/12500 [49:14<22:17:07,  6.67s/it]

{'loss': 1.0205, 'grad_norm': 0.27233588695526123, 'learning_rate': 0.00019257302921168467, 'epoch': 0.04}


  4%|▍         | 470/12500 [49:21<22:09:42,  6.63s/it]

{'loss': 0.7778, 'grad_norm': 0.333891361951828, 'learning_rate': 0.00019255702280912367, 'epoch': 0.04}


  4%|▍         | 471/12500 [49:31<26:09:32,  7.83s/it]

{'loss': 0.675, 'grad_norm': 0.21252791583538055, 'learning_rate': 0.00019254101640656264, 'epoch': 0.04}


  4%|▍         | 472/12500 [49:36<23:16:16,  6.97s/it]

{'loss': 0.9526, 'grad_norm': 0.3085078001022339, 'learning_rate': 0.0001925250100040016, 'epoch': 0.04}


  4%|▍         | 473/12500 [49:41<21:16:42,  6.37s/it]

{'loss': 0.6283, 'grad_norm': 0.2889682650566101, 'learning_rate': 0.0001925090036014406, 'epoch': 0.04}


  4%|▍         | 474/12500 [49:46<19:40:54,  5.89s/it]

{'loss': 0.5357, 'grad_norm': 0.28915420174598694, 'learning_rate': 0.00019249299719887957, 'epoch': 0.04}


  4%|▍         | 475/12500 [49:50<17:17:57,  5.18s/it]

{'loss': 0.7656, 'grad_norm': 0.36861658096313477, 'learning_rate': 0.00019247699079631854, 'epoch': 0.04}


  4%|▍         | 476/12500 [49:57<19:26:10,  5.82s/it]

{'loss': 0.7738, 'grad_norm': 0.2947785258293152, 'learning_rate': 0.0001924609843937575, 'epoch': 0.04}


  4%|▍         | 477/12500 [50:04<20:24:22,  6.11s/it]

{'loss': 0.8772, 'grad_norm': 0.273631751537323, 'learning_rate': 0.0001924449779911965, 'epoch': 0.04}


  4%|▍         | 478/12500 [50:08<19:06:04,  5.72s/it]

{'loss': 0.8544, 'grad_norm': 0.29063576459884644, 'learning_rate': 0.00019242897158863547, 'epoch': 0.04}


  4%|▍         | 479/12500 [50:15<20:20:58,  6.09s/it]

{'loss': 0.894, 'grad_norm': 0.27086925506591797, 'learning_rate': 0.00019241296518607444, 'epoch': 0.04}


  4%|▍         | 480/12500 [50:21<20:07:12,  6.03s/it]

{'loss': 0.6934, 'grad_norm': 0.29523882269859314, 'learning_rate': 0.0001923969587835134, 'epoch': 0.04}


  4%|▍         | 481/12500 [50:29<21:35:21,  6.47s/it]

{'loss': 0.4224, 'grad_norm': 0.2384488731622696, 'learning_rate': 0.0001923809523809524, 'epoch': 0.04}


  4%|▍         | 482/12500 [50:36<22:29:47,  6.74s/it]

{'loss': 0.6543, 'grad_norm': 0.2649068534374237, 'learning_rate': 0.00019236494597839137, 'epoch': 0.04}


  4%|▍         | 483/12500 [50:40<19:48:41,  5.94s/it]

{'loss': 0.6417, 'grad_norm': 0.3202245533466339, 'learning_rate': 0.00019234893957583034, 'epoch': 0.04}


  4%|▍         | 484/12500 [50:49<22:22:09,  6.70s/it]

{'loss': 1.0498, 'grad_norm': 0.2478317767381668, 'learning_rate': 0.00019233293317326932, 'epoch': 0.04}


  4%|▍         | 485/12500 [50:55<22:09:19,  6.64s/it]

{'loss': 0.7702, 'grad_norm': 0.2688138782978058, 'learning_rate': 0.0001923169267707083, 'epoch': 0.04}


  4%|▍         | 486/12500 [51:01<21:29:18,  6.44s/it]

{'loss': 0.6793, 'grad_norm': 0.26793310046195984, 'learning_rate': 0.00019230092036814727, 'epoch': 0.04}


  4%|▍         | 487/12500 [51:06<19:45:29,  5.92s/it]

{'loss': 0.6717, 'grad_norm': 0.34452491998672485, 'learning_rate': 0.00019228491396558624, 'epoch': 0.04}


  4%|▍         | 488/12500 [51:11<19:22:18,  5.81s/it]

{'loss': 0.5743, 'grad_norm': 0.2590729296207428, 'learning_rate': 0.00019226890756302522, 'epoch': 0.04}


  4%|▍         | 489/12500 [51:18<20:08:06,  6.04s/it]

{'loss': 0.3494, 'grad_norm': 0.22705359756946564, 'learning_rate': 0.0001922529011604642, 'epoch': 0.04}


  4%|▍         | 490/12500 [51:23<19:20:03,  5.80s/it]

{'loss': 0.7341, 'grad_norm': 0.29573044180870056, 'learning_rate': 0.00019223689475790317, 'epoch': 0.04}


  4%|▍         | 491/12500 [51:27<17:01:35,  5.10s/it]

{'loss': 0.8564, 'grad_norm': 0.3778861165046692, 'learning_rate': 0.00019222088835534217, 'epoch': 0.04}


  4%|▍         | 492/12500 [51:35<20:00:34,  6.00s/it]

{'loss': 0.6624, 'grad_norm': 0.2971304953098297, 'learning_rate': 0.00019220488195278112, 'epoch': 0.04}


  4%|▍         | 493/12500 [51:39<17:44:29,  5.32s/it]

{'loss': 0.8683, 'grad_norm': 0.34202975034713745, 'learning_rate': 0.0001921888755502201, 'epoch': 0.04}


  4%|▍         | 494/12500 [51:47<20:49:51,  6.25s/it]

{'loss': 0.7775, 'grad_norm': 0.21372684836387634, 'learning_rate': 0.00019217286914765907, 'epoch': 0.04}


  4%|▍         | 495/12500 [51:55<22:28:19,  6.74s/it]

{'loss': 0.4519, 'grad_norm': 0.2255154699087143, 'learning_rate': 0.00019215686274509807, 'epoch': 0.04}


  4%|▍         | 496/12500 [52:00<20:37:30,  6.19s/it]

{'loss': 0.6421, 'grad_norm': 0.29834672808647156, 'learning_rate': 0.00019214085634253702, 'epoch': 0.04}


  4%|▍         | 497/12500 [52:03<17:54:37,  5.37s/it]

{'loss': 0.6057, 'grad_norm': 0.32089173793792725, 'learning_rate': 0.000192124849939976, 'epoch': 0.04}


  4%|▍         | 498/12500 [52:08<17:44:22,  5.32s/it]

{'loss': 0.5239, 'grad_norm': 0.25953781604766846, 'learning_rate': 0.000192108843537415, 'epoch': 0.04}


  4%|▍         | 499/12500 [52:12<15:57:05,  4.79s/it]

{'loss': 0.6914, 'grad_norm': 0.3277316689491272, 'learning_rate': 0.00019209283713485397, 'epoch': 0.04}


  4%|▍         | 500/12500 [52:20<19:09:31,  5.75s/it]

{'loss': 0.8526, 'grad_norm': 0.3471475839614868, 'learning_rate': 0.00019207683073229291, 'epoch': 0.04}


  4%|▍         | 501/12500 [52:28<21:46:31,  6.53s/it]

{'loss': 0.5976, 'grad_norm': 0.23141193389892578, 'learning_rate': 0.0001920608243297319, 'epoch': 0.04}


  4%|▍         | 502/12500 [52:35<21:43:49,  6.52s/it]

{'loss': 0.7044, 'grad_norm': 0.2746582329273224, 'learning_rate': 0.0001920448179271709, 'epoch': 0.04}


  4%|▍         | 503/12500 [52:43<23:15:20,  6.98s/it]

{'loss': 0.7196, 'grad_norm': 0.25413617491722107, 'learning_rate': 0.00019202881152460987, 'epoch': 0.04}


  4%|▍         | 504/12500 [52:53<26:12:44,  7.87s/it]

{'loss': 0.4278, 'grad_norm': 0.32500413060188293, 'learning_rate': 0.00019201280512204881, 'epoch': 0.04}


  4%|▍         | 505/12500 [52:59<24:50:36,  7.46s/it]

{'loss': 0.7347, 'grad_norm': 0.2684568166732788, 'learning_rate': 0.00019199679871948782, 'epoch': 0.04}


  4%|▍         | 506/12500 [53:07<25:27:56,  7.64s/it]

{'loss': 0.5139, 'grad_norm': 0.25170794129371643, 'learning_rate': 0.0001919807923169268, 'epoch': 0.04}


  4%|▍         | 507/12500 [53:13<23:05:54,  6.93s/it]

{'loss': 0.6318, 'grad_norm': 0.2967097759246826, 'learning_rate': 0.00019196478591436577, 'epoch': 0.04}


  4%|▍         | 508/12500 [53:19<22:22:16,  6.72s/it]

{'loss': 0.4309, 'grad_norm': 0.23458051681518555, 'learning_rate': 0.00019194877951180471, 'epoch': 0.04}


  4%|▍         | 509/12500 [53:24<20:26:58,  6.14s/it]

{'loss': 0.7259, 'grad_norm': 0.2902500331401825, 'learning_rate': 0.00019193277310924372, 'epoch': 0.04}


  4%|▍         | 510/12500 [53:29<19:33:57,  5.87s/it]

{'loss': 0.7868, 'grad_norm': 0.2916920483112335, 'learning_rate': 0.0001919167667066827, 'epoch': 0.04}


  4%|▍         | 511/12500 [53:34<19:14:39,  5.78s/it]

{'loss': 1.0265, 'grad_norm': 0.2680763304233551, 'learning_rate': 0.00019190076030412167, 'epoch': 0.04}


  4%|▍         | 512/12500 [53:41<20:28:16,  6.15s/it]

{'loss': 0.9329, 'grad_norm': 0.2513152062892914, 'learning_rate': 0.00019188475390156064, 'epoch': 0.04}


  4%|▍         | 513/12500 [53:47<19:56:40,  5.99s/it]

{'loss': 0.5812, 'grad_norm': 0.3171052038669586, 'learning_rate': 0.00019186874749899961, 'epoch': 0.04}


  4%|▍         | 514/12500 [53:53<20:01:20,  6.01s/it]

{'loss': 0.9452, 'grad_norm': 0.22856439650058746, 'learning_rate': 0.0001918527410964386, 'epoch': 0.04}


  4%|▍         | 515/12500 [53:59<20:11:19,  6.06s/it]

{'loss': 0.4784, 'grad_norm': 0.21901987493038177, 'learning_rate': 0.00019183673469387756, 'epoch': 0.04}


  4%|▍         | 516/12500 [54:05<19:52:05,  5.97s/it]

{'loss': 0.8337, 'grad_norm': 0.262210875749588, 'learning_rate': 0.00019182072829131654, 'epoch': 0.04}


  4%|▍         | 517/12500 [54:10<18:49:47,  5.66s/it]

{'loss': 0.4775, 'grad_norm': 0.26390624046325684, 'learning_rate': 0.00019180472188875551, 'epoch': 0.04}


  4%|▍         | 518/12500 [54:15<18:21:48,  5.52s/it]

{'loss': 0.5396, 'grad_norm': 0.2926567494869232, 'learning_rate': 0.0001917887154861945, 'epoch': 0.04}


  4%|▍         | 519/12500 [54:19<16:09:14,  4.85s/it]

{'loss': 0.6727, 'grad_norm': 0.28829070925712585, 'learning_rate': 0.00019177270908363346, 'epoch': 0.04}


  4%|▍         | 520/12500 [54:23<15:21:46,  4.62s/it]

{'loss': 0.743, 'grad_norm': 0.29972395300865173, 'learning_rate': 0.00019175670268107244, 'epoch': 0.04}


  4%|▍         | 521/12500 [54:28<15:52:59,  4.77s/it]

{'loss': 0.699, 'grad_norm': 0.27919521927833557, 'learning_rate': 0.00019174069627851141, 'epoch': 0.04}


  4%|▍         | 522/12500 [54:31<14:05:10,  4.23s/it]

{'loss': 0.8914, 'grad_norm': 0.3581523597240448, 'learning_rate': 0.0001917246898759504, 'epoch': 0.04}


  4%|▍         | 523/12500 [54:37<16:14:56,  4.88s/it]

{'loss': 0.7238, 'grad_norm': 0.3333909511566162, 'learning_rate': 0.00019170868347338936, 'epoch': 0.04}


  4%|▍         | 524/12500 [54:43<17:08:11,  5.15s/it]

{'loss': 0.6349, 'grad_norm': 0.2577509582042694, 'learning_rate': 0.00019169267707082834, 'epoch': 0.04}


  4%|▍         | 525/12500 [54:51<19:45:55,  5.94s/it]

{'loss': 0.7432, 'grad_norm': 0.21164660155773163, 'learning_rate': 0.0001916766706682673, 'epoch': 0.04}


  4%|▍         | 526/12500 [54:58<20:41:13,  6.22s/it]

{'loss': 0.8918, 'grad_norm': 0.2577507793903351, 'learning_rate': 0.00019166066426570632, 'epoch': 0.04}


  4%|▍         | 527/12500 [55:01<17:49:07,  5.36s/it]

{'loss': 0.6526, 'grad_norm': 0.3438434898853302, 'learning_rate': 0.00019164465786314526, 'epoch': 0.04}


  4%|▍         | 528/12500 [55:11<22:11:04,  6.67s/it]

{'loss': 0.9149, 'grad_norm': 0.20980285108089447, 'learning_rate': 0.00019162865146058424, 'epoch': 0.04}


  4%|▍         | 529/12500 [55:19<24:21:29,  7.33s/it]

{'loss': 1.1547, 'grad_norm': 0.26590394973754883, 'learning_rate': 0.0001916126450580232, 'epoch': 0.04}


  4%|▍         | 530/12500 [55:25<22:05:38,  6.64s/it]

{'loss': 0.5148, 'grad_norm': 0.23012691736221313, 'learning_rate': 0.00019159663865546221, 'epoch': 0.04}


  4%|▍         | 531/12500 [55:28<19:08:25,  5.76s/it]

{'loss': 0.6773, 'grad_norm': 0.3355550765991211, 'learning_rate': 0.00019158063225290116, 'epoch': 0.04}


  4%|▍         | 532/12500 [55:35<20:13:10,  6.08s/it]

{'loss': 0.9339, 'grad_norm': 0.2603365182876587, 'learning_rate': 0.00019156462585034014, 'epoch': 0.04}


  4%|▍         | 533/12500 [55:40<18:57:58,  5.71s/it]

{'loss': 0.884, 'grad_norm': 0.4017958641052246, 'learning_rate': 0.00019154861944777914, 'epoch': 0.04}


  4%|▍         | 534/12500 [55:46<19:35:41,  5.90s/it]

{'loss': 0.5778, 'grad_norm': 0.24192272126674652, 'learning_rate': 0.00019153261304521811, 'epoch': 0.04}


  4%|▍         | 535/12500 [55:54<21:00:44,  6.32s/it]

{'loss': 0.6996, 'grad_norm': 0.2548074722290039, 'learning_rate': 0.00019151660664265706, 'epoch': 0.04}


  4%|▍         | 536/12500 [56:00<20:57:47,  6.31s/it]

{'loss': 0.7374, 'grad_norm': 0.32320448756217957, 'learning_rate': 0.00019150060024009604, 'epoch': 0.04}


  4%|▍         | 537/12500 [56:07<21:33:11,  6.49s/it]

{'loss': 0.7829, 'grad_norm': 0.23594941198825836, 'learning_rate': 0.00019148459383753504, 'epoch': 0.04}


  4%|▍         | 538/12500 [56:15<23:18:47,  7.02s/it]

{'loss': 0.6337, 'grad_norm': 0.20185311138629913, 'learning_rate': 0.000191468587434974, 'epoch': 0.04}


  4%|▍         | 539/12500 [56:20<21:25:26,  6.45s/it]

{'loss': 0.6249, 'grad_norm': 0.3387511670589447, 'learning_rate': 0.00019145258103241296, 'epoch': 0.04}


  4%|▍         | 540/12500 [56:24<18:52:17,  5.68s/it]

{'loss': 0.7406, 'grad_norm': 0.3335259258747101, 'learning_rate': 0.00019143657462985194, 'epoch': 0.04}


  4%|▍         | 541/12500 [56:29<17:50:18,  5.37s/it]

{'loss': 0.6891, 'grad_norm': 0.29121431708335876, 'learning_rate': 0.00019142056822729094, 'epoch': 0.04}


  4%|▍         | 542/12500 [56:33<17:07:27,  5.16s/it]

{'loss': 0.9911, 'grad_norm': 0.3252638280391693, 'learning_rate': 0.0001914045618247299, 'epoch': 0.04}


  4%|▍         | 543/12500 [56:39<17:36:58,  5.30s/it]

{'loss': 0.6588, 'grad_norm': 0.27361616492271423, 'learning_rate': 0.00019138855542216886, 'epoch': 0.04}


  4%|▍         | 544/12500 [56:47<20:37:09,  6.21s/it]

{'loss': 0.5376, 'grad_norm': 0.19965773820877075, 'learning_rate': 0.00019137254901960786, 'epoch': 0.04}


  4%|▍         | 545/12500 [56:52<19:19:26,  5.82s/it]

{'loss': 0.5982, 'grad_norm': 0.3143417537212372, 'learning_rate': 0.00019135654261704684, 'epoch': 0.04}


  4%|▍         | 546/12500 [57:03<23:51:22,  7.18s/it]

{'loss': 0.6739, 'grad_norm': 0.2031095176935196, 'learning_rate': 0.0001913405362144858, 'epoch': 0.04}


  4%|▍         | 547/12500 [57:08<22:08:26,  6.67s/it]

{'loss': 1.1723, 'grad_norm': 0.31904783844947815, 'learning_rate': 0.00019132452981192476, 'epoch': 0.04}


  4%|▍         | 548/12500 [57:17<24:32:36,  7.39s/it]

{'loss': 0.6771, 'grad_norm': 0.21117767691612244, 'learning_rate': 0.00019130852340936376, 'epoch': 0.04}


  4%|▍         | 549/12500 [57:23<23:17:54,  7.02s/it]

{'loss': 0.8598, 'grad_norm': 0.3232039213180542, 'learning_rate': 0.00019129251700680274, 'epoch': 0.04}


  4%|▍         | 550/12500 [57:28<21:26:45,  6.46s/it]

{'loss': 0.7448, 'grad_norm': 0.3020477592945099, 'learning_rate': 0.0001912765106042417, 'epoch': 0.04}


  4%|▍         | 551/12500 [57:36<22:40:06,  6.83s/it]

{'loss': 0.841, 'grad_norm': 0.2739114463329315, 'learning_rate': 0.0001912605042016807, 'epoch': 0.04}


  4%|▍         | 552/12500 [57:41<20:39:03,  6.22s/it]

{'loss': 0.5626, 'grad_norm': 0.24947063624858856, 'learning_rate': 0.00019124449779911966, 'epoch': 0.04}


  4%|▍         | 553/12500 [57:47<20:10:51,  6.08s/it]

{'loss': 0.5261, 'grad_norm': 0.22537057101726532, 'learning_rate': 0.00019122849139655864, 'epoch': 0.04}


  4%|▍         | 554/12500 [57:55<22:49:27,  6.88s/it]

{'loss': 0.7045, 'grad_norm': 0.22693020105361938, 'learning_rate': 0.0001912124849939976, 'epoch': 0.04}


  4%|▍         | 555/12500 [58:01<22:00:04,  6.63s/it]

{'loss': 0.7479, 'grad_norm': 0.22563228011131287, 'learning_rate': 0.00019119647859143659, 'epoch': 0.04}


  4%|▍         | 556/12500 [58:07<20:57:02,  6.31s/it]

{'loss': 0.9103, 'grad_norm': 0.34451824426651, 'learning_rate': 0.00019118047218887556, 'epoch': 0.04}


  4%|▍         | 557/12500 [58:18<25:15:17,  7.61s/it]

{'loss': 1.0835, 'grad_norm': 0.19983817636966705, 'learning_rate': 0.00019116446578631454, 'epoch': 0.04}


  4%|▍         | 558/12500 [58:24<24:08:07,  7.28s/it]

{'loss': 0.877, 'grad_norm': 0.2496805489063263, 'learning_rate': 0.0001911484593837535, 'epoch': 0.04}


  4%|▍         | 559/12500 [58:30<22:42:55,  6.85s/it]

{'loss': 0.9236, 'grad_norm': 0.36173325777053833, 'learning_rate': 0.00019113245298119249, 'epoch': 0.04}


  4%|▍         | 560/12500 [58:34<19:57:29,  6.02s/it]

{'loss': 0.5557, 'grad_norm': 0.3183971643447876, 'learning_rate': 0.00019111644657863146, 'epoch': 0.04}


  4%|▍         | 561/12500 [58:40<19:25:44,  5.86s/it]

{'loss': 0.7946, 'grad_norm': 0.2872048616409302, 'learning_rate': 0.00019110044017607044, 'epoch': 0.04}


  4%|▍         | 562/12500 [58:43<17:09:06,  5.17s/it]

{'loss': 0.5495, 'grad_norm': 0.30296337604522705, 'learning_rate': 0.0001910844337735094, 'epoch': 0.04}


  5%|▍         | 563/12500 [58:48<16:53:54,  5.10s/it]

{'loss': 0.8601, 'grad_norm': 0.27719980478286743, 'learning_rate': 0.00019106842737094838, 'epoch': 0.05}


  5%|▍         | 564/12500 [58:53<16:51:29,  5.08s/it]

{'loss': 0.7291, 'grad_norm': 0.28983134031295776, 'learning_rate': 0.00019105242096838736, 'epoch': 0.05}


  5%|▍         | 565/12500 [58:58<17:03:38,  5.15s/it]

{'loss': 0.9722, 'grad_norm': 0.25786763429641724, 'learning_rate': 0.00019103641456582636, 'epoch': 0.05}


  5%|▍         | 566/12500 [59:05<18:59:04,  5.73s/it]

{'loss': 0.7544, 'grad_norm': 0.23522216081619263, 'learning_rate': 0.0001910204081632653, 'epoch': 0.05}


  5%|▍         | 567/12500 [59:10<18:03:14,  5.45s/it]

{'loss': 0.8979, 'grad_norm': 0.28138867020606995, 'learning_rate': 0.00019100440176070428, 'epoch': 0.05}


  5%|▍         | 568/12500 [59:15<17:09:46,  5.18s/it]

{'loss': 0.6738, 'grad_norm': 0.3008953928947449, 'learning_rate': 0.00019098839535814326, 'epoch': 0.05}


  5%|▍         | 569/12500 [59:22<19:16:26,  5.82s/it]

{'loss': 0.5532, 'grad_norm': 0.25189512968063354, 'learning_rate': 0.00019097238895558226, 'epoch': 0.05}


  5%|▍         | 570/12500 [59:28<19:11:39,  5.79s/it]

{'loss': 0.6778, 'grad_norm': 0.3518221378326416, 'learning_rate': 0.0001909563825530212, 'epoch': 0.05}


  5%|▍         | 571/12500 [59:33<18:54:14,  5.70s/it]

{'loss': 0.8172, 'grad_norm': 0.24446824193000793, 'learning_rate': 0.00019094037615046018, 'epoch': 0.05}


  5%|▍         | 572/12500 [59:42<22:12:23,  6.70s/it]

{'loss': 0.934, 'grad_norm': 0.18808920681476593, 'learning_rate': 0.00019092436974789919, 'epoch': 0.05}


  5%|▍         | 573/12500 [59:51<23:55:27,  7.22s/it]

{'loss': 0.5761, 'grad_norm': 0.23060141503810883, 'learning_rate': 0.00019090836334533816, 'epoch': 0.05}


  5%|▍         | 574/12500 [59:56<21:47:54,  6.58s/it]

{'loss': 1.0806, 'grad_norm': 0.3183106780052185, 'learning_rate': 0.0001908923569427771, 'epoch': 0.05}


  5%|▍         | 575/12500 [1:00:01<19:54:56,  6.01s/it]

{'loss': 0.6966, 'grad_norm': 0.28007471561431885, 'learning_rate': 0.00019087635054021608, 'epoch': 0.05}


  5%|▍         | 576/12500 [1:00:06<19:41:46,  5.95s/it]

{'loss': 0.5658, 'grad_norm': 0.2586062252521515, 'learning_rate': 0.00019086034413765508, 'epoch': 0.05}


  5%|▍         | 577/12500 [1:00:12<19:30:46,  5.89s/it]

{'loss': 0.6145, 'grad_norm': 0.26070380210876465, 'learning_rate': 0.00019084433773509406, 'epoch': 0.05}


  5%|▍         | 578/12500 [1:00:17<18:02:21,  5.45s/it]

{'loss': 0.6233, 'grad_norm': 0.30762261152267456, 'learning_rate': 0.000190828331332533, 'epoch': 0.05}


  5%|▍         | 579/12500 [1:00:22<18:05:02,  5.46s/it]

{'loss': 0.7048, 'grad_norm': 0.2667863965034485, 'learning_rate': 0.000190812324929972, 'epoch': 0.05}


  5%|▍         | 580/12500 [1:00:31<21:53:24,  6.61s/it]

{'loss': 0.786, 'grad_norm': 0.1945081204175949, 'learning_rate': 0.00019079631852741098, 'epoch': 0.05}


  5%|▍         | 581/12500 [1:00:37<21:12:35,  6.41s/it]

{'loss': 0.6377, 'grad_norm': 0.24813717603683472, 'learning_rate': 0.00019078031212484996, 'epoch': 0.05}


  5%|▍         | 582/12500 [1:00:44<21:26:42,  6.48s/it]

{'loss': 0.613, 'grad_norm': 0.23197327554225922, 'learning_rate': 0.0001907643057222889, 'epoch': 0.05}


  5%|▍         | 583/12500 [1:00:49<20:15:30,  6.12s/it]

{'loss': 0.7498, 'grad_norm': 0.30182960629463196, 'learning_rate': 0.0001907482993197279, 'epoch': 0.05}


  5%|▍         | 584/12500 [1:00:54<18:59:15,  5.74s/it]

{'loss': 0.9476, 'grad_norm': 0.31212612986564636, 'learning_rate': 0.00019073229291716688, 'epoch': 0.05}


  5%|▍         | 585/12500 [1:00:59<18:23:37,  5.56s/it]

{'loss': 0.6963, 'grad_norm': 0.2654566466808319, 'learning_rate': 0.00019071628651460586, 'epoch': 0.05}


  5%|▍         | 586/12500 [1:01:04<17:23:09,  5.25s/it]

{'loss': 0.6396, 'grad_norm': 0.3016694486141205, 'learning_rate': 0.00019070028011204483, 'epoch': 0.05}


  5%|▍         | 587/12500 [1:01:08<15:58:38,  4.83s/it]

{'loss': 0.6128, 'grad_norm': 0.2861984372138977, 'learning_rate': 0.0001906842737094838, 'epoch': 0.05}


  5%|▍         | 588/12500 [1:01:14<17:18:24,  5.23s/it]

{'loss': 0.7674, 'grad_norm': 0.23012597858905792, 'learning_rate': 0.00019066826730692278, 'epoch': 0.05}


  5%|▍         | 589/12500 [1:01:18<16:32:38,  5.00s/it]

{'loss': 0.7908, 'grad_norm': 0.3914478123188019, 'learning_rate': 0.00019065226090436176, 'epoch': 0.05}


  5%|▍         | 590/12500 [1:01:23<16:41:21,  5.04s/it]

{'loss': 0.6159, 'grad_norm': 0.277592271566391, 'learning_rate': 0.00019063625450180073, 'epoch': 0.05}


  5%|▍         | 591/12500 [1:01:31<18:50:28,  5.70s/it]

{'loss': 0.7232, 'grad_norm': 0.2530216872692108, 'learning_rate': 0.0001906202480992397, 'epoch': 0.05}


  5%|▍         | 592/12500 [1:01:35<17:41:20,  5.35s/it]

{'loss': 0.6646, 'grad_norm': 0.27954888343811035, 'learning_rate': 0.00019060424169667868, 'epoch': 0.05}


  5%|▍         | 593/12500 [1:01:38<15:44:54,  4.76s/it]

{'loss': 0.9146, 'grad_norm': 0.3194146454334259, 'learning_rate': 0.00019058823529411766, 'epoch': 0.05}


  5%|▍         | 594/12500 [1:01:43<15:47:18,  4.77s/it]

{'loss': 0.6988, 'grad_norm': 0.28369495272636414, 'learning_rate': 0.00019057222889155663, 'epoch': 0.05}


  5%|▍         | 595/12500 [1:01:47<14:23:24,  4.35s/it]

{'loss': 0.9088, 'grad_norm': 0.31325507164001465, 'learning_rate': 0.0001905562224889956, 'epoch': 0.05}


  5%|▍         | 596/12500 [1:01:51<14:51:31,  4.49s/it]

{'loss': 0.654, 'grad_norm': 0.2857212722301483, 'learning_rate': 0.00019054021608643458, 'epoch': 0.05}


  5%|▍         | 597/12500 [1:01:58<17:05:13,  5.17s/it]

{'loss': 0.6293, 'grad_norm': 0.23494313657283783, 'learning_rate': 0.00019052420968387356, 'epoch': 0.05}


  5%|▍         | 598/12500 [1:02:05<18:39:59,  5.65s/it]

{'loss': 0.8069, 'grad_norm': 0.22537961602210999, 'learning_rate': 0.00019050820328131253, 'epoch': 0.05}


  5%|▍         | 599/12500 [1:02:12<19:49:30,  6.00s/it]

{'loss': 0.9625, 'grad_norm': 0.2505570650100708, 'learning_rate': 0.0001904921968787515, 'epoch': 0.05}


  5%|▍         | 600/12500 [1:02:20<22:14:07,  6.73s/it]

{'loss': 0.8842, 'grad_norm': 0.24008964002132416, 'learning_rate': 0.00019047619047619048, 'epoch': 0.05}


  5%|▍         | 601/12500 [1:02:26<21:08:23,  6.40s/it]

{'loss': 0.5712, 'grad_norm': 0.26716887950897217, 'learning_rate': 0.00019046018407362946, 'epoch': 0.05}


  5%|▍         | 602/12500 [1:02:35<24:13:10,  7.33s/it]

{'loss': 0.8319, 'grad_norm': 0.2010170966386795, 'learning_rate': 0.00019044417767106843, 'epoch': 0.05}


  5%|▍         | 603/12500 [1:02:42<23:23:43,  7.08s/it]

{'loss': 0.5152, 'grad_norm': 0.24607539176940918, 'learning_rate': 0.0001904281712685074, 'epoch': 0.05}


  5%|▍         | 604/12500 [1:02:51<25:38:01,  7.76s/it]

{'loss': 0.6258, 'grad_norm': 0.1762651950120926, 'learning_rate': 0.0001904121648659464, 'epoch': 0.05}


  5%|▍         | 605/12500 [1:02:57<23:49:08,  7.21s/it]

{'loss': 0.7211, 'grad_norm': 0.2577413022518158, 'learning_rate': 0.00019039615846338536, 'epoch': 0.05}


  5%|▍         | 606/12500 [1:03:03<22:46:02,  6.89s/it]

{'loss': 0.8954, 'grad_norm': 0.2843673527240753, 'learning_rate': 0.00019038015206082433, 'epoch': 0.05}


  5%|▍         | 607/12500 [1:03:09<21:24:28,  6.48s/it]

{'loss': 1.0798, 'grad_norm': 0.32751432061195374, 'learning_rate': 0.0001903641456582633, 'epoch': 0.05}


  5%|▍         | 608/12500 [1:03:14<20:07:43,  6.09s/it]

{'loss': 0.6245, 'grad_norm': 0.24946314096450806, 'learning_rate': 0.0001903481392557023, 'epoch': 0.05}


  5%|▍         | 609/12500 [1:03:19<19:25:44,  5.88s/it]

{'loss': 0.7169, 'grad_norm': 0.29803502559661865, 'learning_rate': 0.00019033213285314126, 'epoch': 0.05}


  5%|▍         | 610/12500 [1:03:26<19:54:24,  6.03s/it]

{'loss': 0.8664, 'grad_norm': 0.2646186649799347, 'learning_rate': 0.00019031612645058023, 'epoch': 0.05}


  5%|▍         | 611/12500 [1:03:31<19:30:17,  5.91s/it]

{'loss': 0.8082, 'grad_norm': 0.24321232736110687, 'learning_rate': 0.00019030012004801923, 'epoch': 0.05}


  5%|▍         | 612/12500 [1:03:36<18:23:54,  5.57s/it]

{'loss': 0.6946, 'grad_norm': 0.31386178731918335, 'learning_rate': 0.0001902841136454582, 'epoch': 0.05}


  5%|▍         | 613/12500 [1:03:44<20:39:10,  6.25s/it]

{'loss': 0.6459, 'grad_norm': 0.23893395066261292, 'learning_rate': 0.00019026810724289715, 'epoch': 0.05}


  5%|▍         | 614/12500 [1:03:49<19:22:10,  5.87s/it]

{'loss': 0.9197, 'grad_norm': 0.2571641206741333, 'learning_rate': 0.00019025210084033613, 'epoch': 0.05}


  5%|▍         | 615/12500 [1:03:53<17:44:52,  5.38s/it]

{'loss': 0.684, 'grad_norm': 0.2794419825077057, 'learning_rate': 0.00019023609443777513, 'epoch': 0.05}


  5%|▍         | 616/12500 [1:04:02<21:02:52,  6.38s/it]

{'loss': 0.5819, 'grad_norm': 0.20671525597572327, 'learning_rate': 0.0001902200880352141, 'epoch': 0.05}


  5%|▍         | 617/12500 [1:04:06<18:59:12,  5.75s/it]

{'loss': 0.5231, 'grad_norm': 0.23530663549900055, 'learning_rate': 0.00019020408163265305, 'epoch': 0.05}


  5%|▍         | 618/12500 [1:04:11<17:55:29,  5.43s/it]

{'loss': 0.6623, 'grad_norm': 0.2831140458583832, 'learning_rate': 0.00019018807523009206, 'epoch': 0.05}


  5%|▍         | 619/12500 [1:04:15<16:27:11,  4.99s/it]

{'loss': 0.7859, 'grad_norm': 0.31341275572776794, 'learning_rate': 0.00019017206882753103, 'epoch': 0.05}


  5%|▍         | 620/12500 [1:04:22<18:29:56,  5.61s/it]

{'loss': 0.9637, 'grad_norm': 0.27167314291000366, 'learning_rate': 0.00019015606242497, 'epoch': 0.05}


  5%|▍         | 621/12500 [1:04:26<17:25:16,  5.28s/it]

{'loss': 0.6702, 'grad_norm': 0.2729339301586151, 'learning_rate': 0.00019014005602240895, 'epoch': 0.05}


  5%|▍         | 622/12500 [1:04:34<19:18:41,  5.85s/it]

{'loss': 0.9356, 'grad_norm': 0.25379982590675354, 'learning_rate': 0.00019012404961984796, 'epoch': 0.05}


  5%|▍         | 623/12500 [1:04:38<17:32:36,  5.32s/it]

{'loss': 0.7619, 'grad_norm': 0.31153446435928345, 'learning_rate': 0.00019010804321728693, 'epoch': 0.05}


  5%|▍         | 624/12500 [1:04:44<18:07:27,  5.49s/it]

{'loss': 0.7889, 'grad_norm': 0.3165629506111145, 'learning_rate': 0.0001900920368147259, 'epoch': 0.05}


  5%|▌         | 625/12500 [1:04:48<16:50:04,  5.10s/it]

{'loss': 0.8136, 'grad_norm': 0.3067728579044342, 'learning_rate': 0.00019007603041216488, 'epoch': 0.05}


  5%|▌         | 626/12500 [1:04:53<16:50:23,  5.11s/it]

{'loss': 0.6858, 'grad_norm': 0.3348533809185028, 'learning_rate': 0.00019006002400960385, 'epoch': 0.05}


  5%|▌         | 627/12500 [1:05:01<19:35:22,  5.94s/it]

{'loss': 0.7511, 'grad_norm': 0.21072055399417877, 'learning_rate': 0.00019004401760704283, 'epoch': 0.05}


  5%|▌         | 628/12500 [1:05:08<21:10:39,  6.42s/it]

{'loss': 0.4376, 'grad_norm': 0.2075723260641098, 'learning_rate': 0.0001900280112044818, 'epoch': 0.05}


  5%|▌         | 629/12500 [1:05:13<19:53:34,  6.03s/it]

{'loss': 0.8403, 'grad_norm': 0.33079636096954346, 'learning_rate': 0.00019001200480192078, 'epoch': 0.05}


  5%|▌         | 630/12500 [1:05:21<21:20:35,  6.47s/it]

{'loss': 0.6822, 'grad_norm': 0.2187727391719818, 'learning_rate': 0.00018999599839935975, 'epoch': 0.05}


  5%|▌         | 631/12500 [1:05:26<20:15:52,  6.15s/it]

{'loss': 0.5596, 'grad_norm': 0.2796933948993683, 'learning_rate': 0.00018997999199679873, 'epoch': 0.05}


  5%|▌         | 632/12500 [1:05:30<18:10:38,  5.51s/it]

{'loss': 0.94, 'grad_norm': 0.30787980556488037, 'learning_rate': 0.0001899639855942377, 'epoch': 0.05}


  5%|▌         | 633/12500 [1:05:35<16:54:55,  5.13s/it]

{'loss': 0.8623, 'grad_norm': 0.30045613646507263, 'learning_rate': 0.00018994797919167668, 'epoch': 0.05}


  5%|▌         | 634/12500 [1:05:41<18:03:11,  5.48s/it]

{'loss': 0.6992, 'grad_norm': 0.24818509817123413, 'learning_rate': 0.00018993197278911565, 'epoch': 0.05}


  5%|▌         | 635/12500 [1:05:46<17:25:03,  5.28s/it]

{'loss': 0.587, 'grad_norm': 0.2906678318977356, 'learning_rate': 0.00018991596638655463, 'epoch': 0.05}


  5%|▌         | 636/12500 [1:05:50<16:38:29,  5.05s/it]

{'loss': 0.7542, 'grad_norm': 0.30115097761154175, 'learning_rate': 0.0001898999599839936, 'epoch': 0.05}


  5%|▌         | 637/12500 [1:05:55<16:09:53,  4.91s/it]

{'loss': 0.7404, 'grad_norm': 0.31135261058807373, 'learning_rate': 0.00018988395358143258, 'epoch': 0.05}


  5%|▌         | 638/12500 [1:05:59<15:52:52,  4.82s/it]

{'loss': 0.5914, 'grad_norm': 0.2808142304420471, 'learning_rate': 0.00018986794717887155, 'epoch': 0.05}


  5%|▌         | 639/12500 [1:06:06<17:13:14,  5.23s/it]

{'loss': 0.5931, 'grad_norm': 0.24173478782176971, 'learning_rate': 0.00018985194077631055, 'epoch': 0.05}


  5%|▌         | 640/12500 [1:06:11<17:49:09,  5.41s/it]

{'loss': 0.5424, 'grad_norm': 0.2285795956850052, 'learning_rate': 0.0001898359343737495, 'epoch': 0.05}


  5%|▌         | 641/12500 [1:06:15<15:42:24,  4.77s/it]

{'loss': 0.4539, 'grad_norm': 0.2756383717060089, 'learning_rate': 0.00018981992797118848, 'epoch': 0.05}


  5%|▌         | 642/12500 [1:06:20<16:17:48,  4.95s/it]

{'loss': 0.6363, 'grad_norm': 0.31105199456214905, 'learning_rate': 0.00018980392156862745, 'epoch': 0.05}


  5%|▌         | 643/12500 [1:06:28<19:18:49,  5.86s/it]

{'loss': 0.9911, 'grad_norm': 0.2314017415046692, 'learning_rate': 0.00018978791516606645, 'epoch': 0.05}


  5%|▌         | 644/12500 [1:06:31<16:46:19,  5.09s/it]

{'loss': 0.8218, 'grad_norm': 0.30977344512939453, 'learning_rate': 0.0001897719087635054, 'epoch': 0.05}


  5%|▌         | 645/12500 [1:06:37<17:20:21,  5.27s/it]

{'loss': 0.7215, 'grad_norm': 0.26061761379241943, 'learning_rate': 0.00018975590236094438, 'epoch': 0.05}


  5%|▌         | 646/12500 [1:06:46<21:14:17,  6.45s/it]

{'loss': 0.6346, 'grad_norm': 0.23142117261886597, 'learning_rate': 0.00018973989595838338, 'epoch': 0.05}


  5%|▌         | 647/12500 [1:06:52<20:16:32,  6.16s/it]

{'loss': 0.5052, 'grad_norm': 0.23407648503780365, 'learning_rate': 0.00018972388955582235, 'epoch': 0.05}


  5%|▌         | 648/12500 [1:06:59<21:08:15,  6.42s/it]

{'loss': 0.3949, 'grad_norm': 0.19558174908161163, 'learning_rate': 0.0001897078831532613, 'epoch': 0.05}


  5%|▌         | 649/12500 [1:07:05<21:25:27,  6.51s/it]

{'loss': 0.7306, 'grad_norm': 0.21059750020503998, 'learning_rate': 0.00018969187675070028, 'epoch': 0.05}


  5%|▌         | 650/12500 [1:07:10<19:07:53,  5.81s/it]

{'loss': 0.4352, 'grad_norm': 0.2941049635410309, 'learning_rate': 0.00018967587034813928, 'epoch': 0.05}


  5%|▌         | 651/12500 [1:07:16<20:04:35,  6.10s/it]

{'loss': 0.6623, 'grad_norm': 0.2895871102809906, 'learning_rate': 0.00018965986394557825, 'epoch': 0.05}


  5%|▌         | 652/12500 [1:07:21<18:46:09,  5.70s/it]

{'loss': 0.8105, 'grad_norm': 0.2854718565940857, 'learning_rate': 0.0001896438575430172, 'epoch': 0.05}


  5%|▌         | 653/12500 [1:07:27<19:15:07,  5.85s/it]

{'loss': 0.6092, 'grad_norm': 0.30341649055480957, 'learning_rate': 0.00018962785114045618, 'epoch': 0.05}


  5%|▌         | 654/12500 [1:07:32<17:34:50,  5.34s/it]

{'loss': 0.5694, 'grad_norm': 0.26261982321739197, 'learning_rate': 0.00018961184473789518, 'epoch': 0.05}


  5%|▌         | 655/12500 [1:07:36<17:10:31,  5.22s/it]

{'loss': 0.9445, 'grad_norm': 0.34704071283340454, 'learning_rate': 0.00018959583833533415, 'epoch': 0.05}


  5%|▌         | 656/12500 [1:07:41<16:28:55,  5.01s/it]

{'loss': 0.7642, 'grad_norm': 0.2800547778606415, 'learning_rate': 0.0001895798319327731, 'epoch': 0.05}


  5%|▌         | 657/12500 [1:07:46<16:15:00,  4.94s/it]

{'loss': 0.6279, 'grad_norm': 0.24624155461788177, 'learning_rate': 0.0001895638255302121, 'epoch': 0.05}


  5%|▌         | 658/12500 [1:07:55<20:48:07,  6.32s/it]

{'loss': 0.4679, 'grad_norm': 0.1860395073890686, 'learning_rate': 0.00018954781912765108, 'epoch': 0.05}


  5%|▌         | 659/12500 [1:08:01<20:15:04,  6.16s/it]

{'loss': 0.6093, 'grad_norm': 0.25272536277770996, 'learning_rate': 0.00018953181272509005, 'epoch': 0.05}


  5%|▌         | 660/12500 [1:08:08<21:23:59,  6.51s/it]

{'loss': 0.5623, 'grad_norm': 0.2505563497543335, 'learning_rate': 0.000189515806322529, 'epoch': 0.05}


  5%|▌         | 661/12500 [1:08:16<22:08:21,  6.73s/it]

{'loss': 0.7473, 'grad_norm': 0.2195388823747635, 'learning_rate': 0.000189499799919968, 'epoch': 0.05}


  5%|▌         | 662/12500 [1:08:22<22:00:25,  6.69s/it]

{'loss': 0.4244, 'grad_norm': 0.2017059624195099, 'learning_rate': 0.00018948379351740698, 'epoch': 0.05}


  5%|▌         | 663/12500 [1:08:29<22:18:41,  6.79s/it]

{'loss': 0.9659, 'grad_norm': 0.25585469603538513, 'learning_rate': 0.00018946778711484595, 'epoch': 0.05}


  5%|▌         | 664/12500 [1:08:35<21:31:44,  6.55s/it]

{'loss': 0.9073, 'grad_norm': 0.2758079767227173, 'learning_rate': 0.00018945178071228493, 'epoch': 0.05}


  5%|▌         | 665/12500 [1:08:40<20:03:31,  6.10s/it]

{'loss': 0.768, 'grad_norm': 0.37397855520248413, 'learning_rate': 0.0001894357743097239, 'epoch': 0.05}


  5%|▌         | 666/12500 [1:08:50<23:37:58,  7.19s/it]

{'loss': 0.7951, 'grad_norm': 0.2113780677318573, 'learning_rate': 0.00018941976790716288, 'epoch': 0.05}


  5%|▌         | 667/12500 [1:08:55<21:49:56,  6.64s/it]

{'loss': 0.7142, 'grad_norm': 0.2730504870414734, 'learning_rate': 0.00018940376150460185, 'epoch': 0.05}


  5%|▌         | 668/12500 [1:09:00<19:39:51,  5.98s/it]

{'loss': 0.8027, 'grad_norm': 0.29874083399772644, 'learning_rate': 0.00018938775510204083, 'epoch': 0.05}


  5%|▌         | 669/12500 [1:09:07<20:26:59,  6.22s/it]

{'loss': 0.8359, 'grad_norm': 0.3008033037185669, 'learning_rate': 0.0001893717486994798, 'epoch': 0.05}


  5%|▌         | 670/12500 [1:09:11<18:46:10,  5.71s/it]

{'loss': 0.7125, 'grad_norm': 0.32712188363075256, 'learning_rate': 0.00018935574229691878, 'epoch': 0.05}


  5%|▌         | 671/12500 [1:09:16<17:38:47,  5.37s/it]

{'loss': 0.9038, 'grad_norm': 0.2946029603481293, 'learning_rate': 0.00018933973589435775, 'epoch': 0.05}


  5%|▌         | 672/12500 [1:09:21<17:17:29,  5.26s/it]

{'loss': 0.9107, 'grad_norm': 0.3300274908542633, 'learning_rate': 0.00018932372949179673, 'epoch': 0.05}


  5%|▌         | 673/12500 [1:09:25<16:06:54,  4.91s/it]

{'loss': 0.7928, 'grad_norm': 0.3566843271255493, 'learning_rate': 0.0001893077230892357, 'epoch': 0.05}


  5%|▌         | 674/12500 [1:09:33<19:02:18,  5.80s/it]

{'loss': 1.0253, 'grad_norm': 0.2807579040527344, 'learning_rate': 0.00018929171668667467, 'epoch': 0.05}


  5%|▌         | 675/12500 [1:09:39<19:39:08,  5.98s/it]

{'loss': 0.6935, 'grad_norm': 0.24433089792728424, 'learning_rate': 0.00018927571028411365, 'epoch': 0.05}


  5%|▌         | 676/12500 [1:09:48<22:12:00,  6.76s/it]

{'loss': 0.9852, 'grad_norm': 0.25567737221717834, 'learning_rate': 0.00018925970388155262, 'epoch': 0.05}


  5%|▌         | 677/12500 [1:09:52<19:33:59,  5.96s/it]

{'loss': 0.5605, 'grad_norm': 0.2627022862434387, 'learning_rate': 0.0001892436974789916, 'epoch': 0.05}


  5%|▌         | 678/12500 [1:09:56<18:05:56,  5.51s/it]

{'loss': 0.8079, 'grad_norm': 0.31814390420913696, 'learning_rate': 0.0001892276910764306, 'epoch': 0.05}


  5%|▌         | 679/12500 [1:10:06<21:52:31,  6.66s/it]

{'loss': 0.8421, 'grad_norm': 0.19418056309223175, 'learning_rate': 0.00018921168467386955, 'epoch': 0.05}


  5%|▌         | 680/12500 [1:10:13<22:31:17,  6.86s/it]

{'loss': 0.912, 'grad_norm': 0.2253534346818924, 'learning_rate': 0.00018919567827130852, 'epoch': 0.05}


  5%|▌         | 681/12500 [1:10:17<19:57:28,  6.08s/it]

{'loss': 0.7308, 'grad_norm': 0.3156518042087555, 'learning_rate': 0.0001891796718687475, 'epoch': 0.05}


  5%|▌         | 682/12500 [1:10:21<18:06:56,  5.52s/it]

{'loss': 0.7831, 'grad_norm': 0.269550085067749, 'learning_rate': 0.0001891636654661865, 'epoch': 0.05}


  5%|▌         | 683/12500 [1:10:27<18:24:37,  5.61s/it]

{'loss': 0.8697, 'grad_norm': 0.2368713915348053, 'learning_rate': 0.00018914765906362545, 'epoch': 0.05}


  5%|▌         | 684/12500 [1:10:31<17:00:33,  5.18s/it]

{'loss': 0.7258, 'grad_norm': 0.285150408744812, 'learning_rate': 0.00018913165266106442, 'epoch': 0.05}


  5%|▌         | 685/12500 [1:10:36<16:14:27,  4.95s/it]

{'loss': 0.609, 'grad_norm': 0.2822476923465729, 'learning_rate': 0.00018911564625850343, 'epoch': 0.05}


  5%|▌         | 686/12500 [1:10:42<17:29:05,  5.33s/it]

{'loss': 0.601, 'grad_norm': 0.28653842210769653, 'learning_rate': 0.0001890996398559424, 'epoch': 0.05}


  5%|▌         | 687/12500 [1:10:47<17:20:45,  5.29s/it]

{'loss': 0.9025, 'grad_norm': 0.31498709321022034, 'learning_rate': 0.00018908363345338135, 'epoch': 0.05}


  6%|▌         | 688/12500 [1:10:54<18:39:52,  5.69s/it]

{'loss': 0.7451, 'grad_norm': 0.28754183650016785, 'learning_rate': 0.00018906762705082032, 'epoch': 0.06}


  6%|▌         | 689/12500 [1:10:59<17:55:09,  5.46s/it]

{'loss': 0.7551, 'grad_norm': 0.3092702329158783, 'learning_rate': 0.00018905162064825932, 'epoch': 0.06}


  6%|▌         | 690/12500 [1:11:04<17:48:07,  5.43s/it]

{'loss': 0.9298, 'grad_norm': 0.25436267256736755, 'learning_rate': 0.0001890356142456983, 'epoch': 0.06}


  6%|▌         | 691/12500 [1:11:10<18:15:44,  5.57s/it]

{'loss': 0.521, 'grad_norm': 0.22276392579078674, 'learning_rate': 0.00018901960784313725, 'epoch': 0.06}


  6%|▌         | 692/12500 [1:11:19<21:10:50,  6.46s/it]

{'loss': 0.9551, 'grad_norm': 0.2302524894475937, 'learning_rate': 0.00018900360144057625, 'epoch': 0.06}


  6%|▌         | 693/12500 [1:11:24<19:57:05,  6.08s/it]

{'loss': 0.7415, 'grad_norm': 0.27965694665908813, 'learning_rate': 0.00018898759503801522, 'epoch': 0.06}


  6%|▌         | 694/12500 [1:11:32<22:08:07,  6.75s/it]

{'loss': 1.0042, 'grad_norm': 0.22171998023986816, 'learning_rate': 0.0001889715886354542, 'epoch': 0.06}


  6%|▌         | 695/12500 [1:11:37<20:33:21,  6.27s/it]

{'loss': 0.5856, 'grad_norm': 0.3126448094844818, 'learning_rate': 0.00018895558223289315, 'epoch': 0.06}


  6%|▌         | 696/12500 [1:11:43<19:52:36,  6.06s/it]

{'loss': 0.8502, 'grad_norm': 0.27140140533447266, 'learning_rate': 0.00018893957583033215, 'epoch': 0.06}


  6%|▌         | 697/12500 [1:11:49<20:07:42,  6.14s/it]

{'loss': 0.6195, 'grad_norm': 0.22436405718326569, 'learning_rate': 0.00018892356942777112, 'epoch': 0.06}


  6%|▌         | 698/12500 [1:11:55<20:08:43,  6.15s/it]

{'loss': 0.7699, 'grad_norm': 0.30329427123069763, 'learning_rate': 0.0001889075630252101, 'epoch': 0.06}


  6%|▌         | 699/12500 [1:12:01<19:53:22,  6.07s/it]

{'loss': 0.7286, 'grad_norm': 0.26490649580955505, 'learning_rate': 0.00018889155662264907, 'epoch': 0.06}


  6%|▌         | 700/12500 [1:12:10<22:31:50,  6.87s/it]

{'loss': 0.4994, 'grad_norm': 0.20633754134178162, 'learning_rate': 0.00018887555022008805, 'epoch': 0.06}


  6%|▌         | 701/12500 [1:12:17<22:27:18,  6.85s/it]

{'loss': 0.5886, 'grad_norm': 0.2552459239959717, 'learning_rate': 0.00018885954381752702, 'epoch': 0.06}


  6%|▌         | 702/12500 [1:12:26<24:42:56,  7.54s/it]

{'loss': 1.1812, 'grad_norm': 0.21294349431991577, 'learning_rate': 0.000188843537414966, 'epoch': 0.06}


  6%|▌         | 703/12500 [1:12:35<26:02:33,  7.95s/it]

{'loss': 0.5085, 'grad_norm': 0.25702911615371704, 'learning_rate': 0.00018882753101240497, 'epoch': 0.06}


  6%|▌         | 704/12500 [1:12:41<24:02:26,  7.34s/it]

{'loss': 0.7036, 'grad_norm': 0.2264140546321869, 'learning_rate': 0.00018881152460984395, 'epoch': 0.06}


  6%|▌         | 705/12500 [1:12:46<22:01:07,  6.72s/it]

{'loss': 0.6949, 'grad_norm': 0.28080058097839355, 'learning_rate': 0.00018879551820728292, 'epoch': 0.06}


  6%|▌         | 706/12500 [1:12:50<19:33:59,  5.97s/it]

{'loss': 0.7241, 'grad_norm': 0.270129919052124, 'learning_rate': 0.0001887795118047219, 'epoch': 0.06}


  6%|▌         | 707/12500 [1:12:56<19:38:54,  6.00s/it]

{'loss': 0.7236, 'grad_norm': 0.2549736201763153, 'learning_rate': 0.00018876350540216087, 'epoch': 0.06}


  6%|▌         | 708/12500 [1:13:02<19:05:44,  5.83s/it]

{'loss': 0.7488, 'grad_norm': 0.26146218180656433, 'learning_rate': 0.00018874749899959985, 'epoch': 0.06}


  6%|▌         | 709/12500 [1:13:05<16:54:01,  5.16s/it]

{'loss': 0.7616, 'grad_norm': 0.3329751491546631, 'learning_rate': 0.00018873149259703882, 'epoch': 0.06}


  6%|▌         | 710/12500 [1:13:14<20:08:57,  6.15s/it]

{'loss': 0.8091, 'grad_norm': 0.2149718552827835, 'learning_rate': 0.0001887154861944778, 'epoch': 0.06}


  6%|▌         | 711/12500 [1:13:20<20:21:37,  6.22s/it]

{'loss': 0.8386, 'grad_norm': 0.24673928320407867, 'learning_rate': 0.00018869947979191677, 'epoch': 0.06}


  6%|▌         | 712/12500 [1:13:24<18:29:59,  5.65s/it]

{'loss': 0.7969, 'grad_norm': 0.34091249108314514, 'learning_rate': 0.00018868347338935575, 'epoch': 0.06}


  6%|▌         | 713/12500 [1:13:28<16:49:06,  5.14s/it]

{'loss': 0.7241, 'grad_norm': 0.28929874300956726, 'learning_rate': 0.00018866746698679472, 'epoch': 0.06}


  6%|▌         | 714/12500 [1:13:36<19:15:10,  5.88s/it]

{'loss': 0.9183, 'grad_norm': 0.23482990264892578, 'learning_rate': 0.0001886514605842337, 'epoch': 0.06}


  6%|▌         | 715/12500 [1:13:42<18:57:18,  5.79s/it]

{'loss': 0.7282, 'grad_norm': 0.2672767639160156, 'learning_rate': 0.00018863545418167267, 'epoch': 0.06}


  6%|▌         | 716/12500 [1:13:46<17:34:43,  5.37s/it]

{'loss': 0.6969, 'grad_norm': 0.2813652753829956, 'learning_rate': 0.00018861944777911165, 'epoch': 0.06}


  6%|▌         | 717/12500 [1:13:50<16:46:32,  5.13s/it]

{'loss': 0.8093, 'grad_norm': 0.335946649312973, 'learning_rate': 0.00018860344137655065, 'epoch': 0.06}


  6%|▌         | 718/12500 [1:13:58<19:10:57,  5.86s/it]

{'loss': 0.495, 'grad_norm': 0.1970078945159912, 'learning_rate': 0.0001885874349739896, 'epoch': 0.06}


  6%|▌         | 719/12500 [1:14:05<19:49:18,  6.06s/it]

{'loss': 0.6814, 'grad_norm': 0.2111106961965561, 'learning_rate': 0.00018857142857142857, 'epoch': 0.06}


  6%|▌         | 720/12500 [1:14:09<18:15:01,  5.58s/it]

{'loss': 0.5436, 'grad_norm': 0.2645767629146576, 'learning_rate': 0.00018855542216886755, 'epoch': 0.06}


  6%|▌         | 721/12500 [1:14:18<21:05:53,  6.45s/it]

{'loss': 0.9336, 'grad_norm': 0.20014774799346924, 'learning_rate': 0.00018853941576630655, 'epoch': 0.06}


  6%|▌         | 722/12500 [1:14:23<20:15:10,  6.19s/it]

{'loss': 0.6235, 'grad_norm': 0.2611711621284485, 'learning_rate': 0.0001885234093637455, 'epoch': 0.06}


  6%|▌         | 723/12500 [1:14:27<17:48:50,  5.45s/it]

{'loss': 0.7543, 'grad_norm': 0.3046124279499054, 'learning_rate': 0.00018850740296118447, 'epoch': 0.06}


  6%|▌         | 724/12500 [1:14:35<20:54:10,  6.39s/it]

{'loss': 0.9816, 'grad_norm': 0.25732290744781494, 'learning_rate': 0.00018849139655862347, 'epoch': 0.06}


  6%|▌         | 725/12500 [1:14:43<21:53:39,  6.69s/it]

{'loss': 0.7536, 'grad_norm': 0.26639533042907715, 'learning_rate': 0.00018847539015606245, 'epoch': 0.06}


  6%|▌         | 726/12500 [1:14:49<21:22:40,  6.54s/it]

{'loss': 0.6633, 'grad_norm': 0.2896670401096344, 'learning_rate': 0.0001884593837535014, 'epoch': 0.06}


  6%|▌         | 727/12500 [1:14:56<21:53:38,  6.69s/it]

{'loss': 0.7335, 'grad_norm': 0.24722206592559814, 'learning_rate': 0.00018844337735094037, 'epoch': 0.06}


  6%|▌         | 728/12500 [1:15:03<21:42:53,  6.64s/it]

{'loss': 0.9249, 'grad_norm': 0.22341138124465942, 'learning_rate': 0.00018842737094837937, 'epoch': 0.06}


  6%|▌         | 729/12500 [1:15:06<18:54:35,  5.78s/it]

{'loss': 0.9136, 'grad_norm': 0.39827969670295715, 'learning_rate': 0.00018841136454581835, 'epoch': 0.06}


  6%|▌         | 730/12500 [1:15:13<19:58:19,  6.11s/it]

{'loss': 0.6338, 'grad_norm': 0.23447676002979279, 'learning_rate': 0.0001883953581432573, 'epoch': 0.06}


  6%|▌         | 731/12500 [1:15:20<20:13:49,  6.19s/it]

{'loss': 0.8047, 'grad_norm': 0.23503002524375916, 'learning_rate': 0.0001883793517406963, 'epoch': 0.06}


  6%|▌         | 732/12500 [1:15:24<18:38:07,  5.70s/it]

{'loss': 0.6236, 'grad_norm': 0.312215119600296, 'learning_rate': 0.00018836334533813527, 'epoch': 0.06}


  6%|▌         | 733/12500 [1:15:31<19:52:34,  6.08s/it]

{'loss': 1.1063, 'grad_norm': 0.25102806091308594, 'learning_rate': 0.00018834733893557425, 'epoch': 0.06}


  6%|▌         | 734/12500 [1:15:35<18:10:36,  5.56s/it]

{'loss': 0.7907, 'grad_norm': 0.3016721308231354, 'learning_rate': 0.0001883313325330132, 'epoch': 0.06}


  6%|▌         | 735/12500 [1:15:40<16:49:53,  5.15s/it]

{'loss': 0.9985, 'grad_norm': 0.2775430381298065, 'learning_rate': 0.0001883153261304522, 'epoch': 0.06}


  6%|▌         | 736/12500 [1:15:44<15:34:03,  4.76s/it]

{'loss': 0.739, 'grad_norm': 0.32600072026252747, 'learning_rate': 0.00018829931972789117, 'epoch': 0.06}


  6%|▌         | 737/12500 [1:15:48<15:04:14,  4.61s/it]

{'loss': 0.6345, 'grad_norm': 0.26666033267974854, 'learning_rate': 0.00018828331332533014, 'epoch': 0.06}


  6%|▌         | 738/12500 [1:15:54<16:53:12,  5.17s/it]

{'loss': 0.532, 'grad_norm': 0.2407829463481903, 'learning_rate': 0.00018826730692276912, 'epoch': 0.06}


  6%|▌         | 739/12500 [1:16:01<18:20:18,  5.61s/it]

{'loss': 0.6551, 'grad_norm': 0.2967509329319, 'learning_rate': 0.0001882513005202081, 'epoch': 0.06}


  6%|▌         | 740/12500 [1:16:06<18:04:51,  5.53s/it]

{'loss': 0.9267, 'grad_norm': 0.31574270129203796, 'learning_rate': 0.00018823529411764707, 'epoch': 0.06}


  6%|▌         | 741/12500 [1:16:10<15:52:11,  4.86s/it]

{'loss': 1.0441, 'grad_norm': 0.38674721121788025, 'learning_rate': 0.00018821928771508604, 'epoch': 0.06}


  6%|▌         | 742/12500 [1:16:15<16:51:36,  5.16s/it]

{'loss': 0.6876, 'grad_norm': 0.28801339864730835, 'learning_rate': 0.00018820328131252502, 'epoch': 0.06}


  6%|▌         | 743/12500 [1:16:20<16:02:24,  4.91s/it]

{'loss': 0.5938, 'grad_norm': 0.2626456916332245, 'learning_rate': 0.000188187274909964, 'epoch': 0.06}


  6%|▌         | 744/12500 [1:16:27<18:19:53,  5.61s/it]

{'loss': 1.0239, 'grad_norm': 0.23464800417423248, 'learning_rate': 0.00018817126850740297, 'epoch': 0.06}


  6%|▌         | 745/12500 [1:16:33<18:34:44,  5.69s/it]

{'loss': 0.6976, 'grad_norm': 0.22106200456619263, 'learning_rate': 0.00018815526210484194, 'epoch': 0.06}


  6%|▌         | 746/12500 [1:16:42<21:41:05,  6.64s/it]

{'loss': 0.9955, 'grad_norm': 0.2108367532491684, 'learning_rate': 0.00018813925570228092, 'epoch': 0.06}


  6%|▌         | 747/12500 [1:16:48<21:08:54,  6.48s/it]

{'loss': 0.8249, 'grad_norm': 0.2603927552700043, 'learning_rate': 0.0001881232492997199, 'epoch': 0.06}


  6%|▌         | 748/12500 [1:16:55<21:32:49,  6.60s/it]

{'loss': 0.6002, 'grad_norm': 0.2119835466146469, 'learning_rate': 0.00018810724289715887, 'epoch': 0.06}


  6%|▌         | 749/12500 [1:17:02<22:41:26,  6.95s/it]

{'loss': 0.5851, 'grad_norm': 0.196701318025589, 'learning_rate': 0.00018809123649459784, 'epoch': 0.06}


  6%|▌         | 750/12500 [1:17:09<22:18:45,  6.84s/it]

{'loss': 0.5551, 'grad_norm': 0.2557253837585449, 'learning_rate': 0.00018807523009203682, 'epoch': 0.06}


  6%|▌         | 751/12500 [1:17:14<20:50:09,  6.38s/it]

{'loss': 0.8834, 'grad_norm': 0.30944928526878357, 'learning_rate': 0.0001880592236894758, 'epoch': 0.06}


  6%|▌         | 752/12500 [1:17:25<24:46:16,  7.59s/it]

{'loss': 0.7128, 'grad_norm': 0.19681118428707123, 'learning_rate': 0.0001880432172869148, 'epoch': 0.06}


  6%|▌         | 753/12500 [1:17:30<22:25:00,  6.87s/it]

{'loss': 0.7286, 'grad_norm': 0.30153030157089233, 'learning_rate': 0.00018802721088435374, 'epoch': 0.06}


  6%|▌         | 754/12500 [1:17:37<22:59:39,  7.05s/it]

{'loss': 0.5031, 'grad_norm': 0.21796461939811707, 'learning_rate': 0.00018801120448179272, 'epoch': 0.06}


  6%|▌         | 755/12500 [1:17:43<21:12:50,  6.50s/it]

{'loss': 0.7583, 'grad_norm': 0.24578635394573212, 'learning_rate': 0.0001879951980792317, 'epoch': 0.06}


  6%|▌         | 756/12500 [1:17:49<21:03:43,  6.46s/it]

{'loss': 0.6664, 'grad_norm': 0.25727570056915283, 'learning_rate': 0.0001879791916766707, 'epoch': 0.06}


  6%|▌         | 757/12500 [1:17:53<19:07:12,  5.86s/it]

{'loss': 0.7813, 'grad_norm': 0.31626299023628235, 'learning_rate': 0.00018796318527410964, 'epoch': 0.06}


  6%|▌         | 758/12500 [1:17:59<18:25:52,  5.65s/it]

{'loss': 0.5605, 'grad_norm': 0.2260213941335678, 'learning_rate': 0.00018794717887154862, 'epoch': 0.06}


  6%|▌         | 759/12500 [1:18:05<18:57:49,  5.81s/it]

{'loss': 0.7872, 'grad_norm': 0.30186906456947327, 'learning_rate': 0.0001879311724689876, 'epoch': 0.06}


  6%|▌         | 760/12500 [1:18:10<18:40:39,  5.73s/it]

{'loss': 0.4914, 'grad_norm': 0.2364683747291565, 'learning_rate': 0.0001879151660664266, 'epoch': 0.06}


  6%|▌         | 761/12500 [1:18:14<17:00:02,  5.21s/it]

{'loss': 0.8545, 'grad_norm': 0.3238818347454071, 'learning_rate': 0.00018789915966386554, 'epoch': 0.06}


  6%|▌         | 762/12500 [1:18:19<16:11:58,  4.97s/it]

{'loss': 0.6253, 'grad_norm': 0.28012993931770325, 'learning_rate': 0.00018788315326130452, 'epoch': 0.06}


  6%|▌         | 763/12500 [1:18:23<15:06:07,  4.63s/it]

{'loss': 1.0621, 'grad_norm': 0.36157602071762085, 'learning_rate': 0.00018786714685874352, 'epoch': 0.06}


  6%|▌         | 764/12500 [1:18:28<16:08:03,  4.95s/it]

{'loss': 0.8857, 'grad_norm': 0.32351744174957275, 'learning_rate': 0.0001878511404561825, 'epoch': 0.06}


  6%|▌         | 765/12500 [1:18:37<19:49:54,  6.08s/it]

{'loss': 0.6609, 'grad_norm': 0.21958422660827637, 'learning_rate': 0.00018783513405362144, 'epoch': 0.06}


  6%|▌         | 766/12500 [1:18:41<17:52:00,  5.48s/it]

{'loss': 0.7739, 'grad_norm': 0.2906199097633362, 'learning_rate': 0.00018781912765106042, 'epoch': 0.06}


  6%|▌         | 767/12500 [1:18:45<16:11:39,  4.97s/it]

{'loss': 0.5656, 'grad_norm': 0.300747275352478, 'learning_rate': 0.00018780312124849942, 'epoch': 0.06}


  6%|▌         | 768/12500 [1:18:49<15:33:48,  4.78s/it]

{'loss': 0.8157, 'grad_norm': 0.31181859970092773, 'learning_rate': 0.0001877871148459384, 'epoch': 0.06}


  6%|▌         | 769/12500 [1:19:00<21:50:23,  6.70s/it]

{'loss': 0.6958, 'grad_norm': 0.18138648569583893, 'learning_rate': 0.00018777110844337734, 'epoch': 0.06}


  6%|▌         | 770/12500 [1:19:07<22:04:53,  6.78s/it]

{'loss': 0.8001, 'grad_norm': 0.23795467615127563, 'learning_rate': 0.00018775510204081634, 'epoch': 0.06}


  6%|▌         | 771/12500 [1:19:12<19:56:28,  6.12s/it]

{'loss': 0.8915, 'grad_norm': 0.3470252454280853, 'learning_rate': 0.00018773909563825532, 'epoch': 0.06}


  6%|▌         | 772/12500 [1:19:17<18:36:50,  5.71s/it]

{'loss': 0.8223, 'grad_norm': 0.2427210509777069, 'learning_rate': 0.0001877230892356943, 'epoch': 0.06}


  6%|▌         | 773/12500 [1:19:24<20:24:14,  6.26s/it]

{'loss': 0.9266, 'grad_norm': 0.2261534184217453, 'learning_rate': 0.00018770708283313324, 'epoch': 0.06}


  6%|▌         | 774/12500 [1:19:30<20:16:48,  6.23s/it]

{'loss': 0.6891, 'grad_norm': 0.34083297848701477, 'learning_rate': 0.00018769107643057224, 'epoch': 0.06}


  6%|▌         | 775/12500 [1:19:35<19:01:19,  5.84s/it]

{'loss': 0.736, 'grad_norm': 0.32066118717193604, 'learning_rate': 0.00018767507002801122, 'epoch': 0.06}


  6%|▌         | 776/12500 [1:19:43<20:21:23,  6.25s/it]

{'loss': 0.6401, 'grad_norm': 0.3014032542705536, 'learning_rate': 0.0001876590636254502, 'epoch': 0.06}


  6%|▌         | 777/12500 [1:19:51<22:53:23,  7.03s/it]

{'loss': 0.8896, 'grad_norm': 0.21774809062480927, 'learning_rate': 0.00018764305722288917, 'epoch': 0.06}


  6%|▌         | 778/12500 [1:19:56<20:54:29,  6.42s/it]

{'loss': 0.5425, 'grad_norm': 0.2310745269060135, 'learning_rate': 0.00018762705082032814, 'epoch': 0.06}


  6%|▌         | 779/12500 [1:20:05<22:34:00,  6.93s/it]

{'loss': 0.6796, 'grad_norm': 0.21288228034973145, 'learning_rate': 0.00018761104441776712, 'epoch': 0.06}


  6%|▌         | 780/12500 [1:20:09<20:15:48,  6.22s/it]

{'loss': 0.6156, 'grad_norm': 0.31049010157585144, 'learning_rate': 0.0001875950380152061, 'epoch': 0.06}


  6%|▌         | 781/12500 [1:20:14<19:10:18,  5.89s/it]

{'loss': 0.8518, 'grad_norm': 0.377494752407074, 'learning_rate': 0.00018757903161264507, 'epoch': 0.06}


  6%|▋         | 782/12500 [1:20:18<17:01:04,  5.23s/it]

{'loss': 0.8969, 'grad_norm': 0.2969917953014374, 'learning_rate': 0.00018756302521008404, 'epoch': 0.06}


  6%|▋         | 783/12500 [1:20:26<19:27:11,  5.98s/it]

{'loss': 1.1241, 'grad_norm': 0.2277228832244873, 'learning_rate': 0.00018754701880752302, 'epoch': 0.06}


  6%|▋         | 784/12500 [1:20:32<19:36:45,  6.03s/it]

{'loss': 0.6647, 'grad_norm': 0.23659437894821167, 'learning_rate': 0.000187531012404962, 'epoch': 0.06}


  6%|▋         | 785/12500 [1:20:36<17:45:36,  5.46s/it]

{'loss': 0.9256, 'grad_norm': 0.3227694034576416, 'learning_rate': 0.00018751500600240096, 'epoch': 0.06}


  6%|▋         | 786/12500 [1:20:40<16:52:08,  5.18s/it]

{'loss': 0.4582, 'grad_norm': 0.3054812550544739, 'learning_rate': 0.00018749899959983994, 'epoch': 0.06}


  6%|▋         | 787/12500 [1:20:44<15:15:06,  4.69s/it]

{'loss': 1.0566, 'grad_norm': 0.3206733167171478, 'learning_rate': 0.00018748299319727891, 'epoch': 0.06}


  6%|▋         | 788/12500 [1:20:50<16:31:26,  5.08s/it]

{'loss': 0.5803, 'grad_norm': 0.2604702115058899, 'learning_rate': 0.0001874669867947179, 'epoch': 0.06}


  6%|▋         | 789/12500 [1:20:54<15:59:53,  4.92s/it]

{'loss': 0.4666, 'grad_norm': 0.2521950900554657, 'learning_rate': 0.00018745098039215686, 'epoch': 0.06}


  6%|▋         | 790/12500 [1:21:01<17:33:05,  5.40s/it]

{'loss': 0.6674, 'grad_norm': 0.31458011269569397, 'learning_rate': 0.00018743497398959584, 'epoch': 0.06}


  6%|▋         | 791/12500 [1:21:07<18:24:29,  5.66s/it]

{'loss': 0.6385, 'grad_norm': 0.22919504344463348, 'learning_rate': 0.00018741896758703484, 'epoch': 0.06}


  6%|▋         | 792/12500 [1:21:12<17:55:10,  5.51s/it]

{'loss': 0.6045, 'grad_norm': 0.22728760540485382, 'learning_rate': 0.0001874029611844738, 'epoch': 0.06}


  6%|▋         | 793/12500 [1:21:19<18:37:25,  5.73s/it]

{'loss': 0.7977, 'grad_norm': 0.2398737668991089, 'learning_rate': 0.00018738695478191276, 'epoch': 0.06}


  6%|▋         | 794/12500 [1:21:24<18:36:52,  5.72s/it]

{'loss': 0.5676, 'grad_norm': 0.24330225586891174, 'learning_rate': 0.00018737094837935174, 'epoch': 0.06}


  6%|▋         | 795/12500 [1:21:30<18:20:06,  5.64s/it]

{'loss': 0.992, 'grad_norm': 0.33866581320762634, 'learning_rate': 0.00018735494197679074, 'epoch': 0.06}


  6%|▋         | 796/12500 [1:21:39<22:07:16,  6.80s/it]

{'loss': 1.0217, 'grad_norm': 0.27287960052490234, 'learning_rate': 0.0001873389355742297, 'epoch': 0.06}


  6%|▋         | 797/12500 [1:21:46<22:24:29,  6.89s/it]

{'loss': 0.8172, 'grad_norm': 0.2299909144639969, 'learning_rate': 0.00018732292917166866, 'epoch': 0.06}


  6%|▋         | 798/12500 [1:21:55<23:35:50,  7.26s/it]

{'loss': 0.7667, 'grad_norm': 0.210862934589386, 'learning_rate': 0.00018730692276910767, 'epoch': 0.06}


  6%|▋         | 799/12500 [1:22:00<21:41:17,  6.67s/it]

{'loss': 0.658, 'grad_norm': 0.2613590955734253, 'learning_rate': 0.00018729091636654664, 'epoch': 0.06}


  6%|▋         | 800/12500 [1:22:07<21:42:18,  6.68s/it]

{'loss': 0.8819, 'grad_norm': 0.24232549965381622, 'learning_rate': 0.0001872749099639856, 'epoch': 0.06}


  6%|▋         | 801/12500 [1:22:15<23:14:10,  7.15s/it]

{'loss': 0.9711, 'grad_norm': 0.2771991193294525, 'learning_rate': 0.00018725890356142456, 'epoch': 0.06}


  6%|▋         | 802/12500 [1:22:21<22:39:02,  6.97s/it]

{'loss': 0.6392, 'grad_norm': 0.2595610022544861, 'learning_rate': 0.00018724289715886356, 'epoch': 0.06}


  6%|▋         | 803/12500 [1:22:26<20:34:57,  6.33s/it]

{'loss': 0.5902, 'grad_norm': 0.26671892404556274, 'learning_rate': 0.00018722689075630254, 'epoch': 0.06}


  6%|▋         | 804/12500 [1:22:32<19:53:58,  6.13s/it]

{'loss': 0.7128, 'grad_norm': 0.29352691769599915, 'learning_rate': 0.0001872108843537415, 'epoch': 0.06}


  6%|▋         | 805/12500 [1:22:37<19:19:07,  5.95s/it]

{'loss': 0.5999, 'grad_norm': 0.25119373202323914, 'learning_rate': 0.0001871948779511805, 'epoch': 0.06}


  6%|▋         | 806/12500 [1:22:44<19:55:52,  6.14s/it]

{'loss': 0.8923, 'grad_norm': 0.28665855526924133, 'learning_rate': 0.00018717887154861946, 'epoch': 0.06}


  6%|▋         | 807/12500 [1:22:54<23:36:32,  7.27s/it]

{'loss': 0.8679, 'grad_norm': 0.25186166167259216, 'learning_rate': 0.00018716286514605844, 'epoch': 0.06}


  6%|▋         | 808/12500 [1:23:00<22:04:47,  6.80s/it]

{'loss': 0.8305, 'grad_norm': 0.366324782371521, 'learning_rate': 0.0001871468587434974, 'epoch': 0.06}


  6%|▋         | 809/12500 [1:23:07<22:24:31,  6.90s/it]

{'loss': 0.6911, 'grad_norm': 0.2318265587091446, 'learning_rate': 0.0001871308523409364, 'epoch': 0.06}


  6%|▋         | 810/12500 [1:23:15<23:28:34,  7.23s/it]

{'loss': 0.608, 'grad_norm': 0.20827534794807434, 'learning_rate': 0.00018711484593837536, 'epoch': 0.06}


  6%|▋         | 811/12500 [1:23:20<21:12:44,  6.53s/it]

{'loss': 0.6991, 'grad_norm': 0.2735805809497833, 'learning_rate': 0.00018709883953581434, 'epoch': 0.06}


  6%|▋         | 812/12500 [1:23:25<19:47:05,  6.09s/it]

{'loss': 0.5583, 'grad_norm': 0.2705518901348114, 'learning_rate': 0.0001870828331332533, 'epoch': 0.06}


  7%|▋         | 813/12500 [1:23:31<19:54:50,  6.13s/it]

{'loss': 0.8012, 'grad_norm': 0.28548645973205566, 'learning_rate': 0.0001870668267306923, 'epoch': 0.07}


  7%|▋         | 814/12500 [1:23:37<20:09:56,  6.21s/it]

{'loss': 0.7449, 'grad_norm': 0.21880978345870972, 'learning_rate': 0.00018705082032813126, 'epoch': 0.07}


  7%|▋         | 815/12500 [1:23:42<18:39:21,  5.75s/it]

{'loss': 0.8539, 'grad_norm': 0.31744053959846497, 'learning_rate': 0.00018703481392557024, 'epoch': 0.07}


  7%|▋         | 816/12500 [1:23:49<19:43:06,  6.08s/it]

{'loss': 0.571, 'grad_norm': 0.19837920367717743, 'learning_rate': 0.0001870188075230092, 'epoch': 0.07}


  7%|▋         | 817/12500 [1:23:57<21:45:19,  6.70s/it]

{'loss': 1.0104, 'grad_norm': 0.25470370054244995, 'learning_rate': 0.0001870028011204482, 'epoch': 0.07}


  7%|▋         | 818/12500 [1:24:06<23:58:09,  7.39s/it]

{'loss': 0.7931, 'grad_norm': 0.22792337834835052, 'learning_rate': 0.00018698679471788716, 'epoch': 0.07}


  7%|▋         | 819/12500 [1:24:10<20:49:04,  6.42s/it]

{'loss': 0.8515, 'grad_norm': 0.3349623680114746, 'learning_rate': 0.00018697078831532614, 'epoch': 0.07}


  7%|▋         | 820/12500 [1:24:18<22:28:41,  6.93s/it]

{'loss': 0.7222, 'grad_norm': 0.22268547117710114, 'learning_rate': 0.0001869547819127651, 'epoch': 0.07}


  7%|▋         | 821/12500 [1:24:23<20:40:23,  6.37s/it]

{'loss': 0.6817, 'grad_norm': 0.2467029094696045, 'learning_rate': 0.0001869387755102041, 'epoch': 0.07}


  7%|▋         | 822/12500 [1:24:29<20:09:59,  6.22s/it]

{'loss': 0.8386, 'grad_norm': 0.37493547797203064, 'learning_rate': 0.00018692276910764306, 'epoch': 0.07}


  7%|▋         | 823/12500 [1:24:34<19:00:50,  5.86s/it]

{'loss': 0.6984, 'grad_norm': 0.27848339080810547, 'learning_rate': 0.00018690676270508206, 'epoch': 0.07}


  7%|▋         | 824/12500 [1:24:39<18:13:02,  5.62s/it]

{'loss': 0.8196, 'grad_norm': 0.27247917652130127, 'learning_rate': 0.000186890756302521, 'epoch': 0.07}


  7%|▋         | 825/12500 [1:24:43<16:13:06,  5.00s/it]

{'loss': 0.7024, 'grad_norm': 0.3216600716114044, 'learning_rate': 0.00018687474989995999, 'epoch': 0.07}


  7%|▋         | 826/12500 [1:24:48<16:38:28,  5.13s/it]

{'loss': 0.6192, 'grad_norm': 0.2816806733608246, 'learning_rate': 0.00018685874349739896, 'epoch': 0.07}


  7%|▋         | 827/12500 [1:25:00<22:51:02,  7.05s/it]

{'loss': 0.988, 'grad_norm': 0.1951574832201004, 'learning_rate': 0.00018684273709483796, 'epoch': 0.07}


  7%|▋         | 828/12500 [1:25:04<20:29:08,  6.32s/it]

{'loss': 0.7771, 'grad_norm': 0.3344978988170624, 'learning_rate': 0.0001868267306922769, 'epoch': 0.07}


  7%|▋         | 829/12500 [1:25:07<17:16:58,  5.33s/it]

{'loss': 0.6222, 'grad_norm': 0.3109610974788666, 'learning_rate': 0.00018681072428971589, 'epoch': 0.07}


  7%|▋         | 830/12500 [1:25:15<19:36:33,  6.05s/it]

{'loss': 0.4909, 'grad_norm': 0.2244241088628769, 'learning_rate': 0.0001867947178871549, 'epoch': 0.07}


  7%|▋         | 831/12500 [1:25:20<18:38:41,  5.75s/it]

{'loss': 0.5994, 'grad_norm': 0.2823968827724457, 'learning_rate': 0.00018677871148459386, 'epoch': 0.07}


  7%|▋         | 832/12500 [1:25:26<19:03:58,  5.88s/it]

{'loss': 0.6602, 'grad_norm': 0.2549087107181549, 'learning_rate': 0.0001867627050820328, 'epoch': 0.07}


  7%|▋         | 833/12500 [1:25:31<17:38:58,  5.45s/it]

{'loss': 0.6659, 'grad_norm': 0.2718484401702881, 'learning_rate': 0.00018674669867947179, 'epoch': 0.07}


  7%|▋         | 834/12500 [1:25:36<17:13:44,  5.32s/it]

{'loss': 0.5736, 'grad_norm': 0.2951961159706116, 'learning_rate': 0.0001867306922769108, 'epoch': 0.07}


  7%|▋         | 835/12500 [1:25:42<18:15:25,  5.63s/it]

{'loss': 0.4806, 'grad_norm': 0.23249468207359314, 'learning_rate': 0.00018671468587434976, 'epoch': 0.07}


  7%|▋         | 836/12500 [1:25:54<24:11:53,  7.47s/it]

{'loss': 0.6323, 'grad_norm': 0.1822255253791809, 'learning_rate': 0.0001866986794717887, 'epoch': 0.07}


  7%|▋         | 837/12500 [1:25:58<21:09:03,  6.53s/it]

{'loss': 0.6613, 'grad_norm': 0.2704676389694214, 'learning_rate': 0.0001866826730692277, 'epoch': 0.07}


  7%|▋         | 838/12500 [1:26:03<19:33:25,  6.04s/it]

{'loss': 0.6147, 'grad_norm': 0.2892448306083679, 'learning_rate': 0.0001866666666666667, 'epoch': 0.07}


  7%|▋         | 839/12500 [1:26:09<19:41:38,  6.08s/it]

{'loss': 0.6889, 'grad_norm': 0.28641557693481445, 'learning_rate': 0.00018665066026410566, 'epoch': 0.07}


  7%|▋         | 840/12500 [1:26:14<18:46:01,  5.79s/it]

{'loss': 0.8712, 'grad_norm': 0.28073716163635254, 'learning_rate': 0.0001866346538615446, 'epoch': 0.07}


  7%|▋         | 841/12500 [1:26:20<18:02:10,  5.57s/it]

{'loss': 0.7864, 'grad_norm': 0.3138950765132904, 'learning_rate': 0.0001866186474589836, 'epoch': 0.07}


  7%|▋         | 842/12500 [1:26:28<20:46:21,  6.41s/it]

{'loss': 0.8777, 'grad_norm': 0.22617597877979279, 'learning_rate': 0.00018660264105642259, 'epoch': 0.07}


  7%|▋         | 843/12500 [1:26:36<22:24:27,  6.92s/it]

{'loss': 0.6794, 'grad_norm': 0.23271703720092773, 'learning_rate': 0.00018658663465386156, 'epoch': 0.07}


  7%|▋         | 844/12500 [1:26:44<23:02:28,  7.12s/it]

{'loss': 0.6447, 'grad_norm': 0.2566169798374176, 'learning_rate': 0.00018657062825130054, 'epoch': 0.07}


  7%|▋         | 845/12500 [1:26:48<20:30:10,  6.33s/it]

{'loss': 0.6396, 'grad_norm': 0.26737460494041443, 'learning_rate': 0.0001865546218487395, 'epoch': 0.07}


  7%|▋         | 846/12500 [1:26:52<18:34:57,  5.74s/it]

{'loss': 0.8266, 'grad_norm': 0.3184930086135864, 'learning_rate': 0.00018653861544617849, 'epoch': 0.07}


  7%|▋         | 847/12500 [1:27:00<20:21:50,  6.29s/it]

{'loss': 0.5315, 'grad_norm': 0.23384520411491394, 'learning_rate': 0.00018652260904361746, 'epoch': 0.07}


  7%|▋         | 848/12500 [1:27:08<21:35:34,  6.67s/it]

{'loss': 0.6927, 'grad_norm': 0.21603061258792877, 'learning_rate': 0.00018650660264105643, 'epoch': 0.07}


  7%|▋         | 849/12500 [1:27:12<19:08:50,  5.92s/it]

{'loss': 0.7247, 'grad_norm': 0.3286980986595154, 'learning_rate': 0.0001864905962384954, 'epoch': 0.07}


  7%|▋         | 850/12500 [1:27:16<17:08:45,  5.30s/it]

{'loss': 0.634, 'grad_norm': 0.28146836161613464, 'learning_rate': 0.00018647458983593438, 'epoch': 0.07}


  7%|▋         | 851/12500 [1:27:22<17:46:21,  5.49s/it]

{'loss': 0.5959, 'grad_norm': 0.3155559003353119, 'learning_rate': 0.00018645858343337336, 'epoch': 0.07}


  7%|▋         | 852/12500 [1:27:26<16:28:30,  5.09s/it]

{'loss': 0.6826, 'grad_norm': 0.2797186076641083, 'learning_rate': 0.00018644257703081233, 'epoch': 0.07}


  7%|▋         | 853/12500 [1:27:30<15:52:01,  4.90s/it]

{'loss': 0.6917, 'grad_norm': 0.26262366771698, 'learning_rate': 0.0001864265706282513, 'epoch': 0.07}


  7%|▋         | 854/12500 [1:27:35<15:20:36,  4.74s/it]

{'loss': 0.694, 'grad_norm': 0.29083284735679626, 'learning_rate': 0.00018641056422569028, 'epoch': 0.07}


  7%|▋         | 855/12500 [1:27:41<16:57:55,  5.24s/it]

{'loss': 0.8185, 'grad_norm': 0.21424873173236847, 'learning_rate': 0.00018639455782312926, 'epoch': 0.07}


  7%|▋         | 856/12500 [1:27:47<18:03:04,  5.58s/it]

{'loss': 0.6872, 'grad_norm': 0.24016812443733215, 'learning_rate': 0.00018637855142056823, 'epoch': 0.07}


  7%|▋         | 857/12500 [1:27:52<17:27:10,  5.40s/it]

{'loss': 0.996, 'grad_norm': 0.30830684304237366, 'learning_rate': 0.0001863625450180072, 'epoch': 0.07}


  7%|▋         | 858/12500 [1:27:58<18:00:01,  5.57s/it]

{'loss': 0.9141, 'grad_norm': 0.2897115647792816, 'learning_rate': 0.0001863465386154462, 'epoch': 0.07}


  7%|▋         | 859/12500 [1:28:04<18:28:28,  5.71s/it]

{'loss': 0.6775, 'grad_norm': 0.28124815225601196, 'learning_rate': 0.00018633053221288516, 'epoch': 0.07}


  7%|▋         | 860/12500 [1:28:09<17:26:31,  5.39s/it]

{'loss': 0.8566, 'grad_norm': 0.31207039952278137, 'learning_rate': 0.00018631452581032413, 'epoch': 0.07}


  7%|▋         | 861/12500 [1:28:16<18:42:59,  5.79s/it]

{'loss': 0.6738, 'grad_norm': 0.24664461612701416, 'learning_rate': 0.0001862985194077631, 'epoch': 0.07}


  7%|▋         | 862/12500 [1:28:21<18:29:37,  5.72s/it]

{'loss': 0.5886, 'grad_norm': 0.24948285520076752, 'learning_rate': 0.0001862825130052021, 'epoch': 0.07}


  7%|▋         | 863/12500 [1:28:26<17:55:13,  5.54s/it]

{'loss': 0.8374, 'grad_norm': 0.2629435360431671, 'learning_rate': 0.00018626650660264106, 'epoch': 0.07}


  7%|▋         | 864/12500 [1:28:33<19:08:26,  5.92s/it]

{'loss': 0.7679, 'grad_norm': 0.2857531011104584, 'learning_rate': 0.00018625050020008003, 'epoch': 0.07}


  7%|▋         | 865/12500 [1:28:38<18:29:27,  5.72s/it]

{'loss': 0.7404, 'grad_norm': 0.24955183267593384, 'learning_rate': 0.00018623449379751903, 'epoch': 0.07}


  7%|▋         | 866/12500 [1:28:43<17:10:06,  5.31s/it]

{'loss': 0.7682, 'grad_norm': 0.31331485509872437, 'learning_rate': 0.000186218487394958, 'epoch': 0.07}


  7%|▋         | 867/12500 [1:28:48<17:21:02,  5.37s/it]

{'loss': 0.7844, 'grad_norm': 0.2916155159473419, 'learning_rate': 0.00018620248099239696, 'epoch': 0.07}


  7%|▋         | 868/12500 [1:28:55<19:05:51,  5.91s/it]

{'loss': 0.5839, 'grad_norm': 0.265533983707428, 'learning_rate': 0.00018618647458983593, 'epoch': 0.07}


  7%|▋         | 869/12500 [1:29:02<20:05:36,  6.22s/it]

{'loss': 0.5831, 'grad_norm': 0.23242858052253723, 'learning_rate': 0.00018617046818727493, 'epoch': 0.07}


  7%|▋         | 870/12500 [1:29:07<18:15:02,  5.65s/it]

{'loss': 0.5607, 'grad_norm': 0.2530093789100647, 'learning_rate': 0.0001861544617847139, 'epoch': 0.07}


  7%|▋         | 871/12500 [1:29:12<18:19:04,  5.67s/it]

{'loss': 0.655, 'grad_norm': 0.2989283502101898, 'learning_rate': 0.00018613845538215286, 'epoch': 0.07}


  7%|▋         | 872/12500 [1:29:18<17:55:42,  5.55s/it]

{'loss': 0.6904, 'grad_norm': 0.2466372400522232, 'learning_rate': 0.00018612244897959183, 'epoch': 0.07}


  7%|▋         | 873/12500 [1:29:27<21:12:18,  6.57s/it]

{'loss': 0.8782, 'grad_norm': 0.21126167476177216, 'learning_rate': 0.00018610644257703083, 'epoch': 0.07}


  7%|▋         | 874/12500 [1:29:33<20:48:24,  6.44s/it]

{'loss': 0.8259, 'grad_norm': 0.2497904747724533, 'learning_rate': 0.0001860904361744698, 'epoch': 0.07}


  7%|▋         | 875/12500 [1:29:36<17:53:04,  5.54s/it]

{'loss': 0.6066, 'grad_norm': 0.30915436148643494, 'learning_rate': 0.00018607442977190876, 'epoch': 0.07}


  7%|▋         | 876/12500 [1:29:47<23:07:08,  7.16s/it]

{'loss': 0.6193, 'grad_norm': 0.22057980298995972, 'learning_rate': 0.00018605842336934776, 'epoch': 0.07}


  7%|▋         | 877/12500 [1:29:55<24:09:50,  7.48s/it]

{'loss': 0.8316, 'grad_norm': 0.19943132996559143, 'learning_rate': 0.00018604241696678673, 'epoch': 0.07}


  7%|▋         | 878/12500 [1:30:04<25:10:41,  7.80s/it]

{'loss': 0.7289, 'grad_norm': 0.216449573636055, 'learning_rate': 0.0001860264105642257, 'epoch': 0.07}


  7%|▋         | 879/12500 [1:30:12<25:28:56,  7.89s/it]

{'loss': 0.6418, 'grad_norm': 0.21641013026237488, 'learning_rate': 0.00018601040416166466, 'epoch': 0.07}


  7%|▋         | 880/12500 [1:30:20<25:30:36,  7.90s/it]

{'loss': 0.7833, 'grad_norm': 0.230076402425766, 'learning_rate': 0.00018599439775910366, 'epoch': 0.07}


  7%|▋         | 881/12500 [1:30:26<23:16:07,  7.21s/it]

{'loss': 0.7575, 'grad_norm': 0.2670881450176239, 'learning_rate': 0.00018597839135654263, 'epoch': 0.07}


  7%|▋         | 882/12500 [1:30:31<21:57:36,  6.80s/it]

{'loss': 0.8312, 'grad_norm': 0.3215649425983429, 'learning_rate': 0.0001859623849539816, 'epoch': 0.07}


  7%|▋         | 883/12500 [1:30:37<20:43:50,  6.42s/it]

{'loss': 0.6536, 'grad_norm': 0.2544058561325073, 'learning_rate': 0.00018594637855142058, 'epoch': 0.07}


  7%|▋         | 884/12500 [1:30:44<21:28:38,  6.66s/it]

{'loss': 0.4984, 'grad_norm': 0.23465217649936676, 'learning_rate': 0.00018593037214885956, 'epoch': 0.07}


  7%|▋         | 885/12500 [1:30:53<23:22:02,  7.24s/it]

{'loss': 0.8602, 'grad_norm': 0.20446652173995972, 'learning_rate': 0.00018591436574629853, 'epoch': 0.07}


  7%|▋         | 886/12500 [1:30:59<22:21:11,  6.93s/it]

{'loss': 0.4715, 'grad_norm': 0.23368607461452484, 'learning_rate': 0.0001858983593437375, 'epoch': 0.07}


  7%|▋         | 887/12500 [1:31:05<21:19:14,  6.61s/it]

{'loss': 0.6633, 'grad_norm': 0.24718356132507324, 'learning_rate': 0.00018588235294117648, 'epoch': 0.07}


  7%|▋         | 888/12500 [1:31:12<22:04:28,  6.84s/it]

{'loss': 0.6412, 'grad_norm': 0.2319946587085724, 'learning_rate': 0.00018586634653861546, 'epoch': 0.07}


  7%|▋         | 889/12500 [1:31:18<20:53:56,  6.48s/it]

{'loss': 0.8242, 'grad_norm': 0.2689999043941498, 'learning_rate': 0.00018585034013605443, 'epoch': 0.07}


  7%|▋         | 890/12500 [1:31:25<21:12:58,  6.58s/it]

{'loss': 0.833, 'grad_norm': 0.23761425912380219, 'learning_rate': 0.0001858343337334934, 'epoch': 0.07}


  7%|▋         | 891/12500 [1:31:33<22:39:18,  7.03s/it]

{'loss': 0.5397, 'grad_norm': 0.21076591312885284, 'learning_rate': 0.00018581832733093238, 'epoch': 0.07}


  7%|▋         | 892/12500 [1:31:41<23:46:26,  7.37s/it]

{'loss': 0.5687, 'grad_norm': 0.20775765180587769, 'learning_rate': 0.00018580232092837136, 'epoch': 0.07}


  7%|▋         | 893/12500 [1:31:45<20:30:31,  6.36s/it]

{'loss': 0.7675, 'grad_norm': 0.2858593761920929, 'learning_rate': 0.00018578631452581033, 'epoch': 0.07}


  7%|▋         | 894/12500 [1:31:50<18:50:32,  5.84s/it]

{'loss': 0.8348, 'grad_norm': 0.2918208837509155, 'learning_rate': 0.0001857703081232493, 'epoch': 0.07}


  7%|▋         | 895/12500 [1:31:54<17:41:09,  5.49s/it]

{'loss': 0.5984, 'grad_norm': 0.25113362073898315, 'learning_rate': 0.00018575430172068828, 'epoch': 0.07}


  7%|▋         | 896/12500 [1:31:58<15:45:36,  4.89s/it]

{'loss': 0.5825, 'grad_norm': 0.30884960293769836, 'learning_rate': 0.00018573829531812726, 'epoch': 0.07}


  7%|▋         | 897/12500 [1:32:04<16:53:23,  5.24s/it]

{'loss': 0.6866, 'grad_norm': 0.2844989597797394, 'learning_rate': 0.00018572228891556626, 'epoch': 0.07}


  7%|▋         | 898/12500 [1:32:08<15:38:41,  4.85s/it]

{'loss': 0.6056, 'grad_norm': 0.30644407868385315, 'learning_rate': 0.0001857062825130052, 'epoch': 0.07}


  7%|▋         | 899/12500 [1:32:14<17:04:56,  5.30s/it]

{'loss': 0.6746, 'grad_norm': 0.2746979594230652, 'learning_rate': 0.00018569027611044418, 'epoch': 0.07}


  7%|▋         | 900/12500 [1:32:20<17:16:06,  5.36s/it]

{'loss': 0.5881, 'grad_norm': 0.27209189534187317, 'learning_rate': 0.00018567426970788315, 'epoch': 0.07}


  7%|▋         | 901/12500 [1:32:25<17:29:15,  5.43s/it]

{'loss': 0.5771, 'grad_norm': 0.30264416337013245, 'learning_rate': 0.00018565826330532216, 'epoch': 0.07}


  7%|▋         | 902/12500 [1:32:32<18:55:59,  5.88s/it]

{'loss': 0.877, 'grad_norm': 0.2484334260225296, 'learning_rate': 0.0001856422569027611, 'epoch': 0.07}


  7%|▋         | 903/12500 [1:32:37<18:02:07,  5.60s/it]

{'loss': 0.6543, 'grad_norm': 0.27564117312431335, 'learning_rate': 0.00018562625050020008, 'epoch': 0.07}


  7%|▋         | 904/12500 [1:32:45<20:24:48,  6.34s/it]

{'loss': 0.4329, 'grad_norm': 0.19386489689350128, 'learning_rate': 0.00018561024409763908, 'epoch': 0.07}


  7%|▋         | 905/12500 [1:32:55<23:44:23,  7.37s/it]

{'loss': 0.5081, 'grad_norm': 0.1907089352607727, 'learning_rate': 0.00018559423769507806, 'epoch': 0.07}


  7%|▋         | 906/12500 [1:33:03<24:05:28,  7.48s/it]

{'loss': 0.7159, 'grad_norm': 0.23479336500167847, 'learning_rate': 0.000185578231292517, 'epoch': 0.07}


  7%|▋         | 907/12500 [1:33:09<23:22:27,  7.26s/it]

{'loss': 0.983, 'grad_norm': 0.2507019639015198, 'learning_rate': 0.00018556222488995598, 'epoch': 0.07}


  7%|▋         | 908/12500 [1:33:14<20:53:26,  6.49s/it]

{'loss': 0.7611, 'grad_norm': 0.321206271648407, 'learning_rate': 0.00018554621848739498, 'epoch': 0.07}


  7%|▋         | 909/12500 [1:33:18<18:53:53,  5.87s/it]

{'loss': 0.6234, 'grad_norm': 0.28419235348701477, 'learning_rate': 0.00018553021208483396, 'epoch': 0.07}


  7%|▋         | 910/12500 [1:33:25<19:06:37,  5.94s/it]

{'loss': 0.7422, 'grad_norm': 0.30627453327178955, 'learning_rate': 0.0001855142056822729, 'epoch': 0.07}


  7%|▋         | 911/12500 [1:33:29<17:57:21,  5.58s/it]

{'loss': 0.7457, 'grad_norm': 0.28421342372894287, 'learning_rate': 0.0001854981992797119, 'epoch': 0.07}


  7%|▋         | 912/12500 [1:33:35<17:38:48,  5.48s/it]

{'loss': 0.7239, 'grad_norm': 0.2531236708164215, 'learning_rate': 0.00018548219287715088, 'epoch': 0.07}


  7%|▋         | 913/12500 [1:33:40<17:09:26,  5.33s/it]

{'loss': 0.688, 'grad_norm': 0.2766754925251007, 'learning_rate': 0.00018546618647458985, 'epoch': 0.07}


  7%|▋         | 914/12500 [1:33:44<16:28:28,  5.12s/it]

{'loss': 0.7538, 'grad_norm': 0.29275718331336975, 'learning_rate': 0.0001854501800720288, 'epoch': 0.07}


  7%|▋         | 915/12500 [1:33:48<15:30:52,  4.82s/it]

{'loss': 0.775, 'grad_norm': 0.300333172082901, 'learning_rate': 0.0001854341736694678, 'epoch': 0.07}


  7%|▋         | 916/12500 [1:33:53<15:31:42,  4.83s/it]

{'loss': 0.7223, 'grad_norm': 0.2992618680000305, 'learning_rate': 0.00018541816726690678, 'epoch': 0.07}


  7%|▋         | 917/12500 [1:33:58<15:23:08,  4.78s/it]

{'loss': 0.6553, 'grad_norm': 0.3180651068687439, 'learning_rate': 0.00018540216086434575, 'epoch': 0.07}


  7%|▋         | 918/12500 [1:34:03<15:44:11,  4.89s/it]

{'loss': 0.7302, 'grad_norm': 0.26835858821868896, 'learning_rate': 0.00018538615446178473, 'epoch': 0.07}


  7%|▋         | 919/12500 [1:34:09<16:37:57,  5.17s/it]

{'loss': 0.4924, 'grad_norm': 0.22285927832126617, 'learning_rate': 0.0001853701480592237, 'epoch': 0.07}


  7%|▋         | 920/12500 [1:34:16<18:32:17,  5.76s/it]

{'loss': 0.9558, 'grad_norm': 0.23730960488319397, 'learning_rate': 0.00018535414165666268, 'epoch': 0.07}


  7%|▋         | 921/12500 [1:34:26<22:47:36,  7.09s/it]

{'loss': 0.9057, 'grad_norm': 0.20540054142475128, 'learning_rate': 0.00018533813525410165, 'epoch': 0.07}


  7%|▋         | 922/12500 [1:34:36<25:05:16,  7.80s/it]

{'loss': 0.9824, 'grad_norm': 0.21596571803092957, 'learning_rate': 0.00018532212885154063, 'epoch': 0.07}


  7%|▋         | 923/12500 [1:34:40<21:41:33,  6.75s/it]

{'loss': 0.7446, 'grad_norm': 0.34900158643722534, 'learning_rate': 0.0001853061224489796, 'epoch': 0.07}


  7%|▋         | 924/12500 [1:34:45<20:12:28,  6.28s/it]

{'loss': 0.6398, 'grad_norm': 0.27221837639808655, 'learning_rate': 0.00018529011604641858, 'epoch': 0.07}


  7%|▋         | 925/12500 [1:34:50<19:15:17,  5.99s/it]

{'loss': 0.9152, 'grad_norm': 0.25348833203315735, 'learning_rate': 0.00018527410964385755, 'epoch': 0.07}


  7%|▋         | 926/12500 [1:34:57<19:32:18,  6.08s/it]

{'loss': 0.8024, 'grad_norm': 0.2847447991371155, 'learning_rate': 0.00018525810324129653, 'epoch': 0.07}


  7%|▋         | 927/12500 [1:35:04<20:59:19,  6.53s/it]

{'loss': 0.4688, 'grad_norm': 0.23158612847328186, 'learning_rate': 0.0001852420968387355, 'epoch': 0.07}


  7%|▋         | 928/12500 [1:35:09<19:16:58,  6.00s/it]

{'loss': 0.7224, 'grad_norm': 0.2825678288936615, 'learning_rate': 0.00018522609043617448, 'epoch': 0.07}


  7%|▋         | 929/12500 [1:35:15<19:35:56,  6.10s/it]

{'loss': 0.8515, 'grad_norm': 0.22602449357509613, 'learning_rate': 0.00018521008403361345, 'epoch': 0.07}


  7%|▋         | 930/12500 [1:35:20<18:12:59,  5.67s/it]

{'loss': 0.6399, 'grad_norm': 0.2733936309814453, 'learning_rate': 0.00018519407763105243, 'epoch': 0.07}


  7%|▋         | 931/12500 [1:35:27<19:39:17,  6.12s/it]

{'loss': 0.7098, 'grad_norm': 0.20869044959545135, 'learning_rate': 0.0001851780712284914, 'epoch': 0.07}


  7%|▋         | 932/12500 [1:35:32<18:40:19,  5.81s/it]

{'loss': 0.6695, 'grad_norm': 0.2811703383922577, 'learning_rate': 0.00018516206482593038, 'epoch': 0.07}


  7%|▋         | 933/12500 [1:35:38<18:39:15,  5.81s/it]

{'loss': 0.9582, 'grad_norm': 0.3057458698749542, 'learning_rate': 0.00018514605842336935, 'epoch': 0.07}


  7%|▋         | 934/12500 [1:35:42<17:16:42,  5.38s/it]

{'loss': 1.069, 'grad_norm': 0.27277567982673645, 'learning_rate': 0.00018513005202080833, 'epoch': 0.07}


  7%|▋         | 935/12500 [1:35:47<16:32:28,  5.15s/it]

{'loss': 0.5366, 'grad_norm': 0.32335302233695984, 'learning_rate': 0.0001851140456182473, 'epoch': 0.07}


  7%|▋         | 936/12500 [1:35:52<16:50:49,  5.24s/it]

{'loss': 0.6197, 'grad_norm': 0.23581428825855255, 'learning_rate': 0.0001850980392156863, 'epoch': 0.07}


  7%|▋         | 937/12500 [1:36:02<20:40:17,  6.44s/it]

{'loss': 0.5696, 'grad_norm': 0.21844904124736786, 'learning_rate': 0.00018508203281312525, 'epoch': 0.07}


  8%|▊         | 938/12500 [1:36:07<19:20:36,  6.02s/it]

{'loss': 0.9069, 'grad_norm': 0.3463585078716278, 'learning_rate': 0.00018506602641056423, 'epoch': 0.08}


  8%|▊         | 939/12500 [1:36:14<20:31:57,  6.39s/it]

{'loss': 0.6604, 'grad_norm': 0.2765566408634186, 'learning_rate': 0.0001850500200080032, 'epoch': 0.08}


  8%|▊         | 940/12500 [1:36:18<18:01:28,  5.61s/it]

{'loss': 0.7048, 'grad_norm': 0.3190847337245941, 'learning_rate': 0.0001850340136054422, 'epoch': 0.08}


  8%|▊         | 941/12500 [1:36:29<23:08:50,  7.21s/it]

{'loss': 0.7936, 'grad_norm': 0.18810927867889404, 'learning_rate': 0.00018501800720288115, 'epoch': 0.08}


  8%|▊         | 942/12500 [1:36:34<21:28:17,  6.69s/it]

{'loss': 0.7327, 'grad_norm': 0.2774961590766907, 'learning_rate': 0.00018500200080032013, 'epoch': 0.08}


  8%|▊         | 943/12500 [1:36:41<21:44:37,  6.77s/it]

{'loss': 0.7408, 'grad_norm': 0.23807932436466217, 'learning_rate': 0.00018498599439775913, 'epoch': 0.08}


  8%|▊         | 944/12500 [1:36:47<21:08:48,  6.59s/it]

{'loss': 0.6487, 'grad_norm': 0.243357852101326, 'learning_rate': 0.0001849699879951981, 'epoch': 0.08}


  8%|▊         | 945/12500 [1:36:56<23:01:18,  7.17s/it]

{'loss': 0.7548, 'grad_norm': 0.24950824677944183, 'learning_rate': 0.00018495398159263705, 'epoch': 0.08}


  8%|▊         | 946/12500 [1:37:05<24:54:21,  7.76s/it]

{'loss': 0.8838, 'grad_norm': 0.2477388232946396, 'learning_rate': 0.00018493797519007602, 'epoch': 0.08}


  8%|▊         | 947/12500 [1:37:10<22:25:05,  6.99s/it]

{'loss': 0.743, 'grad_norm': 0.2941088080406189, 'learning_rate': 0.00018492196878751503, 'epoch': 0.08}


  8%|▊         | 948/12500 [1:37:16<21:24:53,  6.67s/it]

{'loss': 0.6, 'grad_norm': 0.2139785885810852, 'learning_rate': 0.000184905962384954, 'epoch': 0.08}


  8%|▊         | 949/12500 [1:37:21<19:22:47,  6.04s/it]

{'loss': 0.7881, 'grad_norm': 0.2624349892139435, 'learning_rate': 0.00018488995598239295, 'epoch': 0.08}


  8%|▊         | 950/12500 [1:37:26<19:07:31,  5.96s/it]

{'loss': 0.5487, 'grad_norm': 0.2583529055118561, 'learning_rate': 0.00018487394957983195, 'epoch': 0.08}


  8%|▊         | 951/12500 [1:37:35<21:51:40,  6.81s/it]

{'loss': 0.7083, 'grad_norm': 0.21691420674324036, 'learning_rate': 0.00018485794317727093, 'epoch': 0.08}


  8%|▊         | 952/12500 [1:37:41<20:33:35,  6.41s/it]

{'loss': 0.6695, 'grad_norm': 0.26923832297325134, 'learning_rate': 0.0001848419367747099, 'epoch': 0.08}


  8%|▊         | 953/12500 [1:37:45<18:25:09,  5.74s/it]

{'loss': 0.6706, 'grad_norm': 0.2978023290634155, 'learning_rate': 0.00018482593037214885, 'epoch': 0.08}


  8%|▊         | 954/12500 [1:37:56<23:38:07,  7.37s/it]

{'loss': 0.7224, 'grad_norm': 0.17129474878311157, 'learning_rate': 0.00018480992396958785, 'epoch': 0.08}


  8%|▊         | 955/12500 [1:38:02<22:19:01,  6.96s/it]

{'loss': 1.1041, 'grad_norm': 0.3033926486968994, 'learning_rate': 0.00018479391756702683, 'epoch': 0.08}


  8%|▊         | 956/12500 [1:38:10<23:14:39,  7.25s/it]

{'loss': 0.7223, 'grad_norm': 0.1998838633298874, 'learning_rate': 0.0001847779111644658, 'epoch': 0.08}


  8%|▊         | 957/12500 [1:38:17<22:48:19,  7.11s/it]

{'loss': 0.4883, 'grad_norm': 0.24372953176498413, 'learning_rate': 0.00018476190476190478, 'epoch': 0.08}


  8%|▊         | 958/12500 [1:38:25<24:14:44,  7.56s/it]

{'loss': 0.7859, 'grad_norm': 0.2865467369556427, 'learning_rate': 0.00018474589835934375, 'epoch': 0.08}


  8%|▊         | 959/12500 [1:38:31<22:12:47,  6.93s/it]

{'loss': 0.8342, 'grad_norm': 0.28298649191856384, 'learning_rate': 0.00018472989195678273, 'epoch': 0.08}


  8%|▊         | 960/12500 [1:38:35<19:57:48,  6.23s/it]

{'loss': 0.6109, 'grad_norm': 0.23997743427753448, 'learning_rate': 0.0001847138855542217, 'epoch': 0.08}


  8%|▊         | 961/12500 [1:38:42<19:51:04,  6.19s/it]

{'loss': 0.6639, 'grad_norm': 0.25258269906044006, 'learning_rate': 0.00018469787915166067, 'epoch': 0.08}


  8%|▊         | 962/12500 [1:38:47<18:46:49,  5.86s/it]

{'loss': 0.7869, 'grad_norm': 0.2942565977573395, 'learning_rate': 0.00018468187274909965, 'epoch': 0.08}


  8%|▊         | 963/12500 [1:38:53<19:33:44,  6.10s/it]

{'loss': 0.6951, 'grad_norm': 0.25446563959121704, 'learning_rate': 0.00018466586634653862, 'epoch': 0.08}


  8%|▊         | 964/12500 [1:39:00<20:35:50,  6.43s/it]

{'loss': 0.6961, 'grad_norm': 0.287112832069397, 'learning_rate': 0.0001846498599439776, 'epoch': 0.08}


  8%|▊         | 965/12500 [1:39:05<18:42:05,  5.84s/it]

{'loss': 0.6497, 'grad_norm': 0.3024824857711792, 'learning_rate': 0.00018463385354141657, 'epoch': 0.08}


  8%|▊         | 966/12500 [1:39:10<18:15:51,  5.70s/it]

{'loss': 0.4507, 'grad_norm': 0.27203279733657837, 'learning_rate': 0.00018461784713885555, 'epoch': 0.08}


  8%|▊         | 967/12500 [1:39:21<22:34:18,  7.05s/it]

{'loss': 0.7648, 'grad_norm': 0.19659492373466492, 'learning_rate': 0.00018460184073629452, 'epoch': 0.08}


  8%|▊         | 968/12500 [1:39:28<23:16:57,  7.27s/it]

{'loss': 0.8027, 'grad_norm': 0.2593807578086853, 'learning_rate': 0.0001845858343337335, 'epoch': 0.08}


  8%|▊         | 969/12500 [1:39:35<22:18:44,  6.97s/it]

{'loss': 0.643, 'grad_norm': 0.2721462845802307, 'learning_rate': 0.00018456982793117247, 'epoch': 0.08}


  8%|▊         | 970/12500 [1:39:40<21:03:46,  6.58s/it]

{'loss': 0.5788, 'grad_norm': 0.2596520781517029, 'learning_rate': 0.00018455382152861145, 'epoch': 0.08}


  8%|▊         | 971/12500 [1:39:45<19:30:25,  6.09s/it]

{'loss': 0.9618, 'grad_norm': 0.3046879768371582, 'learning_rate': 0.00018453781512605045, 'epoch': 0.08}


  8%|▊         | 972/12500 [1:39:50<18:07:26,  5.66s/it]

{'loss': 0.6347, 'grad_norm': 0.24902404844760895, 'learning_rate': 0.0001845218087234894, 'epoch': 0.08}


  8%|▊         | 973/12500 [1:39:54<17:03:13,  5.33s/it]

{'loss': 0.7094, 'grad_norm': 0.30109843611717224, 'learning_rate': 0.00018450580232092837, 'epoch': 0.08}


  8%|▊         | 974/12500 [1:40:02<19:13:39,  6.01s/it]

{'loss': 0.4791, 'grad_norm': 0.226127028465271, 'learning_rate': 0.00018448979591836735, 'epoch': 0.08}


  8%|▊         | 975/12500 [1:40:06<17:34:08,  5.49s/it]

{'loss': 0.9047, 'grad_norm': 0.30620262026786804, 'learning_rate': 0.00018447378951580635, 'epoch': 0.08}


  8%|▊         | 976/12500 [1:40:12<17:41:58,  5.53s/it]

{'loss': 0.7357, 'grad_norm': 0.29775941371917725, 'learning_rate': 0.0001844577831132453, 'epoch': 0.08}


  8%|▊         | 977/12500 [1:40:17<17:03:36,  5.33s/it]

{'loss': 0.7638, 'grad_norm': 0.2725255489349365, 'learning_rate': 0.00018444177671068427, 'epoch': 0.08}


  8%|▊         | 978/12500 [1:40:26<20:20:29,  6.36s/it]

{'loss': 0.9809, 'grad_norm': 0.26142382621765137, 'learning_rate': 0.00018442577030812327, 'epoch': 0.08}


  8%|▊         | 979/12500 [1:40:36<24:38:29,  7.70s/it]

{'loss': 0.8057, 'grad_norm': 0.21259768307209015, 'learning_rate': 0.00018440976390556225, 'epoch': 0.08}


  8%|▊         | 980/12500 [1:40:42<22:12:09,  6.94s/it]

{'loss': 0.6193, 'grad_norm': 0.3016109764575958, 'learning_rate': 0.0001843937575030012, 'epoch': 0.08}


  8%|▊         | 981/12500 [1:40:47<21:17:32,  6.65s/it]

{'loss': 0.7443, 'grad_norm': 0.2956232726573944, 'learning_rate': 0.00018437775110044017, 'epoch': 0.08}


  8%|▊         | 982/12500 [1:40:53<19:42:56,  6.16s/it]

{'loss': 0.8531, 'grad_norm': 0.2705945670604706, 'learning_rate': 0.00018436174469787917, 'epoch': 0.08}


  8%|▊         | 983/12500 [1:40:55<16:36:38,  5.19s/it]

{'loss': 1.112, 'grad_norm': 0.422724187374115, 'learning_rate': 0.00018434573829531815, 'epoch': 0.08}


  8%|▊         | 984/12500 [1:41:03<18:40:45,  5.84s/it]

{'loss': 0.88, 'grad_norm': 0.19741883873939514, 'learning_rate': 0.0001843297318927571, 'epoch': 0.08}


  8%|▊         | 985/12500 [1:41:10<19:46:15,  6.18s/it]

{'loss': 0.7358, 'grad_norm': 0.2508723735809326, 'learning_rate': 0.00018431372549019607, 'epoch': 0.08}


  8%|▊         | 986/12500 [1:41:15<19:14:16,  6.01s/it]

{'loss': 0.6622, 'grad_norm': 0.30188271403312683, 'learning_rate': 0.00018429771908763507, 'epoch': 0.08}


  8%|▊         | 987/12500 [1:41:21<18:55:53,  5.92s/it]

{'loss': 0.7929, 'grad_norm': 0.2927689850330353, 'learning_rate': 0.00018428171268507405, 'epoch': 0.08}


  8%|▊         | 988/12500 [1:41:28<20:15:37,  6.34s/it]

{'loss': 0.756, 'grad_norm': 0.21030543744564056, 'learning_rate': 0.000184265706282513, 'epoch': 0.08}


  8%|▊         | 989/12500 [1:41:36<21:49:31,  6.83s/it]

{'loss': 0.6753, 'grad_norm': 0.20093363523483276, 'learning_rate': 0.000184249699879952, 'epoch': 0.08}


  8%|▊         | 990/12500 [1:41:44<23:00:49,  7.20s/it]

{'loss': 0.8678, 'grad_norm': 0.1995866745710373, 'learning_rate': 0.00018423369347739097, 'epoch': 0.08}


  8%|▊         | 991/12500 [1:41:51<21:59:05,  6.88s/it]

{'loss': 0.7603, 'grad_norm': 0.31604284048080444, 'learning_rate': 0.00018421768707482995, 'epoch': 0.08}


  8%|▊         | 992/12500 [1:41:58<22:39:43,  7.09s/it]

{'loss': 0.8144, 'grad_norm': 0.2458088994026184, 'learning_rate': 0.0001842016806722689, 'epoch': 0.08}


  8%|▊         | 993/12500 [1:42:06<23:14:38,  7.27s/it]

{'loss': 0.564, 'grad_norm': 0.2451600432395935, 'learning_rate': 0.0001841856742697079, 'epoch': 0.08}


  8%|▊         | 994/12500 [1:42:10<20:33:34,  6.43s/it]

{'loss': 0.5255, 'grad_norm': 0.23578165471553802, 'learning_rate': 0.00018416966786714687, 'epoch': 0.08}


  8%|▊         | 995/12500 [1:42:15<18:25:46,  5.77s/it]

{'loss': 0.6449, 'grad_norm': 0.2917770445346832, 'learning_rate': 0.00018415366146458585, 'epoch': 0.08}


  8%|▊         | 996/12500 [1:42:20<18:35:34,  5.82s/it]

{'loss': 0.6567, 'grad_norm': 0.2588164508342743, 'learning_rate': 0.00018413765506202482, 'epoch': 0.08}


  8%|▊         | 997/12500 [1:42:26<18:08:35,  5.68s/it]

{'loss': 0.6967, 'grad_norm': 0.28070998191833496, 'learning_rate': 0.0001841216486594638, 'epoch': 0.08}


  8%|▊         | 998/12500 [1:42:34<20:07:12,  6.30s/it]

{'loss': 0.7526, 'grad_norm': 0.21370211243629456, 'learning_rate': 0.00018410564225690277, 'epoch': 0.08}


  8%|▊         | 999/12500 [1:42:41<21:23:18,  6.69s/it]

{'loss': 0.9121, 'grad_norm': 0.2246173471212387, 'learning_rate': 0.00018408963585434175, 'epoch': 0.08}


  8%|▊         | 1000/12500 [1:42:48<21:29:13,  6.73s/it]

{'loss': 0.5509, 'grad_norm': 0.25346022844314575, 'learning_rate': 0.00018407362945178072, 'epoch': 0.08}


  8%|▊         | 1001/12500 [1:42:56<22:29:15,  7.04s/it]

{'loss': 0.7449, 'grad_norm': 0.27994999289512634, 'learning_rate': 0.0001840576230492197, 'epoch': 0.08}


  8%|▊         | 1002/12500 [1:43:02<21:21:25,  6.69s/it]

{'loss': 0.5381, 'grad_norm': 0.3234550356864929, 'learning_rate': 0.00018404161664665867, 'epoch': 0.08}


  8%|▊         | 1003/12500 [1:43:09<22:05:55,  6.92s/it]

{'loss': 1.0075, 'grad_norm': 0.31212520599365234, 'learning_rate': 0.00018402561024409765, 'epoch': 0.08}


  8%|▊         | 1004/12500 [1:43:16<22:10:47,  6.95s/it]

{'loss': 0.7183, 'grad_norm': 0.26403936743736267, 'learning_rate': 0.00018400960384153662, 'epoch': 0.08}


  8%|▊         | 1005/12500 [1:43:23<22:27:52,  7.04s/it]

{'loss': 0.7955, 'grad_norm': 0.228655144572258, 'learning_rate': 0.0001839935974389756, 'epoch': 0.08}


  8%|▊         | 1006/12500 [1:43:28<19:43:35,  6.18s/it]

{'loss': 0.9006, 'grad_norm': 0.33719080686569214, 'learning_rate': 0.00018397759103641457, 'epoch': 0.08}


  8%|▊         | 1007/12500 [1:43:34<19:58:01,  6.25s/it]

{'loss': 0.7239, 'grad_norm': 0.26890796422958374, 'learning_rate': 0.00018396158463385355, 'epoch': 0.08}


  8%|▊         | 1008/12500 [1:43:43<22:21:30,  7.00s/it]

{'loss': 0.9054, 'grad_norm': 0.2296934276819229, 'learning_rate': 0.00018394557823129252, 'epoch': 0.08}


  8%|▊         | 1009/12500 [1:43:50<22:29:16,  7.05s/it]

{'loss': 0.6592, 'grad_norm': 0.2479630410671234, 'learning_rate': 0.0001839295718287315, 'epoch': 0.08}


  8%|▊         | 1010/12500 [1:43:57<22:28:23,  7.04s/it]

{'loss': 0.9113, 'grad_norm': 0.2813875079154968, 'learning_rate': 0.0001839135654261705, 'epoch': 0.08}


  8%|▊         | 1011/12500 [1:44:05<23:11:15,  7.27s/it]

{'loss': 0.786, 'grad_norm': 0.24740374088287354, 'learning_rate': 0.00018389755902360944, 'epoch': 0.08}


  8%|▊         | 1012/12500 [1:44:10<21:28:59,  6.73s/it]

{'loss': 0.834, 'grad_norm': 0.25693613290786743, 'learning_rate': 0.00018388155262104842, 'epoch': 0.08}


  8%|▊         | 1013/12500 [1:44:16<20:53:50,  6.55s/it]

{'loss': 0.9387, 'grad_norm': 0.289398729801178, 'learning_rate': 0.0001838655462184874, 'epoch': 0.08}


  8%|▊         | 1014/12500 [1:44:26<24:05:00,  7.55s/it]

{'loss': 1.0107, 'grad_norm': 0.22053839266300201, 'learning_rate': 0.0001838495398159264, 'epoch': 0.08}


  8%|▊         | 1015/12500 [1:44:32<22:19:51,  7.00s/it]

{'loss': 0.5758, 'grad_norm': 0.2512882947921753, 'learning_rate': 0.00018383353341336534, 'epoch': 0.08}


  8%|▊         | 1016/12500 [1:44:37<20:50:52,  6.54s/it]

{'loss': 0.718, 'grad_norm': 0.2959875762462616, 'learning_rate': 0.00018381752701080432, 'epoch': 0.08}


  8%|▊         | 1017/12500 [1:44:44<21:02:30,  6.60s/it]

{'loss': 0.8098, 'grad_norm': 0.2037680745124817, 'learning_rate': 0.00018380152060824332, 'epoch': 0.08}


  8%|▊         | 1018/12500 [1:44:50<20:38:26,  6.47s/it]

{'loss': 0.5421, 'grad_norm': 0.24987833201885223, 'learning_rate': 0.0001837855142056823, 'epoch': 0.08}


  8%|▊         | 1019/12500 [1:44:56<20:14:50,  6.35s/it]

{'loss': 0.6095, 'grad_norm': 0.231871098279953, 'learning_rate': 0.00018376950780312124, 'epoch': 0.08}


  8%|▊         | 1020/12500 [1:45:02<20:02:58,  6.29s/it]

{'loss': 0.7537, 'grad_norm': 0.25218266248703003, 'learning_rate': 0.00018375350140056022, 'epoch': 0.08}


  8%|▊         | 1021/12500 [1:45:07<18:14:33,  5.72s/it]

{'loss': 0.625, 'grad_norm': 0.2765403389930725, 'learning_rate': 0.00018373749499799922, 'epoch': 0.08}


  8%|▊         | 1022/12500 [1:45:12<17:36:15,  5.52s/it]

{'loss': 0.7157, 'grad_norm': 0.28720933198928833, 'learning_rate': 0.0001837214885954382, 'epoch': 0.08}


  8%|▊         | 1023/12500 [1:45:21<21:22:12,  6.70s/it]

{'loss': 0.644, 'grad_norm': 0.21504966914653778, 'learning_rate': 0.00018370548219287714, 'epoch': 0.08}


  8%|▊         | 1024/12500 [1:45:27<20:21:11,  6.38s/it]

{'loss': 0.6692, 'grad_norm': 0.23922909796237946, 'learning_rate': 0.00018368947579031614, 'epoch': 0.08}


  8%|▊         | 1025/12500 [1:45:34<20:57:32,  6.58s/it]

{'loss': 0.5189, 'grad_norm': 0.22771534323692322, 'learning_rate': 0.00018367346938775512, 'epoch': 0.08}


  8%|▊         | 1026/12500 [1:45:43<23:05:08,  7.24s/it]

{'loss': 0.7589, 'grad_norm': 0.2182394117116928, 'learning_rate': 0.0001836574629851941, 'epoch': 0.08}


  8%|▊         | 1027/12500 [1:45:49<22:03:01,  6.92s/it]

{'loss': 0.5592, 'grad_norm': 0.25633376836776733, 'learning_rate': 0.00018364145658263304, 'epoch': 0.08}


  8%|▊         | 1028/12500 [1:45:55<21:28:29,  6.74s/it]

{'loss': 0.4068, 'grad_norm': 0.2074027955532074, 'learning_rate': 0.00018362545018007204, 'epoch': 0.08}


  8%|▊         | 1029/12500 [1:46:01<20:25:43,  6.41s/it]

{'loss': 0.8528, 'grad_norm': 0.24476443231105804, 'learning_rate': 0.00018360944377751102, 'epoch': 0.08}


  8%|▊         | 1030/12500 [1:46:09<21:46:53,  6.84s/it]

{'loss': 0.9097, 'grad_norm': 0.23428815603256226, 'learning_rate': 0.00018359343737495, 'epoch': 0.08}


  8%|▊         | 1031/12500 [1:46:16<22:10:37,  6.96s/it]

{'loss': 0.7947, 'grad_norm': 0.22477349638938904, 'learning_rate': 0.00018357743097238897, 'epoch': 0.08}


  8%|▊         | 1032/12500 [1:46:21<20:05:24,  6.31s/it]

{'loss': 0.6692, 'grad_norm': 0.24276603758335114, 'learning_rate': 0.00018356142456982794, 'epoch': 0.08}


  8%|▊         | 1033/12500 [1:46:28<21:12:48,  6.66s/it]

{'loss': 0.9747, 'grad_norm': 0.2408422976732254, 'learning_rate': 0.00018354541816726692, 'epoch': 0.08}


  8%|▊         | 1034/12500 [1:46:33<19:16:49,  6.05s/it]

{'loss': 0.6031, 'grad_norm': 0.2591139078140259, 'learning_rate': 0.0001835294117647059, 'epoch': 0.08}


  8%|▊         | 1035/12500 [1:46:38<17:52:48,  5.61s/it]

{'loss': 0.8071, 'grad_norm': 0.28111380338668823, 'learning_rate': 0.00018351340536214487, 'epoch': 0.08}


  8%|▊         | 1036/12500 [1:46:44<18:20:31,  5.76s/it]

{'loss': 0.7674, 'grad_norm': 0.22756262123584747, 'learning_rate': 0.00018349739895958384, 'epoch': 0.08}


  8%|▊         | 1037/12500 [1:46:50<18:45:19,  5.89s/it]

{'loss': 0.7634, 'grad_norm': 0.2360999584197998, 'learning_rate': 0.00018348139255702282, 'epoch': 0.08}


  8%|▊         | 1038/12500 [1:46:59<21:53:23,  6.88s/it]

{'loss': 1.0378, 'grad_norm': 0.2243146002292633, 'learning_rate': 0.0001834653861544618, 'epoch': 0.08}


  8%|▊         | 1039/12500 [1:47:07<22:44:29,  7.14s/it]

{'loss': 1.0312, 'grad_norm': 0.24201729893684387, 'learning_rate': 0.00018344937975190077, 'epoch': 0.08}


  8%|▊         | 1040/12500 [1:47:12<20:30:17,  6.44s/it]

{'loss': 0.6342, 'grad_norm': 0.2692134976387024, 'learning_rate': 0.00018343337334933974, 'epoch': 0.08}


  8%|▊         | 1041/12500 [1:47:16<18:18:38,  5.75s/it]

{'loss': 0.7345, 'grad_norm': 0.3212537169456482, 'learning_rate': 0.00018341736694677872, 'epoch': 0.08}


  8%|▊         | 1042/12500 [1:47:23<20:01:32,  6.29s/it]

{'loss': 0.7297, 'grad_norm': 0.21977081894874573, 'learning_rate': 0.0001834013605442177, 'epoch': 0.08}


  8%|▊         | 1043/12500 [1:47:30<20:31:42,  6.45s/it]

{'loss': 0.8329, 'grad_norm': 0.2690020203590393, 'learning_rate': 0.00018338535414165667, 'epoch': 0.08}


  8%|▊         | 1044/12500 [1:47:38<22:23:43,  7.04s/it]

{'loss': 0.6711, 'grad_norm': 0.23401658236980438, 'learning_rate': 0.00018336934773909564, 'epoch': 0.08}


  8%|▊         | 1045/12500 [1:47:45<21:38:10,  6.80s/it]

{'loss': 0.6959, 'grad_norm': 0.2775278091430664, 'learning_rate': 0.00018335334133653462, 'epoch': 0.08}


  8%|▊         | 1046/12500 [1:47:50<20:23:36,  6.41s/it]

{'loss': 0.5923, 'grad_norm': 0.24597349762916565, 'learning_rate': 0.0001833373349339736, 'epoch': 0.08}


  8%|▊         | 1047/12500 [1:47:57<20:45:53,  6.53s/it]

{'loss': 0.7581, 'grad_norm': 0.2673291862010956, 'learning_rate': 0.00018332132853141257, 'epoch': 0.08}


  8%|▊         | 1048/12500 [1:48:08<24:50:53,  7.81s/it]

{'loss': 0.5515, 'grad_norm': 0.18468962609767914, 'learning_rate': 0.00018330532212885154, 'epoch': 0.08}


  8%|▊         | 1049/12500 [1:48:14<23:34:16,  7.41s/it]

{'loss': 1.1248, 'grad_norm': 0.24529743194580078, 'learning_rate': 0.00018328931572629054, 'epoch': 0.08}


  8%|▊         | 1050/12500 [1:48:20<21:49:30,  6.86s/it]

{'loss': 0.7864, 'grad_norm': 0.22838102281093597, 'learning_rate': 0.0001832733093237295, 'epoch': 0.08}


  8%|▊         | 1051/12500 [1:48:25<19:44:50,  6.21s/it]

{'loss': 0.747, 'grad_norm': 0.3233126103878021, 'learning_rate': 0.00018325730292116847, 'epoch': 0.08}


  8%|▊         | 1052/12500 [1:48:29<18:01:14,  5.67s/it]

{'loss': 0.6483, 'grad_norm': 0.23621755838394165, 'learning_rate': 0.00018324129651860744, 'epoch': 0.08}


  8%|▊         | 1053/12500 [1:48:33<16:21:53,  5.15s/it]

{'loss': 0.7956, 'grad_norm': 0.3034974932670593, 'learning_rate': 0.00018322529011604644, 'epoch': 0.08}


  8%|▊         | 1054/12500 [1:48:38<16:44:35,  5.27s/it]

{'loss': 0.7745, 'grad_norm': 0.25554031133651733, 'learning_rate': 0.0001832092837134854, 'epoch': 0.08}


  8%|▊         | 1055/12500 [1:48:43<16:24:05,  5.16s/it]

{'loss': 0.8152, 'grad_norm': 0.2867966592311859, 'learning_rate': 0.00018319327731092437, 'epoch': 0.08}


  8%|▊         | 1056/12500 [1:48:52<19:55:39,  6.27s/it]

{'loss': 0.4595, 'grad_norm': 0.21336837112903595, 'learning_rate': 0.00018317727090836337, 'epoch': 0.08}


  8%|▊         | 1057/12500 [1:48:58<19:20:28,  6.08s/it]

{'loss': 1.0492, 'grad_norm': 0.2815572917461395, 'learning_rate': 0.00018316126450580234, 'epoch': 0.08}


  8%|▊         | 1058/12500 [1:49:07<22:00:23,  6.92s/it]

{'loss': 0.8378, 'grad_norm': 0.2404251992702484, 'learning_rate': 0.0001831452581032413, 'epoch': 0.08}


  8%|▊         | 1059/12500 [1:49:14<22:03:08,  6.94s/it]

{'loss': 0.7209, 'grad_norm': 0.258669376373291, 'learning_rate': 0.00018312925170068026, 'epoch': 0.08}


  8%|▊         | 1060/12500 [1:49:17<18:49:50,  5.93s/it]

{'loss': 1.0821, 'grad_norm': 0.3564718961715698, 'learning_rate': 0.00018311324529811927, 'epoch': 0.08}


  8%|▊         | 1061/12500 [1:49:23<18:40:25,  5.88s/it]

{'loss': 0.7287, 'grad_norm': 0.31960082054138184, 'learning_rate': 0.00018309723889555824, 'epoch': 0.08}


  8%|▊         | 1062/12500 [1:49:32<22:01:25,  6.93s/it]

{'loss': 0.9087, 'grad_norm': 0.20761507749557495, 'learning_rate': 0.0001830812324929972, 'epoch': 0.08}


  9%|▊         | 1063/12500 [1:49:41<23:18:01,  7.33s/it]

{'loss': 0.7397, 'grad_norm': 0.2275255173444748, 'learning_rate': 0.0001830652260904362, 'epoch': 0.09}


  9%|▊         | 1064/12500 [1:49:48<23:19:41,  7.34s/it]

{'loss': 0.9361, 'grad_norm': 0.2385091930627823, 'learning_rate': 0.00018304921968787517, 'epoch': 0.09}


  9%|▊         | 1065/12500 [1:49:56<23:28:53,  7.39s/it]

{'loss': 0.7227, 'grad_norm': 0.2623946964740753, 'learning_rate': 0.00018303321328531414, 'epoch': 0.09}


  9%|▊         | 1066/12500 [1:50:04<24:49:17,  7.82s/it]

{'loss': 0.5339, 'grad_norm': 0.20025239884853363, 'learning_rate': 0.0001830172068827531, 'epoch': 0.09}


  9%|▊         | 1067/12500 [1:50:09<21:27:42,  6.76s/it]

{'loss': 0.8062, 'grad_norm': 0.26795172691345215, 'learning_rate': 0.0001830012004801921, 'epoch': 0.09}


  9%|▊         | 1068/12500 [1:50:13<18:55:49,  5.96s/it]

{'loss': 0.917, 'grad_norm': 0.3009927272796631, 'learning_rate': 0.00018298519407763107, 'epoch': 0.09}


  9%|▊         | 1069/12500 [1:50:16<16:34:50,  5.22s/it]

{'loss': 0.7071, 'grad_norm': 0.28683313727378845, 'learning_rate': 0.00018296918767507004, 'epoch': 0.09}


  9%|▊         | 1070/12500 [1:50:23<17:33:43,  5.53s/it]

{'loss': 0.5311, 'grad_norm': 0.24370157718658447, 'learning_rate': 0.00018295318127250902, 'epoch': 0.09}


  9%|▊         | 1071/12500 [1:50:31<20:28:15,  6.45s/it]

{'loss': 0.712, 'grad_norm': 0.22624541819095612, 'learning_rate': 0.000182937174869948, 'epoch': 0.09}


  9%|▊         | 1072/12500 [1:50:38<20:35:07,  6.48s/it]

{'loss': 0.5537, 'grad_norm': 0.22865334153175354, 'learning_rate': 0.00018292116846738696, 'epoch': 0.09}


  9%|▊         | 1073/12500 [1:50:45<21:24:59,  6.75s/it]

{'loss': 0.9503, 'grad_norm': 0.23333574831485748, 'learning_rate': 0.00018290516206482594, 'epoch': 0.09}


  9%|▊         | 1074/12500 [1:50:52<21:30:55,  6.78s/it]

{'loss': 0.7747, 'grad_norm': 0.28069230914115906, 'learning_rate': 0.00018288915566226491, 'epoch': 0.09}


  9%|▊         | 1075/12500 [1:50:58<20:45:31,  6.54s/it]

{'loss': 0.4963, 'grad_norm': 0.27059265971183777, 'learning_rate': 0.0001828731492597039, 'epoch': 0.09}


  9%|▊         | 1076/12500 [1:51:08<23:58:33,  7.56s/it]

{'loss': 0.7128, 'grad_norm': 0.2046308070421219, 'learning_rate': 0.00018285714285714286, 'epoch': 0.09}


  9%|▊         | 1077/12500 [1:51:15<23:28:26,  7.40s/it]

{'loss': 0.9178, 'grad_norm': 0.23819893598556519, 'learning_rate': 0.00018284113645458184, 'epoch': 0.09}


  9%|▊         | 1078/12500 [1:51:24<25:03:18,  7.90s/it]

{'loss': 0.8416, 'grad_norm': 0.2248355597257614, 'learning_rate': 0.00018282513005202081, 'epoch': 0.09}


  9%|▊         | 1079/12500 [1:51:31<24:32:46,  7.74s/it]

{'loss': 0.7367, 'grad_norm': 0.2319474071264267, 'learning_rate': 0.0001828091236494598, 'epoch': 0.09}


  9%|▊         | 1080/12500 [1:51:37<22:49:14,  7.19s/it]

{'loss': 0.5654, 'grad_norm': 0.24106748402118683, 'learning_rate': 0.00018279311724689876, 'epoch': 0.09}


  9%|▊         | 1081/12500 [1:51:44<22:04:17,  6.96s/it]

{'loss': 0.6017, 'grad_norm': 0.27002882957458496, 'learning_rate': 0.00018277711084433774, 'epoch': 0.09}


  9%|▊         | 1082/12500 [1:51:52<23:21:56,  7.37s/it]

{'loss': 0.955, 'grad_norm': 0.2218766063451767, 'learning_rate': 0.0001827611044417767, 'epoch': 0.09}


  9%|▊         | 1083/12500 [1:51:57<20:54:56,  6.60s/it]

{'loss': 0.5044, 'grad_norm': 0.2246018797159195, 'learning_rate': 0.0001827450980392157, 'epoch': 0.09}


  9%|▊         | 1084/12500 [1:52:03<20:29:35,  6.46s/it]

{'loss': 0.4646, 'grad_norm': 0.2714677155017853, 'learning_rate': 0.0001827290916366547, 'epoch': 0.09}


  9%|▊         | 1085/12500 [1:52:08<19:40:24,  6.20s/it]

{'loss': 0.591, 'grad_norm': 0.2575704753398895, 'learning_rate': 0.00018271308523409364, 'epoch': 0.09}


  9%|▊         | 1086/12500 [1:52:16<21:21:28,  6.74s/it]

{'loss': 0.784, 'grad_norm': 0.27222052216529846, 'learning_rate': 0.0001826970788315326, 'epoch': 0.09}


  9%|▊         | 1087/12500 [1:52:22<20:12:59,  6.38s/it]

{'loss': 0.8273, 'grad_norm': 0.31533992290496826, 'learning_rate': 0.0001826810724289716, 'epoch': 0.09}


  9%|▊         | 1088/12500 [1:52:28<19:27:04,  6.14s/it]

{'loss': 0.8141, 'grad_norm': 0.253167062997818, 'learning_rate': 0.0001826650660264106, 'epoch': 0.09}


  9%|▊         | 1089/12500 [1:52:33<18:42:27,  5.90s/it]

{'loss': 0.9948, 'grad_norm': 0.2755037248134613, 'learning_rate': 0.00018264905962384954, 'epoch': 0.09}


  9%|▊         | 1090/12500 [1:52:40<20:05:27,  6.34s/it]

{'loss': 0.4985, 'grad_norm': 0.21927304565906525, 'learning_rate': 0.0001826330532212885, 'epoch': 0.09}


  9%|▊         | 1091/12500 [1:52:47<19:57:58,  6.30s/it]

{'loss': 1.0175, 'grad_norm': 0.272335946559906, 'learning_rate': 0.00018261704681872751, 'epoch': 0.09}


  9%|▊         | 1092/12500 [1:52:53<20:30:15,  6.47s/it]

{'loss': 0.7032, 'grad_norm': 0.27091944217681885, 'learning_rate': 0.0001826010404161665, 'epoch': 0.09}


  9%|▊         | 1093/12500 [1:52:58<18:34:05,  5.86s/it]

{'loss': 0.7717, 'grad_norm': 0.3703278601169586, 'learning_rate': 0.00018258503401360544, 'epoch': 0.09}


  9%|▉         | 1094/12500 [1:53:05<19:43:46,  6.23s/it]

{'loss': 0.9614, 'grad_norm': 0.2717474400997162, 'learning_rate': 0.0001825690276110444, 'epoch': 0.09}


  9%|▉         | 1095/12500 [1:53:10<18:58:06,  5.99s/it]

{'loss': 0.6431, 'grad_norm': 0.28830021619796753, 'learning_rate': 0.00018255302120848341, 'epoch': 0.09}


  9%|▉         | 1096/12500 [1:53:14<17:02:16,  5.38s/it]

{'loss': 0.7835, 'grad_norm': 0.3102950155735016, 'learning_rate': 0.0001825370148059224, 'epoch': 0.09}


  9%|▉         | 1097/12500 [1:53:18<15:19:46,  4.84s/it]

{'loss': 0.6235, 'grad_norm': 0.2929401397705078, 'learning_rate': 0.00018252100840336134, 'epoch': 0.09}


  9%|▉         | 1098/12500 [1:53:24<17:00:21,  5.37s/it]

{'loss': 0.6119, 'grad_norm': 0.2738863229751587, 'learning_rate': 0.0001825050020008003, 'epoch': 0.09}


  9%|▉         | 1099/12500 [1:53:31<18:09:46,  5.74s/it]

{'loss': 0.5048, 'grad_norm': 0.24027827382087708, 'learning_rate': 0.0001824889955982393, 'epoch': 0.09}


  9%|▉         | 1100/12500 [1:53:35<16:10:50,  5.11s/it]

{'loss': 0.697, 'grad_norm': 0.3018111288547516, 'learning_rate': 0.0001824729891956783, 'epoch': 0.09}


  9%|▉         | 1101/12500 [1:53:45<21:17:19,  6.72s/it]

{'loss': 1.1079, 'grad_norm': 0.19647422432899475, 'learning_rate': 0.00018245698279311724, 'epoch': 0.09}


  9%|▉         | 1102/12500 [1:53:53<22:04:39,  6.97s/it]

{'loss': 0.414, 'grad_norm': 0.2129278928041458, 'learning_rate': 0.00018244097639055624, 'epoch': 0.09}


  9%|▉         | 1103/12500 [1:53:58<20:35:05,  6.50s/it]

{'loss': 0.9549, 'grad_norm': 0.24817773699760437, 'learning_rate': 0.0001824249699879952, 'epoch': 0.09}


  9%|▉         | 1104/12500 [1:54:07<23:09:56,  7.32s/it]

{'loss': 0.8413, 'grad_norm': 0.18618589639663696, 'learning_rate': 0.0001824089635854342, 'epoch': 0.09}


  9%|▉         | 1105/12500 [1:54:16<24:20:41,  7.69s/it]

{'loss': 0.4963, 'grad_norm': 0.20883704721927643, 'learning_rate': 0.00018239295718287314, 'epoch': 0.09}


  9%|▉         | 1106/12500 [1:54:22<22:40:40,  7.17s/it]

{'loss': 0.4504, 'grad_norm': 0.2590346038341522, 'learning_rate': 0.00018237695078031214, 'epoch': 0.09}


  9%|▉         | 1107/12500 [1:54:29<22:52:55,  7.23s/it]

{'loss': 0.7832, 'grad_norm': 0.2521018385887146, 'learning_rate': 0.0001823609443777511, 'epoch': 0.09}


  9%|▉         | 1108/12500 [1:54:33<19:33:54,  6.18s/it]

{'loss': 0.775, 'grad_norm': 0.30787193775177, 'learning_rate': 0.0001823449379751901, 'epoch': 0.09}


  9%|▉         | 1109/12500 [1:54:39<19:46:38,  6.25s/it]

{'loss': 0.9688, 'grad_norm': 0.2489914894104004, 'learning_rate': 0.00018232893157262906, 'epoch': 0.09}


  9%|▉         | 1110/12500 [1:54:44<17:55:30,  5.67s/it]

{'loss': 0.5633, 'grad_norm': 0.2739581763744354, 'learning_rate': 0.00018231292517006804, 'epoch': 0.09}


  9%|▉         | 1111/12500 [1:54:53<21:30:41,  6.80s/it]

{'loss': 1.0573, 'grad_norm': 0.21435360610485077, 'learning_rate': 0.000182296918767507, 'epoch': 0.09}


  9%|▉         | 1112/12500 [1:55:01<22:20:30,  7.06s/it]

{'loss': 0.8411, 'grad_norm': 0.24503913521766663, 'learning_rate': 0.00018228091236494599, 'epoch': 0.09}


  9%|▉         | 1113/12500 [1:55:08<22:13:22,  7.03s/it]

{'loss': 0.7066, 'grad_norm': 0.25166934728622437, 'learning_rate': 0.00018226490596238496, 'epoch': 0.09}


  9%|▉         | 1114/12500 [1:55:15<22:11:21,  7.02s/it]

{'loss': 0.6993, 'grad_norm': 0.25480055809020996, 'learning_rate': 0.00018224889955982394, 'epoch': 0.09}


  9%|▉         | 1115/12500 [1:55:23<23:19:50,  7.38s/it]

{'loss': 0.5621, 'grad_norm': 0.19742244482040405, 'learning_rate': 0.0001822328931572629, 'epoch': 0.09}


  9%|▉         | 1116/12500 [1:55:29<21:46:42,  6.89s/it]

{'loss': 0.5705, 'grad_norm': 0.27188435196876526, 'learning_rate': 0.00018221688675470189, 'epoch': 0.09}


  9%|▉         | 1117/12500 [1:55:36<22:35:49,  7.15s/it]

{'loss': 0.9296, 'grad_norm': 0.24595043063163757, 'learning_rate': 0.00018220088035214086, 'epoch': 0.09}


  9%|▉         | 1118/12500 [1:55:41<20:29:08,  6.48s/it]

{'loss': 0.6659, 'grad_norm': 0.2689412236213684, 'learning_rate': 0.00018218487394957984, 'epoch': 0.09}


  9%|▉         | 1119/12500 [1:55:50<22:06:55,  7.00s/it]

{'loss': 0.8878, 'grad_norm': 0.2148563116788864, 'learning_rate': 0.0001821688675470188, 'epoch': 0.09}


  9%|▉         | 1120/12500 [1:55:54<19:51:13,  6.28s/it]

{'loss': 0.6053, 'grad_norm': 0.2759123742580414, 'learning_rate': 0.00018215286114445779, 'epoch': 0.09}


  9%|▉         | 1121/12500 [1:56:02<21:11:24,  6.70s/it]

{'loss': 0.8194, 'grad_norm': 0.2647636830806732, 'learning_rate': 0.00018213685474189676, 'epoch': 0.09}


  9%|▉         | 1122/12500 [1:56:07<19:49:15,  6.27s/it]

{'loss': 0.6605, 'grad_norm': 0.31971073150634766, 'learning_rate': 0.00018212084833933573, 'epoch': 0.09}


  9%|▉         | 1123/12500 [1:56:14<20:03:20,  6.35s/it]

{'loss': 0.87, 'grad_norm': 0.2288590967655182, 'learning_rate': 0.00018210484193677474, 'epoch': 0.09}


  9%|▉         | 1124/12500 [1:56:21<21:19:39,  6.75s/it]

{'loss': 0.7136, 'grad_norm': 0.2294013351202011, 'learning_rate': 0.00018208883553421368, 'epoch': 0.09}


  9%|▉         | 1125/12500 [1:56:26<19:47:06,  6.26s/it]

{'loss': 0.5699, 'grad_norm': 0.25895512104034424, 'learning_rate': 0.00018207282913165266, 'epoch': 0.09}


  9%|▉         | 1126/12500 [1:56:32<19:04:29,  6.04s/it]

{'loss': 0.6724, 'grad_norm': 0.27290457487106323, 'learning_rate': 0.00018205682272909163, 'epoch': 0.09}


  9%|▉         | 1127/12500 [1:56:38<19:29:43,  6.17s/it]

{'loss': 0.6539, 'grad_norm': 0.24678562581539154, 'learning_rate': 0.00018204081632653064, 'epoch': 0.09}


  9%|▉         | 1128/12500 [1:56:44<19:06:07,  6.05s/it]

{'loss': 0.992, 'grad_norm': 0.2688712477684021, 'learning_rate': 0.00018202480992396958, 'epoch': 0.09}


  9%|▉         | 1129/12500 [1:56:52<20:41:10,  6.55s/it]

{'loss': 0.4531, 'grad_norm': 0.21291373670101166, 'learning_rate': 0.00018200880352140856, 'epoch': 0.09}


  9%|▉         | 1130/12500 [1:56:57<19:22:42,  6.14s/it]

{'loss': 0.6419, 'grad_norm': 0.2916126847267151, 'learning_rate': 0.00018199279711884756, 'epoch': 0.09}


  9%|▉         | 1131/12500 [1:57:04<19:59:14,  6.33s/it]

{'loss': 0.8199, 'grad_norm': 0.24287980794906616, 'learning_rate': 0.00018197679071628654, 'epoch': 0.09}


  9%|▉         | 1132/12500 [1:57:10<20:02:34,  6.35s/it]

{'loss': 0.5531, 'grad_norm': 0.2149958312511444, 'learning_rate': 0.00018196078431372548, 'epoch': 0.09}


  9%|▉         | 1133/12500 [1:57:15<18:55:17,  5.99s/it]

{'loss': 0.8394, 'grad_norm': 0.2665698826313019, 'learning_rate': 0.00018194477791116446, 'epoch': 0.09}


  9%|▉         | 1134/12500 [1:57:23<20:40:12,  6.55s/it]

{'loss': 0.7162, 'grad_norm': 0.1845531314611435, 'learning_rate': 0.00018192877150860346, 'epoch': 0.09}


  9%|▉         | 1135/12500 [1:57:32<22:43:45,  7.20s/it]

{'loss': 0.9484, 'grad_norm': 0.2539837658405304, 'learning_rate': 0.00018191276510604243, 'epoch': 0.09}


  9%|▉         | 1136/12500 [1:57:38<21:12:03,  6.72s/it]

{'loss': 0.7143, 'grad_norm': 0.28028151392936707, 'learning_rate': 0.00018189675870348138, 'epoch': 0.09}


  9%|▉         | 1137/12500 [1:57:42<19:20:20,  6.13s/it]

{'loss': 0.6671, 'grad_norm': 0.25690579414367676, 'learning_rate': 0.00018188075230092038, 'epoch': 0.09}


  9%|▉         | 1138/12500 [1:57:51<21:57:18,  6.96s/it]

{'loss': 0.6911, 'grad_norm': 0.23316863179206848, 'learning_rate': 0.00018186474589835936, 'epoch': 0.09}


  9%|▉         | 1139/12500 [1:57:59<22:34:24,  7.15s/it]

{'loss': 0.652, 'grad_norm': 0.23157984018325806, 'learning_rate': 0.00018184873949579833, 'epoch': 0.09}


  9%|▉         | 1140/12500 [1:58:06<22:18:43,  7.07s/it]

{'loss': 0.8115, 'grad_norm': 0.25125637650489807, 'learning_rate': 0.00018183273309323728, 'epoch': 0.09}


  9%|▉         | 1141/12500 [1:58:12<21:47:30,  6.91s/it]

{'loss': 0.8897, 'grad_norm': 0.320803701877594, 'learning_rate': 0.00018181672669067628, 'epoch': 0.09}


  9%|▉         | 1142/12500 [1:58:20<22:54:03,  7.26s/it]

{'loss': 0.8653, 'grad_norm': 0.22309042513370514, 'learning_rate': 0.00018180072028811526, 'epoch': 0.09}


  9%|▉         | 1143/12500 [1:58:28<22:52:34,  7.25s/it]

{'loss': 0.4164, 'grad_norm': 0.21061725914478302, 'learning_rate': 0.00018178471388555423, 'epoch': 0.09}


  9%|▉         | 1144/12500 [1:58:33<21:19:52,  6.76s/it]

{'loss': 0.5327, 'grad_norm': 0.25502538681030273, 'learning_rate': 0.0001817687074829932, 'epoch': 0.09}


  9%|▉         | 1145/12500 [1:58:41<22:21:07,  7.09s/it]

{'loss': 0.9507, 'grad_norm': 0.20712736248970032, 'learning_rate': 0.00018175270108043218, 'epoch': 0.09}


  9%|▉         | 1146/12500 [1:58:46<20:20:47,  6.45s/it]

{'loss': 0.6628, 'grad_norm': 0.27910175919532776, 'learning_rate': 0.00018173669467787116, 'epoch': 0.09}


  9%|▉         | 1147/12500 [1:58:53<20:53:47,  6.63s/it]

{'loss': 0.5219, 'grad_norm': 0.2103579342365265, 'learning_rate': 0.00018172068827531013, 'epoch': 0.09}


  9%|▉         | 1148/12500 [1:59:00<21:12:36,  6.73s/it]

{'loss': 0.9732, 'grad_norm': 0.29158681631088257, 'learning_rate': 0.0001817046818727491, 'epoch': 0.09}


  9%|▉         | 1149/12500 [1:59:05<19:46:42,  6.27s/it]

{'loss': 0.7001, 'grad_norm': 0.26282015442848206, 'learning_rate': 0.00018168867547018808, 'epoch': 0.09}


  9%|▉         | 1150/12500 [1:59:14<21:44:30,  6.90s/it]

{'loss': 0.6497, 'grad_norm': 0.19404290616512299, 'learning_rate': 0.00018167266906762706, 'epoch': 0.09}


  9%|▉         | 1151/12500 [1:59:20<21:31:47,  6.83s/it]

{'loss': 0.8003, 'grad_norm': 0.32799071073532104, 'learning_rate': 0.00018165666266506603, 'epoch': 0.09}


  9%|▉         | 1152/12500 [1:59:28<22:11:21,  7.04s/it]

{'loss': 0.5595, 'grad_norm': 0.27955740690231323, 'learning_rate': 0.000181640656262505, 'epoch': 0.09}


  9%|▉         | 1153/12500 [1:59:35<22:00:14,  6.98s/it]

{'loss': 0.4591, 'grad_norm': 0.2403935343027115, 'learning_rate': 0.00018162464985994398, 'epoch': 0.09}


  9%|▉         | 1154/12500 [1:59:40<20:37:20,  6.54s/it]

{'loss': 0.7558, 'grad_norm': 0.2623291611671448, 'learning_rate': 0.00018160864345738296, 'epoch': 0.09}


  9%|▉         | 1155/12500 [1:59:46<19:41:22,  6.25s/it]

{'loss': 0.7598, 'grad_norm': 0.2569665312767029, 'learning_rate': 0.00018159263705482196, 'epoch': 0.09}


  9%|▉         | 1156/12500 [1:59:53<20:17:55,  6.44s/it]

{'loss': 0.8081, 'grad_norm': 0.248556986451149, 'learning_rate': 0.0001815766306522609, 'epoch': 0.09}


  9%|▉         | 1157/12500 [1:59:59<20:14:04,  6.42s/it]

{'loss': 0.6714, 'grad_norm': 0.25478920340538025, 'learning_rate': 0.00018156062424969988, 'epoch': 0.09}


  9%|▉         | 1158/12500 [2:00:04<18:37:08,  5.91s/it]

{'loss': 0.4962, 'grad_norm': 0.2577439844608307, 'learning_rate': 0.00018154461784713886, 'epoch': 0.09}


  9%|▉         | 1159/12500 [2:00:07<16:00:45,  5.08s/it]

{'loss': 1.2857, 'grad_norm': 0.4056028127670288, 'learning_rate': 0.00018152861144457786, 'epoch': 0.09}


  9%|▉         | 1160/12500 [2:00:13<16:55:39,  5.37s/it]

{'loss': 0.5481, 'grad_norm': 0.22702592611312866, 'learning_rate': 0.0001815126050420168, 'epoch': 0.09}


  9%|▉         | 1161/12500 [2:00:20<18:35:10,  5.90s/it]

{'loss': 0.8236, 'grad_norm': 0.23674091696739197, 'learning_rate': 0.00018149659863945578, 'epoch': 0.09}


  9%|▉         | 1162/12500 [2:00:31<23:33:21,  7.48s/it]

{'loss': 0.7075, 'grad_norm': 0.19182255864143372, 'learning_rate': 0.00018148059223689478, 'epoch': 0.09}


  9%|▉         | 1163/12500 [2:00:40<24:39:09,  7.83s/it]

{'loss': 0.5116, 'grad_norm': 0.18557532131671906, 'learning_rate': 0.00018146458583433376, 'epoch': 0.09}


  9%|▉         | 1164/12500 [2:00:46<23:31:17,  7.47s/it]

{'loss': 0.8861, 'grad_norm': 0.23818367719650269, 'learning_rate': 0.0001814485794317727, 'epoch': 0.09}


  9%|▉         | 1165/12500 [2:00:50<20:08:43,  6.40s/it]

{'loss': 0.6968, 'grad_norm': 0.33303600549697876, 'learning_rate': 0.00018143257302921168, 'epoch': 0.09}


  9%|▉         | 1166/12500 [2:00:57<20:32:11,  6.52s/it]

{'loss': 0.7286, 'grad_norm': 0.263010710477829, 'learning_rate': 0.00018141656662665068, 'epoch': 0.09}


  9%|▉         | 1167/12500 [2:01:05<22:09:25,  7.04s/it]

{'loss': 0.6727, 'grad_norm': 0.25160157680511475, 'learning_rate': 0.00018140056022408966, 'epoch': 0.09}


  9%|▉         | 1168/12500 [2:01:11<20:42:11,  6.58s/it]

{'loss': 0.8285, 'grad_norm': 0.2609068751335144, 'learning_rate': 0.0001813845538215286, 'epoch': 0.09}


  9%|▉         | 1169/12500 [2:01:18<21:37:06,  6.87s/it]

{'loss': 0.6982, 'grad_norm': 0.2270054817199707, 'learning_rate': 0.0001813685474189676, 'epoch': 0.09}


  9%|▉         | 1170/12500 [2:01:26<22:06:26,  7.02s/it]

{'loss': 0.4314, 'grad_norm': 0.20675452053546906, 'learning_rate': 0.00018135254101640658, 'epoch': 0.09}


  9%|▉         | 1171/12500 [2:01:32<21:06:22,  6.71s/it]

{'loss': 0.718, 'grad_norm': 0.2176724374294281, 'learning_rate': 0.00018133653461384556, 'epoch': 0.09}


  9%|▉         | 1172/12500 [2:01:37<19:40:25,  6.25s/it]

{'loss': 0.7345, 'grad_norm': 0.23006528615951538, 'learning_rate': 0.0001813205282112845, 'epoch': 0.09}


  9%|▉         | 1173/12500 [2:01:43<19:17:13,  6.13s/it]

{'loss': 0.6307, 'grad_norm': 0.2712123990058899, 'learning_rate': 0.0001813045218087235, 'epoch': 0.09}


  9%|▉         | 1174/12500 [2:01:53<23:05:16,  7.34s/it]

{'loss': 0.9848, 'grad_norm': 0.21366453170776367, 'learning_rate': 0.00018128851540616248, 'epoch': 0.09}


  9%|▉         | 1175/12500 [2:01:59<21:52:17,  6.95s/it]

{'loss': 0.6553, 'grad_norm': 0.26764360070228577, 'learning_rate': 0.00018127250900360146, 'epoch': 0.09}


  9%|▉         | 1176/12500 [2:02:07<22:54:35,  7.28s/it]

{'loss': 1.024, 'grad_norm': 0.23181363940238953, 'learning_rate': 0.00018125650260104043, 'epoch': 0.09}


  9%|▉         | 1177/12500 [2:02:14<22:24:05,  7.12s/it]

{'loss': 0.7265, 'grad_norm': 0.24802464246749878, 'learning_rate': 0.0001812404961984794, 'epoch': 0.09}


  9%|▉         | 1178/12500 [2:02:19<20:45:21,  6.60s/it]

{'loss': 0.8348, 'grad_norm': 0.2857252359390259, 'learning_rate': 0.00018122448979591838, 'epoch': 0.09}


  9%|▉         | 1179/12500 [2:02:25<20:18:25,  6.46s/it]

{'loss': 0.5684, 'grad_norm': 0.2487199306488037, 'learning_rate': 0.00018120848339335736, 'epoch': 0.09}


  9%|▉         | 1180/12500 [2:02:32<20:14:39,  6.44s/it]

{'loss': 0.579, 'grad_norm': 0.2546936273574829, 'learning_rate': 0.00018119247699079633, 'epoch': 0.09}


  9%|▉         | 1181/12500 [2:02:35<17:27:36,  5.55s/it]

{'loss': 0.5628, 'grad_norm': 0.29508495330810547, 'learning_rate': 0.0001811764705882353, 'epoch': 0.09}


  9%|▉         | 1182/12500 [2:02:39<16:07:10,  5.13s/it]

{'loss': 0.6769, 'grad_norm': 0.3015643060207367, 'learning_rate': 0.00018116046418567428, 'epoch': 0.09}


  9%|▉         | 1183/12500 [2:02:48<19:20:55,  6.15s/it]

{'loss': 0.7098, 'grad_norm': 0.1844673901796341, 'learning_rate': 0.00018114445778311326, 'epoch': 0.09}


  9%|▉         | 1184/12500 [2:02:55<20:29:18,  6.52s/it]

{'loss': 0.6668, 'grad_norm': 0.2688639760017395, 'learning_rate': 0.00018112845138055223, 'epoch': 0.09}


  9%|▉         | 1185/12500 [2:02:59<18:09:08,  5.78s/it]

{'loss': 0.972, 'grad_norm': 0.39157021045684814, 'learning_rate': 0.0001811124449779912, 'epoch': 0.09}


  9%|▉         | 1186/12500 [2:03:06<19:16:42,  6.13s/it]

{'loss': 0.4733, 'grad_norm': 0.20802469551563263, 'learning_rate': 0.00018109643857543018, 'epoch': 0.09}


  9%|▉         | 1187/12500 [2:03:13<19:56:37,  6.35s/it]

{'loss': 0.778, 'grad_norm': 0.2706572413444519, 'learning_rate': 0.00018108043217286915, 'epoch': 0.09}


 10%|▉         | 1188/12500 [2:03:17<17:50:08,  5.68s/it]

{'loss': 1.1436, 'grad_norm': 0.40632298588752747, 'learning_rate': 0.00018106442577030813, 'epoch': 0.1}


 10%|▉         | 1189/12500 [2:03:23<17:54:30,  5.70s/it]

{'loss': 0.677, 'grad_norm': 0.2655639946460724, 'learning_rate': 0.0001810484193677471, 'epoch': 0.1}


 10%|▉         | 1190/12500 [2:03:28<16:58:17,  5.40s/it]

{'loss': 0.8964, 'grad_norm': 0.28924497961997986, 'learning_rate': 0.0001810324129651861, 'epoch': 0.1}


 10%|▉         | 1191/12500 [2:03:33<16:23:11,  5.22s/it]

{'loss': 0.8338, 'grad_norm': 0.31271645426750183, 'learning_rate': 0.00018101640656262505, 'epoch': 0.1}


 10%|▉         | 1192/12500 [2:03:44<22:10:29,  7.06s/it]

{'loss': 1.2386, 'grad_norm': 0.1984698623418808, 'learning_rate': 0.00018100040016006403, 'epoch': 0.1}


 10%|▉         | 1193/12500 [2:03:49<20:03:44,  6.39s/it]

{'loss': 0.631, 'grad_norm': 0.2823523283004761, 'learning_rate': 0.000180984393757503, 'epoch': 0.1}


 10%|▉         | 1194/12500 [2:03:54<18:54:54,  6.02s/it]

{'loss': 0.7444, 'grad_norm': 0.2843126952648163, 'learning_rate': 0.000180968387354942, 'epoch': 0.1}


 10%|▉         | 1195/12500 [2:04:00<18:33:10,  5.91s/it]

{'loss': 0.7036, 'grad_norm': 0.2829682528972626, 'learning_rate': 0.00018095238095238095, 'epoch': 0.1}


 10%|▉         | 1196/12500 [2:04:06<19:32:46,  6.22s/it]

{'loss': 1.0962, 'grad_norm': 0.24514459073543549, 'learning_rate': 0.00018093637454981993, 'epoch': 0.1}


 10%|▉         | 1197/12500 [2:04:15<21:48:55,  6.95s/it]

{'loss': 0.9519, 'grad_norm': 0.21708284318447113, 'learning_rate': 0.00018092036814725893, 'epoch': 0.1}


 10%|▉         | 1198/12500 [2:04:22<22:01:13,  7.01s/it]

{'loss': 0.8383, 'grad_norm': 0.30491408705711365, 'learning_rate': 0.0001809043617446979, 'epoch': 0.1}


 10%|▉         | 1199/12500 [2:04:27<20:08:36,  6.42s/it]

{'loss': 0.7978, 'grad_norm': 0.2797541320323944, 'learning_rate': 0.00018088835534213685, 'epoch': 0.1}


 10%|▉         | 1200/12500 [2:04:32<18:19:36,  5.84s/it]

{'loss': 0.8742, 'grad_norm': 0.3170015215873718, 'learning_rate': 0.00018087234893957583, 'epoch': 0.1}


 10%|▉         | 1201/12500 [2:04:39<19:45:04,  6.29s/it]

{'loss': 0.7057, 'grad_norm': 0.2463744580745697, 'learning_rate': 0.00018085634253701483, 'epoch': 0.1}


 10%|▉         | 1202/12500 [2:04:47<21:34:28,  6.87s/it]

{'loss': 0.8767, 'grad_norm': 0.23765546083450317, 'learning_rate': 0.0001808403361344538, 'epoch': 0.1}


 10%|▉         | 1203/12500 [2:04:54<21:38:04,  6.89s/it]

{'loss': 0.55, 'grad_norm': 0.22169668972492218, 'learning_rate': 0.00018082432973189275, 'epoch': 0.1}


 10%|▉         | 1204/12500 [2:04:59<19:35:48,  6.25s/it]

{'loss': 0.7102, 'grad_norm': 0.2838522493839264, 'learning_rate': 0.00018080832332933175, 'epoch': 0.1}


 10%|▉         | 1205/12500 [2:05:08<22:04:51,  7.04s/it]

{'loss': 0.7779, 'grad_norm': 0.20948585867881775, 'learning_rate': 0.00018079231692677073, 'epoch': 0.1}


 10%|▉         | 1206/12500 [2:05:16<22:40:55,  7.23s/it]

{'loss': 0.6655, 'grad_norm': 0.25979092717170715, 'learning_rate': 0.0001807763105242097, 'epoch': 0.1}


 10%|▉         | 1207/12500 [2:05:23<22:30:17,  7.17s/it]

{'loss': 0.4924, 'grad_norm': 0.27256980538368225, 'learning_rate': 0.00018076030412164865, 'epoch': 0.1}


 10%|▉         | 1208/12500 [2:05:32<24:46:40,  7.90s/it]

{'loss': 0.9959, 'grad_norm': 0.228755921125412, 'learning_rate': 0.00018074429771908765, 'epoch': 0.1}


 10%|▉         | 1209/12500 [2:05:39<23:29:01,  7.49s/it]

{'loss': 0.4663, 'grad_norm': 0.2627354562282562, 'learning_rate': 0.00018072829131652663, 'epoch': 0.1}


 10%|▉         | 1210/12500 [2:05:44<21:05:33,  6.73s/it]

{'loss': 0.7446, 'grad_norm': 0.32406938076019287, 'learning_rate': 0.0001807122849139656, 'epoch': 0.1}


 10%|▉         | 1211/12500 [2:05:48<18:40:07,  5.95s/it]

{'loss': 0.7043, 'grad_norm': 0.30440330505371094, 'learning_rate': 0.00018069627851140455, 'epoch': 0.1}


 10%|▉         | 1212/12500 [2:05:53<17:41:10,  5.64s/it]

{'loss': 0.6874, 'grad_norm': 0.27186647057533264, 'learning_rate': 0.00018068027210884355, 'epoch': 0.1}


 10%|▉         | 1213/12500 [2:05:58<17:30:24,  5.58s/it]

{'loss': 0.7205, 'grad_norm': 0.2847030758857727, 'learning_rate': 0.00018066426570628253, 'epoch': 0.1}


 10%|▉         | 1214/12500 [2:06:06<19:21:21,  6.17s/it]

{'loss': 0.6775, 'grad_norm': 0.25762584805488586, 'learning_rate': 0.0001806482593037215, 'epoch': 0.1}


 10%|▉         | 1215/12500 [2:06:10<17:54:33,  5.71s/it]

{'loss': 0.5474, 'grad_norm': 0.31027737259864807, 'learning_rate': 0.00018063225290116048, 'epoch': 0.1}


 10%|▉         | 1216/12500 [2:06:15<17:01:54,  5.43s/it]

{'loss': 0.6907, 'grad_norm': 0.29641103744506836, 'learning_rate': 0.00018061624649859945, 'epoch': 0.1}


 10%|▉         | 1217/12500 [2:06:22<18:10:33,  5.80s/it]

{'loss': 0.7784, 'grad_norm': 0.27141955494880676, 'learning_rate': 0.00018060024009603843, 'epoch': 0.1}


 10%|▉         | 1218/12500 [2:06:26<16:33:20,  5.28s/it]

{'loss': 0.5482, 'grad_norm': 0.2783963084220886, 'learning_rate': 0.0001805842336934774, 'epoch': 0.1}


 10%|▉         | 1219/12500 [2:06:33<17:58:23,  5.74s/it]

{'loss': 0.5205, 'grad_norm': 0.21941837668418884, 'learning_rate': 0.00018056822729091638, 'epoch': 0.1}


 10%|▉         | 1220/12500 [2:06:43<22:01:23,  7.03s/it]

{'loss': 0.6513, 'grad_norm': 0.1924588531255722, 'learning_rate': 0.00018055222088835535, 'epoch': 0.1}


 10%|▉         | 1221/12500 [2:06:52<24:06:33,  7.70s/it]

{'loss': 0.6241, 'grad_norm': 0.21177415549755096, 'learning_rate': 0.00018053621448579433, 'epoch': 0.1}


 10%|▉         | 1222/12500 [2:06:57<21:48:29,  6.96s/it]

{'loss': 0.7168, 'grad_norm': 0.2734871804714203, 'learning_rate': 0.0001805202080832333, 'epoch': 0.1}


 10%|▉         | 1223/12500 [2:07:04<21:59:52,  7.02s/it]

{'loss': 1.0346, 'grad_norm': 0.3201875686645508, 'learning_rate': 0.00018050420168067228, 'epoch': 0.1}


 10%|▉         | 1224/12500 [2:07:12<22:15:50,  7.11s/it]

{'loss': 0.6799, 'grad_norm': 0.26759016513824463, 'learning_rate': 0.00018048819527811125, 'epoch': 0.1}


 10%|▉         | 1225/12500 [2:07:18<21:13:59,  6.78s/it]

{'loss': 0.4327, 'grad_norm': 0.23760847747325897, 'learning_rate': 0.00018047218887555023, 'epoch': 0.1}


 10%|▉         | 1226/12500 [2:07:23<19:31:38,  6.24s/it]

{'loss': 0.5687, 'grad_norm': 0.2979293167591095, 'learning_rate': 0.0001804561824729892, 'epoch': 0.1}


 10%|▉         | 1227/12500 [2:07:27<17:51:36,  5.70s/it]

{'loss': 0.9092, 'grad_norm': 0.3154107928276062, 'learning_rate': 0.00018044017607042818, 'epoch': 0.1}


 10%|▉         | 1228/12500 [2:07:34<18:47:25,  6.00s/it]

{'loss': 0.7396, 'grad_norm': 0.2454422265291214, 'learning_rate': 0.00018042416966786715, 'epoch': 0.1}


 10%|▉         | 1229/12500 [2:07:41<19:44:28,  6.31s/it]

{'loss': 0.7353, 'grad_norm': 0.25886270403862, 'learning_rate': 0.00018040816326530615, 'epoch': 0.1}


 10%|▉         | 1230/12500 [2:07:47<19:30:13,  6.23s/it]

{'loss': 0.6314, 'grad_norm': 0.32024893164634705, 'learning_rate': 0.0001803921568627451, 'epoch': 0.1}


 10%|▉         | 1231/12500 [2:07:51<17:09:41,  5.48s/it]

{'loss': 0.6091, 'grad_norm': 0.3262996971607208, 'learning_rate': 0.00018037615046018408, 'epoch': 0.1}


 10%|▉         | 1232/12500 [2:08:01<21:34:37,  6.89s/it]

{'loss': 0.9369, 'grad_norm': 0.2118392437696457, 'learning_rate': 0.00018036014405762305, 'epoch': 0.1}


 10%|▉         | 1233/12500 [2:08:06<19:58:06,  6.38s/it]

{'loss': 0.5869, 'grad_norm': 0.2382718324661255, 'learning_rate': 0.00018034413765506205, 'epoch': 0.1}


 10%|▉         | 1234/12500 [2:08:13<20:03:23,  6.41s/it]

{'loss': 0.7971, 'grad_norm': 0.2498728185892105, 'learning_rate': 0.000180328131252501, 'epoch': 0.1}


 10%|▉         | 1235/12500 [2:08:20<20:47:36,  6.65s/it]

{'loss': 0.5901, 'grad_norm': 0.22322766482830048, 'learning_rate': 0.00018031212484993997, 'epoch': 0.1}


 10%|▉         | 1236/12500 [2:08:25<19:41:59,  6.30s/it]

{'loss': 0.7501, 'grad_norm': 0.2599695026874542, 'learning_rate': 0.00018029611844737898, 'epoch': 0.1}


 10%|▉         | 1237/12500 [2:08:31<19:03:55,  6.09s/it]

{'loss': 0.55, 'grad_norm': 0.23665565252304077, 'learning_rate': 0.00018028011204481795, 'epoch': 0.1}


 10%|▉         | 1238/12500 [2:08:40<21:59:56,  7.03s/it]

{'loss': 0.803, 'grad_norm': 0.22931812703609467, 'learning_rate': 0.0001802641056422569, 'epoch': 0.1}


 10%|▉         | 1239/12500 [2:08:50<25:10:41,  8.05s/it]

{'loss': 0.8176, 'grad_norm': 0.18444088101387024, 'learning_rate': 0.00018024809923969587, 'epoch': 0.1}


 10%|▉         | 1240/12500 [2:09:00<26:19:43,  8.42s/it]

{'loss': 0.5415, 'grad_norm': 0.19213543832302094, 'learning_rate': 0.00018023209283713488, 'epoch': 0.1}


 10%|▉         | 1241/12500 [2:09:10<27:48:02,  8.89s/it]

{'loss': 0.5785, 'grad_norm': 0.228101909160614, 'learning_rate': 0.00018021608643457385, 'epoch': 0.1}


 10%|▉         | 1242/12500 [2:09:17<26:36:36,  8.51s/it]

{'loss': 0.5788, 'grad_norm': 0.22005592286586761, 'learning_rate': 0.0001802000800320128, 'epoch': 0.1}


 10%|▉         | 1243/12500 [2:09:25<25:25:26,  8.13s/it]

{'loss': 0.9554, 'grad_norm': 0.2624902129173279, 'learning_rate': 0.0001801840736294518, 'epoch': 0.1}


 10%|▉         | 1244/12500 [2:09:32<24:41:52,  7.90s/it]

{'loss': 0.9547, 'grad_norm': 0.2498892843723297, 'learning_rate': 0.00018016806722689078, 'epoch': 0.1}


 10%|▉         | 1245/12500 [2:09:39<23:24:07,  7.49s/it]

{'loss': 0.4939, 'grad_norm': 0.23095712065696716, 'learning_rate': 0.00018015206082432975, 'epoch': 0.1}


 10%|▉         | 1246/12500 [2:09:43<20:32:25,  6.57s/it]

{'loss': 0.5981, 'grad_norm': 0.27795132994651794, 'learning_rate': 0.0001801360544217687, 'epoch': 0.1}


 10%|▉         | 1247/12500 [2:09:50<20:55:40,  6.70s/it]

{'loss': 0.9148, 'grad_norm': 0.24051572382450104, 'learning_rate': 0.0001801200480192077, 'epoch': 0.1}


 10%|▉         | 1248/12500 [2:09:55<19:13:07,  6.15s/it]

{'loss': 0.6998, 'grad_norm': 0.2877730429172516, 'learning_rate': 0.00018010404161664667, 'epoch': 0.1}


 10%|▉         | 1249/12500 [2:10:01<19:41:49,  6.30s/it]

{'loss': 0.7573, 'grad_norm': 0.2856208086013794, 'learning_rate': 0.00018008803521408565, 'epoch': 0.1}


 10%|█         | 1250/12500 [2:10:06<18:22:06,  5.88s/it]

{'loss': 0.6634, 'grad_norm': 0.2672116160392761, 'learning_rate': 0.00018007202881152462, 'epoch': 0.1}


 10%|█         | 1251/12500 [2:10:15<20:35:49,  6.59s/it]

{'loss': 0.8306, 'grad_norm': 0.19734814763069153, 'learning_rate': 0.0001800560224089636, 'epoch': 0.1}


 10%|█         | 1252/12500 [2:10:21<20:05:49,  6.43s/it]

{'loss': 0.8899, 'grad_norm': 0.28132903575897217, 'learning_rate': 0.00018004001600640257, 'epoch': 0.1}


 10%|█         | 1253/12500 [2:10:26<18:53:11,  6.05s/it]

{'loss': 0.796, 'grad_norm': 0.25125253200531006, 'learning_rate': 0.00018002400960384155, 'epoch': 0.1}


 10%|█         | 1254/12500 [2:10:34<20:57:38,  6.71s/it]

{'loss': 0.6619, 'grad_norm': 0.21865196526050568, 'learning_rate': 0.00018000800320128052, 'epoch': 0.1}


 10%|█         | 1255/12500 [2:10:43<23:28:23,  7.51s/it]

{'loss': 0.7098, 'grad_norm': 0.2324901968240738, 'learning_rate': 0.0001799919967987195, 'epoch': 0.1}


 10%|█         | 1256/12500 [2:10:51<23:20:40,  7.47s/it]

{'loss': 0.5962, 'grad_norm': 0.21385709941387177, 'learning_rate': 0.00017997599039615847, 'epoch': 0.1}


 10%|█         | 1257/12500 [2:10:58<23:10:57,  7.42s/it]

{'loss': 0.9253, 'grad_norm': 0.2557966709136963, 'learning_rate': 0.00017995998399359745, 'epoch': 0.1}


 10%|█         | 1258/12500 [2:11:05<22:51:22,  7.32s/it]

{'loss': 0.993, 'grad_norm': 0.25237005949020386, 'learning_rate': 0.00017994397759103642, 'epoch': 0.1}


 10%|█         | 1259/12500 [2:11:10<20:24:32,  6.54s/it]

{'loss': 0.9001, 'grad_norm': 0.2752732038497925, 'learning_rate': 0.0001799279711884754, 'epoch': 0.1}


 10%|█         | 1260/12500 [2:11:14<18:25:23,  5.90s/it]

{'loss': 0.4728, 'grad_norm': 0.3185074031352997, 'learning_rate': 0.00017991196478591437, 'epoch': 0.1}


 10%|█         | 1261/12500 [2:11:21<18:58:56,  6.08s/it]

{'loss': 0.4972, 'grad_norm': 0.2473875880241394, 'learning_rate': 0.00017989595838335335, 'epoch': 0.1}


 10%|█         | 1262/12500 [2:11:29<20:56:52,  6.71s/it]

{'loss': 0.4634, 'grad_norm': 0.20777203142642975, 'learning_rate': 0.00017987995198079232, 'epoch': 0.1}


 10%|█         | 1263/12500 [2:11:36<21:31:58,  6.90s/it]

{'loss': 0.5601, 'grad_norm': 0.26809778809547424, 'learning_rate': 0.0001798639455782313, 'epoch': 0.1}


 10%|█         | 1264/12500 [2:11:46<23:42:30,  7.60s/it]

{'loss': 0.6974, 'grad_norm': 0.20278315246105194, 'learning_rate': 0.00017984793917567027, 'epoch': 0.1}


 10%|█         | 1265/12500 [2:11:53<23:11:28,  7.43s/it]

{'loss': 0.6956, 'grad_norm': 0.22915934026241302, 'learning_rate': 0.00017983193277310925, 'epoch': 0.1}


 10%|█         | 1266/12500 [2:11:56<19:17:13,  6.18s/it]

{'loss': 0.4987, 'grad_norm': 0.3158305883407593, 'learning_rate': 0.00017981592637054822, 'epoch': 0.1}


 10%|█         | 1267/12500 [2:12:05<21:57:45,  7.04s/it]

{'loss': 0.6368, 'grad_norm': 0.19213494658470154, 'learning_rate': 0.0001797999199679872, 'epoch': 0.1}


 10%|█         | 1268/12500 [2:12:12<21:52:28,  7.01s/it]

{'loss': 0.667, 'grad_norm': 0.1920434534549713, 'learning_rate': 0.0001797839135654262, 'epoch': 0.1}


 10%|█         | 1269/12500 [2:12:19<22:06:40,  7.09s/it]

{'loss': 0.4431, 'grad_norm': 0.20034706592559814, 'learning_rate': 0.00017976790716286515, 'epoch': 0.1}


 10%|█         | 1270/12500 [2:12:24<19:41:55,  6.31s/it]

{'loss': 0.7535, 'grad_norm': 0.2776247262954712, 'learning_rate': 0.00017975190076030412, 'epoch': 0.1}


 10%|█         | 1271/12500 [2:12:29<18:46:16,  6.02s/it]

{'loss': 0.8558, 'grad_norm': 0.2657371163368225, 'learning_rate': 0.0001797358943577431, 'epoch': 0.1}


 10%|█         | 1272/12500 [2:12:37<20:39:42,  6.62s/it]

{'loss': 0.6101, 'grad_norm': 0.17705115675926208, 'learning_rate': 0.0001797198879551821, 'epoch': 0.1}


 10%|█         | 1273/12500 [2:12:43<19:42:24,  6.32s/it]

{'loss': 0.804, 'grad_norm': 0.2591533064842224, 'learning_rate': 0.00017970388155262105, 'epoch': 0.1}


 10%|█         | 1274/12500 [2:12:51<21:37:46,  6.94s/it]

{'loss': 0.8414, 'grad_norm': 0.2160029113292694, 'learning_rate': 0.00017968787515006002, 'epoch': 0.1}


 10%|█         | 1275/12500 [2:12:59<22:11:04,  7.11s/it]

{'loss': 1.0561, 'grad_norm': 0.23531347513198853, 'learning_rate': 0.00017967186874749902, 'epoch': 0.1}


 10%|█         | 1276/12500 [2:13:07<23:41:18,  7.60s/it]

{'loss': 0.4906, 'grad_norm': 0.2346627116203308, 'learning_rate': 0.000179655862344938, 'epoch': 0.1}


 10%|█         | 1277/12500 [2:13:11<19:46:38,  6.34s/it]

{'loss': 0.6138, 'grad_norm': 0.28672900795936584, 'learning_rate': 0.00017963985594237695, 'epoch': 0.1}


 10%|█         | 1278/12500 [2:13:17<20:00:26,  6.42s/it]

{'loss': 0.4962, 'grad_norm': 0.2226637899875641, 'learning_rate': 0.00017962384953981592, 'epoch': 0.1}


 10%|█         | 1279/12500 [2:13:23<19:39:06,  6.30s/it]

{'loss': 0.9641, 'grad_norm': 0.314316987991333, 'learning_rate': 0.00017960784313725492, 'epoch': 0.1}


 10%|█         | 1280/12500 [2:13:28<17:51:48,  5.73s/it]

{'loss': 0.6961, 'grad_norm': 0.32792577147483826, 'learning_rate': 0.0001795918367346939, 'epoch': 0.1}


 10%|█         | 1281/12500 [2:13:33<17:18:08,  5.55s/it]

{'loss': 0.6779, 'grad_norm': 0.2760016620159149, 'learning_rate': 0.00017957583033213284, 'epoch': 0.1}


 10%|█         | 1282/12500 [2:13:37<16:17:42,  5.23s/it]

{'loss': 0.6203, 'grad_norm': 0.2702178657054901, 'learning_rate': 0.00017955982392957185, 'epoch': 0.1}


 10%|█         | 1283/12500 [2:13:44<17:14:12,  5.53s/it]

{'loss': 0.5185, 'grad_norm': 0.2280704528093338, 'learning_rate': 0.00017954381752701082, 'epoch': 0.1}


 10%|█         | 1284/12500 [2:13:52<19:31:41,  6.27s/it]

{'loss': 0.7165, 'grad_norm': 0.24594655632972717, 'learning_rate': 0.0001795278111244498, 'epoch': 0.1}


 10%|█         | 1285/12500 [2:14:00<21:32:19,  6.91s/it]

{'loss': 0.6853, 'grad_norm': 0.20350021123886108, 'learning_rate': 0.00017951180472188874, 'epoch': 0.1}


 10%|█         | 1286/12500 [2:14:05<19:54:11,  6.39s/it]

{'loss': 0.8534, 'grad_norm': 0.26223909854888916, 'learning_rate': 0.00017949579831932775, 'epoch': 0.1}


 10%|█         | 1287/12500 [2:14:09<17:26:25,  5.60s/it]

{'loss': 0.8816, 'grad_norm': 0.38004228472709656, 'learning_rate': 0.00017947979191676672, 'epoch': 0.1}


 10%|█         | 1288/12500 [2:14:16<18:35:55,  5.97s/it]

{'loss': 0.8678, 'grad_norm': 0.2718573212623596, 'learning_rate': 0.0001794637855142057, 'epoch': 0.1}


 10%|█         | 1289/12500 [2:14:21<18:17:41,  5.87s/it]

{'loss': 0.6166, 'grad_norm': 0.25589314103126526, 'learning_rate': 0.00017944777911164467, 'epoch': 0.1}


 10%|█         | 1290/12500 [2:14:30<20:49:14,  6.69s/it]

{'loss': 0.5867, 'grad_norm': 0.17957039177417755, 'learning_rate': 0.00017943177270908365, 'epoch': 0.1}


 10%|█         | 1291/12500 [2:14:37<20:51:11,  6.70s/it]

{'loss': 0.8233, 'grad_norm': 0.3038921058177948, 'learning_rate': 0.00017941576630652262, 'epoch': 0.1}


 10%|█         | 1292/12500 [2:14:48<25:29:37,  8.19s/it]

{'loss': 0.6999, 'grad_norm': 0.17386077344417572, 'learning_rate': 0.0001793997599039616, 'epoch': 0.1}


 10%|█         | 1293/12500 [2:14:56<25:19:06,  8.13s/it]

{'loss': 0.829, 'grad_norm': 0.21970020234584808, 'learning_rate': 0.00017938375350140057, 'epoch': 0.1}


 10%|█         | 1294/12500 [2:15:02<22:40:41,  7.29s/it]

{'loss': 0.7176, 'grad_norm': 0.2641935348510742, 'learning_rate': 0.00017936774709883955, 'epoch': 0.1}


 10%|█         | 1295/12500 [2:15:09<23:08:36,  7.44s/it]

{'loss': 0.6109, 'grad_norm': 0.22810928523540497, 'learning_rate': 0.00017935174069627852, 'epoch': 0.1}


 10%|█         | 1296/12500 [2:15:20<25:38:56,  8.24s/it]

{'loss': 0.6515, 'grad_norm': 0.22123879194259644, 'learning_rate': 0.0001793357342937175, 'epoch': 0.1}


 10%|█         | 1297/12500 [2:15:27<24:50:29,  7.98s/it]

{'loss': 0.6528, 'grad_norm': 0.27860313653945923, 'learning_rate': 0.00017931972789115647, 'epoch': 0.1}


 10%|█         | 1298/12500 [2:15:34<24:17:49,  7.81s/it]

{'loss': 0.9102, 'grad_norm': 0.2851334810256958, 'learning_rate': 0.00017930372148859544, 'epoch': 0.1}


 10%|█         | 1299/12500 [2:15:39<21:45:55,  7.00s/it]

{'loss': 0.811, 'grad_norm': 0.299512654542923, 'learning_rate': 0.00017928771508603442, 'epoch': 0.1}


 10%|█         | 1300/12500 [2:15:45<20:07:55,  6.47s/it]

{'loss': 1.2624, 'grad_norm': 0.3628646731376648, 'learning_rate': 0.0001792717086834734, 'epoch': 0.1}


 10%|█         | 1301/12500 [2:15:52<20:43:13,  6.66s/it]

{'loss': 0.698, 'grad_norm': 0.24451912939548492, 'learning_rate': 0.00017925570228091237, 'epoch': 0.1}


 10%|█         | 1302/12500 [2:15:57<19:11:44,  6.17s/it]

{'loss': 1.088, 'grad_norm': 0.36056527495384216, 'learning_rate': 0.00017923969587835134, 'epoch': 0.1}


 10%|█         | 1303/12500 [2:16:01<17:40:22,  5.68s/it]

{'loss': 0.7833, 'grad_norm': 0.3095158040523529, 'learning_rate': 0.00017922368947579035, 'epoch': 0.1}


 10%|█         | 1304/12500 [2:16:07<17:32:50,  5.64s/it]

{'loss': 0.8157, 'grad_norm': 0.3260306715965271, 'learning_rate': 0.0001792076830732293, 'epoch': 0.1}


 10%|█         | 1305/12500 [2:16:15<20:04:08,  6.45s/it]

{'loss': 0.5426, 'grad_norm': 0.22579510509967804, 'learning_rate': 0.00017919167667066827, 'epoch': 0.1}


 10%|█         | 1306/12500 [2:16:20<18:12:36,  5.86s/it]

{'loss': 0.7564, 'grad_norm': 0.2910408675670624, 'learning_rate': 0.00017917567026810724, 'epoch': 0.1}


 10%|█         | 1307/12500 [2:16:26<18:24:22,  5.92s/it]

{'loss': 0.7187, 'grad_norm': 0.27917781472206116, 'learning_rate': 0.00017915966386554625, 'epoch': 0.1}


 10%|█         | 1308/12500 [2:16:32<18:52:19,  6.07s/it]

{'loss': 0.8527, 'grad_norm': 0.2307230681180954, 'learning_rate': 0.0001791436574629852, 'epoch': 0.1}


 10%|█         | 1309/12500 [2:16:39<19:07:41,  6.15s/it]

{'loss': 0.7369, 'grad_norm': 0.27185481786727905, 'learning_rate': 0.00017912765106042417, 'epoch': 0.1}


 10%|█         | 1310/12500 [2:16:42<16:38:50,  5.36s/it]

{'loss': 0.641, 'grad_norm': 0.29822829365730286, 'learning_rate': 0.00017911164465786317, 'epoch': 0.1}


 10%|█         | 1311/12500 [2:16:49<17:45:35,  5.71s/it]

{'loss': 0.5604, 'grad_norm': 0.2573097348213196, 'learning_rate': 0.00017909563825530214, 'epoch': 0.1}


 10%|█         | 1312/12500 [2:16:54<17:16:43,  5.56s/it]

{'loss': 0.623, 'grad_norm': 0.27534881234169006, 'learning_rate': 0.0001790796318527411, 'epoch': 0.1}


 11%|█         | 1313/12500 [2:17:00<18:13:20,  5.86s/it]

{'loss': 0.5467, 'grad_norm': 0.24179290235042572, 'learning_rate': 0.00017906362545018007, 'epoch': 0.11}


 11%|█         | 1314/12500 [2:17:05<16:54:03,  5.44s/it]

{'loss': 0.6198, 'grad_norm': 0.3124532103538513, 'learning_rate': 0.00017904761904761907, 'epoch': 0.11}


 11%|█         | 1315/12500 [2:17:11<17:38:08,  5.68s/it]

{'loss': 0.5568, 'grad_norm': 0.2339644432067871, 'learning_rate': 0.00017903161264505804, 'epoch': 0.11}


 11%|█         | 1316/12500 [2:17:16<17:12:34,  5.54s/it]

{'loss': 0.938, 'grad_norm': 0.3120327293872833, 'learning_rate': 0.000179015606242497, 'epoch': 0.11}


 11%|█         | 1317/12500 [2:17:25<20:13:48,  6.51s/it]

{'loss': 0.5434, 'grad_norm': 0.2028173953294754, 'learning_rate': 0.00017899959983993597, 'epoch': 0.11}


 11%|█         | 1318/12500 [2:17:31<19:59:01,  6.43s/it]

{'loss': 0.9723, 'grad_norm': 0.29548826813697815, 'learning_rate': 0.00017898359343737497, 'epoch': 0.11}


 11%|█         | 1319/12500 [2:17:37<18:59:40,  6.12s/it]

{'loss': 0.6774, 'grad_norm': 0.2674838602542877, 'learning_rate': 0.00017896758703481394, 'epoch': 0.11}


 11%|█         | 1320/12500 [2:17:44<20:04:26,  6.46s/it]

{'loss': 1.008, 'grad_norm': 0.24429969489574432, 'learning_rate': 0.0001789515806322529, 'epoch': 0.11}


 11%|█         | 1321/12500 [2:17:51<20:29:01,  6.60s/it]

{'loss': 0.9247, 'grad_norm': 0.22446686029434204, 'learning_rate': 0.0001789355742296919, 'epoch': 0.11}


 11%|█         | 1322/12500 [2:17:56<18:41:39,  6.02s/it]

{'loss': 0.6631, 'grad_norm': 0.28613215684890747, 'learning_rate': 0.00017891956782713087, 'epoch': 0.11}


 11%|█         | 1323/12500 [2:18:00<17:12:27,  5.54s/it]

{'loss': 0.8532, 'grad_norm': 0.2981020212173462, 'learning_rate': 0.00017890356142456984, 'epoch': 0.11}


 11%|█         | 1324/12500 [2:18:06<17:54:38,  5.77s/it]

{'loss': 0.5332, 'grad_norm': 0.2481125146150589, 'learning_rate': 0.0001788875550220088, 'epoch': 0.11}


 11%|█         | 1325/12500 [2:18:13<18:43:06,  6.03s/it]

{'loss': 0.5567, 'grad_norm': 0.2773715853691101, 'learning_rate': 0.0001788715486194478, 'epoch': 0.11}


 11%|█         | 1326/12500 [2:18:23<22:14:05,  7.16s/it]

{'loss': 0.7946, 'grad_norm': 0.24344441294670105, 'learning_rate': 0.00017885554221688677, 'epoch': 0.11}


 11%|█         | 1327/12500 [2:18:26<18:49:25,  6.07s/it]

{'loss': 0.9277, 'grad_norm': 0.35570743680000305, 'learning_rate': 0.00017883953581432574, 'epoch': 0.11}


 11%|█         | 1328/12500 [2:18:30<16:32:09,  5.33s/it]

{'loss': 0.7272, 'grad_norm': 0.35170572996139526, 'learning_rate': 0.00017882352941176472, 'epoch': 0.11}


 11%|█         | 1329/12500 [2:18:37<18:16:16,  5.89s/it]

{'loss': 0.7021, 'grad_norm': 0.2535070776939392, 'learning_rate': 0.0001788075230092037, 'epoch': 0.11}


 11%|█         | 1330/12500 [2:18:41<16:56:37,  5.46s/it]

{'loss': 0.5902, 'grad_norm': 0.2736995816230774, 'learning_rate': 0.00017879151660664267, 'epoch': 0.11}


 11%|█         | 1331/12500 [2:18:51<20:54:11,  6.74s/it]

{'loss': 0.8693, 'grad_norm': 0.20865200459957123, 'learning_rate': 0.00017877551020408164, 'epoch': 0.11}


 11%|█         | 1332/12500 [2:19:00<22:35:40,  7.28s/it]

{'loss': 0.9511, 'grad_norm': 0.21356414258480072, 'learning_rate': 0.00017875950380152062, 'epoch': 0.11}


 11%|█         | 1333/12500 [2:19:07<22:45:28,  7.34s/it]

{'loss': 0.6995, 'grad_norm': 0.28639647364616394, 'learning_rate': 0.0001787434973989596, 'epoch': 0.11}


 11%|█         | 1334/12500 [2:19:17<24:44:08,  7.97s/it]

{'loss': 0.6308, 'grad_norm': 0.25210925936698914, 'learning_rate': 0.00017872749099639857, 'epoch': 0.11}


 11%|█         | 1335/12500 [2:19:21<21:36:30,  6.97s/it]

{'loss': 0.8562, 'grad_norm': 0.29981547594070435, 'learning_rate': 0.00017871148459383754, 'epoch': 0.11}


 11%|█         | 1336/12500 [2:19:29<22:03:04,  7.11s/it]

{'loss': 0.7806, 'grad_norm': 0.2565930485725403, 'learning_rate': 0.00017869547819127652, 'epoch': 0.11}


 11%|█         | 1337/12500 [2:19:35<21:08:46,  6.82s/it]

{'loss': 0.707, 'grad_norm': 0.2454318106174469, 'learning_rate': 0.0001786794717887155, 'epoch': 0.11}


 11%|█         | 1338/12500 [2:19:42<21:24:48,  6.91s/it]

{'loss': 0.6728, 'grad_norm': 0.2592318058013916, 'learning_rate': 0.00017866346538615447, 'epoch': 0.11}


 11%|█         | 1339/12500 [2:19:47<19:23:01,  6.25s/it]

{'loss': 0.7064, 'grad_norm': 0.2604518532752991, 'learning_rate': 0.00017864745898359344, 'epoch': 0.11}


 11%|█         | 1340/12500 [2:19:55<21:14:08,  6.85s/it]

{'loss': 0.6027, 'grad_norm': 0.28330981731414795, 'learning_rate': 0.00017863145258103242, 'epoch': 0.11}


 11%|█         | 1341/12500 [2:20:00<19:46:59,  6.38s/it]

{'loss': 0.6916, 'grad_norm': 0.32591280341148376, 'learning_rate': 0.0001786154461784714, 'epoch': 0.11}


 11%|█         | 1342/12500 [2:20:09<22:17:59,  7.19s/it]

{'loss': 0.4377, 'grad_norm': 0.19806626439094543, 'learning_rate': 0.0001785994397759104, 'epoch': 0.11}


 11%|█         | 1343/12500 [2:20:15<20:36:22,  6.65s/it]

{'loss': 0.8673, 'grad_norm': 0.2844291031360626, 'learning_rate': 0.00017858343337334934, 'epoch': 0.11}


 11%|█         | 1344/12500 [2:20:22<21:31:53,  6.95s/it]

{'loss': 0.8005, 'grad_norm': 0.24369415640830994, 'learning_rate': 0.00017856742697078831, 'epoch': 0.11}


 11%|█         | 1345/12500 [2:20:29<20:48:13,  6.71s/it]

{'loss': 0.7191, 'grad_norm': 0.2748318910598755, 'learning_rate': 0.0001785514205682273, 'epoch': 0.11}


 11%|█         | 1346/12500 [2:20:35<20:15:18,  6.54s/it]

{'loss': 0.8063, 'grad_norm': 0.2875215411186218, 'learning_rate': 0.0001785354141656663, 'epoch': 0.11}


 11%|█         | 1347/12500 [2:20:42<21:05:40,  6.81s/it]

{'loss': 0.8226, 'grad_norm': 0.2518630921840668, 'learning_rate': 0.00017851940776310524, 'epoch': 0.11}


 11%|█         | 1348/12500 [2:20:50<22:05:23,  7.13s/it]

{'loss': 0.8683, 'grad_norm': 0.20790547132492065, 'learning_rate': 0.00017850340136054421, 'epoch': 0.11}


 11%|█         | 1349/12500 [2:20:58<23:19:33,  7.53s/it]

{'loss': 0.7848, 'grad_norm': 0.25972434878349304, 'learning_rate': 0.00017848739495798322, 'epoch': 0.11}


 11%|█         | 1350/12500 [2:21:03<20:22:00,  6.58s/it]

{'loss': 0.6767, 'grad_norm': 0.2684173882007599, 'learning_rate': 0.0001784713885554222, 'epoch': 0.11}


 11%|█         | 1351/12500 [2:21:10<21:13:42,  6.85s/it]

{'loss': 0.7793, 'grad_norm': 0.2376413494348526, 'learning_rate': 0.00017845538215286114, 'epoch': 0.11}


 11%|█         | 1352/12500 [2:21:16<20:26:18,  6.60s/it]

{'loss': 0.7512, 'grad_norm': 0.2612375319004059, 'learning_rate': 0.00017843937575030011, 'epoch': 0.11}


 11%|█         | 1353/12500 [2:21:23<20:19:47,  6.57s/it]

{'loss': 0.7028, 'grad_norm': 0.26282986998558044, 'learning_rate': 0.00017842336934773912, 'epoch': 0.11}


 11%|█         | 1354/12500 [2:21:27<18:11:19,  5.87s/it]

{'loss': 1.1469, 'grad_norm': 0.36693820357322693, 'learning_rate': 0.0001784073629451781, 'epoch': 0.11}


 11%|█         | 1355/12500 [2:21:35<20:27:33,  6.61s/it]

{'loss': 0.8325, 'grad_norm': 0.20140773057937622, 'learning_rate': 0.00017839135654261704, 'epoch': 0.11}


 11%|█         | 1356/12500 [2:21:40<18:43:13,  6.05s/it]

{'loss': 0.5751, 'grad_norm': 0.26537057757377625, 'learning_rate': 0.00017837535014005604, 'epoch': 0.11}


 11%|█         | 1357/12500 [2:21:46<18:41:44,  6.04s/it]

{'loss': 0.619, 'grad_norm': 0.24659396708011627, 'learning_rate': 0.00017835934373749502, 'epoch': 0.11}


 11%|█         | 1358/12500 [2:21:52<18:46:47,  6.07s/it]

{'loss': 0.8679, 'grad_norm': 0.3477030098438263, 'learning_rate': 0.000178343337334934, 'epoch': 0.11}


 11%|█         | 1359/12500 [2:22:00<20:03:28,  6.48s/it]

{'loss': 0.4067, 'grad_norm': 0.1976245790719986, 'learning_rate': 0.00017832733093237294, 'epoch': 0.11}


 11%|█         | 1360/12500 [2:22:05<18:44:00,  6.05s/it]

{'loss': 0.7366, 'grad_norm': 0.2865840792655945, 'learning_rate': 0.00017831132452981194, 'epoch': 0.11}


 11%|█         | 1361/12500 [2:22:10<17:42:44,  5.72s/it]

{'loss': 0.6455, 'grad_norm': 0.27871939539909363, 'learning_rate': 0.00017829531812725091, 'epoch': 0.11}


 11%|█         | 1362/12500 [2:22:15<17:43:34,  5.73s/it]

{'loss': 0.7069, 'grad_norm': 0.2939906716346741, 'learning_rate': 0.0001782793117246899, 'epoch': 0.11}


 11%|█         | 1363/12500 [2:22:22<18:16:45,  5.91s/it]

{'loss': 0.6229, 'grad_norm': 0.24768276512622833, 'learning_rate': 0.00017826330532212886, 'epoch': 0.11}


 11%|█         | 1364/12500 [2:22:27<17:57:52,  5.81s/it]

{'loss': 0.5284, 'grad_norm': 0.23491303622722626, 'learning_rate': 0.00017824729891956784, 'epoch': 0.11}


 11%|█         | 1365/12500 [2:22:34<18:33:22,  6.00s/it]

{'loss': 0.5662, 'grad_norm': 0.21960324048995972, 'learning_rate': 0.00017823129251700681, 'epoch': 0.11}


 11%|█         | 1366/12500 [2:22:40<18:51:54,  6.10s/it]

{'loss': 0.5749, 'grad_norm': 0.21989378333091736, 'learning_rate': 0.0001782152861144458, 'epoch': 0.11}


 11%|█         | 1367/12500 [2:22:47<19:55:00,  6.44s/it]

{'loss': 0.5144, 'grad_norm': 0.21251261234283447, 'learning_rate': 0.00017819927971188476, 'epoch': 0.11}


 11%|█         | 1368/12500 [2:22:54<20:15:20,  6.55s/it]

{'loss': 0.7934, 'grad_norm': 0.2268402874469757, 'learning_rate': 0.00017818327330932374, 'epoch': 0.11}


 11%|█         | 1369/12500 [2:23:00<19:22:10,  6.26s/it]

{'loss': 0.7103, 'grad_norm': 0.2480717897415161, 'learning_rate': 0.0001781672669067627, 'epoch': 0.11}


 11%|█         | 1370/12500 [2:23:07<19:50:03,  6.42s/it]

{'loss': 0.8145, 'grad_norm': 0.24361710250377655, 'learning_rate': 0.0001781512605042017, 'epoch': 0.11}


 11%|█         | 1371/12500 [2:23:15<21:17:03,  6.88s/it]

{'loss': 0.5284, 'grad_norm': 0.2186768651008606, 'learning_rate': 0.00017813525410164066, 'epoch': 0.11}


 11%|█         | 1372/12500 [2:23:18<18:20:22,  5.93s/it]

{'loss': 0.6313, 'grad_norm': 0.3099501430988312, 'learning_rate': 0.00017811924769907964, 'epoch': 0.11}


 11%|█         | 1373/12500 [2:23:24<18:01:55,  5.83s/it]

{'loss': 0.5373, 'grad_norm': 0.28535544872283936, 'learning_rate': 0.0001781032412965186, 'epoch': 0.11}


 11%|█         | 1374/12500 [2:23:28<16:33:00,  5.36s/it]

{'loss': 0.5617, 'grad_norm': 0.29716289043426514, 'learning_rate': 0.0001780872348939576, 'epoch': 0.11}


 11%|█         | 1375/12500 [2:23:35<17:51:50,  5.78s/it]

{'loss': 0.7075, 'grad_norm': 0.2537769377231598, 'learning_rate': 0.00017807122849139656, 'epoch': 0.11}


 11%|█         | 1376/12500 [2:23:42<19:24:06,  6.28s/it]

{'loss': 0.4252, 'grad_norm': 0.20443695783615112, 'learning_rate': 0.00017805522208883554, 'epoch': 0.11}


 11%|█         | 1377/12500 [2:23:51<21:55:42,  7.10s/it]

{'loss': 0.4214, 'grad_norm': 0.17065417766571045, 'learning_rate': 0.0001780392156862745, 'epoch': 0.11}


 11%|█         | 1378/12500 [2:23:58<21:17:20,  6.89s/it]

{'loss': 0.448, 'grad_norm': 0.23144540190696716, 'learning_rate': 0.0001780232092837135, 'epoch': 0.11}


 11%|█         | 1379/12500 [2:24:03<19:52:50,  6.44s/it]

{'loss': 0.6394, 'grad_norm': 0.24738389253616333, 'learning_rate': 0.00017800720288115246, 'epoch': 0.11}


 11%|█         | 1380/12500 [2:24:11<21:15:12,  6.88s/it]

{'loss': 0.8802, 'grad_norm': 0.23477217555046082, 'learning_rate': 0.00017799119647859144, 'epoch': 0.11}


 11%|█         | 1381/12500 [2:24:19<22:33:19,  7.30s/it]

{'loss': 0.7521, 'grad_norm': 0.2115754932165146, 'learning_rate': 0.00017797519007603044, 'epoch': 0.11}


 11%|█         | 1382/12500 [2:24:29<25:12:56,  8.16s/it]

{'loss': 0.8179, 'grad_norm': 0.19108395278453827, 'learning_rate': 0.0001779591836734694, 'epoch': 0.11}


 11%|█         | 1383/12500 [2:24:37<24:16:27,  7.86s/it]

{'loss': 0.5221, 'grad_norm': 0.2141430824995041, 'learning_rate': 0.00017794317727090836, 'epoch': 0.11}


 11%|█         | 1384/12500 [2:24:45<24:32:39,  7.95s/it]

{'loss': 0.7552, 'grad_norm': 0.20358885824680328, 'learning_rate': 0.00017792717086834734, 'epoch': 0.11}


 11%|█         | 1385/12500 [2:24:49<21:23:47,  6.93s/it]

{'loss': 0.8069, 'grad_norm': 0.29348599910736084, 'learning_rate': 0.00017791116446578634, 'epoch': 0.11}


 11%|█         | 1386/12500 [2:24:54<19:41:45,  6.38s/it]

{'loss': 0.8826, 'grad_norm': 0.2575520873069763, 'learning_rate': 0.00017789515806322529, 'epoch': 0.11}


 11%|█         | 1387/12500 [2:25:01<19:48:49,  6.42s/it]

{'loss': 0.8282, 'grad_norm': 0.2621321380138397, 'learning_rate': 0.00017787915166066426, 'epoch': 0.11}


 11%|█         | 1388/12500 [2:25:08<20:18:08,  6.58s/it]

{'loss': 0.9728, 'grad_norm': 0.2844224274158478, 'learning_rate': 0.00017786314525810326, 'epoch': 0.11}


 11%|█         | 1389/12500 [2:25:12<17:45:56,  5.76s/it]

{'loss': 0.7655, 'grad_norm': 0.3415115773677826, 'learning_rate': 0.00017784713885554224, 'epoch': 0.11}


 11%|█         | 1390/12500 [2:25:18<17:57:16,  5.82s/it]

{'loss': 0.8598, 'grad_norm': 0.2614089250564575, 'learning_rate': 0.00017783113245298119, 'epoch': 0.11}


 11%|█         | 1391/12500 [2:25:25<19:13:08,  6.23s/it]

{'loss': 0.6496, 'grad_norm': 0.24404114484786987, 'learning_rate': 0.00017781512605042016, 'epoch': 0.11}


 11%|█         | 1392/12500 [2:25:35<22:32:55,  7.31s/it]

{'loss': 0.317, 'grad_norm': 0.1538952738046646, 'learning_rate': 0.00017779911964785916, 'epoch': 0.11}


 11%|█         | 1393/12500 [2:25:39<20:08:23,  6.53s/it]

{'loss': 0.71, 'grad_norm': 0.28595951199531555, 'learning_rate': 0.00017778311324529814, 'epoch': 0.11}


 11%|█         | 1394/12500 [2:25:45<19:36:10,  6.35s/it]

{'loss': 1.1691, 'grad_norm': 0.2788141667842865, 'learning_rate': 0.00017776710684273708, 'epoch': 0.11}


 11%|█         | 1395/12500 [2:25:52<19:50:05,  6.43s/it]

{'loss': 0.5802, 'grad_norm': 0.24448563158512115, 'learning_rate': 0.0001777511004401761, 'epoch': 0.11}


 11%|█         | 1396/12500 [2:25:57<18:27:22,  5.98s/it]

{'loss': 0.6875, 'grad_norm': 0.24505293369293213, 'learning_rate': 0.00017773509403761506, 'epoch': 0.11}


 11%|█         | 1397/12500 [2:26:02<17:57:13,  5.82s/it]

{'loss': 0.5857, 'grad_norm': 0.2494974583387375, 'learning_rate': 0.00017771908763505404, 'epoch': 0.11}


 11%|█         | 1398/12500 [2:26:10<19:42:03,  6.39s/it]

{'loss': 0.6208, 'grad_norm': 0.24599969387054443, 'learning_rate': 0.00017770308123249298, 'epoch': 0.11}


 11%|█         | 1399/12500 [2:26:16<19:30:10,  6.32s/it]

{'loss': 0.4222, 'grad_norm': 0.23865725100040436, 'learning_rate': 0.00017768707482993199, 'epoch': 0.11}


 11%|█         | 1400/12500 [2:26:23<19:45:16,  6.41s/it]

{'loss': 0.849, 'grad_norm': 0.35224711894989014, 'learning_rate': 0.00017767106842737096, 'epoch': 0.11}


 11%|█         | 1401/12500 [2:26:30<20:27:56,  6.64s/it]

{'loss': 0.756, 'grad_norm': 0.25297465920448303, 'learning_rate': 0.00017765506202480994, 'epoch': 0.11}


 11%|█         | 1402/12500 [2:26:37<20:44:18,  6.73s/it]

{'loss': 1.0574, 'grad_norm': 0.2819586396217346, 'learning_rate': 0.0001776390556222489, 'epoch': 0.11}


 11%|█         | 1403/12500 [2:26:44<21:10:15,  6.87s/it]

{'loss': 0.398, 'grad_norm': 0.19518616795539856, 'learning_rate': 0.00017762304921968789, 'epoch': 0.11}


 11%|█         | 1404/12500 [2:26:50<19:51:24,  6.44s/it]

{'loss': 0.742, 'grad_norm': 0.2815180718898773, 'learning_rate': 0.00017760704281712686, 'epoch': 0.11}


 11%|█         | 1405/12500 [2:26:55<19:17:12,  6.26s/it]

{'loss': 0.7478, 'grad_norm': 0.264996737241745, 'learning_rate': 0.00017759103641456584, 'epoch': 0.11}


 11%|█         | 1406/12500 [2:27:01<18:34:28,  6.03s/it]

{'loss': 0.6318, 'grad_norm': 0.2567717134952545, 'learning_rate': 0.0001775750300120048, 'epoch': 0.11}


 11%|█▏        | 1407/12500 [2:27:08<19:15:35,  6.25s/it]

{'loss': 0.516, 'grad_norm': 0.1974351555109024, 'learning_rate': 0.00017755902360944379, 'epoch': 0.11}


 11%|█▏        | 1408/12500 [2:27:12<17:17:22,  5.61s/it]

{'loss': 0.8246, 'grad_norm': 0.29240429401397705, 'learning_rate': 0.00017754301720688276, 'epoch': 0.11}


 11%|█▏        | 1409/12500 [2:27:16<16:02:11,  5.21s/it]

{'loss': 0.6415, 'grad_norm': 0.30753880739212036, 'learning_rate': 0.00017752701080432173, 'epoch': 0.11}


 11%|█▏        | 1410/12500 [2:27:21<15:26:46,  5.01s/it]

{'loss': 0.7512, 'grad_norm': 0.3095375895500183, 'learning_rate': 0.0001775110044017607, 'epoch': 0.11}


 11%|█▏        | 1411/12500 [2:27:31<20:16:09,  6.58s/it]

{'loss': 0.4451, 'grad_norm': 0.18684008717536926, 'learning_rate': 0.00017749499799919968, 'epoch': 0.11}


 11%|█▏        | 1412/12500 [2:27:39<21:44:14,  7.06s/it]

{'loss': 1.0508, 'grad_norm': 0.2500684857368469, 'learning_rate': 0.00017747899159663866, 'epoch': 0.11}


 11%|█▏        | 1413/12500 [2:27:44<20:04:31,  6.52s/it]

{'loss': 0.6487, 'grad_norm': 0.2882760167121887, 'learning_rate': 0.00017746298519407763, 'epoch': 0.11}


 11%|█▏        | 1414/12500 [2:27:50<19:42:51,  6.40s/it]

{'loss': 0.6414, 'grad_norm': 0.26125475764274597, 'learning_rate': 0.0001774469787915166, 'epoch': 0.11}


 11%|█▏        | 1415/12500 [2:27:55<18:09:28,  5.90s/it]

{'loss': 0.6152, 'grad_norm': 0.2696312963962555, 'learning_rate': 0.00017743097238895558, 'epoch': 0.11}


 11%|█▏        | 1416/12500 [2:28:00<17:37:13,  5.72s/it]

{'loss': 0.697, 'grad_norm': 0.26382240653038025, 'learning_rate': 0.00017741496598639459, 'epoch': 0.11}


 11%|█▏        | 1417/12500 [2:28:06<17:01:12,  5.53s/it]

{'loss': 0.8479, 'grad_norm': 0.3405691981315613, 'learning_rate': 0.00017739895958383353, 'epoch': 0.11}


 11%|█▏        | 1418/12500 [2:28:13<18:31:15,  6.02s/it]

{'loss': 0.7651, 'grad_norm': 0.22149522602558136, 'learning_rate': 0.0001773829531812725, 'epoch': 0.11}


 11%|█▏        | 1419/12500 [2:28:17<17:07:32,  5.56s/it]

{'loss': 0.8505, 'grad_norm': 0.3412829637527466, 'learning_rate': 0.00017736694677871148, 'epoch': 0.11}


 11%|█▏        | 1420/12500 [2:28:22<16:35:14,  5.39s/it]

{'loss': 0.5443, 'grad_norm': 0.2542094588279724, 'learning_rate': 0.00017735094037615049, 'epoch': 0.11}


 11%|█▏        | 1421/12500 [2:28:27<16:03:54,  5.22s/it]

{'loss': 0.8697, 'grad_norm': 0.2913866639137268, 'learning_rate': 0.00017733493397358943, 'epoch': 0.11}


 11%|█▏        | 1422/12500 [2:28:35<18:20:21,  5.96s/it]

{'loss': 0.8038, 'grad_norm': 0.2304724007844925, 'learning_rate': 0.0001773189275710284, 'epoch': 0.11}


 11%|█▏        | 1423/12500 [2:28:40<18:06:53,  5.89s/it]

{'loss': 0.5494, 'grad_norm': 0.2364235818386078, 'learning_rate': 0.0001773029211684674, 'epoch': 0.11}


 11%|█▏        | 1424/12500 [2:28:44<16:09:57,  5.25s/it]

{'loss': 0.4691, 'grad_norm': 0.26289522647857666, 'learning_rate': 0.00017728691476590638, 'epoch': 0.11}


 11%|█▏        | 1425/12500 [2:28:50<16:15:28,  5.28s/it]

{'loss': 0.6299, 'grad_norm': 0.253444105386734, 'learning_rate': 0.00017727090836334533, 'epoch': 0.11}


 11%|█▏        | 1426/12500 [2:28:58<19:37:30,  6.38s/it]

{'loss': 0.8753, 'grad_norm': 0.24958030879497528, 'learning_rate': 0.0001772549019607843, 'epoch': 0.11}


 11%|█▏        | 1427/12500 [2:29:04<18:37:16,  6.05s/it]

{'loss': 0.7216, 'grad_norm': 0.29295289516448975, 'learning_rate': 0.0001772388955582233, 'epoch': 0.11}


 11%|█▏        | 1428/12500 [2:29:12<20:23:43,  6.63s/it]

{'loss': 0.621, 'grad_norm': 0.21601280570030212, 'learning_rate': 0.00017722288915566228, 'epoch': 0.11}


 11%|█▏        | 1429/12500 [2:29:17<18:59:38,  6.18s/it]

{'loss': 0.4878, 'grad_norm': 0.2623787522315979, 'learning_rate': 0.00017720688275310123, 'epoch': 0.11}


 11%|█▏        | 1430/12500 [2:29:23<19:05:59,  6.21s/it]

{'loss': 0.6088, 'grad_norm': 0.2816515266895294, 'learning_rate': 0.0001771908763505402, 'epoch': 0.11}


 11%|█▏        | 1431/12500 [2:29:27<17:16:59,  5.62s/it]

{'loss': 0.8344, 'grad_norm': 0.3161841928958893, 'learning_rate': 0.0001771748699479792, 'epoch': 0.11}


 11%|█▏        | 1432/12500 [2:29:34<18:05:00,  5.88s/it]

{'loss': 0.5232, 'grad_norm': 0.20544934272766113, 'learning_rate': 0.00017715886354541818, 'epoch': 0.11}


 11%|█▏        | 1433/12500 [2:29:41<18:48:03,  6.12s/it]

{'loss': 0.8559, 'grad_norm': 0.26434776186943054, 'learning_rate': 0.00017714285714285713, 'epoch': 0.11}


 11%|█▏        | 1434/12500 [2:29:49<20:47:10,  6.76s/it]

{'loss': 0.9997, 'grad_norm': 0.2386554330587387, 'learning_rate': 0.00017712685074029613, 'epoch': 0.11}


 11%|█▏        | 1435/12500 [2:29:53<18:09:24,  5.91s/it]

{'loss': 0.6315, 'grad_norm': 0.26490193605422974, 'learning_rate': 0.0001771108443377351, 'epoch': 0.11}


 11%|█▏        | 1436/12500 [2:30:01<20:12:20,  6.57s/it]

{'loss': 0.8552, 'grad_norm': 0.23944608867168427, 'learning_rate': 0.00017709483793517408, 'epoch': 0.11}


 11%|█▏        | 1437/12500 [2:30:08<20:36:07,  6.70s/it]

{'loss': 0.6132, 'grad_norm': 0.25781235098838806, 'learning_rate': 0.00017707883153261303, 'epoch': 0.11}


 12%|█▏        | 1438/12500 [2:30:13<18:46:26,  6.11s/it]

{'loss': 0.6541, 'grad_norm': 0.29004955291748047, 'learning_rate': 0.00017706282513005203, 'epoch': 0.12}


 12%|█▏        | 1439/12500 [2:30:17<17:22:23,  5.65s/it]

{'loss': 0.5872, 'grad_norm': 0.3252003788948059, 'learning_rate': 0.000177046818727491, 'epoch': 0.12}


 12%|█▏        | 1440/12500 [2:30:24<18:31:44,  6.03s/it]

{'loss': 0.741, 'grad_norm': 0.23460452258586884, 'learning_rate': 0.00017703081232492998, 'epoch': 0.12}


 12%|█▏        | 1441/12500 [2:30:29<17:36:36,  5.73s/it]

{'loss': 0.7341, 'grad_norm': 0.23906350135803223, 'learning_rate': 0.00017701480592236896, 'epoch': 0.12}


 12%|█▏        | 1442/12500 [2:30:34<17:08:49,  5.58s/it]

{'loss': 0.621, 'grad_norm': 0.2882087528705597, 'learning_rate': 0.00017699879951980793, 'epoch': 0.12}


 12%|█▏        | 1443/12500 [2:30:43<19:31:23,  6.36s/it]

{'loss': 0.7034, 'grad_norm': 0.29973578453063965, 'learning_rate': 0.0001769827931172469, 'epoch': 0.12}


 12%|█▏        | 1444/12500 [2:30:49<19:57:22,  6.50s/it]

{'loss': 0.7733, 'grad_norm': 0.24095970392227173, 'learning_rate': 0.00017696678671468588, 'epoch': 0.12}


 12%|█▏        | 1445/12500 [2:30:56<19:47:21,  6.44s/it]

{'loss': 0.8504, 'grad_norm': 0.24608169496059418, 'learning_rate': 0.00017695078031212486, 'epoch': 0.12}


 12%|█▏        | 1446/12500 [2:31:01<19:10:14,  6.24s/it]

{'loss': 0.6384, 'grad_norm': 0.2738606035709381, 'learning_rate': 0.00017693477390956383, 'epoch': 0.12}


 12%|█▏        | 1447/12500 [2:31:07<18:56:48,  6.17s/it]

{'loss': 0.4413, 'grad_norm': 0.25078874826431274, 'learning_rate': 0.0001769187675070028, 'epoch': 0.12}


 12%|█▏        | 1448/12500 [2:31:16<20:48:06,  6.78s/it]

{'loss': 0.9097, 'grad_norm': 0.2355733960866928, 'learning_rate': 0.00017690276110444178, 'epoch': 0.12}


 12%|█▏        | 1449/12500 [2:31:22<20:20:50,  6.63s/it]

{'loss': 0.8032, 'grad_norm': 0.2370987981557846, 'learning_rate': 0.00017688675470188076, 'epoch': 0.12}


 12%|█▏        | 1450/12500 [2:31:30<21:24:54,  6.98s/it]

{'loss': 0.5119, 'grad_norm': 0.2373063862323761, 'learning_rate': 0.00017687074829931973, 'epoch': 0.12}


 12%|█▏        | 1451/12500 [2:31:37<22:00:02,  7.17s/it]

{'loss': 0.3925, 'grad_norm': 0.22098156809806824, 'learning_rate': 0.0001768547418967587, 'epoch': 0.12}


 12%|█▏        | 1452/12500 [2:31:43<20:27:14,  6.66s/it]

{'loss': 0.6786, 'grad_norm': 0.25444915890693665, 'learning_rate': 0.00017683873549419768, 'epoch': 0.12}


 12%|█▏        | 1453/12500 [2:31:51<21:41:16,  7.07s/it]

{'loss': 0.6099, 'grad_norm': 0.25294914841651917, 'learning_rate': 0.00017682272909163666, 'epoch': 0.12}


 12%|█▏        | 1454/12500 [2:31:58<22:07:05,  7.21s/it]

{'loss': 0.6067, 'grad_norm': 0.23825334012508392, 'learning_rate': 0.00017680672268907563, 'epoch': 0.12}


 12%|█▏        | 1455/12500 [2:32:02<19:13:56,  6.27s/it]

{'loss': 0.7828, 'grad_norm': 0.32125067710876465, 'learning_rate': 0.00017679071628651463, 'epoch': 0.12}


 12%|█▏        | 1456/12500 [2:32:08<18:16:09,  5.96s/it]

{'loss': 0.6803, 'grad_norm': 0.25559577345848083, 'learning_rate': 0.00017677470988395358, 'epoch': 0.12}


 12%|█▏        | 1457/12500 [2:32:13<17:26:13,  5.68s/it]

{'loss': 0.8077, 'grad_norm': 0.29928597807884216, 'learning_rate': 0.00017675870348139255, 'epoch': 0.12}


 12%|█▏        | 1458/12500 [2:32:20<18:54:07,  6.16s/it]

{'loss': 0.5648, 'grad_norm': 0.22053977847099304, 'learning_rate': 0.00017674269707883153, 'epoch': 0.12}


 12%|█▏        | 1459/12500 [2:32:25<17:39:10,  5.76s/it]

{'loss': 0.5307, 'grad_norm': 0.26109930872917175, 'learning_rate': 0.00017672669067627053, 'epoch': 0.12}


 12%|█▏        | 1460/12500 [2:32:29<16:14:25,  5.30s/it]

{'loss': 0.7166, 'grad_norm': 0.26968684792518616, 'learning_rate': 0.00017671068427370948, 'epoch': 0.12}


 12%|█▏        | 1461/12500 [2:32:35<16:57:31,  5.53s/it]

{'loss': 0.7685, 'grad_norm': 0.25250720977783203, 'learning_rate': 0.00017669467787114845, 'epoch': 0.12}


 12%|█▏        | 1462/12500 [2:32:42<18:32:10,  6.05s/it]

{'loss': 0.7814, 'grad_norm': 0.2920992970466614, 'learning_rate': 0.00017667867146858746, 'epoch': 0.12}


 12%|█▏        | 1463/12500 [2:32:46<16:26:30,  5.36s/it]

{'loss': 0.3933, 'grad_norm': 0.22961702942848206, 'learning_rate': 0.00017666266506602643, 'epoch': 0.12}


 12%|█▏        | 1464/12500 [2:32:53<17:48:54,  5.81s/it]

{'loss': 0.555, 'grad_norm': 0.23249582946300507, 'learning_rate': 0.00017664665866346538, 'epoch': 0.12}


 12%|█▏        | 1465/12500 [2:32:59<17:54:35,  5.84s/it]

{'loss': 0.8392, 'grad_norm': 0.2528837323188782, 'learning_rate': 0.00017663065226090435, 'epoch': 0.12}


 12%|█▏        | 1466/12500 [2:33:02<15:49:43,  5.16s/it]

{'loss': 0.6232, 'grad_norm': 0.30940601229667664, 'learning_rate': 0.00017661464585834336, 'epoch': 0.12}


 12%|█▏        | 1467/12500 [2:33:08<15:41:23,  5.12s/it]

{'loss': 0.5391, 'grad_norm': 0.22152459621429443, 'learning_rate': 0.00017659863945578233, 'epoch': 0.12}


 12%|█▏        | 1468/12500 [2:33:14<16:38:55,  5.43s/it]

{'loss': 0.5715, 'grad_norm': 0.2942344546318054, 'learning_rate': 0.00017658263305322128, 'epoch': 0.12}


 12%|█▏        | 1469/12500 [2:33:18<15:45:53,  5.14s/it]

{'loss': 0.8271, 'grad_norm': 0.31977149844169617, 'learning_rate': 0.00017656662665066028, 'epoch': 0.12}


 12%|█▏        | 1470/12500 [2:33:26<18:40:48,  6.10s/it]

{'loss': 0.9392, 'grad_norm': 0.22376865148544312, 'learning_rate': 0.00017655062024809926, 'epoch': 0.12}


 12%|█▏        | 1471/12500 [2:33:32<18:06:32,  5.91s/it]

{'loss': 0.5985, 'grad_norm': 0.2516617774963379, 'learning_rate': 0.00017653461384553823, 'epoch': 0.12}


 12%|█▏        | 1472/12500 [2:33:40<20:02:48,  6.54s/it]

{'loss': 0.6667, 'grad_norm': 0.27986228466033936, 'learning_rate': 0.00017651860744297718, 'epoch': 0.12}


 12%|█▏        | 1473/12500 [2:33:44<17:59:05,  5.87s/it]

{'loss': 0.9396, 'grad_norm': 0.30058756470680237, 'learning_rate': 0.00017650260104041618, 'epoch': 0.12}


 12%|█▏        | 1474/12500 [2:33:51<18:56:58,  6.19s/it]

{'loss': 0.808, 'grad_norm': 0.20194408297538757, 'learning_rate': 0.00017648659463785515, 'epoch': 0.12}


 12%|█▏        | 1475/12500 [2:34:01<21:49:29,  7.13s/it]

{'loss': 0.4479, 'grad_norm': 0.19477857649326324, 'learning_rate': 0.00017647058823529413, 'epoch': 0.12}


 12%|█▏        | 1476/12500 [2:34:08<22:33:21,  7.37s/it]

{'loss': 0.7511, 'grad_norm': 0.20351019501686096, 'learning_rate': 0.0001764545818327331, 'epoch': 0.12}


 12%|█▏        | 1477/12500 [2:34:13<20:09:44,  6.58s/it]

{'loss': 0.6789, 'grad_norm': 0.25772884488105774, 'learning_rate': 0.00017643857543017208, 'epoch': 0.12}


 12%|█▏        | 1478/12500 [2:34:17<17:58:11,  5.87s/it]

{'loss': 0.6447, 'grad_norm': 0.263401597738266, 'learning_rate': 0.00017642256902761105, 'epoch': 0.12}


 12%|█▏        | 1479/12500 [2:34:26<20:17:51,  6.63s/it]

{'loss': 0.9181, 'grad_norm': 0.1971852332353592, 'learning_rate': 0.00017640656262505003, 'epoch': 0.12}


 12%|█▏        | 1480/12500 [2:34:30<18:12:08,  5.95s/it]

{'loss': 0.7159, 'grad_norm': 0.2937447130680084, 'learning_rate': 0.000176390556222489, 'epoch': 0.12}


 12%|█▏        | 1481/12500 [2:34:36<18:26:49,  6.03s/it]

{'loss': 0.6305, 'grad_norm': 0.26545169949531555, 'learning_rate': 0.00017637454981992798, 'epoch': 0.12}


 12%|█▏        | 1482/12500 [2:34:44<20:13:08,  6.61s/it]

{'loss': 0.6165, 'grad_norm': 0.267671674489975, 'learning_rate': 0.00017635854341736695, 'epoch': 0.12}


 12%|█▏        | 1483/12500 [2:34:49<18:33:21,  6.06s/it]

{'loss': 0.9141, 'grad_norm': 0.2719261348247528, 'learning_rate': 0.00017634253701480596, 'epoch': 0.12}


 12%|█▏        | 1484/12500 [2:34:53<16:22:51,  5.35s/it]

{'loss': 0.6158, 'grad_norm': 0.32877930998802185, 'learning_rate': 0.0001763265306122449, 'epoch': 0.12}


 12%|█▏        | 1485/12500 [2:34:58<16:09:44,  5.28s/it]

{'loss': 0.8758, 'grad_norm': 0.2863813042640686, 'learning_rate': 0.00017631052420968388, 'epoch': 0.12}


 12%|█▏        | 1486/12500 [2:35:03<16:19:19,  5.34s/it]

{'loss': 0.8083, 'grad_norm': 0.3232051134109497, 'learning_rate': 0.00017629451780712285, 'epoch': 0.12}


 12%|█▏        | 1487/12500 [2:35:10<17:23:57,  5.69s/it]

{'loss': 0.5476, 'grad_norm': 0.26300573348999023, 'learning_rate': 0.00017627851140456185, 'epoch': 0.12}


 12%|█▏        | 1488/12500 [2:35:14<16:18:41,  5.33s/it]

{'loss': 0.6326, 'grad_norm': 0.31217116117477417, 'learning_rate': 0.0001762625050020008, 'epoch': 0.12}


 12%|█▏        | 1489/12500 [2:35:20<16:21:33,  5.35s/it]

{'loss': 0.6634, 'grad_norm': 0.27027764916419983, 'learning_rate': 0.00017624649859943978, 'epoch': 0.12}


 12%|█▏        | 1490/12500 [2:35:24<15:30:56,  5.07s/it]

{'loss': 0.7867, 'grad_norm': 0.3322189152240753, 'learning_rate': 0.00017623049219687875, 'epoch': 0.12}


 12%|█▏        | 1491/12500 [2:35:33<18:31:58,  6.06s/it]

{'loss': 0.4487, 'grad_norm': 0.1998782902956009, 'learning_rate': 0.00017621448579431775, 'epoch': 0.12}


 12%|█▏        | 1492/12500 [2:35:40<19:56:30,  6.52s/it]

{'loss': 0.6552, 'grad_norm': 0.25524768233299255, 'learning_rate': 0.0001761984793917567, 'epoch': 0.12}


 12%|█▏        | 1493/12500 [2:35:44<17:53:24,  5.85s/it]

{'loss': 0.5521, 'grad_norm': 0.3133378028869629, 'learning_rate': 0.00017618247298919568, 'epoch': 0.12}


 12%|█▏        | 1494/12500 [2:35:51<18:17:46,  5.98s/it]

{'loss': 0.694, 'grad_norm': 0.25029221177101135, 'learning_rate': 0.00017616646658663468, 'epoch': 0.12}


 12%|█▏        | 1495/12500 [2:35:55<16:18:28,  5.33s/it]

{'loss': 0.5526, 'grad_norm': 0.32294219732284546, 'learning_rate': 0.00017615046018407365, 'epoch': 0.12}


 12%|█▏        | 1496/12500 [2:36:01<17:02:32,  5.58s/it]

{'loss': 0.7658, 'grad_norm': 0.26469385623931885, 'learning_rate': 0.0001761344537815126, 'epoch': 0.12}


 12%|█▏        | 1497/12500 [2:36:05<16:14:47,  5.32s/it]

{'loss': 0.6783, 'grad_norm': 0.34356603026390076, 'learning_rate': 0.00017611844737895158, 'epoch': 0.12}


 12%|█▏        | 1498/12500 [2:36:13<18:24:08,  6.02s/it]

{'loss': 0.8236, 'grad_norm': 0.24068698287010193, 'learning_rate': 0.00017610244097639058, 'epoch': 0.12}


 12%|█▏        | 1499/12500 [2:36:19<18:05:13,  5.92s/it]

{'loss': 0.6945, 'grad_norm': 0.2917470932006836, 'learning_rate': 0.00017608643457382955, 'epoch': 0.12}


 12%|█▏        | 1500/12500 [2:36:24<17:36:54,  5.76s/it]

{'loss': 0.5554, 'grad_norm': 0.2422972172498703, 'learning_rate': 0.0001760704281712685, 'epoch': 0.12}


 12%|█▏        | 1501/12500 [2:36:30<17:35:32,  5.76s/it]

{'loss': 0.6651, 'grad_norm': 0.3517918586730957, 'learning_rate': 0.0001760544217687075, 'epoch': 0.12}


 12%|█▏        | 1502/12500 [2:36:33<15:19:49,  5.02s/it]

{'loss': 0.6888, 'grad_norm': 0.3320837914943695, 'learning_rate': 0.00017603841536614648, 'epoch': 0.12}


 12%|█▏        | 1503/12500 [2:36:39<16:09:05,  5.29s/it]

{'loss': 0.7287, 'grad_norm': 0.3414658308029175, 'learning_rate': 0.00017602240896358545, 'epoch': 0.12}


 12%|█▏        | 1504/12500 [2:36:45<16:13:49,  5.31s/it]

{'loss': 0.5533, 'grad_norm': 0.254504919052124, 'learning_rate': 0.0001760064025610244, 'epoch': 0.12}


 12%|█▏        | 1505/12500 [2:36:51<17:24:44,  5.70s/it]

{'loss': 0.8266, 'grad_norm': 0.32451102137565613, 'learning_rate': 0.0001759903961584634, 'epoch': 0.12}


 12%|█▏        | 1506/12500 [2:36:59<19:28:16,  6.38s/it]

{'loss': 0.4479, 'grad_norm': 0.18488968908786774, 'learning_rate': 0.00017597438975590238, 'epoch': 0.12}


 12%|█▏        | 1507/12500 [2:37:06<20:16:11,  6.64s/it]

{'loss': 0.8913, 'grad_norm': 0.25096651911735535, 'learning_rate': 0.00017595838335334135, 'epoch': 0.12}


 12%|█▏        | 1508/12500 [2:37:13<20:23:02,  6.68s/it]

{'loss': 0.9874, 'grad_norm': 0.24854516983032227, 'learning_rate': 0.00017594237695078033, 'epoch': 0.12}


 12%|█▏        | 1509/12500 [2:37:17<18:00:11,  5.90s/it]

{'loss': 0.706, 'grad_norm': 0.3499857485294342, 'learning_rate': 0.0001759263705482193, 'epoch': 0.12}


 12%|█▏        | 1510/12500 [2:37:26<20:28:20,  6.71s/it]

{'loss': 0.9607, 'grad_norm': 0.2572469711303711, 'learning_rate': 0.00017591036414565828, 'epoch': 0.12}


 12%|█▏        | 1511/12500 [2:37:29<17:23:24,  5.70s/it]

{'loss': 0.7168, 'grad_norm': 0.3574922978878021, 'learning_rate': 0.00017589435774309725, 'epoch': 0.12}


 12%|█▏        | 1512/12500 [2:37:35<17:28:48,  5.73s/it]

{'loss': 0.6905, 'grad_norm': 0.2617744207382202, 'learning_rate': 0.00017587835134053623, 'epoch': 0.12}


 12%|█▏        | 1513/12500 [2:37:39<16:26:50,  5.39s/it]

{'loss': 0.8994, 'grad_norm': 0.4005734920501709, 'learning_rate': 0.0001758623449379752, 'epoch': 0.12}


 12%|█▏        | 1514/12500 [2:37:45<16:25:08,  5.38s/it]

{'loss': 0.7377, 'grad_norm': 0.28586116433143616, 'learning_rate': 0.00017584633853541418, 'epoch': 0.12}


 12%|█▏        | 1515/12500 [2:37:50<16:30:18,  5.41s/it]

{'loss': 0.6752, 'grad_norm': 0.2963959574699402, 'learning_rate': 0.00017583033213285315, 'epoch': 0.12}


 12%|█▏        | 1516/12500 [2:37:56<16:48:34,  5.51s/it]

{'loss': 0.7883, 'grad_norm': 0.31234461069107056, 'learning_rate': 0.00017581432573029213, 'epoch': 0.12}


 12%|█▏        | 1517/12500 [2:38:01<16:20:54,  5.36s/it]

{'loss': 0.8733, 'grad_norm': 0.33243483304977417, 'learning_rate': 0.0001757983193277311, 'epoch': 0.12}


 12%|█▏        | 1518/12500 [2:38:06<16:19:21,  5.35s/it]

{'loss': 0.7935, 'grad_norm': 0.25660720467567444, 'learning_rate': 0.00017578231292517008, 'epoch': 0.12}


 12%|█▏        | 1519/12500 [2:38:11<15:35:03,  5.11s/it]

{'loss': 0.7588, 'grad_norm': 0.30422407388687134, 'learning_rate': 0.00017576630652260905, 'epoch': 0.12}


 12%|█▏        | 1520/12500 [2:38:19<18:20:08,  6.01s/it]

{'loss': 0.7579, 'grad_norm': 0.20794540643692017, 'learning_rate': 0.00017575030012004802, 'epoch': 0.12}


 12%|█▏        | 1521/12500 [2:38:24<17:25:26,  5.71s/it]

{'loss': 0.6805, 'grad_norm': 0.31163275241851807, 'learning_rate': 0.000175734293717487, 'epoch': 0.12}


 12%|█▏        | 1522/12500 [2:38:30<17:46:17,  5.83s/it]

{'loss': 0.732, 'grad_norm': 0.22989077866077423, 'learning_rate': 0.000175718287314926, 'epoch': 0.12}


 12%|█▏        | 1523/12500 [2:38:36<17:51:56,  5.86s/it]

{'loss': 0.6661, 'grad_norm': 0.2737651765346527, 'learning_rate': 0.00017570228091236495, 'epoch': 0.12}


 12%|█▏        | 1524/12500 [2:38:43<18:21:44,  6.02s/it]

{'loss': 0.5815, 'grad_norm': 0.22596614062786102, 'learning_rate': 0.00017568627450980392, 'epoch': 0.12}


 12%|█▏        | 1525/12500 [2:38:51<20:11:44,  6.62s/it]

{'loss': 0.581, 'grad_norm': 0.21833321452140808, 'learning_rate': 0.0001756702681072429, 'epoch': 0.12}


 12%|█▏        | 1526/12500 [2:38:56<19:30:21,  6.40s/it]

{'loss': 0.663, 'grad_norm': 0.2996778190135956, 'learning_rate': 0.0001756542617046819, 'epoch': 0.12}


 12%|█▏        | 1527/12500 [2:39:01<18:06:52,  5.94s/it]

{'loss': 0.6721, 'grad_norm': 0.2803318202495575, 'learning_rate': 0.00017563825530212085, 'epoch': 0.12}


 12%|█▏        | 1528/12500 [2:39:09<19:29:53,  6.40s/it]

{'loss': 0.5013, 'grad_norm': 0.2272106111049652, 'learning_rate': 0.00017562224889955982, 'epoch': 0.12}


 12%|█▏        | 1529/12500 [2:39:16<20:17:47,  6.66s/it]

{'loss': 0.8891, 'grad_norm': 0.25050362944602966, 'learning_rate': 0.00017560624249699883, 'epoch': 0.12}


 12%|█▏        | 1530/12500 [2:39:23<20:37:49,  6.77s/it]

{'loss': 0.711, 'grad_norm': 0.2163211852312088, 'learning_rate': 0.0001755902360944378, 'epoch': 0.12}


 12%|█▏        | 1531/12500 [2:39:28<18:56:58,  6.22s/it]

{'loss': 0.7591, 'grad_norm': 0.33714184165000916, 'learning_rate': 0.00017557422969187675, 'epoch': 0.12}


 12%|█▏        | 1532/12500 [2:39:34<18:21:50,  6.03s/it]

{'loss': 0.5921, 'grad_norm': 0.24199213087558746, 'learning_rate': 0.00017555822328931572, 'epoch': 0.12}


 12%|█▏        | 1533/12500 [2:39:39<17:48:59,  5.85s/it]

{'loss': 0.8011, 'grad_norm': 0.2868312895298004, 'learning_rate': 0.00017554221688675473, 'epoch': 0.12}


 12%|█▏        | 1534/12500 [2:39:46<18:57:26,  6.22s/it]

{'loss': 0.6186, 'grad_norm': 0.24281665682792664, 'learning_rate': 0.0001755262104841937, 'epoch': 0.12}


 12%|█▏        | 1535/12500 [2:39:53<19:18:16,  6.34s/it]

{'loss': 0.8023, 'grad_norm': 0.25588080286979675, 'learning_rate': 0.00017551020408163265, 'epoch': 0.12}


 12%|█▏        | 1536/12500 [2:40:03<22:36:03,  7.42s/it]

{'loss': 0.6624, 'grad_norm': 0.18205301463603973, 'learning_rate': 0.00017549419767907165, 'epoch': 0.12}


 12%|█▏        | 1537/12500 [2:40:10<22:52:59,  7.51s/it]

{'loss': 1.0655, 'grad_norm': 0.2522830069065094, 'learning_rate': 0.00017547819127651062, 'epoch': 0.12}


 12%|█▏        | 1538/12500 [2:40:15<19:47:14,  6.50s/it]

{'loss': 0.6955, 'grad_norm': 0.27902671694755554, 'learning_rate': 0.0001754621848739496, 'epoch': 0.12}


 12%|█▏        | 1539/12500 [2:40:22<20:38:44,  6.78s/it]

{'loss': 0.824, 'grad_norm': 0.299805223941803, 'learning_rate': 0.00017544617847138855, 'epoch': 0.12}


 12%|█▏        | 1540/12500 [2:40:26<18:35:27,  6.11s/it]

{'loss': 0.9633, 'grad_norm': 0.2997000217437744, 'learning_rate': 0.00017543017206882755, 'epoch': 0.12}


 12%|█▏        | 1541/12500 [2:40:33<18:52:36,  6.20s/it]

{'loss': 0.411, 'grad_norm': 0.18958304822444916, 'learning_rate': 0.00017541416566626652, 'epoch': 0.12}


 12%|█▏        | 1542/12500 [2:40:39<18:59:45,  6.24s/it]

{'loss': 0.9004, 'grad_norm': 0.3180496394634247, 'learning_rate': 0.0001753981592637055, 'epoch': 0.12}


 12%|█▏        | 1543/12500 [2:40:44<17:48:24,  5.85s/it]

{'loss': 0.7364, 'grad_norm': 0.26538529992103577, 'learning_rate': 0.00017538215286114445, 'epoch': 0.12}


 12%|█▏        | 1544/12500 [2:40:50<17:49:20,  5.86s/it]

{'loss': 0.8567, 'grad_norm': 0.270140141248703, 'learning_rate': 0.00017536614645858345, 'epoch': 0.12}


 12%|█▏        | 1545/12500 [2:40:57<18:32:12,  6.09s/it]

{'loss': 0.722, 'grad_norm': 0.23506763577461243, 'learning_rate': 0.00017535014005602242, 'epoch': 0.12}


 12%|█▏        | 1546/12500 [2:41:04<19:33:58,  6.43s/it]

{'loss': 0.7376, 'grad_norm': 0.2230256050825119, 'learning_rate': 0.0001753341336534614, 'epoch': 0.12}


 12%|█▏        | 1547/12500 [2:41:08<17:26:54,  5.73s/it]

{'loss': 0.6676, 'grad_norm': 0.30113688111305237, 'learning_rate': 0.00017531812725090037, 'epoch': 0.12}


 12%|█▏        | 1548/12500 [2:41:14<17:47:40,  5.85s/it]

{'loss': 0.6691, 'grad_norm': 0.27259477972984314, 'learning_rate': 0.00017530212084833935, 'epoch': 0.12}


 12%|█▏        | 1549/12500 [2:41:20<17:48:13,  5.85s/it]

{'loss': 0.8337, 'grad_norm': 0.2601335346698761, 'learning_rate': 0.00017528611444577832, 'epoch': 0.12}


 12%|█▏        | 1550/12500 [2:41:26<18:03:20,  5.94s/it]

{'loss': 0.8584, 'grad_norm': 0.23419059813022614, 'learning_rate': 0.0001752701080432173, 'epoch': 0.12}


 12%|█▏        | 1551/12500 [2:41:34<19:40:05,  6.47s/it]

{'loss': 0.6079, 'grad_norm': 0.2295578569173813, 'learning_rate': 0.00017525410164065627, 'epoch': 0.12}


 12%|█▏        | 1552/12500 [2:41:38<17:25:44,  5.73s/it]

{'loss': 0.6808, 'grad_norm': 0.3049162030220032, 'learning_rate': 0.00017523809523809525, 'epoch': 0.12}


 12%|█▏        | 1553/12500 [2:41:43<16:31:03,  5.43s/it]

{'loss': 0.8366, 'grad_norm': 0.2978515923023224, 'learning_rate': 0.00017522208883553422, 'epoch': 0.12}


 12%|█▏        | 1554/12500 [2:41:48<16:56:38,  5.57s/it]

{'loss': 0.9458, 'grad_norm': 0.32685890793800354, 'learning_rate': 0.0001752060824329732, 'epoch': 0.12}


 12%|█▏        | 1555/12500 [2:41:53<16:07:39,  5.30s/it]

{'loss': 0.7384, 'grad_norm': 0.29764899611473083, 'learning_rate': 0.00017519007603041217, 'epoch': 0.12}


 12%|█▏        | 1556/12500 [2:41:57<15:07:00,  4.97s/it]

{'loss': 0.7587, 'grad_norm': 0.30576208233833313, 'learning_rate': 0.00017517406962785115, 'epoch': 0.12}


 12%|█▏        | 1557/12500 [2:42:03<15:57:14,  5.25s/it]

{'loss': 0.6106, 'grad_norm': 0.24494753777980804, 'learning_rate': 0.00017515806322529012, 'epoch': 0.12}


 12%|█▏        | 1558/12500 [2:42:09<16:05:34,  5.29s/it]

{'loss': 0.6553, 'grad_norm': 0.2858726978302002, 'learning_rate': 0.0001751420568227291, 'epoch': 0.12}


 12%|█▏        | 1559/12500 [2:42:15<16:49:22,  5.54s/it]

{'loss': 0.5862, 'grad_norm': 0.24162672460079193, 'learning_rate': 0.00017512605042016807, 'epoch': 0.12}


 12%|█▏        | 1560/12500 [2:42:20<16:59:28,  5.59s/it]

{'loss': 0.9475, 'grad_norm': 0.3356438875198364, 'learning_rate': 0.00017511004401760705, 'epoch': 0.12}


 12%|█▏        | 1561/12500 [2:42:25<16:23:22,  5.39s/it]

{'loss': 0.7235, 'grad_norm': 0.2842024564743042, 'learning_rate': 0.00017509403761504605, 'epoch': 0.12}


 12%|█▏        | 1562/12500 [2:42:32<17:14:59,  5.68s/it]

{'loss': 0.8911, 'grad_norm': 0.2729833722114563, 'learning_rate': 0.000175078031212485, 'epoch': 0.12}


 13%|█▎        | 1563/12500 [2:42:38<17:22:05,  5.72s/it]

{'loss': 0.7215, 'grad_norm': 0.2733090817928314, 'learning_rate': 0.00017506202480992397, 'epoch': 0.13}


 13%|█▎        | 1564/12500 [2:42:44<17:42:47,  5.83s/it]

{'loss': 0.6634, 'grad_norm': 0.27192455530166626, 'learning_rate': 0.00017504601840736295, 'epoch': 0.13}


 13%|█▎        | 1565/12500 [2:42:51<19:15:00,  6.34s/it]

{'loss': 0.9475, 'grad_norm': 0.22971893846988678, 'learning_rate': 0.00017503001200480195, 'epoch': 0.13}


 13%|█▎        | 1566/12500 [2:42:56<18:12:34,  6.00s/it]

{'loss': 0.6719, 'grad_norm': 0.27980491518974304, 'learning_rate': 0.0001750140056022409, 'epoch': 0.13}


 13%|█▎        | 1567/12500 [2:43:03<18:33:00,  6.11s/it]

{'loss': 0.7905, 'grad_norm': 0.2608434855937958, 'learning_rate': 0.00017499799919967987, 'epoch': 0.13}


 13%|█▎        | 1568/12500 [2:43:10<19:15:11,  6.34s/it]

{'loss': 0.7291, 'grad_norm': 0.2656088173389435, 'learning_rate': 0.00017498199279711887, 'epoch': 0.13}


 13%|█▎        | 1569/12500 [2:43:15<18:05:44,  5.96s/it]

{'loss': 0.5798, 'grad_norm': 0.27054426074028015, 'learning_rate': 0.00017496598639455785, 'epoch': 0.13}


 13%|█▎        | 1570/12500 [2:43:22<19:28:28,  6.41s/it]

{'loss': 0.9513, 'grad_norm': 0.28335708379745483, 'learning_rate': 0.0001749499799919968, 'epoch': 0.13}


 13%|█▎        | 1571/12500 [2:43:30<20:29:03,  6.75s/it]

{'loss': 0.9641, 'grad_norm': 0.26091185212135315, 'learning_rate': 0.00017493397358943577, 'epoch': 0.13}


 13%|█▎        | 1572/12500 [2:43:34<18:36:00,  6.13s/it]

{'loss': 0.6883, 'grad_norm': 0.2638919949531555, 'learning_rate': 0.00017491796718687477, 'epoch': 0.13}


 13%|█▎        | 1573/12500 [2:43:41<19:11:50,  6.32s/it]

{'loss': 0.6596, 'grad_norm': 0.29828718304634094, 'learning_rate': 0.00017490196078431375, 'epoch': 0.13}


 13%|█▎        | 1574/12500 [2:43:48<19:55:12,  6.56s/it]

{'loss': 0.8031, 'grad_norm': 0.22771453857421875, 'learning_rate': 0.0001748859543817527, 'epoch': 0.13}


 13%|█▎        | 1575/12500 [2:43:56<21:03:28,  6.94s/it]

{'loss': 0.6381, 'grad_norm': 0.21357746422290802, 'learning_rate': 0.0001748699479791917, 'epoch': 0.13}


 13%|█▎        | 1576/12500 [2:44:03<20:54:49,  6.89s/it]

{'loss': 0.8236, 'grad_norm': 0.2694028317928314, 'learning_rate': 0.00017485394157663067, 'epoch': 0.13}


 13%|█▎        | 1577/12500 [2:44:06<17:55:11,  5.91s/it]

{'loss': 0.5775, 'grad_norm': 0.274126797914505, 'learning_rate': 0.00017483793517406965, 'epoch': 0.13}


 13%|█▎        | 1578/12500 [2:44:13<18:34:07,  6.12s/it]

{'loss': 0.9362, 'grad_norm': 0.23787568509578705, 'learning_rate': 0.0001748219287715086, 'epoch': 0.13}


 13%|█▎        | 1579/12500 [2:44:20<19:32:07,  6.44s/it]

{'loss': 0.8074, 'grad_norm': 0.2342219054698944, 'learning_rate': 0.0001748059223689476, 'epoch': 0.13}


 13%|█▎        | 1580/12500 [2:44:25<17:47:07,  5.86s/it]

{'loss': 0.8475, 'grad_norm': 0.3482080399990082, 'learning_rate': 0.00017478991596638657, 'epoch': 0.13}


 13%|█▎        | 1581/12500 [2:44:30<17:22:17,  5.73s/it]

{'loss': 0.5386, 'grad_norm': 0.23546554148197174, 'learning_rate': 0.00017477390956382555, 'epoch': 0.13}


 13%|█▎        | 1582/12500 [2:44:37<18:21:17,  6.05s/it]

{'loss': 0.4414, 'grad_norm': 0.20617713034152985, 'learning_rate': 0.00017475790316126452, 'epoch': 0.13}


 13%|█▎        | 1583/12500 [2:44:42<17:37:01,  5.81s/it]

{'loss': 0.4357, 'grad_norm': 0.20782215893268585, 'learning_rate': 0.0001747418967587035, 'epoch': 0.13}


 13%|█▎        | 1584/12500 [2:44:48<17:19:22,  5.71s/it]

{'loss': 0.7298, 'grad_norm': 0.3048950433731079, 'learning_rate': 0.00017472589035614247, 'epoch': 0.13}


 13%|█▎        | 1585/12500 [2:44:52<15:50:21,  5.22s/it]

{'loss': 0.7118, 'grad_norm': 0.28040480613708496, 'learning_rate': 0.00017470988395358144, 'epoch': 0.13}


 13%|█▎        | 1586/12500 [2:44:59<17:26:27,  5.75s/it]

{'loss': 0.7448, 'grad_norm': 0.2716957628726959, 'learning_rate': 0.00017469387755102042, 'epoch': 0.13}


 13%|█▎        | 1587/12500 [2:45:04<16:50:17,  5.55s/it]

{'loss': 0.8063, 'grad_norm': 0.2881118357181549, 'learning_rate': 0.0001746778711484594, 'epoch': 0.13}


 13%|█▎        | 1588/12500 [2:45:13<20:26:13,  6.74s/it]

{'loss': 0.8968, 'grad_norm': 0.20470933616161346, 'learning_rate': 0.00017466186474589837, 'epoch': 0.13}


 13%|█▎        | 1589/12500 [2:45:18<18:20:25,  6.05s/it]

{'loss': 0.7018, 'grad_norm': 0.2762299180030823, 'learning_rate': 0.00017464585834333734, 'epoch': 0.13}


 13%|█▎        | 1590/12500 [2:45:26<20:00:30,  6.60s/it]

{'loss': 1.0672, 'grad_norm': 0.2199973165988922, 'learning_rate': 0.00017462985194077632, 'epoch': 0.13}


 13%|█▎        | 1591/12500 [2:45:33<20:28:43,  6.76s/it]

{'loss': 0.6444, 'grad_norm': 0.22912223637104034, 'learning_rate': 0.0001746138455382153, 'epoch': 0.13}


 13%|█▎        | 1592/12500 [2:45:38<19:04:19,  6.29s/it]

{'loss': 0.4943, 'grad_norm': 0.2507955729961395, 'learning_rate': 0.00017459783913565427, 'epoch': 0.13}


 13%|█▎        | 1593/12500 [2:45:44<19:01:20,  6.28s/it]

{'loss': 0.8445, 'grad_norm': 0.253266841173172, 'learning_rate': 0.00017458183273309324, 'epoch': 0.13}


 13%|█▎        | 1594/12500 [2:45:49<17:50:52,  5.89s/it]

{'loss': 0.7605, 'grad_norm': 0.24186542630195618, 'learning_rate': 0.00017456582633053222, 'epoch': 0.13}


 13%|█▎        | 1595/12500 [2:45:55<17:27:10,  5.76s/it]

{'loss': 0.9716, 'grad_norm': 0.27085310220718384, 'learning_rate': 0.0001745498199279712, 'epoch': 0.13}


 13%|█▎        | 1596/12500 [2:46:02<19:05:36,  6.30s/it]

{'loss': 0.7632, 'grad_norm': 0.2670855224132538, 'learning_rate': 0.00017453381352541017, 'epoch': 0.13}


 13%|█▎        | 1597/12500 [2:46:07<17:44:56,  5.86s/it]

{'loss': 0.6069, 'grad_norm': 0.25965753197669983, 'learning_rate': 0.00017451780712284914, 'epoch': 0.13}


 13%|█▎        | 1598/12500 [2:46:12<17:12:02,  5.68s/it]

{'loss': 0.7624, 'grad_norm': 0.3089270293712616, 'learning_rate': 0.00017450180072028812, 'epoch': 0.13}


 13%|█▎        | 1599/12500 [2:46:18<17:07:14,  5.65s/it]

{'loss': 0.7555, 'grad_norm': 0.3135735094547272, 'learning_rate': 0.0001744857943177271, 'epoch': 0.13}


 13%|█▎        | 1600/12500 [2:46:26<18:48:11,  6.21s/it]

{'loss': 0.7442, 'grad_norm': 0.23617704212665558, 'learning_rate': 0.0001744697879151661, 'epoch': 0.13}


 13%|█▎        | 1601/12500 [2:46:34<21:15:45,  7.02s/it]

{'loss': 0.7391, 'grad_norm': 0.2591153681278229, 'learning_rate': 0.00017445378151260504, 'epoch': 0.13}


 13%|█▎        | 1602/12500 [2:46:41<20:47:00,  6.87s/it]

{'loss': 0.6213, 'grad_norm': 0.24997270107269287, 'learning_rate': 0.00017443777511004402, 'epoch': 0.13}


 13%|█▎        | 1603/12500 [2:46:47<19:53:57,  6.57s/it]

{'loss': 0.712, 'grad_norm': 0.23902824521064758, 'learning_rate': 0.000174421768707483, 'epoch': 0.13}


 13%|█▎        | 1604/12500 [2:46:53<19:55:11,  6.58s/it]

{'loss': 0.7735, 'grad_norm': 0.2289823740720749, 'learning_rate': 0.000174405762304922, 'epoch': 0.13}


 13%|█▎        | 1605/12500 [2:47:03<22:19:27,  7.38s/it]

{'loss': 0.7843, 'grad_norm': 0.17465205490589142, 'learning_rate': 0.00017438975590236094, 'epoch': 0.13}


 13%|█▎        | 1606/12500 [2:47:07<19:55:43,  6.59s/it]

{'loss': 0.8467, 'grad_norm': 0.28872841596603394, 'learning_rate': 0.00017437374949979992, 'epoch': 0.13}


 13%|█▎        | 1607/12500 [2:47:17<22:49:46,  7.54s/it]

{'loss': 0.8083, 'grad_norm': 0.23169969022274017, 'learning_rate': 0.00017435774309723892, 'epoch': 0.13}


 13%|█▎        | 1608/12500 [2:47:22<20:31:00,  6.78s/it]

{'loss': 0.6306, 'grad_norm': 0.28086233139038086, 'learning_rate': 0.0001743417366946779, 'epoch': 0.13}


 13%|█▎        | 1609/12500 [2:47:27<19:00:33,  6.28s/it]

{'loss': 0.5031, 'grad_norm': 0.29142510890960693, 'learning_rate': 0.00017432573029211684, 'epoch': 0.13}


 13%|█▎        | 1610/12500 [2:47:35<20:22:29,  6.74s/it]

{'loss': 0.8768, 'grad_norm': 0.26599013805389404, 'learning_rate': 0.00017430972388955582, 'epoch': 0.13}


 13%|█▎        | 1611/12500 [2:47:41<19:51:25,  6.56s/it]

{'loss': 0.5389, 'grad_norm': 0.21394942700862885, 'learning_rate': 0.00017429371748699482, 'epoch': 0.13}


 13%|█▎        | 1612/12500 [2:47:46<18:23:21,  6.08s/it]

{'loss': 0.8319, 'grad_norm': 0.29062461853027344, 'learning_rate': 0.0001742777110844338, 'epoch': 0.13}


 13%|█▎        | 1613/12500 [2:47:54<19:49:37,  6.56s/it]

{'loss': 0.7096, 'grad_norm': 0.2316034436225891, 'learning_rate': 0.00017426170468187274, 'epoch': 0.13}


 13%|█▎        | 1614/12500 [2:48:01<20:19:03,  6.72s/it]

{'loss': 0.5292, 'grad_norm': 0.22579315304756165, 'learning_rate': 0.00017424569827931174, 'epoch': 0.13}


 13%|█▎        | 1615/12500 [2:48:08<20:19:02,  6.72s/it]

{'loss': 0.6289, 'grad_norm': 0.24916134774684906, 'learning_rate': 0.00017422969187675072, 'epoch': 0.13}


 13%|█▎        | 1616/12500 [2:48:13<19:08:34,  6.33s/it]

{'loss': 0.5915, 'grad_norm': 0.2627641558647156, 'learning_rate': 0.0001742136854741897, 'epoch': 0.13}


 13%|█▎        | 1617/12500 [2:48:18<18:14:02,  6.03s/it]

{'loss': 0.9797, 'grad_norm': 0.26057231426239014, 'learning_rate': 0.00017419767907162864, 'epoch': 0.13}


 13%|█▎        | 1618/12500 [2:48:24<17:31:30,  5.80s/it]

{'loss': 0.4555, 'grad_norm': 0.2803194522857666, 'learning_rate': 0.00017418167266906764, 'epoch': 0.13}


 13%|█▎        | 1619/12500 [2:48:29<16:56:58,  5.61s/it]

{'loss': 0.9189, 'grad_norm': 0.2996443212032318, 'learning_rate': 0.00017416566626650662, 'epoch': 0.13}


 13%|█▎        | 1620/12500 [2:48:38<20:33:47,  6.80s/it]

{'loss': 0.7755, 'grad_norm': 0.1931414008140564, 'learning_rate': 0.0001741496598639456, 'epoch': 0.13}


 13%|█▎        | 1621/12500 [2:48:45<20:28:49,  6.78s/it]

{'loss': 0.4833, 'grad_norm': 0.1958894282579422, 'learning_rate': 0.00017413365346138457, 'epoch': 0.13}


 13%|█▎        | 1622/12500 [2:48:55<23:07:02,  7.65s/it]

{'loss': 0.7999, 'grad_norm': 0.2159150093793869, 'learning_rate': 0.00017411764705882354, 'epoch': 0.13}


 13%|█▎        | 1623/12500 [2:49:00<20:33:44,  6.81s/it]

{'loss': 0.9515, 'grad_norm': 0.2796391546726227, 'learning_rate': 0.00017410164065626252, 'epoch': 0.13}


 13%|█▎        | 1624/12500 [2:49:04<18:32:35,  6.14s/it]

{'loss': 0.6431, 'grad_norm': 0.3111431896686554, 'learning_rate': 0.0001740856342537015, 'epoch': 0.13}


 13%|█▎        | 1625/12500 [2:49:13<20:59:53,  6.95s/it]

{'loss': 0.8214, 'grad_norm': 0.215970978140831, 'learning_rate': 0.00017406962785114047, 'epoch': 0.13}


 13%|█▎        | 1626/12500 [2:49:19<19:50:59,  6.57s/it]

{'loss': 0.9088, 'grad_norm': 0.3122803568840027, 'learning_rate': 0.00017405362144857944, 'epoch': 0.13}


 13%|█▎        | 1627/12500 [2:49:23<17:33:55,  5.82s/it]

{'loss': 0.6765, 'grad_norm': 0.28891709446907043, 'learning_rate': 0.00017403761504601842, 'epoch': 0.13}


 13%|█▎        | 1628/12500 [2:49:30<19:07:46,  6.33s/it]

{'loss': 1.059, 'grad_norm': 0.2545604407787323, 'learning_rate': 0.0001740216086434574, 'epoch': 0.13}


 13%|█▎        | 1629/12500 [2:49:39<20:54:26,  6.92s/it]

{'loss': 0.8098, 'grad_norm': 0.23567049205303192, 'learning_rate': 0.00017400560224089637, 'epoch': 0.13}


 13%|█▎        | 1630/12500 [2:49:43<18:18:47,  6.07s/it]

{'loss': 0.6006, 'grad_norm': 0.27327972650527954, 'learning_rate': 0.00017398959583833534, 'epoch': 0.13}


 13%|█▎        | 1631/12500 [2:49:48<17:04:57,  5.66s/it]

{'loss': 0.6514, 'grad_norm': 0.29152417182922363, 'learning_rate': 0.00017397358943577431, 'epoch': 0.13}


 13%|█▎        | 1632/12500 [2:49:55<19:02:36,  6.31s/it]

{'loss': 0.9441, 'grad_norm': 0.2465495467185974, 'learning_rate': 0.0001739575830332133, 'epoch': 0.13}


 13%|█▎        | 1633/12500 [2:50:00<17:29:56,  5.80s/it]

{'loss': 0.5909, 'grad_norm': 0.2828116714954376, 'learning_rate': 0.00017394157663065226, 'epoch': 0.13}


 13%|█▎        | 1634/12500 [2:50:07<18:22:30,  6.09s/it]

{'loss': 0.7783, 'grad_norm': 0.22669781744480133, 'learning_rate': 0.00017392557022809124, 'epoch': 0.13}


 13%|█▎        | 1635/12500 [2:50:13<18:17:09,  6.06s/it]

{'loss': 0.4777, 'grad_norm': 0.29772457480430603, 'learning_rate': 0.00017390956382553024, 'epoch': 0.13}


 13%|█▎        | 1636/12500 [2:50:17<16:42:54,  5.54s/it]

{'loss': 0.662, 'grad_norm': 0.34911292791366577, 'learning_rate': 0.0001738935574229692, 'epoch': 0.13}


 13%|█▎        | 1637/12500 [2:50:22<16:38:57,  5.52s/it]

{'loss': 0.6067, 'grad_norm': 0.24933017790317535, 'learning_rate': 0.00017387755102040816, 'epoch': 0.13}


 13%|█▎        | 1638/12500 [2:50:29<17:47:27,  5.90s/it]

{'loss': 0.6439, 'grad_norm': 0.28605177998542786, 'learning_rate': 0.00017386154461784714, 'epoch': 0.13}


 13%|█▎        | 1639/12500 [2:50:34<17:07:56,  5.68s/it]

{'loss': 0.8479, 'grad_norm': 0.2630331516265869, 'learning_rate': 0.00017384553821528614, 'epoch': 0.13}


 13%|█▎        | 1640/12500 [2:50:43<20:10:14,  6.69s/it]

{'loss': 0.8506, 'grad_norm': 0.22007766366004944, 'learning_rate': 0.0001738295318127251, 'epoch': 0.13}


 13%|█▎        | 1641/12500 [2:50:48<17:47:32,  5.90s/it]

{'loss': 0.7201, 'grad_norm': 0.3332492411136627, 'learning_rate': 0.00017381352541016406, 'epoch': 0.13}


 13%|█▎        | 1642/12500 [2:50:53<17:24:49,  5.77s/it]

{'loss': 0.8811, 'grad_norm': 0.274697870016098, 'learning_rate': 0.00017379751900760307, 'epoch': 0.13}


 13%|█▎        | 1643/12500 [2:50:58<16:32:34,  5.49s/it]

{'loss': 0.4807, 'grad_norm': 0.27679964900016785, 'learning_rate': 0.00017378151260504204, 'epoch': 0.13}


 13%|█▎        | 1644/12500 [2:51:04<17:16:44,  5.73s/it]

{'loss': 0.6117, 'grad_norm': 0.2699642479419708, 'learning_rate': 0.000173765506202481, 'epoch': 0.13}


 13%|█▎        | 1645/12500 [2:51:09<16:28:17,  5.46s/it]

{'loss': 1.0884, 'grad_norm': 0.32870379090309143, 'learning_rate': 0.00017374949979991996, 'epoch': 0.13}


 13%|█▎        | 1646/12500 [2:51:13<15:19:53,  5.09s/it]

{'loss': 0.8619, 'grad_norm': 0.33588287234306335, 'learning_rate': 0.00017373349339735896, 'epoch': 0.13}


 13%|█▎        | 1647/12500 [2:51:20<16:37:41,  5.52s/it]

{'loss': 0.6589, 'grad_norm': 0.24569939076900482, 'learning_rate': 0.00017371748699479794, 'epoch': 0.13}


 13%|█▎        | 1648/12500 [2:51:26<17:11:10,  5.70s/it]

{'loss': 0.7781, 'grad_norm': 0.31630221009254456, 'learning_rate': 0.0001737014805922369, 'epoch': 0.13}


 13%|█▎        | 1649/12500 [2:51:31<16:52:34,  5.60s/it]

{'loss': 0.6725, 'grad_norm': 0.32088959217071533, 'learning_rate': 0.0001736854741896759, 'epoch': 0.13}


 13%|█▎        | 1650/12500 [2:51:36<15:42:25,  5.21s/it]

{'loss': 0.9576, 'grad_norm': 0.3030749261379242, 'learning_rate': 0.00017366946778711486, 'epoch': 0.13}


 13%|█▎        | 1651/12500 [2:51:42<16:30:22,  5.48s/it]

{'loss': 0.8509, 'grad_norm': 0.30428123474121094, 'learning_rate': 0.00017365346138455384, 'epoch': 0.13}


 13%|█▎        | 1652/12500 [2:51:49<18:23:37,  6.10s/it]

{'loss': 1.019, 'grad_norm': 0.20914387702941895, 'learning_rate': 0.0001736374549819928, 'epoch': 0.13}


 13%|█▎        | 1653/12500 [2:51:55<18:19:45,  6.08s/it]

{'loss': 0.7461, 'grad_norm': 0.2851065397262573, 'learning_rate': 0.0001736214485794318, 'epoch': 0.13}


 13%|█▎        | 1654/12500 [2:52:05<21:50:05,  7.25s/it]

{'loss': 0.8454, 'grad_norm': 0.22790125012397766, 'learning_rate': 0.00017360544217687076, 'epoch': 0.13}


 13%|█▎        | 1655/12500 [2:52:10<19:13:10,  6.38s/it]

{'loss': 0.8343, 'grad_norm': 0.33678939938545227, 'learning_rate': 0.00017358943577430974, 'epoch': 0.13}


 13%|█▎        | 1656/12500 [2:52:18<20:42:20,  6.87s/it]

{'loss': 0.6169, 'grad_norm': 0.1958474963903427, 'learning_rate': 0.00017357342937174869, 'epoch': 0.13}


 13%|█▎        | 1657/12500 [2:52:25<20:49:16,  6.91s/it]

{'loss': 0.8323, 'grad_norm': 0.19277222454547882, 'learning_rate': 0.0001735574229691877, 'epoch': 0.13}


 13%|█▎        | 1658/12500 [2:52:32<21:02:06,  6.98s/it]

{'loss': 0.5945, 'grad_norm': 0.25700855255126953, 'learning_rate': 0.00017354141656662666, 'epoch': 0.13}


 13%|█▎        | 1659/12500 [2:52:37<19:18:08,  6.41s/it]

{'loss': 0.6779, 'grad_norm': 0.25673091411590576, 'learning_rate': 0.00017352541016406564, 'epoch': 0.13}


 13%|█▎        | 1660/12500 [2:52:41<17:11:01,  5.71s/it]

{'loss': 0.8386, 'grad_norm': 0.3178541362285614, 'learning_rate': 0.0001735094037615046, 'epoch': 0.13}


 13%|█▎        | 1661/12500 [2:52:49<19:30:40,  6.48s/it]

{'loss': 0.5848, 'grad_norm': 0.19916033744812012, 'learning_rate': 0.0001734933973589436, 'epoch': 0.13}


 13%|█▎        | 1662/12500 [2:52:53<16:54:18,  5.62s/it]

{'loss': 0.9363, 'grad_norm': 0.36372706294059753, 'learning_rate': 0.00017347739095638256, 'epoch': 0.13}


 13%|█▎        | 1663/12500 [2:53:03<20:57:02,  6.96s/it]

{'loss': 0.9564, 'grad_norm': 0.23131661117076874, 'learning_rate': 0.00017346138455382154, 'epoch': 0.13}


 13%|█▎        | 1664/12500 [2:53:10<21:15:48,  7.06s/it]

{'loss': 0.8636, 'grad_norm': 0.22937506437301636, 'learning_rate': 0.0001734453781512605, 'epoch': 0.13}


 13%|█▎        | 1665/12500 [2:53:16<19:48:35,  6.58s/it]

{'loss': 0.5856, 'grad_norm': 0.28599071502685547, 'learning_rate': 0.0001734293717486995, 'epoch': 0.13}


 13%|█▎        | 1666/12500 [2:53:21<18:38:43,  6.20s/it]

{'loss': 0.6378, 'grad_norm': 0.2772332429885864, 'learning_rate': 0.00017341336534613846, 'epoch': 0.13}


 13%|█▎        | 1667/12500 [2:53:28<19:49:56,  6.59s/it]

{'loss': 0.5366, 'grad_norm': 0.2656393051147461, 'learning_rate': 0.00017339735894357744, 'epoch': 0.13}


 13%|█▎        | 1668/12500 [2:53:33<17:37:18,  5.86s/it]

{'loss': 0.4878, 'grad_norm': 0.2729334533214569, 'learning_rate': 0.0001733813525410164, 'epoch': 0.13}


 13%|█▎        | 1669/12500 [2:53:38<17:24:42,  5.79s/it]

{'loss': 0.7364, 'grad_norm': 0.2947239875793457, 'learning_rate': 0.0001733653461384554, 'epoch': 0.13}


 13%|█▎        | 1670/12500 [2:53:42<15:59:28,  5.32s/it]

{'loss': 0.6698, 'grad_norm': 0.2705713212490082, 'learning_rate': 0.00017334933973589436, 'epoch': 0.13}


 13%|█▎        | 1671/12500 [2:53:48<16:34:14,  5.51s/it]

{'loss': 0.7276, 'grad_norm': 0.30682963132858276, 'learning_rate': 0.00017333333333333334, 'epoch': 0.13}


 13%|█▎        | 1672/12500 [2:53:54<16:43:33,  5.56s/it]

{'loss': 0.7593, 'grad_norm': 0.449160635471344, 'learning_rate': 0.0001733173269307723, 'epoch': 0.13}


 13%|█▎        | 1673/12500 [2:54:00<17:09:34,  5.71s/it]

{'loss': 0.7729, 'grad_norm': 0.31953194737434387, 'learning_rate': 0.00017330132052821129, 'epoch': 0.13}


 13%|█▎        | 1674/12500 [2:54:06<17:01:28,  5.66s/it]

{'loss': 0.9175, 'grad_norm': 0.2832130193710327, 'learning_rate': 0.0001732853141256503, 'epoch': 0.13}


 13%|█▎        | 1675/12500 [2:54:11<16:58:03,  5.64s/it]

{'loss': 0.5687, 'grad_norm': 0.21022944152355194, 'learning_rate': 0.00017326930772308924, 'epoch': 0.13}


 13%|█▎        | 1676/12500 [2:54:20<19:24:57,  6.46s/it]

{'loss': 0.667, 'grad_norm': 0.32951298356056213, 'learning_rate': 0.0001732533013205282, 'epoch': 0.13}


 13%|█▎        | 1677/12500 [2:54:30<22:46:22,  7.57s/it]

{'loss': 0.623, 'grad_norm': 0.24682432413101196, 'learning_rate': 0.00017323729491796719, 'epoch': 0.13}


 13%|█▎        | 1678/12500 [2:54:39<24:13:37,  8.06s/it]

{'loss': 1.0574, 'grad_norm': 0.21513700485229492, 'learning_rate': 0.0001732212885154062, 'epoch': 0.13}


 13%|█▎        | 1679/12500 [2:54:45<22:08:32,  7.37s/it]

{'loss': 0.8301, 'grad_norm': 0.30916526913642883, 'learning_rate': 0.00017320528211284514, 'epoch': 0.13}


 13%|█▎        | 1680/12500 [2:54:54<24:08:37,  8.03s/it]

{'loss': 0.9255, 'grad_norm': 0.24060183763504028, 'learning_rate': 0.0001731892757102841, 'epoch': 0.13}


 13%|█▎        | 1681/12500 [2:55:02<24:05:17,  8.02s/it]

{'loss': 0.9832, 'grad_norm': 0.24687448143959045, 'learning_rate': 0.0001731732693077231, 'epoch': 0.13}


 13%|█▎        | 1682/12500 [2:55:07<21:07:15,  7.03s/it]

{'loss': 0.5082, 'grad_norm': 0.31160956621170044, 'learning_rate': 0.0001731572629051621, 'epoch': 0.13}


 13%|█▎        | 1683/12500 [2:55:15<21:36:13,  7.19s/it]

{'loss': 0.8168, 'grad_norm': 0.26253968477249146, 'learning_rate': 0.00017314125650260103, 'epoch': 0.13}


 13%|█▎        | 1684/12500 [2:55:20<19:47:57,  6.59s/it]

{'loss': 0.6122, 'grad_norm': 0.2823514938354492, 'learning_rate': 0.00017312525010004, 'epoch': 0.13}


 13%|█▎        | 1685/12500 [2:55:27<20:05:58,  6.69s/it]

{'loss': 0.4228, 'grad_norm': 0.21660760045051575, 'learning_rate': 0.000173109243697479, 'epoch': 0.13}


 13%|█▎        | 1686/12500 [2:55:32<18:43:54,  6.24s/it]

{'loss': 0.5032, 'grad_norm': 0.2472679316997528, 'learning_rate': 0.00017309323729491799, 'epoch': 0.13}


 13%|█▎        | 1687/12500 [2:55:36<17:14:25,  5.74s/it]

{'loss': 0.8637, 'grad_norm': 0.2914225459098816, 'learning_rate': 0.00017307723089235693, 'epoch': 0.13}


 14%|█▎        | 1688/12500 [2:55:44<18:44:08,  6.24s/it]

{'loss': 0.6935, 'grad_norm': 0.2435551881790161, 'learning_rate': 0.00017306122448979594, 'epoch': 0.14}


 14%|█▎        | 1689/12500 [2:55:50<18:46:36,  6.25s/it]

{'loss': 0.7754, 'grad_norm': 0.2724674940109253, 'learning_rate': 0.0001730452180872349, 'epoch': 0.14}


 14%|█▎        | 1690/12500 [2:55:58<20:07:37,  6.70s/it]

{'loss': 1.2609, 'grad_norm': 0.2931637167930603, 'learning_rate': 0.00017302921168467389, 'epoch': 0.14}


 14%|█▎        | 1691/12500 [2:56:02<17:50:06,  5.94s/it]

{'loss': 0.6773, 'grad_norm': 0.27599430084228516, 'learning_rate': 0.00017301320528211283, 'epoch': 0.14}


 14%|█▎        | 1692/12500 [2:56:10<19:48:58,  6.60s/it]

{'loss': 0.6528, 'grad_norm': 0.19681212306022644, 'learning_rate': 0.00017299719887955184, 'epoch': 0.14}


 14%|█▎        | 1693/12500 [2:56:18<20:53:48,  6.96s/it]

{'loss': 0.7867, 'grad_norm': 0.2615680396556854, 'learning_rate': 0.0001729811924769908, 'epoch': 0.14}


 14%|█▎        | 1694/12500 [2:56:24<20:29:07,  6.82s/it]

{'loss': 0.7918, 'grad_norm': 0.220513716340065, 'learning_rate': 0.00017296518607442978, 'epoch': 0.14}


 14%|█▎        | 1695/12500 [2:56:30<18:57:48,  6.32s/it]

{'loss': 0.8815, 'grad_norm': 0.27767205238342285, 'learning_rate': 0.00017294917967186876, 'epoch': 0.14}


 14%|█▎        | 1696/12500 [2:56:35<17:59:11,  5.99s/it]

{'loss': 0.6714, 'grad_norm': 0.2628501355648041, 'learning_rate': 0.00017293317326930773, 'epoch': 0.14}


 14%|█▎        | 1697/12500 [2:56:42<19:23:12,  6.46s/it]

{'loss': 0.7444, 'grad_norm': 0.25297078490257263, 'learning_rate': 0.0001729171668667467, 'epoch': 0.14}


 14%|█▎        | 1698/12500 [2:56:49<19:12:44,  6.40s/it]

{'loss': 0.6996, 'grad_norm': 0.2557818591594696, 'learning_rate': 0.00017290116046418568, 'epoch': 0.14}


 14%|█▎        | 1699/12500 [2:56:52<16:41:29,  5.56s/it]

{'loss': 0.7358, 'grad_norm': 0.38257166743278503, 'learning_rate': 0.00017288515406162466, 'epoch': 0.14}


 14%|█▎        | 1700/12500 [2:56:59<18:01:16,  6.01s/it]

{'loss': 0.4727, 'grad_norm': 0.1908591240644455, 'learning_rate': 0.00017286914765906363, 'epoch': 0.14}


 14%|█▎        | 1701/12500 [2:57:07<19:53:29,  6.63s/it]

{'loss': 0.447, 'grad_norm': 0.24707676470279694, 'learning_rate': 0.0001728531412565026, 'epoch': 0.14}


 14%|█▎        | 1702/12500 [2:57:12<18:16:53,  6.09s/it]

{'loss': 0.7877, 'grad_norm': 0.3050426244735718, 'learning_rate': 0.00017283713485394158, 'epoch': 0.14}


 14%|█▎        | 1703/12500 [2:57:19<18:37:48,  6.21s/it]

{'loss': 0.7306, 'grad_norm': 0.21803700923919678, 'learning_rate': 0.00017282112845138056, 'epoch': 0.14}


 14%|█▎        | 1704/12500 [2:57:26<19:35:59,  6.54s/it]

{'loss': 0.8877, 'grad_norm': 0.2473512440919876, 'learning_rate': 0.00017280512204881953, 'epoch': 0.14}


 14%|█▎        | 1705/12500 [2:57:33<19:35:00,  6.53s/it]

{'loss': 0.4835, 'grad_norm': 0.23397137224674225, 'learning_rate': 0.0001727891156462585, 'epoch': 0.14}


 14%|█▎        | 1706/12500 [2:57:39<19:49:32,  6.61s/it]

{'loss': 0.8529, 'grad_norm': 0.25195929408073425, 'learning_rate': 0.00017277310924369748, 'epoch': 0.14}


 14%|█▎        | 1707/12500 [2:57:45<18:57:31,  6.32s/it]

{'loss': 0.8142, 'grad_norm': 0.2630385160446167, 'learning_rate': 0.00017275710284113646, 'epoch': 0.14}


 14%|█▎        | 1708/12500 [2:57:50<17:33:15,  5.86s/it]

{'loss': 0.6714, 'grad_norm': 0.28147876262664795, 'learning_rate': 0.00017274109643857543, 'epoch': 0.14}


 14%|█▎        | 1709/12500 [2:57:58<19:31:50,  6.52s/it]

{'loss': 0.5455, 'grad_norm': 0.2136385142803192, 'learning_rate': 0.0001727250900360144, 'epoch': 0.14}


 14%|█▎        | 1710/12500 [2:58:05<20:20:15,  6.79s/it]

{'loss': 0.4709, 'grad_norm': 0.23186872899532318, 'learning_rate': 0.00017270908363345338, 'epoch': 0.14}


 14%|█▎        | 1711/12500 [2:58:13<20:54:26,  6.98s/it]

{'loss': 0.6533, 'grad_norm': 0.29320257902145386, 'learning_rate': 0.00017269307723089236, 'epoch': 0.14}


 14%|█▎        | 1712/12500 [2:58:17<18:22:31,  6.13s/it]

{'loss': 0.7528, 'grad_norm': 0.3159116804599762, 'learning_rate': 0.00017267707082833133, 'epoch': 0.14}


 14%|█▎        | 1713/12500 [2:58:23<18:41:52,  6.24s/it]

{'loss': 0.612, 'grad_norm': 0.25940802693367004, 'learning_rate': 0.00017266106442577033, 'epoch': 0.14}


 14%|█▎        | 1714/12500 [2:58:30<19:16:48,  6.44s/it]

{'loss': 0.7617, 'grad_norm': 0.2640691101551056, 'learning_rate': 0.00017264505802320928, 'epoch': 0.14}


 14%|█▎        | 1715/12500 [2:58:34<17:10:42,  5.73s/it]

{'loss': 0.8365, 'grad_norm': 0.3119736909866333, 'learning_rate': 0.00017262905162064826, 'epoch': 0.14}


 14%|█▎        | 1716/12500 [2:58:37<14:38:42,  4.89s/it]

{'loss': 0.5824, 'grad_norm': 0.3147791624069214, 'learning_rate': 0.00017261304521808723, 'epoch': 0.14}


 14%|█▎        | 1717/12500 [2:58:42<14:42:17,  4.91s/it]

{'loss': 0.4848, 'grad_norm': 0.23725129663944244, 'learning_rate': 0.00017259703881552623, 'epoch': 0.14}


 14%|█▎        | 1718/12500 [2:58:46<14:01:17,  4.68s/it]

{'loss': 0.9655, 'grad_norm': 0.30506229400634766, 'learning_rate': 0.00017258103241296518, 'epoch': 0.14}


 14%|█▍        | 1719/12500 [2:58:51<13:57:49,  4.66s/it]

{'loss': 0.6992, 'grad_norm': 0.31125032901763916, 'learning_rate': 0.00017256502601040416, 'epoch': 0.14}


 14%|█▍        | 1720/12500 [2:58:57<15:31:47,  5.19s/it]

{'loss': 0.7931, 'grad_norm': 0.3775418996810913, 'learning_rate': 0.00017254901960784316, 'epoch': 0.14}


 14%|█▍        | 1721/12500 [2:59:06<19:01:21,  6.35s/it]

{'loss': 0.5689, 'grad_norm': 0.20908579230308533, 'learning_rate': 0.00017253301320528213, 'epoch': 0.14}


 14%|█▍        | 1722/12500 [2:59:12<17:53:16,  5.97s/it]

{'loss': 0.5892, 'grad_norm': 0.25208786129951477, 'learning_rate': 0.00017251700680272108, 'epoch': 0.14}


 14%|█▍        | 1723/12500 [2:59:18<18:10:53,  6.07s/it]

{'loss': 0.5572, 'grad_norm': 0.2374563217163086, 'learning_rate': 0.00017250100040016006, 'epoch': 0.14}


 14%|█▍        | 1724/12500 [2:59:26<20:13:18,  6.76s/it]

{'loss': 0.9487, 'grad_norm': 0.2631269097328186, 'learning_rate': 0.00017248499399759906, 'epoch': 0.14}


 14%|█▍        | 1725/12500 [2:59:34<21:35:33,  7.21s/it]

{'loss': 0.8023, 'grad_norm': 0.25208577513694763, 'learning_rate': 0.00017246898759503803, 'epoch': 0.14}


 14%|█▍        | 1726/12500 [2:59:39<19:02:23,  6.36s/it]

{'loss': 0.8528, 'grad_norm': 0.3549644351005554, 'learning_rate': 0.00017245298119247698, 'epoch': 0.14}


 14%|█▍        | 1727/12500 [2:59:46<19:40:00,  6.57s/it]

{'loss': 0.6591, 'grad_norm': 0.2428508996963501, 'learning_rate': 0.00017243697478991598, 'epoch': 0.14}


 14%|█▍        | 1728/12500 [2:59:52<19:24:28,  6.49s/it]

{'loss': 0.8037, 'grad_norm': 0.2551649510860443, 'learning_rate': 0.00017242096838735496, 'epoch': 0.14}


 14%|█▍        | 1729/12500 [2:59:56<16:55:56,  5.66s/it]

{'loss': 0.9129, 'grad_norm': 0.3236531615257263, 'learning_rate': 0.00017240496198479393, 'epoch': 0.14}


 14%|█▍        | 1730/12500 [3:00:01<16:16:18,  5.44s/it]

{'loss': 0.6733, 'grad_norm': 0.2858757972717285, 'learning_rate': 0.00017238895558223288, 'epoch': 0.14}


 14%|█▍        | 1731/12500 [3:00:05<14:51:53,  4.97s/it]

{'loss': 0.6301, 'grad_norm': 0.29859116673469543, 'learning_rate': 0.00017237294917967188, 'epoch': 0.14}


 14%|█▍        | 1732/12500 [3:00:11<16:19:34,  5.46s/it]

{'loss': 0.6616, 'grad_norm': 0.23515021800994873, 'learning_rate': 0.00017235694277711086, 'epoch': 0.14}


 14%|█▍        | 1733/12500 [3:00:18<17:12:03,  5.75s/it]

{'loss': 0.5515, 'grad_norm': 0.2916552722454071, 'learning_rate': 0.00017234093637454983, 'epoch': 0.14}


 14%|█▍        | 1734/12500 [3:00:26<19:34:21,  6.54s/it]

{'loss': 0.549, 'grad_norm': 0.23468802869319916, 'learning_rate': 0.0001723249299719888, 'epoch': 0.14}


 14%|█▍        | 1735/12500 [3:00:40<25:45:33,  8.61s/it]

{'loss': 0.9029, 'grad_norm': 0.2763747572898865, 'learning_rate': 0.00017230892356942778, 'epoch': 0.14}


 14%|█▍        | 1736/12500 [3:00:46<23:56:18,  8.01s/it]

{'loss': 0.6664, 'grad_norm': 0.28521713614463806, 'learning_rate': 0.00017229291716686676, 'epoch': 0.14}


 14%|█▍        | 1737/12500 [3:00:54<24:09:02,  8.08s/it]

{'loss': 0.8107, 'grad_norm': 0.2333606481552124, 'learning_rate': 0.00017227691076430573, 'epoch': 0.14}


 14%|█▍        | 1738/12500 [3:01:01<22:50:03,  7.64s/it]

{'loss': 0.8412, 'grad_norm': 0.24390162527561188, 'learning_rate': 0.0001722609043617447, 'epoch': 0.14}


 14%|█▍        | 1739/12500 [3:01:07<21:32:34,  7.21s/it]

{'loss': 0.4371, 'grad_norm': 0.22360169887542725, 'learning_rate': 0.00017224489795918368, 'epoch': 0.14}


 14%|█▍        | 1740/12500 [3:01:15<22:17:37,  7.46s/it]

{'loss': 0.9765, 'grad_norm': 0.23876123130321503, 'learning_rate': 0.00017222889155662266, 'epoch': 0.14}


 14%|█▍        | 1741/12500 [3:01:22<21:50:25,  7.31s/it]

{'loss': 0.4502, 'grad_norm': 0.1971682608127594, 'learning_rate': 0.00017221288515406163, 'epoch': 0.14}


 14%|█▍        | 1742/12500 [3:01:30<22:01:52,  7.37s/it]

{'loss': 0.4952, 'grad_norm': 0.20326557755470276, 'learning_rate': 0.0001721968787515006, 'epoch': 0.14}


 14%|█▍        | 1743/12500 [3:01:37<21:49:13,  7.30s/it]

{'loss': 0.56, 'grad_norm': 0.2736137807369232, 'learning_rate': 0.00017218087234893958, 'epoch': 0.14}


 14%|█▍        | 1744/12500 [3:01:45<22:17:44,  7.46s/it]

{'loss': 0.66, 'grad_norm': 0.2281648814678192, 'learning_rate': 0.00017216486594637855, 'epoch': 0.14}


 14%|█▍        | 1745/12500 [3:01:50<20:44:59,  6.95s/it]

{'loss': 0.7362, 'grad_norm': 0.2611353099346161, 'learning_rate': 0.00017214885954381753, 'epoch': 0.14}


 14%|█▍        | 1746/12500 [3:01:57<20:08:13,  6.74s/it]

{'loss': 0.8105, 'grad_norm': 0.25663211941719055, 'learning_rate': 0.0001721328531412565, 'epoch': 0.14}


 14%|█▍        | 1747/12500 [3:02:03<19:56:39,  6.68s/it]

{'loss': 0.5792, 'grad_norm': 0.2356073260307312, 'learning_rate': 0.00017211684673869548, 'epoch': 0.14}


 14%|█▍        | 1748/12500 [3:02:09<19:20:33,  6.48s/it]

{'loss': 0.8381, 'grad_norm': 0.30164775252342224, 'learning_rate': 0.00017210084033613448, 'epoch': 0.14}


 14%|█▍        | 1749/12500 [3:02:13<16:52:02,  5.65s/it]

{'loss': 0.7537, 'grad_norm': 0.28620487451553345, 'learning_rate': 0.00017208483393357343, 'epoch': 0.14}


 14%|█▍        | 1750/12500 [3:02:21<19:18:26,  6.47s/it]

{'loss': 0.7366, 'grad_norm': 0.20883113145828247, 'learning_rate': 0.0001720688275310124, 'epoch': 0.14}


 14%|█▍        | 1751/12500 [3:02:27<18:31:27,  6.20s/it]

{'loss': 0.5517, 'grad_norm': 0.2833113968372345, 'learning_rate': 0.00017205282112845138, 'epoch': 0.14}


 14%|█▍        | 1752/12500 [3:02:34<18:51:32,  6.32s/it]

{'loss': 0.4455, 'grad_norm': 0.2318761944770813, 'learning_rate': 0.00017203681472589038, 'epoch': 0.14}


 14%|█▍        | 1753/12500 [3:02:41<20:15:08,  6.78s/it]

{'loss': 0.9346, 'grad_norm': 0.2423001080751419, 'learning_rate': 0.00017202080832332933, 'epoch': 0.14}


 14%|█▍        | 1754/12500 [3:02:47<19:24:00,  6.50s/it]

{'loss': 0.7622, 'grad_norm': 0.272129625082016, 'learning_rate': 0.0001720048019207683, 'epoch': 0.14}


 14%|█▍        | 1755/12500 [3:02:55<20:46:32,  6.96s/it]

{'loss': 0.6448, 'grad_norm': 0.26370689272880554, 'learning_rate': 0.0001719887955182073, 'epoch': 0.14}


 14%|█▍        | 1756/12500 [3:03:03<21:23:54,  7.17s/it]

{'loss': 0.767, 'grad_norm': 0.19055835902690887, 'learning_rate': 0.00017197278911564628, 'epoch': 0.14}


 14%|█▍        | 1757/12500 [3:03:11<22:18:34,  7.48s/it]

{'loss': 0.6718, 'grad_norm': 0.2113223522901535, 'learning_rate': 0.00017195678271308523, 'epoch': 0.14}


 14%|█▍        | 1758/12500 [3:03:16<19:59:47,  6.70s/it]

{'loss': 0.7093, 'grad_norm': 0.33256345987319946, 'learning_rate': 0.0001719407763105242, 'epoch': 0.14}


 14%|█▍        | 1759/12500 [3:03:21<18:37:33,  6.24s/it]

{'loss': 0.8163, 'grad_norm': 0.2589837312698364, 'learning_rate': 0.0001719247699079632, 'epoch': 0.14}


 14%|█▍        | 1760/12500 [3:03:30<20:29:57,  6.87s/it]

{'loss': 0.7519, 'grad_norm': 0.2237415760755539, 'learning_rate': 0.00017190876350540218, 'epoch': 0.14}


 14%|█▍        | 1761/12500 [3:03:35<19:35:03,  6.57s/it]

{'loss': 1.0034, 'grad_norm': 0.2607633173465729, 'learning_rate': 0.00017189275710284113, 'epoch': 0.14}


 14%|█▍        | 1762/12500 [3:03:45<22:15:21,  7.46s/it]

{'loss': 0.9194, 'grad_norm': 0.19122400879859924, 'learning_rate': 0.00017187675070028013, 'epoch': 0.14}


 14%|█▍        | 1763/12500 [3:03:54<23:39:31,  7.93s/it]

{'loss': 0.4864, 'grad_norm': 0.21474014222621918, 'learning_rate': 0.0001718607442977191, 'epoch': 0.14}


 14%|█▍        | 1764/12500 [3:04:01<23:02:11,  7.72s/it]

{'loss': 0.8637, 'grad_norm': 0.2385191023349762, 'learning_rate': 0.00017184473789515808, 'epoch': 0.14}


 14%|█▍        | 1765/12500 [3:04:07<21:42:39,  7.28s/it]

{'loss': 0.8197, 'grad_norm': 0.23738647997379303, 'learning_rate': 0.00017182873149259703, 'epoch': 0.14}


 14%|█▍        | 1766/12500 [3:04:14<20:44:04,  6.95s/it]

{'loss': 0.6415, 'grad_norm': 0.2527519762516022, 'learning_rate': 0.00017181272509003603, 'epoch': 0.14}


 14%|█▍        | 1767/12500 [3:04:20<20:09:32,  6.76s/it]

{'loss': 0.8621, 'grad_norm': 0.247574120759964, 'learning_rate': 0.000171796718687475, 'epoch': 0.14}


 14%|█▍        | 1768/12500 [3:04:27<20:37:10,  6.92s/it]

{'loss': 0.4882, 'grad_norm': 0.20542190968990326, 'learning_rate': 0.00017178071228491398, 'epoch': 0.14}


 14%|█▍        | 1769/12500 [3:04:33<19:56:53,  6.69s/it]

{'loss': 0.8843, 'grad_norm': 0.2570419907569885, 'learning_rate': 0.00017176470588235293, 'epoch': 0.14}


 14%|█▍        | 1770/12500 [3:04:40<20:17:48,  6.81s/it]

{'loss': 0.4685, 'grad_norm': 0.25999805331230164, 'learning_rate': 0.00017174869947979193, 'epoch': 0.14}


 14%|█▍        | 1771/12500 [3:04:46<18:55:45,  6.35s/it]

{'loss': 0.5124, 'grad_norm': 0.24405714869499207, 'learning_rate': 0.0001717326930772309, 'epoch': 0.14}


 14%|█▍        | 1772/12500 [3:04:52<18:26:06,  6.19s/it]

{'loss': 1.0614, 'grad_norm': 0.27270573377609253, 'learning_rate': 0.00017171668667466988, 'epoch': 0.14}


 14%|█▍        | 1773/12500 [3:04:59<19:57:29,  6.70s/it]

{'loss': 0.6393, 'grad_norm': 0.2075824737548828, 'learning_rate': 0.00017170068027210885, 'epoch': 0.14}


 14%|█▍        | 1774/12500 [3:05:08<21:30:38,  7.22s/it]

{'loss': 0.5714, 'grad_norm': 0.24295903742313385, 'learning_rate': 0.00017168467386954783, 'epoch': 0.14}


 14%|█▍        | 1775/12500 [3:05:12<18:46:13,  6.30s/it]

{'loss': 0.8225, 'grad_norm': 0.33448994159698486, 'learning_rate': 0.0001716686674669868, 'epoch': 0.14}


 14%|█▍        | 1776/12500 [3:05:20<20:32:07,  6.89s/it]

{'loss': 0.6205, 'grad_norm': 0.2773893475532532, 'learning_rate': 0.00017165266106442578, 'epoch': 0.14}


 14%|█▍        | 1777/12500 [3:05:26<19:35:40,  6.58s/it]

{'loss': 0.5337, 'grad_norm': 0.22499053180217743, 'learning_rate': 0.00017163665466186475, 'epoch': 0.14}


 14%|█▍        | 1778/12500 [3:05:32<18:44:58,  6.30s/it]

{'loss': 0.8841, 'grad_norm': 0.27030959725379944, 'learning_rate': 0.00017162064825930373, 'epoch': 0.14}


 14%|█▍        | 1779/12500 [3:05:37<17:34:58,  5.90s/it]

{'loss': 0.6859, 'grad_norm': 0.27995365858078003, 'learning_rate': 0.0001716046418567427, 'epoch': 0.14}


 14%|█▍        | 1780/12500 [3:05:44<18:39:58,  6.27s/it]

{'loss': 0.6759, 'grad_norm': 0.23548369109630585, 'learning_rate': 0.00017158863545418168, 'epoch': 0.14}


 14%|█▍        | 1781/12500 [3:05:53<20:52:17,  7.01s/it]

{'loss': 0.6371, 'grad_norm': 0.2167672961950302, 'learning_rate': 0.00017157262905162065, 'epoch': 0.14}


 14%|█▍        | 1782/12500 [3:05:58<19:23:27,  6.51s/it]

{'loss': 0.5824, 'grad_norm': 0.30096620321273804, 'learning_rate': 0.00017155662264905963, 'epoch': 0.14}


 14%|█▍        | 1783/12500 [3:06:03<17:58:31,  6.04s/it]

{'loss': 0.7865, 'grad_norm': 0.27303314208984375, 'learning_rate': 0.0001715406162464986, 'epoch': 0.14}


 14%|█▍        | 1784/12500 [3:06:10<18:37:43,  6.26s/it]

{'loss': 1.0, 'grad_norm': 0.29118818044662476, 'learning_rate': 0.00017152460984393758, 'epoch': 0.14}


 14%|█▍        | 1785/12500 [3:06:13<16:15:41,  5.46s/it]

{'loss': 0.7266, 'grad_norm': 0.27730682492256165, 'learning_rate': 0.00017150860344137655, 'epoch': 0.14}


 14%|█▍        | 1786/12500 [3:06:19<16:24:41,  5.51s/it]

{'loss': 0.4458, 'grad_norm': 0.21605907380580902, 'learning_rate': 0.00017149259703881553, 'epoch': 0.14}


 14%|█▍        | 1787/12500 [3:06:25<16:58:51,  5.71s/it]

{'loss': 0.7924, 'grad_norm': 0.22939735651016235, 'learning_rate': 0.00017147659063625453, 'epoch': 0.14}


 14%|█▍        | 1788/12500 [3:06:34<19:24:24,  6.52s/it]

{'loss': 0.8693, 'grad_norm': 0.19081927835941315, 'learning_rate': 0.00017146058423369348, 'epoch': 0.14}


 14%|█▍        | 1789/12500 [3:06:37<16:20:33,  5.49s/it]

{'loss': 0.5026, 'grad_norm': 0.3022501468658447, 'learning_rate': 0.00017144457783113245, 'epoch': 0.14}


 14%|█▍        | 1790/12500 [3:06:43<17:22:40,  5.84s/it]

{'loss': 0.7996, 'grad_norm': 0.27205681800842285, 'learning_rate': 0.00017142857142857143, 'epoch': 0.14}


 14%|█▍        | 1791/12500 [3:06:50<17:47:36,  5.98s/it]

{'loss': 0.6913, 'grad_norm': 0.28167009353637695, 'learning_rate': 0.00017141256502601043, 'epoch': 0.14}


 14%|█▍        | 1792/12500 [3:06:58<20:20:50,  6.84s/it]

{'loss': 0.6204, 'grad_norm': 0.2104867547750473, 'learning_rate': 0.00017139655862344937, 'epoch': 0.14}


 14%|█▍        | 1793/12500 [3:07:06<21:25:50,  7.21s/it]

{'loss': 0.9385, 'grad_norm': 0.22600135207176208, 'learning_rate': 0.00017138055222088835, 'epoch': 0.14}


 14%|█▍        | 1794/12500 [3:07:10<18:16:50,  6.15s/it]

{'loss': 0.8411, 'grad_norm': 0.3370286524295807, 'learning_rate': 0.00017136454581832735, 'epoch': 0.14}


 14%|█▍        | 1795/12500 [3:07:15<17:32:16,  5.90s/it]

{'loss': 0.5625, 'grad_norm': 0.23349447548389435, 'learning_rate': 0.00017134853941576633, 'epoch': 0.14}


 14%|█▍        | 1796/12500 [3:07:25<20:50:50,  7.01s/it]

{'loss': 0.6599, 'grad_norm': 0.16990572214126587, 'learning_rate': 0.00017133253301320527, 'epoch': 0.14}


 14%|█▍        | 1797/12500 [3:07:30<19:14:57,  6.47s/it]

{'loss': 0.4951, 'grad_norm': 0.26304131746292114, 'learning_rate': 0.00017131652661064425, 'epoch': 0.14}


 14%|█▍        | 1798/12500 [3:07:37<19:34:58,  6.59s/it]

{'loss': 0.4959, 'grad_norm': 0.32067158818244934, 'learning_rate': 0.00017130052020808325, 'epoch': 0.14}


 14%|█▍        | 1799/12500 [3:07:43<18:35:12,  6.25s/it]

{'loss': 0.6557, 'grad_norm': 0.33765271306037903, 'learning_rate': 0.00017128451380552223, 'epoch': 0.14}


 14%|█▍        | 1800/12500 [3:07:51<20:09:37,  6.78s/it]

{'loss': 1.0035, 'grad_norm': 0.24592962861061096, 'learning_rate': 0.00017126850740296117, 'epoch': 0.14}


 14%|█▍        | 1801/12500 [3:07:56<18:48:27,  6.33s/it]

{'loss': 0.7153, 'grad_norm': 0.3005560636520386, 'learning_rate': 0.00017125250100040018, 'epoch': 0.14}


 14%|█▍        | 1802/12500 [3:08:03<19:07:37,  6.44s/it]

{'loss': 0.7422, 'grad_norm': 0.21996237337589264, 'learning_rate': 0.00017123649459783915, 'epoch': 0.14}


 14%|█▍        | 1803/12500 [3:08:06<16:24:14,  5.52s/it]

{'loss': 0.9496, 'grad_norm': 0.3644392788410187, 'learning_rate': 0.00017122048819527813, 'epoch': 0.14}


 14%|█▍        | 1804/12500 [3:08:15<19:10:06,  6.45s/it]

{'loss': 0.8312, 'grad_norm': 0.2680608332157135, 'learning_rate': 0.00017120448179271707, 'epoch': 0.14}


 14%|█▍        | 1805/12500 [3:08:21<19:11:37,  6.46s/it]

{'loss': 0.8999, 'grad_norm': 0.251108855009079, 'learning_rate': 0.00017118847539015608, 'epoch': 0.14}


 14%|█▍        | 1806/12500 [3:08:26<18:08:27,  6.11s/it]

{'loss': 0.6312, 'grad_norm': 0.3040103018283844, 'learning_rate': 0.00017117246898759505, 'epoch': 0.14}


 14%|█▍        | 1807/12500 [3:08:35<20:07:47,  6.78s/it]

{'loss': 0.7289, 'grad_norm': 0.2343369424343109, 'learning_rate': 0.00017115646258503402, 'epoch': 0.14}


 14%|█▍        | 1808/12500 [3:08:40<18:38:30,  6.28s/it]

{'loss': 0.6899, 'grad_norm': 0.3164770305156708, 'learning_rate': 0.000171140456182473, 'epoch': 0.14}


 14%|█▍        | 1809/12500 [3:08:47<19:13:30,  6.47s/it]

{'loss': 0.6803, 'grad_norm': 0.2797781527042389, 'learning_rate': 0.00017112444977991197, 'epoch': 0.14}


 14%|█▍        | 1810/12500 [3:08:53<18:47:48,  6.33s/it]

{'loss': 0.754, 'grad_norm': 0.26793432235717773, 'learning_rate': 0.00017110844337735095, 'epoch': 0.14}


 14%|█▍        | 1811/12500 [3:08:58<18:05:57,  6.10s/it]

{'loss': 0.876, 'grad_norm': 0.3379172384738922, 'learning_rate': 0.00017109243697478992, 'epoch': 0.14}


 14%|█▍        | 1812/12500 [3:09:03<16:57:45,  5.71s/it]

{'loss': 0.9027, 'grad_norm': 0.35561603307724, 'learning_rate': 0.0001710764305722289, 'epoch': 0.14}


 15%|█▍        | 1813/12500 [3:09:13<20:27:04,  6.89s/it]

{'loss': 0.8148, 'grad_norm': 0.22389541566371918, 'learning_rate': 0.00017106042416966787, 'epoch': 0.15}


 15%|█▍        | 1814/12500 [3:09:18<19:25:32,  6.54s/it]

{'loss': 0.8168, 'grad_norm': 0.2485780268907547, 'learning_rate': 0.00017104441776710685, 'epoch': 0.15}


 15%|█▍        | 1815/12500 [3:09:22<17:04:46,  5.75s/it]

{'loss': 0.6471, 'grad_norm': 0.2767598330974579, 'learning_rate': 0.00017102841136454585, 'epoch': 0.15}


 15%|█▍        | 1816/12500 [3:09:30<18:23:03,  6.19s/it]

{'loss': 0.7585, 'grad_norm': 0.25183457136154175, 'learning_rate': 0.0001710124049619848, 'epoch': 0.15}


 15%|█▍        | 1817/12500 [3:09:36<18:18:49,  6.17s/it]

{'loss': 0.6579, 'grad_norm': 0.2652880847454071, 'learning_rate': 0.00017099639855942377, 'epoch': 0.15}


 15%|█▍        | 1818/12500 [3:09:45<20:43:53,  6.99s/it]

{'loss': 0.9288, 'grad_norm': 0.26636412739753723, 'learning_rate': 0.00017098039215686275, 'epoch': 0.15}


 15%|█▍        | 1819/12500 [3:09:51<20:20:25,  6.86s/it]

{'loss': 0.8033, 'grad_norm': 0.23560290038585663, 'learning_rate': 0.00017096438575430175, 'epoch': 0.15}


 15%|█▍        | 1820/12500 [3:09:58<19:51:40,  6.69s/it]

{'loss': 0.6434, 'grad_norm': 0.2724435329437256, 'learning_rate': 0.0001709483793517407, 'epoch': 0.15}


 15%|█▍        | 1821/12500 [3:10:02<18:08:23,  6.12s/it]

{'loss': 0.5711, 'grad_norm': 0.2430739551782608, 'learning_rate': 0.00017093237294917967, 'epoch': 0.15}


 15%|█▍        | 1822/12500 [3:10:10<19:38:07,  6.62s/it]

{'loss': 0.4811, 'grad_norm': 0.1945485919713974, 'learning_rate': 0.00017091636654661865, 'epoch': 0.15}


 15%|█▍        | 1823/12500 [3:10:17<19:47:45,  6.67s/it]

{'loss': 0.9157, 'grad_norm': 0.26592767238616943, 'learning_rate': 0.00017090036014405765, 'epoch': 0.15}


 15%|█▍        | 1824/12500 [3:10:25<20:54:47,  7.05s/it]

{'loss': 1.2103, 'grad_norm': 0.23462043702602386, 'learning_rate': 0.0001708843537414966, 'epoch': 0.15}


 15%|█▍        | 1825/12500 [3:10:30<18:56:55,  6.39s/it]

{'loss': 0.6532, 'grad_norm': 0.26805630326271057, 'learning_rate': 0.00017086834733893557, 'epoch': 0.15}


 15%|█▍        | 1826/12500 [3:10:34<17:19:15,  5.84s/it]

{'loss': 0.8021, 'grad_norm': 0.34613102674484253, 'learning_rate': 0.00017085234093637457, 'epoch': 0.15}


 15%|█▍        | 1827/12500 [3:10:40<17:42:00,  5.97s/it]

{'loss': 1.0727, 'grad_norm': 0.39651933312416077, 'learning_rate': 0.00017083633453381355, 'epoch': 0.15}


 15%|█▍        | 1828/12500 [3:10:44<15:45:04,  5.31s/it]

{'loss': 0.5794, 'grad_norm': 0.3117436468601227, 'learning_rate': 0.0001708203281312525, 'epoch': 0.15}


 15%|█▍        | 1829/12500 [3:10:50<15:44:30,  5.31s/it]

{'loss': 0.7035, 'grad_norm': 0.313403457403183, 'learning_rate': 0.00017080432172869147, 'epoch': 0.15}


 15%|█▍        | 1830/12500 [3:10:56<17:09:26,  5.79s/it]

{'loss': 0.6291, 'grad_norm': 0.2477497011423111, 'learning_rate': 0.00017078831532613047, 'epoch': 0.15}


 15%|█▍        | 1831/12500 [3:11:03<17:36:56,  5.94s/it]

{'loss': 0.8934, 'grad_norm': 0.2537253201007843, 'learning_rate': 0.00017077230892356945, 'epoch': 0.15}


 15%|█▍        | 1832/12500 [3:11:07<15:45:11,  5.32s/it]

{'loss': 0.9355, 'grad_norm': 0.38664814829826355, 'learning_rate': 0.0001707563025210084, 'epoch': 0.15}


 15%|█▍        | 1833/12500 [3:11:11<14:59:08,  5.06s/it]

{'loss': 0.7773, 'grad_norm': 0.2719910144805908, 'learning_rate': 0.0001707402961184474, 'epoch': 0.15}


 15%|█▍        | 1834/12500 [3:11:18<16:21:29,  5.52s/it]

{'loss': 0.5705, 'grad_norm': 0.23018187284469604, 'learning_rate': 0.00017072428971588637, 'epoch': 0.15}


 15%|█▍        | 1835/12500 [3:11:26<18:32:54,  6.26s/it]

{'loss': 0.9962, 'grad_norm': 0.2337343394756317, 'learning_rate': 0.00017070828331332535, 'epoch': 0.15}


 15%|█▍        | 1836/12500 [3:11:31<18:09:40,  6.13s/it]

{'loss': 0.5091, 'grad_norm': 0.27234917879104614, 'learning_rate': 0.0001706922769107643, 'epoch': 0.15}


 15%|█▍        | 1837/12500 [3:11:35<15:48:50,  5.34s/it]

{'loss': 0.8937, 'grad_norm': 0.3842008411884308, 'learning_rate': 0.0001706762705082033, 'epoch': 0.15}


 15%|█▍        | 1838/12500 [3:11:40<15:48:39,  5.34s/it]

{'loss': 0.7551, 'grad_norm': 0.29693424701690674, 'learning_rate': 0.00017066026410564227, 'epoch': 0.15}


 15%|█▍        | 1839/12500 [3:11:46<16:02:14,  5.42s/it]

{'loss': 0.9148, 'grad_norm': 0.28264516592025757, 'learning_rate': 0.00017064425770308125, 'epoch': 0.15}


 15%|█▍        | 1840/12500 [3:11:51<15:20:06,  5.18s/it]

{'loss': 0.7559, 'grad_norm': 0.30103766918182373, 'learning_rate': 0.00017062825130052022, 'epoch': 0.15}


 15%|█▍        | 1841/12500 [3:11:57<16:39:25,  5.63s/it]

{'loss': 0.4303, 'grad_norm': 0.21698784828186035, 'learning_rate': 0.0001706122448979592, 'epoch': 0.15}


 15%|█▍        | 1842/12500 [3:12:02<15:37:56,  5.28s/it]

{'loss': 0.764, 'grad_norm': 0.30475297570228577, 'learning_rate': 0.00017059623849539817, 'epoch': 0.15}


 15%|█▍        | 1843/12500 [3:12:10<18:04:43,  6.11s/it]

{'loss': 0.5972, 'grad_norm': 0.2195025384426117, 'learning_rate': 0.00017058023209283715, 'epoch': 0.15}


 15%|█▍        | 1844/12500 [3:12:17<19:19:34,  6.53s/it]

{'loss': 0.9584, 'grad_norm': 0.2348526120185852, 'learning_rate': 0.00017056422569027612, 'epoch': 0.15}


 15%|█▍        | 1845/12500 [3:12:23<18:41:53,  6.32s/it]

{'loss': 0.6576, 'grad_norm': 0.3225193917751312, 'learning_rate': 0.0001705482192877151, 'epoch': 0.15}


 15%|█▍        | 1846/12500 [3:12:29<17:57:17,  6.07s/it]

{'loss': 0.7244, 'grad_norm': 0.2613828182220459, 'learning_rate': 0.00017053221288515407, 'epoch': 0.15}


 15%|█▍        | 1847/12500 [3:12:38<20:38:07,  6.97s/it]

{'loss': 0.6746, 'grad_norm': 0.2238640934228897, 'learning_rate': 0.00017051620648259305, 'epoch': 0.15}


 15%|█▍        | 1848/12500 [3:12:43<19:02:10,  6.43s/it]

{'loss': 0.7517, 'grad_norm': 0.30539825558662415, 'learning_rate': 0.00017050020008003202, 'epoch': 0.15}


 15%|█▍        | 1849/12500 [3:12:46<15:59:49,  5.41s/it]

{'loss': 0.7198, 'grad_norm': 0.3821455240249634, 'learning_rate': 0.000170484193677471, 'epoch': 0.15}


 15%|█▍        | 1850/12500 [3:12:52<16:52:19,  5.70s/it]

{'loss': 0.6312, 'grad_norm': 0.294968843460083, 'learning_rate': 0.00017046818727490997, 'epoch': 0.15}


 15%|█▍        | 1851/12500 [3:12:58<17:03:50,  5.77s/it]

{'loss': 0.5938, 'grad_norm': 0.28050583600997925, 'learning_rate': 0.00017045218087234895, 'epoch': 0.15}


 15%|█▍        | 1852/12500 [3:13:04<17:24:24,  5.89s/it]

{'loss': 0.592, 'grad_norm': 0.25667065382003784, 'learning_rate': 0.00017043617446978792, 'epoch': 0.15}


 15%|█▍        | 1853/12500 [3:13:12<19:14:37,  6.51s/it]

{'loss': 0.6493, 'grad_norm': 0.25192567706108093, 'learning_rate': 0.0001704201680672269, 'epoch': 0.15}


 15%|█▍        | 1854/12500 [3:13:17<18:05:08,  6.12s/it]

{'loss': 0.9935, 'grad_norm': 0.2697242796421051, 'learning_rate': 0.0001704041616646659, 'epoch': 0.15}


 15%|█▍        | 1855/12500 [3:13:24<18:08:49,  6.14s/it]

{'loss': 0.7553, 'grad_norm': 0.33675965666770935, 'learning_rate': 0.00017038815526210484, 'epoch': 0.15}


 15%|█▍        | 1856/12500 [3:13:28<16:45:53,  5.67s/it]

{'loss': 0.7273, 'grad_norm': 0.3361499011516571, 'learning_rate': 0.00017037214885954382, 'epoch': 0.15}


 15%|█▍        | 1857/12500 [3:13:35<17:20:26,  5.87s/it]

{'loss': 0.6928, 'grad_norm': 0.2655181586742401, 'learning_rate': 0.0001703561424569828, 'epoch': 0.15}


 15%|█▍        | 1858/12500 [3:13:38<15:27:38,  5.23s/it]

{'loss': 0.7707, 'grad_norm': 0.3295624852180481, 'learning_rate': 0.0001703401360544218, 'epoch': 0.15}


 15%|█▍        | 1859/12500 [3:13:43<15:19:49,  5.19s/it]

{'loss': 0.8193, 'grad_norm': 0.30887967348098755, 'learning_rate': 0.00017032412965186074, 'epoch': 0.15}


 15%|█▍        | 1860/12500 [3:13:48<14:46:40,  5.00s/it]

{'loss': 0.4747, 'grad_norm': 0.28286826610565186, 'learning_rate': 0.00017030812324929972, 'epoch': 0.15}


 15%|█▍        | 1861/12500 [3:13:54<15:37:41,  5.29s/it]

{'loss': 1.0102, 'grad_norm': 0.36031898856163025, 'learning_rate': 0.00017029211684673872, 'epoch': 0.15}


 15%|█▍        | 1862/12500 [3:14:04<20:10:59,  6.83s/it]

{'loss': 0.6664, 'grad_norm': 0.21618589758872986, 'learning_rate': 0.0001702761104441777, 'epoch': 0.15}


 15%|█▍        | 1863/12500 [3:14:11<20:21:07,  6.89s/it]

{'loss': 0.9679, 'grad_norm': 0.24025826156139374, 'learning_rate': 0.00017026010404161664, 'epoch': 0.15}


 15%|█▍        | 1864/12500 [3:14:21<23:11:24,  7.85s/it]

{'loss': 0.6247, 'grad_norm': 0.18386487662792206, 'learning_rate': 0.00017024409763905562, 'epoch': 0.15}


 15%|█▍        | 1865/12500 [3:14:30<23:47:21,  8.05s/it]

{'loss': 0.5265, 'grad_norm': 0.2342313528060913, 'learning_rate': 0.00017022809123649462, 'epoch': 0.15}


 15%|█▍        | 1866/12500 [3:14:37<22:54:14,  7.75s/it]

{'loss': 0.638, 'grad_norm': 0.239871546626091, 'learning_rate': 0.0001702120848339336, 'epoch': 0.15}


 15%|█▍        | 1867/12500 [3:14:43<21:00:29,  7.11s/it]

{'loss': 0.8168, 'grad_norm': 0.2493395358324051, 'learning_rate': 0.00017019607843137254, 'epoch': 0.15}


 15%|█▍        | 1868/12500 [3:14:49<20:05:08,  6.80s/it]

{'loss': 0.9772, 'grad_norm': 0.2549896538257599, 'learning_rate': 0.00017018007202881155, 'epoch': 0.15}


 15%|█▍        | 1869/12500 [3:14:55<19:32:22,  6.62s/it]

{'loss': 0.5908, 'grad_norm': 0.2610298693180084, 'learning_rate': 0.00017016406562625052, 'epoch': 0.15}


 15%|█▍        | 1870/12500 [3:15:00<18:06:33,  6.13s/it]

{'loss': 0.5579, 'grad_norm': 0.2734854817390442, 'learning_rate': 0.0001701480592236895, 'epoch': 0.15}


 15%|█▍        | 1871/12500 [3:15:12<23:02:06,  7.80s/it]

{'loss': 0.9077, 'grad_norm': 0.18370439112186432, 'learning_rate': 0.00017013205282112844, 'epoch': 0.15}


 15%|█▍        | 1872/12500 [3:15:17<21:00:08,  7.11s/it]

{'loss': 0.8328, 'grad_norm': 0.26676464080810547, 'learning_rate': 0.00017011604641856744, 'epoch': 0.15}


 15%|█▍        | 1873/12500 [3:15:25<21:52:35,  7.41s/it]

{'loss': 0.7986, 'grad_norm': 0.21611644327640533, 'learning_rate': 0.00017010004001600642, 'epoch': 0.15}


 15%|█▍        | 1874/12500 [3:15:33<22:24:59,  7.59s/it]

{'loss': 0.5831, 'grad_norm': 0.2448788583278656, 'learning_rate': 0.0001700840336134454, 'epoch': 0.15}


 15%|█▌        | 1875/12500 [3:15:39<20:47:43,  7.05s/it]

{'loss': 0.7108, 'grad_norm': 0.22907927632331848, 'learning_rate': 0.00017006802721088434, 'epoch': 0.15}


 15%|█▌        | 1876/12500 [3:15:49<22:58:12,  7.78s/it]

{'loss': 0.4713, 'grad_norm': 0.20662926137447357, 'learning_rate': 0.00017005202080832334, 'epoch': 0.15}


 15%|█▌        | 1877/12500 [3:15:57<23:19:58,  7.91s/it]

{'loss': 0.7767, 'grad_norm': 0.24544496834278107, 'learning_rate': 0.00017003601440576232, 'epoch': 0.15}


 15%|█▌        | 1878/12500 [3:16:04<22:58:10,  7.78s/it]

{'loss': 0.4281, 'grad_norm': 0.225433811545372, 'learning_rate': 0.0001700200080032013, 'epoch': 0.15}


 15%|█▌        | 1879/12500 [3:16:13<24:02:17,  8.15s/it]

{'loss': 0.8287, 'grad_norm': 0.21171611547470093, 'learning_rate': 0.00017000400160064027, 'epoch': 0.15}


 15%|█▌        | 1880/12500 [3:16:23<25:07:00,  8.51s/it]

{'loss': 1.0271, 'grad_norm': 0.20642298460006714, 'learning_rate': 0.00016998799519807924, 'epoch': 0.15}


 15%|█▌        | 1881/12500 [3:16:28<22:41:26,  7.69s/it]

{'loss': 0.6574, 'grad_norm': 0.2633388638496399, 'learning_rate': 0.00016997198879551822, 'epoch': 0.15}


 15%|█▌        | 1882/12500 [3:16:42<27:40:43,  9.38s/it]

{'loss': 0.9499, 'grad_norm': 0.19667017459869385, 'learning_rate': 0.0001699559823929572, 'epoch': 0.15}


 15%|█▌        | 1883/12500 [3:16:47<24:23:58,  8.27s/it]

{'loss': 0.6778, 'grad_norm': 0.2989353537559509, 'learning_rate': 0.00016993997599039617, 'epoch': 0.15}


 15%|█▌        | 1884/12500 [3:16:53<21:53:38,  7.42s/it]

{'loss': 0.8465, 'grad_norm': 0.3110577464103699, 'learning_rate': 0.00016992396958783514, 'epoch': 0.15}


 15%|█▌        | 1885/12500 [3:17:01<22:14:29,  7.54s/it]

{'loss': 0.7286, 'grad_norm': 0.22381749749183655, 'learning_rate': 0.00016990796318527412, 'epoch': 0.15}


 15%|█▌        | 1886/12500 [3:17:10<23:54:19,  8.11s/it]

{'loss': 0.9557, 'grad_norm': 0.21006044745445251, 'learning_rate': 0.0001698919567827131, 'epoch': 0.15}


 15%|█▌        | 1887/12500 [3:17:15<21:12:54,  7.20s/it]

{'loss': 0.8374, 'grad_norm': 0.2948605716228485, 'learning_rate': 0.00016987595038015207, 'epoch': 0.15}


 15%|█▌        | 1888/12500 [3:17:25<23:44:16,  8.05s/it]

{'loss': 0.4053, 'grad_norm': 0.18337436020374298, 'learning_rate': 0.00016985994397759104, 'epoch': 0.15}


 15%|█▌        | 1889/12500 [3:17:30<20:51:54,  7.08s/it]

{'loss': 0.6808, 'grad_norm': 0.23866455256938934, 'learning_rate': 0.00016984393757503002, 'epoch': 0.15}


 15%|█▌        | 1890/12500 [3:17:38<21:25:36,  7.27s/it]

{'loss': 0.5682, 'grad_norm': 0.24416540563106537, 'learning_rate': 0.000169827931172469, 'epoch': 0.15}


 15%|█▌        | 1891/12500 [3:17:42<18:43:12,  6.35s/it]

{'loss': 0.5844, 'grad_norm': 0.2927247881889343, 'learning_rate': 0.00016981192476990797, 'epoch': 0.15}


 15%|█▌        | 1892/12500 [3:17:50<19:53:59,  6.75s/it]

{'loss': 0.5927, 'grad_norm': 0.237386554479599, 'learning_rate': 0.00016979591836734694, 'epoch': 0.15}


 15%|█▌        | 1893/12500 [3:17:55<19:03:42,  6.47s/it]

{'loss': 0.9242, 'grad_norm': 0.30322834849357605, 'learning_rate': 0.00016977991196478594, 'epoch': 0.15}


 15%|█▌        | 1894/12500 [3:18:00<17:45:15,  6.03s/it]

{'loss': 0.8404, 'grad_norm': 0.25847506523132324, 'learning_rate': 0.0001697639055622249, 'epoch': 0.15}


 15%|█▌        | 1895/12500 [3:18:08<18:43:34,  6.36s/it]

{'loss': 0.7893, 'grad_norm': 0.24874776601791382, 'learning_rate': 0.00016974789915966387, 'epoch': 0.15}


 15%|█▌        | 1896/12500 [3:18:13<17:54:35,  6.08s/it]

{'loss': 0.5955, 'grad_norm': 0.27462536096572876, 'learning_rate': 0.00016973189275710284, 'epoch': 0.15}


 15%|█▌        | 1897/12500 [3:18:19<18:03:34,  6.13s/it]

{'loss': 0.7779, 'grad_norm': 0.23068034648895264, 'learning_rate': 0.00016971588635454184, 'epoch': 0.15}


 15%|█▌        | 1898/12500 [3:18:28<20:36:06,  7.00s/it]

{'loss': 0.7157, 'grad_norm': 0.2144925743341446, 'learning_rate': 0.0001696998799519808, 'epoch': 0.15}


 15%|█▌        | 1899/12500 [3:18:34<19:33:56,  6.64s/it]

{'loss': 0.7045, 'grad_norm': 0.2645898461341858, 'learning_rate': 0.00016968387354941977, 'epoch': 0.15}


 15%|█▌        | 1900/12500 [3:18:39<18:12:20,  6.18s/it]

{'loss': 0.7249, 'grad_norm': 0.4151311218738556, 'learning_rate': 0.00016966786714685877, 'epoch': 0.15}


 15%|█▌        | 1901/12500 [3:18:47<19:57:51,  6.78s/it]

{'loss': 0.9412, 'grad_norm': 0.22824202477931976, 'learning_rate': 0.00016965186074429774, 'epoch': 0.15}


 15%|█▌        | 1902/12500 [3:18:56<21:22:54,  7.26s/it]

{'loss': 0.6675, 'grad_norm': 0.22762827575206757, 'learning_rate': 0.0001696358543417367, 'epoch': 0.15}


 15%|█▌        | 1903/12500 [3:19:02<20:22:43,  6.92s/it]

{'loss': 0.8336, 'grad_norm': 0.25112929940223694, 'learning_rate': 0.00016961984793917566, 'epoch': 0.15}


 15%|█▌        | 1904/12500 [3:19:06<18:14:42,  6.20s/it]

{'loss': 0.4848, 'grad_norm': 0.24108508229255676, 'learning_rate': 0.00016960384153661467, 'epoch': 0.15}


 15%|█▌        | 1905/12500 [3:19:14<19:08:27,  6.50s/it]

{'loss': 0.6283, 'grad_norm': 0.2536919116973877, 'learning_rate': 0.00016958783513405364, 'epoch': 0.15}


 15%|█▌        | 1906/12500 [3:19:22<21:12:53,  7.21s/it]

{'loss': 0.7162, 'grad_norm': 0.18742266297340393, 'learning_rate': 0.0001695718287314926, 'epoch': 0.15}


 15%|█▌        | 1907/12500 [3:19:29<20:37:57,  7.01s/it]

{'loss': 0.5424, 'grad_norm': 0.2601763904094696, 'learning_rate': 0.0001695558223289316, 'epoch': 0.15}


 15%|█▌        | 1908/12500 [3:19:36<20:18:09,  6.90s/it]

{'loss': 0.9066, 'grad_norm': 0.26302585005760193, 'learning_rate': 0.00016953981592637057, 'epoch': 0.15}


 15%|█▌        | 1909/12500 [3:19:40<18:25:31,  6.26s/it]

{'loss': 0.5904, 'grad_norm': 0.2698250710964203, 'learning_rate': 0.00016952380952380954, 'epoch': 0.15}


 15%|█▌        | 1910/12500 [3:19:48<19:56:08,  6.78s/it]

{'loss': 0.9829, 'grad_norm': 0.2633717954158783, 'learning_rate': 0.0001695078031212485, 'epoch': 0.15}


 15%|█▌        | 1911/12500 [3:19:53<17:50:07,  6.06s/it]

{'loss': 0.7423, 'grad_norm': 0.29704800248146057, 'learning_rate': 0.0001694917967186875, 'epoch': 0.15}


 15%|█▌        | 1912/12500 [3:19:58<16:42:20,  5.68s/it]

{'loss': 0.6606, 'grad_norm': 0.2583078145980835, 'learning_rate': 0.00016947579031612647, 'epoch': 0.15}


 15%|█▌        | 1913/12500 [3:20:08<20:54:44,  7.11s/it]

{'loss': 0.7254, 'grad_norm': 0.19732390344142914, 'learning_rate': 0.00016945978391356544, 'epoch': 0.15}


 15%|█▌        | 1914/12500 [3:20:14<20:11:49,  6.87s/it]

{'loss': 0.676, 'grad_norm': 0.3052314817905426, 'learning_rate': 0.00016944377751100442, 'epoch': 0.15}


 15%|█▌        | 1915/12500 [3:20:18<17:35:08,  5.98s/it]

{'loss': 0.5224, 'grad_norm': 0.29032325744628906, 'learning_rate': 0.0001694277711084434, 'epoch': 0.15}


 15%|█▌        | 1916/12500 [3:20:28<21:06:35,  7.18s/it]

{'loss': 0.414, 'grad_norm': 0.17402249574661255, 'learning_rate': 0.00016941176470588237, 'epoch': 0.15}


 15%|█▌        | 1917/12500 [3:20:35<20:35:27,  7.00s/it]

{'loss': 0.8765, 'grad_norm': 0.2668331563472748, 'learning_rate': 0.00016939575830332134, 'epoch': 0.15}


 15%|█▌        | 1918/12500 [3:20:39<18:16:06,  6.21s/it]

{'loss': 0.6807, 'grad_norm': 0.2660997807979584, 'learning_rate': 0.00016937975190076031, 'epoch': 0.15}


 15%|█▌        | 1919/12500 [3:20:46<19:04:29,  6.49s/it]

{'loss': 0.7984, 'grad_norm': 0.2385518103837967, 'learning_rate': 0.0001693637454981993, 'epoch': 0.15}


 15%|█▌        | 1920/12500 [3:20:56<21:59:42,  7.48s/it]

{'loss': 0.8029, 'grad_norm': 0.2270498275756836, 'learning_rate': 0.00016934773909563826, 'epoch': 0.15}


 15%|█▌        | 1921/12500 [3:21:02<20:31:55,  6.99s/it]

{'loss': 0.7355, 'grad_norm': 0.22486071288585663, 'learning_rate': 0.00016933173269307724, 'epoch': 0.15}


 15%|█▌        | 1922/12500 [3:21:09<20:28:15,  6.97s/it]

{'loss': 0.8278, 'grad_norm': 0.25627002120018005, 'learning_rate': 0.00016931572629051621, 'epoch': 0.15}


 15%|█▌        | 1923/12500 [3:21:17<21:34:05,  7.34s/it]

{'loss': 0.5503, 'grad_norm': 0.2911469638347626, 'learning_rate': 0.0001692997198879552, 'epoch': 0.15}


 15%|█▌        | 1924/12500 [3:21:23<20:12:55,  6.88s/it]

{'loss': 0.6939, 'grad_norm': 0.2476995885372162, 'learning_rate': 0.00016928371348539416, 'epoch': 0.15}


 15%|█▌        | 1925/12500 [3:21:30<20:27:30,  6.96s/it]

{'loss': 0.901, 'grad_norm': 0.3010016977787018, 'learning_rate': 0.00016926770708283314, 'epoch': 0.15}


 15%|█▌        | 1926/12500 [3:21:38<21:27:45,  7.31s/it]

{'loss': 0.9221, 'grad_norm': 0.24603594839572906, 'learning_rate': 0.00016925170068027211, 'epoch': 0.15}


 15%|█▌        | 1927/12500 [3:21:43<19:22:13,  6.60s/it]

{'loss': 0.8164, 'grad_norm': 0.3047860860824585, 'learning_rate': 0.0001692356942777111, 'epoch': 0.15}


 15%|█▌        | 1928/12500 [3:21:47<17:15:18,  5.88s/it]

{'loss': 1.1698, 'grad_norm': 0.3349040448665619, 'learning_rate': 0.0001692196878751501, 'epoch': 0.15}


 15%|█▌        | 1929/12500 [3:21:53<17:33:56,  5.98s/it]

{'loss': 0.9002, 'grad_norm': 0.28818249702453613, 'learning_rate': 0.00016920368147258904, 'epoch': 0.15}


 15%|█▌        | 1930/12500 [3:22:00<18:14:34,  6.21s/it]

{'loss': 0.558, 'grad_norm': 0.2559206485748291, 'learning_rate': 0.000169187675070028, 'epoch': 0.15}


 15%|█▌        | 1931/12500 [3:22:11<22:25:58,  7.64s/it]

{'loss': 0.9876, 'grad_norm': 0.22939535975456238, 'learning_rate': 0.000169171668667467, 'epoch': 0.15}


 15%|█▌        | 1932/12500 [3:22:18<22:03:09,  7.51s/it]

{'loss': 0.6755, 'grad_norm': 0.31159043312072754, 'learning_rate': 0.000169155662264906, 'epoch': 0.15}


 15%|█▌        | 1933/12500 [3:22:22<18:21:08,  6.25s/it]

{'loss': 0.8455, 'grad_norm': 0.35090914368629456, 'learning_rate': 0.00016913965586234494, 'epoch': 0.15}


 15%|█▌        | 1934/12500 [3:22:28<18:12:00,  6.20s/it]

{'loss': 0.729, 'grad_norm': 0.2555975019931793, 'learning_rate': 0.0001691236494597839, 'epoch': 0.15}


 15%|█▌        | 1935/12500 [3:22:33<17:08:17,  5.84s/it]

{'loss': 0.8793, 'grad_norm': 0.25240829586982727, 'learning_rate': 0.0001691076430572229, 'epoch': 0.15}


 15%|█▌        | 1936/12500 [3:22:41<19:02:57,  6.49s/it]

{'loss': 1.0855, 'grad_norm': 0.25136512517929077, 'learning_rate': 0.0001690916366546619, 'epoch': 0.15}


 15%|█▌        | 1937/12500 [3:22:46<17:33:54,  5.99s/it]

{'loss': 0.709, 'grad_norm': 0.27089545130729675, 'learning_rate': 0.00016907563025210084, 'epoch': 0.15}


 16%|█▌        | 1938/12500 [3:22:51<16:37:49,  5.67s/it]

{'loss': 0.6367, 'grad_norm': 0.23106534779071808, 'learning_rate': 0.0001690596238495398, 'epoch': 0.16}


 16%|█▌        | 1939/12500 [3:22:58<18:02:50,  6.15s/it]

{'loss': 0.5485, 'grad_norm': 0.2836504578590393, 'learning_rate': 0.00016904361744697881, 'epoch': 0.16}


 16%|█▌        | 1940/12500 [3:23:04<18:05:23,  6.17s/it]

{'loss': 0.7163, 'grad_norm': 0.24061265587806702, 'learning_rate': 0.0001690276110444178, 'epoch': 0.16}


 16%|█▌        | 1941/12500 [3:23:08<16:21:00,  5.57s/it]

{'loss': 0.6729, 'grad_norm': 0.29433533549308777, 'learning_rate': 0.00016901160464185674, 'epoch': 0.16}


 16%|█▌        | 1942/12500 [3:23:18<19:42:35,  6.72s/it]

{'loss': 0.8965, 'grad_norm': 0.23612168431282043, 'learning_rate': 0.0001689955982392957, 'epoch': 0.16}


 16%|█▌        | 1943/12500 [3:23:24<19:05:21,  6.51s/it]

{'loss': 0.8258, 'grad_norm': 0.27145087718963623, 'learning_rate': 0.0001689795918367347, 'epoch': 0.16}


 16%|█▌        | 1944/12500 [3:23:28<16:57:32,  5.78s/it]

{'loss': 0.5756, 'grad_norm': 0.27034878730773926, 'learning_rate': 0.0001689635854341737, 'epoch': 0.16}


 16%|█▌        | 1945/12500 [3:23:33<16:23:43,  5.59s/it]

{'loss': 0.707, 'grad_norm': 0.2435014843940735, 'learning_rate': 0.00016894757903161264, 'epoch': 0.16}


 16%|█▌        | 1946/12500 [3:23:39<16:57:42,  5.79s/it]

{'loss': 0.7734, 'grad_norm': 0.25324946641921997, 'learning_rate': 0.00016893157262905164, 'epoch': 0.16}


 16%|█▌        | 1947/12500 [3:23:46<18:15:42,  6.23s/it]

{'loss': 0.712, 'grad_norm': 0.2472430020570755, 'learning_rate': 0.0001689155662264906, 'epoch': 0.16}


 16%|█▌        | 1948/12500 [3:23:50<16:08:07,  5.50s/it]

{'loss': 0.946, 'grad_norm': 0.35842421650886536, 'learning_rate': 0.0001688995598239296, 'epoch': 0.16}


 16%|█▌        | 1949/12500 [3:23:55<15:14:16,  5.20s/it]

{'loss': 0.7227, 'grad_norm': 0.31784799695014954, 'learning_rate': 0.00016888355342136854, 'epoch': 0.16}


 16%|█▌        | 1950/12500 [3:24:00<15:23:12,  5.25s/it]

{'loss': 0.7799, 'grad_norm': 0.27801626920700073, 'learning_rate': 0.00016886754701880754, 'epoch': 0.16}


 16%|█▌        | 1951/12500 [3:24:08<18:10:24,  6.20s/it]

{'loss': 0.9213, 'grad_norm': 0.2540839612483978, 'learning_rate': 0.0001688515406162465, 'epoch': 0.16}


 16%|█▌        | 1952/12500 [3:24:13<17:02:40,  5.82s/it]

{'loss': 0.453, 'grad_norm': 0.254138320684433, 'learning_rate': 0.0001688355342136855, 'epoch': 0.16}


 16%|█▌        | 1953/12500 [3:24:23<20:26:53,  6.98s/it]

{'loss': 0.8741, 'grad_norm': 0.23755858838558197, 'learning_rate': 0.00016881952781112446, 'epoch': 0.16}


 16%|█▌        | 1954/12500 [3:24:30<20:13:30,  6.90s/it]

{'loss': 0.7879, 'grad_norm': 0.30075836181640625, 'learning_rate': 0.00016880352140856344, 'epoch': 0.16}


 16%|█▌        | 1955/12500 [3:24:35<18:33:07,  6.33s/it]

{'loss': 0.5751, 'grad_norm': 0.2918654978275299, 'learning_rate': 0.0001687875150060024, 'epoch': 0.16}


 16%|█▌        | 1956/12500 [3:24:42<19:31:06,  6.66s/it]

{'loss': 0.6538, 'grad_norm': 0.210953488945961, 'learning_rate': 0.0001687715086034414, 'epoch': 0.16}


 16%|█▌        | 1957/12500 [3:24:47<18:09:23,  6.20s/it]

{'loss': 0.5432, 'grad_norm': 0.28575825691223145, 'learning_rate': 0.00016875550220088036, 'epoch': 0.16}


 16%|█▌        | 1958/12500 [3:24:54<18:25:39,  6.29s/it]

{'loss': 0.4535, 'grad_norm': 0.24704565107822418, 'learning_rate': 0.00016873949579831934, 'epoch': 0.16}


 16%|█▌        | 1959/12500 [3:25:04<21:29:29,  7.34s/it]

{'loss': 0.5272, 'grad_norm': 0.1945304423570633, 'learning_rate': 0.0001687234893957583, 'epoch': 0.16}


 16%|█▌        | 1960/12500 [3:25:11<21:39:30,  7.40s/it]

{'loss': 0.7786, 'grad_norm': 0.22422413527965546, 'learning_rate': 0.00016870748299319729, 'epoch': 0.16}


 16%|█▌        | 1961/12500 [3:25:17<19:53:57,  6.80s/it]

{'loss': 0.9365, 'grad_norm': 0.263990193605423, 'learning_rate': 0.00016869147659063626, 'epoch': 0.16}


 16%|█▌        | 1962/12500 [3:25:22<19:04:32,  6.52s/it]

{'loss': 0.8152, 'grad_norm': 0.250737726688385, 'learning_rate': 0.00016867547018807524, 'epoch': 0.16}


 16%|█▌        | 1963/12500 [3:25:30<20:07:04,  6.87s/it]

{'loss': 0.75, 'grad_norm': 0.2961302697658539, 'learning_rate': 0.0001686594637855142, 'epoch': 0.16}


 16%|█▌        | 1964/12500 [3:25:38<21:02:51,  7.19s/it]

{'loss': 0.9002, 'grad_norm': 0.24710381031036377, 'learning_rate': 0.00016864345738295319, 'epoch': 0.16}


 16%|█▌        | 1965/12500 [3:25:46<21:15:24,  7.26s/it]

{'loss': 0.6594, 'grad_norm': 0.20503830909729004, 'learning_rate': 0.00016862745098039216, 'epoch': 0.16}


 16%|█▌        | 1966/12500 [3:25:51<19:41:23,  6.73s/it]

{'loss': 0.5416, 'grad_norm': 0.30118125677108765, 'learning_rate': 0.00016861144457783114, 'epoch': 0.16}


 16%|█▌        | 1967/12500 [3:25:55<17:28:12,  5.97s/it]

{'loss': 0.9391, 'grad_norm': 0.3258674144744873, 'learning_rate': 0.00016859543817527014, 'epoch': 0.16}


 16%|█▌        | 1968/12500 [3:26:02<18:21:24,  6.27s/it]

{'loss': 0.6153, 'grad_norm': 0.2368178516626358, 'learning_rate': 0.00016857943177270908, 'epoch': 0.16}


 16%|█▌        | 1969/12500 [3:26:08<18:07:01,  6.19s/it]

{'loss': 0.7493, 'grad_norm': 0.26279157400131226, 'learning_rate': 0.00016856342537014806, 'epoch': 0.16}


 16%|█▌        | 1970/12500 [3:26:12<16:26:32,  5.62s/it]

{'loss': 0.7296, 'grad_norm': 0.2796078324317932, 'learning_rate': 0.00016854741896758703, 'epoch': 0.16}


 16%|█▌        | 1971/12500 [3:26:22<19:35:15,  6.70s/it]

{'loss': 0.9106, 'grad_norm': 0.22464296221733093, 'learning_rate': 0.00016853141256502604, 'epoch': 0.16}


 16%|█▌        | 1972/12500 [3:26:34<24:51:44,  8.50s/it]

{'loss': 0.8911, 'grad_norm': 0.2077958881855011, 'learning_rate': 0.00016851540616246498, 'epoch': 0.16}


 16%|█▌        | 1973/12500 [3:26:44<26:03:30,  8.91s/it]

{'loss': 0.4876, 'grad_norm': 0.18346039950847626, 'learning_rate': 0.00016849939975990396, 'epoch': 0.16}


 16%|█▌        | 1974/12500 [3:26:49<22:48:41,  7.80s/it]

{'loss': 0.537, 'grad_norm': 0.2452186644077301, 'learning_rate': 0.00016848339335734296, 'epoch': 0.16}


 16%|█▌        | 1975/12500 [3:26:55<21:00:56,  7.19s/it]

{'loss': 0.5654, 'grad_norm': 0.25316986441612244, 'learning_rate': 0.00016846738695478194, 'epoch': 0.16}


 16%|█▌        | 1976/12500 [3:27:01<20:10:19,  6.90s/it]

{'loss': 0.7929, 'grad_norm': 0.335411936044693, 'learning_rate': 0.00016845138055222088, 'epoch': 0.16}


 16%|█▌        | 1977/12500 [3:27:08<19:48:19,  6.78s/it]

{'loss': 0.4761, 'grad_norm': 0.2633814513683319, 'learning_rate': 0.00016843537414965986, 'epoch': 0.16}


 16%|█▌        | 1978/12500 [3:27:14<18:48:09,  6.43s/it]

{'loss': 0.5293, 'grad_norm': 0.23277781903743744, 'learning_rate': 0.00016841936774709886, 'epoch': 0.16}


 16%|█▌        | 1979/12500 [3:27:21<19:14:38,  6.58s/it]

{'loss': 0.562, 'grad_norm': 0.23463839292526245, 'learning_rate': 0.00016840336134453784, 'epoch': 0.16}


 16%|█▌        | 1980/12500 [3:27:26<18:14:36,  6.24s/it]

{'loss': 0.4939, 'grad_norm': 0.22675195336341858, 'learning_rate': 0.00016838735494197678, 'epoch': 0.16}


 16%|█▌        | 1981/12500 [3:27:33<19:10:58,  6.57s/it]

{'loss': 0.9133, 'grad_norm': 0.2720547616481781, 'learning_rate': 0.00016837134853941578, 'epoch': 0.16}


 16%|█▌        | 1982/12500 [3:27:38<17:17:15,  5.92s/it]

{'loss': 0.7427, 'grad_norm': 0.2815347909927368, 'learning_rate': 0.00016835534213685476, 'epoch': 0.16}


 16%|█▌        | 1983/12500 [3:27:45<18:41:04,  6.40s/it]

{'loss': 0.6488, 'grad_norm': 0.21686814725399017, 'learning_rate': 0.00016833933573429373, 'epoch': 0.16}


 16%|█▌        | 1984/12500 [3:27:49<16:26:16,  5.63s/it]

{'loss': 0.5592, 'grad_norm': 0.295015811920166, 'learning_rate': 0.00016832332933173268, 'epoch': 0.16}


 16%|█▌        | 1985/12500 [3:27:54<15:57:16,  5.46s/it]

{'loss': 0.7332, 'grad_norm': 0.3068827986717224, 'learning_rate': 0.00016830732292917168, 'epoch': 0.16}


 16%|█▌        | 1986/12500 [3:28:00<16:02:33,  5.49s/it]

{'loss': 0.8154, 'grad_norm': 0.2707086503505707, 'learning_rate': 0.00016829131652661066, 'epoch': 0.16}


 16%|█▌        | 1987/12500 [3:28:07<17:17:24,  5.92s/it]

{'loss': 0.4326, 'grad_norm': 0.21582815051078796, 'learning_rate': 0.00016827531012404963, 'epoch': 0.16}


 16%|█▌        | 1988/12500 [3:28:13<17:31:35,  6.00s/it]

{'loss': 0.7809, 'grad_norm': 0.26256391406059265, 'learning_rate': 0.00016825930372148858, 'epoch': 0.16}


 16%|█▌        | 1989/12500 [3:28:18<16:47:10,  5.75s/it]

{'loss': 0.9638, 'grad_norm': 0.2609294354915619, 'learning_rate': 0.00016824329731892758, 'epoch': 0.16}


 16%|█▌        | 1990/12500 [3:28:23<15:50:32,  5.43s/it]

{'loss': 0.9678, 'grad_norm': 0.29422807693481445, 'learning_rate': 0.00016822729091636656, 'epoch': 0.16}


 16%|█▌        | 1991/12500 [3:28:28<15:53:42,  5.45s/it]

{'loss': 0.8489, 'grad_norm': 0.26944005489349365, 'learning_rate': 0.00016821128451380553, 'epoch': 0.16}


 16%|█▌        | 1992/12500 [3:28:34<16:27:35,  5.64s/it]

{'loss': 0.8096, 'grad_norm': 0.30553528666496277, 'learning_rate': 0.0001681952781112445, 'epoch': 0.16}


 16%|█▌        | 1993/12500 [3:28:42<18:28:14,  6.33s/it]

{'loss': 0.6598, 'grad_norm': 0.2464887797832489, 'learning_rate': 0.00016817927170868348, 'epoch': 0.16}


 16%|█▌        | 1994/12500 [3:28:46<16:42:48,  5.73s/it]

{'loss': 0.6729, 'grad_norm': 0.33386391401290894, 'learning_rate': 0.00016816326530612246, 'epoch': 0.16}


 16%|█▌        | 1995/12500 [3:28:52<16:18:31,  5.59s/it]

{'loss': 0.6472, 'grad_norm': 0.27770671248435974, 'learning_rate': 0.00016814725890356143, 'epoch': 0.16}


 16%|█▌        | 1996/12500 [3:28:56<14:52:37,  5.10s/it]

{'loss': 0.9093, 'grad_norm': 0.3302728235721588, 'learning_rate': 0.0001681312525010004, 'epoch': 0.16}


 16%|█▌        | 1997/12500 [3:29:01<15:11:08,  5.21s/it]

{'loss': 0.7057, 'grad_norm': 0.2772081196308136, 'learning_rate': 0.00016811524609843938, 'epoch': 0.16}


 16%|█▌        | 1998/12500 [3:29:06<15:13:39,  5.22s/it]

{'loss': 0.6198, 'grad_norm': 0.27069929242134094, 'learning_rate': 0.00016809923969587836, 'epoch': 0.16}


 16%|█▌        | 1999/12500 [3:29:15<17:45:15,  6.09s/it]

{'loss': 0.6843, 'grad_norm': 0.22027115523815155, 'learning_rate': 0.00016808323329331733, 'epoch': 0.16}


 16%|█▌        | 2000/12500 [3:29:19<16:01:30,  5.49s/it]

{'loss': 0.6612, 'grad_norm': 0.3299523890018463, 'learning_rate': 0.0001680672268907563, 'epoch': 0.16}


 16%|█▌        | 2001/12500 [3:29:29<20:29:19,  7.03s/it]

{'loss': 0.5823, 'grad_norm': 0.21247021853923798, 'learning_rate': 0.00016805122048819528, 'epoch': 0.16}


 16%|█▌        | 2002/12500 [3:29:36<19:52:08,  6.81s/it]

{'loss': 0.5357, 'grad_norm': 0.25110486149787903, 'learning_rate': 0.00016803521408563426, 'epoch': 0.16}


 16%|█▌        | 2003/12500 [3:29:41<18:53:15,  6.48s/it]

{'loss': 0.7998, 'grad_norm': 0.34841349720954895, 'learning_rate': 0.00016801920768307323, 'epoch': 0.16}


 16%|█▌        | 2004/12500 [3:29:46<17:45:53,  6.09s/it]

{'loss': 0.6261, 'grad_norm': 0.27519360184669495, 'learning_rate': 0.0001680032012805122, 'epoch': 0.16}


 16%|█▌        | 2005/12500 [3:29:53<18:33:42,  6.37s/it]

{'loss': 0.6038, 'grad_norm': 0.2706088125705719, 'learning_rate': 0.00016798719487795118, 'epoch': 0.16}


 16%|█▌        | 2006/12500 [3:30:00<18:32:38,  6.36s/it]

{'loss': 0.7054, 'grad_norm': 0.22706329822540283, 'learning_rate': 0.00016797118847539018, 'epoch': 0.16}


 16%|█▌        | 2007/12500 [3:30:08<20:06:42,  6.90s/it]

{'loss': 0.7503, 'grad_norm': 0.22947904467582703, 'learning_rate': 0.00016795518207282913, 'epoch': 0.16}


 16%|█▌        | 2008/12500 [3:30:17<21:44:39,  7.46s/it]

{'loss': 1.0019, 'grad_norm': 0.21688660979270935, 'learning_rate': 0.0001679391756702681, 'epoch': 0.16}


 16%|█▌        | 2009/12500 [3:30:23<20:25:53,  7.01s/it]

{'loss': 0.6756, 'grad_norm': 0.2653045654296875, 'learning_rate': 0.00016792316926770708, 'epoch': 0.16}


 16%|█▌        | 2010/12500 [3:30:29<20:06:20,  6.90s/it]

{'loss': 0.6329, 'grad_norm': 0.23003500699996948, 'learning_rate': 0.00016790716286514608, 'epoch': 0.16}


 16%|█▌        | 2011/12500 [3:30:37<21:03:52,  7.23s/it]

{'loss': 0.7047, 'grad_norm': 0.23831389844417572, 'learning_rate': 0.00016789115646258503, 'epoch': 0.16}


 16%|█▌        | 2012/12500 [3:30:46<22:01:32,  7.56s/it]

{'loss': 0.7556, 'grad_norm': 0.22838449478149414, 'learning_rate': 0.000167875150060024, 'epoch': 0.16}


 16%|█▌        | 2013/12500 [3:30:51<20:27:20,  7.02s/it]

{'loss': 0.4274, 'grad_norm': 0.22705420851707458, 'learning_rate': 0.000167859143657463, 'epoch': 0.16}


 16%|█▌        | 2014/12500 [3:30:55<17:12:22,  5.91s/it]

{'loss': 0.7752, 'grad_norm': 0.3366272449493408, 'learning_rate': 0.00016784313725490198, 'epoch': 0.16}


 16%|█▌        | 2015/12500 [3:31:01<17:12:35,  5.91s/it]

{'loss': 0.6588, 'grad_norm': 0.2462969869375229, 'learning_rate': 0.00016782713085234093, 'epoch': 0.16}


 16%|█▌        | 2016/12500 [3:31:10<20:17:09,  6.97s/it]

{'loss': 0.6314, 'grad_norm': 0.17515479028224945, 'learning_rate': 0.0001678111244497799, 'epoch': 0.16}


 16%|█▌        | 2017/12500 [3:31:17<20:26:45,  7.02s/it]

{'loss': 0.9128, 'grad_norm': 0.2857190668582916, 'learning_rate': 0.0001677951180472189, 'epoch': 0.16}


 16%|█▌        | 2018/12500 [3:31:23<19:42:56,  6.77s/it]

{'loss': 0.5905, 'grad_norm': 0.2194937765598297, 'learning_rate': 0.00016777911164465788, 'epoch': 0.16}


 16%|█▌        | 2019/12500 [3:31:30<19:56:29,  6.85s/it]

{'loss': 0.7462, 'grad_norm': 0.2305101454257965, 'learning_rate': 0.00016776310524209683, 'epoch': 0.16}


 16%|█▌        | 2020/12500 [3:31:35<17:48:34,  6.12s/it]

{'loss': 0.8775, 'grad_norm': 0.3176938593387604, 'learning_rate': 0.00016774709883953583, 'epoch': 0.16}


 16%|█▌        | 2021/12500 [3:31:39<16:18:55,  5.61s/it]

{'loss': 0.653, 'grad_norm': 0.29376140236854553, 'learning_rate': 0.0001677310924369748, 'epoch': 0.16}


 16%|█▌        | 2022/12500 [3:31:45<16:29:46,  5.67s/it]

{'loss': 0.7379, 'grad_norm': 0.29222017526626587, 'learning_rate': 0.00016771508603441378, 'epoch': 0.16}


 16%|█▌        | 2023/12500 [3:31:51<17:01:19,  5.85s/it]

{'loss': 0.7732, 'grad_norm': 0.23643724620342255, 'learning_rate': 0.00016769907963185273, 'epoch': 0.16}


 16%|█▌        | 2024/12500 [3:32:02<21:35:03,  7.42s/it]

{'loss': 0.6747, 'grad_norm': 0.18981236219406128, 'learning_rate': 0.00016768307322929173, 'epoch': 0.16}


 16%|█▌        | 2025/12500 [3:32:07<19:09:04,  6.58s/it]

{'loss': 0.8036, 'grad_norm': 0.286300927400589, 'learning_rate': 0.0001676670668267307, 'epoch': 0.16}


 16%|█▌        | 2026/12500 [3:32:15<20:40:33,  7.11s/it]

{'loss': 0.9743, 'grad_norm': 0.23134198784828186, 'learning_rate': 0.00016765106042416968, 'epoch': 0.16}


 16%|█▌        | 2027/12500 [3:32:24<21:40:28,  7.45s/it]

{'loss': 0.5587, 'grad_norm': 0.2309512346982956, 'learning_rate': 0.00016763505402160866, 'epoch': 0.16}


 16%|█▌        | 2028/12500 [3:32:29<19:50:31,  6.82s/it]

{'loss': 0.7132, 'grad_norm': 0.2570544481277466, 'learning_rate': 0.00016761904761904763, 'epoch': 0.16}


 16%|█▌        | 2029/12500 [3:32:37<21:13:28,  7.30s/it]

{'loss': 0.6368, 'grad_norm': 0.22790208458900452, 'learning_rate': 0.0001676030412164866, 'epoch': 0.16}


 16%|█▌        | 2030/12500 [3:32:42<18:35:15,  6.39s/it]

{'loss': 0.4926, 'grad_norm': 0.24860605597496033, 'learning_rate': 0.00016758703481392558, 'epoch': 0.16}


 16%|█▌        | 2031/12500 [3:32:47<17:55:29,  6.16s/it]

{'loss': 0.7817, 'grad_norm': 0.2992783188819885, 'learning_rate': 0.00016757102841136455, 'epoch': 0.16}


 16%|█▋        | 2032/12500 [3:32:52<16:53:52,  5.81s/it]

{'loss': 0.6898, 'grad_norm': 0.3150186240673065, 'learning_rate': 0.00016755502200880353, 'epoch': 0.16}


 16%|█▋        | 2033/12500 [3:32:56<14:42:45,  5.06s/it]

{'loss': 1.0497, 'grad_norm': 0.368367463350296, 'learning_rate': 0.0001675390156062425, 'epoch': 0.16}


 16%|█▋        | 2034/12500 [3:33:01<14:53:33,  5.12s/it]

{'loss': 0.7781, 'grad_norm': 0.2509918212890625, 'learning_rate': 0.00016752300920368148, 'epoch': 0.16}


 16%|█▋        | 2035/12500 [3:33:09<17:20:45,  5.97s/it]

{'loss': 0.7471, 'grad_norm': 0.226993590593338, 'learning_rate': 0.00016750700280112045, 'epoch': 0.16}


 16%|█▋        | 2036/12500 [3:33:14<17:05:00,  5.88s/it]

{'loss': 0.6159, 'grad_norm': 0.2623476982116699, 'learning_rate': 0.00016749099639855943, 'epoch': 0.16}


 16%|█▋        | 2037/12500 [3:33:19<16:06:03,  5.54s/it]

{'loss': 0.7883, 'grad_norm': 0.26523762941360474, 'learning_rate': 0.0001674749899959984, 'epoch': 0.16}


 16%|█▋        | 2038/12500 [3:33:24<15:44:59,  5.42s/it]

{'loss': 0.6872, 'grad_norm': 0.31485626101493835, 'learning_rate': 0.00016745898359343738, 'epoch': 0.16}


 16%|█▋        | 2039/12500 [3:33:29<14:47:53,  5.09s/it]

{'loss': 1.0552, 'grad_norm': 0.3346048891544342, 'learning_rate': 0.00016744297719087635, 'epoch': 0.16}


 16%|█▋        | 2040/12500 [3:33:33<14:18:29,  4.92s/it]

{'loss': 0.6278, 'grad_norm': 0.2681828737258911, 'learning_rate': 0.00016742697078831533, 'epoch': 0.16}


 16%|█▋        | 2041/12500 [3:33:38<14:28:43,  4.98s/it]

{'loss': 0.5646, 'grad_norm': 0.282448410987854, 'learning_rate': 0.00016741096438575433, 'epoch': 0.16}


 16%|█▋        | 2042/12500 [3:33:47<17:26:10,  6.00s/it]

{'loss': 0.6799, 'grad_norm': 0.19840086996555328, 'learning_rate': 0.00016739495798319328, 'epoch': 0.16}


 16%|█▋        | 2043/12500 [3:33:52<16:58:12,  5.84s/it]

{'loss': 0.7984, 'grad_norm': 0.31306129693984985, 'learning_rate': 0.00016737895158063225, 'epoch': 0.16}


 16%|█▋        | 2044/12500 [3:33:59<17:27:40,  6.01s/it]

{'loss': 0.9876, 'grad_norm': 0.2570992708206177, 'learning_rate': 0.00016736294517807123, 'epoch': 0.16}


 16%|█▋        | 2045/12500 [3:34:04<16:39:11,  5.73s/it]

{'loss': 0.7681, 'grad_norm': 0.31840160489082336, 'learning_rate': 0.00016734693877551023, 'epoch': 0.16}


 16%|█▋        | 2046/12500 [3:34:08<15:45:08,  5.42s/it]

{'loss': 0.6543, 'grad_norm': 0.30534201860427856, 'learning_rate': 0.00016733093237294918, 'epoch': 0.16}


 16%|█▋        | 2047/12500 [3:34:16<17:45:34,  6.12s/it]

{'loss': 0.4425, 'grad_norm': 0.20553399622440338, 'learning_rate': 0.00016731492597038815, 'epoch': 0.16}


 16%|█▋        | 2048/12500 [3:34:22<17:38:20,  6.08s/it]

{'loss': 0.5764, 'grad_norm': 0.25072240829467773, 'learning_rate': 0.00016729891956782713, 'epoch': 0.16}


 16%|█▋        | 2049/12500 [3:34:29<18:26:35,  6.35s/it]

{'loss': 0.979, 'grad_norm': 0.3183937072753906, 'learning_rate': 0.00016728291316526613, 'epoch': 0.16}


 16%|█▋        | 2050/12500 [3:34:35<17:38:55,  6.08s/it]

{'loss': 0.5676, 'grad_norm': 0.29189610481262207, 'learning_rate': 0.00016726690676270508, 'epoch': 0.16}


 16%|█▋        | 2051/12500 [3:34:40<16:46:54,  5.78s/it]

{'loss': 0.5702, 'grad_norm': 0.32345736026763916, 'learning_rate': 0.00016725090036014405, 'epoch': 0.16}


 16%|█▋        | 2052/12500 [3:34:47<18:13:36,  6.28s/it]

{'loss': 0.5598, 'grad_norm': 0.2192954421043396, 'learning_rate': 0.00016723489395758305, 'epoch': 0.16}


 16%|█▋        | 2053/12500 [3:34:52<17:24:33,  6.00s/it]

{'loss': 0.6745, 'grad_norm': 0.28285667300224304, 'learning_rate': 0.00016721888755502203, 'epoch': 0.16}


 16%|█▋        | 2054/12500 [3:35:00<18:43:33,  6.45s/it]

{'loss': 0.5494, 'grad_norm': 0.24405667185783386, 'learning_rate': 0.00016720288115246098, 'epoch': 0.16}


 16%|█▋        | 2055/12500 [3:35:03<16:12:43,  5.59s/it]

{'loss': 0.7009, 'grad_norm': 0.3035259246826172, 'learning_rate': 0.00016718687474989995, 'epoch': 0.16}


 16%|█▋        | 2056/12500 [3:35:09<16:28:09,  5.68s/it]

{'loss': 0.6671, 'grad_norm': 0.27996453642845154, 'learning_rate': 0.00016717086834733895, 'epoch': 0.16}


 16%|█▋        | 2057/12500 [3:35:20<20:42:41,  7.14s/it]

{'loss': 0.4084, 'grad_norm': 0.20955297350883484, 'learning_rate': 0.00016715486194477793, 'epoch': 0.16}


 16%|█▋        | 2058/12500 [3:35:25<19:05:31,  6.58s/it]

{'loss': 0.9183, 'grad_norm': 0.2675594389438629, 'learning_rate': 0.00016713885554221688, 'epoch': 0.16}


 16%|█▋        | 2059/12500 [3:35:32<19:40:30,  6.78s/it]

{'loss': 0.5469, 'grad_norm': 0.27788445353507996, 'learning_rate': 0.00016712284913965588, 'epoch': 0.16}


 16%|█▋        | 2060/12500 [3:35:40<20:23:00,  7.03s/it]

{'loss': 0.8343, 'grad_norm': 0.23818400502204895, 'learning_rate': 0.00016710684273709485, 'epoch': 0.16}


 16%|█▋        | 2061/12500 [3:35:45<18:27:34,  6.37s/it]

{'loss': 0.6654, 'grad_norm': 0.26292893290519714, 'learning_rate': 0.00016709083633453383, 'epoch': 0.16}


 16%|█▋        | 2062/12500 [3:35:53<19:46:21,  6.82s/it]

{'loss': 0.6751, 'grad_norm': 0.22749967873096466, 'learning_rate': 0.00016707482993197278, 'epoch': 0.16}


 17%|█▋        | 2063/12500 [3:36:02<21:26:31,  7.40s/it]

{'loss': 0.5959, 'grad_norm': 0.2390483319759369, 'learning_rate': 0.00016705882352941178, 'epoch': 0.17}


 17%|█▋        | 2064/12500 [3:36:08<20:21:50,  7.02s/it]

{'loss': 0.8131, 'grad_norm': 0.2675307095050812, 'learning_rate': 0.00016704281712685075, 'epoch': 0.17}


 17%|█▋        | 2065/12500 [3:36:13<19:03:19,  6.57s/it]

{'loss': 0.5513, 'grad_norm': 0.24658414721488953, 'learning_rate': 0.00016702681072428973, 'epoch': 0.17}


 17%|█▋        | 2066/12500 [3:36:22<21:07:08,  7.29s/it]

{'loss': 0.8708, 'grad_norm': 0.21934328973293304, 'learning_rate': 0.0001670108043217287, 'epoch': 0.17}


 17%|█▋        | 2067/12500 [3:36:27<18:43:30,  6.46s/it]

{'loss': 1.0147, 'grad_norm': 0.3190652132034302, 'learning_rate': 0.00016699479791916768, 'epoch': 0.17}


 17%|█▋        | 2068/12500 [3:36:34<19:15:35,  6.65s/it]

{'loss': 0.9059, 'grad_norm': 0.26012393832206726, 'learning_rate': 0.00016697879151660665, 'epoch': 0.17}


 17%|█▋        | 2069/12500 [3:36:44<22:34:06,  7.79s/it]

{'loss': 0.8639, 'grad_norm': 0.20905566215515137, 'learning_rate': 0.00016696278511404563, 'epoch': 0.17}


 17%|█▋        | 2070/12500 [3:36:49<19:38:53,  6.78s/it]

{'loss': 0.5695, 'grad_norm': 0.2952994406223297, 'learning_rate': 0.0001669467787114846, 'epoch': 0.17}


 17%|█▋        | 2071/12500 [3:36:55<19:06:30,  6.60s/it]

{'loss': 0.7951, 'grad_norm': 0.27043798565864563, 'learning_rate': 0.00016693077230892358, 'epoch': 0.17}


 17%|█▋        | 2072/12500 [3:36:59<17:01:38,  5.88s/it]

{'loss': 0.821, 'grad_norm': 0.3363490104675293, 'learning_rate': 0.00016691476590636255, 'epoch': 0.17}


 17%|█▋        | 2073/12500 [3:37:09<20:47:28,  7.18s/it]

{'loss': 1.1665, 'grad_norm': 0.22013510763645172, 'learning_rate': 0.00016689875950380153, 'epoch': 0.17}


 17%|█▋        | 2074/12500 [3:37:16<20:43:55,  7.16s/it]

{'loss': 1.0369, 'grad_norm': 0.2573276460170746, 'learning_rate': 0.0001668827531012405, 'epoch': 0.17}


 17%|█▋        | 2075/12500 [3:37:22<19:23:57,  6.70s/it]

{'loss': 0.8688, 'grad_norm': 0.2759884297847748, 'learning_rate': 0.00016686674669867948, 'epoch': 0.17}


 17%|█▋        | 2076/12500 [3:37:28<19:06:09,  6.60s/it]

{'loss': 0.7198, 'grad_norm': 0.3163645565509796, 'learning_rate': 0.00016685074029611845, 'epoch': 0.17}


 17%|█▋        | 2077/12500 [3:37:35<19:17:52,  6.67s/it]

{'loss': 0.6528, 'grad_norm': 0.24369527399539948, 'learning_rate': 0.00016683473389355743, 'epoch': 0.17}


 17%|█▋        | 2078/12500 [3:37:43<19:59:43,  6.91s/it]

{'loss': 0.8014, 'grad_norm': 0.2129727005958557, 'learning_rate': 0.0001668187274909964, 'epoch': 0.17}


 17%|█▋        | 2079/12500 [3:37:48<18:43:58,  6.47s/it]

{'loss': 0.5374, 'grad_norm': 0.25994211435317993, 'learning_rate': 0.00016680272108843537, 'epoch': 0.17}


 17%|█▋        | 2080/12500 [3:37:52<16:42:08,  5.77s/it]

{'loss': 0.6418, 'grad_norm': 0.28748980164527893, 'learning_rate': 0.00016678671468587438, 'epoch': 0.17}


 17%|█▋        | 2081/12500 [3:38:00<18:37:52,  6.44s/it]

{'loss': 1.0238, 'grad_norm': 0.25852110981941223, 'learning_rate': 0.00016677070828331332, 'epoch': 0.17}


 17%|█▋        | 2082/12500 [3:38:06<18:05:32,  6.25s/it]

{'loss': 0.6575, 'grad_norm': 0.25860610604286194, 'learning_rate': 0.0001667547018807523, 'epoch': 0.17}


 17%|█▋        | 2083/12500 [3:38:10<16:16:00,  5.62s/it]

{'loss': 0.5571, 'grad_norm': 0.3124640882015228, 'learning_rate': 0.00016673869547819127, 'epoch': 0.17}


 17%|█▋        | 2084/12500 [3:38:20<20:00:23,  6.91s/it]

{'loss': 0.5249, 'grad_norm': 0.23975160717964172, 'learning_rate': 0.00016672268907563028, 'epoch': 0.17}


 17%|█▋        | 2085/12500 [3:38:25<18:03:15,  6.24s/it]

{'loss': 0.8997, 'grad_norm': 0.3732069432735443, 'learning_rate': 0.00016670668267306922, 'epoch': 0.17}


 17%|█▋        | 2086/12500 [3:38:30<16:53:20,  5.84s/it]

{'loss': 0.6137, 'grad_norm': 0.25600144267082214, 'learning_rate': 0.0001666906762705082, 'epoch': 0.17}


 17%|█▋        | 2087/12500 [3:38:34<15:19:34,  5.30s/it]

{'loss': 0.6208, 'grad_norm': 0.27716195583343506, 'learning_rate': 0.0001666746698679472, 'epoch': 0.17}


 17%|█▋        | 2088/12500 [3:38:41<16:42:28,  5.78s/it]

{'loss': 0.6965, 'grad_norm': 0.22751253843307495, 'learning_rate': 0.00016665866346538618, 'epoch': 0.17}


 17%|█▋        | 2089/12500 [3:38:46<16:39:19,  5.76s/it]

{'loss': 0.8167, 'grad_norm': 0.27112042903900146, 'learning_rate': 0.00016664265706282512, 'epoch': 0.17}


 17%|█▋        | 2090/12500 [3:38:55<19:22:10,  6.70s/it]

{'loss': 0.883, 'grad_norm': 0.20963145792484283, 'learning_rate': 0.0001666266506602641, 'epoch': 0.17}


 17%|█▋        | 2091/12500 [3:39:02<19:04:48,  6.60s/it]

{'loss': 0.7458, 'grad_norm': 0.23504678905010223, 'learning_rate': 0.0001666106442577031, 'epoch': 0.17}


 17%|█▋        | 2092/12500 [3:39:08<18:36:55,  6.44s/it]

{'loss': 0.7183, 'grad_norm': 0.22799085080623627, 'learning_rate': 0.00016659463785514208, 'epoch': 0.17}


 17%|█▋        | 2093/12500 [3:39:17<20:45:31,  7.18s/it]

{'loss': 0.845, 'grad_norm': 0.21894310414791107, 'learning_rate': 0.00016657863145258102, 'epoch': 0.17}


 17%|█▋        | 2094/12500 [3:39:25<21:48:02,  7.54s/it]

{'loss': 0.8183, 'grad_norm': 0.210511714220047, 'learning_rate': 0.00016656262505002002, 'epoch': 0.17}


 17%|█▋        | 2095/12500 [3:39:30<19:18:46,  6.68s/it]

{'loss': 0.6391, 'grad_norm': 0.2593984305858612, 'learning_rate': 0.000166546618647459, 'epoch': 0.17}


 17%|█▋        | 2096/12500 [3:39:37<20:14:56,  7.01s/it]

{'loss': 0.7249, 'grad_norm': 0.23237569630146027, 'learning_rate': 0.00016653061224489797, 'epoch': 0.17}


 17%|█▋        | 2097/12500 [3:39:41<17:14:37,  5.97s/it]

{'loss': 0.6072, 'grad_norm': 0.27080971002578735, 'learning_rate': 0.00016651460584233692, 'epoch': 0.17}


 17%|█▋        | 2098/12500 [3:39:47<16:56:06,  5.86s/it]

{'loss': 0.7535, 'grad_norm': 0.30873897671699524, 'learning_rate': 0.00016649859943977592, 'epoch': 0.17}


 17%|█▋        | 2099/12500 [3:39:52<16:42:49,  5.78s/it]

{'loss': 0.786, 'grad_norm': 0.2800891697406769, 'learning_rate': 0.0001664825930372149, 'epoch': 0.17}


 17%|█▋        | 2100/12500 [3:39:59<17:26:53,  6.04s/it]

{'loss': 0.7958, 'grad_norm': 0.23614315688610077, 'learning_rate': 0.00016646658663465387, 'epoch': 0.17}


 17%|█▋        | 2101/12500 [3:40:05<17:47:45,  6.16s/it]

{'loss': 0.8164, 'grad_norm': 0.3064550757408142, 'learning_rate': 0.00016645058023209282, 'epoch': 0.17}


 17%|█▋        | 2102/12500 [3:40:11<17:47:12,  6.16s/it]

{'loss': 0.659, 'grad_norm': 0.24225497245788574, 'learning_rate': 0.00016643457382953182, 'epoch': 0.17}


 17%|█▋        | 2103/12500 [3:40:21<20:38:51,  7.15s/it]

{'loss': 0.3945, 'grad_norm': 0.207121804356575, 'learning_rate': 0.0001664185674269708, 'epoch': 0.17}


 17%|█▋        | 2104/12500 [3:40:26<19:05:39,  6.61s/it]

{'loss': 0.5451, 'grad_norm': 0.2491857409477234, 'learning_rate': 0.00016640256102440977, 'epoch': 0.17}


 17%|█▋        | 2105/12500 [3:40:33<19:25:39,  6.73s/it]

{'loss': 0.8372, 'grad_norm': 0.25159960985183716, 'learning_rate': 0.00016638655462184875, 'epoch': 0.17}


 17%|█▋        | 2106/12500 [3:40:40<19:23:22,  6.72s/it]

{'loss': 0.6768, 'grad_norm': 0.28227531909942627, 'learning_rate': 0.00016637054821928772, 'epoch': 0.17}


 17%|█▋        | 2107/12500 [3:40:46<19:01:18,  6.59s/it]

{'loss': 0.5894, 'grad_norm': 0.22882424294948578, 'learning_rate': 0.0001663545418167267, 'epoch': 0.17}


 17%|█▋        | 2108/12500 [3:40:55<21:01:16,  7.28s/it]

{'loss': 0.7776, 'grad_norm': 0.20442906022071838, 'learning_rate': 0.00016633853541416567, 'epoch': 0.17}


 17%|█▋        | 2109/12500 [3:41:00<18:58:36,  6.57s/it]

{'loss': 0.5913, 'grad_norm': 0.25939860939979553, 'learning_rate': 0.00016632252901160465, 'epoch': 0.17}


 17%|█▋        | 2110/12500 [3:41:11<22:49:19,  7.91s/it]

{'loss': 0.8394, 'grad_norm': 0.18505246937274933, 'learning_rate': 0.00016630652260904362, 'epoch': 0.17}


 17%|█▋        | 2111/12500 [3:41:16<20:36:38,  7.14s/it]

{'loss': 0.7734, 'grad_norm': 0.2846590578556061, 'learning_rate': 0.0001662905162064826, 'epoch': 0.17}


 17%|█▋        | 2112/12500 [3:41:21<18:24:13,  6.38s/it]

{'loss': 0.7579, 'grad_norm': 0.35035067796707153, 'learning_rate': 0.00016627450980392157, 'epoch': 0.17}


 17%|█▋        | 2113/12500 [3:41:30<20:56:30,  7.26s/it]

{'loss': 1.0096, 'grad_norm': 0.22136850655078888, 'learning_rate': 0.00016625850340136055, 'epoch': 0.17}


 17%|█▋        | 2114/12500 [3:41:39<21:48:52,  7.56s/it]

{'loss': 0.6253, 'grad_norm': 0.22314298152923584, 'learning_rate': 0.00016624249699879952, 'epoch': 0.17}


 17%|█▋        | 2115/12500 [3:41:43<19:27:37,  6.75s/it]

{'loss': 0.6921, 'grad_norm': 0.30845868587493896, 'learning_rate': 0.0001662264905962385, 'epoch': 0.17}


 17%|█▋        | 2116/12500 [3:41:51<20:18:13,  7.04s/it]

{'loss': 0.815, 'grad_norm': 0.269992858171463, 'learning_rate': 0.00016621048419367747, 'epoch': 0.17}


 17%|█▋        | 2117/12500 [3:42:01<22:30:59,  7.81s/it]

{'loss': 0.9664, 'grad_norm': 0.3189418613910675, 'learning_rate': 0.00016619447779111645, 'epoch': 0.17}


 17%|█▋        | 2118/12500 [3:42:06<20:10:56,  7.00s/it]

{'loss': 0.7511, 'grad_norm': 0.2978378236293793, 'learning_rate': 0.00016617847138855542, 'epoch': 0.17}


 17%|█▋        | 2119/12500 [3:42:12<19:16:15,  6.68s/it]

{'loss': 0.7334, 'grad_norm': 0.2353467047214508, 'learning_rate': 0.00016616246498599442, 'epoch': 0.17}


 17%|█▋        | 2120/12500 [3:42:17<18:23:04,  6.38s/it]

{'loss': 0.6965, 'grad_norm': 0.2975025475025177, 'learning_rate': 0.00016614645858343337, 'epoch': 0.17}


 17%|█▋        | 2121/12500 [3:42:23<18:07:27,  6.29s/it]

{'loss': 0.6469, 'grad_norm': 0.296087384223938, 'learning_rate': 0.00016613045218087235, 'epoch': 0.17}


 17%|█▋        | 2122/12500 [3:42:28<16:47:09,  5.82s/it]

{'loss': 0.6061, 'grad_norm': 0.2698802649974823, 'learning_rate': 0.00016611444577831132, 'epoch': 0.17}


 17%|█▋        | 2123/12500 [3:42:35<17:12:17,  5.97s/it]

{'loss': 0.8996, 'grad_norm': 0.29718178510665894, 'learning_rate': 0.00016609843937575032, 'epoch': 0.17}


 17%|█▋        | 2124/12500 [3:42:40<16:32:38,  5.74s/it]

{'loss': 0.6352, 'grad_norm': 0.2443820685148239, 'learning_rate': 0.00016608243297318927, 'epoch': 0.17}


 17%|█▋        | 2125/12500 [3:42:44<14:50:53,  5.15s/it]

{'loss': 0.802, 'grad_norm': 0.3469574749469757, 'learning_rate': 0.00016606642657062825, 'epoch': 0.17}


 17%|█▋        | 2126/12500 [3:42:51<16:36:14,  5.76s/it]

{'loss': 0.6237, 'grad_norm': 0.25968843698501587, 'learning_rate': 0.00016605042016806725, 'epoch': 0.17}


 17%|█▋        | 2127/12500 [3:42:55<15:21:58,  5.33s/it]

{'loss': 0.8065, 'grad_norm': 0.3346714973449707, 'learning_rate': 0.00016603441376550622, 'epoch': 0.17}


 17%|█▋        | 2128/12500 [3:43:05<19:15:45,  6.69s/it]

{'loss': 0.6491, 'grad_norm': 0.2022280991077423, 'learning_rate': 0.00016601840736294517, 'epoch': 0.17}


 17%|█▋        | 2129/12500 [3:43:11<18:57:14,  6.58s/it]

{'loss': 0.9097, 'grad_norm': 0.25969913601875305, 'learning_rate': 0.00016600240096038414, 'epoch': 0.17}


 17%|█▋        | 2130/12500 [3:43:19<19:47:57,  6.87s/it]

{'loss': 0.8058, 'grad_norm': 0.22605609893798828, 'learning_rate': 0.00016598639455782315, 'epoch': 0.17}


 17%|█▋        | 2131/12500 [3:43:28<22:04:54,  7.67s/it]

{'loss': 0.9217, 'grad_norm': 0.21465183794498444, 'learning_rate': 0.00016597038815526212, 'epoch': 0.17}


 17%|█▋        | 2132/12500 [3:43:35<21:09:29,  7.35s/it]

{'loss': 0.6496, 'grad_norm': 0.24880053102970123, 'learning_rate': 0.00016595438175270107, 'epoch': 0.17}


 17%|█▋        | 2133/12500 [3:43:40<19:29:14,  6.77s/it]

{'loss': 0.6217, 'grad_norm': 0.3144083619117737, 'learning_rate': 0.00016593837535014007, 'epoch': 0.17}


 17%|█▋        | 2134/12500 [3:43:46<18:09:15,  6.30s/it]

{'loss': 0.5455, 'grad_norm': 0.2744249105453491, 'learning_rate': 0.00016592236894757905, 'epoch': 0.17}


 17%|█▋        | 2135/12500 [3:43:55<21:10:05,  7.35s/it]

{'loss': 0.8128, 'grad_norm': 0.24203038215637207, 'learning_rate': 0.00016590636254501802, 'epoch': 0.17}


 17%|█▋        | 2136/12500 [3:44:01<19:36:57,  6.81s/it]

{'loss': 0.621, 'grad_norm': 0.2776091992855072, 'learning_rate': 0.00016589035614245697, 'epoch': 0.17}


 17%|█▋        | 2137/12500 [3:44:07<18:53:48,  6.56s/it]

{'loss': 0.4857, 'grad_norm': 0.302334189414978, 'learning_rate': 0.00016587434973989597, 'epoch': 0.17}


 17%|█▋        | 2138/12500 [3:44:12<17:42:12,  6.15s/it]

{'loss': 0.9303, 'grad_norm': 0.36423420906066895, 'learning_rate': 0.00016585834333733495, 'epoch': 0.17}


 17%|█▋        | 2139/12500 [3:44:18<17:54:59,  6.23s/it]

{'loss': 0.7103, 'grad_norm': 0.244392529129982, 'learning_rate': 0.00016584233693477392, 'epoch': 0.17}


 17%|█▋        | 2140/12500 [3:44:24<17:15:23,  6.00s/it]

{'loss': 0.8204, 'grad_norm': 0.291935533285141, 'learning_rate': 0.0001658263305322129, 'epoch': 0.17}


 17%|█▋        | 2141/12500 [3:44:29<16:51:18,  5.86s/it]

{'loss': 0.7686, 'grad_norm': 0.30435606837272644, 'learning_rate': 0.00016581032412965187, 'epoch': 0.17}


 17%|█▋        | 2142/12500 [3:44:36<17:10:54,  5.97s/it]

{'loss': 0.8219, 'grad_norm': 0.28805598616600037, 'learning_rate': 0.00016579431772709084, 'epoch': 0.17}


 17%|█▋        | 2143/12500 [3:44:42<17:44:11,  6.17s/it]

{'loss': 0.8295, 'grad_norm': 0.22249332070350647, 'learning_rate': 0.00016577831132452982, 'epoch': 0.17}


 17%|█▋        | 2144/12500 [3:44:50<19:27:21,  6.76s/it]

{'loss': 0.5273, 'grad_norm': 0.22684702277183533, 'learning_rate': 0.0001657623049219688, 'epoch': 0.17}


 17%|█▋        | 2145/12500 [3:45:01<22:30:16,  7.82s/it]

{'loss': 1.0591, 'grad_norm': 0.2103700488805771, 'learning_rate': 0.00016574629851940777, 'epoch': 0.17}


 17%|█▋        | 2146/12500 [3:45:04<18:58:16,  6.60s/it]

{'loss': 0.5489, 'grad_norm': 0.3070583939552307, 'learning_rate': 0.00016573029211684674, 'epoch': 0.17}


 17%|█▋        | 2147/12500 [3:45:09<17:30:52,  6.09s/it]

{'loss': 0.832, 'grad_norm': 0.3999873399734497, 'learning_rate': 0.00016571428571428575, 'epoch': 0.17}


 17%|█▋        | 2148/12500 [3:45:15<17:01:32,  5.92s/it]

{'loss': 0.7954, 'grad_norm': 0.26018083095550537, 'learning_rate': 0.0001656982793117247, 'epoch': 0.17}


 17%|█▋        | 2149/12500 [3:45:23<18:57:31,  6.59s/it]

{'loss': 0.9802, 'grad_norm': 0.23427337408065796, 'learning_rate': 0.00016568227290916367, 'epoch': 0.17}


 17%|█▋        | 2150/12500 [3:45:28<17:22:49,  6.05s/it]

{'loss': 0.7634, 'grad_norm': 0.2501307427883148, 'learning_rate': 0.00016566626650660264, 'epoch': 0.17}


 17%|█▋        | 2151/12500 [3:45:32<16:07:01,  5.61s/it]

{'loss': 0.5204, 'grad_norm': 0.2807638943195343, 'learning_rate': 0.00016565026010404165, 'epoch': 0.17}


 17%|█▋        | 2152/12500 [3:45:40<17:58:00,  6.25s/it]

{'loss': 0.7553, 'grad_norm': 0.2095324695110321, 'learning_rate': 0.0001656342537014806, 'epoch': 0.17}


 17%|█▋        | 2153/12500 [3:45:47<18:18:43,  6.37s/it]

{'loss': 0.5778, 'grad_norm': 0.22954688966274261, 'learning_rate': 0.00016561824729891957, 'epoch': 0.17}


 17%|█▋        | 2154/12500 [3:45:55<19:50:15,  6.90s/it]

{'loss': 1.0775, 'grad_norm': 0.280926913022995, 'learning_rate': 0.00016560224089635854, 'epoch': 0.17}


 17%|█▋        | 2155/12500 [3:46:01<19:20:46,  6.73s/it]

{'loss': 0.885, 'grad_norm': 0.35475748777389526, 'learning_rate': 0.00016558623449379755, 'epoch': 0.17}


 17%|█▋        | 2156/12500 [3:46:06<17:31:28,  6.10s/it]

{'loss': 0.8297, 'grad_norm': 0.2945694625377655, 'learning_rate': 0.0001655702280912365, 'epoch': 0.17}


 17%|█▋        | 2157/12500 [3:46:13<18:08:29,  6.31s/it]

{'loss': 0.5007, 'grad_norm': 0.24273300170898438, 'learning_rate': 0.00016555422168867547, 'epoch': 0.17}


 17%|█▋        | 2158/12500 [3:46:21<20:12:34,  7.03s/it]

{'loss': 0.4244, 'grad_norm': 0.19543127715587616, 'learning_rate': 0.00016553821528611447, 'epoch': 0.17}


 17%|█▋        | 2159/12500 [3:46:27<19:13:03,  6.69s/it]

{'loss': 0.7168, 'grad_norm': 0.26701945066452026, 'learning_rate': 0.00016552220888355344, 'epoch': 0.17}


 17%|█▋        | 2160/12500 [3:46:36<21:14:25,  7.40s/it]

{'loss': 0.6107, 'grad_norm': 0.22268493473529816, 'learning_rate': 0.0001655062024809924, 'epoch': 0.17}


 17%|█▋        | 2161/12500 [3:46:41<18:25:28,  6.42s/it]

{'loss': 0.7962, 'grad_norm': 0.3514796197414398, 'learning_rate': 0.00016549019607843137, 'epoch': 0.17}


 17%|█▋        | 2162/12500 [3:46:48<19:19:09,  6.73s/it]

{'loss': 0.7892, 'grad_norm': 0.2671913206577301, 'learning_rate': 0.00016547418967587037, 'epoch': 0.17}


 17%|█▋        | 2163/12500 [3:46:53<18:13:45,  6.35s/it]

{'loss': 0.7818, 'grad_norm': 0.2653387188911438, 'learning_rate': 0.00016545818327330934, 'epoch': 0.17}


 17%|█▋        | 2164/12500 [3:46:59<17:51:07,  6.22s/it]

{'loss': 0.6231, 'grad_norm': 0.30295348167419434, 'learning_rate': 0.0001654421768707483, 'epoch': 0.17}


 17%|█▋        | 2165/12500 [3:47:07<19:14:12,  6.70s/it]

{'loss': 0.9375, 'grad_norm': 0.24232174456119537, 'learning_rate': 0.0001654261704681873, 'epoch': 0.17}


 17%|█▋        | 2166/12500 [3:47:11<16:34:39,  5.78s/it]

{'loss': 0.7638, 'grad_norm': 0.3712044060230255, 'learning_rate': 0.00016541016406562627, 'epoch': 0.17}


 17%|█▋        | 2167/12500 [3:47:17<16:51:47,  5.88s/it]

{'loss': 0.661, 'grad_norm': 0.2439597249031067, 'learning_rate': 0.00016539415766306524, 'epoch': 0.17}


 17%|█▋        | 2168/12500 [3:47:21<15:37:45,  5.45s/it]

{'loss': 0.4545, 'grad_norm': 0.24265773594379425, 'learning_rate': 0.0001653781512605042, 'epoch': 0.17}


 17%|█▋        | 2169/12500 [3:47:27<15:32:10,  5.41s/it]

{'loss': 0.7888, 'grad_norm': 0.333423912525177, 'learning_rate': 0.0001653621448579432, 'epoch': 0.17}


 17%|█▋        | 2170/12500 [3:47:33<16:40:50,  5.81s/it]

{'loss': 0.7363, 'grad_norm': 0.27250567078590393, 'learning_rate': 0.00016534613845538217, 'epoch': 0.17}


 17%|█▋        | 2171/12500 [3:47:42<19:06:56,  6.66s/it]

{'loss': 0.7234, 'grad_norm': 0.23708991706371307, 'learning_rate': 0.00016533013205282114, 'epoch': 0.17}


 17%|█▋        | 2172/12500 [3:47:50<20:33:48,  7.17s/it]

{'loss': 0.4528, 'grad_norm': 0.23182620108127594, 'learning_rate': 0.00016531412565026012, 'epoch': 0.17}


 17%|█▋        | 2173/12500 [3:47:56<19:10:11,  6.68s/it]

{'loss': 0.7069, 'grad_norm': 0.342498242855072, 'learning_rate': 0.0001652981192476991, 'epoch': 0.17}


 17%|█▋        | 2174/12500 [3:48:04<20:27:51,  7.13s/it]

{'loss': 0.7696, 'grad_norm': 0.21894493699073792, 'learning_rate': 0.00016528211284513807, 'epoch': 0.17}


 17%|█▋        | 2175/12500 [3:48:09<18:51:02,  6.57s/it]

{'loss': 0.5791, 'grad_norm': 0.2919706106185913, 'learning_rate': 0.00016526610644257704, 'epoch': 0.17}


 17%|█▋        | 2176/12500 [3:48:18<20:31:13,  7.16s/it]

{'loss': 0.9803, 'grad_norm': 0.249019056558609, 'learning_rate': 0.00016525010004001602, 'epoch': 0.17}


 17%|█▋        | 2177/12500 [3:48:23<18:43:36,  6.53s/it]

{'loss': 0.692, 'grad_norm': 0.31111329793930054, 'learning_rate': 0.000165234093637455, 'epoch': 0.17}


 17%|█▋        | 2178/12500 [3:48:30<19:02:39,  6.64s/it]

{'loss': 0.5805, 'grad_norm': 0.29172441363334656, 'learning_rate': 0.00016521808723489397, 'epoch': 0.17}


 17%|█▋        | 2179/12500 [3:48:35<18:05:13,  6.31s/it]

{'loss': 0.8968, 'grad_norm': 0.37666064500808716, 'learning_rate': 0.00016520208083233294, 'epoch': 0.17}


 17%|█▋        | 2180/12500 [3:48:42<17:52:50,  6.24s/it]

{'loss': 0.6181, 'grad_norm': 0.24901746213436127, 'learning_rate': 0.00016518607442977192, 'epoch': 0.17}


 17%|█▋        | 2181/12500 [3:48:46<16:08:59,  5.63s/it]

{'loss': 0.8186, 'grad_norm': 0.296868234872818, 'learning_rate': 0.0001651700680272109, 'epoch': 0.17}


 17%|█▋        | 2182/12500 [3:48:52<16:48:10,  5.86s/it]

{'loss': 0.6343, 'grad_norm': 0.2687155604362488, 'learning_rate': 0.00016515406162464987, 'epoch': 0.17}


 17%|█▋        | 2183/12500 [3:49:01<19:09:00,  6.68s/it]

{'loss': 0.4238, 'grad_norm': 0.2192431092262268, 'learning_rate': 0.00016513805522208884, 'epoch': 0.17}


 17%|█▋        | 2184/12500 [3:49:10<21:46:35,  7.60s/it]

{'loss': 0.7464, 'grad_norm': 0.22280706465244293, 'learning_rate': 0.00016512204881952782, 'epoch': 0.17}


 17%|█▋        | 2185/12500 [3:49:18<21:18:35,  7.44s/it]

{'loss': 0.5915, 'grad_norm': 0.23528127372264862, 'learning_rate': 0.0001651060424169668, 'epoch': 0.17}


 17%|█▋        | 2186/12500 [3:49:24<20:51:50,  7.28s/it]

{'loss': 0.7, 'grad_norm': 0.2709900736808777, 'learning_rate': 0.0001650900360144058, 'epoch': 0.17}


 17%|█▋        | 2187/12500 [3:49:32<21:00:05,  7.33s/it]

{'loss': 0.4332, 'grad_norm': 0.18693093955516815, 'learning_rate': 0.00016507402961184474, 'epoch': 0.17}


 18%|█▊        | 2188/12500 [3:49:36<18:24:16,  6.43s/it]

{'loss': 0.8761, 'grad_norm': 0.33208391070365906, 'learning_rate': 0.00016505802320928372, 'epoch': 0.18}


 18%|█▊        | 2189/12500 [3:49:42<18:12:31,  6.36s/it]

{'loss': 0.7676, 'grad_norm': 0.24853363633155823, 'learning_rate': 0.0001650420168067227, 'epoch': 0.18}


 18%|█▊        | 2190/12500 [3:49:52<20:35:58,  7.19s/it]

{'loss': 0.9765, 'grad_norm': 0.2422075867652893, 'learning_rate': 0.0001650260104041617, 'epoch': 0.18}


 18%|█▊        | 2191/12500 [3:49:56<18:21:56,  6.41s/it]

{'loss': 0.6445, 'grad_norm': 0.2959028482437134, 'learning_rate': 0.00016501000400160064, 'epoch': 0.18}


 18%|█▊        | 2192/12500 [3:50:02<17:45:33,  6.20s/it]

{'loss': 0.5529, 'grad_norm': 0.23923055827617645, 'learning_rate': 0.00016499399759903961, 'epoch': 0.18}


 18%|█▊        | 2193/12500 [3:50:07<16:47:13,  5.86s/it]

{'loss': 0.8561, 'grad_norm': 0.3763507604598999, 'learning_rate': 0.00016497799119647862, 'epoch': 0.18}


 18%|█▊        | 2194/12500 [3:50:15<18:36:06,  6.50s/it]

{'loss': 0.5473, 'grad_norm': 0.2241629809141159, 'learning_rate': 0.0001649619847939176, 'epoch': 0.18}


 18%|█▊        | 2195/12500 [3:50:22<18:46:46,  6.56s/it]

{'loss': 0.9462, 'grad_norm': 0.25675782561302185, 'learning_rate': 0.00016494597839135654, 'epoch': 0.18}


 18%|█▊        | 2196/12500 [3:50:28<18:38:29,  6.51s/it]

{'loss': 1.042, 'grad_norm': 0.29101961851119995, 'learning_rate': 0.00016492997198879551, 'epoch': 0.18}


 18%|█▊        | 2197/12500 [3:50:35<19:06:04,  6.67s/it]

{'loss': 0.6998, 'grad_norm': 0.2880254089832306, 'learning_rate': 0.00016491396558623452, 'epoch': 0.18}


 18%|█▊        | 2198/12500 [3:50:41<18:22:44,  6.42s/it]

{'loss': 0.6494, 'grad_norm': 0.2748594880104065, 'learning_rate': 0.0001648979591836735, 'epoch': 0.18}


 18%|█▊        | 2199/12500 [3:50:47<17:46:08,  6.21s/it]

{'loss': 0.6709, 'grad_norm': 0.29462748765945435, 'learning_rate': 0.00016488195278111244, 'epoch': 0.18}


 18%|█▊        | 2200/12500 [3:50:55<19:35:36,  6.85s/it]

{'loss': 0.658, 'grad_norm': 0.20606358349323273, 'learning_rate': 0.00016486594637855144, 'epoch': 0.18}


 18%|█▊        | 2201/12500 [3:51:04<21:13:00,  7.42s/it]

{'loss': 0.9249, 'grad_norm': 0.23287703096866608, 'learning_rate': 0.00016484993997599042, 'epoch': 0.18}


 18%|█▊        | 2202/12500 [3:51:08<18:30:15,  6.47s/it]

{'loss': 0.5894, 'grad_norm': 0.2619258165359497, 'learning_rate': 0.0001648339335734294, 'epoch': 0.18}


 18%|█▊        | 2203/12500 [3:51:14<18:08:33,  6.34s/it]

{'loss': 0.7771, 'grad_norm': 0.29400062561035156, 'learning_rate': 0.00016481792717086834, 'epoch': 0.18}


 18%|█▊        | 2204/12500 [3:51:20<18:11:27,  6.36s/it]

{'loss': 0.7446, 'grad_norm': 0.24945610761642456, 'learning_rate': 0.00016480192076830734, 'epoch': 0.18}


 18%|█▊        | 2205/12500 [3:51:30<20:37:13,  7.21s/it]

{'loss': 1.096, 'grad_norm': 0.2411763072013855, 'learning_rate': 0.00016478591436574631, 'epoch': 0.18}


 18%|█▊        | 2206/12500 [3:51:35<18:55:53,  6.62s/it]

{'loss': 0.7845, 'grad_norm': 0.32259401679039, 'learning_rate': 0.0001647699079631853, 'epoch': 0.18}


 18%|█▊        | 2207/12500 [3:51:44<20:48:15,  7.28s/it]

{'loss': 0.7399, 'grad_norm': 0.20071084797382355, 'learning_rate': 0.00016475390156062426, 'epoch': 0.18}


 18%|█▊        | 2208/12500 [3:51:51<20:31:24,  7.18s/it]

{'loss': 0.5378, 'grad_norm': 0.2888025641441345, 'learning_rate': 0.00016473789515806324, 'epoch': 0.18}


 18%|█▊        | 2209/12500 [3:51:55<18:00:40,  6.30s/it]

{'loss': 0.6059, 'grad_norm': 0.28519633412361145, 'learning_rate': 0.00016472188875550221, 'epoch': 0.18}


 18%|█▊        | 2210/12500 [3:52:01<17:26:45,  6.10s/it]

{'loss': 0.7714, 'grad_norm': 0.2751677930355072, 'learning_rate': 0.0001647058823529412, 'epoch': 0.18}


 18%|█▊        | 2211/12500 [3:52:08<18:23:29,  6.43s/it]

{'loss': 0.8581, 'grad_norm': 0.21588322520256042, 'learning_rate': 0.00016468987595038016, 'epoch': 0.18}


 18%|█▊        | 2212/12500 [3:52:16<19:33:43,  6.85s/it]

{'loss': 0.5361, 'grad_norm': 0.20392780005931854, 'learning_rate': 0.00016467386954781914, 'epoch': 0.18}


 18%|█▊        | 2213/12500 [3:52:21<18:27:02,  6.46s/it]

{'loss': 0.9512, 'grad_norm': 0.27789929509162903, 'learning_rate': 0.00016465786314525811, 'epoch': 0.18}


 18%|█▊        | 2214/12500 [3:52:25<16:25:03,  5.75s/it]

{'loss': 0.5827, 'grad_norm': 0.3188078999519348, 'learning_rate': 0.0001646418567426971, 'epoch': 0.18}


 18%|█▊        | 2215/12500 [3:52:32<17:22:36,  6.08s/it]

{'loss': 0.7258, 'grad_norm': 0.23624977469444275, 'learning_rate': 0.00016462585034013606, 'epoch': 0.18}


 18%|█▊        | 2216/12500 [3:52:41<19:56:04,  6.98s/it]

{'loss': 0.6024, 'grad_norm': 0.19529369473457336, 'learning_rate': 0.00016460984393757504, 'epoch': 0.18}


 18%|█▊        | 2217/12500 [3:52:46<17:46:02,  6.22s/it]

{'loss': 0.7383, 'grad_norm': 0.3486019968986511, 'learning_rate': 0.000164593837535014, 'epoch': 0.18}


 18%|█▊        | 2218/12500 [3:52:52<18:16:28,  6.40s/it]

{'loss': 0.8552, 'grad_norm': 0.24344561994075775, 'learning_rate': 0.000164577831132453, 'epoch': 0.18}


 18%|█▊        | 2219/12500 [3:53:00<19:15:06,  6.74s/it]

{'loss': 0.7744, 'grad_norm': 0.25584864616394043, 'learning_rate': 0.00016456182472989196, 'epoch': 0.18}


 18%|█▊        | 2220/12500 [3:53:06<18:16:16,  6.40s/it]

{'loss': 0.5672, 'grad_norm': 0.28883200883865356, 'learning_rate': 0.00016454581832733094, 'epoch': 0.18}


 18%|█▊        | 2221/12500 [3:53:13<18:58:45,  6.65s/it]

{'loss': 0.5811, 'grad_norm': 0.20947429537773132, 'learning_rate': 0.0001645298119247699, 'epoch': 0.18}


 18%|█▊        | 2222/12500 [3:53:17<17:16:39,  6.05s/it]

{'loss': 0.5758, 'grad_norm': 0.2610095143318176, 'learning_rate': 0.0001645138055222089, 'epoch': 0.18}


 18%|█▊        | 2223/12500 [3:53:26<19:26:50,  6.81s/it]

{'loss': 0.7379, 'grad_norm': 0.19244784116744995, 'learning_rate': 0.00016449779911964786, 'epoch': 0.18}


 18%|█▊        | 2224/12500 [3:53:31<18:09:37,  6.36s/it]

{'loss': 0.9063, 'grad_norm': 0.36319640278816223, 'learning_rate': 0.00016448179271708684, 'epoch': 0.18}


 18%|█▊        | 2225/12500 [3:53:37<17:18:28,  6.06s/it]

{'loss': 0.5823, 'grad_norm': 0.25454050302505493, 'learning_rate': 0.00016446578631452584, 'epoch': 0.18}


 18%|█▊        | 2226/12500 [3:53:46<20:08:44,  7.06s/it]

{'loss': 1.1677, 'grad_norm': 0.22505119442939758, 'learning_rate': 0.0001644497799119648, 'epoch': 0.18}


 18%|█▊        | 2227/12500 [3:53:53<20:17:29,  7.11s/it]

{'loss': 0.6851, 'grad_norm': 0.2864771783351898, 'learning_rate': 0.00016443377350940376, 'epoch': 0.18}


 18%|█▊        | 2228/12500 [3:53:58<18:23:29,  6.45s/it]

{'loss': 0.684, 'grad_norm': 0.3082767724990845, 'learning_rate': 0.00016441776710684274, 'epoch': 0.18}


 18%|█▊        | 2229/12500 [3:54:04<17:46:47,  6.23s/it]

{'loss': 0.8907, 'grad_norm': 0.33531033992767334, 'learning_rate': 0.00016440176070428174, 'epoch': 0.18}


 18%|█▊        | 2230/12500 [3:54:10<17:30:00,  6.13s/it]

{'loss': 0.5925, 'grad_norm': 0.2825332283973694, 'learning_rate': 0.00016438575430172069, 'epoch': 0.18}


 18%|█▊        | 2231/12500 [3:54:15<16:17:49,  5.71s/it]

{'loss': 1.0459, 'grad_norm': 0.31073760986328125, 'learning_rate': 0.00016436974789915966, 'epoch': 0.18}


 18%|█▊        | 2232/12500 [3:54:22<17:30:29,  6.14s/it]

{'loss': 1.0315, 'grad_norm': 0.2691015601158142, 'learning_rate': 0.00016435374149659866, 'epoch': 0.18}


 18%|█▊        | 2233/12500 [3:54:28<17:16:50,  6.06s/it]

{'loss': 0.7873, 'grad_norm': 0.25136563181877136, 'learning_rate': 0.00016433773509403764, 'epoch': 0.18}


 18%|█▊        | 2234/12500 [3:54:33<16:53:37,  5.92s/it]

{'loss': 0.8275, 'grad_norm': 0.2640881836414337, 'learning_rate': 0.00016432172869147659, 'epoch': 0.18}


 18%|█▊        | 2235/12500 [3:54:38<15:49:55,  5.55s/it]

{'loss': 0.5415, 'grad_norm': 0.2996273338794708, 'learning_rate': 0.00016430572228891556, 'epoch': 0.18}


 18%|█▊        | 2236/12500 [3:54:43<15:43:38,  5.52s/it]

{'loss': 0.7412, 'grad_norm': 0.23906180262565613, 'learning_rate': 0.00016428971588635456, 'epoch': 0.18}


 18%|█▊        | 2237/12500 [3:54:48<15:27:24,  5.42s/it]

{'loss': 0.7567, 'grad_norm': 0.2967700660228729, 'learning_rate': 0.00016427370948379354, 'epoch': 0.18}


 18%|█▊        | 2238/12500 [3:54:56<17:29:47,  6.14s/it]

{'loss': 0.7859, 'grad_norm': 0.24297301471233368, 'learning_rate': 0.00016425770308123249, 'epoch': 0.18}


 18%|█▊        | 2239/12500 [3:55:02<16:58:12,  5.95s/it]

{'loss': 0.6643, 'grad_norm': 0.28888723254203796, 'learning_rate': 0.0001642416966786715, 'epoch': 0.18}


 18%|█▊        | 2240/12500 [3:55:14<22:04:03,  7.74s/it]

{'loss': 0.7826, 'grad_norm': 0.17590893805027008, 'learning_rate': 0.00016422569027611046, 'epoch': 0.18}


 18%|█▊        | 2241/12500 [3:55:24<24:17:52,  8.53s/it]

{'loss': 0.4428, 'grad_norm': 0.1894790679216385, 'learning_rate': 0.00016420968387354944, 'epoch': 0.18}


 18%|█▊        | 2242/12500 [3:55:31<22:43:53,  7.98s/it]

{'loss': 1.348, 'grad_norm': 0.30581384897232056, 'learning_rate': 0.00016419367747098838, 'epoch': 0.18}


 18%|█▊        | 2243/12500 [3:55:37<21:19:29,  7.48s/it]

{'loss': 0.7419, 'grad_norm': 0.2729434370994568, 'learning_rate': 0.00016417767106842739, 'epoch': 0.18}


 18%|█▊        | 2244/12500 [3:55:47<23:19:55,  8.19s/it]

{'loss': 1.236, 'grad_norm': 0.3063543438911438, 'learning_rate': 0.00016416166466586636, 'epoch': 0.18}


 18%|█▊        | 2245/12500 [3:55:53<21:19:34,  7.49s/it]

{'loss': 0.5942, 'grad_norm': 0.24334770441055298, 'learning_rate': 0.00016414565826330534, 'epoch': 0.18}


 18%|█▊        | 2246/12500 [3:55:59<19:58:58,  7.02s/it]

{'loss': 0.6824, 'grad_norm': 0.25942909717559814, 'learning_rate': 0.0001641296518607443, 'epoch': 0.18}


 18%|█▊        | 2247/12500 [3:56:05<19:40:49,  6.91s/it]

{'loss': 0.8259, 'grad_norm': 0.25354379415512085, 'learning_rate': 0.00016411364545818329, 'epoch': 0.18}


 18%|█▊        | 2248/12500 [3:56:13<20:27:24,  7.18s/it]

{'loss': 1.1675, 'grad_norm': 0.25457507371902466, 'learning_rate': 0.00016409763905562226, 'epoch': 0.18}


 18%|█▊        | 2249/12500 [3:56:19<19:20:30,  6.79s/it]

{'loss': 0.6671, 'grad_norm': 0.27254024147987366, 'learning_rate': 0.00016408163265306124, 'epoch': 0.18}


 18%|█▊        | 2250/12500 [3:56:28<21:00:04,  7.38s/it]

{'loss': 0.9048, 'grad_norm': 0.23065748810768127, 'learning_rate': 0.0001640656262505002, 'epoch': 0.18}


 18%|█▊        | 2251/12500 [3:56:37<22:48:49,  8.01s/it]

{'loss': 0.6348, 'grad_norm': 0.20933154225349426, 'learning_rate': 0.00016404961984793919, 'epoch': 0.18}


 18%|█▊        | 2252/12500 [3:56:41<19:22:18,  6.81s/it]

{'loss': 0.6131, 'grad_norm': 0.2979837656021118, 'learning_rate': 0.00016403361344537816, 'epoch': 0.18}


 18%|█▊        | 2253/12500 [3:56:48<19:22:58,  6.81s/it]

{'loss': 0.9011, 'grad_norm': 0.23764163255691528, 'learning_rate': 0.00016401760704281713, 'epoch': 0.18}


 18%|█▊        | 2254/12500 [3:56:57<21:25:59,  7.53s/it]

{'loss': 0.9868, 'grad_norm': 0.20571500062942505, 'learning_rate': 0.0001640016006402561, 'epoch': 0.18}


 18%|█▊        | 2255/12500 [3:57:03<19:54:31,  7.00s/it]

{'loss': 0.702, 'grad_norm': 0.27396419644355774, 'learning_rate': 0.00016398559423769508, 'epoch': 0.18}


 18%|█▊        | 2256/12500 [3:57:09<18:51:23,  6.63s/it]

{'loss': 0.8977, 'grad_norm': 0.29252341389656067, 'learning_rate': 0.00016396958783513406, 'epoch': 0.18}


 18%|█▊        | 2257/12500 [3:57:14<17:20:07,  6.09s/it]

{'loss': 0.7145, 'grad_norm': 0.3175227642059326, 'learning_rate': 0.00016395358143257303, 'epoch': 0.18}


 18%|█▊        | 2258/12500 [3:57:19<16:33:31,  5.82s/it]

{'loss': 0.6388, 'grad_norm': 0.2824476659297943, 'learning_rate': 0.000163937575030012, 'epoch': 0.18}


 18%|█▊        | 2259/12500 [3:57:28<19:19:40,  6.79s/it]

{'loss': 0.7726, 'grad_norm': 0.2301059365272522, 'learning_rate': 0.00016392156862745098, 'epoch': 0.18}


 18%|█▊        | 2260/12500 [3:57:34<18:55:37,  6.65s/it]

{'loss': 0.5615, 'grad_norm': 0.23098570108413696, 'learning_rate': 0.00016390556222488999, 'epoch': 0.18}


 18%|█▊        | 2261/12500 [3:57:41<19:08:15,  6.73s/it]

{'loss': 0.9662, 'grad_norm': 0.26036587357521057, 'learning_rate': 0.00016388955582232893, 'epoch': 0.18}


 18%|█▊        | 2262/12500 [3:57:47<18:05:55,  6.36s/it]

{'loss': 0.8368, 'grad_norm': 0.3100973069667816, 'learning_rate': 0.0001638735494197679, 'epoch': 0.18}


 18%|█▊        | 2263/12500 [3:57:52<16:49:30,  5.92s/it]

{'loss': 0.8176, 'grad_norm': 0.2891693115234375, 'learning_rate': 0.00016385754301720688, 'epoch': 0.18}


 18%|█▊        | 2264/12500 [3:57:56<15:51:13,  5.58s/it]

{'loss': 0.7582, 'grad_norm': 0.29245197772979736, 'learning_rate': 0.00016384153661464589, 'epoch': 0.18}


 18%|█▊        | 2265/12500 [3:58:03<17:07:37,  6.02s/it]

{'loss': 0.7713, 'grad_norm': 0.23654785752296448, 'learning_rate': 0.00016382553021208483, 'epoch': 0.18}


 18%|█▊        | 2266/12500 [3:58:11<18:09:43,  6.39s/it]

{'loss': 0.8473, 'grad_norm': 0.25636500120162964, 'learning_rate': 0.0001638095238095238, 'epoch': 0.18}


 18%|█▊        | 2267/12500 [3:58:17<17:43:11,  6.23s/it]

{'loss': 0.6381, 'grad_norm': 0.21619217097759247, 'learning_rate': 0.00016379351740696278, 'epoch': 0.18}


 18%|█▊        | 2268/12500 [3:58:24<18:32:26,  6.52s/it]

{'loss': 0.686, 'grad_norm': 0.29256996512413025, 'learning_rate': 0.00016377751100440178, 'epoch': 0.18}


 18%|█▊        | 2269/12500 [3:58:30<18:45:43,  6.60s/it]

{'loss': 0.5746, 'grad_norm': 0.24535314738750458, 'learning_rate': 0.00016376150460184073, 'epoch': 0.18}


 18%|█▊        | 2270/12500 [3:58:36<17:49:31,  6.27s/it]

{'loss': 0.5951, 'grad_norm': 0.26928988099098206, 'learning_rate': 0.0001637454981992797, 'epoch': 0.18}


 18%|█▊        | 2271/12500 [3:58:44<18:52:27,  6.64s/it]

{'loss': 0.852, 'grad_norm': 0.24600939452648163, 'learning_rate': 0.0001637294917967187, 'epoch': 0.18}


 18%|█▊        | 2272/12500 [3:58:48<16:41:15,  5.87s/it]

{'loss': 0.5423, 'grad_norm': 0.25508350133895874, 'learning_rate': 0.00016371348539415768, 'epoch': 0.18}


 18%|█▊        | 2273/12500 [3:58:54<16:57:07,  5.97s/it]

{'loss': 1.033, 'grad_norm': 0.24243159592151642, 'learning_rate': 0.00016369747899159663, 'epoch': 0.18}


 18%|█▊        | 2274/12500 [3:59:03<19:24:59,  6.84s/it]

{'loss': 1.0771, 'grad_norm': 0.24261824786663055, 'learning_rate': 0.0001636814725890356, 'epoch': 0.18}


 18%|█▊        | 2275/12500 [3:59:09<18:47:10,  6.61s/it]

{'loss': 0.5786, 'grad_norm': 0.27008938789367676, 'learning_rate': 0.0001636654661864746, 'epoch': 0.18}


 18%|█▊        | 2276/12500 [3:59:15<18:19:28,  6.45s/it]

{'loss': 0.84, 'grad_norm': 0.29544445872306824, 'learning_rate': 0.00016364945978391358, 'epoch': 0.18}


 18%|█▊        | 2277/12500 [3:59:20<17:38:28,  6.21s/it]

{'loss': 0.8143, 'grad_norm': 0.2686730921268463, 'learning_rate': 0.00016363345338135253, 'epoch': 0.18}


 18%|█▊        | 2278/12500 [3:59:28<18:56:32,  6.67s/it]

{'loss': 0.6035, 'grad_norm': 0.2649393081665039, 'learning_rate': 0.00016361744697879153, 'epoch': 0.18}


 18%|█▊        | 2279/12500 [3:59:38<21:40:19,  7.63s/it]

{'loss': 0.7617, 'grad_norm': 0.2247006595134735, 'learning_rate': 0.0001636014405762305, 'epoch': 0.18}


 18%|█▊        | 2280/12500 [3:59:43<19:22:02,  6.82s/it]

{'loss': 0.7315, 'grad_norm': 0.3070695698261261, 'learning_rate': 0.00016358543417366948, 'epoch': 0.18}


 18%|█▊        | 2281/12500 [3:59:49<18:27:58,  6.51s/it]

{'loss': 0.7187, 'grad_norm': 0.25242340564727783, 'learning_rate': 0.00016356942777110843, 'epoch': 0.18}


 18%|█▊        | 2282/12500 [3:59:55<17:56:27,  6.32s/it]

{'loss': 0.8885, 'grad_norm': 0.2569901645183563, 'learning_rate': 0.00016355342136854743, 'epoch': 0.18}


 18%|█▊        | 2283/12500 [3:59:58<15:27:50,  5.45s/it]

{'loss': 1.1065, 'grad_norm': 0.3952798545360565, 'learning_rate': 0.0001635374149659864, 'epoch': 0.18}


 18%|█▊        | 2284/12500 [4:00:05<16:22:27,  5.77s/it]

{'loss': 0.6439, 'grad_norm': 0.24530455470085144, 'learning_rate': 0.00016352140856342538, 'epoch': 0.18}


 18%|█▊        | 2285/12500 [4:00:10<15:38:06,  5.51s/it]

{'loss': 0.5215, 'grad_norm': 0.2645832896232605, 'learning_rate': 0.00016350540216086436, 'epoch': 0.18}


 18%|█▊        | 2286/12500 [4:00:16<16:30:28,  5.82s/it]

{'loss': 0.5502, 'grad_norm': 0.2102525681257248, 'learning_rate': 0.00016348939575830333, 'epoch': 0.18}


 18%|█▊        | 2287/12500 [4:00:24<18:16:18,  6.44s/it]

{'loss': 0.5095, 'grad_norm': 0.276057630777359, 'learning_rate': 0.0001634733893557423, 'epoch': 0.18}


 18%|█▊        | 2288/12500 [4:00:32<19:24:24,  6.84s/it]

{'loss': 0.6943, 'grad_norm': 0.24094170331954956, 'learning_rate': 0.00016345738295318128, 'epoch': 0.18}


 18%|█▊        | 2289/12500 [4:00:37<18:07:47,  6.39s/it]

{'loss': 0.7431, 'grad_norm': 0.28530213236808777, 'learning_rate': 0.00016344137655062026, 'epoch': 0.18}


 18%|█▊        | 2290/12500 [4:00:42<16:30:52,  5.82s/it]

{'loss': 0.696, 'grad_norm': 0.3030546307563782, 'learning_rate': 0.00016342537014805923, 'epoch': 0.18}


 18%|█▊        | 2291/12500 [4:00:48<16:44:58,  5.91s/it]

{'loss': 0.9535, 'grad_norm': 0.2537899613380432, 'learning_rate': 0.0001634093637454982, 'epoch': 0.18}


 18%|█▊        | 2292/12500 [4:00:57<19:48:32,  6.99s/it]

{'loss': 0.8795, 'grad_norm': 0.24197261035442352, 'learning_rate': 0.00016339335734293718, 'epoch': 0.18}


 18%|█▊        | 2293/12500 [4:01:02<17:38:10,  6.22s/it]

{'loss': 0.6738, 'grad_norm': 0.2723613679409027, 'learning_rate': 0.00016337735094037616, 'epoch': 0.18}


 18%|█▊        | 2294/12500 [4:01:11<20:26:21,  7.21s/it]

{'loss': 0.6613, 'grad_norm': 0.24203592538833618, 'learning_rate': 0.00016336134453781513, 'epoch': 0.18}


 18%|█▊        | 2295/12500 [4:01:17<19:43:25,  6.96s/it]

{'loss': 0.6149, 'grad_norm': 0.3081206977367401, 'learning_rate': 0.0001633453381352541, 'epoch': 0.18}


 18%|█▊        | 2296/12500 [4:01:24<19:42:04,  6.95s/it]

{'loss': 0.8048, 'grad_norm': 0.2414165735244751, 'learning_rate': 0.00016332933173269308, 'epoch': 0.18}


 18%|█▊        | 2297/12500 [4:01:30<18:15:53,  6.44s/it]

{'loss': 0.8521, 'grad_norm': 0.3277868926525116, 'learning_rate': 0.00016331332533013206, 'epoch': 0.18}


 18%|█▊        | 2298/12500 [4:01:34<16:30:44,  5.83s/it]

{'loss': 0.4424, 'grad_norm': 0.24921253323554993, 'learning_rate': 0.00016329731892757103, 'epoch': 0.18}


 18%|█▊        | 2299/12500 [4:01:41<17:37:37,  6.22s/it]

{'loss': 0.8861, 'grad_norm': 0.21724775433540344, 'learning_rate': 0.00016328131252501003, 'epoch': 0.18}


 18%|█▊        | 2300/12500 [4:01:46<16:36:56,  5.86s/it]

{'loss': 0.7397, 'grad_norm': 0.27867600321769714, 'learning_rate': 0.00016326530612244898, 'epoch': 0.18}


 18%|█▊        | 2301/12500 [4:01:52<16:46:07,  5.92s/it]

{'loss': 0.5745, 'grad_norm': 0.3034387230873108, 'learning_rate': 0.00016324929971988796, 'epoch': 0.18}


 18%|█▊        | 2302/12500 [4:01:57<15:24:20,  5.44s/it]

{'loss': 0.8268, 'grad_norm': 0.344726026058197, 'learning_rate': 0.00016323329331732693, 'epoch': 0.18}


 18%|█▊        | 2303/12500 [4:02:02<15:04:58,  5.32s/it]

{'loss': 0.7253, 'grad_norm': 0.26215559244155884, 'learning_rate': 0.00016321728691476593, 'epoch': 0.18}


 18%|█▊        | 2304/12500 [4:02:08<15:44:04,  5.56s/it]

{'loss': 0.5806, 'grad_norm': 0.2594219744205475, 'learning_rate': 0.00016320128051220488, 'epoch': 0.18}


 18%|█▊        | 2305/12500 [4:02:14<16:08:17,  5.70s/it]

{'loss': 0.6803, 'grad_norm': 0.254769504070282, 'learning_rate': 0.00016318527410964385, 'epoch': 0.18}


 18%|█▊        | 2306/12500 [4:02:19<16:06:26,  5.69s/it]

{'loss': 0.7638, 'grad_norm': 0.29453492164611816, 'learning_rate': 0.00016316926770708286, 'epoch': 0.18}


 18%|█▊        | 2307/12500 [4:02:25<15:50:14,  5.59s/it]

{'loss': 0.6118, 'grad_norm': 0.2634073495864868, 'learning_rate': 0.00016315326130452183, 'epoch': 0.18}


 18%|█▊        | 2308/12500 [4:02:34<18:59:07,  6.71s/it]

{'loss': 0.9044, 'grad_norm': 0.2517147958278656, 'learning_rate': 0.00016313725490196078, 'epoch': 0.18}


 18%|█▊        | 2309/12500 [4:02:40<18:25:50,  6.51s/it]

{'loss': 0.621, 'grad_norm': 0.2936731278896332, 'learning_rate': 0.00016312124849939975, 'epoch': 0.18}


 18%|█▊        | 2310/12500 [4:02:47<18:21:33,  6.49s/it]

{'loss': 0.6729, 'grad_norm': 0.22190089523792267, 'learning_rate': 0.00016310524209683876, 'epoch': 0.18}


 18%|█▊        | 2311/12500 [4:02:53<18:23:46,  6.50s/it]

{'loss': 0.5325, 'grad_norm': 0.2545006275177002, 'learning_rate': 0.00016308923569427773, 'epoch': 0.18}


 18%|█▊        | 2312/12500 [4:03:01<19:16:49,  6.81s/it]

{'loss': 0.6944, 'grad_norm': 0.2438087910413742, 'learning_rate': 0.00016307322929171668, 'epoch': 0.18}


 19%|█▊        | 2313/12500 [4:03:05<17:18:05,  6.11s/it]

{'loss': 0.7644, 'grad_norm': 0.3412717580795288, 'learning_rate': 0.00016305722288915568, 'epoch': 0.19}


 19%|█▊        | 2314/12500 [4:03:12<17:48:06,  6.29s/it]

{'loss': 0.8505, 'grad_norm': 0.21307235956192017, 'learning_rate': 0.00016304121648659466, 'epoch': 0.19}


 19%|█▊        | 2315/12500 [4:03:20<19:10:12,  6.78s/it]

{'loss': 0.7674, 'grad_norm': 0.2648586630821228, 'learning_rate': 0.00016302521008403363, 'epoch': 0.19}


 19%|█▊        | 2316/12500 [4:03:25<17:49:39,  6.30s/it]

{'loss': 0.9321, 'grad_norm': 0.3193584084510803, 'learning_rate': 0.00016300920368147258, 'epoch': 0.19}


 19%|█▊        | 2317/12500 [4:03:29<15:40:54,  5.54s/it]

{'loss': 0.8766, 'grad_norm': 0.3970419764518738, 'learning_rate': 0.00016299319727891158, 'epoch': 0.19}


 19%|█▊        | 2318/12500 [4:03:34<15:16:30,  5.40s/it]

{'loss': 0.6894, 'grad_norm': 0.34358999133110046, 'learning_rate': 0.00016297719087635055, 'epoch': 0.19}


 19%|█▊        | 2319/12500 [4:03:42<17:20:45,  6.13s/it]

{'loss': 0.7702, 'grad_norm': 0.27289485931396484, 'learning_rate': 0.00016296118447378953, 'epoch': 0.19}


 19%|█▊        | 2320/12500 [4:03:47<16:39:43,  5.89s/it]

{'loss': 0.643, 'grad_norm': 0.28824982047080994, 'learning_rate': 0.00016294517807122848, 'epoch': 0.19}


 19%|█▊        | 2321/12500 [4:03:53<16:33:00,  5.85s/it]

{'loss': 0.7285, 'grad_norm': 0.27535146474838257, 'learning_rate': 0.00016292917166866748, 'epoch': 0.19}


 19%|█▊        | 2322/12500 [4:03:58<15:40:39,  5.55s/it]

{'loss': 0.9916, 'grad_norm': 0.31744247674942017, 'learning_rate': 0.00016291316526610645, 'epoch': 0.19}


 19%|█▊        | 2323/12500 [4:04:02<14:29:03,  5.12s/it]

{'loss': 0.8741, 'grad_norm': 0.35370689630508423, 'learning_rate': 0.00016289715886354543, 'epoch': 0.19}


 19%|█▊        | 2324/12500 [4:04:08<15:42:32,  5.56s/it]

{'loss': 0.7233, 'grad_norm': 0.24378448724746704, 'learning_rate': 0.0001628811524609844, 'epoch': 0.19}


 19%|█▊        | 2325/12500 [4:04:13<14:34:52,  5.16s/it]

{'loss': 0.6366, 'grad_norm': 0.2890445590019226, 'learning_rate': 0.00016286514605842338, 'epoch': 0.19}


 19%|█▊        | 2326/12500 [4:04:20<16:36:31,  5.88s/it]

{'loss': 0.5338, 'grad_norm': 0.3048085570335388, 'learning_rate': 0.00016284913965586235, 'epoch': 0.19}


 19%|█▊        | 2327/12500 [4:04:27<17:53:23,  6.33s/it]

{'loss': 0.7111, 'grad_norm': 0.23218050599098206, 'learning_rate': 0.00016283313325330133, 'epoch': 0.19}


 19%|█▊        | 2328/12500 [4:04:36<20:05:51,  7.11s/it]

{'loss': 0.4799, 'grad_norm': 0.21007518470287323, 'learning_rate': 0.0001628171268507403, 'epoch': 0.19}


 19%|█▊        | 2329/12500 [4:04:42<19:02:00,  6.74s/it]

{'loss': 0.6494, 'grad_norm': 0.23716065287590027, 'learning_rate': 0.00016280112044817928, 'epoch': 0.19}


 19%|█▊        | 2330/12500 [4:04:47<16:56:23,  6.00s/it]

{'loss': 0.7741, 'grad_norm': 0.2823396921157837, 'learning_rate': 0.00016278511404561825, 'epoch': 0.19}


 19%|█▊        | 2331/12500 [4:04:52<16:37:46,  5.89s/it]

{'loss': 0.6754, 'grad_norm': 0.2873002886772156, 'learning_rate': 0.00016276910764305723, 'epoch': 0.19}


 19%|█▊        | 2332/12500 [4:04:57<15:34:10,  5.51s/it]

{'loss': 0.3891, 'grad_norm': 0.2234315276145935, 'learning_rate': 0.0001627531012404962, 'epoch': 0.19}


 19%|█▊        | 2333/12500 [4:05:04<17:02:01,  6.03s/it]

{'loss': 0.8442, 'grad_norm': 0.27518099546432495, 'learning_rate': 0.00016273709483793518, 'epoch': 0.19}


 19%|█▊        | 2334/12500 [4:05:12<18:15:24,  6.47s/it]

{'loss': 0.9213, 'grad_norm': 0.23773017525672913, 'learning_rate': 0.00016272108843537415, 'epoch': 0.19}


 19%|█▊        | 2335/12500 [4:05:16<16:45:18,  5.93s/it]

{'loss': 0.8042, 'grad_norm': 0.2718786895275116, 'learning_rate': 0.00016270508203281313, 'epoch': 0.19}


 19%|█▊        | 2336/12500 [4:05:22<16:40:34,  5.91s/it]

{'loss': 0.5558, 'grad_norm': 0.22336986660957336, 'learning_rate': 0.0001626890756302521, 'epoch': 0.19}


 19%|█▊        | 2337/12500 [4:05:26<15:12:41,  5.39s/it]

{'loss': 0.7132, 'grad_norm': 0.31245431303977966, 'learning_rate': 0.00016267306922769108, 'epoch': 0.19}


 19%|█▊        | 2338/12500 [4:05:33<16:18:18,  5.78s/it]

{'loss': 0.5901, 'grad_norm': 0.22434525191783905, 'learning_rate': 0.00016265706282513008, 'epoch': 0.19}


 19%|█▊        | 2339/12500 [4:05:39<16:34:34,  5.87s/it]

{'loss': 0.5035, 'grad_norm': 0.25748804211616516, 'learning_rate': 0.00016264105642256903, 'epoch': 0.19}


 19%|█▊        | 2340/12500 [4:05:43<14:54:17,  5.28s/it]

{'loss': 1.023, 'grad_norm': 0.38762134313583374, 'learning_rate': 0.000162625050020008, 'epoch': 0.19}


 19%|█▊        | 2341/12500 [4:05:47<14:11:40,  5.03s/it]

{'loss': 0.6432, 'grad_norm': 0.26177430152893066, 'learning_rate': 0.00016260904361744698, 'epoch': 0.19}


 19%|█▊        | 2342/12500 [4:05:54<15:10:14,  5.38s/it]

{'loss': 0.5335, 'grad_norm': 0.27994510531425476, 'learning_rate': 0.00016259303721488598, 'epoch': 0.19}


 19%|█▊        | 2343/12500 [4:06:01<16:38:33,  5.90s/it]

{'loss': 0.9886, 'grad_norm': 0.24134370684623718, 'learning_rate': 0.00016257703081232493, 'epoch': 0.19}


 19%|█▉        | 2344/12500 [4:06:04<14:47:12,  5.24s/it]

{'loss': 0.6297, 'grad_norm': 0.30698081851005554, 'learning_rate': 0.0001625610244097639, 'epoch': 0.19}


 19%|█▉        | 2345/12500 [4:06:10<15:01:14,  5.32s/it]

{'loss': 0.5918, 'grad_norm': 0.2800133526325226, 'learning_rate': 0.0001625450180072029, 'epoch': 0.19}


 19%|█▉        | 2346/12500 [4:06:18<17:03:39,  6.05s/it]

{'loss': 0.7082, 'grad_norm': 0.21912197768688202, 'learning_rate': 0.00016252901160464188, 'epoch': 0.19}


 19%|█▉        | 2347/12500 [4:06:26<18:52:34,  6.69s/it]

{'loss': 0.5932, 'grad_norm': 0.2501370310783386, 'learning_rate': 0.00016251300520208083, 'epoch': 0.19}


 19%|█▉        | 2348/12500 [4:06:31<17:59:38,  6.38s/it]

{'loss': 0.8125, 'grad_norm': 0.26977041363716125, 'learning_rate': 0.0001624969987995198, 'epoch': 0.19}


 19%|█▉        | 2349/12500 [4:06:37<16:57:33,  6.01s/it]

{'loss': 0.7247, 'grad_norm': 0.2774147093296051, 'learning_rate': 0.0001624809923969588, 'epoch': 0.19}


 19%|█▉        | 2350/12500 [4:06:43<17:36:44,  6.25s/it]

{'loss': 0.4434, 'grad_norm': 0.20216231048107147, 'learning_rate': 0.00016246498599439778, 'epoch': 0.19}


 19%|█▉        | 2351/12500 [4:06:51<18:35:46,  6.60s/it]

{'loss': 0.653, 'grad_norm': 0.22366461157798767, 'learning_rate': 0.00016244897959183672, 'epoch': 0.19}


 19%|█▉        | 2352/12500 [4:07:00<20:52:45,  7.41s/it]

{'loss': 0.522, 'grad_norm': 0.197001114487648, 'learning_rate': 0.00016243297318927573, 'epoch': 0.19}


 19%|█▉        | 2353/12500 [4:07:05<18:40:36,  6.63s/it]

{'loss': 0.7332, 'grad_norm': 0.28803345561027527, 'learning_rate': 0.0001624169667867147, 'epoch': 0.19}


 19%|█▉        | 2354/12500 [4:07:14<20:44:06,  7.36s/it]

{'loss': 0.9339, 'grad_norm': 0.23548871278762817, 'learning_rate': 0.00016240096038415368, 'epoch': 0.19}


 19%|█▉        | 2355/12500 [4:07:20<19:15:25,  6.83s/it]

{'loss': 0.6874, 'grad_norm': 0.26662832498550415, 'learning_rate': 0.00016238495398159262, 'epoch': 0.19}


 19%|█▉        | 2356/12500 [4:07:27<19:56:02,  7.07s/it]

{'loss': 1.2285, 'grad_norm': 0.4560202360153198, 'learning_rate': 0.00016236894757903163, 'epoch': 0.19}


 19%|█▉        | 2357/12500 [4:07:36<21:06:42,  7.49s/it]

{'loss': 1.0544, 'grad_norm': 0.2564307451248169, 'learning_rate': 0.0001623529411764706, 'epoch': 0.19}


 19%|█▉        | 2358/12500 [4:07:44<21:36:14,  7.67s/it]

{'loss': 0.9996, 'grad_norm': 0.18997129797935486, 'learning_rate': 0.00016233693477390958, 'epoch': 0.19}


 19%|█▉        | 2359/12500 [4:07:50<20:16:56,  7.20s/it]

{'loss': 0.7139, 'grad_norm': 0.26974552869796753, 'learning_rate': 0.00016232092837134855, 'epoch': 0.19}


 19%|█▉        | 2360/12500 [4:07:56<18:59:06,  6.74s/it]

{'loss': 0.5723, 'grad_norm': 0.2558636963367462, 'learning_rate': 0.00016230492196878753, 'epoch': 0.19}


 19%|█▉        | 2361/12500 [4:08:00<17:00:32,  6.04s/it]

{'loss': 0.571, 'grad_norm': 0.39093858003616333, 'learning_rate': 0.0001622889155662265, 'epoch': 0.19}


 19%|█▉        | 2362/12500 [4:08:07<17:59:43,  6.39s/it]

{'loss': 0.7228, 'grad_norm': 0.22683465480804443, 'learning_rate': 0.00016227290916366548, 'epoch': 0.19}


 19%|█▉        | 2363/12500 [4:08:13<17:35:06,  6.25s/it]

{'loss': 0.7676, 'grad_norm': 0.2474478930234909, 'learning_rate': 0.00016225690276110445, 'epoch': 0.19}


 19%|█▉        | 2364/12500 [4:08:21<18:43:05,  6.65s/it]

{'loss': 0.9339, 'grad_norm': 0.256531685590744, 'learning_rate': 0.00016224089635854343, 'epoch': 0.19}


 19%|█▉        | 2365/12500 [4:08:28<19:02:04,  6.76s/it]

{'loss': 0.4551, 'grad_norm': 0.24363771080970764, 'learning_rate': 0.0001622248899559824, 'epoch': 0.19}


 19%|█▉        | 2366/12500 [4:08:37<21:07:52,  7.51s/it]

{'loss': 0.8112, 'grad_norm': 0.20498858392238617, 'learning_rate': 0.00016220888355342137, 'epoch': 0.19}


 19%|█▉        | 2367/12500 [4:08:45<21:55:16,  7.79s/it]

{'loss': 0.5815, 'grad_norm': 0.21944575011730194, 'learning_rate': 0.00016219287715086035, 'epoch': 0.19}


 19%|█▉        | 2368/12500 [4:08:51<20:17:18,  7.21s/it]

{'loss': 0.9783, 'grad_norm': 0.26713892817497253, 'learning_rate': 0.00016217687074829932, 'epoch': 0.19}


 19%|█▉        | 2369/12500 [4:08:57<18:46:16,  6.67s/it]

{'loss': 0.6386, 'grad_norm': 0.3434121012687683, 'learning_rate': 0.0001621608643457383, 'epoch': 0.19}


 19%|█▉        | 2370/12500 [4:09:04<19:42:39,  7.00s/it]

{'loss': 0.9552, 'grad_norm': 0.2613019049167633, 'learning_rate': 0.00016214485794317727, 'epoch': 0.19}


 19%|█▉        | 2371/12500 [4:09:10<18:21:46,  6.53s/it]

{'loss': 0.632, 'grad_norm': 0.27666935324668884, 'learning_rate': 0.00016212885154061625, 'epoch': 0.19}


 19%|█▉        | 2372/12500 [4:09:21<22:05:55,  7.86s/it]

{'loss': 0.4678, 'grad_norm': 0.2048749178647995, 'learning_rate': 0.00016211284513805522, 'epoch': 0.19}


 19%|█▉        | 2373/12500 [4:09:27<20:31:55,  7.30s/it]

{'loss': 0.5732, 'grad_norm': 0.3041257858276367, 'learning_rate': 0.00016209683873549423, 'epoch': 0.19}


 19%|█▉        | 2374/12500 [4:09:32<18:57:52,  6.74s/it]

{'loss': 0.7689, 'grad_norm': 0.27771684527397156, 'learning_rate': 0.00016208083233293317, 'epoch': 0.19}


 19%|█▉        | 2375/12500 [4:09:40<19:57:02,  7.09s/it]

{'loss': 0.4055, 'grad_norm': 0.19641320407390594, 'learning_rate': 0.00016206482593037215, 'epoch': 0.19}


 19%|█▉        | 2376/12500 [4:09:50<22:18:22,  7.93s/it]

{'loss': 0.7941, 'grad_norm': 0.23855622112751007, 'learning_rate': 0.00016204881952781112, 'epoch': 0.19}


 19%|█▉        | 2377/12500 [4:09:56<20:49:17,  7.40s/it]

{'loss': 0.7136, 'grad_norm': 0.28916001319885254, 'learning_rate': 0.00016203281312525013, 'epoch': 0.19}


 19%|█▉        | 2378/12500 [4:10:01<18:47:11,  6.68s/it]

{'loss': 0.6754, 'grad_norm': 0.25615254044532776, 'learning_rate': 0.00016201680672268907, 'epoch': 0.19}


 19%|█▉        | 2379/12500 [4:10:05<16:12:04,  5.76s/it]

{'loss': 0.6441, 'grad_norm': 0.31212353706359863, 'learning_rate': 0.00016200080032012805, 'epoch': 0.19}


 19%|█▉        | 2380/12500 [4:10:11<16:21:46,  5.82s/it]

{'loss': 0.7775, 'grad_norm': 0.28046715259552, 'learning_rate': 0.00016198479391756702, 'epoch': 0.19}


 19%|█▉        | 2381/12500 [4:10:19<18:08:37,  6.45s/it]

{'loss': 1.1014, 'grad_norm': 0.2303929328918457, 'learning_rate': 0.00016196878751500602, 'epoch': 0.19}


 19%|█▉        | 2382/12500 [4:10:24<17:14:51,  6.14s/it]

{'loss': 0.8616, 'grad_norm': 0.3160652816295624, 'learning_rate': 0.00016195278111244497, 'epoch': 0.19}


 19%|█▉        | 2383/12500 [4:10:30<16:47:13,  5.97s/it]

{'loss': 0.5946, 'grad_norm': 0.25614529848098755, 'learning_rate': 0.00016193677470988395, 'epoch': 0.19}


 19%|█▉        | 2384/12500 [4:10:36<17:23:15,  6.19s/it]

{'loss': 0.7816, 'grad_norm': 0.344181090593338, 'learning_rate': 0.00016192076830732295, 'epoch': 0.19}


 19%|█▉        | 2385/12500 [4:10:42<16:46:53,  5.97s/it]

{'loss': 0.8303, 'grad_norm': 0.3950754404067993, 'learning_rate': 0.00016190476190476192, 'epoch': 0.19}


 19%|█▉        | 2386/12500 [4:10:46<15:18:23,  5.45s/it]

{'loss': 0.7074, 'grad_norm': 0.31200364232063293, 'learning_rate': 0.00016188875550220087, 'epoch': 0.19}


 19%|█▉        | 2387/12500 [4:10:54<17:33:15,  6.25s/it]

{'loss': 0.7004, 'grad_norm': 0.2302754819393158, 'learning_rate': 0.00016187274909963985, 'epoch': 0.19}


 19%|█▉        | 2388/12500 [4:11:03<19:37:24,  6.99s/it]

{'loss': 0.8207, 'grad_norm': 0.2401481717824936, 'learning_rate': 0.00016185674269707885, 'epoch': 0.19}


 19%|█▉        | 2389/12500 [4:11:11<20:38:42,  7.35s/it]

{'loss': 0.6453, 'grad_norm': 0.21572086215019226, 'learning_rate': 0.00016184073629451782, 'epoch': 0.19}


 19%|█▉        | 2390/12500 [4:11:17<19:33:38,  6.97s/it]

{'loss': 0.8378, 'grad_norm': 0.25675487518310547, 'learning_rate': 0.00016182472989195677, 'epoch': 0.19}


 19%|█▉        | 2391/12500 [4:11:24<19:36:27,  6.98s/it]

{'loss': 0.7606, 'grad_norm': 0.21865296363830566, 'learning_rate': 0.00016180872348939577, 'epoch': 0.19}


 19%|█▉        | 2392/12500 [4:11:28<17:04:01,  6.08s/it]

{'loss': 0.8029, 'grad_norm': 0.35001301765441895, 'learning_rate': 0.00016179271708683475, 'epoch': 0.19}


 19%|█▉        | 2393/12500 [4:11:34<17:01:12,  6.06s/it]

{'loss': 0.9247, 'grad_norm': 0.2600669860839844, 'learning_rate': 0.00016177671068427372, 'epoch': 0.19}


 19%|█▉        | 2394/12500 [4:11:42<18:45:17,  6.68s/it]

{'loss': 0.5875, 'grad_norm': 0.20948459208011627, 'learning_rate': 0.00016176070428171267, 'epoch': 0.19}


 19%|█▉        | 2395/12500 [4:11:48<17:36:48,  6.27s/it]

{'loss': 0.6657, 'grad_norm': 0.2693474292755127, 'learning_rate': 0.00016174469787915167, 'epoch': 0.19}


 19%|█▉        | 2396/12500 [4:11:51<15:29:21,  5.52s/it]

{'loss': 1.145, 'grad_norm': 0.3698190152645111, 'learning_rate': 0.00016172869147659065, 'epoch': 0.19}


 19%|█▉        | 2397/12500 [4:11:57<15:45:40,  5.62s/it]

{'loss': 0.7675, 'grad_norm': 0.3535010814666748, 'learning_rate': 0.00016171268507402962, 'epoch': 0.19}


 19%|█▉        | 2398/12500 [4:12:04<16:21:00,  5.83s/it]

{'loss': 0.4987, 'grad_norm': 0.2267201691865921, 'learning_rate': 0.0001616966786714686, 'epoch': 0.19}


 19%|█▉        | 2399/12500 [4:12:10<16:30:36,  5.88s/it]

{'loss': 0.5737, 'grad_norm': 0.2303161770105362, 'learning_rate': 0.00016168067226890757, 'epoch': 0.19}


 19%|█▉        | 2400/12500 [4:12:17<17:24:30,  6.20s/it]

{'loss': 0.7632, 'grad_norm': 0.31040799617767334, 'learning_rate': 0.00016166466586634655, 'epoch': 0.19}


 19%|█▉        | 2401/12500 [4:12:26<19:42:59,  7.03s/it]

{'loss': 0.7648, 'grad_norm': 0.27386441826820374, 'learning_rate': 0.00016164865946378552, 'epoch': 0.19}


 19%|█▉        | 2402/12500 [4:12:35<21:50:35,  7.79s/it]

{'loss': 0.8604, 'grad_norm': 0.21325363218784332, 'learning_rate': 0.0001616326530612245, 'epoch': 0.19}


 19%|█▉        | 2403/12500 [4:12:41<20:35:03,  7.34s/it]

{'loss': 0.9696, 'grad_norm': 0.28908661007881165, 'learning_rate': 0.00016161664665866347, 'epoch': 0.19}


 19%|█▉        | 2404/12500 [4:12:47<18:57:28,  6.76s/it]

{'loss': 0.5774, 'grad_norm': 0.33404430747032166, 'learning_rate': 0.00016160064025610245, 'epoch': 0.19}


 19%|█▉        | 2405/12500 [4:12:56<20:51:26,  7.44s/it]

{'loss': 0.6988, 'grad_norm': 0.20951850712299347, 'learning_rate': 0.00016158463385354142, 'epoch': 0.19}


 19%|█▉        | 2406/12500 [4:13:05<22:16:16,  7.94s/it]

{'loss': 0.4303, 'grad_norm': 0.17755532264709473, 'learning_rate': 0.0001615686274509804, 'epoch': 0.19}


 19%|█▉        | 2407/12500 [4:13:10<20:14:30,  7.22s/it]

{'loss': 0.6541, 'grad_norm': 0.3128754794597626, 'learning_rate': 0.00016155262104841937, 'epoch': 0.19}


 19%|█▉        | 2408/12500 [4:13:15<17:35:51,  6.28s/it]

{'loss': 0.7398, 'grad_norm': 0.3263954520225525, 'learning_rate': 0.00016153661464585835, 'epoch': 0.19}


 19%|█▉        | 2409/12500 [4:13:24<20:17:28,  7.24s/it]

{'loss': 0.8381, 'grad_norm': 0.21209867298603058, 'learning_rate': 0.00016152060824329732, 'epoch': 0.19}


 19%|█▉        | 2410/12500 [4:13:29<18:01:29,  6.43s/it]

{'loss': 0.6151, 'grad_norm': 0.29760119318962097, 'learning_rate': 0.0001615046018407363, 'epoch': 0.19}


 19%|█▉        | 2411/12500 [4:13:36<18:53:09,  6.74s/it]

{'loss': 0.5735, 'grad_norm': 0.2647428810596466, 'learning_rate': 0.00016148859543817527, 'epoch': 0.19}


 19%|█▉        | 2412/12500 [4:13:41<17:49:10,  6.36s/it]

{'loss': 0.8023, 'grad_norm': 0.2554045021533966, 'learning_rate': 0.00016147258903561427, 'epoch': 0.19}


 19%|█▉        | 2413/12500 [4:13:51<20:29:32,  7.31s/it]

{'loss': 1.0049, 'grad_norm': 0.238719180226326, 'learning_rate': 0.00016145658263305322, 'epoch': 0.19}


 19%|█▉        | 2414/12500 [4:13:55<17:33:08,  6.27s/it]

{'loss': 0.5373, 'grad_norm': 0.28324398398399353, 'learning_rate': 0.0001614405762304922, 'epoch': 0.19}


 19%|█▉        | 2415/12500 [4:14:04<20:17:39,  7.24s/it]

{'loss': 0.6779, 'grad_norm': 0.21642696857452393, 'learning_rate': 0.00016142456982793117, 'epoch': 0.19}


 19%|█▉        | 2416/12500 [4:14:11<20:06:35,  7.18s/it]

{'loss': 0.7941, 'grad_norm': 0.23294982314109802, 'learning_rate': 0.00016140856342537017, 'epoch': 0.19}


 19%|█▉        | 2417/12500 [4:14:17<18:46:58,  6.71s/it]

{'loss': 0.5323, 'grad_norm': 0.2493443638086319, 'learning_rate': 0.00016139255702280912, 'epoch': 0.19}


 19%|█▉        | 2418/12500 [4:14:22<17:20:09,  6.19s/it]

{'loss': 0.6533, 'grad_norm': 0.2773909568786621, 'learning_rate': 0.0001613765506202481, 'epoch': 0.19}


 19%|█▉        | 2419/12500 [4:14:28<16:54:36,  6.04s/it]

{'loss': 0.8249, 'grad_norm': 0.24572999775409698, 'learning_rate': 0.0001613605442176871, 'epoch': 0.19}


 19%|█▉        | 2420/12500 [4:14:31<15:02:17,  5.37s/it]

{'loss': 0.7569, 'grad_norm': 0.30217987298965454, 'learning_rate': 0.00016134453781512607, 'epoch': 0.19}


 19%|█▉        | 2421/12500 [4:14:36<14:16:55,  5.10s/it]

{'loss': 0.595, 'grad_norm': 0.28242531418800354, 'learning_rate': 0.00016132853141256502, 'epoch': 0.19}


 19%|█▉        | 2422/12500 [4:14:42<14:56:14,  5.34s/it]

{'loss': 0.6534, 'grad_norm': 0.26463890075683594, 'learning_rate': 0.000161312525010004, 'epoch': 0.19}


 19%|█▉        | 2423/12500 [4:14:50<17:30:27,  6.25s/it]

{'loss': 0.854, 'grad_norm': 0.23712633550167084, 'learning_rate': 0.000161296518607443, 'epoch': 0.19}


 19%|█▉        | 2424/12500 [4:14:55<16:03:41,  5.74s/it]

{'loss': 0.7603, 'grad_norm': 0.2576630413532257, 'learning_rate': 0.00016128051220488197, 'epoch': 0.19}


 19%|█▉        | 2425/12500 [4:14:58<14:13:07,  5.08s/it]

{'loss': 0.6139, 'grad_norm': 0.3068898022174835, 'learning_rate': 0.00016126450580232092, 'epoch': 0.19}


 19%|█▉        | 2426/12500 [4:15:07<16:51:30,  6.02s/it]

{'loss': 0.7877, 'grad_norm': 0.230734720826149, 'learning_rate': 0.00016124849939975992, 'epoch': 0.19}


 19%|█▉        | 2427/12500 [4:15:13<17:18:40,  6.19s/it]

{'loss': 0.764, 'grad_norm': 0.2982998788356781, 'learning_rate': 0.0001612324929971989, 'epoch': 0.19}


 19%|█▉        | 2428/12500 [4:15:21<18:53:34,  6.75s/it]

{'loss': 0.6776, 'grad_norm': 0.27621060609817505, 'learning_rate': 0.00016121648659463787, 'epoch': 0.19}


 19%|█▉        | 2429/12500 [4:15:27<18:02:58,  6.45s/it]

{'loss': 0.8243, 'grad_norm': 0.2506391406059265, 'learning_rate': 0.00016120048019207682, 'epoch': 0.19}


 19%|█▉        | 2430/12500 [4:15:33<17:44:22,  6.34s/it]

{'loss': 0.5461, 'grad_norm': 0.22366541624069214, 'learning_rate': 0.00016118447378951582, 'epoch': 0.19}


 19%|█▉        | 2431/12500 [4:15:39<17:48:52,  6.37s/it]

{'loss': 0.6497, 'grad_norm': 0.30951952934265137, 'learning_rate': 0.0001611684673869548, 'epoch': 0.19}


 19%|█▉        | 2432/12500 [4:15:50<21:03:19,  7.53s/it]

{'loss': 0.7375, 'grad_norm': 0.22289954125881195, 'learning_rate': 0.00016115246098439377, 'epoch': 0.19}


 19%|█▉        | 2433/12500 [4:15:53<17:30:11,  6.26s/it]

{'loss': 0.5172, 'grad_norm': 0.27407562732696533, 'learning_rate': 0.00016113645458183272, 'epoch': 0.19}


 19%|█▉        | 2434/12500 [4:15:57<15:46:20,  5.64s/it]

{'loss': 0.656, 'grad_norm': 0.29671770334243774, 'learning_rate': 0.00016112044817927172, 'epoch': 0.19}


 19%|█▉        | 2435/12500 [4:16:05<17:28:09,  6.25s/it]

{'loss': 0.92, 'grad_norm': 0.25831782817840576, 'learning_rate': 0.0001611044417767107, 'epoch': 0.19}


 19%|█▉        | 2436/12500 [4:16:11<17:34:06,  6.28s/it]

{'loss': 0.5314, 'grad_norm': 0.23250335454940796, 'learning_rate': 0.00016108843537414967, 'epoch': 0.19}


 19%|█▉        | 2437/12500 [4:16:18<17:55:48,  6.41s/it]

{'loss': 0.8933, 'grad_norm': 0.24106496572494507, 'learning_rate': 0.00016107242897158864, 'epoch': 0.19}


 20%|█▉        | 2438/12500 [4:16:25<18:07:43,  6.49s/it]

{'loss': 0.4694, 'grad_norm': 0.24092958867549896, 'learning_rate': 0.00016105642256902762, 'epoch': 0.2}


 20%|█▉        | 2439/12500 [4:16:31<18:03:45,  6.46s/it]

{'loss': 0.6174, 'grad_norm': 0.20992770791053772, 'learning_rate': 0.0001610404161664666, 'epoch': 0.2}


 20%|█▉        | 2440/12500 [4:16:37<17:20:00,  6.20s/it]

{'loss': 0.5929, 'grad_norm': 0.2555604875087738, 'learning_rate': 0.00016102440976390557, 'epoch': 0.2}


 20%|█▉        | 2441/12500 [4:16:41<16:02:29,  5.74s/it]

{'loss': 0.5766, 'grad_norm': 0.25629982352256775, 'learning_rate': 0.00016100840336134454, 'epoch': 0.2}


 20%|█▉        | 2442/12500 [4:16:48<16:57:51,  6.07s/it]

{'loss': 0.791, 'grad_norm': 0.2816510796546936, 'learning_rate': 0.00016099239695878352, 'epoch': 0.2}


 20%|█▉        | 2443/12500 [4:16:54<17:05:05,  6.12s/it]

{'loss': 0.837, 'grad_norm': 0.24681729078292847, 'learning_rate': 0.0001609763905562225, 'epoch': 0.2}


 20%|█▉        | 2444/12500 [4:17:02<18:10:03,  6.50s/it]

{'loss': 0.8713, 'grad_norm': 0.23567481338977814, 'learning_rate': 0.00016096038415366147, 'epoch': 0.2}


 20%|█▉        | 2445/12500 [4:17:08<17:48:57,  6.38s/it]

{'loss': 0.6972, 'grad_norm': 0.272793173789978, 'learning_rate': 0.00016094437775110044, 'epoch': 0.2}


 20%|█▉        | 2446/12500 [4:17:18<20:44:00,  7.42s/it]

{'loss': 0.8998, 'grad_norm': 0.22086451947689056, 'learning_rate': 0.00016092837134853942, 'epoch': 0.2}


 20%|█▉        | 2447/12500 [4:17:26<21:30:14,  7.70s/it]

{'loss': 0.5312, 'grad_norm': 0.2917788624763489, 'learning_rate': 0.0001609123649459784, 'epoch': 0.2}


 20%|█▉        | 2448/12500 [4:17:33<21:16:56,  7.62s/it]

{'loss': 0.6763, 'grad_norm': 0.23378172516822815, 'learning_rate': 0.00016089635854341737, 'epoch': 0.2}


 20%|█▉        | 2449/12500 [4:17:38<18:53:23,  6.77s/it]

{'loss': 0.6771, 'grad_norm': 0.2695486545562744, 'learning_rate': 0.00016088035214085634, 'epoch': 0.2}


 20%|█▉        | 2450/12500 [4:17:44<18:14:51,  6.54s/it]

{'loss': 0.5574, 'grad_norm': 0.259891539812088, 'learning_rate': 0.00016086434573829532, 'epoch': 0.2}


 20%|█▉        | 2451/12500 [4:17:51<18:35:24,  6.66s/it]

{'loss': 0.6283, 'grad_norm': 0.30465131998062134, 'learning_rate': 0.00016084833933573432, 'epoch': 0.2}


 20%|█▉        | 2452/12500 [4:17:57<17:56:40,  6.43s/it]

{'loss': 0.9252, 'grad_norm': 0.3117683231830597, 'learning_rate': 0.00016083233293317327, 'epoch': 0.2}


 20%|█▉        | 2453/12500 [4:18:05<19:20:47,  6.93s/it]

{'loss': 0.633, 'grad_norm': 0.25775718688964844, 'learning_rate': 0.00016081632653061224, 'epoch': 0.2}


 20%|█▉        | 2454/12500 [4:18:11<18:09:29,  6.51s/it]

{'loss': 0.761, 'grad_norm': 0.2526948153972626, 'learning_rate': 0.00016080032012805122, 'epoch': 0.2}


 20%|█▉        | 2455/12500 [4:18:15<16:09:34,  5.79s/it]

{'loss': 0.5688, 'grad_norm': 0.3204565644264221, 'learning_rate': 0.00016078431372549022, 'epoch': 0.2}


 20%|█▉        | 2456/12500 [4:18:24<18:54:12,  6.78s/it]

{'loss': 0.8696, 'grad_norm': 0.20056316256523132, 'learning_rate': 0.00016076830732292917, 'epoch': 0.2}


 20%|█▉        | 2457/12500 [4:18:28<17:05:10,  6.12s/it]

{'loss': 0.6269, 'grad_norm': 0.28718873858451843, 'learning_rate': 0.00016075230092036814, 'epoch': 0.2}


 20%|█▉        | 2458/12500 [4:18:33<15:36:02,  5.59s/it]

{'loss': 0.6604, 'grad_norm': 0.2558412253856659, 'learning_rate': 0.00016073629451780714, 'epoch': 0.2}


 20%|█▉        | 2459/12500 [4:18:41<17:35:54,  6.31s/it]

{'loss': 0.9798, 'grad_norm': 0.22823229432106018, 'learning_rate': 0.00016072028811524612, 'epoch': 0.2}


 20%|█▉        | 2460/12500 [4:18:47<17:15:57,  6.19s/it]

{'loss': 0.575, 'grad_norm': 0.31993600726127625, 'learning_rate': 0.00016070428171268507, 'epoch': 0.2}


 20%|█▉        | 2461/12500 [4:18:54<18:31:16,  6.64s/it]

{'loss': 0.6205, 'grad_norm': 0.2329825460910797, 'learning_rate': 0.00016068827531012404, 'epoch': 0.2}


 20%|█▉        | 2462/12500 [4:18:59<17:00:44,  6.10s/it]

{'loss': 0.7587, 'grad_norm': 0.26394176483154297, 'learning_rate': 0.00016067226890756304, 'epoch': 0.2}


 20%|█▉        | 2463/12500 [4:19:05<17:03:36,  6.12s/it]

{'loss': 0.7429, 'grad_norm': 0.2622869908809662, 'learning_rate': 0.00016065626250500202, 'epoch': 0.2}


 20%|█▉        | 2464/12500 [4:19:14<18:50:48,  6.76s/it]

{'loss': 0.7942, 'grad_norm': 0.20719152688980103, 'learning_rate': 0.00016064025610244096, 'epoch': 0.2}


 20%|█▉        | 2465/12500 [4:19:19<17:55:59,  6.43s/it]

{'loss': 0.7703, 'grad_norm': 0.2592165470123291, 'learning_rate': 0.00016062424969987997, 'epoch': 0.2}


 20%|█▉        | 2466/12500 [4:19:26<17:57:41,  6.44s/it]

{'loss': 0.8565, 'grad_norm': 0.34954971075057983, 'learning_rate': 0.00016060824329731894, 'epoch': 0.2}


 20%|█▉        | 2467/12500 [4:19:33<18:54:00,  6.78s/it]

{'loss': 0.8403, 'grad_norm': 0.23134513199329376, 'learning_rate': 0.00016059223689475792, 'epoch': 0.2}


 20%|█▉        | 2468/12500 [4:19:40<18:38:09,  6.69s/it]

{'loss': 0.7249, 'grad_norm': 0.30207207798957825, 'learning_rate': 0.00016057623049219686, 'epoch': 0.2}


 20%|█▉        | 2469/12500 [4:19:48<19:27:07,  6.98s/it]

{'loss': 0.7185, 'grad_norm': 0.2473737746477127, 'learning_rate': 0.00016056022408963587, 'epoch': 0.2}


 20%|█▉        | 2470/12500 [4:19:51<16:18:47,  5.86s/it]

{'loss': 0.8978, 'grad_norm': 0.45115745067596436, 'learning_rate': 0.00016054421768707484, 'epoch': 0.2}


 20%|█▉        | 2471/12500 [4:19:56<15:41:59,  5.64s/it]

{'loss': 0.8506, 'grad_norm': 0.45228180289268494, 'learning_rate': 0.00016052821128451382, 'epoch': 0.2}


 20%|█▉        | 2472/12500 [4:20:02<16:19:53,  5.86s/it]

{'loss': 0.8363, 'grad_norm': 0.36211347579956055, 'learning_rate': 0.0001605122048819528, 'epoch': 0.2}


 20%|█▉        | 2473/12500 [4:20:08<16:07:27,  5.79s/it]

{'loss': 0.4278, 'grad_norm': 0.23558785021305084, 'learning_rate': 0.00016049619847939177, 'epoch': 0.2}


 20%|█▉        | 2474/12500 [4:20:18<19:31:17,  7.01s/it]

{'loss': 0.7344, 'grad_norm': 0.23544184863567352, 'learning_rate': 0.00016048019207683074, 'epoch': 0.2}


 20%|█▉        | 2475/12500 [4:20:22<16:51:04,  6.05s/it]

{'loss': 0.6894, 'grad_norm': 0.36480414867401123, 'learning_rate': 0.00016046418567426972, 'epoch': 0.2}


 20%|█▉        | 2476/12500 [4:20:28<17:10:17,  6.17s/it]

{'loss': 1.0648, 'grad_norm': 0.3063816428184509, 'learning_rate': 0.0001604481792717087, 'epoch': 0.2}


 20%|█▉        | 2477/12500 [4:20:32<15:24:39,  5.54s/it]

{'loss': 0.6476, 'grad_norm': 0.26579639315605164, 'learning_rate': 0.00016043217286914766, 'epoch': 0.2}


 20%|█▉        | 2478/12500 [4:20:37<14:56:17,  5.37s/it]

{'loss': 0.6941, 'grad_norm': 0.2841062545776367, 'learning_rate': 0.00016041616646658664, 'epoch': 0.2}


 20%|█▉        | 2479/12500 [4:20:41<14:08:09,  5.08s/it]

{'loss': 0.9097, 'grad_norm': 0.3176998496055603, 'learning_rate': 0.00016040016006402564, 'epoch': 0.2}


 20%|█▉        | 2480/12500 [4:20:45<13:07:07,  4.71s/it]

{'loss': 0.8352, 'grad_norm': 0.284036785364151, 'learning_rate': 0.0001603841536614646, 'epoch': 0.2}


 20%|█▉        | 2481/12500 [4:20:51<13:34:57,  4.88s/it]

{'loss': 0.6119, 'grad_norm': 0.24646034836769104, 'learning_rate': 0.00016036814725890356, 'epoch': 0.2}


 20%|█▉        | 2482/12500 [4:20:57<14:47:03,  5.31s/it]

{'loss': 0.8889, 'grad_norm': 0.2573171854019165, 'learning_rate': 0.00016035214085634254, 'epoch': 0.2}


 20%|█▉        | 2483/12500 [4:21:03<15:06:53,  5.43s/it]

{'loss': 0.9352, 'grad_norm': 0.293025940656662, 'learning_rate': 0.00016033613445378154, 'epoch': 0.2}


 20%|█▉        | 2484/12500 [4:21:07<14:18:45,  5.14s/it]

{'loss': 0.7797, 'grad_norm': 0.28090551495552063, 'learning_rate': 0.0001603201280512205, 'epoch': 0.2}


 20%|█▉        | 2485/12500 [4:21:14<16:08:22,  5.80s/it]

{'loss': 0.514, 'grad_norm': 0.21390558779239655, 'learning_rate': 0.00016030412164865946, 'epoch': 0.2}


 20%|█▉        | 2486/12500 [4:21:21<16:44:02,  6.02s/it]

{'loss': 0.7417, 'grad_norm': 0.2516278624534607, 'learning_rate': 0.00016028811524609847, 'epoch': 0.2}


 20%|█▉        | 2487/12500 [4:21:30<19:13:47,  6.91s/it]

{'loss': 0.6294, 'grad_norm': 0.24576346576213837, 'learning_rate': 0.00016027210884353744, 'epoch': 0.2}


 20%|█▉        | 2488/12500 [4:21:35<17:36:15,  6.33s/it]

{'loss': 0.6135, 'grad_norm': 0.2478315681219101, 'learning_rate': 0.0001602561024409764, 'epoch': 0.2}


 20%|█▉        | 2489/12500 [4:21:39<16:06:11,  5.79s/it]

{'loss': 0.7198, 'grad_norm': 0.2584528923034668, 'learning_rate': 0.00016024009603841536, 'epoch': 0.2}


 20%|█▉        | 2490/12500 [4:21:44<15:06:38,  5.43s/it]

{'loss': 0.5243, 'grad_norm': 0.2908846139907837, 'learning_rate': 0.00016022408963585437, 'epoch': 0.2}


 20%|█▉        | 2491/12500 [4:21:50<15:49:05,  5.69s/it]

{'loss': 0.5757, 'grad_norm': 0.2387700378894806, 'learning_rate': 0.00016020808323329334, 'epoch': 0.2}


 20%|█▉        | 2492/12500 [4:22:01<20:13:33,  7.28s/it]

{'loss': 0.603, 'grad_norm': 0.18182648718357086, 'learning_rate': 0.0001601920768307323, 'epoch': 0.2}


 20%|█▉        | 2493/12500 [4:22:12<23:11:33,  8.34s/it]

{'loss': 0.8719, 'grad_norm': 0.253186970949173, 'learning_rate': 0.00016017607042817126, 'epoch': 0.2}


 20%|█▉        | 2494/12500 [4:22:19<21:43:11,  7.81s/it]

{'loss': 0.8624, 'grad_norm': 0.26210182905197144, 'learning_rate': 0.00016016006402561026, 'epoch': 0.2}


 20%|█▉        | 2495/12500 [4:22:25<20:18:38,  7.31s/it]

{'loss': 0.6366, 'grad_norm': 0.2927115857601166, 'learning_rate': 0.00016014405762304924, 'epoch': 0.2}


 20%|█▉        | 2496/12500 [4:22:33<21:24:30,  7.70s/it]

{'loss': 0.7927, 'grad_norm': 0.24766801297664642, 'learning_rate': 0.0001601280512204882, 'epoch': 0.2}


 20%|█▉        | 2497/12500 [4:22:39<19:11:49,  6.91s/it]

{'loss': 0.4812, 'grad_norm': 0.25597885251045227, 'learning_rate': 0.0001601120448179272, 'epoch': 0.2}


 20%|█▉        | 2498/12500 [4:22:49<22:28:55,  8.09s/it]

{'loss': 0.9646, 'grad_norm': 0.20056559145450592, 'learning_rate': 0.00016009603841536616, 'epoch': 0.2}


 20%|█▉        | 2499/12500 [4:22:53<18:56:09,  6.82s/it]

{'loss': 0.683, 'grad_norm': 0.31965330243110657, 'learning_rate': 0.00016008003201280514, 'epoch': 0.2}


 20%|██        | 2500/12500 [4:23:04<22:03:38,  7.94s/it]

{'loss': 0.4052, 'grad_norm': 0.19876272976398468, 'learning_rate': 0.0001600640256102441, 'epoch': 0.2}


 20%|██        | 2501/12500 [4:23:10<20:26:52,  7.36s/it]

{'loss': 0.6874, 'grad_norm': 0.3220682740211487, 'learning_rate': 0.0001600480192076831, 'epoch': 0.2}


 20%|██        | 2502/12500 [4:23:14<17:58:15,  6.47s/it]

{'loss': 0.4036, 'grad_norm': 0.24297687411308289, 'learning_rate': 0.00016003201280512206, 'epoch': 0.2}


 20%|██        | 2503/12500 [4:23:23<20:08:47,  7.25s/it]

{'loss': 0.6807, 'grad_norm': 0.20887491106987, 'learning_rate': 0.00016001600640256104, 'epoch': 0.2}


 20%|██        | 2504/12500 [4:23:28<18:02:33,  6.50s/it]

{'loss': 0.6278, 'grad_norm': 0.26944291591644287, 'learning_rate': 0.00016, 'epoch': 0.2}


 20%|██        | 2505/12500 [4:23:34<17:36:36,  6.34s/it]

{'loss': 0.8118, 'grad_norm': 0.29477426409721375, 'learning_rate': 0.000159983993597439, 'epoch': 0.2}


 20%|██        | 2506/12500 [4:23:39<16:49:44,  6.06s/it]

{'loss': 0.8147, 'grad_norm': 0.2817211449146271, 'learning_rate': 0.00015996798719487796, 'epoch': 0.2}


 20%|██        | 2507/12500 [4:23:47<18:02:00,  6.50s/it]

{'loss': 0.7377, 'grad_norm': 0.21546491980552673, 'learning_rate': 0.00015995198079231694, 'epoch': 0.2}


 20%|██        | 2508/12500 [4:23:52<16:54:21,  6.09s/it]

{'loss': 0.5982, 'grad_norm': 0.27111905813217163, 'learning_rate': 0.0001599359743897559, 'epoch': 0.2}


 20%|██        | 2509/12500 [4:24:00<18:09:54,  6.55s/it]

{'loss': 0.5156, 'grad_norm': 0.23899811506271362, 'learning_rate': 0.0001599199679871949, 'epoch': 0.2}


 20%|██        | 2510/12500 [4:24:07<19:01:11,  6.85s/it]

{'loss': 0.7765, 'grad_norm': 0.255790650844574, 'learning_rate': 0.00015990396158463386, 'epoch': 0.2}


 20%|██        | 2511/12500 [4:24:14<18:45:10,  6.76s/it]

{'loss': 0.3747, 'grad_norm': 0.22181075811386108, 'learning_rate': 0.00015988795518207284, 'epoch': 0.2}


 20%|██        | 2512/12500 [4:24:20<18:11:55,  6.56s/it]

{'loss': 0.6092, 'grad_norm': 0.28889426589012146, 'learning_rate': 0.0001598719487795118, 'epoch': 0.2}


 20%|██        | 2513/12500 [4:24:25<16:54:01,  6.09s/it]

{'loss': 0.6742, 'grad_norm': 0.30140820145606995, 'learning_rate': 0.0001598559423769508, 'epoch': 0.2}


 20%|██        | 2514/12500 [4:24:29<14:54:45,  5.38s/it]

{'loss': 0.918, 'grad_norm': 0.3848235607147217, 'learning_rate': 0.00015983993597438976, 'epoch': 0.2}


 20%|██        | 2515/12500 [4:24:35<15:57:39,  5.75s/it]

{'loss': 0.5284, 'grad_norm': 0.23257800936698914, 'learning_rate': 0.00015982392957182874, 'epoch': 0.2}


 20%|██        | 2516/12500 [4:24:42<17:08:49,  6.18s/it]

{'loss': 0.4598, 'grad_norm': 0.22957341372966766, 'learning_rate': 0.0001598079231692677, 'epoch': 0.2}


 20%|██        | 2517/12500 [4:24:50<18:03:03,  6.51s/it]

{'loss': 0.7636, 'grad_norm': 0.24488523602485657, 'learning_rate': 0.00015979191676670669, 'epoch': 0.2}


 20%|██        | 2518/12500 [4:24:54<16:12:59,  5.85s/it]

{'loss': 0.6567, 'grad_norm': 0.30268460512161255, 'learning_rate': 0.0001597759103641457, 'epoch': 0.2}


 20%|██        | 2519/12500 [4:25:01<17:36:38,  6.35s/it]

{'loss': 0.7499, 'grad_norm': 0.25401803851127625, 'learning_rate': 0.00015975990396158464, 'epoch': 0.2}


 20%|██        | 2520/12500 [4:25:06<16:13:08,  5.85s/it]

{'loss': 0.5031, 'grad_norm': 0.24730636179447174, 'learning_rate': 0.0001597438975590236, 'epoch': 0.2}


 20%|██        | 2521/12500 [4:25:13<16:59:18,  6.13s/it]

{'loss': 0.8466, 'grad_norm': 0.24575331807136536, 'learning_rate': 0.00015972789115646259, 'epoch': 0.2}


 20%|██        | 2522/12500 [4:25:22<19:23:32,  7.00s/it]

{'loss': 0.5356, 'grad_norm': 0.2166052758693695, 'learning_rate': 0.0001597118847539016, 'epoch': 0.2}


 20%|██        | 2523/12500 [4:25:31<21:09:43,  7.64s/it]

{'loss': 0.8706, 'grad_norm': 0.23033736646175385, 'learning_rate': 0.00015969587835134054, 'epoch': 0.2}


 20%|██        | 2524/12500 [4:25:40<22:31:19,  8.13s/it]

{'loss': 0.7714, 'grad_norm': 0.2190445065498352, 'learning_rate': 0.0001596798719487795, 'epoch': 0.2}


 20%|██        | 2525/12500 [4:25:47<20:57:07,  7.56s/it]

{'loss': 0.4407, 'grad_norm': 0.2179044634103775, 'learning_rate': 0.0001596638655462185, 'epoch': 0.2}


 20%|██        | 2526/12500 [4:25:52<19:28:13,  7.03s/it]

{'loss': 0.7025, 'grad_norm': 0.2870701849460602, 'learning_rate': 0.0001596478591436575, 'epoch': 0.2}


 20%|██        | 2527/12500 [4:25:58<18:00:13,  6.50s/it]

{'loss': 0.6466, 'grad_norm': 0.3418903350830078, 'learning_rate': 0.00015963185274109643, 'epoch': 0.2}


 20%|██        | 2528/12500 [4:26:04<18:07:35,  6.54s/it]

{'loss': 0.7555, 'grad_norm': 0.2550242245197296, 'learning_rate': 0.0001596158463385354, 'epoch': 0.2}


 20%|██        | 2529/12500 [4:26:10<17:18:28,  6.25s/it]

{'loss': 0.7209, 'grad_norm': 0.26835083961486816, 'learning_rate': 0.0001595998399359744, 'epoch': 0.2}


 20%|██        | 2530/12500 [4:26:16<17:10:20,  6.20s/it]

{'loss': 0.7189, 'grad_norm': 0.2739408016204834, 'learning_rate': 0.00015958383353341339, 'epoch': 0.2}


 20%|██        | 2531/12500 [4:26:20<15:32:55,  5.61s/it]

{'loss': 0.7618, 'grad_norm': 0.31016290187835693, 'learning_rate': 0.00015956782713085233, 'epoch': 0.2}


 20%|██        | 2532/12500 [4:26:27<16:23:18,  5.92s/it]

{'loss': 0.7422, 'grad_norm': 0.29731282591819763, 'learning_rate': 0.00015955182072829134, 'epoch': 0.2}


 20%|██        | 2533/12500 [4:26:31<14:33:30,  5.26s/it]

{'loss': 0.8377, 'grad_norm': 0.3217669129371643, 'learning_rate': 0.0001595358143257303, 'epoch': 0.2}


 20%|██        | 2534/12500 [4:26:35<13:52:51,  5.01s/it]

{'loss': 0.5313, 'grad_norm': 0.27131664752960205, 'learning_rate': 0.00015951980792316929, 'epoch': 0.2}


 20%|██        | 2535/12500 [4:26:38<12:36:08,  4.55s/it]

{'loss': 0.9672, 'grad_norm': 0.3766726851463318, 'learning_rate': 0.00015950380152060823, 'epoch': 0.2}


 20%|██        | 2536/12500 [4:26:48<17:01:22,  6.15s/it]

{'loss': 0.6981, 'grad_norm': 0.17037025094032288, 'learning_rate': 0.00015948779511804724, 'epoch': 0.2}


 20%|██        | 2537/12500 [4:26:53<15:51:41,  5.73s/it]

{'loss': 0.5618, 'grad_norm': 0.2906986474990845, 'learning_rate': 0.0001594717887154862, 'epoch': 0.2}


 20%|██        | 2538/12500 [4:26:59<15:46:15,  5.70s/it]

{'loss': 0.5918, 'grad_norm': 0.28632134199142456, 'learning_rate': 0.00015945578231292519, 'epoch': 0.2}


 20%|██        | 2539/12500 [4:27:03<14:47:10,  5.34s/it]

{'loss': 0.8074, 'grad_norm': 0.33635827898979187, 'learning_rate': 0.00015943977591036416, 'epoch': 0.2}


 20%|██        | 2540/12500 [4:27:10<16:03:25,  5.80s/it]

{'loss': 0.6217, 'grad_norm': 0.23388680815696716, 'learning_rate': 0.00015942376950780313, 'epoch': 0.2}


 20%|██        | 2541/12500 [4:27:15<15:02:36,  5.44s/it]

{'loss': 0.5805, 'grad_norm': 0.28715965151786804, 'learning_rate': 0.0001594077631052421, 'epoch': 0.2}


 20%|██        | 2542/12500 [4:27:21<15:59:30,  5.78s/it]

{'loss': 0.7242, 'grad_norm': 0.31834062933921814, 'learning_rate': 0.00015939175670268108, 'epoch': 0.2}


 20%|██        | 2543/12500 [4:27:26<14:54:20,  5.39s/it]

{'loss': 0.5418, 'grad_norm': 0.2552608549594879, 'learning_rate': 0.00015937575030012006, 'epoch': 0.2}


 20%|██        | 2544/12500 [4:27:33<16:08:30,  5.84s/it]

{'loss': 0.8997, 'grad_norm': 0.3083888590335846, 'learning_rate': 0.00015935974389755903, 'epoch': 0.2}


 20%|██        | 2545/12500 [4:27:43<19:56:35,  7.21s/it]

{'loss': 0.7488, 'grad_norm': 0.2220778614282608, 'learning_rate': 0.000159343737494998, 'epoch': 0.2}


 20%|██        | 2546/12500 [4:27:54<23:14:55,  8.41s/it]

{'loss': 0.8498, 'grad_norm': 0.19982507824897766, 'learning_rate': 0.00015932773109243698, 'epoch': 0.2}


 20%|██        | 2547/12500 [4:28:00<21:24:38,  7.74s/it]

{'loss': 0.9297, 'grad_norm': 0.2980991005897522, 'learning_rate': 0.00015931172468987596, 'epoch': 0.2}


 20%|██        | 2548/12500 [4:28:06<19:36:57,  7.10s/it]

{'loss': 0.6923, 'grad_norm': 0.31633254885673523, 'learning_rate': 0.00015929571828731493, 'epoch': 0.2}


 20%|██        | 2549/12500 [4:28:10<16:58:38,  6.14s/it]

{'loss': 0.5473, 'grad_norm': 0.31923550367355347, 'learning_rate': 0.0001592797118847539, 'epoch': 0.2}


 20%|██        | 2550/12500 [4:28:17<17:45:40,  6.43s/it]

{'loss': 0.6797, 'grad_norm': 0.2352590560913086, 'learning_rate': 0.00015926370548219288, 'epoch': 0.2}


 20%|██        | 2551/12500 [4:28:26<19:57:50,  7.22s/it]

{'loss': 0.8097, 'grad_norm': 0.20488739013671875, 'learning_rate': 0.00015924769907963186, 'epoch': 0.2}


 20%|██        | 2552/12500 [4:28:31<17:51:51,  6.46s/it]

{'loss': 0.7092, 'grad_norm': 0.32180094718933105, 'learning_rate': 0.00015923169267707083, 'epoch': 0.2}


 20%|██        | 2553/12500 [4:28:36<16:49:26,  6.09s/it]

{'loss': 0.7119, 'grad_norm': 0.3041050434112549, 'learning_rate': 0.0001592156862745098, 'epoch': 0.2}


 20%|██        | 2554/12500 [4:28:46<19:45:41,  7.15s/it]

{'loss': 0.7755, 'grad_norm': 0.19012168049812317, 'learning_rate': 0.00015919967987194878, 'epoch': 0.2}


 20%|██        | 2555/12500 [4:28:50<17:49:31,  6.45s/it]

{'loss': 0.7312, 'grad_norm': 0.2713894248008728, 'learning_rate': 0.00015918367346938776, 'epoch': 0.2}


 20%|██        | 2556/12500 [4:28:56<16:45:29,  6.07s/it]

{'loss': 0.6701, 'grad_norm': 0.307898610830307, 'learning_rate': 0.00015916766706682673, 'epoch': 0.2}


 20%|██        | 2557/12500 [4:29:03<17:32:34,  6.35s/it]

{'loss': 0.7305, 'grad_norm': 0.26574045419692993, 'learning_rate': 0.00015915166066426573, 'epoch': 0.2}


 20%|██        | 2558/12500 [4:29:12<19:46:31,  7.16s/it]

{'loss': 0.4919, 'grad_norm': 0.21519730985164642, 'learning_rate': 0.00015913565426170468, 'epoch': 0.2}


 20%|██        | 2559/12500 [4:29:22<22:23:15,  8.11s/it]

{'loss': 0.4938, 'grad_norm': 0.19138221442699432, 'learning_rate': 0.00015911964785914366, 'epoch': 0.2}


 20%|██        | 2560/12500 [4:29:27<20:04:38,  7.27s/it]

{'loss': 0.6472, 'grad_norm': 0.27407389879226685, 'learning_rate': 0.00015910364145658263, 'epoch': 0.2}


 20%|██        | 2561/12500 [4:29:35<20:05:36,  7.28s/it]

{'loss': 0.5065, 'grad_norm': 0.21944935619831085, 'learning_rate': 0.00015908763505402163, 'epoch': 0.2}


 20%|██        | 2562/12500 [4:29:40<18:44:01,  6.79s/it]

{'loss': 0.537, 'grad_norm': 0.21872977912425995, 'learning_rate': 0.00015907162865146058, 'epoch': 0.2}


 21%|██        | 2563/12500 [4:29:45<16:47:56,  6.09s/it]

{'loss': 0.7567, 'grad_norm': 0.31402021646499634, 'learning_rate': 0.00015905562224889956, 'epoch': 0.21}


 21%|██        | 2564/12500 [4:29:50<15:55:51,  5.77s/it]

{'loss': 0.8095, 'grad_norm': 0.3005744516849518, 'learning_rate': 0.00015903961584633856, 'epoch': 0.21}


 21%|██        | 2565/12500 [4:29:57<16:54:24,  6.13s/it]

{'loss': 0.5878, 'grad_norm': 0.24265645444393158, 'learning_rate': 0.00015902360944377753, 'epoch': 0.21}


 21%|██        | 2566/12500 [4:30:07<20:00:04,  7.25s/it]

{'loss': 0.9339, 'grad_norm': 0.2928624451160431, 'learning_rate': 0.00015900760304121648, 'epoch': 0.21}


 21%|██        | 2567/12500 [4:30:12<18:29:35,  6.70s/it]

{'loss': 0.765, 'grad_norm': 0.3218110203742981, 'learning_rate': 0.00015899159663865546, 'epoch': 0.21}


 21%|██        | 2568/12500 [4:30:19<18:24:25,  6.67s/it]

{'loss': 0.5626, 'grad_norm': 0.23944362998008728, 'learning_rate': 0.00015897559023609446, 'epoch': 0.21}


 21%|██        | 2569/12500 [4:30:24<17:10:18,  6.22s/it]

{'loss': 0.7458, 'grad_norm': 0.34537461400032043, 'learning_rate': 0.00015895958383353343, 'epoch': 0.21}


 21%|██        | 2570/12500 [4:30:33<19:57:19,  7.23s/it]

{'loss': 0.9229, 'grad_norm': 0.23471200466156006, 'learning_rate': 0.00015894357743097238, 'epoch': 0.21}


 21%|██        | 2571/12500 [4:30:40<19:29:54,  7.07s/it]

{'loss': 0.6064, 'grad_norm': 0.260790079832077, 'learning_rate': 0.00015892757102841138, 'epoch': 0.21}


 21%|██        | 2572/12500 [4:30:49<20:50:41,  7.56s/it]

{'loss': 1.0106, 'grad_norm': 0.23955486714839935, 'learning_rate': 0.00015891156462585036, 'epoch': 0.21}


 21%|██        | 2573/12500 [4:30:56<20:30:42,  7.44s/it]

{'loss': 0.7546, 'grad_norm': 0.3004777729511261, 'learning_rate': 0.00015889555822328933, 'epoch': 0.21}


 21%|██        | 2574/12500 [4:31:02<19:27:15,  7.06s/it]

{'loss': 0.6687, 'grad_norm': 0.2964330315589905, 'learning_rate': 0.00015887955182072828, 'epoch': 0.21}


 21%|██        | 2575/12500 [4:31:07<17:46:59,  6.45s/it]

{'loss': 0.5488, 'grad_norm': 0.3007045090198517, 'learning_rate': 0.00015886354541816728, 'epoch': 0.21}


 21%|██        | 2576/12500 [4:31:17<20:22:02,  7.39s/it]

{'loss': 0.4857, 'grad_norm': 0.2342679351568222, 'learning_rate': 0.00015884753901560626, 'epoch': 0.21}


 21%|██        | 2577/12500 [4:31:22<18:39:12,  6.77s/it]

{'loss': 0.7681, 'grad_norm': 0.3023098409175873, 'learning_rate': 0.00015883153261304523, 'epoch': 0.21}


 21%|██        | 2578/12500 [4:31:27<17:15:25,  6.26s/it]

{'loss': 0.8103, 'grad_norm': 0.2985825836658478, 'learning_rate': 0.0001588155262104842, 'epoch': 0.21}


 21%|██        | 2579/12500 [4:31:31<15:20:33,  5.57s/it]

{'loss': 0.6769, 'grad_norm': 0.29646140336990356, 'learning_rate': 0.00015879951980792318, 'epoch': 0.21}


 21%|██        | 2580/12500 [4:31:37<15:36:05,  5.66s/it]

{'loss': 0.5517, 'grad_norm': 0.24792872369289398, 'learning_rate': 0.00015878351340536216, 'epoch': 0.21}


 21%|██        | 2581/12500 [4:31:45<17:28:51,  6.34s/it]

{'loss': 0.5094, 'grad_norm': 0.241983100771904, 'learning_rate': 0.00015876750700280113, 'epoch': 0.21}


 21%|██        | 2582/12500 [4:31:55<20:57:20,  7.61s/it]

{'loss': 0.88, 'grad_norm': 0.24735002219676971, 'learning_rate': 0.0001587515006002401, 'epoch': 0.21}


 21%|██        | 2583/12500 [4:32:04<21:41:22,  7.87s/it]

{'loss': 0.5535, 'grad_norm': 0.21221020817756653, 'learning_rate': 0.00015873549419767908, 'epoch': 0.21}


 21%|██        | 2584/12500 [4:32:12<21:49:31,  7.92s/it]

{'loss': 0.747, 'grad_norm': 0.2540779411792755, 'learning_rate': 0.00015871948779511806, 'epoch': 0.21}


 21%|██        | 2585/12500 [4:32:17<19:47:10,  7.18s/it]

{'loss': 0.8804, 'grad_norm': 0.3251204192638397, 'learning_rate': 0.00015870348139255703, 'epoch': 0.21}


 21%|██        | 2586/12500 [4:32:22<17:57:02,  6.52s/it]

{'loss': 0.5528, 'grad_norm': 0.30440372228622437, 'learning_rate': 0.000158687474989996, 'epoch': 0.21}


 21%|██        | 2587/12500 [4:32:31<19:26:36,  7.06s/it]

{'loss': 0.854, 'grad_norm': 0.24068307876586914, 'learning_rate': 0.00015867146858743498, 'epoch': 0.21}


 21%|██        | 2588/12500 [4:32:38<19:43:53,  7.17s/it]

{'loss': 0.7542, 'grad_norm': 0.22566305100917816, 'learning_rate': 0.00015865546218487396, 'epoch': 0.21}


 21%|██        | 2589/12500 [4:32:46<19:58:11,  7.25s/it]

{'loss': 0.5623, 'grad_norm': 0.29601967334747314, 'learning_rate': 0.00015863945578231293, 'epoch': 0.21}


 21%|██        | 2590/12500 [4:32:53<19:50:09,  7.21s/it]

{'loss': 0.9345, 'grad_norm': 0.27601903676986694, 'learning_rate': 0.0001586234493797519, 'epoch': 0.21}


 21%|██        | 2591/12500 [4:32:58<18:34:01,  6.75s/it]

{'loss': 0.6013, 'grad_norm': 0.2628277540206909, 'learning_rate': 0.00015860744297719088, 'epoch': 0.21}


 21%|██        | 2592/12500 [4:33:03<17:06:50,  6.22s/it]

{'loss': 0.4999, 'grad_norm': 0.26461222767829895, 'learning_rate': 0.00015859143657462988, 'epoch': 0.21}


 21%|██        | 2593/12500 [4:33:11<18:40:04,  6.78s/it]

{'loss': 0.773, 'grad_norm': 0.23458026349544525, 'learning_rate': 0.00015857543017206883, 'epoch': 0.21}


 21%|██        | 2594/12500 [4:33:15<16:23:19,  5.96s/it]

{'loss': 1.0169, 'grad_norm': 0.3778652250766754, 'learning_rate': 0.0001585594237695078, 'epoch': 0.21}


 21%|██        | 2595/12500 [4:33:24<18:15:00,  6.63s/it]

{'loss': 0.6323, 'grad_norm': 0.28981348872184753, 'learning_rate': 0.00015854341736694678, 'epoch': 0.21}


 21%|██        | 2596/12500 [4:33:29<16:48:06,  6.11s/it]

{'loss': 0.8672, 'grad_norm': 0.35616210103034973, 'learning_rate': 0.00015852741096438578, 'epoch': 0.21}


 21%|██        | 2597/12500 [4:33:35<16:54:47,  6.15s/it]

{'loss': 0.5969, 'grad_norm': 0.2495376616716385, 'learning_rate': 0.00015851140456182473, 'epoch': 0.21}


 21%|██        | 2598/12500 [4:33:40<16:13:10,  5.90s/it]

{'loss': 0.6302, 'grad_norm': 0.2981876730918884, 'learning_rate': 0.0001584953981592637, 'epoch': 0.21}


 21%|██        | 2599/12500 [4:33:47<17:26:19,  6.34s/it]

{'loss': 0.996, 'grad_norm': 0.21625962853431702, 'learning_rate': 0.00015847939175670268, 'epoch': 0.21}


 21%|██        | 2600/12500 [4:33:53<16:33:20,  6.02s/it]

{'loss': 0.5591, 'grad_norm': 0.24622106552124023, 'learning_rate': 0.00015846338535414168, 'epoch': 0.21}


 21%|██        | 2601/12500 [4:34:02<19:19:33,  7.03s/it]

{'loss': 1.0149, 'grad_norm': 0.24433450400829315, 'learning_rate': 0.00015844737895158063, 'epoch': 0.21}


 21%|██        | 2602/12500 [4:34:09<18:48:26,  6.84s/it]

{'loss': 0.8363, 'grad_norm': 0.2582436501979828, 'learning_rate': 0.0001584313725490196, 'epoch': 0.21}


 21%|██        | 2603/12500 [4:34:12<16:10:01,  5.88s/it]

{'loss': 0.6639, 'grad_norm': 0.3284609913825989, 'learning_rate': 0.0001584153661464586, 'epoch': 0.21}


 21%|██        | 2604/12500 [4:34:17<15:32:46,  5.66s/it]

{'loss': 0.7427, 'grad_norm': 0.30178144574165344, 'learning_rate': 0.00015839935974389758, 'epoch': 0.21}


 21%|██        | 2605/12500 [4:34:29<20:08:27,  7.33s/it]

{'loss': 0.7989, 'grad_norm': 0.19304142892360687, 'learning_rate': 0.00015838335334133653, 'epoch': 0.21}


 21%|██        | 2606/12500 [4:34:32<17:16:39,  6.29s/it]

{'loss': 0.5707, 'grad_norm': 0.3400813043117523, 'learning_rate': 0.0001583673469387755, 'epoch': 0.21}


 21%|██        | 2607/12500 [4:34:39<17:12:37,  6.26s/it]

{'loss': 0.5608, 'grad_norm': 0.2796184718608856, 'learning_rate': 0.0001583513405362145, 'epoch': 0.21}


 21%|██        | 2608/12500 [4:34:44<16:29:18,  6.00s/it]

{'loss': 0.5043, 'grad_norm': 0.21951261162757874, 'learning_rate': 0.00015833533413365348, 'epoch': 0.21}


 21%|██        | 2609/12500 [4:34:48<14:49:30,  5.40s/it]

{'loss': 0.7149, 'grad_norm': 0.3119041919708252, 'learning_rate': 0.00015831932773109243, 'epoch': 0.21}


 21%|██        | 2610/12500 [4:34:56<17:09:48,  6.25s/it]

{'loss': 0.733, 'grad_norm': 0.22774995863437653, 'learning_rate': 0.00015830332132853143, 'epoch': 0.21}


 21%|██        | 2611/12500 [4:35:06<20:12:28,  7.36s/it]

{'loss': 0.6418, 'grad_norm': 0.20724964141845703, 'learning_rate': 0.0001582873149259704, 'epoch': 0.21}


 21%|██        | 2612/12500 [4:35:14<20:14:46,  7.37s/it]

{'loss': 0.8203, 'grad_norm': 0.30366939306259155, 'learning_rate': 0.00015827130852340938, 'epoch': 0.21}


 21%|██        | 2613/12500 [4:35:19<18:23:03,  6.69s/it]

{'loss': 0.8072, 'grad_norm': 0.3046494126319885, 'learning_rate': 0.00015825530212084833, 'epoch': 0.21}


 21%|██        | 2614/12500 [4:35:23<16:17:58,  5.94s/it]

{'loss': 0.7946, 'grad_norm': 0.3563618063926697, 'learning_rate': 0.00015823929571828733, 'epoch': 0.21}


 21%|██        | 2615/12500 [4:35:28<15:41:45,  5.72s/it]

{'loss': 0.7922, 'grad_norm': 0.2891576588153839, 'learning_rate': 0.0001582232893157263, 'epoch': 0.21}


 21%|██        | 2616/12500 [4:35:35<16:52:31,  6.15s/it]

{'loss': 0.7132, 'grad_norm': 0.24431590735912323, 'learning_rate': 0.00015820728291316528, 'epoch': 0.21}


 21%|██        | 2617/12500 [4:35:42<17:16:56,  6.30s/it]

{'loss': 0.7423, 'grad_norm': 0.23592258989810944, 'learning_rate': 0.00015819127651060425, 'epoch': 0.21}


 21%|██        | 2618/12500 [4:35:49<17:35:51,  6.41s/it]

{'loss': 0.5882, 'grad_norm': 0.24860408902168274, 'learning_rate': 0.00015817527010804323, 'epoch': 0.21}


 21%|██        | 2619/12500 [4:35:55<17:41:34,  6.45s/it]

{'loss': 0.6659, 'grad_norm': 0.3265310525894165, 'learning_rate': 0.0001581592637054822, 'epoch': 0.21}


 21%|██        | 2620/12500 [4:36:01<16:52:37,  6.15s/it]

{'loss': 0.5915, 'grad_norm': 0.24703088402748108, 'learning_rate': 0.00015814325730292118, 'epoch': 0.21}


 21%|██        | 2621/12500 [4:36:06<15:57:13,  5.81s/it]

{'loss': 0.7057, 'grad_norm': 0.2844353914260864, 'learning_rate': 0.00015812725090036015, 'epoch': 0.21}


 21%|██        | 2622/12500 [4:36:13<17:13:41,  6.28s/it]

{'loss': 1.1422, 'grad_norm': 0.252200722694397, 'learning_rate': 0.00015811124449779913, 'epoch': 0.21}


 21%|██        | 2623/12500 [4:36:22<19:16:51,  7.03s/it]

{'loss': 0.3869, 'grad_norm': 0.21689416468143463, 'learning_rate': 0.0001580952380952381, 'epoch': 0.21}


 21%|██        | 2624/12500 [4:36:27<17:59:34,  6.56s/it]

{'loss': 0.8679, 'grad_norm': 0.3201788663864136, 'learning_rate': 0.00015807923169267708, 'epoch': 0.21}


 21%|██        | 2625/12500 [4:36:34<18:11:33,  6.63s/it]

{'loss': 0.5431, 'grad_norm': 0.27928298711776733, 'learning_rate': 0.00015806322529011605, 'epoch': 0.21}


 21%|██        | 2626/12500 [4:36:43<20:12:50,  7.37s/it]

{'loss': 0.758, 'grad_norm': 0.28749093413352966, 'learning_rate': 0.00015804721888755503, 'epoch': 0.21}


 21%|██        | 2627/12500 [4:36:50<19:51:02,  7.24s/it]

{'loss': 0.8475, 'grad_norm': 0.24830740690231323, 'learning_rate': 0.000158031212484994, 'epoch': 0.21}


 21%|██        | 2628/12500 [4:36:59<21:21:32,  7.79s/it]

{'loss': 0.6749, 'grad_norm': 0.22290466725826263, 'learning_rate': 0.00015801520608243298, 'epoch': 0.21}


 21%|██        | 2629/12500 [4:37:06<21:03:12,  7.68s/it]

{'loss': 1.0916, 'grad_norm': 0.28390371799468994, 'learning_rate': 0.00015799919967987195, 'epoch': 0.21}


 21%|██        | 2630/12500 [4:37:12<18:55:34,  6.90s/it]

{'loss': 0.9572, 'grad_norm': 0.3181096315383911, 'learning_rate': 0.00015798319327731093, 'epoch': 0.21}


 21%|██        | 2631/12500 [4:37:19<18:57:22,  6.91s/it]

{'loss': 0.8908, 'grad_norm': 0.25613757967948914, 'learning_rate': 0.00015796718687474993, 'epoch': 0.21}


 21%|██        | 2632/12500 [4:37:23<17:20:37,  6.33s/it]

{'loss': 0.6872, 'grad_norm': 0.28561073541641235, 'learning_rate': 0.00015795118047218888, 'epoch': 0.21}


 21%|██        | 2633/12500 [4:37:31<18:31:07,  6.76s/it]

{'loss': 0.6607, 'grad_norm': 0.22427330911159515, 'learning_rate': 0.00015793517406962785, 'epoch': 0.21}


 21%|██        | 2634/12500 [4:37:38<18:12:40,  6.65s/it]

{'loss': 0.7387, 'grad_norm': 0.2954419255256653, 'learning_rate': 0.00015791916766706683, 'epoch': 0.21}


 21%|██        | 2635/12500 [4:37:46<20:02:11,  7.31s/it]

{'loss': 0.4668, 'grad_norm': 0.22646836936473846, 'learning_rate': 0.00015790316126450583, 'epoch': 0.21}


 21%|██        | 2636/12500 [4:37:55<21:20:12,  7.79s/it]

{'loss': 0.8322, 'grad_norm': 0.20353859663009644, 'learning_rate': 0.00015788715486194478, 'epoch': 0.21}


 21%|██        | 2637/12500 [4:38:00<18:37:48,  6.80s/it]

{'loss': 0.821, 'grad_norm': 0.26416850090026855, 'learning_rate': 0.00015787114845938375, 'epoch': 0.21}


 21%|██        | 2638/12500 [4:38:04<16:50:38,  6.15s/it]

{'loss': 0.7359, 'grad_norm': 0.30025187134742737, 'learning_rate': 0.00015785514205682275, 'epoch': 0.21}


 21%|██        | 2639/12500 [4:38:10<16:24:16,  5.99s/it]

{'loss': 0.7493, 'grad_norm': 0.27503886818885803, 'learning_rate': 0.00015783913565426173, 'epoch': 0.21}


 21%|██        | 2640/12500 [4:38:18<18:08:14,  6.62s/it]

{'loss': 0.6844, 'grad_norm': 0.20444907248020172, 'learning_rate': 0.00015782312925170067, 'epoch': 0.21}


 21%|██        | 2641/12500 [4:38:22<16:10:57,  5.91s/it]

{'loss': 0.5097, 'grad_norm': 0.2863463759422302, 'learning_rate': 0.00015780712284913965, 'epoch': 0.21}


 21%|██        | 2642/12500 [4:38:29<16:42:38,  6.10s/it]

{'loss': 0.672, 'grad_norm': 0.25576186180114746, 'learning_rate': 0.00015779111644657865, 'epoch': 0.21}


 21%|██        | 2643/12500 [4:38:34<15:32:04,  5.67s/it]

{'loss': 0.9861, 'grad_norm': 0.28180184960365295, 'learning_rate': 0.00015777511004401763, 'epoch': 0.21}


 21%|██        | 2644/12500 [4:38:41<17:16:13,  6.31s/it]

{'loss': 0.7683, 'grad_norm': 0.24879688024520874, 'learning_rate': 0.00015775910364145657, 'epoch': 0.21}


 21%|██        | 2645/12500 [4:38:47<16:36:49,  6.07s/it]

{'loss': 0.6731, 'grad_norm': 0.2522767186164856, 'learning_rate': 0.00015774309723889558, 'epoch': 0.21}


 21%|██        | 2646/12500 [4:38:53<16:28:54,  6.02s/it]

{'loss': 0.5646, 'grad_norm': 0.23645740747451782, 'learning_rate': 0.00015772709083633455, 'epoch': 0.21}


 21%|██        | 2647/12500 [4:39:02<18:57:23,  6.93s/it]

{'loss': 0.6154, 'grad_norm': 0.19967658817768097, 'learning_rate': 0.00015771108443377353, 'epoch': 0.21}


 21%|██        | 2648/12500 [4:39:07<17:14:03,  6.30s/it]

{'loss': 0.6324, 'grad_norm': 0.29045772552490234, 'learning_rate': 0.00015769507803121247, 'epoch': 0.21}


 21%|██        | 2649/12500 [4:39:10<15:07:28,  5.53s/it]

{'loss': 0.6721, 'grad_norm': 0.31474316120147705, 'learning_rate': 0.00015767907162865148, 'epoch': 0.21}


 21%|██        | 2650/12500 [4:39:20<18:36:55,  6.80s/it]

{'loss': 0.7702, 'grad_norm': 0.20775339007377625, 'learning_rate': 0.00015766306522609045, 'epoch': 0.21}


 21%|██        | 2651/12500 [4:39:26<18:01:44,  6.59s/it]

{'loss': 0.7865, 'grad_norm': 0.27178817987442017, 'learning_rate': 0.00015764705882352943, 'epoch': 0.21}


 21%|██        | 2652/12500 [4:39:30<15:32:58,  5.68s/it]

{'loss': 0.5723, 'grad_norm': 0.30423539876937866, 'learning_rate': 0.0001576310524209684, 'epoch': 0.21}


 21%|██        | 2653/12500 [4:39:35<14:53:08,  5.44s/it]

{'loss': 0.6714, 'grad_norm': 0.25090327858924866, 'learning_rate': 0.00015761504601840737, 'epoch': 0.21}


 21%|██        | 2654/12500 [4:39:41<15:46:55,  5.77s/it]

{'loss': 0.4581, 'grad_norm': 0.22892892360687256, 'learning_rate': 0.00015759903961584635, 'epoch': 0.21}


 21%|██        | 2655/12500 [4:39:46<14:28:53,  5.30s/it]

{'loss': 0.9862, 'grad_norm': 0.37025371193885803, 'learning_rate': 0.00015758303321328532, 'epoch': 0.21}


 21%|██        | 2656/12500 [4:39:51<14:24:43,  5.27s/it]

{'loss': 0.5528, 'grad_norm': 0.28488999605178833, 'learning_rate': 0.0001575670268107243, 'epoch': 0.21}


 21%|██▏       | 2657/12500 [4:39:57<15:12:06,  5.56s/it]

{'loss': 0.673, 'grad_norm': 0.26307404041290283, 'learning_rate': 0.00015755102040816327, 'epoch': 0.21}


 21%|██▏       | 2658/12500 [4:40:01<14:18:11,  5.23s/it]

{'loss': 0.5605, 'grad_norm': 0.27227577567100525, 'learning_rate': 0.00015753501400560225, 'epoch': 0.21}


 21%|██▏       | 2659/12500 [4:40:07<14:28:37,  5.30s/it]

{'loss': 0.3759, 'grad_norm': 0.2154662162065506, 'learning_rate': 0.00015751900760304122, 'epoch': 0.21}


 21%|██▏       | 2660/12500 [4:40:15<16:24:46,  6.00s/it]

{'loss': 0.7563, 'grad_norm': 0.23931671679019928, 'learning_rate': 0.0001575030012004802, 'epoch': 0.21}


 21%|██▏       | 2661/12500 [4:40:20<16:04:27,  5.88s/it]

{'loss': 0.5564, 'grad_norm': 0.2877352833747864, 'learning_rate': 0.00015748699479791917, 'epoch': 0.21}


 21%|██▏       | 2662/12500 [4:40:25<15:24:49,  5.64s/it]

{'loss': 0.8883, 'grad_norm': 0.3878251910209656, 'learning_rate': 0.00015747098839535815, 'epoch': 0.21}


 21%|██▏       | 2663/12500 [4:40:31<15:32:30,  5.69s/it]

{'loss': 0.6916, 'grad_norm': 0.24733027815818787, 'learning_rate': 0.00015745498199279712, 'epoch': 0.21}


 21%|██▏       | 2664/12500 [4:40:43<21:01:54,  7.70s/it]

{'loss': 0.8123, 'grad_norm': 0.18572352826595306, 'learning_rate': 0.0001574389755902361, 'epoch': 0.21}


 21%|██▏       | 2665/12500 [4:40:48<18:48:48,  6.89s/it]

{'loss': 0.832, 'grad_norm': 0.2806042432785034, 'learning_rate': 0.00015742296918767507, 'epoch': 0.21}


 21%|██▏       | 2666/12500 [4:40:54<17:36:00,  6.44s/it]

{'loss': 0.5603, 'grad_norm': 0.24476966261863708, 'learning_rate': 0.00015740696278511405, 'epoch': 0.21}


 21%|██▏       | 2667/12500 [4:41:02<19:01:50,  6.97s/it]

{'loss': 0.9017, 'grad_norm': 0.21675366163253784, 'learning_rate': 0.00015739095638255302, 'epoch': 0.21}


 21%|██▏       | 2668/12500 [4:41:10<19:54:17,  7.29s/it]

{'loss': 0.6416, 'grad_norm': 0.2176073044538498, 'learning_rate': 0.000157374949979992, 'epoch': 0.21}


 21%|██▏       | 2669/12500 [4:41:16<18:39:40,  6.83s/it]

{'loss': 0.7957, 'grad_norm': 0.2482454627752304, 'learning_rate': 0.00015735894357743097, 'epoch': 0.21}


 21%|██▏       | 2670/12500 [4:41:22<17:49:58,  6.53s/it]

{'loss': 0.7243, 'grad_norm': 0.2957845628261566, 'learning_rate': 0.00015734293717486997, 'epoch': 0.21}


 21%|██▏       | 2671/12500 [4:41:28<18:02:35,  6.61s/it]

{'loss': 0.6588, 'grad_norm': 0.20919746160507202, 'learning_rate': 0.00015732693077230892, 'epoch': 0.21}


 21%|██▏       | 2672/12500 [4:41:37<19:45:21,  7.24s/it]

{'loss': 0.8854, 'grad_norm': 0.21347789466381073, 'learning_rate': 0.0001573109243697479, 'epoch': 0.21}


 21%|██▏       | 2673/12500 [4:41:41<17:04:48,  6.26s/it]

{'loss': 0.6574, 'grad_norm': 0.32688283920288086, 'learning_rate': 0.00015729491796718687, 'epoch': 0.21}


 21%|██▏       | 2674/12500 [4:41:50<19:29:06,  7.14s/it]

{'loss': 0.826, 'grad_norm': 0.25154057145118713, 'learning_rate': 0.00015727891156462587, 'epoch': 0.21}


 21%|██▏       | 2675/12500 [4:41:57<19:13:33,  7.04s/it]

{'loss': 0.7505, 'grad_norm': 0.31884628534317017, 'learning_rate': 0.00015726290516206482, 'epoch': 0.21}


 21%|██▏       | 2676/12500 [4:42:05<19:47:47,  7.25s/it]

{'loss': 0.6593, 'grad_norm': 0.2748211920261383, 'learning_rate': 0.0001572468987595038, 'epoch': 0.21}


 21%|██▏       | 2677/12500 [4:42:11<18:32:42,  6.80s/it]

{'loss': 0.7835, 'grad_norm': 0.27436548471450806, 'learning_rate': 0.0001572308923569428, 'epoch': 0.21}


 21%|██▏       | 2678/12500 [4:42:17<18:25:16,  6.75s/it]

{'loss': 0.5808, 'grad_norm': 0.23810650408267975, 'learning_rate': 0.00015721488595438177, 'epoch': 0.21}


 21%|██▏       | 2679/12500 [4:42:23<17:37:17,  6.46s/it]

{'loss': 0.7673, 'grad_norm': 0.2900429964065552, 'learning_rate': 0.00015719887955182072, 'epoch': 0.21}


 21%|██▏       | 2680/12500 [4:42:27<15:12:12,  5.57s/it]

{'loss': 0.7502, 'grad_norm': 0.34496966004371643, 'learning_rate': 0.0001571828731492597, 'epoch': 0.21}


 21%|██▏       | 2681/12500 [4:42:33<15:39:46,  5.74s/it]

{'loss': 0.7492, 'grad_norm': 0.2641153931617737, 'learning_rate': 0.0001571668667466987, 'epoch': 0.21}


 21%|██▏       | 2682/12500 [4:42:40<16:54:54,  6.20s/it]

{'loss': 0.3096, 'grad_norm': 0.19797657430171967, 'learning_rate': 0.00015715086034413767, 'epoch': 0.21}


 21%|██▏       | 2683/12500 [4:42:46<16:55:01,  6.20s/it]

{'loss': 0.5944, 'grad_norm': 0.30187565088272095, 'learning_rate': 0.00015713485394157662, 'epoch': 0.21}


 21%|██▏       | 2684/12500 [4:42:54<18:01:45,  6.61s/it]

{'loss': 1.0039, 'grad_norm': 0.22292327880859375, 'learning_rate': 0.00015711884753901562, 'epoch': 0.21}


 21%|██▏       | 2685/12500 [4:43:00<17:29:56,  6.42s/it]

{'loss': 0.7511, 'grad_norm': 0.2605346441268921, 'learning_rate': 0.0001571028411364546, 'epoch': 0.21}


 21%|██▏       | 2686/12500 [4:43:04<16:09:13,  5.93s/it]

{'loss': 0.6027, 'grad_norm': 0.2741243839263916, 'learning_rate': 0.00015708683473389357, 'epoch': 0.21}


 21%|██▏       | 2687/12500 [4:43:13<18:39:04,  6.84s/it]

{'loss': 1.0517, 'grad_norm': 0.23912107944488525, 'learning_rate': 0.00015707082833133252, 'epoch': 0.21}


 22%|██▏       | 2688/12500 [4:43:20<18:11:45,  6.68s/it]

{'loss': 0.5558, 'grad_norm': 0.23351986706256866, 'learning_rate': 0.00015705482192877152, 'epoch': 0.22}


 22%|██▏       | 2689/12500 [4:43:26<17:49:33,  6.54s/it]

{'loss': 0.6663, 'grad_norm': 0.36339232325553894, 'learning_rate': 0.0001570388155262105, 'epoch': 0.22}


 22%|██▏       | 2690/12500 [4:43:31<16:46:37,  6.16s/it]

{'loss': 0.7288, 'grad_norm': 0.285921573638916, 'learning_rate': 0.00015702280912364947, 'epoch': 0.22}


 22%|██▏       | 2691/12500 [4:43:37<16:16:37,  5.97s/it]

{'loss': 0.539, 'grad_norm': 0.2689852714538574, 'learning_rate': 0.00015700680272108845, 'epoch': 0.22}


 22%|██▏       | 2692/12500 [4:43:41<15:10:32,  5.57s/it]

{'loss': 0.5898, 'grad_norm': 0.2605316638946533, 'learning_rate': 0.00015699079631852742, 'epoch': 0.22}


 22%|██▏       | 2693/12500 [4:43:51<18:10:46,  6.67s/it]

{'loss': 0.7438, 'grad_norm': 0.2239338904619217, 'learning_rate': 0.0001569747899159664, 'epoch': 0.22}


 22%|██▏       | 2694/12500 [4:43:56<17:05:42,  6.28s/it]

{'loss': 0.6177, 'grad_norm': 0.25953415036201477, 'learning_rate': 0.00015695878351340537, 'epoch': 0.22}


 22%|██▏       | 2695/12500 [4:44:02<16:32:05,  6.07s/it]

{'loss': 0.7611, 'grad_norm': 0.29973357915878296, 'learning_rate': 0.00015694277711084435, 'epoch': 0.22}


 22%|██▏       | 2696/12500 [4:44:09<17:35:19,  6.46s/it]

{'loss': 0.4733, 'grad_norm': 0.2606937885284424, 'learning_rate': 0.00015692677070828332, 'epoch': 0.22}


 22%|██▏       | 2697/12500 [4:44:14<16:50:36,  6.19s/it]

{'loss': 0.5383, 'grad_norm': 0.25981131196022034, 'learning_rate': 0.0001569107643057223, 'epoch': 0.22}


 22%|██▏       | 2698/12500 [4:44:25<20:22:58,  7.49s/it]

{'loss': 0.5422, 'grad_norm': 0.1866389364004135, 'learning_rate': 0.00015689475790316127, 'epoch': 0.22}


 22%|██▏       | 2699/12500 [4:44:33<21:04:12,  7.74s/it]

{'loss': 0.6964, 'grad_norm': 0.19761909544467926, 'learning_rate': 0.00015687875150060025, 'epoch': 0.22}


 22%|██▏       | 2700/12500 [4:44:41<20:48:40,  7.64s/it]

{'loss': 0.7288, 'grad_norm': 0.2695564925670624, 'learning_rate': 0.00015686274509803922, 'epoch': 0.22}


 22%|██▏       | 2701/12500 [4:44:47<19:53:04,  7.31s/it]

{'loss': 0.8772, 'grad_norm': 0.3155880868434906, 'learning_rate': 0.0001568467386954782, 'epoch': 0.22}


 22%|██▏       | 2702/12500 [4:44:55<20:03:32,  7.37s/it]

{'loss': 0.8699, 'grad_norm': 0.24379833042621613, 'learning_rate': 0.00015683073229291717, 'epoch': 0.22}


 22%|██▏       | 2703/12500 [4:45:00<18:33:40,  6.82s/it]

{'loss': 0.7471, 'grad_norm': 0.27485939860343933, 'learning_rate': 0.00015681472589035614, 'epoch': 0.22}


 22%|██▏       | 2704/12500 [4:45:04<16:17:49,  5.99s/it]

{'loss': 0.6309, 'grad_norm': 0.2957766056060791, 'learning_rate': 0.00015679871948779512, 'epoch': 0.22}


 22%|██▏       | 2705/12500 [4:45:10<15:40:13,  5.76s/it]

{'loss': 0.5866, 'grad_norm': 0.26108044385910034, 'learning_rate': 0.00015678271308523412, 'epoch': 0.22}


 22%|██▏       | 2706/12500 [4:45:15<15:13:46,  5.60s/it]

{'loss': 0.698, 'grad_norm': 0.28000789880752563, 'learning_rate': 0.00015676670668267307, 'epoch': 0.22}


 22%|██▏       | 2707/12500 [4:45:22<16:37:44,  6.11s/it]

{'loss': 0.7354, 'grad_norm': 0.24807758629322052, 'learning_rate': 0.00015675070028011204, 'epoch': 0.22}


 22%|██▏       | 2708/12500 [4:45:28<16:41:12,  6.13s/it]

{'loss': 0.659, 'grad_norm': 0.2315828800201416, 'learning_rate': 0.00015673469387755102, 'epoch': 0.22}


 22%|██▏       | 2709/12500 [4:45:35<17:31:12,  6.44s/it]

{'loss': 0.7206, 'grad_norm': 0.2876830995082855, 'learning_rate': 0.00015671868747499002, 'epoch': 0.22}


 22%|██▏       | 2710/12500 [4:45:43<18:39:03,  6.86s/it]

{'loss': 1.0066, 'grad_norm': 0.30062228441238403, 'learning_rate': 0.00015670268107242897, 'epoch': 0.22}


 22%|██▏       | 2711/12500 [4:45:51<19:04:23,  7.01s/it]

{'loss': 0.4972, 'grad_norm': 0.22856441140174866, 'learning_rate': 0.00015668667466986794, 'epoch': 0.22}


 22%|██▏       | 2712/12500 [4:45:57<18:41:03,  6.87s/it]

{'loss': 0.7033, 'grad_norm': 0.23494498431682587, 'learning_rate': 0.00015667066826730692, 'epoch': 0.22}


 22%|██▏       | 2713/12500 [4:46:02<16:46:03,  6.17s/it]

{'loss': 0.575, 'grad_norm': 0.28654441237449646, 'learning_rate': 0.00015665466186474592, 'epoch': 0.22}


 22%|██▏       | 2714/12500 [4:46:06<15:19:19,  5.64s/it]

{'loss': 0.8067, 'grad_norm': 0.313266396522522, 'learning_rate': 0.00015663865546218487, 'epoch': 0.22}


 22%|██▏       | 2715/12500 [4:46:10<14:12:47,  5.23s/it]

{'loss': 0.5131, 'grad_norm': 0.2815239727497101, 'learning_rate': 0.00015662264905962384, 'epoch': 0.22}


 22%|██▏       | 2716/12500 [4:46:17<14:54:14,  5.48s/it]

{'loss': 1.2026, 'grad_norm': 0.30068573355674744, 'learning_rate': 0.00015660664265706284, 'epoch': 0.22}


 22%|██▏       | 2717/12500 [4:46:21<14:16:14,  5.25s/it]

{'loss': 0.9632, 'grad_norm': 0.2789328396320343, 'learning_rate': 0.00015659063625450182, 'epoch': 0.22}


 22%|██▏       | 2718/12500 [4:46:26<13:59:42,  5.15s/it]

{'loss': 0.5498, 'grad_norm': 0.27843570709228516, 'learning_rate': 0.00015657462985194077, 'epoch': 0.22}


 22%|██▏       | 2719/12500 [4:46:31<13:32:50,  4.99s/it]

{'loss': 0.6246, 'grad_norm': 0.2855850160121918, 'learning_rate': 0.00015655862344937974, 'epoch': 0.22}


 22%|██▏       | 2720/12500 [4:46:37<14:26:28,  5.32s/it]

{'loss': 0.9494, 'grad_norm': 0.24750064313411713, 'learning_rate': 0.00015654261704681874, 'epoch': 0.22}


 22%|██▏       | 2721/12500 [4:46:42<13:56:09,  5.13s/it]

{'loss': 0.5964, 'grad_norm': 0.2862076759338379, 'learning_rate': 0.00015652661064425772, 'epoch': 0.22}


 22%|██▏       | 2722/12500 [4:46:47<14:13:36,  5.24s/it]

{'loss': 0.7323, 'grad_norm': 0.33339831233024597, 'learning_rate': 0.00015651060424169667, 'epoch': 0.22}


 22%|██▏       | 2723/12500 [4:46:51<13:33:52,  4.99s/it]

{'loss': 0.6207, 'grad_norm': 0.2576965391635895, 'learning_rate': 0.00015649459783913567, 'epoch': 0.22}


 22%|██▏       | 2724/12500 [4:46:56<13:35:38,  5.01s/it]

{'loss': 0.534, 'grad_norm': 0.23546969890594482, 'learning_rate': 0.00015647859143657464, 'epoch': 0.22}


 22%|██▏       | 2725/12500 [4:47:02<13:56:06,  5.13s/it]

{'loss': 0.802, 'grad_norm': 0.31744349002838135, 'learning_rate': 0.00015646258503401362, 'epoch': 0.22}


 22%|██▏       | 2726/12500 [4:47:07<13:35:23,  5.01s/it]

{'loss': 0.8152, 'grad_norm': 0.3618307113647461, 'learning_rate': 0.00015644657863145257, 'epoch': 0.22}


 22%|██▏       | 2727/12500 [4:47:13<14:43:00,  5.42s/it]

{'loss': 0.693, 'grad_norm': 0.29845666885375977, 'learning_rate': 0.00015643057222889157, 'epoch': 0.22}


 22%|██▏       | 2728/12500 [4:47:19<15:30:22,  5.71s/it]

{'loss': 0.7356, 'grad_norm': 0.2951626479625702, 'learning_rate': 0.00015641456582633054, 'epoch': 0.22}


 22%|██▏       | 2729/12500 [4:47:23<14:06:16,  5.20s/it]

{'loss': 0.6906, 'grad_norm': 0.33736664056777954, 'learning_rate': 0.00015639855942376952, 'epoch': 0.22}


 22%|██▏       | 2730/12500 [4:47:30<15:33:02,  5.73s/it]

{'loss': 0.9045, 'grad_norm': 0.25204184651374817, 'learning_rate': 0.0001563825530212085, 'epoch': 0.22}


 22%|██▏       | 2731/12500 [4:47:35<14:17:08,  5.26s/it]

{'loss': 0.5959, 'grad_norm': 0.271325945854187, 'learning_rate': 0.00015636654661864747, 'epoch': 0.22}


 22%|██▏       | 2732/12500 [4:47:40<14:41:55,  5.42s/it]

{'loss': 0.8434, 'grad_norm': 0.35064709186553955, 'learning_rate': 0.00015635054021608644, 'epoch': 0.22}


 22%|██▏       | 2733/12500 [4:47:48<16:46:02,  6.18s/it]

{'loss': 0.7471, 'grad_norm': 0.22158673405647278, 'learning_rate': 0.00015633453381352542, 'epoch': 0.22}


 22%|██▏       | 2734/12500 [4:47:56<18:15:18,  6.73s/it]

{'loss': 0.9241, 'grad_norm': 0.21876071393489838, 'learning_rate': 0.0001563185274109644, 'epoch': 0.22}


 22%|██▏       | 2735/12500 [4:48:01<16:35:24,  6.12s/it]

{'loss': 0.6725, 'grad_norm': 0.2823841869831085, 'learning_rate': 0.00015630252100840337, 'epoch': 0.22}


 22%|██▏       | 2736/12500 [4:48:08<17:06:35,  6.31s/it]

{'loss': 0.7778, 'grad_norm': 0.22694358229637146, 'learning_rate': 0.00015628651460584234, 'epoch': 0.22}


 22%|██▏       | 2737/12500 [4:48:14<17:20:29,  6.39s/it]

{'loss': 0.8506, 'grad_norm': 0.2621443271636963, 'learning_rate': 0.00015627050820328132, 'epoch': 0.22}


 22%|██▏       | 2738/12500 [4:48:21<17:34:51,  6.48s/it]

{'loss': 0.8675, 'grad_norm': 0.24307577311992645, 'learning_rate': 0.0001562545018007203, 'epoch': 0.22}


 22%|██▏       | 2739/12500 [4:48:25<15:31:09,  5.72s/it]

{'loss': 0.637, 'grad_norm': 0.30603089928627014, 'learning_rate': 0.00015623849539815927, 'epoch': 0.22}


 22%|██▏       | 2740/12500 [4:48:30<15:22:05,  5.67s/it]

{'loss': 1.2169, 'grad_norm': 0.5653980374336243, 'learning_rate': 0.00015622248899559824, 'epoch': 0.22}


 22%|██▏       | 2741/12500 [4:48:38<16:32:14,  6.10s/it]

{'loss': 0.7263, 'grad_norm': 0.2401392012834549, 'learning_rate': 0.00015620648259303722, 'epoch': 0.22}


 22%|██▏       | 2742/12500 [4:48:43<16:08:09,  5.95s/it]

{'loss': 0.6024, 'grad_norm': 0.31819167733192444, 'learning_rate': 0.0001561904761904762, 'epoch': 0.22}


 22%|██▏       | 2743/12500 [4:48:50<16:31:23,  6.10s/it]

{'loss': 0.6587, 'grad_norm': 0.33611777424812317, 'learning_rate': 0.00015617446978791517, 'epoch': 0.22}


 22%|██▏       | 2744/12500 [4:49:00<19:48:56,  7.31s/it]

{'loss': 0.759, 'grad_norm': 0.20767588913440704, 'learning_rate': 0.00015615846338535417, 'epoch': 0.22}


 22%|██▏       | 2745/12500 [4:49:04<17:09:18,  6.33s/it]

{'loss': 0.7497, 'grad_norm': 0.32118552923202515, 'learning_rate': 0.00015614245698279312, 'epoch': 0.22}


 22%|██▏       | 2746/12500 [4:49:08<15:14:54,  5.63s/it]

{'loss': 0.7002, 'grad_norm': 0.34305962920188904, 'learning_rate': 0.0001561264505802321, 'epoch': 0.22}


 22%|██▏       | 2747/12500 [4:49:17<17:43:38,  6.54s/it]

{'loss': 0.6081, 'grad_norm': 0.1887054443359375, 'learning_rate': 0.00015611044417767107, 'epoch': 0.22}


 22%|██▏       | 2748/12500 [4:49:24<18:39:20,  6.89s/it]

{'loss': 0.617, 'grad_norm': 0.2602507770061493, 'learning_rate': 0.00015609443777511007, 'epoch': 0.22}


 22%|██▏       | 2749/12500 [4:49:31<18:52:49,  6.97s/it]

{'loss': 0.4573, 'grad_norm': 0.23089414834976196, 'learning_rate': 0.00015607843137254901, 'epoch': 0.22}


 22%|██▏       | 2750/12500 [4:49:37<17:48:31,  6.58s/it]

{'loss': 0.684, 'grad_norm': 0.2990097105503082, 'learning_rate': 0.000156062424969988, 'epoch': 0.22}


 22%|██▏       | 2751/12500 [4:49:43<17:38:41,  6.52s/it]

{'loss': 0.4885, 'grad_norm': 0.253103643655777, 'learning_rate': 0.000156046418567427, 'epoch': 0.22}


 22%|██▏       | 2752/12500 [4:49:49<16:31:44,  6.10s/it]

{'loss': 0.6307, 'grad_norm': 0.29586854577064514, 'learning_rate': 0.00015603041216486597, 'epoch': 0.22}


 22%|██▏       | 2753/12500 [4:49:52<14:34:22,  5.38s/it]

{'loss': 0.9694, 'grad_norm': 0.4359685182571411, 'learning_rate': 0.00015601440576230491, 'epoch': 0.22}


 22%|██▏       | 2754/12500 [4:49:59<15:53:09,  5.87s/it]

{'loss': 0.4838, 'grad_norm': 0.22858929634094238, 'learning_rate': 0.0001559983993597439, 'epoch': 0.22}


 22%|██▏       | 2755/12500 [4:50:04<14:52:22,  5.49s/it]

{'loss': 0.5299, 'grad_norm': 0.33711865544319153, 'learning_rate': 0.0001559823929571829, 'epoch': 0.22}


 22%|██▏       | 2756/12500 [4:50:09<14:23:51,  5.32s/it]

{'loss': 0.6078, 'grad_norm': 0.25462770462036133, 'learning_rate': 0.00015596638655462187, 'epoch': 0.22}


 22%|██▏       | 2757/12500 [4:50:15<15:21:18,  5.67s/it]

{'loss': 0.7153, 'grad_norm': 0.23485888540744781, 'learning_rate': 0.00015595038015206081, 'epoch': 0.22}


 22%|██▏       | 2758/12500 [4:50:20<14:23:24,  5.32s/it]

{'loss': 0.7295, 'grad_norm': 0.2748633027076721, 'learning_rate': 0.00015593437374949982, 'epoch': 0.22}


 22%|██▏       | 2759/12500 [4:50:25<14:11:14,  5.24s/it]

{'loss': 0.8712, 'grad_norm': 0.25866618752479553, 'learning_rate': 0.0001559183673469388, 'epoch': 0.22}


 22%|██▏       | 2760/12500 [4:50:33<16:29:59,  6.10s/it]

{'loss': 0.6619, 'grad_norm': 0.19991199672222137, 'learning_rate': 0.00015590236094437777, 'epoch': 0.22}


 22%|██▏       | 2761/12500 [4:50:36<14:01:54,  5.19s/it]

{'loss': 0.9298, 'grad_norm': 0.38787540793418884, 'learning_rate': 0.0001558863545418167, 'epoch': 0.22}


 22%|██▏       | 2762/12500 [4:50:43<15:26:12,  5.71s/it]

{'loss': 0.7568, 'grad_norm': 0.25170016288757324, 'learning_rate': 0.00015587034813925572, 'epoch': 0.22}


 22%|██▏       | 2763/12500 [4:50:47<13:46:41,  5.09s/it]

{'loss': 0.5287, 'grad_norm': 0.27676182985305786, 'learning_rate': 0.0001558543417366947, 'epoch': 0.22}


 22%|██▏       | 2764/12500 [4:50:55<16:23:46,  6.06s/it]

{'loss': 0.8729, 'grad_norm': 0.22873367369174957, 'learning_rate': 0.00015583833533413366, 'epoch': 0.22}


 22%|██▏       | 2765/12500 [4:51:00<15:15:15,  5.64s/it]

{'loss': 1.0596, 'grad_norm': 0.37364792823791504, 'learning_rate': 0.00015582232893157264, 'epoch': 0.22}


 22%|██▏       | 2766/12500 [4:51:06<16:03:28,  5.94s/it]

{'loss': 0.6052, 'grad_norm': 0.2349773794412613, 'learning_rate': 0.00015580632252901161, 'epoch': 0.22}


 22%|██▏       | 2767/12500 [4:51:15<18:46:55,  6.95s/it]

{'loss': 0.9507, 'grad_norm': 0.20931819081306458, 'learning_rate': 0.0001557903161264506, 'epoch': 0.22}


 22%|██▏       | 2768/12500 [4:51:23<19:22:43,  7.17s/it]

{'loss': 0.7954, 'grad_norm': 0.2088250368833542, 'learning_rate': 0.00015577430972388956, 'epoch': 0.22}


 22%|██▏       | 2769/12500 [4:51:29<18:05:01,  6.69s/it]

{'loss': 0.7576, 'grad_norm': 0.26645949482917786, 'learning_rate': 0.00015575830332132854, 'epoch': 0.22}


 22%|██▏       | 2770/12500 [4:51:34<17:01:04,  6.30s/it]

{'loss': 0.7872, 'grad_norm': 0.3018372058868408, 'learning_rate': 0.00015574229691876751, 'epoch': 0.22}


 22%|██▏       | 2771/12500 [4:51:39<16:07:22,  5.97s/it]

{'loss': 0.6592, 'grad_norm': 0.2342878133058548, 'learning_rate': 0.0001557262905162065, 'epoch': 0.22}


 22%|██▏       | 2772/12500 [4:51:44<14:48:49,  5.48s/it]

{'loss': 0.904, 'grad_norm': 0.33127182722091675, 'learning_rate': 0.00015571028411364546, 'epoch': 0.22}


 22%|██▏       | 2773/12500 [4:51:48<14:13:40,  5.27s/it]

{'loss': 0.6243, 'grad_norm': 0.2869417667388916, 'learning_rate': 0.00015569427771108444, 'epoch': 0.22}


 22%|██▏       | 2774/12500 [4:51:52<13:00:19,  4.81s/it]

{'loss': 0.9751, 'grad_norm': 0.34077268838882446, 'learning_rate': 0.0001556782713085234, 'epoch': 0.22}


 22%|██▏       | 2775/12500 [4:51:58<13:31:53,  5.01s/it]

{'loss': 0.7133, 'grad_norm': 0.33157843351364136, 'learning_rate': 0.0001556622649059624, 'epoch': 0.22}


 22%|██▏       | 2776/12500 [4:52:05<15:25:35,  5.71s/it]

{'loss': 0.7844, 'grad_norm': 0.27696752548217773, 'learning_rate': 0.00015564625850340136, 'epoch': 0.22}


 22%|██▏       | 2777/12500 [4:52:12<16:05:43,  5.96s/it]

{'loss': 0.7148, 'grad_norm': 0.24688157439231873, 'learning_rate': 0.00015563025210084034, 'epoch': 0.22}


 22%|██▏       | 2778/12500 [4:52:17<15:27:24,  5.72s/it]

{'loss': 0.7923, 'grad_norm': 0.2726968228816986, 'learning_rate': 0.0001556142456982793, 'epoch': 0.22}


 22%|██▏       | 2779/12500 [4:52:23<15:57:40,  5.91s/it]

{'loss': 0.8928, 'grad_norm': 0.2631770968437195, 'learning_rate': 0.0001555982392957183, 'epoch': 0.22}


 22%|██▏       | 2780/12500 [4:52:27<14:31:27,  5.38s/it]

{'loss': 0.6238, 'grad_norm': 0.2999299466609955, 'learning_rate': 0.00015558223289315726, 'epoch': 0.22}


 22%|██▏       | 2781/12500 [4:52:32<13:41:24,  5.07s/it]

{'loss': 0.6864, 'grad_norm': 0.2968735992908478, 'learning_rate': 0.00015556622649059624, 'epoch': 0.22}


 22%|██▏       | 2782/12500 [4:52:39<15:41:54,  5.82s/it]

{'loss': 0.5063, 'grad_norm': 0.22692471742630005, 'learning_rate': 0.0001555502200880352, 'epoch': 0.22}


 22%|██▏       | 2783/12500 [4:52:43<14:00:26,  5.19s/it]

{'loss': 0.5911, 'grad_norm': 0.337785005569458, 'learning_rate': 0.00015553421368547421, 'epoch': 0.22}


 22%|██▏       | 2784/12500 [4:52:48<13:48:15,  5.11s/it]

{'loss': 0.5974, 'grad_norm': 0.29671868681907654, 'learning_rate': 0.00015551820728291316, 'epoch': 0.22}


 22%|██▏       | 2785/12500 [4:52:54<14:23:38,  5.33s/it]

{'loss': 0.7782, 'grad_norm': 0.24891433119773865, 'learning_rate': 0.00015550220088035214, 'epoch': 0.22}


 22%|██▏       | 2786/12500 [4:52:58<13:55:00,  5.16s/it]

{'loss': 0.7173, 'grad_norm': 0.3303704559803009, 'learning_rate': 0.0001554861944777911, 'epoch': 0.22}


 22%|██▏       | 2787/12500 [4:53:05<15:24:01,  5.71s/it]

{'loss': 1.1076, 'grad_norm': 0.3119814693927765, 'learning_rate': 0.00015547018807523011, 'epoch': 0.22}


 22%|██▏       | 2788/12500 [4:53:10<14:08:54,  5.24s/it]

{'loss': 0.5882, 'grad_norm': 0.3212956488132477, 'learning_rate': 0.00015545418167266906, 'epoch': 0.22}


 22%|██▏       | 2789/12500 [4:53:17<15:37:35,  5.79s/it]

{'loss': 1.0182, 'grad_norm': 0.2647915184497833, 'learning_rate': 0.00015543817527010804, 'epoch': 0.22}


 22%|██▏       | 2790/12500 [4:53:24<17:12:09,  6.38s/it]

{'loss': 0.5591, 'grad_norm': 0.24805210530757904, 'learning_rate': 0.00015542216886754704, 'epoch': 0.22}


 22%|██▏       | 2791/12500 [4:53:32<18:04:24,  6.70s/it]

{'loss': 0.7618, 'grad_norm': 0.23146165907382965, 'learning_rate': 0.000155406162464986, 'epoch': 0.22}


 22%|██▏       | 2792/12500 [4:53:37<16:53:40,  6.26s/it]

{'loss': 0.6492, 'grad_norm': 0.30593472719192505, 'learning_rate': 0.00015539015606242496, 'epoch': 0.22}


 22%|██▏       | 2793/12500 [4:53:44<17:09:48,  6.37s/it]

{'loss': 0.5761, 'grad_norm': 0.2671554386615753, 'learning_rate': 0.00015537414965986394, 'epoch': 0.22}


 22%|██▏       | 2794/12500 [4:53:48<15:53:21,  5.89s/it]

{'loss': 0.8137, 'grad_norm': 0.27796876430511475, 'learning_rate': 0.00015535814325730294, 'epoch': 0.22}


 22%|██▏       | 2795/12500 [4:53:54<15:47:13,  5.86s/it]

{'loss': 0.5068, 'grad_norm': 0.28128132224082947, 'learning_rate': 0.0001553421368547419, 'epoch': 0.22}


 22%|██▏       | 2796/12500 [4:54:00<15:59:40,  5.93s/it]

{'loss': 0.705, 'grad_norm': 0.3154665231704712, 'learning_rate': 0.00015532613045218086, 'epoch': 0.22}


 22%|██▏       | 2797/12500 [4:54:06<15:48:01,  5.86s/it]

{'loss': 0.8903, 'grad_norm': 0.32689395546913147, 'learning_rate': 0.00015531012404961986, 'epoch': 0.22}


 22%|██▏       | 2798/12500 [4:54:11<14:51:39,  5.51s/it]

{'loss': 0.684, 'grad_norm': 0.2747996747493744, 'learning_rate': 0.00015529411764705884, 'epoch': 0.22}


 22%|██▏       | 2799/12500 [4:54:14<13:10:35,  4.89s/it]

{'loss': 0.4984, 'grad_norm': 0.30020901560783386, 'learning_rate': 0.0001552781112444978, 'epoch': 0.22}


 22%|██▏       | 2800/12500 [4:54:23<16:18:13,  6.05s/it]

{'loss': 0.4444, 'grad_norm': 0.20839697122573853, 'learning_rate': 0.00015526210484193676, 'epoch': 0.22}


 22%|██▏       | 2801/12500 [4:54:30<16:49:06,  6.24s/it]

{'loss': 0.5234, 'grad_norm': 0.22812806069850922, 'learning_rate': 0.00015524609843937576, 'epoch': 0.22}


 22%|██▏       | 2802/12500 [4:54:34<15:39:08,  5.81s/it]

{'loss': 0.8846, 'grad_norm': 0.31717923283576965, 'learning_rate': 0.00015523009203681474, 'epoch': 0.22}


 22%|██▏       | 2803/12500 [4:54:40<15:30:44,  5.76s/it]

{'loss': 0.658, 'grad_norm': 0.24342069029808044, 'learning_rate': 0.0001552140856342537, 'epoch': 0.22}


 22%|██▏       | 2804/12500 [4:54:45<14:37:36,  5.43s/it]

{'loss': 0.9245, 'grad_norm': 0.3590969145298004, 'learning_rate': 0.00015519807923169269, 'epoch': 0.22}


 22%|██▏       | 2805/12500 [4:54:53<16:44:28,  6.22s/it]

{'loss': 1.026, 'grad_norm': 0.21862855553627014, 'learning_rate': 0.00015518207282913166, 'epoch': 0.22}


 22%|██▏       | 2806/12500 [4:55:02<19:14:46,  7.15s/it]

{'loss': 0.6623, 'grad_norm': 0.18621200323104858, 'learning_rate': 0.00015516606642657064, 'epoch': 0.22}


 22%|██▏       | 2807/12500 [4:55:07<17:45:14,  6.59s/it]

{'loss': 0.8683, 'grad_norm': 0.27913546562194824, 'learning_rate': 0.0001551500600240096, 'epoch': 0.22}


 22%|██▏       | 2808/12500 [4:55:12<16:25:32,  6.10s/it]

{'loss': 0.7096, 'grad_norm': 0.29049843549728394, 'learning_rate': 0.00015513405362144859, 'epoch': 0.22}


 22%|██▏       | 2809/12500 [4:55:17<15:10:58,  5.64s/it]

{'loss': 0.707, 'grad_norm': 0.32157108187675476, 'learning_rate': 0.00015511804721888756, 'epoch': 0.22}


 22%|██▏       | 2810/12500 [4:55:21<13:34:55,  5.05s/it]

{'loss': 0.8472, 'grad_norm': 0.3594650328159332, 'learning_rate': 0.00015510204081632654, 'epoch': 0.22}


 22%|██▏       | 2811/12500 [4:55:26<13:39:38,  5.08s/it]

{'loss': 0.8515, 'grad_norm': 0.3117417097091675, 'learning_rate': 0.00015508603441376554, 'epoch': 0.22}


 22%|██▏       | 2812/12500 [4:55:29<12:19:31,  4.58s/it]

{'loss': 0.5897, 'grad_norm': 0.3261758089065552, 'learning_rate': 0.00015507002801120448, 'epoch': 0.22}


 23%|██▎       | 2813/12500 [4:55:36<13:52:06,  5.15s/it]

{'loss': 0.5934, 'grad_norm': 0.21394436061382294, 'learning_rate': 0.00015505402160864346, 'epoch': 0.23}


 23%|██▎       | 2814/12500 [4:55:43<15:30:02,  5.76s/it]

{'loss': 0.7704, 'grad_norm': 0.21958696842193604, 'learning_rate': 0.00015503801520608243, 'epoch': 0.23}


 23%|██▎       | 2815/12500 [4:55:51<17:06:43,  6.36s/it]

{'loss': 0.9482, 'grad_norm': 0.2577076256275177, 'learning_rate': 0.00015502200880352144, 'epoch': 0.23}


 23%|██▎       | 2816/12500 [4:55:55<15:27:03,  5.74s/it]

{'loss': 0.6494, 'grad_norm': 0.2947308421134949, 'learning_rate': 0.00015500600240096038, 'epoch': 0.23}


 23%|██▎       | 2817/12500 [4:56:00<15:15:03,  5.67s/it]

{'loss': 0.7714, 'grad_norm': 0.3357820510864258, 'learning_rate': 0.00015498999599839936, 'epoch': 0.23}


 23%|██▎       | 2818/12500 [4:56:05<14:37:58,  5.44s/it]

{'loss': 0.6623, 'grad_norm': 0.2764798700809479, 'learning_rate': 0.00015497398959583836, 'epoch': 0.23}


 23%|██▎       | 2819/12500 [4:56:13<16:34:05,  6.16s/it]

{'loss': 0.78, 'grad_norm': 0.2537965774536133, 'learning_rate': 0.00015495798319327734, 'epoch': 0.23}


 23%|██▎       | 2820/12500 [4:56:20<17:04:07,  6.35s/it]

{'loss': 0.7726, 'grad_norm': 0.2372359335422516, 'learning_rate': 0.00015494197679071628, 'epoch': 0.23}


 23%|██▎       | 2821/12500 [4:56:25<15:46:52,  5.87s/it]

{'loss': 0.7061, 'grad_norm': 0.3449122905731201, 'learning_rate': 0.00015492597038815526, 'epoch': 0.23}


 23%|██▎       | 2822/12500 [4:56:29<14:23:10,  5.35s/it]

{'loss': 0.8718, 'grad_norm': 0.30388930439949036, 'learning_rate': 0.00015490996398559426, 'epoch': 0.23}


 23%|██▎       | 2823/12500 [4:56:36<16:05:30,  5.99s/it]

{'loss': 0.758, 'grad_norm': 0.2360270470380783, 'learning_rate': 0.00015489395758303324, 'epoch': 0.23}


 23%|██▎       | 2824/12500 [4:56:40<14:35:36,  5.43s/it]

{'loss': 0.6361, 'grad_norm': 0.2853255867958069, 'learning_rate': 0.00015487795118047218, 'epoch': 0.23}


 23%|██▎       | 2825/12500 [4:56:47<15:21:57,  5.72s/it]

{'loss': 0.8432, 'grad_norm': 0.2307884842157364, 'learning_rate': 0.00015486194477791116, 'epoch': 0.23}


 23%|██▎       | 2826/12500 [4:56:54<16:33:08,  6.16s/it]

{'loss': 0.8857, 'grad_norm': 0.25662702322006226, 'learning_rate': 0.00015484593837535016, 'epoch': 0.23}


 23%|██▎       | 2827/12500 [4:56:58<14:28:40,  5.39s/it]

{'loss': 0.8092, 'grad_norm': 0.3272823989391327, 'learning_rate': 0.00015482993197278913, 'epoch': 0.23}


 23%|██▎       | 2828/12500 [4:57:01<12:59:55,  4.84s/it]

{'loss': 0.8637, 'grad_norm': 0.36793819069862366, 'learning_rate': 0.00015481392557022808, 'epoch': 0.23}


 23%|██▎       | 2829/12500 [4:57:07<13:36:20,  5.06s/it]

{'loss': 0.662, 'grad_norm': 0.26168566942214966, 'learning_rate': 0.00015479791916766708, 'epoch': 0.23}


 23%|██▎       | 2830/12500 [4:57:12<14:06:22,  5.25s/it]

{'loss': 0.3694, 'grad_norm': 0.21091847121715546, 'learning_rate': 0.00015478191276510606, 'epoch': 0.23}


 23%|██▎       | 2831/12500 [4:57:19<15:16:24,  5.69s/it]

{'loss': 0.6897, 'grad_norm': 0.22178484499454498, 'learning_rate': 0.00015476590636254503, 'epoch': 0.23}


 23%|██▎       | 2832/12500 [4:57:27<16:49:18,  6.26s/it]

{'loss': 0.7448, 'grad_norm': 0.3086402416229248, 'learning_rate': 0.00015474989995998398, 'epoch': 0.23}


 23%|██▎       | 2833/12500 [4:57:34<17:32:04,  6.53s/it]

{'loss': 0.88, 'grad_norm': 0.2520633935928345, 'learning_rate': 0.00015473389355742298, 'epoch': 0.23}


 23%|██▎       | 2834/12500 [4:57:42<18:53:58,  7.04s/it]

{'loss': 0.5076, 'grad_norm': 0.2818854749202728, 'learning_rate': 0.00015471788715486196, 'epoch': 0.23}


 23%|██▎       | 2835/12500 [4:57:47<17:24:08,  6.48s/it]

{'loss': 0.5608, 'grad_norm': 0.23586060106754303, 'learning_rate': 0.00015470188075230093, 'epoch': 0.23}


 23%|██▎       | 2836/12500 [4:57:52<16:03:41,  5.98s/it]

{'loss': 0.4043, 'grad_norm': 0.2654251158237457, 'learning_rate': 0.0001546858743497399, 'epoch': 0.23}


 23%|██▎       | 2837/12500 [4:57:58<16:06:12,  6.00s/it]

{'loss': 0.8213, 'grad_norm': 0.3913748264312744, 'learning_rate': 0.00015466986794717888, 'epoch': 0.23}


 23%|██▎       | 2838/12500 [4:58:06<17:43:11,  6.60s/it]

{'loss': 0.9451, 'grad_norm': 0.27434051036834717, 'learning_rate': 0.00015465386154461786, 'epoch': 0.23}


 23%|██▎       | 2839/12500 [4:58:12<17:16:58,  6.44s/it]

{'loss': 0.8866, 'grad_norm': 0.28319647908210754, 'learning_rate': 0.00015463785514205683, 'epoch': 0.23}


 23%|██▎       | 2840/12500 [4:58:17<16:19:04,  6.08s/it]

{'loss': 0.7916, 'grad_norm': 0.26748329401016235, 'learning_rate': 0.0001546218487394958, 'epoch': 0.23}


 23%|██▎       | 2841/12500 [4:58:26<18:00:43,  6.71s/it]

{'loss': 0.7441, 'grad_norm': 0.23408019542694092, 'learning_rate': 0.00015460584233693478, 'epoch': 0.23}


 23%|██▎       | 2842/12500 [4:58:33<18:38:46,  6.95s/it]

{'loss': 1.0936, 'grad_norm': 0.23209203779697418, 'learning_rate': 0.00015458983593437376, 'epoch': 0.23}


 23%|██▎       | 2843/12500 [4:58:38<16:48:33,  6.27s/it]

{'loss': 0.4965, 'grad_norm': 0.27791541814804077, 'learning_rate': 0.00015457382953181273, 'epoch': 0.23}


 23%|██▎       | 2844/12500 [4:58:43<16:09:37,  6.03s/it]

{'loss': 0.7559, 'grad_norm': 0.2621142864227295, 'learning_rate': 0.0001545578231292517, 'epoch': 0.23}


 23%|██▎       | 2845/12500 [4:58:53<19:20:50,  7.21s/it]

{'loss': 0.8645, 'grad_norm': 0.1823951005935669, 'learning_rate': 0.00015454181672669068, 'epoch': 0.23}


 23%|██▎       | 2846/12500 [4:58:58<17:39:35,  6.59s/it]

{'loss': 0.7356, 'grad_norm': 0.25203844904899597, 'learning_rate': 0.00015452581032412966, 'epoch': 0.23}


 23%|██▎       | 2847/12500 [4:59:03<15:49:02,  5.90s/it]

{'loss': 0.7312, 'grad_norm': 0.3237471580505371, 'learning_rate': 0.00015450980392156863, 'epoch': 0.23}


 23%|██▎       | 2848/12500 [4:59:11<17:32:38,  6.54s/it]

{'loss': 0.629, 'grad_norm': 0.2539387345314026, 'learning_rate': 0.0001544937975190076, 'epoch': 0.23}


 23%|██▎       | 2849/12500 [4:59:17<17:08:58,  6.40s/it]

{'loss': 0.5538, 'grad_norm': 0.24337215721607208, 'learning_rate': 0.00015447779111644658, 'epoch': 0.23}


 23%|██▎       | 2850/12500 [4:59:22<16:07:13,  6.01s/it]

{'loss': 0.6503, 'grad_norm': 0.3155352473258972, 'learning_rate': 0.00015446178471388558, 'epoch': 0.23}


 23%|██▎       | 2851/12500 [4:59:28<16:31:38,  6.17s/it]

{'loss': 0.4583, 'grad_norm': 0.25180354714393616, 'learning_rate': 0.00015444577831132453, 'epoch': 0.23}


 23%|██▎       | 2852/12500 [4:59:36<17:18:52,  6.46s/it]

{'loss': 0.7741, 'grad_norm': 0.21864919364452362, 'learning_rate': 0.0001544297719087635, 'epoch': 0.23}


 23%|██▎       | 2853/12500 [4:59:41<16:50:47,  6.29s/it]

{'loss': 0.6847, 'grad_norm': 0.2626301944255829, 'learning_rate': 0.00015441376550620248, 'epoch': 0.23}


 23%|██▎       | 2854/12500 [4:59:48<16:45:00,  6.25s/it]

{'loss': 1.0715, 'grad_norm': 0.2689364552497864, 'learning_rate': 0.00015439775910364148, 'epoch': 0.23}


 23%|██▎       | 2855/12500 [4:59:57<18:54:57,  7.06s/it]

{'loss': 0.9262, 'grad_norm': 0.2615090608596802, 'learning_rate': 0.00015438175270108043, 'epoch': 0.23}


 23%|██▎       | 2856/12500 [5:00:01<16:53:46,  6.31s/it]

{'loss': 0.8322, 'grad_norm': 0.2862110733985901, 'learning_rate': 0.0001543657462985194, 'epoch': 0.23}


 23%|██▎       | 2857/12500 [5:00:07<16:53:42,  6.31s/it]

{'loss': 0.4395, 'grad_norm': 0.24228613078594208, 'learning_rate': 0.0001543497398959584, 'epoch': 0.23}


 23%|██▎       | 2858/12500 [5:00:14<16:56:10,  6.32s/it]

{'loss': 0.414, 'grad_norm': 0.21124227344989777, 'learning_rate': 0.00015433373349339738, 'epoch': 0.23}


 23%|██▎       | 2859/12500 [5:00:18<15:32:20,  5.80s/it]

{'loss': 0.8851, 'grad_norm': 0.2826692461967468, 'learning_rate': 0.00015431772709083633, 'epoch': 0.23}


 23%|██▎       | 2860/12500 [5:00:26<17:17:31,  6.46s/it]

{'loss': 0.7122, 'grad_norm': 0.26451578736305237, 'learning_rate': 0.0001543017206882753, 'epoch': 0.23}


 23%|██▎       | 2861/12500 [5:00:35<19:25:18,  7.25s/it]

{'loss': 0.7201, 'grad_norm': 0.2357826977968216, 'learning_rate': 0.0001542857142857143, 'epoch': 0.23}


 23%|██▎       | 2862/12500 [5:00:42<18:46:59,  7.02s/it]

{'loss': 0.6593, 'grad_norm': 0.25158342719078064, 'learning_rate': 0.00015426970788315328, 'epoch': 0.23}


 23%|██▎       | 2863/12500 [5:00:47<16:56:40,  6.33s/it]

{'loss': 0.5679, 'grad_norm': 0.3176266551017761, 'learning_rate': 0.00015425370148059223, 'epoch': 0.23}


 23%|██▎       | 2864/12500 [5:00:50<14:53:00,  5.56s/it]

{'loss': 0.911, 'grad_norm': 0.33221006393432617, 'learning_rate': 0.00015423769507803123, 'epoch': 0.23}


 23%|██▎       | 2865/12500 [5:00:57<15:24:43,  5.76s/it]

{'loss': 0.7276, 'grad_norm': 0.3062032163143158, 'learning_rate': 0.0001542216886754702, 'epoch': 0.23}


 23%|██▎       | 2866/12500 [5:01:05<17:10:03,  6.42s/it]

{'loss': 1.1671, 'grad_norm': 0.27099958062171936, 'learning_rate': 0.00015420568227290918, 'epoch': 0.23}


 23%|██▎       | 2867/12500 [5:01:11<17:14:56,  6.45s/it]

{'loss': 0.6519, 'grad_norm': 0.2682013213634491, 'learning_rate': 0.00015418967587034813, 'epoch': 0.23}


 23%|██▎       | 2868/12500 [5:01:18<17:24:13,  6.50s/it]

{'loss': 0.815, 'grad_norm': 0.3324739634990692, 'learning_rate': 0.00015417366946778713, 'epoch': 0.23}


 23%|██▎       | 2869/12500 [5:01:24<17:01:36,  6.36s/it]

{'loss': 0.8063, 'grad_norm': 0.35904964804649353, 'learning_rate': 0.0001541576630652261, 'epoch': 0.23}


 23%|██▎       | 2870/12500 [5:01:30<17:18:55,  6.47s/it]

{'loss': 0.8461, 'grad_norm': 0.23510655760765076, 'learning_rate': 0.00015414165666266508, 'epoch': 0.23}


 23%|██▎       | 2871/12500 [5:01:34<15:13:47,  5.69s/it]

{'loss': 0.7441, 'grad_norm': 0.400593638420105, 'learning_rate': 0.00015412565026010406, 'epoch': 0.23}


 23%|██▎       | 2872/12500 [5:01:39<14:00:52,  5.24s/it]

{'loss': 1.0768, 'grad_norm': 0.37029311060905457, 'learning_rate': 0.00015410964385754303, 'epoch': 0.23}


 23%|██▎       | 2873/12500 [5:01:43<13:20:59,  4.99s/it]

{'loss': 0.5267, 'grad_norm': 0.24226807057857513, 'learning_rate': 0.000154093637454982, 'epoch': 0.23}


 23%|██▎       | 2874/12500 [5:01:49<14:05:05,  5.27s/it]

{'loss': 0.762, 'grad_norm': 0.28009921312332153, 'learning_rate': 0.00015407763105242098, 'epoch': 0.23}


 23%|██▎       | 2875/12500 [5:01:56<15:38:24,  5.85s/it]

{'loss': 0.9976, 'grad_norm': 0.3356776535511017, 'learning_rate': 0.00015406162464985996, 'epoch': 0.23}


 23%|██▎       | 2876/12500 [5:02:02<15:57:16,  5.97s/it]

{'loss': 0.7403, 'grad_norm': 0.33883216977119446, 'learning_rate': 0.00015404561824729893, 'epoch': 0.23}


 23%|██▎       | 2877/12500 [5:02:08<15:54:59,  5.95s/it]

{'loss': 0.7274, 'grad_norm': 0.2891972064971924, 'learning_rate': 0.0001540296118447379, 'epoch': 0.23}


 23%|██▎       | 2878/12500 [5:02:19<19:26:11,  7.27s/it]

{'loss': 0.6179, 'grad_norm': 0.19380801916122437, 'learning_rate': 0.00015401360544217688, 'epoch': 0.23}


 23%|██▎       | 2879/12500 [5:02:26<19:25:16,  7.27s/it]

{'loss': 0.679, 'grad_norm': 0.24171720445156097, 'learning_rate': 0.00015399759903961585, 'epoch': 0.23}


 23%|██▎       | 2880/12500 [5:02:33<19:21:49,  7.25s/it]

{'loss': 1.0943, 'grad_norm': 0.2129129022359848, 'learning_rate': 0.00015398159263705483, 'epoch': 0.23}


 23%|██▎       | 2881/12500 [5:02:37<16:56:11,  6.34s/it]

{'loss': 0.6917, 'grad_norm': 0.2738272249698639, 'learning_rate': 0.0001539655862344938, 'epoch': 0.23}


 23%|██▎       | 2882/12500 [5:02:43<16:30:40,  6.18s/it]

{'loss': 1.0664, 'grad_norm': 0.35013511776924133, 'learning_rate': 0.00015394957983193278, 'epoch': 0.23}


 23%|██▎       | 2883/12500 [5:02:49<16:03:17,  6.01s/it]

{'loss': 0.7667, 'grad_norm': 0.2842004597187042, 'learning_rate': 0.00015393357342937175, 'epoch': 0.23}


 23%|██▎       | 2884/12500 [5:02:55<16:09:26,  6.05s/it]

{'loss': 0.694, 'grad_norm': 0.2193366140127182, 'learning_rate': 0.00015391756702681073, 'epoch': 0.23}


 23%|██▎       | 2885/12500 [5:03:01<16:36:43,  6.22s/it]

{'loss': 0.9374, 'grad_norm': 0.25479841232299805, 'learning_rate': 0.0001539015606242497, 'epoch': 0.23}


 23%|██▎       | 2886/12500 [5:03:11<19:19:23,  7.24s/it]

{'loss': 1.073, 'grad_norm': 0.2527414858341217, 'learning_rate': 0.00015388555422168868, 'epoch': 0.23}


 23%|██▎       | 2887/12500 [5:03:15<16:44:46,  6.27s/it]

{'loss': 0.8447, 'grad_norm': 0.298473984003067, 'learning_rate': 0.00015386954781912765, 'epoch': 0.23}


 23%|██▎       | 2888/12500 [5:03:22<17:11:55,  6.44s/it]

{'loss': 0.6517, 'grad_norm': 0.2886273264884949, 'learning_rate': 0.00015385354141656663, 'epoch': 0.23}


 23%|██▎       | 2889/12500 [5:03:27<15:49:59,  5.93s/it]

{'loss': 0.8329, 'grad_norm': 0.34252479672431946, 'learning_rate': 0.00015383753501400563, 'epoch': 0.23}


 23%|██▎       | 2890/12500 [5:03:32<15:14:52,  5.71s/it]

{'loss': 0.7368, 'grad_norm': 0.27939358353614807, 'learning_rate': 0.00015382152861144458, 'epoch': 0.23}


 23%|██▎       | 2891/12500 [5:03:39<16:28:53,  6.17s/it]

{'loss': 0.8888, 'grad_norm': 0.2744321823120117, 'learning_rate': 0.00015380552220888355, 'epoch': 0.23}


 23%|██▎       | 2892/12500 [5:03:45<16:35:08,  6.21s/it]

{'loss': 0.5231, 'grad_norm': 0.26753684878349304, 'learning_rate': 0.00015378951580632253, 'epoch': 0.23}


 23%|██▎       | 2893/12500 [5:03:51<16:23:31,  6.14s/it]

{'loss': 0.979, 'grad_norm': 0.25609833002090454, 'learning_rate': 0.00015377350940376153, 'epoch': 0.23}


 23%|██▎       | 2894/12500 [5:03:58<16:37:17,  6.23s/it]

{'loss': 0.5425, 'grad_norm': 0.2080511450767517, 'learning_rate': 0.00015375750300120048, 'epoch': 0.23}


 23%|██▎       | 2895/12500 [5:04:04<16:57:20,  6.36s/it]

{'loss': 0.5317, 'grad_norm': 0.25663819909095764, 'learning_rate': 0.00015374149659863945, 'epoch': 0.23}


 23%|██▎       | 2896/12500 [5:04:12<17:49:39,  6.68s/it]

{'loss': 0.6435, 'grad_norm': 0.24216139316558838, 'learning_rate': 0.00015372549019607845, 'epoch': 0.23}


 23%|██▎       | 2897/12500 [5:04:18<17:40:46,  6.63s/it]

{'loss': 0.7321, 'grad_norm': 0.23231156170368195, 'learning_rate': 0.00015370948379351743, 'epoch': 0.23}


 23%|██▎       | 2898/12500 [5:04:31<22:05:34,  8.28s/it]

{'loss': 0.4809, 'grad_norm': 0.1861995905637741, 'learning_rate': 0.00015369347739095638, 'epoch': 0.23}


 23%|██▎       | 2899/12500 [5:04:36<20:06:20,  7.54s/it]

{'loss': 1.1445, 'grad_norm': 0.2711040675640106, 'learning_rate': 0.00015367747098839535, 'epoch': 0.23}


 23%|██▎       | 2900/12500 [5:04:46<22:04:50,  8.28s/it]

{'loss': 0.4312, 'grad_norm': 0.20098218321800232, 'learning_rate': 0.00015366146458583435, 'epoch': 0.23}


 23%|██▎       | 2901/12500 [5:04:55<22:04:28,  8.28s/it]

{'loss': 0.6608, 'grad_norm': 0.22872108221054077, 'learning_rate': 0.00015364545818327333, 'epoch': 0.23}


 23%|██▎       | 2902/12500 [5:05:03<21:59:24,  8.25s/it]

{'loss': 0.397, 'grad_norm': 0.18431535363197327, 'learning_rate': 0.00015362945178071228, 'epoch': 0.23}


 23%|██▎       | 2903/12500 [5:05:07<18:58:31,  7.12s/it]

{'loss': 0.9397, 'grad_norm': 0.3639671802520752, 'learning_rate': 0.00015361344537815128, 'epoch': 0.23}


 23%|██▎       | 2904/12500 [5:05:13<17:33:34,  6.59s/it]

{'loss': 0.8051, 'grad_norm': 0.27519264817237854, 'learning_rate': 0.00015359743897559025, 'epoch': 0.23}


 23%|██▎       | 2905/12500 [5:05:19<17:13:11,  6.46s/it]

{'loss': 0.7826, 'grad_norm': 0.23371917009353638, 'learning_rate': 0.00015358143257302923, 'epoch': 0.23}


 23%|██▎       | 2906/12500 [5:05:23<15:40:10,  5.88s/it]

{'loss': 0.5719, 'grad_norm': 0.31341323256492615, 'learning_rate': 0.00015356542617046818, 'epoch': 0.23}


 23%|██▎       | 2907/12500 [5:05:29<15:47:50,  5.93s/it]

{'loss': 0.9393, 'grad_norm': 0.325266033411026, 'learning_rate': 0.00015354941976790718, 'epoch': 0.23}


 23%|██▎       | 2908/12500 [5:05:36<16:38:55,  6.25s/it]

{'loss': 0.5825, 'grad_norm': 0.26586785912513733, 'learning_rate': 0.00015353341336534615, 'epoch': 0.23}


 23%|██▎       | 2909/12500 [5:05:43<16:49:08,  6.31s/it]

{'loss': 0.8164, 'grad_norm': 0.26134464144706726, 'learning_rate': 0.00015351740696278513, 'epoch': 0.23}


 23%|██▎       | 2910/12500 [5:05:47<15:01:49,  5.64s/it]

{'loss': 0.6826, 'grad_norm': 0.3084794580936432, 'learning_rate': 0.0001535014005602241, 'epoch': 0.23}


 23%|██▎       | 2911/12500 [5:05:52<14:41:47,  5.52s/it]

{'loss': 0.7648, 'grad_norm': 0.27626141905784607, 'learning_rate': 0.00015348539415766308, 'epoch': 0.23}


 23%|██▎       | 2912/12500 [5:05:57<14:02:21,  5.27s/it]

{'loss': 0.8426, 'grad_norm': 0.3360496163368225, 'learning_rate': 0.00015346938775510205, 'epoch': 0.23}


 23%|██▎       | 2913/12500 [5:06:02<13:53:35,  5.22s/it]

{'loss': 0.4727, 'grad_norm': 0.2939915955066681, 'learning_rate': 0.00015345338135254103, 'epoch': 0.23}


 23%|██▎       | 2914/12500 [5:06:11<17:19:55,  6.51s/it]

{'loss': 0.9051, 'grad_norm': 0.20951834321022034, 'learning_rate': 0.00015343737494998, 'epoch': 0.23}


 23%|██▎       | 2915/12500 [5:06:18<17:36:20,  6.61s/it]

{'loss': 0.5298, 'grad_norm': 0.26061347126960754, 'learning_rate': 0.00015342136854741898, 'epoch': 0.23}


 23%|██▎       | 2916/12500 [5:06:24<17:12:06,  6.46s/it]

{'loss': 0.7138, 'grad_norm': 0.23987454175949097, 'learning_rate': 0.00015340536214485795, 'epoch': 0.23}


 23%|██▎       | 2917/12500 [5:06:31<17:15:41,  6.48s/it]

{'loss': 0.8342, 'grad_norm': 0.2579939067363739, 'learning_rate': 0.00015338935574229693, 'epoch': 0.23}


 23%|██▎       | 2918/12500 [5:06:35<15:20:17,  5.76s/it]

{'loss': 0.731, 'grad_norm': 0.34364041686058044, 'learning_rate': 0.0001533733493397359, 'epoch': 0.23}


 23%|██▎       | 2919/12500 [5:06:41<15:51:21,  5.96s/it]

{'loss': 0.5645, 'grad_norm': 0.2182142734527588, 'learning_rate': 0.00015335734293717488, 'epoch': 0.23}


 23%|██▎       | 2920/12500 [5:06:51<18:41:51,  7.03s/it]

{'loss': 0.9063, 'grad_norm': 0.20210333168506622, 'learning_rate': 0.00015334133653461385, 'epoch': 0.23}


 23%|██▎       | 2921/12500 [5:06:55<16:32:08,  6.21s/it]

{'loss': 0.7772, 'grad_norm': 0.29264894127845764, 'learning_rate': 0.00015332533013205283, 'epoch': 0.23}


 23%|██▎       | 2922/12500 [5:07:01<16:15:45,  6.11s/it]

{'loss': 0.7474, 'grad_norm': 0.3286064565181732, 'learning_rate': 0.0001533093237294918, 'epoch': 0.23}


 23%|██▎       | 2923/12500 [5:07:06<15:34:54,  5.86s/it]

{'loss': 0.6746, 'grad_norm': 0.35064780712127686, 'learning_rate': 0.00015329331732693078, 'epoch': 0.23}


 23%|██▎       | 2924/12500 [5:07:13<16:18:15,  6.13s/it]

{'loss': 0.4858, 'grad_norm': 0.22374770045280457, 'learning_rate': 0.00015327731092436978, 'epoch': 0.23}


 23%|██▎       | 2925/12500 [5:07:18<15:35:22,  5.86s/it]

{'loss': 0.7648, 'grad_norm': 0.2986125648021698, 'learning_rate': 0.00015326130452180872, 'epoch': 0.23}


 23%|██▎       | 2926/12500 [5:07:26<17:19:58,  6.52s/it]

{'loss': 0.7307, 'grad_norm': 0.2402982860803604, 'learning_rate': 0.0001532452981192477, 'epoch': 0.23}


 23%|██▎       | 2927/12500 [5:07:34<17:44:35,  6.67s/it]

{'loss': 0.6047, 'grad_norm': 0.2495296448469162, 'learning_rate': 0.00015322929171668667, 'epoch': 0.23}


 23%|██▎       | 2928/12500 [5:07:38<15:40:07,  5.89s/it]

{'loss': 0.7837, 'grad_norm': 0.36744195222854614, 'learning_rate': 0.00015321328531412568, 'epoch': 0.23}


 23%|██▎       | 2929/12500 [5:07:45<16:53:19,  6.35s/it]

{'loss': 0.6248, 'grad_norm': 0.23144440352916718, 'learning_rate': 0.00015319727891156462, 'epoch': 0.23}


 23%|██▎       | 2930/12500 [5:07:52<17:22:17,  6.53s/it]

{'loss': 0.4749, 'grad_norm': 0.24940548837184906, 'learning_rate': 0.0001531812725090036, 'epoch': 0.23}


 23%|██▎       | 2931/12500 [5:07:58<17:17:28,  6.51s/it]

{'loss': 0.7431, 'grad_norm': 0.28235992789268494, 'learning_rate': 0.0001531652661064426, 'epoch': 0.23}


 23%|██▎       | 2932/12500 [5:08:06<18:19:29,  6.89s/it]

{'loss': 1.0117, 'grad_norm': 0.2710827887058258, 'learning_rate': 0.00015314925970388158, 'epoch': 0.23}


 23%|██▎       | 2933/12500 [5:08:14<18:46:26,  7.06s/it]

{'loss': 0.5756, 'grad_norm': 0.2748657763004303, 'learning_rate': 0.00015313325330132052, 'epoch': 0.23}


 23%|██▎       | 2934/12500 [5:08:21<18:57:01,  7.13s/it]

{'loss': 0.7184, 'grad_norm': 0.29215580224990845, 'learning_rate': 0.0001531172468987595, 'epoch': 0.23}


 23%|██▎       | 2935/12500 [5:08:28<18:43:06,  7.05s/it]

{'loss': 0.8366, 'grad_norm': 0.25820475816726685, 'learning_rate': 0.0001531012404961985, 'epoch': 0.23}


 23%|██▎       | 2936/12500 [5:08:31<15:56:05,  6.00s/it]

{'loss': 0.6971, 'grad_norm': 0.33350417017936707, 'learning_rate': 0.00015308523409363748, 'epoch': 0.23}


 23%|██▎       | 2937/12500 [5:08:36<15:05:04,  5.68s/it]

{'loss': 0.8355, 'grad_norm': 0.3130132853984833, 'learning_rate': 0.00015306922769107642, 'epoch': 0.23}


 24%|██▎       | 2938/12500 [5:08:42<15:00:54,  5.65s/it]

{'loss': 0.9408, 'grad_norm': 0.3196631073951721, 'learning_rate': 0.0001530532212885154, 'epoch': 0.24}


 24%|██▎       | 2939/12500 [5:08:48<15:07:54,  5.70s/it]

{'loss': 0.9632, 'grad_norm': 0.3051983118057251, 'learning_rate': 0.0001530372148859544, 'epoch': 0.24}


 24%|██▎       | 2940/12500 [5:08:54<15:51:33,  5.97s/it]

{'loss': 0.7622, 'grad_norm': 0.24895597994327545, 'learning_rate': 0.00015302120848339337, 'epoch': 0.24}


 24%|██▎       | 2941/12500 [5:09:00<15:50:37,  5.97s/it]

{'loss': 0.6633, 'grad_norm': 0.2977946400642395, 'learning_rate': 0.00015300520208083232, 'epoch': 0.24}


 24%|██▎       | 2942/12500 [5:09:06<15:27:46,  5.82s/it]

{'loss': 0.7816, 'grad_norm': 0.3194775879383087, 'learning_rate': 0.00015298919567827132, 'epoch': 0.24}


 24%|██▎       | 2943/12500 [5:09:09<13:33:41,  5.11s/it]

{'loss': 0.9729, 'grad_norm': 0.3837323486804962, 'learning_rate': 0.0001529731892757103, 'epoch': 0.24}


 24%|██▎       | 2944/12500 [5:09:15<14:22:22,  5.41s/it]

{'loss': 1.1114, 'grad_norm': 0.3483949601650238, 'learning_rate': 0.00015295718287314927, 'epoch': 0.24}


 24%|██▎       | 2945/12500 [5:09:23<16:00:42,  6.03s/it]

{'loss': 0.8243, 'grad_norm': 0.2323540449142456, 'learning_rate': 0.00015294117647058822, 'epoch': 0.24}


 24%|██▎       | 2946/12500 [5:09:28<15:27:43,  5.83s/it]

{'loss': 0.8492, 'grad_norm': 0.28144463896751404, 'learning_rate': 0.00015292517006802722, 'epoch': 0.24}


 24%|██▎       | 2947/12500 [5:09:32<14:18:18,  5.39s/it]

{'loss': 0.6747, 'grad_norm': 0.30700355768203735, 'learning_rate': 0.0001529091636654662, 'epoch': 0.24}


 24%|██▎       | 2948/12500 [5:09:37<13:25:20,  5.06s/it]

{'loss': 0.6355, 'grad_norm': 0.28739452362060547, 'learning_rate': 0.00015289315726290517, 'epoch': 0.24}


 24%|██▎       | 2949/12500 [5:09:44<15:20:10,  5.78s/it]

{'loss': 0.8968, 'grad_norm': 0.2596598267555237, 'learning_rate': 0.00015287715086034415, 'epoch': 0.24}


 24%|██▎       | 2950/12500 [5:09:49<14:31:49,  5.48s/it]

{'loss': 0.505, 'grad_norm': 0.25922566652297974, 'learning_rate': 0.00015286114445778312, 'epoch': 0.24}


 24%|██▎       | 2951/12500 [5:09:58<16:59:22,  6.41s/it]

{'loss': 0.6235, 'grad_norm': 0.22916701436042786, 'learning_rate': 0.0001528451380552221, 'epoch': 0.24}


 24%|██▎       | 2952/12500 [5:10:04<16:43:51,  6.31s/it]

{'loss': 0.6843, 'grad_norm': 0.2410188913345337, 'learning_rate': 0.00015282913165266107, 'epoch': 0.24}


 24%|██▎       | 2953/12500 [5:10:13<19:13:41,  7.25s/it]

{'loss': 1.0051, 'grad_norm': 0.24761033058166504, 'learning_rate': 0.00015281312525010005, 'epoch': 0.24}


 24%|██▎       | 2954/12500 [5:10:19<17:56:35,  6.77s/it]

{'loss': 0.7318, 'grad_norm': 0.24180465936660767, 'learning_rate': 0.00015279711884753902, 'epoch': 0.24}


 24%|██▎       | 2955/12500 [5:10:26<18:42:02,  7.05s/it]

{'loss': 0.7854, 'grad_norm': 0.24585159122943878, 'learning_rate': 0.000152781112444978, 'epoch': 0.24}


 24%|██▎       | 2956/12500 [5:10:31<16:41:42,  6.30s/it]

{'loss': 0.6224, 'grad_norm': 0.30414044857025146, 'learning_rate': 0.00015276510604241697, 'epoch': 0.24}


 24%|██▎       | 2957/12500 [5:10:38<17:30:19,  6.60s/it]

{'loss': 0.5106, 'grad_norm': 0.18705230951309204, 'learning_rate': 0.00015274909963985595, 'epoch': 0.24}


 24%|██▎       | 2958/12500 [5:10:46<18:19:48,  6.92s/it]

{'loss': 0.769, 'grad_norm': 0.2570631206035614, 'learning_rate': 0.00015273309323729492, 'epoch': 0.24}


 24%|██▎       | 2959/12500 [5:10:52<17:50:33,  6.73s/it]

{'loss': 0.9046, 'grad_norm': 0.20654305815696716, 'learning_rate': 0.0001527170868347339, 'epoch': 0.24}


 24%|██▎       | 2960/12500 [5:10:57<16:02:36,  6.05s/it]

{'loss': 0.68, 'grad_norm': 0.31991270184516907, 'learning_rate': 0.00015270108043217287, 'epoch': 0.24}


 24%|██▎       | 2961/12500 [5:11:02<15:31:39,  5.86s/it]

{'loss': 0.5623, 'grad_norm': 0.2671225070953369, 'learning_rate': 0.00015268507402961185, 'epoch': 0.24}


 24%|██▎       | 2962/12500 [5:11:07<14:38:30,  5.53s/it]

{'loss': 0.6844, 'grad_norm': 0.29570865631103516, 'learning_rate': 0.00015266906762705082, 'epoch': 0.24}


 24%|██▎       | 2963/12500 [5:11:11<13:06:44,  4.95s/it]

{'loss': 0.5974, 'grad_norm': 0.3150278627872467, 'learning_rate': 0.00015265306122448982, 'epoch': 0.24}


 24%|██▎       | 2964/12500 [5:11:18<15:23:00,  5.81s/it]

{'loss': 0.8514, 'grad_norm': 0.21773836016654968, 'learning_rate': 0.00015263705482192877, 'epoch': 0.24}


 24%|██▎       | 2965/12500 [5:11:23<14:28:50,  5.47s/it]

{'loss': 0.765, 'grad_norm': 0.331790566444397, 'learning_rate': 0.00015262104841936775, 'epoch': 0.24}


 24%|██▎       | 2966/12500 [5:11:30<15:32:52,  5.87s/it]

{'loss': 0.8162, 'grad_norm': 0.3126557767391205, 'learning_rate': 0.00015260504201680672, 'epoch': 0.24}


 24%|██▎       | 2967/12500 [5:11:34<14:24:45,  5.44s/it]

{'loss': 0.716, 'grad_norm': 0.3191748559474945, 'learning_rate': 0.00015258903561424572, 'epoch': 0.24}


 24%|██▎       | 2968/12500 [5:11:42<16:21:57,  6.18s/it]

{'loss': 0.6622, 'grad_norm': 0.20930445194244385, 'learning_rate': 0.00015257302921168467, 'epoch': 0.24}


 24%|██▍       | 2969/12500 [5:11:47<15:13:42,  5.75s/it]

{'loss': 0.6628, 'grad_norm': 0.2850915491580963, 'learning_rate': 0.00015255702280912365, 'epoch': 0.24}


 24%|██▍       | 2970/12500 [5:11:51<14:12:07,  5.36s/it]

{'loss': 0.8949, 'grad_norm': 0.26991087198257446, 'learning_rate': 0.00015254101640656265, 'epoch': 0.24}


 24%|██▍       | 2971/12500 [5:11:59<16:10:45,  6.11s/it]

{'loss': 0.4966, 'grad_norm': 0.20581908524036407, 'learning_rate': 0.00015252501000400162, 'epoch': 0.24}


 24%|██▍       | 2972/12500 [5:12:07<17:16:49,  6.53s/it]

{'loss': 0.7211, 'grad_norm': 0.22125603258609772, 'learning_rate': 0.00015250900360144057, 'epoch': 0.24}


 24%|██▍       | 2973/12500 [5:12:14<17:53:38,  6.76s/it]

{'loss': 0.6779, 'grad_norm': 0.21356821060180664, 'learning_rate': 0.00015249299719887954, 'epoch': 0.24}


 24%|██▍       | 2974/12500 [5:12:21<18:00:24,  6.80s/it]

{'loss': 0.6683, 'grad_norm': 0.23134763538837433, 'learning_rate': 0.00015247699079631855, 'epoch': 0.24}


 24%|██▍       | 2975/12500 [5:12:25<15:28:13,  5.85s/it]

{'loss': 0.9113, 'grad_norm': 0.35884836316108704, 'learning_rate': 0.00015246098439375752, 'epoch': 0.24}


 24%|██▍       | 2976/12500 [5:12:33<17:41:09,  6.69s/it]

{'loss': 0.5906, 'grad_norm': 0.24523389339447021, 'learning_rate': 0.00015244497799119647, 'epoch': 0.24}


 24%|██▍       | 2977/12500 [5:12:39<16:55:27,  6.40s/it]

{'loss': 0.9679, 'grad_norm': 0.36761513352394104, 'learning_rate': 0.00015242897158863547, 'epoch': 0.24}


 24%|██▍       | 2978/12500 [5:12:47<18:21:07,  6.94s/it]

{'loss': 0.4994, 'grad_norm': 0.1790696233510971, 'learning_rate': 0.00015241296518607445, 'epoch': 0.24}


 24%|██▍       | 2979/12500 [5:12:55<18:45:41,  7.09s/it]

{'loss': 0.9914, 'grad_norm': 0.24742604792118073, 'learning_rate': 0.00015239695878351342, 'epoch': 0.24}


 24%|██▍       | 2980/12500 [5:12:59<16:42:54,  6.32s/it]

{'loss': 0.6529, 'grad_norm': 0.2622321546077728, 'learning_rate': 0.00015238095238095237, 'epoch': 0.24}


 24%|██▍       | 2981/12500 [5:13:03<15:07:35,  5.72s/it]

{'loss': 0.6449, 'grad_norm': 0.28901275992393494, 'learning_rate': 0.00015236494597839137, 'epoch': 0.24}


 24%|██▍       | 2982/12500 [5:13:08<14:09:50,  5.36s/it]

{'loss': 0.9639, 'grad_norm': 0.37844955921173096, 'learning_rate': 0.00015234893957583035, 'epoch': 0.24}


 24%|██▍       | 2983/12500 [5:13:15<15:08:46,  5.73s/it]

{'loss': 0.8781, 'grad_norm': 0.3054288327693939, 'learning_rate': 0.00015233293317326932, 'epoch': 0.24}


 24%|██▍       | 2984/12500 [5:13:19<14:04:57,  5.33s/it]

{'loss': 1.1766, 'grad_norm': 0.3302757441997528, 'learning_rate': 0.0001523169267707083, 'epoch': 0.24}


 24%|██▍       | 2985/12500 [5:13:28<17:02:28,  6.45s/it]

{'loss': 1.2015, 'grad_norm': 0.23105506598949432, 'learning_rate': 0.00015230092036814727, 'epoch': 0.24}


 24%|██▍       | 2986/12500 [5:13:37<19:02:31,  7.21s/it]

{'loss': 0.8127, 'grad_norm': 0.20293058454990387, 'learning_rate': 0.00015228491396558625, 'epoch': 0.24}


 24%|██▍       | 2987/12500 [5:13:44<19:18:30,  7.31s/it]

{'loss': 0.794, 'grad_norm': 0.2157047986984253, 'learning_rate': 0.00015226890756302522, 'epoch': 0.24}


 24%|██▍       | 2988/12500 [5:13:49<17:22:02,  6.57s/it]

{'loss': 0.532, 'grad_norm': 0.26309192180633545, 'learning_rate': 0.0001522529011604642, 'epoch': 0.24}


 24%|██▍       | 2989/12500 [5:13:53<15:12:17,  5.76s/it]

{'loss': 0.7072, 'grad_norm': 0.35799500346183777, 'learning_rate': 0.00015223689475790317, 'epoch': 0.24}


 24%|██▍       | 2990/12500 [5:13:57<13:25:37,  5.08s/it]

{'loss': 0.525, 'grad_norm': 0.31825798749923706, 'learning_rate': 0.00015222088835534214, 'epoch': 0.24}


 24%|██▍       | 2991/12500 [5:14:03<14:45:24,  5.59s/it]

{'loss': 0.9485, 'grad_norm': 0.26782962679862976, 'learning_rate': 0.00015220488195278112, 'epoch': 0.24}


 24%|██▍       | 2992/12500 [5:14:08<14:00:00,  5.30s/it]

{'loss': 0.5875, 'grad_norm': 0.28185197710990906, 'learning_rate': 0.0001521888755502201, 'epoch': 0.24}


 24%|██▍       | 2993/12500 [5:14:13<13:19:32,  5.05s/it]

{'loss': 0.579, 'grad_norm': 0.2660772502422333, 'learning_rate': 0.00015217286914765907, 'epoch': 0.24}


 24%|██▍       | 2994/12500 [5:14:18<13:47:49,  5.23s/it]

{'loss': 0.5958, 'grad_norm': 0.2675873339176178, 'learning_rate': 0.00015215686274509804, 'epoch': 0.24}


 24%|██▍       | 2995/12500 [5:14:28<17:27:28,  6.61s/it]

{'loss': 0.8435, 'grad_norm': 0.20633037388324738, 'learning_rate': 0.00015214085634253702, 'epoch': 0.24}


 24%|██▍       | 2996/12500 [5:14:34<16:42:20,  6.33s/it]

{'loss': 0.6541, 'grad_norm': 0.2569468021392822, 'learning_rate': 0.000152124849939976, 'epoch': 0.24}


 24%|██▍       | 2997/12500 [5:14:40<16:22:26,  6.20s/it]

{'loss': 0.7496, 'grad_norm': 0.2922634780406952, 'learning_rate': 0.00015210884353741497, 'epoch': 0.24}


 24%|██▍       | 2998/12500 [5:14:47<16:55:35,  6.41s/it]

{'loss': 0.9887, 'grad_norm': 0.3027958869934082, 'learning_rate': 0.00015209283713485394, 'epoch': 0.24}


 24%|██▍       | 2999/12500 [5:14:52<15:49:54,  6.00s/it]

{'loss': 0.803, 'grad_norm': 0.2600303590297699, 'learning_rate': 0.00015207683073229292, 'epoch': 0.24}


 24%|██▍       | 3000/12500 [5:14:57<15:44:49,  5.97s/it]

{'loss': 0.74, 'grad_norm': 0.2874983847141266, 'learning_rate': 0.0001520608243297319, 'epoch': 0.24}


 24%|██▍       | 3001/12500 [5:15:05<17:05:06,  6.48s/it]

{'loss': 0.6565, 'grad_norm': 0.23733416199684143, 'learning_rate': 0.00015204481792717087, 'epoch': 0.24}


 24%|██▍       | 3002/12500 [5:15:10<15:28:27,  5.87s/it]

{'loss': 0.7514, 'grad_norm': 0.3081977069377899, 'learning_rate': 0.00015202881152460987, 'epoch': 0.24}


 24%|██▍       | 3003/12500 [5:15:17<16:35:42,  6.29s/it]

{'loss': 0.705, 'grad_norm': 0.2477712780237198, 'learning_rate': 0.00015201280512204882, 'epoch': 0.24}


 24%|██▍       | 3004/12500 [5:15:23<16:52:03,  6.39s/it]

{'loss': 0.8168, 'grad_norm': 0.2623049318790436, 'learning_rate': 0.0001519967987194878, 'epoch': 0.24}


 24%|██▍       | 3005/12500 [5:15:27<14:37:59,  5.55s/it]

{'loss': 0.5297, 'grad_norm': 0.2795373499393463, 'learning_rate': 0.00015198079231692677, 'epoch': 0.24}


 24%|██▍       | 3006/12500 [5:15:34<16:01:44,  6.08s/it]

{'loss': 0.6238, 'grad_norm': 0.2774331271648407, 'learning_rate': 0.00015196478591436577, 'epoch': 0.24}


 24%|██▍       | 3007/12500 [5:15:43<17:58:12,  6.81s/it]

{'loss': 0.5118, 'grad_norm': 0.22965465486049652, 'learning_rate': 0.00015194877951180472, 'epoch': 0.24}


 24%|██▍       | 3008/12500 [5:15:47<15:59:18,  6.06s/it]

{'loss': 0.8766, 'grad_norm': 0.2849891483783722, 'learning_rate': 0.0001519327731092437, 'epoch': 0.24}


 24%|██▍       | 3009/12500 [5:15:54<16:50:01,  6.39s/it]

{'loss': 0.5567, 'grad_norm': 0.22738800942897797, 'learning_rate': 0.0001519167667066827, 'epoch': 0.24}


 24%|██▍       | 3010/12500 [5:16:00<16:38:38,  6.31s/it]

{'loss': 0.6032, 'grad_norm': 0.290693074464798, 'learning_rate': 0.00015190076030412167, 'epoch': 0.24}


 24%|██▍       | 3011/12500 [5:16:08<17:53:27,  6.79s/it]

{'loss': 0.7529, 'grad_norm': 0.23715941607952118, 'learning_rate': 0.00015188475390156062, 'epoch': 0.24}


 24%|██▍       | 3012/12500 [5:16:14<16:37:24,  6.31s/it]

{'loss': 0.5387, 'grad_norm': 0.29265904426574707, 'learning_rate': 0.0001518687474989996, 'epoch': 0.24}


 24%|██▍       | 3013/12500 [5:16:19<15:54:55,  6.04s/it]

{'loss': 0.7435, 'grad_norm': 0.2589806616306305, 'learning_rate': 0.0001518527410964386, 'epoch': 0.24}


 24%|██▍       | 3014/12500 [5:16:25<15:59:09,  6.07s/it]

{'loss': 0.8333, 'grad_norm': 0.3142741024494171, 'learning_rate': 0.00015183673469387757, 'epoch': 0.24}


 24%|██▍       | 3015/12500 [5:16:30<14:55:42,  5.67s/it]

{'loss': 0.7696, 'grad_norm': 0.2590966522693634, 'learning_rate': 0.00015182072829131652, 'epoch': 0.24}


 24%|██▍       | 3016/12500 [5:16:35<14:41:30,  5.58s/it]

{'loss': 0.5021, 'grad_norm': 0.2642277479171753, 'learning_rate': 0.00015180472188875552, 'epoch': 0.24}


 24%|██▍       | 3017/12500 [5:16:40<14:24:02,  5.47s/it]

{'loss': 0.4402, 'grad_norm': 0.2192462682723999, 'learning_rate': 0.0001517887154861945, 'epoch': 0.24}


 24%|██▍       | 3018/12500 [5:16:48<16:05:30,  6.11s/it]

{'loss': 1.0103, 'grad_norm': 0.26381176710128784, 'learning_rate': 0.00015177270908363347, 'epoch': 0.24}


 24%|██▍       | 3019/12500 [5:16:55<16:39:59,  6.33s/it]

{'loss': 0.9753, 'grad_norm': 0.3833751380443573, 'learning_rate': 0.00015175670268107242, 'epoch': 0.24}


 24%|██▍       | 3020/12500 [5:16:59<14:55:27,  5.67s/it]

{'loss': 0.7394, 'grad_norm': 0.3866516351699829, 'learning_rate': 0.00015174069627851142, 'epoch': 0.24}


 24%|██▍       | 3021/12500 [5:17:06<15:58:01,  6.06s/it]

{'loss': 0.8082, 'grad_norm': 0.3791508078575134, 'learning_rate': 0.0001517246898759504, 'epoch': 0.24}


 24%|██▍       | 3022/12500 [5:17:10<14:35:09,  5.54s/it]

{'loss': 0.7173, 'grad_norm': 0.28086382150650024, 'learning_rate': 0.00015170868347338937, 'epoch': 0.24}


 24%|██▍       | 3023/12500 [5:17:15<13:52:57,  5.27s/it]

{'loss': 0.6974, 'grad_norm': 0.2991422712802887, 'learning_rate': 0.00015169267707082834, 'epoch': 0.24}


 24%|██▍       | 3024/12500 [5:17:20<13:42:02,  5.21s/it]

{'loss': 0.9538, 'grad_norm': 0.28528347611427307, 'learning_rate': 0.00015167667066826732, 'epoch': 0.24}


 24%|██▍       | 3025/12500 [5:17:24<12:52:18,  4.89s/it]

{'loss': 0.6699, 'grad_norm': 0.2861748933792114, 'learning_rate': 0.0001516606642657063, 'epoch': 0.24}


 24%|██▍       | 3026/12500 [5:17:34<16:33:43,  6.29s/it]

{'loss': 0.6413, 'grad_norm': 0.22452804446220398, 'learning_rate': 0.00015164465786314527, 'epoch': 0.24}


 24%|██▍       | 3027/12500 [5:17:40<16:29:37,  6.27s/it]

{'loss': 0.877, 'grad_norm': 0.2849414348602295, 'learning_rate': 0.00015162865146058424, 'epoch': 0.24}


 24%|██▍       | 3028/12500 [5:17:43<14:14:44,  5.41s/it]

{'loss': 0.6157, 'grad_norm': 0.28088101744651794, 'learning_rate': 0.00015161264505802322, 'epoch': 0.24}


 24%|██▍       | 3029/12500 [5:17:49<14:49:05,  5.63s/it]

{'loss': 0.6917, 'grad_norm': 0.2846302092075348, 'learning_rate': 0.0001515966386554622, 'epoch': 0.24}


 24%|██▍       | 3030/12500 [5:17:57<16:03:14,  6.10s/it]

{'loss': 0.898, 'grad_norm': 0.26102930307388306, 'learning_rate': 0.00015158063225290117, 'epoch': 0.24}


 24%|██▍       | 3031/12500 [5:18:05<17:29:15,  6.65s/it]

{'loss': 0.5345, 'grad_norm': 0.22314846515655518, 'learning_rate': 0.00015156462585034014, 'epoch': 0.24}


 24%|██▍       | 3032/12500 [5:18:09<15:47:03,  6.00s/it]

{'loss': 0.6981, 'grad_norm': 0.34105440974235535, 'learning_rate': 0.00015154861944777912, 'epoch': 0.24}


 24%|██▍       | 3033/12500 [5:18:17<17:38:37,  6.71s/it]

{'loss': 1.0203, 'grad_norm': 0.23691236972808838, 'learning_rate': 0.0001515326130452181, 'epoch': 0.24}


 24%|██▍       | 3034/12500 [5:18:22<15:51:08,  6.03s/it]

{'loss': 0.7035, 'grad_norm': 0.3178441822528839, 'learning_rate': 0.00015151660664265707, 'epoch': 0.24}


 24%|██▍       | 3035/12500 [5:18:29<16:22:53,  6.23s/it]

{'loss': 0.7102, 'grad_norm': 0.24708448350429535, 'learning_rate': 0.00015150060024009604, 'epoch': 0.24}


 24%|██▍       | 3036/12500 [5:18:34<15:48:17,  6.01s/it]

{'loss': 0.6549, 'grad_norm': 0.2597169876098633, 'learning_rate': 0.00015148459383753501, 'epoch': 0.24}


 24%|██▍       | 3037/12500 [5:18:39<15:12:47,  5.79s/it]

{'loss': 0.8817, 'grad_norm': 0.37564513087272644, 'learning_rate': 0.00015146858743497402, 'epoch': 0.24}


 24%|██▍       | 3038/12500 [5:18:43<13:36:56,  5.18s/it]

{'loss': 0.5989, 'grad_norm': 0.3460988998413086, 'learning_rate': 0.00015145258103241296, 'epoch': 0.24}


 24%|██▍       | 3039/12500 [5:18:50<14:32:35,  5.53s/it]

{'loss': 0.4788, 'grad_norm': 0.2475849837064743, 'learning_rate': 0.00015143657462985194, 'epoch': 0.24}


 24%|██▍       | 3040/12500 [5:18:55<14:47:39,  5.63s/it]

{'loss': 0.5532, 'grad_norm': 0.2754244804382324, 'learning_rate': 0.00015142056822729091, 'epoch': 0.24}


 24%|██▍       | 3041/12500 [5:19:04<16:54:21,  6.43s/it]

{'loss': 0.6758, 'grad_norm': 0.2564505934715271, 'learning_rate': 0.00015140456182472992, 'epoch': 0.24}


 24%|██▍       | 3042/12500 [5:19:08<14:57:19,  5.69s/it]

{'loss': 0.6898, 'grad_norm': 0.2868666350841522, 'learning_rate': 0.00015138855542216886, 'epoch': 0.24}


 24%|██▍       | 3043/12500 [5:19:16<16:50:31,  6.41s/it]

{'loss': 0.9364, 'grad_norm': 0.23750285804271698, 'learning_rate': 0.00015137254901960784, 'epoch': 0.24}


 24%|██▍       | 3044/12500 [5:19:23<17:21:44,  6.61s/it]

{'loss': 1.1739, 'grad_norm': 0.2757662534713745, 'learning_rate': 0.00015135654261704684, 'epoch': 0.24}


 24%|██▍       | 3045/12500 [5:19:29<17:12:25,  6.55s/it]

{'loss': 0.6755, 'grad_norm': 0.2538618743419647, 'learning_rate': 0.00015134053621448582, 'epoch': 0.24}


 24%|██▍       | 3046/12500 [5:19:36<17:41:21,  6.74s/it]

{'loss': 0.5117, 'grad_norm': 0.25637346506118774, 'learning_rate': 0.00015132452981192476, 'epoch': 0.24}


 24%|██▍       | 3047/12500 [5:19:41<16:00:47,  6.10s/it]

{'loss': 0.5605, 'grad_norm': 0.24585632979869843, 'learning_rate': 0.00015130852340936374, 'epoch': 0.24}


 24%|██▍       | 3048/12500 [5:19:46<15:03:10,  5.73s/it]

{'loss': 0.8245, 'grad_norm': 0.2635709345340729, 'learning_rate': 0.00015129251700680274, 'epoch': 0.24}


 24%|██▍       | 3049/12500 [5:19:52<15:13:19,  5.80s/it]

{'loss': 0.6805, 'grad_norm': 0.22070656716823578, 'learning_rate': 0.00015127651060424172, 'epoch': 0.24}


 24%|██▍       | 3050/12500 [5:19:57<14:58:13,  5.70s/it]

{'loss': 0.8302, 'grad_norm': 0.2889803946018219, 'learning_rate': 0.00015126050420168066, 'epoch': 0.24}


 24%|██▍       | 3051/12500 [5:20:03<15:16:14,  5.82s/it]

{'loss': 0.7766, 'grad_norm': 0.28320395946502686, 'learning_rate': 0.00015124449779911964, 'epoch': 0.24}


 24%|██▍       | 3052/12500 [5:20:09<15:12:25,  5.79s/it]

{'loss': 0.6365, 'grad_norm': 0.2773903012275696, 'learning_rate': 0.00015122849139655864, 'epoch': 0.24}


 24%|██▍       | 3053/12500 [5:20:13<13:35:48,  5.18s/it]

{'loss': 0.7346, 'grad_norm': 0.3237762451171875, 'learning_rate': 0.00015121248499399761, 'epoch': 0.24}


 24%|██▍       | 3054/12500 [5:20:17<12:49:33,  4.89s/it]

{'loss': 0.5402, 'grad_norm': 0.25999075174331665, 'learning_rate': 0.00015119647859143656, 'epoch': 0.24}


 24%|██▍       | 3055/12500 [5:20:22<13:14:20,  5.05s/it]

{'loss': 0.7531, 'grad_norm': 0.24968448281288147, 'learning_rate': 0.00015118047218887556, 'epoch': 0.24}


 24%|██▍       | 3056/12500 [5:20:27<12:44:48,  4.86s/it]

{'loss': 0.8555, 'grad_norm': 0.29662710428237915, 'learning_rate': 0.00015116446578631454, 'epoch': 0.24}


 24%|██▍       | 3057/12500 [5:20:33<13:53:38,  5.30s/it]

{'loss': 0.785, 'grad_norm': 0.23921546339988708, 'learning_rate': 0.00015114845938375351, 'epoch': 0.24}


 24%|██▍       | 3058/12500 [5:20:39<14:13:32,  5.42s/it]

{'loss': 0.5196, 'grad_norm': 0.21013416349887848, 'learning_rate': 0.00015113245298119246, 'epoch': 0.24}


 24%|██▍       | 3059/12500 [5:20:43<13:29:38,  5.15s/it]

{'loss': 0.7043, 'grad_norm': 0.2822631597518921, 'learning_rate': 0.00015111644657863146, 'epoch': 0.24}


 24%|██▍       | 3060/12500 [5:20:48<13:09:07,  5.02s/it]

{'loss': 0.7194, 'grad_norm': 0.3122556805610657, 'learning_rate': 0.00015110044017607044, 'epoch': 0.24}


 24%|██▍       | 3061/12500 [5:20:55<14:27:42,  5.52s/it]

{'loss': 0.5194, 'grad_norm': 0.2569887638092041, 'learning_rate': 0.0001510844337735094, 'epoch': 0.24}


 24%|██▍       | 3062/12500 [5:21:02<15:54:37,  6.07s/it]

{'loss': 0.8899, 'grad_norm': 0.2646139860153198, 'learning_rate': 0.0001510684273709484, 'epoch': 0.24}


 25%|██▍       | 3063/12500 [5:21:11<17:47:42,  6.79s/it]

{'loss': 0.9581, 'grad_norm': 0.2481418401002884, 'learning_rate': 0.00015105242096838736, 'epoch': 0.25}


 25%|██▍       | 3064/12500 [5:21:15<15:57:06,  6.09s/it]

{'loss': 0.5931, 'grad_norm': 0.27279728651046753, 'learning_rate': 0.00015103641456582634, 'epoch': 0.25}


 25%|██▍       | 3065/12500 [5:21:23<17:02:03,  6.50s/it]

{'loss': 0.6064, 'grad_norm': 0.19215360283851624, 'learning_rate': 0.0001510204081632653, 'epoch': 0.25}


 25%|██▍       | 3066/12500 [5:21:28<15:53:12,  6.06s/it]

{'loss': 0.6626, 'grad_norm': 0.303655207157135, 'learning_rate': 0.0001510044017607043, 'epoch': 0.25}


 25%|██▍       | 3067/12500 [5:21:33<15:25:59,  5.89s/it]

{'loss': 0.7235, 'grad_norm': 0.35126203298568726, 'learning_rate': 0.00015098839535814326, 'epoch': 0.25}


 25%|██▍       | 3068/12500 [5:21:41<17:02:10,  6.50s/it]

{'loss': 0.6134, 'grad_norm': 0.22453464567661285, 'learning_rate': 0.00015097238895558224, 'epoch': 0.25}


 25%|██▍       | 3069/12500 [5:21:48<17:34:52,  6.71s/it]

{'loss': 0.4543, 'grad_norm': 0.21751637756824493, 'learning_rate': 0.0001509563825530212, 'epoch': 0.25}


 25%|██▍       | 3070/12500 [5:21:53<15:40:04,  5.98s/it]

{'loss': 0.5673, 'grad_norm': 0.2872353196144104, 'learning_rate': 0.0001509403761504602, 'epoch': 0.25}


 25%|██▍       | 3071/12500 [5:21:58<15:21:06,  5.86s/it]

{'loss': 0.945, 'grad_norm': 0.3078692555427551, 'learning_rate': 0.00015092436974789916, 'epoch': 0.25}


 25%|██▍       | 3072/12500 [5:22:05<16:20:18,  6.24s/it]

{'loss': 0.8939, 'grad_norm': 0.24936650693416595, 'learning_rate': 0.00015090836334533814, 'epoch': 0.25}


 25%|██▍       | 3073/12500 [5:22:12<16:55:49,  6.47s/it]

{'loss': 0.5978, 'grad_norm': 0.263568639755249, 'learning_rate': 0.0001508923569427771, 'epoch': 0.25}


 25%|██▍       | 3074/12500 [5:22:21<19:06:53,  7.30s/it]

{'loss': 0.2925, 'grad_norm': 0.17729605734348297, 'learning_rate': 0.0001508763505402161, 'epoch': 0.25}


 25%|██▍       | 3075/12500 [5:22:26<16:39:38,  6.36s/it]

{'loss': 0.6838, 'grad_norm': 0.31704917550086975, 'learning_rate': 0.00015086034413765506, 'epoch': 0.25}


 25%|██▍       | 3076/12500 [5:22:32<16:56:57,  6.47s/it]

{'loss': 0.7659, 'grad_norm': 0.2533949613571167, 'learning_rate': 0.00015084433773509406, 'epoch': 0.25}


 25%|██▍       | 3077/12500 [5:22:40<17:30:14,  6.69s/it]

{'loss': 0.6293, 'grad_norm': 0.24457497894763947, 'learning_rate': 0.000150828331332533, 'epoch': 0.25}


 25%|██▍       | 3078/12500 [5:22:43<15:10:26,  5.80s/it]

{'loss': 0.4535, 'grad_norm': 0.2811321020126343, 'learning_rate': 0.00015081232492997199, 'epoch': 0.25}


 25%|██▍       | 3079/12500 [5:22:54<18:43:43,  7.16s/it]

{'loss': 0.8125, 'grad_norm': 0.2053414285182953, 'learning_rate': 0.00015079631852741096, 'epoch': 0.25}


 25%|██▍       | 3080/12500 [5:23:02<19:32:55,  7.47s/it]

{'loss': 0.6438, 'grad_norm': 0.19027720391750336, 'learning_rate': 0.00015078031212484996, 'epoch': 0.25}


 25%|██▍       | 3081/12500 [5:23:06<16:38:43,  6.36s/it]

{'loss': 0.5916, 'grad_norm': 0.29133763909339905, 'learning_rate': 0.0001507643057222889, 'epoch': 0.25}


 25%|██▍       | 3082/12500 [5:23:11<15:48:56,  6.05s/it]

{'loss': 0.5499, 'grad_norm': 0.25201377272605896, 'learning_rate': 0.00015074829931972789, 'epoch': 0.25}


 25%|██▍       | 3083/12500 [5:23:15<14:30:56,  5.55s/it]

{'loss': 0.641, 'grad_norm': 0.3354090750217438, 'learning_rate': 0.0001507322929171669, 'epoch': 0.25}


 25%|██▍       | 3084/12500 [5:23:20<13:39:29,  5.22s/it]

{'loss': 0.6891, 'grad_norm': 0.3183298707008362, 'learning_rate': 0.00015071628651460586, 'epoch': 0.25}


 25%|██▍       | 3085/12500 [5:23:28<15:43:14,  6.01s/it]

{'loss': 0.5134, 'grad_norm': 0.20307740569114685, 'learning_rate': 0.0001507002801120448, 'epoch': 0.25}


 25%|██▍       | 3086/12500 [5:23:34<15:40:10,  5.99s/it]

{'loss': 0.6051, 'grad_norm': 0.27742278575897217, 'learning_rate': 0.00015068427370948378, 'epoch': 0.25}


 25%|██▍       | 3087/12500 [5:23:40<15:56:31,  6.10s/it]

{'loss': 0.824, 'grad_norm': 0.26770904660224915, 'learning_rate': 0.0001506682673069228, 'epoch': 0.25}


 25%|██▍       | 3088/12500 [5:23:49<18:07:15,  6.93s/it]

{'loss': 0.5874, 'grad_norm': 0.20155230164527893, 'learning_rate': 0.00015065226090436176, 'epoch': 0.25}


 25%|██▍       | 3089/12500 [5:23:55<17:44:38,  6.79s/it]

{'loss': 0.6089, 'grad_norm': 0.24526114761829376, 'learning_rate': 0.0001506362545018007, 'epoch': 0.25}


 25%|██▍       | 3090/12500 [5:24:02<17:49:45,  6.82s/it]

{'loss': 0.7931, 'grad_norm': 0.2346411794424057, 'learning_rate': 0.0001506202480992397, 'epoch': 0.25}


 25%|██▍       | 3091/12500 [5:24:07<16:29:23,  6.31s/it]

{'loss': 0.7297, 'grad_norm': 0.2988349497318268, 'learning_rate': 0.00015060424169667869, 'epoch': 0.25}


 25%|██▍       | 3092/12500 [5:24:13<16:13:01,  6.21s/it]

{'loss': 0.4912, 'grad_norm': 0.29346466064453125, 'learning_rate': 0.00015058823529411766, 'epoch': 0.25}


 25%|██▍       | 3093/12500 [5:24:16<13:44:35,  5.26s/it]

{'loss': 0.5656, 'grad_norm': 0.3231428861618042, 'learning_rate': 0.0001505722288915566, 'epoch': 0.25}


 25%|██▍       | 3094/12500 [5:24:20<12:17:51,  4.71s/it]

{'loss': 0.6674, 'grad_norm': 0.34139037132263184, 'learning_rate': 0.0001505562224889956, 'epoch': 0.25}


 25%|██▍       | 3095/12500 [5:24:26<13:44:26,  5.26s/it]

{'loss': 0.4511, 'grad_norm': 0.22056777775287628, 'learning_rate': 0.00015054021608643459, 'epoch': 0.25}


 25%|██▍       | 3096/12500 [5:24:32<13:50:07,  5.30s/it]

{'loss': 0.8571, 'grad_norm': 0.2793191373348236, 'learning_rate': 0.00015052420968387356, 'epoch': 0.25}


 25%|██▍       | 3097/12500 [5:24:40<16:08:25,  6.18s/it]

{'loss': 0.7341, 'grad_norm': 0.26078861951828003, 'learning_rate': 0.00015050820328131254, 'epoch': 0.25}


 25%|██▍       | 3098/12500 [5:24:44<14:47:28,  5.66s/it]

{'loss': 0.4517, 'grad_norm': 0.25072234869003296, 'learning_rate': 0.0001504921968787515, 'epoch': 0.25}


 25%|██▍       | 3099/12500 [5:24:51<15:50:59,  6.07s/it]

{'loss': 0.881, 'grad_norm': 0.25667598843574524, 'learning_rate': 0.00015047619047619048, 'epoch': 0.25}


 25%|██▍       | 3100/12500 [5:24:57<15:50:20,  6.07s/it]

{'loss': 0.4597, 'grad_norm': 0.2274484634399414, 'learning_rate': 0.00015046018407362946, 'epoch': 0.25}


 25%|██▍       | 3101/12500 [5:25:05<17:16:53,  6.62s/it]

{'loss': 0.5465, 'grad_norm': 0.23935088515281677, 'learning_rate': 0.00015044417767106843, 'epoch': 0.25}


 25%|██▍       | 3102/12500 [5:25:10<15:36:01,  5.98s/it]

{'loss': 0.8653, 'grad_norm': 0.37213703989982605, 'learning_rate': 0.0001504281712685074, 'epoch': 0.25}


 25%|██▍       | 3103/12500 [5:25:13<13:24:17,  5.14s/it]

{'loss': 0.6303, 'grad_norm': 0.34157201647758484, 'learning_rate': 0.00015041216486594638, 'epoch': 0.25}


 25%|██▍       | 3104/12500 [5:25:19<14:21:36,  5.50s/it]

{'loss': 0.9317, 'grad_norm': 0.2635924518108368, 'learning_rate': 0.00015039615846338536, 'epoch': 0.25}


 25%|██▍       | 3105/12500 [5:25:26<15:19:07,  5.87s/it]

{'loss': 1.0759, 'grad_norm': 0.29333120584487915, 'learning_rate': 0.00015038015206082433, 'epoch': 0.25}


 25%|██▍       | 3106/12500 [5:25:31<14:39:54,  5.62s/it]

{'loss': 0.6469, 'grad_norm': 0.2968801259994507, 'learning_rate': 0.0001503641456582633, 'epoch': 0.25}


 25%|██▍       | 3107/12500 [5:25:38<15:52:23,  6.08s/it]

{'loss': 0.8528, 'grad_norm': 0.3032713532447815, 'learning_rate': 0.00015034813925570228, 'epoch': 0.25}


 25%|██▍       | 3108/12500 [5:25:42<14:16:55,  5.47s/it]

{'loss': 0.61, 'grad_norm': 0.2834434509277344, 'learning_rate': 0.00015033213285314126, 'epoch': 0.25}


 25%|██▍       | 3109/12500 [5:25:47<13:38:18,  5.23s/it]

{'loss': 0.5841, 'grad_norm': 0.2750614583492279, 'learning_rate': 0.00015031612645058023, 'epoch': 0.25}


 25%|██▍       | 3110/12500 [5:25:52<13:44:30,  5.27s/it]

{'loss': 0.5676, 'grad_norm': 0.360152930021286, 'learning_rate': 0.0001503001200480192, 'epoch': 0.25}


 25%|██▍       | 3111/12500 [5:25:59<14:40:07,  5.62s/it]

{'loss': 0.8104, 'grad_norm': 0.26268482208251953, 'learning_rate': 0.00015028411364545818, 'epoch': 0.25}


 25%|██▍       | 3112/12500 [5:26:03<13:58:17,  5.36s/it]

{'loss': 0.5508, 'grad_norm': 0.26217976212501526, 'learning_rate': 0.00015026810724289716, 'epoch': 0.25}


 25%|██▍       | 3113/12500 [5:26:13<17:02:07,  6.53s/it]

{'loss': 0.5806, 'grad_norm': 0.2716047763824463, 'learning_rate': 0.00015025210084033613, 'epoch': 0.25}


 25%|██▍       | 3114/12500 [5:26:18<15:40:56,  6.01s/it]

{'loss': 0.4755, 'grad_norm': 0.2819365859031677, 'learning_rate': 0.0001502360944377751, 'epoch': 0.25}


 25%|██▍       | 3115/12500 [5:26:25<16:44:01,  6.42s/it]

{'loss': 0.8779, 'grad_norm': 0.24471446871757507, 'learning_rate': 0.0001502200880352141, 'epoch': 0.25}


 25%|██▍       | 3116/12500 [5:26:31<16:33:12,  6.35s/it]

{'loss': 0.8727, 'grad_norm': 0.2776339054107666, 'learning_rate': 0.00015020408163265306, 'epoch': 0.25}


 25%|██▍       | 3117/12500 [5:26:39<17:59:22,  6.90s/it]

{'loss': 0.5133, 'grad_norm': 0.24896058440208435, 'learning_rate': 0.00015018807523009203, 'epoch': 0.25}


 25%|██▍       | 3118/12500 [5:26:44<16:03:27,  6.16s/it]

{'loss': 0.7864, 'grad_norm': 0.334054559469223, 'learning_rate': 0.000150172068827531, 'epoch': 0.25}


 25%|██▍       | 3119/12500 [5:26:49<15:42:10,  6.03s/it]

{'loss': 0.7946, 'grad_norm': 0.2989436686038971, 'learning_rate': 0.00015015606242497, 'epoch': 0.25}


 25%|██▍       | 3120/12500 [5:26:55<15:32:54,  5.97s/it]

{'loss': 0.7746, 'grad_norm': 0.26567599177360535, 'learning_rate': 0.00015014005602240896, 'epoch': 0.25}


 25%|██▍       | 3121/12500 [5:27:03<17:11:25,  6.60s/it]

{'loss': 0.8229, 'grad_norm': 0.2759416401386261, 'learning_rate': 0.00015012404961984793, 'epoch': 0.25}


 25%|██▍       | 3122/12500 [5:27:09<16:20:11,  6.27s/it]

{'loss': 0.8081, 'grad_norm': 0.4177769124507904, 'learning_rate': 0.00015010804321728693, 'epoch': 0.25}


 25%|██▍       | 3123/12500 [5:27:13<14:37:08,  5.61s/it]

{'loss': 1.0371, 'grad_norm': 0.37561851739883423, 'learning_rate': 0.0001500920368147259, 'epoch': 0.25}


 25%|██▍       | 3124/12500 [5:27:21<16:09:08,  6.20s/it]

{'loss': 0.9221, 'grad_norm': 0.26547670364379883, 'learning_rate': 0.00015007603041216486, 'epoch': 0.25}


 25%|██▌       | 3125/12500 [5:27:28<17:13:49,  6.62s/it]

{'loss': 0.8059, 'grad_norm': 0.2939283549785614, 'learning_rate': 0.00015006002400960383, 'epoch': 0.25}


 25%|██▌       | 3126/12500 [5:27:34<16:56:02,  6.50s/it]

{'loss': 0.8767, 'grad_norm': 0.26698192954063416, 'learning_rate': 0.00015004401760704283, 'epoch': 0.25}


 25%|██▌       | 3127/12500 [5:27:40<16:36:51,  6.38s/it]

{'loss': 0.7178, 'grad_norm': 0.23666642606258392, 'learning_rate': 0.0001500280112044818, 'epoch': 0.25}


 25%|██▌       | 3128/12500 [5:27:45<15:22:23,  5.91s/it]

{'loss': 0.8337, 'grad_norm': 0.3377906382083893, 'learning_rate': 0.00015001200480192076, 'epoch': 0.25}


 25%|██▌       | 3129/12500 [5:27:52<15:55:01,  6.11s/it]

{'loss': 0.5309, 'grad_norm': 0.26459094882011414, 'learning_rate': 0.00014999599839935976, 'epoch': 0.25}


 25%|██▌       | 3130/12500 [5:28:00<17:38:09,  6.78s/it]

{'loss': 0.5357, 'grad_norm': 0.24456138908863068, 'learning_rate': 0.00014997999199679873, 'epoch': 0.25}


 25%|██▌       | 3131/12500 [5:28:07<17:53:03,  6.87s/it]

{'loss': 0.4951, 'grad_norm': 0.27437129616737366, 'learning_rate': 0.0001499639855942377, 'epoch': 0.25}


 25%|██▌       | 3132/12500 [5:28:14<17:51:37,  6.86s/it]

{'loss': 0.517, 'grad_norm': 0.25408121943473816, 'learning_rate': 0.00014994797919167666, 'epoch': 0.25}


 25%|██▌       | 3133/12500 [5:28:20<17:15:37,  6.63s/it]

{'loss': 0.4749, 'grad_norm': 0.25508421659469604, 'learning_rate': 0.00014993197278911566, 'epoch': 0.25}


 25%|██▌       | 3134/12500 [5:28:26<16:39:03,  6.40s/it]

{'loss': 0.9321, 'grad_norm': 0.28152623772621155, 'learning_rate': 0.00014991596638655463, 'epoch': 0.25}


 25%|██▌       | 3135/12500 [5:28:33<17:18:02,  6.65s/it]

{'loss': 0.7284, 'grad_norm': 0.3017195463180542, 'learning_rate': 0.0001498999599839936, 'epoch': 0.25}


 25%|██▌       | 3136/12500 [5:28:40<17:32:16,  6.74s/it]

{'loss': 0.8772, 'grad_norm': 0.29740625619888306, 'learning_rate': 0.00014988395358143258, 'epoch': 0.25}


 25%|██▌       | 3137/12500 [5:28:44<15:32:46,  5.98s/it]

{'loss': 0.854, 'grad_norm': 0.2821430265903473, 'learning_rate': 0.00014986794717887156, 'epoch': 0.25}


 25%|██▌       | 3138/12500 [5:28:51<15:44:28,  6.05s/it]

{'loss': 0.4394, 'grad_norm': 0.23760215938091278, 'learning_rate': 0.00014985194077631053, 'epoch': 0.25}


 25%|██▌       | 3139/12500 [5:28:56<14:56:54,  5.75s/it]

{'loss': 0.7082, 'grad_norm': 0.2864646017551422, 'learning_rate': 0.0001498359343737495, 'epoch': 0.25}


 25%|██▌       | 3140/12500 [5:29:02<15:41:11,  6.03s/it]

{'loss': 0.7429, 'grad_norm': 0.24635782837867737, 'learning_rate': 0.00014981992797118848, 'epoch': 0.25}


 25%|██▌       | 3141/12500 [5:29:08<15:38:48,  6.02s/it]

{'loss': 0.5134, 'grad_norm': 0.2919648587703705, 'learning_rate': 0.00014980392156862746, 'epoch': 0.25}


 25%|██▌       | 3142/12500 [5:29:14<15:24:37,  5.93s/it]

{'loss': 0.6381, 'grad_norm': 0.31159019470214844, 'learning_rate': 0.00014978791516606643, 'epoch': 0.25}


 25%|██▌       | 3143/12500 [5:29:19<14:42:20,  5.66s/it]

{'loss': 0.6849, 'grad_norm': 0.30629339814186096, 'learning_rate': 0.00014977190876350543, 'epoch': 0.25}


 25%|██▌       | 3144/12500 [5:29:32<20:23:21,  7.85s/it]

{'loss': 0.7302, 'grad_norm': 0.1784806102514267, 'learning_rate': 0.00014975590236094438, 'epoch': 0.25}


 25%|██▌       | 3145/12500 [5:29:39<19:30:24,  7.51s/it]

{'loss': 0.6242, 'grad_norm': 0.2813720703125, 'learning_rate': 0.00014973989595838336, 'epoch': 0.25}


 25%|██▌       | 3146/12500 [5:29:48<20:51:54,  8.03s/it]

{'loss': 0.6986, 'grad_norm': 0.21865957975387573, 'learning_rate': 0.00014972388955582233, 'epoch': 0.25}


 25%|██▌       | 3147/12500 [5:29:55<20:17:01,  7.81s/it]

{'loss': 0.6288, 'grad_norm': 0.2412436455488205, 'learning_rate': 0.00014970788315326133, 'epoch': 0.25}


 25%|██▌       | 3148/12500 [5:29:59<17:10:09,  6.61s/it]

{'loss': 0.5512, 'grad_norm': 0.2995247542858124, 'learning_rate': 0.00014969187675070028, 'epoch': 0.25}


 25%|██▌       | 3149/12500 [5:30:07<18:19:57,  7.06s/it]

{'loss': 0.7337, 'grad_norm': 0.26763859391212463, 'learning_rate': 0.00014967587034813925, 'epoch': 0.25}


 25%|██▌       | 3150/12500 [5:30:12<16:45:22,  6.45s/it]

{'loss': 0.5609, 'grad_norm': 0.2500026226043701, 'learning_rate': 0.00014965986394557826, 'epoch': 0.25}


 25%|██▌       | 3151/12500 [5:30:18<16:25:00,  6.32s/it]

{'loss': 0.6777, 'grad_norm': 0.2762150168418884, 'learning_rate': 0.00014964385754301723, 'epoch': 0.25}


 25%|██▌       | 3152/12500 [5:30:24<15:35:11,  6.00s/it]

{'loss': 0.5647, 'grad_norm': 0.25599396228790283, 'learning_rate': 0.00014962785114045618, 'epoch': 0.25}


 25%|██▌       | 3153/12500 [5:30:28<14:30:45,  5.59s/it]

{'loss': 0.5884, 'grad_norm': 0.30451786518096924, 'learning_rate': 0.00014961184473789515, 'epoch': 0.25}


 25%|██▌       | 3154/12500 [5:30:37<16:46:07,  6.46s/it]

{'loss': 0.7108, 'grad_norm': 0.20929017663002014, 'learning_rate': 0.00014959583833533416, 'epoch': 0.25}


 25%|██▌       | 3155/12500 [5:30:43<16:21:32,  6.30s/it]

{'loss': 0.7482, 'grad_norm': 0.34603041410446167, 'learning_rate': 0.00014957983193277313, 'epoch': 0.25}


 25%|██▌       | 3156/12500 [5:30:51<17:39:44,  6.80s/it]

{'loss': 1.005, 'grad_norm': 0.2325253039598465, 'learning_rate': 0.00014956382553021208, 'epoch': 0.25}


 25%|██▌       | 3157/12500 [5:30:56<16:37:26,  6.41s/it]

{'loss': 0.5412, 'grad_norm': 0.23265671730041504, 'learning_rate': 0.00014954781912765105, 'epoch': 0.25}


 25%|██▌       | 3158/12500 [5:31:07<19:53:28,  7.67s/it]

{'loss': 0.5717, 'grad_norm': 0.15925146639347076, 'learning_rate': 0.00014953181272509006, 'epoch': 0.25}


 25%|██▌       | 3159/12500 [5:31:13<18:59:48,  7.32s/it]

{'loss': 0.5454, 'grad_norm': 0.2514839470386505, 'learning_rate': 0.00014951580632252903, 'epoch': 0.25}


 25%|██▌       | 3160/12500 [5:31:24<21:38:41,  8.34s/it]

{'loss': 0.6645, 'grad_norm': 0.21796616911888123, 'learning_rate': 0.00014949979991996798, 'epoch': 0.25}


 25%|██▌       | 3161/12500 [5:31:30<20:13:14,  7.79s/it]

{'loss': 0.8656, 'grad_norm': 0.2887744903564453, 'learning_rate': 0.00014948379351740698, 'epoch': 0.25}


 25%|██▌       | 3162/12500 [5:31:37<18:55:03,  7.29s/it]

{'loss': 0.8748, 'grad_norm': 0.3059181869029999, 'learning_rate': 0.00014946778711484595, 'epoch': 0.25}


 25%|██▌       | 3163/12500 [5:31:42<17:07:53,  6.61s/it]

{'loss': 0.8604, 'grad_norm': 0.28807562589645386, 'learning_rate': 0.00014945178071228493, 'epoch': 0.25}


 25%|██▌       | 3164/12500 [5:31:50<18:26:51,  7.11s/it]

{'loss': 1.0983, 'grad_norm': 0.2834489345550537, 'learning_rate': 0.00014943577430972388, 'epoch': 0.25}


 25%|██▌       | 3165/12500 [5:31:55<17:12:14,  6.63s/it]

{'loss': 0.6099, 'grad_norm': 0.2785016894340515, 'learning_rate': 0.00014941976790716288, 'epoch': 0.25}


 25%|██▌       | 3166/12500 [5:31:58<14:20:19,  5.53s/it]

{'loss': 0.5891, 'grad_norm': 0.32708778977394104, 'learning_rate': 0.00014940376150460185, 'epoch': 0.25}


 25%|██▌       | 3167/12500 [5:32:07<17:08:04,  6.61s/it]

{'loss': 0.6262, 'grad_norm': 0.1859816461801529, 'learning_rate': 0.00014938775510204083, 'epoch': 0.25}


 25%|██▌       | 3168/12500 [5:32:13<16:08:09,  6.22s/it]

{'loss': 0.5716, 'grad_norm': 0.27438250184059143, 'learning_rate': 0.0001493717486994798, 'epoch': 0.25}


 25%|██▌       | 3169/12500 [5:32:18<15:35:25,  6.01s/it]

{'loss': 0.6346, 'grad_norm': 0.275001585483551, 'learning_rate': 0.00014935574229691878, 'epoch': 0.25}


 25%|██▌       | 3170/12500 [5:32:24<15:33:46,  6.01s/it]

{'loss': 0.6644, 'grad_norm': 0.29836633801460266, 'learning_rate': 0.00014933973589435775, 'epoch': 0.25}


 25%|██▌       | 3171/12500 [5:32:32<16:38:13,  6.42s/it]

{'loss': 1.0247, 'grad_norm': 0.25378644466400146, 'learning_rate': 0.00014932372949179673, 'epoch': 0.25}


 25%|██▌       | 3172/12500 [5:32:44<21:12:03,  8.18s/it]

{'loss': 0.9562, 'grad_norm': 0.19032122194766998, 'learning_rate': 0.0001493077230892357, 'epoch': 0.25}


 25%|██▌       | 3173/12500 [5:32:48<17:48:49,  6.88s/it]

{'loss': 0.8505, 'grad_norm': 0.34043481945991516, 'learning_rate': 0.00014929171668667468, 'epoch': 0.25}


 25%|██▌       | 3174/12500 [5:32:54<17:09:41,  6.62s/it]

{'loss': 0.623, 'grad_norm': 0.22679448127746582, 'learning_rate': 0.00014927571028411365, 'epoch': 0.25}


 25%|██▌       | 3175/12500 [5:32:59<16:02:14,  6.19s/it]

{'loss': 0.6942, 'grad_norm': 0.28129974007606506, 'learning_rate': 0.00014925970388155263, 'epoch': 0.25}


 25%|██▌       | 3176/12500 [5:33:06<16:27:38,  6.36s/it]

{'loss': 0.7564, 'grad_norm': 0.3307913839817047, 'learning_rate': 0.0001492436974789916, 'epoch': 0.25}


 25%|██▌       | 3177/12500 [5:33:09<14:16:53,  5.51s/it]

{'loss': 0.6421, 'grad_norm': 0.32234567403793335, 'learning_rate': 0.00014922769107643058, 'epoch': 0.25}


 25%|██▌       | 3178/12500 [5:33:14<13:21:14,  5.16s/it]

{'loss': 0.9177, 'grad_norm': 0.39103737473487854, 'learning_rate': 0.00014921168467386955, 'epoch': 0.25}


 25%|██▌       | 3179/12500 [5:33:21<14:56:57,  5.77s/it]

{'loss': 0.5487, 'grad_norm': 0.23095734417438507, 'learning_rate': 0.00014919567827130853, 'epoch': 0.25}


 25%|██▌       | 3180/12500 [5:33:30<17:25:27,  6.73s/it]

{'loss': 0.8387, 'grad_norm': 0.25844889879226685, 'learning_rate': 0.0001491796718687475, 'epoch': 0.25}


 25%|██▌       | 3181/12500 [5:33:34<15:35:16,  6.02s/it]

{'loss': 0.6726, 'grad_norm': 0.3196648359298706, 'learning_rate': 0.00014916366546618648, 'epoch': 0.25}


 25%|██▌       | 3182/12500 [5:33:42<17:11:14,  6.64s/it]

{'loss': 0.7645, 'grad_norm': 0.24202582240104675, 'learning_rate': 0.00014914765906362548, 'epoch': 0.25}


 25%|██▌       | 3183/12500 [5:33:49<17:37:36,  6.81s/it]

{'loss': 0.6627, 'grad_norm': 0.23218688368797302, 'learning_rate': 0.00014913165266106443, 'epoch': 0.25}


 25%|██▌       | 3184/12500 [5:33:58<18:46:30,  7.26s/it]

{'loss': 0.5622, 'grad_norm': 0.22962826490402222, 'learning_rate': 0.0001491156462585034, 'epoch': 0.25}


 25%|██▌       | 3185/12500 [5:34:01<15:51:17,  6.13s/it]

{'loss': 0.7767, 'grad_norm': 0.3255191147327423, 'learning_rate': 0.00014909963985594238, 'epoch': 0.25}


 25%|██▌       | 3186/12500 [5:34:08<16:25:32,  6.35s/it]

{'loss': 0.6552, 'grad_norm': 0.25658488273620605, 'learning_rate': 0.00014908363345338138, 'epoch': 0.25}


 25%|██▌       | 3187/12500 [5:34:12<14:51:56,  5.75s/it]

{'loss': 0.7189, 'grad_norm': 0.32558947801589966, 'learning_rate': 0.00014906762705082033, 'epoch': 0.25}


 26%|██▌       | 3188/12500 [5:34:17<14:13:19,  5.50s/it]

{'loss': 0.6027, 'grad_norm': 0.2720087766647339, 'learning_rate': 0.0001490516206482593, 'epoch': 0.26}


 26%|██▌       | 3189/12500 [5:34:22<13:53:11,  5.37s/it]

{'loss': 0.7582, 'grad_norm': 0.31054240465164185, 'learning_rate': 0.0001490356142456983, 'epoch': 0.26}


 26%|██▌       | 3190/12500 [5:34:30<15:16:19,  5.91s/it]

{'loss': 0.6283, 'grad_norm': 0.23578181862831116, 'learning_rate': 0.00014901960784313728, 'epoch': 0.26}


 26%|██▌       | 3191/12500 [5:34:38<17:28:35,  6.76s/it]

{'loss': 0.787, 'grad_norm': 0.23875319957733154, 'learning_rate': 0.00014900360144057623, 'epoch': 0.26}


 26%|██▌       | 3192/12500 [5:34:45<17:03:33,  6.60s/it]

{'loss': 0.9748, 'grad_norm': 0.2606620490550995, 'learning_rate': 0.0001489875950380152, 'epoch': 0.26}


 26%|██▌       | 3193/12500 [5:34:52<17:43:30,  6.86s/it]

{'loss': 0.9529, 'grad_norm': 0.2314128279685974, 'learning_rate': 0.0001489715886354542, 'epoch': 0.26}


 26%|██▌       | 3194/12500 [5:34:59<17:28:58,  6.76s/it]

{'loss': 0.6957, 'grad_norm': 0.26026996970176697, 'learning_rate': 0.00014895558223289318, 'epoch': 0.26}


 26%|██▌       | 3195/12500 [5:35:03<15:28:07,  5.98s/it]

{'loss': 0.8292, 'grad_norm': 0.33094409108161926, 'learning_rate': 0.00014893957583033213, 'epoch': 0.26}


 26%|██▌       | 3196/12500 [5:35:08<14:38:08,  5.66s/it]

{'loss': 0.6364, 'grad_norm': 0.30340465903282166, 'learning_rate': 0.00014892356942777113, 'epoch': 0.26}


 26%|██▌       | 3197/12500 [5:35:15<15:58:50,  6.18s/it]

{'loss': 0.7543, 'grad_norm': 0.24388699233531952, 'learning_rate': 0.0001489075630252101, 'epoch': 0.26}


 26%|██▌       | 3198/12500 [5:35:19<14:37:55,  5.66s/it]

{'loss': 0.6283, 'grad_norm': 0.2934439480304718, 'learning_rate': 0.00014889155662264908, 'epoch': 0.26}


 26%|██▌       | 3199/12500 [5:35:24<13:31:20,  5.23s/it]

{'loss': 0.9053, 'grad_norm': 0.34289711713790894, 'learning_rate': 0.00014887555022008802, 'epoch': 0.26}


 26%|██▌       | 3200/12500 [5:35:29<13:27:31,  5.21s/it]

{'loss': 0.6157, 'grad_norm': 0.2826528251171112, 'learning_rate': 0.00014885954381752703, 'epoch': 0.26}


 26%|██▌       | 3201/12500 [5:35:35<14:00:30,  5.42s/it]

{'loss': 0.835, 'grad_norm': 0.33649390935897827, 'learning_rate': 0.000148843537414966, 'epoch': 0.26}


 26%|██▌       | 3202/12500 [5:35:41<14:42:40,  5.70s/it]

{'loss': 0.5375, 'grad_norm': 0.2737862765789032, 'learning_rate': 0.00014882753101240498, 'epoch': 0.26}


 26%|██▌       | 3203/12500 [5:35:48<15:50:14,  6.13s/it]

{'loss': 0.8353, 'grad_norm': 0.275339275598526, 'learning_rate': 0.00014881152460984395, 'epoch': 0.26}


 26%|██▌       | 3204/12500 [5:35:55<16:34:27,  6.42s/it]

{'loss': 0.5272, 'grad_norm': 0.23643678426742554, 'learning_rate': 0.00014879551820728293, 'epoch': 0.26}


 26%|██▌       | 3205/12500 [5:36:01<15:47:43,  6.12s/it]

{'loss': 0.6381, 'grad_norm': 0.2715562880039215, 'learning_rate': 0.0001487795118047219, 'epoch': 0.26}


 26%|██▌       | 3206/12500 [5:36:07<15:53:20,  6.15s/it]

{'loss': 0.9128, 'grad_norm': 0.2978498041629791, 'learning_rate': 0.00014876350540216088, 'epoch': 0.26}


 26%|██▌       | 3207/12500 [5:36:11<14:07:32,  5.47s/it]

{'loss': 0.6809, 'grad_norm': 0.4012662172317505, 'learning_rate': 0.00014874749899959985, 'epoch': 0.26}


 26%|██▌       | 3208/12500 [5:36:17<14:17:41,  5.54s/it]

{'loss': 0.7144, 'grad_norm': 0.33594000339508057, 'learning_rate': 0.00014873149259703883, 'epoch': 0.26}


 26%|██▌       | 3209/12500 [5:36:22<13:53:19,  5.38s/it]

{'loss': 0.5672, 'grad_norm': 0.2579590678215027, 'learning_rate': 0.0001487154861944778, 'epoch': 0.26}


 26%|██▌       | 3210/12500 [5:36:27<13:31:38,  5.24s/it]

{'loss': 0.822, 'grad_norm': 0.30915912985801697, 'learning_rate': 0.00014869947979191678, 'epoch': 0.26}


 26%|██▌       | 3211/12500 [5:36:31<12:40:44,  4.91s/it]

{'loss': 0.7598, 'grad_norm': 0.30975160002708435, 'learning_rate': 0.00014868347338935575, 'epoch': 0.26}


 26%|██▌       | 3212/12500 [5:36:41<16:34:56,  6.43s/it]

{'loss': 0.658, 'grad_norm': 0.17050053179264069, 'learning_rate': 0.00014866746698679472, 'epoch': 0.26}


 26%|██▌       | 3213/12500 [5:36:47<16:15:14,  6.30s/it]

{'loss': 0.6991, 'grad_norm': 0.2797947824001312, 'learning_rate': 0.0001486514605842337, 'epoch': 0.26}


 26%|██▌       | 3214/12500 [5:36:52<15:46:48,  6.12s/it]

{'loss': 0.5868, 'grad_norm': 0.23939968645572662, 'learning_rate': 0.00014863545418167267, 'epoch': 0.26}


 26%|██▌       | 3215/12500 [5:36:59<15:50:23,  6.14s/it]

{'loss': 0.7038, 'grad_norm': 0.24097804725170135, 'learning_rate': 0.00014861944777911165, 'epoch': 0.26}


 26%|██▌       | 3216/12500 [5:37:06<17:14:50,  6.69s/it]

{'loss': 0.8882, 'grad_norm': 0.24218672513961792, 'learning_rate': 0.00014860344137655062, 'epoch': 0.26}


 26%|██▌       | 3217/12500 [5:37:11<15:46:50,  6.12s/it]

{'loss': 0.5144, 'grad_norm': 0.25839969515800476, 'learning_rate': 0.0001485874349739896, 'epoch': 0.26}


 26%|██▌       | 3218/12500 [5:37:17<15:42:57,  6.10s/it]

{'loss': 0.6213, 'grad_norm': 0.3159621059894562, 'learning_rate': 0.00014857142857142857, 'epoch': 0.26}


 26%|██▌       | 3219/12500 [5:37:23<15:18:41,  5.94s/it]

{'loss': 0.5136, 'grad_norm': 0.22815310955047607, 'learning_rate': 0.00014855542216886755, 'epoch': 0.26}


 26%|██▌       | 3220/12500 [5:37:26<13:20:02,  5.17s/it]

{'loss': 0.8084, 'grad_norm': 0.3457874357700348, 'learning_rate': 0.00014853941576630652, 'epoch': 0.26}


 26%|██▌       | 3221/12500 [5:37:32<13:36:32,  5.28s/it]

{'loss': 0.4825, 'grad_norm': 0.3070802390575409, 'learning_rate': 0.00014852340936374553, 'epoch': 0.26}


 26%|██▌       | 3222/12500 [5:37:38<13:55:47,  5.41s/it]

{'loss': 0.899, 'grad_norm': 0.27560901641845703, 'learning_rate': 0.00014850740296118447, 'epoch': 0.26}


 26%|██▌       | 3223/12500 [5:37:42<13:10:43,  5.11s/it]

{'loss': 0.9672, 'grad_norm': 0.37583813071250916, 'learning_rate': 0.00014849139655862345, 'epoch': 0.26}


 26%|██▌       | 3224/12500 [5:37:49<14:38:01,  5.68s/it]

{'loss': 0.7941, 'grad_norm': 0.24189125001430511, 'learning_rate': 0.00014847539015606242, 'epoch': 0.26}


 26%|██▌       | 3225/12500 [5:37:54<14:15:26,  5.53s/it]

{'loss': 0.8143, 'grad_norm': 0.33647966384887695, 'learning_rate': 0.00014845938375350142, 'epoch': 0.26}


 26%|██▌       | 3226/12500 [5:38:03<16:33:44,  6.43s/it]

{'loss': 0.8553, 'grad_norm': 0.3346257209777832, 'learning_rate': 0.00014844337735094037, 'epoch': 0.26}


 26%|██▌       | 3227/12500 [5:38:06<14:23:30,  5.59s/it]

{'loss': 0.734, 'grad_norm': 0.3097183406352997, 'learning_rate': 0.00014842737094837935, 'epoch': 0.26}


 26%|██▌       | 3228/12500 [5:38:11<13:54:10,  5.40s/it]

{'loss': 0.5924, 'grad_norm': 0.32763659954071045, 'learning_rate': 0.00014841136454581835, 'epoch': 0.26}


 26%|██▌       | 3229/12500 [5:38:17<14:27:34,  5.61s/it]

{'loss': 0.887, 'grad_norm': 0.279714971780777, 'learning_rate': 0.00014839535814325732, 'epoch': 0.26}


 26%|██▌       | 3230/12500 [5:38:21<12:34:58,  4.89s/it]

{'loss': 0.5433, 'grad_norm': 0.2848493754863739, 'learning_rate': 0.00014837935174069627, 'epoch': 0.26}


 26%|██▌       | 3231/12500 [5:38:25<12:37:24,  4.90s/it]

{'loss': 0.6449, 'grad_norm': 0.31743791699409485, 'learning_rate': 0.00014836334533813525, 'epoch': 0.26}


 26%|██▌       | 3232/12500 [5:38:31<13:15:10,  5.15s/it]

{'loss': 0.7963, 'grad_norm': 0.27319663763046265, 'learning_rate': 0.00014834733893557425, 'epoch': 0.26}


 26%|██▌       | 3233/12500 [5:38:39<15:12:01,  5.91s/it]

{'loss': 1.1582, 'grad_norm': 0.25634944438934326, 'learning_rate': 0.00014833133253301322, 'epoch': 0.26}


 26%|██▌       | 3234/12500 [5:38:43<14:07:38,  5.49s/it]

{'loss': 0.8941, 'grad_norm': 0.3189707100391388, 'learning_rate': 0.00014831532613045217, 'epoch': 0.26}


 26%|██▌       | 3235/12500 [5:38:49<14:09:48,  5.50s/it]

{'loss': 0.8776, 'grad_norm': 0.305174320936203, 'learning_rate': 0.00014829931972789117, 'epoch': 0.26}


 26%|██▌       | 3236/12500 [5:38:59<17:30:52,  6.81s/it]

{'loss': 0.9368, 'grad_norm': 0.24139335751533508, 'learning_rate': 0.00014828331332533015, 'epoch': 0.26}


 26%|██▌       | 3237/12500 [5:39:09<20:16:21,  7.88s/it]

{'loss': 0.7646, 'grad_norm': 0.20807424187660217, 'learning_rate': 0.00014826730692276912, 'epoch': 0.26}


 26%|██▌       | 3238/12500 [5:39:14<18:09:14,  7.06s/it]

{'loss': 0.9295, 'grad_norm': 0.2753535211086273, 'learning_rate': 0.00014825130052020807, 'epoch': 0.26}


 26%|██▌       | 3239/12500 [5:39:22<18:35:05,  7.22s/it]

{'loss': 0.4845, 'grad_norm': 0.2262115180492401, 'learning_rate': 0.00014823529411764707, 'epoch': 0.26}


 26%|██▌       | 3240/12500 [5:39:29<18:42:55,  7.28s/it]

{'loss': 0.4868, 'grad_norm': 0.252775102853775, 'learning_rate': 0.00014821928771508605, 'epoch': 0.26}


 26%|██▌       | 3241/12500 [5:39:34<16:21:28,  6.36s/it]

{'loss': 0.665, 'grad_norm': 0.29656100273132324, 'learning_rate': 0.00014820328131252502, 'epoch': 0.26}


 26%|██▌       | 3242/12500 [5:39:39<15:37:40,  6.08s/it]

{'loss': 0.7028, 'grad_norm': 0.29386141896247864, 'learning_rate': 0.000148187274909964, 'epoch': 0.26}


 26%|██▌       | 3243/12500 [5:39:45<15:38:08,  6.08s/it]

{'loss': 0.6137, 'grad_norm': 0.26683154702186584, 'learning_rate': 0.00014817126850740297, 'epoch': 0.26}


 26%|██▌       | 3244/12500 [5:39:49<14:13:48,  5.53s/it]

{'loss': 0.4663, 'grad_norm': 0.2974065840244293, 'learning_rate': 0.00014815526210484195, 'epoch': 0.26}


 26%|██▌       | 3245/12500 [5:39:54<13:28:20,  5.24s/it]

{'loss': 0.8499, 'grad_norm': 0.29514795541763306, 'learning_rate': 0.00014813925570228092, 'epoch': 0.26}


 26%|██▌       | 3246/12500 [5:40:00<13:50:48,  5.39s/it]

{'loss': 0.5657, 'grad_norm': 0.2653123140335083, 'learning_rate': 0.0001481232492997199, 'epoch': 0.26}


 26%|██▌       | 3247/12500 [5:40:07<15:09:15,  5.90s/it]

{'loss': 0.6332, 'grad_norm': 0.2548598647117615, 'learning_rate': 0.00014810724289715887, 'epoch': 0.26}


 26%|██▌       | 3248/12500 [5:40:12<15:03:47,  5.86s/it]

{'loss': 0.6893, 'grad_norm': 0.29168713092803955, 'learning_rate': 0.00014809123649459785, 'epoch': 0.26}


 26%|██▌       | 3249/12500 [5:40:19<15:18:08,  5.95s/it]

{'loss': 0.8448, 'grad_norm': 0.308939665555954, 'learning_rate': 0.00014807523009203682, 'epoch': 0.26}


 26%|██▌       | 3250/12500 [5:40:24<15:10:39,  5.91s/it]

{'loss': 0.763, 'grad_norm': 0.2701050639152527, 'learning_rate': 0.0001480592236894758, 'epoch': 0.26}


 26%|██▌       | 3251/12500 [5:40:33<17:22:04,  6.76s/it]

{'loss': 0.9098, 'grad_norm': 0.23948855698108673, 'learning_rate': 0.00014804321728691477, 'epoch': 0.26}


 26%|██▌       | 3252/12500 [5:40:38<16:07:22,  6.28s/it]

{'loss': 0.5815, 'grad_norm': 0.2975925803184509, 'learning_rate': 0.00014802721088435375, 'epoch': 0.26}


 26%|██▌       | 3253/12500 [5:40:45<16:36:06,  6.46s/it]

{'loss': 0.7633, 'grad_norm': 0.21946845948696136, 'learning_rate': 0.00014801120448179272, 'epoch': 0.26}


 26%|██▌       | 3254/12500 [5:40:51<15:58:40,  6.22s/it]

{'loss': 0.6514, 'grad_norm': 0.263420045375824, 'learning_rate': 0.0001479951980792317, 'epoch': 0.26}


 26%|██▌       | 3255/12500 [5:40:58<16:18:48,  6.35s/it]

{'loss': 0.7064, 'grad_norm': 0.2939308285713196, 'learning_rate': 0.00014797919167667067, 'epoch': 0.26}


 26%|██▌       | 3256/12500 [5:41:05<16:51:56,  6.57s/it]

{'loss': 0.4256, 'grad_norm': 0.19846153259277344, 'learning_rate': 0.00014796318527410967, 'epoch': 0.26}


 26%|██▌       | 3257/12500 [5:41:10<15:50:41,  6.17s/it]

{'loss': 0.7266, 'grad_norm': 0.3254459798336029, 'learning_rate': 0.00014794717887154862, 'epoch': 0.26}


 26%|██▌       | 3258/12500 [5:41:15<14:47:40,  5.76s/it]

{'loss': 0.7473, 'grad_norm': 0.26219475269317627, 'learning_rate': 0.0001479311724689876, 'epoch': 0.26}


 26%|██▌       | 3259/12500 [5:41:19<13:22:44,  5.21s/it]

{'loss': 0.5582, 'grad_norm': 0.29653021693229675, 'learning_rate': 0.00014791516606642657, 'epoch': 0.26}


 26%|██▌       | 3260/12500 [5:41:23<12:34:56,  4.90s/it]

{'loss': 0.7432, 'grad_norm': 0.3212563693523407, 'learning_rate': 0.00014789915966386557, 'epoch': 0.26}


 26%|██▌       | 3261/12500 [5:41:30<14:21:28,  5.59s/it]

{'loss': 0.4363, 'grad_norm': 0.21779227256774902, 'learning_rate': 0.00014788315326130452, 'epoch': 0.26}


 26%|██▌       | 3262/12500 [5:41:34<13:24:53,  5.23s/it]

{'loss': 0.8663, 'grad_norm': 0.29852500557899475, 'learning_rate': 0.0001478671468587435, 'epoch': 0.26}


 26%|██▌       | 3263/12500 [5:41:39<13:00:39,  5.07s/it]

{'loss': 0.5567, 'grad_norm': 0.2736685872077942, 'learning_rate': 0.0001478511404561825, 'epoch': 0.26}


 26%|██▌       | 3264/12500 [5:41:43<11:49:23,  4.61s/it]

{'loss': 0.5123, 'grad_norm': 0.2983475625514984, 'learning_rate': 0.00014783513405362147, 'epoch': 0.26}


 26%|██▌       | 3265/12500 [5:41:50<13:51:15,  5.40s/it]

{'loss': 0.9613, 'grad_norm': 0.27083852887153625, 'learning_rate': 0.00014781912765106042, 'epoch': 0.26}


 26%|██▌       | 3266/12500 [5:41:54<13:13:13,  5.15s/it]

{'loss': 0.5653, 'grad_norm': 0.2806668281555176, 'learning_rate': 0.0001478031212484994, 'epoch': 0.26}


 26%|██▌       | 3267/12500 [5:42:00<13:39:28,  5.33s/it]

{'loss': 0.6692, 'grad_norm': 0.5278120636940002, 'learning_rate': 0.0001477871148459384, 'epoch': 0.26}


 26%|██▌       | 3268/12500 [5:42:09<16:20:04,  6.37s/it]

{'loss': 0.8698, 'grad_norm': 0.278118759393692, 'learning_rate': 0.00014777110844337737, 'epoch': 0.26}


 26%|██▌       | 3269/12500 [5:42:14<15:38:32,  6.10s/it]

{'loss': 0.7546, 'grad_norm': 0.28252631425857544, 'learning_rate': 0.00014775510204081632, 'epoch': 0.26}


 26%|██▌       | 3270/12500 [5:42:20<15:01:44,  5.86s/it]

{'loss': 0.6122, 'grad_norm': 0.2803376317024231, 'learning_rate': 0.0001477390956382553, 'epoch': 0.26}


 26%|██▌       | 3271/12500 [5:42:25<14:37:44,  5.71s/it]

{'loss': 0.8237, 'grad_norm': 0.2927395701408386, 'learning_rate': 0.0001477230892356943, 'epoch': 0.26}


 26%|██▌       | 3272/12500 [5:42:33<15:58:11,  6.23s/it]

{'loss': 0.4363, 'grad_norm': 0.2640935778617859, 'learning_rate': 0.00014770708283313327, 'epoch': 0.26}


 26%|██▌       | 3273/12500 [5:42:39<16:09:58,  6.31s/it]

{'loss': 1.0117, 'grad_norm': 0.29781872034072876, 'learning_rate': 0.00014769107643057222, 'epoch': 0.26}


 26%|██▌       | 3274/12500 [5:42:43<14:38:28,  5.71s/it]

{'loss': 0.6693, 'grad_norm': 0.3054742217063904, 'learning_rate': 0.00014767507002801122, 'epoch': 0.26}


 26%|██▌       | 3275/12500 [5:42:48<13:44:27,  5.36s/it]

{'loss': 0.4814, 'grad_norm': 0.2514050602912903, 'learning_rate': 0.0001476590636254502, 'epoch': 0.26}


 26%|██▌       | 3276/12500 [5:42:56<16:15:11,  6.34s/it]

{'loss': 1.0655, 'grad_norm': 0.3030376434326172, 'learning_rate': 0.00014764305722288917, 'epoch': 0.26}


 26%|██▌       | 3277/12500 [5:43:03<16:29:08,  6.43s/it]

{'loss': 0.826, 'grad_norm': 0.3029618561267853, 'learning_rate': 0.00014762705082032812, 'epoch': 0.26}


 26%|██▌       | 3278/12500 [5:43:09<16:23:13,  6.40s/it]

{'loss': 0.562, 'grad_norm': 0.23928743600845337, 'learning_rate': 0.00014761104441776712, 'epoch': 0.26}


 26%|██▌       | 3279/12500 [5:43:15<15:47:53,  6.17s/it]

{'loss': 0.9259, 'grad_norm': 0.2829029858112335, 'learning_rate': 0.0001475950380152061, 'epoch': 0.26}


 26%|██▌       | 3280/12500 [5:43:23<17:28:08,  6.82s/it]

{'loss': 0.5399, 'grad_norm': 0.21537163853645325, 'learning_rate': 0.00014757903161264507, 'epoch': 0.26}


 26%|██▌       | 3281/12500 [5:43:29<16:07:32,  6.30s/it]

{'loss': 0.5176, 'grad_norm': 0.2710528075695038, 'learning_rate': 0.00014756302521008404, 'epoch': 0.26}


 26%|██▋       | 3282/12500 [5:43:36<17:19:31,  6.77s/it]

{'loss': 0.5919, 'grad_norm': 0.2481529265642166, 'learning_rate': 0.00014754701880752302, 'epoch': 0.26}


 26%|██▋       | 3283/12500 [5:43:43<17:08:24,  6.69s/it]

{'loss': 0.6077, 'grad_norm': 0.32680150866508484, 'learning_rate': 0.000147531012404962, 'epoch': 0.26}


 26%|██▋       | 3284/12500 [5:43:51<18:34:46,  7.26s/it]

{'loss': 0.5655, 'grad_norm': 0.19653962552547455, 'learning_rate': 0.00014751500600240097, 'epoch': 0.26}


 26%|██▋       | 3285/12500 [5:43:57<17:25:24,  6.81s/it]

{'loss': 0.8834, 'grad_norm': 0.24461466073989868, 'learning_rate': 0.00014749899959983994, 'epoch': 0.26}


 26%|██▋       | 3286/12500 [5:44:05<18:08:28,  7.09s/it]

{'loss': 0.7373, 'grad_norm': 0.24387231469154358, 'learning_rate': 0.00014748299319727892, 'epoch': 0.26}


 26%|██▋       | 3287/12500 [5:44:16<20:48:36,  8.13s/it]

{'loss': 0.9554, 'grad_norm': 0.18710483610630035, 'learning_rate': 0.0001474669867947179, 'epoch': 0.26}


 26%|██▋       | 3288/12500 [5:44:22<19:14:57,  7.52s/it]

{'loss': 1.0464, 'grad_norm': 0.2854616343975067, 'learning_rate': 0.00014745098039215687, 'epoch': 0.26}


 26%|██▋       | 3289/12500 [5:44:28<18:30:02,  7.23s/it]

{'loss': 0.7335, 'grad_norm': 0.3162136971950531, 'learning_rate': 0.00014743497398959584, 'epoch': 0.26}


 26%|██▋       | 3290/12500 [5:44:37<19:41:17,  7.70s/it]

{'loss': 0.6822, 'grad_norm': 0.22222939133644104, 'learning_rate': 0.00014741896758703482, 'epoch': 0.26}


 26%|██▋       | 3291/12500 [5:44:42<17:17:43,  6.76s/it]

{'loss': 0.6307, 'grad_norm': 0.2769656479358673, 'learning_rate': 0.0001474029611844738, 'epoch': 0.26}


 26%|██▋       | 3292/12500 [5:44:47<16:08:47,  6.31s/it]

{'loss': 0.4882, 'grad_norm': 0.260618656873703, 'learning_rate': 0.00014738695478191277, 'epoch': 0.26}


 26%|██▋       | 3293/12500 [5:44:53<15:56:31,  6.23s/it]

{'loss': 0.6664, 'grad_norm': 0.2543172836303711, 'learning_rate': 0.00014737094837935174, 'epoch': 0.26}


 26%|██▋       | 3294/12500 [5:44:58<15:20:55,  6.00s/it]

{'loss': 0.5304, 'grad_norm': 0.24635295569896698, 'learning_rate': 0.00014735494197679072, 'epoch': 0.26}


 26%|██▋       | 3295/12500 [5:45:03<14:10:13,  5.54s/it]

{'loss': 0.625, 'grad_norm': 0.2851329743862152, 'learning_rate': 0.00014733893557422972, 'epoch': 0.26}


 26%|██▋       | 3296/12500 [5:45:11<15:57:11,  6.24s/it]

{'loss': 0.5549, 'grad_norm': 0.19747136533260345, 'learning_rate': 0.00014732292917166867, 'epoch': 0.26}


 26%|██▋       | 3297/12500 [5:45:15<14:08:52,  5.53s/it]

{'loss': 0.7656, 'grad_norm': 0.28601348400115967, 'learning_rate': 0.00014730692276910764, 'epoch': 0.26}


 26%|██▋       | 3298/12500 [5:45:23<16:32:22,  6.47s/it]

{'loss': 0.8574, 'grad_norm': 0.31046611070632935, 'learning_rate': 0.00014729091636654662, 'epoch': 0.26}


 26%|██▋       | 3299/12500 [5:45:29<16:04:43,  6.29s/it]

{'loss': 0.5546, 'grad_norm': 0.25348812341690063, 'learning_rate': 0.00014727490996398562, 'epoch': 0.26}


 26%|██▋       | 3300/12500 [5:45:32<13:46:00,  5.39s/it]

{'loss': 0.7266, 'grad_norm': 0.36926645040512085, 'learning_rate': 0.00014725890356142457, 'epoch': 0.26}


 26%|██▋       | 3301/12500 [5:45:41<16:26:29,  6.43s/it]

{'loss': 0.4302, 'grad_norm': 0.22370684146881104, 'learning_rate': 0.00014724289715886354, 'epoch': 0.26}


 26%|██▋       | 3302/12500 [5:45:47<15:35:20,  6.10s/it]

{'loss': 0.7042, 'grad_norm': 0.30630162358283997, 'learning_rate': 0.00014722689075630254, 'epoch': 0.26}


 26%|██▋       | 3303/12500 [5:45:52<15:17:40,  5.99s/it]

{'loss': 0.6967, 'grad_norm': 0.2702351212501526, 'learning_rate': 0.00014721088435374152, 'epoch': 0.26}


 26%|██▋       | 3304/12500 [5:45:57<14:15:12,  5.58s/it]

{'loss': 0.5258, 'grad_norm': 0.26944002509117126, 'learning_rate': 0.00014719487795118047, 'epoch': 0.26}


 26%|██▋       | 3305/12500 [5:46:02<13:34:21,  5.31s/it]

{'loss': 0.7289, 'grad_norm': 0.29714858531951904, 'learning_rate': 0.00014717887154861944, 'epoch': 0.26}


 26%|██▋       | 3306/12500 [5:46:09<14:53:00,  5.83s/it]

{'loss': 0.4602, 'grad_norm': 0.24195337295532227, 'learning_rate': 0.00014716286514605844, 'epoch': 0.26}


 26%|██▋       | 3307/12500 [5:46:13<14:05:06,  5.52s/it]

{'loss': 0.7007, 'grad_norm': 0.2891290783882141, 'learning_rate': 0.00014714685874349742, 'epoch': 0.26}


 26%|██▋       | 3308/12500 [5:46:19<14:11:27,  5.56s/it]

{'loss': 0.7571, 'grad_norm': 0.2890543043613434, 'learning_rate': 0.00014713085234093636, 'epoch': 0.26}


 26%|██▋       | 3309/12500 [5:46:23<13:18:00,  5.21s/it]

{'loss': 0.6603, 'grad_norm': 0.29733148217201233, 'learning_rate': 0.00014711484593837537, 'epoch': 0.26}


 26%|██▋       | 3310/12500 [5:46:28<12:44:52,  4.99s/it]

{'loss': 0.576, 'grad_norm': 0.3208109140396118, 'learning_rate': 0.00014709883953581434, 'epoch': 0.26}


 26%|██▋       | 3311/12500 [5:46:35<14:30:58,  5.69s/it]

{'loss': 0.5201, 'grad_norm': 0.20757615566253662, 'learning_rate': 0.00014708283313325332, 'epoch': 0.26}


 26%|██▋       | 3312/12500 [5:46:41<14:19:56,  5.62s/it]

{'loss': 0.8092, 'grad_norm': 0.37866178154945374, 'learning_rate': 0.00014706682673069226, 'epoch': 0.26}


 27%|██▋       | 3313/12500 [5:46:47<14:55:30,  5.85s/it]

{'loss': 0.976, 'grad_norm': 0.28035250306129456, 'learning_rate': 0.00014705082032813127, 'epoch': 0.27}


 27%|██▋       | 3314/12500 [5:46:54<16:00:39,  6.27s/it]

{'loss': 0.7448, 'grad_norm': 0.24532076716423035, 'learning_rate': 0.00014703481392557024, 'epoch': 0.27}


 27%|██▋       | 3315/12500 [5:47:01<16:38:27,  6.52s/it]

{'loss': 0.791, 'grad_norm': 0.21118085086345673, 'learning_rate': 0.00014701880752300922, 'epoch': 0.27}


 27%|██▋       | 3316/12500 [5:47:10<18:03:30,  7.08s/it]

{'loss': 0.7855, 'grad_norm': 0.2223626971244812, 'learning_rate': 0.0001470028011204482, 'epoch': 0.27}


 27%|██▋       | 3317/12500 [5:47:15<16:34:41,  6.50s/it]

{'loss': 0.477, 'grad_norm': 0.2622756361961365, 'learning_rate': 0.00014698679471788717, 'epoch': 0.27}


 27%|██▋       | 3318/12500 [5:47:23<17:52:23,  7.01s/it]

{'loss': 0.7846, 'grad_norm': 0.23484352231025696, 'learning_rate': 0.00014697078831532614, 'epoch': 0.27}


 27%|██▋       | 3319/12500 [5:47:30<17:21:47,  6.81s/it]

{'loss': 0.7261, 'grad_norm': 0.22453951835632324, 'learning_rate': 0.00014695478191276512, 'epoch': 0.27}


 27%|██▋       | 3320/12500 [5:47:34<15:53:37,  6.23s/it]

{'loss': 0.7592, 'grad_norm': 0.2720023989677429, 'learning_rate': 0.0001469387755102041, 'epoch': 0.27}


 27%|██▋       | 3321/12500 [5:47:42<17:01:57,  6.68s/it]

{'loss': 0.6407, 'grad_norm': 0.21198561787605286, 'learning_rate': 0.00014692276910764307, 'epoch': 0.27}


 27%|██▋       | 3322/12500 [5:47:49<17:09:31,  6.73s/it]

{'loss': 0.8554, 'grad_norm': 0.2873157858848572, 'learning_rate': 0.00014690676270508204, 'epoch': 0.27}


 27%|██▋       | 3323/12500 [5:48:00<20:23:53,  8.00s/it]

{'loss': 0.5697, 'grad_norm': 0.16081561148166656, 'learning_rate': 0.00014689075630252101, 'epoch': 0.27}


 27%|██▋       | 3324/12500 [5:48:04<17:19:57,  6.80s/it]

{'loss': 0.7521, 'grad_norm': 0.3164993226528168, 'learning_rate': 0.00014687474989996, 'epoch': 0.27}


 27%|██▋       | 3325/12500 [5:48:10<17:07:52,  6.72s/it]

{'loss': 0.7645, 'grad_norm': 0.22429727017879486, 'learning_rate': 0.00014685874349739896, 'epoch': 0.27}


 27%|██▋       | 3326/12500 [5:48:20<19:00:59,  7.46s/it]

{'loss': 0.8261, 'grad_norm': 0.22781610488891602, 'learning_rate': 0.00014684273709483794, 'epoch': 0.27}


 27%|██▋       | 3327/12500 [5:48:24<16:24:09,  6.44s/it]

{'loss': 0.7698, 'grad_norm': 0.37176889181137085, 'learning_rate': 0.00014682673069227691, 'epoch': 0.27}


 27%|██▋       | 3328/12500 [5:48:29<15:43:01,  6.17s/it]

{'loss': 0.8583, 'grad_norm': 0.263468474149704, 'learning_rate': 0.0001468107242897159, 'epoch': 0.27}


 27%|██▋       | 3329/12500 [5:48:37<16:51:07,  6.62s/it]

{'loss': 0.8518, 'grad_norm': 0.25071051716804504, 'learning_rate': 0.00014679471788715486, 'epoch': 0.27}


 27%|██▋       | 3330/12500 [5:48:42<15:29:38,  6.08s/it]

{'loss': 0.6485, 'grad_norm': 0.32181599736213684, 'learning_rate': 0.00014677871148459384, 'epoch': 0.27}


 27%|██▋       | 3331/12500 [5:48:46<14:16:06,  5.60s/it]

{'loss': 0.7911, 'grad_norm': 0.29815271496772766, 'learning_rate': 0.00014676270508203281, 'epoch': 0.27}


 27%|██▋       | 3332/12500 [5:48:52<14:34:36,  5.72s/it]

{'loss': 0.793, 'grad_norm': 0.2446068525314331, 'learning_rate': 0.0001467466986794718, 'epoch': 0.27}


 27%|██▋       | 3333/12500 [5:48:57<13:28:51,  5.29s/it]

{'loss': 0.5912, 'grad_norm': 0.2971843183040619, 'learning_rate': 0.00014673069227691076, 'epoch': 0.27}


 27%|██▋       | 3334/12500 [5:49:03<14:21:32,  5.64s/it]

{'loss': 0.7074, 'grad_norm': 0.26200243830680847, 'learning_rate': 0.00014671468587434977, 'epoch': 0.27}


 27%|██▋       | 3335/12500 [5:49:08<13:59:45,  5.50s/it]

{'loss': 0.6162, 'grad_norm': 0.3320298492908478, 'learning_rate': 0.0001466986794717887, 'epoch': 0.27}


 27%|██▋       | 3336/12500 [5:49:13<13:43:42,  5.39s/it]

{'loss': 0.542, 'grad_norm': 0.2505171597003937, 'learning_rate': 0.0001466826730692277, 'epoch': 0.27}


 27%|██▋       | 3337/12500 [5:49:21<15:28:29,  6.08s/it]

{'loss': 0.8019, 'grad_norm': 0.2625948488712311, 'learning_rate': 0.00014666666666666666, 'epoch': 0.27}


 27%|██▋       | 3338/12500 [5:49:26<14:18:47,  5.62s/it]

{'loss': 0.8543, 'grad_norm': 0.3218157887458801, 'learning_rate': 0.00014665066026410566, 'epoch': 0.27}


 27%|██▋       | 3339/12500 [5:49:32<14:40:58,  5.77s/it]

{'loss': 0.8736, 'grad_norm': 0.2600334584712982, 'learning_rate': 0.0001466346538615446, 'epoch': 0.27}


 27%|██▋       | 3340/12500 [5:49:40<16:17:35,  6.40s/it]

{'loss': 0.9761, 'grad_norm': 0.2595033049583435, 'learning_rate': 0.0001466186474589836, 'epoch': 0.27}


 27%|██▋       | 3341/12500 [5:49:47<17:20:37,  6.82s/it]

{'loss': 0.495, 'grad_norm': 0.21114757657051086, 'learning_rate': 0.0001466026410564226, 'epoch': 0.27}


 27%|██▋       | 3342/12500 [5:49:55<17:41:55,  6.96s/it]

{'loss': 0.7153, 'grad_norm': 0.26204025745391846, 'learning_rate': 0.00014658663465386156, 'epoch': 0.27}


 27%|██▋       | 3343/12500 [5:50:00<16:11:42,  6.37s/it]

{'loss': 0.6406, 'grad_norm': 0.2727544903755188, 'learning_rate': 0.0001465706282513005, 'epoch': 0.27}


 27%|██▋       | 3344/12500 [5:50:07<17:02:42,  6.70s/it]

{'loss': 0.9963, 'grad_norm': 0.2565029561519623, 'learning_rate': 0.0001465546218487395, 'epoch': 0.27}


 27%|██▋       | 3345/12500 [5:50:12<15:36:05,  6.13s/it]

{'loss': 0.6168, 'grad_norm': 0.292966365814209, 'learning_rate': 0.0001465386154461785, 'epoch': 0.27}


 27%|██▋       | 3346/12500 [5:50:16<14:17:11,  5.62s/it]

{'loss': 0.8854, 'grad_norm': 0.3012329936027527, 'learning_rate': 0.00014652260904361746, 'epoch': 0.27}


 27%|██▋       | 3347/12500 [5:50:23<14:52:49,  5.85s/it]

{'loss': 0.9101, 'grad_norm': 0.2754915654659271, 'learning_rate': 0.0001465066026410564, 'epoch': 0.27}


 27%|██▋       | 3348/12500 [5:50:27<13:21:23,  5.25s/it]

{'loss': 0.7548, 'grad_norm': 0.3009779751300812, 'learning_rate': 0.0001464905962384954, 'epoch': 0.27}


 27%|██▋       | 3349/12500 [5:50:30<12:03:03,  4.74s/it]

{'loss': 0.8243, 'grad_norm': 0.3296637237071991, 'learning_rate': 0.0001464745898359344, 'epoch': 0.27}


 27%|██▋       | 3350/12500 [5:50:35<12:03:52,  4.75s/it]

{'loss': 0.6054, 'grad_norm': 0.29717886447906494, 'learning_rate': 0.00014645858343337336, 'epoch': 0.27}


 27%|██▋       | 3351/12500 [5:50:45<16:07:15,  6.34s/it]

{'loss': 0.5251, 'grad_norm': 0.19538439810276031, 'learning_rate': 0.0001464425770308123, 'epoch': 0.27}


 27%|██▋       | 3352/12500 [5:50:54<17:48:35,  7.01s/it]

{'loss': 0.9895, 'grad_norm': 0.24918662011623383, 'learning_rate': 0.0001464265706282513, 'epoch': 0.27}


 27%|██▋       | 3353/12500 [5:51:00<17:08:50,  6.75s/it]

{'loss': 0.8682, 'grad_norm': 0.2564311623573303, 'learning_rate': 0.0001464105642256903, 'epoch': 0.27}


 27%|██▋       | 3354/12500 [5:51:08<18:39:48,  7.35s/it]

{'loss': 0.6541, 'grad_norm': 0.22593827545642853, 'learning_rate': 0.00014639455782312926, 'epoch': 0.27}


 27%|██▋       | 3355/12500 [5:51:15<18:21:29,  7.23s/it]

{'loss': 0.7252, 'grad_norm': 0.25407537817955017, 'learning_rate': 0.00014637855142056824, 'epoch': 0.27}


 27%|██▋       | 3356/12500 [5:51:20<16:41:16,  6.57s/it]

{'loss': 0.5362, 'grad_norm': 0.30276182293891907, 'learning_rate': 0.0001463625450180072, 'epoch': 0.27}


 27%|██▋       | 3357/12500 [5:51:28<17:43:50,  6.98s/it]

{'loss': 0.7541, 'grad_norm': 0.25093650817871094, 'learning_rate': 0.0001463465386154462, 'epoch': 0.27}


 27%|██▋       | 3358/12500 [5:51:36<18:12:24,  7.17s/it]

{'loss': 0.8959, 'grad_norm': 0.24728544056415558, 'learning_rate': 0.00014633053221288516, 'epoch': 0.27}


 27%|██▋       | 3359/12500 [5:51:43<18:20:14,  7.22s/it]

{'loss': 0.6062, 'grad_norm': 0.2554972767829895, 'learning_rate': 0.00014631452581032414, 'epoch': 0.27}


 27%|██▋       | 3360/12500 [5:51:47<15:47:24,  6.22s/it]

{'loss': 0.957, 'grad_norm': 0.35732412338256836, 'learning_rate': 0.0001462985194077631, 'epoch': 0.27}


 27%|██▋       | 3361/12500 [5:51:52<14:35:45,  5.75s/it]

{'loss': 0.8791, 'grad_norm': 0.2863965332508087, 'learning_rate': 0.0001462825130052021, 'epoch': 0.27}


 27%|██▋       | 3362/12500 [5:51:57<14:02:25,  5.53s/it]

{'loss': 0.5698, 'grad_norm': 0.2645474970340729, 'learning_rate': 0.00014626650660264106, 'epoch': 0.27}


 27%|██▋       | 3363/12500 [5:52:03<14:13:16,  5.60s/it]

{'loss': 0.678, 'grad_norm': 0.263034850358963, 'learning_rate': 0.00014625050020008004, 'epoch': 0.27}


 27%|██▋       | 3364/12500 [5:52:10<15:23:28,  6.06s/it]

{'loss': 0.7298, 'grad_norm': 0.29499322175979614, 'learning_rate': 0.000146234493797519, 'epoch': 0.27}


 27%|██▋       | 3365/12500 [5:52:16<15:34:29,  6.14s/it]

{'loss': 0.7402, 'grad_norm': 0.24181661009788513, 'learning_rate': 0.00014621848739495799, 'epoch': 0.27}


 27%|██▋       | 3366/12500 [5:52:25<17:56:22,  7.07s/it]

{'loss': 0.3847, 'grad_norm': 0.17109134793281555, 'learning_rate': 0.00014620248099239696, 'epoch': 0.27}


 27%|██▋       | 3367/12500 [5:52:31<16:39:19,  6.57s/it]

{'loss': 0.8405, 'grad_norm': 0.30888181924819946, 'learning_rate': 0.00014618647458983594, 'epoch': 0.27}


 27%|██▋       | 3368/12500 [5:52:38<17:01:11,  6.71s/it]

{'loss': 0.8485, 'grad_norm': 0.23131300508975983, 'learning_rate': 0.0001461704681872749, 'epoch': 0.27}


 27%|██▋       | 3369/12500 [5:52:45<17:28:34,  6.89s/it]

{'loss': 0.9111, 'grad_norm': 0.25900352001190186, 'learning_rate': 0.0001461544617847139, 'epoch': 0.27}


 27%|██▋       | 3370/12500 [5:52:50<16:14:13,  6.40s/it]

{'loss': 0.7826, 'grad_norm': 0.2568698823451996, 'learning_rate': 0.00014613845538215286, 'epoch': 0.27}


 27%|██▋       | 3371/12500 [5:52:57<16:28:59,  6.50s/it]

{'loss': 0.738, 'grad_norm': 0.3389623463153839, 'learning_rate': 0.00014612244897959183, 'epoch': 0.27}


 27%|██▋       | 3372/12500 [5:53:07<19:29:21,  7.69s/it]

{'loss': 0.6294, 'grad_norm': 0.23400026559829712, 'learning_rate': 0.0001461064425770308, 'epoch': 0.27}


 27%|██▋       | 3373/12500 [5:53:12<17:20:53,  6.84s/it]

{'loss': 0.636, 'grad_norm': 0.26008808612823486, 'learning_rate': 0.0001460904361744698, 'epoch': 0.27}


 27%|██▋       | 3374/12500 [5:53:17<15:54:41,  6.28s/it]

{'loss': 0.8317, 'grad_norm': 0.2948058247566223, 'learning_rate': 0.00014607442977190876, 'epoch': 0.27}


 27%|██▋       | 3375/12500 [5:53:23<15:45:27,  6.22s/it]

{'loss': 0.573, 'grad_norm': 0.2945408821105957, 'learning_rate': 0.00014605842336934773, 'epoch': 0.27}


 27%|██▋       | 3376/12500 [5:53:32<17:35:46,  6.94s/it]

{'loss': 1.0266, 'grad_norm': 0.2659807503223419, 'learning_rate': 0.00014604241696678674, 'epoch': 0.27}


 27%|██▋       | 3377/12500 [5:53:36<15:39:19,  6.18s/it]

{'loss': 1.0028, 'grad_norm': 0.3553811311721802, 'learning_rate': 0.0001460264105642257, 'epoch': 0.27}


 27%|██▋       | 3378/12500 [5:53:40<13:36:24,  5.37s/it]

{'loss': 0.8333, 'grad_norm': 0.36462855339050293, 'learning_rate': 0.00014601040416166466, 'epoch': 0.27}


 27%|██▋       | 3379/12500 [5:53:47<14:46:28,  5.83s/it]

{'loss': 0.9907, 'grad_norm': 0.254220575094223, 'learning_rate': 0.00014599439775910363, 'epoch': 0.27}


 27%|██▋       | 3380/12500 [5:53:54<15:59:47,  6.31s/it]

{'loss': 0.4304, 'grad_norm': 0.22769609093666077, 'learning_rate': 0.00014597839135654264, 'epoch': 0.27}


 27%|██▋       | 3381/12500 [5:54:02<16:54:08,  6.67s/it]

{'loss': 0.6491, 'grad_norm': 0.21297018229961395, 'learning_rate': 0.0001459623849539816, 'epoch': 0.27}


 27%|██▋       | 3382/12500 [5:54:08<16:23:46,  6.47s/it]

{'loss': 0.7592, 'grad_norm': 0.33048442006111145, 'learning_rate': 0.00014594637855142056, 'epoch': 0.27}


 27%|██▋       | 3383/12500 [5:54:14<16:13:58,  6.41s/it]

{'loss': 0.9595, 'grad_norm': 0.27214398980140686, 'learning_rate': 0.00014593037214885953, 'epoch': 0.27}


 27%|██▋       | 3384/12500 [5:54:23<18:30:42,  7.31s/it]

{'loss': 0.7103, 'grad_norm': 0.18451760709285736, 'learning_rate': 0.00014591436574629854, 'epoch': 0.27}


 27%|██▋       | 3385/12500 [5:54:31<19:04:01,  7.53s/it]

{'loss': 0.8643, 'grad_norm': 0.24051184952259064, 'learning_rate': 0.0001458983593437375, 'epoch': 0.27}


 27%|██▋       | 3386/12500 [5:54:37<17:18:23,  6.84s/it]

{'loss': 0.5826, 'grad_norm': 0.2749798595905304, 'learning_rate': 0.00014588235294117646, 'epoch': 0.27}


 27%|██▋       | 3387/12500 [5:54:42<15:52:35,  6.27s/it]

{'loss': 0.6698, 'grad_norm': 0.32125669717788696, 'learning_rate': 0.00014586634653861546, 'epoch': 0.27}


 27%|██▋       | 3388/12500 [5:54:49<16:41:23,  6.59s/it]

{'loss': 0.6426, 'grad_norm': 0.26362496614456177, 'learning_rate': 0.00014585034013605443, 'epoch': 0.27}


 27%|██▋       | 3389/12500 [5:54:55<15:53:18,  6.28s/it]

{'loss': 0.5021, 'grad_norm': 0.253634512424469, 'learning_rate': 0.0001458343337334934, 'epoch': 0.27}


 27%|██▋       | 3390/12500 [5:55:00<15:19:05,  6.05s/it]

{'loss': 0.7495, 'grad_norm': 0.35009363293647766, 'learning_rate': 0.00014581832733093236, 'epoch': 0.27}


 27%|██▋       | 3391/12500 [5:55:04<13:45:37,  5.44s/it]

{'loss': 0.483, 'grad_norm': 0.27946892380714417, 'learning_rate': 0.00014580232092837136, 'epoch': 0.27}


 27%|██▋       | 3392/12500 [5:55:09<13:18:26,  5.26s/it]

{'loss': 0.6417, 'grad_norm': 0.303914338350296, 'learning_rate': 0.00014578631452581033, 'epoch': 0.27}


 27%|██▋       | 3393/12500 [5:55:14<13:05:26,  5.17s/it]

{'loss': 0.7026, 'grad_norm': 0.3005256652832031, 'learning_rate': 0.0001457703081232493, 'epoch': 0.27}


 27%|██▋       | 3394/12500 [5:55:18<12:27:35,  4.93s/it]

{'loss': 0.7823, 'grad_norm': 0.31244122982025146, 'learning_rate': 0.00014575430172068828, 'epoch': 0.27}


 27%|██▋       | 3395/12500 [5:55:25<14:00:35,  5.54s/it]

{'loss': 0.4576, 'grad_norm': 0.22903360426425934, 'learning_rate': 0.00014573829531812726, 'epoch': 0.27}


 27%|██▋       | 3396/12500 [5:55:33<15:36:18,  6.17s/it]

{'loss': 0.7106, 'grad_norm': 0.3078915476799011, 'learning_rate': 0.00014572228891556623, 'epoch': 0.27}


 27%|██▋       | 3397/12500 [5:55:40<15:58:09,  6.32s/it]

{'loss': 0.3814, 'grad_norm': 0.253083735704422, 'learning_rate': 0.0001457062825130052, 'epoch': 0.27}


 27%|██▋       | 3398/12500 [5:55:44<14:16:47,  5.65s/it]

{'loss': 0.5909, 'grad_norm': 0.284341961145401, 'learning_rate': 0.00014569027611044418, 'epoch': 0.27}


 27%|██▋       | 3399/12500 [5:55:49<14:04:38,  5.57s/it]

{'loss': 0.5506, 'grad_norm': 0.24551762640476227, 'learning_rate': 0.00014567426970788316, 'epoch': 0.27}


 27%|██▋       | 3400/12500 [5:55:55<14:22:50,  5.69s/it]

{'loss': 0.753, 'grad_norm': 0.2631879448890686, 'learning_rate': 0.00014565826330532213, 'epoch': 0.27}


 27%|██▋       | 3401/12500 [5:56:01<14:33:31,  5.76s/it]

{'loss': 1.0632, 'grad_norm': 0.31092771887779236, 'learning_rate': 0.0001456422569027611, 'epoch': 0.27}


 27%|██▋       | 3402/12500 [5:56:08<15:31:11,  6.14s/it]

{'loss': 0.7112, 'grad_norm': 0.36537328362464905, 'learning_rate': 0.00014562625050020008, 'epoch': 0.27}


 27%|██▋       | 3403/12500 [5:56:13<14:51:02,  5.88s/it]

{'loss': 0.9999, 'grad_norm': 0.3278350532054901, 'learning_rate': 0.00014561024409763906, 'epoch': 0.27}


 27%|██▋       | 3404/12500 [5:56:17<13:09:13,  5.21s/it]

{'loss': 0.6128, 'grad_norm': 0.3097818195819855, 'learning_rate': 0.00014559423769507803, 'epoch': 0.27}


 27%|██▋       | 3405/12500 [5:56:24<14:34:09,  5.77s/it]

{'loss': 0.6029, 'grad_norm': 0.2366843819618225, 'learning_rate': 0.000145578231292517, 'epoch': 0.27}


 27%|██▋       | 3406/12500 [5:56:33<17:13:49,  6.82s/it]

{'loss': 0.5012, 'grad_norm': 0.24703004956245422, 'learning_rate': 0.00014556222488995598, 'epoch': 0.27}


 27%|██▋       | 3407/12500 [5:56:39<16:20:11,  6.47s/it]

{'loss': 0.811, 'grad_norm': 0.3840702474117279, 'learning_rate': 0.00014554621848739496, 'epoch': 0.27}


 27%|██▋       | 3408/12500 [5:56:45<16:06:48,  6.38s/it]

{'loss': 0.8405, 'grad_norm': 0.32878410816192627, 'learning_rate': 0.00014553021208483396, 'epoch': 0.27}


 27%|██▋       | 3409/12500 [5:56:51<15:36:10,  6.18s/it]

{'loss': 0.5406, 'grad_norm': 0.2819153666496277, 'learning_rate': 0.0001455142056822729, 'epoch': 0.27}


 27%|██▋       | 3410/12500 [5:56:56<15:04:22,  5.97s/it]

{'loss': 0.582, 'grad_norm': 0.26261210441589355, 'learning_rate': 0.00014549819927971188, 'epoch': 0.27}


 27%|██▋       | 3411/12500 [5:57:04<16:11:19,  6.41s/it]

{'loss': 0.7782, 'grad_norm': 0.24378958344459534, 'learning_rate': 0.00014548219287715086, 'epoch': 0.27}


 27%|██▋       | 3412/12500 [5:57:07<13:57:22,  5.53s/it]

{'loss': 0.7223, 'grad_norm': 0.337636798620224, 'learning_rate': 0.00014546618647458986, 'epoch': 0.27}


 27%|██▋       | 3413/12500 [5:57:13<14:32:47,  5.76s/it]

{'loss': 0.5651, 'grad_norm': 0.2844264507293701, 'learning_rate': 0.0001454501800720288, 'epoch': 0.27}


 27%|██▋       | 3414/12500 [5:57:21<15:40:28,  6.21s/it]

{'loss': 0.9902, 'grad_norm': 0.2364417165517807, 'learning_rate': 0.00014543417366946778, 'epoch': 0.27}


 27%|██▋       | 3415/12500 [5:57:28<16:47:57,  6.66s/it]

{'loss': 0.6718, 'grad_norm': 0.23672810196876526, 'learning_rate': 0.00014541816726690678, 'epoch': 0.27}


 27%|██▋       | 3416/12500 [5:57:34<16:09:01,  6.40s/it]

{'loss': 0.9088, 'grad_norm': 0.358920693397522, 'learning_rate': 0.00014540216086434576, 'epoch': 0.27}


 27%|██▋       | 3417/12500 [5:57:40<15:59:08,  6.34s/it]

{'loss': 0.994, 'grad_norm': 0.29942482709884644, 'learning_rate': 0.0001453861544617847, 'epoch': 0.27}


 27%|██▋       | 3418/12500 [5:57:45<14:57:23,  5.93s/it]

{'loss': 0.6271, 'grad_norm': 0.28077492117881775, 'learning_rate': 0.00014537014805922368, 'epoch': 0.27}


 27%|██▋       | 3419/12500 [5:57:53<16:10:27,  6.41s/it]

{'loss': 0.7963, 'grad_norm': 0.28490597009658813, 'learning_rate': 0.00014535414165666268, 'epoch': 0.27}


 27%|██▋       | 3420/12500 [5:57:58<14:56:17,  5.92s/it]

{'loss': 0.6028, 'grad_norm': 0.2965436577796936, 'learning_rate': 0.00014533813525410166, 'epoch': 0.27}


 27%|██▋       | 3421/12500 [5:58:02<13:59:20,  5.55s/it]

{'loss': 0.7429, 'grad_norm': 0.281524658203125, 'learning_rate': 0.0001453221288515406, 'epoch': 0.27}


 27%|██▋       | 3422/12500 [5:58:09<15:12:06,  6.03s/it]

{'loss': 0.8242, 'grad_norm': 0.3099183738231659, 'learning_rate': 0.0001453061224489796, 'epoch': 0.27}


 27%|██▋       | 3423/12500 [5:58:18<16:55:11,  6.71s/it]

{'loss': 0.7023, 'grad_norm': 0.22136326134204865, 'learning_rate': 0.00014529011604641858, 'epoch': 0.27}


 27%|██▋       | 3424/12500 [5:58:23<15:55:43,  6.32s/it]

{'loss': 0.6494, 'grad_norm': 0.2792818546295166, 'learning_rate': 0.00014527410964385756, 'epoch': 0.27}


 27%|██▋       | 3425/12500 [5:58:30<16:06:07,  6.39s/it]

{'loss': 1.0597, 'grad_norm': 0.26398003101348877, 'learning_rate': 0.0001452581032412965, 'epoch': 0.27}


 27%|██▋       | 3426/12500 [5:58:39<18:31:39,  7.35s/it]

{'loss': 0.6143, 'grad_norm': 0.3236866593360901, 'learning_rate': 0.0001452420968387355, 'epoch': 0.27}


 27%|██▋       | 3427/12500 [5:58:46<18:08:45,  7.20s/it]

{'loss': 0.6055, 'grad_norm': 0.2543978691101074, 'learning_rate': 0.00014522609043617448, 'epoch': 0.27}


 27%|██▋       | 3428/12500 [5:58:53<18:09:54,  7.21s/it]

{'loss': 0.6963, 'grad_norm': 0.2314581573009491, 'learning_rate': 0.00014521008403361346, 'epoch': 0.27}


 27%|██▋       | 3429/12500 [5:58:59<17:09:52,  6.81s/it]

{'loss': 0.4987, 'grad_norm': 0.2923784554004669, 'learning_rate': 0.00014519407763105243, 'epoch': 0.27}


 27%|██▋       | 3430/12500 [5:59:06<17:21:48,  6.89s/it]

{'loss': 0.3319, 'grad_norm': 0.20432937145233154, 'learning_rate': 0.0001451780712284914, 'epoch': 0.27}


 27%|██▋       | 3431/12500 [5:59:12<16:29:32,  6.55s/it]

{'loss': 0.9999, 'grad_norm': 0.34454599022865295, 'learning_rate': 0.00014516206482593038, 'epoch': 0.27}


 27%|██▋       | 3432/12500 [5:59:20<17:39:07,  7.01s/it]

{'loss': 0.6148, 'grad_norm': 0.24976441264152527, 'learning_rate': 0.00014514605842336936, 'epoch': 0.27}


 27%|██▋       | 3433/12500 [5:59:24<15:12:33,  6.04s/it]

{'loss': 0.7112, 'grad_norm': 0.3402484059333801, 'learning_rate': 0.00014513005202080833, 'epoch': 0.27}


 27%|██▋       | 3434/12500 [5:59:29<14:28:31,  5.75s/it]

{'loss': 1.1215, 'grad_norm': 0.32316938042640686, 'learning_rate': 0.0001451140456182473, 'epoch': 0.27}


 27%|██▋       | 3435/12500 [5:59:37<15:53:45,  6.31s/it]

{'loss': 0.7069, 'grad_norm': 0.3107595145702362, 'learning_rate': 0.00014509803921568628, 'epoch': 0.27}


 27%|██▋       | 3436/12500 [5:59:42<15:14:35,  6.05s/it]

{'loss': 0.5007, 'grad_norm': 0.24029530584812164, 'learning_rate': 0.00014508203281312525, 'epoch': 0.27}


 27%|██▋       | 3437/12500 [5:59:46<13:25:48,  5.33s/it]

{'loss': 0.8132, 'grad_norm': 0.3122701346874237, 'learning_rate': 0.00014506602641056423, 'epoch': 0.27}


 28%|██▊       | 3438/12500 [5:59:56<17:23:29,  6.91s/it]

{'loss': 1.2575, 'grad_norm': 0.22073853015899658, 'learning_rate': 0.0001450500200080032, 'epoch': 0.28}


 28%|██▊       | 3439/12500 [6:00:05<18:42:26,  7.43s/it]

{'loss': 0.7441, 'grad_norm': 0.22789335250854492, 'learning_rate': 0.00014503401360544218, 'epoch': 0.28}


 28%|██▊       | 3440/12500 [6:00:12<18:07:50,  7.20s/it]

{'loss': 0.6909, 'grad_norm': 0.24548037350177765, 'learning_rate': 0.00014501800720288115, 'epoch': 0.28}


 28%|██▊       | 3441/12500 [6:00:16<16:08:06,  6.41s/it]

{'loss': 0.7181, 'grad_norm': 0.279502809047699, 'learning_rate': 0.00014500200080032013, 'epoch': 0.28}


 28%|██▊       | 3442/12500 [6:00:21<14:54:15,  5.92s/it]

{'loss': 0.679, 'grad_norm': 0.33745160698890686, 'learning_rate': 0.0001449859943977591, 'epoch': 0.28}


 28%|██▊       | 3443/12500 [6:00:26<14:03:46,  5.59s/it]

{'loss': 0.5639, 'grad_norm': 0.25729650259017944, 'learning_rate': 0.00014496998799519808, 'epoch': 0.28}


 28%|██▊       | 3444/12500 [6:00:31<13:36:47,  5.41s/it]

{'loss': 0.9397, 'grad_norm': 0.3532424569129944, 'learning_rate': 0.00014495398159263705, 'epoch': 0.28}


 28%|██▊       | 3445/12500 [6:00:35<12:26:12,  4.94s/it]

{'loss': 0.528, 'grad_norm': 0.29649320244789124, 'learning_rate': 0.00014493797519007603, 'epoch': 0.28}


 28%|██▊       | 3446/12500 [6:00:40<12:25:31,  4.94s/it]

{'loss': 0.9126, 'grad_norm': 0.4307694733142853, 'learning_rate': 0.000144921968787515, 'epoch': 0.28}


 28%|██▊       | 3447/12500 [6:00:46<13:20:25,  5.30s/it]

{'loss': 0.7496, 'grad_norm': 0.24867123365402222, 'learning_rate': 0.000144905962384954, 'epoch': 0.28}


 28%|██▊       | 3448/12500 [6:00:51<13:28:30,  5.36s/it]

{'loss': 0.7011, 'grad_norm': 0.2597379684448242, 'learning_rate': 0.00014488995598239295, 'epoch': 0.28}


 28%|██▊       | 3449/12500 [6:00:56<12:53:22,  5.13s/it]

{'loss': 0.4302, 'grad_norm': 0.25784701108932495, 'learning_rate': 0.00014487394957983193, 'epoch': 0.28}


 28%|██▊       | 3450/12500 [6:01:04<14:56:59,  5.95s/it]

{'loss': 0.7286, 'grad_norm': 0.230259507894516, 'learning_rate': 0.0001448579431772709, 'epoch': 0.28}


 28%|██▊       | 3451/12500 [6:01:10<15:03:17,  5.99s/it]

{'loss': 0.6885, 'grad_norm': 0.3215794861316681, 'learning_rate': 0.0001448419367747099, 'epoch': 0.28}


 28%|██▊       | 3452/12500 [6:01:15<14:37:02,  5.82s/it]

{'loss': 1.0562, 'grad_norm': 0.2881144881248474, 'learning_rate': 0.00014482593037214885, 'epoch': 0.28}


 28%|██▊       | 3453/12500 [6:01:19<13:15:45,  5.28s/it]

{'loss': 0.7339, 'grad_norm': 0.3233867883682251, 'learning_rate': 0.00014480992396958783, 'epoch': 0.28}


 28%|██▊       | 3454/12500 [6:01:26<14:17:48,  5.69s/it]

{'loss': 0.7752, 'grad_norm': 0.22596639394760132, 'learning_rate': 0.00014479391756702683, 'epoch': 0.28}


 28%|██▊       | 3455/12500 [6:01:33<15:30:03,  6.17s/it]

{'loss': 0.4017, 'grad_norm': 0.198678120970726, 'learning_rate': 0.0001447779111644658, 'epoch': 0.28}


 28%|██▊       | 3456/12500 [6:01:39<15:15:33,  6.07s/it]

{'loss': 0.9197, 'grad_norm': 0.2843990921974182, 'learning_rate': 0.00014476190476190475, 'epoch': 0.28}


 28%|██▊       | 3457/12500 [6:01:47<16:38:04,  6.62s/it]

{'loss': 0.6905, 'grad_norm': 0.2031034380197525, 'learning_rate': 0.00014474589835934373, 'epoch': 0.28}


 28%|██▊       | 3458/12500 [6:01:52<15:26:21,  6.15s/it]

{'loss': 0.5791, 'grad_norm': 0.28679341077804565, 'learning_rate': 0.00014472989195678273, 'epoch': 0.28}


 28%|██▊       | 3459/12500 [6:02:03<19:12:55,  7.65s/it]

{'loss': 0.6864, 'grad_norm': 0.20227856934070587, 'learning_rate': 0.0001447138855542217, 'epoch': 0.28}


 28%|██▊       | 3460/12500 [6:02:13<20:55:53,  8.34s/it]

{'loss': 0.825, 'grad_norm': 0.2343902289867401, 'learning_rate': 0.00014469787915166065, 'epoch': 0.28}


 28%|██▊       | 3461/12500 [6:02:21<20:41:24,  8.24s/it]

{'loss': 0.5178, 'grad_norm': 0.22530682384967804, 'learning_rate': 0.00014468187274909965, 'epoch': 0.28}


 28%|██▊       | 3462/12500 [6:02:28<19:21:05,  7.71s/it]

{'loss': 0.7228, 'grad_norm': 0.2703658938407898, 'learning_rate': 0.00014466586634653863, 'epoch': 0.28}


 28%|██▊       | 3463/12500 [6:02:35<19:03:34,  7.59s/it]

{'loss': 0.9147, 'grad_norm': 0.2620767652988434, 'learning_rate': 0.0001446498599439776, 'epoch': 0.28}


 28%|██▊       | 3464/12500 [6:02:40<17:08:51,  6.83s/it]

{'loss': 0.5816, 'grad_norm': 0.28041690587997437, 'learning_rate': 0.00014463385354141655, 'epoch': 0.28}


 28%|██▊       | 3465/12500 [6:02:46<16:53:53,  6.73s/it]

{'loss': 1.1761, 'grad_norm': 0.3081073462963104, 'learning_rate': 0.00014461784713885555, 'epoch': 0.28}


 28%|██▊       | 3466/12500 [6:02:55<18:23:24,  7.33s/it]

{'loss': 0.6333, 'grad_norm': 0.19785010814666748, 'learning_rate': 0.00014460184073629453, 'epoch': 0.28}


 28%|██▊       | 3467/12500 [6:03:00<16:22:54,  6.53s/it]

{'loss': 0.587, 'grad_norm': 0.3093029856681824, 'learning_rate': 0.0001445858343337335, 'epoch': 0.28}


 28%|██▊       | 3468/12500 [6:03:06<16:12:25,  6.46s/it]

{'loss': 0.6527, 'grad_norm': 0.2923832833766937, 'learning_rate': 0.00014456982793117248, 'epoch': 0.28}


 28%|██▊       | 3469/12500 [6:03:11<15:24:02,  6.14s/it]

{'loss': 0.6176, 'grad_norm': 0.23735593259334564, 'learning_rate': 0.00014455382152861145, 'epoch': 0.28}


 28%|██▊       | 3470/12500 [6:03:16<13:57:20,  5.56s/it]

{'loss': 0.5446, 'grad_norm': 0.2639414370059967, 'learning_rate': 0.00014453781512605043, 'epoch': 0.28}


 28%|██▊       | 3471/12500 [6:03:22<14:35:02,  5.81s/it]

{'loss': 0.748, 'grad_norm': 0.3520357012748718, 'learning_rate': 0.0001445218087234894, 'epoch': 0.28}


 28%|██▊       | 3472/12500 [6:03:26<13:20:51,  5.32s/it]

{'loss': 0.6077, 'grad_norm': 0.3253685235977173, 'learning_rate': 0.00014450580232092838, 'epoch': 0.28}


 28%|██▊       | 3473/12500 [6:03:32<13:21:20,  5.33s/it]

{'loss': 0.7852, 'grad_norm': 0.29655659198760986, 'learning_rate': 0.00014448979591836735, 'epoch': 0.28}


 28%|██▊       | 3474/12500 [6:03:37<13:24:02,  5.34s/it]

{'loss': 0.5977, 'grad_norm': 0.25852102041244507, 'learning_rate': 0.00014447378951580633, 'epoch': 0.28}


 28%|██▊       | 3475/12500 [6:03:46<15:53:47,  6.34s/it]

{'loss': 0.6748, 'grad_norm': 0.21443770825862885, 'learning_rate': 0.00014445778311324533, 'epoch': 0.28}


 28%|██▊       | 3476/12500 [6:03:51<15:21:09,  6.12s/it]

{'loss': 0.9056, 'grad_norm': 0.33506014943122864, 'learning_rate': 0.00014444177671068428, 'epoch': 0.28}


 28%|██▊       | 3477/12500 [6:04:00<17:14:08,  6.88s/it]

{'loss': 0.846, 'grad_norm': 0.2192799597978592, 'learning_rate': 0.00014442577030812325, 'epoch': 0.28}


 28%|██▊       | 3478/12500 [6:04:06<16:34:36,  6.61s/it]

{'loss': 0.7266, 'grad_norm': 0.2941310703754425, 'learning_rate': 0.00014440976390556223, 'epoch': 0.28}


 28%|██▊       | 3479/12500 [6:04:10<14:42:48,  5.87s/it]

{'loss': 0.7364, 'grad_norm': 0.27081912755966187, 'learning_rate': 0.00014439375750300123, 'epoch': 0.28}


 28%|██▊       | 3480/12500 [6:04:15<14:06:44,  5.63s/it]

{'loss': 0.8121, 'grad_norm': 0.3164975047111511, 'learning_rate': 0.00014437775110044018, 'epoch': 0.28}


 28%|██▊       | 3481/12500 [6:04:21<14:18:01,  5.71s/it]

{'loss': 0.6897, 'grad_norm': 0.3120410442352295, 'learning_rate': 0.00014436174469787915, 'epoch': 0.28}


 28%|██▊       | 3482/12500 [6:04:27<14:52:37,  5.94s/it]

{'loss': 0.9725, 'grad_norm': 0.2444154918193817, 'learning_rate': 0.00014434573829531815, 'epoch': 0.28}


 28%|██▊       | 3483/12500 [6:04:33<14:12:50,  5.67s/it]

{'loss': 0.5337, 'grad_norm': 0.3104306161403656, 'learning_rate': 0.00014432973189275713, 'epoch': 0.28}


 28%|██▊       | 3484/12500 [6:04:39<15:00:55,  6.00s/it]

{'loss': 0.5408, 'grad_norm': 0.28643518686294556, 'learning_rate': 0.00014431372549019607, 'epoch': 0.28}


 28%|██▊       | 3485/12500 [6:04:45<14:46:35,  5.90s/it]

{'loss': 0.6553, 'grad_norm': 0.28154972195625305, 'learning_rate': 0.00014429771908763505, 'epoch': 0.28}


 28%|██▊       | 3486/12500 [6:04:52<15:19:08,  6.12s/it]

{'loss': 0.6015, 'grad_norm': 0.31295058131217957, 'learning_rate': 0.00014428171268507405, 'epoch': 0.28}


 28%|██▊       | 3487/12500 [6:04:57<14:33:41,  5.82s/it]

{'loss': 0.8664, 'grad_norm': 0.31506019830703735, 'learning_rate': 0.00014426570628251303, 'epoch': 0.28}


 28%|██▊       | 3488/12500 [6:05:05<16:24:15,  6.55s/it]

{'loss': 0.5887, 'grad_norm': 0.23559163510799408, 'learning_rate': 0.00014424969987995197, 'epoch': 0.28}


 28%|██▊       | 3489/12500 [6:05:13<17:37:13,  7.04s/it]

{'loss': 0.7982, 'grad_norm': 0.2672775983810425, 'learning_rate': 0.00014423369347739098, 'epoch': 0.28}


 28%|██▊       | 3490/12500 [6:05:21<18:23:19,  7.35s/it]

{'loss': 0.568, 'grad_norm': 0.1884870082139969, 'learning_rate': 0.00014421768707482995, 'epoch': 0.28}


 28%|██▊       | 3491/12500 [6:05:28<18:04:23,  7.22s/it]

{'loss': 0.5765, 'grad_norm': 0.24918852746486664, 'learning_rate': 0.00014420168067226893, 'epoch': 0.28}


 28%|██▊       | 3492/12500 [6:05:36<18:44:03,  7.49s/it]

{'loss': 0.6676, 'grad_norm': 0.21349573135375977, 'learning_rate': 0.00014418567426970787, 'epoch': 0.28}


 28%|██▊       | 3493/12500 [6:05:42<17:33:56,  7.02s/it]

{'loss': 0.4012, 'grad_norm': 0.24297697842121124, 'learning_rate': 0.00014416966786714688, 'epoch': 0.28}


 28%|██▊       | 3494/12500 [6:05:48<16:55:54,  6.77s/it]

{'loss': 0.7866, 'grad_norm': 0.2476402074098587, 'learning_rate': 0.00014415366146458585, 'epoch': 0.28}


 28%|██▊       | 3495/12500 [6:05:54<15:55:30,  6.37s/it]

{'loss': 0.4917, 'grad_norm': 0.26608699560165405, 'learning_rate': 0.00014413765506202483, 'epoch': 0.28}


 28%|██▊       | 3496/12500 [6:05:58<14:04:52,  5.63s/it]

{'loss': 0.6491, 'grad_norm': 0.32550618052482605, 'learning_rate': 0.00014412164865946377, 'epoch': 0.28}


 28%|██▊       | 3497/12500 [6:06:03<13:29:27,  5.39s/it]

{'loss': 0.7666, 'grad_norm': 0.26613035798072815, 'learning_rate': 0.00014410564225690278, 'epoch': 0.28}


 28%|██▊       | 3498/12500 [6:06:07<12:47:57,  5.12s/it]

{'loss': 0.7427, 'grad_norm': 0.2612513601779938, 'learning_rate': 0.00014408963585434175, 'epoch': 0.28}


 28%|██▊       | 3499/12500 [6:06:11<12:16:39,  4.91s/it]

{'loss': 0.7927, 'grad_norm': 0.29192808270454407, 'learning_rate': 0.00014407362945178072, 'epoch': 0.28}


 28%|██▊       | 3500/12500 [6:06:17<12:53:05,  5.15s/it]

{'loss': 0.5992, 'grad_norm': 0.2589874863624573, 'learning_rate': 0.0001440576230492197, 'epoch': 0.28}


 28%|██▊       | 3501/12500 [6:06:23<13:02:57,  5.22s/it]

{'loss': 0.5314, 'grad_norm': 0.3677819073200226, 'learning_rate': 0.00014404161664665867, 'epoch': 0.28}


 28%|██▊       | 3502/12500 [6:06:30<14:27:25,  5.78s/it]

{'loss': 0.8218, 'grad_norm': 0.25450968742370605, 'learning_rate': 0.00014402561024409765, 'epoch': 0.28}


 28%|██▊       | 3503/12500 [6:06:36<14:57:21,  5.98s/it]

{'loss': 0.9125, 'grad_norm': 0.27779337763786316, 'learning_rate': 0.00014400960384153662, 'epoch': 0.28}


 28%|██▊       | 3504/12500 [6:06:40<13:38:49,  5.46s/it]

{'loss': 0.454, 'grad_norm': 0.27241986989974976, 'learning_rate': 0.0001439935974389756, 'epoch': 0.28}


 28%|██▊       | 3505/12500 [6:06:48<15:28:06,  6.19s/it]

{'loss': 0.618, 'grad_norm': 0.23081426322460175, 'learning_rate': 0.00014397759103641457, 'epoch': 0.28}


 28%|██▊       | 3506/12500 [6:06:54<14:59:34,  6.00s/it]

{'loss': 0.8428, 'grad_norm': 0.2792145609855652, 'learning_rate': 0.00014396158463385355, 'epoch': 0.28}


 28%|██▊       | 3507/12500 [6:07:01<16:11:42,  6.48s/it]

{'loss': 1.0824, 'grad_norm': 0.29833605885505676, 'learning_rate': 0.00014394557823129252, 'epoch': 0.28}


 28%|██▊       | 3508/12500 [6:07:09<17:07:54,  6.86s/it]

{'loss': 0.3641, 'grad_norm': 0.17836172878742218, 'learning_rate': 0.0001439295718287315, 'epoch': 0.28}


 28%|██▊       | 3509/12500 [6:07:14<15:57:59,  6.39s/it]

{'loss': 0.7953, 'grad_norm': 0.28282302618026733, 'learning_rate': 0.00014391356542617047, 'epoch': 0.28}


 28%|██▊       | 3510/12500 [6:07:23<17:29:11,  7.00s/it]

{'loss': 0.8569, 'grad_norm': 0.21943749487400055, 'learning_rate': 0.00014389755902360945, 'epoch': 0.28}


 28%|██▊       | 3511/12500 [6:07:28<16:14:32,  6.50s/it]

{'loss': 0.6019, 'grad_norm': 0.2838142514228821, 'learning_rate': 0.00014388155262104842, 'epoch': 0.28}


 28%|██▊       | 3512/12500 [6:07:36<16:57:44,  6.79s/it]

{'loss': 0.8239, 'grad_norm': 0.22979505360126495, 'learning_rate': 0.0001438655462184874, 'epoch': 0.28}


 28%|██▊       | 3513/12500 [6:07:40<15:06:27,  6.05s/it]

{'loss': 0.7716, 'grad_norm': 0.31927329301834106, 'learning_rate': 0.00014384953981592637, 'epoch': 0.28}


 28%|██▊       | 3514/12500 [6:07:45<14:13:24,  5.70s/it]

{'loss': 0.7526, 'grad_norm': 0.26115721464157104, 'learning_rate': 0.00014383353341336537, 'epoch': 0.28}


 28%|██▊       | 3515/12500 [6:07:50<13:39:26,  5.47s/it]

{'loss': 1.0078, 'grad_norm': 0.3605046272277832, 'learning_rate': 0.00014381752701080432, 'epoch': 0.28}


 28%|██▊       | 3516/12500 [6:07:55<13:17:55,  5.33s/it]

{'loss': 0.9345, 'grad_norm': 0.30698418617248535, 'learning_rate': 0.0001438015206082433, 'epoch': 0.28}


 28%|██▊       | 3517/12500 [6:08:04<16:31:01,  6.62s/it]

{'loss': 0.6867, 'grad_norm': 0.2001844048500061, 'learning_rate': 0.00014378551420568227, 'epoch': 0.28}


 28%|██▊       | 3518/12500 [6:08:12<17:13:00,  6.90s/it]

{'loss': 0.9654, 'grad_norm': 0.271634578704834, 'learning_rate': 0.00014376950780312127, 'epoch': 0.28}


 28%|██▊       | 3519/12500 [6:08:18<16:28:34,  6.60s/it]

{'loss': 0.8441, 'grad_norm': 0.2798493206501007, 'learning_rate': 0.00014375350140056022, 'epoch': 0.28}


 28%|██▊       | 3520/12500 [6:08:22<14:55:02,  5.98s/it]

{'loss': 0.5948, 'grad_norm': 0.2586544156074524, 'learning_rate': 0.0001437374949979992, 'epoch': 0.28}


 28%|██▊       | 3521/12500 [6:08:27<13:32:58,  5.43s/it]

{'loss': 0.8705, 'grad_norm': 0.41087064146995544, 'learning_rate': 0.0001437214885954382, 'epoch': 0.28}


 28%|██▊       | 3522/12500 [6:08:34<15:19:20,  6.14s/it]

{'loss': 0.5211, 'grad_norm': 0.23221810162067413, 'learning_rate': 0.00014370548219287717, 'epoch': 0.28}


 28%|██▊       | 3523/12500 [6:08:40<15:15:08,  6.12s/it]

{'loss': 0.4865, 'grad_norm': 0.2600512206554413, 'learning_rate': 0.00014368947579031612, 'epoch': 0.28}


 28%|██▊       | 3524/12500 [6:08:44<13:25:21,  5.38s/it]

{'loss': 0.6921, 'grad_norm': 0.32576048374176025, 'learning_rate': 0.0001436734693877551, 'epoch': 0.28}


 28%|██▊       | 3525/12500 [6:08:49<13:04:26,  5.24s/it]

{'loss': 0.6558, 'grad_norm': 0.31989437341690063, 'learning_rate': 0.0001436574629851941, 'epoch': 0.28}


 28%|██▊       | 3526/12500 [6:08:55<13:42:19,  5.50s/it]

{'loss': 0.6577, 'grad_norm': 0.25746291875839233, 'learning_rate': 0.00014364145658263307, 'epoch': 0.28}


 28%|██▊       | 3527/12500 [6:09:02<15:00:37,  6.02s/it]

{'loss': 0.5822, 'grad_norm': 0.21907687187194824, 'learning_rate': 0.00014362545018007202, 'epoch': 0.28}


 28%|██▊       | 3528/12500 [6:09:07<13:56:18,  5.59s/it]

{'loss': 0.7592, 'grad_norm': 0.29999807476997375, 'learning_rate': 0.00014360944377751102, 'epoch': 0.28}


 28%|██▊       | 3529/12500 [6:09:11<12:57:10,  5.20s/it]

{'loss': 0.5933, 'grad_norm': 0.32113006711006165, 'learning_rate': 0.00014359343737495, 'epoch': 0.28}


 28%|██▊       | 3530/12500 [6:09:20<15:35:43,  6.26s/it]

{'loss': 1.0336, 'grad_norm': 0.2670381963253021, 'learning_rate': 0.00014357743097238897, 'epoch': 0.28}


 28%|██▊       | 3531/12500 [6:09:26<15:39:25,  6.28s/it]

{'loss': 0.9253, 'grad_norm': 0.27535977959632874, 'learning_rate': 0.00014356142456982792, 'epoch': 0.28}


 28%|██▊       | 3532/12500 [6:09:34<16:47:54,  6.74s/it]

{'loss': 1.0726, 'grad_norm': 0.250835120677948, 'learning_rate': 0.00014354541816726692, 'epoch': 0.28}


 28%|██▊       | 3533/12500 [6:09:39<15:40:59,  6.30s/it]

{'loss': 0.8963, 'grad_norm': 0.29207852482795715, 'learning_rate': 0.0001435294117647059, 'epoch': 0.28}


 28%|██▊       | 3534/12500 [6:09:46<16:12:45,  6.51s/it]

{'loss': 0.5745, 'grad_norm': 0.21344459056854248, 'learning_rate': 0.00014351340536214487, 'epoch': 0.28}


 28%|██▊       | 3535/12500 [6:09:53<16:07:53,  6.48s/it]

{'loss': 0.6906, 'grad_norm': 0.29621440172195435, 'learning_rate': 0.00014349739895958385, 'epoch': 0.28}


 28%|██▊       | 3536/12500 [6:09:58<15:07:03,  6.07s/it]

{'loss': 0.9604, 'grad_norm': 0.28532159328460693, 'learning_rate': 0.00014348139255702282, 'epoch': 0.28}


 28%|██▊       | 3537/12500 [6:10:04<14:59:00,  6.02s/it]

{'loss': 0.6855, 'grad_norm': 0.2916940152645111, 'learning_rate': 0.0001434653861544618, 'epoch': 0.28}


 28%|██▊       | 3538/12500 [6:10:09<14:20:45,  5.76s/it]

{'loss': 0.6359, 'grad_norm': 0.26789185404777527, 'learning_rate': 0.00014344937975190077, 'epoch': 0.28}


 28%|██▊       | 3539/12500 [6:10:14<13:40:04,  5.49s/it]

{'loss': 0.6864, 'grad_norm': 0.3389197885990143, 'learning_rate': 0.00014343337334933975, 'epoch': 0.28}


 28%|██▊       | 3540/12500 [6:10:19<13:44:56,  5.52s/it]

{'loss': 0.593, 'grad_norm': 0.308145135641098, 'learning_rate': 0.00014341736694677872, 'epoch': 0.28}


 28%|██▊       | 3541/12500 [6:10:28<15:51:14,  6.37s/it]

{'loss': 0.7275, 'grad_norm': 0.2415716052055359, 'learning_rate': 0.0001434013605442177, 'epoch': 0.28}


 28%|██▊       | 3542/12500 [6:10:36<17:23:22,  6.99s/it]

{'loss': 0.5322, 'grad_norm': 0.21967846155166626, 'learning_rate': 0.00014338535414165667, 'epoch': 0.28}


 28%|██▊       | 3543/12500 [6:10:42<16:41:58,  6.71s/it]

{'loss': 0.8471, 'grad_norm': 0.2847861051559448, 'learning_rate': 0.00014336934773909565, 'epoch': 0.28}


 28%|██▊       | 3544/12500 [6:10:46<14:45:59,  5.94s/it]

{'loss': 0.8659, 'grad_norm': 0.28204792737960815, 'learning_rate': 0.00014335334133653462, 'epoch': 0.28}


 28%|██▊       | 3545/12500 [6:10:52<14:25:28,  5.80s/it]

{'loss': 0.6276, 'grad_norm': 0.266030490398407, 'learning_rate': 0.0001433373349339736, 'epoch': 0.28}


 28%|██▊       | 3546/12500 [6:11:00<16:12:20,  6.52s/it]

{'loss': 0.5542, 'grad_norm': 0.23969413340091705, 'learning_rate': 0.00014332132853141257, 'epoch': 0.28}


 28%|██▊       | 3547/12500 [6:11:05<14:51:57,  5.98s/it]

{'loss': 0.8964, 'grad_norm': 0.33481594920158386, 'learning_rate': 0.00014330532212885154, 'epoch': 0.28}


 28%|██▊       | 3548/12500 [6:11:12<15:44:40,  6.33s/it]

{'loss': 0.4318, 'grad_norm': 0.22190900146961212, 'learning_rate': 0.00014328931572629052, 'epoch': 0.28}


 28%|██▊       | 3549/12500 [6:11:20<16:46:02,  6.74s/it]

{'loss': 0.3797, 'grad_norm': 0.20172835886478424, 'learning_rate': 0.0001432733093237295, 'epoch': 0.28}


 28%|██▊       | 3550/12500 [6:11:27<17:05:13,  6.87s/it]

{'loss': 0.6345, 'grad_norm': 0.2391006499528885, 'learning_rate': 0.00014325730292116847, 'epoch': 0.28}


 28%|██▊       | 3551/12500 [6:11:33<16:22:51,  6.59s/it]

{'loss': 0.7566, 'grad_norm': 0.3072560131549835, 'learning_rate': 0.00014324129651860744, 'epoch': 0.28}


 28%|██▊       | 3552/12500 [6:11:44<19:34:12,  7.87s/it]

{'loss': 0.6911, 'grad_norm': 0.19127438962459564, 'learning_rate': 0.00014322529011604642, 'epoch': 0.28}


 28%|██▊       | 3553/12500 [6:11:50<18:25:58,  7.42s/it]

{'loss': 0.7576, 'grad_norm': 0.2581687569618225, 'learning_rate': 0.00014320928371348542, 'epoch': 0.28}


 28%|██▊       | 3554/12500 [6:11:56<17:17:44,  6.96s/it]

{'loss': 0.7222, 'grad_norm': 0.2929977476596832, 'learning_rate': 0.00014319327731092437, 'epoch': 0.28}


 28%|██▊       | 3555/12500 [6:12:04<18:16:41,  7.36s/it]

{'loss': 0.6432, 'grad_norm': 0.2270488739013672, 'learning_rate': 0.00014317727090836334, 'epoch': 0.28}


 28%|██▊       | 3556/12500 [6:12:08<15:51:58,  6.39s/it]

{'loss': 0.4208, 'grad_norm': 0.2965879440307617, 'learning_rate': 0.00014316126450580232, 'epoch': 0.28}


 28%|██▊       | 3557/12500 [6:12:14<15:32:15,  6.25s/it]

{'loss': 1.052, 'grad_norm': 0.28959837555885315, 'learning_rate': 0.00014314525810324132, 'epoch': 0.28}


KeyboardInterrupt: 

In [31]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

5.4483 seconds used for training.
0.09 minutes used for training.
Peak reserved memory = 3.658 GB.
Peak reserved memory for training = 0.039 GB.
Peak reserved memory % of max memory = 47.267 %.
Peak reserved memory for training % of max memory = 0.504 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers. The sequence is: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

1, 13<|eot_id|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

model.push_to_hub("ID2223JR/lora_model", token = "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") # Online saving
tokenizer.push_to_hub("ID2223JR/lora_model", token = "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") # Online saving

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/97.3M [00:00<?, ?B/s]

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("ID2223JR/gguf_model_q8", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("ID2223JR/gguf_model_q8", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("ID2223JR/gguf_model_q4", tokenizer, quantization_method = "q4_k_m")
if True: model.push_to_hub_gguf("ID2223JR/gguf_model_q4", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "ID2223/gguf_model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>