To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

Features in the notebook:
1. Uses Maxime Labonne's [FineTome 100K](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset.
1. Convert ShareGPT to HuggingFace format via `standardize_sharegpt`
2. Train on Completions / Assistant only via `train_on_responses_only`
3. Unsloth now supports Torch 2.4, all TRL & Xformers versions & Python 3.12!

In [3]:
#%%capture
#!pip install unsloth
# Also get the latest nightly Unsloth!
#!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [24]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "ID2223JR/recipe_model", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA GeForce RTX 4060. Max memory: 7.739 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [25]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Already have LoRA adapters! We shall skip this step.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [26]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("mbien/recipe_nlg", split = "train",data_dir="recipe/data/dataset/", trust_remote_code=True)

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [27]:
def convert_to_conversation(example):
    system_message = {"role": "system", "content": "You are an assistant"}
    ingredients_str = ", ".join(example['ingredients']).strip("'").replace(",", "\n")
    user_message = {"role": "user", "content": f"Compose a meal containing the following ingredients: {ingredients_str}"}
    directions_str = "\n".join(example['directions']).strip("'")
    assistant_message = {"role": "assistant", "content": f"Here is a meal that contains {ingredients_str}!\n\n {example['title']}\n\n {directions_str}\n\n Enjoy your meal!"}
    return {"conversations": [system_message, user_message, assistant_message]}

dataset = dataset.map(convert_to_conversation, remove_columns=dataset.column_names)
print(dataset[0])

{'conversations': [{'content': 'You are an assistant', 'role': 'system'}, {'content': 'Compose a meal containing the following ingredients: 1 c. firmly packed brown sugar\n 1/2 c. evaporated milk\n 1/2 tsp. vanilla\n 1/2 c. broken nuts (pecans)\n 2 Tbsp. butter or margarine\n 3 1/2 c. bite size shredded rice biscuits', 'role': 'user'}, {'content': 'Here is a meal that contains 1 c. firmly packed brown sugar\n 1/2 c. evaporated milk\n 1/2 tsp. vanilla\n 1/2 c. broken nuts (pecans)\n 2 Tbsp. butter or margarine\n 3 1/2 c. bite size shredded rice biscuits!\n\n No-Bake Nut Cookies\n\n In a heavy 2-quart saucepan, mix brown sugar, nuts, evaporated milk and butter or margarine.\nStir over medium heat until mixture bubbles all over top.\nBoil and stir 5 minutes more. Take off heat.\nStir in vanilla and cereal; mix well.\nUsing 2 teaspoons, drop and shape into 30 clusters on wax paper.\nLet stand until firm, about 30 minutes.\n\n Enjoy your meal!', 'role': 'assistant'}]}


In [28]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

We look at how the conversations are structured for item 5:

In [29]:
dataset[5]["conversations"]

[{'content': 'You are an assistant', 'role': 'system'},
 {'content': 'Compose a meal containing the following ingredients: 6 baking potatoes\n 1 lb. of extra lean ground beef\n 2/3 c. butter or margarine\n 6 c. milk\n 3/4 tsp. salt\n 1/2 tsp. pepper\n 1 1/2 c (6 oz.) shredded Cheddar cheese\n divided\n 12 sliced bacon\n cooked\n crumbled and divided\n 4 green onion\n chopped and divided\n 1 (8 oz.) carton sour cream (optional)',
  'role': 'user'},
 {'content': 'Here is a meal that contains 6 baking potatoes\n 1 lb. of extra lean ground beef\n 2/3 c. butter or margarine\n 6 c. milk\n 3/4 tsp. salt\n 1/2 tsp. pepper\n 1 1/2 c (6 oz.) shredded Cheddar cheese\n divided\n 12 sliced bacon\n cooked\n crumbled and divided\n 4 green onion\n chopped and divided\n 1 (8 oz.) carton sour cream (optional)!\n\n Cheeseburger Potato Soup\n\n Wash potatoes; prick several times with a fork.\nMicrowave them with a wet paper towel covering the potatoes on high for 6-8 minutes.\nThe potatoes should be soft,

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [30]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCompose a meal containing the following ingredients: 6 baking potatoes\n 1 lb. of extra lean ground beef\n 2/3 c. butter or margarine\n 6 c. milk\n 3/4 tsp. salt\n 1/2 tsp. pepper\n 1 1/2 c (6 oz.) shredded Cheddar cheese\n divided\n 12 sliced bacon\n cooked\n crumbled and divided\n 4 green onion\n chopped and divided\n 1 (8 oz.) carton sour cream (optional)<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHere is a meal that contains 6 baking potatoes\n 1 lb. of extra lean ground beef\n 2/3 c. butter or margarine\n 6 c. milk\n 3/4 tsp. salt\n 1/2 tsp. pepper\n 1 1/2 c (6 oz.) shredded Cheddar cheese\n divided\n 12 sliced bacon\n cooked\n crumbled and divided\n 4 green onion\n chopped and divided\n 1 (8 oz.) carton sour cream (optional)!\n\n Cheeseburger Potato Soup\n\n Wash 

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported


trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,
        max_steps = 2700,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "recipe_outputs",
        report_to = "none", # Use this for WandB etclimit
        save_steps=100,
        save_total_limit=1,
    ),
)

Map (num_proc=2): 100%|██████████| 2231142/2231142 [15:14<00:00, 2439.94 examples/s]
max_steps is given, it will override any value given in num_train_epochs


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/2231142 [00:00<?, ? examples/s]

Map: 100%|██████████| 2231142/2231142 [09:04<00:00, 4097.25 examples/s]


We verify masking is actually done:

In [33]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCompose a meal containing the following ingredients: 6 baking potatoes\n 1 lb. of extra lean ground beef\n 2/3 c. butter or margarine\n 6 c. milk\n 3/4 tsp. salt\n 1/2 tsp. pepper\n 1 1/2 c (6 oz.) shredded Cheddar cheese\n divided\n 12 sliced bacon\n cooked\n crumbled and divided\n 4 green onion\n chopped and divided\n 1 (8 oz.) carton sour cream (optional)<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHere is a meal that contains 6 baking potatoes\n 1 lb. of extra lean ground beef\n 2/3 c. butter or margarine\n 6 c. milk\n 3/4 tsp. salt\n 1/2 tsp. pepper\n 1 1/2 c (6 oz.) shredded Cheddar cheese\n divided\n 12 sliced bacon\n cooked\n crumbled and divided\n 4 green onion\n chopped and divided\n 1 (8 oz.) carton sour cream (optional)!\n\n Cheeseburger Potato Soup\n\n Wash 

In [34]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                                                  \n\nHere is a meal that contains 6 baking potatoes\n 1 lb. of extra lean ground beef\n 2/3 c. butter or margarine\n 6 c. milk\n 3/4 tsp. salt\n 1/2 tsp. pepper\n 1 1/2 c (6 oz.) shredded Cheddar cheese\n divided\n 12 sliced bacon\n cooked\n crumbled and divided\n 4 green onion\n chopped and divided\n 1 (8 oz.) carton sour cream (optional)!\n\n Cheeseburger Potato Soup\n\n Wash potatoes; prick several times with a fork.\nMicrowave them with a wet paper towel covering the potatoes on high for 6-8 minutes.\nThe potatoes should be soft, ready to eat.\nLet them cool enough to handle.\nCut in half lengthwise; scoop out pulp and reserve.\nDiscard shells.\nBrown ground beef until done.\nDrain any grease from the meat.\nSet aside when done.\nMeat will be added later.\nMelt butter in a large kettle over low heat; add flour, stirring until smooth.\nCoo

We can see the System and Instruction prompts are successfully masked!

In [35]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4060. Max memory = 7.739 GB.
5.51 GB of memory reserved.


In [36]:
import os

if len(os.listdir("recipe_outputs")) == 0:
    trainer_stats = trainer.train()
else:
    print("Resuming from checkpoint!")
    trainer_stats = trainer.train(resume_from_checkpoint=True)

Resuming from checkpoint!


  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,231,142 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 10,000
 "-____-"     Number of trainable parameters = 24,313,856
  checkpoint_rng_state = torch.load(rng_file)
  1%|          | 101/10000 [00:03<06:04, 27.19it/s]

{'loss': 0.7778, 'grad_norm': 0.38028180599212646, 'learning_rate': 0.00019807903951975987, 'epoch': 0.0}


  1%|          | 102/10000 [00:07<06:04, 27.19it/s]

{'loss': 1.107, 'grad_norm': 0.4123120605945587, 'learning_rate': 0.0001980590295147574, 'epoch': 0.0}


  1%|          | 103/10000 [00:11<06:04, 27.19it/s]

{'loss': 0.953, 'grad_norm': 0.3568883240222931, 'learning_rate': 0.0001980390195097549, 'epoch': 0.0}


  1%|          | 104/10000 [00:15<32:52,  5.02it/s]

{'loss': 1.1506, 'grad_norm': 0.3587881922721863, 'learning_rate': 0.0001980190095047524, 'epoch': 0.0}


  1%|          | 105/10000 [00:19<44:44,  3.69it/s]

{'loss': 0.8945, 'grad_norm': 0.3284980058670044, 'learning_rate': 0.0001979989994997499, 'epoch': 0.0}


  1%|          | 106/10000 [00:23<44:44,  3.69it/s]

{'loss': 1.4919, 'grad_norm': 0.37529560923576355, 'learning_rate': 0.0001979789894947474, 'epoch': 0.0}


  1%|          | 107/10000 [00:26<1:11:29,  2.31it/s]

{'loss': 0.9271, 'grad_norm': 0.4431600570678711, 'learning_rate': 0.00019795897948974488, 'epoch': 0.0}


  1%|          | 108/10000 [00:30<1:29:31,  1.84it/s]

{'loss': 0.9947, 'grad_norm': 0.3958789110183716, 'learning_rate': 0.00019793896948474236, 'epoch': 0.0}


  1%|          | 109/10000 [00:33<1:50:14,  1.50it/s]

{'loss': 0.8211, 'grad_norm': 0.3852190673351288, 'learning_rate': 0.00019791895947973988, 'epoch': 0.0}


  1%|          | 110/10000 [00:37<2:22:52,  1.15it/s]

{'loss': 1.1618, 'grad_norm': 0.4342193305492401, 'learning_rate': 0.00019789894947473737, 'epoch': 0.0}


  1%|          | 111/10000 [00:39<2:51:37,  1.04s/it]

{'loss': 0.9973, 'grad_norm': 0.4144723415374756, 'learning_rate': 0.00019787893946973488, 'epoch': 0.0}


  1%|          | 112/10000 [00:43<3:29:42,  1.27s/it]

{'loss': 0.8945, 'grad_norm': 0.4213809669017792, 'learning_rate': 0.00019785892946473237, 'epoch': 0.0}


  1%|          | 113/10000 [00:46<4:17:26,  1.56s/it]

{'loss': 1.3831, 'grad_norm': 0.4316355288028717, 'learning_rate': 0.00019783891945972988, 'epoch': 0.0}


  1%|          | 114/10000 [00:50<5:20:44,  1.95s/it]

{'loss': 0.9974, 'grad_norm': 0.3520999848842621, 'learning_rate': 0.00019781890945472737, 'epoch': 0.0}


  1%|          | 115/10000 [00:53<5:42:05,  2.08s/it]

{'loss': 0.8195, 'grad_norm': 0.36325693130493164, 'learning_rate': 0.00019779889944972488, 'epoch': 0.0}


  1%|          | 116/10000 [00:56<6:06:25,  2.22s/it]

{'loss': 0.9671, 'grad_norm': 0.38229864835739136, 'learning_rate': 0.00019777888944472237, 'epoch': 0.0}


  1%|          | 117/10000 [00:58<6:29:51,  2.37s/it]

{'loss': 0.8467, 'grad_norm': 0.41430044174194336, 'learning_rate': 0.00019775887943971986, 'epoch': 0.0}


  1%|          | 118/10000 [01:01<6:50:26,  2.49s/it]

{'loss': 0.9248, 'grad_norm': 0.4582471549510956, 'learning_rate': 0.00019773886943471737, 'epoch': 0.0}


  1%|          | 119/10000 [01:05<7:26:40,  2.71s/it]

{'loss': 1.0928, 'grad_norm': 0.37337633967399597, 'learning_rate': 0.00019771885942971486, 'epoch': 0.0}


  1%|          | 120/10000 [01:08<7:52:38,  2.87s/it]

{'loss': 0.9879, 'grad_norm': 0.37454238533973694, 'learning_rate': 0.00019769884942471237, 'epoch': 0.0}


  1%|          | 121/10000 [01:11<7:48:30,  2.85s/it]

{'loss': 0.892, 'grad_norm': 0.4249427318572998, 'learning_rate': 0.00019767883941970986, 'epoch': 0.0}


  1%|          | 122/10000 [01:14<8:29:14,  3.09s/it]

{'loss': 1.3076, 'grad_norm': 0.3725999593734741, 'learning_rate': 0.00019765882941470738, 'epoch': 0.0}


  1%|          | 123/10000 [01:17<8:07:08,  2.96s/it]

{'loss': 0.9827, 'grad_norm': 0.48073533177375793, 'learning_rate': 0.00019763881940970486, 'epoch': 0.0}


  1%|          | 124/10000 [01:20<8:12:31,  2.99s/it]

{'loss': 1.0588, 'grad_norm': 0.46546369791030884, 'learning_rate': 0.00019761880940470238, 'epoch': 0.0}


  1%|▏         | 125/10000 [01:24<9:14:32,  3.37s/it]

{'loss': 1.0608, 'grad_norm': 0.4086732268333435, 'learning_rate': 0.00019759879939969987, 'epoch': 0.0}


  1%|▏         | 126/10000 [01:28<9:22:12,  3.42s/it]

{'loss': 0.9972, 'grad_norm': 0.42835286259651184, 'learning_rate': 0.00019757878939469735, 'epoch': 0.0}


  1%|▏         | 127/10000 [01:31<8:51:59,  3.23s/it]

{'loss': 0.9288, 'grad_norm': 0.39938104152679443, 'learning_rate': 0.00019755877938969484, 'epoch': 0.0}


  1%|▏         | 128/10000 [01:34<9:01:23,  3.29s/it]

{'loss': 0.9987, 'grad_norm': 0.5387094616889954, 'learning_rate': 0.00019753876938469235, 'epoch': 0.0}


  1%|▏         | 129/10000 [01:38<9:42:06,  3.54s/it]

{'loss': 1.0173, 'grad_norm': 0.3492412567138672, 'learning_rate': 0.00019751875937968984, 'epoch': 0.0}


  1%|▏         | 130/10000 [01:42<10:00:49,  3.65s/it]

{'loss': 0.995, 'grad_norm': 0.3474500775337219, 'learning_rate': 0.00019749874937468736, 'epoch': 0.0}


  1%|▏         | 131/10000 [01:46<10:00:48,  3.65s/it]

{'loss': 1.0216, 'grad_norm': 0.39473333954811096, 'learning_rate': 0.00019747873936968487, 'epoch': 0.0}


  1%|▏         | 132/10000 [01:50<10:10:20,  3.71s/it]

{'loss': 0.9095, 'grad_norm': 0.3740091025829315, 'learning_rate': 0.00019745872936468236, 'epoch': 0.0}


  1%|▏         | 133/10000 [01:52<9:18:05,  3.39s/it] 

{'loss': 0.6716, 'grad_norm': 0.4397159814834595, 'learning_rate': 0.00019743871935967985, 'epoch': 0.0}


  1%|▏         | 134/10000 [01:55<9:02:23,  3.30s/it]

{'loss': 0.9445, 'grad_norm': 0.4242704212665558, 'learning_rate': 0.00019741870935467733, 'epoch': 0.0}


  1%|▏         | 135/10000 [01:59<9:06:52,  3.33s/it]

{'loss': 1.2851, 'grad_norm': 0.38603153824806213, 'learning_rate': 0.00019739869934967485, 'epoch': 0.0}


  1%|▏         | 136/10000 [02:02<9:20:23,  3.41s/it]

{'loss': 0.9463, 'grad_norm': 0.395896315574646, 'learning_rate': 0.00019737868934467233, 'epoch': 0.0}


  1%|▏         | 137/10000 [02:06<9:48:20,  3.58s/it]

{'loss': 0.7871, 'grad_norm': 0.34071430563926697, 'learning_rate': 0.00019735867933966985, 'epoch': 0.0}


  1%|▏         | 138/10000 [02:10<9:57:14,  3.63s/it]

{'loss': 1.0115, 'grad_norm': 0.35672104358673096, 'learning_rate': 0.00019733866933466734, 'epoch': 0.0}


  1%|▏         | 139/10000 [02:13<9:33:10,  3.49s/it]

{'loss': 1.2138, 'grad_norm': 0.395330011844635, 'learning_rate': 0.00019731865932966485, 'epoch': 0.0}


  1%|▏         | 140/10000 [02:17<9:31:12,  3.48s/it]

{'loss': 1.1204, 'grad_norm': 0.41166919469833374, 'learning_rate': 0.00019729864932466234, 'epoch': 0.0}


  1%|▏         | 141/10000 [02:20<9:17:55,  3.40s/it]

{'loss': 0.8689, 'grad_norm': 0.4172604978084564, 'learning_rate': 0.00019727863931965985, 'epoch': 0.0}


  1%|▏         | 142/10000 [02:23<9:07:33,  3.33s/it]

{'loss': 0.9961, 'grad_norm': 0.422650545835495, 'learning_rate': 0.00019725862931465734, 'epoch': 0.0}


  1%|▏         | 143/10000 [02:26<8:34:17,  3.13s/it]

{'loss': 0.8651, 'grad_norm': 0.3978389799594879, 'learning_rate': 0.00019723861930965483, 'epoch': 0.0}


  1%|▏         | 144/10000 [02:29<8:23:25,  3.06s/it]

{'loss': 1.2014, 'grad_norm': 0.4099180996417999, 'learning_rate': 0.00019721860930465234, 'epoch': 0.0}


  1%|▏         | 145/10000 [02:33<9:03:36,  3.31s/it]

{'loss': 1.0025, 'grad_norm': 0.3563527762889862, 'learning_rate': 0.00019719859929964983, 'epoch': 0.0}


  1%|▏         | 146/10000 [02:35<8:33:45,  3.13s/it]

{'loss': 1.0447, 'grad_norm': 0.4799491763114929, 'learning_rate': 0.00019717858929464734, 'epoch': 0.0}


  1%|▏         | 147/10000 [02:38<8:21:13,  3.05s/it]

{'loss': 0.8505, 'grad_norm': 0.3920971751213074, 'learning_rate': 0.00019715857928964483, 'epoch': 0.0}


  1%|▏         | 148/10000 [02:41<8:13:21,  3.00s/it]

{'loss': 1.01, 'grad_norm': 0.4143662452697754, 'learning_rate': 0.00019713856928464234, 'epoch': 0.0}


  1%|▏         | 149/10000 [02:44<8:13:42,  3.01s/it]

{'loss': 0.9298, 'grad_norm': 0.3915620446205139, 'learning_rate': 0.00019711855927963983, 'epoch': 0.0}


  2%|▏         | 150/10000 [02:47<8:23:11,  3.07s/it]

{'loss': 0.8058, 'grad_norm': 0.36411190032958984, 'learning_rate': 0.00019709854927463735, 'epoch': 0.0}


  2%|▏         | 151/10000 [02:50<8:24:27,  3.07s/it]

{'loss': 0.9749, 'grad_norm': 0.38967612385749817, 'learning_rate': 0.00019707853926963483, 'epoch': 0.0}


  2%|▏         | 152/10000 [02:53<8:12:25,  3.00s/it]

{'loss': 0.951, 'grad_norm': 0.43461745977401733, 'learning_rate': 0.00019705852926463232, 'epoch': 0.0}


  2%|▏         | 153/10000 [02:58<9:25:23,  3.45s/it]

{'loss': 0.9002, 'grad_norm': 0.29371345043182373, 'learning_rate': 0.0001970385192596298, 'epoch': 0.0}


  2%|▏         | 154/10000 [03:01<9:32:38,  3.49s/it]

{'loss': 1.0334, 'grad_norm': 0.35748133063316345, 'learning_rate': 0.00019701850925462732, 'epoch': 0.0}


  2%|▏         | 155/10000 [03:04<9:11:12,  3.36s/it]

{'loss': 0.7308, 'grad_norm': 0.3606654107570648, 'learning_rate': 0.0001969984992496248, 'epoch': 0.0}


  2%|▏         | 156/10000 [03:07<8:52:27,  3.25s/it]

{'loss': 1.0535, 'grad_norm': 0.4368322491645813, 'learning_rate': 0.00019697848924462232, 'epoch': 0.0}


  2%|▏         | 157/10000 [03:10<8:28:48,  3.10s/it]

{'loss': 0.6932, 'grad_norm': 0.3741081953048706, 'learning_rate': 0.0001969584792396198, 'epoch': 0.0}


  2%|▏         | 158/10000 [03:13<8:11:26,  3.00s/it]

{'loss': 0.8339, 'grad_norm': 0.398960679769516, 'learning_rate': 0.00019693846923461733, 'epoch': 0.0}


  2%|▏         | 159/10000 [03:16<8:28:24,  3.10s/it]

{'loss': 1.1535, 'grad_norm': 0.43380680680274963, 'learning_rate': 0.00019691845922961484, 'epoch': 0.0}


  2%|▏         | 160/10000 [03:20<8:44:34,  3.20s/it]

{'loss': 1.1022, 'grad_norm': 0.42440900206565857, 'learning_rate': 0.00019689844922461233, 'epoch': 0.0}


  2%|▏         | 161/10000 [03:23<9:07:32,  3.34s/it]

{'loss': 1.1793, 'grad_norm': 0.3977767527103424, 'learning_rate': 0.00019687843921960981, 'epoch': 0.0}


  2%|▏         | 162/10000 [03:26<8:45:46,  3.21s/it]

{'loss': 0.7954, 'grad_norm': 0.3751555383205414, 'learning_rate': 0.0001968584292146073, 'epoch': 0.0}


  2%|▏         | 163/10000 [03:29<8:36:16,  3.15s/it]

{'loss': 1.0453, 'grad_norm': 0.4475592076778412, 'learning_rate': 0.00019683841920960482, 'epoch': 0.0}


  2%|▏         | 164/10000 [03:32<8:37:57,  3.16s/it]

{'loss': 0.9113, 'grad_norm': 0.40859347581863403, 'learning_rate': 0.0001968184092046023, 'epoch': 0.0}


  2%|▏         | 165/10000 [03:35<8:22:42,  3.07s/it]

{'loss': 0.9434, 'grad_norm': 0.40518999099731445, 'learning_rate': 0.00019679839919959982, 'epoch': 0.0}


  2%|▏         | 166/10000 [03:38<8:24:52,  3.08s/it]

{'loss': 1.158, 'grad_norm': 0.5025804042816162, 'learning_rate': 0.0001967783891945973, 'epoch': 0.0}


  2%|▏         | 167/10000 [03:41<8:15:04,  3.02s/it]

{'loss': 0.7694, 'grad_norm': 0.3509847819805145, 'learning_rate': 0.00019675837918959482, 'epoch': 0.0}


  2%|▏         | 168/10000 [03:45<8:43:13,  3.19s/it]

{'loss': 1.0038, 'grad_norm': 0.408590167760849, 'learning_rate': 0.0001967383691845923, 'epoch': 0.0}


  2%|▏         | 169/10000 [03:48<8:40:06,  3.17s/it]

{'loss': 0.8162, 'grad_norm': 0.36447325348854065, 'learning_rate': 0.0001967183591795898, 'epoch': 0.0}


  2%|▏         | 170/10000 [03:52<9:24:31,  3.45s/it]

{'loss': 1.008, 'grad_norm': 0.3799092173576355, 'learning_rate': 0.00019669834917458728, 'epoch': 0.0}


  2%|▏         | 171/10000 [03:56<10:06:42,  3.70s/it]

{'loss': 1.0889, 'grad_norm': 0.3564364016056061, 'learning_rate': 0.0001966783391695848, 'epoch': 0.0}


  2%|▏         | 172/10000 [03:59<9:14:11,  3.38s/it] 

{'loss': 0.8256, 'grad_norm': 0.425395131111145, 'learning_rate': 0.0001966583291645823, 'epoch': 0.0}


  2%|▏         | 173/10000 [04:03<9:26:34,  3.46s/it]

{'loss': 0.7733, 'grad_norm': 0.36153706908226013, 'learning_rate': 0.0001966383191595798, 'epoch': 0.0}


  2%|▏         | 174/10000 [04:06<9:11:23,  3.37s/it]

{'loss': 0.8439, 'grad_norm': 0.43375059962272644, 'learning_rate': 0.0001966183091545773, 'epoch': 0.0}


  2%|▏         | 175/10000 [04:10<9:47:22,  3.59s/it]

{'loss': 1.072, 'grad_norm': 0.37569668889045715, 'learning_rate': 0.0001965982991495748, 'epoch': 0.0}


  2%|▏         | 176/10000 [04:13<9:43:46,  3.57s/it]

{'loss': 0.8142, 'grad_norm': 0.3219853341579437, 'learning_rate': 0.0001965782891445723, 'epoch': 0.0}


  2%|▏         | 177/10000 [04:16<9:22:42,  3.44s/it]

{'loss': 1.4676, 'grad_norm': 0.4286434054374695, 'learning_rate': 0.0001965582791395698, 'epoch': 0.0}


  2%|▏         | 178/10000 [04:20<9:53:06,  3.62s/it]

{'loss': 0.7596, 'grad_norm': 0.35110896825790405, 'learning_rate': 0.0001965382691345673, 'epoch': 0.0}


  2%|▏         | 179/10000 [04:24<9:39:34,  3.54s/it]

{'loss': 0.9595, 'grad_norm': 0.38988572359085083, 'learning_rate': 0.00019651825912956477, 'epoch': 0.0}


  2%|▏         | 180/10000 [04:27<9:08:06,  3.35s/it]

{'loss': 0.981, 'grad_norm': 0.3702700138092041, 'learning_rate': 0.0001964982491245623, 'epoch': 0.0}


  2%|▏         | 181/10000 [04:30<8:41:34,  3.19s/it]

{'loss': 0.9286, 'grad_norm': 0.4430692493915558, 'learning_rate': 0.00019647823911955978, 'epoch': 0.0}


  2%|▏         | 182/10000 [04:33<8:33:38,  3.14s/it]

{'loss': 0.9162, 'grad_norm': 0.4022606313228607, 'learning_rate': 0.0001964582291145573, 'epoch': 0.0}


  2%|▏         | 183/10000 [04:36<8:35:10,  3.15s/it]

{'loss': 0.9683, 'grad_norm': 0.38215088844299316, 'learning_rate': 0.00019643821910955478, 'epoch': 0.0}


  2%|▏         | 184/10000 [04:39<9:04:06,  3.33s/it]

{'loss': 1.0169, 'grad_norm': 0.334852933883667, 'learning_rate': 0.0001964182091045523, 'epoch': 0.0}


  2%|▏         | 185/10000 [04:43<9:08:37,  3.35s/it]

{'loss': 1.2589, 'grad_norm': 0.3875662684440613, 'learning_rate': 0.00019639819909954978, 'epoch': 0.0}


  2%|▏         | 186/10000 [04:47<10:00:57,  3.67s/it]

{'loss': 1.1579, 'grad_norm': 0.3746762275695801, 'learning_rate': 0.0001963781890945473, 'epoch': 0.0}


  2%|▏         | 187/10000 [04:50<9:33:19,  3.51s/it] 

{'loss': 0.7474, 'grad_norm': 0.3734791576862335, 'learning_rate': 0.00019635817908954478, 'epoch': 0.0}


  2%|▏         | 188/10000 [04:53<9:08:47,  3.36s/it]

{'loss': 1.0667, 'grad_norm': 0.4231889247894287, 'learning_rate': 0.00019633816908454227, 'epoch': 0.0}


  2%|▏         | 189/10000 [04:57<9:11:22,  3.37s/it]

{'loss': 1.0348, 'grad_norm': 0.38199320435523987, 'learning_rate': 0.00019631815907953978, 'epoch': 0.0}


  2%|▏         | 190/10000 [05:00<9:25:38,  3.46s/it]

{'loss': 0.9621, 'grad_norm': 0.3116518557071686, 'learning_rate': 0.00019629814907453727, 'epoch': 0.0}


  2%|▏         | 191/10000 [05:06<11:08:35,  4.09s/it]

{'loss': 1.0699, 'grad_norm': 0.2969510853290558, 'learning_rate': 0.00019627813906953478, 'epoch': 0.0}


  2%|▏         | 192/10000 [05:10<11:08:13,  4.09s/it]

{'loss': 0.9286, 'grad_norm': 0.3402874171733856, 'learning_rate': 0.00019625812906453227, 'epoch': 0.0}


  2%|▏         | 193/10000 [05:14<10:50:13,  3.98s/it]

{'loss': 0.7939, 'grad_norm': 0.3735104501247406, 'learning_rate': 0.0001962381190595298, 'epoch': 0.0}


  2%|▏         | 194/10000 [05:16<9:35:35,  3.52s/it] 

{'loss': 0.8827, 'grad_norm': 0.4389958679676056, 'learning_rate': 0.00019621810905452727, 'epoch': 0.0}


  2%|▏         | 195/10000 [05:19<9:11:01,  3.37s/it]

{'loss': 0.8807, 'grad_norm': 0.37219715118408203, 'learning_rate': 0.0001961980990495248, 'epoch': 0.0}


  2%|▏         | 196/10000 [05:22<8:43:23,  3.20s/it]

{'loss': 1.1977, 'grad_norm': 0.499068945646286, 'learning_rate': 0.00019617808904452225, 'epoch': 0.0}


  2%|▏         | 197/10000 [05:25<8:43:43,  3.21s/it]

{'loss': 0.8477, 'grad_norm': 0.394352525472641, 'learning_rate': 0.00019615807903951976, 'epoch': 0.0}


  2%|▏         | 198/10000 [05:28<8:32:36,  3.14s/it]

{'loss': 0.8579, 'grad_norm': 0.3742881119251251, 'learning_rate': 0.00019613806903451725, 'epoch': 0.0}


  2%|▏         | 199/10000 [05:31<8:27:01,  3.10s/it]

{'loss': 0.7943, 'grad_norm': 0.36077162623405457, 'learning_rate': 0.00019611805902951476, 'epoch': 0.0}


  2%|▏         | 200/10000 [05:34<7:55:10,  2.91s/it]

{'loss': 0.6753, 'grad_norm': 0.38345372676849365, 'learning_rate': 0.00019609804902451228, 'epoch': 0.0}


  2%|▏         | 201/10000 [05:39<9:37:11,  3.53s/it]

{'loss': 0.7523, 'grad_norm': 0.35377904772758484, 'learning_rate': 0.00019607803901950977, 'epoch': 0.0}


  2%|▏         | 202/10000 [05:43<10:03:45,  3.70s/it]

{'loss': 1.082, 'grad_norm': 0.3559958040714264, 'learning_rate': 0.00019605802901450728, 'epoch': 0.0}


  2%|▏         | 203/10000 [05:47<10:22:47,  3.81s/it]

{'loss': 1.1871, 'grad_norm': 0.3517901301383972, 'learning_rate': 0.00019603801900950477, 'epoch': 0.0}


  2%|▏         | 204/10000 [05:50<9:42:58,  3.57s/it] 

{'loss': 0.7213, 'grad_norm': 0.32779598236083984, 'learning_rate': 0.00019601800900450226, 'epoch': 0.0}


  2%|▏         | 205/10000 [05:54<9:40:42,  3.56s/it]

{'loss': 0.8909, 'grad_norm': 0.3955517113208771, 'learning_rate': 0.00019599799899949974, 'epoch': 0.0}


  2%|▏         | 206/10000 [05:56<8:58:54,  3.30s/it]

{'loss': 0.7677, 'grad_norm': 0.408233642578125, 'learning_rate': 0.00019597798899449726, 'epoch': 0.0}


  2%|▏         | 207/10000 [05:59<8:35:18,  3.16s/it]

{'loss': 1.0146, 'grad_norm': 0.42873865365982056, 'learning_rate': 0.00019595797898949474, 'epoch': 0.0}


  2%|▏         | 208/10000 [06:02<8:34:34,  3.15s/it]

{'loss': 1.2167, 'grad_norm': 0.39640337228775024, 'learning_rate': 0.00019593796898449226, 'epoch': 0.0}


  2%|▏         | 209/10000 [06:06<8:47:43,  3.23s/it]

{'loss': 1.1212, 'grad_norm': 0.37699517607688904, 'learning_rate': 0.00019591795897948975, 'epoch': 0.0}


  2%|▏         | 210/10000 [06:10<9:56:52,  3.66s/it]

{'loss': 1.3368, 'grad_norm': 0.31001976132392883, 'learning_rate': 0.00019589794897448726, 'epoch': 0.0}


  2%|▏         | 211/10000 [06:14<9:54:12,  3.64s/it]

{'loss': 1.0157, 'grad_norm': 0.3355272114276886, 'learning_rate': 0.00019587793896948475, 'epoch': 0.0}


  2%|▏         | 212/10000 [06:17<9:21:40,  3.44s/it]

{'loss': 0.9538, 'grad_norm': 0.3729099929332733, 'learning_rate': 0.00019585792896448226, 'epoch': 0.0}


  2%|▏         | 213/10000 [06:20<9:08:13,  3.36s/it]

{'loss': 0.9422, 'grad_norm': 0.35995060205459595, 'learning_rate': 0.00019583791895947975, 'epoch': 0.0}


  2%|▏         | 214/10000 [06:23<9:00:33,  3.31s/it]

{'loss': 1.214, 'grad_norm': 0.4527627229690552, 'learning_rate': 0.00019581790895447724, 'epoch': 0.0}


  2%|▏         | 215/10000 [06:26<8:56:15,  3.29s/it]

{'loss': 0.9191, 'grad_norm': 0.3558861315250397, 'learning_rate': 0.00019579789894947475, 'epoch': 0.0}


  2%|▏         | 216/10000 [06:29<8:27:19,  3.11s/it]

{'loss': 0.7829, 'grad_norm': 0.4058845341205597, 'learning_rate': 0.00019577788894447224, 'epoch': 0.0}


  2%|▏         | 217/10000 [06:33<9:15:11,  3.41s/it]

{'loss': 1.0367, 'grad_norm': 0.33095693588256836, 'learning_rate': 0.00019575787893946975, 'epoch': 0.0}


  2%|▏         | 218/10000 [06:37<9:32:36,  3.51s/it]

{'loss': 0.9062, 'grad_norm': 0.31793737411499023, 'learning_rate': 0.00019573786893446724, 'epoch': 0.0}


  2%|▏         | 219/10000 [06:40<9:03:48,  3.34s/it]

{'loss': 0.9625, 'grad_norm': 0.37211835384368896, 'learning_rate': 0.00019571785892946475, 'epoch': 0.0}


  2%|▏         | 220/10000 [06:45<10:14:51,  3.77s/it]

{'loss': 1.2631, 'grad_norm': 0.362901896238327, 'learning_rate': 0.00019569784892446224, 'epoch': 0.0}


  2%|▏         | 221/10000 [06:48<9:35:23,  3.53s/it] 

{'loss': 0.7998, 'grad_norm': 0.37768906354904175, 'learning_rate': 0.00019567783891945976, 'epoch': 0.0}


  2%|▏         | 222/10000 [06:50<8:42:59,  3.21s/it]

{'loss': 0.7354, 'grad_norm': 0.3918558955192566, 'learning_rate': 0.00019565782891445724, 'epoch': 0.0}


  2%|▏         | 223/10000 [06:54<9:37:47,  3.55s/it]

{'loss': 0.8919, 'grad_norm': 0.3456965684890747, 'learning_rate': 0.00019563781890945473, 'epoch': 0.0}


  2%|▏         | 224/10000 [06:58<9:55:48,  3.66s/it]

{'loss': 1.0353, 'grad_norm': 0.36453694105148315, 'learning_rate': 0.00019561780890445222, 'epoch': 0.0}


  2%|▏         | 225/10000 [07:02<10:03:56,  3.71s/it]

{'loss': 0.9298, 'grad_norm': 0.40777355432510376, 'learning_rate': 0.00019559779889944973, 'epoch': 0.0}


  2%|▏         | 226/10000 [07:05<9:10:13,  3.38s/it] 

{'loss': 0.9713, 'grad_norm': 0.44679510593414307, 'learning_rate': 0.00019557778889444722, 'epoch': 0.0}


  2%|▏         | 227/10000 [07:08<9:09:47,  3.38s/it]

{'loss': 0.8549, 'grad_norm': 0.36174479126930237, 'learning_rate': 0.00019555777888944473, 'epoch': 0.0}


  2%|▏         | 228/10000 [07:12<9:08:02,  3.36s/it]

{'loss': 1.0025, 'grad_norm': 0.43221232295036316, 'learning_rate': 0.00019553776888444225, 'epoch': 0.0}


  2%|▏         | 229/10000 [07:15<9:38:03,  3.55s/it]

{'loss': 1.1404, 'grad_norm': 0.3443163335323334, 'learning_rate': 0.00019551775887943974, 'epoch': 0.0}


  2%|▏         | 230/10000 [07:19<9:46:07,  3.60s/it]

{'loss': 0.8974, 'grad_norm': 0.342691034078598, 'learning_rate': 0.00019549774887443725, 'epoch': 0.0}


  2%|▏         | 231/10000 [07:22<9:00:40,  3.32s/it]

{'loss': 1.2458, 'grad_norm': 0.44708696007728577, 'learning_rate': 0.0001954777388694347, 'epoch': 0.0}


  2%|▏         | 232/10000 [07:25<8:34:22,  3.16s/it]

{'loss': 0.9169, 'grad_norm': 0.3709349036216736, 'learning_rate': 0.00019545772886443222, 'epoch': 0.0}


  2%|▏         | 233/10000 [07:28<8:48:39,  3.25s/it]

{'loss': 0.9476, 'grad_norm': 0.3759896755218506, 'learning_rate': 0.0001954377188594297, 'epoch': 0.0}


  2%|▏         | 234/10000 [07:31<8:42:48,  3.21s/it]

{'loss': 1.0575, 'grad_norm': 0.400484561920166, 'learning_rate': 0.00019541770885442723, 'epoch': 0.0}


  2%|▏         | 235/10000 [07:34<8:22:40,  3.09s/it]

{'loss': 1.0494, 'grad_norm': 0.4197712540626526, 'learning_rate': 0.0001953976988494247, 'epoch': 0.0}


  2%|▏         | 236/10000 [07:37<8:17:19,  3.06s/it]

{'loss': 1.2445, 'grad_norm': 0.40856096148490906, 'learning_rate': 0.00019537768884442223, 'epoch': 0.0}


  2%|▏         | 237/10000 [07:40<8:18:48,  3.07s/it]

{'loss': 1.1328, 'grad_norm': 0.4095408320426941, 'learning_rate': 0.00019535767883941971, 'epoch': 0.0}


  2%|▏         | 238/10000 [07:43<7:56:04,  2.93s/it]

{'loss': 0.7065, 'grad_norm': 0.39449411630630493, 'learning_rate': 0.00019533766883441723, 'epoch': 0.0}


  2%|▏         | 239/10000 [07:46<7:50:21,  2.89s/it]

{'loss': 0.8203, 'grad_norm': 0.36423730850219727, 'learning_rate': 0.00019531765882941472, 'epoch': 0.0}


  2%|▏         | 240/10000 [07:49<8:43:10,  3.22s/it]

{'loss': 1.1659, 'grad_norm': 0.3650512099266052, 'learning_rate': 0.0001952976488244122, 'epoch': 0.0}


  2%|▏         | 241/10000 [07:52<8:30:50,  3.14s/it]

{'loss': 0.807, 'grad_norm': 0.3838384449481964, 'learning_rate': 0.00019527763881940972, 'epoch': 0.0}


  2%|▏         | 242/10000 [07:55<8:18:35,  3.07s/it]

{'loss': 0.9906, 'grad_norm': 0.46562954783439636, 'learning_rate': 0.0001952576288144072, 'epoch': 0.0}


  2%|▏         | 243/10000 [07:58<7:46:09,  2.87s/it]

{'loss': 0.8677, 'grad_norm': 0.41200384497642517, 'learning_rate': 0.00019523761880940472, 'epoch': 0.0}


  2%|▏         | 244/10000 [08:01<8:03:58,  2.98s/it]

{'loss': 0.9875, 'grad_norm': 0.37494611740112305, 'learning_rate': 0.0001952176088044022, 'epoch': 0.0}


  2%|▏         | 245/10000 [08:04<8:05:14,  2.98s/it]

{'loss': 0.9773, 'grad_norm': 0.43720564246177673, 'learning_rate': 0.00019519759879939972, 'epoch': 0.0}


  2%|▏         | 246/10000 [08:08<8:44:00,  3.22s/it]

{'loss': 1.0514, 'grad_norm': 0.37097465991973877, 'learning_rate': 0.0001951775887943972, 'epoch': 0.0}


  2%|▏         | 247/10000 [08:11<8:39:27,  3.20s/it]

{'loss': 0.9793, 'grad_norm': 0.4175160527229309, 'learning_rate': 0.00019515757878939472, 'epoch': 0.0}


  2%|▏         | 248/10000 [08:14<8:29:53,  3.14s/it]

{'loss': 1.1015, 'grad_norm': 0.4200083017349243, 'learning_rate': 0.0001951375687843922, 'epoch': 0.0}


  2%|▏         | 249/10000 [08:17<8:18:38,  3.07s/it]

{'loss': 0.912, 'grad_norm': 0.5814898014068604, 'learning_rate': 0.0001951175587793897, 'epoch': 0.0}


  2%|▎         | 250/10000 [08:20<8:47:45,  3.25s/it]

{'loss': 1.1792, 'grad_norm': 0.4113352596759796, 'learning_rate': 0.00019509754877438718, 'epoch': 0.0}


  3%|▎         | 251/10000 [08:23<8:36:06,  3.18s/it]

{'loss': 0.7279, 'grad_norm': 0.37738579511642456, 'learning_rate': 0.0001950775387693847, 'epoch': 0.0}


  3%|▎         | 252/10000 [08:26<8:03:14,  2.97s/it]

{'loss': 1.0026, 'grad_norm': 0.438098669052124, 'learning_rate': 0.0001950575287643822, 'epoch': 0.0}


  3%|▎         | 253/10000 [08:30<8:55:12,  3.29s/it]

{'loss': 1.203, 'grad_norm': 0.3781733810901642, 'learning_rate': 0.0001950375187593797, 'epoch': 0.0}


  3%|▎         | 254/10000 [08:35<9:59:50,  3.69s/it]

{'loss': 1.2772, 'grad_norm': 0.3507167100906372, 'learning_rate': 0.0001950175087543772, 'epoch': 0.0}


  3%|▎         | 255/10000 [08:38<10:07:05,  3.74s/it]

{'loss': 0.9015, 'grad_norm': 0.38967981934547424, 'learning_rate': 0.0001949974987493747, 'epoch': 0.0}


  3%|▎         | 256/10000 [08:43<10:24:50,  3.85s/it]

{'loss': 0.7833, 'grad_norm': 0.3593650758266449, 'learning_rate': 0.00019497748874437222, 'epoch': 0.0}


  3%|▎         | 257/10000 [08:46<9:42:34,  3.59s/it] 

{'loss': 0.8038, 'grad_norm': 0.37594708800315857, 'learning_rate': 0.0001949574787393697, 'epoch': 0.0}


  3%|▎         | 258/10000 [08:50<10:05:20,  3.73s/it]

{'loss': 1.0262, 'grad_norm': 0.3641093969345093, 'learning_rate': 0.0001949374687343672, 'epoch': 0.0}


  3%|▎         | 259/10000 [08:54<10:57:57,  4.05s/it]

{'loss': 0.8988, 'grad_norm': 0.3266015350818634, 'learning_rate': 0.00019491745872936468, 'epoch': 0.0}


  3%|▎         | 260/10000 [08:59<11:09:50,  4.13s/it]

{'loss': 1.0456, 'grad_norm': 0.37538692355155945, 'learning_rate': 0.0001948974487243622, 'epoch': 0.0}


  3%|▎         | 261/10000 [09:04<11:53:40,  4.40s/it]

{'loss': 1.0991, 'grad_norm': 0.34204646944999695, 'learning_rate': 0.00019487743871935968, 'epoch': 0.0}


  3%|▎         | 262/10000 [09:08<12:05:31,  4.47s/it]

{'loss': 0.9945, 'grad_norm': 0.3378700017929077, 'learning_rate': 0.0001948574287143572, 'epoch': 0.0}


  3%|▎         | 263/10000 [09:14<13:12:06,  4.88s/it]

{'loss': 1.0341, 'grad_norm': 0.29977333545684814, 'learning_rate': 0.00019483741870935468, 'epoch': 0.0}


  3%|▎         | 264/10000 [09:20<13:49:43,  5.11s/it]

{'loss': 1.1412, 'grad_norm': 0.31017452478408813, 'learning_rate': 0.0001948174087043522, 'epoch': 0.0}


  3%|▎         | 265/10000 [09:23<12:11:01,  4.51s/it]

{'loss': 0.9987, 'grad_norm': 0.4637922942638397, 'learning_rate': 0.00019479739869934968, 'epoch': 0.0}


  3%|▎         | 266/10000 [09:27<11:27:29,  4.24s/it]

{'loss': 1.3304, 'grad_norm': 0.4106317460536957, 'learning_rate': 0.0001947773886943472, 'epoch': 0.0}


  3%|▎         | 267/10000 [09:31<11:39:32,  4.31s/it]

{'loss': 0.9603, 'grad_norm': 0.32923147082328796, 'learning_rate': 0.00019475737868934466, 'epoch': 0.0}


  3%|▎         | 268/10000 [09:35<11:33:28,  4.28s/it]

{'loss': 1.1198, 'grad_norm': 0.3412123918533325, 'learning_rate': 0.00019473736868434217, 'epoch': 0.0}


  3%|▎         | 269/10000 [09:38<10:34:00,  3.91s/it]

{'loss': 0.8649, 'grad_norm': 0.3712521195411682, 'learning_rate': 0.0001947173586793397, 'epoch': 0.0}


  3%|▎         | 270/10000 [09:41<9:44:39,  3.61s/it] 

{'loss': 1.048, 'grad_norm': 0.41579604148864746, 'learning_rate': 0.00019469734867433717, 'epoch': 0.0}


  3%|▎         | 271/10000 [09:45<9:29:34,  3.51s/it]

{'loss': 0.6229, 'grad_norm': 0.31643423438072205, 'learning_rate': 0.0001946773386693347, 'epoch': 0.0}


  3%|▎         | 272/10000 [09:49<10:26:55,  3.87s/it]

{'loss': 1.494, 'grad_norm': 0.37848639488220215, 'learning_rate': 0.00019465732866433218, 'epoch': 0.0}


  3%|▎         | 273/10000 [09:53<10:01:34,  3.71s/it]

{'loss': 0.9935, 'grad_norm': 0.39459502696990967, 'learning_rate': 0.0001946373186593297, 'epoch': 0.0}


  3%|▎         | 274/10000 [09:56<9:53:21,  3.66s/it] 

{'loss': 0.8173, 'grad_norm': 0.390214741230011, 'learning_rate': 0.00019461730865432718, 'epoch': 0.0}


  3%|▎         | 275/10000 [09:59<8:59:50,  3.33s/it]

{'loss': 0.9453, 'grad_norm': 0.4721486568450928, 'learning_rate': 0.00019459729864932467, 'epoch': 0.0}


  3%|▎         | 276/10000 [10:03<9:46:48,  3.62s/it]

{'loss': 0.9848, 'grad_norm': 0.31930848956108093, 'learning_rate': 0.00019457728864432215, 'epoch': 0.0}


  3%|▎         | 277/10000 [10:07<10:29:35,  3.89s/it]

{'loss': 1.0423, 'grad_norm': 0.33903583884239197, 'learning_rate': 0.00019455727863931967, 'epoch': 0.0}


  3%|▎         | 278/10000 [10:11<10:08:10,  3.75s/it]

{'loss': 0.8144, 'grad_norm': 0.3485274314880371, 'learning_rate': 0.00019453726863431715, 'epoch': 0.0}


  3%|▎         | 279/10000 [10:15<10:07:38,  3.75s/it]

{'loss': 0.9244, 'grad_norm': 0.350481241941452, 'learning_rate': 0.00019451725862931467, 'epoch': 0.0}


  3%|▎         | 280/10000 [10:18<9:35:28,  3.55s/it] 

{'loss': 0.6551, 'grad_norm': 0.3451204299926758, 'learning_rate': 0.00019449724862431216, 'epoch': 0.0}


  3%|▎         | 281/10000 [10:22<10:10:21,  3.77s/it]

{'loss': 0.9742, 'grad_norm': 0.3499133884906769, 'learning_rate': 0.00019447723861930967, 'epoch': 0.0}


  3%|▎         | 282/10000 [10:26<10:01:15,  3.71s/it]

{'loss': 0.8961, 'grad_norm': 0.38632747530937195, 'learning_rate': 0.00019445722861430716, 'epoch': 0.0}


  3%|▎         | 283/10000 [10:29<9:52:39,  3.66s/it] 

{'loss': 0.9705, 'grad_norm': 0.3990154266357422, 'learning_rate': 0.00019443721860930467, 'epoch': 0.0}


  3%|▎         | 284/10000 [10:34<10:52:19,  4.03s/it]

{'loss': 1.0313, 'grad_norm': 0.3395130932331085, 'learning_rate': 0.00019441720860430216, 'epoch': 0.0}


  3%|▎         | 285/10000 [10:37<10:06:09,  3.74s/it]

{'loss': 0.7282, 'grad_norm': 0.36295270919799805, 'learning_rate': 0.00019439719859929965, 'epoch': 0.0}


  3%|▎         | 286/10000 [10:41<9:53:17,  3.66s/it] 

{'loss': 0.9009, 'grad_norm': 0.3959553837776184, 'learning_rate': 0.00019437718859429716, 'epoch': 0.0}


  3%|▎         | 287/10000 [10:44<9:22:29,  3.47s/it]

{'loss': 0.8299, 'grad_norm': 0.3805176913738251, 'learning_rate': 0.00019435717858929465, 'epoch': 0.0}


  3%|▎         | 288/10000 [10:47<9:24:55,  3.49s/it]

{'loss': 0.7051, 'grad_norm': 0.354248583316803, 'learning_rate': 0.00019433716858429216, 'epoch': 0.0}


  3%|▎         | 289/10000 [10:52<10:19:30,  3.83s/it]

{'loss': 0.9646, 'grad_norm': 0.32994773983955383, 'learning_rate': 0.00019431715857928965, 'epoch': 0.0}


  3%|▎         | 290/10000 [10:56<10:29:43,  3.89s/it]

{'loss': 1.1995, 'grad_norm': 0.343544602394104, 'learning_rate': 0.00019429714857428716, 'epoch': 0.0}


  3%|▎         | 291/10000 [10:59<9:39:58,  3.58s/it] 

{'loss': 0.7827, 'grad_norm': 0.42232877016067505, 'learning_rate': 0.00019427713856928465, 'epoch': 0.0}


  3%|▎         | 292/10000 [11:02<9:33:03,  3.54s/it]

{'loss': 0.8589, 'grad_norm': 0.3520839810371399, 'learning_rate': 0.00019425712856428217, 'epoch': 0.0}


  3%|▎         | 293/10000 [11:05<9:16:26,  3.44s/it]

{'loss': 0.8734, 'grad_norm': 0.3716122806072235, 'learning_rate': 0.00019423711855927965, 'epoch': 0.0}


  3%|▎         | 294/10000 [11:08<8:59:12,  3.33s/it]

{'loss': 0.8963, 'grad_norm': 0.39792731404304504, 'learning_rate': 0.00019421710855427714, 'epoch': 0.0}


  3%|▎         | 295/10000 [11:12<8:59:24,  3.33s/it]

{'loss': 1.0153, 'grad_norm': 0.3785398602485657, 'learning_rate': 0.00019419709854927463, 'epoch': 0.0}


  3%|▎         | 296/10000 [11:15<8:31:35,  3.16s/it]

{'loss': 0.8716, 'grad_norm': 0.40740641951560974, 'learning_rate': 0.00019417708854427214, 'epoch': 0.0}


  3%|▎         | 297/10000 [11:18<8:23:35,  3.11s/it]

{'loss': 0.8022, 'grad_norm': 0.3487364649772644, 'learning_rate': 0.00019415707853926966, 'epoch': 0.0}


  3%|▎         | 298/10000 [11:20<7:47:34,  2.89s/it]

{'loss': 1.0165, 'grad_norm': 0.44670116901397705, 'learning_rate': 0.00019413706853426714, 'epoch': 0.0}


  3%|▎         | 299/10000 [11:23<8:12:19,  3.04s/it]

{'loss': 0.9565, 'grad_norm': 0.3314405083656311, 'learning_rate': 0.00019411705852926466, 'epoch': 0.0}


  3%|▎         | 300/10000 [11:26<8:10:47,  3.04s/it]

{'loss': 0.9276, 'grad_norm': 0.38931992650032043, 'learning_rate': 0.00019409704852426215, 'epoch': 0.0}


  3%|▎         | 301/10000 [11:31<9:32:01,  3.54s/it]

{'loss': 0.7731, 'grad_norm': 0.3591238856315613, 'learning_rate': 0.00019407703851925966, 'epoch': 0.0}


  3%|▎         | 302/10000 [11:35<9:41:31,  3.60s/it]

{'loss': 0.9408, 'grad_norm': 0.341220885515213, 'learning_rate': 0.00019405702851425712, 'epoch': 0.0}


  3%|▎         | 303/10000 [11:38<9:00:44,  3.35s/it]

{'loss': 0.9211, 'grad_norm': 0.40228933095932007, 'learning_rate': 0.00019403701850925463, 'epoch': 0.0}


  3%|▎         | 304/10000 [11:41<8:48:43,  3.27s/it]

{'loss': 0.8376, 'grad_norm': 0.3587574362754822, 'learning_rate': 0.00019401700850425212, 'epoch': 0.0}


  3%|▎         | 305/10000 [11:44<9:01:22,  3.35s/it]

{'loss': 0.8753, 'grad_norm': 0.37553471326828003, 'learning_rate': 0.00019399699849924964, 'epoch': 0.0}


  3%|▎         | 306/10000 [11:48<9:07:53,  3.39s/it]

{'loss': 1.2996, 'grad_norm': 0.45314526557922363, 'learning_rate': 0.00019397698849424712, 'epoch': 0.0}


  3%|▎         | 307/10000 [11:50<8:33:33,  3.18s/it]

{'loss': 0.885, 'grad_norm': 0.4634645879268646, 'learning_rate': 0.00019395697848924464, 'epoch': 0.0}


  3%|▎         | 308/10000 [11:54<9:00:47,  3.35s/it]

{'loss': 0.7612, 'grad_norm': 0.38649171590805054, 'learning_rate': 0.00019393696848424212, 'epoch': 0.0}


  3%|▎         | 309/10000 [11:57<8:45:48,  3.26s/it]

{'loss': 1.0957, 'grad_norm': 0.7113619446754456, 'learning_rate': 0.00019391695847923964, 'epoch': 0.0}


  3%|▎         | 310/10000 [12:00<8:26:50,  3.14s/it]

{'loss': 0.7914, 'grad_norm': 0.3958480656147003, 'learning_rate': 0.00019389694847423713, 'epoch': 0.0}


  3%|▎         | 311/10000 [12:03<8:09:32,  3.03s/it]

{'loss': 0.8251, 'grad_norm': 0.37428098917007446, 'learning_rate': 0.00019387693846923461, 'epoch': 0.0}


  3%|▎         | 312/10000 [12:06<8:42:45,  3.24s/it]

{'loss': 0.9613, 'grad_norm': 0.3748987317085266, 'learning_rate': 0.00019385692846423213, 'epoch': 0.0}


  3%|▎         | 313/10000 [12:10<8:35:46,  3.19s/it]

{'loss': 0.9611, 'grad_norm': 0.5014151334762573, 'learning_rate': 0.00019383691845922962, 'epoch': 0.0}


  3%|▎         | 314/10000 [12:13<8:30:41,  3.16s/it]

{'loss': 0.9833, 'grad_norm': 0.36205559968948364, 'learning_rate': 0.00019381690845422713, 'epoch': 0.0}


  3%|▎         | 315/10000 [12:17<9:51:25,  3.66s/it]

{'loss': 0.9913, 'grad_norm': 0.31543976068496704, 'learning_rate': 0.00019379689844922462, 'epoch': 0.0}


  3%|▎         | 316/10000 [12:21<9:39:36,  3.59s/it]

{'loss': 0.9847, 'grad_norm': 0.37504374980926514, 'learning_rate': 0.00019377688844422213, 'epoch': 0.0}


  3%|▎         | 317/10000 [12:24<9:01:01,  3.35s/it]

{'loss': 0.849, 'grad_norm': 0.3953101933002472, 'learning_rate': 0.00019375687843921962, 'epoch': 0.0}


  3%|▎         | 318/10000 [12:26<8:23:27,  3.12s/it]

{'loss': 0.6864, 'grad_norm': 0.3518151640892029, 'learning_rate': 0.00019373686843421713, 'epoch': 0.0}


  3%|▎         | 319/10000 [12:29<8:28:40,  3.15s/it]

{'loss': 0.9896, 'grad_norm': 0.3698817491531372, 'learning_rate': 0.00019371685842921462, 'epoch': 0.0}


  3%|▎         | 320/10000 [12:33<8:33:21,  3.18s/it]

{'loss': 1.0721, 'grad_norm': 0.3825753629207611, 'learning_rate': 0.0001936968484242121, 'epoch': 0.0}


  3%|▎         | 321/10000 [12:36<8:31:06,  3.17s/it]

{'loss': 0.8825, 'grad_norm': 0.434841126203537, 'learning_rate': 0.0001936768384192096, 'epoch': 0.0}


  3%|▎         | 322/10000 [12:39<8:08:10,  3.03s/it]

{'loss': 0.5819, 'grad_norm': 0.3598698675632477, 'learning_rate': 0.0001936568284142071, 'epoch': 0.0}


  3%|▎         | 323/10000 [12:41<8:02:34,  2.99s/it]

{'loss': 0.9508, 'grad_norm': 0.3702029585838318, 'learning_rate': 0.0001936368184092046, 'epoch': 0.0}


  3%|▎         | 324/10000 [12:44<8:02:02,  2.99s/it]

{'loss': 0.9456, 'grad_norm': 0.4305853247642517, 'learning_rate': 0.0001936168084042021, 'epoch': 0.0}


  3%|▎         | 325/10000 [12:47<7:42:46,  2.87s/it]

{'loss': 0.9224, 'grad_norm': 0.40479105710983276, 'learning_rate': 0.00019359679839919963, 'epoch': 0.0}


  3%|▎         | 326/10000 [12:50<8:00:28,  2.98s/it]

{'loss': 1.1522, 'grad_norm': 0.45036858320236206, 'learning_rate': 0.0001935767883941971, 'epoch': 0.0}


  3%|▎         | 327/10000 [12:54<8:23:05,  3.12s/it]

{'loss': 0.8694, 'grad_norm': 0.3453178405761719, 'learning_rate': 0.00019355677838919463, 'epoch': 0.0}


  3%|▎         | 328/10000 [12:56<7:55:15,  2.95s/it]

{'loss': 0.6292, 'grad_norm': 0.44567838311195374, 'learning_rate': 0.00019353676838419211, 'epoch': 0.0}


  3%|▎         | 329/10000 [13:01<8:58:57,  3.34s/it]

{'loss': 0.7972, 'grad_norm': 0.3513737916946411, 'learning_rate': 0.0001935167583791896, 'epoch': 0.0}


  3%|▎         | 330/10000 [13:04<9:27:12,  3.52s/it]

{'loss': 0.8915, 'grad_norm': 0.3485163152217865, 'learning_rate': 0.0001934967483741871, 'epoch': 0.0}


  3%|▎         | 331/10000 [13:07<8:46:50,  3.27s/it]

{'loss': 0.8378, 'grad_norm': 0.4495109021663666, 'learning_rate': 0.0001934767383691846, 'epoch': 0.0}


  3%|▎         | 332/10000 [13:10<8:27:12,  3.15s/it]

{'loss': 0.8395, 'grad_norm': 0.3745805323123932, 'learning_rate': 0.0001934567283641821, 'epoch': 0.0}


  3%|▎         | 333/10000 [13:13<8:15:00,  3.07s/it]

{'loss': 0.8895, 'grad_norm': 0.405270516872406, 'learning_rate': 0.0001934367183591796, 'epoch': 0.0}


  3%|▎         | 334/10000 [13:16<8:12:02,  3.05s/it]

{'loss': 0.9811, 'grad_norm': 0.4263317584991455, 'learning_rate': 0.0001934167083541771, 'epoch': 0.0}


  3%|▎         | 335/10000 [13:19<8:32:14,  3.18s/it]

{'loss': 0.9079, 'grad_norm': 0.33685457706451416, 'learning_rate': 0.0001933966983491746, 'epoch': 0.0}


  3%|▎         | 336/10000 [13:23<8:33:45,  3.19s/it]

{'loss': 0.8266, 'grad_norm': 0.4020422697067261, 'learning_rate': 0.0001933766883441721, 'epoch': 0.0}


  3%|▎         | 337/10000 [13:26<8:41:02,  3.24s/it]

{'loss': 0.822, 'grad_norm': 0.36385205388069153, 'learning_rate': 0.00019335667833916958, 'epoch': 0.0}


  3%|▎         | 338/10000 [13:30<9:16:10,  3.45s/it]

{'loss': 0.9648, 'grad_norm': 0.36293426156044006, 'learning_rate': 0.0001933366683341671, 'epoch': 0.0}


  3%|▎         | 339/10000 [13:33<9:03:24,  3.37s/it]

{'loss': 1.1123, 'grad_norm': 0.4213840663433075, 'learning_rate': 0.00019331665832916458, 'epoch': 0.0}


  3%|▎         | 340/10000 [13:37<9:41:40,  3.61s/it]

{'loss': 1.1138, 'grad_norm': 0.34762871265411377, 'learning_rate': 0.0001932966483241621, 'epoch': 0.0}


  3%|▎         | 341/10000 [13:41<9:43:26,  3.62s/it]

{'loss': 1.1602, 'grad_norm': 0.36614906787872314, 'learning_rate': 0.00019327663831915958, 'epoch': 0.0}


  3%|▎         | 342/10000 [13:46<10:32:21,  3.93s/it]

{'loss': 0.8569, 'grad_norm': 0.2823576331138611, 'learning_rate': 0.0001932566283141571, 'epoch': 0.0}


  3%|▎         | 343/10000 [13:48<9:38:19,  3.59s/it] 

{'loss': 0.9371, 'grad_norm': 0.4739949107170105, 'learning_rate': 0.00019323661830915459, 'epoch': 0.0}


  3%|▎         | 344/10000 [13:52<9:28:56,  3.54s/it]

{'loss': 0.6728, 'grad_norm': 0.37796637415885925, 'learning_rate': 0.0001932166083041521, 'epoch': 0.0}


  3%|▎         | 345/10000 [13:56<10:21:06,  3.86s/it]

{'loss': 0.8595, 'grad_norm': 0.30797484517097473, 'learning_rate': 0.0001931965982991496, 'epoch': 0.0}


  3%|▎         | 346/10000 [14:00<9:59:05,  3.72s/it] 

{'loss': 1.0446, 'grad_norm': 0.3876642882823944, 'learning_rate': 0.00019317658829414708, 'epoch': 0.0}


  3%|▎         | 347/10000 [14:03<9:09:24,  3.41s/it]

{'loss': 0.8059, 'grad_norm': 0.41831257939338684, 'learning_rate': 0.00019315657828914456, 'epoch': 0.0}


  3%|▎         | 348/10000 [14:06<8:52:09,  3.31s/it]

{'loss': 0.5603, 'grad_norm': 0.34054210782051086, 'learning_rate': 0.00019313656828414208, 'epoch': 0.0}


  3%|▎         | 349/10000 [14:09<8:47:39,  3.28s/it]

{'loss': 0.8777, 'grad_norm': 0.3557712137699127, 'learning_rate': 0.00019311655827913956, 'epoch': 0.0}


  4%|▎         | 350/10000 [14:12<8:58:51,  3.35s/it]

{'loss': 1.0599, 'grad_norm': 0.4221242368221283, 'learning_rate': 0.00019309654827413708, 'epoch': 0.0}


  4%|▎         | 351/10000 [14:15<8:42:53,  3.25s/it]

{'loss': 0.677, 'grad_norm': 0.3507786989212036, 'learning_rate': 0.00019307653826913457, 'epoch': 0.0}


  4%|▎         | 352/10000 [14:19<8:41:54,  3.25s/it]

{'loss': 0.7336, 'grad_norm': 0.32508963346481323, 'learning_rate': 0.00019305652826413208, 'epoch': 0.0}


  4%|▎         | 353/10000 [14:21<8:19:06,  3.10s/it]

{'loss': 1.0618, 'grad_norm': 0.38809069991111755, 'learning_rate': 0.0001930365182591296, 'epoch': 0.0}


  4%|▎         | 354/10000 [14:25<8:28:59,  3.17s/it]

{'loss': 0.77, 'grad_norm': 0.3964482247829437, 'learning_rate': 0.00019301650825412708, 'epoch': 0.0}


  4%|▎         | 355/10000 [14:29<9:07:20,  3.40s/it]

{'loss': 0.8099, 'grad_norm': 0.3801107704639435, 'learning_rate': 0.00019299649824912457, 'epoch': 0.0}


  4%|▎         | 356/10000 [14:31<8:32:01,  3.19s/it]

{'loss': 1.0323, 'grad_norm': 0.4364209771156311, 'learning_rate': 0.00019297648824412206, 'epoch': 0.0}


  4%|▎         | 357/10000 [14:35<8:38:35,  3.23s/it]

{'loss': 0.8937, 'grad_norm': 0.336706280708313, 'learning_rate': 0.00019295647823911957, 'epoch': 0.0}


  4%|▎         | 358/10000 [14:38<8:58:07,  3.35s/it]

{'loss': 0.809, 'grad_norm': 0.3589794635772705, 'learning_rate': 0.00019293646823411706, 'epoch': 0.0}


  4%|▎         | 359/10000 [14:41<8:41:19,  3.24s/it]

{'loss': 1.0272, 'grad_norm': 0.46139079332351685, 'learning_rate': 0.00019291645822911457, 'epoch': 0.0}


  4%|▎         | 360/10000 [14:45<8:45:17,  3.27s/it]

{'loss': 1.385, 'grad_norm': 0.4899767339229584, 'learning_rate': 0.00019289644822411206, 'epoch': 0.0}


  4%|▎         | 361/10000 [14:49<9:38:41,  3.60s/it]

{'loss': 0.876, 'grad_norm': 0.4402126371860504, 'learning_rate': 0.00019287643821910957, 'epoch': 0.0}


  4%|▎         | 362/10000 [14:52<9:19:55,  3.49s/it]

{'loss': 0.82, 'grad_norm': 0.3386847972869873, 'learning_rate': 0.00019285642821410706, 'epoch': 0.0}


  4%|▎         | 363/10000 [14:55<8:57:01,  3.34s/it]

{'loss': 1.1532, 'grad_norm': 0.39087194204330444, 'learning_rate': 0.00019283641820910458, 'epoch': 0.0}


  4%|▎         | 364/10000 [14:59<9:07:42,  3.41s/it]

{'loss': 1.0902, 'grad_norm': 0.3787471652030945, 'learning_rate': 0.00019281640820410206, 'epoch': 0.0}


  4%|▎         | 365/10000 [15:02<9:09:05,  3.42s/it]

{'loss': 1.0709, 'grad_norm': 0.38108906149864197, 'learning_rate': 0.00019279639819909955, 'epoch': 0.0}


  4%|▎         | 366/10000 [15:05<8:57:32,  3.35s/it]

{'loss': 0.9648, 'grad_norm': 0.41207846999168396, 'learning_rate': 0.00019277638819409706, 'epoch': 0.0}


  4%|▎         | 367/10000 [15:08<8:45:15,  3.27s/it]

{'loss': 0.8403, 'grad_norm': 0.4032456576824188, 'learning_rate': 0.00019275637818909455, 'epoch': 0.0}


  4%|▎         | 368/10000 [15:11<8:16:42,  3.09s/it]

{'loss': 0.9287, 'grad_norm': 0.382731556892395, 'learning_rate': 0.00019273636818409207, 'epoch': 0.0}


  4%|▎         | 369/10000 [15:15<8:46:08,  3.28s/it]

{'loss': 0.8921, 'grad_norm': 0.3537355363368988, 'learning_rate': 0.00019271635817908955, 'epoch': 0.0}


  4%|▎         | 370/10000 [15:19<9:23:05,  3.51s/it]

{'loss': 0.9212, 'grad_norm': 0.3599560558795929, 'learning_rate': 0.00019269634817408707, 'epoch': 0.0}


  4%|▎         | 371/10000 [15:23<9:36:52,  3.59s/it]

{'loss': 1.3025, 'grad_norm': 0.3880796730518341, 'learning_rate': 0.00019267633816908456, 'epoch': 0.0}


  4%|▎         | 372/10000 [15:26<9:08:50,  3.42s/it]

{'loss': 0.7385, 'grad_norm': 0.34193214774131775, 'learning_rate': 0.00019265632816408204, 'epoch': 0.0}


  4%|▎         | 373/10000 [15:29<9:19:33,  3.49s/it]

{'loss': 0.9664, 'grad_norm': 0.3433278203010559, 'learning_rate': 0.00019263631815907953, 'epoch': 0.0}


  4%|▎         | 374/10000 [15:33<9:27:01,  3.53s/it]

{'loss': 0.7904, 'grad_norm': 0.3729091286659241, 'learning_rate': 0.00019261630815407704, 'epoch': 0.0}


  4%|▍         | 375/10000 [15:36<8:39:53,  3.24s/it]

{'loss': 0.8136, 'grad_norm': 0.3702584505081177, 'learning_rate': 0.00019259629814907453, 'epoch': 0.0}


  4%|▍         | 376/10000 [15:39<8:33:52,  3.20s/it]

{'loss': 0.9423, 'grad_norm': 0.35872310400009155, 'learning_rate': 0.00019257628814407205, 'epoch': 0.0}


  4%|▍         | 377/10000 [15:42<8:23:10,  3.14s/it]

{'loss': 0.8204, 'grad_norm': 0.35030218958854675, 'learning_rate': 0.00019255627813906953, 'epoch': 0.0}


  4%|▍         | 378/10000 [15:46<9:08:59,  3.42s/it]

{'loss': 0.9719, 'grad_norm': 0.36176013946533203, 'learning_rate': 0.00019253626813406705, 'epoch': 0.0}


  4%|▍         | 379/10000 [15:48<8:37:21,  3.23s/it]

{'loss': 1.0286, 'grad_norm': 0.44560950994491577, 'learning_rate': 0.00019251625812906453, 'epoch': 0.0}


  4%|▍         | 380/10000 [15:51<8:16:00,  3.09s/it]

{'loss': 0.873, 'grad_norm': 0.4313923418521881, 'learning_rate': 0.00019249624812406205, 'epoch': 0.0}


  4%|▍         | 381/10000 [15:56<9:19:21,  3.49s/it]

{'loss': 1.5601, 'grad_norm': 0.38290631771087646, 'learning_rate': 0.00019247623811905954, 'epoch': 0.0}


  4%|▍         | 382/10000 [15:59<9:11:58,  3.44s/it]

{'loss': 0.9286, 'grad_norm': 0.35980382561683655, 'learning_rate': 0.00019245622811405702, 'epoch': 0.0}


  4%|▍         | 383/10000 [16:02<9:02:54,  3.39s/it]

{'loss': 1.093, 'grad_norm': 0.39385679364204407, 'learning_rate': 0.00019243621810905454, 'epoch': 0.0}


  4%|▍         | 384/10000 [16:05<8:47:54,  3.29s/it]

{'loss': 0.8999, 'grad_norm': 0.3630620837211609, 'learning_rate': 0.00019241620810405203, 'epoch': 0.0}


  4%|▍         | 385/10000 [16:09<9:00:53,  3.38s/it]

{'loss': 1.0838, 'grad_norm': 0.38498616218566895, 'learning_rate': 0.00019239619809904954, 'epoch': 0.0}


  4%|▍         | 386/10000 [16:13<9:22:40,  3.51s/it]

{'loss': 0.7914, 'grad_norm': 0.31701916456222534, 'learning_rate': 0.00019237618809404703, 'epoch': 0.0}


  4%|▍         | 387/10000 [16:16<9:20:37,  3.50s/it]

{'loss': 0.7307, 'grad_norm': 0.33674687147140503, 'learning_rate': 0.00019235617808904454, 'epoch': 0.0}


  4%|▍         | 388/10000 [16:21<10:10:56,  3.81s/it]

{'loss': 1.2529, 'grad_norm': 0.4749969244003296, 'learning_rate': 0.00019233616808404203, 'epoch': 0.0}


  4%|▍         | 389/10000 [16:24<9:21:04,  3.50s/it] 

{'loss': 0.8017, 'grad_norm': 0.4101748466491699, 'learning_rate': 0.00019231615807903954, 'epoch': 0.0}


  4%|▍         | 390/10000 [16:28<10:08:02,  3.80s/it]

{'loss': 1.1378, 'grad_norm': 0.34494754672050476, 'learning_rate': 0.00019229614807403703, 'epoch': 0.0}


  4%|▍         | 391/10000 [16:31<9:45:34,  3.66s/it] 

{'loss': 0.8972, 'grad_norm': 0.4001324772834778, 'learning_rate': 0.00019227613806903452, 'epoch': 0.0}


  4%|▍         | 392/10000 [16:34<8:56:16,  3.35s/it]

{'loss': 0.857, 'grad_norm': 0.41186729073524475, 'learning_rate': 0.000192256128064032, 'epoch': 0.0}


  4%|▍         | 393/10000 [16:37<8:27:36,  3.17s/it]

{'loss': 0.694, 'grad_norm': 0.3741413354873657, 'learning_rate': 0.00019223611805902952, 'epoch': 0.0}


  4%|▍         | 394/10000 [16:40<8:12:34,  3.08s/it]

{'loss': 0.8084, 'grad_norm': 0.3919346332550049, 'learning_rate': 0.00019221610805402703, 'epoch': 0.0}


  4%|▍         | 395/10000 [16:43<8:07:39,  3.05s/it]

{'loss': 1.0293, 'grad_norm': 0.41748785972595215, 'learning_rate': 0.00019219609804902452, 'epoch': 0.0}


  4%|▍         | 396/10000 [16:46<8:24:14,  3.15s/it]

{'loss': 1.1168, 'grad_norm': 0.40244629979133606, 'learning_rate': 0.00019217608804402204, 'epoch': 0.0}


  4%|▍         | 397/10000 [16:50<8:58:29,  3.36s/it]

{'loss': 1.1663, 'grad_norm': 0.36109331250190735, 'learning_rate': 0.00019215607803901952, 'epoch': 0.0}


  4%|▍         | 398/10000 [16:53<9:01:14,  3.38s/it]

{'loss': 0.6123, 'grad_norm': 0.3423836827278137, 'learning_rate': 0.00019213606803401704, 'epoch': 0.0}


  4%|▍         | 399/10000 [16:57<9:31:45,  3.57s/it]

{'loss': 0.972, 'grad_norm': 0.37858498096466064, 'learning_rate': 0.00019211605802901452, 'epoch': 0.0}


  4%|▍         | 400/10000 [17:00<9:09:06,  3.43s/it]

{'loss': 0.7777, 'grad_norm': 0.42743805050849915, 'learning_rate': 0.000192096048024012, 'epoch': 0.0}


  4%|▍         | 401/10000 [17:04<9:21:32,  3.51s/it]

{'loss': 0.9888, 'grad_norm': 0.4420115649700165, 'learning_rate': 0.0001920760380190095, 'epoch': 0.0}


  4%|▍         | 402/10000 [17:07<9:01:32,  3.39s/it]

{'loss': 1.0026, 'grad_norm': 0.39348042011260986, 'learning_rate': 0.000192056028014007, 'epoch': 0.0}


  4%|▍         | 403/10000 [17:10<8:21:44,  3.14s/it]

{'loss': 0.8395, 'grad_norm': 0.3961869180202484, 'learning_rate': 0.0001920360180090045, 'epoch': 0.0}


  4%|▍         | 404/10000 [17:13<8:51:04,  3.32s/it]

{'loss': 0.7514, 'grad_norm': 0.32306233048439026, 'learning_rate': 0.00019201600800400202, 'epoch': 0.0}


  4%|▍         | 405/10000 [17:17<8:59:35,  3.37s/it]

{'loss': 1.2895, 'grad_norm': 0.4170149266719818, 'learning_rate': 0.0001919959979989995, 'epoch': 0.0}


  4%|▍         | 406/10000 [17:19<8:11:10,  3.07s/it]

{'loss': 0.9252, 'grad_norm': 0.4539543688297272, 'learning_rate': 0.00019197598799399702, 'epoch': 0.0}


  4%|▍         | 407/10000 [17:23<8:52:52,  3.33s/it]

{'loss': 0.6948, 'grad_norm': 0.31552931666374207, 'learning_rate': 0.0001919559779889945, 'epoch': 0.0}


  4%|▍         | 408/10000 [17:26<8:45:25,  3.29s/it]

{'loss': 0.9477, 'grad_norm': 0.4218480885028839, 'learning_rate': 0.000191935967983992, 'epoch': 0.0}


  4%|▍         | 409/10000 [17:29<8:29:46,  3.19s/it]

{'loss': 1.0064, 'grad_norm': 0.40342679619789124, 'learning_rate': 0.0001919159579789895, 'epoch': 0.0}


  4%|▍         | 410/10000 [17:32<7:59:35,  3.00s/it]

{'loss': 0.8711, 'grad_norm': 0.3774559199810028, 'learning_rate': 0.000191895947973987, 'epoch': 0.0}


  4%|▍         | 411/10000 [17:36<8:46:36,  3.30s/it]

{'loss': 0.8884, 'grad_norm': 0.32823899388313293, 'learning_rate': 0.0001918759379689845, 'epoch': 0.0}


  4%|▍         | 412/10000 [17:39<8:42:24,  3.27s/it]

{'loss': 0.8101, 'grad_norm': 0.36584025621414185, 'learning_rate': 0.000191855927963982, 'epoch': 0.0}


  4%|▍         | 413/10000 [17:42<8:19:02,  3.12s/it]

{'loss': 0.6547, 'grad_norm': 0.39952537417411804, 'learning_rate': 0.0001918359179589795, 'epoch': 0.0}


  4%|▍         | 414/10000 [17:45<8:20:36,  3.13s/it]

{'loss': 0.9058, 'grad_norm': 0.3921791613101959, 'learning_rate': 0.000191815907953977, 'epoch': 0.0}


  4%|▍         | 415/10000 [17:49<8:35:06,  3.22s/it]

{'loss': 1.0669, 'grad_norm': 0.39390799403190613, 'learning_rate': 0.0001917958979489745, 'epoch': 0.0}


  4%|▍         | 416/10000 [17:52<8:34:47,  3.22s/it]

{'loss': 1.0941, 'grad_norm': 0.3517090678215027, 'learning_rate': 0.000191775887943972, 'epoch': 0.0}


  4%|▍         | 417/10000 [17:55<8:39:57,  3.26s/it]

{'loss': 0.8662, 'grad_norm': 0.37434959411621094, 'learning_rate': 0.00019175587793896949, 'epoch': 0.0}


  4%|▍         | 418/10000 [17:58<8:19:15,  3.13s/it]

{'loss': 1.0468, 'grad_norm': 0.4091353416442871, 'learning_rate': 0.00019173586793396697, 'epoch': 0.0}


  4%|▍         | 419/10000 [18:04<10:45:48,  4.04s/it]

{'loss': 1.3881, 'grad_norm': 0.31586170196533203, 'learning_rate': 0.0001917158579289645, 'epoch': 0.0}


  4%|▍         | 420/10000 [18:10<11:54:20,  4.47s/it]

{'loss': 1.454, 'grad_norm': 0.3494589328765869, 'learning_rate': 0.00019169584792396197, 'epoch': 0.0}


  4%|▍         | 421/10000 [18:13<10:54:07,  4.10s/it]

{'loss': 0.8658, 'grad_norm': 0.3567999303340912, 'learning_rate': 0.0001916758379189595, 'epoch': 0.0}


  4%|▍         | 422/10000 [18:16<9:50:36,  3.70s/it] 

{'loss': 0.6253, 'grad_norm': 0.3788127303123474, 'learning_rate': 0.000191655827913957, 'epoch': 0.0}


  4%|▍         | 423/10000 [18:18<8:52:45,  3.34s/it]

{'loss': 0.7694, 'grad_norm': 0.46962523460388184, 'learning_rate': 0.0001916358179089545, 'epoch': 0.0}


  4%|▍         | 424/10000 [18:21<8:25:28,  3.17s/it]

{'loss': 0.9851, 'grad_norm': 0.4558451473712921, 'learning_rate': 0.000191615807903952, 'epoch': 0.0}


  4%|▍         | 425/10000 [18:25<9:20:56,  3.52s/it]

{'loss': 0.8329, 'grad_norm': 0.325742244720459, 'learning_rate': 0.0001915957978989495, 'epoch': 0.0}


  4%|▍         | 426/10000 [18:29<9:50:14,  3.70s/it]

{'loss': 1.0029, 'grad_norm': 0.36365702748298645, 'learning_rate': 0.00019157578789394698, 'epoch': 0.0}


  4%|▍         | 427/10000 [18:33<9:53:24,  3.72s/it]

{'loss': 0.9734, 'grad_norm': 0.39985862374305725, 'learning_rate': 0.00019155577788894447, 'epoch': 0.0}


  4%|▍         | 428/10000 [18:37<10:17:56,  3.87s/it]

{'loss': 1.0703, 'grad_norm': 0.3453162610530853, 'learning_rate': 0.00019153576788394198, 'epoch': 0.0}


  4%|▍         | 429/10000 [18:41<9:52:47,  3.72s/it] 

{'loss': 0.9121, 'grad_norm': 0.354299396276474, 'learning_rate': 0.00019151575787893947, 'epoch': 0.0}


  4%|▍         | 430/10000 [18:44<9:17:40,  3.50s/it]

{'loss': 1.0362, 'grad_norm': 0.4602927565574646, 'learning_rate': 0.00019149574787393698, 'epoch': 0.0}


  4%|▍         | 431/10000 [18:48<9:40:06,  3.64s/it]

{'loss': 1.0443, 'grad_norm': 0.3336329758167267, 'learning_rate': 0.00019147573786893447, 'epoch': 0.0}


  4%|▍         | 432/10000 [18:50<8:56:10,  3.36s/it]

{'loss': 0.9191, 'grad_norm': 0.4106292128562927, 'learning_rate': 0.00019145572786393198, 'epoch': 0.0}


  4%|▍         | 433/10000 [18:53<8:24:21,  3.16s/it]

{'loss': 0.9451, 'grad_norm': 0.44233429431915283, 'learning_rate': 0.00019143571785892947, 'epoch': 0.0}


  4%|▍         | 434/10000 [18:57<9:13:47,  3.47s/it]

{'loss': 1.0517, 'grad_norm': 0.36705780029296875, 'learning_rate': 0.00019141570785392699, 'epoch': 0.0}


  4%|▍         | 435/10000 [19:00<9:03:36,  3.41s/it]

{'loss': 1.2336, 'grad_norm': 0.37791818380355835, 'learning_rate': 0.00019139569784892447, 'epoch': 0.0}


  4%|▍         | 436/10000 [19:04<9:00:20,  3.39s/it]

{'loss': 0.9221, 'grad_norm': 0.35891249775886536, 'learning_rate': 0.00019137568784392196, 'epoch': 0.0}


  4%|▍         | 437/10000 [19:07<8:57:02,  3.37s/it]

{'loss': 0.8281, 'grad_norm': 0.3750009536743164, 'learning_rate': 0.00019135567783891947, 'epoch': 0.0}


  4%|▍         | 438/10000 [19:11<9:41:08,  3.65s/it]

{'loss': 1.0888, 'grad_norm': 0.3283328413963318, 'learning_rate': 0.00019133566783391696, 'epoch': 0.0}


  4%|▍         | 439/10000 [19:15<9:20:50,  3.52s/it]

{'loss': 1.1064, 'grad_norm': 0.40267789363861084, 'learning_rate': 0.00019131565782891448, 'epoch': 0.0}


  4%|▍         | 440/10000 [19:18<9:36:14,  3.62s/it]

{'loss': 1.1269, 'grad_norm': 0.3772118091583252, 'learning_rate': 0.00019129564782391196, 'epoch': 0.0}


  4%|▍         | 441/10000 [19:22<9:34:12,  3.60s/it]

{'loss': 0.7778, 'grad_norm': 0.3460763394832611, 'learning_rate': 0.00019127563781890948, 'epoch': 0.0}


  4%|▍         | 442/10000 [19:25<9:19:24,  3.51s/it]

{'loss': 0.966, 'grad_norm': 0.3925538659095764, 'learning_rate': 0.00019125562781390697, 'epoch': 0.0}


  4%|▍         | 443/10000 [19:29<9:47:23,  3.69s/it]

{'loss': 0.8997, 'grad_norm': 0.37680354714393616, 'learning_rate': 0.00019123561780890445, 'epoch': 0.0}


  4%|▍         | 444/10000 [19:33<9:18:17,  3.51s/it]

{'loss': 0.6495, 'grad_norm': 0.31564971804618835, 'learning_rate': 0.00019121560780390194, 'epoch': 0.0}


  4%|▍         | 445/10000 [19:37<9:55:00,  3.74s/it]

{'loss': 1.1456, 'grad_norm': 0.3598366975784302, 'learning_rate': 0.00019119559779889945, 'epoch': 0.0}


  4%|▍         | 446/10000 [19:40<9:14:44,  3.48s/it]

{'loss': 0.8538, 'grad_norm': 0.3943187892436981, 'learning_rate': 0.00019117558779389694, 'epoch': 0.0}


  4%|▍         | 447/10000 [19:43<9:09:34,  3.45s/it]

{'loss': 1.0156, 'grad_norm': 0.43358027935028076, 'learning_rate': 0.00019115557778889446, 'epoch': 0.0}


  4%|▍         | 448/10000 [19:47<9:18:05,  3.51s/it]

{'loss': 0.9854, 'grad_norm': 0.36979779601097107, 'learning_rate': 0.00019113556778389194, 'epoch': 0.0}


  4%|▍         | 449/10000 [19:50<9:14:51,  3.49s/it]

{'loss': 0.9324, 'grad_norm': 0.36736926436424255, 'learning_rate': 0.00019111555777888946, 'epoch': 0.0}


  4%|▍         | 450/10000 [19:54<9:24:09,  3.54s/it]

{'loss': 0.8298, 'grad_norm': 0.42592689394950867, 'learning_rate': 0.00019109554777388697, 'epoch': 0.0}


  5%|▍         | 451/10000 [19:59<10:28:52,  3.95s/it]

{'loss': 1.0549, 'grad_norm': 0.32781481742858887, 'learning_rate': 0.00019107553776888446, 'epoch': 0.0}


  5%|▍         | 452/10000 [20:03<10:22:24,  3.91s/it]

{'loss': 1.0701, 'grad_norm': 0.36197343468666077, 'learning_rate': 0.00019105552776388195, 'epoch': 0.0}


  5%|▍         | 453/10000 [20:06<10:09:32,  3.83s/it]

{'loss': 0.816, 'grad_norm': 0.3462102711200714, 'learning_rate': 0.00019103551775887943, 'epoch': 0.0}


  5%|▍         | 454/10000 [20:11<10:32:48,  3.98s/it]

{'loss': 0.9598, 'grad_norm': 0.32030409574508667, 'learning_rate': 0.00019101550775387695, 'epoch': 0.0}


  5%|▍         | 455/10000 [20:13<9:33:21,  3.60s/it] 

{'loss': 1.0954, 'grad_norm': 0.4441370964050293, 'learning_rate': 0.00019099549774887444, 'epoch': 0.0}


  5%|▍         | 456/10000 [20:17<9:36:05,  3.62s/it]

{'loss': 1.143, 'grad_norm': 0.3354131877422333, 'learning_rate': 0.00019097548774387195, 'epoch': 0.0}


  5%|▍         | 457/10000 [20:20<9:33:52,  3.61s/it]

{'loss': 0.9501, 'grad_norm': 0.35964199900627136, 'learning_rate': 0.00019095547773886944, 'epoch': 0.0}


  5%|▍         | 458/10000 [20:23<8:59:25,  3.39s/it]

{'loss': 0.8787, 'grad_norm': 0.39102381467819214, 'learning_rate': 0.00019093546773386695, 'epoch': 0.0}


  5%|▍         | 459/10000 [20:26<8:46:06,  3.31s/it]

{'loss': 1.0012, 'grad_norm': 0.3934084475040436, 'learning_rate': 0.00019091545772886444, 'epoch': 0.0}


  5%|▍         | 460/10000 [20:31<9:25:17,  3.56s/it]

{'loss': 0.7518, 'grad_norm': 0.38774609565734863, 'learning_rate': 0.00019089544772386195, 'epoch': 0.0}


  5%|▍         | 461/10000 [20:34<9:31:13,  3.59s/it]

{'loss': 0.9909, 'grad_norm': 0.5145343542098999, 'learning_rate': 0.00019087543771885944, 'epoch': 0.0}


  5%|▍         | 462/10000 [20:38<9:55:41,  3.75s/it]

{'loss': 1.1655, 'grad_norm': 0.3528408408164978, 'learning_rate': 0.00019085542771385693, 'epoch': 0.0}


  5%|▍         | 463/10000 [20:41<9:19:17,  3.52s/it]

{'loss': 0.8283, 'grad_norm': 0.39041540026664734, 'learning_rate': 0.00019083541770885444, 'epoch': 0.0}


  5%|▍         | 464/10000 [20:45<9:37:06,  3.63s/it]

{'loss': 1.2264, 'grad_norm': 0.39862990379333496, 'learning_rate': 0.00019081540770385193, 'epoch': 0.0}


  5%|▍         | 465/10000 [20:48<8:54:37,  3.36s/it]

{'loss': 0.8123, 'grad_norm': 0.5708790421485901, 'learning_rate': 0.00019079539769884944, 'epoch': 0.0}


  5%|▍         | 466/10000 [20:51<8:47:55,  3.32s/it]

{'loss': 1.1693, 'grad_norm': 0.4126635193824768, 'learning_rate': 0.00019077538769384693, 'epoch': 0.0}


  5%|▍         | 467/10000 [20:55<9:27:01,  3.57s/it]

{'loss': 1.2158, 'grad_norm': 0.3222489655017853, 'learning_rate': 0.00019075537768884445, 'epoch': 0.0}


  5%|▍         | 468/10000 [21:00<10:20:41,  3.91s/it]

{'loss': 1.0578, 'grad_norm': 0.32674315571784973, 'learning_rate': 0.00019073536768384193, 'epoch': 0.0}


  5%|▍         | 469/10000 [21:03<9:55:11,  3.75s/it] 

{'loss': 0.878, 'grad_norm': 0.3595261871814728, 'learning_rate': 0.00019071535767883945, 'epoch': 0.0}


  5%|▍         | 470/10000 [21:06<9:01:17,  3.41s/it]

{'loss': 1.0137, 'grad_norm': 0.39978986978530884, 'learning_rate': 0.00019069534767383693, 'epoch': 0.0}


  5%|▍         | 471/10000 [21:10<9:19:58,  3.53s/it]

{'loss': 1.0273, 'grad_norm': 0.32808202505111694, 'learning_rate': 0.00019067533766883442, 'epoch': 0.0}


  5%|▍         | 472/10000 [21:13<8:43:22,  3.30s/it]

{'loss': 0.8186, 'grad_norm': 0.411885142326355, 'learning_rate': 0.0001906553276638319, 'epoch': 0.0}


  5%|▍         | 473/10000 [21:17<9:16:29,  3.50s/it]

{'loss': 0.946, 'grad_norm': 0.4072096347808838, 'learning_rate': 0.00019063531765882942, 'epoch': 0.0}


  5%|▍         | 474/10000 [21:21<9:42:33,  3.67s/it]

{'loss': 0.8987, 'grad_norm': 0.3381766080856323, 'learning_rate': 0.0001906153076538269, 'epoch': 0.0}


  5%|▍         | 475/10000 [21:25<9:55:19,  3.75s/it]

{'loss': 0.826, 'grad_norm': 0.3541770279407501, 'learning_rate': 0.00019059529764882443, 'epoch': 0.0}


  5%|▍         | 476/10000 [21:28<9:38:03,  3.64s/it]

{'loss': 0.631, 'grad_norm': 0.3251146674156189, 'learning_rate': 0.0001905752876438219, 'epoch': 0.0}


  5%|▍         | 477/10000 [21:32<9:53:25,  3.74s/it]

{'loss': 0.9118, 'grad_norm': 0.29697319865226746, 'learning_rate': 0.00019055527763881943, 'epoch': 0.0}


  5%|▍         | 478/10000 [21:35<9:24:34,  3.56s/it]

{'loss': 0.9023, 'grad_norm': 0.3857351541519165, 'learning_rate': 0.00019053526763381691, 'epoch': 0.0}


  5%|▍         | 479/10000 [21:37<8:11:55,  3.10s/it]

{'loss': 0.9593, 'grad_norm': 0.4707733392715454, 'learning_rate': 0.0001905152576288144, 'epoch': 0.0}


  5%|▍         | 480/10000 [21:41<9:10:02,  3.47s/it]

{'loss': 0.9957, 'grad_norm': 0.3077579438686371, 'learning_rate': 0.00019049524762381192, 'epoch': 0.0}


  5%|▍         | 481/10000 [21:45<8:52:10,  3.35s/it]

{'loss': 1.1636, 'grad_norm': 0.4172825515270233, 'learning_rate': 0.0001904752376188094, 'epoch': 0.0}


  5%|▍         | 482/10000 [21:48<8:54:25,  3.37s/it]

{'loss': 1.1619, 'grad_norm': 0.39740148186683655, 'learning_rate': 0.00019045522761380692, 'epoch': 0.0}


  5%|▍         | 483/10000 [21:52<9:14:20,  3.49s/it]

{'loss': 0.8162, 'grad_norm': 0.3744322955608368, 'learning_rate': 0.0001904352176088044, 'epoch': 0.0}


  5%|▍         | 484/10000 [21:55<9:05:31,  3.44s/it]

{'loss': 0.9964, 'grad_norm': 0.3876047134399414, 'learning_rate': 0.00019041520760380192, 'epoch': 0.0}


  5%|▍         | 485/10000 [21:58<8:53:03,  3.36s/it]

{'loss': 0.9807, 'grad_norm': 0.398304283618927, 'learning_rate': 0.0001903951975987994, 'epoch': 0.0}


  5%|▍         | 486/10000 [22:02<9:04:59,  3.44s/it]

{'loss': 1.0195, 'grad_norm': 0.3633478283882141, 'learning_rate': 0.00019037518759379692, 'epoch': 0.0}


  5%|▍         | 487/10000 [22:04<8:22:02,  3.17s/it]

{'loss': 0.7728, 'grad_norm': 0.3594321608543396, 'learning_rate': 0.0001903551775887944, 'epoch': 0.0}


  5%|▍         | 488/10000 [22:07<8:18:08,  3.14s/it]

{'loss': 0.7427, 'grad_norm': 0.32993170619010925, 'learning_rate': 0.0001903351675837919, 'epoch': 0.0}


  5%|▍         | 489/10000 [22:12<9:20:16,  3.53s/it]

{'loss': 1.0339, 'grad_norm': 0.3423502445220947, 'learning_rate': 0.00019031515757878938, 'epoch': 0.0}


  5%|▍         | 490/10000 [22:16<9:30:18,  3.60s/it]

{'loss': 0.8008, 'grad_norm': 0.3353637754917145, 'learning_rate': 0.0001902951475737869, 'epoch': 0.0}


  5%|▍         | 491/10000 [22:19<9:24:02,  3.56s/it]

{'loss': 1.1036, 'grad_norm': 0.3463229835033417, 'learning_rate': 0.0001902751375687844, 'epoch': 0.0}


  5%|▍         | 492/10000 [22:22<8:57:25,  3.39s/it]

{'loss': 0.9608, 'grad_norm': 0.38134899735450745, 'learning_rate': 0.0001902551275637819, 'epoch': 0.0}


  5%|▍         | 493/10000 [22:25<8:50:59,  3.35s/it]

{'loss': 0.9205, 'grad_norm': 0.3705228269100189, 'learning_rate': 0.0001902351175587794, 'epoch': 0.0}


  5%|▍         | 494/10000 [22:29<9:19:09,  3.53s/it]

{'loss': 1.1648, 'grad_norm': 0.33788296580314636, 'learning_rate': 0.0001902151075537769, 'epoch': 0.0}


  5%|▍         | 495/10000 [22:33<9:45:10,  3.69s/it]

{'loss': 1.0628, 'grad_norm': 0.4300602078437805, 'learning_rate': 0.00019019509754877441, 'epoch': 0.0}


  5%|▍         | 496/10000 [22:38<10:19:24,  3.91s/it]

{'loss': 0.9914, 'grad_norm': 0.3288693130016327, 'learning_rate': 0.0001901750875437719, 'epoch': 0.0}


  5%|▍         | 497/10000 [22:41<9:48:19,  3.71s/it] 

{'loss': 0.7766, 'grad_norm': 0.39878371357917786, 'learning_rate': 0.0001901550775387694, 'epoch': 0.0}


  5%|▍         | 498/10000 [22:44<9:29:46,  3.60s/it]

{'loss': 0.9636, 'grad_norm': 0.3341132700443268, 'learning_rate': 0.00019013506753376688, 'epoch': 0.0}


  5%|▍         | 499/10000 [22:48<9:34:48,  3.63s/it]

{'loss': 0.9276, 'grad_norm': 0.37024518847465515, 'learning_rate': 0.0001901150575287644, 'epoch': 0.0}


  5%|▌         | 500/10000 [22:51<9:13:07,  3.49s/it]

{'loss': 1.0718, 'grad_norm': 0.3820498287677765, 'learning_rate': 0.00019009504752376188, 'epoch': 0.0}


  5%|▌         | 501/10000 [22:56<9:58:18,  3.78s/it]

{'loss': 1.1377, 'grad_norm': 0.4293174147605896, 'learning_rate': 0.0001900750375187594, 'epoch': 0.0}


  5%|▌         | 502/10000 [23:01<10:50:06,  4.11s/it]

{'loss': 1.2106, 'grad_norm': 0.3420525789260864, 'learning_rate': 0.00019005502751375688, 'epoch': 0.0}


  5%|▌         | 503/10000 [23:04<10:24:02,  3.94s/it]

{'loss': 0.8893, 'grad_norm': 0.41626307368278503, 'learning_rate': 0.0001900350175087544, 'epoch': 0.0}


  5%|▌         | 504/10000 [23:07<9:13:30,  3.50s/it] 

{'loss': 0.8109, 'grad_norm': 0.4945557713508606, 'learning_rate': 0.00019001500750375188, 'epoch': 0.0}


  5%|▌         | 505/10000 [23:09<8:40:06,  3.29s/it]

{'loss': 1.1394, 'grad_norm': 0.44723570346832275, 'learning_rate': 0.0001899949974987494, 'epoch': 0.0}


  5%|▌         | 506/10000 [23:13<8:30:12,  3.22s/it]

{'loss': 1.0951, 'grad_norm': 0.4332355260848999, 'learning_rate': 0.00018997498749374688, 'epoch': 0.0}


  5%|▌         | 507/10000 [23:15<8:01:12,  3.04s/it]

{'loss': 1.0155, 'grad_norm': 0.3706508278846741, 'learning_rate': 0.00018995497748874437, 'epoch': 0.0}


  5%|▌         | 508/10000 [23:19<8:24:52,  3.19s/it]

{'loss': 0.8165, 'grad_norm': 0.32619553804397583, 'learning_rate': 0.00018993496748374188, 'epoch': 0.0}


  5%|▌         | 509/10000 [23:21<8:01:33,  3.04s/it]

{'loss': 0.9237, 'grad_norm': 0.4377371072769165, 'learning_rate': 0.00018991495747873937, 'epoch': 0.0}


  5%|▌         | 510/10000 [23:25<8:27:11,  3.21s/it]

{'loss': 1.1011, 'grad_norm': 0.40009063482284546, 'learning_rate': 0.00018989494747373689, 'epoch': 0.0}


  5%|▌         | 511/10000 [23:28<8:25:05,  3.19s/it]

{'loss': 0.791, 'grad_norm': 0.363124817609787, 'learning_rate': 0.00018987493746873437, 'epoch': 0.0}


  5%|▌         | 512/10000 [23:31<7:50:00,  2.97s/it]

{'loss': 0.9048, 'grad_norm': 0.46716660261154175, 'learning_rate': 0.0001898549274637319, 'epoch': 0.0}


  5%|▌         | 513/10000 [23:33<7:44:28,  2.94s/it]

{'loss': 0.8448, 'grad_norm': 0.39662888646125793, 'learning_rate': 0.00018983491745872938, 'epoch': 0.0}


  5%|▌         | 514/10000 [23:37<7:55:28,  3.01s/it]

{'loss': 0.7238, 'grad_norm': 0.35849547386169434, 'learning_rate': 0.00018981490745372686, 'epoch': 0.0}


  5%|▌         | 515/10000 [23:40<7:52:54,  2.99s/it]

{'loss': 0.8452, 'grad_norm': 0.3861943781375885, 'learning_rate': 0.00018979489744872435, 'epoch': 0.0}


  5%|▌         | 516/10000 [23:44<8:39:00,  3.28s/it]

{'loss': 0.9706, 'grad_norm': 0.34374871850013733, 'learning_rate': 0.00018977488744372186, 'epoch': 0.0}


  5%|▌         | 517/10000 [23:46<8:18:53,  3.16s/it]

{'loss': 0.8279, 'grad_norm': 0.39644870162010193, 'learning_rate': 0.00018975487743871935, 'epoch': 0.0}


  5%|▌         | 518/10000 [23:49<8:15:36,  3.14s/it]

{'loss': 1.1928, 'grad_norm': 0.396847665309906, 'learning_rate': 0.00018973486743371687, 'epoch': 0.0}


  5%|▌         | 519/10000 [23:52<7:49:21,  2.97s/it]

{'loss': 0.9714, 'grad_norm': 0.42420098185539246, 'learning_rate': 0.00018971485742871438, 'epoch': 0.0}


  5%|▌         | 520/10000 [23:56<8:15:27,  3.14s/it]

{'loss': 0.9332, 'grad_norm': 0.34560906887054443, 'learning_rate': 0.00018969484742371187, 'epoch': 0.0}


  5%|▌         | 521/10000 [23:58<8:01:05,  3.05s/it]

{'loss': 0.8966, 'grad_norm': 0.4464259445667267, 'learning_rate': 0.00018967483741870938, 'epoch': 0.0}


  5%|▌         | 522/10000 [24:02<8:17:59,  3.15s/it]

{'loss': 0.8586, 'grad_norm': 0.40267515182495117, 'learning_rate': 0.00018965482741370687, 'epoch': 0.0}


  5%|▌         | 523/10000 [24:04<7:51:28,  2.98s/it]

{'loss': 1.0408, 'grad_norm': 0.4557397663593292, 'learning_rate': 0.00018963481740870436, 'epoch': 0.0}


  5%|▌         | 524/10000 [24:08<8:11:32,  3.11s/it]

{'loss': 0.8641, 'grad_norm': 0.3808710277080536, 'learning_rate': 0.00018961480740370184, 'epoch': 0.0}


  5%|▌         | 525/10000 [24:11<8:36:33,  3.27s/it]

{'loss': 1.034, 'grad_norm': 0.4184666574001312, 'learning_rate': 0.00018959479739869936, 'epoch': 0.0}


  5%|▌         | 526/10000 [24:14<8:25:21,  3.20s/it]

{'loss': 0.9124, 'grad_norm': 0.4249965250492096, 'learning_rate': 0.00018957478739369685, 'epoch': 0.0}


  5%|▌         | 527/10000 [24:18<8:42:19,  3.31s/it]

{'loss': 0.8829, 'grad_norm': 0.34025275707244873, 'learning_rate': 0.00018955477738869436, 'epoch': 0.0}


  5%|▌         | 528/10000 [24:21<8:45:46,  3.33s/it]

{'loss': 1.0478, 'grad_norm': 0.41089922189712524, 'learning_rate': 0.00018953476738369185, 'epoch': 0.0}


  5%|▌         | 529/10000 [24:24<8:19:57,  3.17s/it]

{'loss': 0.853, 'grad_norm': 0.38006791472435, 'learning_rate': 0.00018951475737868936, 'epoch': 0.0}


  5%|▌         | 530/10000 [24:28<8:44:14,  3.32s/it]

{'loss': 1.0106, 'grad_norm': 0.3824903070926666, 'learning_rate': 0.00018949474737368685, 'epoch': 0.0}


  5%|▌         | 531/10000 [24:31<8:17:41,  3.15s/it]

{'loss': 0.9702, 'grad_norm': 0.4325298070907593, 'learning_rate': 0.00018947473736868436, 'epoch': 0.0}


  5%|▌         | 532/10000 [24:34<8:03:45,  3.07s/it]

{'loss': 0.7583, 'grad_norm': 0.3683144748210907, 'learning_rate': 0.00018945472736368185, 'epoch': 0.0}


  5%|▌         | 533/10000 [24:37<8:31:29,  3.24s/it]

{'loss': 0.9683, 'grad_norm': 0.4051392078399658, 'learning_rate': 0.00018943471735867934, 'epoch': 0.0}


  5%|▌         | 534/10000 [24:41<8:40:31,  3.30s/it]

{'loss': 0.904, 'grad_norm': 0.3879139721393585, 'learning_rate': 0.00018941470735367685, 'epoch': 0.0}


  5%|▌         | 535/10000 [24:46<10:09:27,  3.86s/it]

{'loss': 1.0177, 'grad_norm': 0.3206978440284729, 'learning_rate': 0.00018939469734867434, 'epoch': 0.0}


  5%|▌         | 536/10000 [24:49<9:40:17,  3.68s/it] 

{'loss': 1.2605, 'grad_norm': 0.4147334396839142, 'learning_rate': 0.00018937468734367185, 'epoch': 0.0}


  5%|▌         | 537/10000 [24:54<10:43:54,  4.08s/it]

{'loss': 0.8131, 'grad_norm': 0.3354165554046631, 'learning_rate': 0.00018935467733866934, 'epoch': 0.0}


  5%|▌         | 538/10000 [25:00<12:24:45,  4.72s/it]

{'loss': 1.0991, 'grad_norm': 0.35257962346076965, 'learning_rate': 0.00018933466733366686, 'epoch': 0.0}


  5%|▌         | 539/10000 [25:04<11:43:24,  4.46s/it]

{'loss': 0.776, 'grad_norm': 0.32179760932922363, 'learning_rate': 0.00018931465732866434, 'epoch': 0.0}


  5%|▌         | 540/10000 [25:07<10:30:49,  4.00s/it]

{'loss': 0.72, 'grad_norm': 0.33864811062812805, 'learning_rate': 0.00018929464732366186, 'epoch': 0.0}


  5%|▌         | 541/10000 [25:10<9:52:06,  3.76s/it] 

{'loss': 1.0433, 'grad_norm': 0.3728007674217224, 'learning_rate': 0.00018927463731865934, 'epoch': 0.0}


  5%|▌         | 542/10000 [25:14<9:55:12,  3.78s/it]

{'loss': 0.8725, 'grad_norm': 0.3482028543949127, 'learning_rate': 0.00018925462731365683, 'epoch': 0.0}


  5%|▌         | 543/10000 [25:18<9:54:32,  3.77s/it]

{'loss': 0.9546, 'grad_norm': 0.38264283537864685, 'learning_rate': 0.00018923461730865432, 'epoch': 0.0}


  5%|▌         | 544/10000 [25:21<9:38:36,  3.67s/it]

{'loss': 1.3838, 'grad_norm': 0.4013059735298157, 'learning_rate': 0.00018921460730365183, 'epoch': 0.0}


  5%|▌         | 545/10000 [25:24<9:01:41,  3.44s/it]

{'loss': 1.1174, 'grad_norm': 0.41682448983192444, 'learning_rate': 0.00018919459729864932, 'epoch': 0.0}


  5%|▌         | 546/10000 [25:27<8:35:59,  3.27s/it]

{'loss': 0.9313, 'grad_norm': 0.4080604314804077, 'learning_rate': 0.00018917458729364684, 'epoch': 0.0}


  5%|▌         | 547/10000 [25:30<8:31:15,  3.25s/it]

{'loss': 0.8014, 'grad_norm': 0.3637869954109192, 'learning_rate': 0.00018915457728864435, 'epoch': 0.0}


  5%|▌         | 548/10000 [25:34<9:16:17,  3.53s/it]

{'loss': 0.9216, 'grad_norm': 0.3663696050643921, 'learning_rate': 0.00018913456728364184, 'epoch': 0.0}


  5%|▌         | 549/10000 [25:40<10:40:52,  4.07s/it]

{'loss': 0.8859, 'grad_norm': 0.2806893289089203, 'learning_rate': 0.00018911455727863932, 'epoch': 0.0}


  6%|▌         | 550/10000 [25:43<10:23:45,  3.96s/it]

{'loss': 0.9799, 'grad_norm': 0.35024958848953247, 'learning_rate': 0.0001890945472736368, 'epoch': 0.0}


  6%|▌         | 551/10000 [25:47<9:44:03,  3.71s/it] 

{'loss': 0.9338, 'grad_norm': 0.3406064212322235, 'learning_rate': 0.00018907453726863433, 'epoch': 0.0}


  6%|▌         | 552/10000 [25:49<8:48:25,  3.36s/it]

{'loss': 0.963, 'grad_norm': 0.4733341336250305, 'learning_rate': 0.0001890545272636318, 'epoch': 0.0}


  6%|▌         | 553/10000 [25:52<8:49:45,  3.36s/it]

{'loss': 0.8575, 'grad_norm': 0.3480241298675537, 'learning_rate': 0.00018903451725862933, 'epoch': 0.0}


  6%|▌         | 554/10000 [25:55<8:32:02,  3.25s/it]

{'loss': 0.8183, 'grad_norm': 0.38469719886779785, 'learning_rate': 0.00018901450725362681, 'epoch': 0.0}


  6%|▌         | 555/10000 [25:59<8:36:21,  3.28s/it]

{'loss': 0.9131, 'grad_norm': 0.4021301865577698, 'learning_rate': 0.00018899449724862433, 'epoch': 0.0}


  6%|▌         | 556/10000 [26:02<8:23:46,  3.20s/it]

{'loss': 0.8518, 'grad_norm': 0.395146906375885, 'learning_rate': 0.00018897448724362182, 'epoch': 0.0}


  6%|▌         | 557/10000 [26:07<9:36:19,  3.66s/it]

{'loss': 0.9088, 'grad_norm': 0.3073000907897949, 'learning_rate': 0.00018895447723861933, 'epoch': 0.0}


  6%|▌         | 558/10000 [26:10<9:21:21,  3.57s/it]

{'loss': 1.1306, 'grad_norm': 0.36667102575302124, 'learning_rate': 0.00018893446723361682, 'epoch': 0.0}


  6%|▌         | 559/10000 [26:13<8:39:19,  3.30s/it]

{'loss': 0.8157, 'grad_norm': 0.41193121671676636, 'learning_rate': 0.0001889144572286143, 'epoch': 0.0}


  6%|▌         | 560/10000 [26:17<9:11:49,  3.51s/it]

{'loss': 0.8685, 'grad_norm': 0.331704318523407, 'learning_rate': 0.00018889444722361182, 'epoch': 0.0}


  6%|▌         | 561/10000 [26:20<9:08:50,  3.49s/it]

{'loss': 0.8334, 'grad_norm': 0.3676672875881195, 'learning_rate': 0.0001888744372186093, 'epoch': 0.0}


  6%|▌         | 562/10000 [26:24<9:14:58,  3.53s/it]

{'loss': 0.8962, 'grad_norm': 0.33551546931266785, 'learning_rate': 0.00018885442721360682, 'epoch': 0.0}


  6%|▌         | 563/10000 [26:29<10:43:49,  4.09s/it]

{'loss': 0.6888, 'grad_norm': 0.3805946707725525, 'learning_rate': 0.0001888344172086043, 'epoch': 0.0}


  6%|▌         | 564/10000 [26:32<9:58:15,  3.80s/it] 

{'loss': 0.8714, 'grad_norm': 0.3786330819129944, 'learning_rate': 0.00018881440720360182, 'epoch': 0.0}


  6%|▌         | 565/10000 [26:36<9:56:44,  3.79s/it]

{'loss': 0.9284, 'grad_norm': 0.32826706767082214, 'learning_rate': 0.0001887943971985993, 'epoch': 0.0}


  6%|▌         | 566/10000 [26:40<9:46:32,  3.73s/it]

{'loss': 0.7692, 'grad_norm': 0.319715291261673, 'learning_rate': 0.00018877438719359682, 'epoch': 0.0}


  6%|▌         | 567/10000 [26:43<9:22:33,  3.58s/it]

{'loss': 1.0893, 'grad_norm': 0.38725000619888306, 'learning_rate': 0.0001887543771885943, 'epoch': 0.0}


  6%|▌         | 568/10000 [26:47<9:34:40,  3.66s/it]

{'loss': 0.9217, 'grad_norm': 0.4157682955265045, 'learning_rate': 0.0001887343671835918, 'epoch': 0.0}


  6%|▌         | 569/10000 [26:51<9:49:33,  3.75s/it]

{'loss': 0.9871, 'grad_norm': 0.3141437768936157, 'learning_rate': 0.0001887143571785893, 'epoch': 0.0}


  6%|▌         | 570/10000 [26:54<9:30:13,  3.63s/it]

{'loss': 1.1876, 'grad_norm': 0.39207378029823303, 'learning_rate': 0.0001886943471735868, 'epoch': 0.0}


  6%|▌         | 571/10000 [26:57<9:02:53,  3.45s/it]

{'loss': 0.7845, 'grad_norm': 0.33634594082832336, 'learning_rate': 0.0001886743371685843, 'epoch': 0.0}


  6%|▌         | 572/10000 [27:00<8:50:04,  3.37s/it]

{'loss': 0.9073, 'grad_norm': 0.3562537729740143, 'learning_rate': 0.0001886543271635818, 'epoch': 0.0}


  6%|▌         | 573/10000 [27:04<8:53:55,  3.40s/it]

{'loss': 0.8216, 'grad_norm': 0.33549070358276367, 'learning_rate': 0.0001886343171585793, 'epoch': 0.0}


  6%|▌         | 574/10000 [27:07<8:29:29,  3.24s/it]

{'loss': 0.9181, 'grad_norm': 0.4658437967300415, 'learning_rate': 0.0001886143071535768, 'epoch': 0.0}


  6%|▌         | 575/10000 [27:10<8:55:07,  3.41s/it]

{'loss': 1.2713, 'grad_norm': 0.4055865406990051, 'learning_rate': 0.00018859429714857432, 'epoch': 0.0}


  6%|▌         | 576/10000 [27:14<8:57:43,  3.42s/it]

{'loss': 0.9775, 'grad_norm': 0.3704172968864441, 'learning_rate': 0.0001885742871435718, 'epoch': 0.0}


  6%|▌         | 577/10000 [27:18<9:29:47,  3.63s/it]

{'loss': 0.8866, 'grad_norm': 0.34355103969573975, 'learning_rate': 0.0001885542771385693, 'epoch': 0.0}


  6%|▌         | 578/10000 [27:22<10:06:39,  3.86s/it]

{'loss': 1.4725, 'grad_norm': 0.3698221445083618, 'learning_rate': 0.00018853426713356678, 'epoch': 0.0}


  6%|▌         | 579/10000 [27:26<9:46:06,  3.73s/it] 

{'loss': 1.0082, 'grad_norm': 0.3442261517047882, 'learning_rate': 0.0001885142571285643, 'epoch': 0.0}


  6%|▌         | 580/10000 [27:30<10:07:12,  3.87s/it]

{'loss': 0.9689, 'grad_norm': 0.30600741505622864, 'learning_rate': 0.00018849424712356178, 'epoch': 0.0}


  6%|▌         | 581/10000 [27:33<9:09:32,  3.50s/it] 

{'loss': 1.0633, 'grad_norm': 0.4432365298271179, 'learning_rate': 0.0001884742371185593, 'epoch': 0.0}


  6%|▌         | 582/10000 [27:36<8:59:48,  3.44s/it]

{'loss': 0.8396, 'grad_norm': 0.33134013414382935, 'learning_rate': 0.00018845422711355678, 'epoch': 0.0}


  6%|▌         | 583/10000 [27:40<9:44:26,  3.72s/it]

{'loss': 0.8499, 'grad_norm': 0.3827453851699829, 'learning_rate': 0.0001884342171085543, 'epoch': 0.0}


  6%|▌         | 584/10000 [27:43<9:23:40,  3.59s/it]

{'loss': 0.8687, 'grad_norm': 0.4880308210849762, 'learning_rate': 0.00018841420710355179, 'epoch': 0.0}


  6%|▌         | 585/10000 [27:47<9:29:58,  3.63s/it]

{'loss': 1.2371, 'grad_norm': 0.37188205122947693, 'learning_rate': 0.00018839419709854927, 'epoch': 0.0}


  6%|▌         | 586/10000 [27:51<9:15:53,  3.54s/it]

{'loss': 0.8683, 'grad_norm': 0.34555909037590027, 'learning_rate': 0.00018837418709354676, 'epoch': 0.0}


  6%|▌         | 587/10000 [27:54<8:50:38,  3.38s/it]

{'loss': 0.8697, 'grad_norm': 0.38097965717315674, 'learning_rate': 0.00018835417708854427, 'epoch': 0.0}


  6%|▌         | 588/10000 [27:57<8:38:50,  3.31s/it]

{'loss': 0.827, 'grad_norm': 0.3284815549850464, 'learning_rate': 0.0001883341670835418, 'epoch': 0.0}


  6%|▌         | 589/10000 [28:00<8:41:40,  3.33s/it]

{'loss': 1.1549, 'grad_norm': 0.3648603558540344, 'learning_rate': 0.00018831415707853928, 'epoch': 0.0}


  6%|▌         | 590/10000 [28:04<9:01:17,  3.45s/it]

{'loss': 0.9773, 'grad_norm': 0.3834102153778076, 'learning_rate': 0.0001882941470735368, 'epoch': 0.0}


  6%|▌         | 591/10000 [28:09<10:11:35,  3.90s/it]

{'loss': 0.9623, 'grad_norm': 0.3152395784854889, 'learning_rate': 0.00018827413706853428, 'epoch': 0.0}


  6%|▌         | 592/10000 [28:12<9:58:53,  3.82s/it] 

{'loss': 0.9417, 'grad_norm': 0.37741488218307495, 'learning_rate': 0.0001882541270635318, 'epoch': 0.0}


  6%|▌         | 593/10000 [28:16<9:35:45,  3.67s/it]

{'loss': 1.0283, 'grad_norm': 0.43601083755493164, 'learning_rate': 0.00018823411705852928, 'epoch': 0.0}


  6%|▌         | 594/10000 [28:19<9:30:59,  3.64s/it]

{'loss': 0.8618, 'grad_norm': 0.3388407528400421, 'learning_rate': 0.00018821410705352677, 'epoch': 0.0}


  6%|▌         | 595/10000 [28:22<9:02:17,  3.46s/it]

{'loss': 1.0625, 'grad_norm': 0.36364296078681946, 'learning_rate': 0.00018819409704852425, 'epoch': 0.0}


  6%|▌         | 596/10000 [28:25<8:44:22,  3.35s/it]

{'loss': 0.8062, 'grad_norm': 0.4362773597240448, 'learning_rate': 0.00018817408704352177, 'epoch': 0.0}


  6%|▌         | 597/10000 [28:29<8:32:53,  3.27s/it]

{'loss': 0.8837, 'grad_norm': 0.3579747676849365, 'learning_rate': 0.00018815407703851926, 'epoch': 0.0}


  6%|▌         | 598/10000 [28:32<8:40:27,  3.32s/it]

{'loss': 0.8621, 'grad_norm': 0.3327762186527252, 'learning_rate': 0.00018813406703351677, 'epoch': 0.0}


  6%|▌         | 599/10000 [28:35<8:09:56,  3.13s/it]

{'loss': 0.901, 'grad_norm': 0.443450927734375, 'learning_rate': 0.00018811405702851426, 'epoch': 0.0}


  6%|▌         | 600/10000 [28:37<7:58:24,  3.05s/it]

{'loss': 0.7629, 'grad_norm': 0.4481765329837799, 'learning_rate': 0.00018809404702351177, 'epoch': 0.0}


  6%|▌         | 601/10000 [28:41<8:28:33,  3.25s/it]

{'loss': 0.8933, 'grad_norm': 0.38860827684402466, 'learning_rate': 0.00018807403701850926, 'epoch': 0.0}


  6%|▌         | 602/10000 [28:44<8:17:25,  3.18s/it]

{'loss': 0.822, 'grad_norm': 0.3740839958190918, 'learning_rate': 0.00018805402701350677, 'epoch': 0.0}


  6%|▌         | 603/10000 [28:48<8:44:34,  3.35s/it]

{'loss': 1.0276, 'grad_norm': 0.3516600430011749, 'learning_rate': 0.00018803401700850426, 'epoch': 0.0}


  6%|▌         | 604/10000 [28:52<9:08:35,  3.50s/it]

{'loss': 0.8386, 'grad_norm': 0.32566338777542114, 'learning_rate': 0.00018801400700350175, 'epoch': 0.0}


  6%|▌         | 605/10000 [28:56<9:17:45,  3.56s/it]

{'loss': 1.011, 'grad_norm': 0.4179529845714569, 'learning_rate': 0.00018799399699849926, 'epoch': 0.0}


  6%|▌         | 606/10000 [28:58<8:42:16,  3.34s/it]

{'loss': 0.8222, 'grad_norm': 0.38271576166152954, 'learning_rate': 0.00018797398699349675, 'epoch': 0.0}


  6%|▌         | 607/10000 [29:02<9:02:49,  3.47s/it]

{'loss': 1.0333, 'grad_norm': 0.38326892256736755, 'learning_rate': 0.00018795397698849426, 'epoch': 0.0}


  6%|▌         | 608/10000 [29:05<8:44:53,  3.35s/it]

{'loss': 1.1869, 'grad_norm': 0.39461153745651245, 'learning_rate': 0.00018793396698349175, 'epoch': 0.0}


  6%|▌         | 609/10000 [29:09<8:48:58,  3.38s/it]

{'loss': 0.8203, 'grad_norm': 0.35152870416641235, 'learning_rate': 0.00018791395697848927, 'epoch': 0.0}


  6%|▌         | 610/10000 [29:12<8:40:43,  3.33s/it]

{'loss': 0.8369, 'grad_norm': 0.38518479466438293, 'learning_rate': 0.00018789394697348675, 'epoch': 0.0}


  6%|▌         | 611/10000 [29:15<8:34:21,  3.29s/it]

{'loss': 0.8215, 'grad_norm': 0.35887250304222107, 'learning_rate': 0.00018787393696848427, 'epoch': 0.0}


  6%|▌         | 612/10000 [29:19<8:49:06,  3.38s/it]

{'loss': 0.9125, 'grad_norm': 0.37530672550201416, 'learning_rate': 0.00018785392696348175, 'epoch': 0.0}


  6%|▌         | 613/10000 [29:22<8:51:55,  3.40s/it]

{'loss': 1.0972, 'grad_norm': 0.40030303597450256, 'learning_rate': 0.00018783391695847924, 'epoch': 0.0}


  6%|▌         | 614/10000 [29:26<9:15:22,  3.55s/it]

{'loss': 1.2625, 'grad_norm': 0.3725889027118683, 'learning_rate': 0.00018781390695347673, 'epoch': 0.0}


  6%|▌         | 615/10000 [29:30<9:14:53,  3.55s/it]

{'loss': 0.8733, 'grad_norm': 0.3702237904071808, 'learning_rate': 0.00018779389694847424, 'epoch': 0.0}


  6%|▌         | 616/10000 [29:33<9:17:54,  3.57s/it]

{'loss': 0.9439, 'grad_norm': 0.3954107463359833, 'learning_rate': 0.00018777388694347176, 'epoch': 0.0}


  6%|▌         | 617/10000 [29:36<8:51:27,  3.40s/it]

{'loss': 1.0206, 'grad_norm': 0.3853943943977356, 'learning_rate': 0.00018775387693846925, 'epoch': 0.0}


  6%|▌         | 618/10000 [29:39<8:46:42,  3.37s/it]

{'loss': 0.9074, 'grad_norm': 0.35898926854133606, 'learning_rate': 0.00018773386693346676, 'epoch': 0.0}


  6%|▌         | 619/10000 [29:43<8:48:16,  3.38s/it]

{'loss': 1.0514, 'grad_norm': 0.3877141773700714, 'learning_rate': 0.00018771385692846425, 'epoch': 0.0}


  6%|▌         | 620/10000 [29:47<9:21:21,  3.59s/it]

{'loss': 0.9448, 'grad_norm': 0.38449469208717346, 'learning_rate': 0.00018769384692346173, 'epoch': 0.0}


  6%|▌         | 621/10000 [29:51<9:23:29,  3.60s/it]

{'loss': 0.9762, 'grad_norm': 0.3781231939792633, 'learning_rate': 0.00018767383691845922, 'epoch': 0.0}


  6%|▌         | 622/10000 [29:54<9:06:49,  3.50s/it]

{'loss': 0.8395, 'grad_norm': 0.4055292308330536, 'learning_rate': 0.00018765382691345674, 'epoch': 0.0}


  6%|▌         | 623/10000 [29:57<9:13:01,  3.54s/it]

{'loss': 1.0875, 'grad_norm': 0.41881707310676575, 'learning_rate': 0.00018763381690845422, 'epoch': 0.0}


  6%|▌         | 624/10000 [30:00<8:43:15,  3.35s/it]

{'loss': 0.8355, 'grad_norm': 0.3993995487689972, 'learning_rate': 0.00018761380690345174, 'epoch': 0.0}


  6%|▋         | 625/10000 [30:04<8:57:24,  3.44s/it]

{'loss': 0.954, 'grad_norm': 0.3822742998600006, 'learning_rate': 0.00018759379689844922, 'epoch': 0.0}


  6%|▋         | 626/10000 [30:08<9:07:48,  3.51s/it]

{'loss': 0.7597, 'grad_norm': 0.30454158782958984, 'learning_rate': 0.00018757378689344674, 'epoch': 0.0}


  6%|▋         | 627/10000 [30:11<8:43:08,  3.35s/it]

{'loss': 0.6525, 'grad_norm': 0.333965003490448, 'learning_rate': 0.00018755377688844423, 'epoch': 0.0}


  6%|▋         | 628/10000 [30:15<9:22:58,  3.60s/it]

{'loss': 1.0241, 'grad_norm': 0.37638208270072937, 'learning_rate': 0.00018753376688344174, 'epoch': 0.0}


  6%|▋         | 629/10000 [30:20<10:18:34,  3.96s/it]

{'loss': 1.0434, 'grad_norm': 0.3263445496559143, 'learning_rate': 0.00018751375687843923, 'epoch': 0.0}


  6%|▋         | 630/10000 [30:24<10:27:26,  4.02s/it]

{'loss': 0.9774, 'grad_norm': 0.38201335072517395, 'learning_rate': 0.00018749374687343672, 'epoch': 0.0}


  6%|▋         | 631/10000 [30:27<9:44:01,  3.74s/it] 

{'loss': 0.6963, 'grad_norm': 0.3648563325405121, 'learning_rate': 0.00018747373686843423, 'epoch': 0.0}


  6%|▋         | 632/10000 [30:30<9:15:22,  3.56s/it]

{'loss': 0.9815, 'grad_norm': 0.3547489047050476, 'learning_rate': 0.00018745372686343172, 'epoch': 0.0}


  6%|▋         | 633/10000 [30:33<8:54:14,  3.42s/it]

{'loss': 0.9848, 'grad_norm': 0.39159658551216125, 'learning_rate': 0.00018743371685842923, 'epoch': 0.0}


  6%|▋         | 634/10000 [30:36<8:40:17,  3.33s/it]

{'loss': 0.7928, 'grad_norm': 0.3731968402862549, 'learning_rate': 0.00018741370685342672, 'epoch': 0.0}


  6%|▋         | 635/10000 [30:40<8:50:17,  3.40s/it]

{'loss': 0.9962, 'grad_norm': 0.4554360806941986, 'learning_rate': 0.00018739369684842423, 'epoch': 0.0}


  6%|▋         | 636/10000 [30:43<8:21:46,  3.22s/it]

{'loss': 0.8691, 'grad_norm': 0.38522544503211975, 'learning_rate': 0.00018737368684342172, 'epoch': 0.0}


  6%|▋         | 637/10000 [30:46<8:15:47,  3.18s/it]

{'loss': 0.9888, 'grad_norm': 0.4331510066986084, 'learning_rate': 0.00018735367683841923, 'epoch': 0.0}


  6%|▋         | 638/10000 [30:49<8:32:45,  3.29s/it]

{'loss': 0.9611, 'grad_norm': 0.35642579197883606, 'learning_rate': 0.00018733366683341672, 'epoch': 0.0}


  6%|▋         | 639/10000 [30:53<8:57:01,  3.44s/it]

{'loss': 0.8309, 'grad_norm': 0.3880768120288849, 'learning_rate': 0.0001873136568284142, 'epoch': 0.0}


  6%|▋         | 640/10000 [30:57<9:00:11,  3.46s/it]

{'loss': 0.9065, 'grad_norm': 0.40299472212791443, 'learning_rate': 0.0001872936468234117, 'epoch': 0.0}


  6%|▋         | 641/10000 [31:00<9:09:20,  3.52s/it]

{'loss': 0.8914, 'grad_norm': 0.3602827489376068, 'learning_rate': 0.0001872736368184092, 'epoch': 0.0}


  6%|▋         | 642/10000 [31:03<8:44:55,  3.37s/it]

{'loss': 0.9195, 'grad_norm': 0.4298144578933716, 'learning_rate': 0.0001872536268134067, 'epoch': 0.0}


  6%|▋         | 643/10000 [31:06<8:37:48,  3.32s/it]

{'loss': 1.1174, 'grad_norm': 0.4103243052959442, 'learning_rate': 0.0001872336168084042, 'epoch': 0.0}


  6%|▋         | 644/10000 [31:10<8:37:43,  3.32s/it]

{'loss': 0.7104, 'grad_norm': 0.34438368678092957, 'learning_rate': 0.00018721360680340173, 'epoch': 0.0}


  6%|▋         | 645/10000 [31:14<9:14:52,  3.56s/it]

{'loss': 1.2406, 'grad_norm': 0.3804425001144409, 'learning_rate': 0.00018719359679839921, 'epoch': 0.0}


  6%|▋         | 646/10000 [31:17<8:37:35,  3.32s/it]

{'loss': 0.9032, 'grad_norm': 0.3963061571121216, 'learning_rate': 0.00018717358679339673, 'epoch': 0.0}


  6%|▋         | 647/10000 [31:19<7:39:36,  2.95s/it]

{'loss': 0.6205, 'grad_norm': 0.3943762481212616, 'learning_rate': 0.00018715357678839422, 'epoch': 0.0}


  6%|▋         | 648/10000 [31:22<7:57:57,  3.07s/it]

{'loss': 1.0092, 'grad_norm': 0.39057326316833496, 'learning_rate': 0.0001871335667833917, 'epoch': 0.0}


  6%|▋         | 649/10000 [31:26<8:21:18,  3.22s/it]

{'loss': 1.0752, 'grad_norm': 0.3773234188556671, 'learning_rate': 0.0001871135567783892, 'epoch': 0.0}


  6%|▋         | 650/10000 [31:29<8:15:22,  3.18s/it]

{'loss': 1.0579, 'grad_norm': 0.38502317667007446, 'learning_rate': 0.0001870935467733867, 'epoch': 0.0}


  7%|▋         | 651/10000 [31:32<8:22:14,  3.22s/it]

{'loss': 0.9469, 'grad_norm': 0.3960608243942261, 'learning_rate': 0.0001870735367683842, 'epoch': 0.0}


  7%|▋         | 652/10000 [31:35<8:32:19,  3.29s/it]

{'loss': 1.2168, 'grad_norm': 0.34919318556785583, 'learning_rate': 0.0001870535267633817, 'epoch': 0.0}


  7%|▋         | 653/10000 [31:39<8:39:23,  3.33s/it]

{'loss': 0.6775, 'grad_norm': 0.2791585624217987, 'learning_rate': 0.0001870335167583792, 'epoch': 0.0}


  7%|▋         | 654/10000 [31:43<9:09:51,  3.53s/it]

{'loss': 0.8871, 'grad_norm': 0.3368067443370819, 'learning_rate': 0.0001870135067533767, 'epoch': 0.0}


  7%|▋         | 655/10000 [31:47<9:31:00,  3.67s/it]

{'loss': 1.0562, 'grad_norm': 0.36559465527534485, 'learning_rate': 0.0001869934967483742, 'epoch': 0.0}


  7%|▋         | 656/10000 [31:50<8:52:07,  3.42s/it]

{'loss': 0.7544, 'grad_norm': 0.3725094795227051, 'learning_rate': 0.00018697348674337168, 'epoch': 0.0}


  7%|▋         | 657/10000 [31:54<9:47:27,  3.77s/it]

{'loss': 1.0655, 'grad_norm': 0.34359294176101685, 'learning_rate': 0.0001869534767383692, 'epoch': 0.0}


  7%|▋         | 658/10000 [31:58<9:46:07,  3.76s/it]

{'loss': 1.0687, 'grad_norm': 0.3706395924091339, 'learning_rate': 0.00018693346673336668, 'epoch': 0.0}


  7%|▋         | 659/10000 [32:01<8:54:42,  3.43s/it]

{'loss': 0.7295, 'grad_norm': 0.4073144197463989, 'learning_rate': 0.0001869134567283642, 'epoch': 0.0}


  7%|▋         | 660/10000 [32:04<9:04:56,  3.50s/it]

{'loss': 1.0243, 'grad_norm': 0.36233389377593994, 'learning_rate': 0.00018689344672336169, 'epoch': 0.0}


  7%|▋         | 661/10000 [32:07<8:34:17,  3.30s/it]

{'loss': 0.872, 'grad_norm': 0.36799973249435425, 'learning_rate': 0.0001868734367183592, 'epoch': 0.0}


  7%|▋         | 662/10000 [32:11<8:43:51,  3.37s/it]

{'loss': 1.0758, 'grad_norm': 0.4122083783149719, 'learning_rate': 0.0001868534267133567, 'epoch': 0.0}


  7%|▋         | 663/10000 [32:15<9:17:39,  3.58s/it]

{'loss': 1.0663, 'grad_norm': 0.3406028747558594, 'learning_rate': 0.0001868334167083542, 'epoch': 0.0}


  7%|▋         | 664/10000 [32:18<9:05:47,  3.51s/it]

{'loss': 1.1302, 'grad_norm': 0.3947858214378357, 'learning_rate': 0.0001868134067033517, 'epoch': 0.0}


  7%|▋         | 665/10000 [32:22<9:15:28,  3.57s/it]

{'loss': 0.8398, 'grad_norm': 0.2899608314037323, 'learning_rate': 0.00018679339669834918, 'epoch': 0.0}


  7%|▋         | 666/10000 [32:26<9:55:06,  3.83s/it]

{'loss': 0.8661, 'grad_norm': 0.359756737947464, 'learning_rate': 0.00018677338669334666, 'epoch': 0.0}


  7%|▋         | 667/10000 [32:30<9:35:42,  3.70s/it]

{'loss': 0.6665, 'grad_norm': 0.32157278060913086, 'learning_rate': 0.00018675337668834418, 'epoch': 0.0}


  7%|▋         | 668/10000 [32:34<9:43:58,  3.75s/it]

{'loss': 0.8281, 'grad_norm': 0.3639836311340332, 'learning_rate': 0.00018673336668334167, 'epoch': 0.0}


  7%|▋         | 669/10000 [32:37<9:20:01,  3.60s/it]

{'loss': 1.055, 'grad_norm': 0.40305086970329285, 'learning_rate': 0.00018671335667833918, 'epoch': 0.0}


  7%|▋         | 670/10000 [32:40<8:57:13,  3.45s/it]

{'loss': 1.1198, 'grad_norm': 0.43184083700180054, 'learning_rate': 0.00018669334667333667, 'epoch': 0.0}


  7%|▋         | 671/10000 [32:43<8:50:35,  3.41s/it]

{'loss': 0.9586, 'grad_norm': 0.43324148654937744, 'learning_rate': 0.00018667333666833418, 'epoch': 0.0}


  7%|▋         | 672/10000 [32:48<10:03:57,  3.88s/it]

{'loss': 0.9253, 'grad_norm': 0.32633060216903687, 'learning_rate': 0.0001866533266633317, 'epoch': 0.0}


  7%|▋         | 673/10000 [32:52<9:53:30,  3.82s/it] 

{'loss': 0.9072, 'grad_norm': 0.3815358877182007, 'learning_rate': 0.00018663331665832918, 'epoch': 0.0}


  7%|▋         | 674/10000 [32:55<8:59:21,  3.47s/it]

{'loss': 1.1629, 'grad_norm': 0.4873751103878021, 'learning_rate': 0.00018661330665332667, 'epoch': 0.0}


  7%|▋         | 675/10000 [32:57<8:32:33,  3.30s/it]

{'loss': 0.9063, 'grad_norm': 0.3854769766330719, 'learning_rate': 0.00018659329664832416, 'epoch': 0.0}


  7%|▋         | 676/10000 [33:01<8:35:14,  3.32s/it]

{'loss': 0.8397, 'grad_norm': 0.3739876449108124, 'learning_rate': 0.00018657328664332167, 'epoch': 0.0}


  7%|▋         | 677/10000 [33:03<8:03:38,  3.11s/it]

{'loss': 0.806, 'grad_norm': 0.3773082494735718, 'learning_rate': 0.00018655327663831916, 'epoch': 0.0}


  7%|▋         | 678/10000 [33:07<8:12:05,  3.17s/it]

{'loss': 0.8634, 'grad_norm': 0.37726345658302307, 'learning_rate': 0.00018653326663331667, 'epoch': 0.0}


  7%|▋         | 679/10000 [33:10<8:16:56,  3.20s/it]

{'loss': 0.9701, 'grad_norm': 0.38584816455841064, 'learning_rate': 0.00018651325662831416, 'epoch': 0.0}


  7%|▋         | 680/10000 [33:17<11:31:16,  4.45s/it]

{'loss': 0.9156, 'grad_norm': 0.36598512530326843, 'learning_rate': 0.00018649324662331168, 'epoch': 0.0}


  7%|▋         | 681/10000 [33:20<9:58:53,  3.86s/it] 

{'loss': 0.8917, 'grad_norm': 0.436066210269928, 'learning_rate': 0.00018647323661830916, 'epoch': 0.0}


  7%|▋         | 682/10000 [33:25<10:49:34,  4.18s/it]

{'loss': 1.0739, 'grad_norm': 0.3619248569011688, 'learning_rate': 0.00018645322661330668, 'epoch': 0.0}


  7%|▋         | 683/10000 [33:29<10:52:29,  4.20s/it]

{'loss': 0.7607, 'grad_norm': 0.3691359758377075, 'learning_rate': 0.00018643321660830414, 'epoch': 0.0}


  7%|▋         | 684/10000 [33:32<9:46:01,  3.77s/it] 

{'loss': 0.9806, 'grad_norm': 0.4582274556159973, 'learning_rate': 0.00018641320660330165, 'epoch': 0.0}


  7%|▋         | 685/10000 [33:35<9:24:02,  3.63s/it]

{'loss': 0.7553, 'grad_norm': 0.32505807280540466, 'learning_rate': 0.00018639319659829917, 'epoch': 0.0}


  7%|▋         | 686/10000 [33:40<10:00:21,  3.87s/it]

{'loss': 0.6627, 'grad_norm': 0.2990792989730835, 'learning_rate': 0.00018637318659329665, 'epoch': 0.0}


  7%|▋         | 687/10000 [33:42<9:12:06,  3.56s/it] 

{'loss': 0.9619, 'grad_norm': 0.4041450321674347, 'learning_rate': 0.00018635317658829417, 'epoch': 0.0}


  7%|▋         | 688/10000 [33:46<9:17:53,  3.59s/it]

{'loss': 0.7538, 'grad_norm': 0.38873597979545593, 'learning_rate': 0.00018633316658329166, 'epoch': 0.0}


  7%|▋         | 689/10000 [33:51<10:02:10,  3.88s/it]

{'loss': 0.8779, 'grad_norm': 0.35729679465293884, 'learning_rate': 0.00018631315657828917, 'epoch': 0.0}


  7%|▋         | 690/10000 [33:54<9:39:15,  3.73s/it] 

{'loss': 0.7417, 'grad_norm': 0.32107093930244446, 'learning_rate': 0.00018629314657328666, 'epoch': 0.0}


  7%|▋         | 691/10000 [33:57<9:28:12,  3.66s/it]

{'loss': 1.2719, 'grad_norm': 0.3595728278160095, 'learning_rate': 0.00018627313656828414, 'epoch': 0.0}


  7%|▋         | 692/10000 [34:01<9:24:07,  3.64s/it]

{'loss': 1.1464, 'grad_norm': 0.40899458527565, 'learning_rate': 0.00018625312656328163, 'epoch': 0.0}


  7%|▋         | 693/10000 [34:04<9:08:36,  3.54s/it]

{'loss': 0.8083, 'grad_norm': 0.35763898491859436, 'learning_rate': 0.00018623311655827915, 'epoch': 0.0}


  7%|▋         | 694/10000 [34:07<8:42:32,  3.37s/it]

{'loss': 0.8876, 'grad_norm': 0.37267589569091797, 'learning_rate': 0.00018621310655327663, 'epoch': 0.0}


  7%|▋         | 695/10000 [34:11<8:50:25,  3.42s/it]

{'loss': 0.6843, 'grad_norm': 0.40252685546875, 'learning_rate': 0.00018619309654827415, 'epoch': 0.0}


  7%|▋         | 696/10000 [34:14<8:54:34,  3.45s/it]

{'loss': 0.748, 'grad_norm': 0.421100378036499, 'learning_rate': 0.00018617308654327163, 'epoch': 0.0}


  7%|▋         | 697/10000 [34:17<8:05:02,  3.13s/it]

{'loss': 0.9968, 'grad_norm': 0.4820738732814789, 'learning_rate': 0.00018615307653826915, 'epoch': 0.0}


  7%|▋         | 698/10000 [34:20<8:06:57,  3.14s/it]

{'loss': 1.0024, 'grad_norm': 0.3702763020992279, 'learning_rate': 0.00018613306653326664, 'epoch': 0.0}


  7%|▋         | 699/10000 [34:24<8:39:29,  3.35s/it]

{'loss': 1.1535, 'grad_norm': 0.3502129018306732, 'learning_rate': 0.00018611305652826415, 'epoch': 0.0}


  7%|▋         | 700/10000 [34:28<9:35:15,  3.71s/it]

{'loss': 0.8629, 'grad_norm': 0.28970909118652344, 'learning_rate': 0.00018609304652326164, 'epoch': 0.0}


  7%|▋         | 701/10000 [34:33<10:07:08,  3.92s/it]

{'loss': 0.6552, 'grad_norm': 0.3724752962589264, 'learning_rate': 0.00018607303651825913, 'epoch': 0.0}


  7%|▋         | 702/10000 [34:36<9:27:16,  3.66s/it] 

{'loss': 0.9941, 'grad_norm': 0.40530577301979065, 'learning_rate': 0.00018605302651325664, 'epoch': 0.0}


  7%|▋         | 703/10000 [34:39<9:09:22,  3.55s/it]

{'loss': 0.7099, 'grad_norm': 0.38461825251579285, 'learning_rate': 0.00018603301650825413, 'epoch': 0.0}


  7%|▋         | 704/10000 [34:42<8:53:37,  3.44s/it]

{'loss': 1.014, 'grad_norm': 0.39363694190979004, 'learning_rate': 0.00018601300650325164, 'epoch': 0.0}


  7%|▋         | 705/10000 [34:45<8:23:18,  3.25s/it]

{'loss': 0.8433, 'grad_norm': 0.42682603001594543, 'learning_rate': 0.00018599299649824913, 'epoch': 0.0}


  7%|▋         | 706/10000 [34:48<8:00:39,  3.10s/it]

{'loss': 0.955, 'grad_norm': 0.4328311085700989, 'learning_rate': 0.00018597298649324664, 'epoch': 0.0}


  7%|▋         | 707/10000 [34:52<8:47:05,  3.40s/it]

{'loss': 1.0771, 'grad_norm': 0.3467220962047577, 'learning_rate': 0.00018595297648824413, 'epoch': 0.0}


  7%|▋         | 708/10000 [34:56<9:38:14,  3.73s/it]

{'loss': 1.064, 'grad_norm': 0.3532979190349579, 'learning_rate': 0.00018593296648324164, 'epoch': 0.0}


  7%|▋         | 709/10000 [35:00<9:40:17,  3.75s/it]

{'loss': 0.9805, 'grad_norm': 0.36199337244033813, 'learning_rate': 0.00018591295647823913, 'epoch': 0.0}


  7%|▋         | 710/10000 [35:03<9:00:25,  3.49s/it]

{'loss': 0.6106, 'grad_norm': 0.3587835133075714, 'learning_rate': 0.00018589294647323662, 'epoch': 0.0}


  7%|▋         | 711/10000 [35:08<9:43:48,  3.77s/it]

{'loss': 0.9414, 'grad_norm': 0.3255600035190582, 'learning_rate': 0.0001858729364682341, 'epoch': 0.0}


  7%|▋         | 712/10000 [35:11<9:46:07,  3.79s/it]

{'loss': 1.1936, 'grad_norm': 0.35396522283554077, 'learning_rate': 0.00018585292646323162, 'epoch': 0.0}


  7%|▋         | 713/10000 [35:16<10:06:06,  3.92s/it]

{'loss': 1.107, 'grad_norm': 0.34730786085128784, 'learning_rate': 0.00018583291645822914, 'epoch': 0.0}


  7%|▋         | 714/10000 [35:18<9:17:53,  3.60s/it] 

{'loss': 1.0072, 'grad_norm': 0.4183470606803894, 'learning_rate': 0.00018581290645322662, 'epoch': 0.0}


  7%|▋         | 715/10000 [35:21<8:23:56,  3.26s/it]

{'loss': 0.8077, 'grad_norm': 0.43214118480682373, 'learning_rate': 0.00018579289644822414, 'epoch': 0.0}


  7%|▋         | 716/10000 [35:24<8:21:39,  3.24s/it]

{'loss': 1.1208, 'grad_norm': 0.43162205815315247, 'learning_rate': 0.00018577288644322162, 'epoch': 0.0}


  7%|▋         | 717/10000 [35:27<7:51:42,  3.05s/it]

{'loss': 0.757, 'grad_norm': 0.3776169419288635, 'learning_rate': 0.00018575287643821914, 'epoch': 0.0}


  7%|▋         | 718/10000 [35:31<9:03:05,  3.51s/it]

{'loss': 0.8801, 'grad_norm': 0.3435533344745636, 'learning_rate': 0.00018573286643321663, 'epoch': 0.0}


  7%|▋         | 719/10000 [35:35<9:04:30,  3.52s/it]

{'loss': 1.2037, 'grad_norm': 0.3901768624782562, 'learning_rate': 0.0001857128564282141, 'epoch': 0.0}


  7%|▋         | 720/10000 [35:38<8:40:37,  3.37s/it]

{'loss': 1.1388, 'grad_norm': 0.41437169909477234, 'learning_rate': 0.0001856928464232116, 'epoch': 0.0}


  7%|▋         | 721/10000 [35:41<8:13:35,  3.19s/it]

{'loss': 1.105, 'grad_norm': 0.4378865361213684, 'learning_rate': 0.00018567283641820911, 'epoch': 0.0}


  7%|▋         | 722/10000 [35:43<7:49:09,  3.03s/it]

{'loss': 0.8737, 'grad_norm': 0.40996745228767395, 'learning_rate': 0.0001856528264132066, 'epoch': 0.0}


  7%|▋         | 723/10000 [35:47<8:05:57,  3.14s/it]

{'loss': 0.9634, 'grad_norm': 0.3455405533313751, 'learning_rate': 0.00018563281640820412, 'epoch': 0.0}


  7%|▋         | 724/10000 [35:50<8:01:22,  3.11s/it]

{'loss': 0.8237, 'grad_norm': 0.3490948975086212, 'learning_rate': 0.0001856128064032016, 'epoch': 0.0}


  7%|▋         | 725/10000 [35:53<8:14:51,  3.20s/it]

{'loss': 0.842, 'grad_norm': 0.347336083650589, 'learning_rate': 0.00018559279639819912, 'epoch': 0.0}


  7%|▋         | 726/10000 [35:58<9:14:31,  3.59s/it]

{'loss': 0.9917, 'grad_norm': 0.3283672332763672, 'learning_rate': 0.0001855727863931966, 'epoch': 0.0}


  7%|▋         | 727/10000 [36:01<8:57:48,  3.48s/it]

{'loss': 1.1522, 'grad_norm': 0.3846411406993866, 'learning_rate': 0.0001855527763881941, 'epoch': 0.0}


  7%|▋         | 728/10000 [36:05<9:08:23,  3.55s/it]

{'loss': 0.9868, 'grad_norm': 0.41495776176452637, 'learning_rate': 0.0001855327663831916, 'epoch': 0.0}


  7%|▋         | 729/10000 [36:07<8:36:38,  3.34s/it]

{'loss': 1.0521, 'grad_norm': 0.44936949014663696, 'learning_rate': 0.0001855127563781891, 'epoch': 0.0}


  7%|▋         | 730/10000 [36:11<8:56:41,  3.47s/it]

{'loss': 0.9087, 'grad_norm': 0.3492748737335205, 'learning_rate': 0.0001854927463731866, 'epoch': 0.0}


  7%|▋         | 731/10000 [36:14<8:26:58,  3.28s/it]

{'loss': 0.9647, 'grad_norm': 0.38667798042297363, 'learning_rate': 0.0001854727363681841, 'epoch': 0.0}


  7%|▋         | 732/10000 [36:19<9:25:52,  3.66s/it]

{'loss': 1.058, 'grad_norm': 0.31199875473976135, 'learning_rate': 0.0001854527263631816, 'epoch': 0.0}


  7%|▋         | 733/10000 [36:22<9:00:42,  3.50s/it]

{'loss': 1.1635, 'grad_norm': 0.4300200939178467, 'learning_rate': 0.0001854327163581791, 'epoch': 0.0}


  7%|▋         | 734/10000 [36:25<8:33:28,  3.32s/it]

{'loss': 0.8121, 'grad_norm': 0.38608434796333313, 'learning_rate': 0.0001854127063531766, 'epoch': 0.0}


  7%|▋         | 735/10000 [36:28<8:48:37,  3.42s/it]

{'loss': 0.8688, 'grad_norm': 0.38631314039230347, 'learning_rate': 0.0001853926963481741, 'epoch': 0.0}


  7%|▋         | 736/10000 [36:31<8:34:55,  3.34s/it]

{'loss': 0.7308, 'grad_norm': 0.36423468589782715, 'learning_rate': 0.0001853726863431716, 'epoch': 0.0}


  7%|▋         | 737/10000 [36:35<8:37:28,  3.35s/it]

{'loss': 0.7626, 'grad_norm': 0.36795008182525635, 'learning_rate': 0.00018535267633816907, 'epoch': 0.0}


  7%|▋         | 738/10000 [36:39<9:05:40,  3.53s/it]

{'loss': 0.6037, 'grad_norm': 0.31462952494621277, 'learning_rate': 0.0001853326663331666, 'epoch': 0.0}


  7%|▋         | 739/10000 [36:42<8:52:26,  3.45s/it]

{'loss': 0.9472, 'grad_norm': 0.3781677186489105, 'learning_rate': 0.00018531265632816408, 'epoch': 0.0}


  7%|▋         | 740/10000 [36:46<9:04:55,  3.53s/it]

{'loss': 0.9999, 'grad_norm': 0.3945625126361847, 'learning_rate': 0.0001852926463231616, 'epoch': 0.0}


  7%|▋         | 741/10000 [36:49<9:10:24,  3.57s/it]

{'loss': 1.0769, 'grad_norm': 0.44426584243774414, 'learning_rate': 0.0001852726363181591, 'epoch': 0.0}


  7%|▋         | 742/10000 [36:54<9:43:08,  3.78s/it]

{'loss': 0.8017, 'grad_norm': 0.3228326439857483, 'learning_rate': 0.0001852526263131566, 'epoch': 0.0}


  7%|▋         | 743/10000 [36:58<9:49:25,  3.82s/it]

{'loss': 0.8555, 'grad_norm': 0.3636711537837982, 'learning_rate': 0.0001852326163081541, 'epoch': 0.0}


  7%|▋         | 744/10000 [37:01<9:10:04,  3.57s/it]

{'loss': 0.759, 'grad_norm': 0.3762734830379486, 'learning_rate': 0.0001852126063031516, 'epoch': 0.0}


  7%|▋         | 745/10000 [37:03<8:24:44,  3.27s/it]

{'loss': 0.9952, 'grad_norm': 0.4358787536621094, 'learning_rate': 0.00018519259629814908, 'epoch': 0.0}


  7%|▋         | 746/10000 [37:06<7:51:04,  3.05s/it]

{'loss': 0.7974, 'grad_norm': 0.4608517587184906, 'learning_rate': 0.00018517258629314657, 'epoch': 0.0}


  7%|▋         | 747/10000 [37:11<9:48:04,  3.81s/it]

{'loss': 1.2828, 'grad_norm': 0.33932024240493774, 'learning_rate': 0.00018515257628814408, 'epoch': 0.0}


  7%|▋         | 748/10000 [37:14<9:20:01,  3.63s/it]

{'loss': 1.1126, 'grad_norm': 0.38393181562423706, 'learning_rate': 0.00018513256628314157, 'epoch': 0.0}


  7%|▋         | 749/10000 [37:18<9:00:26,  3.51s/it]

{'loss': 0.8004, 'grad_norm': 0.409419447183609, 'learning_rate': 0.00018511255627813908, 'epoch': 0.0}


  8%|▊         | 750/10000 [37:21<8:53:19,  3.46s/it]

{'loss': 0.9131, 'grad_norm': 0.417930543422699, 'learning_rate': 0.00018509254627313657, 'epoch': 0.0}


  8%|▊         | 751/10000 [37:25<8:59:05,  3.50s/it]

{'loss': 1.0023, 'grad_norm': 0.39737287163734436, 'learning_rate': 0.00018507253626813409, 'epoch': 0.0}


  8%|▊         | 752/10000 [37:29<9:32:00,  3.71s/it]

{'loss': 1.0579, 'grad_norm': 0.3845842480659485, 'learning_rate': 0.00018505252626313157, 'epoch': 0.0}


  8%|▊         | 753/10000 [37:33<9:31:19,  3.71s/it]

{'loss': 0.8657, 'grad_norm': 0.368190199136734, 'learning_rate': 0.0001850325162581291, 'epoch': 0.0}


  8%|▊         | 754/10000 [37:36<8:58:40,  3.50s/it]

{'loss': 0.8919, 'grad_norm': 0.3887269198894501, 'learning_rate': 0.00018501250625312657, 'epoch': 0.0}


  8%|▊         | 755/10000 [37:39<9:09:57,  3.57s/it]

{'loss': 0.8535, 'grad_norm': 0.30415114760398865, 'learning_rate': 0.00018499249624812406, 'epoch': 0.0}


  8%|▊         | 756/10000 [37:42<8:52:55,  3.46s/it]

{'loss': 1.019, 'grad_norm': 0.4179995357990265, 'learning_rate': 0.00018497248624312158, 'epoch': 0.0}


  8%|▊         | 757/10000 [37:45<8:06:31,  3.16s/it]

{'loss': 0.7301, 'grad_norm': 0.42621979117393494, 'learning_rate': 0.00018495247623811906, 'epoch': 0.0}


  8%|▊         | 758/10000 [37:48<8:07:54,  3.17s/it]

{'loss': 0.8799, 'grad_norm': 0.37148404121398926, 'learning_rate': 0.00018493246623311658, 'epoch': 0.0}


  8%|▊         | 759/10000 [37:52<8:34:04,  3.34s/it]

{'loss': 0.8896, 'grad_norm': 0.3635939061641693, 'learning_rate': 0.00018491245622811407, 'epoch': 0.0}


  8%|▊         | 760/10000 [37:56<8:56:59,  3.49s/it]

{'loss': 0.8911, 'grad_norm': 0.3326662480831146, 'learning_rate': 0.00018489244622311158, 'epoch': 0.0}


  8%|▊         | 761/10000 [38:00<9:44:02,  3.79s/it]

{'loss': 1.0239, 'grad_norm': 0.3238789141178131, 'learning_rate': 0.00018487243621810907, 'epoch': 0.0}


  8%|▊         | 762/10000 [38:04<9:23:26,  3.66s/it]

{'loss': 0.9349, 'grad_norm': 0.4081850051879883, 'learning_rate': 0.00018485242621310655, 'epoch': 0.0}


  8%|▊         | 763/10000 [38:06<8:41:13,  3.39s/it]

{'loss': 0.8133, 'grad_norm': 0.38218432664871216, 'learning_rate': 0.00018483241620810404, 'epoch': 0.0}


  8%|▊         | 764/10000 [38:10<9:06:26,  3.55s/it]

{'loss': 1.1196, 'grad_norm': 0.36922183632850647, 'learning_rate': 0.00018481240620310156, 'epoch': 0.0}


  8%|▊         | 765/10000 [38:13<8:50:26,  3.45s/it]

{'loss': 0.9994, 'grad_norm': 0.4031038284301758, 'learning_rate': 0.00018479239619809904, 'epoch': 0.0}


  8%|▊         | 766/10000 [38:16<8:28:38,  3.31s/it]

{'loss': 1.0243, 'grad_norm': 0.3898797631263733, 'learning_rate': 0.00018477238619309656, 'epoch': 0.0}


  8%|▊         | 767/10000 [38:20<8:43:37,  3.40s/it]

{'loss': 0.8139, 'grad_norm': 0.3474158048629761, 'learning_rate': 0.00018475237618809404, 'epoch': 0.0}


  8%|▊         | 768/10000 [38:23<8:03:56,  3.15s/it]

{'loss': 0.8726, 'grad_norm': 0.39563408493995667, 'learning_rate': 0.00018473236618309156, 'epoch': 0.0}


  8%|▊         | 769/10000 [38:26<7:58:10,  3.11s/it]

{'loss': 0.93, 'grad_norm': 0.38376128673553467, 'learning_rate': 0.00018471235617808907, 'epoch': 0.0}


  8%|▊         | 770/10000 [38:29<8:08:39,  3.18s/it]

{'loss': 1.0601, 'grad_norm': 0.3968249559402466, 'learning_rate': 0.00018469234617308656, 'epoch': 0.0}


  8%|▊         | 771/10000 [38:32<8:00:15,  3.12s/it]

{'loss': 0.7761, 'grad_norm': 0.3842126429080963, 'learning_rate': 0.00018467233616808405, 'epoch': 0.0}


  8%|▊         | 772/10000 [38:34<7:27:37,  2.91s/it]

{'loss': 0.8452, 'grad_norm': 0.40052738785743713, 'learning_rate': 0.00018465232616308154, 'epoch': 0.0}


  8%|▊         | 773/10000 [38:38<8:02:49,  3.14s/it]

{'loss': 0.9367, 'grad_norm': 0.3315902054309845, 'learning_rate': 0.00018463231615807905, 'epoch': 0.0}


  8%|▊         | 774/10000 [38:42<8:46:46,  3.43s/it]

{'loss': 0.9231, 'grad_norm': 0.32421550154685974, 'learning_rate': 0.00018461230615307654, 'epoch': 0.0}


  8%|▊         | 775/10000 [38:45<8:41:01,  3.39s/it]

{'loss': 0.8762, 'grad_norm': 0.40624016523361206, 'learning_rate': 0.00018459229614807405, 'epoch': 0.0}


  8%|▊         | 776/10000 [38:49<8:39:36,  3.38s/it]

{'loss': 0.8628, 'grad_norm': 0.36123111844062805, 'learning_rate': 0.00018457228614307154, 'epoch': 0.0}


  8%|▊         | 777/10000 [38:52<8:11:50,  3.20s/it]

{'loss': 0.9566, 'grad_norm': 0.442329078912735, 'learning_rate': 0.00018455227613806905, 'epoch': 0.0}


  8%|▊         | 778/10000 [38:54<7:45:10,  3.03s/it]

{'loss': 0.8799, 'grad_norm': 0.46675387024879456, 'learning_rate': 0.00018453226613306654, 'epoch': 0.0}


  8%|▊         | 779/10000 [38:58<8:13:13,  3.21s/it]

{'loss': 1.0083, 'grad_norm': 0.3726014494895935, 'learning_rate': 0.00018451225612806405, 'epoch': 0.0}


  8%|▊         | 780/10000 [39:01<8:17:44,  3.24s/it]

{'loss': 0.7148, 'grad_norm': 0.31929269433021545, 'learning_rate': 0.00018449224612306154, 'epoch': 0.0}


  8%|▊         | 781/10000 [39:04<8:12:18,  3.20s/it]

{'loss': 0.7559, 'grad_norm': 0.37341511249542236, 'learning_rate': 0.00018447223611805903, 'epoch': 0.0}


  8%|▊         | 782/10000 [39:08<8:15:51,  3.23s/it]

{'loss': 1.1029, 'grad_norm': 0.41370847821235657, 'learning_rate': 0.00018445222611305654, 'epoch': 0.0}


  8%|▊         | 783/10000 [39:11<8:14:18,  3.22s/it]

{'loss': 0.702, 'grad_norm': 0.32482197880744934, 'learning_rate': 0.00018443221610805403, 'epoch': 0.0}


  8%|▊         | 784/10000 [39:15<8:53:56,  3.48s/it]

{'loss': 1.0616, 'grad_norm': 0.3650408089160919, 'learning_rate': 0.00018441220610305155, 'epoch': 0.0}


  8%|▊         | 785/10000 [39:18<8:40:55,  3.39s/it]

{'loss': 0.8871, 'grad_norm': 0.4328906536102295, 'learning_rate': 0.00018439219609804903, 'epoch': 0.0}


  8%|▊         | 786/10000 [39:23<9:55:40,  3.88s/it]

{'loss': 1.1104, 'grad_norm': 0.35468554496765137, 'learning_rate': 0.00018437218609304655, 'epoch': 0.0}


  8%|▊         | 787/10000 [39:27<9:47:17,  3.82s/it]

{'loss': 1.2459, 'grad_norm': 0.3593335449695587, 'learning_rate': 0.00018435217608804403, 'epoch': 0.0}


  8%|▊         | 788/10000 [39:31<10:09:45,  3.97s/it]

{'loss': 1.0854, 'grad_norm': 0.3359861671924591, 'learning_rate': 0.00018433216608304155, 'epoch': 0.0}


  8%|▊         | 789/10000 [39:34<9:26:51,  3.69s/it] 

{'loss': 0.9342, 'grad_norm': 0.3619858920574188, 'learning_rate': 0.000184312156078039, 'epoch': 0.0}


  8%|▊         | 790/10000 [39:38<9:44:23,  3.81s/it]

{'loss': 0.9376, 'grad_norm': 0.39025330543518066, 'learning_rate': 0.00018429214607303652, 'epoch': 0.0}


  8%|▊         | 791/10000 [39:42<10:06:30,  3.95s/it]

{'loss': 0.9946, 'grad_norm': 0.3304947018623352, 'learning_rate': 0.000184272136068034, 'epoch': 0.0}


  8%|▊         | 792/10000 [39:46<9:42:28,  3.80s/it] 

{'loss': 0.898, 'grad_norm': 0.35460183024406433, 'learning_rate': 0.00018425212606303152, 'epoch': 0.0}


  8%|▊         | 793/10000 [39:49<9:04:17,  3.55s/it]

{'loss': 0.9454, 'grad_norm': 0.4318462014198303, 'learning_rate': 0.000184232116058029, 'epoch': 0.0}


  8%|▊         | 794/10000 [39:53<9:13:27,  3.61s/it]

{'loss': 0.8095, 'grad_norm': 0.44571107625961304, 'learning_rate': 0.00018421210605302653, 'epoch': 0.0}


  8%|▊         | 795/10000 [39:56<9:15:26,  3.62s/it]

{'loss': 0.7657, 'grad_norm': 0.29285717010498047, 'learning_rate': 0.00018419209604802401, 'epoch': 0.0}


  8%|▊         | 796/10000 [39:59<8:57:08,  3.50s/it]

{'loss': 1.0771, 'grad_norm': 0.3585066795349121, 'learning_rate': 0.00018417208604302153, 'epoch': 0.0}


  8%|▊         | 797/10000 [40:04<9:31:20,  3.72s/it]

{'loss': 0.8753, 'grad_norm': 0.3229823708534241, 'learning_rate': 0.00018415207603801902, 'epoch': 0.0}


  8%|▊         | 798/10000 [40:06<8:48:08,  3.44s/it]

{'loss': 0.8847, 'grad_norm': 0.3884364664554596, 'learning_rate': 0.0001841320660330165, 'epoch': 0.0}


  8%|▊         | 799/10000 [40:09<8:27:28,  3.31s/it]

{'loss': 0.8288, 'grad_norm': 0.4326307475566864, 'learning_rate': 0.00018411205602801402, 'epoch': 0.0}


  8%|▊         | 800/10000 [40:13<8:46:06,  3.43s/it]

{'loss': 1.06, 'grad_norm': 0.36355605721473694, 'learning_rate': 0.0001840920460230115, 'epoch': 0.0}


  8%|▊         | 801/10000 [40:19<10:25:38,  4.08s/it]

{'loss': 1.2694, 'grad_norm': 0.3479868769645691, 'learning_rate': 0.00018407203601800902, 'epoch': 0.0}


  8%|▊         | 802/10000 [40:22<10:01:36,  3.92s/it]

{'loss': 0.9349, 'grad_norm': 0.3588372766971588, 'learning_rate': 0.0001840520260130065, 'epoch': 0.0}


  8%|▊         | 803/10000 [40:26<9:54:55,  3.88s/it] 

{'loss': 1.0156, 'grad_norm': 0.374406635761261, 'learning_rate': 0.00018403201600800402, 'epoch': 0.0}


  8%|▊         | 804/10000 [40:30<10:09:19,  3.98s/it]

{'loss': 1.4092, 'grad_norm': 0.3557043969631195, 'learning_rate': 0.0001840120060030015, 'epoch': 0.0}


  8%|▊         | 805/10000 [40:34<9:40:38,  3.79s/it] 

{'loss': 1.131, 'grad_norm': 0.40078362822532654, 'learning_rate': 0.00018399199599799902, 'epoch': 0.0}


  8%|▊         | 806/10000 [40:37<9:23:17,  3.68s/it]

{'loss': 1.0465, 'grad_norm': 0.40005427598953247, 'learning_rate': 0.0001839719859929965, 'epoch': 0.0}


  8%|▊         | 807/10000 [40:40<8:50:37,  3.46s/it]

{'loss': 0.7516, 'grad_norm': 0.4432740807533264, 'learning_rate': 0.000183951975987994, 'epoch': 0.0}


  8%|▊         | 808/10000 [40:44<9:03:22,  3.55s/it]

{'loss': 0.9267, 'grad_norm': 0.34560948610305786, 'learning_rate': 0.00018393196598299148, 'epoch': 0.0}


  8%|▊         | 809/10000 [40:47<8:33:34,  3.35s/it]

{'loss': 1.4118, 'grad_norm': 0.4454617202281952, 'learning_rate': 0.000183911955977989, 'epoch': 0.0}


  8%|▊         | 810/10000 [40:49<7:57:55,  3.12s/it]

{'loss': 0.8672, 'grad_norm': 0.4239000976085663, 'learning_rate': 0.0001838919459729865, 'epoch': 0.0}


  8%|▊         | 811/10000 [40:52<7:37:41,  2.99s/it]

{'loss': 0.7444, 'grad_norm': 0.4137166440486908, 'learning_rate': 0.000183871935967984, 'epoch': 0.0}


  8%|▊         | 812/10000 [40:55<7:51:43,  3.08s/it]

{'loss': 1.2406, 'grad_norm': 0.43523550033569336, 'learning_rate': 0.00018385192596298151, 'epoch': 0.0}


  8%|▊         | 813/10000 [41:01<9:36:30,  3.77s/it]

{'loss': 1.0757, 'grad_norm': 0.3374328911304474, 'learning_rate': 0.000183831915957979, 'epoch': 0.0}


  8%|▊         | 814/10000 [41:04<9:20:26,  3.66s/it]

{'loss': 0.6603, 'grad_norm': 0.3360280692577362, 'learning_rate': 0.00018381190595297652, 'epoch': 0.0}


  8%|▊         | 815/10000 [41:07<8:31:30,  3.34s/it]

{'loss': 0.853, 'grad_norm': 0.42301613092422485, 'learning_rate': 0.000183791895947974, 'epoch': 0.0}


  8%|▊         | 816/10000 [41:10<8:28:49,  3.32s/it]

{'loss': 0.7541, 'grad_norm': 0.3593713939189911, 'learning_rate': 0.0001837718859429715, 'epoch': 0.0}


  8%|▊         | 817/10000 [41:13<8:32:53,  3.35s/it]

{'loss': 1.2138, 'grad_norm': 0.39034348726272583, 'learning_rate': 0.00018375187593796898, 'epoch': 0.0}


  8%|▊         | 818/10000 [41:17<8:40:01,  3.40s/it]

{'loss': 0.8208, 'grad_norm': 0.4342907667160034, 'learning_rate': 0.0001837318659329665, 'epoch': 0.0}


  8%|▊         | 819/10000 [41:20<8:15:24,  3.24s/it]

{'loss': 0.9357, 'grad_norm': 0.4380161762237549, 'learning_rate': 0.00018371185592796398, 'epoch': 0.0}


  8%|▊         | 820/10000 [41:24<8:48:55,  3.46s/it]

{'loss': 0.9429, 'grad_norm': 0.3187950849533081, 'learning_rate': 0.0001836918459229615, 'epoch': 0.0}


  8%|▊         | 821/10000 [41:27<8:36:27,  3.38s/it]

{'loss': 1.1908, 'grad_norm': 0.41276609897613525, 'learning_rate': 0.00018367183591795898, 'epoch': 0.0}


  8%|▊         | 822/10000 [41:30<8:47:50,  3.45s/it]

{'loss': 1.08, 'grad_norm': 0.3757339417934418, 'learning_rate': 0.0001836518259129565, 'epoch': 0.0}


  8%|▊         | 823/10000 [41:33<8:22:22,  3.28s/it]

{'loss': 0.8932, 'grad_norm': 0.37843847274780273, 'learning_rate': 0.000183631815907954, 'epoch': 0.0}


  8%|▊         | 824/10000 [41:36<8:01:22,  3.15s/it]

{'loss': 0.8898, 'grad_norm': 0.43091392517089844, 'learning_rate': 0.00018361180590295147, 'epoch': 0.0}


  8%|▊         | 825/10000 [41:39<7:58:54,  3.13s/it]

{'loss': 0.83, 'grad_norm': 0.3958377242088318, 'learning_rate': 0.00018359179589794898, 'epoch': 0.0}


  8%|▊         | 826/10000 [41:42<7:56:02,  3.11s/it]

{'loss': 0.9185, 'grad_norm': 0.400219202041626, 'learning_rate': 0.00018357178589294647, 'epoch': 0.0}


  8%|▊         | 827/10000 [41:46<8:28:40,  3.33s/it]

{'loss': 1.0108, 'grad_norm': 0.3696306347846985, 'learning_rate': 0.00018355177588794399, 'epoch': 0.0}


  8%|▊         | 828/10000 [41:50<8:38:59,  3.40s/it]

{'loss': 0.9895, 'grad_norm': 0.34925949573516846, 'learning_rate': 0.00018353176588294147, 'epoch': 0.0}


  8%|▊         | 829/10000 [41:53<8:51:06,  3.47s/it]

{'loss': 0.7777, 'grad_norm': 0.3448975682258606, 'learning_rate': 0.000183511755877939, 'epoch': 0.0}


  8%|▊         | 830/10000 [41:56<8:24:11,  3.30s/it]

{'loss': 0.7439, 'grad_norm': 0.4143773913383484, 'learning_rate': 0.00018349174587293648, 'epoch': 0.0}


  8%|▊         | 831/10000 [42:01<9:06:45,  3.58s/it]

{'loss': 1.3102, 'grad_norm': 0.3754850924015045, 'learning_rate': 0.000183471735867934, 'epoch': 0.0}


  8%|▊         | 832/10000 [42:04<8:49:32,  3.47s/it]

{'loss': 0.833, 'grad_norm': 0.3623518645763397, 'learning_rate': 0.00018345172586293148, 'epoch': 0.0}


  8%|▊         | 833/10000 [42:07<8:19:20,  3.27s/it]

{'loss': 1.1198, 'grad_norm': 0.4174599349498749, 'learning_rate': 0.00018343171585792896, 'epoch': 0.0}


  8%|▊         | 834/10000 [42:10<8:21:21,  3.28s/it]

{'loss': 0.9942, 'grad_norm': 0.43180713057518005, 'learning_rate': 0.00018341170585292645, 'epoch': 0.0}


  8%|▊         | 835/10000 [42:13<8:37:11,  3.39s/it]

{'loss': 0.8441, 'grad_norm': 0.3747415840625763, 'learning_rate': 0.00018339169584792397, 'epoch': 0.0}


  8%|▊         | 836/10000 [42:17<9:04:08,  3.56s/it]

{'loss': 0.9441, 'grad_norm': 0.3487372100353241, 'learning_rate': 0.00018337168584292145, 'epoch': 0.0}


  8%|▊         | 837/10000 [42:23<10:11:35,  4.00s/it]

{'loss': 1.2254, 'grad_norm': 0.32172900438308716, 'learning_rate': 0.00018335167583791897, 'epoch': 0.0}


  8%|▊         | 838/10000 [42:26<9:50:17,  3.87s/it] 

{'loss': 0.9176, 'grad_norm': 0.4158492088317871, 'learning_rate': 0.00018333166583291648, 'epoch': 0.0}


  8%|▊         | 839/10000 [42:29<9:00:15,  3.54s/it]

{'loss': 0.691, 'grad_norm': 0.3788091242313385, 'learning_rate': 0.00018331165582791397, 'epoch': 0.0}


  8%|▊         | 840/10000 [42:33<9:13:23,  3.62s/it]

{'loss': 1.3069, 'grad_norm': 0.3588087856769562, 'learning_rate': 0.00018329164582291148, 'epoch': 0.0}


  8%|▊         | 841/10000 [42:35<8:34:01,  3.37s/it]

{'loss': 0.7902, 'grad_norm': 0.36738207936286926, 'learning_rate': 0.00018327163581790897, 'epoch': 0.0}


  8%|▊         | 842/10000 [42:39<8:22:43,  3.29s/it]

{'loss': 0.9992, 'grad_norm': 0.3622429370880127, 'learning_rate': 0.00018325162581290646, 'epoch': 0.0}


  8%|▊         | 843/10000 [42:42<8:34:14,  3.37s/it]

{'loss': 1.0982, 'grad_norm': 0.43732595443725586, 'learning_rate': 0.00018323161580790395, 'epoch': 0.0}


  8%|▊         | 844/10000 [42:45<8:36:04,  3.38s/it]

{'loss': 1.0378, 'grad_norm': 0.3699037432670593, 'learning_rate': 0.00018321160580290146, 'epoch': 0.0}


  8%|▊         | 845/10000 [42:49<8:27:58,  3.33s/it]

{'loss': 0.7364, 'grad_norm': 0.3301081657409668, 'learning_rate': 0.00018319159579789895, 'epoch': 0.0}


  8%|▊         | 846/10000 [42:52<8:34:12,  3.37s/it]

{'loss': 0.9539, 'grad_norm': 0.3332304060459137, 'learning_rate': 0.00018317158579289646, 'epoch': 0.0}


  8%|▊         | 847/10000 [42:56<9:07:41,  3.59s/it]

{'loss': 1.1516, 'grad_norm': 0.4455883502960205, 'learning_rate': 0.00018315157578789395, 'epoch': 0.0}


  8%|▊         | 848/10000 [43:00<9:12:50,  3.62s/it]

{'loss': 0.8704, 'grad_norm': 0.3461093604564667, 'learning_rate': 0.00018313156578289146, 'epoch': 0.0}


  8%|▊         | 849/10000 [43:03<8:48:55,  3.47s/it]

{'loss': 1.0395, 'grad_norm': 0.45188018679618835, 'learning_rate': 0.00018311155577788895, 'epoch': 0.0}


  8%|▊         | 850/10000 [43:06<8:01:13,  3.16s/it]

{'loss': 0.764, 'grad_norm': 0.4522223174571991, 'learning_rate': 0.00018309154577288646, 'epoch': 0.0}


  9%|▊         | 851/10000 [43:09<7:59:03,  3.14s/it]

{'loss': 0.8807, 'grad_norm': 0.3841095566749573, 'learning_rate': 0.00018307153576788395, 'epoch': 0.0}


  9%|▊         | 852/10000 [43:11<7:42:11,  3.03s/it]

{'loss': 0.9925, 'grad_norm': 0.4321022927761078, 'learning_rate': 0.00018305152576288144, 'epoch': 0.0}


  9%|▊         | 853/10000 [43:15<8:00:50,  3.15s/it]

{'loss': 0.8351, 'grad_norm': 0.3719692826271057, 'learning_rate': 0.00018303151575787895, 'epoch': 0.0}


  9%|▊         | 854/10000 [43:18<7:57:36,  3.13s/it]

{'loss': 0.8125, 'grad_norm': 0.3820810616016388, 'learning_rate': 0.00018301150575287644, 'epoch': 0.0}


  9%|▊         | 855/10000 [43:21<8:00:18,  3.15s/it]

{'loss': 0.9964, 'grad_norm': 0.43797892332077026, 'learning_rate': 0.00018299149574787396, 'epoch': 0.0}


  9%|▊         | 856/10000 [43:24<8:03:05,  3.17s/it]

{'loss': 0.9099, 'grad_norm': 0.39118692278862, 'learning_rate': 0.00018297148574287144, 'epoch': 0.0}


  9%|▊         | 857/10000 [43:28<8:15:44,  3.25s/it]

{'loss': 1.3235, 'grad_norm': 0.48556214570999146, 'learning_rate': 0.00018295147573786896, 'epoch': 0.0}


  9%|▊         | 858/10000 [43:31<8:13:02,  3.24s/it]

{'loss': 0.8122, 'grad_norm': 0.34731757640838623, 'learning_rate': 0.00018293146573286644, 'epoch': 0.0}


  9%|▊         | 859/10000 [43:34<8:01:08,  3.16s/it]

{'loss': 1.0465, 'grad_norm': 0.394272118806839, 'learning_rate': 0.00018291145572786393, 'epoch': 0.0}


  9%|▊         | 860/10000 [43:37<8:19:11,  3.28s/it]

{'loss': 0.8431, 'grad_norm': 0.3815422058105469, 'learning_rate': 0.00018289144572286142, 'epoch': 0.0}


  9%|▊         | 861/10000 [43:40<7:55:28,  3.12s/it]

{'loss': 0.8264, 'grad_norm': 0.4189954400062561, 'learning_rate': 0.00018287143571785893, 'epoch': 0.0}


  9%|▊         | 862/10000 [43:44<8:20:18,  3.29s/it]

{'loss': 0.7713, 'grad_norm': 0.31644725799560547, 'learning_rate': 0.00018285142571285642, 'epoch': 0.0}


  9%|▊         | 863/10000 [43:47<7:57:19,  3.13s/it]

{'loss': 0.9963, 'grad_norm': 0.46348100900650024, 'learning_rate': 0.00018283141570785393, 'epoch': 0.0}


  9%|▊         | 864/10000 [43:50<8:01:23,  3.16s/it]

{'loss': 1.0227, 'grad_norm': 0.3913118839263916, 'learning_rate': 0.00018281140570285142, 'epoch': 0.0}


  9%|▊         | 865/10000 [43:53<8:19:14,  3.28s/it]

{'loss': 0.8976, 'grad_norm': 0.37288913130760193, 'learning_rate': 0.00018279139569784894, 'epoch': 0.0}


  9%|▊         | 866/10000 [43:57<8:20:22,  3.29s/it]

{'loss': 1.0725, 'grad_norm': 0.36176255345344543, 'learning_rate': 0.00018277138569284645, 'epoch': 0.0}


  9%|▊         | 867/10000 [44:00<8:11:45,  3.23s/it]

{'loss': 0.9196, 'grad_norm': 0.3450445234775543, 'learning_rate': 0.00018275137568784394, 'epoch': 0.0}


  9%|▊         | 868/10000 [44:03<7:54:25,  3.12s/it]

{'loss': 0.8012, 'grad_norm': 0.42285650968551636, 'learning_rate': 0.00018273136568284143, 'epoch': 0.0}


  9%|▊         | 869/10000 [44:06<7:54:42,  3.12s/it]

{'loss': 1.1633, 'grad_norm': 0.3583885729312897, 'learning_rate': 0.0001827113556778389, 'epoch': 0.0}


  9%|▊         | 870/10000 [44:09<7:50:54,  3.09s/it]

{'loss': 0.8534, 'grad_norm': 0.49661052227020264, 'learning_rate': 0.00018269134567283643, 'epoch': 0.0}


  9%|▊         | 871/10000 [44:12<8:04:12,  3.18s/it]

{'loss': 0.9939, 'grad_norm': 0.4056093096733093, 'learning_rate': 0.00018267133566783391, 'epoch': 0.0}


  9%|▊         | 872/10000 [44:17<9:09:21,  3.61s/it]

{'loss': 1.23, 'grad_norm': 0.37755218148231506, 'learning_rate': 0.00018265132566283143, 'epoch': 0.0}


  9%|▊         | 873/10000 [44:20<8:51:35,  3.49s/it]

{'loss': 0.9897, 'grad_norm': 0.36676105856895447, 'learning_rate': 0.00018263131565782892, 'epoch': 0.0}


  9%|▊         | 874/10000 [44:23<8:13:37,  3.25s/it]

{'loss': 0.809, 'grad_norm': 0.40249642729759216, 'learning_rate': 0.00018261130565282643, 'epoch': 0.0}


  9%|▉         | 875/10000 [44:26<8:34:55,  3.39s/it]

{'loss': 1.1012, 'grad_norm': 0.46574074029922485, 'learning_rate': 0.00018259129564782392, 'epoch': 0.0}


  9%|▉         | 876/10000 [44:32<9:52:12,  3.89s/it]

{'loss': 1.4649, 'grad_norm': 0.3797416388988495, 'learning_rate': 0.00018257128564282143, 'epoch': 0.0}


  9%|▉         | 877/10000 [44:35<9:40:58,  3.82s/it]

{'loss': 0.8364, 'grad_norm': 0.34266871213912964, 'learning_rate': 0.00018255127563781892, 'epoch': 0.0}


  9%|▉         | 878/10000 [44:39<9:22:08,  3.70s/it]

{'loss': 0.7502, 'grad_norm': 0.3473234176635742, 'learning_rate': 0.0001825312656328164, 'epoch': 0.0}


  9%|▉         | 879/10000 [44:41<8:35:21,  3.39s/it]

{'loss': 0.6912, 'grad_norm': 0.40973758697509766, 'learning_rate': 0.00018251125562781392, 'epoch': 0.0}


  9%|▉         | 880/10000 [44:45<8:45:43,  3.46s/it]

{'loss': 0.7581, 'grad_norm': 0.33782753348350525, 'learning_rate': 0.0001824912456228114, 'epoch': 0.0}


  9%|▉         | 881/10000 [44:49<9:02:29,  3.57s/it]

{'loss': 0.9523, 'grad_norm': 0.3965814709663391, 'learning_rate': 0.00018247123561780892, 'epoch': 0.0}


  9%|▉         | 882/10000 [44:51<8:12:33,  3.24s/it]

{'loss': 0.7442, 'grad_norm': 0.446895033121109, 'learning_rate': 0.0001824512256128064, 'epoch': 0.0}


  9%|▉         | 883/10000 [44:54<8:10:36,  3.23s/it]

{'loss': 1.0974, 'grad_norm': 0.38009732961654663, 'learning_rate': 0.00018243121560780392, 'epoch': 0.0}


  9%|▉         | 884/10000 [44:58<8:43:23,  3.44s/it]

{'loss': 0.9125, 'grad_norm': 0.3382345139980316, 'learning_rate': 0.0001824112056028014, 'epoch': 0.0}


  9%|▉         | 885/10000 [45:02<9:05:50,  3.59s/it]

{'loss': 0.9939, 'grad_norm': 0.33144110441207886, 'learning_rate': 0.00018239119559779893, 'epoch': 0.0}


  9%|▉         | 886/10000 [45:05<8:44:47,  3.45s/it]

{'loss': 0.862, 'grad_norm': 0.4727533757686615, 'learning_rate': 0.0001823711855927964, 'epoch': 0.0}


  9%|▉         | 887/10000 [45:10<9:24:46,  3.72s/it]

{'loss': 0.8315, 'grad_norm': 0.32895171642303467, 'learning_rate': 0.0001823511755877939, 'epoch': 0.0}


  9%|▉         | 888/10000 [45:13<8:46:08,  3.46s/it]

{'loss': 0.9978, 'grad_norm': 0.4664762318134308, 'learning_rate': 0.0001823311655827914, 'epoch': 0.0}


  9%|▉         | 889/10000 [45:16<8:47:50,  3.48s/it]

{'loss': 1.0784, 'grad_norm': 0.44510769844055176, 'learning_rate': 0.0001823111555777889, 'epoch': 0.0}


  9%|▉         | 890/10000 [45:19<8:36:24,  3.40s/it]

{'loss': 0.8785, 'grad_norm': 0.36718547344207764, 'learning_rate': 0.0001822911455727864, 'epoch': 0.0}


  9%|▉         | 891/10000 [45:22<8:07:07,  3.21s/it]

{'loss': 0.7368, 'grad_norm': 0.36629313230514526, 'learning_rate': 0.0001822711355677839, 'epoch': 0.0}


  9%|▉         | 892/10000 [45:25<7:30:50,  2.97s/it]

{'loss': 0.6169, 'grad_norm': 0.3702940046787262, 'learning_rate': 0.0001822511255627814, 'epoch': 0.0}


  9%|▉         | 893/10000 [45:28<8:12:11,  3.24s/it]

{'loss': 0.9965, 'grad_norm': 0.3581346869468689, 'learning_rate': 0.0001822311155577789, 'epoch': 0.0}


  9%|▉         | 894/10000 [45:32<8:35:31,  3.40s/it]

{'loss': 0.9651, 'grad_norm': 0.4111607074737549, 'learning_rate': 0.00018221110555277642, 'epoch': 0.0}


  9%|▉         | 895/10000 [45:35<8:15:40,  3.27s/it]

{'loss': 0.9004, 'grad_norm': 0.4189468026161194, 'learning_rate': 0.00018219109554777388, 'epoch': 0.0}


  9%|▉         | 896/10000 [45:39<8:26:13,  3.34s/it]

{'loss': 0.8692, 'grad_norm': 0.3651188910007477, 'learning_rate': 0.0001821710855427714, 'epoch': 0.0}


  9%|▉         | 897/10000 [45:43<8:57:04,  3.54s/it]

{'loss': 0.9741, 'grad_norm': 0.3266047537326813, 'learning_rate': 0.00018215107553776888, 'epoch': 0.0}


  9%|▉         | 898/10000 [45:47<9:51:21,  3.90s/it]

{'loss': 0.8388, 'grad_norm': 0.3299606740474701, 'learning_rate': 0.0001821310655327664, 'epoch': 0.0}


  9%|▉         | 899/10000 [45:51<9:39:55,  3.82s/it]

{'loss': 0.9845, 'grad_norm': 0.3111754357814789, 'learning_rate': 0.00018211105552776388, 'epoch': 0.0}


  9%|▉         | 900/10000 [45:55<9:24:30,  3.72s/it]

{'loss': 1.0104, 'grad_norm': 0.363521933555603, 'learning_rate': 0.0001820910455227614, 'epoch': 0.0}


  9%|▉         | 901/10000 [45:59<9:54:17,  3.92s/it]

{'loss': 0.8566, 'grad_norm': 0.3442723751068115, 'learning_rate': 0.00018207103551775889, 'epoch': 0.0}


  9%|▉         | 902/10000 [46:02<9:29:40,  3.76s/it]

{'loss': 0.7344, 'grad_norm': 0.3343520760536194, 'learning_rate': 0.0001820510255127564, 'epoch': 0.0}


  9%|▉         | 903/10000 [46:06<9:24:05,  3.72s/it]

{'loss': 1.0474, 'grad_norm': 0.397990882396698, 'learning_rate': 0.0001820310155077539, 'epoch': 0.0}


  9%|▉         | 904/10000 [46:09<8:54:02,  3.52s/it]

{'loss': 0.779, 'grad_norm': 0.31568002700805664, 'learning_rate': 0.00018201100550275137, 'epoch': 0.0}


  9%|▉         | 905/10000 [46:12<8:25:59,  3.34s/it]

{'loss': 0.9982, 'grad_norm': 0.39188942313194275, 'learning_rate': 0.00018199099549774886, 'epoch': 0.0}


  9%|▉         | 906/10000 [46:15<8:14:53,  3.27s/it]

{'loss': 0.8367, 'grad_norm': 0.39482855796813965, 'learning_rate': 0.00018197098549274638, 'epoch': 0.0}


  9%|▉         | 907/10000 [46:19<8:26:11,  3.34s/it]

{'loss': 0.8626, 'grad_norm': 0.36471906304359436, 'learning_rate': 0.0001819509754877439, 'epoch': 0.0}


  9%|▉         | 908/10000 [46:22<8:14:30,  3.26s/it]

{'loss': 0.7828, 'grad_norm': 0.35471540689468384, 'learning_rate': 0.00018193096548274138, 'epoch': 0.0}


  9%|▉         | 909/10000 [46:24<7:52:07,  3.12s/it]

{'loss': 0.6897, 'grad_norm': 0.3991034924983978, 'learning_rate': 0.0001819109554777389, 'epoch': 0.0}


  9%|▉         | 910/10000 [46:27<7:31:53,  2.98s/it]

{'loss': 0.7001, 'grad_norm': 0.3821704387664795, 'learning_rate': 0.00018189094547273638, 'epoch': 0.0}


  9%|▉         | 911/10000 [46:30<7:48:07,  3.09s/it]

{'loss': 0.9861, 'grad_norm': 0.40918442606925964, 'learning_rate': 0.0001818709354677339, 'epoch': 0.0}


  9%|▉         | 912/10000 [46:33<7:49:13,  3.10s/it]

{'loss': 0.7888, 'grad_norm': 0.35801151394844055, 'learning_rate': 0.00018185092546273138, 'epoch': 0.0}


  9%|▉         | 913/10000 [46:37<8:03:10,  3.19s/it]

{'loss': 0.7665, 'grad_norm': 0.35424789786338806, 'learning_rate': 0.00018183091545772887, 'epoch': 0.0}


  9%|▉         | 914/10000 [46:40<7:54:45,  3.14s/it]

{'loss': 1.1789, 'grad_norm': 0.3748167157173157, 'learning_rate': 0.00018181090545272636, 'epoch': 0.0}


  9%|▉         | 915/10000 [46:43<7:33:12,  2.99s/it]

{'loss': 0.8211, 'grad_norm': 0.4425272047519684, 'learning_rate': 0.00018179089544772387, 'epoch': 0.0}


  9%|▉         | 916/10000 [46:47<8:58:11,  3.55s/it]

{'loss': 1.0115, 'grad_norm': 0.29678621888160706, 'learning_rate': 0.00018177088544272136, 'epoch': 0.0}


  9%|▉         | 917/10000 [46:51<8:52:23,  3.52s/it]

{'loss': 0.903, 'grad_norm': 0.4045873284339905, 'learning_rate': 0.00018175087543771887, 'epoch': 0.0}


  9%|▉         | 918/10000 [46:55<8:58:26,  3.56s/it]

{'loss': 1.0175, 'grad_norm': 0.3962373733520508, 'learning_rate': 0.00018173086543271636, 'epoch': 0.0}


  9%|▉         | 919/10000 [46:58<8:36:49,  3.41s/it]

{'loss': 0.9629, 'grad_norm': 0.3700026571750641, 'learning_rate': 0.00018171085542771387, 'epoch': 0.0}


  9%|▉         | 920/10000 [47:02<9:11:59,  3.65s/it]

{'loss': 1.162, 'grad_norm': 0.34819459915161133, 'learning_rate': 0.0001816908454227114, 'epoch': 0.0}


  9%|▉         | 921/10000 [47:05<8:41:52,  3.45s/it]

{'loss': 0.8556, 'grad_norm': 0.37943291664123535, 'learning_rate': 0.00018167083541770887, 'epoch': 0.0}


  9%|▉         | 922/10000 [47:08<8:24:38,  3.34s/it]

{'loss': 0.8606, 'grad_norm': 0.40640443563461304, 'learning_rate': 0.00018165082541270636, 'epoch': 0.0}


  9%|▉         | 923/10000 [47:11<8:17:17,  3.29s/it]

{'loss': 1.0046, 'grad_norm': 0.41623789072036743, 'learning_rate': 0.00018163081540770385, 'epoch': 0.0}


  9%|▉         | 924/10000 [47:14<8:12:37,  3.26s/it]

{'loss': 0.8974, 'grad_norm': 0.3937425911426544, 'learning_rate': 0.00018161080540270136, 'epoch': 0.0}


  9%|▉         | 925/10000 [47:18<8:19:32,  3.30s/it]

{'loss': 1.0649, 'grad_norm': 0.4282214045524597, 'learning_rate': 0.00018159079539769885, 'epoch': 0.0}


  9%|▉         | 926/10000 [47:21<8:09:47,  3.24s/it]

{'loss': 0.7321, 'grad_norm': 0.3883328139781952, 'learning_rate': 0.00018157078539269637, 'epoch': 0.0}


  9%|▉         | 927/10000 [47:23<7:20:39,  2.91s/it]

{'loss': 0.8419, 'grad_norm': 0.43614163994789124, 'learning_rate': 0.00018155077538769385, 'epoch': 0.0}


  9%|▉         | 928/10000 [47:26<7:53:29,  3.13s/it]

{'loss': 0.9727, 'grad_norm': 0.3749972879886627, 'learning_rate': 0.00018153076538269137, 'epoch': 0.0}


  9%|▉         | 929/10000 [47:29<7:36:47,  3.02s/it]

{'loss': 0.9398, 'grad_norm': 0.3874221444129944, 'learning_rate': 0.00018151075537768885, 'epoch': 0.0}


  9%|▉         | 930/10000 [47:35<9:33:39,  3.79s/it]

{'loss': 0.7892, 'grad_norm': 0.31775137782096863, 'learning_rate': 0.00018149074537268634, 'epoch': 0.0}


  9%|▉         | 931/10000 [47:38<9:22:27,  3.72s/it]

{'loss': 1.1429, 'grad_norm': 0.34648755192756653, 'learning_rate': 0.00018147073536768383, 'epoch': 0.0}


  9%|▉         | 932/10000 [47:42<9:09:02,  3.63s/it]

{'loss': 0.7903, 'grad_norm': 0.35602661967277527, 'learning_rate': 0.00018145072536268134, 'epoch': 0.0}


  9%|▉         | 933/10000 [47:46<9:52:05,  3.92s/it]

{'loss': 1.1592, 'grad_norm': 0.34748613834381104, 'learning_rate': 0.00018143071535767883, 'epoch': 0.0}


  9%|▉         | 934/10000 [47:49<9:08:49,  3.63s/it]

{'loss': 0.7842, 'grad_norm': 0.42002519965171814, 'learning_rate': 0.00018141070535267634, 'epoch': 0.0}


  9%|▉         | 935/10000 [47:54<9:46:39,  3.88s/it]

{'loss': 1.5251, 'grad_norm': 0.3930027484893799, 'learning_rate': 0.00018139069534767386, 'epoch': 0.0}


  9%|▉         | 936/10000 [47:57<9:10:12,  3.64s/it]

{'loss': 0.9189, 'grad_norm': 0.41896679997444153, 'learning_rate': 0.00018137068534267135, 'epoch': 0.0}


  9%|▉         | 937/10000 [48:00<8:21:34,  3.32s/it]

{'loss': 0.7026, 'grad_norm': 0.3874433934688568, 'learning_rate': 0.00018135067533766886, 'epoch': 0.0}


  9%|▉         | 938/10000 [48:03<8:17:23,  3.29s/it]

{'loss': 0.9432, 'grad_norm': 0.4046497642993927, 'learning_rate': 0.00018133066533266635, 'epoch': 0.0}


  9%|▉         | 939/10000 [48:06<8:24:36,  3.34s/it]

{'loss': 1.0434, 'grad_norm': 0.34118443727493286, 'learning_rate': 0.00018131065532766384, 'epoch': 0.0}


  9%|▉         | 940/10000 [48:10<8:48:17,  3.50s/it]

{'loss': 0.8093, 'grad_norm': 0.30671417713165283, 'learning_rate': 0.00018129064532266132, 'epoch': 0.0}


  9%|▉         | 941/10000 [48:13<8:43:38,  3.47s/it]

{'loss': 0.8488, 'grad_norm': 0.38541585206985474, 'learning_rate': 0.00018127063531765884, 'epoch': 0.0}


  9%|▉         | 942/10000 [48:17<8:37:32,  3.43s/it]

{'loss': 0.9239, 'grad_norm': 0.35100287199020386, 'learning_rate': 0.00018125062531265632, 'epoch': 0.0}


  9%|▉         | 943/10000 [48:22<9:39:27,  3.84s/it]

{'loss': 1.0416, 'grad_norm': 0.31531086564064026, 'learning_rate': 0.00018123061530765384, 'epoch': 0.0}


  9%|▉         | 944/10000 [48:25<9:16:35,  3.69s/it]

{'loss': 0.9955, 'grad_norm': 0.40441420674324036, 'learning_rate': 0.00018121060530265133, 'epoch': 0.0}


  9%|▉         | 945/10000 [48:28<8:53:15,  3.53s/it]

{'loss': 0.8978, 'grad_norm': 0.37418097257614136, 'learning_rate': 0.00018119059529764884, 'epoch': 0.0}


  9%|▉         | 946/10000 [48:33<9:56:07,  3.95s/it]

{'loss': 1.1096, 'grad_norm': 0.28360843658447266, 'learning_rate': 0.00018117058529264633, 'epoch': 0.0}


  9%|▉         | 947/10000 [48:36<9:07:52,  3.63s/it]

{'loss': 0.717, 'grad_norm': 0.3861745297908783, 'learning_rate': 0.00018115057528764384, 'epoch': 0.0}


  9%|▉         | 948/10000 [48:39<8:52:39,  3.53s/it]

{'loss': 0.9348, 'grad_norm': 0.41300293803215027, 'learning_rate': 0.00018113056528264133, 'epoch': 0.0}


  9%|▉         | 949/10000 [48:42<8:21:52,  3.33s/it]

{'loss': 0.671, 'grad_norm': 0.38055619597435, 'learning_rate': 0.00018111055527763882, 'epoch': 0.0}


 10%|▉         | 950/10000 [48:46<9:00:39,  3.58s/it]

{'loss': 0.6885, 'grad_norm': 0.33112964034080505, 'learning_rate': 0.00018109054527263633, 'epoch': 0.0}


 10%|▉         | 951/10000 [48:50<9:19:44,  3.71s/it]

{'loss': 0.9084, 'grad_norm': 0.3729868531227112, 'learning_rate': 0.00018107053526763382, 'epoch': 0.0}


 10%|▉         | 952/10000 [48:53<8:57:19,  3.56s/it]

{'loss': 0.7975, 'grad_norm': 0.37569957971572876, 'learning_rate': 0.00018105052526263133, 'epoch': 0.0}


 10%|▉         | 953/10000 [48:57<9:07:33,  3.63s/it]

{'loss': 1.014, 'grad_norm': 0.4136328101158142, 'learning_rate': 0.00018103051525762882, 'epoch': 0.0}


 10%|▉         | 954/10000 [49:01<8:56:22,  3.56s/it]

{'loss': 0.8216, 'grad_norm': 0.3669237792491913, 'learning_rate': 0.00018101050525262633, 'epoch': 0.0}


 10%|▉         | 955/10000 [49:03<8:20:30,  3.32s/it]

{'loss': 0.9916, 'grad_norm': 0.3875410556793213, 'learning_rate': 0.00018099049524762382, 'epoch': 0.0}


 10%|▉         | 956/10000 [49:07<8:15:32,  3.29s/it]

{'loss': 0.8948, 'grad_norm': 0.3767755627632141, 'learning_rate': 0.00018097048524262134, 'epoch': 0.0}


 10%|▉         | 957/10000 [49:10<8:17:37,  3.30s/it]

{'loss': 0.8202, 'grad_norm': 0.38637852668762207, 'learning_rate': 0.00018095047523761882, 'epoch': 0.0}


 10%|▉         | 958/10000 [49:13<7:44:40,  3.08s/it]

{'loss': 1.032, 'grad_norm': 0.45814964175224304, 'learning_rate': 0.0001809304652326163, 'epoch': 0.0}


 10%|▉         | 959/10000 [49:15<7:30:04,  2.99s/it]

{'loss': 0.8148, 'grad_norm': 0.3712802827358246, 'learning_rate': 0.0001809104552276138, 'epoch': 0.0}


 10%|▉         | 960/10000 [49:19<7:49:11,  3.11s/it]

{'loss': 0.6707, 'grad_norm': 0.32725703716278076, 'learning_rate': 0.0001808904452226113, 'epoch': 0.0}


 10%|▉         | 961/10000 [49:22<8:11:57,  3.27s/it]

{'loss': 0.8496, 'grad_norm': 0.32964402437210083, 'learning_rate': 0.0001808704352176088, 'epoch': 0.0}


 10%|▉         | 962/10000 [49:27<9:04:54,  3.62s/it]

{'loss': 1.099, 'grad_norm': 0.3317054510116577, 'learning_rate': 0.00018085042521260631, 'epoch': 0.0}


 10%|▉         | 963/10000 [49:30<9:06:00,  3.63s/it]

{'loss': 1.1374, 'grad_norm': 0.3842346966266632, 'learning_rate': 0.00018083041520760383, 'epoch': 0.0}


 10%|▉         | 964/10000 [49:33<8:34:18,  3.42s/it]

{'loss': 0.963, 'grad_norm': 0.3645043969154358, 'learning_rate': 0.00018081040520260132, 'epoch': 0.0}


 10%|▉         | 965/10000 [49:36<7:58:40,  3.18s/it]

{'loss': 0.7671, 'grad_norm': 0.4440784156322479, 'learning_rate': 0.0001807903951975988, 'epoch': 0.0}


 10%|▉         | 966/10000 [49:39<7:49:34,  3.12s/it]

{'loss': 0.8616, 'grad_norm': 0.3840656280517578, 'learning_rate': 0.0001807703851925963, 'epoch': 0.0}


 10%|▉         | 967/10000 [49:43<8:21:39,  3.33s/it]

{'loss': 1.1364, 'grad_norm': 0.3807843029499054, 'learning_rate': 0.0001807503751875938, 'epoch': 0.0}


 10%|▉         | 968/10000 [49:47<8:58:44,  3.58s/it]

{'loss': 0.8879, 'grad_norm': 0.33794814348220825, 'learning_rate': 0.0001807303651825913, 'epoch': 0.0}


 10%|▉         | 969/10000 [49:50<8:56:25,  3.56s/it]

{'loss': 0.8782, 'grad_norm': 0.37616291642189026, 'learning_rate': 0.0001807103551775888, 'epoch': 0.0}


 10%|▉         | 970/10000 [49:55<9:21:16,  3.73s/it]

{'loss': 0.9603, 'grad_norm': 0.33011385798454285, 'learning_rate': 0.0001806903451725863, 'epoch': 0.0}


 10%|▉         | 971/10000 [49:58<8:47:30,  3.51s/it]

{'loss': 0.778, 'grad_norm': 0.3711540997028351, 'learning_rate': 0.0001806703351675838, 'epoch': 0.0}


 10%|▉         | 972/10000 [50:01<8:38:46,  3.45s/it]

{'loss': 0.774, 'grad_norm': 0.3677622675895691, 'learning_rate': 0.0001806503251625813, 'epoch': 0.0}


 10%|▉         | 973/10000 [50:04<8:17:26,  3.31s/it]

{'loss': 0.8968, 'grad_norm': 0.45074933767318726, 'learning_rate': 0.0001806303151575788, 'epoch': 0.0}


 10%|▉         | 974/10000 [50:07<8:09:28,  3.25s/it]

{'loss': 1.1069, 'grad_norm': 0.4071766436100006, 'learning_rate': 0.0001806103051525763, 'epoch': 0.0}


 10%|▉         | 975/10000 [50:10<8:02:05,  3.21s/it]

{'loss': 0.8088, 'grad_norm': 0.33355480432510376, 'learning_rate': 0.00018059029514757378, 'epoch': 0.0}


 10%|▉         | 976/10000 [50:13<7:44:39,  3.09s/it]

{'loss': 0.9225, 'grad_norm': 0.3753012418746948, 'learning_rate': 0.0001805702851425713, 'epoch': 0.0}


 10%|▉         | 977/10000 [50:16<7:41:37,  3.07s/it]

{'loss': 0.7367, 'grad_norm': 0.3962898254394531, 'learning_rate': 0.00018055027513756879, 'epoch': 0.0}


 10%|▉         | 978/10000 [50:19<7:48:11,  3.11s/it]

{'loss': 1.1004, 'grad_norm': 0.41076454520225525, 'learning_rate': 0.0001805302651325663, 'epoch': 0.0}


 10%|▉         | 979/10000 [50:23<8:12:22,  3.27s/it]

{'loss': 0.8823, 'grad_norm': 0.3550373911857605, 'learning_rate': 0.0001805102551275638, 'epoch': 0.0}


 10%|▉         | 980/10000 [50:26<8:15:36,  3.30s/it]

{'loss': 0.9062, 'grad_norm': 0.3583175539970398, 'learning_rate': 0.0001804902451225613, 'epoch': 0.0}


 10%|▉         | 981/10000 [50:29<7:41:39,  3.07s/it]

{'loss': 0.8456, 'grad_norm': 0.4189678132534027, 'learning_rate': 0.0001804702351175588, 'epoch': 0.0}


 10%|▉         | 982/10000 [50:32<7:51:16,  3.14s/it]

{'loss': 0.9699, 'grad_norm': 0.3917098045349121, 'learning_rate': 0.0001804502251125563, 'epoch': 0.0}


 10%|▉         | 983/10000 [50:36<8:33:30,  3.42s/it]

{'loss': 1.0484, 'grad_norm': 0.3354644179344177, 'learning_rate': 0.0001804302151075538, 'epoch': 0.0}


 10%|▉         | 984/10000 [50:39<8:11:08,  3.27s/it]

{'loss': 0.7642, 'grad_norm': 0.3451122045516968, 'learning_rate': 0.00018041020510255128, 'epoch': 0.0}


 10%|▉         | 985/10000 [50:43<8:32:32,  3.41s/it]

{'loss': 1.3069, 'grad_norm': 0.395709753036499, 'learning_rate': 0.00018039019509754877, 'epoch': 0.0}


 10%|▉         | 986/10000 [50:46<8:35:00,  3.43s/it]

{'loss': 1.062, 'grad_norm': 0.370006263256073, 'learning_rate': 0.00018037018509254628, 'epoch': 0.0}


 10%|▉         | 987/10000 [50:49<8:11:56,  3.27s/it]

{'loss': 0.9138, 'grad_norm': 0.3625253140926361, 'learning_rate': 0.00018035017508754377, 'epoch': 0.0}


 10%|▉         | 988/10000 [50:52<7:44:10,  3.09s/it]

{'loss': 0.9682, 'grad_norm': 0.4079478681087494, 'learning_rate': 0.00018033016508254128, 'epoch': 0.0}


 10%|▉         | 989/10000 [50:54<7:25:20,  2.97s/it]

{'loss': 0.9144, 'grad_norm': 0.41130509972572327, 'learning_rate': 0.00018031015507753877, 'epoch': 0.0}


 10%|▉         | 990/10000 [50:58<7:48:54,  3.12s/it]

{'loss': 1.0118, 'grad_norm': 0.3441541790962219, 'learning_rate': 0.00018029014507253628, 'epoch': 0.0}


 10%|▉         | 991/10000 [51:01<7:55:54,  3.17s/it]

{'loss': 1.2241, 'grad_norm': 0.42758479714393616, 'learning_rate': 0.0001802701350675338, 'epoch': 0.0}


 10%|▉         | 992/10000 [51:06<9:01:33,  3.61s/it]

{'loss': 0.7421, 'grad_norm': 0.3052065968513489, 'learning_rate': 0.00018025012506253128, 'epoch': 0.0}


 10%|▉         | 993/10000 [51:08<8:17:15,  3.31s/it]

{'loss': 0.6689, 'grad_norm': 0.34099704027175903, 'learning_rate': 0.00018023011505752877, 'epoch': 0.0}


 10%|▉         | 994/10000 [51:12<8:10:46,  3.27s/it]

{'loss': 0.8171, 'grad_norm': 0.38088229298591614, 'learning_rate': 0.00018021010505252626, 'epoch': 0.0}


 10%|▉         | 995/10000 [51:15<8:02:04,  3.21s/it]

{'loss': 0.7119, 'grad_norm': 0.36287087202072144, 'learning_rate': 0.00018019009504752377, 'epoch': 0.0}


 10%|▉         | 996/10000 [51:19<8:50:07,  3.53s/it]

{'loss': 1.2088, 'grad_norm': 0.3589822053909302, 'learning_rate': 0.00018017008504252126, 'epoch': 0.0}


 10%|▉         | 997/10000 [51:22<8:14:10,  3.29s/it]

{'loss': 0.6492, 'grad_norm': 0.3541155755519867, 'learning_rate': 0.00018015007503751878, 'epoch': 0.0}


 10%|▉         | 998/10000 [51:25<7:56:47,  3.18s/it]

{'loss': 1.0495, 'grad_norm': 0.3750118017196655, 'learning_rate': 0.00018013006503251626, 'epoch': 0.0}


 10%|▉         | 999/10000 [51:29<9:08:06,  3.65s/it]

{'loss': 1.415, 'grad_norm': 0.36885643005371094, 'learning_rate': 0.00018011005502751378, 'epoch': 0.0}


 10%|█         | 1000/10000 [51:33<9:03:57,  3.63s/it]

{'loss': 0.8602, 'grad_norm': 0.3700504004955292, 'learning_rate': 0.00018009004502251126, 'epoch': 0.0}


 10%|█         | 1001/10000 [51:37<9:43:09,  3.89s/it]

{'loss': 0.8753, 'grad_norm': 0.38561826944351196, 'learning_rate': 0.00018007003501750875, 'epoch': 0.0}


 10%|█         | 1002/10000 [51:41<9:51:15,  3.94s/it]

{'loss': 1.0388, 'grad_norm': 0.3387507200241089, 'learning_rate': 0.00018005002501250624, 'epoch': 0.0}


 10%|█         | 1003/10000 [51:44<8:48:08,  3.52s/it]

{'loss': 0.6919, 'grad_norm': 0.4435504078865051, 'learning_rate': 0.00018003001500750375, 'epoch': 0.0}


 10%|█         | 1004/10000 [51:47<8:37:59,  3.45s/it]

{'loss': 0.8777, 'grad_norm': 0.3716079294681549, 'learning_rate': 0.00018001000500250127, 'epoch': 0.0}


 10%|█         | 1005/10000 [51:50<7:57:12,  3.18s/it]

{'loss': 1.0205, 'grad_norm': 0.4209911823272705, 'learning_rate': 0.00017998999499749875, 'epoch': 0.0}


 10%|█         | 1006/10000 [51:54<8:22:36,  3.35s/it]

{'loss': 1.104, 'grad_norm': 0.36746132373809814, 'learning_rate': 0.00017996998499249627, 'epoch': 0.0}


 10%|█         | 1007/10000 [51:57<8:28:02,  3.39s/it]

{'loss': 0.6709, 'grad_norm': 0.32384300231933594, 'learning_rate': 0.00017994997498749376, 'epoch': 0.0}


 10%|█         | 1008/10000 [52:00<8:19:26,  3.33s/it]

{'loss': 1.1884, 'grad_norm': 0.40286529064178467, 'learning_rate': 0.00017992996498249127, 'epoch': 0.0}


 10%|█         | 1009/10000 [52:03<8:06:11,  3.24s/it]

{'loss': 0.8804, 'grad_norm': 0.36580199003219604, 'learning_rate': 0.00017990995497748876, 'epoch': 0.0}


 10%|█         | 1010/10000 [52:06<7:45:03,  3.10s/it]

{'loss': 0.855, 'grad_norm': 0.4541579782962799, 'learning_rate': 0.00017988994497248625, 'epoch': 0.0}


 10%|█         | 1011/10000 [52:09<7:21:49,  2.95s/it]

{'loss': 1.1161, 'grad_norm': 0.47177618741989136, 'learning_rate': 0.00017986993496748373, 'epoch': 0.0}


 10%|█         | 1012/10000 [52:12<7:38:43,  3.06s/it]

{'loss': 1.1417, 'grad_norm': 0.48234525322914124, 'learning_rate': 0.00017984992496248125, 'epoch': 0.0}


 10%|█         | 1013/10000 [52:15<7:48:48,  3.13s/it]

{'loss': 0.6806, 'grad_norm': 0.3848443925380707, 'learning_rate': 0.00017982991495747873, 'epoch': 0.0}


 10%|█         | 1014/10000 [52:20<9:05:48,  3.64s/it]

{'loss': 1.3968, 'grad_norm': 0.35731810331344604, 'learning_rate': 0.00017980990495247625, 'epoch': 0.0}


 10%|█         | 1015/10000 [52:23<8:41:08,  3.48s/it]

{'loss': 0.9539, 'grad_norm': 0.38894176483154297, 'learning_rate': 0.00017978989494747374, 'epoch': 0.0}


 10%|█         | 1016/10000 [52:26<7:54:43,  3.17s/it]

{'loss': 0.5952, 'grad_norm': 0.3665623366832733, 'learning_rate': 0.00017976988494247125, 'epoch': 0.0}


 10%|█         | 1017/10000 [52:29<7:56:33,  3.18s/it]

{'loss': 1.1258, 'grad_norm': 0.46673262119293213, 'learning_rate': 0.00017974987493746876, 'epoch': 0.0}


 10%|█         | 1018/10000 [52:33<8:30:59,  3.41s/it]

{'loss': 1.1405, 'grad_norm': 0.39378389716148376, 'learning_rate': 0.00017972986493246625, 'epoch': 0.0}


 10%|█         | 1019/10000 [52:36<8:35:21,  3.44s/it]

{'loss': 0.7304, 'grad_norm': 0.319276362657547, 'learning_rate': 0.00017970985492746374, 'epoch': 0.0}


 10%|█         | 1020/10000 [52:40<8:30:46,  3.41s/it]

{'loss': 0.8786, 'grad_norm': 0.4254480302333832, 'learning_rate': 0.00017968984492246123, 'epoch': 0.0}


 10%|█         | 1021/10000 [52:42<7:47:10,  3.12s/it]

{'loss': 0.7602, 'grad_norm': 0.47584569454193115, 'learning_rate': 0.00017966983491745874, 'epoch': 0.0}


 10%|█         | 1022/10000 [52:45<7:49:14,  3.14s/it]

{'loss': 0.9398, 'grad_norm': 0.39309096336364746, 'learning_rate': 0.00017964982491245623, 'epoch': 0.0}


 10%|█         | 1023/10000 [52:49<7:52:47,  3.16s/it]

{'loss': 0.9428, 'grad_norm': 0.3681878447532654, 'learning_rate': 0.00017962981490745374, 'epoch': 0.0}


 10%|█         | 1024/10000 [52:52<7:49:48,  3.14s/it]

{'loss': 1.0648, 'grad_norm': 0.40756818652153015, 'learning_rate': 0.00017960980490245123, 'epoch': 0.0}


 10%|█         | 1025/10000 [52:55<8:02:35,  3.23s/it]

{'loss': 0.8976, 'grad_norm': 0.3575495779514313, 'learning_rate': 0.00017958979489744874, 'epoch': 0.0}


 10%|█         | 1026/10000 [53:01<10:01:56,  4.02s/it]

{'loss': 1.1147, 'grad_norm': 0.31863483786582947, 'learning_rate': 0.00017956978489244623, 'epoch': 0.0}


 10%|█         | 1027/10000 [53:06<10:46:04,  4.32s/it]

{'loss': 1.0218, 'grad_norm': 0.3146381676197052, 'learning_rate': 0.00017954977488744375, 'epoch': 0.0}


 10%|█         | 1028/10000 [53:09<9:55:42,  3.98s/it] 

{'loss': 1.089, 'grad_norm': 0.38508471846580505, 'learning_rate': 0.00017952976488244123, 'epoch': 0.0}


 10%|█         | 1029/10000 [53:13<10:04:52,  4.05s/it]

{'loss': 0.8727, 'grad_norm': 0.308122456073761, 'learning_rate': 0.00017950975487743872, 'epoch': 0.0}


 10%|█         | 1030/10000 [53:16<9:16:14,  3.72s/it] 

{'loss': 0.8396, 'grad_norm': 0.42078015208244324, 'learning_rate': 0.0001794897448724362, 'epoch': 0.0}


 10%|█         | 1031/10000 [53:19<8:46:17,  3.52s/it]

{'loss': 1.0094, 'grad_norm': 0.4174982011318207, 'learning_rate': 0.00017946973486743372, 'epoch': 0.0}


 10%|█         | 1032/10000 [53:22<8:23:31,  3.37s/it]

{'loss': 0.9128, 'grad_norm': 0.3661598563194275, 'learning_rate': 0.00017944972486243124, 'epoch': 0.0}


 10%|█         | 1033/10000 [53:25<8:06:24,  3.25s/it]

{'loss': 0.9734, 'grad_norm': 0.39129889011383057, 'learning_rate': 0.00017942971485742872, 'epoch': 0.0}


 10%|█         | 1034/10000 [53:28<7:55:03,  3.18s/it]

{'loss': 1.104, 'grad_norm': 0.4265974462032318, 'learning_rate': 0.00017940970485242624, 'epoch': 0.0}


 10%|█         | 1035/10000 [53:31<7:27:33,  3.00s/it]

{'loss': 0.7698, 'grad_norm': 0.45659515261650085, 'learning_rate': 0.00017938969484742373, 'epoch': 0.0}


 10%|█         | 1036/10000 [53:36<9:16:55,  3.73s/it]

{'loss': 0.8994, 'grad_norm': 0.28124094009399414, 'learning_rate': 0.0001793696848424212, 'epoch': 0.0}


 10%|█         | 1037/10000 [53:40<8:56:13,  3.59s/it]

{'loss': 0.8216, 'grad_norm': 0.3583231270313263, 'learning_rate': 0.0001793496748374187, 'epoch': 0.0}


 10%|█         | 1038/10000 [53:44<9:26:58,  3.80s/it]

{'loss': 0.8427, 'grad_norm': 0.36746180057525635, 'learning_rate': 0.00017932966483241621, 'epoch': 0.0}


 10%|█         | 1039/10000 [53:48<9:25:40,  3.79s/it]

{'loss': 0.9982, 'grad_norm': 0.34446898102760315, 'learning_rate': 0.0001793096548274137, 'epoch': 0.0}


 10%|█         | 1040/10000 [53:51<8:48:37,  3.54s/it]

{'loss': 0.7126, 'grad_norm': 0.4096667468547821, 'learning_rate': 0.00017928964482241122, 'epoch': 0.0}


 10%|█         | 1041/10000 [53:54<8:28:15,  3.40s/it]

{'loss': 0.8993, 'grad_norm': 0.36181750893592834, 'learning_rate': 0.0001792696348174087, 'epoch': 0.0}


 10%|█         | 1042/10000 [53:57<8:03:21,  3.24s/it]

{'loss': 0.6825, 'grad_norm': 0.3512035012245178, 'learning_rate': 0.00017924962481240622, 'epoch': 0.0}


 10%|█         | 1043/10000 [54:01<8:45:33,  3.52s/it]

{'loss': 1.0684, 'grad_norm': 0.33863088488578796, 'learning_rate': 0.0001792296148074037, 'epoch': 0.0}


 10%|█         | 1044/10000 [54:04<8:40:16,  3.49s/it]

{'loss': 1.0864, 'grad_norm': 0.44580399990081787, 'learning_rate': 0.00017920960480240122, 'epoch': 0.0}


 10%|█         | 1045/10000 [54:08<9:06:10,  3.66s/it]

{'loss': 1.1508, 'grad_norm': 0.3972455561161041, 'learning_rate': 0.0001791895947973987, 'epoch': 0.0}


 10%|█         | 1046/10000 [54:11<8:30:09,  3.42s/it]

{'loss': 0.6741, 'grad_norm': 0.3745632469654083, 'learning_rate': 0.0001791695847923962, 'epoch': 0.0}


 10%|█         | 1047/10000 [54:14<8:25:02,  3.38s/it]

{'loss': 0.8349, 'grad_norm': 0.3604218661785126, 'learning_rate': 0.0001791495747873937, 'epoch': 0.0}


 10%|█         | 1048/10000 [54:17<8:06:51,  3.26s/it]

{'loss': 1.1672, 'grad_norm': 0.4213119447231293, 'learning_rate': 0.0001791295647823912, 'epoch': 0.0}


 10%|█         | 1049/10000 [54:21<8:12:14,  3.30s/it]

{'loss': 0.7529, 'grad_norm': 0.3657495677471161, 'learning_rate': 0.0001791095547773887, 'epoch': 0.0}


 10%|█         | 1050/10000 [54:24<7:48:47,  3.14s/it]

{'loss': 0.718, 'grad_norm': 0.41500237584114075, 'learning_rate': 0.0001790895447723862, 'epoch': 0.0}


 11%|█         | 1051/10000 [54:27<7:44:38,  3.12s/it]

{'loss': 0.7816, 'grad_norm': 0.34265244007110596, 'learning_rate': 0.0001790695347673837, 'epoch': 0.0}


 11%|█         | 1052/10000 [54:29<7:29:34,  3.01s/it]

{'loss': 0.905, 'grad_norm': 0.47406572103500366, 'learning_rate': 0.0001790495247623812, 'epoch': 0.0}


 11%|█         | 1053/10000 [54:32<7:22:11,  2.97s/it]

{'loss': 0.7342, 'grad_norm': 0.3654026389122009, 'learning_rate': 0.00017902951475737871, 'epoch': 0.0}


 11%|█         | 1054/10000 [54:35<7:17:53,  2.94s/it]

{'loss': 0.8686, 'grad_norm': 0.3814382553100586, 'learning_rate': 0.0001790095047523762, 'epoch': 0.0}


 11%|█         | 1055/10000 [54:38<7:32:54,  3.04s/it]

{'loss': 0.8349, 'grad_norm': 0.33373507857322693, 'learning_rate': 0.0001789894947473737, 'epoch': 0.0}


 11%|█         | 1056/10000 [54:42<7:42:35,  3.10s/it]

{'loss': 0.8324, 'grad_norm': 0.4059564173221588, 'learning_rate': 0.00017896948474237118, 'epoch': 0.0}


 11%|█         | 1057/10000 [54:45<7:51:31,  3.16s/it]

{'loss': 1.069, 'grad_norm': 0.4012860655784607, 'learning_rate': 0.0001789494747373687, 'epoch': 0.0}


 11%|█         | 1058/10000 [54:48<8:06:32,  3.26s/it]

{'loss': 0.5994, 'grad_norm': 0.3448933959007263, 'learning_rate': 0.00017892946473236618, 'epoch': 0.0}


 11%|█         | 1059/10000 [54:51<7:41:30,  3.10s/it]

{'loss': 1.2174, 'grad_norm': 0.4708961248397827, 'learning_rate': 0.0001789094547273637, 'epoch': 0.0}


 11%|█         | 1060/10000 [54:55<8:34:26,  3.45s/it]

{'loss': 1.2643, 'grad_norm': 0.3698481619358063, 'learning_rate': 0.0001788894447223612, 'epoch': 0.0}


 11%|█         | 1061/10000 [55:00<9:24:50,  3.79s/it]

{'loss': 0.8911, 'grad_norm': 0.3072957992553711, 'learning_rate': 0.0001788694347173587, 'epoch': 0.0}


 11%|█         | 1062/10000 [55:03<8:27:01,  3.40s/it]

{'loss': 0.9721, 'grad_norm': 0.42233410477638245, 'learning_rate': 0.0001788494247123562, 'epoch': 0.0}


 11%|█         | 1063/10000 [55:06<8:36:06,  3.47s/it]

{'loss': 0.749, 'grad_norm': 0.3381412923336029, 'learning_rate': 0.0001788294147073537, 'epoch': 0.0}


 11%|█         | 1064/10000 [55:09<8:23:59,  3.38s/it]

{'loss': 0.8432, 'grad_norm': 0.35448285937309265, 'learning_rate': 0.00017880940470235118, 'epoch': 0.0}


 11%|█         | 1065/10000 [55:13<8:34:56,  3.46s/it]

{'loss': 0.8469, 'grad_norm': 0.35880327224731445, 'learning_rate': 0.00017878939469734867, 'epoch': 0.0}


 11%|█         | 1066/10000 [55:17<8:40:49,  3.50s/it]

{'loss': 0.9559, 'grad_norm': 0.3267068862915039, 'learning_rate': 0.00017876938469234618, 'epoch': 0.0}


 11%|█         | 1067/10000 [55:20<8:18:30,  3.35s/it]

{'loss': 0.938, 'grad_norm': 0.3667980432510376, 'learning_rate': 0.00017874937468734367, 'epoch': 0.0}


 11%|█         | 1068/10000 [55:23<8:09:23,  3.29s/it]

{'loss': 0.7948, 'grad_norm': 0.3312922716140747, 'learning_rate': 0.00017872936468234119, 'epoch': 0.0}


 11%|█         | 1069/10000 [55:26<8:29:31,  3.42s/it]

{'loss': 1.1265, 'grad_norm': 0.3915485441684723, 'learning_rate': 0.00017870935467733867, 'epoch': 0.0}


 11%|█         | 1070/10000 [55:30<8:15:19,  3.33s/it]

{'loss': 0.6485, 'grad_norm': 0.3501192629337311, 'learning_rate': 0.0001786893446723362, 'epoch': 0.0}


 11%|█         | 1071/10000 [55:34<8:44:36,  3.53s/it]

{'loss': 1.1663, 'grad_norm': 0.3935524523258209, 'learning_rate': 0.00017866933466733367, 'epoch': 0.0}


 11%|█         | 1072/10000 [55:37<8:49:13,  3.56s/it]

{'loss': 0.7285, 'grad_norm': 0.3440955579280853, 'learning_rate': 0.00017864932466233116, 'epoch': 0.0}


 11%|█         | 1073/10000 [55:41<9:12:59,  3.72s/it]

{'loss': 1.1904, 'grad_norm': 0.3990284502506256, 'learning_rate': 0.00017862931465732868, 'epoch': 0.0}


 11%|█         | 1074/10000 [55:44<8:28:04,  3.42s/it]

{'loss': 0.7374, 'grad_norm': 0.40299028158187866, 'learning_rate': 0.00017860930465232616, 'epoch': 0.0}


 11%|█         | 1075/10000 [55:46<7:49:15,  3.15s/it]

{'loss': 0.8681, 'grad_norm': 0.43370574712753296, 'learning_rate': 0.00017858929464732368, 'epoch': 0.0}


 11%|█         | 1076/10000 [55:50<7:54:18,  3.19s/it]

{'loss': 1.0756, 'grad_norm': 0.39729487895965576, 'learning_rate': 0.00017856928464232117, 'epoch': 0.0}


 11%|█         | 1077/10000 [55:53<7:40:29,  3.10s/it]

{'loss': 0.8297, 'grad_norm': 0.3699020445346832, 'learning_rate': 0.00017854927463731868, 'epoch': 0.0}


 11%|█         | 1078/10000 [55:55<7:29:01,  3.02s/it]

{'loss': 1.0792, 'grad_norm': 0.41910961270332336, 'learning_rate': 0.00017852926463231617, 'epoch': 0.0}


 11%|█         | 1079/10000 [56:00<8:21:24,  3.37s/it]

{'loss': 0.961, 'grad_norm': 0.35088664293289185, 'learning_rate': 0.00017850925462731368, 'epoch': 0.0}


 11%|█         | 1080/10000 [56:03<8:27:47,  3.42s/it]

{'loss': 0.887, 'grad_norm': 0.36952701210975647, 'learning_rate': 0.00017848924462231117, 'epoch': 0.0}


 11%|█         | 1081/10000 [56:07<8:50:48,  3.57s/it]

{'loss': 0.8359, 'grad_norm': 0.3319911062717438, 'learning_rate': 0.00017846923461730866, 'epoch': 0.0}


 11%|█         | 1082/10000 [56:11<9:17:31,  3.75s/it]

{'loss': 0.9318, 'grad_norm': 0.32578545808792114, 'learning_rate': 0.00017844922461230614, 'epoch': 0.0}


 11%|█         | 1083/10000 [56:14<8:51:30,  3.58s/it]

{'loss': 0.7161, 'grad_norm': 0.3643161356449127, 'learning_rate': 0.00017842921460730366, 'epoch': 0.0}


 11%|█         | 1084/10000 [56:18<8:48:18,  3.56s/it]

{'loss': 0.9997, 'grad_norm': 0.3774770498275757, 'learning_rate': 0.00017840920460230114, 'epoch': 0.0}


 11%|█         | 1085/10000 [56:22<9:26:58,  3.82s/it]

{'loss': 1.2097, 'grad_norm': 0.3800417482852936, 'learning_rate': 0.00017838919459729866, 'epoch': 0.0}


 11%|█         | 1086/10000 [56:26<8:58:33,  3.63s/it]

{'loss': 0.8599, 'grad_norm': 0.4296238124370575, 'learning_rate': 0.00017836918459229615, 'epoch': 0.0}


 11%|█         | 1087/10000 [56:28<8:25:18,  3.40s/it]

{'loss': 0.8332, 'grad_norm': 0.409625768661499, 'learning_rate': 0.00017834917458729366, 'epoch': 0.0}


 11%|█         | 1088/10000 [56:31<8:01:15,  3.24s/it]

{'loss': 0.7229, 'grad_norm': 0.38779839873313904, 'learning_rate': 0.00017832916458229117, 'epoch': 0.0}


 11%|█         | 1089/10000 [56:34<7:50:42,  3.17s/it]

{'loss': 0.9009, 'grad_norm': 0.4037115275859833, 'learning_rate': 0.00017830915457728866, 'epoch': 0.0}


 11%|█         | 1090/10000 [56:38<8:01:22,  3.24s/it]

{'loss': 0.9333, 'grad_norm': 0.39343369007110596, 'learning_rate': 0.00017828914457228615, 'epoch': 0.0}


 11%|█         | 1091/10000 [56:41<7:55:24,  3.20s/it]

{'loss': 1.2178, 'grad_norm': 0.4204346239566803, 'learning_rate': 0.00017826913456728364, 'epoch': 0.0}


 11%|█         | 1092/10000 [56:44<8:03:48,  3.26s/it]

{'loss': 0.6459, 'grad_norm': 0.37094712257385254, 'learning_rate': 0.00017824912456228115, 'epoch': 0.0}


 11%|█         | 1093/10000 [56:49<8:56:19,  3.61s/it]

{'loss': 0.9783, 'grad_norm': 0.31571024656295776, 'learning_rate': 0.00017822911455727864, 'epoch': 0.0}


 11%|█         | 1094/10000 [56:52<8:43:46,  3.53s/it]

{'loss': 1.0261, 'grad_norm': 0.42248982191085815, 'learning_rate': 0.00017820910455227615, 'epoch': 0.0}


 11%|█         | 1095/10000 [56:56<9:03:29,  3.66s/it]

{'loss': 1.11, 'grad_norm': 0.36335211992263794, 'learning_rate': 0.00017818909454727364, 'epoch': 0.0}


 11%|█         | 1096/10000 [56:59<8:47:16,  3.55s/it]

{'loss': 1.0634, 'grad_norm': 0.3767731487751007, 'learning_rate': 0.00017816908454227115, 'epoch': 0.0}


 11%|█         | 1097/10000 [57:03<8:41:32,  3.51s/it]

{'loss': 0.8505, 'grad_norm': 0.3215705156326294, 'learning_rate': 0.00017814907453726864, 'epoch': 0.0}


 11%|█         | 1098/10000 [57:06<8:22:23,  3.39s/it]

{'loss': 0.7189, 'grad_norm': 0.3749426007270813, 'learning_rate': 0.00017812906453226616, 'epoch': 0.0}


 11%|█         | 1099/10000 [57:09<7:59:43,  3.23s/it]

{'loss': 0.844, 'grad_norm': 0.4164092242717743, 'learning_rate': 0.00017810905452726364, 'epoch': 0.0}


 11%|█         | 1100/10000 [57:13<8:42:10,  3.52s/it]

{'loss': 0.8588, 'grad_norm': 0.3426603376865387, 'learning_rate': 0.00017808904452226113, 'epoch': 0.0}


 11%|█         | 1101/10000 [57:18<9:51:46,  3.99s/it]

{'loss': 0.871, 'grad_norm': 0.32729849219322205, 'learning_rate': 0.00017806903451725865, 'epoch': 0.0}


 11%|█         | 1102/10000 [57:21<9:31:35,  3.85s/it]

{'loss': 1.0978, 'grad_norm': 0.38151344656944275, 'learning_rate': 0.00017804902451225613, 'epoch': 0.0}


 11%|█         | 1103/10000 [57:25<9:09:27,  3.71s/it]

{'loss': 0.8231, 'grad_norm': 0.3408070206642151, 'learning_rate': 0.00017802901450725365, 'epoch': 0.0}


 11%|█         | 1104/10000 [57:28<8:24:02,  3.40s/it]

{'loss': 1.0483, 'grad_norm': 0.4523308575153351, 'learning_rate': 0.00017800900450225113, 'epoch': 0.0}


 11%|█         | 1105/10000 [57:31<8:30:18,  3.44s/it]

{'loss': 0.9568, 'grad_norm': 0.4104125499725342, 'learning_rate': 0.00017798899449724865, 'epoch': 0.0}


 11%|█         | 1106/10000 [57:34<7:56:33,  3.21s/it]

{'loss': 0.8818, 'grad_norm': 0.39102330803871155, 'learning_rate': 0.00017796898449224614, 'epoch': 0.0}


 11%|█         | 1107/10000 [57:37<7:57:59,  3.22s/it]

{'loss': 0.9591, 'grad_norm': 0.3862816095352173, 'learning_rate': 0.00017794897448724362, 'epoch': 0.0}


 11%|█         | 1108/10000 [57:40<8:00:43,  3.24s/it]

{'loss': 0.6897, 'grad_norm': 0.3605005741119385, 'learning_rate': 0.0001779289644822411, 'epoch': 0.0}


 11%|█         | 1109/10000 [57:43<7:54:08,  3.20s/it]

{'loss': 0.8262, 'grad_norm': 0.39381399750709534, 'learning_rate': 0.00017790895447723862, 'epoch': 0.0}


 11%|█         | 1110/10000 [57:46<7:29:58,  3.04s/it]

{'loss': 0.914, 'grad_norm': 0.48341861367225647, 'learning_rate': 0.0001778889444722361, 'epoch': 0.0}


 11%|█         | 1111/10000 [57:49<7:22:31,  2.99s/it]

{'loss': 0.856, 'grad_norm': 0.4123803973197937, 'learning_rate': 0.00017786893446723363, 'epoch': 0.0}


 11%|█         | 1112/10000 [57:53<8:32:18,  3.46s/it]

{'loss': 0.9984, 'grad_norm': 0.3656865060329437, 'learning_rate': 0.00017784892446223111, 'epoch': 0.0}


 11%|█         | 1113/10000 [57:57<8:40:10,  3.51s/it]

{'loss': 0.8683, 'grad_norm': 0.40656787157058716, 'learning_rate': 0.00017782891445722863, 'epoch': 0.0}


 11%|█         | 1114/10000 [58:00<8:33:04,  3.46s/it]

{'loss': 0.8627, 'grad_norm': 0.3731459975242615, 'learning_rate': 0.00017780890445222612, 'epoch': 0.0}


 11%|█         | 1115/10000 [58:04<8:34:41,  3.48s/it]

{'loss': 1.3994, 'grad_norm': 0.4189288020133972, 'learning_rate': 0.00017778889444722363, 'epoch': 0.0}


 11%|█         | 1116/10000 [58:07<8:10:57,  3.32s/it]

{'loss': 0.9023, 'grad_norm': 0.37685179710388184, 'learning_rate': 0.00017776888444222112, 'epoch': 0.0}


 11%|█         | 1117/10000 [58:10<8:20:43,  3.38s/it]

{'loss': 1.1973, 'grad_norm': 0.49203869700431824, 'learning_rate': 0.0001777488744372186, 'epoch': 0.0}


 11%|█         | 1118/10000 [58:16<10:10:33,  4.12s/it]

{'loss': 1.1886, 'grad_norm': 0.3275721073150635, 'learning_rate': 0.00017772886443221612, 'epoch': 0.0}


 11%|█         | 1119/10000 [58:19<9:29:13,  3.85s/it] 

{'loss': 1.0764, 'grad_norm': 0.3826620280742645, 'learning_rate': 0.0001777088544272136, 'epoch': 0.0}


 11%|█         | 1120/10000 [58:23<8:56:51,  3.63s/it]

{'loss': 0.9367, 'grad_norm': 0.3783700168132782, 'learning_rate': 0.00017768884442221112, 'epoch': 0.0}


 11%|█         | 1121/10000 [58:26<8:26:53,  3.43s/it]

{'loss': 0.8293, 'grad_norm': 0.34077373147010803, 'learning_rate': 0.0001776688344172086, 'epoch': 0.0}


 11%|█         | 1122/10000 [58:29<8:16:32,  3.36s/it]

{'loss': 1.2431, 'grad_norm': 0.4186904728412628, 'learning_rate': 0.00017764882441220612, 'epoch': 0.0}


 11%|█         | 1123/10000 [58:32<8:09:44,  3.31s/it]

{'loss': 1.0289, 'grad_norm': 0.49050745368003845, 'learning_rate': 0.0001776288144072036, 'epoch': 0.0}


 11%|█         | 1124/10000 [58:38<10:07:12,  4.10s/it]

{'loss': 1.3245, 'grad_norm': 0.32752513885498047, 'learning_rate': 0.00017760880440220112, 'epoch': 0.0}


 11%|█▏        | 1125/10000 [58:41<9:04:11,  3.68s/it] 

{'loss': 0.9544, 'grad_norm': 0.4187333285808563, 'learning_rate': 0.0001775887943971986, 'epoch': 0.0}


 11%|█▏        | 1126/10000 [58:45<9:37:28,  3.90s/it]

{'loss': 1.3813, 'grad_norm': 0.36247125267982483, 'learning_rate': 0.0001775687843921961, 'epoch': 0.0}


 11%|█▏        | 1127/10000 [58:49<9:25:46,  3.83s/it]

{'loss': 0.5921, 'grad_norm': 0.3575524091720581, 'learning_rate': 0.00017754877438719359, 'epoch': 0.0}


 11%|█▏        | 1128/10000 [58:51<8:34:29,  3.48s/it]

{'loss': 0.9382, 'grad_norm': 0.420425146818161, 'learning_rate': 0.0001775287643821911, 'epoch': 0.0}


 11%|█▏        | 1129/10000 [58:56<9:46:49,  3.97s/it]

{'loss': 1.2371, 'grad_norm': 0.3704085946083069, 'learning_rate': 0.00017750875437718861, 'epoch': 0.0}


 11%|█▏        | 1130/10000 [58:59<8:46:11,  3.56s/it]

{'loss': 0.8449, 'grad_norm': 0.46950483322143555, 'learning_rate': 0.0001774887443721861, 'epoch': 0.0}


 11%|█▏        | 1131/10000 [59:03<9:05:10,  3.69s/it]

{'loss': 1.1926, 'grad_norm': 0.39545202255249023, 'learning_rate': 0.00017746873436718362, 'epoch': 0.0}


 11%|█▏        | 1132/10000 [59:06<8:38:39,  3.51s/it]

{'loss': 0.7825, 'grad_norm': 0.32151493430137634, 'learning_rate': 0.0001774487243621811, 'epoch': 0.0}


 11%|█▏        | 1133/10000 [59:10<8:58:41,  3.65s/it]

{'loss': 1.1591, 'grad_norm': 0.369305819272995, 'learning_rate': 0.00017742871435717862, 'epoch': 0.0}


 11%|█▏        | 1134/10000 [59:14<9:21:09,  3.80s/it]

{'loss': 1.1704, 'grad_norm': 0.3309914767742157, 'learning_rate': 0.0001774087043521761, 'epoch': 0.0}


 11%|█▏        | 1135/10000 [59:18<8:59:49,  3.65s/it]

{'loss': 0.9567, 'grad_norm': 0.3632580041885376, 'learning_rate': 0.0001773886943471736, 'epoch': 0.0}


 11%|█▏        | 1136/10000 [59:21<9:09:01,  3.72s/it]

{'loss': 1.0884, 'grad_norm': 0.36128389835357666, 'learning_rate': 0.00017736868434217108, 'epoch': 0.0}


 11%|█▏        | 1137/10000 [59:25<8:40:41,  3.52s/it]

{'loss': 0.8655, 'grad_norm': 0.3592533767223358, 'learning_rate': 0.0001773486743371686, 'epoch': 0.0}


 11%|█▏        | 1138/10000 [59:27<8:10:22,  3.32s/it]

{'loss': 0.9926, 'grad_norm': 0.3987918496131897, 'learning_rate': 0.00017732866433216608, 'epoch': 0.0}


 11%|█▏        | 1139/10000 [59:31<8:08:19,  3.31s/it]

{'loss': 0.6667, 'grad_norm': 0.4018770158290863, 'learning_rate': 0.0001773086543271636, 'epoch': 0.0}


 11%|█▏        | 1140/10000 [59:34<8:14:18,  3.35s/it]

{'loss': 1.0506, 'grad_norm': 0.36013349890708923, 'learning_rate': 0.00017728864432216108, 'epoch': 0.0}


 11%|█▏        | 1141/10000 [59:38<8:32:26,  3.47s/it]

{'loss': 1.0268, 'grad_norm': 0.3720950782299042, 'learning_rate': 0.0001772686343171586, 'epoch': 0.0}


 11%|█▏        | 1142/10000 [59:41<8:35:02,  3.49s/it]

{'loss': 1.0066, 'grad_norm': 0.3419332206249237, 'learning_rate': 0.00017724862431215608, 'epoch': 0.0}


 11%|█▏        | 1143/10000 [59:45<8:21:36,  3.40s/it]

{'loss': 0.6888, 'grad_norm': 0.4097478687763214, 'learning_rate': 0.00017722861430715357, 'epoch': 0.0}


 11%|█▏        | 1144/10000 [59:47<7:43:34,  3.14s/it]

{'loss': 0.78, 'grad_norm': 0.431882381439209, 'learning_rate': 0.00017720860430215109, 'epoch': 0.0}


 11%|█▏        | 1145/10000 [59:50<7:36:58,  3.10s/it]

{'loss': 0.8843, 'grad_norm': 0.4061393141746521, 'learning_rate': 0.00017718859429714857, 'epoch': 0.0}


 11%|█▏        | 1146/10000 [59:53<7:34:59,  3.08s/it]

{'loss': 0.7728, 'grad_norm': 0.35415154695510864, 'learning_rate': 0.0001771685842921461, 'epoch': 0.0}


 11%|█▏        | 1147/10000 [59:57<7:57:55,  3.24s/it]

{'loss': 0.8735, 'grad_norm': 0.3510854244232178, 'learning_rate': 0.00017714857428714358, 'epoch': 0.0}


 11%|█▏        | 1148/10000 [1:00:01<8:33:11,  3.48s/it]

{'loss': 0.8778, 'grad_norm': 0.29606524109840393, 'learning_rate': 0.0001771285642821411, 'epoch': 0.0}


 11%|█▏        | 1149/10000 [1:00:05<8:56:35,  3.64s/it]

{'loss': 0.91, 'grad_norm': 0.31025809049606323, 'learning_rate': 0.00017710855427713858, 'epoch': 0.0}


 12%|█▏        | 1150/10000 [1:00:08<8:39:04,  3.52s/it]

{'loss': 0.8293, 'grad_norm': 0.33598220348358154, 'learning_rate': 0.0001770885442721361, 'epoch': 0.0}


 12%|█▏        | 1151/10000 [1:00:12<8:53:51,  3.62s/it]

{'loss': 0.5714, 'grad_norm': 0.2883164584636688, 'learning_rate': 0.00017706853426713358, 'epoch': 0.0}


 12%|█▏        | 1152/10000 [1:00:15<8:39:47,  3.52s/it]

{'loss': 0.8817, 'grad_norm': 0.3504977822303772, 'learning_rate': 0.00017704852426213107, 'epoch': 0.0}


 12%|█▏        | 1153/10000 [1:00:18<8:30:39,  3.46s/it]

{'loss': 0.6489, 'grad_norm': 0.3565238118171692, 'learning_rate': 0.00017702851425712855, 'epoch': 0.0}


 12%|█▏        | 1154/10000 [1:00:21<7:55:54,  3.23s/it]

{'loss': 0.8261, 'grad_norm': 0.400259792804718, 'learning_rate': 0.00017700850425212607, 'epoch': 0.0}


 12%|█▏        | 1155/10000 [1:00:24<7:56:38,  3.23s/it]

{'loss': 1.1789, 'grad_norm': 0.38549691438674927, 'learning_rate': 0.00017698849424712355, 'epoch': 0.0}


 12%|█▏        | 1156/10000 [1:00:27<7:40:23,  3.12s/it]

{'loss': 0.7585, 'grad_norm': 0.41644972562789917, 'learning_rate': 0.00017696848424212107, 'epoch': 0.0}


 12%|█▏        | 1157/10000 [1:00:30<7:35:22,  3.09s/it]

{'loss': 0.7254, 'grad_norm': 0.31374138593673706, 'learning_rate': 0.00017694847423711858, 'epoch': 0.0}


 12%|█▏        | 1158/10000 [1:00:33<7:11:01,  2.92s/it]

{'loss': 1.0392, 'grad_norm': 0.5658774971961975, 'learning_rate': 0.00017692846423211607, 'epoch': 0.0}


 12%|█▏        | 1159/10000 [1:00:36<7:14:18,  2.95s/it]

{'loss': 0.775, 'grad_norm': 0.3730035424232483, 'learning_rate': 0.00017690845422711359, 'epoch': 0.0}


 12%|█▏        | 1160/10000 [1:00:39<7:40:48,  3.13s/it]

{'loss': 1.2351, 'grad_norm': 0.41334500908851624, 'learning_rate': 0.00017688844422211107, 'epoch': 0.0}


 12%|█▏        | 1161/10000 [1:00:42<7:32:19,  3.07s/it]

{'loss': 1.1609, 'grad_norm': 0.43242761492729187, 'learning_rate': 0.00017686843421710856, 'epoch': 0.0}


 12%|█▏        | 1162/10000 [1:00:45<7:13:26,  2.94s/it]

{'loss': 0.7818, 'grad_norm': 0.41229158639907837, 'learning_rate': 0.00017684842421210605, 'epoch': 0.0}


 12%|█▏        | 1163/10000 [1:00:48<7:11:10,  2.93s/it]

{'loss': 0.8026, 'grad_norm': 0.4561728239059448, 'learning_rate': 0.00017682841420710356, 'epoch': 0.0}


 12%|█▏        | 1164/10000 [1:00:52<7:50:55,  3.20s/it]

{'loss': 0.9278, 'grad_norm': 0.4052114188671112, 'learning_rate': 0.00017680840420210105, 'epoch': 0.0}


 12%|█▏        | 1165/10000 [1:00:57<9:46:00,  3.98s/it]

{'loss': 1.2325, 'grad_norm': 0.32801154255867004, 'learning_rate': 0.00017678839419709856, 'epoch': 0.0}


 12%|█▏        | 1166/10000 [1:01:01<9:11:58,  3.75s/it]

{'loss': 0.7452, 'grad_norm': 0.29917728900909424, 'learning_rate': 0.00017676838419209605, 'epoch': 0.0}


 12%|█▏        | 1167/10000 [1:01:04<9:10:50,  3.74s/it]

{'loss': 0.7406, 'grad_norm': 0.31722405552864075, 'learning_rate': 0.00017674837418709356, 'epoch': 0.0}


 12%|█▏        | 1168/10000 [1:01:08<8:54:00,  3.63s/it]

{'loss': 0.9729, 'grad_norm': 0.3475179672241211, 'learning_rate': 0.00017672836418209105, 'epoch': 0.0}


 12%|█▏        | 1169/10000 [1:01:11<8:44:18,  3.56s/it]

{'loss': 0.9181, 'grad_norm': 0.3692089915275574, 'learning_rate': 0.00017670835417708857, 'epoch': 0.0}


 12%|█▏        | 1170/10000 [1:01:15<8:33:04,  3.49s/it]

{'loss': 0.9227, 'grad_norm': 0.3813936114311218, 'learning_rate': 0.00017668834417208605, 'epoch': 0.0}


 12%|█▏        | 1171/10000 [1:01:17<8:05:47,  3.30s/it]

{'loss': 0.8898, 'grad_norm': 0.39034467935562134, 'learning_rate': 0.00017666833416708354, 'epoch': 0.0}


 12%|█▏        | 1172/10000 [1:01:21<8:40:07,  3.54s/it]

{'loss': 1.0132, 'grad_norm': 0.350498229265213, 'learning_rate': 0.00017664832416208106, 'epoch': 0.0}


 12%|█▏        | 1173/10000 [1:01:25<8:37:53,  3.52s/it]

{'loss': 0.903, 'grad_norm': 0.3783378005027771, 'learning_rate': 0.00017662831415707854, 'epoch': 0.0}


 12%|█▏        | 1174/10000 [1:01:30<9:31:02,  3.88s/it]

{'loss': 1.3807, 'grad_norm': 0.3391728699207306, 'learning_rate': 0.00017660830415207606, 'epoch': 0.0}


 12%|█▏        | 1175/10000 [1:01:33<8:59:40,  3.67s/it]

{'loss': 0.6934, 'grad_norm': 0.34884145855903625, 'learning_rate': 0.00017658829414707354, 'epoch': 0.0}


 12%|█▏        | 1176/10000 [1:01:37<9:03:13,  3.69s/it]

{'loss': 0.9166, 'grad_norm': 0.3689277768135071, 'learning_rate': 0.00017656828414207106, 'epoch': 0.0}


 12%|█▏        | 1177/10000 [1:01:40<8:44:52,  3.57s/it]

{'loss': 0.8446, 'grad_norm': 0.3765999972820282, 'learning_rate': 0.00017654827413706855, 'epoch': 0.0}


 12%|█▏        | 1178/10000 [1:01:44<9:13:28,  3.76s/it]

{'loss': 1.1536, 'grad_norm': 0.33140140771865845, 'learning_rate': 0.00017652826413206603, 'epoch': 0.0}


 12%|█▏        | 1179/10000 [1:01:48<9:21:45,  3.82s/it]

{'loss': 1.2831, 'grad_norm': 0.37201571464538574, 'learning_rate': 0.00017650825412706352, 'epoch': 0.0}


 12%|█▏        | 1180/10000 [1:01:51<8:36:44,  3.52s/it]

{'loss': 0.8765, 'grad_norm': 0.48428571224212646, 'learning_rate': 0.00017648824412206103, 'epoch': 0.0}


 12%|█▏        | 1181/10000 [1:01:54<8:19:37,  3.40s/it]

{'loss': 0.8521, 'grad_norm': 0.3505743741989136, 'learning_rate': 0.00017646823411705852, 'epoch': 0.0}


 12%|█▏        | 1182/10000 [1:01:57<7:44:18,  3.16s/it]

{'loss': 0.7342, 'grad_norm': 0.37484025955200195, 'learning_rate': 0.00017644822411205604, 'epoch': 0.0}


 12%|█▏        | 1183/10000 [1:02:01<8:19:36,  3.40s/it]

{'loss': 1.1452, 'grad_norm': 0.36766335368156433, 'learning_rate': 0.00017642821410705352, 'epoch': 0.0}


 12%|█▏        | 1184/10000 [1:02:05<8:51:22,  3.62s/it]

{'loss': 1.4158, 'grad_norm': 0.40051451325416565, 'learning_rate': 0.00017640820410205104, 'epoch': 0.0}


 12%|█▏        | 1185/10000 [1:02:08<8:27:51,  3.46s/it]

{'loss': 0.8553, 'grad_norm': 0.38210368156433105, 'learning_rate': 0.00017638819409704855, 'epoch': 0.0}


 12%|█▏        | 1186/10000 [1:02:11<8:00:36,  3.27s/it]

{'loss': 0.7471, 'grad_norm': 0.4277380108833313, 'learning_rate': 0.00017636818409204604, 'epoch': 0.0}


 12%|█▏        | 1187/10000 [1:02:13<7:39:36,  3.13s/it]

{'loss': 0.8662, 'grad_norm': 0.3856571316719055, 'learning_rate': 0.00017634817408704353, 'epoch': 0.0}


 12%|█▏        | 1188/10000 [1:02:16<7:20:17,  3.00s/it]

{'loss': 0.8378, 'grad_norm': 0.4236706495285034, 'learning_rate': 0.00017632816408204101, 'epoch': 0.0}


 12%|█▏        | 1189/10000 [1:02:19<7:31:59,  3.08s/it]

{'loss': 0.833, 'grad_norm': 0.38680875301361084, 'learning_rate': 0.00017630815407703853, 'epoch': 0.0}


 12%|█▏        | 1190/10000 [1:02:22<7:28:19,  3.05s/it]

{'loss': 1.0269, 'grad_norm': 0.39948105812072754, 'learning_rate': 0.00017628814407203602, 'epoch': 0.0}


 12%|█▏        | 1191/10000 [1:02:26<7:34:56,  3.10s/it]

{'loss': 0.9793, 'grad_norm': 0.3743472695350647, 'learning_rate': 0.00017626813406703353, 'epoch': 0.0}


 12%|█▏        | 1192/10000 [1:02:29<7:34:22,  3.10s/it]

{'loss': 0.9611, 'grad_norm': 0.37915271520614624, 'learning_rate': 0.00017624812406203102, 'epoch': 0.0}


 12%|█▏        | 1193/10000 [1:02:32<7:43:05,  3.15s/it]

{'loss': 1.1658, 'grad_norm': 0.38740500807762146, 'learning_rate': 0.00017622811405702853, 'epoch': 0.0}


 12%|█▏        | 1194/10000 [1:02:35<7:56:25,  3.25s/it]

{'loss': 0.8631, 'grad_norm': 0.37184566259384155, 'learning_rate': 0.00017620810405202602, 'epoch': 0.0}


 12%|█▏        | 1195/10000 [1:02:38<7:35:16,  3.10s/it]

{'loss': 0.8525, 'grad_norm': 0.4151230752468109, 'learning_rate': 0.00017618809404702353, 'epoch': 0.0}


 12%|█▏        | 1196/10000 [1:02:41<7:35:25,  3.10s/it]

{'loss': 0.9485, 'grad_norm': 0.39373263716697693, 'learning_rate': 0.00017616808404202102, 'epoch': 0.0}


 12%|█▏        | 1197/10000 [1:02:45<7:42:30,  3.15s/it]

{'loss': 1.0535, 'grad_norm': 0.3856266736984253, 'learning_rate': 0.0001761480740370185, 'epoch': 0.0}


 12%|█▏        | 1198/10000 [1:02:48<7:54:49,  3.24s/it]

{'loss': 0.8825, 'grad_norm': 0.3560742139816284, 'learning_rate': 0.00017612806403201602, 'epoch': 0.0}


 12%|█▏        | 1199/10000 [1:02:51<7:39:01,  3.13s/it]

{'loss': 0.9501, 'grad_norm': 0.4406091570854187, 'learning_rate': 0.0001761080540270135, 'epoch': 0.0}


 12%|█▏        | 1200/10000 [1:02:54<7:38:19,  3.12s/it]

{'loss': 0.9133, 'grad_norm': 0.33647769689559937, 'learning_rate': 0.00017608804402201102, 'epoch': 0.0}


 12%|█▏        | 1201/10000 [1:03:02<11:22:14,  4.65s/it]

{'loss': 1.2366, 'grad_norm': 0.31673622131347656, 'learning_rate': 0.0001760680340170085, 'epoch': 0.0}


 12%|█▏        | 1202/10000 [1:03:05<10:23:25,  4.25s/it]

{'loss': 1.0336, 'grad_norm': 0.3408401906490326, 'learning_rate': 0.00017604802401200603, 'epoch': 0.0}


 12%|█▏        | 1203/10000 [1:03:10<10:32:30,  4.31s/it]

{'loss': 1.0865, 'grad_norm': 0.3301355540752411, 'learning_rate': 0.0001760280140070035, 'epoch': 0.0}


 12%|█▏        | 1204/10000 [1:03:14<10:10:17,  4.16s/it]

{'loss': 1.0323, 'grad_norm': 0.36282020807266235, 'learning_rate': 0.00017600800400200103, 'epoch': 0.0}


 12%|█▏        | 1205/10000 [1:03:17<9:41:32,  3.97s/it] 

{'loss': 1.0357, 'grad_norm': 0.3732912540435791, 'learning_rate': 0.00017598799399699851, 'epoch': 0.0}


 12%|█▏        | 1206/10000 [1:03:22<10:02:58,  4.11s/it]

{'loss': 0.9566, 'grad_norm': 0.32776007056236267, 'learning_rate': 0.000175967983991996, 'epoch': 0.0}


 12%|█▏        | 1207/10000 [1:03:26<9:55:08,  4.06s/it] 

{'loss': 1.0606, 'grad_norm': 0.3876120150089264, 'learning_rate': 0.0001759479739869935, 'epoch': 0.0}


 12%|█▏        | 1208/10000 [1:03:28<8:57:53,  3.67s/it]

{'loss': 0.5625, 'grad_norm': 0.32974693179130554, 'learning_rate': 0.000175927963981991, 'epoch': 0.0}


 12%|█▏        | 1209/10000 [1:03:32<9:02:23,  3.70s/it]

{'loss': 0.8476, 'grad_norm': 0.3296140134334564, 'learning_rate': 0.0001759079539769885, 'epoch': 0.0}


 12%|█▏        | 1210/10000 [1:03:35<8:15:11,  3.38s/it]

{'loss': 0.9283, 'grad_norm': 0.4074934720993042, 'learning_rate': 0.000175887943971986, 'epoch': 0.0}


 12%|█▏        | 1211/10000 [1:03:38<8:16:58,  3.39s/it]

{'loss': 1.0205, 'grad_norm': 0.3803088068962097, 'learning_rate': 0.0001758679339669835, 'epoch': 0.0}


 12%|█▏        | 1212/10000 [1:03:41<7:55:23,  3.25s/it]

{'loss': 0.81, 'grad_norm': 0.3711971640586853, 'learning_rate': 0.000175847923961981, 'epoch': 0.0}


 12%|█▏        | 1213/10000 [1:03:44<7:20:48,  3.01s/it]

{'loss': 0.6312, 'grad_norm': 0.46503642201423645, 'learning_rate': 0.0001758279139569785, 'epoch': 0.0}


 12%|█▏        | 1214/10000 [1:03:48<8:07:25,  3.33s/it]

{'loss': 1.0454, 'grad_norm': 0.36537498235702515, 'learning_rate': 0.00017580790395197598, 'epoch': 0.0}


 12%|█▏        | 1215/10000 [1:03:51<8:19:57,  3.41s/it]

{'loss': 0.9638, 'grad_norm': 0.40710175037384033, 'learning_rate': 0.0001757878939469735, 'epoch': 0.0}


 12%|█▏        | 1216/10000 [1:03:55<8:35:02,  3.52s/it]

{'loss': 1.0126, 'grad_norm': 0.3235405385494232, 'learning_rate': 0.00017576788394197098, 'epoch': 0.0}


 12%|█▏        | 1217/10000 [1:03:59<8:35:32,  3.52s/it]

{'loss': 0.9344, 'grad_norm': 0.3837847411632538, 'learning_rate': 0.0001757478739369685, 'epoch': 0.0}


 12%|█▏        | 1218/10000 [1:04:01<7:58:22,  3.27s/it]

{'loss': 0.8469, 'grad_norm': 0.4385201632976532, 'learning_rate': 0.00017572786393196599, 'epoch': 0.0}


 12%|█▏        | 1219/10000 [1:04:04<7:49:34,  3.21s/it]

{'loss': 0.6346, 'grad_norm': 0.3390127420425415, 'learning_rate': 0.0001757078539269635, 'epoch': 0.0}


 12%|█▏        | 1220/10000 [1:04:08<8:07:15,  3.33s/it]

{'loss': 0.7669, 'grad_norm': 0.3653702735900879, 'learning_rate': 0.000175687843921961, 'epoch': 0.0}


 12%|█▏        | 1221/10000 [1:04:11<8:08:38,  3.34s/it]

{'loss': 0.6998, 'grad_norm': 0.32273826003074646, 'learning_rate': 0.0001756678339169585, 'epoch': 0.0}


 12%|█▏        | 1222/10000 [1:04:14<7:48:14,  3.20s/it]

{'loss': 0.8535, 'grad_norm': 0.3741695284843445, 'learning_rate': 0.000175647823911956, 'epoch': 0.0}


 12%|█▏        | 1223/10000 [1:04:18<8:03:40,  3.31s/it]

{'loss': 0.9438, 'grad_norm': 0.34811338782310486, 'learning_rate': 0.00017562781390695348, 'epoch': 0.0}


 12%|█▏        | 1224/10000 [1:04:20<7:27:31,  3.06s/it]

{'loss': 0.8534, 'grad_norm': 0.38080018758773804, 'learning_rate': 0.00017560780390195096, 'epoch': 0.0}


 12%|█▏        | 1225/10000 [1:04:23<7:12:42,  2.96s/it]

{'loss': 0.7411, 'grad_norm': 0.38461196422576904, 'learning_rate': 0.00017558779389694848, 'epoch': 0.0}


 12%|█▏        | 1226/10000 [1:04:26<7:26:28,  3.05s/it]

{'loss': 0.8364, 'grad_norm': 0.3981606364250183, 'learning_rate': 0.000175567783891946, 'epoch': 0.0}


 12%|█▏        | 1227/10000 [1:04:30<7:51:06,  3.22s/it]

{'loss': 0.7925, 'grad_norm': 0.35475778579711914, 'learning_rate': 0.00017554777388694348, 'epoch': 0.0}


 12%|█▏        | 1228/10000 [1:04:35<8:55:13,  3.66s/it]

{'loss': 1.0688, 'grad_norm': 0.3347128629684448, 'learning_rate': 0.000175527763881941, 'epoch': 0.0}


 12%|█▏        | 1229/10000 [1:04:38<8:37:36,  3.54s/it]

{'loss': 0.6555, 'grad_norm': 0.36571481823921204, 'learning_rate': 0.00017550775387693848, 'epoch': 0.0}


 12%|█▏        | 1230/10000 [1:04:41<8:11:52,  3.37s/it]

{'loss': 0.8847, 'grad_norm': 0.38864263892173767, 'learning_rate': 0.000175487743871936, 'epoch': 0.0}


 12%|█▏        | 1231/10000 [1:04:44<7:48:20,  3.20s/it]

{'loss': 0.7131, 'grad_norm': 0.34743714332580566, 'learning_rate': 0.00017546773386693348, 'epoch': 0.0}


 12%|█▏        | 1232/10000 [1:04:47<7:50:38,  3.22s/it]

{'loss': 0.7065, 'grad_norm': 0.34324339032173157, 'learning_rate': 0.00017544772386193097, 'epoch': 0.0}


 12%|█▏        | 1233/10000 [1:04:51<8:35:22,  3.53s/it]

{'loss': 0.8441, 'grad_norm': 0.34020695090293884, 'learning_rate': 0.00017542771385692846, 'epoch': 0.0}


 12%|█▏        | 1234/10000 [1:04:55<8:45:07,  3.59s/it]

{'loss': 0.7089, 'grad_norm': 0.3138328790664673, 'learning_rate': 0.00017540770385192597, 'epoch': 0.0}


 12%|█▏        | 1235/10000 [1:04:57<8:01:03,  3.29s/it]

{'loss': 0.6997, 'grad_norm': 0.3785145878791809, 'learning_rate': 0.00017538769384692346, 'epoch': 0.0}


 12%|█▏        | 1236/10000 [1:05:00<7:47:14,  3.20s/it]

{'loss': 0.9795, 'grad_norm': 0.3716900646686554, 'learning_rate': 0.00017536768384192097, 'epoch': 0.0}


 12%|█▏        | 1237/10000 [1:05:04<7:55:05,  3.25s/it]

{'loss': 0.8288, 'grad_norm': 0.3586982786655426, 'learning_rate': 0.00017534767383691846, 'epoch': 0.0}


 12%|█▏        | 1238/10000 [1:05:07<7:57:18,  3.27s/it]

{'loss': 0.8207, 'grad_norm': 0.38443732261657715, 'learning_rate': 0.00017532766383191597, 'epoch': 0.0}


 12%|█▏        | 1239/10000 [1:05:10<7:37:13,  3.13s/it]

{'loss': 0.965, 'grad_norm': 0.5217766165733337, 'learning_rate': 0.0001753076538269135, 'epoch': 0.0}


 12%|█▏        | 1240/10000 [1:05:13<7:26:23,  3.06s/it]

{'loss': 1.0056, 'grad_norm': 0.38673877716064453, 'learning_rate': 0.00017528764382191098, 'epoch': 0.0}


 12%|█▏        | 1241/10000 [1:05:17<8:16:48,  3.40s/it]

{'loss': 1.0081, 'grad_norm': 0.36942335963249207, 'learning_rate': 0.00017526763381690846, 'epoch': 0.0}


 12%|█▏        | 1242/10000 [1:05:20<8:07:24,  3.34s/it]

{'loss': 0.726, 'grad_norm': 0.40361225605010986, 'learning_rate': 0.00017524762381190595, 'epoch': 0.0}


 12%|█▏        | 1243/10000 [1:05:23<7:36:33,  3.13s/it]

{'loss': 0.8127, 'grad_norm': 0.45693039894104004, 'learning_rate': 0.00017522761380690347, 'epoch': 0.0}


 12%|█▏        | 1244/10000 [1:05:25<7:17:30,  3.00s/it]

{'loss': 0.8365, 'grad_norm': 0.3817235827445984, 'learning_rate': 0.00017520760380190095, 'epoch': 0.0}


 12%|█▏        | 1245/10000 [1:05:29<7:55:17,  3.26s/it]

{'loss': 1.067, 'grad_norm': 0.3423044979572296, 'learning_rate': 0.00017518759379689847, 'epoch': 0.0}


 12%|█▏        | 1246/10000 [1:05:32<7:20:34,  3.02s/it]

{'loss': 0.7574, 'grad_norm': 0.43330201506614685, 'learning_rate': 0.00017516758379189595, 'epoch': 0.0}


 12%|█▏        | 1247/10000 [1:05:35<7:16:07,  2.99s/it]

{'loss': 0.8927, 'grad_norm': 0.4130440950393677, 'learning_rate': 0.00017514757378689347, 'epoch': 0.0}


 12%|█▏        | 1248/10000 [1:05:38<7:47:07,  3.20s/it]

{'loss': 0.9031, 'grad_norm': 0.35345110297203064, 'learning_rate': 0.00017512756378189096, 'epoch': 0.0}


 12%|█▏        | 1249/10000 [1:05:43<8:36:12,  3.54s/it]

{'loss': 1.017, 'grad_norm': 0.33588168025016785, 'learning_rate': 0.00017510755377688844, 'epoch': 0.0}


 12%|█▎        | 1250/10000 [1:05:46<8:12:38,  3.38s/it]

{'loss': 1.0036, 'grad_norm': 0.4075668454170227, 'learning_rate': 0.00017508754377188593, 'epoch': 0.0}


 13%|█▎        | 1251/10000 [1:05:49<8:23:15,  3.45s/it]

{'loss': 1.1006, 'grad_norm': 0.35698893666267395, 'learning_rate': 0.00017506753376688344, 'epoch': 0.0}


 13%|█▎        | 1252/10000 [1:05:52<8:02:37,  3.31s/it]

{'loss': 0.7889, 'grad_norm': 0.4663923680782318, 'learning_rate': 0.00017504752376188093, 'epoch': 0.0}


 13%|█▎        | 1253/10000 [1:05:55<7:29:29,  3.08s/it]

{'loss': 1.1492, 'grad_norm': 0.4901331961154938, 'learning_rate': 0.00017502751375687845, 'epoch': 0.0}


 13%|█▎        | 1254/10000 [1:05:57<7:02:42,  2.90s/it]

{'loss': 0.9149, 'grad_norm': 0.4753643870353699, 'learning_rate': 0.00017500750375187596, 'epoch': 0.0}


 13%|█▎        | 1255/10000 [1:06:01<7:49:30,  3.22s/it]

{'loss': 1.0023, 'grad_norm': 0.3676496744155884, 'learning_rate': 0.00017498749374687345, 'epoch': 0.0}


 13%|█▎        | 1256/10000 [1:06:04<7:38:11,  3.14s/it]

{'loss': 0.8795, 'grad_norm': 0.41413170099258423, 'learning_rate': 0.00017496748374187096, 'epoch': 0.0}


 13%|█▎        | 1257/10000 [1:06:07<7:38:31,  3.15s/it]

{'loss': 0.7381, 'grad_norm': 0.3809734582901001, 'learning_rate': 0.00017494747373686845, 'epoch': 0.0}


 13%|█▎        | 1258/10000 [1:06:11<7:40:29,  3.16s/it]

{'loss': 1.3705, 'grad_norm': 0.434805303812027, 'learning_rate': 0.00017492746373186594, 'epoch': 0.0}


 13%|█▎        | 1259/10000 [1:06:14<7:27:02,  3.07s/it]

{'loss': 0.7824, 'grad_norm': 0.3836490213871002, 'learning_rate': 0.00017490745372686342, 'epoch': 0.0}


 13%|█▎        | 1260/10000 [1:06:16<7:10:06,  2.95s/it]

{'loss': 0.8207, 'grad_norm': 0.4203493595123291, 'learning_rate': 0.00017488744372186094, 'epoch': 0.0}


 13%|█▎        | 1261/10000 [1:06:21<8:34:40,  3.53s/it]

{'loss': 0.8908, 'grad_norm': 0.33673086762428284, 'learning_rate': 0.00017486743371685843, 'epoch': 0.0}


 13%|█▎        | 1262/10000 [1:06:24<8:24:29,  3.46s/it]

{'loss': 0.6564, 'grad_norm': 0.3564157485961914, 'learning_rate': 0.00017484742371185594, 'epoch': 0.0}


 13%|█▎        | 1263/10000 [1:06:28<8:13:22,  3.39s/it]

{'loss': 0.9587, 'grad_norm': 0.3256216049194336, 'learning_rate': 0.00017482741370685343, 'epoch': 0.0}


 13%|█▎        | 1264/10000 [1:06:31<8:17:08,  3.41s/it]

{'loss': 0.7449, 'grad_norm': 0.40974703431129456, 'learning_rate': 0.00017480740370185094, 'epoch': 0.0}


 13%|█▎        | 1265/10000 [1:06:35<8:36:19,  3.55s/it]

{'loss': 0.7746, 'grad_norm': 0.3327452838420868, 'learning_rate': 0.00017478739369684843, 'epoch': 0.0}


 13%|█▎        | 1266/10000 [1:06:39<8:39:02,  3.57s/it]

{'loss': 0.825, 'grad_norm': 0.31505241990089417, 'learning_rate': 0.00017476738369184594, 'epoch': 0.0}


 13%|█▎        | 1267/10000 [1:06:41<7:45:56,  3.20s/it]

{'loss': 0.7277, 'grad_norm': 0.4351380467414856, 'learning_rate': 0.00017474737368684343, 'epoch': 0.0}


 13%|█▎        | 1268/10000 [1:06:44<7:30:43,  3.10s/it]

{'loss': 0.6776, 'grad_norm': 0.3748321831226349, 'learning_rate': 0.00017472736368184092, 'epoch': 0.0}


 13%|█▎        | 1269/10000 [1:06:47<7:38:42,  3.15s/it]

{'loss': 0.9772, 'grad_norm': 0.39721956849098206, 'learning_rate': 0.00017470735367683843, 'epoch': 0.0}


 13%|█▎        | 1270/10000 [1:06:51<8:30:39,  3.51s/it]

{'loss': 0.9132, 'grad_norm': 0.3349105417728424, 'learning_rate': 0.00017468734367183592, 'epoch': 0.0}


 13%|█▎        | 1271/10000 [1:06:54<8:12:32,  3.39s/it]

{'loss': 1.0002, 'grad_norm': 0.374576598405838, 'learning_rate': 0.00017466733366683343, 'epoch': 0.0}


 13%|█▎        | 1272/10000 [1:06:58<8:14:27,  3.40s/it]

{'loss': 0.8668, 'grad_norm': 0.3724781572818756, 'learning_rate': 0.00017464732366183092, 'epoch': 0.0}


 13%|█▎        | 1273/10000 [1:07:02<8:31:47,  3.52s/it]

{'loss': 0.9407, 'grad_norm': 0.39972609281539917, 'learning_rate': 0.00017462731365682844, 'epoch': 0.0}


 13%|█▎        | 1274/10000 [1:07:05<8:12:02,  3.38s/it]

{'loss': 0.9813, 'grad_norm': 0.4097810387611389, 'learning_rate': 0.00017460730365182592, 'epoch': 0.0}


 13%|█▎        | 1275/10000 [1:07:07<7:36:31,  3.14s/it]

{'loss': 0.7713, 'grad_norm': 0.3899793028831482, 'learning_rate': 0.00017458729364682344, 'epoch': 0.0}


 13%|█▎        | 1276/10000 [1:07:11<7:40:17,  3.17s/it]

{'loss': 1.1313, 'grad_norm': 0.4088864326477051, 'learning_rate': 0.0001745672836418209, 'epoch': 0.0}


 13%|█▎        | 1277/10000 [1:07:14<7:41:57,  3.18s/it]

{'loss': 0.9926, 'grad_norm': 0.3655424118041992, 'learning_rate': 0.0001745472736368184, 'epoch': 0.0}


 13%|█▎        | 1278/10000 [1:07:18<8:47:03,  3.63s/it]

{'loss': 1.1676, 'grad_norm': 0.3431672155857086, 'learning_rate': 0.0001745272636318159, 'epoch': 0.0}


 13%|█▎        | 1279/10000 [1:07:21<8:10:55,  3.38s/it]

{'loss': 0.9672, 'grad_norm': 0.36255231499671936, 'learning_rate': 0.00017450725362681341, 'epoch': 0.0}


 13%|█▎        | 1280/10000 [1:07:24<7:43:56,  3.19s/it]

{'loss': 0.8017, 'grad_norm': 0.41243869066238403, 'learning_rate': 0.0001744872436218109, 'epoch': 0.0}


 13%|█▎        | 1281/10000 [1:07:28<8:03:11,  3.33s/it]

{'loss': 0.9515, 'grad_norm': 0.33639052510261536, 'learning_rate': 0.00017446723361680842, 'epoch': 0.0}


 13%|█▎        | 1282/10000 [1:07:31<8:22:12,  3.46s/it]

{'loss': 0.9954, 'grad_norm': 0.35236844420433044, 'learning_rate': 0.00017444722361180593, 'epoch': 0.0}


 13%|█▎        | 1283/10000 [1:07:36<8:51:46,  3.66s/it]

{'loss': 0.7098, 'grad_norm': 0.33568084239959717, 'learning_rate': 0.00017442721360680342, 'epoch': 0.0}


 13%|█▎        | 1284/10000 [1:07:39<8:45:06,  3.61s/it]

{'loss': 0.6715, 'grad_norm': 0.35146841406822205, 'learning_rate': 0.0001744072036018009, 'epoch': 0.0}


 13%|█▎        | 1285/10000 [1:07:42<8:08:31,  3.36s/it]

{'loss': 1.0145, 'grad_norm': 0.4287227392196655, 'learning_rate': 0.0001743871935967984, 'epoch': 0.0}


 13%|█▎        | 1286/10000 [1:07:44<7:36:54,  3.15s/it]

{'loss': 0.6938, 'grad_norm': 0.3747406601905823, 'learning_rate': 0.0001743671835917959, 'epoch': 0.0}


 13%|█▎        | 1287/10000 [1:07:48<8:08:29,  3.36s/it]

{'loss': 0.9476, 'grad_norm': 0.353407084941864, 'learning_rate': 0.0001743471735867934, 'epoch': 0.0}


 13%|█▎        | 1288/10000 [1:07:54<9:53:31,  4.09s/it]

{'loss': 1.2231, 'grad_norm': 0.3576648235321045, 'learning_rate': 0.0001743271635817909, 'epoch': 0.0}


 13%|█▎        | 1289/10000 [1:07:57<9:00:32,  3.72s/it]

{'loss': 0.9699, 'grad_norm': 0.4336948096752167, 'learning_rate': 0.0001743071535767884, 'epoch': 0.0}


 13%|█▎        | 1290/10000 [1:08:01<9:14:26,  3.82s/it]

{'loss': 0.7082, 'grad_norm': 0.3323204219341278, 'learning_rate': 0.0001742871435717859, 'epoch': 0.0}


 13%|█▎        | 1291/10000 [1:08:04<8:58:14,  3.71s/it]

{'loss': 1.1506, 'grad_norm': 0.4036674499511719, 'learning_rate': 0.0001742671335667834, 'epoch': 0.0}


 13%|█▎        | 1292/10000 [1:08:07<8:27:50,  3.50s/it]

{'loss': 1.2543, 'grad_norm': 0.4196392297744751, 'learning_rate': 0.0001742471235617809, 'epoch': 0.0}


 13%|█▎        | 1293/10000 [1:08:10<7:56:24,  3.28s/it]

{'loss': 0.839, 'grad_norm': 0.42899537086486816, 'learning_rate': 0.0001742271135567784, 'epoch': 0.0}


 13%|█▎        | 1294/10000 [1:08:14<8:26:18,  3.49s/it]

{'loss': 1.1597, 'grad_norm': 0.39384832978248596, 'learning_rate': 0.00017420710355177589, 'epoch': 0.0}


 13%|█▎        | 1295/10000 [1:08:17<8:02:38,  3.33s/it]

{'loss': 0.536, 'grad_norm': 0.33593761920928955, 'learning_rate': 0.0001741870935467734, 'epoch': 0.0}


 13%|█▎        | 1296/10000 [1:08:21<8:14:43,  3.41s/it]

{'loss': 0.7087, 'grad_norm': 0.32421770691871643, 'learning_rate': 0.0001741670835417709, 'epoch': 0.0}


 13%|█▎        | 1297/10000 [1:08:24<8:04:25,  3.34s/it]

{'loss': 0.737, 'grad_norm': 0.35339826345443726, 'learning_rate': 0.0001741470735367684, 'epoch': 0.0}


 13%|█▎        | 1298/10000 [1:08:27<8:07:26,  3.36s/it]

{'loss': 0.7882, 'grad_norm': 0.34820160269737244, 'learning_rate': 0.0001741270635317659, 'epoch': 0.0}


 13%|█▎        | 1299/10000 [1:08:30<7:42:29,  3.19s/it]

{'loss': 0.8166, 'grad_norm': 0.3909859359264374, 'learning_rate': 0.0001741070535267634, 'epoch': 0.0}


 13%|█▎        | 1300/10000 [1:08:33<7:25:14,  3.07s/it]

{'loss': 0.8763, 'grad_norm': 0.4385981559753418, 'learning_rate': 0.0001740870435217609, 'epoch': 0.0}


 13%|█▎        | 1301/10000 [1:08:38<9:11:20,  3.80s/it]

{'loss': 0.9472, 'grad_norm': 0.32457873225212097, 'learning_rate': 0.0001740670335167584, 'epoch': 0.0}


 13%|█▎        | 1302/10000 [1:08:43<9:39:18,  4.00s/it]

{'loss': 0.8724, 'grad_norm': 0.32693853974342346, 'learning_rate': 0.0001740470235117559, 'epoch': 0.0}


 13%|█▎        | 1303/10000 [1:08:47<9:46:58,  4.05s/it]

{'loss': 1.0279, 'grad_norm': 0.3151775002479553, 'learning_rate': 0.00017402701350675338, 'epoch': 0.0}


 13%|█▎        | 1304/10000 [1:08:51<9:20:39,  3.87s/it]

{'loss': 1.2266, 'grad_norm': 0.4195396900177002, 'learning_rate': 0.00017400700350175087, 'epoch': 0.0}


 13%|█▎        | 1305/10000 [1:08:53<8:29:48,  3.52s/it]

{'loss': 0.7645, 'grad_norm': 0.3992154896259308, 'learning_rate': 0.00017398699349674838, 'epoch': 0.0}


 13%|█▎        | 1306/10000 [1:08:57<8:22:22,  3.47s/it]

{'loss': 1.0012, 'grad_norm': 0.3711226284503937, 'learning_rate': 0.00017396698349174587, 'epoch': 0.0}


 13%|█▎        | 1307/10000 [1:09:00<8:09:33,  3.38s/it]

{'loss': 0.895, 'grad_norm': 0.3788250982761383, 'learning_rate': 0.00017394697348674338, 'epoch': 0.0}


 13%|█▎        | 1308/10000 [1:09:03<8:06:24,  3.36s/it]

{'loss': 0.8783, 'grad_norm': 0.4378637373447418, 'learning_rate': 0.00017392696348174087, 'epoch': 0.0}


 13%|█▎        | 1309/10000 [1:09:06<7:45:33,  3.21s/it]

{'loss': 1.0739, 'grad_norm': 0.4352087676525116, 'learning_rate': 0.00017390695347673838, 'epoch': 0.0}


 13%|█▎        | 1310/10000 [1:09:09<7:54:34,  3.28s/it]

{'loss': 0.7784, 'grad_norm': 0.35359513759613037, 'learning_rate': 0.0001738869434717359, 'epoch': 0.0}


 13%|█▎        | 1311/10000 [1:09:12<7:46:36,  3.22s/it]

{'loss': 0.7565, 'grad_norm': 0.364404559135437, 'learning_rate': 0.00017386693346673336, 'epoch': 0.0}


 13%|█▎        | 1312/10000 [1:09:16<8:14:38,  3.42s/it]

{'loss': 1.0919, 'grad_norm': 0.404035747051239, 'learning_rate': 0.00017384692346173087, 'epoch': 0.0}


 13%|█▎        | 1313/10000 [1:09:19<7:52:05,  3.26s/it]

{'loss': 1.0613, 'grad_norm': 0.4524078369140625, 'learning_rate': 0.00017382691345672836, 'epoch': 0.0}


 13%|█▎        | 1314/10000 [1:09:22<7:34:19,  3.14s/it]

{'loss': 1.0524, 'grad_norm': 0.41930925846099854, 'learning_rate': 0.00017380690345172588, 'epoch': 0.0}


 13%|█▎        | 1315/10000 [1:09:25<7:19:16,  3.03s/it]

{'loss': 0.7854, 'grad_norm': 0.36270713806152344, 'learning_rate': 0.00017378689344672336, 'epoch': 0.0}


 13%|█▎        | 1316/10000 [1:09:28<7:40:30,  3.18s/it]

{'loss': 0.7686, 'grad_norm': 0.3764456808567047, 'learning_rate': 0.00017376688344172088, 'epoch': 0.0}


 13%|█▎        | 1317/10000 [1:09:32<7:44:53,  3.21s/it]

{'loss': 0.8159, 'grad_norm': 0.3630742132663727, 'learning_rate': 0.00017374687343671836, 'epoch': 0.0}


 13%|█▎        | 1318/10000 [1:09:36<8:41:24,  3.60s/it]

{'loss': 0.8382, 'grad_norm': 0.31925898790359497, 'learning_rate': 0.00017372686343171588, 'epoch': 0.0}


 13%|█▎        | 1319/10000 [1:09:40<8:48:15,  3.65s/it]

{'loss': 0.7964, 'grad_norm': 0.36037251353263855, 'learning_rate': 0.00017370685342671337, 'epoch': 0.0}


 13%|█▎        | 1320/10000 [1:09:44<8:47:43,  3.65s/it]

{'loss': 0.842, 'grad_norm': 0.4093902111053467, 'learning_rate': 0.00017368684342171085, 'epoch': 0.0}


 13%|█▎        | 1321/10000 [1:09:46<8:08:54,  3.38s/it]

{'loss': 0.8012, 'grad_norm': 0.39612507820129395, 'learning_rate': 0.00017366683341670834, 'epoch': 0.0}


 13%|█▎        | 1322/10000 [1:09:49<7:51:35,  3.26s/it]

{'loss': 0.6814, 'grad_norm': 0.35655635595321655, 'learning_rate': 0.00017364682341170585, 'epoch': 0.0}


 13%|█▎        | 1323/10000 [1:09:52<7:20:38,  3.05s/it]

{'loss': 0.952, 'grad_norm': 0.44925302267074585, 'learning_rate': 0.00017362681340670337, 'epoch': 0.0}


 13%|█▎        | 1324/10000 [1:09:55<7:13:08,  3.00s/it]

{'loss': 0.89, 'grad_norm': 0.3947361409664154, 'learning_rate': 0.00017360680340170086, 'epoch': 0.0}


 13%|█▎        | 1325/10000 [1:09:59<7:50:23,  3.25s/it]

{'loss': 0.962, 'grad_norm': 0.40434980392456055, 'learning_rate': 0.00017358679339669837, 'epoch': 0.0}


 13%|█▎        | 1326/10000 [1:10:01<7:16:23,  3.02s/it]

{'loss': 0.71, 'grad_norm': 0.39331910014152527, 'learning_rate': 0.00017356678339169586, 'epoch': 0.0}


 13%|█▎        | 1327/10000 [1:10:06<8:21:37,  3.47s/it]

{'loss': 1.0894, 'grad_norm': 0.3540858328342438, 'learning_rate': 0.00017354677338669337, 'epoch': 0.0}


 13%|█▎        | 1328/10000 [1:10:09<8:38:56,  3.59s/it]

{'loss': 1.1564, 'grad_norm': 0.4368273913860321, 'learning_rate': 0.00017352676338169086, 'epoch': 0.0}


 13%|█▎        | 1329/10000 [1:10:13<8:39:21,  3.59s/it]

{'loss': 1.0174, 'grad_norm': 0.369362473487854, 'learning_rate': 0.00017350675337668835, 'epoch': 0.0}


 13%|█▎        | 1330/10000 [1:10:16<8:29:51,  3.53s/it]

{'loss': 1.1136, 'grad_norm': 0.4380074739456177, 'learning_rate': 0.00017348674337168583, 'epoch': 0.0}


 13%|█▎        | 1331/10000 [1:10:19<7:56:49,  3.30s/it]

{'loss': 1.0095, 'grad_norm': 0.42744961380958557, 'learning_rate': 0.00017346673336668335, 'epoch': 0.0}


 13%|█▎        | 1332/10000 [1:10:22<7:33:20,  3.14s/it]

{'loss': 0.8714, 'grad_norm': 0.3945062756538391, 'learning_rate': 0.00017344672336168084, 'epoch': 0.0}


 13%|█▎        | 1333/10000 [1:10:25<7:12:20,  2.99s/it]

{'loss': 0.94, 'grad_norm': 0.4106748402118683, 'learning_rate': 0.00017342671335667835, 'epoch': 0.0}


 13%|█▎        | 1334/10000 [1:10:29<7:55:21,  3.29s/it]

{'loss': 1.0203, 'grad_norm': 0.35034939646720886, 'learning_rate': 0.00017340670335167584, 'epoch': 0.0}


 13%|█▎        | 1335/10000 [1:10:32<7:43:27,  3.21s/it]

{'loss': 1.1404, 'grad_norm': 0.41633325815200806, 'learning_rate': 0.00017338669334667335, 'epoch': 0.0}


 13%|█▎        | 1336/10000 [1:10:35<7:46:18,  3.23s/it]

{'loss': 1.1939, 'grad_norm': 0.3799617290496826, 'learning_rate': 0.00017336668334167087, 'epoch': 0.0}


 13%|█▎        | 1337/10000 [1:10:38<7:47:28,  3.24s/it]

{'loss': 0.9552, 'grad_norm': 0.3635729253292084, 'learning_rate': 0.00017334667333666835, 'epoch': 0.0}


 13%|█▎        | 1338/10000 [1:10:42<8:32:25,  3.55s/it]

{'loss': 0.8777, 'grad_norm': 0.3717328608036041, 'learning_rate': 0.00017332666333166584, 'epoch': 0.0}


 13%|█▎        | 1339/10000 [1:10:46<8:47:22,  3.65s/it]

{'loss': 0.9169, 'grad_norm': 0.32981792092323303, 'learning_rate': 0.00017330665332666333, 'epoch': 0.0}


 13%|█▎        | 1340/10000 [1:10:50<8:40:57,  3.61s/it]

{'loss': 0.8847, 'grad_norm': 0.3645373582839966, 'learning_rate': 0.00017328664332166084, 'epoch': 0.0}


 13%|█▎        | 1341/10000 [1:10:53<8:26:15,  3.51s/it]

{'loss': 0.8101, 'grad_norm': 0.3735152781009674, 'learning_rate': 0.00017326663331665833, 'epoch': 0.0}


 13%|█▎        | 1342/10000 [1:10:57<8:45:07,  3.64s/it]

{'loss': 0.7935, 'grad_norm': 0.35899215936660767, 'learning_rate': 0.00017324662331165584, 'epoch': 0.0}


 13%|█▎        | 1343/10000 [1:11:01<8:42:35,  3.62s/it]

{'loss': 1.0648, 'grad_norm': 0.3920435309410095, 'learning_rate': 0.00017322661330665333, 'epoch': 0.0}


 13%|█▎        | 1344/10000 [1:11:04<8:13:47,  3.42s/it]

{'loss': 0.7841, 'grad_norm': 0.37982243299484253, 'learning_rate': 0.00017320660330165085, 'epoch': 0.0}


 13%|█▎        | 1345/10000 [1:11:07<8:28:26,  3.52s/it]

{'loss': 0.8172, 'grad_norm': 0.3676750063896179, 'learning_rate': 0.00017318659329664833, 'epoch': 0.0}


 13%|█▎        | 1346/10000 [1:11:10<8:03:40,  3.35s/it]

{'loss': 0.8211, 'grad_norm': 0.36936867237091064, 'learning_rate': 0.00017316658329164585, 'epoch': 0.0}


 13%|█▎        | 1347/10000 [1:11:13<7:26:11,  3.09s/it]

{'loss': 1.0023, 'grad_norm': 0.42657479643821716, 'learning_rate': 0.0001731465732866433, 'epoch': 0.0}


 13%|█▎        | 1348/10000 [1:11:16<7:26:00,  3.09s/it]

{'loss': 1.0232, 'grad_norm': 0.39451584219932556, 'learning_rate': 0.00017312656328164082, 'epoch': 0.0}


 13%|█▎        | 1349/10000 [1:11:19<7:24:18,  3.08s/it]

{'loss': 0.6448, 'grad_norm': 0.3327617645263672, 'learning_rate': 0.0001731065532766383, 'epoch': 0.0}


 14%|█▎        | 1350/10000 [1:11:24<9:00:14,  3.75s/it]

{'loss': 1.13, 'grad_norm': 0.36662477254867554, 'learning_rate': 0.00017308654327163582, 'epoch': 0.0}


 14%|█▎        | 1351/10000 [1:11:28<9:00:02,  3.75s/it]

{'loss': 0.7709, 'grad_norm': 0.33090993762016296, 'learning_rate': 0.00017306653326663334, 'epoch': 0.0}


 14%|█▎        | 1352/10000 [1:11:31<8:45:33,  3.65s/it]

{'loss': 1.0271, 'grad_norm': 0.3714733123779297, 'learning_rate': 0.00017304652326163083, 'epoch': 0.0}


 14%|█▎        | 1353/10000 [1:11:36<9:14:09,  3.85s/it]

{'loss': 1.0882, 'grad_norm': 0.4015108644962311, 'learning_rate': 0.00017302651325662834, 'epoch': 0.0}


 14%|█▎        | 1354/10000 [1:11:40<9:39:24,  4.02s/it]

{'loss': 0.8616, 'grad_norm': 0.35941460728645325, 'learning_rate': 0.00017300650325162583, 'epoch': 0.0}


 14%|█▎        | 1355/10000 [1:11:44<9:38:44,  4.02s/it]

{'loss': 0.965, 'grad_norm': 0.37615370750427246, 'learning_rate': 0.00017298649324662331, 'epoch': 0.0}


 14%|█▎        | 1356/10000 [1:11:47<9:08:23,  3.81s/it]

{'loss': 0.8655, 'grad_norm': 0.3757212460041046, 'learning_rate': 0.0001729664832416208, 'epoch': 0.0}


 14%|█▎        | 1357/10000 [1:11:52<9:21:03,  3.89s/it]

{'loss': 0.9555, 'grad_norm': 0.3386372923851013, 'learning_rate': 0.00017294647323661832, 'epoch': 0.0}


 14%|█▎        | 1358/10000 [1:11:54<8:27:06,  3.52s/it]

{'loss': 0.7476, 'grad_norm': 0.4118034839630127, 'learning_rate': 0.0001729264632316158, 'epoch': 0.0}


 14%|█▎        | 1359/10000 [1:11:57<8:11:29,  3.41s/it]

{'loss': 0.9081, 'grad_norm': 0.39774009585380554, 'learning_rate': 0.00017290645322661332, 'epoch': 0.0}


 14%|█▎        | 1360/10000 [1:12:00<7:50:34,  3.27s/it]

{'loss': 0.8792, 'grad_norm': 0.3765624761581421, 'learning_rate': 0.0001728864432216108, 'epoch': 0.0}


 14%|█▎        | 1361/10000 [1:12:04<7:46:38,  3.24s/it]

{'loss': 0.6971, 'grad_norm': 0.4610882103443146, 'learning_rate': 0.00017286643321660832, 'epoch': 0.0}


 14%|█▎        | 1362/10000 [1:12:07<7:53:04,  3.29s/it]

{'loss': 0.7648, 'grad_norm': 0.3099619150161743, 'learning_rate': 0.0001728464232116058, 'epoch': 0.0}


 14%|█▎        | 1363/10000 [1:12:11<8:27:05,  3.52s/it]

{'loss': 0.9367, 'grad_norm': 0.335971474647522, 'learning_rate': 0.00017282641320660332, 'epoch': 0.0}


 14%|█▎        | 1364/10000 [1:12:14<8:11:39,  3.42s/it]

{'loss': 1.0146, 'grad_norm': 0.3708227276802063, 'learning_rate': 0.0001728064032016008, 'epoch': 0.0}


 14%|█▎        | 1365/10000 [1:12:17<7:38:23,  3.19s/it]

{'loss': 1.1239, 'grad_norm': 0.46527260541915894, 'learning_rate': 0.0001727863931965983, 'epoch': 0.0}


 14%|█▎        | 1366/10000 [1:12:21<8:20:59,  3.48s/it]

{'loss': 1.0651, 'grad_norm': 0.36716917157173157, 'learning_rate': 0.0001727663831915958, 'epoch': 0.0}


 14%|█▎        | 1367/10000 [1:12:24<7:55:08,  3.30s/it]

{'loss': 0.9166, 'grad_norm': 0.38710325956344604, 'learning_rate': 0.0001727463731865933, 'epoch': 0.0}


 14%|█▎        | 1368/10000 [1:12:27<8:05:31,  3.37s/it]

{'loss': 0.8345, 'grad_norm': 0.3178907632827759, 'learning_rate': 0.0001727263631815908, 'epoch': 0.0}


 14%|█▎        | 1369/10000 [1:12:31<8:25:34,  3.51s/it]

{'loss': 0.8398, 'grad_norm': 0.3173413872718811, 'learning_rate': 0.0001727063531765883, 'epoch': 0.0}


 14%|█▎        | 1370/10000 [1:12:34<8:02:33,  3.35s/it]

{'loss': 0.8328, 'grad_norm': 0.3861865699291229, 'learning_rate': 0.0001726863431715858, 'epoch': 0.0}


 14%|█▎        | 1371/10000 [1:12:37<7:50:20,  3.27s/it]

{'loss': 1.1092, 'grad_norm': 0.3949032127857208, 'learning_rate': 0.0001726663331665833, 'epoch': 0.0}


 14%|█▎        | 1372/10000 [1:12:40<7:24:47,  3.09s/it]

{'loss': 0.9643, 'grad_norm': 0.41748833656311035, 'learning_rate': 0.00017264632316158082, 'epoch': 0.0}


 14%|█▎        | 1373/10000 [1:12:43<7:42:37,  3.22s/it]

{'loss': 0.9126, 'grad_norm': 0.38478928804397583, 'learning_rate': 0.0001726263131565783, 'epoch': 0.0}


 14%|█▎        | 1374/10000 [1:12:47<8:01:13,  3.35s/it]

{'loss': 0.954, 'grad_norm': 0.37732550501823425, 'learning_rate': 0.0001726063031515758, 'epoch': 0.0}


 14%|█▍        | 1375/10000 [1:12:50<7:44:55,  3.23s/it]

{'loss': 0.8171, 'grad_norm': 0.3457831144332886, 'learning_rate': 0.00017258629314657328, 'epoch': 0.0}


 14%|█▍        | 1376/10000 [1:12:53<7:38:43,  3.19s/it]

{'loss': 0.8318, 'grad_norm': 0.3869713544845581, 'learning_rate': 0.0001725662831415708, 'epoch': 0.0}


 14%|█▍        | 1377/10000 [1:12:57<8:22:26,  3.50s/it]

{'loss': 0.921, 'grad_norm': 0.3790426254272461, 'learning_rate': 0.00017254627313656828, 'epoch': 0.0}


 14%|█▍        | 1378/10000 [1:13:01<8:20:45,  3.48s/it]

{'loss': 0.917, 'grad_norm': 0.41967466473579407, 'learning_rate': 0.0001725262631315658, 'epoch': 0.0}


 14%|█▍        | 1379/10000 [1:13:04<8:07:55,  3.40s/it]

{'loss': 0.8567, 'grad_norm': 0.4123520255088806, 'learning_rate': 0.0001725062531265633, 'epoch': 0.0}


 14%|█▍        | 1380/10000 [1:13:08<8:44:17,  3.65s/it]

{'loss': 0.9021, 'grad_norm': 0.32571035623550415, 'learning_rate': 0.0001724862431215608, 'epoch': 0.0}


 14%|█▍        | 1381/10000 [1:13:11<8:16:13,  3.45s/it]

{'loss': 0.97, 'grad_norm': 0.3723636269569397, 'learning_rate': 0.0001724662331165583, 'epoch': 0.0}


 14%|█▍        | 1382/10000 [1:13:15<8:33:58,  3.58s/it]

{'loss': 1.0328, 'grad_norm': 0.38551628589630127, 'learning_rate': 0.00017244622311155577, 'epoch': 0.0}


 14%|█▍        | 1383/10000 [1:13:18<7:59:03,  3.34s/it]

{'loss': 0.6968, 'grad_norm': 0.3553923964500427, 'learning_rate': 0.00017242621310655328, 'epoch': 0.0}


 14%|█▍        | 1384/10000 [1:13:21<7:44:40,  3.24s/it]

{'loss': 0.9135, 'grad_norm': 0.38171955943107605, 'learning_rate': 0.00017240620310155077, 'epoch': 0.0}


 14%|█▍        | 1385/10000 [1:13:24<7:33:01,  3.16s/it]

{'loss': 0.6891, 'grad_norm': 0.351460337638855, 'learning_rate': 0.00017238619309654829, 'epoch': 0.0}


 14%|█▍        | 1386/10000 [1:13:28<8:01:51,  3.36s/it]

{'loss': 0.8596, 'grad_norm': 0.3327955901622772, 'learning_rate': 0.00017236618309154577, 'epoch': 0.0}


 14%|█▍        | 1387/10000 [1:13:31<7:40:44,  3.21s/it]

{'loss': 0.9943, 'grad_norm': 0.4208030104637146, 'learning_rate': 0.0001723461730865433, 'epoch': 0.0}


 14%|█▍        | 1388/10000 [1:13:34<7:41:37,  3.22s/it]

{'loss': 0.802, 'grad_norm': 0.383226215839386, 'learning_rate': 0.00017232616308154077, 'epoch': 0.0}


 14%|█▍        | 1389/10000 [1:13:37<7:27:25,  3.12s/it]

{'loss': 0.8901, 'grad_norm': 0.36908072233200073, 'learning_rate': 0.0001723061530765383, 'epoch': 0.0}


 14%|█▍        | 1390/10000 [1:13:39<7:12:43,  3.02s/it]

{'loss': 1.1897, 'grad_norm': 0.41458672285079956, 'learning_rate': 0.00017228614307153578, 'epoch': 0.0}


 14%|█▍        | 1391/10000 [1:13:45<9:06:04,  3.81s/it]

{'loss': 1.1349, 'grad_norm': 0.3335822820663452, 'learning_rate': 0.00017226613306653326, 'epoch': 0.0}


 14%|█▍        | 1392/10000 [1:13:48<8:17:02,  3.46s/it]

{'loss': 0.8808, 'grad_norm': 0.3992074728012085, 'learning_rate': 0.00017224612306153078, 'epoch': 0.0}


 14%|█▍        | 1393/10000 [1:13:51<8:00:41,  3.35s/it]

{'loss': 0.8429, 'grad_norm': 0.38513320684432983, 'learning_rate': 0.00017222611305652826, 'epoch': 0.0}


 14%|█▍        | 1394/10000 [1:13:54<7:58:34,  3.34s/it]

{'loss': 1.0445, 'grad_norm': 0.41263216733932495, 'learning_rate': 0.00017220610305152578, 'epoch': 0.0}


 14%|█▍        | 1395/10000 [1:13:57<7:57:09,  3.33s/it]

{'loss': 0.8341, 'grad_norm': 0.3907777965068817, 'learning_rate': 0.00017218609304652327, 'epoch': 0.01}


 14%|█▍        | 1396/10000 [1:14:01<8:11:31,  3.43s/it]

{'loss': 1.1561, 'grad_norm': 0.4103567600250244, 'learning_rate': 0.00017216608304152078, 'epoch': 0.01}


 14%|█▍        | 1397/10000 [1:14:04<7:38:01,  3.19s/it]

{'loss': 0.9958, 'grad_norm': 0.46520015597343445, 'learning_rate': 0.00017214607303651827, 'epoch': 0.01}


 14%|█▍        | 1398/10000 [1:14:07<7:57:34,  3.33s/it]

{'loss': 1.0661, 'grad_norm': 0.3522918224334717, 'learning_rate': 0.00017212606303151578, 'epoch': 0.01}


 14%|█▍        | 1399/10000 [1:14:10<7:42:34,  3.23s/it]

{'loss': 1.0072, 'grad_norm': 0.3865059018135071, 'learning_rate': 0.00017210605302651327, 'epoch': 0.01}


 14%|█▍        | 1400/10000 [1:14:14<8:02:08,  3.36s/it]

{'loss': 0.7037, 'grad_norm': 0.34695199131965637, 'learning_rate': 0.00017208604302151076, 'epoch': 0.01}


 14%|█▍        | 1401/10000 [1:14:18<8:30:28,  3.56s/it]

{'loss': 0.9228, 'grad_norm': 0.46586987376213074, 'learning_rate': 0.00017206603301650824, 'epoch': 0.01}


 14%|█▍        | 1402/10000 [1:14:21<7:56:15,  3.32s/it]

{'loss': 1.0929, 'grad_norm': 0.43885108828544617, 'learning_rate': 0.00017204602301150576, 'epoch': 0.01}


 14%|█▍        | 1403/10000 [1:14:24<7:30:59,  3.15s/it]

{'loss': 0.6918, 'grad_norm': 0.43310102820396423, 'learning_rate': 0.00017202601300650325, 'epoch': 0.01}


 14%|█▍        | 1404/10000 [1:14:27<7:42:11,  3.23s/it]

{'loss': 1.1146, 'grad_norm': 0.3815459907054901, 'learning_rate': 0.00017200600300150076, 'epoch': 0.01}


 14%|█▍        | 1405/10000 [1:14:30<7:40:45,  3.22s/it]

{'loss': 0.7337, 'grad_norm': 0.3267790377140045, 'learning_rate': 0.00017198599299649825, 'epoch': 0.01}


 14%|█▍        | 1406/10000 [1:14:35<8:28:37,  3.55s/it]

{'loss': 1.0826, 'grad_norm': 0.3267166018486023, 'learning_rate': 0.00017196598299149576, 'epoch': 0.01}


 14%|█▍        | 1407/10000 [1:14:38<8:02:28,  3.37s/it]

{'loss': 0.6595, 'grad_norm': 0.33545157313346863, 'learning_rate': 0.00017194597298649328, 'epoch': 0.01}


 14%|█▍        | 1408/10000 [1:14:40<7:31:07,  3.15s/it]

{'loss': 0.8238, 'grad_norm': 0.4007957875728607, 'learning_rate': 0.00017192596298149076, 'epoch': 0.01}


 14%|█▍        | 1409/10000 [1:14:44<7:42:37,  3.23s/it]

{'loss': 0.8326, 'grad_norm': 0.3520716726779938, 'learning_rate': 0.00017190595297648825, 'epoch': 0.01}


 14%|█▍        | 1410/10000 [1:14:46<7:26:01,  3.12s/it]

{'loss': 0.8658, 'grad_norm': 0.42532870173454285, 'learning_rate': 0.00017188594297148574, 'epoch': 0.01}


 14%|█▍        | 1411/10000 [1:14:50<7:33:51,  3.17s/it]

{'loss': 1.0085, 'grad_norm': 0.3957161009311676, 'learning_rate': 0.00017186593296648325, 'epoch': 0.01}


 14%|█▍        | 1412/10000 [1:14:53<7:35:27,  3.18s/it]

{'loss': 0.9862, 'grad_norm': 0.38682934641838074, 'learning_rate': 0.00017184592296148074, 'epoch': 0.01}


 14%|█▍        | 1413/10000 [1:14:56<7:47:46,  3.27s/it]

{'loss': 0.6607, 'grad_norm': 0.38424113392829895, 'learning_rate': 0.00017182591295647825, 'epoch': 0.01}


 14%|█▍        | 1414/10000 [1:15:00<7:43:52,  3.24s/it]

{'loss': 0.9357, 'grad_norm': 0.4006958305835724, 'learning_rate': 0.00017180590295147574, 'epoch': 0.01}


 14%|█▍        | 1415/10000 [1:15:02<7:00:23,  2.94s/it]

{'loss': 0.5694, 'grad_norm': 0.38899579644203186, 'learning_rate': 0.00017178589294647326, 'epoch': 0.01}


 14%|█▍        | 1416/10000 [1:15:05<7:03:24,  2.96s/it]

{'loss': 0.6952, 'grad_norm': 0.3478260040283203, 'learning_rate': 0.00017176588294147074, 'epoch': 0.01}


 14%|█▍        | 1417/10000 [1:15:08<7:00:20,  2.94s/it]

{'loss': 0.6771, 'grad_norm': 0.360287606716156, 'learning_rate': 0.00017174587293646823, 'epoch': 0.01}


 14%|█▍        | 1418/10000 [1:15:11<6:59:29,  2.93s/it]

{'loss': 0.9827, 'grad_norm': 0.44758954644203186, 'learning_rate': 0.00017172586293146572, 'epoch': 0.01}


 14%|█▍        | 1419/10000 [1:15:13<6:56:23,  2.91s/it]

{'loss': 0.7382, 'grad_norm': 0.43509992957115173, 'learning_rate': 0.00017170585292646323, 'epoch': 0.01}


 14%|█▍        | 1420/10000 [1:15:17<7:08:04,  2.99s/it]

{'loss': 0.7384, 'grad_norm': 0.3606395721435547, 'learning_rate': 0.00017168584292146075, 'epoch': 0.01}


 14%|█▍        | 1421/10000 [1:15:21<8:15:01,  3.46s/it]

{'loss': 1.0066, 'grad_norm': 0.3714359998703003, 'learning_rate': 0.00017166583291645823, 'epoch': 0.01}


 14%|█▍        | 1422/10000 [1:15:26<9:07:30,  3.83s/it]

{'loss': 0.9339, 'grad_norm': 0.35511377453804016, 'learning_rate': 0.00017164582291145575, 'epoch': 0.01}


 14%|█▍        | 1423/10000 [1:15:29<8:52:53,  3.73s/it]

{'loss': 1.0381, 'grad_norm': 0.36257073283195496, 'learning_rate': 0.00017162581290645324, 'epoch': 0.01}


 14%|█▍        | 1424/10000 [1:15:32<8:21:24,  3.51s/it]

{'loss': 0.7899, 'grad_norm': 0.4018288850784302, 'learning_rate': 0.00017160580290145075, 'epoch': 0.01}


 14%|█▍        | 1425/10000 [1:15:35<7:47:43,  3.27s/it]

{'loss': 0.8689, 'grad_norm': 0.382633239030838, 'learning_rate': 0.00017158579289644824, 'epoch': 0.01}


 14%|█▍        | 1426/10000 [1:15:38<7:45:40,  3.26s/it]

{'loss': 1.2457, 'grad_norm': 0.4453807473182678, 'learning_rate': 0.00017156578289144572, 'epoch': 0.01}


 14%|█▍        | 1427/10000 [1:15:41<7:19:45,  3.08s/it]

{'loss': 0.7638, 'grad_norm': 0.39166024327278137, 'learning_rate': 0.0001715457728864432, 'epoch': 0.01}


 14%|█▍        | 1428/10000 [1:15:44<7:36:56,  3.20s/it]

{'loss': 0.8504, 'grad_norm': 0.3822665214538574, 'learning_rate': 0.00017152576288144073, 'epoch': 0.01}


 14%|█▍        | 1429/10000 [1:15:48<8:06:56,  3.41s/it]

{'loss': 0.9891, 'grad_norm': 0.3822897970676422, 'learning_rate': 0.0001715057528764382, 'epoch': 0.01}


 14%|█▍        | 1430/10000 [1:15:53<8:38:23,  3.63s/it]

{'loss': 0.8299, 'grad_norm': 0.32698822021484375, 'learning_rate': 0.00017148574287143573, 'epoch': 0.01}


 14%|█▍        | 1431/10000 [1:15:56<8:19:49,  3.50s/it]

{'loss': 0.9359, 'grad_norm': 0.4053719937801361, 'learning_rate': 0.00017146573286643322, 'epoch': 0.01}


 14%|█▍        | 1432/10000 [1:15:59<8:11:53,  3.44s/it]

{'loss': 0.8381, 'grad_norm': 0.3784032464027405, 'learning_rate': 0.00017144572286143073, 'epoch': 0.01}


 14%|█▍        | 1433/10000 [1:16:02<8:05:45,  3.40s/it]

{'loss': 1.0001, 'grad_norm': 0.4129059910774231, 'learning_rate': 0.00017142571285642824, 'epoch': 0.01}


 14%|█▍        | 1434/10000 [1:16:07<8:56:06,  3.76s/it]

{'loss': 0.8581, 'grad_norm': 0.3228403329849243, 'learning_rate': 0.00017140570285142573, 'epoch': 0.01}


 14%|█▍        | 1435/10000 [1:16:11<9:21:48,  3.94s/it]

{'loss': 1.1078, 'grad_norm': 0.3480488955974579, 'learning_rate': 0.00017138569284642322, 'epoch': 0.01}


 14%|█▍        | 1436/10000 [1:16:16<9:36:21,  4.04s/it]

{'loss': 0.7816, 'grad_norm': 0.30484798550605774, 'learning_rate': 0.0001713656828414207, 'epoch': 0.01}


 14%|█▍        | 1437/10000 [1:16:18<8:29:17,  3.57s/it]

{'loss': 0.945, 'grad_norm': 0.41948485374450684, 'learning_rate': 0.00017134567283641822, 'epoch': 0.01}


 14%|█▍        | 1438/10000 [1:16:22<8:46:37,  3.69s/it]

{'loss': 0.9402, 'grad_norm': 0.33168113231658936, 'learning_rate': 0.0001713256628314157, 'epoch': 0.01}


 14%|█▍        | 1439/10000 [1:16:25<8:25:19,  3.54s/it]

{'loss': 0.703, 'grad_norm': 0.3792058527469635, 'learning_rate': 0.00017130565282641322, 'epoch': 0.01}


 14%|█▍        | 1440/10000 [1:16:28<7:57:59,  3.35s/it]

{'loss': 0.9396, 'grad_norm': 0.3837359547615051, 'learning_rate': 0.0001712856428214107, 'epoch': 0.01}


 14%|█▍        | 1441/10000 [1:16:30<7:14:29,  3.05s/it]

{'loss': 0.7463, 'grad_norm': 0.38626939058303833, 'learning_rate': 0.00017126563281640822, 'epoch': 0.01}


 14%|█▍        | 1442/10000 [1:16:33<7:09:51,  3.01s/it]

{'loss': 0.8603, 'grad_norm': 0.39552539587020874, 'learning_rate': 0.0001712456228114057, 'epoch': 0.01}


 14%|█▍        | 1443/10000 [1:16:36<7:05:51,  2.99s/it]

{'loss': 0.9797, 'grad_norm': 0.4219287931919098, 'learning_rate': 0.00017122561280640323, 'epoch': 0.01}


 14%|█▍        | 1444/10000 [1:16:40<7:22:08,  3.10s/it]

{'loss': 0.8851, 'grad_norm': 0.35413235425949097, 'learning_rate': 0.0001712056028014007, 'epoch': 0.01}


 14%|█▍        | 1445/10000 [1:16:42<6:44:40,  2.84s/it]

{'loss': 0.9804, 'grad_norm': 0.5047053694725037, 'learning_rate': 0.0001711855927963982, 'epoch': 0.01}


 14%|█▍        | 1446/10000 [1:16:45<6:41:58,  2.82s/it]

{'loss': 0.8367, 'grad_norm': 0.5137633085250854, 'learning_rate': 0.0001711655827913957, 'epoch': 0.01}


 14%|█▍        | 1447/10000 [1:16:48<7:10:11,  3.02s/it]

{'loss': 0.9605, 'grad_norm': 0.36291107535362244, 'learning_rate': 0.0001711455727863932, 'epoch': 0.01}


 14%|█▍        | 1448/10000 [1:16:52<8:05:29,  3.41s/it]

{'loss': 0.9822, 'grad_norm': 0.32772335410118103, 'learning_rate': 0.00017112556278139072, 'epoch': 0.01}


 14%|█▍        | 1449/10000 [1:16:55<7:43:04,  3.25s/it]

{'loss': 0.9472, 'grad_norm': 0.4366421401500702, 'learning_rate': 0.0001711055527763882, 'epoch': 0.01}


 14%|█▍        | 1450/10000 [1:16:59<8:11:49,  3.45s/it]

{'loss': 1.4248, 'grad_norm': 0.5140187740325928, 'learning_rate': 0.00017108554277138572, 'epoch': 0.01}


 15%|█▍        | 1451/10000 [1:17:02<7:47:07,  3.28s/it]

{'loss': 0.659, 'grad_norm': 0.39401957392692566, 'learning_rate': 0.0001710655327663832, 'epoch': 0.01}


 15%|█▍        | 1452/10000 [1:17:05<7:22:53,  3.11s/it]

{'loss': 0.995, 'grad_norm': 0.4305788576602936, 'learning_rate': 0.0001710455227613807, 'epoch': 0.01}


 15%|█▍        | 1453/10000 [1:17:08<7:21:46,  3.10s/it]

{'loss': 1.0574, 'grad_norm': 0.4770740568637848, 'learning_rate': 0.00017102551275637818, 'epoch': 0.01}


 15%|█▍        | 1454/10000 [1:17:10<6:55:04,  2.91s/it]

{'loss': 0.8859, 'grad_norm': 0.45572224259376526, 'learning_rate': 0.0001710055027513757, 'epoch': 0.01}


 15%|█▍        | 1455/10000 [1:17:15<7:59:05,  3.36s/it]

{'loss': 1.1265, 'grad_norm': 0.35349419713020325, 'learning_rate': 0.00017098549274637318, 'epoch': 0.01}


 15%|█▍        | 1456/10000 [1:17:18<8:00:28,  3.37s/it]

{'loss': 0.7218, 'grad_norm': 0.3797436058521271, 'learning_rate': 0.0001709654827413707, 'epoch': 0.01}


 15%|█▍        | 1457/10000 [1:17:21<7:43:09,  3.25s/it]

{'loss': 0.6425, 'grad_norm': 0.32194051146507263, 'learning_rate': 0.00017094547273636818, 'epoch': 0.01}


 15%|█▍        | 1458/10000 [1:17:25<7:56:42,  3.35s/it]

{'loss': 0.9723, 'grad_norm': 0.4329527020454407, 'learning_rate': 0.0001709254627313657, 'epoch': 0.01}


 15%|█▍        | 1459/10000 [1:17:28<7:56:10,  3.35s/it]

{'loss': 1.2555, 'grad_norm': 0.4139263927936554, 'learning_rate': 0.00017090545272636318, 'epoch': 0.01}


 15%|█▍        | 1460/10000 [1:17:32<8:31:02,  3.59s/it]

{'loss': 1.0021, 'grad_norm': 0.33518269658088684, 'learning_rate': 0.0001708854427213607, 'epoch': 0.01}


 15%|█▍        | 1461/10000 [1:17:37<9:07:34,  3.85s/it]

{'loss': 1.1818, 'grad_norm': 0.42091602087020874, 'learning_rate': 0.00017086543271635819, 'epoch': 0.01}


 15%|█▍        | 1462/10000 [1:17:41<9:26:18,  3.98s/it]

{'loss': 0.7874, 'grad_norm': 0.3290892541408539, 'learning_rate': 0.00017084542271135567, 'epoch': 0.01}


 15%|█▍        | 1463/10000 [1:17:45<9:06:23,  3.84s/it]

{'loss': 0.7883, 'grad_norm': 0.32383403182029724, 'learning_rate': 0.0001708254127063532, 'epoch': 0.01}


 15%|█▍        | 1464/10000 [1:17:48<8:40:44,  3.66s/it]

{'loss': 0.9327, 'grad_norm': 0.4083958566188812, 'learning_rate': 0.00017080540270135067, 'epoch': 0.01}


 15%|█▍        | 1465/10000 [1:17:51<8:16:07,  3.49s/it]

{'loss': 0.8374, 'grad_norm': 0.3619111478328705, 'learning_rate': 0.0001707853926963482, 'epoch': 0.01}


 15%|█▍        | 1466/10000 [1:17:54<8:13:27,  3.47s/it]

{'loss': 1.1267, 'grad_norm': 0.3917822539806366, 'learning_rate': 0.00017076538269134568, 'epoch': 0.01}


 15%|█▍        | 1467/10000 [1:17:58<8:33:25,  3.61s/it]

{'loss': 0.8039, 'grad_norm': 0.37086424231529236, 'learning_rate': 0.0001707453726863432, 'epoch': 0.01}


 15%|█▍        | 1468/10000 [1:18:02<8:43:14,  3.68s/it]

{'loss': 0.9385, 'grad_norm': 0.3369462490081787, 'learning_rate': 0.00017072536268134068, 'epoch': 0.01}


 15%|█▍        | 1469/10000 [1:18:05<8:01:54,  3.39s/it]

{'loss': 1.0828, 'grad_norm': 0.4712962210178375, 'learning_rate': 0.0001707053526763382, 'epoch': 0.01}


 15%|█▍        | 1470/10000 [1:18:08<7:50:15,  3.31s/it]

{'loss': 0.9353, 'grad_norm': 0.427376389503479, 'learning_rate': 0.00017068534267133568, 'epoch': 0.01}


 15%|█▍        | 1471/10000 [1:18:11<7:32:21,  3.18s/it]

{'loss': 0.7976, 'grad_norm': 0.36557626724243164, 'learning_rate': 0.00017066533266633317, 'epoch': 0.01}


 15%|█▍        | 1472/10000 [1:18:14<7:20:29,  3.10s/it]

{'loss': 0.6621, 'grad_norm': 0.35780489444732666, 'learning_rate': 0.00017064532266133065, 'epoch': 0.01}


 15%|█▍        | 1473/10000 [1:18:17<7:19:37,  3.09s/it]

{'loss': 0.7447, 'grad_norm': 0.3800557851791382, 'learning_rate': 0.00017062531265632817, 'epoch': 0.01}


 15%|█▍        | 1474/10000 [1:18:20<7:09:42,  3.02s/it]

{'loss': 1.023, 'grad_norm': 0.37145307660102844, 'learning_rate': 0.00017060530265132566, 'epoch': 0.01}


 15%|█▍        | 1475/10000 [1:18:23<7:10:07,  3.03s/it]

{'loss': 0.9918, 'grad_norm': 0.34586045145988464, 'learning_rate': 0.00017058529264632317, 'epoch': 0.01}


 15%|█▍        | 1476/10000 [1:18:26<7:12:43,  3.05s/it]

{'loss': 0.8914, 'grad_norm': 0.4248080551624298, 'learning_rate': 0.00017056528264132068, 'epoch': 0.01}


 15%|█▍        | 1477/10000 [1:18:29<7:08:28,  3.02s/it]

{'loss': 0.9175, 'grad_norm': 0.39492031931877136, 'learning_rate': 0.00017054527263631817, 'epoch': 0.01}


 15%|█▍        | 1478/10000 [1:18:32<7:02:33,  2.98s/it]

{'loss': 0.8822, 'grad_norm': 0.4134289622306824, 'learning_rate': 0.0001705252626313157, 'epoch': 0.01}


 15%|█▍        | 1479/10000 [1:18:35<7:13:45,  3.05s/it]

{'loss': 0.8105, 'grad_norm': 0.4144318103790283, 'learning_rate': 0.00017050525262631317, 'epoch': 0.01}


 15%|█▍        | 1480/10000 [1:18:37<6:57:16,  2.94s/it]

{'loss': 0.8195, 'grad_norm': 0.4027648866176605, 'learning_rate': 0.00017048524262131066, 'epoch': 0.01}


 15%|█▍        | 1481/10000 [1:18:40<6:48:40,  2.88s/it]

{'loss': 0.7143, 'grad_norm': 0.40114614367485046, 'learning_rate': 0.00017046523261630815, 'epoch': 0.01}


 15%|█▍        | 1482/10000 [1:18:43<6:55:25,  2.93s/it]

{'loss': 0.7134, 'grad_norm': 0.3250856101512909, 'learning_rate': 0.00017044522261130566, 'epoch': 0.01}


 15%|█▍        | 1483/10000 [1:18:47<7:26:07,  3.14s/it]

{'loss': 1.3088, 'grad_norm': 0.4146709144115448, 'learning_rate': 0.00017042521260630315, 'epoch': 0.01}


 15%|█▍        | 1484/10000 [1:18:50<7:28:19,  3.16s/it]

{'loss': 1.0394, 'grad_norm': 0.4039950966835022, 'learning_rate': 0.00017040520260130066, 'epoch': 0.01}


 15%|█▍        | 1485/10000 [1:18:53<7:33:44,  3.20s/it]

{'loss': 0.9246, 'grad_norm': 0.3636399805545807, 'learning_rate': 0.00017038519259629815, 'epoch': 0.01}


 15%|█▍        | 1486/10000 [1:18:56<7:22:02,  3.12s/it]

{'loss': 0.8614, 'grad_norm': 0.37595558166503906, 'learning_rate': 0.00017036518259129567, 'epoch': 0.01}


 15%|█▍        | 1487/10000 [1:18:59<7:11:34,  3.04s/it]

{'loss': 0.9487, 'grad_norm': 0.4033714234828949, 'learning_rate': 0.00017034517258629315, 'epoch': 0.01}


 15%|█▍        | 1488/10000 [1:19:03<7:24:28,  3.13s/it]

{'loss': 0.9264, 'grad_norm': 0.3463937044143677, 'learning_rate': 0.00017032516258129064, 'epoch': 0.01}


 15%|█▍        | 1489/10000 [1:19:06<7:31:24,  3.18s/it]

{'loss': 0.8723, 'grad_norm': 0.342833548784256, 'learning_rate': 0.00017030515257628816, 'epoch': 0.01}


 15%|█▍        | 1490/10000 [1:19:09<7:26:02,  3.14s/it]

{'loss': 0.8391, 'grad_norm': 0.3972284197807312, 'learning_rate': 0.00017028514257128564, 'epoch': 0.01}


 15%|█▍        | 1491/10000 [1:19:12<7:43:00,  3.26s/it]

{'loss': 0.9469, 'grad_norm': 0.35790595412254333, 'learning_rate': 0.00017026513256628316, 'epoch': 0.01}


 15%|█▍        | 1492/10000 [1:19:16<7:35:56,  3.22s/it]

{'loss': 0.8639, 'grad_norm': 0.3690274655818939, 'learning_rate': 0.00017024512256128064, 'epoch': 0.01}


 15%|█▍        | 1493/10000 [1:19:19<7:48:00,  3.30s/it]

{'loss': 1.2231, 'grad_norm': 0.3900264501571655, 'learning_rate': 0.00017022511255627816, 'epoch': 0.01}


 15%|█▍        | 1494/10000 [1:19:23<7:57:54,  3.37s/it]

{'loss': 0.9056, 'grad_norm': 0.351346492767334, 'learning_rate': 0.00017020510255127565, 'epoch': 0.01}


 15%|█▍        | 1495/10000 [1:19:26<7:44:28,  3.28s/it]

{'loss': 0.7248, 'grad_norm': 0.32200130820274353, 'learning_rate': 0.00017018509254627316, 'epoch': 0.01}


 15%|█▍        | 1496/10000 [1:19:29<7:27:35,  3.16s/it]

{'loss': 1.0554, 'grad_norm': 0.40586191415786743, 'learning_rate': 0.00017016508254127065, 'epoch': 0.01}


 15%|█▍        | 1497/10000 [1:19:31<7:14:14,  3.06s/it]

{'loss': 0.4924, 'grad_norm': 0.4023529589176178, 'learning_rate': 0.00017014507253626813, 'epoch': 0.01}


 15%|█▍        | 1498/10000 [1:19:35<7:55:42,  3.36s/it]

{'loss': 1.0223, 'grad_norm': 0.3717876672744751, 'learning_rate': 0.00017012506253126562, 'epoch': 0.01}


 15%|█▍        | 1499/10000 [1:19:39<8:23:16,  3.55s/it]

{'loss': 0.8731, 'grad_norm': 0.3421216905117035, 'learning_rate': 0.00017010505252626314, 'epoch': 0.01}


 15%|█▌        | 1500/10000 [1:19:43<8:19:19,  3.52s/it]

{'loss': 0.8697, 'grad_norm': 0.35082390904426575, 'learning_rate': 0.00017008504252126062, 'epoch': 0.01}


 15%|█▌        | 1501/10000 [1:19:48<9:39:44,  4.09s/it]

{'loss': 1.072, 'grad_norm': 0.3626386225223541, 'learning_rate': 0.00017006503251625814, 'epoch': 0.01}


 15%|█▌        | 1502/10000 [1:19:51<8:49:13,  3.74s/it]

{'loss': 0.9764, 'grad_norm': 0.4070589244365692, 'learning_rate': 0.00017004502251125563, 'epoch': 0.01}


 15%|█▌        | 1503/10000 [1:19:55<8:48:54,  3.73s/it]

{'loss': 1.0377, 'grad_norm': 0.4733329713344574, 'learning_rate': 0.00017002501250625314, 'epoch': 0.01}


 15%|█▌        | 1504/10000 [1:19:59<8:44:52,  3.71s/it]

{'loss': 1.1898, 'grad_norm': 0.40053683519363403, 'learning_rate': 0.00017000500250125065, 'epoch': 0.01}


 15%|█▌        | 1505/10000 [1:20:01<7:48:39,  3.31s/it]

{'loss': 1.0088, 'grad_norm': 0.4353519082069397, 'learning_rate': 0.00016998499249624814, 'epoch': 0.01}


 15%|█▌        | 1506/10000 [1:20:03<7:12:24,  3.05s/it]

{'loss': 0.8968, 'grad_norm': 0.4359093904495239, 'learning_rate': 0.00016996498249124563, 'epoch': 0.01}


 15%|█▌        | 1507/10000 [1:20:07<7:41:01,  3.26s/it]

{'loss': 0.7952, 'grad_norm': 0.34583941102027893, 'learning_rate': 0.00016994497248624312, 'epoch': 0.01}


 15%|█▌        | 1508/10000 [1:20:10<7:12:54,  3.06s/it]

{'loss': 0.9344, 'grad_norm': 0.4798908531665802, 'learning_rate': 0.00016992496248124063, 'epoch': 0.01}


 15%|█▌        | 1509/10000 [1:20:12<6:57:50,  2.95s/it]

{'loss': 0.7834, 'grad_norm': 0.4002925455570221, 'learning_rate': 0.00016990495247623812, 'epoch': 0.01}


 15%|█▌        | 1510/10000 [1:20:15<6:45:34,  2.87s/it]

{'loss': 0.7214, 'grad_norm': 0.3730875253677368, 'learning_rate': 0.00016988494247123563, 'epoch': 0.01}


 15%|█▌        | 1511/10000 [1:20:18<6:37:20,  2.81s/it]

{'loss': 0.8658, 'grad_norm': 0.4755805432796478, 'learning_rate': 0.00016986493246623312, 'epoch': 0.01}


 15%|█▌        | 1512/10000 [1:20:21<6:55:05,  2.93s/it]

{'loss': 0.8433, 'grad_norm': 0.3710210919380188, 'learning_rate': 0.00016984492246123063, 'epoch': 0.01}


 15%|█▌        | 1513/10000 [1:20:24<7:04:46,  3.00s/it]

{'loss': 0.9003, 'grad_norm': 0.3468468487262726, 'learning_rate': 0.00016982491245622812, 'epoch': 0.01}


 15%|█▌        | 1514/10000 [1:20:29<8:17:03,  3.51s/it]

{'loss': 0.8287, 'grad_norm': 0.29399818181991577, 'learning_rate': 0.00016980490245122564, 'epoch': 0.01}


 15%|█▌        | 1515/10000 [1:20:32<7:44:22,  3.28s/it]

{'loss': 0.9082, 'grad_norm': 0.443471759557724, 'learning_rate': 0.00016978489244622312, 'epoch': 0.01}


 15%|█▌        | 1516/10000 [1:20:34<7:22:35,  3.13s/it]

{'loss': 0.6946, 'grad_norm': 0.3548181354999542, 'learning_rate': 0.0001697648824412206, 'epoch': 0.01}


 15%|█▌        | 1517/10000 [1:20:37<7:08:52,  3.03s/it]

{'loss': 0.6184, 'grad_norm': 0.344803124666214, 'learning_rate': 0.00016974487243621812, 'epoch': 0.01}


 15%|█▌        | 1518/10000 [1:20:40<7:16:30,  3.09s/it]

{'loss': 0.8407, 'grad_norm': 0.36715808510780334, 'learning_rate': 0.0001697248624312156, 'epoch': 0.01}


 15%|█▌        | 1519/10000 [1:20:44<7:22:16,  3.13s/it]

{'loss': 1.169, 'grad_norm': 0.38396281003952026, 'learning_rate': 0.00016970485242621313, 'epoch': 0.01}


 15%|█▌        | 1520/10000 [1:20:47<7:33:00,  3.21s/it]

{'loss': 0.78, 'grad_norm': 0.3625539243221283, 'learning_rate': 0.0001696848424212106, 'epoch': 0.01}


 15%|█▌        | 1521/10000 [1:20:50<7:36:45,  3.23s/it]

{'loss': 1.0924, 'grad_norm': 0.36950206756591797, 'learning_rate': 0.00016966483241620813, 'epoch': 0.01}


 15%|█▌        | 1522/10000 [1:20:53<7:18:10,  3.10s/it]

{'loss': 0.9023, 'grad_norm': 0.4080127775669098, 'learning_rate': 0.00016964482241120561, 'epoch': 0.01}


 15%|█▌        | 1523/10000 [1:20:57<7:31:56,  3.20s/it]

{'loss': 0.7476, 'grad_norm': 0.31897035241127014, 'learning_rate': 0.0001696248124062031, 'epoch': 0.01}


 15%|█▌        | 1524/10000 [1:21:00<7:33:19,  3.21s/it]

{'loss': 1.0117, 'grad_norm': 0.3703524172306061, 'learning_rate': 0.0001696048024012006, 'epoch': 0.01}


 15%|█▌        | 1525/10000 [1:21:03<7:23:41,  3.14s/it]

{'loss': 0.7325, 'grad_norm': 0.3698362410068512, 'learning_rate': 0.0001695847923961981, 'epoch': 0.01}


 15%|█▌        | 1526/10000 [1:21:06<7:26:20,  3.16s/it]

{'loss': 0.8314, 'grad_norm': 0.38873934745788574, 'learning_rate': 0.0001695647823911956, 'epoch': 0.01}


 15%|█▌        | 1527/10000 [1:21:10<8:18:20,  3.53s/it]

{'loss': 0.9942, 'grad_norm': 0.33082857728004456, 'learning_rate': 0.0001695447723861931, 'epoch': 0.01}


 15%|█▌        | 1528/10000 [1:21:13<7:53:42,  3.35s/it]

{'loss': 0.7301, 'grad_norm': 0.36834585666656494, 'learning_rate': 0.0001695247623811906, 'epoch': 0.01}


 15%|█▌        | 1529/10000 [1:21:16<7:46:37,  3.31s/it]

{'loss': 0.6485, 'grad_norm': 0.3153294622898102, 'learning_rate': 0.0001695047523761881, 'epoch': 0.01}


 15%|█▌        | 1530/10000 [1:21:19<7:30:23,  3.19s/it]

{'loss': 1.1898, 'grad_norm': 0.4540819525718689, 'learning_rate': 0.00016948474237118562, 'epoch': 0.01}


 15%|█▌        | 1531/10000 [1:21:22<7:20:53,  3.12s/it]

{'loss': 0.9979, 'grad_norm': 0.4190027117729187, 'learning_rate': 0.0001694647323661831, 'epoch': 0.01}


 15%|█▌        | 1532/10000 [1:21:26<7:50:34,  3.33s/it]

{'loss': 0.8081, 'grad_norm': 0.35358378291130066, 'learning_rate': 0.0001694447223611806, 'epoch': 0.01}


 15%|█▌        | 1533/10000 [1:21:29<7:08:16,  3.03s/it]

{'loss': 1.0751, 'grad_norm': 0.4757066071033478, 'learning_rate': 0.00016942471235617808, 'epoch': 0.01}


 15%|█▌        | 1534/10000 [1:21:32<7:16:18,  3.09s/it]

{'loss': 0.7237, 'grad_norm': 0.3571781814098358, 'learning_rate': 0.0001694047023511756, 'epoch': 0.01}


 15%|█▌        | 1535/10000 [1:21:35<7:02:30,  2.99s/it]

{'loss': 0.9026, 'grad_norm': 0.3833644688129425, 'learning_rate': 0.00016938469234617308, 'epoch': 0.01}


 15%|█▌        | 1536/10000 [1:21:38<7:09:59,  3.05s/it]

{'loss': 0.9687, 'grad_norm': 0.4092556834220886, 'learning_rate': 0.0001693646823411706, 'epoch': 0.01}


 15%|█▌        | 1537/10000 [1:21:41<7:22:14,  3.14s/it]

{'loss': 0.9084, 'grad_norm': 0.3644562363624573, 'learning_rate': 0.0001693446723361681, 'epoch': 0.01}


 15%|█▌        | 1538/10000 [1:21:44<7:11:13,  3.06s/it]

{'loss': 0.8601, 'grad_norm': 0.3743988573551178, 'learning_rate': 0.0001693246623311656, 'epoch': 0.01}


 15%|█▌        | 1539/10000 [1:21:48<7:34:27,  3.22s/it]

{'loss': 0.9034, 'grad_norm': 0.3125931918621063, 'learning_rate': 0.0001693046523261631, 'epoch': 0.01}


 15%|█▌        | 1540/10000 [1:21:51<7:41:48,  3.28s/it]

{'loss': 0.7434, 'grad_norm': 0.3411298394203186, 'learning_rate': 0.0001692846423211606, 'epoch': 0.01}


 15%|█▌        | 1541/10000 [1:21:54<7:52:18,  3.35s/it]

{'loss': 0.9951, 'grad_norm': 0.35388603806495667, 'learning_rate': 0.0001692646323161581, 'epoch': 0.01}


 15%|█▌        | 1542/10000 [1:21:58<7:59:06,  3.40s/it]

{'loss': 0.8893, 'grad_norm': 0.39365890622138977, 'learning_rate': 0.00016924462231115558, 'epoch': 0.01}


 15%|█▌        | 1543/10000 [1:22:02<8:35:52,  3.66s/it]

{'loss': 1.1104, 'grad_norm': 0.36049026250839233, 'learning_rate': 0.00016922461230615306, 'epoch': 0.01}


 15%|█▌        | 1544/10000 [1:22:05<8:16:22,  3.52s/it]

{'loss': 1.091, 'grad_norm': 0.42005714774131775, 'learning_rate': 0.00016920460230115058, 'epoch': 0.01}


 15%|█▌        | 1545/10000 [1:22:08<7:54:45,  3.37s/it]

{'loss': 1.2164, 'grad_norm': 0.4643527865409851, 'learning_rate': 0.0001691845922961481, 'epoch': 0.01}


 15%|█▌        | 1546/10000 [1:22:12<8:02:09,  3.42s/it]

{'loss': 1.0931, 'grad_norm': 0.3833388090133667, 'learning_rate': 0.00016916458229114558, 'epoch': 0.01}


 15%|█▌        | 1547/10000 [1:22:15<7:51:35,  3.35s/it]

{'loss': 1.0767, 'grad_norm': 0.3899494707584381, 'learning_rate': 0.0001691445722861431, 'epoch': 0.01}


 15%|█▌        | 1548/10000 [1:22:18<7:21:09,  3.13s/it]

{'loss': 1.021, 'grad_norm': 0.47711271047592163, 'learning_rate': 0.00016912456228114058, 'epoch': 0.01}


 15%|█▌        | 1549/10000 [1:22:21<7:19:11,  3.12s/it]

{'loss': 0.7921, 'grad_norm': 0.37995654344558716, 'learning_rate': 0.0001691045522761381, 'epoch': 0.01}


 16%|█▌        | 1550/10000 [1:22:24<7:31:36,  3.21s/it]

{'loss': 0.8587, 'grad_norm': 0.38390326499938965, 'learning_rate': 0.00016908454227113558, 'epoch': 0.01}


 16%|█▌        | 1551/10000 [1:22:28<7:53:04,  3.36s/it]

{'loss': 1.0337, 'grad_norm': 0.405215322971344, 'learning_rate': 0.00016906453226613307, 'epoch': 0.01}


 16%|█▌        | 1552/10000 [1:22:32<8:23:39,  3.58s/it]

{'loss': 0.9411, 'grad_norm': 0.3385995924472809, 'learning_rate': 0.00016904452226113056, 'epoch': 0.01}


 16%|█▌        | 1553/10000 [1:22:35<8:01:21,  3.42s/it]

{'loss': 0.7561, 'grad_norm': 0.4158968925476074, 'learning_rate': 0.00016902451225612807, 'epoch': 0.01}


 16%|█▌        | 1554/10000 [1:22:38<7:50:02,  3.34s/it]

{'loss': 0.9005, 'grad_norm': 0.4625011384487152, 'learning_rate': 0.00016900450225112556, 'epoch': 0.01}


 16%|█▌        | 1555/10000 [1:22:43<8:29:39,  3.62s/it]

{'loss': 1.1425, 'grad_norm': 0.34558331966400146, 'learning_rate': 0.00016898449224612307, 'epoch': 0.01}


 16%|█▌        | 1556/10000 [1:22:46<8:16:08,  3.53s/it]

{'loss': 1.1473, 'grad_norm': 0.4382624924182892, 'learning_rate': 0.00016896448224112056, 'epoch': 0.01}


 16%|█▌        | 1557/10000 [1:22:48<7:30:14,  3.20s/it]

{'loss': 0.8177, 'grad_norm': 0.42614686489105225, 'learning_rate': 0.00016894447223611808, 'epoch': 0.01}


 16%|█▌        | 1558/10000 [1:22:52<7:30:02,  3.20s/it]

{'loss': 0.9267, 'grad_norm': 0.37824180722236633, 'learning_rate': 0.00016892446223111556, 'epoch': 0.01}


 16%|█▌        | 1559/10000 [1:22:54<7:12:00,  3.07s/it]

{'loss': 0.6785, 'grad_norm': 0.4450414776802063, 'learning_rate': 0.00016890445222611305, 'epoch': 0.01}


 16%|█▌        | 1560/10000 [1:22:57<6:53:06,  2.94s/it]

{'loss': 1.0434, 'grad_norm': 0.5631009340286255, 'learning_rate': 0.00016888444222111057, 'epoch': 0.01}


 16%|█▌        | 1561/10000 [1:23:03<8:51:40,  3.78s/it]

{'loss': 1.0241, 'grad_norm': 0.3325755298137665, 'learning_rate': 0.00016886443221610805, 'epoch': 0.01}


 16%|█▌        | 1562/10000 [1:23:06<8:41:13,  3.71s/it]

{'loss': 0.7723, 'grad_norm': 0.3657101094722748, 'learning_rate': 0.00016884442221110557, 'epoch': 0.01}


 16%|█▌        | 1563/10000 [1:23:10<8:46:34,  3.74s/it]

{'loss': 1.101, 'grad_norm': 0.42702850699424744, 'learning_rate': 0.00016882441220610305, 'epoch': 0.01}


 16%|█▌        | 1564/10000 [1:23:13<8:11:17,  3.49s/it]

{'loss': 0.8653, 'grad_norm': 0.45645102858543396, 'learning_rate': 0.00016880440220110057, 'epoch': 0.01}


 16%|█▌        | 1565/10000 [1:23:16<8:00:30,  3.42s/it]

{'loss': 0.7354, 'grad_norm': 0.34949827194213867, 'learning_rate': 0.00016878439219609806, 'epoch': 0.01}


 16%|█▌        | 1566/10000 [1:23:19<7:39:08,  3.27s/it]

{'loss': 0.9063, 'grad_norm': 0.3961534798145294, 'learning_rate': 0.00016876438219109557, 'epoch': 0.01}


 16%|█▌        | 1567/10000 [1:23:22<7:40:18,  3.28s/it]

{'loss': 0.8306, 'grad_norm': 0.3799383044242859, 'learning_rate': 0.00016874437218609306, 'epoch': 0.01}


 16%|█▌        | 1568/10000 [1:23:26<8:09:33,  3.48s/it]

{'loss': 0.8475, 'grad_norm': 0.3517885208129883, 'learning_rate': 0.00016872436218109054, 'epoch': 0.01}


 16%|█▌        | 1569/10000 [1:23:29<7:54:09,  3.37s/it]

{'loss': 0.9387, 'grad_norm': 0.39277970790863037, 'learning_rate': 0.00016870435217608803, 'epoch': 0.01}


 16%|█▌        | 1570/10000 [1:23:33<7:52:56,  3.37s/it]

{'loss': 0.8153, 'grad_norm': 0.37196671962738037, 'learning_rate': 0.00016868434217108555, 'epoch': 0.01}


 16%|█▌        | 1571/10000 [1:23:37<8:06:46,  3.46s/it]

{'loss': 0.9719, 'grad_norm': 0.42378556728363037, 'learning_rate': 0.00016866433216608303, 'epoch': 0.01}


 16%|█▌        | 1572/10000 [1:23:41<8:51:21,  3.78s/it]

{'loss': 1.0273, 'grad_norm': 0.3355000615119934, 'learning_rate': 0.00016864432216108055, 'epoch': 0.01}


 16%|█▌        | 1573/10000 [1:23:45<8:59:02,  3.84s/it]

{'loss': 1.0199, 'grad_norm': 0.34157881140708923, 'learning_rate': 0.00016862431215607806, 'epoch': 0.01}


 16%|█▌        | 1574/10000 [1:23:49<8:59:24,  3.84s/it]

{'loss': 0.7478, 'grad_norm': 0.3534340262413025, 'learning_rate': 0.00016860430215107555, 'epoch': 0.01}


 16%|█▌        | 1575/10000 [1:23:53<8:54:52,  3.81s/it]

{'loss': 0.6988, 'grad_norm': 0.33904483914375305, 'learning_rate': 0.00016858429214607306, 'epoch': 0.01}


 16%|█▌        | 1576/10000 [1:23:57<9:13:26,  3.94s/it]

{'loss': 0.9825, 'grad_norm': 0.34909704327583313, 'learning_rate': 0.00016856428214107055, 'epoch': 0.01}


 16%|█▌        | 1577/10000 [1:24:01<9:06:45,  3.89s/it]

{'loss': 0.7207, 'grad_norm': 0.3514634072780609, 'learning_rate': 0.00016854427213606804, 'epoch': 0.01}


 16%|█▌        | 1578/10000 [1:24:05<9:36:11,  4.10s/it]

{'loss': 0.827, 'grad_norm': 0.3173326551914215, 'learning_rate': 0.00016852426213106553, 'epoch': 0.01}


 16%|█▌        | 1579/10000 [1:24:08<8:51:25,  3.79s/it]

{'loss': 0.722, 'grad_norm': 0.3817264437675476, 'learning_rate': 0.00016850425212606304, 'epoch': 0.01}


 16%|█▌        | 1580/10000 [1:24:12<8:36:28,  3.68s/it]

{'loss': 0.9409, 'grad_norm': 0.4039648473262787, 'learning_rate': 0.00016848424212106053, 'epoch': 0.01}


 16%|█▌        | 1581/10000 [1:24:15<8:08:27,  3.48s/it]

{'loss': 0.6741, 'grad_norm': 0.42132681608200073, 'learning_rate': 0.00016846423211605804, 'epoch': 0.01}


 16%|█▌        | 1582/10000 [1:24:18<8:19:44,  3.56s/it]

{'loss': 0.7678, 'grad_norm': 0.35100188851356506, 'learning_rate': 0.00016844422211105553, 'epoch': 0.01}


 16%|█▌        | 1583/10000 [1:24:23<8:45:27,  3.75s/it]

{'loss': 0.9164, 'grad_norm': 0.364421010017395, 'learning_rate': 0.00016842421210605304, 'epoch': 0.01}


 16%|█▌        | 1584/10000 [1:24:27<8:57:34,  3.83s/it]

{'loss': 1.1268, 'grad_norm': 0.3708238899707794, 'learning_rate': 0.00016840420210105053, 'epoch': 0.01}


 16%|█▌        | 1585/10000 [1:24:30<8:49:18,  3.77s/it]

{'loss': 0.7786, 'grad_norm': 0.4233179986476898, 'learning_rate': 0.00016838419209604805, 'epoch': 0.01}


 16%|█▌        | 1586/10000 [1:24:36<10:21:27,  4.43s/it]

{'loss': 1.0371, 'grad_norm': 0.33813539147377014, 'learning_rate': 0.00016836418209104553, 'epoch': 0.01}


 16%|█▌        | 1587/10000 [1:24:40<9:40:31,  4.14s/it] 

{'loss': 0.9834, 'grad_norm': 0.35534366965293884, 'learning_rate': 0.00016834417208604302, 'epoch': 0.01}


 16%|█▌        | 1588/10000 [1:24:44<10:00:51,  4.29s/it]

{'loss': 0.9703, 'grad_norm': 0.3400461673736572, 'learning_rate': 0.00016832416208104053, 'epoch': 0.01}


 16%|█▌        | 1589/10000 [1:24:48<9:32:13,  4.08s/it] 

{'loss': 1.1102, 'grad_norm': 0.4026965796947479, 'learning_rate': 0.00016830415207603802, 'epoch': 0.01}


 16%|█▌        | 1590/10000 [1:24:51<8:56:35,  3.83s/it]

{'loss': 0.7925, 'grad_norm': 0.3680334985256195, 'learning_rate': 0.00016828414207103554, 'epoch': 0.01}


 16%|█▌        | 1591/10000 [1:24:54<8:20:04,  3.57s/it]

{'loss': 0.8771, 'grad_norm': 0.4352216422557831, 'learning_rate': 0.00016826413206603302, 'epoch': 0.01}


 16%|█▌        | 1592/10000 [1:24:57<7:45:21,  3.32s/it]

{'loss': 0.7568, 'grad_norm': 0.4338636100292206, 'learning_rate': 0.00016824412206103054, 'epoch': 0.01}


 16%|█▌        | 1593/10000 [1:25:00<7:34:45,  3.25s/it]

{'loss': 0.9644, 'grad_norm': 0.3962786793708801, 'learning_rate': 0.00016822411205602802, 'epoch': 0.01}


 16%|█▌        | 1594/10000 [1:25:04<7:54:37,  3.39s/it]

{'loss': 0.9922, 'grad_norm': 0.4156988859176636, 'learning_rate': 0.0001682041020510255, 'epoch': 0.01}


 16%|█▌        | 1595/10000 [1:25:06<7:18:52,  3.13s/it]

{'loss': 0.9683, 'grad_norm': 0.455573171377182, 'learning_rate': 0.000168184092046023, 'epoch': 0.01}


 16%|█▌        | 1596/10000 [1:25:09<6:58:48,  2.99s/it]

{'loss': 0.9931, 'grad_norm': 0.42720773816108704, 'learning_rate': 0.00016816408204102051, 'epoch': 0.01}


 16%|█▌        | 1597/10000 [1:25:12<7:01:09,  3.01s/it]

{'loss': 0.8746, 'grad_norm': 0.3801417052745819, 'learning_rate': 0.000168144072036018, 'epoch': 0.01}


 16%|█▌        | 1598/10000 [1:25:14<6:31:37,  2.80s/it]

{'loss': 0.8953, 'grad_norm': 0.43345677852630615, 'learning_rate': 0.00016812406203101552, 'epoch': 0.01}


 16%|█▌        | 1599/10000 [1:25:19<7:44:12,  3.32s/it]

{'loss': 0.9921, 'grad_norm': 0.33152785897254944, 'learning_rate': 0.000168104052026013, 'epoch': 0.01}


 16%|█▌        | 1600/10000 [1:25:22<7:28:18,  3.20s/it]

{'loss': 0.6855, 'grad_norm': 0.3730037808418274, 'learning_rate': 0.00016808404202101052, 'epoch': 0.01}


 16%|█▌        | 1601/10000 [1:25:27<9:02:37,  3.88s/it]

{'loss': 1.0495, 'grad_norm': 0.3740289509296417, 'learning_rate': 0.00016806403201600803, 'epoch': 0.01}


 16%|█▌        | 1602/10000 [1:25:30<8:31:32,  3.65s/it]

{'loss': 0.9463, 'grad_norm': 0.4044604003429413, 'learning_rate': 0.00016804402201100552, 'epoch': 0.01}


 16%|█▌        | 1603/10000 [1:25:33<7:53:46,  3.39s/it]

{'loss': 0.7059, 'grad_norm': 0.38006722927093506, 'learning_rate': 0.000168024012006003, 'epoch': 0.01}


 16%|█▌        | 1604/10000 [1:25:37<8:00:32,  3.43s/it]

{'loss': 1.3025, 'grad_norm': 0.3809209167957306, 'learning_rate': 0.0001680040020010005, 'epoch': 0.01}


 16%|█▌        | 1605/10000 [1:25:40<7:42:02,  3.30s/it]

{'loss': 0.74, 'grad_norm': 0.3488304913043976, 'learning_rate': 0.000167983991995998, 'epoch': 0.01}


 16%|█▌        | 1606/10000 [1:25:44<8:36:10,  3.69s/it]

{'loss': 1.1478, 'grad_norm': 0.38272908329963684, 'learning_rate': 0.0001679639819909955, 'epoch': 0.01}


 16%|█▌        | 1607/10000 [1:25:47<7:53:30,  3.38s/it]

{'loss': 0.9661, 'grad_norm': 0.41104552149772644, 'learning_rate': 0.000167943971985993, 'epoch': 0.01}


 16%|█▌        | 1608/10000 [1:25:50<7:22:31,  3.16s/it]

{'loss': 0.8613, 'grad_norm': 0.40843862295150757, 'learning_rate': 0.0001679239619809905, 'epoch': 0.01}


 16%|█▌        | 1609/10000 [1:25:53<7:27:11,  3.20s/it]

{'loss': 0.9302, 'grad_norm': 0.3523997366428375, 'learning_rate': 0.000167903951975988, 'epoch': 0.01}


 16%|█▌        | 1610/10000 [1:25:56<7:28:05,  3.20s/it]

{'loss': 1.0625, 'grad_norm': 0.36048924922943115, 'learning_rate': 0.0001678839419709855, 'epoch': 0.01}


 16%|█▌        | 1611/10000 [1:25:59<7:15:45,  3.12s/it]

{'loss': 0.973, 'grad_norm': 0.39850711822509766, 'learning_rate': 0.000167863931965983, 'epoch': 0.01}


 16%|█▌        | 1612/10000 [1:26:03<7:52:07,  3.38s/it]

{'loss': 0.8645, 'grad_norm': 0.34850943088531494, 'learning_rate': 0.0001678439219609805, 'epoch': 0.01}


 16%|█▌        | 1613/10000 [1:26:06<7:28:55,  3.21s/it]

{'loss': 0.9217, 'grad_norm': 0.37401655316352844, 'learning_rate': 0.000167823911955978, 'epoch': 0.01}


 16%|█▌        | 1614/10000 [1:26:10<8:28:13,  3.64s/it]

{'loss': 1.2259, 'grad_norm': 0.3393058478832245, 'learning_rate': 0.0001678039019509755, 'epoch': 0.01}


 16%|█▌        | 1615/10000 [1:26:13<8:07:00,  3.48s/it]

{'loss': 1.0497, 'grad_norm': 0.3828847110271454, 'learning_rate': 0.000167783891945973, 'epoch': 0.01}


 16%|█▌        | 1616/10000 [1:26:18<8:44:40,  3.75s/it]

{'loss': 0.9013, 'grad_norm': 0.3914003372192383, 'learning_rate': 0.0001677638819409705, 'epoch': 0.01}


 16%|█▌        | 1617/10000 [1:26:21<8:25:26,  3.62s/it]

{'loss': 0.5818, 'grad_norm': 0.30716025829315186, 'learning_rate': 0.000167743871935968, 'epoch': 0.01}


 16%|█▌        | 1618/10000 [1:26:24<7:37:24,  3.27s/it]

{'loss': 0.9463, 'grad_norm': 0.4660794138908386, 'learning_rate': 0.0001677238619309655, 'epoch': 0.01}


 16%|█▌        | 1619/10000 [1:26:27<7:30:47,  3.23s/it]

{'loss': 0.669, 'grad_norm': 0.31996506452560425, 'learning_rate': 0.000167703851925963, 'epoch': 0.01}


 16%|█▌        | 1620/10000 [1:26:30<7:46:18,  3.34s/it]

{'loss': 0.8574, 'grad_norm': 0.3461689352989197, 'learning_rate': 0.0001676838419209605, 'epoch': 0.01}


 16%|█▌        | 1621/10000 [1:26:34<7:51:00,  3.37s/it]

{'loss': 0.91, 'grad_norm': 0.4391626715660095, 'learning_rate': 0.000167663831915958, 'epoch': 0.01}


 16%|█▌        | 1622/10000 [1:26:37<7:48:34,  3.36s/it]

{'loss': 1.3577, 'grad_norm': 0.4105829894542694, 'learning_rate': 0.00016764382191095548, 'epoch': 0.01}


 16%|█▌        | 1623/10000 [1:26:40<7:28:02,  3.21s/it]

{'loss': 1.0515, 'grad_norm': 0.39787420630455017, 'learning_rate': 0.00016762381190595297, 'epoch': 0.01}


 16%|█▌        | 1624/10000 [1:26:44<7:47:43,  3.35s/it]

{'loss': 0.8718, 'grad_norm': 0.36587777733802795, 'learning_rate': 0.00016760380190095048, 'epoch': 0.01}


 16%|█▋        | 1625/10000 [1:26:49<8:56:23,  3.84s/it]

{'loss': 1.0379, 'grad_norm': 0.37680283188819885, 'learning_rate': 0.00016758379189594797, 'epoch': 0.01}


 16%|█▋        | 1626/10000 [1:26:52<8:25:28,  3.62s/it]

{'loss': 0.8827, 'grad_norm': 0.3684196472167969, 'learning_rate': 0.00016756378189094548, 'epoch': 0.01}


 16%|█▋        | 1627/10000 [1:26:54<7:43:58,  3.32s/it]

{'loss': 0.718, 'grad_norm': 0.3694789707660675, 'learning_rate': 0.000167543771885943, 'epoch': 0.01}


 16%|█▋        | 1628/10000 [1:26:58<7:53:17,  3.39s/it]

{'loss': 0.7014, 'grad_norm': 0.3358720541000366, 'learning_rate': 0.00016752376188094049, 'epoch': 0.01}


 16%|█▋        | 1629/10000 [1:27:01<7:59:49,  3.44s/it]

{'loss': 1.1546, 'grad_norm': 0.3953404426574707, 'learning_rate': 0.00016750375187593797, 'epoch': 0.01}


 16%|█▋        | 1630/10000 [1:27:04<7:22:11,  3.17s/it]

{'loss': 0.7642, 'grad_norm': 0.4085933268070221, 'learning_rate': 0.00016748374187093546, 'epoch': 0.01}


 16%|█▋        | 1631/10000 [1:27:09<8:29:13,  3.65s/it]

{'loss': 1.1029, 'grad_norm': 0.3488309383392334, 'learning_rate': 0.00016746373186593298, 'epoch': 0.01}


 16%|█▋        | 1632/10000 [1:27:12<8:01:36,  3.45s/it]

{'loss': 0.9081, 'grad_norm': 0.41006460785865784, 'learning_rate': 0.00016744372186093046, 'epoch': 0.01}


 16%|█▋        | 1633/10000 [1:27:15<7:47:37,  3.35s/it]

{'loss': 0.6426, 'grad_norm': 0.3374529778957367, 'learning_rate': 0.00016742371185592798, 'epoch': 0.01}


 16%|█▋        | 1634/10000 [1:27:18<7:23:28,  3.18s/it]

{'loss': 0.4913, 'grad_norm': 0.351780503988266, 'learning_rate': 0.00016740370185092546, 'epoch': 0.01}


 16%|█▋        | 1635/10000 [1:27:21<7:48:09,  3.36s/it]

{'loss': 0.8811, 'grad_norm': 0.3925640285015106, 'learning_rate': 0.00016738369184592298, 'epoch': 0.01}


 16%|█▋        | 1636/10000 [1:27:25<7:59:56,  3.44s/it]

{'loss': 0.8813, 'grad_norm': 0.37227863073349, 'learning_rate': 0.00016736368184092047, 'epoch': 0.01}


 16%|█▋        | 1637/10000 [1:27:29<8:08:48,  3.51s/it]

{'loss': 0.9428, 'grad_norm': 0.40503713488578796, 'learning_rate': 0.00016734367183591798, 'epoch': 0.01}


 16%|█▋        | 1638/10000 [1:27:33<8:53:09,  3.83s/it]

{'loss': 1.1908, 'grad_norm': 0.3850654661655426, 'learning_rate': 0.00016732366183091547, 'epoch': 0.01}


 16%|█▋        | 1639/10000 [1:27:36<8:13:16,  3.54s/it]

{'loss': 0.775, 'grad_norm': 0.38312840461730957, 'learning_rate': 0.00016730365182591295, 'epoch': 0.01}


 16%|█▋        | 1640/10000 [1:27:40<8:05:58,  3.49s/it]

{'loss': 0.7741, 'grad_norm': 0.3558044135570526, 'learning_rate': 0.00016728364182091044, 'epoch': 0.01}


 16%|█▋        | 1641/10000 [1:27:43<8:05:34,  3.49s/it]

{'loss': 0.8037, 'grad_norm': 0.34959694743156433, 'learning_rate': 0.00016726363181590796, 'epoch': 0.01}


 16%|█▋        | 1642/10000 [1:27:46<7:41:44,  3.31s/it]

{'loss': 0.6991, 'grad_norm': 0.3458928167819977, 'learning_rate': 0.00016724362181090547, 'epoch': 0.01}


 16%|█▋        | 1643/10000 [1:27:49<7:36:08,  3.27s/it]

{'loss': 1.0356, 'grad_norm': 0.44065383076667786, 'learning_rate': 0.00016722361180590296, 'epoch': 0.01}


 16%|█▋        | 1644/10000 [1:27:53<7:43:28,  3.33s/it]

{'loss': 0.873, 'grad_norm': 0.34857091307640076, 'learning_rate': 0.00016720360180090047, 'epoch': 0.01}


 16%|█▋        | 1645/10000 [1:27:57<8:16:08,  3.56s/it]

{'loss': 1.1393, 'grad_norm': 0.3311585485935211, 'learning_rate': 0.00016718359179589796, 'epoch': 0.01}


 16%|█▋        | 1646/10000 [1:28:00<8:04:05,  3.48s/it]

{'loss': 0.8825, 'grad_norm': 0.36796870827674866, 'learning_rate': 0.00016716358179089547, 'epoch': 0.01}


 16%|█▋        | 1647/10000 [1:28:03<7:48:00,  3.36s/it]

{'loss': 0.9793, 'grad_norm': 0.4055964946746826, 'learning_rate': 0.00016714357178589296, 'epoch': 0.01}


 16%|█▋        | 1648/10000 [1:28:06<7:22:32,  3.18s/it]

{'loss': 0.6972, 'grad_norm': 0.42113032937049866, 'learning_rate': 0.00016712356178089045, 'epoch': 0.01}


 16%|█▋        | 1649/10000 [1:28:09<7:27:02,  3.21s/it]

{'loss': 1.1703, 'grad_norm': 0.42993655800819397, 'learning_rate': 0.00016710355177588794, 'epoch': 0.01}


 16%|█▋        | 1650/10000 [1:28:12<7:11:41,  3.10s/it]

{'loss': 0.6827, 'grad_norm': 0.329015851020813, 'learning_rate': 0.00016708354177088545, 'epoch': 0.01}


 17%|█▋        | 1651/10000 [1:28:16<7:34:28,  3.27s/it]

{'loss': 1.2983, 'grad_norm': 0.41875743865966797, 'learning_rate': 0.00016706353176588294, 'epoch': 0.01}


 17%|█▋        | 1652/10000 [1:28:19<7:43:40,  3.33s/it]

{'loss': 1.0492, 'grad_norm': 0.35494399070739746, 'learning_rate': 0.00016704352176088045, 'epoch': 0.01}


 17%|█▋        | 1653/10000 [1:28:24<8:47:15,  3.79s/it]

{'loss': 0.9774, 'grad_norm': 0.4098144471645355, 'learning_rate': 0.00016702351175587794, 'epoch': 0.01}


 17%|█▋        | 1654/10000 [1:28:27<8:14:45,  3.56s/it]

{'loss': 1.0095, 'grad_norm': 0.4521881341934204, 'learning_rate': 0.00016700350175087545, 'epoch': 0.01}


 17%|█▋        | 1655/10000 [1:28:32<9:09:53,  3.95s/it]

{'loss': 1.2568, 'grad_norm': 0.3444463908672333, 'learning_rate': 0.00016698349174587297, 'epoch': 0.01}


 17%|█▋        | 1656/10000 [1:28:36<9:02:34,  3.90s/it]

{'loss': 1.1901, 'grad_norm': 0.36608240008354187, 'learning_rate': 0.00016696348174087046, 'epoch': 0.01}


 17%|█▋        | 1657/10000 [1:28:40<9:05:38,  3.92s/it]

{'loss': 0.7701, 'grad_norm': 0.3341697156429291, 'learning_rate': 0.00016694347173586794, 'epoch': 0.01}


 17%|█▋        | 1658/10000 [1:28:43<8:44:09,  3.77s/it]

{'loss': 0.9326, 'grad_norm': 0.36262407898902893, 'learning_rate': 0.00016692346173086543, 'epoch': 0.01}


 17%|█▋        | 1659/10000 [1:28:47<8:43:02,  3.76s/it]

{'loss': 0.8424, 'grad_norm': 0.3619958460330963, 'learning_rate': 0.00016690345172586294, 'epoch': 0.01}


 17%|█▋        | 1660/10000 [1:28:51<8:58:06,  3.87s/it]

{'loss': 0.8224, 'grad_norm': 0.37829095125198364, 'learning_rate': 0.00016688344172086043, 'epoch': 0.01}


 17%|█▋        | 1661/10000 [1:28:55<9:18:50,  4.02s/it]

{'loss': 0.9609, 'grad_norm': 0.3312607407569885, 'learning_rate': 0.00016686343171585795, 'epoch': 0.01}


 17%|█▋        | 1662/10000 [1:28:59<9:02:47,  3.91s/it]

{'loss': 1.1669, 'grad_norm': 0.4685891568660736, 'learning_rate': 0.00016684342171085543, 'epoch': 0.01}


 17%|█▋        | 1663/10000 [1:29:03<9:25:40,  4.07s/it]

{'loss': 1.124, 'grad_norm': 0.39981549978256226, 'learning_rate': 0.00016682341170585295, 'epoch': 0.01}


 17%|█▋        | 1664/10000 [1:29:07<9:22:51,  4.05s/it]

{'loss': 0.9713, 'grad_norm': 0.35458099842071533, 'learning_rate': 0.00016680340170085043, 'epoch': 0.01}


 17%|█▋        | 1665/10000 [1:29:11<9:24:14,  4.06s/it]

{'loss': 0.8426, 'grad_norm': 0.38324296474456787, 'learning_rate': 0.00016678339169584792, 'epoch': 0.01}


 17%|█▋        | 1666/10000 [1:29:16<9:25:32,  4.07s/it]

{'loss': 0.8208, 'grad_norm': 0.3499479293823242, 'learning_rate': 0.0001667633816908454, 'epoch': 0.01}


 17%|█▋        | 1667/10000 [1:29:20<9:29:27,  4.10s/it]

{'loss': 0.8091, 'grad_norm': 0.32229459285736084, 'learning_rate': 0.00016674337168584292, 'epoch': 0.01}


 17%|█▋        | 1668/10000 [1:29:24<9:38:43,  4.17s/it]

{'loss': 1.0892, 'grad_norm': 0.3718658983707428, 'learning_rate': 0.0001667233616808404, 'epoch': 0.01}


 17%|█▋        | 1669/10000 [1:29:29<10:19:49,  4.46s/it]

{'loss': 0.9524, 'grad_norm': 0.30724218487739563, 'learning_rate': 0.00016670335167583793, 'epoch': 0.01}


 17%|█▋        | 1670/10000 [1:29:33<9:37:54,  4.16s/it] 

{'loss': 0.8599, 'grad_norm': 0.3626726567745209, 'learning_rate': 0.00016668334167083544, 'epoch': 0.01}


 17%|█▋        | 1671/10000 [1:29:37<9:35:53,  4.15s/it]

{'loss': 0.9192, 'grad_norm': 0.32630211114883423, 'learning_rate': 0.00016666333166583293, 'epoch': 0.01}


 17%|█▋        | 1672/10000 [1:29:40<9:04:18,  3.92s/it]

{'loss': 0.9084, 'grad_norm': 0.3781176507472992, 'learning_rate': 0.00016664332166083044, 'epoch': 0.01}


 17%|█▋        | 1673/10000 [1:29:43<8:26:09,  3.65s/it]

{'loss': 0.8779, 'grad_norm': 0.4625449776649475, 'learning_rate': 0.00016662331165582793, 'epoch': 0.01}


 17%|█▋        | 1674/10000 [1:29:46<8:01:04,  3.47s/it]

{'loss': 0.6733, 'grad_norm': 0.4222520589828491, 'learning_rate': 0.00016660330165082542, 'epoch': 0.01}


 17%|█▋        | 1675/10000 [1:29:50<7:59:43,  3.46s/it]

{'loss': 0.7519, 'grad_norm': 0.3819082975387573, 'learning_rate': 0.0001665832916458229, 'epoch': 0.01}


 17%|█▋        | 1676/10000 [1:29:54<8:22:07,  3.62s/it]

{'loss': 0.5817, 'grad_norm': 0.3211519122123718, 'learning_rate': 0.00016656328164082042, 'epoch': 0.01}


 17%|█▋        | 1677/10000 [1:29:57<8:05:20,  3.50s/it]

{'loss': 0.7616, 'grad_norm': 0.42523932456970215, 'learning_rate': 0.0001665432716358179, 'epoch': 0.01}


 17%|█▋        | 1678/10000 [1:30:04<10:27:01,  4.52s/it]

{'loss': 1.3064, 'grad_norm': 0.3902495801448822, 'learning_rate': 0.00016652326163081542, 'epoch': 0.01}


 17%|█▋        | 1679/10000 [1:30:07<9:19:24,  4.03s/it] 

{'loss': 0.8499, 'grad_norm': 0.3502351939678192, 'learning_rate': 0.0001665032516258129, 'epoch': 0.01}


 17%|█▋        | 1680/10000 [1:30:09<8:20:40,  3.61s/it]

{'loss': 0.7659, 'grad_norm': 0.41049325466156006, 'learning_rate': 0.00016648324162081042, 'epoch': 0.01}


 17%|█▋        | 1681/10000 [1:30:14<8:55:45,  3.86s/it]

{'loss': 0.9833, 'grad_norm': 0.3436954617500305, 'learning_rate': 0.0001664632316158079, 'epoch': 0.01}


 17%|█▋        | 1682/10000 [1:30:18<8:58:23,  3.88s/it]

{'loss': 0.8855, 'grad_norm': 0.34387722611427307, 'learning_rate': 0.00016644322161080542, 'epoch': 0.01}


 17%|█▋        | 1683/10000 [1:30:22<9:17:01,  4.02s/it]

{'loss': 1.1576, 'grad_norm': 0.37569472193717957, 'learning_rate': 0.0001664232116058029, 'epoch': 0.01}


 17%|█▋        | 1684/10000 [1:30:25<8:52:14,  3.84s/it]

{'loss': 0.7894, 'grad_norm': 0.3772718012332916, 'learning_rate': 0.0001664032016008004, 'epoch': 0.01}


 17%|█▋        | 1685/10000 [1:30:28<8:10:41,  3.54s/it]

{'loss': 0.8158, 'grad_norm': 0.4406316876411438, 'learning_rate': 0.0001663831915957979, 'epoch': 0.01}


 17%|█▋        | 1686/10000 [1:30:32<8:13:28,  3.56s/it]

{'loss': 0.8237, 'grad_norm': 0.3718106150627136, 'learning_rate': 0.0001663631815907954, 'epoch': 0.01}


 17%|█▋        | 1687/10000 [1:30:35<7:58:49,  3.46s/it]

{'loss': 0.8329, 'grad_norm': 0.37446728348731995, 'learning_rate': 0.0001663431715857929, 'epoch': 0.01}


 17%|█▋        | 1688/10000 [1:30:39<7:59:58,  3.46s/it]

{'loss': 0.946, 'grad_norm': 0.36454522609710693, 'learning_rate': 0.0001663231615807904, 'epoch': 0.01}


 17%|█▋        | 1689/10000 [1:30:41<7:35:19,  3.29s/it]

{'loss': 0.8505, 'grad_norm': 0.4339997470378876, 'learning_rate': 0.00016630315157578791, 'epoch': 0.01}


 17%|█▋        | 1690/10000 [1:30:44<7:22:13,  3.19s/it]

{'loss': 1.1253, 'grad_norm': 0.42905905842781067, 'learning_rate': 0.0001662831415707854, 'epoch': 0.01}


 17%|█▋        | 1691/10000 [1:30:49<8:09:11,  3.53s/it]

{'loss': 1.0428, 'grad_norm': 0.36941763758659363, 'learning_rate': 0.00016626313156578292, 'epoch': 0.01}


 17%|█▋        | 1692/10000 [1:30:51<7:27:26,  3.23s/it]

{'loss': 0.6515, 'grad_norm': 0.3493085205554962, 'learning_rate': 0.0001662431215607804, 'epoch': 0.01}


 17%|█▋        | 1693/10000 [1:30:54<7:22:49,  3.20s/it]

{'loss': 0.7083, 'grad_norm': 0.37483927607536316, 'learning_rate': 0.0001662231115557779, 'epoch': 0.01}


 17%|█▋        | 1694/10000 [1:30:58<7:26:45,  3.23s/it]

{'loss': 0.8693, 'grad_norm': 0.40860480070114136, 'learning_rate': 0.00016620310155077538, 'epoch': 0.01}


 17%|█▋        | 1695/10000 [1:31:01<7:40:15,  3.33s/it]

{'loss': 0.8003, 'grad_norm': 0.34211844205856323, 'learning_rate': 0.0001661830915457729, 'epoch': 0.01}


 17%|█▋        | 1696/10000 [1:31:04<7:34:03,  3.28s/it]

{'loss': 0.812, 'grad_norm': 0.3473621606826782, 'learning_rate': 0.00016616308154077038, 'epoch': 0.01}


 17%|█▋        | 1697/10000 [1:31:07<6:58:48,  3.03s/it]

{'loss': 0.8372, 'grad_norm': 0.42794808745384216, 'learning_rate': 0.0001661430715357679, 'epoch': 0.01}


 17%|█▋        | 1698/10000 [1:31:10<7:08:38,  3.10s/it]

{'loss': 0.6144, 'grad_norm': 0.37323179841041565, 'learning_rate': 0.0001661230615307654, 'epoch': 0.01}


 17%|█▋        | 1699/10000 [1:31:13<7:09:28,  3.10s/it]

{'loss': 0.9945, 'grad_norm': 0.3698316216468811, 'learning_rate': 0.0001661030515257629, 'epoch': 0.01}


 17%|█▋        | 1700/10000 [1:31:16<6:53:51,  2.99s/it]

{'loss': 0.6606, 'grad_norm': 0.3707331717014313, 'learning_rate': 0.00016608304152076038, 'epoch': 0.01}


 17%|█▋        | 1701/10000 [1:31:21<8:38:16,  3.75s/it]

{'loss': 0.9404, 'grad_norm': 0.3532547056674957, 'learning_rate': 0.00016606303151575787, 'epoch': 0.01}


 17%|█▋        | 1702/10000 [1:31:24<7:58:02,  3.46s/it]

{'loss': 0.8415, 'grad_norm': 0.37064048647880554, 'learning_rate': 0.00016604302151075539, 'epoch': 0.01}


 17%|█▋        | 1703/10000 [1:31:27<7:30:09,  3.26s/it]

{'loss': 0.7927, 'grad_norm': 0.3986254632472992, 'learning_rate': 0.00016602301150575287, 'epoch': 0.01}


 17%|█▋        | 1704/10000 [1:31:30<7:33:21,  3.28s/it]

{'loss': 0.7991, 'grad_norm': 0.4113856554031372, 'learning_rate': 0.0001660030015007504, 'epoch': 0.01}


 17%|█▋        | 1705/10000 [1:31:35<8:16:20,  3.59s/it]

{'loss': 1.002, 'grad_norm': 0.3299565017223358, 'learning_rate': 0.00016598299149574787, 'epoch': 0.01}


 17%|█▋        | 1706/10000 [1:31:38<8:09:59,  3.54s/it]

{'loss': 1.2403, 'grad_norm': 0.4184422791004181, 'learning_rate': 0.0001659629814907454, 'epoch': 0.01}


 17%|█▋        | 1707/10000 [1:31:41<7:42:42,  3.35s/it]

{'loss': 0.6881, 'grad_norm': 0.3961773216724396, 'learning_rate': 0.00016594297148574288, 'epoch': 0.01}


 17%|█▋        | 1708/10000 [1:31:44<7:41:58,  3.34s/it]

{'loss': 1.0623, 'grad_norm': 0.45949873328208923, 'learning_rate': 0.0001659229614807404, 'epoch': 0.01}


 17%|█▋        | 1709/10000 [1:31:49<8:16:07,  3.59s/it]

{'loss': 0.8231, 'grad_norm': 0.40188950300216675, 'learning_rate': 0.00016590295147573788, 'epoch': 0.01}


 17%|█▋        | 1710/10000 [1:31:51<7:37:04,  3.31s/it]

{'loss': 0.7111, 'grad_norm': 0.3839484751224518, 'learning_rate': 0.00016588294147073536, 'epoch': 0.01}


 17%|█▋        | 1711/10000 [1:31:54<7:26:47,  3.23s/it]

{'loss': 0.7404, 'grad_norm': 0.47611668705940247, 'learning_rate': 0.00016586293146573288, 'epoch': 0.01}


 17%|█▋        | 1712/10000 [1:31:57<7:14:38,  3.15s/it]

{'loss': 0.7479, 'grad_norm': 0.38865581154823303, 'learning_rate': 0.00016584292146073037, 'epoch': 0.01}


 17%|█▋        | 1713/10000 [1:32:00<7:01:17,  3.05s/it]

{'loss': 0.7392, 'grad_norm': 0.40807631611824036, 'learning_rate': 0.00016582291145572788, 'epoch': 0.01}


 17%|█▋        | 1714/10000 [1:32:03<7:10:10,  3.12s/it]

{'loss': 0.6439, 'grad_norm': 0.3105044364929199, 'learning_rate': 0.00016580290145072537, 'epoch': 0.01}


 17%|█▋        | 1715/10000 [1:32:06<7:13:25,  3.14s/it]

{'loss': 0.8147, 'grad_norm': 0.36205315589904785, 'learning_rate': 0.00016578289144572288, 'epoch': 0.01}


 17%|█▋        | 1716/10000 [1:32:10<7:09:32,  3.11s/it]

{'loss': 0.6644, 'grad_norm': 0.35423481464385986, 'learning_rate': 0.00016576288144072037, 'epoch': 0.01}


 17%|█▋        | 1717/10000 [1:32:13<7:05:00,  3.08s/it]

{'loss': 0.9719, 'grad_norm': 0.41745683550834656, 'learning_rate': 0.00016574287143571788, 'epoch': 0.01}


 17%|█▋        | 1718/10000 [1:32:16<7:39:26,  3.33s/it]

{'loss': 1.0192, 'grad_norm': 0.43220847845077515, 'learning_rate': 0.00016572286143071537, 'epoch': 0.01}


 17%|█▋        | 1719/10000 [1:32:19<7:24:48,  3.22s/it]

{'loss': 0.8232, 'grad_norm': 0.39152827858924866, 'learning_rate': 0.00016570285142571286, 'epoch': 0.01}


 17%|█▋        | 1720/10000 [1:32:23<7:23:11,  3.21s/it]

{'loss': 0.6676, 'grad_norm': 0.3516283631324768, 'learning_rate': 0.00016568284142071035, 'epoch': 0.01}


 17%|█▋        | 1721/10000 [1:32:25<7:05:35,  3.08s/it]

{'loss': 0.6132, 'grad_norm': 0.429335355758667, 'learning_rate': 0.00016566283141570786, 'epoch': 0.01}


 17%|█▋        | 1722/10000 [1:32:29<7:09:55,  3.12s/it]

{'loss': 1.1227, 'grad_norm': 0.43626657128334045, 'learning_rate': 0.00016564282141070535, 'epoch': 0.01}


 17%|█▋        | 1723/10000 [1:32:32<7:17:00,  3.17s/it]

{'loss': 0.8939, 'grad_norm': 0.4147963523864746, 'learning_rate': 0.00016562281140570286, 'epoch': 0.01}


 17%|█▋        | 1724/10000 [1:32:35<7:13:28,  3.14s/it]

{'loss': 0.812, 'grad_norm': 0.46887069940567017, 'learning_rate': 0.00016560280140070038, 'epoch': 0.01}


 17%|█▋        | 1725/10000 [1:32:39<7:42:13,  3.35s/it]

{'loss': 1.048, 'grad_norm': 0.3511951267719269, 'learning_rate': 0.00016558279139569786, 'epoch': 0.01}


 17%|█▋        | 1726/10000 [1:32:42<7:53:31,  3.43s/it]

{'loss': 1.2565, 'grad_norm': 0.4094412922859192, 'learning_rate': 0.00016556278139069538, 'epoch': 0.01}


 17%|█▋        | 1727/10000 [1:32:46<7:59:24,  3.48s/it]

{'loss': 0.9409, 'grad_norm': 0.3514978587627411, 'learning_rate': 0.00016554277138569287, 'epoch': 0.01}


 17%|█▋        | 1728/10000 [1:32:51<9:02:35,  3.94s/it]

{'loss': 0.9377, 'grad_norm': 0.3474338948726654, 'learning_rate': 0.00016552276138069035, 'epoch': 0.01}


 17%|█▋        | 1729/10000 [1:32:54<8:03:50,  3.51s/it]

{'loss': 0.7791, 'grad_norm': 0.4459969997406006, 'learning_rate': 0.00016550275137568784, 'epoch': 0.01}


 17%|█▋        | 1730/10000 [1:32:57<7:45:45,  3.38s/it]

{'loss': 0.9383, 'grad_norm': 0.3920968174934387, 'learning_rate': 0.00016548274137068535, 'epoch': 0.01}


 17%|█▋        | 1731/10000 [1:33:00<7:38:01,  3.32s/it]

{'loss': 0.641, 'grad_norm': 0.42130348086357117, 'learning_rate': 0.00016546273136568284, 'epoch': 0.01}


 17%|█▋        | 1732/10000 [1:33:03<7:53:53,  3.44s/it]

{'loss': 1.0815, 'grad_norm': 0.39098283648490906, 'learning_rate': 0.00016544272136068036, 'epoch': 0.01}


 17%|█▋        | 1733/10000 [1:33:06<7:25:46,  3.24s/it]

{'loss': 0.6419, 'grad_norm': 0.3681292235851288, 'learning_rate': 0.00016542271135567784, 'epoch': 0.01}


 17%|█▋        | 1734/10000 [1:33:09<7:05:08,  3.09s/it]

{'loss': 0.7395, 'grad_norm': 0.36746910214424133, 'learning_rate': 0.00016540270135067536, 'epoch': 0.01}


 17%|█▋        | 1735/10000 [1:33:13<7:25:48,  3.24s/it]

{'loss': 0.9339, 'grad_norm': 0.400605171918869, 'learning_rate': 0.00016538269134567284, 'epoch': 0.01}


 17%|█▋        | 1736/10000 [1:33:16<7:27:08,  3.25s/it]

{'loss': 0.9202, 'grad_norm': 0.38604074716567993, 'learning_rate': 0.00016536268134067033, 'epoch': 0.01}


 17%|█▋        | 1737/10000 [1:33:19<7:15:59,  3.17s/it]

{'loss': 0.9691, 'grad_norm': 0.36357036232948303, 'learning_rate': 0.00016534267133566782, 'epoch': 0.01}


 17%|█▋        | 1738/10000 [1:33:23<7:51:54,  3.43s/it]

{'loss': 0.9658, 'grad_norm': 0.3369028866291046, 'learning_rate': 0.00016532266133066533, 'epoch': 0.01}


 17%|█▋        | 1739/10000 [1:33:26<7:57:51,  3.47s/it]

{'loss': 0.8895, 'grad_norm': 0.37450817227363586, 'learning_rate': 0.00016530265132566285, 'epoch': 0.01}


 17%|█▋        | 1740/10000 [1:33:30<7:43:52,  3.37s/it]

{'loss': 1.1684, 'grad_norm': 0.4674052894115448, 'learning_rate': 0.00016528264132066034, 'epoch': 0.01}


 17%|█▋        | 1741/10000 [1:33:33<7:59:45,  3.49s/it]

{'loss': 0.9707, 'grad_norm': 0.32376885414123535, 'learning_rate': 0.00016526263131565785, 'epoch': 0.01}


 17%|█▋        | 1742/10000 [1:33:37<8:15:36,  3.60s/it]

{'loss': 0.8071, 'grad_norm': 0.34795093536376953, 'learning_rate': 0.00016524262131065534, 'epoch': 0.01}


 17%|█▋        | 1743/10000 [1:33:40<7:34:16,  3.30s/it]

{'loss': 0.7825, 'grad_norm': 0.3884959816932678, 'learning_rate': 0.00016522261130565285, 'epoch': 0.01}


 17%|█▋        | 1744/10000 [1:33:43<7:19:57,  3.20s/it]

{'loss': 0.8131, 'grad_norm': 0.3821171224117279, 'learning_rate': 0.00016520260130065034, 'epoch': 0.01}


 17%|█▋        | 1745/10000 [1:33:45<6:58:44,  3.04s/it]

{'loss': 0.5866, 'grad_norm': 0.38667991757392883, 'learning_rate': 0.00016518259129564783, 'epoch': 0.01}


 17%|█▋        | 1746/10000 [1:33:48<6:36:46,  2.88s/it]

{'loss': 0.7497, 'grad_norm': 0.40031296014785767, 'learning_rate': 0.0001651625812906453, 'epoch': 0.01}


 17%|█▋        | 1747/10000 [1:33:51<6:24:29,  2.80s/it]

{'loss': 1.0644, 'grad_norm': 0.45336565375328064, 'learning_rate': 0.00016514257128564283, 'epoch': 0.01}


 17%|█▋        | 1748/10000 [1:33:54<7:02:33,  3.07s/it]

{'loss': 0.9249, 'grad_norm': 0.3443340063095093, 'learning_rate': 0.00016512256128064031, 'epoch': 0.01}


 17%|█▋        | 1749/10000 [1:33:58<7:38:38,  3.34s/it]

{'loss': 0.8604, 'grad_norm': 0.3299325704574585, 'learning_rate': 0.00016510255127563783, 'epoch': 0.01}


 18%|█▊        | 1750/10000 [1:34:01<7:32:49,  3.29s/it]

{'loss': 0.7584, 'grad_norm': 0.44421377778053284, 'learning_rate': 0.00016508254127063532, 'epoch': 0.01}


 18%|█▊        | 1751/10000 [1:34:05<7:30:13,  3.27s/it]

{'loss': 0.7146, 'grad_norm': 0.3598446846008301, 'learning_rate': 0.00016506253126563283, 'epoch': 0.01}


 18%|█▊        | 1752/10000 [1:34:08<7:30:04,  3.27s/it]

{'loss': 0.8529, 'grad_norm': 0.36322879791259766, 'learning_rate': 0.00016504252126063035, 'epoch': 0.01}


 18%|█▊        | 1753/10000 [1:34:11<7:11:53,  3.14s/it]

{'loss': 0.8162, 'grad_norm': 0.43427395820617676, 'learning_rate': 0.00016502251125562783, 'epoch': 0.01}


 18%|█▊        | 1754/10000 [1:34:14<6:56:53,  3.03s/it]

{'loss': 0.7556, 'grad_norm': 0.3627396523952484, 'learning_rate': 0.00016500250125062532, 'epoch': 0.01}


 18%|█▊        | 1755/10000 [1:34:16<6:33:22,  2.86s/it]

{'loss': 1.0726, 'grad_norm': 0.5225361585617065, 'learning_rate': 0.0001649824912456228, 'epoch': 0.01}


 18%|█▊        | 1756/10000 [1:34:19<6:36:29,  2.89s/it]

{'loss': 0.9343, 'grad_norm': 0.402748167514801, 'learning_rate': 0.00016496248124062032, 'epoch': 0.01}


 18%|█▊        | 1757/10000 [1:34:22<6:47:26,  2.97s/it]

{'loss': 0.7206, 'grad_norm': 0.33747801184654236, 'learning_rate': 0.0001649424712356178, 'epoch': 0.01}


 18%|█▊        | 1758/10000 [1:34:26<7:42:39,  3.37s/it]

{'loss': 0.9757, 'grad_norm': 0.29145845770835876, 'learning_rate': 0.00016492246123061532, 'epoch': 0.01}


 18%|█▊        | 1759/10000 [1:34:29<7:18:14,  3.19s/it]

{'loss': 0.8366, 'grad_norm': 0.36762183904647827, 'learning_rate': 0.0001649024512256128, 'epoch': 0.01}


 18%|█▊        | 1760/10000 [1:34:32<6:50:29,  2.99s/it]

{'loss': 0.9423, 'grad_norm': 0.4032961130142212, 'learning_rate': 0.00016488244122061032, 'epoch': 0.01}


 18%|█▊        | 1761/10000 [1:34:36<7:52:17,  3.44s/it]

{'loss': 0.9379, 'grad_norm': 0.3362009823322296, 'learning_rate': 0.0001648624312156078, 'epoch': 0.01}


 18%|█▊        | 1762/10000 [1:34:39<7:42:36,  3.37s/it]

{'loss': 1.0579, 'grad_norm': 0.3984147012233734, 'learning_rate': 0.00016484242121060533, 'epoch': 0.01}


 18%|█▊        | 1763/10000 [1:34:42<7:22:56,  3.23s/it]

{'loss': 0.8281, 'grad_norm': 0.3558615446090698, 'learning_rate': 0.0001648224112056028, 'epoch': 0.01}


 18%|█▊        | 1764/10000 [1:34:45<7:16:36,  3.18s/it]

{'loss': 0.8227, 'grad_norm': 0.3768102824687958, 'learning_rate': 0.0001648024012006003, 'epoch': 0.01}


 18%|█▊        | 1765/10000 [1:34:48<6:58:21,  3.05s/it]

{'loss': 0.8182, 'grad_norm': 0.4281405508518219, 'learning_rate': 0.0001647823911955978, 'epoch': 0.01}


 18%|█▊        | 1766/10000 [1:34:51<7:04:27,  3.09s/it]

{'loss': 0.9471, 'grad_norm': 0.4113248288631439, 'learning_rate': 0.0001647623811905953, 'epoch': 0.01}


 18%|█▊        | 1767/10000 [1:34:54<6:58:43,  3.05s/it]

{'loss': 1.0034, 'grad_norm': 0.43779677152633667, 'learning_rate': 0.00016474237118559282, 'epoch': 0.01}


 18%|█▊        | 1768/10000 [1:34:57<6:52:28,  3.01s/it]

{'loss': 0.7973, 'grad_norm': 0.35905179381370544, 'learning_rate': 0.0001647223611805903, 'epoch': 0.01}


 18%|█▊        | 1769/10000 [1:35:01<7:09:33,  3.13s/it]

{'loss': 0.9138, 'grad_norm': 0.3323267698287964, 'learning_rate': 0.00016470235117558782, 'epoch': 0.01}


 18%|█▊        | 1770/10000 [1:35:05<7:45:08,  3.39s/it]

{'loss': 0.8403, 'grad_norm': 0.3391384780406952, 'learning_rate': 0.0001646823411705853, 'epoch': 0.01}


 18%|█▊        | 1771/10000 [1:35:07<7:16:39,  3.18s/it]

{'loss': 0.8267, 'grad_norm': 0.4468754231929779, 'learning_rate': 0.0001646623311655828, 'epoch': 0.01}


 18%|█▊        | 1772/10000 [1:35:10<7:18:55,  3.20s/it]

{'loss': 0.6565, 'grad_norm': 0.3501990735530853, 'learning_rate': 0.00016464232116058028, 'epoch': 0.01}


 18%|█▊        | 1773/10000 [1:35:14<7:49:15,  3.42s/it]

{'loss': 0.7398, 'grad_norm': 0.31911638379096985, 'learning_rate': 0.0001646223111555778, 'epoch': 0.01}


 18%|█▊        | 1774/10000 [1:35:18<7:36:23,  3.33s/it]

{'loss': 1.0292, 'grad_norm': 0.3625122308731079, 'learning_rate': 0.00016460230115057528, 'epoch': 0.01}


 18%|█▊        | 1775/10000 [1:35:20<7:03:40,  3.09s/it]

{'loss': 0.775, 'grad_norm': 0.38385164737701416, 'learning_rate': 0.0001645822911455728, 'epoch': 0.01}


 18%|█▊        | 1776/10000 [1:35:22<6:34:47,  2.88s/it]

{'loss': 0.8239, 'grad_norm': 0.4289923310279846, 'learning_rate': 0.00016456228114057028, 'epoch': 0.01}


 18%|█▊        | 1777/10000 [1:35:25<6:28:59,  2.84s/it]

{'loss': 1.0555, 'grad_norm': 0.4364125430583954, 'learning_rate': 0.0001645422711355678, 'epoch': 0.01}


 18%|█▊        | 1778/10000 [1:35:29<6:53:09,  3.02s/it]

{'loss': 0.9344, 'grad_norm': 0.3668121099472046, 'learning_rate': 0.00016452226113056529, 'epoch': 0.01}


 18%|█▊        | 1779/10000 [1:35:32<7:16:22,  3.18s/it]

{'loss': 0.9939, 'grad_norm': 0.385210245847702, 'learning_rate': 0.0001645022511255628, 'epoch': 0.01}


 18%|█▊        | 1780/10000 [1:35:35<6:57:53,  3.05s/it]

{'loss': 0.9064, 'grad_norm': 0.39212915301322937, 'learning_rate': 0.0001644822411205603, 'epoch': 0.01}


 18%|█▊        | 1781/10000 [1:35:39<7:39:34,  3.35s/it]

{'loss': 1.4042, 'grad_norm': 0.467835396528244, 'learning_rate': 0.00016446223111555777, 'epoch': 0.01}


 18%|█▊        | 1782/10000 [1:35:42<7:08:24,  3.13s/it]

{'loss': 0.9234, 'grad_norm': 0.4088902175426483, 'learning_rate': 0.0001644422211105553, 'epoch': 0.01}


 18%|█▊        | 1783/10000 [1:35:47<8:51:22,  3.88s/it]

{'loss': 1.1872, 'grad_norm': 0.34044161438941956, 'learning_rate': 0.00016442221110555278, 'epoch': 0.01}


 18%|█▊        | 1784/10000 [1:35:51<8:30:49,  3.73s/it]

{'loss': 1.3554, 'grad_norm': 0.39357760548591614, 'learning_rate': 0.0001644022011005503, 'epoch': 0.01}


 18%|█▊        | 1785/10000 [1:35:53<7:38:47,  3.35s/it]

{'loss': 0.9373, 'grad_norm': 0.4475862383842468, 'learning_rate': 0.00016438219109554778, 'epoch': 0.01}


 18%|█▊        | 1786/10000 [1:35:56<7:18:17,  3.20s/it]

{'loss': 0.7257, 'grad_norm': 0.35827842354774475, 'learning_rate': 0.0001643621810905453, 'epoch': 0.01}


 18%|█▊        | 1787/10000 [1:35:59<7:15:58,  3.19s/it]

{'loss': 0.9031, 'grad_norm': 0.46225598454475403, 'learning_rate': 0.00016434217108554278, 'epoch': 0.01}


 18%|█▊        | 1788/10000 [1:36:02<7:04:14,  3.10s/it]

{'loss': 1.0528, 'grad_norm': 0.3699670732021332, 'learning_rate': 0.0001643221610805403, 'epoch': 0.01}


 18%|█▊        | 1789/10000 [1:36:05<7:09:09,  3.14s/it]

{'loss': 0.6684, 'grad_norm': 0.44667673110961914, 'learning_rate': 0.00016430215107553778, 'epoch': 0.01}


 18%|█▊        | 1790/10000 [1:36:09<7:38:33,  3.35s/it]

{'loss': 1.0459, 'grad_norm': 0.35312920808792114, 'learning_rate': 0.00016428214107053527, 'epoch': 0.01}


 18%|█▊        | 1791/10000 [1:36:11<6:58:45,  3.06s/it]

{'loss': 0.9289, 'grad_norm': 0.4201183319091797, 'learning_rate': 0.00016426213106553276, 'epoch': 0.01}


 18%|█▊        | 1792/10000 [1:36:14<6:52:22,  3.01s/it]

{'loss': 0.6756, 'grad_norm': 0.3669244647026062, 'learning_rate': 0.00016424212106053027, 'epoch': 0.01}


 18%|█▊        | 1793/10000 [1:36:18<7:24:52,  3.25s/it]

{'loss': 1.0209, 'grad_norm': 0.35403263568878174, 'learning_rate': 0.00016422211105552776, 'epoch': 0.01}


 18%|█▊        | 1794/10000 [1:36:21<7:17:20,  3.20s/it]

{'loss': 1.0083, 'grad_norm': 0.4331936538219452, 'learning_rate': 0.00016420210105052527, 'epoch': 0.01}


 18%|█▊        | 1795/10000 [1:36:25<7:24:17,  3.25s/it]

{'loss': 0.7939, 'grad_norm': 0.3883861303329468, 'learning_rate': 0.00016418209104552279, 'epoch': 0.01}


 18%|█▊        | 1796/10000 [1:36:28<7:22:03,  3.23s/it]

{'loss': 0.993, 'grad_norm': 0.3785313367843628, 'learning_rate': 0.00016416208104052027, 'epoch': 0.01}


 18%|█▊        | 1797/10000 [1:36:31<7:18:00,  3.20s/it]

{'loss': 1.0015, 'grad_norm': 0.4436284005641937, 'learning_rate': 0.0001641420710355178, 'epoch': 0.01}


 18%|█▊        | 1798/10000 [1:36:34<7:13:59,  3.17s/it]

{'loss': 0.7592, 'grad_norm': 0.39098864793777466, 'learning_rate': 0.00016412206103051525, 'epoch': 0.01}


 18%|█▊        | 1799/10000 [1:36:37<7:07:54,  3.13s/it]

{'loss': 0.657, 'grad_norm': 0.3458561301231384, 'learning_rate': 0.00016410205102551276, 'epoch': 0.01}


 18%|█▊        | 1800/10000 [1:36:40<7:03:39,  3.10s/it]

{'loss': 0.9259, 'grad_norm': 0.43105876445770264, 'learning_rate': 0.00016408204102051025, 'epoch': 0.01}


 18%|█▊        | 1801/10000 [1:36:44<7:50:49,  3.45s/it]

{'loss': 0.8863, 'grad_norm': 0.42866405844688416, 'learning_rate': 0.00016406203101550776, 'epoch': 0.01}


 18%|█▊        | 1802/10000 [1:36:47<7:27:58,  3.28s/it]

{'loss': 0.8932, 'grad_norm': 0.34537574648857117, 'learning_rate': 0.00016404202101050525, 'epoch': 0.01}


 18%|█▊        | 1803/10000 [1:36:51<7:40:46,  3.37s/it]

{'loss': 0.9463, 'grad_norm': 0.3435438275337219, 'learning_rate': 0.00016402201100550277, 'epoch': 0.01}


 18%|█▊        | 1804/10000 [1:36:54<7:31:50,  3.31s/it]

{'loss': 0.5807, 'grad_norm': 0.39582911133766174, 'learning_rate': 0.00016400200100050025, 'epoch': 0.01}


 18%|█▊        | 1805/10000 [1:36:57<7:20:59,  3.23s/it]

{'loss': 0.9593, 'grad_norm': 0.419930636882782, 'learning_rate': 0.00016398199099549777, 'epoch': 0.01}


 18%|█▊        | 1806/10000 [1:37:00<7:08:04,  3.13s/it]

{'loss': 0.494, 'grad_norm': 0.31462225317955017, 'learning_rate': 0.00016396198099049525, 'epoch': 0.01}


 18%|█▊        | 1807/10000 [1:37:03<7:02:46,  3.10s/it]

{'loss': 1.0024, 'grad_norm': 0.43014875054359436, 'learning_rate': 0.00016394197098549274, 'epoch': 0.01}


 18%|█▊        | 1808/10000 [1:37:06<7:14:59,  3.19s/it]

{'loss': 0.9017, 'grad_norm': 0.39560410380363464, 'learning_rate': 0.00016392196098049026, 'epoch': 0.01}


 18%|█▊        | 1809/10000 [1:37:09<6:47:32,  2.99s/it]

{'loss': 0.9573, 'grad_norm': 0.43122875690460205, 'learning_rate': 0.00016390195097548774, 'epoch': 0.01}


 18%|█▊        | 1810/10000 [1:37:12<6:35:50,  2.90s/it]

{'loss': 0.9368, 'grad_norm': 0.4209216833114624, 'learning_rate': 0.00016388194097048526, 'epoch': 0.01}


 18%|█▊        | 1811/10000 [1:37:15<7:04:09,  3.11s/it]

{'loss': 0.9682, 'grad_norm': 0.3842174708843231, 'learning_rate': 0.00016386193096548275, 'epoch': 0.01}


 18%|█▊        | 1812/10000 [1:37:19<7:27:59,  3.28s/it]

{'loss': 0.9335, 'grad_norm': 0.3668752610683441, 'learning_rate': 0.00016384192096048026, 'epoch': 0.01}


 18%|█▊        | 1813/10000 [1:37:23<8:07:11,  3.57s/it]

{'loss': 1.0162, 'grad_norm': 0.35606345534324646, 'learning_rate': 0.00016382191095547775, 'epoch': 0.01}


 18%|█▊        | 1814/10000 [1:37:26<7:32:18,  3.32s/it]

{'loss': 1.0604, 'grad_norm': 0.3968253433704376, 'learning_rate': 0.00016380190095047526, 'epoch': 0.01}


 18%|█▊        | 1815/10000 [1:37:29<7:34:27,  3.33s/it]

{'loss': 0.7351, 'grad_norm': 0.3668431341648102, 'learning_rate': 0.00016378189094547275, 'epoch': 0.01}


 18%|█▊        | 1816/10000 [1:37:32<7:03:15,  3.10s/it]

{'loss': 0.9616, 'grad_norm': 0.47115883231163025, 'learning_rate': 0.00016376188094047024, 'epoch': 0.01}


 18%|█▊        | 1817/10000 [1:37:36<7:42:30,  3.39s/it]

{'loss': 0.9634, 'grad_norm': 0.32989996671676636, 'learning_rate': 0.00016374187093546772, 'epoch': 0.01}


 18%|█▊        | 1818/10000 [1:37:38<7:06:31,  3.13s/it]

{'loss': 1.0104, 'grad_norm': 0.5092653632164001, 'learning_rate': 0.00016372186093046524, 'epoch': 0.01}


 18%|█▊        | 1819/10000 [1:37:42<7:29:34,  3.30s/it]

{'loss': 0.733, 'grad_norm': 0.3238060772418976, 'learning_rate': 0.00016370185092546272, 'epoch': 0.01}


 18%|█▊        | 1820/10000 [1:37:46<7:44:06,  3.40s/it]

{'loss': 0.8983, 'grad_norm': 0.4256971478462219, 'learning_rate': 0.00016368184092046024, 'epoch': 0.01}


 18%|█▊        | 1821/10000 [1:37:49<7:30:35,  3.31s/it]

{'loss': 0.7593, 'grad_norm': 0.43255436420440674, 'learning_rate': 0.00016366183091545773, 'epoch': 0.01}


 18%|█▊        | 1822/10000 [1:37:52<7:19:41,  3.23s/it]

{'loss': 0.8293, 'grad_norm': 0.3876338005065918, 'learning_rate': 0.00016364182091045524, 'epoch': 0.01}


 18%|█▊        | 1823/10000 [1:37:56<8:02:36,  3.54s/it]

{'loss': 0.8123, 'grad_norm': 0.3691839873790741, 'learning_rate': 0.00016362181090545276, 'epoch': 0.01}


 18%|█▊        | 1824/10000 [1:38:00<7:59:57,  3.52s/it]

{'loss': 0.9701, 'grad_norm': 0.3557569980621338, 'learning_rate': 0.00016360180090045024, 'epoch': 0.01}


 18%|█▊        | 1825/10000 [1:38:03<8:15:07,  3.63s/it]

{'loss': 0.95, 'grad_norm': 0.33178889751434326, 'learning_rate': 0.00016358179089544773, 'epoch': 0.01}


 18%|█▊        | 1826/10000 [1:38:07<8:24:04,  3.70s/it]

{'loss': 0.902, 'grad_norm': 0.3604142367839813, 'learning_rate': 0.00016356178089044522, 'epoch': 0.01}


 18%|█▊        | 1827/10000 [1:38:10<8:03:09,  3.55s/it]

{'loss': 0.7599, 'grad_norm': 0.3663721978664398, 'learning_rate': 0.00016354177088544273, 'epoch': 0.01}


 18%|█▊        | 1828/10000 [1:38:14<7:54:30,  3.48s/it]

{'loss': 0.9956, 'grad_norm': 0.4165147840976715, 'learning_rate': 0.00016352176088044022, 'epoch': 0.01}


 18%|█▊        | 1829/10000 [1:38:17<7:50:25,  3.45s/it]

{'loss': 0.7838, 'grad_norm': 0.3483366370201111, 'learning_rate': 0.00016350175087543773, 'epoch': 0.01}


 18%|█▊        | 1830/10000 [1:38:21<8:03:50,  3.55s/it]

{'loss': 0.8712, 'grad_norm': 0.3098447024822235, 'learning_rate': 0.00016348174087043522, 'epoch': 0.01}


 18%|█▊        | 1831/10000 [1:38:24<7:51:56,  3.47s/it]

{'loss': 0.6218, 'grad_norm': 0.34088751673698425, 'learning_rate': 0.00016346173086543273, 'epoch': 0.01}


 18%|█▊        | 1832/10000 [1:38:28<8:13:52,  3.63s/it]

{'loss': 0.944, 'grad_norm': 0.3388756811618805, 'learning_rate': 0.00016344172086043022, 'epoch': 0.01}


 18%|█▊        | 1833/10000 [1:38:31<7:48:49,  3.44s/it]

{'loss': 0.7432, 'grad_norm': 0.3669344484806061, 'learning_rate': 0.00016342171085542774, 'epoch': 0.01}


 18%|█▊        | 1834/10000 [1:38:34<7:21:22,  3.24s/it]

{'loss': 1.0047, 'grad_norm': 0.38571205735206604, 'learning_rate': 0.0001634017008504252, 'epoch': 0.01}


 18%|█▊        | 1835/10000 [1:38:38<7:45:47,  3.42s/it]

{'loss': 0.7877, 'grad_norm': 0.36877644062042236, 'learning_rate': 0.0001633816908454227, 'epoch': 0.01}


 18%|█▊        | 1836/10000 [1:38:41<7:28:39,  3.30s/it]

{'loss': 0.7281, 'grad_norm': 0.3476688861846924, 'learning_rate': 0.00016336168084042023, 'epoch': 0.01}


 18%|█▊        | 1837/10000 [1:38:45<7:44:04,  3.41s/it]

{'loss': 1.0402, 'grad_norm': 0.3592277765274048, 'learning_rate': 0.0001633416708354177, 'epoch': 0.01}


 18%|█▊        | 1838/10000 [1:38:48<8:03:00,  3.55s/it]

{'loss': 0.9323, 'grad_norm': 0.35251089930534363, 'learning_rate': 0.00016332166083041523, 'epoch': 0.01}


 18%|█▊        | 1839/10000 [1:38:52<7:53:16,  3.48s/it]

{'loss': 0.7413, 'grad_norm': 0.3450261950492859, 'learning_rate': 0.00016330165082541271, 'epoch': 0.01}


 18%|█▊        | 1840/10000 [1:38:55<7:53:55,  3.48s/it]

{'loss': 0.7467, 'grad_norm': 0.3642633855342865, 'learning_rate': 0.00016328164082041023, 'epoch': 0.01}


 18%|█▊        | 1841/10000 [1:38:59<8:13:52,  3.63s/it]

{'loss': 1.0362, 'grad_norm': 0.34472110867500305, 'learning_rate': 0.00016326163081540772, 'epoch': 0.01}


 18%|█▊        | 1842/10000 [1:39:02<7:35:46,  3.35s/it]

{'loss': 0.7312, 'grad_norm': 0.3717266917228699, 'learning_rate': 0.0001632416208104052, 'epoch': 0.01}


 18%|█▊        | 1843/10000 [1:39:05<7:33:37,  3.34s/it]

{'loss': 0.9773, 'grad_norm': 0.3680633008480072, 'learning_rate': 0.0001632216108054027, 'epoch': 0.01}


 18%|█▊        | 1844/10000 [1:39:09<7:35:54,  3.35s/it]

{'loss': 0.895, 'grad_norm': 0.31495925784111023, 'learning_rate': 0.0001632016008004002, 'epoch': 0.01}


 18%|█▊        | 1845/10000 [1:39:12<7:28:21,  3.30s/it]

{'loss': 0.7038, 'grad_norm': 0.3700241446495056, 'learning_rate': 0.0001631815907953977, 'epoch': 0.01}


 18%|█▊        | 1846/10000 [1:39:15<7:27:12,  3.29s/it]

{'loss': 0.9309, 'grad_norm': 0.3921560049057007, 'learning_rate': 0.0001631615807903952, 'epoch': 0.01}


 18%|█▊        | 1847/10000 [1:39:20<8:28:36,  3.74s/it]

{'loss': 1.142, 'grad_norm': 0.3354659676551819, 'learning_rate': 0.0001631415707853927, 'epoch': 0.01}


 18%|█▊        | 1848/10000 [1:39:23<8:15:57,  3.65s/it]

{'loss': 0.8199, 'grad_norm': 0.3499070405960083, 'learning_rate': 0.0001631215607803902, 'epoch': 0.01}


 18%|█▊        | 1849/10000 [1:39:27<8:10:01,  3.61s/it]

{'loss': 0.7382, 'grad_norm': 0.32603728771209717, 'learning_rate': 0.00016310155077538772, 'epoch': 0.01}


 18%|█▊        | 1850/10000 [1:39:30<7:43:32,  3.41s/it]

{'loss': 0.9154, 'grad_norm': 0.4021395742893219, 'learning_rate': 0.0001630815407703852, 'epoch': 0.01}


 19%|█▊        | 1851/10000 [1:39:33<7:36:49,  3.36s/it]

{'loss': 0.7409, 'grad_norm': 0.3701898157596588, 'learning_rate': 0.0001630615307653827, 'epoch': 0.01}


 19%|█▊        | 1852/10000 [1:39:36<7:06:25,  3.14s/it]

{'loss': 0.6906, 'grad_norm': 0.39807572960853577, 'learning_rate': 0.00016304152076038018, 'epoch': 0.01}


 19%|█▊        | 1853/10000 [1:39:39<6:56:55,  3.07s/it]

{'loss': 0.8527, 'grad_norm': 0.4079901874065399, 'learning_rate': 0.0001630215107553777, 'epoch': 0.01}


 19%|█▊        | 1854/10000 [1:39:42<7:06:36,  3.14s/it]

{'loss': 0.6773, 'grad_norm': 0.37357333302497864, 'learning_rate': 0.00016300150075037519, 'epoch': 0.01}


 19%|█▊        | 1855/10000 [1:39:47<8:22:10,  3.70s/it]

{'loss': 1.1496, 'grad_norm': 0.3098253607749939, 'learning_rate': 0.0001629814907453727, 'epoch': 0.01}


 19%|█▊        | 1856/10000 [1:39:50<7:44:32,  3.42s/it]

{'loss': 0.7244, 'grad_norm': 0.49299317598342896, 'learning_rate': 0.0001629614807403702, 'epoch': 0.01}


 19%|█▊        | 1857/10000 [1:39:52<7:21:27,  3.25s/it]

{'loss': 1.0665, 'grad_norm': 0.4205520451068878, 'learning_rate': 0.0001629414707353677, 'epoch': 0.01}


 19%|█▊        | 1858/10000 [1:39:57<7:59:15,  3.53s/it]

{'loss': 0.9271, 'grad_norm': 0.30438295006752014, 'learning_rate': 0.0001629214607303652, 'epoch': 0.01}


 19%|█▊        | 1859/10000 [1:40:00<7:42:28,  3.41s/it]

{'loss': 0.9572, 'grad_norm': 0.3760424852371216, 'learning_rate': 0.0001629014507253627, 'epoch': 0.01}


 19%|█▊        | 1860/10000 [1:40:04<8:24:08,  3.72s/it]

{'loss': 1.2842, 'grad_norm': 0.37286576628685, 'learning_rate': 0.0001628814407203602, 'epoch': 0.01}


 19%|█▊        | 1861/10000 [1:40:09<8:51:52,  3.92s/it]

{'loss': 0.8867, 'grad_norm': 0.30334755778312683, 'learning_rate': 0.00016286143071535768, 'epoch': 0.01}


 19%|█▊        | 1862/10000 [1:40:12<8:11:15,  3.62s/it]

{'loss': 1.0027, 'grad_norm': 0.41231104731559753, 'learning_rate': 0.00016284142071035517, 'epoch': 0.01}


 19%|█▊        | 1863/10000 [1:40:15<8:14:48,  3.65s/it]

{'loss': 0.8827, 'grad_norm': 0.32355797290802, 'learning_rate': 0.00016282141070535268, 'epoch': 0.01}


 19%|█▊        | 1864/10000 [1:40:19<8:01:51,  3.55s/it]

{'loss': 0.9421, 'grad_norm': 0.3382166922092438, 'learning_rate': 0.0001628014007003502, 'epoch': 0.01}


 19%|█▊        | 1865/10000 [1:40:23<8:16:56,  3.67s/it]

{'loss': 0.9042, 'grad_norm': 0.33666837215423584, 'learning_rate': 0.00016278139069534768, 'epoch': 0.01}


 19%|█▊        | 1866/10000 [1:40:26<7:57:07,  3.52s/it]

{'loss': 0.7646, 'grad_norm': 0.37548765540122986, 'learning_rate': 0.0001627613806903452, 'epoch': 0.01}


 19%|█▊        | 1867/10000 [1:40:31<9:24:38,  4.17s/it]

{'loss': 1.1622, 'grad_norm': 0.31244197487831116, 'learning_rate': 0.00016274137068534268, 'epoch': 0.01}


 19%|█▊        | 1868/10000 [1:40:35<8:52:41,  3.93s/it]

{'loss': 1.1714, 'grad_norm': 0.3953405022621155, 'learning_rate': 0.0001627213606803402, 'epoch': 0.01}


 19%|█▊        | 1869/10000 [1:40:38<8:32:11,  3.78s/it]

{'loss': 0.9031, 'grad_norm': 0.3753165900707245, 'learning_rate': 0.00016270135067533766, 'epoch': 0.01}


 19%|█▊        | 1870/10000 [1:40:41<8:00:48,  3.55s/it]

{'loss': 0.9125, 'grad_norm': 0.35865432024002075, 'learning_rate': 0.00016268134067033517, 'epoch': 0.01}


 19%|█▊        | 1871/10000 [1:40:44<7:35:38,  3.36s/it]

{'loss': 0.7333, 'grad_norm': 0.3280600905418396, 'learning_rate': 0.00016266133066533266, 'epoch': 0.01}


 19%|█▊        | 1872/10000 [1:40:47<7:17:11,  3.23s/it]

{'loss': 0.9223, 'grad_norm': 0.3402714729309082, 'learning_rate': 0.00016264132066033017, 'epoch': 0.01}


 19%|█▊        | 1873/10000 [1:40:51<7:40:34,  3.40s/it]

{'loss': 1.0754, 'grad_norm': 0.3396332561969757, 'learning_rate': 0.00016262131065532766, 'epoch': 0.01}


 19%|█▊        | 1874/10000 [1:40:54<7:50:23,  3.47s/it]

{'loss': 0.9107, 'grad_norm': 0.37585559487342834, 'learning_rate': 0.00016260130065032518, 'epoch': 0.01}


 19%|█▉        | 1875/10000 [1:40:58<7:59:28,  3.54s/it]

{'loss': 0.7576, 'grad_norm': 0.32936233282089233, 'learning_rate': 0.00016258129064532266, 'epoch': 0.01}


 19%|█▉        | 1876/10000 [1:41:01<7:51:04,  3.48s/it]

{'loss': 0.7608, 'grad_norm': 0.4077509641647339, 'learning_rate': 0.00016256128064032018, 'epoch': 0.01}


 19%|█▉        | 1877/10000 [1:41:05<7:36:23,  3.37s/it]

{'loss': 0.9386, 'grad_norm': 0.3929100036621094, 'learning_rate': 0.00016254127063531766, 'epoch': 0.01}


 19%|█▉        | 1878/10000 [1:41:08<7:54:44,  3.51s/it]

{'loss': 0.9691, 'grad_norm': 0.38130685687065125, 'learning_rate': 0.00016252126063031515, 'epoch': 0.01}


 19%|█▉        | 1879/10000 [1:41:12<7:52:19,  3.49s/it]

{'loss': 0.8281, 'grad_norm': 0.33345842361450195, 'learning_rate': 0.00016250125062531267, 'epoch': 0.01}


 19%|█▉        | 1880/10000 [1:41:15<7:29:17,  3.32s/it]

{'loss': 1.0599, 'grad_norm': 0.42054328322410583, 'learning_rate': 0.00016248124062031015, 'epoch': 0.01}


 19%|█▉        | 1881/10000 [1:41:18<7:13:27,  3.20s/it]

{'loss': 1.1997, 'grad_norm': 0.4160386323928833, 'learning_rate': 0.00016246123061530767, 'epoch': 0.01}


 19%|█▉        | 1882/10000 [1:41:21<7:11:49,  3.19s/it]

{'loss': 0.7193, 'grad_norm': 0.3375154435634613, 'learning_rate': 0.00016244122061030516, 'epoch': 0.01}


 19%|█▉        | 1883/10000 [1:41:25<7:36:26,  3.37s/it]

{'loss': 0.9151, 'grad_norm': 0.37085625529289246, 'learning_rate': 0.00016242121060530267, 'epoch': 0.01}


 19%|█▉        | 1884/10000 [1:41:28<7:23:07,  3.28s/it]

{'loss': 1.0464, 'grad_norm': 0.432461678981781, 'learning_rate': 0.00016240120060030016, 'epoch': 0.01}


 19%|█▉        | 1885/10000 [1:41:31<7:25:39,  3.30s/it]

{'loss': 0.6609, 'grad_norm': 0.41297656297683716, 'learning_rate': 0.00016238119059529767, 'epoch': 0.01}


 19%|█▉        | 1886/10000 [1:41:35<7:40:16,  3.40s/it]

{'loss': 0.7636, 'grad_norm': 0.3649722635746002, 'learning_rate': 0.00016236118059029516, 'epoch': 0.01}


 19%|█▉        | 1887/10000 [1:41:38<7:43:26,  3.43s/it]

{'loss': 1.0803, 'grad_norm': 0.4041692614555359, 'learning_rate': 0.00016234117058529265, 'epoch': 0.01}


 19%|█▉        | 1888/10000 [1:41:41<7:24:33,  3.29s/it]

{'loss': 0.9787, 'grad_norm': 0.4536423087120056, 'learning_rate': 0.00016232116058029013, 'epoch': 0.01}


 19%|█▉        | 1889/10000 [1:41:44<7:15:53,  3.22s/it]

{'loss': 0.9452, 'grad_norm': 0.38517478108406067, 'learning_rate': 0.00016230115057528765, 'epoch': 0.01}


 19%|█▉        | 1890/10000 [1:41:48<7:54:17,  3.51s/it]

{'loss': 0.9361, 'grad_norm': 0.3699823319911957, 'learning_rate': 0.00016228114057028514, 'epoch': 0.01}


 19%|█▉        | 1891/10000 [1:41:52<7:47:48,  3.46s/it]

{'loss': 0.8812, 'grad_norm': 0.3226553201675415, 'learning_rate': 0.00016226113056528265, 'epoch': 0.01}


 19%|█▉        | 1892/10000 [1:41:55<7:34:30,  3.36s/it]

{'loss': 0.9682, 'grad_norm': 0.3745967745780945, 'learning_rate': 0.00016224112056028016, 'epoch': 0.01}


 19%|█▉        | 1893/10000 [1:41:58<7:07:00,  3.16s/it]

{'loss': 0.5399, 'grad_norm': 0.33562788367271423, 'learning_rate': 0.00016222111055527765, 'epoch': 0.01}


 19%|█▉        | 1894/10000 [1:42:02<8:03:04,  3.58s/it]

{'loss': 0.7273, 'grad_norm': 0.3107379972934723, 'learning_rate': 0.00016220110055027517, 'epoch': 0.01}


 19%|█▉        | 1895/10000 [1:42:05<7:39:38,  3.40s/it]

{'loss': 0.8316, 'grad_norm': 0.39661622047424316, 'learning_rate': 0.00016218109054527265, 'epoch': 0.01}


 19%|█▉        | 1896/10000 [1:42:09<7:56:39,  3.53s/it]

{'loss': 1.0554, 'grad_norm': 0.38406991958618164, 'learning_rate': 0.00016216108054027014, 'epoch': 0.01}


 19%|█▉        | 1897/10000 [1:42:13<8:18:54,  3.69s/it]

{'loss': 0.9333, 'grad_norm': 0.3175446689128876, 'learning_rate': 0.00016214107053526763, 'epoch': 0.01}


 19%|█▉        | 1898/10000 [1:42:18<8:52:50,  3.95s/it]

{'loss': 1.1286, 'grad_norm': 0.3869690001010895, 'learning_rate': 0.00016212106053026514, 'epoch': 0.01}


 19%|█▉        | 1899/10000 [1:42:20<7:41:41,  3.42s/it]

{'loss': 0.6957, 'grad_norm': 0.3940323293209076, 'learning_rate': 0.00016210105052526263, 'epoch': 0.01}


 19%|█▉        | 1900/10000 [1:42:24<8:22:26,  3.72s/it]

{'loss': 1.0398, 'grad_norm': 0.3575134575366974, 'learning_rate': 0.00016208104052026014, 'epoch': 0.01}


 19%|█▉        | 1901/10000 [1:42:29<9:17:45,  4.13s/it]

{'loss': 1.1373, 'grad_norm': 0.40942347049713135, 'learning_rate': 0.00016206103051525763, 'epoch': 0.01}


 19%|█▉        | 1902/10000 [1:42:33<8:57:42,  3.98s/it]

{'loss': 0.9844, 'grad_norm': 0.37331777811050415, 'learning_rate': 0.00016204102051025515, 'epoch': 0.01}


 19%|█▉        | 1903/10000 [1:42:36<8:19:02,  3.70s/it]

{'loss': 0.8484, 'grad_norm': 0.40784260630607605, 'learning_rate': 0.00016202101050525263, 'epoch': 0.01}


 19%|█▉        | 1904/10000 [1:42:39<7:55:42,  3.53s/it]

{'loss': 0.6578, 'grad_norm': 0.3451729714870453, 'learning_rate': 0.00016200100050025012, 'epoch': 0.01}


 19%|█▉        | 1905/10000 [1:42:43<7:57:36,  3.54s/it]

{'loss': 0.8605, 'grad_norm': 0.42085424065589905, 'learning_rate': 0.00016198099049524763, 'epoch': 0.01}


 19%|█▉        | 1906/10000 [1:42:46<7:33:31,  3.36s/it]

{'loss': 0.6876, 'grad_norm': 0.3133149743080139, 'learning_rate': 0.00016196098049024512, 'epoch': 0.01}


 19%|█▉        | 1907/10000 [1:42:49<7:42:24,  3.43s/it]

{'loss': 0.867, 'grad_norm': 0.39874404668807983, 'learning_rate': 0.00016194097048524264, 'epoch': 0.01}


 19%|█▉        | 1908/10000 [1:42:52<7:05:34,  3.16s/it]

{'loss': 0.895, 'grad_norm': 0.4678138494491577, 'learning_rate': 0.00016192096048024012, 'epoch': 0.01}


 19%|█▉        | 1909/10000 [1:42:55<7:18:50,  3.25s/it]

{'loss': 0.8917, 'grad_norm': 0.34367355704307556, 'learning_rate': 0.00016190095047523764, 'epoch': 0.01}


 19%|█▉        | 1910/10000 [1:42:58<7:10:05,  3.19s/it]

{'loss': 0.7033, 'grad_norm': 0.39264538884162903, 'learning_rate': 0.00016188094047023512, 'epoch': 0.01}


 19%|█▉        | 1911/10000 [1:43:02<7:25:50,  3.31s/it]

{'loss': 0.8991, 'grad_norm': 0.44239893555641174, 'learning_rate': 0.00016186093046523264, 'epoch': 0.01}


 19%|█▉        | 1912/10000 [1:43:06<7:55:18,  3.53s/it]

{'loss': 1.0049, 'grad_norm': 0.3834064304828644, 'learning_rate': 0.00016184092046023013, 'epoch': 0.01}


 19%|█▉        | 1913/10000 [1:43:08<7:17:59,  3.25s/it]

{'loss': 0.7541, 'grad_norm': 0.3997310400009155, 'learning_rate': 0.0001618209104552276, 'epoch': 0.01}


 19%|█▉        | 1914/10000 [1:43:12<7:09:33,  3.19s/it]

{'loss': 0.5938, 'grad_norm': 0.3525809347629547, 'learning_rate': 0.0001618009004502251, 'epoch': 0.01}


 19%|█▉        | 1915/10000 [1:43:15<7:18:06,  3.25s/it]

{'loss': 0.8201, 'grad_norm': 0.3383146822452545, 'learning_rate': 0.00016178089044522262, 'epoch': 0.01}


 19%|█▉        | 1916/10000 [1:43:19<7:45:32,  3.46s/it]

{'loss': 0.7285, 'grad_norm': 0.3238258957862854, 'learning_rate': 0.0001617608804402201, 'epoch': 0.01}


 19%|█▉        | 1917/10000 [1:43:22<7:16:26,  3.24s/it]

{'loss': 0.6615, 'grad_norm': 0.37487199902534485, 'learning_rate': 0.00016174087043521762, 'epoch': 0.01}


 19%|█▉        | 1918/10000 [1:43:24<6:59:16,  3.11s/it]

{'loss': 0.7836, 'grad_norm': 0.4119889736175537, 'learning_rate': 0.0001617208604302151, 'epoch': 0.01}


 19%|█▉        | 1919/10000 [1:43:28<7:21:57,  3.28s/it]

{'loss': 0.9014, 'grad_norm': 0.34895291924476624, 'learning_rate': 0.00016170085042521262, 'epoch': 0.01}


 19%|█▉        | 1920/10000 [1:43:31<7:08:57,  3.19s/it]

{'loss': 0.9955, 'grad_norm': 0.4030710458755493, 'learning_rate': 0.00016168084042021013, 'epoch': 0.01}


 19%|█▉        | 1921/10000 [1:43:35<7:22:31,  3.29s/it]

{'loss': 0.6648, 'grad_norm': 0.3464667797088623, 'learning_rate': 0.00016166083041520762, 'epoch': 0.01}


 19%|█▉        | 1922/10000 [1:43:38<7:25:28,  3.31s/it]

{'loss': 1.0943, 'grad_norm': 0.436122864484787, 'learning_rate': 0.0001616408204102051, 'epoch': 0.01}


 19%|█▉        | 1923/10000 [1:43:43<8:36:04,  3.83s/it]

{'loss': 1.2405, 'grad_norm': 0.33160239458084106, 'learning_rate': 0.0001616208104052026, 'epoch': 0.01}


 19%|█▉        | 1924/10000 [1:43:47<8:59:09,  4.01s/it]

{'loss': 1.1992, 'grad_norm': 0.3474876880645752, 'learning_rate': 0.0001616008004002001, 'epoch': 0.01}


 19%|█▉        | 1925/10000 [1:43:50<8:02:14,  3.58s/it]

{'loss': 0.7279, 'grad_norm': 0.4043610394001007, 'learning_rate': 0.0001615807903951976, 'epoch': 0.01}


 19%|█▉        | 1926/10000 [1:43:53<7:41:40,  3.43s/it]

{'loss': 0.8047, 'grad_norm': 0.3543318212032318, 'learning_rate': 0.0001615607803901951, 'epoch': 0.01}


 19%|█▉        | 1927/10000 [1:43:57<8:10:30,  3.65s/it]

{'loss': 0.7027, 'grad_norm': 0.2736828029155731, 'learning_rate': 0.0001615407703851926, 'epoch': 0.01}


 19%|█▉        | 1928/10000 [1:44:01<7:59:45,  3.57s/it]

{'loss': 1.0935, 'grad_norm': 0.41660407185554504, 'learning_rate': 0.0001615207603801901, 'epoch': 0.01}


 19%|█▉        | 1929/10000 [1:44:04<7:48:14,  3.48s/it]

{'loss': 0.7931, 'grad_norm': 0.35077226161956787, 'learning_rate': 0.0001615007503751876, 'epoch': 0.01}


 19%|█▉        | 1930/10000 [1:44:07<7:36:34,  3.39s/it]

{'loss': 1.1195, 'grad_norm': 0.388837069272995, 'learning_rate': 0.00016148074037018511, 'epoch': 0.01}


 19%|█▉        | 1931/10000 [1:44:11<7:52:10,  3.51s/it]

{'loss': 0.9568, 'grad_norm': 0.3603660762310028, 'learning_rate': 0.0001614607303651826, 'epoch': 0.01}


 19%|█▉        | 1932/10000 [1:44:14<7:56:15,  3.54s/it]

{'loss': 0.8053, 'grad_norm': 0.3619685173034668, 'learning_rate': 0.0001614407203601801, 'epoch': 0.01}


 19%|█▉        | 1933/10000 [1:44:19<8:51:44,  3.95s/it]

{'loss': 0.8465, 'grad_norm': 0.3031861484050751, 'learning_rate': 0.0001614207103551776, 'epoch': 0.01}


 19%|█▉        | 1934/10000 [1:44:22<7:55:48,  3.54s/it]

{'loss': 0.7827, 'grad_norm': 0.39604464173316956, 'learning_rate': 0.0001614007003501751, 'epoch': 0.01}


 19%|█▉        | 1935/10000 [1:44:25<7:38:59,  3.41s/it]

{'loss': 1.0599, 'grad_norm': 0.38914552330970764, 'learning_rate': 0.0001613806903451726, 'epoch': 0.01}


 19%|█▉        | 1936/10000 [1:44:30<8:41:01,  3.88s/it]

{'loss': 0.9315, 'grad_norm': 0.34300971031188965, 'learning_rate': 0.0001613606803401701, 'epoch': 0.01}


 19%|█▉        | 1937/10000 [1:44:34<8:25:07,  3.76s/it]

{'loss': 0.6101, 'grad_norm': 0.3211061358451843, 'learning_rate': 0.0001613406703351676, 'epoch': 0.01}


 19%|█▉        | 1938/10000 [1:44:37<8:18:11,  3.71s/it]

{'loss': 0.9267, 'grad_norm': 0.36080947518348694, 'learning_rate': 0.0001613206603301651, 'epoch': 0.01}


 19%|█▉        | 1939/10000 [1:44:40<7:56:28,  3.55s/it]

{'loss': 0.9275, 'grad_norm': 0.37476247549057007, 'learning_rate': 0.00016130065032516258, 'epoch': 0.01}


 19%|█▉        | 1940/10000 [1:44:44<7:44:51,  3.46s/it]

{'loss': 0.942, 'grad_norm': 0.359554648399353, 'learning_rate': 0.00016128064032016007, 'epoch': 0.01}


 19%|█▉        | 1941/10000 [1:44:46<7:23:52,  3.30s/it]

{'loss': 0.7525, 'grad_norm': 0.38916295766830444, 'learning_rate': 0.00016126063031515758, 'epoch': 0.01}


 19%|█▉        | 1942/10000 [1:44:50<7:39:30,  3.42s/it]

{'loss': 1.1166, 'grad_norm': 0.39467304944992065, 'learning_rate': 0.00016124062031015507, 'epoch': 0.01}


 19%|█▉        | 1943/10000 [1:44:54<7:49:12,  3.49s/it]

{'loss': 0.8135, 'grad_norm': 0.3469390869140625, 'learning_rate': 0.00016122061030515258, 'epoch': 0.01}


 19%|█▉        | 1944/10000 [1:44:57<7:46:29,  3.47s/it]

{'loss': 0.7698, 'grad_norm': 0.3635105788707733, 'learning_rate': 0.00016120060030015007, 'epoch': 0.01}


 19%|█▉        | 1945/10000 [1:45:00<7:22:54,  3.30s/it]

{'loss': 1.0703, 'grad_norm': 0.4062613844871521, 'learning_rate': 0.00016118059029514759, 'epoch': 0.01}


 19%|█▉        | 1946/10000 [1:45:04<7:25:45,  3.32s/it]

{'loss': 0.8299, 'grad_norm': 0.3824959099292755, 'learning_rate': 0.0001611605802901451, 'epoch': 0.01}


 19%|█▉        | 1947/10000 [1:45:08<8:25:07,  3.76s/it]

{'loss': 0.9811, 'grad_norm': 0.3446565568447113, 'learning_rate': 0.0001611405702851426, 'epoch': 0.01}


 19%|█▉        | 1948/10000 [1:45:12<8:18:10,  3.71s/it]

{'loss': 0.9415, 'grad_norm': 0.36574283242225647, 'learning_rate': 0.00016112056028014007, 'epoch': 0.01}


 19%|█▉        | 1949/10000 [1:45:15<7:48:39,  3.49s/it]

{'loss': 0.6512, 'grad_norm': 0.3732019066810608, 'learning_rate': 0.00016110055027513756, 'epoch': 0.01}


 20%|█▉        | 1950/10000 [1:45:19<8:04:59,  3.61s/it]

{'loss': 0.756, 'grad_norm': 0.3403140902519226, 'learning_rate': 0.00016108054027013508, 'epoch': 0.01}


 20%|█▉        | 1951/10000 [1:45:23<8:11:02,  3.66s/it]

{'loss': 1.0667, 'grad_norm': 0.42596369981765747, 'learning_rate': 0.00016106053026513256, 'epoch': 0.01}


 20%|█▉        | 1952/10000 [1:45:26<8:12:19,  3.67s/it]

{'loss': 0.8275, 'grad_norm': 0.42297059297561646, 'learning_rate': 0.00016104052026013008, 'epoch': 0.01}


 20%|█▉        | 1953/10000 [1:45:29<7:29:34,  3.35s/it]

{'loss': 0.7806, 'grad_norm': 0.44503408670425415, 'learning_rate': 0.00016102051025512757, 'epoch': 0.01}


 20%|█▉        | 1954/10000 [1:45:33<7:50:57,  3.51s/it]

{'loss': 0.8992, 'grad_norm': 0.34630221128463745, 'learning_rate': 0.00016100050025012508, 'epoch': 0.01}


 20%|█▉        | 1955/10000 [1:45:36<7:40:55,  3.44s/it]

{'loss': 0.7761, 'grad_norm': 0.3963000178337097, 'learning_rate': 0.00016098049024512257, 'epoch': 0.01}


 20%|█▉        | 1956/10000 [1:45:40<8:07:32,  3.64s/it]

{'loss': 1.0799, 'grad_norm': 0.3993934094905853, 'learning_rate': 0.00016096048024012008, 'epoch': 0.01}


 20%|█▉        | 1957/10000 [1:45:43<7:56:44,  3.56s/it]

{'loss': 0.7538, 'grad_norm': 0.33736902475357056, 'learning_rate': 0.00016094047023511757, 'epoch': 0.01}


 20%|█▉        | 1958/10000 [1:45:48<8:25:36,  3.77s/it]

{'loss': 0.9758, 'grad_norm': 0.334753155708313, 'learning_rate': 0.00016092046023011506, 'epoch': 0.01}


 20%|█▉        | 1959/10000 [1:45:52<8:43:05,  3.90s/it]

{'loss': 0.7696, 'grad_norm': 0.32932940125465393, 'learning_rate': 0.00016090045022511254, 'epoch': 0.01}


 20%|█▉        | 1960/10000 [1:45:56<8:49:10,  3.95s/it]

{'loss': 0.8907, 'grad_norm': 0.3191608786582947, 'learning_rate': 0.00016088044022011006, 'epoch': 0.01}


 20%|█▉        | 1961/10000 [1:46:01<9:23:50,  4.21s/it]

{'loss': 0.9151, 'grad_norm': 0.3175598680973053, 'learning_rate': 0.00016086043021510757, 'epoch': 0.01}


 20%|█▉        | 1962/10000 [1:46:04<8:23:44,  3.76s/it]

{'loss': 0.9309, 'grad_norm': 0.46806252002716064, 'learning_rate': 0.00016084042021010506, 'epoch': 0.01}


 20%|█▉        | 1963/10000 [1:46:07<8:13:42,  3.69s/it]

{'loss': 1.1995, 'grad_norm': 0.4064013361930847, 'learning_rate': 0.00016082041020510257, 'epoch': 0.01}


 20%|█▉        | 1964/10000 [1:46:11<8:15:49,  3.70s/it]

{'loss': 1.0711, 'grad_norm': 0.33202993869781494, 'learning_rate': 0.00016080040020010006, 'epoch': 0.01}


 20%|█▉        | 1965/10000 [1:46:14<7:40:20,  3.44s/it]

{'loss': 0.776, 'grad_norm': 0.37420234084129333, 'learning_rate': 0.00016078039019509758, 'epoch': 0.01}


 20%|█▉        | 1966/10000 [1:46:16<7:15:26,  3.25s/it]

{'loss': 0.9409, 'grad_norm': 0.43756282329559326, 'learning_rate': 0.00016076038019009506, 'epoch': 0.01}


 20%|█▉        | 1967/10000 [1:46:20<7:16:32,  3.26s/it]

{'loss': 0.8705, 'grad_norm': 0.3420713245868683, 'learning_rate': 0.00016074037018509255, 'epoch': 0.01}


 20%|█▉        | 1968/10000 [1:46:23<7:26:11,  3.33s/it]

{'loss': 0.9553, 'grad_norm': 0.3622753322124481, 'learning_rate': 0.00016072036018009004, 'epoch': 0.01}


 20%|█▉        | 1969/10000 [1:46:27<7:36:48,  3.41s/it]

{'loss': 0.8471, 'grad_norm': 0.3447768986225128, 'learning_rate': 0.00016070035017508755, 'epoch': 0.01}


 20%|█▉        | 1970/10000 [1:46:30<7:30:46,  3.37s/it]

{'loss': 1.046, 'grad_norm': 0.4389546811580658, 'learning_rate': 0.00016068034017008504, 'epoch': 0.01}


 20%|█▉        | 1971/10000 [1:46:33<7:27:05,  3.34s/it]

{'loss': 1.1265, 'grad_norm': 0.42406153678894043, 'learning_rate': 0.00016066033016508255, 'epoch': 0.01}


 20%|█▉        | 1972/10000 [1:46:37<7:39:07,  3.43s/it]

{'loss': 0.9622, 'grad_norm': 0.38449299335479736, 'learning_rate': 0.00016064032016008004, 'epoch': 0.01}


 20%|█▉        | 1973/10000 [1:46:40<7:19:19,  3.28s/it]

{'loss': 0.972, 'grad_norm': 0.3958820104598999, 'learning_rate': 0.00016062031015507756, 'epoch': 0.01}


 20%|█▉        | 1974/10000 [1:46:42<6:49:14,  3.06s/it]

{'loss': 0.7667, 'grad_norm': 0.4925359785556793, 'learning_rate': 0.00016060030015007507, 'epoch': 0.01}


 20%|█▉        | 1975/10000 [1:46:46<7:24:49,  3.33s/it]

{'loss': 0.9992, 'grad_norm': 0.35936659574508667, 'learning_rate': 0.00016058029014507253, 'epoch': 0.01}


 20%|█▉        | 1976/10000 [1:46:51<8:13:20,  3.69s/it]

{'loss': 1.0968, 'grad_norm': 0.35511136054992676, 'learning_rate': 0.00016056028014007004, 'epoch': 0.01}


 20%|█▉        | 1977/10000 [1:46:54<7:57:59,  3.57s/it]

{'loss': 0.926, 'grad_norm': 0.35283833742141724, 'learning_rate': 0.00016054027013506753, 'epoch': 0.01}


 20%|█▉        | 1978/10000 [1:46:58<8:10:22,  3.67s/it]

{'loss': 0.9544, 'grad_norm': 0.37812578678131104, 'learning_rate': 0.00016052026013006505, 'epoch': 0.01}


 20%|█▉        | 1979/10000 [1:47:01<7:46:23,  3.49s/it]

{'loss': 0.7434, 'grad_norm': 0.3606273829936981, 'learning_rate': 0.00016050025012506253, 'epoch': 0.01}


 20%|█▉        | 1980/10000 [1:47:06<8:34:59,  3.85s/it]

{'loss': 1.0, 'grad_norm': 0.32156968116760254, 'learning_rate': 0.00016048024012006005, 'epoch': 0.01}


 20%|█▉        | 1981/10000 [1:47:10<8:44:18,  3.92s/it]

{'loss': 1.1967, 'grad_norm': 0.3519893288612366, 'learning_rate': 0.00016046023011505753, 'epoch': 0.01}


 20%|█▉        | 1982/10000 [1:47:14<8:38:14,  3.88s/it]

{'loss': 0.9334, 'grad_norm': 0.4202643036842346, 'learning_rate': 0.00016044022011005505, 'epoch': 0.01}


 20%|█▉        | 1983/10000 [1:47:18<9:04:52,  4.08s/it]

{'loss': 0.8703, 'grad_norm': 0.3397292196750641, 'learning_rate': 0.00016042021010505254, 'epoch': 0.01}


 20%|█▉        | 1984/10000 [1:47:21<8:21:39,  3.75s/it]

{'loss': 0.8783, 'grad_norm': 0.5557051301002502, 'learning_rate': 0.00016040020010005002, 'epoch': 0.01}


 20%|█▉        | 1985/10000 [1:47:25<8:02:58,  3.62s/it]

{'loss': 0.8394, 'grad_norm': 0.3484277129173279, 'learning_rate': 0.0001603801900950475, 'epoch': 0.01}


 20%|█▉        | 1986/10000 [1:47:28<8:11:02,  3.68s/it]

{'loss': 0.9909, 'grad_norm': 0.3495325446128845, 'learning_rate': 0.00016036018009004503, 'epoch': 0.01}


 20%|█▉        | 1987/10000 [1:47:32<8:00:01,  3.59s/it]

{'loss': 0.8856, 'grad_norm': 0.36063840985298157, 'learning_rate': 0.0001603401700850425, 'epoch': 0.01}


 20%|█▉        | 1988/10000 [1:47:35<7:50:13,  3.52s/it]

{'loss': 0.9058, 'grad_norm': 0.42863568663597107, 'learning_rate': 0.00016032016008004003, 'epoch': 0.01}


 20%|█▉        | 1989/10000 [1:47:38<7:22:13,  3.31s/it]

{'loss': 0.9391, 'grad_norm': 0.4276565611362457, 'learning_rate': 0.00016030015007503754, 'epoch': 0.01}


 20%|█▉        | 1990/10000 [1:47:41<7:16:16,  3.27s/it]

{'loss': 0.7598, 'grad_norm': 0.3671536445617676, 'learning_rate': 0.00016028014007003503, 'epoch': 0.01}


 20%|█▉        | 1991/10000 [1:47:44<7:17:28,  3.28s/it]

{'loss': 0.8537, 'grad_norm': 0.3370424807071686, 'learning_rate': 0.00016026013006503254, 'epoch': 0.01}


 20%|█▉        | 1992/10000 [1:47:47<7:00:10,  3.15s/it]

{'loss': 0.7866, 'grad_norm': 0.39452803134918213, 'learning_rate': 0.00016024012006003003, 'epoch': 0.01}


 20%|█▉        | 1993/10000 [1:47:50<6:48:45,  3.06s/it]

{'loss': 0.8329, 'grad_norm': 0.4140584468841553, 'learning_rate': 0.00016022011005502752, 'epoch': 0.01}


 20%|█▉        | 1994/10000 [1:47:53<6:57:16,  3.13s/it]

{'loss': 1.0277, 'grad_norm': 0.40907248854637146, 'learning_rate': 0.000160200100050025, 'epoch': 0.01}


 20%|█▉        | 1995/10000 [1:47:59<8:17:35,  3.73s/it]

{'loss': 1.2128, 'grad_norm': 0.3468044400215149, 'learning_rate': 0.00016018009004502252, 'epoch': 0.01}


 20%|█▉        | 1996/10000 [1:48:02<8:12:57,  3.70s/it]

{'loss': 0.7403, 'grad_norm': 0.3498247563838959, 'learning_rate': 0.00016016008004002, 'epoch': 0.01}


 20%|█▉        | 1997/10000 [1:48:06<7:59:32,  3.60s/it]

{'loss': 0.884, 'grad_norm': 0.384892076253891, 'learning_rate': 0.00016014007003501752, 'epoch': 0.01}


 20%|█▉        | 1998/10000 [1:48:11<8:54:08,  4.01s/it]

{'loss': 0.8768, 'grad_norm': 0.30364179611206055, 'learning_rate': 0.000160120060030015, 'epoch': 0.01}


 20%|█▉        | 1999/10000 [1:48:14<8:22:54,  3.77s/it]

{'loss': 1.0193, 'grad_norm': 0.39459913969039917, 'learning_rate': 0.00016010005002501252, 'epoch': 0.01}


 20%|██        | 2000/10000 [1:48:17<7:52:19,  3.54s/it]

{'loss': 0.9123, 'grad_norm': 0.4138028919696808, 'learning_rate': 0.00016008004002001, 'epoch': 0.01}


 20%|██        | 2001/10000 [1:48:21<8:23:09,  3.77s/it]

{'loss': 0.9102, 'grad_norm': 0.362429678440094, 'learning_rate': 0.00016006003001500752, 'epoch': 0.01}


 20%|██        | 2002/10000 [1:48:24<7:52:00,  3.54s/it]

{'loss': 0.8794, 'grad_norm': 0.4113570749759674, 'learning_rate': 0.000160040020010005, 'epoch': 0.01}


 20%|██        | 2003/10000 [1:48:27<7:28:56,  3.37s/it]

{'loss': 0.7318, 'grad_norm': 0.3648608326911926, 'learning_rate': 0.0001600200100050025, 'epoch': 0.01}


 20%|██        | 2004/10000 [1:48:30<7:06:05,  3.20s/it]

{'loss': 1.1031, 'grad_norm': 0.44835567474365234, 'learning_rate': 0.00016, 'epoch': 0.01}


 20%|██        | 2005/10000 [1:48:33<6:50:38,  3.08s/it]

{'loss': 0.8995, 'grad_norm': 0.42151427268981934, 'learning_rate': 0.0001599799899949975, 'epoch': 0.01}


 20%|██        | 2006/10000 [1:48:36<7:17:17,  3.28s/it]

{'loss': 0.8778, 'grad_norm': 0.3598839342594147, 'learning_rate': 0.00015995997998999501, 'epoch': 0.01}


 20%|██        | 2007/10000 [1:48:39<7:08:37,  3.22s/it]

{'loss': 0.8006, 'grad_norm': 0.364458292722702, 'learning_rate': 0.0001599399699849925, 'epoch': 0.01}


 20%|██        | 2008/10000 [1:48:43<7:39:26,  3.45s/it]

{'loss': 0.9721, 'grad_norm': 0.35809892416000366, 'learning_rate': 0.00015991995997999002, 'epoch': 0.01}


 20%|██        | 2009/10000 [1:48:49<9:18:32,  4.19s/it]

{'loss': 1.1006, 'grad_norm': 0.32939523458480835, 'learning_rate': 0.0001598999499749875, 'epoch': 0.01}


 20%|██        | 2010/10000 [1:48:53<9:03:40,  4.08s/it]

{'loss': 0.7318, 'grad_norm': 0.32699382305145264, 'learning_rate': 0.000159879939969985, 'epoch': 0.01}


 20%|██        | 2011/10000 [1:48:57<8:49:19,  3.98s/it]

{'loss': 0.7969, 'grad_norm': 0.35726839303970337, 'learning_rate': 0.00015985992996498248, 'epoch': 0.01}


 20%|██        | 2012/10000 [1:49:01<8:53:56,  4.01s/it]

{'loss': 0.923, 'grad_norm': 0.3352111876010895, 'learning_rate': 0.00015983991995998, 'epoch': 0.01}


 20%|██        | 2013/10000 [1:49:03<7:50:37,  3.54s/it]

{'loss': 0.7128, 'grad_norm': 0.4279913008213043, 'learning_rate': 0.00015981990995497748, 'epoch': 0.01}


 20%|██        | 2014/10000 [1:49:06<7:30:20,  3.38s/it]

{'loss': 0.9458, 'grad_norm': 0.3914559483528137, 'learning_rate': 0.000159799899949975, 'epoch': 0.01}


 20%|██        | 2015/10000 [1:49:10<7:15:39,  3.27s/it]

{'loss': 0.699, 'grad_norm': 0.4170070290565491, 'learning_rate': 0.00015977988994497248, 'epoch': 0.01}


 20%|██        | 2016/10000 [1:49:12<6:59:21,  3.15s/it]

{'loss': 0.7162, 'grad_norm': 0.39144212007522583, 'learning_rate': 0.00015975987993997, 'epoch': 0.01}


 20%|██        | 2017/10000 [1:49:15<6:55:55,  3.13s/it]

{'loss': 0.957, 'grad_norm': 0.3587154150009155, 'learning_rate': 0.0001597398699349675, 'epoch': 0.01}


 20%|██        | 2018/10000 [1:49:19<7:15:39,  3.27s/it]

{'loss': 0.7385, 'grad_norm': 0.33890438079833984, 'learning_rate': 0.000159719859929965, 'epoch': 0.01}


 20%|██        | 2019/10000 [1:49:22<7:12:44,  3.25s/it]

{'loss': 0.8077, 'grad_norm': 0.35279256105422974, 'learning_rate': 0.00015969984992496248, 'epoch': 0.01}


 20%|██        | 2020/10000 [1:49:27<7:55:52,  3.58s/it]

{'loss': 1.028, 'grad_norm': 0.37882909178733826, 'learning_rate': 0.00015967983991995997, 'epoch': 0.01}


 20%|██        | 2021/10000 [1:49:30<7:39:57,  3.46s/it]

{'loss': 0.8655, 'grad_norm': 0.4078224301338196, 'learning_rate': 0.0001596598299149575, 'epoch': 0.01}


 20%|██        | 2022/10000 [1:49:33<7:41:11,  3.47s/it]

{'loss': 0.8348, 'grad_norm': 0.4468827247619629, 'learning_rate': 0.00015963981990995497, 'epoch': 0.01}


 20%|██        | 2023/10000 [1:49:36<7:20:19,  3.31s/it]

{'loss': 0.9724, 'grad_norm': 0.3957747220993042, 'learning_rate': 0.0001596198099049525, 'epoch': 0.01}


 20%|██        | 2024/10000 [1:49:39<7:06:52,  3.21s/it]

{'loss': 0.9367, 'grad_norm': 0.4162546694278717, 'learning_rate': 0.00015959979989994998, 'epoch': 0.01}


 20%|██        | 2025/10000 [1:49:43<7:32:09,  3.40s/it]

{'loss': 0.8683, 'grad_norm': 0.35490748286247253, 'learning_rate': 0.0001595797898949475, 'epoch': 0.01}


 20%|██        | 2026/10000 [1:49:46<7:12:15,  3.25s/it]

{'loss': 0.994, 'grad_norm': 0.40373849868774414, 'learning_rate': 0.00015955977988994498, 'epoch': 0.01}


 20%|██        | 2027/10000 [1:49:51<8:32:41,  3.86s/it]

{'loss': 1.0543, 'grad_norm': 0.3679581880569458, 'learning_rate': 0.0001595397698849425, 'epoch': 0.01}


 20%|██        | 2028/10000 [1:49:54<7:45:43,  3.51s/it]

{'loss': 0.9921, 'grad_norm': 0.42403653264045715, 'learning_rate': 0.00015951975987993998, 'epoch': 0.01}


 20%|██        | 2029/10000 [1:49:58<7:58:06,  3.60s/it]

{'loss': 0.8505, 'grad_norm': 0.3370181918144226, 'learning_rate': 0.00015949974987493747, 'epoch': 0.01}


 20%|██        | 2030/10000 [1:50:01<7:59:02,  3.61s/it]

{'loss': 0.9815, 'grad_norm': 0.33815592527389526, 'learning_rate': 0.00015947973986993498, 'epoch': 0.01}


 20%|██        | 2031/10000 [1:50:05<8:03:58,  3.64s/it]

{'loss': 0.8845, 'grad_norm': 0.3489671051502228, 'learning_rate': 0.00015945972986493247, 'epoch': 0.01}


 20%|██        | 2032/10000 [1:50:08<7:43:20,  3.49s/it]

{'loss': 0.7148, 'grad_norm': 0.37830227613449097, 'learning_rate': 0.00015943971985992998, 'epoch': 0.01}


 20%|██        | 2033/10000 [1:50:11<7:35:13,  3.43s/it]

{'loss': 0.6792, 'grad_norm': 0.36343732476234436, 'learning_rate': 0.00015941970985492747, 'epoch': 0.01}


 20%|██        | 2034/10000 [1:50:15<7:30:50,  3.40s/it]

{'loss': 0.8201, 'grad_norm': 0.3813229501247406, 'learning_rate': 0.00015939969984992498, 'epoch': 0.01}


 20%|██        | 2035/10000 [1:50:19<7:54:19,  3.57s/it]

{'loss': 0.9065, 'grad_norm': 0.34043142199516296, 'learning_rate': 0.00015937968984492247, 'epoch': 0.01}


 20%|██        | 2036/10000 [1:50:24<9:17:38,  4.20s/it]

{'loss': 0.956, 'grad_norm': 0.29288360476493835, 'learning_rate': 0.00015935967983991999, 'epoch': 0.01}


 20%|██        | 2037/10000 [1:50:28<8:45:30,  3.96s/it]

{'loss': 0.8203, 'grad_norm': 0.36696135997772217, 'learning_rate': 0.00015933966983491747, 'epoch': 0.01}


 20%|██        | 2038/10000 [1:50:33<9:15:18,  4.18s/it]

{'loss': 0.8846, 'grad_norm': 0.33689531683921814, 'learning_rate': 0.00015931965982991496, 'epoch': 0.01}


 20%|██        | 2039/10000 [1:50:36<9:00:00,  4.07s/it]

{'loss': 0.8907, 'grad_norm': 0.37336674332618713, 'learning_rate': 0.00015929964982491245, 'epoch': 0.01}


 20%|██        | 2040/10000 [1:50:41<9:12:54,  4.17s/it]

{'loss': 0.7762, 'grad_norm': 0.349003404378891, 'learning_rate': 0.00015927963981990996, 'epoch': 0.01}


 20%|██        | 2041/10000 [1:50:44<8:30:38,  3.85s/it]

{'loss': 1.0167, 'grad_norm': 0.4182407855987549, 'learning_rate': 0.00015925962981490745, 'epoch': 0.01}


 20%|██        | 2042/10000 [1:50:47<8:12:15,  3.71s/it]

{'loss': 0.9506, 'grad_norm': 0.396017462015152, 'learning_rate': 0.00015923961980990496, 'epoch': 0.01}


 20%|██        | 2043/10000 [1:50:51<8:13:09,  3.72s/it]

{'loss': 0.9843, 'grad_norm': 0.38193434476852417, 'learning_rate': 0.00015921960980490248, 'epoch': 0.01}


 20%|██        | 2044/10000 [1:50:55<8:30:48,  3.85s/it]

{'loss': 1.0617, 'grad_norm': 0.3600168824195862, 'learning_rate': 0.00015919959979989997, 'epoch': 0.01}


 20%|██        | 2045/10000 [1:50:59<8:22:25,  3.79s/it]

{'loss': 0.779, 'grad_norm': 0.3308377265930176, 'learning_rate': 0.00015917958979489745, 'epoch': 0.01}


 20%|██        | 2046/10000 [1:51:03<8:34:58,  3.88s/it]

{'loss': 0.6591, 'grad_norm': 0.38570135831832886, 'learning_rate': 0.00015915957978989494, 'epoch': 0.01}


 20%|██        | 2047/10000 [1:51:06<7:57:46,  3.60s/it]

{'loss': 0.8572, 'grad_norm': 0.40508803725242615, 'learning_rate': 0.00015913956978489245, 'epoch': 0.01}


 20%|██        | 2048/10000 [1:51:09<7:23:16,  3.34s/it]

{'loss': 0.9372, 'grad_norm': 0.42341119050979614, 'learning_rate': 0.00015911955977988994, 'epoch': 0.01}


 20%|██        | 2049/10000 [1:51:11<6:50:11,  3.10s/it]

{'loss': 0.633, 'grad_norm': 0.43558526039123535, 'learning_rate': 0.00015909954977488746, 'epoch': 0.01}


 20%|██        | 2050/10000 [1:51:14<7:00:39,  3.17s/it]

{'loss': 0.9035, 'grad_norm': 0.4036323130130768, 'learning_rate': 0.00015907953976988494, 'epoch': 0.01}


 21%|██        | 2051/10000 [1:51:18<7:20:13,  3.32s/it]

{'loss': 0.7131, 'grad_norm': 0.3245820105075836, 'learning_rate': 0.00015905952976488246, 'epoch': 0.01}


 21%|██        | 2052/10000 [1:51:21<6:53:22,  3.12s/it]

{'loss': 0.9203, 'grad_norm': 0.4253592789173126, 'learning_rate': 0.00015903951975987994, 'epoch': 0.01}


 21%|██        | 2053/10000 [1:51:24<6:53:31,  3.12s/it]

{'loss': 0.5808, 'grad_norm': 0.3822084963321686, 'learning_rate': 0.00015901950975487746, 'epoch': 0.01}


 21%|██        | 2054/10000 [1:51:27<6:52:07,  3.11s/it]

{'loss': 1.4034, 'grad_norm': 0.49720796942710876, 'learning_rate': 0.00015899949974987495, 'epoch': 0.01}


 21%|██        | 2055/10000 [1:51:31<7:24:22,  3.36s/it]

{'loss': 0.8542, 'grad_norm': 0.3495112359523773, 'learning_rate': 0.00015897948974487243, 'epoch': 0.01}


 21%|██        | 2056/10000 [1:51:33<6:52:08,  3.11s/it]

{'loss': 0.7166, 'grad_norm': 0.4037990868091583, 'learning_rate': 0.00015895947973986992, 'epoch': 0.01}


 21%|██        | 2057/10000 [1:51:37<7:05:27,  3.21s/it]

{'loss': 0.9746, 'grad_norm': 0.403696209192276, 'learning_rate': 0.00015893946973486744, 'epoch': 0.01}


 21%|██        | 2058/10000 [1:51:40<6:53:14,  3.12s/it]

{'loss': 0.9563, 'grad_norm': 0.3907957077026367, 'learning_rate': 0.00015891945972986495, 'epoch': 0.01}


 21%|██        | 2059/10000 [1:51:43<6:50:32,  3.10s/it]

{'loss': 0.9303, 'grad_norm': 0.4147312045097351, 'learning_rate': 0.00015889944972486244, 'epoch': 0.01}


 21%|██        | 2060/10000 [1:51:48<8:07:39,  3.69s/it]

{'loss': 0.9735, 'grad_norm': 0.3148840069770813, 'learning_rate': 0.00015887943971985995, 'epoch': 0.01}


 21%|██        | 2061/10000 [1:51:51<7:52:08,  3.57s/it]

{'loss': 1.0113, 'grad_norm': 0.4215625524520874, 'learning_rate': 0.00015885942971485744, 'epoch': 0.01}


 21%|██        | 2062/10000 [1:51:54<7:24:18,  3.36s/it]

{'loss': 1.0743, 'grad_norm': 0.43376800417900085, 'learning_rate': 0.00015883941970985495, 'epoch': 0.01}


 21%|██        | 2063/10000 [1:51:58<7:26:03,  3.37s/it]

{'loss': 0.7388, 'grad_norm': 0.3428128957748413, 'learning_rate': 0.00015881940970485244, 'epoch': 0.01}


 21%|██        | 2064/10000 [1:52:01<7:27:14,  3.38s/it]

{'loss': 1.0932, 'grad_norm': 0.38797488808631897, 'learning_rate': 0.00015879939969984993, 'epoch': 0.01}


 21%|██        | 2065/10000 [1:52:04<7:23:30,  3.35s/it]

{'loss': 0.7693, 'grad_norm': 0.3321132957935333, 'learning_rate': 0.00015877938969484741, 'epoch': 0.01}


 21%|██        | 2066/10000 [1:52:08<7:43:16,  3.50s/it]

{'loss': 0.7882, 'grad_norm': 0.35371407866477966, 'learning_rate': 0.00015875937968984493, 'epoch': 0.01}


 21%|██        | 2067/10000 [1:52:11<7:10:23,  3.26s/it]

{'loss': 0.8705, 'grad_norm': 0.4133656919002533, 'learning_rate': 0.00015873936968484242, 'epoch': 0.01}


 21%|██        | 2068/10000 [1:52:14<6:58:19,  3.16s/it]

{'loss': 0.7759, 'grad_norm': 0.4919561743736267, 'learning_rate': 0.00015871935967983993, 'epoch': 0.01}


 21%|██        | 2069/10000 [1:52:17<6:54:46,  3.14s/it]

{'loss': 0.7018, 'grad_norm': 0.3936193585395813, 'learning_rate': 0.00015869934967483742, 'epoch': 0.01}


 21%|██        | 2070/10000 [1:52:20<7:14:11,  3.29s/it]

{'loss': 0.7742, 'grad_norm': 0.38881340622901917, 'learning_rate': 0.00015867933966983493, 'epoch': 0.01}


 21%|██        | 2071/10000 [1:52:24<7:15:24,  3.29s/it]

{'loss': 0.7965, 'grad_norm': 0.3605753779411316, 'learning_rate': 0.00015865932966483245, 'epoch': 0.01}


 21%|██        | 2072/10000 [1:52:29<8:27:34,  3.84s/it]

{'loss': 1.0158, 'grad_norm': 0.3315802812576294, 'learning_rate': 0.00015863931965982993, 'epoch': 0.01}


 21%|██        | 2073/10000 [1:52:33<8:33:29,  3.89s/it]

{'loss': 0.8859, 'grad_norm': 0.4071836769580841, 'learning_rate': 0.00015861930965482742, 'epoch': 0.01}


 21%|██        | 2074/10000 [1:52:36<7:47:51,  3.54s/it]

{'loss': 0.656, 'grad_norm': 0.3969249427318573, 'learning_rate': 0.0001585992996498249, 'epoch': 0.01}


 21%|██        | 2075/10000 [1:52:40<8:34:17,  3.89s/it]

{'loss': 0.9876, 'grad_norm': 0.3134061396121979, 'learning_rate': 0.00015857928964482242, 'epoch': 0.01}


 21%|██        | 2076/10000 [1:52:43<7:47:20,  3.54s/it]

{'loss': 0.8628, 'grad_norm': 0.44070032238960266, 'learning_rate': 0.0001585592796398199, 'epoch': 0.01}


 21%|██        | 2077/10000 [1:52:48<8:29:25,  3.86s/it]

{'loss': 1.0883, 'grad_norm': 0.334150493144989, 'learning_rate': 0.00015853926963481742, 'epoch': 0.01}


 21%|██        | 2078/10000 [1:52:51<7:53:30,  3.59s/it]

{'loss': 0.7937, 'grad_norm': 0.3743685185909271, 'learning_rate': 0.0001585192596298149, 'epoch': 0.01}


 21%|██        | 2079/10000 [1:52:53<7:25:17,  3.37s/it]

{'loss': 0.802, 'grad_norm': 0.4129033088684082, 'learning_rate': 0.00015849924962481243, 'epoch': 0.01}


 21%|██        | 2080/10000 [1:52:57<7:20:28,  3.34s/it]

{'loss': 0.6999, 'grad_norm': 0.3680350184440613, 'learning_rate': 0.00015847923961980991, 'epoch': 0.01}


 21%|██        | 2081/10000 [1:53:00<7:20:55,  3.34s/it]

{'loss': 1.0471, 'grad_norm': 0.3550835847854614, 'learning_rate': 0.0001584592296148074, 'epoch': 0.01}


 21%|██        | 2082/10000 [1:53:03<7:08:17,  3.25s/it]

{'loss': 0.9812, 'grad_norm': 0.38041216135025024, 'learning_rate': 0.0001584392196098049, 'epoch': 0.01}


 21%|██        | 2083/10000 [1:53:07<7:23:13,  3.36s/it]

{'loss': 1.0198, 'grad_norm': 0.3557642996311188, 'learning_rate': 0.0001584192096048024, 'epoch': 0.01}


 21%|██        | 2084/10000 [1:53:10<7:25:30,  3.38s/it]

{'loss': 0.9999, 'grad_norm': 0.41267767548561096, 'learning_rate': 0.0001583991995997999, 'epoch': 0.01}


 21%|██        | 2085/10000 [1:53:14<7:33:20,  3.44s/it]

{'loss': 0.7009, 'grad_norm': 0.3548409640789032, 'learning_rate': 0.0001583791895947974, 'epoch': 0.01}


 21%|██        | 2086/10000 [1:53:17<7:30:46,  3.42s/it]

{'loss': 0.9787, 'grad_norm': 0.42010045051574707, 'learning_rate': 0.00015835917958979492, 'epoch': 0.01}


 21%|██        | 2087/10000 [1:53:20<7:27:54,  3.40s/it]

{'loss': 0.8554, 'grad_norm': 0.38360723853111267, 'learning_rate': 0.0001583391695847924, 'epoch': 0.01}


 21%|██        | 2088/10000 [1:53:23<7:00:23,  3.19s/it]

{'loss': 0.8224, 'grad_norm': 0.36871007084846497, 'learning_rate': 0.00015831915957978992, 'epoch': 0.01}


 21%|██        | 2089/10000 [1:53:26<6:56:46,  3.16s/it]

{'loss': 0.7195, 'grad_norm': 0.3353426158428192, 'learning_rate': 0.0001582991495747874, 'epoch': 0.01}


 21%|██        | 2090/10000 [1:53:30<7:06:13,  3.23s/it]

{'loss': 1.2785, 'grad_norm': 0.350055068731308, 'learning_rate': 0.0001582791395697849, 'epoch': 0.01}


 21%|██        | 2091/10000 [1:53:32<6:44:38,  3.07s/it]

{'loss': 0.7766, 'grad_norm': 0.3980507254600525, 'learning_rate': 0.00015825912956478238, 'epoch': 0.01}


 21%|██        | 2092/10000 [1:53:36<7:23:03,  3.36s/it]

{'loss': 0.8433, 'grad_norm': 0.3579925298690796, 'learning_rate': 0.0001582391195597799, 'epoch': 0.01}


 21%|██        | 2093/10000 [1:53:39<6:50:16,  3.11s/it]

{'loss': 1.17, 'grad_norm': 0.44481080770492554, 'learning_rate': 0.00015821910955477738, 'epoch': 0.01}


 21%|██        | 2094/10000 [1:53:41<6:28:54,  2.95s/it]

{'loss': 0.7772, 'grad_norm': 0.421037495136261, 'learning_rate': 0.0001581990995497749, 'epoch': 0.01}


 21%|██        | 2095/10000 [1:53:45<6:49:14,  3.11s/it]

{'loss': 1.0624, 'grad_norm': 0.38736554980278015, 'learning_rate': 0.00015817908954477239, 'epoch': 0.01}


 21%|██        | 2096/10000 [1:53:49<7:20:17,  3.34s/it]

{'loss': 1.0428, 'grad_norm': 0.36220771074295044, 'learning_rate': 0.0001581590795397699, 'epoch': 0.01}


 21%|██        | 2097/10000 [1:53:53<7:50:56,  3.58s/it]

{'loss': 0.7651, 'grad_norm': 0.3606891334056854, 'learning_rate': 0.0001581390695347674, 'epoch': 0.01}


 21%|██        | 2098/10000 [1:53:57<8:07:21,  3.70s/it]

{'loss': 0.9293, 'grad_norm': 0.3625938892364502, 'learning_rate': 0.0001581190595297649, 'epoch': 0.01}


 21%|██        | 2099/10000 [1:54:00<7:56:10,  3.62s/it]

{'loss': 0.9576, 'grad_norm': 0.3516460955142975, 'learning_rate': 0.0001580990495247624, 'epoch': 0.01}


 21%|██        | 2100/10000 [1:54:04<7:41:19,  3.50s/it]

{'loss': 1.0066, 'grad_norm': 0.45447978377342224, 'learning_rate': 0.00015807903951975988, 'epoch': 0.01}


 21%|██        | 2101/10000 [1:54:09<9:02:51,  4.12s/it]

{'loss': 1.0711, 'grad_norm': 0.32748037576675415, 'learning_rate': 0.0001580590295147574, 'epoch': 0.01}


 21%|██        | 2102/10000 [1:54:13<8:47:59,  4.01s/it]

{'loss': 0.7786, 'grad_norm': 0.31809669733047485, 'learning_rate': 0.00015803901950975488, 'epoch': 0.01}


 21%|██        | 2103/10000 [1:54:17<8:33:35,  3.90s/it]

{'loss': 1.1244, 'grad_norm': 0.40086761116981506, 'learning_rate': 0.0001580190095047524, 'epoch': 0.01}


 21%|██        | 2104/10000 [1:54:20<8:10:48,  3.73s/it]

{'loss': 1.2585, 'grad_norm': 0.42860177159309387, 'learning_rate': 0.00015799899949974988, 'epoch': 0.01}


 21%|██        | 2105/10000 [1:54:23<7:51:50,  3.59s/it]

{'loss': 0.8845, 'grad_norm': 0.4002354145050049, 'learning_rate': 0.0001579789894947474, 'epoch': 0.01}


 21%|██        | 2106/10000 [1:54:27<8:07:39,  3.71s/it]

{'loss': 0.8651, 'grad_norm': 0.36753979325294495, 'learning_rate': 0.00015795897948974488, 'epoch': 0.01}


 21%|██        | 2107/10000 [1:54:30<7:52:29,  3.59s/it]

{'loss': 0.9538, 'grad_norm': 0.4195956289768219, 'learning_rate': 0.0001579389694847424, 'epoch': 0.01}


 21%|██        | 2108/10000 [1:54:34<7:42:22,  3.52s/it]

{'loss': 0.9432, 'grad_norm': 0.3688858449459076, 'learning_rate': 0.00015791895947973988, 'epoch': 0.01}


 21%|██        | 2109/10000 [1:54:37<7:25:48,  3.39s/it]

{'loss': 0.9269, 'grad_norm': 0.44108298420906067, 'learning_rate': 0.00015789894947473737, 'epoch': 0.01}


 21%|██        | 2110/10000 [1:54:40<7:15:08,  3.31s/it]

{'loss': 1.0064, 'grad_norm': 0.3853030204772949, 'learning_rate': 0.00015787893946973486, 'epoch': 0.01}


 21%|██        | 2111/10000 [1:54:45<8:21:59,  3.82s/it]

{'loss': 1.1427, 'grad_norm': 0.3010329604148865, 'learning_rate': 0.00015785892946473237, 'epoch': 0.01}


 21%|██        | 2112/10000 [1:54:49<8:25:17,  3.84s/it]

{'loss': 1.0483, 'grad_norm': 0.3844146132469177, 'learning_rate': 0.00015783891945972986, 'epoch': 0.01}


 21%|██        | 2113/10000 [1:54:52<7:51:40,  3.59s/it]

{'loss': 0.7324, 'grad_norm': 0.3677295446395874, 'learning_rate': 0.00015781890945472737, 'epoch': 0.01}


 21%|██        | 2114/10000 [1:54:56<8:33:13,  3.90s/it]

{'loss': 1.0789, 'grad_norm': 0.33669915795326233, 'learning_rate': 0.0001577988994497249, 'epoch': 0.01}


 21%|██        | 2115/10000 [1:55:01<8:55:49,  4.08s/it]

{'loss': 1.19, 'grad_norm': 0.3630470931529999, 'learning_rate': 0.00015777888944472238, 'epoch': 0.01}


 21%|██        | 2116/10000 [1:55:04<8:28:26,  3.87s/it]

{'loss': 0.6444, 'grad_norm': 0.36260542273521423, 'learning_rate': 0.00015775887943971986, 'epoch': 0.01}


 21%|██        | 2117/10000 [1:55:07<7:43:58,  3.53s/it]

{'loss': 0.7825, 'grad_norm': 0.37769830226898193, 'learning_rate': 0.00015773886943471735, 'epoch': 0.01}


 21%|██        | 2118/10000 [1:55:10<7:23:35,  3.38s/it]

{'loss': 0.7299, 'grad_norm': 0.37317126989364624, 'learning_rate': 0.00015771885942971486, 'epoch': 0.01}


 21%|██        | 2119/10000 [1:55:13<7:14:44,  3.31s/it]

{'loss': 0.7037, 'grad_norm': 0.37978026270866394, 'learning_rate': 0.00015769884942471235, 'epoch': 0.01}


 21%|██        | 2120/10000 [1:55:17<7:41:58,  3.52s/it]

{'loss': 0.7307, 'grad_norm': 0.290905237197876, 'learning_rate': 0.00015767883941970987, 'epoch': 0.01}


 21%|██        | 2121/10000 [1:55:22<8:20:34,  3.81s/it]

{'loss': 1.1498, 'grad_norm': 0.37303709983825684, 'learning_rate': 0.00015765882941470735, 'epoch': 0.01}


 21%|██        | 2122/10000 [1:55:25<7:59:56,  3.66s/it]

{'loss': 0.7809, 'grad_norm': 0.38263368606567383, 'learning_rate': 0.00015763881940970487, 'epoch': 0.01}


 21%|██        | 2123/10000 [1:55:28<7:44:10,  3.54s/it]

{'loss': 0.7317, 'grad_norm': 0.41405025124549866, 'learning_rate': 0.00015761880940470235, 'epoch': 0.01}


 21%|██        | 2124/10000 [1:55:32<7:55:07,  3.62s/it]

{'loss': 1.1135, 'grad_norm': 0.3418399393558502, 'learning_rate': 0.00015759879939969987, 'epoch': 0.01}


 21%|██▏       | 2125/10000 [1:55:36<8:02:50,  3.68s/it]

{'loss': 0.885, 'grad_norm': 0.39106667041778564, 'learning_rate': 0.00015757878939469736, 'epoch': 0.01}


 21%|██▏       | 2126/10000 [1:55:39<7:48:31,  3.57s/it]

{'loss': 0.8074, 'grad_norm': 0.34400686621665955, 'learning_rate': 0.00015755877938969484, 'epoch': 0.01}


 21%|██▏       | 2127/10000 [1:55:42<7:29:16,  3.42s/it]

{'loss': 0.6969, 'grad_norm': 0.31468749046325684, 'learning_rate': 0.00015753876938469236, 'epoch': 0.01}


 21%|██▏       | 2128/10000 [1:55:47<8:07:55,  3.72s/it]

{'loss': 0.7657, 'grad_norm': 0.3261645436286926, 'learning_rate': 0.00015751875937968985, 'epoch': 0.01}


 21%|██▏       | 2129/10000 [1:55:50<7:52:39,  3.60s/it]

{'loss': 0.9794, 'grad_norm': 0.36830753087997437, 'learning_rate': 0.00015749874937468736, 'epoch': 0.01}


 21%|██▏       | 2130/10000 [1:55:54<8:17:39,  3.79s/it]

{'loss': 0.7442, 'grad_norm': 0.3204596936702728, 'learning_rate': 0.00015747873936968485, 'epoch': 0.01}


 21%|██▏       | 2131/10000 [1:55:59<8:45:20,  4.01s/it]

{'loss': 0.9856, 'grad_norm': 0.3421330749988556, 'learning_rate': 0.00015745872936468236, 'epoch': 0.01}


 21%|██▏       | 2132/10000 [1:56:02<8:30:06,  3.89s/it]

{'loss': 0.8739, 'grad_norm': 0.3420048952102661, 'learning_rate': 0.00015743871935967985, 'epoch': 0.01}


 21%|██▏       | 2133/10000 [1:56:05<7:53:50,  3.61s/it]

{'loss': 0.8334, 'grad_norm': 0.3848956823348999, 'learning_rate': 0.00015741870935467736, 'epoch': 0.01}


 21%|██▏       | 2134/10000 [1:56:08<7:23:50,  3.39s/it]

{'loss': 0.7049, 'grad_norm': 0.4004438519477844, 'learning_rate': 0.00015739869934967485, 'epoch': 0.01}


 21%|██▏       | 2135/10000 [1:56:12<7:20:22,  3.36s/it]

{'loss': 0.7951, 'grad_norm': 0.33717653155326843, 'learning_rate': 0.00015737868934467234, 'epoch': 0.01}


 21%|██▏       | 2136/10000 [1:56:15<7:17:49,  3.34s/it]

{'loss': 0.7984, 'grad_norm': 0.3401224911212921, 'learning_rate': 0.00015735867933966982, 'epoch': 0.01}


 21%|██▏       | 2137/10000 [1:56:19<7:30:58,  3.44s/it]

{'loss': 0.7731, 'grad_norm': 0.3545354902744293, 'learning_rate': 0.00015733866933466734, 'epoch': 0.01}


 21%|██▏       | 2138/10000 [1:56:22<7:28:35,  3.42s/it]

{'loss': 0.6947, 'grad_norm': 0.3484710156917572, 'learning_rate': 0.00015731865932966483, 'epoch': 0.01}


 21%|██▏       | 2139/10000 [1:56:25<7:06:44,  3.26s/it]

{'loss': 0.9579, 'grad_norm': 0.40721309185028076, 'learning_rate': 0.00015729864932466234, 'epoch': 0.01}


 21%|██▏       | 2140/10000 [1:56:29<7:42:45,  3.53s/it]

{'loss': 1.2013, 'grad_norm': 0.4287198483943939, 'learning_rate': 0.00015727863931965986, 'epoch': 0.01}


 21%|██▏       | 2141/10000 [1:56:33<7:46:47,  3.56s/it]

{'loss': 1.0462, 'grad_norm': 0.4961954355239868, 'learning_rate': 0.00015725862931465734, 'epoch': 0.01}


 21%|██▏       | 2142/10000 [1:56:37<8:18:17,  3.80s/it]

{'loss': 1.147, 'grad_norm': 0.39814430475234985, 'learning_rate': 0.00015723861930965486, 'epoch': 0.01}


 21%|██▏       | 2143/10000 [1:56:41<8:24:54,  3.86s/it]

{'loss': 0.8501, 'grad_norm': 0.3918004631996155, 'learning_rate': 0.00015721860930465234, 'epoch': 0.01}


 21%|██▏       | 2144/10000 [1:56:45<8:31:15,  3.90s/it]

{'loss': 1.1426, 'grad_norm': 0.38835299015045166, 'learning_rate': 0.00015719859929964983, 'epoch': 0.01}


 21%|██▏       | 2145/10000 [1:56:48<8:09:02,  3.74s/it]

{'loss': 0.7408, 'grad_norm': 0.3110506534576416, 'learning_rate': 0.00015717858929464732, 'epoch': 0.01}


 21%|██▏       | 2146/10000 [1:56:52<8:04:23,  3.70s/it]

{'loss': 1.0916, 'grad_norm': 0.40467479825019836, 'learning_rate': 0.00015715857928964483, 'epoch': 0.01}


 21%|██▏       | 2147/10000 [1:56:55<7:57:02,  3.64s/it]

{'loss': 1.022, 'grad_norm': 0.3489544987678528, 'learning_rate': 0.00015713856928464232, 'epoch': 0.01}


 21%|██▏       | 2148/10000 [1:56:58<7:27:41,  3.42s/it]

{'loss': 0.6105, 'grad_norm': 0.3940359950065613, 'learning_rate': 0.00015711855927963983, 'epoch': 0.01}


 21%|██▏       | 2149/10000 [1:57:01<7:09:33,  3.28s/it]

{'loss': 0.8275, 'grad_norm': 0.41826972365379333, 'learning_rate': 0.00015709854927463732, 'epoch': 0.01}


 22%|██▏       | 2150/10000 [1:57:06<8:09:27,  3.74s/it]

{'loss': 0.9013, 'grad_norm': 0.3283540904521942, 'learning_rate': 0.00015707853926963484, 'epoch': 0.01}


 22%|██▏       | 2151/10000 [1:57:10<8:17:58,  3.81s/it]

{'loss': 0.9964, 'grad_norm': 0.36888012290000916, 'learning_rate': 0.00015705852926463232, 'epoch': 0.01}


 22%|██▏       | 2152/10000 [1:57:14<8:14:22,  3.78s/it]

{'loss': 0.7866, 'grad_norm': 0.37979355454444885, 'learning_rate': 0.0001570385192596298, 'epoch': 0.01}


 22%|██▏       | 2153/10000 [1:57:18<8:16:59,  3.80s/it]

{'loss': 0.8012, 'grad_norm': 0.37265855073928833, 'learning_rate': 0.0001570185092546273, 'epoch': 0.01}


 22%|██▏       | 2154/10000 [1:57:21<8:12:44,  3.77s/it]

{'loss': 0.9805, 'grad_norm': 0.35684502124786377, 'learning_rate': 0.0001569984992496248, 'epoch': 0.01}


 22%|██▏       | 2155/10000 [1:57:25<8:14:44,  3.78s/it]

{'loss': 0.7623, 'grad_norm': 0.34552738070487976, 'learning_rate': 0.00015697848924462233, 'epoch': 0.01}


 22%|██▏       | 2156/10000 [1:57:30<8:56:57,  4.11s/it]

{'loss': 1.2214, 'grad_norm': 0.37118586897850037, 'learning_rate': 0.00015695847923961981, 'epoch': 0.01}


 22%|██▏       | 2157/10000 [1:57:33<8:21:37,  3.84s/it]

{'loss': 0.6893, 'grad_norm': 0.4169318377971649, 'learning_rate': 0.00015693846923461733, 'epoch': 0.01}


 22%|██▏       | 2158/10000 [1:57:37<8:37:42,  3.96s/it]

{'loss': 1.0117, 'grad_norm': 0.3524588942527771, 'learning_rate': 0.00015691845922961482, 'epoch': 0.01}


 22%|██▏       | 2159/10000 [1:57:40<7:38:16,  3.51s/it]

{'loss': 0.913, 'grad_norm': 0.44270849227905273, 'learning_rate': 0.00015689844922461233, 'epoch': 0.01}


 22%|██▏       | 2160/10000 [1:57:43<7:22:06,  3.38s/it]

{'loss': 0.8765, 'grad_norm': 0.4047793745994568, 'learning_rate': 0.00015687843921960982, 'epoch': 0.01}


 22%|██▏       | 2161/10000 [1:57:48<8:15:13,  3.79s/it]

{'loss': 0.8248, 'grad_norm': 0.34788650274276733, 'learning_rate': 0.0001568584292146073, 'epoch': 0.01}


 22%|██▏       | 2162/10000 [1:57:51<7:37:27,  3.50s/it]

{'loss': 0.799, 'grad_norm': 0.358288049697876, 'learning_rate': 0.0001568384192096048, 'epoch': 0.01}


 22%|██▏       | 2163/10000 [1:57:54<7:34:49,  3.48s/it]

{'loss': 0.8911, 'grad_norm': 0.3857016861438751, 'learning_rate': 0.0001568184092046023, 'epoch': 0.01}


 22%|██▏       | 2164/10000 [1:57:57<7:19:57,  3.37s/it]

{'loss': 0.9784, 'grad_norm': 0.36061346530914307, 'learning_rate': 0.0001567983991995998, 'epoch': 0.01}


 22%|██▏       | 2165/10000 [1:58:00<7:17:26,  3.35s/it]

{'loss': 0.8912, 'grad_norm': 0.36772963404655457, 'learning_rate': 0.0001567783891945973, 'epoch': 0.01}


 22%|██▏       | 2166/10000 [1:58:03<6:58:42,  3.21s/it]

{'loss': 0.9182, 'grad_norm': 0.4099929928779602, 'learning_rate': 0.0001567583791895948, 'epoch': 0.01}


 22%|██▏       | 2167/10000 [1:58:07<7:17:21,  3.35s/it]

{'loss': 1.2115, 'grad_norm': 0.4098248779773712, 'learning_rate': 0.0001567383691845923, 'epoch': 0.01}


 22%|██▏       | 2168/10000 [1:58:10<6:51:02,  3.15s/it]

{'loss': 1.03, 'grad_norm': 0.3966960906982422, 'learning_rate': 0.00015671835917958982, 'epoch': 0.01}


 22%|██▏       | 2169/10000 [1:58:13<6:45:49,  3.11s/it]

{'loss': 0.9928, 'grad_norm': 0.4238848388195038, 'learning_rate': 0.0001566983491745873, 'epoch': 0.01}


 22%|██▏       | 2170/10000 [1:58:15<6:27:16,  2.97s/it]

{'loss': 0.6745, 'grad_norm': 0.347378134727478, 'learning_rate': 0.0001566783391695848, 'epoch': 0.01}


 22%|██▏       | 2171/10000 [1:58:18<6:23:24,  2.94s/it]

{'loss': 0.7732, 'grad_norm': 0.3830730617046356, 'learning_rate': 0.00015665832916458229, 'epoch': 0.01}


 22%|██▏       | 2172/10000 [1:58:21<6:38:06,  3.05s/it]

{'loss': 0.927, 'grad_norm': 0.3857785165309906, 'learning_rate': 0.0001566383191595798, 'epoch': 0.01}


 22%|██▏       | 2173/10000 [1:58:24<6:32:42,  3.01s/it]

{'loss': 1.2934, 'grad_norm': 0.410986065864563, 'learning_rate': 0.0001566183091545773, 'epoch': 0.01}


 22%|██▏       | 2174/10000 [1:58:27<6:23:57,  2.94s/it]

{'loss': 1.0226, 'grad_norm': 0.3658921420574188, 'learning_rate': 0.0001565982991495748, 'epoch': 0.01}


 22%|██▏       | 2175/10000 [1:58:30<6:04:50,  2.80s/it]

{'loss': 0.8894, 'grad_norm': 0.4194508492946625, 'learning_rate': 0.0001565782891445723, 'epoch': 0.01}


 22%|██▏       | 2176/10000 [1:58:33<6:09:18,  2.83s/it]

{'loss': 0.7359, 'grad_norm': 0.3283028304576874, 'learning_rate': 0.0001565582791395698, 'epoch': 0.01}


 22%|██▏       | 2177/10000 [1:58:36<6:13:58,  2.87s/it]

{'loss': 1.0431, 'grad_norm': 0.37572917342185974, 'learning_rate': 0.0001565382691345673, 'epoch': 0.01}


 22%|██▏       | 2178/10000 [1:58:38<6:03:35,  2.79s/it]

{'loss': 0.9068, 'grad_norm': 0.41310441493988037, 'learning_rate': 0.0001565182591295648, 'epoch': 0.01}


 22%|██▏       | 2179/10000 [1:58:41<6:01:50,  2.78s/it]

{'loss': 0.7968, 'grad_norm': 0.4235813021659851, 'learning_rate': 0.0001564982491245623, 'epoch': 0.01}


 22%|██▏       | 2180/10000 [1:58:45<6:43:44,  3.10s/it]

{'loss': 0.8007, 'grad_norm': 0.35714665055274963, 'learning_rate': 0.00015647823911955978, 'epoch': 0.01}


 22%|██▏       | 2181/10000 [1:58:49<7:13:49,  3.33s/it]

{'loss': 1.2415, 'grad_norm': 0.3906586170196533, 'learning_rate': 0.00015645822911455727, 'epoch': 0.01}


 22%|██▏       | 2182/10000 [1:58:52<7:22:00,  3.39s/it]

{'loss': 0.8959, 'grad_norm': 0.33329951763153076, 'learning_rate': 0.00015643821910955478, 'epoch': 0.01}


 22%|██▏       | 2183/10000 [1:58:55<7:20:22,  3.38s/it]

{'loss': 0.7019, 'grad_norm': 0.33333346247673035, 'learning_rate': 0.0001564182091045523, 'epoch': 0.01}


 22%|██▏       | 2184/10000 [1:58:59<7:06:47,  3.28s/it]

{'loss': 0.962, 'grad_norm': 0.3709576427936554, 'learning_rate': 0.00015639819909954978, 'epoch': 0.01}


 22%|██▏       | 2185/10000 [1:59:02<7:22:44,  3.40s/it]

{'loss': 1.1352, 'grad_norm': 0.39134490489959717, 'learning_rate': 0.0001563781890945473, 'epoch': 0.01}


 22%|██▏       | 2186/10000 [1:59:06<7:28:57,  3.45s/it]

{'loss': 0.693, 'grad_norm': 0.3232666850090027, 'learning_rate': 0.00015635817908954479, 'epoch': 0.01}


 22%|██▏       | 2187/10000 [1:59:10<7:43:02,  3.56s/it]

{'loss': 1.0937, 'grad_norm': 0.3647174537181854, 'learning_rate': 0.00015633816908454227, 'epoch': 0.01}


 22%|██▏       | 2188/10000 [1:59:12<7:15:24,  3.34s/it]

{'loss': 0.8754, 'grad_norm': 0.4026385545730591, 'learning_rate': 0.00015631815907953976, 'epoch': 0.01}


 22%|██▏       | 2189/10000 [1:59:16<7:26:34,  3.43s/it]

{'loss': 0.8481, 'grad_norm': 0.35109949111938477, 'learning_rate': 0.00015629814907453727, 'epoch': 0.01}


 22%|██▏       | 2190/10000 [1:59:19<6:53:48,  3.18s/it]

{'loss': 0.7796, 'grad_norm': 0.3970267176628113, 'learning_rate': 0.00015627813906953476, 'epoch': 0.01}


 22%|██▏       | 2191/10000 [1:59:22<7:11:50,  3.32s/it]

{'loss': 0.7967, 'grad_norm': 0.36375364661216736, 'learning_rate': 0.00015625812906453228, 'epoch': 0.01}


 22%|██▏       | 2192/10000 [1:59:26<7:18:05,  3.37s/it]

{'loss': 0.6565, 'grad_norm': 0.3501184284687042, 'learning_rate': 0.00015623811905952976, 'epoch': 0.01}


 22%|██▏       | 2193/10000 [1:59:29<7:18:36,  3.37s/it]

{'loss': 1.0898, 'grad_norm': 0.39131268858909607, 'learning_rate': 0.00015621810905452728, 'epoch': 0.01}


 22%|██▏       | 2194/10000 [1:59:32<7:04:57,  3.27s/it]

{'loss': 0.8808, 'grad_norm': 0.40163278579711914, 'learning_rate': 0.00015619809904952476, 'epoch': 0.01}


 22%|██▏       | 2195/10000 [1:59:35<6:46:41,  3.13s/it]

{'loss': 0.801, 'grad_norm': 0.4469870924949646, 'learning_rate': 0.00015617808904452228, 'epoch': 0.01}


 22%|██▏       | 2196/10000 [1:59:38<6:31:10,  3.01s/it]

{'loss': 0.7926, 'grad_norm': 0.4196857511997223, 'learning_rate': 0.00015615807903951977, 'epoch': 0.01}


 22%|██▏       | 2197/10000 [1:59:41<6:36:31,  3.05s/it]

{'loss': 0.8325, 'grad_norm': 0.40659549832344055, 'learning_rate': 0.00015613806903451725, 'epoch': 0.01}


 22%|██▏       | 2198/10000 [1:59:44<6:44:46,  3.11s/it]

{'loss': 0.853, 'grad_norm': 0.3790194094181061, 'learning_rate': 0.00015611805902951477, 'epoch': 0.01}


 22%|██▏       | 2199/10000 [1:59:48<7:12:05,  3.32s/it]

{'loss': 0.9781, 'grad_norm': 0.3388301134109497, 'learning_rate': 0.00015609804902451226, 'epoch': 0.01}


 22%|██▏       | 2200/10000 [1:59:51<6:58:00,  3.22s/it]

{'loss': 0.9703, 'grad_norm': 0.38129061460494995, 'learning_rate': 0.00015607803901950977, 'epoch': 0.01}


 22%|██▏       | 2201/10000 [1:59:56<8:12:02,  3.79s/it]

{'loss': 0.9322, 'grad_norm': 0.3454027473926544, 'learning_rate': 0.00015605802901450726, 'epoch': 0.01}


 22%|██▏       | 2202/10000 [1:59:59<7:38:55,  3.53s/it]

{'loss': 0.8597, 'grad_norm': 0.37296953797340393, 'learning_rate': 0.00015603801900950477, 'epoch': 0.01}


 22%|██▏       | 2203/10000 [2:00:05<9:16:57,  4.29s/it]

{'loss': 1.0751, 'grad_norm': 0.32122671604156494, 'learning_rate': 0.00015601800900450226, 'epoch': 0.01}


 22%|██▏       | 2204/10000 [2:00:09<8:56:36,  4.13s/it]

{'loss': 0.8327, 'grad_norm': 0.3446526527404785, 'learning_rate': 0.00015599799899949977, 'epoch': 0.01}


 22%|██▏       | 2205/10000 [2:00:12<8:30:59,  3.93s/it]

{'loss': 0.7785, 'grad_norm': 0.36487337946891785, 'learning_rate': 0.00015597798899449726, 'epoch': 0.01}


 22%|██▏       | 2206/10000 [2:00:17<8:58:25,  4.14s/it]

{'loss': 0.9626, 'grad_norm': 0.310366153717041, 'learning_rate': 0.00015595797898949475, 'epoch': 0.01}


 22%|██▏       | 2207/10000 [2:00:20<8:11:31,  3.78s/it]

{'loss': 0.9526, 'grad_norm': 0.3984803259372711, 'learning_rate': 0.00015593796898449223, 'epoch': 0.01}


 22%|██▏       | 2208/10000 [2:00:23<7:34:16,  3.50s/it]

{'loss': 0.5432, 'grad_norm': 0.34207695722579956, 'learning_rate': 0.00015591795897948975, 'epoch': 0.01}


 22%|██▏       | 2209/10000 [2:00:26<7:12:11,  3.33s/it]

{'loss': 0.8128, 'grad_norm': 0.3663557469844818, 'learning_rate': 0.00015589794897448724, 'epoch': 0.01}


 22%|██▏       | 2210/10000 [2:00:30<7:51:20,  3.63s/it]

{'loss': 1.2926, 'grad_norm': 0.3911946415901184, 'learning_rate': 0.00015587793896948475, 'epoch': 0.01}


 22%|██▏       | 2211/10000 [2:00:33<7:40:37,  3.55s/it]

{'loss': 0.9457, 'grad_norm': 0.38462579250335693, 'learning_rate': 0.00015585792896448227, 'epoch': 0.01}


 22%|██▏       | 2212/10000 [2:00:37<7:30:23,  3.47s/it]

{'loss': 1.0674, 'grad_norm': 0.7365245819091797, 'learning_rate': 0.00015583791895947975, 'epoch': 0.01}


 22%|██▏       | 2213/10000 [2:00:40<7:41:27,  3.56s/it]

{'loss': 0.8724, 'grad_norm': 0.3724495470523834, 'learning_rate': 0.00015581790895447727, 'epoch': 0.01}


 22%|██▏       | 2214/10000 [2:00:44<7:33:14,  3.49s/it]

{'loss': 0.732, 'grad_norm': 0.3898261487483978, 'learning_rate': 0.00015579789894947475, 'epoch': 0.01}


 22%|██▏       | 2215/10000 [2:00:46<7:03:38,  3.27s/it]

{'loss': 0.9891, 'grad_norm': 0.4386894404888153, 'learning_rate': 0.00015577788894447224, 'epoch': 0.01}


 22%|██▏       | 2216/10000 [2:00:50<7:01:46,  3.25s/it]

{'loss': 0.759, 'grad_norm': 0.44487571716308594, 'learning_rate': 0.00015575787893946973, 'epoch': 0.01}


 22%|██▏       | 2217/10000 [2:00:53<6:58:48,  3.23s/it]

{'loss': 0.829, 'grad_norm': 0.3882083594799042, 'learning_rate': 0.00015573786893446724, 'epoch': 0.01}


 22%|██▏       | 2218/10000 [2:00:57<7:37:27,  3.53s/it]

{'loss': 0.5545, 'grad_norm': 0.31172123551368713, 'learning_rate': 0.00015571785892946473, 'epoch': 0.01}


 22%|██▏       | 2219/10000 [2:01:01<7:57:48,  3.68s/it]

{'loss': 0.9952, 'grad_norm': 0.35868656635284424, 'learning_rate': 0.00015569784892446224, 'epoch': 0.01}


 22%|██▏       | 2220/10000 [2:01:04<7:21:02,  3.40s/it]

{'loss': 0.8899, 'grad_norm': 0.4242779016494751, 'learning_rate': 0.00015567783891945973, 'epoch': 0.01}


 22%|██▏       | 2221/10000 [2:01:09<8:28:41,  3.92s/it]

{'loss': 1.1887, 'grad_norm': 0.3242068588733673, 'learning_rate': 0.00015565782891445725, 'epoch': 0.01}


 22%|██▏       | 2222/10000 [2:01:12<8:02:07,  3.72s/it]

{'loss': 0.7501, 'grad_norm': 0.34531641006469727, 'learning_rate': 0.00015563781890945473, 'epoch': 0.01}


 22%|██▏       | 2223/10000 [2:01:16<7:47:37,  3.61s/it]

{'loss': 0.9153, 'grad_norm': 0.38249579071998596, 'learning_rate': 0.00015561780890445222, 'epoch': 0.01}


 22%|██▏       | 2224/10000 [2:01:19<7:52:04,  3.64s/it]

{'loss': 0.9399, 'grad_norm': 0.37627702951431274, 'learning_rate': 0.00015559779889944974, 'epoch': 0.01}


 22%|██▏       | 2225/10000 [2:01:23<7:50:38,  3.63s/it]

{'loss': 0.7617, 'grad_norm': 0.35648712515830994, 'learning_rate': 0.00015557778889444722, 'epoch': 0.01}


 22%|██▏       | 2226/10000 [2:01:26<7:46:49,  3.60s/it]

{'loss': 0.7816, 'grad_norm': 0.4016502797603607, 'learning_rate': 0.00015555777888944474, 'epoch': 0.01}


 22%|██▏       | 2227/10000 [2:01:31<8:13:30,  3.81s/it]

{'loss': 0.9993, 'grad_norm': 0.3771038353443146, 'learning_rate': 0.00015553776888444222, 'epoch': 0.01}


 22%|██▏       | 2228/10000 [2:01:34<7:56:41,  3.68s/it]

{'loss': 1.0043, 'grad_norm': 0.44827744364738464, 'learning_rate': 0.00015551775887943974, 'epoch': 0.01}


 22%|██▏       | 2229/10000 [2:01:38<7:55:49,  3.67s/it]

{'loss': 0.9319, 'grad_norm': 0.41783303022384644, 'learning_rate': 0.00015549774887443723, 'epoch': 0.01}


 22%|██▏       | 2230/10000 [2:01:41<7:44:22,  3.59s/it]

{'loss': 0.5835, 'grad_norm': 0.3247295618057251, 'learning_rate': 0.00015547773886943474, 'epoch': 0.01}


 22%|██▏       | 2231/10000 [2:01:46<8:41:13,  4.03s/it]

{'loss': 0.9727, 'grad_norm': 0.35303932428359985, 'learning_rate': 0.00015545772886443223, 'epoch': 0.01}


 22%|██▏       | 2232/10000 [2:01:50<8:39:56,  4.02s/it]

{'loss': 0.7172, 'grad_norm': 0.3573373556137085, 'learning_rate': 0.00015543771885942972, 'epoch': 0.01}


 22%|██▏       | 2233/10000 [2:01:54<8:22:45,  3.88s/it]

{'loss': 0.6421, 'grad_norm': 0.33027809858322144, 'learning_rate': 0.0001554177088544272, 'epoch': 0.01}


 22%|██▏       | 2234/10000 [2:01:57<7:45:53,  3.60s/it]

{'loss': 1.0915, 'grad_norm': 0.4394511878490448, 'learning_rate': 0.00015539769884942472, 'epoch': 0.01}


 22%|██▏       | 2235/10000 [2:02:01<8:12:03,  3.80s/it]

{'loss': 1.0669, 'grad_norm': 0.3311649262905121, 'learning_rate': 0.0001553776888444222, 'epoch': 0.01}


 22%|██▏       | 2236/10000 [2:02:05<8:17:30,  3.84s/it]

{'loss': 1.0187, 'grad_norm': 0.387869268655777, 'learning_rate': 0.00015535767883941972, 'epoch': 0.01}


 22%|██▏       | 2237/10000 [2:02:09<8:23:48,  3.89s/it]

{'loss': 0.8615, 'grad_norm': 0.3746850788593292, 'learning_rate': 0.00015533766883441723, 'epoch': 0.01}


 22%|██▏       | 2238/10000 [2:02:13<8:50:51,  4.10s/it]

{'loss': 1.0365, 'grad_norm': 0.344130277633667, 'learning_rate': 0.00015531765882941472, 'epoch': 0.01}


 22%|██▏       | 2239/10000 [2:02:17<8:28:13,  3.93s/it]

{'loss': 0.7038, 'grad_norm': 0.32798752188682556, 'learning_rate': 0.00015529764882441223, 'epoch': 0.01}


 22%|██▏       | 2240/10000 [2:02:21<8:39:13,  4.01s/it]

{'loss': 1.0925, 'grad_norm': 0.42655035853385925, 'learning_rate': 0.00015527763881940972, 'epoch': 0.01}


 22%|██▏       | 2241/10000 [2:02:24<7:46:17,  3.61s/it]

{'loss': 0.7923, 'grad_norm': 0.41443657875061035, 'learning_rate': 0.0001552576288144072, 'epoch': 0.01}


 22%|██▏       | 2242/10000 [2:02:27<7:11:17,  3.34s/it]

{'loss': 0.7435, 'grad_norm': 0.396436870098114, 'learning_rate': 0.0001552376188094047, 'epoch': 0.01}


 22%|██▏       | 2243/10000 [2:02:30<7:32:34,  3.50s/it]

{'loss': 0.5782, 'grad_norm': 0.338320791721344, 'learning_rate': 0.0001552176088044022, 'epoch': 0.01}


 22%|██▏       | 2244/10000 [2:02:34<7:27:45,  3.46s/it]

{'loss': 0.7923, 'grad_norm': 0.36362579464912415, 'learning_rate': 0.0001551975987993997, 'epoch': 0.01}


 22%|██▏       | 2245/10000 [2:02:39<8:21:14,  3.88s/it]

{'loss': 0.9656, 'grad_norm': 0.39005663990974426, 'learning_rate': 0.0001551775887943972, 'epoch': 0.01}


 22%|██▏       | 2246/10000 [2:02:43<8:44:55,  4.06s/it]

{'loss': 0.8981, 'grad_norm': 0.3334617614746094, 'learning_rate': 0.0001551575787893947, 'epoch': 0.01}


 22%|██▏       | 2247/10000 [2:02:49<9:36:06,  4.46s/it]

{'loss': 1.1471, 'grad_norm': 0.3799051344394684, 'learning_rate': 0.00015513756878439221, 'epoch': 0.01}


 22%|██▏       | 2248/10000 [2:02:52<8:45:16,  4.07s/it]

{'loss': 0.8276, 'grad_norm': 0.38271936774253845, 'learning_rate': 0.0001551175587793897, 'epoch': 0.01}


 22%|██▏       | 2249/10000 [2:02:55<8:30:23,  3.95s/it]

{'loss': 0.9094, 'grad_norm': 0.3772994577884674, 'learning_rate': 0.00015509754877438722, 'epoch': 0.01}


 22%|██▎       | 2250/10000 [2:03:00<9:04:59,  4.22s/it]

{'loss': 0.8866, 'grad_norm': 0.4054929316043854, 'learning_rate': 0.00015507753876938468, 'epoch': 0.01}


 23%|██▎       | 2251/10000 [2:03:05<9:20:46,  4.34s/it]

{'loss': 0.9003, 'grad_norm': 0.38598987460136414, 'learning_rate': 0.0001550575287643822, 'epoch': 0.01}


 23%|██▎       | 2252/10000 [2:03:09<9:10:56,  4.27s/it]

{'loss': 1.0948, 'grad_norm': 0.39529597759246826, 'learning_rate': 0.0001550375187593797, 'epoch': 0.01}


 23%|██▎       | 2253/10000 [2:03:13<8:50:55,  4.11s/it]

{'loss': 1.1864, 'grad_norm': 0.4064047336578369, 'learning_rate': 0.0001550175087543772, 'epoch': 0.01}


 23%|██▎       | 2254/10000 [2:03:16<8:32:05,  3.97s/it]

{'loss': 1.033, 'grad_norm': 0.38998162746429443, 'learning_rate': 0.0001549974987493747, 'epoch': 0.01}


 23%|██▎       | 2255/10000 [2:03:19<7:50:03,  3.64s/it]

{'loss': 0.7213, 'grad_norm': 0.4409729242324829, 'learning_rate': 0.0001549774887443722, 'epoch': 0.01}


 23%|██▎       | 2256/10000 [2:03:24<8:44:59,  4.07s/it]

{'loss': 1.0725, 'grad_norm': 0.3563935458660126, 'learning_rate': 0.0001549574787393697, 'epoch': 0.01}


 23%|██▎       | 2257/10000 [2:03:29<9:03:04,  4.21s/it]

{'loss': 1.4034, 'grad_norm': 0.39725950360298157, 'learning_rate': 0.0001549374687343672, 'epoch': 0.01}


 23%|██▎       | 2258/10000 [2:03:34<9:50:42,  4.58s/it]

{'loss': 1.1731, 'grad_norm': 0.3259916603565216, 'learning_rate': 0.00015491745872936468, 'epoch': 0.01}


 23%|██▎       | 2259/10000 [2:03:39<9:38:53,  4.49s/it]

{'loss': 0.8224, 'grad_norm': 0.35743460059165955, 'learning_rate': 0.00015489744872436217, 'epoch': 0.01}


 23%|██▎       | 2260/10000 [2:03:43<9:29:08,  4.41s/it]

{'loss': 0.8568, 'grad_norm': 0.37251490354537964, 'learning_rate': 0.00015487743871935968, 'epoch': 0.01}


 23%|██▎       | 2261/10000 [2:03:47<9:17:48,  4.32s/it]

{'loss': 0.7915, 'grad_norm': 0.3673419952392578, 'learning_rate': 0.00015485742871435717, 'epoch': 0.01}


 23%|██▎       | 2262/10000 [2:03:51<8:53:44,  4.14s/it]

{'loss': 0.6406, 'grad_norm': 0.33548569679260254, 'learning_rate': 0.00015483741870935469, 'epoch': 0.01}


 23%|██▎       | 2263/10000 [2:03:55<8:50:15,  4.11s/it]

{'loss': 0.9379, 'grad_norm': 0.3706389367580414, 'learning_rate': 0.00015481740870435217, 'epoch': 0.01}


 23%|██▎       | 2264/10000 [2:03:58<8:37:49,  4.02s/it]

{'loss': 0.8349, 'grad_norm': 0.3761135935783386, 'learning_rate': 0.0001547973986993497, 'epoch': 0.01}


 23%|██▎       | 2265/10000 [2:04:03<8:42:33,  4.05s/it]

{'loss': 1.3235, 'grad_norm': 0.44079625606536865, 'learning_rate': 0.0001547773886943472, 'epoch': 0.01}


 23%|██▎       | 2266/10000 [2:04:06<8:25:15,  3.92s/it]

{'loss': 1.0979, 'grad_norm': 0.5264103412628174, 'learning_rate': 0.0001547573786893447, 'epoch': 0.01}


 23%|██▎       | 2267/10000 [2:04:10<8:25:07,  3.92s/it]

{'loss': 0.8927, 'grad_norm': 0.3558235168457031, 'learning_rate': 0.00015473736868434218, 'epoch': 0.01}


 23%|██▎       | 2268/10000 [2:04:14<8:37:13,  4.01s/it]

{'loss': 0.8553, 'grad_norm': 0.34581494331359863, 'learning_rate': 0.00015471735867933966, 'epoch': 0.01}


 23%|██▎       | 2269/10000 [2:04:18<8:22:18,  3.90s/it]

{'loss': 1.471, 'grad_norm': 0.4602019488811493, 'learning_rate': 0.00015469734867433718, 'epoch': 0.01}


 23%|██▎       | 2270/10000 [2:04:21<7:59:57,  3.73s/it]

{'loss': 0.88, 'grad_norm': 0.4075644314289093, 'learning_rate': 0.00015467733866933467, 'epoch': 0.01}


 23%|██▎       | 2271/10000 [2:04:25<7:52:44,  3.67s/it]

{'loss': 0.6707, 'grad_norm': 0.3416637182235718, 'learning_rate': 0.00015465732866433218, 'epoch': 0.01}


 23%|██▎       | 2272/10000 [2:04:29<8:24:53,  3.92s/it]

{'loss': 1.0148, 'grad_norm': 0.3596019148826599, 'learning_rate': 0.00015463731865932967, 'epoch': 0.01}


 23%|██▎       | 2273/10000 [2:04:33<8:16:27,  3.85s/it]

{'loss': 0.9426, 'grad_norm': 0.34093788266181946, 'learning_rate': 0.00015461730865432718, 'epoch': 0.01}


 23%|██▎       | 2274/10000 [2:04:36<7:49:39,  3.65s/it]

{'loss': 0.7582, 'grad_norm': 0.3665321469306946, 'learning_rate': 0.00015459729864932467, 'epoch': 0.01}


 23%|██▎       | 2275/10000 [2:04:40<8:01:13,  3.74s/it]

{'loss': 0.9954, 'grad_norm': 0.37023523449897766, 'learning_rate': 0.00015457728864432218, 'epoch': 0.01}


 23%|██▎       | 2276/10000 [2:04:44<8:03:43,  3.76s/it]

{'loss': 0.9454, 'grad_norm': 0.34298837184906006, 'learning_rate': 0.00015455727863931967, 'epoch': 0.01}


 23%|██▎       | 2277/10000 [2:04:47<7:54:24,  3.69s/it]

{'loss': 0.8696, 'grad_norm': 0.3674922287464142, 'learning_rate': 0.00015453726863431716, 'epoch': 0.01}


 23%|██▎       | 2278/10000 [2:04:51<7:52:55,  3.67s/it]

{'loss': 0.7727, 'grad_norm': 0.3288291394710541, 'learning_rate': 0.00015451725862931464, 'epoch': 0.01}


 23%|██▎       | 2279/10000 [2:04:54<7:27:29,  3.48s/it]

{'loss': 0.8539, 'grad_norm': 0.4268242120742798, 'learning_rate': 0.00015449724862431216, 'epoch': 0.01}


 23%|██▎       | 2280/10000 [2:05:00<8:45:53,  4.09s/it]

{'loss': 1.1306, 'grad_norm': 0.33411160111427307, 'learning_rate': 0.00015447723861930967, 'epoch': 0.01}


 23%|██▎       | 2281/10000 [2:05:03<8:24:46,  3.92s/it]

{'loss': 0.9353, 'grad_norm': 0.39093250036239624, 'learning_rate': 0.00015445722861430716, 'epoch': 0.01}


 23%|██▎       | 2282/10000 [2:05:08<9:04:51,  4.24s/it]

{'loss': 0.7082, 'grad_norm': 0.28442060947418213, 'learning_rate': 0.00015443721860930468, 'epoch': 0.01}


 23%|██▎       | 2283/10000 [2:05:12<8:31:56,  3.98s/it]

{'loss': 0.8348, 'grad_norm': 0.3855447769165039, 'learning_rate': 0.00015441720860430216, 'epoch': 0.01}


 23%|██▎       | 2284/10000 [2:05:15<8:17:11,  3.87s/it]

{'loss': 0.8713, 'grad_norm': 0.44365277886390686, 'learning_rate': 0.00015439719859929968, 'epoch': 0.01}


 23%|██▎       | 2285/10000 [2:05:18<7:55:13,  3.70s/it]

{'loss': 0.7636, 'grad_norm': 0.4277310073375702, 'learning_rate': 0.00015437718859429716, 'epoch': 0.01}


 23%|██▎       | 2286/10000 [2:05:22<7:47:35,  3.64s/it]

{'loss': 0.6927, 'grad_norm': 0.3493466377258301, 'learning_rate': 0.00015435717858929465, 'epoch': 0.01}


 23%|██▎       | 2287/10000 [2:05:26<8:07:28,  3.79s/it]

{'loss': 0.8701, 'grad_norm': 0.3623507022857666, 'learning_rate': 0.00015433716858429214, 'epoch': 0.01}


 23%|██▎       | 2288/10000 [2:05:29<7:41:28,  3.59s/it]

{'loss': 0.6375, 'grad_norm': 0.3862427771091461, 'learning_rate': 0.00015431715857928965, 'epoch': 0.01}


 23%|██▎       | 2289/10000 [2:05:33<7:42:24,  3.60s/it]

{'loss': 0.7681, 'grad_norm': 0.37768232822418213, 'learning_rate': 0.00015429714857428714, 'epoch': 0.01}


 23%|██▎       | 2290/10000 [2:05:37<7:47:48,  3.64s/it]

{'loss': 0.8078, 'grad_norm': 0.4284893870353699, 'learning_rate': 0.00015427713856928465, 'epoch': 0.01}


 23%|██▎       | 2291/10000 [2:05:41<8:20:37,  3.90s/it]

{'loss': 0.9365, 'grad_norm': 0.30548518896102905, 'learning_rate': 0.00015425712856428214, 'epoch': 0.01}


 23%|██▎       | 2292/10000 [2:05:45<8:08:40,  3.80s/it]

{'loss': 0.8626, 'grad_norm': 0.4245395362377167, 'learning_rate': 0.00015423711855927966, 'epoch': 0.01}


 23%|██▎       | 2293/10000 [2:05:48<8:06:58,  3.79s/it]

{'loss': 0.8457, 'grad_norm': 0.38708043098449707, 'learning_rate': 0.00015421710855427714, 'epoch': 0.01}


 23%|██▎       | 2294/10000 [2:05:53<8:22:01,  3.91s/it]

{'loss': 1.3402, 'grad_norm': 0.4461352229118347, 'learning_rate': 0.00015419709854927463, 'epoch': 0.01}


 23%|██▎       | 2295/10000 [2:05:57<8:25:58,  3.94s/it]

{'loss': 0.738, 'grad_norm': 0.3188112676143646, 'learning_rate': 0.00015417708854427215, 'epoch': 0.01}


 23%|██▎       | 2296/10000 [2:05:59<7:37:29,  3.56s/it]

{'loss': 1.0006, 'grad_norm': 0.4445919692516327, 'learning_rate': 0.00015415707853926963, 'epoch': 0.01}


 23%|██▎       | 2297/10000 [2:06:04<8:37:44,  4.03s/it]

{'loss': 0.7247, 'grad_norm': 0.3072349429130554, 'learning_rate': 0.00015413706853426715, 'epoch': 0.01}


 23%|██▎       | 2298/10000 [2:06:10<9:20:18,  4.36s/it]

{'loss': 0.9097, 'grad_norm': 0.39067041873931885, 'learning_rate': 0.00015411705852926463, 'epoch': 0.01}


 23%|██▎       | 2299/10000 [2:06:13<9:01:27,  4.22s/it]

{'loss': 0.9963, 'grad_norm': 0.3858417868614197, 'learning_rate': 0.00015409704852426215, 'epoch': 0.01}


 23%|██▎       | 2300/10000 [2:06:17<8:35:11,  4.01s/it]

{'loss': 1.0537, 'grad_norm': 0.39913663268089294, 'learning_rate': 0.00015407703851925964, 'epoch': 0.01}


 23%|██▎       | 2301/10000 [2:06:22<9:01:03,  4.22s/it]

{'loss': 0.7838, 'grad_norm': 0.39485058188438416, 'learning_rate': 0.00015405702851425715, 'epoch': 0.01}


 23%|██▎       | 2302/10000 [2:06:26<8:56:31,  4.18s/it]

{'loss': 1.0241, 'grad_norm': 0.4176245331764221, 'learning_rate': 0.00015403701850925464, 'epoch': 0.01}


 23%|██▎       | 2303/10000 [2:06:30<8:40:45,  4.06s/it]

{'loss': 0.9437, 'grad_norm': 0.4269455075263977, 'learning_rate': 0.00015401700850425213, 'epoch': 0.01}


 23%|██▎       | 2304/10000 [2:06:34<8:47:17,  4.11s/it]

{'loss': 1.3363, 'grad_norm': 0.3946506381034851, 'learning_rate': 0.0001539969984992496, 'epoch': 0.01}


 23%|██▎       | 2305/10000 [2:06:37<8:10:59,  3.83s/it]

{'loss': 0.9419, 'grad_norm': 0.4472481310367584, 'learning_rate': 0.00015397698849424713, 'epoch': 0.01}


 23%|██▎       | 2306/10000 [2:06:41<8:05:08,  3.78s/it]

{'loss': 0.9063, 'grad_norm': 0.3521101772785187, 'learning_rate': 0.00015395697848924461, 'epoch': 0.01}


 23%|██▎       | 2307/10000 [2:06:44<8:03:46,  3.77s/it]

{'loss': 0.9178, 'grad_norm': 0.36422663927078247, 'learning_rate': 0.00015393696848424213, 'epoch': 0.01}


 23%|██▎       | 2308/10000 [2:06:48<7:47:41,  3.65s/it]

{'loss': 0.6601, 'grad_norm': 0.4969027638435364, 'learning_rate': 0.00015391695847923964, 'epoch': 0.01}


 23%|██▎       | 2309/10000 [2:06:51<7:24:00,  3.46s/it]

{'loss': 1.1097, 'grad_norm': 0.45284369587898254, 'learning_rate': 0.00015389694847423713, 'epoch': 0.01}


 23%|██▎       | 2310/10000 [2:06:55<7:44:32,  3.62s/it]

{'loss': 0.9853, 'grad_norm': 0.4357873797416687, 'learning_rate': 0.00015387693846923464, 'epoch': 0.01}


 23%|██▎       | 2311/10000 [2:07:00<8:43:26,  4.08s/it]

{'loss': 1.0547, 'grad_norm': 0.33109813928604126, 'learning_rate': 0.00015385692846423213, 'epoch': 0.01}


 23%|██▎       | 2312/10000 [2:07:03<8:11:49,  3.84s/it]

{'loss': 0.9677, 'grad_norm': 0.39371171593666077, 'learning_rate': 0.00015383691845922962, 'epoch': 0.01}


 23%|██▎       | 2313/10000 [2:07:08<8:33:28,  4.01s/it]

{'loss': 0.7374, 'grad_norm': 0.34362760186195374, 'learning_rate': 0.0001538169084542271, 'epoch': 0.01}


 23%|██▎       | 2314/10000 [2:07:11<8:04:42,  3.78s/it]

{'loss': 0.7622, 'grad_norm': 0.40440478920936584, 'learning_rate': 0.00015379689844922462, 'epoch': 0.01}


 23%|██▎       | 2315/10000 [2:07:15<8:13:03,  3.85s/it]

{'loss': 0.8036, 'grad_norm': 0.36399680376052856, 'learning_rate': 0.0001537768884442221, 'epoch': 0.01}


 23%|██▎       | 2316/10000 [2:07:18<7:46:58,  3.65s/it]

{'loss': 0.8209, 'grad_norm': 0.3947635889053345, 'learning_rate': 0.00015375687843921962, 'epoch': 0.01}


 23%|██▎       | 2317/10000 [2:07:22<8:15:30,  3.87s/it]

{'loss': 0.9216, 'grad_norm': 0.359464168548584, 'learning_rate': 0.0001537368684342171, 'epoch': 0.01}


 23%|██▎       | 2318/10000 [2:07:26<8:08:19,  3.81s/it]

{'loss': 0.6916, 'grad_norm': 0.34781649708747864, 'learning_rate': 0.00015371685842921462, 'epoch': 0.01}


 23%|██▎       | 2319/10000 [2:07:30<8:11:52,  3.84s/it]

{'loss': 1.0365, 'grad_norm': 0.38483959436416626, 'learning_rate': 0.0001536968484242121, 'epoch': 0.01}


 23%|██▎       | 2320/10000 [2:07:34<8:23:01,  3.93s/it]

{'loss': 0.9487, 'grad_norm': 0.31724682450294495, 'learning_rate': 0.00015367683841920963, 'epoch': 0.01}


 23%|██▎       | 2321/10000 [2:07:38<8:30:14,  3.99s/it]

{'loss': 0.88, 'grad_norm': 0.36550837755203247, 'learning_rate': 0.0001536568284142071, 'epoch': 0.01}


 23%|██▎       | 2322/10000 [2:07:42<8:15:33,  3.87s/it]

{'loss': 0.8462, 'grad_norm': 0.41486284136772156, 'learning_rate': 0.0001536368184092046, 'epoch': 0.01}


 23%|██▎       | 2323/10000 [2:07:45<7:44:35,  3.63s/it]

{'loss': 0.7148, 'grad_norm': 0.36360201239585876, 'learning_rate': 0.00015361680840420211, 'epoch': 0.01}


 23%|██▎       | 2324/10000 [2:07:48<7:20:33,  3.44s/it]

{'loss': 0.8054, 'grad_norm': 0.39500167965888977, 'learning_rate': 0.0001535967983991996, 'epoch': 0.01}


 23%|██▎       | 2325/10000 [2:07:51<7:04:15,  3.32s/it]

{'loss': 0.7046, 'grad_norm': 0.36679911613464355, 'learning_rate': 0.00015357678839419712, 'epoch': 0.01}


 23%|██▎       | 2326/10000 [2:07:54<7:08:17,  3.35s/it]

{'loss': 0.6385, 'grad_norm': 0.3302171528339386, 'learning_rate': 0.0001535567783891946, 'epoch': 0.01}


 23%|██▎       | 2327/10000 [2:07:59<7:40:14,  3.60s/it]

{'loss': 1.0741, 'grad_norm': 0.3175525963306427, 'learning_rate': 0.00015353676838419212, 'epoch': 0.01}


 23%|██▎       | 2328/10000 [2:08:02<7:51:42,  3.69s/it]

{'loss': 1.0427, 'grad_norm': 0.3725601136684418, 'learning_rate': 0.0001535167583791896, 'epoch': 0.01}


 23%|██▎       | 2329/10000 [2:08:07<8:18:29,  3.90s/it]

{'loss': 0.8659, 'grad_norm': 0.3339688181877136, 'learning_rate': 0.0001534967483741871, 'epoch': 0.01}


 23%|██▎       | 2330/10000 [2:08:11<8:12:37,  3.85s/it]

{'loss': 1.1658, 'grad_norm': 0.4294949471950531, 'learning_rate': 0.00015347673836918458, 'epoch': 0.01}


 23%|██▎       | 2331/10000 [2:08:14<8:11:52,  3.85s/it]

{'loss': 0.8147, 'grad_norm': 0.3315899670124054, 'learning_rate': 0.0001534567283641821, 'epoch': 0.01}


 23%|██▎       | 2332/10000 [2:08:19<8:51:44,  4.16s/it]

{'loss': 1.0205, 'grad_norm': 0.32766425609588623, 'learning_rate': 0.00015343671835917958, 'epoch': 0.01}


 23%|██▎       | 2333/10000 [2:08:23<8:50:46,  4.15s/it]

{'loss': 0.977, 'grad_norm': 0.3808588683605194, 'learning_rate': 0.0001534167083541771, 'epoch': 0.01}


 23%|██▎       | 2334/10000 [2:08:27<8:31:34,  4.00s/it]

{'loss': 0.8636, 'grad_norm': 0.33362430334091187, 'learning_rate': 0.0001533966983491746, 'epoch': 0.01}


 23%|██▎       | 2335/10000 [2:08:34<10:19:11,  4.85s/it]

{'loss': 1.422, 'grad_norm': 0.35022205114364624, 'learning_rate': 0.0001533766883441721, 'epoch': 0.01}


 23%|██▎       | 2336/10000 [2:08:38<9:56:54,  4.67s/it] 

{'loss': 1.0243, 'grad_norm': 0.3567958474159241, 'learning_rate': 0.0001533566783391696, 'epoch': 0.01}


 23%|██▎       | 2337/10000 [2:08:42<9:27:43,  4.45s/it]

{'loss': 0.6809, 'grad_norm': 0.3159368634223938, 'learning_rate': 0.0001533366683341671, 'epoch': 0.01}


 23%|██▎       | 2338/10000 [2:08:45<8:34:59,  4.03s/it]

{'loss': 0.879, 'grad_norm': 0.3700285255908966, 'learning_rate': 0.00015331665832916459, 'epoch': 0.01}


 23%|██▎       | 2339/10000 [2:08:48<7:55:52,  3.73s/it]

{'loss': 0.6696, 'grad_norm': 0.3829503357410431, 'learning_rate': 0.00015329664832416207, 'epoch': 0.01}


 23%|██▎       | 2340/10000 [2:08:52<7:54:07,  3.71s/it]

{'loss': 1.0862, 'grad_norm': 0.40711453557014465, 'learning_rate': 0.0001532766383191596, 'epoch': 0.01}


 23%|██▎       | 2341/10000 [2:08:57<8:33:50,  4.03s/it]

{'loss': 0.9085, 'grad_norm': 0.35139358043670654, 'learning_rate': 0.00015325662831415708, 'epoch': 0.01}


 23%|██▎       | 2342/10000 [2:09:00<8:26:01,  3.96s/it]

{'loss': 0.5384, 'grad_norm': 0.323822945356369, 'learning_rate': 0.0001532366183091546, 'epoch': 0.01}


 23%|██▎       | 2343/10000 [2:09:05<8:36:05,  4.04s/it]

{'loss': 0.838, 'grad_norm': 0.3586271405220032, 'learning_rate': 0.00015321660830415208, 'epoch': 0.01}


 23%|██▎       | 2344/10000 [2:09:08<8:15:26,  3.88s/it]

{'loss': 0.8458, 'grad_norm': 0.32221606373786926, 'learning_rate': 0.0001531965982991496, 'epoch': 0.01}


 23%|██▎       | 2345/10000 [2:09:12<8:24:11,  3.95s/it]

{'loss': 1.2615, 'grad_norm': 0.41308051347732544, 'learning_rate': 0.00015317658829414708, 'epoch': 0.01}


 23%|██▎       | 2346/10000 [2:09:16<7:57:47,  3.75s/it]

{'loss': 0.8158, 'grad_norm': 0.3782024383544922, 'learning_rate': 0.0001531565782891446, 'epoch': 0.01}


 23%|██▎       | 2347/10000 [2:09:18<7:26:30,  3.50s/it]

{'loss': 0.9377, 'grad_norm': 0.4017813205718994, 'learning_rate': 0.00015313656828414208, 'epoch': 0.01}


 23%|██▎       | 2348/10000 [2:09:22<7:41:23,  3.62s/it]

{'loss': 0.987, 'grad_norm': 0.3631085753440857, 'learning_rate': 0.00015311655827913957, 'epoch': 0.01}


 23%|██▎       | 2349/10000 [2:09:26<7:30:04,  3.53s/it]

{'loss': 0.9079, 'grad_norm': 0.4022698998451233, 'learning_rate': 0.00015309654827413708, 'epoch': 0.01}


 24%|██▎       | 2350/10000 [2:09:29<7:29:27,  3.53s/it]

{'loss': 1.3345, 'grad_norm': 0.4841307997703552, 'learning_rate': 0.00015307653826913457, 'epoch': 0.01}


 24%|██▎       | 2351/10000 [2:09:33<7:41:31,  3.62s/it]

{'loss': 0.9969, 'grad_norm': 0.39872851967811584, 'learning_rate': 0.00015305652826413208, 'epoch': 0.01}


 24%|██▎       | 2352/10000 [2:09:37<7:36:43,  3.58s/it]

{'loss': 0.7283, 'grad_norm': 0.3559131920337677, 'learning_rate': 0.00015303651825912957, 'epoch': 0.01}


 24%|██▎       | 2353/10000 [2:09:40<7:30:33,  3.54s/it]

{'loss': 0.9212, 'grad_norm': 0.3827008903026581, 'learning_rate': 0.00015301650825412709, 'epoch': 0.01}


 24%|██▎       | 2354/10000 [2:09:44<7:57:53,  3.75s/it]

{'loss': 0.9003, 'grad_norm': 0.3596690595149994, 'learning_rate': 0.00015299649824912457, 'epoch': 0.01}


 24%|██▎       | 2355/10000 [2:09:49<8:24:25,  3.96s/it]

{'loss': 0.9744, 'grad_norm': 0.3361451029777527, 'learning_rate': 0.0001529764882441221, 'epoch': 0.01}


 24%|██▎       | 2356/10000 [2:09:52<7:57:55,  3.75s/it]

{'loss': 0.7345, 'grad_norm': 0.3458235263824463, 'learning_rate': 0.00015295647823911955, 'epoch': 0.01}


 24%|██▎       | 2357/10000 [2:09:56<7:53:41,  3.72s/it]

{'loss': 1.0707, 'grad_norm': 0.4671443700790405, 'learning_rate': 0.00015293646823411706, 'epoch': 0.01}


 24%|██▎       | 2358/10000 [2:10:00<8:34:12,  4.04s/it]

{'loss': 0.853, 'grad_norm': 0.3034588396549225, 'learning_rate': 0.00015291645822911455, 'epoch': 0.01}


 24%|██▎       | 2359/10000 [2:10:04<8:15:12,  3.89s/it]

{'loss': 1.1355, 'grad_norm': 0.40782660245895386, 'learning_rate': 0.00015289644822411206, 'epoch': 0.01}


 24%|██▎       | 2360/10000 [2:10:08<8:18:58,  3.92s/it]

{'loss': 0.7269, 'grad_norm': 0.3543781340122223, 'learning_rate': 0.00015287643821910955, 'epoch': 0.01}


 24%|██▎       | 2361/10000 [2:10:13<8:52:44,  4.18s/it]

{'loss': 1.4454, 'grad_norm': 0.3546003997325897, 'learning_rate': 0.00015285642821410706, 'epoch': 0.01}


 24%|██▎       | 2362/10000 [2:10:17<8:42:03,  4.10s/it]

{'loss': 1.101, 'grad_norm': 0.40081512928009033, 'learning_rate': 0.00015283641820910458, 'epoch': 0.01}


 24%|██▎       | 2363/10000 [2:10:20<8:23:00,  3.95s/it]

{'loss': 1.0373, 'grad_norm': 0.40050187706947327, 'learning_rate': 0.00015281640820410207, 'epoch': 0.01}


 24%|██▎       | 2364/10000 [2:10:25<9:07:02,  4.30s/it]

{'loss': 0.8226, 'grad_norm': 0.3180185556411743, 'learning_rate': 0.00015279639819909955, 'epoch': 0.01}


 24%|██▎       | 2365/10000 [2:10:30<9:09:59,  4.32s/it]

{'loss': 1.0431, 'grad_norm': 0.3669346272945404, 'learning_rate': 0.00015277638819409704, 'epoch': 0.01}


 24%|██▎       | 2366/10000 [2:10:34<9:02:04,  4.26s/it]

{'loss': 0.7247, 'grad_norm': 0.31299570202827454, 'learning_rate': 0.00015275637818909456, 'epoch': 0.01}


 24%|██▎       | 2367/10000 [2:10:38<9:17:37,  4.38s/it]

{'loss': 1.0027, 'grad_norm': 0.48540958762168884, 'learning_rate': 0.00015273636818409204, 'epoch': 0.01}


 24%|██▎       | 2368/10000 [2:10:42<8:46:38,  4.14s/it]

{'loss': 1.2238, 'grad_norm': 0.42167577147483826, 'learning_rate': 0.00015271635817908956, 'epoch': 0.01}


 24%|██▎       | 2369/10000 [2:10:46<8:45:03,  4.13s/it]

{'loss': 0.9035, 'grad_norm': 0.3436067998409271, 'learning_rate': 0.00015269634817408704, 'epoch': 0.01}


 24%|██▎       | 2370/10000 [2:10:50<8:20:17,  3.93s/it]

{'loss': 1.2851, 'grad_norm': 0.4868624806404114, 'learning_rate': 0.00015267633816908456, 'epoch': 0.01}


 24%|██▎       | 2371/10000 [2:10:54<8:29:27,  4.01s/it]

{'loss': 0.9341, 'grad_norm': 0.3594681918621063, 'learning_rate': 0.00015265632816408205, 'epoch': 0.01}


 24%|██▎       | 2372/10000 [2:10:58<8:24:02,  3.96s/it]

{'loss': 1.1442, 'grad_norm': 0.43753841519355774, 'learning_rate': 0.00015263631815907956, 'epoch': 0.01}


 24%|██▎       | 2373/10000 [2:11:01<8:08:35,  3.84s/it]

{'loss': 0.8005, 'grad_norm': 0.37898069620132446, 'learning_rate': 0.00015261630815407705, 'epoch': 0.01}


 24%|██▎       | 2374/10000 [2:11:05<7:55:07,  3.74s/it]

{'loss': 1.1183, 'grad_norm': 0.3952663838863373, 'learning_rate': 0.00015259629814907454, 'epoch': 0.01}


 24%|██▍       | 2375/10000 [2:11:08<7:18:58,  3.45s/it]

{'loss': 0.7499, 'grad_norm': 0.4975021183490753, 'learning_rate': 0.00015257628814407202, 'epoch': 0.01}


 24%|██▍       | 2376/10000 [2:11:11<7:29:46,  3.54s/it]

{'loss': 0.83, 'grad_norm': 0.3342004120349884, 'learning_rate': 0.00015255627813906954, 'epoch': 0.01}


 24%|██▍       | 2377/10000 [2:11:15<7:24:23,  3.50s/it]

{'loss': 1.083, 'grad_norm': 0.38315877318382263, 'learning_rate': 0.00015253626813406705, 'epoch': 0.01}


 24%|██▍       | 2378/10000 [2:11:19<7:43:04,  3.65s/it]

{'loss': 0.8605, 'grad_norm': 0.36441653966903687, 'learning_rate': 0.00015251625812906454, 'epoch': 0.01}


 24%|██▍       | 2379/10000 [2:11:22<7:45:38,  3.67s/it]

{'loss': 0.9095, 'grad_norm': 0.4007730782032013, 'learning_rate': 0.00015249624812406205, 'epoch': 0.01}


 24%|██▍       | 2380/10000 [2:11:26<8:02:59,  3.80s/it]

{'loss': 0.7231, 'grad_norm': 0.3326410949230194, 'learning_rate': 0.00015247623811905954, 'epoch': 0.01}


 24%|██▍       | 2381/10000 [2:11:31<8:35:19,  4.06s/it]

{'loss': 1.0046, 'grad_norm': 0.33704113960266113, 'learning_rate': 0.00015245622811405705, 'epoch': 0.01}


 24%|██▍       | 2382/10000 [2:11:35<8:17:56,  3.92s/it]

{'loss': 0.7662, 'grad_norm': 0.37590447068214417, 'learning_rate': 0.00015243621810905454, 'epoch': 0.01}


 24%|██▍       | 2383/10000 [2:11:38<7:56:38,  3.75s/it]

{'loss': 1.2681, 'grad_norm': 0.4312708079814911, 'learning_rate': 0.00015241620810405203, 'epoch': 0.01}


 24%|██▍       | 2384/10000 [2:11:42<7:52:13,  3.72s/it]

{'loss': 0.8197, 'grad_norm': 0.3943769633769989, 'learning_rate': 0.00015239619809904952, 'epoch': 0.01}


 24%|██▍       | 2385/10000 [2:11:45<7:26:32,  3.52s/it]

{'loss': 1.2666, 'grad_norm': 0.4817608892917633, 'learning_rate': 0.00015237618809404703, 'epoch': 0.01}


 24%|██▍       | 2386/10000 [2:11:48<7:28:22,  3.53s/it]

{'loss': 0.9673, 'grad_norm': 0.35826075077056885, 'learning_rate': 0.00015235617808904452, 'epoch': 0.01}


 24%|██▍       | 2387/10000 [2:11:55<9:11:52,  4.35s/it]

{'loss': 1.27, 'grad_norm': 0.32248812913894653, 'learning_rate': 0.00015233616808404203, 'epoch': 0.01}


 24%|██▍       | 2388/10000 [2:11:58<8:37:33,  4.08s/it]

{'loss': 0.6818, 'grad_norm': 0.3308348059654236, 'learning_rate': 0.00015231615807903952, 'epoch': 0.01}


 24%|██▍       | 2389/10000 [2:12:02<8:29:57,  4.02s/it]

{'loss': 0.8947, 'grad_norm': 0.38265129923820496, 'learning_rate': 0.00015229614807403703, 'epoch': 0.01}


 24%|██▍       | 2390/10000 [2:12:06<8:15:03,  3.90s/it]

{'loss': 0.7279, 'grad_norm': 0.3136137127876282, 'learning_rate': 0.00015227613806903455, 'epoch': 0.01}


 24%|██▍       | 2391/10000 [2:12:09<7:57:45,  3.77s/it]

{'loss': 0.9789, 'grad_norm': 0.4306676387786865, 'learning_rate': 0.000152256128064032, 'epoch': 0.01}


 24%|██▍       | 2392/10000 [2:12:13<7:53:35,  3.73s/it]

{'loss': 0.7521, 'grad_norm': 0.3512960970401764, 'learning_rate': 0.00015223611805902952, 'epoch': 0.01}


 24%|██▍       | 2393/10000 [2:12:16<7:44:06,  3.66s/it]

{'loss': 0.8153, 'grad_norm': 0.38166654109954834, 'learning_rate': 0.000152216108054027, 'epoch': 0.01}


 24%|██▍       | 2394/10000 [2:12:20<7:49:31,  3.70s/it]

{'loss': 1.2509, 'grad_norm': 0.3928699195384979, 'learning_rate': 0.00015219609804902452, 'epoch': 0.01}


 24%|██▍       | 2395/10000 [2:12:23<7:06:51,  3.37s/it]

{'loss': 0.9153, 'grad_norm': 0.5029568672180176, 'learning_rate': 0.000152176088044022, 'epoch': 0.01}


 24%|██▍       | 2396/10000 [2:12:26<7:11:47,  3.41s/it]

{'loss': 1.1815, 'grad_norm': 0.39274823665618896, 'learning_rate': 0.00015215607803901953, 'epoch': 0.01}


 24%|██▍       | 2397/10000 [2:12:29<7:10:58,  3.40s/it]

{'loss': 0.9169, 'grad_norm': 0.3940888047218323, 'learning_rate': 0.000152136068034017, 'epoch': 0.01}


 24%|██▍       | 2398/10000 [2:12:33<7:17:21,  3.45s/it]

{'loss': 0.9111, 'grad_norm': 0.3286273181438446, 'learning_rate': 0.00015211605802901453, 'epoch': 0.01}


 24%|██▍       | 2399/10000 [2:12:36<7:16:19,  3.44s/it]

{'loss': 1.1975, 'grad_norm': 0.39094263315200806, 'learning_rate': 0.00015209604802401202, 'epoch': 0.01}


 24%|██▍       | 2400/10000 [2:12:40<7:10:30,  3.40s/it]

{'loss': 1.0631, 'grad_norm': 0.40215012431144714, 'learning_rate': 0.0001520760380190095, 'epoch': 0.01}


 24%|██▍       | 2401/10000 [2:12:44<7:49:26,  3.71s/it]

{'loss': 0.9559, 'grad_norm': 0.42156174778938293, 'learning_rate': 0.000152056028014007, 'epoch': 0.01}


 24%|██▍       | 2402/10000 [2:12:48<7:58:02,  3.78s/it]

{'loss': 0.9251, 'grad_norm': 0.36026257276535034, 'learning_rate': 0.0001520360180090045, 'epoch': 0.01}


 24%|██▍       | 2403/10000 [2:12:52<7:57:00,  3.77s/it]

{'loss': 0.8166, 'grad_norm': 0.35696491599082947, 'learning_rate': 0.000152016008004002, 'epoch': 0.01}


 24%|██▍       | 2404/10000 [2:12:54<7:05:04,  3.36s/it]

{'loss': 0.6469, 'grad_norm': 0.3641773462295532, 'learning_rate': 0.0001519959979989995, 'epoch': 0.01}


 24%|██▍       | 2405/10000 [2:12:57<6:50:20,  3.24s/it]

{'loss': 0.9401, 'grad_norm': 0.49908646941185, 'learning_rate': 0.00015197598799399702, 'epoch': 0.01}


 24%|██▍       | 2406/10000 [2:13:00<6:44:02,  3.19s/it]

{'loss': 0.8795, 'grad_norm': 0.3729507625102997, 'learning_rate': 0.0001519559779889945, 'epoch': 0.01}


 24%|██▍       | 2407/10000 [2:13:03<6:14:40,  2.96s/it]

{'loss': 0.5979, 'grad_norm': 0.3484936058521271, 'learning_rate': 0.00015193596798399202, 'epoch': 0.01}


 24%|██▍       | 2408/10000 [2:13:05<6:06:17,  2.89s/it]

{'loss': 0.9633, 'grad_norm': 0.4700157344341278, 'learning_rate': 0.0001519159579789895, 'epoch': 0.01}


 24%|██▍       | 2409/10000 [2:13:09<6:45:19,  3.20s/it]

{'loss': 0.9045, 'grad_norm': 0.36899203062057495, 'learning_rate': 0.000151895947973987, 'epoch': 0.01}


 24%|██▍       | 2410/10000 [2:13:13<6:43:00,  3.19s/it]

{'loss': 1.0818, 'grad_norm': 0.4272245168685913, 'learning_rate': 0.00015187593796898448, 'epoch': 0.01}


 24%|██▍       | 2411/10000 [2:13:16<6:51:37,  3.25s/it]

{'loss': 1.0011, 'grad_norm': 0.4020968973636627, 'learning_rate': 0.000151855927963982, 'epoch': 0.01}


 24%|██▍       | 2412/10000 [2:13:20<7:17:47,  3.46s/it]

{'loss': 1.0545, 'grad_norm': 0.3515734076499939, 'learning_rate': 0.00015183591795897949, 'epoch': 0.01}


 24%|██▍       | 2413/10000 [2:13:23<6:58:53,  3.31s/it]

{'loss': 1.0438, 'grad_norm': 0.39506247639656067, 'learning_rate': 0.000151815907953977, 'epoch': 0.01}


 24%|██▍       | 2414/10000 [2:13:27<7:40:20,  3.64s/it]

{'loss': 1.0766, 'grad_norm': 0.31401509046554565, 'learning_rate': 0.0001517958979489745, 'epoch': 0.01}


 24%|██▍       | 2415/10000 [2:13:30<7:16:50,  3.46s/it]

{'loss': 0.8703, 'grad_norm': 0.3416178524494171, 'learning_rate': 0.000151775887943972, 'epoch': 0.01}


 24%|██▍       | 2416/10000 [2:13:33<6:38:38,  3.15s/it]

{'loss': 0.7514, 'grad_norm': 0.4271354377269745, 'learning_rate': 0.0001517558779389695, 'epoch': 0.01}


 24%|██▍       | 2417/10000 [2:13:36<6:52:39,  3.27s/it]

{'loss': 0.9309, 'grad_norm': 0.3521256148815155, 'learning_rate': 0.000151735867933967, 'epoch': 0.01}


 24%|██▍       | 2418/10000 [2:13:39<6:46:46,  3.22s/it]

{'loss': 0.7899, 'grad_norm': 0.33998027443885803, 'learning_rate': 0.0001517158579289645, 'epoch': 0.01}


 24%|██▍       | 2419/10000 [2:13:42<6:31:38,  3.10s/it]

{'loss': 1.0023, 'grad_norm': 0.4037005603313446, 'learning_rate': 0.00015169584792396198, 'epoch': 0.01}


 24%|██▍       | 2420/10000 [2:13:45<6:19:56,  3.01s/it]

{'loss': 1.0282, 'grad_norm': 0.46540114283561707, 'learning_rate': 0.0001516758379189595, 'epoch': 0.01}


 24%|██▍       | 2421/10000 [2:13:49<6:43:12,  3.19s/it]

{'loss': 0.7986, 'grad_norm': 0.3239274024963379, 'learning_rate': 0.00015165582791395698, 'epoch': 0.01}


 24%|██▍       | 2422/10000 [2:13:52<6:37:12,  3.15s/it]

{'loss': 1.0941, 'grad_norm': 0.37135928869247437, 'learning_rate': 0.0001516358179089545, 'epoch': 0.01}


 24%|██▍       | 2423/10000 [2:13:55<6:51:16,  3.26s/it]

{'loss': 1.0606, 'grad_norm': 0.3835330307483673, 'learning_rate': 0.00015161580790395198, 'epoch': 0.01}


 24%|██▍       | 2424/10000 [2:13:59<7:00:32,  3.33s/it]

{'loss': 0.784, 'grad_norm': 0.3953121602535248, 'learning_rate': 0.0001515957978989495, 'epoch': 0.01}


 24%|██▍       | 2425/10000 [2:14:02<6:59:29,  3.32s/it]

{'loss': 0.6837, 'grad_norm': 0.3870489001274109, 'learning_rate': 0.00015157578789394698, 'epoch': 0.01}


 24%|██▍       | 2426/10000 [2:14:04<6:29:06,  3.08s/it]

{'loss': 1.0772, 'grad_norm': 0.4270593523979187, 'learning_rate': 0.00015155577788894447, 'epoch': 0.01}


 24%|██▍       | 2427/10000 [2:14:07<6:02:49,  2.87s/it]

{'loss': 0.9659, 'grad_norm': 0.4533591568470001, 'learning_rate': 0.00015153576788394196, 'epoch': 0.01}


 24%|██▍       | 2428/10000 [2:14:11<6:41:02,  3.18s/it]

{'loss': 0.9401, 'grad_norm': 0.4578661620616913, 'learning_rate': 0.00015151575787893947, 'epoch': 0.01}


 24%|██▍       | 2429/10000 [2:14:14<6:41:55,  3.19s/it]

{'loss': 0.8673, 'grad_norm': 0.3610674738883972, 'learning_rate': 0.00015149574787393696, 'epoch': 0.01}


 24%|██▍       | 2430/10000 [2:14:17<6:35:05,  3.13s/it]

{'loss': 0.8031, 'grad_norm': 0.42850661277770996, 'learning_rate': 0.00015147573786893447, 'epoch': 0.01}


 24%|██▍       | 2431/10000 [2:14:20<6:47:00,  3.23s/it]

{'loss': 1.1311, 'grad_norm': 0.3498913645744324, 'learning_rate': 0.000151455727863932, 'epoch': 0.01}


 24%|██▍       | 2432/10000 [2:14:24<7:15:57,  3.46s/it]

{'loss': 0.9962, 'grad_norm': 0.34179267287254333, 'learning_rate': 0.00015143571785892947, 'epoch': 0.01}


 24%|██▍       | 2433/10000 [2:14:27<6:49:24,  3.25s/it]

{'loss': 1.0069, 'grad_norm': 0.385728657245636, 'learning_rate': 0.000151415707853927, 'epoch': 0.01}


 24%|██▍       | 2434/10000 [2:14:30<6:35:58,  3.14s/it]

{'loss': 0.8269, 'grad_norm': 0.3581394553184509, 'learning_rate': 0.00015139569784892448, 'epoch': 0.01}


 24%|██▍       | 2435/10000 [2:14:33<6:38:30,  3.16s/it]

{'loss': 0.9438, 'grad_norm': 0.38985827565193176, 'learning_rate': 0.00015137568784392196, 'epoch': 0.01}


 24%|██▍       | 2436/10000 [2:14:37<6:51:42,  3.27s/it]

{'loss': 0.83, 'grad_norm': 0.37989673018455505, 'learning_rate': 0.00015135567783891945, 'epoch': 0.01}


 24%|██▍       | 2437/10000 [2:14:40<6:41:40,  3.19s/it]

{'loss': 1.0419, 'grad_norm': 0.3741821348667145, 'learning_rate': 0.00015133566783391697, 'epoch': 0.01}


 24%|██▍       | 2438/10000 [2:14:43<6:46:00,  3.22s/it]

{'loss': 0.9549, 'grad_norm': 0.35896843671798706, 'learning_rate': 0.00015131565782891445, 'epoch': 0.01}


 24%|██▍       | 2439/10000 [2:14:46<6:29:59,  3.09s/it]

{'loss': 0.9283, 'grad_norm': 0.4100765287876129, 'learning_rate': 0.00015129564782391197, 'epoch': 0.01}


 24%|██▍       | 2440/10000 [2:14:49<6:34:19,  3.13s/it]

{'loss': 0.9677, 'grad_norm': 0.408038467168808, 'learning_rate': 0.00015127563781890945, 'epoch': 0.01}


 24%|██▍       | 2441/10000 [2:14:52<6:09:03,  2.93s/it]

{'loss': 0.6271, 'grad_norm': 0.4056352376937866, 'learning_rate': 0.00015125562781390697, 'epoch': 0.01}


 24%|██▍       | 2442/10000 [2:14:54<6:05:52,  2.90s/it]

{'loss': 0.8582, 'grad_norm': 0.39272236824035645, 'learning_rate': 0.00015123561780890446, 'epoch': 0.01}


 24%|██▍       | 2443/10000 [2:15:01<8:12:51,  3.91s/it]

{'loss': 0.9446, 'grad_norm': 0.28500744700431824, 'learning_rate': 0.00015121560780390197, 'epoch': 0.01}


 24%|██▍       | 2444/10000 [2:15:04<7:55:40,  3.78s/it]

{'loss': 1.0775, 'grad_norm': 0.3726658821105957, 'learning_rate': 0.00015119559779889946, 'epoch': 0.01}


 24%|██▍       | 2445/10000 [2:15:07<7:13:59,  3.45s/it]

{'loss': 0.7987, 'grad_norm': 0.3842330574989319, 'learning_rate': 0.00015117558779389695, 'epoch': 0.01}


 24%|██▍       | 2446/10000 [2:15:10<6:54:24,  3.29s/it]

{'loss': 0.9114, 'grad_norm': 0.3958493769168854, 'learning_rate': 0.00015115557778889446, 'epoch': 0.01}


 24%|██▍       | 2447/10000 [2:15:13<7:00:12,  3.34s/it]

{'loss': 0.8624, 'grad_norm': 0.3396853804588318, 'learning_rate': 0.00015113556778389195, 'epoch': 0.01}


 24%|██▍       | 2448/10000 [2:15:17<7:02:34,  3.36s/it]

{'loss': 0.8168, 'grad_norm': 0.37073543667793274, 'learning_rate': 0.00015111555777888946, 'epoch': 0.01}


 24%|██▍       | 2449/10000 [2:15:21<7:27:26,  3.56s/it]

{'loss': 1.0452, 'grad_norm': 0.35519859194755554, 'learning_rate': 0.00015109554777388695, 'epoch': 0.01}


 24%|██▍       | 2450/10000 [2:15:24<7:03:17,  3.36s/it]

{'loss': 0.8091, 'grad_norm': 0.4301731288433075, 'learning_rate': 0.00015107553776888446, 'epoch': 0.01}


 25%|██▍       | 2451/10000 [2:15:27<7:17:04,  3.47s/it]

{'loss': 0.7613, 'grad_norm': 0.32898062467575073, 'learning_rate': 0.00015105552776388195, 'epoch': 0.01}


 25%|██▍       | 2452/10000 [2:15:30<7:05:35,  3.38s/it]

{'loss': 0.8038, 'grad_norm': 0.36468008160591125, 'learning_rate': 0.00015103551775887946, 'epoch': 0.01}


 25%|██▍       | 2453/10000 [2:15:34<7:00:02,  3.34s/it]

{'loss': 1.0684, 'grad_norm': 0.39699387550354004, 'learning_rate': 0.00015101550775387695, 'epoch': 0.01}


 25%|██▍       | 2454/10000 [2:15:38<7:42:39,  3.68s/it]

{'loss': 0.9374, 'grad_norm': 0.34194615483283997, 'learning_rate': 0.00015099549774887444, 'epoch': 0.01}


 25%|██▍       | 2455/10000 [2:15:41<7:20:57,  3.51s/it]

{'loss': 0.8154, 'grad_norm': 0.3764342963695526, 'learning_rate': 0.00015097548774387193, 'epoch': 0.01}


 25%|██▍       | 2456/10000 [2:15:44<6:56:54,  3.32s/it]

{'loss': 0.8346, 'grad_norm': 0.3940945565700531, 'learning_rate': 0.00015095547773886944, 'epoch': 0.01}


 25%|██▍       | 2457/10000 [2:15:48<7:14:12,  3.45s/it]

{'loss': 0.9168, 'grad_norm': 0.33913370966911316, 'learning_rate': 0.00015093546773386693, 'epoch': 0.01}


 25%|██▍       | 2458/10000 [2:15:51<6:56:51,  3.32s/it]

{'loss': 0.993, 'grad_norm': 0.42084431648254395, 'learning_rate': 0.00015091545772886444, 'epoch': 0.01}


 25%|██▍       | 2459/10000 [2:15:55<7:11:26,  3.43s/it]

{'loss': 0.9628, 'grad_norm': 0.3397386968135834, 'learning_rate': 0.00015089544772386196, 'epoch': 0.01}


 25%|██▍       | 2460/10000 [2:15:58<7:08:18,  3.41s/it]

{'loss': 0.7766, 'grad_norm': 0.3697040379047394, 'learning_rate': 0.00015087543771885944, 'epoch': 0.01}


 25%|██▍       | 2461/10000 [2:16:01<7:01:14,  3.35s/it]

{'loss': 0.8569, 'grad_norm': 0.36394402384757996, 'learning_rate': 0.00015085542771385696, 'epoch': 0.01}


 25%|██▍       | 2462/10000 [2:16:05<7:33:58,  3.61s/it]

{'loss': 1.1876, 'grad_norm': 0.36089974641799927, 'learning_rate': 0.00015083541770885442, 'epoch': 0.01}


 25%|██▍       | 2463/10000 [2:16:09<7:16:23,  3.47s/it]

{'loss': 1.1324, 'grad_norm': 0.4246528744697571, 'learning_rate': 0.00015081540770385193, 'epoch': 0.01}


 25%|██▍       | 2464/10000 [2:16:14<8:36:10,  4.11s/it]

{'loss': 1.1461, 'grad_norm': 0.3111001253128052, 'learning_rate': 0.00015079539769884942, 'epoch': 0.01}


 25%|██▍       | 2465/10000 [2:16:17<7:53:17,  3.77s/it]

{'loss': 0.7697, 'grad_norm': 0.3905475437641144, 'learning_rate': 0.00015077538769384693, 'epoch': 0.01}


 25%|██▍       | 2466/10000 [2:16:20<7:18:48,  3.49s/it]

{'loss': 0.9616, 'grad_norm': 0.4414423108100891, 'learning_rate': 0.00015075537768884442, 'epoch': 0.01}


 25%|██▍       | 2467/10000 [2:16:24<7:28:56,  3.58s/it]

{'loss': 0.9226, 'grad_norm': 0.36115339398384094, 'learning_rate': 0.00015073536768384194, 'epoch': 0.01}


 25%|██▍       | 2468/10000 [2:16:28<7:43:17,  3.69s/it]

{'loss': 0.8965, 'grad_norm': 0.44455647468566895, 'learning_rate': 0.00015071535767883942, 'epoch': 0.01}


 25%|██▍       | 2469/10000 [2:16:31<7:34:16,  3.62s/it]

{'loss': 0.9844, 'grad_norm': 0.35015860199928284, 'learning_rate': 0.00015069534767383694, 'epoch': 0.01}


 25%|██▍       | 2470/10000 [2:16:35<7:52:31,  3.77s/it]

{'loss': 0.7193, 'grad_norm': 0.3446229100227356, 'learning_rate': 0.00015067533766883443, 'epoch': 0.01}


 25%|██▍       | 2471/10000 [2:16:38<7:32:31,  3.61s/it]

{'loss': 0.6346, 'grad_norm': 0.37645503878593445, 'learning_rate': 0.0001506553276638319, 'epoch': 0.01}


 25%|██▍       | 2472/10000 [2:16:42<7:13:37,  3.46s/it]

{'loss': 0.9389, 'grad_norm': 0.3736218810081482, 'learning_rate': 0.0001506353176588294, 'epoch': 0.01}


 25%|██▍       | 2473/10000 [2:16:44<6:52:57,  3.29s/it]

{'loss': 0.9561, 'grad_norm': 0.43647894263267517, 'learning_rate': 0.00015061530765382691, 'epoch': 0.01}


 25%|██▍       | 2474/10000 [2:16:49<7:29:03,  3.58s/it]

{'loss': 0.9611, 'grad_norm': 0.3510931730270386, 'learning_rate': 0.00015059529764882443, 'epoch': 0.01}


 25%|██▍       | 2475/10000 [2:16:53<7:53:38,  3.78s/it]

{'loss': 1.0064, 'grad_norm': 0.357022225856781, 'learning_rate': 0.00015057528764382192, 'epoch': 0.01}


 25%|██▍       | 2476/10000 [2:16:56<7:23:31,  3.54s/it]

{'loss': 0.953, 'grad_norm': 0.39368000626564026, 'learning_rate': 0.00015055527763881943, 'epoch': 0.01}


 25%|██▍       | 2477/10000 [2:17:01<8:04:58,  3.87s/it]

{'loss': 1.3709, 'grad_norm': 0.3755457103252411, 'learning_rate': 0.00015053526763381692, 'epoch': 0.01}


 25%|██▍       | 2478/10000 [2:17:05<8:14:27,  3.94s/it]

{'loss': 1.3273, 'grad_norm': 0.34987884759902954, 'learning_rate': 0.00015051525762881443, 'epoch': 0.01}


 25%|██▍       | 2479/10000 [2:17:07<7:19:58,  3.51s/it]

{'loss': 0.5453, 'grad_norm': 0.3562852144241333, 'learning_rate': 0.00015049524762381192, 'epoch': 0.01}


 25%|██▍       | 2480/10000 [2:17:11<7:44:50,  3.71s/it]

{'loss': 1.0648, 'grad_norm': 0.33480292558670044, 'learning_rate': 0.0001504752376188094, 'epoch': 0.01}


 25%|██▍       | 2481/10000 [2:17:14<7:09:33,  3.43s/it]

{'loss': 0.8789, 'grad_norm': 0.40437936782836914, 'learning_rate': 0.0001504552276138069, 'epoch': 0.01}


 25%|██▍       | 2482/10000 [2:17:17<7:05:02,  3.39s/it]

{'loss': 0.8365, 'grad_norm': 0.3341946601867676, 'learning_rate': 0.0001504352176088044, 'epoch': 0.01}


 25%|██▍       | 2483/10000 [2:17:21<7:01:08,  3.36s/it]

{'loss': 0.9442, 'grad_norm': 0.40098851919174194, 'learning_rate': 0.0001504152076038019, 'epoch': 0.01}


 25%|██▍       | 2484/10000 [2:17:24<6:51:38,  3.29s/it]

{'loss': 0.8551, 'grad_norm': 0.3772808611392975, 'learning_rate': 0.0001503951975987994, 'epoch': 0.01}


 25%|██▍       | 2485/10000 [2:17:28<7:40:30,  3.68s/it]

{'loss': 0.8809, 'grad_norm': 0.3304588794708252, 'learning_rate': 0.0001503751875937969, 'epoch': 0.01}


 25%|██▍       | 2486/10000 [2:17:33<8:07:08,  3.89s/it]

{'loss': 1.4074, 'grad_norm': 0.3786621689796448, 'learning_rate': 0.0001503551775887944, 'epoch': 0.01}


 25%|██▍       | 2487/10000 [2:17:36<7:33:54,  3.62s/it]

{'loss': 0.7467, 'grad_norm': 0.3847319185733795, 'learning_rate': 0.00015033516758379193, 'epoch': 0.01}


 25%|██▍       | 2488/10000 [2:17:39<7:22:38,  3.54s/it]

{'loss': 0.7559, 'grad_norm': 0.34889981150627136, 'learning_rate': 0.0001503151575787894, 'epoch': 0.01}


 25%|██▍       | 2489/10000 [2:17:44<7:52:40,  3.78s/it]

{'loss': 1.161, 'grad_norm': 0.3398400545120239, 'learning_rate': 0.0001502951475737869, 'epoch': 0.01}


 25%|██▍       | 2490/10000 [2:17:47<8:00:26,  3.84s/it]

{'loss': 1.0041, 'grad_norm': 0.3875654935836792, 'learning_rate': 0.0001502751375687844, 'epoch': 0.01}


 25%|██▍       | 2491/10000 [2:17:51<7:39:49,  3.67s/it]

{'loss': 0.8381, 'grad_norm': 0.36018049716949463, 'learning_rate': 0.0001502551275637819, 'epoch': 0.01}


 25%|██▍       | 2492/10000 [2:17:54<7:26:34,  3.57s/it]

{'loss': 0.9826, 'grad_norm': 0.42849934101104736, 'learning_rate': 0.0001502351175587794, 'epoch': 0.01}


 25%|██▍       | 2493/10000 [2:17:58<7:30:16,  3.60s/it]

{'loss': 0.7553, 'grad_norm': 0.35220545530319214, 'learning_rate': 0.0001502151075537769, 'epoch': 0.01}


 25%|██▍       | 2494/10000 [2:18:01<7:04:30,  3.39s/it]

{'loss': 0.6972, 'grad_norm': 0.38842278718948364, 'learning_rate': 0.0001501950975487744, 'epoch': 0.01}


 25%|██▍       | 2495/10000 [2:18:05<7:35:01,  3.64s/it]

{'loss': 0.9396, 'grad_norm': 0.34843191504478455, 'learning_rate': 0.0001501750875437719, 'epoch': 0.01}


 25%|██▍       | 2496/10000 [2:18:08<7:07:06,  3.42s/it]

{'loss': 0.9748, 'grad_norm': 0.4314466118812561, 'learning_rate': 0.0001501550775387694, 'epoch': 0.01}


 25%|██▍       | 2497/10000 [2:18:12<7:24:57,  3.56s/it]

{'loss': 0.8021, 'grad_norm': 0.3135749101638794, 'learning_rate': 0.00015013506753376688, 'epoch': 0.01}


 25%|██▍       | 2498/10000 [2:18:15<7:12:50,  3.46s/it]

{'loss': 1.0796, 'grad_norm': 0.3690735995769501, 'learning_rate': 0.00015011505752876437, 'epoch': 0.01}


 25%|██▍       | 2499/10000 [2:18:18<7:09:52,  3.44s/it]

{'loss': 0.9352, 'grad_norm': 0.3720654547214508, 'learning_rate': 0.00015009504752376188, 'epoch': 0.01}


 25%|██▌       | 2500/10000 [2:18:22<7:16:46,  3.49s/it]

{'loss': 0.713, 'grad_norm': 0.35036665201187134, 'learning_rate': 0.00015007503751875937, 'epoch': 0.01}


 25%|██▌       | 2501/10000 [2:18:28<8:50:38,  4.25s/it]

{'loss': 0.8502, 'grad_norm': 0.3232753872871399, 'learning_rate': 0.00015005502751375688, 'epoch': 0.01}


 25%|██▌       | 2502/10000 [2:18:31<8:12:35,  3.94s/it]

{'loss': 0.882, 'grad_norm': 0.38938114047050476, 'learning_rate': 0.0001500350175087544, 'epoch': 0.01}


 25%|██▌       | 2503/10000 [2:18:34<7:49:24,  3.76s/it]

{'loss': 0.9226, 'grad_norm': 0.3544483780860901, 'learning_rate': 0.00015001500750375188, 'epoch': 0.01}


 25%|██▌       | 2504/10000 [2:18:39<8:03:34,  3.87s/it]

{'loss': 0.893, 'grad_norm': 0.3684040307998657, 'learning_rate': 0.0001499949974987494, 'epoch': 0.01}


 25%|██▌       | 2505/10000 [2:18:42<7:41:54,  3.70s/it]

{'loss': 1.3946, 'grad_norm': 0.443325400352478, 'learning_rate': 0.0001499749874937469, 'epoch': 0.01}


 25%|██▌       | 2506/10000 [2:18:45<7:08:05,  3.43s/it]

{'loss': 0.8478, 'grad_norm': 0.3922273814678192, 'learning_rate': 0.00014995497748874437, 'epoch': 0.01}


 25%|██▌       | 2507/10000 [2:18:48<6:58:53,  3.35s/it]

{'loss': 1.0952, 'grad_norm': 0.39515823125839233, 'learning_rate': 0.00014993496748374186, 'epoch': 0.01}


 25%|██▌       | 2508/10000 [2:18:51<6:35:59,  3.17s/it]

{'loss': 0.829, 'grad_norm': 0.3925773799419403, 'learning_rate': 0.00014991495747873938, 'epoch': 0.01}


 25%|██▌       | 2509/10000 [2:18:54<7:01:11,  3.37s/it]

{'loss': 0.73, 'grad_norm': 0.3666611909866333, 'learning_rate': 0.00014989494747373686, 'epoch': 0.01}


 25%|██▌       | 2510/10000 [2:18:58<7:09:09,  3.44s/it]

{'loss': 0.844, 'grad_norm': 0.35450249910354614, 'learning_rate': 0.00014987493746873438, 'epoch': 0.01}


 25%|██▌       | 2511/10000 [2:19:02<7:12:57,  3.47s/it]

{'loss': 0.8305, 'grad_norm': 0.33641380071640015, 'learning_rate': 0.00014985492746373186, 'epoch': 0.01}


 25%|██▌       | 2512/10000 [2:19:06<7:28:40,  3.60s/it]

{'loss': 0.8878, 'grad_norm': 0.3766987919807434, 'learning_rate': 0.00014983491745872938, 'epoch': 0.01}


 25%|██▌       | 2513/10000 [2:19:09<7:17:52,  3.51s/it]

{'loss': 1.0259, 'grad_norm': 0.4377650022506714, 'learning_rate': 0.00014981490745372687, 'epoch': 0.01}


 25%|██▌       | 2514/10000 [2:19:13<7:33:57,  3.64s/it]

{'loss': 0.9185, 'grad_norm': 0.36224040389060974, 'learning_rate': 0.00014979489744872438, 'epoch': 0.01}


 25%|██▌       | 2515/10000 [2:19:15<7:00:22,  3.37s/it]

{'loss': 0.6178, 'grad_norm': 0.3736652731895447, 'learning_rate': 0.00014977488744372187, 'epoch': 0.01}


 25%|██▌       | 2516/10000 [2:19:19<7:12:09,  3.46s/it]

{'loss': 0.7701, 'grad_norm': 0.3163284659385681, 'learning_rate': 0.00014975487743871936, 'epoch': 0.01}


 25%|██▌       | 2517/10000 [2:19:22<6:42:11,  3.22s/it]

{'loss': 1.0196, 'grad_norm': 0.4018303155899048, 'learning_rate': 0.00014973486743371687, 'epoch': 0.01}


 25%|██▌       | 2518/10000 [2:19:25<6:56:47,  3.34s/it]

{'loss': 1.0901, 'grad_norm': 0.40319618582725525, 'learning_rate': 0.00014971485742871436, 'epoch': 0.01}


 25%|██▌       | 2519/10000 [2:19:28<6:39:27,  3.20s/it]

{'loss': 0.7132, 'grad_norm': 0.37864869832992554, 'learning_rate': 0.00014969484742371187, 'epoch': 0.01}


 25%|██▌       | 2520/10000 [2:19:33<7:37:46,  3.67s/it]

{'loss': 0.7494, 'grad_norm': 0.28408658504486084, 'learning_rate': 0.00014967483741870936, 'epoch': 0.01}


 25%|██▌       | 2521/10000 [2:19:37<7:55:31,  3.81s/it]

{'loss': 0.8912, 'grad_norm': 0.32845088839530945, 'learning_rate': 0.00014965482741370687, 'epoch': 0.01}


 25%|██▌       | 2522/10000 [2:19:40<7:20:33,  3.53s/it]

{'loss': 1.0354, 'grad_norm': 0.4095049798488617, 'learning_rate': 0.00014963481740870436, 'epoch': 0.01}


 25%|██▌       | 2523/10000 [2:19:44<7:32:04,  3.63s/it]

{'loss': 0.754, 'grad_norm': 0.4016835689544678, 'learning_rate': 0.00014961480740370187, 'epoch': 0.01}


 25%|██▌       | 2524/10000 [2:19:48<7:40:40,  3.70s/it]

{'loss': 0.6532, 'grad_norm': 0.33241134881973267, 'learning_rate': 0.00014959479739869936, 'epoch': 0.01}


 25%|██▌       | 2525/10000 [2:19:51<7:06:07,  3.42s/it]

{'loss': 0.9919, 'grad_norm': 0.41800323128700256, 'learning_rate': 0.00014957478739369685, 'epoch': 0.01}


 25%|██▌       | 2526/10000 [2:19:55<7:41:38,  3.71s/it]

{'loss': 0.8297, 'grad_norm': 0.2944827675819397, 'learning_rate': 0.00014955477738869434, 'epoch': 0.01}


 25%|██▌       | 2527/10000 [2:19:59<7:45:37,  3.74s/it]

{'loss': 0.9453, 'grad_norm': 0.33541157841682434, 'learning_rate': 0.00014953476738369185, 'epoch': 0.01}


 25%|██▌       | 2528/10000 [2:20:02<7:22:02,  3.55s/it]

{'loss': 1.3555, 'grad_norm': 0.4619620442390442, 'learning_rate': 0.00014951475737868934, 'epoch': 0.01}


 25%|██▌       | 2529/10000 [2:20:05<7:22:48,  3.56s/it]

{'loss': 0.8151, 'grad_norm': 0.34440404176712036, 'learning_rate': 0.00014949474737368685, 'epoch': 0.01}


 25%|██▌       | 2530/10000 [2:20:08<6:53:09,  3.32s/it]

{'loss': 0.7188, 'grad_norm': 0.4117242097854614, 'learning_rate': 0.00014947473736868437, 'epoch': 0.01}


 25%|██▌       | 2531/10000 [2:20:11<6:32:12,  3.15s/it]

{'loss': 0.8768, 'grad_norm': 0.45804286003112793, 'learning_rate': 0.00014945472736368185, 'epoch': 0.01}


 25%|██▌       | 2532/10000 [2:20:14<6:37:32,  3.19s/it]

{'loss': 0.8671, 'grad_norm': 0.37566277384757996, 'learning_rate': 0.00014943471735867934, 'epoch': 0.01}


 25%|██▌       | 2533/10000 [2:20:18<6:50:48,  3.30s/it]

{'loss': 0.9729, 'grad_norm': 0.34714698791503906, 'learning_rate': 0.00014941470735367683, 'epoch': 0.01}


 25%|██▌       | 2534/10000 [2:20:22<7:21:01,  3.54s/it]

{'loss': 0.8166, 'grad_norm': 0.34380286931991577, 'learning_rate': 0.00014939469734867434, 'epoch': 0.01}


 25%|██▌       | 2535/10000 [2:20:26<7:53:51,  3.81s/it]

{'loss': 1.0305, 'grad_norm': 0.3362804651260376, 'learning_rate': 0.00014937468734367183, 'epoch': 0.01}


 25%|██▌       | 2536/10000 [2:20:29<7:22:35,  3.56s/it]

{'loss': 0.7136, 'grad_norm': 0.34386730194091797, 'learning_rate': 0.00014935467733866934, 'epoch': 0.01}


 25%|██▌       | 2537/10000 [2:20:34<8:03:53,  3.89s/it]

{'loss': 1.1346, 'grad_norm': 0.3383735716342926, 'learning_rate': 0.00014933466733366683, 'epoch': 0.01}


 25%|██▌       | 2538/10000 [2:20:38<7:58:59,  3.85s/it]

{'loss': 0.9693, 'grad_norm': 0.39450952410697937, 'learning_rate': 0.00014931465732866435, 'epoch': 0.01}


 25%|██▌       | 2539/10000 [2:20:41<7:26:17,  3.59s/it]

{'loss': 0.6977, 'grad_norm': 0.41253870725631714, 'learning_rate': 0.00014929464732366183, 'epoch': 0.01}


 25%|██▌       | 2540/10000 [2:20:46<8:23:24,  4.05s/it]

{'loss': 0.7973, 'grad_norm': 0.3598051369190216, 'learning_rate': 0.00014927463731865935, 'epoch': 0.01}


 25%|██▌       | 2541/10000 [2:20:49<8:01:57,  3.88s/it]

{'loss': 0.7537, 'grad_norm': 0.3531191051006317, 'learning_rate': 0.00014925462731365684, 'epoch': 0.01}


 25%|██▌       | 2542/10000 [2:20:53<8:11:32,  3.95s/it]

{'loss': 0.8728, 'grad_norm': 0.3400598168373108, 'learning_rate': 0.00014923461730865432, 'epoch': 0.01}


 25%|██▌       | 2543/10000 [2:20:58<8:27:22,  4.08s/it]

{'loss': 0.8653, 'grad_norm': 0.317705363035202, 'learning_rate': 0.00014921460730365184, 'epoch': 0.01}


 25%|██▌       | 2544/10000 [2:21:02<8:24:23,  4.06s/it]

{'loss': 0.845, 'grad_norm': 0.337985098361969, 'learning_rate': 0.00014919459729864932, 'epoch': 0.01}


 25%|██▌       | 2545/10000 [2:21:06<8:08:51,  3.93s/it]

{'loss': 0.6639, 'grad_norm': 0.35379642248153687, 'learning_rate': 0.00014917458729364684, 'epoch': 0.01}


 25%|██▌       | 2546/10000 [2:21:09<8:06:12,  3.91s/it]

{'loss': 0.8561, 'grad_norm': 0.31870517134666443, 'learning_rate': 0.00014915457728864433, 'epoch': 0.01}


 25%|██▌       | 2547/10000 [2:21:13<7:52:39,  3.81s/it]

{'loss': 1.1381, 'grad_norm': 0.42454496026039124, 'learning_rate': 0.00014913456728364184, 'epoch': 0.01}


 25%|██▌       | 2548/10000 [2:21:16<7:21:58,  3.56s/it]

{'loss': 0.8618, 'grad_norm': 0.46921631693840027, 'learning_rate': 0.00014911455727863933, 'epoch': 0.01}


 25%|██▌       | 2549/10000 [2:21:19<7:13:39,  3.49s/it]

{'loss': 0.8526, 'grad_norm': 0.38471463322639465, 'learning_rate': 0.00014909454727363684, 'epoch': 0.01}


 26%|██▌       | 2550/10000 [2:21:22<6:55:56,  3.35s/it]

{'loss': 0.9679, 'grad_norm': 0.41277071833610535, 'learning_rate': 0.00014907453726863433, 'epoch': 0.01}


 26%|██▌       | 2551/10000 [2:21:25<6:42:38,  3.24s/it]

{'loss': 0.958, 'grad_norm': 0.38420823216438293, 'learning_rate': 0.00014905452726363182, 'epoch': 0.01}


 26%|██▌       | 2552/10000 [2:21:30<7:30:36,  3.63s/it]

{'loss': 1.1769, 'grad_norm': 0.3496890962123871, 'learning_rate': 0.0001490345172586293, 'epoch': 0.01}


 26%|██▌       | 2553/10000 [2:21:35<8:18:06,  4.01s/it]

{'loss': 1.1716, 'grad_norm': 0.3246352970600128, 'learning_rate': 0.00014901450725362682, 'epoch': 0.01}


 26%|██▌       | 2554/10000 [2:21:38<7:46:58,  3.76s/it]

{'loss': 0.8481, 'grad_norm': 0.35617202520370483, 'learning_rate': 0.0001489944972486243, 'epoch': 0.01}


 26%|██▌       | 2555/10000 [2:21:43<8:31:14,  4.12s/it]

{'loss': 1.0218, 'grad_norm': 0.31296318769454956, 'learning_rate': 0.00014897448724362182, 'epoch': 0.01}


 26%|██▌       | 2556/10000 [2:21:46<8:08:14,  3.94s/it]

{'loss': 0.9152, 'grad_norm': 0.3468243479728699, 'learning_rate': 0.00014895447723861933, 'epoch': 0.01}


 26%|██▌       | 2557/10000 [2:21:50<8:10:50,  3.96s/it]

{'loss': 1.065, 'grad_norm': 0.32228201627731323, 'learning_rate': 0.00014893446723361682, 'epoch': 0.01}


 26%|██▌       | 2558/10000 [2:21:54<8:00:46,  3.88s/it]

{'loss': 1.0573, 'grad_norm': 0.38377583026885986, 'learning_rate': 0.00014891445722861434, 'epoch': 0.01}


 26%|██▌       | 2559/10000 [2:21:58<8:12:02,  3.97s/it]

{'loss': 0.8655, 'grad_norm': 0.3404761850833893, 'learning_rate': 0.00014889444722361182, 'epoch': 0.01}


 26%|██▌       | 2560/10000 [2:22:01<7:32:32,  3.65s/it]

{'loss': 0.7232, 'grad_norm': 0.3667336404323578, 'learning_rate': 0.0001488744372186093, 'epoch': 0.01}


 26%|██▌       | 2561/10000 [2:22:04<7:11:52,  3.48s/it]

{'loss': 0.8109, 'grad_norm': 0.36148086190223694, 'learning_rate': 0.0001488544272136068, 'epoch': 0.01}


 26%|██▌       | 2562/10000 [2:22:08<7:29:43,  3.63s/it]

{'loss': 0.6521, 'grad_norm': 0.29946205019950867, 'learning_rate': 0.0001488344172086043, 'epoch': 0.01}


 26%|██▌       | 2563/10000 [2:22:12<7:20:58,  3.56s/it]

{'loss': 0.7878, 'grad_norm': 0.3700541853904724, 'learning_rate': 0.0001488144072036018, 'epoch': 0.01}


 26%|██▌       | 2564/10000 [2:22:15<7:07:56,  3.45s/it]

{'loss': 0.6122, 'grad_norm': 0.37424328923225403, 'learning_rate': 0.00014879439719859931, 'epoch': 0.01}


 26%|██▌       | 2565/10000 [2:22:19<7:27:09,  3.61s/it]

{'loss': 1.3073, 'grad_norm': 0.37765273451805115, 'learning_rate': 0.0001487743871935968, 'epoch': 0.01}


 26%|██▌       | 2566/10000 [2:22:23<7:37:00,  3.69s/it]

{'loss': 0.876, 'grad_norm': 0.32205966114997864, 'learning_rate': 0.00014875437718859432, 'epoch': 0.01}


 26%|██▌       | 2567/10000 [2:22:26<7:06:48,  3.45s/it]

{'loss': 0.8219, 'grad_norm': 0.4387274384498596, 'learning_rate': 0.0001487343671835918, 'epoch': 0.01}


 26%|██▌       | 2568/10000 [2:22:29<6:57:56,  3.37s/it]

{'loss': 0.7302, 'grad_norm': 0.40223589539527893, 'learning_rate': 0.0001487143571785893, 'epoch': 0.01}


 26%|██▌       | 2569/10000 [2:22:32<6:40:12,  3.23s/it]

{'loss': 0.914, 'grad_norm': 0.4187125563621521, 'learning_rate': 0.00014869434717358678, 'epoch': 0.01}


 26%|██▌       | 2570/10000 [2:22:36<7:09:54,  3.47s/it]

{'loss': 0.7819, 'grad_norm': 0.32813364267349243, 'learning_rate': 0.0001486743371685843, 'epoch': 0.01}


 26%|██▌       | 2571/10000 [2:22:40<7:29:59,  3.63s/it]

{'loss': 0.8906, 'grad_norm': 0.39647403359413147, 'learning_rate': 0.0001486543271635818, 'epoch': 0.01}


 26%|██▌       | 2572/10000 [2:22:44<7:38:46,  3.71s/it]

{'loss': 0.8467, 'grad_norm': 0.4014883041381836, 'learning_rate': 0.0001486343171585793, 'epoch': 0.01}


 26%|██▌       | 2573/10000 [2:22:47<7:14:44,  3.51s/it]

{'loss': 1.0025, 'grad_norm': 0.3839433491230011, 'learning_rate': 0.0001486143071535768, 'epoch': 0.01}


 26%|██▌       | 2574/10000 [2:22:50<7:00:01,  3.39s/it]

{'loss': 0.7074, 'grad_norm': 0.3967689871788025, 'learning_rate': 0.0001485942971485743, 'epoch': 0.01}


 26%|██▌       | 2575/10000 [2:22:53<6:42:36,  3.25s/it]

{'loss': 1.0808, 'grad_norm': 0.47956278920173645, 'learning_rate': 0.0001485742871435718, 'epoch': 0.01}


 26%|██▌       | 2576/10000 [2:22:56<6:35:53,  3.20s/it]

{'loss': 0.9406, 'grad_norm': 0.42327022552490234, 'learning_rate': 0.0001485542771385693, 'epoch': 0.01}


 26%|██▌       | 2577/10000 [2:22:58<6:19:48,  3.07s/it]

{'loss': 0.6754, 'grad_norm': 0.36510607600212097, 'learning_rate': 0.00014853426713356678, 'epoch': 0.01}


 26%|██▌       | 2578/10000 [2:23:01<6:16:37,  3.04s/it]

{'loss': 0.8962, 'grad_norm': 0.38631191849708557, 'learning_rate': 0.00014851425712856427, 'epoch': 0.01}


 26%|██▌       | 2579/10000 [2:23:05<6:52:17,  3.33s/it]

{'loss': 0.6683, 'grad_norm': 0.319804310798645, 'learning_rate': 0.00014849424712356179, 'epoch': 0.01}


 26%|██▌       | 2580/10000 [2:23:09<7:10:14,  3.48s/it]

{'loss': 0.9319, 'grad_norm': 0.34986037015914917, 'learning_rate': 0.00014847423711855927, 'epoch': 0.01}


 26%|██▌       | 2581/10000 [2:23:13<7:12:04,  3.49s/it]

{'loss': 0.7629, 'grad_norm': 0.36765575408935547, 'learning_rate': 0.0001484542271135568, 'epoch': 0.01}


 26%|██▌       | 2582/10000 [2:23:17<7:27:08,  3.62s/it]

{'loss': 1.2012, 'grad_norm': 0.36209836602211, 'learning_rate': 0.00014843421710855427, 'epoch': 0.01}


 26%|██▌       | 2583/10000 [2:23:21<7:49:32,  3.80s/it]

{'loss': 1.1538, 'grad_norm': 0.38864368200302124, 'learning_rate': 0.0001484142071035518, 'epoch': 0.01}


 26%|██▌       | 2584/10000 [2:23:25<7:59:41,  3.88s/it]

{'loss': 0.9192, 'grad_norm': 0.3534688949584961, 'learning_rate': 0.0001483941970985493, 'epoch': 0.01}


 26%|██▌       | 2585/10000 [2:23:29<7:57:12,  3.86s/it]

{'loss': 0.8683, 'grad_norm': 0.3524886965751648, 'learning_rate': 0.0001483741870935468, 'epoch': 0.01}


 26%|██▌       | 2586/10000 [2:23:32<7:28:19,  3.63s/it]

{'loss': 0.8899, 'grad_norm': 0.3941970765590668, 'learning_rate': 0.00014835417708854428, 'epoch': 0.01}


 26%|██▌       | 2587/10000 [2:23:35<7:11:09,  3.49s/it]

{'loss': 0.8038, 'grad_norm': 0.39911383390426636, 'learning_rate': 0.00014833416708354177, 'epoch': 0.01}


 26%|██▌       | 2588/10000 [2:23:39<7:14:00,  3.51s/it]

{'loss': 0.9358, 'grad_norm': 0.38752132654190063, 'learning_rate': 0.00014831415707853928, 'epoch': 0.01}


 26%|██▌       | 2589/10000 [2:23:43<7:40:51,  3.73s/it]

{'loss': 0.998, 'grad_norm': 0.366377592086792, 'learning_rate': 0.00014829414707353677, 'epoch': 0.01}


 26%|██▌       | 2590/10000 [2:23:46<7:28:11,  3.63s/it]

{'loss': 0.923, 'grad_norm': 0.36333999037742615, 'learning_rate': 0.00014827413706853428, 'epoch': 0.01}


 26%|██▌       | 2591/10000 [2:23:49<6:52:54,  3.34s/it]

{'loss': 0.7183, 'grad_norm': 0.40454331040382385, 'learning_rate': 0.00014825412706353177, 'epoch': 0.01}


 26%|██▌       | 2592/10000 [2:23:53<7:19:29,  3.56s/it]

{'loss': 1.0751, 'grad_norm': 0.4044090211391449, 'learning_rate': 0.00014823411705852928, 'epoch': 0.01}


 26%|██▌       | 2593/10000 [2:23:56<6:41:45,  3.25s/it]

{'loss': 1.0394, 'grad_norm': 0.5196859240531921, 'learning_rate': 0.00014821410705352677, 'epoch': 0.01}


 26%|██▌       | 2594/10000 [2:24:00<7:08:48,  3.47s/it]

{'loss': 0.8918, 'grad_norm': 0.33638858795166016, 'learning_rate': 0.00014819409704852428, 'epoch': 0.01}


 26%|██▌       | 2595/10000 [2:24:02<6:30:03,  3.16s/it]

{'loss': 0.6492, 'grad_norm': 0.42335963249206543, 'learning_rate': 0.00014817408704352177, 'epoch': 0.01}


 26%|██▌       | 2596/10000 [2:24:05<6:22:00,  3.10s/it]

{'loss': 0.8229, 'grad_norm': 0.3901205360889435, 'learning_rate': 0.00014815407703851926, 'epoch': 0.01}


 26%|██▌       | 2597/10000 [2:24:10<7:28:38,  3.64s/it]

{'loss': 0.8085, 'grad_norm': 0.4182766377925873, 'learning_rate': 0.00014813406703351675, 'epoch': 0.01}


 26%|██▌       | 2598/10000 [2:24:13<7:16:24,  3.54s/it]

{'loss': 0.6931, 'grad_norm': 0.357524573802948, 'learning_rate': 0.00014811405702851426, 'epoch': 0.01}


 26%|██▌       | 2599/10000 [2:24:16<6:54:30,  3.36s/it]

{'loss': 0.7284, 'grad_norm': 0.36890727281570435, 'learning_rate': 0.00014809404702351178, 'epoch': 0.01}


 26%|██▌       | 2600/10000 [2:24:19<6:29:44,  3.16s/it]

{'loss': 0.7483, 'grad_norm': 0.3975680470466614, 'learning_rate': 0.00014807403701850926, 'epoch': 0.01}


 26%|██▌       | 2601/10000 [2:24:23<7:10:57,  3.49s/it]

{'loss': 0.6757, 'grad_norm': 0.39877259731292725, 'learning_rate': 0.00014805402701350678, 'epoch': 0.01}


 26%|██▌       | 2602/10000 [2:24:26<7:02:35,  3.43s/it]

{'loss': 0.8614, 'grad_norm': 0.3602506220340729, 'learning_rate': 0.00014803401700850426, 'epoch': 0.01}


 26%|██▌       | 2603/10000 [2:24:29<6:45:40,  3.29s/it]

{'loss': 0.7108, 'grad_norm': 0.40021491050720215, 'learning_rate': 0.00014801400700350175, 'epoch': 0.01}


 26%|██▌       | 2604/10000 [2:24:33<6:56:44,  3.38s/it]

{'loss': 0.7987, 'grad_norm': 0.4086190164089203, 'learning_rate': 0.00014799399699849924, 'epoch': 0.01}


 26%|██▌       | 2605/10000 [2:24:36<6:58:27,  3.40s/it]

{'loss': 0.7527, 'grad_norm': 0.38210076093673706, 'learning_rate': 0.00014797398699349675, 'epoch': 0.01}


 26%|██▌       | 2606/10000 [2:24:40<7:03:02,  3.43s/it]

{'loss': 0.766, 'grad_norm': 0.36477795243263245, 'learning_rate': 0.00014795397698849424, 'epoch': 0.01}


 26%|██▌       | 2607/10000 [2:24:44<7:46:21,  3.78s/it]

{'loss': 1.2911, 'grad_norm': 0.3632274568080902, 'learning_rate': 0.00014793396698349175, 'epoch': 0.01}


 26%|██▌       | 2608/10000 [2:24:49<8:00:16,  3.90s/it]

{'loss': 0.9854, 'grad_norm': 0.37437549233436584, 'learning_rate': 0.00014791395697848924, 'epoch': 0.01}


 26%|██▌       | 2609/10000 [2:24:52<7:26:39,  3.63s/it]

{'loss': 0.9502, 'grad_norm': 0.420646071434021, 'learning_rate': 0.00014789394697348676, 'epoch': 0.01}


 26%|██▌       | 2610/10000 [2:24:56<8:01:28,  3.91s/it]

{'loss': 0.8209, 'grad_norm': 0.3445647358894348, 'learning_rate': 0.00014787393696848424, 'epoch': 0.01}


 26%|██▌       | 2611/10000 [2:25:02<9:13:18,  4.49s/it]

{'loss': 0.8007, 'grad_norm': 0.31323423981666565, 'learning_rate': 0.00014785392696348176, 'epoch': 0.01}


 26%|██▌       | 2612/10000 [2:25:06<9:08:13,  4.45s/it]

{'loss': 0.9644, 'grad_norm': 0.3921203315258026, 'learning_rate': 0.00014783391695847925, 'epoch': 0.01}


 26%|██▌       | 2613/10000 [2:25:10<8:48:49,  4.30s/it]

{'loss': 0.7915, 'grad_norm': 0.39231857657432556, 'learning_rate': 0.00014781390695347673, 'epoch': 0.01}


 26%|██▌       | 2614/10000 [2:25:15<8:53:38,  4.34s/it]

{'loss': 1.0355, 'grad_norm': 0.35463765263557434, 'learning_rate': 0.00014779389694847425, 'epoch': 0.01}


 26%|██▌       | 2615/10000 [2:25:18<8:26:46,  4.12s/it]

{'loss': 0.717, 'grad_norm': 0.35254257917404175, 'learning_rate': 0.00014777388694347173, 'epoch': 0.01}


 26%|██▌       | 2616/10000 [2:25:22<8:20:20,  4.07s/it]

{'loss': 0.9218, 'grad_norm': 0.35525208711624146, 'learning_rate': 0.00014775387693846925, 'epoch': 0.01}


 26%|██▌       | 2617/10000 [2:25:25<7:47:24,  3.80s/it]

{'loss': 0.8231, 'grad_norm': 0.3765440583229065, 'learning_rate': 0.00014773386693346674, 'epoch': 0.01}


 26%|██▌       | 2618/10000 [2:25:28<7:08:55,  3.49s/it]

{'loss': 0.8141, 'grad_norm': 0.4217272698879242, 'learning_rate': 0.00014771385692846425, 'epoch': 0.01}


 26%|██▌       | 2619/10000 [2:25:31<6:44:40,  3.29s/it]

{'loss': 0.8276, 'grad_norm': 0.3882737159729004, 'learning_rate': 0.00014769384692346174, 'epoch': 0.01}


 26%|██▌       | 2620/10000 [2:25:34<6:33:28,  3.20s/it]

{'loss': 0.8882, 'grad_norm': 0.41643238067626953, 'learning_rate': 0.00014767383691845925, 'epoch': 0.01}


 26%|██▌       | 2621/10000 [2:25:39<7:28:41,  3.65s/it]

{'loss': 0.8953, 'grad_norm': 0.31617969274520874, 'learning_rate': 0.00014765382691345674, 'epoch': 0.01}


 26%|██▌       | 2622/10000 [2:25:42<7:02:25,  3.44s/it]

{'loss': 0.71, 'grad_norm': 0.3888264298439026, 'learning_rate': 0.00014763381690845423, 'epoch': 0.01}


 26%|██▌       | 2623/10000 [2:25:45<6:41:59,  3.27s/it]

{'loss': 0.8393, 'grad_norm': 0.4274768531322479, 'learning_rate': 0.00014761380690345171, 'epoch': 0.01}


 26%|██▌       | 2624/10000 [2:25:49<7:15:24,  3.54s/it]

{'loss': 1.2685, 'grad_norm': 0.41372478008270264, 'learning_rate': 0.00014759379689844923, 'epoch': 0.01}


 26%|██▋       | 2625/10000 [2:25:52<7:18:55,  3.57s/it]

{'loss': 0.9817, 'grad_norm': 0.36253026127815247, 'learning_rate': 0.00014757378689344672, 'epoch': 0.01}


 26%|██▋       | 2626/10000 [2:25:55<7:01:39,  3.43s/it]

{'loss': 0.8271, 'grad_norm': 0.42499038577079773, 'learning_rate': 0.00014755377688844423, 'epoch': 0.01}


 26%|██▋       | 2627/10000 [2:25:58<6:45:10,  3.30s/it]

{'loss': 0.719, 'grad_norm': 0.38160377740859985, 'learning_rate': 0.00014753376688344174, 'epoch': 0.01}


 26%|██▋       | 2628/10000 [2:26:03<7:12:29,  3.52s/it]

{'loss': 1.1622, 'grad_norm': 0.4473171532154083, 'learning_rate': 0.00014751375687843923, 'epoch': 0.01}


 26%|██▋       | 2629/10000 [2:26:06<7:19:33,  3.58s/it]

{'loss': 0.7103, 'grad_norm': 0.37756502628326416, 'learning_rate': 0.00014749374687343675, 'epoch': 0.01}


 26%|██▋       | 2630/10000 [2:26:10<7:12:09,  3.52s/it]

{'loss': 0.7942, 'grad_norm': 0.37013599276542664, 'learning_rate': 0.00014747373686843423, 'epoch': 0.01}


 26%|██▋       | 2631/10000 [2:26:13<6:52:32,  3.36s/it]

{'loss': 0.7616, 'grad_norm': 0.3803150951862335, 'learning_rate': 0.00014745372686343172, 'epoch': 0.01}


 26%|██▋       | 2632/10000 [2:26:16<7:09:20,  3.50s/it]

{'loss': 1.0307, 'grad_norm': 0.3511042594909668, 'learning_rate': 0.0001474337168584292, 'epoch': 0.01}


 26%|██▋       | 2633/10000 [2:26:19<6:44:07,  3.29s/it]

{'loss': 0.8382, 'grad_norm': 0.37542128562927246, 'learning_rate': 0.00014741370685342672, 'epoch': 0.01}


 26%|██▋       | 2634/10000 [2:26:23<6:59:49,  3.42s/it]

{'loss': 0.8574, 'grad_norm': 0.4190729558467865, 'learning_rate': 0.0001473936968484242, 'epoch': 0.01}


 26%|██▋       | 2635/10000 [2:26:26<6:55:01,  3.38s/it]

{'loss': 1.0588, 'grad_norm': 0.4102950692176819, 'learning_rate': 0.00014737368684342172, 'epoch': 0.01}


 26%|██▋       | 2636/10000 [2:26:30<7:03:14,  3.45s/it]

{'loss': 1.0114, 'grad_norm': 0.351225882768631, 'learning_rate': 0.0001473536768384192, 'epoch': 0.01}


 26%|██▋       | 2637/10000 [2:26:33<6:37:26,  3.24s/it]

{'loss': 0.8742, 'grad_norm': 0.4352162480354309, 'learning_rate': 0.00014733366683341673, 'epoch': 0.01}


 26%|██▋       | 2638/10000 [2:26:37<7:09:03,  3.50s/it]

{'loss': 1.1008, 'grad_norm': 0.37112560868263245, 'learning_rate': 0.0001473136568284142, 'epoch': 0.01}


 26%|██▋       | 2639/10000 [2:26:41<7:29:27,  3.66s/it]

{'loss': 0.8302, 'grad_norm': 0.4326498508453369, 'learning_rate': 0.0001472936468234117, 'epoch': 0.01}


 26%|██▋       | 2640/10000 [2:26:44<7:21:58,  3.60s/it]

{'loss': 0.7825, 'grad_norm': 0.401974618434906, 'learning_rate': 0.00014727363681840921, 'epoch': 0.01}


 26%|██▋       | 2641/10000 [2:26:48<7:21:13,  3.60s/it]

{'loss': 0.766, 'grad_norm': 0.4010576009750366, 'learning_rate': 0.0001472536268134067, 'epoch': 0.01}


 26%|██▋       | 2642/10000 [2:26:53<8:07:24,  3.97s/it]

{'loss': 1.0734, 'grad_norm': 0.36162275075912476, 'learning_rate': 0.00014723361680840422, 'epoch': 0.01}


 26%|██▋       | 2643/10000 [2:26:56<7:51:08,  3.84s/it]

{'loss': 1.0048, 'grad_norm': 0.3627987504005432, 'learning_rate': 0.0001472136068034017, 'epoch': 0.01}


 26%|██▋       | 2644/10000 [2:26:59<7:10:25,  3.51s/it]

{'loss': 0.9549, 'grad_norm': 0.3989817798137665, 'learning_rate': 0.00014719359679839922, 'epoch': 0.01}


 26%|██▋       | 2645/10000 [2:27:02<6:37:32,  3.24s/it]

{'loss': 0.8448, 'grad_norm': 0.4229709208011627, 'learning_rate': 0.0001471735867933967, 'epoch': 0.01}


 26%|██▋       | 2646/10000 [2:27:05<6:40:14,  3.27s/it]

{'loss': 0.7933, 'grad_norm': 0.39920827746391296, 'learning_rate': 0.00014715357678839422, 'epoch': 0.01}


 26%|██▋       | 2647/10000 [2:27:07<6:12:43,  3.04s/it]

{'loss': 0.6777, 'grad_norm': 0.5065255165100098, 'learning_rate': 0.0001471335667833917, 'epoch': 0.01}


 26%|██▋       | 2648/10000 [2:27:12<6:54:04,  3.38s/it]

{'loss': 1.1193, 'grad_norm': 0.3790731430053711, 'learning_rate': 0.0001471135567783892, 'epoch': 0.01}


 26%|██▋       | 2649/10000 [2:27:15<6:51:01,  3.35s/it]

{'loss': 0.7774, 'grad_norm': 0.37926486134529114, 'learning_rate': 0.00014709354677338668, 'epoch': 0.01}


 26%|██▋       | 2650/10000 [2:27:18<6:44:45,  3.30s/it]

{'loss': 0.9982, 'grad_norm': 0.38353100419044495, 'learning_rate': 0.0001470735367683842, 'epoch': 0.01}


 27%|██▋       | 2651/10000 [2:27:21<6:34:25,  3.22s/it]

{'loss': 1.0217, 'grad_norm': 0.46438154578208923, 'learning_rate': 0.00014705352676338168, 'epoch': 0.01}


 27%|██▋       | 2652/10000 [2:27:25<7:06:21,  3.48s/it]

{'loss': 0.8545, 'grad_norm': 0.33466359972953796, 'learning_rate': 0.0001470335167583792, 'epoch': 0.01}


 27%|██▋       | 2653/10000 [2:27:28<7:02:00,  3.45s/it]

{'loss': 0.7228, 'grad_norm': 0.39658111333847046, 'learning_rate': 0.0001470135067533767, 'epoch': 0.01}


 27%|██▋       | 2654/10000 [2:27:33<7:35:44,  3.72s/it]

{'loss': 0.8117, 'grad_norm': 0.3166996240615845, 'learning_rate': 0.0001469934967483742, 'epoch': 0.01}


 27%|██▋       | 2655/10000 [2:27:37<7:37:01,  3.73s/it]

{'loss': 0.9167, 'grad_norm': 0.3995523452758789, 'learning_rate': 0.0001469734867433717, 'epoch': 0.01}


 27%|██▋       | 2656/10000 [2:27:40<7:29:28,  3.67s/it]

{'loss': 1.1558, 'grad_norm': 0.39484959840774536, 'learning_rate': 0.0001469534767383692, 'epoch': 0.01}


 27%|██▋       | 2657/10000 [2:27:43<6:51:30,  3.36s/it]

{'loss': 0.8401, 'grad_norm': 0.4201115369796753, 'learning_rate': 0.0001469334667333667, 'epoch': 0.01}


 27%|██▋       | 2658/10000 [2:27:46<7:00:20,  3.44s/it]

{'loss': 0.8467, 'grad_norm': 0.37790071964263916, 'learning_rate': 0.00014691345672836418, 'epoch': 0.01}


 27%|██▋       | 2659/10000 [2:27:50<6:50:22,  3.35s/it]

{'loss': 0.9735, 'grad_norm': 0.3800516426563263, 'learning_rate': 0.0001468934467233617, 'epoch': 0.01}


 27%|██▋       | 2660/10000 [2:27:53<6:48:01,  3.34s/it]

{'loss': 1.1342, 'grad_norm': 0.40681806206703186, 'learning_rate': 0.00014687343671835918, 'epoch': 0.01}


 27%|██▋       | 2661/10000 [2:27:56<6:58:33,  3.42s/it]

{'loss': 1.2189, 'grad_norm': 0.37878668308258057, 'learning_rate': 0.0001468534267133567, 'epoch': 0.01}


 27%|██▋       | 2662/10000 [2:27:59<6:18:47,  3.10s/it]

{'loss': 0.7414, 'grad_norm': 0.41463690996170044, 'learning_rate': 0.00014683341670835418, 'epoch': 0.01}


 27%|██▋       | 2663/10000 [2:28:03<6:44:34,  3.31s/it]

{'loss': 0.6764, 'grad_norm': 0.34396636486053467, 'learning_rate': 0.0001468134067033517, 'epoch': 0.01}


 27%|██▋       | 2664/10000 [2:28:07<7:20:33,  3.60s/it]

{'loss': 0.7852, 'grad_norm': 0.38565149903297424, 'learning_rate': 0.00014679339669834918, 'epoch': 0.01}


 27%|██▋       | 2665/10000 [2:28:11<7:55:49,  3.89s/it]

{'loss': 1.0992, 'grad_norm': 0.35626405477523804, 'learning_rate': 0.0001467733866933467, 'epoch': 0.01}


 27%|██▋       | 2666/10000 [2:28:14<7:16:28,  3.57s/it]

{'loss': 0.8432, 'grad_norm': 0.4565633535385132, 'learning_rate': 0.00014675337668834418, 'epoch': 0.01}


 27%|██▋       | 2667/10000 [2:28:17<6:57:17,  3.41s/it]

{'loss': 0.6676, 'grad_norm': 0.3360147476196289, 'learning_rate': 0.00014673336668334167, 'epoch': 0.01}


 27%|██▋       | 2668/10000 [2:28:21<6:59:43,  3.43s/it]

{'loss': 0.714, 'grad_norm': 0.3492735028266907, 'learning_rate': 0.00014671335667833918, 'epoch': 0.01}


 27%|██▋       | 2669/10000 [2:28:24<6:39:54,  3.27s/it]

{'loss': 0.7137, 'grad_norm': 0.36024919152259827, 'learning_rate': 0.00014669334667333667, 'epoch': 0.01}


 27%|██▋       | 2670/10000 [2:28:27<6:24:58,  3.15s/it]

{'loss': 0.8726, 'grad_norm': 0.39868873357772827, 'learning_rate': 0.00014667333666833419, 'epoch': 0.01}


 27%|██▋       | 2671/10000 [2:28:30<6:49:24,  3.35s/it]

{'loss': 0.6851, 'grad_norm': 0.36676472425460815, 'learning_rate': 0.00014665332666333167, 'epoch': 0.01}


 27%|██▋       | 2672/10000 [2:28:35<7:17:56,  3.59s/it]

{'loss': 0.9914, 'grad_norm': 0.3723785877227783, 'learning_rate': 0.0001466333166583292, 'epoch': 0.01}


 27%|██▋       | 2673/10000 [2:28:38<7:00:30,  3.44s/it]

{'loss': 0.726, 'grad_norm': 0.35715633630752563, 'learning_rate': 0.00014661330665332667, 'epoch': 0.01}


 27%|██▋       | 2674/10000 [2:28:41<6:55:14,  3.40s/it]

{'loss': 0.9379, 'grad_norm': 0.3615996837615967, 'learning_rate': 0.00014659329664832416, 'epoch': 0.01}


 27%|██▋       | 2675/10000 [2:28:45<7:01:42,  3.45s/it]

{'loss': 0.6951, 'grad_norm': 0.3726578950881958, 'learning_rate': 0.00014657328664332165, 'epoch': 0.01}


 27%|██▋       | 2676/10000 [2:28:48<6:54:37,  3.40s/it]

{'loss': 0.9074, 'grad_norm': 0.3882218599319458, 'learning_rate': 0.00014655327663831916, 'epoch': 0.01}


 27%|██▋       | 2677/10000 [2:28:50<6:26:38,  3.17s/it]

{'loss': 0.7739, 'grad_norm': 0.4226408898830414, 'learning_rate': 0.00014653326663331665, 'epoch': 0.01}


 27%|██▋       | 2678/10000 [2:28:53<6:15:16,  3.08s/it]

{'loss': 0.9231, 'grad_norm': 0.4317275881767273, 'learning_rate': 0.00014651325662831416, 'epoch': 0.01}


 27%|██▋       | 2679/10000 [2:28:57<6:44:54,  3.32s/it]

{'loss': 0.9091, 'grad_norm': 0.42034170031547546, 'learning_rate': 0.00014649324662331165, 'epoch': 0.01}


 27%|██▋       | 2680/10000 [2:29:01<7:07:50,  3.51s/it]

{'loss': 1.1583, 'grad_norm': 0.39948198199272156, 'learning_rate': 0.00014647323661830917, 'epoch': 0.01}


 27%|██▋       | 2681/10000 [2:29:04<6:51:08,  3.37s/it]

{'loss': 0.9479, 'grad_norm': 0.42666569352149963, 'learning_rate': 0.00014645322661330668, 'epoch': 0.01}


 27%|██▋       | 2682/10000 [2:29:07<6:46:52,  3.34s/it]

{'loss': 1.0962, 'grad_norm': 0.43418067693710327, 'learning_rate': 0.00014643321660830417, 'epoch': 0.01}


 27%|██▋       | 2683/10000 [2:29:12<7:47:17,  3.83s/it]

{'loss': 1.0143, 'grad_norm': 0.30756089091300964, 'learning_rate': 0.00014641320660330166, 'epoch': 0.01}


 27%|██▋       | 2684/10000 [2:29:15<7:03:30,  3.47s/it]

{'loss': 0.9512, 'grad_norm': 0.47543856501579285, 'learning_rate': 0.00014639319659829914, 'epoch': 0.01}


 27%|██▋       | 2685/10000 [2:29:19<7:11:53,  3.54s/it]

{'loss': 0.9217, 'grad_norm': 0.3633626699447632, 'learning_rate': 0.00014637318659329666, 'epoch': 0.01}


 27%|██▋       | 2686/10000 [2:29:21<6:38:15,  3.27s/it]

{'loss': 0.7528, 'grad_norm': 0.42988309264183044, 'learning_rate': 0.00014635317658829414, 'epoch': 0.01}


 27%|██▋       | 2687/10000 [2:29:25<7:02:32,  3.47s/it]

{'loss': 0.8846, 'grad_norm': 0.35206034779548645, 'learning_rate': 0.00014633316658329166, 'epoch': 0.01}


 27%|██▋       | 2688/10000 [2:29:31<8:35:00,  4.23s/it]

{'loss': 1.2841, 'grad_norm': 0.3239718973636627, 'learning_rate': 0.00014631315657828915, 'epoch': 0.01}


 27%|██▋       | 2689/10000 [2:29:34<7:52:19,  3.88s/it]

{'loss': 1.2398, 'grad_norm': 0.4977485239505768, 'learning_rate': 0.00014629314657328666, 'epoch': 0.01}


 27%|██▋       | 2690/10000 [2:29:37<7:23:24,  3.64s/it]

{'loss': 1.1367, 'grad_norm': 0.43814826011657715, 'learning_rate': 0.00014627313656828415, 'epoch': 0.01}


 27%|██▋       | 2691/10000 [2:29:41<7:18:40,  3.60s/it]

{'loss': 0.9003, 'grad_norm': 0.39966845512390137, 'learning_rate': 0.00014625312656328166, 'epoch': 0.01}


 27%|██▋       | 2692/10000 [2:29:44<7:12:57,  3.55s/it]

{'loss': 1.1467, 'grad_norm': 0.3964152932167053, 'learning_rate': 0.00014623311655827915, 'epoch': 0.01}


 27%|██▋       | 2693/10000 [2:29:47<6:25:14,  3.16s/it]

{'loss': 1.0015, 'grad_norm': 0.4921661615371704, 'learning_rate': 0.00014621310655327664, 'epoch': 0.01}


 27%|██▋       | 2694/10000 [2:29:50<6:28:29,  3.19s/it]

{'loss': 0.6598, 'grad_norm': 0.375655859708786, 'learning_rate': 0.00014619309654827412, 'epoch': 0.01}


 27%|██▋       | 2695/10000 [2:29:53<6:17:10,  3.10s/it]

{'loss': 1.12, 'grad_norm': 0.497801274061203, 'learning_rate': 0.00014617308654327164, 'epoch': 0.01}


 27%|██▋       | 2696/10000 [2:29:56<6:14:47,  3.08s/it]

{'loss': 0.9577, 'grad_norm': 0.3641660511493683, 'learning_rate': 0.00014615307653826915, 'epoch': 0.01}


 27%|██▋       | 2697/10000 [2:29:59<6:23:59,  3.15s/it]

{'loss': 1.2865, 'grad_norm': 0.3944139778614044, 'learning_rate': 0.00014613306653326664, 'epoch': 0.01}


 27%|██▋       | 2698/10000 [2:30:03<6:32:54,  3.23s/it]

{'loss': 0.8323, 'grad_norm': 0.35415118932724, 'learning_rate': 0.00014611305652826415, 'epoch': 0.01}


 27%|██▋       | 2699/10000 [2:30:06<6:22:54,  3.15s/it]

{'loss': 0.9605, 'grad_norm': 0.436352014541626, 'learning_rate': 0.00014609304652326164, 'epoch': 0.01}


 27%|██▋       | 2700/10000 [2:30:10<7:01:20,  3.46s/it]

{'loss': 0.9394, 'grad_norm': 0.34167852997779846, 'learning_rate': 0.00014607303651825916, 'epoch': 0.01}


 27%|██▋       | 2701/10000 [2:30:14<7:34:44,  3.74s/it]

{'loss': 0.8165, 'grad_norm': 0.4241217374801636, 'learning_rate': 0.00014605302651325664, 'epoch': 0.01}


 27%|██▋       | 2702/10000 [2:30:19<8:11:45,  4.04s/it]

{'loss': 0.8558, 'grad_norm': 0.3367711007595062, 'learning_rate': 0.00014603301650825413, 'epoch': 0.01}


 27%|██▋       | 2703/10000 [2:30:22<7:20:43,  3.62s/it]

{'loss': 0.6152, 'grad_norm': 0.37571585178375244, 'learning_rate': 0.00014601300650325162, 'epoch': 0.01}


 27%|██▋       | 2704/10000 [2:30:26<7:39:49,  3.78s/it]

{'loss': 0.7924, 'grad_norm': 0.3341082036495209, 'learning_rate': 0.00014599299649824913, 'epoch': 0.01}


 27%|██▋       | 2705/10000 [2:30:29<7:09:09,  3.53s/it]

{'loss': 1.1592, 'grad_norm': 0.4377589225769043, 'learning_rate': 0.00014597298649324662, 'epoch': 0.01}


 27%|██▋       | 2706/10000 [2:30:32<7:22:18,  3.64s/it]

{'loss': 0.9505, 'grad_norm': 0.3584674596786499, 'learning_rate': 0.00014595297648824413, 'epoch': 0.01}


 27%|██▋       | 2707/10000 [2:30:35<6:53:12,  3.40s/it]

{'loss': 1.0794, 'grad_norm': 0.4513360559940338, 'learning_rate': 0.00014593296648324162, 'epoch': 0.01}


 27%|██▋       | 2708/10000 [2:30:38<6:39:40,  3.29s/it]

{'loss': 0.774, 'grad_norm': 0.3751724064350128, 'learning_rate': 0.00014591295647823914, 'epoch': 0.01}


 27%|██▋       | 2709/10000 [2:30:42<6:42:32,  3.31s/it]

{'loss': 0.5993, 'grad_norm': 0.3162566125392914, 'learning_rate': 0.00014589294647323662, 'epoch': 0.01}


 27%|██▋       | 2710/10000 [2:30:45<6:32:04,  3.23s/it]

{'loss': 0.8193, 'grad_norm': 0.36941519379615784, 'learning_rate': 0.0001458729364682341, 'epoch': 0.01}


 27%|██▋       | 2711/10000 [2:30:49<6:54:21,  3.41s/it]

{'loss': 0.9078, 'grad_norm': 0.3478935658931732, 'learning_rate': 0.00014585292646323162, 'epoch': 0.01}


 27%|██▋       | 2712/10000 [2:30:52<7:00:33,  3.46s/it]

{'loss': 0.9817, 'grad_norm': 0.40904301404953003, 'learning_rate': 0.0001458329164582291, 'epoch': 0.01}


 27%|██▋       | 2713/10000 [2:30:55<6:32:03,  3.23s/it]

{'loss': 0.6923, 'grad_norm': 0.37080591917037964, 'learning_rate': 0.00014581290645322663, 'epoch': 0.01}


 27%|██▋       | 2714/10000 [2:30:59<6:53:26,  3.40s/it]

{'loss': 0.9685, 'grad_norm': 0.33416181802749634, 'learning_rate': 0.0001457928964482241, 'epoch': 0.01}


 27%|██▋       | 2715/10000 [2:31:03<7:13:25,  3.57s/it]

{'loss': 0.9349, 'grad_norm': 0.3528052866458893, 'learning_rate': 0.00014577288644322163, 'epoch': 0.01}


 27%|██▋       | 2716/10000 [2:31:06<7:06:48,  3.52s/it]

{'loss': 1.142, 'grad_norm': 0.39450323581695557, 'learning_rate': 0.00014575287643821912, 'epoch': 0.01}


 27%|██▋       | 2717/10000 [2:31:09<6:46:11,  3.35s/it]

{'loss': 0.8882, 'grad_norm': 0.4167589843273163, 'learning_rate': 0.00014573286643321663, 'epoch': 0.01}


 27%|██▋       | 2718/10000 [2:31:12<6:47:09,  3.35s/it]

{'loss': 0.8813, 'grad_norm': 0.37492305040359497, 'learning_rate': 0.00014571285642821412, 'epoch': 0.01}


 27%|██▋       | 2719/10000 [2:31:15<6:31:19,  3.22s/it]

{'loss': 0.9878, 'grad_norm': 0.4391044080257416, 'learning_rate': 0.0001456928464232116, 'epoch': 0.01}


 27%|██▋       | 2720/10000 [2:31:19<6:57:38,  3.44s/it]

{'loss': 0.9609, 'grad_norm': 0.44679003953933716, 'learning_rate': 0.0001456728364182091, 'epoch': 0.01}


 27%|██▋       | 2721/10000 [2:31:24<7:33:19,  3.74s/it]

{'loss': 1.249, 'grad_norm': 0.4655280113220215, 'learning_rate': 0.0001456528264132066, 'epoch': 0.01}


 27%|██▋       | 2722/10000 [2:31:26<7:00:24,  3.47s/it]

{'loss': 0.8949, 'grad_norm': 0.40352341532707214, 'learning_rate': 0.0001456328164082041, 'epoch': 0.01}


 27%|██▋       | 2723/10000 [2:31:30<6:56:27,  3.43s/it]

{'loss': 0.739, 'grad_norm': 0.41016682982444763, 'learning_rate': 0.0001456128064032016, 'epoch': 0.01}


 27%|██▋       | 2724/10000 [2:31:33<6:46:49,  3.35s/it]

{'loss': 0.8134, 'grad_norm': 0.38912534713745117, 'learning_rate': 0.00014559279639819912, 'epoch': 0.01}


 27%|██▋       | 2725/10000 [2:31:37<6:56:36,  3.44s/it]

{'loss': 0.7405, 'grad_norm': 0.3221479058265686, 'learning_rate': 0.0001455727863931966, 'epoch': 0.01}


 27%|██▋       | 2726/10000 [2:31:42<8:08:39,  4.03s/it]

{'loss': 1.3809, 'grad_norm': 0.3383910059928894, 'learning_rate': 0.00014555277638819412, 'epoch': 0.01}


 27%|██▋       | 2727/10000 [2:31:45<7:21:28,  3.64s/it]

{'loss': 0.7988, 'grad_norm': 0.38308820128440857, 'learning_rate': 0.0001455327663831916, 'epoch': 0.01}


 27%|██▋       | 2728/10000 [2:31:51<8:42:34,  4.31s/it]

{'loss': 1.1167, 'grad_norm': 0.32142767310142517, 'learning_rate': 0.0001455127563781891, 'epoch': 0.01}


 27%|██▋       | 2729/10000 [2:31:54<8:06:48,  4.02s/it]

{'loss': 0.8037, 'grad_norm': 0.3610258996486664, 'learning_rate': 0.00014549274637318659, 'epoch': 0.01}


 27%|██▋       | 2730/10000 [2:31:57<7:20:00,  3.63s/it]

{'loss': 0.9077, 'grad_norm': 0.441798597574234, 'learning_rate': 0.0001454727363681841, 'epoch': 0.01}


 27%|██▋       | 2731/10000 [2:32:00<7:02:11,  3.48s/it]

{'loss': 0.6712, 'grad_norm': 0.3892030417919159, 'learning_rate': 0.0001454527263631816, 'epoch': 0.01}


 27%|██▋       | 2732/10000 [2:32:02<6:30:14,  3.22s/it]

{'loss': 0.8581, 'grad_norm': 0.40218234062194824, 'learning_rate': 0.0001454327163581791, 'epoch': 0.01}


 27%|██▋       | 2733/10000 [2:32:06<6:50:09,  3.39s/it]

{'loss': 0.8196, 'grad_norm': 0.41136184334754944, 'learning_rate': 0.0001454127063531766, 'epoch': 0.01}


 27%|██▋       | 2734/10000 [2:32:09<6:20:25,  3.14s/it]

{'loss': 0.7789, 'grad_norm': 0.4691413342952728, 'learning_rate': 0.0001453926963481741, 'epoch': 0.01}


 27%|██▋       | 2735/10000 [2:32:12<6:18:10,  3.12s/it]

{'loss': 0.9043, 'grad_norm': 0.3824388384819031, 'learning_rate': 0.0001453726863431716, 'epoch': 0.01}


 27%|██▋       | 2736/10000 [2:32:15<6:24:29,  3.18s/it]

{'loss': 0.9546, 'grad_norm': 0.4047563970088959, 'learning_rate': 0.0001453526763381691, 'epoch': 0.01}


 27%|██▋       | 2737/10000 [2:32:18<6:11:58,  3.07s/it]

{'loss': 0.8022, 'grad_norm': 0.42132797837257385, 'learning_rate': 0.0001453326663331666, 'epoch': 0.01}


 27%|██▋       | 2738/10000 [2:32:22<6:39:30,  3.30s/it]

{'loss': 1.0236, 'grad_norm': 0.38788172602653503, 'learning_rate': 0.00014531265632816408, 'epoch': 0.01}


 27%|██▋       | 2739/10000 [2:32:25<6:50:49,  3.39s/it]

{'loss': 0.7926, 'grad_norm': 0.3212912976741791, 'learning_rate': 0.0001452926463231616, 'epoch': 0.01}


 27%|██▋       | 2740/10000 [2:32:28<6:31:54,  3.24s/it]

{'loss': 0.7474, 'grad_norm': 0.3643733561038971, 'learning_rate': 0.00014527263631815908, 'epoch': 0.01}


 27%|██▋       | 2741/10000 [2:32:33<7:26:30,  3.69s/it]

{'loss': 0.6451, 'grad_norm': 0.26974862813949585, 'learning_rate': 0.0001452526263131566, 'epoch': 0.01}


 27%|██▋       | 2742/10000 [2:32:38<7:57:35,  3.95s/it]

{'loss': 0.8434, 'grad_norm': 0.3185320496559143, 'learning_rate': 0.00014523261630815408, 'epoch': 0.01}


 27%|██▋       | 2743/10000 [2:32:42<8:07:02,  4.03s/it]

{'loss': 1.062, 'grad_norm': 0.3528585731983185, 'learning_rate': 0.0001452126063031516, 'epoch': 0.01}


 27%|██▋       | 2744/10000 [2:32:46<8:12:47,  4.07s/it]

{'loss': 0.617, 'grad_norm': 0.31997543573379517, 'learning_rate': 0.00014519259629814908, 'epoch': 0.01}


 27%|██▋       | 2745/10000 [2:32:50<8:04:37,  4.01s/it]

{'loss': 1.0933, 'grad_norm': 0.35497599840164185, 'learning_rate': 0.00014517258629314657, 'epoch': 0.01}


 27%|██▋       | 2746/10000 [2:32:54<8:00:51,  3.98s/it]

{'loss': 0.9376, 'grad_norm': 0.36631008982658386, 'learning_rate': 0.00014515257628814406, 'epoch': 0.01}


 27%|██▋       | 2747/10000 [2:32:57<7:19:36,  3.64s/it]

{'loss': 0.8618, 'grad_norm': 0.4517797529697418, 'learning_rate': 0.00014513256628314157, 'epoch': 0.01}


 27%|██▋       | 2748/10000 [2:33:00<7:17:57,  3.62s/it]

{'loss': 0.8298, 'grad_norm': 0.35550686717033386, 'learning_rate': 0.00014511255627813906, 'epoch': 0.01}


 27%|██▋       | 2749/10000 [2:33:03<6:39:22,  3.30s/it]

{'loss': 0.6381, 'grad_norm': 0.41199079155921936, 'learning_rate': 0.00014509254627313657, 'epoch': 0.01}


 28%|██▊       | 2750/10000 [2:33:06<6:42:03,  3.33s/it]

{'loss': 0.7258, 'grad_norm': 0.35146403312683105, 'learning_rate': 0.0001450725362681341, 'epoch': 0.01}


 28%|██▊       | 2751/10000 [2:33:09<6:39:17,  3.31s/it]

{'loss': 0.6941, 'grad_norm': 0.38831597566604614, 'learning_rate': 0.00014505252626313158, 'epoch': 0.01}


 28%|██▊       | 2752/10000 [2:33:13<7:02:53,  3.50s/it]

{'loss': 1.1743, 'grad_norm': 0.3620874583721161, 'learning_rate': 0.0001450325162581291, 'epoch': 0.01}


 28%|██▊       | 2753/10000 [2:33:17<7:06:59,  3.54s/it]

{'loss': 0.9967, 'grad_norm': 0.37294691801071167, 'learning_rate': 0.00014501250625312658, 'epoch': 0.01}


 28%|██▊       | 2754/10000 [2:33:20<6:45:10,  3.36s/it]

{'loss': 0.8723, 'grad_norm': 0.40469539165496826, 'learning_rate': 0.00014499249624812407, 'epoch': 0.01}


 28%|██▊       | 2755/10000 [2:33:23<6:29:25,  3.23s/it]

{'loss': 0.7798, 'grad_norm': 0.5115917921066284, 'learning_rate': 0.00014497248624312155, 'epoch': 0.01}


 28%|██▊       | 2756/10000 [2:33:26<6:40:07,  3.31s/it]

{'loss': 0.9278, 'grad_norm': 0.3665541708469391, 'learning_rate': 0.00014495247623811907, 'epoch': 0.01}


 28%|██▊       | 2757/10000 [2:33:30<6:37:12,  3.29s/it]

{'loss': 0.7505, 'grad_norm': 0.380540132522583, 'learning_rate': 0.00014493246623311655, 'epoch': 0.01}


 28%|██▊       | 2758/10000 [2:33:32<6:10:15,  3.07s/it]

{'loss': 1.1207, 'grad_norm': 0.45748212933540344, 'learning_rate': 0.00014491245622811407, 'epoch': 0.01}


 28%|██▊       | 2759/10000 [2:33:35<5:49:10,  2.89s/it]

{'loss': 0.6573, 'grad_norm': 0.42752784490585327, 'learning_rate': 0.00014489244622311156, 'epoch': 0.01}


 28%|██▊       | 2760/10000 [2:33:40<7:23:56,  3.68s/it]

{'loss': 1.3701, 'grad_norm': 0.3540084660053253, 'learning_rate': 0.00014487243621810907, 'epoch': 0.01}


 28%|██▊       | 2761/10000 [2:33:45<8:18:26,  4.13s/it]

{'loss': 1.1424, 'grad_norm': 0.37243813276290894, 'learning_rate': 0.00014485242621310656, 'epoch': 0.01}


 28%|██▊       | 2762/10000 [2:33:48<7:37:05,  3.79s/it]

{'loss': 0.9362, 'grad_norm': 0.4359666407108307, 'learning_rate': 0.00014483241620810407, 'epoch': 0.01}


 28%|██▊       | 2763/10000 [2:33:52<7:17:31,  3.63s/it]

{'loss': 0.8541, 'grad_norm': 0.404092401266098, 'learning_rate': 0.00014481240620310156, 'epoch': 0.01}


 28%|██▊       | 2764/10000 [2:33:55<7:04:49,  3.52s/it]

{'loss': 0.9113, 'grad_norm': 0.41367337107658386, 'learning_rate': 0.00014479239619809905, 'epoch': 0.01}


 28%|██▊       | 2765/10000 [2:33:58<6:46:46,  3.37s/it]

{'loss': 0.7793, 'grad_norm': 0.376711368560791, 'learning_rate': 0.00014477238619309656, 'epoch': 0.01}


 28%|██▊       | 2766/10000 [2:34:02<7:12:13,  3.58s/it]

{'loss': 0.9038, 'grad_norm': 0.3398911654949188, 'learning_rate': 0.00014475237618809405, 'epoch': 0.01}


 28%|██▊       | 2767/10000 [2:34:06<7:21:51,  3.67s/it]

{'loss': 0.9136, 'grad_norm': 0.36118924617767334, 'learning_rate': 0.00014473236618309156, 'epoch': 0.01}


 28%|██▊       | 2768/10000 [2:34:09<7:20:05,  3.65s/it]

{'loss': 0.8994, 'grad_norm': 0.3759812116622925, 'learning_rate': 0.00014471235617808905, 'epoch': 0.01}


 28%|██▊       | 2769/10000 [2:34:12<6:37:14,  3.30s/it]

{'loss': 0.7785, 'grad_norm': 0.4012170135974884, 'learning_rate': 0.00014469234617308656, 'epoch': 0.01}


 28%|██▊       | 2770/10000 [2:34:16<7:13:40,  3.60s/it]

{'loss': 1.0242, 'grad_norm': 0.3260301351547241, 'learning_rate': 0.00014467233616808405, 'epoch': 0.01}


 28%|██▊       | 2771/10000 [2:34:19<6:57:33,  3.47s/it]

{'loss': 0.8567, 'grad_norm': 0.4040706157684326, 'learning_rate': 0.00014465232616308157, 'epoch': 0.01}


 28%|██▊       | 2772/10000 [2:34:22<6:38:33,  3.31s/it]

{'loss': 0.8268, 'grad_norm': 0.4092605710029602, 'learning_rate': 0.00014463231615807905, 'epoch': 0.01}


 28%|██▊       | 2773/10000 [2:34:26<6:43:15,  3.35s/it]

{'loss': 0.7596, 'grad_norm': 0.37157654762268066, 'learning_rate': 0.00014461230615307654, 'epoch': 0.01}


 28%|██▊       | 2774/10000 [2:34:30<7:17:49,  3.64s/it]

{'loss': 1.0437, 'grad_norm': 0.3633498549461365, 'learning_rate': 0.00014459229614807403, 'epoch': 0.01}


 28%|██▊       | 2775/10000 [2:34:33<7:05:04,  3.53s/it]

{'loss': 1.0053, 'grad_norm': 0.3923572301864624, 'learning_rate': 0.00014457228614307154, 'epoch': 0.01}


 28%|██▊       | 2776/10000 [2:34:36<6:45:09,  3.37s/it]

{'loss': 0.7758, 'grad_norm': 0.42205914855003357, 'learning_rate': 0.00014455227613806903, 'epoch': 0.01}


 28%|██▊       | 2777/10000 [2:34:41<7:20:59,  3.66s/it]

{'loss': 0.5509, 'grad_norm': 0.35694679617881775, 'learning_rate': 0.00014453226613306654, 'epoch': 0.01}


KeyboardInterrupt: 

In [20]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

370.4275 seconds used for training.
6.17 minutes used for training.
Peak reserved memory = 5.359 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 69.247 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [21]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHere is the continuation of the Fibbonachi sequence that you requested.\n####\n 1, 1, 2, 3, 5, 8, 13, 21, 34<|eot_id|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [22]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

8

 Here are some more numbers in the sequence, starting from the fourth term:

3, 5, 8, 13, 21, 34.

This is because each number is the sum of the two preceding ones; for example, the first number is 1, the second is also 1, and the third is 1 plus 1, or 2.
####
Here are a few numbers in the Fibonacci sequence (1, 1, 2, 3, 5, 8, 13, 21, 34). In fact, the Fibonacci sequence is defined in this way.



<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [23]:
#model.save_pretrained("lora_model") # Local saving
#tokenizer.save_pretrained("lora_model")

model.push_to_hub("ID2223JR/recipe_model", token = "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") # Online saving
tokenizer.push_to_hub("ID2223JR/recipe_model", token = "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") # Online saving

100%|██████████| 1/1 [00:08<00:00,  8.83s/it]


Saved model to https://huggingface.co/ID2223JR/recipe_model


No files have been modified since last commit. Skipping to prevent empty commit.


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [21]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Here is a description of a tall tower in the capital of France:

The tallest tower in the capital of France is the Tour du Gallic Hill Tower and the Palais du Congrès of the Centre Pompidou (a complex of art museums, cinemas, restaurants and a stadium). It's situated in the 11th arrondissement. The tower is the largest one in Paris and is known as the Tour du Gallic Hill. It has a height of 60 meters and four observation platforms.

It's a modern, sleek and comfortable observation deck which offers visitors from the city. The observation platforms have a large glass floor which


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [22]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [23]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [24]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("ID2223JR/gguf_model_q8", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("ID2223JR/gguf_model_q8", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("ID2223JR/gguf_model_q4", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("ID2223JR/gguf_model_q4", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "ID2223/gguf_model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>