To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Llama-3 8b is trained on a crazy 15 trillion tokens! Llama-2 was 2 trillion.**

Use our [Llama-3 8b Instruct](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing) notebook for conversational style finetunes.

In [1]:
# #%%capture
# # Installs Unsloth, Xformers (Flash Attention) and all other packages!
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes


* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [2]:
import xformers
import torch

from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
from transformers import TrainingArguments
from trl import SFTTrainer

print(f"Torch version: {torch.__version__}")
print(f"Xformers version: {xformers.__version__}")
print("Unsloth is working!")

print("TRL and Transformers are working!")


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!
Torch version: 2.4.0+cu121
Xformers version: 0.0.27.post2
Unsloth is working!
TRL and Transformers are working!


In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
    'yentinglin/Llama-3-Taiwan-8B-Instruct',
    'MediaTek-Research/Breeze-7B-Instruct-v1_0',
    'yentinglin/Taiwan-LLM-7B-v2.0-base',
    'yentinglin/Taiwan-LLM-7B-v2.1-chat',
    'chinese-alpaca-plus-7b-hf',
] # More models at https://huggingface.co/unsloth


model_name_path = "unsloth/mistral-7b-v0.3-bnb-4bit"
model_name = model_name_path.split("/")[-1]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.1.5: Fast Mistral patching. Transformers: 4.47.1.
   \\   /|    GPU: NVIDIA GeForce RTX 4060 Ti. Max memory: 15.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.4.0+cu121. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  _ = torch.tensor([0], device=i)


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.1.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [5]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["question"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset, Dataset
dataset = load_dataset("datasets/train", split = "train")
dataset = dataset.map(
    formatting_prompts_func,
    batched = True,
    remove_columns=["instruction", "question", "output"],
)

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [6]:

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# 创建一个空的 eval 数据集
empty_eval_dataset = Dataset.from_dict({"input_ids": [], "attention_mask": []})

training_args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 2000,
        learning_rate = 2e-4, # 5e-05
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps=5,                # 每 1 步记录一次日志
        logging_dir="./logs",           # 日志存储路径      
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = model_name,
        # save_strategy="steps",            # 按步数保存检查点
        # save_steps=100,                   # 每 100 步保存一次检查点
        # save_total_limit=3,               # 最多保留 3 个检查点
        # eval_strategy="steps",            # 按步数进行评估
        # eval_steps=100,                   # 每 100 步进行评估
        # load_best_model_at_end=True,      # 训练结束后加载最佳模型
        # metric_for_best_model="loss",     # 以损失为评价指标选择最佳模型
        # greater_is_better=False,          # 损失越小越好
)


trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    # eval_dataset=empty_eval_dataset,  # 提供空评估数据集
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 1,
    packing = False, # Can make training 5x faster for short sequences.
    args = training_args,
)

Map: 100%|██████████| 13550/13550 [00:01<00:00, 8014.69 examples/s]


In [7]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4060 Ti. Max memory = 15.996 GB.
4.637 GB of memory reserved.


In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 13,550 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 2,000
 "-____-"     Number of trainable parameters = 41,943,040
  0%|          | 5/2000 [00:39<4:11:04,  7.55s/it]

{'loss': 2.1622, 'grad_norm': 1.0938102006912231, 'learning_rate': 0.0002, 'epoch': 0.0}


  0%|          | 10/2000 [01:13<3:54:19,  7.06s/it]

{'loss': 1.9234, 'grad_norm': 1.0250017642974854, 'learning_rate': 0.00019949874686716793, 'epoch': 0.01}


  1%|          | 15/2000 [01:49<3:51:35,  7.00s/it]

{'loss': 1.8677, 'grad_norm': 1.0454978942871094, 'learning_rate': 0.00019899749373433585, 'epoch': 0.01}


  1%|          | 20/2000 [02:21<3:36:01,  6.55s/it]

{'loss': 1.8352, 'grad_norm': 0.8707805871963501, 'learning_rate': 0.00019849624060150375, 'epoch': 0.01}


  1%|▏         | 25/2000 [02:56<3:39:22,  6.66s/it]

{'loss': 1.8304, 'grad_norm': 0.8150629997253418, 'learning_rate': 0.0001979949874686717, 'epoch': 0.01}


  2%|▏         | 30/2000 [03:30<3:49:13,  6.98s/it]

{'loss': 1.7577, 'grad_norm': 0.7253011465072632, 'learning_rate': 0.0001974937343358396, 'epoch': 0.02}


  2%|▏         | 35/2000 [04:06<3:55:10,  7.18s/it]

{'loss': 1.8142, 'grad_norm': 0.883719801902771, 'learning_rate': 0.00019699248120300754, 'epoch': 0.02}


  2%|▏         | 40/2000 [04:43<4:04:10,  7.47s/it]

{'loss': 1.7089, 'grad_norm': 0.7924404144287109, 'learning_rate': 0.00019649122807017543, 'epoch': 0.02}


  2%|▏         | 45/2000 [05:16<3:41:09,  6.79s/it]

{'loss': 1.7293, 'grad_norm': 0.8116251826286316, 'learning_rate': 0.00019598997493734338, 'epoch': 0.03}


  2%|▎         | 50/2000 [05:52<3:49:29,  7.06s/it]

{'loss': 1.7139, 'grad_norm': 0.7996013760566711, 'learning_rate': 0.00019548872180451127, 'epoch': 0.03}


  3%|▎         | 55/2000 [06:28<3:50:39,  7.12s/it]

{'loss': 1.7424, 'grad_norm': 0.7409282326698303, 'learning_rate': 0.00019498746867167922, 'epoch': 0.03}


  3%|▎         | 60/2000 [07:05<4:00:11,  7.43s/it]

{'loss': 1.7254, 'grad_norm': 0.8940387964248657, 'learning_rate': 0.0001944862155388471, 'epoch': 0.04}


  3%|▎         | 65/2000 [07:43<4:00:38,  7.46s/it]

{'loss': 1.6961, 'grad_norm': 0.8446363210678101, 'learning_rate': 0.00019398496240601503, 'epoch': 0.04}


  4%|▎         | 70/2000 [08:18<3:50:25,  7.16s/it]

{'loss': 1.7552, 'grad_norm': 0.8779881000518799, 'learning_rate': 0.00019348370927318296, 'epoch': 0.04}


  4%|▍         | 75/2000 [08:52<3:45:12,  7.02s/it]

{'loss': 1.7026, 'grad_norm': 0.7914571166038513, 'learning_rate': 0.00019298245614035088, 'epoch': 0.04}


  4%|▍         | 80/2000 [09:27<3:41:09,  6.91s/it]

{'loss': 1.7319, 'grad_norm': 0.8674939274787903, 'learning_rate': 0.0001924812030075188, 'epoch': 0.05}


  4%|▍         | 85/2000 [10:00<3:24:20,  6.40s/it]

{'loss': 1.6843, 'grad_norm': 0.8480135202407837, 'learning_rate': 0.00019197994987468672, 'epoch': 0.05}


  4%|▍         | 90/2000 [10:35<3:36:23,  6.80s/it]

{'loss': 1.5895, 'grad_norm': 0.960347592830658, 'learning_rate': 0.00019147869674185464, 'epoch': 0.05}


  5%|▍         | 95/2000 [11:11<3:50:44,  7.27s/it]

{'loss': 1.6956, 'grad_norm': 0.8438019156455994, 'learning_rate': 0.00019097744360902256, 'epoch': 0.06}


  5%|▌         | 100/2000 [11:46<3:42:42,  7.03s/it]

{'loss': 1.6895, 'grad_norm': 0.8659358620643616, 'learning_rate': 0.00019047619047619048, 'epoch': 0.06}


  5%|▌         | 105/2000 [12:19<3:31:57,  6.71s/it]

{'loss': 1.6409, 'grad_norm': 0.8603217601776123, 'learning_rate': 0.0001899749373433584, 'epoch': 0.06}


  6%|▌         | 110/2000 [12:58<3:57:35,  7.54s/it]

{'loss': 1.5702, 'grad_norm': 0.8255472183227539, 'learning_rate': 0.00018947368421052632, 'epoch': 0.06}


  6%|▌         | 115/2000 [13:33<3:46:28,  7.21s/it]

{'loss': 1.6155, 'grad_norm': 0.8030120134353638, 'learning_rate': 0.00018897243107769424, 'epoch': 0.07}


  6%|▌         | 120/2000 [14:10<3:44:26,  7.16s/it]

{'loss': 1.5873, 'grad_norm': 0.9447240233421326, 'learning_rate': 0.00018847117794486217, 'epoch': 0.07}


  6%|▋         | 125/2000 [14:44<3:39:18,  7.02s/it]

{'loss': 1.6125, 'grad_norm': 0.9990488290786743, 'learning_rate': 0.00018796992481203009, 'epoch': 0.07}


  6%|▋         | 130/2000 [15:19<3:29:04,  6.71s/it]

{'loss': 1.6194, 'grad_norm': 0.8788908123970032, 'learning_rate': 0.00018746867167919798, 'epoch': 0.08}


  7%|▋         | 135/2000 [15:52<3:29:59,  6.76s/it]

{'loss': 1.6293, 'grad_norm': 0.797059178352356, 'learning_rate': 0.00018696741854636593, 'epoch': 0.08}


  7%|▋         | 140/2000 [16:25<3:30:39,  6.80s/it]

{'loss': 1.6561, 'grad_norm': 0.8816524147987366, 'learning_rate': 0.00018646616541353382, 'epoch': 0.08}


  7%|▋         | 145/2000 [16:59<3:27:14,  6.70s/it]

{'loss': 1.6145, 'grad_norm': 0.9204146862030029, 'learning_rate': 0.00018596491228070177, 'epoch': 0.09}


  8%|▊         | 150/2000 [17:32<3:22:46,  6.58s/it]

{'loss': 1.5848, 'grad_norm': 0.8675138354301453, 'learning_rate': 0.00018546365914786966, 'epoch': 0.09}


  8%|▊         | 155/2000 [18:07<3:38:58,  7.12s/it]

{'loss': 1.623, 'grad_norm': 0.8173717260360718, 'learning_rate': 0.0001849624060150376, 'epoch': 0.09}


  8%|▊         | 160/2000 [18:40<3:19:19,  6.50s/it]

{'loss': 1.4662, 'grad_norm': 0.8775128722190857, 'learning_rate': 0.0001844611528822055, 'epoch': 0.09}


  8%|▊         | 165/2000 [19:15<3:34:58,  7.03s/it]

{'loss': 1.6042, 'grad_norm': 0.9574693441390991, 'learning_rate': 0.00018395989974937345, 'epoch': 0.1}


  8%|▊         | 170/2000 [19:54<3:47:26,  7.46s/it]

{'loss': 1.6074, 'grad_norm': 0.808083176612854, 'learning_rate': 0.00018345864661654135, 'epoch': 0.1}


  9%|▉         | 175/2000 [20:26<3:27:26,  6.82s/it]

{'loss': 1.6371, 'grad_norm': 0.7822176218032837, 'learning_rate': 0.0001829573934837093, 'epoch': 0.1}


  9%|▉         | 180/2000 [21:00<3:23:16,  6.70s/it]

{'loss': 1.5571, 'grad_norm': 0.8280292749404907, 'learning_rate': 0.0001824561403508772, 'epoch': 0.11}


  9%|▉         | 185/2000 [21:33<3:20:30,  6.63s/it]

{'loss': 1.6103, 'grad_norm': 0.9835675954818726, 'learning_rate': 0.0001819548872180451, 'epoch': 0.11}


 10%|▉         | 190/2000 [22:07<3:31:38,  7.02s/it]

{'loss': 1.5804, 'grad_norm': 0.8704354763031006, 'learning_rate': 0.00018145363408521303, 'epoch': 0.11}


 10%|▉         | 195/2000 [22:41<3:24:29,  6.80s/it]

{'loss': 1.5087, 'grad_norm': 0.8361993432044983, 'learning_rate': 0.00018095238095238095, 'epoch': 0.12}


 10%|█         | 200/2000 [23:17<3:32:57,  7.10s/it]

{'loss': 1.5728, 'grad_norm': 1.1597719192504883, 'learning_rate': 0.00018045112781954887, 'epoch': 0.12}


 10%|█         | 205/2000 [23:48<3:09:15,  6.33s/it]

{'loss': 1.6179, 'grad_norm': 0.9409633278846741, 'learning_rate': 0.0001799498746867168, 'epoch': 0.12}


 10%|█         | 210/2000 [24:21<3:10:09,  6.37s/it]

{'loss': 1.5923, 'grad_norm': 0.9709766507148743, 'learning_rate': 0.00017944862155388472, 'epoch': 0.12}


 11%|█         | 215/2000 [24:55<3:19:10,  6.69s/it]

{'loss': 1.5137, 'grad_norm': 0.8982247114181519, 'learning_rate': 0.00017894736842105264, 'epoch': 0.13}


 11%|█         | 220/2000 [25:30<3:22:07,  6.81s/it]

{'loss': 1.5235, 'grad_norm': 0.8663676977157593, 'learning_rate': 0.00017844611528822056, 'epoch': 0.13}


 11%|█▏        | 225/2000 [26:04<3:20:39,  6.78s/it]

{'loss': 1.5497, 'grad_norm': 0.8942726254463196, 'learning_rate': 0.00017794486215538848, 'epoch': 0.13}


 12%|█▏        | 230/2000 [26:39<3:26:49,  7.01s/it]

{'loss': 1.5642, 'grad_norm': 0.8846656680107117, 'learning_rate': 0.0001774436090225564, 'epoch': 0.14}


 12%|█▏        | 235/2000 [27:13<3:17:52,  6.73s/it]

{'loss': 1.5466, 'grad_norm': 0.8901705145835876, 'learning_rate': 0.00017694235588972432, 'epoch': 0.14}


 12%|█▏        | 240/2000 [27:49<3:30:19,  7.17s/it]

{'loss': 1.5386, 'grad_norm': 0.8756357431411743, 'learning_rate': 0.00017644110275689224, 'epoch': 0.14}


 12%|█▏        | 245/2000 [28:24<3:31:28,  7.23s/it]

{'loss': 1.5399, 'grad_norm': 0.9494670033454895, 'learning_rate': 0.00017593984962406016, 'epoch': 0.14}


 12%|█▎        | 250/2000 [28:57<3:21:48,  6.92s/it]

{'loss': 1.4891, 'grad_norm': 1.0095595121383667, 'learning_rate': 0.00017543859649122806, 'epoch': 0.15}


 13%|█▎        | 255/2000 [29:32<3:28:55,  7.18s/it]

{'loss': 1.527, 'grad_norm': 0.8393500447273254, 'learning_rate': 0.000174937343358396, 'epoch': 0.15}


 13%|█▎        | 260/2000 [30:06<3:17:55,  6.82s/it]

{'loss': 1.5012, 'grad_norm': 1.397696852684021, 'learning_rate': 0.0001744360902255639, 'epoch': 0.15}


 13%|█▎        | 265/2000 [30:38<3:19:02,  6.88s/it]

{'loss': 1.5363, 'grad_norm': 0.8534162044525146, 'learning_rate': 0.00017393483709273185, 'epoch': 0.16}


 14%|█▎        | 270/2000 [31:15<3:28:54,  7.25s/it]

{'loss': 1.5281, 'grad_norm': 0.9916231632232666, 'learning_rate': 0.00017343358395989974, 'epoch': 0.16}


 14%|█▍        | 275/2000 [31:49<3:24:38,  7.12s/it]

{'loss': 1.6806, 'grad_norm': 0.90652996301651, 'learning_rate': 0.0001729323308270677, 'epoch': 0.16}


 14%|█▍        | 280/2000 [32:22<3:16:23,  6.85s/it]

{'loss': 1.4367, 'grad_norm': 0.8543688058853149, 'learning_rate': 0.00017243107769423558, 'epoch': 0.17}


 14%|█▍        | 285/2000 [32:56<3:14:24,  6.80s/it]

{'loss': 1.5211, 'grad_norm': 0.9758288264274597, 'learning_rate': 0.00017192982456140353, 'epoch': 0.17}


 14%|█▍        | 290/2000 [33:29<3:09:16,  6.64s/it]

{'loss': 1.5099, 'grad_norm': 1.0217927694320679, 'learning_rate': 0.00017142857142857143, 'epoch': 0.17}


 15%|█▍        | 295/2000 [34:01<2:59:55,  6.33s/it]

{'loss': 1.4535, 'grad_norm': 1.1638081073760986, 'learning_rate': 0.00017092731829573935, 'epoch': 0.17}


 15%|█▌        | 300/2000 [34:37<3:24:50,  7.23s/it]

{'loss': 1.4912, 'grad_norm': 0.9804096221923828, 'learning_rate': 0.00017042606516290727, 'epoch': 0.18}


 15%|█▌        | 305/2000 [35:09<3:03:24,  6.49s/it]

{'loss': 1.4854, 'grad_norm': 0.8706085085868835, 'learning_rate': 0.0001699248120300752, 'epoch': 0.18}


 16%|█▌        | 310/2000 [35:42<3:03:57,  6.53s/it]

{'loss': 1.4787, 'grad_norm': 1.0013046264648438, 'learning_rate': 0.0001694235588972431, 'epoch': 0.18}


 16%|█▌        | 315/2000 [36:20<3:28:16,  7.42s/it]

{'loss': 1.3487, 'grad_norm': 0.8838441371917725, 'learning_rate': 0.00016892230576441103, 'epoch': 0.19}


 16%|█▌        | 320/2000 [36:56<3:21:27,  7.19s/it]

{'loss': 1.4131, 'grad_norm': 0.9804054498672485, 'learning_rate': 0.00016842105263157895, 'epoch': 0.19}


 16%|█▋        | 325/2000 [37:31<3:13:43,  6.94s/it]

{'loss': 1.3774, 'grad_norm': 1.093043565750122, 'learning_rate': 0.00016791979949874687, 'epoch': 0.19}


 16%|█▋        | 330/2000 [38:05<3:08:00,  6.75s/it]

{'loss': 1.5258, 'grad_norm': 1.1693962812423706, 'learning_rate': 0.0001674185463659148, 'epoch': 0.19}


 17%|█▋        | 335/2000 [38:39<3:08:56,  6.81s/it]

{'loss': 1.3937, 'grad_norm': 1.105959415435791, 'learning_rate': 0.00016691729323308271, 'epoch': 0.2}


 17%|█▋        | 340/2000 [39:14<3:20:10,  7.24s/it]

{'loss': 1.4294, 'grad_norm': 1.0210820436477661, 'learning_rate': 0.00016641604010025064, 'epoch': 0.2}


 17%|█▋        | 345/2000 [39:48<3:12:28,  6.98s/it]

{'loss': 1.3831, 'grad_norm': 0.9003434181213379, 'learning_rate': 0.00016591478696741856, 'epoch': 0.2}


 18%|█▊        | 350/2000 [40:20<2:55:01,  6.36s/it]

{'loss': 1.5072, 'grad_norm': 1.1378402709960938, 'learning_rate': 0.00016541353383458648, 'epoch': 0.21}


 18%|█▊        | 355/2000 [40:52<2:53:29,  6.33s/it]

{'loss': 1.434, 'grad_norm': 1.0387107133865356, 'learning_rate': 0.0001649122807017544, 'epoch': 0.21}


 18%|█▊        | 360/2000 [41:24<2:58:02,  6.51s/it]

{'loss': 1.5191, 'grad_norm': 0.9308577179908752, 'learning_rate': 0.0001644110275689223, 'epoch': 0.21}


 18%|█▊        | 365/2000 [41:59<3:03:24,  6.73s/it]

{'loss': 1.4525, 'grad_norm': 1.0414131879806519, 'learning_rate': 0.00016390977443609024, 'epoch': 0.22}


 18%|█▊        | 370/2000 [42:31<3:02:59,  6.74s/it]

{'loss': 1.502, 'grad_norm': 1.0251795053482056, 'learning_rate': 0.00016340852130325813, 'epoch': 0.22}


 19%|█▉        | 375/2000 [43:07<3:14:06,  7.17s/it]

{'loss': 1.3497, 'grad_norm': 0.9890316724777222, 'learning_rate': 0.00016290726817042608, 'epoch': 0.22}


 19%|█▉        | 380/2000 [43:40<3:01:15,  6.71s/it]

{'loss': 1.4572, 'grad_norm': 1.0639238357543945, 'learning_rate': 0.00016240601503759398, 'epoch': 0.22}


 19%|█▉        | 385/2000 [44:18<3:10:25,  7.07s/it]

{'loss': 1.4296, 'grad_norm': 1.1881426572799683, 'learning_rate': 0.00016190476190476192, 'epoch': 0.23}


 20%|█▉        | 390/2000 [44:53<3:13:19,  7.20s/it]

{'loss': 1.405, 'grad_norm': 0.9097644090652466, 'learning_rate': 0.00016140350877192982, 'epoch': 0.23}


 20%|█▉        | 395/2000 [45:28<3:12:12,  7.19s/it]

{'loss': 1.4193, 'grad_norm': 1.0315747261047363, 'learning_rate': 0.00016090225563909777, 'epoch': 0.23}


 20%|██        | 400/2000 [46:01<2:53:33,  6.51s/it]

{'loss': 1.5419, 'grad_norm': 1.0427813529968262, 'learning_rate': 0.00016040100250626566, 'epoch': 0.24}


 20%|██        | 405/2000 [46:34<2:57:19,  6.67s/it]

{'loss': 1.378, 'grad_norm': 1.032768964767456, 'learning_rate': 0.00015989974937343358, 'epoch': 0.24}


 20%|██        | 410/2000 [47:07<2:57:30,  6.70s/it]

{'loss': 1.479, 'grad_norm': 0.9933909773826599, 'learning_rate': 0.0001593984962406015, 'epoch': 0.24}


 21%|██        | 415/2000 [47:38<2:43:30,  6.19s/it]

{'loss': 1.4587, 'grad_norm': 1.0538294315338135, 'learning_rate': 0.00015889724310776942, 'epoch': 0.25}


 21%|██        | 420/2000 [48:13<2:57:34,  6.74s/it]

{'loss': 1.408, 'grad_norm': 0.9578077793121338, 'learning_rate': 0.00015839598997493734, 'epoch': 0.25}


 21%|██▏       | 425/2000 [48:48<3:02:00,  6.93s/it]

{'loss': 1.4255, 'grad_norm': 1.0677573680877686, 'learning_rate': 0.00015789473684210527, 'epoch': 0.25}


 22%|██▏       | 430/2000 [49:22<3:02:10,  6.96s/it]

{'loss': 1.3834, 'grad_norm': 0.9828157424926758, 'learning_rate': 0.00015739348370927319, 'epoch': 0.25}


 22%|██▏       | 435/2000 [49:57<2:57:43,  6.81s/it]

{'loss': 1.3432, 'grad_norm': 1.2171220779418945, 'learning_rate': 0.0001568922305764411, 'epoch': 0.26}


 22%|██▏       | 440/2000 [50:31<3:04:27,  7.09s/it]

{'loss': 1.4073, 'grad_norm': 0.9697394371032715, 'learning_rate': 0.00015639097744360903, 'epoch': 0.26}


 22%|██▏       | 445/2000 [51:06<2:54:50,  6.75s/it]

{'loss': 1.3678, 'grad_norm': 1.093349575996399, 'learning_rate': 0.00015588972431077695, 'epoch': 0.26}


 22%|██▎       | 450/2000 [51:39<2:56:13,  6.82s/it]

{'loss': 1.3499, 'grad_norm': 1.1317431926727295, 'learning_rate': 0.00015538847117794487, 'epoch': 0.27}


 23%|██▎       | 455/2000 [52:15<2:57:05,  6.88s/it]

{'loss': 1.4113, 'grad_norm': 1.354501485824585, 'learning_rate': 0.0001548872180451128, 'epoch': 0.27}


 23%|██▎       | 460/2000 [52:50<3:02:32,  7.11s/it]

{'loss': 1.1988, 'grad_norm': 0.9542083144187927, 'learning_rate': 0.0001543859649122807, 'epoch': 0.27}


 23%|██▎       | 465/2000 [53:27<3:11:14,  7.48s/it]

{'loss': 1.2029, 'grad_norm': 1.0621236562728882, 'learning_rate': 0.00015388471177944863, 'epoch': 0.27}


 24%|██▎       | 470/2000 [53:59<2:46:59,  6.55s/it]

{'loss': 1.4472, 'grad_norm': 1.139036774635315, 'learning_rate': 0.00015338345864661653, 'epoch': 0.28}


 24%|██▍       | 475/2000 [54:34<3:01:19,  7.13s/it]

{'loss': 1.3565, 'grad_norm': 0.93958580493927, 'learning_rate': 0.00015288220551378448, 'epoch': 0.28}


 24%|██▍       | 480/2000 [55:08<2:57:44,  7.02s/it]

{'loss': 1.5148, 'grad_norm': 1.3109173774719238, 'learning_rate': 0.00015238095238095237, 'epoch': 0.28}


 24%|██▍       | 485/2000 [55:43<2:55:20,  6.94s/it]

{'loss': 1.265, 'grad_norm': 1.1942825317382812, 'learning_rate': 0.00015187969924812032, 'epoch': 0.29}


 24%|██▍       | 490/2000 [56:19<2:56:39,  7.02s/it]

{'loss': 1.2291, 'grad_norm': 1.3624187707901, 'learning_rate': 0.0001513784461152882, 'epoch': 0.29}


 25%|██▍       | 495/2000 [56:54<3:01:38,  7.24s/it]

{'loss': 1.2912, 'grad_norm': 0.9766658544540405, 'learning_rate': 0.00015087719298245616, 'epoch': 0.29}


 25%|██▌       | 500/2000 [57:28<2:52:38,  6.91s/it]

{'loss': 1.339, 'grad_norm': 1.2494179010391235, 'learning_rate': 0.00015037593984962405, 'epoch': 0.3}


 25%|██▌       | 505/2000 [58:02<2:47:44,  6.73s/it]

{'loss': 1.3542, 'grad_norm': 1.1983740329742432, 'learning_rate': 0.000149874686716792, 'epoch': 0.3}


 26%|██▌       | 510/2000 [58:37<2:49:49,  6.84s/it]

{'loss': 1.2904, 'grad_norm': 1.1400823593139648, 'learning_rate': 0.0001493734335839599, 'epoch': 0.3}


 26%|██▌       | 515/2000 [59:09<2:40:41,  6.49s/it]

{'loss': 1.2301, 'grad_norm': 1.0930993556976318, 'learning_rate': 0.00014887218045112784, 'epoch': 0.3}


 26%|██▌       | 520/2000 [59:42<2:46:01,  6.73s/it]

{'loss': 1.3645, 'grad_norm': 1.04677414894104, 'learning_rate': 0.00014837092731829574, 'epoch': 0.31}


 26%|██▋       | 525/2000 [1:00:16<2:50:29,  6.94s/it]

{'loss': 1.3432, 'grad_norm': 1.2088792324066162, 'learning_rate': 0.00014786967418546366, 'epoch': 0.31}


 26%|██▋       | 530/2000 [1:00:48<2:35:52,  6.36s/it]

{'loss': 1.4283, 'grad_norm': 1.3488353490829468, 'learning_rate': 0.00014736842105263158, 'epoch': 0.31}


 27%|██▋       | 535/2000 [1:01:22<2:46:13,  6.81s/it]

{'loss': 1.3219, 'grad_norm': 1.1529690027236938, 'learning_rate': 0.0001468671679197995, 'epoch': 0.32}


 27%|██▋       | 540/2000 [1:01:57<2:54:12,  7.16s/it]

{'loss': 1.233, 'grad_norm': 1.0975911617279053, 'learning_rate': 0.00014636591478696742, 'epoch': 0.32}


 27%|██▋       | 545/2000 [1:02:32<2:48:46,  6.96s/it]

{'loss': 1.2237, 'grad_norm': 1.1224992275238037, 'learning_rate': 0.00014586466165413534, 'epoch': 0.32}


 28%|██▊       | 550/2000 [1:03:05<2:44:24,  6.80s/it]

{'loss': 1.2915, 'grad_norm': 1.0287275314331055, 'learning_rate': 0.00014536340852130326, 'epoch': 0.32}


 28%|██▊       | 555/2000 [1:03:39<2:41:23,  6.70s/it]

{'loss': 1.3302, 'grad_norm': 1.5363305807113647, 'learning_rate': 0.00014486215538847118, 'epoch': 0.33}


 28%|██▊       | 560/2000 [1:04:11<2:37:39,  6.57s/it]

{'loss': 1.2475, 'grad_norm': 1.0564631223678589, 'learning_rate': 0.0001443609022556391, 'epoch': 0.33}


 28%|██▊       | 565/2000 [1:04:44<2:32:59,  6.40s/it]

{'loss': 1.2708, 'grad_norm': 1.3009556531906128, 'learning_rate': 0.00014385964912280703, 'epoch': 0.33}


 28%|██▊       | 570/2000 [1:05:21<2:53:51,  7.29s/it]

{'loss': 1.4023, 'grad_norm': 1.1399235725402832, 'learning_rate': 0.00014335839598997495, 'epoch': 0.34}


 29%|██▉       | 575/2000 [1:05:55<2:41:31,  6.80s/it]

{'loss': 1.2216, 'grad_norm': 1.2044274806976318, 'learning_rate': 0.00014285714285714287, 'epoch': 0.34}


 29%|██▉       | 580/2000 [1:06:29<2:42:01,  6.85s/it]

{'loss': 1.3178, 'grad_norm': 1.1081454753875732, 'learning_rate': 0.0001423558897243108, 'epoch': 0.34}


 29%|██▉       | 585/2000 [1:07:02<2:36:09,  6.62s/it]

{'loss': 1.2965, 'grad_norm': 1.164624810218811, 'learning_rate': 0.0001418546365914787, 'epoch': 0.35}


 30%|██▉       | 590/2000 [1:07:36<2:39:45,  6.80s/it]

{'loss': 1.3625, 'grad_norm': 1.2441002130508423, 'learning_rate': 0.0001413533834586466, 'epoch': 0.35}


 30%|██▉       | 595/2000 [1:08:10<2:44:10,  7.01s/it]

{'loss': 1.2692, 'grad_norm': 1.1494401693344116, 'learning_rate': 0.00014085213032581455, 'epoch': 0.35}


 30%|███       | 600/2000 [1:08:46<2:47:17,  7.17s/it]

{'loss': 1.3903, 'grad_norm': 1.0980480909347534, 'learning_rate': 0.00014035087719298245, 'epoch': 0.35}


 30%|███       | 605/2000 [1:09:23<2:51:10,  7.36s/it]

{'loss': 1.2344, 'grad_norm': 1.0210484266281128, 'learning_rate': 0.0001398496240601504, 'epoch': 0.36}


 30%|███       | 610/2000 [1:09:57<2:39:49,  6.90s/it]

{'loss': 1.3361, 'grad_norm': 1.1724189519882202, 'learning_rate': 0.0001393483709273183, 'epoch': 0.36}


 31%|███       | 615/2000 [1:10:32<2:42:42,  7.05s/it]

{'loss': 1.2261, 'grad_norm': 1.098995327949524, 'learning_rate': 0.00013884711779448624, 'epoch': 0.36}


 31%|███       | 620/2000 [1:11:06<2:34:24,  6.71s/it]

{'loss': 1.2786, 'grad_norm': 1.3162239789962769, 'learning_rate': 0.00013834586466165413, 'epoch': 0.37}


 31%|███▏      | 625/2000 [1:11:42<2:40:55,  7.02s/it]

{'loss': 1.1623, 'grad_norm': 1.1246699094772339, 'learning_rate': 0.00013784461152882208, 'epoch': 0.37}


 32%|███▏      | 630/2000 [1:12:14<2:30:25,  6.59s/it]

{'loss': 1.2641, 'grad_norm': 1.245484709739685, 'learning_rate': 0.00013734335839598997, 'epoch': 0.37}


 32%|███▏      | 635/2000 [1:12:49<2:35:32,  6.84s/it]

{'loss': 1.2188, 'grad_norm': 1.2681502103805542, 'learning_rate': 0.0001368421052631579, 'epoch': 0.37}


 32%|███▏      | 640/2000 [1:13:24<2:27:48,  6.52s/it]

{'loss': 1.1708, 'grad_norm': 1.3343188762664795, 'learning_rate': 0.00013634085213032581, 'epoch': 0.38}


 32%|███▏      | 645/2000 [1:13:59<2:38:32,  7.02s/it]

{'loss': 1.2078, 'grad_norm': 1.1543577909469604, 'learning_rate': 0.00013583959899749373, 'epoch': 0.38}


 32%|███▎      | 650/2000 [1:14:32<2:28:20,  6.59s/it]

{'loss': 1.2572, 'grad_norm': 1.1672154664993286, 'learning_rate': 0.00013533834586466166, 'epoch': 0.38}


 33%|███▎      | 655/2000 [1:15:07<2:39:22,  7.11s/it]

{'loss': 1.306, 'grad_norm': 1.1531533002853394, 'learning_rate': 0.00013483709273182958, 'epoch': 0.39}


 33%|███▎      | 660/2000 [1:15:41<2:33:13,  6.86s/it]

{'loss': 1.2857, 'grad_norm': 1.1717712879180908, 'learning_rate': 0.0001343358395989975, 'epoch': 0.39}


 33%|███▎      | 665/2000 [1:16:15<2:30:44,  6.77s/it]

{'loss': 1.2122, 'grad_norm': 1.0072818994522095, 'learning_rate': 0.00013383458646616542, 'epoch': 0.39}


 34%|███▎      | 670/2000 [1:16:52<2:38:55,  7.17s/it]

{'loss': 1.1508, 'grad_norm': 1.3173154592514038, 'learning_rate': 0.00013333333333333334, 'epoch': 0.4}


 34%|███▍      | 675/2000 [1:17:25<2:26:58,  6.66s/it]

{'loss': 1.309, 'grad_norm': 1.5100171566009521, 'learning_rate': 0.00013283208020050126, 'epoch': 0.4}


 34%|███▍      | 680/2000 [1:18:01<2:33:12,  6.96s/it]

{'loss': 1.1446, 'grad_norm': 1.1293449401855469, 'learning_rate': 0.00013233082706766918, 'epoch': 0.4}


 34%|███▍      | 685/2000 [1:18:34<2:30:43,  6.88s/it]

{'loss': 1.2391, 'grad_norm': 1.2165882587432861, 'learning_rate': 0.0001318295739348371, 'epoch': 0.4}


 34%|███▍      | 690/2000 [1:19:11<2:43:04,  7.47s/it]

{'loss': 1.1082, 'grad_norm': 1.1081280708312988, 'learning_rate': 0.00013132832080200502, 'epoch': 0.41}


 35%|███▍      | 695/2000 [1:19:44<2:34:30,  7.10s/it]

{'loss': 1.2043, 'grad_norm': 1.0567681789398193, 'learning_rate': 0.00013082706766917294, 'epoch': 0.41}


 35%|███▌      | 700/2000 [1:20:17<2:30:39,  6.95s/it]

{'loss': 1.2133, 'grad_norm': 1.1160569190979004, 'learning_rate': 0.00013032581453634084, 'epoch': 0.41}


 35%|███▌      | 705/2000 [1:20:52<2:31:07,  7.00s/it]

{'loss': 1.3557, 'grad_norm': 1.3793083429336548, 'learning_rate': 0.0001298245614035088, 'epoch': 0.42}


 36%|███▌      | 710/2000 [1:21:22<2:13:19,  6.20s/it]

{'loss': 1.2054, 'grad_norm': 1.134678602218628, 'learning_rate': 0.00012932330827067668, 'epoch': 0.42}


 36%|███▌      | 715/2000 [1:21:54<2:19:13,  6.50s/it]

{'loss': 1.1914, 'grad_norm': 1.2566311359405518, 'learning_rate': 0.00012882205513784463, 'epoch': 0.42}


 36%|███▌      | 720/2000 [1:22:28<2:25:16,  6.81s/it]

{'loss': 1.1198, 'grad_norm': 1.1027913093566895, 'learning_rate': 0.00012832080200501252, 'epoch': 0.43}


 36%|███▋      | 725/2000 [1:23:08<2:49:34,  7.98s/it]

{'loss': 0.9716, 'grad_norm': 1.1423637866973877, 'learning_rate': 0.00012781954887218047, 'epoch': 0.43}


 36%|███▋      | 730/2000 [1:23:44<2:32:19,  7.20s/it]

{'loss': 1.2748, 'grad_norm': 1.290273666381836, 'learning_rate': 0.00012731829573934836, 'epoch': 0.43}


 37%|███▋      | 735/2000 [1:24:17<2:16:11,  6.46s/it]

{'loss': 1.3565, 'grad_norm': 1.248607873916626, 'learning_rate': 0.0001268170426065163, 'epoch': 0.43}


 37%|███▋      | 740/2000 [1:24:54<2:33:50,  7.33s/it]

{'loss': 1.1194, 'grad_norm': 1.185571551322937, 'learning_rate': 0.0001263157894736842, 'epoch': 0.44}


 37%|███▋      | 745/2000 [1:25:28<2:23:18,  6.85s/it]

{'loss': 1.1695, 'grad_norm': 1.6020134687423706, 'learning_rate': 0.00012581453634085213, 'epoch': 0.44}


 38%|███▊      | 750/2000 [1:26:03<2:26:50,  7.05s/it]

{'loss': 1.1913, 'grad_norm': 1.0854830741882324, 'learning_rate': 0.00012531328320802005, 'epoch': 0.44}


 38%|███▊      | 755/2000 [1:26:35<2:16:48,  6.59s/it]

{'loss': 1.1871, 'grad_norm': 1.0390379428863525, 'learning_rate': 0.00012481203007518797, 'epoch': 0.45}


 38%|███▊      | 760/2000 [1:27:12<2:34:00,  7.45s/it]

{'loss': 1.1135, 'grad_norm': 1.0968196392059326, 'learning_rate': 0.0001243107769423559, 'epoch': 0.45}


 38%|███▊      | 765/2000 [1:27:45<2:17:32,  6.68s/it]

{'loss': 1.0485, 'grad_norm': 1.1436532735824585, 'learning_rate': 0.0001238095238095238, 'epoch': 0.45}


 38%|███▊      | 770/2000 [1:28:20<2:23:57,  7.02s/it]

{'loss': 1.1483, 'grad_norm': 1.1657772064208984, 'learning_rate': 0.00012330827067669173, 'epoch': 0.45}


 39%|███▉      | 775/2000 [1:28:51<2:08:18,  6.28s/it]

{'loss': 1.1352, 'grad_norm': 1.168986201286316, 'learning_rate': 0.00012280701754385965, 'epoch': 0.46}


 39%|███▉      | 780/2000 [1:29:26<2:22:39,  7.02s/it]

{'loss': 1.2077, 'grad_norm': 1.1554542779922485, 'learning_rate': 0.00012230576441102757, 'epoch': 0.46}


 39%|███▉      | 785/2000 [1:30:00<2:16:21,  6.73s/it]

{'loss': 1.0761, 'grad_norm': 1.5006510019302368, 'learning_rate': 0.0001218045112781955, 'epoch': 0.46}


 40%|███▉      | 790/2000 [1:30:33<2:12:46,  6.58s/it]

{'loss': 1.216, 'grad_norm': 1.3811664581298828, 'learning_rate': 0.0001213032581453634, 'epoch': 0.47}


 40%|███▉      | 795/2000 [1:31:11<2:27:28,  7.34s/it]

{'loss': 1.1928, 'grad_norm': 1.2679400444030762, 'learning_rate': 0.00012080200501253134, 'epoch': 0.47}


 40%|████      | 800/2000 [1:31:49<2:32:54,  7.65s/it]

{'loss': 1.1893, 'grad_norm': 1.2572402954101562, 'learning_rate': 0.00012030075187969925, 'epoch': 0.47}


 40%|████      | 805/2000 [1:32:22<2:19:08,  6.99s/it]

{'loss': 1.2399, 'grad_norm': 1.0759947299957275, 'learning_rate': 0.00011979949874686718, 'epoch': 0.48}


 40%|████      | 810/2000 [1:32:56<2:16:31,  6.88s/it]

{'loss': 1.1465, 'grad_norm': 1.2938898801803589, 'learning_rate': 0.00011929824561403509, 'epoch': 0.48}


 41%|████      | 815/2000 [1:33:32<2:15:08,  6.84s/it]

{'loss': 1.143, 'grad_norm': 1.312024474143982, 'learning_rate': 0.00011879699248120302, 'epoch': 0.48}


 41%|████      | 820/2000 [1:34:04<2:06:19,  6.42s/it]

{'loss': 1.2029, 'grad_norm': 1.1961669921875, 'learning_rate': 0.00011829573934837093, 'epoch': 0.48}


 41%|████▏     | 825/2000 [1:34:39<2:13:50,  6.83s/it]

{'loss': 1.1506, 'grad_norm': 1.437454104423523, 'learning_rate': 0.00011779448621553886, 'epoch': 0.49}


 42%|████▏     | 830/2000 [1:35:15<2:22:20,  7.30s/it]

{'loss': 1.0742, 'grad_norm': 1.098257064819336, 'learning_rate': 0.00011729323308270677, 'epoch': 0.49}


 42%|████▏     | 835/2000 [1:35:50<2:16:38,  7.04s/it]

{'loss': 1.17, 'grad_norm': 1.2824513912200928, 'learning_rate': 0.00011679197994987469, 'epoch': 0.49}


 42%|████▏     | 840/2000 [1:36:26<2:18:08,  7.15s/it]

{'loss': 1.1237, 'grad_norm': 1.202103853225708, 'learning_rate': 0.0001162907268170426, 'epoch': 0.5}


 42%|████▏     | 845/2000 [1:37:00<2:15:07,  7.02s/it]

{'loss': 1.1157, 'grad_norm': 0.9880297183990479, 'learning_rate': 0.00011578947368421053, 'epoch': 0.5}


 42%|████▎     | 850/2000 [1:37:34<2:14:28,  7.02s/it]

{'loss': 1.0861, 'grad_norm': 1.095328688621521, 'learning_rate': 0.00011528822055137844, 'epoch': 0.5}


 43%|████▎     | 855/2000 [1:38:06<1:59:25,  6.26s/it]

{'loss': 1.1274, 'grad_norm': 1.3303157091140747, 'learning_rate': 0.00011478696741854638, 'epoch': 0.5}


 43%|████▎     | 860/2000 [1:38:38<1:59:48,  6.31s/it]

{'loss': 1.1308, 'grad_norm': 1.1415557861328125, 'learning_rate': 0.00011428571428571428, 'epoch': 0.51}


 43%|████▎     | 865/2000 [1:39:14<2:12:57,  7.03s/it]

{'loss': 1.191, 'grad_norm': 1.3480119705200195, 'learning_rate': 0.00011378446115288222, 'epoch': 0.51}


 44%|████▎     | 870/2000 [1:39:47<2:04:59,  6.64s/it]

{'loss': 1.2101, 'grad_norm': 1.3083902597427368, 'learning_rate': 0.00011328320802005013, 'epoch': 0.51}


 44%|████▍     | 875/2000 [1:40:26<2:17:14,  7.32s/it]

{'loss': 1.0877, 'grad_norm': 1.254554033279419, 'learning_rate': 0.00011278195488721806, 'epoch': 0.52}


 44%|████▍     | 880/2000 [1:40:59<2:06:13,  6.76s/it]

{'loss': 1.161, 'grad_norm': 1.368010401725769, 'learning_rate': 0.00011228070175438597, 'epoch': 0.52}


 44%|████▍     | 885/2000 [1:41:32<2:10:15,  7.01s/it]

{'loss': 1.1779, 'grad_norm': 1.390325903892517, 'learning_rate': 0.00011177944862155389, 'epoch': 0.52}


 44%|████▍     | 890/2000 [1:42:03<1:56:08,  6.28s/it]

{'loss': 1.2365, 'grad_norm': 1.2487001419067383, 'learning_rate': 0.00011127819548872181, 'epoch': 0.53}


 45%|████▍     | 895/2000 [1:42:37<2:07:55,  6.95s/it]

{'loss': 1.0587, 'grad_norm': 1.2122797966003418, 'learning_rate': 0.00011077694235588973, 'epoch': 0.53}


 45%|████▌     | 900/2000 [1:43:11<2:08:13,  6.99s/it]

{'loss': 1.1026, 'grad_norm': 1.350793719291687, 'learning_rate': 0.00011027568922305764, 'epoch': 0.53}


 45%|████▌     | 905/2000 [1:43:46<2:03:22,  6.76s/it]

{'loss': 1.1012, 'grad_norm': 1.329400897026062, 'learning_rate': 0.00010977443609022557, 'epoch': 0.53}


 46%|████▌     | 910/2000 [1:44:19<2:01:43,  6.70s/it]

{'loss': 1.0423, 'grad_norm': 1.273405909538269, 'learning_rate': 0.00010927318295739348, 'epoch': 0.54}


 46%|████▌     | 915/2000 [1:44:54<2:07:40,  7.06s/it]

{'loss': 0.9251, 'grad_norm': 1.2452001571655273, 'learning_rate': 0.00010877192982456141, 'epoch': 0.54}


 46%|████▌     | 920/2000 [1:45:29<2:06:49,  7.05s/it]

{'loss': 0.983, 'grad_norm': 1.1891984939575195, 'learning_rate': 0.00010827067669172932, 'epoch': 0.54}


 46%|████▋     | 925/2000 [1:46:02<1:57:26,  6.56s/it]

{'loss': 1.2784, 'grad_norm': 1.3849130868911743, 'learning_rate': 0.00010776942355889726, 'epoch': 0.55}


 46%|████▋     | 930/2000 [1:46:35<1:58:30,  6.65s/it]

{'loss': 1.1293, 'grad_norm': 1.2812891006469727, 'learning_rate': 0.00010726817042606516, 'epoch': 0.55}


 47%|████▋     | 935/2000 [1:47:13<2:11:09,  7.39s/it]

{'loss': 0.9756, 'grad_norm': 1.4472376108169556, 'learning_rate': 0.0001067669172932331, 'epoch': 0.55}


 47%|████▋     | 940/2000 [1:47:48<2:09:09,  7.31s/it]

{'loss': 1.1295, 'grad_norm': 1.3342339992523193, 'learning_rate': 0.000106265664160401, 'epoch': 0.55}


 47%|████▋     | 945/2000 [1:48:22<1:56:35,  6.63s/it]

{'loss': 0.9826, 'grad_norm': 1.3883708715438843, 'learning_rate': 0.00010576441102756893, 'epoch': 0.56}


 48%|████▊     | 950/2000 [1:48:56<2:00:51,  6.91s/it]

{'loss': 1.0521, 'grad_norm': 1.157913327217102, 'learning_rate': 0.00010526315789473685, 'epoch': 0.56}


 48%|████▊     | 955/2000 [1:49:33<2:04:08,  7.13s/it]

{'loss': 0.9388, 'grad_norm': 1.354029655456543, 'learning_rate': 0.00010476190476190477, 'epoch': 0.56}


 48%|████▊     | 960/2000 [1:50:09<2:00:12,  6.94s/it]

{'loss': 0.9947, 'grad_norm': 1.249582052230835, 'learning_rate': 0.00010426065162907268, 'epoch': 0.57}


 48%|████▊     | 965/2000 [1:50:42<1:54:59,  6.67s/it]

{'loss': 0.9892, 'grad_norm': 1.298510193824768, 'learning_rate': 0.00010375939849624061, 'epoch': 0.57}


 48%|████▊     | 970/2000 [1:51:17<1:59:51,  6.98s/it]

{'loss': 1.0176, 'grad_norm': 1.2136142253875732, 'learning_rate': 0.00010325814536340852, 'epoch': 0.57}


 49%|████▉     | 975/2000 [1:51:50<1:54:38,  6.71s/it]

{'loss': 1.2657, 'grad_norm': 1.3698294162750244, 'learning_rate': 0.00010275689223057645, 'epoch': 0.58}


 49%|████▉     | 980/2000 [1:52:26<2:00:04,  7.06s/it]

{'loss': 1.0534, 'grad_norm': 1.2841932773590088, 'learning_rate': 0.00010225563909774436, 'epoch': 0.58}


 49%|████▉     | 985/2000 [1:53:03<2:04:47,  7.38s/it]

{'loss': 0.8391, 'grad_norm': 0.9251750707626343, 'learning_rate': 0.0001017543859649123, 'epoch': 0.58}


 50%|████▉     | 990/2000 [1:53:39<2:03:50,  7.36s/it]

{'loss': 0.9364, 'grad_norm': 1.1653392314910889, 'learning_rate': 0.0001012531328320802, 'epoch': 0.58}


 50%|████▉     | 995/2000 [1:54:13<1:59:36,  7.14s/it]

{'loss': 0.9851, 'grad_norm': 1.1236029863357544, 'learning_rate': 0.00010075187969924814, 'epoch': 0.59}


 50%|█████     | 1000/2000 [1:54:48<1:51:35,  6.70s/it]

{'loss': 0.9664, 'grad_norm': 1.4831438064575195, 'learning_rate': 0.00010025062656641604, 'epoch': 0.59}


 50%|█████     | 1005/2000 [1:55:25<1:56:29,  7.02s/it]

{'loss': 1.111, 'grad_norm': 1.463314175605774, 'learning_rate': 9.974937343358397e-05, 'epoch': 0.59}


 50%|█████     | 1010/2000 [1:56:01<2:00:46,  7.32s/it]

{'loss': 0.9572, 'grad_norm': 1.2547234296798706, 'learning_rate': 9.924812030075187e-05, 'epoch': 0.6}


 51%|█████     | 1015/2000 [1:56:33<1:47:22,  6.54s/it]

{'loss': 1.1734, 'grad_norm': 1.3255984783172607, 'learning_rate': 9.87468671679198e-05, 'epoch': 0.6}


 51%|█████     | 1020/2000 [1:57:09<1:57:22,  7.19s/it]

{'loss': 1.0548, 'grad_norm': 1.2319415807724, 'learning_rate': 9.824561403508771e-05, 'epoch': 0.6}


 51%|█████▏    | 1025/2000 [1:57:45<1:53:29,  6.98s/it]

{'loss': 1.0542, 'grad_norm': 1.3325629234313965, 'learning_rate': 9.774436090225564e-05, 'epoch': 0.61}


 52%|█████▏    | 1030/2000 [1:58:18<1:49:55,  6.80s/it]

{'loss': 0.9034, 'grad_norm': 1.323063850402832, 'learning_rate': 9.724310776942356e-05, 'epoch': 0.61}


 52%|█████▏    | 1035/2000 [1:58:55<2:02:02,  7.59s/it]

{'loss': 1.1011, 'grad_norm': 1.335036277770996, 'learning_rate': 9.674185463659148e-05, 'epoch': 0.61}


 52%|█████▏    | 1040/2000 [1:59:26<1:42:20,  6.40s/it]

{'loss': 1.1706, 'grad_norm': 1.4199497699737549, 'learning_rate': 9.62406015037594e-05, 'epoch': 0.61}


 52%|█████▏    | 1045/2000 [2:00:02<1:52:44,  7.08s/it]

{'loss': 1.0243, 'grad_norm': 1.191562294960022, 'learning_rate': 9.573934837092732e-05, 'epoch': 0.62}


 52%|█████▎    | 1050/2000 [2:00:36<1:48:59,  6.88s/it]

{'loss': 0.9592, 'grad_norm': 1.33430016040802, 'learning_rate': 9.523809523809524e-05, 'epoch': 0.62}


 53%|█████▎    | 1055/2000 [2:01:12<1:57:03,  7.43s/it]

{'loss': 1.0211, 'grad_norm': 0.971207857131958, 'learning_rate': 9.473684210526316e-05, 'epoch': 0.62}


 53%|█████▎    | 1060/2000 [2:01:48<1:50:58,  7.08s/it]

{'loss': 0.9254, 'grad_norm': 1.2990907430648804, 'learning_rate': 9.423558897243108e-05, 'epoch': 0.63}


 53%|█████▎    | 1065/2000 [2:02:22<1:43:50,  6.66s/it]

{'loss': 1.013, 'grad_norm': 1.4435092210769653, 'learning_rate': 9.373433583959899e-05, 'epoch': 0.63}


 54%|█████▎    | 1070/2000 [2:02:57<1:50:44,  7.14s/it]

{'loss': 0.8449, 'grad_norm': 1.2378244400024414, 'learning_rate': 9.323308270676691e-05, 'epoch': 0.63}


 54%|█████▍    | 1075/2000 [2:03:31<1:48:14,  7.02s/it]

{'loss': 0.9191, 'grad_norm': 1.15598726272583, 'learning_rate': 9.273182957393483e-05, 'epoch': 0.63}


 54%|█████▍    | 1080/2000 [2:04:06<1:44:52,  6.84s/it]

{'loss': 0.987, 'grad_norm': 1.412122368812561, 'learning_rate': 9.223057644110275e-05, 'epoch': 0.64}


 54%|█████▍    | 1085/2000 [2:04:43<1:51:30,  7.31s/it]

{'loss': 0.8285, 'grad_norm': 1.3682137727737427, 'learning_rate': 9.172932330827067e-05, 'epoch': 0.64}


 55%|█████▍    | 1090/2000 [2:05:18<1:48:06,  7.13s/it]

{'loss': 1.0296, 'grad_norm': 1.3732696771621704, 'learning_rate': 9.12280701754386e-05, 'epoch': 0.64}


 55%|█████▍    | 1095/2000 [2:05:54<1:49:19,  7.25s/it]

{'loss': 0.9491, 'grad_norm': 1.2567834854125977, 'learning_rate': 9.072681704260652e-05, 'epoch': 0.65}


 55%|█████▌    | 1100/2000 [2:06:27<1:37:25,  6.50s/it]

{'loss': 1.1121, 'grad_norm': 1.8421776294708252, 'learning_rate': 9.022556390977444e-05, 'epoch': 0.65}


 55%|█████▌    | 1105/2000 [2:06:59<1:35:41,  6.41s/it]

{'loss': 0.9515, 'grad_norm': 1.4442222118377686, 'learning_rate': 8.972431077694236e-05, 'epoch': 0.65}


 56%|█████▌    | 1110/2000 [2:07:31<1:29:47,  6.05s/it]

{'loss': 1.1868, 'grad_norm': 1.3702119588851929, 'learning_rate': 8.922305764411028e-05, 'epoch': 0.66}


 56%|█████▌    | 1115/2000 [2:08:05<1:41:48,  6.90s/it]

{'loss': 0.9275, 'grad_norm': 1.3770135641098022, 'learning_rate': 8.87218045112782e-05, 'epoch': 0.66}


 56%|█████▌    | 1120/2000 [2:08:40<1:42:14,  6.97s/it]

{'loss': 0.9651, 'grad_norm': 1.5407929420471191, 'learning_rate': 8.822055137844612e-05, 'epoch': 0.66}


 56%|█████▋    | 1125/2000 [2:09:14<1:42:58,  7.06s/it]

{'loss': 0.982, 'grad_norm': 1.3024307489395142, 'learning_rate': 8.771929824561403e-05, 'epoch': 0.66}


 56%|█████▋    | 1130/2000 [2:09:49<1:41:10,  6.98s/it]

{'loss': 0.9738, 'grad_norm': 1.1388189792633057, 'learning_rate': 8.721804511278195e-05, 'epoch': 0.67}


 57%|█████▋    | 1135/2000 [2:10:21<1:31:55,  6.38s/it]

{'loss': 1.2098, 'grad_norm': 1.4600144624710083, 'learning_rate': 8.671679197994987e-05, 'epoch': 0.67}


 57%|█████▋    | 1140/2000 [2:10:55<1:36:17,  6.72s/it]

{'loss': 1.06, 'grad_norm': 1.3104939460754395, 'learning_rate': 8.621553884711779e-05, 'epoch': 0.67}


 57%|█████▋    | 1145/2000 [2:11:30<1:39:25,  6.98s/it]

{'loss': 0.9945, 'grad_norm': 1.0570507049560547, 'learning_rate': 8.571428571428571e-05, 'epoch': 0.68}


 57%|█████▊    | 1150/2000 [2:12:05<1:41:30,  7.17s/it]

{'loss': 0.9474, 'grad_norm': 1.368147373199463, 'learning_rate': 8.521303258145363e-05, 'epoch': 0.68}


 58%|█████▊    | 1155/2000 [2:12:39<1:36:36,  6.86s/it]

{'loss': 0.8869, 'grad_norm': 1.7872812747955322, 'learning_rate': 8.471177944862155e-05, 'epoch': 0.68}


 58%|█████▊    | 1160/2000 [2:13:13<1:35:15,  6.80s/it]

{'loss': 1.1699, 'grad_norm': 1.3931212425231934, 'learning_rate': 8.421052631578948e-05, 'epoch': 0.68}


 58%|█████▊    | 1165/2000 [2:13:47<1:34:58,  6.82s/it]

{'loss': 0.951, 'grad_norm': 1.3165456056594849, 'learning_rate': 8.37092731829574e-05, 'epoch': 0.69}


 58%|█████▊    | 1170/2000 [2:14:19<1:33:34,  6.76s/it]

{'loss': 1.0931, 'grad_norm': 1.180040955543518, 'learning_rate': 8.320802005012532e-05, 'epoch': 0.69}


 59%|█████▉    | 1175/2000 [2:14:54<1:36:25,  7.01s/it]

{'loss': 0.9537, 'grad_norm': 1.5243492126464844, 'learning_rate': 8.270676691729324e-05, 'epoch': 0.69}


 59%|█████▉    | 1180/2000 [2:15:30<1:36:43,  7.08s/it]

{'loss': 1.0107, 'grad_norm': 1.3209506273269653, 'learning_rate': 8.220551378446115e-05, 'epoch': 0.7}


 59%|█████▉    | 1185/2000 [2:16:02<1:30:41,  6.68s/it]

{'loss': 0.8973, 'grad_norm': 1.4733390808105469, 'learning_rate': 8.170426065162907e-05, 'epoch': 0.7}


 60%|█████▉    | 1190/2000 [2:16:40<1:39:25,  7.36s/it]

{'loss': 0.8463, 'grad_norm': 1.4993168115615845, 'learning_rate': 8.120300751879699e-05, 'epoch': 0.7}


 60%|█████▉    | 1195/2000 [2:17:17<1:37:25,  7.26s/it]

{'loss': 0.8013, 'grad_norm': 1.2703211307525635, 'learning_rate': 8.070175438596491e-05, 'epoch': 0.71}


 60%|██████    | 1200/2000 [2:17:51<1:34:49,  7.11s/it]

{'loss': 1.0924, 'grad_norm': 1.246267318725586, 'learning_rate': 8.020050125313283e-05, 'epoch': 0.71}


 60%|██████    | 1205/2000 [2:18:22<1:27:17,  6.59s/it]

{'loss': 1.2103, 'grad_norm': 1.2607594728469849, 'learning_rate': 7.969924812030075e-05, 'epoch': 0.71}


 60%|██████    | 1210/2000 [2:18:58<1:34:58,  7.21s/it]

{'loss': 0.8584, 'grad_norm': 1.0497454404830933, 'learning_rate': 7.919799498746867e-05, 'epoch': 0.71}


 61%|██████    | 1215/2000 [2:19:30<1:24:12,  6.44s/it]

{'loss': 1.0348, 'grad_norm': 1.5773403644561768, 'learning_rate': 7.869674185463659e-05, 'epoch': 0.72}


 61%|██████    | 1220/2000 [2:20:04<1:28:30,  6.81s/it]

{'loss': 0.956, 'grad_norm': 1.2092294692993164, 'learning_rate': 7.819548872180451e-05, 'epoch': 0.72}


 61%|██████▏   | 1225/2000 [2:20:38<1:31:40,  7.10s/it]

{'loss': 0.8583, 'grad_norm': 1.1769514083862305, 'learning_rate': 7.769423558897244e-05, 'epoch': 0.72}


 62%|██████▏   | 1230/2000 [2:21:13<1:27:23,  6.81s/it]

{'loss': 0.8759, 'grad_norm': 1.3763296604156494, 'learning_rate': 7.719298245614036e-05, 'epoch': 0.73}


 62%|██████▏   | 1235/2000 [2:21:50<1:32:33,  7.26s/it]

{'loss': 0.886, 'grad_norm': 1.5465565919876099, 'learning_rate': 7.669172932330826e-05, 'epoch': 0.73}


 62%|██████▏   | 1240/2000 [2:22:24<1:26:23,  6.82s/it]

{'loss': 0.9638, 'grad_norm': 1.2748034000396729, 'learning_rate': 7.619047619047618e-05, 'epoch': 0.73}


 62%|██████▏   | 1245/2000 [2:22:57<1:23:57,  6.67s/it]

{'loss': 0.9219, 'grad_norm': 1.2813693284988403, 'learning_rate': 7.56892230576441e-05, 'epoch': 0.74}


 62%|██████▎   | 1250/2000 [2:23:30<1:26:55,  6.95s/it]

{'loss': 1.0684, 'grad_norm': 1.234945297241211, 'learning_rate': 7.518796992481203e-05, 'epoch': 0.74}


 63%|██████▎   | 1255/2000 [2:24:06<1:29:21,  7.20s/it]

{'loss': 0.9175, 'grad_norm': 1.2503745555877686, 'learning_rate': 7.468671679197995e-05, 'epoch': 0.74}


 63%|██████▎   | 1260/2000 [2:24:42<1:29:51,  7.29s/it]

{'loss': 0.8628, 'grad_norm': 1.4235327243804932, 'learning_rate': 7.418546365914787e-05, 'epoch': 0.74}


 63%|██████▎   | 1265/2000 [2:25:17<1:26:46,  7.08s/it]

{'loss': 0.7346, 'grad_norm': 1.2899856567382812, 'learning_rate': 7.368421052631579e-05, 'epoch': 0.75}


 64%|██████▎   | 1270/2000 [2:25:51<1:23:07,  6.83s/it]

{'loss': 0.974, 'grad_norm': 1.2511154413223267, 'learning_rate': 7.318295739348371e-05, 'epoch': 0.75}


 64%|██████▍   | 1275/2000 [2:26:26<1:22:41,  6.84s/it]

{'loss': 1.0523, 'grad_norm': 1.5289890766143799, 'learning_rate': 7.268170426065163e-05, 'epoch': 0.75}


 64%|██████▍   | 1280/2000 [2:26:58<1:18:59,  6.58s/it]

{'loss': 1.004, 'grad_norm': 1.3704705238342285, 'learning_rate': 7.218045112781955e-05, 'epoch': 0.76}


 64%|██████▍   | 1285/2000 [2:27:32<1:21:31,  6.84s/it]

{'loss': 0.864, 'grad_norm': 1.734082818031311, 'learning_rate': 7.167919799498747e-05, 'epoch': 0.76}


 64%|██████▍   | 1290/2000 [2:28:07<1:21:08,  6.86s/it]

{'loss': 0.825, 'grad_norm': 1.3407783508300781, 'learning_rate': 7.11779448621554e-05, 'epoch': 0.76}


 65%|██████▍   | 1295/2000 [2:28:41<1:19:18,  6.75s/it]

{'loss': 0.8285, 'grad_norm': 1.245031714439392, 'learning_rate': 7.06766917293233e-05, 'epoch': 0.76}


 65%|██████▌   | 1300/2000 [2:29:13<1:18:42,  6.75s/it]

{'loss': 0.9883, 'grad_norm': 1.2235941886901855, 'learning_rate': 7.017543859649122e-05, 'epoch': 0.77}


 65%|██████▌   | 1305/2000 [2:29:44<1:11:18,  6.16s/it]

{'loss': 0.8783, 'grad_norm': 1.532874345779419, 'learning_rate': 6.967418546365914e-05, 'epoch': 0.77}


 66%|██████▌   | 1310/2000 [2:30:18<1:17:04,  6.70s/it]

{'loss': 0.8555, 'grad_norm': 1.526931881904602, 'learning_rate': 6.917293233082706e-05, 'epoch': 0.77}


 66%|██████▌   | 1315/2000 [2:30:51<1:16:50,  6.73s/it]

{'loss': 1.0381, 'grad_norm': 1.0924617052078247, 'learning_rate': 6.867167919799499e-05, 'epoch': 0.78}


 66%|██████▌   | 1320/2000 [2:31:24<1:17:13,  6.81s/it]

{'loss': 0.9501, 'grad_norm': 1.425606608390808, 'learning_rate': 6.817042606516291e-05, 'epoch': 0.78}


 66%|██████▋   | 1325/2000 [2:31:57<1:12:33,  6.45s/it]

{'loss': 1.0138, 'grad_norm': 1.5626469850540161, 'learning_rate': 6.766917293233083e-05, 'epoch': 0.78}


 66%|██████▋   | 1330/2000 [2:32:35<1:24:26,  7.56s/it]

{'loss': 0.9248, 'grad_norm': 0.9604526162147522, 'learning_rate': 6.716791979949875e-05, 'epoch': 0.79}


 67%|██████▋   | 1335/2000 [2:33:07<1:13:04,  6.59s/it]

{'loss': 0.9065, 'grad_norm': 1.5843627452850342, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.79}


 67%|██████▋   | 1340/2000 [2:33:40<1:11:15,  6.48s/it]

{'loss': 1.019, 'grad_norm': 1.5023337602615356, 'learning_rate': 6.616541353383459e-05, 'epoch': 0.79}


 67%|██████▋   | 1345/2000 [2:34:13<1:13:30,  6.73s/it]

{'loss': 1.0414, 'grad_norm': 1.0179755687713623, 'learning_rate': 6.566416040100251e-05, 'epoch': 0.79}


 68%|██████▊   | 1350/2000 [2:34:46<1:12:03,  6.65s/it]

{'loss': 0.8364, 'grad_norm': 1.4267481565475464, 'learning_rate': 6.516290726817042e-05, 'epoch': 0.8}


 68%|██████▊   | 1355/2000 [2:35:26<1:21:54,  7.62s/it]

{'loss': 0.9609, 'grad_norm': 1.7045586109161377, 'learning_rate': 6.466165413533834e-05, 'epoch': 0.8}


 68%|██████▊   | 1360/2000 [2:36:03<1:18:45,  7.38s/it]

{'loss': 0.8446, 'grad_norm': 1.1208314895629883, 'learning_rate': 6.416040100250626e-05, 'epoch': 0.8}


 68%|██████▊   | 1365/2000 [2:36:38<1:15:04,  7.09s/it]

{'loss': 1.0436, 'grad_norm': 1.4000322818756104, 'learning_rate': 6.365914786967418e-05, 'epoch': 0.81}


 68%|██████▊   | 1370/2000 [2:37:12<1:12:16,  6.88s/it]

{'loss': 0.958, 'grad_norm': 1.4463199377059937, 'learning_rate': 6.31578947368421e-05, 'epoch': 0.81}


 69%|██████▉   | 1375/2000 [2:37:44<1:05:49,  6.32s/it]

{'loss': 1.0601, 'grad_norm': 1.8493292331695557, 'learning_rate': 6.265664160401002e-05, 'epoch': 0.81}


 69%|██████▉   | 1380/2000 [2:38:19<1:12:19,  7.00s/it]

{'loss': 0.8849, 'grad_norm': 1.5485039949417114, 'learning_rate': 6.215538847117795e-05, 'epoch': 0.81}


 69%|██████▉   | 1385/2000 [2:38:55<1:11:16,  6.95s/it]

{'loss': 0.9119, 'grad_norm': 1.298280954360962, 'learning_rate': 6.165413533834587e-05, 'epoch': 0.82}


 70%|██████▉   | 1390/2000 [2:39:26<1:07:57,  6.68s/it]

{'loss': 0.9054, 'grad_norm': 1.2036017179489136, 'learning_rate': 6.115288220551379e-05, 'epoch': 0.82}


 70%|██████▉   | 1395/2000 [2:40:01<1:09:23,  6.88s/it]

{'loss': 1.1096, 'grad_norm': 1.5265735387802124, 'learning_rate': 6.06516290726817e-05, 'epoch': 0.82}


 70%|███████   | 1400/2000 [2:40:40<1:19:04,  7.91s/it]

{'loss': 0.8314, 'grad_norm': 1.4458311796188354, 'learning_rate': 6.015037593984962e-05, 'epoch': 0.83}


 70%|███████   | 1405/2000 [2:41:13<1:05:35,  6.61s/it]

{'loss': 0.8369, 'grad_norm': 1.4614320993423462, 'learning_rate': 5.9649122807017544e-05, 'epoch': 0.83}


 70%|███████   | 1410/2000 [2:41:46<1:03:54,  6.50s/it]

{'loss': 0.907, 'grad_norm': 1.5217381715774536, 'learning_rate': 5.9147869674185465e-05, 'epoch': 0.83}


 71%|███████   | 1415/2000 [2:42:23<1:09:32,  7.13s/it]

{'loss': 0.8561, 'grad_norm': 1.3240584135055542, 'learning_rate': 5.8646616541353386e-05, 'epoch': 0.84}


 71%|███████   | 1420/2000 [2:42:56<1:06:15,  6.86s/it]

{'loss': 0.9628, 'grad_norm': 1.4517889022827148, 'learning_rate': 5.81453634085213e-05, 'epoch': 0.84}


 71%|███████▏  | 1425/2000 [2:43:28<1:01:17,  6.40s/it]

{'loss': 0.8513, 'grad_norm': 1.9675098657608032, 'learning_rate': 5.764411027568922e-05, 'epoch': 0.84}


 72%|███████▏  | 1430/2000 [2:44:04<1:07:58,  7.15s/it]

{'loss': 0.8873, 'grad_norm': 1.3222458362579346, 'learning_rate': 5.714285714285714e-05, 'epoch': 0.84}


 72%|███████▏  | 1435/2000 [2:44:37<1:03:17,  6.72s/it]

{'loss': 0.9743, 'grad_norm': 1.4605063199996948, 'learning_rate': 5.664160401002506e-05, 'epoch': 0.85}


 72%|███████▏  | 1440/2000 [2:45:10<1:02:56,  6.74s/it]

{'loss': 0.8672, 'grad_norm': 1.1858711242675781, 'learning_rate': 5.6140350877192984e-05, 'epoch': 0.85}


 72%|███████▏  | 1445/2000 [2:45:47<1:06:21,  7.17s/it]

{'loss': 0.8016, 'grad_norm': 1.2993470430374146, 'learning_rate': 5.5639097744360905e-05, 'epoch': 0.85}


 72%|███████▎  | 1450/2000 [2:46:18<58:48,  6.42s/it]  

{'loss': 0.801, 'grad_norm': 1.5968234539031982, 'learning_rate': 5.513784461152882e-05, 'epoch': 0.86}


 73%|███████▎  | 1455/2000 [2:46:49<55:34,  6.12s/it]  

{'loss': 0.9702, 'grad_norm': 1.896776556968689, 'learning_rate': 5.463659147869674e-05, 'epoch': 0.86}


 73%|███████▎  | 1460/2000 [2:47:22<57:52,  6.43s/it]

{'loss': 0.8826, 'grad_norm': 1.6068018674850464, 'learning_rate': 5.413533834586466e-05, 'epoch': 0.86}


 73%|███████▎  | 1465/2000 [2:47:54<58:04,  6.51s/it]

{'loss': 0.8631, 'grad_norm': 1.2919517755508423, 'learning_rate': 5.363408521303258e-05, 'epoch': 0.86}


 74%|███████▎  | 1470/2000 [2:48:29<59:10,  6.70s/it]  

{'loss': 0.9557, 'grad_norm': 1.6974555253982544, 'learning_rate': 5.31328320802005e-05, 'epoch': 0.87}


 74%|███████▍  | 1475/2000 [2:49:01<56:51,  6.50s/it]

{'loss': 0.8252, 'grad_norm': 1.6577435731887817, 'learning_rate': 5.2631578947368424e-05, 'epoch': 0.87}


 74%|███████▍  | 1480/2000 [2:49:37<1:03:50,  7.37s/it]

{'loss': 0.6424, 'grad_norm': 1.1510124206542969, 'learning_rate': 5.213032581453634e-05, 'epoch': 0.87}


 74%|███████▍  | 1485/2000 [2:50:10<57:55,  6.75s/it]  

{'loss': 0.8511, 'grad_norm': 1.094560980796814, 'learning_rate': 5.162907268170426e-05, 'epoch': 0.88}


 74%|███████▍  | 1490/2000 [2:50:44<55:53,  6.57s/it]  

{'loss': 0.8398, 'grad_norm': 1.4988173246383667, 'learning_rate': 5.112781954887218e-05, 'epoch': 0.88}


 75%|███████▍  | 1495/2000 [2:51:19<59:28,  7.07s/it]

{'loss': 0.894, 'grad_norm': 1.3090505599975586, 'learning_rate': 5.06265664160401e-05, 'epoch': 0.88}


 75%|███████▌  | 1500/2000 [2:51:56<59:34,  7.15s/it]  

{'loss': 0.9456, 'grad_norm': 1.0758816003799438, 'learning_rate': 5.012531328320802e-05, 'epoch': 0.89}


 75%|███████▌  | 1505/2000 [2:52:32<57:41,  6.99s/it]  

{'loss': 0.7939, 'grad_norm': 1.5272928476333618, 'learning_rate': 4.9624060150375936e-05, 'epoch': 0.89}


 76%|███████▌  | 1510/2000 [2:53:04<54:20,  6.65s/it]

{'loss': 0.9281, 'grad_norm': 1.551650047302246, 'learning_rate': 4.912280701754386e-05, 'epoch': 0.89}


 76%|███████▌  | 1515/2000 [2:53:39<55:17,  6.84s/it]

{'loss': 0.8027, 'grad_norm': 1.3972851037979126, 'learning_rate': 4.862155388471178e-05, 'epoch': 0.89}


 76%|███████▌  | 1520/2000 [2:54:14<57:26,  7.18s/it]

{'loss': 0.839, 'grad_norm': 1.4774463176727295, 'learning_rate': 4.81203007518797e-05, 'epoch': 0.9}


 76%|███████▋  | 1525/2000 [2:54:52<59:57,  7.57s/it]  

{'loss': 0.7233, 'grad_norm': 1.2397618293762207, 'learning_rate': 4.761904761904762e-05, 'epoch': 0.9}


 76%|███████▋  | 1530/2000 [2:55:27<54:19,  6.93s/it]  

{'loss': 0.9319, 'grad_norm': 1.645594596862793, 'learning_rate': 4.711779448621554e-05, 'epoch': 0.9}


 77%|███████▋  | 1535/2000 [2:56:00<51:48,  6.68s/it]

{'loss': 0.7638, 'grad_norm': 1.2859647274017334, 'learning_rate': 4.6616541353383456e-05, 'epoch': 0.91}


 77%|███████▋  | 1540/2000 [2:56:34<53:50,  7.02s/it]

{'loss': 0.7374, 'grad_norm': 1.0524531602859497, 'learning_rate': 4.6115288220551377e-05, 'epoch': 0.91}


 77%|███████▋  | 1545/2000 [2:57:10<55:19,  7.29s/it]

{'loss': 0.7406, 'grad_norm': 1.3354268074035645, 'learning_rate': 4.56140350877193e-05, 'epoch': 0.91}


 78%|███████▊  | 1550/2000 [2:57:40<46:19,  6.18s/it]

{'loss': 0.8207, 'grad_norm': 1.5868195295333862, 'learning_rate': 4.511278195488722e-05, 'epoch': 0.92}


 78%|███████▊  | 1555/2000 [2:58:13<47:56,  6.46s/it]

{'loss': 0.838, 'grad_norm': 1.4557265043258667, 'learning_rate': 4.461152882205514e-05, 'epoch': 0.92}


 78%|███████▊  | 1560/2000 [2:58:47<48:53,  6.67s/it]

{'loss': 0.9353, 'grad_norm': 1.5638450384140015, 'learning_rate': 4.411027568922306e-05, 'epoch': 0.92}


 78%|███████▊  | 1565/2000 [2:59:19<47:48,  6.59s/it]

{'loss': 0.843, 'grad_norm': 1.4079711437225342, 'learning_rate': 4.3609022556390975e-05, 'epoch': 0.92}


 78%|███████▊  | 1570/2000 [2:59:51<46:38,  6.51s/it]

{'loss': 0.7831, 'grad_norm': 1.7145686149597168, 'learning_rate': 4.3107769423558896e-05, 'epoch': 0.93}


 79%|███████▉  | 1575/2000 [3:00:28<50:15,  7.10s/it]

{'loss': 0.7623, 'grad_norm': 1.1911547183990479, 'learning_rate': 4.260651629072682e-05, 'epoch': 0.93}


 79%|███████▉  | 1580/2000 [3:01:03<49:49,  7.12s/it]

{'loss': 0.8137, 'grad_norm': 1.7607125043869019, 'learning_rate': 4.210526315789474e-05, 'epoch': 0.93}


 79%|███████▉  | 1585/2000 [3:01:36<44:13,  6.39s/it]

{'loss': 0.7926, 'grad_norm': 1.3125735521316528, 'learning_rate': 4.160401002506266e-05, 'epoch': 0.94}


 80%|███████▉  | 1590/2000 [3:02:12<49:22,  7.22s/it]

{'loss': 0.7223, 'grad_norm': 1.647131085395813, 'learning_rate': 4.110275689223057e-05, 'epoch': 0.94}


 80%|███████▉  | 1595/2000 [3:02:47<45:33,  6.75s/it]

{'loss': 0.8876, 'grad_norm': 1.4910334348678589, 'learning_rate': 4.0601503759398494e-05, 'epoch': 0.94}


 80%|████████  | 1600/2000 [3:03:20<43:39,  6.55s/it]

{'loss': 0.8726, 'grad_norm': 1.569236159324646, 'learning_rate': 4.0100250626566415e-05, 'epoch': 0.94}


 80%|████████  | 1605/2000 [3:03:54<43:36,  6.62s/it]

{'loss': 0.7711, 'grad_norm': 1.377762794494629, 'learning_rate': 3.9598997493734336e-05, 'epoch': 0.95}


 80%|████████  | 1610/2000 [3:04:26<42:51,  6.59s/it]

{'loss': 0.7448, 'grad_norm': 1.587536096572876, 'learning_rate': 3.909774436090226e-05, 'epoch': 0.95}


 81%|████████  | 1615/2000 [3:05:01<43:50,  6.83s/it]

{'loss': 0.8657, 'grad_norm': 1.3537918329238892, 'learning_rate': 3.859649122807018e-05, 'epoch': 0.95}


 81%|████████  | 1620/2000 [3:05:33<39:00,  6.16s/it]

{'loss': 0.9325, 'grad_norm': 1.8587934970855713, 'learning_rate': 3.809523809523809e-05, 'epoch': 0.96}


 81%|████████▏ | 1625/2000 [3:06:05<41:38,  6.66s/it]

{'loss': 0.8236, 'grad_norm': 1.6408116817474365, 'learning_rate': 3.759398496240601e-05, 'epoch': 0.96}


 82%|████████▏ | 1630/2000 [3:06:43<44:52,  7.28s/it]

{'loss': 0.8724, 'grad_norm': 1.355876088142395, 'learning_rate': 3.7092731829573934e-05, 'epoch': 0.96}


 82%|████████▏ | 1635/2000 [3:07:16<40:14,  6.61s/it]

{'loss': 0.8174, 'grad_norm': 1.8919538259506226, 'learning_rate': 3.6591478696741855e-05, 'epoch': 0.97}


 82%|████████▏ | 1640/2000 [3:07:49<40:58,  6.83s/it]

{'loss': 0.8464, 'grad_norm': 1.3345226049423218, 'learning_rate': 3.6090225563909776e-05, 'epoch': 0.97}


 82%|████████▏ | 1645/2000 [3:08:23<40:22,  6.83s/it]

{'loss': 0.8047, 'grad_norm': 1.341935157775879, 'learning_rate': 3.55889724310777e-05, 'epoch': 0.97}


 82%|████████▎ | 1650/2000 [3:08:57<40:04,  6.87s/it]

{'loss': 0.8364, 'grad_norm': 1.1686636209487915, 'learning_rate': 3.508771929824561e-05, 'epoch': 0.97}


 83%|████████▎ | 1655/2000 [3:09:32<41:00,  7.13s/it]

{'loss': 0.8828, 'grad_norm': 1.361989974975586, 'learning_rate': 3.458646616541353e-05, 'epoch': 0.98}


 83%|████████▎ | 1660/2000 [3:10:08<41:01,  7.24s/it]

{'loss': 0.9589, 'grad_norm': 1.673227071762085, 'learning_rate': 3.4085213032581453e-05, 'epoch': 0.98}


 83%|████████▎ | 1665/2000 [3:10:41<37:12,  6.67s/it]

{'loss': 0.8705, 'grad_norm': 1.421875, 'learning_rate': 3.3583959899749374e-05, 'epoch': 0.98}


 84%|████████▎ | 1670/2000 [3:11:14<36:03,  6.56s/it]

{'loss': 0.8399, 'grad_norm': 1.5793256759643555, 'learning_rate': 3.3082706766917295e-05, 'epoch': 0.99}


 84%|████████▍ | 1675/2000 [3:11:47<34:03,  6.29s/it]

{'loss': 0.7154, 'grad_norm': 1.7908761501312256, 'learning_rate': 3.258145363408521e-05, 'epoch': 0.99}


 84%|████████▍ | 1680/2000 [3:12:21<36:15,  6.80s/it]

{'loss': 0.6495, 'grad_norm': 1.1645277738571167, 'learning_rate': 3.208020050125313e-05, 'epoch': 0.99}


 84%|████████▍ | 1685/2000 [3:12:54<34:38,  6.60s/it]

{'loss': 0.796, 'grad_norm': 1.6119542121887207, 'learning_rate': 3.157894736842105e-05, 'epoch': 0.99}


 84%|████████▍ | 1690/2000 [3:13:29<36:15,  7.02s/it]

{'loss': 0.8714, 'grad_norm': 1.568618655204773, 'learning_rate': 3.107769423558897e-05, 'epoch': 1.0}


 85%|████████▍ | 1695/2000 [3:14:06<37:47,  7.44s/it]

{'loss': 0.7699, 'grad_norm': 1.5515520572662354, 'learning_rate': 3.0576441102756894e-05, 'epoch': 1.0}


 85%|████████▌ | 1700/2000 [3:14:40<35:48,  7.16s/it]

{'loss': 0.5775, 'grad_norm': 1.3875118494033813, 'learning_rate': 3.007518796992481e-05, 'epoch': 1.0}


 85%|████████▌ | 1705/2000 [3:15:14<33:06,  6.73s/it]

{'loss': 0.5136, 'grad_norm': 1.7519936561584473, 'learning_rate': 2.9573934837092732e-05, 'epoch': 1.01}


 86%|████████▌ | 1710/2000 [3:15:49<33:19,  6.90s/it]

{'loss': 0.6077, 'grad_norm': 1.008662462234497, 'learning_rate': 2.907268170426065e-05, 'epoch': 1.01}


 86%|████████▌ | 1715/2000 [3:16:20<29:57,  6.31s/it]

{'loss': 0.5921, 'grad_norm': 1.5803648233413696, 'learning_rate': 2.857142857142857e-05, 'epoch': 1.01}


 86%|████████▌ | 1720/2000 [3:16:54<31:03,  6.66s/it]

{'loss': 0.6948, 'grad_norm': 1.996482253074646, 'learning_rate': 2.8070175438596492e-05, 'epoch': 1.02}


 86%|████████▋ | 1725/2000 [3:17:28<30:34,  6.67s/it]

{'loss': 0.6297, 'grad_norm': 1.3235238790512085, 'learning_rate': 2.756892230576441e-05, 'epoch': 1.02}


 86%|████████▋ | 1730/2000 [3:18:03<32:07,  7.14s/it]

{'loss': 0.5539, 'grad_norm': 0.8883875608444214, 'learning_rate': 2.706766917293233e-05, 'epoch': 1.02}


 87%|████████▋ | 1735/2000 [3:18:37<29:31,  6.68s/it]

{'loss': 0.5488, 'grad_norm': 1.129499077796936, 'learning_rate': 2.656641604010025e-05, 'epoch': 1.02}


 87%|████████▋ | 1740/2000 [3:19:10<28:28,  6.57s/it]

{'loss': 0.5577, 'grad_norm': 1.4810580015182495, 'learning_rate': 2.606516290726817e-05, 'epoch': 1.03}


 87%|████████▋ | 1745/2000 [3:19:43<28:00,  6.59s/it]

{'loss': 0.6612, 'grad_norm': 1.3658519983291626, 'learning_rate': 2.556390977443609e-05, 'epoch': 1.03}


 88%|████████▊ | 1750/2000 [3:20:15<27:31,  6.60s/it]

{'loss': 0.7682, 'grad_norm': 1.4350873231887817, 'learning_rate': 2.506265664160401e-05, 'epoch': 1.03}


 88%|████████▊ | 1755/2000 [3:20:50<28:20,  6.94s/it]

{'loss': 0.6819, 'grad_norm': 1.3909707069396973, 'learning_rate': 2.456140350877193e-05, 'epoch': 1.04}


 88%|████████▊ | 1760/2000 [3:21:25<27:16,  6.82s/it]

{'loss': 0.5099, 'grad_norm': 1.3128196001052856, 'learning_rate': 2.406015037593985e-05, 'epoch': 1.04}


 88%|████████▊ | 1765/2000 [3:22:00<27:18,  6.97s/it]

{'loss': 0.4941, 'grad_norm': 1.1890156269073486, 'learning_rate': 2.355889724310777e-05, 'epoch': 1.04}


 88%|████████▊ | 1770/2000 [3:22:33<26:34,  6.93s/it]

{'loss': 0.5999, 'grad_norm': 1.2922918796539307, 'learning_rate': 2.3057644110275688e-05, 'epoch': 1.05}


 89%|████████▉ | 1775/2000 [3:23:10<28:10,  7.51s/it]

{'loss': 0.5069, 'grad_norm': 1.3055009841918945, 'learning_rate': 2.255639097744361e-05, 'epoch': 1.05}


 89%|████████▉ | 1780/2000 [3:23:44<25:28,  6.95s/it]

{'loss': 0.6757, 'grad_norm': 1.7656971216201782, 'learning_rate': 2.205513784461153e-05, 'epoch': 1.05}


 89%|████████▉ | 1785/2000 [3:24:17<23:16,  6.49s/it]

{'loss': 0.5835, 'grad_norm': 1.627235770225525, 'learning_rate': 2.1553884711779448e-05, 'epoch': 1.05}


 90%|████████▉ | 1790/2000 [3:24:51<24:23,  6.97s/it]

{'loss': 0.5289, 'grad_norm': 1.4212807416915894, 'learning_rate': 2.105263157894737e-05, 'epoch': 1.06}


 90%|████████▉ | 1795/2000 [3:25:28<24:35,  7.20s/it]

{'loss': 0.5196, 'grad_norm': 1.1520802974700928, 'learning_rate': 2.0551378446115287e-05, 'epoch': 1.06}


 90%|█████████ | 1800/2000 [3:26:02<23:32,  7.06s/it]

{'loss': 0.6517, 'grad_norm': 0.9399082660675049, 'learning_rate': 2.0050125313283208e-05, 'epoch': 1.06}


 90%|█████████ | 1805/2000 [3:26:38<23:45,  7.31s/it]

{'loss': 0.5626, 'grad_norm': 1.3609808683395386, 'learning_rate': 1.954887218045113e-05, 'epoch': 1.07}


 90%|█████████ | 1810/2000 [3:27:11<21:02,  6.64s/it]

{'loss': 0.5372, 'grad_norm': 1.1078377962112427, 'learning_rate': 1.9047619047619046e-05, 'epoch': 1.07}


 91%|█████████ | 1815/2000 [3:27:46<21:46,  7.06s/it]

{'loss': 0.4922, 'grad_norm': 1.5209760665893555, 'learning_rate': 1.8546365914786967e-05, 'epoch': 1.07}


 91%|█████████ | 1820/2000 [3:28:18<19:34,  6.52s/it]

{'loss': 0.5987, 'grad_norm': 1.0364819765090942, 'learning_rate': 1.8045112781954888e-05, 'epoch': 1.07}


 91%|█████████▏| 1825/2000 [3:28:52<18:58,  6.50s/it]

{'loss': 0.583, 'grad_norm': 1.9161659479141235, 'learning_rate': 1.7543859649122806e-05, 'epoch': 1.08}


 92%|█████████▏| 1830/2000 [3:29:29<20:01,  7.06s/it]

{'loss': 0.485, 'grad_norm': 1.417205810546875, 'learning_rate': 1.7042606516290727e-05, 'epoch': 1.08}


 92%|█████████▏| 1835/2000 [3:30:02<18:02,  6.56s/it]

{'loss': 0.5879, 'grad_norm': 1.666182279586792, 'learning_rate': 1.6541353383458648e-05, 'epoch': 1.08}


 92%|█████████▏| 1840/2000 [3:30:38<18:58,  7.12s/it]

{'loss': 0.5574, 'grad_norm': 1.313354730606079, 'learning_rate': 1.6040100250626565e-05, 'epoch': 1.09}


 92%|█████████▏| 1845/2000 [3:31:11<17:45,  6.88s/it]

{'loss': 0.6221, 'grad_norm': 1.0611518621444702, 'learning_rate': 1.5538847117794486e-05, 'epoch': 1.09}


 92%|█████████▎| 1850/2000 [3:31:44<16:33,  6.63s/it]

{'loss': 0.7641, 'grad_norm': 1.5122733116149902, 'learning_rate': 1.5037593984962406e-05, 'epoch': 1.09}


 93%|█████████▎| 1855/2000 [3:32:18<16:19,  6.76s/it]

{'loss': 0.6535, 'grad_norm': 1.3867439031600952, 'learning_rate': 1.4536340852130325e-05, 'epoch': 1.1}


 93%|█████████▎| 1860/2000 [3:32:51<15:38,  6.70s/it]

{'loss': 0.5974, 'grad_norm': 1.217467188835144, 'learning_rate': 1.4035087719298246e-05, 'epoch': 1.1}


 93%|█████████▎| 1865/2000 [3:33:27<16:05,  7.15s/it]

{'loss': 0.4959, 'grad_norm': 1.3787227869033813, 'learning_rate': 1.3533834586466165e-05, 'epoch': 1.1}


 94%|█████████▎| 1870/2000 [3:34:01<14:33,  6.72s/it]

{'loss': 0.5759, 'grad_norm': 1.570431113243103, 'learning_rate': 1.3032581453634085e-05, 'epoch': 1.1}


 94%|█████████▍| 1875/2000 [3:34:34<13:46,  6.61s/it]

{'loss': 0.6953, 'grad_norm': 1.564875602722168, 'learning_rate': 1.2531328320802006e-05, 'epoch': 1.11}


 94%|█████████▍| 1880/2000 [3:35:05<12:53,  6.45s/it]

{'loss': 0.6666, 'grad_norm': 1.3994468450546265, 'learning_rate': 1.2030075187969925e-05, 'epoch': 1.11}


 94%|█████████▍| 1885/2000 [3:35:44<14:22,  7.50s/it]

{'loss': 0.6472, 'grad_norm': 1.4435492753982544, 'learning_rate': 1.1528822055137844e-05, 'epoch': 1.11}


 94%|█████████▍| 1890/2000 [3:36:20<13:26,  7.33s/it]

{'loss': 0.4981, 'grad_norm': 1.1127601861953735, 'learning_rate': 1.1027568922305765e-05, 'epoch': 1.12}


 95%|█████████▍| 1895/2000 [3:36:56<12:55,  7.38s/it]

{'loss': 0.4387, 'grad_norm': 1.0067707300186157, 'learning_rate': 1.0526315789473684e-05, 'epoch': 1.12}


 95%|█████████▌| 1900/2000 [3:37:32<11:46,  7.06s/it]

{'loss': 0.6275, 'grad_norm': 1.3040039539337158, 'learning_rate': 1.0025062656641604e-05, 'epoch': 1.12}


 95%|█████████▌| 1905/2000 [3:38:06<10:38,  6.72s/it]

{'loss': 0.6036, 'grad_norm': 1.5903748273849487, 'learning_rate': 9.523809523809523e-06, 'epoch': 1.13}


 96%|█████████▌| 1910/2000 [3:38:41<10:20,  6.89s/it]

{'loss': 0.6428, 'grad_norm': 1.534645676612854, 'learning_rate': 9.022556390977444e-06, 'epoch': 1.13}


 96%|█████████▌| 1915/2000 [3:39:19<10:45,  7.60s/it]

{'loss': 0.4994, 'grad_norm': 1.3415799140930176, 'learning_rate': 8.521303258145363e-06, 'epoch': 1.13}


 96%|█████████▌| 1920/2000 [3:39:57<09:59,  7.49s/it]

{'loss': 0.5106, 'grad_norm': 1.7142432928085327, 'learning_rate': 8.020050125313283e-06, 'epoch': 1.13}


 96%|█████████▋| 1925/2000 [3:40:33<09:13,  7.37s/it]

{'loss': 0.5744, 'grad_norm': 1.5055850744247437, 'learning_rate': 7.518796992481203e-06, 'epoch': 1.14}


 96%|█████████▋| 1930/2000 [3:41:07<08:07,  6.97s/it]

{'loss': 0.4429, 'grad_norm': 1.2790148258209229, 'learning_rate': 7.017543859649123e-06, 'epoch': 1.14}


 97%|█████████▋| 1935/2000 [3:41:43<07:39,  7.07s/it]

{'loss': 0.5353, 'grad_norm': 1.296201467514038, 'learning_rate': 6.516290726817042e-06, 'epoch': 1.14}


 97%|█████████▋| 1940/2000 [3:42:16<06:48,  6.81s/it]

{'loss': 0.5814, 'grad_norm': 0.7799281477928162, 'learning_rate': 6.015037593984962e-06, 'epoch': 1.15}


 97%|█████████▋| 1945/2000 [3:42:51<06:34,  7.17s/it]

{'loss': 0.4752, 'grad_norm': 0.9901845455169678, 'learning_rate': 5.5137844611528826e-06, 'epoch': 1.15}


 98%|█████████▊| 1950/2000 [3:43:25<05:45,  6.91s/it]

{'loss': 0.5983, 'grad_norm': 1.5220268964767456, 'learning_rate': 5.012531328320802e-06, 'epoch': 1.15}


 98%|█████████▊| 1955/2000 [3:44:00<05:14,  7.00s/it]

{'loss': 0.5185, 'grad_norm': 1.8356173038482666, 'learning_rate': 4.511278195488722e-06, 'epoch': 1.15}


 98%|█████████▊| 1960/2000 [3:44:35<04:44,  7.11s/it]

{'loss': 0.6338, 'grad_norm': 1.2141913175582886, 'learning_rate': 4.010025062656641e-06, 'epoch': 1.16}


 98%|█████████▊| 1965/2000 [3:45:11<04:11,  7.17s/it]

{'loss': 0.4893, 'grad_norm': 1.4006842374801636, 'learning_rate': 3.5087719298245615e-06, 'epoch': 1.16}


 98%|█████████▊| 1970/2000 [3:45:44<03:21,  6.71s/it]

{'loss': 0.5823, 'grad_norm': 1.566226601600647, 'learning_rate': 3.007518796992481e-06, 'epoch': 1.16}


 99%|█████████▉| 1975/2000 [3:46:17<02:48,  6.73s/it]

{'loss': 0.6066, 'grad_norm': 1.405783772468567, 'learning_rate': 2.506265664160401e-06, 'epoch': 1.17}


 99%|█████████▉| 1980/2000 [3:46:55<02:22,  7.12s/it]

{'loss': 0.5933, 'grad_norm': 2.3382391929626465, 'learning_rate': 2.0050125313283207e-06, 'epoch': 1.17}


 99%|█████████▉| 1985/2000 [3:47:29<01:41,  6.80s/it]

{'loss': 0.5386, 'grad_norm': 1.7002789974212646, 'learning_rate': 1.5037593984962406e-06, 'epoch': 1.17}


100%|█████████▉| 1990/2000 [3:48:03<01:08,  6.83s/it]

{'loss': 0.3771, 'grad_norm': 0.9407487511634827, 'learning_rate': 1.0025062656641603e-06, 'epoch': 1.18}


100%|█████████▉| 1995/2000 [3:48:35<00:32,  6.53s/it]

{'loss': 0.6353, 'grad_norm': 1.6390571594238281, 'learning_rate': 5.012531328320802e-07, 'epoch': 1.18}


100%|██████████| 2000/2000 [3:49:06<00:00,  6.44s/it]

{'loss': 0.5626, 'grad_norm': 1.6668428182601929, 'learning_rate': 0.0, 'epoch': 1.18}


100%|██████████| 2000/2000 [3:49:07<00:00,  6.87s/it]

{'train_runtime': 13747.7148, 'train_samples_per_second': 1.164, 'train_steps_per_second': 0.145, 'train_loss': 1.0849841850996018, 'epoch': 1.18}





In [9]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

13747.7148 seconds used for training.
229.13 minutes used for training.
Peak reserved memory = 6.496 GB.
Peak reserved memory for training = 1.859 GB.
Peak reserved memory % of max memory = 40.61 %.
Peak reserved memory for training % of max memory = 11.622 %.


In [10]:
model.save_pretrained(f"{model_name}/best_model") # Local saving
tokenizer.save_pretrained(f"{model_name}/best_model")

('mistral-7b-v0.3-bnb-4bit/best_model\\tokenizer_config.json',
 'mistral-7b-v0.3-bnb-4bit/best_model\\special_tokens_map.json',
 'mistral-7b-v0.3-bnb-4bit/best_model\\tokenizer.model',
 'mistral-7b-v0.3-bnb-4bit/best_model\\added_tokens.json',
 'mistral-7b-v0.3-bnb-4bit/best_model\\tokenizer.json')

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>