* PROJEKCT CPM FILIP NIEWCZAS, WIKTORIA DOMAŃSKA, MARCEL PASSON *

Detecting texts generated by large language models
As large language models are increasingly being used to generate texts in the context of everyday communication, blog posts, advertisements etc., there is a practical problem of distinguishing between artificially generated texts and those authored by humans. It is also interesting to investigate if there are any systematic differences between such texts on the language level.
Some datasets of real vs generated texts:
https://github.com/jmpu/DeepfakeTextDetection
https://github.com/liamdugan/human-detection/
https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro
https://github.com/ICTMCG/Awesome-Machine-Generated-Text (long list under Datasets section)
Choose a dataset from a particular usage domain. Formulate some initial hypotheses.
Perform an extensive exploration of the dataset. What features are characteristic for the generated texts?
Try to build your own classifier detecting generated content.
Complement your work with a qualitative analysis of generated texts.


In [1]:
%pip install nlp

Note: you may need to restart the kernel to use updated packages.


In [2]:
from nlp import Dataset
import numpy as np
from transformers import AutoTokenizer, pipeline, Trainer, TrainingArguments, AutoModelForSequenceClassification,DataCollatorWithPadding
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm
2024-06-11 13:09:02.412903: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
#from google.colab import drive
#drive.mount('/content/drive')

In [4]:


# Load the .jsonl file
file_path = '/Users/filip/Desktop/Cognitive/cpm/project /RedditBot.jsonl' #'/content/drive/MyDrive/CogSCi/CPM II /Projekt CPM /project /RedditBot.jsonl'
data = pd.read_json(file_path, lines=True)

# Display the first few rows
print("First few rows of the dataset:")
print(data.head())

# Display summary information
print("\nSummary information of the dataset:")
print(data.info())

# Display basic statistics
print("\nBasic statistics of the dataset:")
print(data.describe())

# Check for missing values
missing_values = data.isnull().sum()
print("\nMissing values in each column:")
print(missing_values)

# Calculate the length of each text
data['text_length'] = data['text'].apply(len)

First few rows of the dataset:
   id                                               text    label
0   0  To Kill a Mockingbird  is an excellent example...  machine
1   0  there must be a reason to make your claim...so...    human
2   0   God is an idea that humans believe in, and th...  machine
3   0  If you need to make menial philosophical point...    human
4   0   Probably one of the worst inventions is eugen...  machine

Summary information of the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1774 entries, 0 to 1773
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      1774 non-null   int64 
 1   text    1774 non-null   object
 2   label   1774 non-null   object
dtypes: int64(1), object(2)
memory usage: 41.7+ KB
None

Basic statistics of the dataset:
           id
count  1774.0
mean      0.0
std       0.0
min       0.0
25%       0.0
50%       0.0
75%       0.0
max       0.0

Missing values in each column:
id

In [5]:
#  transforming dataset - delete columns ID and text_lenght, drop duplicates, drop missing values
data = data.drop(columns=['id', 'text_length']).drop_duplicates().dropna()

In [6]:
data.head(5)

Unnamed: 0,text,label
0,To Kill a Mockingbird is an excellent example...,machine
1,there must be a reason to make your claim...so...,human
2,"God is an idea that humans believe in, and th...",machine
3,If you need to make menial philosophical point...,human
4,Probably one of the worst inventions is eugen...,machine


In [7]:
#Changing values machine-human to 0:1
data['label'] = data['label'].map({'machine': 0, 'human': 1})
data.sample(5)

Unnamed: 0,text,label
715,Okay this may seem stupid but here it goes: ce...,1
1057,My best friends. I drove up like 100 miles nor...,1
105,That's based on old information. Only some chi...,1
574,I don't want to use this as some sort of plug ...,1
25,\n\nI think that the biggest challenge in sca...,0


In [8]:
data['label'].mean() # fifty-fifty almost!

0.49745042492917846

In [9]:
dataset = Dataset.from_pandas(data)

In [10]:
#Parameters for the model
train_fraction = 0.7 # fraction of a dataset used for training (the rest used for validation)
num_train_epochs = 3 # epochs to train
batch_size = 4 # batch size for training and validation
warmup_steps = 50
weight_decay = 0.02


In [11]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased', use_fast=True,low_cpu_mem_usage=False)

In [12]:

def encode_examples(example):
    # Encode the text and return the encoding which includes 'input_ids' and 'attention_mask'
    return tokenizer(example['text'], truncation=True, padding='max_length')


dataset_encoded = dataset.map(encode_examples, batched=True)


print(dataset_encoded)

  0%|          | 0/2 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|██████████| 2/2 [00:01<00:00,  1.05it/s]

Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 1765)





In [13]:
splitted_dataset = dataset_encoded.train_test_split(train_fraction) #0.7 for training, 0.3 for validation

100%|██████████| 1/1 [00:00<00:00,  1.63it/s]
100%|██████████| 2/2 [00:01<00:00,  1.35it/s]


In [14]:
splitted_dataset

{'train': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 529),
 'test': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 1236)}

In [15]:
# DataCollatorWithPadding creates batch of data. It also dynamically pads text to the
#  length of the longest element in the batch, making them all the same length.
#  It's possible to pad your text in the tokenizer function with padding=True, dynamic padding is more efficient.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) # Tokenizer - DistilBert

In [16]:
print(splitted_dataset['train'][0].keys())

dict_keys(['__index_level_0__', 'attention_mask', 'input_ids', 'label', 'text'])


In [17]:
tokenizer.decode(splitted_dataset['train'][0]['input_ids'])

"[CLS] I'd like to add to the discourse that stability is not directly linked to allergenicity in a mechanistic way that can be explained by immunologists. Peanut is also reported to interact with a receptor called DC - SIGN that enhances its uptake into the cells responsible for initiating immune responses. Additionally, western preparation of peanuts via dry roasting changes the structure of sugars in the peanut, adding things called advanced glycosylation end products, which are targeted by a specialized receptor, which is also known for uptake and activating the immune system. In general, its not clear why anyone is allergic to specific things. Some allergens ( like peanut, shellfish, house dust mite etc ) have protein cleaving and trypsin inhibiting properties that the immune system seems to react to. Others like milk dont have those properties. Its all still quite a mystery and is very very hard to study because by the time a patient comes to the clinic, they are already fully al

In [18]:
# Loading the model

BERT_MODEL = 'distilbert-base-cased'

model = AutoModelForSequenceClassification.from_pretrained(
    BERT_MODEL, num_labels=2,
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False # Whether the model returns all hidden-states.
)

model.config.id2label = {0: 'ROBOT', 1: 'HUMAN'}

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# number of trainable parameters
print(model.num_parameters(only_trainable=True)/1e6)

65.783042


In [20]:
%pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [21]:
from datasets import load_metric

metric = load_metric("accuracy") # do oceny

def compute_metrics(eval_pred):
    logits, labels = eval_pred #rozdzielenie przewidywania (logits) i etykiet (labels)
    predictions = np.argmax(logits, axis=-1)  #argmax zwraca indeksy największych wartości w kolumnie wzdłuż osi
    return metric.compute(predictions=predictions, references=labels)

  metric = load_metric("accuracy") # do oceny
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [22]:
%pip install transformers[torch]
%pip install mlflow


zsh:1: no matches found: transformers[torch]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [23]:
import accelerate
print(accelerate.__version__)

0.30.1


In [24]:
def is_accelerate_available():
    return True
training_args = TrainingArguments(
    output_dir="./project cpm2 ",
    logging_dir='./logs',
    num_train_epochs=num_train_epochs, # 3
    per_device_train_batch_size=2, #16 - number of samples to process at once per batch
    per_device_eval_batch_size=2, #16
    logging_strategy='steps', # log every step
    logging_first_step=True,
    load_best_model_at_end=True, #trainer will load the best model found during training at the end of training
    logging_steps=1,
    evaluation_strategy='epoch',# when evaluate model - after each epoch
    warmup_steps=warmup_steps, #50
    weight_decay=weight_decay, #0.02
    eval_steps=1,
    save_strategy='epoch',
    report_to="mlflow",  # log to mlflow
)



# Define the trainer:
# instantiate the trainer class and check for available devices
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=splitted_dataset['train'],
    eval_dataset=splitted_dataset['test'],
    data_collator=data_collator # A function to batch together samples of data.
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [25]:
trainer.evaluate()

100%|██████████| 618/618 [11:04<00:00,  1.10s/it]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|██████████| 618/618 [11:05<00:00,  1.08s/it]


{'eval_loss': 0.6935293078422546,
 'eval_accuracy': 0.5032362459546925,
 'eval_runtime': 666.2794,
 'eval_samples_per_second': 1.855,
 'eval_steps_per_second': 0.928}

In [26]:
trainer.train()

  0%|          | 1/795 [00:04<52:36,  3.98s/it]

{'loss': 0.6845, 'grad_norm': 6.480278968811035, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}


  0%|          | 2/795 [00:07<48:15,  3.65s/it]

{'loss': 0.6951, 'grad_norm': 2.9086215496063232, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.01}


  0%|          | 3/795 [00:10<43:15,  3.28s/it]

{'loss': 0.7107, 'grad_norm': 3.21169114112854, 'learning_rate': 3e-06, 'epoch': 0.01}


  1%|          | 4/795 [00:12<39:34,  3.00s/it]

{'loss': 0.6717, 'grad_norm': 2.8797333240509033, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.02}


  1%|          | 5/795 [00:15<38:16,  2.91s/it]

{'loss': 0.7054, 'grad_norm': 2.9025964736938477, 'learning_rate': 5e-06, 'epoch': 0.02}


  1%|          | 6/795 [00:18<38:12,  2.91s/it]

{'loss': 0.7351, 'grad_norm': 2.9162111282348633, 'learning_rate': 6e-06, 'epoch': 0.02}


  1%|          | 7/795 [00:21<37:17,  2.84s/it]

{'loss': 0.6472, 'grad_norm': 5.9054083824157715, 'learning_rate': 7.000000000000001e-06, 'epoch': 0.03}


  1%|          | 8/795 [00:24<37:24,  2.85s/it]

{'loss': 0.6853, 'grad_norm': 6.334052562713623, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.03}


  1%|          | 9/795 [00:26<36:43,  2.80s/it]

{'loss': 0.7023, 'grad_norm': 2.7710423469543457, 'learning_rate': 9e-06, 'epoch': 0.03}


  1%|▏         | 10/795 [00:29<36:36,  2.80s/it]

{'loss': 0.729, 'grad_norm': 6.6541876792907715, 'learning_rate': 1e-05, 'epoch': 0.04}


  1%|▏         | 11/795 [00:32<36:28,  2.79s/it]

{'loss': 0.6799, 'grad_norm': 6.212198257446289, 'learning_rate': 1.1000000000000001e-05, 'epoch': 0.04}


  2%|▏         | 12/795 [00:35<36:50,  2.82s/it]

{'loss': 0.638, 'grad_norm': 2.670488119125366, 'learning_rate': 1.2e-05, 'epoch': 0.05}


  2%|▏         | 13/795 [00:38<37:04,  2.84s/it]

{'loss': 0.6004, 'grad_norm': 5.650048732757568, 'learning_rate': 1.3000000000000001e-05, 'epoch': 0.05}


  2%|▏         | 14/795 [00:40<36:11,  2.78s/it]

{'loss': 0.6291, 'grad_norm': 2.669658899307251, 'learning_rate': 1.4000000000000001e-05, 'epoch': 0.05}


  2%|▏         | 15/795 [00:43<36:06,  2.78s/it]

{'loss': 0.7, 'grad_norm': 6.100147724151611, 'learning_rate': 1.5e-05, 'epoch': 0.06}


  2%|▏         | 16/795 [00:46<35:33,  2.74s/it]

{'loss': 0.7063, 'grad_norm': 2.470055103302002, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.06}


  2%|▏         | 17/795 [00:49<36:24,  2.81s/it]

{'loss': 0.6672, 'grad_norm': 3.0305893421173096, 'learning_rate': 1.7000000000000003e-05, 'epoch': 0.06}


  2%|▏         | 18/795 [00:51<35:46,  2.76s/it]

{'loss': 0.7218, 'grad_norm': 6.89959192276001, 'learning_rate': 1.8e-05, 'epoch': 0.07}


  2%|▏         | 19/795 [00:54<35:22,  2.73s/it]

{'loss': 0.6452, 'grad_norm': 2.6991634368896484, 'learning_rate': 1.9e-05, 'epoch': 0.07}


  3%|▎         | 20/795 [00:57<35:31,  2.75s/it]

{'loss': 0.6512, 'grad_norm': 2.648099184036255, 'learning_rate': 2e-05, 'epoch': 0.08}


  3%|▎         | 21/795 [01:00<35:43,  2.77s/it]

{'loss': 0.6983, 'grad_norm': 2.8912289142608643, 'learning_rate': 2.1e-05, 'epoch': 0.08}


  3%|▎         | 22/795 [01:02<36:08,  2.80s/it]

{'loss': 0.6124, 'grad_norm': 5.908492088317871, 'learning_rate': 2.2000000000000003e-05, 'epoch': 0.08}


  3%|▎         | 23/795 [01:05<36:39,  2.85s/it]

{'loss': 0.6889, 'grad_norm': 6.100847244262695, 'learning_rate': 2.3000000000000003e-05, 'epoch': 0.09}


  3%|▎         | 24/795 [01:08<36:47,  2.86s/it]

{'loss': 0.6019, 'grad_norm': 3.0593972206115723, 'learning_rate': 2.4e-05, 'epoch': 0.09}


  3%|▎         | 25/795 [01:11<35:53,  2.80s/it]

{'loss': 0.6731, 'grad_norm': 6.283253192901611, 'learning_rate': 2.5e-05, 'epoch': 0.09}


  3%|▎         | 26/795 [01:14<36:07,  2.82s/it]

{'loss': 0.6312, 'grad_norm': 3.368927478790283, 'learning_rate': 2.6000000000000002e-05, 'epoch': 0.1}


  3%|▎         | 27/795 [01:17<35:57,  2.81s/it]

{'loss': 0.6093, 'grad_norm': 5.878419399261475, 'learning_rate': 2.7000000000000002e-05, 'epoch': 0.1}


  4%|▎         | 28/795 [01:19<35:58,  2.81s/it]

{'loss': 0.6367, 'grad_norm': 3.3207972049713135, 'learning_rate': 2.8000000000000003e-05, 'epoch': 0.11}


  4%|▎         | 29/795 [01:22<35:20,  2.77s/it]

{'loss': 0.6885, 'grad_norm': 7.1881489753723145, 'learning_rate': 2.9e-05, 'epoch': 0.11}


  4%|▍         | 30/795 [01:25<35:22,  2.78s/it]

{'loss': 0.5047, 'grad_norm': 3.9983787536621094, 'learning_rate': 3e-05, 'epoch': 0.11}


  4%|▍         | 31/795 [01:28<35:18,  2.77s/it]

{'loss': 0.6307, 'grad_norm': 4.085281848907471, 'learning_rate': 3.1e-05, 'epoch': 0.12}


  4%|▍         | 32/795 [01:31<36:12,  2.85s/it]

{'loss': 0.5181, 'grad_norm': 4.547367572784424, 'learning_rate': 3.2000000000000005e-05, 'epoch': 0.12}


  4%|▍         | 33/795 [01:33<36:02,  2.84s/it]

{'loss': 0.6545, 'grad_norm': 10.91840934753418, 'learning_rate': 3.3e-05, 'epoch': 0.12}


  4%|▍         | 34/795 [01:36<35:35,  2.81s/it]

{'loss': 0.3383, 'grad_norm': 3.830824136734009, 'learning_rate': 3.4000000000000007e-05, 'epoch': 0.13}


  4%|▍         | 35/795 [01:39<35:23,  2.79s/it]

{'loss': 0.4106, 'grad_norm': 6.180461883544922, 'learning_rate': 3.5e-05, 'epoch': 0.13}


  5%|▍         | 36/795 [01:42<35:37,  2.82s/it]

{'loss': 0.7223, 'grad_norm': 10.299790382385254, 'learning_rate': 3.6e-05, 'epoch': 0.14}


  5%|▍         | 37/795 [01:45<35:23,  2.80s/it]

{'loss': 0.9011, 'grad_norm': 8.521520614624023, 'learning_rate': 3.7e-05, 'epoch': 0.14}


  5%|▍         | 38/795 [01:47<35:24,  2.81s/it]

{'loss': 0.3363, 'grad_norm': 6.159188747406006, 'learning_rate': 3.8e-05, 'epoch': 0.14}


  5%|▍         | 39/795 [01:50<35:19,  2.80s/it]

{'loss': 0.269, 'grad_norm': 3.7899649143218994, 'learning_rate': 3.9000000000000006e-05, 'epoch': 0.15}


  5%|▌         | 40/795 [01:53<35:36,  2.83s/it]

{'loss': 0.2996, 'grad_norm': 7.827201843261719, 'learning_rate': 4e-05, 'epoch': 0.15}


  5%|▌         | 41/795 [01:56<35:17,  2.81s/it]

{'loss': 0.1859, 'grad_norm': 3.654431104660034, 'learning_rate': 4.1e-05, 'epoch': 0.15}


  5%|▌         | 42/795 [01:59<35:03,  2.79s/it]

{'loss': 0.16, 'grad_norm': 2.8432881832122803, 'learning_rate': 4.2e-05, 'epoch': 0.16}


  5%|▌         | 43/795 [02:01<34:41,  2.77s/it]

{'loss': 0.1133, 'grad_norm': 2.139831781387329, 'learning_rate': 4.3e-05, 'epoch': 0.16}


  6%|▌         | 44/795 [02:04<34:39,  2.77s/it]

{'loss': 0.0905, 'grad_norm': 1.9449326992034912, 'learning_rate': 4.4000000000000006e-05, 'epoch': 0.17}


  6%|▌         | 45/795 [02:07<34:46,  2.78s/it]

{'loss': 2.3568, 'grad_norm': 34.12663650512695, 'learning_rate': 4.5e-05, 'epoch': 0.17}


  6%|▌         | 46/795 [02:10<34:48,  2.79s/it]

{'loss': 0.07, 'grad_norm': 1.3082739114761353, 'learning_rate': 4.600000000000001e-05, 'epoch': 0.17}


  6%|▌         | 47/795 [02:13<34:51,  2.80s/it]

{'loss': 0.0878, 'grad_norm': 2.3054213523864746, 'learning_rate': 4.7e-05, 'epoch': 0.18}


  6%|▌         | 48/795 [02:15<35:23,  2.84s/it]

{'loss': 0.065, 'grad_norm': 1.294819951057434, 'learning_rate': 4.8e-05, 'epoch': 0.18}


  6%|▌         | 49/795 [02:18<35:08,  2.83s/it]

{'loss': 1.7074, 'grad_norm': 14.613678932189941, 'learning_rate': 4.9e-05, 'epoch': 0.18}


  6%|▋         | 50/795 [02:21<35:34,  2.86s/it]

{'loss': 0.027, 'grad_norm': 0.4947701096534729, 'learning_rate': 5e-05, 'epoch': 0.19}


  6%|▋         | 51/795 [02:24<34:53,  2.81s/it]

{'loss': 0.0583, 'grad_norm': 4.488288879394531, 'learning_rate': 4.9932885906040274e-05, 'epoch': 0.19}


  7%|▋         | 52/795 [02:27<34:27,  2.78s/it]

{'loss': 0.03, 'grad_norm': 0.9851390719413757, 'learning_rate': 4.986577181208054e-05, 'epoch': 0.2}


  7%|▋         | 53/795 [02:29<33:41,  2.72s/it]

{'loss': 0.0192, 'grad_norm': 0.5596448183059692, 'learning_rate': 4.9798657718120805e-05, 'epoch': 0.2}


  7%|▋         | 54/795 [02:32<34:20,  2.78s/it]

{'loss': 0.0229, 'grad_norm': 0.41478970646858215, 'learning_rate': 4.9731543624161077e-05, 'epoch': 0.2}


  7%|▋         | 55/795 [02:35<33:51,  2.74s/it]

{'loss': 0.0121, 'grad_norm': 0.32644715905189514, 'learning_rate': 4.966442953020135e-05, 'epoch': 0.21}


  7%|▋         | 56/795 [02:38<33:47,  2.74s/it]

{'loss': 0.01, 'grad_norm': 0.23785871267318726, 'learning_rate': 4.9597315436241614e-05, 'epoch': 0.21}


  7%|▋         | 57/795 [02:40<34:19,  2.79s/it]

{'loss': 0.1659, 'grad_norm': 48.68952560424805, 'learning_rate': 4.953020134228188e-05, 'epoch': 0.22}


  7%|▋         | 58/795 [02:43<34:06,  2.78s/it]

{'loss': 0.0078, 'grad_norm': 0.21780525147914886, 'learning_rate': 4.946308724832215e-05, 'epoch': 0.22}


  7%|▋         | 59/795 [02:46<33:56,  2.77s/it]

{'loss': 0.0109, 'grad_norm': 0.20986443758010864, 'learning_rate': 4.9395973154362416e-05, 'epoch': 0.22}


  8%|▊         | 60/795 [02:49<34:11,  2.79s/it]

{'loss': 2.4533, 'grad_norm': 35.069908142089844, 'learning_rate': 4.932885906040269e-05, 'epoch': 0.23}


  8%|▊         | 61/795 [02:51<33:46,  2.76s/it]

{'loss': 2.3861, 'grad_norm': 13.634841918945312, 'learning_rate': 4.926174496644296e-05, 'epoch': 0.23}


  8%|▊         | 62/795 [02:54<33:28,  2.74s/it]

{'loss': 2.6179, 'grad_norm': 17.71994972229004, 'learning_rate': 4.9194630872483225e-05, 'epoch': 0.23}


  8%|▊         | 63/795 [02:57<34:09,  2.80s/it]

{'loss': 0.0085, 'grad_norm': 0.17247258126735687, 'learning_rate': 4.912751677852349e-05, 'epoch': 0.24}


  8%|▊         | 64/795 [03:00<34:08,  2.80s/it]

{'loss': 0.0104, 'grad_norm': 0.25554358959198, 'learning_rate': 4.906040268456376e-05, 'epoch': 0.24}


  8%|▊         | 65/795 [03:03<34:37,  2.85s/it]

{'loss': 0.0109, 'grad_norm': 0.3382002115249634, 'learning_rate': 4.8993288590604034e-05, 'epoch': 0.25}


  8%|▊         | 66/795 [03:06<34:21,  2.83s/it]

{'loss': 0.0082, 'grad_norm': 0.1903190165758133, 'learning_rate': 4.89261744966443e-05, 'epoch': 0.25}


  8%|▊         | 67/795 [03:08<33:55,  2.80s/it]

{'loss': 0.0155, 'grad_norm': 0.33770886063575745, 'learning_rate': 4.8859060402684564e-05, 'epoch': 0.25}


  9%|▊         | 68/795 [03:11<34:28,  2.85s/it]

{'loss': 0.0451, 'grad_norm': 13.908136367797852, 'learning_rate': 4.8791946308724836e-05, 'epoch': 0.26}


  9%|▊         | 69/795 [03:14<35:09,  2.91s/it]

{'loss': 0.0114, 'grad_norm': 0.30994167923927307, 'learning_rate': 4.87248322147651e-05, 'epoch': 0.26}


  9%|▉         | 70/795 [03:17<35:33,  2.94s/it]

{'loss': 0.0098, 'grad_norm': 0.22479426860809326, 'learning_rate': 4.865771812080537e-05, 'epoch': 0.26}


  9%|▉         | 71/795 [03:20<35:44,  2.96s/it]

{'loss': 0.0057, 'grad_norm': 0.11433187127113342, 'learning_rate': 4.859060402684564e-05, 'epoch': 0.27}


  9%|▉         | 72/795 [03:24<37:17,  3.09s/it]

{'loss': 0.0061, 'grad_norm': 0.16112513840198517, 'learning_rate': 4.852348993288591e-05, 'epoch': 0.27}


  9%|▉         | 73/795 [03:27<38:17,  3.18s/it]

{'loss': 0.0042, 'grad_norm': 0.08116893470287323, 'learning_rate': 4.8456375838926175e-05, 'epoch': 0.28}


  9%|▉         | 74/795 [03:31<38:56,  3.24s/it]

{'loss': 1.5071, 'grad_norm': 202.45083618164062, 'learning_rate': 4.838926174496645e-05, 'epoch': 0.28}


  9%|▉         | 75/795 [03:34<38:57,  3.25s/it]

{'loss': 0.0045, 'grad_norm': 0.08898302167654037, 'learning_rate': 4.832214765100672e-05, 'epoch': 0.28}


 10%|▉         | 76/795 [03:37<37:55,  3.17s/it]

{'loss': 2.7619, 'grad_norm': 18.473400115966797, 'learning_rate': 4.825503355704698e-05, 'epoch': 0.29}


 10%|▉         | 77/795 [03:40<36:54,  3.08s/it]

{'loss': 0.0034, 'grad_norm': 0.08938699960708618, 'learning_rate': 4.818791946308725e-05, 'epoch': 0.29}


 10%|▉         | 78/795 [03:42<35:42,  2.99s/it]

{'loss': 0.0035, 'grad_norm': 0.08626178652048111, 'learning_rate': 4.812080536912752e-05, 'epoch': 0.29}


 10%|▉         | 79/795 [03:45<35:09,  2.95s/it]

{'loss': 0.0068, 'grad_norm': 0.18423523008823395, 'learning_rate': 4.8053691275167786e-05, 'epoch': 0.3}


 10%|█         | 80/795 [03:48<34:55,  2.93s/it]

{'loss': 0.8717, 'grad_norm': 148.08160400390625, 'learning_rate': 4.798657718120805e-05, 'epoch': 0.3}


 10%|█         | 81/795 [03:51<35:13,  2.96s/it]

{'loss': 2.25, 'grad_norm': 33.54359436035156, 'learning_rate': 4.7919463087248323e-05, 'epoch': 0.31}


 10%|█         | 82/795 [03:54<35:23,  2.98s/it]

{'loss': 2.476, 'grad_norm': 51.9165153503418, 'learning_rate': 4.7852348993288595e-05, 'epoch': 0.31}


 10%|█         | 83/795 [03:57<35:33,  3.00s/it]

{'loss': 0.0025, 'grad_norm': 0.06641370803117752, 'learning_rate': 4.778523489932886e-05, 'epoch': 0.31}


 11%|█         | 84/795 [04:00<35:19,  2.98s/it]

{'loss': 0.0252, 'grad_norm': 4.049863815307617, 'learning_rate': 4.771812080536913e-05, 'epoch': 0.32}


 11%|█         | 85/795 [04:03<35:06,  2.97s/it]

{'loss': 0.058, 'grad_norm': 15.855113983154297, 'learning_rate': 4.76510067114094e-05, 'epoch': 0.32}


 11%|█         | 86/795 [04:06<35:06,  2.97s/it]

{'loss': 1.1148, 'grad_norm': 178.96754455566406, 'learning_rate': 4.758389261744966e-05, 'epoch': 0.32}


 11%|█         | 87/795 [04:09<34:50,  2.95s/it]

{'loss': 0.005, 'grad_norm': 0.12332690507173538, 'learning_rate': 4.7516778523489935e-05, 'epoch': 0.33}


 11%|█         | 88/795 [04:12<35:39,  3.03s/it]

{'loss': 0.0076, 'grad_norm': 0.20596595108509064, 'learning_rate': 4.7449664429530207e-05, 'epoch': 0.33}


 11%|█         | 89/795 [04:15<35:32,  3.02s/it]

{'loss': 0.0059, 'grad_norm': 0.14711640775203705, 'learning_rate': 4.738255033557047e-05, 'epoch': 0.34}


 11%|█▏        | 90/795 [04:18<35:28,  3.02s/it]

{'loss': 2.3861, 'grad_norm': 16.069072723388672, 'learning_rate': 4.731543624161074e-05, 'epoch': 0.34}


 11%|█▏        | 91/795 [04:21<34:39,  2.95s/it]

{'loss': 1.3897, 'grad_norm': 85.25798797607422, 'learning_rate': 4.724832214765101e-05, 'epoch': 0.34}


 12%|█▏        | 92/795 [04:24<34:52,  2.98s/it]

{'loss': 0.0097, 'grad_norm': 0.3180437684059143, 'learning_rate': 4.718120805369128e-05, 'epoch': 0.35}


 12%|█▏        | 93/795 [04:27<34:14,  2.93s/it]

{'loss': 0.0289, 'grad_norm': 1.1974507570266724, 'learning_rate': 4.7114093959731546e-05, 'epoch': 0.35}


 12%|█▏        | 94/795 [04:30<34:51,  2.98s/it]

{'loss': 0.0092, 'grad_norm': 0.5010631084442139, 'learning_rate': 4.704697986577181e-05, 'epoch': 0.35}


 12%|█▏        | 95/795 [04:33<34:49,  2.98s/it]

{'loss': 0.0119, 'grad_norm': 0.3400384485721588, 'learning_rate': 4.697986577181208e-05, 'epoch': 0.36}


 12%|█▏        | 96/795 [04:36<34:54,  3.00s/it]

{'loss': 0.0034, 'grad_norm': 0.14777977764606476, 'learning_rate': 4.691275167785235e-05, 'epoch': 0.36}


 12%|█▏        | 97/795 [04:39<34:09,  2.94s/it]

{'loss': 0.0045, 'grad_norm': 0.09050944447517395, 'learning_rate': 4.684563758389262e-05, 'epoch': 0.37}


 12%|█▏        | 98/795 [04:42<33:58,  2.93s/it]

{'loss': 0.0056, 'grad_norm': 0.14734047651290894, 'learning_rate': 4.677852348993289e-05, 'epoch': 0.37}


 12%|█▏        | 99/795 [04:45<33:52,  2.92s/it]

{'loss': 0.0039, 'grad_norm': 0.07945244759321213, 'learning_rate': 4.671140939597316e-05, 'epoch': 0.37}


 13%|█▎        | 100/795 [04:48<33:47,  2.92s/it]

{'loss': 3.0833, 'grad_norm': 14.950118064880371, 'learning_rate': 4.664429530201342e-05, 'epoch': 0.38}


 13%|█▎        | 101/795 [04:50<33:43,  2.92s/it]

{'loss': 0.0055, 'grad_norm': 0.14027847349643707, 'learning_rate': 4.6577181208053694e-05, 'epoch': 0.38}


 13%|█▎        | 102/795 [04:54<34:08,  2.96s/it]

{'loss': 0.0022, 'grad_norm': 0.04518579691648483, 'learning_rate': 4.6510067114093966e-05, 'epoch': 0.38}


 13%|█▎        | 103/795 [04:57<34:38,  3.00s/it]

{'loss': 0.0036, 'grad_norm': 0.10142692923545837, 'learning_rate': 4.644295302013423e-05, 'epoch': 0.39}


 13%|█▎        | 104/795 [04:59<33:58,  2.95s/it]

{'loss': 1.4622, 'grad_norm': 126.21958923339844, 'learning_rate': 4.6375838926174496e-05, 'epoch': 0.39}


 13%|█▎        | 105/795 [05:02<33:53,  2.95s/it]

{'loss': 0.0048, 'grad_norm': 0.13489636778831482, 'learning_rate': 4.630872483221477e-05, 'epoch': 0.4}


 13%|█▎        | 106/795 [05:05<33:26,  2.91s/it]

{'loss': 0.0039, 'grad_norm': 0.08871396631002426, 'learning_rate': 4.624161073825504e-05, 'epoch': 0.4}


 13%|█▎        | 107/795 [05:08<33:10,  2.89s/it]

{'loss': 0.9261, 'grad_norm': 130.3323974609375, 'learning_rate': 4.6174496644295305e-05, 'epoch': 0.4}


 14%|█▎        | 108/795 [05:11<33:55,  2.96s/it]

{'loss': 0.0024, 'grad_norm': 0.04826531931757927, 'learning_rate': 4.610738255033557e-05, 'epoch': 0.41}


 14%|█▎        | 109/795 [05:14<33:55,  2.97s/it]

{'loss': 0.0028, 'grad_norm': 0.054754406213760376, 'learning_rate': 4.604026845637584e-05, 'epoch': 0.41}


 14%|█▍        | 110/795 [05:17<33:29,  2.93s/it]

{'loss': 0.0039, 'grad_norm': 0.08590951561927795, 'learning_rate': 4.597315436241611e-05, 'epoch': 0.42}


 14%|█▍        | 111/795 [05:20<33:06,  2.90s/it]

{'loss': 0.0042, 'grad_norm': 0.13727086782455444, 'learning_rate': 4.590604026845638e-05, 'epoch': 0.42}


 14%|█▍        | 112/795 [05:23<33:10,  2.91s/it]

{'loss': 2.7585, 'grad_norm': 13.504322052001953, 'learning_rate': 4.583892617449665e-05, 'epoch': 0.42}


 14%|█▍        | 113/795 [05:26<32:49,  2.89s/it]

{'loss': 0.0041, 'grad_norm': 0.10357506573200226, 'learning_rate': 4.5771812080536916e-05, 'epoch': 0.43}


 14%|█▍        | 114/795 [05:29<32:54,  2.90s/it]

{'loss': 1.5626, 'grad_norm': 89.66680908203125, 'learning_rate': 4.570469798657718e-05, 'epoch': 0.43}


 14%|█▍        | 115/795 [05:31<32:35,  2.88s/it]

{'loss': 0.0045, 'grad_norm': 0.1144423633813858, 'learning_rate': 4.5637583892617453e-05, 'epoch': 0.43}


 15%|█▍        | 116/795 [05:34<31:55,  2.82s/it]

{'loss': 0.0061, 'grad_norm': 0.15525327622890472, 'learning_rate': 4.5570469798657725e-05, 'epoch': 0.44}


 15%|█▍        | 117/795 [05:37<31:50,  2.82s/it]

{'loss': 2.3599, 'grad_norm': 19.099384307861328, 'learning_rate': 4.5503355704697984e-05, 'epoch': 0.44}


 15%|█▍        | 118/795 [05:40<32:03,  2.84s/it]

{'loss': 0.6073, 'grad_norm': 88.32684326171875, 'learning_rate': 4.5436241610738256e-05, 'epoch': 0.45}


 15%|█▍        | 119/795 [05:43<32:31,  2.89s/it]

{'loss': 2.1012, 'grad_norm': 15.92974853515625, 'learning_rate': 4.536912751677853e-05, 'epoch': 0.45}


 15%|█▌        | 120/795 [05:46<32:15,  2.87s/it]

{'loss': 0.0056, 'grad_norm': 0.13202962279319763, 'learning_rate': 4.530201342281879e-05, 'epoch': 0.45}


 15%|█▌        | 121/795 [05:49<32:31,  2.89s/it]

{'loss': 2.8086, 'grad_norm': 16.945762634277344, 'learning_rate': 4.5234899328859065e-05, 'epoch': 0.46}


 15%|█▌        | 122/795 [05:52<32:52,  2.93s/it]

{'loss': 0.0076, 'grad_norm': 0.19650636613368988, 'learning_rate': 4.516778523489933e-05, 'epoch': 0.46}


 15%|█▌        | 123/795 [05:55<32:56,  2.94s/it]

{'loss': 0.0175, 'grad_norm': 0.8260939121246338, 'learning_rate': 4.51006711409396e-05, 'epoch': 0.46}


 16%|█▌        | 124/795 [05:57<32:41,  2.92s/it]

{'loss': 0.003, 'grad_norm': 0.09300969541072845, 'learning_rate': 4.503355704697987e-05, 'epoch': 0.47}


 16%|█▌        | 125/795 [06:00<33:06,  2.96s/it]

{'loss': 0.4982, 'grad_norm': 64.53221130371094, 'learning_rate': 4.496644295302014e-05, 'epoch': 0.47}


 16%|█▌        | 126/795 [06:03<32:19,  2.90s/it]

{'loss': 0.0248, 'grad_norm': 0.7209399342536926, 'learning_rate': 4.4899328859060404e-05, 'epoch': 0.48}


 16%|█▌        | 127/795 [06:06<32:07,  2.89s/it]

{'loss': 0.012, 'grad_norm': 0.2540016770362854, 'learning_rate': 4.483221476510067e-05, 'epoch': 0.48}


 16%|█▌        | 128/795 [06:09<31:53,  2.87s/it]

{'loss': 0.0107, 'grad_norm': 0.3542500138282776, 'learning_rate': 4.476510067114094e-05, 'epoch': 0.48}


 16%|█▌        | 129/795 [06:12<31:26,  2.83s/it]

{'loss': 0.0059, 'grad_norm': 0.1291658580303192, 'learning_rate': 4.469798657718121e-05, 'epoch': 0.49}


 16%|█▋        | 130/795 [06:15<31:27,  2.84s/it]

{'loss': 0.0031, 'grad_norm': 0.0972973108291626, 'learning_rate': 4.463087248322148e-05, 'epoch': 0.49}


 16%|█▋        | 131/795 [06:17<31:19,  2.83s/it]

{'loss': 2.2239, 'grad_norm': 13.192072868347168, 'learning_rate': 4.456375838926174e-05, 'epoch': 0.49}


 17%|█▋        | 132/795 [06:20<31:26,  2.85s/it]

{'loss': 0.0082, 'grad_norm': 0.20947499573230743, 'learning_rate': 4.4496644295302015e-05, 'epoch': 0.5}


 17%|█▋        | 133/795 [06:23<31:47,  2.88s/it]

{'loss': 0.0084, 'grad_norm': 0.20161041617393494, 'learning_rate': 4.442953020134229e-05, 'epoch': 0.5}


 17%|█▋        | 134/795 [06:26<32:03,  2.91s/it]

{'loss': 0.007, 'grad_norm': 0.18154114484786987, 'learning_rate': 4.436241610738255e-05, 'epoch': 0.51}


 17%|█▋        | 135/795 [06:29<32:12,  2.93s/it]

{'loss': 0.6696, 'grad_norm': 96.9950180053711, 'learning_rate': 4.4295302013422824e-05, 'epoch': 0.51}


 17%|█▋        | 136/795 [06:32<32:02,  2.92s/it]

{'loss': 0.0077, 'grad_norm': 0.17328467965126038, 'learning_rate': 4.422818791946309e-05, 'epoch': 0.51}


 17%|█▋        | 137/795 [06:35<32:27,  2.96s/it]

{'loss': 0.0061, 'grad_norm': 0.140369713306427, 'learning_rate': 4.4161073825503354e-05, 'epoch': 0.52}


 17%|█▋        | 138/795 [06:38<32:13,  2.94s/it]

{'loss': 0.005, 'grad_norm': 0.11747214198112488, 'learning_rate': 4.4093959731543626e-05, 'epoch': 0.52}


 17%|█▋        | 139/795 [06:41<32:38,  2.99s/it]

{'loss': 0.0048, 'grad_norm': 0.09632740914821625, 'learning_rate': 4.40268456375839e-05, 'epoch': 0.52}


 18%|█▊        | 140/795 [06:44<32:24,  2.97s/it]

{'loss': 0.0055, 'grad_norm': 0.11763374507427216, 'learning_rate': 4.395973154362416e-05, 'epoch': 0.53}


 18%|█▊        | 141/795 [06:47<32:12,  2.96s/it]

{'loss': 0.0023, 'grad_norm': 0.06360442191362381, 'learning_rate': 4.389261744966443e-05, 'epoch': 0.53}


 18%|█▊        | 142/795 [06:50<31:56,  2.93s/it]

{'loss': 2.6577, 'grad_norm': 30.167720794677734, 'learning_rate': 4.38255033557047e-05, 'epoch': 0.54}


 18%|█▊        | 143/795 [06:53<32:27,  2.99s/it]

{'loss': 1.0308, 'grad_norm': 117.90232849121094, 'learning_rate': 4.375838926174497e-05, 'epoch': 0.54}


 18%|█▊        | 144/795 [06:56<32:04,  2.96s/it]

{'loss': 0.0348, 'grad_norm': 5.570077419281006, 'learning_rate': 4.369127516778524e-05, 'epoch': 0.54}


 18%|█▊        | 145/795 [06:59<32:13,  2.98s/it]

{'loss': 0.0248, 'grad_norm': 1.9143973588943481, 'learning_rate': 4.36241610738255e-05, 'epoch': 0.55}


 18%|█▊        | 146/795 [07:02<32:31,  3.01s/it]

{'loss': 0.0102, 'grad_norm': 1.1693063974380493, 'learning_rate': 4.3557046979865775e-05, 'epoch': 0.55}


 18%|█▊        | 147/795 [07:05<32:36,  3.02s/it]

{'loss': 0.0051, 'grad_norm': 0.14762255549430847, 'learning_rate': 4.348993288590604e-05, 'epoch': 0.55}


 19%|█▊        | 148/795 [07:08<32:26,  3.01s/it]

{'loss': 0.0074, 'grad_norm': 0.44344067573547363, 'learning_rate': 4.342281879194631e-05, 'epoch': 0.56}


 19%|█▊        | 149/795 [07:11<31:49,  2.96s/it]

{'loss': 0.0044, 'grad_norm': 0.23372292518615723, 'learning_rate': 4.335570469798658e-05, 'epoch': 0.56}


 19%|█▉        | 150/795 [07:14<31:27,  2.93s/it]

{'loss': 2.0645, 'grad_norm': 44.5358772277832, 'learning_rate': 4.328859060402685e-05, 'epoch': 0.57}


 19%|█▉        | 151/795 [07:17<31:55,  2.97s/it]

{'loss': 0.0025, 'grad_norm': 0.053450360894203186, 'learning_rate': 4.3221476510067114e-05, 'epoch': 0.57}


 19%|█▉        | 152/795 [07:20<31:38,  2.95s/it]

{'loss': 0.0024, 'grad_norm': 0.06742984056472778, 'learning_rate': 4.3154362416107386e-05, 'epoch': 0.57}


 19%|█▉        | 153/795 [07:23<31:49,  2.97s/it]

{'loss': 0.0024, 'grad_norm': 0.04887386038899422, 'learning_rate': 4.308724832214766e-05, 'epoch': 0.58}


 19%|█▉        | 154/795 [07:25<31:09,  2.92s/it]

{'loss': 0.0035, 'grad_norm': 0.0664641335606575, 'learning_rate': 4.3020134228187916e-05, 'epoch': 0.58}


 19%|█▉        | 155/795 [07:29<31:55,  2.99s/it]

{'loss': 0.0026, 'grad_norm': 0.05481472238898277, 'learning_rate': 4.295302013422819e-05, 'epoch': 0.58}


 20%|█▉        | 156/795 [07:31<31:40,  2.97s/it]

{'loss': 0.0035, 'grad_norm': 0.10444216430187225, 'learning_rate': 4.288590604026846e-05, 'epoch': 0.59}


 20%|█▉        | 157/795 [07:34<31:29,  2.96s/it]

{'loss': 0.5747, 'grad_norm': 47.48558044433594, 'learning_rate': 4.2818791946308725e-05, 'epoch': 0.59}


 20%|█▉        | 158/795 [07:37<31:20,  2.95s/it]

{'loss': 0.0012, 'grad_norm': 0.03259506821632385, 'learning_rate': 4.2751677852349e-05, 'epoch': 0.6}


 20%|██        | 159/795 [07:40<31:05,  2.93s/it]

{'loss': 2.0647, 'grad_norm': 76.69412231445312, 'learning_rate': 4.268456375838926e-05, 'epoch': 0.6}


 20%|██        | 160/795 [07:43<31:10,  2.95s/it]

{'loss': 0.0027, 'grad_norm': 0.07286032289266586, 'learning_rate': 4.2617449664429534e-05, 'epoch': 0.6}


 20%|██        | 161/795 [07:46<30:30,  2.89s/it]

{'loss': 0.0106, 'grad_norm': 1.8655428886413574, 'learning_rate': 4.25503355704698e-05, 'epoch': 0.61}


 20%|██        | 162/795 [07:49<30:40,  2.91s/it]

{'loss': 0.0018, 'grad_norm': 0.04916556924581528, 'learning_rate': 4.248322147651007e-05, 'epoch': 0.61}


 21%|██        | 163/795 [07:52<30:12,  2.87s/it]

{'loss': 0.0255, 'grad_norm': 6.783437728881836, 'learning_rate': 4.2416107382550336e-05, 'epoch': 0.62}


 21%|██        | 164/795 [07:55<30:34,  2.91s/it]

{'loss': 0.0032, 'grad_norm': 0.08537586033344269, 'learning_rate': 4.234899328859061e-05, 'epoch': 0.62}


 21%|██        | 165/795 [07:58<30:20,  2.89s/it]

{'loss': 2.4775, 'grad_norm': 55.262542724609375, 'learning_rate': 4.228187919463087e-05, 'epoch': 0.62}


 21%|██        | 166/795 [08:00<30:00,  2.86s/it]

{'loss': 0.0021, 'grad_norm': 0.050879716873168945, 'learning_rate': 4.2214765100671145e-05, 'epoch': 0.63}


 21%|██        | 167/795 [08:03<29:52,  2.85s/it]

{'loss': 0.0021, 'grad_norm': 0.05630562826991081, 'learning_rate': 4.214765100671142e-05, 'epoch': 0.63}


 21%|██        | 168/795 [08:06<30:27,  2.92s/it]

{'loss': 0.0024, 'grad_norm': 0.06370184570550919, 'learning_rate': 4.2080536912751675e-05, 'epoch': 0.63}


 21%|██▏       | 169/795 [08:09<30:39,  2.94s/it]

{'loss': 2.9548, 'grad_norm': 23.26696014404297, 'learning_rate': 4.201342281879195e-05, 'epoch': 0.64}


 21%|██▏       | 170/795 [08:12<30:42,  2.95s/it]

{'loss': 0.9039, 'grad_norm': 78.2514877319336, 'learning_rate': 4.194630872483222e-05, 'epoch': 0.64}


 22%|██▏       | 171/795 [08:15<29:56,  2.88s/it]

{'loss': 0.7451, 'grad_norm': 85.1793212890625, 'learning_rate': 4.1879194630872484e-05, 'epoch': 0.65}


 22%|██▏       | 172/795 [08:18<29:26,  2.83s/it]

{'loss': 0.0023, 'grad_norm': 0.07798624038696289, 'learning_rate': 4.181208053691275e-05, 'epoch': 0.65}


 22%|██▏       | 173/795 [08:20<29:21,  2.83s/it]

{'loss': 0.193, 'grad_norm': 39.140785217285156, 'learning_rate': 4.174496644295302e-05, 'epoch': 0.65}


 22%|██▏       | 174/795 [08:23<29:36,  2.86s/it]

{'loss': 0.0024, 'grad_norm': 0.06026499345898628, 'learning_rate': 4.1677852348993293e-05, 'epoch': 0.66}


 22%|██▏       | 175/795 [08:26<29:49,  2.89s/it]

{'loss': 0.0155, 'grad_norm': 2.9506912231445312, 'learning_rate': 4.161073825503356e-05, 'epoch': 0.66}


 22%|██▏       | 176/795 [08:29<29:29,  2.86s/it]

{'loss': 0.0027, 'grad_norm': 0.06996428221464157, 'learning_rate': 4.154362416107383e-05, 'epoch': 0.66}


 22%|██▏       | 177/795 [08:32<29:47,  2.89s/it]

{'loss': 0.0019, 'grad_norm': 0.05351511389017105, 'learning_rate': 4.1476510067114096e-05, 'epoch': 0.67}


 22%|██▏       | 178/795 [08:35<29:55,  2.91s/it]

{'loss': 0.0779, 'grad_norm': 12.217259407043457, 'learning_rate': 4.140939597315436e-05, 'epoch': 0.67}


 23%|██▎       | 179/795 [08:38<29:40,  2.89s/it]

{'loss': 0.004, 'grad_norm': 0.215010866522789, 'learning_rate': 4.134228187919463e-05, 'epoch': 0.68}


 23%|██▎       | 180/795 [08:41<29:27,  2.87s/it]

{'loss': 0.0051, 'grad_norm': 0.4026161730289459, 'learning_rate': 4.1275167785234905e-05, 'epoch': 0.68}


 23%|██▎       | 181/795 [08:44<29:38,  2.90s/it]

{'loss': 0.0028, 'grad_norm': 0.07174354046583176, 'learning_rate': 4.120805369127517e-05, 'epoch': 0.68}


 23%|██▎       | 182/795 [08:47<30:13,  2.96s/it]

{'loss': 0.0016, 'grad_norm': 0.03484371304512024, 'learning_rate': 4.1140939597315435e-05, 'epoch': 0.69}


 23%|██▎       | 183/795 [08:50<30:27,  2.99s/it]

{'loss': 0.0015, 'grad_norm': 0.04285109415650368, 'learning_rate': 4.107382550335571e-05, 'epoch': 0.69}


 23%|██▎       | 184/795 [08:53<30:17,  2.97s/it]

{'loss': 0.0035, 'grad_norm': 0.09658024460077286, 'learning_rate': 4.100671140939598e-05, 'epoch': 0.69}


 23%|██▎       | 185/795 [08:56<30:02,  2.95s/it]

{'loss': 0.002, 'grad_norm': 0.0741795226931572, 'learning_rate': 4.0939597315436244e-05, 'epoch': 0.7}


 23%|██▎       | 186/795 [08:59<29:58,  2.95s/it]

{'loss': 0.0033, 'grad_norm': 0.27369850873947144, 'learning_rate': 4.087248322147651e-05, 'epoch': 0.7}


 24%|██▎       | 187/795 [09:02<29:35,  2.92s/it]

{'loss': 0.0013, 'grad_norm': 0.039755526930093765, 'learning_rate': 4.080536912751678e-05, 'epoch': 0.71}


 24%|██▎       | 188/795 [09:04<29:25,  2.91s/it]

{'loss': 0.0016, 'grad_norm': 0.04446766525506973, 'learning_rate': 4.0738255033557046e-05, 'epoch': 0.71}


 24%|██▍       | 189/795 [09:08<30:12,  2.99s/it]

{'loss': 0.0015, 'grad_norm': 0.04047966003417969, 'learning_rate': 4.067114093959732e-05, 'epoch': 0.71}


 24%|██▍       | 190/795 [09:10<29:52,  2.96s/it]

{'loss': 0.2999, 'grad_norm': 69.70783233642578, 'learning_rate': 4.060402684563759e-05, 'epoch': 0.72}


 24%|██▍       | 191/795 [09:13<29:46,  2.96s/it]

{'loss': 0.0012, 'grad_norm': 0.03669726103544235, 'learning_rate': 4.0536912751677855e-05, 'epoch': 0.72}


 24%|██▍       | 192/795 [09:16<29:21,  2.92s/it]

{'loss': 0.0026, 'grad_norm': 0.06837458163499832, 'learning_rate': 4.046979865771812e-05, 'epoch': 0.72}


 24%|██▍       | 193/795 [09:19<29:46,  2.97s/it]

{'loss': 0.002, 'grad_norm': 0.04145798459649086, 'learning_rate': 4.040268456375839e-05, 'epoch': 0.73}


 24%|██▍       | 194/795 [09:22<29:33,  2.95s/it]

{'loss': 0.002, 'grad_norm': 0.06034277752041817, 'learning_rate': 4.0335570469798664e-05, 'epoch': 0.73}


 25%|██▍       | 195/795 [09:25<29:20,  2.93s/it]

{'loss': 3.106, 'grad_norm': 19.87137222290039, 'learning_rate': 4.026845637583892e-05, 'epoch': 0.74}


 25%|██▍       | 196/795 [09:28<29:03,  2.91s/it]

{'loss': 0.0058, 'grad_norm': 0.5159128308296204, 'learning_rate': 4.0201342281879194e-05, 'epoch': 0.74}


 25%|██▍       | 197/795 [09:31<28:55,  2.90s/it]

{'loss': 0.0012, 'grad_norm': 0.03048146702349186, 'learning_rate': 4.0134228187919466e-05, 'epoch': 0.74}


 25%|██▍       | 198/795 [09:34<29:15,  2.94s/it]

{'loss': 0.0006, 'grad_norm': 0.0187910795211792, 'learning_rate': 4.006711409395973e-05, 'epoch': 0.75}


 25%|██▌       | 199/795 [09:37<29:12,  2.94s/it]

{'loss': 0.006, 'grad_norm': 0.18272221088409424, 'learning_rate': 4e-05, 'epoch': 0.75}


 25%|██▌       | 200/795 [09:40<29:23,  2.96s/it]

{'loss': 0.0011, 'grad_norm': 0.04701494798064232, 'learning_rate': 3.993288590604027e-05, 'epoch': 0.75}


 25%|██▌       | 201/795 [09:43<29:11,  2.95s/it]

{'loss': 0.0021, 'grad_norm': 0.04914892092347145, 'learning_rate': 3.986577181208054e-05, 'epoch': 0.76}


 25%|██▌       | 202/795 [09:46<29:10,  2.95s/it]

{'loss': 0.0454, 'grad_norm': 7.6154937744140625, 'learning_rate': 3.9798657718120805e-05, 'epoch': 0.76}


 26%|██▌       | 203/795 [09:49<29:05,  2.95s/it]

{'loss': 0.5586, 'grad_norm': 42.9263801574707, 'learning_rate': 3.973154362416108e-05, 'epoch': 0.77}


 26%|██▌       | 204/795 [09:52<28:46,  2.92s/it]

{'loss': 0.0397, 'grad_norm': 3.546494960784912, 'learning_rate': 3.966442953020135e-05, 'epoch': 0.77}


 26%|██▌       | 205/795 [09:54<28:20,  2.88s/it]

{'loss': 0.0018, 'grad_norm': 0.04513748362660408, 'learning_rate': 3.959731543624161e-05, 'epoch': 0.77}


 26%|██▌       | 206/795 [09:57<27:58,  2.85s/it]

{'loss': 0.0016, 'grad_norm': 0.03809221088886261, 'learning_rate': 3.953020134228188e-05, 'epoch': 0.78}


 26%|██▌       | 207/795 [10:00<28:28,  2.91s/it]

{'loss': 0.011, 'grad_norm': 0.9982602000236511, 'learning_rate': 3.946308724832215e-05, 'epoch': 0.78}


 26%|██▌       | 208/795 [10:03<28:40,  2.93s/it]

{'loss': 0.0021, 'grad_norm': 0.05318723991513252, 'learning_rate': 3.939597315436242e-05, 'epoch': 0.78}


 26%|██▋       | 209/795 [10:06<28:44,  2.94s/it]

{'loss': 0.0022, 'grad_norm': 0.05712786316871643, 'learning_rate': 3.932885906040268e-05, 'epoch': 0.79}


 26%|██▋       | 210/795 [10:09<29:31,  3.03s/it]

{'loss': 0.0012, 'grad_norm': 0.03204542398452759, 'learning_rate': 3.9261744966442954e-05, 'epoch': 0.79}


 27%|██▋       | 211/795 [10:12<28:59,  2.98s/it]

{'loss': 0.007, 'grad_norm': 0.2540586292743683, 'learning_rate': 3.9194630872483226e-05, 'epoch': 0.8}


 27%|██▋       | 212/795 [10:15<28:41,  2.95s/it]

{'loss': 2.2934, 'grad_norm': 54.88360595703125, 'learning_rate': 3.912751677852349e-05, 'epoch': 0.8}


 27%|██▋       | 213/795 [10:18<28:35,  2.95s/it]

{'loss': 0.0024, 'grad_norm': 0.1850002408027649, 'learning_rate': 3.906040268456376e-05, 'epoch': 0.8}


 27%|██▋       | 214/795 [10:21<28:17,  2.92s/it]

{'loss': 0.1589, 'grad_norm': 78.65650939941406, 'learning_rate': 3.899328859060403e-05, 'epoch': 0.81}


 27%|██▋       | 215/795 [10:24<28:23,  2.94s/it]

{'loss': 0.0011, 'grad_norm': 0.04133405163884163, 'learning_rate': 3.89261744966443e-05, 'epoch': 0.81}


 27%|██▋       | 216/795 [10:27<28:21,  2.94s/it]

{'loss': 0.0081, 'grad_norm': 0.32567235827445984, 'learning_rate': 3.8859060402684565e-05, 'epoch': 0.82}


 27%|██▋       | 217/795 [10:30<28:18,  2.94s/it]

{'loss': 0.002, 'grad_norm': 0.04623575136065483, 'learning_rate': 3.879194630872484e-05, 'epoch': 0.82}


 27%|██▋       | 218/795 [10:33<28:23,  2.95s/it]

{'loss': 0.0028, 'grad_norm': 0.0736275166273117, 'learning_rate': 3.87248322147651e-05, 'epoch': 0.82}


 28%|██▊       | 219/795 [10:36<28:03,  2.92s/it]

{'loss': 0.0039, 'grad_norm': 0.18847163021564484, 'learning_rate': 3.865771812080537e-05, 'epoch': 0.83}


 28%|██▊       | 220/795 [10:38<27:58,  2.92s/it]

{'loss': 0.0007, 'grad_norm': 0.019604509696364403, 'learning_rate': 3.859060402684564e-05, 'epoch': 0.83}


 28%|██▊       | 221/795 [10:41<27:41,  2.89s/it]

{'loss': 0.0018, 'grad_norm': 0.0503956638276577, 'learning_rate': 3.852348993288591e-05, 'epoch': 0.83}


 28%|██▊       | 222/795 [10:44<27:49,  2.91s/it]

{'loss': 0.0019, 'grad_norm': 0.051699571311473846, 'learning_rate': 3.8456375838926176e-05, 'epoch': 0.84}


 28%|██▊       | 223/795 [10:47<27:56,  2.93s/it]

{'loss': 0.0013, 'grad_norm': 0.030796783044934273, 'learning_rate': 3.838926174496644e-05, 'epoch': 0.84}


 28%|██▊       | 224/795 [10:50<28:00,  2.94s/it]

{'loss': 0.001, 'grad_norm': 0.024577191099524498, 'learning_rate': 3.832214765100671e-05, 'epoch': 0.85}


 28%|██▊       | 225/795 [10:53<27:44,  2.92s/it]

{'loss': 4.4115, 'grad_norm': 166.45526123046875, 'learning_rate': 3.8255033557046985e-05, 'epoch': 0.85}


 28%|██▊       | 226/795 [10:56<27:56,  2.95s/it]

{'loss': 0.0005, 'grad_norm': 0.015710702165961266, 'learning_rate': 3.818791946308725e-05, 'epoch': 0.85}


 29%|██▊       | 227/795 [10:59<27:49,  2.94s/it]

{'loss': 0.0008, 'grad_norm': 0.023153061047196388, 'learning_rate': 3.812080536912752e-05, 'epoch': 0.86}


 29%|██▊       | 228/795 [11:02<28:35,  3.03s/it]

{'loss': 0.0011, 'grad_norm': 0.027132224291563034, 'learning_rate': 3.805369127516779e-05, 'epoch': 0.86}


 29%|██▉       | 229/795 [11:05<28:38,  3.04s/it]

{'loss': 0.0013, 'grad_norm': 0.03618837147951126, 'learning_rate': 3.798657718120805e-05, 'epoch': 0.86}


 29%|██▉       | 230/795 [11:08<28:15,  3.00s/it]

{'loss': 0.0008, 'grad_norm': 0.032948024570941925, 'learning_rate': 3.7919463087248324e-05, 'epoch': 0.87}


 29%|██▉       | 231/795 [11:11<28:12,  3.00s/it]

{'loss': 0.001, 'grad_norm': 0.02083224430680275, 'learning_rate': 3.7852348993288596e-05, 'epoch': 0.87}


 29%|██▉       | 232/795 [11:14<28:17,  3.01s/it]

{'loss': 0.001, 'grad_norm': 0.024022439494729042, 'learning_rate': 3.778523489932886e-05, 'epoch': 0.88}


 29%|██▉       | 233/795 [11:17<27:55,  2.98s/it]

{'loss': 0.001, 'grad_norm': 0.02717413194477558, 'learning_rate': 3.7718120805369127e-05, 'epoch': 0.88}


 29%|██▉       | 234/795 [11:20<27:14,  2.91s/it]

{'loss': 0.0013, 'grad_norm': 0.05897865071892738, 'learning_rate': 3.76510067114094e-05, 'epoch': 0.88}


 30%|██▉       | 235/795 [11:23<28:09,  3.02s/it]

{'loss': 0.0012, 'grad_norm': 0.032055459916591644, 'learning_rate': 3.758389261744967e-05, 'epoch': 0.89}


 30%|██▉       | 236/795 [11:26<28:03,  3.01s/it]

{'loss': 0.0012, 'grad_norm': 0.026808878406882286, 'learning_rate': 3.7516778523489936e-05, 'epoch': 0.89}


 30%|██▉       | 237/795 [11:29<27:40,  2.98s/it]

{'loss': 0.0011, 'grad_norm': 0.024022547528147697, 'learning_rate': 3.74496644295302e-05, 'epoch': 0.89}


 30%|██▉       | 238/795 [11:32<27:38,  2.98s/it]

{'loss': 3.0279, 'grad_norm': 27.005859375, 'learning_rate': 3.738255033557047e-05, 'epoch': 0.9}


 30%|███       | 239/795 [11:35<27:45,  2.99s/it]

{'loss': 0.0011, 'grad_norm': 0.029473213478922844, 'learning_rate': 3.731543624161074e-05, 'epoch': 0.9}


 30%|███       | 240/795 [11:38<27:06,  2.93s/it]

{'loss': 0.0047, 'grad_norm': 0.6733487248420715, 'learning_rate': 3.724832214765101e-05, 'epoch': 0.91}


 30%|███       | 241/795 [11:41<26:36,  2.88s/it]

{'loss': 0.0015, 'grad_norm': 0.039692819118499756, 'learning_rate': 3.7181208053691275e-05, 'epoch': 0.91}


 30%|███       | 242/795 [11:44<26:52,  2.92s/it]

{'loss': 0.0013, 'grad_norm': 0.035719189792871475, 'learning_rate': 3.711409395973155e-05, 'epoch': 0.91}


 31%|███       | 243/795 [11:46<26:39,  2.90s/it]

{'loss': 3.2583, 'grad_norm': 16.56538963317871, 'learning_rate': 3.704697986577181e-05, 'epoch': 0.92}


 31%|███       | 244/795 [11:49<26:51,  2.92s/it]

{'loss': 0.0013, 'grad_norm': 0.027588604018092155, 'learning_rate': 3.6979865771812084e-05, 'epoch': 0.92}


 31%|███       | 245/795 [11:53<27:22,  2.99s/it]

{'loss': 3.165, 'grad_norm': 14.451007843017578, 'learning_rate': 3.6912751677852356e-05, 'epoch': 0.92}


 31%|███       | 246/795 [11:56<27:36,  3.02s/it]

{'loss': 2.8305, 'grad_norm': 63.38199234008789, 'learning_rate': 3.6845637583892614e-05, 'epoch': 0.93}


 31%|███       | 247/795 [11:59<27:26,  3.01s/it]

{'loss': 0.0027, 'grad_norm': 0.0863664299249649, 'learning_rate': 3.6778523489932886e-05, 'epoch': 0.93}


 31%|███       | 248/795 [12:02<27:15,  2.99s/it]

{'loss': 0.0018, 'grad_norm': 0.03909147530794144, 'learning_rate': 3.671140939597316e-05, 'epoch': 0.94}


 31%|███▏      | 249/795 [12:05<27:40,  3.04s/it]

{'loss': 0.0023, 'grad_norm': 0.05363745987415314, 'learning_rate': 3.664429530201342e-05, 'epoch': 0.94}


 31%|███▏      | 250/795 [12:08<27:11,  2.99s/it]

{'loss': 0.002, 'grad_norm': 0.053726259618997574, 'learning_rate': 3.6577181208053695e-05, 'epoch': 0.94}


 32%|███▏      | 251/795 [12:11<27:40,  3.05s/it]

{'loss': 0.0034, 'grad_norm': 0.09364677965641022, 'learning_rate': 3.651006711409396e-05, 'epoch': 0.95}


 32%|███▏      | 252/795 [12:14<27:03,  2.99s/it]

{'loss': 0.0029, 'grad_norm': 0.07210054993629456, 'learning_rate': 3.644295302013423e-05, 'epoch': 0.95}


 32%|███▏      | 253/795 [12:17<26:45,  2.96s/it]

{'loss': 0.0029, 'grad_norm': 0.06855696439743042, 'learning_rate': 3.63758389261745e-05, 'epoch': 0.95}


 32%|███▏      | 254/795 [12:20<26:52,  2.98s/it]

{'loss': 0.0039, 'grad_norm': 0.10663614422082901, 'learning_rate': 3.630872483221477e-05, 'epoch': 0.96}


 32%|███▏      | 255/795 [12:23<26:42,  2.97s/it]

{'loss': 0.0019, 'grad_norm': 0.046267807483673096, 'learning_rate': 3.6241610738255034e-05, 'epoch': 0.96}


 32%|███▏      | 256/795 [12:25<26:08,  2.91s/it]

{'loss': 0.0021, 'grad_norm': 0.0494782030582428, 'learning_rate': 3.61744966442953e-05, 'epoch': 0.97}


 32%|███▏      | 257/795 [12:28<26:02,  2.90s/it]

{'loss': 0.0028, 'grad_norm': 0.09317756444215775, 'learning_rate': 3.610738255033557e-05, 'epoch': 0.97}


 32%|███▏      | 258/795 [12:31<25:32,  2.85s/it]

{'loss': 0.0022, 'grad_norm': 0.05169161409139633, 'learning_rate': 3.604026845637584e-05, 'epoch': 0.97}


 33%|███▎      | 259/795 [12:34<25:07,  2.81s/it]

{'loss': 0.0027, 'grad_norm': 0.07223368436098099, 'learning_rate': 3.597315436241611e-05, 'epoch': 0.98}


 33%|███▎      | 260/795 [12:36<24:47,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.03243487328290939, 'learning_rate': 3.5906040268456373e-05, 'epoch': 0.98}


 33%|███▎      | 261/795 [12:39<24:58,  2.81s/it]

{'loss': 0.0068, 'grad_norm': 0.38141077756881714, 'learning_rate': 3.5838926174496645e-05, 'epoch': 0.98}


 33%|███▎      | 262/795 [12:42<25:17,  2.85s/it]

{'loss': 0.0029, 'grad_norm': 0.08245736360549927, 'learning_rate': 3.577181208053692e-05, 'epoch': 0.99}


 33%|███▎      | 263/795 [12:45<25:25,  2.87s/it]

{'loss': 0.0012, 'grad_norm': 0.04027332738041878, 'learning_rate': 3.570469798657718e-05, 'epoch': 0.99}


 33%|███▎      | 264/795 [12:48<25:42,  2.91s/it]

{'loss': 0.0376, 'grad_norm': 6.774250030517578, 'learning_rate': 3.563758389261745e-05, 'epoch': 1.0}


 33%|███▎      | 265/795 [12:51<25:27,  2.88s/it]

{'loss': 0.0007, 'grad_norm': 0.022608010098338127, 'learning_rate': 3.557046979865772e-05, 'epoch': 1.0}


                                                 
 33%|███▎      | 265/795 [23:28<25:27,  2.88s/it]

{'eval_loss': 0.20711205899715424, 'eval_accuracy': 0.9377022653721683, 'eval_runtime': 637.1324, 'eval_samples_per_second': 1.94, 'eval_steps_per_second': 0.97, 'epoch': 1.0}


 33%|███▎      | 266/795 [23:35<28:40:06, 195.10s/it]

{'loss': 0.0032, 'grad_norm': 0.0998125821352005, 'learning_rate': 3.550335570469799e-05, 'epoch': 1.0}


 34%|███▎      | 267/795 [23:37<20:09:21, 137.43s/it]

{'loss': 0.0037, 'grad_norm': 0.11121796071529388, 'learning_rate': 3.5436241610738257e-05, 'epoch': 1.01}


 34%|███▎      | 268/795 [23:41<14:13:27, 97.17s/it] 

{'loss': 0.0026, 'grad_norm': 0.10494175553321838, 'learning_rate': 3.536912751677853e-05, 'epoch': 1.01}


 34%|███▍      | 269/795 [23:43<10:03:47, 68.87s/it]

{'loss': 0.0017, 'grad_norm': 0.04127974808216095, 'learning_rate': 3.5302013422818794e-05, 'epoch': 1.02}


 34%|███▍      | 270/795 [23:46<7:09:05, 49.04s/it] 

{'loss': 0.0036, 'grad_norm': 0.08455519378185272, 'learning_rate': 3.523489932885906e-05, 'epoch': 1.02}


 34%|███▍      | 271/795 [23:49<5:07:15, 35.18s/it]

{'loss': 0.003, 'grad_norm': 0.13492418825626373, 'learning_rate': 3.516778523489933e-05, 'epoch': 1.02}


 34%|███▍      | 272/795 [23:52<3:42:16, 25.50s/it]

{'loss': 0.0014, 'grad_norm': 0.03391725569963455, 'learning_rate': 3.51006711409396e-05, 'epoch': 1.03}


 34%|███▍      | 273/795 [23:55<2:42:37, 18.69s/it]

{'loss': 0.0012, 'grad_norm': 0.026339834555983543, 'learning_rate': 3.503355704697987e-05, 'epoch': 1.03}


 34%|███▍      | 274/795 [23:58<2:01:35, 14.00s/it]

{'loss': 0.0046, 'grad_norm': 0.5177393555641174, 'learning_rate': 3.496644295302013e-05, 'epoch': 1.03}


 35%|███▍      | 275/795 [24:01<1:32:30, 10.67s/it]

{'loss': 0.0034, 'grad_norm': 0.08624488860368729, 'learning_rate': 3.4899328859060405e-05, 'epoch': 1.04}


 35%|███▍      | 276/795 [24:04<1:12:45,  8.41s/it]

{'loss': 0.0008, 'grad_norm': 0.018156174570322037, 'learning_rate': 3.483221476510068e-05, 'epoch': 1.04}


 35%|███▍      | 277/795 [24:07<58:19,  6.76s/it]  

{'loss': 0.0021, 'grad_norm': 0.05528900399804115, 'learning_rate': 3.476510067114094e-05, 'epoch': 1.05}


 35%|███▍      | 278/795 [24:09<47:44,  5.54s/it]

{'loss': 0.0018, 'grad_norm': 0.0477202869951725, 'learning_rate': 3.469798657718121e-05, 'epoch': 1.05}


 35%|███▌      | 279/795 [24:13<41:09,  4.79s/it]

{'loss': 2.3477, 'grad_norm': 65.2034912109375, 'learning_rate': 3.463087248322148e-05, 'epoch': 1.05}


 35%|███▌      | 280/795 [24:16<36:40,  4.27s/it]

{'loss': 0.0017, 'grad_norm': 0.039619289338588715, 'learning_rate': 3.4563758389261744e-05, 'epoch': 1.06}


 35%|███▌      | 281/795 [24:19<33:52,  3.95s/it]

{'loss': 0.0012, 'grad_norm': 0.027933718636631966, 'learning_rate': 3.4496644295302016e-05, 'epoch': 1.06}


 35%|███▌      | 282/795 [24:22<31:00,  3.63s/it]

{'loss': 0.0006, 'grad_norm': 0.017804794013500214, 'learning_rate': 3.442953020134229e-05, 'epoch': 1.06}


 36%|███▌      | 283/795 [24:25<29:00,  3.40s/it]

{'loss': 0.0012, 'grad_norm': 0.02791886031627655, 'learning_rate': 3.436241610738255e-05, 'epoch': 1.07}


 36%|███▌      | 284/795 [24:28<27:54,  3.28s/it]

{'loss': 0.0012, 'grad_norm': 0.02608605846762657, 'learning_rate': 3.429530201342282e-05, 'epoch': 1.07}


 36%|███▌      | 285/795 [24:31<27:12,  3.20s/it]

{'loss': 0.0005, 'grad_norm': 0.014373289421200752, 'learning_rate': 3.422818791946309e-05, 'epoch': 1.08}


 36%|███▌      | 286/795 [24:33<26:26,  3.12s/it]

{'loss': 0.0013, 'grad_norm': 0.03623606637120247, 'learning_rate': 3.416107382550336e-05, 'epoch': 1.08}


 36%|███▌      | 287/795 [24:36<26:04,  3.08s/it]

{'loss': 1.2621, 'grad_norm': 40.385807037353516, 'learning_rate': 3.409395973154362e-05, 'epoch': 1.08}


 36%|███▌      | 288/795 [24:39<25:13,  2.99s/it]

{'loss': 0.0013, 'grad_norm': 0.06439296156167984, 'learning_rate': 3.402684563758389e-05, 'epoch': 1.09}


 36%|███▋      | 289/795 [24:42<25:13,  2.99s/it]

{'loss': 0.0136, 'grad_norm': 0.8377049565315247, 'learning_rate': 3.3959731543624164e-05, 'epoch': 1.09}


 36%|███▋      | 290/795 [24:45<24:49,  2.95s/it]

{'loss': 0.0006, 'grad_norm': 0.015721648931503296, 'learning_rate': 3.389261744966443e-05, 'epoch': 1.09}


 37%|███▋      | 291/795 [24:48<24:43,  2.94s/it]

{'loss': 0.0005, 'grad_norm': 0.016464058309793472, 'learning_rate': 3.38255033557047e-05, 'epoch': 1.1}


 37%|███▋      | 292/795 [24:51<24:35,  2.93s/it]

{'loss': 0.0015, 'grad_norm': 0.038616664707660675, 'learning_rate': 3.3758389261744966e-05, 'epoch': 1.1}


 37%|███▋      | 293/795 [24:54<24:06,  2.88s/it]

{'loss': 0.0014, 'grad_norm': 0.03767583891749382, 'learning_rate': 3.369127516778524e-05, 'epoch': 1.11}


 37%|███▋      | 294/795 [24:57<24:26,  2.93s/it]

{'loss': 0.0004, 'grad_norm': 0.012175993993878365, 'learning_rate': 3.3624161073825504e-05, 'epoch': 1.11}


 37%|███▋      | 295/795 [24:59<24:00,  2.88s/it]

{'loss': 0.0011, 'grad_norm': 0.024265792220830917, 'learning_rate': 3.3557046979865775e-05, 'epoch': 1.11}


 37%|███▋      | 296/795 [25:02<24:13,  2.91s/it]

{'loss': 0.0004, 'grad_norm': 0.01227585505694151, 'learning_rate': 3.348993288590605e-05, 'epoch': 1.12}


 37%|███▋      | 297/795 [25:05<24:08,  2.91s/it]

{'loss': 0.0008, 'grad_norm': 0.018917525187134743, 'learning_rate': 3.3422818791946306e-05, 'epoch': 1.12}


 37%|███▋      | 298/795 [25:09<24:53,  3.00s/it]

{'loss': 0.0012, 'grad_norm': 0.036127395927906036, 'learning_rate': 3.335570469798658e-05, 'epoch': 1.12}


 38%|███▊      | 299/795 [25:11<23:59,  2.90s/it]

{'loss': 0.0031, 'grad_norm': 0.2884826958179474, 'learning_rate': 3.328859060402685e-05, 'epoch': 1.13}


 38%|███▊      | 300/795 [25:14<24:22,  2.96s/it]

{'loss': 0.003, 'grad_norm': 0.3260033428668976, 'learning_rate': 3.3221476510067115e-05, 'epoch': 1.13}


 38%|███▊      | 301/795 [25:17<23:59,  2.91s/it]

{'loss': 0.0014, 'grad_norm': 0.10100319236516953, 'learning_rate': 3.315436241610738e-05, 'epoch': 1.14}


 38%|███▊      | 302/795 [25:20<24:18,  2.96s/it]

{'loss': 0.001, 'grad_norm': 0.028524935245513916, 'learning_rate': 3.308724832214765e-05, 'epoch': 1.14}


 38%|███▊      | 303/795 [25:23<24:02,  2.93s/it]

{'loss': 0.0009, 'grad_norm': 0.04557056352496147, 'learning_rate': 3.3020134228187924e-05, 'epoch': 1.14}


 38%|███▊      | 304/795 [25:26<23:31,  2.87s/it]

{'loss': 0.0007, 'grad_norm': 0.015271767042577267, 'learning_rate': 3.295302013422819e-05, 'epoch': 1.15}


 38%|███▊      | 305/795 [25:29<23:10,  2.84s/it]

{'loss': 0.0004, 'grad_norm': 0.01108483038842678, 'learning_rate': 3.288590604026846e-05, 'epoch': 1.15}


 38%|███▊      | 306/795 [25:32<23:22,  2.87s/it]

{'loss': 0.0012, 'grad_norm': 0.027862468734383583, 'learning_rate': 3.2818791946308726e-05, 'epoch': 1.15}


 39%|███▊      | 307/795 [25:35<23:41,  2.91s/it]

{'loss': 0.0012, 'grad_norm': 0.02989351935684681, 'learning_rate': 3.275167785234899e-05, 'epoch': 1.16}


 39%|███▊      | 308/795 [25:37<23:32,  2.90s/it]

{'loss': 0.0006, 'grad_norm': 0.012792546302080154, 'learning_rate': 3.268456375838926e-05, 'epoch': 1.16}


 39%|███▉      | 309/795 [25:40<23:25,  2.89s/it]

{'loss': 0.0006, 'grad_norm': 0.020291965454816818, 'learning_rate': 3.2617449664429535e-05, 'epoch': 1.17}


 39%|███▉      | 310/795 [25:43<23:09,  2.87s/it]

{'loss': 0.0004, 'grad_norm': 0.01191309280693531, 'learning_rate': 3.25503355704698e-05, 'epoch': 1.17}


 39%|███▉      | 311/795 [25:46<23:16,  2.88s/it]

{'loss': 0.0006, 'grad_norm': 0.01417987234890461, 'learning_rate': 3.2483221476510065e-05, 'epoch': 1.17}


 39%|███▉      | 312/795 [25:49<23:04,  2.87s/it]

{'loss': 0.0006, 'grad_norm': 0.013918680138885975, 'learning_rate': 3.241610738255034e-05, 'epoch': 1.18}


 39%|███▉      | 313/795 [25:52<22:44,  2.83s/it]

{'loss': 0.0005, 'grad_norm': 0.01564374566078186, 'learning_rate': 3.234899328859061e-05, 'epoch': 1.18}


 39%|███▉      | 314/795 [25:54<22:43,  2.84s/it]

{'loss': 0.0008, 'grad_norm': 0.022421902045607567, 'learning_rate': 3.2281879194630874e-05, 'epoch': 1.18}


 40%|███▉      | 315/795 [25:57<22:41,  2.84s/it]

{'loss': 0.0005, 'grad_norm': 0.01174350269138813, 'learning_rate': 3.221476510067114e-05, 'epoch': 1.19}


 40%|███▉      | 316/795 [26:00<23:03,  2.89s/it]

{'loss': 0.0008, 'grad_norm': 0.016932666301727295, 'learning_rate': 3.214765100671141e-05, 'epoch': 1.19}


 40%|███▉      | 317/795 [26:03<23:00,  2.89s/it]

{'loss': 0.0008, 'grad_norm': 0.02045455016195774, 'learning_rate': 3.208053691275168e-05, 'epoch': 1.2}


 40%|████      | 318/795 [26:06<22:45,  2.86s/it]

{'loss': 0.0008, 'grad_norm': 0.022691922262310982, 'learning_rate': 3.201342281879195e-05, 'epoch': 1.2}


 40%|████      | 319/795 [26:09<23:26,  2.96s/it]

{'loss': 0.0008, 'grad_norm': 0.02357790246605873, 'learning_rate': 3.194630872483222e-05, 'epoch': 1.2}


 40%|████      | 320/795 [26:12<23:06,  2.92s/it]

{'loss': 0.0007, 'grad_norm': 0.015496835112571716, 'learning_rate': 3.1879194630872485e-05, 'epoch': 1.21}


 40%|████      | 321/795 [26:15<23:44,  3.01s/it]

{'loss': 0.0008, 'grad_norm': 0.019665569067001343, 'learning_rate': 3.181208053691275e-05, 'epoch': 1.21}


 41%|████      | 322/795 [26:18<23:39,  3.00s/it]

{'loss': 0.0003, 'grad_norm': 0.009386356920003891, 'learning_rate': 3.174496644295302e-05, 'epoch': 1.22}


 41%|████      | 323/795 [26:21<23:26,  2.98s/it]

{'loss': 0.0006, 'grad_norm': 0.0136058758944273, 'learning_rate': 3.1677852348993294e-05, 'epoch': 1.22}


 41%|████      | 324/795 [26:24<23:13,  2.96s/it]

{'loss': 0.0005, 'grad_norm': 0.010886097326874733, 'learning_rate': 3.161073825503356e-05, 'epoch': 1.22}


 41%|████      | 325/795 [26:27<22:51,  2.92s/it]

{'loss': 0.0007, 'grad_norm': 0.015357965603470802, 'learning_rate': 3.1543624161073825e-05, 'epoch': 1.23}


 41%|████      | 326/795 [26:30<22:57,  2.94s/it]

{'loss': 1.1084, 'grad_norm': 203.968505859375, 'learning_rate': 3.1476510067114096e-05, 'epoch': 1.23}


 41%|████      | 327/795 [26:33<22:37,  2.90s/it]

{'loss': 0.0006, 'grad_norm': 0.012873231433331966, 'learning_rate': 3.140939597315437e-05, 'epoch': 1.23}


 41%|████▏     | 328/795 [26:35<22:22,  2.87s/it]

{'loss': 0.001, 'grad_norm': 0.08443956077098846, 'learning_rate': 3.1342281879194634e-05, 'epoch': 1.24}


 41%|████▏     | 329/795 [26:38<22:19,  2.87s/it]

{'loss': 0.0006, 'grad_norm': 0.013839595019817352, 'learning_rate': 3.12751677852349e-05, 'epoch': 1.24}


 42%|████▏     | 330/795 [26:41<21:42,  2.80s/it]

{'loss': 0.1568, 'grad_norm': 65.55098724365234, 'learning_rate': 3.120805369127517e-05, 'epoch': 1.25}


 42%|████▏     | 331/795 [26:44<22:10,  2.87s/it]

{'loss': 0.0003, 'grad_norm': 0.010322188027203083, 'learning_rate': 3.1140939597315436e-05, 'epoch': 1.25}


 42%|████▏     | 332/795 [26:47<22:27,  2.91s/it]

{'loss': 0.0266, 'grad_norm': 13.318544387817383, 'learning_rate': 3.107382550335571e-05, 'epoch': 1.25}


 42%|████▏     | 333/795 [26:50<22:12,  2.88s/it]

{'loss': 0.001, 'grad_norm': 0.022916339337825775, 'learning_rate': 3.100671140939597e-05, 'epoch': 1.26}


 42%|████▏     | 334/795 [26:53<22:13,  2.89s/it]

{'loss': 0.001, 'grad_norm': 0.023032451048493385, 'learning_rate': 3.0939597315436245e-05, 'epoch': 1.26}


 42%|████▏     | 335/795 [26:55<21:51,  2.85s/it]

{'loss': 0.0005, 'grad_norm': 0.01172315888106823, 'learning_rate': 3.087248322147651e-05, 'epoch': 1.26}


 42%|████▏     | 336/795 [26:58<21:36,  2.82s/it]

{'loss': 0.0003, 'grad_norm': 0.009961563162505627, 'learning_rate': 3.080536912751678e-05, 'epoch': 1.27}


 42%|████▏     | 337/795 [27:01<21:30,  2.82s/it]

{'loss': 0.0003, 'grad_norm': 0.009983089752495289, 'learning_rate': 3.0738255033557054e-05, 'epoch': 1.27}


 43%|████▎     | 338/795 [27:04<21:59,  2.89s/it]

{'loss': 0.0002, 'grad_norm': 0.0074307057075202465, 'learning_rate': 3.067114093959731e-05, 'epoch': 1.28}


 43%|████▎     | 339/795 [27:07<22:11,  2.92s/it]

{'loss': 0.0005, 'grad_norm': 0.034709323197603226, 'learning_rate': 3.0604026845637584e-05, 'epoch': 1.28}


 43%|████▎     | 340/795 [27:10<21:58,  2.90s/it]

{'loss': 0.0006, 'grad_norm': 0.012844597920775414, 'learning_rate': 3.0536912751677856e-05, 'epoch': 1.28}


 43%|████▎     | 341/795 [27:13<21:35,  2.85s/it]

{'loss': 1.7694, 'grad_norm': 223.67010498046875, 'learning_rate': 3.0469798657718124e-05, 'epoch': 1.29}


 43%|████▎     | 342/795 [27:15<21:24,  2.84s/it]

{'loss': 0.0007, 'grad_norm': 0.02623806521296501, 'learning_rate': 3.0402684563758393e-05, 'epoch': 1.29}


 43%|████▎     | 343/795 [27:18<21:36,  2.87s/it]

{'loss': 0.0429, 'grad_norm': 15.309081077575684, 'learning_rate': 3.0335570469798658e-05, 'epoch': 1.29}


 43%|████▎     | 344/795 [27:21<21:12,  2.82s/it]

{'loss': 0.0006, 'grad_norm': 0.019548337906599045, 'learning_rate': 3.0268456375838927e-05, 'epoch': 1.3}


 43%|████▎     | 345/795 [27:24<21:26,  2.86s/it]

{'loss': 0.0013, 'grad_norm': 0.035507459193468094, 'learning_rate': 3.02013422818792e-05, 'epoch': 1.3}


 44%|████▎     | 346/795 [27:27<21:25,  2.86s/it]

{'loss': 0.0003, 'grad_norm': 0.008515950292348862, 'learning_rate': 3.0134228187919467e-05, 'epoch': 1.31}


 44%|████▎     | 347/795 [27:30<21:31,  2.88s/it]

{'loss': 0.0004, 'grad_norm': 0.009001318365335464, 'learning_rate': 3.0067114093959732e-05, 'epoch': 1.31}


 44%|████▍     | 348/795 [27:33<22:08,  2.97s/it]

{'loss': 0.0007, 'grad_norm': 0.015501399524509907, 'learning_rate': 3e-05, 'epoch': 1.31}


 44%|████▍     | 349/795 [27:36<22:01,  2.96s/it]

{'loss': 0.0007, 'grad_norm': 0.018704187124967575, 'learning_rate': 2.993288590604027e-05, 'epoch': 1.32}


 44%|████▍     | 350/795 [27:39<21:45,  2.93s/it]

{'loss': 0.0005, 'grad_norm': 0.012408338487148285, 'learning_rate': 2.986577181208054e-05, 'epoch': 1.32}


 44%|████▍     | 351/795 [27:42<21:31,  2.91s/it]

{'loss': 0.0004, 'grad_norm': 0.009529477916657925, 'learning_rate': 2.979865771812081e-05, 'epoch': 1.32}


 44%|████▍     | 352/795 [27:45<21:24,  2.90s/it]

{'loss': 0.0008, 'grad_norm': 0.023029446601867676, 'learning_rate': 2.9731543624161075e-05, 'epoch': 1.33}


 44%|████▍     | 353/795 [27:48<21:46,  2.96s/it]

{'loss': 1.8761, 'grad_norm': 134.81591796875, 'learning_rate': 2.9664429530201343e-05, 'epoch': 1.33}


 45%|████▍     | 354/795 [27:50<21:21,  2.91s/it]

{'loss': 0.0007, 'grad_norm': 0.020820729434490204, 'learning_rate': 2.9597315436241612e-05, 'epoch': 1.34}


 45%|████▍     | 355/795 [27:54<21:42,  2.96s/it]

{'loss': 0.2344, 'grad_norm': 128.26747131347656, 'learning_rate': 2.9530201342281884e-05, 'epoch': 1.34}


 45%|████▍     | 356/795 [27:57<22:02,  3.01s/it]

{'loss': 0.0008, 'grad_norm': 0.023108940571546555, 'learning_rate': 2.9463087248322146e-05, 'epoch': 1.34}


 45%|████▍     | 357/795 [28:00<21:34,  2.95s/it]

{'loss': 0.0004, 'grad_norm': 0.012006346136331558, 'learning_rate': 2.9395973154362418e-05, 'epoch': 1.35}


 45%|████▌     | 358/795 [28:02<21:15,  2.92s/it]

{'loss': 0.0008, 'grad_norm': 0.02069883979856968, 'learning_rate': 2.9328859060402686e-05, 'epoch': 1.35}


 45%|████▌     | 359/795 [28:05<21:15,  2.92s/it]

{'loss': 0.0005, 'grad_norm': 0.01082760188728571, 'learning_rate': 2.9261744966442955e-05, 'epoch': 1.35}


 45%|████▌     | 360/795 [28:08<21:18,  2.94s/it]

{'loss': 0.0002, 'grad_norm': 0.007569981273263693, 'learning_rate': 2.9194630872483227e-05, 'epoch': 1.36}


 45%|████▌     | 361/795 [28:11<21:11,  2.93s/it]

{'loss': 0.0005, 'grad_norm': 0.011022127233445644, 'learning_rate': 2.9127516778523488e-05, 'epoch': 1.36}


 46%|████▌     | 362/795 [28:14<21:14,  2.94s/it]

{'loss': 0.0006, 'grad_norm': 0.018083548173308372, 'learning_rate': 2.906040268456376e-05, 'epoch': 1.37}


 46%|████▌     | 363/795 [28:17<21:00,  2.92s/it]

{'loss': 0.0004, 'grad_norm': 0.009815461002290249, 'learning_rate': 2.899328859060403e-05, 'epoch': 1.37}


 46%|████▌     | 364/795 [28:20<21:07,  2.94s/it]

{'loss': 0.0003, 'grad_norm': 0.008343350142240524, 'learning_rate': 2.8926174496644297e-05, 'epoch': 1.37}


 46%|████▌     | 365/795 [28:23<21:00,  2.93s/it]

{'loss': 0.0003, 'grad_norm': 0.008897147141397, 'learning_rate': 2.885906040268457e-05, 'epoch': 1.38}


 46%|████▌     | 366/795 [28:26<20:51,  2.92s/it]

{'loss': 0.001, 'grad_norm': 0.029303623363375664, 'learning_rate': 2.879194630872483e-05, 'epoch': 1.38}


 46%|████▌     | 367/795 [28:29<21:09,  2.97s/it]

{'loss': 0.0007, 'grad_norm': 0.021996911615133286, 'learning_rate': 2.8724832214765103e-05, 'epoch': 1.38}


 46%|████▋     | 368/795 [28:32<20:52,  2.93s/it]

{'loss': 0.0004, 'grad_norm': 0.009407155215740204, 'learning_rate': 2.865771812080537e-05, 'epoch': 1.39}


 46%|████▋     | 369/795 [28:35<20:57,  2.95s/it]

{'loss': 0.0008, 'grad_norm': 0.017648661509156227, 'learning_rate': 2.859060402684564e-05, 'epoch': 1.39}


 47%|████▋     | 370/795 [28:38<20:48,  2.94s/it]

{'loss': 0.0003, 'grad_norm': 0.007921891286969185, 'learning_rate': 2.8523489932885905e-05, 'epoch': 1.4}


 47%|████▋     | 371/795 [28:41<20:48,  2.94s/it]

{'loss': 3.5288, 'grad_norm': 58.42949295043945, 'learning_rate': 2.8456375838926174e-05, 'epoch': 1.4}


 47%|████▋     | 372/795 [28:43<20:21,  2.89s/it]

{'loss': 0.0006, 'grad_norm': 0.01301155798137188, 'learning_rate': 2.8389261744966445e-05, 'epoch': 1.4}


 47%|████▋     | 373/795 [28:46<20:07,  2.86s/it]

{'loss': 0.0009, 'grad_norm': 0.044544149190187454, 'learning_rate': 2.8322147651006714e-05, 'epoch': 1.41}


 47%|████▋     | 374/795 [28:49<20:25,  2.91s/it]

{'loss': 0.0003, 'grad_norm': 0.008994899690151215, 'learning_rate': 2.8255033557046983e-05, 'epoch': 1.41}


 47%|████▋     | 375/795 [28:52<20:41,  2.96s/it]

{'loss': 0.0007, 'grad_norm': 0.02861872687935829, 'learning_rate': 2.8187919463087248e-05, 'epoch': 1.42}


 47%|████▋     | 376/795 [28:55<20:28,  2.93s/it]

{'loss': 0.0004, 'grad_norm': 0.012322966940701008, 'learning_rate': 2.8120805369127516e-05, 'epoch': 1.42}


 47%|████▋     | 377/795 [28:58<20:06,  2.89s/it]

{'loss': 0.0004, 'grad_norm': 0.009521745145320892, 'learning_rate': 2.8053691275167788e-05, 'epoch': 1.42}


 48%|████▊     | 378/795 [29:01<19:57,  2.87s/it]

{'loss': 0.0003, 'grad_norm': 0.0070209261029958725, 'learning_rate': 2.7986577181208057e-05, 'epoch': 1.43}


 48%|████▊     | 379/795 [29:04<19:57,  2.88s/it]

{'loss': 0.0003, 'grad_norm': 0.011248485185205936, 'learning_rate': 2.7919463087248322e-05, 'epoch': 1.43}


 48%|████▊     | 380/795 [29:06<19:23,  2.80s/it]

{'loss': 0.0003, 'grad_norm': 0.008328922092914581, 'learning_rate': 2.785234899328859e-05, 'epoch': 1.43}


 48%|████▊     | 381/795 [29:09<19:50,  2.88s/it]

{'loss': 0.0002, 'grad_norm': 0.008655951358377934, 'learning_rate': 2.778523489932886e-05, 'epoch': 1.44}


 48%|████▊     | 382/795 [29:12<20:04,  2.92s/it]

{'loss': 0.0009, 'grad_norm': 0.02825767919421196, 'learning_rate': 2.771812080536913e-05, 'epoch': 1.44}


 48%|████▊     | 383/795 [29:15<20:19,  2.96s/it]

{'loss': 0.0003, 'grad_norm': 0.007943914271891117, 'learning_rate': 2.76510067114094e-05, 'epoch': 1.45}


 48%|████▊     | 384/795 [29:18<20:01,  2.92s/it]

{'loss': 0.0004, 'grad_norm': 0.012262959964573383, 'learning_rate': 2.7583892617449664e-05, 'epoch': 1.45}


 48%|████▊     | 385/795 [29:21<19:35,  2.87s/it]

{'loss': 0.0004, 'grad_norm': 0.009959486313164234, 'learning_rate': 2.7516778523489933e-05, 'epoch': 1.45}


 49%|████▊     | 386/795 [29:24<19:35,  2.87s/it]

{'loss': 0.0004, 'grad_norm': 0.012035208754241467, 'learning_rate': 2.74496644295302e-05, 'epoch': 1.46}


 49%|████▊     | 387/795 [29:27<19:41,  2.90s/it]

{'loss': 0.0007, 'grad_norm': 0.02059505507349968, 'learning_rate': 2.7382550335570473e-05, 'epoch': 1.46}


 49%|████▉     | 388/795 [29:30<19:32,  2.88s/it]

{'loss': 0.0006, 'grad_norm': 0.017747966572642326, 'learning_rate': 2.7315436241610742e-05, 'epoch': 1.46}


 49%|████▉     | 389/795 [29:33<19:45,  2.92s/it]

{'loss': 0.0003, 'grad_norm': 0.012088117189705372, 'learning_rate': 2.7248322147651007e-05, 'epoch': 1.47}


 49%|████▉     | 390/795 [29:35<19:29,  2.89s/it]

{'loss': 0.0009, 'grad_norm': 0.025298837572336197, 'learning_rate': 2.7181208053691276e-05, 'epoch': 1.47}


 49%|████▉     | 391/795 [29:38<19:23,  2.88s/it]

{'loss': 0.0003, 'grad_norm': 0.007151258178055286, 'learning_rate': 2.7114093959731544e-05, 'epoch': 1.48}


 49%|████▉     | 392/795 [29:41<19:17,  2.87s/it]

{'loss': 0.0003, 'grad_norm': 0.009407141245901585, 'learning_rate': 2.7046979865771816e-05, 'epoch': 1.48}


 49%|████▉     | 393/795 [29:44<19:04,  2.85s/it]

{'loss': 0.0004, 'grad_norm': 0.011225403286516666, 'learning_rate': 2.6979865771812078e-05, 'epoch': 1.48}


 50%|████▉     | 394/795 [29:47<18:59,  2.84s/it]

{'loss': 0.0004, 'grad_norm': 0.012044033035635948, 'learning_rate': 2.691275167785235e-05, 'epoch': 1.49}


 50%|████▉     | 395/795 [29:50<19:17,  2.89s/it]

{'loss': 0.0004, 'grad_norm': 0.009348801337182522, 'learning_rate': 2.6845637583892618e-05, 'epoch': 1.49}


 50%|████▉     | 396/795 [29:53<19:12,  2.89s/it]

{'loss': 0.0005, 'grad_norm': 0.022653058171272278, 'learning_rate': 2.6778523489932887e-05, 'epoch': 1.49}


 50%|████▉     | 397/795 [29:55<18:51,  2.84s/it]

{'loss': 0.0005, 'grad_norm': 0.013781635090708733, 'learning_rate': 2.671140939597316e-05, 'epoch': 1.5}


 50%|█████     | 398/795 [29:58<19:01,  2.87s/it]

{'loss': 0.0005, 'grad_norm': 0.011995124630630016, 'learning_rate': 2.6644295302013424e-05, 'epoch': 1.5}


 50%|█████     | 399/795 [30:01<19:01,  2.88s/it]

{'loss': 0.0005, 'grad_norm': 0.01262214407324791, 'learning_rate': 2.6577181208053692e-05, 'epoch': 1.51}


 50%|█████     | 400/795 [30:04<18:50,  2.86s/it]

{'loss': 0.0003, 'grad_norm': 0.006820796988904476, 'learning_rate': 2.651006711409396e-05, 'epoch': 1.51}


 50%|█████     | 401/795 [30:07<18:35,  2.83s/it]

{'loss': 0.0005, 'grad_norm': 0.014393273741006851, 'learning_rate': 2.6442953020134233e-05, 'epoch': 1.51}


 51%|█████     | 402/795 [30:10<18:16,  2.79s/it]

{'loss': 0.0007, 'grad_norm': 0.02922435849905014, 'learning_rate': 2.6375838926174495e-05, 'epoch': 1.52}


 51%|█████     | 403/795 [30:12<18:30,  2.83s/it]

{'loss': 0.0005, 'grad_norm': 0.015048094093799591, 'learning_rate': 2.6308724832214767e-05, 'epoch': 1.52}


 51%|█████     | 404/795 [30:15<18:21,  2.82s/it]

{'loss': 0.0004, 'grad_norm': 0.008196981623768806, 'learning_rate': 2.6241610738255035e-05, 'epoch': 1.52}


 51%|█████     | 405/795 [30:18<18:07,  2.79s/it]

{'loss': 0.0005, 'grad_norm': 0.019000569358468056, 'learning_rate': 2.6174496644295304e-05, 'epoch': 1.53}


 51%|█████     | 406/795 [30:21<18:11,  2.81s/it]

{'loss': 0.0005, 'grad_norm': 0.014356476254761219, 'learning_rate': 2.6107382550335576e-05, 'epoch': 1.53}


 51%|█████     | 407/795 [30:24<18:17,  2.83s/it]

{'loss': 0.0005, 'grad_norm': 0.010434694588184357, 'learning_rate': 2.6040268456375837e-05, 'epoch': 1.54}


 51%|█████▏    | 408/795 [30:27<18:16,  2.83s/it]

{'loss': 0.0005, 'grad_norm': 0.013787421397864819, 'learning_rate': 2.597315436241611e-05, 'epoch': 1.54}


 51%|█████▏    | 409/795 [30:29<18:28,  2.87s/it]

{'loss': 0.0003, 'grad_norm': 0.008264429867267609, 'learning_rate': 2.5906040268456378e-05, 'epoch': 1.54}


 52%|█████▏    | 410/795 [30:33<18:53,  2.94s/it]

{'loss': 0.0002, 'grad_norm': 0.004866325296461582, 'learning_rate': 2.5838926174496646e-05, 'epoch': 1.55}


 52%|█████▏    | 411/795 [30:36<18:45,  2.93s/it]

{'loss': 0.0003, 'grad_norm': 0.00663686404004693, 'learning_rate': 2.5771812080536918e-05, 'epoch': 1.55}


 52%|█████▏    | 412/795 [30:38<18:06,  2.84s/it]

{'loss': 0.0003, 'grad_norm': 0.008969474583864212, 'learning_rate': 2.570469798657718e-05, 'epoch': 1.55}


 52%|█████▏    | 413/795 [30:41<17:59,  2.83s/it]

{'loss': 0.0005, 'grad_norm': 0.01316034235060215, 'learning_rate': 2.5637583892617452e-05, 'epoch': 1.56}


 52%|█████▏    | 414/795 [30:44<18:11,  2.86s/it]

{'loss': 0.0009, 'grad_norm': 0.04487251117825508, 'learning_rate': 2.557046979865772e-05, 'epoch': 1.56}


 52%|█████▏    | 415/795 [30:47<17:59,  2.84s/it]

{'loss': 0.0005, 'grad_norm': 0.010440589860081673, 'learning_rate': 2.550335570469799e-05, 'epoch': 1.57}


 52%|█████▏    | 416/795 [30:49<17:51,  2.83s/it]

{'loss': 0.0005, 'grad_norm': 0.014950579032301903, 'learning_rate': 2.5436241610738254e-05, 'epoch': 1.57}


 52%|█████▏    | 417/795 [30:52<17:52,  2.84s/it]

{'loss': 0.0004, 'grad_norm': 0.011611049063503742, 'learning_rate': 2.5369127516778523e-05, 'epoch': 1.57}


 53%|█████▎    | 418/795 [30:55<17:45,  2.83s/it]

{'loss': 7.6234, 'grad_norm': 32.55048751831055, 'learning_rate': 2.5302013422818795e-05, 'epoch': 1.58}


 53%|█████▎    | 419/795 [30:58<17:55,  2.86s/it]

{'loss': 0.0007, 'grad_norm': 0.018993524834513664, 'learning_rate': 2.5234899328859063e-05, 'epoch': 1.58}


 53%|█████▎    | 420/795 [31:01<18:09,  2.91s/it]

{'loss': 0.0005, 'grad_norm': 0.01580188050866127, 'learning_rate': 2.516778523489933e-05, 'epoch': 1.58}


 53%|█████▎    | 421/795 [31:04<17:59,  2.89s/it]

{'loss': 0.0015, 'grad_norm': 0.09195868670940399, 'learning_rate': 2.5100671140939597e-05, 'epoch': 1.59}


 53%|█████▎    | 422/795 [31:07<17:47,  2.86s/it]

{'loss': 0.0009, 'grad_norm': 0.07373730838298798, 'learning_rate': 2.5033557046979865e-05, 'epoch': 1.59}


 53%|█████▎    | 423/795 [31:10<17:45,  2.86s/it]

{'loss': 0.0003, 'grad_norm': 0.010335531085729599, 'learning_rate': 2.4966442953020137e-05, 'epoch': 1.6}


 53%|█████▎    | 424/795 [31:13<17:58,  2.91s/it]

{'loss': 0.0006, 'grad_norm': 0.01523728296160698, 'learning_rate': 2.4899328859060402e-05, 'epoch': 1.6}


 53%|█████▎    | 425/795 [31:15<17:44,  2.88s/it]

{'loss': 0.001, 'grad_norm': 0.038589149713516235, 'learning_rate': 2.4832214765100674e-05, 'epoch': 1.6}


 54%|█████▎    | 426/795 [31:18<17:42,  2.88s/it]

{'loss': 0.0004, 'grad_norm': 0.011820158921182156, 'learning_rate': 2.476510067114094e-05, 'epoch': 1.61}


 54%|█████▎    | 427/795 [31:21<17:42,  2.89s/it]

{'loss': 0.0005, 'grad_norm': 0.013815294951200485, 'learning_rate': 2.4697986577181208e-05, 'epoch': 1.61}


 54%|█████▍    | 428/795 [31:24<17:34,  2.87s/it]

{'loss': 0.0004, 'grad_norm': 0.009154431521892548, 'learning_rate': 2.463087248322148e-05, 'epoch': 1.62}


 54%|█████▍    | 429/795 [31:27<17:37,  2.89s/it]

{'loss': 0.0003, 'grad_norm': 0.008091594092547894, 'learning_rate': 2.4563758389261745e-05, 'epoch': 1.62}


 54%|█████▍    | 430/795 [31:30<17:57,  2.95s/it]

{'loss': 0.0009, 'grad_norm': 0.029397279024124146, 'learning_rate': 2.4496644295302017e-05, 'epoch': 1.62}


 54%|█████▍    | 431/795 [31:33<17:40,  2.91s/it]

{'loss': 0.0004, 'grad_norm': 0.012957886792719364, 'learning_rate': 2.4429530201342282e-05, 'epoch': 1.63}


 54%|█████▍    | 432/795 [31:36<17:29,  2.89s/it]

{'loss': 0.0011, 'grad_norm': 0.030124276876449585, 'learning_rate': 2.436241610738255e-05, 'epoch': 1.63}


 54%|█████▍    | 433/795 [31:38<17:07,  2.84s/it]

{'loss': 0.0003, 'grad_norm': 0.009460339322686195, 'learning_rate': 2.429530201342282e-05, 'epoch': 1.63}


 55%|█████▍    | 434/795 [31:41<17:06,  2.84s/it]

{'loss': 0.0005, 'grad_norm': 0.01127777062356472, 'learning_rate': 2.4228187919463088e-05, 'epoch': 1.64}


 55%|█████▍    | 435/795 [31:44<16:55,  2.82s/it]

{'loss': 0.0002, 'grad_norm': 0.005370191764086485, 'learning_rate': 2.416107382550336e-05, 'epoch': 1.64}


 55%|█████▍    | 436/795 [31:47<17:04,  2.85s/it]

{'loss': 0.0003, 'grad_norm': 0.008355313912034035, 'learning_rate': 2.4093959731543625e-05, 'epoch': 1.65}


 55%|█████▍    | 437/795 [31:50<17:00,  2.85s/it]

{'loss': 0.0006, 'grad_norm': 0.015991250053048134, 'learning_rate': 2.4026845637583893e-05, 'epoch': 1.65}


 55%|█████▌    | 438/795 [31:53<16:45,  2.82s/it]

{'loss': 0.0007, 'grad_norm': 0.022596312686800957, 'learning_rate': 2.3959731543624162e-05, 'epoch': 1.65}


 55%|█████▌    | 439/795 [31:56<17:10,  2.89s/it]

{'loss': 2.7796, 'grad_norm': 106.79325103759766, 'learning_rate': 2.389261744966443e-05, 'epoch': 1.66}


 55%|█████▌    | 440/795 [31:59<17:20,  2.93s/it]

{'loss': 0.0005, 'grad_norm': 0.012422790750861168, 'learning_rate': 2.38255033557047e-05, 'epoch': 1.66}


 55%|█████▌    | 441/795 [32:02<17:12,  2.92s/it]

{'loss': 0.0006, 'grad_norm': 0.016621774062514305, 'learning_rate': 2.3758389261744967e-05, 'epoch': 1.66}


 56%|█████▌    | 442/795 [32:05<17:15,  2.93s/it]

{'loss': 0.0005, 'grad_norm': 0.012895398773252964, 'learning_rate': 2.3691275167785236e-05, 'epoch': 1.67}


 56%|█████▌    | 443/795 [32:08<17:25,  2.97s/it]

{'loss': 0.0025, 'grad_norm': 0.16806915402412415, 'learning_rate': 2.3624161073825504e-05, 'epoch': 1.67}


 56%|█████▌    | 444/795 [32:11<17:23,  2.97s/it]

{'loss': 0.0005, 'grad_norm': 0.010948040522634983, 'learning_rate': 2.3557046979865773e-05, 'epoch': 1.68}


 56%|█████▌    | 445/795 [32:14<17:26,  2.99s/it]

{'loss': 0.001, 'grad_norm': 0.03390062600374222, 'learning_rate': 2.348993288590604e-05, 'epoch': 1.68}


 56%|█████▌    | 446/795 [32:17<17:26,  3.00s/it]

{'loss': 0.0012, 'grad_norm': 0.04817795380949974, 'learning_rate': 2.342281879194631e-05, 'epoch': 1.68}


 56%|█████▌    | 447/795 [32:19<16:57,  2.92s/it]

{'loss': 0.0007, 'grad_norm': 0.019494658336043358, 'learning_rate': 2.335570469798658e-05, 'epoch': 1.69}


 56%|█████▋    | 448/795 [32:22<17:00,  2.94s/it]

{'loss': 0.0011, 'grad_norm': 0.12128185480833054, 'learning_rate': 2.3288590604026847e-05, 'epoch': 1.69}


 56%|█████▋    | 449/795 [32:25<16:50,  2.92s/it]

{'loss': 0.001, 'grad_norm': 0.027788622304797173, 'learning_rate': 2.3221476510067116e-05, 'epoch': 1.69}


 57%|█████▋    | 450/795 [32:28<17:12,  2.99s/it]

{'loss': 0.0003, 'grad_norm': 0.009251460433006287, 'learning_rate': 2.3154362416107384e-05, 'epoch': 1.7}


 57%|█████▋    | 451/795 [32:31<17:09,  2.99s/it]

{'loss': 0.0005, 'grad_norm': 0.015383764170110226, 'learning_rate': 2.3087248322147653e-05, 'epoch': 1.7}


 57%|█████▋    | 452/795 [32:34<16:59,  2.97s/it]

{'loss': 0.0003, 'grad_norm': 0.007671982049942017, 'learning_rate': 2.302013422818792e-05, 'epoch': 1.71}


 57%|█████▋    | 453/795 [32:37<16:49,  2.95s/it]

{'loss': 0.0098, 'grad_norm': 2.431840658187866, 'learning_rate': 2.295302013422819e-05, 'epoch': 1.71}


 57%|█████▋    | 454/795 [32:40<17:03,  3.00s/it]

{'loss': 0.0006, 'grad_norm': 0.01544857956469059, 'learning_rate': 2.2885906040268458e-05, 'epoch': 1.71}


 57%|█████▋    | 455/795 [32:43<17:03,  3.01s/it]

{'loss': 0.0006, 'grad_norm': 0.017189860343933105, 'learning_rate': 2.2818791946308727e-05, 'epoch': 1.72}


 57%|█████▋    | 456/795 [32:46<17:00,  3.01s/it]

{'loss': 0.0005, 'grad_norm': 0.018174558877944946, 'learning_rate': 2.2751677852348992e-05, 'epoch': 1.72}


 57%|█████▋    | 457/795 [32:49<16:46,  2.98s/it]

{'loss': 0.0008, 'grad_norm': 0.03079187497496605, 'learning_rate': 2.2684563758389264e-05, 'epoch': 1.72}


 58%|█████▊    | 458/795 [32:52<16:46,  2.99s/it]

{'loss': 0.0002, 'grad_norm': 0.00617193104699254, 'learning_rate': 2.2617449664429532e-05, 'epoch': 1.73}


 58%|█████▊    | 459/795 [32:55<16:39,  2.97s/it]

{'loss': 0.0014, 'grad_norm': 0.05740836262702942, 'learning_rate': 2.25503355704698e-05, 'epoch': 1.73}


 58%|█████▊    | 460/795 [32:58<16:37,  2.98s/it]

{'loss': 0.0004, 'grad_norm': 0.00902376789599657, 'learning_rate': 2.248322147651007e-05, 'epoch': 1.74}


 58%|█████▊    | 461/795 [33:01<16:21,  2.94s/it]

{'loss': 0.0006, 'grad_norm': 0.013821136206388474, 'learning_rate': 2.2416107382550335e-05, 'epoch': 1.74}


 58%|█████▊    | 462/795 [33:04<16:17,  2.94s/it]

{'loss': 0.001, 'grad_norm': 0.027141021564602852, 'learning_rate': 2.2348993288590606e-05, 'epoch': 1.74}


 58%|█████▊    | 463/795 [33:07<15:50,  2.86s/it]

{'loss': 0.0005, 'grad_norm': 0.010910139419138432, 'learning_rate': 2.228187919463087e-05, 'epoch': 1.75}


 58%|█████▊    | 464/795 [33:10<16:00,  2.90s/it]

{'loss': 0.0009, 'grad_norm': 0.05080043524503708, 'learning_rate': 2.2214765100671144e-05, 'epoch': 1.75}


 58%|█████▊    | 465/795 [33:12<15:46,  2.87s/it]

{'loss': 0.0011, 'grad_norm': 0.16236422955989838, 'learning_rate': 2.2147651006711412e-05, 'epoch': 1.75}


 59%|█████▊    | 466/795 [33:15<15:49,  2.89s/it]

{'loss': 0.0013, 'grad_norm': 0.05417965352535248, 'learning_rate': 2.2080536912751677e-05, 'epoch': 1.76}


 59%|█████▊    | 467/795 [33:18<15:30,  2.84s/it]

{'loss': 0.0004, 'grad_norm': 0.010337213054299355, 'learning_rate': 2.201342281879195e-05, 'epoch': 1.76}


 59%|█████▉    | 468/795 [33:21<15:33,  2.85s/it]

{'loss': 0.0005, 'grad_norm': 0.016049783676862717, 'learning_rate': 2.1946308724832214e-05, 'epoch': 1.77}


 59%|█████▉    | 469/795 [33:24<15:51,  2.92s/it]

{'loss': 2.4136, 'grad_norm': 56.61780548095703, 'learning_rate': 2.1879194630872486e-05, 'epoch': 1.77}


 59%|█████▉    | 470/795 [33:27<15:50,  2.92s/it]

{'loss': 0.0003, 'grad_norm': 0.010043379850685596, 'learning_rate': 2.181208053691275e-05, 'epoch': 1.77}


 59%|█████▉    | 471/795 [33:30<15:42,  2.91s/it]

{'loss': 0.0007, 'grad_norm': 0.020894398912787437, 'learning_rate': 2.174496644295302e-05, 'epoch': 1.78}


 59%|█████▉    | 472/795 [33:33<15:35,  2.90s/it]

{'loss': 0.0014, 'grad_norm': 0.0693601593375206, 'learning_rate': 2.167785234899329e-05, 'epoch': 1.78}


 59%|█████▉    | 473/795 [33:36<15:30,  2.89s/it]

{'loss': 0.0003, 'grad_norm': 0.0086594233289361, 'learning_rate': 2.1610738255033557e-05, 'epoch': 1.78}


 60%|█████▉    | 474/795 [33:39<15:33,  2.91s/it]

{'loss': 0.0003, 'grad_norm': 0.008222589269280434, 'learning_rate': 2.154362416107383e-05, 'epoch': 1.79}


 60%|█████▉    | 475/795 [33:41<15:13,  2.86s/it]

{'loss': 0.0004, 'grad_norm': 0.010786586441099644, 'learning_rate': 2.1476510067114094e-05, 'epoch': 1.79}


 60%|█████▉    | 476/795 [33:44<15:33,  2.93s/it]

{'loss': 0.0004, 'grad_norm': 0.012386666610836983, 'learning_rate': 2.1409395973154362e-05, 'epoch': 1.8}


 60%|██████    | 477/795 [33:47<15:02,  2.84s/it]

{'loss': 0.0003, 'grad_norm': 0.008740241639316082, 'learning_rate': 2.134228187919463e-05, 'epoch': 1.8}


 60%|██████    | 478/795 [33:50<14:41,  2.78s/it]

{'loss': 0.0006, 'grad_norm': 0.022419514134526253, 'learning_rate': 2.12751677852349e-05, 'epoch': 1.8}


 60%|██████    | 479/795 [33:52<14:25,  2.74s/it]

{'loss': 0.0002, 'grad_norm': 0.006094209384173155, 'learning_rate': 2.1208053691275168e-05, 'epoch': 1.81}


 60%|██████    | 480/795 [33:55<14:30,  2.76s/it]

{'loss': 0.0006, 'grad_norm': 0.021148184314370155, 'learning_rate': 2.1140939597315437e-05, 'epoch': 1.81}


 61%|██████    | 481/795 [33:58<14:09,  2.71s/it]

{'loss': 0.0001, 'grad_norm': 0.003717039944604039, 'learning_rate': 2.107382550335571e-05, 'epoch': 1.82}


 61%|██████    | 482/795 [34:00<14:13,  2.73s/it]

{'loss': 0.0008, 'grad_norm': 0.029273563995957375, 'learning_rate': 2.1006711409395974e-05, 'epoch': 1.82}


 61%|██████    | 483/795 [34:03<14:12,  2.73s/it]

{'loss': 0.0009, 'grad_norm': 0.04166419804096222, 'learning_rate': 2.0939597315436242e-05, 'epoch': 1.82}


 61%|██████    | 484/795 [34:06<14:15,  2.75s/it]

{'loss': 0.0277, 'grad_norm': 6.7515153884887695, 'learning_rate': 2.087248322147651e-05, 'epoch': 1.83}


 61%|██████    | 485/795 [34:09<14:18,  2.77s/it]

{'loss': 0.0004, 'grad_norm': 0.011572974734008312, 'learning_rate': 2.080536912751678e-05, 'epoch': 1.83}


 61%|██████    | 486/795 [34:12<14:13,  2.76s/it]

{'loss': 0.0007, 'grad_norm': 0.04696943610906601, 'learning_rate': 2.0738255033557048e-05, 'epoch': 1.83}


 61%|██████▏   | 487/795 [34:14<14:19,  2.79s/it]

{'loss': 0.0004, 'grad_norm': 0.011744293384253979, 'learning_rate': 2.0671140939597316e-05, 'epoch': 1.84}


 61%|██████▏   | 488/795 [34:17<14:08,  2.76s/it]

{'loss': 0.0002, 'grad_norm': 0.007394977379590273, 'learning_rate': 2.0604026845637585e-05, 'epoch': 1.84}


 62%|██████▏   | 489/795 [34:20<14:12,  2.79s/it]

{'loss': 0.0064, 'grad_norm': 1.8400135040283203, 'learning_rate': 2.0536912751677853e-05, 'epoch': 1.85}


 62%|██████▏   | 490/795 [34:23<13:59,  2.75s/it]

{'loss': 0.0004, 'grad_norm': 0.01059960201382637, 'learning_rate': 2.0469798657718122e-05, 'epoch': 1.85}


 62%|██████▏   | 491/795 [34:25<13:45,  2.72s/it]

{'loss': 0.0003, 'grad_norm': 0.007388430647552013, 'learning_rate': 2.040268456375839e-05, 'epoch': 1.85}


 62%|██████▏   | 492/795 [34:28<13:52,  2.75s/it]

{'loss': 0.0004, 'grad_norm': 0.008848238736391068, 'learning_rate': 2.033557046979866e-05, 'epoch': 1.86}


 62%|██████▏   | 493/795 [34:31<13:34,  2.70s/it]

{'loss': 0.0004, 'grad_norm': 0.008517556823790073, 'learning_rate': 2.0268456375838928e-05, 'epoch': 1.86}


 62%|██████▏   | 494/795 [34:33<13:23,  2.67s/it]

{'loss': 0.0002, 'grad_norm': 0.00921550951898098, 'learning_rate': 2.0201342281879196e-05, 'epoch': 1.86}


 62%|██████▏   | 495/795 [34:36<13:18,  2.66s/it]

{'loss': 0.0058, 'grad_norm': 2.025858163833618, 'learning_rate': 2.013422818791946e-05, 'epoch': 1.87}


 62%|██████▏   | 496/795 [34:38<12:54,  2.59s/it]

{'loss': 1.1565, 'grad_norm': 267.77264404296875, 'learning_rate': 2.0067114093959733e-05, 'epoch': 1.87}


 63%|██████▎   | 497/795 [34:41<13:05,  2.64s/it]

{'loss': 0.0005, 'grad_norm': 0.014650718308985233, 'learning_rate': 2e-05, 'epoch': 1.88}


 63%|██████▎   | 498/795 [34:44<13:05,  2.64s/it]

{'loss': 0.0005, 'grad_norm': 0.016560347750782967, 'learning_rate': 1.993288590604027e-05, 'epoch': 1.88}


 63%|██████▎   | 499/795 [34:46<12:58,  2.63s/it]

{'loss': 0.0004, 'grad_norm': 0.00912328902631998, 'learning_rate': 1.986577181208054e-05, 'epoch': 1.88}


 63%|██████▎   | 500/795 [34:49<13:01,  2.65s/it]

{'loss': 0.0003, 'grad_norm': 0.006439743563532829, 'learning_rate': 1.9798657718120804e-05, 'epoch': 1.89}


 63%|██████▎   | 501/795 [34:52<12:52,  2.63s/it]

{'loss': 0.0004, 'grad_norm': 0.008603288792073727, 'learning_rate': 1.9731543624161076e-05, 'epoch': 1.89}


 63%|██████▎   | 502/795 [34:54<12:57,  2.65s/it]

{'loss': 0.0003, 'grad_norm': 0.012473923154175282, 'learning_rate': 1.966442953020134e-05, 'epoch': 1.89}


 63%|██████▎   | 503/795 [34:57<12:54,  2.65s/it]

{'loss': 0.0002, 'grad_norm': 0.007778317201882601, 'learning_rate': 1.9597315436241613e-05, 'epoch': 1.9}


 63%|██████▎   | 504/795 [35:00<12:50,  2.65s/it]

{'loss': 0.0004, 'grad_norm': 0.012801321223378181, 'learning_rate': 1.953020134228188e-05, 'epoch': 1.9}


 64%|██████▎   | 505/795 [35:02<12:47,  2.65s/it]

{'loss': 0.0003, 'grad_norm': 0.006304887589067221, 'learning_rate': 1.946308724832215e-05, 'epoch': 1.91}


 64%|██████▎   | 506/795 [35:05<12:33,  2.61s/it]

{'loss': 0.0002, 'grad_norm': 0.006043308414518833, 'learning_rate': 1.939597315436242e-05, 'epoch': 1.91}


 64%|██████▍   | 507/795 [35:07<12:26,  2.59s/it]

{'loss': 0.0004, 'grad_norm': 0.011039982549846172, 'learning_rate': 1.9328859060402684e-05, 'epoch': 1.91}


 64%|██████▍   | 508/795 [35:10<12:36,  2.64s/it]

{'loss': 0.0004, 'grad_norm': 0.008609754964709282, 'learning_rate': 1.9261744966442955e-05, 'epoch': 1.92}


 64%|██████▍   | 509/795 [35:13<12:49,  2.69s/it]

{'loss': 3.1995, 'grad_norm': 29.500120162963867, 'learning_rate': 1.919463087248322e-05, 'epoch': 1.92}


 64%|██████▍   | 510/795 [35:15<12:35,  2.65s/it]

{'loss': 0.0009, 'grad_norm': 0.09980811923742294, 'learning_rate': 1.9127516778523493e-05, 'epoch': 1.92}


 64%|██████▍   | 511/795 [35:18<12:23,  2.62s/it]

{'loss': 0.0003, 'grad_norm': 0.009619567543268204, 'learning_rate': 1.906040268456376e-05, 'epoch': 1.93}


 64%|██████▍   | 512/795 [35:21<12:26,  2.64s/it]

{'loss': 0.0004, 'grad_norm': 0.0236861202865839, 'learning_rate': 1.8993288590604026e-05, 'epoch': 1.93}


 65%|██████▍   | 513/795 [35:23<12:20,  2.63s/it]

{'loss': 0.0006, 'grad_norm': 0.013735225424170494, 'learning_rate': 1.8926174496644298e-05, 'epoch': 1.94}


 65%|██████▍   | 514/795 [35:26<12:15,  2.62s/it]

{'loss': 0.0002, 'grad_norm': 0.005306712817400694, 'learning_rate': 1.8859060402684563e-05, 'epoch': 1.94}


 65%|██████▍   | 515/795 [35:28<12:12,  2.62s/it]

{'loss': 0.001, 'grad_norm': 0.0269260723143816, 'learning_rate': 1.8791946308724835e-05, 'epoch': 1.94}


 65%|██████▍   | 516/795 [35:31<12:00,  2.58s/it]

{'loss': 0.0005, 'grad_norm': 0.013479376211762428, 'learning_rate': 1.87248322147651e-05, 'epoch': 1.95}


 65%|██████▌   | 517/795 [35:33<11:48,  2.55s/it]

{'loss': 0.0089, 'grad_norm': 3.750173807144165, 'learning_rate': 1.865771812080537e-05, 'epoch': 1.95}


 65%|██████▌   | 518/795 [35:36<11:55,  2.58s/it]

{'loss': 0.0004, 'grad_norm': 0.010613715276122093, 'learning_rate': 1.8590604026845637e-05, 'epoch': 1.95}


 65%|██████▌   | 519/795 [35:39<11:54,  2.59s/it]

{'loss': 0.0003, 'grad_norm': 0.008456528186798096, 'learning_rate': 1.8523489932885906e-05, 'epoch': 1.96}


 65%|██████▌   | 520/795 [35:41<11:51,  2.59s/it]

{'loss': 0.0005, 'grad_norm': 0.015409854240715504, 'learning_rate': 1.8456375838926178e-05, 'epoch': 1.96}


 66%|██████▌   | 521/795 [35:44<12:08,  2.66s/it]

{'loss': 0.0001, 'grad_norm': 0.004630445037037134, 'learning_rate': 1.8389261744966443e-05, 'epoch': 1.97}


 66%|██████▌   | 522/795 [35:47<11:55,  2.62s/it]

{'loss': 0.0004, 'grad_norm': 0.0094553017988801, 'learning_rate': 1.832214765100671e-05, 'epoch': 1.97}


 66%|██████▌   | 523/795 [35:49<11:49,  2.61s/it]

{'loss': 0.0022, 'grad_norm': 0.4290948212146759, 'learning_rate': 1.825503355704698e-05, 'epoch': 1.97}


 66%|██████▌   | 524/795 [35:52<11:51,  2.63s/it]

{'loss': 0.0005, 'grad_norm': 0.012619487941265106, 'learning_rate': 1.818791946308725e-05, 'epoch': 1.98}


 66%|██████▌   | 525/795 [35:55<11:49,  2.63s/it]

{'loss': 0.0039, 'grad_norm': 0.3709055483341217, 'learning_rate': 1.8120805369127517e-05, 'epoch': 1.98}


 66%|██████▌   | 526/795 [35:57<11:56,  2.66s/it]

{'loss': 0.0006, 'grad_norm': 0.08693284541368484, 'learning_rate': 1.8053691275167786e-05, 'epoch': 1.98}


 66%|██████▋   | 527/795 [36:00<12:00,  2.69s/it]

{'loss': 0.0004, 'grad_norm': 0.010747131891548634, 'learning_rate': 1.7986577181208054e-05, 'epoch': 1.99}


 66%|██████▋   | 528/795 [36:03<11:53,  2.67s/it]

{'loss': 0.0008, 'grad_norm': 0.039835747331380844, 'learning_rate': 1.7919463087248323e-05, 'epoch': 1.99}


 67%|██████▋   | 529/795 [36:05<11:53,  2.68s/it]

{'loss': 0.0003, 'grad_norm': 0.009755867533385754, 'learning_rate': 1.785234899328859e-05, 'epoch': 2.0}


 67%|██████▋   | 530/795 [36:07<10:24,  2.36s/it]

{'loss': 0.0002, 'grad_norm': 0.0067032091319561005, 'learning_rate': 1.778523489932886e-05, 'epoch': 2.0}


                                                 
 67%|██████▋   | 530/795 [46:42<10:24,  2.36s/it]

{'eval_loss': 0.19468733668327332, 'eval_accuracy': 0.9619741100323624, 'eval_runtime': 634.9486, 'eval_samples_per_second': 1.947, 'eval_steps_per_second': 0.973, 'epoch': 2.0}


 67%|██████▋   | 531/795 [46:48<14:12:47, 193.82s/it]

{'loss': 0.0003, 'grad_norm': 0.007783535402268171, 'learning_rate': 1.7718120805369128e-05, 'epoch': 2.0}


 67%|██████▋   | 532/795 [46:50<9:58:15, 136.48s/it] 

{'loss': 0.0004, 'grad_norm': 0.009561401791870594, 'learning_rate': 1.7651006711409397e-05, 'epoch': 2.01}


 67%|██████▋   | 533/795 [46:53<7:00:37, 96.33s/it] 

{'loss': 0.0001, 'grad_norm': 0.004413078539073467, 'learning_rate': 1.7583892617449665e-05, 'epoch': 2.01}


 67%|██████▋   | 534/795 [46:55<4:56:41, 68.21s/it]

{'loss': 0.0003, 'grad_norm': 0.009861391969025135, 'learning_rate': 1.7516778523489934e-05, 'epoch': 2.02}


 67%|██████▋   | 535/795 [46:58<3:30:19, 48.54s/it]

{'loss': 0.0004, 'grad_norm': 0.012137334793806076, 'learning_rate': 1.7449664429530202e-05, 'epoch': 2.02}


 67%|██████▋   | 536/795 [47:01<2:30:03, 34.76s/it]

{'loss': 0.0003, 'grad_norm': 0.009697365574538708, 'learning_rate': 1.738255033557047e-05, 'epoch': 2.02}


 68%|██████▊   | 537/795 [47:03<1:48:12, 25.17s/it]

{'loss': 0.0001, 'grad_norm': 0.004287198651582003, 'learning_rate': 1.731543624161074e-05, 'epoch': 2.03}


 68%|██████▊   | 538/795 [47:06<1:18:43, 18.38s/it]

{'loss': 0.0005, 'grad_norm': 0.012415130622684956, 'learning_rate': 1.7248322147651008e-05, 'epoch': 2.03}


 68%|██████▊   | 539/795 [47:09<58:09, 13.63s/it]  

{'loss': 0.0002, 'grad_norm': 0.0051464056596159935, 'learning_rate': 1.7181208053691277e-05, 'epoch': 2.03}


 68%|██████▊   | 540/795 [47:11<43:58, 10.35s/it]

{'loss': 0.0003, 'grad_norm': 0.009757443331182003, 'learning_rate': 1.7114093959731545e-05, 'epoch': 2.04}


 68%|██████▊   | 541/795 [47:14<34:09,  8.07s/it]

{'loss': 0.0009, 'grad_norm': 0.04178129881620407, 'learning_rate': 1.704697986577181e-05, 'epoch': 2.04}


 68%|██████▊   | 542/795 [47:17<27:06,  6.43s/it]

{'loss': 0.0007, 'grad_norm': 0.018182558938860893, 'learning_rate': 1.6979865771812082e-05, 'epoch': 2.05}


 68%|██████▊   | 543/795 [47:19<22:06,  5.26s/it]

{'loss': 0.0002, 'grad_norm': 0.007494211662560701, 'learning_rate': 1.691275167785235e-05, 'epoch': 2.05}


 68%|██████▊   | 544/795 [47:22<18:44,  4.48s/it]

{'loss': 0.0006, 'grad_norm': 0.03483275696635246, 'learning_rate': 1.684563758389262e-05, 'epoch': 2.05}


 69%|██████▊   | 545/795 [47:24<16:24,  3.94s/it]

{'loss': 0.0004, 'grad_norm': 0.00972359161823988, 'learning_rate': 1.6778523489932888e-05, 'epoch': 2.06}


 69%|██████▊   | 546/795 [47:27<14:48,  3.57s/it]

{'loss': 0.0002, 'grad_norm': 0.006117410492151976, 'learning_rate': 1.6711409395973153e-05, 'epoch': 2.06}


 69%|██████▉   | 547/795 [47:30<13:34,  3.28s/it]

{'loss': 0.0002, 'grad_norm': 0.004933098331093788, 'learning_rate': 1.6644295302013425e-05, 'epoch': 2.06}


 69%|██████▉   | 548/795 [47:32<12:37,  3.07s/it]

{'loss': 0.0003, 'grad_norm': 0.007416863460093737, 'learning_rate': 1.657718120805369e-05, 'epoch': 2.07}


 69%|██████▉   | 549/795 [47:35<11:59,  2.92s/it]

{'loss': 0.0003, 'grad_norm': 0.0060793496668338776, 'learning_rate': 1.6510067114093962e-05, 'epoch': 2.07}


 69%|██████▉   | 550/795 [47:38<11:33,  2.83s/it]

{'loss': 0.0004, 'grad_norm': 0.015607479959726334, 'learning_rate': 1.644295302013423e-05, 'epoch': 2.08}


 69%|██████▉   | 551/795 [47:40<11:18,  2.78s/it]

{'loss': 0.0006, 'grad_norm': 0.023462221026420593, 'learning_rate': 1.6375838926174496e-05, 'epoch': 2.08}


 69%|██████▉   | 552/795 [47:43<11:03,  2.73s/it]

{'loss': 0.0004, 'grad_norm': 0.011692352592945099, 'learning_rate': 1.6308724832214767e-05, 'epoch': 2.08}


 70%|██████▉   | 553/795 [47:46<10:58,  2.72s/it]

{'loss': 0.0005, 'grad_norm': 0.014276201836764812, 'learning_rate': 1.6241610738255033e-05, 'epoch': 2.09}


 70%|██████▉   | 554/795 [47:48<10:50,  2.70s/it]

{'loss': 0.0002, 'grad_norm': 0.006590588018298149, 'learning_rate': 1.6174496644295304e-05, 'epoch': 2.09}


 70%|██████▉   | 555/795 [47:51<10:43,  2.68s/it]

{'loss': 0.0002, 'grad_norm': 0.007719963788986206, 'learning_rate': 1.610738255033557e-05, 'epoch': 2.09}


 70%|██████▉   | 556/795 [47:53<10:32,  2.65s/it]

{'loss': 0.0004, 'grad_norm': 0.009025480598211288, 'learning_rate': 1.604026845637584e-05, 'epoch': 2.1}


 70%|███████   | 557/795 [47:56<10:37,  2.68s/it]

{'loss': 0.0003, 'grad_norm': 0.009405692107975483, 'learning_rate': 1.597315436241611e-05, 'epoch': 2.1}


 70%|███████   | 558/795 [47:59<10:38,  2.70s/it]

{'loss': 0.0002, 'grad_norm': 0.0074730683118104935, 'learning_rate': 1.5906040268456375e-05, 'epoch': 2.11}


 70%|███████   | 559/795 [48:02<10:30,  2.67s/it]

{'loss': 0.0003, 'grad_norm': 0.00788275245577097, 'learning_rate': 1.5838926174496647e-05, 'epoch': 2.11}


 70%|███████   | 560/795 [48:04<10:18,  2.63s/it]

{'loss': 0.0002, 'grad_norm': 0.005504175554960966, 'learning_rate': 1.5771812080536912e-05, 'epoch': 2.11}


 71%|███████   | 561/795 [48:07<10:13,  2.62s/it]

{'loss': 0.0005, 'grad_norm': 0.015158488415181637, 'learning_rate': 1.5704697986577184e-05, 'epoch': 2.12}


 71%|███████   | 562/795 [48:09<10:18,  2.66s/it]

{'loss': 0.0002, 'grad_norm': 0.0052150520496070385, 'learning_rate': 1.563758389261745e-05, 'epoch': 2.12}


 71%|███████   | 563/795 [48:12<10:16,  2.66s/it]

{'loss': 0.0002, 'grad_norm': 0.006073061376810074, 'learning_rate': 1.5570469798657718e-05, 'epoch': 2.12}


 71%|███████   | 564/795 [48:15<10:17,  2.67s/it]

{'loss': 0.0003, 'grad_norm': 0.008271525613963604, 'learning_rate': 1.5503355704697986e-05, 'epoch': 2.13}


 71%|███████   | 565/795 [48:17<10:15,  2.67s/it]

{'loss': 0.0004, 'grad_norm': 0.012641491368412971, 'learning_rate': 1.5436241610738255e-05, 'epoch': 2.13}


 71%|███████   | 566/795 [48:20<10:17,  2.69s/it]

{'loss': 0.0002, 'grad_norm': 0.005112863145768642, 'learning_rate': 1.5369127516778527e-05, 'epoch': 2.14}


 71%|███████▏  | 567/795 [48:23<10:23,  2.74s/it]

{'loss': 0.0004, 'grad_norm': 0.009344140999019146, 'learning_rate': 1.5302013422818792e-05, 'epoch': 2.14}


 71%|███████▏  | 568/795 [48:26<10:15,  2.71s/it]

{'loss': 0.0002, 'grad_norm': 0.0053508467972278595, 'learning_rate': 1.5234899328859062e-05, 'epoch': 2.14}


 72%|███████▏  | 569/795 [48:28<10:05,  2.68s/it]

{'loss': 0.0002, 'grad_norm': 0.005584104917943478, 'learning_rate': 1.5167785234899329e-05, 'epoch': 2.15}


 72%|███████▏  | 570/795 [48:31<10:03,  2.68s/it]

{'loss': 0.0006, 'grad_norm': 0.01622874289751053, 'learning_rate': 1.51006711409396e-05, 'epoch': 2.15}


 72%|███████▏  | 571/795 [48:34<09:58,  2.67s/it]

{'loss': 0.0003, 'grad_norm': 0.0074205645360052586, 'learning_rate': 1.5033557046979866e-05, 'epoch': 2.15}


 72%|███████▏  | 572/795 [48:36<09:48,  2.64s/it]

{'loss': 0.0002, 'grad_norm': 0.0049379561096429825, 'learning_rate': 1.4966442953020135e-05, 'epoch': 2.16}


 72%|███████▏  | 573/795 [48:39<09:43,  2.63s/it]

{'loss': 0.0002, 'grad_norm': 0.004183546639978886, 'learning_rate': 1.4899328859060405e-05, 'epoch': 2.16}


 72%|███████▏  | 574/795 [48:41<09:43,  2.64s/it]

{'loss': 0.0001, 'grad_norm': 0.00454983115196228, 'learning_rate': 1.4832214765100672e-05, 'epoch': 2.17}


 72%|███████▏  | 575/795 [48:44<09:38,  2.63s/it]

{'loss': 0.0003, 'grad_norm': 0.0070951334200799465, 'learning_rate': 1.4765100671140942e-05, 'epoch': 2.17}


 72%|███████▏  | 576/795 [48:47<09:31,  2.61s/it]

{'loss': 3.4132, 'grad_norm': 24.767107009887695, 'learning_rate': 1.4697986577181209e-05, 'epoch': 2.17}


 73%|███████▎  | 577/795 [48:49<09:33,  2.63s/it]

{'loss': 0.0001, 'grad_norm': 0.0035904450342059135, 'learning_rate': 1.4630872483221477e-05, 'epoch': 2.18}


 73%|███████▎  | 578/795 [48:52<09:35,  2.65s/it]

{'loss': 0.0003, 'grad_norm': 0.009835131466388702, 'learning_rate': 1.4563758389261744e-05, 'epoch': 2.18}


 73%|███████▎  | 579/795 [48:55<09:33,  2.65s/it]

{'loss': 0.0004, 'grad_norm': 0.013044269755482674, 'learning_rate': 1.4496644295302014e-05, 'epoch': 2.18}


 73%|███████▎  | 580/795 [48:57<09:23,  2.62s/it]

{'loss': 0.0004, 'grad_norm': 0.009075458161532879, 'learning_rate': 1.4429530201342285e-05, 'epoch': 2.19}


 73%|███████▎  | 581/795 [49:00<09:29,  2.66s/it]

{'loss': 0.0005, 'grad_norm': 0.015316864475607872, 'learning_rate': 1.4362416107382551e-05, 'epoch': 2.19}


 73%|███████▎  | 582/795 [49:03<09:31,  2.68s/it]

{'loss': 0.0003, 'grad_norm': 0.006562734488397837, 'learning_rate': 1.429530201342282e-05, 'epoch': 2.2}


 73%|███████▎  | 583/795 [49:05<09:22,  2.65s/it]

{'loss': 0.0005, 'grad_norm': 0.013430492952466011, 'learning_rate': 1.4228187919463087e-05, 'epoch': 2.2}


 73%|███████▎  | 584/795 [49:08<09:24,  2.68s/it]

{'loss': 0.0003, 'grad_norm': 0.007637522649019957, 'learning_rate': 1.4161073825503357e-05, 'epoch': 2.2}


 74%|███████▎  | 585/795 [49:11<09:20,  2.67s/it]

{'loss': 0.0004, 'grad_norm': 0.01162825059145689, 'learning_rate': 1.4093959731543624e-05, 'epoch': 2.21}


 74%|███████▎  | 586/795 [49:13<09:14,  2.65s/it]

{'loss': 0.0003, 'grad_norm': 0.008261634968221188, 'learning_rate': 1.4026845637583894e-05, 'epoch': 2.21}


 74%|███████▍  | 587/795 [49:16<09:13,  2.66s/it]

{'loss': 0.0006, 'grad_norm': 0.016807951033115387, 'learning_rate': 1.3959731543624161e-05, 'epoch': 2.22}


 74%|███████▍  | 588/795 [49:19<09:08,  2.65s/it]

{'loss': 0.0005, 'grad_norm': 0.011800372041761875, 'learning_rate': 1.389261744966443e-05, 'epoch': 2.22}


 74%|███████▍  | 589/795 [49:21<09:01,  2.63s/it]

{'loss': 0.0003, 'grad_norm': 0.007824947126209736, 'learning_rate': 1.38255033557047e-05, 'epoch': 2.22}


 74%|███████▍  | 590/795 [49:24<08:50,  2.59s/it]

{'loss': 0.0004, 'grad_norm': 0.009352896362543106, 'learning_rate': 1.3758389261744966e-05, 'epoch': 2.23}


 74%|███████▍  | 591/795 [49:26<08:41,  2.56s/it]

{'loss': 0.0004, 'grad_norm': 0.010494095273315907, 'learning_rate': 1.3691275167785237e-05, 'epoch': 2.23}


 74%|███████▍  | 592/795 [49:29<08:42,  2.58s/it]

{'loss': 0.0001, 'grad_norm': 0.005325828678905964, 'learning_rate': 1.3624161073825504e-05, 'epoch': 2.23}


 75%|███████▍  | 593/795 [49:31<08:34,  2.55s/it]

{'loss': 0.0005, 'grad_norm': 0.015593101270496845, 'learning_rate': 1.3557046979865772e-05, 'epoch': 2.24}


 75%|███████▍  | 594/795 [49:34<08:29,  2.53s/it]

{'loss': 0.0006, 'grad_norm': 0.01683037169277668, 'learning_rate': 1.3489932885906039e-05, 'epoch': 2.24}


 75%|███████▍  | 595/795 [49:36<08:22,  2.51s/it]

{'loss': 0.0007, 'grad_norm': 0.01978354901075363, 'learning_rate': 1.3422818791946309e-05, 'epoch': 2.25}


 75%|███████▍  | 596/795 [49:39<08:31,  2.57s/it]

{'loss': 0.0004, 'grad_norm': 0.011350728571414948, 'learning_rate': 1.335570469798658e-05, 'epoch': 2.25}


 75%|███████▌  | 597/795 [49:41<08:21,  2.53s/it]

{'loss': 0.0004, 'grad_norm': 0.01129324734210968, 'learning_rate': 1.3288590604026846e-05, 'epoch': 2.25}


 75%|███████▌  | 598/795 [49:44<08:28,  2.58s/it]

{'loss': 0.0009, 'grad_norm': 0.027854321524500847, 'learning_rate': 1.3221476510067116e-05, 'epoch': 2.26}


 75%|███████▌  | 599/795 [49:47<08:34,  2.63s/it]

{'loss': 0.0004, 'grad_norm': 0.009455942548811436, 'learning_rate': 1.3154362416107383e-05, 'epoch': 2.26}


 75%|███████▌  | 600/795 [49:50<08:42,  2.68s/it]

{'loss': 0.0004, 'grad_norm': 0.011189877986907959, 'learning_rate': 1.3087248322147652e-05, 'epoch': 2.26}


 76%|███████▌  | 601/795 [49:52<08:50,  2.73s/it]

{'loss': 0.0012, 'grad_norm': 0.07379350811243057, 'learning_rate': 1.3020134228187919e-05, 'epoch': 2.27}


 76%|███████▌  | 602/795 [49:55<08:45,  2.72s/it]

{'loss': 0.0003, 'grad_norm': 0.008787418715655804, 'learning_rate': 1.2953020134228189e-05, 'epoch': 2.27}


 76%|███████▌  | 603/795 [49:58<08:39,  2.71s/it]

{'loss': 0.0001, 'grad_norm': 0.0033026414457708597, 'learning_rate': 1.2885906040268459e-05, 'epoch': 2.28}


 76%|███████▌  | 604/795 [50:01<08:43,  2.74s/it]

{'loss': 0.0011, 'grad_norm': 0.0996340960264206, 'learning_rate': 1.2818791946308726e-05, 'epoch': 2.28}


 76%|███████▌  | 605/795 [50:03<08:32,  2.70s/it]

{'loss': 0.0002, 'grad_norm': 0.006212861277163029, 'learning_rate': 1.2751677852348994e-05, 'epoch': 2.28}


 76%|███████▌  | 606/795 [50:06<08:17,  2.63s/it]

{'loss': 0.0002, 'grad_norm': 0.005963131319731474, 'learning_rate': 1.2684563758389261e-05, 'epoch': 2.29}


 76%|███████▋  | 607/795 [50:08<08:14,  2.63s/it]

{'loss': 0.0002, 'grad_norm': 0.008168850094079971, 'learning_rate': 1.2617449664429532e-05, 'epoch': 2.29}


 76%|███████▋  | 608/795 [50:11<08:12,  2.63s/it]

{'loss': 0.0003, 'grad_norm': 0.008543654344975948, 'learning_rate': 1.2550335570469798e-05, 'epoch': 2.29}


 77%|███████▋  | 609/795 [50:14<08:09,  2.63s/it]

{'loss': 0.0003, 'grad_norm': 0.0072490498423576355, 'learning_rate': 1.2483221476510069e-05, 'epoch': 2.3}


 77%|███████▋  | 610/795 [50:16<08:09,  2.65s/it]

{'loss': 0.0002, 'grad_norm': 0.005675597116351128, 'learning_rate': 1.2416107382550337e-05, 'epoch': 2.3}


 77%|███████▋  | 611/795 [50:19<08:08,  2.65s/it]

{'loss': 0.0002, 'grad_norm': 0.006387125700712204, 'learning_rate': 1.2348993288590604e-05, 'epoch': 2.31}


 77%|███████▋  | 612/795 [50:22<08:09,  2.67s/it]

{'loss': 0.0005, 'grad_norm': 0.016209321096539497, 'learning_rate': 1.2281879194630872e-05, 'epoch': 2.31}


 77%|███████▋  | 613/795 [50:24<08:06,  2.67s/it]

{'loss': 0.0004, 'grad_norm': 0.01255144365131855, 'learning_rate': 1.2214765100671141e-05, 'epoch': 2.31}


 77%|███████▋  | 614/795 [50:27<08:03,  2.67s/it]

{'loss': 0.0002, 'grad_norm': 0.0058352155610919, 'learning_rate': 1.214765100671141e-05, 'epoch': 2.32}


 77%|███████▋  | 615/795 [50:30<07:53,  2.63s/it]

{'loss': 0.0002, 'grad_norm': 0.004888255149126053, 'learning_rate': 1.208053691275168e-05, 'epoch': 2.32}


 77%|███████▋  | 616/795 [50:32<07:51,  2.64s/it]

{'loss': 0.0001, 'grad_norm': 0.0033850150648504496, 'learning_rate': 1.2013422818791947e-05, 'epoch': 2.32}


 78%|███████▊  | 617/795 [50:35<07:50,  2.64s/it]

{'loss': 0.0008, 'grad_norm': 0.04813811182975769, 'learning_rate': 1.1946308724832215e-05, 'epoch': 2.33}


 78%|███████▊  | 618/795 [50:38<07:49,  2.65s/it]

{'loss': 0.0006, 'grad_norm': 0.017733031883835793, 'learning_rate': 1.1879194630872484e-05, 'epoch': 2.33}


 78%|███████▊  | 619/795 [50:40<07:48,  2.66s/it]

{'loss': 0.0004, 'grad_norm': 0.013414017856121063, 'learning_rate': 1.1812080536912752e-05, 'epoch': 2.34}


 78%|███████▊  | 620/795 [50:43<07:40,  2.63s/it]

{'loss': 0.0006, 'grad_norm': 0.017379479482769966, 'learning_rate': 1.174496644295302e-05, 'epoch': 2.34}


 78%|███████▊  | 621/795 [50:45<07:35,  2.62s/it]

{'loss': 0.0004, 'grad_norm': 0.0117215970531106, 'learning_rate': 1.167785234899329e-05, 'epoch': 2.34}


 78%|███████▊  | 622/795 [50:48<07:28,  2.59s/it]

{'loss': 0.0002, 'grad_norm': 0.005736928898841143, 'learning_rate': 1.1610738255033558e-05, 'epoch': 2.35}


 78%|███████▊  | 623/795 [50:51<07:33,  2.63s/it]

{'loss': 0.0003, 'grad_norm': 0.006408490240573883, 'learning_rate': 1.1543624161073826e-05, 'epoch': 2.35}


 78%|███████▊  | 624/795 [50:53<07:35,  2.66s/it]

{'loss': 0.0002, 'grad_norm': 0.005257755517959595, 'learning_rate': 1.1476510067114095e-05, 'epoch': 2.35}


 79%|███████▊  | 625/795 [50:56<07:44,  2.73s/it]

{'loss': 0.0001, 'grad_norm': 0.002934084739536047, 'learning_rate': 1.1409395973154363e-05, 'epoch': 2.36}


 79%|███████▊  | 626/795 [50:59<07:29,  2.66s/it]

{'loss': 0.0005, 'grad_norm': 0.01523284986615181, 'learning_rate': 1.1342281879194632e-05, 'epoch': 2.36}


 79%|███████▉  | 627/795 [51:01<07:15,  2.59s/it]

{'loss': 0.0007, 'grad_norm': 0.030981115996837616, 'learning_rate': 1.12751677852349e-05, 'epoch': 2.37}


 79%|███████▉  | 628/795 [51:04<07:17,  2.62s/it]

{'loss': 0.0001, 'grad_norm': 0.003528724191710353, 'learning_rate': 1.1208053691275167e-05, 'epoch': 2.37}


 79%|███████▉  | 629/795 [51:07<07:16,  2.63s/it]

{'loss': 0.0003, 'grad_norm': 0.006769416853785515, 'learning_rate': 1.1140939597315436e-05, 'epoch': 2.37}


 79%|███████▉  | 630/795 [51:09<07:12,  2.62s/it]

{'loss': 0.0002, 'grad_norm': 0.005228747148066759, 'learning_rate': 1.1073825503355706e-05, 'epoch': 2.38}


 79%|███████▉  | 631/795 [51:12<07:10,  2.62s/it]

{'loss': 0.0004, 'grad_norm': 0.013010755181312561, 'learning_rate': 1.1006711409395975e-05, 'epoch': 2.38}


 79%|███████▉  | 632/795 [51:14<07:08,  2.63s/it]

{'loss': 0.0002, 'grad_norm': 0.004545975476503372, 'learning_rate': 1.0939597315436243e-05, 'epoch': 2.38}


 80%|███████▉  | 633/795 [51:17<07:10,  2.65s/it]

{'loss': 0.0004, 'grad_norm': 0.012152394279837608, 'learning_rate': 1.087248322147651e-05, 'epoch': 2.39}


 80%|███████▉  | 634/795 [51:20<07:15,  2.71s/it]

{'loss': 0.0001, 'grad_norm': 0.004412841983139515, 'learning_rate': 1.0805369127516778e-05, 'epoch': 2.39}


 80%|███████▉  | 635/795 [51:23<07:08,  2.68s/it]

{'loss': 0.0001, 'grad_norm': 0.004281165543943644, 'learning_rate': 1.0738255033557047e-05, 'epoch': 2.4}


 80%|████████  | 636/795 [51:25<07:06,  2.69s/it]

{'loss': 0.0002, 'grad_norm': 0.004406746942549944, 'learning_rate': 1.0671140939597316e-05, 'epoch': 2.4}


 80%|████████  | 637/795 [51:28<07:04,  2.69s/it]

{'loss': 0.0003, 'grad_norm': 0.006479701027274132, 'learning_rate': 1.0604026845637584e-05, 'epoch': 2.4}


 80%|████████  | 638/795 [51:31<07:01,  2.69s/it]

{'loss': 0.0005, 'grad_norm': 0.01573651097714901, 'learning_rate': 1.0536912751677854e-05, 'epoch': 2.41}


 80%|████████  | 639/795 [51:33<06:55,  2.66s/it]

{'loss': 0.0007, 'grad_norm': 0.021284813061356544, 'learning_rate': 1.0469798657718121e-05, 'epoch': 2.41}


 81%|████████  | 640/795 [51:36<06:54,  2.67s/it]

{'loss': 0.0005, 'grad_norm': 0.014517569914460182, 'learning_rate': 1.040268456375839e-05, 'epoch': 2.42}


 81%|████████  | 641/795 [51:39<06:56,  2.71s/it]

{'loss': 0.0002, 'grad_norm': 0.0037851682864129543, 'learning_rate': 1.0335570469798658e-05, 'epoch': 2.42}


 81%|████████  | 642/795 [51:41<06:54,  2.71s/it]

{'loss': 0.001, 'grad_norm': 0.10705012083053589, 'learning_rate': 1.0268456375838927e-05, 'epoch': 2.42}


 81%|████████  | 643/795 [51:44<06:54,  2.73s/it]

{'loss': 0.0004, 'grad_norm': 0.013657917268574238, 'learning_rate': 1.0201342281879195e-05, 'epoch': 2.43}


 81%|████████  | 644/795 [51:47<06:48,  2.70s/it]

{'loss': 0.0002, 'grad_norm': 0.005222797859460115, 'learning_rate': 1.0134228187919464e-05, 'epoch': 2.43}


 81%|████████  | 645/795 [51:50<06:53,  2.75s/it]

{'loss': 0.0002, 'grad_norm': 0.007829712703824043, 'learning_rate': 1.006711409395973e-05, 'epoch': 2.43}


 81%|████████▏ | 646/795 [51:52<06:45,  2.72s/it]

{'loss': 0.0001, 'grad_norm': 0.0037024980410933495, 'learning_rate': 1e-05, 'epoch': 2.44}


 81%|████████▏ | 647/795 [51:55<06:41,  2.71s/it]

{'loss': 0.0003, 'grad_norm': 0.008495910093188286, 'learning_rate': 9.93288590604027e-06, 'epoch': 2.44}


 82%|████████▏ | 648/795 [51:58<06:34,  2.68s/it]

{'loss': 0.0003, 'grad_norm': 0.00635483767837286, 'learning_rate': 9.865771812080538e-06, 'epoch': 2.45}


 82%|████████▏ | 649/795 [52:00<06:30,  2.68s/it]

{'loss': 0.0002, 'grad_norm': 0.004790757782757282, 'learning_rate': 9.798657718120806e-06, 'epoch': 2.45}


 82%|████████▏ | 650/795 [52:03<06:25,  2.66s/it]

{'loss': 0.0003, 'grad_norm': 0.006056411191821098, 'learning_rate': 9.731543624161075e-06, 'epoch': 2.45}


 82%|████████▏ | 651/795 [52:06<06:20,  2.64s/it]

{'loss': 0.0002, 'grad_norm': 0.00604014378041029, 'learning_rate': 9.664429530201342e-06, 'epoch': 2.46}


 82%|████████▏ | 652/795 [52:08<06:17,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.025580130517482758, 'learning_rate': 9.59731543624161e-06, 'epoch': 2.46}


 82%|████████▏ | 653/795 [52:11<06:09,  2.60s/it]

{'loss': 0.0002, 'grad_norm': 0.004718187265098095, 'learning_rate': 9.53020134228188e-06, 'epoch': 2.46}


 82%|████████▏ | 654/795 [52:13<06:10,  2.63s/it]

{'loss': 0.0003, 'grad_norm': 0.007551256567239761, 'learning_rate': 9.463087248322149e-06, 'epoch': 2.47}


 82%|████████▏ | 655/795 [52:16<06:07,  2.62s/it]

{'loss': 0.0004, 'grad_norm': 0.00911344401538372, 'learning_rate': 9.395973154362418e-06, 'epoch': 2.47}


 83%|████████▎ | 656/795 [52:19<06:10,  2.66s/it]

{'loss': 0.0003, 'grad_norm': 0.007483477238565683, 'learning_rate': 9.328859060402684e-06, 'epoch': 2.48}


 83%|████████▎ | 657/795 [52:21<06:07,  2.67s/it]

{'loss': 0.0003, 'grad_norm': 0.010562513023614883, 'learning_rate': 9.261744966442953e-06, 'epoch': 2.48}


 83%|████████▎ | 658/795 [52:24<06:05,  2.67s/it]

{'loss': 0.0005, 'grad_norm': 0.01335353497415781, 'learning_rate': 9.194630872483221e-06, 'epoch': 2.48}


 83%|████████▎ | 659/795 [52:27<05:55,  2.61s/it]

{'loss': 0.0005, 'grad_norm': 0.01443843636661768, 'learning_rate': 9.12751677852349e-06, 'epoch': 2.49}


 83%|████████▎ | 660/795 [52:29<05:49,  2.59s/it]

{'loss': 0.0003, 'grad_norm': 0.007522102911025286, 'learning_rate': 9.060402684563759e-06, 'epoch': 2.49}


 83%|████████▎ | 661/795 [52:32<05:55,  2.65s/it]

{'loss': 0.0005, 'grad_norm': 0.01452125795185566, 'learning_rate': 8.993288590604027e-06, 'epoch': 2.49}


 83%|████████▎ | 662/795 [52:34<05:48,  2.62s/it]

{'loss': 0.0002, 'grad_norm': 0.006773656699806452, 'learning_rate': 8.926174496644296e-06, 'epoch': 2.5}


 83%|████████▎ | 663/795 [52:37<05:48,  2.64s/it]

{'loss': 0.0003, 'grad_norm': 0.010043062269687653, 'learning_rate': 8.859060402684564e-06, 'epoch': 2.5}


 84%|████████▎ | 664/795 [52:40<05:41,  2.61s/it]

{'loss': 0.0005, 'grad_norm': 0.013996905647218227, 'learning_rate': 8.791946308724833e-06, 'epoch': 2.51}


 84%|████████▎ | 665/795 [52:43<05:49,  2.69s/it]

{'loss': 0.0002, 'grad_norm': 0.00515930587425828, 'learning_rate': 8.724832214765101e-06, 'epoch': 2.51}


 84%|████████▍ | 666/795 [52:45<05:50,  2.72s/it]

{'loss': 0.0004, 'grad_norm': 0.0170651376247406, 'learning_rate': 8.65771812080537e-06, 'epoch': 2.51}


 84%|████████▍ | 667/795 [52:48<05:44,  2.69s/it]

{'loss': 0.0002, 'grad_norm': 0.00474423635751009, 'learning_rate': 8.590604026845638e-06, 'epoch': 2.52}


 84%|████████▍ | 668/795 [52:51<05:38,  2.66s/it]

{'loss': 0.0002, 'grad_norm': 0.00498888548463583, 'learning_rate': 8.523489932885905e-06, 'epoch': 2.52}


 84%|████████▍ | 669/795 [52:53<05:37,  2.68s/it]

{'loss': 0.0002, 'grad_norm': 0.005076369270682335, 'learning_rate': 8.456375838926175e-06, 'epoch': 2.52}


 84%|████████▍ | 670/795 [52:56<05:34,  2.67s/it]

{'loss': 0.0002, 'grad_norm': 0.005287211388349533, 'learning_rate': 8.389261744966444e-06, 'epoch': 2.53}


 84%|████████▍ | 671/795 [52:59<05:30,  2.66s/it]

{'loss': 0.0002, 'grad_norm': 0.005097250919789076, 'learning_rate': 8.322147651006712e-06, 'epoch': 2.53}


 85%|████████▍ | 672/795 [53:01<05:27,  2.66s/it]

{'loss': 0.0002, 'grad_norm': 0.007470032665878534, 'learning_rate': 8.255033557046981e-06, 'epoch': 2.54}


 85%|████████▍ | 673/795 [53:04<05:22,  2.64s/it]

{'loss': 0.0002, 'grad_norm': 0.004453893285244703, 'learning_rate': 8.187919463087248e-06, 'epoch': 2.54}


 85%|████████▍ | 674/795 [53:06<05:16,  2.62s/it]

{'loss': 0.0002, 'grad_norm': 0.00541383121162653, 'learning_rate': 8.120805369127516e-06, 'epoch': 2.54}


 85%|████████▍ | 675/795 [53:09<05:18,  2.65s/it]

{'loss': 0.0004, 'grad_norm': 0.008571977727115154, 'learning_rate': 8.053691275167785e-06, 'epoch': 2.55}


 85%|████████▌ | 676/795 [53:12<05:17,  2.67s/it]

{'loss': 0.0003, 'grad_norm': 0.0071525597013533115, 'learning_rate': 7.986577181208055e-06, 'epoch': 2.55}


 85%|████████▌ | 677/795 [53:14<05:10,  2.63s/it]

{'loss': 0.0005, 'grad_norm': 0.01371125876903534, 'learning_rate': 7.919463087248324e-06, 'epoch': 2.55}


 85%|████████▌ | 678/795 [53:17<05:10,  2.65s/it]

{'loss': 0.0003, 'grad_norm': 0.007692055776715279, 'learning_rate': 7.852348993288592e-06, 'epoch': 2.56}


 85%|████████▌ | 679/795 [53:20<05:12,  2.70s/it]

{'loss': 0.0004, 'grad_norm': 0.011240548454225063, 'learning_rate': 7.785234899328859e-06, 'epoch': 2.56}


 86%|████████▌ | 680/795 [53:23<05:11,  2.71s/it]

{'loss': 0.0004, 'grad_norm': 0.014716526493430138, 'learning_rate': 7.718120805369127e-06, 'epoch': 2.57}


 86%|████████▌ | 681/795 [53:25<05:10,  2.72s/it]

{'loss': 0.0003, 'grad_norm': 0.024280505254864693, 'learning_rate': 7.651006711409396e-06, 'epoch': 2.57}


 86%|████████▌ | 682/795 [53:28<05:05,  2.71s/it]

{'loss': 0.0003, 'grad_norm': 0.007646803278476, 'learning_rate': 7.5838926174496645e-06, 'epoch': 2.57}


 86%|████████▌ | 683/795 [53:31<05:00,  2.68s/it]

{'loss': 0.0004, 'grad_norm': 0.013739078305661678, 'learning_rate': 7.516778523489933e-06, 'epoch': 2.58}


 86%|████████▌ | 684/795 [53:33<04:57,  2.68s/it]

{'loss': 0.0004, 'grad_norm': 0.009046031162142754, 'learning_rate': 7.4496644295302024e-06, 'epoch': 2.58}


 86%|████████▌ | 685/795 [53:36<04:51,  2.65s/it]

{'loss': 0.0005, 'grad_norm': 0.013853834941983223, 'learning_rate': 7.382550335570471e-06, 'epoch': 2.58}


 86%|████████▋ | 686/795 [53:39<04:53,  2.69s/it]

{'loss': 0.0003, 'grad_norm': 0.006944600492715836, 'learning_rate': 7.315436241610739e-06, 'epoch': 2.59}


 86%|████████▋ | 687/795 [53:41<04:46,  2.65s/it]

{'loss': 0.0004, 'grad_norm': 0.009155615232884884, 'learning_rate': 7.248322147651007e-06, 'epoch': 2.59}


 87%|████████▋ | 688/795 [53:44<04:48,  2.70s/it]

{'loss': 0.0004, 'grad_norm': 0.01083577424287796, 'learning_rate': 7.181208053691276e-06, 'epoch': 2.6}


 87%|████████▋ | 689/795 [53:47<04:45,  2.70s/it]

{'loss': 0.0002, 'grad_norm': 0.0047560264356434345, 'learning_rate': 7.114093959731543e-06, 'epoch': 2.6}


 87%|████████▋ | 690/795 [53:49<04:41,  2.68s/it]

{'loss': 0.0003, 'grad_norm': 0.0072987619787454605, 'learning_rate': 7.046979865771812e-06, 'epoch': 2.6}


 87%|████████▋ | 691/795 [53:52<04:33,  2.63s/it]

{'loss': 0.0004, 'grad_norm': 0.014934873208403587, 'learning_rate': 6.9798657718120805e-06, 'epoch': 2.61}


 87%|████████▋ | 692/795 [53:55<04:33,  2.65s/it]

{'loss': 0.0006, 'grad_norm': 0.0281132310628891, 'learning_rate': 6.91275167785235e-06, 'epoch': 2.61}


 87%|████████▋ | 693/795 [53:57<04:27,  2.62s/it]

{'loss': 0.0004, 'grad_norm': 0.010111309587955475, 'learning_rate': 6.845637583892618e-06, 'epoch': 2.62}


 87%|████████▋ | 694/795 [54:00<04:24,  2.62s/it]

{'loss': 0.0001, 'grad_norm': 0.00466804439201951, 'learning_rate': 6.778523489932886e-06, 'epoch': 2.62}


 87%|████████▋ | 695/795 [54:02<04:24,  2.64s/it]

{'loss': 0.0003, 'grad_norm': 0.007694182451814413, 'learning_rate': 6.7114093959731546e-06, 'epoch': 2.62}


 88%|████████▊ | 696/795 [54:05<04:20,  2.63s/it]

{'loss': 0.0005, 'grad_norm': 0.013307592831552029, 'learning_rate': 6.644295302013423e-06, 'epoch': 2.63}


 88%|████████▊ | 697/795 [54:08<04:22,  2.68s/it]

{'loss': 0.0002, 'grad_norm': 0.004985718056559563, 'learning_rate': 6.577181208053692e-06, 'epoch': 2.63}


 88%|████████▊ | 698/795 [54:11<04:18,  2.66s/it]

{'loss': 0.0004, 'grad_norm': 0.01548148225992918, 'learning_rate': 6.510067114093959e-06, 'epoch': 2.63}


 88%|████████▊ | 699/795 [54:13<04:12,  2.63s/it]

{'loss': 0.0002, 'grad_norm': 0.005353553220629692, 'learning_rate': 6.4429530201342295e-06, 'epoch': 2.64}


 88%|████████▊ | 700/795 [54:16<04:09,  2.63s/it]

{'loss': 0.0002, 'grad_norm': 0.0053162649273872375, 'learning_rate': 6.375838926174497e-06, 'epoch': 2.64}


 88%|████████▊ | 701/795 [54:18<04:07,  2.63s/it]

{'loss': 0.0004, 'grad_norm': 0.012693468481302261, 'learning_rate': 6.308724832214766e-06, 'epoch': 2.65}


 88%|████████▊ | 702/795 [54:21<04:03,  2.62s/it]

{'loss': 0.0005, 'grad_norm': 0.014024647884070873, 'learning_rate': 6.241610738255034e-06, 'epoch': 2.65}


 88%|████████▊ | 703/795 [54:23<03:59,  2.60s/it]

{'loss': 0.0003, 'grad_norm': 0.00874284841120243, 'learning_rate': 6.174496644295302e-06, 'epoch': 2.65}


 89%|████████▊ | 704/795 [54:26<03:54,  2.57s/it]

{'loss': 0.0006, 'grad_norm': 0.01560733001679182, 'learning_rate': 6.1073825503355705e-06, 'epoch': 2.66}


 89%|████████▊ | 705/795 [54:29<03:53,  2.60s/it]

{'loss': 0.0002, 'grad_norm': 0.004443306941539049, 'learning_rate': 6.04026845637584e-06, 'epoch': 2.66}


 89%|████████▉ | 706/795 [54:31<03:46,  2.54s/it]

{'loss': 0.0003, 'grad_norm': 0.008530301973223686, 'learning_rate': 5.9731543624161076e-06, 'epoch': 2.66}


 89%|████████▉ | 707/795 [54:34<03:45,  2.56s/it]

{'loss': 0.0004, 'grad_norm': 0.01212973240762949, 'learning_rate': 5.906040268456376e-06, 'epoch': 2.67}


 89%|████████▉ | 708/795 [54:36<03:44,  2.58s/it]

{'loss': 0.0002, 'grad_norm': 0.00520802428945899, 'learning_rate': 5.838926174496645e-06, 'epoch': 2.67}


 89%|████████▉ | 709/795 [54:39<03:47,  2.65s/it]

{'loss': 0.0003, 'grad_norm': 0.006902650464326143, 'learning_rate': 5.771812080536913e-06, 'epoch': 2.68}


 89%|████████▉ | 710/795 [54:42<03:41,  2.61s/it]

{'loss': 0.0002, 'grad_norm': 0.003700920846313238, 'learning_rate': 5.704697986577182e-06, 'epoch': 2.68}


 89%|████████▉ | 711/795 [54:44<03:34,  2.55s/it]

{'loss': 0.0001, 'grad_norm': 0.0024159587919712067, 'learning_rate': 5.63758389261745e-06, 'epoch': 2.68}


 90%|████████▉ | 712/795 [54:46<03:29,  2.53s/it]

{'loss': 0.0002, 'grad_norm': 0.00488232122734189, 'learning_rate': 5.570469798657718e-06, 'epoch': 2.69}


 90%|████████▉ | 713/795 [54:49<03:30,  2.56s/it]

{'loss': 0.0002, 'grad_norm': 0.004757107235491276, 'learning_rate': 5.503355704697987e-06, 'epoch': 2.69}


 90%|████████▉ | 714/795 [54:52<03:30,  2.60s/it]

{'loss': 0.0004, 'grad_norm': 0.009541156701743603, 'learning_rate': 5.436241610738255e-06, 'epoch': 2.69}


 90%|████████▉ | 715/795 [54:54<03:26,  2.58s/it]

{'loss': 0.0004, 'grad_norm': 0.01107726339250803, 'learning_rate': 5.3691275167785235e-06, 'epoch': 2.7}


 90%|█████████ | 716/795 [54:57<03:26,  2.61s/it]

{'loss': 0.0001, 'grad_norm': 0.0035562580451369286, 'learning_rate': 5.302013422818792e-06, 'epoch': 2.7}


 90%|█████████ | 717/795 [55:00<03:24,  2.62s/it]

{'loss': 0.0003, 'grad_norm': 0.009130183607339859, 'learning_rate': 5.2348993288590606e-06, 'epoch': 2.71}


 90%|█████████ | 718/795 [55:03<03:27,  2.69s/it]

{'loss': 0.0001, 'grad_norm': 0.0035555425565689802, 'learning_rate': 5.167785234899329e-06, 'epoch': 2.71}


 90%|█████████ | 719/795 [55:05<03:19,  2.63s/it]

{'loss': 0.0005, 'grad_norm': 0.01614115759730339, 'learning_rate': 5.100671140939598e-06, 'epoch': 2.71}


 91%|█████████ | 720/795 [55:08<03:17,  2.64s/it]

{'loss': 0.0001, 'grad_norm': 0.004479729104787111, 'learning_rate': 5.033557046979865e-06, 'epoch': 2.72}


 91%|█████████ | 721/795 [55:10<03:16,  2.66s/it]

{'loss': 0.0001, 'grad_norm': 0.003347056917846203, 'learning_rate': 4.966442953020135e-06, 'epoch': 2.72}


 91%|█████████ | 722/795 [55:13<03:11,  2.63s/it]

{'loss': 0.0001, 'grad_norm': 0.0037628060672432184, 'learning_rate': 4.899328859060403e-06, 'epoch': 2.72}


 91%|█████████ | 723/795 [55:15<03:05,  2.58s/it]

{'loss': 0.0002, 'grad_norm': 0.005290593486279249, 'learning_rate': 4.832214765100671e-06, 'epoch': 2.73}


 91%|█████████ | 724/795 [55:18<03:02,  2.58s/it]

{'loss': 0.0003, 'grad_norm': 0.009238428436219692, 'learning_rate': 4.76510067114094e-06, 'epoch': 2.73}


 91%|█████████ | 725/795 [55:20<02:58,  2.55s/it]

{'loss': 0.0002, 'grad_norm': 0.007595130708068609, 'learning_rate': 4.697986577181209e-06, 'epoch': 2.74}


 91%|█████████▏| 726/795 [55:23<02:56,  2.55s/it]

{'loss': 0.0002, 'grad_norm': 0.004996937233954668, 'learning_rate': 4.6308724832214765e-06, 'epoch': 2.74}


 91%|█████████▏| 727/795 [55:25<02:51,  2.53s/it]

{'loss': 0.0003, 'grad_norm': 0.009061390534043312, 'learning_rate': 4.563758389261745e-06, 'epoch': 2.74}


 92%|█████████▏| 728/795 [55:28<02:50,  2.54s/it]

{'loss': 0.0001, 'grad_norm': 0.003450600663200021, 'learning_rate': 4.4966442953020135e-06, 'epoch': 2.75}


 92%|█████████▏| 729/795 [55:31<02:49,  2.57s/it]

{'loss': 2.2846, 'grad_norm': 272.3068542480469, 'learning_rate': 4.429530201342282e-06, 'epoch': 2.75}


 92%|█████████▏| 730/795 [55:33<02:49,  2.61s/it]

{'loss': 0.0002, 'grad_norm': 0.0056541962549090385, 'learning_rate': 4.362416107382551e-06, 'epoch': 2.75}


 92%|█████████▏| 731/795 [55:36<02:49,  2.64s/it]

{'loss': 0.0004, 'grad_norm': 0.012743568979203701, 'learning_rate': 4.295302013422819e-06, 'epoch': 2.76}


 92%|█████████▏| 732/795 [55:39<02:47,  2.66s/it]

{'loss': 0.0001, 'grad_norm': 0.004732990637421608, 'learning_rate': 4.228187919463088e-06, 'epoch': 2.76}


 92%|█████████▏| 733/795 [55:41<02:43,  2.64s/it]

{'loss': 2.9168, 'grad_norm': 50.32946014404297, 'learning_rate': 4.161073825503356e-06, 'epoch': 2.77}


 92%|█████████▏| 734/795 [55:44<02:40,  2.64s/it]

{'loss': 0.0002, 'grad_norm': 0.004683722276240587, 'learning_rate': 4.093959731543624e-06, 'epoch': 2.77}


 92%|█████████▏| 735/795 [55:47<02:40,  2.67s/it]

{'loss': 0.0002, 'grad_norm': 0.005029688123613596, 'learning_rate': 4.026845637583892e-06, 'epoch': 2.77}


 93%|█████████▎| 736/795 [55:49<02:36,  2.65s/it]

{'loss': 0.0002, 'grad_norm': 0.005961999762803316, 'learning_rate': 3.959731543624162e-06, 'epoch': 2.78}


 93%|█████████▎| 737/795 [55:52<02:34,  2.67s/it]

{'loss': 0.0001, 'grad_norm': 0.0031547911930829287, 'learning_rate': 3.8926174496644295e-06, 'epoch': 2.78}


 93%|█████████▎| 738/795 [55:55<02:30,  2.64s/it]

{'loss': 0.0002, 'grad_norm': 0.005167447961866856, 'learning_rate': 3.825503355704698e-06, 'epoch': 2.78}


 93%|█████████▎| 739/795 [55:58<02:31,  2.71s/it]

{'loss': 0.0005, 'grad_norm': 0.017457714304327965, 'learning_rate': 3.7583892617449665e-06, 'epoch': 2.79}


 93%|█████████▎| 740/795 [56:00<02:28,  2.70s/it]

{'loss': 0.0003, 'grad_norm': 0.0075087170116603374, 'learning_rate': 3.6912751677852355e-06, 'epoch': 2.79}


 93%|█████████▎| 741/795 [56:03<02:25,  2.70s/it]

{'loss': 0.0001, 'grad_norm': 0.004508475307375193, 'learning_rate': 3.6241610738255036e-06, 'epoch': 2.8}


 93%|█████████▎| 742/795 [56:05<02:20,  2.65s/it]

{'loss': 0.0002, 'grad_norm': 0.004338506143540144, 'learning_rate': 3.5570469798657717e-06, 'epoch': 2.8}


 93%|█████████▎| 743/795 [56:08<02:19,  2.67s/it]

{'loss': 0.0005, 'grad_norm': 0.015378881245851517, 'learning_rate': 3.4899328859060402e-06, 'epoch': 2.8}


 94%|█████████▎| 744/795 [56:11<02:15,  2.66s/it]

{'loss': 0.0001, 'grad_norm': 0.004282478243112564, 'learning_rate': 3.422818791946309e-06, 'epoch': 2.81}


 94%|█████████▎| 745/795 [56:13<02:11,  2.64s/it]

{'loss': 0.04, 'grad_norm': 7.150835990905762, 'learning_rate': 3.3557046979865773e-06, 'epoch': 2.81}


 94%|█████████▍| 746/795 [56:16<02:08,  2.63s/it]

{'loss': 0.0001, 'grad_norm': 0.003809408051893115, 'learning_rate': 3.288590604026846e-06, 'epoch': 2.82}


 94%|█████████▍| 747/795 [56:19<02:06,  2.64s/it]

{'loss': 0.0004, 'grad_norm': 0.010153482668101788, 'learning_rate': 3.2214765100671148e-06, 'epoch': 2.82}


 94%|█████████▍| 748/795 [56:21<02:04,  2.66s/it]

{'loss': 0.0004, 'grad_norm': 0.020102934911847115, 'learning_rate': 3.154362416107383e-06, 'epoch': 2.82}


 94%|█████████▍| 749/795 [56:24<02:03,  2.69s/it]

{'loss': 0.0001, 'grad_norm': 0.0030213361606001854, 'learning_rate': 3.087248322147651e-06, 'epoch': 2.83}


 94%|█████████▍| 750/795 [56:27<02:01,  2.70s/it]

{'loss': 0.0006, 'grad_norm': 0.025161201134324074, 'learning_rate': 3.02013422818792e-06, 'epoch': 2.83}


 94%|█████████▍| 751/795 [56:30<01:59,  2.71s/it]

{'loss': 0.0002, 'grad_norm': 0.005702278111129999, 'learning_rate': 2.953020134228188e-06, 'epoch': 2.83}


 95%|█████████▍| 752/795 [56:32<01:58,  2.75s/it]

{'loss': 0.0002, 'grad_norm': 0.004742846358567476, 'learning_rate': 2.8859060402684566e-06, 'epoch': 2.84}


 95%|█████████▍| 753/795 [56:35<01:54,  2.72s/it]

{'loss': 0.0014, 'grad_norm': 0.10460353642702103, 'learning_rate': 2.818791946308725e-06, 'epoch': 2.84}


 95%|█████████▍| 754/795 [56:38<01:50,  2.70s/it]

{'loss': 0.0006, 'grad_norm': 0.01413438469171524, 'learning_rate': 2.7516778523489936e-06, 'epoch': 2.85}


 95%|█████████▍| 755/795 [56:40<01:47,  2.68s/it]

{'loss': 0.0004, 'grad_norm': 0.013044865801930428, 'learning_rate': 2.6845637583892617e-06, 'epoch': 2.85}


 95%|█████████▌| 756/795 [56:43<01:44,  2.68s/it]

{'loss': 0.0003, 'grad_norm': 0.009074189700186253, 'learning_rate': 2.6174496644295303e-06, 'epoch': 2.85}


 95%|█████████▌| 757/795 [56:46<01:43,  2.72s/it]

{'loss': 0.0003, 'grad_norm': 0.013590866699814796, 'learning_rate': 2.550335570469799e-06, 'epoch': 2.86}


 95%|█████████▌| 758/795 [56:49<01:40,  2.73s/it]

{'loss': 0.0002, 'grad_norm': 0.005762686021625996, 'learning_rate': 2.4832214765100673e-06, 'epoch': 2.86}


 95%|█████████▌| 759/795 [56:51<01:37,  2.72s/it]

{'loss': 0.0002, 'grad_norm': 0.005104287061840296, 'learning_rate': 2.4161073825503354e-06, 'epoch': 2.86}


 96%|█████████▌| 760/795 [56:54<01:35,  2.72s/it]

{'loss': 0.0004, 'grad_norm': 0.011133952997624874, 'learning_rate': 2.3489932885906044e-06, 'epoch': 2.87}


 96%|█████████▌| 761/795 [56:57<01:32,  2.71s/it]

{'loss': 0.0002, 'grad_norm': 0.006120200734585524, 'learning_rate': 2.2818791946308725e-06, 'epoch': 2.87}


 96%|█████████▌| 762/795 [56:59<01:29,  2.70s/it]

{'loss': 0.0005, 'grad_norm': 0.031090030446648598, 'learning_rate': 2.214765100671141e-06, 'epoch': 2.88}


 96%|█████████▌| 763/795 [57:02<01:25,  2.67s/it]

{'loss': 0.0002, 'grad_norm': 0.005815795622766018, 'learning_rate': 2.1476510067114096e-06, 'epoch': 2.88}


 96%|█████████▌| 764/795 [57:05<01:22,  2.66s/it]

{'loss': 0.0004, 'grad_norm': 0.01250295527279377, 'learning_rate': 2.080536912751678e-06, 'epoch': 2.88}


 96%|█████████▌| 765/795 [57:07<01:19,  2.64s/it]

{'loss': 0.0002, 'grad_norm': 0.0048135085962712765, 'learning_rate': 2.013422818791946e-06, 'epoch': 2.89}


 96%|█████████▋| 766/795 [57:10<01:16,  2.63s/it]

{'loss': 0.0004, 'grad_norm': 0.012507508508861065, 'learning_rate': 1.9463087248322147e-06, 'epoch': 2.89}


 96%|█████████▋| 767/795 [57:13<01:13,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.06651055812835693, 'learning_rate': 1.8791946308724833e-06, 'epoch': 2.89}


 97%|█████████▋| 768/795 [57:15<01:11,  2.64s/it]

{'loss': 0.0001, 'grad_norm': 0.0032965890131890774, 'learning_rate': 1.8120805369127518e-06, 'epoch': 2.9}


 97%|█████████▋| 769/795 [57:18<01:09,  2.67s/it]

{'loss': 0.0004, 'grad_norm': 0.02857608161866665, 'learning_rate': 1.7449664429530201e-06, 'epoch': 2.9}


 97%|█████████▋| 770/795 [57:21<01:07,  2.71s/it]

{'loss': 0.0004, 'grad_norm': 0.01082827802747488, 'learning_rate': 1.6778523489932886e-06, 'epoch': 2.91}


 97%|█████████▋| 771/795 [57:23<01:05,  2.71s/it]

{'loss': 0.0003, 'grad_norm': 0.008626251481473446, 'learning_rate': 1.6107382550335574e-06, 'epoch': 2.91}


 97%|█████████▋| 772/795 [57:26<01:01,  2.68s/it]

{'loss': 0.0004, 'grad_norm': 0.012754485942423344, 'learning_rate': 1.5436241610738255e-06, 'epoch': 2.91}


 97%|█████████▋| 773/795 [57:29<00:58,  2.64s/it]

{'loss': 0.0001, 'grad_norm': 0.004087765235453844, 'learning_rate': 1.476510067114094e-06, 'epoch': 2.92}


 97%|█████████▋| 774/795 [57:31<00:55,  2.63s/it]

{'loss': 0.002, 'grad_norm': 0.24983613193035126, 'learning_rate': 1.4093959731543626e-06, 'epoch': 2.92}


 97%|█████████▋| 775/795 [57:34<00:53,  2.70s/it]

{'loss': 0.0003, 'grad_norm': 0.007454277947545052, 'learning_rate': 1.3422818791946309e-06, 'epoch': 2.92}


 98%|█████████▊| 776/795 [57:37<00:50,  2.64s/it]

{'loss': 0.0004, 'grad_norm': 0.010362323373556137, 'learning_rate': 1.2751677852348994e-06, 'epoch': 2.93}


 98%|█████████▊| 777/795 [57:39<00:47,  2.62s/it]

{'loss': 0.0001, 'grad_norm': 0.0033651653211563826, 'learning_rate': 1.2080536912751677e-06, 'epoch': 2.93}


 98%|█████████▊| 778/795 [57:42<00:45,  2.65s/it]

{'loss': 0.0004, 'grad_norm': 0.013724175281822681, 'learning_rate': 1.1409395973154363e-06, 'epoch': 2.94}


 98%|█████████▊| 779/795 [57:44<00:42,  2.65s/it]

{'loss': 0.0004, 'grad_norm': 0.011808857321739197, 'learning_rate': 1.0738255033557048e-06, 'epoch': 2.94}


 98%|█████████▊| 780/795 [57:47<00:39,  2.63s/it]

{'loss': 0.0004, 'grad_norm': 0.008603926748037338, 'learning_rate': 1.006711409395973e-06, 'epoch': 2.94}


 98%|█████████▊| 781/795 [57:50<00:36,  2.62s/it]

{'loss': 0.0002, 'grad_norm': 0.006079991348087788, 'learning_rate': 9.395973154362416e-07, 'epoch': 2.95}


 98%|█████████▊| 782/795 [57:52<00:34,  2.62s/it]

{'loss': 0.0001, 'grad_norm': 0.0042761825025081635, 'learning_rate': 8.724832214765101e-07, 'epoch': 2.95}


 98%|█████████▊| 783/795 [57:55<00:31,  2.60s/it]

{'loss': 0.0001, 'grad_norm': 0.0031442728359252214, 'learning_rate': 8.053691275167787e-07, 'epoch': 2.95}


 99%|█████████▊| 784/795 [57:57<00:28,  2.60s/it]

{'loss': 0.0003, 'grad_norm': 0.008051147684454918, 'learning_rate': 7.38255033557047e-07, 'epoch': 2.96}


 99%|█████████▊| 785/795 [58:00<00:26,  2.62s/it]

{'loss': 0.0002, 'grad_norm': 0.004447645973414183, 'learning_rate': 6.711409395973154e-07, 'epoch': 2.96}


 99%|█████████▉| 786/795 [58:03<00:23,  2.63s/it]

{'loss': 0.0004, 'grad_norm': 0.011865965090692043, 'learning_rate': 6.040268456375839e-07, 'epoch': 2.97}


 99%|█████████▉| 787/795 [58:05<00:21,  2.66s/it]

{'loss': 0.0001, 'grad_norm': 0.003465501358732581, 'learning_rate': 5.369127516778524e-07, 'epoch': 2.97}


 99%|█████████▉| 788/795 [58:08<00:18,  2.71s/it]

{'loss': 0.0002, 'grad_norm': 0.004510985221713781, 'learning_rate': 4.697986577181208e-07, 'epoch': 2.97}


 99%|█████████▉| 789/795 [58:11<00:16,  2.72s/it]

{'loss': 0.0001, 'grad_norm': 0.004536142107099295, 'learning_rate': 4.0268456375838935e-07, 'epoch': 2.98}


 99%|█████████▉| 790/795 [58:14<00:13,  2.71s/it]

{'loss': 0.0001, 'grad_norm': 0.002929429989308119, 'learning_rate': 3.355704697986577e-07, 'epoch': 2.98}


 99%|█████████▉| 791/795 [58:16<00:10,  2.66s/it]

{'loss': 0.0003, 'grad_norm': 0.008475862443447113, 'learning_rate': 2.684563758389262e-07, 'epoch': 2.98}


100%|█████████▉| 792/795 [58:19<00:07,  2.65s/it]

{'loss': 0.0002, 'grad_norm': 0.0064002820290625095, 'learning_rate': 2.0134228187919467e-07, 'epoch': 2.99}


100%|█████████▉| 793/795 [58:22<00:05,  2.63s/it]

{'loss': 0.0003, 'grad_norm': 0.008923356421291828, 'learning_rate': 1.342281879194631e-07, 'epoch': 2.99}


100%|█████████▉| 794/795 [58:24<00:02,  2.65s/it]

{'loss': 0.0004, 'grad_norm': 0.010488552041351795, 'learning_rate': 6.711409395973155e-08, 'epoch': 3.0}


100%|██████████| 795/795 [58:26<00:00,  2.33s/it]

{'loss': 0.0001, 'grad_norm': 0.0033402987755835056, 'learning_rate': 0.0, 'epoch': 3.0}


                                                 
100%|██████████| 795/795 [1:08:57<00:00,  2.33s/it]

{'eval_loss': 0.18705609440803528, 'eval_accuracy': 0.9676375404530745, 'eval_runtime': 631.4271, 'eval_samples_per_second': 1.957, 'eval_steps_per_second': 0.979, 'epoch': 3.0}


100%|██████████| 795/795 [1:09:01<00:00,  5.21s/it]

{'train_runtime': 4141.4021, 'train_samples_per_second': 0.383, 'train_steps_per_second': 0.192, 'train_loss': 0.1910579867478231, 'epoch': 3.0}





TrainOutput(global_step=795, training_loss=0.1910579867478231, metrics={'train_runtime': 4141.4021, 'train_samples_per_second': 0.383, 'train_steps_per_second': 0.192, 'train_loss': 0.1910579867478231, 'epoch': 3.0})

In [27]:
trainer.evaluate()

100%|██████████| 618/618 [10:28<00:00,  1.02s/it]


{'eval_loss': 0.18705609440803528,
 'eval_accuracy': 0.9676375404530745,
 'eval_runtime': 629.8774,
 'eval_samples_per_second': 1.962,
 'eval_steps_per_second': 0.981,
 'epoch': 3.0}

In [28]:
trainer.save_model()

In [29]:
tokenizer.save_vocabulary(save_directory="./project cpm2 ")

('./project cpm2 /vocab.txt',)

In [30]:
#from google.colab import drive
#drive.mount('/content/drive')

In [34]:
%pip install tf-keras

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting tf-keras
  Downloading tf_keras-2.16.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.16.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: tf-keras
Successfully installed tf-keras-2.16.0
Note: you may need to restart the kernel to use updated packages.


In [36]:
pipe = pipeline("text-classification", "/Users/filip/Desktop/Cognitive/cpm/project cpm2 ", tokenizer=BERT_MODEL)



sample_text = '''
I'm human and I'm proud of it!
'''


pipe(sample_text, top_k=None)

[{'label': 'HUMAN', 'score': 0.9989688396453857},
 {'label': 'ROBOT', 'score': 0.00103111588396132}]