## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [1]:
from allennlp.data.token_indexers import TokenIndexer, PretrainedTransformerIndexer
from allennlp.data.tokenizers import Token, Tokenizer, PretrainedTransformerTokenizer
import nltk
#nltk.download('punkt')
import numpy as np
from os import listdir
from os.path import join as pathjoin
import torch
import torch.nn as nn
from torch.nn import functional as F
import tqdm

from minGPT.mingpt.model import GPT, GPTConfig
from minGPT.mingpt.trainer import Trainer, TrainerConfig
# make deterministic
from minGPT.mingpt.utils import sample, set_seed
set_seed(42)

In [2]:
DATA_DIR = '/home/mlepekhin/data'
MODELS_DIR = '/home/mlepekhin/models'
transformer_model = 'DeepPavlov/rubert-base-cased'

In [3]:
from allennlp.data import Vocabulary


tokenizer = PretrainedTransformerTokenizer(transformer_model)
indexer = PretrainedTransformerIndexer(transformer_model)
bert_vocab = Vocabulary().from_files(
    pathjoin('/home/mlepekhin/models', 'allennlp_rubert_from_discriminator', 'vocab')
)

In [4]:
indexer.tokens_to_indices(tokenizer.tokenize('присоединились'), bert_vocab)

{'token_ids': [101, 29895, 102],
 'mask': [True, True, True],
 'type_ids': [0, 0, 0]}

In [5]:
bert_token_to_index = bert_vocab.get_token_to_index_vocabulary('tags')
bert_token_to_index['присоединились']

29895

In [6]:
import math
from torch.utils.data import Dataset

def detokenize(tokens):
    return ' '.join([str(x) for x in tokens[1:-1]]).replace(' ##', '')

class BPEDataset(Dataset):
    def __init__(self, data, block_size):
        data_size, vocab_size = len(data), len(bert_token_to_index)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [bert_token_to_index[word] for word in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [7]:
block_size = 128

In [8]:
def train_gpt_generator(train_text_file, state_dict_file, n_layer=4, n_head=4, n_embd=256,
                        max_epochs=2, batch_size=64):
    text_sentences = nltk.tokenize.sent_tokenize(open(train_text_file, 'r').read())
    tokens = np.concatenate([tokenizer.tokenize(sent) for sent in text_sentences])
    tokens = [str(token) for token in tokens]
    train_dataset = BPEDataset(tokens, block_size) 
    
    mconf = GPTConfig(
        train_dataset.vocab_size, train_dataset.block_size,
        n_layer=n_layer, n_head=n_head, n_embd=n_embd
    )
    model = GPT(mconf)
    tconf = TrainerConfig(
        max_epochs=max_epochs, batch_size=batch_size, learning_rate=6e-4,
        lr_decay=True, warmup_tokens=batch_size*20, final_tokens=2*len(train_dataset)*block_size,
    )
    trainer = Trainer(model, train_dataset, None, tconf)
    trainer.train()
    torch.save(model.state_dict(), state_dict_file)

In [9]:
GENRE_DATA_DIR = '/home/mlepekhin/data/train_genre'
GPT_MODELS_DIR = '/home/mlepekhin/models/mini_gpt_bpe_tuned/'
LANG = 'ru'

In [10]:
#train_gpt_generator(
#        pathjoin(GENRE_DATA_DIR, LANG, 'A1.txt'),
#        pathjoin(GPT_MODELS_DIR, LANG, 'A1')
#)

In [11]:
for train_text_file in tqdm.tqdm(listdir(pathjoin(GENRE_DATA_DIR, LANG))):
    label = train_text_file[:-4]
    train_gpt_generator(
        pathjoin(GENRE_DATA_DIR, LANG, train_text_file),
        pathjoin(GPT_MODELS_DIR, LANG, label)
    )

  0%|          | 0/10 [00:00<?, ?it/s]

data has 199175 characters, 119547 unique.




epoch 1 iter 0: train loss 11.73162. lr 6.000000e-04:   0%|          | 0/3111 [00:02<?, ?it/s][A
epoch 1 iter 0: train loss 11.73162. lr 6.000000e-04:   0%|          | 1/3111 [00:02<2:26:22,  2.82s/it][A
epoch 1 iter 1: train loss 11.26046. lr 5.999999e-04:   0%|          | 1/3111 [00:03<2:26:22,  2.82s/it][A
epoch 1 iter 1: train loss 11.26046. lr 5.999999e-04:   0%|          | 2/3111 [00:03<1:48:01,  2.08s/it][A
epoch 1 iter 2: train loss 11.01675. lr 5.999997e-04:   0%|          | 2/3111 [00:03<1:48:01,  2.08s/it][A
epoch 1 iter 2: train loss 11.01675. lr 5.999997e-04:   0%|          | 3/3111 [00:03<1:21:07,  1.57s/it][A
epoch 1 iter 3: train loss 10.82698. lr 5.999994e-04:   0%|          | 3/3111 [00:03<1:21:07,  1.57s/it][A
epoch 1 iter 3: train loss 10.82698. lr 5.999994e-04:   0%|          | 4/3111 [00:03<1:02:16,  1.20s/it][A
epoch 1 iter 4: train loss 10.64299. lr 5.999991e-04:   0%|          | 4/3111 [00:04<1:02:16,  1.20s/it][A
epoch 1 iter 4: train loss 10.64299.

epoch 1 iter 36: train loss 6.79981. lr 5.999481e-04:   1%|          | 37/3111 [00:15<18:16,  2.80it/s][A
epoch 1 iter 37: train loss 6.82705. lr 5.999452e-04:   1%|          | 37/3111 [00:15<18:16,  2.80it/s][A
epoch 1 iter 37: train loss 6.82705. lr 5.999452e-04:   1%|          | 38/3111 [00:15<18:15,  2.81it/s][A
epoch 1 iter 38: train loss 6.75523. lr 5.999423e-04:   1%|          | 38/3111 [00:16<18:15,  2.81it/s][A
epoch 1 iter 38: train loss 6.75523. lr 5.999423e-04:   1%|▏         | 39/3111 [00:16<18:14,  2.81it/s][A
epoch 1 iter 39: train loss 6.71545. lr 5.999393e-04:   1%|▏         | 39/3111 [00:16<18:14,  2.81it/s][A
epoch 1 iter 39: train loss 6.71545. lr 5.999393e-04:   1%|▏         | 40/3111 [00:16<18:13,  2.81it/s][A
epoch 1 iter 40: train loss 6.72391. lr 5.999362e-04:   1%|▏         | 40/3111 [00:17<18:13,  2.81it/s][A
epoch 1 iter 40: train loss 6.72391. lr 5.999362e-04:   1%|▏         | 41/3111 [00:17<18:12,  2.81it/s][A
epoch 1 iter 41: train loss 6.81989. 

epoch 1 iter 74: train loss 6.25344. lr 5.997857e-04:   2%|▏         | 75/3111 [00:29<17:59,  2.81it/s][A
epoch 1 iter 75: train loss 6.36253. lr 5.997799e-04:   2%|▏         | 75/3111 [00:29<17:59,  2.81it/s][A
epoch 1 iter 75: train loss 6.36253. lr 5.997799e-04:   2%|▏         | 76/3111 [00:29<17:59,  2.81it/s][A
epoch 1 iter 76: train loss 5.91137. lr 5.997741e-04:   2%|▏         | 76/3111 [00:29<17:59,  2.81it/s][A
epoch 1 iter 76: train loss 5.91137. lr 5.997741e-04:   2%|▏         | 77/3111 [00:29<17:59,  2.81it/s][A
epoch 1 iter 77: train loss 6.21149. lr 5.997682e-04:   2%|▏         | 77/3111 [00:30<17:59,  2.81it/s][A
epoch 1 iter 77: train loss 6.21149. lr 5.997682e-04:   3%|▎         | 78/3111 [00:30<17:58,  2.81it/s][A
epoch 1 iter 78: train loss 6.18482. lr 5.997622e-04:   3%|▎         | 78/3111 [00:30<17:58,  2.81it/s][A
epoch 1 iter 78: train loss 6.18482. lr 5.997622e-04:   3%|▎         | 79/3111 [00:30<17:58,  2.81it/s][A
epoch 1 iter 79: train loss 6.11024. 

epoch 1 iter 112: train loss 5.61508. lr 5.995129e-04:   4%|▎         | 113/3111 [00:42<17:47,  2.81it/s][A
epoch 1 iter 113: train loss 5.67203. lr 5.995042e-04:   4%|▎         | 113/3111 [00:43<17:47,  2.81it/s][A
epoch 1 iter 113: train loss 5.67203. lr 5.995042e-04:   4%|▎         | 114/3111 [00:43<17:47,  2.81it/s][A
epoch 1 iter 114: train loss 5.59683. lr 5.994955e-04:   4%|▎         | 114/3111 [00:43<17:47,  2.81it/s][A
epoch 1 iter 114: train loss 5.59683. lr 5.994955e-04:   4%|▎         | 115/3111 [00:43<17:46,  2.81it/s][A
epoch 1 iter 115: train loss 5.73192. lr 5.994866e-04:   4%|▎         | 115/3111 [00:43<17:46,  2.81it/s][A
epoch 1 iter 115: train loss 5.73192. lr 5.994866e-04:   4%|▎         | 116/3111 [00:43<17:46,  2.81it/s][A
epoch 1 iter 116: train loss 5.59015. lr 5.994777e-04:   4%|▎         | 116/3111 [00:44<17:46,  2.81it/s][A
epoch 1 iter 116: train loss 5.59015. lr 5.994777e-04:   4%|▍         | 117/3111 [00:44<17:46,  2.81it/s][A
epoch 1 iter 117: t

epoch 1 iter 150: train loss 5.30927. lr 5.991297e-04:   5%|▍         | 150/3111 [00:56<17:33,  2.81it/s][A
epoch 1 iter 150: train loss 5.30927. lr 5.991297e-04:   5%|▍         | 151/3111 [00:56<17:32,  2.81it/s][A
epoch 1 iter 151: train loss 5.29876. lr 5.991182e-04:   5%|▍         | 151/3111 [00:56<17:32,  2.81it/s][A
epoch 1 iter 151: train loss 5.29876. lr 5.991182e-04:   5%|▍         | 152/3111 [00:56<17:32,  2.81it/s][A
epoch 1 iter 152: train loss 5.24560. lr 5.991065e-04:   5%|▍         | 152/3111 [00:56<17:32,  2.81it/s][A
epoch 1 iter 152: train loss 5.24560. lr 5.991065e-04:   5%|▍         | 153/3111 [00:56<17:31,  2.81it/s][A
epoch 1 iter 153: train loss 5.30597. lr 5.990948e-04:   5%|▍         | 153/3111 [00:57<17:31,  2.81it/s][A
epoch 1 iter 153: train loss 5.30597. lr 5.990948e-04:   5%|▍         | 154/3111 [00:57<17:31,  2.81it/s][A
epoch 1 iter 154: train loss 5.27852. lr 5.990830e-04:   5%|▍         | 154/3111 [00:57<17:31,  2.81it/s][A
epoch 1 iter 154: t

epoch 1 iter 187: train loss 4.92864. lr 5.986508e-04:   6%|▌         | 188/3111 [01:09<17:21,  2.81it/s][A
epoch 1 iter 188: train loss 5.05868. lr 5.986364e-04:   6%|▌         | 188/3111 [01:09<17:21,  2.81it/s][A
epoch 1 iter 188: train loss 5.05868. lr 5.986364e-04:   6%|▌         | 189/3111 [01:09<17:20,  2.81it/s][A
epoch 1 iter 189: train loss 5.09041. lr 5.986220e-04:   6%|▌         | 189/3111 [01:10<17:20,  2.81it/s][A
epoch 1 iter 189: train loss 5.09041. lr 5.986220e-04:   6%|▌         | 190/3111 [01:10<17:19,  2.81it/s][A
epoch 1 iter 190: train loss 4.91657. lr 5.986074e-04:   6%|▌         | 190/3111 [01:10<17:19,  2.81it/s][A
epoch 1 iter 190: train loss 4.91657. lr 5.986074e-04:   6%|▌         | 191/3111 [01:10<17:18,  2.81it/s][A
epoch 1 iter 191: train loss 5.01608. lr 5.985928e-04:   6%|▌         | 191/3111 [01:10<17:18,  2.81it/s][A
epoch 1 iter 191: train loss 5.01608. lr 5.985928e-04:   6%|▌         | 192/3111 [01:10<17:18,  2.81it/s][A
epoch 1 iter 192: t

epoch 1 iter 225: train loss 4.72046. lr 5.980504e-04:   7%|▋         | 225/3111 [01:22<17:07,  2.81it/s][A
epoch 1 iter 225: train loss 4.72046. lr 5.980504e-04:   7%|▋         | 226/3111 [01:22<17:06,  2.81it/s][A
epoch 1 iter 226: train loss 4.85751. lr 5.980331e-04:   7%|▋         | 226/3111 [01:23<17:06,  2.81it/s][A
epoch 1 iter 226: train loss 4.85751. lr 5.980331e-04:   7%|▋         | 227/3111 [01:23<17:06,  2.81it/s][A
epoch 1 iter 227: train loss 4.78520. lr 5.980157e-04:   7%|▋         | 227/3111 [01:23<17:06,  2.81it/s][A
epoch 1 iter 227: train loss 4.78520. lr 5.980157e-04:   7%|▋         | 228/3111 [01:23<17:05,  2.81it/s][A
epoch 1 iter 228: train loss 4.69332. lr 5.979983e-04:   7%|▋         | 228/3111 [01:23<17:05,  2.81it/s][A
epoch 1 iter 228: train loss 4.69332. lr 5.979983e-04:   7%|▋         | 229/3111 [01:23<17:05,  2.81it/s][A
epoch 1 iter 229: train loss 4.62493. lr 5.979808e-04:   7%|▋         | 229/3111 [01:24<17:05,  2.81it/s][A
epoch 1 iter 229: t

epoch 1 iter 262: train loss 4.63595. lr 5.973603e-04:   8%|▊         | 263/3111 [01:36<16:54,  2.81it/s][A
epoch 1 iter 263: train loss 4.51972. lr 5.973402e-04:   8%|▊         | 263/3111 [01:36<16:54,  2.81it/s][A
epoch 1 iter 263: train loss 4.51972. lr 5.973402e-04:   8%|▊         | 264/3111 [01:36<16:54,  2.81it/s][A
epoch 1 iter 264: train loss 4.49837. lr 5.973200e-04:   8%|▊         | 264/3111 [01:36<16:54,  2.81it/s][A
epoch 1 iter 264: train loss 4.49837. lr 5.973200e-04:   9%|▊         | 265/3111 [01:36<16:53,  2.81it/s][A
epoch 1 iter 265: train loss 4.39655. lr 5.972998e-04:   9%|▊         | 265/3111 [01:37<16:53,  2.81it/s][A
epoch 1 iter 265: train loss 4.39655. lr 5.972998e-04:   9%|▊         | 266/3111 [01:37<16:53,  2.81it/s][A
epoch 1 iter 266: train loss 4.55501. lr 5.972794e-04:   9%|▊         | 266/3111 [01:37<16:53,  2.81it/s][A
epoch 1 iter 266: train loss 4.55501. lr 5.972794e-04:   9%|▊         | 267/3111 [01:37<16:53,  2.81it/s][A
epoch 1 iter 267: t

epoch 1 iter 300: train loss 4.30797. lr 5.965434e-04:  10%|▉         | 300/3111 [01:49<16:40,  2.81it/s][A
epoch 1 iter 300: train loss 4.30797. lr 5.965434e-04:  10%|▉         | 301/3111 [01:49<16:40,  2.81it/s][A
epoch 1 iter 301: train loss 4.29190. lr 5.965204e-04:  10%|▉         | 301/3111 [01:49<16:40,  2.81it/s][A
epoch 1 iter 301: train loss 4.29190. lr 5.965204e-04:  10%|▉         | 302/3111 [01:49<16:39,  2.81it/s][A
epoch 1 iter 302: train loss 4.30995. lr 5.964974e-04:  10%|▉         | 302/3111 [01:50<16:39,  2.81it/s][A
epoch 1 iter 302: train loss 4.30995. lr 5.964974e-04:  10%|▉         | 303/3111 [01:50<16:39,  2.81it/s][A
epoch 1 iter 303: train loss 4.24681. lr 5.964743e-04:  10%|▉         | 303/3111 [01:50<16:39,  2.81it/s][A
epoch 1 iter 303: train loss 4.24681. lr 5.964743e-04:  10%|▉         | 304/3111 [01:50<16:39,  2.81it/s][A
epoch 1 iter 304: train loss 4.19774. lr 5.964511e-04:  10%|▉         | 304/3111 [01:51<16:39,  2.81it/s][A
epoch 1 iter 304: t

epoch 1 iter 337: train loss 4.10750. lr 5.956431e-04:  11%|█         | 338/3111 [02:02<16:30,  2.80it/s][A
epoch 1 iter 338: train loss 4.08425. lr 5.956173e-04:  11%|█         | 338/3111 [02:03<16:30,  2.80it/s][A
epoch 1 iter 338: train loss 4.08425. lr 5.956173e-04:  11%|█         | 339/3111 [02:03<16:28,  2.80it/s][A
epoch 1 iter 339: train loss 3.94106. lr 5.955915e-04:  11%|█         | 339/3111 [02:03<16:28,  2.80it/s][A
epoch 1 iter 339: train loss 3.94106. lr 5.955915e-04:  11%|█         | 340/3111 [02:03<16:28,  2.80it/s][A
epoch 1 iter 340: train loss 4.04471. lr 5.955656e-04:  11%|█         | 340/3111 [02:03<16:28,  2.80it/s][A
epoch 1 iter 340: train loss 4.04471. lr 5.955656e-04:  11%|█         | 341/3111 [02:03<16:28,  2.80it/s][A
epoch 1 iter 341: train loss 4.09309. lr 5.955396e-04:  11%|█         | 341/3111 [02:04<16:28,  2.80it/s][A
epoch 1 iter 341: train loss 4.09309. lr 5.955396e-04:  11%|█         | 342/3111 [02:04<16:27,  2.80it/s][A
epoch 1 iter 342: t

epoch 1 iter 375: train loss 3.71651. lr 5.946110e-04:  12%|█▏        | 375/3111 [02:16<16:16,  2.80it/s][A
epoch 1 iter 375: train loss 3.71651. lr 5.946110e-04:  12%|█▏        | 376/3111 [02:16<16:15,  2.80it/s][A
epoch 1 iter 376: train loss 3.75587. lr 5.945823e-04:  12%|█▏        | 376/3111 [02:16<16:15,  2.80it/s][A
epoch 1 iter 376: train loss 3.75587. lr 5.945823e-04:  12%|█▏        | 377/3111 [02:16<16:15,  2.80it/s][A
epoch 1 iter 377: train loss 3.83119. lr 5.945536e-04:  12%|█▏        | 377/3111 [02:17<16:15,  2.80it/s][A
epoch 1 iter 377: train loss 3.83119. lr 5.945536e-04:  12%|█▏        | 378/3111 [02:17<16:15,  2.80it/s][A
epoch 1 iter 378: train loss 3.79385. lr 5.945248e-04:  12%|█▏        | 378/3111 [02:17<16:15,  2.80it/s][A
epoch 1 iter 378: train loss 3.79385. lr 5.945248e-04:  12%|█▏        | 379/3111 [02:17<16:15,  2.80it/s][A
epoch 1 iter 379: train loss 3.88286. lr 5.944960e-04:  12%|█▏        | 379/3111 [02:17<16:15,  2.80it/s][A
epoch 1 iter 379: t

epoch 1 iter 412: train loss 3.64077. lr 5.935017e-04:  13%|█▎        | 413/3111 [02:29<16:03,  2.80it/s][A
epoch 1 iter 413: train loss 3.58570. lr 5.934703e-04:  13%|█▎        | 413/3111 [02:29<16:03,  2.80it/s][A
epoch 1 iter 413: train loss 3.58570. lr 5.934703e-04:  13%|█▎        | 414/3111 [02:29<16:03,  2.80it/s][A
epoch 1 iter 414: train loss 3.58374. lr 5.934388e-04:  13%|█▎        | 414/3111 [02:30<16:03,  2.80it/s][A
epoch 1 iter 414: train loss 3.58374. lr 5.934388e-04:  13%|█▎        | 415/3111 [02:30<16:02,  2.80it/s][A
epoch 1 iter 415: train loss 3.60333. lr 5.934073e-04:  13%|█▎        | 415/3111 [02:30<16:02,  2.80it/s][A
epoch 1 iter 415: train loss 3.60333. lr 5.934073e-04:  13%|█▎        | 416/3111 [02:30<16:02,  2.80it/s][A
epoch 1 iter 416: train loss 3.59608. lr 5.933756e-04:  13%|█▎        | 416/3111 [02:30<16:02,  2.80it/s][A
epoch 1 iter 416: train loss 3.59608. lr 5.933756e-04:  13%|█▎        | 417/3111 [02:30<16:01,  2.80it/s][A
epoch 1 iter 417: t

epoch 1 iter 450: train loss 3.39375. lr 5.922558e-04:  14%|█▍        | 450/3111 [02:43<15:50,  2.80it/s][A
epoch 1 iter 450: train loss 3.39375. lr 5.922558e-04:  14%|█▍        | 451/3111 [02:43<15:49,  2.80it/s][A
epoch 1 iter 451: train loss 3.49904. lr 5.922216e-04:  14%|█▍        | 451/3111 [02:43<15:49,  2.80it/s][A
epoch 1 iter 451: train loss 3.49904. lr 5.922216e-04:  15%|█▍        | 452/3111 [02:43<15:49,  2.80it/s][A
epoch 1 iter 452: train loss 3.43159. lr 5.921872e-04:  15%|█▍        | 452/3111 [02:43<15:49,  2.80it/s][A
epoch 1 iter 452: train loss 3.43159. lr 5.921872e-04:  15%|█▍        | 453/3111 [02:43<15:49,  2.80it/s][A
epoch 1 iter 453: train loss 3.42052. lr 5.921529e-04:  15%|█▍        | 453/3111 [02:44<15:49,  2.80it/s][A
epoch 1 iter 453: train loss 3.42052. lr 5.921529e-04:  15%|█▍        | 454/3111 [02:44<15:49,  2.80it/s][A
epoch 1 iter 454: train loss 3.32317. lr 5.921184e-04:  15%|█▍        | 454/3111 [02:44<15:49,  2.80it/s][A
epoch 1 iter 454: t

epoch 1 iter 487: train loss 3.12297. lr 5.909392e-04:  16%|█▌        | 488/3111 [02:56<15:39,  2.79it/s][A
epoch 1 iter 488: train loss 3.28032. lr 5.909022e-04:  16%|█▌        | 488/3111 [02:56<15:39,  2.79it/s][A
epoch 1 iter 488: train loss 3.28032. lr 5.909022e-04:  16%|█▌        | 489/3111 [02:56<15:39,  2.79it/s][A
epoch 1 iter 489: train loss 3.20531. lr 5.908652e-04:  16%|█▌        | 489/3111 [02:57<15:39,  2.79it/s][A
epoch 1 iter 489: train loss 3.20531. lr 5.908652e-04:  16%|█▌        | 490/3111 [02:57<15:38,  2.79it/s][A
epoch 1 iter 490: train loss 3.19986. lr 5.908280e-04:  16%|█▌        | 490/3111 [02:57<15:38,  2.79it/s][A
epoch 1 iter 490: train loss 3.19986. lr 5.908280e-04:  16%|█▌        | 491/3111 [02:57<15:38,  2.79it/s][A
epoch 1 iter 491: train loss 3.10766. lr 5.907908e-04:  16%|█▌        | 491/3111 [02:57<15:38,  2.79it/s][A
epoch 1 iter 491: train loss 3.10766. lr 5.907908e-04:  16%|█▌        | 492/3111 [02:57<15:37,  2.79it/s][A
epoch 1 iter 492: t

epoch 1 iter 525: train loss 2.88081. lr 5.894813e-04:  17%|█▋        | 525/3111 [03:09<15:25,  2.79it/s][A
epoch 1 iter 525: train loss 2.88081. lr 5.894813e-04:  17%|█▋        | 526/3111 [03:09<15:25,  2.79it/s][A
epoch 1 iter 526: train loss 2.89670. lr 5.894415e-04:  17%|█▋        | 526/3111 [03:10<15:25,  2.79it/s][A
epoch 1 iter 526: train loss 2.89670. lr 5.894415e-04:  17%|█▋        | 527/3111 [03:10<15:24,  2.79it/s][A
epoch 1 iter 527: train loss 3.02709. lr 5.894016e-04:  17%|█▋        | 527/3111 [03:10<15:24,  2.79it/s][A
epoch 1 iter 527: train loss 3.02709. lr 5.894016e-04:  17%|█▋        | 528/3111 [03:10<15:24,  2.79it/s][A
epoch 1 iter 528: train loss 3.03342. lr 5.893617e-04:  17%|█▋        | 528/3111 [03:11<15:24,  2.79it/s][A
epoch 1 iter 528: train loss 3.03342. lr 5.893617e-04:  17%|█▋        | 529/3111 [03:11<15:24,  2.79it/s][A
epoch 1 iter 529: train loss 2.93586. lr 5.893217e-04:  17%|█▋        | 529/3111 [03:11<15:24,  2.79it/s][A
epoch 1 iter 529: t

epoch 1 iter 562: train loss 2.88142. lr 5.879593e-04:  18%|█▊        | 563/3111 [03:23<15:12,  2.79it/s][A
epoch 1 iter 563: train loss 2.80847. lr 5.879168e-04:  18%|█▊        | 563/3111 [03:23<15:12,  2.79it/s][A
epoch 1 iter 563: train loss 2.80847. lr 5.879168e-04:  18%|█▊        | 564/3111 [03:23<15:11,  2.79it/s][A
epoch 1 iter 564: train loss 2.85514. lr 5.878742e-04:  18%|█▊        | 564/3111 [03:23<15:11,  2.79it/s][A
epoch 1 iter 564: train loss 2.85514. lr 5.878742e-04:  18%|█▊        | 565/3111 [03:23<15:11,  2.79it/s][A
epoch 1 iter 565: train loss 2.75944. lr 5.878315e-04:  18%|█▊        | 565/3111 [03:24<15:11,  2.79it/s][A
epoch 1 iter 565: train loss 2.75944. lr 5.878315e-04:  18%|█▊        | 566/3111 [03:24<15:11,  2.79it/s][A
epoch 1 iter 566: train loss 2.81569. lr 5.877888e-04:  18%|█▊        | 566/3111 [03:24<15:11,  2.79it/s][A
epoch 1 iter 566: train loss 2.81569. lr 5.877888e-04:  18%|█▊        | 567/3111 [03:24<15:10,  2.79it/s][A
epoch 1 iter 567: t

epoch 1 iter 600: train loss 2.74636. lr 5.862915e-04:  19%|█▉        | 600/3111 [03:36<14:58,  2.79it/s][A
epoch 1 iter 600: train loss 2.74636. lr 5.862915e-04:  19%|█▉        | 601/3111 [03:36<14:58,  2.79it/s][A
epoch 1 iter 601: train loss 2.57258. lr 5.862462e-04:  19%|█▉        | 601/3111 [03:37<14:58,  2.79it/s][A
epoch 1 iter 601: train loss 2.57258. lr 5.862462e-04:  19%|█▉        | 602/3111 [03:37<14:58,  2.79it/s][A
epoch 1 iter 602: train loss 2.68424. lr 5.862008e-04:  19%|█▉        | 602/3111 [03:37<14:58,  2.79it/s][A
epoch 1 iter 602: train loss 2.68424. lr 5.862008e-04:  19%|█▉        | 603/3111 [03:37<14:58,  2.79it/s][A
epoch 1 iter 603: train loss 2.65269. lr 5.861554e-04:  19%|█▉        | 603/3111 [03:37<14:58,  2.79it/s][A
epoch 1 iter 603: train loss 2.65269. lr 5.861554e-04:  19%|█▉        | 604/3111 [03:37<14:57,  2.79it/s][A
epoch 1 iter 604: train loss 2.58975. lr 5.861098e-04:  19%|█▉        | 604/3111 [03:38<14:57,  2.79it/s][A
epoch 1 iter 604: t

epoch 1 iter 637: train loss 2.48310. lr 5.845663e-04:  21%|██        | 638/3111 [03:50<14:46,  2.79it/s][A
epoch 1 iter 638: train loss 2.48229. lr 5.845183e-04:  21%|██        | 638/3111 [03:50<14:46,  2.79it/s][A
epoch 1 iter 638: train loss 2.48229. lr 5.845183e-04:  21%|██        | 639/3111 [03:50<14:45,  2.79it/s][A
epoch 1 iter 639: train loss 2.51419. lr 5.844702e-04:  21%|██        | 639/3111 [03:50<14:45,  2.79it/s][A
epoch 1 iter 639: train loss 2.51419. lr 5.844702e-04:  21%|██        | 640/3111 [03:50<14:45,  2.79it/s][A
epoch 1 iter 640: train loss 2.47333. lr 5.844220e-04:  21%|██        | 640/3111 [03:51<14:45,  2.79it/s][A
epoch 1 iter 640: train loss 2.47333. lr 5.844220e-04:  21%|██        | 641/3111 [03:51<14:44,  2.79it/s][A
epoch 1 iter 641: train loss 2.55469. lr 5.843738e-04:  21%|██        | 641/3111 [03:51<14:44,  2.79it/s][A
epoch 1 iter 641: train loss 2.55469. lr 5.843738e-04:  21%|██        | 642/3111 [03:51<14:44,  2.79it/s][A
epoch 1 iter 642: t

epoch 1 iter 675: train loss 2.32225. lr 5.826910e-04:  22%|██▏       | 675/3111 [04:03<14:33,  2.79it/s][A
epoch 1 iter 675: train loss 2.32225. lr 5.826910e-04:  22%|██▏       | 676/3111 [04:03<14:32,  2.79it/s][A
epoch 1 iter 676: train loss 2.18339. lr 5.826402e-04:  22%|██▏       | 676/3111 [04:04<14:32,  2.79it/s][A
epoch 1 iter 676: train loss 2.18339. lr 5.826402e-04:  22%|██▏       | 677/3111 [04:04<14:32,  2.79it/s][A
epoch 1 iter 677: train loss 2.33022. lr 5.825894e-04:  22%|██▏       | 677/3111 [04:04<14:32,  2.79it/s][A
epoch 1 iter 677: train loss 2.33022. lr 5.825894e-04:  22%|██▏       | 678/3111 [04:04<14:31,  2.79it/s][A
epoch 1 iter 678: train loss 2.25695. lr 5.825385e-04:  22%|██▏       | 678/3111 [04:04<14:31,  2.79it/s][A
epoch 1 iter 678: train loss 2.25695. lr 5.825385e-04:  22%|██▏       | 679/3111 [04:04<14:31,  2.79it/s][A
epoch 1 iter 679: train loss 2.19643. lr 5.824875e-04:  22%|██▏       | 679/3111 [04:05<14:31,  2.79it/s][A
epoch 1 iter 679: t

epoch 1 iter 712: train loss 2.19475. lr 5.807649e-04:  23%|██▎       | 713/3111 [04:16<14:19,  2.79it/s][A
epoch 1 iter 713: train loss 2.07246. lr 5.807115e-04:  23%|██▎       | 713/3111 [04:17<14:19,  2.79it/s][A
epoch 1 iter 713: train loss 2.07246. lr 5.807115e-04:  23%|██▎       | 714/3111 [04:17<14:19,  2.79it/s][A
epoch 1 iter 714: train loss 2.06308. lr 5.806580e-04:  23%|██▎       | 714/3111 [04:17<14:19,  2.79it/s][A
epoch 1 iter 714: train loss 2.06308. lr 5.806580e-04:  23%|██▎       | 715/3111 [04:17<14:18,  2.79it/s][A
epoch 1 iter 715: train loss 2.07691. lr 5.806045e-04:  23%|██▎       | 715/3111 [04:18<14:18,  2.79it/s][A
epoch 1 iter 715: train loss 2.07691. lr 5.806045e-04:  23%|██▎       | 716/3111 [04:18<14:18,  2.79it/s][A
epoch 1 iter 716: train loss 2.12139. lr 5.805508e-04:  23%|██▎       | 716/3111 [04:18<14:18,  2.79it/s][A
epoch 1 iter 716: train loss 2.12139. lr 5.805508e-04:  23%|██▎       | 717/3111 [04:18<14:17,  2.79it/s][A
epoch 1 iter 717: t

epoch 1 iter 750: train loss 1.95669. lr 5.786848e-04:  24%|██▍       | 750/3111 [04:30<14:06,  2.79it/s][A
epoch 1 iter 750: train loss 1.95669. lr 5.786848e-04:  24%|██▍       | 751/3111 [04:30<14:05,  2.79it/s][A
epoch 1 iter 751: train loss 1.95417. lr 5.786287e-04:  24%|██▍       | 751/3111 [04:30<14:05,  2.79it/s][A
epoch 1 iter 751: train loss 1.95417. lr 5.786287e-04:  24%|██▍       | 752/3111 [04:30<14:05,  2.79it/s][A
epoch 1 iter 752: train loss 1.86735. lr 5.785725e-04:  24%|██▍       | 752/3111 [04:31<14:05,  2.79it/s][A
epoch 1 iter 752: train loss 1.86735. lr 5.785725e-04:  24%|██▍       | 753/3111 [04:31<14:04,  2.79it/s][A
epoch 1 iter 753: train loss 1.87134. lr 5.785162e-04:  24%|██▍       | 753/3111 [04:31<14:04,  2.79it/s][A
epoch 1 iter 753: train loss 1.87134. lr 5.785162e-04:  24%|██▍       | 754/3111 [04:31<14:04,  2.79it/s][A
epoch 1 iter 754: train loss 1.92540. lr 5.784598e-04:  24%|██▍       | 754/3111 [04:31<14:04,  2.79it/s][A
epoch 1 iter 754: t

epoch 1 iter 787: train loss 1.81879. lr 5.765608e-04:  25%|██▌       | 788/3111 [04:43<13:53,  2.79it/s][A
epoch 1 iter 788: train loss 1.84804. lr 5.765020e-04:  25%|██▌       | 788/3111 [04:44<13:53,  2.79it/s][A
epoch 1 iter 788: train loss 1.84804. lr 5.765020e-04:  25%|██▌       | 789/3111 [04:44<13:52,  2.79it/s][A
epoch 1 iter 789: train loss 1.88353. lr 5.764432e-04:  25%|██▌       | 789/3111 [04:44<13:52,  2.79it/s][A
epoch 1 iter 789: train loss 1.88353. lr 5.764432e-04:  25%|██▌       | 790/3111 [04:44<13:51,  2.79it/s][A
epoch 1 iter 790: train loss 1.75915. lr 5.763843e-04:  25%|██▌       | 790/3111 [04:44<13:51,  2.79it/s][A
epoch 1 iter 790: train loss 1.75915. lr 5.763843e-04:  25%|██▌       | 791/3111 [04:44<13:51,  2.79it/s][A
epoch 1 iter 791: train loss 1.82681. lr 5.763253e-04:  25%|██▌       | 791/3111 [04:45<13:51,  2.79it/s][A
epoch 1 iter 791: train loss 1.82681. lr 5.763253e-04:  25%|██▌       | 792/3111 [04:45<13:51,  2.79it/s][A
epoch 1 iter 792: t

epoch 1 iter 825: train loss 1.67961. lr 5.742788e-04:  27%|██▋       | 825/3111 [04:57<13:40,  2.79it/s][A
epoch 1 iter 825: train loss 1.67961. lr 5.742788e-04:  27%|██▋       | 826/3111 [04:57<13:40,  2.79it/s][A
epoch 1 iter 826: train loss 1.72372. lr 5.742174e-04:  27%|██▋       | 826/3111 [04:57<13:40,  2.79it/s][A
epoch 1 iter 826: train loss 1.72372. lr 5.742174e-04:  27%|██▋       | 827/3111 [04:57<13:39,  2.79it/s][A
epoch 1 iter 827: train loss 1.64413. lr 5.741559e-04:  27%|██▋       | 827/3111 [04:58<13:39,  2.79it/s][A
epoch 1 iter 827: train loss 1.64413. lr 5.741559e-04:  27%|██▋       | 828/3111 [04:58<13:38,  2.79it/s][A
epoch 1 iter 828: train loss 1.71524. lr 5.740943e-04:  27%|██▋       | 828/3111 [04:58<13:38,  2.79it/s][A
epoch 1 iter 828: train loss 1.71524. lr 5.740943e-04:  27%|██▋       | 829/3111 [04:58<13:38,  2.79it/s][A
epoch 1 iter 829: train loss 1.75596. lr 5.740327e-04:  27%|██▋       | 829/3111 [04:58<13:38,  2.79it/s][A
epoch 1 iter 829: t

epoch 1 iter 862: train loss 1.55611. lr 5.719598e-04:  28%|██▊       | 863/3111 [05:10<13:25,  2.79it/s][A
epoch 1 iter 863: train loss 1.61339. lr 5.718958e-04:  28%|██▊       | 863/3111 [05:11<13:25,  2.79it/s][A
epoch 1 iter 863: train loss 1.61339. lr 5.718958e-04:  28%|██▊       | 864/3111 [05:11<13:25,  2.79it/s][A
epoch 1 iter 864: train loss 1.56998. lr 5.718317e-04:  28%|██▊       | 864/3111 [05:11<13:25,  2.79it/s][A
epoch 1 iter 864: train loss 1.56998. lr 5.718317e-04:  28%|██▊       | 865/3111 [05:11<13:25,  2.79it/s][A
epoch 1 iter 865: train loss 1.50752. lr 5.717676e-04:  28%|██▊       | 865/3111 [05:11<13:25,  2.79it/s][A
epoch 1 iter 865: train loss 1.50752. lr 5.717676e-04:  28%|██▊       | 866/3111 [05:11<13:24,  2.79it/s][A
epoch 1 iter 866: train loss 1.57643. lr 5.717034e-04:  28%|██▊       | 866/3111 [05:12<13:24,  2.79it/s][A
epoch 1 iter 866: train loss 1.57643. lr 5.717034e-04:  28%|██▊       | 867/3111 [05:12<13:24,  2.79it/s][A
epoch 1 iter 867: t

epoch 1 iter 900: train loss 1.38577. lr 5.694792e-04:  29%|██▉       | 900/3111 [05:24<13:13,  2.79it/s][A
epoch 1 iter 900: train loss 1.38577. lr 5.694792e-04:  29%|██▉       | 901/3111 [05:24<13:12,  2.79it/s][A
epoch 1 iter 901: train loss 1.48923. lr 5.694126e-04:  29%|██▉       | 901/3111 [05:24<13:12,  2.79it/s][A
epoch 1 iter 901: train loss 1.48923. lr 5.694126e-04:  29%|██▉       | 902/3111 [05:24<13:12,  2.79it/s][A
epoch 1 iter 902: train loss 1.49299. lr 5.693459e-04:  29%|██▉       | 902/3111 [05:25<13:12,  2.79it/s][A
epoch 1 iter 902: train loss 1.49299. lr 5.693459e-04:  29%|██▉       | 903/3111 [05:25<13:12,  2.79it/s][A
epoch 1 iter 903: train loss 1.47873. lr 5.692792e-04:  29%|██▉       | 903/3111 [05:25<13:12,  2.79it/s][A
epoch 1 iter 903: train loss 1.47873. lr 5.692792e-04:  29%|██▉       | 904/3111 [05:25<13:11,  2.79it/s][A
epoch 1 iter 904: train loss 1.51380. lr 5.692123e-04:  29%|██▉       | 904/3111 [05:25<13:11,  2.79it/s][A
epoch 1 iter 904: t

epoch 1 iter 1246: train loss 0.79346. lr 5.424527e-04:  40%|████      | 1246/3111 [07:28<11:15,  2.76it/s][A
epoch 1 iter 1246: train loss 0.79346. lr 5.424527e-04:  40%|████      | 1247/3111 [07:28<11:17,  2.75it/s][A
epoch 1 iter 1247: train loss 0.82435. lr 5.423634e-04:  40%|████      | 1247/3111 [07:29<11:17,  2.75it/s][A
epoch 1 iter 1247: train loss 0.82435. lr 5.423634e-04:  40%|████      | 1248/3111 [07:29<11:16,  2.75it/s][A
epoch 1 iter 1248: train loss 0.80077. lr 5.422741e-04:  40%|████      | 1248/3111 [07:29<11:16,  2.75it/s][A
epoch 1 iter 1248: train loss 0.80077. lr 5.422741e-04:  40%|████      | 1249/3111 [07:29<11:17,  2.75it/s][A
epoch 1 iter 1249: train loss 0.74790. lr 5.421847e-04:  40%|████      | 1249/3111 [07:29<11:17,  2.75it/s][A
epoch 1 iter 1249: train loss 0.74790. lr 5.421847e-04:  40%|████      | 1250/3111 [07:29<11:15,  2.75it/s][A
epoch 1 iter 1250: train loss 0.81282. lr 5.420952e-04:  40%|████      | 1250/3111 [07:30<11:15,  2.75it/s][A
e

epoch 1 iter 1282: train loss 0.76353. lr 5.392002e-04:  41%|████      | 1283/3111 [07:41<11:05,  2.75it/s][A
epoch 1 iter 1283: train loss 0.77595. lr 5.391088e-04:  41%|████      | 1283/3111 [07:42<11:05,  2.75it/s][A
epoch 1 iter 1283: train loss 0.77595. lr 5.391088e-04:  41%|████▏     | 1284/3111 [07:42<11:05,  2.74it/s][A
epoch 1 iter 1284: train loss 0.79242. lr 5.390172e-04:  41%|████▏     | 1284/3111 [07:42<11:05,  2.74it/s][A
epoch 1 iter 1284: train loss 0.79242. lr 5.390172e-04:  41%|████▏     | 1285/3111 [07:42<11:04,  2.75it/s][A
epoch 1 iter 1285: train loss 0.74538. lr 5.389256e-04:  41%|████▏     | 1285/3111 [07:43<11:04,  2.75it/s][A
epoch 1 iter 1285: train loss 0.74538. lr 5.389256e-04:  41%|████▏     | 1286/3111 [07:43<11:02,  2.75it/s][A
epoch 1 iter 1286: train loss 0.71285. lr 5.388339e-04:  41%|████▏     | 1286/3111 [07:43<11:02,  2.75it/s][A
epoch 1 iter 1286: train loss 0.71285. lr 5.388339e-04:  41%|████▏     | 1287/3111 [07:43<10:59,  2.76it/s][A
e

epoch 1 iter 1319: train loss 0.70212. lr 5.357750e-04:  42%|████▏     | 1319/3111 [07:55<10:52,  2.74it/s][A
epoch 1 iter 1319: train loss 0.70212. lr 5.357750e-04:  42%|████▏     | 1320/3111 [07:55<10:52,  2.74it/s][A
epoch 1 iter 1320: train loss 0.71363. lr 5.356813e-04:  42%|████▏     | 1320/3111 [07:55<10:52,  2.74it/s][A
epoch 1 iter 1320: train loss 0.71363. lr 5.356813e-04:  42%|████▏     | 1321/3111 [07:55<10:50,  2.75it/s][A
epoch 1 iter 1321: train loss 0.73063. lr 5.355875e-04:  42%|████▏     | 1321/3111 [07:56<10:50,  2.75it/s][A
epoch 1 iter 1321: train loss 0.73063. lr 5.355875e-04:  42%|████▏     | 1322/3111 [07:56<10:51,  2.75it/s][A
epoch 1 iter 1322: train loss 0.70283. lr 5.354937e-04:  42%|████▏     | 1322/3111 [07:56<10:51,  2.75it/s][A
epoch 1 iter 1322: train loss 0.70283. lr 5.354937e-04:  43%|████▎     | 1323/3111 [07:56<10:51,  2.74it/s][A
epoch 1 iter 1323: train loss 0.69778. lr 5.353998e-04:  43%|████▎     | 1323/3111 [07:56<10:51,  2.74it/s][A
e

epoch 1 iter 1355: train loss 0.68379. lr 5.323634e-04:  44%|████▎     | 1356/3111 [08:08<10:40,  2.74it/s][A
epoch 1 iter 1356: train loss 0.70268. lr 5.322675e-04:  44%|████▎     | 1356/3111 [08:08<10:40,  2.74it/s][A
epoch 1 iter 1356: train loss 0.70268. lr 5.322675e-04:  44%|████▎     | 1357/3111 [08:08<10:38,  2.75it/s][A
epoch 1 iter 1357: train loss 0.65622. lr 5.321716e-04:  44%|████▎     | 1357/3111 [08:09<10:38,  2.75it/s][A
epoch 1 iter 1357: train loss 0.65622. lr 5.321716e-04:  44%|████▎     | 1358/3111 [08:09<10:38,  2.75it/s][A
epoch 1 iter 1358: train loss 0.72751. lr 5.320756e-04:  44%|████▎     | 1358/3111 [08:09<10:38,  2.75it/s][A
epoch 1 iter 1358: train loss 0.72751. lr 5.320756e-04:  44%|████▎     | 1359/3111 [08:09<10:39,  2.74it/s][A
epoch 1 iter 1359: train loss 0.70324. lr 5.319795e-04:  44%|████▎     | 1359/3111 [08:09<10:39,  2.74it/s][A
epoch 1 iter 1359: train loss 0.70324. lr 5.319795e-04:  44%|████▎     | 1360/3111 [08:09<10:39,  2.74it/s][A
e

epoch 1 iter 1392: train loss 0.65247. lr 5.287769e-04:  45%|████▍     | 1392/3111 [08:21<10:27,  2.74it/s][A
epoch 1 iter 1392: train loss 0.65247. lr 5.287769e-04:  45%|████▍     | 1393/3111 [08:21<10:26,  2.74it/s][A
epoch 1 iter 1393: train loss 0.63230. lr 5.286788e-04:  45%|████▍     | 1393/3111 [08:22<10:26,  2.74it/s][A
epoch 1 iter 1393: train loss 0.63230. lr 5.286788e-04:  45%|████▍     | 1394/3111 [08:22<10:25,  2.74it/s][A
epoch 1 iter 1394: train loss 0.64723. lr 5.285807e-04:  45%|████▍     | 1394/3111 [08:22<10:25,  2.74it/s][A
epoch 1 iter 1394: train loss 0.64723. lr 5.285807e-04:  45%|████▍     | 1395/3111 [08:22<11:22,  2.51it/s][A
epoch 1 iter 1395: train loss 0.67366. lr 5.284826e-04:  45%|████▍     | 1395/3111 [08:23<11:22,  2.51it/s][A
epoch 1 iter 1395: train loss 0.67366. lr 5.284826e-04:  45%|████▍     | 1396/3111 [08:23<11:05,  2.58it/s][A
epoch 1 iter 1396: train loss 0.69384. lr 5.283844e-04:  45%|████▍     | 1396/3111 [08:23<11:05,  2.58it/s][A
e

epoch 1 iter 1428: train loss 0.60716. lr 5.252107e-04:  46%|████▌     | 1429/3111 [08:35<10:10,  2.76it/s][A
epoch 1 iter 1429: train loss 0.63358. lr 5.251105e-04:  46%|████▌     | 1429/3111 [08:35<10:10,  2.76it/s][A
epoch 1 iter 1429: train loss 0.63358. lr 5.251105e-04:  46%|████▌     | 1430/3111 [08:35<10:13,  2.74it/s][A
epoch 1 iter 1430: train loss 0.62615. lr 5.250103e-04:  46%|████▌     | 1430/3111 [08:35<10:13,  2.74it/s][A
epoch 1 iter 1430: train loss 0.62615. lr 5.250103e-04:  46%|████▌     | 1431/3111 [08:35<10:13,  2.74it/s][A
epoch 1 iter 1431: train loss 0.67460. lr 5.249101e-04:  46%|████▌     | 1431/3111 [08:36<10:13,  2.74it/s][A
epoch 1 iter 1431: train loss 0.67460. lr 5.249101e-04:  46%|████▌     | 1432/3111 [08:36<10:11,  2.74it/s][A
epoch 1 iter 1432: train loss 0.67778. lr 5.248098e-04:  46%|████▌     | 1432/3111 [08:36<10:11,  2.74it/s][A
epoch 1 iter 1432: train loss 0.67778. lr 5.248098e-04:  46%|████▌     | 1433/3111 [08:36<10:10,  2.75it/s][A
e

epoch 1 iter 1465: train loss 0.60719. lr 5.214678e-04:  47%|████▋     | 1465/3111 [08:48<10:00,  2.74it/s][A
epoch 1 iter 1465: train loss 0.60719. lr 5.214678e-04:  47%|████▋     | 1466/3111 [08:48<10:01,  2.74it/s][A
epoch 1 iter 1466: train loss 0.62834. lr 5.213655e-04:  47%|████▋     | 1466/3111 [08:49<10:01,  2.74it/s][A
epoch 1 iter 1466: train loss 0.62834. lr 5.213655e-04:  47%|████▋     | 1467/3111 [08:49<10:01,  2.73it/s][A
epoch 1 iter 1467: train loss 0.56797. lr 5.212632e-04:  47%|████▋     | 1467/3111 [08:49<10:01,  2.73it/s][A
epoch 1 iter 1467: train loss 0.56797. lr 5.212632e-04:  47%|████▋     | 1468/3111 [08:49<10:00,  2.74it/s][A
epoch 1 iter 1468: train loss 0.61293. lr 5.211609e-04:  47%|████▋     | 1468/3111 [08:49<10:00,  2.74it/s][A
epoch 1 iter 1468: train loss 0.61293. lr 5.211609e-04:  47%|████▋     | 1469/3111 [08:49<09:58,  2.74it/s][A
epoch 1 iter 1469: train loss 0.59065. lr 5.210585e-04:  47%|████▋     | 1469/3111 [08:50<09:58,  2.74it/s][A
e

epoch 1 iter 1501: train loss 0.55277. lr 5.177518e-04:  48%|████▊     | 1502/3111 [09:01<09:47,  2.74it/s][A
epoch 1 iter 1502: train loss 0.55408. lr 5.176476e-04:  48%|████▊     | 1502/3111 [09:02<09:47,  2.74it/s][A
epoch 1 iter 1502: train loss 0.55408. lr 5.176476e-04:  48%|████▊     | 1503/3111 [09:02<09:47,  2.74it/s][A
epoch 1 iter 1503: train loss 0.59032. lr 5.175433e-04:  48%|████▊     | 1503/3111 [09:02<09:47,  2.74it/s][A
epoch 1 iter 1503: train loss 0.59032. lr 5.175433e-04:  48%|████▊     | 1504/3111 [09:02<09:48,  2.73it/s][A
epoch 1 iter 1504: train loss 0.59132. lr 5.174389e-04:  48%|████▊     | 1504/3111 [09:02<09:48,  2.73it/s][A
epoch 1 iter 1504: train loss 0.59132. lr 5.174389e-04:  48%|████▊     | 1505/3111 [09:02<09:48,  2.73it/s][A
epoch 1 iter 1505: train loss 0.64823. lr 5.173345e-04:  48%|████▊     | 1505/3111 [09:03<09:48,  2.73it/s][A
epoch 1 iter 1505: train loss 0.64823. lr 5.173345e-04:  48%|████▊     | 1506/3111 [09:03<09:46,  2.74it/s][A
e

epoch 1 iter 1538: train loss 0.55566. lr 5.138576e-04:  49%|████▉     | 1538/3111 [09:15<09:34,  2.74it/s][A
epoch 1 iter 1538: train loss 0.55566. lr 5.138576e-04:  49%|████▉     | 1539/3111 [09:15<09:32,  2.75it/s][A
epoch 1 iter 1539: train loss 0.59440. lr 5.137513e-04:  49%|████▉     | 1539/3111 [09:15<09:32,  2.75it/s][A
epoch 1 iter 1539: train loss 0.59440. lr 5.137513e-04:  50%|████▉     | 1540/3111 [09:15<09:33,  2.74it/s][A
epoch 1 iter 1540: train loss 0.56790. lr 5.136450e-04:  50%|████▉     | 1540/3111 [09:16<09:33,  2.74it/s][A
epoch 1 iter 1540: train loss 0.56790. lr 5.136450e-04:  50%|████▉     | 1541/3111 [09:16<09:35,  2.73it/s][A
epoch 1 iter 1541: train loss 0.57848. lr 5.135386e-04:  50%|████▉     | 1541/3111 [09:16<09:35,  2.73it/s][A
epoch 1 iter 1541: train loss 0.57848. lr 5.135386e-04:  50%|████▉     | 1542/3111 [09:16<09:34,  2.73it/s][A
epoch 1 iter 1542: train loss 0.60735. lr 5.134321e-04:  50%|████▉     | 1542/3111 [09:16<09:34,  2.73it/s][A
e

epoch 1 iter 1574: train loss 0.57627. lr 5.099970e-04:  51%|█████     | 1575/3111 [09:28<09:19,  2.74it/s][A
epoch 1 iter 1575: train loss 0.58947. lr 5.098888e-04:  51%|█████     | 1575/3111 [09:28<09:19,  2.74it/s][A
epoch 1 iter 1575: train loss 0.58947. lr 5.098888e-04:  51%|█████     | 1576/3111 [09:28<09:18,  2.75it/s][A
epoch 1 iter 1576: train loss 0.55770. lr 5.097805e-04:  51%|█████     | 1576/3111 [09:29<09:18,  2.75it/s][A
epoch 1 iter 1576: train loss 0.55770. lr 5.097805e-04:  51%|█████     | 1577/3111 [09:29<09:20,  2.74it/s][A
epoch 1 iter 1577: train loss 0.53561. lr 5.096721e-04:  51%|█████     | 1577/3111 [09:29<09:20,  2.74it/s][A
epoch 1 iter 1577: train loss 0.53561. lr 5.096721e-04:  51%|█████     | 1578/3111 [09:29<09:20,  2.74it/s][A
epoch 1 iter 1578: train loss 0.55610. lr 5.095637e-04:  51%|█████     | 1578/3111 [09:29<09:20,  2.74it/s][A
epoch 1 iter 1578: train loss 0.55610. lr 5.095637e-04:  51%|█████     | 1579/3111 [09:29<09:19,  2.74it/s][A
e

epoch 1 iter 1611: train loss 0.54326. lr 5.059568e-04:  52%|█████▏    | 1611/3111 [09:42<09:09,  2.73it/s][A
epoch 1 iter 1611: train loss 0.54326. lr 5.059568e-04:  52%|█████▏    | 1612/3111 [09:42<09:07,  2.74it/s][A
epoch 1 iter 1612: train loss 0.57491. lr 5.058466e-04:  52%|█████▏    | 1612/3111 [09:42<09:07,  2.74it/s][A
epoch 1 iter 1612: train loss 0.57491. lr 5.058466e-04:  52%|█████▏    | 1613/3111 [09:42<09:07,  2.74it/s][A
epoch 1 iter 1613: train loss 0.58201. lr 5.057363e-04:  52%|█████▏    | 1613/3111 [09:42<09:07,  2.74it/s][A
epoch 1 iter 1613: train loss 0.58201. lr 5.057363e-04:  52%|█████▏    | 1614/3111 [09:42<09:07,  2.73it/s][A
epoch 1 iter 1614: train loss 0.53963. lr 5.056260e-04:  52%|█████▏    | 1614/3111 [09:43<09:07,  2.73it/s][A
epoch 1 iter 1614: train loss 0.53963. lr 5.056260e-04:  52%|█████▏    | 1615/3111 [09:43<09:07,  2.73it/s][A
epoch 1 iter 1615: train loss 0.55016. lr 5.055157e-04:  52%|█████▏    | 1615/3111 [09:43<09:07,  2.73it/s][A
e

epoch 1 iter 1647: train loss 0.56478. lr 5.019567e-04:  53%|█████▎    | 1648/3111 [09:55<08:56,  2.72it/s][A
epoch 1 iter 1648: train loss 0.50783. lr 5.018447e-04:  53%|█████▎    | 1648/3111 [09:55<08:56,  2.72it/s][A
epoch 1 iter 1648: train loss 0.50783. lr 5.018447e-04:  53%|█████▎    | 1649/3111 [09:55<08:55,  2.73it/s][A
epoch 1 iter 1649: train loss 0.51646. lr 5.017325e-04:  53%|█████▎    | 1649/3111 [09:55<08:55,  2.73it/s][A
epoch 1 iter 1649: train loss 0.51646. lr 5.017325e-04:  53%|█████▎    | 1650/3111 [09:55<08:53,  2.74it/s][A
epoch 1 iter 1650: train loss 0.55696. lr 5.016204e-04:  53%|█████▎    | 1650/3111 [09:56<08:53,  2.74it/s][A
epoch 1 iter 1650: train loss 0.55696. lr 5.016204e-04:  53%|█████▎    | 1651/3111 [09:56<08:54,  2.73it/s][A
epoch 1 iter 1651: train loss 0.51866. lr 5.015081e-04:  53%|█████▎    | 1651/3111 [09:56<08:54,  2.73it/s][A
epoch 1 iter 1651: train loss 0.51866. lr 5.015081e-04:  53%|█████▎    | 1652/3111 [09:56<08:54,  2.73it/s][A
e

epoch 1 iter 1684: train loss 0.53314. lr 4.977760e-04:  54%|█████▍    | 1684/3111 [10:08<08:43,  2.73it/s][A
epoch 1 iter 1684: train loss 0.53314. lr 4.977760e-04:  54%|█████▍    | 1685/3111 [10:08<08:43,  2.73it/s][A
epoch 1 iter 1685: train loss 0.52094. lr 4.976620e-04:  54%|█████▍    | 1685/3111 [10:09<08:43,  2.73it/s][A
epoch 1 iter 1685: train loss 0.52094. lr 4.976620e-04:  54%|█████▍    | 1686/3111 [10:09<08:41,  2.73it/s][A
epoch 1 iter 1686: train loss 0.53676. lr 4.975480e-04:  54%|█████▍    | 1686/3111 [10:09<08:41,  2.73it/s][A
epoch 1 iter 1686: train loss 0.53676. lr 4.975480e-04:  54%|█████▍    | 1687/3111 [10:09<08:41,  2.73it/s][A
epoch 1 iter 1687: train loss 0.53056. lr 4.974340e-04:  54%|█████▍    | 1687/3111 [10:09<08:41,  2.73it/s][A
epoch 1 iter 1687: train loss 0.53056. lr 4.974340e-04:  54%|█████▍    | 1688/3111 [10:09<08:41,  2.73it/s][A
epoch 1 iter 1688: train loss 0.52021. lr 4.973199e-04:  54%|█████▍    | 1688/3111 [10:10<08:41,  2.73it/s][A
e

epoch 1 iter 1720: train loss 0.50992. lr 4.936420e-04:  55%|█████▌    | 1721/3111 [10:21<08:27,  2.74it/s][A
epoch 1 iter 1721: train loss 0.49055. lr 4.935262e-04:  55%|█████▌    | 1721/3111 [10:22<08:27,  2.74it/s][A
epoch 1 iter 1721: train loss 0.49055. lr 4.935262e-04:  55%|█████▌    | 1722/3111 [10:22<08:27,  2.74it/s][A
epoch 1 iter 1722: train loss 0.47962. lr 4.934104e-04:  55%|█████▌    | 1722/3111 [10:22<08:27,  2.74it/s][A
epoch 1 iter 1722: train loss 0.47962. lr 4.934104e-04:  55%|█████▌    | 1723/3111 [10:22<08:26,  2.74it/s][A
epoch 1 iter 1723: train loss 0.50007. lr 4.932945e-04:  55%|█████▌    | 1723/3111 [10:22<08:26,  2.74it/s][A
epoch 1 iter 1723: train loss 0.50007. lr 4.932945e-04:  55%|█████▌    | 1724/3111 [10:22<08:25,  2.74it/s][A
epoch 1 iter 1724: train loss 0.52995. lr 4.931786e-04:  55%|█████▌    | 1724/3111 [10:23<08:25,  2.74it/s][A
epoch 1 iter 1724: train loss 0.52995. lr 4.931786e-04:  55%|█████▌    | 1725/3111 [10:23<08:26,  2.73it/s][A
e

epoch 2 iter 798: train loss 0.22081. lr 1.822040e-04:  26%|██▌       | 798/3111 [04:52<14:05,  2.74it/s][A
epoch 2 iter 798: train loss 0.22081. lr 1.822040e-04:  26%|██▌       | 799/3111 [04:52<14:08,  2.73it/s][A
epoch 2 iter 799: train loss 0.22843. lr 1.820647e-04:  26%|██▌       | 799/3111 [04:52<14:08,  2.73it/s][A
epoch 2 iter 799: train loss 0.22843. lr 1.820647e-04:  26%|██▌       | 800/3111 [04:52<14:08,  2.72it/s][A
epoch 2 iter 800: train loss 0.22972. lr 1.819254e-04:  26%|██▌       | 800/3111 [04:53<14:08,  2.72it/s][A
epoch 2 iter 800: train loss 0.22972. lr 1.819254e-04:  26%|██▌       | 801/3111 [04:53<14:07,  2.72it/s][A
epoch 2 iter 801: train loss 0.21854. lr 1.817861e-04:  26%|██▌       | 801/3111 [04:53<14:07,  2.72it/s][A
epoch 2 iter 801: train loss 0.21854. lr 1.817861e-04:  26%|██▌       | 802/3111 [04:53<14:09,  2.72it/s][A
epoch 2 iter 802: train loss 0.21836. lr 1.816469e-04:  26%|██▌       | 802/3111 [04:54<14:09,  2.72it/s][A
epoch 2 iter 802: t

epoch 2 iter 835: train loss 0.23037. lr 1.770688e-04:  27%|██▋       | 836/3111 [05:06<13:50,  2.74it/s][A
epoch 2 iter 836: train loss 0.24739. lr 1.769306e-04:  27%|██▋       | 836/3111 [05:06<13:50,  2.74it/s][A
epoch 2 iter 836: train loss 0.24739. lr 1.769306e-04:  27%|██▋       | 837/3111 [05:06<13:55,  2.72it/s][A
epoch 2 iter 837: train loss 0.23296. lr 1.767925e-04:  27%|██▋       | 837/3111 [05:06<13:55,  2.72it/s][A
epoch 2 iter 837: train loss 0.23296. lr 1.767925e-04:  27%|██▋       | 838/3111 [05:06<13:57,  2.72it/s][A
epoch 2 iter 838: train loss 0.23203. lr 1.766543e-04:  27%|██▋       | 838/3111 [05:07<13:57,  2.72it/s][A
epoch 2 iter 838: train loss 0.23203. lr 1.766543e-04:  27%|██▋       | 839/3111 [05:07<13:54,  2.72it/s][A
epoch 2 iter 839: train loss 0.22447. lr 1.765162e-04:  27%|██▋       | 839/3111 [05:07<13:54,  2.72it/s][A
epoch 2 iter 839: train loss 0.22447. lr 1.765162e-04:  27%|██▋       | 840/3111 [05:07<13:53,  2.73it/s][A
epoch 2 iter 840: t

epoch 2 iter 873: train loss 0.22130. lr 1.718396e-04:  28%|██▊       | 873/3111 [05:20<13:42,  2.72it/s][A
epoch 2 iter 873: train loss 0.22130. lr 1.718396e-04:  28%|██▊       | 874/3111 [05:20<13:39,  2.73it/s][A
epoch 2 iter 874: train loss 0.23816. lr 1.717026e-04:  28%|██▊       | 874/3111 [05:20<13:39,  2.73it/s][A
epoch 2 iter 874: train loss 0.23816. lr 1.717026e-04:  28%|██▊       | 875/3111 [05:20<13:43,  2.72it/s][A
epoch 2 iter 875: train loss 0.22671. lr 1.715656e-04:  28%|██▊       | 875/3111 [05:20<13:43,  2.72it/s][A
epoch 2 iter 875: train loss 0.22671. lr 1.715656e-04:  28%|██▊       | 876/3111 [05:20<13:42,  2.72it/s][A
epoch 2 iter 876: train loss 0.23637. lr 1.714287e-04:  28%|██▊       | 876/3111 [05:21<13:42,  2.72it/s][A
epoch 2 iter 876: train loss 0.23637. lr 1.714287e-04:  28%|██▊       | 877/3111 [05:21<13:41,  2.72it/s][A
epoch 2 iter 877: train loss 0.22527. lr 1.712918e-04:  28%|██▊       | 877/3111 [05:21<13:41,  2.72it/s][A
epoch 2 iter 877: t

epoch 2 iter 910: train loss 0.24463. lr 1.667932e-04:  29%|██▉       | 911/3111 [05:33<13:25,  2.73it/s][A
epoch 2 iter 911: train loss 0.22485. lr 1.666575e-04:  29%|██▉       | 911/3111 [05:34<13:25,  2.73it/s][A
epoch 2 iter 911: train loss 0.22485. lr 1.666575e-04:  29%|██▉       | 912/3111 [05:34<13:26,  2.73it/s][A
epoch 2 iter 912: train loss 0.22051. lr 1.665218e-04:  29%|██▉       | 912/3111 [05:34<13:26,  2.73it/s][A
epoch 2 iter 912: train loss 0.22051. lr 1.665218e-04:  29%|██▉       | 913/3111 [05:34<13:27,  2.72it/s][A
epoch 2 iter 913: train loss 0.21968. lr 1.663861e-04:  29%|██▉       | 913/3111 [05:34<13:27,  2.72it/s][A
epoch 2 iter 913: train loss 0.21968. lr 1.663861e-04:  29%|██▉       | 914/3111 [05:34<13:24,  2.73it/s][A
epoch 2 iter 914: train loss 0.22594. lr 1.662504e-04:  29%|██▉       | 914/3111 [05:35<13:24,  2.73it/s][A
epoch 2 iter 914: train loss 0.22594. lr 1.662504e-04:  29%|██▉       | 915/3111 [05:35<13:21,  2.74it/s][A
epoch 2 iter 915: t

epoch 2 iter 948: train loss 0.24675. lr 1.616590e-04:  30%|███       | 948/3111 [05:47<13:09,  2.74it/s][A
epoch 2 iter 948: train loss 0.24675. lr 1.616590e-04:  31%|███       | 949/3111 [05:47<13:06,  2.75it/s][A
epoch 2 iter 949: train loss 0.22523. lr 1.615245e-04:  31%|███       | 949/3111 [05:47<13:06,  2.75it/s][A
epoch 2 iter 949: train loss 0.22523. lr 1.615245e-04:  31%|███       | 950/3111 [05:47<13:09,  2.74it/s][A
epoch 2 iter 950: train loss 0.21725. lr 1.613901e-04:  31%|███       | 950/3111 [05:48<13:09,  2.74it/s][A
epoch 2 iter 950: train loss 0.21725. lr 1.613901e-04:  31%|███       | 951/3111 [05:48<13:10,  2.73it/s][A
epoch 2 iter 951: train loss 0.21551. lr 1.612558e-04:  31%|███       | 951/3111 [05:48<13:10,  2.73it/s][A
epoch 2 iter 951: train loss 0.21551. lr 1.612558e-04:  31%|███       | 952/3111 [05:48<13:08,  2.74it/s][A
epoch 2 iter 952: train loss 0.20818. lr 1.611215e-04:  31%|███       | 952/3111 [05:49<13:08,  2.74it/s][A
epoch 2 iter 952: t

epoch 2 iter 985: train loss 0.21385. lr 1.567088e-04:  32%|███▏      | 986/3111 [06:01<13:00,  2.72it/s][A
epoch 2 iter 986: train loss 0.21656. lr 1.565757e-04:  32%|███▏      | 986/3111 [06:01<13:00,  2.72it/s][A
epoch 2 iter 986: train loss 0.21656. lr 1.565757e-04:  32%|███▏      | 987/3111 [06:01<13:00,  2.72it/s][A
epoch 2 iter 987: train loss 0.24862. lr 1.564426e-04:  32%|███▏      | 987/3111 [06:01<13:00,  2.72it/s][A
epoch 2 iter 987: train loss 0.24862. lr 1.564426e-04:  32%|███▏      | 988/3111 [06:01<13:01,  2.72it/s][A
epoch 2 iter 988: train loss 0.22170. lr 1.563096e-04:  32%|███▏      | 988/3111 [06:02<13:01,  2.72it/s][A
epoch 2 iter 988: train loss 0.22170. lr 1.563096e-04:  32%|███▏      | 989/3111 [06:02<13:01,  2.72it/s][A
epoch 2 iter 989: train loss 0.24349. lr 1.561766e-04:  32%|███▏      | 989/3111 [06:02<13:01,  2.72it/s][A
epoch 2 iter 989: train loss 0.24349. lr 1.561766e-04:  32%|███▏      | 990/3111 [06:02<12:57,  2.73it/s][A
epoch 2 iter 990: t

epoch 2 iter 1022: train loss 0.22387. lr 1.518086e-04:  33%|███▎      | 1023/3111 [06:14<12:46,  2.73it/s][A
epoch 2 iter 1023: train loss 0.23074. lr 1.516769e-04:  33%|███▎      | 1023/3111 [06:15<12:46,  2.73it/s][A
epoch 2 iter 1023: train loss 0.23074. lr 1.516769e-04:  33%|███▎      | 1024/3111 [06:15<12:46,  2.72it/s][A
epoch 2 iter 1024: train loss 0.22456. lr 1.515452e-04:  33%|███▎      | 1024/3111 [06:15<12:46,  2.72it/s][A
epoch 2 iter 1024: train loss 0.22456. lr 1.515452e-04:  33%|███▎      | 1025/3111 [06:15<12:46,  2.72it/s][A
epoch 2 iter 1025: train loss 0.20541. lr 1.514135e-04:  33%|███▎      | 1025/3111 [06:15<12:46,  2.72it/s][A
epoch 2 iter 1025: train loss 0.20541. lr 1.514135e-04:  33%|███▎      | 1026/3111 [06:15<12:47,  2.72it/s][A
epoch 2 iter 1026: train loss 0.21355. lr 1.512819e-04:  33%|███▎      | 1026/3111 [06:16<12:47,  2.72it/s][A
epoch 2 iter 1026: train loss 0.21355. lr 1.512819e-04:  33%|███▎      | 1027/3111 [06:16<12:45,  2.72it/s][A
e

epoch 2 iter 1059: train loss 0.22881. lr 1.469602e-04:  34%|███▍      | 1059/3111 [06:28<12:33,  2.72it/s][A
epoch 2 iter 1059: train loss 0.22881. lr 1.469602e-04:  34%|███▍      | 1060/3111 [06:28<12:33,  2.72it/s][A
epoch 2 iter 1060: train loss 0.21275. lr 1.468299e-04:  34%|███▍      | 1060/3111 [06:28<12:33,  2.72it/s][A
epoch 2 iter 1060: train loss 0.21275. lr 1.468299e-04:  34%|███▍      | 1061/3111 [06:28<12:30,  2.73it/s][A
epoch 2 iter 1061: train loss 0.21516. lr 1.466996e-04:  34%|███▍      | 1061/3111 [06:28<12:30,  2.73it/s][A
epoch 2 iter 1061: train loss 0.21516. lr 1.466996e-04:  34%|███▍      | 1062/3111 [06:28<12:29,  2.73it/s][A
epoch 2 iter 1062: train loss 0.22019. lr 1.465694e-04:  34%|███▍      | 1062/3111 [06:29<12:29,  2.73it/s][A
epoch 2 iter 1062: train loss 0.22019. lr 1.465694e-04:  34%|███▍      | 1063/3111 [06:29<12:30,  2.73it/s][A
epoch 2 iter 1063: train loss 0.21539. lr 1.464392e-04:  34%|███▍      | 1063/3111 [06:29<12:30,  2.73it/s][A
e

epoch 2 iter 1095: train loss 0.21319. lr 1.422941e-04:  35%|███▌      | 1096/3111 [06:41<12:20,  2.72it/s][A
epoch 2 iter 1096: train loss 0.22166. lr 1.421652e-04:  35%|███▌      | 1096/3111 [06:41<12:20,  2.72it/s][A
epoch 2 iter 1096: train loss 0.22166. lr 1.421652e-04:  35%|███▌      | 1097/3111 [06:41<12:20,  2.72it/s][A
epoch 2 iter 1097: train loss 0.23115. lr 1.420364e-04:  35%|███▌      | 1097/3111 [06:42<12:20,  2.72it/s][A
epoch 2 iter 1097: train loss 0.23115. lr 1.420364e-04:  35%|███▌      | 1098/3111 [06:42<12:20,  2.72it/s][A
epoch 2 iter 1098: train loss 0.23183. lr 1.419076e-04:  35%|███▌      | 1098/3111 [06:42<12:20,  2.72it/s][A
epoch 2 iter 1098: train loss 0.23183. lr 1.419076e-04:  35%|███▌      | 1099/3111 [06:42<12:19,  2.72it/s][A
epoch 2 iter 1099: train loss 0.22797. lr 1.417788e-04:  35%|███▌      | 1099/3111 [06:42<12:19,  2.72it/s][A
epoch 2 iter 1099: train loss 0.22797. lr 1.417788e-04:  35%|███▌      | 1100/3111 [06:42<12:17,  2.73it/s][A
e

epoch 2 iter 1132: train loss 0.22904. lr 1.375527e-04:  36%|███▋      | 1132/3111 [06:54<12:09,  2.71it/s][A
epoch 2 iter 1132: train loss 0.22904. lr 1.375527e-04:  36%|███▋      | 1133/3111 [06:54<12:08,  2.72it/s][A
epoch 2 iter 1133: train loss 0.21993. lr 1.374254e-04:  36%|███▋      | 1133/3111 [06:55<12:08,  2.72it/s][A
epoch 2 iter 1133: train loss 0.21993. lr 1.374254e-04:  36%|███▋      | 1134/3111 [06:55<12:09,  2.71it/s][A
epoch 2 iter 1134: train loss 0.23066. lr 1.372980e-04:  36%|███▋      | 1134/3111 [06:55<12:09,  2.71it/s][A
epoch 2 iter 1134: train loss 0.23066. lr 1.372980e-04:  36%|███▋      | 1135/3111 [06:55<12:08,  2.71it/s][A
epoch 2 iter 1135: train loss 0.18751. lr 1.371707e-04:  36%|███▋      | 1135/3111 [06:56<12:08,  2.71it/s][A
epoch 2 iter 1135: train loss 0.18751. lr 1.371707e-04:  37%|███▋      | 1136/3111 [06:56<12:07,  2.72it/s][A
epoch 2 iter 1136: train loss 0.21973. lr 1.370435e-04:  37%|███▋      | 1136/3111 [06:56<12:07,  2.72it/s][A
e

epoch 2 iter 1168: train loss 0.22316. lr 1.329939e-04:  38%|███▊      | 1169/3111 [07:08<11:51,  2.73it/s][A
epoch 2 iter 1169: train loss 0.22466. lr 1.328681e-04:  38%|███▊      | 1169/3111 [07:08<11:51,  2.73it/s][A
epoch 2 iter 1169: train loss 0.22466. lr 1.328681e-04:  38%|███▊      | 1170/3111 [07:08<11:50,  2.73it/s][A
epoch 2 iter 1170: train loss 0.19439. lr 1.327423e-04:  38%|███▊      | 1170/3111 [07:08<11:50,  2.73it/s][A
epoch 2 iter 1170: train loss 0.19439. lr 1.327423e-04:  38%|███▊      | 1171/3111 [07:08<11:47,  2.74it/s][A
epoch 2 iter 1171: train loss 0.20681. lr 1.326165e-04:  38%|███▊      | 1171/3111 [07:09<11:47,  2.74it/s][A
epoch 2 iter 1171: train loss 0.20681. lr 1.326165e-04:  38%|███▊      | 1172/3111 [07:09<11:51,  2.72it/s][A
epoch 2 iter 1172: train loss 0.21748. lr 1.324908e-04:  38%|███▊      | 1172/3111 [07:09<11:51,  2.72it/s][A
epoch 2 iter 1172: train loss 0.21748. lr 1.324908e-04:  38%|███▊      | 1173/3111 [07:09<11:52,  2.72it/s][A
e

epoch 2 iter 1205: train loss 0.20879. lr 1.283661e-04:  39%|███▊      | 1205/3111 [07:21<11:35,  2.74it/s][A
epoch 2 iter 1205: train loss 0.20879. lr 1.283661e-04:  39%|███▉      | 1206/3111 [07:21<11:35,  2.74it/s][A
epoch 2 iter 1206: train loss 0.20623. lr 1.282418e-04:  39%|███▉      | 1206/3111 [07:22<11:35,  2.74it/s][A
epoch 2 iter 1206: train loss 0.20623. lr 1.282418e-04:  39%|███▉      | 1207/3111 [07:22<11:34,  2.74it/s][A
epoch 2 iter 1207: train loss 0.21560. lr 1.281176e-04:  39%|███▉      | 1207/3111 [07:22<11:34,  2.74it/s][A
epoch 2 iter 1207: train loss 0.21560. lr 1.281176e-04:  39%|███▉      | 1208/3111 [07:22<11:36,  2.73it/s][A
epoch 2 iter 1208: train loss 0.22460. lr 1.279934e-04:  39%|███▉      | 1208/3111 [07:22<11:36,  2.73it/s][A
epoch 2 iter 1208: train loss 0.22460. lr 1.279934e-04:  39%|███▉      | 1209/3111 [07:22<11:37,  2.73it/s][A
epoch 2 iter 1209: train loss 0.20847. lr 1.278693e-04:  39%|███▉      | 1209/3111 [07:23<11:37,  2.73it/s][A
e

epoch 2 iter 1241: train loss 0.21168. lr 1.239208e-04:  40%|███▉      | 1242/3111 [07:34<11:27,  2.72it/s][A
epoch 2 iter 1242: train loss 0.22436. lr 1.237981e-04:  40%|███▉      | 1242/3111 [07:35<11:27,  2.72it/s][A
epoch 2 iter 1242: train loss 0.22436. lr 1.237981e-04:  40%|███▉      | 1243/3111 [07:35<11:26,  2.72it/s][A
epoch 2 iter 1243: train loss 0.21860. lr 1.236755e-04:  40%|███▉      | 1243/3111 [07:35<11:26,  2.72it/s][A
epoch 2 iter 1243: train loss 0.21860. lr 1.236755e-04:  40%|███▉      | 1244/3111 [07:35<11:23,  2.73it/s][A
epoch 2 iter 1244: train loss 0.23171. lr 1.235530e-04:  40%|███▉      | 1244/3111 [07:35<11:23,  2.73it/s][A
epoch 2 iter 1244: train loss 0.23171. lr 1.235530e-04:  40%|████      | 1245/3111 [07:35<11:21,  2.74it/s][A
epoch 2 iter 1245: train loss 0.22611. lr 1.234304e-04:  40%|████      | 1245/3111 [07:36<11:21,  2.74it/s][A
epoch 2 iter 1245: train loss 0.22611. lr 1.234304e-04:  40%|████      | 1246/3111 [07:36<11:24,  2.73it/s][A
e

epoch 2 iter 1278: train loss 0.21771. lr 1.194127e-04:  41%|████      | 1278/3111 [07:48<11:10,  2.73it/s][A
epoch 2 iter 1278: train loss 0.21771. lr 1.194127e-04:  41%|████      | 1279/3111 [07:48<11:11,  2.73it/s][A
epoch 2 iter 1279: train loss 0.19837. lr 1.192918e-04:  41%|████      | 1279/3111 [07:48<11:11,  2.73it/s][A
epoch 2 iter 1279: train loss 0.19837. lr 1.192918e-04:  41%|████      | 1280/3111 [07:48<11:12,  2.72it/s][A
epoch 2 iter 1280: train loss 0.20964. lr 1.191708e-04:  41%|████      | 1280/3111 [07:49<11:12,  2.72it/s][A
epoch 2 iter 1280: train loss 0.20964. lr 1.191708e-04:  41%|████      | 1281/3111 [07:49<11:11,  2.73it/s][A
epoch 2 iter 1281: train loss 0.21588. lr 1.190499e-04:  41%|████      | 1281/3111 [07:49<11:11,  2.73it/s][A
epoch 2 iter 1281: train loss 0.21588. lr 1.190499e-04:  41%|████      | 1282/3111 [07:49<11:09,  2.73it/s][A
epoch 2 iter 1282: train loss 0.20394. lr 1.189291e-04:  41%|████      | 1282/3111 [07:49<11:09,  2.73it/s][A
e

epoch 1 iter 342: train loss 2.41863. lr 5.912936e-04:  15%|█▌        | 343/2230 [02:05<11:35,  2.71it/s][A
epoch 1 iter 343: train loss 2.53313. lr 5.912430e-04:  15%|█▌        | 343/2230 [02:05<11:35,  2.71it/s][A
epoch 1 iter 343: train loss 2.53313. lr 5.912430e-04:  15%|█▌        | 344/2230 [02:05<11:35,  2.71it/s][A
epoch 1 iter 344: train loss 2.44198. lr 5.911922e-04:  15%|█▌        | 344/2230 [02:06<11:35,  2.71it/s][A
epoch 1 iter 344: train loss 2.44198. lr 5.911922e-04:  15%|█▌        | 345/2230 [02:06<11:36,  2.71it/s][A
epoch 1 iter 345: train loss 2.38279. lr 5.911413e-04:  15%|█▌        | 345/2230 [02:06<11:36,  2.71it/s][A
epoch 1 iter 345: train loss 2.38279. lr 5.911413e-04:  16%|█▌        | 346/2230 [02:06<11:34,  2.71it/s][A
epoch 1 iter 346: train loss 2.60506. lr 5.910903e-04:  16%|█▌        | 346/2230 [02:06<11:34,  2.71it/s][A
epoch 1 iter 346: train loss 2.60506. lr 5.910903e-04:  16%|█▌        | 347/2230 [02:06<11:32,  2.72it/s][A
epoch 1 iter 347: t

epoch 1 iter 380: train loss 2.22599. lr 5.892689e-04:  17%|█▋        | 380/2230 [02:19<11:19,  2.72it/s][A
epoch 1 iter 380: train loss 2.22599. lr 5.892689e-04:  17%|█▋        | 381/2230 [02:19<11:18,  2.72it/s][A
epoch 1 iter 381: train loss 2.36688. lr 5.892128e-04:  17%|█▋        | 381/2230 [02:19<11:18,  2.72it/s][A
epoch 1 iter 381: train loss 2.36688. lr 5.892128e-04:  17%|█▋        | 382/2230 [02:19<11:17,  2.73it/s][A
epoch 1 iter 382: train loss 2.34702. lr 5.891566e-04:  17%|█▋        | 382/2230 [02:20<11:17,  2.73it/s][A
epoch 1 iter 382: train loss 2.34702. lr 5.891566e-04:  17%|█▋        | 383/2230 [02:20<11:17,  2.73it/s][A
epoch 1 iter 383: train loss 2.09533. lr 5.891002e-04:  17%|█▋        | 383/2230 [02:20<11:17,  2.73it/s][A
epoch 1 iter 383: train loss 2.09533. lr 5.891002e-04:  17%|█▋        | 384/2230 [02:20<11:20,  2.71it/s][A
epoch 1 iter 384: train loss 2.31403. lr 5.890437e-04:  17%|█▋        | 384/2230 [02:20<11:20,  2.71it/s][A
epoch 1 iter 384: t

epoch 1 iter 417: train loss 2.03473. lr 5.870983e-04:  19%|█▊        | 418/2230 [02:33<11:06,  2.72it/s][A
epoch 1 iter 418: train loss 2.05069. lr 5.870369e-04:  19%|█▊        | 418/2230 [02:33<11:06,  2.72it/s][A
epoch 1 iter 418: train loss 2.05069. lr 5.870369e-04:  19%|█▉        | 419/2230 [02:33<11:07,  2.71it/s][A
epoch 1 iter 419: train loss 2.01967. lr 5.869754e-04:  19%|█▉        | 419/2230 [02:33<11:07,  2.71it/s][A
epoch 1 iter 419: train loss 2.01967. lr 5.869754e-04:  19%|█▉        | 420/2230 [02:33<11:07,  2.71it/s][A
epoch 1 iter 420: train loss 2.04915. lr 5.869137e-04:  19%|█▉        | 420/2230 [02:34<11:07,  2.71it/s][A
epoch 1 iter 420: train loss 2.04915. lr 5.869137e-04:  19%|█▉        | 421/2230 [02:34<11:05,  2.72it/s][A
epoch 1 iter 421: train loss 2.06008. lr 5.868519e-04:  19%|█▉        | 421/2230 [02:34<11:05,  2.72it/s][A
epoch 1 iter 421: train loss 2.06008. lr 5.868519e-04:  19%|█▉        | 422/2230 [02:34<11:05,  2.72it/s][A
epoch 1 iter 422: t

epoch 1 iter 455: train loss 1.86162. lr 5.846660e-04:  20%|██        | 455/2230 [02:47<10:54,  2.71it/s][A
epoch 1 iter 455: train loss 1.86162. lr 5.846660e-04:  20%|██        | 456/2230 [02:47<10:52,  2.72it/s][A
epoch 1 iter 456: train loss 1.82822. lr 5.845992e-04:  20%|██        | 456/2230 [02:47<10:52,  2.72it/s][A
epoch 1 iter 456: train loss 1.82822. lr 5.845992e-04:  20%|██        | 457/2230 [02:47<10:51,  2.72it/s][A
epoch 1 iter 457: train loss 1.93656. lr 5.845323e-04:  20%|██        | 457/2230 [02:47<10:51,  2.72it/s][A
epoch 1 iter 457: train loss 1.93656. lr 5.845323e-04:  21%|██        | 458/2230 [02:47<10:52,  2.72it/s][A
epoch 1 iter 458: train loss 1.88309. lr 5.844653e-04:  21%|██        | 458/2230 [02:48<10:52,  2.72it/s][A
epoch 1 iter 458: train loss 1.88309. lr 5.844653e-04:  21%|██        | 459/2230 [02:48<10:53,  2.71it/s][A
epoch 1 iter 459: train loss 1.87739. lr 5.843981e-04:  21%|██        | 459/2230 [02:48<10:53,  2.71it/s][A
epoch 1 iter 459: t

epoch 1 iter 492: train loss 1.64147. lr 5.821018e-04:  22%|██▏       | 493/2230 [03:00<10:40,  2.71it/s][A
epoch 1 iter 493: train loss 1.76144. lr 5.820298e-04:  22%|██▏       | 493/2230 [03:01<10:40,  2.71it/s][A
epoch 1 iter 493: train loss 1.76144. lr 5.820298e-04:  22%|██▏       | 494/2230 [03:01<10:39,  2.71it/s][A
epoch 1 iter 494: train loss 1.59415. lr 5.819577e-04:  22%|██▏       | 494/2230 [03:01<10:39,  2.71it/s][A
epoch 1 iter 494: train loss 1.59415. lr 5.819577e-04:  22%|██▏       | 495/2230 [03:01<10:37,  2.72it/s][A
epoch 1 iter 495: train loss 1.74918. lr 5.818854e-04:  22%|██▏       | 495/2230 [03:01<10:37,  2.72it/s][A
epoch 1 iter 495: train loss 1.74918. lr 5.818854e-04:  22%|██▏       | 496/2230 [03:01<10:38,  2.72it/s][A
epoch 1 iter 496: train loss 1.63183. lr 5.818131e-04:  22%|██▏       | 496/2230 [03:02<10:38,  2.72it/s][A
epoch 1 iter 496: train loss 1.63183. lr 5.818131e-04:  22%|██▏       | 497/2230 [03:02<10:37,  2.72it/s][A
epoch 1 iter 497: t

epoch 1 iter 530: train loss 1.53825. lr 5.792688e-04:  24%|██▍       | 530/2230 [03:14<10:23,  2.73it/s][A
epoch 1 iter 530: train loss 1.53825. lr 5.792688e-04:  24%|██▍       | 531/2230 [03:14<10:23,  2.72it/s][A
epoch 1 iter 531: train loss 1.43336. lr 5.791915e-04:  24%|██▍       | 531/2230 [03:15<10:23,  2.72it/s][A
epoch 1 iter 531: train loss 1.43336. lr 5.791915e-04:  24%|██▍       | 532/2230 [03:15<10:24,  2.72it/s][A
epoch 1 iter 532: train loss 1.53556. lr 5.791141e-04:  24%|██▍       | 532/2230 [03:15<10:24,  2.72it/s][A
epoch 1 iter 532: train loss 1.53556. lr 5.791141e-04:  24%|██▍       | 533/2230 [03:15<10:23,  2.72it/s][A
epoch 1 iter 533: train loss 1.54770. lr 5.790366e-04:  24%|██▍       | 533/2230 [03:15<10:23,  2.72it/s][A
epoch 1 iter 533: train loss 1.54770. lr 5.790366e-04:  24%|██▍       | 534/2230 [03:15<10:21,  2.73it/s][A
epoch 1 iter 534: train loss 1.54622. lr 5.789589e-04:  24%|██▍       | 534/2230 [03:16<10:21,  2.73it/s][A
epoch 1 iter 534: t

epoch 1 iter 567: train loss 1.39989. lr 5.763181e-04:  25%|██▌       | 568/2230 [03:28<10:10,  2.72it/s][A
epoch 1 iter 568: train loss 1.38351. lr 5.762357e-04:  25%|██▌       | 568/2230 [03:28<10:10,  2.72it/s][A
epoch 1 iter 568: train loss 1.38351. lr 5.762357e-04:  26%|██▌       | 569/2230 [03:28<10:09,  2.73it/s][A
epoch 1 iter 569: train loss 1.45903. lr 5.761532e-04:  26%|██▌       | 569/2230 [03:28<10:09,  2.73it/s][A
epoch 1 iter 569: train loss 1.45903. lr 5.761532e-04:  26%|██▌       | 570/2230 [03:28<10:10,  2.72it/s][A
epoch 1 iter 570: train loss 1.36718. lr 5.760706e-04:  26%|██▌       | 570/2230 [03:29<10:10,  2.72it/s][A
epoch 1 iter 570: train loss 1.36718. lr 5.760706e-04:  26%|██▌       | 571/2230 [03:29<10:10,  2.72it/s][A
epoch 1 iter 571: train loss 1.34286. lr 5.759878e-04:  26%|██▌       | 571/2230 [03:29<10:10,  2.72it/s][A
epoch 1 iter 571: train loss 1.34286. lr 5.759878e-04:  26%|██▌       | 572/2230 [03:29<10:09,  2.72it/s][A
epoch 1 iter 572: t

epoch 1 iter 605: train loss 1.21553. lr 5.730922e-04:  27%|██▋       | 605/2230 [03:42<09:58,  2.72it/s][A
epoch 1 iter 605: train loss 1.21553. lr 5.730922e-04:  27%|██▋       | 606/2230 [03:42<09:56,  2.72it/s][A
epoch 1 iter 606: train loss 1.20361. lr 5.730047e-04:  27%|██▋       | 606/2230 [03:42<09:56,  2.72it/s][A
epoch 1 iter 606: train loss 1.20361. lr 5.730047e-04:  27%|██▋       | 607/2230 [03:42<09:55,  2.73it/s][A
epoch 1 iter 607: train loss 1.22430. lr 5.729170e-04:  27%|██▋       | 607/2230 [03:42<09:55,  2.73it/s][A
epoch 1 iter 607: train loss 1.22430. lr 5.729170e-04:  27%|██▋       | 608/2230 [03:42<09:52,  2.74it/s][A
epoch 1 iter 608: train loss 1.15803. lr 5.728292e-04:  27%|██▋       | 608/2230 [03:43<09:52,  2.74it/s][A
epoch 1 iter 608: train loss 1.15803. lr 5.728292e-04:  27%|██▋       | 609/2230 [03:43<09:55,  2.72it/s][A
epoch 1 iter 609: train loss 1.21403. lr 5.727413e-04:  27%|██▋       | 609/2230 [03:43<09:55,  2.72it/s][A
epoch 1 iter 609: t

epoch 1 iter 642: train loss 1.11686. lr 5.697633e-04:  29%|██▉       | 643/2230 [03:55<09:44,  2.72it/s][A
epoch 1 iter 643: train loss 1.13889. lr 5.696708e-04:  29%|██▉       | 643/2230 [03:56<09:44,  2.72it/s][A
epoch 1 iter 643: train loss 1.13889. lr 5.696708e-04:  29%|██▉       | 644/2230 [03:56<09:44,  2.71it/s][A
epoch 1 iter 644: train loss 1.16097. lr 5.695781e-04:  29%|██▉       | 644/2230 [03:56<09:44,  2.71it/s][A
epoch 1 iter 644: train loss 1.16097. lr 5.695781e-04:  29%|██▉       | 645/2230 [03:56<09:44,  2.71it/s][A
epoch 1 iter 645: train loss 1.08260. lr 5.694853e-04:  29%|██▉       | 645/2230 [03:56<09:44,  2.71it/s][A
epoch 1 iter 645: train loss 1.08260. lr 5.694853e-04:  29%|██▉       | 646/2230 [03:56<09:44,  2.71it/s][A
epoch 1 iter 646: train loss 1.01291. lr 5.693924e-04:  29%|██▉       | 646/2230 [03:57<09:44,  2.71it/s][A
epoch 1 iter 646: train loss 1.01291. lr 5.693924e-04:  29%|██▉       | 647/2230 [03:57<09:43,  2.71it/s][A
epoch 1 iter 647: t

epoch 1 iter 680: train loss 1.01723. lr 5.661536e-04:  30%|███       | 680/2230 [04:09<09:30,  2.72it/s][A
epoch 1 iter 680: train loss 1.01723. lr 5.661536e-04:  31%|███       | 681/2230 [04:09<09:30,  2.71it/s][A
epoch 1 iter 681: train loss 0.98566. lr 5.660561e-04:  31%|███       | 681/2230 [04:10<09:30,  2.71it/s][A
epoch 1 iter 681: train loss 0.98566. lr 5.660561e-04:  31%|███       | 682/2230 [04:10<09:30,  2.71it/s][A
epoch 1 iter 682: train loss 0.99092. lr 5.659583e-04:  31%|███       | 682/2230 [04:10<09:30,  2.71it/s][A
epoch 1 iter 682: train loss 0.99092. lr 5.659583e-04:  31%|███       | 683/2230 [04:10<09:29,  2.72it/s][A
epoch 1 iter 683: train loss 0.93958. lr 5.658605e-04:  31%|███       | 683/2230 [04:10<09:29,  2.72it/s][A
epoch 1 iter 683: train loss 0.93958. lr 5.658605e-04:  31%|███       | 684/2230 [04:10<09:28,  2.72it/s][A
epoch 1 iter 684: train loss 0.91585. lr 5.657625e-04:  31%|███       | 684/2230 [04:11<09:28,  2.72it/s][A
epoch 1 iter 684: t

epoch 1 iter 717: train loss 0.89525. lr 5.624557e-04:  32%|███▏      | 718/2230 [04:23<09:11,  2.74it/s][A
epoch 1 iter 718: train loss 0.94797. lr 5.623533e-04:  32%|███▏      | 718/2230 [04:23<09:11,  2.74it/s][A
epoch 1 iter 718: train loss 0.94797. lr 5.623533e-04:  32%|███▏      | 719/2230 [04:23<09:13,  2.73it/s][A
epoch 1 iter 719: train loss 0.88400. lr 5.622507e-04:  32%|███▏      | 719/2230 [04:24<09:13,  2.73it/s][A
epoch 1 iter 719: train loss 0.88400. lr 5.622507e-04:  32%|███▏      | 720/2230 [04:24<09:12,  2.73it/s][A
epoch 1 iter 720: train loss 0.88906. lr 5.621480e-04:  32%|███▏      | 720/2230 [04:24<09:12,  2.73it/s][A
epoch 1 iter 720: train loss 0.88906. lr 5.621480e-04:  32%|███▏      | 721/2230 [04:24<09:11,  2.74it/s][A
epoch 1 iter 721: train loss 0.89825. lr 5.620452e-04:  32%|███▏      | 721/2230 [04:24<09:11,  2.74it/s][A
epoch 1 iter 721: train loss 0.89825. lr 5.620452e-04:  32%|███▏      | 722/2230 [04:24<09:12,  2.73it/s][A
epoch 1 iter 722: t

epoch 1 iter 755: train loss 0.83861. lr 5.584723e-04:  34%|███▍      | 755/2230 [04:37<09:01,  2.73it/s][A
epoch 1 iter 755: train loss 0.83861. lr 5.584723e-04:  34%|███▍      | 756/2230 [04:37<08:59,  2.73it/s][A
epoch 1 iter 756: train loss 0.86167. lr 5.583650e-04:  34%|███▍      | 756/2230 [04:37<08:59,  2.73it/s][A
epoch 1 iter 756: train loss 0.86167. lr 5.583650e-04:  34%|███▍      | 757/2230 [04:37<09:02,  2.72it/s][A
epoch 1 iter 757: train loss 0.82302. lr 5.582575e-04:  34%|███▍      | 757/2230 [04:38<09:02,  2.72it/s][A
epoch 1 iter 757: train loss 0.82302. lr 5.582575e-04:  34%|███▍      | 758/2230 [04:38<09:02,  2.71it/s][A
epoch 1 iter 758: train loss 0.78835. lr 5.581499e-04:  34%|███▍      | 758/2230 [04:38<09:02,  2.71it/s][A
epoch 1 iter 758: train loss 0.78835. lr 5.581499e-04:  34%|███▍      | 759/2230 [04:38<09:01,  2.72it/s][A
epoch 1 iter 759: train loss 0.81677. lr 5.580422e-04:  34%|███▍      | 759/2230 [04:38<09:01,  2.72it/s][A
epoch 1 iter 759: t

epoch 1 iter 792: train loss 0.76617. lr 5.544158e-04:  36%|███▌      | 793/2230 [04:51<08:49,  2.71it/s][A
epoch 1 iter 793: train loss 0.75134. lr 5.543037e-04:  36%|███▌      | 793/2230 [04:51<08:49,  2.71it/s][A
epoch 1 iter 793: train loss 0.75134. lr 5.543037e-04:  36%|███▌      | 794/2230 [04:51<08:48,  2.72it/s][A
epoch 1 iter 794: train loss 0.73570. lr 5.541915e-04:  36%|███▌      | 794/2230 [04:51<08:48,  2.72it/s][A
epoch 1 iter 794: train loss 0.73570. lr 5.541915e-04:  36%|███▌      | 795/2230 [04:51<08:48,  2.72it/s][A
epoch 1 iter 795: train loss 0.72761. lr 5.540792e-04:  36%|███▌      | 795/2230 [04:52<08:48,  2.72it/s][A
epoch 1 iter 795: train loss 0.72761. lr 5.540792e-04:  36%|███▌      | 796/2230 [04:52<08:48,  2.72it/s][A
epoch 1 iter 796: train loss 0.69414. lr 5.539668e-04:  36%|███▌      | 796/2230 [04:52<08:48,  2.72it/s][A
epoch 1 iter 796: train loss 0.69414. lr 5.539668e-04:  36%|███▌      | 797/2230 [04:52<08:48,  2.71it/s][A
epoch 1 iter 797: t

epoch 2 iter 743: train loss 0.14908. lr 1.498866e-04:  33%|███▎      | 744/2230 [04:33<09:54,  2.50it/s][A
epoch 2 iter 744: train loss 0.18012. lr 1.497037e-04:  33%|███▎      | 744/2230 [04:33<09:54,  2.50it/s][A
epoch 2 iter 744: train loss 0.18012. lr 1.497037e-04:  33%|███▎      | 745/2230 [04:33<09:37,  2.57it/s][A
epoch 2 iter 745: train loss 0.16186. lr 1.495208e-04:  33%|███▎      | 745/2230 [04:34<09:37,  2.57it/s][A
epoch 2 iter 745: train loss 0.16186. lr 1.495208e-04:  33%|███▎      | 746/2230 [04:34<09:29,  2.60it/s][A
epoch 2 iter 746: train loss 0.16514. lr 1.493380e-04:  33%|███▎      | 746/2230 [04:34<09:29,  2.60it/s][A
epoch 2 iter 746: train loss 0.16514. lr 1.493380e-04:  33%|███▎      | 747/2230 [04:34<09:23,  2.63it/s][A
epoch 2 iter 747: train loss 0.18066. lr 1.491553e-04:  33%|███▎      | 747/2230 [04:35<09:23,  2.63it/s][A
epoch 2 iter 747: train loss 0.18066. lr 1.491553e-04:  34%|███▎      | 748/2230 [04:35<09:17,  2.66it/s][A
epoch 2 iter 748: t

epoch 2 iter 781: train loss 0.15802. lr 1.429884e-04:  35%|███▌      | 781/2230 [04:47<08:52,  2.72it/s][A
epoch 2 iter 781: train loss 0.15802. lr 1.429884e-04:  35%|███▌      | 782/2230 [04:47<08:51,  2.73it/s][A
epoch 2 iter 782: train loss 0.16297. lr 1.428084e-04:  35%|███▌      | 782/2230 [04:47<08:51,  2.73it/s][A
epoch 2 iter 782: train loss 0.16297. lr 1.428084e-04:  35%|███▌      | 783/2230 [04:47<08:50,  2.73it/s][A
epoch 2 iter 783: train loss 0.16206. lr 1.426284e-04:  35%|███▌      | 783/2230 [04:48<08:50,  2.73it/s][A
epoch 2 iter 783: train loss 0.16206. lr 1.426284e-04:  35%|███▌      | 784/2230 [04:48<08:48,  2.74it/s][A
epoch 2 iter 784: train loss 0.15876. lr 1.424485e-04:  35%|███▌      | 784/2230 [04:48<08:48,  2.74it/s][A
epoch 2 iter 784: train loss 0.15876. lr 1.424485e-04:  35%|███▌      | 785/2230 [04:48<08:49,  2.73it/s][A
epoch 2 iter 785: train loss 0.15846. lr 1.422687e-04:  35%|███▌      | 785/2230 [04:48<08:49,  2.73it/s][A
epoch 2 iter 785: t

epoch 2 iter 818: train loss 0.17090. lr 1.363798e-04:  37%|███▋      | 819/2230 [05:01<08:40,  2.71it/s][A
epoch 2 iter 819: train loss 0.16262. lr 1.362027e-04:  37%|███▋      | 819/2230 [05:01<08:40,  2.71it/s][A
epoch 2 iter 819: train loss 0.16262. lr 1.362027e-04:  37%|███▋      | 820/2230 [05:01<08:40,  2.71it/s][A
epoch 2 iter 820: train loss 0.16303. lr 1.360257e-04:  37%|███▋      | 820/2230 [05:01<08:40,  2.71it/s][A
epoch 2 iter 820: train loss 0.16303. lr 1.360257e-04:  37%|███▋      | 821/2230 [05:01<08:38,  2.72it/s][A
epoch 2 iter 821: train loss 0.15908. lr 1.358488e-04:  37%|███▋      | 821/2230 [05:02<08:38,  2.72it/s][A
epoch 2 iter 821: train loss 0.15908. lr 1.358488e-04:  37%|███▋      | 822/2230 [05:02<08:38,  2.72it/s][A
epoch 2 iter 822: train loss 0.17211. lr 1.356719e-04:  37%|███▋      | 822/2230 [05:02<08:38,  2.72it/s][A
epoch 2 iter 822: train loss 0.17211. lr 1.356719e-04:  37%|███▋      | 823/2230 [05:02<08:38,  2.71it/s][A
epoch 2 iter 823: t

epoch 2 iter 856: train loss 0.14622. lr 1.297083e-04:  38%|███▊      | 856/2230 [05:15<08:25,  2.72it/s][A
epoch 2 iter 856: train loss 0.14622. lr 1.297083e-04:  38%|███▊      | 857/2230 [05:15<08:25,  2.72it/s][A
epoch 2 iter 857: train loss 0.15733. lr 1.295344e-04:  38%|███▊      | 857/2230 [05:15<08:25,  2.72it/s][A
epoch 2 iter 857: train loss 0.15733. lr 1.295344e-04:  38%|███▊      | 858/2230 [05:15<08:25,  2.72it/s][A
epoch 2 iter 858: train loss 0.16863. lr 1.293605e-04:  38%|███▊      | 858/2230 [05:15<08:25,  2.72it/s][A
epoch 2 iter 858: train loss 0.16863. lr 1.293605e-04:  39%|███▊      | 859/2230 [05:15<08:26,  2.71it/s][A
epoch 2 iter 859: train loss 0.16151. lr 1.291868e-04:  39%|███▊      | 859/2230 [05:16<08:26,  2.71it/s][A
epoch 2 iter 859: train loss 0.16151. lr 1.291868e-04:  39%|███▊      | 860/2230 [05:16<08:25,  2.71it/s][A
epoch 2 iter 860: train loss 0.17483. lr 1.290131e-04:  39%|███▊      | 860/2230 [05:16<08:25,  2.71it/s][A
epoch 2 iter 860: t

epoch 2 iter 893: train loss 0.16612. lr 1.233296e-04:  40%|████      | 894/2230 [05:28<08:12,  2.71it/s][A
epoch 2 iter 894: train loss 0.16400. lr 1.231589e-04:  40%|████      | 894/2230 [05:29<08:12,  2.71it/s][A
epoch 2 iter 894: train loss 0.16400. lr 1.231589e-04:  40%|████      | 895/2230 [05:29<08:10,  2.72it/s][A
epoch 2 iter 895: train loss 0.16092. lr 1.229882e-04:  40%|████      | 895/2230 [05:29<08:10,  2.72it/s][A
epoch 2 iter 895: train loss 0.16092. lr 1.229882e-04:  40%|████      | 896/2230 [05:29<08:09,  2.73it/s][A
epoch 2 iter 896: train loss 0.15469. lr 1.228176e-04:  40%|████      | 896/2230 [05:29<08:09,  2.73it/s][A
epoch 2 iter 896: train loss 0.15469. lr 1.228176e-04:  40%|████      | 897/2230 [05:29<08:06,  2.74it/s][A
epoch 2 iter 897: train loss 0.16322. lr 1.226471e-04:  40%|████      | 897/2230 [05:30<08:06,  2.74it/s][A
epoch 2 iter 897: train loss 0.16322. lr 1.226471e-04:  40%|████      | 898/2230 [05:30<08:08,  2.73it/s][A
epoch 2 iter 898: t

epoch 2 iter 931: train loss 0.15446. lr 1.169034e-04:  42%|████▏     | 931/2230 [05:42<07:54,  2.74it/s][A
epoch 2 iter 931: train loss 0.15446. lr 1.169034e-04:  42%|████▏     | 932/2230 [05:42<07:56,  2.72it/s][A
epoch 2 iter 932: train loss 0.15430. lr 1.167361e-04:  42%|████▏     | 932/2230 [05:43<07:56,  2.72it/s][A
epoch 2 iter 932: train loss 0.15430. lr 1.167361e-04:  42%|████▏     | 933/2230 [05:43<07:56,  2.72it/s][A
epoch 2 iter 933: train loss 0.15564. lr 1.165688e-04:  42%|████▏     | 933/2230 [05:43<07:56,  2.72it/s][A
epoch 2 iter 933: train loss 0.15564. lr 1.165688e-04:  42%|████▏     | 934/2230 [05:43<07:55,  2.73it/s][A
epoch 2 iter 934: train loss 0.15621. lr 1.164016e-04:  42%|████▏     | 934/2230 [05:43<07:55,  2.73it/s][A
epoch 2 iter 934: train loss 0.15621. lr 1.164016e-04:  42%|████▏     | 935/2230 [05:43<07:54,  2.73it/s][A
epoch 2 iter 935: train loss 0.15661. lr 1.162346e-04:  42%|████▏     | 935/2230 [05:44<07:54,  2.73it/s][A
epoch 2 iter 935: t

epoch 2 iter 968: train loss 0.16730. lr 1.107724e-04:  43%|████▎     | 969/2230 [05:56<07:43,  2.72it/s][A
epoch 2 iter 969: train loss 0.15669. lr 1.106085e-04:  43%|████▎     | 969/2230 [05:56<07:43,  2.72it/s][A
epoch 2 iter 969: train loss 0.15669. lr 1.106085e-04:  43%|████▎     | 970/2230 [05:56<07:42,  2.72it/s][A
epoch 2 iter 970: train loss 0.15564. lr 1.104446e-04:  43%|████▎     | 970/2230 [05:57<07:42,  2.72it/s][A
epoch 2 iter 970: train loss 0.15564. lr 1.104446e-04:  44%|████▎     | 971/2230 [05:57<07:44,  2.71it/s][A
epoch 2 iter 971: train loss 0.16373. lr 1.102809e-04:  44%|████▎     | 971/2230 [05:57<07:44,  2.71it/s][A
epoch 2 iter 971: train loss 0.16373. lr 1.102809e-04:  44%|████▎     | 972/2230 [05:57<07:44,  2.71it/s][A
epoch 2 iter 972: train loss 0.16223. lr 1.101172e-04:  44%|████▎     | 972/2230 [05:57<07:44,  2.71it/s][A
epoch 2 iter 972: train loss 0.16223. lr 1.101172e-04:  44%|████▎     | 973/2230 [05:57<07:43,  2.71it/s][A
epoch 2 iter 973: t

epoch 2 iter 1005: train loss 0.15844. lr 1.047699e-04:  45%|████▌     | 1006/2230 [06:09<07:28,  2.73it/s][A
epoch 2 iter 1006: train loss 0.15140. lr 1.046095e-04:  45%|████▌     | 1006/2230 [06:10<07:28,  2.73it/s][A
epoch 2 iter 1006: train loss 0.15140. lr 1.046095e-04:  45%|████▌     | 1007/2230 [06:10<07:29,  2.72it/s][A
epoch 2 iter 1007: train loss 0.15179. lr 1.044492e-04:  45%|████▌     | 1007/2230 [06:10<07:29,  2.72it/s][A
epoch 2 iter 1007: train loss 0.15179. lr 1.044492e-04:  45%|████▌     | 1008/2230 [06:10<07:28,  2.73it/s][A
epoch 2 iter 1008: train loss 0.17433. lr 1.042890e-04:  45%|████▌     | 1008/2230 [06:11<07:28,  2.73it/s][A
epoch 2 iter 1008: train loss 0.17433. lr 1.042890e-04:  45%|████▌     | 1009/2230 [06:11<07:27,  2.73it/s][A
epoch 2 iter 1009: train loss 0.16738. lr 1.041289e-04:  45%|████▌     | 1009/2230 [06:11<07:27,  2.73it/s][A
epoch 2 iter 1009: train loss 0.16738. lr 1.041289e-04:  45%|████▌     | 1010/2230 [06:11<07:28,  2.72it/s][A
e

epoch 2 iter 1042: train loss 0.15984. lr 9.890004e-05:  47%|████▋     | 1042/2230 [06:23<07:16,  2.72it/s][A
epoch 2 iter 1042: train loss 0.15984. lr 9.890004e-05:  47%|████▋     | 1043/2230 [06:23<07:16,  2.72it/s][A
epoch 2 iter 1043: train loss 0.14102. lr 9.874327e-05:  47%|████▋     | 1043/2230 [06:23<07:16,  2.72it/s][A
epoch 2 iter 1043: train loss 0.14102. lr 9.874327e-05:  47%|████▋     | 1044/2230 [06:23<07:17,  2.71it/s][A
epoch 2 iter 1044: train loss 0.14866. lr 9.858660e-05:  47%|████▋     | 1044/2230 [06:24<07:17,  2.71it/s][A
epoch 2 iter 1044: train loss 0.14866. lr 9.858660e-05:  47%|████▋     | 1045/2230 [06:24<07:17,  2.71it/s][A
epoch 2 iter 1045: train loss 0.16989. lr 9.843004e-05:  47%|████▋     | 1045/2230 [06:24<07:17,  2.71it/s][A
epoch 2 iter 1045: train loss 0.16989. lr 9.843004e-05:  47%|████▋     | 1046/2230 [06:24<07:16,  2.71it/s][A
epoch 2 iter 1046: train loss 0.15447. lr 9.827357e-05:  47%|████▋     | 1046/2230 [06:24<07:16,  2.71it/s][A
e

epoch 2 iter 1078: train loss 0.15387. lr 9.331988e-05:  48%|████▊     | 1079/2230 [06:36<07:03,  2.72it/s][A
epoch 2 iter 1079: train loss 0.14474. lr 9.316676e-05:  48%|████▊     | 1079/2230 [06:37<07:03,  2.72it/s][A
epoch 2 iter 1079: train loss 0.14474. lr 9.316676e-05:  48%|████▊     | 1080/2230 [06:37<07:04,  2.71it/s][A
epoch 2 iter 1080: train loss 0.14788. lr 9.301374e-05:  48%|████▊     | 1080/2230 [06:37<07:04,  2.71it/s][A
epoch 2 iter 1080: train loss 0.14788. lr 9.301374e-05:  48%|████▊     | 1081/2230 [06:37<07:02,  2.72it/s][A
epoch 2 iter 1081: train loss 0.13336. lr 9.286082e-05:  48%|████▊     | 1081/2230 [06:37<07:02,  2.72it/s][A
epoch 2 iter 1081: train loss 0.13336. lr 9.286082e-05:  49%|████▊     | 1082/2230 [06:37<07:01,  2.73it/s][A
epoch 2 iter 1082: train loss 0.15631. lr 9.270801e-05:  49%|████▊     | 1082/2230 [06:38<07:01,  2.73it/s][A
epoch 2 iter 1082: train loss 0.15631. lr 9.270801e-05:  49%|████▊     | 1083/2230 [06:38<07:01,  2.72it/s][A
e

epoch 2 iter 1115: train loss 0.15835. lr 8.772325e-05:  50%|█████     | 1115/2230 [06:50<06:50,  2.72it/s][A
epoch 2 iter 1115: train loss 0.15835. lr 8.772325e-05:  50%|█████     | 1116/2230 [06:50<06:49,  2.72it/s][A
epoch 2 iter 1116: train loss 0.15981. lr 8.757398e-05:  50%|█████     | 1116/2230 [06:50<06:49,  2.72it/s][A
epoch 2 iter 1116: train loss 0.15981. lr 8.757398e-05:  50%|█████     | 1117/2230 [06:50<06:49,  2.72it/s][A
epoch 2 iter 1117: train loss 0.15050. lr 8.742481e-05:  50%|█████     | 1117/2230 [06:51<06:49,  2.72it/s][A
epoch 2 iter 1117: train loss 0.15050. lr 8.742481e-05:  50%|█████     | 1118/2230 [06:51<06:49,  2.72it/s][A
epoch 2 iter 1118: train loss 0.14159. lr 8.727574e-05:  50%|█████     | 1118/2230 [06:51<06:49,  2.72it/s][A
epoch 2 iter 1118: train loss 0.14159. lr 8.727574e-05:  50%|█████     | 1119/2230 [06:51<06:49,  2.71it/s][A
epoch 2 iter 1119: train loss 0.15246. lr 8.712678e-05:  50%|█████     | 1119/2230 [06:51<06:49,  2.71it/s][A
e

epoch 2 iter 1151: train loss 0.16153. lr 8.241625e-05:  52%|█████▏    | 1152/2230 [07:03<06:35,  2.72it/s][A
epoch 2 iter 1152: train loss 0.13685. lr 8.227082e-05:  52%|█████▏    | 1152/2230 [07:04<06:35,  2.72it/s][A
epoch 2 iter 1152: train loss 0.13685. lr 8.227082e-05:  52%|█████▏    | 1153/2230 [07:04<06:36,  2.72it/s][A
epoch 2 iter 1153: train loss 0.15165. lr 8.212549e-05:  52%|█████▏    | 1153/2230 [07:04<06:36,  2.72it/s][A
epoch 2 iter 1153: train loss 0.15165. lr 8.212549e-05:  52%|█████▏    | 1154/2230 [07:04<06:36,  2.71it/s][A
epoch 2 iter 1154: train loss 0.15544. lr 8.198027e-05:  52%|█████▏    | 1154/2230 [07:04<06:36,  2.71it/s][A
epoch 2 iter 1154: train loss 0.15544. lr 8.198027e-05:  52%|█████▏    | 1155/2230 [07:04<06:35,  2.72it/s][A
epoch 2 iter 1155: train loss 0.16039. lr 8.183516e-05:  52%|█████▏    | 1155/2230 [07:05<06:35,  2.72it/s][A
epoch 2 iter 1155: train loss 0.16039. lr 8.183516e-05:  52%|█████▏    | 1156/2230 [07:05<06:34,  2.72it/s][A
e

epoch 2 iter 1188: train loss 0.17418. lr 7.710767e-05:  53%|█████▎    | 1188/2230 [07:17<06:24,  2.71it/s][A
epoch 2 iter 1188: train loss 0.17418. lr 7.710767e-05:  53%|█████▎    | 1189/2230 [07:17<06:23,  2.71it/s][A
epoch 2 iter 1189: train loss 0.14223. lr 7.696628e-05:  53%|█████▎    | 1189/2230 [07:17<06:23,  2.71it/s][A
epoch 2 iter 1189: train loss 0.14223. lr 7.696628e-05:  53%|█████▎    | 1190/2230 [07:17<06:22,  2.72it/s][A
epoch 2 iter 1190: train loss 0.15151. lr 7.682500e-05:  53%|█████▎    | 1190/2230 [07:17<06:22,  2.72it/s][A
epoch 2 iter 1190: train loss 0.15151. lr 7.682500e-05:  53%|█████▎    | 1191/2230 [07:17<06:22,  2.72it/s][A
epoch 2 iter 1191: train loss 0.15503. lr 7.668383e-05:  53%|█████▎    | 1191/2230 [07:18<06:22,  2.72it/s][A
epoch 2 iter 1191: train loss 0.15503. lr 7.668383e-05:  53%|█████▎    | 1192/2230 [07:18<06:22,  2.72it/s][A
epoch 2 iter 1192: train loss 0.15734. lr 7.654278e-05:  53%|█████▎    | 1192/2230 [07:18<06:22,  2.72it/s][A
e

epoch 2 iter 1224: train loss 0.14621. lr 7.208786e-05:  55%|█████▍    | 1225/2230 [07:30<06:08,  2.72it/s][A
epoch 2 iter 1225: train loss 0.15024. lr 7.195049e-05:  55%|█████▍    | 1225/2230 [07:30<06:08,  2.72it/s][A
epoch 2 iter 1225: train loss 0.15024. lr 7.195049e-05:  55%|█████▍    | 1226/2230 [07:30<06:08,  2.72it/s][A
epoch 2 iter 1226: train loss 0.15285. lr 7.181325e-05:  55%|█████▍    | 1226/2230 [07:31<06:08,  2.72it/s][A
epoch 2 iter 1226: train loss 0.15285. lr 7.181325e-05:  55%|█████▌    | 1227/2230 [07:31<06:09,  2.72it/s][A
epoch 2 iter 1227: train loss 0.14542. lr 7.167611e-05:  55%|█████▌    | 1227/2230 [07:31<06:09,  2.72it/s][A
epoch 2 iter 1227: train loss 0.14542. lr 7.167611e-05:  55%|█████▌    | 1228/2230 [07:31<06:09,  2.71it/s][A
epoch 2 iter 1228: train loss 0.16252. lr 7.153909e-05:  55%|█████▌    | 1228/2230 [07:31<06:09,  2.71it/s][A
epoch 2 iter 1228: train loss 0.16252. lr 7.153909e-05:  55%|█████▌    | 1229/2230 [07:31<06:08,  2.72it/s][A
e

epoch 1 iter 1000: train loss 3.20268. lr 5.958428e-04:  11%|█         | 1001/9433 [06:08<51:44,  2.72it/s][A
epoch 1 iter 1001: train loss 3.08882. lr 5.958345e-04:  11%|█         | 1001/9433 [06:08<51:44,  2.72it/s][A
epoch 1 iter 1001: train loss 3.08882. lr 5.958345e-04:  11%|█         | 1002/9433 [06:08<51:43,  2.72it/s][A
epoch 1 iter 1002: train loss 3.09913. lr 5.958262e-04:  11%|█         | 1002/9433 [06:09<51:43,  2.72it/s][A
epoch 1 iter 1002: train loss 3.09913. lr 5.958262e-04:  11%|█         | 1003/9433 [06:09<51:42,  2.72it/s][A
epoch 1 iter 1003: train loss 3.18534. lr 5.958179e-04:  11%|█         | 1003/9433 [06:09<51:42,  2.72it/s][A
epoch 1 iter 1003: train loss 3.18534. lr 5.958179e-04:  11%|█         | 1004/9433 [06:09<51:50,  2.71it/s][A
epoch 1 iter 1004: train loss 3.24448. lr 5.958096e-04:  11%|█         | 1004/9433 [06:09<51:50,  2.71it/s][A
epoch 1 iter 1004: train loss 3.24448. lr 5.958096e-04:  11%|█         | 1005/9433 [06:09<51:46,  2.71it/s][A
e

epoch 1 iter 1037: train loss 3.12905. lr 5.955305e-04:  11%|█         | 1037/9433 [06:22<51:33,  2.71it/s][A
epoch 1 iter 1037: train loss 3.12905. lr 5.955305e-04:  11%|█         | 1038/9433 [06:22<51:41,  2.71it/s][A
epoch 1 iter 1038: train loss 3.05065. lr 5.955220e-04:  11%|█         | 1038/9433 [06:22<51:41,  2.71it/s][A
epoch 1 iter 1038: train loss 3.05065. lr 5.955220e-04:  11%|█         | 1039/9433 [06:22<51:39,  2.71it/s][A
epoch 1 iter 1039: train loss 3.15755. lr 5.955133e-04:  11%|█         | 1039/9433 [06:22<51:39,  2.71it/s][A
epoch 1 iter 1039: train loss 3.15755. lr 5.955133e-04:  11%|█         | 1040/9433 [06:22<51:33,  2.71it/s][A
epoch 1 iter 1040: train loss 3.08223. lr 5.955047e-04:  11%|█         | 1040/9433 [06:23<51:33,  2.71it/s][A
epoch 1 iter 1040: train loss 3.08223. lr 5.955047e-04:  11%|█         | 1041/9433 [06:23<51:32,  2.71it/s][A
epoch 1 iter 1041: train loss 2.89180. lr 5.954961e-04:  11%|█         | 1041/9433 [06:23<51:32,  2.71it/s][A
e

epoch 1 iter 1073: train loss 3.04191. lr 5.952159e-04:  11%|█▏        | 1074/9433 [06:35<51:26,  2.71it/s][A
epoch 1 iter 1074: train loss 2.83171. lr 5.952071e-04:  11%|█▏        | 1074/9433 [06:35<51:26,  2.71it/s][A
epoch 1 iter 1074: train loss 2.83171. lr 5.952071e-04:  11%|█▏        | 1075/9433 [06:35<51:18,  2.71it/s][A
epoch 1 iter 1075: train loss 2.87642. lr 5.951982e-04:  11%|█▏        | 1075/9433 [06:36<51:18,  2.71it/s][A
epoch 1 iter 1075: train loss 2.87642. lr 5.951982e-04:  11%|█▏        | 1076/9433 [06:36<51:16,  2.72it/s][A
epoch 1 iter 1076: train loss 3.01095. lr 5.951892e-04:  11%|█▏        | 1076/9433 [06:36<51:16,  2.72it/s][A
epoch 1 iter 1076: train loss 3.01095. lr 5.951892e-04:  11%|█▏        | 1077/9433 [06:36<51:18,  2.71it/s][A
epoch 1 iter 1077: train loss 3.04821. lr 5.951803e-04:  11%|█▏        | 1077/9433 [06:36<51:18,  2.71it/s][A
epoch 1 iter 1077: train loss 3.04821. lr 5.951803e-04:  11%|█▏        | 1078/9433 [06:36<51:27,  2.71it/s][A
e

epoch 1 iter 1110: train loss 2.88504. lr 5.948815e-04:  12%|█▏        | 1110/9433 [06:48<51:06,  2.71it/s][A
epoch 1 iter 1110: train loss 2.88504. lr 5.948815e-04:  12%|█▏        | 1111/9433 [06:48<51:05,  2.71it/s][A
epoch 1 iter 1111: train loss 2.94526. lr 5.948724e-04:  12%|█▏        | 1111/9433 [06:49<51:05,  2.71it/s][A
epoch 1 iter 1111: train loss 2.94526. lr 5.948724e-04:  12%|█▏        | 1112/9433 [06:49<50:55,  2.72it/s][A
epoch 1 iter 1112: train loss 2.91045. lr 5.948632e-04:  12%|█▏        | 1112/9433 [06:49<50:55,  2.72it/s][A
epoch 1 iter 1112: train loss 2.91045. lr 5.948632e-04:  12%|█▏        | 1113/9433 [06:49<51:04,  2.71it/s][A
epoch 1 iter 1113: train loss 2.84942. lr 5.948539e-04:  12%|█▏        | 1113/9433 [06:50<51:04,  2.71it/s][A
epoch 1 iter 1113: train loss 2.84942. lr 5.948539e-04:  12%|█▏        | 1114/9433 [06:50<51:06,  2.71it/s][A
epoch 1 iter 1114: train loss 2.96225. lr 5.948447e-04:  12%|█▏        | 1114/9433 [06:50<51:06,  2.71it/s][A
e

epoch 1 iter 1146: train loss 2.62022. lr 5.945454e-04:  12%|█▏        | 1147/9433 [07:02<50:56,  2.71it/s][A
epoch 1 iter 1147: train loss 2.65445. lr 5.945360e-04:  12%|█▏        | 1147/9433 [07:02<50:56,  2.71it/s][A
epoch 1 iter 1147: train loss 2.65445. lr 5.945360e-04:  12%|█▏        | 1148/9433 [07:02<51:07,  2.70it/s][A
epoch 1 iter 1148: train loss 2.82839. lr 5.945265e-04:  12%|█▏        | 1148/9433 [07:02<51:07,  2.70it/s][A
epoch 1 iter 1148: train loss 2.82839. lr 5.945265e-04:  12%|█▏        | 1149/9433 [07:02<51:02,  2.71it/s][A
epoch 1 iter 1149: train loss 2.91232. lr 5.945170e-04:  12%|█▏        | 1149/9433 [07:03<51:02,  2.71it/s][A
epoch 1 iter 1149: train loss 2.91232. lr 5.945170e-04:  12%|█▏        | 1150/9433 [07:03<50:54,  2.71it/s][A
epoch 1 iter 1150: train loss 2.89457. lr 5.945074e-04:  12%|█▏        | 1150/9433 [07:03<50:54,  2.71it/s][A
epoch 1 iter 1150: train loss 2.89457. lr 5.945074e-04:  12%|█▏        | 1151/9433 [07:03<50:55,  2.71it/s][A
e

epoch 1 iter 1183: train loss 2.91304. lr 5.941890e-04:  13%|█▎        | 1183/9433 [07:15<50:44,  2.71it/s][A
epoch 1 iter 1183: train loss 2.91304. lr 5.941890e-04:  13%|█▎        | 1184/9433 [07:15<50:35,  2.72it/s][A
epoch 1 iter 1184: train loss 2.87362. lr 5.941792e-04:  13%|█▎        | 1184/9433 [07:16<50:35,  2.72it/s][A
epoch 1 iter 1184: train loss 2.87362. lr 5.941792e-04:  13%|█▎        | 1185/9433 [07:16<50:30,  2.72it/s][A
epoch 1 iter 1185: train loss 2.68051. lr 5.941694e-04:  13%|█▎        | 1185/9433 [07:16<50:30,  2.72it/s][A
epoch 1 iter 1185: train loss 2.68051. lr 5.941694e-04:  13%|█▎        | 1186/9433 [07:16<50:33,  2.72it/s][A
epoch 1 iter 1186: train loss 2.80127. lr 5.941596e-04:  13%|█▎        | 1186/9433 [07:16<50:33,  2.72it/s][A
epoch 1 iter 1186: train loss 2.80127. lr 5.941596e-04:  13%|█▎        | 1187/9433 [07:16<50:41,  2.71it/s][A
epoch 1 iter 1187: train loss 2.81047. lr 5.941498e-04:  13%|█▎        | 1187/9433 [07:17<50:41,  2.71it/s][A
e

epoch 1 iter 1219: train loss 2.80486. lr 5.938314e-04:  13%|█▎        | 1220/9433 [07:29<50:25,  2.71it/s][A
epoch 1 iter 1220: train loss 2.76949. lr 5.938213e-04:  13%|█▎        | 1220/9433 [07:29<50:25,  2.71it/s][A
epoch 1 iter 1220: train loss 2.76949. lr 5.938213e-04:  13%|█▎        | 1221/9433 [07:29<50:28,  2.71it/s][A
epoch 1 iter 1221: train loss 2.62185. lr 5.938112e-04:  13%|█▎        | 1221/9433 [07:29<50:28,  2.71it/s][A
epoch 1 iter 1221: train loss 2.62185. lr 5.938112e-04:  13%|█▎        | 1222/9433 [07:29<50:35,  2.71it/s][A
epoch 1 iter 1222: train loss 2.61460. lr 5.938011e-04:  13%|█▎        | 1222/9433 [07:30<50:35,  2.71it/s][A
epoch 1 iter 1222: train loss 2.61460. lr 5.938011e-04:  13%|█▎        | 1223/9433 [07:30<50:33,  2.71it/s][A
epoch 1 iter 1223: train loss 2.51209. lr 5.937910e-04:  13%|█▎        | 1223/9433 [07:30<50:33,  2.71it/s][A
epoch 1 iter 1223: train loss 2.51209. lr 5.937910e-04:  13%|█▎        | 1224/9433 [07:30<50:26,  2.71it/s][A
e

epoch 1 iter 1256: train loss 2.65709. lr 5.934529e-04:  13%|█▎        | 1256/9433 [07:42<50:06,  2.72it/s][A
epoch 1 iter 1256: train loss 2.65709. lr 5.934529e-04:  13%|█▎        | 1257/9433 [07:42<50:13,  2.71it/s][A
epoch 1 iter 1257: train loss 2.58683. lr 5.934425e-04:  13%|█▎        | 1257/9433 [07:43<50:13,  2.71it/s][A
epoch 1 iter 1257: train loss 2.58683. lr 5.934425e-04:  13%|█▎        | 1258/9433 [07:43<50:15,  2.71it/s][A
epoch 1 iter 1258: train loss 2.49135. lr 5.934321e-04:  13%|█▎        | 1258/9433 [07:43<50:15,  2.71it/s][A
epoch 1 iter 1258: train loss 2.49135. lr 5.934321e-04:  13%|█▎        | 1259/9433 [07:43<50:08,  2.72it/s][A
epoch 1 iter 1259: train loss 2.53675. lr 5.934217e-04:  13%|█▎        | 1259/9433 [07:43<50:08,  2.72it/s][A
epoch 1 iter 1259: train loss 2.53675. lr 5.934217e-04:  13%|█▎        | 1260/9433 [07:43<50:14,  2.71it/s][A
epoch 1 iter 1260: train loss 2.65088. lr 5.934113e-04:  13%|█▎        | 1260/9433 [07:44<50:14,  2.71it/s][A
e

epoch 1 iter 1292: train loss 2.46413. lr 5.930740e-04:  14%|█▎        | 1293/9433 [07:55<50:03,  2.71it/s][A
epoch 1 iter 1293: train loss 2.64528. lr 5.930633e-04:  14%|█▎        | 1293/9433 [07:56<50:03,  2.71it/s][A
epoch 1 iter 1293: train loss 2.64528. lr 5.930633e-04:  14%|█▎        | 1294/9433 [07:56<49:56,  2.72it/s][A
epoch 1 iter 1294: train loss 2.50079. lr 5.930526e-04:  14%|█▎        | 1294/9433 [07:56<49:56,  2.72it/s][A
epoch 1 iter 1294: train loss 2.50079. lr 5.930526e-04:  14%|█▎        | 1295/9433 [07:56<49:51,  2.72it/s][A
epoch 1 iter 1295: train loss 2.44763. lr 5.930419e-04:  14%|█▎        | 1295/9433 [07:57<49:51,  2.72it/s][A
epoch 1 iter 1295: train loss 2.44763. lr 5.930419e-04:  14%|█▎        | 1296/9433 [07:57<50:05,  2.71it/s][A
epoch 1 iter 1296: train loss 2.54285. lr 5.930312e-04:  14%|█▎        | 1296/9433 [07:57<50:05,  2.71it/s][A
epoch 1 iter 1296: train loss 2.54285. lr 5.930312e-04:  14%|█▎        | 1297/9433 [07:57<50:09,  2.70it/s][A
e

epoch 1 iter 1329: train loss 2.46051. lr 5.926735e-04:  14%|█▍        | 1329/9433 [08:09<49:46,  2.71it/s][A
epoch 1 iter 1329: train loss 2.46051. lr 5.926735e-04:  14%|█▍        | 1330/9433 [08:09<49:47,  2.71it/s][A
epoch 1 iter 1330: train loss 2.47406. lr 5.926625e-04:  14%|█▍        | 1330/9433 [08:09<49:47,  2.71it/s][A
epoch 1 iter 1330: train loss 2.47406. lr 5.926625e-04:  14%|█▍        | 1331/9433 [08:09<49:37,  2.72it/s][A
epoch 1 iter 1331: train loss 2.43886. lr 5.926515e-04:  14%|█▍        | 1331/9433 [08:10<49:37,  2.72it/s][A
epoch 1 iter 1331: train loss 2.43886. lr 5.926515e-04:  14%|█▍        | 1332/9433 [08:10<49:45,  2.71it/s][A
epoch 1 iter 1332: train loss 2.57290. lr 5.926405e-04:  14%|█▍        | 1332/9433 [08:10<49:45,  2.71it/s][A
epoch 1 iter 1332: train loss 2.57290. lr 5.926405e-04:  14%|█▍        | 1333/9433 [08:10<49:48,  2.71it/s][A
epoch 1 iter 1333: train loss 2.34873. lr 5.926295e-04:  14%|█▍        | 1333/9433 [08:11<49:48,  2.71it/s][A
e

epoch 1 iter 1520: train loss 2.18886. lr 5.904299e-04:  16%|█▌        | 1520/9433 [09:19<48:32,  2.72it/s][A
epoch 1 iter 1520: train loss 2.18886. lr 5.904299e-04:  16%|█▌        | 1521/9433 [09:19<48:37,  2.71it/s][A
epoch 1 iter 1521: train loss 2.16829. lr 5.904173e-04:  16%|█▌        | 1521/9433 [09:20<48:37,  2.71it/s][A
epoch 1 iter 1521: train loss 2.16829. lr 5.904173e-04:  16%|█▌        | 1522/9433 [09:20<48:34,  2.71it/s][A
epoch 1 iter 1522: train loss 2.10840. lr 5.904048e-04:  16%|█▌        | 1522/9433 [09:20<48:34,  2.71it/s][A
epoch 1 iter 1522: train loss 2.10840. lr 5.904048e-04:  16%|█▌        | 1523/9433 [09:20<48:29,  2.72it/s][A
epoch 1 iter 1523: train loss 2.14303. lr 5.903923e-04:  16%|█▌        | 1523/9433 [09:21<48:29,  2.72it/s][A
epoch 1 iter 1523: train loss 2.14303. lr 5.903923e-04:  16%|█▌        | 1524/9433 [09:21<48:11,  2.74it/s][A
epoch 1 iter 1524: train loss 2.15422. lr 5.903797e-04:  16%|█▌        | 1524/9433 [09:21<48:11,  2.74it/s][A
e

epoch 1 iter 1556: train loss 1.93329. lr 5.899740e-04:  17%|█▋        | 1557/9433 [09:33<48:12,  2.72it/s][A
epoch 1 iter 1557: train loss 2.11542. lr 5.899612e-04:  17%|█▋        | 1557/9433 [09:33<48:12,  2.72it/s][A
epoch 1 iter 1557: train loss 2.11542. lr 5.899612e-04:  17%|█▋        | 1558/9433 [09:33<48:13,  2.72it/s][A
epoch 1 iter 1558: train loss 2.24135. lr 5.899484e-04:  17%|█▋        | 1558/9433 [09:33<48:13,  2.72it/s][A
epoch 1 iter 1558: train loss 2.24135. lr 5.899484e-04:  17%|█▋        | 1559/9433 [09:33<48:20,  2.71it/s][A
epoch 1 iter 1559: train loss 2.24853. lr 5.899355e-04:  17%|█▋        | 1559/9433 [09:34<48:20,  2.71it/s][A
epoch 1 iter 1559: train loss 2.24853. lr 5.899355e-04:  17%|█▋        | 1560/9433 [09:34<48:24,  2.71it/s][A
epoch 1 iter 1560: train loss 2.10871. lr 5.899227e-04:  17%|█▋        | 1560/9433 [09:34<48:24,  2.71it/s][A
epoch 1 iter 1560: train loss 2.10871. lr 5.899227e-04:  17%|█▋        | 1561/9433 [09:34<48:18,  2.72it/s][A
e

epoch 1 iter 1593: train loss 2.14048. lr 5.894946e-04:  17%|█▋        | 1593/9433 [09:46<48:09,  2.71it/s][A
epoch 1 iter 1593: train loss 2.14048. lr 5.894946e-04:  17%|█▋        | 1594/9433 [09:46<48:16,  2.71it/s][A
epoch 1 iter 1594: train loss 2.08800. lr 5.894815e-04:  17%|█▋        | 1594/9433 [09:47<48:16,  2.71it/s][A
epoch 1 iter 1594: train loss 2.08800. lr 5.894815e-04:  17%|█▋        | 1595/9433 [09:47<48:13,  2.71it/s][A
epoch 1 iter 1595: train loss 2.28510. lr 5.894684e-04:  17%|█▋        | 1595/9433 [09:47<48:13,  2.71it/s][A
epoch 1 iter 1595: train loss 2.28510. lr 5.894684e-04:  17%|█▋        | 1596/9433 [09:47<48:06,  2.71it/s][A
epoch 1 iter 1596: train loss 2.06114. lr 5.894553e-04:  17%|█▋        | 1596/9433 [09:47<48:06,  2.71it/s][A
epoch 1 iter 1596: train loss 2.06114. lr 5.894553e-04:  17%|█▋        | 1597/9433 [09:47<48:04,  2.72it/s][A
epoch 1 iter 1597: train loss 2.16690. lr 5.894421e-04:  17%|█▋        | 1597/9433 [09:48<48:04,  2.72it/s][A
e

[A
epoch 1 iter 3585: train loss 0.76999. lr 5.480829e-04:  38%|███▊      | 3586/9433 [22:00<35:54,  2.71it/s][A
epoch 1 iter 3586: train loss 0.75871. lr 5.480548e-04:  38%|███▊      | 3586/9433 [22:01<35:54,  2.71it/s][A
epoch 1 iter 3586: train loss 0.75871. lr 5.480548e-04:  38%|███▊      | 3587/9433 [22:01<35:50,  2.72it/s][A
epoch 1 iter 3587: train loss 0.81823. lr 5.480267e-04:  38%|███▊      | 3587/9433 [22:01<35:50,  2.72it/s][A
epoch 1 iter 3587: train loss 0.81823. lr 5.480267e-04:  38%|███▊      | 3588/9433 [22:01<35:47,  2.72it/s][A
epoch 1 iter 3588: train loss 0.78076. lr 5.479986e-04:  38%|███▊      | 3588/9433 [22:01<35:47,  2.72it/s][A
epoch 1 iter 3588: train loss 0.78076. lr 5.479986e-04:  38%|███▊      | 3589/9433 [22:01<35:36,  2.74it/s][A
epoch 1 iter 3589: train loss 0.81434. lr 5.479705e-04:  38%|███▊      | 3589/9433 [22:02<35:36,  2.74it/s][A
epoch 1 iter 3589: train loss 0.81434. lr 5.479705e-04:  38%|███▊      | 3590/9433 [22:02<35:47,  2.72it/s]

epoch 1 iter 3622: train loss 0.75375. lr 5.470388e-04:  38%|███▊      | 3622/9433 [22:14<35:44,  2.71it/s][A
epoch 1 iter 3622: train loss 0.75375. lr 5.470388e-04:  38%|███▊      | 3623/9433 [22:14<35:45,  2.71it/s][A
epoch 1 iter 3623: train loss 0.82797. lr 5.470105e-04:  38%|███▊      | 3623/9433 [22:14<35:45,  2.71it/s][A
epoch 1 iter 3623: train loss 0.82797. lr 5.470105e-04:  38%|███▊      | 3624/9433 [22:14<35:37,  2.72it/s][A
epoch 1 iter 3624: train loss 0.78123. lr 5.469821e-04:  38%|███▊      | 3624/9433 [22:15<35:37,  2.72it/s][A
epoch 1 iter 3624: train loss 0.78123. lr 5.469821e-04:  38%|███▊      | 3625/9433 [22:15<35:42,  2.71it/s][A
epoch 1 iter 3625: train loss 0.80549. lr 5.469537e-04:  38%|███▊      | 3625/9433 [22:15<35:42,  2.71it/s][A
epoch 1 iter 3625: train loss 0.80549. lr 5.469537e-04:  38%|███▊      | 3626/9433 [22:15<35:42,  2.71it/s][A
epoch 1 iter 3626: train loss 0.77938. lr 5.469254e-04:  38%|███▊      | 3626/9433 [22:15<35:42,  2.71it/s][A
e

epoch 1 iter 3658: train loss 0.77974. lr 5.460140e-04:  39%|███▉      | 3659/9433 [22:27<35:22,  2.72it/s][A
epoch 1 iter 3659: train loss 0.76265. lr 5.459854e-04:  39%|███▉      | 3659/9433 [22:28<35:22,  2.72it/s][A
epoch 1 iter 3659: train loss 0.76265. lr 5.459854e-04:  39%|███▉      | 3660/9433 [22:28<35:27,  2.71it/s][A
epoch 1 iter 3660: train loss 0.79525. lr 5.459568e-04:  39%|███▉      | 3660/9433 [22:28<35:27,  2.71it/s][A
epoch 1 iter 3660: train loss 0.79525. lr 5.459568e-04:  39%|███▉      | 3661/9433 [22:28<35:28,  2.71it/s][A
epoch 1 iter 3661: train loss 0.76361. lr 5.459282e-04:  39%|███▉      | 3661/9433 [22:28<35:28,  2.71it/s][A
epoch 1 iter 3661: train loss 0.76361. lr 5.459282e-04:  39%|███▉      | 3662/9433 [22:28<35:24,  2.72it/s][A
epoch 1 iter 3662: train loss 0.76504. lr 5.458995e-04:  39%|███▉      | 3662/9433 [22:29<35:24,  2.72it/s][A
epoch 1 iter 3662: train loss 0.76504. lr 5.458995e-04:  39%|███▉      | 3663/9433 [22:29<35:22,  2.72it/s][A
e

epoch 1 iter 3695: train loss 0.76021. lr 5.449514e-04:  39%|███▉      | 3695/9433 [22:41<35:13,  2.72it/s][A
epoch 1 iter 3695: train loss 0.76021. lr 5.449514e-04:  39%|███▉      | 3696/9433 [22:41<35:08,  2.72it/s][A
epoch 1 iter 3696: train loss 0.78714. lr 5.449226e-04:  39%|███▉      | 3696/9433 [22:41<35:08,  2.72it/s][A
epoch 1 iter 3696: train loss 0.78714. lr 5.449226e-04:  39%|███▉      | 3697/9433 [22:41<35:05,  2.72it/s][A
epoch 1 iter 3697: train loss 0.75418. lr 5.448937e-04:  39%|███▉      | 3697/9433 [22:42<35:05,  2.72it/s][A
epoch 1 iter 3697: train loss 0.75418. lr 5.448937e-04:  39%|███▉      | 3698/9433 [22:42<34:55,  2.74it/s][A
epoch 1 iter 3698: train loss 0.77826. lr 5.448649e-04:  39%|███▉      | 3698/9433 [22:42<34:55,  2.74it/s][A
epoch 1 iter 3698: train loss 0.77826. lr 5.448649e-04:  39%|███▉      | 3699/9433 [22:42<35:05,  2.72it/s][A
epoch 1 iter 3699: train loss 0.79239. lr 5.448360e-04:  39%|███▉      | 3699/9433 [22:42<35:05,  2.72it/s][A
e

epoch 1 iter 3731: train loss 0.76710. lr 5.439087e-04:  40%|███▉      | 3732/9433 [22:54<35:00,  2.71it/s][A
epoch 1 iter 3732: train loss 0.74332. lr 5.438796e-04:  40%|███▉      | 3732/9433 [22:54<35:00,  2.71it/s][A
epoch 1 iter 3732: train loss 0.74332. lr 5.438796e-04:  40%|███▉      | 3733/9433 [22:54<34:53,  2.72it/s][A
epoch 1 iter 3733: train loss 0.76737. lr 5.438505e-04:  40%|███▉      | 3733/9433 [22:55<34:53,  2.72it/s][A
epoch 1 iter 3733: train loss 0.76737. lr 5.438505e-04:  40%|███▉      | 3734/9433 [22:55<34:58,  2.72it/s][A
epoch 1 iter 3734: train loss 0.80516. lr 5.438214e-04:  40%|███▉      | 3734/9433 [22:55<34:58,  2.72it/s][A
epoch 1 iter 3734: train loss 0.80516. lr 5.438214e-04:  40%|███▉      | 3735/9433 [22:55<34:58,  2.71it/s][A
epoch 1 iter 3735: train loss 0.77044. lr 5.437923e-04:  40%|███▉      | 3735/9433 [22:56<34:58,  2.71it/s][A
epoch 1 iter 3735: train loss 0.77044. lr 5.437923e-04:  40%|███▉      | 3736/9433 [22:56<34:55,  2.72it/s][A
e

epoch 1 iter 3768: train loss 0.72618. lr 5.428278e-04:  40%|███▉      | 3768/9433 [23:08<34:49,  2.71it/s][A
epoch 1 iter 3768: train loss 0.72618. lr 5.428278e-04:  40%|███▉      | 3769/9433 [23:08<34:54,  2.70it/s][A
epoch 1 iter 3769: train loss 0.83436. lr 5.427985e-04:  40%|███▉      | 3769/9433 [23:08<34:54,  2.70it/s][A
epoch 1 iter 3769: train loss 0.83436. lr 5.427985e-04:  40%|███▉      | 3770/9433 [23:08<34:53,  2.71it/s][A
epoch 1 iter 3770: train loss 0.74200. lr 5.427691e-04:  40%|███▉      | 3770/9433 [23:08<34:53,  2.71it/s][A
epoch 1 iter 3770: train loss 0.74200. lr 5.427691e-04:  40%|███▉      | 3771/9433 [23:08<34:47,  2.71it/s][A
epoch 1 iter 3771: train loss 0.74821. lr 5.427398e-04:  40%|███▉      | 3771/9433 [23:09<34:47,  2.71it/s][A
epoch 1 iter 3771: train loss 0.74821. lr 5.427398e-04:  40%|███▉      | 3772/9433 [23:09<34:49,  2.71it/s][A
epoch 1 iter 3772: train loss 0.76103. lr 5.427104e-04:  40%|███▉      | 3772/9433 [23:09<34:49,  2.71it/s][A
e

epoch 1 iter 3804: train loss 0.76129. lr 5.417673e-04:  40%|████      | 3805/9433 [23:21<34:37,  2.71it/s][A
epoch 1 iter 3805: train loss 0.77453. lr 5.417378e-04:  40%|████      | 3805/9433 [23:21<34:37,  2.71it/s][A
epoch 1 iter 3805: train loss 0.77453. lr 5.417378e-04:  40%|████      | 3806/9433 [23:21<34:32,  2.72it/s][A
epoch 1 iter 3806: train loss 0.82598. lr 5.417082e-04:  40%|████      | 3806/9433 [23:22<34:32,  2.72it/s][A
epoch 1 iter 3806: train loss 0.82598. lr 5.417082e-04:  40%|████      | 3807/9433 [23:22<34:32,  2.71it/s][A
epoch 1 iter 3807: train loss 0.77298. lr 5.416786e-04:  40%|████      | 3807/9433 [23:22<34:32,  2.71it/s][A
epoch 1 iter 3807: train loss 0.77298. lr 5.416786e-04:  40%|████      | 3808/9433 [23:22<34:29,  2.72it/s][A
epoch 1 iter 3808: train loss 0.72112. lr 5.416490e-04:  40%|████      | 3808/9433 [23:22<34:29,  2.72it/s][A
epoch 1 iter 3808: train loss 0.72112. lr 5.416490e-04:  40%|████      | 3809/9433 [23:22<34:33,  2.71it/s][A
e

epoch 1 iter 3841: train loss 0.73579. lr 5.406683e-04:  41%|████      | 3841/9433 [23:35<34:19,  2.72it/s][A
epoch 1 iter 3841: train loss 0.73579. lr 5.406683e-04:  41%|████      | 3842/9433 [23:35<34:18,  2.72it/s][A
epoch 1 iter 3842: train loss 0.72016. lr 5.406385e-04:  41%|████      | 3842/9433 [23:35<34:18,  2.72it/s][A
epoch 1 iter 3842: train loss 0.72016. lr 5.406385e-04:  41%|████      | 3843/9433 [23:35<34:17,  2.72it/s][A
epoch 1 iter 3843: train loss 0.74206. lr 5.406087e-04:  41%|████      | 3843/9433 [23:35<34:17,  2.72it/s][A
epoch 1 iter 3843: train loss 0.74206. lr 5.406087e-04:  41%|████      | 3844/9433 [23:35<34:22,  2.71it/s][A
epoch 1 iter 3844: train loss 0.72786. lr 5.405788e-04:  41%|████      | 3844/9433 [23:36<34:22,  2.71it/s][A
epoch 1 iter 3844: train loss 0.72786. lr 5.405788e-04:  41%|████      | 3845/9433 [23:36<34:25,  2.71it/s][A
epoch 1 iter 3845: train loss 0.73583. lr 5.405490e-04:  41%|████      | 3845/9433 [23:36<34:25,  2.71it/s][A
e

epoch 1 iter 3877: train loss 0.64923. lr 5.395903e-04:  41%|████      | 3878/9433 [23:48<34:02,  2.72it/s][A
epoch 1 iter 3878: train loss 0.71886. lr 5.395602e-04:  41%|████      | 3878/9433 [23:48<34:02,  2.72it/s][A
epoch 1 iter 3878: train loss 0.71886. lr 5.395602e-04:  41%|████      | 3879/9433 [23:48<34:06,  2.71it/s][A
epoch 1 iter 3879: train loss 0.71586. lr 5.395301e-04:  41%|████      | 3879/9433 [23:49<34:06,  2.71it/s][A
epoch 1 iter 3879: train loss 0.71586. lr 5.395301e-04:  41%|████      | 3880/9433 [23:49<34:05,  2.71it/s][A
epoch 1 iter 3880: train loss 0.70992. lr 5.395000e-04:  41%|████      | 3880/9433 [23:49<34:05,  2.71it/s][A
epoch 1 iter 3880: train loss 0.70992. lr 5.395000e-04:  41%|████      | 3881/9433 [23:49<34:02,  2.72it/s][A
epoch 1 iter 3881: train loss 0.73926. lr 5.394700e-04:  41%|████      | 3881/9433 [23:49<34:02,  2.72it/s][A
epoch 1 iter 3881: train loss 0.73926. lr 5.394700e-04:  41%|████      | 3882/9433 [23:49<34:06,  2.71it/s][A
e

epoch 1 iter 3914: train loss 0.80998. lr 5.384733e-04:  41%|████▏     | 3914/9433 [24:02<33:58,  2.71it/s][A
epoch 1 iter 3914: train loss 0.80998. lr 5.384733e-04:  42%|████▏     | 3915/9433 [24:02<33:58,  2.71it/s][A
epoch 1 iter 3915: train loss 0.76895. lr 5.384430e-04:  42%|████▏     | 3915/9433 [24:02<33:58,  2.71it/s][A
epoch 1 iter 3915: train loss 0.76895. lr 5.384430e-04:  42%|████▏     | 3916/9433 [24:02<33:53,  2.71it/s][A
epoch 1 iter 3916: train loss 0.73675. lr 5.384127e-04:  42%|████▏     | 3916/9433 [24:02<33:53,  2.71it/s][A
epoch 1 iter 3916: train loss 0.73675. lr 5.384127e-04:  42%|████▏     | 3917/9433 [24:02<33:54,  2.71it/s][A
epoch 1 iter 3917: train loss 0.72408. lr 5.383823e-04:  42%|████▏     | 3917/9433 [24:03<33:54,  2.71it/s][A
epoch 1 iter 3917: train loss 0.72408. lr 5.383823e-04:  42%|████▏     | 3918/9433 [24:03<33:47,  2.72it/s][A
epoch 1 iter 3918: train loss 0.72857. lr 5.383520e-04:  42%|████▏     | 3918/9433 [24:03<33:47,  2.72it/s][A
e

epoch 1 iter 3950: train loss 0.72783. lr 5.373778e-04:  42%|████▏     | 3951/9433 [24:15<33:35,  2.72it/s][A
epoch 1 iter 3951: train loss 0.73087. lr 5.373473e-04:  42%|████▏     | 3951/9433 [24:15<33:35,  2.72it/s][A
epoch 1 iter 3951: train loss 0.73087. lr 5.373473e-04:  42%|████▏     | 3952/9433 [24:15<33:23,  2.74it/s][A
epoch 1 iter 3952: train loss 0.78313. lr 5.373167e-04:  42%|████▏     | 3952/9433 [24:16<33:23,  2.74it/s][A
epoch 1 iter 3952: train loss 0.78313. lr 5.373167e-04:  42%|████▏     | 3953/9433 [24:16<33:35,  2.72it/s][A
epoch 1 iter 3953: train loss 0.74481. lr 5.372861e-04:  42%|████▏     | 3953/9433 [24:16<33:35,  2.72it/s][A
epoch 1 iter 3953: train loss 0.74481. lr 5.372861e-04:  42%|████▏     | 3954/9433 [24:16<33:36,  2.72it/s][A
epoch 1 iter 3954: train loss 0.79196. lr 5.372556e-04:  42%|████▏     | 3954/9433 [24:16<33:36,  2.72it/s][A
epoch 1 iter 3954: train loss 0.79196. lr 5.372556e-04:  42%|████▏     | 3955/9433 [24:16<33:33,  2.72it/s][A
e

epoch 1 iter 4141: train loss 0.71203. lr 5.314239e-04:  44%|████▍     | 4141/9433 [25:25<32:32,  2.71it/s][A
epoch 1 iter 4141: train loss 0.71203. lr 5.314239e-04:  44%|████▍     | 4142/9433 [25:25<32:34,  2.71it/s][A
epoch 1 iter 4142: train loss 0.70554. lr 5.313921e-04:  44%|████▍     | 4142/9433 [25:25<32:34,  2.71it/s][A
epoch 1 iter 4142: train loss 0.70554. lr 5.313921e-04:  44%|████▍     | 4143/9433 [25:25<32:31,  2.71it/s][A
epoch 1 iter 4143: train loss 0.69714. lr 5.313603e-04:  44%|████▍     | 4143/9433 [25:26<32:31,  2.71it/s][A
epoch 1 iter 4143: train loss 0.69714. lr 5.313603e-04:  44%|████▍     | 4144/9433 [25:26<32:25,  2.72it/s][A
epoch 1 iter 4144: train loss 0.70379. lr 5.313285e-04:  44%|████▍     | 4144/9433 [25:26<32:25,  2.72it/s][A
epoch 1 iter 4144: train loss 0.70379. lr 5.313285e-04:  44%|████▍     | 4145/9433 [25:26<32:27,  2.72it/s][A
epoch 1 iter 4145: train loss 0.74029. lr 5.312967e-04:  44%|████▍     | 4145/9433 [25:27<32:27,  2.72it/s][A
e

epoch 1 iter 4177: train loss 0.66839. lr 5.302753e-04:  44%|████▍     | 4178/9433 [25:38<32:15,  2.72it/s][A
epoch 1 iter 4178: train loss 0.62112. lr 5.302433e-04:  44%|████▍     | 4178/9433 [25:39<32:15,  2.72it/s][A
epoch 1 iter 4178: train loss 0.62112. lr 5.302433e-04:  44%|████▍     | 4179/9433 [25:39<32:12,  2.72it/s][A
epoch 1 iter 4179: train loss 0.66434. lr 5.302112e-04:  44%|████▍     | 4179/9433 [25:39<32:12,  2.72it/s][A
epoch 1 iter 4179: train loss 0.66434. lr 5.302112e-04:  44%|████▍     | 4180/9433 [25:39<32:02,  2.73it/s][A
epoch 1 iter 4180: train loss 0.69224. lr 5.301792e-04:  44%|████▍     | 4180/9433 [25:39<32:02,  2.73it/s][A
epoch 1 iter 4180: train loss 0.69224. lr 5.301792e-04:  44%|████▍     | 4181/9433 [25:39<32:12,  2.72it/s][A
epoch 1 iter 4181: train loss 0.65919. lr 5.301472e-04:  44%|████▍     | 4181/9433 [25:40<32:12,  2.72it/s][A
epoch 1 iter 4181: train loss 0.65919. lr 5.301472e-04:  44%|████▍     | 4182/9433 [25:40<32:14,  2.71it/s][A
e

epoch 1 iter 4214: train loss 0.65468. lr 5.290862e-04:  45%|████▍     | 4214/9433 [25:52<31:43,  2.74it/s][A
epoch 1 iter 4214: train loss 0.65468. lr 5.290862e-04:  45%|████▍     | 4215/9433 [25:52<31:51,  2.73it/s][A
epoch 1 iter 4215: train loss 0.69479. lr 5.290539e-04:  45%|████▍     | 4215/9433 [25:52<31:51,  2.73it/s][A
epoch 1 iter 4215: train loss 0.69479. lr 5.290539e-04:  45%|████▍     | 4216/9433 [25:52<31:54,  2.72it/s][A
epoch 1 iter 4216: train loss 0.69917. lr 5.290216e-04:  45%|████▍     | 4216/9433 [25:53<31:54,  2.72it/s][A
epoch 1 iter 4216: train loss 0.69917. lr 5.290216e-04:  45%|████▍     | 4217/9433 [25:53<31:52,  2.73it/s][A
epoch 1 iter 4217: train loss 0.67798. lr 5.289894e-04:  45%|████▍     | 4217/9433 [25:53<31:52,  2.73it/s][A
epoch 1 iter 4217: train loss 0.67798. lr 5.289894e-04:  45%|████▍     | 4218/9433 [25:53<31:50,  2.73it/s][A
epoch 1 iter 4218: train loss 0.64980. lr 5.289571e-04:  45%|████▍     | 4218/9433 [25:53<31:50,  2.73it/s][A
e

epoch 1 iter 6215: train loss 0.52304. lr 4.531271e-04:  66%|██████▌   | 6216/9433 [38:09<19:44,  2.72it/s][A
epoch 1 iter 6216: train loss 0.48050. lr 4.530841e-04:  66%|██████▌   | 6216/9433 [38:09<19:44,  2.72it/s][A
epoch 1 iter 6216: train loss 0.48050. lr 4.530841e-04:  66%|██████▌   | 6217/9433 [38:09<19:44,  2.72it/s][A
epoch 1 iter 6217: train loss 0.54108. lr 4.530412e-04:  66%|██████▌   | 6217/9433 [38:10<19:44,  2.72it/s][A
epoch 1 iter 6217: train loss 0.54108. lr 4.530412e-04:  66%|██████▌   | 6218/9433 [38:10<19:44,  2.71it/s][A
epoch 1 iter 6218: train loss 0.51126. lr 4.529982e-04:  66%|██████▌   | 6218/9433 [38:10<19:44,  2.71it/s][A
epoch 1 iter 6218: train loss 0.51126. lr 4.529982e-04:  66%|██████▌   | 6219/9433 [38:10<19:45,  2.71it/s][A
epoch 1 iter 6219: train loss 0.49712. lr 4.529552e-04:  66%|██████▌   | 6219/9433 [38:10<19:45,  2.71it/s][A
epoch 1 iter 6219: train loss 0.49712. lr 4.529552e-04:  66%|██████▌   | 6220/9433 [38:10<19:44,  2.71it/s][A
e

epoch 1 iter 6252: train loss 0.48937. lr 4.515346e-04:  66%|██████▋   | 6252/9433 [38:23<19:29,  2.72it/s][A
epoch 1 iter 6252: train loss 0.48937. lr 4.515346e-04:  66%|██████▋   | 6253/9433 [38:23<19:31,  2.72it/s][A
epoch 1 iter 6253: train loss 0.47438. lr 4.514915e-04:  66%|██████▋   | 6253/9433 [38:23<19:31,  2.72it/s][A
epoch 1 iter 6253: train loss 0.47438. lr 4.514915e-04:  66%|██████▋   | 6254/9433 [38:23<19:32,  2.71it/s][A
epoch 1 iter 6254: train loss 0.50236. lr 4.514484e-04:  66%|██████▋   | 6254/9433 [38:23<19:32,  2.71it/s][A
epoch 1 iter 6254: train loss 0.50236. lr 4.514484e-04:  66%|██████▋   | 6255/9433 [38:23<19:28,  2.72it/s][A
epoch 1 iter 6255: train loss 0.53662. lr 4.514053e-04:  66%|██████▋   | 6255/9433 [38:24<19:28,  2.72it/s][A
epoch 1 iter 6255: train loss 0.53662. lr 4.514053e-04:  66%|██████▋   | 6256/9433 [38:24<19:26,  2.72it/s][A
epoch 1 iter 6256: train loss 0.47128. lr 4.513621e-04:  66%|██████▋   | 6256/9433 [38:24<19:26,  2.72it/s][A
e

epoch 1 iter 6288: train loss 0.48799. lr 4.499797e-04:  67%|██████▋   | 6289/9433 [38:36<19:20,  2.71it/s][A
epoch 1 iter 6289: train loss 0.49329. lr 4.499364e-04:  67%|██████▋   | 6289/9433 [38:36<19:20,  2.71it/s][A
epoch 1 iter 6289: train loss 0.49329. lr 4.499364e-04:  67%|██████▋   | 6290/9433 [38:36<19:17,  2.71it/s][A
epoch 1 iter 6290: train loss 0.54008. lr 4.498932e-04:  67%|██████▋   | 6290/9433 [38:37<19:17,  2.71it/s][A
epoch 1 iter 6290: train loss 0.54008. lr 4.498932e-04:  67%|██████▋   | 6291/9433 [38:37<19:16,  2.72it/s][A
epoch 1 iter 6291: train loss 0.50968. lr 4.498499e-04:  67%|██████▋   | 6291/9433 [38:37<19:16,  2.72it/s][A
epoch 1 iter 6291: train loss 0.50968. lr 4.498499e-04:  67%|██████▋   | 6292/9433 [38:37<19:15,  2.72it/s][A
epoch 1 iter 6292: train loss 0.49099. lr 4.498066e-04:  67%|██████▋   | 6292/9433 [38:37<19:15,  2.72it/s][A
epoch 1 iter 6292: train loss 0.49099. lr 4.498066e-04:  67%|██████▋   | 6293/9433 [38:37<19:18,  2.71it/s][A
e

epoch 1 iter 6325: train loss 0.50583. lr 4.483760e-04:  67%|██████▋   | 6325/9433 [38:49<19:05,  2.71it/s][A
epoch 1 iter 6325: train loss 0.50583. lr 4.483760e-04:  67%|██████▋   | 6326/9433 [38:49<19:04,  2.71it/s][A
epoch 1 iter 6326: train loss 0.52555. lr 4.483326e-04:  67%|██████▋   | 6326/9433 [38:50<19:04,  2.71it/s][A
epoch 1 iter 6326: train loss 0.52555. lr 4.483326e-04:  67%|██████▋   | 6327/9433 [38:50<19:03,  2.72it/s][A
epoch 1 iter 6327: train loss 0.50220. lr 4.482891e-04:  67%|██████▋   | 6327/9433 [38:50<19:03,  2.72it/s][A
epoch 1 iter 6327: train loss 0.50220. lr 4.482891e-04:  67%|██████▋   | 6328/9433 [38:50<19:05,  2.71it/s][A
epoch 1 iter 6328: train loss 0.51448. lr 4.482457e-04:  67%|██████▋   | 6328/9433 [38:51<19:05,  2.71it/s][A
epoch 1 iter 6328: train loss 0.51448. lr 4.482457e-04:  67%|██████▋   | 6329/9433 [38:51<19:05,  2.71it/s][A
epoch 1 iter 6329: train loss 0.48303. lr 4.482023e-04:  67%|██████▋   | 6329/9433 [38:51<19:05,  2.71it/s][A
e

epoch 1 iter 6361: train loss 0.53329. lr 4.468102e-04:  67%|██████▋   | 6362/9433 [39:03<18:48,  2.72it/s][A
epoch 1 iter 6362: train loss 0.49430. lr 4.467666e-04:  67%|██████▋   | 6362/9433 [39:03<18:48,  2.72it/s][A
epoch 1 iter 6362: train loss 0.49430. lr 4.467666e-04:  67%|██████▋   | 6363/9433 [39:03<18:50,  2.71it/s][A
epoch 1 iter 6363: train loss 0.45307. lr 4.467230e-04:  67%|██████▋   | 6363/9433 [39:03<18:50,  2.71it/s][A
epoch 1 iter 6363: train loss 0.45307. lr 4.467230e-04:  67%|██████▋   | 6364/9433 [39:03<18:51,  2.71it/s][A
epoch 1 iter 6364: train loss 0.49548. lr 4.466795e-04:  67%|██████▋   | 6364/9433 [39:04<18:51,  2.71it/s][A
epoch 1 iter 6364: train loss 0.49548. lr 4.466795e-04:  67%|██████▋   | 6365/9433 [39:04<18:49,  2.72it/s][A
epoch 1 iter 6365: train loss 0.50499. lr 4.466359e-04:  67%|██████▋   | 6365/9433 [39:04<18:49,  2.72it/s][A
epoch 1 iter 6365: train loss 0.50499. lr 4.466359e-04:  67%|██████▋   | 6366/9433 [39:04<18:48,  2.72it/s][A
e

epoch 1 iter 6398: train loss 0.49667. lr 4.451954e-04:  68%|██████▊   | 6398/9433 [39:16<18:37,  2.72it/s][A
epoch 1 iter 6398: train loss 0.49667. lr 4.451954e-04:  68%|██████▊   | 6399/9433 [39:16<18:34,  2.72it/s][A
epoch 1 iter 6399: train loss 0.50336. lr 4.451517e-04:  68%|██████▊   | 6399/9433 [39:17<18:34,  2.72it/s][A
epoch 1 iter 6399: train loss 0.50336. lr 4.451517e-04:  68%|██████▊   | 6400/9433 [39:17<18:32,  2.73it/s][A
epoch 1 iter 6400: train loss 0.48140. lr 4.451080e-04:  68%|██████▊   | 6400/9433 [39:17<18:32,  2.73it/s][A
epoch 1 iter 6400: train loss 0.48140. lr 4.451080e-04:  68%|██████▊   | 6401/9433 [39:17<18:29,  2.73it/s][A
epoch 1 iter 6401: train loss 0.51498. lr 4.450642e-04:  68%|██████▊   | 6401/9433 [39:17<18:29,  2.73it/s][A
epoch 1 iter 6401: train loss 0.51498. lr 4.450642e-04:  68%|██████▊   | 6402/9433 [39:17<18:34,  2.72it/s][A
epoch 1 iter 6402: train loss 0.52715. lr 4.450205e-04:  68%|██████▊   | 6402/9433 [39:18<18:34,  2.72it/s][A
e

epoch 1 iter 6434: train loss 0.53425. lr 4.436190e-04:  68%|██████▊   | 6435/9433 [39:30<18:24,  2.71it/s][A
epoch 1 iter 6435: train loss 0.51019. lr 4.435751e-04:  68%|██████▊   | 6435/9433 [39:30<18:24,  2.71it/s][A
epoch 1 iter 6435: train loss 0.51019. lr 4.435751e-04:  68%|██████▊   | 6436/9433 [39:30<18:20,  2.72it/s][A
epoch 1 iter 6436: train loss 0.50383. lr 4.435312e-04:  68%|██████▊   | 6436/9433 [39:30<18:20,  2.72it/s][A
epoch 1 iter 6436: train loss 0.50383. lr 4.435312e-04:  68%|██████▊   | 6437/9433 [39:30<18:22,  2.72it/s][A
epoch 1 iter 6437: train loss 0.50810. lr 4.434874e-04:  68%|██████▊   | 6437/9433 [39:31<18:22,  2.72it/s][A
epoch 1 iter 6437: train loss 0.50810. lr 4.434874e-04:  68%|██████▊   | 6438/9433 [39:31<18:23,  2.71it/s][A
epoch 1 iter 6438: train loss 0.49707. lr 4.434435e-04:  68%|██████▊   | 6438/9433 [39:31<18:23,  2.71it/s][A
epoch 1 iter 6438: train loss 0.49707. lr 4.434435e-04:  68%|██████▊   | 6439/9433 [39:31<18:21,  2.72it/s][A
e

epoch 1 iter 6471: train loss 0.51232. lr 4.419934e-04:  69%|██████▊   | 6471/9433 [39:43<18:11,  2.71it/s][A
epoch 1 iter 6471: train loss 0.51232. lr 4.419934e-04:  69%|██████▊   | 6472/9433 [39:43<18:14,  2.71it/s][A
epoch 1 iter 6472: train loss 0.48747. lr 4.419493e-04:  69%|██████▊   | 6472/9433 [39:44<18:14,  2.71it/s][A
epoch 1 iter 6472: train loss 0.48747. lr 4.419493e-04:  69%|██████▊   | 6473/9433 [39:44<18:13,  2.71it/s][A
epoch 1 iter 6473: train loss 0.51478. lr 4.419053e-04:  69%|██████▊   | 6473/9433 [39:44<18:13,  2.71it/s][A
epoch 1 iter 6473: train loss 0.51478. lr 4.419053e-04:  69%|██████▊   | 6474/9433 [39:44<18:10,  2.71it/s][A
epoch 1 iter 6474: train loss 0.50308. lr 4.418613e-04:  69%|██████▊   | 6474/9433 [39:44<18:10,  2.71it/s][A
epoch 1 iter 6474: train loss 0.50308. lr 4.418613e-04:  69%|██████▊   | 6475/9433 [39:44<18:10,  2.71it/s][A
epoch 1 iter 6475: train loss 0.50004. lr 4.418173e-04:  69%|██████▊   | 6475/9433 [39:45<18:10,  2.71it/s][A
e

epoch 1 iter 6507: train loss 0.51803. lr 4.404065e-04:  69%|██████▉   | 6508/9433 [39:56<17:55,  2.72it/s][A
epoch 1 iter 6508: train loss 0.49502. lr 4.403624e-04:  69%|██████▉   | 6508/9433 [39:57<17:55,  2.72it/s][A
epoch 1 iter 6508: train loss 0.49502. lr 4.403624e-04:  69%|██████▉   | 6509/9433 [39:57<17:54,  2.72it/s][A
epoch 1 iter 6509: train loss 0.48809. lr 4.403182e-04:  69%|██████▉   | 6509/9433 [39:57<17:54,  2.72it/s][A
epoch 1 iter 6509: train loss 0.48809. lr 4.403182e-04:  69%|██████▉   | 6510/9433 [39:57<17:56,  2.72it/s][A
epoch 1 iter 6510: train loss 0.49186. lr 4.402740e-04:  69%|██████▉   | 6510/9433 [39:58<17:56,  2.72it/s][A
epoch 1 iter 6510: train loss 0.49186. lr 4.402740e-04:  69%|██████▉   | 6511/9433 [39:58<18:00,  2.70it/s][A
epoch 1 iter 6511: train loss 0.48527. lr 4.402299e-04:  69%|██████▉   | 6511/9433 [39:58<18:00,  2.70it/s][A
epoch 1 iter 6511: train loss 0.48527. lr 4.402299e-04:  69%|██████▉   | 6512/9433 [39:58<17:59,  2.71it/s][A
e

epoch 1 iter 6544: train loss 0.48935. lr 4.387703e-04:  69%|██████▉   | 6544/9433 [40:10<17:44,  2.72it/s][A
epoch 1 iter 6544: train loss 0.48935. lr 4.387703e-04:  69%|██████▉   | 6545/9433 [40:10<17:41,  2.72it/s][A
epoch 1 iter 6545: train loss 0.51198. lr 4.387260e-04:  69%|██████▉   | 6545/9433 [40:10<17:41,  2.72it/s][A
epoch 1 iter 6545: train loss 0.51198. lr 4.387260e-04:  69%|██████▉   | 6546/9433 [40:10<17:44,  2.71it/s][A
epoch 1 iter 6546: train loss 0.50098. lr 4.386817e-04:  69%|██████▉   | 6546/9433 [40:11<17:44,  2.71it/s][A
epoch 1 iter 6546: train loss 0.50098. lr 4.386817e-04:  69%|██████▉   | 6547/9433 [40:11<17:44,  2.71it/s][A
epoch 1 iter 6547: train loss 0.50744. lr 4.386374e-04:  69%|██████▉   | 6547/9433 [40:11<17:44,  2.71it/s][A
epoch 1 iter 6547: train loss 0.50744. lr 4.386374e-04:  69%|██████▉   | 6548/9433 [40:11<17:41,  2.72it/s][A
epoch 1 iter 6548: train loss 0.51450. lr 4.385931e-04:  69%|██████▉   | 6548/9433 [40:12<17:41,  2.72it/s][A
e

epoch 1 iter 6580: train loss 0.46108. lr 4.371733e-04:  70%|██████▉   | 6581/9433 [40:23<17:34,  2.71it/s][A
epoch 1 iter 6581: train loss 0.49074. lr 4.371289e-04:  70%|██████▉   | 6581/9433 [40:24<17:34,  2.71it/s][A
epoch 1 iter 6581: train loss 0.49074. lr 4.371289e-04:  70%|██████▉   | 6582/9433 [40:24<17:32,  2.71it/s][A
epoch 1 iter 6582: train loss 0.51308. lr 4.370844e-04:  70%|██████▉   | 6582/9433 [40:24<17:32,  2.71it/s][A
epoch 1 iter 6582: train loss 0.51308. lr 4.370844e-04:  70%|██████▉   | 6583/9433 [40:24<17:30,  2.71it/s][A
epoch 1 iter 6583: train loss 0.55026. lr 4.370400e-04:  70%|██████▉   | 6583/9433 [40:24<17:30,  2.71it/s][A
epoch 1 iter 6583: train loss 0.55026. lr 4.370400e-04:  70%|██████▉   | 6584/9433 [40:24<17:29,  2.72it/s][A
epoch 1 iter 6584: train loss 0.50653. lr 4.369956e-04:  70%|██████▉   | 6584/9433 [40:25<17:29,  2.72it/s][A
epoch 1 iter 6584: train loss 0.50653. lr 4.369956e-04:  70%|██████▉   | 6585/9433 [40:25<17:29,  2.71it/s][A
e

epoch 1 iter 6774: train loss 0.48792. lr 4.284837e-04:  72%|███████▏  | 6774/9433 [41:35<16:17,  2.72it/s][A
epoch 1 iter 6774: train loss 0.48792. lr 4.284837e-04:  72%|███████▏  | 6775/9433 [41:35<16:17,  2.72it/s][A
epoch 1 iter 6775: train loss 0.49215. lr 4.284386e-04:  72%|███████▏  | 6775/9433 [41:35<16:17,  2.72it/s][A
epoch 1 iter 6775: train loss 0.49215. lr 4.284386e-04:  72%|███████▏  | 6776/9433 [41:35<16:18,  2.71it/s][A
epoch 1 iter 6776: train loss 0.48864. lr 4.283934e-04:  72%|███████▏  | 6776/9433 [41:36<16:18,  2.71it/s][A
epoch 1 iter 6776: train loss 0.48864. lr 4.283934e-04:  72%|███████▏  | 6777/9433 [41:36<16:16,  2.72it/s][A
epoch 1 iter 6777: train loss 0.46980. lr 4.283483e-04:  72%|███████▏  | 6777/9433 [41:36<16:16,  2.72it/s][A
epoch 1 iter 6777: train loss 0.46980. lr 4.283483e-04:  72%|███████▏  | 6778/9433 [41:36<16:14,  2.72it/s][A
epoch 1 iter 6778: train loss 0.47704. lr 4.283031e-04:  72%|███████▏  | 6778/9433 [41:36<16:14,  2.72it/s][A
e

epoch 1 iter 6810: train loss 0.43802. lr 4.268562e-04:  72%|███████▏  | 6811/9433 [41:48<16:06,  2.71it/s][A
epoch 1 iter 6811: train loss 0.46583. lr 4.268109e-04:  72%|███████▏  | 6811/9433 [41:48<16:06,  2.71it/s][A
epoch 1 iter 6811: train loss 0.46583. lr 4.268109e-04:  72%|███████▏  | 6812/9433 [41:48<16:03,  2.72it/s][A
epoch 1 iter 6812: train loss 0.48919. lr 4.267657e-04:  72%|███████▏  | 6812/9433 [41:49<16:03,  2.72it/s][A
epoch 1 iter 6812: train loss 0.48919. lr 4.267657e-04:  72%|███████▏  | 6813/9433 [41:49<16:04,  2.72it/s][A
epoch 1 iter 6813: train loss 0.46021. lr 4.267204e-04:  72%|███████▏  | 6813/9433 [41:49<16:04,  2.72it/s][A
epoch 1 iter 6813: train loss 0.46021. lr 4.267204e-04:  72%|███████▏  | 6814/9433 [41:49<16:04,  2.71it/s][A
epoch 1 iter 6814: train loss 0.51141. lr 4.266751e-04:  72%|███████▏  | 6814/9433 [41:50<16:04,  2.71it/s][A
epoch 1 iter 6814: train loss 0.51141. lr 4.266751e-04:  72%|███████▏  | 6815/9433 [41:50<16:08,  2.70it/s][A
e

epoch 1 iter 6847: train loss 0.47512. lr 4.251787e-04:  73%|███████▎  | 6847/9433 [42:02<15:51,  2.72it/s][A
epoch 1 iter 6847: train loss 0.47512. lr 4.251787e-04:  73%|███████▎  | 6848/9433 [42:02<15:52,  2.71it/s][A
epoch 1 iter 6848: train loss 0.45269. lr 4.251333e-04:  73%|███████▎  | 6848/9433 [42:02<15:52,  2.71it/s][A
epoch 1 iter 6848: train loss 0.45269. lr 4.251333e-04:  73%|███████▎  | 6849/9433 [42:02<15:52,  2.71it/s][A
epoch 1 iter 6849: train loss 0.44823. lr 4.250879e-04:  73%|███████▎  | 6849/9433 [42:02<15:52,  2.71it/s][A
epoch 1 iter 6849: train loss 0.44823. lr 4.250879e-04:  73%|███████▎  | 6850/9433 [42:02<15:54,  2.71it/s][A
epoch 1 iter 6850: train loss 0.47877. lr 4.250425e-04:  73%|███████▎  | 6850/9433 [42:03<15:54,  2.71it/s][A
epoch 1 iter 6850: train loss 0.47877. lr 4.250425e-04:  73%|███████▎  | 6851/9433 [42:03<15:52,  2.71it/s][A
epoch 1 iter 6851: train loss 0.44796. lr 4.249971e-04:  73%|███████▎  | 6851/9433 [42:03<15:52,  2.71it/s][A
e

epoch 1 iter 8834: train loss 0.39944. lr 3.298120e-04:  94%|█████████▎| 8835/9433 [54:13<03:38,  2.74it/s][A
epoch 1 iter 8835: train loss 0.38194. lr 3.297623e-04:  94%|█████████▎| 8835/9433 [54:14<03:38,  2.74it/s][A
epoch 1 iter 8835: train loss 0.38194. lr 3.297623e-04:  94%|█████████▎| 8836/9433 [54:14<03:38,  2.73it/s][A
epoch 1 iter 8836: train loss 0.42038. lr 3.297126e-04:  94%|█████████▎| 8836/9433 [54:14<03:38,  2.73it/s][A
epoch 1 iter 8836: train loss 0.42038. lr 3.297126e-04:  94%|█████████▎| 8837/9433 [54:14<03:37,  2.74it/s][A
epoch 1 iter 8837: train loss 0.36883. lr 3.296628e-04:  94%|█████████▎| 8837/9433 [54:14<03:37,  2.74it/s][A
epoch 1 iter 8837: train loss 0.36883. lr 3.296628e-04:  94%|█████████▎| 8838/9433 [54:14<03:36,  2.74it/s][A
epoch 1 iter 8838: train loss 0.40012. lr 3.296131e-04:  94%|█████████▎| 8838/9433 [54:15<03:36,  2.74it/s][A
epoch 1 iter 8838: train loss 0.40012. lr 3.296131e-04:  94%|█████████▎| 8839/9433 [54:15<03:36,  2.74it/s][A
e

epoch 1 iter 8871: train loss 0.42360. lr 3.279721e-04:  94%|█████████▍| 8871/9433 [54:27<03:25,  2.74it/s][A
epoch 1 iter 8871: train loss 0.42360. lr 3.279721e-04:  94%|█████████▍| 8872/9433 [54:27<03:25,  2.74it/s][A
epoch 1 iter 8872: train loss 0.42723. lr 3.279224e-04:  94%|█████████▍| 8872/9433 [54:27<03:25,  2.74it/s][A
epoch 1 iter 8872: train loss 0.42723. lr 3.279224e-04:  94%|█████████▍| 8873/9433 [54:27<03:25,  2.73it/s][A
epoch 1 iter 8873: train loss 0.40116. lr 3.278726e-04:  94%|█████████▍| 8873/9433 [54:28<03:25,  2.73it/s][A
epoch 1 iter 8873: train loss 0.40116. lr 3.278726e-04:  94%|█████████▍| 8874/9433 [54:28<03:24,  2.73it/s][A
epoch 1 iter 8874: train loss 0.41825. lr 3.278229e-04:  94%|█████████▍| 8874/9433 [54:28<03:24,  2.73it/s][A
epoch 1 iter 8874: train loss 0.41825. lr 3.278229e-04:  94%|█████████▍| 8875/9433 [54:28<03:24,  2.74it/s][A
epoch 1 iter 8875: train loss 0.38367. lr 3.277731e-04:  94%|█████████▍| 8875/9433 [54:28<03:24,  2.74it/s][A
e

epoch 1 iter 8907: train loss 0.39957. lr 3.261809e-04:  94%|█████████▍| 8908/9433 [54:40<03:11,  2.75it/s][A
epoch 1 iter 8908: train loss 0.40559. lr 3.261312e-04:  94%|█████████▍| 8908/9433 [54:40<03:11,  2.75it/s][A
epoch 1 iter 8908: train loss 0.40559. lr 3.261312e-04:  94%|█████████▍| 8909/9433 [54:40<03:11,  2.74it/s][A
epoch 1 iter 8909: train loss 0.38494. lr 3.260814e-04:  94%|█████████▍| 8909/9433 [54:41<03:11,  2.74it/s][A
epoch 1 iter 8909: train loss 0.38494. lr 3.260814e-04:  94%|█████████▍| 8910/9433 [54:41<03:11,  2.73it/s][A
epoch 1 iter 8910: train loss 0.36867. lr 3.260316e-04:  94%|█████████▍| 8910/9433 [54:41<03:11,  2.73it/s][A
epoch 1 iter 8910: train loss 0.36867. lr 3.260316e-04:  94%|█████████▍| 8911/9433 [54:41<03:10,  2.74it/s][A
epoch 1 iter 8911: train loss 0.41826. lr 3.259819e-04:  94%|█████████▍| 8911/9433 [54:41<03:10,  2.74it/s][A
epoch 1 iter 8911: train loss 0.41826. lr 3.259819e-04:  94%|█████████▍| 8912/9433 [54:41<03:09,  2.74it/s][A
e

epoch 1 iter 8944: train loss 0.39008. lr 3.243390e-04:  95%|█████████▍| 8944/9433 [54:53<02:58,  2.74it/s][A
epoch 1 iter 8944: train loss 0.39008. lr 3.243390e-04:  95%|█████████▍| 8945/9433 [54:53<02:57,  2.75it/s][A
epoch 1 iter 8945: train loss 0.37958. lr 3.242892e-04:  95%|█████████▍| 8945/9433 [54:54<02:57,  2.75it/s][A
epoch 1 iter 8945: train loss 0.37958. lr 3.242892e-04:  95%|█████████▍| 8946/9433 [54:54<02:58,  2.73it/s][A
epoch 1 iter 8946: train loss 0.44025. lr 3.242394e-04:  95%|█████████▍| 8946/9433 [54:54<02:58,  2.73it/s][A
epoch 1 iter 8946: train loss 0.44025. lr 3.242394e-04:  95%|█████████▍| 8947/9433 [54:54<02:58,  2.73it/s][A
epoch 1 iter 8947: train loss 0.38417. lr 3.241896e-04:  95%|█████████▍| 8947/9433 [54:55<02:58,  2.73it/s][A
epoch 1 iter 8947: train loss 0.38417. lr 3.241896e-04:  95%|█████████▍| 8948/9433 [54:55<02:57,  2.73it/s][A
epoch 1 iter 8948: train loss 0.39984. lr 3.241399e-04:  95%|█████████▍| 8948/9433 [54:55<02:57,  2.73it/s][A
e

epoch 1 iter 8980: train loss 0.39591. lr 3.225460e-04:  95%|█████████▌| 8981/9433 [55:07<02:45,  2.73it/s][A
epoch 1 iter 8981: train loss 0.39779. lr 3.224962e-04:  95%|█████████▌| 8981/9433 [55:07<02:45,  2.73it/s][A
epoch 1 iter 8981: train loss 0.39779. lr 3.224962e-04:  95%|█████████▌| 8982/9433 [55:07<02:44,  2.74it/s][A
epoch 1 iter 8982: train loss 0.38761. lr 3.224464e-04:  95%|█████████▌| 8982/9433 [55:07<02:44,  2.74it/s][A
epoch 1 iter 8982: train loss 0.38761. lr 3.224464e-04:  95%|█████████▌| 8983/9433 [55:07<02:43,  2.75it/s][A
epoch 1 iter 8983: train loss 0.41535. lr 3.223966e-04:  95%|█████████▌| 8983/9433 [55:08<02:43,  2.75it/s][A
epoch 1 iter 8983: train loss 0.41535. lr 3.223966e-04:  95%|█████████▌| 8984/9433 [55:08<02:43,  2.74it/s][A
epoch 1 iter 8984: train loss 0.38914. lr 3.223467e-04:  95%|█████████▌| 8984/9433 [55:08<02:43,  2.74it/s][A
epoch 1 iter 8984: train loss 0.38914. lr 3.223467e-04:  95%|█████████▌| 8985/9433 [55:08<02:43,  2.74it/s][A
e

epoch 1 iter 9017: train loss 0.37936. lr 3.207024e-04:  96%|█████████▌| 9017/9433 [55:20<02:32,  2.73it/s][A
epoch 1 iter 9017: train loss 0.37936. lr 3.207024e-04:  96%|█████████▌| 9018/9433 [55:20<02:31,  2.74it/s][A
epoch 1 iter 9018: train loss 0.40722. lr 3.206525e-04:  96%|█████████▌| 9018/9433 [55:20<02:31,  2.74it/s][A
epoch 1 iter 9018: train loss 0.40722. lr 3.206525e-04:  96%|█████████▌| 9019/9433 [55:20<02:31,  2.74it/s][A
epoch 1 iter 9019: train loss 0.38714. lr 3.206027e-04:  96%|█████████▌| 9019/9433 [55:21<02:31,  2.74it/s][A
epoch 1 iter 9019: train loss 0.38714. lr 3.206027e-04:  96%|█████████▌| 9020/9433 [55:21<02:30,  2.75it/s][A
epoch 1 iter 9020: train loss 0.38895. lr 3.205528e-04:  96%|█████████▌| 9020/9433 [55:21<02:30,  2.75it/s][A
epoch 1 iter 9020: train loss 0.38895. lr 3.205528e-04:  96%|█████████▌| 9021/9433 [55:21<02:30,  2.74it/s][A
epoch 1 iter 9021: train loss 0.40885. lr 3.205030e-04:  96%|█████████▌| 9021/9433 [55:22<02:30,  2.74it/s][A
e

epoch 1 iter 9053: train loss 0.41286. lr 3.189078e-04:  96%|█████████▌| 9054/9433 [55:33<02:18,  2.73it/s][A
epoch 1 iter 9054: train loss 0.40054. lr 3.188579e-04:  96%|█████████▌| 9054/9433 [55:34<02:18,  2.73it/s][A
epoch 1 iter 9054: train loss 0.40054. lr 3.188579e-04:  96%|█████████▌| 9055/9433 [55:34<02:18,  2.72it/s][A
epoch 1 iter 9055: train loss 0.40219. lr 3.188080e-04:  96%|█████████▌| 9055/9433 [55:34<02:18,  2.72it/s][A
epoch 1 iter 9055: train loss 0.40219. lr 3.188080e-04:  96%|█████████▌| 9056/9433 [55:34<02:18,  2.72it/s][A
epoch 1 iter 9056: train loss 0.38239. lr 3.187582e-04:  96%|█████████▌| 9056/9433 [55:34<02:18,  2.72it/s][A
epoch 1 iter 9056: train loss 0.38239. lr 3.187582e-04:  96%|█████████▌| 9057/9433 [55:34<02:17,  2.73it/s][A
epoch 1 iter 9057: train loss 0.42772. lr 3.187083e-04:  96%|█████████▌| 9057/9433 [55:35<02:17,  2.73it/s][A
epoch 1 iter 9057: train loss 0.42772. lr 3.187083e-04:  96%|█████████▌| 9058/9433 [55:35<02:16,  2.74it/s][A
e

epoch 1 iter 9090: train loss 0.40322. lr 3.170626e-04:  96%|█████████▋| 9090/9433 [55:47<02:05,  2.74it/s][A
epoch 1 iter 9090: train loss 0.40322. lr 3.170626e-04:  96%|█████████▋| 9091/9433 [55:47<02:04,  2.75it/s][A
epoch 1 iter 9091: train loss 0.41852. lr 3.170127e-04:  96%|█████████▋| 9091/9433 [55:47<02:04,  2.75it/s][A
epoch 1 iter 9091: train loss 0.41852. lr 3.170127e-04:  96%|█████████▋| 9092/9433 [55:47<02:04,  2.74it/s][A
epoch 1 iter 9092: train loss 0.42009. lr 3.169629e-04:  96%|█████████▋| 9092/9433 [55:48<02:04,  2.74it/s][A
epoch 1 iter 9092: train loss 0.42009. lr 3.169629e-04:  96%|█████████▋| 9093/9433 [55:48<02:04,  2.73it/s][A
epoch 1 iter 9093: train loss 0.40328. lr 3.169130e-04:  96%|█████████▋| 9093/9433 [55:48<02:04,  2.73it/s][A
epoch 1 iter 9093: train loss 0.40328. lr 3.169130e-04:  96%|█████████▋| 9094/9433 [55:48<02:03,  2.74it/s][A
epoch 1 iter 9094: train loss 0.40714. lr 3.168631e-04:  96%|█████████▋| 9094/9433 [55:48<02:03,  2.74it/s][A
e

epoch 1 iter 9126: train loss 0.37988. lr 3.152667e-04:  97%|█████████▋| 9127/9433 [56:00<01:51,  2.74it/s][A
epoch 1 iter 9127: train loss 0.40951. lr 3.152168e-04:  97%|█████████▋| 9127/9433 [56:00<01:51,  2.74it/s][A
epoch 1 iter 9127: train loss 0.40951. lr 3.152168e-04:  97%|█████████▋| 9128/9433 [56:00<01:50,  2.75it/s][A
epoch 1 iter 9128: train loss 0.39433. lr 3.151669e-04:  97%|█████████▋| 9128/9433 [56:01<01:50,  2.75it/s][A
epoch 1 iter 9128: train loss 0.39433. lr 3.151669e-04:  97%|█████████▋| 9129/9433 [56:01<01:51,  2.73it/s][A
epoch 1 iter 9129: train loss 0.41199. lr 3.151170e-04:  97%|█████████▋| 9129/9433 [56:01<01:51,  2.73it/s][A
epoch 1 iter 9129: train loss 0.41199. lr 3.151170e-04:  97%|█████████▋| 9130/9433 [56:01<01:51,  2.73it/s][A
epoch 1 iter 9130: train loss 0.38343. lr 3.150671e-04:  97%|█████████▋| 9130/9433 [56:01<01:51,  2.73it/s][A
epoch 1 iter 9130: train loss 0.38343. lr 3.150671e-04:  97%|█████████▋| 9131/9433 [56:01<01:50,  2.73it/s][A
e

epoch 1 iter 9163: train loss 0.39426. lr 3.134204e-04:  97%|█████████▋| 9163/9433 [56:14<01:38,  2.73it/s][A
epoch 1 iter 9163: train loss 0.39426. lr 3.134204e-04:  97%|█████████▋| 9164/9433 [56:14<01:38,  2.74it/s][A
epoch 1 iter 9164: train loss 0.39421. lr 3.133705e-04:  97%|█████████▋| 9164/9433 [56:14<01:38,  2.74it/s][A
epoch 1 iter 9164: train loss 0.39421. lr 3.133705e-04:  97%|█████████▋| 9165/9433 [56:14<01:37,  2.74it/s][A
epoch 1 iter 9165: train loss 0.38195. lr 3.133206e-04:  97%|█████████▋| 9165/9433 [56:14<01:37,  2.74it/s][A
epoch 1 iter 9165: train loss 0.38195. lr 3.133206e-04:  97%|█████████▋| 9166/9433 [56:14<01:37,  2.73it/s][A
epoch 1 iter 9166: train loss 0.39015. lr 3.132706e-04:  97%|█████████▋| 9166/9433 [56:15<01:37,  2.73it/s][A
epoch 1 iter 9166: train loss 0.39015. lr 3.132706e-04:  97%|█████████▋| 9167/9433 [56:15<01:37,  2.73it/s][A
epoch 1 iter 9167: train loss 0.41954. lr 3.132207e-04:  97%|█████████▋| 9167/9433 [56:15<01:37,  2.73it/s][A
e

epoch 1 iter 9199: train loss 0.40001. lr 3.116234e-04:  98%|█████████▊| 9200/9433 [56:27<01:25,  2.73it/s][A
epoch 1 iter 9200: train loss 0.37743. lr 3.115735e-04:  98%|█████████▊| 9200/9433 [56:27<01:25,  2.73it/s][A
epoch 1 iter 9200: train loss 0.37743. lr 3.115735e-04:  98%|█████████▊| 9201/9433 [56:27<01:24,  2.73it/s][A
epoch 1 iter 9201: train loss 0.41401. lr 3.115236e-04:  98%|█████████▊| 9201/9433 [56:27<01:24,  2.73it/s][A
epoch 1 iter 9201: train loss 0.41401. lr 3.115236e-04:  98%|█████████▊| 9202/9433 [56:27<01:24,  2.74it/s][A
epoch 1 iter 9202: train loss 0.38041. lr 3.114737e-04:  98%|█████████▊| 9202/9433 [56:28<01:24,  2.74it/s][A
epoch 1 iter 9202: train loss 0.38041. lr 3.114737e-04:  98%|█████████▊| 9203/9433 [56:28<01:23,  2.75it/s][A
epoch 1 iter 9203: train loss 0.36986. lr 3.114237e-04:  98%|█████████▊| 9203/9433 [56:28<01:23,  2.75it/s][A
epoch 1 iter 9203: train loss 0.36986. lr 3.114237e-04:  98%|█████████▊| 9204/9433 [56:28<01:23,  2.74it/s][A
e

epoch 1 iter 9236: train loss 0.36774. lr 3.097761e-04:  98%|█████████▊| 9236/9433 [56:40<01:12,  2.73it/s][A
epoch 1 iter 9236: train loss 0.36774. lr 3.097761e-04:  98%|█████████▊| 9237/9433 [56:40<01:11,  2.72it/s][A
epoch 1 iter 9237: train loss 0.40335. lr 3.097262e-04:  98%|█████████▊| 9237/9433 [56:41<01:11,  2.72it/s][A
epoch 1 iter 9237: train loss 0.40335. lr 3.097262e-04:  98%|█████████▊| 9238/9433 [56:41<01:11,  2.73it/s][A
epoch 1 iter 9238: train loss 0.40866. lr 3.096763e-04:  98%|█████████▊| 9238/9433 [56:41<01:11,  2.73it/s][A
epoch 1 iter 9238: train loss 0.40866. lr 3.096763e-04:  98%|█████████▊| 9239/9433 [56:41<01:10,  2.73it/s][A
epoch 1 iter 9239: train loss 0.41593. lr 3.096263e-04:  98%|█████████▊| 9239/9433 [56:42<01:10,  2.73it/s][A
epoch 1 iter 9239: train loss 0.41593. lr 3.096263e-04:  98%|█████████▊| 9240/9433 [56:42<01:10,  2.73it/s][A
epoch 1 iter 9240: train loss 0.44369. lr 3.095764e-04:  98%|█████████▊| 9240/9433 [56:42<01:10,  2.73it/s][A
e

epoch 1 iter 9272: train loss 0.41425. lr 3.079784e-04:  98%|█████████▊| 9273/9433 [56:54<00:58,  2.74it/s][A
epoch 1 iter 9273: train loss 0.41039. lr 3.079285e-04:  98%|█████████▊| 9273/9433 [56:54<00:58,  2.74it/s][A
epoch 1 iter 9273: train loss 0.41039. lr 3.079285e-04:  98%|█████████▊| 9274/9433 [56:54<00:58,  2.73it/s][A
epoch 1 iter 9274: train loss 0.38016. lr 3.078785e-04:  98%|█████████▊| 9274/9433 [56:54<00:58,  2.73it/s][A
epoch 1 iter 9274: train loss 0.38016. lr 3.078785e-04:  98%|█████████▊| 9275/9433 [56:54<00:57,  2.74it/s][A
epoch 1 iter 9275: train loss 0.38329. lr 3.078286e-04:  98%|█████████▊| 9275/9433 [56:55<00:57,  2.74it/s][A
epoch 1 iter 9275: train loss 0.38329. lr 3.078286e-04:  98%|█████████▊| 9276/9433 [56:55<00:57,  2.73it/s][A
epoch 1 iter 9276: train loss 0.40349. lr 3.077786e-04:  98%|█████████▊| 9276/9433 [56:55<00:57,  2.73it/s][A
epoch 1 iter 9276: train loss 0.40349. lr 3.077786e-04:  98%|█████████▊| 9277/9433 [56:55<00:57,  2.72it/s][A
e

epoch 1 iter 9309: train loss 0.37620. lr 3.061304e-04:  99%|█████████▊| 9309/9433 [57:07<00:45,  2.75it/s][A
epoch 1 iter 9309: train loss 0.37620. lr 3.061304e-04:  99%|█████████▊| 9310/9433 [57:07<00:44,  2.74it/s][A
epoch 1 iter 9310: train loss 0.37941. lr 3.060805e-04:  99%|█████████▊| 9310/9433 [57:07<00:44,  2.74it/s][A
epoch 1 iter 9310: train loss 0.37941. lr 3.060805e-04:  99%|█████████▊| 9311/9433 [57:07<00:44,  2.73it/s][A
epoch 1 iter 9311: train loss 0.41877. lr 3.060305e-04:  99%|█████████▊| 9311/9433 [57:08<00:44,  2.73it/s][A
epoch 1 iter 9311: train loss 0.41877. lr 3.060305e-04:  99%|█████████▊| 9312/9433 [57:08<00:44,  2.74it/s][A
epoch 1 iter 9312: train loss 0.39252. lr 3.059806e-04:  99%|█████████▊| 9312/9433 [57:08<00:44,  2.74it/s][A
epoch 1 iter 9312: train loss 0.39252. lr 3.059806e-04:  99%|█████████▊| 9313/9433 [57:08<00:43,  2.74it/s][A
epoch 1 iter 9313: train loss 0.37797. lr 3.059307e-04:  99%|█████████▊| 9313/9433 [57:09<00:43,  2.74it/s][A
e

epoch 2 iter 2028: train loss 0.35592. lr 2.005552e-04:  22%|██▏       | 2029/9433 [12:27<45:22,  2.72it/s][A
epoch 2 iter 2029: train loss 0.34371. lr 2.005081e-04:  22%|██▏       | 2029/9433 [12:27<45:22,  2.72it/s][A
epoch 2 iter 2029: train loss 0.34371. lr 2.005081e-04:  22%|██▏       | 2030/9433 [12:27<45:21,  2.72it/s][A
epoch 2 iter 2030: train loss 0.34029. lr 2.004610e-04:  22%|██▏       | 2030/9433 [12:27<45:21,  2.72it/s][A
epoch 2 iter 2030: train loss 0.34029. lr 2.004610e-04:  22%|██▏       | 2031/9433 [12:27<45:28,  2.71it/s][A
epoch 2 iter 2031: train loss 0.33330. lr 2.004138e-04:  22%|██▏       | 2031/9433 [12:28<45:28,  2.71it/s][A
epoch 2 iter 2031: train loss 0.33330. lr 2.004138e-04:  22%|██▏       | 2032/9433 [12:28<45:29,  2.71it/s][A
epoch 2 iter 2032: train loss 0.33023. lr 2.003667e-04:  22%|██▏       | 2032/9433 [12:28<45:29,  2.71it/s][A
epoch 2 iter 2032: train loss 0.33023. lr 2.003667e-04:  22%|██▏       | 2033/9433 [12:28<45:29,  2.71it/s][A
e

epoch 2 iter 2065: train loss 0.32377. lr 1.988132e-04:  22%|██▏       | 2065/9433 [12:40<45:10,  2.72it/s][A
epoch 2 iter 2065: train loss 0.32377. lr 1.988132e-04:  22%|██▏       | 2066/9433 [12:40<45:14,  2.71it/s][A
epoch 2 iter 2066: train loss 0.32418. lr 1.987661e-04:  22%|██▏       | 2066/9433 [12:41<45:14,  2.71it/s][A
epoch 2 iter 2066: train loss 0.32418. lr 1.987661e-04:  22%|██▏       | 2067/9433 [12:41<45:07,  2.72it/s][A
epoch 2 iter 2067: train loss 0.32719. lr 1.987191e-04:  22%|██▏       | 2067/9433 [12:41<45:07,  2.72it/s][A
epoch 2 iter 2067: train loss 0.32719. lr 1.987191e-04:  22%|██▏       | 2068/9433 [12:41<45:00,  2.73it/s][A
epoch 2 iter 2068: train loss 0.31514. lr 1.986721e-04:  22%|██▏       | 2068/9433 [12:41<45:00,  2.73it/s][A
epoch 2 iter 2068: train loss 0.31514. lr 1.986721e-04:  22%|██▏       | 2069/9433 [12:41<44:44,  2.74it/s][A
epoch 2 iter 2069: train loss 0.32582. lr 1.986251e-04:  22%|██▏       | 2069/9433 [12:42<44:44,  2.74it/s][A
e

epoch 2 iter 2101: train loss 0.34428. lr 1.971219e-04:  22%|██▏       | 2102/9433 [12:54<44:56,  2.72it/s][A
epoch 2 iter 2102: train loss 0.34129. lr 1.970749e-04:  22%|██▏       | 2102/9433 [12:54<44:56,  2.72it/s][A
epoch 2 iter 2102: train loss 0.34129. lr 1.970749e-04:  22%|██▏       | 2103/9433 [12:54<44:44,  2.73it/s][A
epoch 2 iter 2103: train loss 0.33578. lr 1.970280e-04:  22%|██▏       | 2103/9433 [12:54<44:44,  2.73it/s][A
epoch 2 iter 2103: train loss 0.33578. lr 1.970280e-04:  22%|██▏       | 2104/9433 [12:54<44:59,  2.72it/s][A
epoch 2 iter 2104: train loss 0.33465. lr 1.969811e-04:  22%|██▏       | 2104/9433 [12:55<44:59,  2.72it/s][A
epoch 2 iter 2104: train loss 0.33465. lr 1.969811e-04:  22%|██▏       | 2105/9433 [12:55<45:00,  2.71it/s][A
epoch 2 iter 2105: train loss 0.32328. lr 1.969342e-04:  22%|██▏       | 2105/9433 [12:55<45:00,  2.71it/s][A
epoch 2 iter 2105: train loss 0.32328. lr 1.969342e-04:  22%|██▏       | 2106/9433 [12:55<44:56,  2.72it/s][A
e

epoch 2 iter 2138: train loss 0.33756. lr 1.953874e-04:  23%|██▎       | 2138/9433 [13:07<46:55,  2.59it/s][A
epoch 2 iter 2138: train loss 0.33756. lr 1.953874e-04:  23%|██▎       | 2139/9433 [13:07<46:12,  2.63it/s][A
epoch 2 iter 2139: train loss 0.32975. lr 1.953406e-04:  23%|██▎       | 2139/9433 [13:08<46:12,  2.63it/s][A
epoch 2 iter 2139: train loss 0.32975. lr 1.953406e-04:  23%|██▎       | 2140/9433 [13:08<45:50,  2.65it/s][A
epoch 2 iter 2140: train loss 0.32549. lr 1.952938e-04:  23%|██▎       | 2140/9433 [13:08<45:50,  2.65it/s][A
epoch 2 iter 2140: train loss 0.32549. lr 1.952938e-04:  23%|██▎       | 2141/9433 [13:08<45:23,  2.68it/s][A
epoch 2 iter 2141: train loss 0.34038. lr 1.952470e-04:  23%|██▎       | 2141/9433 [13:08<45:23,  2.68it/s][A
epoch 2 iter 2141: train loss 0.34038. lr 1.952470e-04:  23%|██▎       | 2142/9433 [13:08<45:15,  2.68it/s][A
epoch 2 iter 2142: train loss 0.34072. lr 1.952002e-04:  23%|██▎       | 2142/9433 [13:09<45:15,  2.68it/s][A
e

epoch 2 iter 2174: train loss 0.32398. lr 1.937037e-04:  23%|██▎       | 2175/9433 [13:21<44:31,  2.72it/s][A
epoch 2 iter 2175: train loss 0.32851. lr 1.936570e-04:  23%|██▎       | 2175/9433 [13:21<44:31,  2.72it/s][A
epoch 2 iter 2175: train loss 0.32851. lr 1.936570e-04:  23%|██▎       | 2176/9433 [13:21<44:38,  2.71it/s][A
epoch 2 iter 2176: train loss 0.30482. lr 1.936103e-04:  23%|██▎       | 2176/9433 [13:21<44:38,  2.71it/s][A
epoch 2 iter 2176: train loss 0.30482. lr 1.936103e-04:  23%|██▎       | 2177/9433 [13:21<44:34,  2.71it/s][A
epoch 2 iter 2177: train loss 0.35518. lr 1.935636e-04:  23%|██▎       | 2177/9433 [13:22<44:34,  2.71it/s][A
epoch 2 iter 2177: train loss 0.35518. lr 1.935636e-04:  23%|██▎       | 2178/9433 [13:22<44:27,  2.72it/s][A
epoch 2 iter 2178: train loss 0.33811. lr 1.935169e-04:  23%|██▎       | 2178/9433 [13:22<44:27,  2.72it/s][A
epoch 2 iter 2178: train loss 0.33811. lr 1.935169e-04:  23%|██▎       | 2179/9433 [13:22<44:31,  2.72it/s][A
e

epoch 2 iter 2211: train loss 0.31479. lr 1.919772e-04:  23%|██▎       | 2211/9433 [13:34<44:27,  2.71it/s][A
epoch 2 iter 2211: train loss 0.31479. lr 1.919772e-04:  23%|██▎       | 2212/9433 [13:34<44:18,  2.72it/s][A
epoch 2 iter 2212: train loss 0.34048. lr 1.919306e-04:  23%|██▎       | 2212/9433 [13:35<44:18,  2.72it/s][A
epoch 2 iter 2212: train loss 0.34048. lr 1.919306e-04:  23%|██▎       | 2213/9433 [13:35<44:13,  2.72it/s][A
epoch 2 iter 2213: train loss 0.35093. lr 1.918840e-04:  23%|██▎       | 2213/9433 [13:35<44:13,  2.72it/s][A
epoch 2 iter 2213: train loss 0.35093. lr 1.918840e-04:  23%|██▎       | 2214/9433 [13:35<44:14,  2.72it/s][A
epoch 2 iter 2214: train loss 0.33123. lr 1.918374e-04:  23%|██▎       | 2214/9433 [13:35<44:14,  2.72it/s][A
epoch 2 iter 2214: train loss 0.33123. lr 1.918374e-04:  23%|██▎       | 2215/9433 [13:35<44:16,  2.72it/s][A
epoch 2 iter 2215: train loss 0.32332. lr 1.917908e-04:  23%|██▎       | 2215/9433 [13:36<44:16,  2.72it/s][A
e

epoch 2 iter 2247: train loss 0.32870. lr 1.903013e-04:  24%|██▍       | 2248/9433 [13:47<44:10,  2.71it/s][A
epoch 2 iter 2248: train loss 0.34274. lr 1.902548e-04:  24%|██▍       | 2248/9433 [13:48<44:10,  2.71it/s][A
epoch 2 iter 2248: train loss 0.34274. lr 1.902548e-04:  24%|██▍       | 2249/9433 [13:48<44:06,  2.71it/s][A
epoch 2 iter 2249: train loss 0.36411. lr 1.902083e-04:  24%|██▍       | 2249/9433 [13:48<44:06,  2.71it/s][A
epoch 2 iter 2249: train loss 0.36411. lr 1.902083e-04:  24%|██▍       | 2250/9433 [13:48<44:12,  2.71it/s][A
epoch 2 iter 2250: train loss 0.33676. lr 1.901618e-04:  24%|██▍       | 2250/9433 [13:49<44:12,  2.71it/s][A
epoch 2 iter 2250: train loss 0.33676. lr 1.901618e-04:  24%|██▍       | 2251/9433 [13:49<44:13,  2.71it/s][A
epoch 2 iter 2251: train loss 0.31263. lr 1.901153e-04:  24%|██▍       | 2251/9433 [13:49<44:13,  2.71it/s][A
epoch 2 iter 2251: train loss 0.31263. lr 1.901153e-04:  24%|██▍       | 2252/9433 [13:49<44:05,  2.71it/s][A
e

epoch 2 iter 2284: train loss 0.34150. lr 1.885829e-04:  24%|██▍       | 2284/9433 [14:01<43:51,  2.72it/s][A
epoch 2 iter 2284: train loss 0.34150. lr 1.885829e-04:  24%|██▍       | 2285/9433 [14:01<43:56,  2.71it/s][A
epoch 2 iter 2285: train loss 0.33145. lr 1.885365e-04:  24%|██▍       | 2285/9433 [14:01<43:56,  2.71it/s][A
epoch 2 iter 2285: train loss 0.33145. lr 1.885365e-04:  24%|██▍       | 2286/9433 [14:01<43:57,  2.71it/s][A
epoch 2 iter 2286: train loss 0.32102. lr 1.884901e-04:  24%|██▍       | 2286/9433 [14:02<43:57,  2.71it/s][A
epoch 2 iter 2286: train loss 0.32102. lr 1.884901e-04:  24%|██▍       | 2287/9433 [14:02<43:51,  2.72it/s][A
epoch 2 iter 2287: train loss 0.33648. lr 1.884438e-04:  24%|██▍       | 2287/9433 [14:02<43:51,  2.72it/s][A
epoch 2 iter 2287: train loss 0.33648. lr 1.884438e-04:  24%|██▍       | 2288/9433 [14:02<43:43,  2.72it/s][A
epoch 2 iter 2288: train loss 0.34112. lr 1.883974e-04:  24%|██▍       | 2288/9433 [14:03<43:43,  2.72it/s][A
e

epoch 2 iter 2320: train loss 0.32152. lr 1.869150e-04:  25%|██▍       | 2321/9433 [14:14<43:30,  2.72it/s][A
epoch 2 iter 2321: train loss 0.32172. lr 1.868688e-04:  25%|██▍       | 2321/9433 [14:15<43:30,  2.72it/s][A
epoch 2 iter 2321: train loss 0.32172. lr 1.868688e-04:  25%|██▍       | 2322/9433 [14:15<43:31,  2.72it/s][A
epoch 2 iter 2322: train loss 0.33857. lr 1.868225e-04:  25%|██▍       | 2322/9433 [14:15<43:31,  2.72it/s][A
epoch 2 iter 2322: train loss 0.33857. lr 1.868225e-04:  25%|██▍       | 2323/9433 [14:15<43:35,  2.72it/s][A
epoch 2 iter 2323: train loss 0.32655. lr 1.867762e-04:  25%|██▍       | 2323/9433 [14:15<43:35,  2.72it/s][A
epoch 2 iter 2323: train loss 0.32655. lr 1.867762e-04:  25%|██▍       | 2324/9433 [14:15<43:45,  2.71it/s][A
epoch 2 iter 2324: train loss 0.33716. lr 1.867300e-04:  25%|██▍       | 2324/9433 [14:16<43:45,  2.71it/s][A
epoch 2 iter 2324: train loss 0.33716. lr 1.867300e-04:  25%|██▍       | 2325/9433 [14:16<43:42,  2.71it/s][A
e

epoch 2 iter 2357: train loss 0.33365. lr 1.852051e-04:  25%|██▍       | 2357/9433 [14:28<43:15,  2.73it/s][A
epoch 2 iter 2357: train loss 0.33365. lr 1.852051e-04:  25%|██▍       | 2358/9433 [14:28<43:24,  2.72it/s][A
epoch 2 iter 2358: train loss 0.32050. lr 1.851589e-04:  25%|██▍       | 2358/9433 [14:28<43:24,  2.72it/s][A
epoch 2 iter 2358: train loss 0.32050. lr 1.851589e-04:  25%|██▌       | 2359/9433 [14:28<43:24,  2.72it/s][A
epoch 2 iter 2359: train loss 0.30757. lr 1.851128e-04:  25%|██▌       | 2359/9433 [14:29<43:24,  2.72it/s][A
epoch 2 iter 2359: train loss 0.30757. lr 1.851128e-04:  25%|██▌       | 2360/9433 [14:29<43:17,  2.72it/s][A
epoch 2 iter 2360: train loss 0.35172. lr 1.850666e-04:  25%|██▌       | 2360/9433 [14:29<43:17,  2.72it/s][A
epoch 2 iter 2360: train loss 0.35172. lr 1.850666e-04:  25%|██▌       | 2361/9433 [14:29<43:13,  2.73it/s][A
epoch 2 iter 2361: train loss 0.31140. lr 1.850205e-04:  25%|██▌       | 2361/9433 [14:29<43:13,  2.73it/s][A
e

epoch 2 iter 2393: train loss 0.34106. lr 1.835455e-04:  25%|██▌       | 2394/9433 [14:41<43:07,  2.72it/s][A
epoch 2 iter 2394: train loss 0.32110. lr 1.834995e-04:  25%|██▌       | 2394/9433 [14:41<43:07,  2.72it/s][A
epoch 2 iter 2394: train loss 0.32110. lr 1.834995e-04:  25%|██▌       | 2395/9433 [14:41<43:08,  2.72it/s][A
epoch 2 iter 2395: train loss 0.33512. lr 1.834534e-04:  25%|██▌       | 2395/9433 [14:42<43:08,  2.72it/s][A
epoch 2 iter 2395: train loss 0.33512. lr 1.834534e-04:  25%|██▌       | 2396/9433 [14:42<43:12,  2.71it/s][A
epoch 2 iter 2396: train loss 0.32576. lr 1.834074e-04:  25%|██▌       | 2396/9433 [14:42<43:12,  2.71it/s][A
epoch 2 iter 2396: train loss 0.32576. lr 1.834074e-04:  25%|██▌       | 2397/9433 [14:42<43:15,  2.71it/s][A
epoch 2 iter 2397: train loss 0.35007. lr 1.833614e-04:  25%|██▌       | 2397/9433 [14:43<43:15,  2.71it/s][A
epoch 2 iter 2397: train loss 0.35007. lr 1.833614e-04:  25%|██▌       | 2398/9433 [14:43<43:15,  2.71it/s][A
e

epoch 2 iter 2430: train loss 0.32747. lr 1.818442e-04:  26%|██▌       | 2430/9433 [14:55<42:52,  2.72it/s][A
epoch 2 iter 2430: train loss 0.32747. lr 1.818442e-04:  26%|██▌       | 2431/9433 [14:55<42:59,  2.71it/s][A
epoch 2 iter 2431: train loss 0.36054. lr 1.817983e-04:  26%|██▌       | 2431/9433 [14:55<42:59,  2.71it/s][A
epoch 2 iter 2431: train loss 0.36054. lr 1.817983e-04:  26%|██▌       | 2432/9433 [14:55<43:01,  2.71it/s][A
epoch 2 iter 2432: train loss 0.33807. lr 1.817524e-04:  26%|██▌       | 2432/9433 [14:55<43:01,  2.71it/s][A
epoch 2 iter 2432: train loss 0.33807. lr 1.817524e-04:  26%|██▌       | 2433/9433 [14:55<42:51,  2.72it/s][A
epoch 2 iter 2433: train loss 0.32788. lr 1.817065e-04:  26%|██▌       | 2433/9433 [14:56<42:51,  2.72it/s][A
epoch 2 iter 2433: train loss 0.32788. lr 1.817065e-04:  26%|██▌       | 2434/9433 [14:56<42:51,  2.72it/s][A
epoch 2 iter 2434: train loss 0.33342. lr 1.816605e-04:  26%|██▌       | 2434/9433 [14:56<42:51,  2.72it/s][A
e

epoch 2 iter 2466: train loss 0.32309. lr 1.801932e-04:  26%|██▌       | 2467/9433 [15:08<42:23,  2.74it/s][A
epoch 2 iter 2467: train loss 0.30761. lr 1.801474e-04:  26%|██▌       | 2467/9433 [15:08<42:23,  2.74it/s][A
epoch 2 iter 2467: train loss 0.30761. lr 1.801474e-04:  26%|██▌       | 2468/9433 [15:08<42:34,  2.73it/s][A
epoch 2 iter 2468: train loss 0.34316. lr 1.801016e-04:  26%|██▌       | 2468/9433 [15:09<42:34,  2.73it/s][A
epoch 2 iter 2468: train loss 0.34316. lr 1.801016e-04:  26%|██▌       | 2469/9433 [15:09<42:37,  2.72it/s][A
epoch 2 iter 2469: train loss 0.30750. lr 1.800558e-04:  26%|██▌       | 2469/9433 [15:09<42:37,  2.72it/s][A
epoch 2 iter 2469: train loss 0.30750. lr 1.800558e-04:  26%|██▌       | 2470/9433 [15:09<42:32,  2.73it/s][A
epoch 2 iter 2470: train loss 0.34842. lr 1.800100e-04:  26%|██▌       | 2470/9433 [15:09<42:32,  2.73it/s][A
epoch 2 iter 2470: train loss 0.34842. lr 1.800100e-04:  26%|██▌       | 2471/9433 [15:09<42:26,  2.73it/s][A
e

epoch 2 iter 4451: train loss 0.31563. lr 9.740928e-05:  47%|████▋     | 4451/9433 [27:18<30:26,  2.73it/s][A
epoch 2 iter 4451: train loss 0.31563. lr 9.740928e-05:  47%|████▋     | 4452/9433 [27:18<30:28,  2.72it/s][A
epoch 2 iter 4452: train loss 0.30309. lr 9.737244e-05:  47%|████▋     | 4452/9433 [27:18<30:28,  2.72it/s][A
epoch 2 iter 4452: train loss 0.30309. lr 9.737244e-05:  47%|████▋     | 4453/9433 [27:18<30:31,  2.72it/s][A
epoch 2 iter 4453: train loss 0.27898. lr 9.733560e-05:  47%|████▋     | 4453/9433 [27:18<30:31,  2.72it/s][A
epoch 2 iter 4453: train loss 0.27898. lr 9.733560e-05:  47%|████▋     | 4454/9433 [27:18<30:33,  2.72it/s][A
epoch 2 iter 4454: train loss 0.27700. lr 9.729877e-05:  47%|████▋     | 4454/9433 [27:19<30:33,  2.72it/s][A
epoch 2 iter 4454: train loss 0.27700. lr 9.729877e-05:  47%|████▋     | 4455/9433 [27:19<30:29,  2.72it/s][A
epoch 2 iter 4455: train loss 0.30005. lr 9.726194e-05:  47%|████▋     | 4455/9433 [27:19<30:29,  2.72it/s][A
e

epoch 2 iter 4487: train loss 0.27812. lr 9.608645e-05:  48%|████▊     | 4488/9433 [27:31<30:20,  2.72it/s][A
epoch 2 iter 4488: train loss 0.30775. lr 9.604981e-05:  48%|████▊     | 4488/9433 [27:31<30:20,  2.72it/s][A
epoch 2 iter 4488: train loss 0.30775. lr 9.604981e-05:  48%|████▊     | 4489/9433 [27:31<30:17,  2.72it/s][A
epoch 2 iter 4489: train loss 0.29141. lr 9.601318e-05:  48%|████▊     | 4489/9433 [27:32<30:17,  2.72it/s][A
epoch 2 iter 4489: train loss 0.29141. lr 9.601318e-05:  48%|████▊     | 4490/9433 [27:32<33:29,  2.46it/s][A
epoch 2 iter 4490: train loss 0.30642. lr 9.597655e-05:  48%|████▊     | 4490/9433 [27:32<33:29,  2.46it/s][A
epoch 2 iter 4490: train loss 0.30642. lr 9.597655e-05:  48%|████▊     | 4491/9433 [27:32<32:35,  2.53it/s][A
epoch 2 iter 4491: train loss 0.28303. lr 9.593992e-05:  48%|████▊     | 4491/9433 [27:32<32:35,  2.53it/s][A
epoch 2 iter 4491: train loss 0.28303. lr 9.593992e-05:  48%|████▊     | 4492/9433 [27:32<31:51,  2.59it/s][A
e

epoch 2 iter 4524: train loss 0.30031. lr 9.473451e-05:  48%|████▊     | 4524/9433 [27:45<30:07,  2.72it/s][A
epoch 2 iter 4524: train loss 0.30031. lr 9.473451e-05:  48%|████▊     | 4525/9433 [27:45<30:11,  2.71it/s][A
epoch 2 iter 4525: train loss 0.30428. lr 9.469808e-05:  48%|████▊     | 4525/9433 [27:45<30:11,  2.71it/s][A
epoch 2 iter 4525: train loss 0.30428. lr 9.469808e-05:  48%|████▊     | 4526/9433 [27:45<30:10,  2.71it/s][A
epoch 2 iter 4526: train loss 0.26567. lr 9.466166e-05:  48%|████▊     | 4526/9433 [27:45<30:10,  2.71it/s][A
epoch 2 iter 4526: train loss 0.26567. lr 9.466166e-05:  48%|████▊     | 4527/9433 [27:45<30:04,  2.72it/s][A
epoch 2 iter 4527: train loss 0.27359. lr 9.462524e-05:  48%|████▊     | 4527/9433 [27:46<30:04,  2.72it/s][A
epoch 2 iter 4527: train loss 0.27359. lr 9.462524e-05:  48%|████▊     | 4528/9433 [27:46<30:04,  2.72it/s][A
epoch 2 iter 4528: train loss 0.29027. lr 9.458883e-05:  48%|████▊     | 4528/9433 [27:46<30:04,  2.72it/s][A
e

epoch 2 iter 4560: train loss 0.29058. lr 9.342660e-05:  48%|████▊     | 4561/9433 [27:58<29:53,  2.72it/s][A
epoch 2 iter 4561: train loss 0.27998. lr 9.339037e-05:  48%|████▊     | 4561/9433 [27:58<29:53,  2.72it/s][A
epoch 2 iter 4561: train loss 0.27998. lr 9.339037e-05:  48%|████▊     | 4562/9433 [27:58<29:48,  2.72it/s][A
epoch 2 iter 4562: train loss 0.30340. lr 9.335415e-05:  48%|████▊     | 4562/9433 [27:59<29:48,  2.72it/s][A
epoch 2 iter 4562: train loss 0.30340. lr 9.335415e-05:  48%|████▊     | 4563/9433 [27:59<29:45,  2.73it/s][A
epoch 2 iter 4563: train loss 0.29907. lr 9.331794e-05:  48%|████▊     | 4563/9433 [27:59<29:45,  2.73it/s][A
epoch 2 iter 4563: train loss 0.29907. lr 9.331794e-05:  48%|████▊     | 4564/9433 [27:59<29:52,  2.72it/s][A
epoch 2 iter 4564: train loss 0.29445. lr 9.328173e-05:  48%|████▊     | 4564/9433 [27:59<29:52,  2.72it/s][A
epoch 2 iter 4564: train loss 0.29445. lr 9.328173e-05:  48%|████▊     | 4565/9433 [27:59<29:53,  2.71it/s][A
e

epoch 2 iter 4783: train loss 0.28704. lr 8.549199e-05:  51%|█████     | 4784/9433 [29:20<28:32,  2.71it/s][A
epoch 2 iter 4784: train loss 0.30603. lr 8.545707e-05:  51%|█████     | 4784/9433 [29:20<28:32,  2.71it/s][A
epoch 2 iter 4784: train loss 0.30603. lr 8.545707e-05:  51%|█████     | 4785/9433 [29:20<28:36,  2.71it/s][A
epoch 2 iter 4785: train loss 0.30776. lr 8.542215e-05:  51%|█████     | 4785/9433 [29:21<28:36,  2.71it/s][A
epoch 2 iter 4785: train loss 0.30776. lr 8.542215e-05:  51%|█████     | 4786/9433 [29:21<28:35,  2.71it/s][A
epoch 2 iter 4786: train loss 0.29875. lr 8.538724e-05:  51%|█████     | 4786/9433 [29:21<28:35,  2.71it/s][A
epoch 2 iter 4786: train loss 0.29875. lr 8.538724e-05:  51%|█████     | 4787/9433 [29:21<28:28,  2.72it/s][A
epoch 2 iter 4787: train loss 0.28976. lr 8.535233e-05:  51%|█████     | 4787/9433 [29:21<28:28,  2.72it/s][A
epoch 2 iter 4787: train loss 0.28976. lr 8.535233e-05:  51%|█████     | 4788/9433 [29:21<28:26,  2.72it/s][A
e

epoch 2 iter 4820: train loss 0.28781. lr 8.420381e-05:  51%|█████     | 4820/9433 [29:33<28:20,  2.71it/s][A
epoch 2 iter 4820: train loss 0.28781. lr 8.420381e-05:  51%|█████     | 4821/9433 [29:33<28:17,  2.72it/s][A
epoch 2 iter 4821: train loss 0.29273. lr 8.416911e-05:  51%|█████     | 4821/9433 [29:34<28:17,  2.72it/s][A
epoch 2 iter 4821: train loss 0.29273. lr 8.416911e-05:  51%|█████     | 4822/9433 [29:34<28:13,  2.72it/s][A
epoch 2 iter 4822: train loss 0.29192. lr 8.413441e-05:  51%|█████     | 4822/9433 [29:34<28:13,  2.72it/s][A
epoch 2 iter 4822: train loss 0.29192. lr 8.413441e-05:  51%|█████     | 4823/9433 [29:34<28:05,  2.74it/s][A
epoch 2 iter 4823: train loss 0.29157. lr 8.409972e-05:  51%|█████     | 4823/9433 [29:35<28:05,  2.74it/s][A
epoch 2 iter 4823: train loss 0.29157. lr 8.409972e-05:  51%|█████     | 4824/9433 [29:35<28:16,  2.72it/s][A
epoch 2 iter 4824: train loss 0.31404. lr 8.406503e-05:  51%|█████     | 4824/9433 [29:35<28:16,  2.72it/s][A
e

epoch 2 iter 4856: train loss 0.27451. lr 8.295830e-05:  51%|█████▏    | 4857/9433 [29:47<27:59,  2.72it/s][A
epoch 2 iter 4857: train loss 0.28263. lr 8.292382e-05:  51%|█████▏    | 4857/9433 [29:47<27:59,  2.72it/s][A
epoch 2 iter 4857: train loss 0.28263. lr 8.292382e-05:  52%|█████▏    | 4858/9433 [29:47<28:02,  2.72it/s][A
epoch 2 iter 4858: train loss 0.28605. lr 8.288934e-05:  52%|█████▏    | 4858/9433 [29:47<28:02,  2.72it/s][A
epoch 2 iter 4858: train loss 0.28605. lr 8.288934e-05:  52%|█████▏    | 4859/9433 [29:47<28:03,  2.72it/s][A
epoch 2 iter 4859: train loss 0.29555. lr 8.285486e-05:  52%|█████▏    | 4859/9433 [29:48<28:03,  2.72it/s][A
epoch 2 iter 4859: train loss 0.29555. lr 8.285486e-05:  52%|█████▏    | 4860/9433 [29:48<27:59,  2.72it/s][A
epoch 2 iter 4860: train loss 0.28011. lr 8.282040e-05:  52%|█████▏    | 4860/9433 [29:48<27:59,  2.72it/s][A
epoch 2 iter 4860: train loss 0.28011. lr 8.282040e-05:  52%|█████▏    | 4861/9433 [29:48<27:55,  2.73it/s][A
e

epoch 2 iter 4893: train loss 0.28921. lr 8.168633e-05:  52%|█████▏    | 4893/9433 [30:00<27:49,  2.72it/s][A
epoch 2 iter 4893: train loss 0.28921. lr 8.168633e-05:  52%|█████▏    | 4894/9433 [30:00<27:45,  2.73it/s][A
epoch 2 iter 4894: train loss 0.28630. lr 8.165207e-05:  52%|█████▏    | 4894/9433 [30:01<27:45,  2.73it/s][A
epoch 2 iter 4894: train loss 0.28630. lr 8.165207e-05:  52%|█████▏    | 4895/9433 [30:01<27:45,  2.72it/s][A
epoch 2 iter 4895: train loss 0.29079. lr 8.161781e-05:  52%|█████▏    | 4895/9433 [30:01<27:45,  2.72it/s][A
epoch 2 iter 4895: train loss 0.29079. lr 8.161781e-05:  52%|█████▏    | 4896/9433 [30:01<27:48,  2.72it/s][A
epoch 2 iter 4896: train loss 0.28387. lr 8.158356e-05:  52%|█████▏    | 4896/9433 [30:01<27:48,  2.72it/s][A
epoch 2 iter 4896: train loss 0.28387. lr 8.158356e-05:  52%|█████▏    | 4897/9433 [30:01<27:52,  2.71it/s][A
epoch 2 iter 4897: train loss 0.27757. lr 8.154932e-05:  52%|█████▏    | 4897/9433 [30:02<27:52,  2.71it/s][A
e

epoch 2 iter 4929: train loss 0.27723. lr 8.045669e-05:  52%|█████▏    | 4930/9433 [30:13<27:28,  2.73it/s][A
epoch 2 iter 4930: train loss 0.27865. lr 8.042265e-05:  52%|█████▏    | 4930/9433 [30:14<27:28,  2.73it/s][A
epoch 2 iter 4930: train loss 0.27865. lr 8.042265e-05:  52%|█████▏    | 4931/9433 [30:14<27:35,  2.72it/s][A
epoch 2 iter 4931: train loss 0.27708. lr 8.038861e-05:  52%|█████▏    | 4931/9433 [30:14<27:35,  2.72it/s][A
epoch 2 iter 4931: train loss 0.27708. lr 8.038861e-05:  52%|█████▏    | 4932/9433 [30:14<27:38,  2.71it/s][A
epoch 2 iter 4932: train loss 0.27914. lr 8.035458e-05:  52%|█████▏    | 4932/9433 [30:15<27:38,  2.71it/s][A
epoch 2 iter 4932: train loss 0.27914. lr 8.035458e-05:  52%|█████▏    | 4933/9433 [30:15<27:35,  2.72it/s][A
epoch 2 iter 4933: train loss 0.30317. lr 8.032055e-05:  52%|█████▏    | 4933/9433 [30:15<27:35,  2.72it/s][A
epoch 2 iter 4933: train loss 0.30317. lr 8.032055e-05:  52%|█████▏    | 4934/9433 [30:15<27:32,  2.72it/s][A
e

epoch 2 iter 4966: train loss 0.27292. lr 7.920112e-05:  53%|█████▎    | 4966/9433 [30:27<27:24,  2.72it/s][A
epoch 2 iter 4966: train loss 0.27292. lr 7.920112e-05:  53%|█████▎    | 4967/9433 [30:27<27:20,  2.72it/s][A
epoch 2 iter 4967: train loss 0.28626. lr 7.916730e-05:  53%|█████▎    | 4967/9433 [30:27<27:20,  2.72it/s][A
epoch 2 iter 4967: train loss 0.28626. lr 7.916730e-05:  53%|█████▎    | 4968/9433 [30:27<27:18,  2.72it/s][A
epoch 2 iter 4968: train loss 0.27409. lr 7.913349e-05:  53%|█████▎    | 4968/9433 [30:28<27:18,  2.72it/s][A
epoch 2 iter 4968: train loss 0.27409. lr 7.913349e-05:  53%|█████▎    | 4969/9433 [30:28<27:09,  2.74it/s][A
epoch 2 iter 4969: train loss 0.30208. lr 7.909968e-05:  53%|█████▎    | 4969/9433 [30:28<27:09,  2.74it/s][A
epoch 2 iter 4969: train loss 0.30208. lr 7.909968e-05:  53%|█████▎    | 4970/9433 [30:28<27:13,  2.73it/s][A
epoch 2 iter 4970: train loss 0.29488. lr 7.906588e-05:  53%|█████▎    | 4970/9433 [30:29<27:13,  2.73it/s][A
e

epoch 2 iter 5002: train loss 0.27611. lr 7.798753e-05:  53%|█████▎    | 5003/9433 [30:40<27:09,  2.72it/s][A
epoch 2 iter 5003: train loss 0.28125. lr 7.795393e-05:  53%|█████▎    | 5003/9433 [30:41<27:09,  2.72it/s][A
epoch 2 iter 5003: train loss 0.28125. lr 7.795393e-05:  53%|█████▎    | 5004/9433 [30:41<27:10,  2.72it/s][A
epoch 2 iter 5004: train loss 0.27593. lr 7.792034e-05:  53%|█████▎    | 5004/9433 [30:41<27:10,  2.72it/s][A
epoch 2 iter 5004: train loss 0.27593. lr 7.792034e-05:  53%|█████▎    | 5005/9433 [30:41<27:06,  2.72it/s][A
epoch 2 iter 5005: train loss 0.26152. lr 7.788675e-05:  53%|█████▎    | 5005/9433 [30:41<27:06,  2.72it/s][A
epoch 2 iter 5005: train loss 0.26152. lr 7.788675e-05:  53%|█████▎    | 5006/9433 [30:41<27:02,  2.73it/s][A
epoch 2 iter 5006: train loss 0.28467. lr 7.785318e-05:  53%|█████▎    | 5006/9433 [30:42<27:02,  2.73it/s][A
epoch 2 iter 5006: train loss 0.28467. lr 7.785318e-05:  53%|█████▎    | 5007/9433 [30:42<26:54,  2.74it/s][A
e

epoch 2 iter 5039: train loss 0.29845. lr 7.674854e-05:  53%|█████▎    | 5039/9433 [30:54<26:56,  2.72it/s][A
epoch 2 iter 5039: train loss 0.29845. lr 7.674854e-05:  53%|█████▎    | 5040/9433 [30:54<26:53,  2.72it/s][A
epoch 2 iter 5040: train loss 0.26879. lr 7.671517e-05:  53%|█████▎    | 5040/9433 [30:54<26:53,  2.72it/s][A
epoch 2 iter 5040: train loss 0.26879. lr 7.671517e-05:  53%|█████▎    | 5041/9433 [30:54<26:53,  2.72it/s][A
epoch 2 iter 5041: train loss 0.29278. lr 7.668181e-05:  53%|█████▎    | 5041/9433 [30:55<26:53,  2.72it/s][A
epoch 2 iter 5041: train loss 0.29278. lr 7.668181e-05:  53%|█████▎    | 5042/9433 [30:55<26:57,  2.72it/s][A
epoch 2 iter 5042: train loss 0.27299. lr 7.664845e-05:  53%|█████▎    | 5042/9433 [30:55<26:57,  2.72it/s][A
epoch 2 iter 5042: train loss 0.27299. lr 7.664845e-05:  53%|█████▎    | 5043/9433 [30:55<26:57,  2.71it/s][A
epoch 2 iter 5043: train loss 0.28562. lr 7.661510e-05:  53%|█████▎    | 5043/9433 [30:55<26:57,  2.71it/s][A
e

epoch 2 iter 5075: train loss 0.29102. lr 7.555117e-05:  54%|█████▍    | 5076/9433 [31:07<26:45,  2.71it/s][A
epoch 2 iter 5076: train loss 0.27580. lr 7.551802e-05:  54%|█████▍    | 5076/9433 [31:07<26:45,  2.71it/s][A
epoch 2 iter 5076: train loss 0.27580. lr 7.551802e-05:  54%|█████▍    | 5077/9433 [31:07<26:49,  2.71it/s][A
epoch 2 iter 5077: train loss 0.28367. lr 7.548489e-05:  54%|█████▍    | 5077/9433 [31:08<26:49,  2.71it/s][A
epoch 2 iter 5077: train loss 0.28367. lr 7.548489e-05:  54%|█████▍    | 5078/9433 [31:08<26:47,  2.71it/s][A
epoch 2 iter 5078: train loss 0.27686. lr 7.545175e-05:  54%|█████▍    | 5078/9433 [31:08<26:47,  2.71it/s][A
epoch 2 iter 5078: train loss 0.27686. lr 7.545175e-05:  54%|█████▍    | 5079/9433 [31:08<26:42,  2.72it/s][A
epoch 2 iter 5079: train loss 0.30433. lr 7.541863e-05:  54%|█████▍    | 5079/9433 [31:09<26:42,  2.72it/s][A
epoch 2 iter 5079: train loss 0.30433. lr 7.541863e-05:  54%|█████▍    | 5080/9433 [31:09<26:40,  2.72it/s][A
e

epoch 2 iter 5112: train loss 0.28196. lr 7.432895e-05:  54%|█████▍    | 5112/9433 [31:21<26:33,  2.71it/s][A
epoch 2 iter 5112: train loss 0.28196. lr 7.432895e-05:  54%|█████▍    | 5113/9433 [31:21<26:28,  2.72it/s][A
epoch 2 iter 5113: train loss 0.29341. lr 7.429603e-05:  54%|█████▍    | 5113/9433 [31:21<26:28,  2.72it/s][A
epoch 2 iter 5113: train loss 0.29341. lr 7.429603e-05:  54%|█████▍    | 5114/9433 [31:21<26:25,  2.72it/s][A
epoch 2 iter 5114: train loss 0.26722. lr 7.426313e-05:  54%|█████▍    | 5114/9433 [31:21<26:25,  2.72it/s][A
epoch 2 iter 5114: train loss 0.26722. lr 7.426313e-05:  54%|█████▍    | 5115/9433 [31:21<26:23,  2.73it/s][A
epoch 2 iter 5115: train loss 0.27347. lr 7.423022e-05:  54%|█████▍    | 5115/9433 [31:22<26:23,  2.73it/s][A
epoch 2 iter 5115: train loss 0.27347. lr 7.423022e-05:  54%|█████▍    | 5116/9433 [31:22<26:38,  2.70it/s][A
epoch 2 iter 5116: train loss 0.30722. lr 7.419733e-05:  54%|█████▍    | 5116/9433 [31:22<26:38,  2.70it/s][A
e

epoch 2 iter 7068: train loss 0.25106. lr 6.000000e-05:  75%|███████▍  | 7069/9433 [43:21<14:32,  2.71it/s][A
epoch 2 iter 7069: train loss 0.28832. lr 6.000000e-05:  75%|███████▍  | 7069/9433 [43:21<14:32,  2.71it/s][A
epoch 2 iter 7069: train loss 0.28832. lr 6.000000e-05:  75%|███████▍  | 7070/9433 [43:21<14:31,  2.71it/s][A
epoch 2 iter 7070: train loss 0.27604. lr 6.000000e-05:  75%|███████▍  | 7070/9433 [43:21<14:31,  2.71it/s][A
epoch 2 iter 7070: train loss 0.27604. lr 6.000000e-05:  75%|███████▍  | 7071/9433 [43:21<14:28,  2.72it/s][A
epoch 2 iter 7071: train loss 0.27945. lr 6.000000e-05:  75%|███████▍  | 7071/9433 [43:22<14:28,  2.72it/s][A
epoch 2 iter 7071: train loss 0.27945. lr 6.000000e-05:  75%|███████▍  | 7072/9433 [43:22<14:28,  2.72it/s][A
epoch 2 iter 7072: train loss 0.27007. lr 6.000000e-05:  75%|███████▍  | 7072/9433 [43:22<14:28,  2.72it/s][A
epoch 2 iter 7072: train loss 0.27007. lr 6.000000e-05:  75%|███████▍  | 7073/9433 [43:22<14:29,  2.71it/s][A
e

epoch 2 iter 7105: train loss 0.27992. lr 6.000000e-05:  75%|███████▌  | 7105/9433 [43:34<14:17,  2.71it/s][A
epoch 2 iter 7105: train loss 0.27992. lr 6.000000e-05:  75%|███████▌  | 7106/9433 [43:34<14:17,  2.71it/s][A
epoch 2 iter 7106: train loss 0.27870. lr 6.000000e-05:  75%|███████▌  | 7106/9433 [43:35<14:17,  2.71it/s][A
epoch 2 iter 7106: train loss 0.27870. lr 6.000000e-05:  75%|███████▌  | 7107/9433 [43:35<14:17,  2.71it/s][A
epoch 2 iter 7107: train loss 0.25080. lr 6.000000e-05:  75%|███████▌  | 7107/9433 [43:35<14:17,  2.71it/s][A
epoch 2 iter 7107: train loss 0.25080. lr 6.000000e-05:  75%|███████▌  | 7108/9433 [43:35<14:19,  2.71it/s][A
epoch 2 iter 7108: train loss 0.27300. lr 6.000000e-05:  75%|███████▌  | 7108/9433 [43:35<14:19,  2.71it/s][A
epoch 2 iter 7108: train loss 0.27300. lr 6.000000e-05:  75%|███████▌  | 7109/9433 [43:35<14:18,  2.71it/s][A
epoch 2 iter 7109: train loss 0.29537. lr 6.000000e-05:  75%|███████▌  | 7109/9433 [43:36<14:18,  2.71it/s][A
e

epoch 2 iter 7141: train loss 0.25664. lr 6.000000e-05:  76%|███████▌  | 7142/9433 [43:48<14:04,  2.71it/s][A
epoch 2 iter 7142: train loss 0.26804. lr 6.000000e-05:  76%|███████▌  | 7142/9433 [43:48<14:04,  2.71it/s][A
epoch 2 iter 7142: train loss 0.26804. lr 6.000000e-05:  76%|███████▌  | 7143/9433 [43:48<14:06,  2.71it/s][A
epoch 2 iter 7143: train loss 0.26229. lr 6.000000e-05:  76%|███████▌  | 7143/9433 [43:48<14:06,  2.71it/s][A
epoch 2 iter 7143: train loss 0.26229. lr 6.000000e-05:  76%|███████▌  | 7144/9433 [43:48<14:04,  2.71it/s][A
epoch 2 iter 7144: train loss 0.25022. lr 6.000000e-05:  76%|███████▌  | 7144/9433 [43:49<14:04,  2.71it/s][A
epoch 2 iter 7144: train loss 0.25022. lr 6.000000e-05:  76%|███████▌  | 7145/9433 [43:49<14:01,  2.72it/s][A
epoch 2 iter 7145: train loss 0.24606. lr 6.000000e-05:  76%|███████▌  | 7145/9433 [43:49<14:01,  2.72it/s][A
epoch 2 iter 7145: train loss 0.24606. lr 6.000000e-05:  76%|███████▌  | 7146/9433 [43:49<14:01,  2.72it/s][A
e

epoch 2 iter 7178: train loss 0.26238. lr 6.000000e-05:  76%|███████▌  | 7178/9433 [44:01<13:55,  2.70it/s][A
epoch 2 iter 7178: train loss 0.26238. lr 6.000000e-05:  76%|███████▌  | 7179/9433 [44:01<13:55,  2.70it/s][A
epoch 2 iter 7179: train loss 0.26636. lr 6.000000e-05:  76%|███████▌  | 7179/9433 [44:02<13:55,  2.70it/s][A
epoch 2 iter 7179: train loss 0.26636. lr 6.000000e-05:  76%|███████▌  | 7180/9433 [44:02<13:53,  2.70it/s][A
epoch 2 iter 7180: train loss 0.26733. lr 6.000000e-05:  76%|███████▌  | 7180/9433 [44:02<13:53,  2.70it/s][A
epoch 2 iter 7180: train loss 0.26733. lr 6.000000e-05:  76%|███████▌  | 7181/9433 [44:02<13:54,  2.70it/s][A
epoch 2 iter 7181: train loss 0.26417. lr 6.000000e-05:  76%|███████▌  | 7181/9433 [44:02<13:54,  2.70it/s][A
epoch 2 iter 7181: train loss 0.26417. lr 6.000000e-05:  76%|███████▌  | 7182/9433 [44:02<13:53,  2.70it/s][A
epoch 2 iter 7182: train loss 0.26972. lr 6.000000e-05:  76%|███████▌  | 7182/9433 [44:03<13:53,  2.70it/s][A
e

epoch 2 iter 7390: train loss 0.29185. lr 6.000000e-05:  78%|███████▊  | 7390/9433 [45:19<12:33,  2.71it/s][A
epoch 2 iter 7390: train loss 0.29185. lr 6.000000e-05:  78%|███████▊  | 7391/9433 [45:19<12:34,  2.71it/s][A
epoch 2 iter 7391: train loss 0.26353. lr 6.000000e-05:  78%|███████▊  | 7391/9433 [45:20<12:34,  2.71it/s][A
epoch 2 iter 7391: train loss 0.26353. lr 6.000000e-05:  78%|███████▊  | 7392/9433 [45:20<12:32,  2.71it/s][A
epoch 2 iter 7392: train loss 0.27379. lr 6.000000e-05:  78%|███████▊  | 7392/9433 [45:20<12:32,  2.71it/s][A
epoch 2 iter 7392: train loss 0.27379. lr 6.000000e-05:  78%|███████▊  | 7393/9433 [45:20<12:31,  2.71it/s][A
epoch 2 iter 7393: train loss 0.25528. lr 6.000000e-05:  78%|███████▊  | 7393/9433 [45:20<12:31,  2.71it/s][A
epoch 2 iter 7393: train loss 0.25528. lr 6.000000e-05:  78%|███████▊  | 7394/9433 [45:20<12:28,  2.72it/s][A
epoch 2 iter 7394: train loss 0.24760. lr 6.000000e-05:  78%|███████▊  | 7394/9433 [45:21<12:28,  2.72it/s][A
e

epoch 2 iter 7426: train loss 0.24790. lr 6.000000e-05:  79%|███████▊  | 7427/9433 [45:32<12:16,  2.72it/s][A
epoch 2 iter 7427: train loss 0.27740. lr 6.000000e-05:  79%|███████▊  | 7427/9433 [45:33<12:16,  2.72it/s][A
epoch 2 iter 7427: train loss 0.27740. lr 6.000000e-05:  79%|███████▊  | 7428/9433 [45:33<12:17,  2.72it/s][A
epoch 2 iter 7428: train loss 0.26223. lr 6.000000e-05:  79%|███████▊  | 7428/9433 [45:33<12:17,  2.72it/s][A
epoch 2 iter 7428: train loss 0.26223. lr 6.000000e-05:  79%|███████▉  | 7429/9433 [45:33<12:20,  2.71it/s][A
epoch 2 iter 7429: train loss 0.26904. lr 6.000000e-05:  79%|███████▉  | 7429/9433 [45:34<12:20,  2.71it/s][A
epoch 2 iter 7429: train loss 0.26904. lr 6.000000e-05:  79%|███████▉  | 7430/9433 [45:34<12:20,  2.70it/s][A
epoch 2 iter 7430: train loss 0.25572. lr 6.000000e-05:  79%|███████▉  | 7430/9433 [45:34<12:20,  2.70it/s][A
epoch 2 iter 7430: train loss 0.25572. lr 6.000000e-05:  79%|███████▉  | 7431/9433 [45:34<12:19,  2.71it/s][A
e

epoch 2 iter 7463: train loss 0.26154. lr 6.000000e-05:  79%|███████▉  | 7463/9433 [45:46<12:05,  2.72it/s][A
epoch 2 iter 7463: train loss 0.26154. lr 6.000000e-05:  79%|███████▉  | 7464/9433 [45:46<12:05,  2.71it/s][A
epoch 2 iter 7464: train loss 0.27125. lr 6.000000e-05:  79%|███████▉  | 7464/9433 [45:46<12:05,  2.71it/s][A
epoch 2 iter 7464: train loss 0.27125. lr 6.000000e-05:  79%|███████▉  | 7465/9433 [45:46<12:06,  2.71it/s][A
epoch 2 iter 7465: train loss 0.26745. lr 6.000000e-05:  79%|███████▉  | 7465/9433 [45:47<12:06,  2.71it/s][A
epoch 2 iter 7465: train loss 0.26745. lr 6.000000e-05:  79%|███████▉  | 7466/9433 [45:47<12:05,  2.71it/s][A
epoch 2 iter 7466: train loss 0.25928. lr 6.000000e-05:  79%|███████▉  | 7466/9433 [45:47<12:05,  2.71it/s][A
epoch 2 iter 7466: train loss 0.25928. lr 6.000000e-05:  79%|███████▉  | 7467/9433 [45:47<12:03,  2.72it/s][A
epoch 2 iter 7467: train loss 0.28251. lr 6.000000e-05:  79%|███████▉  | 7467/9433 [45:48<12:03,  2.72it/s][A
e

epoch 2 iter 7499: train loss 0.26421. lr 6.000000e-05:  80%|███████▉  | 7500/9433 [45:59<11:52,  2.71it/s][A
epoch 2 iter 7500: train loss 0.25912. lr 6.000000e-05:  80%|███████▉  | 7500/9433 [46:00<11:52,  2.71it/s][A
epoch 2 iter 7500: train loss 0.25912. lr 6.000000e-05:  80%|███████▉  | 7501/9433 [46:00<11:51,  2.72it/s][A
epoch 2 iter 7501: train loss 0.24908. lr 6.000000e-05:  80%|███████▉  | 7501/9433 [46:00<11:51,  2.72it/s][A
epoch 2 iter 7501: train loss 0.24908. lr 6.000000e-05:  80%|███████▉  | 7502/9433 [46:00<11:50,  2.72it/s][A
epoch 2 iter 7502: train loss 0.27770. lr 6.000000e-05:  80%|███████▉  | 7502/9433 [46:00<11:50,  2.72it/s][A
epoch 2 iter 7502: train loss 0.27770. lr 6.000000e-05:  80%|███████▉  | 7503/9433 [46:00<11:50,  2.72it/s][A
epoch 2 iter 7503: train loss 0.25710. lr 6.000000e-05:  80%|███████▉  | 7503/9433 [46:01<11:50,  2.72it/s][A
epoch 2 iter 7503: train loss 0.25710. lr 6.000000e-05:  80%|███████▉  | 7504/9433 [46:01<11:48,  2.72it/s][A
e

epoch 2 iter 7536: train loss 0.25252. lr 6.000000e-05:  80%|███████▉  | 7536/9433 [46:13<11:39,  2.71it/s][A
epoch 2 iter 7536: train loss 0.25252. lr 6.000000e-05:  80%|███████▉  | 7537/9433 [46:13<11:38,  2.71it/s][A
epoch 2 iter 7537: train loss 0.28308. lr 6.000000e-05:  80%|███████▉  | 7537/9433 [46:13<11:38,  2.71it/s][A
epoch 2 iter 7537: train loss 0.28308. lr 6.000000e-05:  80%|███████▉  | 7538/9433 [46:13<11:37,  2.72it/s][A
epoch 2 iter 7538: train loss 0.27098. lr 6.000000e-05:  80%|███████▉  | 7538/9433 [46:14<11:37,  2.72it/s][A
epoch 2 iter 7538: train loss 0.27098. lr 6.000000e-05:  80%|███████▉  | 7539/9433 [46:14<11:35,  2.72it/s][A
epoch 2 iter 7539: train loss 0.27003. lr 6.000000e-05:  80%|███████▉  | 7539/9433 [46:14<11:35,  2.72it/s][A
epoch 2 iter 7539: train loss 0.27003. lr 6.000000e-05:  80%|███████▉  | 7540/9433 [46:14<11:37,  2.71it/s][A
epoch 2 iter 7540: train loss 0.28272. lr 6.000000e-05:  80%|███████▉  | 7540/9433 [46:14<11:37,  2.71it/s][A
e

epoch 2 iter 7572: train loss 0.27706. lr 6.000000e-05:  80%|████████  | 7573/9433 [46:26<11:25,  2.71it/s][A
epoch 2 iter 7573: train loss 0.26998. lr 6.000000e-05:  80%|████████  | 7573/9433 [46:27<11:25,  2.71it/s][A
epoch 2 iter 7573: train loss 0.26998. lr 6.000000e-05:  80%|████████  | 7574/9433 [46:27<11:26,  2.71it/s][A
epoch 2 iter 7574: train loss 0.27675. lr 6.000000e-05:  80%|████████  | 7574/9433 [46:27<11:26,  2.71it/s][A
epoch 2 iter 7574: train loss 0.27675. lr 6.000000e-05:  80%|████████  | 7575/9433 [46:27<11:27,  2.70it/s][A
epoch 2 iter 7575: train loss 0.27020. lr 6.000000e-05:  80%|████████  | 7575/9433 [46:27<11:27,  2.70it/s][A
epoch 2 iter 7575: train loss 0.27020. lr 6.000000e-05:  80%|████████  | 7576/9433 [46:27<11:26,  2.70it/s][A
epoch 2 iter 7576: train loss 0.28043. lr 6.000000e-05:  80%|████████  | 7576/9433 [46:28<11:26,  2.70it/s][A
epoch 2 iter 7576: train loss 0.28043. lr 6.000000e-05:  80%|████████  | 7577/9433 [46:28<11:25,  2.71it/s][A
e

epoch 2 iter 7609: train loss 0.26353. lr 6.000000e-05:  81%|████████  | 7609/9433 [46:40<11:12,  2.71it/s][A
epoch 2 iter 7609: train loss 0.26353. lr 6.000000e-05:  81%|████████  | 7610/9433 [46:40<11:13,  2.71it/s][A
epoch 2 iter 7610: train loss 0.26948. lr 6.000000e-05:  81%|████████  | 7610/9433 [46:40<11:13,  2.71it/s][A
epoch 2 iter 7610: train loss 0.26948. lr 6.000000e-05:  81%|████████  | 7611/9433 [46:40<11:11,  2.71it/s][A
epoch 2 iter 7611: train loss 0.26113. lr 6.000000e-05:  81%|████████  | 7611/9433 [46:41<11:11,  2.71it/s][A
epoch 2 iter 7611: train loss 0.26113. lr 6.000000e-05:  81%|████████  | 7612/9433 [46:41<11:10,  2.72it/s][A
epoch 2 iter 7612: train loss 0.26085. lr 6.000000e-05:  81%|████████  | 7612/9433 [46:41<11:10,  2.72it/s][A
epoch 2 iter 7612: train loss 0.26085. lr 6.000000e-05:  81%|████████  | 7613/9433 [46:41<11:08,  2.72it/s][A
epoch 2 iter 7613: train loss 0.27305. lr 6.000000e-05:  81%|████████  | 7613/9433 [46:41<11:08,  2.72it/s][A
e

epoch 2 iter 7645: train loss 0.25385. lr 6.000000e-05:  81%|████████  | 7646/9433 [46:53<10:58,  2.72it/s][A
epoch 2 iter 7646: train loss 0.25893. lr 6.000000e-05:  81%|████████  | 7646/9433 [46:54<10:58,  2.72it/s][A
epoch 2 iter 7646: train loss 0.25893. lr 6.000000e-05:  81%|████████  | 7647/9433 [46:54<10:57,  2.72it/s][A
epoch 2 iter 7647: train loss 0.29084. lr 6.000000e-05:  81%|████████  | 7647/9433 [46:54<10:57,  2.72it/s][A
epoch 2 iter 7647: train loss 0.29084. lr 6.000000e-05:  81%|████████  | 7648/9433 [46:54<10:58,  2.71it/s][A
epoch 2 iter 7648: train loss 0.26995. lr 6.000000e-05:  81%|████████  | 7648/9433 [46:54<10:58,  2.71it/s][A
epoch 2 iter 7648: train loss 0.26995. lr 6.000000e-05:  81%|████████  | 7649/9433 [46:54<10:56,  2.72it/s][A
epoch 2 iter 7649: train loss 0.25920. lr 6.000000e-05:  81%|████████  | 7649/9433 [46:55<10:56,  2.72it/s][A
epoch 2 iter 7649: train loss 0.25920. lr 6.000000e-05:  81%|████████  | 7650/9433 [46:55<10:56,  2.72it/s][A
e

epoch 2 iter 7682: train loss 0.25526. lr 6.000000e-05:  81%|████████▏ | 7682/9433 [47:07<10:49,  2.70it/s][A
epoch 2 iter 7682: train loss 0.25526. lr 6.000000e-05:  81%|████████▏ | 7683/9433 [47:07<10:47,  2.70it/s][A
epoch 2 iter 7683: train loss 0.27246. lr 6.000000e-05:  81%|████████▏ | 7683/9433 [47:07<10:47,  2.70it/s][A
epoch 2 iter 7683: train loss 0.27246. lr 6.000000e-05:  81%|████████▏ | 7684/9433 [47:07<10:43,  2.72it/s][A
epoch 2 iter 7684: train loss 0.26297. lr 6.000000e-05:  81%|████████▏ | 7684/9433 [47:08<10:43,  2.72it/s][A
epoch 2 iter 7684: train loss 0.26297. lr 6.000000e-05:  81%|████████▏ | 7685/9433 [47:08<10:45,  2.71it/s][A
epoch 2 iter 7685: train loss 0.27127. lr 6.000000e-05:  81%|████████▏ | 7685/9433 [47:08<10:45,  2.71it/s][A
epoch 2 iter 7685: train loss 0.27127. lr 6.000000e-05:  81%|████████▏ | 7686/9433 [47:08<10:46,  2.70it/s][A
epoch 2 iter 7686: train loss 0.27669. lr 6.000000e-05:  81%|████████▏ | 7686/9433 [47:08<10:46,  2.70it/s][A
e

epoch 2 iter 7718: train loss 0.26405. lr 6.000000e-05:  82%|████████▏ | 7719/9433 [47:20<10:32,  2.71it/s][A
epoch 2 iter 7719: train loss 0.27188. lr 6.000000e-05:  82%|████████▏ | 7719/9433 [47:21<10:32,  2.71it/s][A
epoch 2 iter 7719: train loss 0.27188. lr 6.000000e-05:  82%|████████▏ | 7720/9433 [47:21<10:29,  2.72it/s][A
epoch 2 iter 7720: train loss 0.27053. lr 6.000000e-05:  82%|████████▏ | 7720/9433 [47:21<10:29,  2.72it/s][A
epoch 2 iter 7720: train loss 0.27053. lr 6.000000e-05:  82%|████████▏ | 7721/9433 [47:21<10:31,  2.71it/s][A
epoch 2 iter 7721: train loss 0.24633. lr 6.000000e-05:  82%|████████▏ | 7721/9433 [47:21<10:31,  2.71it/s][A
epoch 2 iter 7721: train loss 0.24633. lr 6.000000e-05:  82%|████████▏ | 7722/9433 [47:21<10:32,  2.71it/s][A
epoch 2 iter 7722: train loss 0.28848. lr 6.000000e-05:  82%|████████▏ | 7722/9433 [47:22<10:32,  2.71it/s][A
epoch 2 iter 7722: train loss 0.28848. lr 6.000000e-05:  82%|████████▏ | 7723/9433 [47:22<10:30,  2.71it/s][A
e

epoch 1 iter 269: train loss 3.58122. lr 5.880700e-04:  18%|█▊        | 269/1499 [01:39<07:34,  2.71it/s][A
epoch 1 iter 269: train loss 3.58122. lr 5.880700e-04:  18%|█▊        | 270/1499 [01:39<07:32,  2.71it/s][A
epoch 1 iter 270: train loss 3.60371. lr 5.879820e-04:  18%|█▊        | 270/1499 [01:39<07:32,  2.71it/s][A
epoch 1 iter 270: train loss 3.60371. lr 5.879820e-04:  18%|█▊        | 271/1499 [01:39<07:32,  2.72it/s][A
epoch 1 iter 271: train loss 3.54645. lr 5.878937e-04:  18%|█▊        | 271/1499 [01:39<07:32,  2.72it/s][A
epoch 1 iter 271: train loss 3.54645. lr 5.878937e-04:  18%|█▊        | 272/1499 [01:39<07:32,  2.71it/s][A
epoch 1 iter 272: train loss 3.55668. lr 5.878051e-04:  18%|█▊        | 272/1499 [01:40<07:32,  2.71it/s][A
epoch 1 iter 272: train loss 3.55668. lr 5.878051e-04:  18%|█▊        | 273/1499 [01:40<07:33,  2.71it/s][A
epoch 1 iter 273: train loss 3.60858. lr 5.877161e-04:  18%|█▊        | 273/1499 [01:40<07:33,  2.71it/s][A
epoch 1 iter 273: t

epoch 1 iter 306: train loss 3.19460. lr 5.846042e-04:  20%|██        | 307/1499 [01:52<07:17,  2.72it/s][A
epoch 1 iter 307: train loss 3.19643. lr 5.845046e-04:  20%|██        | 307/1499 [01:53<07:17,  2.72it/s][A
epoch 1 iter 307: train loss 3.19643. lr 5.845046e-04:  21%|██        | 308/1499 [01:53<07:18,  2.72it/s][A
epoch 1 iter 308: train loss 3.12427. lr 5.844047e-04:  21%|██        | 308/1499 [01:53<07:18,  2.72it/s][A
epoch 1 iter 308: train loss 3.12427. lr 5.844047e-04:  21%|██        | 309/1499 [01:53<07:18,  2.71it/s][A
epoch 1 iter 309: train loss 3.11573. lr 5.843044e-04:  21%|██        | 309/1499 [01:53<07:18,  2.71it/s][A
epoch 1 iter 309: train loss 3.11573. lr 5.843044e-04:  21%|██        | 310/1499 [01:53<07:17,  2.72it/s][A
epoch 1 iter 310: train loss 3.13565. lr 5.842038e-04:  21%|██        | 310/1499 [01:54<07:17,  2.72it/s][A
epoch 1 iter 310: train loss 3.13565. lr 5.842038e-04:  21%|██        | 311/1499 [01:54<07:17,  2.72it/s][A
epoch 1 iter 311: t

epoch 1 iter 344: train loss 2.76982. lr 5.805990e-04:  23%|██▎       | 344/1499 [02:06<07:07,  2.70it/s][A
epoch 1 iter 344: train loss 2.76982. lr 5.805990e-04:  23%|██▎       | 345/1499 [02:06<07:06,  2.71it/s][A
epoch 1 iter 345: train loss 2.78505. lr 5.804876e-04:  23%|██▎       | 345/1499 [02:07<07:06,  2.71it/s][A
epoch 1 iter 345: train loss 2.78505. lr 5.804876e-04:  23%|██▎       | 346/1499 [02:07<07:04,  2.71it/s][A
epoch 1 iter 346: train loss 2.79164. lr 5.803758e-04:  23%|██▎       | 346/1499 [02:07<07:04,  2.71it/s][A
epoch 1 iter 346: train loss 2.79164. lr 5.803758e-04:  23%|██▎       | 347/1499 [02:07<07:05,  2.70it/s][A
epoch 1 iter 347: train loss 2.77208. lr 5.802638e-04:  23%|██▎       | 347/1499 [02:07<07:05,  2.70it/s][A
epoch 1 iter 347: train loss 2.77208. lr 5.802638e-04:  23%|██▎       | 348/1499 [02:07<07:06,  2.70it/s][A
epoch 1 iter 348: train loss 2.69297. lr 5.801514e-04:  23%|██▎       | 348/1499 [02:08<07:06,  2.70it/s][A
epoch 1 iter 348: t

epoch 1 iter 381: train loss 2.38565. lr 5.762711e-04:  25%|██▌       | 382/1499 [02:20<06:54,  2.70it/s][A
epoch 1 iter 382: train loss 2.38365. lr 5.761483e-04:  25%|██▌       | 382/1499 [02:20<06:54,  2.70it/s][AIOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

epoch 1 iter 567: train loss 1.05543. lr 5.483697e-04:  38%|███▊      | 567/1499 [03:28<05:44,  2.71it/s][A
epoch 1 iter 567: train loss 1.05543. lr 5.483697e-04:  38%|███▊      | 568/1499 [03:28<05:44,  2.71it/s][A
epoch 1 iter 568: train loss 1.05080. lr 5.481932e-04:  38%|███▊      | 568/1499 [03:29<05:44,  2.71it/s][A
epoch 1 iter 568: train loss 1.05080. lr 5.481932e-04:  38%|███▊      | 569/1499 [03:29<05:43,  2.71it/s][A
epoch 1 iter 569: trai

epoch 1 iter 602: train loss 0.93124. lr 5.420284e-04:  40%|████      | 602/1499 [03:41<05:30,  2.71it/s][A
epoch 1 iter 602: train loss 0.93124. lr 5.420284e-04:  40%|████      | 603/1499 [03:41<05:30,  2.71it/s][A
epoch 1 iter 603: train loss 0.90585. lr 5.418424e-04:  40%|████      | 603/1499 [03:42<05:30,  2.71it/s][A
epoch 1 iter 603: train loss 0.90585. lr 5.418424e-04:  40%|████      | 604/1499 [03:42<05:29,  2.71it/s][A
epoch 1 iter 604: train loss 0.89832. lr 5.416561e-04:  40%|████      | 604/1499 [03:42<05:29,  2.71it/s][A
epoch 1 iter 604: train loss 0.89832. lr 5.416561e-04:  40%|████      | 605/1499 [03:42<05:28,  2.72it/s][A
epoch 1 iter 605: train loss 0.89582. lr 5.414696e-04:  40%|████      | 605/1499 [03:42<05:28,  2.72it/s][A
epoch 1 iter 605: train loss 0.89582. lr 5.414696e-04:  40%|████      | 606/1499 [03:42<05:27,  2.73it/s][A
epoch 1 iter 606: train loss 0.84410. lr 5.412828e-04:  40%|████      | 606/1499 [03:43<05:27,  2.73it/s][A
epoch 1 iter 606: t

epoch 1 iter 639: train loss 0.80389. lr 5.349705e-04:  43%|████▎     | 640/1499 [03:55<05:16,  2.71it/s][A
epoch 1 iter 640: train loss 0.76009. lr 5.347747e-04:  43%|████▎     | 640/1499 [03:55<05:16,  2.71it/s][A
epoch 1 iter 640: train loss 0.76009. lr 5.347747e-04:  43%|████▎     | 641/1499 [03:55<05:16,  2.71it/s][A
epoch 1 iter 641: train loss 0.76899. lr 5.345788e-04:  43%|████▎     | 641/1499 [03:56<05:16,  2.71it/s][A
epoch 1 iter 641: train loss 0.76899. lr 5.345788e-04:  43%|████▎     | 642/1499 [03:56<05:15,  2.72it/s][A
epoch 1 iter 642: train loss 0.79166. lr 5.343825e-04:  43%|████▎     | 642/1499 [03:56<05:15,  2.72it/s][A
epoch 1 iter 642: train loss 0.79166. lr 5.343825e-04:  43%|████▎     | 643/1499 [03:56<05:15,  2.71it/s][A
epoch 1 iter 643: train loss 0.81483. lr 5.341861e-04:  43%|████▎     | 643/1499 [03:56<05:15,  2.71it/s][A
epoch 1 iter 643: train loss 0.81483. lr 5.341861e-04:  43%|████▎     | 644/1499 [03:56<05:15,  2.71it/s][A
epoch 1 iter 644: t

epoch 1 iter 677: train loss 0.65151. lr 5.273537e-04:  45%|████▌     | 677/1499 [04:09<05:02,  2.72it/s][A
epoch 1 iter 677: train loss 0.65151. lr 5.273537e-04:  45%|████▌     | 678/1499 [04:09<05:02,  2.71it/s][A
epoch 1 iter 678: train loss 0.67832. lr 5.271484e-04:  45%|████▌     | 678/1499 [04:09<05:02,  2.71it/s][A
epoch 1 iter 678: train loss 0.67832. lr 5.271484e-04:  45%|████▌     | 679/1499 [04:09<05:01,  2.72it/s][A
epoch 1 iter 679: train loss 0.64635. lr 5.269427e-04:  45%|████▌     | 679/1499 [04:10<05:01,  2.72it/s][A
epoch 1 iter 679: train loss 0.64635. lr 5.269427e-04:  45%|████▌     | 680/1499 [04:10<05:00,  2.72it/s][A
epoch 1 iter 680: train loss 0.69337. lr 5.267369e-04:  45%|████▌     | 680/1499 [04:10<05:00,  2.72it/s][A
epoch 1 iter 680: train loss 0.69337. lr 5.267369e-04:  45%|████▌     | 681/1499 [04:10<04:59,  2.73it/s][A
epoch 1 iter 681: train loss 0.69372. lr 5.265308e-04:  45%|████▌     | 681/1499 [04:10<04:59,  2.73it/s][A
epoch 1 iter 681: t

epoch 1 iter 714: train loss 0.56614. lr 5.195905e-04:  48%|████▊     | 715/1499 [04:23<04:49,  2.71it/s][A
epoch 1 iter 715: train loss 0.55100. lr 5.193760e-04:  48%|████▊     | 715/1499 [04:23<04:49,  2.71it/s][A
epoch 1 iter 715: train loss 0.55100. lr 5.193760e-04:  48%|████▊     | 716/1499 [04:23<04:49,  2.71it/s][A
epoch 1 iter 716: train loss 0.56322. lr 5.191614e-04:  48%|████▊     | 716/1499 [04:23<04:49,  2.71it/s][A
epoch 1 iter 716: train loss 0.56322. lr 5.191614e-04:  48%|████▊     | 717/1499 [04:23<04:48,  2.71it/s][A
epoch 1 iter 717: train loss 0.55397. lr 5.189464e-04:  48%|████▊     | 717/1499 [04:24<04:48,  2.71it/s][A
epoch 1 iter 717: train loss 0.55397. lr 5.189464e-04:  48%|████▊     | 718/1499 [04:24<04:48,  2.71it/s][A
epoch 1 iter 718: train loss 0.54763. lr 5.187312e-04:  48%|████▊     | 718/1499 [04:24<04:48,  2.71it/s][A
epoch 1 iter 718: train loss 0.54763. lr 5.187312e-04:  48%|████▊     | 719/1499 [04:24<04:47,  2.72it/s][A
epoch 1 iter 719: t

epoch 1 iter 752: train loss 0.48966. lr 5.112736e-04:  50%|█████     | 752/1499 [04:37<04:35,  2.72it/s][A
epoch 1 iter 752: train loss 0.48966. lr 5.112736e-04:  50%|█████     | 753/1499 [04:37<04:35,  2.71it/s][A
epoch 1 iter 753: train loss 0.48220. lr 5.110501e-04:  50%|█████     | 753/1499 [04:37<04:35,  2.71it/s][A
epoch 1 iter 753: train loss 0.48220. lr 5.110501e-04:  50%|█████     | 754/1499 [04:37<04:34,  2.72it/s][A
epoch 1 iter 754: train loss 0.50947. lr 5.108264e-04:  50%|█████     | 754/1499 [04:37<04:34,  2.72it/s][A
epoch 1 iter 754: train loss 0.50947. lr 5.108264e-04:  50%|█████     | 755/1499 [04:37<04:33,  2.72it/s][A
epoch 1 iter 755: train loss 0.52599. lr 5.106025e-04:  50%|█████     | 755/1499 [04:38<04:33,  2.72it/s][A
epoch 1 iter 755: train loss 0.52599. lr 5.106025e-04:  50%|█████     | 756/1499 [04:38<04:31,  2.73it/s][A
epoch 1 iter 756: train loss 0.49709. lr 5.103783e-04:  50%|█████     | 756/1499 [04:38<04:31,  2.73it/s][A
epoch 1 iter 756: t

epoch 1 iter 789: train loss 0.42195. lr 5.028531e-04:  53%|█████▎    | 790/1499 [04:50<04:20,  2.72it/s][A
epoch 1 iter 790: train loss 0.43133. lr 5.026212e-04:  53%|█████▎    | 790/1499 [04:51<04:20,  2.72it/s][A
epoch 1 iter 790: train loss 0.43133. lr 5.026212e-04:  53%|█████▎    | 791/1499 [04:51<04:19,  2.73it/s][A
epoch 1 iter 791: train loss 0.44671. lr 5.023891e-04:  53%|█████▎    | 791/1499 [04:51<04:19,  2.73it/s][A
epoch 1 iter 791: train loss 0.44671. lr 5.023891e-04:  53%|█████▎    | 792/1499 [04:51<04:20,  2.71it/s][A
epoch 1 iter 792: train loss 0.44243. lr 5.021568e-04:  53%|█████▎    | 792/1499 [04:51<04:20,  2.71it/s][A
epoch 1 iter 792: train loss 0.44243. lr 5.021568e-04:  53%|█████▎    | 793/1499 [04:51<04:21,  2.70it/s][A
epoch 1 iter 793: train loss 0.44883. lr 5.019242e-04:  53%|█████▎    | 793/1499 [04:52<04:21,  2.70it/s][A
epoch 1 iter 793: train loss 0.44883. lr 5.019242e-04:  53%|█████▎    | 794/1499 [04:52<04:20,  2.71it/s][A
epoch 1 iter 794: t

epoch 1 iter 827: train loss 0.41754. lr 4.938873e-04:  55%|█████▌    | 827/1499 [05:04<04:06,  2.72it/s][A
epoch 1 iter 827: train loss 0.41754. lr 4.938873e-04:  55%|█████▌    | 828/1499 [05:04<04:06,  2.72it/s][A
epoch 1 iter 828: train loss 0.40518. lr 4.936471e-04:  55%|█████▌    | 828/1499 [05:05<04:06,  2.72it/s][A
epoch 1 iter 828: train loss 0.40518. lr 4.936471e-04:  55%|█████▌    | 829/1499 [05:05<04:06,  2.72it/s][A
epoch 1 iter 829: train loss 0.40041. lr 4.934068e-04:  55%|█████▌    | 829/1499 [05:05<04:06,  2.72it/s][A
epoch 1 iter 829: train loss 0.40041. lr 4.934068e-04:  55%|█████▌    | 830/1499 [05:05<04:05,  2.72it/s][A
epoch 1 iter 830: train loss 0.40999. lr 4.931662e-04:  55%|█████▌    | 830/1499 [05:05<04:05,  2.72it/s][A
epoch 1 iter 830: train loss 0.40999. lr 4.931662e-04:  55%|█████▌    | 831/1499 [05:05<04:04,  2.73it/s][A
epoch 1 iter 831: train loss 0.42542. lr 4.929254e-04:  55%|█████▌    | 831/1499 [05:06<04:04,  2.73it/s][A
epoch 1 iter 831: t

epoch 1 iter 864: train loss 0.37392. lr 4.848616e-04:  58%|█████▊    | 865/1499 [05:18<03:53,  2.72it/s][A
epoch 1 iter 865: train loss 0.36598. lr 4.846137e-04:  58%|█████▊    | 865/1499 [05:18<03:53,  2.72it/s][A
epoch 1 iter 865: train loss 0.36598. lr 4.846137e-04:  58%|█████▊    | 866/1499 [05:18<03:53,  2.71it/s][A
epoch 1 iter 866: train loss 0.36303. lr 4.843657e-04:  58%|█████▊    | 866/1499 [05:19<03:53,  2.71it/s][A
epoch 1 iter 866: train loss 0.36303. lr 4.843657e-04:  58%|█████▊    | 867/1499 [05:19<03:53,  2.71it/s][A
epoch 1 iter 867: train loss 0.36627. lr 4.841174e-04:  58%|█████▊    | 867/1499 [05:19<03:53,  2.71it/s][A
epoch 1 iter 867: train loss 0.36627. lr 4.841174e-04:  58%|█████▊    | 868/1499 [05:19<03:53,  2.71it/s][A
epoch 1 iter 868: train loss 0.35678. lr 4.838689e-04:  58%|█████▊    | 868/1499 [05:19<03:53,  2.71it/s][A
epoch 1 iter 868: train loss 0.35678. lr 4.838689e-04:  58%|█████▊    | 869/1499 [05:19<03:52,  2.71it/s][A
epoch 1 iter 869: t

epoch 1 iter 902: train loss 0.34569. lr 4.753025e-04:  60%|██████    | 902/1499 [05:32<03:40,  2.70it/s][A
epoch 1 iter 902: train loss 0.34569. lr 4.753025e-04:  60%|██████    | 903/1499 [05:32<03:40,  2.71it/s][A
epoch 1 iter 903: train loss 0.34464. lr 4.750471e-04:  60%|██████    | 903/1499 [05:32<03:40,  2.71it/s][A
epoch 1 iter 903: train loss 0.34464. lr 4.750471e-04:  60%|██████    | 904/1499 [05:32<03:39,  2.71it/s][A
epoch 1 iter 904: train loss 0.34789. lr 4.747915e-04:  60%|██████    | 904/1499 [05:33<03:39,  2.71it/s][A
epoch 1 iter 904: train loss 0.34789. lr 4.747915e-04:  60%|██████    | 905/1499 [05:33<03:38,  2.72it/s][A
epoch 1 iter 905: train loss 0.35622. lr 4.745357e-04:  60%|██████    | 905/1499 [05:33<03:38,  2.72it/s][A
epoch 1 iter 905: train loss 0.35622. lr 4.745357e-04:  60%|██████    | 906/1499 [05:33<03:37,  2.73it/s][A
epoch 1 iter 906: train loss 0.35941. lr 4.742798e-04:  60%|██████    | 906/1499 [05:33<03:37,  2.73it/s][A
epoch 1 iter 906: t

epoch 2 iter 1372: train loss 0.11519. lr 6.000000e-05:  92%|█████████▏| 1373/1499 [08:25<00:46,  2.71it/s][A
epoch 2 iter 1373: train loss 0.13006. lr 6.000000e-05:  92%|█████████▏| 1373/1499 [08:25<00:46,  2.71it/s][A
epoch 2 iter 1373: train loss 0.13006. lr 6.000000e-05:  92%|█████████▏| 1374/1499 [08:25<00:46,  2.71it/s][A
epoch 2 iter 1374: train loss 0.10463. lr 6.000000e-05:  92%|█████████▏| 1374/1499 [08:26<00:46,  2.71it/s][A
epoch 2 iter 1374: train loss 0.10463. lr 6.000000e-05:  92%|█████████▏| 1375/1499 [08:26<00:45,  2.71it/s][A
epoch 2 iter 1375: train loss 0.12875. lr 6.000000e-05:  92%|█████████▏| 1375/1499 [08:26<00:45,  2.71it/s][A
epoch 2 iter 1375: train loss 0.12875. lr 6.000000e-05:  92%|█████████▏| 1376/1499 [08:26<00:45,  2.71it/s][A
epoch 2 iter 1376: train loss 0.11891. lr 6.000000e-05:  92%|█████████▏| 1376/1499 [08:26<00:45,  2.71it/s][A
epoch 2 iter 1376: train loss 0.11891. lr 6.000000e-05:  92%|█████████▏| 1377/1499 [08:26<00:44,  2.71it/s][A
e

epoch 2 iter 1409: train loss 0.12631. lr 6.000000e-05:  94%|█████████▍| 1409/1499 [08:39<00:33,  2.71it/s][A
epoch 2 iter 1409: train loss 0.12631. lr 6.000000e-05:  94%|█████████▍| 1410/1499 [08:39<00:32,  2.72it/s][A
epoch 2 iter 1410: train loss 0.12228. lr 6.000000e-05:  94%|█████████▍| 1410/1499 [08:39<00:32,  2.72it/s][A
epoch 2 iter 1410: train loss 0.12228. lr 6.000000e-05:  94%|█████████▍| 1411/1499 [08:39<00:32,  2.72it/s][A
epoch 2 iter 1411: train loss 0.14444. lr 6.000000e-05:  94%|█████████▍| 1411/1499 [08:39<00:32,  2.72it/s][A
epoch 2 iter 1411: train loss 0.14444. lr 6.000000e-05:  94%|█████████▍| 1412/1499 [08:39<00:32,  2.71it/s][A
epoch 2 iter 1412: train loss 0.13326. lr 6.000000e-05:  94%|█████████▍| 1412/1499 [08:40<00:32,  2.71it/s][A
epoch 2 iter 1412: train loss 0.13326. lr 6.000000e-05:  94%|█████████▍| 1413/1499 [08:40<00:31,  2.71it/s][A
epoch 2 iter 1413: train loss 0.12423. lr 6.000000e-05:  94%|█████████▍| 1413/1499 [08:40<00:31,  2.71it/s][A
e

epoch 2 iter 1445: train loss 0.13499. lr 6.000000e-05:  96%|█████████▋| 1446/1499 [08:52<00:19,  2.72it/s][A
epoch 2 iter 1446: train loss 0.12972. lr 6.000000e-05:  96%|█████████▋| 1446/1499 [08:52<00:19,  2.72it/s][A
epoch 2 iter 1446: train loss 0.12972. lr 6.000000e-05:  97%|█████████▋| 1447/1499 [08:52<00:19,  2.71it/s][A
epoch 2 iter 1447: train loss 0.12223. lr 6.000000e-05:  97%|█████████▋| 1447/1499 [08:53<00:19,  2.71it/s][A
epoch 2 iter 1447: train loss 0.12223. lr 6.000000e-05:  97%|█████████▋| 1448/1499 [08:53<00:18,  2.71it/s][A
epoch 2 iter 1448: train loss 0.14184. lr 6.000000e-05:  97%|█████████▋| 1448/1499 [08:53<00:18,  2.71it/s][A
epoch 2 iter 1448: train loss 0.14184. lr 6.000000e-05:  97%|█████████▋| 1449/1499 [08:53<00:18,  2.71it/s][A
epoch 2 iter 1449: train loss 0.12201. lr 6.000000e-05:  97%|█████████▋| 1449/1499 [08:53<00:18,  2.71it/s][A
epoch 2 iter 1449: train loss 0.12201. lr 6.000000e-05:  97%|█████████▋| 1450/1499 [08:53<00:18,  2.72it/s][A
e

epoch 2 iter 1482: train loss 0.11385. lr 6.000000e-05:  99%|█████████▉| 1482/1499 [09:05<00:06,  2.71it/s][A
epoch 2 iter 1482: train loss 0.11385. lr 6.000000e-05:  99%|█████████▉| 1483/1499 [09:05<00:05,  2.71it/s][A
epoch 2 iter 1483: train loss 0.12784. lr 6.000000e-05:  99%|█████████▉| 1483/1499 [09:06<00:05,  2.71it/s][A
epoch 2 iter 1483: train loss 0.12784. lr 6.000000e-05:  99%|█████████▉| 1484/1499 [09:06<00:05,  2.72it/s][A
epoch 2 iter 1484: train loss 0.12084. lr 6.000000e-05:  99%|█████████▉| 1484/1499 [09:06<00:05,  2.72it/s][A
epoch 2 iter 1484: train loss 0.12084. lr 6.000000e-05:  99%|█████████▉| 1485/1499 [09:06<00:05,  2.72it/s][A
epoch 2 iter 1485: train loss 0.12086. lr 6.000000e-05:  99%|█████████▉| 1485/1499 [09:07<00:05,  2.72it/s][A
epoch 2 iter 1485: train loss 0.12086. lr 6.000000e-05:  99%|█████████▉| 1486/1499 [09:07<00:04,  2.71it/s][A
epoch 2 iter 1486: train loss 0.12823. lr 6.000000e-05:  99%|█████████▉| 1486/1499 [09:07<00:04,  2.71it/s][A
e

data has 287136 characters, 119547 unique.



  0%|          | 0/4485 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 11.73241. lr 6.000000e-04:   0%|          | 0/4485 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 11.73241. lr 6.000000e-04:   0%|          | 1/4485 [00:00<28:37,  2.61it/s][A
epoch 1 iter 1: train loss 11.34875. lr 5.999999e-04:   0%|          | 1/4485 [00:00<28:37,  2.61it/s][AIOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

epoch 1 iter 187: train loss 4.38426. lr 5.993508e-04:   4%|▍         | 187/4485 [01:08<26:14,  2.73it/s][A
epoch 1 iter 187: train loss 4.38426. lr 5.993508e-04:   4%|▍         | 188/4485 [01:08<26:16,  2.73it/s][A
epoch 1 iter 188: train loss 4.50966. lr 5.993439e-04:   4%|▍         | 188/4485 [01:08<26:16,  2.73it/

epoch 1 iter 221: train loss 4.20912. lr 5.990947e-04:   5%|▍         | 221/4485 [01:21<26:09,  2.72it/s][A
epoch 1 iter 221: train loss 4.20912. lr 5.990947e-04:   5%|▍         | 222/4485 [01:21<26:10,  2.71it/s][A
epoch 1 iter 222: train loss 4.04049. lr 5.990865e-04:   5%|▍         | 222/4485 [01:21<26:10,  2.71it/s][A
epoch 1 iter 222: train loss 4.04049. lr 5.990865e-04:   5%|▍         | 223/4485 [01:21<26:06,  2.72it/s][A
epoch 1 iter 223: train loss 4.16398. lr 5.990783e-04:   5%|▍         | 223/4485 [01:21<26:06,  2.72it/s][A
epoch 1 iter 223: train loss 4.16398. lr 5.990783e-04:   5%|▍         | 224/4485 [01:21<26:02,  2.73it/s][A
epoch 1 iter 224: train loss 4.11959. lr 5.990701e-04:   5%|▍         | 224/4485 [01:22<26:02,  2.73it/s][A
epoch 1 iter 224: train loss 4.11959. lr 5.990701e-04:   5%|▌         | 225/4485 [01:22<25:58,  2.73it/s][A
epoch 1 iter 225: train loss 4.04872. lr 5.990618e-04:   5%|▌         | 225/4485 [01:22<25:58,  2.73it/s][A
epoch 1 iter 225: t

epoch 1 iter 258: train loss 3.90870. lr 5.987678e-04:   6%|▌         | 259/4485 [01:34<25:56,  2.71it/s][A
epoch 1 iter 259: train loss 3.81619. lr 5.987582e-04:   6%|▌         | 259/4485 [01:35<25:56,  2.71it/s][A
epoch 1 iter 259: train loss 3.81619. lr 5.987582e-04:   6%|▌         | 260/4485 [01:35<25:55,  2.72it/s][A
epoch 1 iter 260: train loss 4.01541. lr 5.987487e-04:   6%|▌         | 260/4485 [01:35<25:55,  2.72it/s][A
epoch 1 iter 260: train loss 4.01541. lr 5.987487e-04:   6%|▌         | 261/4485 [01:35<25:58,  2.71it/s][A
epoch 1 iter 261: train loss 3.92664. lr 5.987390e-04:   6%|▌         | 261/4485 [01:35<25:58,  2.71it/s][A
epoch 1 iter 261: train loss 3.92664. lr 5.987390e-04:   6%|▌         | 262/4485 [01:35<25:57,  2.71it/s][A
epoch 1 iter 262: train loss 3.86004. lr 5.987294e-04:   6%|▌         | 262/4485 [01:36<25:57,  2.71it/s][A
epoch 1 iter 262: train loss 3.86004. lr 5.987294e-04:   6%|▌         | 263/4485 [01:36<25:52,  2.72it/s][A
epoch 1 iter 263: t

epoch 1 iter 296: train loss 3.64894. lr 5.983797e-04:   7%|▋         | 296/4485 [01:48<25:44,  2.71it/s][A
epoch 1 iter 296: train loss 3.64894. lr 5.983797e-04:   7%|▋         | 297/4485 [01:48<25:44,  2.71it/s][A
epoch 1 iter 297: train loss 3.59973. lr 5.983688e-04:   7%|▋         | 297/4485 [01:49<25:44,  2.71it/s][A
epoch 1 iter 297: train loss 3.59973. lr 5.983688e-04:   7%|▋         | 298/4485 [01:49<25:39,  2.72it/s][A
epoch 1 iter 298: train loss 3.71938. lr 5.983579e-04:   7%|▋         | 298/4485 [01:49<25:39,  2.72it/s][A
epoch 1 iter 298: train loss 3.71938. lr 5.983579e-04:   7%|▋         | 299/4485 [01:49<25:37,  2.72it/s][A
epoch 1 iter 299: train loss 3.63767. lr 5.983469e-04:   7%|▋         | 299/4485 [01:49<25:37,  2.72it/s][A
epoch 1 iter 299: train loss 3.63767. lr 5.983469e-04:   7%|▋         | 300/4485 [01:49<25:40,  2.72it/s][A
epoch 1 iter 300: train loss 3.49756. lr 5.983358e-04:   7%|▋         | 300/4485 [01:50<25:40,  2.72it/s][A
epoch 1 iter 300: t

epoch 1 iter 333: train loss 3.45960. lr 5.979512e-04:   7%|▋         | 334/4485 [02:02<25:24,  2.72it/s][A
epoch 1 iter 334: train loss 3.49490. lr 5.979389e-04:   7%|▋         | 334/4485 [02:02<25:24,  2.72it/s][A
epoch 1 iter 334: train loss 3.49490. lr 5.979389e-04:   7%|▋         | 335/4485 [02:02<25:29,  2.71it/s][A
epoch 1 iter 335: train loss 3.53127. lr 5.979266e-04:   7%|▋         | 335/4485 [02:03<25:29,  2.71it/s][A
epoch 1 iter 335: train loss 3.53127. lr 5.979266e-04:   7%|▋         | 336/4485 [02:03<25:29,  2.71it/s][A
epoch 1 iter 336: train loss 3.48494. lr 5.979142e-04:   7%|▋         | 336/4485 [02:03<25:29,  2.71it/s][A
epoch 1 iter 336: train loss 3.48494. lr 5.979142e-04:   8%|▊         | 337/4485 [02:03<25:26,  2.72it/s][A
epoch 1 iter 337: train loss 3.27915. lr 5.979018e-04:   8%|▊         | 337/4485 [02:03<25:26,  2.72it/s][A
epoch 1 iter 337: train loss 3.27915. lr 5.979018e-04:   8%|▊         | 338/4485 [02:03<25:25,  2.72it/s][A
epoch 1 iter 338: t

epoch 1 iter 371: train loss 3.29537. lr 5.974589e-04:   8%|▊         | 371/4485 [02:16<25:19,  2.71it/s][A
epoch 1 iter 371: train loss 3.29537. lr 5.974589e-04:   8%|▊         | 372/4485 [02:16<25:15,  2.71it/s][A
epoch 1 iter 372: train loss 3.34237. lr 5.974452e-04:   8%|▊         | 372/4485 [02:16<25:15,  2.71it/s][A
epoch 1 iter 372: train loss 3.34237. lr 5.974452e-04:   8%|▊         | 373/4485 [02:16<25:14,  2.72it/s][A
epoch 1 iter 373: train loss 3.30357. lr 5.974315e-04:   8%|▊         | 373/4485 [02:17<25:14,  2.72it/s][A
epoch 1 iter 373: train loss 3.30357. lr 5.974315e-04:   8%|▊         | 374/4485 [02:17<25:09,  2.72it/s][A
epoch 1 iter 374: train loss 3.24826. lr 5.974178e-04:   8%|▊         | 374/4485 [02:17<25:09,  2.72it/s][A
epoch 1 iter 374: train loss 3.24826. lr 5.974178e-04:   8%|▊         | 375/4485 [02:17<25:13,  2.72it/s][A
epoch 1 iter 375: train loss 3.37788. lr 5.974040e-04:   8%|▊         | 375/4485 [02:17<25:13,  2.72it/s][A
epoch 1 iter 375: t

epoch 1 iter 408: train loss 2.98254. lr 5.969289e-04:   9%|▉         | 409/4485 [02:29<25:05,  2.71it/s][A
epoch 1 iter 409: train loss 3.18198. lr 5.969139e-04:   9%|▉         | 409/4485 [02:30<25:05,  2.71it/s][A
epoch 1 iter 409: train loss 3.18198. lr 5.969139e-04:   9%|▉         | 410/4485 [02:30<25:07,  2.70it/s][A
epoch 1 iter 410: train loss 3.01333. lr 5.968989e-04:   9%|▉         | 410/4485 [02:30<25:07,  2.70it/s][A
epoch 1 iter 410: train loss 3.01333. lr 5.968989e-04:   9%|▉         | 411/4485 [02:30<25:05,  2.71it/s][A
epoch 1 iter 411: train loss 3.12556. lr 5.968838e-04:   9%|▉         | 411/4485 [02:31<25:05,  2.71it/s][A
epoch 1 iter 411: train loss 3.12556. lr 5.968838e-04:   9%|▉         | 412/4485 [02:31<25:01,  2.71it/s][A
epoch 1 iter 412: train loss 3.03514. lr 5.968686e-04:   9%|▉         | 412/4485 [02:31<25:01,  2.71it/s][A
epoch 1 iter 412: train loss 3.03514. lr 5.968686e-04:   9%|▉         | 413/4485 [02:31<25:00,  2.71it/s][A
epoch 1 iter 413: t

epoch 1 iter 446: train loss 3.05612. lr 5.963327e-04:  10%|▉         | 446/4485 [02:43<24:49,  2.71it/s][A
epoch 1 iter 446: train loss 3.05612. lr 5.963327e-04:  10%|▉         | 447/4485 [02:43<24:46,  2.72it/s][A
epoch 1 iter 447: train loss 2.96166. lr 5.963163e-04:  10%|▉         | 447/4485 [02:44<24:46,  2.72it/s][A
epoch 1 iter 447: train loss 2.96166. lr 5.963163e-04:  10%|▉         | 448/4485 [02:44<24:43,  2.72it/s][A
epoch 1 iter 448: train loss 2.91675. lr 5.962999e-04:  10%|▉         | 448/4485 [02:44<24:43,  2.72it/s][A
epoch 1 iter 448: train loss 2.91675. lr 5.962999e-04:  10%|█         | 449/4485 [02:44<24:49,  2.71it/s][A
epoch 1 iter 449: train loss 2.70171. lr 5.962834e-04:  10%|█         | 449/4485 [02:45<24:49,  2.71it/s][A
epoch 1 iter 449: train loss 2.70171. lr 5.962834e-04:  10%|█         | 450/4485 [02:45<24:51,  2.71it/s][A
epoch 1 iter 450: train loss 2.92446. lr 5.962669e-04:  10%|█         | 450/4485 [02:45<24:51,  2.71it/s][A
epoch 1 iter 450: t

epoch 1 iter 483: train loss 2.84440. lr 5.957018e-04:  11%|█         | 484/4485 [02:57<24:35,  2.71it/s][A
epoch 1 iter 484: train loss 2.84339. lr 5.956840e-04:  11%|█         | 484/4485 [02:57<24:35,  2.71it/s][A
epoch 1 iter 484: train loss 2.84339. lr 5.956840e-04:  11%|█         | 485/4485 [02:57<24:35,  2.71it/s][A
epoch 1 iter 485: train loss 2.69884. lr 5.956663e-04:  11%|█         | 485/4485 [02:58<24:35,  2.71it/s][A
epoch 1 iter 485: train loss 2.69884. lr 5.956663e-04:  11%|█         | 486/4485 [02:58<24:33,  2.71it/s][A
epoch 1 iter 486: train loss 2.74850. lr 5.956484e-04:  11%|█         | 486/4485 [02:58<24:33,  2.71it/s][A
epoch 1 iter 486: train loss 2.74850. lr 5.956484e-04:  11%|█         | 487/4485 [02:58<24:29,  2.72it/s][A
epoch 1 iter 487: train loss 2.83046. lr 5.956306e-04:  11%|█         | 487/4485 [02:59<24:29,  2.72it/s][A
epoch 1 iter 487: train loss 2.83046. lr 5.956306e-04:  11%|█         | 488/4485 [02:59<24:24,  2.73it/s][A
epoch 1 iter 488: t

epoch 1 iter 521: train loss 2.62740. lr 5.950021e-04:  12%|█▏        | 521/4485 [03:11<24:18,  2.72it/s][A
epoch 1 iter 521: train loss 2.62740. lr 5.950021e-04:  12%|█▏        | 522/4485 [03:11<24:15,  2.72it/s][A
epoch 1 iter 522: train loss 2.76089. lr 5.949830e-04:  12%|█▏        | 522/4485 [03:11<24:15,  2.72it/s][A
epoch 1 iter 522: train loss 2.76089. lr 5.949830e-04:  12%|█▏        | 523/4485 [03:11<24:10,  2.73it/s][A
epoch 1 iter 523: train loss 2.69215. lr 5.949638e-04:  12%|█▏        | 523/4485 [03:12<24:10,  2.73it/s][A
epoch 1 iter 523: train loss 2.69215. lr 5.949638e-04:  12%|█▏        | 524/4485 [03:12<24:20,  2.71it/s][A
epoch 1 iter 524: train loss 2.54234. lr 5.949446e-04:  12%|█▏        | 524/4485 [03:12<24:20,  2.71it/s][A
epoch 1 iter 524: train loss 2.54234. lr 5.949446e-04:  12%|█▏        | 525/4485 [03:12<24:20,  2.71it/s][A
epoch 1 iter 525: train loss 2.64921. lr 5.949254e-04:  12%|█▏        | 525/4485 [03:13<24:20,  2.71it/s][A
epoch 1 iter 525: t

epoch 1 iter 2489: train loss 0.40216. lr 4.929571e-04:  56%|█████▌    | 2490/4485 [15:16<12:12,  2.72it/s][A
epoch 1 iter 2490: train loss 0.41180. lr 4.928767e-04:  56%|█████▌    | 2490/4485 [15:17<12:12,  2.72it/s][A
epoch 1 iter 2490: train loss 0.41180. lr 4.928767e-04:  56%|█████▌    | 2491/4485 [15:17<12:13,  2.72it/s][A
epoch 1 iter 2491: train loss 0.42804. lr 4.927962e-04:  56%|█████▌    | 2491/4485 [15:17<12:13,  2.72it/s][A
epoch 1 iter 2491: train loss 0.42804. lr 4.927962e-04:  56%|█████▌    | 2492/4485 [15:17<12:13,  2.72it/s][A
epoch 1 iter 2492: train loss 0.43366. lr 4.927156e-04:  56%|█████▌    | 2492/4485 [15:18<12:13,  2.72it/s][A
epoch 1 iter 2492: train loss 0.43366. lr 4.927156e-04:  56%|█████▌    | 2493/4485 [15:18<12:12,  2.72it/s][A
epoch 1 iter 2493: train loss 0.45771. lr 4.926351e-04:  56%|█████▌    | 2493/4485 [15:18<12:12,  2.72it/s][A
epoch 1 iter 2493: train loss 0.45771. lr 4.926351e-04:  56%|█████▌    | 2494/4485 [15:18<12:12,  2.72it/s][A
e

epoch 1 iter 2526: train loss 0.42118. lr 4.899639e-04:  56%|█████▋    | 2526/4485 [15:30<12:04,  2.70it/s][A
epoch 1 iter 2526: train loss 0.42118. lr 4.899639e-04:  56%|█████▋    | 2527/4485 [15:30<12:03,  2.71it/s][A
epoch 1 iter 2527: train loss 0.44686. lr 4.898825e-04:  56%|█████▋    | 2527/4485 [15:30<12:03,  2.71it/s][A
epoch 1 iter 2527: train loss 0.44686. lr 4.898825e-04:  56%|█████▋    | 2528/4485 [15:30<12:01,  2.71it/s][A
epoch 1 iter 2528: train loss 0.40857. lr 4.898012e-04:  56%|█████▋    | 2528/4485 [15:31<12:01,  2.71it/s][A
epoch 1 iter 2528: train loss 0.40857. lr 4.898012e-04:  56%|█████▋    | 2529/4485 [15:31<11:58,  2.72it/s][A
epoch 1 iter 2529: train loss 0.43417. lr 4.897198e-04:  56%|█████▋    | 2529/4485 [15:31<11:58,  2.72it/s][A
epoch 1 iter 2529: train loss 0.43417. lr 4.897198e-04:  56%|█████▋    | 2530/4485 [15:31<12:01,  2.71it/s][A
epoch 1 iter 2530: train loss 0.42012. lr 4.896384e-04:  56%|█████▋    | 2530/4485 [15:32<12:01,  2.71it/s][A
e

epoch 1 iter 2562: train loss 0.42102. lr 4.870209e-04:  57%|█████▋    | 2563/4485 [15:43<11:52,  2.70it/s][A
epoch 1 iter 2563: train loss 0.43519. lr 4.869387e-04:  57%|█████▋    | 2563/4485 [15:44<11:52,  2.70it/s][A
epoch 1 iter 2563: train loss 0.43519. lr 4.869387e-04:  57%|█████▋    | 2564/4485 [15:44<11:49,  2.71it/s][A
epoch 1 iter 2564: train loss 0.42650. lr 4.868565e-04:  57%|█████▋    | 2564/4485 [15:44<11:49,  2.71it/s][A
epoch 1 iter 2564: train loss 0.42650. lr 4.868565e-04:  57%|█████▋    | 2565/4485 [15:44<11:46,  2.72it/s][A
epoch 1 iter 2565: train loss 0.38930. lr 4.867743e-04:  57%|█████▋    | 2565/4485 [15:44<11:46,  2.72it/s][A
epoch 1 iter 2565: train loss 0.38930. lr 4.867743e-04:  57%|█████▋    | 2566/4485 [15:44<11:49,  2.71it/s][A
epoch 1 iter 2566: train loss 0.42224. lr 4.866921e-04:  57%|█████▋    | 2566/4485 [15:45<11:49,  2.71it/s][A
epoch 1 iter 2566: train loss 0.42224. lr 4.866921e-04:  57%|█████▋    | 2567/4485 [15:45<11:49,  2.70it/s][A
e

epoch 1 iter 2599: train loss 0.42673. lr 4.839652e-04:  58%|█████▊    | 2599/4485 [15:57<11:33,  2.72it/s][A
epoch 1 iter 2599: train loss 0.42673. lr 4.839652e-04:  58%|█████▊    | 2600/4485 [15:57<11:35,  2.71it/s][A
epoch 1 iter 2600: train loss 0.40371. lr 4.838822e-04:  58%|█████▊    | 2600/4485 [15:57<11:35,  2.71it/s][A
epoch 1 iter 2600: train loss 0.40371. lr 4.838822e-04:  58%|█████▊    | 2601/4485 [15:57<11:36,  2.70it/s][A
epoch 1 iter 2601: train loss 0.40344. lr 4.837991e-04:  58%|█████▊    | 2601/4485 [15:58<11:36,  2.70it/s][A
epoch 1 iter 2601: train loss 0.40344. lr 4.837991e-04:  58%|█████▊    | 2602/4485 [15:58<11:34,  2.71it/s][A
epoch 1 iter 2602: train loss 0.40443. lr 4.837161e-04:  58%|█████▊    | 2602/4485 [15:58<11:34,  2.71it/s][A
epoch 1 iter 2602: train loss 0.40443. lr 4.837161e-04:  58%|█████▊    | 2603/4485 [15:58<11:32,  2.72it/s][A
epoch 1 iter 2603: train loss 0.39098. lr 4.836330e-04:  58%|█████▊    | 2603/4485 [15:58<11:32,  2.72it/s][A
e

epoch 1 iter 2805: train loss 0.37514. lr 4.664019e-04:  63%|██████▎   | 2806/4485 [17:13<10:16,  2.72it/s][A
epoch 1 iter 2806: train loss 0.39260. lr 4.663144e-04:  63%|██████▎   | 2806/4485 [17:13<10:16,  2.72it/s][A
epoch 1 iter 2806: train loss 0.39260. lr 4.663144e-04:  63%|██████▎   | 2807/4485 [17:13<10:16,  2.72it/s][A
epoch 1 iter 2807: train loss 0.40113. lr 4.662269e-04:  63%|██████▎   | 2807/4485 [17:14<10:16,  2.72it/s][A
epoch 1 iter 2807: train loss 0.40113. lr 4.662269e-04:  63%|██████▎   | 2808/4485 [17:14<10:17,  2.72it/s][A
epoch 1 iter 2808: train loss 0.41673. lr 4.661395e-04:  63%|██████▎   | 2808/4485 [17:14<10:17,  2.72it/s][A
epoch 1 iter 2808: train loss 0.41673. lr 4.661395e-04:  63%|██████▎   | 2809/4485 [17:14<10:19,  2.70it/s][A
epoch 1 iter 2809: train loss 0.40994. lr 4.660519e-04:  63%|██████▎   | 2809/4485 [17:14<10:19,  2.70it/s][A
epoch 1 iter 2809: train loss 0.40994. lr 4.660519e-04:  63%|██████▎   | 2810/4485 [17:14<10:20,  2.70it/s][A
e

epoch 1 iter 2842: train loss 0.39192. lr 4.631528e-04:  63%|██████▎   | 2842/4485 [17:27<10:05,  2.71it/s][A
epoch 1 iter 2842: train loss 0.39192. lr 4.631528e-04:  63%|██████▎   | 2843/4485 [17:27<10:05,  2.71it/s][A
epoch 1 iter 2843: train loss 0.36952. lr 4.630646e-04:  63%|██████▎   | 2843/4485 [17:27<10:05,  2.71it/s][A
epoch 1 iter 2843: train loss 0.36952. lr 4.630646e-04:  63%|██████▎   | 2844/4485 [17:27<10:04,  2.72it/s][A
epoch 1 iter 2844: train loss 0.38458. lr 4.629764e-04:  63%|██████▎   | 2844/4485 [17:27<10:04,  2.72it/s][A
epoch 1 iter 2844: train loss 0.38458. lr 4.629764e-04:  63%|██████▎   | 2845/4485 [17:27<10:04,  2.71it/s][A
epoch 1 iter 2845: train loss 0.37705. lr 4.628882e-04:  63%|██████▎   | 2845/4485 [17:28<10:04,  2.71it/s][A
epoch 1 iter 2845: train loss 0.37705. lr 4.628882e-04:  63%|██████▎   | 2846/4485 [17:28<10:04,  2.71it/s][A
epoch 1 iter 2846: train loss 0.38044. lr 4.627999e-04:  63%|██████▎   | 2846/4485 [17:28<10:04,  2.71it/s][A
e

epoch 1 iter 2878: train loss 0.37689. lr 4.599653e-04:  64%|██████▍   | 2879/4485 [17:40<09:53,  2.71it/s][A
epoch 1 iter 2879: train loss 0.38721. lr 4.598764e-04:  64%|██████▍   | 2879/4485 [17:40<09:53,  2.71it/s][A
epoch 1 iter 2879: train loss 0.38721. lr 4.598764e-04:  64%|██████▍   | 2880/4485 [17:40<09:54,  2.70it/s][A
epoch 1 iter 2880: train loss 0.38951. lr 4.597874e-04:  64%|██████▍   | 2880/4485 [17:41<09:54,  2.70it/s][A
epoch 1 iter 2880: train loss 0.38951. lr 4.597874e-04:  64%|██████▍   | 2881/4485 [17:41<09:53,  2.70it/s][A
epoch 1 iter 2881: train loss 0.37814. lr 4.596985e-04:  64%|██████▍   | 2881/4485 [17:41<09:53,  2.70it/s][A
epoch 1 iter 2881: train loss 0.37814. lr 4.596985e-04:  64%|██████▍   | 2882/4485 [17:41<09:51,  2.71it/s][A
epoch 1 iter 2882: train loss 0.36154. lr 4.596095e-04:  64%|██████▍   | 2882/4485 [17:41<09:51,  2.71it/s][A
epoch 1 iter 2882: train loss 0.36154. lr 4.596095e-04:  64%|██████▍   | 2883/4485 [17:41<09:50,  2.71it/s][A
e

epoch 1 iter 2915: train loss 0.37374. lr 4.566627e-04:  65%|██████▍   | 2915/4485 [17:53<09:38,  2.72it/s][A
epoch 1 iter 2915: train loss 0.37374. lr 4.566627e-04:  65%|██████▌   | 2916/4485 [17:53<09:37,  2.72it/s][A
epoch 1 iter 2916: train loss 0.36357. lr 4.565731e-04:  65%|██████▌   | 2916/4485 [17:54<09:37,  2.72it/s][A
epoch 1 iter 2916: train loss 0.36357. lr 4.565731e-04:  65%|██████▌   | 2917/4485 [17:54<09:35,  2.72it/s][A
epoch 1 iter 2917: train loss 0.36736. lr 4.564834e-04:  65%|██████▌   | 2917/4485 [17:54<09:35,  2.72it/s][A
epoch 1 iter 2917: train loss 0.36736. lr 4.564834e-04:  65%|██████▌   | 2918/4485 [17:54<09:35,  2.72it/s][A
epoch 1 iter 2918: train loss 0.38544. lr 4.563938e-04:  65%|██████▌   | 2918/4485 [17:55<09:35,  2.72it/s][A
epoch 1 iter 2918: train loss 0.38544. lr 4.563938e-04:  65%|██████▌   | 2919/4485 [17:55<09:36,  2.72it/s][A
epoch 1 iter 2919: train loss 0.36924. lr 4.563041e-04:  65%|██████▌   | 2919/4485 [17:55<09:36,  2.72it/s][A
e

epoch 1 iter 2951: train loss 0.39577. lr 4.534241e-04:  66%|██████▌   | 2952/4485 [18:07<09:24,  2.72it/s][A
epoch 1 iter 2952: train loss 0.36512. lr 4.533338e-04:  66%|██████▌   | 2952/4485 [18:07<09:24,  2.72it/s][A
epoch 1 iter 2952: train loss 0.36512. lr 4.533338e-04:  66%|██████▌   | 2953/4485 [18:07<09:20,  2.73it/s][A
epoch 1 iter 2953: train loss 0.36568. lr 4.532435e-04:  66%|██████▌   | 2953/4485 [18:07<09:20,  2.73it/s][A
epoch 1 iter 2953: train loss 0.36568. lr 4.532435e-04:  66%|██████▌   | 2954/4485 [18:07<09:22,  2.72it/s][A
epoch 1 iter 2954: train loss 0.35876. lr 4.531531e-04:  66%|██████▌   | 2954/4485 [18:08<09:22,  2.72it/s][A
epoch 1 iter 2954: train loss 0.35876. lr 4.531531e-04:  66%|██████▌   | 2955/4485 [18:08<09:23,  2.72it/s][A
epoch 1 iter 2955: train loss 0.37720. lr 4.530627e-04:  66%|██████▌   | 2955/4485 [18:08<09:23,  2.72it/s][A
epoch 1 iter 2955: train loss 0.37720. lr 4.530627e-04:  66%|██████▌   | 2956/4485 [18:08<09:22,  2.72it/s][A
e

epoch 1 iter 2988: train loss 0.36147. lr 4.500701e-04:  67%|██████▋   | 2988/4485 [18:20<09:10,  2.72it/s][A
epoch 1 iter 2988: train loss 0.36147. lr 4.500701e-04:  67%|██████▋   | 2989/4485 [18:20<09:11,  2.71it/s][A
epoch 1 iter 2989: train loss 0.39384. lr 4.499791e-04:  67%|██████▋   | 2989/4485 [18:21<09:11,  2.71it/s][A
epoch 1 iter 2989: train loss 0.39384. lr 4.499791e-04:  67%|██████▋   | 2990/4485 [18:21<09:11,  2.71it/s][A
epoch 1 iter 2990: train loss 0.37275. lr 4.498881e-04:  67%|██████▋   | 2990/4485 [18:21<09:11,  2.71it/s][A
epoch 1 iter 2990: train loss 0.37275. lr 4.498881e-04:  67%|██████▋   | 2991/4485 [18:21<09:10,  2.71it/s][A
epoch 1 iter 2991: train loss 0.35361. lr 4.497971e-04:  67%|██████▋   | 2991/4485 [18:21<09:10,  2.71it/s][A
epoch 1 iter 2991: train loss 0.35361. lr 4.497971e-04:  67%|██████▋   | 2992/4485 [18:21<09:10,  2.71it/s][A
epoch 1 iter 2992: train loss 0.39270. lr 4.497060e-04:  67%|██████▋   | 2992/4485 [18:22<09:10,  2.71it/s][A
e

epoch 1 iter 3024: train loss 0.36353. lr 4.467826e-04:  67%|██████▋   | 3025/4485 [18:34<08:58,  2.71it/s][A
epoch 1 iter 3025: train loss 0.33684. lr 4.466910e-04:  67%|██████▋   | 3025/4485 [18:34<08:58,  2.71it/s][A
epoch 1 iter 3025: train loss 0.33684. lr 4.466910e-04:  67%|██████▋   | 3026/4485 [18:34<08:57,  2.72it/s][A
epoch 1 iter 3026: train loss 0.35327. lr 4.465993e-04:  67%|██████▋   | 3026/4485 [18:34<08:57,  2.72it/s][A
epoch 1 iter 3026: train loss 0.35327. lr 4.465993e-04:  67%|██████▋   | 3027/4485 [18:34<08:58,  2.71it/s][A
epoch 1 iter 3027: train loss 0.36495. lr 4.465076e-04:  67%|██████▋   | 3027/4485 [18:35<08:58,  2.71it/s][A
epoch 1 iter 3027: train loss 0.36495. lr 4.465076e-04:  68%|██████▊   | 3028/4485 [18:35<08:56,  2.72it/s][A
epoch 1 iter 3028: train loss 0.34908. lr 4.464159e-04:  68%|██████▊   | 3028/4485 [18:35<08:56,  2.72it/s][A
epoch 1 iter 3028: train loss 0.34908. lr 4.464159e-04:  68%|██████▊   | 3029/4485 [18:35<08:57,  2.71it/s][A
e

epoch 1 iter 3061: train loss 0.34597. lr 4.433795e-04:  68%|██████▊   | 3061/4485 [18:47<08:52,  2.67it/s][A
epoch 1 iter 3061: train loss 0.34597. lr 4.433795e-04:  68%|██████▊   | 3062/4485 [18:47<08:51,  2.68it/s][A
epoch 1 iter 3062: train loss 0.37174. lr 4.432872e-04:  68%|██████▊   | 3062/4485 [18:48<08:51,  2.68it/s][A
epoch 1 iter 3062: train loss 0.37174. lr 4.432872e-04:  68%|██████▊   | 3063/4485 [18:48<08:49,  2.68it/s][A
epoch 1 iter 3063: train loss 0.34714. lr 4.431948e-04:  68%|██████▊   | 3063/4485 [18:48<08:49,  2.68it/s][A
epoch 1 iter 3063: train loss 0.34714. lr 4.431948e-04:  68%|██████▊   | 3064/4485 [18:48<08:46,  2.70it/s][A
epoch 1 iter 3064: train loss 0.36057. lr 4.431025e-04:  68%|██████▊   | 3064/4485 [18:48<08:46,  2.70it/s][A
epoch 1 iter 3064: train loss 0.36057. lr 4.431025e-04:  68%|██████▊   | 3065/4485 [18:48<08:48,  2.69it/s][A
epoch 1 iter 3065: train loss 0.37953. lr 4.430101e-04:  68%|██████▊   | 3065/4485 [18:49<08:48,  2.69it/s][A
e

epoch 1 iter 3097: train loss 0.37688. lr 4.400452e-04:  69%|██████▉   | 3098/4485 [19:01<08:32,  2.71it/s][A
epoch 1 iter 3098: train loss 0.33233. lr 4.399522e-04:  69%|██████▉   | 3098/4485 [19:01<08:32,  2.71it/s][A
epoch 1 iter 3098: train loss 0.33233. lr 4.399522e-04:  69%|██████▉   | 3099/4485 [19:01<08:31,  2.71it/s][A
epoch 1 iter 3099: train loss 0.38129. lr 4.398593e-04:  69%|██████▉   | 3099/4485 [19:01<08:31,  2.71it/s][A
epoch 1 iter 3099: train loss 0.38129. lr 4.398593e-04:  69%|██████▉   | 3100/4485 [19:01<08:30,  2.71it/s][A
epoch 1 iter 3100: train loss 0.36055. lr 4.397663e-04:  69%|██████▉   | 3100/4485 [19:02<08:30,  2.71it/s][A
epoch 1 iter 3100: train loss 0.36055. lr 4.397663e-04:  69%|██████▉   | 3101/4485 [19:02<08:31,  2.71it/s][A
epoch 1 iter 3101: train loss 0.34761. lr 4.396733e-04:  69%|██████▉   | 3101/4485 [19:02<08:31,  2.71it/s][A
epoch 1 iter 3101: train loss 0.34761. lr 4.396733e-04:  69%|██████▉   | 3102/4485 [19:02<08:31,  2.70it/s][A
e

epoch 1 iter 3134: train loss 0.37659. lr 4.365951e-04:  70%|██████▉   | 3134/4485 [19:14<08:21,  2.69it/s][A
epoch 1 iter 3134: train loss 0.37659. lr 4.365951e-04:  70%|██████▉   | 3135/4485 [19:14<08:19,  2.70it/s][A
epoch 1 iter 3135: train loss 0.36773. lr 4.365015e-04:  70%|██████▉   | 3135/4485 [19:15<08:19,  2.70it/s][A
epoch 1 iter 3135: train loss 0.36773. lr 4.365015e-04:  70%|██████▉   | 3136/4485 [19:15<08:18,  2.71it/s][A
epoch 1 iter 3136: train loss 0.34824. lr 4.364079e-04:  70%|██████▉   | 3136/4485 [19:15<08:18,  2.71it/s][A
epoch 1 iter 3136: train loss 0.34824. lr 4.364079e-04:  70%|██████▉   | 3137/4485 [19:15<08:16,  2.71it/s][A
epoch 1 iter 3137: train loss 0.35546. lr 4.363143e-04:  70%|██████▉   | 3137/4485 [19:15<08:16,  2.71it/s][A
epoch 1 iter 3137: train loss 0.35546. lr 4.363143e-04:  70%|██████▉   | 3138/4485 [19:15<08:17,  2.71it/s][A
epoch 1 iter 3138: train loss 0.35360. lr 4.362207e-04:  70%|██████▉   | 3138/4485 [19:16<08:17,  2.71it/s][A
e

epoch 2 iter 460: train loss 0.22869. lr 2.517748e-04:  10%|█         | 461/4485 [02:49<24:42,  2.71it/s][A
epoch 2 iter 461: train loss 0.24074. lr 2.516711e-04:  10%|█         | 461/4485 [02:50<24:42,  2.71it/s][A
epoch 2 iter 461: train loss 0.24074. lr 2.516711e-04:  10%|█         | 462/4485 [02:50<24:45,  2.71it/s][A
epoch 2 iter 462: train loss 0.25512. lr 2.515674e-04:  10%|█         | 462/4485 [02:50<24:45,  2.71it/s][A
epoch 2 iter 462: train loss 0.25512. lr 2.515674e-04:  10%|█         | 463/4485 [02:50<24:43,  2.71it/s][A
epoch 2 iter 463: train loss 0.24129. lr 2.514637e-04:  10%|█         | 463/4485 [02:50<24:43,  2.71it/s][A
epoch 2 iter 463: train loss 0.24129. lr 2.514637e-04:  10%|█         | 464/4485 [02:50<24:39,  2.72it/s][A
epoch 2 iter 464: train loss 0.23496. lr 2.513600e-04:  10%|█         | 464/4485 [02:51<24:39,  2.72it/s][A
epoch 2 iter 464: train loss 0.23496. lr 2.513600e-04:  10%|█         | 465/4485 [02:51<24:39,  2.72it/s][A
epoch 2 iter 465: t

epoch 2 iter 498: train loss 0.23000. lr 2.478380e-04:  11%|█         | 498/4485 [03:03<24:32,  2.71it/s][A
epoch 2 iter 498: train loss 0.23000. lr 2.478380e-04:  11%|█         | 499/4485 [03:03<24:27,  2.72it/s][A
epoch 2 iter 499: train loss 0.24019. lr 2.477345e-04:  11%|█         | 499/4485 [03:04<24:27,  2.72it/s][A
epoch 2 iter 499: train loss 0.24019. lr 2.477345e-04:  11%|█         | 500/4485 [03:04<24:25,  2.72it/s][A
epoch 2 iter 500: train loss 0.23080. lr 2.476310e-04:  11%|█         | 500/4485 [03:04<24:25,  2.72it/s][A
epoch 2 iter 500: train loss 0.23080. lr 2.476310e-04:  11%|█         | 501/4485 [03:04<24:28,  2.71it/s][A
epoch 2 iter 501: train loss 0.22753. lr 2.475276e-04:  11%|█         | 501/4485 [03:04<24:28,  2.71it/s][A
epoch 2 iter 501: train loss 0.22753. lr 2.475276e-04:  11%|█         | 502/4485 [03:04<24:32,  2.71it/s][A
epoch 2 iter 502: train loss 0.23350. lr 2.474241e-04:  11%|█         | 502/4485 [03:05<24:32,  2.71it/s][A
epoch 2 iter 502: t

epoch 2 iter 535: train loss 0.22727. lr 2.440136e-04:  12%|█▏        | 536/4485 [03:17<24:12,  2.72it/s][A
epoch 2 iter 536: train loss 0.22638. lr 2.439104e-04:  12%|█▏        | 536/4485 [03:17<24:12,  2.72it/s][A
epoch 2 iter 536: train loss 0.22638. lr 2.439104e-04:  12%|█▏        | 537/4485 [03:17<24:15,  2.71it/s][A
epoch 2 iter 537: train loss 0.22994. lr 2.438072e-04:  12%|█▏        | 537/4485 [03:18<24:15,  2.71it/s][A
epoch 2 iter 537: train loss 0.22994. lr 2.438072e-04:  12%|█▏        | 538/4485 [03:18<24:12,  2.72it/s][A
epoch 2 iter 538: train loss 0.22417. lr 2.437039e-04:  12%|█▏        | 538/4485 [03:18<24:12,  2.72it/s][A
epoch 2 iter 538: train loss 0.22417. lr 2.437039e-04:  12%|█▏        | 539/4485 [03:18<24:08,  2.72it/s][A
epoch 2 iter 539: train loss 0.22574. lr 2.436007e-04:  12%|█▏        | 539/4485 [03:18<24:08,  2.72it/s][A
epoch 2 iter 539: train loss 0.22574. lr 2.436007e-04:  12%|█▏        | 540/4485 [03:18<24:06,  2.73it/s][A
epoch 2 iter 540: t

epoch 2 iter 573: train loss 0.22585. lr 2.400957e-04:  13%|█▎        | 573/4485 [03:31<24:00,  2.72it/s][A
epoch 2 iter 573: train loss 0.22585. lr 2.400957e-04:  13%|█▎        | 574/4485 [03:31<23:59,  2.72it/s][A
epoch 2 iter 574: train loss 0.24067. lr 2.399927e-04:  13%|█▎        | 574/4485 [03:31<23:59,  2.72it/s][A
epoch 2 iter 574: train loss 0.24067. lr 2.399927e-04:  13%|█▎        | 575/4485 [03:31<24:00,  2.71it/s][A
epoch 2 iter 575: train loss 0.21968. lr 2.398897e-04:  13%|█▎        | 575/4485 [03:32<24:00,  2.71it/s][A
epoch 2 iter 575: train loss 0.21968. lr 2.398897e-04:  13%|█▎        | 576/4485 [03:32<24:03,  2.71it/s][A
epoch 2 iter 576: train loss 0.22620. lr 2.397868e-04:  13%|█▎        | 576/4485 [03:32<24:03,  2.71it/s][A
epoch 2 iter 576: train loss 0.22620. lr 2.397868e-04:  13%|█▎        | 577/4485 [03:32<24:02,  2.71it/s][A
epoch 2 iter 577: train loss 0.22270. lr 2.396839e-04:  13%|█▎        | 577/4485 [03:32<24:02,  2.71it/s][A
epoch 2 iter 577: t

epoch 2 iter 780: train loss 0.22454. lr 2.189575e-04:  17%|█▋        | 780/4485 [04:47<22:46,  2.71it/s][A
epoch 2 iter 780: train loss 0.22454. lr 2.189575e-04:  17%|█▋        | 781/4485 [04:47<22:42,  2.72it/s][A
epoch 2 iter 781: train loss 0.22184. lr 2.188563e-04:  17%|█▋        | 781/4485 [04:47<22:42,  2.72it/s][A
epoch 2 iter 781: train loss 0.22184. lr 2.188563e-04:  17%|█▋        | 782/4485 [04:47<22:41,  2.72it/s][A
epoch 2 iter 782: train loss 0.22544. lr 2.187552e-04:  17%|█▋        | 782/4485 [04:48<22:41,  2.72it/s][A
epoch 2 iter 782: train loss 0.22544. lr 2.187552e-04:  17%|█▋        | 783/4485 [04:48<22:43,  2.71it/s][A
epoch 2 iter 783: train loss 0.22448. lr 2.186540e-04:  17%|█▋        | 783/4485 [04:48<22:43,  2.71it/s][A
epoch 2 iter 783: train loss 0.22448. lr 2.186540e-04:  17%|█▋        | 784/4485 [04:48<22:46,  2.71it/s][A
epoch 2 iter 784: train loss 0.20229. lr 2.185529e-04:  17%|█▋        | 784/4485 [04:49<22:46,  2.71it/s][A
epoch 2 iter 784: t

epoch 2 iter 817: train loss 0.20209. lr 2.152209e-04:  18%|█▊        | 818/4485 [05:01<22:29,  2.72it/s][A
epoch 2 iter 818: train loss 0.23533. lr 2.151201e-04:  18%|█▊        | 818/4485 [05:01<22:29,  2.72it/s][A
epoch 2 iter 818: train loss 0.23533. lr 2.151201e-04:  18%|█▊        | 819/4485 [05:01<22:31,  2.71it/s][A
epoch 2 iter 819: train loss 0.23755. lr 2.150193e-04:  18%|█▊        | 819/4485 [05:01<22:31,  2.71it/s][A
epoch 2 iter 819: train loss 0.23755. lr 2.150193e-04:  18%|█▊        | 820/4485 [05:01<22:30,  2.71it/s][A
epoch 2 iter 820: train loss 0.21161. lr 2.149185e-04:  18%|█▊        | 820/4485 [05:02<22:30,  2.71it/s][A
epoch 2 iter 820: train loss 0.21161. lr 2.149185e-04:  18%|█▊        | 821/4485 [05:02<22:28,  2.72it/s][A
epoch 2 iter 821: train loss 0.21820. lr 2.148178e-04:  18%|█▊        | 821/4485 [05:02<22:28,  2.72it/s][A
epoch 2 iter 821: train loss 0.21820. lr 2.148178e-04:  18%|█▊        | 822/4485 [05:02<22:21,  2.73it/s][A
epoch 2 iter 822: t

epoch 2 iter 855: train loss 0.22933. lr 2.113981e-04:  19%|█▉        | 855/4485 [05:15<22:19,  2.71it/s][A
epoch 2 iter 855: train loss 0.22933. lr 2.113981e-04:  19%|█▉        | 856/4485 [05:15<22:17,  2.71it/s][A
epoch 2 iter 856: train loss 0.22429. lr 2.112977e-04:  19%|█▉        | 856/4485 [05:15<22:17,  2.71it/s][A
epoch 2 iter 856: train loss 0.22429. lr 2.112977e-04:  19%|█▉        | 857/4485 [05:15<22:20,  2.71it/s][A
epoch 2 iter 857: train loss 0.22514. lr 2.111973e-04:  19%|█▉        | 857/4485 [05:15<22:20,  2.71it/s][A
epoch 2 iter 857: train loss 0.22514. lr 2.111973e-04:  19%|█▉        | 858/4485 [05:15<22:21,  2.70it/s][A
epoch 2 iter 858: train loss 0.23212. lr 2.110970e-04:  19%|█▉        | 858/4485 [05:16<22:21,  2.70it/s][A
epoch 2 iter 858: train loss 0.23212. lr 2.110970e-04:  19%|█▉        | 859/4485 [05:16<22:16,  2.71it/s][A
epoch 2 iter 859: train loss 0.21176. lr 2.109966e-04:  19%|█▉        | 859/4485 [05:16<22:16,  2.71it/s][A
epoch 2 iter 859: t

epoch 2 iter 892: train loss 0.23056. lr 2.076910e-04:  20%|█▉        | 893/4485 [05:28<21:58,  2.72it/s][A
epoch 2 iter 893: train loss 0.24133. lr 2.075910e-04:  20%|█▉        | 893/4485 [05:29<21:58,  2.72it/s][A
epoch 2 iter 893: train loss 0.24133. lr 2.075910e-04:  20%|█▉        | 894/4485 [05:29<21:58,  2.72it/s][A
epoch 2 iter 894: train loss 0.21214. lr 2.074911e-04:  20%|█▉        | 894/4485 [05:29<21:58,  2.72it/s][A
epoch 2 iter 894: train loss 0.21214. lr 2.074911e-04:  20%|█▉        | 895/4485 [05:29<22:00,  2.72it/s][A
epoch 2 iter 895: train loss 0.21218. lr 2.073911e-04:  20%|█▉        | 895/4485 [05:29<22:00,  2.72it/s][A
epoch 2 iter 895: train loss 0.21218. lr 2.073911e-04:  20%|█▉        | 896/4485 [05:29<22:03,  2.71it/s][A
epoch 2 iter 896: train loss 0.21542. lr 2.072912e-04:  20%|█▉        | 896/4485 [05:30<22:03,  2.71it/s][A
epoch 2 iter 896: train loss 0.21542. lr 2.072912e-04:  20%|██        | 897/4485 [05:30<22:03,  2.71it/s][A
epoch 2 iter 897: t

epoch 2 iter 930: train loss 0.22936. lr 2.038999e-04:  21%|██        | 930/4485 [05:42<21:48,  2.72it/s][A
epoch 2 iter 930: train loss 0.22936. lr 2.038999e-04:  21%|██        | 931/4485 [05:42<21:51,  2.71it/s][A
epoch 2 iter 931: train loss 0.22800. lr 2.038003e-04:  21%|██        | 931/4485 [05:43<21:51,  2.71it/s][A
epoch 2 iter 931: train loss 0.22800. lr 2.038003e-04:  21%|██        | 932/4485 [05:43<21:49,  2.71it/s][A
epoch 2 iter 932: train loss 0.22247. lr 2.037008e-04:  21%|██        | 932/4485 [05:43<21:49,  2.71it/s][A
epoch 2 iter 932: train loss 0.22247. lr 2.037008e-04:  21%|██        | 933/4485 [05:43<21:45,  2.72it/s][A
epoch 2 iter 933: train loss 0.21640. lr 2.036013e-04:  21%|██        | 933/4485 [05:43<21:45,  2.72it/s][A
epoch 2 iter 933: train loss 0.21640. lr 2.036013e-04:  21%|██        | 934/4485 [05:43<21:43,  2.72it/s][A
epoch 2 iter 934: train loss 0.21270. lr 2.035018e-04:  21%|██        | 934/4485 [05:44<21:43,  2.72it/s][A
epoch 2 iter 934: t

epoch 2 iter 967: train loss 0.21719. lr 2.002248e-04:  22%|██▏       | 968/4485 [05:56<21:39,  2.71it/s][A
epoch 2 iter 968: train loss 0.22253. lr 2.001257e-04:  22%|██▏       | 968/4485 [05:56<21:39,  2.71it/s][A
epoch 2 iter 968: train loss 0.22253. lr 2.001257e-04:  22%|██▏       | 969/4485 [05:56<21:39,  2.71it/s][A
epoch 2 iter 969: train loss 0.21330. lr 2.000266e-04:  22%|██▏       | 969/4485 [05:57<21:39,  2.71it/s][A
epoch 2 iter 969: train loss 0.21330. lr 2.000266e-04:  22%|██▏       | 970/4485 [05:57<21:39,  2.70it/s][A
epoch 2 iter 970: train loss 0.21803. lr 1.999276e-04:  22%|██▏       | 970/4485 [05:57<21:39,  2.70it/s][A
epoch 2 iter 970: train loss 0.21803. lr 1.999276e-04:  22%|██▏       | 971/4485 [05:57<21:40,  2.70it/s][A
epoch 2 iter 971: train loss 0.22327. lr 1.998285e-04:  22%|██▏       | 971/4485 [05:57<21:40,  2.70it/s][A
epoch 2 iter 971: train loss 0.22327. lr 1.998285e-04:  22%|██▏       | 972/4485 [05:57<21:36,  2.71it/s][A
epoch 2 iter 972: t

epoch 2 iter 1004: train loss 0.20610. lr 1.965665e-04:  22%|██▏       | 1005/4485 [06:10<21:23,  2.71it/s][A
epoch 2 iter 1005: train loss 0.21971. lr 1.964679e-04:  22%|██▏       | 1005/4485 [06:10<21:23,  2.71it/s][A
epoch 2 iter 1005: train loss 0.21971. lr 1.964679e-04:  22%|██▏       | 1006/4485 [06:10<21:22,  2.71it/s][A
epoch 2 iter 1006: train loss 0.22461. lr 1.963693e-04:  22%|██▏       | 1006/4485 [06:10<21:22,  2.71it/s][A
epoch 2 iter 1006: train loss 0.22461. lr 1.963693e-04:  22%|██▏       | 1007/4485 [06:10<21:18,  2.72it/s][A
epoch 2 iter 1007: train loss 0.23792. lr 1.962707e-04:  22%|██▏       | 1007/4485 [06:11<21:18,  2.72it/s][A
epoch 2 iter 1007: train loss 0.23792. lr 1.962707e-04:  22%|██▏       | 1008/4485 [06:11<21:16,  2.72it/s][A
epoch 2 iter 1008: train loss 0.23122. lr 1.961721e-04:  22%|██▏       | 1008/4485 [06:11<21:16,  2.72it/s][A
epoch 2 iter 1008: train loss 0.23122. lr 1.961721e-04:  22%|██▏       | 1009/4485 [06:11<21:19,  2.72it/s][A
e

epoch 2 iter 1700: train loss 0.19013. lr 1.316496e-04:  38%|███▊      | 1701/4485 [10:26<17:03,  2.72it/s][A
epoch 2 iter 1701: train loss 0.19588. lr 1.315626e-04:  38%|███▊      | 1701/4485 [10:26<17:03,  2.72it/s][A
epoch 2 iter 1701: train loss 0.19588. lr 1.315626e-04:  38%|███▊      | 1702/4485 [10:26<17:04,  2.72it/s][A
epoch 2 iter 1702: train loss 0.20663. lr 1.314757e-04:  38%|███▊      | 1702/4485 [10:27<17:04,  2.72it/s][A
epoch 2 iter 1702: train loss 0.20663. lr 1.314757e-04:  38%|███▊      | 1703/4485 [10:27<17:06,  2.71it/s][A
epoch 2 iter 1703: train loss 0.18762. lr 1.313888e-04:  38%|███▊      | 1703/4485 [10:27<17:06,  2.71it/s][A
epoch 2 iter 1703: train loss 0.18762. lr 1.313888e-04:  38%|███▊      | 1704/4485 [10:27<17:03,  2.72it/s][A
epoch 2 iter 1704: train loss 0.19520. lr 1.313019e-04:  38%|███▊      | 1704/4485 [10:27<17:03,  2.72it/s][A
epoch 2 iter 1704: train loss 0.19520. lr 1.313019e-04:  38%|███▊      | 1705/4485 [10:27<17:01,  2.72it/s][A
e

epoch 2 iter 1737: train loss 0.18530. lr 1.284456e-04:  39%|███▊      | 1737/4485 [10:39<16:51,  2.72it/s][A
epoch 2 iter 1737: train loss 0.18530. lr 1.284456e-04:  39%|███▉      | 1738/4485 [10:39<16:52,  2.71it/s][A
epoch 2 iter 1738: train loss 0.20493. lr 1.283595e-04:  39%|███▉      | 1738/4485 [10:40<16:52,  2.71it/s][A
epoch 2 iter 1738: train loss 0.20493. lr 1.283595e-04:  39%|███▉      | 1739/4485 [10:40<16:49,  2.72it/s][A
epoch 2 iter 1739: train loss 0.19865. lr 1.282733e-04:  39%|███▉      | 1739/4485 [10:40<16:49,  2.72it/s][A
epoch 2 iter 1739: train loss 0.19865. lr 1.282733e-04:  39%|███▉      | 1740/4485 [10:40<16:49,  2.72it/s][A
epoch 2 iter 1740: train loss 0.20255. lr 1.281871e-04:  39%|███▉      | 1740/4485 [10:40<16:49,  2.72it/s][A
epoch 2 iter 1740: train loss 0.20255. lr 1.281871e-04:  39%|███▉      | 1741/4485 [10:41<16:50,  2.71it/s][A
epoch 2 iter 1741: train loss 0.20526. lr 1.281010e-04:  39%|███▉      | 1741/4485 [10:41<16:50,  2.71it/s][A
e

epoch 2 iter 1773: train loss 0.20044. lr 1.253559e-04:  40%|███▉      | 1774/4485 [10:53<16:38,  2.71it/s][A
epoch 2 iter 1774: train loss 0.19886. lr 1.252705e-04:  40%|███▉      | 1774/4485 [10:53<16:38,  2.71it/s][A
epoch 2 iter 1774: train loss 0.19886. lr 1.252705e-04:  40%|███▉      | 1775/4485 [10:53<16:38,  2.72it/s][A
epoch 2 iter 1775: train loss 0.20227. lr 1.251851e-04:  40%|███▉      | 1775/4485 [10:53<16:38,  2.72it/s][A
epoch 2 iter 1775: train loss 0.20227. lr 1.251851e-04:  40%|███▉      | 1776/4485 [10:53<16:35,  2.72it/s][A
epoch 2 iter 1776: train loss 0.20969. lr 1.250997e-04:  40%|███▉      | 1776/4485 [10:54<16:35,  2.72it/s][A
epoch 2 iter 1776: train loss 0.20969. lr 1.250997e-04:  40%|███▉      | 1777/4485 [10:54<16:36,  2.72it/s][A
epoch 2 iter 1777: train loss 0.18782. lr 1.250143e-04:  40%|███▉      | 1777/4485 [10:54<16:36,  2.72it/s][A
epoch 2 iter 1777: train loss 0.18782. lr 1.250143e-04:  40%|███▉      | 1778/4485 [10:54<16:37,  2.71it/s][A
e

epoch 2 iter 1810: train loss 0.19485. lr 1.222094e-04:  40%|████      | 1810/4485 [11:06<16:25,  2.72it/s][A
epoch 2 iter 1810: train loss 0.19485. lr 1.222094e-04:  40%|████      | 1811/4485 [11:06<16:24,  2.72it/s][A
epoch 2 iter 1811: train loss 0.20819. lr 1.221247e-04:  40%|████      | 1811/4485 [11:07<16:24,  2.72it/s][A
epoch 2 iter 1811: train loss 0.20819. lr 1.221247e-04:  40%|████      | 1812/4485 [11:07<16:25,  2.71it/s][A
epoch 2 iter 1812: train loss 0.20857. lr 1.220401e-04:  40%|████      | 1812/4485 [11:07<16:25,  2.71it/s][A
epoch 2 iter 1812: train loss 0.20857. lr 1.220401e-04:  40%|████      | 1813/4485 [11:07<16:25,  2.71it/s][A
epoch 2 iter 1813: train loss 0.19557. lr 1.219555e-04:  40%|████      | 1813/4485 [11:07<16:25,  2.71it/s][A
epoch 2 iter 1813: train loss 0.19557. lr 1.219555e-04:  40%|████      | 1814/4485 [11:07<16:23,  2.72it/s][A
epoch 2 iter 1814: train loss 0.18990. lr 1.218710e-04:  40%|████      | 1814/4485 [11:08<16:23,  2.72it/s][A
e

epoch 2 iter 1846: train loss 0.21083. lr 1.191765e-04:  41%|████      | 1847/4485 [11:20<16:15,  2.70it/s][A
epoch 2 iter 1847: train loss 0.18992. lr 1.190926e-04:  41%|████      | 1847/4485 [11:20<16:15,  2.70it/s][A
epoch 2 iter 1847: train loss 0.18992. lr 1.190926e-04:  41%|████      | 1848/4485 [11:20<16:13,  2.71it/s][A
epoch 2 iter 1848: train loss 0.18585. lr 1.190088e-04:  41%|████      | 1848/4485 [11:20<16:13,  2.71it/s][A
epoch 2 iter 1848: train loss 0.18585. lr 1.190088e-04:  41%|████      | 1849/4485 [11:20<16:10,  2.71it/s][A
epoch 2 iter 1849: train loss 0.20781. lr 1.189250e-04:  41%|████      | 1849/4485 [11:21<16:10,  2.71it/s][A
epoch 2 iter 1849: train loss 0.20781. lr 1.189250e-04:  41%|████      | 1850/4485 [11:21<16:05,  2.73it/s][A
epoch 2 iter 1850: train loss 0.20245. lr 1.188412e-04:  41%|████      | 1850/4485 [11:21<16:05,  2.73it/s][A
epoch 2 iter 1850: train loss 0.20245. lr 1.188412e-04:  41%|████▏     | 1851/4485 [11:21<16:09,  2.72it/s][A
e

epoch 2 iter 1883: train loss 0.18948. lr 1.160893e-04:  42%|████▏     | 1883/4485 [11:33<16:05,  2.70it/s][A
epoch 2 iter 1883: train loss 0.18948. lr 1.160893e-04:  42%|████▏     | 1884/4485 [11:33<16:03,  2.70it/s][A
epoch 2 iter 1884: train loss 0.18942. lr 1.160063e-04:  42%|████▏     | 1884/4485 [11:34<16:03,  2.70it/s][A
epoch 2 iter 1884: train loss 0.18942. lr 1.160063e-04:  42%|████▏     | 1885/4485 [11:34<16:04,  2.69it/s][A
epoch 2 iter 1885: train loss 0.19866. lr 1.159233e-04:  42%|████▏     | 1885/4485 [11:34<16:04,  2.69it/s][A
epoch 2 iter 1885: train loss 0.19866. lr 1.159233e-04:  42%|████▏     | 1886/4485 [11:34<15:57,  2.72it/s][A
epoch 2 iter 1886: train loss 0.19282. lr 1.158403e-04:  42%|████▏     | 1886/4485 [11:34<15:57,  2.72it/s][A
epoch 2 iter 1886: train loss 0.19282. lr 1.158403e-04:  42%|████▏     | 1887/4485 [11:34<15:57,  2.71it/s][A
epoch 2 iter 1887: train loss 0.21075. lr 1.157574e-04:  42%|████▏     | 1887/4485 [11:35<15:57,  2.71it/s][A
e

epoch 2 iter 1919: train loss 0.19772. lr 1.131152e-04:  43%|████▎     | 1920/4485 [11:46<15:44,  2.72it/s][A
epoch 2 iter 1920: train loss 0.20427. lr 1.130330e-04:  43%|████▎     | 1920/4485 [11:47<15:44,  2.72it/s][A
epoch 2 iter 1920: train loss 0.20427. lr 1.130330e-04:  43%|████▎     | 1921/4485 [11:47<15:45,  2.71it/s][A
epoch 2 iter 1921: train loss 0.20157. lr 1.129509e-04:  43%|████▎     | 1921/4485 [11:47<15:45,  2.71it/s][A
epoch 2 iter 1921: train loss 0.20157. lr 1.129509e-04:  43%|████▎     | 1922/4485 [11:47<15:45,  2.71it/s][A
epoch 2 iter 1922: train loss 0.19473. lr 1.128687e-04:  43%|████▎     | 1922/4485 [11:48<15:45,  2.71it/s][A
epoch 2 iter 1922: train loss 0.19473. lr 1.128687e-04:  43%|████▎     | 1923/4485 [11:48<15:42,  2.72it/s][A
epoch 2 iter 1923: train loss 0.18698. lr 1.127866e-04:  43%|████▎     | 1923/4485 [11:48<15:42,  2.72it/s][A
epoch 2 iter 1923: train loss 0.18698. lr 1.127866e-04:  43%|████▎     | 1924/4485 [11:48<15:41,  2.72it/s][A
e

epoch 2 iter 2259: train loss 0.19059. lr 8.655604e-05:  50%|█████     | 2259/4485 [13:52<13:40,  2.71it/s][A
epoch 2 iter 2259: train loss 0.19059. lr 8.655604e-05:  50%|█████     | 2260/4485 [13:52<13:40,  2.71it/s][A
epoch 2 iter 2260: train loss 0.17599. lr 8.648221e-05:  50%|█████     | 2260/4485 [13:52<13:40,  2.71it/s][A
epoch 2 iter 2260: train loss 0.17599. lr 8.648221e-05:  50%|█████     | 2261/4485 [13:52<13:41,  2.71it/s][A
epoch 2 iter 2261: train loss 0.17581. lr 8.640841e-05:  50%|█████     | 2261/4485 [13:53<13:41,  2.71it/s][A
epoch 2 iter 2261: train loss 0.17581. lr 8.640841e-05:  50%|█████     | 2262/4485 [13:53<13:41,  2.71it/s][A
epoch 2 iter 2262: train loss 0.17499. lr 8.633463e-05:  50%|█████     | 2262/4485 [13:53<13:41,  2.71it/s][A
epoch 2 iter 2262: train loss 0.17499. lr 8.633463e-05:  50%|█████     | 2263/4485 [13:53<13:38,  2.72it/s][A
epoch 2 iter 2263: train loss 0.18013. lr 8.626088e-05:  50%|█████     | 2263/4485 [13:53<13:38,  2.72it/s][A
e

epoch 2 iter 2295: train loss 0.18339. lr 8.391474e-05:  51%|█████     | 2296/4485 [14:05<13:28,  2.71it/s][A
epoch 2 iter 2296: train loss 0.18189. lr 8.384186e-05:  51%|█████     | 2296/4485 [14:05<13:28,  2.71it/s][A
epoch 2 iter 2296: train loss 0.18189. lr 8.384186e-05:  51%|█████     | 2297/4485 [14:05<13:27,  2.71it/s][A
epoch 2 iter 2297: train loss 0.19109. lr 8.376900e-05:  51%|█████     | 2297/4485 [14:06<13:27,  2.71it/s][A
epoch 2 iter 2297: train loss 0.19109. lr 8.376900e-05:  51%|█████     | 2298/4485 [14:06<13:24,  2.72it/s][A
epoch 2 iter 2298: train loss 0.18510. lr 8.369617e-05:  51%|█████     | 2298/4485 [14:06<13:24,  2.72it/s][A
epoch 2 iter 2298: train loss 0.18510. lr 8.369617e-05:  51%|█████▏    | 2299/4485 [14:06<13:23,  2.72it/s][A
epoch 2 iter 2299: train loss 0.16565. lr 8.362337e-05:  51%|█████▏    | 2299/4485 [14:07<13:23,  2.72it/s][A
epoch 2 iter 2299: train loss 0.16565. lr 8.362337e-05:  51%|█████▏    | 2300/4485 [14:07<13:26,  2.71it/s][A
e

epoch 2 iter 2332: train loss 0.18293. lr 8.123587e-05:  52%|█████▏    | 2332/4485 [14:19<13:15,  2.71it/s][A
epoch 2 iter 2332: train loss 0.18293. lr 8.123587e-05:  52%|█████▏    | 2333/4485 [14:19<13:12,  2.71it/s][A
epoch 2 iter 2333: train loss 0.19438. lr 8.116398e-05:  52%|█████▏    | 2333/4485 [14:19<13:12,  2.71it/s][A
epoch 2 iter 2333: train loss 0.19438. lr 8.116398e-05:  52%|█████▏    | 2334/4485 [14:19<13:11,  2.72it/s][A
epoch 2 iter 2334: train loss 0.19046. lr 8.109211e-05:  52%|█████▏    | 2334/4485 [14:19<13:11,  2.72it/s][A
epoch 2 iter 2334: train loss 0.19046. lr 8.109211e-05:  52%|█████▏    | 2335/4485 [14:19<13:10,  2.72it/s][A
epoch 2 iter 2335: train loss 0.18147. lr 8.102027e-05:  52%|█████▏    | 2335/4485 [14:20<13:10,  2.72it/s][A
epoch 2 iter 2335: train loss 0.18147. lr 8.102027e-05:  52%|█████▏    | 2336/4485 [14:20<13:12,  2.71it/s][A
epoch 2 iter 2336: train loss 0.19662. lr 8.094846e-05:  52%|█████▏    | 2336/4485 [14:20<13:12,  2.71it/s][A
e

epoch 2 iter 2368: train loss 0.20030. lr 7.866468e-05:  53%|█████▎    | 2369/4485 [14:32<12:58,  2.72it/s][A
epoch 2 iter 2369: train loss 0.17522. lr 7.859375e-05:  53%|█████▎    | 2369/4485 [14:32<12:58,  2.72it/s][A
epoch 2 iter 2369: train loss 0.17522. lr 7.859375e-05:  53%|█████▎    | 2370/4485 [14:32<12:59,  2.71it/s][A
epoch 2 iter 2370: train loss 0.18566. lr 7.852286e-05:  53%|█████▎    | 2370/4485 [14:33<12:59,  2.71it/s][A
epoch 2 iter 2370: train loss 0.18566. lr 7.852286e-05:  53%|█████▎    | 2371/4485 [14:33<13:01,  2.71it/s][A
epoch 2 iter 2371: train loss 0.17762. lr 7.845199e-05:  53%|█████▎    | 2371/4485 [14:33<13:01,  2.71it/s][A
epoch 2 iter 2371: train loss 0.17762. lr 7.845199e-05:  53%|█████▎    | 2372/4485 [14:33<13:00,  2.71it/s][A
epoch 2 iter 2372: train loss 0.17408. lr 7.838115e-05:  53%|█████▎    | 2372/4485 [14:33<13:00,  2.71it/s][A
epoch 2 iter 2372: train loss 0.17408. lr 7.838115e-05:  53%|█████▎    | 2373/4485 [14:33<12:57,  2.72it/s][A
e

epoch 2 iter 2405: train loss 0.16026. lr 7.605873e-05:  54%|█████▎    | 2405/4485 [14:46<12:46,  2.71it/s][A
epoch 2 iter 2405: train loss 0.16026. lr 7.605873e-05:  54%|█████▎    | 2406/4485 [14:46<12:47,  2.71it/s][A
epoch 2 iter 2406: train loss 0.18736. lr 7.598882e-05:  54%|█████▎    | 2406/4485 [14:46<12:47,  2.71it/s][A
epoch 2 iter 2406: train loss 0.18736. lr 7.598882e-05:  54%|█████▎    | 2407/4485 [14:46<12:46,  2.71it/s][A
epoch 2 iter 2407: train loss 0.18213. lr 7.591894e-05:  54%|█████▎    | 2407/4485 [14:46<12:46,  2.71it/s][A
epoch 2 iter 2407: train loss 0.18213. lr 7.591894e-05:  54%|█████▎    | 2408/4485 [14:46<12:43,  2.72it/s][A
epoch 2 iter 2408: train loss 0.18121. lr 7.584908e-05:  54%|█████▎    | 2408/4485 [14:47<12:43,  2.72it/s][A
epoch 2 iter 2408: train loss 0.18121. lr 7.584908e-05:  54%|█████▎    | 2409/4485 [14:47<12:42,  2.72it/s][A
epoch 2 iter 2409: train loss 0.18735. lr 7.577926e-05:  54%|█████▎    | 2409/4485 [14:47<12:42,  2.72it/s][A
e

epoch 2 iter 2441: train loss 0.17010. lr 7.355932e-05:  54%|█████▍    | 2442/4485 [14:59<12:30,  2.72it/s][A
epoch 2 iter 2442: train loss 0.17485. lr 7.349041e-05:  54%|█████▍    | 2442/4485 [14:59<12:30,  2.72it/s][A
epoch 2 iter 2442: train loss 0.17485. lr 7.349041e-05:  54%|█████▍    | 2443/4485 [14:59<12:28,  2.73it/s][A
epoch 2 iter 2443: train loss 0.17944. lr 7.342152e-05:  54%|█████▍    | 2443/4485 [15:00<12:28,  2.73it/s][A
epoch 2 iter 2443: train loss 0.17944. lr 7.342152e-05:  54%|█████▍    | 2444/4485 [15:00<12:26,  2.73it/s][A
epoch 2 iter 2444: train loss 0.18735. lr 7.335266e-05:  54%|█████▍    | 2444/4485 [15:00<12:26,  2.73it/s][A
epoch 2 iter 2444: train loss 0.18735. lr 7.335266e-05:  55%|█████▍    | 2445/4485 [15:00<12:30,  2.72it/s][A
epoch 2 iter 2445: train loss 0.18150. lr 7.328383e-05:  55%|█████▍    | 2445/4485 [15:00<12:30,  2.72it/s][A
epoch 2 iter 2445: train loss 0.18150. lr 7.328383e-05:  55%|█████▍    | 2446/4485 [15:00<12:31,  2.71it/s][A
e

epoch 2 iter 2478: train loss 0.18227. lr 7.102801e-05:  55%|█████▌    | 2478/4485 [15:12<12:12,  2.74it/s][A
epoch 2 iter 2478: train loss 0.18227. lr 7.102801e-05:  55%|█████▌    | 2479/4485 [15:12<12:14,  2.73it/s][A
epoch 2 iter 2479: train loss 0.19237. lr 7.096013e-05:  55%|█████▌    | 2479/4485 [15:13<12:14,  2.73it/s][A
epoch 2 iter 2479: train loss 0.19237. lr 7.096013e-05:  55%|█████▌    | 2480/4485 [15:13<12:15,  2.73it/s][A
epoch 2 iter 2480: train loss 0.18097. lr 7.089227e-05:  55%|█████▌    | 2480/4485 [15:13<12:15,  2.73it/s][A
epoch 2 iter 2480: train loss 0.18097. lr 7.089227e-05:  55%|█████▌    | 2481/4485 [15:13<12:13,  2.73it/s][A
epoch 2 iter 2481: train loss 0.18039. lr 7.082445e-05:  55%|█████▌    | 2481/4485 [15:14<12:13,  2.73it/s][A
epoch 2 iter 2481: train loss 0.18039. lr 7.082445e-05:  55%|█████▌    | 2482/4485 [15:14<12:13,  2.73it/s][A
epoch 2 iter 2482: train loss 0.18419. lr 7.075665e-05:  55%|█████▌    | 2482/4485 [15:14<12:13,  2.73it/s][A
e

epoch 2 iter 2514: train loss 0.17836. lr 6.860202e-05:  56%|█████▌    | 2515/4485 [15:26<12:03,  2.72it/s][A
epoch 2 iter 2515: train loss 0.17659. lr 6.853516e-05:  56%|█████▌    | 2515/4485 [15:26<12:03,  2.72it/s][A
epoch 2 iter 2515: train loss 0.17659. lr 6.853516e-05:  56%|█████▌    | 2516/4485 [15:26<12:03,  2.72it/s][A
epoch 2 iter 2516: train loss 0.17745. lr 6.846832e-05:  56%|█████▌    | 2516/4485 [15:26<12:03,  2.72it/s][A
epoch 2 iter 2516: train loss 0.17745. lr 6.846832e-05:  56%|█████▌    | 2517/4485 [15:26<12:04,  2.72it/s][A
epoch 2 iter 2517: train loss 0.19564. lr 6.840151e-05:  56%|█████▌    | 2517/4485 [15:27<12:04,  2.72it/s][A
epoch 2 iter 2517: train loss 0.19564. lr 6.840151e-05:  56%|█████▌    | 2518/4485 [15:27<12:05,  2.71it/s][A
epoch 2 iter 2518: train loss 0.17470. lr 6.833473e-05:  56%|█████▌    | 2518/4485 [15:27<12:05,  2.71it/s][A
epoch 2 iter 2518: train loss 0.17470. lr 6.833473e-05:  56%|█████▌    | 2519/4485 [15:27<12:04,  2.71it/s][A
e

epoch 2 iter 2551: train loss 0.17902. lr 6.614699e-05:  57%|█████▋    | 2551/4485 [15:39<11:51,  2.72it/s][A
epoch 2 iter 2551: train loss 0.17902. lr 6.614699e-05:  57%|█████▋    | 2552/4485 [15:39<11:52,  2.71it/s][A
epoch 2 iter 2552: train loss 0.18324. lr 6.608118e-05:  57%|█████▋    | 2552/4485 [15:40<11:52,  2.71it/s][A
epoch 2 iter 2552: train loss 0.18324. lr 6.608118e-05:  57%|█████▋    | 2553/4485 [15:40<11:51,  2.72it/s][A
epoch 2 iter 2553: train loss 0.18398. lr 6.601540e-05:  57%|█████▋    | 2553/4485 [15:40<11:51,  2.72it/s][A
epoch 2 iter 2553: train loss 0.18398. lr 6.601540e-05:  57%|█████▋    | 2554/4485 [15:40<11:49,  2.72it/s][A
epoch 2 iter 2554: train loss 0.18784. lr 6.594965e-05:  57%|█████▋    | 2554/4485 [15:40<11:49,  2.72it/s][A
epoch 2 iter 2554: train loss 0.18784. lr 6.594965e-05:  57%|█████▋    | 2555/4485 [15:40<11:48,  2.72it/s][A
epoch 2 iter 2555: train loss 0.18038. lr 6.588393e-05:  57%|█████▋    | 2555/4485 [15:41<11:48,  2.72it/s][A
e

epoch 2 iter 2587: train loss 0.19430. lr 6.379601e-05:  58%|█████▊    | 2588/4485 [15:52<11:33,  2.73it/s][A
epoch 2 iter 2588: train loss 0.18716. lr 6.373124e-05:  58%|█████▊    | 2588/4485 [15:53<11:33,  2.73it/s][A
epoch 2 iter 2588: train loss 0.18716. lr 6.373124e-05:  58%|█████▊    | 2589/4485 [15:53<11:30,  2.75it/s][A
epoch 2 iter 2589: train loss 0.17202. lr 6.366650e-05:  58%|█████▊    | 2589/4485 [15:53<11:30,  2.75it/s][A
epoch 2 iter 2589: train loss 0.17202. lr 6.366650e-05:  58%|█████▊    | 2590/4485 [15:53<11:33,  2.73it/s][A
epoch 2 iter 2590: train loss 0.18473. lr 6.360178e-05:  58%|█████▊    | 2590/4485 [15:54<11:33,  2.73it/s][A
epoch 2 iter 2590: train loss 0.18473. lr 6.360178e-05:  58%|█████▊    | 2591/4485 [15:54<11:34,  2.73it/s][A
epoch 2 iter 2591: train loss 0.19330. lr 6.353710e-05:  58%|█████▊    | 2591/4485 [15:54<11:34,  2.73it/s][A
epoch 2 iter 2591: train loss 0.19330. lr 6.353710e-05:  58%|█████▊    | 2592/4485 [15:54<11:33,  2.73it/s][A
e

epoch 2 iter 2624: train loss 0.18482. lr 6.141886e-05:  59%|█████▊    | 2624/4485 [16:06<11:27,  2.71it/s][A
epoch 2 iter 2624: train loss 0.18482. lr 6.141886e-05:  59%|█████▊    | 2625/4485 [16:06<11:25,  2.71it/s][A
epoch 2 iter 2625: train loss 0.19594. lr 6.135517e-05:  59%|█████▊    | 2625/4485 [16:06<11:25,  2.71it/s][A
epoch 2 iter 2625: train loss 0.19594. lr 6.135517e-05:  59%|█████▊    | 2626/4485 [16:06<11:23,  2.72it/s][A
epoch 2 iter 2626: train loss 0.19180. lr 6.129151e-05:  59%|█████▊    | 2626/4485 [16:07<11:23,  2.72it/s][A
epoch 2 iter 2626: train loss 0.19180. lr 6.129151e-05:  59%|█████▊    | 2627/4485 [16:07<11:22,  2.72it/s][A
epoch 2 iter 2627: train loss 0.17940. lr 6.122787e-05:  59%|█████▊    | 2627/4485 [16:07<11:22,  2.72it/s][A
epoch 2 iter 2627: train loss 0.17940. lr 6.122787e-05:  59%|█████▊    | 2628/4485 [16:07<11:23,  2.72it/s][A
epoch 2 iter 2628: train loss 0.19133. lr 6.116427e-05:  59%|█████▊    | 2628/4485 [16:08<11:23,  2.72it/s][A
e

epoch 2 iter 2660: train loss 0.18625. lr 6.000000e-05:  59%|█████▉    | 2661/4485 [16:19<11:10,  2.72it/s][A
epoch 2 iter 2661: train loss 0.18021. lr 6.000000e-05:  59%|█████▉    | 2661/4485 [16:20<11:10,  2.72it/s][A
epoch 2 iter 2661: train loss 0.18021. lr 6.000000e-05:  59%|█████▉    | 2662/4485 [16:20<11:11,  2.71it/s][A
epoch 2 iter 2662: train loss 0.19260. lr 6.000000e-05:  59%|█████▉    | 2662/4485 [16:20<11:11,  2.71it/s][A
epoch 2 iter 2662: train loss 0.19260. lr 6.000000e-05:  59%|█████▉    | 2663/4485 [16:20<11:12,  2.71it/s][A
epoch 2 iter 2663: train loss 0.18672. lr 6.000000e-05:  59%|█████▉    | 2663/4485 [16:20<11:12,  2.71it/s][A
epoch 2 iter 2663: train loss 0.18672. lr 6.000000e-05:  59%|█████▉    | 2664/4485 [16:20<11:09,  2.72it/s][A
epoch 2 iter 2664: train loss 0.18365. lr 6.000000e-05:  59%|█████▉    | 2664/4485 [16:21<11:09,  2.72it/s][A
epoch 2 iter 2664: train loss 0.18365. lr 6.000000e-05:  59%|█████▉    | 2665/4485 [16:21<11:08,  2.72it/s][A
e

epoch 2 iter 2697: train loss 0.19088. lr 6.000000e-05:  60%|██████    | 2697/4485 [16:33<11:00,  2.71it/s][A
epoch 2 iter 2697: train loss 0.19088. lr 6.000000e-05:  60%|██████    | 2698/4485 [16:33<10:58,  2.71it/s][A
epoch 2 iter 2698: train loss 0.19547. lr 6.000000e-05:  60%|██████    | 2698/4485 [16:33<10:58,  2.71it/s][A
epoch 2 iter 2698: train loss 0.19547. lr 6.000000e-05:  60%|██████    | 2699/4485 [16:33<10:56,  2.72it/s][A
epoch 2 iter 2699: train loss 0.17164. lr 6.000000e-05:  60%|██████    | 2699/4485 [16:34<10:56,  2.72it/s][A
epoch 2 iter 2699: train loss 0.17164. lr 6.000000e-05:  60%|██████    | 2700/4485 [16:34<10:55,  2.72it/s][A
epoch 2 iter 2700: train loss 0.17585. lr 6.000000e-05:  60%|██████    | 2700/4485 [16:34<10:55,  2.72it/s][A
epoch 2 iter 2700: train loss 0.17585. lr 6.000000e-05:  60%|██████    | 2701/4485 [16:34<10:54,  2.73it/s][A
epoch 2 iter 2701: train loss 0.18638. lr 6.000000e-05:  60%|██████    | 2701/4485 [16:34<10:54,  2.73it/s][A
e

epoch 1 iter 400: train loss 1.87658. lr 5.698712e-04:  29%|██▉       | 401/1394 [02:27<06:05,  2.71it/s][A
epoch 1 iter 401: train loss 1.91723. lr 5.697233e-04:  29%|██▉       | 401/1394 [02:27<06:05,  2.71it/s][A
epoch 1 iter 401: train loss 1.91723. lr 5.697233e-04:  29%|██▉       | 402/1394 [02:27<06:06,  2.71it/s][A
epoch 1 iter 402: train loss 1.89640. lr 5.695750e-04:  29%|██▉       | 402/1394 [02:28<06:06,  2.71it/s][A
epoch 1 iter 402: train loss 1.89640. lr 5.695750e-04:  29%|██▉       | 403/1394 [02:28<06:05,  2.71it/s][A
epoch 1 iter 403: train loss 1.89662. lr 5.694264e-04:  29%|██▉       | 403/1394 [02:28<06:05,  2.71it/s][A
epoch 1 iter 403: train loss 1.89662. lr 5.694264e-04:  29%|██▉       | 404/1394 [02:28<06:04,  2.72it/s][A
epoch 1 iter 404: train loss 1.84760. lr 5.692774e-04:  29%|██▉       | 404/1394 [02:28<06:04,  2.72it/s][A
epoch 1 iter 404: train loss 1.84760. lr 5.692774e-04:  29%|██▉       | 405/1394 [02:28<06:03,  2.72it/s][A
epoch 1 iter 405: t

epoch 1 iter 438: train loss 1.61017. lr 5.640104e-04:  31%|███▏      | 438/1394 [02:41<05:51,  2.72it/s][A
epoch 1 iter 438: train loss 1.61017. lr 5.640104e-04:  31%|███▏      | 439/1394 [02:41<05:51,  2.72it/s][A
epoch 1 iter 439: train loss 1.64157. lr 5.638495e-04:  31%|███▏      | 439/1394 [02:41<05:51,  2.72it/s][A
epoch 1 iter 439: train loss 1.64157. lr 5.638495e-04:  32%|███▏      | 440/1394 [02:41<05:48,  2.73it/s][A
epoch 1 iter 440: train loss 1.65014. lr 5.636884e-04:  32%|███▏      | 440/1394 [02:42<05:48,  2.73it/s][A
epoch 1 iter 440: train loss 1.65014. lr 5.636884e-04:  32%|███▏      | 441/1394 [02:42<05:50,  2.72it/s][A
epoch 1 iter 441: train loss 1.59969. lr 5.635269e-04:  32%|███▏      | 441/1394 [02:42<05:50,  2.72it/s][A
epoch 1 iter 441: train loss 1.59969. lr 5.635269e-04:  32%|███▏      | 442/1394 [02:42<05:50,  2.72it/s][A
epoch 1 iter 442: train loss 1.62898. lr 5.633650e-04:  32%|███▏      | 442/1394 [02:42<05:50,  2.72it/s][A
epoch 1 iter 442: t

epoch 1 iter 475: train loss 1.40391. lr 5.578379e-04:  34%|███▍      | 476/1394 [02:55<05:39,  2.71it/s][A
epoch 1 iter 476: train loss 1.30468. lr 5.576648e-04:  34%|███▍      | 476/1394 [02:55<05:39,  2.71it/s][A
epoch 1 iter 476: train loss 1.30468. lr 5.576648e-04:  34%|███▍      | 477/1394 [02:55<05:38,  2.71it/s][A
epoch 1 iter 477: train loss 1.40340. lr 5.574914e-04:  34%|███▍      | 477/1394 [02:55<05:38,  2.71it/s][A
epoch 1 iter 477: train loss 1.40340. lr 5.574914e-04:  34%|███▍      | 478/1394 [02:55<05:37,  2.71it/s][A
epoch 1 iter 478: train loss 1.38093. lr 5.573176e-04:  34%|███▍      | 478/1394 [02:56<05:37,  2.71it/s][A
epoch 1 iter 478: train loss 1.38093. lr 5.573176e-04:  34%|███▍      | 479/1394 [02:56<05:37,  2.71it/s][A
epoch 1 iter 479: train loss 1.33769. lr 5.571436e-04:  34%|███▍      | 479/1394 [02:56<05:37,  2.71it/s][A
epoch 1 iter 479: train loss 1.33769. lr 5.571436e-04:  34%|███▍      | 480/1394 [02:56<05:36,  2.72it/s][A
epoch 1 iter 480: t

epoch 1 iter 513: train loss 1.17476. lr 5.510316e-04:  37%|███▋      | 513/1394 [03:09<05:24,  2.72it/s][A
epoch 1 iter 513: train loss 1.17476. lr 5.510316e-04:  37%|███▋      | 514/1394 [03:09<05:23,  2.72it/s][A
epoch 1 iter 514: train loss 1.17238. lr 5.508463e-04:  37%|███▋      | 514/1394 [03:09<05:23,  2.72it/s][A
epoch 1 iter 514: train loss 1.17238. lr 5.508463e-04:  37%|███▋      | 515/1394 [03:09<05:23,  2.71it/s][A
epoch 1 iter 515: train loss 1.12295. lr 5.506605e-04:  37%|███▋      | 515/1394 [03:09<05:23,  2.71it/s][A
epoch 1 iter 515: train loss 1.12295. lr 5.506605e-04:  37%|███▋      | 516/1394 [03:09<05:24,  2.71it/s][A
epoch 1 iter 516: train loss 1.15641. lr 5.504745e-04:  37%|███▋      | 516/1394 [03:10<05:24,  2.71it/s][A
epoch 1 iter 516: train loss 1.15641. lr 5.504745e-04:  37%|███▋      | 517/1394 [03:10<05:23,  2.71it/s][A
epoch 1 iter 517: train loss 1.14199. lr 5.502882e-04:  37%|███▋      | 517/1394 [03:10<05:23,  2.71it/s][A
epoch 1 iter 517: t

epoch 1 iter 550: train loss 1.07896. lr 5.439615e-04:  40%|███▉      | 551/1394 [03:22<05:10,  2.71it/s][A
epoch 1 iter 551: train loss 0.98675. lr 5.437645e-04:  40%|███▉      | 551/1394 [03:23<05:10,  2.71it/s][A
epoch 1 iter 551: train loss 0.98675. lr 5.437645e-04:  40%|███▉      | 552/1394 [03:23<05:09,  2.72it/s][A
epoch 1 iter 552: train loss 0.96169. lr 5.435671e-04:  40%|███▉      | 552/1394 [03:23<05:09,  2.72it/s][A
epoch 1 iter 552: train loss 0.96169. lr 5.435671e-04:  40%|███▉      | 553/1394 [03:23<05:08,  2.72it/s][A
epoch 1 iter 553: train loss 1.00830. lr 5.433695e-04:  40%|███▉      | 553/1394 [03:23<05:08,  2.72it/s][A
epoch 1 iter 553: train loss 1.00830. lr 5.433695e-04:  40%|███▉      | 554/1394 [03:23<05:06,  2.74it/s][A
epoch 1 iter 554: train loss 0.96152. lr 5.431715e-04:  40%|███▉      | 554/1394 [03:24<05:06,  2.74it/s][A
epoch 1 iter 554: train loss 0.96152. lr 5.431715e-04:  40%|███▉      | 555/1394 [03:24<05:07,  2.73it/s][A
epoch 1 iter 555: t

epoch 1 iter 588: train loss 0.82971. lr 5.362584e-04:  42%|████▏     | 588/1394 [03:36<04:56,  2.71it/s][A
epoch 1 iter 588: train loss 0.82971. lr 5.362584e-04:  42%|████▏     | 589/1394 [03:36<04:56,  2.71it/s][A
epoch 1 iter 589: train loss 0.84197. lr 5.360498e-04:  42%|████▏     | 589/1394 [03:37<04:56,  2.71it/s][A
epoch 1 iter 589: train loss 0.84197. lr 5.360498e-04:  42%|████▏     | 590/1394 [03:37<04:57,  2.71it/s][A
epoch 1 iter 590: train loss 0.79565. lr 5.358409e-04:  42%|████▏     | 590/1394 [03:37<04:57,  2.71it/s][A
epoch 1 iter 590: train loss 0.79565. lr 5.358409e-04:  42%|████▏     | 591/1394 [03:37<04:56,  2.71it/s][A
epoch 1 iter 591: train loss 0.82637. lr 5.356316e-04:  42%|████▏     | 591/1394 [03:37<04:56,  2.71it/s][A
epoch 1 iter 591: train loss 0.82637. lr 5.356316e-04:  42%|████▏     | 592/1394 [03:37<04:55,  2.71it/s][A
epoch 1 iter 592: train loss 0.82385. lr 5.354221e-04:  42%|████▏     | 592/1394 [03:38<04:55,  2.71it/s][A
epoch 1 iter 592: t

epoch 1 iter 625: train loss 0.71930. lr 5.283412e-04:  45%|████▍     | 626/1394 [03:50<04:43,  2.71it/s][A
epoch 1 iter 626: train loss 0.69284. lr 5.281216e-04:  45%|████▍     | 626/1394 [03:50<04:43,  2.71it/s][A
epoch 1 iter 626: train loss 0.69284. lr 5.281216e-04:  45%|████▍     | 627/1394 [03:50<04:42,  2.72it/s][A
epoch 1 iter 627: train loss 0.65115. lr 5.279018e-04:  45%|████▍     | 627/1394 [03:51<04:42,  2.72it/s][A
epoch 1 iter 627: train loss 0.65115. lr 5.279018e-04:  45%|████▌     | 628/1394 [03:51<04:42,  2.72it/s][A
epoch 1 iter 628: train loss 0.70902. lr 5.276816e-04:  45%|████▌     | 628/1394 [03:51<04:42,  2.72it/s][A
epoch 1 iter 628: train loss 0.70902. lr 5.276816e-04:  45%|████▌     | 629/1394 [03:51<04:40,  2.72it/s][A
epoch 1 iter 629: train loss 0.69699. lr 5.274612e-04:  45%|████▌     | 629/1394 [03:51<04:40,  2.72it/s][A
epoch 1 iter 629: train loss 0.69699. lr 5.274612e-04:  45%|████▌     | 630/1394 [03:51<04:41,  2.72it/s][A
epoch 1 iter 630: t

epoch 1 iter 663: train loss 0.63451. lr 5.197964e-04:  48%|████▊     | 663/1394 [04:04<04:28,  2.73it/s][A
epoch 1 iter 663: train loss 0.63451. lr 5.197964e-04:  48%|████▊     | 664/1394 [04:04<04:29,  2.71it/s][A
epoch 1 iter 664: train loss 0.60647. lr 5.195660e-04:  48%|████▊     | 664/1394 [04:04<04:29,  2.71it/s][A
epoch 1 iter 664: train loss 0.60647. lr 5.195660e-04:  48%|████▊     | 665/1394 [04:04<04:29,  2.70it/s][A
epoch 1 iter 665: train loss 0.61497. lr 5.193353e-04:  48%|████▊     | 665/1394 [04:05<04:29,  2.70it/s][A
epoch 1 iter 665: train loss 0.61497. lr 5.193353e-04:  48%|████▊     | 666/1394 [04:05<04:29,  2.70it/s][A
epoch 1 iter 666: train loss 0.58748. lr 5.191044e-04:  48%|████▊     | 666/1394 [04:05<04:29,  2.70it/s][A
epoch 1 iter 666: train loss 0.58748. lr 5.191044e-04:  48%|████▊     | 667/1394 [04:05<04:28,  2.71it/s][A
epoch 1 iter 667: train loss 0.62592. lr 5.188732e-04:  48%|████▊     | 667/1394 [04:05<04:28,  2.71it/s][A
epoch 1 iter 667: t

epoch 1 iter 700: train loss 0.53241. lr 5.110885e-04:  50%|█████     | 701/1394 [04:17<04:14,  2.72it/s][A
epoch 1 iter 701: train loss 0.49519. lr 5.108480e-04:  50%|█████     | 701/1394 [04:18<04:14,  2.72it/s][A
epoch 1 iter 701: train loss 0.49519. lr 5.108480e-04:  50%|█████     | 702/1394 [04:18<04:14,  2.72it/s][A
epoch 1 iter 702: train loss 0.50393. lr 5.106072e-04:  50%|█████     | 702/1394 [04:18<04:14,  2.72it/s][A
epoch 1 iter 702: train loss 0.50393. lr 5.106072e-04:  50%|█████     | 703/1394 [04:18<04:14,  2.72it/s][A
epoch 1 iter 703: train loss 0.56120. lr 5.103662e-04:  50%|█████     | 703/1394 [04:19<04:14,  2.72it/s][A
epoch 1 iter 703: train loss 0.56120. lr 5.103662e-04:  51%|█████     | 704/1394 [04:19<04:14,  2.71it/s][A
epoch 1 iter 704: train loss 0.53349. lr 5.101249e-04:  51%|█████     | 704/1394 [04:19<04:14,  2.71it/s][A
epoch 1 iter 704: train loss 0.53349. lr 5.101249e-04:  51%|█████     | 705/1394 [04:19<04:14,  2.71it/s][A
epoch 1 iter 705: t

epoch 1 iter 738: train loss 0.47169. lr 5.017631e-04:  53%|█████▎    | 738/1394 [04:31<04:01,  2.72it/s][A
epoch 1 iter 738: train loss 0.47169. lr 5.017631e-04:  53%|█████▎    | 739/1394 [04:31<04:01,  2.71it/s][A
epoch 1 iter 739: train loss 0.45598. lr 5.015126e-04:  53%|█████▎    | 739/1394 [04:32<04:01,  2.71it/s][A
epoch 1 iter 739: train loss 0.45598. lr 5.015126e-04:  53%|█████▎    | 740/1394 [04:32<04:01,  2.71it/s][A
epoch 1 iter 740: train loss 0.45967. lr 5.012619e-04:  53%|█████▎    | 740/1394 [04:32<04:01,  2.71it/s][A
epoch 1 iter 740: train loss 0.45967. lr 5.012619e-04:  53%|█████▎    | 741/1394 [04:32<04:00,  2.72it/s][A
epoch 1 iter 741: train loss 0.47808. lr 5.010109e-04:  53%|█████▎    | 741/1394 [04:33<04:00,  2.72it/s][A
epoch 1 iter 741: train loss 0.47808. lr 5.010109e-04:  53%|█████▎    | 742/1394 [04:33<04:00,  2.72it/s][A
epoch 1 iter 742: train loss 0.47640. lr 5.007596e-04:  53%|█████▎    | 742/1394 [04:33<04:00,  2.72it/s][A
epoch 1 iter 742: t

epoch 1 iter 775: train loss 0.45730. lr 4.923270e-04:  56%|█████▌    | 776/1394 [04:45<03:47,  2.72it/s][A
epoch 1 iter 776: train loss 0.43414. lr 4.920672e-04:  56%|█████▌    | 776/1394 [04:45<03:47,  2.72it/s][A
epoch 1 iter 776: train loss 0.43414. lr 4.920672e-04:  56%|█████▌    | 777/1394 [04:45<03:47,  2.72it/s][A
epoch 1 iter 777: train loss 0.41822. lr 4.918072e-04:  56%|█████▌    | 777/1394 [04:46<03:47,  2.72it/s][A
epoch 1 iter 777: train loss 0.41822. lr 4.918072e-04:  56%|█████▌    | 778/1394 [04:46<03:46,  2.72it/s][A
epoch 1 iter 778: train loss 0.42087. lr 4.915470e-04:  56%|█████▌    | 778/1394 [04:46<03:46,  2.72it/s][A
epoch 1 iter 778: train loss 0.42087. lr 4.915470e-04:  56%|█████▌    | 779/1394 [04:46<03:46,  2.71it/s][A
epoch 1 iter 779: train loss 0.44249. lr 4.912865e-04:  56%|█████▌    | 779/1394 [04:47<03:46,  2.71it/s][A
epoch 1 iter 779: train loss 0.44249. lr 4.912865e-04:  56%|█████▌    | 780/1394 [04:47<03:46,  2.71it/s][A
epoch 1 iter 780: t

epoch 1 iter 813: train loss 0.36445. lr 4.822875e-04:  58%|█████▊    | 813/1394 [04:59<03:33,  2.72it/s][A
epoch 1 iter 813: train loss 0.36445. lr 4.822875e-04:  58%|█████▊    | 814/1394 [04:59<03:34,  2.71it/s][A
epoch 1 iter 814: train loss 0.39654. lr 4.820187e-04:  58%|█████▊    | 814/1394 [04:59<03:34,  2.71it/s][A
epoch 1 iter 814: train loss 0.39654. lr 4.820187e-04:  58%|█████▊    | 815/1394 [04:59<03:33,  2.71it/s][A
epoch 1 iter 815: train loss 0.38745. lr 4.817497e-04:  58%|█████▊    | 815/1394 [05:00<03:33,  2.71it/s][A
epoch 1 iter 815: train loss 0.38745. lr 4.817497e-04:  59%|█████▊    | 816/1394 [05:00<03:32,  2.72it/s][A
epoch 1 iter 816: train loss 0.38454. lr 4.814804e-04:  59%|█████▊    | 816/1394 [05:00<03:32,  2.72it/s][A
epoch 1 iter 816: train loss 0.38454. lr 4.814804e-04:  59%|█████▊    | 817/1394 [05:00<03:31,  2.73it/s][A
epoch 1 iter 817: train loss 0.37382. lr 4.812109e-04:  59%|█████▊    | 817/1394 [05:01<03:31,  2.73it/s][A
epoch 1 iter 817: t

epoch 1 iter 850: train loss 0.39709. lr 4.721905e-04:  61%|██████    | 851/1394 [05:13<03:18,  2.73it/s][A
epoch 1 iter 851: train loss 0.36397. lr 4.719134e-04:  61%|██████    | 851/1394 [05:13<03:18,  2.73it/s][A
epoch 1 iter 851: train loss 0.36397. lr 4.719134e-04:  61%|██████    | 852/1394 [05:13<03:19,  2.72it/s][A
epoch 1 iter 852: train loss 0.33912. lr 4.716361e-04:  61%|██████    | 852/1394 [05:13<03:19,  2.72it/s][A
epoch 1 iter 852: train loss 0.33912. lr 4.716361e-04:  61%|██████    | 853/1394 [05:13<03:19,  2.71it/s][A
epoch 1 iter 853: train loss 0.38916. lr 4.713585e-04:  61%|██████    | 853/1394 [05:14<03:19,  2.71it/s][A
epoch 1 iter 853: train loss 0.38916. lr 4.713585e-04:  61%|██████▏   | 854/1394 [05:14<03:18,  2.72it/s][A
epoch 1 iter 854: train loss 0.35281. lr 4.710807e-04:  61%|██████▏   | 854/1394 [05:14<03:18,  2.72it/s][A
epoch 1 iter 854: train loss 0.35281. lr 4.710807e-04:  61%|██████▏   | 855/1394 [05:14<03:18,  2.72it/s][A
epoch 1 iter 855: t

epoch 1 iter 888: train loss 0.35509. lr 4.615089e-04:  64%|██████▎   | 888/1394 [05:27<03:07,  2.69it/s][A
epoch 1 iter 888: train loss 0.35509. lr 4.615089e-04:  64%|██████▍   | 889/1394 [05:27<03:07,  2.70it/s][A
epoch 1 iter 889: train loss 0.33995. lr 4.612237e-04:  64%|██████▍   | 889/1394 [05:27<03:07,  2.70it/s][A
epoch 1 iter 889: train loss 0.33995. lr 4.612237e-04:  64%|██████▍   | 890/1394 [05:27<03:06,  2.70it/s][A
epoch 1 iter 890: train loss 0.32675. lr 4.609383e-04:  64%|██████▍   | 890/1394 [05:27<03:06,  2.70it/s][A
epoch 1 iter 890: train loss 0.32675. lr 4.609383e-04:  64%|██████▍   | 891/1394 [05:27<03:06,  2.70it/s][A
epoch 1 iter 891: train loss 0.32796. lr 4.606527e-04:  64%|██████▍   | 891/1394 [05:28<03:06,  2.70it/s][A
epoch 1 iter 891: train loss 0.32796. lr 4.606527e-04:  64%|██████▍   | 892/1394 [05:28<03:05,  2.70it/s][A
epoch 1 iter 892: train loss 0.37247. lr 4.603669e-04:  64%|██████▍   | 892/1394 [05:28<03:05,  2.70it/s][A
epoch 1 iter 892: t

epoch 1 iter 267: train loss 4.28457. lr 5.971821e-04:   9%|▊         | 268/3068 [02:01<22:09,  2.11it/s][A
epoch 1 iter 268: train loss 4.29257. lr 5.971611e-04:   9%|▊         | 268/3068 [02:01<22:09,  2.11it/s][A
epoch 1 iter 268: train loss 4.29257. lr 5.971611e-04:   9%|▉         | 269/3068 [02:01<22:09,  2.11it/s][A
epoch 1 iter 269: train loss 4.23005. lr 5.971400e-04:   9%|▉         | 269/3068 [02:01<22:09,  2.11it/s][A
epoch 1 iter 269: train loss 4.23005. lr 5.971400e-04:   9%|▉         | 270/3068 [02:01<22:04,  2.11it/s][A
epoch 1 iter 270: train loss 4.36512. lr 5.971188e-04:   9%|▉         | 270/3068 [02:02<22:04,  2.11it/s][A
epoch 1 iter 270: train loss 4.36512. lr 5.971188e-04:   9%|▉         | 271/3068 [02:02<21:29,  2.17it/s][A
epoch 1 iter 271: train loss 4.37328. lr 5.970975e-04:   9%|▉         | 271/3068 [02:02<21:29,  2.17it/s][A
epoch 1 iter 271: train loss 4.37328. lr 5.970975e-04:   9%|▉         | 272/3068 [02:02<21:37,  2.15it/s][A
epoch 1 iter 272: t

epoch 1 iter 305: train loss 4.08033. lr 5.963276e-04:  10%|▉         | 305/3068 [02:18<21:55,  2.10it/s][A
epoch 1 iter 305: train loss 4.08033. lr 5.963276e-04:  10%|▉         | 306/3068 [02:18<22:01,  2.09it/s][A
epoch 1 iter 306: train loss 4.18140. lr 5.963036e-04:  10%|▉         | 306/3068 [02:19<22:01,  2.09it/s][A
epoch 1 iter 306: train loss 4.18140. lr 5.963036e-04:  10%|█         | 307/3068 [02:19<22:15,  2.07it/s][A
epoch 1 iter 307: train loss 4.03753. lr 5.962795e-04:  10%|█         | 307/3068 [02:19<22:15,  2.07it/s][A
epoch 1 iter 307: train loss 4.03753. lr 5.962795e-04:  10%|█         | 308/3068 [02:19<22:15,  2.07it/s][A
epoch 1 iter 308: train loss 4.05657. lr 5.962554e-04:  10%|█         | 308/3068 [02:20<22:15,  2.07it/s][A
epoch 1 iter 308: train loss 4.05657. lr 5.962554e-04:  10%|█         | 309/3068 [02:20<22:10,  2.07it/s][A
epoch 1 iter 309: train loss 4.08931. lr 5.962311e-04:  10%|█         | 309/3068 [02:20<22:10,  2.07it/s][A
epoch 1 iter 309: t

epoch 1 iter 342: train loss 3.80877. lr 5.953877e-04:  11%|█         | 343/3068 [02:36<21:18,  2.13it/s][A
epoch 1 iter 343: train loss 3.83172. lr 5.953609e-04:  11%|█         | 343/3068 [02:37<21:18,  2.13it/s][A
epoch 1 iter 343: train loss 3.83172. lr 5.953609e-04:  11%|█         | 344/3068 [02:37<21:18,  2.13it/s][A
epoch 1 iter 344: train loss 3.74267. lr 5.953339e-04:  11%|█         | 344/3068 [02:37<21:18,  2.13it/s][A
epoch 1 iter 344: train loss 3.74267. lr 5.953339e-04:  11%|█         | 345/3068 [02:37<21:25,  2.12it/s][A
epoch 1 iter 345: train loss 3.70091. lr 5.953069e-04:  11%|█         | 345/3068 [02:37<21:25,  2.12it/s][A
epoch 1 iter 345: train loss 3.70091. lr 5.953069e-04:  11%|█▏        | 346/3068 [02:37<21:26,  2.12it/s][A
epoch 1 iter 346: train loss 3.71226. lr 5.952798e-04:  11%|█▏        | 346/3068 [02:38<21:26,  2.12it/s][A
epoch 1 iter 346: train loss 3.71226. lr 5.952798e-04:  11%|█▏        | 347/3068 [02:38<21:23,  2.12it/s][A
epoch 1 iter 347: t

epoch 1 iter 380: train loss 3.52985. lr 5.943121e-04:  12%|█▏        | 380/3068 [02:54<21:04,  2.13it/s][A
epoch 1 iter 380: train loss 3.52985. lr 5.943121e-04:  12%|█▏        | 381/3068 [02:54<21:09,  2.12it/s][A
epoch 1 iter 381: train loss 3.41702. lr 5.942823e-04:  12%|█▏        | 381/3068 [02:54<21:09,  2.12it/s][A
epoch 1 iter 381: train loss 3.41702. lr 5.942823e-04:  12%|█▏        | 382/3068 [02:54<21:21,  2.10it/s][A
epoch 1 iter 382: train loss 3.44815. lr 5.942524e-04:  12%|█▏        | 382/3068 [02:55<21:21,  2.10it/s][A
epoch 1 iter 382: train loss 3.44815. lr 5.942524e-04:  12%|█▏        | 383/3068 [02:55<21:17,  2.10it/s][A
epoch 1 iter 383: train loss 3.50228. lr 5.942224e-04:  12%|█▏        | 383/3068 [02:55<21:17,  2.10it/s][A
epoch 1 iter 383: train loss 3.50228. lr 5.942224e-04:  13%|█▎        | 384/3068 [02:55<21:10,  2.11it/s][A
epoch 1 iter 384: train loss 3.52310. lr 5.941924e-04:  13%|█▎        | 384/3068 [02:56<21:10,  2.11it/s][A
epoch 1 iter 384: t

epoch 1 iter 417: train loss 3.26736. lr 5.931576e-04:  14%|█▎        | 418/3068 [03:12<21:02,  2.10it/s][A
epoch 1 iter 418: train loss 3.41455. lr 5.931250e-04:  14%|█▎        | 418/3068 [03:12<21:02,  2.10it/s][A
epoch 1 iter 418: train loss 3.41455. lr 5.931250e-04:  14%|█▎        | 419/3068 [03:12<21:00,  2.10it/s][A
epoch 1 iter 419: train loss 3.29327. lr 5.930922e-04:  14%|█▎        | 419/3068 [03:12<21:00,  2.10it/s][A
epoch 1 iter 419: train loss 3.29327. lr 5.930922e-04:  14%|█▎        | 420/3068 [03:12<21:01,  2.10it/s][A
epoch 1 iter 420: train loss 3.33975. lr 5.930594e-04:  14%|█▎        | 420/3068 [03:13<21:01,  2.10it/s][A
epoch 1 iter 420: train loss 3.33975. lr 5.930594e-04:  14%|█▎        | 421/3068 [03:13<21:03,  2.10it/s][A
epoch 1 iter 421: train loss 3.20176. lr 5.930265e-04:  14%|█▎        | 421/3068 [03:13<21:03,  2.10it/s][A
epoch 1 iter 421: train loss 3.20176. lr 5.930265e-04:  14%|█▍        | 422/3068 [03:13<21:04,  2.09it/s][A
epoch 1 iter 422: t

epoch 1 iter 455: train loss 2.99611. lr 5.918624e-04:  15%|█▍        | 455/3068 [03:30<20:43,  2.10it/s][A
epoch 1 iter 455: train loss 2.99611. lr 5.918624e-04:  15%|█▍        | 456/3068 [03:30<20:39,  2.11it/s][A
epoch 1 iter 456: train loss 3.12702. lr 5.918268e-04:  15%|█▍        | 456/3068 [03:30<20:39,  2.11it/s][A
epoch 1 iter 456: train loss 3.12702. lr 5.918268e-04:  15%|█▍        | 457/3068 [03:30<20:40,  2.10it/s][A
epoch 1 iter 457: train loss 2.94448. lr 5.917912e-04:  15%|█▍        | 457/3068 [03:31<20:40,  2.10it/s][A
epoch 1 iter 457: train loss 2.94448. lr 5.917912e-04:  15%|█▍        | 458/3068 [03:31<20:38,  2.11it/s][A
epoch 1 iter 458: train loss 3.12722. lr 5.917554e-04:  15%|█▍        | 458/3068 [03:31<20:38,  2.11it/s][A
epoch 1 iter 458: train loss 3.12722. lr 5.917554e-04:  15%|█▍        | 459/3068 [03:31<20:33,  2.11it/s][A
epoch 1 iter 459: train loss 3.03598. lr 5.917196e-04:  15%|█▍        | 459/3068 [03:31<20:33,  2.11it/s][A
epoch 1 iter 459: t

epoch 1 iter 492: train loss 2.79382. lr 5.904951e-04:  16%|█▌        | 493/3068 [03:47<20:40,  2.08it/s][A
epoch 1 iter 493: train loss 2.72728. lr 5.904567e-04:  16%|█▌        | 493/3068 [03:48<20:40,  2.08it/s][A
epoch 1 iter 493: train loss 2.72728. lr 5.904567e-04:  16%|█▌        | 494/3068 [03:48<20:42,  2.07it/s][A
epoch 1 iter 494: train loss 2.81472. lr 5.904182e-04:  16%|█▌        | 494/3068 [03:48<20:42,  2.07it/s][A
epoch 1 iter 494: train loss 2.81472. lr 5.904182e-04:  16%|█▌        | 495/3068 [03:48<20:33,  2.09it/s][A
epoch 1 iter 495: train loss 2.75490. lr 5.903796e-04:  16%|█▌        | 495/3068 [03:49<20:33,  2.09it/s][A
epoch 1 iter 495: train loss 2.75490. lr 5.903796e-04:  16%|█▌        | 496/3068 [03:49<20:26,  2.10it/s][A
epoch 1 iter 496: train loss 2.85308. lr 5.903410e-04:  16%|█▌        | 496/3068 [03:49<20:26,  2.10it/s][A
epoch 1 iter 496: train loss 2.85308. lr 5.903410e-04:  16%|█▌        | 497/3068 [03:49<20:22,  2.10it/s][A
epoch 1 iter 497: t

epoch 1 iter 530: train loss 2.58349. lr 5.889822e-04:  17%|█▋        | 530/3068 [04:05<20:07,  2.10it/s][A
epoch 1 iter 530: train loss 2.58349. lr 5.889822e-04:  17%|█▋        | 531/3068 [04:05<20:05,  2.10it/s][A
epoch 1 iter 531: train loss 2.64723. lr 5.889409e-04:  17%|█▋        | 531/3068 [04:06<20:05,  2.10it/s][A
epoch 1 iter 531: train loss 2.64723. lr 5.889409e-04:  17%|█▋        | 532/3068 [04:06<20:05,  2.10it/s][A
epoch 1 iter 532: train loss 2.61781. lr 5.888996e-04:  17%|█▋        | 532/3068 [04:06<20:05,  2.10it/s][A
epoch 1 iter 532: train loss 2.61781. lr 5.888996e-04:  17%|█▋        | 533/3068 [04:06<20:10,  2.09it/s][A
epoch 1 iter 533: train loss 2.58004. lr 5.888581e-04:  17%|█▋        | 533/3068 [04:07<20:10,  2.09it/s][A
epoch 1 iter 533: train loss 2.58004. lr 5.888581e-04:  17%|█▋        | 534/3068 [04:07<20:17,  2.08it/s][A
epoch 1 iter 534: train loss 2.57315. lr 5.888166e-04:  17%|█▋        | 534/3068 [04:07<20:17,  2.08it/s][A
epoch 1 iter 534: t

epoch 1 iter 567: train loss 2.48883. lr 5.874040e-04:  19%|█▊        | 568/3068 [04:23<19:35,  2.13it/s][A
epoch 1 iter 568: train loss 2.43445. lr 5.873599e-04:  19%|█▊        | 568/3068 [04:23<19:35,  2.13it/s][A
epoch 1 iter 568: train loss 2.43445. lr 5.873599e-04:  19%|█▊        | 569/3068 [04:23<19:33,  2.13it/s][A
epoch 1 iter 569: train loss 2.31322. lr 5.873158e-04:  19%|█▊        | 569/3068 [04:24<19:33,  2.13it/s][A
epoch 1 iter 569: train loss 2.31322. lr 5.873158e-04:  19%|█▊        | 570/3068 [04:24<19:40,  2.12it/s][A
epoch 1 iter 570: train loss 2.32443. lr 5.872715e-04:  19%|█▊        | 570/3068 [04:24<19:40,  2.12it/s][A
epoch 1 iter 570: train loss 2.32443. lr 5.872715e-04:  19%|█▊        | 571/3068 [04:24<19:39,  2.12it/s][A
epoch 1 iter 571: train loss 2.33758. lr 5.872272e-04:  19%|█▊        | 571/3068 [04:25<19:39,  2.12it/s][A
epoch 1 iter 571: train loss 2.33758. lr 5.872272e-04:  19%|█▊        | 572/3068 [04:25<19:51,  2.10it/s][A
epoch 1 iter 572: t

epoch 1 iter 605: train loss 2.22101. lr 5.856758e-04:  20%|█▉        | 605/3068 [04:41<19:25,  2.11it/s][A
epoch 1 iter 605: train loss 2.22101. lr 5.856758e-04:  20%|█▉        | 606/3068 [04:41<19:21,  2.12it/s][A
epoch 1 iter 606: train loss 2.26718. lr 5.856288e-04:  20%|█▉        | 606/3068 [04:41<19:21,  2.12it/s][A
epoch 1 iter 606: train loss 2.26718. lr 5.856288e-04:  20%|█▉        | 607/3068 [04:41<19:29,  2.10it/s][A
epoch 1 iter 607: train loss 2.16964. lr 5.855818e-04:  20%|█▉        | 607/3068 [04:42<19:29,  2.10it/s][A
epoch 1 iter 607: train loss 2.16964. lr 5.855818e-04:  20%|█▉        | 608/3068 [04:42<19:30,  2.10it/s][A
epoch 1 iter 608: train loss 2.21313. lr 5.855347e-04:  20%|█▉        | 608/3068 [04:42<19:30,  2.10it/s][A
epoch 1 iter 608: train loss 2.21313. lr 5.855347e-04:  20%|█▉        | 609/3068 [04:42<19:29,  2.10it/s][A
epoch 1 iter 609: train loss 2.09934. lr 5.854875e-04:  20%|█▉        | 609/3068 [04:43<19:29,  2.10it/s][A
epoch 1 iter 609: t

epoch 1 iter 642: train loss 2.01079. lr 5.838890e-04:  21%|██        | 643/3068 [04:58<18:50,  2.14it/s][A
epoch 1 iter 643: train loss 2.00897. lr 5.838393e-04:  21%|██        | 643/3068 [04:59<18:50,  2.14it/s][A
epoch 1 iter 643: train loss 2.00897. lr 5.838393e-04:  21%|██        | 644/3068 [04:59<18:51,  2.14it/s][A
epoch 1 iter 644: train loss 1.99696. lr 5.837895e-04:  21%|██        | 644/3068 [04:59<18:51,  2.14it/s][A
epoch 1 iter 644: train loss 1.99696. lr 5.837895e-04:  21%|██        | 645/3068 [04:59<18:56,  2.13it/s][A
epoch 1 iter 645: train loss 2.02736. lr 5.837397e-04:  21%|██        | 645/3068 [05:00<18:56,  2.13it/s][A
epoch 1 iter 645: train loss 2.02736. lr 5.837397e-04:  21%|██        | 646/3068 [05:00<18:58,  2.13it/s][A
epoch 1 iter 646: train loss 1.96050. lr 5.836897e-04:  21%|██        | 646/3068 [05:00<18:58,  2.13it/s][A
epoch 1 iter 646: train loss 1.96050. lr 5.836897e-04:  21%|██        | 647/3068 [05:00<18:53,  2.14it/s][A
epoch 1 iter 647: t

epoch 1 iter 680: train loss 1.92020. lr 5.819479e-04:  22%|██▏       | 680/3068 [05:16<18:56,  2.10it/s][A
epoch 1 iter 680: train loss 1.92020. lr 5.819479e-04:  22%|██▏       | 681/3068 [05:16<18:57,  2.10it/s][A
epoch 1 iter 681: train loss 1.81220. lr 5.818954e-04:  22%|██▏       | 681/3068 [05:17<18:57,  2.10it/s][A
epoch 1 iter 681: train loss 1.81220. lr 5.818954e-04:  22%|██▏       | 682/3068 [05:17<18:55,  2.10it/s][A
epoch 1 iter 682: train loss 1.79029. lr 5.818428e-04:  22%|██▏       | 682/3068 [05:17<18:55,  2.10it/s][A
epoch 1 iter 682: train loss 1.79029. lr 5.818428e-04:  22%|██▏       | 683/3068 [05:17<18:55,  2.10it/s][A
epoch 1 iter 683: train loss 1.84431. lr 5.817901e-04:  22%|██▏       | 683/3068 [05:18<18:55,  2.10it/s][A
epoch 1 iter 683: train loss 1.84431. lr 5.817901e-04:  22%|██▏       | 684/3068 [05:18<19:13,  2.07it/s][A
epoch 1 iter 684: train loss 1.81712. lr 5.817374e-04:  22%|██▏       | 684/3068 [05:18<19:13,  2.07it/s][A
epoch 1 iter 684: t

epoch 1 iter 717: train loss 1.73350. lr 5.799553e-04:  23%|██▎       | 718/3068 [05:34<18:36,  2.11it/s][A
epoch 1 iter 718: train loss 1.58216. lr 5.799000e-04:  23%|██▎       | 718/3068 [05:34<18:36,  2.11it/s][A
epoch 1 iter 718: train loss 1.58216. lr 5.799000e-04:  23%|██▎       | 719/3068 [05:34<18:45,  2.09it/s][A
epoch 1 iter 719: train loss 1.72124. lr 5.798447e-04:  23%|██▎       | 719/3068 [05:35<18:45,  2.09it/s][A
epoch 1 iter 719: train loss 1.72124. lr 5.798447e-04:  23%|██▎       | 720/3068 [05:35<18:30,  2.11it/s][A
epoch 1 iter 720: train loss 1.67542. lr 5.797893e-04:  23%|██▎       | 720/3068 [05:35<18:30,  2.11it/s][A
epoch 1 iter 720: train loss 1.67542. lr 5.797893e-04:  24%|██▎       | 721/3068 [05:35<18:37,  2.10it/s][A
epoch 1 iter 721: train loss 1.63913. lr 5.797338e-04:  24%|██▎       | 721/3068 [05:36<18:37,  2.10it/s][A
epoch 1 iter 721: train loss 1.63913. lr 5.797338e-04:  24%|██▎       | 722/3068 [05:36<18:33,  2.11it/s][A
epoch 1 iter 722: t

epoch 1 iter 2784: train loss 0.20877. lr 3.432344e-04:  91%|█████████ | 2785/3068 [19:55<01:44,  2.72it/s][A
epoch 1 iter 2785: train loss 0.21716. lr 3.430824e-04:  91%|█████████ | 2785/3068 [19:55<01:44,  2.72it/s][A
epoch 1 iter 2785: train loss 0.21716. lr 3.430824e-04:  91%|█████████ | 2786/3068 [19:55<01:43,  2.72it/s][A
epoch 1 iter 2786: train loss 0.21107. lr 3.429303e-04:  91%|█████████ | 2786/3068 [19:56<01:43,  2.72it/s][A
epoch 1 iter 2786: train loss 0.21107. lr 3.429303e-04:  91%|█████████ | 2787/3068 [19:56<01:43,  2.71it/s][A
epoch 1 iter 2787: train loss 0.21238. lr 3.427783e-04:  91%|█████████ | 2787/3068 [19:56<01:43,  2.71it/s][A
epoch 1 iter 2787: train loss 0.21238. lr 3.427783e-04:  91%|█████████ | 2788/3068 [19:56<01:43,  2.70it/s][A
epoch 1 iter 2788: train loss 0.22378. lr 3.426262e-04:  91%|█████████ | 2788/3068 [19:56<01:43,  2.70it/s][A
epoch 1 iter 2788: train loss 0.22378. lr 3.426262e-04:  91%|█████████ | 2789/3068 [19:56<01:43,  2.71it/s][A
e

epoch 1 iter 2821: train loss 0.21136. lr 3.376018e-04:  92%|█████████▏| 2821/3068 [20:08<01:31,  2.71it/s][A
epoch 1 iter 2821: train loss 0.21136. lr 3.376018e-04:  92%|█████████▏| 2822/3068 [20:08<01:30,  2.72it/s][A
epoch 1 iter 2822: train loss 0.20794. lr 3.374494e-04:  92%|█████████▏| 2822/3068 [20:09<01:30,  2.72it/s][A
epoch 1 iter 2822: train loss 0.20794. lr 3.374494e-04:  92%|█████████▏| 2823/3068 [20:09<01:30,  2.71it/s][A
epoch 1 iter 2823: train loss 0.19933. lr 3.372969e-04:  92%|█████████▏| 2823/3068 [20:09<01:30,  2.71it/s][A
epoch 1 iter 2823: train loss 0.19933. lr 3.372969e-04:  92%|█████████▏| 2824/3068 [20:09<01:29,  2.72it/s][A
epoch 1 iter 2824: train loss 0.21910. lr 3.371445e-04:  92%|█████████▏| 2824/3068 [20:10<01:29,  2.72it/s][A
epoch 1 iter 2824: train loss 0.21910. lr 3.371445e-04:  92%|█████████▏| 2825/3068 [20:10<01:29,  2.72it/s][A
epoch 1 iter 2825: train loss 0.21832. lr 3.369920e-04:  92%|█████████▏| 2825/3068 [20:10<01:29,  2.72it/s][A
e

epoch 1 iter 2857: train loss 0.22255. lr 3.321084e-04:  93%|█████████▎| 2858/3068 [20:22<01:17,  2.72it/s][A
epoch 1 iter 2858: train loss 0.21654. lr 3.319556e-04:  93%|█████████▎| 2858/3068 [20:22<01:17,  2.72it/s][A
epoch 1 iter 2858: train loss 0.21654. lr 3.319556e-04:  93%|█████████▎| 2859/3068 [20:22<01:16,  2.72it/s][A
epoch 1 iter 2859: train loss 0.21560. lr 3.318029e-04:  93%|█████████▎| 2859/3068 [20:22<01:16,  2.72it/s][A
epoch 1 iter 2859: train loss 0.21560. lr 3.318029e-04:  93%|█████████▎| 2860/3068 [20:22<01:16,  2.72it/s][A
epoch 1 iter 2860: train loss 0.19548. lr 3.316501e-04:  93%|█████████▎| 2860/3068 [20:23<01:16,  2.72it/s][A
epoch 1 iter 2860: train loss 0.19548. lr 3.316501e-04:  93%|█████████▎| 2861/3068 [20:23<01:16,  2.72it/s][A
epoch 1 iter 2861: train loss 0.21936. lr 3.314973e-04:  93%|█████████▎| 2861/3068 [20:23<01:16,  2.72it/s][A
epoch 1 iter 2861: train loss 0.21936. lr 3.314973e-04:  93%|█████████▎| 2862/3068 [20:23<01:16,  2.71it/s][A
e

epoch 1 iter 2894: train loss 0.19639. lr 3.264511e-04:  94%|█████████▍| 2894/3068 [20:35<01:04,  2.71it/s][A
epoch 1 iter 2894: train loss 0.19639. lr 3.264511e-04:  94%|█████████▍| 2895/3068 [20:35<01:03,  2.71it/s][A
epoch 1 iter 2895: train loss 0.21375. lr 3.262980e-04:  94%|█████████▍| 2895/3068 [20:36<01:03,  2.71it/s][A
epoch 1 iter 2895: train loss 0.21375. lr 3.262980e-04:  94%|█████████▍| 2896/3068 [20:36<01:03,  2.71it/s][A
epoch 1 iter 2896: train loss 0.19919. lr 3.261450e-04:  94%|█████████▍| 2896/3068 [20:36<01:03,  2.71it/s][A
epoch 1 iter 2896: train loss 0.19919. lr 3.261450e-04:  94%|█████████▍| 2897/3068 [20:36<01:03,  2.70it/s][A
epoch 1 iter 2897: train loss 0.19805. lr 3.259919e-04:  94%|█████████▍| 2897/3068 [20:36<01:03,  2.70it/s][A
epoch 1 iter 2897: train loss 0.19805. lr 3.259919e-04:  94%|█████████▍| 2898/3068 [20:36<01:03,  2.69it/s][A
epoch 1 iter 2898: train loss 0.21051. lr 3.258389e-04:  94%|█████████▍| 2898/3068 [20:37<01:03,  2.69it/s][A
e

epoch 1 iter 2930: train loss 0.20851. lr 3.209375e-04:  96%|█████████▌| 2931/3068 [20:49<00:50,  2.71it/s][A
epoch 1 iter 2931: train loss 0.20310. lr 3.207843e-04:  96%|█████████▌| 2931/3068 [20:49<00:50,  2.71it/s][A
epoch 1 iter 2931: train loss 0.20310. lr 3.207843e-04:  96%|█████████▌| 2932/3068 [20:49<00:50,  2.72it/s][A
epoch 1 iter 2932: train loss 0.19597. lr 3.206310e-04:  96%|█████████▌| 2932/3068 [20:49<00:50,  2.72it/s][A
epoch 1 iter 2932: train loss 0.19597. lr 3.206310e-04:  96%|█████████▌| 2933/3068 [20:49<00:49,  2.71it/s][A
epoch 1 iter 2933: train loss 0.21225. lr 3.204777e-04:  96%|█████████▌| 2933/3068 [20:50<00:49,  2.71it/s][A
epoch 1 iter 2933: train loss 0.21225. lr 3.204777e-04:  96%|█████████▌| 2934/3068 [20:50<00:49,  2.71it/s][A
epoch 1 iter 2934: train loss 0.20625. lr 3.203244e-04:  96%|█████████▌| 2934/3068 [20:50<00:49,  2.71it/s][A
epoch 1 iter 2934: train loss 0.20625. lr 3.203244e-04:  96%|█████████▌| 2935/3068 [20:50<00:49,  2.71it/s][A
e

epoch 1 iter 2967: train loss 0.21892. lr 3.152634e-04:  97%|█████████▋| 2967/3068 [21:02<00:37,  2.70it/s][A
epoch 1 iter 2967: train loss 0.21892. lr 3.152634e-04:  97%|█████████▋| 2968/3068 [21:02<00:37,  2.70it/s][A
epoch 1 iter 2968: train loss 0.19800. lr 3.151100e-04:  97%|█████████▋| 2968/3068 [21:03<00:37,  2.70it/s][A
epoch 1 iter 2968: train loss 0.19800. lr 3.151100e-04:  97%|█████████▋| 2969/3068 [21:03<00:36,  2.70it/s][A
epoch 1 iter 2969: train loss 0.17832. lr 3.149565e-04:  97%|█████████▋| 2969/3068 [21:03<00:36,  2.70it/s][A
epoch 1 iter 2969: train loss 0.17832. lr 3.149565e-04:  97%|█████████▋| 2970/3068 [21:03<00:36,  2.71it/s][A
epoch 1 iter 2970: train loss 0.22423. lr 3.148031e-04:  97%|█████████▋| 2970/3068 [21:03<00:36,  2.71it/s][A
epoch 1 iter 2970: train loss 0.22423. lr 3.148031e-04:  97%|█████████▋| 2971/3068 [21:03<00:35,  2.71it/s][A
epoch 1 iter 2971: train loss 0.20159. lr 3.146496e-04:  97%|█████████▋| 2971/3068 [21:04<00:35,  2.71it/s][A
e

epoch 1 iter 3003: train loss 0.19227. lr 3.097374e-04:  98%|█████████▊| 3004/3068 [21:16<00:23,  2.71it/s][A
epoch 1 iter 3004: train loss 0.22245. lr 3.095838e-04:  98%|█████████▊| 3004/3068 [21:16<00:23,  2.71it/s][A
epoch 1 iter 3004: train loss 0.22245. lr 3.095838e-04:  98%|█████████▊| 3005/3068 [21:16<00:23,  2.71it/s][A
epoch 1 iter 3005: train loss 0.20036. lr 3.094303e-04:  98%|█████████▊| 3005/3068 [21:16<00:23,  2.71it/s][A
epoch 1 iter 3005: train loss 0.20036. lr 3.094303e-04:  98%|█████████▊| 3006/3068 [21:16<00:22,  2.72it/s][A
epoch 1 iter 3006: train loss 0.21438. lr 3.092767e-04:  98%|█████████▊| 3006/3068 [21:17<00:22,  2.72it/s][A
epoch 1 iter 3006: train loss 0.21438. lr 3.092767e-04:  98%|█████████▊| 3007/3068 [21:17<00:22,  2.72it/s][A
epoch 1 iter 3007: train loss 0.19992. lr 3.091232e-04:  98%|█████████▊| 3007/3068 [21:17<00:22,  2.72it/s][A
epoch 1 iter 3007: train loss 0.19992. lr 3.091232e-04:  98%|█████████▊| 3008/3068 [21:17<00:22,  2.71it/s][A
e

epoch 1 iter 3040: train loss 0.20918. lr 3.040544e-04:  99%|█████████▉| 3040/3068 [21:29<00:10,  2.71it/s][A
epoch 1 iter 3040: train loss 0.20918. lr 3.040544e-04:  99%|█████████▉| 3041/3068 [21:29<00:09,  2.72it/s][A
epoch 1 iter 3041: train loss 0.19820. lr 3.039008e-04:  99%|█████████▉| 3041/3068 [21:30<00:09,  2.72it/s][A
epoch 1 iter 3041: train loss 0.19820. lr 3.039008e-04:  99%|█████████▉| 3042/3068 [21:30<00:09,  2.71it/s][A
epoch 1 iter 3042: train loss 0.22192. lr 3.037472e-04:  99%|█████████▉| 3042/3068 [21:30<00:09,  2.71it/s][A
epoch 1 iter 3042: train loss 0.22192. lr 3.037472e-04:  99%|█████████▉| 3043/3068 [21:30<00:09,  2.70it/s][A
epoch 1 iter 3043: train loss 0.21159. lr 3.035936e-04:  99%|█████████▉| 3043/3068 [21:30<00:09,  2.70it/s][A
epoch 1 iter 3043: train loss 0.21159. lr 3.035936e-04:  99%|█████████▉| 3044/3068 [21:30<00:08,  2.70it/s][A
epoch 1 iter 3044: train loss 0.19943. lr 3.034399e-04:  99%|█████████▉| 3044/3068 [21:31<00:08,  2.70it/s][A
e

epoch 2 iter 9: train loss 0.17023. lr 2.984756e-04:   0%|          | 9/3068 [00:03<18:45,  2.72it/s][A
epoch 2 iter 9: train loss 0.17023. lr 2.984756e-04:   0%|          | 10/3068 [00:03<18:52,  2.70it/s][A
epoch 2 iter 10: train loss 0.19791. lr 2.983220e-04:   0%|          | 10/3068 [00:04<18:52,  2.70it/s][A
epoch 2 iter 10: train loss 0.19791. lr 2.983220e-04:   0%|          | 11/3068 [00:04<18:54,  2.69it/s][A
epoch 2 iter 11: train loss 0.18506. lr 2.981684e-04:   0%|          | 11/3068 [00:04<18:54,  2.69it/s][A
epoch 2 iter 11: train loss 0.18506. lr 2.981684e-04:   0%|          | 12/3068 [00:04<18:53,  2.70it/s][A
epoch 2 iter 12: train loss 0.18459. lr 2.980147e-04:   0%|          | 12/3068 [00:04<18:53,  2.70it/s][A
epoch 2 iter 12: train loss 0.18459. lr 2.980147e-04:   0%|          | 13/3068 [00:04<18:51,  2.70it/s][A
epoch 2 iter 13: train loss 0.19482. lr 2.978611e-04:   0%|          | 13/3068 [00:05<18:51,  2.70it/s][A
epoch 2 iter 13: train loss 0.19482. lr 

epoch 2 iter 47: train loss 0.18471. lr 2.926382e-04:   2%|▏         | 47/3068 [00:17<18:34,  2.71it/s][A
epoch 2 iter 47: train loss 0.18471. lr 2.926382e-04:   2%|▏         | 48/3068 [00:17<18:33,  2.71it/s][A
epoch 2 iter 48: train loss 0.17999. lr 2.924846e-04:   2%|▏         | 48/3068 [00:18<18:33,  2.71it/s][A
epoch 2 iter 48: train loss 0.17999. lr 2.924846e-04:   2%|▏         | 49/3068 [00:18<18:32,  2.71it/s][A
epoch 2 iter 49: train loss 0.19208. lr 2.923310e-04:   2%|▏         | 49/3068 [00:18<18:32,  2.71it/s][A
epoch 2 iter 49: train loss 0.19208. lr 2.923310e-04:   2%|▏         | 50/3068 [00:18<18:34,  2.71it/s][A
epoch 2 iter 50: train loss 0.21004. lr 2.921774e-04:   2%|▏         | 50/3068 [00:18<18:34,  2.71it/s][A
epoch 2 iter 50: train loss 0.21004. lr 2.921774e-04:   2%|▏         | 51/3068 [00:18<18:37,  2.70it/s][A
epoch 2 iter 51: train loss 0.18784. lr 2.920238e-04:   2%|▏         | 51/3068 [00:19<18:37,  2.70it/s][A
epoch 2 iter 51: train loss 0.18784. 

epoch 2 iter 85: train loss 0.17725. lr 2.868035e-04:   3%|▎         | 85/3068 [00:31<18:21,  2.71it/s][A
epoch 2 iter 85: train loss 0.17725. lr 2.868035e-04:   3%|▎         | 86/3068 [00:31<18:15,  2.72it/s][A
epoch 2 iter 86: train loss 0.19060. lr 2.866500e-04:   3%|▎         | 86/3068 [00:32<18:15,  2.72it/s][A
epoch 2 iter 86: train loss 0.19060. lr 2.866500e-04:   3%|▎         | 87/3068 [00:32<18:19,  2.71it/s][A
epoch 2 iter 87: train loss 0.18835. lr 2.864966e-04:   3%|▎         | 87/3068 [00:32<18:19,  2.71it/s][A
epoch 2 iter 87: train loss 0.18835. lr 2.864966e-04:   3%|▎         | 88/3068 [00:32<18:21,  2.71it/s][A
epoch 2 iter 88: train loss 0.18984. lr 2.863431e-04:   3%|▎         | 88/3068 [00:32<18:21,  2.71it/s][A
epoch 2 iter 88: train loss 0.18984. lr 2.863431e-04:   3%|▎         | 89/3068 [00:32<18:19,  2.71it/s][A
epoch 2 iter 89: train loss 0.18691. lr 2.861896e-04:   3%|▎         | 89/3068 [00:33<18:19,  2.71it/s][A
epoch 2 iter 89: train loss 0.18691. 

epoch 2 iter 122: train loss 0.18293. lr 2.811272e-04:   4%|▍         | 123/3068 [00:45<18:12,  2.70it/s][A
epoch 2 iter 123: train loss 0.18297. lr 2.809739e-04:   4%|▍         | 123/3068 [00:45<18:12,  2.70it/s][A
epoch 2 iter 123: train loss 0.18297. lr 2.809739e-04:   4%|▍         | 124/3068 [00:45<18:12,  2.70it/s][A
epoch 2 iter 124: train loss 0.17788. lr 2.808205e-04:   4%|▍         | 124/3068 [00:46<18:12,  2.70it/s][A
epoch 2 iter 124: train loss 0.17788. lr 2.808205e-04:   4%|▍         | 125/3068 [00:46<18:09,  2.70it/s][A
epoch 2 iter 125: train loss 0.19670. lr 2.806672e-04:   4%|▍         | 125/3068 [00:46<18:09,  2.70it/s][A
epoch 2 iter 125: train loss 0.19670. lr 2.806672e-04:   4%|▍         | 126/3068 [00:46<18:07,  2.70it/s][A
epoch 2 iter 126: train loss 0.19656. lr 2.805139e-04:   4%|▍         | 126/3068 [00:46<18:07,  2.70it/s][A
epoch 2 iter 126: train loss 0.19656. lr 2.805139e-04:   4%|▍         | 127/3068 [00:46<18:02,  2.72it/s][A
epoch 2 iter 127: t

epoch 2 iter 160: train loss 0.17943. lr 2.753045e-04:   5%|▌         | 160/3068 [00:59<17:57,  2.70it/s][A
epoch 2 iter 160: train loss 0.17943. lr 2.753045e-04:   5%|▌         | 161/3068 [00:59<17:58,  2.70it/s][A
epoch 2 iter 161: train loss 0.19253. lr 2.751514e-04:   5%|▌         | 161/3068 [00:59<17:58,  2.70it/s][A
epoch 2 iter 161: train loss 0.19253. lr 2.751514e-04:   5%|▌         | 162/3068 [00:59<17:56,  2.70it/s][A
epoch 2 iter 162: train loss 0.16877. lr 2.749983e-04:   5%|▌         | 162/3068 [01:00<17:56,  2.70it/s][A
epoch 2 iter 162: train loss 0.16877. lr 2.749983e-04:   5%|▌         | 163/3068 [01:00<17:55,  2.70it/s][A
epoch 2 iter 163: train loss 0.18645. lr 2.748452e-04:   5%|▌         | 163/3068 [01:00<17:55,  2.70it/s][A
epoch 2 iter 163: train loss 0.18645. lr 2.748452e-04:   5%|▌         | 164/3068 [01:00<18:05,  2.68it/s][A
epoch 2 iter 164: train loss 0.17830. lr 2.746921e-04:   5%|▌         | 164/3068 [01:01<18:05,  2.68it/s][A
epoch 2 iter 164: t

epoch 2 iter 197: train loss 0.17309. lr 2.696440e-04:   6%|▋         | 198/3068 [01:13<17:44,  2.70it/s][A
epoch 2 iter 198: train loss 0.17078. lr 2.694912e-04:   6%|▋         | 198/3068 [01:13<17:44,  2.70it/s][A
epoch 2 iter 198: train loss 0.17078. lr 2.694912e-04:   6%|▋         | 199/3068 [01:13<17:42,  2.70it/s][A
epoch 2 iter 199: train loss 0.17911. lr 2.693384e-04:   6%|▋         | 199/3068 [01:13<17:42,  2.70it/s][A
epoch 2 iter 199: train loss 0.17911. lr 2.693384e-04:   7%|▋         | 200/3068 [01:13<17:38,  2.71it/s][A
epoch 2 iter 200: train loss 0.19128. lr 2.691855e-04:   7%|▋         | 200/3068 [01:14<17:38,  2.71it/s][A
epoch 2 iter 200: train loss 0.19128. lr 2.691855e-04:   7%|▋         | 201/3068 [01:14<17:44,  2.69it/s][A
epoch 2 iter 201: train loss 0.19063. lr 2.690327e-04:   7%|▋         | 201/3068 [01:14<17:44,  2.69it/s][A
epoch 2 iter 201: train loss 0.19063. lr 2.690327e-04:   7%|▋         | 202/3068 [01:14<17:48,  2.68it/s][A
epoch 2 iter 202: t

epoch 2 iter 2389: train loss 0.12430. lr 6.000000e-05:  78%|███████▊  | 2390/3068 [14:42<04:09,  2.72it/s][A
epoch 2 iter 2390: train loss 0.13429. lr 6.000000e-05:  78%|███████▊  | 2390/3068 [14:42<04:09,  2.72it/s][A
epoch 2 iter 2390: train loss 0.13429. lr 6.000000e-05:  78%|███████▊  | 2391/3068 [14:42<04:09,  2.71it/s][A
epoch 2 iter 2391: train loss 0.13117. lr 6.000000e-05:  78%|███████▊  | 2391/3068 [14:43<04:09,  2.71it/s][A
epoch 2 iter 2391: train loss 0.13117. lr 6.000000e-05:  78%|███████▊  | 2392/3068 [14:43<04:09,  2.71it/s][A
epoch 2 iter 2392: train loss 0.14134. lr 6.000000e-05:  78%|███████▊  | 2392/3068 [14:43<04:09,  2.71it/s][A
epoch 2 iter 2392: train loss 0.14134. lr 6.000000e-05:  78%|███████▊  | 2393/3068 [14:43<04:08,  2.71it/s][A
epoch 2 iter 2393: train loss 0.14751. lr 6.000000e-05:  78%|███████▊  | 2393/3068 [14:43<04:08,  2.71it/s][A
epoch 2 iter 2393: train loss 0.14751. lr 6.000000e-05:  78%|███████▊  | 2394/3068 [14:43<04:08,  2.72it/s][A
e

epoch 2 iter 2426: train loss 0.13977. lr 6.000000e-05:  79%|███████▉  | 2426/3068 [14:55<03:56,  2.71it/s][A
epoch 2 iter 2426: train loss 0.13977. lr 6.000000e-05:  79%|███████▉  | 2427/3068 [14:55<03:56,  2.71it/s][A
epoch 2 iter 2427: train loss 0.13132. lr 6.000000e-05:  79%|███████▉  | 2427/3068 [14:56<03:56,  2.71it/s][A
epoch 2 iter 2427: train loss 0.13132. lr 6.000000e-05:  79%|███████▉  | 2428/3068 [14:56<03:55,  2.71it/s][A
epoch 2 iter 2428: train loss 0.14492. lr 6.000000e-05:  79%|███████▉  | 2428/3068 [14:56<03:55,  2.71it/s][A
epoch 2 iter 2428: train loss 0.14492. lr 6.000000e-05:  79%|███████▉  | 2429/3068 [14:56<03:55,  2.71it/s][A
epoch 2 iter 2429: train loss 0.13893. lr 6.000000e-05:  79%|███████▉  | 2429/3068 [14:57<03:55,  2.71it/s][A
epoch 2 iter 2429: train loss 0.13893. lr 6.000000e-05:  79%|███████▉  | 2430/3068 [14:57<03:54,  2.72it/s][A
epoch 2 iter 2430: train loss 0.13347. lr 6.000000e-05:  79%|███████▉  | 2430/3068 [14:57<03:54,  2.72it/s][A
e

epoch 2 iter 2462: train loss 0.13487. lr 6.000000e-05:  80%|████████  | 2463/3068 [15:09<03:43,  2.71it/s][A
epoch 2 iter 2463: train loss 0.13521. lr 6.000000e-05:  80%|████████  | 2463/3068 [15:09<03:43,  2.71it/s][A
epoch 2 iter 2463: train loss 0.13521. lr 6.000000e-05:  80%|████████  | 2464/3068 [15:09<03:42,  2.71it/s][A
epoch 2 iter 2464: train loss 0.13974. lr 6.000000e-05:  80%|████████  | 2464/3068 [15:09<03:42,  2.71it/s][A
epoch 2 iter 2464: train loss 0.13974. lr 6.000000e-05:  80%|████████  | 2465/3068 [15:09<03:41,  2.72it/s][A
epoch 2 iter 2465: train loss 0.13056. lr 6.000000e-05:  80%|████████  | 2465/3068 [15:10<03:41,  2.72it/s][A
epoch 2 iter 2465: train loss 0.13056. lr 6.000000e-05:  80%|████████  | 2466/3068 [15:10<03:42,  2.71it/s][A
epoch 2 iter 2466: train loss 0.13666. lr 6.000000e-05:  80%|████████  | 2466/3068 [15:10<03:42,  2.71it/s][A
epoch 2 iter 2466: train loss 0.13666. lr 6.000000e-05:  80%|████████  | 2467/3068 [15:10<03:42,  2.71it/s][A
e

epoch 2 iter 2499: train loss 0.13342. lr 6.000000e-05:  81%|████████▏ | 2499/3068 [15:22<03:29,  2.71it/s][A
epoch 2 iter 2499: train loss 0.13342. lr 6.000000e-05:  81%|████████▏ | 2500/3068 [15:22<03:28,  2.72it/s][A
epoch 2 iter 2500: train loss 0.12997. lr 6.000000e-05:  81%|████████▏ | 2500/3068 [15:23<03:28,  2.72it/s][A
epoch 2 iter 2500: train loss 0.12997. lr 6.000000e-05:  82%|████████▏ | 2501/3068 [15:23<03:29,  2.71it/s][A
epoch 2 iter 2501: train loss 0.12976. lr 6.000000e-05:  82%|████████▏ | 2501/3068 [15:23<03:29,  2.71it/s][A
epoch 2 iter 2501: train loss 0.12976. lr 6.000000e-05:  82%|████████▏ | 2502/3068 [15:23<03:29,  2.71it/s][A
epoch 2 iter 2502: train loss 0.14274. lr 6.000000e-05:  82%|████████▏ | 2502/3068 [15:23<03:29,  2.71it/s][A
epoch 2 iter 2502: train loss 0.14274. lr 6.000000e-05:  82%|████████▏ | 2503/3068 [15:23<03:28,  2.71it/s][A
epoch 2 iter 2503: train loss 0.13874. lr 6.000000e-05:  82%|████████▏ | 2503/3068 [15:24<03:28,  2.71it/s][A
e

epoch 2 iter 2535: train loss 0.12606. lr 6.000000e-05:  83%|████████▎ | 2536/3068 [15:36<03:16,  2.71it/s][A
epoch 2 iter 2536: train loss 0.13734. lr 6.000000e-05:  83%|████████▎ | 2536/3068 [15:36<03:16,  2.71it/s][A
epoch 2 iter 2536: train loss 0.13734. lr 6.000000e-05:  83%|████████▎ | 2537/3068 [15:36<03:15,  2.71it/s][A
epoch 2 iter 2537: train loss 0.13770. lr 6.000000e-05:  83%|████████▎ | 2537/3068 [15:36<03:15,  2.71it/s][A
epoch 2 iter 2537: train loss 0.13770. lr 6.000000e-05:  83%|████████▎ | 2538/3068 [15:36<03:15,  2.71it/s][A
epoch 2 iter 2538: train loss 0.13707. lr 6.000000e-05:  83%|████████▎ | 2538/3068 [15:37<03:15,  2.71it/s][A
epoch 2 iter 2538: train loss 0.13707. lr 6.000000e-05:  83%|████████▎ | 2539/3068 [15:37<03:14,  2.72it/s][A
epoch 2 iter 2539: train loss 0.13191. lr 6.000000e-05:  83%|████████▎ | 2539/3068 [15:37<03:14,  2.72it/s][A
epoch 2 iter 2539: train loss 0.13191. lr 6.000000e-05:  83%|████████▎ | 2540/3068 [15:37<03:13,  2.72it/s][A
e

epoch 2 iter 2572: train loss 0.12683. lr 6.000000e-05:  84%|████████▍ | 2572/3068 [15:49<03:02,  2.71it/s][A
epoch 2 iter 2572: train loss 0.12683. lr 6.000000e-05:  84%|████████▍ | 2573/3068 [15:49<03:02,  2.71it/s][A
epoch 2 iter 2573: train loss 0.14190. lr 6.000000e-05:  84%|████████▍ | 2573/3068 [15:50<03:02,  2.71it/s][A
epoch 2 iter 2573: train loss 0.14190. lr 6.000000e-05:  84%|████████▍ | 2574/3068 [15:50<03:01,  2.72it/s][A
epoch 2 iter 2574: train loss 0.13555. lr 6.000000e-05:  84%|████████▍ | 2574/3068 [15:50<03:01,  2.72it/s][A
epoch 2 iter 2574: train loss 0.13555. lr 6.000000e-05:  84%|████████▍ | 2575/3068 [15:50<03:02,  2.71it/s][A
epoch 2 iter 2575: train loss 0.13786. lr 6.000000e-05:  84%|████████▍ | 2575/3068 [15:50<03:02,  2.71it/s][A
epoch 2 iter 2575: train loss 0.13786. lr 6.000000e-05:  84%|████████▍ | 2576/3068 [15:50<03:02,  2.70it/s][A
epoch 2 iter 2576: train loss 0.14381. lr 6.000000e-05:  84%|████████▍ | 2576/3068 [15:51<03:02,  2.70it/s][A
e

epoch 2 iter 2608: train loss 0.14088. lr 6.000000e-05:  85%|████████▌ | 2609/3068 [16:03<02:49,  2.71it/s][A
epoch 2 iter 2609: train loss 0.12216. lr 6.000000e-05:  85%|████████▌ | 2609/3068 [16:03<02:49,  2.71it/s][A
epoch 2 iter 2609: train loss 0.12216. lr 6.000000e-05:  85%|████████▌ | 2610/3068 [16:03<02:49,  2.71it/s][A
epoch 2 iter 2610: train loss 0.12808. lr 6.000000e-05:  85%|████████▌ | 2610/3068 [16:03<02:49,  2.71it/s][A
epoch 2 iter 2610: train loss 0.12808. lr 6.000000e-05:  85%|████████▌ | 2611/3068 [16:03<02:49,  2.70it/s][A
epoch 2 iter 2611: train loss 0.13842. lr 6.000000e-05:  85%|████████▌ | 2611/3068 [16:04<02:49,  2.70it/s][A
epoch 2 iter 2611: train loss 0.13842. lr 6.000000e-05:  85%|████████▌ | 2612/3068 [16:04<02:48,  2.70it/s][A
epoch 2 iter 2612: train loss 0.14681. lr 6.000000e-05:  85%|████████▌ | 2612/3068 [16:04<02:48,  2.70it/s][A
epoch 2 iter 2612: train loss 0.14681. lr 6.000000e-05:  85%|████████▌ | 2613/3068 [16:04<02:48,  2.71it/s][A
e

epoch 2 iter 2645: train loss 0.12785. lr 6.000000e-05:  86%|████████▌ | 2645/3068 [16:16<02:35,  2.72it/s][A
epoch 2 iter 2645: train loss 0.12785. lr 6.000000e-05:  86%|████████▌ | 2646/3068 [16:16<02:35,  2.71it/s][A
epoch 2 iter 2646: train loss 0.14223. lr 6.000000e-05:  86%|████████▌ | 2646/3068 [16:17<02:35,  2.71it/s][A
epoch 2 iter 2646: train loss 0.14223. lr 6.000000e-05:  86%|████████▋ | 2647/3068 [16:17<02:35,  2.71it/s][A
epoch 2 iter 2647: train loss 0.12838. lr 6.000000e-05:  86%|████████▋ | 2647/3068 [16:17<02:35,  2.71it/s][A
epoch 2 iter 2647: train loss 0.12838. lr 6.000000e-05:  86%|████████▋ | 2648/3068 [16:17<02:34,  2.71it/s][A
epoch 2 iter 2648: train loss 0.12387. lr 6.000000e-05:  86%|████████▋ | 2648/3068 [16:17<02:34,  2.71it/s][A
epoch 2 iter 2648: train loss 0.12387. lr 6.000000e-05:  86%|████████▋ | 2649/3068 [16:17<02:34,  2.71it/s][A
epoch 2 iter 2649: train loss 0.12362. lr 6.000000e-05:  86%|████████▋ | 2649/3068 [16:18<02:34,  2.71it/s][A
e

epoch 2 iter 2681: train loss 0.14660. lr 6.000000e-05:  87%|████████▋ | 2682/3068 [16:29<02:22,  2.72it/s][A
epoch 2 iter 2682: train loss 0.13567. lr 6.000000e-05:  87%|████████▋ | 2682/3068 [16:30<02:22,  2.72it/s][A
epoch 2 iter 2682: train loss 0.13567. lr 6.000000e-05:  87%|████████▋ | 2683/3068 [16:30<02:21,  2.72it/s][A
epoch 2 iter 2683: train loss 0.13849. lr 6.000000e-05:  87%|████████▋ | 2683/3068 [16:30<02:21,  2.72it/s][A
epoch 2 iter 2683: train loss 0.13849. lr 6.000000e-05:  87%|████████▋ | 2684/3068 [16:30<02:21,  2.72it/s][A
epoch 2 iter 2684: train loss 0.13349. lr 6.000000e-05:  87%|████████▋ | 2684/3068 [16:31<02:21,  2.72it/s][A
epoch 2 iter 2684: train loss 0.13349. lr 6.000000e-05:  88%|████████▊ | 2685/3068 [16:31<02:21,  2.70it/s][A
epoch 2 iter 2685: train loss 0.13134. lr 6.000000e-05:  88%|████████▊ | 2685/3068 [16:31<02:21,  2.70it/s][A
epoch 2 iter 2685: train loss 0.13134. lr 6.000000e-05:  88%|████████▊ | 2686/3068 [16:31<02:21,  2.70it/s][A
e

epoch 2 iter 2718: train loss 0.12877. lr 6.000000e-05:  89%|████████▊ | 2718/3068 [16:43<02:08,  2.71it/s][A
epoch 2 iter 2718: train loss 0.12877. lr 6.000000e-05:  89%|████████▊ | 2719/3068 [16:43<02:08,  2.71it/s][A
epoch 2 iter 2719: train loss 0.13705. lr 6.000000e-05:  89%|████████▊ | 2719/3068 [16:43<02:08,  2.71it/s][A
epoch 2 iter 2719: train loss 0.13705. lr 6.000000e-05:  89%|████████▊ | 2720/3068 [16:43<02:08,  2.71it/s][A
epoch 2 iter 2720: train loss 0.13535. lr 6.000000e-05:  89%|████████▊ | 2720/3068 [16:44<02:08,  2.71it/s][A
epoch 2 iter 2720: train loss 0.13535. lr 6.000000e-05:  89%|████████▊ | 2721/3068 [16:44<02:08,  2.70it/s][A
epoch 2 iter 2721: train loss 0.12529. lr 6.000000e-05:  89%|████████▊ | 2721/3068 [16:44<02:08,  2.70it/s][A
epoch 2 iter 2721: train loss 0.12529. lr 6.000000e-05:  89%|████████▊ | 2722/3068 [16:44<02:08,  2.70it/s][A
epoch 2 iter 2722: train loss 0.12780. lr 6.000000e-05:  89%|████████▊ | 2722/3068 [16:45<02:08,  2.70it/s][A
e

epoch 2 iter 2754: train loss 0.13954. lr 6.000000e-05:  90%|████████▉ | 2755/3068 [16:56<01:55,  2.72it/s][A
epoch 2 iter 2755: train loss 0.14008. lr 6.000000e-05:  90%|████████▉ | 2755/3068 [16:57<01:55,  2.72it/s][A
epoch 2 iter 2755: train loss 0.14008. lr 6.000000e-05:  90%|████████▉ | 2756/3068 [16:57<01:55,  2.71it/s][A
epoch 2 iter 2756: train loss 0.15282. lr 6.000000e-05:  90%|████████▉ | 2756/3068 [16:57<01:55,  2.71it/s][A
epoch 2 iter 2756: train loss 0.15282. lr 6.000000e-05:  90%|████████▉ | 2757/3068 [16:57<01:54,  2.71it/s][A
epoch 2 iter 2757: train loss 0.12857. lr 6.000000e-05:  90%|████████▉ | 2757/3068 [16:57<01:54,  2.71it/s][A
epoch 2 iter 2757: train loss 0.12857. lr 6.000000e-05:  90%|████████▉ | 2758/3068 [16:57<01:54,  2.71it/s][A
epoch 2 iter 2758: train loss 0.13507. lr 6.000000e-05:  90%|████████▉ | 2758/3068 [16:58<01:54,  2.71it/s][A
epoch 2 iter 2758: train loss 0.13507. lr 6.000000e-05:  90%|████████▉ | 2759/3068 [16:58<01:53,  2.71it/s][A
e

epoch 2 iter 2791: train loss 0.13031. lr 6.000000e-05:  91%|█████████ | 2791/3068 [17:10<01:42,  2.71it/s][A
epoch 2 iter 2791: train loss 0.13031. lr 6.000000e-05:  91%|█████████ | 2792/3068 [17:10<01:41,  2.71it/s][A
epoch 2 iter 2792: train loss 0.12505. lr 6.000000e-05:  91%|█████████ | 2792/3068 [17:10<01:41,  2.71it/s][A
epoch 2 iter 2792: train loss 0.12505. lr 6.000000e-05:  91%|█████████ | 2793/3068 [17:10<01:41,  2.71it/s][A
epoch 2 iter 2793: train loss 0.12558. lr 6.000000e-05:  91%|█████████ | 2793/3068 [17:11<01:41,  2.71it/s][A
epoch 2 iter 2793: train loss 0.12558. lr 6.000000e-05:  91%|█████████ | 2794/3068 [17:11<01:41,  2.71it/s][A
epoch 2 iter 2794: train loss 0.13762. lr 6.000000e-05:  91%|█████████ | 2794/3068 [17:11<01:41,  2.71it/s][A
epoch 2 iter 2794: train loss 0.13762. lr 6.000000e-05:  91%|█████████ | 2795/3068 [17:11<01:40,  2.72it/s][A
epoch 2 iter 2795: train loss 0.13574. lr 6.000000e-05:  91%|█████████ | 2795/3068 [17:11<01:40,  2.72it/s][A
e

epoch 2 iter 2827: train loss 0.14237. lr 6.000000e-05:  92%|█████████▏| 2828/3068 [17:23<01:33,  2.55it/s][A
epoch 2 iter 2828: train loss 0.13848. lr 6.000000e-05:  92%|█████████▏| 2828/3068 [17:24<01:33,  2.55it/s][A
epoch 2 iter 2828: train loss 0.13848. lr 6.000000e-05:  92%|█████████▏| 2829/3068 [17:24<01:33,  2.57it/s][A
epoch 2 iter 2829: train loss 0.13248. lr 6.000000e-05:  92%|█████████▏| 2829/3068 [17:24<01:33,  2.57it/s][A
epoch 2 iter 2829: train loss 0.13248. lr 6.000000e-05:  92%|█████████▏| 2830/3068 [17:24<01:32,  2.58it/s][A
epoch 2 iter 2830: train loss 0.14294. lr 6.000000e-05:  92%|█████████▏| 2830/3068 [17:25<01:32,  2.58it/s][A
epoch 2 iter 2830: train loss 0.14294. lr 6.000000e-05:  92%|█████████▏| 2831/3068 [17:25<01:31,  2.58it/s][A
epoch 2 iter 2831: train loss 0.13331. lr 6.000000e-05:  92%|█████████▏| 2831/3068 [17:25<01:31,  2.58it/s][A
epoch 2 iter 2831: train loss 0.13331. lr 6.000000e-05:  92%|█████████▏| 2832/3068 [17:25<01:31,  2.57it/s][A
e

epoch 1 iter 1931: train loss 0.21719. lr 3.801807e-04:  83%|████████▎ | 1932/2334 [12:34<02:28,  2.70it/s][A
epoch 1 iter 1932: train loss 0.21522. lr 3.799861e-04:  83%|████████▎ | 1932/2334 [12:34<02:28,  2.70it/s][A
epoch 1 iter 1932: train loss 0.21522. lr 3.799861e-04:  83%|████████▎ | 1933/2334 [12:34<02:28,  2.70it/s][A
epoch 1 iter 1933: train loss 0.22861. lr 3.797915e-04:  83%|████████▎ | 1933/2334 [12:34<02:28,  2.70it/s][A
epoch 1 iter 1933: train loss 0.22861. lr 3.797915e-04:  83%|████████▎ | 1934/2334 [12:34<02:27,  2.71it/s][A
epoch 1 iter 1934: train loss 0.21866. lr 3.795968e-04:  83%|████████▎ | 1934/2334 [12:35<02:27,  2.71it/s][A
epoch 1 iter 1934: train loss 0.21866. lr 3.795968e-04:  83%|████████▎ | 1935/2334 [12:35<02:27,  2.71it/s][A
epoch 1 iter 1935: train loss 0.23779. lr 3.794021e-04:  83%|████████▎ | 1935/2334 [12:35<02:27,  2.71it/s][A
epoch 1 iter 1935: train loss 0.23779. lr 3.794021e-04:  83%|████████▎ | 1936/2334 [12:35<02:26,  2.72it/s][A
e

epoch 1 iter 1968: train loss 0.22270. lr 3.729575e-04:  84%|████████▍ | 1968/2334 [12:47<02:15,  2.70it/s][A
epoch 1 iter 1968: train loss 0.22270. lr 3.729575e-04:  84%|████████▍ | 1969/2334 [12:47<02:14,  2.71it/s][A
epoch 1 iter 1969: train loss 0.20334. lr 3.727616e-04:  84%|████████▍ | 1969/2334 [12:48<02:14,  2.71it/s][A
epoch 1 iter 1969: train loss 0.20334. lr 3.727616e-04:  84%|████████▍ | 1970/2334 [12:48<02:14,  2.71it/s][A
epoch 1 iter 1970: train loss 0.22626. lr 3.725657e-04:  84%|████████▍ | 1970/2334 [12:48<02:14,  2.71it/s][A
epoch 1 iter 1970: train loss 0.22626. lr 3.725657e-04:  84%|████████▍ | 1971/2334 [12:48<02:14,  2.70it/s][A
epoch 1 iter 1971: train loss 0.20011. lr 3.723697e-04:  84%|████████▍ | 1971/2334 [12:48<02:14,  2.70it/s][A
epoch 1 iter 1971: train loss 0.20011. lr 3.723697e-04:  84%|████████▍ | 1972/2334 [12:48<02:14,  2.70it/s][A
epoch 1 iter 1972: train loss 0.21905. lr 3.721738e-04:  84%|████████▍ | 1972/2334 [12:49<02:14,  2.70it/s][A
e

epoch 1 iter 2004: train loss 0.20838. lr 3.658860e-04:  86%|████████▌ | 2005/2334 [13:00<02:01,  2.71it/s][A
epoch 1 iter 2005: train loss 0.20352. lr 3.656890e-04:  86%|████████▌ | 2005/2334 [13:01<02:01,  2.71it/s][A
epoch 1 iter 2005: train loss 0.20352. lr 3.656890e-04:  86%|████████▌ | 2006/2334 [13:01<02:00,  2.72it/s][A
epoch 1 iter 2006: train loss 0.22469. lr 3.654919e-04:  86%|████████▌ | 2006/2334 [13:01<02:00,  2.72it/s][A
epoch 1 iter 2006: train loss 0.22469. lr 3.654919e-04:  86%|████████▌ | 2007/2334 [13:01<02:00,  2.71it/s][A
epoch 1 iter 2007: train loss 0.21667. lr 3.652949e-04:  86%|████████▌ | 2007/2334 [13:02<02:00,  2.71it/s][A
epoch 1 iter 2007: train loss 0.21667. lr 3.652949e-04:  86%|████████▌ | 2008/2334 [13:02<02:00,  2.71it/s][A
epoch 1 iter 2008: train loss 0.22304. lr 3.650978e-04:  86%|████████▌ | 2008/2334 [13:02<02:00,  2.71it/s][A
epoch 1 iter 2008: train loss 0.22304. lr 3.650978e-04:  86%|████████▌ | 2009/2334 [13:02<01:59,  2.71it/s][A
e

epoch 1 iter 2041: train loss 0.20344. lr 3.585778e-04:  87%|████████▋ | 2041/2334 [13:14<01:48,  2.71it/s][A
epoch 1 iter 2041: train loss 0.20344. lr 3.585778e-04:  87%|████████▋ | 2042/2334 [13:14<01:47,  2.71it/s][A
epoch 1 iter 2042: train loss 0.18414. lr 3.583797e-04:  87%|████████▋ | 2042/2334 [13:14<01:47,  2.71it/s][A
epoch 1 iter 2042: train loss 0.18414. lr 3.583797e-04:  88%|████████▊ | 2043/2334 [13:14<01:47,  2.72it/s][A
epoch 1 iter 2043: train loss 0.20187. lr 3.581817e-04:  88%|████████▊ | 2043/2334 [13:15<01:47,  2.72it/s][A
epoch 1 iter 2043: train loss 0.20187. lr 3.581817e-04:  88%|████████▊ | 2044/2334 [13:15<01:46,  2.72it/s][A
epoch 1 iter 2044: train loss 0.20114. lr 3.579836e-04:  88%|████████▊ | 2044/2334 [13:15<01:46,  2.72it/s][A
epoch 1 iter 2044: train loss 0.20114. lr 3.579836e-04:  88%|████████▊ | 2045/2334 [13:15<01:46,  2.71it/s][A
epoch 1 iter 2045: train loss 0.20867. lr 3.577854e-04:  88%|████████▊ | 2045/2334 [13:16<01:46,  2.71it/s][A
e

epoch 1 iter 2077: train loss 0.20512. lr 3.514322e-04:  89%|████████▉ | 2078/2334 [13:27<01:34,  2.71it/s][A
epoch 1 iter 2078: train loss 0.20748. lr 3.512332e-04:  89%|████████▉ | 2078/2334 [13:28<01:34,  2.71it/s][A
epoch 1 iter 2078: train loss 0.20748. lr 3.512332e-04:  89%|████████▉ | 2079/2334 [13:28<01:33,  2.72it/s][A
epoch 1 iter 2079: train loss 0.20998. lr 3.510343e-04:  89%|████████▉ | 2079/2334 [13:28<01:33,  2.72it/s][A
epoch 1 iter 2079: train loss 0.20998. lr 3.510343e-04:  89%|████████▉ | 2080/2334 [13:28<01:33,  2.71it/s][A
epoch 1 iter 2080: train loss 0.21480. lr 3.508353e-04:  89%|████████▉ | 2080/2334 [13:28<01:33,  2.71it/s][A
epoch 1 iter 2080: train loss 0.21480. lr 3.508353e-04:  89%|████████▉ | 2081/2334 [13:28<01:33,  2.71it/s][A
epoch 1 iter 2081: train loss 0.19696. lr 3.506363e-04:  89%|████████▉ | 2081/2334 [13:29<01:33,  2.71it/s][A
epoch 1 iter 2081: train loss 0.19696. lr 3.506363e-04:  89%|████████▉ | 2082/2334 [13:29<01:33,  2.70it/s][A
e

epoch 1 iter 2114: train loss 0.21313. lr 3.440567e-04:  91%|█████████ | 2114/2334 [13:41<01:21,  2.71it/s][A
epoch 1 iter 2114: train loss 0.21313. lr 3.440567e-04:  91%|█████████ | 2115/2334 [13:41<01:20,  2.71it/s][A
epoch 1 iter 2115: train loss 0.21067. lr 3.438570e-04:  91%|█████████ | 2115/2334 [13:41<01:20,  2.71it/s][A
epoch 1 iter 2115: train loss 0.21067. lr 3.438570e-04:  91%|█████████ | 2116/2334 [13:41<01:20,  2.72it/s][A
epoch 1 iter 2116: train loss 0.20825. lr 3.436572e-04:  91%|█████████ | 2116/2334 [13:42<01:20,  2.72it/s][A
epoch 1 iter 2116: train loss 0.20825. lr 3.436572e-04:  91%|█████████ | 2117/2334 [13:42<01:20,  2.71it/s][A
epoch 1 iter 2117: train loss 0.20172. lr 3.434574e-04:  91%|█████████ | 2117/2334 [13:42<01:20,  2.71it/s][A
epoch 1 iter 2117: train loss 0.20172. lr 3.434574e-04:  91%|█████████ | 2118/2334 [13:42<01:19,  2.71it/s][A
epoch 1 iter 2118: train loss 0.21838. lr 3.432576e-04:  91%|█████████ | 2118/2334 [13:42<01:19,  2.71it/s][A
e

epoch 1 iter 2150: train loss 0.20098. lr 3.368543e-04:  92%|█████████▏| 2151/2334 [13:54<01:07,  2.70it/s][A
epoch 1 iter 2151: train loss 0.18253. lr 3.366539e-04:  92%|█████████▏| 2151/2334 [13:55<01:07,  2.70it/s][A
epoch 1 iter 2151: train loss 0.18253. lr 3.366539e-04:  92%|█████████▏| 2152/2334 [13:55<01:07,  2.70it/s][A
epoch 1 iter 2152: train loss 0.21147. lr 3.364535e-04:  92%|█████████▏| 2152/2334 [13:55<01:07,  2.70it/s][A
epoch 1 iter 2152: train loss 0.21147. lr 3.364535e-04:  92%|█████████▏| 2153/2334 [13:55<01:07,  2.70it/s][A
epoch 1 iter 2153: train loss 0.21125. lr 3.362530e-04:  92%|█████████▏| 2153/2334 [13:55<01:07,  2.70it/s][A
epoch 1 iter 2153: train loss 0.21125. lr 3.362530e-04:  92%|█████████▏| 2154/2334 [13:55<01:06,  2.71it/s][A
epoch 1 iter 2154: train loss 0.21192. lr 3.360526e-04:  92%|█████████▏| 2154/2334 [13:56<01:06,  2.71it/s][A
epoch 1 iter 2154: train loss 0.21192. lr 3.360526e-04:  92%|█████████▏| 2155/2334 [13:56<01:06,  2.71it/s][A
e

epoch 1 iter 2187: train loss 0.19794. lr 3.294293e-04:  94%|█████████▎| 2187/2334 [14:08<00:54,  2.71it/s][A
epoch 1 iter 2187: train loss 0.19794. lr 3.294293e-04:  94%|█████████▎| 2188/2334 [14:08<00:53,  2.72it/s][A
epoch 1 iter 2188: train loss 0.19593. lr 3.292283e-04:  94%|█████████▎| 2188/2334 [14:08<00:53,  2.72it/s][A
epoch 1 iter 2188: train loss 0.19593. lr 3.292283e-04:  94%|█████████▍| 2189/2334 [14:08<00:53,  2.72it/s][A
epoch 1 iter 2189: train loss 0.20372. lr 3.290274e-04:  94%|█████████▍| 2189/2334 [14:09<00:53,  2.72it/s][A
epoch 1 iter 2189: train loss 0.20372. lr 3.290274e-04:  94%|█████████▍| 2190/2334 [14:09<00:53,  2.71it/s][A
epoch 1 iter 2190: train loss 0.20217. lr 3.288264e-04:  94%|█████████▍| 2190/2334 [14:09<00:53,  2.71it/s][A
epoch 1 iter 2190: train loss 0.20217. lr 3.288264e-04:  94%|█████████▍| 2191/2334 [14:09<00:52,  2.72it/s][A
epoch 1 iter 2191: train loss 0.20425. lr 3.286254e-04:  94%|█████████▍| 2191/2334 [14:09<00:52,  2.72it/s][A
e

epoch 1 iter 2223: train loss 0.18510. lr 3.221874e-04:  95%|█████████▌| 2224/2334 [14:21<00:40,  2.71it/s][A
epoch 1 iter 2224: train loss 0.20259. lr 3.219860e-04:  95%|█████████▌| 2224/2334 [14:22<00:40,  2.71it/s][A
epoch 1 iter 2224: train loss 0.20259. lr 3.219860e-04:  95%|█████████▌| 2225/2334 [14:22<00:40,  2.71it/s][A
epoch 1 iter 2225: train loss 0.19535. lr 3.217847e-04:  95%|█████████▌| 2225/2334 [14:22<00:40,  2.71it/s][A
epoch 1 iter 2225: train loss 0.19535. lr 3.217847e-04:  95%|█████████▌| 2226/2334 [14:22<00:39,  2.72it/s][A
epoch 1 iter 2226: train loss 0.19790. lr 3.215833e-04:  95%|█████████▌| 2226/2334 [14:22<00:39,  2.72it/s][A
epoch 1 iter 2226: train loss 0.19790. lr 3.215833e-04:  95%|█████████▌| 2227/2334 [14:22<00:39,  2.71it/s][A
epoch 1 iter 2227: train loss 0.21760. lr 3.213819e-04:  95%|█████████▌| 2227/2334 [14:23<00:39,  2.71it/s][A
epoch 1 iter 2227: train loss 0.21760. lr 3.213819e-04:  95%|█████████▌| 2228/2334 [14:23<00:39,  2.72it/s][A
e

epoch 1 iter 2260: train loss 0.20664. lr 3.147308e-04:  97%|█████████▋| 2260/2334 [14:35<00:27,  2.71it/s][A
epoch 1 iter 2260: train loss 0.20664. lr 3.147308e-04:  97%|█████████▋| 2261/2334 [14:35<00:26,  2.72it/s][A
epoch 1 iter 2261: train loss 0.19979. lr 3.145292e-04:  97%|█████████▋| 2261/2334 [14:35<00:26,  2.72it/s][A
epoch 1 iter 2261: train loss 0.19979. lr 3.145292e-04:  97%|█████████▋| 2262/2334 [14:35<00:26,  2.71it/s][A
epoch 1 iter 2262: train loss 0.20300. lr 3.143275e-04:  97%|█████████▋| 2262/2334 [14:36<00:26,  2.71it/s][A
epoch 1 iter 2262: train loss 0.20300. lr 3.143275e-04:  97%|█████████▋| 2263/2334 [14:36<00:26,  2.71it/s][A
epoch 1 iter 2263: train loss 0.18235. lr 3.141258e-04:  97%|█████████▋| 2263/2334 [14:36<00:26,  2.71it/s][A
epoch 1 iter 2263: train loss 0.18235. lr 3.141258e-04:  97%|█████████▋| 2264/2334 [14:36<00:25,  2.71it/s][A
epoch 1 iter 2264: train loss 0.20314. lr 3.139241e-04:  97%|█████████▋| 2264/2334 [14:36<00:25,  2.71it/s][A
e

epoch 1 iter 2296: train loss 0.20497. lr 3.074670e-04:  98%|█████████▊| 2297/2334 [14:48<00:13,  2.71it/s][A
epoch 1 iter 2297: train loss 0.19657. lr 3.072651e-04:  98%|█████████▊| 2297/2334 [14:48<00:13,  2.71it/s][A
epoch 1 iter 2297: train loss 0.19657. lr 3.072651e-04:  98%|█████████▊| 2298/2334 [14:48<00:13,  2.71it/s][A
epoch 1 iter 2298: train loss 0.18504. lr 3.070633e-04:  98%|█████████▊| 2298/2334 [14:49<00:13,  2.71it/s][A
epoch 1 iter 2298: train loss 0.18504. lr 3.070633e-04:  99%|█████████▊| 2299/2334 [14:49<00:12,  2.72it/s][A
epoch 1 iter 2299: train loss 0.18253. lr 3.068614e-04:  99%|█████████▊| 2299/2334 [14:49<00:12,  2.72it/s][A
epoch 1 iter 2299: train loss 0.18253. lr 3.068614e-04:  99%|█████████▊| 2300/2334 [14:49<00:12,  2.72it/s][A
epoch 1 iter 2300: train loss 0.21220. lr 3.066595e-04:  99%|█████████▊| 2300/2334 [14:50<00:12,  2.72it/s][A
epoch 1 iter 2300: train loss 0.21220. lr 3.066595e-04:  99%|█████████▊| 2301/2334 [14:50<00:12,  2.71it/s][A
e

epoch 1 iter 2333: train loss 0.19660. lr 3.000158e-04: 100%|█████████▉| 2333/2334 [15:02<00:00,  2.71it/s][A
epoch 1 iter 2333: train loss 0.19660. lr 3.000158e-04: 100%|██████████| 2334/2334 [15:02<00:00,  2.59it/s][A

  0%|          | 0/2334 [00:00<?, ?it/s][A
epoch 2 iter 0: train loss 0.18090. lr 2.998139e-04:   0%|          | 0/2334 [00:00<?, ?it/s][A
epoch 2 iter 0: train loss 0.18090. lr 2.998139e-04:   0%|          | 1/2334 [00:00<14:10,  2.74it/s][A
epoch 2 iter 1: train loss 0.17723. lr 2.996119e-04:   0%|          | 1/2334 [00:00<14:10,  2.74it/s][A
epoch 2 iter 1: train loss 0.17723. lr 2.996119e-04:   0%|          | 2/2334 [00:00<14:15,  2.73it/s][A
epoch 2 iter 2: train loss 0.17011. lr 2.994100e-04:   0%|          | 2/2334 [00:01<14:15,  2.73it/s][A
epoch 2 iter 2: train loss 0.17011. lr 2.994100e-04:   0%|          | 3/2334 [00:01<14:18,  2.72it/s][A
epoch 2 iter 3: train loss 0.17291. lr 2.992081e-04:   0%|          | 3/2334 [00:01<14:18,  2.72it/s][A
epoch 

epoch 2 iter 37: train loss 0.19095. lr 2.923438e-04:   2%|▏         | 37/2334 [00:13<14:06,  2.71it/s][A
epoch 2 iter 37: train loss 0.19095. lr 2.923438e-04:   2%|▏         | 38/2334 [00:14<14:07,  2.71it/s][A
epoch 2 iter 38: train loss 0.17621. lr 2.921419e-04:   2%|▏         | 38/2334 [00:14<14:07,  2.71it/s][A
epoch 2 iter 38: train loss 0.17621. lr 2.921419e-04:   2%|▏         | 39/2334 [00:14<14:06,  2.71it/s][A
epoch 2 iter 39: train loss 0.17390. lr 2.919401e-04:   2%|▏         | 39/2334 [00:14<14:06,  2.71it/s][A
epoch 2 iter 39: train loss 0.17390. lr 2.919401e-04:   2%|▏         | 40/2334 [00:14<14:07,  2.71it/s][A
epoch 2 iter 40: train loss 0.18207. lr 2.917382e-04:   2%|▏         | 40/2334 [00:15<14:07,  2.71it/s][A
epoch 2 iter 40: train loss 0.18207. lr 2.917382e-04:   2%|▏         | 41/2334 [00:15<14:04,  2.71it/s][A
epoch 2 iter 41: train loss 0.18808. lr 2.915364e-04:   2%|▏         | 41/2334 [00:15<14:04,  2.71it/s][A
epoch 2 iter 41: train loss 0.18808. 

epoch 2 iter 75: train loss 0.17096. lr 2.846768e-04:   3%|▎         | 75/2334 [00:27<13:52,  2.71it/s][A
epoch 2 iter 75: train loss 0.17096. lr 2.846768e-04:   3%|▎         | 76/2334 [00:27<13:49,  2.72it/s][A
epoch 2 iter 76: train loss 0.18590. lr 2.844751e-04:   3%|▎         | 76/2334 [00:28<13:49,  2.72it/s][A
epoch 2 iter 76: train loss 0.18590. lr 2.844751e-04:   3%|▎         | 77/2334 [00:28<13:50,  2.72it/s][A
epoch 2 iter 77: train loss 0.18629. lr 2.842735e-04:   3%|▎         | 77/2334 [00:28<13:50,  2.72it/s][A
epoch 2 iter 77: train loss 0.18629. lr 2.842735e-04:   3%|▎         | 78/2334 [00:28<13:50,  2.72it/s][A
epoch 2 iter 78: train loss 0.16312. lr 2.840718e-04:   3%|▎         | 78/2334 [00:29<13:50,  2.72it/s][A
epoch 2 iter 78: train loss 0.16312. lr 2.840718e-04:   3%|▎         | 79/2334 [00:29<13:49,  2.72it/s][A
epoch 2 iter 79: train loss 0.17280. lr 2.838702e-04:   3%|▎         | 79/2334 [00:29<13:49,  2.72it/s][A
epoch 2 iter 79: train loss 0.17280. 

epoch 2 iter 2104: train loss 0.11900. lr 6.000000e-05:  90%|█████████ | 2104/2334 [12:56<01:24,  2.71it/s][A
epoch 2 iter 2104: train loss 0.11900. lr 6.000000e-05:  90%|█████████ | 2105/2334 [12:56<01:24,  2.72it/s][A
epoch 2 iter 2105: train loss 0.13069. lr 6.000000e-05:  90%|█████████ | 2105/2334 [12:56<01:24,  2.72it/s][A
epoch 2 iter 2105: train loss 0.13069. lr 6.000000e-05:  90%|█████████ | 2106/2334 [12:56<01:24,  2.71it/s][A
epoch 2 iter 2106: train loss 0.12728. lr 6.000000e-05:  90%|█████████ | 2106/2334 [12:57<01:24,  2.71it/s][A
epoch 2 iter 2106: train loss 0.12728. lr 6.000000e-05:  90%|█████████ | 2107/2334 [12:57<01:23,  2.71it/s][A
epoch 2 iter 2107: train loss 0.12072. lr 6.000000e-05:  90%|█████████ | 2107/2334 [12:57<01:23,  2.71it/s][A
epoch 2 iter 2107: train loss 0.12072. lr 6.000000e-05:  90%|█████████ | 2108/2334 [12:57<01:23,  2.71it/s][A
epoch 2 iter 2108: train loss 0.12421. lr 6.000000e-05:  90%|█████████ | 2108/2334 [12:57<01:23,  2.71it/s][A
e

epoch 2 iter 2140: train loss 0.13139. lr 6.000000e-05:  92%|█████████▏| 2141/2334 [13:09<01:11,  2.71it/s][A
epoch 2 iter 2141: train loss 0.12853. lr 6.000000e-05:  92%|█████████▏| 2141/2334 [13:09<01:11,  2.71it/s][A
epoch 2 iter 2141: train loss 0.12853. lr 6.000000e-05:  92%|█████████▏| 2142/2334 [13:09<01:10,  2.71it/s][A
epoch 2 iter 2142: train loss 0.11984. lr 6.000000e-05:  92%|█████████▏| 2142/2334 [13:10<01:10,  2.71it/s][A
epoch 2 iter 2142: train loss 0.11984. lr 6.000000e-05:  92%|█████████▏| 2143/2334 [13:10<01:10,  2.71it/s][A
epoch 2 iter 2143: train loss 0.12354. lr 6.000000e-05:  92%|█████████▏| 2143/2334 [13:10<01:10,  2.71it/s][A
epoch 2 iter 2143: train loss 0.12354. lr 6.000000e-05:  92%|█████████▏| 2144/2334 [13:10<01:09,  2.71it/s][A
epoch 2 iter 2144: train loss 0.11619. lr 6.000000e-05:  92%|█████████▏| 2144/2334 [13:11<01:09,  2.71it/s][A
epoch 2 iter 2144: train loss 0.11619. lr 6.000000e-05:  92%|█████████▏| 2145/2334 [13:11<01:09,  2.72it/s][A
e

epoch 2 iter 2177: train loss 0.12504. lr 6.000000e-05:  93%|█████████▎| 2177/2334 [13:23<00:57,  2.71it/s][A
epoch 2 iter 2177: train loss 0.12504. lr 6.000000e-05:  93%|█████████▎| 2178/2334 [13:23<00:57,  2.71it/s][A
epoch 2 iter 2178: train loss 0.10896. lr 6.000000e-05:  93%|█████████▎| 2178/2334 [13:23<00:57,  2.71it/s][A
epoch 2 iter 2178: train loss 0.10896. lr 6.000000e-05:  93%|█████████▎| 2179/2334 [13:23<00:57,  2.71it/s][A
epoch 2 iter 2179: train loss 0.12295. lr 6.000000e-05:  93%|█████████▎| 2179/2334 [13:23<00:57,  2.71it/s][A
epoch 2 iter 2179: train loss 0.12295. lr 6.000000e-05:  93%|█████████▎| 2180/2334 [13:23<00:56,  2.72it/s][A
epoch 2 iter 2180: train loss 0.12931. lr 6.000000e-05:  93%|█████████▎| 2180/2334 [13:24<00:56,  2.72it/s][A
epoch 2 iter 2180: train loss 0.12931. lr 6.000000e-05:  93%|█████████▎| 2181/2334 [13:24<00:56,  2.72it/s][A
epoch 2 iter 2181: train loss 0.11879. lr 6.000000e-05:  93%|█████████▎| 2181/2334 [13:24<00:56,  2.72it/s][A
e

epoch 2 iter 2213: train loss 0.11864. lr 6.000000e-05:  95%|█████████▍| 2214/2334 [13:36<00:44,  2.71it/s][A
epoch 2 iter 2214: train loss 0.11997. lr 6.000000e-05:  95%|█████████▍| 2214/2334 [13:36<00:44,  2.71it/s][A
epoch 2 iter 2214: train loss 0.11997. lr 6.000000e-05:  95%|█████████▍| 2215/2334 [13:36<00:43,  2.71it/s][A
epoch 2 iter 2215: train loss 0.13651. lr 6.000000e-05:  95%|█████████▍| 2215/2334 [13:37<00:43,  2.71it/s][A
epoch 2 iter 2215: train loss 0.13651. lr 6.000000e-05:  95%|█████████▍| 2216/2334 [13:37<00:43,  2.70it/s][A
epoch 2 iter 2216: train loss 0.12115. lr 6.000000e-05:  95%|█████████▍| 2216/2334 [13:37<00:43,  2.70it/s][A
epoch 2 iter 2216: train loss 0.12115. lr 6.000000e-05:  95%|█████████▍| 2217/2334 [13:37<00:43,  2.70it/s][A
epoch 2 iter 2217: train loss 0.13234. lr 6.000000e-05:  95%|█████████▍| 2217/2334 [13:37<00:43,  2.70it/s][A
epoch 2 iter 2217: train loss 0.13234. lr 6.000000e-05:  95%|█████████▌| 2218/2334 [13:38<00:42,  2.71it/s][A
e

epoch 2 iter 2250: train loss 0.11857. lr 6.000000e-05:  96%|█████████▋| 2250/2334 [13:50<00:30,  2.72it/s][A
epoch 2 iter 2250: train loss 0.11857. lr 6.000000e-05:  96%|█████████▋| 2251/2334 [13:50<00:30,  2.71it/s][A
epoch 2 iter 2251: train loss 0.11997. lr 6.000000e-05:  96%|█████████▋| 2251/2334 [13:50<00:30,  2.71it/s][A
epoch 2 iter 2251: train loss 0.11997. lr 6.000000e-05:  96%|█████████▋| 2252/2334 [13:50<00:30,  2.71it/s][A
epoch 2 iter 2252: train loss 0.13180. lr 6.000000e-05:  96%|█████████▋| 2252/2334 [13:50<00:30,  2.71it/s][A
epoch 2 iter 2252: train loss 0.13180. lr 6.000000e-05:  97%|█████████▋| 2253/2334 [13:50<00:29,  2.71it/s][A
epoch 2 iter 2253: train loss 0.11099. lr 6.000000e-05:  97%|█████████▋| 2253/2334 [13:51<00:29,  2.71it/s][A
epoch 2 iter 2253: train loss 0.11099. lr 6.000000e-05:  97%|█████████▋| 2254/2334 [13:51<00:29,  2.72it/s][A
epoch 2 iter 2254: train loss 0.11201. lr 6.000000e-05:  97%|█████████▋| 2254/2334 [13:51<00:29,  2.72it/s][A
e

epoch 2 iter 2286: train loss 0.13413. lr 6.000000e-05:  98%|█████████▊| 2287/2334 [14:03<00:17,  2.71it/s][A
epoch 2 iter 2287: train loss 0.11808. lr 6.000000e-05:  98%|█████████▊| 2287/2334 [14:03<00:17,  2.71it/s][A
epoch 2 iter 2287: train loss 0.11808. lr 6.000000e-05:  98%|█████████▊| 2288/2334 [14:03<00:16,  2.72it/s][A
epoch 2 iter 2288: train loss 0.12718. lr 6.000000e-05:  98%|█████████▊| 2288/2334 [14:04<00:16,  2.72it/s][A
epoch 2 iter 2288: train loss 0.12718. lr 6.000000e-05:  98%|█████████▊| 2289/2334 [14:04<00:16,  2.72it/s][A
epoch 2 iter 2289: train loss 0.12204. lr 6.000000e-05:  98%|█████████▊| 2289/2334 [14:04<00:16,  2.72it/s][A
epoch 2 iter 2289: train loss 0.12204. lr 6.000000e-05:  98%|█████████▊| 2290/2334 [14:04<00:16,  2.71it/s][A
epoch 2 iter 2290: train loss 0.13157. lr 6.000000e-05:  98%|█████████▊| 2290/2334 [14:04<00:16,  2.71it/s][A
epoch 2 iter 2290: train loss 0.13157. lr 6.000000e-05:  98%|█████████▊| 2291/2334 [14:04<00:15,  2.70it/s][A
e

epoch 2 iter 2323: train loss 0.12803. lr 6.000000e-05: 100%|█████████▉| 2323/2334 [14:17<00:04,  2.71it/s][A
epoch 2 iter 2323: train loss 0.12803. lr 6.000000e-05: 100%|█████████▉| 2324/2334 [14:17<00:03,  2.71it/s][A
epoch 2 iter 2324: train loss 0.12311. lr 6.000000e-05: 100%|█████████▉| 2324/2334 [14:17<00:03,  2.71it/s][A
epoch 2 iter 2324: train loss 0.12311. lr 6.000000e-05: 100%|█████████▉| 2325/2334 [14:17<00:03,  2.71it/s][A
epoch 2 iter 2325: train loss 0.10847. lr 6.000000e-05: 100%|█████████▉| 2325/2334 [14:17<00:03,  2.71it/s][A
epoch 2 iter 2325: train loss 0.10847. lr 6.000000e-05: 100%|█████████▉| 2326/2334 [14:17<00:02,  2.71it/s][A
epoch 2 iter 2326: train loss 0.12952. lr 6.000000e-05: 100%|█████████▉| 2326/2334 [14:18<00:02,  2.71it/s][A
epoch 2 iter 2326: train loss 0.12952. lr 6.000000e-05: 100%|█████████▉| 2327/2334 [14:18<00:02,  2.70it/s][A
epoch 2 iter 2327: train loss 0.13578. lr 6.000000e-05: 100%|█████████▉| 2327/2334 [14:18<00:02,  2.70it/s][A
e

data has 595386 characters, 119547 unique.



  0%|          | 0/9301 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 11.74520. lr 6.000000e-04:   0%|          | 0/9301 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 11.74520. lr 6.000000e-04:   0%|          | 1/9301 [00:00<1:04:10,  2.42it/s][A
epoch 1 iter 1: train loss 11.30823. lr 6.000000e-04:   0%|          | 1/9301 [00:00<1:04:10,  2.42it/s][A
epoch 1 iter 1: train loss 11.30823. lr 6.000000e-04:   0%|          | 2/9301 [00:00<1:01:43,  2.51it/s][A
epoch 1 iter 2: train loss 11.05606. lr 6.000000e-04:   0%|          | 2/9301 [00:01<1:01:43,  2.51it/s][A
epoch 1 iter 2: train loss 11.05606. lr 6.000000e-04:   0%|          | 3/9301 [00:01<1:00:00,  2.58it/s][A
epoch 1 iter 3: train loss 10.89508. lr 5.999999e-04:   0%|          | 3/9301 [00:01<1:00:00,  2.58it/s][A
epoch 1 iter 3: train loss 10.89508. lr 5.999999e-04:   0%|          | 4/9301 [00:01<58:44,  2.64it/s]  [A
epoch 1 iter 4: train loss 10.61887. lr 5.999999e-04:   0%|          | 4/9301 [00:01<58:44,  2.64it/s

epoch 1 iter 38: train loss 7.29183. lr 5.999935e-04:   0%|          | 38/9301 [00:14<55:45,  2.77it/s][A
epoch 1 iter 38: train loss 7.29183. lr 5.999935e-04:   0%|          | 39/9301 [00:14<55:45,  2.77it/s][A
epoch 1 iter 39: train loss 7.14823. lr 5.999932e-04:   0%|          | 39/9301 [00:14<55:45,  2.77it/s][A
epoch 1 iter 39: train loss 7.14823. lr 5.999932e-04:   0%|          | 40/9301 [00:14<55:43,  2.77it/s][A
epoch 1 iter 40: train loss 7.23578. lr 5.999929e-04:   0%|          | 40/9301 [00:14<55:43,  2.77it/s][A
epoch 1 iter 40: train loss 7.23578. lr 5.999929e-04:   0%|          | 41/9301 [00:14<55:41,  2.77it/s][A
epoch 1 iter 41: train loss 7.14036. lr 5.999925e-04:   0%|          | 41/9301 [00:15<55:41,  2.77it/s][A
epoch 1 iter 41: train loss 7.14036. lr 5.999925e-04:   0%|          | 42/9301 [00:15<55:49,  2.76it/s][A
epoch 1 iter 42: train loss 6.96278. lr 5.999921e-04:   0%|          | 42/9301 [00:15<55:49,  2.76it/s][A
epoch 1 iter 42: train loss 6.96278. 

epoch 1 iter 76: train loss 6.70616. lr 5.999747e-04:   1%|          | 76/9301 [00:27<55:42,  2.76it/s][A
epoch 1 iter 76: train loss 6.70616. lr 5.999747e-04:   1%|          | 77/9301 [00:27<55:49,  2.75it/s][A
epoch 1 iter 77: train loss 6.48520. lr 5.999741e-04:   1%|          | 77/9301 [00:28<55:49,  2.75it/s][A
epoch 1 iter 77: train loss 6.48520. lr 5.999741e-04:   1%|          | 78/9301 [00:28<55:50,  2.75it/s][A
epoch 1 iter 78: train loss 6.58763. lr 5.999734e-04:   1%|          | 78/9301 [00:28<55:50,  2.75it/s][A
epoch 1 iter 78: train loss 6.58763. lr 5.999734e-04:   1%|          | 79/9301 [00:28<55:50,  2.75it/s][A
epoch 1 iter 79: train loss 6.64271. lr 5.999727e-04:   1%|          | 79/9301 [00:29<55:50,  2.75it/s][A
epoch 1 iter 79: train loss 6.64271. lr 5.999727e-04:   1%|          | 80/9301 [00:29<55:46,  2.76it/s][A
epoch 1 iter 80: train loss 6.51988. lr 5.999720e-04:   1%|          | 80/9301 [00:29<55:46,  2.76it/s][A
epoch 1 iter 80: train loss 6.51988. 

epoch 1 iter 114: train loss 6.34834. lr 5.999436e-04:   1%|          | 114/9301 [00:41<55:59,  2.73it/s][A
epoch 1 iter 114: train loss 6.34834. lr 5.999436e-04:   1%|          | 115/9301 [00:41<56:06,  2.73it/s][A
epoch 1 iter 115: train loss 6.21290. lr 5.999426e-04:   1%|          | 115/9301 [00:42<56:06,  2.73it/s][A
epoch 1 iter 115: train loss 6.21290. lr 5.999426e-04:   1%|          | 116/9301 [00:42<56:01,  2.73it/s][A
epoch 1 iter 116: train loss 5.96371. lr 5.999416e-04:   1%|          | 116/9301 [00:42<56:01,  2.73it/s][A
epoch 1 iter 116: train loss 5.96371. lr 5.999416e-04:   1%|▏         | 117/9301 [00:42<55:54,  2.74it/s][A
epoch 1 iter 117: train loss 6.32182. lr 5.999406e-04:   1%|▏         | 117/9301 [00:42<55:54,  2.74it/s][A
epoch 1 iter 117: train loss 6.32182. lr 5.999406e-04:   1%|▏         | 118/9301 [00:42<55:48,  2.74it/s][A
epoch 1 iter 118: train loss 6.04319. lr 5.999396e-04:   1%|▏         | 118/9301 [00:43<55:48,  2.74it/s][A
epoch 1 iter 118: t

epoch 1 iter 151: train loss 5.84608. lr 5.999014e-04:   2%|▏         | 152/9301 [00:55<55:52,  2.73it/s][A
epoch 1 iter 152: train loss 5.85802. lr 5.999001e-04:   2%|▏         | 152/9301 [00:55<55:52,  2.73it/s][A
epoch 1 iter 152: train loss 5.85802. lr 5.999001e-04:   2%|▏         | 153/9301 [00:55<55:52,  2.73it/s][A
epoch 1 iter 153: train loss 5.93519. lr 5.998987e-04:   2%|▏         | 153/9301 [00:56<55:52,  2.73it/s][A
epoch 1 iter 153: train loss 5.93519. lr 5.998987e-04:   2%|▏         | 154/9301 [00:56<55:46,  2.73it/s][A
epoch 1 iter 154: train loss 5.90169. lr 5.998974e-04:   2%|▏         | 154/9301 [00:56<55:46,  2.73it/s][A
epoch 1 iter 154: train loss 5.90169. lr 5.998974e-04:   2%|▏         | 155/9301 [00:56<55:51,  2.73it/s][A
epoch 1 iter 155: train loss 5.86598. lr 5.998961e-04:   2%|▏         | 155/9301 [00:56<55:51,  2.73it/s][A
epoch 1 iter 155: train loss 5.86598. lr 5.998961e-04:   2%|▏         | 156/9301 [00:56<55:52,  2.73it/s][A
epoch 1 iter 156: t

epoch 1 iter 189: train loss 5.54694. lr 5.998458e-04:   2%|▏         | 189/9301 [01:09<55:28,  2.74it/s][A
epoch 1 iter 189: train loss 5.54694. lr 5.998458e-04:   2%|▏         | 190/9301 [01:09<55:42,  2.73it/s][A
epoch 1 iter 190: train loss 5.41527. lr 5.998442e-04:   2%|▏         | 190/9301 [01:09<55:42,  2.73it/s][A
epoch 1 iter 190: train loss 5.41527. lr 5.998442e-04:   2%|▏         | 191/9301 [01:09<55:50,  2.72it/s][A
epoch 1 iter 191: train loss 5.58892. lr 5.998425e-04:   2%|▏         | 191/9301 [01:09<55:50,  2.72it/s][A
epoch 1 iter 191: train loss 5.58892. lr 5.998425e-04:   2%|▏         | 192/9301 [01:09<55:40,  2.73it/s][A
epoch 1 iter 192: train loss 5.43592. lr 5.998409e-04:   2%|▏         | 192/9301 [01:10<55:40,  2.73it/s][A
epoch 1 iter 192: train loss 5.43592. lr 5.998409e-04:   2%|▏         | 193/9301 [01:10<55:34,  2.73it/s][A
epoch 1 iter 193: train loss 5.53196. lr 5.998392e-04:   2%|▏         | 193/9301 [01:10<55:34,  2.73it/s][A
epoch 1 iter 193: t

epoch 1 iter 226: train loss 5.32914. lr 5.997799e-04:   2%|▏         | 227/9301 [01:22<55:32,  2.72it/s][A
epoch 1 iter 227: train loss 5.22032. lr 5.997779e-04:   2%|▏         | 227/9301 [01:23<55:32,  2.72it/s][A
epoch 1 iter 227: train loss 5.22032. lr 5.997779e-04:   2%|▏         | 228/9301 [01:23<55:36,  2.72it/s][A
epoch 1 iter 228: train loss 5.45999. lr 5.997760e-04:   2%|▏         | 228/9301 [01:23<55:36,  2.72it/s][A
epoch 1 iter 228: train loss 5.45999. lr 5.997760e-04:   2%|▏         | 229/9301 [01:23<55:41,  2.71it/s][A
epoch 1 iter 229: train loss 5.40127. lr 5.997740e-04:   2%|▏         | 229/9301 [01:23<55:41,  2.71it/s][A
epoch 1 iter 229: train loss 5.40127. lr 5.997740e-04:   2%|▏         | 230/9301 [01:23<55:41,  2.71it/s][A
epoch 1 iter 230: train loss 5.30747. lr 5.997720e-04:   2%|▏         | 230/9301 [01:24<55:41,  2.71it/s][A
epoch 1 iter 230: train loss 5.30747. lr 5.997720e-04:   2%|▏         | 231/9301 [01:24<55:32,  2.72it/s][A
epoch 1 iter 231: t

epoch 1 iter 2415: train loss 1.37853. lr 5.753741e-04:  26%|██▌       | 2415/9301 [14:50<42:18,  2.71it/s][A
epoch 1 iter 2415: train loss 1.37853. lr 5.753741e-04:  26%|██▌       | 2416/9301 [14:50<42:16,  2.71it/s][A
epoch 1 iter 2416: train loss 1.42399. lr 5.753540e-04:  26%|██▌       | 2416/9301 [14:50<42:16,  2.71it/s][A
epoch 1 iter 2416: train loss 1.42399. lr 5.753540e-04:  26%|██▌       | 2417/9301 [14:50<42:10,  2.72it/s][A
epoch 1 iter 2417: train loss 1.45488. lr 5.753339e-04:  26%|██▌       | 2417/9301 [14:50<42:10,  2.72it/s][A
epoch 1 iter 2417: train loss 1.45488. lr 5.753339e-04:  26%|██▌       | 2418/9301 [14:50<42:17,  2.71it/s][A
epoch 1 iter 2418: train loss 1.39441. lr 5.753137e-04:  26%|██▌       | 2418/9301 [14:51<42:17,  2.71it/s][A
epoch 1 iter 2418: train loss 1.39441. lr 5.753137e-04:  26%|██▌       | 2419/9301 [14:51<42:23,  2.71it/s][A
epoch 1 iter 2419: train loss 1.39258. lr 5.752936e-04:  26%|██▌       | 2419/9301 [14:51<42:23,  2.71it/s][A
e

epoch 1 iter 2451: train loss 1.47106. lr 5.746453e-04:  26%|██▋       | 2452/9301 [15:03<41:58,  2.72it/s][A
epoch 1 iter 2452: train loss 1.44352. lr 5.746249e-04:  26%|██▋       | 2452/9301 [15:03<41:58,  2.72it/s][A
epoch 1 iter 2452: train loss 1.44352. lr 5.746249e-04:  26%|██▋       | 2453/9301 [15:03<42:13,  2.70it/s][A
epoch 1 iter 2453: train loss 1.41560. lr 5.746045e-04:  26%|██▋       | 2453/9301 [15:04<42:13,  2.70it/s][A
epoch 1 iter 2453: train loss 1.41560. lr 5.746045e-04:  26%|██▋       | 2454/9301 [15:04<42:26,  2.69it/s][A
epoch 1 iter 2454: train loss 1.47684. lr 5.745841e-04:  26%|██▋       | 2454/9301 [15:04<42:26,  2.69it/s][A
epoch 1 iter 2454: train loss 1.47684. lr 5.745841e-04:  26%|██▋       | 2455/9301 [15:04<42:24,  2.69it/s][A
epoch 1 iter 2455: train loss 1.31421. lr 5.745637e-04:  26%|██▋       | 2455/9301 [15:04<42:24,  2.69it/s][A
epoch 1 iter 2455: train loss 1.31421. lr 5.745637e-04:  26%|██▋       | 2456/9301 [15:04<42:21,  2.69it/s][A
e

epoch 1 iter 2488: train loss 1.32406. lr 5.738856e-04:  27%|██▋       | 2488/9301 [15:17<41:42,  2.72it/s][A
epoch 1 iter 2488: train loss 1.32406. lr 5.738856e-04:  27%|██▋       | 2489/9301 [15:17<41:49,  2.71it/s][A
epoch 1 iter 2489: train loss 1.35342. lr 5.738650e-04:  27%|██▋       | 2489/9301 [15:17<41:49,  2.71it/s][A
epoch 1 iter 2489: train loss 1.35342. lr 5.738650e-04:  27%|██▋       | 2490/9301 [15:17<41:53,  2.71it/s][A
epoch 1 iter 2490: train loss 1.31693. lr 5.738443e-04:  27%|██▋       | 2490/9301 [15:17<41:53,  2.71it/s][A
epoch 1 iter 2490: train loss 1.31693. lr 5.738443e-04:  27%|██▋       | 2491/9301 [15:17<41:50,  2.71it/s][A
epoch 1 iter 2491: train loss 1.43164. lr 5.738236e-04:  27%|██▋       | 2491/9301 [15:18<41:50,  2.71it/s][A
epoch 1 iter 2491: train loss 1.43164. lr 5.738236e-04:  27%|██▋       | 2492/9301 [15:18<41:47,  2.71it/s][A
epoch 1 iter 2492: train loss 1.40690. lr 5.738029e-04:  27%|██▋       | 2492/9301 [15:18<41:47,  2.71it/s][A
e

epoch 1 iter 2524: train loss 1.34656. lr 5.731363e-04:  27%|██▋       | 2525/9301 [15:30<41:39,  2.71it/s][A
epoch 1 iter 2525: train loss 1.37268. lr 5.731153e-04:  27%|██▋       | 2525/9301 [15:30<41:39,  2.71it/s][A
epoch 1 iter 2525: train loss 1.37268. lr 5.731153e-04:  27%|██▋       | 2526/9301 [15:30<41:38,  2.71it/s][A
epoch 1 iter 2526: train loss 1.40368. lr 5.730944e-04:  27%|██▋       | 2526/9301 [15:31<41:38,  2.71it/s][A
epoch 1 iter 2526: train loss 1.40368. lr 5.730944e-04:  27%|██▋       | 2527/9301 [15:31<41:36,  2.71it/s][A
epoch 1 iter 2527: train loss 1.37760. lr 5.730734e-04:  27%|██▋       | 2527/9301 [15:31<41:36,  2.71it/s][A
epoch 1 iter 2527: train loss 1.37760. lr 5.730734e-04:  27%|██▋       | 2528/9301 [15:31<41:29,  2.72it/s][A
epoch 1 iter 2528: train loss 1.35698. lr 5.730524e-04:  27%|██▋       | 2528/9301 [15:31<41:29,  2.72it/s][A
epoch 1 iter 2528: train loss 1.35698. lr 5.730524e-04:  27%|██▋       | 2529/9301 [15:31<41:36,  2.71it/s][A
e

epoch 1 iter 2561: train loss 1.32298. lr 5.723556e-04:  28%|██▊       | 2561/9301 [15:44<41:27,  2.71it/s][A
epoch 1 iter 2561: train loss 1.32298. lr 5.723556e-04:  28%|██▊       | 2562/9301 [15:44<41:32,  2.70it/s][A
epoch 1 iter 2562: train loss 1.41342. lr 5.723343e-04:  28%|██▊       | 2562/9301 [15:44<41:32,  2.70it/s][A
epoch 1 iter 2562: train loss 1.41342. lr 5.723343e-04:  28%|██▊       | 2563/9301 [15:44<41:23,  2.71it/s][A
epoch 1 iter 2563: train loss 1.28843. lr 5.723131e-04:  28%|██▊       | 2563/9301 [15:44<41:23,  2.71it/s][A
epoch 1 iter 2563: train loss 1.28843. lr 5.723131e-04:  28%|██▊       | 2564/9301 [15:44<41:27,  2.71it/s][A
epoch 1 iter 2564: train loss 1.23153. lr 5.722918e-04:  28%|██▊       | 2564/9301 [15:45<41:27,  2.71it/s][A
epoch 1 iter 2564: train loss 1.23153. lr 5.722918e-04:  28%|██▊       | 2565/9301 [15:45<41:29,  2.71it/s][A
epoch 1 iter 2565: train loss 1.24006. lr 5.722705e-04:  28%|██▊       | 2565/9301 [15:45<41:29,  2.71it/s][A
e

epoch 1 iter 2597: train loss 1.38198. lr 5.715858e-04:  28%|██▊       | 2598/9301 [15:57<41:02,  2.72it/s][A
epoch 1 iter 2598: train loss 1.27516. lr 5.715642e-04:  28%|██▊       | 2598/9301 [15:57<41:02,  2.72it/s][A
epoch 1 iter 2598: train loss 1.27516. lr 5.715642e-04:  28%|██▊       | 2599/9301 [15:57<41:09,  2.71it/s][A
epoch 1 iter 2599: train loss 1.36361. lr 5.715427e-04:  28%|██▊       | 2599/9301 [15:58<41:09,  2.71it/s][A
epoch 1 iter 2599: train loss 1.36361. lr 5.715427e-04:  28%|██▊       | 2600/9301 [15:58<41:13,  2.71it/s][A
epoch 1 iter 2600: train loss 1.29312. lr 5.715212e-04:  28%|██▊       | 2600/9301 [15:58<41:13,  2.71it/s][A
epoch 1 iter 2600: train loss 1.29312. lr 5.715212e-04:  28%|██▊       | 2601/9301 [15:58<41:10,  2.71it/s][A
epoch 1 iter 2601: train loss 1.29503. lr 5.714996e-04:  28%|██▊       | 2601/9301 [15:58<41:10,  2.71it/s][A
epoch 1 iter 2601: train loss 1.29503. lr 5.714996e-04:  28%|██▊       | 2602/9301 [15:58<41:07,  2.71it/s][A
e

epoch 1 iter 2634: train loss 1.24173. lr 5.707841e-04:  28%|██▊       | 2634/9301 [16:10<41:01,  2.71it/s][A
epoch 1 iter 2634: train loss 1.24173. lr 5.707841e-04:  28%|██▊       | 2635/9301 [16:10<41:03,  2.71it/s][A
epoch 1 iter 2635: train loss 1.27374. lr 5.707623e-04:  28%|██▊       | 2635/9301 [16:11<41:03,  2.71it/s][A
epoch 1 iter 2635: train loss 1.27374. lr 5.707623e-04:  28%|██▊       | 2636/9301 [16:11<40:59,  2.71it/s][A
epoch 1 iter 2636: train loss 1.38065. lr 5.707405e-04:  28%|██▊       | 2636/9301 [16:11<40:59,  2.71it/s][A
epoch 1 iter 2636: train loss 1.38065. lr 5.707405e-04:  28%|██▊       | 2637/9301 [16:11<40:55,  2.71it/s][A
epoch 1 iter 2637: train loss 1.34347. lr 5.707186e-04:  28%|██▊       | 2637/9301 [16:12<40:55,  2.71it/s][A
epoch 1 iter 2637: train loss 1.34347. lr 5.707186e-04:  28%|██▊       | 2638/9301 [16:12<40:48,  2.72it/s][A
epoch 1 iter 2638: train loss 1.29880. lr 5.706968e-04:  28%|██▊       | 2638/9301 [16:12<40:48,  2.72it/s][A
e

epoch 1 iter 2670: train loss 1.36018. lr 5.699940e-04:  29%|██▊       | 2671/9301 [16:24<40:46,  2.71it/s][A
epoch 1 iter 2671: train loss 1.24144. lr 5.699719e-04:  29%|██▊       | 2671/9301 [16:24<40:46,  2.71it/s][A
epoch 1 iter 2671: train loss 1.24144. lr 5.699719e-04:  29%|██▊       | 2672/9301 [16:24<40:45,  2.71it/s][A
epoch 1 iter 2672: train loss 1.36291. lr 5.699498e-04:  29%|██▊       | 2672/9301 [16:24<40:45,  2.71it/s][A
epoch 1 iter 2672: train loss 1.36291. lr 5.699498e-04:  29%|██▊       | 2673/9301 [16:24<40:38,  2.72it/s][A
epoch 1 iter 2673: train loss 1.33659. lr 5.699277e-04:  29%|██▊       | 2673/9301 [16:25<40:38,  2.72it/s][A
epoch 1 iter 2673: train loss 1.33659. lr 5.699277e-04:  29%|██▊       | 2674/9301 [16:25<40:44,  2.71it/s][A
epoch 1 iter 2674: train loss 1.25521. lr 5.699056e-04:  29%|██▊       | 2674/9301 [16:25<40:44,  2.71it/s][A
epoch 1 iter 2674: train loss 1.25521. lr 5.699056e-04:  29%|██▉       | 2675/9301 [16:25<40:45,  2.71it/s][A
e

epoch 1 iter 2707: train loss 1.22092. lr 5.691715e-04:  29%|██▉       | 2707/9301 [16:37<40:33,  2.71it/s][A
epoch 1 iter 2707: train loss 1.22092. lr 5.691715e-04:  29%|██▉       | 2708/9301 [16:37<40:35,  2.71it/s][A
epoch 1 iter 2708: train loss 1.20584. lr 5.691491e-04:  29%|██▉       | 2708/9301 [16:38<40:35,  2.71it/s][A
epoch 1 iter 2708: train loss 1.20584. lr 5.691491e-04:  29%|██▉       | 2709/9301 [16:38<40:40,  2.70it/s][A
epoch 1 iter 2709: train loss 1.22310. lr 5.691267e-04:  29%|██▉       | 2709/9301 [16:38<40:40,  2.70it/s][A
epoch 1 iter 2709: train loss 1.22310. lr 5.691267e-04:  29%|██▉       | 2710/9301 [16:38<40:40,  2.70it/s][A
epoch 1 iter 2710: train loss 1.24414. lr 5.691043e-04:  29%|██▉       | 2710/9301 [16:38<40:40,  2.70it/s][A
epoch 1 iter 2710: train loss 1.24414. lr 5.691043e-04:  29%|██▉       | 2711/9301 [16:38<40:35,  2.71it/s][A
epoch 1 iter 2711: train loss 1.28884. lr 5.690819e-04:  29%|██▉       | 2711/9301 [16:39<40:35,  2.71it/s][A
e

epoch 1 iter 2743: train loss 1.18123. lr 5.683611e-04:  30%|██▉       | 2744/9301 [16:51<40:19,  2.71it/s][A
epoch 1 iter 2744: train loss 1.21466. lr 5.683385e-04:  30%|██▉       | 2744/9301 [16:51<40:19,  2.71it/s][A
epoch 1 iter 2744: train loss 1.21466. lr 5.683385e-04:  30%|██▉       | 2745/9301 [16:51<40:21,  2.71it/s][A
epoch 1 iter 2745: train loss 1.21979. lr 5.683158e-04:  30%|██▉       | 2745/9301 [16:51<40:21,  2.71it/s][A
epoch 1 iter 2745: train loss 1.21979. lr 5.683158e-04:  30%|██▉       | 2746/9301 [16:51<40:17,  2.71it/s][A
epoch 1 iter 2746: train loss 1.27826. lr 5.682932e-04:  30%|██▉       | 2746/9301 [16:52<40:17,  2.71it/s][A
epoch 1 iter 2746: train loss 1.27826. lr 5.682932e-04:  30%|██▉       | 2747/9301 [16:52<44:30,  2.45it/s][A
epoch 1 iter 2747: train loss 1.26130. lr 5.682705e-04:  30%|██▉       | 2747/9301 [16:52<44:30,  2.45it/s][A
epoch 1 iter 2747: train loss 1.26130. lr 5.682705e-04:  30%|██▉       | 2748/9301 [16:52<43:19,  2.52it/s][A
e

epoch 1 iter 2780: train loss 1.31315. lr 5.675179e-04:  30%|██▉       | 2780/9301 [17:04<40:08,  2.71it/s][A
epoch 1 iter 2780: train loss 1.31315. lr 5.675179e-04:  30%|██▉       | 2781/9301 [17:04<40:08,  2.71it/s][A
epoch 1 iter 2781: train loss 1.20149. lr 5.674950e-04:  30%|██▉       | 2781/9301 [17:05<40:08,  2.71it/s][A
epoch 1 iter 2781: train loss 1.20149. lr 5.674950e-04:  30%|██▉       | 2782/9301 [17:05<40:15,  2.70it/s][A
epoch 1 iter 2782: train loss 1.24859. lr 5.674721e-04:  30%|██▉       | 2782/9301 [17:05<40:15,  2.70it/s][A
epoch 1 iter 2782: train loss 1.24859. lr 5.674721e-04:  30%|██▉       | 2783/9301 [17:05<40:19,  2.69it/s][A
epoch 1 iter 2783: train loss 1.19072. lr 5.674491e-04:  30%|██▉       | 2783/9301 [17:06<40:19,  2.69it/s][A
epoch 1 iter 2783: train loss 1.19072. lr 5.674491e-04:  30%|██▉       | 2784/9301 [17:06<40:18,  2.69it/s][A
epoch 1 iter 2784: train loss 1.26777. lr 5.674262e-04:  30%|██▉       | 2784/9301 [17:06<40:18,  2.69it/s][A
e

epoch 1 iter 2816: train loss 1.21779. lr 5.666875e-04:  30%|███       | 2817/9301 [17:18<39:41,  2.72it/s][A
epoch 1 iter 2817: train loss 1.25482. lr 5.666643e-04:  30%|███       | 2817/9301 [17:18<39:41,  2.72it/s][A
epoch 1 iter 2817: train loss 1.25482. lr 5.666643e-04:  30%|███       | 2818/9301 [17:18<39:49,  2.71it/s][A
epoch 1 iter 2818: train loss 1.23954. lr 5.666411e-04:  30%|███       | 2818/9301 [17:18<39:49,  2.71it/s][A
epoch 1 iter 2818: train loss 1.23954. lr 5.666411e-04:  30%|███       | 2819/9301 [17:18<39:53,  2.71it/s][A
epoch 1 iter 2819: train loss 1.19034. lr 5.666179e-04:  30%|███       | 2819/9301 [17:19<39:53,  2.71it/s][A
epoch 1 iter 2819: train loss 1.19034. lr 5.666179e-04:  30%|███       | 2820/9301 [17:19<39:53,  2.71it/s][A
epoch 1 iter 2820: train loss 1.29289. lr 5.665946e-04:  30%|███       | 2820/9301 [17:19<39:53,  2.71it/s][A
epoch 1 iter 2820: train loss 1.29289. lr 5.665946e-04:  30%|███       | 2821/9301 [17:19<39:49,  2.71it/s][A
e

epoch 1 iter 2853: train loss 1.16711. lr 5.658237e-04:  31%|███       | 2853/9301 [17:31<39:38,  2.71it/s][A
epoch 1 iter 2853: train loss 1.16711. lr 5.658237e-04:  31%|███       | 2854/9301 [17:31<39:41,  2.71it/s][A
epoch 1 iter 2854: train loss 1.16922. lr 5.658003e-04:  31%|███       | 2854/9301 [17:32<39:41,  2.71it/s][A
epoch 1 iter 2854: train loss 1.16922. lr 5.658003e-04:  31%|███       | 2855/9301 [17:32<39:36,  2.71it/s][A
epoch 1 iter 2855: train loss 1.17188. lr 5.657768e-04:  31%|███       | 2855/9301 [17:32<39:36,  2.71it/s][A
epoch 1 iter 2855: train loss 1.17188. lr 5.657768e-04:  31%|███       | 2856/9301 [17:32<39:35,  2.71it/s][A
epoch 1 iter 2856: train loss 1.20108. lr 5.657533e-04:  31%|███       | 2856/9301 [17:32<39:35,  2.71it/s][A
epoch 1 iter 2856: train loss 1.20108. lr 5.657533e-04:  31%|███       | 2857/9301 [17:32<39:27,  2.72it/s][A
epoch 1 iter 2857: train loss 1.15803. lr 5.657297e-04:  31%|███       | 2857/9301 [17:33<39:27,  2.72it/s][A
e

epoch 1 iter 2889: train loss 1.10454. lr 5.649734e-04:  31%|███       | 2890/9301 [17:45<39:24,  2.71it/s][A
epoch 1 iter 2890: train loss 1.18846. lr 5.649496e-04:  31%|███       | 2890/9301 [17:45<39:24,  2.71it/s][A
epoch 1 iter 2890: train loss 1.18846. lr 5.649496e-04:  31%|███       | 2891/9301 [17:45<39:21,  2.71it/s][A
epoch 1 iter 2891: train loss 1.09222. lr 5.649258e-04:  31%|███       | 2891/9301 [17:45<39:21,  2.71it/s][A
epoch 1 iter 2891: train loss 1.09222. lr 5.649258e-04:  31%|███       | 2892/9301 [17:45<39:14,  2.72it/s][A
epoch 1 iter 2892: train loss 1.18084. lr 5.649020e-04:  31%|███       | 2892/9301 [17:46<39:14,  2.72it/s][A
epoch 1 iter 2892: train loss 1.18084. lr 5.649020e-04:  31%|███       | 2893/9301 [17:46<39:20,  2.71it/s][A
epoch 1 iter 2893: train loss 1.08672. lr 5.648783e-04:  31%|███       | 2893/9301 [17:46<39:20,  2.71it/s][A
epoch 1 iter 2893: train loss 1.08672. lr 5.648783e-04:  31%|███       | 2894/9301 [17:46<39:25,  2.71it/s][A
e

epoch 1 iter 5048: train loss 0.73104. lr 4.973884e-04:  54%|█████▍    | 5048/9301 [31:01<26:04,  2.72it/s][A
epoch 1 iter 5048: train loss 0.73104. lr 4.973884e-04:  54%|█████▍    | 5049/9301 [31:01<26:08,  2.71it/s][A
epoch 1 iter 5049: train loss 0.65150. lr 4.973502e-04:  54%|█████▍    | 5049/9301 [31:02<26:08,  2.71it/s][A
epoch 1 iter 5049: train loss 0.65150. lr 4.973502e-04:  54%|█████▍    | 5050/9301 [31:02<26:10,  2.71it/s][A
epoch 1 iter 5050: train loss 0.70753. lr 4.973121e-04:  54%|█████▍    | 5050/9301 [31:02<26:10,  2.71it/s][A
epoch 1 iter 5050: train loss 0.70753. lr 4.973121e-04:  54%|█████▍    | 5051/9301 [31:02<26:07,  2.71it/s][A
epoch 1 iter 5051: train loss 0.67911. lr 4.972739e-04:  54%|█████▍    | 5051/9301 [31:02<26:07,  2.71it/s][A
epoch 1 iter 5051: train loss 0.67911. lr 4.972739e-04:  54%|█████▍    | 5052/9301 [31:02<26:05,  2.71it/s][A
epoch 1 iter 5052: train loss 0.65873. lr 4.972357e-04:  54%|█████▍    | 5052/9301 [31:03<26:05,  2.71it/s][A
e

epoch 1 iter 5084: train loss 0.64886. lr 4.960112e-04:  55%|█████▍    | 5085/9301 [31:15<25:58,  2.71it/s][A
epoch 1 iter 5085: train loss 0.66076. lr 4.959728e-04:  55%|█████▍    | 5085/9301 [31:15<25:58,  2.71it/s][A
epoch 1 iter 5085: train loss 0.66076. lr 4.959728e-04:  55%|█████▍    | 5086/9301 [31:15<25:55,  2.71it/s][A
epoch 1 iter 5086: train loss 0.63814. lr 4.959345e-04:  55%|█████▍    | 5086/9301 [31:15<25:55,  2.71it/s][A
epoch 1 iter 5086: train loss 0.63814. lr 4.959345e-04:  55%|█████▍    | 5087/9301 [31:15<25:54,  2.71it/s][A
epoch 1 iter 5087: train loss 0.67836. lr 4.958961e-04:  55%|█████▍    | 5087/9301 [31:16<25:54,  2.71it/s][A
epoch 1 iter 5087: train loss 0.67836. lr 4.958961e-04:  55%|█████▍    | 5088/9301 [31:16<25:55,  2.71it/s][A
epoch 1 iter 5088: train loss 0.67827. lr 4.958577e-04:  55%|█████▍    | 5088/9301 [31:16<25:55,  2.71it/s][A
epoch 1 iter 5088: train loss 0.67827. lr 4.958577e-04:  55%|█████▍    | 5089/9301 [31:16<25:58,  2.70it/s][A
e

epoch 1 iter 5121: train loss 0.57818. lr 4.945882e-04:  55%|█████▌    | 5121/9301 [31:28<25:35,  2.72it/s][A
epoch 1 iter 5121: train loss 0.57818. lr 4.945882e-04:  55%|█████▌    | 5122/9301 [31:28<25:39,  2.71it/s][A
epoch 1 iter 5122: train loss 0.60114. lr 4.945496e-04:  55%|█████▌    | 5122/9301 [31:29<25:39,  2.71it/s][A
epoch 1 iter 5122: train loss 0.60114. lr 4.945496e-04:  55%|█████▌    | 5123/9301 [31:29<25:41,  2.71it/s][A
epoch 1 iter 5123: train loss 0.66902. lr 4.945110e-04:  55%|█████▌    | 5123/9301 [31:29<25:41,  2.71it/s][A
epoch 1 iter 5123: train loss 0.66902. lr 4.945110e-04:  55%|█████▌    | 5124/9301 [31:29<25:38,  2.71it/s][A
epoch 1 iter 5124: train loss 0.63907. lr 4.944725e-04:  55%|█████▌    | 5124/9301 [31:29<25:38,  2.71it/s][A
epoch 1 iter 5124: train loss 0.63907. lr 4.944725e-04:  55%|█████▌    | 5125/9301 [31:29<25:38,  2.71it/s][A
epoch 1 iter 5125: train loss 0.70548. lr 4.944339e-04:  55%|█████▌    | 5125/9301 [31:30<25:38,  2.71it/s][A
e

epoch 1 iter 5157: train loss 0.67446. lr 4.931963e-04:  55%|█████▌    | 5158/9301 [31:42<25:35,  2.70it/s][A
epoch 1 iter 5158: train loss 0.66769. lr 4.931576e-04:  55%|█████▌    | 5158/9301 [31:42<25:35,  2.70it/s][A
epoch 1 iter 5158: train loss 0.66769. lr 4.931576e-04:  55%|█████▌    | 5159/9301 [31:42<25:31,  2.70it/s][A
epoch 1 iter 5159: train loss 0.66810. lr 4.931188e-04:  55%|█████▌    | 5159/9301 [31:42<25:31,  2.70it/s][A
epoch 1 iter 5159: train loss 0.66810. lr 4.931188e-04:  55%|█████▌    | 5160/9301 [31:42<25:31,  2.70it/s][A
epoch 1 iter 5160: train loss 0.62251. lr 4.930800e-04:  55%|█████▌    | 5160/9301 [31:43<25:31,  2.70it/s][A
epoch 1 iter 5160: train loss 0.62251. lr 4.930800e-04:  55%|█████▌    | 5161/9301 [31:43<25:28,  2.71it/s][A
epoch 1 iter 5161: train loss 0.63859. lr 4.930413e-04:  55%|█████▌    | 5161/9301 [31:43<25:28,  2.71it/s][A
epoch 1 iter 5161: train loss 0.63859. lr 4.930413e-04:  55%|█████▌    | 5162/9301 [31:43<25:30,  2.70it/s][A
e

epoch 1 iter 5194: train loss 0.65665. lr 4.917584e-04:  56%|█████▌    | 5194/9301 [31:55<25:13,  2.71it/s][A
epoch 1 iter 5194: train loss 0.65665. lr 4.917584e-04:  56%|█████▌    | 5195/9301 [31:55<25:10,  2.72it/s][A
epoch 1 iter 5195: train loss 0.69490. lr 4.917194e-04:  56%|█████▌    | 5195/9301 [31:56<25:10,  2.72it/s][A
epoch 1 iter 5195: train loss 0.69490. lr 4.917194e-04:  56%|█████▌    | 5196/9301 [31:56<25:16,  2.71it/s][A
epoch 1 iter 5196: train loss 0.64238. lr 4.916805e-04:  56%|█████▌    | 5196/9301 [31:56<25:16,  2.71it/s][A
epoch 1 iter 5196: train loss 0.64238. lr 4.916805e-04:  56%|█████▌    | 5197/9301 [31:56<25:19,  2.70it/s][A
epoch 1 iter 5197: train loss 0.63938. lr 4.916415e-04:  56%|█████▌    | 5197/9301 [31:56<25:19,  2.70it/s][A
epoch 1 iter 5197: train loss 0.63938. lr 4.916415e-04:  56%|█████▌    | 5198/9301 [31:56<25:19,  2.70it/s][A
epoch 1 iter 5198: train loss 0.66599. lr 4.916025e-04:  56%|█████▌    | 5198/9301 [31:57<25:19,  2.70it/s][A
e

epoch 1 iter 5230: train loss 0.65035. lr 4.903521e-04:  56%|█████▌    | 5231/9301 [32:09<24:58,  2.72it/s][A
epoch 1 iter 5231: train loss 0.63684. lr 4.903130e-04:  56%|█████▌    | 5231/9301 [32:09<24:58,  2.72it/s][A
epoch 1 iter 5231: train loss 0.63684. lr 4.903130e-04:  56%|█████▋    | 5232/9301 [32:09<25:01,  2.71it/s][A
epoch 1 iter 5232: train loss 0.68121. lr 4.902738e-04:  56%|█████▋    | 5232/9301 [32:09<25:01,  2.71it/s][A
epoch 1 iter 5232: train loss 0.68121. lr 4.902738e-04:  56%|█████▋    | 5233/9301 [32:09<25:03,  2.71it/s][A
epoch 1 iter 5233: train loss 0.60696. lr 4.902346e-04:  56%|█████▋    | 5233/9301 [32:10<25:03,  2.71it/s][A
epoch 1 iter 5233: train loss 0.60696. lr 4.902346e-04:  56%|█████▋    | 5234/9301 [32:10<25:00,  2.71it/s][A
epoch 1 iter 5234: train loss 0.65360. lr 4.901955e-04:  56%|█████▋    | 5234/9301 [32:10<25:00,  2.71it/s][A
epoch 1 iter 5234: train loss 0.65360. lr 4.901955e-04:  56%|█████▋    | 5235/9301 [32:10<24:57,  2.72it/s][A
e

epoch 1 iter 5267: train loss 0.66848. lr 4.888995e-04:  57%|█████▋    | 5267/9301 [32:22<24:47,  2.71it/s][A
epoch 1 iter 5267: train loss 0.66848. lr 4.888995e-04:  57%|█████▋    | 5268/9301 [32:22<24:48,  2.71it/s][A
epoch 1 iter 5268: train loss 0.66632. lr 4.888601e-04:  57%|█████▋    | 5268/9301 [32:23<24:48,  2.71it/s][A
epoch 1 iter 5268: train loss 0.66632. lr 4.888601e-04:  57%|█████▋    | 5269/9301 [32:23<24:46,  2.71it/s][A
epoch 1 iter 5269: train loss 0.64480. lr 4.888207e-04:  57%|█████▋    | 5269/9301 [32:23<24:46,  2.71it/s][A
epoch 1 iter 5269: train loss 0.64480. lr 4.888207e-04:  57%|█████▋    | 5270/9301 [32:23<24:45,  2.71it/s][A
epoch 1 iter 5270: train loss 0.64470. lr 4.887814e-04:  57%|█████▋    | 5270/9301 [32:23<24:45,  2.71it/s][A
epoch 1 iter 5270: train loss 0.64470. lr 4.887814e-04:  57%|█████▋    | 5271/9301 [32:23<24:42,  2.72it/s][A
epoch 1 iter 5271: train loss 0.64803. lr 4.887420e-04:  57%|█████▋    | 5271/9301 [32:24<24:42,  2.72it/s][A
e

epoch 1 iter 5303: train loss 0.65656. lr 4.874790e-04:  57%|█████▋    | 5304/9301 [32:35<24:29,  2.72it/s][A
epoch 1 iter 5304: train loss 0.61836. lr 4.874394e-04:  57%|█████▋    | 5304/9301 [32:36<24:29,  2.72it/s][A
epoch 1 iter 5304: train loss 0.61836. lr 4.874394e-04:  57%|█████▋    | 5305/9301 [32:36<24:32,  2.71it/s][A
epoch 1 iter 5305: train loss 0.64976. lr 4.873999e-04:  57%|█████▋    | 5305/9301 [32:36<24:32,  2.71it/s][A
epoch 1 iter 5305: train loss 0.64976. lr 4.873999e-04:  57%|█████▋    | 5306/9301 [32:36<24:31,  2.71it/s][A
epoch 1 iter 5306: train loss 0.65213. lr 4.873603e-04:  57%|█████▋    | 5306/9301 [32:37<24:31,  2.71it/s][A
epoch 1 iter 5306: train loss 0.65213. lr 4.873603e-04:  57%|█████▋    | 5307/9301 [32:37<24:35,  2.71it/s][A
epoch 1 iter 5307: train loss 0.64051. lr 4.873207e-04:  57%|█████▋    | 5307/9301 [32:37<24:35,  2.71it/s][A
epoch 1 iter 5307: train loss 0.64051. lr 4.873207e-04:  57%|█████▋    | 5308/9301 [32:37<24:35,  2.71it/s][A
e

epoch 1 iter 5340: train loss 0.62413. lr 4.860118e-04:  57%|█████▋    | 5340/9301 [32:49<24:20,  2.71it/s][A
epoch 1 iter 5340: train loss 0.62413. lr 4.860118e-04:  57%|█████▋    | 5341/9301 [32:49<24:21,  2.71it/s][A
epoch 1 iter 5341: train loss 0.67675. lr 4.859721e-04:  57%|█████▋    | 5341/9301 [32:49<24:21,  2.71it/s][A
epoch 1 iter 5341: train loss 0.67675. lr 4.859721e-04:  57%|█████▋    | 5342/9301 [32:49<24:24,  2.70it/s][A
epoch 1 iter 5342: train loss 0.64264. lr 4.859323e-04:  57%|█████▋    | 5342/9301 [32:50<24:24,  2.70it/s][A
epoch 1 iter 5342: train loss 0.64264. lr 4.859323e-04:  57%|█████▋    | 5343/9301 [32:50<24:25,  2.70it/s][A
epoch 1 iter 5343: train loss 0.64209. lr 4.858926e-04:  57%|█████▋    | 5343/9301 [32:50<24:25,  2.70it/s][A
epoch 1 iter 5343: train loss 0.64209. lr 4.858926e-04:  57%|█████▋    | 5344/9301 [32:50<24:21,  2.71it/s][A
epoch 1 iter 5344: train loss 0.69901. lr 4.858528e-04:  57%|█████▋    | 5344/9301 [32:51<24:21,  2.71it/s][A
e

epoch 1 iter 5376: train loss 0.64365. lr 4.845774e-04:  58%|█████▊    | 5377/9301 [33:02<24:08,  2.71it/s][A
epoch 1 iter 5377: train loss 0.67715. lr 4.845374e-04:  58%|█████▊    | 5377/9301 [33:03<24:08,  2.71it/s][A
epoch 1 iter 5377: train loss 0.67715. lr 4.845374e-04:  58%|█████▊    | 5378/9301 [33:03<24:07,  2.71it/s][A
epoch 1 iter 5378: train loss 0.60368. lr 4.844975e-04:  58%|█████▊    | 5378/9301 [33:03<24:07,  2.71it/s][A
epoch 1 iter 5378: train loss 0.60368. lr 4.844975e-04:  58%|█████▊    | 5379/9301 [33:03<24:05,  2.71it/s][A
epoch 1 iter 5379: train loss 0.70540. lr 4.844575e-04:  58%|█████▊    | 5379/9301 [33:03<24:05,  2.71it/s][A
epoch 1 iter 5379: train loss 0.70540. lr 4.844575e-04:  58%|█████▊    | 5380/9301 [33:03<24:04,  2.72it/s][A
epoch 1 iter 5380: train loss 0.63875. lr 4.844176e-04:  58%|█████▊    | 5380/9301 [33:04<24:04,  2.72it/s][A
epoch 1 iter 5380: train loss 0.63875. lr 4.844176e-04:  58%|█████▊    | 5381/9301 [33:04<24:09,  2.70it/s][A
e

epoch 1 iter 5413: train loss 0.62784. lr 4.830959e-04:  58%|█████▊    | 5413/9301 [33:16<23:50,  2.72it/s][A
epoch 1 iter 5413: train loss 0.62784. lr 4.830959e-04:  58%|█████▊    | 5414/9301 [33:16<23:49,  2.72it/s][A
epoch 1 iter 5414: train loss 0.62701. lr 4.830558e-04:  58%|█████▊    | 5414/9301 [33:16<23:49,  2.72it/s][A
epoch 1 iter 5414: train loss 0.62701. lr 4.830558e-04:  58%|█████▊    | 5415/9301 [33:16<23:51,  2.71it/s][A
epoch 1 iter 5415: train loss 0.64715. lr 4.830157e-04:  58%|█████▊    | 5415/9301 [33:17<23:51,  2.71it/s][A
epoch 1 iter 5415: train loss 0.64715. lr 4.830157e-04:  58%|█████▊    | 5416/9301 [33:17<23:47,  2.72it/s][A
epoch 1 iter 5416: train loss 0.64921. lr 4.829755e-04:  58%|█████▊    | 5416/9301 [33:17<23:47,  2.72it/s][A
epoch 1 iter 5416: train loss 0.64921. lr 4.829755e-04:  58%|█████▊    | 5417/9301 [33:17<23:49,  2.72it/s][A
epoch 1 iter 5417: train loss 0.66804. lr 4.829353e-04:  58%|█████▊    | 5417/9301 [33:17<23:49,  2.72it/s][A
e

epoch 1 iter 5449: train loss 0.65240. lr 4.816477e-04:  59%|█████▊    | 5450/9301 [33:29<23:39,  2.71it/s][A
epoch 1 iter 5450: train loss 0.60246. lr 4.816074e-04:  59%|█████▊    | 5450/9301 [33:30<23:39,  2.71it/s][A
epoch 1 iter 5450: train loss 0.60246. lr 4.816074e-04:  59%|█████▊    | 5451/9301 [33:30<23:36,  2.72it/s][A
epoch 1 iter 5451: train loss 0.67733. lr 4.815670e-04:  59%|█████▊    | 5451/9301 [33:30<23:36,  2.72it/s][A
epoch 1 iter 5451: train loss 0.67733. lr 4.815670e-04:  59%|█████▊    | 5452/9301 [33:30<23:40,  2.71it/s][A
epoch 1 iter 5452: train loss 0.63154. lr 4.815267e-04:  59%|█████▊    | 5452/9301 [33:30<23:40,  2.71it/s][A
epoch 1 iter 5452: train loss 0.63154. lr 4.815267e-04:  59%|█████▊    | 5453/9301 [33:30<23:41,  2.71it/s][A
epoch 1 iter 5453: train loss 0.66300. lr 4.814863e-04:  59%|█████▊    | 5453/9301 [33:31<23:41,  2.71it/s][A
epoch 1 iter 5453: train loss 0.66300. lr 4.814863e-04:  59%|█████▊    | 5454/9301 [33:31<23:39,  2.71it/s][A
e

epoch 1 iter 5486: train loss 0.65858. lr 4.801522e-04:  59%|█████▉    | 5486/9301 [33:43<23:21,  2.72it/s][A
epoch 1 iter 5486: train loss 0.65858. lr 4.801522e-04:  59%|█████▉    | 5487/9301 [33:43<23:24,  2.72it/s][A
epoch 1 iter 5487: train loss 0.65000. lr 4.801117e-04:  59%|█████▉    | 5487/9301 [33:43<23:24,  2.72it/s][A
epoch 1 iter 5487: train loss 0.65000. lr 4.801117e-04:  59%|█████▉    | 5488/9301 [33:43<23:21,  2.72it/s][A
epoch 1 iter 5488: train loss 0.62070. lr 4.800712e-04:  59%|█████▉    | 5488/9301 [33:44<23:21,  2.72it/s][A
epoch 1 iter 5488: train loss 0.62070. lr 4.800712e-04:  59%|█████▉    | 5489/9301 [33:44<23:20,  2.72it/s][A
epoch 1 iter 5489: train loss 0.66792. lr 4.800306e-04:  59%|█████▉    | 5489/9301 [33:44<23:20,  2.72it/s][A
epoch 1 iter 5489: train loss 0.66792. lr 4.800306e-04:  59%|█████▉    | 5490/9301 [33:44<23:23,  2.71it/s][A
epoch 1 iter 5490: train loss 0.63518. lr 4.799901e-04:  59%|█████▉    | 5490/9301 [33:44<23:23,  2.71it/s][A
e

epoch 1 iter 5522: train loss 0.71538. lr 4.786904e-04:  59%|█████▉    | 5523/9301 [33:56<23:13,  2.71it/s][A
epoch 1 iter 5523: train loss 0.62348. lr 4.786497e-04:  59%|█████▉    | 5523/9301 [33:57<23:13,  2.71it/s][A
epoch 1 iter 5523: train loss 0.62348. lr 4.786497e-04:  59%|█████▉    | 5524/9301 [33:57<23:12,  2.71it/s][A
epoch 1 iter 5524: train loss 0.62275. lr 4.786090e-04:  59%|█████▉    | 5524/9301 [33:57<23:12,  2.71it/s][A
epoch 1 iter 5524: train loss 0.62275. lr 4.786090e-04:  59%|█████▉    | 5525/9301 [33:57<23:10,  2.71it/s][A
epoch 1 iter 5525: train loss 0.67065. lr 4.785683e-04:  59%|█████▉    | 5525/9301 [33:57<23:10,  2.71it/s][A
epoch 1 iter 5525: train loss 0.67065. lr 4.785683e-04:  59%|█████▉    | 5526/9301 [33:57<23:04,  2.73it/s][A
epoch 1 iter 5526: train loss 0.61731. lr 4.785275e-04:  59%|█████▉    | 5526/9301 [33:58<23:04,  2.73it/s][A
epoch 1 iter 5526: train loss 0.61731. lr 4.785275e-04:  59%|█████▉    | 5527/9301 [33:58<23:07,  2.72it/s][A
e

epoch 1 iter 7688: train loss 0.51149. lr 3.806682e-04:  83%|████████▎ | 7689/9301 [47:15<09:57,  2.70it/s][A
epoch 1 iter 7689: train loss 0.51192. lr 3.806194e-04:  83%|████████▎ | 7689/9301 [47:15<09:57,  2.70it/s][A
epoch 1 iter 7689: train loss 0.51192. lr 3.806194e-04:  83%|████████▎ | 7690/9301 [47:15<09:55,  2.70it/s][A
epoch 1 iter 7690: train loss 0.50376. lr 3.805706e-04:  83%|████████▎ | 7690/9301 [47:16<09:55,  2.70it/s][A
epoch 1 iter 7690: train loss 0.50376. lr 3.805706e-04:  83%|████████▎ | 7691/9301 [47:16<09:54,  2.71it/s][A
epoch 1 iter 7691: train loss 0.49134. lr 3.805218e-04:  83%|████████▎ | 7691/9301 [47:16<09:54,  2.71it/s][A
epoch 1 iter 7691: train loss 0.49134. lr 3.805218e-04:  83%|████████▎ | 7692/9301 [47:16<09:51,  2.72it/s][A
epoch 1 iter 7692: train loss 0.50357. lr 3.804730e-04:  83%|████████▎ | 7692/9301 [47:17<09:51,  2.72it/s][A
epoch 1 iter 7692: train loss 0.50357. lr 3.804730e-04:  83%|████████▎ | 7693/9301 [47:17<09:52,  2.71it/s][A
e

epoch 1 iter 7725: train loss 0.50096. lr 3.788610e-04:  83%|████████▎ | 7725/9301 [47:29<09:41,  2.71it/s][A
epoch 1 iter 7725: train loss 0.50096. lr 3.788610e-04:  83%|████████▎ | 7726/9301 [47:29<09:40,  2.71it/s][A
epoch 1 iter 7726: train loss 0.49355. lr 3.788121e-04:  83%|████████▎ | 7726/9301 [47:29<09:40,  2.71it/s][A
epoch 1 iter 7726: train loss 0.49355. lr 3.788121e-04:  83%|████████▎ | 7727/9301 [47:29<09:38,  2.72it/s][A
epoch 1 iter 7727: train loss 0.49455. lr 3.787632e-04:  83%|████████▎ | 7727/9301 [47:29<09:38,  2.72it/s][A
epoch 1 iter 7727: train loss 0.49455. lr 3.787632e-04:  83%|████████▎ | 7728/9301 [47:29<09:39,  2.71it/s][A
epoch 1 iter 7728: train loss 0.50679. lr 3.787143e-04:  83%|████████▎ | 7728/9301 [47:30<09:39,  2.71it/s][A
epoch 1 iter 7728: train loss 0.50679. lr 3.787143e-04:  83%|████████▎ | 7729/9301 [47:30<09:40,  2.71it/s][A
epoch 1 iter 7729: train loss 0.51220. lr 3.786654e-04:  83%|████████▎ | 7729/9301 [47:30<09:40,  2.71it/s][A
e

epoch 1 iter 7761: train loss 0.49822. lr 3.770997e-04:  83%|████████▎ | 7762/9301 [47:42<09:28,  2.71it/s][A
epoch 1 iter 7762: train loss 0.50833. lr 3.770508e-04:  83%|████████▎ | 7762/9301 [47:42<09:28,  2.71it/s][A
epoch 1 iter 7762: train loss 0.50833. lr 3.770508e-04:  83%|████████▎ | 7763/9301 [47:42<09:28,  2.70it/s][A
epoch 1 iter 7763: train loss 0.51046. lr 3.770018e-04:  83%|████████▎ | 7763/9301 [47:43<09:28,  2.70it/s][A
epoch 1 iter 7763: train loss 0.51046. lr 3.770018e-04:  83%|████████▎ | 7764/9301 [47:43<09:28,  2.71it/s][A
epoch 1 iter 7764: train loss 0.47247. lr 3.769528e-04:  83%|████████▎ | 7764/9301 [47:43<09:28,  2.71it/s][A
epoch 1 iter 7764: train loss 0.47247. lr 3.769528e-04:  83%|████████▎ | 7765/9301 [47:43<09:26,  2.71it/s][A
epoch 1 iter 7765: train loss 0.47266. lr 3.769038e-04:  83%|████████▎ | 7765/9301 [47:43<09:26,  2.71it/s][A
epoch 1 iter 7765: train loss 0.47266. lr 3.769038e-04:  83%|████████▎ | 7766/9301 [47:43<09:26,  2.71it/s][A
e

epoch 1 iter 7798: train loss 0.47713. lr 3.752865e-04:  84%|████████▍ | 7798/9301 [47:56<09:17,  2.69it/s][A
epoch 1 iter 7798: train loss 0.47713. lr 3.752865e-04:  84%|████████▍ | 7799/9301 [47:56<09:17,  2.69it/s][A
epoch 1 iter 7799: train loss 0.47762. lr 3.752375e-04:  84%|████████▍ | 7799/9301 [47:56<09:17,  2.69it/s][A
epoch 1 iter 7799: train loss 0.47762. lr 3.752375e-04:  84%|████████▍ | 7800/9301 [47:56<09:15,  2.70it/s][A
epoch 1 iter 7800: train loss 0.49579. lr 3.751884e-04:  84%|████████▍ | 7800/9301 [47:56<09:15,  2.70it/s][A
epoch 1 iter 7800: train loss 0.49579. lr 3.751884e-04:  84%|████████▍ | 7801/9301 [47:56<09:13,  2.71it/s][A
epoch 1 iter 7801: train loss 0.46331. lr 3.751394e-04:  84%|████████▍ | 7801/9301 [47:57<09:13,  2.71it/s][A
epoch 1 iter 7801: train loss 0.46331. lr 3.751394e-04:  84%|████████▍ | 7802/9301 [47:57<09:10,  2.72it/s][A
epoch 1 iter 7802: train loss 0.52993. lr 3.750903e-04:  84%|████████▍ | 7802/9301 [47:57<09:10,  2.72it/s][A
e

epoch 1 iter 7834: train loss 0.51370. lr 3.735195e-04:  84%|████████▍ | 7835/9301 [48:09<09:00,  2.71it/s][A
epoch 1 iter 7835: train loss 0.51357. lr 3.734704e-04:  84%|████████▍ | 7835/9301 [48:09<09:00,  2.71it/s][A
epoch 1 iter 7835: train loss 0.51357. lr 3.734704e-04:  84%|████████▍ | 7836/9301 [48:09<09:00,  2.71it/s][A
epoch 1 iter 7836: train loss 0.49541. lr 3.734213e-04:  84%|████████▍ | 7836/9301 [48:10<09:00,  2.71it/s][A
epoch 1 iter 7836: train loss 0.49541. lr 3.734213e-04:  84%|████████▍ | 7837/9301 [48:10<09:01,  2.70it/s][A
epoch 1 iter 7837: train loss 0.49047. lr 3.733722e-04:  84%|████████▍ | 7837/9301 [48:10<09:01,  2.70it/s][A
epoch 1 iter 7837: train loss 0.49047. lr 3.733722e-04:  84%|████████▍ | 7838/9301 [48:10<09:02,  2.70it/s][A
epoch 1 iter 7838: train loss 0.48420. lr 3.733230e-04:  84%|████████▍ | 7838/9301 [48:10<09:02,  2.70it/s][A
epoch 1 iter 7838: train loss 0.48420. lr 3.733230e-04:  84%|████████▍ | 7839/9301 [48:10<09:02,  2.70it/s][A
e

epoch 1 iter 7871: train loss 0.49117. lr 3.717006e-04:  85%|████████▍ | 7871/9301 [48:23<08:46,  2.71it/s][A
epoch 1 iter 7871: train loss 0.49117. lr 3.717006e-04:  85%|████████▍ | 7872/9301 [48:23<08:44,  2.73it/s][A
epoch 1 iter 7872: train loss 0.45671. lr 3.716514e-04:  85%|████████▍ | 7872/9301 [48:23<08:44,  2.73it/s][A
epoch 1 iter 7872: train loss 0.45671. lr 3.716514e-04:  85%|████████▍ | 7873/9301 [48:23<08:45,  2.71it/s][A
epoch 1 iter 7873: train loss 0.50258. lr 3.716022e-04:  85%|████████▍ | 7873/9301 [48:23<08:45,  2.71it/s][A
epoch 1 iter 7873: train loss 0.50258. lr 3.716022e-04:  85%|████████▍ | 7874/9301 [48:23<08:46,  2.71it/s][A
epoch 1 iter 7874: train loss 0.51964. lr 3.715530e-04:  85%|████████▍ | 7874/9301 [48:24<08:46,  2.71it/s][A
epoch 1 iter 7874: train loss 0.51964. lr 3.715530e-04:  85%|████████▍ | 7875/9301 [48:24<08:45,  2.71it/s][A
epoch 1 iter 7875: train loss 0.53545. lr 3.715038e-04:  85%|████████▍ | 7875/9301 [48:24<08:45,  2.71it/s][A
e

epoch 1 iter 7907: train loss 0.48410. lr 3.699282e-04:  85%|████████▌ | 7908/9301 [48:36<08:33,  2.71it/s][A
epoch 1 iter 7908: train loss 0.50015. lr 3.698789e-04:  85%|████████▌ | 7908/9301 [48:36<08:33,  2.71it/s][A
epoch 1 iter 7908: train loss 0.50015. lr 3.698789e-04:  85%|████████▌ | 7909/9301 [48:36<08:33,  2.71it/s][A
epoch 1 iter 7909: train loss 0.48631. lr 3.698296e-04:  85%|████████▌ | 7909/9301 [48:37<08:33,  2.71it/s][A
epoch 1 iter 7909: train loss 0.48631. lr 3.698296e-04:  85%|████████▌ | 7910/9301 [48:37<08:32,  2.71it/s][A
epoch 1 iter 7910: train loss 0.51054. lr 3.697804e-04:  85%|████████▌ | 7910/9301 [48:37<08:32,  2.71it/s][A
epoch 1 iter 7910: train loss 0.51054. lr 3.697804e-04:  85%|████████▌ | 7911/9301 [48:37<08:32,  2.71it/s][A
epoch 1 iter 7911: train loss 0.52047. lr 3.697311e-04:  85%|████████▌ | 7911/9301 [48:37<08:32,  2.71it/s][A
epoch 1 iter 7911: train loss 0.52047. lr 3.697311e-04:  85%|████████▌ | 7912/9301 [48:37<08:30,  2.72it/s][A
e

epoch 1 iter 7944: train loss 0.51125. lr 3.681038e-04:  85%|████████▌ | 7944/9301 [48:49<08:20,  2.71it/s][A
epoch 1 iter 7944: train loss 0.51125. lr 3.681038e-04:  85%|████████▌ | 7945/9301 [48:49<08:19,  2.72it/s][A
epoch 1 iter 7945: train loss 0.46467. lr 3.680545e-04:  85%|████████▌ | 7945/9301 [48:50<08:19,  2.72it/s][A
epoch 1 iter 7945: train loss 0.46467. lr 3.680545e-04:  85%|████████▌ | 7946/9301 [48:50<08:18,  2.72it/s][A
epoch 1 iter 7946: train loss 0.49873. lr 3.680051e-04:  85%|████████▌ | 7946/9301 [48:50<08:18,  2.72it/s][A
epoch 1 iter 7946: train loss 0.49873. lr 3.680051e-04:  85%|████████▌ | 7947/9301 [48:50<08:17,  2.72it/s][A
epoch 1 iter 7947: train loss 0.49741. lr 3.679558e-04:  85%|████████▌ | 7947/9301 [48:51<08:17,  2.72it/s][A
epoch 1 iter 7947: train loss 0.49741. lr 3.679558e-04:  85%|████████▌ | 7948/9301 [48:51<08:18,  2.71it/s][A
epoch 1 iter 7948: train loss 0.52706. lr 3.679064e-04:  85%|████████▌ | 7948/9301 [48:51<08:18,  2.71it/s][A
e

epoch 1 iter 7980: train loss 0.51051. lr 3.663262e-04:  86%|████████▌ | 7981/9301 [49:03<08:05,  2.72it/s][A
epoch 1 iter 7981: train loss 0.47400. lr 3.662768e-04:  86%|████████▌ | 7981/9301 [49:03<08:05,  2.72it/s][A
epoch 1 iter 7981: train loss 0.47400. lr 3.662768e-04:  86%|████████▌ | 7982/9301 [49:03<08:04,  2.72it/s][A
epoch 1 iter 7982: train loss 0.46867. lr 3.662274e-04:  86%|████████▌ | 7982/9301 [49:03<08:04,  2.72it/s][A
epoch 1 iter 7982: train loss 0.46867. lr 3.662274e-04:  86%|████████▌ | 7983/9301 [49:03<08:05,  2.72it/s][A
epoch 1 iter 7983: train loss 0.48288. lr 3.661780e-04:  86%|████████▌ | 7983/9301 [49:04<08:05,  2.72it/s][A
epoch 1 iter 7983: train loss 0.48288. lr 3.661780e-04:  86%|████████▌ | 7984/9301 [49:04<08:05,  2.71it/s][A
epoch 1 iter 7984: train loss 0.48639. lr 3.661285e-04:  86%|████████▌ | 7984/9301 [49:04<08:05,  2.71it/s][A
epoch 1 iter 7984: train loss 0.48639. lr 3.661285e-04:  86%|████████▌ | 7985/9301 [49:04<08:04,  2.71it/s][A
e

epoch 1 iter 8017: train loss 0.52491. lr 3.644967e-04:  86%|████████▌ | 8017/9301 [49:16<07:54,  2.71it/s][A
epoch 1 iter 8017: train loss 0.52491. lr 3.644967e-04:  86%|████████▌ | 8018/9301 [49:16<07:55,  2.70it/s][A
epoch 1 iter 8018: train loss 0.53404. lr 3.644472e-04:  86%|████████▌ | 8018/9301 [49:17<07:55,  2.70it/s][A
epoch 1 iter 8018: train loss 0.53404. lr 3.644472e-04:  86%|████████▌ | 8019/9301 [49:17<07:55,  2.70it/s][A
epoch 1 iter 8019: train loss 0.47915. lr 3.643977e-04:  86%|████████▌ | 8019/9301 [49:17<07:55,  2.70it/s][A
epoch 1 iter 8019: train loss 0.47915. lr 3.643977e-04:  86%|████████▌ | 8020/9301 [49:17<07:53,  2.71it/s][A
epoch 1 iter 8020: train loss 0.47906. lr 3.643482e-04:  86%|████████▌ | 8020/9301 [49:17<07:53,  2.71it/s][A
epoch 1 iter 8020: train loss 0.47906. lr 3.643482e-04:  86%|████████▌ | 8021/9301 [49:17<07:52,  2.71it/s][A
epoch 1 iter 8021: train loss 0.51300. lr 3.642987e-04:  86%|████████▌ | 8021/9301 [49:18<07:52,  2.71it/s][A
e

epoch 1 iter 8053: train loss 0.48590. lr 3.627141e-04:  87%|████████▋ | 8054/9301 [49:30<07:42,  2.70it/s][A
epoch 1 iter 8054: train loss 0.50013. lr 3.626646e-04:  87%|████████▋ | 8054/9301 [49:30<07:42,  2.70it/s][A
epoch 1 iter 8054: train loss 0.50013. lr 3.626646e-04:  87%|████████▋ | 8055/9301 [49:30<07:41,  2.70it/s][A
epoch 1 iter 8055: train loss 0.48521. lr 3.626150e-04:  87%|████████▋ | 8055/9301 [49:30<07:41,  2.70it/s][A
epoch 1 iter 8055: train loss 0.48521. lr 3.626150e-04:  87%|████████▋ | 8056/9301 [49:30<07:39,  2.71it/s][A
epoch 1 iter 8056: train loss 0.50489. lr 3.625655e-04:  87%|████████▋ | 8056/9301 [49:31<07:39,  2.71it/s][A
epoch 1 iter 8056: train loss 0.50489. lr 3.625655e-04:  87%|████████▋ | 8057/9301 [49:31<07:38,  2.71it/s][A
epoch 1 iter 8057: train loss 0.49752. lr 3.625159e-04:  87%|████████▋ | 8057/9301 [49:31<07:38,  2.71it/s][A
epoch 1 iter 8057: train loss 0.49752. lr 3.625159e-04:  87%|████████▋ | 8058/9301 [49:31<07:39,  2.71it/s][A
e

epoch 1 iter 8090: train loss 0.48966. lr 3.608797e-04:  87%|████████▋ | 8090/9301 [49:43<07:25,  2.72it/s][A
epoch 1 iter 8090: train loss 0.48966. lr 3.608797e-04:  87%|████████▋ | 8091/9301 [49:43<07:24,  2.72it/s][A
epoch 1 iter 8091: train loss 0.47409. lr 3.608301e-04:  87%|████████▋ | 8091/9301 [49:44<07:24,  2.72it/s][A
epoch 1 iter 8091: train loss 0.47409. lr 3.608301e-04:  87%|████████▋ | 8092/9301 [49:44<07:26,  2.71it/s][A
epoch 1 iter 8092: train loss 0.48010. lr 3.607805e-04:  87%|████████▋ | 8092/9301 [49:44<07:26,  2.71it/s][A
epoch 1 iter 8092: train loss 0.48010. lr 3.607805e-04:  87%|████████▋ | 8093/9301 [49:44<07:27,  2.70it/s][A
epoch 1 iter 8093: train loss 0.50749. lr 3.607308e-04:  87%|████████▋ | 8093/9301 [49:44<07:27,  2.70it/s][A
epoch 1 iter 8093: train loss 0.50749. lr 3.607308e-04:  87%|████████▋ | 8094/9301 [49:44<07:27,  2.70it/s][A
epoch 1 iter 8094: train loss 0.50083. lr 3.606812e-04:  87%|████████▋ | 8094/9301 [49:45<07:27,  2.70it/s][A
e

epoch 1 iter 8126: train loss 0.48154. lr 3.590925e-04:  87%|████████▋ | 8127/9301 [49:57<07:11,  2.72it/s][A
epoch 1 iter 8127: train loss 0.46195. lr 3.590429e-04:  87%|████████▋ | 8127/9301 [49:57<07:11,  2.72it/s][A
epoch 1 iter 8127: train loss 0.46195. lr 3.590429e-04:  87%|████████▋ | 8128/9301 [49:57<07:12,  2.71it/s][A
epoch 1 iter 8128: train loss 0.51019. lr 3.589932e-04:  87%|████████▋ | 8128/9301 [49:57<07:12,  2.71it/s][A
epoch 1 iter 8128: train loss 0.51019. lr 3.589932e-04:  87%|████████▋ | 8129/9301 [49:57<07:12,  2.71it/s][A
epoch 1 iter 8129: train loss 0.48416. lr 3.589435e-04:  87%|████████▋ | 8129/9301 [49:58<07:12,  2.71it/s][A
epoch 1 iter 8129: train loss 0.48416. lr 3.589435e-04:  87%|████████▋ | 8130/9301 [49:58<07:11,  2.71it/s][A
epoch 1 iter 8130: train loss 0.48599. lr 3.588938e-04:  87%|████████▋ | 8130/9301 [49:58<07:11,  2.71it/s][A
epoch 1 iter 8130: train loss 0.48599. lr 3.588938e-04:  87%|████████▋ | 8131/9301 [49:58<07:10,  2.72it/s][A
e

epoch 2 iter 993: train loss 0.39702. lr 2.498778e-04:  11%|█         | 993/9301 [06:13<51:13,  2.70it/s][A
epoch 2 iter 993: train loss 0.39702. lr 2.498778e-04:  11%|█         | 994/9301 [06:13<51:12,  2.70it/s][A
epoch 2 iter 994: train loss 0.40205. lr 2.498278e-04:  11%|█         | 994/9301 [06:14<51:12,  2.70it/s][A
epoch 2 iter 994: train loss 0.40205. lr 2.498278e-04:  11%|█         | 995/9301 [06:14<50:58,  2.72it/s][A
epoch 2 iter 995: train loss 0.39730. lr 2.497779e-04:  11%|█         | 995/9301 [06:14<50:58,  2.72it/s][A
epoch 2 iter 995: train loss 0.39730. lr 2.497779e-04:  11%|█         | 996/9301 [06:14<50:57,  2.72it/s][A
epoch 2 iter 996: train loss 0.40462. lr 2.497279e-04:  11%|█         | 996/9301 [06:14<50:57,  2.72it/s][A
epoch 2 iter 996: train loss 0.40462. lr 2.497279e-04:  11%|█         | 997/9301 [06:14<50:56,  2.72it/s][A
epoch 2 iter 997: train loss 0.40630. lr 2.496780e-04:  11%|█         | 997/9301 [06:15<50:56,  2.72it/s][A
epoch 2 iter 997: t

epoch 2 iter 1030: train loss 0.39276. lr 2.480305e-04:  11%|█         | 1030/9301 [06:27<51:07,  2.70it/s][A
epoch 2 iter 1030: train loss 0.39276. lr 2.480305e-04:  11%|█         | 1031/9301 [06:27<51:04,  2.70it/s][A
epoch 2 iter 1031: train loss 0.42221. lr 2.479806e-04:  11%|█         | 1031/9301 [06:27<51:04,  2.70it/s][A
epoch 2 iter 1031: train loss 0.42221. lr 2.479806e-04:  11%|█         | 1032/9301 [06:27<50:58,  2.70it/s][A
epoch 2 iter 1032: train loss 0.37084. lr 2.479307e-04:  11%|█         | 1032/9301 [06:28<50:58,  2.70it/s][A
epoch 2 iter 1032: train loss 0.37084. lr 2.479307e-04:  11%|█         | 1033/9301 [06:28<51:02,  2.70it/s][A
epoch 2 iter 1033: train loss 0.40090. lr 2.478808e-04:  11%|█         | 1033/9301 [06:28<51:02,  2.70it/s][A
epoch 2 iter 1033: train loss 0.40090. lr 2.478808e-04:  11%|█         | 1034/9301 [06:28<50:58,  2.70it/s][A
epoch 2 iter 1034: train loss 0.39480. lr 2.478309e-04:  11%|█         | 1034/9301 [06:28<50:58,  2.70it/s][A
e

epoch 2 iter 1066: train loss 0.41408. lr 2.462350e-04:  11%|█▏        | 1067/9301 [06:40<50:07,  2.74it/s][A
epoch 2 iter 1067: train loss 0.43192. lr 2.461852e-04:  11%|█▏        | 1067/9301 [06:40<50:07,  2.74it/s][A
epoch 2 iter 1067: train loss 0.43192. lr 2.461852e-04:  11%|█▏        | 1068/9301 [06:40<50:15,  2.73it/s][A
epoch 2 iter 1068: train loss 0.38472. lr 2.461353e-04:  11%|█▏        | 1068/9301 [06:41<50:15,  2.73it/s][A
epoch 2 iter 1068: train loss 0.38472. lr 2.461353e-04:  11%|█▏        | 1069/9301 [06:41<50:11,  2.73it/s][A
epoch 2 iter 1069: train loss 0.43450. lr 2.460855e-04:  11%|█▏        | 1069/9301 [06:41<50:11,  2.73it/s][A
epoch 2 iter 1069: train loss 0.43450. lr 2.460855e-04:  12%|█▏        | 1070/9301 [06:41<50:08,  2.74it/s][A
epoch 2 iter 1070: train loss 0.37460. lr 2.460357e-04:  12%|█▏        | 1070/9301 [06:41<50:08,  2.74it/s][A
epoch 2 iter 1070: train loss 0.37460. lr 2.460357e-04:  12%|█▏        | 1071/9301 [06:41<50:03,  2.74it/s][A
e

epoch 2 iter 1103: train loss 0.45847. lr 2.443918e-04:  12%|█▏        | 1103/9301 [06:54<50:13,  2.72it/s][A
epoch 2 iter 1103: train loss 0.45847. lr 2.443918e-04:  12%|█▏        | 1104/9301 [06:54<50:17,  2.72it/s][A
epoch 2 iter 1104: train loss 0.43730. lr 2.443420e-04:  12%|█▏        | 1104/9301 [06:54<50:17,  2.72it/s][A
epoch 2 iter 1104: train loss 0.43730. lr 2.443420e-04:  12%|█▏        | 1105/9301 [06:54<50:25,  2.71it/s][A
epoch 2 iter 1105: train loss 0.40007. lr 2.442922e-04:  12%|█▏        | 1105/9301 [06:54<50:25,  2.71it/s][A
epoch 2 iter 1105: train loss 0.40007. lr 2.442922e-04:  12%|█▏        | 1106/9301 [06:54<50:25,  2.71it/s][A
epoch 2 iter 1106: train loss 0.41177. lr 2.442424e-04:  12%|█▏        | 1106/9301 [06:55<50:25,  2.71it/s][A
epoch 2 iter 1106: train loss 0.41177. lr 2.442424e-04:  12%|█▏        | 1107/9301 [06:55<50:20,  2.71it/s][A
epoch 2 iter 1107: train loss 0.38337. lr 2.441926e-04:  12%|█▏        | 1107/9301 [06:55<50:20,  2.71it/s][A
e

epoch 2 iter 1139: train loss 0.38838. lr 2.426004e-04:  12%|█▏        | 1140/9301 [07:07<50:18,  2.70it/s][A
epoch 2 iter 1140: train loss 0.42423. lr 2.425507e-04:  12%|█▏        | 1140/9301 [07:07<50:18,  2.70it/s][A
epoch 2 iter 1140: train loss 0.42423. lr 2.425507e-04:  12%|█▏        | 1141/9301 [07:07<50:14,  2.71it/s][A
epoch 2 iter 1141: train loss 0.42189. lr 2.425010e-04:  12%|█▏        | 1141/9301 [07:08<50:14,  2.71it/s][A
epoch 2 iter 1141: train loss 0.42189. lr 2.425010e-04:  12%|█▏        | 1142/9301 [07:08<50:08,  2.71it/s][A
epoch 2 iter 1142: train loss 0.41241. lr 2.424513e-04:  12%|█▏        | 1142/9301 [07:08<50:08,  2.71it/s][A
epoch 2 iter 1142: train loss 0.41241. lr 2.424513e-04:  12%|█▏        | 1143/9301 [07:08<49:59,  2.72it/s][A
epoch 2 iter 1143: train loss 0.40634. lr 2.424015e-04:  12%|█▏        | 1143/9301 [07:08<49:59,  2.72it/s][A
epoch 2 iter 1143: train loss 0.40634. lr 2.424015e-04:  12%|█▏        | 1144/9301 [07:08<50:10,  2.71it/s][A
e

epoch 2 iter 1176: train loss 0.38232. lr 2.407616e-04:  13%|█▎        | 1176/9301 [07:20<50:04,  2.70it/s][A
epoch 2 iter 1176: train loss 0.38232. lr 2.407616e-04:  13%|█▎        | 1177/9301 [07:20<49:55,  2.71it/s][A
epoch 2 iter 1177: train loss 0.41063. lr 2.407119e-04:  13%|█▎        | 1177/9301 [07:21<49:55,  2.71it/s][A
epoch 2 iter 1177: train loss 0.41063. lr 2.407119e-04:  13%|█▎        | 1178/9301 [07:21<49:51,  2.72it/s][A
epoch 2 iter 1178: train loss 0.41579. lr 2.406622e-04:  13%|█▎        | 1178/9301 [07:21<49:51,  2.72it/s][A
epoch 2 iter 1178: train loss 0.41579. lr 2.406622e-04:  13%|█▎        | 1179/9301 [07:21<49:53,  2.71it/s][A
epoch 2 iter 1179: train loss 0.41114. lr 2.406126e-04:  13%|█▎        | 1179/9301 [07:22<49:53,  2.71it/s][A
epoch 2 iter 1179: train loss 0.41114. lr 2.406126e-04:  13%|█▎        | 1180/9301 [07:22<50:01,  2.71it/s][A
epoch 2 iter 1180: train loss 0.41015. lr 2.405629e-04:  13%|█▎        | 1180/9301 [07:22<50:01,  2.71it/s][A
e

epoch 2 iter 1212: train loss 0.40844. lr 2.389746e-04:  13%|█▎        | 1213/9301 [07:34<49:31,  2.72it/s][A
epoch 2 iter 1213: train loss 0.38784. lr 2.389250e-04:  13%|█▎        | 1213/9301 [07:34<49:31,  2.72it/s][A
epoch 2 iter 1213: train loss 0.38784. lr 2.389250e-04:  13%|█▎        | 1214/9301 [07:34<49:44,  2.71it/s][A
epoch 2 iter 1214: train loss 0.39125. lr 2.388754e-04:  13%|█▎        | 1214/9301 [07:34<49:44,  2.71it/s][A
epoch 2 iter 1214: train loss 0.39125. lr 2.388754e-04:  13%|█▎        | 1215/9301 [07:34<49:50,  2.70it/s][A
epoch 2 iter 1215: train loss 0.40125. lr 2.388258e-04:  13%|█▎        | 1215/9301 [07:35<49:50,  2.70it/s][A
epoch 2 iter 1215: train loss 0.40125. lr 2.388258e-04:  13%|█▎        | 1216/9301 [07:35<49:51,  2.70it/s][A
epoch 2 iter 1216: train loss 0.40009. lr 2.387762e-04:  13%|█▎        | 1216/9301 [07:35<49:51,  2.70it/s][A
epoch 2 iter 1216: train loss 0.40009. lr 2.387762e-04:  13%|█▎        | 1217/9301 [07:35<49:42,  2.71it/s][A
e

epoch 2 iter 1249: train loss 0.40925. lr 2.371403e-04:  13%|█▎        | 1249/9301 [07:47<49:25,  2.72it/s][A
epoch 2 iter 1249: train loss 0.40925. lr 2.371403e-04:  13%|█▎        | 1250/9301 [07:47<49:35,  2.71it/s][A
epoch 2 iter 1250: train loss 0.40914. lr 2.370908e-04:  13%|█▎        | 1250/9301 [07:48<49:35,  2.71it/s][A
epoch 2 iter 1250: train loss 0.40914. lr 2.370908e-04:  13%|█▎        | 1251/9301 [07:48<49:36,  2.70it/s][A
epoch 2 iter 1251: train loss 0.39189. lr 2.370413e-04:  13%|█▎        | 1251/9301 [07:48<49:36,  2.70it/s][A
epoch 2 iter 1251: train loss 0.39189. lr 2.370413e-04:  13%|█▎        | 1252/9301 [07:48<49:24,  2.71it/s][A
epoch 2 iter 1252: train loss 0.41650. lr 2.369917e-04:  13%|█▎        | 1252/9301 [07:48<49:24,  2.71it/s][A
epoch 2 iter 1252: train loss 0.41650. lr 2.369917e-04:  13%|█▎        | 1253/9301 [07:48<49:27,  2.71it/s][A
epoch 2 iter 1253: train loss 0.40872. lr 2.369422e-04:  13%|█▎        | 1253/9301 [07:49<49:27,  2.71it/s][A
e

epoch 2 iter 1285: train loss 0.40224. lr 2.353580e-04:  14%|█▍        | 1286/9301 [08:01<49:18,  2.71it/s][A
epoch 2 iter 1286: train loss 0.36010. lr 2.353085e-04:  14%|█▍        | 1286/9301 [08:01<49:18,  2.71it/s][A
epoch 2 iter 1286: train loss 0.36010. lr 2.353085e-04:  14%|█▍        | 1287/9301 [08:01<49:13,  2.71it/s][A
epoch 2 iter 1287: train loss 0.38500. lr 2.352591e-04:  14%|█▍        | 1287/9301 [08:01<49:13,  2.71it/s][A
epoch 2 iter 1287: train loss 0.38500. lr 2.352591e-04:  14%|█▍        | 1288/9301 [08:01<49:15,  2.71it/s][A
epoch 2 iter 1288: train loss 0.39361. lr 2.352096e-04:  14%|█▍        | 1288/9301 [08:02<49:15,  2.71it/s][A
epoch 2 iter 1288: train loss 0.39361. lr 2.352096e-04:  14%|█▍        | 1289/9301 [08:02<49:18,  2.71it/s][A
epoch 2 iter 1289: train loss 0.40317. lr 2.351601e-04:  14%|█▍        | 1289/9301 [08:02<49:18,  2.71it/s][A
epoch 2 iter 1289: train loss 0.40317. lr 2.351601e-04:  14%|█▍        | 1290/9301 [08:02<49:23,  2.70it/s][A
e

epoch 2 iter 1322: train loss 0.41689. lr 2.335287e-04:  14%|█▍        | 1322/9301 [08:14<48:48,  2.72it/s][A
epoch 2 iter 1322: train loss 0.41689. lr 2.335287e-04:  14%|█▍        | 1323/9301 [08:14<48:45,  2.73it/s][A
epoch 2 iter 1323: train loss 0.39729. lr 2.334793e-04:  14%|█▍        | 1323/9301 [08:15<48:45,  2.73it/s][A
epoch 2 iter 1323: train loss 0.39729. lr 2.334793e-04:  14%|█▍        | 1324/9301 [08:15<49:00,  2.71it/s][A
epoch 2 iter 1324: train loss 0.37950. lr 2.334299e-04:  14%|█▍        | 1324/9301 [08:15<49:00,  2.71it/s][A
epoch 2 iter 1324: train loss 0.37950. lr 2.334299e-04:  14%|█▍        | 1325/9301 [08:15<49:11,  2.70it/s][A
epoch 2 iter 1325: train loss 0.39244. lr 2.333804e-04:  14%|█▍        | 1325/9301 [08:15<49:11,  2.70it/s][A
epoch 2 iter 1325: train loss 0.39244. lr 2.333804e-04:  14%|█▍        | 1326/9301 [08:15<49:09,  2.70it/s][A
epoch 2 iter 1326: train loss 0.41069. lr 2.333310e-04:  14%|█▍        | 1326/9301 [08:16<49:09,  2.70it/s][A
e

epoch 2 iter 1358: train loss 0.41753. lr 2.317512e-04:  15%|█▍        | 1359/9301 [08:28<48:48,  2.71it/s][A
epoch 2 iter 1359: train loss 0.40457. lr 2.317019e-04:  15%|█▍        | 1359/9301 [08:28<48:48,  2.71it/s][A
epoch 2 iter 1359: train loss 0.40457. lr 2.317019e-04:  15%|█▍        | 1360/9301 [08:28<48:55,  2.70it/s][A
epoch 2 iter 1360: train loss 0.42627. lr 2.316526e-04:  15%|█▍        | 1360/9301 [08:28<48:55,  2.70it/s][A
epoch 2 iter 1360: train loss 0.42627. lr 2.316526e-04:  15%|█▍        | 1361/9301 [08:28<48:57,  2.70it/s][A
epoch 2 iter 1361: train loss 0.38081. lr 2.316032e-04:  15%|█▍        | 1361/9301 [08:29<48:57,  2.70it/s][A
epoch 2 iter 1361: train loss 0.38081. lr 2.316032e-04:  15%|█▍        | 1362/9301 [08:29<48:48,  2.71it/s][A
epoch 2 iter 1362: train loss 0.41536. lr 2.315539e-04:  15%|█▍        | 1362/9301 [08:29<48:48,  2.71it/s][A
epoch 2 iter 1362: train loss 0.41536. lr 2.315539e-04:  15%|█▍        | 1363/9301 [08:29<48:46,  2.71it/s][A
e

epoch 2 iter 1395: train loss 0.44329. lr 2.299271e-04:  15%|█▍        | 1395/9301 [08:41<48:34,  2.71it/s][A
epoch 2 iter 1395: train loss 0.44329. lr 2.299271e-04:  15%|█▌        | 1396/9301 [08:41<48:31,  2.72it/s][A
epoch 2 iter 1396: train loss 0.37860. lr 2.298778e-04:  15%|█▌        | 1396/9301 [08:42<48:31,  2.72it/s][A
epoch 2 iter 1396: train loss 0.37860. lr 2.298778e-04:  15%|█▌        | 1397/9301 [08:42<48:26,  2.72it/s][A
epoch 2 iter 1397: train loss 0.39230. lr 2.298286e-04:  15%|█▌        | 1397/9301 [08:42<48:26,  2.72it/s][A
epoch 2 iter 1397: train loss 0.39230. lr 2.298286e-04:  15%|█▌        | 1398/9301 [08:42<48:15,  2.73it/s][A
epoch 2 iter 1398: train loss 0.43324. lr 2.297793e-04:  15%|█▌        | 1398/9301 [08:42<48:15,  2.73it/s][A
epoch 2 iter 1398: train loss 0.43324. lr 2.297793e-04:  15%|█▌        | 1399/9301 [08:42<48:30,  2.72it/s][A
epoch 2 iter 1399: train loss 0.39588. lr 2.297301e-04:  15%|█▌        | 1399/9301 [08:43<48:30,  2.72it/s][A
e

epoch 2 iter 1431: train loss 0.37057. lr 2.281549e-04:  15%|█▌        | 1432/9301 [08:54<48:05,  2.73it/s][A
epoch 2 iter 1432: train loss 0.39243. lr 2.281057e-04:  15%|█▌        | 1432/9301 [08:55<48:05,  2.73it/s][A
epoch 2 iter 1432: train loss 0.39243. lr 2.281057e-04:  15%|█▌        | 1433/9301 [08:55<48:08,  2.72it/s][A
epoch 2 iter 1433: train loss 0.41954. lr 2.280565e-04:  15%|█▌        | 1433/9301 [08:55<48:08,  2.72it/s][A
epoch 2 iter 1433: train loss 0.41954. lr 2.280565e-04:  15%|█▌        | 1434/9301 [08:55<48:15,  2.72it/s][A
epoch 2 iter 1434: train loss 0.41938. lr 2.280073e-04:  15%|█▌        | 1434/9301 [08:56<48:15,  2.72it/s][A
epoch 2 iter 1434: train loss 0.41938. lr 2.280073e-04:  15%|█▌        | 1435/9301 [08:56<48:23,  2.71it/s][A
epoch 2 iter 1435: train loss 0.38748. lr 2.279581e-04:  15%|█▌        | 1435/9301 [08:56<48:23,  2.71it/s][A
epoch 2 iter 1435: train loss 0.38748. lr 2.279581e-04:  15%|█▌        | 1436/9301 [08:56<48:12,  2.72it/s][A
e

epoch 2 iter 1468: train loss 0.37594. lr 2.263362e-04:  16%|█▌        | 1468/9301 [09:08<48:02,  2.72it/s][A
epoch 2 iter 1468: train loss 0.37594. lr 2.263362e-04:  16%|█▌        | 1469/9301 [09:08<48:08,  2.71it/s][A
epoch 2 iter 1469: train loss 0.40153. lr 2.262871e-04:  16%|█▌        | 1469/9301 [09:08<48:08,  2.71it/s][A
epoch 2 iter 1469: train loss 0.40153. lr 2.262871e-04:  16%|█▌        | 1470/9301 [09:08<48:09,  2.71it/s][A
epoch 2 iter 1470: train loss 0.41835. lr 2.262380e-04:  16%|█▌        | 1470/9301 [09:09<48:09,  2.71it/s][A
epoch 2 iter 1470: train loss 0.41835. lr 2.262380e-04:  16%|█▌        | 1471/9301 [09:09<48:04,  2.71it/s][A
epoch 2 iter 1471: train loss 0.38814. lr 2.261888e-04:  16%|█▌        | 1471/9301 [09:09<48:04,  2.71it/s][A
epoch 2 iter 1471: train loss 0.38814. lr 2.261888e-04:  16%|█▌        | 1472/9301 [09:09<48:05,  2.71it/s][A
epoch 2 iter 1472: train loss 0.41883. lr 2.261397e-04:  16%|█▌        | 1472/9301 [09:10<48:05,  2.71it/s][A
e

epoch 2 iter 3214: train loss 0.35437. lr 1.449980e-04:  35%|███▍      | 3214/9301 [19:52<37:16,  2.72it/s][A
epoch 2 iter 3214: train loss 0.35437. lr 1.449980e-04:  35%|███▍      | 3215/9301 [19:52<37:22,  2.71it/s][A
epoch 2 iter 3215: train loss 0.37255. lr 1.449546e-04:  35%|███▍      | 3215/9301 [19:52<37:22,  2.71it/s][A
epoch 2 iter 3215: train loss 0.37255. lr 1.449546e-04:  35%|███▍      | 3216/9301 [19:52<37:28,  2.71it/s][A
epoch 2 iter 3216: train loss 0.36116. lr 1.449112e-04:  35%|███▍      | 3216/9301 [19:52<37:28,  2.71it/s][A
epoch 2 iter 3216: train loss 0.36116. lr 1.449112e-04:  35%|███▍      | 3217/9301 [19:52<37:20,  2.71it/s][A
epoch 2 iter 3217: train loss 0.36217. lr 1.448678e-04:  35%|███▍      | 3217/9301 [19:53<37:20,  2.71it/s][A
epoch 2 iter 3217: train loss 0.36217. lr 1.448678e-04:  35%|███▍      | 3218/9301 [19:53<37:16,  2.72it/s][A
epoch 2 iter 3218: train loss 0.35584. lr 1.448245e-04:  35%|███▍      | 3218/9301 [19:53<37:16,  2.72it/s][A
e

epoch 2 iter 3250: train loss 0.36621. lr 1.434392e-04:  35%|███▍      | 3251/9301 [20:05<37:06,  2.72it/s][A
epoch 2 iter 3251: train loss 0.38251. lr 1.433960e-04:  35%|███▍      | 3251/9301 [20:05<37:06,  2.72it/s][A
epoch 2 iter 3251: train loss 0.38251. lr 1.433960e-04:  35%|███▍      | 3252/9301 [20:05<37:11,  2.71it/s][A
epoch 2 iter 3252: train loss 0.37849. lr 1.433527e-04:  35%|███▍      | 3252/9301 [20:06<37:11,  2.71it/s][A
epoch 2 iter 3252: train loss 0.37849. lr 1.433527e-04:  35%|███▍      | 3253/9301 [20:06<37:10,  2.71it/s][A
epoch 2 iter 3253: train loss 0.34192. lr 1.433095e-04:  35%|███▍      | 3253/9301 [20:06<37:10,  2.71it/s][A
epoch 2 iter 3253: train loss 0.34192. lr 1.433095e-04:  35%|███▍      | 3254/9301 [20:06<37:06,  2.72it/s][A
epoch 2 iter 3254: train loss 0.35918. lr 1.432663e-04:  35%|███▍      | 3254/9301 [20:06<37:06,  2.72it/s][A
epoch 2 iter 3254: train loss 0.35918. lr 1.432663e-04:  35%|███▍      | 3255/9301 [20:06<37:04,  2.72it/s][A
e

epoch 2 iter 3287: train loss 0.34337. lr 1.418431e-04:  35%|███▌      | 3287/9301 [20:19<37:04,  2.70it/s][A
epoch 2 iter 3287: train loss 0.34337. lr 1.418431e-04:  35%|███▌      | 3288/9301 [20:19<37:02,  2.71it/s][A
epoch 2 iter 3288: train loss 0.35669. lr 1.418001e-04:  35%|███▌      | 3288/9301 [20:19<37:02,  2.71it/s][A
epoch 2 iter 3288: train loss 0.35669. lr 1.418001e-04:  35%|███▌      | 3289/9301 [20:19<36:57,  2.71it/s][A
epoch 2 iter 3289: train loss 0.35642. lr 1.417570e-04:  35%|███▌      | 3289/9301 [20:19<36:57,  2.71it/s][A
epoch 2 iter 3289: train loss 0.35642. lr 1.417570e-04:  35%|███▌      | 3290/9301 [20:19<36:53,  2.72it/s][A
epoch 2 iter 3290: train loss 0.34861. lr 1.417140e-04:  35%|███▌      | 3290/9301 [20:20<36:53,  2.72it/s][A
epoch 2 iter 3290: train loss 0.34861. lr 1.417140e-04:  35%|███▌      | 3291/9301 [20:20<36:47,  2.72it/s][A
epoch 2 iter 3291: train loss 0.36597. lr 1.416709e-04:  35%|███▌      | 3291/9301 [20:20<36:47,  2.72it/s][A
e

epoch 2 iter 3323: train loss 0.35103. lr 1.402961e-04:  36%|███▌      | 3324/9301 [20:32<36:44,  2.71it/s][A
epoch 2 iter 3324: train loss 0.36644. lr 1.402532e-04:  36%|███▌      | 3324/9301 [20:32<36:44,  2.71it/s][A
epoch 2 iter 3324: train loss 0.36644. lr 1.402532e-04:  36%|███▌      | 3325/9301 [20:32<36:41,  2.71it/s][A
epoch 2 iter 3325: train loss 0.34444. lr 1.402103e-04:  36%|███▌      | 3325/9301 [20:33<36:41,  2.71it/s][A
epoch 2 iter 3325: train loss 0.34444. lr 1.402103e-04:  36%|███▌      | 3326/9301 [20:33<36:32,  2.72it/s][A
epoch 2 iter 3326: train loss 0.36615. lr 1.401675e-04:  36%|███▌      | 3326/9301 [20:33<36:32,  2.72it/s][A
epoch 2 iter 3326: train loss 0.36615. lr 1.401675e-04:  36%|███▌      | 3327/9301 [20:33<36:38,  2.72it/s][A
epoch 2 iter 3327: train loss 0.33993. lr 1.401246e-04:  36%|███▌      | 3327/9301 [20:33<36:38,  2.72it/s][A
epoch 2 iter 3327: train loss 0.33993. lr 1.401246e-04:  36%|███▌      | 3328/9301 [20:33<36:39,  2.72it/s][A
e

epoch 2 iter 3360: train loss 0.36427. lr 1.387123e-04:  36%|███▌      | 3360/9301 [20:45<36:40,  2.70it/s][A
epoch 2 iter 3360: train loss 0.36427. lr 1.387123e-04:  36%|███▌      | 3361/9301 [20:45<36:24,  2.72it/s][A
epoch 2 iter 3361: train loss 0.36401. lr 1.386696e-04:  36%|███▌      | 3361/9301 [20:46<36:24,  2.72it/s][A
epoch 2 iter 3361: train loss 0.36401. lr 1.386696e-04:  36%|███▌      | 3362/9301 [20:46<36:25,  2.72it/s][A
epoch 2 iter 3362: train loss 0.34593. lr 1.386269e-04:  36%|███▌      | 3362/9301 [20:46<36:25,  2.72it/s][A
epoch 2 iter 3362: train loss 0.34593. lr 1.386269e-04:  36%|███▌      | 3363/9301 [20:46<36:24,  2.72it/s][A
epoch 2 iter 3363: train loss 0.36189. lr 1.385842e-04:  36%|███▌      | 3363/9301 [20:47<36:24,  2.72it/s][A
epoch 2 iter 3363: train loss 0.36189. lr 1.385842e-04:  36%|███▌      | 3364/9301 [20:47<36:20,  2.72it/s][A
epoch 2 iter 3364: train loss 0.33098. lr 1.385414e-04:  36%|███▌      | 3364/9301 [20:47<36:20,  2.72it/s][A
e

epoch 2 iter 3396: train loss 0.36075. lr 1.371773e-04:  37%|███▋      | 3397/9301 [20:59<36:22,  2.70it/s][A
epoch 2 iter 3397: train loss 0.34967. lr 1.371348e-04:  37%|███▋      | 3397/9301 [20:59<36:22,  2.70it/s][A
epoch 2 iter 3397: train loss 0.34967. lr 1.371348e-04:  37%|███▋      | 3398/9301 [20:59<36:21,  2.71it/s][A
epoch 2 iter 3398: train loss 0.35434. lr 1.370922e-04:  37%|███▋      | 3398/9301 [20:59<36:21,  2.71it/s][A
epoch 2 iter 3398: train loss 0.35434. lr 1.370922e-04:  37%|███▋      | 3399/9301 [20:59<36:16,  2.71it/s][A
epoch 2 iter 3399: train loss 0.36581. lr 1.370497e-04:  37%|███▋      | 3399/9301 [21:00<36:16,  2.71it/s][A
epoch 2 iter 3399: train loss 0.36581. lr 1.370497e-04:  37%|███▋      | 3400/9301 [21:00<36:14,  2.71it/s][A
epoch 2 iter 3400: train loss 0.34413. lr 1.370071e-04:  37%|███▋      | 3400/9301 [21:00<36:14,  2.71it/s][A
epoch 2 iter 3400: train loss 0.34413. lr 1.370071e-04:  37%|███▋      | 3401/9301 [21:00<36:07,  2.72it/s][A
e

epoch 2 iter 3433: train loss 0.35831. lr 1.356060e-04:  37%|███▋      | 3433/9301 [21:12<36:19,  2.69it/s][A
epoch 2 iter 3433: train loss 0.35831. lr 1.356060e-04:  37%|███▋      | 3434/9301 [21:12<36:17,  2.69it/s][A
epoch 2 iter 3434: train loss 0.35078. lr 1.355636e-04:  37%|███▋      | 3434/9301 [21:13<36:17,  2.69it/s][A
epoch 2 iter 3434: train loss 0.35078. lr 1.355636e-04:  37%|███▋      | 3435/9301 [21:13<36:11,  2.70it/s][A
epoch 2 iter 3435: train loss 0.36795. lr 1.355212e-04:  37%|███▋      | 3435/9301 [21:13<36:11,  2.70it/s][A
epoch 2 iter 3435: train loss 0.36795. lr 1.355212e-04:  37%|███▋      | 3436/9301 [21:13<36:01,  2.71it/s][A
epoch 2 iter 3436: train loss 0.36669. lr 1.354789e-04:  37%|███▋      | 3436/9301 [21:13<36:01,  2.71it/s][A
epoch 2 iter 3436: train loss 0.36669. lr 1.354789e-04:  37%|███▋      | 3437/9301 [21:13<36:14,  2.70it/s][A
epoch 2 iter 3437: train loss 0.36403. lr 1.354365e-04:  37%|███▋      | 3437/9301 [21:14<36:14,  2.70it/s][A
e

epoch 2 iter 3469: train loss 0.35557. lr 1.340833e-04:  37%|███▋      | 3470/9301 [21:27<46:22,  2.10it/s][A
epoch 2 iter 3470: train loss 0.38729. lr 1.340411e-04:  37%|███▋      | 3470/9301 [21:27<46:22,  2.10it/s][A
epoch 2 iter 3470: train loss 0.38729. lr 1.340411e-04:  37%|███▋      | 3471/9301 [21:27<46:39,  2.08it/s][A
epoch 2 iter 3471: train loss 0.34521. lr 1.339989e-04:  37%|███▋      | 3471/9301 [21:28<46:39,  2.08it/s][A
epoch 2 iter 3471: train loss 0.34521. lr 1.339989e-04:  37%|███▋      | 3472/9301 [21:28<46:51,  2.07it/s][A
epoch 2 iter 3472: train loss 0.37004. lr 1.339567e-04:  37%|███▋      | 3472/9301 [21:28<46:51,  2.07it/s][A
epoch 2 iter 3472: train loss 0.37004. lr 1.339567e-04:  37%|███▋      | 3473/9301 [21:28<46:44,  2.08it/s][A
epoch 2 iter 3473: train loss 0.36204. lr 1.339145e-04:  37%|███▋      | 3473/9301 [21:29<46:44,  2.08it/s][A
epoch 2 iter 3473: train loss 0.36204. lr 1.339145e-04:  37%|███▋      | 3474/9301 [21:29<46:54,  2.07it/s][A
e

epoch 2 iter 3506: train loss 0.36562. lr 1.325247e-04:  38%|███▊      | 3506/9301 [21:44<45:52,  2.11it/s][A
epoch 2 iter 3506: train loss 0.36562. lr 1.325247e-04:  38%|███▊      | 3507/9301 [21:44<46:00,  2.10it/s][A
epoch 2 iter 3507: train loss 0.35416. lr 1.324827e-04:  38%|███▊      | 3507/9301 [21:45<46:00,  2.10it/s][A
epoch 2 iter 3507: train loss 0.35416. lr 1.324827e-04:  38%|███▊      | 3508/9301 [21:45<46:02,  2.10it/s][A
epoch 2 iter 3508: train loss 0.35438. lr 1.324406e-04:  38%|███▊      | 3508/9301 [21:45<46:02,  2.10it/s][A
epoch 2 iter 3508: train loss 0.35438. lr 1.324406e-04:  38%|███▊      | 3509/9301 [21:45<46:14,  2.09it/s][A
epoch 2 iter 3509: train loss 0.35068. lr 1.323986e-04:  38%|███▊      | 3509/9301 [21:46<46:14,  2.09it/s][A
epoch 2 iter 3509: train loss 0.35068. lr 1.323986e-04:  38%|███▊      | 3510/9301 [21:46<46:24,  2.08it/s][A
epoch 2 iter 3510: train loss 0.35559. lr 1.323566e-04:  38%|███▊      | 3510/9301 [21:46<46:24,  2.08it/s][A
e

epoch 2 iter 3542: train loss 0.35541. lr 1.310145e-04:  38%|███▊      | 3543/9301 [22:01<45:23,  2.11it/s][A
epoch 2 iter 3543: train loss 0.35974. lr 1.309726e-04:  38%|███▊      | 3543/9301 [22:02<45:23,  2.11it/s][A
epoch 2 iter 3543: train loss 0.35974. lr 1.309726e-04:  38%|███▊      | 3544/9301 [22:02<45:11,  2.12it/s][A
epoch 2 iter 3544: train loss 0.34850. lr 1.309308e-04:  38%|███▊      | 3544/9301 [22:02<45:11,  2.12it/s][A
epoch 2 iter 3544: train loss 0.34850. lr 1.309308e-04:  38%|███▊      | 3545/9301 [22:02<45:17,  2.12it/s][A
epoch 2 iter 3545: train loss 0.35599. lr 1.308889e-04:  38%|███▊      | 3545/9301 [22:03<45:17,  2.12it/s][A
epoch 2 iter 3545: train loss 0.35599. lr 1.308889e-04:  38%|███▊      | 3546/9301 [22:03<45:31,  2.11it/s][A
epoch 2 iter 3546: train loss 0.35726. lr 1.308471e-04:  38%|███▊      | 3546/9301 [22:03<45:31,  2.11it/s][A
epoch 2 iter 3546: train loss 0.35726. lr 1.308471e-04:  38%|███▊      | 3547/9301 [22:03<45:41,  2.10it/s][A
e

epoch 2 iter 3579: train loss 0.39010. lr 1.294688e-04:  38%|███▊      | 3579/9301 [22:19<45:08,  2.11it/s][A
epoch 2 iter 3579: train loss 0.39010. lr 1.294688e-04:  38%|███▊      | 3580/9301 [22:19<45:23,  2.10it/s][A
epoch 2 iter 3580: train loss 0.36357. lr 1.294271e-04:  38%|███▊      | 3580/9301 [22:20<45:23,  2.10it/s][A
epoch 2 iter 3580: train loss 0.36357. lr 1.294271e-04:  39%|███▊      | 3581/9301 [22:20<45:22,  2.10it/s][A
epoch 2 iter 3581: train loss 0.33598. lr 1.293855e-04:  39%|███▊      | 3581/9301 [22:20<45:22,  2.10it/s][A
epoch 2 iter 3581: train loss 0.33598. lr 1.293855e-04:  39%|███▊      | 3582/9301 [22:20<44:56,  2.12it/s][A
epoch 2 iter 3582: train loss 0.35065. lr 1.293438e-04:  39%|███▊      | 3582/9301 [22:20<44:56,  2.12it/s][A
epoch 2 iter 3582: train loss 0.35065. lr 1.293438e-04:  39%|███▊      | 3583/9301 [22:20<45:06,  2.11it/s][A
epoch 2 iter 3583: train loss 0.35605. lr 1.293021e-04:  39%|███▊      | 3583/9301 [22:21<45:06,  2.11it/s][A
e

epoch 2 iter 3615: train loss 0.35702. lr 1.279713e-04:  39%|███▉      | 3616/9301 [22:36<44:56,  2.11it/s][A
epoch 2 iter 3616: train loss 0.34185. lr 1.279298e-04:  39%|███▉      | 3616/9301 [22:37<44:56,  2.11it/s][A
epoch 2 iter 3616: train loss 0.34185. lr 1.279298e-04:  39%|███▉      | 3617/9301 [22:37<44:56,  2.11it/s][A
epoch 2 iter 3617: train loss 0.33200. lr 1.278883e-04:  39%|███▉      | 3617/9301 [22:37<44:56,  2.11it/s][A
epoch 2 iter 3617: train loss 0.33200. lr 1.278883e-04:  39%|███▉      | 3618/9301 [22:37<45:04,  2.10it/s][A
epoch 2 iter 3618: train loss 0.34721. lr 1.278468e-04:  39%|███▉      | 3618/9301 [22:38<45:04,  2.10it/s][A
epoch 2 iter 3618: train loss 0.34721. lr 1.278468e-04:  39%|███▉      | 3619/9301 [22:38<44:47,  2.11it/s][A
epoch 2 iter 3619: train loss 0.36145. lr 1.278054e-04:  39%|███▉      | 3619/9301 [22:38<44:47,  2.11it/s][A
epoch 2 iter 3619: train loss 0.36145. lr 1.278054e-04:  39%|███▉      | 3620/9301 [22:38<44:32,  2.13it/s][A
e

epoch 2 iter 3652: train loss 0.34329. lr 1.264389e-04:  39%|███▉      | 3652/9301 [22:54<45:31,  2.07it/s][A
epoch 2 iter 3652: train loss 0.34329. lr 1.264389e-04:  39%|███▉      | 3653/9301 [22:54<45:12,  2.08it/s][A
epoch 2 iter 3653: train loss 0.33816. lr 1.263976e-04:  39%|███▉      | 3653/9301 [22:54<45:12,  2.08it/s][A
epoch 2 iter 3653: train loss 0.33816. lr 1.263976e-04:  39%|███▉      | 3654/9301 [22:54<45:16,  2.08it/s][A
epoch 2 iter 3654: train loss 0.34205. lr 1.263563e-04:  39%|███▉      | 3654/9301 [22:55<45:16,  2.08it/s][A
epoch 2 iter 3654: train loss 0.34205. lr 1.263563e-04:  39%|███▉      | 3655/9301 [22:55<45:48,  2.05it/s][A
epoch 2 iter 3655: train loss 0.36939. lr 1.263149e-04:  39%|███▉      | 3655/9301 [22:55<45:48,  2.05it/s][A
epoch 2 iter 3655: train loss 0.36939. lr 1.263149e-04:  39%|███▉      | 3656/9301 [22:55<45:08,  2.08it/s][A
epoch 2 iter 3656: train loss 0.35414. lr 1.262736e-04:  39%|███▉      | 3656/9301 [22:56<45:08,  2.08it/s][A
e

epoch 2 iter 3688: train loss 0.37497. lr 1.249544e-04:  40%|███▉      | 3689/9301 [23:11<44:14,  2.11it/s][A
epoch 2 iter 3689: train loss 0.33883. lr 1.249132e-04:  40%|███▉      | 3689/9301 [23:12<44:14,  2.11it/s][A
epoch 2 iter 3689: train loss 0.33883. lr 1.249132e-04:  40%|███▉      | 3690/9301 [23:12<44:27,  2.10it/s][A
epoch 2 iter 3690: train loss 0.35096. lr 1.248721e-04:  40%|███▉      | 3690/9301 [23:12<44:27,  2.10it/s][A
epoch 2 iter 3690: train loss 0.35096. lr 1.248721e-04:  40%|███▉      | 3691/9301 [23:12<43:49,  2.13it/s][A
epoch 2 iter 3691: train loss 0.34377. lr 1.248309e-04:  40%|███▉      | 3691/9301 [23:12<43:49,  2.13it/s][A
epoch 2 iter 3691: train loss 0.34377. lr 1.248309e-04:  40%|███▉      | 3692/9301 [23:12<44:23,  2.11it/s][A
epoch 2 iter 3692: train loss 0.35284. lr 1.247898e-04:  40%|███▉      | 3692/9301 [23:13<44:23,  2.11it/s][A
epoch 2 iter 3692: train loss 0.35284. lr 1.247898e-04:  40%|███▉      | 3693/9301 [23:13<44:17,  2.11it/s][A
e

epoch 2 iter 3725: train loss 0.33532. lr 1.234353e-04:  40%|████      | 3725/9301 [23:29<44:03,  2.11it/s][A
epoch 2 iter 3725: train loss 0.33532. lr 1.234353e-04:  40%|████      | 3726/9301 [23:29<44:34,  2.08it/s][A
epoch 2 iter 3726: train loss 0.35764. lr 1.233944e-04:  40%|████      | 3726/9301 [23:29<44:34,  2.08it/s][A
epoch 2 iter 3726: train loss 0.35764. lr 1.233944e-04:  40%|████      | 3727/9301 [23:29<44:37,  2.08it/s][A
epoch 2 iter 3727: train loss 0.35007. lr 1.233534e-04:  40%|████      | 3727/9301 [23:30<44:37,  2.08it/s][A
epoch 2 iter 3727: train loss 0.35007. lr 1.233534e-04:  40%|████      | 3728/9301 [23:30<44:58,  2.07it/s][A
epoch 2 iter 3728: train loss 0.36013. lr 1.233125e-04:  40%|████      | 3728/9301 [23:30<44:58,  2.07it/s][A
epoch 2 iter 3728: train loss 0.36013. lr 1.233125e-04:  40%|████      | 3729/9301 [23:30<45:16,  2.05it/s][A
epoch 2 iter 3729: train loss 0.32950. lr 1.232715e-04:  40%|████      | 3729/9301 [23:31<45:16,  2.05it/s][A
e

epoch 2 iter 3761: train loss 0.35094. lr 1.219640e-04:  40%|████      | 3762/9301 [23:46<43:37,  2.12it/s][A
epoch 2 iter 3762: train loss 0.34854. lr 1.219232e-04:  40%|████      | 3762/9301 [23:46<43:37,  2.12it/s][A
epoch 2 iter 3762: train loss 0.34854. lr 1.219232e-04:  40%|████      | 3763/9301 [23:46<43:29,  2.12it/s][A
epoch 2 iter 3763: train loss 0.36871. lr 1.218824e-04:  40%|████      | 3763/9301 [23:47<43:29,  2.12it/s][A
epoch 2 iter 3763: train loss 0.36871. lr 1.218824e-04:  40%|████      | 3764/9301 [23:47<43:38,  2.11it/s][A
epoch 2 iter 3764: train loss 0.33185. lr 1.218417e-04:  40%|████      | 3764/9301 [23:47<43:38,  2.11it/s][A
epoch 2 iter 3764: train loss 0.33185. lr 1.218417e-04:  40%|████      | 3765/9301 [23:47<43:36,  2.12it/s][A
epoch 2 iter 3765: train loss 0.36584. lr 1.218009e-04:  40%|████      | 3765/9301 [23:48<43:36,  2.12it/s][A
epoch 2 iter 3765: train loss 0.36584. lr 1.218009e-04:  40%|████      | 3766/9301 [23:48<43:24,  2.12it/s][A
e

epoch 2 iter 3798: train loss 0.34023. lr 1.204586e-04:  41%|████      | 3798/9301 [24:04<43:14,  2.12it/s][A
epoch 2 iter 3798: train loss 0.34023. lr 1.204586e-04:  41%|████      | 3799/9301 [24:04<42:54,  2.14it/s][A
epoch 2 iter 3799: train loss 0.35124. lr 1.204180e-04:  41%|████      | 3799/9301 [24:04<42:54,  2.14it/s][A
epoch 2 iter 3799: train loss 0.35124. lr 1.204180e-04:  41%|████      | 3800/9301 [24:04<43:03,  2.13it/s][A
epoch 2 iter 3800: train loss 0.33790. lr 1.203775e-04:  41%|████      | 3800/9301 [24:05<43:03,  2.13it/s][A
epoch 2 iter 3800: train loss 0.33790. lr 1.203775e-04:  41%|████      | 3801/9301 [24:05<43:22,  2.11it/s][A
epoch 2 iter 3801: train loss 0.34302. lr 1.203369e-04:  41%|████      | 3801/9301 [24:05<43:22,  2.11it/s][A
epoch 2 iter 3801: train loss 0.34302. lr 1.203369e-04:  41%|████      | 3802/9301 [24:05<43:08,  2.12it/s][A
epoch 2 iter 3802: train loss 0.31469. lr 1.202963e-04:  41%|████      | 3802/9301 [24:05<43:08,  2.12it/s][A
e

epoch 2 iter 3834: train loss 0.32433. lr 1.190007e-04:  41%|████      | 3835/9301 [24:21<43:37,  2.09it/s][A
epoch 2 iter 3835: train loss 0.33234. lr 1.189603e-04:  41%|████      | 3835/9301 [24:21<43:37,  2.09it/s][A
epoch 2 iter 3835: train loss 0.33234. lr 1.189603e-04:  41%|████      | 3836/9301 [24:21<42:48,  2.13it/s][A
epoch 2 iter 3836: train loss 0.32416. lr 1.189199e-04:  41%|████      | 3836/9301 [24:22<42:48,  2.13it/s][A
epoch 2 iter 3836: train loss 0.32416. lr 1.189199e-04:  41%|████▏     | 3837/9301 [24:22<42:28,  2.14it/s][A
epoch 2 iter 3837: train loss 0.34514. lr 1.188795e-04:  41%|████▏     | 3837/9301 [24:22<42:28,  2.14it/s][A
epoch 2 iter 3837: train loss 0.34514. lr 1.188795e-04:  41%|████▏     | 3838/9301 [24:22<42:33,  2.14it/s][A
epoch 2 iter 3838: train loss 0.33592. lr 1.188391e-04:  41%|████▏     | 3838/9301 [24:23<42:33,  2.14it/s][A
epoch 2 iter 3838: train loss 0.33592. lr 1.188391e-04:  41%|████▏     | 3839/9301 [24:23<42:51,  2.12it/s][A
e

epoch 2 iter 3871: train loss 0.35807. lr 1.175092e-04:  42%|████▏     | 3871/9301 [24:39<42:38,  2.12it/s][A
epoch 2 iter 3871: train loss 0.35807. lr 1.175092e-04:  42%|████▏     | 3872/9301 [24:39<42:27,  2.13it/s][A
epoch 2 iter 3872: train loss 0.34935. lr 1.174690e-04:  42%|████▏     | 3872/9301 [24:39<42:27,  2.13it/s][A
epoch 2 iter 3872: train loss 0.34935. lr 1.174690e-04:  42%|████▏     | 3873/9301 [24:39<42:34,  2.12it/s][A
epoch 2 iter 3873: train loss 0.34054. lr 1.174288e-04:  42%|████▏     | 3873/9301 [24:40<42:34,  2.12it/s][A
epoch 2 iter 3873: train loss 0.34054. lr 1.174288e-04:  42%|████▏     | 3874/9301 [24:40<42:33,  2.13it/s][A
epoch 2 iter 3874: train loss 0.35464. lr 1.173886e-04:  42%|████▏     | 3874/9301 [24:40<42:33,  2.13it/s][A
epoch 2 iter 3874: train loss 0.35464. lr 1.173886e-04:  42%|████▏     | 3875/9301 [24:40<42:23,  2.13it/s][A
epoch 2 iter 3875: train loss 0.38856. lr 1.173484e-04:  42%|████▏     | 3875/9301 [24:41<42:23,  2.13it/s][A
e

epoch 2 iter 3907: train loss 0.36707. lr 1.160649e-04:  42%|████▏     | 3908/9301 [24:56<43:04,  2.09it/s][A
epoch 2 iter 3908: train loss 0.37647. lr 1.160249e-04:  42%|████▏     | 3908/9301 [24:56<43:04,  2.09it/s][A
epoch 2 iter 3908: train loss 0.37647. lr 1.160249e-04:  42%|████▏     | 3909/9301 [24:56<43:18,  2.08it/s][A
epoch 2 iter 3909: train loss 0.35276. lr 1.159848e-04:  42%|████▏     | 3909/9301 [24:57<43:18,  2.08it/s][A
epoch 2 iter 3909: train loss 0.35276. lr 1.159848e-04:  42%|████▏     | 3910/9301 [24:57<42:48,  2.10it/s][A
epoch 2 iter 3910: train loss 0.36991. lr 1.159448e-04:  42%|████▏     | 3910/9301 [24:57<42:48,  2.10it/s][A
epoch 2 iter 3910: train loss 0.36991. lr 1.159448e-04:  42%|████▏     | 3911/9301 [24:57<43:23,  2.07it/s][A
epoch 2 iter 3911: train loss 0.35618. lr 1.159048e-04:  42%|████▏     | 3911/9301 [24:58<43:23,  2.07it/s][A
epoch 2 iter 3911: train loss 0.35618. lr 1.159048e-04:  42%|████▏     | 3912/9301 [24:58<43:26,  2.07it/s][A
e

epoch 2 iter 3944: train loss 0.32895. lr 1.145875e-04:  42%|████▏     | 3944/9301 [25:14<42:44,  2.09it/s][A
epoch 2 iter 3944: train loss 0.32895. lr 1.145875e-04:  42%|████▏     | 3945/9301 [25:14<43:11,  2.07it/s][A
epoch 2 iter 3945: train loss 0.33907. lr 1.145477e-04:  42%|████▏     | 3945/9301 [25:14<43:11,  2.07it/s][A
epoch 2 iter 3945: train loss 0.33907. lr 1.145477e-04:  42%|████▏     | 3946/9301 [25:14<43:23,  2.06it/s][A
epoch 2 iter 3946: train loss 0.33247. lr 1.145079e-04:  42%|████▏     | 3946/9301 [25:14<43:23,  2.06it/s][A
epoch 2 iter 3946: train loss 0.33247. lr 1.145079e-04:  42%|████▏     | 3947/9301 [25:14<43:16,  2.06it/s][A
epoch 2 iter 3947: train loss 0.34884. lr 1.144680e-04:  42%|████▏     | 3947/9301 [25:15<43:16,  2.06it/s][A
epoch 2 iter 3947: train loss 0.34884. lr 1.144680e-04:  42%|████▏     | 3948/9301 [25:15<43:18,  2.06it/s][A
epoch 2 iter 3948: train loss 0.35593. lr 1.144282e-04:  42%|████▏     | 3948/9301 [25:15<43:18,  2.06it/s][A
e

epoch 2 iter 3980: train loss 0.34977. lr 1.131570e-04:  43%|████▎     | 3981/9301 [25:29<36:25,  2.43it/s][A
epoch 2 iter 3981: train loss 0.33747. lr 1.131174e-04:  43%|████▎     | 3981/9301 [25:30<36:25,  2.43it/s][A
epoch 2 iter 3981: train loss 0.33747. lr 1.131174e-04:  43%|████▎     | 3982/9301 [25:30<36:14,  2.45it/s][A
epoch 2 iter 3982: train loss 0.35518. lr 1.130778e-04:  43%|████▎     | 3982/9301 [25:30<36:14,  2.45it/s][A
epoch 2 iter 3982: train loss 0.35518. lr 1.130778e-04:  43%|████▎     | 3983/9301 [25:30<36:08,  2.45it/s][A
epoch 2 iter 3983: train loss 0.34893. lr 1.130381e-04:  43%|████▎     | 3983/9301 [25:30<36:08,  2.45it/s][A
epoch 2 iter 3983: train loss 0.34893. lr 1.130381e-04:  43%|████▎     | 3984/9301 [25:30<38:52,  2.28it/s][A
epoch 2 iter 3984: train loss 0.32676. lr 1.129985e-04:  43%|████▎     | 3984/9301 [25:31<38:52,  2.28it/s][A
epoch 2 iter 3984: train loss 0.32676. lr 1.129985e-04:  43%|████▎     | 3985/9301 [25:31<38:10,  2.32it/s][A
e

epoch 2 iter 4017: train loss 0.33979. lr 1.116940e-04:  43%|████▎     | 4017/9301 [25:46<42:07,  2.09it/s][A
epoch 2 iter 4017: train loss 0.33979. lr 1.116940e-04:  43%|████▎     | 4018/9301 [25:46<41:45,  2.11it/s][A
epoch 2 iter 4018: train loss 0.32272. lr 1.116546e-04:  43%|████▎     | 4018/9301 [25:47<41:45,  2.11it/s][A
epoch 2 iter 4018: train loss 0.32272. lr 1.116546e-04:  43%|████▎     | 4019/9301 [25:47<41:30,  2.12it/s][A
epoch 2 iter 4019: train loss 0.33714. lr 1.116151e-04:  43%|████▎     | 4019/9301 [25:47<41:30,  2.12it/s][A
epoch 2 iter 4019: train loss 0.33714. lr 1.116151e-04:  43%|████▎     | 4020/9301 [25:47<42:05,  2.09it/s][A
epoch 2 iter 4020: train loss 0.34536. lr 1.115757e-04:  43%|████▎     | 4020/9301 [25:48<42:05,  2.09it/s][A
epoch 2 iter 4020: train loss 0.34536. lr 1.115757e-04:  43%|████▎     | 4021/9301 [25:48<42:20,  2.08it/s][A
epoch 2 iter 4021: train loss 0.32837. lr 1.115363e-04:  43%|████▎     | 4021/9301 [25:48<42:20,  2.08it/s][A
e

epoch 2 iter 4053: train loss 0.34035. lr 1.102776e-04:  44%|████▎     | 4054/9301 [26:04<41:26,  2.11it/s][A
epoch 2 iter 4054: train loss 0.33902. lr 1.102383e-04:  44%|████▎     | 4054/9301 [26:04<41:26,  2.11it/s][A
epoch 2 iter 4054: train loss 0.33902. lr 1.102383e-04:  44%|████▎     | 4055/9301 [26:04<41:37,  2.10it/s][A
epoch 2 iter 4055: train loss 0.32983. lr 1.101991e-04:  44%|████▎     | 4055/9301 [26:05<41:37,  2.10it/s][A
epoch 2 iter 4055: train loss 0.32983. lr 1.101991e-04:  44%|████▎     | 4056/9301 [26:05<42:05,  2.08it/s][A
epoch 2 iter 4056: train loss 0.33840. lr 1.101599e-04:  44%|████▎     | 4056/9301 [26:05<42:05,  2.08it/s][A
epoch 2 iter 4056: train loss 0.33840. lr 1.101599e-04:  44%|████▎     | 4057/9301 [26:05<41:42,  2.10it/s][A
epoch 2 iter 4057: train loss 0.35030. lr 1.101206e-04:  44%|████▎     | 4057/9301 [26:06<41:42,  2.10it/s][A
epoch 2 iter 4057: train loss 0.35030. lr 1.101206e-04:  44%|████▎     | 4058/9301 [26:06<41:28,  2.11it/s][A
e

epoch 2 iter 4090: train loss 0.34780. lr 1.088291e-04:  44%|████▍     | 4090/9301 [26:21<41:10,  2.11it/s][A
epoch 2 iter 4090: train loss 0.34780. lr 1.088291e-04:  44%|████▍     | 4091/9301 [26:21<41:14,  2.11it/s][A
epoch 2 iter 4091: train loss 0.33502. lr 1.087901e-04:  44%|████▍     | 4091/9301 [26:22<41:14,  2.11it/s][A
epoch 2 iter 4091: train loss 0.33502. lr 1.087901e-04:  44%|████▍     | 4092/9301 [26:22<41:24,  2.10it/s][A
epoch 2 iter 4092: train loss 0.36968. lr 1.087510e-04:  44%|████▍     | 4092/9301 [26:22<41:24,  2.10it/s][A
epoch 2 iter 4092: train loss 0.36968. lr 1.087510e-04:  44%|████▍     | 4093/9301 [26:22<41:55,  2.07it/s][A
epoch 2 iter 4093: train loss 0.33775. lr 1.087120e-04:  44%|████▍     | 4093/9301 [26:23<41:55,  2.07it/s][A
epoch 2 iter 4093: train loss 0.33775. lr 1.087120e-04:  44%|████▍     | 4094/9301 [26:23<42:01,  2.07it/s][A
epoch 2 iter 4094: train loss 0.34170. lr 1.086730e-04:  44%|████▍     | 4094/9301 [26:23<42:01,  2.07it/s][A
e

epoch 2 iter 4126: train loss 0.34058. lr 1.074270e-04:  44%|████▍     | 4127/9301 [26:39<41:42,  2.07it/s][A
epoch 2 iter 4127: train loss 0.34853. lr 1.073881e-04:  44%|████▍     | 4127/9301 [26:39<41:42,  2.07it/s][A
epoch 2 iter 4127: train loss 0.34853. lr 1.073881e-04:  44%|████▍     | 4128/9301 [26:39<41:38,  2.07it/s][A
epoch 2 iter 4128: train loss 0.33840. lr 1.073493e-04:  44%|████▍     | 4128/9301 [26:40<41:38,  2.07it/s][A
epoch 2 iter 4128: train loss 0.33840. lr 1.073493e-04:  44%|████▍     | 4129/9301 [26:40<41:42,  2.07it/s][A
epoch 2 iter 4129: train loss 0.34402. lr 1.073104e-04:  44%|████▍     | 4129/9301 [26:40<41:42,  2.07it/s][A
epoch 2 iter 4129: train loss 0.34402. lr 1.073104e-04:  44%|████▍     | 4130/9301 [26:40<41:53,  2.06it/s][A
epoch 2 iter 4130: train loss 0.33955. lr 1.072716e-04:  44%|████▍     | 4130/9301 [26:41<41:53,  2.06it/s][A
epoch 2 iter 4130: train loss 0.33955. lr 1.072716e-04:  44%|████▍     | 4131/9301 [26:41<41:50,  2.06it/s][A
e

epoch 2 iter 4163: train loss 0.34163. lr 1.059933e-04:  45%|████▍     | 4163/9301 [26:57<41:48,  2.05it/s][A
epoch 2 iter 4163: train loss 0.34163. lr 1.059933e-04:  45%|████▍     | 4164/9301 [26:57<42:22,  2.02it/s][A
epoch 2 iter 4164: train loss 0.34610. lr 1.059547e-04:  45%|████▍     | 4164/9301 [26:57<42:22,  2.02it/s][A
epoch 2 iter 4164: train loss 0.34610. lr 1.059547e-04:  45%|████▍     | 4165/9301 [26:57<42:42,  2.00it/s][A
epoch 2 iter 4165: train loss 0.36476. lr 1.059160e-04:  45%|████▍     | 4165/9301 [26:58<42:42,  2.00it/s][A
epoch 2 iter 4165: train loss 0.36476. lr 1.059160e-04:  45%|████▍     | 4166/9301 [26:58<42:46,  2.00it/s][A
epoch 2 iter 4166: train loss 0.34523. lr 1.058774e-04:  45%|████▍     | 4166/9301 [26:58<42:46,  2.00it/s][A
epoch 2 iter 4166: train loss 0.34523. lr 1.058774e-04:  45%|████▍     | 4167/9301 [26:58<42:47,  2.00it/s][A
epoch 2 iter 4167: train loss 0.35645. lr 1.058388e-04:  45%|████▍     | 4167/9301 [26:59<42:47,  2.00it/s][A
e

epoch 2 iter 4199: train loss 0.34221. lr 1.046056e-04:  45%|████▌     | 4200/9301 [27:14<40:31,  2.10it/s][A
epoch 2 iter 4200: train loss 0.34650. lr 1.045672e-04:  45%|████▌     | 4200/9301 [27:14<40:31,  2.10it/s][A
epoch 2 iter 4200: train loss 0.34650. lr 1.045672e-04:  45%|████▌     | 4201/9301 [27:14<40:32,  2.10it/s][A
epoch 2 iter 4201: train loss 0.32191. lr 1.045288e-04:  45%|████▌     | 4201/9301 [27:15<40:32,  2.10it/s][A
epoch 2 iter 4201: train loss 0.32191. lr 1.045288e-04:  45%|████▌     | 4202/9301 [27:15<40:42,  2.09it/s][A
epoch 2 iter 4202: train loss 0.32833. lr 1.044903e-04:  45%|████▌     | 4202/9301 [27:15<40:42,  2.09it/s][A
epoch 2 iter 4202: train loss 0.32833. lr 1.044903e-04:  45%|████▌     | 4203/9301 [27:15<40:27,  2.10it/s][A
epoch 2 iter 4203: train loss 0.35225. lr 1.044519e-04:  45%|████▌     | 4203/9301 [27:16<40:27,  2.10it/s][A
epoch 2 iter 4203: train loss 0.35225. lr 1.044519e-04:  45%|████▌     | 4204/9301 [27:16<40:18,  2.11it/s][A
e

epoch 2 iter 4236: train loss 0.36651. lr 1.031870e-04:  46%|████▌     | 4236/9301 [27:31<39:52,  2.12it/s][A
epoch 2 iter 4236: train loss 0.36651. lr 1.031870e-04:  46%|████▌     | 4237/9301 [27:31<39:37,  2.13it/s][A
epoch 2 iter 4237: train loss 0.32912. lr 1.031487e-04:  46%|████▌     | 4237/9301 [27:32<39:37,  2.13it/s][A
epoch 2 iter 4237: train loss 0.32912. lr 1.031487e-04:  46%|████▌     | 4238/9301 [27:32<40:14,  2.10it/s][A
epoch 2 iter 4238: train loss 0.35649. lr 1.031105e-04:  46%|████▌     | 4238/9301 [27:32<40:14,  2.10it/s][A
epoch 2 iter 4238: train loss 0.35649. lr 1.031105e-04:  46%|████▌     | 4239/9301 [27:32<40:25,  2.09it/s][A
epoch 2 iter 4239: train loss 0.32964. lr 1.030723e-04:  46%|████▌     | 4239/9301 [27:33<40:25,  2.09it/s][A
epoch 2 iter 4239: train loss 0.32964. lr 1.030723e-04:  46%|████▌     | 4240/9301 [27:33<40:14,  2.10it/s][A
epoch 2 iter 4240: train loss 0.33446. lr 1.030340e-04:  46%|████▌     | 4240/9301 [27:33<40:14,  2.10it/s][A
e

epoch 2 iter 4272: train loss 0.33262. lr 1.018140e-04:  46%|████▌     | 4273/9301 [27:49<40:45,  2.06it/s][A
epoch 2 iter 4273: train loss 0.35057. lr 1.017760e-04:  46%|████▌     | 4273/9301 [27:49<40:45,  2.06it/s][A
epoch 2 iter 4273: train loss 0.35057. lr 1.017760e-04:  46%|████▌     | 4274/9301 [27:49<40:45,  2.06it/s][A
epoch 2 iter 4274: train loss 0.35794. lr 1.017379e-04:  46%|████▌     | 4274/9301 [27:50<40:45,  2.06it/s][A
epoch 2 iter 4274: train loss 0.35794. lr 1.017379e-04:  46%|████▌     | 4275/9301 [27:50<40:54,  2.05it/s][A
epoch 2 iter 4275: train loss 0.33412. lr 1.016999e-04:  46%|████▌     | 4275/9301 [27:50<40:54,  2.05it/s][A
epoch 2 iter 4275: train loss 0.33412. lr 1.016999e-04:  46%|████▌     | 4276/9301 [27:50<40:14,  2.08it/s][A
epoch 2 iter 4276: train loss 0.35490. lr 1.016619e-04:  46%|████▌     | 4276/9301 [27:51<40:14,  2.08it/s][A
epoch 2 iter 4276: train loss 0.35490. lr 1.016619e-04:  46%|████▌     | 4277/9301 [27:51<40:09,  2.09it/s][A
e

epoch 2 iter 4309: train loss 0.32897. lr 1.004105e-04:  46%|████▋     | 4309/9301 [28:06<40:04,  2.08it/s][A
epoch 2 iter 4309: train loss 0.32897. lr 1.004105e-04:  46%|████▋     | 4310/9301 [28:06<40:43,  2.04it/s][A
epoch 2 iter 4310: train loss 0.32594. lr 1.003727e-04:  46%|████▋     | 4310/9301 [28:07<40:43,  2.04it/s][A
epoch 2 iter 4310: train loss 0.32594. lr 1.003727e-04:  46%|████▋     | 4311/9301 [28:07<40:47,  2.04it/s][A
epoch 2 iter 4311: train loss 0.33584. lr 1.003349e-04:  46%|████▋     | 4311/9301 [28:07<40:47,  2.04it/s][A
epoch 2 iter 4311: train loss 0.33584. lr 1.003349e-04:  46%|████▋     | 4312/9301 [28:07<40:32,  2.05it/s][A
epoch 2 iter 4312: train loss 0.31214. lr 1.002971e-04:  46%|████▋     | 4312/9301 [28:08<40:32,  2.05it/s][A
epoch 2 iter 4312: train loss 0.31214. lr 1.002971e-04:  46%|████▋     | 4313/9301 [28:08<40:06,  2.07it/s][A
epoch 2 iter 4313: train loss 0.34742. lr 1.002593e-04:  46%|████▋     | 4313/9301 [28:08<40:06,  2.07it/s][A
e

epoch 2 iter 4345: train loss 0.30666. lr 9.905248e-05:  47%|████▋     | 4346/9301 [28:23<38:47,  2.13it/s][A
epoch 2 iter 4346: train loss 0.33724. lr 9.901487e-05:  47%|████▋     | 4346/9301 [28:24<38:47,  2.13it/s][A
epoch 2 iter 4346: train loss 0.33724. lr 9.901487e-05:  47%|████▋     | 4347/9301 [28:24<39:23,  2.10it/s][A
epoch 2 iter 4347: train loss 0.33677. lr 9.897725e-05:  47%|████▋     | 4347/9301 [28:24<39:23,  2.10it/s][A
epoch 2 iter 4347: train loss 0.33677. lr 9.897725e-05:  47%|████▋     | 4348/9301 [28:24<39:21,  2.10it/s][A
epoch 2 iter 4348: train loss 0.33035. lr 9.893965e-05:  47%|████▋     | 4348/9301 [28:25<39:21,  2.10it/s][A
epoch 2 iter 4348: train loss 0.33035. lr 9.893965e-05:  47%|████▋     | 4349/9301 [28:25<39:39,  2.08it/s][A
epoch 2 iter 4349: train loss 0.32978. lr 9.890205e-05:  47%|████▋     | 4349/9301 [28:25<39:39,  2.08it/s][A
epoch 2 iter 4349: train loss 0.32978. lr 9.890205e-05:  47%|████▋     | 4350/9301 [28:25<39:46,  2.07it/s][A
e

epoch 2 iter 4382: train loss 0.35538. lr 9.766445e-05:  47%|████▋     | 4382/9301 [28:41<40:25,  2.03it/s][A
epoch 2 iter 4382: train loss 0.35538. lr 9.766445e-05:  47%|████▋     | 4383/9301 [28:41<39:43,  2.06it/s][A
epoch 2 iter 4383: train loss 0.35261. lr 9.762704e-05:  47%|████▋     | 4383/9301 [28:42<39:43,  2.06it/s][A
epoch 2 iter 4383: train loss 0.35261. lr 9.762704e-05:  47%|████▋     | 4384/9301 [28:42<39:33,  2.07it/s][A
epoch 2 iter 4384: train loss 0.33931. lr 9.758965e-05:  47%|████▋     | 4384/9301 [28:42<39:33,  2.07it/s][A
epoch 2 iter 4384: train loss 0.33931. lr 9.758965e-05:  47%|████▋     | 4385/9301 [28:42<39:55,  2.05it/s][A
epoch 2 iter 4385: train loss 0.34922. lr 9.755225e-05:  47%|████▋     | 4385/9301 [28:43<39:55,  2.05it/s][A
epoch 2 iter 4385: train loss 0.34922. lr 9.755225e-05:  47%|████▋     | 4386/9301 [28:43<39:27,  2.08it/s][A
epoch 2 iter 4386: train loss 0.34169. lr 9.751486e-05:  47%|████▋     | 4386/9301 [28:43<39:27,  2.08it/s][A
e

epoch 2 iter 4418: train loss 0.33286. lr 9.632151e-05:  48%|████▊     | 4419/9301 [28:58<41:14,  1.97it/s][A
epoch 2 iter 4419: train loss 0.33576. lr 9.628432e-05:  48%|████▊     | 4419/9301 [28:59<41:14,  1.97it/s][A
epoch 2 iter 4419: train loss 0.33576. lr 9.628432e-05:  48%|████▊     | 4420/9301 [28:59<41:30,  1.96it/s][A
epoch 2 iter 4420: train loss 0.33289. lr 9.624713e-05:  48%|████▊     | 4420/9301 [28:59<41:30,  1.96it/s][A
epoch 2 iter 4420: train loss 0.33289. lr 9.624713e-05:  48%|████▊     | 4421/9301 [28:59<40:53,  1.99it/s][A
epoch 2 iter 4421: train loss 0.32341. lr 9.620994e-05:  48%|████▊     | 4421/9301 [29:00<40:53,  1.99it/s][A
epoch 2 iter 4421: train loss 0.32341. lr 9.620994e-05:  48%|████▊     | 4422/9301 [29:00<40:49,  1.99it/s][A
epoch 2 iter 4422: train loss 0.35014. lr 9.617276e-05:  48%|████▊     | 4422/9301 [29:00<40:49,  1.99it/s][A
epoch 2 iter 4422: train loss 0.35014. lr 9.617276e-05:  48%|████▊     | 4423/9301 [29:00<39:38,  2.05it/s][A
e

epoch 2 iter 4455: train loss 0.36265. lr 9.494912e-05:  48%|████▊     | 4455/9301 [29:16<38:00,  2.12it/s][A
epoch 2 iter 4455: train loss 0.36265. lr 9.494912e-05:  48%|████▊     | 4456/9301 [29:16<38:13,  2.11it/s][A
epoch 2 iter 4456: train loss 0.32271. lr 9.491214e-05:  48%|████▊     | 4456/9301 [29:17<38:13,  2.11it/s][A
epoch 2 iter 4456: train loss 0.32271. lr 9.491214e-05:  48%|████▊     | 4457/9301 [29:17<38:14,  2.11it/s][A
epoch 2 iter 4457: train loss 0.34730. lr 9.487516e-05:  48%|████▊     | 4457/9301 [29:17<38:14,  2.11it/s][A
epoch 2 iter 4457: train loss 0.34730. lr 9.487516e-05:  48%|████▊     | 4458/9301 [29:17<38:09,  2.12it/s][A
epoch 2 iter 4458: train loss 0.33291. lr 9.483819e-05:  48%|████▊     | 4458/9301 [29:18<38:09,  2.12it/s][A
epoch 2 iter 4458: train loss 0.33291. lr 9.483819e-05:  48%|████▊     | 4459/9301 [29:18<38:21,  2.10it/s][A
epoch 2 iter 4459: train loss 0.33487. lr 9.480123e-05:  48%|████▊     | 4459/9301 [29:18<38:21,  2.10it/s][A
e

epoch 2 iter 4491: train loss 0.33812. lr 9.362150e-05:  48%|████▊     | 4492/9301 [29:32<33:21,  2.40it/s][A
epoch 2 iter 4492: train loss 0.34339. lr 9.358473e-05:  48%|████▊     | 4492/9301 [29:32<33:21,  2.40it/s][A
epoch 2 iter 4492: train loss 0.34339. lr 9.358473e-05:  48%|████▊     | 4493/9301 [29:32<33:20,  2.40it/s][A
epoch 2 iter 4493: train loss 0.33190. lr 9.354797e-05:  48%|████▊     | 4493/9301 [29:33<33:20,  2.40it/s][A
epoch 2 iter 4493: train loss 0.33190. lr 9.354797e-05:  48%|████▊     | 4494/9301 [29:33<33:17,  2.41it/s][A
epoch 2 iter 4494: train loss 0.34520. lr 9.351121e-05:  48%|████▊     | 4494/9301 [29:33<33:17,  2.41it/s][A
epoch 2 iter 4494: train loss 0.34520. lr 9.351121e-05:  48%|████▊     | 4495/9301 [29:33<32:48,  2.44it/s][A
epoch 2 iter 4495: train loss 0.31139. lr 9.347446e-05:  48%|████▊     | 4495/9301 [29:34<32:48,  2.44it/s][A
epoch 2 iter 4495: train loss 0.31139. lr 9.347446e-05:  48%|████▊     | 4496/9301 [29:34<31:57,  2.51it/s][A
e

epoch 2 iter 4528: train loss 0.34562. lr 9.226495e-05:  49%|████▊     | 4528/9301 [29:46<29:19,  2.71it/s][A
epoch 2 iter 4528: train loss 0.34562. lr 9.226495e-05:  49%|████▊     | 4529/9301 [29:46<29:17,  2.72it/s][A
epoch 2 iter 4529: train loss 0.33917. lr 9.222840e-05:  49%|████▊     | 4529/9301 [29:46<29:17,  2.72it/s][A
epoch 2 iter 4529: train loss 0.33917. lr 9.222840e-05:  49%|████▊     | 4530/9301 [29:46<29:18,  2.71it/s][A
epoch 2 iter 4530: train loss 0.32704. lr 9.219186e-05:  49%|████▊     | 4530/9301 [29:47<29:18,  2.71it/s][A
epoch 2 iter 4530: train loss 0.32704. lr 9.219186e-05:  49%|████▊     | 4531/9301 [29:47<29:15,  2.72it/s][A
epoch 2 iter 4531: train loss 0.34987. lr 9.215532e-05:  49%|████▊     | 4531/9301 [29:47<29:15,  2.72it/s][A
epoch 2 iter 4531: train loss 0.34987. lr 9.215532e-05:  49%|████▊     | 4532/9301 [29:47<29:19,  2.71it/s][A
epoch 2 iter 4532: train loss 0.32762. lr 9.211879e-05:  49%|████▊     | 4532/9301 [29:47<29:19,  2.71it/s][A
e

epoch 2 iter 4564: train loss 0.33586. lr 9.095286e-05:  49%|████▉     | 4565/9301 [29:59<29:06,  2.71it/s][A
epoch 2 iter 4565: train loss 0.34445. lr 9.091652e-05:  49%|████▉     | 4565/9301 [29:59<29:06,  2.71it/s][A
epoch 2 iter 4565: train loss 0.34445. lr 9.091652e-05:  49%|████▉     | 4566/9301 [29:59<29:01,  2.72it/s][A
epoch 2 iter 4566: train loss 0.32744. lr 9.088019e-05:  49%|████▉     | 4566/9301 [30:00<29:01,  2.72it/s][A
epoch 2 iter 4566: train loss 0.32744. lr 9.088019e-05:  49%|████▉     | 4567/9301 [30:00<29:05,  2.71it/s][A
epoch 2 iter 4567: train loss 0.32901. lr 9.084386e-05:  49%|████▉     | 4567/9301 [30:00<29:05,  2.71it/s][A
epoch 2 iter 4567: train loss 0.32901. lr 9.084386e-05:  49%|████▉     | 4568/9301 [30:00<29:07,  2.71it/s][A
epoch 2 iter 4568: train loss 0.34194. lr 9.080755e-05:  49%|████▉     | 4568/9301 [30:01<29:07,  2.71it/s][A
epoch 2 iter 4568: train loss 0.34194. lr 9.080755e-05:  49%|████▉     | 4569/9301 [30:01<29:03,  2.71it/s][A
e

epoch 2 iter 4601: train loss 0.34783. lr 8.961237e-05:  49%|████▉     | 4601/9301 [30:13<28:46,  2.72it/s][A
epoch 2 iter 4601: train loss 0.34783. lr 8.961237e-05:  49%|████▉     | 4602/9301 [30:13<28:49,  2.72it/s][A
epoch 2 iter 4602: train loss 0.33030. lr 8.957625e-05:  49%|████▉     | 4602/9301 [30:13<28:49,  2.72it/s][A
epoch 2 iter 4602: train loss 0.33030. lr 8.957625e-05:  49%|████▉     | 4603/9301 [30:13<28:47,  2.72it/s][A
epoch 2 iter 4603: train loss 0.33033. lr 8.954014e-05:  49%|████▉     | 4603/9301 [30:13<28:47,  2.72it/s][A
epoch 2 iter 4603: train loss 0.33033. lr 8.954014e-05:  50%|████▉     | 4604/9301 [30:13<28:46,  2.72it/s][A
epoch 2 iter 4604: train loss 0.32720. lr 8.950404e-05:  50%|████▉     | 4604/9301 [30:14<28:46,  2.72it/s][A
epoch 2 iter 4604: train loss 0.32720. lr 8.950404e-05:  50%|████▉     | 4605/9301 [30:14<28:51,  2.71it/s][A
epoch 2 iter 4605: train loss 0.34078. lr 8.946794e-05:  50%|████▉     | 4605/9301 [30:14<28:51,  2.71it/s][A
e

epoch 2 iter 4637: train loss 0.35468. lr 8.831599e-05:  50%|████▉     | 4638/9301 [30:26<28:40,  2.71it/s][A
epoch 2 iter 4638: train loss 0.35691. lr 8.828009e-05:  50%|████▉     | 4638/9301 [30:26<28:40,  2.71it/s][A
epoch 2 iter 4638: train loss 0.35691. lr 8.828009e-05:  50%|████▉     | 4639/9301 [30:26<28:37,  2.71it/s][A
epoch 2 iter 4639: train loss 0.34847. lr 8.824420e-05:  50%|████▉     | 4639/9301 [30:27<28:37,  2.71it/s][A
epoch 2 iter 4639: train loss 0.34847. lr 8.824420e-05:  50%|████▉     | 4640/9301 [30:27<28:35,  2.72it/s][A
epoch 2 iter 4640: train loss 0.33099. lr 8.820831e-05:  50%|████▉     | 4640/9301 [30:27<28:35,  2.72it/s][A
epoch 2 iter 4640: train loss 0.33099. lr 8.820831e-05:  50%|████▉     | 4641/9301 [30:27<28:32,  2.72it/s][A
epoch 2 iter 4641: train loss 0.34177. lr 8.817243e-05:  50%|████▉     | 4641/9301 [30:27<28:32,  2.72it/s][A
epoch 2 iter 4641: train loss 0.34177. lr 8.817243e-05:  50%|████▉     | 4642/9301 [30:27<28:36,  2.71it/s][A
e

epoch 2 iter 4674: train loss 0.33545. lr 8.699176e-05:  50%|█████     | 4674/9301 [30:40<28:25,  2.71it/s][A
epoch 2 iter 4674: train loss 0.33545. lr 8.699176e-05:  50%|█████     | 4675/9301 [30:40<28:23,  2.71it/s][A
epoch 2 iter 4675: train loss 0.31909. lr 8.695608e-05:  50%|█████     | 4675/9301 [30:40<28:23,  2.71it/s][A
epoch 2 iter 4675: train loss 0.31909. lr 8.695608e-05:  50%|█████     | 4676/9301 [30:40<28:30,  2.70it/s][A
epoch 2 iter 4676: train loss 0.32902. lr 8.692041e-05:  50%|█████     | 4676/9301 [30:40<28:30,  2.70it/s][A
epoch 2 iter 4676: train loss 0.32902. lr 8.692041e-05:  50%|█████     | 4677/9301 [30:40<28:33,  2.70it/s][A
epoch 2 iter 4677: train loss 0.33131. lr 8.688475e-05:  50%|█████     | 4677/9301 [30:41<28:33,  2.70it/s][A
epoch 2 iter 4677: train loss 0.33131. lr 8.688475e-05:  50%|█████     | 4678/9301 [30:41<28:34,  2.70it/s][A
epoch 2 iter 4678: train loss 0.32663. lr 8.684909e-05:  50%|█████     | 4678/9301 [30:41<28:34,  2.70it/s][A
e

epoch 2 iter 4710: train loss 0.34366. lr 8.571130e-05:  51%|█████     | 4711/9301 [30:53<28:11,  2.71it/s][A
epoch 2 iter 4711: train loss 0.35151. lr 8.567584e-05:  51%|█████     | 4711/9301 [30:53<28:11,  2.71it/s][A
epoch 2 iter 4711: train loss 0.35151. lr 8.567584e-05:  51%|█████     | 4712/9301 [30:53<28:14,  2.71it/s][A
epoch 2 iter 4712: train loss 0.30942. lr 8.564039e-05:  51%|█████     | 4712/9301 [30:54<28:14,  2.71it/s][A
epoch 2 iter 4712: train loss 0.30942. lr 8.564039e-05:  51%|█████     | 4713/9301 [30:54<28:12,  2.71it/s][A
epoch 2 iter 4713: train loss 0.33269. lr 8.560495e-05:  51%|█████     | 4713/9301 [30:54<28:12,  2.71it/s][A
epoch 2 iter 4713: train loss 0.33269. lr 8.560495e-05:  51%|█████     | 4714/9301 [30:54<28:08,  2.72it/s][A
epoch 2 iter 4714: train loss 0.33387. lr 8.556951e-05:  51%|█████     | 4714/9301 [30:54<28:08,  2.72it/s][A
epoch 2 iter 4714: train loss 0.33387. lr 8.556951e-05:  51%|█████     | 4715/9301 [30:54<28:10,  2.71it/s][A
e

epoch 2 iter 4747: train loss 0.34387. lr 8.440352e-05:  51%|█████     | 4747/9301 [31:07<27:55,  2.72it/s][A
epoch 2 iter 4747: train loss 0.34387. lr 8.440352e-05:  51%|█████     | 4748/9301 [31:07<27:59,  2.71it/s][A
epoch 2 iter 4748: train loss 0.34621. lr 8.436829e-05:  51%|█████     | 4748/9301 [31:07<27:59,  2.71it/s][A
epoch 2 iter 4748: train loss 0.34621. lr 8.436829e-05:  51%|█████     | 4749/9301 [31:07<27:56,  2.72it/s][A
epoch 2 iter 4749: train loss 0.33974. lr 8.433307e-05:  51%|█████     | 4749/9301 [31:07<27:56,  2.72it/s][A
epoch 2 iter 4749: train loss 0.33974. lr 8.433307e-05:  51%|█████     | 4750/9301 [31:07<27:55,  2.72it/s][A
epoch 2 iter 4750: train loss 0.33026. lr 8.429785e-05:  51%|█████     | 4750/9301 [31:08<27:55,  2.72it/s][A
epoch 2 iter 4750: train loss 0.33026. lr 8.429785e-05:  51%|█████     | 4751/9301 [31:08<27:49,  2.73it/s][A
epoch 2 iter 4751: train loss 0.31567. lr 8.426264e-05:  51%|█████     | 4751/9301 [31:08<27:49,  2.73it/s][A
e

epoch 2 iter 4783: train loss 0.33524. lr 8.313917e-05:  51%|█████▏    | 4784/9301 [31:20<27:52,  2.70it/s][A
epoch 2 iter 4784: train loss 0.33776. lr 8.310417e-05:  51%|█████▏    | 4784/9301 [31:20<27:52,  2.70it/s][A
epoch 2 iter 4784: train loss 0.33776. lr 8.310417e-05:  51%|█████▏    | 4785/9301 [31:20<27:48,  2.71it/s][A
epoch 2 iter 4785: train loss 0.35505. lr 8.306917e-05:  51%|█████▏    | 4785/9301 [31:21<27:48,  2.71it/s][A
epoch 2 iter 4785: train loss 0.35505. lr 8.306917e-05:  51%|█████▏    | 4786/9301 [31:21<27:44,  2.71it/s][A
epoch 2 iter 4786: train loss 0.33499. lr 8.303417e-05:  51%|█████▏    | 4786/9301 [31:21<27:44,  2.71it/s][A
epoch 2 iter 4786: train loss 0.33499. lr 8.303417e-05:  51%|█████▏    | 4787/9301 [31:21<27:52,  2.70it/s][A
epoch 2 iter 4787: train loss 0.33974. lr 8.299918e-05:  51%|█████▏    | 4787/9301 [31:21<27:52,  2.70it/s][A
epoch 2 iter 4787: train loss 0.33974. lr 8.299918e-05:  51%|█████▏    | 4788/9301 [31:21<27:54,  2.70it/s][A
e

epoch 2 iter 4820: train loss 0.32850. lr 8.184806e-05:  52%|█████▏    | 4820/9301 [31:33<27:34,  2.71it/s][A
epoch 2 iter 4820: train loss 0.32850. lr 8.184806e-05:  52%|█████▏    | 4821/9301 [31:33<27:30,  2.71it/s][A
epoch 2 iter 4821: train loss 0.34226. lr 8.181328e-05:  52%|█████▏    | 4821/9301 [31:34<27:30,  2.71it/s][A
epoch 2 iter 4821: train loss 0.34226. lr 8.181328e-05:  52%|█████▏    | 4822/9301 [31:34<27:27,  2.72it/s][A
epoch 2 iter 4822: train loss 0.33488. lr 8.177851e-05:  52%|█████▏    | 4822/9301 [31:34<27:27,  2.72it/s][A
epoch 2 iter 4822: train loss 0.33488. lr 8.177851e-05:  52%|█████▏    | 4823/9301 [31:34<27:31,  2.71it/s][A
epoch 2 iter 4823: train loss 0.33435. lr 8.174375e-05:  52%|█████▏    | 4823/9301 [31:35<27:31,  2.71it/s][A
epoch 2 iter 4823: train loss 0.33435. lr 8.174375e-05:  52%|█████▏    | 4824/9301 [31:35<27:35,  2.70it/s][A
epoch 2 iter 4824: train loss 0.32749. lr 8.170899e-05:  52%|█████▏    | 4824/9301 [31:35<27:35,  2.70it/s][A
e

epoch 2 iter 4856: train loss 0.32873. lr 8.060001e-05:  52%|█████▏    | 4857/9301 [31:47<27:17,  2.71it/s][A
epoch 2 iter 4857: train loss 0.34839. lr 8.056546e-05:  52%|█████▏    | 4857/9301 [31:47<27:17,  2.71it/s][A
epoch 2 iter 4857: train loss 0.34839. lr 8.056546e-05:  52%|█████▏    | 4858/9301 [31:47<27:14,  2.72it/s][A
epoch 2 iter 4858: train loss 0.33060. lr 8.053092e-05:  52%|█████▏    | 4858/9301 [31:48<27:14,  2.72it/s][A
epoch 2 iter 4858: train loss 0.33060. lr 8.053092e-05:  52%|█████▏    | 4859/9301 [31:48<27:18,  2.71it/s][A
epoch 2 iter 4859: train loss 0.34099. lr 8.049638e-05:  52%|█████▏    | 4859/9301 [31:48<27:18,  2.71it/s][A
epoch 2 iter 4859: train loss 0.34099. lr 8.049638e-05:  52%|█████▏    | 4860/9301 [31:48<27:22,  2.70it/s][A
epoch 2 iter 4860: train loss 0.33506. lr 8.046184e-05:  52%|█████▏    | 4860/9301 [31:48<27:22,  2.70it/s][A
epoch 2 iter 4860: train loss 0.33506. lr 8.046184e-05:  52%|█████▏    | 4861/9301 [31:48<27:19,  2.71it/s][A
e

epoch 2 iter 4893: train loss 0.34861. lr 7.932575e-05:  53%|█████▎    | 4893/9301 [32:00<27:03,  2.72it/s][A
epoch 2 iter 4893: train loss 0.34861. lr 7.932575e-05:  53%|█████▎    | 4894/9301 [32:00<26:58,  2.72it/s][A
epoch 2 iter 4894: train loss 0.32964. lr 7.929143e-05:  53%|█████▎    | 4894/9301 [32:01<26:58,  2.72it/s][A
epoch 2 iter 4894: train loss 0.32964. lr 7.929143e-05:  53%|█████▎    | 4895/9301 [32:01<27:02,  2.72it/s][A
epoch 2 iter 4895: train loss 0.31086. lr 7.925712e-05:  53%|█████▎    | 4895/9301 [32:01<27:02,  2.72it/s][A
epoch 2 iter 4895: train loss 0.31086. lr 7.925712e-05:  53%|█████▎    | 4896/9301 [32:01<27:07,  2.71it/s][A
epoch 2 iter 4896: train loss 0.35705. lr 7.922281e-05:  53%|█████▎    | 4896/9301 [32:02<27:07,  2.71it/s][A
epoch 2 iter 4896: train loss 0.35705. lr 7.922281e-05:  53%|█████▎    | 4897/9301 [32:02<27:04,  2.71it/s][A
epoch 2 iter 4897: train loss 0.33492. lr 7.918851e-05:  53%|█████▎    | 4897/9301 [32:02<27:04,  2.71it/s][A
e

epoch 2 iter 4929: train loss 0.32588. lr 7.809420e-05:  53%|█████▎    | 4930/9301 [32:14<26:53,  2.71it/s][A
epoch 2 iter 4930: train loss 0.31973. lr 7.806011e-05:  53%|█████▎    | 4930/9301 [32:14<26:53,  2.71it/s][A
epoch 2 iter 4930: train loss 0.31973. lr 7.806011e-05:  53%|█████▎    | 4931/9301 [32:14<26:56,  2.70it/s][A
epoch 2 iter 4931: train loss 0.31256. lr 7.802602e-05:  53%|█████▎    | 4931/9301 [32:14<26:56,  2.70it/s][A
epoch 2 iter 4931: train loss 0.31256. lr 7.802602e-05:  53%|█████▎    | 4932/9301 [32:14<26:51,  2.71it/s][A
epoch 2 iter 4932: train loss 0.31683. lr 7.799194e-05:  53%|█████▎    | 4932/9301 [32:15<26:51,  2.71it/s][A
epoch 2 iter 4932: train loss 0.31683. lr 7.799194e-05:  53%|█████▎    | 4933/9301 [32:15<26:48,  2.72it/s][A
epoch 2 iter 4933: train loss 0.32917. lr 7.795787e-05:  53%|█████▎    | 4933/9301 [32:15<26:48,  2.72it/s][A
epoch 2 iter 4933: train loss 0.32917. lr 7.795787e-05:  53%|█████▎    | 4934/9301 [32:15<26:43,  2.72it/s][A
e

epoch 2 iter 4966: train loss 0.33478. lr 7.683699e-05:  53%|█████▎    | 4966/9301 [32:27<26:42,  2.71it/s][A
epoch 2 iter 4966: train loss 0.33478. lr 7.683699e-05:  53%|█████▎    | 4967/9301 [32:27<26:37,  2.71it/s][A
epoch 2 iter 4967: train loss 0.32561. lr 7.680313e-05:  53%|█████▎    | 4967/9301 [32:28<26:37,  2.71it/s][A
epoch 2 iter 4967: train loss 0.32561. lr 7.680313e-05:  53%|█████▎    | 4968/9301 [32:28<26:34,  2.72it/s][A
epoch 2 iter 4968: train loss 0.30956. lr 7.676928e-05:  53%|█████▎    | 4968/9301 [32:28<26:34,  2.72it/s][A
epoch 2 iter 4968: train loss 0.30956. lr 7.676928e-05:  53%|█████▎    | 4969/9301 [32:28<26:31,  2.72it/s][A
epoch 2 iter 4969: train loss 0.30945. lr 7.673544e-05:  53%|█████▎    | 4969/9301 [32:28<26:31,  2.72it/s][A
epoch 2 iter 4969: train loss 0.30945. lr 7.673544e-05:  53%|█████▎    | 4970/9301 [32:28<26:35,  2.71it/s][A
epoch 2 iter 4970: train loss 0.34375. lr 7.670160e-05:  53%|█████▎    | 4970/9301 [32:29<26:35,  2.71it/s][A
e

epoch 2 iter 5002: train loss 0.32447. lr 7.562212e-05:  54%|█████▍    | 5003/9301 [32:41<26:22,  2.72it/s][A
epoch 2 iter 5003: train loss 0.32256. lr 7.558849e-05:  54%|█████▍    | 5003/9301 [32:41<26:22,  2.72it/s][A
epoch 2 iter 5003: train loss 0.32256. lr 7.558849e-05:  54%|█████▍    | 5004/9301 [32:41<26:18,  2.72it/s][A
epoch 2 iter 5004: train loss 0.33525. lr 7.555487e-05:  54%|█████▍    | 5004/9301 [32:41<26:18,  2.72it/s][A
epoch 2 iter 5004: train loss 0.33525. lr 7.555487e-05:  54%|█████▍    | 5005/9301 [32:41<26:22,  2.71it/s][A
epoch 2 iter 5005: train loss 0.34353. lr 7.552126e-05:  54%|█████▍    | 5005/9301 [32:42<26:22,  2.71it/s][A
epoch 2 iter 5005: train loss 0.34353. lr 7.552126e-05:  54%|█████▍    | 5006/9301 [32:42<26:26,  2.71it/s][A
epoch 2 iter 5006: train loss 0.34116. lr 7.548765e-05:  54%|█████▍    | 5006/9301 [32:42<26:26,  2.71it/s][A
epoch 2 iter 5006: train loss 0.34116. lr 7.548765e-05:  54%|█████▍    | 5007/9301 [32:42<26:23,  2.71it/s][A
e

epoch 2 iter 5039: train loss 0.32307. lr 7.438215e-05:  54%|█████▍    | 5039/9301 [32:54<26:03,  2.73it/s][A
epoch 2 iter 5039: train loss 0.32307. lr 7.438215e-05:  54%|█████▍    | 5040/9301 [32:54<26:06,  2.72it/s][A
epoch 2 iter 5040: train loss 0.33300. lr 7.434876e-05:  54%|█████▍    | 5040/9301 [32:55<26:06,  2.72it/s][A
epoch 2 iter 5040: train loss 0.33300. lr 7.434876e-05:  54%|█████▍    | 5041/9301 [32:55<26:06,  2.72it/s][A
epoch 2 iter 5041: train loss 0.34098. lr 7.431537e-05:  54%|█████▍    | 5041/9301 [32:55<26:06,  2.72it/s][A
epoch 2 iter 5041: train loss 0.34098. lr 7.431537e-05:  54%|█████▍    | 5042/9301 [32:55<26:05,  2.72it/s][A
epoch 2 iter 5042: train loss 0.31989. lr 7.428200e-05:  54%|█████▍    | 5042/9301 [32:55<26:05,  2.72it/s][A
epoch 2 iter 5042: train loss 0.31989. lr 7.428200e-05:  54%|█████▍    | 5043/9301 [32:55<26:09,  2.71it/s][A
epoch 2 iter 5043: train loss 0.32759. lr 7.424862e-05:  54%|█████▍    | 5043/9301 [32:56<26:09,  2.71it/s][A
e

epoch 2 iter 5075: train loss 0.33259. lr 7.318414e-05:  55%|█████▍    | 5076/9301 [33:08<25:59,  2.71it/s][A
epoch 2 iter 5076: train loss 0.33173. lr 7.315099e-05:  55%|█████▍    | 5076/9301 [33:08<25:59,  2.71it/s][A
epoch 2 iter 5076: train loss 0.33173. lr 7.315099e-05:  55%|█████▍    | 5077/9301 [33:08<25:57,  2.71it/s][A
epoch 2 iter 5077: train loss 0.33005. lr 7.311783e-05:  55%|█████▍    | 5077/9301 [33:08<25:57,  2.71it/s][A
epoch 2 iter 5077: train loss 0.33005. lr 7.311783e-05:  55%|█████▍    | 5078/9301 [33:08<25:56,  2.71it/s][A
epoch 2 iter 5078: train loss 0.33958. lr 7.308469e-05:  55%|█████▍    | 5078/9301 [33:09<25:56,  2.71it/s][A
epoch 2 iter 5078: train loss 0.33958. lr 7.308469e-05:  55%|█████▍    | 5079/9301 [33:09<25:58,  2.71it/s][A
epoch 2 iter 5079: train loss 0.32641. lr 7.305155e-05:  55%|█████▍    | 5079/9301 [33:09<25:58,  2.71it/s][A
epoch 2 iter 5079: train loss 0.32641. lr 7.305155e-05:  55%|█████▍    | 5080/9301 [33:09<26:01,  2.70it/s][A
e

epoch 2 iter 5112: train loss 0.33440. lr 7.196160e-05:  55%|█████▍    | 5112/9301 [33:21<25:38,  2.72it/s][A
epoch 2 iter 5112: train loss 0.33440. lr 7.196160e-05:  55%|█████▍    | 5113/9301 [33:21<25:41,  2.72it/s][A
epoch 2 iter 5113: train loss 0.35045. lr 7.192868e-05:  55%|█████▍    | 5113/9301 [33:22<25:41,  2.72it/s][A
epoch 2 iter 5113: train loss 0.35045. lr 7.192868e-05:  55%|█████▍    | 5114/9301 [33:22<25:45,  2.71it/s][A
epoch 2 iter 5114: train loss 0.33665. lr 7.189577e-05:  55%|█████▍    | 5114/9301 [33:22<25:45,  2.71it/s][A
epoch 2 iter 5114: train loss 0.33665. lr 7.189577e-05:  55%|█████▍    | 5115/9301 [33:22<25:46,  2.71it/s][A
epoch 2 iter 5115: train loss 0.30967. lr 7.186286e-05:  55%|█████▍    | 5115/9301 [33:22<25:46,  2.71it/s][A
epoch 2 iter 5115: train loss 0.30967. lr 7.186286e-05:  55%|█████▌    | 5116/9301 [33:22<25:46,  2.71it/s][A
epoch 2 iter 5116: train loss 0.32757. lr 7.182996e-05:  55%|█████▌    | 5116/9301 [33:23<25:46,  2.71it/s][A
e

epoch 2 iter 5148: train loss 0.30899. lr 7.078064e-05:  55%|█████▌    | 5149/9301 [33:34<25:35,  2.70it/s][A
epoch 2 iter 5149: train loss 0.31726. lr 7.074796e-05:  55%|█████▌    | 5149/9301 [33:35<25:35,  2.70it/s][A
epoch 2 iter 5149: train loss 0.31726. lr 7.074796e-05:  55%|█████▌    | 5150/9301 [33:35<25:38,  2.70it/s][A
epoch 2 iter 5150: train loss 0.32061. lr 7.071528e-05:  55%|█████▌    | 5150/9301 [33:35<25:38,  2.70it/s][A
epoch 2 iter 5150: train loss 0.32061. lr 7.071528e-05:  55%|█████▌    | 5151/9301 [33:35<25:37,  2.70it/s][A
epoch 2 iter 5151: train loss 0.33511. lr 7.068261e-05:  55%|█████▌    | 5151/9301 [33:36<25:37,  2.70it/s][A
epoch 2 iter 5151: train loss 0.33511. lr 7.068261e-05:  55%|█████▌    | 5152/9301 [33:36<25:34,  2.70it/s][A
epoch 2 iter 5152: train loss 0.31626. lr 7.064995e-05:  55%|█████▌    | 5152/9301 [33:36<25:34,  2.70it/s][A
epoch 2 iter 5152: train loss 0.31626. lr 7.064995e-05:  55%|█████▌    | 5153/9301 [33:36<25:29,  2.71it/s][A
e

epoch 2 iter 5185: train loss 0.32532. lr 6.957571e-05:  56%|█████▌    | 5185/9301 [33:48<25:13,  2.72it/s][A
epoch 2 iter 5185: train loss 0.32532. lr 6.957571e-05:  56%|█████▌    | 5186/9301 [33:48<25:15,  2.72it/s][A
epoch 2 iter 5186: train loss 0.32725. lr 6.954327e-05:  56%|█████▌    | 5186/9301 [33:48<25:15,  2.72it/s][A
epoch 2 iter 5186: train loss 0.32725. lr 6.954327e-05:  56%|█████▌    | 5187/9301 [33:48<25:14,  2.72it/s][A
epoch 2 iter 5187: train loss 0.30572. lr 6.951084e-05:  56%|█████▌    | 5187/9301 [33:49<25:14,  2.72it/s][A
epoch 2 iter 5187: train loss 0.30572. lr 6.951084e-05:  56%|█████▌    | 5188/9301 [33:49<25:14,  2.72it/s][A
epoch 2 iter 5188: train loss 0.31100. lr 6.947841e-05:  56%|█████▌    | 5188/9301 [33:49<25:14,  2.72it/s][A
epoch 2 iter 5188: train loss 0.31100. lr 6.947841e-05:  56%|█████▌    | 5189/9301 [33:49<25:16,  2.71it/s][A
epoch 2 iter 5189: train loss 0.30397. lr 6.944599e-05:  56%|█████▌    | 5189/9301 [33:50<25:16,  2.71it/s][A
e

epoch 2 iter 5221: train loss 0.32766. lr 6.841198e-05:  56%|█████▌    | 5222/9301 [34:02<25:41,  2.65it/s][A
epoch 2 iter 5222: train loss 0.33616. lr 6.837978e-05:  56%|█████▌    | 5222/9301 [34:02<25:41,  2.65it/s][A
epoch 2 iter 5222: train loss 0.33616. lr 6.837978e-05:  56%|█████▌    | 5223/9301 [34:02<25:27,  2.67it/s][A
epoch 2 iter 5223: train loss 0.32323. lr 6.834758e-05:  56%|█████▌    | 5223/9301 [34:02<25:27,  2.67it/s][A
epoch 2 iter 5223: train loss 0.32323. lr 6.834758e-05:  56%|█████▌    | 5224/9301 [34:02<25:18,  2.69it/s][A
epoch 2 iter 5224: train loss 0.33007. lr 6.831539e-05:  56%|█████▌    | 5224/9301 [34:03<25:18,  2.69it/s][A
epoch 2 iter 5224: train loss 0.33007. lr 6.831539e-05:  56%|█████▌    | 5225/9301 [34:03<25:07,  2.70it/s][A
epoch 2 iter 5225: train loss 0.32659. lr 6.828321e-05:  56%|█████▌    | 5225/9301 [34:03<25:07,  2.70it/s][A
epoch 2 iter 5225: train loss 0.32659. lr 6.828321e-05:  56%|█████▌    | 5226/9301 [34:03<25:07,  2.70it/s][A
e

epoch 2 iter 5258: train loss 0.32190. lr 6.722485e-05:  57%|█████▋    | 5258/9301 [34:15<24:48,  2.72it/s][A
epoch 2 iter 5258: train loss 0.32190. lr 6.722485e-05:  57%|█████▋    | 5259/9301 [34:15<24:45,  2.72it/s][A
epoch 2 iter 5259: train loss 0.31152. lr 6.719289e-05:  57%|█████▋    | 5259/9301 [34:16<24:45,  2.72it/s][A
epoch 2 iter 5259: train loss 0.31152. lr 6.719289e-05:  57%|█████▋    | 5260/9301 [34:16<24:51,  2.71it/s][A
epoch 2 iter 5260: train loss 0.34329. lr 6.716094e-05:  57%|█████▋    | 5260/9301 [34:16<24:51,  2.71it/s][A
epoch 2 iter 5260: train loss 0.34329. lr 6.716094e-05:  57%|█████▋    | 5261/9301 [34:16<24:53,  2.70it/s][A
epoch 2 iter 5261: train loss 0.32050. lr 6.712899e-05:  57%|█████▋    | 5261/9301 [34:16<24:53,  2.70it/s][A
epoch 2 iter 5261: train loss 0.32050. lr 6.712899e-05:  57%|█████▋    | 5262/9301 [34:16<24:53,  2.70it/s][A
epoch 2 iter 5262: train loss 0.31059. lr 6.709705e-05:  57%|█████▋    | 5262/9301 [34:17<24:53,  2.70it/s][A
e

epoch 2 iter 5294: train loss 0.32012. lr 6.607852e-05:  57%|█████▋    | 5295/9301 [34:29<24:36,  2.71it/s][A
epoch 2 iter 5295: train loss 0.32935. lr 6.604680e-05:  57%|█████▋    | 5295/9301 [34:29<24:36,  2.71it/s][A
epoch 2 iter 5295: train loss 0.32935. lr 6.604680e-05:  57%|█████▋    | 5296/9301 [34:29<24:38,  2.71it/s][A
epoch 2 iter 5296: train loss 0.30900. lr 6.601509e-05:  57%|█████▋    | 5296/9301 [34:29<24:38,  2.71it/s][A
epoch 2 iter 5296: train loss 0.30900. lr 6.601509e-05:  57%|█████▋    | 5297/9301 [34:29<24:35,  2.71it/s][A
epoch 2 iter 5297: train loss 0.29550. lr 6.598338e-05:  57%|█████▋    | 5297/9301 [34:30<24:35,  2.71it/s][A
epoch 2 iter 5297: train loss 0.29550. lr 6.598338e-05:  57%|█████▋    | 5298/9301 [34:30<24:32,  2.72it/s][A
epoch 2 iter 5298: train loss 0.31509. lr 6.595169e-05:  57%|█████▋    | 5298/9301 [34:30<24:32,  2.72it/s][A
epoch 2 iter 5298: train loss 0.31509. lr 6.595169e-05:  57%|█████▋    | 5299/9301 [34:30<24:35,  2.71it/s][A
e

epoch 2 iter 5331: train loss 0.31985. lr 6.490936e-05:  57%|█████▋    | 5331/9301 [34:42<24:23,  2.71it/s][A
epoch 2 iter 5331: train loss 0.31985. lr 6.490936e-05:  57%|█████▋    | 5332/9301 [34:42<24:24,  2.71it/s][A
epoch 2 iter 5332: train loss 0.31960. lr 6.487789e-05:  57%|█████▋    | 5332/9301 [34:43<24:24,  2.71it/s][A
epoch 2 iter 5332: train loss 0.31960. lr 6.487789e-05:  57%|█████▋    | 5333/9301 [34:43<24:21,  2.71it/s][A
epoch 2 iter 5333: train loss 0.31961. lr 6.484643e-05:  57%|█████▋    | 5333/9301 [34:43<24:21,  2.71it/s][A
epoch 2 iter 5333: train loss 0.31961. lr 6.484643e-05:  57%|█████▋    | 5334/9301 [34:43<24:22,  2.71it/s][A
epoch 2 iter 5334: train loss 0.31964. lr 6.481497e-05:  57%|█████▋    | 5334/9301 [34:43<24:22,  2.71it/s][A
epoch 2 iter 5334: train loss 0.31964. lr 6.481497e-05:  57%|█████▋    | 5335/9301 [34:43<24:17,  2.72it/s][A
epoch 2 iter 5335: train loss 0.31670. lr 6.478352e-05:  57%|█████▋    | 5335/9301 [34:44<24:17,  2.72it/s][A
e

epoch 2 iter 5367: train loss 0.32476. lr 6.378062e-05:  58%|█████▊    | 5368/9301 [34:55<24:10,  2.71it/s][A
epoch 2 iter 5368: train loss 0.30860. lr 6.374939e-05:  58%|█████▊    | 5368/9301 [34:56<24:10,  2.71it/s][A
epoch 2 iter 5368: train loss 0.30860. lr 6.374939e-05:  58%|█████▊    | 5369/9301 [34:56<24:09,  2.71it/s][A
epoch 2 iter 5369: train loss 0.32474. lr 6.371816e-05:  58%|█████▊    | 5369/9301 [34:56<24:09,  2.71it/s][A
epoch 2 iter 5369: train loss 0.32474. lr 6.371816e-05:  58%|█████▊    | 5370/9301 [34:56<24:14,  2.70it/s][A
epoch 2 iter 5370: train loss 0.29614. lr 6.368695e-05:  58%|█████▊    | 5370/9301 [34:57<24:14,  2.70it/s][A
epoch 2 iter 5370: train loss 0.29614. lr 6.368695e-05:  58%|█████▊    | 5371/9301 [34:57<24:15,  2.70it/s][A
epoch 2 iter 5371: train loss 0.31712. lr 6.365574e-05:  58%|█████▊    | 5371/9301 [34:57<24:15,  2.70it/s][A
epoch 2 iter 5371: train loss 0.31712. lr 6.365574e-05:  58%|█████▊    | 5372/9301 [34:57<24:16,  2.70it/s][A
e

epoch 2 iter 5404: train loss 0.35629. lr 6.262962e-05:  58%|█████▊    | 5404/9301 [35:09<23:49,  2.73it/s][A
epoch 2 iter 5404: train loss 0.35629. lr 6.262962e-05:  58%|█████▊    | 5405/9301 [35:09<23:54,  2.72it/s][A
epoch 2 iter 5405: train loss 0.32317. lr 6.259864e-05:  58%|█████▊    | 5405/9301 [35:09<23:54,  2.72it/s][A
epoch 2 iter 5405: train loss 0.32317. lr 6.259864e-05:  58%|█████▊    | 5406/9301 [35:09<23:57,  2.71it/s][A
epoch 2 iter 5406: train loss 0.32462. lr 6.256766e-05:  58%|█████▊    | 5406/9301 [35:10<23:57,  2.71it/s][A
epoch 2 iter 5406: train loss 0.32462. lr 6.256766e-05:  58%|█████▊    | 5407/9301 [35:10<23:56,  2.71it/s][A
epoch 2 iter 5407: train loss 0.30947. lr 6.253670e-05:  58%|█████▊    | 5407/9301 [35:10<23:56,  2.71it/s][A
epoch 2 iter 5407: train loss 0.30947. lr 6.253670e-05:  58%|█████▊    | 5408/9301 [35:10<23:55,  2.71it/s][A
epoch 2 iter 5408: train loss 0.32528. lr 6.250574e-05:  58%|█████▊    | 5408/9301 [35:11<23:55,  2.71it/s][A
e

epoch 2 iter 5440: train loss 0.31193. lr 6.151862e-05:  58%|█████▊    | 5441/9301 [35:22<23:44,  2.71it/s][A
epoch 2 iter 5441: train loss 0.32455. lr 6.148788e-05:  58%|█████▊    | 5441/9301 [35:23<23:44,  2.71it/s][A
epoch 2 iter 5441: train loss 0.32455. lr 6.148788e-05:  59%|█████▊    | 5442/9301 [35:23<23:45,  2.71it/s][A
epoch 2 iter 5442: train loss 0.32464. lr 6.145715e-05:  59%|█████▊    | 5442/9301 [35:23<23:45,  2.71it/s][A
epoch 2 iter 5442: train loss 0.32464. lr 6.145715e-05:  59%|█████▊    | 5443/9301 [35:23<23:42,  2.71it/s][A
epoch 2 iter 5443: train loss 0.31704. lr 6.142643e-05:  59%|█████▊    | 5443/9301 [35:23<23:42,  2.71it/s][A
epoch 2 iter 5443: train loss 0.31704. lr 6.142643e-05:  59%|█████▊    | 5444/9301 [35:23<23:44,  2.71it/s][A
epoch 2 iter 5444: train loss 0.32728. lr 6.139572e-05:  59%|█████▊    | 5444/9301 [35:24<23:44,  2.71it/s][A
epoch 2 iter 5444: train loss 0.32728. lr 6.139572e-05:  59%|█████▊    | 5445/9301 [35:24<23:39,  2.72it/s][A
e

epoch 2 iter 5477: train loss 0.32544. lr 6.038595e-05:  59%|█████▉    | 5477/9301 [35:36<23:31,  2.71it/s][A
epoch 2 iter 5477: train loss 0.32544. lr 6.038595e-05:  59%|█████▉    | 5478/9301 [35:36<23:29,  2.71it/s][A
epoch 2 iter 5478: train loss 0.31859. lr 6.035546e-05:  59%|█████▉    | 5478/9301 [35:36<23:29,  2.71it/s][A
epoch 2 iter 5478: train loss 0.31859. lr 6.035546e-05:  59%|█████▉    | 5479/9301 [35:36<23:29,  2.71it/s][A
epoch 2 iter 5479: train loss 0.32843. lr 6.032499e-05:  59%|█████▉    | 5479/9301 [35:37<23:29,  2.71it/s][A
epoch 2 iter 5479: train loss 0.32843. lr 6.032499e-05:  59%|█████▉    | 5480/9301 [35:37<23:25,  2.72it/s][A
epoch 2 iter 5480: train loss 0.33078. lr 6.029452e-05:  59%|█████▉    | 5480/9301 [35:37<23:25,  2.72it/s][A
epoch 2 iter 5480: train loss 0.33078. lr 6.029452e-05:  59%|█████▉    | 5481/9301 [35:37<23:28,  2.71it/s][A
epoch 2 iter 5481: train loss 0.32306. lr 6.026405e-05:  59%|█████▉    | 5481/9301 [35:37<23:28,  2.71it/s][A
e

epoch 2 iter 5513: train loss 0.34099. lr 6.000000e-05:  59%|█████▉    | 5514/9301 [35:49<23:16,  2.71it/s][A
epoch 2 iter 5514: train loss 0.32597. lr 6.000000e-05:  59%|█████▉    | 5514/9301 [35:50<23:16,  2.71it/s][A
epoch 2 iter 5514: train loss 0.32597. lr 6.000000e-05:  59%|█████▉    | 5515/9301 [35:50<23:08,  2.73it/s][A
epoch 2 iter 5515: train loss 0.32523. lr 6.000000e-05:  59%|█████▉    | 5515/9301 [35:50<23:08,  2.73it/s][A
epoch 2 iter 5515: train loss 0.32523. lr 6.000000e-05:  59%|█████▉    | 5516/9301 [35:50<23:12,  2.72it/s][A
epoch 2 iter 5516: train loss 0.32164. lr 6.000000e-05:  59%|█████▉    | 5516/9301 [35:50<23:12,  2.72it/s][A
epoch 2 iter 5516: train loss 0.32164. lr 6.000000e-05:  59%|█████▉    | 5517/9301 [35:50<23:14,  2.71it/s][A
epoch 2 iter 5517: train loss 0.32097. lr 6.000000e-05:  59%|█████▉    | 5517/9301 [35:51<23:14,  2.71it/s][A
epoch 2 iter 5517: train loss 0.32097. lr 6.000000e-05:  59%|█████▉    | 5518/9301 [35:51<23:12,  2.72it/s][A
e

epoch 2 iter 5550: train loss 0.33292. lr 6.000000e-05:  60%|█████▉    | 5550/9301 [36:03<22:59,  2.72it/s][A
epoch 2 iter 5550: train loss 0.33292. lr 6.000000e-05:  60%|█████▉    | 5551/9301 [36:03<23:03,  2.71it/s][A
epoch 2 iter 5551: train loss 0.33095. lr 6.000000e-05:  60%|█████▉    | 5551/9301 [36:03<23:03,  2.71it/s][A
epoch 2 iter 5551: train loss 0.33095. lr 6.000000e-05:  60%|█████▉    | 5552/9301 [36:03<23:03,  2.71it/s][A
epoch 2 iter 5552: train loss 0.31847. lr 6.000000e-05:  60%|█████▉    | 5552/9301 [36:04<23:03,  2.71it/s][A
epoch 2 iter 5552: train loss 0.31847. lr 6.000000e-05:  60%|█████▉    | 5553/9301 [36:04<23:01,  2.71it/s][A
epoch 2 iter 5553: train loss 0.31909. lr 6.000000e-05:  60%|█████▉    | 5553/9301 [36:04<23:01,  2.71it/s][A
epoch 2 iter 5553: train loss 0.31909. lr 6.000000e-05:  60%|█████▉    | 5554/9301 [36:04<23:00,  2.71it/s][A
epoch 2 iter 5554: train loss 0.31569. lr 6.000000e-05:  60%|█████▉    | 5554/9301 [36:04<23:00,  2.71it/s][A
e

epoch 2 iter 5586: train loss 0.29530. lr 6.000000e-05:  60%|██████    | 5587/9301 [36:16<22:50,  2.71it/s][A
epoch 2 iter 5587: train loss 0.33176. lr 6.000000e-05:  60%|██████    | 5587/9301 [36:17<22:50,  2.71it/s][A
epoch 2 iter 5587: train loss 0.33176. lr 6.000000e-05:  60%|██████    | 5588/9301 [36:17<22:48,  2.71it/s][A
epoch 2 iter 5588: train loss 0.30604. lr 6.000000e-05:  60%|██████    | 5588/9301 [36:17<22:48,  2.71it/s][A
epoch 2 iter 5588: train loss 0.30604. lr 6.000000e-05:  60%|██████    | 5589/9301 [36:17<22:47,  2.71it/s][A
epoch 2 iter 5589: train loss 0.32660. lr 6.000000e-05:  60%|██████    | 5589/9301 [36:17<22:47,  2.71it/s][A
epoch 2 iter 5589: train loss 0.32660. lr 6.000000e-05:  60%|██████    | 5590/9301 [36:17<22:44,  2.72it/s][A
epoch 2 iter 5590: train loss 0.30627. lr 6.000000e-05:  60%|██████    | 5590/9301 [36:18<22:44,  2.72it/s][A
epoch 2 iter 5590: train loss 0.30627. lr 6.000000e-05:  60%|██████    | 5591/9301 [36:18<22:47,  2.71it/s][A
e

epoch 2 iter 5623: train loss 0.31329. lr 6.000000e-05:  60%|██████    | 5623/9301 [36:30<22:39,  2.70it/s][A
epoch 2 iter 5623: train loss 0.31329. lr 6.000000e-05:  60%|██████    | 5624/9301 [36:30<22:35,  2.71it/s][A
epoch 2 iter 5624: train loss 0.31517. lr 6.000000e-05:  60%|██████    | 5624/9301 [36:30<22:35,  2.71it/s][A
epoch 2 iter 5624: train loss 0.31517. lr 6.000000e-05:  60%|██████    | 5625/9301 [36:30<22:28,  2.73it/s][A
epoch 2 iter 5625: train loss 0.33339. lr 6.000000e-05:  60%|██████    | 5625/9301 [36:31<22:28,  2.73it/s][A
epoch 2 iter 5625: train loss 0.33339. lr 6.000000e-05:  60%|██████    | 5626/9301 [36:31<22:33,  2.71it/s][A
epoch 2 iter 5626: train loss 0.32259. lr 6.000000e-05:  60%|██████    | 5626/9301 [36:31<22:33,  2.71it/s][A
epoch 2 iter 5626: train loss 0.32259. lr 6.000000e-05:  60%|██████    | 5627/9301 [36:31<22:34,  2.71it/s][A
epoch 2 iter 5627: train loss 0.31771. lr 6.000000e-05:  60%|██████    | 5627/9301 [36:31<22:34,  2.71it/s][A
e

epoch 2 iter 5659: train loss 0.31426. lr 6.000000e-05:  61%|██████    | 5660/9301 [36:43<22:19,  2.72it/s][A
epoch 2 iter 5660: train loss 0.32045. lr 6.000000e-05:  61%|██████    | 5660/9301 [36:43<22:19,  2.72it/s][A
epoch 2 iter 5660: train loss 0.32045. lr 6.000000e-05:  61%|██████    | 5661/9301 [36:43<22:21,  2.71it/s][A
epoch 2 iter 5661: train loss 0.32854. lr 6.000000e-05:  61%|██████    | 5661/9301 [36:44<22:21,  2.71it/s][A
epoch 2 iter 5661: train loss 0.32854. lr 6.000000e-05:  61%|██████    | 5662/9301 [36:44<22:23,  2.71it/s][A
epoch 2 iter 5662: train loss 0.32199. lr 6.000000e-05:  61%|██████    | 5662/9301 [36:44<22:23,  2.71it/s][A
epoch 2 iter 5662: train loss 0.32199. lr 6.000000e-05:  61%|██████    | 5663/9301 [36:44<22:20,  2.71it/s][A
epoch 2 iter 5663: train loss 0.32338. lr 6.000000e-05:  61%|██████    | 5663/9301 [36:45<22:20,  2.71it/s][A
epoch 2 iter 5663: train loss 0.32338. lr 6.000000e-05:  61%|██████    | 5664/9301 [36:45<22:23,  2.71it/s][A
e

epoch 2 iter 5696: train loss 0.31575. lr 6.000000e-05:  61%|██████    | 5696/9301 [36:57<22:11,  2.71it/s][A
epoch 2 iter 5696: train loss 0.31575. lr 6.000000e-05:  61%|██████▏   | 5697/9301 [36:57<22:09,  2.71it/s][A
epoch 2 iter 5697: train loss 0.29623. lr 6.000000e-05:  61%|██████▏   | 5697/9301 [36:57<22:09,  2.71it/s][A
epoch 2 iter 5697: train loss 0.29623. lr 6.000000e-05:  61%|██████▏   | 5698/9301 [36:57<22:06,  2.72it/s][A
epoch 2 iter 5698: train loss 0.30033. lr 6.000000e-05:  61%|██████▏   | 5698/9301 [36:57<22:06,  2.72it/s][A
epoch 2 iter 5698: train loss 0.30033. lr 6.000000e-05:  61%|██████▏   | 5699/9301 [36:57<22:07,  2.71it/s][A
epoch 2 iter 5699: train loss 0.33150. lr 6.000000e-05:  61%|██████▏   | 5699/9301 [36:58<22:07,  2.71it/s][A
epoch 2 iter 5699: train loss 0.33150. lr 6.000000e-05:  61%|██████▏   | 5700/9301 [36:58<22:07,  2.71it/s][A
epoch 2 iter 5700: train loss 0.32048. lr 6.000000e-05:  61%|██████▏   | 5700/9301 [36:58<22:07,  2.71it/s][A
e

epoch 2 iter 5732: train loss 0.33861. lr 6.000000e-05:  62%|██████▏   | 5733/9301 [37:10<21:56,  2.71it/s][A
epoch 2 iter 5733: train loss 0.31739. lr 6.000000e-05:  62%|██████▏   | 5733/9301 [37:10<21:56,  2.71it/s][A
epoch 2 iter 5733: train loss 0.31739. lr 6.000000e-05:  62%|██████▏   | 5734/9301 [37:10<21:55,  2.71it/s][A
epoch 2 iter 5734: train loss 0.32295. lr 6.000000e-05:  62%|██████▏   | 5734/9301 [37:11<21:55,  2.71it/s][A
epoch 2 iter 5734: train loss 0.32295. lr 6.000000e-05:  62%|██████▏   | 5735/9301 [37:11<21:51,  2.72it/s][A
epoch 2 iter 5735: train loss 0.34140. lr 6.000000e-05:  62%|██████▏   | 5735/9301 [37:11<21:51,  2.72it/s][A
epoch 2 iter 5735: train loss 0.34140. lr 6.000000e-05:  62%|██████▏   | 5736/9301 [37:11<21:55,  2.71it/s][A
epoch 2 iter 5736: train loss 0.30551. lr 6.000000e-05:  62%|██████▏   | 5736/9301 [37:11<21:55,  2.71it/s][A
epoch 2 iter 5736: train loss 0.30551. lr 6.000000e-05:  62%|██████▏   | 5737/9301 [37:11<21:56,  2.71it/s][A
e

epoch 2 iter 5769: train loss 0.29986. lr 6.000000e-05:  62%|██████▏   | 5769/9301 [37:24<21:38,  2.72it/s][A
epoch 2 iter 5769: train loss 0.29986. lr 6.000000e-05:  62%|██████▏   | 5770/9301 [37:24<21:34,  2.73it/s][A
epoch 2 iter 5770: train loss 0.30025. lr 6.000000e-05:  62%|██████▏   | 5770/9301 [37:24<21:34,  2.73it/s][A
epoch 2 iter 5770: train loss 0.30025. lr 6.000000e-05:  62%|██████▏   | 5771/9301 [37:24<21:37,  2.72it/s][A
epoch 2 iter 5771: train loss 0.32363. lr 6.000000e-05:  62%|██████▏   | 5771/9301 [37:24<21:37,  2.72it/s][A
epoch 2 iter 5771: train loss 0.32363. lr 6.000000e-05:  62%|██████▏   | 5772/9301 [37:24<21:40,  2.71it/s][A
epoch 2 iter 5772: train loss 0.30946. lr 6.000000e-05:  62%|██████▏   | 5772/9301 [37:25<21:40,  2.71it/s][A
epoch 2 iter 5772: train loss 0.30946. lr 6.000000e-05:  62%|██████▏   | 5773/9301 [37:25<21:38,  2.72it/s][A
epoch 2 iter 5773: train loss 0.35367. lr 6.000000e-05:  62%|██████▏   | 5773/9301 [37:25<21:38,  2.72it/s][A
e

epoch 2 iter 5805: train loss 0.33103. lr 6.000000e-05:  62%|██████▏   | 5806/9301 [37:37<21:27,  2.72it/s][A
epoch 2 iter 5806: train loss 0.30439. lr 6.000000e-05:  62%|██████▏   | 5806/9301 [37:37<21:27,  2.72it/s][A
epoch 2 iter 5806: train loss 0.30439. lr 6.000000e-05:  62%|██████▏   | 5807/9301 [37:37<21:28,  2.71it/s][A
epoch 2 iter 5807: train loss 0.34120. lr 6.000000e-05:  62%|██████▏   | 5807/9301 [37:38<21:28,  2.71it/s][A
epoch 2 iter 5807: train loss 0.34120. lr 6.000000e-05:  62%|██████▏   | 5808/9301 [37:38<21:25,  2.72it/s][A
epoch 2 iter 5808: train loss 0.31552. lr 6.000000e-05:  62%|██████▏   | 5808/9301 [37:38<21:25,  2.72it/s][A
epoch 2 iter 5808: train loss 0.31552. lr 6.000000e-05:  62%|██████▏   | 5809/9301 [37:38<21:28,  2.71it/s][A
epoch 2 iter 5809: train loss 0.32068. lr 6.000000e-05:  62%|██████▏   | 5809/9301 [37:38<21:28,  2.71it/s][A
epoch 2 iter 5809: train loss 0.32068. lr 6.000000e-05:  62%|██████▏   | 5810/9301 [37:38<21:25,  2.72it/s][A
e

epoch 2 iter 5842: train loss 0.30725. lr 6.000000e-05:  63%|██████▎   | 5842/9301 [37:51<21:22,  2.70it/s][A
epoch 2 iter 5842: train loss 0.30725. lr 6.000000e-05:  63%|██████▎   | 5843/9301 [37:51<21:19,  2.70it/s][A
epoch 2 iter 5843: train loss 0.32695. lr 6.000000e-05:  63%|██████▎   | 5843/9301 [37:51<21:19,  2.70it/s][A
epoch 2 iter 5843: train loss 0.32695. lr 6.000000e-05:  63%|██████▎   | 5844/9301 [37:51<21:17,  2.71it/s][A
epoch 2 iter 5844: train loss 0.32366. lr 6.000000e-05:  63%|██████▎   | 5844/9301 [37:51<21:17,  2.71it/s][A
epoch 2 iter 5844: train loss 0.32366. lr 6.000000e-05:  63%|██████▎   | 5845/9301 [37:51<21:15,  2.71it/s][A
epoch 2 iter 5845: train loss 0.32076. lr 6.000000e-05:  63%|██████▎   | 5845/9301 [37:52<21:15,  2.71it/s][A
epoch 2 iter 5845: train loss 0.32076. lr 6.000000e-05:  63%|██████▎   | 5846/9301 [37:52<21:17,  2.70it/s][A
epoch 2 iter 5846: train loss 0.32706. lr 6.000000e-05:  63%|██████▎   | 5846/9301 [37:52<21:17,  2.70it/s][A
e

epoch 2 iter 5878: train loss 0.30215. lr 6.000000e-05:  63%|██████▎   | 5879/9301 [38:04<20:58,  2.72it/s][A
epoch 2 iter 5879: train loss 0.34626. lr 6.000000e-05:  63%|██████▎   | 5879/9301 [38:04<20:58,  2.72it/s][A
epoch 2 iter 5879: train loss 0.34626. lr 6.000000e-05:  63%|██████▎   | 5880/9301 [38:04<21:02,  2.71it/s][A
epoch 2 iter 5880: train loss 0.33814. lr 6.000000e-05:  63%|██████▎   | 5880/9301 [38:05<21:02,  2.71it/s][A
epoch 2 iter 5880: train loss 0.33814. lr 6.000000e-05:  63%|██████▎   | 5881/9301 [38:05<21:05,  2.70it/s][A
epoch 2 iter 5881: train loss 0.33525. lr 6.000000e-05:  63%|██████▎   | 5881/9301 [38:05<21:05,  2.70it/s][A
epoch 2 iter 5881: train loss 0.33525. lr 6.000000e-05:  63%|██████▎   | 5882/9301 [38:05<21:05,  2.70it/s][A
epoch 2 iter 5882: train loss 0.31290. lr 6.000000e-05:  63%|██████▎   | 5882/9301 [38:05<21:05,  2.70it/s][A
epoch 2 iter 5882: train loss 0.31290. lr 6.000000e-05:  63%|██████▎   | 5883/9301 [38:05<21:01,  2.71it/s][A
e

epoch 2 iter 5915: train loss 0.33702. lr 6.000000e-05:  64%|██████▎   | 5915/9301 [38:17<20:50,  2.71it/s][A
epoch 2 iter 5915: train loss 0.33702. lr 6.000000e-05:  64%|██████▎   | 5916/9301 [38:17<20:52,  2.70it/s][A
epoch 2 iter 5916: train loss 0.31360. lr 6.000000e-05:  64%|██████▎   | 5916/9301 [38:18<20:52,  2.70it/s][A
epoch 2 iter 5916: train loss 0.31360. lr 6.000000e-05:  64%|██████▎   | 5917/9301 [38:18<20:51,  2.70it/s][A
epoch 2 iter 5917: train loss 0.31660. lr 6.000000e-05:  64%|██████▎   | 5917/9301 [38:18<20:51,  2.70it/s][A
epoch 2 iter 5917: train loss 0.31660. lr 6.000000e-05:  64%|██████▎   | 5918/9301 [38:18<20:48,  2.71it/s][A
epoch 2 iter 5918: train loss 0.31716. lr 6.000000e-05:  64%|██████▎   | 5918/9301 [38:19<20:48,  2.71it/s][A
epoch 2 iter 5918: train loss 0.31716. lr 6.000000e-05:  64%|██████▎   | 5919/9301 [38:19<20:49,  2.71it/s][A
epoch 2 iter 5919: train loss 0.32267. lr 6.000000e-05:  64%|██████▎   | 5919/9301 [38:19<20:49,  2.71it/s][A
e

epoch 2 iter 5951: train loss 0.31677. lr 6.000000e-05:  64%|██████▍   | 5952/9301 [38:31<20:35,  2.71it/s][A
epoch 2 iter 5952: train loss 0.34659. lr 6.000000e-05:  64%|██████▍   | 5952/9301 [38:31<20:35,  2.71it/s][A
epoch 2 iter 5952: train loss 0.34659. lr 6.000000e-05:  64%|██████▍   | 5953/9301 [38:31<20:34,  2.71it/s][A
epoch 2 iter 5953: train loss 0.34038. lr 6.000000e-05:  64%|██████▍   | 5953/9301 [38:31<20:34,  2.71it/s][A
epoch 2 iter 5953: train loss 0.34038. lr 6.000000e-05:  64%|██████▍   | 5954/9301 [38:31<20:32,  2.72it/s][A
epoch 2 iter 5954: train loss 0.30489. lr 6.000000e-05:  64%|██████▍   | 5954/9301 [38:32<20:32,  2.72it/s][A
epoch 2 iter 5954: train loss 0.30489. lr 6.000000e-05:  64%|██████▍   | 5955/9301 [38:32<20:26,  2.73it/s][A
epoch 2 iter 5955: train loss 0.31866. lr 6.000000e-05:  64%|██████▍   | 5955/9301 [38:32<20:26,  2.73it/s][A
epoch 2 iter 5955: train loss 0.31866. lr 6.000000e-05:  64%|██████▍   | 5956/9301 [38:32<20:29,  2.72it/s][A
e

epoch 2 iter 5988: train loss 0.31862. lr 6.000000e-05:  64%|██████▍   | 5988/9301 [38:44<20:20,  2.72it/s][A
epoch 2 iter 5988: train loss 0.31862. lr 6.000000e-05:  64%|██████▍   | 5989/9301 [38:44<20:21,  2.71it/s][A
epoch 2 iter 5989: train loss 0.33381. lr 6.000000e-05:  64%|██████▍   | 5989/9301 [38:45<20:21,  2.71it/s][A
epoch 2 iter 5989: train loss 0.33381. lr 6.000000e-05:  64%|██████▍   | 5990/9301 [38:45<20:17,  2.72it/s][A
epoch 2 iter 5990: train loss 0.31153. lr 6.000000e-05:  64%|██████▍   | 5990/9301 [38:45<20:17,  2.72it/s][A
epoch 2 iter 5990: train loss 0.31153. lr 6.000000e-05:  64%|██████▍   | 5991/9301 [38:45<20:20,  2.71it/s][A
epoch 2 iter 5991: train loss 0.30258. lr 6.000000e-05:  64%|██████▍   | 5991/9301 [38:45<20:20,  2.71it/s][A
epoch 2 iter 5991: train loss 0.30258. lr 6.000000e-05:  64%|██████▍   | 5992/9301 [38:45<20:19,  2.71it/s][A
epoch 2 iter 5992: train loss 0.33421. lr 6.000000e-05:  64%|██████▍   | 5992/9301 [38:46<20:19,  2.71it/s][A
e

epoch 2 iter 6024: train loss 0.30199. lr 6.000000e-05:  65%|██████▍   | 6025/9301 [38:58<20:03,  2.72it/s][A
epoch 2 iter 6025: train loss 0.30843. lr 6.000000e-05:  65%|██████▍   | 6025/9301 [38:58<20:03,  2.72it/s][A
epoch 2 iter 6025: train loss 0.30843. lr 6.000000e-05:  65%|██████▍   | 6026/9301 [38:58<20:06,  2.71it/s][A
epoch 2 iter 6026: train loss 0.31539. lr 6.000000e-05:  65%|██████▍   | 6026/9301 [38:58<20:06,  2.71it/s][A
epoch 2 iter 6026: train loss 0.31539. lr 6.000000e-05:  65%|██████▍   | 6027/9301 [38:58<20:08,  2.71it/s][A
epoch 2 iter 6027: train loss 0.30418. lr 6.000000e-05:  65%|██████▍   | 6027/9301 [38:59<20:08,  2.71it/s][A
epoch 2 iter 6027: train loss 0.30418. lr 6.000000e-05:  65%|██████▍   | 6028/9301 [38:59<20:06,  2.71it/s][A
epoch 2 iter 6028: train loss 0.32469. lr 6.000000e-05:  65%|██████▍   | 6028/9301 [38:59<20:06,  2.71it/s][A
epoch 2 iter 6028: train loss 0.32469. lr 6.000000e-05:  65%|██████▍   | 6029/9301 [38:59<20:07,  2.71it/s][A
e

epoch 2 iter 6061: train loss 0.32126. lr 6.000000e-05:  65%|██████▌   | 6061/9301 [39:11<20:01,  2.70it/s][A
epoch 2 iter 6061: train loss 0.32126. lr 6.000000e-05:  65%|██████▌   | 6062/9301 [39:11<20:00,  2.70it/s][A
epoch 2 iter 6062: train loss 0.29681. lr 6.000000e-05:  65%|██████▌   | 6062/9301 [39:12<20:00,  2.70it/s][A
epoch 2 iter 6062: train loss 0.29681. lr 6.000000e-05:  65%|██████▌   | 6063/9301 [39:12<19:57,  2.70it/s][A
epoch 2 iter 6063: train loss 0.31085. lr 6.000000e-05:  65%|██████▌   | 6063/9301 [39:12<19:57,  2.70it/s][A
epoch 2 iter 6063: train loss 0.31085. lr 6.000000e-05:  65%|██████▌   | 6064/9301 [39:12<19:54,  2.71it/s][A
epoch 2 iter 6064: train loss 0.31597. lr 6.000000e-05:  65%|██████▌   | 6064/9301 [39:12<19:54,  2.71it/s][A
epoch 2 iter 6064: train loss 0.31597. lr 6.000000e-05:  65%|██████▌   | 6065/9301 [39:12<19:49,  2.72it/s][A
epoch 2 iter 6065: train loss 0.34253. lr 6.000000e-05:  65%|██████▌   | 6065/9301 [39:13<19:49,  2.72it/s][A
e

epoch 2 iter 6097: train loss 0.30557. lr 6.000000e-05:  66%|██████▌   | 6098/9301 [39:25<19:38,  2.72it/s][A
epoch 2 iter 6098: train loss 0.33668. lr 6.000000e-05:  66%|██████▌   | 6098/9301 [39:25<19:38,  2.72it/s][A
epoch 2 iter 6098: train loss 0.33668. lr 6.000000e-05:  66%|██████▌   | 6099/9301 [39:25<19:38,  2.72it/s][A
epoch 2 iter 6099: train loss 0.29895. lr 6.000000e-05:  66%|██████▌   | 6099/9301 [39:25<19:38,  2.72it/s][A
epoch 2 iter 6099: train loss 0.29895. lr 6.000000e-05:  66%|██████▌   | 6100/9301 [39:25<19:42,  2.71it/s][A
epoch 2 iter 6100: train loss 0.34321. lr 6.000000e-05:  66%|██████▌   | 6100/9301 [39:26<19:42,  2.71it/s][A
epoch 2 iter 6100: train loss 0.34321. lr 6.000000e-05:  66%|██████▌   | 6101/9301 [39:26<19:45,  2.70it/s][A
epoch 2 iter 6101: train loss 0.35131. lr 6.000000e-05:  66%|██████▌   | 6101/9301 [39:26<19:45,  2.70it/s][A
epoch 2 iter 6101: train loss 0.35131. lr 6.000000e-05:  66%|██████▌   | 6102/9301 [39:26<19:45,  2.70it/s][A
e

epoch 2 iter 6134: train loss 0.32312. lr 6.000000e-05:  66%|██████▌   | 6134/9301 [39:38<19:27,  2.71it/s][A
epoch 2 iter 6134: train loss 0.32312. lr 6.000000e-05:  66%|██████▌   | 6135/9301 [39:38<19:25,  2.72it/s][A
epoch 2 iter 6135: train loss 0.33080. lr 6.000000e-05:  66%|██████▌   | 6135/9301 [39:39<19:25,  2.72it/s][A
epoch 2 iter 6135: train loss 0.33080. lr 6.000000e-05:  66%|██████▌   | 6136/9301 [39:39<19:26,  2.71it/s][A
epoch 2 iter 6136: train loss 0.30931. lr 6.000000e-05:  66%|██████▌   | 6136/9301 [39:39<19:26,  2.71it/s][A
epoch 2 iter 6136: train loss 0.30931. lr 6.000000e-05:  66%|██████▌   | 6137/9301 [39:39<19:25,  2.71it/s][A
epoch 2 iter 6137: train loss 0.31931. lr 6.000000e-05:  66%|██████▌   | 6137/9301 [39:39<19:25,  2.71it/s][A
epoch 2 iter 6137: train loss 0.31931. lr 6.000000e-05:  66%|██████▌   | 6138/9301 [39:39<19:24,  2.72it/s][A
epoch 2 iter 6138: train loss 0.30979. lr 6.000000e-05:  66%|██████▌   | 6138/9301 [39:40<19:24,  2.72it/s][A
e

epoch 2 iter 6170: train loss 0.33090. lr 6.000000e-05:  66%|██████▋   | 6171/9301 [39:51<19:13,  2.71it/s][A
epoch 2 iter 6171: train loss 0.30672. lr 6.000000e-05:  66%|██████▋   | 6171/9301 [39:52<19:13,  2.71it/s][A
epoch 2 iter 6171: train loss 0.30672. lr 6.000000e-05:  66%|██████▋   | 6172/9301 [39:52<19:15,  2.71it/s][A
epoch 2 iter 6172: train loss 0.31736. lr 6.000000e-05:  66%|██████▋   | 6172/9301 [39:52<19:15,  2.71it/s][A
epoch 2 iter 6172: train loss 0.31736. lr 6.000000e-05:  66%|██████▋   | 6173/9301 [39:52<19:13,  2.71it/s][A
epoch 2 iter 6173: train loss 0.31510. lr 6.000000e-05:  66%|██████▋   | 6173/9301 [39:53<19:13,  2.71it/s][A
epoch 2 iter 6173: train loss 0.31510. lr 6.000000e-05:  66%|██████▋   | 6174/9301 [39:53<19:13,  2.71it/s][A
epoch 2 iter 6174: train loss 0.30174. lr 6.000000e-05:  66%|██████▋   | 6174/9301 [39:53<19:13,  2.71it/s][A
epoch 2 iter 6174: train loss 0.30174. lr 6.000000e-05:  66%|██████▋   | 6175/9301 [39:53<19:09,  2.72it/s][A
e

epoch 2 iter 6207: train loss 0.32859. lr 6.000000e-05:  67%|██████▋   | 6207/9301 [40:05<19:00,  2.71it/s][A
epoch 2 iter 6207: train loss 0.32859. lr 6.000000e-05:  67%|██████▋   | 6208/9301 [40:05<18:59,  2.71it/s][A
epoch 2 iter 6208: train loss 0.32447. lr 6.000000e-05:  67%|██████▋   | 6208/9301 [40:05<18:59,  2.71it/s][A
epoch 2 iter 6208: train loss 0.32447. lr 6.000000e-05:  67%|██████▋   | 6209/9301 [40:05<18:57,  2.72it/s][A
epoch 2 iter 6209: train loss 0.32078. lr 6.000000e-05:  67%|██████▋   | 6209/9301 [40:06<18:57,  2.72it/s][A
epoch 2 iter 6209: train loss 0.32078. lr 6.000000e-05:  67%|██████▋   | 6210/9301 [40:06<18:54,  2.72it/s][A
epoch 2 iter 6210: train loss 0.31697. lr 6.000000e-05:  67%|██████▋   | 6210/9301 [40:06<18:54,  2.72it/s][A
epoch 2 iter 6210: train loss 0.31697. lr 6.000000e-05:  67%|██████▋   | 6211/9301 [40:06<18:57,  2.72it/s][A
epoch 2 iter 6211: train loss 0.31774. lr 6.000000e-05:  67%|██████▋   | 6211/9301 [40:07<18:57,  2.72it/s][A
e

epoch 2 iter 6243: train loss 0.31760. lr 6.000000e-05:  67%|██████▋   | 6244/9301 [40:18<18:46,  2.71it/s][A
epoch 2 iter 6244: train loss 0.30511. lr 6.000000e-05:  67%|██████▋   | 6244/9301 [40:19<18:46,  2.71it/s][A
epoch 2 iter 6244: train loss 0.30511. lr 6.000000e-05:  67%|██████▋   | 6245/9301 [40:19<18:43,  2.72it/s][A
epoch 2 iter 6245: train loss 0.30095. lr 6.000000e-05:  67%|██████▋   | 6245/9301 [40:19<18:43,  2.72it/s][A
epoch 2 iter 6245: train loss 0.30095. lr 6.000000e-05:  67%|██████▋   | 6246/9301 [40:19<18:46,  2.71it/s][A
epoch 2 iter 6246: train loss 0.32998. lr 6.000000e-05:  67%|██████▋   | 6246/9301 [40:19<18:46,  2.71it/s][A
epoch 2 iter 6246: train loss 0.32998. lr 6.000000e-05:  67%|██████▋   | 6247/9301 [40:19<18:47,  2.71it/s][A
epoch 2 iter 6247: train loss 0.32723. lr 6.000000e-05:  67%|██████▋   | 6247/9301 [40:20<18:47,  2.71it/s][A
epoch 2 iter 6247: train loss 0.32723. lr 6.000000e-05:  67%|██████▋   | 6248/9301 [40:20<18:45,  2.71it/s][A
e

epoch 2 iter 6280: train loss 0.34573. lr 6.000000e-05:  68%|██████▊   | 6280/9301 [40:32<18:30,  2.72it/s][A
epoch 2 iter 6280: train loss 0.34573. lr 6.000000e-05:  68%|██████▊   | 6281/9301 [40:32<18:32,  2.71it/s][A
epoch 2 iter 6281: train loss 0.31957. lr 6.000000e-05:  68%|██████▊   | 6281/9301 [40:32<18:32,  2.71it/s][A
epoch 2 iter 6281: train loss 0.31957. lr 6.000000e-05:  68%|██████▊   | 6282/9301 [40:32<18:32,  2.71it/s][A
epoch 2 iter 6282: train loss 0.31577. lr 6.000000e-05:  68%|██████▊   | 6282/9301 [40:33<18:32,  2.71it/s][A
epoch 2 iter 6282: train loss 0.31577. lr 6.000000e-05:  68%|██████▊   | 6283/9301 [40:33<18:32,  2.71it/s][A
epoch 2 iter 6283: train loss 0.31479. lr 6.000000e-05:  68%|██████▊   | 6283/9301 [40:33<18:32,  2.71it/s][A
epoch 2 iter 6283: train loss 0.31479. lr 6.000000e-05:  68%|██████▊   | 6284/9301 [40:33<18:31,  2.71it/s][A
epoch 2 iter 6284: train loss 0.32712. lr 6.000000e-05:  68%|██████▊   | 6284/9301 [40:33<18:31,  2.71it/s][A
e

epoch 2 iter 6316: train loss 0.30355. lr 6.000000e-05:  68%|██████▊   | 6317/9301 [40:45<18:19,  2.71it/s][A
epoch 2 iter 6317: train loss 0.29979. lr 6.000000e-05:  68%|██████▊   | 6317/9301 [40:46<18:19,  2.71it/s][A
epoch 2 iter 6317: train loss 0.29979. lr 6.000000e-05:  68%|██████▊   | 6318/9301 [40:46<18:18,  2.71it/s][A
epoch 2 iter 6318: train loss 0.30159. lr 6.000000e-05:  68%|██████▊   | 6318/9301 [40:46<18:18,  2.71it/s][A
epoch 2 iter 6318: train loss 0.30159. lr 6.000000e-05:  68%|██████▊   | 6319/9301 [40:46<18:18,  2.71it/s][A
epoch 2 iter 6319: train loss 0.32017. lr 6.000000e-05:  68%|██████▊   | 6319/9301 [40:46<18:18,  2.71it/s][A
epoch 2 iter 6319: train loss 0.32017. lr 6.000000e-05:  68%|██████▊   | 6320/9301 [40:46<18:14,  2.72it/s][A
epoch 2 iter 6320: train loss 0.31521. lr 6.000000e-05:  68%|██████▊   | 6320/9301 [40:47<18:14,  2.72it/s][A
epoch 2 iter 6320: train loss 0.31521. lr 6.000000e-05:  68%|██████▊   | 6321/9301 [40:47<18:17,  2.72it/s][A
e

epoch 2 iter 6353: train loss 0.30021. lr 6.000000e-05:  68%|██████▊   | 6353/9301 [40:59<18:06,  2.71it/s][A
epoch 2 iter 6353: train loss 0.30021. lr 6.000000e-05:  68%|██████▊   | 6354/9301 [40:59<18:05,  2.71it/s][A
epoch 2 iter 6354: train loss 0.32684. lr 6.000000e-05:  68%|██████▊   | 6354/9301 [40:59<18:05,  2.71it/s][A
epoch 2 iter 6354: train loss 0.32684. lr 6.000000e-05:  68%|██████▊   | 6355/9301 [40:59<18:01,  2.72it/s][A
epoch 2 iter 6355: train loss 0.29220. lr 6.000000e-05:  68%|██████▊   | 6355/9301 [41:00<18:01,  2.72it/s][A
epoch 2 iter 6355: train loss 0.29220. lr 6.000000e-05:  68%|██████▊   | 6356/9301 [41:00<18:03,  2.72it/s][A
epoch 2 iter 6356: train loss 0.32225. lr 6.000000e-05:  68%|██████▊   | 6356/9301 [41:00<18:03,  2.72it/s][A
epoch 2 iter 6356: train loss 0.32225. lr 6.000000e-05:  68%|██████▊   | 6357/9301 [41:00<18:03,  2.72it/s][A
epoch 2 iter 6357: train loss 0.32269. lr 6.000000e-05:  68%|██████▊   | 6357/9301 [41:00<18:03,  2.72it/s][A
e

epoch 2 iter 6389: train loss 0.32545. lr 6.000000e-05:  69%|██████▊   | 6390/9301 [41:12<17:48,  2.73it/s][A
epoch 2 iter 6390: train loss 0.29122. lr 6.000000e-05:  69%|██████▊   | 6390/9301 [41:13<17:48,  2.73it/s][A
epoch 2 iter 6390: train loss 0.29122. lr 6.000000e-05:  69%|██████▊   | 6391/9301 [41:13<17:51,  2.71it/s][A
epoch 2 iter 6391: train loss 0.30876. lr 6.000000e-05:  69%|██████▊   | 6391/9301 [41:13<17:51,  2.71it/s][A
epoch 2 iter 6391: train loss 0.30876. lr 6.000000e-05:  69%|██████▊   | 6392/9301 [41:13<17:51,  2.72it/s][A
epoch 2 iter 6392: train loss 0.29339. lr 6.000000e-05:  69%|██████▊   | 6392/9301 [41:13<17:51,  2.72it/s][A
epoch 2 iter 6392: train loss 0.29339. lr 6.000000e-05:  69%|██████▊   | 6393/9301 [41:13<19:43,  2.46it/s][A
epoch 2 iter 6393: train loss 0.31366. lr 6.000000e-05:  69%|██████▊   | 6393/9301 [41:14<19:43,  2.46it/s][A
epoch 2 iter 6393: train loss 0.31366. lr 6.000000e-05:  69%|██████▊   | 6394/9301 [41:14<19:11,  2.52it/s][A
e

epoch 2 iter 6426: train loss 0.31934. lr 6.000000e-05:  69%|██████▉   | 6426/9301 [41:26<17:43,  2.70it/s][A
epoch 2 iter 6426: train loss 0.31934. lr 6.000000e-05:  69%|██████▉   | 6427/9301 [41:26<17:41,  2.71it/s][A
epoch 2 iter 6427: train loss 0.34099. lr 6.000000e-05:  69%|██████▉   | 6427/9301 [41:26<17:41,  2.71it/s][A
epoch 2 iter 6427: train loss 0.34099. lr 6.000000e-05:  69%|██████▉   | 6428/9301 [41:26<17:37,  2.72it/s][A
epoch 2 iter 6428: train loss 0.31673. lr 6.000000e-05:  69%|██████▉   | 6428/9301 [41:27<17:37,  2.72it/s][A
epoch 2 iter 6428: train loss 0.31673. lr 6.000000e-05:  69%|██████▉   | 6429/9301 [41:27<17:39,  2.71it/s][A
epoch 2 iter 6429: train loss 0.30841. lr 6.000000e-05:  69%|██████▉   | 6429/9301 [41:27<17:39,  2.71it/s][A
epoch 2 iter 6429: train loss 0.30841. lr 6.000000e-05:  69%|██████▉   | 6430/9301 [41:27<17:41,  2.70it/s][A
epoch 2 iter 6430: train loss 0.32593. lr 6.000000e-05:  69%|██████▉   | 6430/9301 [41:27<17:41,  2.70it/s][A
e

epoch 2 iter 6462: train loss 0.32740. lr 6.000000e-05:  69%|██████▉   | 6463/9301 [41:39<17:28,  2.71it/s][A
epoch 2 iter 6463: train loss 0.31682. lr 6.000000e-05:  69%|██████▉   | 6463/9301 [41:40<17:28,  2.71it/s][A
epoch 2 iter 6463: train loss 0.31682. lr 6.000000e-05:  69%|██████▉   | 6464/9301 [41:40<17:31,  2.70it/s][A
epoch 2 iter 6464: train loss 0.32212. lr 6.000000e-05:  69%|██████▉   | 6464/9301 [41:40<17:31,  2.70it/s][A
epoch 2 iter 6464: train loss 0.32212. lr 6.000000e-05:  70%|██████▉   | 6465/9301 [41:40<17:30,  2.70it/s][A
epoch 2 iter 6465: train loss 0.29660. lr 6.000000e-05:  70%|██████▉   | 6465/9301 [41:40<17:30,  2.70it/s][A
epoch 2 iter 6465: train loss 0.29660. lr 6.000000e-05:  70%|██████▉   | 6466/9301 [41:40<17:28,  2.70it/s][A
epoch 2 iter 6466: train loss 0.31268. lr 6.000000e-05:  70%|██████▉   | 6466/9301 [41:41<17:28,  2.70it/s][A
epoch 2 iter 6466: train loss 0.31268. lr 6.000000e-05:  70%|██████▉   | 6467/9301 [41:41<17:26,  2.71it/s][A
e

epoch 2 iter 6499: train loss 0.32898. lr 6.000000e-05:  70%|██████▉   | 6499/9301 [41:53<17:11,  2.72it/s][A
epoch 2 iter 6499: train loss 0.32898. lr 6.000000e-05:  70%|██████▉   | 6500/9301 [41:53<17:12,  2.71it/s][A
epoch 2 iter 6500: train loss 0.32028. lr 6.000000e-05:  70%|██████▉   | 6500/9301 [41:53<17:12,  2.71it/s][A
epoch 2 iter 6500: train loss 0.32028. lr 6.000000e-05:  70%|██████▉   | 6501/9301 [41:53<17:14,  2.71it/s][A
epoch 2 iter 6501: train loss 0.33053. lr 6.000000e-05:  70%|██████▉   | 6501/9301 [41:54<17:14,  2.71it/s][A
epoch 2 iter 6501: train loss 0.33053. lr 6.000000e-05:  70%|██████▉   | 6502/9301 [41:54<17:12,  2.71it/s][A
epoch 2 iter 6502: train loss 0.30835. lr 6.000000e-05:  70%|██████▉   | 6502/9301 [41:54<17:12,  2.71it/s][A
epoch 2 iter 6502: train loss 0.30835. lr 6.000000e-05:  70%|██████▉   | 6503/9301 [41:54<17:11,  2.71it/s][A
epoch 2 iter 6503: train loss 0.33531. lr 6.000000e-05:  70%|██████▉   | 6503/9301 [41:54<17:11,  2.71it/s][A
e

epoch 2 iter 6535: train loss 0.30691. lr 6.000000e-05:  70%|███████   | 6536/9301 [42:06<17:04,  2.70it/s][A
epoch 2 iter 6536: train loss 0.32320. lr 6.000000e-05:  70%|███████   | 6536/9301 [42:07<17:04,  2.70it/s][A
epoch 2 iter 6536: train loss 0.32320. lr 6.000000e-05:  70%|███████   | 6537/9301 [42:07<17:01,  2.71it/s][A
epoch 2 iter 6537: train loss 0.30326. lr 6.000000e-05:  70%|███████   | 6537/9301 [42:07<17:01,  2.71it/s][A
epoch 2 iter 6537: train loss 0.30326. lr 6.000000e-05:  70%|███████   | 6538/9301 [42:07<16:59,  2.71it/s][A
epoch 2 iter 6538: train loss 0.33166. lr 6.000000e-05:  70%|███████   | 6538/9301 [42:07<16:59,  2.71it/s][A
epoch 2 iter 6538: train loss 0.33166. lr 6.000000e-05:  70%|███████   | 6539/9301 [42:07<16:55,  2.72it/s][A
epoch 2 iter 6539: train loss 0.32927. lr 6.000000e-05:  70%|███████   | 6539/9301 [42:08<16:55,  2.72it/s][A
epoch 2 iter 6539: train loss 0.32927. lr 6.000000e-05:  70%|███████   | 6540/9301 [42:08<16:58,  2.71it/s][A
e

epoch 2 iter 6572: train loss 0.32422. lr 6.000000e-05:  71%|███████   | 6572/9301 [42:20<16:44,  2.72it/s][A
epoch 2 iter 6572: train loss 0.32422. lr 6.000000e-05:  71%|███████   | 6573/9301 [42:20<16:46,  2.71it/s][A
epoch 2 iter 6573: train loss 0.31285. lr 6.000000e-05:  71%|███████   | 6573/9301 [42:20<16:46,  2.71it/s][A
epoch 2 iter 6573: train loss 0.31285. lr 6.000000e-05:  71%|███████   | 6574/9301 [42:20<16:48,  2.70it/s][A
epoch 2 iter 6574: train loss 0.29891. lr 6.000000e-05:  71%|███████   | 6574/9301 [42:21<16:48,  2.70it/s][A
epoch 2 iter 6574: train loss 0.29891. lr 6.000000e-05:  71%|███████   | 6575/9301 [42:21<16:49,  2.70it/s][A
epoch 2 iter 6575: train loss 0.31644. lr 6.000000e-05:  71%|███████   | 6575/9301 [42:21<16:49,  2.70it/s][A
epoch 2 iter 6575: train loss 0.31644. lr 6.000000e-05:  71%|███████   | 6576/9301 [42:21<16:49,  2.70it/s][A
epoch 2 iter 6576: train loss 0.32234. lr 6.000000e-05:  71%|███████   | 6576/9301 [42:21<16:49,  2.70it/s][A
e

epoch 2 iter 6608: train loss 0.30212. lr 6.000000e-05:  71%|███████   | 6609/9301 [42:33<16:27,  2.73it/s][A
epoch 2 iter 6609: train loss 0.34768. lr 6.000000e-05:  71%|███████   | 6609/9301 [42:33<16:27,  2.73it/s][A
epoch 2 iter 6609: train loss 0.34768. lr 6.000000e-05:  71%|███████   | 6610/9301 [42:33<16:28,  2.72it/s][A
epoch 2 iter 6610: train loss 0.32170. lr 6.000000e-05:  71%|███████   | 6610/9301 [42:34<16:28,  2.72it/s][A
epoch 2 iter 6610: train loss 0.32170. lr 6.000000e-05:  71%|███████   | 6611/9301 [42:34<16:29,  2.72it/s][A
epoch 2 iter 6611: train loss 0.29820. lr 6.000000e-05:  71%|███████   | 6611/9301 [42:34<16:29,  2.72it/s][A
epoch 2 iter 6611: train loss 0.29820. lr 6.000000e-05:  71%|███████   | 6612/9301 [42:34<16:28,  2.72it/s][A
epoch 2 iter 6612: train loss 0.32595. lr 6.000000e-05:  71%|███████   | 6612/9301 [42:35<16:28,  2.72it/s][A
epoch 2 iter 6612: train loss 0.32595. lr 6.000000e-05:  71%|███████   | 6613/9301 [42:35<16:32,  2.71it/s][A
e

epoch 2 iter 6645: train loss 0.30768. lr 6.000000e-05:  71%|███████▏  | 6645/9301 [42:47<16:19,  2.71it/s][A
epoch 2 iter 6645: train loss 0.30768. lr 6.000000e-05:  71%|███████▏  | 6646/9301 [42:47<16:20,  2.71it/s][A
epoch 2 iter 6646: train loss 0.31806. lr 6.000000e-05:  71%|███████▏  | 6646/9301 [42:47<16:20,  2.71it/s][A
epoch 2 iter 6646: train loss 0.31806. lr 6.000000e-05:  71%|███████▏  | 6647/9301 [42:47<16:19,  2.71it/s][A
epoch 2 iter 6647: train loss 0.31658. lr 6.000000e-05:  71%|███████▏  | 6647/9301 [42:47<16:19,  2.71it/s][A
epoch 2 iter 6647: train loss 0.31658. lr 6.000000e-05:  71%|███████▏  | 6648/9301 [42:47<16:18,  2.71it/s][A
epoch 2 iter 6648: train loss 0.30602. lr 6.000000e-05:  71%|███████▏  | 6648/9301 [42:48<16:18,  2.71it/s][A
epoch 2 iter 6648: train loss 0.30602. lr 6.000000e-05:  71%|███████▏  | 6649/9301 [42:48<16:14,  2.72it/s][A
epoch 2 iter 6649: train loss 0.30464. lr 6.000000e-05:  71%|███████▏  | 6649/9301 [42:48<16:14,  2.72it/s][A
e

epoch 2 iter 6681: train loss 0.33149. lr 6.000000e-05:  72%|███████▏  | 6682/9301 [43:00<16:04,  2.72it/s][A
epoch 2 iter 6682: train loss 0.32350. lr 6.000000e-05:  72%|███████▏  | 6682/9301 [43:00<16:04,  2.72it/s][A
epoch 2 iter 6682: train loss 0.32350. lr 6.000000e-05:  72%|███████▏  | 6683/9301 [43:00<16:03,  2.72it/s][A
epoch 2 iter 6683: train loss 0.31901. lr 6.000000e-05:  72%|███████▏  | 6683/9301 [43:01<16:03,  2.72it/s][A
epoch 2 iter 6683: train loss 0.31901. lr 6.000000e-05:  72%|███████▏  | 6684/9301 [43:01<16:07,  2.71it/s][A
epoch 2 iter 6684: train loss 0.31414. lr 6.000000e-05:  72%|███████▏  | 6684/9301 [43:01<16:07,  2.71it/s][A
epoch 2 iter 6684: train loss 0.31414. lr 6.000000e-05:  72%|███████▏  | 6685/9301 [43:01<16:08,  2.70it/s][A
epoch 2 iter 6685: train loss 0.30166. lr 6.000000e-05:  72%|███████▏  | 6685/9301 [43:01<16:08,  2.70it/s][A
epoch 2 iter 6685: train loss 0.30166. lr 6.000000e-05:  72%|███████▏  | 6686/9301 [43:01<16:09,  2.70it/s][A
e

epoch 2 iter 6718: train loss 0.33880. lr 6.000000e-05:  72%|███████▏  | 6718/9301 [43:14<15:54,  2.71it/s][A
epoch 2 iter 6718: train loss 0.33880. lr 6.000000e-05:  72%|███████▏  | 6719/9301 [43:14<15:50,  2.72it/s][A
epoch 2 iter 6719: train loss 0.31921. lr 6.000000e-05:  72%|███████▏  | 6719/9301 [43:14<15:50,  2.72it/s][A
epoch 2 iter 6719: train loss 0.31921. lr 6.000000e-05:  72%|███████▏  | 6720/9301 [43:14<15:52,  2.71it/s][A
epoch 2 iter 6720: train loss 0.33657. lr 6.000000e-05:  72%|███████▏  | 6720/9301 [43:14<15:52,  2.71it/s][A
epoch 2 iter 6720: train loss 0.33657. lr 6.000000e-05:  72%|███████▏  | 6721/9301 [43:14<15:52,  2.71it/s][A
epoch 2 iter 6721: train loss 0.32986. lr 6.000000e-05:  72%|███████▏  | 6721/9301 [43:15<15:52,  2.71it/s][A
epoch 2 iter 6721: train loss 0.32986. lr 6.000000e-05:  72%|███████▏  | 6722/9301 [43:15<15:51,  2.71it/s][A
epoch 2 iter 6722: train loss 0.31454. lr 6.000000e-05:  72%|███████▏  | 6722/9301 [43:15<15:51,  2.71it/s][A
e

epoch 2 iter 6754: train loss 0.29844. lr 6.000000e-05:  73%|███████▎  | 6755/9301 [43:27<15:39,  2.71it/s][A
epoch 2 iter 6755: train loss 0.31527. lr 6.000000e-05:  73%|███████▎  | 6755/9301 [43:27<15:39,  2.71it/s][A
epoch 2 iter 6755: train loss 0.31527. lr 6.000000e-05:  73%|███████▎  | 6756/9301 [43:27<15:40,  2.71it/s][A
epoch 2 iter 6756: train loss 0.31679. lr 6.000000e-05:  73%|███████▎  | 6756/9301 [43:28<15:40,  2.71it/s][A
epoch 2 iter 6756: train loss 0.31679. lr 6.000000e-05:  73%|███████▎  | 6757/9301 [43:28<15:38,  2.71it/s][A
epoch 2 iter 6757: train loss 0.31730. lr 6.000000e-05:  73%|███████▎  | 6757/9301 [43:28<15:38,  2.71it/s][A
epoch 2 iter 6757: train loss 0.31730. lr 6.000000e-05:  73%|███████▎  | 6758/9301 [43:28<15:38,  2.71it/s][A
epoch 2 iter 6758: train loss 0.31510. lr 6.000000e-05:  73%|███████▎  | 6758/9301 [43:28<15:38,  2.71it/s][A
epoch 2 iter 6758: train loss 0.31510. lr 6.000000e-05:  73%|███████▎  | 6759/9301 [43:28<15:35,  2.72it/s][A
e

epoch 2 iter 6791: train loss 0.32015. lr 6.000000e-05:  73%|███████▎  | 6791/9301 [43:41<15:25,  2.71it/s][A
epoch 2 iter 6791: train loss 0.32015. lr 6.000000e-05:  73%|███████▎  | 6792/9301 [43:41<15:25,  2.71it/s][A
epoch 2 iter 6792: train loss 0.31808. lr 6.000000e-05:  73%|███████▎  | 6792/9301 [43:41<15:25,  2.71it/s][A
epoch 2 iter 6792: train loss 0.31808. lr 6.000000e-05:  73%|███████▎  | 6793/9301 [43:41<15:24,  2.71it/s][A
epoch 2 iter 6793: train loss 0.31310. lr 6.000000e-05:  73%|███████▎  | 6793/9301 [43:41<15:24,  2.71it/s][A
epoch 2 iter 6793: train loss 0.31310. lr 6.000000e-05:  73%|███████▎  | 6794/9301 [43:41<15:20,  2.72it/s][A
epoch 2 iter 6794: train loss 0.32196. lr 6.000000e-05:  73%|███████▎  | 6794/9301 [43:42<15:20,  2.72it/s][A
epoch 2 iter 6794: train loss 0.32196. lr 6.000000e-05:  73%|███████▎  | 6795/9301 [43:42<15:23,  2.71it/s][A
epoch 2 iter 6795: train loss 0.32418. lr 6.000000e-05:  73%|███████▎  | 6795/9301 [43:42<15:23,  2.71it/s][A
e

epoch 2 iter 6827: train loss 0.30720. lr 6.000000e-05:  73%|███████▎  | 6828/9301 [43:54<15:12,  2.71it/s][A
epoch 2 iter 6828: train loss 0.33774. lr 6.000000e-05:  73%|███████▎  | 6828/9301 [43:54<15:12,  2.71it/s][A
epoch 2 iter 6828: train loss 0.33774. lr 6.000000e-05:  73%|███████▎  | 6829/9301 [43:54<15:08,  2.72it/s][A
epoch 2 iter 6829: train loss 0.32940. lr 6.000000e-05:  73%|███████▎  | 6829/9301 [43:55<15:08,  2.72it/s][A
epoch 2 iter 6829: train loss 0.32940. lr 6.000000e-05:  73%|███████▎  | 6830/9301 [43:55<15:11,  2.71it/s][A
epoch 2 iter 6830: train loss 0.32140. lr 6.000000e-05:  73%|███████▎  | 6830/9301 [43:55<15:11,  2.71it/s][A
epoch 2 iter 6830: train loss 0.32140. lr 6.000000e-05:  73%|███████▎  | 6831/9301 [43:55<15:12,  2.71it/s][A
epoch 2 iter 6831: train loss 0.32306. lr 6.000000e-05:  73%|███████▎  | 6831/9301 [43:55<15:12,  2.71it/s][A
epoch 2 iter 6831: train loss 0.32306. lr 6.000000e-05:  73%|███████▎  | 6832/9301 [43:55<15:11,  2.71it/s][A
e

epoch 2 iter 6864: train loss 0.29572. lr 6.000000e-05:  74%|███████▍  | 6864/9301 [44:07<14:58,  2.71it/s][A
epoch 2 iter 6864: train loss 0.29572. lr 6.000000e-05:  74%|███████▍  | 6865/9301 [44:07<14:54,  2.72it/s][A
epoch 2 iter 6865: train loss 0.32829. lr 6.000000e-05:  74%|███████▍  | 6865/9301 [44:08<14:54,  2.72it/s][A
epoch 2 iter 6865: train loss 0.32829. lr 6.000000e-05:  74%|███████▍  | 6866/9301 [44:08<14:56,  2.72it/s][A
epoch 2 iter 6866: train loss 0.31878. lr 6.000000e-05:  74%|███████▍  | 6866/9301 [44:08<14:56,  2.72it/s][A
epoch 2 iter 6866: train loss 0.31878. lr 6.000000e-05:  74%|███████▍  | 6867/9301 [44:08<14:57,  2.71it/s][A
epoch 2 iter 6867: train loss 0.29601. lr 6.000000e-05:  74%|███████▍  | 6867/9301 [44:09<14:57,  2.71it/s][A
epoch 2 iter 6867: train loss 0.29601. lr 6.000000e-05:  74%|███████▍  | 6868/9301 [44:09<14:56,  2.71it/s][A
epoch 2 iter 6868: train loss 0.32574. lr 6.000000e-05:  74%|███████▍  | 6868/9301 [44:09<14:56,  2.71it/s][A
e

epoch 2 iter 6900: train loss 0.31678. lr 6.000000e-05:  74%|███████▍  | 6901/9301 [44:21<14:44,  2.71it/s][A
epoch 2 iter 6901: train loss 0.32371. lr 6.000000e-05:  74%|███████▍  | 6901/9301 [44:21<14:44,  2.71it/s][A
epoch 2 iter 6901: train loss 0.32371. lr 6.000000e-05:  74%|███████▍  | 6902/9301 [44:21<14:45,  2.71it/s][A
epoch 2 iter 6902: train loss 0.30779. lr 6.000000e-05:  74%|███████▍  | 6902/9301 [44:22<14:45,  2.71it/s][A
epoch 2 iter 6902: train loss 0.30779. lr 6.000000e-05:  74%|███████▍  | 6903/9301 [44:22<14:44,  2.71it/s][A
epoch 2 iter 6903: train loss 0.33201. lr 6.000000e-05:  74%|███████▍  | 6903/9301 [44:22<14:44,  2.71it/s][A
epoch 2 iter 6903: train loss 0.33201. lr 6.000000e-05:  74%|███████▍  | 6904/9301 [44:22<14:42,  2.72it/s][A
epoch 2 iter 6904: train loss 0.31114. lr 6.000000e-05:  74%|███████▍  | 6904/9301 [44:22<14:42,  2.72it/s][A
epoch 2 iter 6904: train loss 0.31114. lr 6.000000e-05:  74%|███████▍  | 6905/9301 [44:22<14:40,  2.72it/s][A
e

epoch 2 iter 6937: train loss 0.31886. lr 6.000000e-05:  75%|███████▍  | 6937/9301 [44:34<14:33,  2.71it/s][A
epoch 2 iter 6937: train loss 0.31886. lr 6.000000e-05:  75%|███████▍  | 6938/9301 [44:34<14:32,  2.71it/s][A
epoch 2 iter 6938: train loss 0.30596. lr 6.000000e-05:  75%|███████▍  | 6938/9301 [44:35<14:32,  2.71it/s][A
epoch 2 iter 6938: train loss 0.30596. lr 6.000000e-05:  75%|███████▍  | 6939/9301 [44:35<14:32,  2.71it/s][A
epoch 2 iter 6939: train loss 0.31515. lr 6.000000e-05:  75%|███████▍  | 6939/9301 [44:35<14:32,  2.71it/s][A
epoch 2 iter 6939: train loss 0.31515. lr 6.000000e-05:  75%|███████▍  | 6940/9301 [44:35<14:28,  2.72it/s][A
epoch 2 iter 6940: train loss 0.29186. lr 6.000000e-05:  75%|███████▍  | 6940/9301 [44:36<14:28,  2.72it/s][A
epoch 2 iter 6940: train loss 0.29186. lr 6.000000e-05:  75%|███████▍  | 6941/9301 [44:36<14:30,  2.71it/s][A
epoch 2 iter 6941: train loss 0.30560. lr 6.000000e-05:  75%|███████▍  | 6941/9301 [44:36<14:30,  2.71it/s][A
e

epoch 2 iter 6973: train loss 0.32201. lr 6.000000e-05:  75%|███████▍  | 6974/9301 [44:48<14:18,  2.71it/s][A
epoch 2 iter 6974: train loss 0.31100. lr 6.000000e-05:  75%|███████▍  | 6974/9301 [44:48<14:18,  2.71it/s][A
epoch 2 iter 6974: train loss 0.31100. lr 6.000000e-05:  75%|███████▍  | 6975/9301 [44:48<14:16,  2.72it/s][A
epoch 2 iter 6975: train loss 0.31461. lr 6.000000e-05:  75%|███████▍  | 6975/9301 [44:48<14:16,  2.72it/s][A
epoch 2 iter 6975: train loss 0.31461. lr 6.000000e-05:  75%|███████▌  | 6976/9301 [44:48<14:18,  2.71it/s][A
epoch 2 iter 6976: train loss 0.30360. lr 6.000000e-05:  75%|███████▌  | 6976/9301 [44:49<14:18,  2.71it/s][A
epoch 2 iter 6976: train loss 0.30360. lr 6.000000e-05:  75%|███████▌  | 6977/9301 [44:49<14:18,  2.71it/s][A
epoch 2 iter 6977: train loss 0.31103. lr 6.000000e-05:  75%|███████▌  | 6977/9301 [44:49<14:18,  2.71it/s][A
epoch 2 iter 6977: train loss 0.31103. lr 6.000000e-05:  75%|███████▌  | 6978/9301 [44:49<14:17,  2.71it/s][A
e

epoch 2 iter 7010: train loss 0.31454. lr 6.000000e-05:  75%|███████▌  | 7010/9301 [45:01<14:08,  2.70it/s][A
epoch 2 iter 7010: train loss 0.31454. lr 6.000000e-05:  75%|███████▌  | 7011/9301 [45:01<14:05,  2.71it/s][A
epoch 2 iter 7011: train loss 0.31488. lr 6.000000e-05:  75%|███████▌  | 7011/9301 [45:02<14:05,  2.71it/s][A
epoch 2 iter 7011: train loss 0.31488. lr 6.000000e-05:  75%|███████▌  | 7012/9301 [45:02<14:08,  2.70it/s][A
epoch 2 iter 7012: train loss 0.29748. lr 6.000000e-05:  75%|███████▌  | 7012/9301 [45:02<14:08,  2.70it/s][A
epoch 2 iter 7012: train loss 0.29748. lr 6.000000e-05:  75%|███████▌  | 7013/9301 [45:02<14:09,  2.69it/s][A
epoch 2 iter 7013: train loss 0.30990. lr 6.000000e-05:  75%|███████▌  | 7013/9301 [45:03<14:09,  2.69it/s][A
epoch 2 iter 7013: train loss 0.30990. lr 6.000000e-05:  75%|███████▌  | 7014/9301 [45:03<14:09,  2.69it/s][A
epoch 2 iter 7014: train loss 0.33429. lr 6.000000e-05:  75%|███████▌  | 7014/9301 [45:03<14:09,  2.69it/s][A
e

epoch 2 iter 7046: train loss 0.34321. lr 6.000000e-05:  76%|███████▌  | 7047/9301 [45:15<13:52,  2.71it/s][A
epoch 2 iter 7047: train loss 0.32963. lr 6.000000e-05:  76%|███████▌  | 7047/9301 [45:15<13:52,  2.71it/s][A
epoch 2 iter 7047: train loss 0.32963. lr 6.000000e-05:  76%|███████▌  | 7048/9301 [45:15<13:52,  2.71it/s][A
epoch 2 iter 7048: train loss 0.30277. lr 6.000000e-05:  76%|███████▌  | 7048/9301 [45:15<13:52,  2.71it/s][A
epoch 2 iter 7048: train loss 0.30277. lr 6.000000e-05:  76%|███████▌  | 7049/9301 [45:15<13:54,  2.70it/s][A
epoch 2 iter 7049: train loss 0.30345. lr 6.000000e-05:  76%|███████▌  | 7049/9301 [45:16<13:54,  2.70it/s][A
epoch 2 iter 7049: train loss 0.30345. lr 6.000000e-05:  76%|███████▌  | 7050/9301 [45:16<13:53,  2.70it/s][A
epoch 2 iter 7050: train loss 0.31465. lr 6.000000e-05:  76%|███████▌  | 7050/9301 [45:16<13:53,  2.70it/s][A
epoch 2 iter 7050: train loss 0.31465. lr 6.000000e-05:  76%|███████▌  | 7051/9301 [45:16<13:51,  2.70it/s][A
e

epoch 2 iter 7083: train loss 0.31075. lr 6.000000e-05:  76%|███████▌  | 7083/9301 [45:28<13:35,  2.72it/s][A
epoch 2 iter 7083: train loss 0.31075. lr 6.000000e-05:  76%|███████▌  | 7084/9301 [45:28<13:36,  2.71it/s][A
epoch 2 iter 7084: train loss 0.31398. lr 6.000000e-05:  76%|███████▌  | 7084/9301 [45:29<13:36,  2.71it/s][A
epoch 2 iter 7084: train loss 0.31398. lr 6.000000e-05:  76%|███████▌  | 7085/9301 [45:29<13:38,  2.71it/s][A
epoch 2 iter 7085: train loss 0.30573. lr 6.000000e-05:  76%|███████▌  | 7085/9301 [45:29<13:38,  2.71it/s][A
epoch 2 iter 7085: train loss 0.30573. lr 6.000000e-05:  76%|███████▌  | 7086/9301 [45:29<13:37,  2.71it/s][A
epoch 2 iter 7086: train loss 0.31439. lr 6.000000e-05:  76%|███████▌  | 7086/9301 [45:29<13:37,  2.71it/s][A
epoch 2 iter 7086: train loss 0.31439. lr 6.000000e-05:  76%|███████▌  | 7087/9301 [45:29<13:36,  2.71it/s][A
epoch 2 iter 7087: train loss 0.32486. lr 6.000000e-05:  76%|███████▌  | 7087/9301 [45:30<13:36,  2.71it/s][A
e

epoch 2 iter 7119: train loss 0.31657. lr 6.000000e-05:  77%|███████▋  | 7120/9301 [45:42<13:24,  2.71it/s][A
epoch 2 iter 7120: train loss 0.31107. lr 6.000000e-05:  77%|███████▋  | 7120/9301 [45:42<13:24,  2.71it/s][A
epoch 2 iter 7120: train loss 0.31107. lr 6.000000e-05:  77%|███████▋  | 7121/9301 [45:42<13:25,  2.71it/s][A
epoch 2 iter 7121: train loss 0.32220. lr 6.000000e-05:  77%|███████▋  | 7121/9301 [45:42<13:25,  2.71it/s][A
epoch 2 iter 7121: train loss 0.32220. lr 6.000000e-05:  77%|███████▋  | 7122/9301 [45:42<13:25,  2.71it/s][A
epoch 2 iter 7122: train loss 0.32664. lr 6.000000e-05:  77%|███████▋  | 7122/9301 [45:43<13:25,  2.71it/s][A
epoch 2 iter 7122: train loss 0.32664. lr 6.000000e-05:  77%|███████▋  | 7123/9301 [45:43<13:24,  2.71it/s][A
epoch 2 iter 7123: train loss 0.31070. lr 6.000000e-05:  77%|███████▋  | 7123/9301 [45:43<13:24,  2.71it/s][A
epoch 2 iter 7123: train loss 0.31070. lr 6.000000e-05:  77%|███████▋  | 7124/9301 [45:43<13:23,  2.71it/s][A
e

epoch 2 iter 7156: train loss 0.31908. lr 6.000000e-05:  77%|███████▋  | 7156/9301 [45:55<13:10,  2.71it/s][A
epoch 2 iter 7156: train loss 0.31908. lr 6.000000e-05:  77%|███████▋  | 7157/9301 [45:55<13:11,  2.71it/s][A
epoch 2 iter 7157: train loss 0.30483. lr 6.000000e-05:  77%|███████▋  | 7157/9301 [45:56<13:11,  2.71it/s][A
epoch 2 iter 7157: train loss 0.30483. lr 6.000000e-05:  77%|███████▋  | 7158/9301 [45:56<13:11,  2.71it/s][A
epoch 2 iter 7158: train loss 0.32700. lr 6.000000e-05:  77%|███████▋  | 7158/9301 [45:56<13:11,  2.71it/s][A
epoch 2 iter 7158: train loss 0.32700. lr 6.000000e-05:  77%|███████▋  | 7159/9301 [45:56<13:10,  2.71it/s][A
epoch 2 iter 7159: train loss 0.31466. lr 6.000000e-05:  77%|███████▋  | 7159/9301 [45:56<13:10,  2.71it/s][A
epoch 2 iter 7159: train loss 0.31466. lr 6.000000e-05:  77%|███████▋  | 7160/9301 [45:56<13:07,  2.72it/s][A
epoch 2 iter 7160: train loss 0.29371. lr 6.000000e-05:  77%|███████▋  | 7160/9301 [45:57<13:07,  2.72it/s][A
e

epoch 2 iter 7192: train loss 0.33583. lr 6.000000e-05:  77%|███████▋  | 7193/9301 [46:09<12:57,  2.71it/s][A
epoch 2 iter 7193: train loss 0.30562. lr 6.000000e-05:  77%|███████▋  | 7193/9301 [46:09<12:57,  2.71it/s][A
epoch 2 iter 7193: train loss 0.30562. lr 6.000000e-05:  77%|███████▋  | 7194/9301 [46:09<12:56,  2.71it/s][A
epoch 2 iter 7194: train loss 0.30620. lr 6.000000e-05:  77%|███████▋  | 7194/9301 [46:09<12:56,  2.71it/s][A
epoch 2 iter 7194: train loss 0.30620. lr 6.000000e-05:  77%|███████▋  | 7195/9301 [46:09<12:53,  2.72it/s][A
epoch 2 iter 7195: train loss 0.29336. lr 6.000000e-05:  77%|███████▋  | 7195/9301 [46:10<12:53,  2.72it/s][A
epoch 2 iter 7195: train loss 0.29336. lr 6.000000e-05:  77%|███████▋  | 7196/9301 [46:10<12:56,  2.71it/s][A
epoch 2 iter 7196: train loss 0.32319. lr 6.000000e-05:  77%|███████▋  | 7196/9301 [46:10<12:56,  2.71it/s][A
epoch 2 iter 7196: train loss 0.32319. lr 6.000000e-05:  77%|███████▋  | 7197/9301 [46:10<12:57,  2.71it/s][A
e

epoch 2 iter 7229: train loss 0.32227. lr 6.000000e-05:  78%|███████▊  | 7229/9301 [46:22<12:46,  2.70it/s][A
epoch 2 iter 7229: train loss 0.32227. lr 6.000000e-05:  78%|███████▊  | 7230/9301 [46:22<12:44,  2.71it/s][A
epoch 2 iter 7230: train loss 0.30536. lr 6.000000e-05:  78%|███████▊  | 7230/9301 [46:23<12:44,  2.71it/s][A
epoch 2 iter 7230: train loss 0.30536. lr 6.000000e-05:  78%|███████▊  | 7231/9301 [46:23<12:44,  2.71it/s][A
epoch 2 iter 7231: train loss 0.29838. lr 6.000000e-05:  78%|███████▊  | 7231/9301 [46:23<12:44,  2.71it/s][A
epoch 2 iter 7231: train loss 0.29838. lr 6.000000e-05:  78%|███████▊  | 7232/9301 [46:23<12:45,  2.70it/s][A
epoch 2 iter 7232: train loss 0.31998. lr 6.000000e-05:  78%|███████▊  | 7232/9301 [46:23<12:45,  2.70it/s][A
epoch 2 iter 7232: train loss 0.31998. lr 6.000000e-05:  78%|███████▊  | 7233/9301 [46:23<12:46,  2.70it/s][A
epoch 2 iter 7233: train loss 0.31488. lr 6.000000e-05:  78%|███████▊  | 7233/9301 [46:24<12:46,  2.70it/s][A
e

epoch 2 iter 7265: train loss 0.29845. lr 6.000000e-05:  78%|███████▊  | 7266/9301 [46:36<12:32,  2.70it/s][A
epoch 2 iter 7266: train loss 0.33632. lr 6.000000e-05:  78%|███████▊  | 7266/9301 [46:36<12:32,  2.70it/s][A
epoch 2 iter 7266: train loss 0.33632. lr 6.000000e-05:  78%|███████▊  | 7267/9301 [46:36<12:30,  2.71it/s][A
epoch 2 iter 7267: train loss 0.33116. lr 6.000000e-05:  78%|███████▊  | 7267/9301 [46:36<12:30,  2.71it/s][A
epoch 2 iter 7267: train loss 0.33116. lr 6.000000e-05:  78%|███████▊  | 7268/9301 [46:36<12:27,  2.72it/s][A
epoch 2 iter 7268: train loss 0.31210. lr 6.000000e-05:  78%|███████▊  | 7268/9301 [46:37<12:27,  2.72it/s][A
epoch 2 iter 7268: train loss 0.31210. lr 6.000000e-05:  78%|███████▊  | 7269/9301 [46:37<12:28,  2.71it/s][A
epoch 2 iter 7269: train loss 0.32580. lr 6.000000e-05:  78%|███████▊  | 7269/9301 [46:37<12:28,  2.71it/s][A
epoch 2 iter 7269: train loss 0.32580. lr 6.000000e-05:  78%|███████▊  | 7270/9301 [46:37<12:29,  2.71it/s][A
e

epoch 2 iter 7302: train loss 0.30003. lr 6.000000e-05:  79%|███████▊  | 7302/9301 [46:49<12:16,  2.71it/s][A
epoch 2 iter 7302: train loss 0.30003. lr 6.000000e-05:  79%|███████▊  | 7303/9301 [46:49<12:13,  2.72it/s][A
epoch 2 iter 7303: train loss 0.30055. lr 6.000000e-05:  79%|███████▊  | 7303/9301 [46:50<12:13,  2.72it/s][A
epoch 2 iter 7303: train loss 0.30055. lr 6.000000e-05:  79%|███████▊  | 7304/9301 [46:50<12:15,  2.72it/s][A
epoch 2 iter 7304: train loss 0.31229. lr 6.000000e-05:  79%|███████▊  | 7304/9301 [46:50<12:15,  2.72it/s][A
epoch 2 iter 7304: train loss 0.31229. lr 6.000000e-05:  79%|███████▊  | 7305/9301 [46:50<12:15,  2.71it/s][A
epoch 2 iter 7305: train loss 0.30479. lr 6.000000e-05:  79%|███████▊  | 7305/9301 [46:50<12:15,  2.71it/s][A
epoch 2 iter 7305: train loss 0.30479. lr 6.000000e-05:  79%|███████▊  | 7306/9301 [46:50<12:14,  2.71it/s][A
epoch 2 iter 7306: train loss 0.32466. lr 6.000000e-05:  79%|███████▊  | 7306/9301 [46:51<12:14,  2.71it/s][A
e

epoch 2 iter 7338: train loss 0.32652. lr 6.000000e-05:  79%|███████▉  | 7339/9301 [47:03<12:00,  2.72it/s][A
epoch 2 iter 7339: train loss 0.31525. lr 6.000000e-05:  79%|███████▉  | 7339/9301 [47:03<12:00,  2.72it/s][A
epoch 2 iter 7339: train loss 0.31525. lr 6.000000e-05:  79%|███████▉  | 7340/9301 [47:03<12:02,  2.72it/s][A
epoch 2 iter 7340: train loss 0.31750. lr 6.000000e-05:  79%|███████▉  | 7340/9301 [47:03<12:02,  2.72it/s][A
epoch 2 iter 7340: train loss 0.31750. lr 6.000000e-05:  79%|███████▉  | 7341/9301 [47:03<12:01,  2.72it/s][A
epoch 2 iter 7341: train loss 0.30383. lr 6.000000e-05:  79%|███████▉  | 7341/9301 [47:04<12:01,  2.72it/s][A
epoch 2 iter 7341: train loss 0.30383. lr 6.000000e-05:  79%|███████▉  | 7342/9301 [47:04<12:01,  2.71it/s][A
epoch 2 iter 7342: train loss 0.31838. lr 6.000000e-05:  79%|███████▉  | 7342/9301 [47:04<12:01,  2.71it/s][A
epoch 2 iter 7342: train loss 0.31838. lr 6.000000e-05:  79%|███████▉  | 7343/9301 [47:04<12:02,  2.71it/s][A
e

epoch 2 iter 7375: train loss 0.32646. lr 6.000000e-05:  79%|███████▉  | 7375/9301 [47:16<11:51,  2.71it/s][A
epoch 2 iter 7375: train loss 0.32646. lr 6.000000e-05:  79%|███████▉  | 7376/9301 [47:16<11:50,  2.71it/s][A
epoch 2 iter 7376: train loss 0.31852. lr 6.000000e-05:  79%|███████▉  | 7376/9301 [47:17<11:50,  2.71it/s][A
epoch 2 iter 7376: train loss 0.31852. lr 6.000000e-05:  79%|███████▉  | 7377/9301 [47:17<11:49,  2.71it/s][A
epoch 2 iter 7377: train loss 0.31144. lr 6.000000e-05:  79%|███████▉  | 7377/9301 [47:17<11:49,  2.71it/s][A
epoch 2 iter 7377: train loss 0.31144. lr 6.000000e-05:  79%|███████▉  | 7378/9301 [47:17<11:46,  2.72it/s][A
epoch 2 iter 7378: train loss 0.31903. lr 6.000000e-05:  79%|███████▉  | 7378/9301 [47:17<11:46,  2.72it/s][A
epoch 2 iter 7378: train loss 0.31903. lr 6.000000e-05:  79%|███████▉  | 7379/9301 [47:17<11:48,  2.71it/s][A
epoch 2 iter 7379: train loss 0.32228. lr 6.000000e-05:  79%|███████▉  | 7379/9301 [47:18<11:48,  2.71it/s][A
e

epoch 2 iter 7411: train loss 0.31779. lr 6.000000e-05:  80%|███████▉  | 7412/9301 [47:30<11:38,  2.70it/s][A
epoch 2 iter 7412: train loss 0.31643. lr 6.000000e-05:  80%|███████▉  | 7412/9301 [47:30<11:38,  2.70it/s][A
epoch 2 iter 7412: train loss 0.31643. lr 6.000000e-05:  80%|███████▉  | 7413/9301 [47:30<11:37,  2.71it/s][A
epoch 2 iter 7413: train loss 0.30787. lr 6.000000e-05:  80%|███████▉  | 7413/9301 [47:30<11:37,  2.71it/s][A
epoch 2 iter 7413: train loss 0.30787. lr 6.000000e-05:  80%|███████▉  | 7414/9301 [47:30<11:34,  2.72it/s][A
epoch 2 iter 7414: train loss 0.31299. lr 6.000000e-05:  80%|███████▉  | 7414/9301 [47:31<11:34,  2.72it/s][A
epoch 2 iter 7414: train loss 0.31299. lr 6.000000e-05:  80%|███████▉  | 7415/9301 [47:31<11:37,  2.70it/s][A
epoch 2 iter 7415: train loss 0.32050. lr 6.000000e-05:  80%|███████▉  | 7415/9301 [47:31<11:37,  2.70it/s][A
epoch 2 iter 7415: train loss 0.32050. lr 6.000000e-05:  80%|███████▉  | 7416/9301 [47:31<11:39,  2.70it/s][A
e

epoch 2 iter 7448: train loss 0.30933. lr 6.000000e-05:  80%|████████  | 7448/9301 [47:43<11:23,  2.71it/s][A
epoch 2 iter 7448: train loss 0.30933. lr 6.000000e-05:  80%|████████  | 7449/9301 [47:43<11:22,  2.71it/s][A
epoch 2 iter 7449: train loss 0.30675. lr 6.000000e-05:  80%|████████  | 7449/9301 [47:44<11:22,  2.71it/s][A
epoch 2 iter 7449: train loss 0.30675. lr 6.000000e-05:  80%|████████  | 7450/9301 [47:44<11:18,  2.73it/s][A
epoch 2 iter 7450: train loss 0.30684. lr 6.000000e-05:  80%|████████  | 7450/9301 [47:44<11:18,  2.73it/s][A
epoch 2 iter 7450: train loss 0.30684. lr 6.000000e-05:  80%|████████  | 7451/9301 [47:44<11:21,  2.71it/s][A
epoch 2 iter 7451: train loss 0.31741. lr 6.000000e-05:  80%|████████  | 7451/9301 [47:44<11:21,  2.71it/s][A
epoch 2 iter 7451: train loss 0.31741. lr 6.000000e-05:  80%|████████  | 7452/9301 [47:44<11:21,  2.71it/s][A
epoch 2 iter 7452: train loss 0.31490. lr 6.000000e-05:  80%|████████  | 7452/9301 [47:45<11:21,  2.71it/s][A
e

epoch 2 iter 7484: train loss 0.31654. lr 6.000000e-05:  80%|████████  | 7485/9301 [47:56<11:06,  2.72it/s][A
epoch 2 iter 7485: train loss 0.30452. lr 6.000000e-05:  80%|████████  | 7485/9301 [47:57<11:06,  2.72it/s][A
epoch 2 iter 7485: train loss 0.30452. lr 6.000000e-05:  80%|████████  | 7486/9301 [47:57<11:08,  2.72it/s][A
epoch 2 iter 7486: train loss 0.33993. lr 6.000000e-05:  80%|████████  | 7486/9301 [47:57<11:08,  2.72it/s][A
epoch 2 iter 7486: train loss 0.33993. lr 6.000000e-05:  80%|████████  | 7487/9301 [47:57<11:08,  2.71it/s][A
epoch 2 iter 7487: train loss 0.30212. lr 6.000000e-05:  80%|████████  | 7487/9301 [47:58<11:08,  2.71it/s][A
epoch 2 iter 7487: train loss 0.30212. lr 6.000000e-05:  81%|████████  | 7488/9301 [47:58<11:07,  2.71it/s][A
epoch 2 iter 7488: train loss 0.31857. lr 6.000000e-05:  81%|████████  | 7488/9301 [47:58<11:07,  2.71it/s][A
epoch 2 iter 7488: train loss 0.31857. lr 6.000000e-05:  81%|████████  | 7489/9301 [47:58<11:07,  2.71it/s][A
e

epoch 2 iter 7521: train loss 0.30205. lr 6.000000e-05:  81%|████████  | 7521/9301 [48:10<10:55,  2.71it/s][A
epoch 2 iter 7521: train loss 0.30205. lr 6.000000e-05:  81%|████████  | 7522/9301 [48:10<10:56,  2.71it/s][A
epoch 2 iter 7522: train loss 0.30398. lr 6.000000e-05:  81%|████████  | 7522/9301 [48:10<10:56,  2.71it/s][A
epoch 2 iter 7522: train loss 0.30398. lr 6.000000e-05:  81%|████████  | 7523/9301 [48:10<10:55,  2.71it/s][A
epoch 2 iter 7523: train loss 0.31027. lr 6.000000e-05:  81%|████████  | 7523/9301 [48:11<10:55,  2.71it/s][A
epoch 2 iter 7523: train loss 0.31027. lr 6.000000e-05:  81%|████████  | 7524/9301 [48:11<10:55,  2.71it/s][A
epoch 2 iter 7524: train loss 0.31141. lr 6.000000e-05:  81%|████████  | 7524/9301 [48:11<10:55,  2.71it/s][A
epoch 2 iter 7524: train loss 0.31141. lr 6.000000e-05:  81%|████████  | 7525/9301 [48:11<10:52,  2.72it/s][A
epoch 2 iter 7525: train loss 0.30439. lr 6.000000e-05:  81%|████████  | 7525/9301 [48:12<10:52,  2.72it/s][A
e

epoch 2 iter 7557: train loss 0.31191. lr 6.000000e-05:  81%|████████▏ | 7558/9301 [48:23<10:43,  2.71it/s][A
epoch 2 iter 7558: train loss 0.31112. lr 6.000000e-05:  81%|████████▏ | 7558/9301 [48:24<10:43,  2.71it/s][A
epoch 2 iter 7558: train loss 0.31112. lr 6.000000e-05:  81%|████████▏ | 7559/9301 [48:24<10:42,  2.71it/s][A
epoch 2 iter 7559: train loss 0.31733. lr 6.000000e-05:  81%|████████▏ | 7559/9301 [48:24<10:42,  2.71it/s][A
epoch 2 iter 7559: train loss 0.31733. lr 6.000000e-05:  81%|████████▏ | 7560/9301 [48:24<10:39,  2.72it/s][A
epoch 2 iter 7560: train loss 0.30566. lr 6.000000e-05:  81%|████████▏ | 7560/9301 [48:24<10:39,  2.72it/s][A
epoch 2 iter 7560: train loss 0.30566. lr 6.000000e-05:  81%|████████▏ | 7561/9301 [48:24<10:41,  2.71it/s][A
epoch 2 iter 7561: train loss 0.31531. lr 6.000000e-05:  81%|████████▏ | 7561/9301 [48:25<10:41,  2.71it/s][A
epoch 2 iter 7561: train loss 0.31531. lr 6.000000e-05:  81%|████████▏ | 7562/9301 [48:25<10:41,  2.71it/s][A
e

epoch 2 iter 7594: train loss 0.30043. lr 6.000000e-05:  82%|████████▏ | 7594/9301 [48:37<10:28,  2.72it/s][A
epoch 2 iter 7594: train loss 0.30043. lr 6.000000e-05:  82%|████████▏ | 7595/9301 [48:37<10:29,  2.71it/s][A
epoch 2 iter 7595: train loss 0.31696. lr 6.000000e-05:  82%|████████▏ | 7595/9301 [48:38<10:29,  2.71it/s][A
epoch 2 iter 7595: train loss 0.31696. lr 6.000000e-05:  82%|████████▏ | 7596/9301 [48:38<10:29,  2.71it/s][A
epoch 2 iter 7596: train loss 0.29721. lr 6.000000e-05:  82%|████████▏ | 7596/9301 [48:38<10:29,  2.71it/s][A
epoch 2 iter 7596: train loss 0.29721. lr 6.000000e-05:  82%|████████▏ | 7597/9301 [48:38<10:28,  2.71it/s][A
epoch 2 iter 7597: train loss 0.31391. lr 6.000000e-05:  82%|████████▏ | 7597/9301 [48:38<10:28,  2.71it/s][A
epoch 2 iter 7597: train loss 0.31391. lr 6.000000e-05:  82%|████████▏ | 7598/9301 [48:38<10:27,  2.72it/s][A
epoch 2 iter 7598: train loss 0.32343. lr 6.000000e-05:  82%|████████▏ | 7598/9301 [48:39<10:27,  2.72it/s][A
e

epoch 2 iter 7630: train loss 0.31559. lr 6.000000e-05:  82%|████████▏ | 7631/9301 [48:50<10:14,  2.72it/s][A
epoch 2 iter 7631: train loss 0.31732. lr 6.000000e-05:  82%|████████▏ | 7631/9301 [48:51<10:14,  2.72it/s][A
epoch 2 iter 7631: train loss 0.31732. lr 6.000000e-05:  82%|████████▏ | 7632/9301 [48:51<10:13,  2.72it/s][A
epoch 2 iter 7632: train loss 0.30633. lr 6.000000e-05:  82%|████████▏ | 7632/9301 [48:51<10:13,  2.72it/s][A
epoch 2 iter 7632: train loss 0.30633. lr 6.000000e-05:  82%|████████▏ | 7633/9301 [48:51<10:15,  2.71it/s][A
epoch 2 iter 7633: train loss 0.32612. lr 6.000000e-05:  82%|████████▏ | 7633/9301 [48:52<10:15,  2.71it/s][A
epoch 2 iter 7633: train loss 0.32612. lr 6.000000e-05:  82%|████████▏ | 7634/9301 [48:52<10:16,  2.71it/s][A
epoch 2 iter 7634: train loss 0.31554. lr 6.000000e-05:  82%|████████▏ | 7634/9301 [48:52<10:16,  2.71it/s][A
epoch 2 iter 7634: train loss 0.31554. lr 6.000000e-05:  82%|████████▏ | 7635/9301 [48:52<10:16,  2.70it/s][A
e

epoch 2 iter 7667: train loss 0.32252. lr 6.000000e-05:  82%|████████▏ | 7667/9301 [49:04<10:01,  2.71it/s][A
epoch 2 iter 7667: train loss 0.32252. lr 6.000000e-05:  82%|████████▏ | 7668/9301 [49:04<10:01,  2.72it/s][A
epoch 2 iter 7668: train loss 0.30616. lr 6.000000e-05:  82%|████████▏ | 7668/9301 [49:04<10:01,  2.72it/s][A
epoch 2 iter 7668: train loss 0.30616. lr 6.000000e-05:  82%|████████▏ | 7669/9301 [49:04<09:58,  2.73it/s][A
epoch 2 iter 7669: train loss 0.30977. lr 6.000000e-05:  82%|████████▏ | 7669/9301 [49:05<09:58,  2.73it/s][A
epoch 2 iter 7669: train loss 0.30977. lr 6.000000e-05:  82%|████████▏ | 7670/9301 [49:05<09:59,  2.72it/s][A
epoch 2 iter 7670: train loss 0.32834. lr 6.000000e-05:  82%|████████▏ | 7670/9301 [49:05<09:59,  2.72it/s][A
epoch 2 iter 7670: train loss 0.32834. lr 6.000000e-05:  82%|████████▏ | 7671/9301 [49:05<10:00,  2.72it/s][A
epoch 2 iter 7671: train loss 0.30101. lr 6.000000e-05:  82%|████████▏ | 7671/9301 [49:06<10:00,  2.72it/s][A
e

epoch 2 iter 7703: train loss 0.31973. lr 6.000000e-05:  83%|████████▎ | 7704/9301 [49:17<09:50,  2.71it/s][A
epoch 2 iter 7704: train loss 0.30337. lr 6.000000e-05:  83%|████████▎ | 7704/9301 [49:18<09:50,  2.71it/s][A
epoch 2 iter 7704: train loss 0.30337. lr 6.000000e-05:  83%|████████▎ | 7705/9301 [49:18<09:50,  2.70it/s][A
epoch 2 iter 7705: train loss 0.29572. lr 6.000000e-05:  83%|████████▎ | 7705/9301 [49:18<09:50,  2.70it/s][A
epoch 2 iter 7705: train loss 0.29572. lr 6.000000e-05:  83%|████████▎ | 7706/9301 [49:18<09:51,  2.70it/s][A
epoch 2 iter 7706: train loss 0.30599. lr 6.000000e-05:  83%|████████▎ | 7706/9301 [49:18<09:51,  2.70it/s][A
epoch 2 iter 7706: train loss 0.30599. lr 6.000000e-05:  83%|████████▎ | 7707/9301 [49:18<09:49,  2.70it/s][A
epoch 2 iter 7707: train loss 0.31974. lr 6.000000e-05:  83%|████████▎ | 7707/9301 [49:19<09:49,  2.70it/s][A
epoch 2 iter 7707: train loss 0.31974. lr 6.000000e-05:  83%|████████▎ | 7708/9301 [49:19<09:49,  2.70it/s][A
e

epoch 2 iter 7740: train loss 0.30105. lr 6.000000e-05:  83%|████████▎ | 7740/9301 [49:31<09:35,  2.71it/s][A
epoch 2 iter 7740: train loss 0.30105. lr 6.000000e-05:  83%|████████▎ | 7741/9301 [49:31<09:35,  2.71it/s][A
epoch 2 iter 7741: train loss 0.31600. lr 6.000000e-05:  83%|████████▎ | 7741/9301 [49:31<09:35,  2.71it/s][A
epoch 2 iter 7741: train loss 0.31600. lr 6.000000e-05:  83%|████████▎ | 7742/9301 [49:31<09:35,  2.71it/s][A
epoch 2 iter 7742: train loss 0.30452. lr 6.000000e-05:  83%|████████▎ | 7742/9301 [49:32<09:35,  2.71it/s][A
epoch 2 iter 7742: train loss 0.30452. lr 6.000000e-05:  83%|████████▎ | 7743/9301 [49:32<09:34,  2.71it/s][A
epoch 2 iter 7743: train loss 0.31805. lr 6.000000e-05:  83%|████████▎ | 7743/9301 [49:32<09:34,  2.71it/s][A
epoch 2 iter 7743: train loss 0.31805. lr 6.000000e-05:  83%|████████▎ | 7744/9301 [49:32<09:31,  2.72it/s][A
epoch 2 iter 7744: train loss 0.29923. lr 6.000000e-05:  83%|████████▎ | 7744/9301 [49:32<09:31,  2.72it/s][A
e

epoch 2 iter 7776: train loss 0.29685. lr 6.000000e-05:  84%|████████▎ | 7777/9301 [49:44<09:20,  2.72it/s][A
epoch 2 iter 7777: train loss 0.30010. lr 6.000000e-05:  84%|████████▎ | 7777/9301 [49:45<09:20,  2.72it/s][A
epoch 2 iter 7777: train loss 0.30010. lr 6.000000e-05:  84%|████████▎ | 7778/9301 [49:45<09:21,  2.71it/s][A
epoch 2 iter 7778: train loss 0.30132. lr 6.000000e-05:  84%|████████▎ | 7778/9301 [49:45<09:21,  2.71it/s][A
epoch 2 iter 7778: train loss 0.30132. lr 6.000000e-05:  84%|████████▎ | 7779/9301 [49:45<09:22,  2.71it/s][A
epoch 2 iter 7779: train loss 0.32441. lr 6.000000e-05:  84%|████████▎ | 7779/9301 [49:45<09:22,  2.71it/s][A
epoch 2 iter 7779: train loss 0.32441. lr 6.000000e-05:  84%|████████▎ | 7780/9301 [49:45<09:22,  2.70it/s][A
epoch 2 iter 7780: train loss 0.30159. lr 6.000000e-05:  84%|████████▎ | 7780/9301 [49:46<09:22,  2.70it/s][A
epoch 2 iter 7780: train loss 0.30159. lr 6.000000e-05:  84%|████████▎ | 7781/9301 [49:46<09:23,  2.70it/s][A
e

epoch 2 iter 7813: train loss 0.33829. lr 6.000000e-05:  84%|████████▍ | 7813/9301 [49:58<09:09,  2.71it/s][A
epoch 2 iter 7813: train loss 0.33829. lr 6.000000e-05:  84%|████████▍ | 7814/9301 [49:58<09:07,  2.72it/s][A
epoch 2 iter 7814: train loss 0.31328. lr 6.000000e-05:  84%|████████▍ | 7814/9301 [49:58<09:07,  2.72it/s][A
epoch 2 iter 7814: train loss 0.31328. lr 6.000000e-05:  84%|████████▍ | 7815/9301 [49:58<09:07,  2.71it/s][A
epoch 2 iter 7815: train loss 0.30642. lr 6.000000e-05:  84%|████████▍ | 7815/9301 [49:59<09:07,  2.71it/s][A
epoch 2 iter 7815: train loss 0.30642. lr 6.000000e-05:  84%|████████▍ | 7816/9301 [49:59<09:08,  2.71it/s][A
epoch 2 iter 7816: train loss 0.31329. lr 6.000000e-05:  84%|████████▍ | 7816/9301 [49:59<09:08,  2.71it/s][A
epoch 2 iter 7816: train loss 0.31329. lr 6.000000e-05:  84%|████████▍ | 7817/9301 [49:59<09:07,  2.71it/s][A
epoch 2 iter 7817: train loss 0.31370. lr 6.000000e-05:  84%|████████▍ | 7817/9301 [49:59<09:07,  2.71it/s][A
e

epoch 2 iter 7849: train loss 0.30099. lr 6.000000e-05:  84%|████████▍ | 7850/9301 [50:11<08:55,  2.71it/s][A
epoch 2 iter 7850: train loss 0.32608. lr 6.000000e-05:  84%|████████▍ | 7850/9301 [50:12<08:55,  2.71it/s][A
epoch 2 iter 7850: train loss 0.32608. lr 6.000000e-05:  84%|████████▍ | 7851/9301 [50:12<08:55,  2.71it/s][A
epoch 2 iter 7851: train loss 0.31472. lr 6.000000e-05:  84%|████████▍ | 7851/9301 [50:12<08:55,  2.71it/s][A
epoch 2 iter 7851: train loss 0.31472. lr 6.000000e-05:  84%|████████▍ | 7852/9301 [50:12<08:54,  2.71it/s][A
epoch 2 iter 7852: train loss 0.30510. lr 6.000000e-05:  84%|████████▍ | 7852/9301 [50:12<08:54,  2.71it/s][A
epoch 2 iter 7852: train loss 0.30510. lr 6.000000e-05:  84%|████████▍ | 7853/9301 [50:12<08:53,  2.71it/s][A
epoch 2 iter 7853: train loss 0.32153. lr 6.000000e-05:  84%|████████▍ | 7853/9301 [50:13<08:53,  2.71it/s][A
epoch 2 iter 7853: train loss 0.32153. lr 6.000000e-05:  84%|████████▍ | 7854/9301 [50:13<08:51,  2.72it/s][A
e

epoch 2 iter 7886: train loss 0.31432. lr 6.000000e-05:  85%|████████▍ | 7886/9301 [50:25<08:42,  2.71it/s][A
epoch 2 iter 7886: train loss 0.31432. lr 6.000000e-05:  85%|████████▍ | 7887/9301 [50:25<08:41,  2.71it/s][A
epoch 2 iter 7887: train loss 0.31260. lr 6.000000e-05:  85%|████████▍ | 7887/9301 [50:25<08:41,  2.71it/s][A
epoch 2 iter 7887: train loss 0.31260. lr 6.000000e-05:  85%|████████▍ | 7888/9301 [50:25<08:40,  2.71it/s][A
epoch 2 iter 7888: train loss 0.32163. lr 6.000000e-05:  85%|████████▍ | 7888/9301 [50:26<08:40,  2.71it/s][A
epoch 2 iter 7888: train loss 0.32163. lr 6.000000e-05:  85%|████████▍ | 7889/9301 [50:26<08:39,  2.72it/s][A
epoch 2 iter 7889: train loss 0.31288. lr 6.000000e-05:  85%|████████▍ | 7889/9301 [50:26<08:39,  2.72it/s][A
epoch 2 iter 7889: train loss 0.31288. lr 6.000000e-05:  85%|████████▍ | 7890/9301 [50:26<08:39,  2.71it/s][A
epoch 2 iter 7890: train loss 0.30464. lr 6.000000e-05:  85%|████████▍ | 7890/9301 [50:26<08:39,  2.71it/s][A
e

epoch 2 iter 7922: train loss 0.32299. lr 6.000000e-05:  85%|████████▌ | 7923/9301 [50:38<08:27,  2.71it/s][A
epoch 2 iter 7923: train loss 0.31578. lr 6.000000e-05:  85%|████████▌ | 7923/9301 [50:38<08:27,  2.71it/s][A
epoch 2 iter 7923: train loss 0.31578. lr 6.000000e-05:  85%|████████▌ | 7924/9301 [50:38<08:29,  2.70it/s][A
epoch 2 iter 7924: train loss 0.31250. lr 6.000000e-05:  85%|████████▌ | 7924/9301 [50:39<08:29,  2.70it/s][A
epoch 2 iter 7924: train loss 0.31250. lr 6.000000e-05:  85%|████████▌ | 7925/9301 [50:39<08:30,  2.70it/s][A
epoch 2 iter 7925: train loss 0.29927. lr 6.000000e-05:  85%|████████▌ | 7925/9301 [50:39<08:30,  2.70it/s][A
epoch 2 iter 7925: train loss 0.29927. lr 6.000000e-05:  85%|████████▌ | 7926/9301 [50:39<08:29,  2.70it/s][A
epoch 2 iter 7926: train loss 0.32330. lr 6.000000e-05:  85%|████████▌ | 7926/9301 [50:40<08:29,  2.70it/s][A
epoch 2 iter 7926: train loss 0.32330. lr 6.000000e-05:  85%|████████▌ | 7927/9301 [50:40<08:28,  2.70it/s][A
e

epoch 2 iter 7959: train loss 0.30824. lr 6.000000e-05:  86%|████████▌ | 7959/9301 [50:52<08:24,  2.66it/s][A
epoch 2 iter 7959: train loss 0.30824. lr 6.000000e-05:  86%|████████▌ | 7960/9301 [50:52<08:21,  2.67it/s][A
epoch 2 iter 7960: train loss 0.31634. lr 6.000000e-05:  86%|████████▌ | 7960/9301 [50:52<08:21,  2.67it/s][A
epoch 2 iter 7960: train loss 0.31634. lr 6.000000e-05:  86%|████████▌ | 7961/9301 [50:52<08:22,  2.67it/s][A
epoch 2 iter 7961: train loss 0.31617. lr 6.000000e-05:  86%|████████▌ | 7961/9301 [50:53<08:22,  2.67it/s][A
epoch 2 iter 7961: train loss 0.31617. lr 6.000000e-05:  86%|████████▌ | 7962/9301 [50:53<08:22,  2.67it/s][A
epoch 2 iter 7962: train loss 0.30364. lr 6.000000e-05:  86%|████████▌ | 7962/9301 [50:53<08:22,  2.67it/s][A
epoch 2 iter 7962: train loss 0.30364. lr 6.000000e-05:  86%|████████▌ | 7963/9301 [50:53<08:22,  2.66it/s][A
epoch 2 iter 7963: train loss 0.29795. lr 6.000000e-05:  86%|████████▌ | 7963/9301 [50:53<08:22,  2.66it/s][A
e

epoch 2 iter 7995: train loss 0.30146. lr 6.000000e-05:  86%|████████▌ | 7996/9301 [51:05<08:01,  2.71it/s][A
epoch 2 iter 7996: train loss 0.31987. lr 6.000000e-05:  86%|████████▌ | 7996/9301 [51:05<08:01,  2.71it/s][A
epoch 2 iter 7996: train loss 0.31987. lr 6.000000e-05:  86%|████████▌ | 7997/9301 [51:05<07:59,  2.72it/s][A
epoch 2 iter 7997: train loss 0.31352. lr 6.000000e-05:  86%|████████▌ | 7997/9301 [51:06<07:59,  2.72it/s][A
epoch 2 iter 7997: train loss 0.31352. lr 6.000000e-05:  86%|████████▌ | 7998/9301 [51:06<08:00,  2.71it/s][A
epoch 2 iter 7998: train loss 0.30002. lr 6.000000e-05:  86%|████████▌ | 7998/9301 [51:06<08:00,  2.71it/s][A
epoch 2 iter 7998: train loss 0.30002. lr 6.000000e-05:  86%|████████▌ | 7999/9301 [51:06<08:00,  2.71it/s][A
epoch 2 iter 7999: train loss 0.28584. lr 6.000000e-05:  86%|████████▌ | 7999/9301 [51:07<08:00,  2.71it/s][A
epoch 2 iter 7999: train loss 0.28584. lr 6.000000e-05:  86%|████████▌ | 8000/9301 [51:07<07:59,  2.71it/s][A
e

epoch 2 iter 8032: train loss 0.31421. lr 6.000000e-05:  86%|████████▋ | 8032/9301 [51:19<07:47,  2.72it/s][A
epoch 2 iter 8032: train loss 0.31421. lr 6.000000e-05:  86%|████████▋ | 8033/9301 [51:19<07:47,  2.71it/s][A
epoch 2 iter 8033: train loss 0.29595. lr 6.000000e-05:  86%|████████▋ | 8033/9301 [51:19<07:47,  2.71it/s][A
epoch 2 iter 8033: train loss 0.29595. lr 6.000000e-05:  86%|████████▋ | 8034/9301 [51:19<07:47,  2.71it/s][A
epoch 2 iter 8034: train loss 0.31016. lr 6.000000e-05:  86%|████████▋ | 8034/9301 [51:20<07:47,  2.71it/s][A
epoch 2 iter 8034: train loss 0.31016. lr 6.000000e-05:  86%|████████▋ | 8035/9301 [51:20<07:46,  2.71it/s][A
epoch 2 iter 8035: train loss 0.29305. lr 6.000000e-05:  86%|████████▋ | 8035/9301 [51:20<07:46,  2.71it/s][A
epoch 2 iter 8035: train loss 0.29305. lr 6.000000e-05:  86%|████████▋ | 8036/9301 [51:20<07:46,  2.71it/s][A
epoch 2 iter 8036: train loss 0.32094. lr 6.000000e-05:  86%|████████▋ | 8036/9301 [51:20<07:46,  2.71it/s][A
e

epoch 2 iter 8068: train loss 0.32799. lr 6.000000e-05:  87%|████████▋ | 8069/9301 [51:32<07:36,  2.70it/s][A
epoch 2 iter 8069: train loss 0.31883. lr 6.000000e-05:  87%|████████▋ | 8069/9301 [51:32<07:36,  2.70it/s][A
epoch 2 iter 8069: train loss 0.31883. lr 6.000000e-05:  87%|████████▋ | 8070/9301 [51:32<07:34,  2.71it/s][A
epoch 2 iter 8070: train loss 0.30967. lr 6.000000e-05:  87%|████████▋ | 8070/9301 [51:33<07:34,  2.71it/s][A
epoch 2 iter 8070: train loss 0.30967. lr 6.000000e-05:  87%|████████▋ | 8071/9301 [51:33<07:34,  2.71it/s][A
epoch 2 iter 8071: train loss 0.30543. lr 6.000000e-05:  87%|████████▋ | 8071/9301 [51:33<07:34,  2.71it/s][A
epoch 2 iter 8071: train loss 0.30543. lr 6.000000e-05:  87%|████████▋ | 8072/9301 [51:33<07:34,  2.70it/s][A
epoch 2 iter 8072: train loss 0.30632. lr 6.000000e-05:  87%|████████▋ | 8072/9301 [51:34<07:34,  2.70it/s][A
epoch 2 iter 8072: train loss 0.30632. lr 6.000000e-05:  87%|████████▋ | 8073/9301 [51:34<07:35,  2.70it/s][A
e

epoch 2 iter 8105: train loss 0.31432. lr 6.000000e-05:  87%|████████▋ | 8105/9301 [51:46<07:19,  2.72it/s][A
epoch 2 iter 8105: train loss 0.31432. lr 6.000000e-05:  87%|████████▋ | 8106/9301 [51:46<07:19,  2.72it/s][A
epoch 2 iter 8106: train loss 0.31068. lr 6.000000e-05:  87%|████████▋ | 8106/9301 [51:46<07:19,  2.72it/s][A
epoch 2 iter 8106: train loss 0.31068. lr 6.000000e-05:  87%|████████▋ | 8107/9301 [51:46<07:20,  2.71it/s][A
epoch 2 iter 8107: train loss 0.30105. lr 6.000000e-05:  87%|████████▋ | 8107/9301 [51:46<07:20,  2.71it/s][A
epoch 2 iter 8107: train loss 0.30105. lr 6.000000e-05:  87%|████████▋ | 8108/9301 [51:46<07:20,  2.71it/s][A
epoch 2 iter 8108: train loss 0.31012. lr 6.000000e-05:  87%|████████▋ | 8108/9301 [51:47<07:20,  2.71it/s][A
epoch 2 iter 8108: train loss 0.31012. lr 6.000000e-05:  87%|████████▋ | 8109/9301 [51:47<07:20,  2.71it/s][A
epoch 2 iter 8109: train loss 0.32386. lr 6.000000e-05:  87%|████████▋ | 8109/9301 [51:47<07:20,  2.71it/s][A
e

epoch 2 iter 8141: train loss 0.30566. lr 6.000000e-05:  88%|████████▊ | 8142/9301 [51:59<07:05,  2.72it/s][A
epoch 2 iter 8142: train loss 0.30515. lr 6.000000e-05:  88%|████████▊ | 8142/9301 [51:59<07:05,  2.72it/s][A
epoch 2 iter 8142: train loss 0.30515. lr 6.000000e-05:  88%|████████▊ | 8143/9301 [51:59<07:06,  2.72it/s][A
epoch 2 iter 8143: train loss 0.31770. lr 6.000000e-05:  88%|████████▊ | 8143/9301 [52:00<07:06,  2.72it/s][A
epoch 2 iter 8143: train loss 0.31770. lr 6.000000e-05:  88%|████████▊ | 8144/9301 [52:00<07:06,  2.71it/s][A
epoch 2 iter 8144: train loss 0.30219. lr 6.000000e-05:  88%|████████▊ | 8144/9301 [52:00<07:06,  2.71it/s][A
epoch 2 iter 8144: train loss 0.30219. lr 6.000000e-05:  88%|████████▊ | 8145/9301 [52:00<07:06,  2.71it/s][A
epoch 2 iter 8145: train loss 0.31604. lr 6.000000e-05:  88%|████████▊ | 8145/9301 [52:00<07:06,  2.71it/s][A
epoch 2 iter 8145: train loss 0.31604. lr 6.000000e-05:  88%|████████▊ | 8146/9301 [52:00<07:05,  2.71it/s][A
e

epoch 2 iter 8178: train loss 0.32829. lr 6.000000e-05:  88%|████████▊ | 8178/9301 [52:13<06:53,  2.72it/s][A
epoch 2 iter 8178: train loss 0.32829. lr 6.000000e-05:  88%|████████▊ | 8179/9301 [52:13<06:53,  2.71it/s][A
epoch 2 iter 8179: train loss 0.30625. lr 6.000000e-05:  88%|████████▊ | 8179/9301 [52:13<06:53,  2.71it/s][A
epoch 2 iter 8179: train loss 0.30625. lr 6.000000e-05:  88%|████████▊ | 8180/9301 [52:13<06:52,  2.71it/s][A
epoch 2 iter 8180: train loss 0.31874. lr 6.000000e-05:  88%|████████▊ | 8180/9301 [52:13<06:52,  2.71it/s][A
epoch 2 iter 8180: train loss 0.31874. lr 6.000000e-05:  88%|████████▊ | 8181/9301 [52:13<06:52,  2.71it/s][A
epoch 2 iter 8181: train loss 0.29321. lr 6.000000e-05:  88%|████████▊ | 8181/9301 [52:14<06:52,  2.71it/s][A
epoch 2 iter 8181: train loss 0.29321. lr 6.000000e-05:  88%|████████▊ | 8182/9301 [52:14<06:50,  2.72it/s][A
epoch 2 iter 8182: train loss 0.31815. lr 6.000000e-05:  88%|████████▊ | 8182/9301 [52:14<06:50,  2.72it/s][A
e

epoch 2 iter 8214: train loss 0.31011. lr 6.000000e-05:  88%|████████▊ | 8215/9301 [52:26<06:40,  2.71it/s][A
epoch 2 iter 8215: train loss 0.30948. lr 6.000000e-05:  88%|████████▊ | 8215/9301 [52:26<06:40,  2.71it/s][A
epoch 2 iter 8215: train loss 0.30948. lr 6.000000e-05:  88%|████████▊ | 8216/9301 [52:26<06:39,  2.71it/s][A
epoch 2 iter 8216: train loss 0.29958. lr 6.000000e-05:  88%|████████▊ | 8216/9301 [52:27<06:39,  2.71it/s][A
epoch 2 iter 8216: train loss 0.29958. lr 6.000000e-05:  88%|████████▊ | 8217/9301 [52:27<06:38,  2.72it/s][A
epoch 2 iter 8217: train loss 0.31553. lr 6.000000e-05:  88%|████████▊ | 8217/9301 [52:27<06:38,  2.72it/s][A
epoch 2 iter 8217: train loss 0.31553. lr 6.000000e-05:  88%|████████▊ | 8218/9301 [52:27<06:39,  2.71it/s][A
epoch 2 iter 8218: train loss 0.31400. lr 6.000000e-05:  88%|████████▊ | 8218/9301 [52:27<06:39,  2.71it/s][A
epoch 2 iter 8218: train loss 0.31400. lr 6.000000e-05:  88%|████████▊ | 8219/9301 [52:27<06:39,  2.71it/s][A
e

epoch 2 iter 8251: train loss 0.31244. lr 6.000000e-05:  89%|████████▊ | 8251/9301 [52:39<06:27,  2.71it/s][A
epoch 2 iter 8251: train loss 0.31244. lr 6.000000e-05:  89%|████████▊ | 8252/9301 [52:39<06:26,  2.72it/s][A
epoch 2 iter 8252: train loss 0.31075. lr 6.000000e-05:  89%|████████▊ | 8252/9301 [52:40<06:26,  2.72it/s][A
epoch 2 iter 8252: train loss 0.31075. lr 6.000000e-05:  89%|████████▊ | 8253/9301 [52:40<06:26,  2.71it/s][A
epoch 2 iter 8253: train loss 0.29815. lr 6.000000e-05:  89%|████████▊ | 8253/9301 [52:40<06:26,  2.71it/s][A
epoch 2 iter 8253: train loss 0.29815. lr 6.000000e-05:  89%|████████▊ | 8254/9301 [52:40<06:26,  2.71it/s][A
epoch 2 iter 8254: train loss 0.31735. lr 6.000000e-05:  89%|████████▊ | 8254/9301 [52:41<06:26,  2.71it/s][A
epoch 2 iter 8254: train loss 0.31735. lr 6.000000e-05:  89%|████████▉ | 8255/9301 [52:41<06:25,  2.71it/s][A
epoch 2 iter 8255: train loss 0.31678. lr 6.000000e-05:  89%|████████▉ | 8255/9301 [52:41<06:25,  2.71it/s][A
e

epoch 2 iter 8287: train loss 0.29085. lr 6.000000e-05:  89%|████████▉ | 8288/9301 [52:53<06:13,  2.71it/s][A
epoch 2 iter 8288: train loss 0.32397. lr 6.000000e-05:  89%|████████▉ | 8288/9301 [52:53<06:13,  2.71it/s][A
epoch 2 iter 8288: train loss 0.32397. lr 6.000000e-05:  89%|████████▉ | 8289/9301 [52:53<06:13,  2.71it/s][A
epoch 2 iter 8289: train loss 0.30935. lr 6.000000e-05:  89%|████████▉ | 8289/9301 [52:54<06:13,  2.71it/s][A
epoch 2 iter 8289: train loss 0.30935. lr 6.000000e-05:  89%|████████▉ | 8290/9301 [52:54<06:12,  2.71it/s][A
epoch 2 iter 8290: train loss 0.31400. lr 6.000000e-05:  89%|████████▉ | 8290/9301 [52:54<06:12,  2.71it/s][A
epoch 2 iter 8290: train loss 0.31400. lr 6.000000e-05:  89%|████████▉ | 8291/9301 [52:54<06:12,  2.71it/s][A
epoch 2 iter 8291: train loss 0.31330. lr 6.000000e-05:  89%|████████▉ | 8291/9301 [52:54<06:12,  2.71it/s][A
epoch 2 iter 8291: train loss 0.31330. lr 6.000000e-05:  89%|████████▉ | 8292/9301 [52:54<06:10,  2.72it/s][A
e

epoch 2 iter 8324: train loss 0.29968. lr 6.000000e-05:  89%|████████▉ | 8324/9301 [53:06<06:00,  2.71it/s][A
epoch 2 iter 8324: train loss 0.29968. lr 6.000000e-05:  90%|████████▉ | 8325/9301 [53:06<05:59,  2.71it/s][A
epoch 2 iter 8325: train loss 0.31121. lr 6.000000e-05:  90%|████████▉ | 8325/9301 [53:07<05:59,  2.71it/s][A
epoch 2 iter 8325: train loss 0.31121. lr 6.000000e-05:  90%|████████▉ | 8326/9301 [53:07<05:59,  2.71it/s][A
epoch 2 iter 8326: train loss 0.32301. lr 6.000000e-05:  90%|████████▉ | 8326/9301 [53:07<05:59,  2.71it/s][A
epoch 2 iter 8326: train loss 0.32301. lr 6.000000e-05:  90%|████████▉ | 8327/9301 [53:07<05:57,  2.72it/s][A
epoch 2 iter 8327: train loss 0.28620. lr 6.000000e-05:  90%|████████▉ | 8327/9301 [53:08<05:57,  2.72it/s][A
epoch 2 iter 8327: train loss 0.28620. lr 6.000000e-05:  90%|████████▉ | 8328/9301 [53:08<05:58,  2.72it/s][A
epoch 2 iter 8328: train loss 0.29991. lr 6.000000e-05:  90%|████████▉ | 8328/9301 [53:08<05:58,  2.72it/s][A
e

epoch 2 iter 8360: train loss 0.32694. lr 6.000000e-05:  90%|████████▉ | 8361/9301 [53:20<05:46,  2.72it/s][A
epoch 2 iter 8361: train loss 0.28597. lr 6.000000e-05:  90%|████████▉ | 8361/9301 [53:20<05:46,  2.72it/s][A
epoch 2 iter 8361: train loss 0.28597. lr 6.000000e-05:  90%|████████▉ | 8362/9301 [53:20<05:44,  2.72it/s][A
epoch 2 iter 8362: train loss 0.30546. lr 6.000000e-05:  90%|████████▉ | 8362/9301 [53:20<05:44,  2.72it/s][A
epoch 2 iter 8362: train loss 0.30546. lr 6.000000e-05:  90%|████████▉ | 8363/9301 [53:20<05:45,  2.72it/s][A
epoch 2 iter 8363: train loss 0.31198. lr 6.000000e-05:  90%|████████▉ | 8363/9301 [53:21<05:45,  2.72it/s][A
epoch 2 iter 8363: train loss 0.31198. lr 6.000000e-05:  90%|████████▉ | 8364/9301 [53:21<05:45,  2.71it/s][A
epoch 2 iter 8364: train loss 0.32612. lr 6.000000e-05:  90%|████████▉ | 8364/9301 [53:21<05:45,  2.71it/s][A
epoch 2 iter 8364: train loss 0.32612. lr 6.000000e-05:  90%|████████▉ | 8365/9301 [53:21<05:44,  2.71it/s][A
e

epoch 2 iter 8397: train loss 0.33832. lr 6.000000e-05:  90%|█████████ | 8397/9301 [53:33<05:31,  2.73it/s][A
epoch 2 iter 8397: train loss 0.33832. lr 6.000000e-05:  90%|█████████ | 8398/9301 [53:33<05:31,  2.72it/s][A
epoch 2 iter 8398: train loss 0.31983. lr 6.000000e-05:  90%|█████████ | 8398/9301 [53:34<05:31,  2.72it/s][A
epoch 2 iter 8398: train loss 0.31983. lr 6.000000e-05:  90%|█████████ | 8399/9301 [53:34<05:31,  2.72it/s][A
epoch 2 iter 8399: train loss 0.29769. lr 6.000000e-05:  90%|█████████ | 8399/9301 [53:34<05:31,  2.72it/s][A
epoch 2 iter 8399: train loss 0.29769. lr 6.000000e-05:  90%|█████████ | 8400/9301 [53:34<05:31,  2.72it/s][A
epoch 2 iter 8400: train loss 0.29835. lr 6.000000e-05:  90%|█████████ | 8400/9301 [53:34<05:31,  2.72it/s][A
epoch 2 iter 8400: train loss 0.29835. lr 6.000000e-05:  90%|█████████ | 8401/9301 [53:34<05:31,  2.71it/s][A
epoch 2 iter 8401: train loss 0.31447. lr 6.000000e-05:  90%|█████████ | 8401/9301 [53:35<05:31,  2.71it/s][A
e

epoch 2 iter 8433: train loss 0.29976. lr 6.000000e-05:  91%|█████████ | 8434/9301 [53:47<05:19,  2.71it/s][A
epoch 2 iter 8434: train loss 0.32962. lr 6.000000e-05:  91%|█████████ | 8434/9301 [53:47<05:19,  2.71it/s][A
epoch 2 iter 8434: train loss 0.32962. lr 6.000000e-05:  91%|█████████ | 8435/9301 [53:47<05:19,  2.71it/s][A
epoch 2 iter 8435: train loss 0.29689. lr 6.000000e-05:  91%|█████████ | 8435/9301 [53:47<05:19,  2.71it/s][A
epoch 2 iter 8435: train loss 0.29689. lr 6.000000e-05:  91%|█████████ | 8436/9301 [53:47<05:19,  2.71it/s][A
epoch 2 iter 8436: train loss 0.31594. lr 6.000000e-05:  91%|█████████ | 8436/9301 [53:48<05:19,  2.71it/s][A
epoch 2 iter 8436: train loss 0.31594. lr 6.000000e-05:  91%|█████████ | 8437/9301 [53:48<05:18,  2.72it/s][A
epoch 2 iter 8437: train loss 0.31940. lr 6.000000e-05:  91%|█████████ | 8437/9301 [53:48<05:18,  2.72it/s][A
epoch 2 iter 8437: train loss 0.31940. lr 6.000000e-05:  91%|█████████ | 8438/9301 [53:48<05:18,  2.71it/s][A
e

epoch 2 iter 8470: train loss 0.28478. lr 6.000000e-05:  91%|█████████ | 8470/9301 [54:00<05:07,  2.70it/s][A
epoch 2 iter 8470: train loss 0.28478. lr 6.000000e-05:  91%|█████████ | 8471/9301 [54:00<05:06,  2.71it/s][A
epoch 2 iter 8471: train loss 0.29636. lr 6.000000e-05:  91%|█████████ | 8471/9301 [54:01<05:06,  2.71it/s][A
epoch 2 iter 8471: train loss 0.29636. lr 6.000000e-05:  91%|█████████ | 8472/9301 [54:01<05:05,  2.71it/s][A
epoch 2 iter 8472: train loss 0.32608. lr 6.000000e-05:  91%|█████████ | 8472/9301 [54:01<05:05,  2.71it/s][A
epoch 2 iter 8472: train loss 0.32608. lr 6.000000e-05:  91%|█████████ | 8473/9301 [54:01<05:04,  2.72it/s][A
epoch 2 iter 8473: train loss 0.32652. lr 6.000000e-05:  91%|█████████ | 8473/9301 [54:01<05:04,  2.72it/s][A
epoch 2 iter 8473: train loss 0.32652. lr 6.000000e-05:  91%|█████████ | 8474/9301 [54:01<05:04,  2.71it/s][A
epoch 2 iter 8474: train loss 0.30340. lr 6.000000e-05:  91%|█████████ | 8474/9301 [54:02<05:04,  2.71it/s][A
e

epoch 2 iter 8506: train loss 0.30079. lr 6.000000e-05:  91%|█████████▏| 8507/9301 [54:14<04:52,  2.71it/s][A
epoch 2 iter 8507: train loss 0.29300. lr 6.000000e-05:  91%|█████████▏| 8507/9301 [54:14<04:52,  2.71it/s][A
epoch 2 iter 8507: train loss 0.29300. lr 6.000000e-05:  91%|█████████▏| 8508/9301 [54:14<04:51,  2.72it/s][A
epoch 2 iter 8508: train loss 0.31424. lr 6.000000e-05:  91%|█████████▏| 8508/9301 [54:14<04:51,  2.72it/s][A
epoch 2 iter 8508: train loss 0.31424. lr 6.000000e-05:  91%|█████████▏| 8509/9301 [54:14<04:51,  2.72it/s][A
epoch 2 iter 8509: train loss 0.29859. lr 6.000000e-05:  91%|█████████▏| 8509/9301 [54:15<04:51,  2.72it/s][A
epoch 2 iter 8509: train loss 0.29859. lr 6.000000e-05:  91%|█████████▏| 8510/9301 [54:15<04:51,  2.71it/s][A
epoch 2 iter 8510: train loss 0.31581. lr 6.000000e-05:  91%|█████████▏| 8510/9301 [54:15<04:51,  2.71it/s][A
epoch 2 iter 8510: train loss 0.31581. lr 6.000000e-05:  92%|█████████▏| 8511/9301 [54:15<04:51,  2.71it/s][A
e

epoch 2 iter 8543: train loss 0.33130. lr 6.000000e-05:  92%|█████████▏| 8543/9301 [54:27<04:38,  2.72it/s][A
epoch 2 iter 8543: train loss 0.33130. lr 6.000000e-05:  92%|█████████▏| 8544/9301 [54:27<04:39,  2.71it/s][A
epoch 2 iter 8544: train loss 0.30405. lr 6.000000e-05:  92%|█████████▏| 8544/9301 [54:28<04:39,  2.71it/s][A
epoch 2 iter 8544: train loss 0.30405. lr 6.000000e-05:  92%|█████████▏| 8545/9301 [54:28<04:39,  2.71it/s][A
epoch 2 iter 8545: train loss 0.30622. lr 6.000000e-05:  92%|█████████▏| 8545/9301 [54:28<04:39,  2.71it/s][A
epoch 2 iter 8545: train loss 0.30622. lr 6.000000e-05:  92%|█████████▏| 8546/9301 [54:28<04:38,  2.71it/s][A
epoch 2 iter 8546: train loss 0.30829. lr 6.000000e-05:  92%|█████████▏| 8546/9301 [54:28<04:38,  2.71it/s][A
epoch 2 iter 8546: train loss 0.30829. lr 6.000000e-05:  92%|█████████▏| 8547/9301 [54:28<04:37,  2.71it/s][A
epoch 2 iter 8547: train loss 0.30152. lr 6.000000e-05:  92%|█████████▏| 8547/9301 [54:29<04:37,  2.71it/s][A
e

epoch 2 iter 8579: train loss 0.29376. lr 6.000000e-05:  92%|█████████▏| 8580/9301 [54:40<04:25,  2.71it/s][A
epoch 2 iter 8580: train loss 0.31288. lr 6.000000e-05:  92%|█████████▏| 8580/9301 [54:41<04:25,  2.71it/s][A
epoch 2 iter 8580: train loss 0.31288. lr 6.000000e-05:  92%|█████████▏| 8581/9301 [54:41<04:25,  2.71it/s][A
epoch 2 iter 8581: train loss 0.30614. lr 6.000000e-05:  92%|█████████▏| 8581/9301 [54:41<04:25,  2.71it/s][A
epoch 2 iter 8581: train loss 0.30614. lr 6.000000e-05:  92%|█████████▏| 8582/9301 [54:41<04:24,  2.72it/s][A
epoch 2 iter 8582: train loss 0.31081. lr 6.000000e-05:  92%|█████████▏| 8582/9301 [54:42<04:24,  2.72it/s][A
epoch 2 iter 8582: train loss 0.31081. lr 6.000000e-05:  92%|█████████▏| 8583/9301 [54:42<04:23,  2.73it/s][A
epoch 2 iter 8583: train loss 0.29861. lr 6.000000e-05:  92%|█████████▏| 8583/9301 [54:42<04:23,  2.73it/s][A
epoch 2 iter 8583: train loss 0.29861. lr 6.000000e-05:  92%|█████████▏| 8584/9301 [54:42<04:23,  2.72it/s][A
e

epoch 2 iter 8616: train loss 0.29709. lr 6.000000e-05:  93%|█████████▎| 8616/9301 [54:54<04:11,  2.73it/s][A
epoch 2 iter 8616: train loss 0.29709. lr 6.000000e-05:  93%|█████████▎| 8617/9301 [54:54<04:09,  2.74it/s][A
epoch 2 iter 8617: train loss 0.33419. lr 6.000000e-05:  93%|█████████▎| 8617/9301 [54:54<04:09,  2.74it/s][A
epoch 2 iter 8617: train loss 0.33419. lr 6.000000e-05:  93%|█████████▎| 8618/9301 [54:54<04:10,  2.73it/s][A
epoch 2 iter 8618: train loss 0.30605. lr 6.000000e-05:  93%|█████████▎| 8618/9301 [54:55<04:10,  2.73it/s][A
epoch 2 iter 8618: train loss 0.30605. lr 6.000000e-05:  93%|█████████▎| 8619/9301 [54:55<04:10,  2.72it/s][A
epoch 2 iter 8619: train loss 0.30600. lr 6.000000e-05:  93%|█████████▎| 8619/9301 [54:55<04:10,  2.72it/s][A
epoch 2 iter 8619: train loss 0.30600. lr 6.000000e-05:  93%|█████████▎| 8620/9301 [54:55<04:10,  2.72it/s][A
epoch 2 iter 8620: train loss 0.30756. lr 6.000000e-05:  93%|█████████▎| 8620/9301 [54:56<04:10,  2.72it/s][A
e

epoch 2 iter 8652: train loss 0.29693. lr 6.000000e-05:  93%|█████████▎| 8653/9301 [55:07<03:58,  2.72it/s][A
epoch 2 iter 8653: train loss 0.29434. lr 6.000000e-05:  93%|█████████▎| 8653/9301 [55:08<03:58,  2.72it/s][A
epoch 2 iter 8653: train loss 0.29434. lr 6.000000e-05:  93%|█████████▎| 8654/9301 [55:08<03:58,  2.71it/s][A
epoch 2 iter 8654: train loss 0.31479. lr 6.000000e-05:  93%|█████████▎| 8654/9301 [55:08<03:58,  2.71it/s][A
epoch 2 iter 8654: train loss 0.31479. lr 6.000000e-05:  93%|█████████▎| 8655/9301 [55:08<03:57,  2.72it/s][A
epoch 2 iter 8655: train loss 0.32220. lr 6.000000e-05:  93%|█████████▎| 8655/9301 [55:08<03:57,  2.72it/s][A
epoch 2 iter 8655: train loss 0.32220. lr 6.000000e-05:  93%|█████████▎| 8656/9301 [55:08<03:55,  2.73it/s][A
epoch 2 iter 8656: train loss 0.30782. lr 6.000000e-05:  93%|█████████▎| 8656/9301 [55:09<03:55,  2.73it/s][A
epoch 2 iter 8656: train loss 0.30782. lr 6.000000e-05:  93%|█████████▎| 8657/9301 [55:09<03:57,  2.71it/s][A
e

epoch 2 iter 8689: train loss 0.33506. lr 6.000000e-05:  93%|█████████▎| 8689/9301 [55:21<03:46,  2.70it/s][A
epoch 2 iter 8689: train loss 0.33506. lr 6.000000e-05:  93%|█████████▎| 8690/9301 [55:21<03:45,  2.71it/s][A
epoch 2 iter 8690: train loss 0.31033. lr 6.000000e-05:  93%|█████████▎| 8690/9301 [55:21<03:45,  2.71it/s][A
epoch 2 iter 8690: train loss 0.31033. lr 6.000000e-05:  93%|█████████▎| 8691/9301 [55:21<03:45,  2.71it/s][A
epoch 2 iter 8691: train loss 0.31249. lr 6.000000e-05:  93%|█████████▎| 8691/9301 [55:22<03:45,  2.71it/s][A
epoch 2 iter 8691: train loss 0.31249. lr 6.000000e-05:  93%|█████████▎| 8692/9301 [55:22<03:44,  2.71it/s][A
epoch 2 iter 8692: train loss 0.29700. lr 6.000000e-05:  93%|█████████▎| 8692/9301 [55:22<03:44,  2.71it/s][A
epoch 2 iter 8692: train loss 0.29700. lr 6.000000e-05:  93%|█████████▎| 8693/9301 [55:22<03:45,  2.70it/s][A
epoch 2 iter 8693: train loss 0.29763. lr 6.000000e-05:  93%|█████████▎| 8693/9301 [55:22<03:45,  2.70it/s][A
e

epoch 2 iter 8725: train loss 0.30975. lr 6.000000e-05:  94%|█████████▍| 8726/9301 [55:34<03:31,  2.71it/s][A
epoch 2 iter 8726: train loss 0.29639. lr 6.000000e-05:  94%|█████████▍| 8726/9301 [55:35<03:31,  2.71it/s][A
epoch 2 iter 8726: train loss 0.29639. lr 6.000000e-05:  94%|█████████▍| 8727/9301 [55:35<03:30,  2.73it/s][A
epoch 2 iter 8727: train loss 0.29236. lr 6.000000e-05:  94%|█████████▍| 8727/9301 [55:35<03:30,  2.73it/s][A
epoch 2 iter 8727: train loss 0.29236. lr 6.000000e-05:  94%|█████████▍| 8728/9301 [55:35<03:30,  2.72it/s][A
epoch 2 iter 8728: train loss 0.31597. lr 6.000000e-05:  94%|█████████▍| 8728/9301 [55:35<03:30,  2.72it/s][A
epoch 2 iter 8728: train loss 0.31597. lr 6.000000e-05:  94%|█████████▍| 8729/9301 [55:35<03:30,  2.72it/s][A
epoch 2 iter 8729: train loss 0.31918. lr 6.000000e-05:  94%|█████████▍| 8729/9301 [55:36<03:30,  2.72it/s][A
epoch 2 iter 8729: train loss 0.31918. lr 6.000000e-05:  94%|█████████▍| 8730/9301 [55:36<03:30,  2.72it/s][A
e

epoch 2 iter 8762: train loss 0.29508. lr 6.000000e-05:  94%|█████████▍| 8762/9301 [55:48<03:19,  2.70it/s][A
epoch 2 iter 8762: train loss 0.29508. lr 6.000000e-05:  94%|█████████▍| 8763/9301 [55:48<03:19,  2.70it/s][A
epoch 2 iter 8763: train loss 0.30674. lr 6.000000e-05:  94%|█████████▍| 8763/9301 [55:48<03:19,  2.70it/s][A
epoch 2 iter 8763: train loss 0.30674. lr 6.000000e-05:  94%|█████████▍| 8764/9301 [55:48<03:18,  2.71it/s][A
epoch 2 iter 8764: train loss 0.29433. lr 6.000000e-05:  94%|█████████▍| 8764/9301 [55:49<03:18,  2.71it/s][A
epoch 2 iter 8764: train loss 0.29433. lr 6.000000e-05:  94%|█████████▍| 8765/9301 [55:49<03:17,  2.72it/s][A
epoch 2 iter 8765: train loss 0.31374. lr 6.000000e-05:  94%|█████████▍| 8765/9301 [55:49<03:17,  2.72it/s][A
epoch 2 iter 8765: train loss 0.31374. lr 6.000000e-05:  94%|█████████▍| 8766/9301 [55:49<03:18,  2.70it/s][A
epoch 2 iter 8766: train loss 0.32804. lr 6.000000e-05:  94%|█████████▍| 8766/9301 [55:49<03:18,  2.70it/s][A
e

epoch 2 iter 8798: train loss 0.30974. lr 6.000000e-05:  95%|█████████▍| 8799/9301 [56:01<03:05,  2.71it/s][A
epoch 2 iter 8799: train loss 0.29124. lr 6.000000e-05:  95%|█████████▍| 8799/9301 [56:02<03:05,  2.71it/s][A
epoch 2 iter 8799: train loss 0.29124. lr 6.000000e-05:  95%|█████████▍| 8800/9301 [56:02<03:04,  2.71it/s][A
epoch 2 iter 8800: train loss 0.31746. lr 6.000000e-05:  95%|█████████▍| 8800/9301 [56:02<03:04,  2.71it/s][A
epoch 2 iter 8800: train loss 0.31746. lr 6.000000e-05:  95%|█████████▍| 8801/9301 [56:02<03:03,  2.72it/s][A
epoch 2 iter 8801: train loss 0.31595. lr 6.000000e-05:  95%|█████████▍| 8801/9301 [56:02<03:03,  2.72it/s][A
epoch 2 iter 8801: train loss 0.31595. lr 6.000000e-05:  95%|█████████▍| 8802/9301 [56:02<03:03,  2.72it/s][A
epoch 2 iter 8802: train loss 0.30652. lr 6.000000e-05:  95%|█████████▍| 8802/9301 [56:03<03:03,  2.72it/s][A
epoch 2 iter 8802: train loss 0.30652. lr 6.000000e-05:  95%|█████████▍| 8803/9301 [56:03<03:03,  2.71it/s][A
e

epoch 2 iter 8835: train loss 0.32255. lr 6.000000e-05:  95%|█████████▍| 8835/9301 [56:15<02:51,  2.71it/s][A
epoch 2 iter 8835: train loss 0.32255. lr 6.000000e-05:  95%|█████████▌| 8836/9301 [56:15<02:50,  2.72it/s][A
epoch 2 iter 8836: train loss 0.31256. lr 6.000000e-05:  95%|█████████▌| 8836/9301 [56:15<02:50,  2.72it/s][A
epoch 2 iter 8836: train loss 0.31256. lr 6.000000e-05:  95%|█████████▌| 8837/9301 [56:15<02:50,  2.72it/s][A
epoch 2 iter 8837: train loss 0.32038. lr 6.000000e-05:  95%|█████████▌| 8837/9301 [56:16<02:50,  2.72it/s][A
epoch 2 iter 8837: train loss 0.32038. lr 6.000000e-05:  95%|█████████▌| 8838/9301 [56:16<02:50,  2.71it/s][A
epoch 2 iter 8838: train loss 0.29549. lr 6.000000e-05:  95%|█████████▌| 8838/9301 [56:16<02:50,  2.71it/s][A
epoch 2 iter 8838: train loss 0.29549. lr 6.000000e-05:  95%|█████████▌| 8839/9301 [56:16<02:50,  2.71it/s][A
epoch 2 iter 8839: train loss 0.30701. lr 6.000000e-05:  95%|█████████▌| 8839/9301 [56:16<02:50,  2.71it/s][A
e

epoch 2 iter 8871: train loss 0.30723. lr 6.000000e-05:  95%|█████████▌| 8872/9301 [56:28<02:38,  2.70it/s][A
epoch 2 iter 8872: train loss 0.29809. lr 6.000000e-05:  95%|█████████▌| 8872/9301 [56:29<02:38,  2.70it/s][A
epoch 2 iter 8872: train loss 0.29809. lr 6.000000e-05:  95%|█████████▌| 8873/9301 [56:29<02:38,  2.70it/s][A
epoch 2 iter 8873: train loss 0.29682. lr 6.000000e-05:  95%|█████████▌| 8873/9301 [56:29<02:38,  2.70it/s][A
epoch 2 iter 8873: train loss 0.29682. lr 6.000000e-05:  95%|█████████▌| 8874/9301 [56:29<02:37,  2.70it/s][A
epoch 2 iter 8874: train loss 0.31031. lr 6.000000e-05:  95%|█████████▌| 8874/9301 [56:29<02:37,  2.70it/s][A
epoch 2 iter 8874: train loss 0.31031. lr 6.000000e-05:  95%|█████████▌| 8875/9301 [56:29<02:37,  2.71it/s][A
epoch 2 iter 8875: train loss 0.30395. lr 6.000000e-05:  95%|█████████▌| 8875/9301 [56:30<02:37,  2.71it/s][A
epoch 2 iter 8875: train loss 0.30395. lr 6.000000e-05:  95%|█████████▌| 8876/9301 [56:30<02:36,  2.71it/s][A
e

epoch 2 iter 8908: train loss 0.30665. lr 6.000000e-05:  96%|█████████▌| 8908/9301 [56:42<02:25,  2.70it/s][A
epoch 2 iter 8908: train loss 0.30665. lr 6.000000e-05:  96%|█████████▌| 8909/9301 [56:42<02:25,  2.70it/s][A
epoch 2 iter 8909: train loss 0.30028. lr 6.000000e-05:  96%|█████████▌| 8909/9301 [56:42<02:25,  2.70it/s][A
epoch 2 iter 8909: train loss 0.30028. lr 6.000000e-05:  96%|█████████▌| 8910/9301 [56:42<02:24,  2.70it/s][A
epoch 2 iter 8910: train loss 0.33978. lr 6.000000e-05:  96%|█████████▌| 8910/9301 [56:43<02:24,  2.70it/s][A
epoch 2 iter 8910: train loss 0.33978. lr 6.000000e-05:  96%|█████████▌| 8911/9301 [56:43<02:24,  2.71it/s][A
epoch 2 iter 8911: train loss 0.30913. lr 6.000000e-05:  96%|█████████▌| 8911/9301 [56:43<02:24,  2.71it/s][A
epoch 2 iter 8911: train loss 0.30913. lr 6.000000e-05:  96%|█████████▌| 8912/9301 [56:43<02:22,  2.72it/s][A
epoch 2 iter 8912: train loss 0.32804. lr 6.000000e-05:  96%|█████████▌| 8912/9301 [56:43<02:22,  2.72it/s][A
e

epoch 2 iter 8944: train loss 0.31661. lr 6.000000e-05:  96%|█████████▌| 8945/9301 [56:55<02:11,  2.71it/s][A
epoch 2 iter 8945: train loss 0.29743. lr 6.000000e-05:  96%|█████████▌| 8945/9301 [56:56<02:11,  2.71it/s][A
epoch 2 iter 8945: train loss 0.29743. lr 6.000000e-05:  96%|█████████▌| 8946/9301 [56:56<02:10,  2.71it/s][A
epoch 2 iter 8946: train loss 0.31074. lr 6.000000e-05:  96%|█████████▌| 8946/9301 [56:56<02:10,  2.71it/s][A
epoch 2 iter 8946: train loss 0.31074. lr 6.000000e-05:  96%|█████████▌| 8947/9301 [56:56<02:09,  2.72it/s][A
epoch 2 iter 8947: train loss 0.32052. lr 6.000000e-05:  96%|█████████▌| 8947/9301 [56:56<02:09,  2.72it/s][A
epoch 2 iter 8947: train loss 0.32052. lr 6.000000e-05:  96%|█████████▌| 8948/9301 [56:56<02:10,  2.71it/s][A
epoch 2 iter 8948: train loss 0.30726. lr 6.000000e-05:  96%|█████████▌| 8948/9301 [56:57<02:10,  2.71it/s][A
epoch 2 iter 8948: train loss 0.30726. lr 6.000000e-05:  96%|█████████▌| 8949/9301 [56:57<02:09,  2.71it/s][A
e

epoch 2 iter 8981: train loss 0.30859. lr 6.000000e-05:  97%|█████████▋| 8981/9301 [57:09<01:59,  2.68it/s][A
epoch 2 iter 8981: train loss 0.30859. lr 6.000000e-05:  97%|█████████▋| 8982/9301 [57:09<01:59,  2.68it/s][A
epoch 2 iter 8982: train loss 0.32060. lr 6.000000e-05:  97%|█████████▋| 8982/9301 [57:09<01:59,  2.68it/s][A
epoch 2 iter 8982: train loss 0.32060. lr 6.000000e-05:  97%|█████████▋| 8983/9301 [57:09<01:58,  2.68it/s][A
epoch 2 iter 8983: train loss 0.29735. lr 6.000000e-05:  97%|█████████▋| 8983/9301 [57:10<01:58,  2.68it/s][A
epoch 2 iter 8983: train loss 0.29735. lr 6.000000e-05:  97%|█████████▋| 8984/9301 [57:10<01:57,  2.69it/s][A
epoch 2 iter 8984: train loss 0.30782. lr 6.000000e-05:  97%|█████████▋| 8984/9301 [57:10<01:57,  2.69it/s][A
epoch 2 iter 8984: train loss 0.30782. lr 6.000000e-05:  97%|█████████▋| 8985/9301 [57:10<01:56,  2.71it/s][A
epoch 2 iter 8985: train loss 0.30117. lr 6.000000e-05:  97%|█████████▋| 8985/9301 [57:10<01:56,  2.71it/s][A
e

epoch 2 iter 9017: train loss 0.33546. lr 6.000000e-05:  97%|█████████▋| 9018/9301 [57:22<01:44,  2.70it/s][A
epoch 2 iter 9018: train loss 0.28090. lr 6.000000e-05:  97%|█████████▋| 9018/9301 [57:23<01:44,  2.70it/s][A
epoch 2 iter 9018: train loss 0.28090. lr 6.000000e-05:  97%|█████████▋| 9019/9301 [57:23<01:44,  2.70it/s][A
epoch 2 iter 9019: train loss 0.31162. lr 6.000000e-05:  97%|█████████▋| 9019/9301 [57:23<01:44,  2.70it/s][A
epoch 2 iter 9019: train loss 0.31162. lr 6.000000e-05:  97%|█████████▋| 9020/9301 [57:23<01:43,  2.71it/s][A
epoch 2 iter 9020: train loss 0.30122. lr 6.000000e-05:  97%|█████████▋| 9020/9301 [57:23<01:43,  2.71it/s][A
epoch 2 iter 9020: train loss 0.30122. lr 6.000000e-05:  97%|█████████▋| 9021/9301 [57:23<01:43,  2.69it/s][A
epoch 2 iter 9021: train loss 0.30690. lr 6.000000e-05:  97%|█████████▋| 9021/9301 [57:24<01:43,  2.69it/s][A
epoch 2 iter 9021: train loss 0.30690. lr 6.000000e-05:  97%|█████████▋| 9022/9301 [57:24<01:44,  2.68it/s][A
e

epoch 2 iter 9054: train loss 0.31109. lr 6.000000e-05:  97%|█████████▋| 9054/9301 [57:36<01:31,  2.71it/s][A
epoch 2 iter 9054: train loss 0.31109. lr 6.000000e-05:  97%|█████████▋| 9055/9301 [57:36<01:30,  2.71it/s][A
epoch 2 iter 9055: train loss 0.29613. lr 6.000000e-05:  97%|█████████▋| 9055/9301 [57:36<01:30,  2.71it/s][A
epoch 2 iter 9055: train loss 0.29613. lr 6.000000e-05:  97%|█████████▋| 9056/9301 [57:36<01:29,  2.73it/s][A
epoch 2 iter 9056: train loss 0.30875. lr 6.000000e-05:  97%|█████████▋| 9056/9301 [57:37<01:29,  2.73it/s][A
epoch 2 iter 9056: train loss 0.30875. lr 6.000000e-05:  97%|█████████▋| 9057/9301 [57:37<01:29,  2.72it/s][A
epoch 2 iter 9057: train loss 0.33170. lr 6.000000e-05:  97%|█████████▋| 9057/9301 [57:37<01:29,  2.72it/s][A
epoch 2 iter 9057: train loss 0.33170. lr 6.000000e-05:  97%|█████████▋| 9058/9301 [57:37<01:29,  2.71it/s][A
epoch 2 iter 9058: train loss 0.31046. lr 6.000000e-05:  97%|█████████▋| 9058/9301 [57:37<01:29,  2.71it/s][A
e

epoch 2 iter 9090: train loss 0.30653. lr 6.000000e-05:  98%|█████████▊| 9091/9301 [57:49<01:17,  2.73it/s][A
epoch 2 iter 9091: train loss 0.32531. lr 6.000000e-05:  98%|█████████▊| 9091/9301 [57:50<01:17,  2.73it/s][A
epoch 2 iter 9091: train loss 0.32531. lr 6.000000e-05:  98%|█████████▊| 9092/9301 [57:50<01:16,  2.72it/s][A
epoch 2 iter 9092: train loss 0.29472. lr 6.000000e-05:  98%|█████████▊| 9092/9301 [57:50<01:16,  2.72it/s][A
epoch 2 iter 9092: train loss 0.29472. lr 6.000000e-05:  98%|█████████▊| 9093/9301 [57:50<01:16,  2.71it/s][A
epoch 2 iter 9093: train loss 0.31790. lr 6.000000e-05:  98%|█████████▊| 9093/9301 [57:50<01:16,  2.71it/s][A
epoch 2 iter 9093: train loss 0.31790. lr 6.000000e-05:  98%|█████████▊| 9094/9301 [57:50<01:16,  2.71it/s][A
epoch 2 iter 9094: train loss 0.31261. lr 6.000000e-05:  98%|█████████▊| 9094/9301 [57:51<01:16,  2.71it/s][A
epoch 2 iter 9094: train loss 0.31261. lr 6.000000e-05:  98%|█████████▊| 9095/9301 [57:51<01:16,  2.71it/s][A
e

epoch 2 iter 9127: train loss 0.32903. lr 6.000000e-05:  98%|█████████▊| 9127/9301 [58:03<01:04,  2.72it/s][A
epoch 2 iter 9127: train loss 0.32903. lr 6.000000e-05:  98%|█████████▊| 9128/9301 [58:03<01:03,  2.71it/s][A
epoch 2 iter 9128: train loss 0.30297. lr 6.000000e-05:  98%|█████████▊| 9128/9301 [58:03<01:03,  2.71it/s][A
epoch 2 iter 9128: train loss 0.30297. lr 6.000000e-05:  98%|█████████▊| 9129/9301 [58:03<01:03,  2.71it/s][A
epoch 2 iter 9129: train loss 0.30069. lr 6.000000e-05:  98%|█████████▊| 9129/9301 [58:04<01:03,  2.71it/s][A
epoch 2 iter 9129: train loss 0.30069. lr 6.000000e-05:  98%|█████████▊| 9130/9301 [58:04<01:03,  2.71it/s][A
epoch 2 iter 9130: train loss 0.32110. lr 6.000000e-05:  98%|█████████▊| 9130/9301 [58:04<01:03,  2.71it/s][A
epoch 2 iter 9130: train loss 0.32110. lr 6.000000e-05:  98%|█████████▊| 9131/9301 [58:04<01:02,  2.72it/s][A
epoch 2 iter 9131: train loss 0.30788. lr 6.000000e-05:  98%|█████████▊| 9131/9301 [58:04<01:02,  2.72it/s][A
e

epoch 2 iter 9163: train loss 0.29430. lr 6.000000e-05:  99%|█████████▊| 9164/9301 [58:16<00:50,  2.71it/s][A
epoch 2 iter 9164: train loss 0.32264. lr 6.000000e-05:  99%|█████████▊| 9164/9301 [58:16<00:50,  2.71it/s][A
epoch 2 iter 9164: train loss 0.32264. lr 6.000000e-05:  99%|█████████▊| 9165/9301 [58:16<00:50,  2.71it/s][A
epoch 2 iter 9165: train loss 0.29576. lr 6.000000e-05:  99%|█████████▊| 9165/9301 [58:17<00:50,  2.71it/s][A
epoch 2 iter 9165: train loss 0.29576. lr 6.000000e-05:  99%|█████████▊| 9166/9301 [58:17<00:49,  2.71it/s][A
epoch 2 iter 9166: train loss 0.32312. lr 6.000000e-05:  99%|█████████▊| 9166/9301 [58:17<00:49,  2.71it/s][A
epoch 2 iter 9166: train loss 0.32312. lr 6.000000e-05:  99%|█████████▊| 9167/9301 [58:17<00:49,  2.70it/s][A
epoch 2 iter 9167: train loss 0.29705. lr 6.000000e-05:  99%|█████████▊| 9167/9301 [58:18<00:49,  2.70it/s][A
epoch 2 iter 9167: train loss 0.29705. lr 6.000000e-05:  99%|█████████▊| 9168/9301 [58:18<00:49,  2.70it/s][A
e

epoch 2 iter 9200: train loss 0.31615. lr 6.000000e-05:  99%|█████████▉| 9200/9301 [58:30<00:37,  2.72it/s][A
epoch 2 iter 9200: train loss 0.31615. lr 6.000000e-05:  99%|█████████▉| 9201/9301 [58:30<00:36,  2.72it/s][A
epoch 2 iter 9201: train loss 0.32827. lr 6.000000e-05:  99%|█████████▉| 9201/9301 [58:30<00:36,  2.72it/s][A
epoch 2 iter 9201: train loss 0.32827. lr 6.000000e-05:  99%|█████████▉| 9202/9301 [58:30<00:36,  2.70it/s][A
epoch 2 iter 9202: train loss 0.29951. lr 6.000000e-05:  99%|█████████▉| 9202/9301 [58:30<00:36,  2.70it/s][A
epoch 2 iter 9202: train loss 0.29951. lr 6.000000e-05:  99%|█████████▉| 9203/9301 [58:30<00:36,  2.70it/s][A
epoch 2 iter 9203: train loss 0.29218. lr 6.000000e-05:  99%|█████████▉| 9203/9301 [58:31<00:36,  2.70it/s][A
epoch 2 iter 9203: train loss 0.29218. lr 6.000000e-05:  99%|█████████▉| 9204/9301 [58:31<00:35,  2.70it/s][A
epoch 2 iter 9204: train loss 0.28894. lr 6.000000e-05:  99%|█████████▉| 9204/9301 [58:31<00:35,  2.70it/s][A
e

epoch 2 iter 9236: train loss 0.31625. lr 6.000000e-05:  99%|█████████▉| 9237/9301 [58:43<00:23,  2.73it/s][A
epoch 2 iter 9237: train loss 0.29704. lr 6.000000e-05:  99%|█████████▉| 9237/9301 [58:43<00:23,  2.73it/s][A
epoch 2 iter 9237: train loss 0.29704. lr 6.000000e-05:  99%|█████████▉| 9238/9301 [58:43<00:23,  2.72it/s][A
epoch 2 iter 9238: train loss 0.34412. lr 6.000000e-05:  99%|█████████▉| 9238/9301 [58:44<00:23,  2.72it/s][A
epoch 2 iter 9238: train loss 0.34412. lr 6.000000e-05:  99%|█████████▉| 9239/9301 [58:44<00:22,  2.72it/s][A
epoch 2 iter 9239: train loss 0.31484. lr 6.000000e-05:  99%|█████████▉| 9239/9301 [58:44<00:22,  2.72it/s][A
epoch 2 iter 9239: train loss 0.31484. lr 6.000000e-05:  99%|█████████▉| 9240/9301 [58:44<00:22,  2.71it/s][A
epoch 2 iter 9240: train loss 0.31464. lr 6.000000e-05:  99%|█████████▉| 9240/9301 [58:44<00:22,  2.71it/s][A
epoch 2 iter 9240: train loss 0.31464. lr 6.000000e-05:  99%|█████████▉| 9241/9301 [58:44<00:22,  2.71it/s][A
e

epoch 2 iter 9273: train loss 0.30134. lr 6.000000e-05: 100%|█████████▉| 9273/9301 [58:57<00:10,  2.70it/s][A
epoch 2 iter 9273: train loss 0.30134. lr 6.000000e-05: 100%|█████████▉| 9274/9301 [58:57<00:10,  2.69it/s][A
epoch 2 iter 9274: train loss 0.30867. lr 6.000000e-05: 100%|█████████▉| 9274/9301 [58:57<00:10,  2.69it/s][A
epoch 2 iter 9274: train loss 0.30867. lr 6.000000e-05: 100%|█████████▉| 9275/9301 [58:57<00:09,  2.69it/s][A
epoch 2 iter 9275: train loss 0.28801. lr 6.000000e-05: 100%|█████████▉| 9275/9301 [58:57<00:09,  2.69it/s][A
epoch 2 iter 9275: train loss 0.28801. lr 6.000000e-05: 100%|█████████▉| 9276/9301 [58:57<00:09,  2.70it/s][A
epoch 2 iter 9276: train loss 0.30277. lr 6.000000e-05: 100%|█████████▉| 9276/9301 [58:58<00:09,  2.70it/s][A
epoch 2 iter 9276: train loss 0.30277. lr 6.000000e-05: 100%|█████████▉| 9277/9301 [58:58<00:08,  2.70it/s][A
epoch 2 iter 9277: train loss 0.32234. lr 6.000000e-05: 100%|█████████▉| 9277/9301 [58:58<00:08,  2.70it/s][A
e

data has 217484 characters, 119547 unique.



  0%|          | 0/3397 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 11.74256. lr 6.000000e-04:   0%|          | 0/3397 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 11.74256. lr 6.000000e-04:   0%|          | 1/3397 [00:00<21:25,  2.64it/s][A
epoch 1 iter 1: train loss 11.40341. lr 5.999999e-04:   0%|          | 1/3397 [00:00<21:25,  2.64it/s][A
epoch 1 iter 1: train loss 11.40341. lr 5.999999e-04:   0%|          | 2/3397 [00:00<21:10,  2.67it/s][A
epoch 1 iter 2: train loss 11.15278. lr 5.999997e-04:   0%|          | 2/3397 [00:01<21:10,  2.67it/s][A
epoch 1 iter 2: train loss 11.15278. lr 5.999997e-04:   0%|          | 3/3397 [00:01<21:02,  2.69it/s][A
epoch 1 iter 3: train loss 10.98933. lr 5.999995e-04:   0%|          | 3/3397 [00:01<21:02,  2.69it/s][A
epoch 1 iter 3: train loss 10.98933. lr 5.999995e-04:   0%|          | 4/3397 [00:01<20:55,  2.70it/s][A
epoch 1 iter 4: train loss 10.80853. lr 5.999992e-04:   0%|          | 4/3397 [00:01<20:55,  2.70it/s][A
epoch 1 i

epoch 1 iter 38: train loss 7.14599. lr 5.999516e-04:   1%|          | 38/3397 [00:14<20:33,  2.72it/s][A
epoch 1 iter 38: train loss 7.14599. lr 5.999516e-04:   1%|          | 39/3397 [00:14<20:32,  2.72it/s][A
epoch 1 iter 39: train loss 7.31268. lr 5.999491e-04:   1%|          | 39/3397 [00:14<20:32,  2.72it/s][A
epoch 1 iter 39: train loss 7.31268. lr 5.999491e-04:   1%|          | 40/3397 [00:14<20:35,  2.72it/s][A
epoch 1 iter 40: train loss 7.17824. lr 5.999465e-04:   1%|          | 40/3397 [00:15<20:35,  2.72it/s][A
epoch 1 iter 40: train loss 7.17824. lr 5.999465e-04:   1%|          | 41/3397 [00:15<20:36,  2.71it/s][A
epoch 1 iter 41: train loss 7.27034. lr 5.999438e-04:   1%|          | 41/3397 [00:15<20:36,  2.71it/s][A
epoch 1 iter 41: train loss 7.27034. lr 5.999438e-04:   1%|          | 42/3397 [00:15<20:36,  2.71it/s][A
epoch 1 iter 42: train loss 7.14298. lr 5.999411e-04:   1%|          | 42/3397 [00:15<20:36,  2.71it/s][A
epoch 1 iter 42: train loss 7.14298. 

epoch 1 iter 76: train loss 6.42430. lr 5.998105e-04:   2%|▏         | 76/3397 [00:28<20:26,  2.71it/s][A
epoch 1 iter 76: train loss 6.42430. lr 5.998105e-04:   2%|▏         | 77/3397 [00:28<20:25,  2.71it/s][A
epoch 1 iter 77: train loss 6.48944. lr 5.998056e-04:   2%|▏         | 77/3397 [00:28<20:25,  2.71it/s][A
epoch 1 iter 77: train loss 6.48944. lr 5.998056e-04:   2%|▏         | 78/3397 [00:28<20:22,  2.71it/s][A
epoch 1 iter 78: train loss 6.35707. lr 5.998005e-04:   2%|▏         | 78/3397 [00:29<20:22,  2.71it/s][A
epoch 1 iter 78: train loss 6.35707. lr 5.998005e-04:   2%|▏         | 79/3397 [00:29<20:23,  2.71it/s][A
epoch 1 iter 79: train loss 6.48284. lr 5.997954e-04:   2%|▏         | 79/3397 [00:29<20:23,  2.71it/s][A
epoch 1 iter 79: train loss 6.48284. lr 5.997954e-04:   2%|▏         | 80/3397 [00:29<20:18,  2.72it/s][A
epoch 1 iter 80: train loss 6.42485. lr 5.997903e-04:   2%|▏         | 80/3397 [00:29<20:18,  2.72it/s][A
epoch 1 iter 80: train loss 6.42485. 

epoch 1 iter 114: train loss 5.72385. lr 5.995769e-04:   3%|▎         | 114/3397 [00:42<20:07,  2.72it/s][A
epoch 1 iter 114: train loss 5.72385. lr 5.995769e-04:   3%|▎         | 115/3397 [00:42<20:08,  2.72it/s][A
epoch 1 iter 115: train loss 5.68026. lr 5.995695e-04:   3%|▎         | 115/3397 [00:42<20:08,  2.72it/s][A
epoch 1 iter 115: train loss 5.68026. lr 5.995695e-04:   3%|▎         | 116/3397 [00:42<20:10,  2.71it/s][A
epoch 1 iter 116: train loss 5.62108. lr 5.995620e-04:   3%|▎         | 116/3397 [00:43<20:10,  2.71it/s][A
epoch 1 iter 116: train loss 5.62108. lr 5.995620e-04:   3%|▎         | 117/3397 [00:43<20:09,  2.71it/s][A
epoch 1 iter 117: train loss 5.83283. lr 5.995545e-04:   3%|▎         | 117/3397 [00:43<20:09,  2.71it/s][A
epoch 1 iter 117: train loss 5.83283. lr 5.995545e-04:   3%|▎         | 118/3397 [00:43<20:06,  2.72it/s][A
epoch 1 iter 118: train loss 5.65242. lr 5.995469e-04:   3%|▎         | 118/3397 [00:43<20:06,  2.72it/s][A
epoch 1 iter 118: t

epoch 1 iter 151: train loss 5.24595. lr 5.992604e-04:   4%|▍         | 152/3397 [00:55<19:58,  2.71it/s][A
epoch 1 iter 152: train loss 5.42331. lr 5.992507e-04:   4%|▍         | 152/3397 [00:56<19:58,  2.71it/s][A
epoch 1 iter 152: train loss 5.42331. lr 5.992507e-04:   5%|▍         | 153/3397 [00:56<19:56,  2.71it/s][A
epoch 1 iter 153: train loss 5.25537. lr 5.992408e-04:   5%|▍         | 153/3397 [00:56<19:56,  2.71it/s][A
epoch 1 iter 153: train loss 5.25537. lr 5.992408e-04:   5%|▍         | 154/3397 [00:56<19:54,  2.71it/s][A
epoch 1 iter 154: train loss 5.35512. lr 5.992309e-04:   5%|▍         | 154/3397 [00:57<19:54,  2.71it/s][A
epoch 1 iter 154: train loss 5.35512. lr 5.992309e-04:   5%|▍         | 155/3397 [00:57<19:51,  2.72it/s][A
epoch 1 iter 155: train loss 5.10215. lr 5.992210e-04:   5%|▍         | 155/3397 [00:57<19:51,  2.72it/s][A
epoch 1 iter 155: train loss 5.10215. lr 5.992210e-04:   5%|▍         | 156/3397 [00:57<19:53,  2.71it/s][A
epoch 1 iter 156: t

epoch 1 iter 189: train loss 4.84145. lr 5.988442e-04:   6%|▌         | 189/3397 [01:09<19:42,  2.71it/s][A
epoch 1 iter 189: train loss 4.84145. lr 5.988442e-04:   6%|▌         | 190/3397 [01:09<19:38,  2.72it/s][A
epoch 1 iter 190: train loss 4.78354. lr 5.988320e-04:   6%|▌         | 190/3397 [01:10<19:38,  2.72it/s][A
epoch 1 iter 190: train loss 4.78354. lr 5.988320e-04:   6%|▌         | 191/3397 [01:10<19:42,  2.71it/s][A
epoch 1 iter 191: train loss 4.74989. lr 5.988197e-04:   6%|▌         | 191/3397 [01:10<19:42,  2.71it/s][A
epoch 1 iter 191: train loss 4.74989. lr 5.988197e-04:   6%|▌         | 192/3397 [01:10<19:45,  2.70it/s][A
epoch 1 iter 192: train loss 4.77314. lr 5.988074e-04:   6%|▌         | 192/3397 [01:11<19:45,  2.70it/s][A
epoch 1 iter 192: train loss 4.77314. lr 5.988074e-04:   6%|▌         | 193/3397 [01:11<19:48,  2.70it/s][A
epoch 1 iter 193: train loss 4.67153. lr 5.987950e-04:   6%|▌         | 193/3397 [01:11<19:48,  2.70it/s][A
epoch 1 iter 193: t

epoch 1 iter 226: train loss 4.72904. lr 5.983502e-04:   7%|▋         | 227/3397 [01:23<19:30,  2.71it/s][A
epoch 1 iter 227: train loss 4.51381. lr 5.983357e-04:   7%|▋         | 227/3397 [01:23<19:30,  2.71it/s][A
epoch 1 iter 227: train loss 4.51381. lr 5.983357e-04:   7%|▋         | 228/3397 [01:23<19:28,  2.71it/s][A
epoch 1 iter 228: train loss 4.60143. lr 5.983210e-04:   7%|▋         | 228/3397 [01:24<19:28,  2.71it/s][A
epoch 1 iter 228: train loss 4.60143. lr 5.983210e-04:   7%|▋         | 229/3397 [01:24<19:24,  2.72it/s][A
epoch 1 iter 229: train loss 4.45612. lr 5.983063e-04:   7%|▋         | 229/3397 [01:24<19:24,  2.72it/s][A
epoch 1 iter 229: train loss 4.45612. lr 5.983063e-04:   7%|▋         | 230/3397 [01:24<19:18,  2.73it/s][A
epoch 1 iter 230: train loss 4.47002. lr 5.982916e-04:   7%|▋         | 230/3397 [01:25<19:18,  2.73it/s][A
epoch 1 iter 230: train loss 4.47002. lr 5.982916e-04:   7%|▋         | 231/3397 [01:25<19:25,  2.72it/s][A
epoch 1 iter 231: t

epoch 1 iter 264: train loss 4.23952. lr 5.977520e-04:   8%|▊         | 264/3397 [01:37<19:15,  2.71it/s][A
epoch 1 iter 264: train loss 4.23952. lr 5.977520e-04:   8%|▊         | 265/3397 [01:37<19:10,  2.72it/s][A
epoch 1 iter 265: train loss 4.15977. lr 5.977350e-04:   8%|▊         | 265/3397 [01:38<19:10,  2.72it/s][A
epoch 1 iter 265: train loss 4.15977. lr 5.977350e-04:   8%|▊         | 266/3397 [01:38<19:15,  2.71it/s][A
epoch 1 iter 266: train loss 4.17650. lr 5.977179e-04:   8%|▊         | 266/3397 [01:38<19:15,  2.71it/s][A
epoch 1 iter 266: train loss 4.17650. lr 5.977179e-04:   8%|▊         | 267/3397 [01:38<19:16,  2.71it/s][A
epoch 1 iter 267: train loss 4.08645. lr 5.977008e-04:   8%|▊         | 267/3397 [01:38<19:16,  2.71it/s][A
epoch 1 iter 267: train loss 4.08645. lr 5.977008e-04:   8%|▊         | 268/3397 [01:38<19:13,  2.71it/s][A
epoch 1 iter 268: train loss 4.17422. lr 5.976836e-04:   8%|▊         | 268/3397 [01:39<19:13,  2.71it/s][A
epoch 1 iter 268: t

epoch 1 iter 301: train loss 3.84915. lr 5.970811e-04:   9%|▉         | 302/3397 [01:51<19:02,  2.71it/s][A
epoch 1 iter 302: train loss 3.92329. lr 5.970617e-04:   9%|▉         | 302/3397 [01:51<19:02,  2.71it/s][A
epoch 1 iter 302: train loss 3.92329. lr 5.970617e-04:   9%|▉         | 303/3397 [01:51<19:02,  2.71it/s][A
epoch 1 iter 303: train loss 3.98457. lr 5.970423e-04:   9%|▉         | 303/3397 [01:52<19:02,  2.71it/s][A
epoch 1 iter 303: train loss 3.98457. lr 5.970423e-04:   9%|▉         | 304/3397 [01:52<19:00,  2.71it/s][A
epoch 1 iter 304: train loss 3.90593. lr 5.970228e-04:   9%|▉         | 304/3397 [01:52<19:00,  2.71it/s][A
epoch 1 iter 304: train loss 3.90593. lr 5.970228e-04:   9%|▉         | 305/3397 [01:52<18:57,  2.72it/s][A
epoch 1 iter 305: train loss 3.93337. lr 5.970033e-04:   9%|▉         | 305/3397 [01:52<18:57,  2.72it/s][A
epoch 1 iter 305: train loss 3.93337. lr 5.970033e-04:   9%|▉         | 306/3397 [01:52<18:53,  2.73it/s][A
epoch 1 iter 306: t

epoch 1 iter 339: train loss 3.62972. lr 5.963014e-04:  10%|▉         | 339/3397 [02:05<18:46,  2.72it/s][A
epoch 1 iter 339: train loss 3.62972. lr 5.963014e-04:  10%|█         | 340/3397 [02:05<18:47,  2.71it/s][A
epoch 1 iter 340: train loss 3.61561. lr 5.962797e-04:  10%|█         | 340/3397 [02:05<18:47,  2.71it/s][A
epoch 1 iter 340: train loss 3.61561. lr 5.962797e-04:  10%|█         | 341/3397 [02:05<18:44,  2.72it/s][A
epoch 1 iter 341: train loss 3.54690. lr 5.962579e-04:  10%|█         | 341/3397 [02:06<18:44,  2.72it/s][A
epoch 1 iter 341: train loss 3.54690. lr 5.962579e-04:  10%|█         | 342/3397 [02:06<18:46,  2.71it/s][A
epoch 1 iter 342: train loss 3.68117. lr 5.962360e-04:  10%|█         | 342/3397 [02:06<18:46,  2.71it/s][A
epoch 1 iter 342: train loss 3.68117. lr 5.962360e-04:  10%|█         | 343/3397 [02:06<18:45,  2.71it/s][A
epoch 1 iter 343: train loss 3.42655. lr 5.962141e-04:  10%|█         | 343/3397 [02:06<18:45,  2.71it/s][A
epoch 1 iter 343: t

epoch 1 iter 376: train loss 3.30694. lr 5.954544e-04:  11%|█         | 377/3397 [02:18<18:34,  2.71it/s][A
epoch 1 iter 377: train loss 3.54190. lr 5.954303e-04:  11%|█         | 377/3397 [02:19<18:34,  2.71it/s][A
epoch 1 iter 377: train loss 3.54190. lr 5.954303e-04:  11%|█         | 378/3397 [02:19<18:35,  2.71it/s][A
epoch 1 iter 378: train loss 3.37396. lr 5.954062e-04:  11%|█         | 378/3397 [02:19<18:35,  2.71it/s][A
epoch 1 iter 378: train loss 3.37396. lr 5.954062e-04:  11%|█         | 379/3397 [02:19<18:33,  2.71it/s][A
epoch 1 iter 379: train loss 3.39300. lr 5.953819e-04:  11%|█         | 379/3397 [02:20<18:33,  2.71it/s][A
epoch 1 iter 379: train loss 3.39300. lr 5.953819e-04:  11%|█         | 380/3397 [02:20<18:32,  2.71it/s][A
epoch 1 iter 380: train loss 3.27203. lr 5.953576e-04:  11%|█         | 380/3397 [02:20<18:32,  2.71it/s][A
epoch 1 iter 380: train loss 3.27203. lr 5.953576e-04:  11%|█         | 381/3397 [02:20<18:28,  2.72it/s][A
epoch 1 iter 381: t

epoch 1 iter 414: train loss 3.14155. lr 5.944944e-04:  12%|█▏        | 414/3397 [02:32<18:26,  2.70it/s][A
epoch 1 iter 414: train loss 3.14155. lr 5.944944e-04:  12%|█▏        | 415/3397 [02:32<18:24,  2.70it/s][A
epoch 1 iter 415: train loss 3.20488. lr 5.944679e-04:  12%|█▏        | 415/3397 [02:33<18:24,  2.70it/s][A
epoch 1 iter 415: train loss 3.20488. lr 5.944679e-04:  12%|█▏        | 416/3397 [02:33<18:21,  2.71it/s][A
epoch 1 iter 416: train loss 3.16036. lr 5.944414e-04:  12%|█▏        | 416/3397 [02:33<18:21,  2.71it/s][A
epoch 1 iter 416: train loss 3.16036. lr 5.944414e-04:  12%|█▏        | 417/3397 [02:33<18:18,  2.71it/s][A
epoch 1 iter 417: train loss 3.15782. lr 5.944147e-04:  12%|█▏        | 417/3397 [02:34<18:18,  2.71it/s][A
epoch 1 iter 417: train loss 3.15782. lr 5.944147e-04:  12%|█▏        | 418/3397 [02:34<18:20,  2.71it/s][A
epoch 1 iter 418: train loss 3.25609. lr 5.943881e-04:  12%|█▏        | 418/3397 [02:34<18:20,  2.71it/s][A
epoch 1 iter 418: t

epoch 1 iter 451: train loss 3.06074. lr 5.934723e-04:  13%|█▎        | 452/3397 [02:46<18:04,  2.72it/s][A
epoch 1 iter 452: train loss 2.94026. lr 5.934434e-04:  13%|█▎        | 452/3397 [02:47<18:04,  2.72it/s][A
epoch 1 iter 452: train loss 2.94026. lr 5.934434e-04:  13%|█▎        | 453/3397 [02:47<18:03,  2.72it/s][A
epoch 1 iter 453: train loss 2.89829. lr 5.934146e-04:  13%|█▎        | 453/3397 [02:47<18:03,  2.72it/s][A
epoch 1 iter 453: train loss 2.89829. lr 5.934146e-04:  13%|█▎        | 454/3397 [02:47<18:05,  2.71it/s][A
epoch 1 iter 454: train loss 2.95164. lr 5.933856e-04:  13%|█▎        | 454/3397 [02:47<18:05,  2.71it/s][A
epoch 1 iter 454: train loss 2.95164. lr 5.933856e-04:  13%|█▎        | 455/3397 [02:47<18:04,  2.71it/s][A
epoch 1 iter 455: train loss 3.03548. lr 5.933566e-04:  13%|█▎        | 455/3397 [02:48<18:04,  2.71it/s][A
epoch 1 iter 455: train loss 3.03548. lr 5.933566e-04:  13%|█▎        | 456/3397 [02:48<18:03,  2.71it/s][A
epoch 1 iter 456: t

epoch 1 iter 489: train loss 2.72942. lr 5.923330e-04:  14%|█▍        | 489/3397 [03:00<17:49,  2.72it/s][A
epoch 1 iter 489: train loss 2.72942. lr 5.923330e-04:  14%|█▍        | 490/3397 [03:00<17:51,  2.71it/s][A
epoch 1 iter 490: train loss 2.69689. lr 5.923018e-04:  14%|█▍        | 490/3397 [03:01<17:51,  2.71it/s][A
epoch 1 iter 490: train loss 2.69689. lr 5.923018e-04:  14%|█▍        | 491/3397 [03:01<17:53,  2.71it/s][A
epoch 1 iter 491: train loss 2.84147. lr 5.922706e-04:  14%|█▍        | 491/3397 [03:01<17:53,  2.71it/s][A
epoch 1 iter 491: train loss 2.84147. lr 5.922706e-04:  14%|█▍        | 492/3397 [03:01<17:53,  2.71it/s][A
epoch 1 iter 492: train loss 2.64469. lr 5.922392e-04:  14%|█▍        | 492/3397 [03:01<17:53,  2.71it/s][A
epoch 1 iter 492: train loss 2.64469. lr 5.922392e-04:  15%|█▍        | 493/3397 [03:01<17:51,  2.71it/s][A
epoch 1 iter 493: train loss 2.69265. lr 5.922078e-04:  15%|█▍        | 493/3397 [03:02<17:51,  2.71it/s][A
epoch 1 iter 493: t

epoch 1 iter 526: train loss 2.41285. lr 5.911370e-04:  16%|█▌        | 527/3397 [03:14<17:39,  2.71it/s][A
epoch 1 iter 527: train loss 2.69217. lr 5.911035e-04:  16%|█▌        | 527/3397 [03:14<17:39,  2.71it/s][A
epoch 1 iter 527: train loss 2.69217. lr 5.911035e-04:  16%|█▌        | 528/3397 [03:14<17:38,  2.71it/s][A
epoch 1 iter 528: train loss 2.63366. lr 5.910699e-04:  16%|█▌        | 528/3397 [03:15<17:38,  2.71it/s][A
epoch 1 iter 528: train loss 2.63366. lr 5.910699e-04:  16%|█▌        | 529/3397 [03:15<17:38,  2.71it/s][A
epoch 1 iter 529: train loss 2.52474. lr 5.910363e-04:  16%|█▌        | 529/3397 [03:15<17:38,  2.71it/s][A
epoch 1 iter 529: train loss 2.52474. lr 5.910363e-04:  16%|█▌        | 530/3397 [03:15<17:33,  2.72it/s][A
epoch 1 iter 530: train loss 2.50170. lr 5.910026e-04:  16%|█▌        | 530/3397 [03:15<17:33,  2.72it/s][A
epoch 1 iter 530: train loss 2.50170. lr 5.910026e-04:  16%|█▌        | 531/3397 [03:15<17:35,  2.72it/s][A
epoch 1 iter 531: t

epoch 1 iter 564: train loss 2.32859. lr 5.898199e-04:  17%|█▋        | 564/3397 [03:28<17:25,  2.71it/s][A
epoch 1 iter 564: train loss 2.32859. lr 5.898199e-04:  17%|█▋        | 565/3397 [03:28<17:21,  2.72it/s][A
epoch 1 iter 565: train loss 2.38913. lr 5.897840e-04:  17%|█▋        | 565/3397 [03:28<17:21,  2.72it/s][A
epoch 1 iter 565: train loss 2.38913. lr 5.897840e-04:  17%|█▋        | 566/3397 [03:28<17:23,  2.71it/s][A
epoch 1 iter 566: train loss 2.46441. lr 5.897481e-04:  17%|█▋        | 566/3397 [03:29<17:23,  2.71it/s][A
epoch 1 iter 566: train loss 2.46441. lr 5.897481e-04:  17%|█▋        | 567/3397 [03:29<17:24,  2.71it/s][A
epoch 1 iter 567: train loss 2.52082. lr 5.897121e-04:  17%|█▋        | 567/3397 [03:29<17:24,  2.71it/s][A
epoch 1 iter 567: train loss 2.52082. lr 5.897121e-04:  17%|█▋        | 568/3397 [03:29<17:23,  2.71it/s][A
epoch 1 iter 568: train loss 2.33847. lr 5.896760e-04:  17%|█▋        | 568/3397 [03:29<17:23,  2.71it/s][A
epoch 1 iter 568: t

epoch 1 iter 601: train loss 2.26748. lr 5.884514e-04:  18%|█▊        | 602/3397 [03:42<17:11,  2.71it/s][A
epoch 1 iter 602: train loss 2.21993. lr 5.884133e-04:  18%|█▊        | 602/3397 [03:42<17:11,  2.71it/s][A
epoch 1 iter 602: train loss 2.21993. lr 5.884133e-04:  18%|█▊        | 603/3397 [03:42<17:12,  2.71it/s][A
epoch 1 iter 603: train loss 2.15824. lr 5.883750e-04:  18%|█▊        | 603/3397 [03:42<17:12,  2.71it/s][A
epoch 1 iter 603: train loss 2.15824. lr 5.883750e-04:  18%|█▊        | 604/3397 [03:42<17:11,  2.71it/s][A
epoch 1 iter 604: train loss 2.21440. lr 5.883367e-04:  18%|█▊        | 604/3397 [03:43<17:11,  2.71it/s][A
epoch 1 iter 604: train loss 2.21440. lr 5.883367e-04:  18%|█▊        | 605/3397 [03:43<17:19,  2.68it/s][A
epoch 1 iter 605: train loss 2.15372. lr 5.882984e-04:  18%|█▊        | 605/3397 [03:43<17:19,  2.68it/s][A
epoch 1 iter 605: train loss 2.15372. lr 5.882984e-04:  18%|█▊        | 606/3397 [03:43<17:26,  2.67it/s][A
epoch 1 iter 606: t

epoch 1 iter 639: train loss 1.97608. lr 5.869580e-04:  19%|█▉        | 639/3397 [03:56<16:58,  2.71it/s][A
epoch 1 iter 639: train loss 1.97608. lr 5.869580e-04:  19%|█▉        | 640/3397 [03:56<16:56,  2.71it/s][A
epoch 1 iter 640: train loss 1.96264. lr 5.869175e-04:  19%|█▉        | 640/3397 [03:56<16:56,  2.71it/s][A
epoch 1 iter 640: train loss 1.96264. lr 5.869175e-04:  19%|█▉        | 641/3397 [03:56<16:52,  2.72it/s][A
epoch 1 iter 641: train loss 1.99055. lr 5.868770e-04:  19%|█▉        | 641/3397 [03:57<16:52,  2.72it/s][A
epoch 1 iter 641: train loss 1.99055. lr 5.868770e-04:  19%|█▉        | 642/3397 [03:57<16:56,  2.71it/s][A
epoch 1 iter 642: train loss 2.04183. lr 5.868363e-04:  19%|█▉        | 642/3397 [03:57<16:56,  2.71it/s][A
epoch 1 iter 642: train loss 2.04183. lr 5.868363e-04:  19%|█▉        | 643/3397 [03:57<16:58,  2.70it/s][A
epoch 1 iter 643: train loss 1.91554. lr 5.867957e-04:  19%|█▉        | 643/3397 [03:57<16:58,  2.70it/s][A
epoch 1 iter 643: t

epoch 1 iter 676: train loss 1.97638. lr 5.854188e-04:  20%|█▉        | 677/3397 [04:10<16:48,  2.70it/s][A
epoch 1 iter 677: train loss 1.87942. lr 5.853760e-04:  20%|█▉        | 677/3397 [04:10<16:48,  2.70it/s][A
epoch 1 iter 677: train loss 1.87942. lr 5.853760e-04:  20%|█▉        | 678/3397 [04:10<16:45,  2.71it/s][A
epoch 1 iter 678: train loss 1.84359. lr 5.853332e-04:  20%|█▉        | 678/3397 [04:10<16:45,  2.71it/s][A
epoch 1 iter 678: train loss 1.84359. lr 5.853332e-04:  20%|█▉        | 679/3397 [04:10<16:51,  2.69it/s][A
epoch 1 iter 679: train loss 1.82901. lr 5.852903e-04:  20%|█▉        | 679/3397 [04:11<16:51,  2.69it/s][A
epoch 1 iter 679: train loss 1.82901. lr 5.852903e-04:  20%|██        | 680/3397 [04:11<16:58,  2.67it/s][A
epoch 1 iter 680: train loss 1.85236. lr 5.852473e-04:  20%|██        | 680/3397 [04:11<16:58,  2.67it/s][A
epoch 1 iter 680: train loss 1.85236. lr 5.852473e-04:  20%|██        | 681/3397 [04:11<16:59,  2.67it/s][A
epoch 1 iter 681: t

epoch 1 iter 714: train loss 1.69370. lr 5.837509e-04:  21%|██        | 714/3397 [04:24<16:32,  2.70it/s][A
epoch 1 iter 714: train loss 1.69370. lr 5.837509e-04:  21%|██        | 715/3397 [04:24<16:28,  2.71it/s][A
epoch 1 iter 715: train loss 1.81761. lr 5.837058e-04:  21%|██        | 715/3397 [04:24<16:28,  2.71it/s][A
epoch 1 iter 715: train loss 1.81761. lr 5.837058e-04:  21%|██        | 716/3397 [04:24<16:31,  2.71it/s][A
epoch 1 iter 716: train loss 1.82472. lr 5.836607e-04:  21%|██        | 716/3397 [04:24<16:31,  2.71it/s][A
epoch 1 iter 716: train loss 1.82472. lr 5.836607e-04:  21%|██        | 717/3397 [04:24<16:31,  2.70it/s][A
epoch 1 iter 717: train loss 1.77992. lr 5.836155e-04:  21%|██        | 717/3397 [04:25<16:31,  2.70it/s][A
epoch 1 iter 717: train loss 1.77992. lr 5.836155e-04:  21%|██        | 718/3397 [04:25<16:31,  2.70it/s][A
epoch 1 iter 718: train loss 1.67357. lr 5.835702e-04:  21%|██        | 718/3397 [04:25<16:31,  2.70it/s][A
epoch 1 iter 718: t

epoch 1 iter 751: train loss 1.59100. lr 5.820427e-04:  22%|██▏       | 752/3397 [04:37<16:17,  2.70it/s][A
epoch 1 iter 752: train loss 1.58933. lr 5.819953e-04:  22%|██▏       | 752/3397 [04:38<16:17,  2.70it/s][A
epoch 1 iter 752: train loss 1.58933. lr 5.819953e-04:  22%|██▏       | 753/3397 [04:38<16:14,  2.71it/s][A
epoch 1 iter 753: train loss 1.58239. lr 5.819480e-04:  22%|██▏       | 753/3397 [04:38<16:14,  2.71it/s][A
epoch 1 iter 753: train loss 1.58239. lr 5.819480e-04:  22%|██▏       | 754/3397 [04:38<16:15,  2.71it/s][A
epoch 1 iter 754: train loss 1.51662. lr 5.819005e-04:  22%|██▏       | 754/3397 [04:38<16:15,  2.71it/s][A
epoch 1 iter 754: train loss 1.51662. lr 5.819005e-04:  22%|██▏       | 755/3397 [04:38<16:17,  2.70it/s][A
epoch 1 iter 755: train loss 1.58313. lr 5.818530e-04:  22%|██▏       | 755/3397 [04:39<16:17,  2.70it/s][A
epoch 1 iter 755: train loss 1.58313. lr 5.818530e-04:  22%|██▏       | 756/3397 [04:39<16:16,  2.71it/s][A
epoch 1 iter 756: t

epoch 1 iter 789: train loss 1.54947. lr 5.802023e-04:  23%|██▎       | 789/3397 [04:51<16:04,  2.70it/s][A
epoch 1 iter 789: train loss 1.54947. lr 5.802023e-04:  23%|██▎       | 790/3397 [04:51<16:05,  2.70it/s][A
epoch 1 iter 790: train loss 1.44865. lr 5.801527e-04:  23%|██▎       | 790/3397 [04:52<16:05,  2.70it/s][A
epoch 1 iter 790: train loss 1.44865. lr 5.801527e-04:  23%|██▎       | 791/3397 [04:52<16:05,  2.70it/s][A
epoch 1 iter 791: train loss 1.40215. lr 5.801030e-04:  23%|██▎       | 791/3397 [04:52<16:05,  2.70it/s][A
epoch 1 iter 791: train loss 1.40215. lr 5.801030e-04:  23%|██▎       | 792/3397 [04:52<16:05,  2.70it/s][A
epoch 1 iter 792: train loss 1.43891. lr 5.800533e-04:  23%|██▎       | 792/3397 [04:52<16:05,  2.70it/s][A
epoch 1 iter 792: train loss 1.43891. lr 5.800533e-04:  23%|██▎       | 793/3397 [04:52<16:01,  2.71it/s][A
epoch 1 iter 793: train loss 1.41585. lr 5.800035e-04:  23%|██▎       | 793/3397 [04:53<16:01,  2.71it/s][A
epoch 1 iter 793: t

epoch 1 iter 826: train loss 1.30511. lr 5.783272e-04:  24%|██▍       | 827/3397 [05:05<15:53,  2.70it/s][A
epoch 1 iter 827: train loss 1.39895. lr 5.782754e-04:  24%|██▍       | 827/3397 [05:05<15:53,  2.70it/s][A
epoch 1 iter 827: train loss 1.39895. lr 5.782754e-04:  24%|██▍       | 828/3397 [05:05<15:50,  2.70it/s][A
epoch 1 iter 828: train loss 1.43091. lr 5.782235e-04:  24%|██▍       | 828/3397 [05:06<15:50,  2.70it/s][A
epoch 1 iter 828: train loss 1.43091. lr 5.782235e-04:  24%|██▍       | 829/3397 [05:06<15:46,  2.71it/s][A
epoch 1 iter 829: train loss 1.40300. lr 5.781716e-04:  24%|██▍       | 829/3397 [05:06<15:46,  2.71it/s][A
epoch 1 iter 829: train loss 1.40300. lr 5.781716e-04:  24%|██▍       | 830/3397 [05:06<15:52,  2.69it/s][A
epoch 1 iter 830: train loss 1.39106. lr 5.781196e-04:  24%|██▍       | 830/3397 [05:07<15:52,  2.69it/s][A
epoch 1 iter 830: train loss 1.39106. lr 5.781196e-04:  24%|██▍       | 831/3397 [05:07<15:58,  2.68it/s][A
epoch 1 iter 831: t

epoch 1 iter 864: train loss 1.19469. lr 5.763166e-04:  25%|██▌       | 864/3397 [05:19<15:37,  2.70it/s][A
epoch 1 iter 864: train loss 1.19469. lr 5.763166e-04:  25%|██▌       | 865/3397 [05:19<15:36,  2.70it/s][A
epoch 1 iter 865: train loss 1.35085. lr 5.762625e-04:  25%|██▌       | 865/3397 [05:20<15:36,  2.70it/s][A
epoch 1 iter 865: train loss 1.35085. lr 5.762625e-04:  25%|██▌       | 866/3397 [05:20<15:31,  2.72it/s][A
epoch 1 iter 866: train loss 1.23500. lr 5.762084e-04:  25%|██▌       | 866/3397 [05:20<15:31,  2.72it/s][A
epoch 1 iter 866: train loss 1.23500. lr 5.762084e-04:  26%|██▌       | 867/3397 [05:20<15:36,  2.70it/s][A
epoch 1 iter 867: train loss 1.24625. lr 5.761542e-04:  26%|██▌       | 867/3397 [05:20<15:36,  2.70it/s][A
epoch 1 iter 867: train loss 1.24625. lr 5.761542e-04:  26%|██▌       | 868/3397 [05:20<15:39,  2.69it/s][A
epoch 1 iter 868: train loss 1.25024. lr 5.761000e-04:  26%|██▌       | 868/3397 [05:21<15:39,  2.69it/s][A
epoch 1 iter 868: t

epoch 1 iter 901: train loss 1.19051. lr 5.742769e-04:  27%|██▋       | 902/3397 [05:33<15:29,  2.69it/s][A
epoch 1 iter 902: train loss 1.17832. lr 5.742206e-04:  27%|██▋       | 902/3397 [05:33<15:29,  2.69it/s][A
epoch 1 iter 902: train loss 1.17832. lr 5.742206e-04:  27%|██▋       | 903/3397 [05:33<15:29,  2.68it/s][A
epoch 1 iter 903: train loss 1.14304. lr 5.741643e-04:  27%|██▋       | 903/3397 [05:34<15:29,  2.68it/s][A
epoch 1 iter 903: train loss 1.14304. lr 5.741643e-04:  27%|██▋       | 904/3397 [05:34<15:23,  2.70it/s][A
epoch 1 iter 904: train loss 1.09940. lr 5.741079e-04:  27%|██▋       | 904/3397 [05:34<15:23,  2.70it/s][A
epoch 1 iter 904: train loss 1.09940. lr 5.741079e-04:  27%|██▋       | 905/3397 [05:34<15:23,  2.70it/s][A
epoch 1 iter 905: train loss 1.14660. lr 5.740515e-04:  27%|██▋       | 905/3397 [05:34<15:23,  2.70it/s][A
epoch 1 iter 905: train loss 1.14660. lr 5.740515e-04:  27%|██▋       | 906/3397 [05:34<15:22,  2.70it/s][A
epoch 1 iter 906: t

epoch 1 iter 939: train loss 1.10834. lr 5.720984e-04:  28%|██▊       | 939/3397 [05:47<15:09,  2.70it/s][A
epoch 1 iter 939: train loss 1.10834. lr 5.720984e-04:  28%|██▊       | 940/3397 [05:47<15:08,  2.70it/s][A
epoch 1 iter 940: train loss 1.04522. lr 5.720399e-04:  28%|██▊       | 940/3397 [05:47<15:08,  2.70it/s][A
epoch 1 iter 940: train loss 1.04522. lr 5.720399e-04:  28%|██▊       | 941/3397 [05:47<15:07,  2.70it/s][A
epoch 1 iter 941: train loss 1.03864. lr 5.719814e-04:  28%|██▊       | 941/3397 [05:48<15:07,  2.70it/s][A
epoch 1 iter 941: train loss 1.03864. lr 5.719814e-04:  28%|██▊       | 942/3397 [05:48<15:09,  2.70it/s][A
epoch 1 iter 942: train loss 1.04645. lr 5.719228e-04:  28%|██▊       | 942/3397 [05:48<15:09,  2.70it/s][A
epoch 1 iter 942: train loss 1.04645. lr 5.719228e-04:  28%|██▊       | 943/3397 [05:48<15:09,  2.70it/s][A
epoch 1 iter 943: train loss 1.05313. lr 5.718642e-04:  28%|██▊       | 943/3397 [05:48<15:09,  2.70it/s][A
epoch 1 iter 943: t

epoch 1 iter 976: train loss 0.94863. lr 5.698965e-04:  29%|██▉       | 977/3397 [06:01<14:55,  2.70it/s][A
epoch 1 iter 977: train loss 1.00716. lr 5.698359e-04:  29%|██▉       | 977/3397 [06:01<14:55,  2.70it/s][A
epoch 1 iter 977: train loss 1.00716. lr 5.698359e-04:  29%|██▉       | 978/3397 [06:01<15:02,  2.68it/s][A
epoch 1 iter 978: train loss 0.98020. lr 5.697752e-04:  29%|██▉       | 978/3397 [06:01<15:02,  2.68it/s][A
epoch 1 iter 978: train loss 0.98020. lr 5.697752e-04:  29%|██▉       | 979/3397 [06:01<15:05,  2.67it/s][A
epoch 1 iter 979: train loss 0.98772. lr 5.697145e-04:  29%|██▉       | 979/3397 [06:02<15:05,  2.67it/s][A
epoch 1 iter 979: train loss 0.98772. lr 5.697145e-04:  29%|██▉       | 980/3397 [06:02<15:06,  2.67it/s][A
epoch 1 iter 980: train loss 0.93713. lr 5.696537e-04:  29%|██▉       | 980/3397 [06:02<15:06,  2.67it/s][A
epoch 1 iter 980: train loss 0.93713. lr 5.696537e-04:  29%|██▉       | 981/3397 [06:02<15:06,  2.66it/s][A
epoch 1 iter 981: t

epoch 1 iter 1013: train loss 0.92961. lr 5.676155e-04:  30%|██▉       | 1014/3397 [06:14<14:46,  2.69it/s][A
epoch 1 iter 1014: train loss 0.94962. lr 5.675528e-04:  30%|██▉       | 1014/3397 [06:15<14:46,  2.69it/s][A
epoch 1 iter 1014: train loss 0.94962. lr 5.675528e-04:  30%|██▉       | 1015/3397 [06:15<14:47,  2.68it/s][A
epoch 1 iter 1015: train loss 0.87356. lr 5.674900e-04:  30%|██▉       | 1015/3397 [06:15<14:47,  2.68it/s][A
epoch 1 iter 1015: train loss 0.87356. lr 5.674900e-04:  30%|██▉       | 1016/3397 [06:15<14:48,  2.68it/s][A
epoch 1 iter 1016: train loss 0.90303. lr 5.674271e-04:  30%|██▉       | 1016/3397 [06:16<14:48,  2.68it/s][A
epoch 1 iter 1016: train loss 0.90303. lr 5.674271e-04:  30%|██▉       | 1017/3397 [06:16<14:48,  2.68it/s][A
epoch 1 iter 1017: train loss 0.93385. lr 5.673642e-04:  30%|██▉       | 1017/3397 [06:16<14:48,  2.68it/s][A
epoch 1 iter 1017: train loss 0.93385. lr 5.673642e-04:  30%|██▉       | 1018/3397 [06:16<14:45,  2.69it/s][A
e

epoch 1 iter 1050: train loss 0.86467. lr 5.652562e-04:  31%|███       | 1050/3397 [06:28<14:28,  2.70it/s][A
epoch 1 iter 1050: train loss 0.86467. lr 5.652562e-04:  31%|███       | 1051/3397 [06:28<14:23,  2.72it/s][A
epoch 1 iter 1051: train loss 0.88317. lr 5.651914e-04:  31%|███       | 1051/3397 [06:29<14:23,  2.72it/s][A
epoch 1 iter 1051: train loss 0.88317. lr 5.651914e-04:  31%|███       | 1052/3397 [06:29<14:26,  2.71it/s][A
epoch 1 iter 1052: train loss 0.88984. lr 5.651265e-04:  31%|███       | 1052/3397 [06:29<14:26,  2.71it/s][A
epoch 1 iter 1052: train loss 0.88984. lr 5.651265e-04:  31%|███       | 1053/3397 [06:29<14:28,  2.70it/s][A
epoch 1 iter 1053: train loss 0.83789. lr 5.650615e-04:  31%|███       | 1053/3397 [06:29<14:28,  2.70it/s][A
epoch 1 iter 1053: train loss 0.83789. lr 5.650615e-04:  31%|███       | 1054/3397 [06:29<14:27,  2.70it/s][A
epoch 1 iter 1054: train loss 0.84156. lr 5.649965e-04:  31%|███       | 1054/3397 [06:30<14:27,  2.70it/s][A
e

epoch 1 iter 1086: train loss 0.80220. lr 5.628861e-04:  32%|███▏      | 1087/3397 [06:41<14:09,  2.72it/s][A
epoch 1 iter 1087: train loss 0.80284. lr 5.628192e-04:  32%|███▏      | 1087/3397 [06:42<14:09,  2.72it/s][A
epoch 1 iter 1087: train loss 0.80284. lr 5.628192e-04:  32%|███▏      | 1088/3397 [06:42<14:13,  2.70it/s][A
epoch 1 iter 1088: train loss 0.83734. lr 5.627523e-04:  32%|███▏      | 1088/3397 [06:42<14:13,  2.70it/s][A
epoch 1 iter 1088: train loss 0.83734. lr 5.627523e-04:  32%|███▏      | 1089/3397 [06:42<14:16,  2.69it/s][A
epoch 1 iter 1089: train loss 0.78147. lr 5.626853e-04:  32%|███▏      | 1089/3397 [06:43<14:16,  2.69it/s][A
epoch 1 iter 1089: train loss 0.78147. lr 5.626853e-04:  32%|███▏      | 1090/3397 [06:43<14:16,  2.69it/s][A
epoch 1 iter 1090: train loss 0.81337. lr 5.626182e-04:  32%|███▏      | 1090/3397 [06:43<14:16,  2.69it/s][A
epoch 1 iter 1090: train loss 0.81337. lr 5.626182e-04:  32%|███▏      | 1091/3397 [06:43<14:14,  2.70it/s][A
e

epoch 1 iter 1123: train loss 0.75926. lr 5.603742e-04:  33%|███▎      | 1123/3397 [06:55<14:00,  2.71it/s][A
epoch 1 iter 1123: train loss 0.75926. lr 5.603742e-04:  33%|███▎      | 1124/3397 [06:55<13:56,  2.72it/s][A
epoch 1 iter 1124: train loss 0.76769. lr 5.603052e-04:  33%|███▎      | 1124/3397 [06:56<13:56,  2.72it/s][A
epoch 1 iter 1124: train loss 0.76769. lr 5.603052e-04:  33%|███▎      | 1125/3397 [06:56<14:04,  2.69it/s][A
epoch 1 iter 1125: train loss 0.74142. lr 5.602362e-04:  33%|███▎      | 1125/3397 [06:56<14:04,  2.69it/s][A
epoch 1 iter 1125: train loss 0.74142. lr 5.602362e-04:  33%|███▎      | 1126/3397 [06:56<14:05,  2.68it/s][A
epoch 1 iter 1126: train loss 0.71485. lr 5.601671e-04:  33%|███▎      | 1126/3397 [06:56<14:05,  2.68it/s][A
epoch 1 iter 1126: train loss 0.71485. lr 5.601671e-04:  33%|███▎      | 1127/3397 [06:56<14:04,  2.69it/s][A
epoch 1 iter 1127: train loss 0.73603. lr 5.600980e-04:  33%|███▎      | 1127/3397 [06:57<14:04,  2.69it/s][A
e

epoch 1 iter 1159: train loss 0.63951. lr 5.578569e-04:  34%|███▍      | 1160/3397 [07:09<13:49,  2.70it/s][A
epoch 1 iter 1160: train loss 0.71352. lr 5.577860e-04:  34%|███▍      | 1160/3397 [07:09<13:49,  2.70it/s][A
epoch 1 iter 1160: train loss 0.71352. lr 5.577860e-04:  34%|███▍      | 1161/3397 [07:09<13:46,  2.71it/s][A
epoch 1 iter 1161: train loss 0.68993. lr 5.577150e-04:  34%|███▍      | 1161/3397 [07:09<13:46,  2.71it/s][A
epoch 1 iter 1161: train loss 0.68993. lr 5.577150e-04:  34%|███▍      | 1162/3397 [07:09<13:53,  2.68it/s][A
epoch 1 iter 1162: train loss 0.70686. lr 5.576439e-04:  34%|███▍      | 1162/3397 [07:10<13:53,  2.68it/s][A
epoch 1 iter 1162: train loss 0.70686. lr 5.576439e-04:  34%|███▍      | 1163/3397 [07:10<13:57,  2.67it/s][A
epoch 1 iter 1163: train loss 0.70005. lr 5.575728e-04:  34%|███▍      | 1163/3397 [07:10<13:57,  2.67it/s][A
epoch 1 iter 1163: train loss 0.70005. lr 5.575728e-04:  34%|███▍      | 1164/3397 [07:10<14:00,  2.66it/s][A
e

epoch 1 iter 1196: train loss 0.68894. lr 5.551953e-04:  35%|███▌      | 1196/3397 [07:22<13:36,  2.70it/s][A
epoch 1 iter 1196: train loss 0.68894. lr 5.551953e-04:  35%|███▌      | 1197/3397 [07:22<13:36,  2.69it/s][A
epoch 1 iter 1197: train loss 0.62010. lr 5.551223e-04:  35%|███▌      | 1197/3397 [07:23<13:36,  2.69it/s][A
epoch 1 iter 1197: train loss 0.62010. lr 5.551223e-04:  35%|███▌      | 1198/3397 [07:23<13:35,  2.70it/s][A
epoch 1 iter 1198: train loss 0.69424. lr 5.550493e-04:  35%|███▌      | 1198/3397 [07:23<13:35,  2.70it/s][A
epoch 1 iter 1198: train loss 0.69424. lr 5.550493e-04:  35%|███▌      | 1199/3397 [07:23<13:45,  2.66it/s][A
epoch 1 iter 1199: train loss 0.67205. lr 5.549762e-04:  35%|███▌      | 1199/3397 [07:23<13:45,  2.66it/s][A
epoch 1 iter 1199: train loss 0.67205. lr 5.549762e-04:  35%|███▌      | 1200/3397 [07:23<13:50,  2.64it/s][A
epoch 1 iter 1200: train loss 0.69508. lr 5.549031e-04:  35%|███▌      | 1200/3397 [07:24<13:50,  2.64it/s][A
e

epoch 1 iter 1232: train loss 0.64051. lr 5.525339e-04:  36%|███▋      | 1233/3397 [07:36<13:19,  2.71it/s][A
epoch 1 iter 1233: train loss 0.63862. lr 5.524589e-04:  36%|███▋      | 1233/3397 [07:36<13:19,  2.71it/s][A
epoch 1 iter 1233: train loss 0.63862. lr 5.524589e-04:  36%|███▋      | 1234/3397 [07:36<13:18,  2.71it/s][A
epoch 1 iter 1234: train loss 0.61253. lr 5.523839e-04:  36%|███▋      | 1234/3397 [07:36<13:18,  2.71it/s][A
epoch 1 iter 1234: train loss 0.61253. lr 5.523839e-04:  36%|███▋      | 1235/3397 [07:36<13:16,  2.71it/s][A
epoch 1 iter 1235: train loss 0.58399. lr 5.523089e-04:  36%|███▋      | 1235/3397 [07:37<13:16,  2.71it/s][A
epoch 1 iter 1235: train loss 0.58399. lr 5.523089e-04:  36%|███▋      | 1236/3397 [07:37<13:13,  2.72it/s][A
epoch 1 iter 1236: train loss 0.67304. lr 5.522338e-04:  36%|███▋      | 1236/3397 [07:37<13:13,  2.72it/s][A
epoch 1 iter 1236: train loss 0.67304. lr 5.522338e-04:  36%|███▋      | 1237/3397 [07:37<13:15,  2.71it/s][A
e

epoch 1 iter 1269: train loss 0.61053. lr 5.497255e-04:  37%|███▋      | 1269/3397 [07:49<13:09,  2.70it/s][A
epoch 1 iter 1269: train loss 0.61053. lr 5.497255e-04:  37%|███▋      | 1270/3397 [07:49<13:08,  2.70it/s][A
epoch 1 iter 1270: train loss 0.59905. lr 5.496486e-04:  37%|███▋      | 1270/3397 [07:50<13:08,  2.70it/s][A
epoch 1 iter 1270: train loss 0.59905. lr 5.496486e-04:  37%|███▋      | 1271/3397 [07:50<13:07,  2.70it/s][A
epoch 1 iter 1271: train loss 0.55060. lr 5.495716e-04:  37%|███▋      | 1271/3397 [07:50<13:07,  2.70it/s][A
epoch 1 iter 1271: train loss 0.55060. lr 5.495716e-04:  37%|███▋      | 1272/3397 [07:50<13:03,  2.71it/s][A
epoch 1 iter 1272: train loss 0.57313. lr 5.494946e-04:  37%|███▋      | 1272/3397 [07:50<13:03,  2.71it/s][A
epoch 1 iter 1272: train loss 0.57313. lr 5.494946e-04:  37%|███▋      | 1273/3397 [07:50<13:08,  2.69it/s][A
epoch 1 iter 1273: train loss 0.61127. lr 5.494175e-04:  37%|███▋      | 1273/3397 [07:51<13:08,  2.69it/s][A
e

epoch 1 iter 1305: train loss 0.59057. lr 5.469229e-04:  38%|███▊      | 1306/3397 [08:03<13:02,  2.67it/s][A
epoch 1 iter 1306: train loss 0.58077. lr 5.468441e-04:  38%|███▊      | 1306/3397 [08:03<13:02,  2.67it/s][A
epoch 1 iter 1306: train loss 0.58077. lr 5.468441e-04:  38%|███▊      | 1307/3397 [08:03<13:00,  2.68it/s][A
epoch 1 iter 1307: train loss 0.57537. lr 5.467652e-04:  38%|███▊      | 1307/3397 [08:03<13:00,  2.68it/s][A
epoch 1 iter 1307: train loss 0.57537. lr 5.467652e-04:  39%|███▊      | 1308/3397 [08:03<12:58,  2.68it/s][A
epoch 1 iter 1308: train loss 0.57206. lr 5.466863e-04:  39%|███▊      | 1308/3397 [08:04<12:58,  2.68it/s][A
epoch 1 iter 1308: train loss 0.57206. lr 5.466863e-04:  39%|███▊      | 1309/3397 [08:04<12:58,  2.68it/s][A
epoch 1 iter 1309: train loss 0.54040. lr 5.466073e-04:  39%|███▊      | 1309/3397 [08:04<12:58,  2.68it/s][A
epoch 1 iter 1309: train loss 0.54040. lr 5.466073e-04:  39%|███▊      | 1310/3397 [08:04<12:58,  2.68it/s][A
e

epoch 1 iter 1342: train loss 0.58278. lr 5.439711e-04:  40%|███▉      | 1342/3397 [08:16<12:56,  2.65it/s][A
epoch 1 iter 1342: train loss 0.58278. lr 5.439711e-04:  40%|███▉      | 1343/3397 [08:16<12:56,  2.64it/s][A
epoch 1 iter 1343: train loss 0.55386. lr 5.438903e-04:  40%|███▉      | 1343/3397 [08:17<12:56,  2.64it/s][A
epoch 1 iter 1343: train loss 0.55386. lr 5.438903e-04:  40%|███▉      | 1344/3397 [08:17<12:54,  2.65it/s][A
epoch 1 iter 1344: train loss 0.52123. lr 5.438095e-04:  40%|███▉      | 1344/3397 [08:17<12:54,  2.65it/s][A
epoch 1 iter 1344: train loss 0.52123. lr 5.438095e-04:  40%|███▉      | 1345/3397 [08:17<12:52,  2.65it/s][A
epoch 1 iter 1345: train loss 0.55784. lr 5.437286e-04:  40%|███▉      | 1345/3397 [08:18<12:52,  2.65it/s][A
epoch 1 iter 1345: train loss 0.55784. lr 5.437286e-04:  40%|███▉      | 1346/3397 [08:18<12:49,  2.67it/s][A
epoch 1 iter 1346: train loss 0.54401. lr 5.436477e-04:  40%|███▉      | 1346/3397 [08:18<12:49,  2.67it/s][A
e

epoch 1 iter 1378: train loss 0.52347. lr 5.410305e-04:  41%|████      | 1379/3397 [08:30<12:24,  2.71it/s][A
epoch 1 iter 1379: train loss 0.53987. lr 5.409478e-04:  41%|████      | 1379/3397 [08:30<12:24,  2.71it/s][A
epoch 1 iter 1379: train loss 0.53987. lr 5.409478e-04:  41%|████      | 1380/3397 [08:30<12:25,  2.71it/s][A
epoch 1 iter 1380: train loss 0.51201. lr 5.408652e-04:  41%|████      | 1380/3397 [08:31<12:25,  2.71it/s][A
epoch 1 iter 1380: train loss 0.51201. lr 5.408652e-04:  41%|████      | 1381/3397 [08:31<12:25,  2.70it/s][A
epoch 1 iter 1381: train loss 0.52441. lr 5.407824e-04:  41%|████      | 1381/3397 [08:31<12:25,  2.70it/s][A
epoch 1 iter 1381: train loss 0.52441. lr 5.407824e-04:  41%|████      | 1382/3397 [08:31<12:24,  2.71it/s][A
epoch 1 iter 1382: train loss 0.52442. lr 5.406996e-04:  41%|████      | 1382/3397 [08:31<12:24,  2.71it/s][A
epoch 1 iter 1382: train loss 0.52442. lr 5.406996e-04:  41%|████      | 1383/3397 [08:31<12:22,  2.71it/s][A
e

epoch 1 iter 1415: train loss 0.49624. lr 5.379386e-04:  42%|████▏     | 1415/3397 [08:43<12:08,  2.72it/s][A
epoch 1 iter 1415: train loss 0.49624. lr 5.379386e-04:  42%|████▏     | 1416/3397 [08:43<12:08,  2.72it/s][A
epoch 1 iter 1416: train loss 0.51981. lr 5.378540e-04:  42%|████▏     | 1416/3397 [08:44<12:08,  2.72it/s][A
epoch 1 iter 1416: train loss 0.51981. lr 5.378540e-04:  42%|████▏     | 1417/3397 [08:44<12:09,  2.72it/s][A
epoch 1 iter 1417: train loss 0.49203. lr 5.377694e-04:  42%|████▏     | 1417/3397 [08:44<12:09,  2.72it/s][A
epoch 1 iter 1417: train loss 0.49203. lr 5.377694e-04:  42%|████▏     | 1418/3397 [08:44<12:09,  2.71it/s][A
epoch 1 iter 1418: train loss 0.50967. lr 5.376848e-04:  42%|████▏     | 1418/3397 [08:45<12:09,  2.71it/s][A
epoch 1 iter 1418: train loss 0.50967. lr 5.376848e-04:  42%|████▏     | 1419/3397 [08:45<12:08,  2.72it/s][A
epoch 1 iter 1419: train loss 0.50497. lr 5.376001e-04:  42%|████▏     | 1419/3397 [08:45<12:08,  2.72it/s][A
e

epoch 1 iter 1451: train loss 0.51662. lr 5.348633e-04:  43%|████▎     | 1452/3397 [08:57<12:00,  2.70it/s][A
epoch 1 iter 1452: train loss 0.51494. lr 5.347770e-04:  43%|████▎     | 1452/3397 [08:57<12:00,  2.70it/s][A
epoch 1 iter 1452: train loss 0.51494. lr 5.347770e-04:  43%|████▎     | 1453/3397 [08:57<11:59,  2.70it/s][A
epoch 1 iter 1453: train loss 0.48066. lr 5.346905e-04:  43%|████▎     | 1453/3397 [08:57<11:59,  2.70it/s][A
epoch 1 iter 1453: train loss 0.48066. lr 5.346905e-04:  43%|████▎     | 1454/3397 [08:57<11:57,  2.71it/s][A
epoch 1 iter 1454: train loss 0.45758. lr 5.346041e-04:  43%|████▎     | 1454/3397 [08:58<11:57,  2.71it/s][A
epoch 1 iter 1454: train loss 0.45758. lr 5.346041e-04:  43%|████▎     | 1455/3397 [08:58<11:53,  2.72it/s][A
epoch 1 iter 1455: train loss 0.50681. lr 5.345176e-04:  43%|████▎     | 1455/3397 [08:58<11:53,  2.72it/s][A
epoch 1 iter 1455: train loss 0.50681. lr 5.345176e-04:  43%|████▎     | 1456/3397 [08:58<11:53,  2.72it/s][A
e

epoch 1 iter 1488: train loss 0.47617. lr 5.316348e-04:  44%|████▍     | 1488/3397 [09:10<11:52,  2.68it/s][A
epoch 1 iter 1488: train loss 0.47617. lr 5.316348e-04:  44%|████▍     | 1489/3397 [09:10<11:52,  2.68it/s][A
epoch 1 iter 1489: train loss 0.45488. lr 5.315466e-04:  44%|████▍     | 1489/3397 [09:11<11:52,  2.68it/s][A
epoch 1 iter 1489: train loss 0.45488. lr 5.315466e-04:  44%|████▍     | 1490/3397 [09:11<11:49,  2.69it/s][A
epoch 1 iter 1490: train loss 0.47238. lr 5.314583e-04:  44%|████▍     | 1490/3397 [09:11<11:49,  2.69it/s][A
epoch 1 iter 1490: train loss 0.47238. lr 5.314583e-04:  44%|████▍     | 1491/3397 [09:11<11:48,  2.69it/s][A
epoch 1 iter 1491: train loss 0.45593. lr 5.313700e-04:  44%|████▍     | 1491/3397 [09:12<11:48,  2.69it/s][A
epoch 1 iter 1491: train loss 0.45593. lr 5.313700e-04:  44%|████▍     | 1492/3397 [09:12<11:45,  2.70it/s][A
epoch 1 iter 1492: train loss 0.49543. lr 5.312817e-04:  44%|████▍     | 1492/3397 [09:12<11:45,  2.70it/s][A
e

epoch 1 iter 1524: train loss 0.45384. lr 5.284284e-04:  45%|████▍     | 1525/3397 [09:24<11:32,  2.70it/s][A
epoch 1 iter 1525: train loss 0.45281. lr 5.283384e-04:  45%|████▍     | 1525/3397 [09:24<11:32,  2.70it/s][A
epoch 1 iter 1525: train loss 0.45281. lr 5.283384e-04:  45%|████▍     | 1526/3397 [09:24<11:33,  2.70it/s][A
epoch 1 iter 1526: train loss 0.48648. lr 5.282484e-04:  45%|████▍     | 1526/3397 [09:25<11:33,  2.70it/s][A
epoch 1 iter 1526: train loss 0.48648. lr 5.282484e-04:  45%|████▍     | 1527/3397 [09:25<11:32,  2.70it/s][A
epoch 1 iter 1527: train loss 0.44619. lr 5.281583e-04:  45%|████▍     | 1527/3397 [09:25<11:32,  2.70it/s][A
epoch 1 iter 1527: train loss 0.44619. lr 5.281583e-04:  45%|████▍     | 1528/3397 [09:25<11:31,  2.70it/s][A
epoch 1 iter 1528: train loss 0.44033. lr 5.280682e-04:  45%|████▍     | 1528/3397 [09:25<11:31,  2.70it/s][A
epoch 1 iter 1528: train loss 0.44033. lr 5.280682e-04:  45%|████▌     | 1529/3397 [09:25<11:28,  2.71it/s][A
e

epoch 1 iter 1561: train loss 0.47232. lr 5.250670e-04:  46%|████▌     | 1561/3397 [09:38<11:26,  2.67it/s][A
epoch 1 iter 1561: train loss 0.47232. lr 5.250670e-04:  46%|████▌     | 1562/3397 [09:38<11:24,  2.68it/s][A
epoch 1 iter 1562: train loss 0.44141. lr 5.249752e-04:  46%|████▌     | 1562/3397 [09:38<11:24,  2.68it/s][A
epoch 1 iter 1562: train loss 0.44141. lr 5.249752e-04:  46%|████▌     | 1563/3397 [09:38<11:23,  2.68it/s][A
epoch 1 iter 1563: train loss 0.44730. lr 5.248834e-04:  46%|████▌     | 1563/3397 [09:38<11:23,  2.68it/s][A
epoch 1 iter 1563: train loss 0.44730. lr 5.248834e-04:  46%|████▌     | 1564/3397 [09:38<11:21,  2.69it/s][A
epoch 1 iter 1564: train loss 0.44179. lr 5.247915e-04:  46%|████▌     | 1564/3397 [09:39<11:21,  2.69it/s][A
epoch 1 iter 1564: train loss 0.44179. lr 5.247915e-04:  46%|████▌     | 1565/3397 [09:39<11:19,  2.70it/s][A
epoch 1 iter 1565: train loss 0.47320. lr 5.246996e-04:  46%|████▌     | 1565/3397 [09:39<11:19,  2.70it/s][A
e

epoch 1 iter 1597: train loss 0.44295. lr 5.217331e-04:  47%|████▋     | 1598/3397 [09:51<11:04,  2.71it/s][A
epoch 1 iter 1598: train loss 0.45641. lr 5.216396e-04:  47%|████▋     | 1598/3397 [09:51<11:04,  2.71it/s][A
epoch 1 iter 1598: train loss 0.45641. lr 5.216396e-04:  47%|████▋     | 1599/3397 [09:51<12:08,  2.47it/s][A
epoch 1 iter 1599: train loss 0.42131. lr 5.215461e-04:  47%|████▋     | 1599/3397 [09:52<12:08,  2.47it/s][A
epoch 1 iter 1599: train loss 0.42131. lr 5.215461e-04:  47%|████▋     | 1600/3397 [09:52<11:49,  2.53it/s][A
epoch 1 iter 1600: train loss 0.41750. lr 5.214525e-04:  47%|████▋     | 1600/3397 [09:52<11:49,  2.53it/s][A
epoch 1 iter 1600: train loss 0.41750. lr 5.214525e-04:  47%|████▋     | 1601/3397 [09:52<11:35,  2.58it/s][A
epoch 1 iter 1601: train loss 0.39082. lr 5.213588e-04:  47%|████▋     | 1601/3397 [09:52<11:35,  2.58it/s][A
epoch 1 iter 1601: train loss 0.39082. lr 5.213588e-04:  47%|████▋     | 1602/3397 [09:52<11:25,  2.62it/s][A
e

epoch 1 iter 1634: train loss 0.43038. lr 5.182426e-04:  48%|████▊     | 1634/3397 [10:05<10:47,  2.72it/s][A
epoch 1 iter 1634: train loss 0.43038. lr 5.182426e-04:  48%|████▊     | 1635/3397 [10:05<11:12,  2.62it/s][A
epoch 1 iter 1635: train loss 0.42351. lr 5.181473e-04:  48%|████▊     | 1635/3397 [10:05<11:12,  2.62it/s][A
epoch 1 iter 1635: train loss 0.42351. lr 5.181473e-04:  48%|████▊     | 1636/3397 [10:05<11:21,  2.58it/s][A
epoch 1 iter 1636: train loss 0.38733. lr 5.180521e-04:  48%|████▊     | 1636/3397 [10:05<11:21,  2.58it/s][A
epoch 1 iter 1636: train loss 0.38733. lr 5.180521e-04:  48%|████▊     | 1637/3397 [10:05<11:26,  2.56it/s][A
epoch 1 iter 1637: train loss 0.40080. lr 5.179567e-04:  48%|████▊     | 1637/3397 [10:06<11:26,  2.56it/s][A
epoch 1 iter 1637: train loss 0.40080. lr 5.179567e-04:  48%|████▊     | 1638/3397 [10:06<11:27,  2.56it/s][A
epoch 1 iter 1638: train loss 0.41527. lr 5.178614e-04:  48%|████▊     | 1638/3397 [10:06<11:27,  2.56it/s][A
e

epoch 1 iter 1670: train loss 0.42054. lr 5.147850e-04:  49%|████▉     | 1671/3397 [10:18<10:36,  2.71it/s][A
epoch 1 iter 1671: train loss 0.41586. lr 5.146881e-04:  49%|████▉     | 1671/3397 [10:18<10:36,  2.71it/s][A
epoch 1 iter 1671: train loss 0.41586. lr 5.146881e-04:  49%|████▉     | 1672/3397 [10:18<10:37,  2.71it/s][A
epoch 1 iter 1672: train loss 0.39145. lr 5.145912e-04:  49%|████▉     | 1672/3397 [10:19<10:37,  2.71it/s][A
epoch 1 iter 1672: train loss 0.39145. lr 5.145912e-04:  49%|████▉     | 1673/3397 [10:19<10:33,  2.72it/s][A
epoch 1 iter 1673: train loss 0.38015. lr 5.144942e-04:  49%|████▉     | 1673/3397 [10:19<10:33,  2.72it/s][A
epoch 1 iter 1673: train loss 0.38015. lr 5.144942e-04:  49%|████▉     | 1674/3397 [10:19<10:34,  2.71it/s][A
epoch 1 iter 1674: train loss 0.41290. lr 5.143972e-04:  49%|████▉     | 1674/3397 [10:19<10:34,  2.71it/s][A
epoch 1 iter 1674: train loss 0.41290. lr 5.143972e-04:  49%|████▉     | 1675/3397 [10:19<10:35,  2.71it/s][A
e

epoch 1 iter 1707: train loss 0.41937. lr 5.111694e-04:  50%|█████     | 1707/3397 [10:32<10:31,  2.68it/s][A
epoch 1 iter 1707: train loss 0.41937. lr 5.111694e-04:  50%|█████     | 1708/3397 [10:32<10:29,  2.68it/s][A
epoch 1 iter 1708: train loss 0.39421. lr 5.110708e-04:  50%|█████     | 1708/3397 [10:32<10:29,  2.68it/s][A
epoch 1 iter 1708: train loss 0.39421. lr 5.110708e-04:  50%|█████     | 1709/3397 [10:32<10:24,  2.70it/s][A
epoch 1 iter 1709: train loss 0.37754. lr 5.109722e-04:  50%|█████     | 1709/3397 [10:33<10:24,  2.70it/s][A
epoch 1 iter 1709: train loss 0.37754. lr 5.109722e-04:  50%|█████     | 1710/3397 [10:33<10:27,  2.69it/s][A
epoch 1 iter 1710: train loss 0.38143. lr 5.108735e-04:  50%|█████     | 1710/3397 [10:33<10:27,  2.69it/s][A
epoch 1 iter 1710: train loss 0.38143. lr 5.108735e-04:  50%|█████     | 1711/3397 [10:33<10:29,  2.68it/s][A
epoch 1 iter 1711: train loss 0.37897. lr 5.107748e-04:  50%|█████     | 1711/3397 [10:33<10:29,  2.68it/s][A
e

epoch 1 iter 1743: train loss 0.37099. lr 5.075921e-04:  51%|█████▏    | 1744/3397 [10:45<10:09,  2.71it/s][A
epoch 1 iter 1744: train loss 0.39251. lr 5.074919e-04:  51%|█████▏    | 1744/3397 [10:46<10:09,  2.71it/s][A
epoch 1 iter 1744: train loss 0.39251. lr 5.074919e-04:  51%|█████▏    | 1745/3397 [10:46<10:06,  2.73it/s][A
epoch 1 iter 1745: train loss 0.40302. lr 5.073917e-04:  51%|█████▏    | 1745/3397 [10:46<10:06,  2.73it/s][A
epoch 1 iter 1745: train loss 0.40302. lr 5.073917e-04:  51%|█████▏    | 1746/3397 [10:46<10:06,  2.72it/s][A
epoch 1 iter 1746: train loss 0.39996. lr 5.072914e-04:  51%|█████▏    | 1746/3397 [10:46<10:06,  2.72it/s][A
epoch 1 iter 1746: train loss 0.39996. lr 5.072914e-04:  51%|█████▏    | 1747/3397 [10:46<10:07,  2.72it/s][A
epoch 1 iter 1747: train loss 0.39311. lr 5.071911e-04:  51%|█████▏    | 1747/3397 [10:47<10:07,  2.72it/s][A
epoch 1 iter 1747: train loss 0.39311. lr 5.071911e-04:  51%|█████▏    | 1748/3397 [10:47<10:07,  2.72it/s][A
e

epoch 1 iter 1780: train loss 0.37045. lr 5.038555e-04:  52%|█████▏    | 1780/3397 [10:59<09:54,  2.72it/s][A
epoch 1 iter 1780: train loss 0.37045. lr 5.038555e-04:  52%|█████▏    | 1781/3397 [10:59<09:55,  2.72it/s][A
epoch 1 iter 1781: train loss 0.38505. lr 5.037537e-04:  52%|█████▏    | 1781/3397 [10:59<09:55,  2.72it/s][A
epoch 1 iter 1781: train loss 0.38505. lr 5.037537e-04:  52%|█████▏    | 1782/3397 [10:59<09:56,  2.71it/s][A
epoch 1 iter 1782: train loss 0.37867. lr 5.036518e-04:  52%|█████▏    | 1782/3397 [11:00<09:56,  2.71it/s][A
epoch 1 iter 1782: train loss 0.37867. lr 5.036518e-04:  52%|█████▏    | 1783/3397 [11:00<09:55,  2.71it/s][A
epoch 1 iter 1783: train loss 0.36163. lr 5.035499e-04:  52%|█████▏    | 1783/3397 [11:00<09:55,  2.71it/s][A
epoch 1 iter 1783: train loss 0.36163. lr 5.035499e-04:  53%|█████▎    | 1784/3397 [11:00<09:55,  2.71it/s][A
epoch 1 iter 1784: train loss 0.37317. lr 5.034479e-04:  53%|█████▎    | 1784/3397 [11:00<09:55,  2.71it/s][A
e

epoch 1 iter 1816: train loss 0.35729. lr 5.001626e-04:  53%|█████▎    | 1817/3397 [11:12<10:01,  2.62it/s][A
epoch 1 iter 1817: train loss 0.37958. lr 5.000592e-04:  53%|█████▎    | 1817/3397 [11:13<10:01,  2.62it/s][A
epoch 1 iter 1817: train loss 0.37958. lr 5.000592e-04:  54%|█████▎    | 1818/3397 [11:13<09:58,  2.64it/s][A
epoch 1 iter 1818: train loss 0.38305. lr 4.999558e-04:  54%|█████▎    | 1818/3397 [11:13<09:58,  2.64it/s][A
epoch 1 iter 1818: train loss 0.38305. lr 4.999558e-04:  54%|█████▎    | 1819/3397 [11:13<09:55,  2.65it/s][A
epoch 1 iter 1819: train loss 0.37194. lr 4.998523e-04:  54%|█████▎    | 1819/3397 [11:13<09:55,  2.65it/s][A
epoch 1 iter 1819: train loss 0.37194. lr 4.998523e-04:  54%|█████▎    | 1820/3397 [11:13<09:53,  2.66it/s][A
epoch 1 iter 1820: train loss 0.39513. lr 4.997488e-04:  54%|█████▎    | 1820/3397 [11:14<09:53,  2.66it/s][A
epoch 1 iter 1820: train loss 0.39513. lr 4.997488e-04:  54%|█████▎    | 1821/3397 [11:14<09:48,  2.68it/s][A
e

epoch 1 iter 1853: train loss 0.34132. lr 4.963092e-04:  55%|█████▍    | 1853/3397 [11:26<09:29,  2.71it/s][A
epoch 1 iter 1853: train loss 0.34132. lr 4.963092e-04:  55%|█████▍    | 1854/3397 [11:26<09:29,  2.71it/s][A
epoch 1 iter 1854: train loss 0.34424. lr 4.962043e-04:  55%|█████▍    | 1854/3397 [11:26<09:29,  2.71it/s][A
epoch 1 iter 1854: train loss 0.34424. lr 4.962043e-04:  55%|█████▍    | 1855/3397 [11:26<09:29,  2.71it/s][A
epoch 1 iter 1855: train loss 0.36505. lr 4.960993e-04:  55%|█████▍    | 1855/3397 [11:27<09:29,  2.71it/s][A
epoch 1 iter 1855: train loss 0.36505. lr 4.960993e-04:  55%|█████▍    | 1856/3397 [11:27<09:26,  2.72it/s][A
epoch 1 iter 1856: train loss 0.34935. lr 4.959943e-04:  55%|█████▍    | 1856/3397 [11:27<09:26,  2.72it/s][A
epoch 1 iter 1856: train loss 0.34935. lr 4.959943e-04:  55%|█████▍    | 1857/3397 [11:27<09:27,  2.71it/s][A
epoch 1 iter 1857: train loss 0.38389. lr 4.958892e-04:  55%|█████▍    | 1857/3397 [11:27<09:27,  2.71it/s][A
e

epoch 1 iter 1889: train loss 0.35259. lr 4.925048e-04:  56%|█████▌    | 1890/3397 [11:39<09:15,  2.71it/s][A
epoch 1 iter 1890: train loss 0.35768. lr 4.923984e-04:  56%|█████▌    | 1890/3397 [11:40<09:15,  2.71it/s][A
epoch 1 iter 1890: train loss 0.35768. lr 4.923984e-04:  56%|█████▌    | 1891/3397 [11:40<09:12,  2.72it/s][A
epoch 1 iter 1891: train loss 0.38051. lr 4.922919e-04:  56%|█████▌    | 1891/3397 [11:40<09:12,  2.72it/s][A
epoch 1 iter 1891: train loss 0.38051. lr 4.922919e-04:  56%|█████▌    | 1892/3397 [11:40<09:13,  2.72it/s][A
epoch 1 iter 1892: train loss 0.35233. lr 4.921854e-04:  56%|█████▌    | 1892/3397 [11:40<09:13,  2.72it/s][A
epoch 1 iter 1892: train loss 0.35233. lr 4.921854e-04:  56%|█████▌    | 1893/3397 [11:40<09:14,  2.71it/s][A
epoch 1 iter 1893: train loss 0.36508. lr 4.920788e-04:  56%|█████▌    | 1893/3397 [11:41<09:14,  2.71it/s][A
epoch 1 iter 1893: train loss 0.36508. lr 4.920788e-04:  56%|█████▌    | 1894/3397 [11:41<09:14,  2.71it/s][A
e

epoch 1 iter 1926: train loss 0.36642. lr 4.885392e-04:  57%|█████▋    | 1926/3397 [11:53<09:02,  2.71it/s][A
epoch 1 iter 1926: train loss 0.36642. lr 4.885392e-04:  57%|█████▋    | 1927/3397 [11:53<09:05,  2.70it/s][A
epoch 1 iter 1927: train loss 0.31589. lr 4.884312e-04:  57%|█████▋    | 1927/3397 [11:54<09:05,  2.70it/s][A
epoch 1 iter 1927: train loss 0.31589. lr 4.884312e-04:  57%|█████▋    | 1928/3397 [11:54<09:07,  2.68it/s][A
epoch 1 iter 1928: train loss 0.36543. lr 4.883232e-04:  57%|█████▋    | 1928/3397 [11:54<09:07,  2.68it/s][A
epoch 1 iter 1928: train loss 0.36543. lr 4.883232e-04:  57%|█████▋    | 1929/3397 [11:54<09:06,  2.69it/s][A
epoch 1 iter 1929: train loss 0.33130. lr 4.882152e-04:  57%|█████▋    | 1929/3397 [11:54<09:06,  2.69it/s][A
epoch 1 iter 1929: train loss 0.33130. lr 4.882152e-04:  57%|█████▋    | 1930/3397 [11:54<09:05,  2.69it/s][A
epoch 1 iter 1930: train loss 0.34943. lr 4.881071e-04:  57%|█████▋    | 1930/3397 [11:55<09:05,  2.69it/s][A
e

epoch 1 iter 1962: train loss 0.35525. lr 4.846277e-04:  58%|█████▊    | 1963/3397 [12:06<08:49,  2.71it/s][A
epoch 1 iter 1963: train loss 0.34110. lr 4.845183e-04:  58%|█████▊    | 1963/3397 [12:07<08:49,  2.71it/s][A
epoch 1 iter 1963: train loss 0.34110. lr 4.845183e-04:  58%|█████▊    | 1964/3397 [12:07<08:50,  2.70it/s][A
epoch 1 iter 1964: train loss 0.30680. lr 4.844088e-04:  58%|█████▊    | 1964/3397 [12:07<08:50,  2.70it/s][A
epoch 1 iter 1964: train loss 0.30680. lr 4.844088e-04:  58%|█████▊    | 1965/3397 [12:07<08:49,  2.70it/s][A
epoch 1 iter 1965: train loss 0.35692. lr 4.842994e-04:  58%|█████▊    | 1965/3397 [12:08<08:49,  2.70it/s][A
epoch 1 iter 1965: train loss 0.35692. lr 4.842994e-04:  58%|█████▊    | 1966/3397 [12:08<08:48,  2.71it/s][A
epoch 1 iter 1966: train loss 0.34033. lr 4.841899e-04:  58%|█████▊    | 1966/3397 [12:08<08:48,  2.71it/s][A
epoch 1 iter 1966: train loss 0.34033. lr 4.841899e-04:  58%|█████▊    | 1967/3397 [12:08<08:47,  2.71it/s][A
e

epoch 1 iter 1999: train loss 0.33128. lr 4.805542e-04:  59%|█████▉    | 1999/3397 [12:20<08:36,  2.70it/s][A
epoch 1 iter 1999: train loss 0.33128. lr 4.805542e-04:  59%|█████▉    | 2000/3397 [12:20<08:35,  2.71it/s][A
epoch 1 iter 2000: train loss 0.32300. lr 4.804434e-04:  59%|█████▉    | 2000/3397 [12:20<08:35,  2.71it/s][A
epoch 1 iter 2000: train loss 0.32300. lr 4.804434e-04:  59%|█████▉    | 2001/3397 [12:20<08:33,  2.72it/s][A
epoch 1 iter 2001: train loss 0.34292. lr 4.803325e-04:  59%|█████▉    | 2001/3397 [12:21<08:33,  2.72it/s][A
epoch 1 iter 2001: train loss 0.34292. lr 4.803325e-04:  59%|█████▉    | 2002/3397 [12:21<08:34,  2.71it/s][A
epoch 1 iter 2002: train loss 0.35267. lr 4.802216e-04:  59%|█████▉    | 2002/3397 [12:21<08:34,  2.71it/s][A
epoch 1 iter 2002: train loss 0.35267. lr 4.802216e-04:  59%|█████▉    | 2003/3397 [12:21<08:35,  2.70it/s][A
epoch 1 iter 2003: train loss 0.35066. lr 4.801106e-04:  59%|█████▉    | 2003/3397 [12:22<08:35,  2.70it/s][A
e

epoch 1 iter 2035: train loss 0.31905. lr 4.765400e-04:  60%|█████▉    | 2036/3397 [12:33<08:22,  2.71it/s][A
epoch 1 iter 2036: train loss 0.32366. lr 4.764278e-04:  60%|█████▉    | 2036/3397 [12:34<08:22,  2.71it/s][A
epoch 1 iter 2036: train loss 0.32366. lr 4.764278e-04:  60%|█████▉    | 2037/3397 [12:34<08:22,  2.71it/s][A
epoch 1 iter 2037: train loss 0.33856. lr 4.763156e-04:  60%|█████▉    | 2037/3397 [12:34<08:22,  2.71it/s][A
epoch 1 iter 2037: train loss 0.33856. lr 4.763156e-04:  60%|█████▉    | 2038/3397 [12:34<08:22,  2.70it/s][A
epoch 1 iter 2038: train loss 0.33738. lr 4.762033e-04:  60%|█████▉    | 2038/3397 [12:34<08:22,  2.70it/s][A
epoch 1 iter 2038: train loss 0.33738. lr 4.762033e-04:  60%|██████    | 2039/3397 [12:34<08:22,  2.70it/s][A
epoch 1 iter 2039: train loss 0.33030. lr 4.760910e-04:  60%|██████    | 2039/3397 [12:35<08:22,  2.70it/s][A
epoch 1 iter 2039: train loss 0.33030. lr 4.760910e-04:  60%|██████    | 2040/3397 [12:35<08:21,  2.71it/s][A
e

epoch 1 iter 2072: train loss 0.31608. lr 4.723634e-04:  61%|██████    | 2072/3397 [12:47<08:08,  2.71it/s][A
epoch 1 iter 2072: train loss 0.31608. lr 4.723634e-04:  61%|██████    | 2073/3397 [12:47<08:10,  2.70it/s][A
epoch 1 iter 2073: train loss 0.31398. lr 4.722498e-04:  61%|██████    | 2073/3397 [12:47<08:10,  2.70it/s][A
epoch 1 iter 2073: train loss 0.31398. lr 4.722498e-04:  61%|██████    | 2074/3397 [12:47<08:11,  2.69it/s][A
epoch 1 iter 2074: train loss 0.31421. lr 4.721362e-04:  61%|██████    | 2074/3397 [12:48<08:11,  2.69it/s][A
epoch 1 iter 2074: train loss 0.31421. lr 4.721362e-04:  61%|██████    | 2075/3397 [12:48<08:10,  2.69it/s][A
epoch 1 iter 2075: train loss 0.32965. lr 4.720225e-04:  61%|██████    | 2075/3397 [12:48<08:10,  2.69it/s][A
epoch 1 iter 2075: train loss 0.32965. lr 4.720225e-04:  61%|██████    | 2076/3397 [12:48<08:09,  2.70it/s][A
epoch 1 iter 2076: train loss 0.32678. lr 4.719088e-04:  61%|██████    | 2076/3397 [12:48<08:09,  2.70it/s][A
e

epoch 1 iter 2108: train loss 0.31739. lr 4.682512e-04:  62%|██████▏   | 2109/3397 [13:00<08:02,  2.67it/s][A
epoch 1 iter 2109: train loss 0.32403. lr 4.681363e-04:  62%|██████▏   | 2109/3397 [13:01<08:02,  2.67it/s][A
epoch 1 iter 2109: train loss 0.32403. lr 4.681363e-04:  62%|██████▏   | 2110/3397 [13:01<08:05,  2.65it/s][A
epoch 1 iter 2110: train loss 0.31305. lr 4.680213e-04:  62%|██████▏   | 2110/3397 [13:01<08:05,  2.65it/s][A
epoch 1 iter 2110: train loss 0.31305. lr 4.680213e-04:  62%|██████▏   | 2111/3397 [13:01<08:05,  2.65it/s][A
epoch 1 iter 2111: train loss 0.32836. lr 4.679063e-04:  62%|██████▏   | 2111/3397 [13:01<08:05,  2.65it/s][A
epoch 1 iter 2111: train loss 0.32836. lr 4.679063e-04:  62%|██████▏   | 2112/3397 [13:01<08:04,  2.65it/s][A
epoch 1 iter 2112: train loss 0.32253. lr 4.677913e-04:  62%|██████▏   | 2112/3397 [13:02<08:04,  2.65it/s][A
epoch 1 iter 2112: train loss 0.32253. lr 4.677913e-04:  62%|██████▏   | 2113/3397 [13:02<08:03,  2.66it/s][A
e

epoch 1 iter 2145: train loss 0.31167. lr 4.639761e-04:  63%|██████▎   | 2145/3397 [13:14<07:42,  2.71it/s][A
epoch 1 iter 2145: train loss 0.31167. lr 4.639761e-04:  63%|██████▎   | 2146/3397 [13:14<07:40,  2.72it/s][A
epoch 1 iter 2146: train loss 0.30898. lr 4.638599e-04:  63%|██████▎   | 2146/3397 [13:14<07:40,  2.72it/s][A
epoch 1 iter 2146: train loss 0.30898. lr 4.638599e-04:  63%|██████▎   | 2147/3397 [13:14<07:42,  2.70it/s][A
epoch 1 iter 2147: train loss 0.30968. lr 4.637436e-04:  63%|██████▎   | 2147/3397 [13:15<07:42,  2.70it/s][A
epoch 1 iter 2147: train loss 0.30968. lr 4.637436e-04:  63%|██████▎   | 2148/3397 [13:15<07:44,  2.69it/s][A
epoch 1 iter 2148: train loss 0.30892. lr 4.636274e-04:  63%|██████▎   | 2148/3397 [13:15<07:44,  2.69it/s][A
epoch 1 iter 2148: train loss 0.30892. lr 4.636274e-04:  63%|██████▎   | 2149/3397 [13:15<07:43,  2.69it/s][A
epoch 1 iter 2149: train loss 0.30947. lr 4.635110e-04:  63%|██████▎   | 2149/3397 [13:16<07:43,  2.69it/s][A
e

epoch 1 iter 2181: train loss 0.28370. lr 4.597705e-04:  64%|██████▍   | 2182/3397 [13:27<07:28,  2.71it/s][A
epoch 1 iter 2182: train loss 0.30920. lr 4.596530e-04:  64%|██████▍   | 2182/3397 [13:28<07:28,  2.71it/s][A
epoch 1 iter 2182: train loss 0.30920. lr 4.596530e-04:  64%|██████▍   | 2183/3397 [13:28<07:27,  2.71it/s][A
epoch 1 iter 2183: train loss 0.29116. lr 4.595355e-04:  64%|██████▍   | 2183/3397 [13:28<07:27,  2.71it/s][A
epoch 1 iter 2183: train loss 0.29116. lr 4.595355e-04:  64%|██████▍   | 2184/3397 [13:28<07:25,  2.72it/s][A
epoch 1 iter 2184: train loss 0.29002. lr 4.594180e-04:  64%|██████▍   | 2184/3397 [13:29<07:25,  2.72it/s][A
epoch 1 iter 2184: train loss 0.29002. lr 4.594180e-04:  64%|██████▍   | 2185/3397 [13:29<07:27,  2.71it/s][A
epoch 1 iter 2185: train loss 0.32897. lr 4.593004e-04:  64%|██████▍   | 2185/3397 [13:29<07:27,  2.71it/s][A
epoch 1 iter 2185: train loss 0.32897. lr 4.593004e-04:  64%|██████▍   | 2186/3397 [13:29<07:27,  2.70it/s][A
e

epoch 1 iter 2218: train loss 0.31573. lr 4.554019e-04:  65%|██████▌   | 2218/3397 [13:41<07:14,  2.71it/s][A
epoch 1 iter 2218: train loss 0.31573. lr 4.554019e-04:  65%|██████▌   | 2219/3397 [13:41<07:13,  2.71it/s][A
epoch 1 iter 2219: train loss 0.33781. lr 4.552832e-04:  65%|██████▌   | 2219/3397 [13:41<07:13,  2.71it/s][A
epoch 1 iter 2219: train loss 0.33781. lr 4.552832e-04:  65%|██████▌   | 2220/3397 [13:41<07:11,  2.73it/s][A
epoch 1 iter 2220: train loss 0.27910. lr 4.551644e-04:  65%|██████▌   | 2220/3397 [13:42<07:11,  2.73it/s][A
epoch 1 iter 2220: train loss 0.27910. lr 4.551644e-04:  65%|██████▌   | 2221/3397 [13:42<07:12,  2.72it/s][A
epoch 1 iter 2221: train loss 0.30049. lr 4.550457e-04:  65%|██████▌   | 2221/3397 [13:42<07:12,  2.72it/s][A
epoch 1 iter 2221: train loss 0.30049. lr 4.550457e-04:  65%|██████▌   | 2222/3397 [13:42<07:13,  2.71it/s][A
epoch 1 iter 2222: train loss 0.28441. lr 4.549269e-04:  65%|██████▌   | 2222/3397 [13:43<07:13,  2.71it/s][A
e

epoch 1 iter 2254: train loss 0.29941. lr 4.511077e-04:  66%|██████▋   | 2255/3397 [13:54<06:59,  2.72it/s][A
epoch 1 iter 2255: train loss 0.29911. lr 4.509878e-04:  66%|██████▋   | 2255/3397 [13:55<06:59,  2.72it/s][A
epoch 1 iter 2255: train loss 0.29911. lr 4.509878e-04:  66%|██████▋   | 2256/3397 [13:55<07:00,  2.71it/s][A
epoch 1 iter 2256: train loss 0.28937. lr 4.508679e-04:  66%|██████▋   | 2256/3397 [13:55<07:00,  2.71it/s][A
epoch 1 iter 2256: train loss 0.28937. lr 4.508679e-04:  66%|██████▋   | 2257/3397 [13:55<07:00,  2.71it/s][A
epoch 1 iter 2257: train loss 0.29099. lr 4.507479e-04:  66%|██████▋   | 2257/3397 [13:55<07:00,  2.71it/s][A
epoch 1 iter 2257: train loss 0.29099. lr 4.507479e-04:  66%|██████▋   | 2258/3397 [13:55<07:00,  2.71it/s][A
epoch 1 iter 2258: train loss 0.30513. lr 4.506279e-04:  66%|██████▋   | 2258/3397 [13:56<07:00,  2.71it/s][A
epoch 1 iter 2258: train loss 0.30513. lr 4.506279e-04:  66%|██████▋   | 2259/3397 [13:56<06:59,  2.71it/s][A
e

epoch 1 iter 2291: train loss 0.28502. lr 4.466505e-04:  67%|██████▋   | 2291/3397 [14:08<06:47,  2.71it/s][A
epoch 1 iter 2291: train loss 0.28502. lr 4.466505e-04:  67%|██████▋   | 2292/3397 [14:08<06:48,  2.71it/s][A
epoch 1 iter 2292: train loss 0.26964. lr 4.465295e-04:  67%|██████▋   | 2292/3397 [14:08<06:48,  2.71it/s][A
epoch 1 iter 2292: train loss 0.26964. lr 4.465295e-04:  68%|██████▊   | 2293/3397 [14:08<06:47,  2.71it/s][A
epoch 1 iter 2293: train loss 0.30450. lr 4.464084e-04:  68%|██████▊   | 2293/3397 [14:09<06:47,  2.71it/s][A
epoch 1 iter 2293: train loss 0.30450. lr 4.464084e-04:  68%|██████▊   | 2294/3397 [14:09<06:46,  2.71it/s][A
epoch 1 iter 2294: train loss 0.31011. lr 4.462873e-04:  68%|██████▊   | 2294/3397 [14:09<06:46,  2.71it/s][A
epoch 1 iter 2294: train loss 0.31011. lr 4.462873e-04:  68%|██████▊   | 2295/3397 [14:09<06:44,  2.72it/s][A
epoch 1 iter 2295: train loss 0.30603. lr 4.461661e-04:  68%|██████▊   | 2295/3397 [14:10<06:44,  2.72it/s][A
e

epoch 1 iter 2327: train loss 0.29782. lr 4.422726e-04:  69%|██████▊   | 2328/3397 [14:21<06:34,  2.71it/s][A
epoch 1 iter 2328: train loss 0.29272. lr 4.421505e-04:  69%|██████▊   | 2328/3397 [14:22<06:34,  2.71it/s][A
epoch 1 iter 2328: train loss 0.29272. lr 4.421505e-04:  69%|██████▊   | 2329/3397 [14:22<06:33,  2.71it/s][A
epoch 1 iter 2329: train loss 0.29814. lr 4.420282e-04:  69%|██████▊   | 2329/3397 [14:22<06:33,  2.71it/s][A
epoch 1 iter 2329: train loss 0.29814. lr 4.420282e-04:  69%|██████▊   | 2330/3397 [14:22<06:33,  2.71it/s][A
epoch 1 iter 2330: train loss 0.28131. lr 4.419060e-04:  69%|██████▊   | 2330/3397 [14:22<06:33,  2.71it/s][A
epoch 1 iter 2330: train loss 0.28131. lr 4.419060e-04:  69%|██████▊   | 2331/3397 [14:22<06:31,  2.73it/s][A
epoch 1 iter 2331: train loss 0.27716. lr 4.417837e-04:  69%|██████▊   | 2331/3397 [14:23<06:31,  2.73it/s][A
epoch 1 iter 2331: train loss 0.27716. lr 4.417837e-04:  69%|██████▊   | 2332/3397 [14:23<06:32,  2.71it/s][A
e

epoch 1 iter 2364: train loss 0.28988. lr 4.377320e-04:  70%|██████▉   | 2364/3397 [14:35<06:21,  2.71it/s][A
epoch 1 iter 2364: train loss 0.28988. lr 4.377320e-04:  70%|██████▉   | 2365/3397 [14:35<06:21,  2.71it/s][A
epoch 1 iter 2365: train loss 0.28044. lr 4.376087e-04:  70%|██████▉   | 2365/3397 [14:35<06:21,  2.71it/s][A
epoch 1 iter 2365: train loss 0.28044. lr 4.376087e-04:  70%|██████▉   | 2366/3397 [14:35<06:20,  2.71it/s][A
epoch 1 iter 2366: train loss 0.28533. lr 4.374854e-04:  70%|██████▉   | 2366/3397 [14:36<06:20,  2.71it/s][A
epoch 1 iter 2366: train loss 0.28533. lr 4.374854e-04:  70%|██████▉   | 2367/3397 [14:36<06:18,  2.72it/s][A
epoch 1 iter 2367: train loss 0.30438. lr 4.373621e-04:  70%|██████▉   | 2367/3397 [14:36<06:18,  2.72it/s][A
epoch 1 iter 2367: train loss 0.30438. lr 4.373621e-04:  70%|██████▉   | 2368/3397 [14:36<06:20,  2.71it/s][A
epoch 1 iter 2368: train loss 0.28403. lr 4.372387e-04:  70%|██████▉   | 2368/3397 [14:37<06:20,  2.71it/s][A
e

epoch 1 iter 2400: train loss 0.28897. lr 4.332754e-04:  71%|███████   | 2401/3397 [14:48<06:08,  2.70it/s][A
epoch 1 iter 2401: train loss 0.30418. lr 4.331511e-04:  71%|███████   | 2401/3397 [14:49<06:08,  2.70it/s][A
epoch 1 iter 2401: train loss 0.30418. lr 4.331511e-04:  71%|███████   | 2402/3397 [14:49<06:07,  2.71it/s][A
epoch 1 iter 2402: train loss 0.27097. lr 4.330267e-04:  71%|███████   | 2402/3397 [14:49<06:07,  2.71it/s][A
epoch 1 iter 2402: train loss 0.27097. lr 4.330267e-04:  71%|███████   | 2403/3397 [14:49<06:06,  2.71it/s][A
epoch 1 iter 2403: train loss 0.29502. lr 4.329023e-04:  71%|███████   | 2403/3397 [14:49<06:06,  2.71it/s][A
epoch 1 iter 2403: train loss 0.29502. lr 4.329023e-04:  71%|███████   | 2404/3397 [14:49<06:04,  2.73it/s][A
epoch 1 iter 2404: train loss 0.29780. lr 4.327779e-04:  71%|███████   | 2404/3397 [14:50<06:04,  2.73it/s][A
epoch 1 iter 2404: train loss 0.29780. lr 4.327779e-04:  71%|███████   | 2405/3397 [14:50<06:05,  2.71it/s][A
e

epoch 1 iter 2437: train loss 0.27756. lr 4.286565e-04:  72%|███████▏  | 2437/3397 [15:02<05:53,  2.72it/s][A
epoch 1 iter 2437: train loss 0.27756. lr 4.286565e-04:  72%|███████▏  | 2438/3397 [15:02<05:53,  2.71it/s][A
epoch 1 iter 2438: train loss 0.28177. lr 4.285311e-04:  72%|███████▏  | 2438/3397 [15:02<05:53,  2.71it/s][A
epoch 1 iter 2438: train loss 0.28177. lr 4.285311e-04:  72%|███████▏  | 2439/3397 [15:02<05:53,  2.71it/s][A
epoch 1 iter 2439: train loss 0.28171. lr 4.284057e-04:  72%|███████▏  | 2439/3397 [15:03<05:53,  2.71it/s][A
epoch 1 iter 2439: train loss 0.28171. lr 4.284057e-04:  72%|███████▏  | 2440/3397 [15:03<05:53,  2.71it/s][A
epoch 1 iter 2440: train loss 0.26572. lr 4.282803e-04:  72%|███████▏  | 2440/3397 [15:03<05:53,  2.71it/s][A
epoch 1 iter 2440: train loss 0.26572. lr 4.282803e-04:  72%|███████▏  | 2441/3397 [15:03<05:53,  2.71it/s][A
epoch 1 iter 2441: train loss 0.27762. lr 4.281549e-04:  72%|███████▏  | 2441/3397 [15:03<05:53,  2.71it/s][A
e

epoch 1 iter 2473: train loss 0.29855. lr 4.241262e-04:  73%|███████▎  | 2474/3397 [15:15<05:39,  2.72it/s][A
epoch 1 iter 2474: train loss 0.29327. lr 4.239999e-04:  73%|███████▎  | 2474/3397 [15:16<05:39,  2.72it/s][A
epoch 1 iter 2474: train loss 0.29327. lr 4.239999e-04:  73%|███████▎  | 2475/3397 [15:16<05:39,  2.72it/s][A
epoch 1 iter 2475: train loss 0.29436. lr 4.238735e-04:  73%|███████▎  | 2475/3397 [15:16<05:39,  2.72it/s][A
epoch 1 iter 2475: train loss 0.29436. lr 4.238735e-04:  73%|███████▎  | 2476/3397 [15:16<05:39,  2.71it/s][A
epoch 1 iter 2476: train loss 0.27464. lr 4.237471e-04:  73%|███████▎  | 2476/3397 [15:16<05:39,  2.71it/s][A
epoch 1 iter 2476: train loss 0.27464. lr 4.237471e-04:  73%|███████▎  | 2477/3397 [15:16<05:38,  2.72it/s][A
epoch 1 iter 2477: train loss 0.26950. lr 4.236207e-04:  73%|███████▎  | 2477/3397 [15:17<05:38,  2.72it/s][A
epoch 1 iter 2477: train loss 0.26950. lr 4.236207e-04:  73%|███████▎  | 2478/3397 [15:17<05:37,  2.72it/s][A
e

epoch 1 iter 2510: train loss 0.28115. lr 4.194343e-04:  74%|███████▍  | 2510/3397 [15:29<05:26,  2.72it/s][A
epoch 1 iter 2510: train loss 0.28115. lr 4.194343e-04:  74%|███████▍  | 2511/3397 [15:29<05:25,  2.72it/s][A
epoch 1 iter 2511: train loss 0.27086. lr 4.193070e-04:  74%|███████▍  | 2511/3397 [15:29<05:25,  2.72it/s][A
epoch 1 iter 2511: train loss 0.27086. lr 4.193070e-04:  74%|███████▍  | 2512/3397 [15:29<05:25,  2.72it/s][A
epoch 1 iter 2512: train loss 0.28793. lr 4.191797e-04:  74%|███████▍  | 2512/3397 [15:30<05:25,  2.72it/s][A
epoch 1 iter 2512: train loss 0.28793. lr 4.191797e-04:  74%|███████▍  | 2513/3397 [15:30<05:25,  2.72it/s][A
epoch 1 iter 2513: train loss 0.28236. lr 4.190523e-04:  74%|███████▍  | 2513/3397 [15:30<05:25,  2.72it/s][A
epoch 1 iter 2513: train loss 0.28236. lr 4.190523e-04:  74%|███████▍  | 2514/3397 [15:30<05:24,  2.72it/s][A
epoch 1 iter 2514: train loss 0.27899. lr 4.189249e-04:  74%|███████▍  | 2514/3397 [15:30<05:24,  2.72it/s][A
e

epoch 1 iter 2546: train loss 0.26961. lr 4.148356e-04:  75%|███████▍  | 2547/3397 [15:42<05:12,  2.72it/s][A
epoch 1 iter 2547: train loss 0.25654. lr 4.147074e-04:  75%|███████▍  | 2547/3397 [15:43<05:12,  2.72it/s][A
epoch 1 iter 2547: train loss 0.25654. lr 4.147074e-04:  75%|███████▌  | 2548/3397 [15:43<05:12,  2.72it/s][A
epoch 1 iter 2548: train loss 0.27528. lr 4.145791e-04:  75%|███████▌  | 2548/3397 [15:43<05:12,  2.72it/s][A
epoch 1 iter 2548: train loss 0.27528. lr 4.145791e-04:  75%|███████▌  | 2549/3397 [15:43<05:12,  2.71it/s][A
epoch 1 iter 2549: train loss 0.27303. lr 4.144509e-04:  75%|███████▌  | 2549/3397 [15:43<05:12,  2.71it/s][A
epoch 1 iter 2549: train loss 0.27303. lr 4.144509e-04:  75%|███████▌  | 2550/3397 [15:43<05:12,  2.71it/s][A
epoch 1 iter 2550: train loss 0.26319. lr 4.143226e-04:  75%|███████▌  | 2550/3397 [15:44<05:12,  2.71it/s][A
epoch 1 iter 2550: train loss 0.26319. lr 4.143226e-04:  75%|███████▌  | 2551/3397 [15:44<05:12,  2.71it/s][A
e

epoch 1 iter 2583: train loss 0.25911. lr 4.100760e-04:  76%|███████▌  | 2583/3397 [15:56<05:00,  2.71it/s][A
epoch 1 iter 2583: train loss 0.25911. lr 4.100760e-04:  76%|███████▌  | 2584/3397 [15:56<04:59,  2.72it/s][A
epoch 1 iter 2584: train loss 0.25656. lr 4.099469e-04:  76%|███████▌  | 2584/3397 [15:56<04:59,  2.72it/s][A
epoch 1 iter 2584: train loss 0.25656. lr 4.099469e-04:  76%|███████▌  | 2585/3397 [15:56<05:00,  2.70it/s][A
epoch 1 iter 2585: train loss 0.28231. lr 4.098177e-04:  76%|███████▌  | 2585/3397 [15:57<05:00,  2.70it/s][A
epoch 1 iter 2585: train loss 0.28231. lr 4.098177e-04:  76%|███████▌  | 2586/3397 [15:57<05:01,  2.69it/s][A
epoch 1 iter 2586: train loss 0.28111. lr 4.096886e-04:  76%|███████▌  | 2586/3397 [15:57<05:01,  2.69it/s][A
epoch 1 iter 2586: train loss 0.28111. lr 4.096886e-04:  76%|███████▌  | 2587/3397 [15:57<05:01,  2.69it/s][A
epoch 1 iter 2587: train loss 0.28495. lr 4.095594e-04:  76%|███████▌  | 2587/3397 [15:57<05:01,  2.69it/s][A
e

epoch 1 iter 2619: train loss 0.25477. lr 4.054140e-04:  77%|███████▋  | 2620/3397 [16:09<04:47,  2.70it/s][A
epoch 1 iter 2620: train loss 0.26264. lr 4.052841e-04:  77%|███████▋  | 2620/3397 [16:10<04:47,  2.70it/s][A
epoch 1 iter 2620: train loss 0.26264. lr 4.052841e-04:  77%|███████▋  | 2621/3397 [16:10<04:47,  2.70it/s][A
epoch 1 iter 2621: train loss 0.27509. lr 4.051541e-04:  77%|███████▋  | 2621/3397 [16:10<04:47,  2.70it/s][A
epoch 1 iter 2621: train loss 0.27509. lr 4.051541e-04:  77%|███████▋  | 2622/3397 [16:10<04:47,  2.70it/s][A
epoch 1 iter 2622: train loss 0.26321. lr 4.050242e-04:  77%|███████▋  | 2622/3397 [16:10<04:47,  2.70it/s][A
epoch 1 iter 2622: train loss 0.26321. lr 4.050242e-04:  77%|███████▋  | 2623/3397 [16:10<04:47,  2.69it/s][A
epoch 1 iter 2623: train loss 0.26080. lr 4.048942e-04:  77%|███████▋  | 2623/3397 [16:11<04:47,  2.69it/s][A
epoch 1 iter 2623: train loss 0.26080. lr 4.048942e-04:  77%|███████▋  | 2624/3397 [16:11<04:47,  2.69it/s][A
e

epoch 1 iter 2656: train loss 0.25696. lr 4.005921e-04:  78%|███████▊  | 2656/3397 [16:23<04:36,  2.68it/s][A
epoch 1 iter 2656: train loss 0.25696. lr 4.005921e-04:  78%|███████▊  | 2657/3397 [16:23<04:36,  2.68it/s][A
epoch 1 iter 2657: train loss 0.26875. lr 4.004614e-04:  78%|███████▊  | 2657/3397 [16:23<04:36,  2.68it/s][A
epoch 1 iter 2657: train loss 0.26875. lr 4.004614e-04:  78%|███████▊  | 2658/3397 [16:23<04:35,  2.68it/s][A
epoch 1 iter 2658: train loss 0.23921. lr 4.003306e-04:  78%|███████▊  | 2658/3397 [16:24<04:35,  2.68it/s][A
epoch 1 iter 2658: train loss 0.23921. lr 4.003306e-04:  78%|███████▊  | 2659/3397 [16:24<04:34,  2.69it/s][A
epoch 1 iter 2659: train loss 0.24604. lr 4.001999e-04:  78%|███████▊  | 2659/3397 [16:24<04:34,  2.69it/s][A
epoch 1 iter 2659: train loss 0.24604. lr 4.001999e-04:  78%|███████▊  | 2660/3397 [16:24<04:33,  2.70it/s][A
epoch 1 iter 2660: train loss 0.26903. lr 4.000691e-04:  78%|███████▊  | 2660/3397 [16:24<04:33,  2.70it/s][A
e

epoch 1 iter 2692: train loss 0.26594. lr 3.958723e-04:  79%|███████▉  | 2693/3397 [16:36<04:19,  2.71it/s][A
epoch 1 iter 2693: train loss 0.29255. lr 3.957408e-04:  79%|███████▉  | 2693/3397 [16:37<04:19,  2.71it/s][A
epoch 1 iter 2693: train loss 0.29255. lr 3.957408e-04:  79%|███████▉  | 2694/3397 [16:37<04:19,  2.71it/s][A
epoch 1 iter 2694: train loss 0.25111. lr 3.956093e-04:  79%|███████▉  | 2694/3397 [16:37<04:19,  2.71it/s][A
epoch 1 iter 2694: train loss 0.25111. lr 3.956093e-04:  79%|███████▉  | 2695/3397 [16:37<04:18,  2.71it/s][A
epoch 1 iter 2695: train loss 0.25168. lr 3.954777e-04:  79%|███████▉  | 2695/3397 [16:37<04:18,  2.71it/s][A
epoch 1 iter 2695: train loss 0.25168. lr 3.954777e-04:  79%|███████▉  | 2696/3397 [16:37<04:16,  2.73it/s][A
epoch 1 iter 2696: train loss 0.25767. lr 3.953462e-04:  79%|███████▉  | 2696/3397 [16:38<04:16,  2.73it/s][A
epoch 1 iter 2696: train loss 0.25767. lr 3.953462e-04:  79%|███████▉  | 2697/3397 [16:38<04:17,  2.72it/s][A
e

epoch 1 iter 2729: train loss 0.26415. lr 3.909936e-04:  80%|████████  | 2729/3397 [16:50<04:07,  2.70it/s][A
epoch 1 iter 2729: train loss 0.26415. lr 3.909936e-04:  80%|████████  | 2730/3397 [16:50<04:06,  2.70it/s][A
epoch 1 iter 2730: train loss 0.24724. lr 3.908614e-04:  80%|████████  | 2730/3397 [16:50<04:06,  2.70it/s][A
epoch 1 iter 2730: train loss 0.24724. lr 3.908614e-04:  80%|████████  | 2731/3397 [16:50<04:05,  2.71it/s][A
epoch 1 iter 2731: train loss 0.23841. lr 3.907292e-04:  80%|████████  | 2731/3397 [16:51<04:05,  2.71it/s][A
epoch 1 iter 2731: train loss 0.23841. lr 3.907292e-04:  80%|████████  | 2732/3397 [16:51<04:06,  2.70it/s][A
epoch 1 iter 2732: train loss 0.26461. lr 3.905969e-04:  80%|████████  | 2732/3397 [16:51<04:06,  2.70it/s][A
epoch 1 iter 2732: train loss 0.26461. lr 3.905969e-04:  80%|████████  | 2733/3397 [16:51<04:07,  2.69it/s][A
epoch 1 iter 2733: train loss 0.25674. lr 3.904646e-04:  80%|████████  | 2733/3397 [16:51<04:07,  2.69it/s][A
e

epoch 1 iter 2765: train loss 0.26318. lr 3.862213e-04:  81%|████████▏ | 2766/3397 [17:03<03:52,  2.71it/s][A
epoch 1 iter 2766: train loss 0.25356. lr 3.860884e-04:  81%|████████▏ | 2766/3397 [17:04<03:52,  2.71it/s][A
epoch 1 iter 2766: train loss 0.25356. lr 3.860884e-04:  81%|████████▏ | 2767/3397 [17:04<03:51,  2.72it/s][A
epoch 1 iter 2767: train loss 0.27002. lr 3.859554e-04:  81%|████████▏ | 2767/3397 [17:04<03:51,  2.72it/s][A
epoch 1 iter 2767: train loss 0.27002. lr 3.859554e-04:  81%|████████▏ | 2768/3397 [17:04<03:52,  2.71it/s][A
epoch 1 iter 2768: train loss 0.24992. lr 3.858225e-04:  81%|████████▏ | 2768/3397 [17:04<03:52,  2.71it/s][A
epoch 1 iter 2768: train loss 0.24992. lr 3.858225e-04:  82%|████████▏ | 2769/3397 [17:04<03:52,  2.71it/s][A
epoch 1 iter 2769: train loss 0.24502. lr 3.856895e-04:  82%|████████▏ | 2769/3397 [17:05<03:52,  2.71it/s][A
epoch 1 iter 2769: train loss 0.24502. lr 3.856895e-04:  82%|████████▏ | 2770/3397 [17:05<03:51,  2.71it/s][A
e

epoch 1 iter 2802: train loss 0.26393. lr 3.812914e-04:  82%|████████▏ | 2802/3397 [17:17<03:39,  2.71it/s][A
epoch 1 iter 2802: train loss 0.26393. lr 3.812914e-04:  83%|████████▎ | 2803/3397 [17:17<03:39,  2.71it/s][A
epoch 1 iter 2803: train loss 0.24897. lr 3.811579e-04:  83%|████████▎ | 2803/3397 [17:17<03:39,  2.71it/s][A
epoch 1 iter 2803: train loss 0.24897. lr 3.811579e-04:  83%|████████▎ | 2804/3397 [17:17<03:38,  2.71it/s][A
epoch 1 iter 2804: train loss 0.25230. lr 3.810243e-04:  83%|████████▎ | 2804/3397 [17:18<03:38,  2.71it/s][A
epoch 1 iter 2804: train loss 0.25230. lr 3.810243e-04:  83%|████████▎ | 2805/3397 [17:18<03:37,  2.72it/s][A
epoch 1 iter 2805: train loss 0.29519. lr 3.808906e-04:  83%|████████▎ | 2805/3397 [17:18<03:37,  2.72it/s][A
epoch 1 iter 2805: train loss 0.29519. lr 3.808906e-04:  83%|████████▎ | 2806/3397 [17:18<03:37,  2.71it/s][A
epoch 1 iter 2806: train loss 0.24422. lr 3.807570e-04:  83%|████████▎ | 2806/3397 [17:18<03:37,  2.71it/s][A
e

epoch 1 iter 2838: train loss 0.26885. lr 3.764720e-04:  84%|████████▎ | 2839/3397 [17:30<03:26,  2.71it/s][A
epoch 1 iter 2839: train loss 0.25616. lr 3.763378e-04:  84%|████████▎ | 2839/3397 [17:31<03:26,  2.71it/s][A
epoch 1 iter 2839: train loss 0.25616. lr 3.763378e-04:  84%|████████▎ | 2840/3397 [17:31<03:25,  2.71it/s][A
epoch 1 iter 2840: train loss 0.23179. lr 3.762036e-04:  84%|████████▎ | 2840/3397 [17:31<03:25,  2.71it/s][A
epoch 1 iter 2840: train loss 0.23179. lr 3.762036e-04:  84%|████████▎ | 2841/3397 [17:31<03:25,  2.71it/s][A
epoch 1 iter 2841: train loss 0.24284. lr 3.760694e-04:  84%|████████▎ | 2841/3397 [17:31<03:25,  2.71it/s][A
epoch 1 iter 2841: train loss 0.24284. lr 3.760694e-04:  84%|████████▎ | 2842/3397 [17:31<03:25,  2.71it/s][A
epoch 1 iter 2842: train loss 0.23635. lr 3.759351e-04:  84%|████████▎ | 2842/3397 [17:32<03:25,  2.71it/s][A
epoch 1 iter 2842: train loss 0.23635. lr 3.759351e-04:  84%|████████▎ | 2843/3397 [17:32<03:24,  2.71it/s][A
e

epoch 1 iter 2875: train loss 0.25054. lr 3.714965e-04:  85%|████████▍ | 2875/3397 [17:44<03:12,  2.71it/s][A
epoch 1 iter 2875: train loss 0.25054. lr 3.714965e-04:  85%|████████▍ | 2876/3397 [17:44<03:12,  2.71it/s][A
epoch 1 iter 2876: train loss 0.24603. lr 3.713618e-04:  85%|████████▍ | 2876/3397 [17:44<03:12,  2.71it/s][A
epoch 1 iter 2876: train loss 0.24603. lr 3.713618e-04:  85%|████████▍ | 2877/3397 [17:44<03:12,  2.71it/s][A
epoch 1 iter 2877: train loss 0.22659. lr 3.712270e-04:  85%|████████▍ | 2877/3397 [17:45<03:12,  2.71it/s][A
epoch 1 iter 2877: train loss 0.22659. lr 3.712270e-04:  85%|████████▍ | 2878/3397 [17:45<03:11,  2.71it/s][A
epoch 1 iter 2878: train loss 0.24898. lr 3.710922e-04:  85%|████████▍ | 2878/3397 [17:45<03:11,  2.71it/s][A
epoch 1 iter 2878: train loss 0.24898. lr 3.710922e-04:  85%|████████▍ | 2879/3397 [17:45<03:10,  2.72it/s][A
epoch 1 iter 2879: train loss 0.23740. lr 3.709574e-04:  85%|████████▍ | 2879/3397 [17:45<03:10,  2.72it/s][A
e

epoch 1 iter 2911: train loss 0.25069. lr 3.666355e-04:  86%|████████▌ | 2912/3397 [17:57<02:58,  2.71it/s][A
epoch 1 iter 2912: train loss 0.22865. lr 3.665002e-04:  86%|████████▌ | 2912/3397 [17:58<02:58,  2.71it/s][A
epoch 1 iter 2912: train loss 0.22865. lr 3.665002e-04:  86%|████████▌ | 2913/3397 [17:58<02:58,  2.71it/s][A
epoch 1 iter 2913: train loss 0.23714. lr 3.663649e-04:  86%|████████▌ | 2913/3397 [17:58<02:58,  2.71it/s][A
epoch 1 iter 2913: train loss 0.23714. lr 3.663649e-04:  86%|████████▌ | 2914/3397 [17:58<02:58,  2.71it/s][A
epoch 1 iter 2914: train loss 0.24161. lr 3.662295e-04:  86%|████████▌ | 2914/3397 [17:58<02:58,  2.71it/s][A
epoch 1 iter 2914: train loss 0.24161. lr 3.662295e-04:  86%|████████▌ | 2915/3397 [17:58<02:57,  2.72it/s][A
epoch 1 iter 2915: train loss 0.23256. lr 3.660942e-04:  86%|████████▌ | 2915/3397 [17:59<02:57,  2.72it/s][A
epoch 1 iter 2915: train loss 0.23256. lr 3.660942e-04:  86%|████████▌ | 2916/3397 [17:59<02:57,  2.71it/s][A
e

epoch 1 iter 2948: train loss 0.22825. lr 3.616202e-04:  87%|████████▋ | 2948/3397 [18:11<02:48,  2.67it/s][A
epoch 1 iter 2948: train loss 0.22825. lr 3.616202e-04:  87%|████████▋ | 2949/3397 [18:11<02:47,  2.67it/s][A
epoch 1 iter 2949: train loss 0.24792. lr 3.614844e-04:  87%|████████▋ | 2949/3397 [18:11<02:47,  2.67it/s][A
epoch 1 iter 2949: train loss 0.24792. lr 3.614844e-04:  87%|████████▋ | 2950/3397 [18:11<02:46,  2.68it/s][A
epoch 1 iter 2950: train loss 0.23047. lr 3.613485e-04:  87%|████████▋ | 2950/3397 [18:12<02:46,  2.68it/s][A
epoch 1 iter 2950: train loss 0.23047. lr 3.613485e-04:  87%|████████▋ | 2951/3397 [18:12<02:45,  2.70it/s][A
epoch 1 iter 2951: train loss 0.23025. lr 3.612127e-04:  87%|████████▋ | 2951/3397 [18:12<02:45,  2.70it/s][A
epoch 1 iter 2951: train loss 0.23025. lr 3.612127e-04:  87%|████████▋ | 2952/3397 [18:12<02:44,  2.70it/s][A
epoch 1 iter 2952: train loss 0.24495. lr 3.610769e-04:  87%|████████▋ | 2952/3397 [18:12<02:44,  2.70it/s][A
e

epoch 1 iter 2984: train loss 0.22820. lr 3.567231e-04:  88%|████████▊ | 2985/3397 [18:24<02:31,  2.71it/s][A
epoch 1 iter 2985: train loss 0.23727. lr 3.565868e-04:  88%|████████▊ | 2985/3397 [18:25<02:31,  2.71it/s][A
epoch 1 iter 2985: train loss 0.23727. lr 3.565868e-04:  88%|████████▊ | 2986/3397 [18:25<02:30,  2.72it/s][A
epoch 1 iter 2986: train loss 0.23776. lr 3.564505e-04:  88%|████████▊ | 2986/3397 [18:25<02:30,  2.72it/s][A
epoch 1 iter 2986: train loss 0.23776. lr 3.564505e-04:  88%|████████▊ | 2987/3397 [18:25<02:30,  2.72it/s][A
epoch 1 iter 2987: train loss 0.23054. lr 3.563142e-04:  88%|████████▊ | 2987/3397 [18:25<02:30,  2.72it/s][A
epoch 1 iter 2987: train loss 0.23054. lr 3.563142e-04:  88%|████████▊ | 2988/3397 [18:25<02:30,  2.71it/s][A
epoch 1 iter 2988: train loss 0.24098. lr 3.561779e-04:  88%|████████▊ | 2988/3397 [18:26<02:30,  2.71it/s][A
epoch 1 iter 2988: train loss 0.24098. lr 3.561779e-04:  88%|████████▊ | 2989/3397 [18:26<02:30,  2.71it/s][A
e

epoch 1 iter 3021: train loss 0.25145. lr 3.516735e-04:  89%|████████▉ | 3021/3397 [18:38<02:17,  2.72it/s][A
epoch 1 iter 3021: train loss 0.25145. lr 3.516735e-04:  89%|████████▉ | 3022/3397 [18:38<02:18,  2.71it/s][A
epoch 1 iter 3022: train loss 0.23238. lr 3.515368e-04:  89%|████████▉ | 3022/3397 [18:38<02:18,  2.71it/s][A
epoch 1 iter 3022: train loss 0.23238. lr 3.515368e-04:  89%|████████▉ | 3023/3397 [18:38<02:18,  2.71it/s][A
epoch 1 iter 3023: train loss 0.23353. lr 3.514001e-04:  89%|████████▉ | 3023/3397 [18:39<02:18,  2.71it/s][A
epoch 1 iter 3023: train loss 0.23353. lr 3.514001e-04:  89%|████████▉ | 3024/3397 [18:39<02:17,  2.71it/s][A
epoch 1 iter 3024: train loss 0.21619. lr 3.512634e-04:  89%|████████▉ | 3024/3397 [18:39<02:17,  2.71it/s][A
epoch 1 iter 3024: train loss 0.21619. lr 3.512634e-04:  89%|████████▉ | 3025/3397 [18:39<02:17,  2.71it/s][A
epoch 1 iter 3025: train loss 0.23383. lr 3.511267e-04:  89%|████████▉ | 3025/3397 [18:39<02:17,  2.71it/s][A
e

epoch 1 iter 3057: train loss 0.21849. lr 3.467460e-04:  90%|█████████ | 3058/3397 [18:51<02:04,  2.71it/s][A
epoch 1 iter 3058: train loss 0.23842. lr 3.466089e-04:  90%|█████████ | 3058/3397 [18:51<02:04,  2.71it/s][A
epoch 1 iter 3058: train loss 0.23842. lr 3.466089e-04:  90%|█████████ | 3059/3397 [18:51<02:04,  2.71it/s][A
epoch 1 iter 3059: train loss 0.25285. lr 3.464718e-04:  90%|█████████ | 3059/3397 [18:52<02:04,  2.71it/s][A
epoch 1 iter 3059: train loss 0.25285. lr 3.464718e-04:  90%|█████████ | 3060/3397 [18:52<02:04,  2.71it/s][A
epoch 1 iter 3060: train loss 0.24890. lr 3.463347e-04:  90%|█████████ | 3060/3397 [18:52<02:04,  2.71it/s][A
epoch 1 iter 3060: train loss 0.24890. lr 3.463347e-04:  90%|█████████ | 3061/3397 [18:52<02:03,  2.71it/s][A
epoch 1 iter 3061: train loss 0.24247. lr 3.461976e-04:  90%|█████████ | 3061/3397 [18:53<02:03,  2.71it/s][A
epoch 1 iter 3061: train loss 0.24247. lr 3.461976e-04:  90%|█████████ | 3062/3397 [18:53<02:03,  2.72it/s][A
e

epoch 1 iter 3094: train loss 0.22758. lr 3.416680e-04:  91%|█████████ | 3094/3397 [19:05<01:51,  2.71it/s][A
epoch 1 iter 3094: train loss 0.22758. lr 3.416680e-04:  91%|█████████ | 3095/3397 [19:05<01:51,  2.71it/s][A
epoch 1 iter 3095: train loss 0.21335. lr 3.415306e-04:  91%|█████████ | 3095/3397 [19:05<01:51,  2.71it/s][A
epoch 1 iter 3095: train loss 0.21335. lr 3.415306e-04:  91%|█████████ | 3096/3397 [19:05<01:51,  2.71it/s][A
epoch 1 iter 3096: train loss 0.22871. lr 3.413932e-04:  91%|█████████ | 3096/3397 [19:06<01:51,  2.71it/s][A
epoch 1 iter 3096: train loss 0.22871. lr 3.413932e-04:  91%|█████████ | 3097/3397 [19:06<01:50,  2.71it/s][A
epoch 1 iter 3097: train loss 0.24036. lr 3.412557e-04:  91%|█████████ | 3097/3397 [19:06<01:50,  2.71it/s][A
epoch 1 iter 3097: train loss 0.24036. lr 3.412557e-04:  91%|█████████ | 3098/3397 [19:06<01:49,  2.72it/s][A
epoch 1 iter 3098: train loss 0.22437. lr 3.411183e-04:  91%|█████████ | 3098/3397 [19:06<01:49,  2.72it/s][A
e

epoch 1 iter 3130: train loss 0.21547. lr 3.367156e-04:  92%|█████████▏| 3131/3397 [19:18<01:38,  2.71it/s][A
epoch 1 iter 3131: train loss 0.21767. lr 3.365779e-04:  92%|█████████▏| 3131/3397 [19:18<01:38,  2.71it/s][A
epoch 1 iter 3131: train loss 0.21767. lr 3.365779e-04:  92%|█████████▏| 3132/3397 [19:18<01:37,  2.71it/s][A
epoch 1 iter 3132: train loss 0.22214. lr 3.364401e-04:  92%|█████████▏| 3132/3397 [19:19<01:37,  2.71it/s][A
epoch 1 iter 3132: train loss 0.22214. lr 3.364401e-04:  92%|█████████▏| 3133/3397 [19:19<01:37,  2.72it/s][A
epoch 1 iter 3133: train loss 0.21710. lr 3.363024e-04:  92%|█████████▏| 3133/3397 [19:19<01:37,  2.72it/s][A
epoch 1 iter 3133: train loss 0.21710. lr 3.363024e-04:  92%|█████████▏| 3134/3397 [19:19<01:36,  2.71it/s][A
epoch 1 iter 3134: train loss 0.21759. lr 3.361646e-04:  92%|█████████▏| 3134/3397 [19:20<01:36,  2.71it/s][A
epoch 1 iter 3134: train loss 0.21759. lr 3.361646e-04:  92%|█████████▏| 3135/3397 [19:20<01:36,  2.71it/s][A
e

epoch 1 iter 3167: train loss 0.22484. lr 3.316150e-04:  93%|█████████▎| 3167/3397 [19:32<01:25,  2.70it/s][A
epoch 1 iter 3167: train loss 0.22484. lr 3.316150e-04:  93%|█████████▎| 3168/3397 [19:32<01:24,  2.70it/s][A
epoch 1 iter 3168: train loss 0.22268. lr 3.314770e-04:  93%|█████████▎| 3168/3397 [19:32<01:24,  2.70it/s][A
epoch 1 iter 3168: train loss 0.22268. lr 3.314770e-04:  93%|█████████▎| 3169/3397 [19:32<01:24,  2.70it/s][A
epoch 1 iter 3169: train loss 0.24124. lr 3.313390e-04:  93%|█████████▎| 3169/3397 [19:32<01:24,  2.70it/s][A
epoch 1 iter 3169: train loss 0.24124. lr 3.313390e-04:  93%|█████████▎| 3170/3397 [19:32<01:24,  2.70it/s][A
epoch 1 iter 3170: train loss 0.23796. lr 3.312010e-04:  93%|█████████▎| 3170/3397 [19:33<01:24,  2.70it/s][A
epoch 1 iter 3170: train loss 0.23796. lr 3.312010e-04:  93%|█████████▎| 3171/3397 [19:33<01:23,  2.69it/s][A
epoch 1 iter 3171: train loss 0.23717. lr 3.310630e-04:  93%|█████████▎| 3171/3397 [19:33<01:23,  2.69it/s][A
e

epoch 1 iter 3203: train loss 0.20903. lr 3.266433e-04:  94%|█████████▍| 3204/3397 [19:45<01:11,  2.71it/s][A
epoch 1 iter 3204: train loss 0.22761. lr 3.265051e-04:  94%|█████████▍| 3204/3397 [19:45<01:11,  2.71it/s][A
epoch 1 iter 3204: train loss 0.22761. lr 3.265051e-04:  94%|█████████▍| 3205/3397 [19:45<01:10,  2.71it/s][A
epoch 1 iter 3205: train loss 0.22826. lr 3.263669e-04:  94%|█████████▍| 3205/3397 [19:46<01:10,  2.71it/s][A
epoch 1 iter 3205: train loss 0.22826. lr 3.263669e-04:  94%|█████████▍| 3206/3397 [19:46<01:10,  2.71it/s][A
epoch 1 iter 3206: train loss 0.24076. lr 3.262287e-04:  94%|█████████▍| 3206/3397 [19:46<01:10,  2.71it/s][A
epoch 1 iter 3206: train loss 0.24076. lr 3.262287e-04:  94%|█████████▍| 3207/3397 [19:46<01:10,  2.71it/s][A
epoch 1 iter 3207: train loss 0.22300. lr 3.260904e-04:  94%|█████████▍| 3207/3397 [19:46<01:10,  2.71it/s][A
epoch 1 iter 3207: train loss 0.22300. lr 3.260904e-04:  94%|█████████▍| 3208/3397 [19:46<01:09,  2.71it/s][A
e

epoch 1 iter 3240: train loss 0.20971. lr 3.215259e-04:  95%|█████████▌| 3240/3397 [19:59<00:57,  2.72it/s][A
epoch 1 iter 3240: train loss 0.20971. lr 3.215259e-04:  95%|█████████▌| 3241/3397 [19:59<00:57,  2.71it/s][A
epoch 1 iter 3241: train loss 0.22343. lr 3.213875e-04:  95%|█████████▌| 3241/3397 [19:59<00:57,  2.71it/s][A
epoch 1 iter 3241: train loss 0.22343. lr 3.213875e-04:  95%|█████████▌| 3242/3397 [19:59<00:57,  2.71it/s][A
epoch 1 iter 3242: train loss 0.21344. lr 3.212491e-04:  95%|█████████▌| 3242/3397 [19:59<00:57,  2.71it/s][A
epoch 1 iter 3242: train loss 0.21344. lr 3.212491e-04:  95%|█████████▌| 3243/3397 [19:59<00:56,  2.71it/s][A
epoch 1 iter 3243: train loss 0.21839. lr 3.211107e-04:  95%|█████████▌| 3243/3397 [20:00<00:56,  2.71it/s][A
epoch 1 iter 3243: train loss 0.21839. lr 3.211107e-04:  95%|█████████▌| 3244/3397 [20:00<00:56,  2.71it/s][A
epoch 1 iter 3244: train loss 0.21653. lr 3.209723e-04:  95%|█████████▌| 3244/3397 [20:00<00:56,  2.71it/s][A
e

epoch 1 iter 3276: train loss 0.21782. lr 3.165407e-04:  96%|█████████▋| 3277/3397 [20:12<00:45,  2.64it/s][A
epoch 1 iter 3277: train loss 0.22055. lr 3.164022e-04:  96%|█████████▋| 3277/3397 [20:12<00:45,  2.64it/s][A
epoch 1 iter 3277: train loss 0.22055. lr 3.164022e-04:  96%|█████████▋| 3278/3397 [20:12<00:45,  2.63it/s][A
epoch 1 iter 3278: train loss 0.20842. lr 3.162636e-04:  96%|█████████▋| 3278/3397 [20:13<00:45,  2.63it/s][A
epoch 1 iter 3278: train loss 0.20842. lr 3.162636e-04:  97%|█████████▋| 3279/3397 [20:13<00:44,  2.63it/s][A
epoch 1 iter 3279: train loss 0.21438. lr 3.161251e-04:  97%|█████████▋| 3279/3397 [20:13<00:44,  2.63it/s][A
epoch 1 iter 3279: train loss 0.21438. lr 3.161251e-04:  97%|█████████▋| 3280/3397 [20:13<00:44,  2.63it/s][A
epoch 1 iter 3280: train loss 0.22453. lr 3.159865e-04:  97%|█████████▋| 3280/3397 [20:14<00:44,  2.63it/s][A
epoch 1 iter 3280: train loss 0.22453. lr 3.159865e-04:  97%|█████████▋| 3281/3397 [20:14<00:43,  2.65it/s][A
e

epoch 1 iter 3313: train loss 0.21629. lr 3.114123e-04:  98%|█████████▊| 3313/3397 [20:26<00:31,  2.71it/s][A
epoch 1 iter 3313: train loss 0.21629. lr 3.114123e-04:  98%|█████████▊| 3314/3397 [20:26<00:30,  2.71it/s][A
epoch 1 iter 3314: train loss 0.23120. lr 3.112736e-04:  98%|█████████▊| 3314/3397 [20:26<00:30,  2.71it/s][A
epoch 1 iter 3314: train loss 0.23120. lr 3.112736e-04:  98%|█████████▊| 3315/3397 [20:26<00:30,  2.72it/s][A
epoch 1 iter 3315: train loss 0.22256. lr 3.111350e-04:  98%|█████████▊| 3315/3397 [20:27<00:30,  2.72it/s][A
epoch 1 iter 3315: train loss 0.22256. lr 3.111350e-04:  98%|█████████▊| 3316/3397 [20:27<00:29,  2.71it/s][A
epoch 1 iter 3316: train loss 0.21173. lr 3.109963e-04:  98%|█████████▊| 3316/3397 [20:27<00:29,  2.71it/s][A
epoch 1 iter 3316: train loss 0.21173. lr 3.109963e-04:  98%|█████████▊| 3317/3397 [20:27<00:29,  2.71it/s][A
epoch 1 iter 3317: train loss 0.21716. lr 3.108576e-04:  98%|█████████▊| 3317/3397 [20:27<00:29,  2.71it/s][A
e

epoch 1 iter 3349: train loss 0.20495. lr 3.064193e-04:  99%|█████████▊| 3350/3397 [20:39<00:17,  2.71it/s][A
epoch 1 iter 3350: train loss 0.21954. lr 3.062805e-04:  99%|█████████▊| 3350/3397 [20:39<00:17,  2.71it/s][A
epoch 1 iter 3350: train loss 0.21954. lr 3.062805e-04:  99%|█████████▊| 3351/3397 [20:39<00:16,  2.72it/s][A
epoch 1 iter 3351: train loss 0.23070. lr 3.061418e-04:  99%|█████████▊| 3351/3397 [20:40<00:16,  2.72it/s][A
epoch 1 iter 3351: train loss 0.23070. lr 3.061418e-04:  99%|█████████▊| 3352/3397 [20:40<00:16,  2.71it/s][A
epoch 1 iter 3352: train loss 0.21044. lr 3.060031e-04:  99%|█████████▊| 3352/3397 [20:40<00:16,  2.71it/s][A
epoch 1 iter 3352: train loss 0.21044. lr 3.060031e-04:  99%|█████████▊| 3353/3397 [20:40<00:16,  2.70it/s][A
epoch 1 iter 3353: train loss 0.21386. lr 3.058643e-04:  99%|█████████▊| 3353/3397 [20:41<00:16,  2.70it/s][A
epoch 1 iter 3353: train loss 0.21386. lr 3.058643e-04:  99%|█████████▊| 3354/3397 [20:41<00:15,  2.70it/s][A
e

epoch 1 iter 3386: train loss 0.21071. lr 3.012857e-04: 100%|█████████▉| 3386/3397 [20:53<00:04,  2.71it/s][A
epoch 1 iter 3386: train loss 0.21071. lr 3.012857e-04: 100%|█████████▉| 3387/3397 [20:53<00:03,  2.71it/s][A
epoch 1 iter 3387: train loss 0.22081. lr 3.011469e-04: 100%|█████████▉| 3387/3397 [20:53<00:03,  2.71it/s][A
epoch 1 iter 3387: train loss 0.22081. lr 3.011469e-04: 100%|█████████▉| 3388/3397 [20:53<00:03,  2.71it/s][A
epoch 1 iter 3388: train loss 0.21085. lr 3.010082e-04: 100%|█████████▉| 3388/3397 [20:54<00:03,  2.71it/s][A
epoch 1 iter 3388: train loss 0.21085. lr 3.010082e-04: 100%|█████████▉| 3389/3397 [20:54<00:02,  2.70it/s][A
epoch 1 iter 3389: train loss 0.21064. lr 3.008694e-04: 100%|█████████▉| 3389/3397 [20:54<00:02,  2.70it/s][A
epoch 1 iter 3389: train loss 0.21064. lr 3.008694e-04: 100%|█████████▉| 3390/3397 [20:54<00:02,  2.70it/s][A
epoch 1 iter 3390: train loss 0.21570. lr 3.007306e-04: 100%|█████████▉| 3390/3397 [20:54<00:02,  2.70it/s][A
e

epoch 2 iter 26: train loss 0.20013. lr 2.962645e-04:   1%|          | 27/3397 [00:10<20:45,  2.71it/s][A
epoch 2 iter 27: train loss 0.18970. lr 2.961257e-04:   1%|          | 27/3397 [00:10<20:45,  2.71it/s][A
epoch 2 iter 27: train loss 0.18970. lr 2.961257e-04:   1%|          | 28/3397 [00:10<20:50,  2.69it/s][A
epoch 2 iter 28: train loss 0.20524. lr 2.959870e-04:   1%|          | 28/3397 [00:10<20:50,  2.69it/s][A
epoch 2 iter 28: train loss 0.20524. lr 2.959870e-04:   1%|          | 29/3397 [00:10<20:50,  2.69it/s][A
epoch 2 iter 29: train loss 0.19979. lr 2.958482e-04:   1%|          | 29/3397 [00:11<20:50,  2.69it/s][A
epoch 2 iter 29: train loss 0.19979. lr 2.958482e-04:   1%|          | 30/3397 [00:11<20:48,  2.70it/s][A
epoch 2 iter 30: train loss 0.19425. lr 2.957095e-04:   1%|          | 30/3397 [00:11<20:48,  2.70it/s][A
epoch 2 iter 30: train loss 0.19425. lr 2.957095e-04:   1%|          | 31/3397 [00:11<20:45,  2.70it/s][A
epoch 2 iter 31: train loss 0.21558. 

epoch 2 iter 64: train loss 0.19410. lr 2.909929e-04:   2%|▏         | 65/3397 [00:24<20:26,  2.72it/s][A
epoch 2 iter 65: train loss 0.19867. lr 2.908542e-04:   2%|▏         | 65/3397 [00:24<20:26,  2.72it/s][A
epoch 2 iter 65: train loss 0.19867. lr 2.908542e-04:   2%|▏         | 66/3397 [00:24<20:21,  2.73it/s][A
epoch 2 iter 66: train loss 0.19070. lr 2.907155e-04:   2%|▏         | 66/3397 [00:24<20:21,  2.73it/s][A
epoch 2 iter 66: train loss 0.19070. lr 2.907155e-04:   2%|▏         | 67/3397 [00:24<20:18,  2.73it/s][A
epoch 2 iter 67: train loss 0.20625. lr 2.905768e-04:   2%|▏         | 67/3397 [00:25<20:18,  2.73it/s][A
epoch 2 iter 67: train loss 0.20625. lr 2.905768e-04:   2%|▏         | 68/3397 [00:25<20:29,  2.71it/s][A
epoch 2 iter 68: train loss 0.19471. lr 2.904381e-04:   2%|▏         | 68/3397 [00:25<20:29,  2.71it/s][A
epoch 2 iter 68: train loss 0.19471. lr 2.904381e-04:   2%|▏         | 69/3397 [00:25<20:40,  2.68it/s][A
epoch 2 iter 69: train loss 0.20456. 

epoch 2 iter 102: train loss 0.20363. lr 2.857241e-04:   3%|▎         | 103/3397 [00:38<20:23,  2.69it/s][A
epoch 2 iter 103: train loss 0.19212. lr 2.855855e-04:   3%|▎         | 103/3397 [00:38<20:23,  2.69it/s][A
epoch 2 iter 103: train loss 0.19212. lr 2.855855e-04:   3%|▎         | 104/3397 [00:38<20:12,  2.72it/s][A
epoch 2 iter 104: train loss 0.20966. lr 2.854469e-04:   3%|▎         | 104/3397 [00:38<20:12,  2.72it/s][A
epoch 2 iter 104: train loss 0.20966. lr 2.854469e-04:   3%|▎         | 105/3397 [00:38<20:13,  2.71it/s][A
epoch 2 iter 105: train loss 0.17803. lr 2.853083e-04:   3%|▎         | 105/3397 [00:39<20:13,  2.71it/s][A
epoch 2 iter 105: train loss 0.17803. lr 2.853083e-04:   3%|▎         | 106/3397 [00:39<20:14,  2.71it/s][A
epoch 2 iter 106: train loss 0.20426. lr 2.851697e-04:   3%|▎         | 106/3397 [00:39<20:14,  2.71it/s][A
epoch 2 iter 106: train loss 0.20426. lr 2.851697e-04:   3%|▎         | 107/3397 [00:39<20:11,  2.72it/s][A
epoch 2 iter 107: t

epoch 2 iter 140: train loss 0.19413. lr 2.804597e-04:   4%|▍         | 140/3397 [00:52<20:01,  2.71it/s][A
epoch 2 iter 140: train loss 0.19413. lr 2.804597e-04:   4%|▍         | 141/3397 [00:52<20:00,  2.71it/s][A
epoch 2 iter 141: train loss 0.20301. lr 2.803213e-04:   4%|▍         | 141/3397 [00:52<20:00,  2.71it/s][A
epoch 2 iter 141: train loss 0.20301. lr 2.803213e-04:   4%|▍         | 142/3397 [00:52<19:58,  2.72it/s][A
epoch 2 iter 142: train loss 0.20617. lr 2.801828e-04:   4%|▍         | 142/3397 [00:52<19:58,  2.72it/s][A
epoch 2 iter 142: train loss 0.20617. lr 2.801828e-04:   4%|▍         | 143/3397 [00:52<19:58,  2.72it/s][A
epoch 2 iter 143: train loss 0.20382. lr 2.800444e-04:   4%|▍         | 143/3397 [00:53<19:58,  2.72it/s][A
epoch 2 iter 143: train loss 0.20382. lr 2.800444e-04:   4%|▍         | 144/3397 [00:53<20:00,  2.71it/s][A
epoch 2 iter 144: train loss 0.20171. lr 2.799059e-04:   4%|▍         | 144/3397 [00:53<20:00,  2.71it/s][A
epoch 2 iter 144: t

epoch 2 iter 177: train loss 0.20673. lr 2.753397e-04:   5%|▌         | 178/3397 [01:05<19:50,  2.70it/s][A
epoch 2 iter 178: train loss 0.18902. lr 2.752014e-04:   5%|▌         | 178/3397 [01:06<19:50,  2.70it/s][A
epoch 2 iter 178: train loss 0.18902. lr 2.752014e-04:   5%|▌         | 179/3397 [01:06<19:45,  2.71it/s][A
epoch 2 iter 179: train loss 0.19869. lr 2.750631e-04:   5%|▌         | 179/3397 [01:06<19:45,  2.71it/s][A
epoch 2 iter 179: train loss 0.19869. lr 2.750631e-04:   5%|▌         | 180/3397 [01:06<19:47,  2.71it/s][A
epoch 2 iter 180: train loss 0.20654. lr 2.749248e-04:   5%|▌         | 180/3397 [01:06<19:47,  2.71it/s][A
epoch 2 iter 180: train loss 0.20654. lr 2.749248e-04:   5%|▌         | 181/3397 [01:06<19:48,  2.71it/s][A
epoch 2 iter 181: train loss 0.19077. lr 2.747866e-04:   5%|▌         | 181/3397 [01:07<19:48,  2.71it/s][A
epoch 2 iter 181: train loss 0.19077. lr 2.747866e-04:   5%|▌         | 182/3397 [01:07<19:47,  2.71it/s][A
epoch 2 iter 182: t

epoch 2 iter 215: train loss 0.20290. lr 2.700888e-04:   6%|▋         | 215/3397 [01:19<19:31,  2.72it/s][A
epoch 2 iter 215: train loss 0.20290. lr 2.700888e-04:   6%|▋         | 216/3397 [01:19<19:32,  2.71it/s][A
epoch 2 iter 216: train loss 0.18885. lr 2.699507e-04:   6%|▋         | 216/3397 [01:20<19:32,  2.71it/s][A
epoch 2 iter 216: train loss 0.18885. lr 2.699507e-04:   6%|▋         | 217/3397 [01:20<19:30,  2.72it/s][A
epoch 2 iter 217: train loss 0.18306. lr 2.698127e-04:   6%|▋         | 217/3397 [01:20<19:30,  2.72it/s][A
epoch 2 iter 217: train loss 0.18306. lr 2.698127e-04:   6%|▋         | 218/3397 [01:20<19:28,  2.72it/s][A
epoch 2 iter 218: train loss 0.17731. lr 2.696746e-04:   6%|▋         | 218/3397 [01:20<19:28,  2.72it/s][A
epoch 2 iter 218: train loss 0.17731. lr 2.696746e-04:   6%|▋         | 219/3397 [01:20<19:31,  2.71it/s][A
epoch 2 iter 219: train loss 0.19477. lr 2.695366e-04:   6%|▋         | 219/3397 [01:21<19:31,  2.71it/s][A
epoch 2 iter 219: t

epoch 2 iter 252: train loss 0.20242. lr 2.649849e-04:   7%|▋         | 253/3397 [01:33<19:16,  2.72it/s][A
epoch 2 iter 253: train loss 0.19521. lr 2.648471e-04:   7%|▋         | 253/3397 [01:33<19:16,  2.72it/s][A
epoch 2 iter 253: train loss 0.19521. lr 2.648471e-04:   7%|▋         | 254/3397 [01:33<19:11,  2.73it/s][A
epoch 2 iter 254: train loss 0.20066. lr 2.647093e-04:   7%|▋         | 254/3397 [01:34<19:11,  2.73it/s][A
epoch 2 iter 254: train loss 0.20066. lr 2.647093e-04:   8%|▊         | 255/3397 [01:34<19:13,  2.72it/s][A
epoch 2 iter 255: train loss 0.20100. lr 2.645715e-04:   8%|▊         | 255/3397 [01:34<19:13,  2.72it/s][A
epoch 2 iter 255: train loss 0.20100. lr 2.645715e-04:   8%|▊         | 256/3397 [01:34<19:16,  2.72it/s][A
epoch 2 iter 256: train loss 0.17879. lr 2.644338e-04:   8%|▊         | 256/3397 [01:34<19:16,  2.72it/s][A
epoch 2 iter 256: train loss 0.17879. lr 2.644338e-04:   8%|▊         | 257/3397 [01:34<19:15,  2.72it/s][A
epoch 2 iter 257: t

epoch 2 iter 290: train loss 0.19624. lr 2.597538e-04:   9%|▊         | 290/3397 [01:47<19:35,  2.64it/s][A
epoch 2 iter 290: train loss 0.19624. lr 2.597538e-04:   9%|▊         | 291/3397 [01:47<19:29,  2.66it/s][A
epoch 2 iter 291: train loss 0.18561. lr 2.596163e-04:   9%|▊         | 291/3397 [01:47<19:29,  2.66it/s][A
epoch 2 iter 291: train loss 0.18561. lr 2.596163e-04:   9%|▊         | 292/3397 [01:47<19:23,  2.67it/s][A
epoch 2 iter 292: train loss 0.18535. lr 2.594788e-04:   9%|▊         | 292/3397 [01:48<19:23,  2.67it/s][A
epoch 2 iter 292: train loss 0.18535. lr 2.594788e-04:   9%|▊         | 293/3397 [01:48<19:16,  2.68it/s][A
epoch 2 iter 293: train loss 0.20522. lr 2.593414e-04:   9%|▊         | 293/3397 [01:48<19:16,  2.68it/s][A
epoch 2 iter 293: train loss 0.20522. lr 2.593414e-04:   9%|▊         | 294/3397 [01:48<19:28,  2.66it/s][A
epoch 2 iter 294: train loss 0.18455. lr 2.592039e-04:   9%|▊         | 294/3397 [01:49<19:28,  2.66it/s][A
epoch 2 iter 294: t

epoch 2 iter 327: train loss 0.21272. lr 2.546723e-04:  10%|▉         | 328/3397 [02:01<18:48,  2.72it/s][A
epoch 2 iter 328: train loss 0.19646. lr 2.545352e-04:  10%|▉         | 328/3397 [02:01<18:48,  2.72it/s][A
epoch 2 iter 328: train loss 0.19646. lr 2.545352e-04:  10%|▉         | 329/3397 [02:01<18:49,  2.72it/s][A
epoch 2 iter 329: train loss 0.19083. lr 2.543980e-04:  10%|▉         | 329/3397 [02:02<18:49,  2.72it/s][A
epoch 2 iter 329: train loss 0.19083. lr 2.543980e-04:  10%|▉         | 330/3397 [02:02<18:45,  2.72it/s][A
epoch 2 iter 330: train loss 0.19706. lr 2.542609e-04:  10%|▉         | 330/3397 [02:02<18:45,  2.72it/s][A
epoch 2 iter 330: train loss 0.19706. lr 2.542609e-04:  10%|▉         | 331/3397 [02:02<18:47,  2.72it/s][A
epoch 2 iter 331: train loss 0.18495. lr 2.541237e-04:  10%|▉         | 331/3397 [02:02<18:47,  2.72it/s][A
epoch 2 iter 331: train loss 0.18495. lr 2.541237e-04:  10%|▉         | 332/3397 [02:02<18:48,  2.72it/s][A
epoch 2 iter 332: t

epoch 2 iter 365: train loss 0.17994. lr 2.494673e-04:  11%|█         | 365/3397 [02:15<18:56,  2.67it/s][A
epoch 2 iter 365: train loss 0.17994. lr 2.494673e-04:  11%|█         | 366/3397 [02:15<18:57,  2.66it/s][A
epoch 2 iter 366: train loss 0.18867. lr 2.493305e-04:  11%|█         | 366/3397 [02:15<18:57,  2.66it/s][A
epoch 2 iter 366: train loss 0.18867. lr 2.493305e-04:  11%|█         | 367/3397 [02:15<18:45,  2.69it/s][A
epoch 2 iter 367: train loss 0.18373. lr 2.491938e-04:  11%|█         | 367/3397 [02:16<18:45,  2.69it/s][A
epoch 2 iter 367: train loss 0.18373. lr 2.491938e-04:  11%|█         | 368/3397 [02:16<18:41,  2.70it/s][A
epoch 2 iter 368: train loss 0.19967. lr 2.490570e-04:  11%|█         | 368/3397 [02:16<18:41,  2.70it/s][A
epoch 2 iter 368: train loss 0.19967. lr 2.490570e-04:  11%|█         | 369/3397 [02:16<18:39,  2.70it/s][A
epoch 2 iter 369: train loss 0.19808. lr 2.489203e-04:  11%|█         | 369/3397 [02:16<18:39,  2.70it/s][A
epoch 2 iter 369: t

epoch 2 iter 402: train loss 0.18972. lr 2.444142e-04:  12%|█▏        | 403/3397 [02:29<18:22,  2.72it/s][A
epoch 2 iter 403: train loss 0.17844. lr 2.442779e-04:  12%|█▏        | 403/3397 [02:29<18:22,  2.72it/s][A
epoch 2 iter 403: train loss 0.17844. lr 2.442779e-04:  12%|█▏        | 404/3397 [02:29<18:23,  2.71it/s][A
epoch 2 iter 404: train loss 0.17576. lr 2.441416e-04:  12%|█▏        | 404/3397 [02:29<18:23,  2.71it/s][A
epoch 2 iter 404: train loss 0.17576. lr 2.441416e-04:  12%|█▏        | 405/3397 [02:29<18:21,  2.72it/s][A
epoch 2 iter 405: train loss 0.19678. lr 2.440052e-04:  12%|█▏        | 405/3397 [02:30<18:21,  2.72it/s][A
epoch 2 iter 405: train loss 0.19678. lr 2.440052e-04:  12%|█▏        | 406/3397 [02:30<18:19,  2.72it/s][A
epoch 2 iter 406: train loss 0.20581. lr 2.438689e-04:  12%|█▏        | 406/3397 [02:30<18:19,  2.72it/s][A
epoch 2 iter 406: train loss 0.20581. lr 2.438689e-04:  12%|█▏        | 407/3397 [02:30<18:20,  2.72it/s][A
epoch 2 iter 407: t

epoch 2 iter 440: train loss 0.18764. lr 2.392416e-04:  13%|█▎        | 440/3397 [02:43<18:08,  2.72it/s][A
epoch 2 iter 440: train loss 0.18764. lr 2.392416e-04:  13%|█▎        | 441/3397 [02:43<18:10,  2.71it/s][A
epoch 2 iter 441: train loss 0.18883. lr 2.391057e-04:  13%|█▎        | 441/3397 [02:43<18:10,  2.71it/s][A
epoch 2 iter 441: train loss 0.18883. lr 2.391057e-04:  13%|█▎        | 442/3397 [02:43<18:05,  2.72it/s][A
epoch 2 iter 442: train loss 0.18042. lr 2.389698e-04:  13%|█▎        | 442/3397 [02:43<18:05,  2.72it/s][A
epoch 2 iter 442: train loss 0.18042. lr 2.389698e-04:  13%|█▎        | 443/3397 [02:43<18:07,  2.72it/s][A
epoch 2 iter 443: train loss 0.18789. lr 2.388340e-04:  13%|█▎        | 443/3397 [02:44<18:07,  2.72it/s][A
epoch 2 iter 443: train loss 0.18789. lr 2.388340e-04:  13%|█▎        | 444/3397 [02:44<18:07,  2.71it/s][A
epoch 2 iter 444: train loss 0.20131. lr 2.386982e-04:  13%|█▎        | 444/3397 [02:44<18:07,  2.71it/s][A
epoch 2 iter 444: t

epoch 2 iter 477: train loss 0.18108. lr 2.342231e-04:  14%|█▍        | 478/3397 [02:56<18:00,  2.70it/s][A
epoch 2 iter 478: train loss 0.19675. lr 2.340877e-04:  14%|█▍        | 478/3397 [02:57<18:00,  2.70it/s][A
epoch 2 iter 478: train loss 0.19675. lr 2.340877e-04:  14%|█▍        | 479/3397 [02:57<17:59,  2.70it/s][A
epoch 2 iter 479: train loss 0.19949. lr 2.339523e-04:  14%|█▍        | 479/3397 [02:57<17:59,  2.70it/s][A
epoch 2 iter 479: train loss 0.19949. lr 2.339523e-04:  14%|█▍        | 480/3397 [02:57<17:57,  2.71it/s][A
epoch 2 iter 480: train loss 0.19304. lr 2.338170e-04:  14%|█▍        | 480/3397 [02:57<17:57,  2.71it/s][A
epoch 2 iter 480: train loss 0.19304. lr 2.338170e-04:  14%|█▍        | 481/3397 [02:57<17:55,  2.71it/s][A
epoch 2 iter 481: train loss 0.19479. lr 2.336816e-04:  14%|█▍        | 481/3397 [02:58<17:55,  2.71it/s][A
epoch 2 iter 481: train loss 0.19479. lr 2.336816e-04:  14%|█▍        | 482/3397 [02:58<17:56,  2.71it/s][A
epoch 2 iter 482: t

epoch 2 iter 515: train loss 0.17234. lr 2.290890e-04:  15%|█▌        | 515/3397 [03:10<17:39,  2.72it/s][A
epoch 2 iter 515: train loss 0.17234. lr 2.290890e-04:  15%|█▌        | 516/3397 [03:10<17:41,  2.71it/s][A
epoch 2 iter 516: train loss 0.16872. lr 2.289541e-04:  15%|█▌        | 516/3397 [03:11<17:41,  2.71it/s][A
epoch 2 iter 516: train loss 0.16872. lr 2.289541e-04:  15%|█▌        | 517/3397 [03:11<17:43,  2.71it/s][A
epoch 2 iter 517: train loss 0.18477. lr 2.288193e-04:  15%|█▌        | 517/3397 [03:11<17:43,  2.71it/s][A
epoch 2 iter 517: train loss 0.18477. lr 2.288193e-04:  15%|█▌        | 518/3397 [03:11<17:41,  2.71it/s][A
epoch 2 iter 518: train loss 0.18732. lr 2.286846e-04:  15%|█▌        | 518/3397 [03:11<17:41,  2.71it/s][A
epoch 2 iter 518: train loss 0.18732. lr 2.286846e-04:  15%|█▌        | 519/3397 [03:11<17:40,  2.71it/s][A
epoch 2 iter 519: train loss 0.17735. lr 2.285498e-04:  15%|█▌        | 519/3397 [03:12<17:40,  2.71it/s][A
epoch 2 iter 519: t

epoch 2 iter 552: train loss 0.17833. lr 2.241110e-04:  16%|█▋        | 553/3397 [03:24<17:29,  2.71it/s][A
epoch 2 iter 553: train loss 0.19518. lr 2.239768e-04:  16%|█▋        | 553/3397 [03:24<17:29,  2.71it/s][A
epoch 2 iter 553: train loss 0.19518. lr 2.239768e-04:  16%|█▋        | 554/3397 [03:24<17:28,  2.71it/s][A
epoch 2 iter 554: train loss 0.17498. lr 2.238426e-04:  16%|█▋        | 554/3397 [03:25<17:28,  2.71it/s][A
epoch 2 iter 554: train loss 0.17498. lr 2.238426e-04:  16%|█▋        | 555/3397 [03:25<17:24,  2.72it/s][A
epoch 2 iter 555: train loss 0.17830. lr 2.237083e-04:  16%|█▋        | 555/3397 [03:25<17:24,  2.72it/s][A
epoch 2 iter 555: train loss 0.17830. lr 2.237083e-04:  16%|█▋        | 556/3397 [03:25<19:03,  2.48it/s][A
epoch 2 iter 556: train loss 0.17612. lr 2.235742e-04:  16%|█▋        | 556/3397 [03:26<19:03,  2.48it/s][A
epoch 2 iter 556: train loss 0.17612. lr 2.235742e-04:  16%|█▋        | 557/3397 [03:26<18:32,  2.55it/s][A
epoch 2 iter 557: t

epoch 2 iter 590: train loss 0.18516. lr 2.190217e-04:  17%|█▋        | 590/3397 [03:38<17:13,  2.72it/s][A
epoch 2 iter 590: train loss 0.18516. lr 2.190217e-04:  17%|█▋        | 591/3397 [03:38<17:15,  2.71it/s][A
epoch 2 iter 591: train loss 0.17334. lr 2.188881e-04:  17%|█▋        | 591/3397 [03:39<17:15,  2.71it/s][A
epoch 2 iter 591: train loss 0.17334. lr 2.188881e-04:  17%|█▋        | 592/3397 [03:39<17:15,  2.71it/s][A
epoch 2 iter 592: train loss 0.18983. lr 2.187545e-04:  17%|█▋        | 592/3397 [03:39<17:15,  2.71it/s][A
epoch 2 iter 592: train loss 0.18983. lr 2.187545e-04:  17%|█▋        | 593/3397 [03:39<17:14,  2.71it/s][A
epoch 2 iter 593: train loss 0.19508. lr 2.186209e-04:  17%|█▋        | 593/3397 [03:39<17:14,  2.71it/s][A
epoch 2 iter 593: train loss 0.19508. lr 2.186209e-04:  17%|█▋        | 594/3397 [03:39<17:09,  2.72it/s][A
epoch 2 iter 594: train loss 0.17904. lr 2.184874e-04:  17%|█▋        | 594/3397 [03:40<17:09,  2.72it/s][A
epoch 2 iter 594: t

epoch 2 iter 627: train loss 0.18453. lr 2.140903e-04:  18%|█▊        | 628/3397 [03:52<17:01,  2.71it/s][A
epoch 2 iter 628: train loss 0.19347. lr 2.139573e-04:  18%|█▊        | 628/3397 [03:52<17:01,  2.71it/s][A
epoch 2 iter 628: train loss 0.19347. lr 2.139573e-04:  19%|█▊        | 629/3397 [03:52<16:58,  2.72it/s][A
epoch 2 iter 629: train loss 0.17245. lr 2.138244e-04:  19%|█▊        | 629/3397 [03:53<16:58,  2.72it/s][A
epoch 2 iter 629: train loss 0.17245. lr 2.138244e-04:  19%|█▊        | 630/3397 [03:53<17:00,  2.71it/s][A
epoch 2 iter 630: train loss 0.18055. lr 2.136915e-04:  19%|█▊        | 630/3397 [03:53<17:00,  2.71it/s][A
epoch 2 iter 630: train loss 0.18055. lr 2.136915e-04:  19%|█▊        | 631/3397 [03:53<17:00,  2.71it/s][A
epoch 2 iter 631: train loss 0.19261. lr 2.135586e-04:  19%|█▊        | 631/3397 [03:53<17:00,  2.71it/s][A
epoch 2 iter 631: train loss 0.19261. lr 2.135586e-04:  19%|█▊        | 632/3397 [03:53<16:59,  2.71it/s][A
epoch 2 iter 632: t

epoch 2 iter 665: train loss 0.17919. lr 2.090518e-04:  20%|█▉        | 665/3397 [04:06<16:47,  2.71it/s][A
epoch 2 iter 665: train loss 0.17919. lr 2.090518e-04:  20%|█▉        | 666/3397 [04:06<16:48,  2.71it/s][A
epoch 2 iter 666: train loss 0.18521. lr 2.089196e-04:  20%|█▉        | 666/3397 [04:06<16:48,  2.71it/s][A
epoch 2 iter 666: train loss 0.18521. lr 2.089196e-04:  20%|█▉        | 667/3397 [04:06<16:47,  2.71it/s][A
epoch 2 iter 667: train loss 0.17544. lr 2.087874e-04:  20%|█▉        | 667/3397 [04:07<16:47,  2.71it/s][A
epoch 2 iter 667: train loss 0.17544. lr 2.087874e-04:  20%|█▉        | 668/3397 [04:07<16:46,  2.71it/s][A
epoch 2 iter 668: train loss 0.16864. lr 2.086552e-04:  20%|█▉        | 668/3397 [04:07<16:46,  2.71it/s][A
epoch 2 iter 668: train loss 0.16864. lr 2.086552e-04:  20%|█▉        | 669/3397 [04:07<16:43,  2.72it/s][A
epoch 2 iter 669: train loss 0.17124. lr 2.085230e-04:  20%|█▉        | 669/3397 [04:07<16:43,  2.72it/s][A
epoch 2 iter 669: t

epoch 2 iter 702: train loss 0.18672. lr 2.041729e-04:  21%|██        | 703/3397 [04:20<16:55,  2.65it/s][A
epoch 2 iter 703: train loss 0.16539. lr 2.040414e-04:  21%|██        | 703/3397 [04:20<16:55,  2.65it/s][A
epoch 2 iter 703: train loss 0.16539. lr 2.040414e-04:  21%|██        | 704/3397 [04:20<17:03,  2.63it/s][A
epoch 2 iter 704: train loss 0.16381. lr 2.039100e-04:  21%|██        | 704/3397 [04:20<17:03,  2.63it/s][A
epoch 2 iter 704: train loss 0.16381. lr 2.039100e-04:  21%|██        | 705/3397 [04:20<17:04,  2.63it/s][A
epoch 2 iter 705: train loss 0.18157. lr 2.037785e-04:  21%|██        | 705/3397 [04:21<17:04,  2.63it/s][A
epoch 2 iter 705: train loss 0.18157. lr 2.037785e-04:  21%|██        | 706/3397 [04:21<17:00,  2.64it/s][A
epoch 2 iter 706: train loss 0.17552. lr 2.036471e-04:  21%|██        | 706/3397 [04:21<17:00,  2.64it/s][A
epoch 2 iter 706: train loss 0.17552. lr 2.036471e-04:  21%|██        | 707/3397 [04:21<16:56,  2.65it/s][A
epoch 2 iter 707: t

epoch 2 iter 740: train loss 0.18757. lr 1.991914e-04:  22%|██▏       | 740/3397 [04:34<16:20,  2.71it/s][A
epoch 2 iter 740: train loss 0.18757. lr 1.991914e-04:  22%|██▏       | 741/3397 [04:34<16:21,  2.71it/s][A
epoch 2 iter 741: train loss 0.17019. lr 1.990607e-04:  22%|██▏       | 741/3397 [04:34<16:21,  2.71it/s][A
epoch 2 iter 741: train loss 0.17019. lr 1.990607e-04:  22%|██▏       | 742/3397 [04:34<16:21,  2.70it/s][A
epoch 2 iter 742: train loss 0.17316. lr 1.989300e-04:  22%|██▏       | 742/3397 [04:34<16:21,  2.70it/s][A
epoch 2 iter 742: train loss 0.17316. lr 1.989300e-04:  22%|██▏       | 743/3397 [04:34<16:21,  2.70it/s][A
epoch 2 iter 743: train loss 0.18934. lr 1.987994e-04:  22%|██▏       | 743/3397 [04:35<16:21,  2.70it/s][A
epoch 2 iter 743: train loss 0.18934. lr 1.987994e-04:  22%|██▏       | 744/3397 [04:35<16:21,  2.70it/s][A
epoch 2 iter 744: train loss 0.17185. lr 1.986688e-04:  22%|██▏       | 744/3397 [04:35<16:21,  2.70it/s][A
epoch 2 iter 744: t

epoch 2 iter 777: train loss 0.17858. lr 1.943709e-04:  23%|██▎       | 778/3397 [04:47<16:04,  2.72it/s][A
epoch 2 iter 778: train loss 0.17722. lr 1.942410e-04:  23%|██▎       | 778/3397 [04:48<16:04,  2.72it/s][A
epoch 2 iter 778: train loss 0.17722. lr 1.942410e-04:  23%|██▎       | 779/3397 [04:48<16:05,  2.71it/s][A
epoch 2 iter 779: train loss 0.17330. lr 1.941112e-04:  23%|██▎       | 779/3397 [04:48<16:05,  2.71it/s][A
epoch 2 iter 779: train loss 0.17330. lr 1.941112e-04:  23%|██▎       | 780/3397 [04:48<16:06,  2.71it/s][A
epoch 2 iter 780: train loss 0.16653. lr 1.939813e-04:  23%|██▎       | 780/3397 [04:49<16:06,  2.71it/s][A
epoch 2 iter 780: train loss 0.16653. lr 1.939813e-04:  23%|██▎       | 781/3397 [04:49<16:05,  2.71it/s][A
epoch 2 iter 781: train loss 0.18217. lr 1.938515e-04:  23%|██▎       | 781/3397 [04:49<16:05,  2.71it/s][A
epoch 2 iter 781: train loss 0.18217. lr 1.938515e-04:  23%|██▎       | 782/3397 [04:49<16:04,  2.71it/s][A
epoch 2 iter 782: t

epoch 2 iter 815: train loss 0.16858. lr 1.894523e-04:  24%|██▍       | 815/3397 [05:01<15:56,  2.70it/s][A
epoch 2 iter 815: train loss 0.16858. lr 1.894523e-04:  24%|██▍       | 816/3397 [05:01<15:54,  2.70it/s][A
epoch 2 iter 816: train loss 0.17234. lr 1.893233e-04:  24%|██▍       | 816/3397 [05:02<15:54,  2.70it/s][A
epoch 2 iter 816: train loss 0.17234. lr 1.893233e-04:  24%|██▍       | 817/3397 [05:02<15:53,  2.71it/s][A
epoch 2 iter 817: train loss 0.17858. lr 1.891943e-04:  24%|██▍       | 817/3397 [05:02<15:53,  2.71it/s][A
epoch 2 iter 817: train loss 0.17858. lr 1.891943e-04:  24%|██▍       | 818/3397 [05:02<15:48,  2.72it/s][A
epoch 2 iter 818: train loss 0.18420. lr 1.890654e-04:  24%|██▍       | 818/3397 [05:03<15:48,  2.72it/s][A
epoch 2 iter 818: train loss 0.18420. lr 1.890654e-04:  24%|██▍       | 819/3397 [05:03<15:51,  2.71it/s][A
epoch 2 iter 819: train loss 0.16103. lr 1.889365e-04:  24%|██▍       | 819/3397 [05:03<15:51,  2.71it/s][A
epoch 2 iter 819: t

epoch 2 iter 852: train loss 0.17797. lr 1.846959e-04:  25%|██▌       | 853/3397 [05:15<15:37,  2.71it/s][A
epoch 2 iter 853: train loss 0.17004. lr 1.845678e-04:  25%|██▌       | 853/3397 [05:15<15:37,  2.71it/s][A
epoch 2 iter 853: train loss 0.17004. lr 1.845678e-04:  25%|██▌       | 854/3397 [05:15<15:33,  2.72it/s][A
epoch 2 iter 854: train loss 0.15640. lr 1.844397e-04:  25%|██▌       | 854/3397 [05:16<15:33,  2.72it/s][A
epoch 2 iter 854: train loss 0.15640. lr 1.844397e-04:  25%|██▌       | 855/3397 [05:16<15:35,  2.72it/s][A
epoch 2 iter 855: train loss 0.16737. lr 1.843117e-04:  25%|██▌       | 855/3397 [05:16<15:35,  2.72it/s][A
epoch 2 iter 855: train loss 0.16737. lr 1.843117e-04:  25%|██▌       | 856/3397 [05:16<15:36,  2.71it/s][A
epoch 2 iter 856: train loss 0.15092. lr 1.841837e-04:  25%|██▌       | 856/3397 [05:17<15:36,  2.71it/s][A
epoch 2 iter 856: train loss 0.15092. lr 1.841837e-04:  25%|██▌       | 857/3397 [05:17<15:36,  2.71it/s][A
epoch 2 iter 857: t

epoch 2 iter 890: train loss 0.19068. lr 1.798461e-04:  26%|██▌       | 890/3397 [05:29<15:23,  2.72it/s][A
epoch 2 iter 890: train loss 0.19068. lr 1.798461e-04:  26%|██▌       | 891/3397 [05:29<15:29,  2.70it/s][A
epoch 2 iter 891: train loss 0.16812. lr 1.797190e-04:  26%|██▌       | 891/3397 [05:30<15:29,  2.70it/s][A
epoch 2 iter 891: train loss 0.16812. lr 1.797190e-04:  26%|██▋       | 892/3397 [05:30<15:34,  2.68it/s][A
epoch 2 iter 892: train loss 0.17841. lr 1.795919e-04:  26%|██▋       | 892/3397 [05:30<15:34,  2.68it/s][A
epoch 2 iter 892: train loss 0.17841. lr 1.795919e-04:  26%|██▋       | 893/3397 [05:30<15:36,  2.67it/s][A
epoch 2 iter 893: train loss 0.18249. lr 1.794648e-04:  26%|██▋       | 893/3397 [05:30<15:36,  2.67it/s][A
epoch 2 iter 893: train loss 0.18249. lr 1.794648e-04:  26%|██▋       | 894/3397 [05:30<15:35,  2.68it/s][A
epoch 2 iter 894: train loss 0.16144. lr 1.793378e-04:  26%|██▋       | 894/3397 [05:31<15:35,  2.68it/s][A
epoch 2 iter 894: t

epoch 2 iter 927: train loss 0.16446. lr 1.751597e-04:  27%|██▋       | 928/3397 [05:43<15:08,  2.72it/s][A
epoch 2 iter 928: train loss 0.16864. lr 1.750335e-04:  27%|██▋       | 928/3397 [05:43<15:08,  2.72it/s][A
epoch 2 iter 928: train loss 0.16864. lr 1.750335e-04:  27%|██▋       | 929/3397 [05:43<15:08,  2.72it/s][A
epoch 2 iter 929: train loss 0.16561. lr 1.749074e-04:  27%|██▋       | 929/3397 [05:44<15:08,  2.72it/s][A
epoch 2 iter 929: train loss 0.16561. lr 1.749074e-04:  27%|██▋       | 930/3397 [05:44<15:08,  2.71it/s][A
epoch 2 iter 930: train loss 0.17090. lr 1.747813e-04:  27%|██▋       | 930/3397 [05:44<15:08,  2.71it/s][A
epoch 2 iter 930: train loss 0.17090. lr 1.747813e-04:  27%|██▋       | 931/3397 [05:44<15:08,  2.72it/s][A
epoch 2 iter 931: train loss 0.15945. lr 1.746552e-04:  27%|██▋       | 931/3397 [05:44<15:08,  2.72it/s][A
epoch 2 iter 931: train loss 0.15945. lr 1.746552e-04:  27%|██▋       | 932/3397 [05:44<15:03,  2.73it/s][A
epoch 2 iter 932: t

epoch 2 iter 965: train loss 0.17834. lr 1.703846e-04:  28%|██▊       | 965/3397 [05:57<14:57,  2.71it/s][A
epoch 2 iter 965: train loss 0.17834. lr 1.703846e-04:  28%|██▊       | 966/3397 [05:57<14:56,  2.71it/s][A
epoch 2 iter 966: train loss 0.17475. lr 1.702595e-04:  28%|██▊       | 966/3397 [05:57<14:56,  2.71it/s][A
epoch 2 iter 966: train loss 0.17475. lr 1.702595e-04:  28%|██▊       | 967/3397 [05:57<14:52,  2.72it/s][A
epoch 2 iter 967: train loss 0.15971. lr 1.701344e-04:  28%|██▊       | 967/3397 [05:58<14:52,  2.72it/s][A
epoch 2 iter 967: train loss 0.15971. lr 1.701344e-04:  28%|██▊       | 968/3397 [05:58<14:54,  2.71it/s][A
epoch 2 iter 968: train loss 0.17013. lr 1.700093e-04:  28%|██▊       | 968/3397 [05:58<14:54,  2.71it/s][A
epoch 2 iter 968: train loss 0.17013. lr 1.700093e-04:  29%|██▊       | 969/3397 [05:58<14:55,  2.71it/s][A
epoch 2 iter 969: train loss 0.18536. lr 1.698843e-04:  29%|██▊       | 969/3397 [05:58<14:55,  2.71it/s][A
epoch 2 iter 969: t

epoch 2 iter 1002: train loss 0.16611. lr 1.657737e-04:  30%|██▉       | 1003/3397 [06:11<14:42,  2.71it/s][A
epoch 2 iter 1003: train loss 0.17813. lr 1.656496e-04:  30%|██▉       | 1003/3397 [06:11<14:42,  2.71it/s][A
epoch 2 iter 1003: train loss 0.17813. lr 1.656496e-04:  30%|██▉       | 1004/3397 [06:11<14:43,  2.71it/s][A
epoch 2 iter 1004: train loss 0.17931. lr 1.655255e-04:  30%|██▉       | 1004/3397 [06:11<14:43,  2.71it/s][A
epoch 2 iter 1004: train loss 0.17931. lr 1.655255e-04:  30%|██▉       | 1005/3397 [06:11<14:42,  2.71it/s][A
epoch 2 iter 1005: train loss 0.16995. lr 1.654015e-04:  30%|██▉       | 1005/3397 [06:12<14:42,  2.71it/s][A
epoch 2 iter 1005: train loss 0.16995. lr 1.654015e-04:  30%|██▉       | 1006/3397 [06:12<14:41,  2.71it/s][A
epoch 2 iter 1006: train loss 0.17359. lr 1.652775e-04:  30%|██▉       | 1006/3397 [06:12<14:41,  2.71it/s][A
epoch 2 iter 1006: train loss 0.17359. lr 1.652775e-04:  30%|██▉       | 1007/3397 [06:12<14:37,  2.72it/s][A
e

epoch 2 iter 1039: train loss 0.16300. lr 1.612020e-04:  31%|███       | 1039/3397 [06:24<14:28,  2.72it/s][A
epoch 2 iter 1039: train loss 0.16300. lr 1.612020e-04:  31%|███       | 1040/3397 [06:24<14:32,  2.70it/s][A
epoch 2 iter 1040: train loss 0.15800. lr 1.610790e-04:  31%|███       | 1040/3397 [06:25<14:32,  2.70it/s][A
epoch 2 iter 1040: train loss 0.15800. lr 1.610790e-04:  31%|███       | 1041/3397 [06:25<14:36,  2.69it/s][A
epoch 2 iter 1041: train loss 0.17072. lr 1.609561e-04:  31%|███       | 1041/3397 [06:25<14:36,  2.69it/s][A
epoch 2 iter 1041: train loss 0.17072. lr 1.609561e-04:  31%|███       | 1042/3397 [06:25<14:36,  2.69it/s][A
epoch 2 iter 1042: train loss 0.17381. lr 1.608331e-04:  31%|███       | 1042/3397 [06:25<14:36,  2.69it/s][A
epoch 2 iter 1042: train loss 0.17381. lr 1.608331e-04:  31%|███       | 1043/3397 [06:25<14:34,  2.69it/s][A
epoch 2 iter 1043: train loss 0.16231. lr 1.607102e-04:  31%|███       | 1043/3397 [06:26<14:34,  2.69it/s][A
e

epoch 2 iter 1075: train loss 0.16427. lr 1.567930e-04:  32%|███▏      | 1076/3397 [06:38<14:14,  2.72it/s][A
epoch 2 iter 1076: train loss 0.17195. lr 1.566710e-04:  32%|███▏      | 1076/3397 [06:38<14:14,  2.72it/s][A
epoch 2 iter 1076: train loss 0.17195. lr 1.566710e-04:  32%|███▏      | 1077/3397 [06:38<14:15,  2.71it/s][A
epoch 2 iter 1077: train loss 0.16856. lr 1.565492e-04:  32%|███▏      | 1077/3397 [06:38<14:15,  2.71it/s][A
epoch 2 iter 1077: train loss 0.16856. lr 1.565492e-04:  32%|███▏      | 1078/3397 [06:38<14:14,  2.71it/s][A
epoch 2 iter 1078: train loss 0.17933. lr 1.564273e-04:  32%|███▏      | 1078/3397 [06:39<14:14,  2.71it/s][A
epoch 2 iter 1078: train loss 0.17933. lr 1.564273e-04:  32%|███▏      | 1079/3397 [06:39<14:14,  2.71it/s][A
epoch 2 iter 1079: train loss 0.16622. lr 1.563055e-04:  32%|███▏      | 1079/3397 [06:39<14:14,  2.71it/s][A
epoch 2 iter 1079: train loss 0.16622. lr 1.563055e-04:  32%|███▏      | 1080/3397 [06:39<14:11,  2.72it/s][A
e

epoch 2 iter 1112: train loss 0.15680. lr 1.523028e-04:  33%|███▎      | 1112/3397 [06:51<14:03,  2.71it/s][A
epoch 2 iter 1112: train loss 0.15680. lr 1.523028e-04:  33%|███▎      | 1113/3397 [06:51<14:02,  2.71it/s][A
epoch 2 iter 1113: train loss 0.16632. lr 1.521820e-04:  33%|███▎      | 1113/3397 [06:52<14:02,  2.71it/s][A
epoch 2 iter 1113: train loss 0.16632. lr 1.521820e-04:  33%|███▎      | 1114/3397 [06:52<14:02,  2.71it/s][A
epoch 2 iter 1114: train loss 0.16722. lr 1.520613e-04:  33%|███▎      | 1114/3397 [06:52<14:02,  2.71it/s][A
epoch 2 iter 1114: train loss 0.16722. lr 1.520613e-04:  33%|███▎      | 1115/3397 [06:52<13:58,  2.72it/s][A
epoch 2 iter 1115: train loss 0.16164. lr 1.519406e-04:  33%|███▎      | 1115/3397 [06:52<13:58,  2.72it/s][A
epoch 2 iter 1115: train loss 0.16164. lr 1.519406e-04:  33%|███▎      | 1116/3397 [06:52<14:00,  2.72it/s][A
epoch 2 iter 1116: train loss 0.15277. lr 1.518199e-04:  33%|███▎      | 1116/3397 [06:53<14:00,  2.72it/s][A
e

epoch 2 iter 1148: train loss 0.17423. lr 1.479755e-04:  34%|███▍      | 1149/3397 [07:04<13:50,  2.71it/s][A
epoch 2 iter 1149: train loss 0.15593. lr 1.478559e-04:  34%|███▍      | 1149/3397 [07:05<13:50,  2.71it/s][A
epoch 2 iter 1149: train loss 0.15593. lr 1.478559e-04:  34%|███▍      | 1150/3397 [07:05<13:46,  2.72it/s][A
epoch 2 iter 1150: train loss 0.16531. lr 1.477363e-04:  34%|███▍      | 1150/3397 [07:05<13:46,  2.72it/s][A
epoch 2 iter 1150: train loss 0.16531. lr 1.477363e-04:  34%|███▍      | 1151/3397 [07:05<13:48,  2.71it/s][A
epoch 2 iter 1151: train loss 0.15929. lr 1.476168e-04:  34%|███▍      | 1151/3397 [07:06<13:48,  2.71it/s][A
epoch 2 iter 1151: train loss 0.15929. lr 1.476168e-04:  34%|███▍      | 1152/3397 [07:06<13:49,  2.71it/s][A
epoch 2 iter 1152: train loss 0.15882. lr 1.474973e-04:  34%|███▍      | 1152/3397 [07:06<13:49,  2.71it/s][A
epoch 2 iter 1152: train loss 0.15882. lr 1.474973e-04:  34%|███▍      | 1153/3397 [07:06<13:48,  2.71it/s][A
e

epoch 2 iter 1185: train loss 0.15717. lr 1.435719e-04:  35%|███▍      | 1185/3397 [07:18<13:35,  2.71it/s][A
epoch 2 iter 1185: train loss 0.15717. lr 1.435719e-04:  35%|███▍      | 1186/3397 [07:18<13:36,  2.71it/s][A
epoch 2 iter 1186: train loss 0.17117. lr 1.434535e-04:  35%|███▍      | 1186/3397 [07:18<13:36,  2.71it/s][A
epoch 2 iter 1186: train loss 0.17117. lr 1.434535e-04:  35%|███▍      | 1187/3397 [07:18<13:36,  2.71it/s][A
epoch 2 iter 1187: train loss 0.15659. lr 1.433352e-04:  35%|███▍      | 1187/3397 [07:19<13:36,  2.71it/s][A
epoch 2 iter 1187: train loss 0.15659. lr 1.433352e-04:  35%|███▍      | 1188/3397 [07:19<13:34,  2.71it/s][A
epoch 2 iter 1188: train loss 0.14997. lr 1.432169e-04:  35%|███▍      | 1188/3397 [07:19<13:34,  2.71it/s][A
epoch 2 iter 1188: train loss 0.14997. lr 1.432169e-04:  35%|███▌      | 1189/3397 [07:19<13:31,  2.72it/s][A
epoch 2 iter 1189: train loss 0.16423. lr 1.430986e-04:  35%|███▌      | 1189/3397 [07:20<13:31,  2.72it/s][A
e

epoch 2 iter 1221: train loss 0.17069. lr 1.393313e-04:  36%|███▌      | 1222/3397 [07:31<13:22,  2.71it/s][A
epoch 2 iter 1222: train loss 0.15935. lr 1.392142e-04:  36%|███▌      | 1222/3397 [07:32<13:22,  2.71it/s][A
epoch 2 iter 1222: train loss 0.15935. lr 1.392142e-04:  36%|███▌      | 1223/3397 [07:32<13:21,  2.71it/s][A
epoch 2 iter 1223: train loss 0.16968. lr 1.390971e-04:  36%|███▌      | 1223/3397 [07:32<13:21,  2.71it/s][A
epoch 2 iter 1223: train loss 0.16968. lr 1.390971e-04:  36%|███▌      | 1224/3397 [07:32<13:19,  2.72it/s][A
epoch 2 iter 1224: train loss 0.17445. lr 1.389800e-04:  36%|███▌      | 1224/3397 [07:33<13:19,  2.72it/s][A
epoch 2 iter 1224: train loss 0.17445. lr 1.389800e-04:  36%|███▌      | 1225/3397 [07:33<13:18,  2.72it/s][A
epoch 2 iter 1225: train loss 0.15071. lr 1.388629e-04:  36%|███▌      | 1225/3397 [07:33<13:18,  2.72it/s][A
epoch 2 iter 1225: train loss 0.15071. lr 1.388629e-04:  36%|███▌      | 1226/3397 [07:33<13:20,  2.71it/s][A
e

epoch 2 iter 1258: train loss 0.16030. lr 1.350194e-04:  37%|███▋      | 1258/3397 [07:45<13:13,  2.70it/s][A
epoch 2 iter 1258: train loss 0.16030. lr 1.350194e-04:  37%|███▋      | 1259/3397 [07:45<13:11,  2.70it/s][A
epoch 2 iter 1259: train loss 0.16604. lr 1.349035e-04:  37%|███▋      | 1259/3397 [07:45<13:11,  2.70it/s][A
epoch 2 iter 1259: train loss 0.16604. lr 1.349035e-04:  37%|███▋      | 1260/3397 [07:45<13:08,  2.71it/s][A
epoch 2 iter 1260: train loss 0.16180. lr 1.347877e-04:  37%|███▋      | 1260/3397 [07:46<13:08,  2.71it/s][A
epoch 2 iter 1260: train loss 0.16180. lr 1.347877e-04:  37%|███▋      | 1261/3397 [07:46<13:04,  2.72it/s][A
epoch 2 iter 1261: train loss 0.15823. lr 1.346719e-04:  37%|███▋      | 1261/3397 [07:46<13:04,  2.72it/s][A
epoch 2 iter 1261: train loss 0.15823. lr 1.346719e-04:  37%|███▋      | 1262/3397 [07:46<13:08,  2.71it/s][A
epoch 2 iter 1262: train loss 0.16753. lr 1.345561e-04:  37%|███▋      | 1262/3397 [07:47<13:08,  2.71it/s][A
e

epoch 2 iter 1599: train loss 0.15750. lr 9.771164e-05:  47%|████▋     | 1600/3397 [09:51<11:03,  2.71it/s][A
epoch 2 iter 1600: train loss 0.13094. lr 9.760919e-05:  47%|████▋     | 1600/3397 [09:52<11:03,  2.71it/s][A
epoch 2 iter 1600: train loss 0.13094. lr 9.760919e-05:  47%|████▋     | 1601/3397 [09:52<11:00,  2.72it/s][A
epoch 2 iter 1601: train loss 0.14845. lr 9.750679e-05:  47%|████▋     | 1601/3397 [09:52<11:00,  2.72it/s][A
epoch 2 iter 1601: train loss 0.14845. lr 9.750679e-05:  47%|████▋     | 1602/3397 [09:52<11:01,  2.71it/s][A
epoch 2 iter 1602: train loss 0.16424. lr 9.740443e-05:  47%|████▋     | 1602/3397 [09:52<11:01,  2.71it/s][A
epoch 2 iter 1602: train loss 0.16424. lr 9.740443e-05:  47%|████▋     | 1603/3397 [09:52<11:01,  2.71it/s][A
epoch 2 iter 1603: train loss 0.14435. lr 9.730211e-05:  47%|████▋     | 1603/3397 [09:53<11:01,  2.71it/s][A
epoch 2 iter 1603: train loss 0.14435. lr 9.730211e-05:  47%|████▋     | 1604/3397 [09:53<11:01,  2.71it/s][A
e

epoch 2 iter 1636: train loss 0.14332. lr 9.395013e-05:  48%|████▊     | 1636/3397 [10:05<10:48,  2.72it/s][A
epoch 2 iter 1636: train loss 0.14332. lr 9.395013e-05:  48%|████▊     | 1637/3397 [10:05<10:49,  2.71it/s][A
epoch 2 iter 1637: train loss 0.16313. lr 9.384930e-05:  48%|████▊     | 1637/3397 [10:05<10:49,  2.71it/s][A
epoch 2 iter 1637: train loss 0.16313. lr 9.384930e-05:  48%|████▊     | 1638/3397 [10:05<10:49,  2.71it/s][A
epoch 2 iter 1638: train loss 0.15840. lr 9.374852e-05:  48%|████▊     | 1638/3397 [10:06<10:49,  2.71it/s][A
epoch 2 iter 1638: train loss 0.15840. lr 9.374852e-05:  48%|████▊     | 1639/3397 [10:06<10:48,  2.71it/s][A
epoch 2 iter 1639: train loss 0.15317. lr 9.364778e-05:  48%|████▊     | 1639/3397 [10:06<10:48,  2.71it/s][A
epoch 2 iter 1639: train loss 0.15317. lr 9.364778e-05:  48%|████▊     | 1640/3397 [10:06<10:48,  2.71it/s][A
epoch 2 iter 1640: train loss 0.16063. lr 9.354708e-05:  48%|████▊     | 1640/3397 [10:06<10:48,  2.71it/s][A
e

epoch 2 iter 1672: train loss 0.15169. lr 9.034820e-05:  49%|████▉     | 1673/3397 [10:18<10:35,  2.71it/s][A
epoch 2 iter 1673: train loss 0.16144. lr 9.024897e-05:  49%|████▉     | 1673/3397 [10:19<10:35,  2.71it/s][A
epoch 2 iter 1673: train loss 0.16144. lr 9.024897e-05:  49%|████▉     | 1674/3397 [10:19<10:35,  2.71it/s][A
epoch 2 iter 1674: train loss 0.14703. lr 9.014979e-05:  49%|████▉     | 1674/3397 [10:19<10:35,  2.71it/s][A
epoch 2 iter 1674: train loss 0.14703. lr 9.014979e-05:  49%|████▉     | 1675/3397 [10:19<10:34,  2.71it/s][A
epoch 2 iter 1675: train loss 0.14609. lr 9.005065e-05:  49%|████▉     | 1675/3397 [10:19<10:34,  2.71it/s][A
epoch 2 iter 1675: train loss 0.14609. lr 9.005065e-05:  49%|████▉     | 1676/3397 [10:19<10:35,  2.71it/s][A
epoch 2 iter 1676: train loss 0.15137. lr 8.995156e-05:  49%|████▉     | 1676/3397 [10:20<10:35,  2.71it/s][A
epoch 2 iter 1676: train loss 0.15137. lr 8.995156e-05:  49%|████▉     | 1677/3397 [10:20<10:35,  2.71it/s][A
e

epoch 2 iter 1709: train loss 0.15393. lr 8.670679e-05:  50%|█████     | 1709/3397 [10:32<10:22,  2.71it/s][A
epoch 2 iter 1709: train loss 0.15393. lr 8.670679e-05:  50%|█████     | 1710/3397 [10:32<10:20,  2.72it/s][A
epoch 2 iter 1710: train loss 0.15187. lr 8.660924e-05:  50%|█████     | 1710/3397 [10:32<10:20,  2.72it/s][A
epoch 2 iter 1710: train loss 0.15187. lr 8.660924e-05:  50%|█████     | 1711/3397 [10:32<10:21,  2.71it/s][A
epoch 2 iter 1711: train loss 0.15585. lr 8.651173e-05:  50%|█████     | 1711/3397 [10:33<10:21,  2.71it/s][A
epoch 2 iter 1711: train loss 0.15585. lr 8.651173e-05:  50%|█████     | 1712/3397 [10:33<10:21,  2.71it/s][A
epoch 2 iter 1712: train loss 0.15408. lr 8.641427e-05:  50%|█████     | 1712/3397 [10:33<10:21,  2.71it/s][A
epoch 2 iter 1712: train loss 0.15408. lr 8.641427e-05:  50%|█████     | 1713/3397 [10:33<10:21,  2.71it/s][A
epoch 2 iter 1713: train loss 0.17813. lr 8.631685e-05:  50%|█████     | 1713/3397 [10:33<10:21,  2.71it/s][A
e

epoch 2 iter 1745: train loss 0.14349. lr 8.322375e-05:  51%|█████▏    | 1746/3397 [10:45<10:09,  2.71it/s][A
epoch 2 iter 1746: train loss 0.14848. lr 8.312785e-05:  51%|█████▏    | 1746/3397 [10:46<10:09,  2.71it/s][A
epoch 2 iter 1746: train loss 0.14848. lr 8.312785e-05:  51%|█████▏    | 1747/3397 [10:46<10:08,  2.71it/s][A
epoch 2 iter 1747: train loss 0.14764. lr 8.303200e-05:  51%|█████▏    | 1747/3397 [10:46<10:08,  2.71it/s][A
epoch 2 iter 1747: train loss 0.14764. lr 8.303200e-05:  51%|█████▏    | 1748/3397 [10:46<10:06,  2.72it/s][A
epoch 2 iter 1748: train loss 0.15187. lr 8.293620e-05:  51%|█████▏    | 1748/3397 [10:46<10:06,  2.72it/s][A
epoch 2 iter 1748: train loss 0.15187. lr 8.293620e-05:  51%|█████▏    | 1749/3397 [10:46<10:06,  2.72it/s][A
epoch 2 iter 1749: train loss 0.15613. lr 8.284044e-05:  51%|█████▏    | 1749/3397 [10:47<10:06,  2.72it/s][A
epoch 2 iter 1749: train loss 0.15613. lr 8.284044e-05:  52%|█████▏    | 1750/3397 [10:47<10:07,  2.71it/s][A
e

epoch 2 iter 1782: train loss 0.15661. lr 7.970660e-05:  52%|█████▏    | 1782/3397 [10:59<09:59,  2.69it/s][A
epoch 2 iter 1782: train loss 0.15661. lr 7.970660e-05:  52%|█████▏    | 1783/3397 [10:59<09:58,  2.70it/s][A
epoch 2 iter 1783: train loss 0.15512. lr 7.961243e-05:  52%|█████▏    | 1783/3397 [10:59<09:58,  2.70it/s][A
epoch 2 iter 1783: train loss 0.15512. lr 7.961243e-05:  53%|█████▎    | 1784/3397 [10:59<09:57,  2.70it/s][A
epoch 2 iter 1784: train loss 0.14149. lr 7.951831e-05:  53%|█████▎    | 1784/3397 [11:00<09:57,  2.70it/s][A
epoch 2 iter 1784: train loss 0.14149. lr 7.951831e-05:  53%|█████▎    | 1785/3397 [11:00<09:56,  2.70it/s][A
epoch 2 iter 1785: train loss 0.17047. lr 7.942423e-05:  53%|█████▎    | 1785/3397 [11:00<09:56,  2.70it/s][A
epoch 2 iter 1785: train loss 0.17047. lr 7.942423e-05:  53%|█████▎    | 1786/3397 [11:00<09:56,  2.70it/s][A
epoch 2 iter 1786: train loss 0.14697. lr 7.933021e-05:  53%|█████▎    | 1786/3397 [11:01<09:56,  2.70it/s][A
e

epoch 2 iter 1818: train loss 0.15089. lr 7.634641e-05:  54%|█████▎    | 1819/3397 [11:12<09:42,  2.71it/s][A
epoch 2 iter 1819: train loss 0.14530. lr 7.625396e-05:  54%|█████▎    | 1819/3397 [11:13<09:42,  2.71it/s][A
epoch 2 iter 1819: train loss 0.14530. lr 7.625396e-05:  54%|█████▎    | 1820/3397 [11:13<09:41,  2.71it/s][A
epoch 2 iter 1820: train loss 0.14409. lr 7.616155e-05:  54%|█████▎    | 1820/3397 [11:13<09:41,  2.71it/s][A
epoch 2 iter 1820: train loss 0.14409. lr 7.616155e-05:  54%|█████▎    | 1821/3397 [11:13<09:39,  2.72it/s][A
epoch 2 iter 1821: train loss 0.15802. lr 7.606918e-05:  54%|█████▎    | 1821/3397 [11:13<09:39,  2.72it/s][A
epoch 2 iter 1821: train loss 0.15802. lr 7.606918e-05:  54%|█████▎    | 1822/3397 [11:13<09:40,  2.71it/s][A
epoch 2 iter 1822: train loss 0.15471. lr 7.597687e-05:  54%|█████▎    | 1822/3397 [11:14<09:40,  2.71it/s][A
epoch 2 iter 1822: train loss 0.15471. lr 7.597687e-05:  54%|█████▎    | 1823/3397 [11:14<09:40,  2.71it/s][A
e

epoch 2 iter 1855: train loss 0.16020. lr 7.295752e-05:  55%|█████▍    | 1855/3397 [11:26<09:28,  2.71it/s][A
epoch 2 iter 1855: train loss 0.16020. lr 7.295752e-05:  55%|█████▍    | 1856/3397 [11:26<09:28,  2.71it/s][A
epoch 2 iter 1856: train loss 0.15501. lr 7.286684e-05:  55%|█████▍    | 1856/3397 [11:26<09:28,  2.71it/s][A
epoch 2 iter 1856: train loss 0.15501. lr 7.286684e-05:  55%|█████▍    | 1857/3397 [11:26<09:29,  2.71it/s][A
epoch 2 iter 1857: train loss 0.14831. lr 7.277622e-05:  55%|█████▍    | 1857/3397 [11:27<09:29,  2.71it/s][A
epoch 2 iter 1857: train loss 0.14831. lr 7.277622e-05:  55%|█████▍    | 1858/3397 [11:27<09:30,  2.70it/s][A
epoch 2 iter 1858: train loss 0.15404. lr 7.268564e-05:  55%|█████▍    | 1858/3397 [11:27<09:30,  2.70it/s][A
epoch 2 iter 1858: train loss 0.15404. lr 7.268564e-05:  55%|█████▍    | 1859/3397 [11:27<09:28,  2.70it/s][A
epoch 2 iter 1859: train loss 0.15672. lr 7.259512e-05:  55%|█████▍    | 1859/3397 [11:27<09:28,  2.70it/s][A
e

epoch 2 iter 1891: train loss 0.15148. lr 6.972403e-05:  56%|█████▌    | 1892/3397 [11:39<09:14,  2.71it/s][A
epoch 2 iter 1892: train loss 0.13756. lr 6.963511e-05:  56%|█████▌    | 1892/3397 [11:40<09:14,  2.71it/s][A
epoch 2 iter 1892: train loss 0.13756. lr 6.963511e-05:  56%|█████▌    | 1893/3397 [11:40<09:15,  2.71it/s][A
epoch 2 iter 1893: train loss 0.13818. lr 6.954625e-05:  56%|█████▌    | 1893/3397 [11:40<09:15,  2.71it/s][A
epoch 2 iter 1893: train loss 0.13818. lr 6.954625e-05:  56%|█████▌    | 1894/3397 [11:40<09:14,  2.71it/s][A
epoch 2 iter 1894: train loss 0.16147. lr 6.945744e-05:  56%|█████▌    | 1894/3397 [11:40<09:14,  2.71it/s][A
epoch 2 iter 1894: train loss 0.16147. lr 6.945744e-05:  56%|█████▌    | 1895/3397 [11:40<09:14,  2.71it/s][A
epoch 2 iter 1895: train loss 0.15736. lr 6.936867e-05:  56%|█████▌    | 1895/3397 [11:41<09:14,  2.71it/s][A
epoch 2 iter 1895: train loss 0.15736. lr 6.936867e-05:  56%|█████▌    | 1896/3397 [11:41<09:12,  2.72it/s][A
e

epoch 2 iter 1928: train loss 0.14427. lr 6.646725e-05:  57%|█████▋    | 1928/3397 [11:53<09:12,  2.66it/s][A
epoch 2 iter 1928: train loss 0.14427. lr 6.646725e-05:  57%|█████▋    | 1929/3397 [11:53<09:10,  2.67it/s][A
epoch 2 iter 1929: train loss 0.13993. lr 6.638018e-05:  57%|█████▋    | 1929/3397 [11:53<09:10,  2.67it/s][A
epoch 2 iter 1929: train loss 0.13993. lr 6.638018e-05:  57%|█████▋    | 1930/3397 [11:53<09:05,  2.69it/s][A
epoch 2 iter 1930: train loss 0.13433. lr 6.629315e-05:  57%|█████▋    | 1930/3397 [11:54<09:05,  2.69it/s][A
epoch 2 iter 1930: train loss 0.13433. lr 6.629315e-05:  57%|█████▋    | 1931/3397 [11:54<09:07,  2.68it/s][A
epoch 2 iter 1931: train loss 0.13794. lr 6.620618e-05:  57%|█████▋    | 1931/3397 [11:54<09:07,  2.68it/s][A
epoch 2 iter 1931: train loss 0.13794. lr 6.620618e-05:  57%|█████▋    | 1932/3397 [11:54<09:08,  2.67it/s][A
epoch 2 iter 1932: train loss 0.15857. lr 6.611925e-05:  57%|█████▋    | 1932/3397 [11:55<09:08,  2.67it/s][A
e

epoch 2 iter 1964: train loss 0.14611. lr 6.336414e-05:  58%|█████▊    | 1965/3397 [12:06<08:47,  2.71it/s][A
epoch 2 iter 1965: train loss 0.15811. lr 6.327888e-05:  58%|█████▊    | 1965/3397 [12:07<08:47,  2.71it/s][A
epoch 2 iter 1965: train loss 0.15811. lr 6.327888e-05:  58%|█████▊    | 1966/3397 [12:07<08:46,  2.72it/s][A
epoch 2 iter 1966: train loss 0.15022. lr 6.319366e-05:  58%|█████▊    | 1966/3397 [12:07<08:46,  2.72it/s][A
epoch 2 iter 1966: train loss 0.15022. lr 6.319366e-05:  58%|█████▊    | 1967/3397 [12:07<08:46,  2.71it/s][A
epoch 2 iter 1967: train loss 0.14922. lr 6.310850e-05:  58%|█████▊    | 1967/3397 [12:07<08:46,  2.71it/s][A
epoch 2 iter 1967: train loss 0.14922. lr 6.310850e-05:  58%|█████▊    | 1968/3397 [12:07<08:47,  2.71it/s][A
epoch 2 iter 1968: train loss 0.14580. lr 6.302338e-05:  58%|█████▊    | 1968/3397 [12:08<08:47,  2.71it/s][A
epoch 2 iter 1968: train loss 0.14580. lr 6.302338e-05:  58%|█████▊    | 1969/3397 [12:08<08:46,  2.71it/s][A
e

epoch 2 iter 2001: train loss 0.14461. lr 6.024320e-05:  59%|█████▉    | 2001/3397 [12:20<08:32,  2.72it/s][A
epoch 2 iter 2001: train loss 0.14461. lr 6.024320e-05:  59%|█████▉    | 2002/3397 [12:20<08:33,  2.72it/s][A
epoch 2 iter 2002: train loss 0.13860. lr 6.015982e-05:  59%|█████▉    | 2002/3397 [12:20<08:33,  2.72it/s][A
epoch 2 iter 2002: train loss 0.13860. lr 6.015982e-05:  59%|█████▉    | 2003/3397 [12:20<08:33,  2.71it/s][A
epoch 2 iter 2003: train loss 0.15497. lr 6.007650e-05:  59%|█████▉    | 2003/3397 [12:21<08:33,  2.71it/s][A
epoch 2 iter 2003: train loss 0.15497. lr 6.007650e-05:  59%|█████▉    | 2004/3397 [12:21<08:33,  2.71it/s][A
epoch 2 iter 2004: train loss 0.14440. lr 6.000000e-05:  59%|█████▉    | 2004/3397 [12:21<08:33,  2.71it/s][A
epoch 2 iter 2004: train loss 0.14440. lr 6.000000e-05:  59%|█████▉    | 2005/3397 [12:21<08:33,  2.71it/s][A
epoch 2 iter 2005: train loss 0.15386. lr 6.000000e-05:  59%|█████▉    | 2005/3397 [12:21<08:33,  2.71it/s][A
e

epoch 2 iter 2037: train loss 0.14903. lr 6.000000e-05:  60%|█████▉    | 2038/3397 [12:33<08:21,  2.71it/s][A
epoch 2 iter 2038: train loss 0.14004. lr 6.000000e-05:  60%|█████▉    | 2038/3397 [12:34<08:21,  2.71it/s][A
epoch 2 iter 2038: train loss 0.14004. lr 6.000000e-05:  60%|██████    | 2039/3397 [12:34<08:20,  2.71it/s][A
epoch 2 iter 2039: train loss 0.13944. lr 6.000000e-05:  60%|██████    | 2039/3397 [12:34<08:20,  2.71it/s][A
epoch 2 iter 2039: train loss 0.13944. lr 6.000000e-05:  60%|██████    | 2040/3397 [12:34<08:20,  2.71it/s][A
epoch 2 iter 2040: train loss 0.14765. lr 6.000000e-05:  60%|██████    | 2040/3397 [12:34<08:20,  2.71it/s][A
epoch 2 iter 2040: train loss 0.14765. lr 6.000000e-05:  60%|██████    | 2041/3397 [12:34<08:17,  2.72it/s][A
epoch 2 iter 2041: train loss 0.13657. lr 6.000000e-05:  60%|██████    | 2041/3397 [12:35<08:17,  2.72it/s][A
epoch 2 iter 2041: train loss 0.13657. lr 6.000000e-05:  60%|██████    | 2042/3397 [12:35<08:18,  2.72it/s][A
e

epoch 2 iter 2074: train loss 0.14828. lr 6.000000e-05:  61%|██████    | 2074/3397 [12:47<08:08,  2.71it/s][A
epoch 2 iter 2074: train loss 0.14828. lr 6.000000e-05:  61%|██████    | 2075/3397 [12:47<08:07,  2.71it/s][A
epoch 2 iter 2075: train loss 0.13841. lr 6.000000e-05:  61%|██████    | 2075/3397 [12:47<08:07,  2.71it/s][A
epoch 2 iter 2075: train loss 0.13841. lr 6.000000e-05:  61%|██████    | 2076/3397 [12:47<08:05,  2.72it/s][A
epoch 2 iter 2076: train loss 0.16375. lr 6.000000e-05:  61%|██████    | 2076/3397 [12:48<08:05,  2.72it/s][A
epoch 2 iter 2076: train loss 0.16375. lr 6.000000e-05:  61%|██████    | 2077/3397 [12:48<08:05,  2.72it/s][AIOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)


epoch 2 i

epoch 2 iter 2978: train loss 0.15443. lr 6.000000e-05:  88%|████████▊ | 2978/3397 [18:21<02:34,  2.71it/s][A
epoch 2 iter 2978: train loss 0.15443. lr 6.000000e-05:  88%|████████▊ | 2979/3397 [18:21<02:33,  2.72it/s][A
epoch 2 iter 2979: train loss 0.15669. lr 6.000000e-05:  88%|████████▊ | 2979/3397 [18:21<02:33,  2.72it/s][A
epoch 2 iter 2979: train loss 0.15669. lr 6.000000e-05:  88%|████████▊ | 2980/3397 [18:21<02:33,  2.71it/s][A
epoch 2 iter 2980: train loss 0.14618. lr 6.000000e-05:  88%|████████▊ | 2980/3397 [18:21<02:33,  2.71it/s][A
epoch 2 iter 2980: train loss 0.14618. lr 6.000000e-05:  88%|████████▊ | 2981/3397 [18:21<02:33,  2.71it/s][A
epoch 2 iter 2981: train loss 0.14036. lr 6.000000e-05:  88%|████████▊ | 2981/3397 [18:22<02:33,  2.71it/s][A
epoch 2 iter 2981: train loss 0.14036. lr 6.000000e-05:  88%|████████▊ | 2982/3397 [18:22<02:32,  2.71it/s][A
epoch 2 iter 2982: train loss 0.13695. lr 6.000000e-05:  88%|████████▊ | 2982/3397 [18:22<02:32,  2.71it/s][A
e

epoch 2 iter 3014: train loss 0.14356. lr 6.000000e-05:  89%|████████▉ | 3015/3397 [18:34<02:21,  2.70it/s][A
epoch 2 iter 3015: train loss 0.15126. lr 6.000000e-05:  89%|████████▉ | 3015/3397 [18:34<02:21,  2.70it/s][A
epoch 2 iter 3015: train loss 0.15126. lr 6.000000e-05:  89%|████████▉ | 3016/3397 [18:34<02:21,  2.69it/s][A
epoch 2 iter 3016: train loss 0.14257. lr 6.000000e-05:  89%|████████▉ | 3016/3397 [18:35<02:21,  2.69it/s][A
epoch 2 iter 3016: train loss 0.14257. lr 6.000000e-05:  89%|████████▉ | 3017/3397 [18:35<02:20,  2.70it/s][A
epoch 2 iter 3017: train loss 0.14916. lr 6.000000e-05:  89%|████████▉ | 3017/3397 [18:35<02:20,  2.70it/s][A
epoch 2 iter 3017: train loss 0.14916. lr 6.000000e-05:  89%|████████▉ | 3018/3397 [18:35<02:19,  2.71it/s][A
epoch 2 iter 3018: train loss 0.14250. lr 6.000000e-05:  89%|████████▉ | 3018/3397 [18:35<02:19,  2.71it/s][A
epoch 2 iter 3018: train loss 0.14250. lr 6.000000e-05:  89%|████████▉ | 3019/3397 [18:35<02:19,  2.71it/s][A
e

epoch 2 iter 3051: train loss 0.13806. lr 6.000000e-05:  90%|████████▉ | 3051/3397 [18:48<02:07,  2.70it/s][A
epoch 2 iter 3051: train loss 0.13806. lr 6.000000e-05:  90%|████████▉ | 3052/3397 [18:48<02:07,  2.71it/s][A
epoch 2 iter 3052: train loss 0.13283. lr 6.000000e-05:  90%|████████▉ | 3052/3397 [18:48<02:07,  2.71it/s][A
epoch 2 iter 3052: train loss 0.13283. lr 6.000000e-05:  90%|████████▉ | 3053/3397 [18:48<02:06,  2.71it/s][A
epoch 2 iter 3053: train loss 0.14349. lr 6.000000e-05:  90%|████████▉ | 3053/3397 [18:48<02:06,  2.71it/s][A
epoch 2 iter 3053: train loss 0.14349. lr 6.000000e-05:  90%|████████▉ | 3054/3397 [18:48<02:06,  2.72it/s][A
epoch 2 iter 3054: train loss 0.14376. lr 6.000000e-05:  90%|████████▉ | 3054/3397 [18:49<02:06,  2.72it/s][A
epoch 2 iter 3054: train loss 0.14376. lr 6.000000e-05:  90%|████████▉ | 3055/3397 [18:49<02:06,  2.71it/s][A
epoch 2 iter 3055: train loss 0.14701. lr 6.000000e-05:  90%|████████▉ | 3055/3397 [18:49<02:06,  2.71it/s][A
e

epoch 2 iter 3087: train loss 0.14143. lr 6.000000e-05:  91%|█████████ | 3088/3397 [19:01<01:53,  2.71it/s][A
epoch 2 iter 3088: train loss 0.14587. lr 6.000000e-05:  91%|█████████ | 3088/3397 [19:01<01:53,  2.71it/s][A
epoch 2 iter 3088: train loss 0.14587. lr 6.000000e-05:  91%|█████████ | 3089/3397 [19:01<01:53,  2.71it/s][A
epoch 2 iter 3089: train loss 0.15617. lr 6.000000e-05:  91%|█████████ | 3089/3397 [19:02<01:53,  2.71it/s][A
epoch 2 iter 3089: train loss 0.15617. lr 6.000000e-05:  91%|█████████ | 3090/3397 [19:02<01:52,  2.72it/s][A
epoch 2 iter 3090: train loss 0.14577. lr 6.000000e-05:  91%|█████████ | 3090/3397 [19:02<01:52,  2.72it/s][A
epoch 2 iter 3090: train loss 0.14577. lr 6.000000e-05:  91%|█████████ | 3091/3397 [19:02<01:52,  2.72it/s][A
epoch 2 iter 3091: train loss 0.14056. lr 6.000000e-05:  91%|█████████ | 3091/3397 [19:02<01:52,  2.72it/s][A
epoch 2 iter 3091: train loss 0.14056. lr 6.000000e-05:  91%|█████████ | 3092/3397 [19:02<01:52,  2.71it/s][A
e

epoch 2 iter 3124: train loss 0.15107. lr 6.000000e-05:  92%|█████████▏| 3124/3397 [19:15<01:41,  2.68it/s][A
epoch 2 iter 3124: train loss 0.15107. lr 6.000000e-05:  92%|█████████▏| 3125/3397 [19:15<01:41,  2.69it/s][A
epoch 2 iter 3125: train loss 0.14520. lr 6.000000e-05:  92%|█████████▏| 3125/3397 [19:15<01:41,  2.69it/s][A
epoch 2 iter 3125: train loss 0.14520. lr 6.000000e-05:  92%|█████████▏| 3126/3397 [19:15<01:40,  2.71it/s][A
epoch 2 iter 3126: train loss 0.13185. lr 6.000000e-05:  92%|█████████▏| 3126/3397 [19:15<01:40,  2.71it/s][A
epoch 2 iter 3126: train loss 0.13185. lr 6.000000e-05:  92%|█████████▏| 3127/3397 [19:15<01:39,  2.71it/s][A
epoch 2 iter 3127: train loss 0.15325. lr 6.000000e-05:  92%|█████████▏| 3127/3397 [19:16<01:39,  2.71it/s][A
epoch 2 iter 3127: train loss 0.15325. lr 6.000000e-05:  92%|█████████▏| 3128/3397 [19:16<01:39,  2.70it/s][A
epoch 2 iter 3128: train loss 0.14485. lr 6.000000e-05:  92%|█████████▏| 3128/3397 [19:16<01:39,  2.70it/s][A
e

epoch 2 iter 3160: train loss 0.14539. lr 6.000000e-05:  93%|█████████▎| 3161/3397 [19:28<01:27,  2.71it/s][A
epoch 2 iter 3161: train loss 0.14211. lr 6.000000e-05:  93%|█████████▎| 3161/3397 [19:28<01:27,  2.71it/s][A
epoch 2 iter 3161: train loss 0.14211. lr 6.000000e-05:  93%|█████████▎| 3162/3397 [19:28<01:27,  2.67it/s][A
epoch 2 iter 3162: train loss 0.14578. lr 6.000000e-05:  93%|█████████▎| 3162/3397 [19:29<01:27,  2.67it/s][A
epoch 2 iter 3162: train loss 0.14578. lr 6.000000e-05:  93%|█████████▎| 3163/3397 [19:29<01:27,  2.66it/s][A
epoch 2 iter 3163: train loss 0.13580. lr 6.000000e-05:  93%|█████████▎| 3163/3397 [19:29<01:27,  2.66it/s][A
epoch 2 iter 3163: train loss 0.13580. lr 6.000000e-05:  93%|█████████▎| 3164/3397 [19:29<01:27,  2.66it/s][A
epoch 2 iter 3164: train loss 0.14144. lr 6.000000e-05:  93%|█████████▎| 3164/3397 [19:29<01:27,  2.66it/s][A
epoch 2 iter 3164: train loss 0.14144. lr 6.000000e-05:  93%|█████████▎| 3165/3397 [19:29<01:27,  2.66it/s][A
e

epoch 2 iter 3197: train loss 0.13728. lr 6.000000e-05:  94%|█████████▍| 3197/3397 [19:42<01:13,  2.71it/s][A
epoch 2 iter 3197: train loss 0.13728. lr 6.000000e-05:  94%|█████████▍| 3198/3397 [19:42<01:13,  2.71it/s][A
epoch 2 iter 3198: train loss 0.13116. lr 6.000000e-05:  94%|█████████▍| 3198/3397 [19:42<01:13,  2.71it/s][A
epoch 2 iter 3198: train loss 0.13116. lr 6.000000e-05:  94%|█████████▍| 3199/3397 [19:42<01:13,  2.71it/s][A
epoch 2 iter 3199: train loss 0.15171. lr 6.000000e-05:  94%|█████████▍| 3199/3397 [19:42<01:13,  2.71it/s][A
epoch 2 iter 3199: train loss 0.15171. lr 6.000000e-05:  94%|█████████▍| 3200/3397 [19:42<01:12,  2.71it/s][A
epoch 2 iter 3200: train loss 0.13966. lr 6.000000e-05:  94%|█████████▍| 3200/3397 [19:43<01:12,  2.71it/s][A
epoch 2 iter 3200: train loss 0.13966. lr 6.000000e-05:  94%|█████████▍| 3201/3397 [19:43<01:12,  2.71it/s][A
epoch 2 iter 3201: train loss 0.13569. lr 6.000000e-05:  94%|█████████▍| 3201/3397 [19:43<01:12,  2.71it/s][A
e

epoch 2 iter 3233: train loss 0.15369. lr 6.000000e-05:  95%|█████████▌| 3234/3397 [19:55<00:59,  2.72it/s][A
epoch 2 iter 3234: train loss 0.14953. lr 6.000000e-05:  95%|█████████▌| 3234/3397 [19:55<00:59,  2.72it/s][A
epoch 2 iter 3234: train loss 0.14953. lr 6.000000e-05:  95%|█████████▌| 3235/3397 [19:55<00:59,  2.72it/s][A
epoch 2 iter 3235: train loss 0.15018. lr 6.000000e-05:  95%|█████████▌| 3235/3397 [19:56<00:59,  2.72it/s][A
epoch 2 iter 3235: train loss 0.15018. lr 6.000000e-05:  95%|█████████▌| 3236/3397 [19:56<00:59,  2.71it/s][A
epoch 2 iter 3236: train loss 0.14158. lr 6.000000e-05:  95%|█████████▌| 3236/3397 [19:56<00:59,  2.71it/s][A
epoch 2 iter 3236: train loss 0.14158. lr 6.000000e-05:  95%|█████████▌| 3237/3397 [19:56<00:58,  2.71it/s][A
epoch 2 iter 3237: train loss 0.13085. lr 6.000000e-05:  95%|█████████▌| 3237/3397 [19:56<00:58,  2.71it/s][A
epoch 2 iter 3237: train loss 0.13085. lr 6.000000e-05:  95%|█████████▌| 3238/3397 [19:56<00:58,  2.71it/s][A
e

epoch 2 iter 3270: train loss 0.14341. lr 6.000000e-05:  96%|█████████▋| 3270/3397 [20:09<00:46,  2.71it/s][A
epoch 2 iter 3270: train loss 0.14341. lr 6.000000e-05:  96%|█████████▋| 3271/3397 [20:09<00:46,  2.71it/s][A
epoch 2 iter 3271: train loss 0.14679. lr 6.000000e-05:  96%|█████████▋| 3271/3397 [20:09<00:46,  2.71it/s][A
epoch 2 iter 3271: train loss 0.14679. lr 6.000000e-05:  96%|█████████▋| 3272/3397 [20:09<00:46,  2.72it/s][A
epoch 2 iter 3272: train loss 0.15316. lr 6.000000e-05:  96%|█████████▋| 3272/3397 [20:09<00:46,  2.72it/s][A
epoch 2 iter 3272: train loss 0.15316. lr 6.000000e-05:  96%|█████████▋| 3273/3397 [20:09<00:45,  2.71it/s][A
epoch 2 iter 3273: train loss 0.13134. lr 6.000000e-05:  96%|█████████▋| 3273/3397 [20:10<00:45,  2.71it/s][A
epoch 2 iter 3273: train loss 0.13134. lr 6.000000e-05:  96%|█████████▋| 3274/3397 [20:10<00:45,  2.72it/s][A
epoch 2 iter 3274: train loss 0.13986. lr 6.000000e-05:  96%|█████████▋| 3274/3397 [20:10<00:45,  2.72it/s][A
e

epoch 2 iter 3306: train loss 0.14033. lr 6.000000e-05:  97%|█████████▋| 3307/3397 [20:22<00:33,  2.70it/s][A
epoch 2 iter 3307: train loss 0.13036. lr 6.000000e-05:  97%|█████████▋| 3307/3397 [20:22<00:33,  2.70it/s][A
epoch 2 iter 3307: train loss 0.13036. lr 6.000000e-05:  97%|█████████▋| 3308/3397 [20:22<00:32,  2.71it/s][A
epoch 2 iter 3308: train loss 0.15005. lr 6.000000e-05:  97%|█████████▋| 3308/3397 [20:23<00:32,  2.71it/s][A
epoch 2 iter 3308: train loss 0.15005. lr 6.000000e-05:  97%|█████████▋| 3309/3397 [20:23<00:32,  2.72it/s][A
epoch 2 iter 3309: train loss 0.14595. lr 6.000000e-05:  97%|█████████▋| 3309/3397 [20:23<00:32,  2.72it/s][A
epoch 2 iter 3309: train loss 0.14595. lr 6.000000e-05:  97%|█████████▋| 3310/3397 [20:23<00:32,  2.72it/s][A
epoch 2 iter 3310: train loss 0.15426. lr 6.000000e-05:  97%|█████████▋| 3310/3397 [20:23<00:32,  2.72it/s][A
epoch 2 iter 3310: train loss 0.15426. lr 6.000000e-05:  97%|█████████▋| 3311/3397 [20:23<00:31,  2.71it/s][A
e

epoch 2 iter 3343: train loss 0.14778. lr 6.000000e-05:  98%|█████████▊| 3343/3397 [20:35<00:19,  2.71it/s][A
epoch 2 iter 3343: train loss 0.14778. lr 6.000000e-05:  98%|█████████▊| 3344/3397 [20:35<00:19,  2.72it/s][A
epoch 2 iter 3344: train loss 0.14134. lr 6.000000e-05:  98%|█████████▊| 3344/3397 [20:36<00:19,  2.72it/s][A
epoch 2 iter 3344: train loss 0.14134. lr 6.000000e-05:  98%|█████████▊| 3345/3397 [20:36<00:19,  2.72it/s][A
epoch 2 iter 3345: train loss 0.13387. lr 6.000000e-05:  98%|█████████▊| 3345/3397 [20:36<00:19,  2.72it/s][A
epoch 2 iter 3345: train loss 0.13387. lr 6.000000e-05:  98%|█████████▊| 3346/3397 [20:36<00:18,  2.71it/s][A
epoch 2 iter 3346: train loss 0.14708. lr 6.000000e-05:  98%|█████████▊| 3346/3397 [20:37<00:18,  2.71it/s][A
epoch 2 iter 3346: train loss 0.14708. lr 6.000000e-05:  99%|█████████▊| 3347/3397 [20:37<00:18,  2.72it/s][A
epoch 2 iter 3347: train loss 0.13840. lr 6.000000e-05:  99%|█████████▊| 3347/3397 [20:37<00:18,  2.72it/s][A
e

epoch 2 iter 3379: train loss 0.14560. lr 6.000000e-05:  99%|█████████▉| 3380/3397 [20:49<00:06,  2.72it/s][A
epoch 2 iter 3380: train loss 0.13876. lr 6.000000e-05:  99%|█████████▉| 3380/3397 [20:49<00:06,  2.72it/s][A
epoch 2 iter 3380: train loss 0.13876. lr 6.000000e-05: 100%|█████████▉| 3381/3397 [20:49<00:05,  2.71it/s][A
epoch 2 iter 3381: train loss 0.14916. lr 6.000000e-05: 100%|█████████▉| 3381/3397 [20:49<00:05,  2.71it/s][A
epoch 2 iter 3381: train loss 0.14916. lr 6.000000e-05: 100%|█████████▉| 3382/3397 [20:49<00:05,  2.71it/s][A
epoch 2 iter 3382: train loss 0.14151. lr 6.000000e-05: 100%|█████████▉| 3382/3397 [20:50<00:05,  2.71it/s][A
epoch 2 iter 3382: train loss 0.14151. lr 6.000000e-05: 100%|█████████▉| 3383/3397 [20:50<00:05,  2.71it/s][A
epoch 2 iter 3383: train loss 0.14498. lr 6.000000e-05: 100%|█████████▉| 3383/3397 [20:50<00:05,  2.71it/s][A
epoch 2 iter 3383: train loss 0.14498. lr 6.000000e-05: 100%|█████████▉| 3384/3397 [20:50<00:04,  2.72it/s][A
e