## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [1]:
from allennlp.data.token_indexers import TokenIndexer, PretrainedTransformerIndexer
from allennlp.data.tokenizers import Token, Tokenizer, PretrainedTransformerTokenizer

import nltk
#nltk.download('punkt')
import numpy as np
from os import listdir
from os.path import join as pathjoin
import torch
import torch.nn as nn
from torch.nn import functional as F
import tqdm

from minGPT.mingpt.model import GPT, GPTConfig
from minGPT.mingpt.trainer import Trainer, TrainerConfig
# make deterministic
from minGPT.mingpt.utils import sample, set_seed
set_seed(42)

In [2]:
DATA_DIR = '/home/mlepekhin/data/big'
#MODELS_DIR = '/home/mlepekhin/models/big'
transformer_model = 'bert-base-cased'

In [3]:
import math
from torch.utils.data import Dataset


def detokenize(tokens):
    return ' '.join([str(x) for x in tokens[1:-1]]).replace(' ##', '')

class BPEDataset(Dataset):
    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [4]:
block_size = 128
tokenizer = PretrainedTransformerTokenizer(transformer_model)
#indexer = PretrainedTransformerIndexer(transformer_model)

In [5]:
def train_gpt_generator(train_text_file, state_dict_file, n_layer=8, n_head=8, n_embd=512,
                        max_epochs=1, batch_size=256):
    text_sentences = nltk.tokenize.sent_tokenize(open(train_text_file, 'r').read())
    tokens = np.concatenate([tokenizer.tokenize(sent)[1:-1] for sent in text_sentences])
    tokens = [str(token) for token in tokens]
    train_dataset = BPEDataset(tokens, block_size) 
    
    mconf = GPTConfig(
        train_dataset.vocab_size, train_dataset.block_size,
        n_layer=n_layer, n_head=n_head, n_embd=n_embd
    )
    model = GPT(mconf)
    tconf = TrainerConfig(
        max_epochs=max_epochs, batch_size=batch_size, learning_rate=6e-4,
        lr_decay=True, warmup_tokens=batch_size*20, final_tokens=2*len(train_dataset)*block_size,
        num_workers=4
    )
    trainer = Trainer(model, train_dataset, None, tconf)
    trainer.train()
    torch.save(model.state_dict(), state_dict_file)

In [6]:
GENRE_DATA_DIR = '/home/mlepekhin/data/big/genre'
GPT_MODELS_DIR = '/home/mlepekhin/models/mini_gpt_big_bpe/'
LANG = 'en'

In [7]:
#train_gpt_generator(
#        pathjoin(GENRE_DATA_DIR, LANG, 'A1.txt'),
#        pathjoin(GPT_MODELS_DIR, LANG, 'A1')
#)

In [None]:
for train_text_file in tqdm.tqdm(listdir(pathjoin(GENRE_DATA_DIR, LANG))):
    label = train_text_file[:-4]
    train_gpt_generator(
        pathjoin(GENRE_DATA_DIR, LANG, train_text_file),
        pathjoin(GPT_MODELS_DIR, LANG, label)
    )

  0%|          | 0/11 [00:00<?, ?it/s]

data has 1353356 characters, 22161 unique.




epoch 1 iter 0: train loss 10.11153. lr 6.000000e-04:   0%|          | 0/5287 [00:07<?, ?it/s][A
epoch 1 iter 0: train loss 10.11153. lr 6.000000e-04:   0%|          | 1/5287 [00:07<10:59:54,  7.49s/it][A
epoch 1 iter 1: train loss 9.40282. lr 6.000000e-04:   0%|          | 1/5287 [00:07<10:59:54,  7.49s/it] [A
epoch 1 iter 1: train loss 9.40282. lr 6.000000e-04:   0%|          | 2/5287 [00:07<7:54:05,  5.38s/it] [A
epoch 1 iter 2: train loss 8.94643. lr 5.999999e-04:   0%|          | 2/5287 [00:08<7:54:05,  5.38s/it][A
epoch 1 iter 2: train loss 8.94643. lr 5.999999e-04:   0%|          | 3/5287 [00:08<5:43:39,  3.90s/it][A
epoch 1 iter 3: train loss 8.67606. lr 5.999998e-04:   0%|          | 3/5287 [00:08<5:43:39,  3.90s/it][A
epoch 1 iter 3: train loss 8.67606. lr 5.999998e-04:   0%|          | 4/5287 [00:08<4:12:24,  2.87s/it][A
epoch 1 iter 4: train loss 8.40319. lr 5.999997e-04:   0%|          | 4/5287 [00:09<4:12:24,  2.87s/it][A
epoch 1 iter 4: train loss 8.40319. lr 

epoch 1 iter 36: train loss 6.62182. lr 5.999820e-04:   1%|          | 37/5287 [00:23<39:36,  2.21it/s][A
epoch 1 iter 37: train loss 6.65813. lr 5.999810e-04:   1%|          | 37/5287 [00:24<39:36,  2.21it/s][A
epoch 1 iter 37: train loss 6.65813. lr 5.999810e-04:   1%|          | 38/5287 [00:24<39:36,  2.21it/s][A
epoch 1 iter 38: train loss 6.71436. lr 5.999800e-04:   1%|          | 38/5287 [00:24<39:36,  2.21it/s][A
epoch 1 iter 38: train loss 6.71436. lr 5.999800e-04:   1%|          | 39/5287 [00:24<39:38,  2.21it/s][A
epoch 1 iter 39: train loss 6.61231. lr 5.999790e-04:   1%|          | 39/5287 [00:25<39:38,  2.21it/s][A
epoch 1 iter 39: train loss 6.61231. lr 5.999790e-04:   1%|          | 40/5287 [00:25<39:36,  2.21it/s][A
epoch 1 iter 40: train loss 6.63065. lr 5.999779e-04:   1%|          | 40/5287 [00:25<39:36,  2.21it/s][A
epoch 1 iter 40: train loss 6.63065. lr 5.999779e-04:   1%|          | 41/5287 [00:25<39:37,  2.21it/s][A
epoch 1 iter 41: train loss 6.59261. 

epoch 1 iter 74: train loss 6.22153. lr 5.999258e-04:   1%|▏         | 75/5287 [00:41<39:48,  2.18it/s][A
epoch 1 iter 75: train loss 6.25726. lr 5.999238e-04:   1%|▏         | 75/5287 [00:41<39:48,  2.18it/s][A
epoch 1 iter 75: train loss 6.25726. lr 5.999238e-04:   1%|▏         | 76/5287 [00:41<39:38,  2.19it/s][A
epoch 1 iter 76: train loss 6.22580. lr 5.999218e-04:   1%|▏         | 76/5287 [00:42<39:38,  2.19it/s][A
epoch 1 iter 76: train loss 6.22580. lr 5.999218e-04:   1%|▏         | 77/5287 [00:42<39:33,  2.19it/s][A
epoch 1 iter 77: train loss 6.19895. lr 5.999197e-04:   1%|▏         | 77/5287 [00:42<39:33,  2.19it/s][A
epoch 1 iter 77: train loss 6.19895. lr 5.999197e-04:   1%|▏         | 78/5287 [00:42<39:28,  2.20it/s][A
epoch 1 iter 78: train loss 6.17545. lr 5.999177e-04:   1%|▏         | 78/5287 [00:43<39:28,  2.20it/s][A
epoch 1 iter 78: train loss 6.17545. lr 5.999177e-04:   1%|▏         | 79/5287 [00:43<39:27,  2.20it/s][A
epoch 1 iter 79: train loss 6.24111. 

epoch 1 iter 112: train loss 5.96703. lr 5.998313e-04:   2%|▏         | 113/5287 [00:58<39:36,  2.18it/s][A
epoch 1 iter 113: train loss 5.93330. lr 5.998283e-04:   2%|▏         | 113/5287 [00:59<39:36,  2.18it/s][A
epoch 1 iter 113: train loss 5.93330. lr 5.998283e-04:   2%|▏         | 114/5287 [00:59<39:20,  2.19it/s][A
epoch 1 iter 114: train loss 5.92517. lr 5.998253e-04:   2%|▏         | 114/5287 [00:59<39:20,  2.19it/s][A
epoch 1 iter 114: train loss 5.92517. lr 5.998253e-04:   2%|▏         | 115/5287 [00:59<39:10,  2.20it/s][A
epoch 1 iter 115: train loss 5.96815. lr 5.998223e-04:   2%|▏         | 115/5287 [01:00<39:10,  2.20it/s][A
epoch 1 iter 115: train loss 5.96815. lr 5.998223e-04:   2%|▏         | 116/5287 [01:00<39:01,  2.21it/s][A
epoch 1 iter 116: train loss 5.92616. lr 5.998192e-04:   2%|▏         | 116/5287 [01:00<39:01,  2.21it/s][A
epoch 1 iter 116: train loss 5.92616. lr 5.998192e-04:   2%|▏         | 117/5287 [01:00<38:53,  2.22it/s][A
epoch 1 iter 117: t

epoch 1 iter 150: train loss 5.76490. lr 5.996987e-04:   3%|▎         | 150/5287 [01:16<38:39,  2.21it/s][A
epoch 1 iter 150: train loss 5.76490. lr 5.996987e-04:   3%|▎         | 151/5287 [01:16<38:38,  2.22it/s][A
epoch 1 iter 151: train loss 5.73633. lr 5.996946e-04:   3%|▎         | 151/5287 [01:16<38:38,  2.22it/s][A
epoch 1 iter 151: train loss 5.73633. lr 5.996946e-04:   3%|▎         | 152/5287 [01:16<38:35,  2.22it/s][A
epoch 1 iter 152: train loss 5.69355. lr 5.996906e-04:   3%|▎         | 152/5287 [01:16<38:35,  2.22it/s][A
epoch 1 iter 152: train loss 5.69355. lr 5.996906e-04:   3%|▎         | 153/5287 [01:16<38:32,  2.22it/s][A
epoch 1 iter 153: train loss 5.69661. lr 5.996866e-04:   3%|▎         | 153/5287 [01:17<38:32,  2.22it/s][A
epoch 1 iter 153: train loss 5.69661. lr 5.996866e-04:   3%|▎         | 154/5287 [01:17<38:29,  2.22it/s][A
epoch 1 iter 154: train loss 5.72832. lr 5.996825e-04:   3%|▎         | 154/5287 [01:17<38:29,  2.22it/s][A
epoch 1 iter 154: t

epoch 1 iter 187: train loss 5.52859. lr 5.995327e-04:   4%|▎         | 188/5287 [01:32<38:15,  2.22it/s][A
epoch 1 iter 188: train loss 5.63372. lr 5.995277e-04:   4%|▎         | 188/5287 [01:33<38:15,  2.22it/s][A
epoch 1 iter 188: train loss 5.63372. lr 5.995277e-04:   4%|▎         | 189/5287 [01:33<38:13,  2.22it/s][A
epoch 1 iter 189: train loss 5.57798. lr 5.995227e-04:   4%|▎         | 189/5287 [01:33<38:13,  2.22it/s][A
epoch 1 iter 189: train loss 5.57798. lr 5.995227e-04:   4%|▎         | 190/5287 [01:33<38:15,  2.22it/s][A
epoch 1 iter 190: train loss 5.54303. lr 5.995177e-04:   4%|▎         | 190/5287 [01:34<38:15,  2.22it/s][A
epoch 1 iter 190: train loss 5.54303. lr 5.995177e-04:   4%|▎         | 191/5287 [01:34<38:15,  2.22it/s][A
epoch 1 iter 191: train loss 5.58530. lr 5.995126e-04:   4%|▎         | 191/5287 [01:34<38:15,  2.22it/s][A
epoch 1 iter 191: train loss 5.58530. lr 5.995126e-04:   4%|▎         | 192/5287 [01:34<38:14,  2.22it/s][A
epoch 1 iter 192: t

epoch 1 iter 225: train loss 5.44333. lr 5.993246e-04:   4%|▍         | 225/5287 [01:50<38:03,  2.22it/s][A
epoch 1 iter 225: train loss 5.44333. lr 5.993246e-04:   4%|▍         | 226/5287 [01:50<41:28,  2.03it/s][A
epoch 1 iter 226: train loss 5.52502. lr 5.993186e-04:   4%|▍         | 226/5287 [01:50<41:28,  2.03it/s][A
epoch 1 iter 226: train loss 5.52502. lr 5.993186e-04:   4%|▍         | 227/5287 [01:50<40:26,  2.09it/s][A
epoch 1 iter 227: train loss 5.38235. lr 5.993126e-04:   4%|▍         | 227/5287 [01:51<40:26,  2.09it/s][A
epoch 1 iter 227: train loss 5.38235. lr 5.993126e-04:   4%|▍         | 228/5287 [01:51<39:44,  2.12it/s][A
epoch 1 iter 228: train loss 5.40222. lr 5.993066e-04:   4%|▍         | 228/5287 [01:51<39:44,  2.12it/s][A
epoch 1 iter 228: train loss 5.40222. lr 5.993066e-04:   4%|▍         | 229/5287 [01:51<39:13,  2.15it/s][A
epoch 1 iter 229: train loss 5.54319. lr 5.993005e-04:   4%|▍         | 229/5287 [01:51<39:13,  2.15it/s][A
epoch 1 iter 229: t

epoch 1 iter 262: train loss 5.30430. lr 5.990853e-04:   5%|▍         | 263/5287 [02:06<37:57,  2.21it/s][A
epoch 1 iter 263: train loss 5.33477. lr 5.990784e-04:   5%|▍         | 263/5287 [02:07<37:57,  2.21it/s][A
epoch 1 iter 263: train loss 5.33477. lr 5.990784e-04:   5%|▍         | 264/5287 [02:07<37:56,  2.21it/s][A
epoch 1 iter 264: train loss 5.32661. lr 5.990714e-04:   5%|▍         | 264/5287 [02:07<37:56,  2.21it/s][A
epoch 1 iter 264: train loss 5.32661. lr 5.990714e-04:   5%|▌         | 265/5287 [02:07<37:52,  2.21it/s][A
epoch 1 iter 265: train loss 5.28413. lr 5.990644e-04:   5%|▌         | 265/5287 [02:08<37:52,  2.21it/s][A
epoch 1 iter 265: train loss 5.28413. lr 5.990644e-04:   5%|▌         | 266/5287 [02:08<37:52,  2.21it/s][A
epoch 1 iter 266: train loss 5.30251. lr 5.990573e-04:   5%|▌         | 266/5287 [02:08<37:52,  2.21it/s][A
epoch 1 iter 266: train loss 5.30251. lr 5.990573e-04:   5%|▌         | 267/5287 [02:08<37:51,  2.21it/s][A
epoch 1 iter 267: t

epoch 1 iter 300: train loss 5.20376. lr 5.988020e-04:   6%|▌         | 300/5287 [02:24<37:36,  2.21it/s][A
epoch 1 iter 300: train loss 5.20376. lr 5.988020e-04:   6%|▌         | 301/5287 [02:24<37:37,  2.21it/s][A
epoch 1 iter 301: train loss 5.19037. lr 5.987940e-04:   6%|▌         | 301/5287 [02:24<37:37,  2.21it/s][A
epoch 1 iter 301: train loss 5.19037. lr 5.987940e-04:   6%|▌         | 302/5287 [02:24<37:35,  2.21it/s][A
epoch 1 iter 302: train loss 5.16235. lr 5.987860e-04:   6%|▌         | 302/5287 [02:25<37:35,  2.21it/s][A
epoch 1 iter 302: train loss 5.16235. lr 5.987860e-04:   6%|▌         | 303/5287 [02:25<37:36,  2.21it/s][A
epoch 1 iter 303: train loss 5.22714. lr 5.987780e-04:   6%|▌         | 303/5287 [02:25<37:36,  2.21it/s][A
epoch 1 iter 303: train loss 5.22714. lr 5.987780e-04:   6%|▌         | 304/5287 [02:25<37:34,  2.21it/s][A
epoch 1 iter 304: train loss 5.15341. lr 5.987699e-04:   6%|▌         | 304/5287 [02:26<37:34,  2.21it/s][A
epoch 1 iter 304: t

epoch 1 iter 337: train loss 4.99684. lr 5.984894e-04:   6%|▋         | 338/5287 [02:41<40:11,  2.05it/s][A
epoch 1 iter 338: train loss 5.03805. lr 5.984805e-04:   6%|▋         | 338/5287 [02:41<40:11,  2.05it/s][A
epoch 1 iter 338: train loss 5.03805. lr 5.984805e-04:   6%|▋         | 339/5287 [02:41<39:19,  2.10it/s][A
epoch 1 iter 339: train loss 5.09300. lr 5.984715e-04:   6%|▋         | 339/5287 [02:42<39:19,  2.10it/s][A
epoch 1 iter 339: train loss 5.09300. lr 5.984715e-04:   6%|▋         | 340/5287 [02:42<38:45,  2.13it/s][A
epoch 1 iter 340: train loss 5.18588. lr 5.984625e-04:   6%|▋         | 340/5287 [02:42<38:45,  2.13it/s][A
epoch 1 iter 340: train loss 5.18588. lr 5.984625e-04:   6%|▋         | 341/5287 [02:42<38:18,  2.15it/s][A
epoch 1 iter 341: train loss 5.05860. lr 5.984535e-04:   6%|▋         | 341/5287 [02:43<38:18,  2.15it/s][A
epoch 1 iter 341: train loss 5.05860. lr 5.984535e-04:   6%|▋         | 342/5287 [02:43<38:00,  2.17it/s][A
epoch 1 iter 342: t

epoch 1 iter 375: train loss 4.93490. lr 5.981308e-04:   7%|▋         | 375/5287 [02:58<37:16,  2.20it/s][A
epoch 1 iter 375: train loss 4.93490. lr 5.981308e-04:   7%|▋         | 376/5287 [02:58<37:13,  2.20it/s][A
epoch 1 iter 376: train loss 4.96161. lr 5.981209e-04:   7%|▋         | 376/5287 [02:59<37:13,  2.20it/s][A
epoch 1 iter 376: train loss 4.96161. lr 5.981209e-04:   7%|▋         | 377/5287 [02:59<37:11,  2.20it/s][A
epoch 1 iter 377: train loss 5.01270. lr 5.981109e-04:   7%|▋         | 377/5287 [02:59<37:11,  2.20it/s][A
epoch 1 iter 377: train loss 5.01270. lr 5.981109e-04:   7%|▋         | 378/5287 [02:59<37:08,  2.20it/s][A
epoch 1 iter 378: train loss 4.94878. lr 5.981009e-04:   7%|▋         | 378/5287 [02:59<37:08,  2.20it/s][A
epoch 1 iter 378: train loss 4.94878. lr 5.981009e-04:   7%|▋         | 379/5287 [02:59<37:07,  2.20it/s][A
epoch 1 iter 379: train loss 4.97201. lr 5.980909e-04:   7%|▋         | 379/5287 [03:00<37:07,  2.20it/s][A
epoch 1 iter 379: t

epoch 1 iter 412: train loss 4.81565. lr 5.977452e-04:   8%|▊         | 413/5287 [03:15<36:54,  2.20it/s][A
epoch 1 iter 413: train loss 4.85149. lr 5.977343e-04:   8%|▊         | 413/5287 [03:16<36:54,  2.20it/s][A
epoch 1 iter 413: train loss 4.85149. lr 5.977343e-04:   8%|▊         | 414/5287 [03:16<36:53,  2.20it/s][A
epoch 1 iter 414: train loss 4.86658. lr 5.977233e-04:   8%|▊         | 414/5287 [03:16<36:53,  2.20it/s][A
epoch 1 iter 414: train loss 4.86658. lr 5.977233e-04:   8%|▊         | 415/5287 [03:16<36:52,  2.20it/s][A
epoch 1 iter 415: train loss 4.79973. lr 5.977124e-04:   8%|▊         | 415/5287 [03:16<36:52,  2.20it/s][A
epoch 1 iter 415: train loss 4.79973. lr 5.977124e-04:   8%|▊         | 416/5287 [03:16<36:53,  2.20it/s][A
epoch 1 iter 416: train loss 4.79440. lr 5.977013e-04:   8%|▊         | 416/5287 [03:17<36:53,  2.20it/s][A
epoch 1 iter 416: train loss 4.79440. lr 5.977013e-04:   8%|▊         | 417/5287 [03:17<36:52,  2.20it/s][A
epoch 1 iter 417: t

epoch 1 iter 450: train loss 4.78544. lr 5.973117e-04:   9%|▊         | 450/5287 [03:33<39:28,  2.04it/s][A
epoch 1 iter 450: train loss 4.78544. lr 5.973117e-04:   9%|▊         | 451/5287 [03:33<38:37,  2.09it/s][A
epoch 1 iter 451: train loss 4.74743. lr 5.972997e-04:   9%|▊         | 451/5287 [03:33<38:37,  2.09it/s][A
epoch 1 iter 451: train loss 4.74743. lr 5.972997e-04:   9%|▊         | 452/5287 [03:33<38:02,  2.12it/s][A
epoch 1 iter 452: train loss 4.83584. lr 5.972878e-04:   9%|▊         | 452/5287 [03:33<38:02,  2.12it/s][A
epoch 1 iter 452: train loss 4.83584. lr 5.972878e-04:   9%|▊         | 453/5287 [03:33<37:36,  2.14it/s][A
epoch 1 iter 453: train loss 4.78356. lr 5.972758e-04:   9%|▊         | 453/5287 [03:34<37:36,  2.14it/s][A
epoch 1 iter 453: train loss 4.78356. lr 5.972758e-04:   9%|▊         | 454/5287 [03:34<37:20,  2.16it/s][A
epoch 1 iter 454: train loss 4.76739. lr 5.972638e-04:   9%|▊         | 454/5287 [03:34<37:20,  2.16it/s][A
epoch 1 iter 454: t

epoch 1 iter 487: train loss 4.66203. lr 5.968531e-04:   9%|▉         | 488/5287 [03:50<36:28,  2.19it/s][A
epoch 1 iter 488: train loss 4.68233. lr 5.968402e-04:   9%|▉         | 488/5287 [03:50<36:28,  2.19it/s][A
epoch 1 iter 488: train loss 4.68233. lr 5.968402e-04:   9%|▉         | 489/5287 [03:50<36:25,  2.20it/s][A
epoch 1 iter 489: train loss 4.62380. lr 5.968273e-04:   9%|▉         | 489/5287 [03:50<36:25,  2.20it/s][A
epoch 1 iter 489: train loss 4.62380. lr 5.968273e-04:   9%|▉         | 490/5287 [03:50<36:25,  2.20it/s][A
epoch 1 iter 490: train loss 4.61471. lr 5.968143e-04:   9%|▉         | 490/5287 [03:51<36:25,  2.20it/s][A
epoch 1 iter 490: train loss 4.61471. lr 5.968143e-04:   9%|▉         | 491/5287 [03:51<36:22,  2.20it/s][A
epoch 1 iter 491: train loss 4.63968. lr 5.968014e-04:   9%|▉         | 491/5287 [03:51<36:22,  2.20it/s][A
epoch 1 iter 491: train loss 4.63968. lr 5.968014e-04:   9%|▉         | 492/5287 [03:51<36:23,  2.20it/s][A
epoch 1 iter 492: t

epoch 1 iter 525: train loss 4.58922. lr 5.963448e-04:  10%|▉         | 525/5287 [04:07<36:08,  2.20it/s][A
epoch 1 iter 525: train loss 4.58922. lr 5.963448e-04:  10%|▉         | 526/5287 [04:07<36:09,  2.19it/s][A
epoch 1 iter 526: train loss 4.50002. lr 5.963309e-04:  10%|▉         | 526/5287 [04:07<36:09,  2.19it/s][A
epoch 1 iter 526: train loss 4.50002. lr 5.963309e-04:  10%|▉         | 527/5287 [04:07<36:07,  2.20it/s][A
epoch 1 iter 527: train loss 4.51494. lr 5.963170e-04:  10%|▉         | 527/5287 [04:08<36:07,  2.20it/s][A
epoch 1 iter 527: train loss 4.51494. lr 5.963170e-04:  10%|▉         | 528/5287 [04:08<36:08,  2.20it/s][A
epoch 1 iter 528: train loss 4.46901. lr 5.963031e-04:  10%|▉         | 528/5287 [04:08<36:08,  2.20it/s][A
epoch 1 iter 528: train loss 4.46901. lr 5.963031e-04:  10%|█         | 529/5287 [04:08<36:05,  2.20it/s][A
epoch 1 iter 529: train loss 4.59929. lr 5.962891e-04:  10%|█         | 529/5287 [04:09<36:05,  2.20it/s][A
epoch 1 iter 529: t

epoch 1 iter 562: train loss 4.35763. lr 5.958136e-04:  11%|█         | 563/5287 [04:24<38:12,  2.06it/s][A
epoch 1 iter 563: train loss 4.42583. lr 5.957987e-04:  11%|█         | 563/5287 [04:25<38:12,  2.06it/s][A
epoch 1 iter 563: train loss 4.42583. lr 5.957987e-04:  11%|█         | 564/5287 [04:25<37:31,  2.10it/s][A
epoch 1 iter 564: train loss 4.49346. lr 5.957838e-04:  11%|█         | 564/5287 [04:25<37:31,  2.10it/s][A
epoch 1 iter 564: train loss 4.49346. lr 5.957838e-04:  11%|█         | 565/5287 [04:25<37:00,  2.13it/s][A
epoch 1 iter 565: train loss 4.37661. lr 5.957689e-04:  11%|█         | 565/5287 [04:25<37:00,  2.13it/s][A
epoch 1 iter 565: train loss 4.37661. lr 5.957689e-04:  11%|█         | 566/5287 [04:25<36:40,  2.15it/s][A
epoch 1 iter 566: train loss 4.40064. lr 5.957540e-04:  11%|█         | 566/5287 [04:26<36:40,  2.15it/s][A
epoch 1 iter 566: train loss 4.40064. lr 5.957540e-04:  11%|█         | 567/5287 [04:26<36:26,  2.16it/s][A
epoch 1 iter 567: t

epoch 1 iter 600: train loss 4.35584. lr 5.952307e-04:  11%|█▏        | 600/5287 [04:42<35:49,  2.18it/s][A
epoch 1 iter 600: train loss 4.35584. lr 5.952307e-04:  11%|█▏        | 601/5287 [04:42<35:49,  2.18it/s][A
epoch 1 iter 601: train loss 4.33293. lr 5.952149e-04:  11%|█▏        | 601/5287 [04:42<35:49,  2.18it/s][A
epoch 1 iter 601: train loss 4.33293. lr 5.952149e-04:  11%|█▏        | 602/5287 [04:42<35:49,  2.18it/s][A
epoch 1 iter 602: train loss 4.36625. lr 5.951990e-04:  11%|█▏        | 602/5287 [04:42<35:49,  2.18it/s][A
epoch 1 iter 602: train loss 4.36625. lr 5.951990e-04:  11%|█▏        | 603/5287 [04:42<35:48,  2.18it/s][A
epoch 1 iter 603: train loss 4.28696. lr 5.951831e-04:  11%|█▏        | 603/5287 [04:43<35:48,  2.18it/s][A
epoch 1 iter 603: train loss 4.28696. lr 5.951831e-04:  11%|█▏        | 604/5287 [04:43<35:46,  2.18it/s][A
epoch 1 iter 604: train loss 4.31854. lr 5.951672e-04:  11%|█▏        | 604/5287 [04:43<35:46,  2.18it/s][A
epoch 1 iter 604: t

epoch 1 iter 637: train loss 4.19622. lr 5.946271e-04:  12%|█▏        | 638/5287 [04:59<35:34,  2.18it/s][A
epoch 1 iter 638: train loss 4.27482. lr 5.946103e-04:  12%|█▏        | 638/5287 [04:59<35:34,  2.18it/s][A
epoch 1 iter 638: train loss 4.27482. lr 5.946103e-04:  12%|█▏        | 639/5287 [04:59<35:33,  2.18it/s][A
epoch 1 iter 639: train loss 4.25141. lr 5.945934e-04:  12%|█▏        | 639/5287 [05:00<35:33,  2.18it/s][A
epoch 1 iter 639: train loss 4.25141. lr 5.945934e-04:  12%|█▏        | 640/5287 [05:00<35:33,  2.18it/s][A
epoch 1 iter 640: train loss 4.23167. lr 5.945766e-04:  12%|█▏        | 640/5287 [05:00<35:33,  2.18it/s][A
epoch 1 iter 640: train loss 4.23167. lr 5.945766e-04:  12%|█▏        | 641/5287 [05:00<35:34,  2.18it/s][A
epoch 1 iter 641: train loss 4.20963. lr 5.945597e-04:  12%|█▏        | 641/5287 [05:00<35:34,  2.18it/s][A
epoch 1 iter 641: train loss 4.20963. lr 5.945597e-04:  12%|█▏        | 642/5287 [05:00<35:35,  2.18it/s][A
epoch 1 iter 642: t

epoch 1 iter 675: train loss 4.20004. lr 5.939700e-04:  13%|█▎        | 675/5287 [05:16<37:20,  2.06it/s][A
epoch 1 iter 675: train loss 4.20004. lr 5.939700e-04:  13%|█▎        | 676/5287 [05:16<36:47,  2.09it/s][A
epoch 1 iter 676: train loss 4.08225. lr 5.939522e-04:  13%|█▎        | 676/5287 [05:17<36:47,  2.09it/s][A
epoch 1 iter 676: train loss 4.08225. lr 5.939522e-04:  13%|█▎        | 677/5287 [05:17<36:20,  2.11it/s][A
epoch 1 iter 677: train loss 4.23752. lr 5.939344e-04:  13%|█▎        | 677/5287 [05:17<36:20,  2.11it/s][A
epoch 1 iter 677: train loss 4.23752. lr 5.939344e-04:  13%|█▎        | 678/5287 [05:17<36:02,  2.13it/s][A
epoch 1 iter 678: train loss 4.14352. lr 5.939166e-04:  13%|█▎        | 678/5287 [05:18<36:02,  2.13it/s][A
epoch 1 iter 678: train loss 4.14352. lr 5.939166e-04:  13%|█▎        | 679/5287 [05:18<35:54,  2.14it/s][A
epoch 1 iter 679: train loss 4.11594. lr 5.938987e-04:  13%|█▎        | 679/5287 [05:18<35:54,  2.14it/s][A
epoch 1 iter 679: t

epoch 1 iter 712: train loss 4.02191. lr 5.932943e-04:  13%|█▎        | 713/5287 [05:34<35:38,  2.14it/s][A
epoch 1 iter 713: train loss 4.02851. lr 5.932755e-04:  13%|█▎        | 713/5287 [05:34<35:38,  2.14it/s][A
epoch 1 iter 713: train loss 4.02851. lr 5.932755e-04:  14%|█▎        | 714/5287 [05:34<35:48,  2.13it/s][A
epoch 1 iter 714: train loss 4.09541. lr 5.932567e-04:  14%|█▎        | 714/5287 [05:35<35:48,  2.13it/s][A
epoch 1 iter 714: train loss 4.09541. lr 5.932567e-04:  14%|█▎        | 715/5287 [05:35<35:58,  2.12it/s][A
epoch 1 iter 715: train loss 4.02275. lr 5.932379e-04:  14%|█▎        | 715/5287 [05:35<35:58,  2.12it/s][A
epoch 1 iter 715: train loss 4.02275. lr 5.932379e-04:  14%|█▎        | 716/5287 [05:35<35:44,  2.13it/s][A
epoch 1 iter 716: train loss 4.06448. lr 5.932191e-04:  14%|█▎        | 716/5287 [05:36<35:44,  2.13it/s][A
epoch 1 iter 716: train loss 4.06448. lr 5.932191e-04:  14%|█▎        | 717/5287 [05:36<35:46,  2.13it/s][A
epoch 1 iter 717: t

epoch 1 iter 750: train loss 3.96098. lr 5.925633e-04:  14%|█▍        | 750/5287 [05:52<37:12,  2.03it/s][A
epoch 1 iter 750: train loss 3.96098. lr 5.925633e-04:  14%|█▍        | 751/5287 [05:52<37:38,  2.01it/s][A
epoch 1 iter 751: train loss 4.03009. lr 5.925436e-04:  14%|█▍        | 751/5287 [05:53<37:38,  2.01it/s][A
epoch 1 iter 751: train loss 4.03009. lr 5.925436e-04:  14%|█▍        | 752/5287 [05:53<37:34,  2.01it/s][A
epoch 1 iter 752: train loss 3.99150. lr 5.925238e-04:  14%|█▍        | 752/5287 [05:53<37:34,  2.01it/s][A
epoch 1 iter 752: train loss 3.99150. lr 5.925238e-04:  14%|█▍        | 753/5287 [05:53<37:15,  2.03it/s][A
epoch 1 iter 753: train loss 3.95966. lr 5.925040e-04:  14%|█▍        | 753/5287 [05:53<37:15,  2.03it/s][A
epoch 1 iter 753: train loss 3.95966. lr 5.925040e-04:  14%|█▍        | 754/5287 [05:53<37:08,  2.03it/s][A
epoch 1 iter 754: train loss 4.01132. lr 5.924842e-04:  14%|█▍        | 754/5287 [05:54<37:08,  2.03it/s][A
epoch 1 iter 754: t

epoch 1 iter 787: train loss 3.86319. lr 5.918158e-04:  15%|█▍        | 788/5287 [06:10<39:47,  1.88it/s][A
epoch 1 iter 788: train loss 3.89382. lr 5.917951e-04:  15%|█▍        | 788/5287 [06:11<39:47,  1.88it/s][A
epoch 1 iter 788: train loss 3.89382. lr 5.917951e-04:  15%|█▍        | 789/5287 [06:11<39:02,  1.92it/s][A
epoch 1 iter 789: train loss 3.83381. lr 5.917744e-04:  15%|█▍        | 789/5287 [06:11<39:02,  1.92it/s][A
epoch 1 iter 789: train loss 3.83381. lr 5.917744e-04:  15%|█▍        | 790/5287 [06:11<38:25,  1.95it/s][A
epoch 1 iter 790: train loss 3.84320. lr 5.917536e-04:  15%|█▍        | 790/5287 [06:12<38:25,  1.95it/s][A
epoch 1 iter 790: train loss 3.84320. lr 5.917536e-04:  15%|█▍        | 791/5287 [06:12<37:51,  1.98it/s][A
epoch 1 iter 791: train loss 3.80391. lr 5.917328e-04:  15%|█▍        | 791/5287 [06:12<37:51,  1.98it/s][A
epoch 1 iter 791: train loss 3.80391. lr 5.917328e-04:  15%|█▍        | 792/5287 [06:12<37:01,  2.02it/s][A
epoch 1 iter 792: t

epoch 1 iter 2838: train loss 0.83192. lr 4.994332e-04:  54%|█████▎    | 2839/5287 [23:06<20:03,  2.03it/s][A
epoch 1 iter 2839: train loss 0.82014. lr 4.993666e-04:  54%|█████▎    | 2839/5287 [23:07<20:03,  2.03it/s][A
epoch 1 iter 2839: train loss 0.82014. lr 4.993666e-04:  54%|█████▎    | 2840/5287 [23:07<19:48,  2.06it/s][A
epoch 1 iter 2840: train loss 0.82374. lr 4.993000e-04:  54%|█████▎    | 2840/5287 [23:07<19:48,  2.06it/s][A
epoch 1 iter 2840: train loss 0.82374. lr 4.993000e-04:  54%|█████▎    | 2841/5287 [23:07<19:42,  2.07it/s][A
epoch 1 iter 2841: train loss 0.82007. lr 4.992334e-04:  54%|█████▎    | 2841/5287 [23:08<19:42,  2.07it/s][A
epoch 1 iter 2841: train loss 0.82007. lr 4.992334e-04:  54%|█████▍    | 2842/5287 [23:08<19:43,  2.07it/s][A
epoch 1 iter 2842: train loss 0.84401. lr 4.991667e-04:  54%|█████▍    | 2842/5287 [23:08<19:43,  2.07it/s][A
epoch 1 iter 2842: train loss 0.84401. lr 4.991667e-04:  54%|█████▍    | 2843/5287 [23:08<19:33,  2.08it/s][A
e

epoch 1 iter 2875: train loss 0.81366. lr 4.969571e-04:  54%|█████▍    | 2875/5287 [23:24<19:12,  2.09it/s][A
epoch 1 iter 2875: train loss 0.81366. lr 4.969571e-04:  54%|█████▍    | 2876/5287 [23:24<19:11,  2.09it/s][A
epoch 1 iter 2876: train loss 0.78705. lr 4.968899e-04:  54%|█████▍    | 2876/5287 [23:25<19:11,  2.09it/s][A
epoch 1 iter 2876: train loss 0.78705. lr 4.968899e-04:  54%|█████▍    | 2877/5287 [23:25<19:07,  2.10it/s][A
epoch 1 iter 2877: train loss 0.82800. lr 4.968226e-04:  54%|█████▍    | 2877/5287 [23:25<19:07,  2.10it/s][A
epoch 1 iter 2877: train loss 0.82800. lr 4.968226e-04:  54%|█████▍    | 2878/5287 [23:25<19:09,  2.10it/s][A
epoch 1 iter 2878: train loss 0.82660. lr 4.967553e-04:  54%|█████▍    | 2878/5287 [23:26<19:09,  2.10it/s][A
epoch 1 iter 2878: train loss 0.82660. lr 4.967553e-04:  54%|█████▍    | 2879/5287 [23:26<19:23,  2.07it/s][A
epoch 1 iter 2879: train loss 0.80983. lr 4.966880e-04:  54%|█████▍    | 2879/5287 [23:26<19:23,  2.07it/s][A
e

epoch 1 iter 2911: train loss 0.79577. lr 4.945251e-04:  55%|█████▌    | 2912/5287 [23:42<18:59,  2.08it/s][A
epoch 1 iter 2912: train loss 0.79250. lr 4.944572e-04:  55%|█████▌    | 2912/5287 [23:42<18:59,  2.08it/s][A
epoch 1 iter 2912: train loss 0.79250. lr 4.944572e-04:  55%|█████▌    | 2913/5287 [23:42<19:00,  2.08it/s][A
epoch 1 iter 2913: train loss 0.76896. lr 4.943893e-04:  55%|█████▌    | 2913/5287 [23:43<19:00,  2.08it/s][A
epoch 1 iter 2913: train loss 0.76896. lr 4.943893e-04:  55%|█████▌    | 2914/5287 [23:43<20:24,  1.94it/s][A
epoch 1 iter 2914: train loss 0.80323. lr 4.943214e-04:  55%|█████▌    | 2914/5287 [23:43<20:24,  1.94it/s][A
epoch 1 iter 2914: train loss 0.80323. lr 4.943214e-04:  55%|█████▌    | 2915/5287 [23:43<19:55,  1.98it/s][A
epoch 1 iter 2915: train loss 0.81762. lr 4.942534e-04:  55%|█████▌    | 2915/5287 [23:44<19:55,  1.98it/s][A
epoch 1 iter 2915: train loss 0.81762. lr 4.942534e-04:  55%|█████▌    | 2916/5287 [23:44<19:37,  2.01it/s][A
e

epoch 1 iter 2948: train loss 0.79920. lr 4.920022e-04:  56%|█████▌    | 2948/5287 [24:00<19:03,  2.05it/s][A
epoch 1 iter 2948: train loss 0.79920. lr 4.920022e-04:  56%|█████▌    | 2949/5287 [24:00<19:00,  2.05it/s][A
epoch 1 iter 2949: train loss 0.78817. lr 4.919337e-04:  56%|█████▌    | 2949/5287 [24:00<19:00,  2.05it/s][A
epoch 1 iter 2949: train loss 0.78817. lr 4.919337e-04:  56%|█████▌    | 2950/5287 [24:00<18:53,  2.06it/s][A
epoch 1 iter 2950: train loss 0.77142. lr 4.918652e-04:  56%|█████▌    | 2950/5287 [24:01<18:53,  2.06it/s][A
epoch 1 iter 2950: train loss 0.77142. lr 4.918652e-04:  56%|█████▌    | 2951/5287 [24:01<18:52,  2.06it/s][A
epoch 1 iter 2951: train loss 0.77789. lr 4.917967e-04:  56%|█████▌    | 2951/5287 [24:01<18:52,  2.06it/s][A
epoch 1 iter 2951: train loss 0.77789. lr 4.917967e-04:  56%|█████▌    | 2952/5287 [24:01<18:48,  2.07it/s][A
epoch 1 iter 2952: train loss 0.77372. lr 4.917281e-04:  56%|█████▌    | 2952/5287 [24:02<18:48,  2.07it/s][A
e

epoch 1 iter 2984: train loss 0.78298. lr 4.895253e-04:  56%|█████▋    | 2985/5287 [24:17<18:27,  2.08it/s][A
epoch 1 iter 2985: train loss 0.77616. lr 4.894562e-04:  56%|█████▋    | 2985/5287 [24:18<18:27,  2.08it/s][A
epoch 1 iter 2985: train loss 0.77616. lr 4.894562e-04:  56%|█████▋    | 2986/5287 [24:18<18:23,  2.08it/s][A
epoch 1 iter 2986: train loss 0.76166. lr 4.893871e-04:  56%|█████▋    | 2986/5287 [24:18<18:23,  2.08it/s][A
epoch 1 iter 2986: train loss 0.76166. lr 4.893871e-04:  56%|█████▋    | 2987/5287 [24:18<18:20,  2.09it/s][A
epoch 1 iter 2987: train loss 0.75971. lr 4.893179e-04:  56%|█████▋    | 2987/5287 [24:19<18:20,  2.09it/s][A
epoch 1 iter 2987: train loss 0.75971. lr 4.893179e-04:  57%|█████▋    | 2988/5287 [24:19<18:20,  2.09it/s][A
epoch 1 iter 2988: train loss 0.79153. lr 4.892488e-04:  57%|█████▋    | 2988/5287 [24:19<18:20,  2.09it/s][A
epoch 1 iter 2988: train loss 0.79153. lr 4.892488e-04:  57%|█████▋    | 2989/5287 [24:19<18:21,  2.09it/s][A
e

epoch 1 iter 3021: train loss 0.74665. lr 4.869570e-04:  57%|█████▋    | 3021/5287 [24:36<18:35,  2.03it/s][A
epoch 1 iter 3021: train loss 0.74665. lr 4.869570e-04:  57%|█████▋    | 3022/5287 [24:36<18:29,  2.04it/s][A
epoch 1 iter 3022: train loss 0.75588. lr 4.868873e-04:  57%|█████▋    | 3022/5287 [24:36<18:29,  2.04it/s][A
epoch 1 iter 3022: train loss 0.75588. lr 4.868873e-04:  57%|█████▋    | 3023/5287 [24:36<18:24,  2.05it/s][A
epoch 1 iter 3023: train loss 0.76026. lr 4.868175e-04:  57%|█████▋    | 3023/5287 [24:37<18:24,  2.05it/s][A
epoch 1 iter 3023: train loss 0.76026. lr 4.868175e-04:  57%|█████▋    | 3024/5287 [24:37<18:19,  2.06it/s][A
epoch 1 iter 3024: train loss 0.78027. lr 4.867478e-04:  57%|█████▋    | 3024/5287 [24:37<18:19,  2.06it/s][A
epoch 1 iter 3024: train loss 0.78027. lr 4.867478e-04:  57%|█████▋    | 3025/5287 [24:37<18:18,  2.06it/s][A
epoch 1 iter 3025: train loss 0.78150. lr 4.866780e-04:  57%|█████▋    | 3025/5287 [24:38<18:18,  2.06it/s][A
e

epoch 1 iter 3057: train loss 0.75640. lr 4.844364e-04:  58%|█████▊    | 3058/5287 [24:54<18:55,  1.96it/s][A
epoch 1 iter 3058: train loss 0.76309. lr 4.843661e-04:  58%|█████▊    | 3058/5287 [24:54<18:55,  1.96it/s][A
epoch 1 iter 3058: train loss 0.76309. lr 4.843661e-04:  58%|█████▊    | 3059/5287 [24:54<18:40,  1.99it/s][A
epoch 1 iter 3059: train loss 0.75055. lr 4.842957e-04:  58%|█████▊    | 3059/5287 [24:55<18:40,  1.99it/s][A
epoch 1 iter 3059: train loss 0.75055. lr 4.842957e-04:  58%|█████▊    | 3060/5287 [24:55<18:22,  2.02it/s][A
epoch 1 iter 3060: train loss 0.75862. lr 4.842254e-04:  58%|█████▊    | 3060/5287 [24:55<18:22,  2.02it/s][A
epoch 1 iter 3060: train loss 0.75862. lr 4.842254e-04:  58%|█████▊    | 3061/5287 [24:55<18:15,  2.03it/s][A
epoch 1 iter 3061: train loss 0.75193. lr 4.841550e-04:  58%|█████▊    | 3061/5287 [24:56<18:15,  2.03it/s][A
epoch 1 iter 3061: train loss 0.75193. lr 4.841550e-04:  58%|█████▊    | 3062/5287 [24:56<18:02,  2.06it/s][A
e

epoch 1 iter 3094: train loss 0.74700. lr 4.818238e-04:  59%|█████▊    | 3094/5287 [25:12<17:42,  2.06it/s][A
epoch 1 iter 3094: train loss 0.74700. lr 4.818238e-04:  59%|█████▊    | 3095/5287 [25:12<17:43,  2.06it/s][A
epoch 1 iter 3095: train loss 0.73268. lr 4.817529e-04:  59%|█████▊    | 3095/5287 [25:12<17:43,  2.06it/s][A
epoch 1 iter 3095: train loss 0.73268. lr 4.817529e-04:  59%|█████▊    | 3096/5287 [25:12<17:37,  2.07it/s][A
epoch 1 iter 3096: train loss 0.73495. lr 4.816819e-04:  59%|█████▊    | 3096/5287 [25:13<17:37,  2.07it/s][A
epoch 1 iter 3096: train loss 0.73495. lr 4.816819e-04:  59%|█████▊    | 3097/5287 [25:13<17:41,  2.06it/s][A
epoch 1 iter 3097: train loss 0.75677. lr 4.816110e-04:  59%|█████▊    | 3097/5287 [25:13<17:41,  2.06it/s][A
epoch 1 iter 3097: train loss 0.75677. lr 4.816110e-04:  59%|█████▊    | 3098/5287 [25:13<17:45,  2.05it/s][A
epoch 1 iter 3098: train loss 0.73957. lr 4.815400e-04:  59%|█████▊    | 3098/5287 [25:14<17:45,  2.05it/s][A
e

epoch 1 iter 3130: train loss 0.70386. lr 4.792607e-04:  59%|█████▉    | 3131/5287 [25:30<17:21,  2.07it/s][A
epoch 1 iter 3131: train loss 0.72526. lr 4.791892e-04:  59%|█████▉    | 3131/5287 [25:30<17:21,  2.07it/s][A
epoch 1 iter 3131: train loss 0.72526. lr 4.791892e-04:  59%|█████▉    | 3132/5287 [25:30<17:23,  2.07it/s][A
epoch 1 iter 3132: train loss 0.73836. lr 4.791177e-04:  59%|█████▉    | 3132/5287 [25:31<17:23,  2.07it/s][A
epoch 1 iter 3132: train loss 0.73836. lr 4.791177e-04:  59%|█████▉    | 3133/5287 [25:31<17:16,  2.08it/s][A
epoch 1 iter 3133: train loss 0.74184. lr 4.790462e-04:  59%|█████▉    | 3133/5287 [25:31<17:16,  2.08it/s][A
epoch 1 iter 3133: train loss 0.74184. lr 4.790462e-04:  59%|█████▉    | 3134/5287 [25:31<17:15,  2.08it/s][A
epoch 1 iter 3134: train loss 0.72942. lr 4.789746e-04:  59%|█████▉    | 3134/5287 [25:32<17:15,  2.08it/s][A
epoch 1 iter 3134: train loss 0.72942. lr 4.789746e-04:  59%|█████▉    | 3135/5287 [25:32<17:21,  2.07it/s][A
e

epoch 1 iter 3167: train loss 0.73988. lr 4.766050e-04:  60%|█████▉    | 3167/5287 [25:48<18:14,  1.94it/s][A
epoch 1 iter 3167: train loss 0.73988. lr 4.766050e-04:  60%|█████▉    | 3168/5287 [25:48<17:50,  1.98it/s][A
epoch 1 iter 3168: train loss 0.73029. lr 4.765330e-04:  60%|█████▉    | 3168/5287 [25:49<17:50,  1.98it/s][A
epoch 1 iter 3168: train loss 0.73029. lr 4.765330e-04:  60%|█████▉    | 3169/5287 [25:49<17:38,  2.00it/s][A
epoch 1 iter 3169: train loss 0.71829. lr 4.764609e-04:  60%|█████▉    | 3169/5287 [25:49<17:38,  2.00it/s][A
epoch 1 iter 3169: train loss 0.71829. lr 4.764609e-04:  60%|█████▉    | 3170/5287 [25:49<17:27,  2.02it/s][A
epoch 1 iter 3170: train loss 0.72497. lr 4.763888e-04:  60%|█████▉    | 3170/5287 [25:50<17:27,  2.02it/s][A
epoch 1 iter 3170: train loss 0.72497. lr 4.763888e-04:  60%|█████▉    | 3171/5287 [25:50<17:16,  2.04it/s][A
epoch 1 iter 3171: train loss 0.72548. lr 4.763166e-04:  60%|█████▉    | 3171/5287 [25:50<17:16,  2.04it/s][A
e

epoch 1 iter 3203: train loss 0.68704. lr 4.740006e-04:  61%|██████    | 3204/5287 [26:06<16:58,  2.05it/s][A
epoch 1 iter 3204: train loss 0.70927. lr 4.739280e-04:  61%|██████    | 3204/5287 [26:06<16:58,  2.05it/s][A
epoch 1 iter 3204: train loss 0.70927. lr 4.739280e-04:  61%|██████    | 3205/5287 [26:06<16:59,  2.04it/s][A
epoch 1 iter 3205: train loss 0.72553. lr 4.738554e-04:  61%|██████    | 3205/5287 [26:07<16:59,  2.04it/s][A
epoch 1 iter 3205: train loss 0.72553. lr 4.738554e-04:  61%|██████    | 3206/5287 [26:07<16:58,  2.04it/s][A
epoch 1 iter 3206: train loss 0.70146. lr 4.737827e-04:  61%|██████    | 3206/5287 [26:07<16:58,  2.04it/s][A
epoch 1 iter 3206: train loss 0.70146. lr 4.737827e-04:  61%|██████    | 3207/5287 [26:07<16:51,  2.06it/s][A
epoch 1 iter 3207: train loss 0.72371. lr 4.737100e-04:  61%|██████    | 3207/5287 [26:08<16:51,  2.06it/s][A
epoch 1 iter 3207: train loss 0.72371. lr 4.737100e-04:  61%|██████    | 3208/5287 [26:08<16:50,  2.06it/s][A
e

epoch 1 iter 3240: train loss 0.70363. lr 4.713032e-04:  61%|██████▏   | 3240/5287 [26:24<17:03,  2.00it/s][A
epoch 1 iter 3240: train loss 0.70363. lr 4.713032e-04:  61%|██████▏   | 3241/5287 [26:24<16:55,  2.01it/s][A
epoch 1 iter 3241: train loss 0.69668. lr 4.712300e-04:  61%|██████▏   | 3241/5287 [26:25<16:55,  2.01it/s][A
epoch 1 iter 3241: train loss 0.69668. lr 4.712300e-04:  61%|██████▏   | 3242/5287 [26:25<16:46,  2.03it/s][A
epoch 1 iter 3242: train loss 0.69021. lr 4.711568e-04:  61%|██████▏   | 3242/5287 [26:25<16:46,  2.03it/s][A
epoch 1 iter 3242: train loss 0.69021. lr 4.711568e-04:  61%|██████▏   | 3243/5287 [26:25<16:42,  2.04it/s][A
epoch 1 iter 3243: train loss 0.69328. lr 4.710835e-04:  61%|██████▏   | 3243/5287 [26:26<16:42,  2.04it/s][A
epoch 1 iter 3243: train loss 0.69328. lr 4.710835e-04:  61%|██████▏   | 3244/5287 [26:26<16:40,  2.04it/s][A
epoch 1 iter 3244: train loss 0.69310. lr 4.710103e-04:  61%|██████▏   | 3244/5287 [26:26<16:40,  2.04it/s][A
e

epoch 1 iter 3276: train loss 0.70169. lr 4.686587e-04:  62%|██████▏   | 3277/5287 [26:42<17:11,  1.95it/s][A
epoch 1 iter 3277: train loss 0.69454. lr 4.685850e-04:  62%|██████▏   | 3277/5287 [26:43<17:11,  1.95it/s][A
epoch 1 iter 3277: train loss 0.69454. lr 4.685850e-04:  62%|██████▏   | 3278/5287 [26:43<18:15,  1.83it/s][A
epoch 1 iter 3278: train loss 0.70085. lr 4.685112e-04:  62%|██████▏   | 3278/5287 [26:43<18:15,  1.83it/s][A
epoch 1 iter 3278: train loss 0.70085. lr 4.685112e-04:  62%|██████▏   | 3279/5287 [26:43<17:44,  1.89it/s][A
epoch 1 iter 3279: train loss 0.69684. lr 4.684375e-04:  62%|██████▏   | 3279/5287 [26:44<17:44,  1.89it/s][A
epoch 1 iter 3279: train loss 0.69684. lr 4.684375e-04:  62%|██████▏   | 3280/5287 [26:44<17:22,  1.92it/s][A
epoch 1 iter 3280: train loss 0.67485. lr 4.683637e-04:  62%|██████▏   | 3280/5287 [26:44<17:22,  1.92it/s][A
epoch 1 iter 3280: train loss 0.67485. lr 4.683637e-04:  62%|██████▏   | 3281/5287 [26:44<17:05,  1.96it/s][A
e

epoch 1 iter 3313: train loss 0.68453. lr 4.659207e-04:  63%|██████▎   | 3313/5287 [27:00<16:16,  2.02it/s][A
epoch 1 iter 3313: train loss 0.68453. lr 4.659207e-04:  63%|██████▎   | 3314/5287 [27:00<16:11,  2.03it/s][A
epoch 1 iter 3314: train loss 0.69969. lr 4.658464e-04:  63%|██████▎   | 3314/5287 [27:01<16:11,  2.03it/s][A
epoch 1 iter 3314: train loss 0.69969. lr 4.658464e-04:  63%|██████▎   | 3315/5287 [27:01<16:16,  2.02it/s][A
epoch 1 iter 3315: train loss 0.67197. lr 4.657721e-04:  63%|██████▎   | 3315/5287 [27:01<16:16,  2.02it/s][AIOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

epoch 1 iter 126: train loss 6.16449. lr 5.999777e-04:   1%|          | 126/16329 [01:01<2:10:20,  2.07it/s][A
epoch 1 i

epoch 1 iter 158: train loss 5.93502. lr 5.999650e-04:   1%|          | 159/16329 [01:16<2:08:56,  2.09it/s][A
epoch 1 iter 159: train loss 6.02163. lr 5.999645e-04:   1%|          | 159/16329 [01:17<2:08:56,  2.09it/s][A
epoch 1 iter 159: train loss 6.02163. lr 5.999645e-04:   1%|          | 160/16329 [01:17<2:08:58,  2.09it/s][A
epoch 1 iter 160: train loss 5.97657. lr 5.999641e-04:   1%|          | 160/16329 [01:17<2:08:58,  2.09it/s][A
epoch 1 iter 160: train loss 5.97657. lr 5.999641e-04:   1%|          | 161/16329 [01:17<2:08:57,  2.09it/s][A
epoch 1 iter 161: train loss 5.99533. lr 5.999636e-04:   1%|          | 161/16329 [01:18<2:08:57,  2.09it/s][A
epoch 1 iter 161: train loss 5.99533. lr 5.999636e-04:   1%|          | 162/16329 [01:18<2:08:59,  2.09it/s][A
epoch 1 iter 162: train loss 5.94194. lr 5.999632e-04:   1%|          | 162/16329 [01:18<2:08:59,  2.09it/s][A
epoch 1 iter 162: train loss 5.94194. lr 5.999632e-04:   1%|          | 163/16329 [01:18<2:08:59,  2.09i

epoch 1 iter 195: train loss 5.83952. lr 5.999468e-04:   1%|          | 195/16329 [01:34<2:08:47,  2.09it/s][A
epoch 1 iter 195: train loss 5.83952. lr 5.999468e-04:   1%|          | 196/16329 [01:34<2:08:46,  2.09it/s][A
epoch 1 iter 196: train loss 5.79618. lr 5.999462e-04:   1%|          | 196/16329 [01:34<2:08:46,  2.09it/s][A
epoch 1 iter 196: train loss 5.79618. lr 5.999462e-04:   1%|          | 197/16329 [01:34<2:08:57,  2.08it/s][A
epoch 1 iter 197: train loss 5.84495. lr 5.999457e-04:   1%|          | 197/16329 [01:35<2:08:57,  2.08it/s][A
epoch 1 iter 197: train loss 5.84495. lr 5.999457e-04:   1%|          | 198/16329 [01:35<2:08:56,  2.09it/s][A
epoch 1 iter 198: train loss 5.76150. lr 5.999451e-04:   1%|          | 198/16329 [01:35<2:08:56,  2.09it/s][A
epoch 1 iter 198: train loss 5.76150. lr 5.999451e-04:   1%|          | 199/16329 [01:35<2:08:52,  2.09it/s][A
epoch 1 iter 199: train loss 5.79216. lr 5.999446e-04:   1%|          | 199/16329 [01:36<2:08:52,  2.09i

epoch 1 iter 231: train loss 5.68531. lr 5.999254e-04:   1%|▏         | 232/16329 [01:51<2:09:04,  2.08it/s][A
epoch 1 iter 232: train loss 5.63006. lr 5.999247e-04:   1%|▏         | 232/16329 [01:52<2:09:04,  2.08it/s][A
epoch 1 iter 232: train loss 5.63006. lr 5.999247e-04:   1%|▏         | 233/16329 [01:52<2:09:07,  2.08it/s][A
epoch 1 iter 233: train loss 5.63101. lr 5.999241e-04:   1%|▏         | 233/16329 [01:52<2:09:07,  2.08it/s][A
epoch 1 iter 233: train loss 5.63101. lr 5.999241e-04:   1%|▏         | 234/16329 [01:52<2:09:07,  2.08it/s][A
epoch 1 iter 234: train loss 5.64733. lr 5.999234e-04:   1%|▏         | 234/16329 [01:53<2:09:07,  2.08it/s][A
epoch 1 iter 234: train loss 5.64733. lr 5.999234e-04:   1%|▏         | 235/16329 [01:53<2:09:06,  2.08it/s][A
epoch 1 iter 235: train loss 5.63722. lr 5.999228e-04:   1%|▏         | 235/16329 [01:53<2:09:06,  2.08it/s][A
epoch 1 iter 235: train loss 5.63722. lr 5.999228e-04:   1%|▏         | 236/16329 [01:53<2:09:04,  2.08i

epoch 1 iter 268: train loss 5.59180. lr 5.998997e-04:   2%|▏         | 268/16329 [02:09<2:09:38,  2.06it/s][A
epoch 1 iter 268: train loss 5.59180. lr 5.998997e-04:   2%|▏         | 269/16329 [02:09<2:10:13,  2.06it/s][A
epoch 1 iter 269: train loss 5.60404. lr 5.998989e-04:   2%|▏         | 269/16329 [02:09<2:10:13,  2.06it/s][A
epoch 1 iter 269: train loss 5.60404. lr 5.998989e-04:   2%|▏         | 270/16329 [02:09<2:09:58,  2.06it/s][A
epoch 1 iter 270: train loss 5.57377. lr 5.998982e-04:   2%|▏         | 270/16329 [02:10<2:09:58,  2.06it/s][A
epoch 1 iter 270: train loss 5.57377. lr 5.998982e-04:   2%|▏         | 271/16329 [02:10<2:09:52,  2.06it/s][A
epoch 1 iter 271: train loss 5.52320. lr 5.998974e-04:   2%|▏         | 271/16329 [02:10<2:09:52,  2.06it/s][A
epoch 1 iter 271: train loss 5.52320. lr 5.998974e-04:   2%|▏         | 272/16329 [02:10<2:09:40,  2.06it/s][A
epoch 1 iter 272: train loss 5.50254. lr 5.998967e-04:   2%|▏         | 272/16329 [02:11<2:09:40,  2.06i

epoch 1 iter 304: train loss 5.37448. lr 5.998710e-04:   2%|▏         | 305/16329 [02:27<2:10:27,  2.05it/s][A
epoch 1 iter 305: train loss 5.38482. lr 5.998702e-04:   2%|▏         | 305/16329 [02:27<2:10:27,  2.05it/s][A
epoch 1 iter 305: train loss 5.38482. lr 5.998702e-04:   2%|▏         | 306/16329 [02:27<2:10:22,  2.05it/s][A
epoch 1 iter 306: train loss 5.36579. lr 5.998693e-04:   2%|▏         | 306/16329 [02:28<2:10:22,  2.05it/s][A
epoch 1 iter 306: train loss 5.36579. lr 5.998693e-04:   2%|▏         | 307/16329 [02:28<2:10:26,  2.05it/s][A
epoch 1 iter 307: train loss 5.38004. lr 5.998685e-04:   2%|▏         | 307/16329 [02:28<2:10:26,  2.05it/s][A
epoch 1 iter 307: train loss 5.38004. lr 5.998685e-04:   2%|▏         | 308/16329 [02:28<2:10:57,  2.04it/s][A
epoch 1 iter 308: train loss 5.42412. lr 5.998676e-04:   2%|▏         | 308/16329 [02:28<2:10:57,  2.04it/s][A
epoch 1 iter 308: train loss 5.42412. lr 5.998676e-04:   2%|▏         | 309/16329 [02:28<2:10:54,  2.04i

epoch 1 iter 341: train loss 5.31048. lr 5.998378e-04:   2%|▏         | 341/16329 [02:45<2:14:56,  1.97it/s][A
epoch 1 iter 341: train loss 5.31048. lr 5.998378e-04:   2%|▏         | 342/16329 [02:45<2:14:28,  1.98it/s][A
epoch 1 iter 342: train loss 5.35940. lr 5.998368e-04:   2%|▏         | 342/16329 [02:46<2:14:28,  1.98it/s][A
epoch 1 iter 342: train loss 5.35940. lr 5.998368e-04:   2%|▏         | 343/16329 [02:46<2:13:43,  1.99it/s][A
epoch 1 iter 343: train loss 5.24128. lr 5.998359e-04:   2%|▏         | 343/16329 [02:46<2:13:43,  1.99it/s][A
epoch 1 iter 343: train loss 5.24128. lr 5.998359e-04:   2%|▏         | 344/16329 [02:46<2:13:28,  2.00it/s][A
epoch 1 iter 344: train loss 5.36845. lr 5.998349e-04:   2%|▏         | 344/16329 [02:47<2:13:28,  2.00it/s][A
epoch 1 iter 344: train loss 5.36845. lr 5.998349e-04:   2%|▏         | 345/16329 [02:47<2:13:16,  2.00it/s][A
epoch 1 iter 345: train loss 5.26613. lr 5.998340e-04:   2%|▏         | 345/16329 [02:47<2:13:16,  2.00i

epoch 1 iter 377: train loss 5.19126. lr 5.998018e-04:   2%|▏         | 378/16329 [03:04<2:21:10,  1.88it/s][A
epoch 1 iter 378: train loss 5.17157. lr 5.998008e-04:   2%|▏         | 378/16329 [03:04<2:21:10,  1.88it/s][A
epoch 1 iter 378: train loss 5.17157. lr 5.998008e-04:   2%|▏         | 379/16329 [03:04<2:18:07,  1.92it/s][A
epoch 1 iter 379: train loss 5.16210. lr 5.997997e-04:   2%|▏         | 379/16329 [03:05<2:18:07,  1.92it/s][A
epoch 1 iter 379: train loss 5.16210. lr 5.997997e-04:   2%|▏         | 380/16329 [03:05<2:15:58,  1.95it/s][A
epoch 1 iter 380: train loss 5.15891. lr 5.997987e-04:   2%|▏         | 380/16329 [03:05<2:15:58,  1.95it/s][A
epoch 1 iter 380: train loss 5.15891. lr 5.997987e-04:   2%|▏         | 381/16329 [03:05<2:15:02,  1.97it/s][A
epoch 1 iter 381: train loss 5.15637. lr 5.997976e-04:   2%|▏         | 381/16329 [03:06<2:15:02,  1.97it/s][A
epoch 1 iter 381: train loss 5.15637. lr 5.997976e-04:   2%|▏         | 382/16329 [03:06<2:13:55,  1.98i

epoch 1 iter 414: train loss 5.09645. lr 5.997611e-04:   3%|▎         | 414/16329 [03:22<2:12:55,  2.00it/s][A
epoch 1 iter 414: train loss 5.09645. lr 5.997611e-04:   3%|▎         | 415/16329 [03:22<2:12:28,  2.00it/s][A
epoch 1 iter 415: train loss 5.09217. lr 5.997600e-04:   3%|▎         | 415/16329 [03:23<2:12:28,  2.00it/s][A
epoch 1 iter 415: train loss 5.09217. lr 5.997600e-04:   3%|▎         | 416/16329 [03:23<2:28:29,  1.79it/s][A
epoch 1 iter 416: train loss 5.09752. lr 5.997588e-04:   3%|▎         | 416/16329 [03:23<2:28:29,  1.79it/s][A
epoch 1 iter 416: train loss 5.09752. lr 5.997588e-04:   3%|▎         | 417/16329 [03:23<2:23:26,  1.85it/s][A
epoch 1 iter 417: train loss 5.09515. lr 5.997577e-04:   3%|▎         | 417/16329 [03:24<2:23:26,  1.85it/s][A
epoch 1 iter 417: train loss 5.09515. lr 5.997577e-04:   3%|▎         | 418/16329 [03:24<2:20:01,  1.89it/s][A
epoch 1 iter 418: train loss 5.07115. lr 5.997565e-04:   3%|▎         | 418/16329 [03:24<2:20:01,  1.89i

epoch 1 iter 450: train loss 5.03757. lr 5.997179e-04:   3%|▎         | 451/16329 [03:41<2:25:33,  1.82it/s][A
epoch 1 iter 451: train loss 4.96721. lr 5.997166e-04:   3%|▎         | 451/16329 [03:41<2:25:33,  1.82it/s][A
epoch 1 iter 451: train loss 4.96721. lr 5.997166e-04:   3%|▎         | 452/16329 [03:41<2:21:35,  1.87it/s][A
epoch 1 iter 452: train loss 5.08537. lr 5.997154e-04:   3%|▎         | 452/16329 [03:42<2:21:35,  1.87it/s][A
epoch 1 iter 452: train loss 5.08537. lr 5.997154e-04:   3%|▎         | 453/16329 [03:42<2:19:22,  1.90it/s][A
epoch 1 iter 453: train loss 4.94469. lr 5.997141e-04:   3%|▎         | 453/16329 [03:42<2:19:22,  1.90it/s][A
epoch 1 iter 453: train loss 4.94469. lr 5.997141e-04:   3%|▎         | 454/16329 [03:42<2:17:31,  1.92it/s][A
epoch 1 iter 454: train loss 4.97389. lr 5.997129e-04:   3%|▎         | 454/16329 [03:43<2:17:31,  1.92it/s][A
epoch 1 iter 454: train loss 4.97389. lr 5.997129e-04:   3%|▎         | 455/16329 [03:43<2:16:11,  1.94i

epoch 1 iter 487: train loss 4.93057. lr 5.996697e-04:   3%|▎         | 487/16329 [04:00<2:13:16,  1.98it/s][A
epoch 1 iter 487: train loss 4.93057. lr 5.996697e-04:   3%|▎         | 488/16329 [04:00<2:12:58,  1.99it/s][A
epoch 1 iter 488: train loss 4.97426. lr 5.996683e-04:   3%|▎         | 488/16329 [04:00<2:12:58,  1.99it/s][A
epoch 1 iter 488: train loss 4.97426. lr 5.996683e-04:   3%|▎         | 489/16329 [04:00<2:12:56,  1.99it/s][A
epoch 1 iter 489: train loss 4.94429. lr 5.996670e-04:   3%|▎         | 489/16329 [04:01<2:12:56,  1.99it/s][A
epoch 1 iter 489: train loss 4.94429. lr 5.996670e-04:   3%|▎         | 490/16329 [04:01<2:12:39,  1.99it/s][A
epoch 1 iter 490: train loss 4.85322. lr 5.996656e-04:   3%|▎         | 490/16329 [04:01<2:12:39,  1.99it/s][A
epoch 1 iter 490: train loss 4.85322. lr 5.996656e-04:   3%|▎         | 491/16329 [04:01<2:12:39,  1.99it/s][A
epoch 1 iter 491: train loss 4.95849. lr 5.996642e-04:   3%|▎         | 491/16329 [04:02<2:12:39,  1.99i

epoch 1 iter 523: train loss 4.85761. lr 5.996191e-04:   3%|▎         | 524/16329 [04:18<2:13:19,  1.98it/s][A
epoch 1 iter 524: train loss 4.83223. lr 5.996177e-04:   3%|▎         | 524/16329 [04:19<2:13:19,  1.98it/s][A
epoch 1 iter 524: train loss 4.83223. lr 5.996177e-04:   3%|▎         | 525/16329 [04:19<2:13:11,  1.98it/s][A
epoch 1 iter 525: train loss 4.84499. lr 5.996162e-04:   3%|▎         | 525/16329 [04:19<2:13:11,  1.98it/s][A
epoch 1 iter 525: train loss 4.84499. lr 5.996162e-04:   3%|▎         | 526/16329 [04:19<2:12:43,  1.98it/s][A
epoch 1 iter 526: train loss 4.79301. lr 5.996148e-04:   3%|▎         | 526/16329 [04:20<2:12:43,  1.98it/s][A
epoch 1 iter 526: train loss 4.79301. lr 5.996148e-04:   3%|▎         | 527/16329 [04:20<2:12:42,  1.98it/s][A
epoch 1 iter 527: train loss 4.84564. lr 5.996133e-04:   3%|▎         | 527/16329 [04:20<2:12:42,  1.98it/s][A
epoch 1 iter 527: train loss 4.84564. lr 5.996133e-04:   3%|▎         | 528/16329 [04:20<2:12:18,  1.99i

epoch 1 iter 560: train loss 4.74763. lr 5.995635e-04:   3%|▎         | 560/16329 [04:37<2:13:11,  1.97it/s][A
epoch 1 iter 560: train loss 4.74763. lr 5.995635e-04:   3%|▎         | 561/16329 [04:37<2:12:40,  1.98it/s][A
epoch 1 iter 561: train loss 4.82786. lr 5.995619e-04:   3%|▎         | 561/16329 [04:38<2:12:40,  1.98it/s][A
epoch 1 iter 561: train loss 4.82786. lr 5.995619e-04:   3%|▎         | 562/16329 [04:38<2:12:16,  1.99it/s][A
epoch 1 iter 562: train loss 4.81648. lr 5.995603e-04:   3%|▎         | 562/16329 [04:38<2:12:16,  1.99it/s][A
epoch 1 iter 562: train loss 4.81648. lr 5.995603e-04:   3%|▎         | 563/16329 [04:38<2:12:09,  1.99it/s][A
epoch 1 iter 563: train loss 4.75342. lr 5.995588e-04:   3%|▎         | 563/16329 [04:39<2:12:09,  1.99it/s][A
epoch 1 iter 563: train loss 4.75342. lr 5.995588e-04:   3%|▎         | 564/16329 [04:39<2:12:24,  1.98it/s][A
epoch 1 iter 564: train loss 4.83799. lr 5.995572e-04:   3%|▎         | 564/16329 [04:39<2:12:24,  1.98i

epoch 1 iter 596: train loss 4.75926. lr 5.995056e-04:   4%|▎         | 597/16329 [04:56<2:12:27,  1.98it/s][A
epoch 1 iter 597: train loss 4.70465. lr 5.995040e-04:   4%|▎         | 597/16329 [04:57<2:12:27,  1.98it/s][A
epoch 1 iter 597: train loss 4.70465. lr 5.995040e-04:   4%|▎         | 598/16329 [04:57<2:12:07,  1.98it/s][A
epoch 1 iter 598: train loss 4.74142. lr 5.995023e-04:   4%|▎         | 598/16329 [04:57<2:12:07,  1.98it/s][A
epoch 1 iter 598: train loss 4.74142. lr 5.995023e-04:   4%|▎         | 599/16329 [04:57<2:11:58,  1.99it/s][A
epoch 1 iter 599: train loss 4.71991. lr 5.995006e-04:   4%|▎         | 599/16329 [04:58<2:11:58,  1.99it/s][A
epoch 1 iter 599: train loss 4.71991. lr 5.995006e-04:   4%|▎         | 600/16329 [04:58<2:12:12,  1.98it/s][A
epoch 1 iter 600: train loss 4.77146. lr 5.994990e-04:   4%|▎         | 600/16329 [04:58<2:12:12,  1.98it/s][A
epoch 1 iter 600: train loss 4.77146. lr 5.994990e-04:   4%|▎         | 601/16329 [04:58<2:12:11,  1.98i

epoch 1 iter 2286: train loss 3.13383. lr 5.927694e-04:  14%|█▍        | 2287/16329 [19:19<1:56:43,  2.00it/s][A
epoch 1 iter 2287: train loss 3.05883. lr 5.927632e-04:  14%|█▍        | 2287/16329 [19:19<1:56:43,  2.00it/s][A
epoch 1 iter 2287: train loss 3.05883. lr 5.927632e-04:  14%|█▍        | 2288/16329 [19:19<1:56:33,  2.01it/s][A
epoch 1 iter 2288: train loss 3.09726. lr 5.927568e-04:  14%|█▍        | 2288/16329 [19:20<1:56:33,  2.01it/s][A
epoch 1 iter 2288: train loss 3.09726. lr 5.927568e-04:  14%|█▍        | 2289/16329 [19:20<1:56:12,  2.01it/s][A
epoch 1 iter 2289: train loss 3.14285. lr 5.927505e-04:  14%|█▍        | 2289/16329 [19:20<1:56:12,  2.01it/s][A
epoch 1 iter 2289: train loss 3.14285. lr 5.927505e-04:  14%|█▍        | 2290/16329 [19:20<1:55:33,  2.02it/s][A
epoch 1 iter 2290: train loss 3.18466. lr 5.927442e-04:  14%|█▍        | 2290/16329 [19:21<1:55:33,  2.02it/s][A
epoch 1 iter 2290: train loss 3.18466. lr 5.927442e-04:  14%|█▍        | 2291/16329 [19:

epoch 1 iter 2322: train loss 3.04563. lr 5.925410e-04:  14%|█▍        | 2322/16329 [19:37<2:04:32,  1.87it/s][A
epoch 1 iter 2322: train loss 3.04563. lr 5.925410e-04:  14%|█▍        | 2323/16329 [19:37<2:01:55,  1.91it/s][A
epoch 1 iter 2323: train loss 3.01654. lr 5.925346e-04:  14%|█▍        | 2323/16329 [19:37<2:01:55,  1.91it/s][A
epoch 1 iter 2323: train loss 3.01654. lr 5.925346e-04:  14%|█▍        | 2324/16329 [19:37<1:59:43,  1.95it/s][A
epoch 1 iter 2324: train loss 3.03635. lr 5.925282e-04:  14%|█▍        | 2324/16329 [19:38<1:59:43,  1.95it/s][A
epoch 1 iter 2324: train loss 3.03635. lr 5.925282e-04:  14%|█▍        | 2325/16329 [19:38<1:58:52,  1.96it/s][A
epoch 1 iter 2325: train loss 3.10719. lr 5.925218e-04:  14%|█▍        | 2325/16329 [19:38<1:58:52,  1.96it/s][A
epoch 1 iter 2325: train loss 3.10719. lr 5.925218e-04:  14%|█▍        | 2326/16329 [19:38<1:57:57,  1.98it/s][A
epoch 1 iter 2326: train loss 3.01858. lr 5.925154e-04:  14%|█▍        | 2326/16329 [19:

epoch 1 iter 2357: train loss 3.02982. lr 5.923155e-04:  14%|█▍        | 2358/16329 [19:54<2:01:15,  1.92it/s][A
epoch 1 iter 2358: train loss 2.97521. lr 5.923090e-04:  14%|█▍        | 2358/16329 [19:55<2:01:15,  1.92it/s][A
epoch 1 iter 2358: train loss 2.97521. lr 5.923090e-04:  14%|█▍        | 2359/16329 [19:55<1:59:25,  1.95it/s][A
epoch 1 iter 2359: train loss 3.05864. lr 5.923025e-04:  14%|█▍        | 2359/16329 [19:55<1:59:25,  1.95it/s][A
epoch 1 iter 2359: train loss 3.05864. lr 5.923025e-04:  14%|█▍        | 2360/16329 [19:55<1:57:46,  1.98it/s][A
epoch 1 iter 2360: train loss 3.06198. lr 5.922960e-04:  14%|█▍        | 2360/16329 [19:56<1:57:46,  1.98it/s][A
epoch 1 iter 2360: train loss 3.06198. lr 5.922960e-04:  14%|█▍        | 2361/16329 [19:56<1:57:07,  1.99it/s][A
epoch 1 iter 2361: train loss 3.00102. lr 5.922895e-04:  14%|█▍        | 2361/16329 [19:56<1:57:07,  1.99it/s][A
epoch 1 iter 2361: train loss 3.00102. lr 5.922895e-04:  14%|█▍        | 2362/16329 [19:

epoch 1 iter 2393: train loss 2.93376. lr 5.920801e-04:  15%|█▍        | 2393/16329 [20:12<1:55:20,  2.01it/s][A
epoch 1 iter 2393: train loss 2.93376. lr 5.920801e-04:  15%|█▍        | 2394/16329 [20:12<1:55:07,  2.02it/s][A
epoch 1 iter 2394: train loss 3.04660. lr 5.920735e-04:  15%|█▍        | 2394/16329 [20:13<1:55:07,  2.02it/s][A
epoch 1 iter 2394: train loss 3.04660. lr 5.920735e-04:  15%|█▍        | 2395/16329 [20:13<1:55:08,  2.02it/s][A
epoch 1 iter 2395: train loss 3.03953. lr 5.920669e-04:  15%|█▍        | 2395/16329 [20:13<1:55:08,  2.02it/s][A
epoch 1 iter 2395: train loss 3.03953. lr 5.920669e-04:  15%|█▍        | 2396/16329 [20:13<1:54:55,  2.02it/s][A
epoch 1 iter 2396: train loss 3.07816. lr 5.920603e-04:  15%|█▍        | 2396/16329 [20:14<1:54:55,  2.02it/s][A
epoch 1 iter 2396: train loss 3.07816. lr 5.920603e-04:  15%|█▍        | 2397/16329 [20:14<1:54:25,  2.03it/s][A
epoch 1 iter 2397: train loss 3.05018. lr 5.920537e-04:  15%|█▍        | 2397/16329 [20:

epoch 1 iter 2428: train loss 3.06797. lr 5.918478e-04:  15%|█▍        | 2429/16329 [20:30<1:54:17,  2.03it/s][A
epoch 1 iter 2429: train loss 3.02429. lr 5.918411e-04:  15%|█▍        | 2429/16329 [20:30<1:54:17,  2.03it/s][A
epoch 1 iter 2429: train loss 3.02429. lr 5.918411e-04:  15%|█▍        | 2430/16329 [20:30<1:54:05,  2.03it/s][A
epoch 1 iter 2430: train loss 3.04833. lr 5.918345e-04:  15%|█▍        | 2430/16329 [20:31<1:54:05,  2.03it/s][A
epoch 1 iter 2430: train loss 3.04833. lr 5.918345e-04:  15%|█▍        | 2431/16329 [20:31<1:54:12,  2.03it/s][A
epoch 1 iter 2431: train loss 3.02203. lr 5.918278e-04:  15%|█▍        | 2431/16329 [20:31<1:54:12,  2.03it/s][A
epoch 1 iter 2431: train loss 3.02203. lr 5.918278e-04:  15%|█▍        | 2432/16329 [20:31<1:54:16,  2.03it/s][A
epoch 1 iter 2432: train loss 3.06889. lr 5.918211e-04:  15%|█▍        | 2432/16329 [20:32<1:54:16,  2.03it/s][A
epoch 1 iter 2432: train loss 3.06889. lr 5.918211e-04:  15%|█▍        | 2433/16329 [20:

epoch 1 iter 2464: train loss 2.97205. lr 5.916055e-04:  15%|█▌        | 2464/16329 [20:48<1:54:29,  2.02it/s][A
epoch 1 iter 2464: train loss 2.97205. lr 5.916055e-04:  15%|█▌        | 2465/16329 [20:48<1:54:23,  2.02it/s][A
epoch 1 iter 2465: train loss 3.00744. lr 5.915987e-04:  15%|█▌        | 2465/16329 [20:48<1:54:23,  2.02it/s][A
epoch 1 iter 2465: train loss 3.00744. lr 5.915987e-04:  15%|█▌        | 2466/16329 [20:48<1:54:20,  2.02it/s][A
epoch 1 iter 2466: train loss 2.97335. lr 5.915920e-04:  15%|█▌        | 2466/16329 [20:49<1:54:20,  2.02it/s][A
epoch 1 iter 2466: train loss 2.97335. lr 5.915920e-04:  15%|█▌        | 2467/16329 [20:49<1:53:48,  2.03it/s][A
epoch 1 iter 2467: train loss 2.98777. lr 5.915852e-04:  15%|█▌        | 2467/16329 [20:49<1:53:48,  2.03it/s][A
epoch 1 iter 2467: train loss 2.98777. lr 5.915852e-04:  15%|█▌        | 2468/16329 [20:49<1:54:10,  2.02it/s][A
epoch 1 iter 2468: train loss 3.01447. lr 5.915784e-04:  15%|█▌        | 2468/16329 [20:

epoch 1 iter 2499: train loss 2.97723. lr 5.913666e-04:  15%|█▌        | 2500/16329 [21:05<1:53:57,  2.02it/s][A
epoch 1 iter 2500: train loss 2.92384. lr 5.913597e-04:  15%|█▌        | 2500/16329 [21:06<1:53:57,  2.02it/s][A
epoch 1 iter 2500: train loss 2.92384. lr 5.913597e-04:  15%|█▌        | 2501/16329 [21:06<1:53:33,  2.03it/s][A
epoch 1 iter 2501: train loss 2.91686. lr 5.913528e-04:  15%|█▌        | 2501/16329 [21:06<1:53:33,  2.03it/s][A
epoch 1 iter 2501: train loss 2.91686. lr 5.913528e-04:  15%|█▌        | 2502/16329 [21:06<1:53:40,  2.03it/s][A
epoch 1 iter 2502: train loss 2.92749. lr 5.913460e-04:  15%|█▌        | 2502/16329 [21:07<1:53:40,  2.03it/s][A
epoch 1 iter 2502: train loss 2.92749. lr 5.913460e-04:  15%|█▌        | 2503/16329 [21:07<1:53:30,  2.03it/s][A
epoch 1 iter 2503: train loss 2.97610. lr 5.913391e-04:  15%|█▌        | 2503/16329 [21:07<1:53:30,  2.03it/s][A
epoch 1 iter 2503: train loss 2.97610. lr 5.913391e-04:  15%|█▌        | 2504/16329 [21:

epoch 1 iter 2535: train loss 2.92853. lr 5.911174e-04:  16%|█▌        | 2535/16329 [21:23<2:05:39,  1.83it/s][A
epoch 1 iter 2535: train loss 2.92853. lr 5.911174e-04:  16%|█▌        | 2536/16329 [21:23<2:02:16,  1.88it/s][A
epoch 1 iter 2536: train loss 2.92519. lr 5.911104e-04:  16%|█▌        | 2536/16329 [21:24<2:02:16,  1.88it/s][A
epoch 1 iter 2536: train loss 2.92519. lr 5.911104e-04:  16%|█▌        | 2537/16329 [21:24<1:59:36,  1.92it/s][A
epoch 1 iter 2537: train loss 2.93142. lr 5.911034e-04:  16%|█▌        | 2537/16329 [21:24<1:59:36,  1.92it/s][A
epoch 1 iter 2537: train loss 2.93142. lr 5.911034e-04:  16%|█▌        | 2538/16329 [21:24<1:57:53,  1.95it/s][A
epoch 1 iter 2538: train loss 2.88920. lr 5.910965e-04:  16%|█▌        | 2538/16329 [21:25<1:57:53,  1.95it/s][A
epoch 1 iter 2538: train loss 2.88920. lr 5.910965e-04:  16%|█▌        | 2539/16329 [21:25<1:56:42,  1.97it/s][A
epoch 1 iter 2539: train loss 3.00669. lr 5.910895e-04:  16%|█▌        | 2539/16329 [21:

epoch 1 iter 2570: train loss 3.01409. lr 5.908718e-04:  16%|█▌        | 2571/16329 [21:41<1:53:09,  2.03it/s][A
epoch 1 iter 2571: train loss 2.91295. lr 5.908647e-04:  16%|█▌        | 2571/16329 [21:41<1:53:09,  2.03it/s][A
epoch 1 iter 2571: train loss 2.91295. lr 5.908647e-04:  16%|█▌        | 2572/16329 [21:41<1:53:08,  2.03it/s][A
epoch 1 iter 2572: train loss 2.99866. lr 5.908576e-04:  16%|█▌        | 2572/16329 [21:42<1:53:08,  2.03it/s][A
epoch 1 iter 2572: train loss 2.99866. lr 5.908576e-04:  16%|█▌        | 2573/16329 [21:42<1:53:26,  2.02it/s][A
epoch 1 iter 2573: train loss 2.84753. lr 5.908505e-04:  16%|█▌        | 2573/16329 [21:42<1:53:26,  2.02it/s][A
epoch 1 iter 2573: train loss 2.84753. lr 5.908505e-04:  16%|█▌        | 2574/16329 [21:42<1:53:22,  2.02it/s][A
epoch 1 iter 2574: train loss 2.95952. lr 5.908435e-04:  16%|█▌        | 2574/16329 [21:43<1:53:22,  2.02it/s][A
epoch 1 iter 2574: train loss 2.95952. lr 5.908435e-04:  16%|█▌        | 2575/16329 [21:

epoch 1 iter 2606: train loss 2.84416. lr 5.906157e-04:  16%|█▌        | 2606/16329 [21:59<2:03:38,  1.85it/s][A
epoch 1 iter 2606: train loss 2.84416. lr 5.906157e-04:  16%|█▌        | 2607/16329 [21:59<2:05:21,  1.82it/s][A
epoch 1 iter 2607: train loss 2.94925. lr 5.906085e-04:  16%|█▌        | 2607/16329 [22:00<2:05:21,  1.82it/s][A
epoch 1 iter 2607: train loss 2.94925. lr 5.906085e-04:  16%|█▌        | 2608/16329 [22:00<2:05:26,  1.82it/s][A
epoch 1 iter 2608: train loss 2.96360. lr 5.906013e-04:  16%|█▌        | 2608/16329 [22:00<2:05:26,  1.82it/s][A
epoch 1 iter 2608: train loss 2.96360. lr 5.906013e-04:  16%|█▌        | 2609/16329 [22:00<2:04:17,  1.84it/s][A
epoch 1 iter 2609: train loss 2.93835. lr 5.905942e-04:  16%|█▌        | 2609/16329 [22:01<2:04:17,  1.84it/s][A
epoch 1 iter 2609: train loss 2.93835. lr 5.905942e-04:  16%|█▌        | 2610/16329 [22:01<2:14:06,  1.70it/s][A
epoch 1 iter 2610: train loss 2.90423. lr 5.905870e-04:  16%|█▌        | 2610/16329 [22:

epoch 1 iter 2641: train loss 2.88536. lr 5.903634e-04:  16%|█▌        | 2642/16329 [22:17<1:53:20,  2.01it/s][A
epoch 1 iter 2642: train loss 2.81557. lr 5.903561e-04:  16%|█▌        | 2642/16329 [22:17<1:53:20,  2.01it/s][A
epoch 1 iter 2642: train loss 2.81557. lr 5.903561e-04:  16%|█▌        | 2643/16329 [22:17<1:53:53,  2.00it/s][A
epoch 1 iter 2643: train loss 2.91664. lr 5.903488e-04:  16%|█▌        | 2643/16329 [22:18<1:53:53,  2.00it/s][A
epoch 1 iter 2643: train loss 2.91664. lr 5.903488e-04:  16%|█▌        | 2644/16329 [22:18<1:54:23,  1.99it/s][A
epoch 1 iter 2644: train loss 2.81046. lr 5.903416e-04:  16%|█▌        | 2644/16329 [22:18<1:54:23,  1.99it/s][A
epoch 1 iter 2644: train loss 2.81046. lr 5.903416e-04:  16%|█▌        | 2645/16329 [22:18<1:53:36,  2.01it/s][A
epoch 1 iter 2645: train loss 2.91495. lr 5.903343e-04:  16%|█▌        | 2645/16329 [22:19<1:53:36,  2.01it/s][A
epoch 1 iter 2645: train loss 2.91495. lr 5.903343e-04:  16%|█▌        | 2646/16329 [22:

epoch 1 iter 2677: train loss 2.82446. lr 5.901004e-04:  16%|█▋        | 2677/16329 [22:35<1:52:20,  2.03it/s][A
epoch 1 iter 2677: train loss 2.82446. lr 5.901004e-04:  16%|█▋        | 2678/16329 [22:35<1:51:54,  2.03it/s][A
epoch 1 iter 2678: train loss 2.90445. lr 5.900930e-04:  16%|█▋        | 2678/16329 [22:35<1:51:54,  2.03it/s][A
epoch 1 iter 2678: train loss 2.90445. lr 5.900930e-04:  16%|█▋        | 2679/16329 [22:35<1:52:19,  2.03it/s][A
epoch 1 iter 2679: train loss 2.87353. lr 5.900857e-04:  16%|█▋        | 2679/16329 [22:36<1:52:19,  2.03it/s][A
epoch 1 iter 2679: train loss 2.87353. lr 5.900857e-04:  16%|█▋        | 2680/16329 [22:36<1:52:20,  2.02it/s][A
epoch 1 iter 2680: train loss 2.82705. lr 5.900783e-04:  16%|█▋        | 2680/16329 [22:36<1:52:20,  2.02it/s][A
epoch 1 iter 2680: train loss 2.82705. lr 5.900783e-04:  16%|█▋        | 2681/16329 [22:36<1:52:24,  2.02it/s][A
epoch 1 iter 2681: train loss 2.87109. lr 5.900710e-04:  16%|█▋        | 2681/16329 [22:

epoch 1 iter 2712: train loss 2.82426. lr 5.898414e-04:  17%|█▋        | 2713/16329 [22:52<1:52:11,  2.02it/s][A
epoch 1 iter 2713: train loss 2.87457. lr 5.898340e-04:  17%|█▋        | 2713/16329 [22:53<1:52:11,  2.02it/s][A
epoch 1 iter 2713: train loss 2.87457. lr 5.898340e-04:  17%|█▋        | 2714/16329 [22:53<1:52:33,  2.02it/s][A
epoch 1 iter 2714: train loss 2.82076. lr 5.898265e-04:  17%|█▋        | 2714/16329 [22:53<1:52:33,  2.02it/s][A
epoch 1 iter 2714: train loss 2.82076. lr 5.898265e-04:  17%|█▋        | 2715/16329 [22:53<1:52:15,  2.02it/s][A
epoch 1 iter 2715: train loss 2.89609. lr 5.898190e-04:  17%|█▋        | 2715/16329 [22:54<1:52:15,  2.02it/s][A
epoch 1 iter 2715: train loss 2.89609. lr 5.898190e-04:  17%|█▋        | 2716/16329 [22:54<1:51:55,  2.03it/s][A
epoch 1 iter 2716: train loss 2.80998. lr 5.898116e-04:  17%|█▋        | 2716/16329 [22:54<1:51:55,  2.03it/s][A
epoch 1 iter 2716: train loss 2.80998. lr 5.898116e-04:  17%|█▋        | 2717/16329 [22:

epoch 1 iter 2748: train loss 2.82022. lr 5.895716e-04:  17%|█▋        | 2748/16329 [23:10<1:51:51,  2.02it/s][A
epoch 1 iter 2748: train loss 2.82022. lr 5.895716e-04:  17%|█▋        | 2749/16329 [23:10<1:51:52,  2.02it/s][A
epoch 1 iter 2749: train loss 2.82137. lr 5.895640e-04:  17%|█▋        | 2749/16329 [23:11<1:51:52,  2.02it/s][A
epoch 1 iter 2749: train loss 2.82137. lr 5.895640e-04:  17%|█▋        | 2750/16329 [23:11<1:51:44,  2.03it/s][A
epoch 1 iter 2750: train loss 2.76363. lr 5.895565e-04:  17%|█▋        | 2750/16329 [23:11<1:51:44,  2.03it/s][A
epoch 1 iter 2750: train loss 2.76363. lr 5.895565e-04:  17%|█▋        | 2751/16329 [23:11<1:51:51,  2.02it/s][A
epoch 1 iter 2751: train loss 2.84085. lr 5.895489e-04:  17%|█▋        | 2751/16329 [23:12<1:51:51,  2.02it/s][A
epoch 1 iter 2751: train loss 2.84085. lr 5.895489e-04:  17%|█▋        | 2752/16329 [23:12<1:51:37,  2.03it/s][A
epoch 1 iter 2752: train loss 2.79910. lr 5.895414e-04:  17%|█▋        | 2752/16329 [23:

epoch 1 iter 2783: train loss 2.86493. lr 5.893059e-04:  17%|█▋        | 2784/16329 [23:28<1:51:14,  2.03it/s][A
epoch 1 iter 2784: train loss 2.81559. lr 5.892983e-04:  17%|█▋        | 2784/16329 [23:28<1:51:14,  2.03it/s][A
epoch 1 iter 2784: train loss 2.81559. lr 5.892983e-04:  17%|█▋        | 2785/16329 [23:28<1:51:33,  2.02it/s][A
epoch 1 iter 2785: train loss 2.86033. lr 5.892906e-04:  17%|█▋        | 2785/16329 [23:29<1:51:33,  2.02it/s][A
epoch 1 iter 2785: train loss 2.86033. lr 5.892906e-04:  17%|█▋        | 2786/16329 [23:29<1:51:36,  2.02it/s][A
epoch 1 iter 2786: train loss 2.76046. lr 5.892830e-04:  17%|█▋        | 2786/16329 [23:29<1:51:36,  2.02it/s][A
epoch 1 iter 2786: train loss 2.76046. lr 5.892830e-04:  17%|█▋        | 2787/16329 [23:29<1:51:20,  2.03it/s][A
epoch 1 iter 2787: train loss 2.82461. lr 5.892754e-04:  17%|█▋        | 2787/16329 [23:30<1:51:20,  2.03it/s][A
epoch 1 iter 2787: train loss 2.82461. lr 5.892754e-04:  17%|█▋        | 2788/16329 [23:

epoch 1 iter 2819: train loss 2.77072. lr 5.890293e-04:  17%|█▋        | 2819/16329 [23:46<1:50:51,  2.03it/s][A
epoch 1 iter 2819: train loss 2.77072. lr 5.890293e-04:  17%|█▋        | 2820/16329 [23:46<1:51:00,  2.03it/s][A
epoch 1 iter 2820: train loss 2.83075. lr 5.890215e-04:  17%|█▋        | 2820/16329 [23:46<1:51:00,  2.03it/s][A
epoch 1 iter 2820: train loss 2.83075. lr 5.890215e-04:  17%|█▋        | 2821/16329 [23:46<1:50:59,  2.03it/s][A
epoch 1 iter 2821: train loss 2.81159. lr 5.890138e-04:  17%|█▋        | 2821/16329 [23:47<1:50:59,  2.03it/s][A
epoch 1 iter 2821: train loss 2.81159. lr 5.890138e-04:  17%|█▋        | 2822/16329 [23:47<1:51:13,  2.02it/s][A
epoch 1 iter 2822: train loss 2.76696. lr 5.890060e-04:  17%|█▋        | 2822/16329 [23:47<1:51:13,  2.02it/s][A
epoch 1 iter 2822: train loss 2.76696. lr 5.890060e-04:  17%|█▋        | 2823/16329 [23:47<1:51:15,  2.02it/s][A
epoch 1 iter 2823: train loss 2.81838. lr 5.889983e-04:  17%|█▋        | 2823/16329 [23:

epoch 1 iter 2854: train loss 2.87015. lr 5.887570e-04:  17%|█▋        | 2855/16329 [24:03<1:50:55,  2.02it/s][A
epoch 1 iter 2855: train loss 2.71571. lr 5.887491e-04:  17%|█▋        | 2855/16329 [24:04<1:50:55,  2.02it/s][A
epoch 1 iter 2855: train loss 2.71571. lr 5.887491e-04:  17%|█▋        | 2856/16329 [24:04<1:50:49,  2.03it/s][A
epoch 1 iter 2856: train loss 2.73885. lr 5.887413e-04:  17%|█▋        | 2856/16329 [24:04<1:50:49,  2.03it/s][A
epoch 1 iter 2856: train loss 2.73885. lr 5.887413e-04:  17%|█▋        | 2857/16329 [24:04<1:50:52,  2.03it/s][A
epoch 1 iter 2857: train loss 2.78357. lr 5.887335e-04:  17%|█▋        | 2857/16329 [24:05<1:50:52,  2.03it/s][A
epoch 1 iter 2857: train loss 2.78357. lr 5.887335e-04:  18%|█▊        | 2858/16329 [24:05<1:50:45,  2.03it/s][A
epoch 1 iter 2858: train loss 2.75669. lr 5.887256e-04:  18%|█▊        | 2858/16329 [24:05<1:50:45,  2.03it/s][A
epoch 1 iter 2858: train loss 2.75669. lr 5.887256e-04:  18%|█▊        | 2859/16329 [24:

epoch 1 iter 2890: train loss 2.70753. lr 5.884735e-04:  18%|█▊        | 2890/16329 [24:22<1:59:08,  1.88it/s][A
epoch 1 iter 2890: train loss 2.70753. lr 5.884735e-04:  18%|█▊        | 2891/16329 [24:22<1:56:00,  1.93it/s][A
epoch 1 iter 2891: train loss 2.75425. lr 5.884655e-04:  18%|█▊        | 2891/16329 [24:22<1:56:00,  1.93it/s][A
epoch 1 iter 2891: train loss 2.75425. lr 5.884655e-04:  18%|█▊        | 2892/16329 [24:22<1:54:27,  1.96it/s][A
epoch 1 iter 2892: train loss 2.71724. lr 5.884576e-04:  18%|█▊        | 2892/16329 [24:23<1:54:27,  1.96it/s][A
epoch 1 iter 2892: train loss 2.71724. lr 5.884576e-04:  18%|█▊        | 2893/16329 [24:23<1:53:19,  1.98it/s][A
epoch 1 iter 2893: train loss 2.78177. lr 5.884497e-04:  18%|█▊        | 2893/16329 [24:23<1:53:19,  1.98it/s][A
epoch 1 iter 2893: train loss 2.78177. lr 5.884497e-04:  18%|█▊        | 2894/16329 [24:23<1:52:25,  1.99it/s][A
epoch 1 iter 2894: train loss 2.70497. lr 5.884417e-04:  18%|█▊        | 2894/16329 [24:

epoch 1 iter 2925: train loss 2.69546. lr 5.881945e-04:  18%|█▊        | 2926/16329 [24:39<1:51:04,  2.01it/s][A
epoch 1 iter 2926: train loss 2.79772. lr 5.881865e-04:  18%|█▊        | 2926/16329 [24:39<1:51:04,  2.01it/s][A
epoch 1 iter 2926: train loss 2.79772. lr 5.881865e-04:  18%|█▊        | 2927/16329 [24:39<1:50:53,  2.01it/s][A
epoch 1 iter 2927: train loss 2.70951. lr 5.881785e-04:  18%|█▊        | 2927/16329 [24:40<1:50:53,  2.01it/s][A
epoch 1 iter 2927: train loss 2.70951. lr 5.881785e-04:  18%|█▊        | 2928/16329 [24:40<1:50:18,  2.02it/s][A
epoch 1 iter 2928: train loss 2.75316. lr 5.881705e-04:  18%|█▊        | 2928/16329 [24:40<1:50:18,  2.02it/s][A
epoch 1 iter 2928: train loss 2.75316. lr 5.881705e-04:  18%|█▊        | 2929/16329 [24:40<1:50:17,  2.02it/s][A
epoch 1 iter 2929: train loss 2.68730. lr 5.881624e-04:  18%|█▊        | 2929/16329 [24:41<1:50:17,  2.02it/s][A
epoch 1 iter 2929: train loss 2.68730. lr 5.881624e-04:  18%|█▊        | 2930/16329 [24:

epoch 1 iter 2961: train loss 2.75024. lr 5.879042e-04:  18%|█▊        | 2961/16329 [24:57<1:51:41,  1.99it/s][A
epoch 1 iter 2961: train loss 2.75024. lr 5.879042e-04:  18%|█▊        | 2962/16329 [24:57<1:51:25,  2.00it/s][A
epoch 1 iter 2962: train loss 2.66837. lr 5.878961e-04:  18%|█▊        | 2962/16329 [24:58<1:51:25,  2.00it/s][A
epoch 1 iter 2962: train loss 2.66837. lr 5.878961e-04:  18%|█▊        | 2963/16329 [24:58<1:50:52,  2.01it/s][A
epoch 1 iter 2963: train loss 2.78210. lr 5.878880e-04:  18%|█▊        | 2963/16329 [24:58<1:50:52,  2.01it/s][A
epoch 1 iter 2963: train loss 2.78210. lr 5.878880e-04:  18%|█▊        | 2964/16329 [24:58<1:50:44,  2.01it/s][A
epoch 1 iter 2964: train loss 2.75081. lr 5.878798e-04:  18%|█▊        | 2964/16329 [24:59<1:50:44,  2.01it/s][A
epoch 1 iter 2964: train loss 2.75081. lr 5.878798e-04:  18%|█▊        | 2965/16329 [24:59<1:50:32,  2.01it/s][A
epoch 1 iter 2965: train loss 2.75547. lr 5.878717e-04:  18%|█▊        | 2965/16329 [24:

epoch 1 iter 2996: train loss 2.72755. lr 5.876186e-04:  18%|█▊        | 2997/16329 [25:15<1:50:43,  2.01it/s][A
epoch 1 iter 2997: train loss 2.65435. lr 5.876104e-04:  18%|█▊        | 2997/16329 [25:15<1:50:43,  2.01it/s][A
epoch 1 iter 2997: train loss 2.65435. lr 5.876104e-04:  18%|█▊        | 2998/16329 [25:15<1:50:32,  2.01it/s][A
epoch 1 iter 2998: train loss 2.71926. lr 5.876022e-04:  18%|█▊        | 2998/16329 [25:16<1:50:32,  2.01it/s][A
epoch 1 iter 2998: train loss 2.71926. lr 5.876022e-04:  18%|█▊        | 2999/16329 [25:16<1:49:56,  2.02it/s][A
epoch 1 iter 2999: train loss 2.73631. lr 5.875940e-04:  18%|█▊        | 2999/16329 [25:16<1:49:56,  2.02it/s][A
epoch 1 iter 2999: train loss 2.73631. lr 5.875940e-04:  18%|█▊        | 3000/16329 [25:16<1:50:02,  2.02it/s][A
epoch 1 iter 3000: train loss 2.69687. lr 5.875858e-04:  18%|█▊        | 3000/16329 [25:17<1:50:02,  2.02it/s][A
epoch 1 iter 3000: train loss 2.69687. lr 5.875858e-04:  18%|█▊        | 3001/16329 [25:

epoch 1 iter 3032: train loss 2.66362. lr 5.873215e-04:  19%|█▊        | 3032/16329 [25:33<1:49:23,  2.03it/s][A
epoch 1 iter 3032: train loss 2.66362. lr 5.873215e-04:  19%|█▊        | 3033/16329 [25:33<1:49:20,  2.03it/s][A
epoch 1 iter 3033: train loss 2.77328. lr 5.873132e-04:  19%|█▊        | 3033/16329 [25:33<1:49:20,  2.03it/s][A
epoch 1 iter 3033: train loss 2.77328. lr 5.873132e-04:  19%|█▊        | 3034/16329 [25:33<1:49:23,  2.03it/s][A
epoch 1 iter 3034: train loss 2.67832. lr 5.873049e-04:  19%|█▊        | 3034/16329 [25:34<1:49:23,  2.03it/s][A
epoch 1 iter 3034: train loss 2.67832. lr 5.873049e-04:  19%|█▊        | 3035/16329 [25:34<1:49:24,  2.03it/s][A
epoch 1 iter 3035: train loss 2.71955. lr 5.872966e-04:  19%|█▊        | 3035/16329 [25:34<1:49:24,  2.03it/s][A
epoch 1 iter 3035: train loss 2.71955. lr 5.872966e-04:  19%|█▊        | 3036/16329 [25:34<1:49:09,  2.03it/s][A
epoch 1 iter 3036: train loss 2.67930. lr 5.872883e-04:  19%|█▊        | 3036/16329 [25:

epoch 1 iter 3067: train loss 2.64349. lr 5.870293e-04:  19%|█▉        | 3068/16329 [25:50<1:49:26,  2.02it/s][A
epoch 1 iter 3068: train loss 2.63712. lr 5.870209e-04:  19%|█▉        | 3068/16329 [25:51<1:49:26,  2.02it/s][A
epoch 1 iter 3068: train loss 2.63712. lr 5.870209e-04:  19%|█▉        | 3069/16329 [25:51<1:49:27,  2.02it/s][A
epoch 1 iter 3069: train loss 2.72014. lr 5.870125e-04:  19%|█▉        | 3069/16329 [25:51<1:49:27,  2.02it/s][A
epoch 1 iter 3069: train loss 2.72014. lr 5.870125e-04:  19%|█▉        | 3070/16329 [25:51<1:49:27,  2.02it/s][A
epoch 1 iter 3070: train loss 2.64746. lr 5.870041e-04:  19%|█▉        | 3070/16329 [25:52<1:49:27,  2.02it/s][A
epoch 1 iter 3070: train loss 2.64746. lr 5.870041e-04:  19%|█▉        | 3071/16329 [25:52<1:49:18,  2.02it/s][A
epoch 1 iter 3071: train loss 2.57193. lr 5.869957e-04:  19%|█▉        | 3071/16329 [25:52<1:49:18,  2.02it/s][A
epoch 1 iter 3071: train loss 2.57193. lr 5.869957e-04:  19%|█▉        | 3072/16329 [25:

epoch 1 iter 3103: train loss 2.68138. lr 5.867254e-04:  19%|█▉        | 3103/16329 [26:08<1:49:02,  2.02it/s][A
epoch 1 iter 3103: train loss 2.68138. lr 5.867254e-04:  19%|█▉        | 3104/16329 [26:08<1:48:50,  2.03it/s][A
epoch 1 iter 3104: train loss 2.58801. lr 5.867169e-04:  19%|█▉        | 3104/16329 [26:09<1:48:50,  2.03it/s][A
epoch 1 iter 3104: train loss 2.58801. lr 5.867169e-04:  19%|█▉        | 3105/16329 [26:09<1:48:34,  2.03it/s][A
epoch 1 iter 3105: train loss 2.65803. lr 5.867084e-04:  19%|█▉        | 3105/16329 [26:09<1:48:34,  2.03it/s][A
epoch 1 iter 3105: train loss 2.65803. lr 5.867084e-04:  19%|█▉        | 3106/16329 [26:09<1:50:50,  1.99it/s][A
epoch 1 iter 3106: train loss 2.67784. lr 5.866999e-04:  19%|█▉        | 3106/16329 [26:10<1:50:50,  1.99it/s][A
epoch 1 iter 3106: train loss 2.67784. lr 5.866999e-04:  19%|█▉        | 3107/16329 [26:10<1:52:54,  1.95it/s][A
epoch 1 iter 3107: train loss 2.64878. lr 5.866914e-04:  19%|█▉        | 3107/16329 [26:

epoch 1 iter 3138: train loss 2.58400. lr 5.864266e-04:  19%|█▉        | 3139/16329 [26:26<1:48:25,  2.03it/s][A
epoch 1 iter 3139: train loss 2.70031. lr 5.864180e-04:  19%|█▉        | 3139/16329 [26:26<1:48:25,  2.03it/s][A
epoch 1 iter 3139: train loss 2.70031. lr 5.864180e-04:  19%|█▉        | 3140/16329 [26:26<1:48:22,  2.03it/s][A
epoch 1 iter 3140: train loss 2.64992. lr 5.864095e-04:  19%|█▉        | 3140/16329 [26:27<1:48:22,  2.03it/s][A
epoch 1 iter 3140: train loss 2.64992. lr 5.864095e-04:  19%|█▉        | 3141/16329 [26:27<1:48:37,  2.02it/s][A
epoch 1 iter 3141: train loss 2.62725. lr 5.864009e-04:  19%|█▉        | 3141/16329 [26:27<1:48:37,  2.02it/s][A
epoch 1 iter 3141: train loss 2.62725. lr 5.864009e-04:  19%|█▉        | 3142/16329 [26:27<1:48:42,  2.02it/s][A
epoch 1 iter 3142: train loss 2.65517. lr 5.863923e-04:  19%|█▉        | 3142/16329 [26:28<1:48:42,  2.02it/s][A
epoch 1 iter 3142: train loss 2.65517. lr 5.863923e-04:  19%|█▉        | 3143/16329 [26:

epoch 1 iter 3174: train loss 2.60895. lr 5.861159e-04:  19%|█▉        | 3174/16329 [26:44<1:53:57,  1.92it/s][A
epoch 1 iter 3174: train loss 2.60895. lr 5.861159e-04:  19%|█▉        | 3175/16329 [26:44<1:52:12,  1.95it/s][A
epoch 1 iter 3175: train loss 2.60728. lr 5.861073e-04:  19%|█▉        | 3175/16329 [26:45<1:52:12,  1.95it/s][A
epoch 1 iter 3175: train loss 2.60728. lr 5.861073e-04:  19%|█▉        | 3176/16329 [26:45<1:51:04,  1.97it/s][A
epoch 1 iter 3176: train loss 2.63620. lr 5.860986e-04:  19%|█▉        | 3176/16329 [26:45<1:51:04,  1.97it/s][A
epoch 1 iter 3176: train loss 2.63620. lr 5.860986e-04:  19%|█▉        | 3177/16329 [26:45<1:50:02,  1.99it/s][A
epoch 1 iter 3177: train loss 2.61643. lr 5.860899e-04:  19%|█▉        | 3177/16329 [26:46<1:50:02,  1.99it/s][A
epoch 1 iter 3177: train loss 2.61643. lr 5.860899e-04:  19%|█▉        | 3178/16329 [26:46<1:49:52,  1.99it/s][A
epoch 1 iter 3178: train loss 2.60768. lr 5.860812e-04:  19%|█▉        | 3178/16329 [26:

epoch 1 iter 3209: train loss 2.61654. lr 5.858106e-04:  20%|█▉        | 3210/16329 [27:02<1:58:39,  1.84it/s][A
epoch 1 iter 3210: train loss 2.56759. lr 5.858018e-04:  20%|█▉        | 3210/16329 [27:02<1:58:39,  1.84it/s][A
epoch 1 iter 3210: train loss 2.56759. lr 5.858018e-04:  20%|█▉        | 3211/16329 [27:02<1:55:23,  1.89it/s][A
epoch 1 iter 3211: train loss 2.58767. lr 5.857930e-04:  20%|█▉        | 3211/16329 [27:03<1:55:23,  1.89it/s][A
epoch 1 iter 3211: train loss 2.58767. lr 5.857930e-04:  20%|█▉        | 3212/16329 [27:03<1:53:08,  1.93it/s][A
epoch 1 iter 3212: train loss 2.72680. lr 5.857842e-04:  20%|█▉        | 3212/16329 [27:03<1:53:08,  1.93it/s][A
epoch 1 iter 3212: train loss 2.72680. lr 5.857842e-04:  20%|█▉        | 3213/16329 [27:03<1:51:45,  1.96it/s][A
epoch 1 iter 3213: train loss 2.61692. lr 5.857755e-04:  20%|█▉        | 3213/16329 [27:04<1:51:45,  1.96it/s][A
epoch 1 iter 3213: train loss 2.61692. lr 5.857755e-04:  20%|█▉        | 3214/16329 [27:

epoch 1 iter 3245: train loss 2.61275. lr 5.854931e-04:  20%|█▉        | 3245/16329 [27:20<1:59:04,  1.83it/s][A
epoch 1 iter 3245: train loss 2.61275. lr 5.854931e-04:  20%|█▉        | 3246/16329 [27:20<1:55:30,  1.89it/s][A
epoch 1 iter 3246: train loss 2.59102. lr 5.854842e-04:  20%|█▉        | 3246/16329 [27:21<1:55:30,  1.89it/s][A
epoch 1 iter 3246: train loss 2.59102. lr 5.854842e-04:  20%|█▉        | 3247/16329 [27:21<1:53:34,  1.92it/s][A
epoch 1 iter 3247: train loss 2.63575. lr 5.854754e-04:  20%|█▉        | 3247/16329 [27:21<1:53:34,  1.92it/s][A
epoch 1 iter 3247: train loss 2.63575. lr 5.854754e-04:  20%|█▉        | 3248/16329 [27:21<1:51:42,  1.95it/s][A
epoch 1 iter 3248: train loss 2.59334. lr 5.854665e-04:  20%|█▉        | 3248/16329 [27:22<1:51:42,  1.95it/s][A
epoch 1 iter 3248: train loss 2.59334. lr 5.854665e-04:  20%|█▉        | 3249/16329 [27:22<1:49:57,  1.98it/s][A
epoch 1 iter 3249: train loss 2.57380. lr 5.854576e-04:  20%|█▉        | 3249/16329 [27:

epoch 1 iter 3280: train loss 2.58527. lr 5.851812e-04:  20%|██        | 3281/16329 [27:38<1:47:18,  2.03it/s][A
epoch 1 iter 3281: train loss 2.59662. lr 5.851722e-04:  20%|██        | 3281/16329 [27:38<1:47:18,  2.03it/s][A
epoch 1 iter 3281: train loss 2.59662. lr 5.851722e-04:  20%|██        | 3282/16329 [27:38<1:46:45,  2.04it/s][A
epoch 1 iter 3282: train loss 2.61229. lr 5.851633e-04:  20%|██        | 3282/16329 [27:39<1:46:45,  2.04it/s][A
epoch 1 iter 3282: train loss 2.61229. lr 5.851633e-04:  20%|██        | 3283/16329 [27:39<1:47:00,  2.03it/s][A
epoch 1 iter 3283: train loss 2.57126. lr 5.851543e-04:  20%|██        | 3283/16329 [27:39<1:47:00,  2.03it/s][A
epoch 1 iter 3283: train loss 2.57126. lr 5.851543e-04:  20%|██        | 3284/16329 [27:39<1:47:06,  2.03it/s][A
epoch 1 iter 3284: train loss 2.53664. lr 5.851453e-04:  20%|██        | 3284/16329 [27:40<1:47:06,  2.03it/s][A
epoch 1 iter 3284: train loss 2.53664. lr 5.851453e-04:  20%|██        | 3285/16329 [27:

epoch 1 iter 3316: train loss 2.58033. lr 5.848570e-04:  20%|██        | 3316/16329 [27:56<1:53:05,  1.92it/s][A
epoch 1 iter 3316: train loss 2.58033. lr 5.848570e-04:  20%|██        | 3317/16329 [27:56<1:52:27,  1.93it/s][A
epoch 1 iter 3317: train loss 2.57634. lr 5.848479e-04:  20%|██        | 3317/16329 [27:56<1:52:27,  1.93it/s][A
epoch 1 iter 3317: train loss 2.57634. lr 5.848479e-04:  20%|██        | 3318/16329 [27:56<1:51:23,  1.95it/s][A
epoch 1 iter 3318: train loss 2.57515. lr 5.848389e-04:  20%|██        | 3318/16329 [27:57<1:51:23,  1.95it/s][A
epoch 1 iter 3318: train loss 2.57515. lr 5.848389e-04:  20%|██        | 3319/16329 [27:57<1:50:30,  1.96it/s][A
epoch 1 iter 3319: train loss 2.56075. lr 5.848298e-04:  20%|██        | 3319/16329 [27:57<1:50:30,  1.96it/s][A
epoch 1 iter 3319: train loss 2.56075. lr 5.848298e-04:  20%|██        | 3320/16329 [27:57<1:49:37,  1.98it/s][A
epoch 1 iter 3320: train loss 2.64768. lr 5.848207e-04:  20%|██        | 3320/16329 [27:

epoch 1 iter 3351: train loss 2.55454. lr 5.845385e-04:  21%|██        | 3352/16329 [28:13<1:47:32,  2.01it/s][A
epoch 1 iter 3352: train loss 2.60309. lr 5.845293e-04:  21%|██        | 3352/16329 [28:14<1:47:32,  2.01it/s][A
epoch 1 iter 3352: train loss 2.60309. lr 5.845293e-04:  21%|██        | 3353/16329 [28:14<1:47:21,  2.01it/s][A
epoch 1 iter 3353: train loss 2.51076. lr 5.845202e-04:  21%|██        | 3353/16329 [28:14<1:47:21,  2.01it/s][A
epoch 1 iter 3353: train loss 2.51076. lr 5.845202e-04:  21%|██        | 3354/16329 [28:14<1:47:23,  2.01it/s][A
epoch 1 iter 3354: train loss 2.48850. lr 5.845110e-04:  21%|██        | 3354/16329 [28:15<1:47:23,  2.01it/s][A
epoch 1 iter 3354: train loss 2.48850. lr 5.845110e-04:  21%|██        | 3355/16329 [28:15<1:46:54,  2.02it/s][A
epoch 1 iter 3355: train loss 2.59908. lr 5.845019e-04:  21%|██        | 3355/16329 [28:15<1:46:54,  2.02it/s][A
epoch 1 iter 3355: train loss 2.59908. lr 5.845019e-04:  21%|██        | 3356/16329 [28:

epoch 1 iter 3387: train loss 2.58108. lr 5.842075e-04:  21%|██        | 3387/16329 [28:32<1:47:46,  2.00it/s][A
epoch 1 iter 3387: train loss 2.58108. lr 5.842075e-04:  21%|██        | 3388/16329 [28:32<1:47:20,  2.01it/s][A
epoch 1 iter 3388: train loss 2.51330. lr 5.841983e-04:  21%|██        | 3388/16329 [28:32<1:47:20,  2.01it/s][A
epoch 1 iter 3388: train loss 2.51330. lr 5.841983e-04:  21%|██        | 3389/16329 [28:32<1:46:52,  2.02it/s][A
epoch 1 iter 3389: train loss 2.55038. lr 5.841891e-04:  21%|██        | 3389/16329 [28:33<1:46:52,  2.02it/s][A
epoch 1 iter 3389: train loss 2.55038. lr 5.841891e-04:  21%|██        | 3390/16329 [28:33<1:46:55,  2.02it/s][A
epoch 1 iter 3390: train loss 2.44250. lr 5.841798e-04:  21%|██        | 3390/16329 [28:33<1:46:55,  2.02it/s][A
epoch 1 iter 3390: train loss 2.44250. lr 5.841798e-04:  21%|██        | 3391/16329 [28:33<1:46:32,  2.02it/s][A
epoch 1 iter 3391: train loss 2.58146. lr 5.841706e-04:  21%|██        | 3391/16329 [28:

epoch 1 iter 3422: train loss 2.54303. lr 5.838825e-04:  21%|██        | 3423/16329 [28:49<1:45:44,  2.03it/s][A
epoch 1 iter 3423: train loss 2.54310. lr 5.838732e-04:  21%|██        | 3423/16329 [28:50<1:45:44,  2.03it/s][A
epoch 1 iter 3423: train loss 2.54310. lr 5.838732e-04:  21%|██        | 3424/16329 [28:50<1:57:11,  1.84it/s][A
epoch 1 iter 3424: train loss 2.49475. lr 5.838638e-04:  21%|██        | 3424/16329 [28:51<1:57:11,  1.84it/s][A
epoch 1 iter 3424: train loss 2.49475. lr 5.838638e-04:  21%|██        | 3425/16329 [28:51<1:54:02,  1.89it/s][A
epoch 1 iter 3425: train loss 2.50985. lr 5.838545e-04:  21%|██        | 3425/16329 [28:51<1:54:02,  1.89it/s][A
epoch 1 iter 3425: train loss 2.50985. lr 5.838545e-04:  21%|██        | 3426/16329 [28:51<1:51:49,  1.92it/s][A
epoch 1 iter 3426: train loss 2.49310. lr 5.838452e-04:  21%|██        | 3426/16329 [28:52<1:51:49,  1.92it/s][A
epoch 1 iter 3426: train loss 2.49310. lr 5.838452e-04:  21%|██        | 3427/16329 [28:

epoch 1 iter 3458: train loss 2.46755. lr 5.835449e-04:  21%|██        | 3458/16329 [29:08<1:45:47,  2.03it/s][A
epoch 1 iter 3458: train loss 2.46755. lr 5.835449e-04:  21%|██        | 3459/16329 [29:08<1:46:03,  2.02it/s][A
epoch 1 iter 3459: train loss 2.51804. lr 5.835354e-04:  21%|██        | 3459/16329 [29:08<1:46:03,  2.02it/s][A
epoch 1 iter 3459: train loss 2.51804. lr 5.835354e-04:  21%|██        | 3460/16329 [29:08<1:45:57,  2.02it/s][A
epoch 1 iter 3460: train loss 2.51114. lr 5.835260e-04:  21%|██        | 3460/16329 [29:09<1:45:57,  2.02it/s][A
epoch 1 iter 3460: train loss 2.51114. lr 5.835260e-04:  21%|██        | 3461/16329 [29:09<1:46:04,  2.02it/s][A
epoch 1 iter 3461: train loss 2.50319. lr 5.835166e-04:  21%|██        | 3461/16329 [29:09<1:46:04,  2.02it/s][A
epoch 1 iter 3461: train loss 2.50319. lr 5.835166e-04:  21%|██        | 3462/16329 [29:09<1:45:59,  2.02it/s][A
epoch 1 iter 3462: train loss 2.49169. lr 5.835071e-04:  21%|██        | 3462/16329 [29:

epoch 1 iter 3493: train loss 2.48799. lr 5.832133e-04:  21%|██▏       | 3494/16329 [29:25<1:45:50,  2.02it/s][A
epoch 1 iter 3494: train loss 2.49753. lr 5.832038e-04:  21%|██▏       | 3494/16329 [29:26<1:45:50,  2.02it/s][A
epoch 1 iter 3494: train loss 2.49753. lr 5.832038e-04:  21%|██▏       | 3495/16329 [29:26<1:45:44,  2.02it/s][A
epoch 1 iter 3495: train loss 2.54646. lr 5.831943e-04:  21%|██▏       | 3495/16329 [29:26<1:45:44,  2.02it/s][A
epoch 1 iter 3495: train loss 2.54646. lr 5.831943e-04:  21%|██▏       | 3496/16329 [29:26<1:45:41,  2.02it/s][A
epoch 1 iter 3496: train loss 2.51519. lr 5.831847e-04:  21%|██▏       | 3496/16329 [29:27<1:45:41,  2.02it/s][A
epoch 1 iter 3496: train loss 2.51519. lr 5.831847e-04:  21%|██▏       | 3497/16329 [29:27<1:45:28,  2.03it/s][A
epoch 1 iter 3497: train loss 2.50431. lr 5.831752e-04:  21%|██▏       | 3497/16329 [29:27<1:45:28,  2.03it/s][A
epoch 1 iter 3497: train loss 2.50431. lr 5.831752e-04:  21%|██▏       | 3498/16329 [29:

epoch 1 iter 3529: train loss 2.48299. lr 5.828689e-04:  22%|██▏       | 3529/16329 [29:44<1:47:06,  1.99it/s][A
epoch 1 iter 3529: train loss 2.48299. lr 5.828689e-04:  22%|██▏       | 3530/16329 [29:44<1:46:50,  2.00it/s][A
epoch 1 iter 3530: train loss 2.50731. lr 5.828593e-04:  22%|██▏       | 3530/16329 [29:44<1:46:50,  2.00it/s][A
epoch 1 iter 3530: train loss 2.50731. lr 5.828593e-04:  22%|██▏       | 3531/16329 [29:44<1:46:38,  2.00it/s][A
epoch 1 iter 3531: train loss 2.45442. lr 5.828497e-04:  22%|██▏       | 3531/16329 [29:45<1:46:38,  2.00it/s][A
epoch 1 iter 3531: train loss 2.45442. lr 5.828497e-04:  22%|██▏       | 3532/16329 [29:45<1:46:26,  2.00it/s][A
epoch 1 iter 3532: train loss 2.47186. lr 5.828401e-04:  22%|██▏       | 3532/16329 [29:45<1:46:26,  2.00it/s][A
epoch 1 iter 3532: train loss 2.47186. lr 5.828401e-04:  22%|██▏       | 3533/16329 [29:45<1:46:34,  2.00it/s][A
epoch 1 iter 3533: train loss 2.49472. lr 5.828305e-04:  22%|██▏       | 3533/16329 [29:

epoch 1 iter 3564: train loss 2.43853. lr 5.825309e-04:  22%|██▏       | 3565/16329 [30:01<1:45:09,  2.02it/s][A
epoch 1 iter 3565: train loss 2.45394. lr 5.825212e-04:  22%|██▏       | 3565/16329 [30:02<1:45:09,  2.02it/s][A
epoch 1 iter 3565: train loss 2.45394. lr 5.825212e-04:  22%|██▏       | 3566/16329 [30:02<1:44:56,  2.03it/s][A
epoch 1 iter 3566: train loss 2.47174. lr 5.825115e-04:  22%|██▏       | 3566/16329 [30:02<1:44:56,  2.03it/s][A
epoch 1 iter 3566: train loss 2.47174. lr 5.825115e-04:  22%|██▏       | 3567/16329 [30:02<1:45:00,  2.03it/s][A
epoch 1 iter 3567: train loss 2.46926. lr 5.825018e-04:  22%|██▏       | 3567/16329 [30:03<1:45:00,  2.03it/s][A
epoch 1 iter 3567: train loss 2.46926. lr 5.825018e-04:  22%|██▏       | 3568/16329 [30:03<1:44:52,  2.03it/s][A
epoch 1 iter 3568: train loss 2.42535. lr 5.824920e-04:  22%|██▏       | 3568/16329 [30:03<1:44:52,  2.03it/s][A
epoch 1 iter 3568: train loss 2.42535. lr 5.824920e-04:  22%|██▏       | 3569/16329 [30:

epoch 1 iter 3600: train loss 2.44780. lr 5.821798e-04:  22%|██▏       | 3600/16329 [30:19<1:46:48,  1.99it/s][A
epoch 1 iter 3600: train loss 2.44780. lr 5.821798e-04:  22%|██▏       | 3601/16329 [30:19<1:46:06,  2.00it/s][A
epoch 1 iter 3601: train loss 2.44996. lr 5.821700e-04:  22%|██▏       | 3601/16329 [30:20<1:46:06,  2.00it/s][A
epoch 1 iter 3601: train loss 2.44996. lr 5.821700e-04:  22%|██▏       | 3602/16329 [30:20<1:45:59,  2.00it/s][A
epoch 1 iter 3602: train loss 2.49615. lr 5.821602e-04:  22%|██▏       | 3602/16329 [30:20<1:45:59,  2.00it/s][A
epoch 1 iter 3602: train loss 2.49615. lr 5.821602e-04:  22%|██▏       | 3603/16329 [30:20<1:45:24,  2.01it/s][A
epoch 1 iter 3603: train loss 2.48420. lr 5.821504e-04:  22%|██▏       | 3603/16329 [30:21<1:45:24,  2.01it/s][A
epoch 1 iter 3603: train loss 2.48420. lr 5.821504e-04:  22%|██▏       | 3604/16329 [30:21<1:45:25,  2.01it/s][A
epoch 1 iter 3604: train loss 2.47366. lr 5.821406e-04:  22%|██▏       | 3604/16329 [30:

epoch 1 iter 3635: train loss 2.48491. lr 5.818353e-04:  22%|██▏       | 3636/16329 [30:37<1:44:41,  2.02it/s][A
epoch 1 iter 3636: train loss 2.47374. lr 5.818254e-04:  22%|██▏       | 3636/16329 [30:38<1:44:41,  2.02it/s][A
epoch 1 iter 3636: train loss 2.47374. lr 5.818254e-04:  22%|██▏       | 3637/16329 [30:38<1:44:07,  2.03it/s][A
epoch 1 iter 3637: train loss 2.51830. lr 5.818155e-04:  22%|██▏       | 3637/16329 [30:38<1:44:07,  2.03it/s][A
epoch 1 iter 3637: train loss 2.51830. lr 5.818155e-04:  22%|██▏       | 3638/16329 [30:38<1:44:21,  2.03it/s][A
epoch 1 iter 3638: train loss 2.43050. lr 5.818056e-04:  22%|██▏       | 3638/16329 [30:39<1:44:21,  2.03it/s][A
epoch 1 iter 3638: train loss 2.43050. lr 5.818056e-04:  22%|██▏       | 3639/16329 [30:39<1:44:08,  2.03it/s][A
epoch 1 iter 3639: train loss 2.46541. lr 5.817957e-04:  22%|██▏       | 3639/16329 [30:39<1:44:08,  2.03it/s][A
epoch 1 iter 3639: train loss 2.46541. lr 5.817957e-04:  22%|██▏       | 3640/16329 [30:

epoch 1 iter 3671: train loss 2.39909. lr 5.814775e-04:  22%|██▏       | 3671/16329 [30:55<1:44:18,  2.02it/s][A
epoch 1 iter 3671: train loss 2.39909. lr 5.814775e-04:  22%|██▏       | 3672/16329 [30:55<1:44:05,  2.03it/s][A
epoch 1 iter 3672: train loss 2.44203. lr 5.814676e-04:  22%|██▏       | 3672/16329 [30:56<1:44:05,  2.03it/s][A
epoch 1 iter 3672: train loss 2.44203. lr 5.814676e-04:  22%|██▏       | 3673/16329 [30:56<1:43:47,  2.03it/s][A
epoch 1 iter 3673: train loss 2.48215. lr 5.814576e-04:  22%|██▏       | 3673/16329 [30:56<1:43:47,  2.03it/s][A
epoch 1 iter 3673: train loss 2.48215. lr 5.814576e-04:  22%|██▏       | 3674/16329 [30:56<1:43:56,  2.03it/s][A
epoch 1 iter 3674: train loss 2.41903. lr 5.814476e-04:  22%|██▏       | 3674/16329 [30:57<1:43:56,  2.03it/s][A
epoch 1 iter 3674: train loss 2.41903. lr 5.814476e-04:  23%|██▎       | 3675/16329 [30:57<1:43:39,  2.03it/s][A
epoch 1 iter 3675: train loss 2.42177. lr 5.814376e-04:  23%|██▎       | 3675/16329 [30:

epoch 1 iter 3706: train loss 2.43248. lr 5.811265e-04:  23%|██▎       | 3707/16329 [31:13<1:43:28,  2.03it/s][A
epoch 1 iter 3707: train loss 2.43329. lr 5.811164e-04:  23%|██▎       | 3707/16329 [31:13<1:43:28,  2.03it/s][A
epoch 1 iter 3707: train loss 2.43329. lr 5.811164e-04:  23%|██▎       | 3708/16329 [31:13<1:43:52,  2.02it/s][A
epoch 1 iter 3708: train loss 2.47264. lr 5.811064e-04:  23%|██▎       | 3708/16329 [31:14<1:43:52,  2.02it/s][A
epoch 1 iter 3708: train loss 2.47264. lr 5.811064e-04:  23%|██▎       | 3709/16329 [31:14<1:44:00,  2.02it/s][A
epoch 1 iter 3709: train loss 2.40749. lr 5.810963e-04:  23%|██▎       | 3709/16329 [31:14<1:44:00,  2.02it/s][A
epoch 1 iter 3709: train loss 2.40749. lr 5.810963e-04:  23%|██▎       | 3710/16329 [31:14<1:44:05,  2.02it/s][A
epoch 1 iter 3710: train loss 2.44197. lr 5.810862e-04:  23%|██▎       | 3710/16329 [31:15<1:44:05,  2.02it/s][A
epoch 1 iter 3710: train loss 2.44197. lr 5.810862e-04:  23%|██▎       | 3711/16329 [31:

epoch 1 iter 3742: train loss 2.41378. lr 5.807621e-04:  23%|██▎       | 3742/16329 [31:31<1:43:48,  2.02it/s][A
epoch 1 iter 3742: train loss 2.41378. lr 5.807621e-04:  23%|██▎       | 3743/16329 [31:31<1:43:35,  2.03it/s][A
epoch 1 iter 3743: train loss 2.36722. lr 5.807520e-04:  23%|██▎       | 3743/16329 [31:31<1:43:35,  2.03it/s][A
epoch 1 iter 3743: train loss 2.36722. lr 5.807520e-04:  23%|██▎       | 3744/16329 [31:31<1:43:14,  2.03it/s][A
epoch 1 iter 3744: train loss 2.35915. lr 5.807418e-04:  23%|██▎       | 3744/16329 [31:32<1:43:14,  2.03it/s][A
epoch 1 iter 3744: train loss 2.35915. lr 5.807418e-04:  23%|██▎       | 3745/16329 [31:32<1:43:19,  2.03it/s][A
epoch 1 iter 3745: train loss 2.35299. lr 5.807316e-04:  23%|██▎       | 3745/16329 [31:32<1:43:19,  2.03it/s][A
epoch 1 iter 3745: train loss 2.35299. lr 5.807316e-04:  23%|██▎       | 3746/16329 [31:32<1:43:17,  2.03it/s][A
epoch 1 iter 3746: train loss 2.45575. lr 5.807214e-04:  23%|██▎       | 3746/16329 [31:

epoch 1 iter 3777: train loss 2.36445. lr 5.804046e-04:  23%|██▎       | 3778/16329 [31:49<1:55:06,  1.82it/s][A
epoch 1 iter 3778: train loss 2.35168. lr 5.803944e-04:  23%|██▎       | 3778/16329 [31:49<1:55:06,  1.82it/s][A
epoch 1 iter 3778: train loss 2.35168. lr 5.803944e-04:  23%|██▎       | 3779/16329 [31:49<1:51:54,  1.87it/s][A
epoch 1 iter 3779: train loss 2.40653. lr 5.803841e-04:  23%|██▎       | 3779/16329 [31:50<1:51:54,  1.87it/s][A
epoch 1 iter 3779: train loss 2.40653. lr 5.803841e-04:  23%|██▎       | 3780/16329 [31:50<1:48:45,  1.92it/s][A
epoch 1 iter 3780: train loss 2.42021. lr 5.803739e-04:  23%|██▎       | 3780/16329 [31:50<1:48:45,  1.92it/s][A
epoch 1 iter 3780: train loss 2.42021. lr 5.803739e-04:  23%|██▎       | 3781/16329 [31:50<1:47:15,  1.95it/s][A
epoch 1 iter 3781: train loss 2.46252. lr 5.803636e-04:  23%|██▎       | 3781/16329 [31:51<1:47:15,  1.95it/s][A
epoch 1 iter 3781: train loss 2.46252. lr 5.803636e-04:  23%|██▎       | 3782/16329 [31:

epoch 1 iter 3813: train loss 2.39962. lr 5.800336e-04:  23%|██▎       | 3813/16329 [32:07<1:44:33,  1.99it/s][A
epoch 1 iter 3813: train loss 2.39962. lr 5.800336e-04:  23%|██▎       | 3814/16329 [32:07<1:44:27,  2.00it/s][A
epoch 1 iter 3814: train loss 2.39340. lr 5.800233e-04:  23%|██▎       | 3814/16329 [32:07<1:44:27,  2.00it/s][A
epoch 1 iter 3814: train loss 2.39340. lr 5.800233e-04:  23%|██▎       | 3815/16329 [32:07<1:44:36,  1.99it/s][A
epoch 1 iter 3815: train loss 2.39030. lr 5.800129e-04:  23%|██▎       | 3815/16329 [32:08<1:44:36,  1.99it/s][A
epoch 1 iter 3815: train loss 2.39030. lr 5.800129e-04:  23%|██▎       | 3816/16329 [32:08<1:44:08,  2.00it/s][A
epoch 1 iter 3816: train loss 2.36349. lr 5.800026e-04:  23%|██▎       | 3816/16329 [32:08<1:44:08,  2.00it/s][A
epoch 1 iter 3816: train loss 2.36349. lr 5.800026e-04:  23%|██▎       | 3817/16329 [32:08<1:43:49,  2.01it/s][A
epoch 1 iter 3817: train loss 2.35258. lr 5.799922e-04:  23%|██▎       | 3817/16329 [32:

epoch 1 iter 3848: train loss 2.36961. lr 5.796697e-04:  24%|██▎       | 3849/16329 [32:24<1:44:57,  1.98it/s][A
epoch 1 iter 3849: train loss 2.33158. lr 5.796592e-04:  24%|██▎       | 3849/16329 [32:25<1:44:57,  1.98it/s][A
epoch 1 iter 3849: train loss 2.33158. lr 5.796592e-04:  24%|██▎       | 3850/16329 [32:25<1:44:28,  1.99it/s][A
epoch 1 iter 3850: train loss 2.37729. lr 5.796488e-04:  24%|██▎       | 3850/16329 [32:25<1:44:28,  1.99it/s][A
epoch 1 iter 3850: train loss 2.37729. lr 5.796488e-04:  24%|██▎       | 3851/16329 [32:25<1:43:39,  2.01it/s][A
epoch 1 iter 3851: train loss 2.35421. lr 5.796384e-04:  24%|██▎       | 3851/16329 [32:26<1:43:39,  2.01it/s][A
epoch 1 iter 3851: train loss 2.35421. lr 5.796384e-04:  24%|██▎       | 3852/16329 [32:26<1:43:26,  2.01it/s][A
epoch 1 iter 3852: train loss 2.40090. lr 5.796279e-04:  24%|██▎       | 3852/16329 [32:26<1:43:26,  2.01it/s][A
epoch 1 iter 3852: train loss 2.40090. lr 5.796279e-04:  24%|██▎       | 3853/16329 [32:

epoch 1 iter 3884: train loss 2.37477. lr 5.792921e-04:  24%|██▍       | 3884/16329 [32:43<1:44:42,  1.98it/s][A
epoch 1 iter 3884: train loss 2.37477. lr 5.792921e-04:  24%|██▍       | 3885/16329 [32:43<1:43:49,  2.00it/s][A
epoch 1 iter 3885: train loss 2.32853. lr 5.792815e-04:  24%|██▍       | 3885/16329 [32:43<1:43:49,  2.00it/s][A
epoch 1 iter 3885: train loss 2.32853. lr 5.792815e-04:  24%|██▍       | 3886/16329 [32:43<1:43:28,  2.00it/s][A
epoch 1 iter 3886: train loss 2.33438. lr 5.792710e-04:  24%|██▍       | 3886/16329 [32:44<1:43:28,  2.00it/s][A
epoch 1 iter 3886: train loss 2.33438. lr 5.792710e-04:  24%|██▍       | 3887/16329 [32:44<1:42:52,  2.02it/s][A
epoch 1 iter 3887: train loss 2.35352. lr 5.792604e-04:  24%|██▍       | 3887/16329 [32:44<1:42:52,  2.02it/s][A
epoch 1 iter 3887: train loss 2.35352. lr 5.792604e-04:  24%|██▍       | 3888/16329 [32:44<1:42:58,  2.01it/s][A
epoch 1 iter 3888: train loss 2.37749. lr 5.792499e-04:  24%|██▍       | 3888/16329 [32:

epoch 1 iter 3919: train loss 2.39924. lr 5.789217e-04:  24%|██▍       | 3920/16329 [33:00<1:42:27,  2.02it/s][A
epoch 1 iter 3920: train loss 2.32046. lr 5.789111e-04:  24%|██▍       | 3920/16329 [33:01<1:42:27,  2.02it/s][A
epoch 1 iter 3920: train loss 2.32046. lr 5.789111e-04:  24%|██▍       | 3921/16329 [33:01<1:42:30,  2.02it/s][A
epoch 1 iter 3921: train loss 2.35347. lr 5.789004e-04:  24%|██▍       | 3921/16329 [33:01<1:42:30,  2.02it/s][A
epoch 1 iter 3921: train loss 2.35347. lr 5.789004e-04:  24%|██▍       | 3922/16329 [33:01<1:42:43,  2.01it/s][A
epoch 1 iter 3922: train loss 2.29078. lr 5.788898e-04:  24%|██▍       | 3922/16329 [33:02<1:42:43,  2.01it/s][A
epoch 1 iter 3922: train loss 2.29078. lr 5.788898e-04:  24%|██▍       | 3923/16329 [33:02<1:42:26,  2.02it/s][A
epoch 1 iter 3923: train loss 2.30130. lr 5.788792e-04:  24%|██▍       | 3923/16329 [33:02<1:42:26,  2.02it/s][A
epoch 1 iter 3923: train loss 2.30130. lr 5.788792e-04:  24%|██▍       | 3924/16329 [33:

epoch 1 iter 3955: train loss 2.39412. lr 5.785375e-04:  24%|██▍       | 3955/16329 [33:18<1:44:52,  1.97it/s][A
epoch 1 iter 3955: train loss 2.39412. lr 5.785375e-04:  24%|██▍       | 3956/16329 [33:18<1:44:23,  1.98it/s][A
epoch 1 iter 3956: train loss 2.29222. lr 5.785267e-04:  24%|██▍       | 3956/16329 [33:19<1:44:23,  1.98it/s][A
epoch 1 iter 3956: train loss 2.29222. lr 5.785267e-04:  24%|██▍       | 3957/16329 [33:19<1:43:27,  1.99it/s][A
epoch 1 iter 3957: train loss 2.37532. lr 5.785160e-04:  24%|██▍       | 3957/16329 [33:19<1:43:27,  1.99it/s][A
epoch 1 iter 3957: train loss 2.37532. lr 5.785160e-04:  24%|██▍       | 3958/16329 [33:19<1:42:58,  2.00it/s][A
epoch 1 iter 3958: train loss 2.31608. lr 5.785053e-04:  24%|██▍       | 3958/16329 [33:20<1:42:58,  2.00it/s][A
epoch 1 iter 3958: train loss 2.31608. lr 5.785053e-04:  24%|██▍       | 3959/16329 [33:20<1:42:30,  2.01it/s][A
epoch 1 iter 3959: train loss 2.32755. lr 5.784946e-04:  24%|██▍       | 3959/16329 [33:

epoch 1 iter 3990: train loss 2.31916. lr 5.781607e-04:  24%|██▍       | 3991/16329 [33:36<1:41:44,  2.02it/s][A
epoch 1 iter 3991: train loss 2.33776. lr 5.781499e-04:  24%|██▍       | 3991/16329 [33:37<1:41:44,  2.02it/s][A
epoch 1 iter 3991: train loss 2.33776. lr 5.781499e-04:  24%|██▍       | 3992/16329 [33:37<1:41:41,  2.02it/s][A
epoch 1 iter 3992: train loss 2.28053. lr 5.781391e-04:  24%|██▍       | 3992/16329 [33:37<1:41:41,  2.02it/s][A
epoch 1 iter 3992: train loss 2.28053. lr 5.781391e-04:  24%|██▍       | 3993/16329 [33:37<1:41:16,  2.03it/s][A
epoch 1 iter 3993: train loss 2.31388. lr 5.781282e-04:  24%|██▍       | 3993/16329 [33:38<1:41:16,  2.03it/s][A
epoch 1 iter 3993: train loss 2.31388. lr 5.781282e-04:  24%|██▍       | 3994/16329 [33:38<1:41:19,  2.03it/s][A
epoch 1 iter 3994: train loss 2.36086. lr 5.781174e-04:  24%|██▍       | 3994/16329 [33:38<1:41:19,  2.03it/s][A
epoch 1 iter 3994: train loss 2.36086. lr 5.781174e-04:  24%|██▍       | 3995/16329 [33:

epoch 1 iter 4026: train loss 2.34612. lr 5.777699e-04:  25%|██▍       | 4026/16329 [33:54<1:41:13,  2.03it/s][A
epoch 1 iter 4026: train loss 2.34612. lr 5.777699e-04:  25%|██▍       | 4027/16329 [33:54<1:41:02,  2.03it/s][A
epoch 1 iter 4027: train loss 2.35494. lr 5.777590e-04:  25%|██▍       | 4027/16329 [33:55<1:41:02,  2.03it/s][A
epoch 1 iter 4027: train loss 2.35494. lr 5.777590e-04:  25%|██▍       | 4028/16329 [33:55<1:41:25,  2.02it/s][A
epoch 1 iter 4028: train loss 2.32231. lr 5.777480e-04:  25%|██▍       | 4028/16329 [33:55<1:41:25,  2.02it/s][A
epoch 1 iter 4028: train loss 2.32231. lr 5.777480e-04:  25%|██▍       | 4029/16329 [33:55<1:41:18,  2.02it/s][A
epoch 1 iter 4029: train loss 2.35621. lr 5.777371e-04:  25%|██▍       | 4029/16329 [33:56<1:41:18,  2.02it/s][A
epoch 1 iter 4029: train loss 2.35621. lr 5.777371e-04:  25%|██▍       | 4030/16329 [33:56<1:40:57,  2.03it/s][A
epoch 1 iter 4030: train loss 2.31625. lr 5.777262e-04:  25%|██▍       | 4030/16329 [33:

epoch 1 iter 4061: train loss 2.32083. lr 5.773867e-04:  25%|██▍       | 4062/16329 [34:12<1:46:12,  1.92it/s][A
epoch 1 iter 4062: train loss 2.27574. lr 5.773757e-04:  25%|██▍       | 4062/16329 [34:13<1:46:12,  1.92it/s][A
epoch 1 iter 4062: train loss 2.27574. lr 5.773757e-04:  25%|██▍       | 4063/16329 [34:13<1:44:16,  1.96it/s][A
epoch 1 iter 4063: train loss 2.23653. lr 5.773647e-04:  25%|██▍       | 4063/16329 [34:13<1:44:16,  1.96it/s][A
epoch 1 iter 4063: train loss 2.23653. lr 5.773647e-04:  25%|██▍       | 4064/16329 [34:13<1:43:31,  1.97it/s][A
epoch 1 iter 4064: train loss 2.34708. lr 5.773537e-04:  25%|██▍       | 4064/16329 [34:14<1:43:31,  1.97it/s][A
epoch 1 iter 4064: train loss 2.34708. lr 5.773537e-04:  25%|██▍       | 4065/16329 [34:14<1:42:32,  1.99it/s][A
epoch 1 iter 4065: train loss 2.33465. lr 5.773427e-04:  25%|██▍       | 4065/16329 [34:14<1:42:32,  1.99it/s][A
epoch 1 iter 4065: train loss 2.33465. lr 5.773427e-04:  25%|██▍       | 4066/16329 [34:

epoch 1 iter 4097: train loss 2.27655. lr 5.769893e-04:  25%|██▌       | 4097/16329 [34:30<1:41:26,  2.01it/s][A
epoch 1 iter 4097: train loss 2.27655. lr 5.769893e-04:  25%|██▌       | 4098/16329 [34:30<1:41:17,  2.01it/s][A
epoch 1 iter 4098: train loss 2.27513. lr 5.769782e-04:  25%|██▌       | 4098/16329 [34:31<1:41:17,  2.01it/s][A
epoch 1 iter 4098: train loss 2.27513. lr 5.769782e-04:  25%|██▌       | 4099/16329 [34:31<1:54:32,  1.78it/s][A
epoch 1 iter 4099: train loss 2.30748. lr 5.769671e-04:  25%|██▌       | 4099/16329 [34:31<1:54:32,  1.78it/s][A
epoch 1 iter 4099: train loss 2.30748. lr 5.769671e-04:  25%|██▌       | 4100/16329 [34:31<1:50:05,  1.85it/s][A
epoch 1 iter 4100: train loss 2.29444. lr 5.769560e-04:  25%|██▌       | 4100/16329 [34:32<1:50:05,  1.85it/s][A
epoch 1 iter 4100: train loss 2.29444. lr 5.769560e-04:  25%|██▌       | 4101/16329 [34:32<1:51:14,  1.83it/s][A
epoch 1 iter 4101: train loss 2.29802. lr 5.769449e-04:  25%|██▌       | 4101/16329 [34:

epoch 1 iter 4132: train loss 2.24803. lr 5.765998e-04:  25%|██▌       | 4133/16329 [34:48<1:40:43,  2.02it/s][A
epoch 1 iter 4133: train loss 2.29865. lr 5.765886e-04:  25%|██▌       | 4133/16329 [34:49<1:40:43,  2.02it/s][A
epoch 1 iter 4133: train loss 2.29865. lr 5.765886e-04:  25%|██▌       | 4134/16329 [34:49<1:50:52,  1.83it/s][A
epoch 1 iter 4134: train loss 2.26054. lr 5.765774e-04:  25%|██▌       | 4134/16329 [34:49<1:50:52,  1.83it/s][A
epoch 1 iter 4134: train loss 2.26054. lr 5.765774e-04:  25%|██▌       | 4135/16329 [34:49<1:47:57,  1.88it/s][A
epoch 1 iter 4135: train loss 2.26152. lr 5.765662e-04:  25%|██▌       | 4135/16329 [34:50<1:47:57,  1.88it/s][A
epoch 1 iter 4135: train loss 2.26152. lr 5.765662e-04:  25%|██▌       | 4136/16329 [34:50<1:45:31,  1.93it/s][A
epoch 1 iter 4136: train loss 2.25814. lr 5.765550e-04:  25%|██▌       | 4136/16329 [34:50<1:45:31,  1.93it/s][A
epoch 1 iter 4136: train loss 2.25814. lr 5.765550e-04:  25%|██▌       | 4137/16329 [34:

epoch 1 iter 4168: train loss 2.31020. lr 5.761958e-04:  26%|██▌       | 4168/16329 [35:06<1:40:42,  2.01it/s][A
epoch 1 iter 4168: train loss 2.31020. lr 5.761958e-04:  26%|██▌       | 4169/16329 [35:06<1:40:20,  2.02it/s][A
epoch 1 iter 4169: train loss 2.30190. lr 5.761846e-04:  26%|██▌       | 4169/16329 [35:07<1:40:20,  2.02it/s][A
epoch 1 iter 4169: train loss 2.30190. lr 5.761846e-04:  26%|██▌       | 4170/16329 [35:07<1:39:54,  2.03it/s][A
epoch 1 iter 4170: train loss 2.33994. lr 5.761733e-04:  26%|██▌       | 4170/16329 [35:07<1:39:54,  2.03it/s][A
epoch 1 iter 4170: train loss 2.33994. lr 5.761733e-04:  26%|██▌       | 4171/16329 [35:07<1:39:57,  2.03it/s][A
epoch 1 iter 4171: train loss 2.30243. lr 5.761620e-04:  26%|██▌       | 4171/16329 [35:08<1:39:57,  2.03it/s][A
epoch 1 iter 4171: train loss 2.30243. lr 5.761620e-04:  26%|██▌       | 4172/16329 [35:08<1:39:49,  2.03it/s][A
epoch 1 iter 4172: train loss 2.26502. lr 5.761507e-04:  26%|██▌       | 4172/16329 [35:

epoch 1 iter 4203: train loss 2.30598. lr 5.757999e-04:  26%|██▌       | 4204/16329 [35:24<1:39:18,  2.03it/s][A
epoch 1 iter 4204: train loss 2.25250. lr 5.757886e-04:  26%|██▌       | 4204/16329 [35:24<1:39:18,  2.03it/s][A
epoch 1 iter 4204: train loss 2.25250. lr 5.757886e-04:  26%|██▌       | 4205/16329 [35:24<1:39:35,  2.03it/s][A
epoch 1 iter 4205: train loss 2.29853. lr 5.757772e-04:  26%|██▌       | 4205/16329 [35:25<1:39:35,  2.03it/s][A
epoch 1 iter 4205: train loss 2.29853. lr 5.757772e-04:  26%|██▌       | 4206/16329 [35:25<1:39:35,  2.03it/s][A
epoch 1 iter 4206: train loss 2.29034. lr 5.757658e-04:  26%|██▌       | 4206/16329 [35:25<1:39:35,  2.03it/s][A
epoch 1 iter 4206: train loss 2.29034. lr 5.757658e-04:  26%|██▌       | 4207/16329 [35:25<1:39:53,  2.02it/s][A
epoch 1 iter 4207: train loss 2.25280. lr 5.757545e-04:  26%|██▌       | 4207/16329 [35:26<1:39:53,  2.02it/s][A
epoch 1 iter 4207: train loss 2.25280. lr 5.757545e-04:  26%|██▌       | 4208/16329 [35:

epoch 1 iter 4239: train loss 2.25071. lr 5.753895e-04:  26%|██▌       | 4239/16329 [35:42<1:39:50,  2.02it/s][A
epoch 1 iter 4239: train loss 2.25071. lr 5.753895e-04:  26%|██▌       | 4240/16329 [35:42<1:39:56,  2.02it/s][A
epoch 1 iter 4240: train loss 2.18138. lr 5.753780e-04:  26%|██▌       | 4240/16329 [35:42<1:39:56,  2.02it/s][A
epoch 1 iter 4240: train loss 2.18138. lr 5.753780e-04:  26%|██▌       | 4241/16329 [35:42<1:39:51,  2.02it/s][A
epoch 1 iter 4241: train loss 2.25772. lr 5.753666e-04:  26%|██▌       | 4241/16329 [35:43<1:39:51,  2.02it/s][A
epoch 1 iter 4241: train loss 2.25772. lr 5.753666e-04:  26%|██▌       | 4242/16329 [35:43<1:39:57,  2.02it/s][A
epoch 1 iter 4242: train loss 2.28086. lr 5.753551e-04:  26%|██▌       | 4242/16329 [35:43<1:39:57,  2.02it/s][A
epoch 1 iter 4242: train loss 2.28086. lr 5.753551e-04:  26%|██▌       | 4243/16329 [35:43<1:39:38,  2.02it/s][A
epoch 1 iter 4243: train loss 2.32592. lr 5.753437e-04:  26%|██▌       | 4243/16329 [35:

epoch 1 iter 4274: train loss 2.24444. lr 5.749872e-04:  26%|██▌       | 4275/16329 [35:59<1:39:46,  2.01it/s][A
epoch 1 iter 4275: train loss 2.26078. lr 5.749757e-04:  26%|██▌       | 4275/16329 [36:00<1:39:46,  2.01it/s][A
epoch 1 iter 4275: train loss 2.26078. lr 5.749757e-04:  26%|██▌       | 4276/16329 [36:00<1:40:10,  2.01it/s][A
epoch 1 iter 4276: train loss 2.23700. lr 5.749642e-04:  26%|██▌       | 4276/16329 [36:00<1:40:10,  2.01it/s][A
epoch 1 iter 4276: train loss 2.23700. lr 5.749642e-04:  26%|██▌       | 4277/16329 [36:00<1:40:09,  2.01it/s][A
epoch 1 iter 4277: train loss 2.23465. lr 5.749526e-04:  26%|██▌       | 4277/16329 [36:01<1:40:09,  2.01it/s][A
epoch 1 iter 4277: train loss 2.23465. lr 5.749526e-04:  26%|██▌       | 4278/16329 [36:01<1:39:55,  2.01it/s][A
epoch 1 iter 4278: train loss 2.26007. lr 5.749411e-04:  26%|██▌       | 4278/16329 [36:01<1:39:55,  2.01it/s][A
epoch 1 iter 4278: train loss 2.26007. lr 5.749411e-04:  26%|██▌       | 4279/16329 [36:

epoch 1 iter 4310: train loss 2.26089. lr 5.745703e-04:  26%|██▋       | 4310/16329 [36:17<1:38:55,  2.02it/s][A
epoch 1 iter 4310: train loss 2.26089. lr 5.745703e-04:  26%|██▋       | 4311/16329 [36:17<1:38:54,  2.03it/s][A
epoch 1 iter 4311: train loss 2.26070. lr 5.745586e-04:  26%|██▋       | 4311/16329 [36:18<1:38:54,  2.03it/s][A
epoch 1 iter 4311: train loss 2.26070. lr 5.745586e-04:  26%|██▋       | 4312/16329 [36:18<1:39:02,  2.02it/s][A
epoch 1 iter 4312: train loss 2.18898. lr 5.745470e-04:  26%|██▋       | 4312/16329 [36:19<1:39:02,  2.02it/s][A
epoch 1 iter 4312: train loss 2.18898. lr 5.745470e-04:  26%|██▋       | 4313/16329 [36:19<1:54:27,  1.75it/s][A
epoch 1 iter 4313: train loss 2.23530. lr 5.745354e-04:  26%|██▋       | 4313/16329 [36:19<1:54:27,  1.75it/s][A
epoch 1 iter 4313: train loss 2.23530. lr 5.745354e-04:  26%|██▋       | 4314/16329 [36:19<1:49:44,  1.82it/s][A
epoch 1 iter 4314: train loss 2.22198. lr 5.745237e-04:  26%|██▋       | 4314/16329 [36:

epoch 1 iter 4345: train loss 2.21044. lr 5.741617e-04:  27%|██▋       | 4346/16329 [36:35<1:41:14,  1.97it/s][A
epoch 1 iter 4346: train loss 2.22092. lr 5.741500e-04:  27%|██▋       | 4346/16329 [36:36<1:41:14,  1.97it/s][A
epoch 1 iter 4346: train loss 2.22092. lr 5.741500e-04:  27%|██▋       | 4347/16329 [36:36<1:40:43,  1.98it/s][A
epoch 1 iter 4347: train loss 2.17864. lr 5.741383e-04:  27%|██▋       | 4347/16329 [36:36<1:40:43,  1.98it/s][A
epoch 1 iter 4347: train loss 2.17864. lr 5.741383e-04:  27%|██▋       | 4348/16329 [36:36<1:39:52,  2.00it/s][A
epoch 1 iter 4348: train loss 2.16641. lr 5.741265e-04:  27%|██▋       | 4348/16329 [36:37<1:39:52,  2.00it/s][A
epoch 1 iter 4348: train loss 2.16641. lr 5.741265e-04:  27%|██▋       | 4349/16329 [36:37<1:39:33,  2.01it/s][A
epoch 1 iter 4349: train loss 2.22424. lr 5.741148e-04:  27%|██▋       | 4349/16329 [36:37<1:39:33,  2.01it/s][A
epoch 1 iter 4349: train loss 2.22424. lr 5.741148e-04:  27%|██▋       | 4350/16329 [36:

epoch 1 iter 4381: train loss 2.17703. lr 5.737382e-04:  27%|██▋       | 4381/16329 [36:53<1:37:37,  2.04it/s][A
epoch 1 iter 4381: train loss 2.17703. lr 5.737382e-04:  27%|██▋       | 4382/16329 [36:53<1:37:52,  2.03it/s][A
epoch 1 iter 4382: train loss 2.23648. lr 5.737264e-04:  27%|██▋       | 4382/16329 [36:54<1:37:52,  2.03it/s][A
epoch 1 iter 4382: train loss 2.23648. lr 5.737264e-04:  27%|██▋       | 4383/16329 [36:54<1:37:37,  2.04it/s][A
epoch 1 iter 4383: train loss 2.14184. lr 5.737146e-04:  27%|██▋       | 4383/16329 [36:54<1:37:37,  2.04it/s][A
epoch 1 iter 4383: train loss 2.14184. lr 5.737146e-04:  27%|██▋       | 4384/16329 [36:54<1:37:50,  2.03it/s][A
epoch 1 iter 4384: train loss 2.19759. lr 5.737028e-04:  27%|██▋       | 4384/16329 [36:55<1:37:50,  2.03it/s][A
epoch 1 iter 4384: train loss 2.19759. lr 5.737028e-04:  27%|██▋       | 4385/16329 [36:55<1:37:53,  2.03it/s][A
epoch 1 iter 4385: train loss 2.15483. lr 5.736910e-04:  27%|██▋       | 4385/16329 [36:

epoch 1 iter 4416: train loss 2.21961. lr 5.733234e-04:  27%|██▋       | 4417/16329 [37:11<1:40:17,  1.98it/s][A
epoch 1 iter 4417: train loss 2.11921. lr 5.733115e-04:  27%|██▋       | 4417/16329 [37:11<1:40:17,  1.98it/s][A
epoch 1 iter 4417: train loss 2.11921. lr 5.733115e-04:  27%|██▋       | 4418/16329 [37:11<1:39:44,  1.99it/s][A
epoch 1 iter 4418: train loss 2.23657. lr 5.732996e-04:  27%|██▋       | 4418/16329 [37:12<1:39:44,  1.99it/s][A
epoch 1 iter 4418: train loss 2.23657. lr 5.732996e-04:  27%|██▋       | 4419/16329 [37:12<1:38:54,  2.01it/s][A
epoch 1 iter 4419: train loss 2.23307. lr 5.732877e-04:  27%|██▋       | 4419/16329 [37:12<1:38:54,  2.01it/s][A
epoch 1 iter 4419: train loss 2.23307. lr 5.732877e-04:  27%|██▋       | 4420/16329 [37:12<1:38:48,  2.01it/s][A
epoch 1 iter 4420: train loss 2.17281. lr 5.732758e-04:  27%|██▋       | 4420/16329 [37:13<1:38:48,  2.01it/s][A
epoch 1 iter 4420: train loss 2.17281. lr 5.732758e-04:  27%|██▋       | 4421/16329 [37:

epoch 1 iter 4452: train loss 2.13915. lr 5.728934e-04:  27%|██▋       | 4452/16329 [37:29<1:38:03,  2.02it/s][A
epoch 1 iter 4452: train loss 2.13915. lr 5.728934e-04:  27%|██▋       | 4453/16329 [37:29<1:37:44,  2.03it/s][A
epoch 1 iter 4453: train loss 2.20980. lr 5.728815e-04:  27%|██▋       | 4453/16329 [37:30<1:37:44,  2.03it/s][A
epoch 1 iter 4453: train loss 2.20980. lr 5.728815e-04:  27%|██▋       | 4454/16329 [37:30<1:42:11,  1.94it/s][A
epoch 1 iter 4454: train loss 2.23242. lr 5.728695e-04:  27%|██▋       | 4454/16329 [37:30<1:42:11,  1.94it/s][A
epoch 1 iter 4454: train loss 2.23242. lr 5.728695e-04:  27%|██▋       | 4455/16329 [37:30<1:44:52,  1.89it/s][A
epoch 1 iter 4455: train loss 2.18243. lr 5.728575e-04:  27%|██▋       | 4455/16329 [37:31<1:44:52,  1.89it/s][A
epoch 1 iter 4455: train loss 2.18243. lr 5.728575e-04:  27%|██▋       | 4456/16329 [37:31<1:45:49,  1.87it/s][A
epoch 1 iter 4456: train loss 2.19391. lr 5.728455e-04:  27%|██▋       | 4456/16329 [37:

epoch 1 iter 4487: train loss 2.23564. lr 5.724723e-04:  27%|██▋       | 4488/16329 [37:47<1:42:08,  1.93it/s][A
epoch 1 iter 4488: train loss 2.20690. lr 5.724602e-04:  27%|██▋       | 4488/16329 [37:47<1:42:08,  1.93it/s][A
epoch 1 iter 4488: train loss 2.20690. lr 5.724602e-04:  27%|██▋       | 4489/16329 [37:47<1:42:08,  1.93it/s][A
epoch 1 iter 4489: train loss 2.16359. lr 5.724482e-04:  27%|██▋       | 4489/16329 [37:48<1:42:08,  1.93it/s][A
epoch 1 iter 4489: train loss 2.16359. lr 5.724482e-04:  27%|██▋       | 4490/16329 [37:48<1:41:27,  1.94it/s][A
epoch 1 iter 4490: train loss 2.19672. lr 5.724361e-04:  27%|██▋       | 4490/16329 [37:48<1:41:27,  1.94it/s][A
epoch 1 iter 4490: train loss 2.19672. lr 5.724361e-04:  28%|██▊       | 4491/16329 [37:48<1:40:57,  1.95it/s][A
epoch 1 iter 4491: train loss 2.18398. lr 5.724240e-04:  28%|██▊       | 4491/16329 [37:49<1:40:57,  1.95it/s][A
epoch 1 iter 4491: train loss 2.18398. lr 5.724240e-04:  28%|██▊       | 4492/16329 [37:

epoch 1 iter 4523: train loss 2.15726. lr 5.720359e-04:  28%|██▊       | 4523/16329 [38:05<1:41:31,  1.94it/s][A
epoch 1 iter 4523: train loss 2.15726. lr 5.720359e-04:  28%|██▊       | 4524/16329 [38:05<1:42:59,  1.91it/s][A
epoch 1 iter 4524: train loss 2.23271. lr 5.720238e-04:  28%|██▊       | 4524/16329 [38:05<1:42:59,  1.91it/s][A
epoch 1 iter 4524: train loss 2.23271. lr 5.720238e-04:  28%|██▊       | 4525/16329 [38:05<1:42:37,  1.92it/s][A
epoch 1 iter 4525: train loss 2.07830. lr 5.720116e-04:  28%|██▊       | 4525/16329 [38:06<1:42:37,  1.92it/s][A
epoch 1 iter 4525: train loss 2.07830. lr 5.720116e-04:  28%|██▊       | 4526/16329 [38:06<1:42:05,  1.93it/s][A
epoch 1 iter 4526: train loss 2.18409. lr 5.719994e-04:  28%|██▊       | 4526/16329 [38:06<1:42:05,  1.93it/s][A
epoch 1 iter 4526: train loss 2.18409. lr 5.719994e-04:  28%|██▊       | 4527/16329 [38:06<1:41:07,  1.95it/s][A
epoch 1 iter 4527: train loss 2.18624. lr 5.719872e-04:  28%|██▊       | 4527/16329 [38:

epoch 1 iter 4558: train loss 2.13443. lr 5.716085e-04:  28%|██▊       | 4559/16329 [38:22<1:36:38,  2.03it/s][A
epoch 1 iter 4559: train loss 2.20727. lr 5.715963e-04:  28%|██▊       | 4559/16329 [38:23<1:36:38,  2.03it/s][A
epoch 1 iter 4559: train loss 2.20727. lr 5.715963e-04:  28%|██▊       | 4560/16329 [38:23<1:36:50,  2.03it/s][A
epoch 1 iter 4560: train loss 2.15807. lr 5.715840e-04:  28%|██▊       | 4560/16329 [38:23<1:36:50,  2.03it/s][A
epoch 1 iter 4560: train loss 2.15807. lr 5.715840e-04:  28%|██▊       | 4561/16329 [38:23<1:36:46,  2.03it/s][A
epoch 1 iter 4561: train loss 2.12689. lr 5.715718e-04:  28%|██▊       | 4561/16329 [38:24<1:36:46,  2.03it/s][A
epoch 1 iter 4561: train loss 2.12689. lr 5.715718e-04:  28%|██▊       | 4562/16329 [38:24<1:36:26,  2.03it/s][A
epoch 1 iter 4562: train loss 2.16392. lr 5.715595e-04:  28%|██▊       | 4562/16329 [38:24<1:36:26,  2.03it/s][A
epoch 1 iter 4562: train loss 2.16392. lr 5.715595e-04:  28%|██▊       | 4563/16329 [38:

epoch 1 iter 4594: train loss 2.09862. lr 5.711657e-04:  28%|██▊       | 4594/16329 [38:41<1:36:30,  2.03it/s][A
epoch 1 iter 4594: train loss 2.09862. lr 5.711657e-04:  28%|██▊       | 4595/16329 [38:41<1:36:33,  2.03it/s][A
epoch 1 iter 4595: train loss 2.09442. lr 5.711534e-04:  28%|██▊       | 4595/16329 [38:41<1:36:33,  2.03it/s][A
epoch 1 iter 4595: train loss 2.09442. lr 5.711534e-04:  28%|██▊       | 4596/16329 [38:41<1:36:00,  2.04it/s][A
epoch 1 iter 4596: train loss 2.15944. lr 5.711410e-04:  28%|██▊       | 4596/16329 [38:42<1:36:00,  2.04it/s][A
epoch 1 iter 4596: train loss 2.15944. lr 5.711410e-04:  28%|██▊       | 4597/16329 [38:42<1:36:24,  2.03it/s][A
epoch 1 iter 4597: train loss 2.15398. lr 5.711287e-04:  28%|██▊       | 4597/16329 [38:42<1:36:24,  2.03it/s][A
epoch 1 iter 4597: train loss 2.15398. lr 5.711287e-04:  28%|██▊       | 4598/16329 [38:42<1:36:20,  2.03it/s][A
epoch 1 iter 4598: train loss 2.16600. lr 5.711163e-04:  28%|██▊       | 4598/16329 [38:

epoch 1 iter 4629: train loss 2.09710. lr 5.707321e-04:  28%|██▊       | 4630/16329 [38:58<1:36:23,  2.02it/s][A
epoch 1 iter 4630: train loss 2.13486. lr 5.707196e-04:  28%|██▊       | 4630/16329 [38:59<1:36:23,  2.02it/s][A
epoch 1 iter 4630: train loss 2.13486. lr 5.707196e-04:  28%|██▊       | 4631/16329 [38:59<1:36:12,  2.03it/s][A
epoch 1 iter 4631: train loss 2.17186. lr 5.707072e-04:  28%|██▊       | 4631/16329 [38:59<1:36:12,  2.03it/s][A
epoch 1 iter 4631: train loss 2.17186. lr 5.707072e-04:  28%|██▊       | 4632/16329 [38:59<1:36:17,  2.02it/s][A
epoch 1 iter 4632: train loss 2.14883. lr 5.706948e-04:  28%|██▊       | 4632/16329 [39:00<1:36:17,  2.02it/s][A
epoch 1 iter 4632: train loss 2.14883. lr 5.706948e-04:  28%|██▊       | 4633/16329 [39:00<1:36:11,  2.03it/s][A
epoch 1 iter 4633: train loss 2.16391. lr 5.706823e-04:  28%|██▊       | 4633/16329 [39:00<1:36:11,  2.03it/s][A
epoch 1 iter 4633: train loss 2.16391. lr 5.706823e-04:  28%|██▊       | 4634/16329 [39:

epoch 1 iter 4665: train loss 2.14970. lr 5.702828e-04:  29%|██▊       | 4665/16329 [39:16<1:37:51,  1.99it/s][A
epoch 1 iter 4665: train loss 2.14970. lr 5.702828e-04:  29%|██▊       | 4666/16329 [39:16<1:37:27,  1.99it/s][A
epoch 1 iter 4666: train loss 2.09571. lr 5.702703e-04:  29%|██▊       | 4666/16329 [39:17<1:37:27,  1.99it/s][A
epoch 1 iter 4666: train loss 2.09571. lr 5.702703e-04:  29%|██▊       | 4667/16329 [39:17<1:48:56,  1.78it/s][A
epoch 1 iter 4667: train loss 2.15488. lr 5.702578e-04:  29%|██▊       | 4667/16329 [39:18<1:48:56,  1.78it/s][A
epoch 1 iter 4667: train loss 2.15488. lr 5.702578e-04:  29%|██▊       | 4668/16329 [39:18<1:45:28,  1.84it/s][A
epoch 1 iter 4668: train loss 2.12651. lr 5.702453e-04:  29%|██▊       | 4668/16329 [39:18<1:45:28,  1.84it/s][A
epoch 1 iter 4668: train loss 2.12651. lr 5.702453e-04:  29%|██▊       | 4669/16329 [39:18<1:42:24,  1.90it/s][A
epoch 1 iter 4669: train loss 2.12889. lr 5.702327e-04:  29%|██▊       | 4669/16329 [39:

epoch 1 iter 4700: train loss 2.11694. lr 5.698430e-04:  29%|██▉       | 4701/16329 [39:34<1:37:09,  1.99it/s][A
epoch 1 iter 4701: train loss 2.17429. lr 5.698304e-04:  29%|██▉       | 4701/16329 [39:35<1:37:09,  1.99it/s][A
epoch 1 iter 4701: train loss 2.17429. lr 5.698304e-04:  29%|██▉       | 4702/16329 [39:35<1:36:42,  2.00it/s][A
epoch 1 iter 4702: train loss 2.12322. lr 5.698178e-04:  29%|██▉       | 4702/16329 [39:35<1:36:42,  2.00it/s][A
epoch 1 iter 4702: train loss 2.12322. lr 5.698178e-04:  29%|██▉       | 4703/16329 [39:35<1:36:45,  2.00it/s][A
epoch 1 iter 4703: train loss 2.13098. lr 5.698051e-04:  29%|██▉       | 4703/16329 [39:36<1:36:45,  2.00it/s][A
epoch 1 iter 4703: train loss 2.13098. lr 5.698051e-04:  29%|██▉       | 4704/16329 [39:36<1:36:20,  2.01it/s][A
epoch 1 iter 4704: train loss 2.07302. lr 5.697925e-04:  29%|██▉       | 4704/16329 [39:36<1:36:20,  2.01it/s][A
epoch 1 iter 4704: train loss 2.07302. lr 5.697925e-04:  29%|██▉       | 4705/16329 [39:

epoch 1 iter 4736: train loss 2.15135. lr 5.693874e-04:  29%|██▉       | 4736/16329 [39:53<1:41:03,  1.91it/s][A
epoch 1 iter 4736: train loss 2.15135. lr 5.693874e-04:  29%|██▉       | 4737/16329 [39:53<1:39:19,  1.95it/s][A
epoch 1 iter 4737: train loss 2.13476. lr 5.693747e-04:  29%|██▉       | 4737/16329 [39:53<1:39:19,  1.95it/s][A
epoch 1 iter 4737: train loss 2.13476. lr 5.693747e-04:  29%|██▉       | 4738/16329 [39:53<1:38:14,  1.97it/s][A
epoch 1 iter 4738: train loss 2.12041. lr 5.693620e-04:  29%|██▉       | 4738/16329 [39:54<1:38:14,  1.97it/s][A
epoch 1 iter 4738: train loss 2.12041. lr 5.693620e-04:  29%|██▉       | 4739/16329 [39:54<1:37:12,  1.99it/s][A
epoch 1 iter 4739: train loss 2.08981. lr 5.693493e-04:  29%|██▉       | 4739/16329 [39:54<1:37:12,  1.99it/s][A
epoch 1 iter 4739: train loss 2.08981. lr 5.693493e-04:  29%|██▉       | 4740/16329 [39:54<1:37:01,  1.99it/s][A
epoch 1 iter 4740: train loss 2.08771. lr 5.693366e-04:  29%|██▉       | 4740/16329 [39:

epoch 1 iter 4771: train loss 2.13471. lr 5.689413e-04:  29%|██▉       | 4772/16329 [40:10<1:38:51,  1.95it/s][A
epoch 1 iter 4772: train loss 2.13895. lr 5.689285e-04:  29%|██▉       | 4772/16329 [40:11<1:38:51,  1.95it/s][A
epoch 1 iter 4772: train loss 2.13895. lr 5.689285e-04:  29%|██▉       | 4773/16329 [40:11<1:37:42,  1.97it/s][A
epoch 1 iter 4773: train loss 2.09984. lr 5.689157e-04:  29%|██▉       | 4773/16329 [40:11<1:37:42,  1.97it/s][A
epoch 1 iter 4773: train loss 2.09984. lr 5.689157e-04:  29%|██▉       | 4774/16329 [40:11<1:37:06,  1.98it/s][A
epoch 1 iter 4774: train loss 2.13899. lr 5.689029e-04:  29%|██▉       | 4774/16329 [40:12<1:37:06,  1.98it/s][A
epoch 1 iter 4774: train loss 2.13899. lr 5.689029e-04:  29%|██▉       | 4775/16329 [40:12<1:36:37,  1.99it/s][A
epoch 1 iter 4775: train loss 2.15741. lr 5.688901e-04:  29%|██▉       | 4775/16329 [40:12<1:36:37,  1.99it/s][A
epoch 1 iter 4775: train loss 2.15741. lr 5.688901e-04:  29%|██▉       | 4776/16329 [40:

epoch 1 iter 4807: train loss 2.08578. lr 5.684793e-04:  29%|██▉       | 4807/16329 [40:29<1:36:13,  2.00it/s][A
epoch 1 iter 4807: train loss 2.08578. lr 5.684793e-04:  29%|██▉       | 4808/16329 [40:29<1:36:24,  1.99it/s][A
epoch 1 iter 4808: train loss 2.04879. lr 5.684665e-04:  29%|██▉       | 4808/16329 [40:29<1:36:24,  1.99it/s][A
epoch 1 iter 4808: train loss 2.04879. lr 5.684665e-04:  29%|██▉       | 4809/16329 [40:29<1:36:07,  2.00it/s][A
epoch 1 iter 4809: train loss 2.04552. lr 5.684536e-04:  29%|██▉       | 4809/16329 [40:30<1:36:07,  2.00it/s][A
epoch 1 iter 4809: train loss 2.04552. lr 5.684536e-04:  29%|██▉       | 4810/16329 [40:30<1:35:59,  2.00it/s][A
epoch 1 iter 4810: train loss 2.04888. lr 5.684407e-04:  29%|██▉       | 4810/16329 [40:30<1:35:59,  2.00it/s][A
epoch 1 iter 4810: train loss 2.04888. lr 5.684407e-04:  29%|██▉       | 4811/16329 [40:30<1:36:10,  2.00it/s][A
epoch 1 iter 4811: train loss 2.02683. lr 5.684278e-04:  29%|██▉       | 4811/16329 [40:

epoch 1 iter 4842: train loss 2.11359. lr 5.680271e-04:  30%|██▉       | 4843/16329 [40:46<1:36:15,  1.99it/s][A
epoch 1 iter 4843: train loss 2.07773. lr 5.680141e-04:  30%|██▉       | 4843/16329 [40:47<1:36:15,  1.99it/s][A
epoch 1 iter 4843: train loss 2.07773. lr 5.680141e-04:  30%|██▉       | 4844/16329 [40:47<1:36:38,  1.98it/s][A
epoch 1 iter 4844: train loss 2.07924. lr 5.680012e-04:  30%|██▉       | 4844/16329 [40:47<1:36:38,  1.98it/s][A
epoch 1 iter 4844: train loss 2.07924. lr 5.680012e-04:  30%|██▉       | 4845/16329 [40:47<1:36:14,  1.99it/s][A
epoch 1 iter 4845: train loss 2.08910. lr 5.679882e-04:  30%|██▉       | 4845/16329 [40:48<1:36:14,  1.99it/s][A
epoch 1 iter 4845: train loss 2.08910. lr 5.679882e-04:  30%|██▉       | 4846/16329 [40:48<1:36:02,  1.99it/s][A
epoch 1 iter 4846: train loss 2.11288. lr 5.679752e-04:  30%|██▉       | 4846/16329 [40:48<1:36:02,  1.99it/s][A
epoch 1 iter 4846: train loss 2.11288. lr 5.679752e-04:  30%|██▉       | 4847/16329 [40:

epoch 1 iter 4878: train loss 2.11117. lr 5.675588e-04:  30%|██▉       | 4878/16329 [41:05<1:39:31,  1.92it/s][A
epoch 1 iter 4878: train loss 2.11117. lr 5.675588e-04:  30%|██▉       | 4879/16329 [41:05<1:38:59,  1.93it/s][A
epoch 1 iter 4879: train loss 2.09163. lr 5.675457e-04:  30%|██▉       | 4879/16329 [41:05<1:38:59,  1.93it/s][A
epoch 1 iter 4879: train loss 2.09163. lr 5.675457e-04:  30%|██▉       | 4880/16329 [41:05<1:38:15,  1.94it/s][A
epoch 1 iter 4880: train loss 2.13740. lr 5.675327e-04:  30%|██▉       | 4880/16329 [41:06<1:38:15,  1.94it/s][A
epoch 1 iter 4880: train loss 2.13740. lr 5.675327e-04:  30%|██▉       | 4881/16329 [41:06<1:37:44,  1.95it/s][A
epoch 1 iter 4881: train loss 2.01467. lr 5.675196e-04:  30%|██▉       | 4881/16329 [41:06<1:37:44,  1.95it/s][A
epoch 1 iter 4881: train loss 2.01467. lr 5.675196e-04:  30%|██▉       | 4882/16329 [41:06<1:37:14,  1.96it/s][A
epoch 1 iter 4882: train loss 2.10486. lr 5.675065e-04:  30%|██▉       | 4882/16329 [41:

epoch 1 iter 4913: train loss 2.09996. lr 5.671004e-04:  30%|███       | 4914/16329 [41:23<1:34:57,  2.00it/s][A
epoch 1 iter 4914: train loss 2.12619. lr 5.670872e-04:  30%|███       | 4914/16329 [41:23<1:34:57,  2.00it/s][A
epoch 1 iter 4914: train loss 2.12619. lr 5.670872e-04:  30%|███       | 4915/16329 [41:23<1:34:35,  2.01it/s][A
epoch 1 iter 4915: train loss 2.08822. lr 5.670741e-04:  30%|███       | 4915/16329 [41:24<1:34:35,  2.01it/s][A
epoch 1 iter 4915: train loss 2.08822. lr 5.670741e-04:  30%|███       | 4916/16329 [41:24<1:34:27,  2.01it/s][A
epoch 1 iter 4916: train loss 2.08703. lr 5.670609e-04:  30%|███       | 4916/16329 [41:24<1:34:27,  2.01it/s][A
epoch 1 iter 4916: train loss 2.08703. lr 5.670609e-04:  30%|███       | 4917/16329 [41:24<1:34:08,  2.02it/s][A
epoch 1 iter 4917: train loss 2.08686. lr 5.670478e-04:  30%|███       | 4917/16329 [41:25<1:34:08,  2.02it/s][A
epoch 1 iter 4917: train loss 2.08686. lr 5.670478e-04:  30%|███       | 4918/16329 [41:

epoch 1 iter 4949: train loss 1.99805. lr 5.666257e-04:  30%|███       | 4949/16329 [41:41<1:40:58,  1.88it/s][A
epoch 1 iter 4949: train loss 1.99805. lr 5.666257e-04:  30%|███       | 4950/16329 [41:41<1:39:03,  1.91it/s][A
epoch 1 iter 4950: train loss 2.08692. lr 5.666125e-04:  30%|███       | 4950/16329 [41:41<1:39:03,  1.91it/s][A
epoch 1 iter 4950: train loss 2.08692. lr 5.666125e-04:  30%|███       | 4951/16329 [41:41<1:37:36,  1.94it/s][A
epoch 1 iter 4951: train loss 1.96093. lr 5.665993e-04:  30%|███       | 4951/16329 [41:42<1:37:36,  1.94it/s][A
epoch 1 iter 4951: train loss 1.96093. lr 5.665993e-04:  30%|███       | 4952/16329 [41:42<1:36:42,  1.96it/s][A
epoch 1 iter 4952: train loss 2.02624. lr 5.665860e-04:  30%|███       | 4952/16329 [41:42<1:36:42,  1.96it/s][A
epoch 1 iter 4952: train loss 2.02624. lr 5.665860e-04:  30%|███       | 4953/16329 [41:42<1:35:49,  1.98it/s][A
epoch 1 iter 4953: train loss 2.06743. lr 5.665728e-04:  30%|███       | 4953/16329 [41:

epoch 1 iter 4984: train loss 2.02907. lr 5.661612e-04:  31%|███       | 4985/16329 [41:58<1:38:16,  1.92it/s][A
epoch 1 iter 4985: train loss 2.01235. lr 5.661479e-04:  31%|███       | 4985/16329 [41:59<1:38:16,  1.92it/s][A
epoch 1 iter 4985: train loss 2.01235. lr 5.661479e-04:  31%|███       | 4986/16329 [41:59<1:38:03,  1.93it/s][A
epoch 1 iter 4986: train loss 2.07436. lr 5.661346e-04:  31%|███       | 4986/16329 [41:59<1:38:03,  1.93it/s][A
epoch 1 iter 4986: train loss 2.07436. lr 5.661346e-04:  31%|███       | 4987/16329 [41:59<1:37:46,  1.93it/s][A
epoch 1 iter 4987: train loss 2.11921. lr 5.661212e-04:  31%|███       | 4987/16329 [42:00<1:37:46,  1.93it/s][A
epoch 1 iter 4987: train loss 2.11921. lr 5.661212e-04:  31%|███       | 4988/16329 [42:00<1:47:25,  1.76it/s][A
epoch 1 iter 4988: train loss 2.08587. lr 5.661079e-04:  31%|███       | 4988/16329 [42:01<1:47:25,  1.76it/s][A
epoch 1 iter 4988: train loss 2.08587. lr 5.661079e-04:  31%|███       | 4989/16329 [42:

epoch 1 iter 5020: train loss 2.05987. lr 5.656802e-04:  31%|███       | 5020/16329 [42:17<1:40:17,  1.88it/s][A
epoch 1 iter 5020: train loss 2.05987. lr 5.656802e-04:  31%|███       | 5021/16329 [42:17<1:40:14,  1.88it/s][A
epoch 1 iter 5021: train loss 2.07225. lr 5.656668e-04:  31%|███       | 5021/16329 [42:17<1:40:14,  1.88it/s][A
epoch 1 iter 5021: train loss 2.07225. lr 5.656668e-04:  31%|███       | 5022/16329 [42:17<1:39:24,  1.90it/s][A
epoch 1 iter 5022: train loss 2.04017. lr 5.656534e-04:  31%|███       | 5022/16329 [42:18<1:39:24,  1.90it/s][A
epoch 1 iter 5022: train loss 2.04017. lr 5.656534e-04:  31%|███       | 5023/16329 [42:18<1:48:13,  1.74it/s][A
epoch 1 iter 5023: train loss 2.04244. lr 5.656400e-04:  31%|███       | 5023/16329 [42:18<1:48:13,  1.74it/s][A
epoch 1 iter 5023: train loss 2.04244. lr 5.656400e-04:  31%|███       | 5024/16329 [42:18<1:44:11,  1.81it/s][A
epoch 1 iter 5024: train loss 2.06573. lr 5.656266e-04:  31%|███       | 5024/16329 [42:

epoch 1 iter 5055: train loss 2.00720. lr 5.652096e-04:  31%|███       | 5056/16329 [42:35<1:34:02,  2.00it/s][A
epoch 1 iter 5056: train loss 2.00713. lr 5.651961e-04:  31%|███       | 5056/16329 [42:35<1:34:02,  2.00it/s][A
epoch 1 iter 5056: train loss 2.00713. lr 5.651961e-04:  31%|███       | 5057/16329 [42:35<1:33:38,  2.01it/s][A
epoch 1 iter 5057: train loss 2.08981. lr 5.651826e-04:  31%|███       | 5057/16329 [42:36<1:33:38,  2.01it/s][A
epoch 1 iter 5057: train loss 2.08981. lr 5.651826e-04:  31%|███       | 5058/16329 [42:36<1:33:21,  2.01it/s][A
epoch 1 iter 5058: train loss 2.04997. lr 5.651691e-04:  31%|███       | 5058/16329 [42:36<1:33:21,  2.01it/s][A
epoch 1 iter 5058: train loss 2.04997. lr 5.651691e-04:  31%|███       | 5059/16329 [42:36<1:33:17,  2.01it/s][A
epoch 1 iter 5059: train loss 2.06915. lr 5.651556e-04:  31%|███       | 5059/16329 [42:37<1:33:17,  2.01it/s][A
epoch 1 iter 5059: train loss 2.06915. lr 5.651556e-04:  31%|███       | 5060/16329 [42:

epoch 1 iter 5091: train loss 1.98172. lr 5.647224e-04:  31%|███       | 5091/16329 [42:53<1:36:06,  1.95it/s][A
epoch 1 iter 5091: train loss 1.98172. lr 5.647224e-04:  31%|███       | 5092/16329 [42:53<1:35:48,  1.95it/s][A
epoch 1 iter 5092: train loss 2.04243. lr 5.647088e-04:  31%|███       | 5092/16329 [42:53<1:35:48,  1.95it/s][A
epoch 1 iter 5092: train loss 2.04243. lr 5.647088e-04:  31%|███       | 5093/16329 [42:53<1:35:45,  1.96it/s][A
epoch 1 iter 5093: train loss 2.01027. lr 5.646952e-04:  31%|███       | 5093/16329 [42:54<1:35:45,  1.96it/s][A
epoch 1 iter 5093: train loss 2.01027. lr 5.646952e-04:  31%|███       | 5094/16329 [42:54<1:35:19,  1.96it/s][A
epoch 1 iter 5094: train loss 2.02527. lr 5.646816e-04:  31%|███       | 5094/16329 [42:54<1:35:19,  1.96it/s][A
epoch 1 iter 5094: train loss 2.02527. lr 5.646816e-04:  31%|███       | 5095/16329 [42:54<1:34:58,  1.97it/s][A
epoch 1 iter 5095: train loss 1.97048. lr 5.646680e-04:  31%|███       | 5095/16329 [42:

epoch 1 iter 5126: train loss 2.00930. lr 5.642456e-04:  31%|███▏      | 5127/16329 [43:10<1:32:34,  2.02it/s][A
epoch 1 iter 5127: train loss 2.02224. lr 5.642320e-04:  31%|███▏      | 5127/16329 [43:11<1:32:34,  2.02it/s][A
epoch 1 iter 5127: train loss 2.02224. lr 5.642320e-04:  31%|███▏      | 5128/16329 [43:11<1:32:52,  2.01it/s][A
epoch 1 iter 5128: train loss 2.02796. lr 5.642183e-04:  31%|███▏      | 5128/16329 [43:11<1:32:52,  2.01it/s][A
epoch 1 iter 5128: train loss 2.02796. lr 5.642183e-04:  31%|███▏      | 5129/16329 [43:11<1:32:40,  2.01it/s][A
epoch 1 iter 5129: train loss 2.04059. lr 5.642046e-04:  31%|███▏      | 5129/16329 [43:12<1:32:40,  2.01it/s][A
epoch 1 iter 5129: train loss 2.04059. lr 5.642046e-04:  31%|███▏      | 5130/16329 [43:12<1:32:31,  2.02it/s][A
epoch 1 iter 5130: train loss 1.99156. lr 5.641909e-04:  31%|███▏      | 5130/16329 [43:12<1:32:31,  2.02it/s][A
epoch 1 iter 5130: train loss 1.99156. lr 5.641909e-04:  31%|███▏      | 5131/16329 [43:

epoch 1 iter 5162: train loss 2.05451. lr 5.637521e-04:  32%|███▏      | 5162/16329 [43:29<1:32:35,  2.01it/s][A
epoch 1 iter 5162: train loss 2.05451. lr 5.637521e-04:  32%|███▏      | 5163/16329 [43:29<1:32:25,  2.01it/s][A
epoch 1 iter 5163: train loss 2.04442. lr 5.637384e-04:  32%|███▏      | 5163/16329 [43:29<1:32:25,  2.01it/s][A
epoch 1 iter 5163: train loss 2.04442. lr 5.637384e-04:  32%|███▏      | 5164/16329 [43:29<1:32:13,  2.02it/s][A
epoch 1 iter 5164: train loss 2.01991. lr 5.637246e-04:  32%|███▏      | 5164/16329 [43:30<1:32:13,  2.02it/s][A
epoch 1 iter 5164: train loss 2.01991. lr 5.637246e-04:  32%|███▏      | 5165/16329 [43:30<1:32:35,  2.01it/s][A
epoch 1 iter 5165: train loss 1.97055. lr 5.637109e-04:  32%|███▏      | 5165/16329 [43:30<1:32:35,  2.01it/s][A
epoch 1 iter 5165: train loss 1.97055. lr 5.637109e-04:  32%|███▏      | 5166/16329 [43:30<1:33:13,  2.00it/s][A
epoch 1 iter 5166: train loss 2.03277. lr 5.636971e-04:  32%|███▏      | 5166/16329 [43:

epoch 1 iter 5197: train loss 2.03511. lr 5.632693e-04:  32%|███▏      | 5198/16329 [43:46<1:31:59,  2.02it/s][A
epoch 1 iter 5198: train loss 2.03672. lr 5.632555e-04:  32%|███▏      | 5198/16329 [43:47<1:31:59,  2.02it/s][A
epoch 1 iter 5198: train loss 2.03672. lr 5.632555e-04:  32%|███▏      | 5199/16329 [43:47<1:31:46,  2.02it/s][A
epoch 1 iter 5199: train loss 1.99088. lr 5.632416e-04:  32%|███▏      | 5199/16329 [43:47<1:31:46,  2.02it/s][A
epoch 1 iter 5199: train loss 1.99088. lr 5.632416e-04:  32%|███▏      | 5200/16329 [43:47<1:32:01,  2.02it/s][A
epoch 1 iter 5200: train loss 1.95448. lr 5.632278e-04:  32%|███▏      | 5200/16329 [43:48<1:32:01,  2.02it/s][A
epoch 1 iter 5200: train loss 1.95448. lr 5.632278e-04:  32%|███▏      | 5201/16329 [43:48<1:31:51,  2.02it/s][A
epoch 1 iter 5201: train loss 1.98645. lr 5.632139e-04:  32%|███▏      | 5201/16329 [43:48<1:31:51,  2.02it/s][A
epoch 1 iter 5201: train loss 1.98645. lr 5.632139e-04:  32%|███▏      | 5202/16329 [43:

epoch 1 iter 5233: train loss 2.01969. lr 5.627696e-04:  32%|███▏      | 5233/16329 [44:04<1:31:57,  2.01it/s][A
epoch 1 iter 5233: train loss 2.01969. lr 5.627696e-04:  32%|███▏      | 5234/16329 [44:04<1:31:43,  2.02it/s][A
epoch 1 iter 5234: train loss 2.05116. lr 5.627557e-04:  32%|███▏      | 5234/16329 [44:05<1:31:43,  2.02it/s][A
epoch 1 iter 5234: train loss 2.05116. lr 5.627557e-04:  32%|███▏      | 5235/16329 [44:05<1:32:01,  2.01it/s][A
epoch 1 iter 5235: train loss 1.96100. lr 5.627417e-04:  32%|███▏      | 5235/16329 [44:05<1:32:01,  2.01it/s][A
epoch 1 iter 5235: train loss 1.96100. lr 5.627417e-04:  32%|███▏      | 5236/16329 [44:05<1:32:09,  2.01it/s][A
epoch 1 iter 5236: train loss 2.07995. lr 5.627278e-04:  32%|███▏      | 5236/16329 [44:06<1:32:09,  2.01it/s][A
epoch 1 iter 5236: train loss 2.07995. lr 5.627278e-04:  32%|███▏      | 5237/16329 [44:06<1:32:12,  2.00it/s][A
epoch 1 iter 5237: train loss 2.00779. lr 5.627139e-04:  32%|███▏      | 5237/16329 [44:

epoch 1 iter 5268: train loss 1.96821. lr 5.622807e-04:  32%|███▏      | 5269/16329 [44:22<1:31:50,  2.01it/s][A
epoch 1 iter 5269: train loss 1.98838. lr 5.622667e-04:  32%|███▏      | 5269/16329 [44:22<1:31:50,  2.01it/s][A
epoch 1 iter 5269: train loss 1.98838. lr 5.622667e-04:  32%|███▏      | 5270/16329 [44:22<1:31:54,  2.01it/s][A
epoch 1 iter 5270: train loss 2.00086. lr 5.622527e-04:  32%|███▏      | 5270/16329 [44:23<1:31:54,  2.01it/s][A
epoch 1 iter 5270: train loss 2.00086. lr 5.622527e-04:  32%|███▏      | 5271/16329 [44:23<1:31:47,  2.01it/s][A
epoch 1 iter 5271: train loss 1.98735. lr 5.622387e-04:  32%|███▏      | 5271/16329 [44:23<1:31:47,  2.01it/s][A
epoch 1 iter 5271: train loss 1.98735. lr 5.622387e-04:  32%|███▏      | 5272/16329 [44:23<1:31:26,  2.02it/s][A
epoch 1 iter 5272: train loss 2.00532. lr 5.622247e-04:  32%|███▏      | 5272/16329 [44:24<1:31:26,  2.02it/s][A
epoch 1 iter 5272: train loss 2.00532. lr 5.622247e-04:  32%|███▏      | 5273/16329 [44:

epoch 1 iter 5304: train loss 1.98063. lr 5.617748e-04:  32%|███▏      | 5304/16329 [44:40<1:37:39,  1.88it/s][A
epoch 1 iter 5304: train loss 1.98063. lr 5.617748e-04:  32%|███▏      | 5305/16329 [44:40<1:35:45,  1.92it/s][A
epoch 1 iter 5305: train loss 1.99476. lr 5.617607e-04:  32%|███▏      | 5305/16329 [44:41<1:35:45,  1.92it/s][A
epoch 1 iter 5305: train loss 1.99476. lr 5.617607e-04:  32%|███▏      | 5306/16329 [44:41<1:34:22,  1.95it/s][A
epoch 1 iter 5306: train loss 2.01225. lr 5.617466e-04:  32%|███▏      | 5306/16329 [44:41<1:34:22,  1.95it/s][A
epoch 1 iter 5306: train loss 2.01225. lr 5.617466e-04:  33%|███▎      | 5307/16329 [44:41<1:33:03,  1.97it/s][A
epoch 1 iter 5307: train loss 2.02646. lr 5.617325e-04:  33%|███▎      | 5307/16329 [44:42<1:33:03,  1.97it/s][A
epoch 1 iter 5307: train loss 2.02646. lr 5.617325e-04:  33%|███▎      | 5308/16329 [44:42<1:32:28,  1.99it/s][A
epoch 1 iter 5308: train loss 1.94058. lr 5.617184e-04:  33%|███▎      | 5308/16329 [44:

epoch 1 iter 5339: train loss 1.98263. lr 5.612799e-04:  33%|███▎      | 5340/16329 [44:58<1:34:09,  1.95it/s][A
epoch 1 iter 5340: train loss 1.98221. lr 5.612657e-04:  33%|███▎      | 5340/16329 [44:59<1:34:09,  1.95it/s][A
epoch 1 iter 5340: train loss 1.98221. lr 5.612657e-04:  33%|███▎      | 5341/16329 [44:59<1:34:03,  1.95it/s][A
epoch 1 iter 5341: train loss 2.04355. lr 5.612515e-04:  33%|███▎      | 5341/16329 [44:59<1:34:03,  1.95it/s][A
epoch 1 iter 5341: train loss 2.04355. lr 5.612515e-04:  33%|███▎      | 5342/16329 [44:59<1:33:47,  1.95it/s][A
epoch 1 iter 5342: train loss 1.99894. lr 5.612374e-04:  33%|███▎      | 5342/16329 [45:00<1:33:47,  1.95it/s][A
epoch 1 iter 5342: train loss 1.99894. lr 5.612374e-04:  33%|███▎      | 5343/16329 [45:00<1:32:38,  1.98it/s][A
epoch 1 iter 5343: train loss 2.01143. lr 5.612232e-04:  33%|███▎      | 5343/16329 [45:00<1:32:38,  1.98it/s][A
epoch 1 iter 5343: train loss 2.01143. lr 5.612232e-04:  33%|███▎      | 5344/16329 [45:

epoch 1 iter 5375: train loss 2.01273. lr 5.607678e-04:  33%|███▎      | 5375/16329 [45:16<1:31:32,  1.99it/s][A
epoch 1 iter 5375: train loss 2.01273. lr 5.607678e-04:  33%|███▎      | 5376/16329 [45:16<1:31:15,  2.00it/s][A
epoch 1 iter 5376: train loss 1.93669. lr 5.607535e-04:  33%|███▎      | 5376/16329 [45:17<1:31:15,  2.00it/s][A
epoch 1 iter 5376: train loss 1.93669. lr 5.607535e-04:  33%|███▎      | 5377/16329 [45:17<1:30:59,  2.01it/s][A
epoch 1 iter 5377: train loss 1.92422. lr 5.607393e-04:  33%|███▎      | 5377/16329 [45:17<1:30:59,  2.01it/s][A
epoch 1 iter 5377: train loss 1.92422. lr 5.607393e-04:  33%|███▎      | 5378/16329 [45:17<1:31:06,  2.00it/s][A
epoch 1 iter 5378: train loss 1.96387. lr 5.607250e-04:  33%|███▎      | 5378/16329 [45:18<1:31:06,  2.00it/s][A
epoch 1 iter 5378: train loss 1.96387. lr 5.607250e-04:  33%|███▎      | 5379/16329 [45:18<1:30:54,  2.01it/s][A
epoch 1 iter 5379: train loss 1.94842. lr 5.607107e-04:  33%|███▎      | 5379/16329 [45:

epoch 1 iter 5410: train loss 1.98226. lr 5.602669e-04:  33%|███▎      | 5411/16329 [45:34<1:31:17,  1.99it/s][A
epoch 1 iter 5411: train loss 1.98255. lr 5.602525e-04:  33%|███▎      | 5411/16329 [45:34<1:31:17,  1.99it/s][A
epoch 1 iter 5411: train loss 1.98255. lr 5.602525e-04:  33%|███▎      | 5412/16329 [45:34<1:30:59,  2.00it/s][A
epoch 1 iter 5412: train loss 1.97899. lr 5.602382e-04:  33%|███▎      | 5412/16329 [45:35<1:30:59,  2.00it/s][A
epoch 1 iter 5412: train loss 1.97899. lr 5.602382e-04:  33%|███▎      | 5413/16329 [45:35<1:30:40,  2.01it/s][A
epoch 1 iter 5413: train loss 1.99703. lr 5.602238e-04:  33%|███▎      | 5413/16329 [45:35<1:30:40,  2.01it/s][A
epoch 1 iter 5413: train loss 1.99703. lr 5.602238e-04:  33%|███▎      | 5414/16329 [45:35<1:30:50,  2.00it/s][A
epoch 1 iter 5414: train loss 1.99210. lr 5.602095e-04:  33%|███▎      | 5414/16329 [45:36<1:30:50,  2.00it/s][A
epoch 1 iter 5414: train loss 1.99210. lr 5.602095e-04:  33%|███▎      | 5415/16329 [45:

epoch 1 iter 5446: train loss 1.97256. lr 5.597486e-04:  33%|███▎      | 5446/16329 [45:52<1:30:06,  2.01it/s][A
epoch 1 iter 5446: train loss 1.97256. lr 5.597486e-04:  33%|███▎      | 5447/16329 [45:52<1:30:16,  2.01it/s][A
epoch 1 iter 5447: train loss 1.96672. lr 5.597342e-04:  33%|███▎      | 5447/16329 [45:53<1:30:16,  2.01it/s][A
epoch 1 iter 5447: train loss 1.96672. lr 5.597342e-04:  33%|███▎      | 5448/16329 [45:53<1:30:04,  2.01it/s][A
epoch 1 iter 5448: train loss 1.93513. lr 5.597197e-04:  33%|███▎      | 5448/16329 [45:53<1:30:04,  2.01it/s][A
epoch 1 iter 5448: train loss 1.93513. lr 5.597197e-04:  33%|███▎      | 5449/16329 [45:53<1:29:57,  2.02it/s][A
epoch 1 iter 5449: train loss 1.97049. lr 5.597053e-04:  33%|███▎      | 5449/16329 [45:54<1:29:57,  2.02it/s][A
epoch 1 iter 5449: train loss 1.97049. lr 5.597053e-04:  33%|███▎      | 5450/16329 [45:54<1:30:06,  2.01it/s][A
epoch 1 iter 5450: train loss 1.94919. lr 5.596908e-04:  33%|███▎      | 5450/16329 [45:

epoch 1 iter 5481: train loss 1.96383. lr 5.592418e-04:  34%|███▎      | 5482/16329 [46:10<1:31:09,  1.98it/s][A
epoch 1 iter 5482: train loss 1.95485. lr 5.592272e-04:  34%|███▎      | 5482/16329 [46:10<1:31:09,  1.98it/s][A
epoch 1 iter 5482: train loss 1.95485. lr 5.592272e-04:  34%|███▎      | 5483/16329 [46:10<1:31:13,  1.98it/s][A
epoch 1 iter 5483: train loss 1.97255. lr 5.592127e-04:  34%|███▎      | 5483/16329 [46:11<1:31:13,  1.98it/s][A
epoch 1 iter 5483: train loss 1.97255. lr 5.592127e-04:  34%|███▎      | 5484/16329 [46:11<1:31:13,  1.98it/s][A
epoch 1 iter 5484: train loss 1.93337. lr 5.591982e-04:  34%|███▎      | 5484/16329 [46:11<1:31:13,  1.98it/s][A
epoch 1 iter 5484: train loss 1.93337. lr 5.591982e-04:  34%|███▎      | 5485/16329 [46:11<1:30:36,  1.99it/s][A
epoch 1 iter 5485: train loss 1.94313. lr 5.591836e-04:  34%|███▎      | 5485/16329 [46:12<1:30:36,  1.99it/s][A
epoch 1 iter 5485: train loss 1.94313. lr 5.591836e-04:  34%|███▎      | 5486/16329 [46:

epoch 1 iter 5517: train loss 1.99713. lr 5.587173e-04:  34%|███▍      | 5517/16329 [46:28<1:29:32,  2.01it/s][A
epoch 1 iter 5517: train loss 1.99713. lr 5.587173e-04:  34%|███▍      | 5518/16329 [46:28<1:29:28,  2.01it/s][A
epoch 1 iter 5518: train loss 1.97784. lr 5.587027e-04:  34%|███▍      | 5518/16329 [46:28<1:29:28,  2.01it/s][A
epoch 1 iter 5518: train loss 1.97784. lr 5.587027e-04:  34%|███▍      | 5519/16329 [46:28<1:29:12,  2.02it/s][A
epoch 1 iter 5519: train loss 1.97082. lr 5.586881e-04:  34%|███▍      | 5519/16329 [46:29<1:29:12,  2.02it/s][A
epoch 1 iter 5519: train loss 1.97082. lr 5.586881e-04:  34%|███▍      | 5520/16329 [46:29<1:29:32,  2.01it/s][A
epoch 1 iter 5520: train loss 1.97468. lr 5.586735e-04:  34%|███▍      | 5520/16329 [46:29<1:29:32,  2.01it/s][A
epoch 1 iter 5520: train loss 1.97468. lr 5.586735e-04:  34%|███▍      | 5521/16329 [46:29<1:29:25,  2.01it/s][A
epoch 1 iter 5521: train loss 1.93353. lr 5.586589e-04:  34%|███▍      | 5521/16329 [46:

epoch 1 iter 5552: train loss 1.93150. lr 5.582045e-04:  34%|███▍      | 5553/16329 [46:45<1:28:57,  2.02it/s][A
epoch 1 iter 5553: train loss 1.98277. lr 5.581898e-04:  34%|███▍      | 5553/16329 [46:46<1:28:57,  2.02it/s][A
epoch 1 iter 5553: train loss 1.98277. lr 5.581898e-04:  34%|███▍      | 5554/16329 [46:46<1:29:03,  2.02it/s][A
epoch 1 iter 5554: train loss 1.99680. lr 5.581751e-04:  34%|███▍      | 5554/16329 [46:46<1:29:03,  2.02it/s][A
epoch 1 iter 5554: train loss 1.99680. lr 5.581751e-04:  34%|███▍      | 5555/16329 [46:46<1:28:51,  2.02it/s][A
epoch 1 iter 5555: train loss 1.95811. lr 5.581604e-04:  34%|███▍      | 5555/16329 [46:47<1:28:51,  2.02it/s][A
epoch 1 iter 5555: train loss 1.95811. lr 5.581604e-04:  34%|███▍      | 5556/16329 [46:47<1:44:29,  1.72it/s][A
epoch 1 iter 5556: train loss 1.90496. lr 5.581457e-04:  34%|███▍      | 5556/16329 [46:48<1:44:29,  1.72it/s][A
epoch 1 iter 5556: train loss 1.90496. lr 5.581457e-04:  34%|███▍      | 5557/16329 [46:

epoch 1 iter 5588: train loss 1.93770. lr 5.576740e-04:  34%|███▍      | 5588/16329 [47:04<1:35:16,  1.88it/s][A
epoch 1 iter 5588: train loss 1.93770. lr 5.576740e-04:  34%|███▍      | 5589/16329 [47:04<1:36:57,  1.85it/s][A
epoch 1 iter 5589: train loss 1.90449. lr 5.576592e-04:  34%|███▍      | 5589/16329 [47:04<1:36:57,  1.85it/s][A
epoch 1 iter 5589: train loss 1.90449. lr 5.576592e-04:  34%|███▍      | 5590/16329 [47:04<1:37:18,  1.84it/s][A
epoch 1 iter 5590: train loss 1.99884. lr 5.576444e-04:  34%|███▍      | 5590/16329 [47:05<1:37:18,  1.84it/s][A
epoch 1 iter 5590: train loss 1.99884. lr 5.576444e-04:  34%|███▍      | 5591/16329 [47:05<1:36:49,  1.85it/s][A
epoch 1 iter 5591: train loss 1.93860. lr 5.576296e-04:  34%|███▍      | 5591/16329 [47:05<1:36:49,  1.85it/s][A
epoch 1 iter 5591: train loss 1.93860. lr 5.576296e-04:  34%|███▍      | 5592/16329 [47:05<1:35:49,  1.87it/s][A
epoch 1 iter 5592: train loss 1.97351. lr 5.576148e-04:  34%|███▍      | 5592/16329 [47:

epoch 1 iter 5623: train loss 1.94967. lr 5.571552e-04:  34%|███▍      | 5624/16329 [47:21<1:35:11,  1.87it/s][A
epoch 1 iter 5624: train loss 1.91188. lr 5.571403e-04:  34%|███▍      | 5624/16329 [47:22<1:35:11,  1.87it/s][A
epoch 1 iter 5624: train loss 1.91188. lr 5.571403e-04:  34%|███▍      | 5625/16329 [47:22<1:33:04,  1.92it/s][A
epoch 1 iter 5625: train loss 1.91615. lr 5.571255e-04:  34%|███▍      | 5625/16329 [47:22<1:33:04,  1.92it/s][A
epoch 1 iter 5625: train loss 1.91615. lr 5.571255e-04:  34%|███▍      | 5626/16329 [47:22<1:31:38,  1.95it/s][A
epoch 1 iter 5626: train loss 1.88701. lr 5.571106e-04:  34%|███▍      | 5626/16329 [47:23<1:31:38,  1.95it/s][A
epoch 1 iter 5626: train loss 1.88701. lr 5.571106e-04:  34%|███▍      | 5627/16329 [47:23<1:30:58,  1.96it/s][A
epoch 1 iter 5627: train loss 1.92789. lr 5.570957e-04:  34%|███▍      | 5627/16329 [47:23<1:30:58,  1.96it/s][A
epoch 1 iter 5627: train loss 1.92789. lr 5.570957e-04:  34%|███▍      | 5628/16329 [47:

epoch 1 iter 5659: train loss 1.89818. lr 5.566186e-04:  35%|███▍      | 5659/16329 [47:40<1:38:41,  1.80it/s][A
epoch 1 iter 5659: train loss 1.89818. lr 5.566186e-04:  35%|███▍      | 5660/16329 [47:40<1:35:35,  1.86it/s][A
epoch 1 iter 5660: train loss 1.92218. lr 5.566036e-04:  35%|███▍      | 5660/16329 [47:40<1:35:35,  1.86it/s][A
epoch 1 iter 5660: train loss 1.92218. lr 5.566036e-04:  35%|███▍      | 5661/16329 [47:40<1:33:21,  1.90it/s][A
epoch 1 iter 5661: train loss 1.98676. lr 5.565887e-04:  35%|███▍      | 5661/16329 [47:41<1:33:21,  1.90it/s][A
epoch 1 iter 5661: train loss 1.98676. lr 5.565887e-04:  35%|███▍      | 5662/16329 [47:41<1:35:00,  1.87it/s][A
epoch 1 iter 5662: train loss 1.93508. lr 5.565737e-04:  35%|███▍      | 5662/16329 [47:41<1:35:00,  1.87it/s][A
epoch 1 iter 5662: train loss 1.93508. lr 5.565737e-04:  35%|███▍      | 5663/16329 [47:41<1:34:51,  1.87it/s][A
epoch 1 iter 5663: train loss 1.93536. lr 5.565588e-04:  35%|███▍      | 5663/16329 [47:

epoch 1 iter 5694: train loss 1.88441. lr 5.560939e-04:  35%|███▍      | 5695/16329 [47:58<1:28:10,  2.01it/s][A
epoch 1 iter 5695: train loss 1.94234. lr 5.560789e-04:  35%|███▍      | 5695/16329 [47:58<1:28:10,  2.01it/s][A
epoch 1 iter 5695: train loss 1.94234. lr 5.560789e-04:  35%|███▍      | 5696/16329 [47:58<1:27:53,  2.02it/s][A
epoch 1 iter 5696: train loss 1.93560. lr 5.560639e-04:  35%|███▍      | 5696/16329 [47:59<1:27:53,  2.02it/s][A
epoch 1 iter 5696: train loss 1.93560. lr 5.560639e-04:  35%|███▍      | 5697/16329 [47:59<1:28:08,  2.01it/s][A
epoch 1 iter 5697: train loss 1.89513. lr 5.560488e-04:  35%|███▍      | 5697/16329 [47:59<1:28:08,  2.01it/s][A
epoch 1 iter 5697: train loss 1.89513. lr 5.560488e-04:  35%|███▍      | 5698/16329 [47:59<1:28:01,  2.01it/s][A
epoch 1 iter 5698: train loss 1.91638. lr 5.560338e-04:  35%|███▍      | 5698/16329 [48:00<1:28:01,  2.01it/s][A
epoch 1 iter 5698: train loss 1.91638. lr 5.560338e-04:  35%|███▍      | 5699/16329 [48:

epoch 1 iter 5730: train loss 1.90835. lr 5.555512e-04:  35%|███▌      | 5730/16329 [48:16<1:27:44,  2.01it/s][A
epoch 1 iter 5730: train loss 1.90835. lr 5.555512e-04:  35%|███▌      | 5731/16329 [48:16<1:27:35,  2.02it/s][A
epoch 1 iter 5731: train loss 1.89059. lr 5.555361e-04:  35%|███▌      | 5731/16329 [48:16<1:27:35,  2.02it/s][A
epoch 1 iter 5731: train loss 1.89059. lr 5.555361e-04:  35%|███▌      | 5732/16329 [48:16<1:27:24,  2.02it/s][A
epoch 1 iter 5732: train loss 1.90500. lr 5.555210e-04:  35%|███▌      | 5732/16329 [48:17<1:27:24,  2.02it/s][A
epoch 1 iter 5732: train loss 1.90500. lr 5.555210e-04:  35%|███▌      | 5733/16329 [48:17<1:27:11,  2.03it/s][A
epoch 1 iter 5733: train loss 1.89601. lr 5.555059e-04:  35%|███▌      | 5733/16329 [48:17<1:27:11,  2.03it/s][A
epoch 1 iter 5733: train loss 1.89601. lr 5.555059e-04:  35%|███▌      | 5734/16329 [48:17<1:27:29,  2.02it/s][A
epoch 1 iter 5734: train loss 1.90195. lr 5.554908e-04:  35%|███▌      | 5734/16329 [48:

epoch 1 iter 5765: train loss 1.88810. lr 5.550207e-04:  35%|███▌      | 5766/16329 [48:34<1:27:39,  2.01it/s][A
epoch 1 iter 5766: train loss 1.89892. lr 5.550055e-04:  35%|███▌      | 5766/16329 [48:34<1:27:39,  2.01it/s][A
epoch 1 iter 5766: train loss 1.89892. lr 5.550055e-04:  35%|███▌      | 5767/16329 [48:34<1:27:49,  2.00it/s][A
epoch 1 iter 5767: train loss 1.91161. lr 5.549903e-04:  35%|███▌      | 5767/16329 [48:35<1:27:49,  2.00it/s][A
epoch 1 iter 5767: train loss 1.91161. lr 5.549903e-04:  35%|███▌      | 5768/16329 [48:35<1:27:39,  2.01it/s][A
epoch 1 iter 5768: train loss 1.87888. lr 5.549751e-04:  35%|███▌      | 5768/16329 [48:35<1:27:39,  2.01it/s][A
epoch 1 iter 5768: train loss 1.87888. lr 5.549751e-04:  35%|███▌      | 5769/16329 [48:35<1:27:25,  2.01it/s][A
epoch 1 iter 5769: train loss 1.91384. lr 5.549599e-04:  35%|███▌      | 5769/16329 [48:36<1:27:25,  2.01it/s][A
epoch 1 iter 5769: train loss 1.91384. lr 5.549599e-04:  35%|███▌      | 5770/16329 [48:

epoch 1 iter 5801: train loss 1.90126. lr 5.544720e-04:  36%|███▌      | 5801/16329 [48:52<1:29:47,  1.95it/s][A
epoch 1 iter 5801: train loss 1.90126. lr 5.544720e-04:  36%|███▌      | 5802/16329 [48:52<1:28:56,  1.97it/s][A
epoch 1 iter 5802: train loss 1.90798. lr 5.544567e-04:  36%|███▌      | 5802/16329 [48:52<1:28:56,  1.97it/s][A
epoch 1 iter 5802: train loss 1.90798. lr 5.544567e-04:  36%|███▌      | 5803/16329 [48:52<1:28:25,  1.98it/s][A
epoch 1 iter 5803: train loss 1.90972. lr 5.544414e-04:  36%|███▌      | 5803/16329 [48:53<1:28:25,  1.98it/s][A
epoch 1 iter 5803: train loss 1.90972. lr 5.544414e-04:  36%|███▌      | 5804/16329 [48:53<1:27:55,  2.00it/s][A
epoch 1 iter 5804: train loss 1.90128. lr 5.544261e-04:  36%|███▌      | 5804/16329 [48:53<1:27:55,  2.00it/s][A
epoch 1 iter 5804: train loss 1.90128. lr 5.544261e-04:  36%|███▌      | 5805/16329 [48:53<1:27:45,  2.00it/s][A
epoch 1 iter 5805: train loss 1.90639. lr 5.544108e-04:  36%|███▌      | 5805/16329 [48:

epoch 1 iter 5836: train loss 1.87031. lr 5.539356e-04:  36%|███▌      | 5837/16329 [49:10<1:39:19,  1.76it/s][A
epoch 1 iter 5837: train loss 1.90576. lr 5.539202e-04:  36%|███▌      | 5837/16329 [49:10<1:39:19,  1.76it/s][A
epoch 1 iter 5837: train loss 1.90576. lr 5.539202e-04:  36%|███▌      | 5838/16329 [49:10<1:35:24,  1.83it/s][A
epoch 1 iter 5838: train loss 1.85939. lr 5.539048e-04:  36%|███▌      | 5838/16329 [49:11<1:35:24,  1.83it/s][A
epoch 1 iter 5838: train loss 1.85939. lr 5.539048e-04:  36%|███▌      | 5839/16329 [49:11<1:33:01,  1.88it/s][A
epoch 1 iter 5839: train loss 1.90411. lr 5.538894e-04:  36%|███▌      | 5839/16329 [49:11<1:33:01,  1.88it/s][A
epoch 1 iter 5839: train loss 1.90411. lr 5.538894e-04:  36%|███▌      | 5840/16329 [49:11<1:31:02,  1.92it/s][A
epoch 1 iter 5840: train loss 1.90848. lr 5.538741e-04:  36%|███▌      | 5840/16329 [49:12<1:31:02,  1.92it/s][A
epoch 1 iter 5840: train loss 1.90848. lr 5.538741e-04:  36%|███▌      | 5841/16329 [49:

epoch 1 iter 5872: train loss 1.93762. lr 5.533808e-04:  36%|███▌      | 5872/16329 [49:28<1:26:42,  2.01it/s][A
epoch 1 iter 5872: train loss 1.93762. lr 5.533808e-04:  36%|███▌      | 5873/16329 [49:28<1:26:38,  2.01it/s][A
epoch 1 iter 5873: train loss 1.86253. lr 5.533654e-04:  36%|███▌      | 5873/16329 [49:28<1:26:38,  2.01it/s][A
epoch 1 iter 5873: train loss 1.86253. lr 5.533654e-04:  36%|███▌      | 5874/16329 [49:28<1:26:54,  2.00it/s][A
epoch 1 iter 5874: train loss 1.87318. lr 5.533499e-04:  36%|███▌      | 5874/16329 [49:29<1:26:54,  2.00it/s][A
epoch 1 iter 5874: train loss 1.87318. lr 5.533499e-04:  36%|███▌      | 5875/16329 [49:29<1:26:51,  2.01it/s][A
epoch 1 iter 5875: train loss 1.86418. lr 5.533345e-04:  36%|███▌      | 5875/16329 [49:29<1:26:51,  2.01it/s][A
epoch 1 iter 5875: train loss 1.86418. lr 5.533345e-04:  36%|███▌      | 5876/16329 [49:29<1:26:43,  2.01it/s][A
epoch 1 iter 5876: train loss 1.90143. lr 5.533190e-04:  36%|███▌      | 5876/16329 [49:

epoch 1 iter 5907: train loss 1.85297. lr 5.528386e-04:  36%|███▌      | 5908/16329 [49:45<1:26:59,  2.00it/s][A
epoch 1 iter 5908: train loss 1.94361. lr 5.528230e-04:  36%|███▌      | 5908/16329 [49:46<1:26:59,  2.00it/s][A
epoch 1 iter 5908: train loss 1.94361. lr 5.528230e-04:  36%|███▌      | 5909/16329 [49:46<1:27:19,  1.99it/s][A
epoch 1 iter 5909: train loss 1.81251. lr 5.528075e-04:  36%|███▌      | 5909/16329 [49:46<1:27:19,  1.99it/s][A
epoch 1 iter 5909: train loss 1.81251. lr 5.528075e-04:  36%|███▌      | 5910/16329 [49:46<1:26:59,  2.00it/s][A
epoch 1 iter 5910: train loss 1.82180. lr 5.527920e-04:  36%|███▌      | 5910/16329 [49:47<1:26:59,  2.00it/s][A
epoch 1 iter 5910: train loss 1.82180. lr 5.527920e-04:  36%|███▌      | 5911/16329 [49:47<1:26:59,  2.00it/s][A
epoch 1 iter 5911: train loss 1.85736. lr 5.527764e-04:  36%|███▌      | 5911/16329 [49:47<1:26:59,  2.00it/s][A
epoch 1 iter 5911: train loss 1.85736. lr 5.527764e-04:  36%|███▌      | 5912/16329 [49:

epoch 1 iter 5943: train loss 1.84418. lr 5.522779e-04:  36%|███▋      | 5943/16329 [50:04<1:30:43,  1.91it/s][A
epoch 1 iter 5943: train loss 1.84418. lr 5.522779e-04:  36%|███▋      | 5944/16329 [50:04<1:29:47,  1.93it/s][A
epoch 1 iter 5944: train loss 1.89284. lr 5.522622e-04:  36%|███▋      | 5944/16329 [50:04<1:29:47,  1.93it/s][A
epoch 1 iter 5944: train loss 1.89284. lr 5.522622e-04:  36%|███▋      | 5945/16329 [50:04<1:28:27,  1.96it/s][A
epoch 1 iter 5945: train loss 1.85495. lr 5.522466e-04:  36%|███▋      | 5945/16329 [50:05<1:28:27,  1.96it/s][A
epoch 1 iter 5945: train loss 1.85495. lr 5.522466e-04:  36%|███▋      | 5946/16329 [50:05<1:27:28,  1.98it/s][A
epoch 1 iter 5946: train loss 1.84134. lr 5.522310e-04:  36%|███▋      | 5946/16329 [50:05<1:27:28,  1.98it/s][A
epoch 1 iter 5946: train loss 1.84134. lr 5.522310e-04:  36%|███▋      | 5947/16329 [50:05<1:27:03,  1.99it/s][A
epoch 1 iter 5947: train loss 1.86489. lr 5.522154e-04:  36%|███▋      | 5947/16329 [50:

epoch 1 iter 5978: train loss 1.88668. lr 5.517298e-04:  37%|███▋      | 5979/16329 [50:21<1:25:31,  2.02it/s][A
epoch 1 iter 5979: train loss 1.86615. lr 5.517141e-04:  37%|███▋      | 5979/16329 [50:22<1:25:31,  2.02it/s][A
epoch 1 iter 5979: train loss 1.86615. lr 5.517141e-04:  37%|███▋      | 5980/16329 [50:22<1:27:36,  1.97it/s][A
epoch 1 iter 5980: train loss 1.89143. lr 5.516984e-04:  37%|███▋      | 5980/16329 [50:22<1:27:36,  1.97it/s][A
epoch 1 iter 5980: train loss 1.89143. lr 5.516984e-04:  37%|███▋      | 5981/16329 [50:22<1:28:21,  1.95it/s][A
epoch 1 iter 5981: train loss 1.84691. lr 5.516827e-04:  37%|███▋      | 5981/16329 [50:23<1:28:21,  1.95it/s][A
epoch 1 iter 5981: train loss 1.84691. lr 5.516827e-04:  37%|███▋      | 5982/16329 [50:23<1:28:20,  1.95it/s][A
epoch 1 iter 5982: train loss 1.84919. lr 5.516670e-04:  37%|███▋      | 5982/16329 [50:23<1:28:20,  1.95it/s][A
epoch 1 iter 5982: train loss 1.84919. lr 5.516670e-04:  37%|███▋      | 5983/16329 [50:

epoch 1 iter 6014: train loss 1.82686. lr 5.511631e-04:  37%|███▋      | 6014/16329 [50:40<1:25:27,  2.01it/s][A
epoch 1 iter 6014: train loss 1.82686. lr 5.511631e-04:  37%|███▋      | 6015/16329 [50:40<1:25:12,  2.02it/s][A
epoch 1 iter 6015: train loss 1.89673. lr 5.511473e-04:  37%|███▋      | 6015/16329 [50:40<1:25:12,  2.02it/s][A
epoch 1 iter 6015: train loss 1.89673. lr 5.511473e-04:  37%|███▋      | 6016/16329 [50:40<1:25:27,  2.01it/s][A
epoch 1 iter 6016: train loss 1.85224. lr 5.511315e-04:  37%|███▋      | 6016/16329 [50:41<1:25:27,  2.01it/s][A
epoch 1 iter 6016: train loss 1.85224. lr 5.511315e-04:  37%|███▋      | 6017/16329 [50:41<1:25:18,  2.01it/s][A
epoch 1 iter 6017: train loss 1.89679. lr 5.511158e-04:  37%|███▋      | 6017/16329 [50:41<1:25:18,  2.01it/s][A
epoch 1 iter 6017: train loss 1.89679. lr 5.511158e-04:  37%|███▋      | 6018/16329 [50:41<1:25:33,  2.01it/s][A
epoch 1 iter 6018: train loss 1.80809. lr 5.511000e-04:  37%|███▋      | 6018/16329 [50:

epoch 1 iter 6049: train loss 1.83546. lr 5.506093e-04:  37%|███▋      | 6050/16329 [50:57<1:25:29,  2.00it/s][A
epoch 1 iter 6050: train loss 1.86023. lr 5.505934e-04:  37%|███▋      | 6050/16329 [50:58<1:25:29,  2.00it/s][A
epoch 1 iter 6050: train loss 1.86023. lr 5.505934e-04:  37%|███▋      | 6051/16329 [50:58<1:25:40,  2.00it/s][A
epoch 1 iter 6051: train loss 1.79738. lr 5.505775e-04:  37%|███▋      | 6051/16329 [50:58<1:25:40,  2.00it/s][A
epoch 1 iter 6051: train loss 1.79738. lr 5.505775e-04:  37%|███▋      | 6052/16329 [50:58<1:25:25,  2.00it/s][A
epoch 1 iter 6052: train loss 1.91764. lr 5.505617e-04:  37%|███▋      | 6052/16329 [50:59<1:25:25,  2.00it/s][A
epoch 1 iter 6052: train loss 1.91764. lr 5.505617e-04:  37%|███▋      | 6053/16329 [50:59<1:25:25,  2.00it/s][A
epoch 1 iter 6053: train loss 1.83420. lr 5.505458e-04:  37%|███▋      | 6053/16329 [50:59<1:25:25,  2.00it/s][A
epoch 1 iter 6053: train loss 1.83420. lr 5.505458e-04:  37%|███▋      | 6054/16329 [50:

epoch 1 iter 6085: train loss 1.84927. lr 5.500367e-04:  37%|███▋      | 6085/16329 [51:15<1:24:45,  2.01it/s][A
epoch 1 iter 6085: train loss 1.84927. lr 5.500367e-04:  37%|███▋      | 6086/16329 [51:15<1:24:29,  2.02it/s][A
epoch 1 iter 6086: train loss 1.81673. lr 5.500207e-04:  37%|███▋      | 6086/16329 [51:16<1:24:29,  2.02it/s][A
epoch 1 iter 6086: train loss 1.81673. lr 5.500207e-04:  37%|███▋      | 6087/16329 [51:16<1:24:35,  2.02it/s][A
epoch 1 iter 6087: train loss 1.87733. lr 5.500048e-04:  37%|███▋      | 6087/16329 [51:16<1:24:35,  2.02it/s][A
epoch 1 iter 6087: train loss 1.87733. lr 5.500048e-04:  37%|███▋      | 6088/16329 [51:16<1:24:24,  2.02it/s][A
epoch 1 iter 6088: train loss 1.87437. lr 5.499888e-04:  37%|███▋      | 6088/16329 [51:17<1:24:24,  2.02it/s][A
epoch 1 iter 6088: train loss 1.87437. lr 5.499888e-04:  37%|███▋      | 6089/16329 [51:17<1:24:39,  2.02it/s][A
epoch 1 iter 6089: train loss 1.81178. lr 5.499729e-04:  37%|███▋      | 6089/16329 [51:

epoch 1 iter 6120: train loss 1.86302. lr 5.494771e-04:  37%|███▋      | 6121/16329 [51:33<1:24:27,  2.01it/s][A
epoch 1 iter 6121: train loss 1.84977. lr 5.494610e-04:  37%|███▋      | 6121/16329 [51:33<1:24:27,  2.01it/s][A
epoch 1 iter 6121: train loss 1.84977. lr 5.494610e-04:  37%|███▋      | 6122/16329 [51:33<1:24:23,  2.02it/s][A
epoch 1 iter 6122: train loss 1.75983. lr 5.494450e-04:  37%|███▋      | 6122/16329 [51:34<1:24:23,  2.02it/s][A
epoch 1 iter 6122: train loss 1.75983. lr 5.494450e-04:  37%|███▋      | 6123/16329 [51:34<1:24:33,  2.01it/s][A
epoch 1 iter 6123: train loss 1.82905. lr 5.494290e-04:  37%|███▋      | 6123/16329 [51:34<1:24:33,  2.01it/s][A
epoch 1 iter 6123: train loss 1.82905. lr 5.494290e-04:  38%|███▊      | 6124/16329 [51:34<1:24:18,  2.02it/s][A
epoch 1 iter 6124: train loss 1.80049. lr 5.494129e-04:  38%|███▊      | 6124/16329 [51:35<1:24:18,  2.02it/s][A
epoch 1 iter 6124: train loss 1.80049. lr 5.494129e-04:  38%|███▊      | 6125/16329 [51:

epoch 1 iter 6156: train loss 1.83921. lr 5.488985e-04:  38%|███▊      | 6156/16329 [51:51<1:24:05,  2.02it/s][A
epoch 1 iter 6156: train loss 1.83921. lr 5.488985e-04:  38%|███▊      | 6157/16329 [51:51<1:24:27,  2.01it/s][A
epoch 1 iter 6157: train loss 1.82988. lr 5.488824e-04:  38%|███▊      | 6157/16329 [51:52<1:24:27,  2.01it/s][A
epoch 1 iter 6157: train loss 1.82988. lr 5.488824e-04:  38%|███▊      | 6158/16329 [51:52<1:24:27,  2.01it/s][A
epoch 1 iter 6158: train loss 1.87567. lr 5.488663e-04:  38%|███▊      | 6158/16329 [51:52<1:24:27,  2.01it/s][A
epoch 1 iter 6158: train loss 1.87567. lr 5.488663e-04:  38%|███▊      | 6159/16329 [51:52<1:24:36,  2.00it/s][A
epoch 1 iter 6159: train loss 1.85602. lr 5.488502e-04:  38%|███▊      | 6159/16329 [51:53<1:24:36,  2.00it/s][A
epoch 1 iter 6159: train loss 1.85602. lr 5.488502e-04:  38%|███▊      | 6160/16329 [51:53<1:24:25,  2.01it/s][A
epoch 1 iter 6160: train loss 1.80886. lr 5.488341e-04:  38%|███▊      | 6160/16329 [51:

epoch 1 iter 6191: train loss 1.82594. lr 5.483332e-04:  38%|███▊      | 6192/16329 [52:09<1:31:32,  1.85it/s][A
epoch 1 iter 6192: train loss 1.84626. lr 5.483170e-04:  38%|███▊      | 6192/16329 [52:09<1:31:32,  1.85it/s][A
epoch 1 iter 6192: train loss 1.84626. lr 5.483170e-04:  38%|███▊      | 6193/16329 [52:09<1:29:19,  1.89it/s][A
epoch 1 iter 6193: train loss 1.82887. lr 5.483008e-04:  38%|███▊      | 6193/16329 [52:10<1:29:19,  1.89it/s][A
epoch 1 iter 6193: train loss 1.82887. lr 5.483008e-04:  38%|███▊      | 6194/16329 [52:10<1:27:30,  1.93it/s][A
epoch 1 iter 6194: train loss 1.85333. lr 5.482846e-04:  38%|███▊      | 6194/16329 [52:10<1:27:30,  1.93it/s][A
epoch 1 iter 6194: train loss 1.85333. lr 5.482846e-04:  38%|███▊      | 6195/16329 [52:10<1:26:26,  1.95it/s][A
epoch 1 iter 6195: train loss 1.86641. lr 5.482684e-04:  38%|███▊      | 6195/16329 [52:11<1:26:26,  1.95it/s][A
epoch 1 iter 6195: train loss 1.86641. lr 5.482684e-04:  38%|███▊      | 6196/16329 [52:

epoch 1 iter 6227: train loss 1.88360. lr 5.477488e-04:  38%|███▊      | 6227/16329 [52:27<1:25:10,  1.98it/s][A
epoch 1 iter 6227: train loss 1.88360. lr 5.477488e-04:  38%|███▊      | 6228/16329 [52:27<1:24:25,  1.99it/s][A
epoch 1 iter 6228: train loss 1.82598. lr 5.477325e-04:  38%|███▊      | 6228/16329 [52:28<1:24:25,  1.99it/s][A
epoch 1 iter 6228: train loss 1.82598. lr 5.477325e-04:  38%|███▊      | 6229/16329 [52:28<1:24:09,  2.00it/s][A
epoch 1 iter 6229: train loss 1.82181. lr 5.477163e-04:  38%|███▊      | 6229/16329 [52:28<1:24:09,  2.00it/s][A
epoch 1 iter 6229: train loss 1.82181. lr 5.477163e-04:  38%|███▊      | 6230/16329 [52:28<1:23:47,  2.01it/s][A
epoch 1 iter 6230: train loss 1.81665. lr 5.477000e-04:  38%|███▊      | 6230/16329 [52:29<1:23:47,  2.01it/s][A
epoch 1 iter 6230: train loss 1.81665. lr 5.477000e-04:  38%|███▊      | 6231/16329 [52:29<1:23:21,  2.02it/s][A
epoch 1 iter 6231: train loss 1.82438. lr 5.476837e-04:  38%|███▊      | 6231/16329 [52:

epoch 1 iter 6262: train loss 1.80869. lr 5.471778e-04:  38%|███▊      | 6263/16329 [52:45<1:24:19,  1.99it/s][A
epoch 1 iter 6263: train loss 1.78335. lr 5.471614e-04:  38%|███▊      | 6263/16329 [52:45<1:24:19,  1.99it/s][A
epoch 1 iter 6263: train loss 1.78335. lr 5.471614e-04:  38%|███▊      | 6264/16329 [52:45<1:23:43,  2.00it/s][A
epoch 1 iter 6264: train loss 1.83402. lr 5.471451e-04:  38%|███▊      | 6264/16329 [52:46<1:23:43,  2.00it/s][A
epoch 1 iter 6264: train loss 1.83402. lr 5.471451e-04:  38%|███▊      | 6265/16329 [52:46<1:23:28,  2.01it/s][A
epoch 1 iter 6265: train loss 1.85076. lr 5.471287e-04:  38%|███▊      | 6265/16329 [52:46<1:23:28,  2.01it/s][A
epoch 1 iter 6265: train loss 1.85076. lr 5.471287e-04:  38%|███▊      | 6266/16329 [52:46<1:23:27,  2.01it/s][A
epoch 1 iter 6266: train loss 1.80958. lr 5.471123e-04:  38%|███▊      | 6266/16329 [52:47<1:23:27,  2.01it/s][A
epoch 1 iter 6266: train loss 1.80958. lr 5.471123e-04:  38%|███▊      | 6267/16329 [52:

epoch 1 iter 6298: train loss 1.84208. lr 5.465875e-04:  39%|███▊      | 6298/16329 [53:03<1:24:05,  1.99it/s][A
epoch 1 iter 6298: train loss 1.84208. lr 5.465875e-04:  39%|███▊      | 6299/16329 [53:03<1:23:28,  2.00it/s][A
epoch 1 iter 6299: train loss 1.81325. lr 5.465711e-04:  39%|███▊      | 6299/16329 [53:03<1:23:28,  2.00it/s][A
epoch 1 iter 6299: train loss 1.81325. lr 5.465711e-04:  39%|███▊      | 6300/16329 [53:03<1:23:12,  2.01it/s][A
epoch 1 iter 6300: train loss 1.84984. lr 5.465546e-04:  39%|███▊      | 6300/16329 [53:04<1:23:12,  2.01it/s][A
epoch 1 iter 6300: train loss 1.84984. lr 5.465546e-04:  39%|███▊      | 6301/16329 [53:04<1:23:03,  2.01it/s][A
epoch 1 iter 6301: train loss 1.77927. lr 5.465382e-04:  39%|███▊      | 6301/16329 [53:04<1:23:03,  2.01it/s][A
epoch 1 iter 6301: train loss 1.77927. lr 5.465382e-04:  39%|███▊      | 6302/16329 [53:04<1:22:45,  2.02it/s][A
epoch 1 iter 6302: train loss 1.83005. lr 5.465218e-04:  39%|███▊      | 6302/16329 [53:

epoch 1 iter 6333: train loss 1.79781. lr 5.460108e-04:  39%|███▉      | 6334/16329 [53:20<1:23:17,  2.00it/s][A
epoch 1 iter 6334: train loss 1.78920. lr 5.459943e-04:  39%|███▉      | 6334/16329 [53:21<1:23:17,  2.00it/s][A
epoch 1 iter 6334: train loss 1.78920. lr 5.459943e-04:  39%|███▉      | 6335/16329 [53:21<1:22:51,  2.01it/s][A
epoch 1 iter 6335: train loss 1.84766. lr 5.459778e-04:  39%|███▉      | 6335/16329 [53:21<1:22:51,  2.01it/s][A
epoch 1 iter 6335: train loss 1.84766. lr 5.459778e-04:  39%|███▉      | 6336/16329 [53:21<1:22:51,  2.01it/s][A
epoch 1 iter 6336: train loss 1.79143. lr 5.459613e-04:  39%|███▉      | 6336/16329 [53:22<1:22:51,  2.01it/s][A
epoch 1 iter 6336: train loss 1.79143. lr 5.459613e-04:  39%|███▉      | 6337/16329 [53:22<1:22:28,  2.02it/s][A
epoch 1 iter 6337: train loss 1.78146. lr 5.459447e-04:  39%|███▉      | 6337/16329 [53:22<1:22:28,  2.02it/s][A
epoch 1 iter 6337: train loss 1.78146. lr 5.459447e-04:  39%|███▉      | 6338/16329 [53:

epoch 1 iter 6369: train loss 1.79958. lr 5.454147e-04:  39%|███▉      | 6369/16329 [53:38<1:22:18,  2.02it/s][A
epoch 1 iter 6369: train loss 1.79958. lr 5.454147e-04:  39%|███▉      | 6370/16329 [53:38<1:24:46,  1.96it/s][A
epoch 1 iter 6370: train loss 1.84857. lr 5.453981e-04:  39%|███▉      | 6370/16329 [53:39<1:24:46,  1.96it/s][A
epoch 1 iter 6370: train loss 1.84857. lr 5.453981e-04:  39%|███▉      | 6371/16329 [53:39<1:26:27,  1.92it/s][A
epoch 1 iter 6371: train loss 1.79417. lr 5.453815e-04:  39%|███▉      | 6371/16329 [53:40<1:26:27,  1.92it/s][A
epoch 1 iter 6371: train loss 1.79417. lr 5.453815e-04:  39%|███▉      | 6372/16329 [53:40<1:27:03,  1.91it/s][A
epoch 1 iter 6372: train loss 1.79134. lr 5.453649e-04:  39%|███▉      | 6372/16329 [53:40<1:27:03,  1.91it/s][A
epoch 1 iter 6372: train loss 1.79134. lr 5.453649e-04:  39%|███▉      | 6373/16329 [53:40<1:26:50,  1.91it/s][A
epoch 1 iter 6373: train loss 1.74043. lr 5.453483e-04:  39%|███▉      | 6373/16329 [53:

epoch 1 iter 6404: train loss 1.76044. lr 5.448324e-04:  39%|███▉      | 6405/16329 [53:56<1:22:35,  2.00it/s][A
epoch 1 iter 6405: train loss 1.83657. lr 5.448157e-04:  39%|███▉      | 6405/16329 [53:57<1:22:35,  2.00it/s][A
epoch 1 iter 6405: train loss 1.83657. lr 5.448157e-04:  39%|███▉      | 6406/16329 [53:57<1:22:06,  2.01it/s][A
epoch 1 iter 6406: train loss 1.77945. lr 5.447990e-04:  39%|███▉      | 6406/16329 [53:57<1:22:06,  2.01it/s][A
epoch 1 iter 6406: train loss 1.77945. lr 5.447990e-04:  39%|███▉      | 6407/16329 [53:57<1:21:37,  2.03it/s][A
epoch 1 iter 6407: train loss 1.77672. lr 5.447823e-04:  39%|███▉      | 6407/16329 [53:58<1:21:37,  2.03it/s][A
epoch 1 iter 6407: train loss 1.77672. lr 5.447823e-04:  39%|███▉      | 6408/16329 [53:58<1:21:40,  2.02it/s][A
epoch 1 iter 6408: train loss 1.83339. lr 5.447656e-04:  39%|███▉      | 6408/16329 [53:58<1:21:40,  2.02it/s][A
epoch 1 iter 6408: train loss 1.83339. lr 5.447656e-04:  39%|███▉      | 6409/16329 [53:

epoch 1 iter 6440: train loss 1.81552. lr 5.442305e-04:  39%|███▉      | 6440/16329 [54:14<1:22:01,  2.01it/s][A
epoch 1 iter 6440: train loss 1.81552. lr 5.442305e-04:  39%|███▉      | 6441/16329 [54:14<1:21:58,  2.01it/s][A
epoch 1 iter 6441: train loss 1.80004. lr 5.442137e-04:  39%|███▉      | 6441/16329 [54:15<1:21:58,  2.01it/s][A
epoch 1 iter 6441: train loss 1.80004. lr 5.442137e-04:  39%|███▉      | 6442/16329 [54:15<1:22:06,  2.01it/s][A
epoch 1 iter 6442: train loss 1.76679. lr 5.441970e-04:  39%|███▉      | 6442/16329 [54:15<1:22:06,  2.01it/s][A
epoch 1 iter 6442: train loss 1.76679. lr 5.441970e-04:  39%|███▉      | 6443/16329 [54:15<1:22:02,  2.01it/s][A
epoch 1 iter 6443: train loss 1.78918. lr 5.441802e-04:  39%|███▉      | 6443/16329 [54:16<1:22:02,  2.01it/s][A
epoch 1 iter 6443: train loss 1.78918. lr 5.441802e-04:  39%|███▉      | 6444/16329 [54:16<1:22:15,  2.00it/s][A
epoch 1 iter 6444: train loss 1.75843. lr 5.441634e-04:  39%|███▉      | 6444/16329 [54:

epoch 1 iter 6475: train loss 1.78830. lr 5.436425e-04:  40%|███▉      | 6476/16329 [54:33<1:29:14,  1.84it/s][A
epoch 1 iter 6476: train loss 1.82940. lr 5.436257e-04:  40%|███▉      | 6476/16329 [54:33<1:29:14,  1.84it/s][A
epoch 1 iter 6476: train loss 1.82940. lr 5.436257e-04:  40%|███▉      | 6477/16329 [54:33<1:27:00,  1.89it/s][A
epoch 1 iter 6477: train loss 1.79560. lr 5.436088e-04:  40%|███▉      | 6477/16329 [54:34<1:27:00,  1.89it/s][A
epoch 1 iter 6477: train loss 1.79560. lr 5.436088e-04:  40%|███▉      | 6478/16329 [54:34<1:25:33,  1.92it/s][A
epoch 1 iter 6478: train loss 1.81016. lr 5.435920e-04:  40%|███▉      | 6478/16329 [54:34<1:25:33,  1.92it/s][A
epoch 1 iter 6478: train loss 1.81016. lr 5.435920e-04:  40%|███▉      | 6479/16329 [54:34<1:24:22,  1.95it/s][A
epoch 1 iter 6479: train loss 1.70343. lr 5.435751e-04:  40%|███▉      | 6479/16329 [54:35<1:24:22,  1.95it/s][A
epoch 1 iter 6479: train loss 1.70343. lr 5.435751e-04:  40%|███▉      | 6480/16329 [54:

epoch 1 iter 6511: train loss 1.75502. lr 5.430349e-04:  40%|███▉      | 6511/16329 [54:51<1:21:31,  2.01it/s][A
epoch 1 iter 6511: train loss 1.75502. lr 5.430349e-04:  40%|███▉      | 6512/16329 [54:51<1:30:04,  1.82it/s][A
epoch 1 iter 6512: train loss 1.74041. lr 5.430179e-04:  40%|███▉      | 6512/16329 [54:52<1:30:04,  1.82it/s][A
epoch 1 iter 6512: train loss 1.74041. lr 5.430179e-04:  40%|███▉      | 6513/16329 [54:52<1:27:37,  1.87it/s][A
epoch 1 iter 6513: train loss 1.76092. lr 5.430010e-04:  40%|███▉      | 6513/16329 [54:52<1:27:37,  1.87it/s][A
epoch 1 iter 6513: train loss 1.76092. lr 5.430010e-04:  40%|███▉      | 6514/16329 [54:52<1:25:25,  1.91it/s][A
epoch 1 iter 6514: train loss 1.81595. lr 5.429841e-04:  40%|███▉      | 6514/16329 [54:53<1:25:25,  1.91it/s][A
epoch 1 iter 6514: train loss 1.81595. lr 5.429841e-04:  40%|███▉      | 6515/16329 [54:53<1:24:30,  1.94it/s][A
epoch 1 iter 6515: train loss 1.79159. lr 5.429672e-04:  40%|███▉      | 6515/16329 [54:

epoch 1 iter 6546: train loss 1.78637. lr 5.424413e-04:  40%|████      | 6547/16329 [55:09<1:30:05,  1.81it/s][A
epoch 1 iter 6547: train loss 1.75088. lr 5.424243e-04:  40%|████      | 6547/16329 [55:09<1:30:05,  1.81it/s][A
epoch 1 iter 6547: train loss 1.75088. lr 5.424243e-04:  40%|████      | 6548/16329 [55:09<1:27:12,  1.87it/s][A
epoch 1 iter 6548: train loss 1.73059. lr 5.424073e-04:  40%|████      | 6548/16329 [55:10<1:27:12,  1.87it/s][A
epoch 1 iter 6548: train loss 1.73059. lr 5.424073e-04:  40%|████      | 6549/16329 [55:10<1:25:24,  1.91it/s][A
epoch 1 iter 6549: train loss 1.74347. lr 5.423903e-04:  40%|████      | 6549/16329 [55:10<1:25:24,  1.91it/s][A
epoch 1 iter 6549: train loss 1.74347. lr 5.423903e-04:  40%|████      | 6550/16329 [55:10<1:23:56,  1.94it/s][A
epoch 1 iter 6550: train loss 1.72819. lr 5.423733e-04:  40%|████      | 6550/16329 [55:11<1:23:56,  1.94it/s][A
epoch 1 iter 6550: train loss 1.72819. lr 5.423733e-04:  40%|████      | 6551/16329 [55:

epoch 1 iter 6582: train loss 1.80522. lr 5.418279e-04:  40%|████      | 6582/16329 [55:27<1:21:17,  2.00it/s][A
epoch 1 iter 6582: train loss 1.80522. lr 5.418279e-04:  40%|████      | 6583/16329 [55:27<1:21:04,  2.00it/s][A
epoch 1 iter 6583: train loss 1.72127. lr 5.418108e-04:  40%|████      | 6583/16329 [55:27<1:21:04,  2.00it/s][A
epoch 1 iter 6583: train loss 1.72127. lr 5.418108e-04:  40%|████      | 6584/16329 [55:27<1:20:43,  2.01it/s][A
epoch 1 iter 6584: train loss 1.74689. lr 5.417937e-04:  40%|████      | 6584/16329 [55:28<1:20:43,  2.01it/s][A
epoch 1 iter 6584: train loss 1.74689. lr 5.417937e-04:  40%|████      | 6585/16329 [55:28<1:20:49,  2.01it/s][A
epoch 1 iter 6585: train loss 1.76890. lr 5.417766e-04:  40%|████      | 6585/16329 [55:28<1:20:49,  2.01it/s][A
epoch 1 iter 6585: train loss 1.76890. lr 5.417766e-04:  40%|████      | 6586/16329 [55:28<1:20:44,  2.01it/s][A
epoch 1 iter 6586: train loss 1.78991. lr 5.417595e-04:  40%|████      | 6586/16329 [55:

epoch 1 iter 6617: train loss 1.73785. lr 5.412287e-04:  41%|████      | 6618/16329 [55:45<1:21:21,  1.99it/s][A
epoch 1 iter 6618: train loss 1.82317. lr 5.412116e-04:  41%|████      | 6618/16329 [55:45<1:21:21,  1.99it/s][A
epoch 1 iter 6618: train loss 1.82317. lr 5.412116e-04:  41%|████      | 6619/16329 [55:45<1:24:46,  1.91it/s][A
epoch 1 iter 6619: train loss 1.77721. lr 5.411944e-04:  41%|████      | 6619/16329 [55:46<1:24:46,  1.91it/s][A
epoch 1 iter 6619: train loss 1.77721. lr 5.411944e-04:  41%|████      | 6620/16329 [55:46<1:26:49,  1.86it/s][A
epoch 1 iter 6620: train loss 1.72719. lr 5.411773e-04:  41%|████      | 6620/16329 [55:46<1:26:49,  1.86it/s][A
epoch 1 iter 6620: train loss 1.72719. lr 5.411773e-04:  41%|████      | 6621/16329 [55:46<1:27:21,  1.85it/s][A
epoch 1 iter 6621: train loss 1.79448. lr 5.411601e-04:  41%|████      | 6621/16329 [55:47<1:27:21,  1.85it/s][A
epoch 1 iter 6621: train loss 1.79448. lr 5.411601e-04:  41%|████      | 6622/16329 [55:

epoch 1 iter 6653: train loss 1.69253. lr 5.406096e-04:  41%|████      | 6653/16329 [56:03<1:20:06,  2.01it/s][A
epoch 1 iter 6653: train loss 1.69253. lr 5.406096e-04:  41%|████      | 6654/16329 [56:03<1:20:01,  2.02it/s][A
epoch 1 iter 6654: train loss 1.78479. lr 5.405924e-04:  41%|████      | 6654/16329 [56:04<1:20:01,  2.02it/s][A
epoch 1 iter 6654: train loss 1.78479. lr 5.405924e-04:  41%|████      | 6655/16329 [56:04<1:20:18,  2.01it/s][A
epoch 1 iter 6655: train loss 1.76346. lr 5.405751e-04:  41%|████      | 6655/16329 [56:04<1:20:18,  2.01it/s][A
epoch 1 iter 6655: train loss 1.76346. lr 5.405751e-04:  41%|████      | 6656/16329 [56:04<1:20:16,  2.01it/s][A
epoch 1 iter 6656: train loss 1.73513. lr 5.405579e-04:  41%|████      | 6656/16329 [56:05<1:20:16,  2.01it/s][A
epoch 1 iter 6656: train loss 1.73513. lr 5.405579e-04:  41%|████      | 6657/16329 [56:05<1:20:14,  2.01it/s][A
epoch 1 iter 6657: train loss 1.77278. lr 5.405407e-04:  41%|████      | 6657/16329 [56:

epoch 1 iter 6688: train loss 1.75444. lr 5.400049e-04:  41%|████      | 6689/16329 [56:21<1:20:02,  2.01it/s][A
epoch 1 iter 6689: train loss 1.72691. lr 5.399876e-04:  41%|████      | 6689/16329 [56:21<1:20:02,  2.01it/s][A
epoch 1 iter 6689: train loss 1.72691. lr 5.399876e-04:  41%|████      | 6690/16329 [56:21<1:19:59,  2.01it/s][A
epoch 1 iter 6690: train loss 1.78764. lr 5.399703e-04:  41%|████      | 6690/16329 [56:22<1:19:59,  2.01it/s][A
epoch 1 iter 6690: train loss 1.78764. lr 5.399703e-04:  41%|████      | 6691/16329 [56:22<1:20:01,  2.01it/s][A
epoch 1 iter 6691: train loss 1.73291. lr 5.399530e-04:  41%|████      | 6691/16329 [56:22<1:20:01,  2.01it/s][A
epoch 1 iter 6691: train loss 1.73291. lr 5.399530e-04:  41%|████      | 6692/16329 [56:22<1:19:55,  2.01it/s][A
epoch 1 iter 6692: train loss 1.71189. lr 5.399357e-04:  41%|████      | 6692/16329 [56:23<1:19:55,  2.01it/s][A
epoch 1 iter 6692: train loss 1.71189. lr 5.399357e-04:  41%|████      | 6693/16329 [56:

epoch 1 iter 6724: train loss 1.77894. lr 5.393801e-04:  41%|████      | 6724/16329 [56:39<1:19:55,  2.00it/s][A
epoch 1 iter 6724: train loss 1.77894. lr 5.393801e-04:  41%|████      | 6725/16329 [56:39<1:19:52,  2.00it/s][A
epoch 1 iter 6725: train loss 1.71145. lr 5.393627e-04:  41%|████      | 6725/16329 [56:40<1:19:52,  2.00it/s][A
epoch 1 iter 6725: train loss 1.71145. lr 5.393627e-04:  41%|████      | 6726/16329 [56:40<1:28:02,  1.82it/s][A
epoch 1 iter 6726: train loss 1.70655. lr 5.393453e-04:  41%|████      | 6726/16329 [56:40<1:28:02,  1.82it/s][A
epoch 1 iter 6726: train loss 1.70655. lr 5.393453e-04:  41%|████      | 6727/16329 [56:40<1:28:00,  1.82it/s][A
epoch 1 iter 6727: train loss 1.71800. lr 5.393279e-04:  41%|████      | 6727/16329 [56:41<1:28:00,  1.82it/s][A
epoch 1 iter 6727: train loss 1.71800. lr 5.393279e-04:  41%|████      | 6728/16329 [56:41<1:27:10,  1.84it/s][A
epoch 1 iter 6728: train loss 1.79682. lr 5.393105e-04:  41%|████      | 6728/16329 [56:

epoch 1 iter 6759: train loss 1.72444. lr 5.387699e-04:  41%|████▏     | 6760/16329 [56:57<1:19:28,  2.01it/s][A
epoch 1 iter 6760: train loss 1.73880. lr 5.387525e-04:  41%|████▏     | 6760/16329 [56:57<1:19:28,  2.01it/s][A
epoch 1 iter 6760: train loss 1.73880. lr 5.387525e-04:  41%|████▏     | 6761/16329 [56:57<1:19:11,  2.01it/s][A
epoch 1 iter 6761: train loss 1.71292. lr 5.387350e-04:  41%|████▏     | 6761/16329 [56:58<1:19:11,  2.01it/s][A
epoch 1 iter 6761: train loss 1.71292. lr 5.387350e-04:  41%|████▏     | 6762/16329 [56:58<1:18:57,  2.02it/s][A
epoch 1 iter 6762: train loss 1.73646. lr 5.387175e-04:  41%|████▏     | 6762/16329 [56:58<1:18:57,  2.02it/s][A
epoch 1 iter 6762: train loss 1.73646. lr 5.387175e-04:  41%|████▏     | 6763/16329 [56:58<1:18:36,  2.03it/s][A
epoch 1 iter 6763: train loss 1.69827. lr 5.387000e-04:  41%|████▏     | 6763/16329 [56:59<1:18:36,  2.03it/s][A
epoch 1 iter 6763: train loss 1.69827. lr 5.387000e-04:  41%|████▏     | 6764/16329 [56:

epoch 1 iter 6795: train loss 1.75554. lr 5.381395e-04:  42%|████▏     | 6795/16329 [57:15<1:22:39,  1.92it/s][A
epoch 1 iter 6795: train loss 1.75554. lr 5.381395e-04:  42%|████▏     | 6796/16329 [57:15<1:22:00,  1.94it/s][A
epoch 1 iter 6796: train loss 1.74445. lr 5.381219e-04:  42%|████▏     | 6796/16329 [57:16<1:22:00,  1.94it/s][A
epoch 1 iter 6796: train loss 1.74445. lr 5.381219e-04:  42%|████▏     | 6797/16329 [57:16<1:21:25,  1.95it/s][A
epoch 1 iter 6797: train loss 1.69638. lr 5.381044e-04:  42%|████▏     | 6797/16329 [57:16<1:21:25,  1.95it/s][A
epoch 1 iter 6797: train loss 1.69638. lr 5.381044e-04:  42%|████▏     | 6798/16329 [57:16<1:20:48,  1.97it/s][A
epoch 1 iter 6798: train loss 1.72160. lr 5.380868e-04:  42%|████▏     | 6798/16329 [57:17<1:20:48,  1.97it/s][A
epoch 1 iter 6798: train loss 1.72160. lr 5.380868e-04:  42%|████▏     | 6799/16329 [57:17<1:20:00,  1.99it/s][A
epoch 1 iter 6799: train loss 1.74714. lr 5.380693e-04:  42%|████▏     | 6799/16329 [57:

epoch 1 iter 6830: train loss 1.74847. lr 5.375238e-04:  42%|████▏     | 6831/16329 [57:33<1:19:32,  1.99it/s][A
epoch 1 iter 6831: train loss 1.74272. lr 5.375062e-04:  42%|████▏     | 6831/16329 [57:33<1:19:32,  1.99it/s][A
epoch 1 iter 6831: train loss 1.74272. lr 5.375062e-04:  42%|████▏     | 6832/16329 [57:33<1:18:57,  2.00it/s][A
epoch 1 iter 6832: train loss 1.77107. lr 5.374886e-04:  42%|████▏     | 6832/16329 [57:34<1:18:57,  2.00it/s][A
epoch 1 iter 6832: train loss 1.77107. lr 5.374886e-04:  42%|████▏     | 6833/16329 [57:34<1:18:52,  2.01it/s][A
epoch 1 iter 6833: train loss 1.76944. lr 5.374709e-04:  42%|████▏     | 6833/16329 [57:34<1:18:52,  2.01it/s][A
epoch 1 iter 6833: train loss 1.76944. lr 5.374709e-04:  42%|████▏     | 6834/16329 [57:34<1:18:31,  2.02it/s][A
epoch 1 iter 6834: train loss 1.70331. lr 5.374533e-04:  42%|████▏     | 6834/16329 [57:35<1:18:31,  2.02it/s][A
epoch 1 iter 6834: train loss 1.70331. lr 5.374533e-04:  42%|████▏     | 6835/16329 [57:

epoch 1 iter 6866: train loss 1.72371. lr 5.368877e-04:  42%|████▏     | 6866/16329 [57:51<1:17:41,  2.03it/s][A
epoch 1 iter 6866: train loss 1.72371. lr 5.368877e-04:  42%|████▏     | 6867/16329 [57:51<1:17:43,  2.03it/s][A
epoch 1 iter 6867: train loss 1.74510. lr 5.368700e-04:  42%|████▏     | 6867/16329 [57:52<1:17:43,  2.03it/s][A
epoch 1 iter 6867: train loss 1.74510. lr 5.368700e-04:  42%|████▏     | 6868/16329 [57:52<1:17:37,  2.03it/s][A
epoch 1 iter 6868: train loss 1.75057. lr 5.368523e-04:  42%|████▏     | 6868/16329 [57:52<1:17:37,  2.03it/s][A
epoch 1 iter 6868: train loss 1.75057. lr 5.368523e-04:  42%|████▏     | 6869/16329 [57:52<1:17:48,  2.03it/s][A
epoch 1 iter 6869: train loss 1.73244. lr 5.368346e-04:  42%|████▏     | 6869/16329 [57:53<1:17:48,  2.03it/s][A
epoch 1 iter 6869: train loss 1.73244. lr 5.368346e-04:  42%|████▏     | 6870/16329 [57:53<1:17:53,  2.02it/s][A
epoch 1 iter 6870: train loss 1.66735. lr 5.368169e-04:  42%|████▏     | 6870/16329 [57:

epoch 1 iter 6901: train loss 1.70837. lr 5.362666e-04:  42%|████▏     | 6902/16329 [58:09<1:17:51,  2.02it/s][A
epoch 1 iter 6902: train loss 1.77473. lr 5.362488e-04:  42%|████▏     | 6902/16329 [58:09<1:17:51,  2.02it/s][A
epoch 1 iter 6902: train loss 1.77473. lr 5.362488e-04:  42%|████▏     | 6903/16329 [58:09<1:17:35,  2.02it/s][A
epoch 1 iter 6903: train loss 1.69439. lr 5.362310e-04:  42%|████▏     | 6903/16329 [58:10<1:17:35,  2.02it/s][A
epoch 1 iter 6903: train loss 1.69439. lr 5.362310e-04:  42%|████▏     | 6904/16329 [58:10<1:17:35,  2.02it/s][A
epoch 1 iter 6904: train loss 1.73440. lr 5.362132e-04:  42%|████▏     | 6904/16329 [58:10<1:17:35,  2.02it/s][A
epoch 1 iter 6904: train loss 1.73440. lr 5.362132e-04:  42%|████▏     | 6905/16329 [58:10<1:17:31,  2.03it/s][A
epoch 1 iter 6905: train loss 1.73542. lr 5.361954e-04:  42%|████▏     | 6905/16329 [58:11<1:17:31,  2.03it/s][A
epoch 1 iter 6905: train loss 1.73542. lr 5.361954e-04:  42%|████▏     | 6906/16329 [58:

epoch 1 iter 6937: train loss 1.68077. lr 5.356249e-04:  42%|████▏     | 6937/16329 [58:27<1:18:00,  2.01it/s][A
epoch 1 iter 6937: train loss 1.68077. lr 5.356249e-04:  42%|████▏     | 6938/16329 [58:27<1:17:57,  2.01it/s][A
epoch 1 iter 6938: train loss 1.71789. lr 5.356071e-04:  42%|████▏     | 6938/16329 [58:27<1:17:57,  2.01it/s][A
epoch 1 iter 6938: train loss 1.71789. lr 5.356071e-04:  42%|████▏     | 6939/16329 [58:27<1:17:45,  2.01it/s][A
epoch 1 iter 6939: train loss 1.79735. lr 5.355892e-04:  42%|████▏     | 6939/16329 [58:28<1:17:45,  2.01it/s][A
epoch 1 iter 6939: train loss 1.79735. lr 5.355892e-04:  43%|████▎     | 6940/16329 [58:28<1:17:31,  2.02it/s][A
epoch 1 iter 6940: train loss 1.68115. lr 5.355713e-04:  43%|████▎     | 6940/16329 [58:28<1:17:31,  2.02it/s][A
epoch 1 iter 6940: train loss 1.68115. lr 5.355713e-04:  43%|████▎     | 6941/16329 [58:28<1:17:30,  2.02it/s][A
epoch 1 iter 6941: train loss 1.70683. lr 5.355535e-04:  43%|████▎     | 6941/16329 [58:

epoch 1 iter 6972: train loss 1.72949. lr 5.349984e-04:  43%|████▎     | 6973/16329 [58:44<1:16:59,  2.03it/s][A
epoch 1 iter 6973: train loss 1.69796. lr 5.349804e-04:  43%|████▎     | 6973/16329 [58:45<1:16:59,  2.03it/s][A
epoch 1 iter 6973: train loss 1.69796. lr 5.349804e-04:  43%|████▎     | 6974/16329 [58:45<1:17:03,  2.02it/s][A
epoch 1 iter 6974: train loss 1.71992. lr 5.349625e-04:  43%|████▎     | 6974/16329 [58:45<1:17:03,  2.02it/s][A
epoch 1 iter 6974: train loss 1.71992. lr 5.349625e-04:  43%|████▎     | 6975/16329 [58:45<1:16:57,  2.03it/s][A
epoch 1 iter 6975: train loss 1.69895. lr 5.349445e-04:  43%|████▎     | 6975/16329 [58:46<1:16:57,  2.03it/s][A
epoch 1 iter 6975: train loss 1.69895. lr 5.349445e-04:  43%|████▎     | 6976/16329 [58:46<1:16:49,  2.03it/s][A
epoch 1 iter 6976: train loss 1.67295. lr 5.349266e-04:  43%|████▎     | 6976/16329 [58:46<1:16:49,  2.03it/s][A
epoch 1 iter 6976: train loss 1.67295. lr 5.349266e-04:  43%|████▎     | 6977/16329 [58:

epoch 1 iter 7008: train loss 1.69304. lr 5.343511e-04:  43%|████▎     | 7008/16329 [59:02<1:17:08,  2.01it/s][A
epoch 1 iter 7008: train loss 1.69304. lr 5.343511e-04:  43%|████▎     | 7009/16329 [59:02<1:16:56,  2.02it/s][A
epoch 1 iter 7009: train loss 1.72716. lr 5.343331e-04:  43%|████▎     | 7009/16329 [59:03<1:16:56,  2.02it/s][A
epoch 1 iter 7009: train loss 1.72716. lr 5.343331e-04:  43%|████▎     | 7010/16329 [59:03<1:17:07,  2.01it/s][A
epoch 1 iter 7010: train loss 1.70864. lr 5.343151e-04:  43%|████▎     | 7010/16329 [59:03<1:17:07,  2.01it/s][A
epoch 1 iter 7010: train loss 1.70864. lr 5.343151e-04:  43%|████▎     | 7011/16329 [59:03<1:16:47,  2.02it/s][A
epoch 1 iter 7011: train loss 1.69911. lr 5.342971e-04:  43%|████▎     | 7011/16329 [59:04<1:16:47,  2.02it/s][A
epoch 1 iter 7011: train loss 1.69911. lr 5.342971e-04:  43%|████▎     | 7012/16329 [59:04<1:16:58,  2.02it/s][A
epoch 1 iter 7012: train loss 1.67341. lr 5.342790e-04:  43%|████▎     | 7012/16329 [59:

epoch 1 iter 7043: train loss 1.65994. lr 5.337192e-04:  43%|████▎     | 7044/16329 [59:20<1:17:01,  2.01it/s][A
epoch 1 iter 7044: train loss 1.71178. lr 5.337011e-04:  43%|████▎     | 7044/16329 [59:20<1:17:01,  2.01it/s][A
epoch 1 iter 7044: train loss 1.71178. lr 5.337011e-04:  43%|████▎     | 7045/16329 [59:20<1:17:00,  2.01it/s][A
epoch 1 iter 7045: train loss 1.72803. lr 5.336830e-04:  43%|████▎     | 7045/16329 [59:21<1:17:00,  2.01it/s][A
epoch 1 iter 7045: train loss 1.72803. lr 5.336830e-04:  43%|████▎     | 7046/16329 [59:21<1:16:49,  2.01it/s][A
epoch 1 iter 7046: train loss 1.68316. lr 5.336649e-04:  43%|████▎     | 7046/16329 [59:21<1:16:49,  2.01it/s][A
epoch 1 iter 7046: train loss 1.68316. lr 5.336649e-04:  43%|████▎     | 7047/16329 [59:21<1:16:56,  2.01it/s][A
epoch 1 iter 7047: train loss 1.76275. lr 5.336468e-04:  43%|████▎     | 7047/16329 [59:22<1:16:56,  2.01it/s][A
epoch 1 iter 7047: train loss 1.76275. lr 5.336468e-04:  43%|████▎     | 7048/16329 [59:

epoch 1 iter 7079: train loss 1.71242. lr 5.330664e-04:  43%|████▎     | 7079/16329 [59:38<1:16:25,  2.02it/s][A
epoch 1 iter 7079: train loss 1.71242. lr 5.330664e-04:  43%|████▎     | 7080/16329 [59:38<1:24:11,  1.83it/s][A
epoch 1 iter 7080: train loss 1.67931. lr 5.330482e-04:  43%|████▎     | 7080/16329 [59:39<1:24:11,  1.83it/s][A
epoch 1 iter 7080: train loss 1.67931. lr 5.330482e-04:  43%|████▎     | 7081/16329 [59:39<1:22:07,  1.88it/s][A
epoch 1 iter 7081: train loss 1.71049. lr 5.330300e-04:  43%|████▎     | 7081/16329 [59:39<1:22:07,  1.88it/s][A
epoch 1 iter 7081: train loss 1.71049. lr 5.330300e-04:  43%|████▎     | 7082/16329 [59:39<1:20:28,  1.92it/s][A
epoch 1 iter 7082: train loss 1.65538. lr 5.330119e-04:  43%|████▎     | 7082/16329 [59:40<1:20:28,  1.92it/s][A
epoch 1 iter 7082: train loss 1.65538. lr 5.330119e-04:  43%|████▎     | 7083/16329 [59:40<1:19:01,  1.95it/s][A
epoch 1 iter 7083: train loss 1.72424. lr 5.329937e-04:  43%|████▎     | 7083/16329 [59:

epoch 1 iter 7114: train loss 1.68441. lr 5.324291e-04:  44%|████▎     | 7115/16329 [59:56<1:16:39,  2.00it/s][A
epoch 1 iter 7115: train loss 1.73186. lr 5.324108e-04:  44%|████▎     | 7115/16329 [59:56<1:16:39,  2.00it/s][A
epoch 1 iter 7115: train loss 1.73186. lr 5.324108e-04:  44%|████▎     | 7116/16329 [59:56<1:16:20,  2.01it/s][A
epoch 1 iter 7116: train loss 1.66507. lr 5.323926e-04:  44%|████▎     | 7116/16329 [59:57<1:16:20,  2.01it/s][A
epoch 1 iter 7116: train loss 1.66507. lr 5.323926e-04:  44%|████▎     | 7117/16329 [59:57<1:16:24,  2.01it/s][A
epoch 1 iter 7117: train loss 1.69734. lr 5.323743e-04:  44%|████▎     | 7117/16329 [59:57<1:16:24,  2.01it/s][A
epoch 1 iter 7117: train loss 1.69734. lr 5.323743e-04:  44%|████▎     | 7118/16329 [59:57<1:16:10,  2.02it/s][A
epoch 1 iter 7118: train loss 1.70757. lr 5.323561e-04:  44%|████▎     | 7118/16329 [59:58<1:16:10,  2.02it/s][A
epoch 1 iter 7118: train loss 1.70757. lr 5.323561e-04:  44%|████▎     | 7119/16329 [59:

epoch 1 iter 7149: train loss 1.68637. lr 5.317891e-04:  44%|████▍     | 7150/16329 [1:00:13<1:18:21,  1.95it/s][A
epoch 1 iter 7150: train loss 1.65382. lr 5.317708e-04:  44%|████▍     | 7150/16329 [1:00:14<1:18:21,  1.95it/s][A
epoch 1 iter 7150: train loss 1.65382. lr 5.317708e-04:  44%|████▍     | 7151/16329 [1:00:14<1:17:21,  1.98it/s][A
epoch 1 iter 7151: train loss 1.71787. lr 5.317525e-04:  44%|████▍     | 7151/16329 [1:00:14<1:17:21,  1.98it/s][A
epoch 1 iter 7151: train loss 1.71787. lr 5.317525e-04:  44%|████▍     | 7152/16329 [1:00:14<1:16:38,  2.00it/s][A
epoch 1 iter 7152: train loss 1.72482. lr 5.317341e-04:  44%|████▍     | 7152/16329 [1:00:15<1:16:38,  2.00it/s][A
epoch 1 iter 7152: train loss 1.72482. lr 5.317341e-04:  44%|████▍     | 7153/16329 [1:00:15<1:16:33,  2.00it/s][A
epoch 1 iter 7153: train loss 1.73013. lr 5.317158e-04:  44%|████▍     | 7153/16329 [1:00:15<1:16:33,  2.00it/s][A
epoch 1 iter 7153: train loss 1.73013. lr 5.317158e-04:  44%|████▍     |

epoch 1 iter 7184: train loss 1.69625. lr 5.311465e-04:  44%|████▍     | 7185/16329 [1:00:31<1:19:33,  1.92it/s][A
epoch 1 iter 7185: train loss 1.65713. lr 5.311281e-04:  44%|████▍     | 7185/16329 [1:00:32<1:19:33,  1.92it/s][A
epoch 1 iter 7185: train loss 1.65713. lr 5.311281e-04:  44%|████▍     | 7186/16329 [1:00:32<1:18:32,  1.94it/s][A
epoch 1 iter 7186: train loss 1.71549. lr 5.311097e-04:  44%|████▍     | 7186/16329 [1:00:32<1:18:32,  1.94it/s][A
epoch 1 iter 7186: train loss 1.71549. lr 5.311097e-04:  44%|████▍     | 7187/16329 [1:00:32<1:17:41,  1.96it/s][A
epoch 1 iter 7187: train loss 1.69094. lr 5.310913e-04:  44%|████▍     | 7187/16329 [1:00:32<1:17:41,  1.96it/s][A
epoch 1 iter 7187: train loss 1.69094. lr 5.310913e-04:  44%|████▍     | 7188/16329 [1:00:32<1:17:05,  1.98it/s][A
epoch 1 iter 7188: train loss 1.68169. lr 5.310729e-04:  44%|████▍     | 7188/16329 [1:00:33<1:17:05,  1.98it/s][A
epoch 1 iter 7188: train loss 1.68169. lr 5.310729e-04:  44%|████▍     |

epoch 1 iter 7219: train loss 1.69855. lr 5.305013e-04:  44%|████▍     | 7220/16329 [1:00:49<1:18:20,  1.94it/s][A
epoch 1 iter 7220: train loss 1.63321. lr 5.304828e-04:  44%|████▍     | 7220/16329 [1:00:49<1:18:20,  1.94it/s][A
epoch 1 iter 7220: train loss 1.63321. lr 5.304828e-04:  44%|████▍     | 7221/16329 [1:00:49<1:18:18,  1.94it/s][A
epoch 1 iter 7221: train loss 1.70591. lr 5.304644e-04:  44%|████▍     | 7221/16329 [1:00:50<1:18:18,  1.94it/s][A
epoch 1 iter 7221: train loss 1.70591. lr 5.304644e-04:  44%|████▍     | 7222/16329 [1:00:50<1:17:56,  1.95it/s][A
epoch 1 iter 7222: train loss 1.66458. lr 5.304459e-04:  44%|████▍     | 7222/16329 [1:00:50<1:17:56,  1.95it/s][A
epoch 1 iter 7222: train loss 1.66458. lr 5.304459e-04:  44%|████▍     | 7223/16329 [1:00:50<1:17:33,  1.96it/s][A
epoch 1 iter 7223: train loss 1.65770. lr 5.304274e-04:  44%|████▍     | 7223/16329 [1:00:51<1:17:33,  1.96it/s][A
epoch 1 iter 7223: train loss 1.65770. lr 5.304274e-04:  44%|████▍     |

epoch 1 iter 7254: train loss 1.66629. lr 5.298535e-04:  44%|████▍     | 7255/16329 [1:01:06<1:18:11,  1.93it/s][A
epoch 1 iter 7255: train loss 1.67652. lr 5.298349e-04:  44%|████▍     | 7255/16329 [1:01:07<1:18:11,  1.93it/s][A
epoch 1 iter 7255: train loss 1.67652. lr 5.298349e-04:  44%|████▍     | 7256/16329 [1:01:07<1:17:22,  1.95it/s][A
epoch 1 iter 7256: train loss 1.67923. lr 5.298164e-04:  44%|████▍     | 7256/16329 [1:01:07<1:17:22,  1.95it/s][A
epoch 1 iter 7256: train loss 1.67923. lr 5.298164e-04:  44%|████▍     | 7257/16329 [1:01:07<1:16:44,  1.97it/s][A
epoch 1 iter 7257: train loss 1.67893. lr 5.297978e-04:  44%|████▍     | 7257/16329 [1:01:08<1:16:44,  1.97it/s][A
epoch 1 iter 7257: train loss 1.67893. lr 5.297978e-04:  44%|████▍     | 7258/16329 [1:01:08<1:16:15,  1.98it/s][A
epoch 1 iter 7258: train loss 1.68867. lr 5.297793e-04:  44%|████▍     | 7258/16329 [1:01:08<1:16:15,  1.98it/s][A
epoch 1 iter 7258: train loss 1.68867. lr 5.297793e-04:  44%|████▍     |

epoch 1 iter 7289: train loss 1.66565. lr 5.292031e-04:  45%|████▍     | 7290/16329 [1:01:24<1:14:13,  2.03it/s][A
epoch 1 iter 7290: train loss 1.67309. lr 5.291844e-04:  45%|████▍     | 7290/16329 [1:01:25<1:14:13,  2.03it/s][A
epoch 1 iter 7290: train loss 1.67309. lr 5.291844e-04:  45%|████▍     | 7291/16329 [1:01:25<1:14:11,  2.03it/s][A
epoch 1 iter 7291: train loss 1.64901. lr 5.291658e-04:  45%|████▍     | 7291/16329 [1:01:25<1:14:11,  2.03it/s][A
epoch 1 iter 7291: train loss 1.64901. lr 5.291658e-04:  45%|████▍     | 7292/16329 [1:01:25<1:14:20,  2.03it/s][A
epoch 1 iter 7292: train loss 1.64690. lr 5.291472e-04:  45%|████▍     | 7292/16329 [1:01:26<1:14:20,  2.03it/s][A
epoch 1 iter 7292: train loss 1.64690. lr 5.291472e-04:  45%|████▍     | 7293/16329 [1:01:26<1:13:59,  2.04it/s][A
epoch 1 iter 7293: train loss 1.65209. lr 5.291286e-04:  45%|████▍     | 7293/16329 [1:01:26<1:13:59,  2.04it/s][A
epoch 1 iter 7293: train loss 1.65209. lr 5.291286e-04:  45%|████▍     |

epoch 1 iter 7324: train loss 1.69642. lr 5.285500e-04:  45%|████▍     | 7325/16329 [1:01:42<1:14:11,  2.02it/s][A
epoch 1 iter 7325: train loss 1.66612. lr 5.285313e-04:  45%|████▍     | 7325/16329 [1:01:42<1:14:11,  2.02it/s][A
epoch 1 iter 7325: train loss 1.66612. lr 5.285313e-04:  45%|████▍     | 7326/16329 [1:01:42<1:14:03,  2.03it/s][A
epoch 1 iter 7326: train loss 1.63759. lr 5.285126e-04:  45%|████▍     | 7326/16329 [1:01:43<1:14:03,  2.03it/s][A
epoch 1 iter 7326: train loss 1.63759. lr 5.285126e-04:  45%|████▍     | 7327/16329 [1:01:43<1:13:56,  2.03it/s][A
epoch 1 iter 7327: train loss 1.69989. lr 5.284939e-04:  45%|████▍     | 7327/16329 [1:01:43<1:13:56,  2.03it/s][A
epoch 1 iter 7327: train loss 1.69989. lr 5.284939e-04:  45%|████▍     | 7328/16329 [1:01:43<1:13:59,  2.03it/s][A
epoch 1 iter 7328: train loss 1.68636. lr 5.284752e-04:  45%|████▍     | 7328/16329 [1:01:44<1:13:59,  2.03it/s][A
epoch 1 iter 7328: train loss 1.68636. lr 5.284752e-04:  45%|████▍     |

epoch 1 iter 7359: train loss 1.63380. lr 5.278944e-04:  45%|████▌     | 7360/16329 [1:01:59<1:14:12,  2.01it/s][A
epoch 1 iter 7360: train loss 1.67234. lr 5.278757e-04:  45%|████▌     | 7360/16329 [1:02:00<1:14:12,  2.01it/s][A
epoch 1 iter 7360: train loss 1.67234. lr 5.278757e-04:  45%|████▌     | 7361/16329 [1:02:00<1:21:41,  1.83it/s][A
epoch 1 iter 7361: train loss 1.68284. lr 5.278569e-04:  45%|████▌     | 7361/16329 [1:02:00<1:21:41,  1.83it/s][A
epoch 1 iter 7361: train loss 1.68284. lr 5.278569e-04:  45%|████▌     | 7362/16329 [1:02:00<1:19:35,  1.88it/s][A
epoch 1 iter 7362: train loss 1.66698. lr 5.278381e-04:  45%|████▌     | 7362/16329 [1:02:01<1:19:35,  1.88it/s][A
epoch 1 iter 7362: train loss 1.66698. lr 5.278381e-04:  45%|████▌     | 7363/16329 [1:02:01<1:19:14,  1.89it/s][A
epoch 1 iter 7363: train loss 1.66755. lr 5.278193e-04:  45%|████▌     | 7363/16329 [1:02:01<1:19:14,  1.89it/s][A
epoch 1 iter 7363: train loss 1.66755. lr 5.278193e-04:  45%|████▌     |

epoch 1 iter 7394: train loss 1.61718. lr 5.272362e-04:  45%|████▌     | 7395/16329 [1:02:17<1:13:50,  2.02it/s][A
epoch 1 iter 7395: train loss 1.71836. lr 5.272174e-04:  45%|████▌     | 7395/16329 [1:02:18<1:13:50,  2.02it/s][A
epoch 1 iter 7395: train loss 1.71836. lr 5.272174e-04:  45%|████▌     | 7396/16329 [1:02:18<1:13:56,  2.01it/s][A
epoch 1 iter 7396: train loss 1.68708. lr 5.271985e-04:  45%|████▌     | 7396/16329 [1:02:18<1:13:56,  2.01it/s][A
epoch 1 iter 7396: train loss 1.68708. lr 5.271985e-04:  45%|████▌     | 7397/16329 [1:02:18<1:13:47,  2.02it/s][A
epoch 1 iter 7397: train loss 1.65808. lr 5.271797e-04:  45%|████▌     | 7397/16329 [1:02:19<1:13:47,  2.02it/s][A
epoch 1 iter 7397: train loss 1.65808. lr 5.271797e-04:  45%|████▌     | 7398/16329 [1:02:19<1:13:42,  2.02it/s][A
epoch 1 iter 7398: train loss 1.62732. lr 5.271608e-04:  45%|████▌     | 7398/16329 [1:02:19<1:13:42,  2.02it/s][A
epoch 1 iter 7398: train loss 1.62732. lr 5.271608e-04:  45%|████▌     |

epoch 1 iter 7429: train loss 1.69830. lr 5.265754e-04:  46%|████▌     | 7430/16329 [1:02:35<1:13:57,  2.01it/s][A
epoch 1 iter 7430: train loss 1.64921. lr 5.265565e-04:  46%|████▌     | 7430/16329 [1:02:35<1:13:57,  2.01it/s][A
epoch 1 iter 7430: train loss 1.64921. lr 5.265565e-04:  46%|████▌     | 7431/16329 [1:02:35<1:13:41,  2.01it/s][A
epoch 1 iter 7431: train loss 1.65224. lr 5.265376e-04:  46%|████▌     | 7431/16329 [1:02:36<1:13:41,  2.01it/s][A
epoch 1 iter 7431: train loss 1.65224. lr 5.265376e-04:  46%|████▌     | 7432/16329 [1:02:36<1:16:04,  1.95it/s][A
epoch 1 iter 7432: train loss 1.64697. lr 5.265187e-04:  46%|████▌     | 7432/16329 [1:02:36<1:16:04,  1.95it/s][A
epoch 1 iter 7432: train loss 1.64697. lr 5.265187e-04:  46%|████▌     | 7433/16329 [1:02:36<1:17:43,  1.91it/s][A
epoch 1 iter 7433: train loss 1.63662. lr 5.264998e-04:  46%|████▌     | 7433/16329 [1:02:37<1:17:43,  1.91it/s][A
epoch 1 iter 7433: train loss 1.63662. lr 5.264998e-04:  46%|████▌     |

epoch 1 iter 7464: train loss 1.67779. lr 5.259121e-04:  46%|████▌     | 7465/16329 [1:02:52<1:15:05,  1.97it/s][A
epoch 1 iter 7465: train loss 1.64127. lr 5.258931e-04:  46%|████▌     | 7465/16329 [1:02:53<1:15:05,  1.97it/s][A
epoch 1 iter 7465: train loss 1.64127. lr 5.258931e-04:  46%|████▌     | 7466/16329 [1:02:53<1:14:34,  1.98it/s][A
epoch 1 iter 7466: train loss 1.68301. lr 5.258741e-04:  46%|████▌     | 7466/16329 [1:02:53<1:14:34,  1.98it/s][A
epoch 1 iter 7466: train loss 1.68301. lr 5.258741e-04:  46%|████▌     | 7467/16329 [1:02:53<1:13:52,  2.00it/s][A
epoch 1 iter 7467: train loss 1.67431. lr 5.258551e-04:  46%|████▌     | 7467/16329 [1:02:54<1:13:52,  2.00it/s][A
epoch 1 iter 7467: train loss 1.67431. lr 5.258551e-04:  46%|████▌     | 7468/16329 [1:02:54<1:13:48,  2.00it/s][A
epoch 1 iter 7468: train loss 1.64866. lr 5.258361e-04:  46%|████▌     | 7468/16329 [1:02:54<1:13:48,  2.00it/s][A
epoch 1 iter 7468: train loss 1.64866. lr 5.258361e-04:  46%|████▌     |

epoch 1 iter 7499: train loss 1.65168. lr 5.252462e-04:  46%|████▌     | 7500/16329 [1:03:10<1:13:11,  2.01it/s][A
epoch 1 iter 7500: train loss 1.65749. lr 5.252271e-04:  46%|████▌     | 7500/16329 [1:03:11<1:13:11,  2.01it/s][A
epoch 1 iter 7500: train loss 1.65749. lr 5.252271e-04:  46%|████▌     | 7501/16329 [1:03:11<1:13:08,  2.01it/s][A
epoch 1 iter 7501: train loss 1.63556. lr 5.252081e-04:  46%|████▌     | 7501/16329 [1:03:11<1:13:08,  2.01it/s][A
epoch 1 iter 7501: train loss 1.63556. lr 5.252081e-04:  46%|████▌     | 7502/16329 [1:03:11<1:12:49,  2.02it/s][A
epoch 1 iter 7502: train loss 1.61037. lr 5.251890e-04:  46%|████▌     | 7502/16329 [1:03:12<1:12:49,  2.02it/s][A
epoch 1 iter 7502: train loss 1.61037. lr 5.251890e-04:  46%|████▌     | 7503/16329 [1:03:12<1:15:47,  1.94it/s][A
epoch 1 iter 7503: train loss 1.62295. lr 5.251699e-04:  46%|████▌     | 7503/16329 [1:03:12<1:15:47,  1.94it/s][A
epoch 1 iter 7503: train loss 1.62295. lr 5.251699e-04:  46%|████▌     |

epoch 1 iter 7534: train loss 1.63461. lr 5.245777e-04:  46%|████▌     | 7535/16329 [1:03:28<1:13:10,  2.00it/s][A
epoch 1 iter 7535: train loss 1.64858. lr 5.245586e-04:  46%|████▌     | 7535/16329 [1:03:29<1:13:10,  2.00it/s][A
epoch 1 iter 7535: train loss 1.64858. lr 5.245586e-04:  46%|████▌     | 7536/16329 [1:03:29<1:13:10,  2.00it/s][A
epoch 1 iter 7536: train loss 1.64796. lr 5.245395e-04:  46%|████▌     | 7536/16329 [1:03:29<1:13:10,  2.00it/s][A
epoch 1 iter 7536: train loss 1.64796. lr 5.245395e-04:  46%|████▌     | 7537/16329 [1:03:29<1:12:58,  2.01it/s][A
epoch 1 iter 7537: train loss 1.66700. lr 5.245203e-04:  46%|████▌     | 7537/16329 [1:03:30<1:12:58,  2.01it/s][A
epoch 1 iter 7537: train loss 1.66700. lr 5.245203e-04:  46%|████▌     | 7538/16329 [1:03:30<1:12:51,  2.01it/s][A
epoch 1 iter 7538: train loss 1.68293. lr 5.245012e-04:  46%|████▌     | 7538/16329 [1:03:30<1:12:51,  2.01it/s][A
epoch 1 iter 7538: train loss 1.68293. lr 5.245012e-04:  46%|████▌     |

epoch 1 iter 7569: train loss 1.61165. lr 5.239067e-04:  46%|████▋     | 7570/16329 [1:03:46<1:12:58,  2.00it/s][A
epoch 1 iter 7570: train loss 1.67611. lr 5.238875e-04:  46%|████▋     | 7570/16329 [1:03:46<1:12:58,  2.00it/s][A
epoch 1 iter 7570: train loss 1.67611. lr 5.238875e-04:  46%|████▋     | 7571/16329 [1:03:46<1:12:42,  2.01it/s][A
epoch 1 iter 7571: train loss 1.62983. lr 5.238683e-04:  46%|████▋     | 7571/16329 [1:03:47<1:12:42,  2.01it/s][A
epoch 1 iter 7571: train loss 1.62983. lr 5.238683e-04:  46%|████▋     | 7572/16329 [1:03:47<1:12:42,  2.01it/s][A
epoch 1 iter 7572: train loss 1.66663. lr 5.238491e-04:  46%|████▋     | 7572/16329 [1:03:47<1:12:42,  2.01it/s][A
epoch 1 iter 7572: train loss 1.66663. lr 5.238491e-04:  46%|████▋     | 7573/16329 [1:03:47<1:12:25,  2.01it/s][A
epoch 1 iter 7573: train loss 1.69458. lr 5.238299e-04:  46%|████▋     | 7573/16329 [1:03:48<1:12:25,  2.01it/s][A
epoch 1 iter 7573: train loss 1.69458. lr 5.238299e-04:  46%|████▋     |

epoch 1 iter 7604: train loss 1.61026. lr 5.232332e-04:  47%|████▋     | 7605/16329 [1:04:03<1:11:56,  2.02it/s][A
epoch 1 iter 7605: train loss 1.62868. lr 5.232139e-04:  47%|████▋     | 7605/16329 [1:04:04<1:11:56,  2.02it/s][A
epoch 1 iter 7605: train loss 1.62868. lr 5.232139e-04:  47%|████▋     | 7606/16329 [1:04:04<1:11:59,  2.02it/s][A
epoch 1 iter 7606: train loss 1.64894. lr 5.231946e-04:  47%|████▋     | 7606/16329 [1:04:04<1:11:59,  2.02it/s][A
epoch 1 iter 7606: train loss 1.64894. lr 5.231946e-04:  47%|████▋     | 7607/16329 [1:04:04<1:11:55,  2.02it/s][A
epoch 1 iter 7607: train loss 1.64929. lr 5.231753e-04:  47%|████▋     | 7607/16329 [1:04:05<1:11:55,  2.02it/s][A
epoch 1 iter 7607: train loss 1.64929. lr 5.231753e-04:  47%|████▋     | 7608/16329 [1:04:05<1:12:00,  2.02it/s][A
epoch 1 iter 7608: train loss 1.61454. lr 5.231560e-04:  47%|████▋     | 7608/16329 [1:04:05<1:12:00,  2.02it/s][A
epoch 1 iter 7608: train loss 1.61454. lr 5.231560e-04:  47%|████▋     |

epoch 1 iter 7639: train loss 1.63058. lr 5.225571e-04:  47%|████▋     | 7640/16329 [1:04:21<1:11:53,  2.01it/s][A
epoch 1 iter 7640: train loss 1.62727. lr 5.225378e-04:  47%|████▋     | 7640/16329 [1:04:21<1:11:53,  2.01it/s][A
epoch 1 iter 7640: train loss 1.62727. lr 5.225378e-04:  47%|████▋     | 7641/16329 [1:04:21<1:12:01,  2.01it/s][A
epoch 1 iter 7641: train loss 1.60765. lr 5.225184e-04:  47%|████▋     | 7641/16329 [1:04:22<1:12:01,  2.01it/s][A
epoch 1 iter 7641: train loss 1.60765. lr 5.225184e-04:  47%|████▋     | 7642/16329 [1:04:22<1:12:09,  2.01it/s][A
epoch 1 iter 7642: train loss 1.67776. lr 5.224990e-04:  47%|████▋     | 7642/16329 [1:04:22<1:12:09,  2.01it/s][A
epoch 1 iter 7642: train loss 1.67776. lr 5.224990e-04:  47%|████▋     | 7643/16329 [1:04:22<1:12:11,  2.01it/s][A
epoch 1 iter 7643: train loss 1.67013. lr 5.224797e-04:  47%|████▋     | 7643/16329 [1:04:23<1:12:11,  2.01it/s][A
epoch 1 iter 7643: train loss 1.67013. lr 5.224797e-04:  47%|████▋     |

epoch 1 iter 7674: train loss 1.65579. lr 5.218785e-04:  47%|████▋     | 7675/16329 [1:04:38<1:11:49,  2.01it/s][A
epoch 1 iter 7675: train loss 1.58845. lr 5.218591e-04:  47%|████▋     | 7675/16329 [1:04:39<1:11:49,  2.01it/s][A
epoch 1 iter 7675: train loss 1.58845. lr 5.218591e-04:  47%|████▋     | 7676/16329 [1:04:39<1:11:56,  2.00it/s][A
epoch 1 iter 7676: train loss 1.62084. lr 5.218397e-04:  47%|████▋     | 7676/16329 [1:04:39<1:11:56,  2.00it/s][A
epoch 1 iter 7676: train loss 1.62084. lr 5.218397e-04:  47%|████▋     | 7677/16329 [1:04:39<1:11:42,  2.01it/s][A
epoch 1 iter 7677: train loss 1.61193. lr 5.218202e-04:  47%|████▋     | 7677/16329 [1:04:40<1:11:42,  2.01it/s][A
epoch 1 iter 7677: train loss 1.61193. lr 5.218202e-04:  47%|████▋     | 7678/16329 [1:04:40<1:11:36,  2.01it/s][A
epoch 1 iter 7678: train loss 1.65553. lr 5.218008e-04:  47%|████▋     | 7678/16329 [1:04:40<1:11:36,  2.01it/s][A
epoch 1 iter 7678: train loss 1.65553. lr 5.218008e-04:  47%|████▋     |

epoch 1 iter 7709: train loss 1.61397. lr 5.211974e-04:  47%|████▋     | 7710/16329 [1:04:56<1:11:24,  2.01it/s][A
epoch 1 iter 7710: train loss 1.61140. lr 5.211779e-04:  47%|████▋     | 7710/16329 [1:04:56<1:11:24,  2.01it/s][A
epoch 1 iter 7710: train loss 1.61140. lr 5.211779e-04:  47%|████▋     | 7711/16329 [1:04:56<1:11:10,  2.02it/s][A
epoch 1 iter 7711: train loss 1.59767. lr 5.211584e-04:  47%|████▋     | 7711/16329 [1:04:57<1:11:10,  2.02it/s][A
epoch 1 iter 7711: train loss 1.59767. lr 5.211584e-04:  47%|████▋     | 7712/16329 [1:04:57<1:14:07,  1.94it/s][A
epoch 1 iter 7712: train loss 1.63841. lr 5.211389e-04:  47%|████▋     | 7712/16329 [1:04:58<1:14:07,  1.94it/s][A
epoch 1 iter 7712: train loss 1.63841. lr 5.211389e-04:  47%|████▋     | 7713/16329 [1:04:58<1:16:08,  1.89it/s][A
epoch 1 iter 7713: train loss 1.62197. lr 5.211194e-04:  47%|████▋     | 7713/16329 [1:04:58<1:16:08,  1.89it/s][A
epoch 1 iter 7713: train loss 1.62197. lr 5.211194e-04:  47%|████▋     |

epoch 1 iter 7744: train loss 1.58285. lr 5.205138e-04:  47%|████▋     | 7745/16329 [1:05:14<1:13:20,  1.95it/s][A
epoch 1 iter 7745: train loss 1.60467. lr 5.204942e-04:  47%|████▋     | 7745/16329 [1:05:14<1:13:20,  1.95it/s][A
epoch 1 iter 7745: train loss 1.60467. lr 5.204942e-04:  47%|████▋     | 7746/16329 [1:05:14<1:12:45,  1.97it/s][A
epoch 1 iter 7746: train loss 1.65920. lr 5.204746e-04:  47%|████▋     | 7746/16329 [1:05:15<1:12:45,  1.97it/s][A
epoch 1 iter 7746: train loss 1.65920. lr 5.204746e-04:  47%|████▋     | 7747/16329 [1:05:15<1:12:15,  1.98it/s][A
epoch 1 iter 7747: train loss 1.65922. lr 5.204551e-04:  47%|████▋     | 7747/16329 [1:05:15<1:12:15,  1.98it/s][A
epoch 1 iter 7747: train loss 1.65922. lr 5.204551e-04:  47%|████▋     | 7748/16329 [1:05:15<1:11:42,  1.99it/s][A
epoch 1 iter 7748: train loss 1.63471. lr 5.204355e-04:  47%|████▋     | 7748/16329 [1:05:16<1:11:42,  1.99it/s][A
epoch 1 iter 7748: train loss 1.63471. lr 5.204355e-04:  47%|████▋     |

epoch 1 iter 7779: train loss 1.62298. lr 5.198277e-04:  48%|████▊     | 7780/16329 [1:05:31<1:10:49,  2.01it/s][A
epoch 1 iter 7780: train loss 1.64926. lr 5.198080e-04:  48%|████▊     | 7780/16329 [1:05:32<1:10:49,  2.01it/s][A
epoch 1 iter 7780: train loss 1.64926. lr 5.198080e-04:  48%|████▊     | 7781/16329 [1:05:32<1:10:59,  2.01it/s][A
epoch 1 iter 7781: train loss 1.58205. lr 5.197884e-04:  48%|████▊     | 7781/16329 [1:05:33<1:10:59,  2.01it/s][A
epoch 1 iter 7781: train loss 1.58205. lr 5.197884e-04:  48%|████▊     | 7782/16329 [1:05:33<1:18:39,  1.81it/s][A
epoch 1 iter 7782: train loss 1.63699. lr 5.197687e-04:  48%|████▊     | 7782/16329 [1:05:33<1:18:39,  1.81it/s][A
epoch 1 iter 7782: train loss 1.63699. lr 5.197687e-04:  48%|████▊     | 7783/16329 [1:05:33<1:16:06,  1.87it/s][A
epoch 1 iter 7783: train loss 1.62674. lr 5.197491e-04:  48%|████▊     | 7783/16329 [1:05:34<1:16:06,  1.87it/s][A
epoch 1 iter 7783: train loss 1.62674. lr 5.197491e-04:  48%|████▊     |

epoch 1 iter 7814: train loss 1.62660. lr 5.191390e-04:  48%|████▊     | 7815/16329 [1:05:49<1:10:35,  2.01it/s][A
epoch 1 iter 7815: train loss 1.62593. lr 5.191193e-04:  48%|████▊     | 7815/16329 [1:05:49<1:10:35,  2.01it/s][A
epoch 1 iter 7815: train loss 1.62593. lr 5.191193e-04:  48%|████▊     | 7816/16329 [1:05:49<1:10:24,  2.02it/s][A
epoch 1 iter 7816: train loss 1.60238. lr 5.190996e-04:  48%|████▊     | 7816/16329 [1:05:50<1:10:24,  2.02it/s][A
epoch 1 iter 7816: train loss 1.60238. lr 5.190996e-04:  48%|████▊     | 7817/16329 [1:05:50<1:17:53,  1.82it/s][A
epoch 1 iter 7817: train loss 1.57684. lr 5.190799e-04:  48%|████▊     | 7817/16329 [1:05:51<1:17:53,  1.82it/s][A
epoch 1 iter 7817: train loss 1.57684. lr 5.190799e-04:  48%|████▊     | 7818/16329 [1:05:51<1:15:45,  1.87it/s][A
epoch 1 iter 7818: train loss 1.63773. lr 5.190602e-04:  48%|████▊     | 7818/16329 [1:05:51<1:15:45,  1.87it/s][A
epoch 1 iter 7818: train loss 1.63773. lr 5.190602e-04:  48%|████▊     |

epoch 1 iter 7849: train loss 1.57505. lr 5.184479e-04:  48%|████▊     | 7850/16329 [1:06:07<1:10:12,  2.01it/s][A
epoch 1 iter 7850: train loss 1.63388. lr 5.184282e-04:  48%|████▊     | 7850/16329 [1:06:07<1:10:12,  2.01it/s][A
epoch 1 iter 7850: train loss 1.63388. lr 5.184282e-04:  48%|████▊     | 7851/16329 [1:06:07<1:11:39,  1.97it/s][A
epoch 1 iter 7851: train loss 1.62310. lr 5.184084e-04:  48%|████▊     | 7851/16329 [1:06:08<1:11:39,  1.97it/s][A
epoch 1 iter 7851: train loss 1.62310. lr 5.184084e-04:  48%|████▊     | 7852/16329 [1:06:08<1:12:09,  1.96it/s][A
epoch 1 iter 7852: train loss 1.59601. lr 5.183886e-04:  48%|████▊     | 7852/16329 [1:06:08<1:12:09,  1.96it/s][A
epoch 1 iter 7852: train loss 1.59601. lr 5.183886e-04:  48%|████▊     | 7853/16329 [1:06:08<1:12:17,  1.95it/s][A
epoch 1 iter 7853: train loss 1.58483. lr 5.183688e-04:  48%|████▊     | 7853/16329 [1:06:09<1:12:17,  1.95it/s][A
epoch 1 iter 7853: train loss 1.58483. lr 5.183688e-04:  48%|████▊     |

epoch 1 iter 7884: train loss 1.57958. lr 5.177544e-04:  48%|████▊     | 7885/16329 [1:06:24<1:14:02,  1.90it/s][A
epoch 1 iter 7885: train loss 1.60102. lr 5.177345e-04:  48%|████▊     | 7885/16329 [1:06:25<1:14:02,  1.90it/s][A
epoch 1 iter 7885: train loss 1.60102. lr 5.177345e-04:  48%|████▊     | 7886/16329 [1:06:25<1:13:40,  1.91it/s][A
epoch 1 iter 7886: train loss 1.57205. lr 5.177147e-04:  48%|████▊     | 7886/16329 [1:06:25<1:13:40,  1.91it/s][A
epoch 1 iter 7886: train loss 1.57205. lr 5.177147e-04:  48%|████▊     | 7887/16329 [1:06:25<1:13:26,  1.92it/s][A
epoch 1 iter 7887: train loss 1.62119. lr 5.176948e-04:  48%|████▊     | 7887/16329 [1:06:26<1:13:26,  1.92it/s][A
epoch 1 iter 7887: train loss 1.62119. lr 5.176948e-04:  48%|████▊     | 7888/16329 [1:06:26<1:12:46,  1.93it/s][A
epoch 1 iter 7888: train loss 1.58518. lr 5.176749e-04:  48%|████▊     | 7888/16329 [1:06:26<1:12:46,  1.93it/s][A
epoch 1 iter 7888: train loss 1.58518. lr 5.176749e-04:  48%|████▊     |

epoch 1 iter 7919: train loss 1.59534. lr 5.170583e-04:  49%|████▊     | 7920/16329 [1:06:42<1:09:10,  2.03it/s][A
epoch 1 iter 7920: train loss 1.58833. lr 5.170384e-04:  49%|████▊     | 7920/16329 [1:06:42<1:09:10,  2.03it/s][A
epoch 1 iter 7920: train loss 1.58833. lr 5.170384e-04:  49%|████▊     | 7921/16329 [1:06:42<1:09:01,  2.03it/s][A
epoch 1 iter 7921: train loss 1.60737. lr 5.170185e-04:  49%|████▊     | 7921/16329 [1:06:43<1:09:01,  2.03it/s][A
epoch 1 iter 7921: train loss 1.60737. lr 5.170185e-04:  49%|████▊     | 7922/16329 [1:06:43<1:08:57,  2.03it/s][A
epoch 1 iter 7922: train loss 1.57021. lr 5.169985e-04:  49%|████▊     | 7922/16329 [1:06:43<1:08:57,  2.03it/s][A
epoch 1 iter 7922: train loss 1.57021. lr 5.169985e-04:  49%|████▊     | 7923/16329 [1:06:43<1:09:07,  2.03it/s][A
epoch 1 iter 7923: train loss 1.63589. lr 5.169786e-04:  49%|████▊     | 7923/16329 [1:06:44<1:09:07,  2.03it/s][A
epoch 1 iter 7923: train loss 1.63589. lr 5.169786e-04:  49%|████▊     |

epoch 1 iter 7954: train loss 1.60619. lr 5.163598e-04:  49%|████▊     | 7955/16329 [1:06:59<1:09:24,  2.01it/s][A
epoch 1 iter 7955: train loss 1.61057. lr 5.163398e-04:  49%|████▊     | 7955/16329 [1:07:00<1:09:24,  2.01it/s][A
epoch 1 iter 7955: train loss 1.61057. lr 5.163398e-04:  49%|████▊     | 7956/16329 [1:07:00<1:09:17,  2.01it/s][A
epoch 1 iter 7956: train loss 1.63583. lr 5.163198e-04:  49%|████▊     | 7956/16329 [1:07:00<1:09:17,  2.01it/s][A
epoch 1 iter 7956: train loss 1.63583. lr 5.163198e-04:  49%|████▊     | 7957/16329 [1:07:00<1:08:56,  2.02it/s][A
epoch 1 iter 7957: train loss 1.58449. lr 5.162998e-04:  49%|████▊     | 7957/16329 [1:07:01<1:08:56,  2.02it/s][A
epoch 1 iter 7957: train loss 1.58449. lr 5.162998e-04:  49%|████▊     | 7958/16329 [1:07:01<1:09:12,  2.02it/s][A
epoch 1 iter 7958: train loss 1.59938. lr 5.162798e-04:  49%|████▊     | 7958/16329 [1:07:01<1:09:12,  2.02it/s][A
epoch 1 iter 7958: train loss 1.59938. lr 5.162798e-04:  49%|████▊     |

epoch 1 iter 7989: train loss 1.55030. lr 5.156589e-04:  49%|████▉     | 7990/16329 [1:07:17<1:08:59,  2.01it/s][A
epoch 1 iter 7990: train loss 1.55824. lr 5.156388e-04:  49%|████▉     | 7990/16329 [1:07:18<1:08:59,  2.01it/s][A
epoch 1 iter 7990: train loss 1.55824. lr 5.156388e-04:  49%|████▉     | 7991/16329 [1:07:18<1:08:55,  2.02it/s][A
epoch 1 iter 7991: train loss 1.63125. lr 5.156187e-04:  49%|████▉     | 7991/16329 [1:07:18<1:08:55,  2.02it/s][A
epoch 1 iter 7991: train loss 1.63125. lr 5.156187e-04:  49%|████▉     | 7992/16329 [1:07:18<1:08:47,  2.02it/s][A
epoch 1 iter 7992: train loss 1.61448. lr 5.155987e-04:  49%|████▉     | 7992/16329 [1:07:19<1:08:47,  2.02it/s][A
epoch 1 iter 7992: train loss 1.61448. lr 5.155987e-04:  49%|████▉     | 7993/16329 [1:07:19<1:08:49,  2.02it/s][A
epoch 1 iter 7993: train loss 1.59084. lr 5.155786e-04:  49%|████▉     | 7993/16329 [1:07:19<1:08:49,  2.02it/s][A
epoch 1 iter 7993: train loss 1.59084. lr 5.155786e-04:  49%|████▉     |

epoch 1 iter 8024: train loss 1.58504. lr 5.149555e-04:  49%|████▉     | 8025/16329 [1:07:35<1:08:22,  2.02it/s][A
epoch 1 iter 8025: train loss 1.62437. lr 5.149353e-04:  49%|████▉     | 8025/16329 [1:07:35<1:08:22,  2.02it/s][A
epoch 1 iter 8025: train loss 1.62437. lr 5.149353e-04:  49%|████▉     | 8026/16329 [1:07:35<1:08:27,  2.02it/s][A
epoch 1 iter 8026: train loss 1.58081. lr 5.149152e-04:  49%|████▉     | 8026/16329 [1:07:36<1:08:27,  2.02it/s][A
epoch 1 iter 8026: train loss 1.58081. lr 5.149152e-04:  49%|████▉     | 8027/16329 [1:07:36<1:08:40,  2.01it/s][A
epoch 1 iter 8027: train loss 1.51549. lr 5.148951e-04:  49%|████▉     | 8027/16329 [1:07:36<1:08:40,  2.01it/s][A
epoch 1 iter 8027: train loss 1.51549. lr 5.148951e-04:  49%|████▉     | 8028/16329 [1:07:36<1:08:44,  2.01it/s][A
epoch 1 iter 8028: train loss 1.63288. lr 5.148749e-04:  49%|████▉     | 8028/16329 [1:07:37<1:08:44,  2.01it/s][A
epoch 1 iter 8028: train loss 1.63288. lr 5.148749e-04:  49%|████▉     |

epoch 1 iter 8059: train loss 1.59687. lr 5.142496e-04:  49%|████▉     | 8060/16329 [1:07:53<1:11:20,  1.93it/s][A
epoch 1 iter 8060: train loss 1.63657. lr 5.142294e-04:  49%|████▉     | 8060/16329 [1:07:53<1:11:20,  1.93it/s][A
epoch 1 iter 8060: train loss 1.63657. lr 5.142294e-04:  49%|████▉     | 8061/16329 [1:07:53<1:10:52,  1.94it/s][A
epoch 1 iter 8061: train loss 1.55454. lr 5.142092e-04:  49%|████▉     | 8061/16329 [1:07:54<1:10:52,  1.94it/s][A
epoch 1 iter 8061: train loss 1.55454. lr 5.142092e-04:  49%|████▉     | 8062/16329 [1:07:54<1:10:31,  1.95it/s][A
epoch 1 iter 8062: train loss 1.57566. lr 5.141890e-04:  49%|████▉     | 8062/16329 [1:07:54<1:10:31,  1.95it/s][A
epoch 1 iter 8062: train loss 1.57566. lr 5.141890e-04:  49%|████▉     | 8063/16329 [1:07:54<1:09:59,  1.97it/s][A
epoch 1 iter 8063: train loss 1.55125. lr 5.141688e-04:  49%|████▉     | 8063/16329 [1:07:55<1:09:59,  1.97it/s][A
epoch 1 iter 8063: train loss 1.55125. lr 5.141688e-04:  49%|████▉     |

epoch 1 iter 8094: train loss 1.51910. lr 5.135414e-04:  50%|████▉     | 8095/16329 [1:08:10<1:08:13,  2.01it/s][A
epoch 1 iter 8095: train loss 1.55576. lr 5.135211e-04:  50%|████▉     | 8095/16329 [1:08:11<1:08:13,  2.01it/s][A
epoch 1 iter 8095: train loss 1.55576. lr 5.135211e-04:  50%|████▉     | 8096/16329 [1:08:11<1:15:20,  1.82it/s][A
epoch 1 iter 8096: train loss 1.54514. lr 5.135008e-04:  50%|████▉     | 8096/16329 [1:08:11<1:15:20,  1.82it/s][A
epoch 1 iter 8096: train loss 1.54514. lr 5.135008e-04:  50%|████▉     | 8097/16329 [1:08:11<1:13:10,  1.87it/s][A
epoch 1 iter 8097: train loss 1.55855. lr 5.134805e-04:  50%|████▉     | 8097/16329 [1:08:12<1:13:10,  1.87it/s][A
epoch 1 iter 8097: train loss 1.55855. lr 5.134805e-04:  50%|████▉     | 8098/16329 [1:08:12<1:11:48,  1.91it/s][A
epoch 1 iter 8098: train loss 1.51293. lr 5.134603e-04:  50%|████▉     | 8098/16329 [1:08:12<1:11:48,  1.91it/s][A
epoch 1 iter 8098: train loss 1.51293. lr 5.134603e-04:  50%|████▉     |

epoch 1 iter 8129: train loss 1.56539. lr 5.128307e-04:  50%|████▉     | 8130/16329 [1:08:28<1:08:29,  2.00it/s][A
epoch 1 iter 8130: train loss 1.57161. lr 5.128103e-04:  50%|████▉     | 8130/16329 [1:08:28<1:08:29,  2.00it/s][A
epoch 1 iter 8130: train loss 1.57161. lr 5.128103e-04:  50%|████▉     | 8131/16329 [1:08:28<1:08:19,  2.00it/s][A
epoch 1 iter 8131: train loss 1.60288. lr 5.127900e-04:  50%|████▉     | 8131/16329 [1:08:29<1:08:19,  2.00it/s][A
epoch 1 iter 8131: train loss 1.60288. lr 5.127900e-04:  50%|████▉     | 8132/16329 [1:08:29<1:07:58,  2.01it/s][A
epoch 1 iter 8132: train loss 1.59622. lr 5.127696e-04:  50%|████▉     | 8132/16329 [1:08:29<1:07:58,  2.01it/s][A
epoch 1 iter 8132: train loss 1.59622. lr 5.127696e-04:  50%|████▉     | 8133/16329 [1:08:29<1:07:57,  2.01it/s][A
epoch 1 iter 8133: train loss 1.59642. lr 5.127493e-04:  50%|████▉     | 8133/16329 [1:08:30<1:07:57,  2.01it/s][A
epoch 1 iter 8133: train loss 1.59642. lr 5.127493e-04:  50%|████▉     |

epoch 1 iter 8164: train loss 1.53461. lr 5.121176e-04:  50%|█████     | 8165/16329 [1:08:46<1:13:32,  1.85it/s][A
epoch 1 iter 8165: train loss 1.56449. lr 5.120972e-04:  50%|█████     | 8165/16329 [1:08:46<1:13:32,  1.85it/s][A
epoch 1 iter 8165: train loss 1.56449. lr 5.120972e-04:  50%|█████     | 8166/16329 [1:08:46<1:12:11,  1.88it/s][A
epoch 1 iter 8166: train loss 1.57879. lr 5.120767e-04:  50%|█████     | 8166/16329 [1:08:47<1:12:11,  1.88it/s][A
epoch 1 iter 8166: train loss 1.57879. lr 5.120767e-04:  50%|█████     | 8167/16329 [1:08:47<1:10:57,  1.92it/s][A
epoch 1 iter 8167: train loss 1.53569. lr 5.120563e-04:  50%|█████     | 8167/16329 [1:08:47<1:10:57,  1.92it/s][A
epoch 1 iter 8167: train loss 1.53569. lr 5.120563e-04:  50%|█████     | 8168/16329 [1:08:47<1:10:11,  1.94it/s][A
epoch 1 iter 8168: train loss 1.55282. lr 5.120359e-04:  50%|█████     | 8168/16329 [1:08:48<1:10:11,  1.94it/s][A
epoch 1 iter 8168: train loss 1.55282. lr 5.120359e-04:  50%|█████     |

epoch 1 iter 8199: train loss 1.58210. lr 5.114021e-04:  50%|█████     | 8200/16329 [1:09:03<1:10:49,  1.91it/s][A
epoch 1 iter 8200: train loss 1.59769. lr 5.113816e-04:  50%|█████     | 8200/16329 [1:09:04<1:10:49,  1.91it/s][A
epoch 1 iter 8200: train loss 1.59769. lr 5.113816e-04:  50%|█████     | 8201/16329 [1:09:04<1:09:32,  1.95it/s][A
epoch 1 iter 8201: train loss 1.54294. lr 5.113611e-04:  50%|█████     | 8201/16329 [1:09:04<1:09:32,  1.95it/s][A
epoch 1 iter 8201: train loss 1.54294. lr 5.113611e-04:  50%|█████     | 8202/16329 [1:09:04<1:08:37,  1.97it/s][A
epoch 1 iter 8202: train loss 1.55604. lr 5.113406e-04:  50%|█████     | 8202/16329 [1:09:05<1:08:37,  1.97it/s][A
epoch 1 iter 8202: train loss 1.55604. lr 5.113406e-04:  50%|█████     | 8203/16329 [1:09:05<1:08:16,  1.98it/s][A
epoch 1 iter 8203: train loss 1.59799. lr 5.113201e-04:  50%|█████     | 8203/16329 [1:09:05<1:08:16,  1.98it/s][A
epoch 1 iter 8203: train loss 1.59799. lr 5.113201e-04:  50%|█████     |

epoch 1 iter 8234: train loss 1.60030. lr 5.106842e-04:  50%|█████     | 8235/16329 [1:09:21<1:06:46,  2.02it/s][A
epoch 1 iter 8235: train loss 1.57419. lr 5.106636e-04:  50%|█████     | 8235/16329 [1:09:21<1:06:46,  2.02it/s][A
epoch 1 iter 8235: train loss 1.57419. lr 5.106636e-04:  50%|█████     | 8236/16329 [1:09:21<1:06:46,  2.02it/s][A
epoch 1 iter 8236: train loss 1.60258. lr 5.106431e-04:  50%|█████     | 8236/16329 [1:09:22<1:06:46,  2.02it/s][A
epoch 1 iter 8236: train loss 1.60258. lr 5.106431e-04:  50%|█████     | 8237/16329 [1:09:22<1:06:47,  2.02it/s][A
epoch 1 iter 8237: train loss 1.58851. lr 5.106225e-04:  50%|█████     | 8237/16329 [1:09:22<1:06:47,  2.02it/s][A
epoch 1 iter 8237: train loss 1.58851. lr 5.106225e-04:  50%|█████     | 8238/16329 [1:09:22<1:06:44,  2.02it/s][A
epoch 1 iter 8238: train loss 1.57945. lr 5.106020e-04:  50%|█████     | 8238/16329 [1:09:23<1:06:44,  2.02it/s][A
epoch 1 iter 8238: train loss 1.57945. lr 5.106020e-04:  50%|█████     |

epoch 1 iter 8269: train loss 1.57609. lr 5.099639e-04:  51%|█████     | 8270/16329 [1:09:38<1:06:27,  2.02it/s][A
epoch 1 iter 8270: train loss 1.58368. lr 5.099433e-04:  51%|█████     | 8270/16329 [1:09:39<1:06:27,  2.02it/s][A
epoch 1 iter 8270: train loss 1.58368. lr 5.099433e-04:  51%|█████     | 8271/16329 [1:09:39<1:06:27,  2.02it/s][A
epoch 1 iter 8271: train loss 1.56833. lr 5.099226e-04:  51%|█████     | 8271/16329 [1:09:39<1:06:27,  2.02it/s][A
epoch 1 iter 8271: train loss 1.56833. lr 5.099226e-04:  51%|█████     | 8272/16329 [1:09:39<1:06:34,  2.02it/s][A
epoch 1 iter 8272: train loss 1.58463. lr 5.099020e-04:  51%|█████     | 8272/16329 [1:09:40<1:06:34,  2.02it/s][A
epoch 1 iter 8272: train loss 1.58463. lr 5.099020e-04:  51%|█████     | 8273/16329 [1:09:40<1:06:47,  2.01it/s][A
epoch 1 iter 8273: train loss 1.56628. lr 5.098814e-04:  51%|█████     | 8273/16329 [1:09:40<1:06:47,  2.01it/s][A
epoch 1 iter 8273: train loss 1.56628. lr 5.098814e-04:  51%|█████     |

epoch 1 iter 8304: train loss 1.54805. lr 5.092412e-04:  51%|█████     | 8305/16329 [1:09:56<1:06:22,  2.01it/s][A
epoch 1 iter 8305: train loss 1.55897. lr 5.092205e-04:  51%|█████     | 8305/16329 [1:09:56<1:06:22,  2.01it/s][A
epoch 1 iter 8305: train loss 1.55897. lr 5.092205e-04:  51%|█████     | 8306/16329 [1:09:56<1:06:14,  2.02it/s][A
epoch 1 iter 8306: train loss 1.52368. lr 5.091998e-04:  51%|█████     | 8306/16329 [1:09:57<1:06:14,  2.02it/s][A
epoch 1 iter 8306: train loss 1.52368. lr 5.091998e-04:  51%|█████     | 8307/16329 [1:09:57<1:06:05,  2.02it/s][A
epoch 1 iter 8307: train loss 1.56628. lr 5.091791e-04:  51%|█████     | 8307/16329 [1:09:57<1:06:05,  2.02it/s][A
epoch 1 iter 8307: train loss 1.56628. lr 5.091791e-04:  51%|█████     | 8308/16329 [1:09:57<1:06:06,  2.02it/s][A
epoch 1 iter 8308: train loss 1.51785. lr 5.091585e-04:  51%|█████     | 8308/16329 [1:09:58<1:06:06,  2.02it/s][A
epoch 1 iter 8308: train loss 1.51785. lr 5.091585e-04:  51%|█████     |

epoch 1 iter 8339: train loss 1.55129. lr 5.085162e-04:  51%|█████     | 8340/16329 [1:10:13<1:06:27,  2.00it/s][A
epoch 1 iter 8340: train loss 1.56896. lr 5.084954e-04:  51%|█████     | 8340/16329 [1:10:14<1:06:27,  2.00it/s][A
epoch 1 iter 8340: train loss 1.56896. lr 5.084954e-04:  51%|█████     | 8341/16329 [1:10:14<1:06:14,  2.01it/s][A
epoch 1 iter 8341: train loss 1.53329. lr 5.084746e-04:  51%|█████     | 8341/16329 [1:10:14<1:06:14,  2.01it/s][A
epoch 1 iter 8341: train loss 1.53329. lr 5.084746e-04:  51%|█████     | 8342/16329 [1:10:14<1:06:03,  2.01it/s][A
epoch 1 iter 8342: train loss 1.56255. lr 5.084539e-04:  51%|█████     | 8342/16329 [1:10:15<1:06:03,  2.01it/s][A
epoch 1 iter 8342: train loss 1.56255. lr 5.084539e-04:  51%|█████     | 8343/16329 [1:10:15<1:05:47,  2.02it/s][A
epoch 1 iter 8343: train loss 1.60639. lr 5.084331e-04:  51%|█████     | 8343/16329 [1:10:15<1:05:47,  2.02it/s][A
epoch 1 iter 8343: train loss 1.60639. lr 5.084331e-04:  51%|█████     |

epoch 1 iter 8374: train loss 1.52640. lr 5.077887e-04:  51%|█████▏    | 8375/16329 [1:10:31<1:05:25,  2.03it/s][A
epoch 1 iter 8375: train loss 1.56468. lr 5.077679e-04:  51%|█████▏    | 8375/16329 [1:10:31<1:05:25,  2.03it/s][A
epoch 1 iter 8375: train loss 1.56468. lr 5.077679e-04:  51%|█████▏    | 8376/16329 [1:10:31<1:05:28,  2.02it/s][A
epoch 1 iter 8376: train loss 1.57599. lr 5.077471e-04:  51%|█████▏    | 8376/16329 [1:10:32<1:05:28,  2.02it/s][A
epoch 1 iter 8376: train loss 1.57599. lr 5.077471e-04:  51%|█████▏    | 8377/16329 [1:10:32<1:12:25,  1.83it/s][A
epoch 1 iter 8377: train loss 1.55824. lr 5.077263e-04:  51%|█████▏    | 8377/16329 [1:10:33<1:12:25,  1.83it/s][A
epoch 1 iter 8377: train loss 1.55824. lr 5.077263e-04:  51%|█████▏    | 8378/16329 [1:10:33<1:10:29,  1.88it/s][A
epoch 1 iter 8378: train loss 1.51041. lr 5.077055e-04:  51%|█████▏    | 8378/16329 [1:10:33<1:10:29,  1.88it/s][A
epoch 1 iter 8378: train loss 1.51041. lr 5.077055e-04:  51%|█████▏    |

epoch 1 iter 8409: train loss 1.58559. lr 5.070590e-04:  52%|█████▏    | 8410/16329 [1:10:48<1:05:30,  2.01it/s][A
epoch 1 iter 8410: train loss 1.61080. lr 5.070381e-04:  52%|█████▏    | 8410/16329 [1:10:49<1:05:30,  2.01it/s][A
epoch 1 iter 8410: train loss 1.61080. lr 5.070381e-04:  52%|█████▏    | 8411/16329 [1:10:49<1:05:26,  2.02it/s][A
epoch 1 iter 8411: train loss 1.54025. lr 5.070172e-04:  52%|█████▏    | 8411/16329 [1:10:49<1:05:26,  2.02it/s][A
epoch 1 iter 8411: train loss 1.54025. lr 5.070172e-04:  52%|█████▏    | 8412/16329 [1:10:49<1:05:26,  2.02it/s][A
epoch 1 iter 8412: train loss 1.55493. lr 5.069963e-04:  52%|█████▏    | 8412/16329 [1:10:50<1:05:26,  2.02it/s][A
epoch 1 iter 8412: train loss 1.55493. lr 5.069963e-04:  52%|█████▏    | 8413/16329 [1:10:50<1:05:21,  2.02it/s][A
epoch 1 iter 8413: train loss 1.49304. lr 5.069754e-04:  52%|█████▏    | 8413/16329 [1:10:50<1:05:21,  2.02it/s][A
epoch 1 iter 8413: train loss 1.49304. lr 5.069754e-04:  52%|█████▏    |

epoch 1 iter 8444: train loss 1.52444. lr 5.063269e-04:  52%|█████▏    | 8445/16329 [1:11:06<1:05:09,  2.02it/s][A
epoch 1 iter 8445: train loss 1.55564. lr 5.063059e-04:  52%|█████▏    | 8445/16329 [1:11:07<1:05:09,  2.02it/s][A
epoch 1 iter 8445: train loss 1.55564. lr 5.063059e-04:  52%|█████▏    | 8446/16329 [1:11:07<1:05:13,  2.01it/s][A
epoch 1 iter 8446: train loss 1.51211. lr 5.062850e-04:  52%|█████▏    | 8446/16329 [1:11:07<1:05:13,  2.01it/s][A
epoch 1 iter 8446: train loss 1.51211. lr 5.062850e-04:  52%|█████▏    | 8447/16329 [1:11:07<1:05:00,  2.02it/s][A
epoch 1 iter 8447: train loss 1.54777. lr 5.062640e-04:  52%|█████▏    | 8447/16329 [1:11:08<1:05:00,  2.02it/s][A
epoch 1 iter 8447: train loss 1.54777. lr 5.062640e-04:  52%|█████▏    | 8448/16329 [1:11:08<1:05:10,  2.02it/s][A
epoch 1 iter 8448: train loss 1.51250. lr 5.062431e-04:  52%|█████▏    | 8448/16329 [1:11:08<1:05:10,  2.02it/s][A
epoch 1 iter 8448: train loss 1.51250. lr 5.062431e-04:  52%|█████▏    |

epoch 1 iter 8479: train loss 1.52616. lr 5.055924e-04:  52%|█████▏    | 8480/16329 [1:11:24<1:07:49,  1.93it/s][A
epoch 1 iter 8480: train loss 1.55999. lr 5.055714e-04:  52%|█████▏    | 8480/16329 [1:11:24<1:07:49,  1.93it/s][A
epoch 1 iter 8480: train loss 1.55999. lr 5.055714e-04:  52%|█████▏    | 8481/16329 [1:11:24<1:06:50,  1.96it/s][A
epoch 1 iter 8481: train loss 1.56162. lr 5.055504e-04:  52%|█████▏    | 8481/16329 [1:11:25<1:06:50,  1.96it/s][A
epoch 1 iter 8481: train loss 1.56162. lr 5.055504e-04:  52%|█████▏    | 8482/16329 [1:11:25<1:06:22,  1.97it/s][A
epoch 1 iter 8482: train loss 1.55815. lr 5.055294e-04:  52%|█████▏    | 8482/16329 [1:11:25<1:06:22,  1.97it/s][A
epoch 1 iter 8482: train loss 1.55815. lr 5.055294e-04:  52%|█████▏    | 8483/16329 [1:11:25<1:06:35,  1.96it/s][A
epoch 1 iter 8483: train loss 1.55676. lr 5.055083e-04:  52%|█████▏    | 8483/16329 [1:11:26<1:06:35,  1.96it/s][A
epoch 1 iter 8483: train loss 1.55676. lr 5.055083e-04:  52%|█████▏    |

epoch 1 iter 8514: train loss 1.55461. lr 5.048556e-04:  52%|█████▏    | 8515/16329 [1:11:41<1:04:51,  2.01it/s][A
epoch 1 iter 8515: train loss 1.52945. lr 5.048346e-04:  52%|█████▏    | 8515/16329 [1:11:42<1:04:51,  2.01it/s][A
epoch 1 iter 8515: train loss 1.52945. lr 5.048346e-04:  52%|█████▏    | 8516/16329 [1:11:42<1:04:55,  2.01it/s][A
epoch 1 iter 8516: train loss 1.54229. lr 5.048135e-04:  52%|█████▏    | 8516/16329 [1:11:42<1:04:55,  2.01it/s][A
epoch 1 iter 8516: train loss 1.54229. lr 5.048135e-04:  52%|█████▏    | 8517/16329 [1:11:42<1:04:36,  2.02it/s][A
epoch 1 iter 8517: train loss 1.48202. lr 5.047924e-04:  52%|█████▏    | 8517/16329 [1:11:43<1:04:36,  2.02it/s][A
epoch 1 iter 8517: train loss 1.48202. lr 5.047924e-04:  52%|█████▏    | 8518/16329 [1:11:43<1:04:35,  2.02it/s][A
epoch 1 iter 8518: train loss 1.52118. lr 5.047713e-04:  52%|█████▏    | 8518/16329 [1:11:43<1:04:35,  2.02it/s][A
epoch 1 iter 8518: train loss 1.52118. lr 5.047713e-04:  52%|█████▏    |

epoch 1 iter 8549: train loss 1.48109. lr 5.041165e-04:  52%|█████▏    | 8550/16329 [1:11:59<1:05:12,  1.99it/s][A
epoch 1 iter 8550: train loss 1.53846. lr 5.040954e-04:  52%|█████▏    | 8550/16329 [1:12:00<1:05:12,  1.99it/s][A
epoch 1 iter 8550: train loss 1.53846. lr 5.040954e-04:  52%|█████▏    | 8551/16329 [1:12:00<1:04:57,  2.00it/s][A
epoch 1 iter 8551: train loss 1.51763. lr 5.040742e-04:  52%|█████▏    | 8551/16329 [1:12:00<1:04:57,  2.00it/s][A
epoch 1 iter 8551: train loss 1.51763. lr 5.040742e-04:  52%|█████▏    | 8552/16329 [1:12:00<1:04:42,  2.00it/s][A
epoch 1 iter 8552: train loss 1.50970. lr 5.040531e-04:  52%|█████▏    | 8552/16329 [1:12:01<1:04:42,  2.00it/s][A
epoch 1 iter 8552: train loss 1.50970. lr 5.040531e-04:  52%|█████▏    | 8553/16329 [1:12:01<1:04:26,  2.01it/s][A
epoch 1 iter 8553: train loss 1.54979. lr 5.040319e-04:  52%|█████▏    | 8553/16329 [1:12:01<1:04:26,  2.01it/s][A
epoch 1 iter 8553: train loss 1.54979. lr 5.040319e-04:  52%|█████▏    |

epoch 1 iter 8584: train loss 1.49246. lr 5.033751e-04:  53%|█████▎    | 8585/16329 [1:12:17<1:04:49,  1.99it/s][A
epoch 1 iter 8585: train loss 1.51845. lr 5.033539e-04:  53%|█████▎    | 8585/16329 [1:12:17<1:04:49,  1.99it/s][A
epoch 1 iter 8585: train loss 1.51845. lr 5.033539e-04:  53%|█████▎    | 8586/16329 [1:12:17<1:05:41,  1.96it/s][A
epoch 1 iter 8586: train loss 1.53241. lr 5.033327e-04:  53%|█████▎    | 8586/16329 [1:12:18<1:05:41,  1.96it/s][A
epoch 1 iter 8586: train loss 1.53241. lr 5.033327e-04:  53%|█████▎    | 8587/16329 [1:12:18<1:05:27,  1.97it/s][A
epoch 1 iter 8587: train loss 1.54220. lr 5.033115e-04:  53%|█████▎    | 8587/16329 [1:12:18<1:05:27,  1.97it/s][A
epoch 1 iter 8587: train loss 1.54220. lr 5.033115e-04:  53%|█████▎    | 8588/16329 [1:12:18<1:04:59,  1.99it/s][A
epoch 1 iter 8588: train loss 1.52404. lr 5.032902e-04:  53%|█████▎    | 8588/16329 [1:12:19<1:04:59,  1.99it/s][A
epoch 1 iter 8588: train loss 1.52404. lr 5.032902e-04:  53%|█████▎    |

epoch 1 iter 8619: train loss 1.53923. lr 5.026314e-04:  53%|█████▎    | 8620/16329 [1:12:35<1:08:18,  1.88it/s][A
epoch 1 iter 8620: train loss 1.46324. lr 5.026101e-04:  53%|█████▎    | 8620/16329 [1:12:35<1:08:18,  1.88it/s][A
epoch 1 iter 8620: train loss 1.46324. lr 5.026101e-04:  53%|█████▎    | 8621/16329 [1:12:35<1:07:43,  1.90it/s][A
epoch 1 iter 8621: train loss 1.49049. lr 5.025888e-04:  53%|█████▎    | 8621/16329 [1:12:36<1:07:43,  1.90it/s][A
epoch 1 iter 8621: train loss 1.49049. lr 5.025888e-04:  53%|█████▎    | 8622/16329 [1:12:36<1:06:57,  1.92it/s][A
epoch 1 iter 8622: train loss 1.49698. lr 5.025675e-04:  53%|█████▎    | 8622/16329 [1:12:36<1:06:57,  1.92it/s][A
epoch 1 iter 8622: train loss 1.49698. lr 5.025675e-04:  53%|█████▎    | 8623/16329 [1:12:36<1:06:20,  1.94it/s][A
epoch 1 iter 8623: train loss 1.50050. lr 5.025463e-04:  53%|█████▎    | 8623/16329 [1:12:37<1:06:20,  1.94it/s][A
epoch 1 iter 8623: train loss 1.50050. lr 5.025463e-04:  53%|█████▎    |

epoch 1 iter 8654: train loss 1.52407. lr 5.018854e-04:  53%|█████▎    | 8655/16329 [1:12:52<1:03:15,  2.02it/s][A
epoch 1 iter 8655: train loss 1.52527. lr 5.018640e-04:  53%|█████▎    | 8655/16329 [1:12:53<1:03:15,  2.02it/s][A
epoch 1 iter 8655: train loss 1.52527. lr 5.018640e-04:  53%|█████▎    | 8656/16329 [1:12:53<1:03:08,  2.03it/s][A
epoch 1 iter 8656: train loss 1.51621. lr 5.018427e-04:  53%|█████▎    | 8656/16329 [1:12:53<1:03:08,  2.03it/s][A
epoch 1 iter 8656: train loss 1.51621. lr 5.018427e-04:  53%|█████▎    | 8657/16329 [1:12:53<1:03:19,  2.02it/s][A
epoch 1 iter 8657: train loss 1.52907. lr 5.018213e-04:  53%|█████▎    | 8657/16329 [1:12:54<1:03:19,  2.02it/s][A
epoch 1 iter 8657: train loss 1.52907. lr 5.018213e-04:  53%|█████▎    | 8658/16329 [1:12:54<1:03:12,  2.02it/s][A
epoch 1 iter 8658: train loss 1.54737. lr 5.018000e-04:  53%|█████▎    | 8658/16329 [1:12:54<1:03:12,  2.02it/s][A
epoch 1 iter 8658: train loss 1.54737. lr 5.018000e-04:  53%|█████▎    |

epoch 1 iter 8689: train loss 1.50307. lr 5.011371e-04:  53%|█████▎    | 8690/16329 [1:13:10<1:03:07,  2.02it/s][A
epoch 1 iter 8690: train loss 1.53415. lr 5.011157e-04:  53%|█████▎    | 8690/16329 [1:13:10<1:03:07,  2.02it/s][A
epoch 1 iter 8690: train loss 1.53415. lr 5.011157e-04:  53%|█████▎    | 8691/16329 [1:13:10<1:03:02,  2.02it/s][A
epoch 1 iter 8691: train loss 1.53596. lr 5.010942e-04:  53%|█████▎    | 8691/16329 [1:13:11<1:03:02,  2.02it/s][A
epoch 1 iter 8691: train loss 1.53596. lr 5.010942e-04:  53%|█████▎    | 8692/16329 [1:13:11<1:03:02,  2.02it/s][A
epoch 1 iter 8692: train loss 1.53515. lr 5.010728e-04:  53%|█████▎    | 8692/16329 [1:13:11<1:03:02,  2.02it/s][A
epoch 1 iter 8692: train loss 1.53515. lr 5.010728e-04:  53%|█████▎    | 8693/16329 [1:13:11<1:03:12,  2.01it/s][A
epoch 1 iter 8693: train loss 1.48772. lr 5.010514e-04:  53%|█████▎    | 8693/16329 [1:13:12<1:03:12,  2.01it/s][A
epoch 1 iter 8693: train loss 1.48772. lr 5.010514e-04:  53%|█████▎    |

epoch 1 iter 8724: train loss 1.56483. lr 5.003865e-04:  53%|█████▎    | 8725/16329 [1:13:28<1:02:48,  2.02it/s][A
epoch 1 iter 8725: train loss 1.51334. lr 5.003650e-04:  53%|█████▎    | 8725/16329 [1:13:28<1:02:48,  2.02it/s][A
epoch 1 iter 8725: train loss 1.51334. lr 5.003650e-04:  53%|█████▎    | 8726/16329 [1:13:28<1:02:58,  2.01it/s][A
epoch 1 iter 8726: train loss 1.53374. lr 5.003435e-04:  53%|█████▎    | 8726/16329 [1:13:29<1:02:58,  2.01it/s][A
epoch 1 iter 8726: train loss 1.53374. lr 5.003435e-04:  53%|█████▎    | 8727/16329 [1:13:29<1:02:56,  2.01it/s][A
epoch 1 iter 8727: train loss 1.54926. lr 5.003220e-04:  53%|█████▎    | 8727/16329 [1:13:29<1:02:56,  2.01it/s][A
epoch 1 iter 8727: train loss 1.54926. lr 5.003220e-04:  53%|█████▎    | 8728/16329 [1:13:29<1:03:02,  2.01it/s][A
epoch 1 iter 8728: train loss 1.53191. lr 5.003006e-04:  53%|█████▎    | 8728/16329 [1:13:30<1:03:02,  2.01it/s][A
epoch 1 iter 8728: train loss 1.53191. lr 5.003006e-04:  53%|█████▎    |

epoch 1 iter 8759: train loss 1.48739. lr 4.996336e-04:  54%|█████▎    | 8760/16329 [1:13:45<1:07:16,  1.88it/s][A
epoch 1 iter 8760: train loss 1.48760. lr 4.996121e-04:  54%|█████▎    | 8760/16329 [1:13:46<1:07:16,  1.88it/s][A
epoch 1 iter 8760: train loss 1.48760. lr 4.996121e-04:  54%|█████▎    | 8761/16329 [1:13:46<1:07:29,  1.87it/s][A
epoch 1 iter 8761: train loss 1.49817. lr 4.995905e-04:  54%|█████▎    | 8761/16329 [1:13:46<1:07:29,  1.87it/s][A
epoch 1 iter 8761: train loss 1.49817. lr 4.995905e-04:  54%|█████▎    | 8762/16329 [1:13:46<1:07:12,  1.88it/s][A
epoch 1 iter 8762: train loss 1.49553. lr 4.995690e-04:  54%|█████▎    | 8762/16329 [1:13:47<1:07:12,  1.88it/s][A
epoch 1 iter 8762: train loss 1.49553. lr 4.995690e-04:  54%|█████▎    | 8763/16329 [1:13:47<1:06:35,  1.89it/s][A
epoch 1 iter 8763: train loss 1.49632. lr 4.995474e-04:  54%|█████▎    | 8763/16329 [1:13:47<1:06:35,  1.89it/s][A
epoch 1 iter 8763: train loss 1.49632. lr 4.995474e-04:  54%|█████▎    |

epoch 1 iter 8794: train loss 1.48775. lr 4.988785e-04:  54%|█████▍    | 8795/16329 [1:14:03<1:03:45,  1.97it/s][A
epoch 1 iter 8795: train loss 1.49365. lr 4.988569e-04:  54%|█████▍    | 8795/16329 [1:14:03<1:03:45,  1.97it/s][A
epoch 1 iter 8795: train loss 1.49365. lr 4.988569e-04:  54%|█████▍    | 8796/16329 [1:14:03<1:03:13,  1.99it/s][A
epoch 1 iter 8796: train loss 1.49165. lr 4.988353e-04:  54%|█████▍    | 8796/16329 [1:14:04<1:03:13,  1.99it/s][A
epoch 1 iter 8796: train loss 1.49165. lr 4.988353e-04:  54%|█████▍    | 8797/16329 [1:14:04<1:02:57,  1.99it/s][A
epoch 1 iter 8797: train loss 1.48593. lr 4.988137e-04:  54%|█████▍    | 8797/16329 [1:14:05<1:02:57,  1.99it/s][A
epoch 1 iter 8797: train loss 1.48593. lr 4.988137e-04:  54%|█████▍    | 8798/16329 [1:14:05<1:11:47,  1.75it/s][A
epoch 1 iter 8798: train loss 1.52073. lr 4.987921e-04:  54%|█████▍    | 8798/16329 [1:14:05<1:11:47,  1.75it/s][A
epoch 1 iter 8798: train loss 1.52073. lr 4.987921e-04:  54%|█████▍    |

epoch 1 iter 8829: train loss 1.51595. lr 4.981211e-04:  54%|█████▍    | 8830/16329 [1:14:21<1:03:09,  1.98it/s][A
epoch 1 iter 8830: train loss 1.51258. lr 4.980994e-04:  54%|█████▍    | 8830/16329 [1:14:21<1:03:09,  1.98it/s][A
epoch 1 iter 8830: train loss 1.51258. lr 4.980994e-04:  54%|█████▍    | 8831/16329 [1:14:21<1:02:39,  1.99it/s][A
epoch 1 iter 8831: train loss 1.51121. lr 4.980778e-04:  54%|█████▍    | 8831/16329 [1:14:22<1:02:39,  1.99it/s][A
epoch 1 iter 8831: train loss 1.51121. lr 4.980778e-04:  54%|█████▍    | 8832/16329 [1:14:22<1:02:32,  2.00it/s][A
epoch 1 iter 8832: train loss 1.49072. lr 4.980561e-04:  54%|█████▍    | 8832/16329 [1:14:22<1:02:32,  2.00it/s][A
epoch 1 iter 8832: train loss 1.49072. lr 4.980561e-04:  54%|█████▍    | 8833/16329 [1:14:22<1:08:50,  1.81it/s][A
epoch 1 iter 8833: train loss 1.55121. lr 4.980344e-04:  54%|█████▍    | 8833/16329 [1:14:23<1:08:50,  1.81it/s][A
epoch 1 iter 8833: train loss 1.55121. lr 4.980344e-04:  54%|█████▍    |

epoch 1 iter 8864: train loss 1.48325. lr 4.973615e-04:  54%|█████▍    | 8865/16329 [1:14:39<1:01:54,  2.01it/s][A
epoch 1 iter 8865: train loss 1.51161. lr 4.973398e-04:  54%|█████▍    | 8865/16329 [1:14:39<1:01:54,  2.01it/s][A
epoch 1 iter 8865: train loss 1.51161. lr 4.973398e-04:  54%|█████▍    | 8866/16329 [1:14:39<1:01:58,  2.01it/s][A
epoch 1 iter 8866: train loss 1.49295. lr 4.973180e-04:  54%|█████▍    | 8866/16329 [1:14:40<1:01:58,  2.01it/s][A
epoch 1 iter 8866: train loss 1.49295. lr 4.973180e-04:  54%|█████▍    | 8867/16329 [1:14:40<1:01:44,  2.01it/s][A
epoch 1 iter 8867: train loss 1.52313. lr 4.972963e-04:  54%|█████▍    | 8867/16329 [1:14:40<1:01:44,  2.01it/s][A
epoch 1 iter 8867: train loss 1.52313. lr 4.972963e-04:  54%|█████▍    | 8868/16329 [1:14:40<1:01:49,  2.01it/s][A
epoch 1 iter 8868: train loss 1.49608. lr 4.972745e-04:  54%|█████▍    | 8868/16329 [1:14:41<1:01:49,  2.01it/s][A
epoch 1 iter 8868: train loss 1.49608. lr 4.972745e-04:  54%|█████▍    |

epoch 1 iter 8899: train loss 1.43728. lr 4.965996e-04:  55%|█████▍    | 8900/16329 [1:14:56<1:01:33,  2.01it/s][A
epoch 1 iter 8900: train loss 1.51569. lr 4.965778e-04:  55%|█████▍    | 8900/16329 [1:14:57<1:01:33,  2.01it/s][A
epoch 1 iter 8900: train loss 1.51569. lr 4.965778e-04:  55%|█████▍    | 8901/16329 [1:14:57<1:01:23,  2.02it/s][A
epoch 1 iter 8901: train loss 1.45017. lr 4.965560e-04:  55%|█████▍    | 8901/16329 [1:14:57<1:01:23,  2.02it/s][A
epoch 1 iter 8901: train loss 1.45017. lr 4.965560e-04:  55%|█████▍    | 8902/16329 [1:14:57<1:01:28,  2.01it/s][A
epoch 1 iter 8902: train loss 1.50329. lr 4.965342e-04:  55%|█████▍    | 8902/16329 [1:14:58<1:01:28,  2.01it/s][A
epoch 1 iter 8902: train loss 1.50329. lr 4.965342e-04:  55%|█████▍    | 8903/16329 [1:14:58<1:01:29,  2.01it/s][A
epoch 1 iter 8903: train loss 1.51445. lr 4.965124e-04:  55%|█████▍    | 8903/16329 [1:14:58<1:01:29,  2.01it/s][A
epoch 1 iter 8903: train loss 1.51445. lr 4.965124e-04:  55%|█████▍    |

epoch 1 iter 8934: train loss 1.46513. lr 4.958355e-04:  55%|█████▍    | 8935/16329 [1:15:14<1:01:24,  2.01it/s][A
epoch 1 iter 8935: train loss 1.45466. lr 4.958137e-04:  55%|█████▍    | 8935/16329 [1:15:14<1:01:24,  2.01it/s][A
epoch 1 iter 8935: train loss 1.45466. lr 4.958137e-04:  55%|█████▍    | 8936/16329 [1:15:14<1:01:08,  2.02it/s][A
epoch 1 iter 8936: train loss 1.45272. lr 4.957918e-04:  55%|█████▍    | 8936/16329 [1:15:15<1:01:08,  2.02it/s][A
epoch 1 iter 8936: train loss 1.45272. lr 4.957918e-04:  55%|█████▍    | 8937/16329 [1:15:15<1:01:10,  2.01it/s][A
epoch 1 iter 8937: train loss 1.46094. lr 4.957699e-04:  55%|█████▍    | 8937/16329 [1:15:15<1:01:10,  2.01it/s][A
epoch 1 iter 8937: train loss 1.46094. lr 4.957699e-04:  55%|█████▍    | 8938/16329 [1:15:15<1:01:10,  2.01it/s][A
epoch 1 iter 8938: train loss 1.46976. lr 4.957481e-04:  55%|█████▍    | 8938/16329 [1:15:16<1:01:10,  2.01it/s][A
epoch 1 iter 8938: train loss 1.46976. lr 4.957481e-04:  55%|█████▍    |

epoch 1 iter 8969: train loss 1.48697. lr 4.950692e-04:  55%|█████▍    | 8970/16329 [1:15:32<1:00:54,  2.01it/s][A
epoch 1 iter 8970: train loss 1.47189. lr 4.950473e-04:  55%|█████▍    | 8970/16329 [1:15:32<1:00:54,  2.01it/s][A
epoch 1 iter 8970: train loss 1.47189. lr 4.950473e-04:  55%|█████▍    | 8971/16329 [1:15:32<1:01:01,  2.01it/s][A
epoch 1 iter 8971: train loss 1.48880. lr 4.950254e-04:  55%|█████▍    | 8971/16329 [1:15:33<1:01:01,  2.01it/s][A
epoch 1 iter 8971: train loss 1.48880. lr 4.950254e-04:  55%|█████▍    | 8972/16329 [1:15:33<1:00:51,  2.02it/s][A
epoch 1 iter 8972: train loss 1.48384. lr 4.950034e-04:  55%|█████▍    | 8972/16329 [1:15:33<1:00:51,  2.02it/s][A
epoch 1 iter 8972: train loss 1.48384. lr 4.950034e-04:  55%|█████▍    | 8973/16329 [1:15:33<1:01:02,  2.01it/s][A
epoch 1 iter 8973: train loss 1.50786. lr 4.949815e-04:  55%|█████▍    | 8973/16329 [1:15:34<1:01:02,  2.01it/s][A
epoch 1 iter 8973: train loss 1.50786. lr 4.949815e-04:  55%|█████▍    |

epoch 1 iter 9004: train loss 1.47646. lr 4.943007e-04:  55%|█████▌    | 9005/16329 [1:15:49<1:04:20,  1.90it/s][A
epoch 1 iter 9005: train loss 1.49861. lr 4.942787e-04:  55%|█████▌    | 9005/16329 [1:15:50<1:04:20,  1.90it/s][A
epoch 1 iter 9005: train loss 1.49861. lr 4.942787e-04:  55%|█████▌    | 9006/16329 [1:15:50<1:04:11,  1.90it/s][A
epoch 1 iter 9006: train loss 1.47433. lr 4.942567e-04:  55%|█████▌    | 9006/16329 [1:15:51<1:04:11,  1.90it/s][A
epoch 1 iter 9006: train loss 1.47433. lr 4.942567e-04:  55%|█████▌    | 9007/16329 [1:15:51<1:03:46,  1.91it/s][A
epoch 1 iter 9007: train loss 1.46248. lr 4.942347e-04:  55%|█████▌    | 9007/16329 [1:15:51<1:03:46,  1.91it/s][A
epoch 1 iter 9007: train loss 1.46248. lr 4.942347e-04:  55%|█████▌    | 9008/16329 [1:15:51<1:03:19,  1.93it/s][A
epoch 1 iter 9008: train loss 1.51831. lr 4.942127e-04:  55%|█████▌    | 9008/16329 [1:15:52<1:03:19,  1.93it/s][A
epoch 1 iter 9008: train loss 1.51831. lr 4.942127e-04:  55%|█████▌    |

epoch 1 iter 9039: train loss 1.51372. lr 4.935300e-04:  55%|█████▌    | 9040/16329 [1:16:07<1:00:14,  2.02it/s][A
epoch 1 iter 9040: train loss 1.46359. lr 4.935079e-04:  55%|█████▌    | 9040/16329 [1:16:08<1:00:14,  2.02it/s][A
epoch 1 iter 9040: train loss 1.46359. lr 4.935079e-04:  55%|█████▌    | 9041/16329 [1:16:08<1:00:13,  2.02it/s][A
epoch 1 iter 9041: train loss 1.49031. lr 4.934859e-04:  55%|█████▌    | 9041/16329 [1:16:08<1:00:13,  2.02it/s][A
epoch 1 iter 9041: train loss 1.49031. lr 4.934859e-04:  55%|█████▌    | 9042/16329 [1:16:08<1:00:05,  2.02it/s][A
epoch 1 iter 9042: train loss 1.46779. lr 4.934638e-04:  55%|█████▌    | 9042/16329 [1:16:09<1:00:05,  2.02it/s][A
epoch 1 iter 9042: train loss 1.46779. lr 4.934638e-04:  55%|█████▌    | 9043/16329 [1:16:09<59:57,  2.03it/s]  [A
epoch 1 iter 9043: train loss 1.50015. lr 4.934418e-04:  55%|█████▌    | 9043/16329 [1:16:09<59:57,  2.03it/s][A
epoch 1 iter 9043: train loss 1.50015. lr 4.934418e-04:  55%|█████▌    | 9

epoch 1 iter 10528: train loss 1.34496. lr 4.588245e-04:  64%|██████▍   | 10528/16329 [1:28:39<49:59,  1.93it/s][A
epoch 1 iter 10528: train loss 1.34496. lr 4.588245e-04:  64%|██████▍   | 10529/16329 [1:28:39<51:14,  1.89it/s][A
epoch 1 iter 10529: train loss 1.33383. lr 4.588000e-04:  64%|██████▍   | 10529/16329 [1:28:40<51:14,  1.89it/s][A
epoch 1 iter 10529: train loss 1.33383. lr 4.588000e-04:  64%|██████▍   | 10530/16329 [1:28:40<51:46,  1.87it/s][A
epoch 1 iter 10530: train loss 1.32989. lr 4.587755e-04:  64%|██████▍   | 10530/16329 [1:28:41<51:46,  1.87it/s][A
epoch 1 iter 10530: train loss 1.32989. lr 4.587755e-04:  64%|██████▍   | 10531/16329 [1:28:41<51:37,  1.87it/s][A
epoch 1 iter 10531: train loss 1.33089. lr 4.587510e-04:  64%|██████▍   | 10531/16329 [1:28:41<51:37,  1.87it/s][A
epoch 1 iter 10531: train loss 1.33089. lr 4.587510e-04:  64%|██████▍   | 10532/16329 [1:28:41<51:14,  1.89it/s][A
epoch 1 iter 10532: train loss 1.36000. lr 4.587265e-04:  64%|██████▍   

epoch 1 iter 10563: train loss 1.35857. lr 4.579666e-04:  65%|██████▍   | 10563/16329 [1:28:57<47:34,  2.02it/s][A
epoch 1 iter 10563: train loss 1.35857. lr 4.579666e-04:  65%|██████▍   | 10564/16329 [1:28:57<47:30,  2.02it/s][A
epoch 1 iter 10564: train loss 1.38539. lr 4.579421e-04:  65%|██████▍   | 10564/16329 [1:28:58<47:30,  2.02it/s][A
epoch 1 iter 10564: train loss 1.38539. lr 4.579421e-04:  65%|██████▍   | 10565/16329 [1:28:58<47:34,  2.02it/s][A
epoch 1 iter 10565: train loss 1.36180. lr 4.579176e-04:  65%|██████▍   | 10565/16329 [1:28:58<47:34,  2.02it/s][A
epoch 1 iter 10565: train loss 1.36180. lr 4.579176e-04:  65%|██████▍   | 10566/16329 [1:28:58<47:34,  2.02it/s][A
epoch 1 iter 10566: train loss 1.36164. lr 4.578930e-04:  65%|██████▍   | 10566/16329 [1:28:59<47:34,  2.02it/s][A
epoch 1 iter 10566: train loss 1.36164. lr 4.578930e-04:  65%|██████▍   | 10567/16329 [1:28:59<47:37,  2.02it/s][A
epoch 1 iter 10567: train loss 1.34994. lr 4.578685e-04:  65%|██████▍   

epoch 1 iter 10598: train loss 1.39626. lr 4.571070e-04:  65%|██████▍   | 10598/16329 [1:29:15<47:17,  2.02it/s][A
epoch 1 iter 10598: train loss 1.39626. lr 4.571070e-04:  65%|██████▍   | 10599/16329 [1:29:15<47:18,  2.02it/s][A
epoch 1 iter 10599: train loss 1.38377. lr 4.570824e-04:  65%|██████▍   | 10599/16329 [1:29:15<47:18,  2.02it/s][A
epoch 1 iter 10599: train loss 1.38377. lr 4.570824e-04:  65%|██████▍   | 10600/16329 [1:29:15<47:10,  2.02it/s][A
epoch 1 iter 10600: train loss 1.35215. lr 4.570578e-04:  65%|██████▍   | 10600/16329 [1:29:16<47:10,  2.02it/s][A
epoch 1 iter 10600: train loss 1.35215. lr 4.570578e-04:  65%|██████▍   | 10601/16329 [1:29:16<47:15,  2.02it/s][A
epoch 1 iter 10601: train loss 1.33823. lr 4.570332e-04:  65%|██████▍   | 10601/16329 [1:29:16<47:15,  2.02it/s][A
epoch 1 iter 10601: train loss 1.33823. lr 4.570332e-04:  65%|██████▍   | 10602/16329 [1:29:16<47:10,  2.02it/s][A
epoch 1 iter 10602: train loss 1.37373. lr 4.570086e-04:  65%|██████▍   

epoch 1 iter 10633: train loss 1.32715. lr 4.562456e-04:  65%|██████▌   | 10633/16329 [1:29:32<48:59,  1.94it/s][A
epoch 1 iter 10633: train loss 1.32715. lr 4.562456e-04:  65%|██████▌   | 10634/16329 [1:29:32<48:52,  1.94it/s][A
epoch 1 iter 10634: train loss 1.36048. lr 4.562209e-04:  65%|██████▌   | 10634/16329 [1:29:33<48:52,  1.94it/s][A
epoch 1 iter 10634: train loss 1.36048. lr 4.562209e-04:  65%|██████▌   | 10635/16329 [1:29:33<48:38,  1.95it/s][A
epoch 1 iter 10635: train loss 1.38198. lr 4.561963e-04:  65%|██████▌   | 10635/16329 [1:29:33<48:38,  1.95it/s][A
epoch 1 iter 10635: train loss 1.38198. lr 4.561963e-04:  65%|██████▌   | 10636/16329 [1:29:33<53:22,  1.78it/s][A
epoch 1 iter 10636: train loss 1.33111. lr 4.561717e-04:  65%|██████▌   | 10636/16329 [1:29:34<53:22,  1.78it/s][A
epoch 1 iter 10636: train loss 1.33111. lr 4.561717e-04:  65%|██████▌   | 10637/16329 [1:29:34<51:26,  1.84it/s][A
epoch 1 iter 10637: train loss 1.30661. lr 4.561470e-04:  65%|██████▌   

epoch 1 iter 10668: train loss 1.31434. lr 4.553824e-04:  65%|██████▌   | 10668/16329 [1:29:50<48:05,  1.96it/s][A
epoch 1 iter 10668: train loss 1.31434. lr 4.553824e-04:  65%|██████▌   | 10669/16329 [1:29:50<47:38,  1.98it/s][A
epoch 1 iter 10669: train loss 1.36509. lr 4.553577e-04:  65%|██████▌   | 10669/16329 [1:29:51<47:38,  1.98it/s][A
epoch 1 iter 10669: train loss 1.36509. lr 4.553577e-04:  65%|██████▌   | 10670/16329 [1:29:51<47:24,  1.99it/s][A
epoch 1 iter 10670: train loss 1.33433. lr 4.553330e-04:  65%|██████▌   | 10670/16329 [1:29:51<47:24,  1.99it/s][A
epoch 1 iter 10670: train loss 1.33433. lr 4.553330e-04:  65%|██████▌   | 10671/16329 [1:29:51<47:11,  2.00it/s][A
epoch 1 iter 10671: train loss 1.33626. lr 4.553083e-04:  65%|██████▌   | 10671/16329 [1:29:52<47:11,  2.00it/s][A
epoch 1 iter 10671: train loss 1.33626. lr 4.553083e-04:  65%|██████▌   | 10672/16329 [1:29:52<47:01,  2.00it/s][A
epoch 1 iter 10672: train loss 1.37192. lr 4.552836e-04:  65%|██████▌   

epoch 1 iter 10703: train loss 1.38178. lr 4.545175e-04:  66%|██████▌   | 10703/16329 [1:30:08<54:02,  1.73it/s][A
epoch 1 iter 10703: train loss 1.38178. lr 4.545175e-04:  66%|██████▌   | 10704/16329 [1:30:08<52:00,  1.80it/s][A
epoch 1 iter 10704: train loss 1.35977. lr 4.544927e-04:  66%|██████▌   | 10704/16329 [1:30:08<52:00,  1.80it/s][A
epoch 1 iter 10704: train loss 1.35977. lr 4.544927e-04:  66%|██████▌   | 10705/16329 [1:30:08<50:29,  1.86it/s][A
epoch 1 iter 10705: train loss 1.33478. lr 4.544680e-04:  66%|██████▌   | 10705/16329 [1:30:09<50:29,  1.86it/s][A
epoch 1 iter 10705: train loss 1.33478. lr 4.544680e-04:  66%|██████▌   | 10706/16329 [1:30:09<49:18,  1.90it/s][A
epoch 1 iter 10706: train loss 1.33860. lr 4.544432e-04:  66%|██████▌   | 10706/16329 [1:30:09<49:18,  1.90it/s][A
epoch 1 iter 10706: train loss 1.33860. lr 4.544432e-04:  66%|██████▌   | 10707/16329 [1:30:09<48:35,  1.93it/s][A
epoch 1 iter 10707: train loss 1.33270. lr 4.544185e-04:  66%|██████▌   

epoch 1 iter 10738: train loss 1.34811. lr 4.536508e-04:  66%|██████▌   | 10738/16329 [1:30:25<51:37,  1.81it/s][A
epoch 1 iter 10738: train loss 1.34811. lr 4.536508e-04:  66%|██████▌   | 10739/16329 [1:30:25<49:48,  1.87it/s][A
epoch 1 iter 10739: train loss 1.37406. lr 4.536260e-04:  66%|██████▌   | 10739/16329 [1:30:26<49:48,  1.87it/s][A
epoch 1 iter 10739: train loss 1.37406. lr 4.536260e-04:  66%|██████▌   | 10740/16329 [1:30:26<48:52,  1.91it/s][A
epoch 1 iter 10740: train loss 1.37459. lr 4.536012e-04:  66%|██████▌   | 10740/16329 [1:30:26<48:52,  1.91it/s][A
epoch 1 iter 10740: train loss 1.37459. lr 4.536012e-04:  66%|██████▌   | 10741/16329 [1:30:26<48:01,  1.94it/s][A
epoch 1 iter 10741: train loss 1.35764. lr 4.535764e-04:  66%|██████▌   | 10741/16329 [1:30:27<48:01,  1.94it/s][A
epoch 1 iter 10741: train loss 1.35764. lr 4.535764e-04:  66%|██████▌   | 10742/16329 [1:30:27<47:35,  1.96it/s][A
epoch 1 iter 10742: train loss 1.35062. lr 4.535516e-04:  66%|██████▌   

epoch 1 iter 10773: train loss 1.35911. lr 4.527823e-04:  66%|██████▌   | 10773/16329 [1:30:43<46:00,  2.01it/s][A
epoch 1 iter 10773: train loss 1.35911. lr 4.527823e-04:  66%|██████▌   | 10774/16329 [1:30:43<45:57,  2.01it/s][A
epoch 1 iter 10774: train loss 1.32303. lr 4.527575e-04:  66%|██████▌   | 10774/16329 [1:30:44<45:57,  2.01it/s][A
epoch 1 iter 10774: train loss 1.32303. lr 4.527575e-04:  66%|██████▌   | 10775/16329 [1:30:44<45:52,  2.02it/s][A
epoch 1 iter 10775: train loss 1.35416. lr 4.527326e-04:  66%|██████▌   | 10775/16329 [1:30:44<45:52,  2.02it/s][A
epoch 1 iter 10775: train loss 1.35416. lr 4.527326e-04:  66%|██████▌   | 10776/16329 [1:30:44<45:49,  2.02it/s][A
epoch 1 iter 10776: train loss 1.34916. lr 4.527078e-04:  66%|██████▌   | 10776/16329 [1:30:45<45:49,  2.02it/s][A
epoch 1 iter 10776: train loss 1.34916. lr 4.527078e-04:  66%|██████▌   | 10777/16329 [1:30:45<45:55,  2.01it/s][A
epoch 1 iter 10777: train loss 1.37738. lr 4.526830e-04:  66%|██████▌   

epoch 1 iter 10808: train loss 1.37666. lr 4.519122e-04:  66%|██████▌   | 10808/16329 [1:31:01<47:38,  1.93it/s][A
epoch 1 iter 10808: train loss 1.37666. lr 4.519122e-04:  66%|██████▌   | 10809/16329 [1:31:01<47:37,  1.93it/s][A
epoch 1 iter 10809: train loss 1.30956. lr 4.518873e-04:  66%|██████▌   | 10809/16329 [1:31:01<47:37,  1.93it/s][A
epoch 1 iter 10809: train loss 1.30956. lr 4.518873e-04:  66%|██████▌   | 10810/16329 [1:31:01<47:24,  1.94it/s][A
epoch 1 iter 10810: train loss 1.32163. lr 4.518624e-04:  66%|██████▌   | 10810/16329 [1:31:02<47:24,  1.94it/s][A
epoch 1 iter 10810: train loss 1.32163. lr 4.518624e-04:  66%|██████▌   | 10811/16329 [1:31:02<47:05,  1.95it/s][A
epoch 1 iter 10811: train loss 1.33886. lr 4.518375e-04:  66%|██████▌   | 10811/16329 [1:31:02<47:05,  1.95it/s][A
epoch 1 iter 10811: train loss 1.33886. lr 4.518375e-04:  66%|██████▌   | 10812/16329 [1:31:02<46:46,  1.97it/s][A
epoch 1 iter 10812: train loss 1.30643. lr 4.518126e-04:  66%|██████▌   

epoch 1 iter 10843: train loss 1.31889. lr 4.510403e-04:  66%|██████▋   | 10843/16329 [1:31:18<45:16,  2.02it/s][A
epoch 1 iter 10843: train loss 1.31889. lr 4.510403e-04:  66%|██████▋   | 10844/16329 [1:31:18<45:15,  2.02it/s][A
epoch 1 iter 10844: train loss 1.33020. lr 4.510153e-04:  66%|██████▋   | 10844/16329 [1:31:19<45:15,  2.02it/s][A
epoch 1 iter 10844: train loss 1.33020. lr 4.510153e-04:  66%|██████▋   | 10845/16329 [1:31:19<45:19,  2.02it/s][A
epoch 1 iter 10845: train loss 1.31181. lr 4.509904e-04:  66%|██████▋   | 10845/16329 [1:31:19<45:19,  2.02it/s][A
epoch 1 iter 10845: train loss 1.31181. lr 4.509904e-04:  66%|██████▋   | 10846/16329 [1:31:19<45:09,  2.02it/s][A
epoch 1 iter 10846: train loss 1.32344. lr 4.509655e-04:  66%|██████▋   | 10846/16329 [1:31:20<45:09,  2.02it/s][A
epoch 1 iter 10846: train loss 1.32344. lr 4.509655e-04:  66%|██████▋   | 10847/16329 [1:31:20<45:18,  2.02it/s][A
epoch 1 iter 10847: train loss 1.33812. lr 4.509405e-04:  66%|██████▋   

epoch 1 iter 10878: train loss 1.33060. lr 4.501667e-04:  67%|██████▋   | 10878/16329 [1:31:36<45:06,  2.01it/s][A
epoch 1 iter 10878: train loss 1.33060. lr 4.501667e-04:  67%|██████▋   | 10879/16329 [1:31:36<45:05,  2.01it/s][A
epoch 1 iter 10879: train loss 1.31984. lr 4.501417e-04:  67%|██████▋   | 10879/16329 [1:31:37<45:05,  2.01it/s][A
epoch 1 iter 10879: train loss 1.31984. lr 4.501417e-04:  67%|██████▋   | 10880/16329 [1:31:37<45:08,  2.01it/s][A
epoch 1 iter 10880: train loss 1.34706. lr 4.501167e-04:  67%|██████▋   | 10880/16329 [1:31:37<45:08,  2.01it/s][A
epoch 1 iter 10880: train loss 1.34706. lr 4.501167e-04:  67%|██████▋   | 10881/16329 [1:31:37<45:06,  2.01it/s][A
epoch 1 iter 10881: train loss 1.33200. lr 4.500917e-04:  67%|██████▋   | 10881/16329 [1:31:38<45:06,  2.01it/s][A
epoch 1 iter 10881: train loss 1.33200. lr 4.500917e-04:  67%|██████▋   | 10882/16329 [1:31:38<45:07,  2.01it/s][A
epoch 1 iter 10882: train loss 1.31574. lr 4.500667e-04:  67%|██████▋   

epoch 1 iter 10913: train loss 1.32256. lr 4.492914e-04:  67%|██████▋   | 10913/16329 [1:31:54<46:52,  1.93it/s][A
epoch 1 iter 10913: train loss 1.32256. lr 4.492914e-04:  67%|██████▋   | 10914/16329 [1:31:54<46:42,  1.93it/s][A
epoch 1 iter 10914: train loss 1.32822. lr 4.492663e-04:  67%|██████▋   | 10914/16329 [1:31:54<46:42,  1.93it/s][A
epoch 1 iter 10914: train loss 1.32822. lr 4.492663e-04:  67%|██████▋   | 10915/16329 [1:31:54<46:26,  1.94it/s][A
epoch 1 iter 10915: train loss 1.30623. lr 4.492413e-04:  67%|██████▋   | 10915/16329 [1:31:55<46:26,  1.94it/s][A
epoch 1 iter 10915: train loss 1.30623. lr 4.492413e-04:  67%|██████▋   | 10916/16329 [1:31:55<46:12,  1.95it/s][A
epoch 1 iter 10916: train loss 1.29473. lr 4.492162e-04:  67%|██████▋   | 10916/16329 [1:31:55<46:12,  1.95it/s][A
epoch 1 iter 10916: train loss 1.29473. lr 4.492162e-04:  67%|██████▋   | 10917/16329 [1:31:55<50:46,  1.78it/s][A
epoch 1 iter 10917: train loss 1.31160. lr 4.491912e-04:  67%|██████▋   

epoch 1 iter 10948: train loss 1.31536. lr 4.484144e-04:  67%|██████▋   | 10948/16329 [1:32:11<44:30,  2.02it/s][A
epoch 1 iter 10948: train loss 1.31536. lr 4.484144e-04:  67%|██████▋   | 10949/16329 [1:32:11<44:33,  2.01it/s][A
epoch 1 iter 10949: train loss 1.32938. lr 4.483893e-04:  67%|██████▋   | 10949/16329 [1:32:12<44:33,  2.01it/s][A
epoch 1 iter 10949: train loss 1.32938. lr 4.483893e-04:  67%|██████▋   | 10950/16329 [1:32:12<44:33,  2.01it/s][A
epoch 1 iter 10950: train loss 1.32029. lr 4.483642e-04:  67%|██████▋   | 10950/16329 [1:32:12<44:33,  2.01it/s][A
epoch 1 iter 10950: train loss 1.32029. lr 4.483642e-04:  67%|██████▋   | 10951/16329 [1:32:12<44:37,  2.01it/s][A
epoch 1 iter 10951: train loss 1.32221. lr 4.483391e-04:  67%|██████▋   | 10951/16329 [1:32:13<44:37,  2.01it/s][A
epoch 1 iter 10951: train loss 1.32221. lr 4.483391e-04:  67%|██████▋   | 10952/16329 [1:32:13<44:30,  2.01it/s][A
epoch 1 iter 10952: train loss 1.32350. lr 4.483140e-04:  67%|██████▋   

epoch 1 iter 10983: train loss 1.36662. lr 4.475357e-04:  67%|██████▋   | 10983/16329 [1:32:29<45:29,  1.96it/s][A
epoch 1 iter 10983: train loss 1.36662. lr 4.475357e-04:  67%|██████▋   | 10984/16329 [1:32:29<45:31,  1.96it/s][A
epoch 1 iter 10984: train loss 1.32683. lr 4.475105e-04:  67%|██████▋   | 10984/16329 [1:32:30<45:31,  1.96it/s][A
epoch 1 iter 10984: train loss 1.32683. lr 4.475105e-04:  67%|██████▋   | 10985/16329 [1:32:30<45:30,  1.96it/s][A
epoch 1 iter 10985: train loss 1.34501. lr 4.474854e-04:  67%|██████▋   | 10985/16329 [1:32:30<45:30,  1.96it/s][A
epoch 1 iter 10985: train loss 1.34501. lr 4.474854e-04:  67%|██████▋   | 10986/16329 [1:32:30<45:15,  1.97it/s][A
epoch 1 iter 10986: train loss 1.32526. lr 4.474603e-04:  67%|██████▋   | 10986/16329 [1:32:31<45:15,  1.97it/s][A
epoch 1 iter 10986: train loss 1.32526. lr 4.474603e-04:  67%|██████▋   | 10987/16329 [1:32:31<45:04,  1.98it/s][A
epoch 1 iter 10987: train loss 1.33811. lr 4.474351e-04:  67%|██████▋   

epoch 1 iter 11018: train loss 1.31541. lr 4.466553e-04:  67%|██████▋   | 11018/16329 [1:32:47<47:10,  1.88it/s][A
epoch 1 iter 11018: train loss 1.31541. lr 4.466553e-04:  67%|██████▋   | 11019/16329 [1:32:47<46:04,  1.92it/s][A
epoch 1 iter 11019: train loss 1.33697. lr 4.466301e-04:  67%|██████▋   | 11019/16329 [1:32:48<46:04,  1.92it/s][A
epoch 1 iter 11019: train loss 1.33697. lr 4.466301e-04:  67%|██████▋   | 11020/16329 [1:32:48<45:25,  1.95it/s][A
epoch 1 iter 11020: train loss 1.38006. lr 4.466050e-04:  67%|██████▋   | 11020/16329 [1:32:48<45:25,  1.95it/s][A
epoch 1 iter 11020: train loss 1.38006. lr 4.466050e-04:  67%|██████▋   | 11021/16329 [1:32:48<44:55,  1.97it/s][A
epoch 1 iter 11021: train loss 1.33705. lr 4.465798e-04:  67%|██████▋   | 11021/16329 [1:32:49<44:55,  1.97it/s][A
epoch 1 iter 11021: train loss 1.33705. lr 4.465798e-04:  67%|██████▋   | 11022/16329 [1:32:49<44:33,  1.98it/s][A
epoch 1 iter 11022: train loss 1.32598. lr 4.465546e-04:  67%|██████▋   

epoch 1 iter 11053: train loss 1.29391. lr 4.457733e-04:  68%|██████▊   | 11053/16329 [1:33:05<43:31,  2.02it/s][A
epoch 1 iter 11053: train loss 1.29391. lr 4.457733e-04:  68%|██████▊   | 11054/16329 [1:33:05<43:34,  2.02it/s][A
epoch 1 iter 11054: train loss 1.30710. lr 4.457481e-04:  68%|██████▊   | 11054/16329 [1:33:05<43:34,  2.02it/s][A
epoch 1 iter 11054: train loss 1.30710. lr 4.457481e-04:  68%|██████▊   | 11055/16329 [1:33:05<43:24,  2.02it/s][A
epoch 1 iter 11055: train loss 1.30059. lr 4.457228e-04:  68%|██████▊   | 11055/16329 [1:33:06<43:24,  2.02it/s][A
epoch 1 iter 11055: train loss 1.30059. lr 4.457228e-04:  68%|██████▊   | 11056/16329 [1:33:06<43:17,  2.03it/s][A
epoch 1 iter 11056: train loss 1.32898. lr 4.456976e-04:  68%|██████▊   | 11056/16329 [1:33:06<43:17,  2.03it/s][A
epoch 1 iter 11056: train loss 1.32898. lr 4.456976e-04:  68%|██████▊   | 11057/16329 [1:33:06<43:19,  2.03it/s][A
epoch 1 iter 11057: train loss 1.30734. lr 4.456724e-04:  68%|██████▊   

epoch 1 iter 11088: train loss 1.31611. lr 4.448896e-04:  68%|██████▊   | 11088/16329 [1:33:22<44:31,  1.96it/s][A
epoch 1 iter 11088: train loss 1.31611. lr 4.448896e-04:  68%|██████▊   | 11089/16329 [1:33:22<44:09,  1.98it/s][A
epoch 1 iter 11089: train loss 1.30944. lr 4.448644e-04:  68%|██████▊   | 11089/16329 [1:33:23<44:09,  1.98it/s][A
epoch 1 iter 11089: train loss 1.30944. lr 4.448644e-04:  68%|██████▊   | 11090/16329 [1:33:23<43:48,  1.99it/s][A
epoch 1 iter 11090: train loss 1.30666. lr 4.448391e-04:  68%|██████▊   | 11090/16329 [1:33:23<43:48,  1.99it/s][A
epoch 1 iter 11090: train loss 1.30666. lr 4.448391e-04:  68%|██████▊   | 11091/16329 [1:33:23<43:38,  2.00it/s][A
epoch 1 iter 11091: train loss 1.31804. lr 4.448138e-04:  68%|██████▊   | 11091/16329 [1:33:24<43:38,  2.00it/s][A
epoch 1 iter 11091: train loss 1.31804. lr 4.448138e-04:  68%|██████▊   | 11092/16329 [1:33:24<43:32,  2.00it/s][A
epoch 1 iter 11092: train loss 1.29498. lr 4.447885e-04:  68%|██████▊   

epoch 1 iter 11123: train loss 1.32516. lr 4.440043e-04:  68%|██████▊   | 11123/16329 [1:33:40<44:08,  1.97it/s][A
epoch 1 iter 11123: train loss 1.32516. lr 4.440043e-04:  68%|██████▊   | 11124/16329 [1:33:40<43:44,  1.98it/s][A
epoch 1 iter 11124: train loss 1.31255. lr 4.439790e-04:  68%|██████▊   | 11124/16329 [1:33:41<43:44,  1.98it/s][A
epoch 1 iter 11124: train loss 1.31255. lr 4.439790e-04:  68%|██████▊   | 11125/16329 [1:33:41<43:22,  2.00it/s][A
epoch 1 iter 11125: train loss 1.34573. lr 4.439537e-04:  68%|██████▊   | 11125/16329 [1:33:41<43:22,  2.00it/s][A
epoch 1 iter 11125: train loss 1.34573. lr 4.439537e-04:  68%|██████▊   | 11126/16329 [1:33:41<43:21,  2.00it/s][A
epoch 1 iter 11126: train loss 1.29459. lr 4.439284e-04:  68%|██████▊   | 11126/16329 [1:33:42<43:21,  2.00it/s][A
epoch 1 iter 11126: train loss 1.29459. lr 4.439284e-04:  68%|██████▊   | 11127/16329 [1:33:42<43:11,  2.01it/s][A
epoch 1 iter 11127: train loss 1.34939. lr 4.439030e-04:  68%|██████▊   

epoch 1 iter 11158: train loss 1.31066. lr 4.431174e-04:  68%|██████▊   | 11158/16329 [1:33:58<44:02,  1.96it/s][A
epoch 1 iter 11158: train loss 1.31066. lr 4.431174e-04:  68%|██████▊   | 11159/16329 [1:33:58<44:03,  1.96it/s][A
epoch 1 iter 11159: train loss 1.32179. lr 4.430920e-04:  68%|██████▊   | 11159/16329 [1:33:58<44:03,  1.96it/s][A
epoch 1 iter 11159: train loss 1.32179. lr 4.430920e-04:  68%|██████▊   | 11160/16329 [1:33:58<43:46,  1.97it/s][A
epoch 1 iter 11160: train loss 1.31850. lr 4.430666e-04:  68%|██████▊   | 11160/16329 [1:33:59<43:46,  1.97it/s][A
epoch 1 iter 11160: train loss 1.31850. lr 4.430666e-04:  68%|██████▊   | 11161/16329 [1:33:59<43:36,  1.98it/s][A
epoch 1 iter 11161: train loss 1.31731. lr 4.430413e-04:  68%|██████▊   | 11161/16329 [1:33:59<43:36,  1.98it/s][A
epoch 1 iter 11161: train loss 1.31731. lr 4.430413e-04:  68%|██████▊   | 11162/16329 [1:33:59<43:21,  1.99it/s][A
epoch 1 iter 11162: train loss 1.33045. lr 4.430159e-04:  68%|██████▊   

epoch 1 iter 11193: train loss 1.29773. lr 4.422288e-04:  69%|██████▊   | 11193/16329 [1:34:15<42:19,  2.02it/s][A
epoch 1 iter 11193: train loss 1.29773. lr 4.422288e-04:  69%|██████▊   | 11194/16329 [1:34:15<42:22,  2.02it/s][A
epoch 1 iter 11194: train loss 1.31773. lr 4.422034e-04:  69%|██████▊   | 11194/16329 [1:34:16<42:22,  2.02it/s][A
epoch 1 iter 11194: train loss 1.31773. lr 4.422034e-04:  69%|██████▊   | 11195/16329 [1:34:16<42:19,  2.02it/s][A
epoch 1 iter 11195: train loss 1.31324. lr 4.421780e-04:  69%|██████▊   | 11195/16329 [1:34:16<42:19,  2.02it/s][A
epoch 1 iter 11195: train loss 1.31324. lr 4.421780e-04:  69%|██████▊   | 11196/16329 [1:34:16<42:25,  2.02it/s][A
epoch 1 iter 11196: train loss 1.36002. lr 4.421526e-04:  69%|██████▊   | 11196/16329 [1:34:17<42:25,  2.02it/s][A
epoch 1 iter 11196: train loss 1.36002. lr 4.421526e-04:  69%|██████▊   | 11197/16329 [1:34:17<44:32,  1.92it/s][A
epoch 1 iter 11197: train loss 1.29831. lr 4.421272e-04:  69%|██████▊   

epoch 1 iter 11228: train loss 1.30848. lr 4.413386e-04:  69%|██████▉   | 11228/16329 [1:34:33<42:02,  2.02it/s][A
epoch 1 iter 11228: train loss 1.30848. lr 4.413386e-04:  69%|██████▉   | 11229/16329 [1:34:33<41:55,  2.03it/s][A
epoch 1 iter 11229: train loss 1.29742. lr 4.413132e-04:  69%|██████▉   | 11229/16329 [1:34:34<41:55,  2.03it/s][A
epoch 1 iter 11229: train loss 1.29742. lr 4.413132e-04:  69%|██████▉   | 11230/16329 [1:34:34<42:02,  2.02it/s][A
epoch 1 iter 11230: train loss 1.29558. lr 4.412877e-04:  69%|██████▉   | 11230/16329 [1:34:34<42:02,  2.02it/s][A
epoch 1 iter 11230: train loss 1.29558. lr 4.412877e-04:  69%|██████▉   | 11231/16329 [1:34:34<42:02,  2.02it/s][A
epoch 1 iter 11231: train loss 1.30011. lr 4.412622e-04:  69%|██████▉   | 11231/16329 [1:34:34<42:02,  2.02it/s][A
epoch 1 iter 11231: train loss 1.30011. lr 4.412622e-04:  69%|██████▉   | 11232/16329 [1:34:35<42:05,  2.02it/s][A
epoch 1 iter 11232: train loss 1.28928. lr 4.412368e-04:  69%|██████▉   

epoch 1 iter 11263: train loss 1.33091. lr 4.404468e-04:  69%|██████▉   | 11263/16329 [1:34:51<42:01,  2.01it/s][A
epoch 1 iter 11263: train loss 1.33091. lr 4.404468e-04:  69%|██████▉   | 11264/16329 [1:34:51<44:06,  1.91it/s][A
epoch 1 iter 11264: train loss 1.31894. lr 4.404213e-04:  69%|██████▉   | 11264/16329 [1:34:51<44:06,  1.91it/s][A
epoch 1 iter 11264: train loss 1.31894. lr 4.404213e-04:  69%|██████▉   | 11265/16329 [1:34:51<45:10,  1.87it/s][A
epoch 1 iter 11265: train loss 1.28559. lr 4.403958e-04:  69%|██████▉   | 11265/16329 [1:34:52<45:10,  1.87it/s][A
epoch 1 iter 11265: train loss 1.28559. lr 4.403958e-04:  69%|██████▉   | 11266/16329 [1:34:52<45:21,  1.86it/s][A
epoch 1 iter 11266: train loss 1.31076. lr 4.403703e-04:  69%|██████▉   | 11266/16329 [1:34:52<45:21,  1.86it/s][A
epoch 1 iter 11266: train loss 1.31076. lr 4.403703e-04:  69%|██████▉   | 11267/16329 [1:34:52<45:13,  1.87it/s][A
epoch 1 iter 11267: train loss 1.30372. lr 4.403448e-04:  69%|██████▉   

epoch 1 iter 11298: train loss 1.28284. lr 4.395535e-04:  69%|██████▉   | 11298/16329 [1:35:08<45:48,  1.83it/s][A
epoch 1 iter 11298: train loss 1.28284. lr 4.395535e-04:  69%|██████▉   | 11299/16329 [1:35:08<44:24,  1.89it/s][A
epoch 1 iter 11299: train loss 1.33264. lr 4.395279e-04:  69%|██████▉   | 11299/16329 [1:35:09<44:24,  1.89it/s][A
epoch 1 iter 11299: train loss 1.33264. lr 4.395279e-04:  69%|██████▉   | 11300/16329 [1:35:09<43:33,  1.92it/s][A
epoch 1 iter 11300: train loss 1.34406. lr 4.395024e-04:  69%|██████▉   | 11300/16329 [1:35:09<43:33,  1.92it/s][A
epoch 1 iter 11300: train loss 1.34406. lr 4.395024e-04:  69%|██████▉   | 11301/16329 [1:35:09<42:45,  1.96it/s][A
epoch 1 iter 11301: train loss 1.35103. lr 4.394768e-04:  69%|██████▉   | 11301/16329 [1:35:10<42:45,  1.96it/s][A
epoch 1 iter 11301: train loss 1.35103. lr 4.394768e-04:  69%|██████▉   | 11302/16329 [1:35:10<42:24,  1.98it/s][A
epoch 1 iter 11302: train loss 1.31145. lr 4.394513e-04:  69%|██████▉   

epoch 1 iter 11333: train loss 1.28248. lr 4.386585e-04:  69%|██████▉   | 11333/16329 [1:35:26<41:06,  2.03it/s][A
epoch 1 iter 11333: train loss 1.28248. lr 4.386585e-04:  69%|██████▉   | 11334/16329 [1:35:26<41:06,  2.02it/s][A
epoch 1 iter 11334: train loss 1.32128. lr 4.386329e-04:  69%|██████▉   | 11334/16329 [1:35:26<41:06,  2.02it/s][A
epoch 1 iter 11334: train loss 1.32128. lr 4.386329e-04:  69%|██████▉   | 11335/16329 [1:35:26<41:09,  2.02it/s][A
epoch 1 iter 11335: train loss 1.29350. lr 4.386073e-04:  69%|██████▉   | 11335/16329 [1:35:27<41:09,  2.02it/s][A
epoch 1 iter 11335: train loss 1.29350. lr 4.386073e-04:  69%|██████▉   | 11336/16329 [1:35:27<41:13,  2.02it/s][A
epoch 1 iter 11336: train loss 1.29846. lr 4.385817e-04:  69%|██████▉   | 11336/16329 [1:35:27<41:13,  2.02it/s][A
epoch 1 iter 11336: train loss 1.29846. lr 4.385817e-04:  69%|██████▉   | 11337/16329 [1:35:27<41:09,  2.02it/s][A
epoch 1 iter 11337: train loss 1.28088. lr 4.385561e-04:  69%|██████▉   

epoch 1 iter 11368: train loss 1.27686. lr 4.377620e-04:  70%|██████▉   | 11368/16329 [1:35:44<40:54,  2.02it/s][A
epoch 1 iter 11368: train loss 1.27686. lr 4.377620e-04:  70%|██████▉   | 11369/16329 [1:35:44<40:49,  2.02it/s][A
epoch 1 iter 11369: train loss 1.28729. lr 4.377363e-04:  70%|██████▉   | 11369/16329 [1:35:44<40:49,  2.02it/s][A
epoch 1 iter 11369: train loss 1.28729. lr 4.377363e-04:  70%|██████▉   | 11370/16329 [1:35:44<40:47,  2.03it/s][A
epoch 1 iter 11370: train loss 1.27167. lr 4.377107e-04:  70%|██████▉   | 11370/16329 [1:35:45<40:47,  2.03it/s][A
epoch 1 iter 11370: train loss 1.27167. lr 4.377107e-04:  70%|██████▉   | 11371/16329 [1:35:45<40:55,  2.02it/s][A
epoch 1 iter 11371: train loss 1.30355. lr 4.376851e-04:  70%|██████▉   | 11371/16329 [1:35:45<40:55,  2.02it/s][A
epoch 1 iter 11371: train loss 1.30355. lr 4.376851e-04:  70%|██████▉   | 11372/16329 [1:35:45<40:54,  2.02it/s][A
epoch 1 iter 11372: train loss 1.32420. lr 4.376594e-04:  70%|██████▉   

epoch 1 iter 11403: train loss 1.32432. lr 4.368639e-04:  70%|██████▉   | 11403/16329 [1:36:01<41:19,  1.99it/s][A
epoch 1 iter 11403: train loss 1.32432. lr 4.368639e-04:  70%|██████▉   | 11404/16329 [1:36:01<41:00,  2.00it/s][A
epoch 1 iter 11404: train loss 1.27690. lr 4.368382e-04:  70%|██████▉   | 11404/16329 [1:36:02<41:00,  2.00it/s][A
epoch 1 iter 11404: train loss 1.27690. lr 4.368382e-04:  70%|██████▉   | 11405/16329 [1:36:02<40:58,  2.00it/s][A
epoch 1 iter 11405: train loss 1.31378. lr 4.368125e-04:  70%|██████▉   | 11405/16329 [1:36:02<40:58,  2.00it/s][A
epoch 1 iter 11405: train loss 1.31378. lr 4.368125e-04:  70%|██████▉   | 11406/16329 [1:36:02<40:46,  2.01it/s][A
epoch 1 iter 11406: train loss 1.30085. lr 4.367868e-04:  70%|██████▉   | 11406/16329 [1:36:03<40:46,  2.01it/s][A
epoch 1 iter 11406: train loss 1.30085. lr 4.367868e-04:  70%|██████▉   | 11407/16329 [1:36:03<40:46,  2.01it/s][A
epoch 1 iter 11407: train loss 1.29577. lr 4.367612e-04:  70%|██████▉   

epoch 1 iter 11438: train loss 1.26814. lr 4.359642e-04:  70%|███████   | 11438/16329 [1:36:19<40:18,  2.02it/s][A
epoch 1 iter 11438: train loss 1.26814. lr 4.359642e-04:  70%|███████   | 11439/16329 [1:36:19<41:20,  1.97it/s][A
epoch 1 iter 11439: train loss 1.28268. lr 4.359385e-04:  70%|███████   | 11439/16329 [1:36:20<41:20,  1.97it/s][A
epoch 1 iter 11439: train loss 1.28268. lr 4.359385e-04:  70%|███████   | 11440/16329 [1:36:20<41:49,  1.95it/s][A
epoch 1 iter 11440: train loss 1.30432. lr 4.359128e-04:  70%|███████   | 11440/16329 [1:36:20<41:49,  1.95it/s][A
epoch 1 iter 11440: train loss 1.30432. lr 4.359128e-04:  70%|███████   | 11441/16329 [1:36:20<41:45,  1.95it/s][A
epoch 1 iter 11441: train loss 1.26232. lr 4.358871e-04:  70%|███████   | 11441/16329 [1:36:21<41:45,  1.95it/s][A
epoch 1 iter 11441: train loss 1.26232. lr 4.358871e-04:  70%|███████   | 11442/16329 [1:36:21<41:45,  1.95it/s][A
epoch 1 iter 11442: train loss 1.29177. lr 4.358613e-04:  70%|███████   

epoch 1 iter 11473: train loss 1.30410. lr 4.350631e-04:  70%|███████   | 11473/16329 [1:36:37<40:15,  2.01it/s][A
epoch 1 iter 11473: train loss 1.30410. lr 4.350631e-04:  70%|███████   | 11474/16329 [1:36:37<40:15,  2.01it/s][A
epoch 1 iter 11474: train loss 1.30987. lr 4.350373e-04:  70%|███████   | 11474/16329 [1:36:37<40:15,  2.01it/s][A
epoch 1 iter 11474: train loss 1.30987. lr 4.350373e-04:  70%|███████   | 11475/16329 [1:36:37<40:08,  2.02it/s][A
epoch 1 iter 11475: train loss 1.30252. lr 4.350115e-04:  70%|███████   | 11475/16329 [1:36:38<40:08,  2.02it/s][A
epoch 1 iter 11475: train loss 1.30252. lr 4.350115e-04:  70%|███████   | 11476/16329 [1:36:38<40:09,  2.01it/s][A
epoch 1 iter 11476: train loss 1.32620. lr 4.349858e-04:  70%|███████   | 11476/16329 [1:36:38<40:09,  2.01it/s][A
epoch 1 iter 11476: train loss 1.32620. lr 4.349858e-04:  70%|███████   | 11477/16329 [1:36:38<40:07,  2.02it/s][A
epoch 1 iter 11477: train loss 1.29241. lr 4.349600e-04:  70%|███████   

epoch 1 iter 11508: train loss 1.29295. lr 4.341604e-04:  70%|███████   | 11508/16329 [1:36:54<40:03,  2.01it/s][A
epoch 1 iter 11508: train loss 1.29295. lr 4.341604e-04:  70%|███████   | 11509/16329 [1:36:54<39:56,  2.01it/s][A
epoch 1 iter 11509: train loss 1.26793. lr 4.341345e-04:  70%|███████   | 11509/16329 [1:36:55<39:56,  2.01it/s][A
epoch 1 iter 11509: train loss 1.26793. lr 4.341345e-04:  70%|███████   | 11510/16329 [1:36:55<39:57,  2.01it/s][A
epoch 1 iter 11510: train loss 1.27226. lr 4.341087e-04:  70%|███████   | 11510/16329 [1:36:55<39:57,  2.01it/s][A
epoch 1 iter 11510: train loss 1.27226. lr 4.341087e-04:  70%|███████   | 11511/16329 [1:36:55<39:48,  2.02it/s][A
epoch 1 iter 11511: train loss 1.27693. lr 4.340829e-04:  70%|███████   | 11511/16329 [1:36:56<39:48,  2.02it/s][A
epoch 1 iter 11511: train loss 1.27693. lr 4.340829e-04:  71%|███████   | 11512/16329 [1:36:56<39:52,  2.01it/s][A
epoch 1 iter 11512: train loss 1.34385. lr 4.340571e-04:  71%|███████   

epoch 1 iter 11543: train loss 1.27144. lr 4.332561e-04:  71%|███████   | 11543/16329 [1:37:12<39:44,  2.01it/s][A
epoch 1 iter 11543: train loss 1.27144. lr 4.332561e-04:  71%|███████   | 11544/16329 [1:37:12<39:41,  2.01it/s][A
epoch 1 iter 11544: train loss 1.31126. lr 4.332303e-04:  71%|███████   | 11544/16329 [1:37:12<39:41,  2.01it/s][A
epoch 1 iter 11544: train loss 1.31126. lr 4.332303e-04:  71%|███████   | 11545/16329 [1:37:12<39:34,  2.01it/s][A
epoch 1 iter 11545: train loss 1.30727. lr 4.332044e-04:  71%|███████   | 11545/16329 [1:37:13<39:34,  2.01it/s][A
epoch 1 iter 11545: train loss 1.30727. lr 4.332044e-04:  71%|███████   | 11546/16329 [1:37:13<39:33,  2.02it/s][A
epoch 1 iter 11546: train loss 1.28309. lr 4.331785e-04:  71%|███████   | 11546/16329 [1:37:13<39:33,  2.02it/s][A
epoch 1 iter 11546: train loss 1.28309. lr 4.331785e-04:  71%|███████   | 11547/16329 [1:37:13<39:27,  2.02it/s][A
epoch 1 iter 11547: train loss 1.27931. lr 4.331527e-04:  71%|███████   

epoch 1 iter 11578: train loss 1.26581. lr 4.323504e-04:  71%|███████   | 11578/16329 [1:37:29<39:13,  2.02it/s][A
epoch 1 iter 11578: train loss 1.26581. lr 4.323504e-04:  71%|███████   | 11579/16329 [1:37:29<39:20,  2.01it/s][A
epoch 1 iter 11579: train loss 1.30653. lr 4.323245e-04:  71%|███████   | 11579/16329 [1:37:30<39:20,  2.01it/s][A
epoch 1 iter 11579: train loss 1.30653. lr 4.323245e-04:  71%|███████   | 11580/16329 [1:37:30<39:17,  2.01it/s][A
epoch 1 iter 11580: train loss 1.27931. lr 4.322986e-04:  71%|███████   | 11580/16329 [1:37:30<39:17,  2.01it/s][A
epoch 1 iter 11580: train loss 1.27931. lr 4.322986e-04:  71%|███████   | 11581/16329 [1:37:30<39:20,  2.01it/s][A
epoch 1 iter 11581: train loss 1.29083. lr 4.322727e-04:  71%|███████   | 11581/16329 [1:37:31<39:20,  2.01it/s][A
epoch 1 iter 11581: train loss 1.29083. lr 4.322727e-04:  71%|███████   | 11582/16329 [1:37:31<39:17,  2.01it/s][A
epoch 1 iter 11582: train loss 1.27546. lr 4.322468e-04:  71%|███████   

epoch 1 iter 11613: train loss 1.29060. lr 4.314431e-04:  71%|███████   | 11613/16329 [1:37:47<39:14,  2.00it/s][A
epoch 1 iter 11613: train loss 1.29060. lr 4.314431e-04:  71%|███████   | 11614/16329 [1:37:47<39:06,  2.01it/s][A
epoch 1 iter 11614: train loss 1.31262. lr 4.314172e-04:  71%|███████   | 11614/16329 [1:37:48<39:06,  2.01it/s][A
epoch 1 iter 11614: train loss 1.31262. lr 4.314172e-04:  71%|███████   | 11615/16329 [1:37:48<39:00,  2.01it/s][A
epoch 1 iter 11615: train loss 1.25360. lr 4.313912e-04:  71%|███████   | 11615/16329 [1:37:48<39:00,  2.01it/s][A
epoch 1 iter 11615: train loss 1.25360. lr 4.313912e-04:  71%|███████   | 11616/16329 [1:37:48<38:55,  2.02it/s][A
epoch 1 iter 11616: train loss 1.32056. lr 4.313653e-04:  71%|███████   | 11616/16329 [1:37:49<38:55,  2.02it/s][A
epoch 1 iter 11616: train loss 1.32056. lr 4.313653e-04:  71%|███████   | 11617/16329 [1:37:49<38:46,  2.03it/s][A
epoch 1 iter 11617: train loss 1.24631. lr 4.313393e-04:  71%|███████   

epoch 1 iter 11648: train loss 1.27463. lr 4.305344e-04:  71%|███████▏  | 11648/16329 [1:38:05<38:39,  2.02it/s][A
epoch 1 iter 11648: train loss 1.27463. lr 4.305344e-04:  71%|███████▏  | 11649/16329 [1:38:05<38:41,  2.02it/s][A
epoch 1 iter 11649: train loss 1.26184. lr 4.305084e-04:  71%|███████▏  | 11649/16329 [1:38:05<38:41,  2.02it/s][A
epoch 1 iter 11649: train loss 1.26184. lr 4.305084e-04:  71%|███████▏  | 11650/16329 [1:38:05<38:37,  2.02it/s][A
epoch 1 iter 11650: train loss 1.30403. lr 4.304824e-04:  71%|███████▏  | 11650/16329 [1:38:06<38:37,  2.02it/s][A
epoch 1 iter 11650: train loss 1.30403. lr 4.304824e-04:  71%|███████▏  | 11651/16329 [1:38:06<38:33,  2.02it/s][A
epoch 1 iter 11651: train loss 1.29517. lr 4.304564e-04:  71%|███████▏  | 11651/16329 [1:38:06<38:33,  2.02it/s][A
epoch 1 iter 11651: train loss 1.29517. lr 4.304564e-04:  71%|███████▏  | 11652/16329 [1:38:06<42:41,  1.83it/s][A
epoch 1 iter 11652: train loss 1.27799. lr 4.304304e-04:  71%|███████▏  

epoch 1 iter 11683: train loss 1.26504. lr 4.296242e-04:  72%|███████▏  | 11683/16329 [1:38:23<39:59,  1.94it/s][A
epoch 1 iter 11683: train loss 1.26504. lr 4.296242e-04:  72%|███████▏  | 11684/16329 [1:38:23<39:18,  1.97it/s][A
epoch 1 iter 11684: train loss 1.28002. lr 4.295981e-04:  72%|███████▏  | 11684/16329 [1:38:23<39:18,  1.97it/s][A
epoch 1 iter 11684: train loss 1.28002. lr 4.295981e-04:  72%|███████▏  | 11685/16329 [1:38:23<39:02,  1.98it/s][A
epoch 1 iter 11685: train loss 1.25981. lr 4.295721e-04:  72%|███████▏  | 11685/16329 [1:38:24<39:02,  1.98it/s][A
epoch 1 iter 11685: train loss 1.25981. lr 4.295721e-04:  72%|███████▏  | 11686/16329 [1:38:24<38:48,  1.99it/s][A
epoch 1 iter 11686: train loss 1.28375. lr 4.295461e-04:  72%|███████▏  | 11686/16329 [1:38:24<38:48,  1.99it/s][A
epoch 1 iter 11686: train loss 1.28375. lr 4.295461e-04:  72%|███████▏  | 11687/16329 [1:38:24<38:37,  2.00it/s][A
epoch 1 iter 11687: train loss 1.27665. lr 4.295201e-04:  72%|███████▏  

epoch 1 iter 11718: train loss 1.24582. lr 4.287125e-04:  72%|███████▏  | 11718/16329 [1:38:40<38:11,  2.01it/s][A
epoch 1 iter 11718: train loss 1.24582. lr 4.287125e-04:  72%|███████▏  | 11719/16329 [1:38:40<42:16,  1.82it/s][A
epoch 1 iter 11719: train loss 1.30848. lr 4.286864e-04:  72%|███████▏  | 11719/16329 [1:38:41<42:16,  1.82it/s][A
epoch 1 iter 11719: train loss 1.30848. lr 4.286864e-04:  72%|███████▏  | 11720/16329 [1:38:41<40:56,  1.88it/s][A
epoch 1 iter 11720: train loss 1.23818. lr 4.286603e-04:  72%|███████▏  | 11720/16329 [1:38:41<40:56,  1.88it/s][A
epoch 1 iter 11720: train loss 1.23818. lr 4.286603e-04:  72%|███████▏  | 11721/16329 [1:38:41<40:12,  1.91it/s][A
epoch 1 iter 11721: train loss 1.24742. lr 4.286343e-04:  72%|███████▏  | 11721/16329 [1:38:42<40:12,  1.91it/s][A
epoch 1 iter 11721: train loss 1.24742. lr 4.286343e-04:  72%|███████▏  | 11722/16329 [1:38:42<39:32,  1.94it/s][A
epoch 1 iter 11722: train loss 1.26655. lr 4.286082e-04:  72%|███████▏  

epoch 1 iter 11753: train loss 1.28509. lr 4.277993e-04:  72%|███████▏  | 11753/16329 [1:38:58<37:49,  2.02it/s][A
epoch 1 iter 11753: train loss 1.28509. lr 4.277993e-04:  72%|███████▏  | 11754/16329 [1:38:58<41:44,  1.83it/s][A
epoch 1 iter 11754: train loss 1.23732. lr 4.277732e-04:  72%|███████▏  | 11754/16329 [1:38:58<41:44,  1.83it/s][A
epoch 1 iter 11754: train loss 1.23732. lr 4.277732e-04:  72%|███████▏  | 11755/16329 [1:38:58<40:36,  1.88it/s][A
epoch 1 iter 11755: train loss 1.25338. lr 4.277471e-04:  72%|███████▏  | 11755/16329 [1:38:59<40:36,  1.88it/s][A
epoch 1 iter 11755: train loss 1.25338. lr 4.277471e-04:  72%|███████▏  | 11756/16329 [1:38:59<39:42,  1.92it/s][A
epoch 1 iter 11756: train loss 1.26820. lr 4.277210e-04:  72%|███████▏  | 11756/16329 [1:38:59<39:42,  1.92it/s][A
epoch 1 iter 11756: train loss 1.26820. lr 4.277210e-04:  72%|███████▏  | 11757/16329 [1:38:59<39:04,  1.95it/s][A
epoch 1 iter 11757: train loss 1.27104. lr 4.276949e-04:  72%|███████▏  

epoch 1 iter 11788: train loss 1.27679. lr 4.268847e-04:  72%|███████▏  | 11788/16329 [1:39:15<37:38,  2.01it/s][A
epoch 1 iter 11788: train loss 1.27679. lr 4.268847e-04:  72%|███████▏  | 11789/16329 [1:39:15<37:28,  2.02it/s][A
epoch 1 iter 11789: train loss 1.32104. lr 4.268586e-04:  72%|███████▏  | 11789/16329 [1:39:16<37:28,  2.02it/s][A
epoch 1 iter 11789: train loss 1.32104. lr 4.268586e-04:  72%|███████▏  | 11790/16329 [1:39:16<37:28,  2.02it/s][A
epoch 1 iter 11790: train loss 1.30185. lr 4.268324e-04:  72%|███████▏  | 11790/16329 [1:39:16<37:28,  2.02it/s][A
epoch 1 iter 11790: train loss 1.30185. lr 4.268324e-04:  72%|███████▏  | 11791/16329 [1:39:16<37:27,  2.02it/s][A
epoch 1 iter 11791: train loss 1.30406. lr 4.268063e-04:  72%|███████▏  | 11791/16329 [1:39:17<37:27,  2.02it/s][A
epoch 1 iter 11791: train loss 1.30406. lr 4.268063e-04:  72%|███████▏  | 11792/16329 [1:39:17<37:29,  2.02it/s][A
epoch 1 iter 11792: train loss 1.24384. lr 4.267801e-04:  72%|███████▏  

epoch 1 iter 11823: train loss 1.26121. lr 4.259687e-04:  72%|███████▏  | 11823/16329 [1:39:33<37:09,  2.02it/s][A
epoch 1 iter 11823: train loss 1.26121. lr 4.259687e-04:  72%|███████▏  | 11824/16329 [1:39:33<37:06,  2.02it/s][A
epoch 1 iter 11824: train loss 1.26259. lr 4.259425e-04:  72%|███████▏  | 11824/16329 [1:39:34<37:06,  2.02it/s][A
epoch 1 iter 11824: train loss 1.26259. lr 4.259425e-04:  72%|███████▏  | 11825/16329 [1:39:34<37:07,  2.02it/s][A
epoch 1 iter 11825: train loss 1.28271. lr 4.259163e-04:  72%|███████▏  | 11825/16329 [1:39:34<37:07,  2.02it/s][A
epoch 1 iter 11825: train loss 1.28271. lr 4.259163e-04:  72%|███████▏  | 11826/16329 [1:39:34<37:01,  2.03it/s][A
epoch 1 iter 11826: train loss 1.27996. lr 4.258901e-04:  72%|███████▏  | 11826/16329 [1:39:35<37:01,  2.03it/s][A
epoch 1 iter 11826: train loss 1.27996. lr 4.258901e-04:  72%|███████▏  | 11827/16329 [1:39:35<36:58,  2.03it/s][A
epoch 1 iter 11827: train loss 1.27631. lr 4.258639e-04:  72%|███████▏  

epoch 1 iter 11858: train loss 1.27956. lr 4.250512e-04:  73%|███████▎  | 11858/16329 [1:39:51<36:48,  2.02it/s][A
epoch 1 iter 11858: train loss 1.27956. lr 4.250512e-04:  73%|███████▎  | 11859/16329 [1:39:51<36:52,  2.02it/s][A
epoch 1 iter 11859: train loss 1.25480. lr 4.250250e-04:  73%|███████▎  | 11859/16329 [1:39:51<36:52,  2.02it/s][A
epoch 1 iter 11859: train loss 1.25480. lr 4.250250e-04:  73%|███████▎  | 11860/16329 [1:39:51<36:50,  2.02it/s][A
epoch 1 iter 11860: train loss 1.25490. lr 4.249988e-04:  73%|███████▎  | 11860/16329 [1:39:52<36:50,  2.02it/s][A
epoch 1 iter 11860: train loss 1.25490. lr 4.249988e-04:  73%|███████▎  | 11861/16329 [1:39:52<36:51,  2.02it/s][A
epoch 1 iter 11861: train loss 1.26851. lr 4.249725e-04:  73%|███████▎  | 11861/16329 [1:39:52<36:51,  2.02it/s][A
epoch 1 iter 11861: train loss 1.26851. lr 4.249725e-04:  73%|███████▎  | 11862/16329 [1:39:52<36:51,  2.02it/s][A
epoch 1 iter 11862: train loss 1.23453. lr 4.249463e-04:  73%|███████▎  

epoch 1 iter 11893: train loss 1.28649. lr 4.241324e-04:  73%|███████▎  | 11893/16329 [1:40:08<36:33,  2.02it/s][A
epoch 1 iter 11893: train loss 1.28649. lr 4.241324e-04:  73%|███████▎  | 11894/16329 [1:40:08<36:32,  2.02it/s][A
epoch 1 iter 11894: train loss 1.27157. lr 4.241061e-04:  73%|███████▎  | 11894/16329 [1:40:09<36:32,  2.02it/s][A
epoch 1 iter 11894: train loss 1.27157. lr 4.241061e-04:  73%|███████▎  | 11895/16329 [1:40:09<36:29,  2.03it/s][A
epoch 1 iter 11895: train loss 1.27180. lr 4.240798e-04:  73%|███████▎  | 11895/16329 [1:40:09<36:29,  2.03it/s][A
epoch 1 iter 11895: train loss 1.27180. lr 4.240798e-04:  73%|███████▎  | 11896/16329 [1:40:09<36:34,  2.02it/s][A
epoch 1 iter 11896: train loss 1.25566. lr 4.240535e-04:  73%|███████▎  | 11896/16329 [1:40:10<36:34,  2.02it/s][A
epoch 1 iter 11896: train loss 1.25566. lr 4.240535e-04:  73%|███████▎  | 11897/16329 [1:40:10<36:28,  2.03it/s][A
epoch 1 iter 11897: train loss 1.29099. lr 4.240273e-04:  73%|███████▎  

epoch 1 iter 11928: train loss 1.23268. lr 4.232121e-04:  73%|███████▎  | 11928/16329 [1:40:26<36:11,  2.03it/s][A
epoch 1 iter 11928: train loss 1.23268. lr 4.232121e-04:  73%|███████▎  | 11929/16329 [1:40:26<36:13,  2.02it/s][A
epoch 1 iter 11929: train loss 1.23631. lr 4.231858e-04:  73%|███████▎  | 11929/16329 [1:40:26<36:13,  2.02it/s][A
epoch 1 iter 11929: train loss 1.23631. lr 4.231858e-04:  73%|███████▎  | 11930/16329 [1:40:26<36:20,  2.02it/s][A
epoch 1 iter 11930: train loss 1.25955. lr 4.231595e-04:  73%|███████▎  | 11930/16329 [1:40:27<36:20,  2.02it/s][A
epoch 1 iter 11930: train loss 1.25955. lr 4.231595e-04:  73%|███████▎  | 11931/16329 [1:40:27<36:24,  2.01it/s][A
epoch 1 iter 11931: train loss 1.26414. lr 4.231331e-04:  73%|███████▎  | 11931/16329 [1:40:27<36:24,  2.01it/s][A
epoch 1 iter 11931: train loss 1.26414. lr 4.231331e-04:  73%|███████▎  | 11932/16329 [1:40:27<36:18,  2.02it/s][A
epoch 1 iter 11932: train loss 1.26106. lr 4.231068e-04:  73%|███████▎  

epoch 1 iter 11963: train loss 1.22647. lr 4.222904e-04:  73%|███████▎  | 11963/16329 [1:40:44<36:27,  2.00it/s][A
epoch 1 iter 11963: train loss 1.22647. lr 4.222904e-04:  73%|███████▎  | 11964/16329 [1:40:44<36:59,  1.97it/s][A
epoch 1 iter 11964: train loss 1.26875. lr 4.222640e-04:  73%|███████▎  | 11964/16329 [1:40:44<36:59,  1.97it/s][A
epoch 1 iter 11964: train loss 1.26875. lr 4.222640e-04:  73%|███████▎  | 11965/16329 [1:40:44<37:22,  1.95it/s][A
epoch 1 iter 11965: train loss 1.26353. lr 4.222377e-04:  73%|███████▎  | 11965/16329 [1:40:45<37:22,  1.95it/s][A
epoch 1 iter 11965: train loss 1.26353. lr 4.222377e-04:  73%|███████▎  | 11966/16329 [1:40:45<37:31,  1.94it/s][A
epoch 1 iter 11966: train loss 1.28748. lr 4.222113e-04:  73%|███████▎  | 11966/16329 [1:40:45<37:31,  1.94it/s][A
epoch 1 iter 11966: train loss 1.28748. lr 4.222113e-04:  73%|███████▎  | 11967/16329 [1:40:45<37:28,  1.94it/s][A
epoch 1 iter 11967: train loss 1.26049. lr 4.221850e-04:  73%|███████▎  

epoch 1 iter 11998: train loss 1.25365. lr 4.213673e-04:  73%|███████▎  | 11998/16329 [1:41:01<36:43,  1.97it/s][A
epoch 1 iter 11998: train loss 1.25365. lr 4.213673e-04:  73%|███████▎  | 11999/16329 [1:41:01<36:32,  1.97it/s][A
epoch 1 iter 11999: train loss 1.26039. lr 4.213409e-04:  73%|███████▎  | 11999/16329 [1:41:02<36:32,  1.97it/s][A
epoch 1 iter 11999: train loss 1.26039. lr 4.213409e-04:  73%|███████▎  | 12000/16329 [1:41:02<36:24,  1.98it/s][A
epoch 1 iter 12000: train loss 1.26262. lr 4.213145e-04:  73%|███████▎  | 12000/16329 [1:41:02<36:24,  1.98it/s][A
epoch 1 iter 12000: train loss 1.26262. lr 4.213145e-04:  73%|███████▎  | 12001/16329 [1:41:02<36:11,  1.99it/s][A
epoch 1 iter 12001: train loss 1.26132. lr 4.212881e-04:  73%|███████▎  | 12001/16329 [1:41:03<36:11,  1.99it/s][A
epoch 1 iter 12001: train loss 1.26132. lr 4.212881e-04:  74%|███████▎  | 12002/16329 [1:41:03<35:59,  2.00it/s][A
epoch 1 iter 12002: train loss 1.24308. lr 4.212618e-04:  74%|███████▎  

epoch 1 iter 12033: train loss 1.22340. lr 4.204429e-04:  74%|███████▎  | 12033/16329 [1:41:19<39:09,  1.83it/s][A
epoch 1 iter 12033: train loss 1.22340. lr 4.204429e-04:  74%|███████▎  | 12034/16329 [1:41:19<38:03,  1.88it/s][A
epoch 1 iter 12034: train loss 1.22596. lr 4.204165e-04:  74%|███████▎  | 12034/16329 [1:41:20<38:03,  1.88it/s][A
epoch 1 iter 12034: train loss 1.22596. lr 4.204165e-04:  74%|███████▎  | 12035/16329 [1:41:20<37:21,  1.92it/s][A
epoch 1 iter 12035: train loss 1.32703. lr 4.203900e-04:  74%|███████▎  | 12035/16329 [1:41:20<37:21,  1.92it/s][A
epoch 1 iter 12035: train loss 1.32703. lr 4.203900e-04:  74%|███████▎  | 12036/16329 [1:41:20<36:44,  1.95it/s][A
epoch 1 iter 12036: train loss 1.26892. lr 4.203636e-04:  74%|███████▎  | 12036/16329 [1:41:21<36:44,  1.95it/s][A
epoch 1 iter 12036: train loss 1.26892. lr 4.203636e-04:  74%|███████▎  | 12037/16329 [1:41:21<36:13,  1.97it/s][A
epoch 1 iter 12037: train loss 1.26197. lr 4.203371e-04:  74%|███████▎  

epoch 1 iter 12068: train loss 1.23326. lr 4.195171e-04:  74%|███████▍  | 12068/16329 [1:41:37<35:25,  2.00it/s][A
epoch 1 iter 12068: train loss 1.23326. lr 4.195171e-04:  74%|███████▍  | 12069/16329 [1:41:37<35:23,  2.01it/s][A
epoch 1 iter 12069: train loss 1.22894. lr 4.194906e-04:  74%|███████▍  | 12069/16329 [1:41:37<35:23,  2.01it/s][A
epoch 1 iter 12069: train loss 1.22894. lr 4.194906e-04:  74%|███████▍  | 12070/16329 [1:41:37<35:15,  2.01it/s][A
epoch 1 iter 12070: train loss 1.25242. lr 4.194641e-04:  74%|███████▍  | 12070/16329 [1:41:38<35:15,  2.01it/s][A
epoch 1 iter 12070: train loss 1.25242. lr 4.194641e-04:  74%|███████▍  | 12071/16329 [1:41:38<35:12,  2.02it/s][A
epoch 1 iter 12071: train loss 1.24550. lr 4.194377e-04:  74%|███████▍  | 12071/16329 [1:41:38<35:12,  2.02it/s][A
epoch 1 iter 12071: train loss 1.24550. lr 4.194377e-04:  74%|███████▍  | 12072/16329 [1:41:38<35:09,  2.02it/s][A
epoch 1 iter 12072: train loss 1.25846. lr 4.194112e-04:  74%|███████▍  

epoch 1 iter 12103: train loss 1.26663. lr 4.185899e-04:  74%|███████▍  | 12103/16329 [1:41:54<36:11,  1.95it/s][A
epoch 1 iter 12103: train loss 1.26663. lr 4.185899e-04:  74%|███████▍  | 12104/16329 [1:41:54<35:46,  1.97it/s][A
epoch 1 iter 12104: train loss 1.24176. lr 4.185634e-04:  74%|███████▍  | 12104/16329 [1:41:55<35:46,  1.97it/s][A
epoch 1 iter 12104: train loss 1.24176. lr 4.185634e-04:  74%|███████▍  | 12105/16329 [1:41:55<35:34,  1.98it/s][A
epoch 1 iter 12105: train loss 1.24421. lr 4.185369e-04:  74%|███████▍  | 12105/16329 [1:41:55<35:34,  1.98it/s][A
epoch 1 iter 12105: train loss 1.24421. lr 4.185369e-04:  74%|███████▍  | 12106/16329 [1:41:55<35:29,  1.98it/s][A
epoch 1 iter 12106: train loss 1.28734. lr 4.185104e-04:  74%|███████▍  | 12106/16329 [1:41:56<35:29,  1.98it/s][A
epoch 1 iter 12106: train loss 1.28734. lr 4.185104e-04:  74%|███████▍  | 12107/16329 [1:41:56<35:15,  2.00it/s][A
epoch 1 iter 12107: train loss 1.28809. lr 4.184839e-04:  74%|███████▍  

epoch 1 iter 12138: train loss 1.24178. lr 4.176614e-04:  74%|███████▍  | 12138/16329 [1:42:12<36:14,  1.93it/s][A
epoch 1 iter 12138: train loss 1.24178. lr 4.176614e-04:  74%|███████▍  | 12139/16329 [1:42:12<35:51,  1.95it/s][A
epoch 1 iter 12139: train loss 1.25246. lr 4.176349e-04:  74%|███████▍  | 12139/16329 [1:42:12<35:51,  1.95it/s][A
epoch 1 iter 12139: train loss 1.25246. lr 4.176349e-04:  74%|███████▍  | 12140/16329 [1:42:12<35:29,  1.97it/s][A
epoch 1 iter 12140: train loss 1.26570. lr 4.176083e-04:  74%|███████▍  | 12140/16329 [1:42:13<35:29,  1.97it/s][A
epoch 1 iter 12140: train loss 1.26570. lr 4.176083e-04:  74%|███████▍  | 12141/16329 [1:42:13<35:10,  1.98it/s][A
epoch 1 iter 12141: train loss 1.24222. lr 4.175818e-04:  74%|███████▍  | 12141/16329 [1:42:13<35:10,  1.98it/s][A
epoch 1 iter 12141: train loss 1.24222. lr 4.175818e-04:  74%|███████▍  | 12142/16329 [1:42:13<36:09,  1.93it/s][A
epoch 1 iter 12142: train loss 1.26258. lr 4.175552e-04:  74%|███████▍  

epoch 1 iter 12173: train loss 1.25215. lr 4.167316e-04:  75%|███████▍  | 12173/16329 [1:42:30<35:01,  1.98it/s][A
epoch 1 iter 12173: train loss 1.25215. lr 4.167316e-04:  75%|███████▍  | 12174/16329 [1:42:30<34:49,  1.99it/s][A
epoch 1 iter 12174: train loss 1.27434. lr 4.167050e-04:  75%|███████▍  | 12174/16329 [1:42:30<34:49,  1.99it/s][A
epoch 1 iter 12174: train loss 1.27434. lr 4.167050e-04:  75%|███████▍  | 12175/16329 [1:42:30<34:39,  2.00it/s][A
epoch 1 iter 12175: train loss 1.24487. lr 4.166784e-04:  75%|███████▍  | 12175/16329 [1:42:31<34:39,  2.00it/s][A
epoch 1 iter 12175: train loss 1.24487. lr 4.166784e-04:  75%|███████▍  | 12176/16329 [1:42:31<34:36,  2.00it/s][A
epoch 1 iter 12176: train loss 1.26099. lr 4.166518e-04:  75%|███████▍  | 12176/16329 [1:42:31<34:36,  2.00it/s][A
epoch 1 iter 12176: train loss 1.26099. lr 4.166518e-04:  75%|███████▍  | 12177/16329 [1:42:31<34:26,  2.01it/s][A
epoch 1 iter 12177: train loss 1.26334. lr 4.166252e-04:  75%|███████▍  

epoch 1 iter 12208: train loss 1.24434. lr 4.158004e-04:  75%|███████▍  | 12208/16329 [1:42:47<33:59,  2.02it/s][A
epoch 1 iter 12208: train loss 1.24434. lr 4.158004e-04:  75%|███████▍  | 12209/16329 [1:42:47<34:01,  2.02it/s][A
epoch 1 iter 12209: train loss 1.25890. lr 4.157738e-04:  75%|███████▍  | 12209/16329 [1:42:48<34:01,  2.02it/s][A
epoch 1 iter 12209: train loss 1.25890. lr 4.157738e-04:  75%|███████▍  | 12210/16329 [1:42:48<33:59,  2.02it/s][A
epoch 1 iter 12210: train loss 1.23827. lr 4.157471e-04:  75%|███████▍  | 12210/16329 [1:42:48<33:59,  2.02it/s][A
epoch 1 iter 12210: train loss 1.23827. lr 4.157471e-04:  75%|███████▍  | 12211/16329 [1:42:48<33:56,  2.02it/s][A
epoch 1 iter 12211: train loss 1.26046. lr 4.157205e-04:  75%|███████▍  | 12211/16329 [1:42:49<33:56,  2.02it/s][A
epoch 1 iter 12211: train loss 1.26046. lr 4.157205e-04:  75%|███████▍  | 12212/16329 [1:42:49<33:56,  2.02it/s][A
epoch 1 iter 12212: train loss 1.21570. lr 4.156939e-04:  75%|███████▍  

epoch 1 iter 12252: train loss 1.24030. lr 4.146279e-04:  75%|███████▌  | 12252/16329 [1:43:10<34:29,  1.97it/s][A
epoch 1 iter 12252: train loss 1.24030. lr 4.146279e-04:  75%|███████▌  | 12253/16329 [1:43:10<34:10,  1.99it/s][A
epoch 1 iter 12253: train loss 1.21958. lr 4.146012e-04:  75%|███████▌  | 12253/16329 [1:43:10<34:10,  1.99it/s][A
epoch 1 iter 12253: train loss 1.21958. lr 4.146012e-04:  75%|███████▌  | 12254/16329 [1:43:10<33:57,  2.00it/s][A
epoch 1 iter 12254: train loss 1.23828. lr 4.145746e-04:  75%|███████▌  | 12254/16329 [1:43:11<33:57,  2.00it/s][A
epoch 1 iter 12254: train loss 1.23828. lr 4.145746e-04:  75%|███████▌  | 12255/16329 [1:43:11<33:51,  2.00it/s][A
epoch 1 iter 12255: train loss 1.21729. lr 4.145479e-04:  75%|███████▌  | 12255/16329 [1:43:11<33:51,  2.00it/s][A
epoch 1 iter 12255: train loss 1.21729. lr 4.145479e-04:  75%|███████▌  | 12256/16329 [1:43:11<33:42,  2.01it/s][A
epoch 1 iter 12256: train loss 1.23692. lr 4.145212e-04:  75%|███████▌  

epoch 1 iter 12287: train loss 1.24091. lr 4.136938e-04:  75%|███████▌  | 12287/16329 [1:43:27<36:46,  1.83it/s][A
epoch 1 iter 12287: train loss 1.24091. lr 4.136938e-04:  75%|███████▌  | 12288/16329 [1:43:27<35:45,  1.88it/s][A
epoch 1 iter 12288: train loss 1.20729. lr 4.136671e-04:  75%|███████▌  | 12288/16329 [1:43:28<35:45,  1.88it/s][A
epoch 1 iter 12288: train loss 1.20729. lr 4.136671e-04:  75%|███████▌  | 12289/16329 [1:43:28<35:04,  1.92it/s][A
epoch 1 iter 12289: train loss 1.25248. lr 4.136404e-04:  75%|███████▌  | 12289/16329 [1:43:28<35:04,  1.92it/s][A
epoch 1 iter 12289: train loss 1.25248. lr 4.136404e-04:  75%|███████▌  | 12290/16329 [1:43:28<34:31,  1.95it/s][A
epoch 1 iter 12290: train loss 1.22448. lr 4.136137e-04:  75%|███████▌  | 12290/16329 [1:43:29<34:31,  1.95it/s][A
epoch 1 iter 12290: train loss 1.22448. lr 4.136137e-04:  75%|███████▌  | 12291/16329 [1:43:29<34:15,  1.96it/s][A
epoch 1 iter 12291: train loss 1.24837. lr 4.135870e-04:  75%|███████▌  

epoch 1 iter 12322: train loss 1.24491. lr 4.127584e-04:  75%|███████▌  | 12322/16329 [1:43:45<34:22,  1.94it/s][A
epoch 1 iter 12322: train loss 1.24491. lr 4.127584e-04:  75%|███████▌  | 12323/16329 [1:43:45<34:15,  1.95it/s][A
epoch 1 iter 12323: train loss 1.23865. lr 4.127317e-04:  75%|███████▌  | 12323/16329 [1:43:45<34:15,  1.95it/s][A
epoch 1 iter 12323: train loss 1.23865. lr 4.127317e-04:  75%|███████▌  | 12324/16329 [1:43:45<33:59,  1.96it/s][A
epoch 1 iter 12324: train loss 1.23598. lr 4.127049e-04:  75%|███████▌  | 12324/16329 [1:43:46<33:59,  1.96it/s][A
epoch 1 iter 12324: train loss 1.23598. lr 4.127049e-04:  75%|███████▌  | 12325/16329 [1:43:46<33:50,  1.97it/s][A
epoch 1 iter 12325: train loss 1.23093. lr 4.126782e-04:  75%|███████▌  | 12325/16329 [1:43:46<33:50,  1.97it/s][A
epoch 1 iter 12325: train loss 1.23093. lr 4.126782e-04:  75%|███████▌  | 12326/16329 [1:43:46<33:39,  1.98it/s][A
epoch 1 iter 12326: train loss 1.21251. lr 4.126514e-04:  75%|███████▌  

epoch 1 iter 12357: train loss 1.24563. lr 4.118217e-04:  76%|███████▌  | 12357/16329 [1:44:03<34:59,  1.89it/s][A
epoch 1 iter 12357: train loss 1.24563. lr 4.118217e-04:  76%|███████▌  | 12358/16329 [1:44:03<34:29,  1.92it/s][A
epoch 1 iter 12358: train loss 1.22802. lr 4.117949e-04:  76%|███████▌  | 12358/16329 [1:44:03<34:29,  1.92it/s][A
epoch 1 iter 12358: train loss 1.22802. lr 4.117949e-04:  76%|███████▌  | 12359/16329 [1:44:03<34:05,  1.94it/s][A
epoch 1 iter 12359: train loss 1.21498. lr 4.117682e-04:  76%|███████▌  | 12359/16329 [1:44:04<34:05,  1.94it/s][A
epoch 1 iter 12359: train loss 1.21498. lr 4.117682e-04:  76%|███████▌  | 12360/16329 [1:44:04<33:43,  1.96it/s][A
epoch 1 iter 12360: train loss 1.22288. lr 4.117414e-04:  76%|███████▌  | 12360/16329 [1:44:04<33:43,  1.96it/s][A
epoch 1 iter 12360: train loss 1.22288. lr 4.117414e-04:  76%|███████▌  | 12361/16329 [1:44:04<33:26,  1.98it/s][A
epoch 1 iter 12361: train loss 1.24226. lr 4.117146e-04:  76%|███████▌  

epoch 1 iter 12392: train loss 1.25523. lr 4.108838e-04:  76%|███████▌  | 12392/16329 [1:44:20<33:46,  1.94it/s][A
epoch 1 iter 12392: train loss 1.25523. lr 4.108838e-04:  76%|███████▌  | 12393/16329 [1:44:20<33:17,  1.97it/s][A
epoch 1 iter 12393: train loss 1.25298. lr 4.108569e-04:  76%|███████▌  | 12393/16329 [1:44:21<33:17,  1.97it/s][A
epoch 1 iter 12393: train loss 1.25298. lr 4.108569e-04:  76%|███████▌  | 12394/16329 [1:44:21<33:08,  1.98it/s][A
epoch 1 iter 12394: train loss 1.23417. lr 4.108301e-04:  76%|███████▌  | 12394/16329 [1:44:21<33:08,  1.98it/s][A
epoch 1 iter 12394: train loss 1.23417. lr 4.108301e-04:  76%|███████▌  | 12395/16329 [1:44:21<33:00,  1.99it/s][A
epoch 1 iter 12395: train loss 1.24300. lr 4.108033e-04:  76%|███████▌  | 12395/16329 [1:44:22<33:00,  1.99it/s][A
epoch 1 iter 12395: train loss 1.24300. lr 4.108033e-04:  76%|███████▌  | 12396/16329 [1:44:22<32:52,  1.99it/s][A
epoch 1 iter 12396: train loss 1.22054. lr 4.107765e-04:  76%|███████▌  

epoch 1 iter 12427: train loss 1.24120. lr 4.099446e-04:  76%|███████▌  | 12427/16329 [1:44:38<32:07,  2.02it/s][A
epoch 1 iter 12427: train loss 1.24120. lr 4.099446e-04:  76%|███████▌  | 12428/16329 [1:44:38<32:02,  2.03it/s][A
epoch 1 iter 12428: train loss 1.23207. lr 4.099177e-04:  76%|███████▌  | 12428/16329 [1:44:38<32:02,  2.03it/s][A
epoch 1 iter 12428: train loss 1.23207. lr 4.099177e-04:  76%|███████▌  | 12429/16329 [1:44:38<32:00,  2.03it/s][A
epoch 1 iter 12429: train loss 1.25938. lr 4.098909e-04:  76%|███████▌  | 12429/16329 [1:44:39<32:00,  2.03it/s][A
epoch 1 iter 12429: train loss 1.25938. lr 4.098909e-04:  76%|███████▌  | 12430/16329 [1:44:39<32:00,  2.03it/s][A
epoch 1 iter 12430: train loss 1.23526. lr 4.098640e-04:  76%|███████▌  | 12430/16329 [1:44:39<32:00,  2.03it/s][A
epoch 1 iter 12430: train loss 1.23526. lr 4.098640e-04:  76%|███████▌  | 12431/16329 [1:44:39<31:58,  2.03it/s][A
epoch 1 iter 12431: train loss 1.24207. lr 4.098371e-04:  76%|███████▌  

epoch 1 iter 12462: train loss 1.22705. lr 4.090041e-04:  76%|███████▋  | 12462/16329 [1:44:56<33:15,  1.94it/s][A
epoch 1 iter 12462: train loss 1.22705. lr 4.090041e-04:  76%|███████▋  | 12463/16329 [1:44:56<33:00,  1.95it/s][A
epoch 1 iter 12463: train loss 1.24874. lr 4.089772e-04:  76%|███████▋  | 12463/16329 [1:44:56<33:00,  1.95it/s][A
epoch 1 iter 12463: train loss 1.24874. lr 4.089772e-04:  76%|███████▋  | 12464/16329 [1:44:56<32:46,  1.97it/s][A
epoch 1 iter 12464: train loss 1.25322. lr 4.089503e-04:  76%|███████▋  | 12464/16329 [1:44:57<32:46,  1.97it/s][A
epoch 1 iter 12464: train loss 1.25322. lr 4.089503e-04:  76%|███████▋  | 12465/16329 [1:44:57<32:26,  1.99it/s][A
epoch 1 iter 12465: train loss 1.21236. lr 4.089234e-04:  76%|███████▋  | 12465/16329 [1:44:57<32:26,  1.99it/s][A
epoch 1 iter 12465: train loss 1.21236. lr 4.089234e-04:  76%|███████▋  | 12466/16329 [1:44:57<32:15,  2.00it/s][A
epoch 1 iter 12466: train loss 1.21965. lr 4.088965e-04:  76%|███████▋  

epoch 1 iter 12497: train loss 1.22893. lr 4.080624e-04:  77%|███████▋  | 12497/16329 [1:45:13<31:29,  2.03it/s][A
epoch 1 iter 12497: train loss 1.22893. lr 4.080624e-04:  77%|███████▋  | 12498/16329 [1:45:13<31:30,  2.03it/s][A
epoch 1 iter 12498: train loss 1.20783. lr 4.080355e-04:  77%|███████▋  | 12498/16329 [1:45:14<31:30,  2.03it/s][A
epoch 1 iter 12498: train loss 1.20783. lr 4.080355e-04:  77%|███████▋  | 12499/16329 [1:45:14<32:05,  1.99it/s][A
epoch 1 iter 12499: train loss 1.15370. lr 4.080086e-04:  77%|███████▋  | 12499/16329 [1:45:14<32:05,  1.99it/s][A
epoch 1 iter 12499: train loss 1.15370. lr 4.080086e-04:  77%|███████▋  | 12500/16329 [1:45:14<32:23,  1.97it/s][A
epoch 1 iter 12500: train loss 1.21164. lr 4.079816e-04:  77%|███████▋  | 12500/16329 [1:45:15<32:23,  1.97it/s][A
epoch 1 iter 12500: train loss 1.21164. lr 4.079816e-04:  77%|███████▋  | 12501/16329 [1:45:15<32:20,  1.97it/s][A
epoch 1 iter 12501: train loss 1.21162. lr 4.079547e-04:  77%|███████▋  

epoch 1 iter 12532: train loss 1.22819. lr 4.071195e-04:  77%|███████▋  | 12532/16329 [1:45:31<31:12,  2.03it/s][A
epoch 1 iter 12532: train loss 1.22819. lr 4.071195e-04:  77%|███████▋  | 12533/16329 [1:45:31<31:09,  2.03it/s][A
epoch 1 iter 12533: train loss 1.23870. lr 4.070925e-04:  77%|███████▋  | 12533/16329 [1:45:31<31:09,  2.03it/s][A
epoch 1 iter 12533: train loss 1.23870. lr 4.070925e-04:  77%|███████▋  | 12534/16329 [1:45:31<31:04,  2.04it/s][A
epoch 1 iter 12534: train loss 1.18656. lr 4.070656e-04:  77%|███████▋  | 12534/16329 [1:45:32<31:04,  2.04it/s][A
epoch 1 iter 12534: train loss 1.18656. lr 4.070656e-04:  77%|███████▋  | 12535/16329 [1:45:32<31:06,  2.03it/s][A
epoch 1 iter 12535: train loss 1.21715. lr 4.070386e-04:  77%|███████▋  | 12535/16329 [1:45:32<31:06,  2.03it/s][A
epoch 1 iter 12535: train loss 1.21715. lr 4.070386e-04:  77%|███████▋  | 12536/16329 [1:45:32<31:08,  2.03it/s][A
epoch 1 iter 12536: train loss 1.18099. lr 4.070117e-04:  77%|███████▋  

epoch 1 iter 12567: train loss 1.18553. lr 4.061754e-04:  77%|███████▋  | 12567/16329 [1:45:48<30:54,  2.03it/s][A
epoch 1 iter 12567: train loss 1.18553. lr 4.061754e-04:  77%|███████▋  | 12568/16329 [1:45:48<34:33,  1.81it/s][A
epoch 1 iter 12568: train loss 1.16719. lr 4.061484e-04:  77%|███████▋  | 12568/16329 [1:45:49<34:33,  1.81it/s][A
epoch 1 iter 12568: train loss 1.16719. lr 4.061484e-04:  77%|███████▋  | 12569/16329 [1:45:49<33:33,  1.87it/s][A
epoch 1 iter 12569: train loss 1.27188. lr 4.061214e-04:  77%|███████▋  | 12569/16329 [1:45:49<33:33,  1.87it/s][A
epoch 1 iter 12569: train loss 1.27188. lr 4.061214e-04:  77%|███████▋  | 12570/16329 [1:45:49<32:42,  1.92it/s][A
epoch 1 iter 12570: train loss 1.24205. lr 4.060944e-04:  77%|███████▋  | 12570/16329 [1:45:50<32:42,  1.92it/s][A
epoch 1 iter 12570: train loss 1.24205. lr 4.060944e-04:  77%|███████▋  | 12571/16329 [1:45:50<32:05,  1.95it/s][A
epoch 1 iter 12571: train loss 1.20311. lr 4.060674e-04:  77%|███████▋  

epoch 1 iter 12602: train loss 1.18472. lr 4.052300e-04:  77%|███████▋  | 12602/16329 [1:46:06<30:37,  2.03it/s][A
epoch 1 iter 12602: train loss 1.18472. lr 4.052300e-04:  77%|███████▋  | 12603/16329 [1:46:06<30:37,  2.03it/s][A
epoch 1 iter 12603: train loss 1.20029. lr 4.052030e-04:  77%|███████▋  | 12603/16329 [1:46:06<30:37,  2.03it/s][A
epoch 1 iter 12603: train loss 1.20029. lr 4.052030e-04:  77%|███████▋  | 12604/16329 [1:46:06<30:35,  2.03it/s][A
epoch 1 iter 12604: train loss 1.25070. lr 4.051760e-04:  77%|███████▋  | 12604/16329 [1:46:07<30:35,  2.03it/s][A
epoch 1 iter 12604: train loss 1.25070. lr 4.051760e-04:  77%|███████▋  | 12605/16329 [1:46:07<30:41,  2.02it/s][A
epoch 1 iter 12605: train loss 1.26639. lr 4.051490e-04:  77%|███████▋  | 12605/16329 [1:46:07<30:41,  2.02it/s][A
epoch 1 iter 12605: train loss 1.26639. lr 4.051490e-04:  77%|███████▋  | 12606/16329 [1:46:07<30:42,  2.02it/s][A
epoch 1 iter 12606: train loss 1.22619. lr 4.051219e-04:  77%|███████▋  

epoch 1 iter 12637: train loss 1.17261. lr 4.042835e-04:  77%|███████▋  | 12637/16329 [1:46:23<30:28,  2.02it/s][A
epoch 1 iter 12637: train loss 1.17261. lr 4.042835e-04:  77%|███████▋  | 12638/16329 [1:46:23<31:01,  1.98it/s][A
epoch 1 iter 12638: train loss 1.24474. lr 4.042564e-04:  77%|███████▋  | 12638/16329 [1:46:24<31:01,  1.98it/s][A
epoch 1 iter 12638: train loss 1.24474. lr 4.042564e-04:  77%|███████▋  | 12639/16329 [1:46:24<31:10,  1.97it/s][A
epoch 1 iter 12639: train loss 1.22483. lr 4.042294e-04:  77%|███████▋  | 12639/16329 [1:46:24<31:10,  1.97it/s][A
epoch 1 iter 12639: train loss 1.22483. lr 4.042294e-04:  77%|███████▋  | 12640/16329 [1:46:24<31:12,  1.97it/s][A
epoch 1 iter 12640: train loss 1.22922. lr 4.042023e-04:  77%|███████▋  | 12640/16329 [1:46:25<31:12,  1.97it/s][A
epoch 1 iter 12640: train loss 1.22922. lr 4.042023e-04:  77%|███████▋  | 12641/16329 [1:46:25<31:44,  1.94it/s][A
epoch 1 iter 12641: train loss 1.20117. lr 4.041753e-04:  77%|███████▋  

epoch 1 iter 12672: train loss 1.24258. lr 4.033358e-04:  78%|███████▊  | 12672/16329 [1:46:41<30:56,  1.97it/s][A
epoch 1 iter 12672: train loss 1.24258. lr 4.033358e-04:  78%|███████▊  | 12673/16329 [1:46:41<30:45,  1.98it/s][A
epoch 1 iter 12673: train loss 1.26989. lr 4.033087e-04:  78%|███████▊  | 12673/16329 [1:46:42<30:45,  1.98it/s][A
epoch 1 iter 12673: train loss 1.26989. lr 4.033087e-04:  78%|███████▊  | 12674/16329 [1:46:42<30:36,  1.99it/s][A
epoch 1 iter 12674: train loss 1.23778. lr 4.032816e-04:  78%|███████▊  | 12674/16329 [1:46:42<30:36,  1.99it/s][A
epoch 1 iter 12674: train loss 1.23778. lr 4.032816e-04:  78%|███████▊  | 12675/16329 [1:46:42<30:31,  1.99it/s][A
epoch 1 iter 12675: train loss 1.21840. lr 4.032545e-04:  78%|███████▊  | 12675/16329 [1:46:43<30:31,  1.99it/s][A
epoch 1 iter 12675: train loss 1.21840. lr 4.032545e-04:  78%|███████▊  | 12676/16329 [1:46:43<30:22,  2.00it/s][A
epoch 1 iter 12676: train loss 1.21177. lr 4.032274e-04:  78%|███████▊  

epoch 1 iter 12707: train loss 1.17766. lr 4.023869e-04:  78%|███████▊  | 12707/16329 [1:46:59<30:34,  1.97it/s][A
epoch 1 iter 12707: train loss 1.17766. lr 4.023869e-04:  78%|███████▊  | 12708/16329 [1:46:59<30:27,  1.98it/s][A
epoch 1 iter 12708: train loss 1.19864. lr 4.023598e-04:  78%|███████▊  | 12708/16329 [1:47:00<30:27,  1.98it/s][A
epoch 1 iter 12708: train loss 1.19864. lr 4.023598e-04:  78%|███████▊  | 12709/16329 [1:47:00<30:22,  1.99it/s][A
epoch 1 iter 12709: train loss 1.20812. lr 4.023327e-04:  78%|███████▊  | 12709/16329 [1:47:00<30:22,  1.99it/s][A
epoch 1 iter 12709: train loss 1.20812. lr 4.023327e-04:  78%|███████▊  | 12710/16329 [1:47:00<30:13,  2.00it/s][A
epoch 1 iter 12710: train loss 1.23918. lr 4.023055e-04:  78%|███████▊  | 12710/16329 [1:47:01<30:13,  2.00it/s][A
epoch 1 iter 12710: train loss 1.23918. lr 4.023055e-04:  78%|███████▊  | 12711/16329 [1:47:01<30:10,  2.00it/s][A
epoch 1 iter 12711: train loss 1.23215. lr 4.022784e-04:  78%|███████▊  

epoch 1 iter 12742: train loss 1.16913. lr 4.014369e-04:  78%|███████▊  | 12742/16329 [1:47:17<30:02,  1.99it/s][A
epoch 1 iter 12742: train loss 1.16913. lr 4.014369e-04:  78%|███████▊  | 12743/16329 [1:47:17<29:58,  1.99it/s][A
epoch 1 iter 12743: train loss 1.22502. lr 4.014097e-04:  78%|███████▊  | 12743/16329 [1:47:17<29:58,  1.99it/s][A
epoch 1 iter 12743: train loss 1.22502. lr 4.014097e-04:  78%|███████▊  | 12744/16329 [1:47:17<29:50,  2.00it/s][A
epoch 1 iter 12744: train loss 1.18975. lr 4.013826e-04:  78%|███████▊  | 12744/16329 [1:47:18<29:50,  2.00it/s][A
epoch 1 iter 12744: train loss 1.18975. lr 4.013826e-04:  78%|███████▊  | 12745/16329 [1:47:18<29:49,  2.00it/s][A
epoch 1 iter 12745: train loss 1.18094. lr 4.013554e-04:  78%|███████▊  | 12745/16329 [1:47:18<29:49,  2.00it/s][A
epoch 1 iter 12745: train loss 1.18094. lr 4.013554e-04:  78%|███████▊  | 12746/16329 [1:47:18<29:43,  2.01it/s][A
epoch 1 iter 12746: train loss 1.23825. lr 4.013282e-04:  78%|███████▊  

epoch 1 iter 12777: train loss 1.21311. lr 4.004857e-04:  78%|███████▊  | 12777/16329 [1:47:34<29:47,  1.99it/s][A
epoch 1 iter 12777: train loss 1.21311. lr 4.004857e-04:  78%|███████▊  | 12778/16329 [1:47:34<29:41,  1.99it/s][A
epoch 1 iter 12778: train loss 1.20774. lr 4.004585e-04:  78%|███████▊  | 12778/16329 [1:47:35<29:41,  1.99it/s][A
epoch 1 iter 12778: train loss 1.20774. lr 4.004585e-04:  78%|███████▊  | 12779/16329 [1:47:35<29:37,  2.00it/s][A
epoch 1 iter 12779: train loss 1.18998. lr 4.004313e-04:  78%|███████▊  | 12779/16329 [1:47:35<29:37,  2.00it/s][A
epoch 1 iter 12779: train loss 1.18998. lr 4.004313e-04:  78%|███████▊  | 12780/16329 [1:47:35<29:27,  2.01it/s][A
epoch 1 iter 12780: train loss 1.16212. lr 4.004041e-04:  78%|███████▊  | 12780/16329 [1:47:36<29:27,  2.01it/s][A
epoch 1 iter 12780: train loss 1.16212. lr 4.004041e-04:  78%|███████▊  | 12781/16329 [1:47:36<29:28,  2.01it/s][A
epoch 1 iter 12781: train loss 1.18922. lr 4.003769e-04:  78%|███████▊  

epoch 1 iter 12812: train loss 1.23008. lr 3.995334e-04:  78%|███████▊  | 12812/16329 [1:47:52<29:02,  2.02it/s][A
epoch 1 iter 12812: train loss 1.23008. lr 3.995334e-04:  78%|███████▊  | 12813/16329 [1:47:52<28:59,  2.02it/s][A
epoch 1 iter 12813: train loss 1.23200. lr 3.995061e-04:  78%|███████▊  | 12813/16329 [1:47:52<28:59,  2.02it/s][A
epoch 1 iter 12813: train loss 1.23200. lr 3.995061e-04:  78%|███████▊  | 12814/16329 [1:47:52<29:04,  2.01it/s][A
epoch 1 iter 12814: train loss 1.22738. lr 3.994789e-04:  78%|███████▊  | 12814/16329 [1:47:53<29:04,  2.01it/s][A
epoch 1 iter 12814: train loss 1.22738. lr 3.994789e-04:  78%|███████▊  | 12815/16329 [1:47:53<29:04,  2.01it/s][A
epoch 1 iter 12815: train loss 1.21599. lr 3.994517e-04:  78%|███████▊  | 12815/16329 [1:47:53<29:04,  2.01it/s][A
epoch 1 iter 12815: train loss 1.21599. lr 3.994517e-04:  78%|███████▊  | 12816/16329 [1:47:53<29:05,  2.01it/s][A
epoch 1 iter 12816: train loss 1.20678. lr 3.994245e-04:  78%|███████▊  

epoch 1 iter 12847: train loss 1.19176. lr 3.985799e-04:  79%|███████▊  | 12847/16329 [1:48:10<28:46,  2.02it/s][A
epoch 1 iter 12847: train loss 1.19176. lr 3.985799e-04:  79%|███████▊  | 12848/16329 [1:48:10<28:46,  2.02it/s][A
epoch 1 iter 12848: train loss 1.19899. lr 3.985526e-04:  79%|███████▊  | 12848/16329 [1:48:10<28:46,  2.02it/s][A
epoch 1 iter 12848: train loss 1.19899. lr 3.985526e-04:  79%|███████▊  | 12849/16329 [1:48:10<28:51,  2.01it/s][A
epoch 1 iter 12849: train loss 1.20407. lr 3.985254e-04:  79%|███████▊  | 12849/16329 [1:48:10<28:51,  2.01it/s][A
epoch 1 iter 12849: train loss 1.20407. lr 3.985254e-04:  79%|███████▊  | 12850/16329 [1:48:11<28:48,  2.01it/s][A
epoch 1 iter 12850: train loss 1.18473. lr 3.984981e-04:  79%|███████▊  | 12850/16329 [1:48:11<28:48,  2.01it/s][A
epoch 1 iter 12850: train loss 1.18473. lr 3.984981e-04:  79%|███████▊  | 12851/16329 [1:48:11<28:52,  2.01it/s][A
epoch 1 iter 12851: train loss 1.20012. lr 3.984709e-04:  79%|███████▊  

epoch 1 iter 12882: train loss 1.14912. lr 3.976253e-04:  79%|███████▉  | 12882/16329 [1:48:27<28:34,  2.01it/s][A
epoch 1 iter 12882: train loss 1.14912. lr 3.976253e-04:  79%|███████▉  | 12883/16329 [1:48:27<28:36,  2.01it/s][A
epoch 1 iter 12883: train loss 1.23167. lr 3.975980e-04:  79%|███████▉  | 12883/16329 [1:48:28<28:36,  2.01it/s][A
epoch 1 iter 12883: train loss 1.23167. lr 3.975980e-04:  79%|███████▉  | 12884/16329 [1:48:28<28:32,  2.01it/s][A
epoch 1 iter 12884: train loss 1.21867. lr 3.975707e-04:  79%|███████▉  | 12884/16329 [1:48:28<28:32,  2.01it/s][A
epoch 1 iter 12884: train loss 1.21867. lr 3.975707e-04:  79%|███████▉  | 12885/16329 [1:48:28<28:28,  2.02it/s][A
epoch 1 iter 12885: train loss 1.22774. lr 3.975435e-04:  79%|███████▉  | 12885/16329 [1:48:29<28:28,  2.02it/s][A
epoch 1 iter 12885: train loss 1.22774. lr 3.975435e-04:  79%|███████▉  | 12886/16329 [1:48:29<28:27,  2.02it/s][A
epoch 1 iter 12886: train loss 1.20271. lr 3.975162e-04:  79%|███████▉  

epoch 1 iter 12917: train loss 1.21494. lr 3.966696e-04:  79%|███████▉  | 12917/16329 [1:48:45<28:15,  2.01it/s][A
epoch 1 iter 12917: train loss 1.21494. lr 3.966696e-04:  79%|███████▉  | 12918/16329 [1:48:45<28:17,  2.01it/s][A
epoch 1 iter 12918: train loss 1.18122. lr 3.966423e-04:  79%|███████▉  | 12918/16329 [1:48:45<28:17,  2.01it/s][A
epoch 1 iter 12918: train loss 1.18122. lr 3.966423e-04:  79%|███████▉  | 12919/16329 [1:48:45<28:15,  2.01it/s][A
epoch 1 iter 12919: train loss 1.22877. lr 3.966150e-04:  79%|███████▉  | 12919/16329 [1:48:46<28:15,  2.01it/s][A
epoch 1 iter 12919: train loss 1.22877. lr 3.966150e-04:  79%|███████▉  | 12920/16329 [1:48:46<28:20,  2.00it/s][A
epoch 1 iter 12920: train loss 1.17855. lr 3.965877e-04:  79%|███████▉  | 12920/16329 [1:48:46<28:20,  2.00it/s][A
epoch 1 iter 12920: train loss 1.17855. lr 3.965877e-04:  79%|███████▉  | 12921/16329 [1:48:46<28:22,  2.00it/s][A
epoch 1 iter 12921: train loss 1.17747. lr 3.965604e-04:  79%|███████▉  

epoch 1 iter 12952: train loss 1.23326. lr 3.957129e-04:  79%|███████▉  | 12952/16329 [1:49:03<28:59,  1.94it/s][A
epoch 1 iter 12952: train loss 1.23326. lr 3.957129e-04:  79%|███████▉  | 12953/16329 [1:49:03<28:41,  1.96it/s][A
epoch 1 iter 12953: train loss 1.22105. lr 3.956855e-04:  79%|███████▉  | 12953/16329 [1:49:03<28:41,  1.96it/s][A
epoch 1 iter 12953: train loss 1.22105. lr 3.956855e-04:  79%|███████▉  | 12954/16329 [1:49:03<28:20,  1.99it/s][A
epoch 1 iter 12954: train loss 1.19393. lr 3.956582e-04:  79%|███████▉  | 12954/16329 [1:49:04<28:20,  1.99it/s][A
epoch 1 iter 12954: train loss 1.19393. lr 3.956582e-04:  79%|███████▉  | 12955/16329 [1:49:04<28:15,  1.99it/s][A
epoch 1 iter 12955: train loss 1.20479. lr 3.956308e-04:  79%|███████▉  | 12955/16329 [1:49:04<28:15,  1.99it/s][A
epoch 1 iter 12955: train loss 1.20479. lr 3.956308e-04:  79%|███████▉  | 12956/16329 [1:49:04<28:06,  2.00it/s][A
epoch 1 iter 12956: train loss 1.17292. lr 3.956035e-04:  79%|███████▉  

epoch 1 iter 12987: train loss 1.16755. lr 3.947550e-04:  80%|███████▉  | 12987/16329 [1:49:20<27:44,  2.01it/s][A
epoch 1 iter 12987: train loss 1.16755. lr 3.947550e-04:  80%|███████▉  | 12988/16329 [1:49:20<27:44,  2.01it/s][A
epoch 1 iter 12988: train loss 1.19961. lr 3.947276e-04:  80%|███████▉  | 12988/16329 [1:49:21<27:44,  2.01it/s][A
epoch 1 iter 12988: train loss 1.19961. lr 3.947276e-04:  80%|███████▉  | 12989/16329 [1:49:21<30:36,  1.82it/s][A
epoch 1 iter 12989: train loss 1.16659. lr 3.947002e-04:  80%|███████▉  | 12989/16329 [1:49:21<30:36,  1.82it/s][A
epoch 1 iter 12989: train loss 1.16659. lr 3.947002e-04:  80%|███████▉  | 12990/16329 [1:49:21<29:46,  1.87it/s][A
epoch 1 iter 12990: train loss 1.20269. lr 3.946729e-04:  80%|███████▉  | 12990/16329 [1:49:22<29:46,  1.87it/s][A
epoch 1 iter 12990: train loss 1.20269. lr 3.946729e-04:  80%|███████▉  | 12991/16329 [1:49:22<29:04,  1.91it/s][A
epoch 1 iter 12991: train loss 1.18408. lr 3.946455e-04:  80%|███████▉  

epoch 1 iter 13022: train loss 1.16827. lr 3.937961e-04:  80%|███████▉  | 13022/16329 [1:49:38<27:20,  2.02it/s][A
epoch 1 iter 13022: train loss 1.16827. lr 3.937961e-04:  80%|███████▉  | 13023/16329 [1:49:38<27:23,  2.01it/s][A
epoch 1 iter 13023: train loss 1.22341. lr 3.937687e-04:  80%|███████▉  | 13023/16329 [1:49:38<27:23,  2.01it/s][A
epoch 1 iter 13023: train loss 1.22341. lr 3.937687e-04:  80%|███████▉  | 13024/16329 [1:49:38<30:21,  1.81it/s][A
epoch 1 iter 13024: train loss 1.24182. lr 3.937412e-04:  80%|███████▉  | 13024/16329 [1:49:39<30:21,  1.81it/s][A
epoch 1 iter 13024: train loss 1.24182. lr 3.937412e-04:  80%|███████▉  | 13025/16329 [1:49:39<29:22,  1.87it/s][A
epoch 1 iter 13025: train loss 1.19716. lr 3.937138e-04:  80%|███████▉  | 13025/16329 [1:49:39<29:22,  1.87it/s][A
epoch 1 iter 13025: train loss 1.19716. lr 3.937138e-04:  80%|███████▉  | 13026/16329 [1:49:39<28:46,  1.91it/s][A
epoch 1 iter 13026: train loss 1.19189. lr 3.936864e-04:  80%|███████▉  

epoch 1 iter 13057: train loss 1.19951. lr 3.928361e-04:  80%|███████▉  | 13057/16329 [1:49:55<27:22,  1.99it/s][A
epoch 1 iter 13057: train loss 1.19951. lr 3.928361e-04:  80%|███████▉  | 13058/16329 [1:49:55<27:15,  2.00it/s][A
epoch 1 iter 13058: train loss 1.19043. lr 3.928086e-04:  80%|███████▉  | 13058/16329 [1:49:56<27:15,  2.00it/s][A
epoch 1 iter 13058: train loss 1.19043. lr 3.928086e-04:  80%|███████▉  | 13059/16329 [1:49:56<27:12,  2.00it/s][A
epoch 1 iter 13059: train loss 1.20169. lr 3.927812e-04:  80%|███████▉  | 13059/16329 [1:49:56<27:12,  2.00it/s][A
epoch 1 iter 13059: train loss 1.20169. lr 3.927812e-04:  80%|███████▉  | 13060/16329 [1:49:56<27:06,  2.01it/s][A
epoch 1 iter 13060: train loss 1.19036. lr 3.927537e-04:  80%|███████▉  | 13060/16329 [1:49:57<27:06,  2.01it/s][A
epoch 1 iter 13060: train loss 1.19036. lr 3.927537e-04:  80%|███████▉  | 13061/16329 [1:49:57<27:05,  2.01it/s][A
epoch 1 iter 13061: train loss 1.19180. lr 3.927263e-04:  80%|███████▉  

epoch 1 iter 13092: train loss 1.18914. lr 3.918750e-04:  80%|████████  | 13092/16329 [1:50:13<26:52,  2.01it/s][A
epoch 1 iter 13092: train loss 1.18914. lr 3.918750e-04:  80%|████████  | 13093/16329 [1:50:13<26:49,  2.01it/s][A
epoch 1 iter 13093: train loss 1.17917. lr 3.918475e-04:  80%|████████  | 13093/16329 [1:50:14<26:49,  2.01it/s][A
epoch 1 iter 13093: train loss 1.17917. lr 3.918475e-04:  80%|████████  | 13094/16329 [1:50:14<26:48,  2.01it/s][A
epoch 1 iter 13094: train loss 1.23278. lr 3.918201e-04:  80%|████████  | 13094/16329 [1:50:14<26:48,  2.01it/s][A
epoch 1 iter 13094: train loss 1.23278. lr 3.918201e-04:  80%|████████  | 13095/16329 [1:50:14<26:41,  2.02it/s][A
epoch 1 iter 13095: train loss 1.20925. lr 3.917926e-04:  80%|████████  | 13095/16329 [1:50:15<26:41,  2.02it/s][A
epoch 1 iter 13095: train loss 1.20925. lr 3.917926e-04:  80%|████████  | 13096/16329 [1:50:15<26:44,  2.01it/s][A
epoch 1 iter 13096: train loss 1.20848. lr 3.917651e-04:  80%|████████  

epoch 1 iter 13127: train loss 1.20263. lr 3.909129e-04:  80%|████████  | 13127/16329 [1:50:31<27:27,  1.94it/s][A
epoch 1 iter 13127: train loss 1.20263. lr 3.909129e-04:  80%|████████  | 13128/16329 [1:50:31<27:15,  1.96it/s][A
epoch 1 iter 13128: train loss 1.19036. lr 3.908854e-04:  80%|████████  | 13128/16329 [1:50:31<27:15,  1.96it/s][A
epoch 1 iter 13128: train loss 1.19036. lr 3.908854e-04:  80%|████████  | 13129/16329 [1:50:31<26:56,  1.98it/s][A
epoch 1 iter 13129: train loss 1.16981. lr 3.908579e-04:  80%|████████  | 13129/16329 [1:50:32<26:56,  1.98it/s][A
epoch 1 iter 13129: train loss 1.16981. lr 3.908579e-04:  80%|████████  | 13130/16329 [1:50:32<26:47,  1.99it/s][A
epoch 1 iter 13130: train loss 1.13805. lr 3.908304e-04:  80%|████████  | 13130/16329 [1:50:32<26:47,  1.99it/s][A
epoch 1 iter 13130: train loss 1.13805. lr 3.908304e-04:  80%|████████  | 13131/16329 [1:50:32<26:42,  2.00it/s][A
epoch 1 iter 13131: train loss 1.20000. lr 3.908029e-04:  80%|████████  

epoch 1 iter 13162: train loss 1.19267. lr 3.899498e-04:  81%|████████  | 13162/16329 [1:50:49<26:18,  2.01it/s][A
epoch 1 iter 13162: train loss 1.19267. lr 3.899498e-04:  81%|████████  | 13163/16329 [1:50:49<26:16,  2.01it/s][A
epoch 1 iter 13163: train loss 1.20583. lr 3.899223e-04:  81%|████████  | 13163/16329 [1:50:49<26:16,  2.01it/s][A
epoch 1 iter 13163: train loss 1.20583. lr 3.899223e-04:  81%|████████  | 13164/16329 [1:50:49<26:09,  2.02it/s][A
epoch 1 iter 13164: train loss 1.19983. lr 3.898947e-04:  81%|████████  | 13164/16329 [1:50:50<26:09,  2.02it/s][A
epoch 1 iter 13164: train loss 1.19983. lr 3.898947e-04:  81%|████████  | 13165/16329 [1:50:50<26:14,  2.01it/s][A
epoch 1 iter 13165: train loss 1.17684. lr 3.898672e-04:  81%|████████  | 13165/16329 [1:50:50<26:14,  2.01it/s][A
epoch 1 iter 13165: train loss 1.17684. lr 3.898672e-04:  81%|████████  | 13166/16329 [1:50:50<26:10,  2.01it/s][A
epoch 1 iter 13166: train loss 1.21880. lr 3.898397e-04:  81%|████████  

epoch 1 iter 13197: train loss 1.18984. lr 3.889856e-04:  81%|████████  | 13197/16329 [1:51:06<25:46,  2.03it/s][A
epoch 1 iter 13197: train loss 1.18984. lr 3.889856e-04:  81%|████████  | 13198/16329 [1:51:06<25:48,  2.02it/s][A
epoch 1 iter 13198: train loss 1.18655. lr 3.889581e-04:  81%|████████  | 13198/16329 [1:51:07<25:48,  2.02it/s][A
epoch 1 iter 13198: train loss 1.18655. lr 3.889581e-04:  81%|████████  | 13199/16329 [1:51:07<25:50,  2.02it/s][A
epoch 1 iter 13199: train loss 1.20035. lr 3.889305e-04:  81%|████████  | 13199/16329 [1:51:07<25:50,  2.02it/s][A
epoch 1 iter 13199: train loss 1.20035. lr 3.889305e-04:  81%|████████  | 13200/16329 [1:51:07<25:45,  2.02it/s][A
epoch 1 iter 13200: train loss 1.18147. lr 3.889030e-04:  81%|████████  | 13200/16329 [1:51:08<25:45,  2.02it/s][A
epoch 1 iter 13200: train loss 1.18147. lr 3.889030e-04:  81%|████████  | 13201/16329 [1:51:08<25:48,  2.02it/s][A
epoch 1 iter 13201: train loss 1.13891. lr 3.888754e-04:  81%|████████  

epoch 1 iter 13232: train loss 1.19628. lr 3.880205e-04:  81%|████████  | 13232/16329 [1:51:24<27:05,  1.91it/s][A
epoch 1 iter 13232: train loss 1.19628. lr 3.880205e-04:  81%|████████  | 13233/16329 [1:51:24<27:10,  1.90it/s][A
epoch 1 iter 13233: train loss 1.21031. lr 3.879929e-04:  81%|████████  | 13233/16329 [1:51:25<27:10,  1.90it/s][A
epoch 1 iter 13233: train loss 1.21031. lr 3.879929e-04:  81%|████████  | 13234/16329 [1:51:25<27:02,  1.91it/s][A
epoch 1 iter 13234: train loss 1.17583. lr 3.879653e-04:  81%|████████  | 13234/16329 [1:51:25<27:02,  1.91it/s][A
epoch 1 iter 13234: train loss 1.17583. lr 3.879653e-04:  81%|████████  | 13235/16329 [1:51:25<26:50,  1.92it/s][A
epoch 1 iter 13235: train loss 1.16880. lr 3.879377e-04:  81%|████████  | 13235/16329 [1:51:26<26:50,  1.92it/s][A
epoch 1 iter 13235: train loss 1.16880. lr 3.879377e-04:  81%|████████  | 13236/16329 [1:51:26<26:33,  1.94it/s][A
epoch 1 iter 13236: train loss 1.19647. lr 3.879101e-04:  81%|████████  

epoch 1 iter 13267: train loss 1.16677. lr 3.870543e-04:  81%|████████  | 13267/16329 [1:51:42<25:19,  2.02it/s][A
epoch 1 iter 13267: train loss 1.16677. lr 3.870543e-04:  81%|████████▏ | 13268/16329 [1:51:42<25:18,  2.02it/s][A
epoch 1 iter 13268: train loss 1.18119. lr 3.870267e-04:  81%|████████▏ | 13268/16329 [1:51:42<25:18,  2.02it/s][A
epoch 1 iter 13268: train loss 1.18119. lr 3.870267e-04:  81%|████████▏ | 13269/16329 [1:51:42<25:20,  2.01it/s][A
epoch 1 iter 13269: train loss 1.19944. lr 3.869991e-04:  81%|████████▏ | 13269/16329 [1:51:43<25:20,  2.01it/s][A
epoch 1 iter 13269: train loss 1.19944. lr 3.869991e-04:  81%|████████▏ | 13270/16329 [1:51:43<25:16,  2.02it/s][A
epoch 1 iter 13270: train loss 1.17137. lr 3.869715e-04:  81%|████████▏ | 13270/16329 [1:51:43<25:16,  2.02it/s][A
epoch 1 iter 13270: train loss 1.17137. lr 3.869715e-04:  81%|████████▏ | 13271/16329 [1:51:43<25:14,  2.02it/s][A
epoch 1 iter 13271: train loss 1.18180. lr 3.869439e-04:  81%|████████▏ 

epoch 1 iter 13302: train loss 1.15850. lr 3.860872e-04:  81%|████████▏ | 13302/16329 [1:51:59<24:55,  2.02it/s][A
epoch 1 iter 13302: train loss 1.15850. lr 3.860872e-04:  81%|████████▏ | 13303/16329 [1:51:59<27:42,  1.82it/s][A
epoch 1 iter 13303: train loss 1.17134. lr 3.860596e-04:  81%|████████▏ | 13303/16329 [1:52:00<27:42,  1.82it/s][A
epoch 1 iter 13303: train loss 1.17134. lr 3.860596e-04:  81%|████████▏ | 13304/16329 [1:52:00<26:52,  1.88it/s][A
epoch 1 iter 13304: train loss 1.20112. lr 3.860319e-04:  81%|████████▏ | 13304/16329 [1:52:00<26:52,  1.88it/s][A
epoch 1 iter 13304: train loss 1.20112. lr 3.860319e-04:  81%|████████▏ | 13305/16329 [1:52:00<26:14,  1.92it/s][A
epoch 1 iter 13305: train loss 1.19862. lr 3.860043e-04:  81%|████████▏ | 13305/16329 [1:52:01<26:14,  1.92it/s][A
epoch 1 iter 13305: train loss 1.19862. lr 3.860043e-04:  81%|████████▏ | 13306/16329 [1:52:01<25:51,  1.95it/s][A
epoch 1 iter 13306: train loss 1.19144. lr 3.859766e-04:  81%|████████▏ 

epoch 1 iter 13337: train loss 1.17830. lr 3.851191e-04:  82%|████████▏ | 13337/16329 [1:52:17<25:02,  1.99it/s][A
epoch 1 iter 13337: train loss 1.17830. lr 3.851191e-04:  82%|████████▏ | 13338/16329 [1:52:17<24:56,  2.00it/s][A
epoch 1 iter 13338: train loss 1.18165. lr 3.850914e-04:  82%|████████▏ | 13338/16329 [1:52:17<24:56,  2.00it/s][A
epoch 1 iter 13338: train loss 1.18165. lr 3.850914e-04:  82%|████████▏ | 13339/16329 [1:52:17<24:53,  2.00it/s][A
epoch 1 iter 13339: train loss 1.16873. lr 3.850637e-04:  82%|████████▏ | 13339/16329 [1:52:18<24:53,  2.00it/s][A
epoch 1 iter 13339: train loss 1.16873. lr 3.850637e-04:  82%|████████▏ | 13340/16329 [1:52:18<24:46,  2.01it/s][A
epoch 1 iter 13340: train loss 1.14483. lr 3.850361e-04:  82%|████████▏ | 13340/16329 [1:52:18<24:46,  2.01it/s][A
epoch 1 iter 13340: train loss 1.14483. lr 3.850361e-04:  82%|████████▏ | 13341/16329 [1:52:18<24:46,  2.01it/s][A
epoch 1 iter 13341: train loss 1.16532. lr 3.850084e-04:  82%|████████▏ 

epoch 1 iter 13372: train loss 1.19846. lr 3.841500e-04:  82%|████████▏ | 13372/16329 [1:52:34<25:43,  1.92it/s][A
epoch 1 iter 13372: train loss 1.19846. lr 3.841500e-04:  82%|████████▏ | 13373/16329 [1:52:34<25:20,  1.94it/s][A
epoch 1 iter 13373: train loss 1.16808. lr 3.841223e-04:  82%|████████▏ | 13373/16329 [1:52:35<25:20,  1.94it/s][A
epoch 1 iter 13373: train loss 1.16808. lr 3.841223e-04:  82%|████████▏ | 13374/16329 [1:52:35<24:57,  1.97it/s][A
epoch 1 iter 13374: train loss 1.17626. lr 3.840946e-04:  82%|████████▏ | 13374/16329 [1:52:35<24:57,  1.97it/s][A
epoch 1 iter 13374: train loss 1.17626. lr 3.840946e-04:  82%|████████▏ | 13375/16329 [1:52:35<24:52,  1.98it/s][A
epoch 1 iter 13375: train loss 1.20932. lr 3.840669e-04:  82%|████████▏ | 13375/16329 [1:52:36<24:52,  1.98it/s][A
epoch 1 iter 13375: train loss 1.20932. lr 3.840669e-04:  82%|████████▏ | 13376/16329 [1:52:36<24:44,  1.99it/s][A
epoch 1 iter 13376: train loss 1.12821. lr 3.840392e-04:  82%|████████▏ 

epoch 1 iter 13407: train loss 1.17657. lr 3.831800e-04:  82%|████████▏ | 13407/16329 [1:52:52<25:25,  1.92it/s][A
epoch 1 iter 13407: train loss 1.17657. lr 3.831800e-04:  82%|████████▏ | 13408/16329 [1:52:52<25:07,  1.94it/s][A
epoch 1 iter 13408: train loss 1.18450. lr 3.831522e-04:  82%|████████▏ | 13408/16329 [1:52:53<25:07,  1.94it/s][A
epoch 1 iter 13408: train loss 1.18450. lr 3.831522e-04:  82%|████████▏ | 13409/16329 [1:52:53<24:52,  1.96it/s][A
epoch 1 iter 13409: train loss 1.18386. lr 3.831245e-04:  82%|████████▏ | 13409/16329 [1:52:53<24:52,  1.96it/s][A
epoch 1 iter 13409: train loss 1.18386. lr 3.831245e-04:  82%|████████▏ | 13410/16329 [1:52:53<24:38,  1.97it/s][A
epoch 1 iter 13410: train loss 1.15341. lr 3.830968e-04:  82%|████████▏ | 13410/16329 [1:52:54<24:38,  1.97it/s][A
epoch 1 iter 13410: train loss 1.15341. lr 3.830968e-04:  82%|████████▏ | 13411/16329 [1:52:54<24:29,  1.99it/s][A
epoch 1 iter 13411: train loss 1.20373. lr 3.830690e-04:  82%|████████▏ 

epoch 1 iter 13442: train loss 1.16270. lr 3.822090e-04:  82%|████████▏ | 13442/16329 [1:53:10<23:55,  2.01it/s][A
epoch 1 iter 13442: train loss 1.16270. lr 3.822090e-04:  82%|████████▏ | 13443/16329 [1:53:10<23:50,  2.02it/s][A
epoch 1 iter 13443: train loss 1.14925. lr 3.821812e-04:  82%|████████▏ | 13443/16329 [1:53:10<23:50,  2.02it/s][A
epoch 1 iter 13443: train loss 1.14925. lr 3.821812e-04:  82%|████████▏ | 13444/16329 [1:53:10<23:52,  2.01it/s][A
epoch 1 iter 13444: train loss 1.16130. lr 3.821535e-04:  82%|████████▏ | 13444/16329 [1:53:11<23:52,  2.01it/s][A
epoch 1 iter 13444: train loss 1.16130. lr 3.821535e-04:  82%|████████▏ | 13445/16329 [1:53:11<23:51,  2.02it/s][A
epoch 1 iter 13445: train loss 1.17996. lr 3.821257e-04:  82%|████████▏ | 13445/16329 [1:53:11<23:51,  2.02it/s][A
epoch 1 iter 13445: train loss 1.17996. lr 3.821257e-04:  82%|████████▏ | 13446/16329 [1:53:11<23:52,  2.01it/s][A
epoch 1 iter 13446: train loss 1.15788. lr 3.820980e-04:  82%|████████▏ 

epoch 1 iter 13477: train loss 1.18019. lr 3.812371e-04:  83%|████████▎ | 13477/16329 [1:53:27<23:35,  2.01it/s][A
epoch 1 iter 13477: train loss 1.18019. lr 3.812371e-04:  83%|████████▎ | 13478/16329 [1:53:27<23:35,  2.01it/s][A
epoch 1 iter 13478: train loss 1.15229. lr 3.812093e-04:  83%|████████▎ | 13478/16329 [1:53:28<23:35,  2.01it/s][A
epoch 1 iter 13478: train loss 1.15229. lr 3.812093e-04:  83%|████████▎ | 13479/16329 [1:53:28<23:34,  2.02it/s][A
epoch 1 iter 13479: train loss 1.17679. lr 3.811815e-04:  83%|████████▎ | 13479/16329 [1:53:28<23:34,  2.02it/s][A
epoch 1 iter 13479: train loss 1.17679. lr 3.811815e-04:  83%|████████▎ | 13480/16329 [1:53:28<23:31,  2.02it/s][A
epoch 1 iter 13480: train loss 1.17943. lr 3.811537e-04:  83%|████████▎ | 13480/16329 [1:53:29<23:31,  2.02it/s][A
epoch 1 iter 13480: train loss 1.17943. lr 3.811537e-04:  83%|████████▎ | 13481/16329 [1:53:29<23:30,  2.02it/s][A
epoch 1 iter 13481: train loss 1.17998. lr 3.811260e-04:  83%|████████▎ 

epoch 1 iter 13512: train loss 1.18225. lr 3.802643e-04:  83%|████████▎ | 13512/16329 [1:53:45<23:15,  2.02it/s][A
epoch 1 iter 13512: train loss 1.18225. lr 3.802643e-04:  83%|████████▎ | 13513/16329 [1:53:45<23:18,  2.01it/s][A
epoch 1 iter 13513: train loss 1.19352. lr 3.802364e-04:  83%|████████▎ | 13513/16329 [1:53:45<23:18,  2.01it/s][A
epoch 1 iter 13513: train loss 1.19352. lr 3.802364e-04:  83%|████████▎ | 13514/16329 [1:53:45<23:29,  2.00it/s][A
epoch 1 iter 13514: train loss 1.16262. lr 3.802086e-04:  83%|████████▎ | 13514/16329 [1:53:46<23:29,  2.00it/s][A
epoch 1 iter 13514: train loss 1.16262. lr 3.802086e-04:  83%|████████▎ | 13515/16329 [1:53:46<23:35,  1.99it/s][A
epoch 1 iter 13515: train loss 1.12264. lr 3.801808e-04:  83%|████████▎ | 13515/16329 [1:53:46<23:35,  1.99it/s][A
epoch 1 iter 13515: train loss 1.12264. lr 3.801808e-04:  83%|████████▎ | 13516/16329 [1:53:46<23:27,  2.00it/s][A
epoch 1 iter 13516: train loss 1.19766. lr 3.801530e-04:  83%|████████▎ 

epoch 1 iter 13547: train loss 1.19070. lr 3.792905e-04:  83%|████████▎ | 13547/16329 [1:54:02<23:01,  2.01it/s][A
epoch 1 iter 13547: train loss 1.19070. lr 3.792905e-04:  83%|████████▎ | 13548/16329 [1:54:02<23:01,  2.01it/s][A
epoch 1 iter 13548: train loss 1.16734. lr 3.792627e-04:  83%|████████▎ | 13548/16329 [1:54:03<23:01,  2.01it/s][A
epoch 1 iter 13548: train loss 1.16734. lr 3.792627e-04:  83%|████████▎ | 13549/16329 [1:54:03<23:00,  2.01it/s][A
epoch 1 iter 13549: train loss 1.16357. lr 3.792348e-04:  83%|████████▎ | 13549/16329 [1:54:03<23:00,  2.01it/s][A
epoch 1 iter 13549: train loss 1.16357. lr 3.792348e-04:  83%|████████▎ | 13550/16329 [1:54:03<23:04,  2.01it/s][A
epoch 1 iter 13550: train loss 1.15092. lr 3.792070e-04:  83%|████████▎ | 13550/16329 [1:54:04<23:04,  2.01it/s][A
epoch 1 iter 13550: train loss 1.15092. lr 3.792070e-04:  83%|████████▎ | 13551/16329 [1:54:04<23:02,  2.01it/s][A
epoch 1 iter 13551: train loss 1.17295. lr 3.791792e-04:  83%|████████▎ 

epoch 1 iter 13582: train loss 1.15757. lr 3.783159e-04:  83%|████████▎ | 13582/16329 [1:54:20<22:40,  2.02it/s][A
epoch 1 iter 13582: train loss 1.15757. lr 3.783159e-04:  83%|████████▎ | 13583/16329 [1:54:20<22:40,  2.02it/s][A
epoch 1 iter 13583: train loss 1.16409. lr 3.782880e-04:  83%|████████▎ | 13583/16329 [1:54:21<22:40,  2.02it/s][A
epoch 1 iter 13583: train loss 1.16409. lr 3.782880e-04:  83%|████████▎ | 13584/16329 [1:54:21<25:15,  1.81it/s][A
epoch 1 iter 13584: train loss 1.17782. lr 3.782602e-04:  83%|████████▎ | 13584/16329 [1:54:21<25:15,  1.81it/s][A
epoch 1 iter 13584: train loss 1.17782. lr 3.782602e-04:  83%|████████▎ | 13585/16329 [1:54:21<24:31,  1.87it/s][A
epoch 1 iter 13585: train loss 1.16439. lr 3.782323e-04:  83%|████████▎ | 13585/16329 [1:54:22<24:31,  1.87it/s][A
epoch 1 iter 13585: train loss 1.16439. lr 3.782323e-04:  83%|████████▎ | 13586/16329 [1:54:22<23:56,  1.91it/s][A
epoch 1 iter 13586: train loss 1.15327. lr 3.782044e-04:  83%|████████▎ 

epoch 1 iter 13617: train loss 1.13826. lr 3.773403e-04:  83%|████████▎ | 13617/16329 [1:54:38<22:37,  2.00it/s][A
epoch 1 iter 13617: train loss 1.13826. lr 3.773403e-04:  83%|████████▎ | 13618/16329 [1:54:38<22:44,  1.99it/s][A
epoch 1 iter 13618: train loss 1.19082. lr 3.773125e-04:  83%|████████▎ | 13618/16329 [1:54:38<22:44,  1.99it/s][A
epoch 1 iter 13618: train loss 1.19082. lr 3.773125e-04:  83%|████████▎ | 13619/16329 [1:54:38<22:36,  2.00it/s][A
epoch 1 iter 13619: train loss 1.18248. lr 3.772846e-04:  83%|████████▎ | 13619/16329 [1:54:39<22:36,  2.00it/s][A
epoch 1 iter 13619: train loss 1.18248. lr 3.772846e-04:  83%|████████▎ | 13620/16329 [1:54:39<22:35,  2.00it/s][A
epoch 1 iter 13620: train loss 1.15316. lr 3.772567e-04:  83%|████████▎ | 13620/16329 [1:54:39<22:35,  2.00it/s][A
epoch 1 iter 13620: train loss 1.15316. lr 3.772567e-04:  83%|████████▎ | 13621/16329 [1:54:39<22:29,  2.01it/s][A
epoch 1 iter 13621: train loss 1.14389. lr 3.772288e-04:  83%|████████▎ 

epoch 1 iter 13652: train loss 1.15291. lr 3.763639e-04:  84%|████████▎ | 13652/16329 [1:54:56<22:09,  2.01it/s][A
epoch 1 iter 13652: train loss 1.15291. lr 3.763639e-04:  84%|████████▎ | 13653/16329 [1:54:56<22:06,  2.02it/s][A
epoch 1 iter 13653: train loss 1.16021. lr 3.763360e-04:  84%|████████▎ | 13653/16329 [1:54:56<22:06,  2.02it/s][A
epoch 1 iter 13653: train loss 1.16021. lr 3.763360e-04:  84%|████████▎ | 13654/16329 [1:54:56<22:04,  2.02it/s][A
epoch 1 iter 13654: train loss 1.16238. lr 3.763081e-04:  84%|████████▎ | 13654/16329 [1:54:57<22:04,  2.02it/s][A
epoch 1 iter 13654: train loss 1.16238. lr 3.763081e-04:  84%|████████▎ | 13655/16329 [1:54:57<22:04,  2.02it/s][A
epoch 1 iter 13655: train loss 1.15783. lr 3.762802e-04:  84%|████████▎ | 13655/16329 [1:54:57<22:04,  2.02it/s][A
epoch 1 iter 13655: train loss 1.15783. lr 3.762802e-04:  84%|████████▎ | 13656/16329 [1:54:57<22:03,  2.02it/s][A
epoch 1 iter 13656: train loss 1.17884. lr 3.762523e-04:  84%|████████▎ 

epoch 1 iter 13687: train loss 1.14233. lr 3.753867e-04:  84%|████████▍ | 13687/16329 [1:55:14<24:23,  1.80it/s][A
epoch 1 iter 13687: train loss 1.14233. lr 3.753867e-04:  84%|████████▍ | 13688/16329 [1:55:14<24:01,  1.83it/s][A
epoch 1 iter 13688: train loss 1.17460. lr 3.753587e-04:  84%|████████▍ | 13688/16329 [1:55:14<24:01,  1.83it/s][A
epoch 1 iter 13688: train loss 1.17460. lr 3.753587e-04:  84%|████████▍ | 13689/16329 [1:55:14<23:36,  1.86it/s][A
epoch 1 iter 13689: train loss 1.16760. lr 3.753308e-04:  84%|████████▍ | 13689/16329 [1:55:15<23:36,  1.86it/s][A
epoch 1 iter 13689: train loss 1.16760. lr 3.753308e-04:  84%|████████▍ | 13690/16329 [1:55:15<23:16,  1.89it/s][A
epoch 1 iter 13690: train loss 1.14358. lr 3.753029e-04:  84%|████████▍ | 13690/16329 [1:55:15<23:16,  1.89it/s][A
epoch 1 iter 13690: train loss 1.14358. lr 3.753029e-04:  84%|████████▍ | 13691/16329 [1:55:15<22:57,  1.91it/s][A
epoch 1 iter 13691: train loss 1.17029. lr 3.752749e-04:  84%|████████▍ 

epoch 1 iter 13722: train loss 1.15250. lr 3.744086e-04:  84%|████████▍ | 13722/16329 [1:55:31<21:33,  2.02it/s][A
epoch 1 iter 13722: train loss 1.15250. lr 3.744086e-04:  84%|████████▍ | 13723/16329 [1:55:31<21:32,  2.02it/s][A
epoch 1 iter 13723: train loss 1.17806. lr 3.743806e-04:  84%|████████▍ | 13723/16329 [1:55:32<21:32,  2.02it/s][A
epoch 1 iter 13723: train loss 1.17806. lr 3.743806e-04:  84%|████████▍ | 13724/16329 [1:55:32<21:30,  2.02it/s][A
epoch 1 iter 13724: train loss 1.18032. lr 3.743526e-04:  84%|████████▍ | 13724/16329 [1:55:32<21:30,  2.02it/s][A
epoch 1 iter 13724: train loss 1.18032. lr 3.743526e-04:  84%|████████▍ | 13725/16329 [1:55:32<21:32,  2.01it/s][A
epoch 1 iter 13725: train loss 1.19908. lr 3.743247e-04:  84%|████████▍ | 13725/16329 [1:55:33<21:32,  2.01it/s][A
epoch 1 iter 13725: train loss 1.19908. lr 3.743247e-04:  84%|████████▍ | 13726/16329 [1:55:33<21:29,  2.02it/s][A
epoch 1 iter 13726: train loss 1.14971. lr 3.742967e-04:  84%|████████▍ 

epoch 1 iter 13757: train loss 1.15997. lr 3.734296e-04:  84%|████████▍ | 13757/16329 [1:55:49<21:41,  1.98it/s][A
epoch 1 iter 13757: train loss 1.15997. lr 3.734296e-04:  84%|████████▍ | 13758/16329 [1:55:49<21:32,  1.99it/s][A
epoch 1 iter 13758: train loss 1.14218. lr 3.734016e-04:  84%|████████▍ | 13758/16329 [1:55:50<21:32,  1.99it/s][A
epoch 1 iter 13758: train loss 1.14218. lr 3.734016e-04:  84%|████████▍ | 13759/16329 [1:55:50<21:23,  2.00it/s][A
epoch 1 iter 13759: train loss 1.14739. lr 3.733736e-04:  84%|████████▍ | 13759/16329 [1:55:50<21:23,  2.00it/s][A
epoch 1 iter 13759: train loss 1.14739. lr 3.733736e-04:  84%|████████▍ | 13760/16329 [1:55:50<22:04,  1.94it/s][A
epoch 1 iter 13760: train loss 1.18219. lr 3.733456e-04:  84%|████████▍ | 13760/16329 [1:55:51<22:04,  1.94it/s][A
epoch 1 iter 13760: train loss 1.18219. lr 3.733456e-04:  84%|████████▍ | 13761/16329 [1:55:51<22:32,  1.90it/s][A
epoch 1 iter 13761: train loss 1.14649. lr 3.733176e-04:  84%|████████▍ 

epoch 1 iter 13792: train loss 1.17008. lr 3.724498e-04:  84%|████████▍ | 13792/16329 [1:56:07<21:18,  1.98it/s][A
epoch 1 iter 13792: train loss 1.17008. lr 3.724498e-04:  84%|████████▍ | 13793/16329 [1:56:07<21:14,  1.99it/s][A
epoch 1 iter 13793: train loss 1.14647. lr 3.724218e-04:  84%|████████▍ | 13793/16329 [1:56:07<21:14,  1.99it/s][A
epoch 1 iter 13793: train loss 1.14647. lr 3.724218e-04:  84%|████████▍ | 13794/16329 [1:56:07<21:11,  1.99it/s][A
epoch 1 iter 13794: train loss 1.17077. lr 3.723938e-04:  84%|████████▍ | 13794/16329 [1:56:08<21:11,  1.99it/s][A
epoch 1 iter 13794: train loss 1.17077. lr 3.723938e-04:  84%|████████▍ | 13795/16329 [1:56:08<21:10,  1.99it/s][A
epoch 1 iter 13795: train loss 1.15004. lr 3.723658e-04:  84%|████████▍ | 13795/16329 [1:56:08<21:10,  1.99it/s][A
epoch 1 iter 13795: train loss 1.15004. lr 3.723658e-04:  84%|████████▍ | 13796/16329 [1:56:08<21:04,  2.00it/s][A
epoch 1 iter 13796: train loss 1.14198. lr 3.723378e-04:  84%|████████▍ 

epoch 1 iter 13827: train loss 1.12824. lr 3.714692e-04:  85%|████████▍ | 13827/16329 [1:56:24<20:40,  2.02it/s][A
epoch 1 iter 13827: train loss 1.12824. lr 3.714692e-04:  85%|████████▍ | 13828/16329 [1:56:24<20:41,  2.02it/s][A
epoch 1 iter 13828: train loss 1.11791. lr 3.714411e-04:  85%|████████▍ | 13828/16329 [1:56:25<20:41,  2.02it/s][A
epoch 1 iter 13828: train loss 1.11791. lr 3.714411e-04:  85%|████████▍ | 13829/16329 [1:56:25<20:38,  2.02it/s][A
epoch 1 iter 13829: train loss 1.15968. lr 3.714131e-04:  85%|████████▍ | 13829/16329 [1:56:25<20:38,  2.02it/s][A
epoch 1 iter 13829: train loss 1.15968. lr 3.714131e-04:  85%|████████▍ | 13830/16329 [1:56:25<20:41,  2.01it/s][A
epoch 1 iter 13830: train loss 1.14341. lr 3.713851e-04:  85%|████████▍ | 13830/16329 [1:56:26<20:41,  2.01it/s][A
epoch 1 iter 13830: train loss 1.14341. lr 3.713851e-04:  85%|████████▍ | 13831/16329 [1:56:26<20:42,  2.01it/s][A
epoch 1 iter 13831: train loss 1.15135. lr 3.713570e-04:  85%|████████▍ 

epoch 1 iter 13862: train loss 1.13737. lr 3.704877e-04:  85%|████████▍ | 13862/16329 [1:56:42<20:24,  2.01it/s][A
epoch 1 iter 13862: train loss 1.13737. lr 3.704877e-04:  85%|████████▍ | 13863/16329 [1:56:42<20:24,  2.01it/s][A
epoch 1 iter 13863: train loss 1.14184. lr 3.704597e-04:  85%|████████▍ | 13863/16329 [1:56:43<20:24,  2.01it/s][A
epoch 1 iter 13863: train loss 1.14184. lr 3.704597e-04:  85%|████████▍ | 13864/16329 [1:56:43<20:22,  2.02it/s][A
epoch 1 iter 13864: train loss 1.13129. lr 3.704316e-04:  85%|████████▍ | 13864/16329 [1:56:43<20:22,  2.02it/s][A
epoch 1 iter 13864: train loss 1.13129. lr 3.704316e-04:  85%|████████▍ | 13865/16329 [1:56:43<20:24,  2.01it/s][A
epoch 1 iter 13865: train loss 1.17329. lr 3.704036e-04:  85%|████████▍ | 13865/16329 [1:56:44<20:24,  2.01it/s][A
epoch 1 iter 13865: train loss 1.17329. lr 3.704036e-04:  85%|████████▍ | 13866/16329 [1:56:44<20:22,  2.02it/s][A
epoch 1 iter 13866: train loss 1.13549. lr 3.703755e-04:  85%|████████▍ 

epoch 1 iter 13897: train loss 1.16462. lr 3.695055e-04:  85%|████████▌ | 13897/16329 [1:57:00<20:14,  2.00it/s][A
epoch 1 iter 13897: train loss 1.16462. lr 3.695055e-04:  85%|████████▌ | 13898/16329 [1:57:00<20:09,  2.01it/s][A
epoch 1 iter 13898: train loss 1.13630. lr 3.694774e-04:  85%|████████▌ | 13898/16329 [1:57:00<20:09,  2.01it/s][A
epoch 1 iter 13898: train loss 1.13630. lr 3.694774e-04:  85%|████████▌ | 13899/16329 [1:57:00<20:08,  2.01it/s][A
epoch 1 iter 13899: train loss 1.16365. lr 3.694493e-04:  85%|████████▌ | 13899/16329 [1:57:01<20:08,  2.01it/s][A
epoch 1 iter 13899: train loss 1.16365. lr 3.694493e-04:  85%|████████▌ | 13900/16329 [1:57:01<20:07,  2.01it/s][A
epoch 1 iter 13900: train loss 1.15679. lr 3.694213e-04:  85%|████████▌ | 13900/16329 [1:57:01<20:07,  2.01it/s][A
epoch 1 iter 13900: train loss 1.15679. lr 3.694213e-04:  85%|████████▌ | 13901/16329 [1:57:01<20:05,  2.01it/s][A
epoch 1 iter 13901: train loss 1.17233. lr 3.693932e-04:  85%|████████▌ 

epoch 1 iter 13932: train loss 1.13244. lr 3.685225e-04:  85%|████████▌ | 13932/16329 [1:57:17<19:47,  2.02it/s][A
epoch 1 iter 13932: train loss 1.13244. lr 3.685225e-04:  85%|████████▌ | 13933/16329 [1:57:17<19:47,  2.02it/s][A
epoch 1 iter 13933: train loss 1.14526. lr 3.684944e-04:  85%|████████▌ | 13933/16329 [1:57:18<19:47,  2.02it/s][A
epoch 1 iter 13933: train loss 1.14526. lr 3.684944e-04:  85%|████████▌ | 13934/16329 [1:57:18<19:44,  2.02it/s][A
epoch 1 iter 13934: train loss 1.12838. lr 3.684663e-04:  85%|████████▌ | 13934/16329 [1:57:18<19:44,  2.02it/s][A
epoch 1 iter 13934: train loss 1.12838. lr 3.684663e-04:  85%|████████▌ | 13935/16329 [1:57:18<19:47,  2.02it/s][A
epoch 1 iter 13935: train loss 1.13138. lr 3.684382e-04:  85%|████████▌ | 13935/16329 [1:57:19<19:47,  2.02it/s][A
epoch 1 iter 13935: train loss 1.13138. lr 3.684382e-04:  85%|████████▌ | 13936/16329 [1:57:19<19:48,  2.01it/s][A
epoch 1 iter 13936: train loss 1.13853. lr 3.684101e-04:  85%|████████▌ 

epoch 1 iter 13967: train loss 1.15752. lr 3.675387e-04:  86%|████████▌ | 13967/16329 [1:57:35<20:40,  1.90it/s][A
epoch 1 iter 13967: train loss 1.15752. lr 3.675387e-04:  86%|████████▌ | 13968/16329 [1:57:35<20:22,  1.93it/s][A
epoch 1 iter 13968: train loss 1.12114. lr 3.675106e-04:  86%|████████▌ | 13968/16329 [1:57:36<20:22,  1.93it/s][A
epoch 1 iter 13968: train loss 1.12114. lr 3.675106e-04:  86%|████████▌ | 13969/16329 [1:57:36<20:04,  1.96it/s][A
epoch 1 iter 13969: train loss 1.12844. lr 3.674824e-04:  86%|████████▌ | 13969/16329 [1:57:36<20:04,  1.96it/s][A
epoch 1 iter 13969: train loss 1.12844. lr 3.674824e-04:  86%|████████▌ | 13970/16329 [1:57:36<19:55,  1.97it/s][A
epoch 1 iter 13970: train loss 1.12475. lr 3.674543e-04:  86%|████████▌ | 13970/16329 [1:57:37<19:55,  1.97it/s][A
epoch 1 iter 13970: train loss 1.12475. lr 3.674543e-04:  86%|████████▌ | 13971/16329 [1:57:37<19:45,  1.99it/s][A
epoch 1 iter 13971: train loss 1.13160. lr 3.674262e-04:  86%|████████▌ 

epoch 1 iter 14002: train loss 1.13919. lr 3.665541e-04:  86%|████████▌ | 14002/16329 [1:57:53<19:24,  2.00it/s][A
epoch 1 iter 14002: train loss 1.13919. lr 3.665541e-04:  86%|████████▌ | 14003/16329 [1:57:53<19:25,  2.00it/s][A
epoch 1 iter 14003: train loss 1.10893. lr 3.665260e-04:  86%|████████▌ | 14003/16329 [1:57:53<19:25,  2.00it/s][A
epoch 1 iter 14003: train loss 1.10893. lr 3.665260e-04:  86%|████████▌ | 14004/16329 [1:57:53<19:21,  2.00it/s][A
epoch 1 iter 14004: train loss 1.14839. lr 3.664978e-04:  86%|████████▌ | 14004/16329 [1:57:54<19:21,  2.00it/s][A
epoch 1 iter 14004: train loss 1.14839. lr 3.664978e-04:  86%|████████▌ | 14005/16329 [1:57:54<21:21,  1.81it/s][A
epoch 1 iter 14005: train loss 1.11305. lr 3.664697e-04:  86%|████████▌ | 14005/16329 [1:57:54<21:21,  1.81it/s][A
epoch 1 iter 14005: train loss 1.11305. lr 3.664697e-04:  86%|████████▌ | 14006/16329 [1:57:54<20:43,  1.87it/s][A
epoch 1 iter 14006: train loss 1.14922. lr 3.664416e-04:  86%|████████▌ 

epoch 1 iter 14037: train loss 1.14026. lr 3.655688e-04:  86%|████████▌ | 14037/16329 [1:58:11<19:57,  1.91it/s][A
epoch 1 iter 14037: train loss 1.14026. lr 3.655688e-04:  86%|████████▌ | 14038/16329 [1:58:11<19:44,  1.93it/s][A
epoch 1 iter 14038: train loss 1.13654. lr 3.655406e-04:  86%|████████▌ | 14038/16329 [1:58:11<19:44,  1.93it/s][A
epoch 1 iter 14038: train loss 1.13654. lr 3.655406e-04:  86%|████████▌ | 14039/16329 [1:58:11<19:34,  1.95it/s][A
epoch 1 iter 14039: train loss 1.12562. lr 3.655125e-04:  86%|████████▌ | 14039/16329 [1:58:12<19:34,  1.95it/s][A
epoch 1 iter 14039: train loss 1.12562. lr 3.655125e-04:  86%|████████▌ | 14040/16329 [1:58:12<21:29,  1.78it/s][A
epoch 1 iter 14040: train loss 1.14043. lr 3.654843e-04:  86%|████████▌ | 14040/16329 [1:58:12<21:29,  1.78it/s][A
epoch 1 iter 14040: train loss 1.14043. lr 3.654843e-04:  86%|████████▌ | 14041/16329 [1:58:12<20:45,  1.84it/s][A
epoch 1 iter 14041: train loss 1.11281. lr 3.654562e-04:  86%|████████▌ 

epoch 1 iter 14072: train loss 1.12909. lr 3.645828e-04:  86%|████████▌ | 14072/16329 [1:58:29<18:47,  2.00it/s][A
epoch 1 iter 14072: train loss 1.12909. lr 3.645828e-04:  86%|████████▌ | 14073/16329 [1:58:29<18:47,  2.00it/s][A
epoch 1 iter 14073: train loss 1.13564. lr 3.645546e-04:  86%|████████▌ | 14073/16329 [1:58:29<18:47,  2.00it/s][A
epoch 1 iter 14073: train loss 1.13564. lr 3.645546e-04:  86%|████████▌ | 14074/16329 [1:58:29<18:42,  2.01it/s][A
epoch 1 iter 14074: train loss 1.15242. lr 3.645264e-04:  86%|████████▌ | 14074/16329 [1:58:29<18:42,  2.01it/s][A
epoch 1 iter 14074: train loss 1.15242. lr 3.645264e-04:  86%|████████▌ | 14075/16329 [1:58:29<18:43,  2.01it/s][A
epoch 1 iter 14075: train loss 1.16379. lr 3.644982e-04:  86%|████████▌ | 14075/16329 [1:58:30<18:43,  2.01it/s][A
epoch 1 iter 14075: train loss 1.16379. lr 3.644982e-04:  86%|████████▌ | 14076/16329 [1:58:30<18:42,  2.01it/s][A
epoch 1 iter 14076: train loss 1.13357. lr 3.644700e-04:  86%|████████▌ 

epoch 1 iter 14107: train loss 1.11274. lr 3.635960e-04:  86%|████████▋ | 14107/16329 [1:58:46<18:52,  1.96it/s][A
epoch 1 iter 14107: train loss 1.11274. lr 3.635960e-04:  86%|████████▋ | 14108/16329 [1:58:46<18:45,  1.97it/s][A
epoch 1 iter 14108: train loss 1.15868. lr 3.635678e-04:  86%|████████▋ | 14108/16329 [1:58:47<18:45,  1.97it/s][A
epoch 1 iter 14108: train loss 1.15868. lr 3.635678e-04:  86%|████████▋ | 14109/16329 [1:58:47<18:40,  1.98it/s][A
epoch 1 iter 14109: train loss 1.13928. lr 3.635395e-04:  86%|████████▋ | 14109/16329 [1:58:47<18:40,  1.98it/s][A
epoch 1 iter 14109: train loss 1.13928. lr 3.635395e-04:  86%|████████▋ | 14110/16329 [1:58:47<18:33,  1.99it/s][A
epoch 1 iter 14110: train loss 1.16140. lr 3.635113e-04:  86%|████████▋ | 14110/16329 [1:58:48<18:33,  1.99it/s][A
epoch 1 iter 14110: train loss 1.16140. lr 3.635113e-04:  86%|████████▋ | 14111/16329 [1:58:48<18:28,  2.00it/s][A
epoch 1 iter 14111: train loss 1.14839. lr 3.634831e-04:  86%|████████▋ 

epoch 1 iter 14142: train loss 1.10665. lr 3.626084e-04:  87%|████████▋ | 14142/16329 [1:59:04<18:04,  2.02it/s][A
epoch 1 iter 14142: train loss 1.10665. lr 3.626084e-04:  87%|████████▋ | 14143/16329 [1:59:04<18:05,  2.01it/s][A
epoch 1 iter 14143: train loss 1.12539. lr 3.625802e-04:  87%|████████▋ | 14143/16329 [1:59:04<18:05,  2.01it/s][A
epoch 1 iter 14143: train loss 1.12539. lr 3.625802e-04:  87%|████████▋ | 14144/16329 [1:59:04<18:04,  2.01it/s][A
epoch 1 iter 14144: train loss 1.14048. lr 3.625520e-04:  87%|████████▋ | 14144/16329 [1:59:05<18:04,  2.01it/s][A
epoch 1 iter 14144: train loss 1.14048. lr 3.625520e-04:  87%|████████▋ | 14145/16329 [1:59:05<18:05,  2.01it/s][A
epoch 1 iter 14145: train loss 1.11697. lr 3.625238e-04:  87%|████████▋ | 14145/16329 [1:59:05<18:05,  2.01it/s][A
epoch 1 iter 14145: train loss 1.11697. lr 3.625238e-04:  87%|████████▋ | 14146/16329 [1:59:05<18:04,  2.01it/s][A
epoch 1 iter 14146: train loss 1.13405. lr 3.624955e-04:  87%|████████▋ 

epoch 1 iter 14177: train loss 1.09030. lr 3.616202e-04:  87%|████████▋ | 14177/16329 [1:59:21<17:56,  2.00it/s][A
epoch 1 iter 14177: train loss 1.09030. lr 3.616202e-04:  87%|████████▋ | 14178/16329 [1:59:21<17:56,  2.00it/s][A
epoch 1 iter 14178: train loss 1.14485. lr 3.615920e-04:  87%|████████▋ | 14178/16329 [1:59:22<17:56,  2.00it/s][A
epoch 1 iter 14178: train loss 1.14485. lr 3.615920e-04:  87%|████████▋ | 14179/16329 [1:59:22<17:56,  2.00it/s][A
epoch 1 iter 14179: train loss 1.12426. lr 3.615637e-04:  87%|████████▋ | 14179/16329 [1:59:22<17:56,  2.00it/s][A
epoch 1 iter 14179: train loss 1.12426. lr 3.615637e-04:  87%|████████▋ | 14180/16329 [1:59:22<17:55,  2.00it/s][A
epoch 1 iter 14180: train loss 1.15272. lr 3.615355e-04:  87%|████████▋ | 14180/16329 [1:59:23<17:55,  2.00it/s][A
epoch 1 iter 14180: train loss 1.15272. lr 3.615355e-04:  87%|████████▋ | 14181/16329 [1:59:23<17:54,  2.00it/s][A
epoch 1 iter 14181: train loss 1.13674. lr 3.615072e-04:  87%|████████▋ 

epoch 1 iter 14212: train loss 1.11786. lr 3.606313e-04:  87%|████████▋ | 14212/16329 [1:59:39<17:39,  2.00it/s][A
epoch 1 iter 14212: train loss 1.11786. lr 3.606313e-04:  87%|████████▋ | 14213/16329 [1:59:39<17:37,  2.00it/s][A
epoch 1 iter 14213: train loss 1.11339. lr 3.606030e-04:  87%|████████▋ | 14213/16329 [1:59:40<17:37,  2.00it/s][A
epoch 1 iter 14213: train loss 1.11339. lr 3.606030e-04:  87%|████████▋ | 14214/16329 [1:59:40<17:31,  2.01it/s][A
epoch 1 iter 14214: train loss 1.11916. lr 3.605748e-04:  87%|████████▋ | 14214/16329 [1:59:40<17:31,  2.01it/s][A
epoch 1 iter 14214: train loss 1.11916. lr 3.605748e-04:  87%|████████▋ | 14215/16329 [1:59:40<17:31,  2.01it/s][A
epoch 1 iter 14215: train loss 1.13837. lr 3.605465e-04:  87%|████████▋ | 14215/16329 [1:59:41<17:31,  2.01it/s][A
epoch 1 iter 14215: train loss 1.13837. lr 3.605465e-04:  87%|████████▋ | 14216/16329 [1:59:41<17:32,  2.01it/s][A
epoch 1 iter 14216: train loss 1.12766. lr 3.605182e-04:  87%|████████▋ 

epoch 1 iter 14247: train loss 1.15705. lr 3.596417e-04:  87%|████████▋ | 14247/16329 [1:59:57<17:13,  2.01it/s][A
epoch 1 iter 14247: train loss 1.15705. lr 3.596417e-04:  87%|████████▋ | 14248/16329 [1:59:57<17:11,  2.02it/s][A
epoch 1 iter 14248: train loss 1.13067. lr 3.596134e-04:  87%|████████▋ | 14248/16329 [1:59:57<17:11,  2.02it/s][A
epoch 1 iter 14248: train loss 1.13067. lr 3.596134e-04:  87%|████████▋ | 14249/16329 [1:59:57<17:09,  2.02it/s][A
epoch 1 iter 14249: train loss 1.12552. lr 3.595851e-04:  87%|████████▋ | 14249/16329 [1:59:58<17:09,  2.02it/s][A
epoch 1 iter 14249: train loss 1.12552. lr 3.595851e-04:  87%|████████▋ | 14250/16329 [1:59:58<17:08,  2.02it/s][A
epoch 1 iter 14250: train loss 1.13642. lr 3.595568e-04:  87%|████████▋ | 14250/16329 [1:59:58<17:08,  2.02it/s][A
epoch 1 iter 14250: train loss 1.13642. lr 3.595568e-04:  87%|████████▋ | 14251/16329 [1:59:58<17:11,  2.01it/s][A
epoch 1 iter 14251: train loss 1.10949. lr 3.595286e-04:  87%|████████▋ 

epoch 1 iter 14282: train loss 1.12491. lr 3.586514e-04:  87%|████████▋ | 14282/16329 [2:00:14<16:53,  2.02it/s][A
epoch 1 iter 14282: train loss 1.12491. lr 3.586514e-04:  87%|████████▋ | 14283/16329 [2:00:14<16:52,  2.02it/s][A
epoch 1 iter 14283: train loss 1.11752. lr 3.586231e-04:  87%|████████▋ | 14283/16329 [2:00:15<16:52,  2.02it/s][A
epoch 1 iter 14283: train loss 1.11752. lr 3.586231e-04:  87%|████████▋ | 14284/16329 [2:00:15<16:54,  2.02it/s][A
epoch 1 iter 14284: train loss 1.09571. lr 3.585948e-04:  87%|████████▋ | 14284/16329 [2:00:15<16:54,  2.02it/s][A
epoch 1 iter 14284: train loss 1.09571. lr 3.585948e-04:  87%|████████▋ | 14285/16329 [2:00:15<16:51,  2.02it/s][A
epoch 1 iter 14285: train loss 1.10787. lr 3.585665e-04:  87%|████████▋ | 14285/16329 [2:00:16<16:51,  2.02it/s][A
epoch 1 iter 14285: train loss 1.10787. lr 3.585665e-04:  87%|████████▋ | 14286/16329 [2:00:16<16:52,  2.02it/s][A
epoch 1 iter 14286: train loss 1.11348. lr 3.585382e-04:  87%|████████▋ 

epoch 1 iter 14317: train loss 1.12068. lr 3.576605e-04:  88%|████████▊ | 14317/16329 [2:00:32<16:44,  2.00it/s][A
epoch 1 iter 14317: train loss 1.12068. lr 3.576605e-04:  88%|████████▊ | 14318/16329 [2:00:32<16:40,  2.01it/s][A
epoch 1 iter 14318: train loss 1.12674. lr 3.576321e-04:  88%|████████▊ | 14318/16329 [2:00:33<16:40,  2.01it/s][A
epoch 1 iter 14318: train loss 1.12674. lr 3.576321e-04:  88%|████████▊ | 14319/16329 [2:00:33<18:26,  1.82it/s][A
epoch 1 iter 14319: train loss 1.14033. lr 3.576038e-04:  88%|████████▊ | 14319/16329 [2:00:33<18:26,  1.82it/s][A
epoch 1 iter 14319: train loss 1.14033. lr 3.576038e-04:  88%|████████▊ | 14320/16329 [2:00:33<17:50,  1.88it/s][A
epoch 1 iter 14320: train loss 1.11814. lr 3.575755e-04:  88%|████████▊ | 14320/16329 [2:00:34<17:50,  1.88it/s][A
epoch 1 iter 14320: train loss 1.11814. lr 3.575755e-04:  88%|████████▊ | 14321/16329 [2:00:34<17:27,  1.92it/s][A
epoch 1 iter 14321: train loss 1.11942. lr 3.575472e-04:  88%|████████▊ 

epoch 1 iter 14352: train loss 1.13034. lr 3.566689e-04:  88%|████████▊ | 14352/16329 [2:00:50<16:59,  1.94it/s][A
epoch 1 iter 14352: train loss 1.13034. lr 3.566689e-04:  88%|████████▊ | 14353/16329 [2:00:50<16:55,  1.95it/s][A
epoch 1 iter 14353: train loss 1.12640. lr 3.566405e-04:  88%|████████▊ | 14353/16329 [2:00:50<16:55,  1.95it/s][A
epoch 1 iter 14353: train loss 1.12640. lr 3.566405e-04:  88%|████████▊ | 14354/16329 [2:00:50<16:50,  1.95it/s][A
epoch 1 iter 14354: train loss 1.13490. lr 3.566122e-04:  88%|████████▊ | 14354/16329 [2:00:51<16:50,  1.95it/s][A
epoch 1 iter 14354: train loss 1.13490. lr 3.566122e-04:  88%|████████▊ | 14355/16329 [2:00:51<16:46,  1.96it/s][A
epoch 1 iter 14355: train loss 1.13058. lr 3.565838e-04:  88%|████████▊ | 14355/16329 [2:00:51<16:46,  1.96it/s][A
epoch 1 iter 14355: train loss 1.13058. lr 3.565838e-04:  88%|████████▊ | 14356/16329 [2:00:51<16:39,  1.97it/s][A
epoch 1 iter 14356: train loss 1.12406. lr 3.565555e-04:  88%|████████▊ 

epoch 1 iter 14387: train loss 1.11595. lr 3.556766e-04:  88%|████████▊ | 14387/16329 [2:01:08<17:20,  1.87it/s][A
epoch 1 iter 14387: train loss 1.11595. lr 3.556766e-04:  88%|████████▊ | 14388/16329 [2:01:08<16:56,  1.91it/s][A
epoch 1 iter 14388: train loss 1.12608. lr 3.556483e-04:  88%|████████▊ | 14388/16329 [2:01:08<16:56,  1.91it/s][A
epoch 1 iter 14388: train loss 1.12608. lr 3.556483e-04:  88%|████████▊ | 14389/16329 [2:01:08<16:41,  1.94it/s][A
epoch 1 iter 14389: train loss 1.12991. lr 3.556199e-04:  88%|████████▊ | 14389/16329 [2:01:09<16:41,  1.94it/s][A
epoch 1 iter 14389: train loss 1.12991. lr 3.556199e-04:  88%|████████▊ | 14390/16329 [2:01:09<16:28,  1.96it/s][A
epoch 1 iter 14390: train loss 1.13715. lr 3.555915e-04:  88%|████████▊ | 14390/16329 [2:01:09<16:28,  1.96it/s][A
epoch 1 iter 14390: train loss 1.13715. lr 3.555915e-04:  88%|████████▊ | 14391/16329 [2:01:09<16:22,  1.97it/s][A
epoch 1 iter 14391: train loss 1.12445. lr 3.555632e-04:  88%|████████▊ 

epoch 1 iter 14422: train loss 1.12234. lr 3.546837e-04:  88%|████████▊ | 14422/16329 [2:01:25<16:56,  1.88it/s][A
epoch 1 iter 14422: train loss 1.12234. lr 3.546837e-04:  88%|████████▊ | 14423/16329 [2:01:25<16:34,  1.92it/s][A
epoch 1 iter 14423: train loss 1.09492. lr 3.546554e-04:  88%|████████▊ | 14423/16329 [2:01:26<16:34,  1.92it/s][A
epoch 1 iter 14423: train loss 1.09492. lr 3.546554e-04:  88%|████████▊ | 14424/16329 [2:01:26<16:21,  1.94it/s][A
epoch 1 iter 14424: train loss 1.09450. lr 3.546270e-04:  88%|████████▊ | 14424/16329 [2:01:26<16:21,  1.94it/s][A
epoch 1 iter 14424: train loss 1.09450. lr 3.546270e-04:  88%|████████▊ | 14425/16329 [2:01:26<16:09,  1.96it/s][A
epoch 1 iter 14425: train loss 1.14586. lr 3.545986e-04:  88%|████████▊ | 14425/16329 [2:01:27<16:09,  1.96it/s][A
epoch 1 iter 14425: train loss 1.14586. lr 3.545986e-04:  88%|████████▊ | 14426/16329 [2:01:27<15:59,  1.98it/s][A
epoch 1 iter 14426: train loss 1.10093. lr 3.545702e-04:  88%|████████▊ 

epoch 1 iter 14457: train loss 1.14919. lr 3.536902e-04:  89%|████████▊ | 14457/16329 [2:01:43<15:28,  2.02it/s][A
epoch 1 iter 14457: train loss 1.14919. lr 3.536902e-04:  89%|████████▊ | 14458/16329 [2:01:43<15:29,  2.01it/s][A
epoch 1 iter 14458: train loss 1.11518. lr 3.536618e-04:  89%|████████▊ | 14458/16329 [2:01:43<15:29,  2.01it/s][A
epoch 1 iter 14458: train loss 1.11518. lr 3.536618e-04:  89%|████████▊ | 14459/16329 [2:01:43<15:28,  2.01it/s][A
epoch 1 iter 14459: train loss 1.11387. lr 3.536335e-04:  89%|████████▊ | 14459/16329 [2:01:44<15:28,  2.01it/s][A
epoch 1 iter 14459: train loss 1.11387. lr 3.536335e-04:  89%|████████▊ | 14460/16329 [2:01:44<15:29,  2.01it/s][A
epoch 1 iter 14460: train loss 1.08290. lr 3.536051e-04:  89%|████████▊ | 14460/16329 [2:01:44<15:29,  2.01it/s][A
epoch 1 iter 14460: train loss 1.08290. lr 3.536051e-04:  89%|████████▊ | 14461/16329 [2:01:44<15:28,  2.01it/s][A
epoch 1 iter 14461: train loss 1.12050. lr 3.535767e-04:  89%|████████▊ 

epoch 1 iter 14492: train loss 1.10780. lr 3.526961e-04:  89%|████████▉ | 14492/16329 [2:02:01<15:14,  2.01it/s][A
epoch 1 iter 14492: train loss 1.10780. lr 3.526961e-04:  89%|████████▉ | 14493/16329 [2:02:01<15:12,  2.01it/s][A
epoch 1 iter 14493: train loss 1.13147. lr 3.526677e-04:  89%|████████▉ | 14493/16329 [2:02:01<15:12,  2.01it/s][A
epoch 1 iter 14493: train loss 1.13147. lr 3.526677e-04:  89%|████████▉ | 14494/16329 [2:02:01<15:10,  2.01it/s][A
epoch 1 iter 14494: train loss 1.10487. lr 3.526393e-04:  89%|████████▉ | 14494/16329 [2:02:02<15:10,  2.01it/s][A
epoch 1 iter 14494: train loss 1.10487. lr 3.526393e-04:  89%|████████▉ | 14495/16329 [2:02:02<15:07,  2.02it/s][A
epoch 1 iter 14495: train loss 1.09432. lr 3.526109e-04:  89%|████████▉ | 14495/16329 [2:02:02<15:07,  2.02it/s][A
epoch 1 iter 14495: train loss 1.09432. lr 3.526109e-04:  89%|████████▉ | 14496/16329 [2:02:02<15:08,  2.02it/s][A
epoch 1 iter 14496: train loss 1.12368. lr 3.525825e-04:  89%|████████▉ 

epoch 1 iter 14527: train loss 1.11119. lr 3.517014e-04:  89%|████████▉ | 14527/16329 [2:02:18<14:55,  2.01it/s][A
epoch 1 iter 14527: train loss 1.11119. lr 3.517014e-04:  89%|████████▉ | 14528/16329 [2:02:18<14:54,  2.01it/s][A
epoch 1 iter 14528: train loss 1.11851. lr 3.516730e-04:  89%|████████▉ | 14528/16329 [2:02:19<14:54,  2.01it/s][A
epoch 1 iter 14528: train loss 1.11851. lr 3.516730e-04:  89%|████████▉ | 14529/16329 [2:02:19<14:54,  2.01it/s][A
epoch 1 iter 14529: train loss 1.12403. lr 3.516446e-04:  89%|████████▉ | 14529/16329 [2:02:19<14:54,  2.01it/s][A
epoch 1 iter 14529: train loss 1.12403. lr 3.516446e-04:  89%|████████▉ | 14530/16329 [2:02:19<14:49,  2.02it/s][A
epoch 1 iter 14530: train loss 1.11307. lr 3.516161e-04:  89%|████████▉ | 14530/16329 [2:02:20<14:49,  2.02it/s][A
epoch 1 iter 14530: train loss 1.11307. lr 3.516161e-04:  89%|████████▉ | 14531/16329 [2:02:20<14:52,  2.01it/s][A
epoch 1 iter 14531: train loss 1.08817. lr 3.515877e-04:  89%|████████▉ 

epoch 1 iter 14562: train loss 1.09118. lr 3.507061e-04:  89%|████████▉ | 14562/16329 [2:02:36<14:44,  2.00it/s][A
epoch 1 iter 14562: train loss 1.09118. lr 3.507061e-04:  89%|████████▉ | 14563/16329 [2:02:36<14:49,  1.98it/s][A
epoch 1 iter 14563: train loss 1.11528. lr 3.506777e-04:  89%|████████▉ | 14563/16329 [2:02:36<14:49,  1.98it/s][A
epoch 1 iter 14563: train loss 1.11528. lr 3.506777e-04:  89%|████████▉ | 14564/16329 [2:02:36<14:50,  1.98it/s][A
epoch 1 iter 14564: train loss 1.08966. lr 3.506493e-04:  89%|████████▉ | 14564/16329 [2:02:37<14:50,  1.98it/s][A
epoch 1 iter 14564: train loss 1.08966. lr 3.506493e-04:  89%|████████▉ | 14565/16329 [2:02:37<14:45,  1.99it/s][A
epoch 1 iter 14565: train loss 1.07801. lr 3.506208e-04:  89%|████████▉ | 14565/16329 [2:02:37<14:45,  1.99it/s][A
epoch 1 iter 14565: train loss 1.07801. lr 3.506208e-04:  89%|████████▉ | 14566/16329 [2:02:37<14:43,  2.00it/s][A
epoch 1 iter 14566: train loss 1.10660. lr 3.505924e-04:  89%|████████▉ 

epoch 1 iter 14597: train loss 1.12235. lr 3.497103e-04:  89%|████████▉ | 14597/16329 [2:02:53<14:24,  2.00it/s][A
epoch 1 iter 14597: train loss 1.12235. lr 3.497103e-04:  89%|████████▉ | 14598/16329 [2:02:53<14:22,  2.01it/s][A
epoch 1 iter 14598: train loss 1.12944. lr 3.496818e-04:  89%|████████▉ | 14598/16329 [2:02:54<14:22,  2.01it/s][A
epoch 1 iter 14598: train loss 1.12944. lr 3.496818e-04:  89%|████████▉ | 14599/16329 [2:02:54<14:22,  2.01it/s][A
epoch 1 iter 14599: train loss 1.10826. lr 3.496534e-04:  89%|████████▉ | 14599/16329 [2:02:55<14:22,  2.01it/s][A
epoch 1 iter 14599: train loss 1.10826. lr 3.496534e-04:  89%|████████▉ | 14600/16329 [2:02:55<15:54,  1.81it/s][A
epoch 1 iter 14600: train loss 1.13571. lr 3.496249e-04:  89%|████████▉ | 14600/16329 [2:02:55<15:54,  1.81it/s][A
epoch 1 iter 14600: train loss 1.13571. lr 3.496249e-04:  89%|████████▉ | 14601/16329 [2:02:55<15:22,  1.87it/s][A
epoch 1 iter 14601: train loss 1.12618. lr 3.495964e-04:  89%|████████▉ 

epoch 1 iter 14632: train loss 1.11520. lr 3.487139e-04:  90%|████████▉ | 14632/16329 [2:03:11<14:03,  2.01it/s][A
epoch 1 iter 14632: train loss 1.11520. lr 3.487139e-04:  90%|████████▉ | 14633/16329 [2:03:11<14:03,  2.01it/s][A
epoch 1 iter 14633: train loss 1.11139. lr 3.486854e-04:  90%|████████▉ | 14633/16329 [2:03:12<14:03,  2.01it/s][A
epoch 1 iter 14633: train loss 1.11139. lr 3.486854e-04:  90%|████████▉ | 14634/16329 [2:03:12<14:03,  2.01it/s][A
epoch 1 iter 14634: train loss 1.10445. lr 3.486569e-04:  90%|████████▉ | 14634/16329 [2:03:12<14:03,  2.01it/s][A
epoch 1 iter 14634: train loss 1.10445. lr 3.486569e-04:  90%|████████▉ | 14635/16329 [2:03:12<14:02,  2.01it/s][A
epoch 1 iter 14635: train loss 1.11804. lr 3.486284e-04:  90%|████████▉ | 14635/16329 [2:03:13<14:02,  2.01it/s][A
epoch 1 iter 14635: train loss 1.11804. lr 3.486284e-04:  90%|████████▉ | 14636/16329 [2:03:13<14:04,  2.01it/s][A
epoch 1 iter 14636: train loss 1.09191. lr 3.485999e-04:  90%|████████▉ 

epoch 1 iter 14667: train loss 1.10297. lr 3.477169e-04:  90%|████████▉ | 14667/16329 [2:03:29<13:45,  2.01it/s][A
epoch 1 iter 14667: train loss 1.10297. lr 3.477169e-04:  90%|████████▉ | 14668/16329 [2:03:29<13:42,  2.02it/s][A
epoch 1 iter 14668: train loss 1.10393. lr 3.476884e-04:  90%|████████▉ | 14668/16329 [2:03:29<13:42,  2.02it/s][A
epoch 1 iter 14668: train loss 1.10393. lr 3.476884e-04:  90%|████████▉ | 14669/16329 [2:03:29<13:44,  2.01it/s][A
epoch 1 iter 14669: train loss 1.07701. lr 3.476599e-04:  90%|████████▉ | 14669/16329 [2:03:30<13:44,  2.01it/s][A
epoch 1 iter 14669: train loss 1.07701. lr 3.476599e-04:  90%|████████▉ | 14670/16329 [2:03:30<14:00,  1.97it/s][A
epoch 1 iter 14670: train loss 1.09814. lr 3.476314e-04:  90%|████████▉ | 14670/16329 [2:03:30<14:00,  1.97it/s][A
epoch 1 iter 14670: train loss 1.09814. lr 3.476314e-04:  90%|████████▉ | 14671/16329 [2:03:30<14:05,  1.96it/s][A
epoch 1 iter 14671: train loss 1.09167. lr 3.476029e-04:  90%|████████▉ 

epoch 1 iter 14702: train loss 1.10658. lr 3.467194e-04:  90%|█████████ | 14702/16329 [2:03:47<14:37,  1.85it/s][A
epoch 1 iter 14702: train loss 1.10658. lr 3.467194e-04:  90%|█████████ | 14703/16329 [2:03:47<14:14,  1.90it/s][A
epoch 1 iter 14703: train loss 1.11485. lr 3.466908e-04:  90%|█████████ | 14703/16329 [2:03:47<14:14,  1.90it/s][A
epoch 1 iter 14703: train loss 1.11485. lr 3.466908e-04:  90%|█████████ | 14704/16329 [2:03:47<13:56,  1.94it/s][A
epoch 1 iter 14704: train loss 1.08860. lr 3.466623e-04:  90%|█████████ | 14704/16329 [2:03:48<13:56,  1.94it/s][A
epoch 1 iter 14704: train loss 1.08860. lr 3.466623e-04:  90%|█████████ | 14705/16329 [2:03:48<13:48,  1.96it/s][A
epoch 1 iter 14705: train loss 1.09954. lr 3.466338e-04:  90%|█████████ | 14705/16329 [2:03:48<13:48,  1.96it/s][A
epoch 1 iter 14705: train loss 1.09954. lr 3.466338e-04:  90%|█████████ | 14706/16329 [2:03:48<13:42,  1.97it/s][A
epoch 1 iter 14706: train loss 1.07986. lr 3.466053e-04:  90%|█████████ 

epoch 1 iter 14737: train loss 1.07224. lr 3.457213e-04:  90%|█████████ | 14737/16329 [2:04:04<13:17,  2.00it/s][A
epoch 1 iter 14737: train loss 1.07224. lr 3.457213e-04:  90%|█████████ | 14738/16329 [2:04:04<13:13,  2.00it/s][A
epoch 1 iter 14738: train loss 1.08994. lr 3.456928e-04:  90%|█████████ | 14738/16329 [2:04:05<13:13,  2.00it/s][A
epoch 1 iter 14738: train loss 1.08994. lr 3.456928e-04:  90%|█████████ | 14739/16329 [2:04:05<13:12,  2.01it/s][A
epoch 1 iter 14739: train loss 1.10138. lr 3.456643e-04:  90%|█████████ | 14739/16329 [2:04:05<13:12,  2.01it/s][A
epoch 1 iter 14739: train loss 1.10138. lr 3.456643e-04:  90%|█████████ | 14740/16329 [2:04:05<13:10,  2.01it/s][A
epoch 1 iter 14740: train loss 1.08387. lr 3.456357e-04:  90%|█████████ | 14740/16329 [2:04:06<13:10,  2.01it/s][A
epoch 1 iter 14740: train loss 1.08387. lr 3.456357e-04:  90%|█████████ | 14741/16329 [2:04:06<13:08,  2.01it/s][A
epoch 1 iter 14741: train loss 1.07065. lr 3.456072e-04:  90%|█████████ 

epoch 1 iter 14772: train loss 1.12586. lr 3.447227e-04:  90%|█████████ | 14772/16329 [2:04:22<13:31,  1.92it/s][A
epoch 1 iter 14772: train loss 1.12586. lr 3.447227e-04:  90%|█████████ | 14773/16329 [2:04:22<13:19,  1.95it/s][A
epoch 1 iter 14773: train loss 1.08820. lr 3.446942e-04:  90%|█████████ | 14773/16329 [2:04:23<13:19,  1.95it/s][A
epoch 1 iter 14773: train loss 1.08820. lr 3.446942e-04:  90%|█████████ | 14774/16329 [2:04:23<13:13,  1.96it/s][A
epoch 1 iter 14774: train loss 1.10124. lr 3.446657e-04:  90%|█████████ | 14774/16329 [2:04:23<13:13,  1.96it/s][A
epoch 1 iter 14774: train loss 1.10124. lr 3.446657e-04:  90%|█████████ | 14775/16329 [2:04:23<13:05,  1.98it/s][A
epoch 1 iter 14775: train loss 1.10261. lr 3.446371e-04:  90%|█████████ | 14775/16329 [2:04:24<13:05,  1.98it/s][A
epoch 1 iter 14775: train loss 1.10261. lr 3.446371e-04:  90%|█████████ | 14776/16329 [2:04:24<12:59,  1.99it/s][A
epoch 1 iter 14776: train loss 1.08837. lr 3.446086e-04:  90%|█████████ 

epoch 1 iter 14807: train loss 1.12266. lr 3.437237e-04:  91%|█████████ | 14807/16329 [2:04:40<13:21,  1.90it/s][A
epoch 1 iter 14807: train loss 1.12266. lr 3.437237e-04:  91%|█████████ | 14808/16329 [2:04:40<13:10,  1.92it/s][A
epoch 1 iter 14808: train loss 1.08393. lr 3.436951e-04:  91%|█████████ | 14808/16329 [2:04:41<13:10,  1.92it/s][A
epoch 1 iter 14808: train loss 1.08393. lr 3.436951e-04:  91%|█████████ | 14809/16329 [2:04:41<13:02,  1.94it/s][A
epoch 1 iter 14809: train loss 1.09962. lr 3.436666e-04:  91%|█████████ | 14809/16329 [2:04:41<13:02,  1.94it/s][A
epoch 1 iter 14809: train loss 1.09962. lr 3.436666e-04:  91%|█████████ | 14810/16329 [2:04:41<12:51,  1.97it/s][A
epoch 1 iter 14810: train loss 1.07891. lr 3.436380e-04:  91%|█████████ | 14810/16329 [2:04:42<12:51,  1.97it/s][A
epoch 1 iter 14810: train loss 1.07891. lr 3.436380e-04:  91%|█████████ | 14811/16329 [2:04:42<12:44,  1.99it/s][A
epoch 1 iter 14811: train loss 1.07368. lr 3.436095e-04:  91%|█████████ 

epoch 1 iter 14842: train loss 1.10611. lr 3.427241e-04:  91%|█████████ | 14842/16329 [2:04:58<12:24,  2.00it/s][A
epoch 1 iter 14842: train loss 1.10611. lr 3.427241e-04:  91%|█████████ | 14843/16329 [2:04:58<12:20,  2.01it/s][A
epoch 1 iter 14843: train loss 1.12382. lr 3.426955e-04:  91%|█████████ | 14843/16329 [2:04:58<12:20,  2.01it/s][A
epoch 1 iter 14843: train loss 1.12382. lr 3.426955e-04:  91%|█████████ | 14844/16329 [2:04:58<12:17,  2.01it/s][A
epoch 1 iter 14844: train loss 1.10368. lr 3.426670e-04:  91%|█████████ | 14844/16329 [2:04:59<12:17,  2.01it/s][A
epoch 1 iter 14844: train loss 1.10368. lr 3.426670e-04:  91%|█████████ | 14845/16329 [2:04:59<12:17,  2.01it/s][A
epoch 1 iter 14845: train loss 1.09262. lr 3.426384e-04:  91%|█████████ | 14845/16329 [2:04:59<12:17,  2.01it/s][A
epoch 1 iter 14845: train loss 1.09262. lr 3.426384e-04:  91%|█████████ | 14846/16329 [2:04:59<12:15,  2.02it/s][A
epoch 1 iter 14846: train loss 1.11365. lr 3.426098e-04:  91%|█████████ 

1
