## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [1]:
import numpy as np
from os import listdir
from os.path import join as pathjoin
import torch
import torch.nn as nn
from torch.nn import functional as F
import tqdm

from mingpt.model import GPT, GPTConfig
from mingpt.trainer import Trainer, TrainerConfig
# make deterministic
from mingpt.utils import sample, set_seed
set_seed(42)

In [2]:
import math
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [3]:
block_size = 128

In [4]:
def train_gpt_generator(train_text_file, state_dict_file, n_layer=8, n_head=8, n_embd=512,
                        max_epochs=2, batch_size=512):
    text = open(train_text_file, 'r').read()
    train_dataset = CharDataset(text, block_size) 
    mconf = GPTConfig(
        train_dataset.vocab_size, train_dataset.block_size,
        n_layer=n_layer, n_head=n_head, n_embd=n_embd
    )
    model = GPT(mconf)
    tconf = TrainerConfig(
        max_epochs=max_epochs, batch_size=batch_size, learning_rate=6e-4,
        lr_decay=True, warmup_tokens=batch_size*20, final_tokens=2*len(train_dataset)*block_size,
        num_workers=4
    )
    trainer = Trainer(model, train_dataset, None, tconf)
    trainer.train()
    torch.save(model.state_dict(), state_dict_file)

In [5]:
GENRE_DATA_DIR = '/home/mlepekhin/data/genre'
GPT_MODELS_DIR = '/home/mlepekhin/models/mini_gpt/'
LANG = 'ru'

In [None]:
for train_text_file in tqdm.tqdm(listdir(pathjoin(GENRE_DATA_DIR, LANG))):
    label = train_text_file[:-4]
    train_gpt_generator(
        pathjoin(GENRE_DATA_DIR, LANG, train_text_file),
        pathjoin(GPT_MODELS_DIR, LANG, label)
    )

  0%|          | 0/10 [00:00<?, ?it/s]

data has 745066 characters, 184 unique.




epoch 1 iter 0: train loss 5.34787. lr 5.999999e-04:   0%|          | 0/1455 [00:07<?, ?it/s][A
epoch 1 iter 0: train loss 5.34787. lr 5.999999e-04:   0%|          | 1/1455 [00:07<2:59:08,  7.39s/it][A
epoch 1 iter 1: train loss 3.88315. lr 5.999994e-04:   0%|          | 1/1455 [00:07<2:59:08,  7.39s/it][A
epoch 1 iter 1: train loss 3.88315. lr 5.999994e-04:   0%|          | 2/1455 [00:07<2:07:49,  5.28s/it][A
epoch 1 iter 2: train loss 3.94659. lr 5.999986e-04:   0%|          | 2/1455 [00:08<2:07:49,  5.28s/it][A
epoch 1 iter 2: train loss 3.94659. lr 5.999986e-04:   0%|          | 3/1455 [00:08<1:31:51,  3.80s/it][A
epoch 1 iter 3: train loss 3.46090. lr 5.999974e-04:   0%|          | 3/1455 [00:08<1:31:51,  3.80s/it][A
epoch 1 iter 3: train loss 3.46090. lr 5.999974e-04:   0%|          | 4/1455 [00:08<1:06:59,  2.77s/it][A
epoch 1 iter 4: train loss 3.48075. lr 5.999959e-04:   0%|          | 4/1455 [00:08<1:06:59,  2.77s/it][A
epoch 1 iter 4: train loss 3.48075. lr 5.9999

epoch 1 iter 36: train loss 2.71652. lr 5.997627e-04:   3%|▎         | 37/1455 [00:19<07:51,  3.01it/s][A
epoch 1 iter 37: train loss 2.71328. lr 5.997496e-04:   3%|▎         | 37/1455 [00:20<07:51,  3.01it/s][A
epoch 1 iter 37: train loss 2.71328. lr 5.997496e-04:   3%|▎         | 38/1455 [00:20<07:46,  3.04it/s][A
epoch 1 iter 38: train loss 2.71492. lr 5.997362e-04:   3%|▎         | 38/1455 [00:20<07:46,  3.04it/s][A
epoch 1 iter 38: train loss 2.71492. lr 5.997362e-04:   3%|▎         | 39/1455 [00:20<07:42,  3.06it/s][A
epoch 1 iter 39: train loss 2.69224. lr 5.997225e-04:   3%|▎         | 39/1455 [00:20<07:42,  3.06it/s][A
epoch 1 iter 39: train loss 2.69224. lr 5.997225e-04:   3%|▎         | 40/1455 [00:20<07:40,  3.07it/s][A
epoch 1 iter 40: train loss 2.69743. lr 5.997084e-04:   3%|▎         | 40/1455 [00:20<07:40,  3.07it/s][A
epoch 1 iter 40: train loss 2.69743. lr 5.997084e-04:   3%|▎         | 41/1455 [00:20<07:38,  3.09it/s][A
epoch 1 iter 41: train loss 2.70664. 

epoch 1 iter 74: train loss 2.61741. lr 5.990211e-04:   5%|▌         | 75/1455 [00:32<07:28,  3.08it/s][A
epoch 1 iter 75: train loss 2.61303. lr 5.989947e-04:   5%|▌         | 75/1455 [00:32<07:28,  3.08it/s][A
epoch 1 iter 75: train loss 2.61303. lr 5.989947e-04:   5%|▌         | 76/1455 [00:32<07:27,  3.08it/s][A
epoch 1 iter 76: train loss 2.60827. lr 5.989681e-04:   5%|▌         | 76/1455 [00:32<07:27,  3.08it/s][A
epoch 1 iter 76: train loss 2.60827. lr 5.989681e-04:   5%|▌         | 77/1455 [00:32<07:27,  3.08it/s][A
epoch 1 iter 77: train loss 2.60747. lr 5.989411e-04:   5%|▌         | 77/1455 [00:33<07:27,  3.08it/s][A
epoch 1 iter 77: train loss 2.60747. lr 5.989411e-04:   5%|▌         | 78/1455 [00:33<07:41,  2.99it/s][A
epoch 1 iter 78: train loss 2.59457. lr 5.989137e-04:   5%|▌         | 78/1455 [00:33<07:41,  2.99it/s][A
epoch 1 iter 78: train loss 2.59457. lr 5.989137e-04:   5%|▌         | 79/1455 [00:33<07:36,  3.01it/s][A
epoch 1 iter 79: train loss 2.60484. 

epoch 1 iter 112: train loss 2.54612. lr 5.977762e-04:   8%|▊         | 113/1455 [00:44<07:20,  3.05it/s][A
epoch 1 iter 113: train loss 2.52788. lr 5.977367e-04:   8%|▊         | 113/1455 [00:44<07:20,  3.05it/s][A
epoch 1 iter 113: train loss 2.52788. lr 5.977367e-04:   8%|▊         | 114/1455 [00:44<07:18,  3.06it/s][A
epoch 1 iter 114: train loss 2.53914. lr 5.976968e-04:   8%|▊         | 114/1455 [00:45<07:18,  3.06it/s][A
epoch 1 iter 114: train loss 2.53914. lr 5.976968e-04:   8%|▊         | 115/1455 [00:45<07:17,  3.07it/s][A
epoch 1 iter 115: train loss 2.53685. lr 5.976565e-04:   8%|▊         | 115/1455 [00:45<07:17,  3.07it/s][A
epoch 1 iter 115: train loss 2.53685. lr 5.976565e-04:   8%|▊         | 116/1455 [00:45<07:16,  3.07it/s][A
epoch 1 iter 116: train loss 2.52204. lr 5.976160e-04:   8%|▊         | 116/1455 [00:45<07:16,  3.07it/s][A
epoch 1 iter 116: train loss 2.52204. lr 5.976160e-04:   8%|▊         | 117/1455 [00:45<07:16,  3.07it/s][A
epoch 1 iter 117: t

epoch 1 iter 150: train loss 2.47735. lr 5.960302e-04:  10%|█         | 150/1455 [00:57<07:38,  2.84it/s][A
epoch 1 iter 150: train loss 2.47735. lr 5.960302e-04:  10%|█         | 151/1455 [00:57<07:39,  2.84it/s][A
epoch 1 iter 151: train loss 2.47563. lr 5.959775e-04:  10%|█         | 151/1455 [00:57<07:39,  2.84it/s][A
epoch 1 iter 151: train loss 2.47563. lr 5.959775e-04:  10%|█         | 152/1455 [00:57<07:53,  2.75it/s][A
epoch 1 iter 152: train loss 2.47255. lr 5.959244e-04:  10%|█         | 152/1455 [00:57<07:53,  2.75it/s][A
epoch 1 iter 152: train loss 2.47255. lr 5.959244e-04:  11%|█         | 153/1455 [00:57<07:51,  2.76it/s][A
epoch 1 iter 153: train loss 2.47646. lr 5.958711e-04:  11%|█         | 153/1455 [00:58<07:51,  2.76it/s][A
epoch 1 iter 153: train loss 2.47646. lr 5.958711e-04:  11%|█         | 154/1455 [00:58<07:46,  2.79it/s][A
epoch 1 iter 154: train loss 2.46879. lr 5.958173e-04:  11%|█         | 154/1455 [00:58<07:46,  2.79it/s][A
epoch 1 iter 154: t

epoch 1 iter 187: train loss 2.40934. lr 5.938513e-04:  13%|█▎        | 188/1455 [01:10<07:35,  2.78it/s][A
epoch 1 iter 188: train loss 2.40442. lr 5.937859e-04:  13%|█▎        | 188/1455 [01:10<07:35,  2.78it/s][A
epoch 1 iter 188: train loss 2.40442. lr 5.937859e-04:  13%|█▎        | 189/1455 [01:10<07:35,  2.78it/s][A
epoch 1 iter 189: train loss 2.40138. lr 5.937202e-04:  13%|█▎        | 189/1455 [01:11<07:35,  2.78it/s][A
epoch 1 iter 189: train loss 2.40138. lr 5.937202e-04:  13%|█▎        | 190/1455 [01:11<07:38,  2.76it/s][A
epoch 1 iter 190: train loss 2.39917. lr 5.936541e-04:  13%|█▎        | 190/1455 [01:11<07:38,  2.76it/s][A
epoch 1 iter 190: train loss 2.39917. lr 5.936541e-04:  13%|█▎        | 191/1455 [01:11<07:39,  2.75it/s][A
epoch 1 iter 191: train loss 2.40069. lr 5.935876e-04:  13%|█▎        | 191/1455 [01:12<07:39,  2.75it/s][A
epoch 1 iter 191: train loss 2.40069. lr 5.935876e-04:  13%|█▎        | 192/1455 [01:12<07:40,  2.74it/s][A
epoch 1 iter 192: t

epoch 1 iter 225: train loss 2.31079. lr 5.911256e-04:  15%|█▌        | 225/1455 [01:23<06:55,  2.96it/s][A
epoch 1 iter 225: train loss 2.31079. lr 5.911256e-04:  16%|█▌        | 226/1455 [01:23<07:11,  2.85it/s][A
epoch 1 iter 226: train loss 2.30507. lr 5.910472e-04:  16%|█▌        | 226/1455 [01:24<07:11,  2.85it/s][A
epoch 1 iter 226: train loss 2.30507. lr 5.910472e-04:  16%|█▌        | 227/1455 [01:24<07:06,  2.88it/s][A
epoch 1 iter 227: train loss 2.31050. lr 5.909685e-04:  16%|█▌        | 227/1455 [01:24<07:06,  2.88it/s][A
epoch 1 iter 227: train loss 2.31050. lr 5.909685e-04:  16%|█▌        | 228/1455 [01:24<07:03,  2.90it/s][A
epoch 1 iter 228: train loss 2.31636. lr 5.908894e-04:  16%|█▌        | 228/1455 [01:24<07:03,  2.90it/s][A
epoch 1 iter 228: train loss 2.31636. lr 5.908894e-04:  16%|█▌        | 229/1455 [01:24<07:00,  2.92it/s][A
epoch 1 iter 229: train loss 2.29274. lr 5.908101e-04:  16%|█▌        | 229/1455 [01:25<07:00,  2.92it/s][A
epoch 1 iter 229: t

epoch 1 iter 262: train loss 2.20081. lr 5.880007e-04:  18%|█▊        | 263/1455 [01:36<06:44,  2.95it/s][A
epoch 1 iter 263: train loss 2.20488. lr 5.879098e-04:  18%|█▊        | 263/1455 [01:36<06:44,  2.95it/s][A
epoch 1 iter 263: train loss 2.20488. lr 5.879098e-04:  18%|█▊        | 264/1455 [01:36<06:43,  2.95it/s][A
epoch 1 iter 264: train loss 2.19950. lr 5.878186e-04:  18%|█▊        | 264/1455 [01:37<06:43,  2.95it/s][A
epoch 1 iter 264: train loss 2.19950. lr 5.878186e-04:  18%|█▊        | 265/1455 [01:37<06:42,  2.95it/s][A
epoch 1 iter 265: train loss 2.21641. lr 5.877271e-04:  18%|█▊        | 265/1455 [01:37<06:42,  2.95it/s][A
epoch 1 iter 265: train loss 2.21641. lr 5.877271e-04:  18%|█▊        | 266/1455 [01:37<06:42,  2.95it/s][A
epoch 1 iter 266: train loss 2.21038. lr 5.876353e-04:  18%|█▊        | 266/1455 [01:37<06:42,  2.95it/s][A
epoch 1 iter 266: train loss 2.21038. lr 5.876353e-04:  18%|█▊        | 267/1455 [01:37<06:42,  2.95it/s][A
epoch 1 iter 267: t

epoch 1 iter 300: train loss 2.10697. lr 5.843131e-04:  21%|██        | 300/1455 [01:49<06:41,  2.88it/s][A
epoch 1 iter 300: train loss 2.10697. lr 5.843131e-04:  21%|██        | 301/1455 [01:49<06:39,  2.89it/s][A
epoch 1 iter 301: train loss 2.11237. lr 5.842095e-04:  21%|██        | 301/1455 [01:49<06:39,  2.89it/s][A
epoch 1 iter 301: train loss 2.11237. lr 5.842095e-04:  21%|██        | 302/1455 [01:49<06:37,  2.90it/s][A
epoch 1 iter 302: train loss 2.10141. lr 5.841057e-04:  21%|██        | 302/1455 [01:50<06:37,  2.90it/s][A
epoch 1 iter 302: train loss 2.10141. lr 5.841057e-04:  21%|██        | 303/1455 [01:50<06:35,  2.91it/s][A
epoch 1 iter 303: train loss 2.08111. lr 5.840015e-04:  21%|██        | 303/1455 [01:50<06:35,  2.91it/s][A
epoch 1 iter 303: train loss 2.08111. lr 5.840015e-04:  21%|██        | 304/1455 [01:50<06:34,  2.92it/s][A
epoch 1 iter 304: train loss 2.10043. lr 5.838970e-04:  21%|██        | 304/1455 [01:50<06:34,  2.92it/s][A
epoch 1 iter 304: t

epoch 1 iter 337: train loss 2.02170. lr 5.802627e-04:  23%|██▎       | 338/1455 [02:02<06:21,  2.93it/s][A
epoch 1 iter 338: train loss 2.00430. lr 5.801470e-04:  23%|██▎       | 338/1455 [02:02<06:21,  2.93it/s][A
epoch 1 iter 338: train loss 2.00430. lr 5.801470e-04:  23%|██▎       | 339/1455 [02:02<06:20,  2.93it/s][A
epoch 1 iter 339: train loss 1.99907. lr 5.800309e-04:  23%|██▎       | 339/1455 [02:02<06:20,  2.93it/s][A
epoch 1 iter 339: train loss 1.99907. lr 5.800309e-04:  23%|██▎       | 340/1455 [02:02<06:20,  2.93it/s][A
epoch 1 iter 340: train loss 1.99694. lr 5.799146e-04:  23%|██▎       | 340/1455 [02:03<06:20,  2.93it/s][A
epoch 1 iter 340: train loss 1.99694. lr 5.799146e-04:  23%|██▎       | 341/1455 [02:03<06:20,  2.93it/s][A
epoch 1 iter 341: train loss 1.99002. lr 5.797979e-04:  23%|██▎       | 341/1455 [02:03<06:20,  2.93it/s][A
epoch 1 iter 341: train loss 1.99002. lr 5.797979e-04:  24%|██▎       | 342/1455 [02:03<06:19,  2.93it/s][A
epoch 1 iter 342: t

epoch 1 iter 375: train loss 1.92642. lr 5.756374e-04:  26%|██▌       | 375/1455 [02:15<06:18,  2.86it/s][A
epoch 1 iter 375: train loss 1.92642. lr 5.756374e-04:  26%|██▌       | 376/1455 [02:15<06:15,  2.88it/s][A
epoch 1 iter 376: train loss 1.91127. lr 5.755093e-04:  26%|██▌       | 376/1455 [02:15<06:15,  2.88it/s][A
epoch 1 iter 376: train loss 1.91127. lr 5.755093e-04:  26%|██▌       | 377/1455 [02:15<06:13,  2.89it/s][A
epoch 1 iter 377: train loss 1.89690. lr 5.753810e-04:  26%|██▌       | 377/1455 [02:16<06:13,  2.89it/s][A
epoch 1 iter 377: train loss 1.89690. lr 5.753810e-04:  26%|██▌       | 378/1455 [02:16<06:11,  2.90it/s][A
epoch 1 iter 378: train loss 1.93850. lr 5.752523e-04:  26%|██▌       | 378/1455 [02:16<06:11,  2.90it/s][A
epoch 1 iter 378: train loss 1.93850. lr 5.752523e-04:  26%|██▌       | 379/1455 [02:16<06:10,  2.90it/s][A
epoch 1 iter 379: train loss 1.93455. lr 5.751234e-04:  26%|██▌       | 379/1455 [02:16<06:10,  2.90it/s][A
epoch 1 iter 379: t

epoch 1 iter 412: train loss 1.82925. lr 5.706879e-04:  28%|██▊       | 413/1455 [02:28<06:01,  2.88it/s][A
epoch 1 iter 413: train loss 1.82788. lr 5.705481e-04:  28%|██▊       | 413/1455 [02:28<06:01,  2.88it/s][A
epoch 1 iter 413: train loss 1.82788. lr 5.705481e-04:  28%|██▊       | 414/1455 [02:28<06:00,  2.89it/s][A
epoch 1 iter 414: train loss 1.83177. lr 5.704080e-04:  28%|██▊       | 414/1455 [02:28<06:00,  2.89it/s][A
epoch 1 iter 414: train loss 1.83177. lr 5.704080e-04:  29%|██▊       | 415/1455 [02:28<05:59,  2.89it/s][A
epoch 1 iter 415: train loss 1.82244. lr 5.702676e-04:  29%|██▊       | 415/1455 [02:29<05:59,  2.89it/s][A
epoch 1 iter 415: train loss 1.82244. lr 5.702676e-04:  29%|██▊       | 416/1455 [02:29<05:58,  2.90it/s][A
epoch 1 iter 416: train loss 1.80898. lr 5.701269e-04:  29%|██▊       | 416/1455 [02:29<05:58,  2.90it/s][A
epoch 1 iter 416: train loss 1.80898. lr 5.701269e-04:  29%|██▊       | 417/1455 [02:29<06:11,  2.79it/s][A
epoch 1 iter 417: t

epoch 1 iter 450: train loss 1.75201. lr 5.651553e-04:  31%|███       | 450/1455 [02:41<05:46,  2.90it/s][A
epoch 1 iter 450: train loss 1.75201. lr 5.651553e-04:  31%|███       | 451/1455 [02:41<05:44,  2.91it/s][A
epoch 1 iter 451: train loss 1.74920. lr 5.650036e-04:  31%|███       | 451/1455 [02:41<05:44,  2.91it/s][A
epoch 1 iter 451: train loss 1.74920. lr 5.650036e-04:  31%|███       | 452/1455 [02:41<05:43,  2.92it/s][A
epoch 1 iter 452: train loss 1.75893. lr 5.648516e-04:  31%|███       | 452/1455 [02:41<05:43,  2.92it/s][A
epoch 1 iter 452: train loss 1.75893. lr 5.648516e-04:  31%|███       | 453/1455 [02:41<05:42,  2.92it/s][A
epoch 1 iter 453: train loss 1.75599. lr 5.646993e-04:  31%|███       | 453/1455 [02:42<05:42,  2.92it/s][A
epoch 1 iter 453: train loss 1.75599. lr 5.646993e-04:  31%|███       | 454/1455 [02:42<05:41,  2.93it/s][A
epoch 1 iter 454: train loss 1.75348. lr 5.645467e-04:  31%|███       | 454/1455 [02:42<05:41,  2.93it/s][A
epoch 1 iter 454: t

epoch 1 iter 487: train loss 1.69164. lr 5.593393e-04:  34%|███▎      | 488/1455 [02:54<05:36,  2.87it/s][A
epoch 1 iter 488: train loss 1.67900. lr 5.591763e-04:  34%|███▎      | 488/1455 [02:54<05:36,  2.87it/s][A
epoch 1 iter 488: train loss 1.67900. lr 5.591763e-04:  34%|███▎      | 489/1455 [02:54<05:35,  2.88it/s][A
epoch 1 iter 489: train loss 1.67036. lr 5.590130e-04:  34%|███▎      | 489/1455 [02:54<05:35,  2.88it/s][A
epoch 1 iter 489: train loss 1.67036. lr 5.590130e-04:  34%|███▎      | 490/1455 [02:54<05:35,  2.88it/s][A
epoch 1 iter 490: train loss 1.67705. lr 5.588494e-04:  34%|███▎      | 490/1455 [02:55<05:35,  2.88it/s][A
epoch 1 iter 490: train loss 1.67705. lr 5.588494e-04:  34%|███▎      | 491/1455 [02:55<05:48,  2.77it/s][A
epoch 1 iter 491: train loss 1.67217. lr 5.586856e-04:  34%|███▎      | 491/1455 [02:55<05:48,  2.77it/s][A
epoch 1 iter 491: train loss 1.67217. lr 5.586856e-04:  34%|███▍      | 492/1455 [02:55<05:43,  2.80it/s][A
epoch 1 iter 492: t

epoch 1 iter 525: train loss 1.62954. lr 5.529355e-04:  36%|███▌      | 525/1455 [03:07<05:31,  2.81it/s][A
epoch 1 iter 525: train loss 1.62954. lr 5.529355e-04:  36%|███▌      | 526/1455 [03:07<05:29,  2.82it/s][A
epoch 1 iter 526: train loss 1.62493. lr 5.527611e-04:  36%|███▌      | 526/1455 [03:07<05:29,  2.82it/s][A
epoch 1 iter 526: train loss 1.62493. lr 5.527611e-04:  36%|███▌      | 527/1455 [03:07<05:27,  2.84it/s][A
epoch 1 iter 527: train loss 1.62825. lr 5.525865e-04:  36%|███▌      | 527/1455 [03:08<05:27,  2.84it/s][A
epoch 1 iter 527: train loss 1.62825. lr 5.525865e-04:  36%|███▋      | 528/1455 [03:08<05:29,  2.82it/s][A
epoch 1 iter 528: train loss 1.59982. lr 5.524116e-04:  36%|███▋      | 528/1455 [03:08<05:29,  2.82it/s][A
epoch 1 iter 528: train loss 1.59982. lr 5.524116e-04:  36%|███▋      | 529/1455 [03:08<05:29,  2.81it/s][A
epoch 1 iter 529: train loss 1.62151. lr 5.522364e-04:  36%|███▋      | 529/1455 [03:08<05:29,  2.81it/s][A
epoch 1 iter 529: t

epoch 1 iter 562: train loss 1.57057. lr 5.462910e-04:  39%|███▊      | 563/1455 [03:20<05:27,  2.72it/s][A
epoch 1 iter 563: train loss 1.56961. lr 5.461060e-04:  39%|███▊      | 563/1455 [03:21<05:27,  2.72it/s][A
epoch 1 iter 563: train loss 1.56961. lr 5.461060e-04:  39%|███▉      | 564/1455 [03:21<05:26,  2.73it/s][A
epoch 1 iter 564: train loss 1.56286. lr 5.459206e-04:  39%|███▉      | 564/1455 [03:21<05:26,  2.73it/s][A
epoch 1 iter 564: train loss 1.56286. lr 5.459206e-04:  39%|███▉      | 565/1455 [03:21<05:40,  2.61it/s][A
epoch 1 iter 565: train loss 1.54230. lr 5.457349e-04:  39%|███▉      | 565/1455 [03:22<05:40,  2.61it/s][A
epoch 1 iter 565: train loss 1.54230. lr 5.457349e-04:  39%|███▉      | 566/1455 [03:22<05:40,  2.61it/s][A
epoch 1 iter 566: train loss 1.55612. lr 5.455490e-04:  39%|███▉      | 566/1455 [03:22<05:40,  2.61it/s][A
epoch 1 iter 566: train loss 1.55612. lr 5.455490e-04:  39%|███▉      | 567/1455 [03:22<05:37,  2.63it/s][A
epoch 1 iter 567: t

epoch 1 iter 600: train loss 1.51226. lr 5.390581e-04:  41%|████      | 600/1455 [03:35<05:24,  2.63it/s][A
epoch 1 iter 600: train loss 1.51226. lr 5.390581e-04:  41%|████▏     | 601/1455 [03:35<05:24,  2.63it/s][A
epoch 1 iter 601: train loss 1.50793. lr 5.388622e-04:  41%|████▏     | 601/1455 [03:35<05:24,  2.63it/s][A
epoch 1 iter 601: train loss 1.50793. lr 5.388622e-04:  41%|████▏     | 602/1455 [03:35<05:17,  2.69it/s][A
epoch 1 iter 602: train loss 1.50569. lr 5.386661e-04:  41%|████▏     | 602/1455 [03:36<05:17,  2.69it/s][A
epoch 1 iter 602: train loss 1.50569. lr 5.386661e-04:  41%|████▏     | 603/1455 [03:36<05:16,  2.69it/s][A
epoch 1 iter 603: train loss 1.50582. lr 5.384697e-04:  41%|████▏     | 603/1455 [03:36<05:16,  2.69it/s][A
epoch 1 iter 603: train loss 1.50582. lr 5.384697e-04:  42%|████▏     | 604/1455 [03:36<05:19,  2.67it/s][A
epoch 1 iter 604: train loss 1.50441. lr 5.382731e-04:  42%|████▏     | 604/1455 [03:36<05:19,  2.67it/s][A
epoch 1 iter 604: t

epoch 1 iter 637: train loss 1.45478. lr 5.316287e-04:  44%|████▍     | 638/1455 [03:49<05:16,  2.58it/s][A
epoch 1 iter 638: train loss 1.45991. lr 5.314228e-04:  44%|████▍     | 638/1455 [03:49<05:16,  2.58it/s][A
epoch 1 iter 638: train loss 1.45991. lr 5.314228e-04:  44%|████▍     | 639/1455 [03:49<05:28,  2.48it/s][A
epoch 1 iter 639: train loss 1.46490. lr 5.312165e-04:  44%|████▍     | 639/1455 [03:50<05:28,  2.48it/s][A
epoch 1 iter 639: train loss 1.46490. lr 5.312165e-04:  44%|████▍     | 640/1455 [03:50<05:24,  2.51it/s][A
epoch 1 iter 640: train loss 1.44646. lr 5.310100e-04:  44%|████▍     | 640/1455 [03:50<05:24,  2.51it/s][A
epoch 1 iter 640: train loss 1.44646. lr 5.310100e-04:  44%|████▍     | 641/1455 [03:50<05:21,  2.53it/s][A
epoch 1 iter 641: train loss 1.48388. lr 5.308032e-04:  44%|████▍     | 641/1455 [03:51<05:21,  2.53it/s][A
epoch 1 iter 641: train loss 1.48388. lr 5.308032e-04:  44%|████▍     | 642/1455 [03:51<05:19,  2.54it/s][A
epoch 1 iter 642: t

epoch 1 iter 675: train loss 1.39202. lr 5.236140e-04:  46%|████▋     | 675/1455 [04:04<05:06,  2.55it/s][A
epoch 1 iter 675: train loss 1.39202. lr 5.236140e-04:  46%|████▋     | 676/1455 [04:04<05:08,  2.53it/s][A
epoch 1 iter 676: train loss 1.39973. lr 5.233980e-04:  46%|████▋     | 676/1455 [04:04<05:08,  2.53it/s][A
epoch 1 iter 676: train loss 1.39973. lr 5.233980e-04:  47%|████▋     | 677/1455 [04:04<05:09,  2.51it/s][A
epoch 1 iter 677: train loss 1.40800. lr 5.231816e-04:  47%|████▋     | 677/1455 [04:05<05:09,  2.51it/s][A
epoch 1 iter 677: train loss 1.40800. lr 5.231816e-04:  47%|████▋     | 678/1455 [04:05<05:10,  2.50it/s][A
epoch 1 iter 678: train loss 1.39178. lr 5.229651e-04:  47%|████▋     | 678/1455 [04:05<05:10,  2.50it/s][A
epoch 1 iter 678: train loss 1.39178. lr 5.229651e-04:  47%|████▋     | 679/1455 [04:05<05:10,  2.50it/s][A
epoch 1 iter 679: train loss 1.39969. lr 5.227482e-04:  47%|████▋     | 679/1455 [04:06<05:10,  2.50it/s][A
epoch 1 iter 679: t

epoch 1 iter 712: train loss 1.35788. lr 5.154485e-04:  49%|████▉     | 713/1455 [04:20<05:30,  2.24it/s][A
epoch 1 iter 713: train loss 1.36846. lr 5.152229e-04:  49%|████▉     | 713/1455 [04:20<05:30,  2.24it/s][A
epoch 1 iter 713: train loss 1.36846. lr 5.152229e-04:  49%|████▉     | 714/1455 [04:20<05:18,  2.33it/s][A
epoch 1 iter 714: train loss 1.36605. lr 5.149972e-04:  49%|████▉     | 714/1455 [04:21<05:18,  2.33it/s][A
epoch 1 iter 714: train loss 1.36605. lr 5.149972e-04:  49%|████▉     | 715/1455 [04:21<05:09,  2.39it/s][A
epoch 1 iter 715: train loss 1.35689. lr 5.147711e-04:  49%|████▉     | 715/1455 [04:21<05:09,  2.39it/s][A
epoch 1 iter 715: train loss 1.35689. lr 5.147711e-04:  49%|████▉     | 716/1455 [04:21<05:01,  2.45it/s][A
epoch 1 iter 716: train loss 1.36277. lr 5.145449e-04:  49%|████▉     | 716/1455 [04:21<05:01,  2.45it/s][A
epoch 1 iter 716: train loss 1.36277. lr 5.145449e-04:  49%|████▉     | 717/1455 [04:21<04:56,  2.49it/s][A
epoch 1 iter 717: t

epoch 1 iter 750: train loss 1.32120. lr 5.067045e-04:  52%|█████▏    | 750/1455 [04:35<04:49,  2.44it/s][A
epoch 1 iter 750: train loss 1.32120. lr 5.067045e-04:  52%|█████▏    | 751/1455 [04:35<04:48,  2.44it/s][A
epoch 1 iter 751: train loss 1.31321. lr 5.064696e-04:  52%|█████▏    | 751/1455 [04:36<04:48,  2.44it/s][A
epoch 1 iter 751: train loss 1.31321. lr 5.064696e-04:  52%|█████▏    | 752/1455 [04:36<04:47,  2.44it/s][A
epoch 1 iter 752: train loss 1.31212. lr 5.062345e-04:  52%|█████▏    | 752/1455 [04:36<04:47,  2.44it/s][A
epoch 1 iter 752: train loss 1.31212. lr 5.062345e-04:  52%|█████▏    | 753/1455 [04:36<04:48,  2.44it/s][A
epoch 1 iter 753: train loss 1.30887. lr 5.059992e-04:  52%|█████▏    | 753/1455 [04:36<04:48,  2.44it/s][A
epoch 1 iter 753: train loss 1.30887. lr 5.059992e-04:  52%|█████▏    | 754/1455 [04:36<04:47,  2.44it/s][A
epoch 1 iter 754: train loss 1.30866. lr 5.057636e-04:  52%|█████▏    | 754/1455 [04:37<04:47,  2.44it/s][A
epoch 1 iter 754: t

epoch 1 iter 787: train loss 1.27657. lr 4.978563e-04:  54%|█████▍    | 788/1455 [04:51<04:56,  2.25it/s][A
epoch 1 iter 788: train loss 1.27519. lr 4.976127e-04:  54%|█████▍    | 788/1455 [04:51<04:56,  2.25it/s][A
epoch 1 iter 788: train loss 1.27519. lr 4.976127e-04:  54%|█████▍    | 789/1455 [04:51<04:56,  2.25it/s][A
epoch 1 iter 789: train loss 1.26668. lr 4.973688e-04:  54%|█████▍    | 789/1455 [04:52<04:56,  2.25it/s][A
epoch 1 iter 789: train loss 1.26668. lr 4.973688e-04:  54%|█████▍    | 790/1455 [04:52<04:53,  2.27it/s][A
epoch 1 iter 790: train loss 1.26727. lr 4.971248e-04:  54%|█████▍    | 790/1455 [04:52<04:53,  2.27it/s][A
epoch 1 iter 790: train loss 1.26727. lr 4.971248e-04:  54%|█████▍    | 791/1455 [04:52<04:50,  2.28it/s][A
epoch 1 iter 791: train loss 1.26740. lr 4.968805e-04:  54%|█████▍    | 791/1455 [04:52<04:50,  2.28it/s][A
epoch 1 iter 791: train loss 1.26740. lr 4.968805e-04:  54%|█████▍    | 792/1455 [04:52<04:48,  2.29it/s][A
epoch 1 iter 792: t

epoch 1 iter 825: train loss 1.20871. lr 4.884404e-04:  57%|█████▋    | 825/1455 [05:07<04:08,  2.54it/s][A
epoch 1 iter 825: train loss 1.20871. lr 4.884404e-04:  57%|█████▋    | 826/1455 [05:07<04:09,  2.53it/s][A
epoch 1 iter 826: train loss 1.21879. lr 4.881882e-04:  57%|█████▋    | 826/1455 [05:08<04:09,  2.53it/s][A
epoch 1 iter 826: train loss 1.21879. lr 4.881882e-04:  57%|█████▋    | 827/1455 [05:08<04:15,  2.46it/s][A
epoch 1 iter 827: train loss 1.22278. lr 4.879359e-04:  57%|█████▋    | 827/1455 [05:08<04:15,  2.46it/s][A
epoch 1 iter 827: train loss 1.22278. lr 4.879359e-04:  57%|█████▋    | 828/1455 [05:08<04:23,  2.38it/s][A
epoch 1 iter 828: train loss 1.22136. lr 4.876833e-04:  57%|█████▋    | 828/1455 [05:09<04:23,  2.38it/s][A
epoch 1 iter 828: train loss 1.22136. lr 4.876833e-04:  57%|█████▋    | 829/1455 [05:09<04:17,  2.43it/s][A
epoch 1 iter 829: train loss 1.21629. lr 4.874305e-04:  57%|█████▋    | 829/1455 [05:09<04:17,  2.43it/s][A
epoch 1 iter 829: t

epoch 1 iter 862: train loss 1.17065. lr 4.789674e-04:  59%|█████▉    | 863/1455 [05:25<04:59,  1.98it/s][A
epoch 1 iter 863: train loss 1.17507. lr 4.787073e-04:  59%|█████▉    | 863/1455 [05:26<04:59,  1.98it/s][A
epoch 1 iter 863: train loss 1.17507. lr 4.787073e-04:  59%|█████▉    | 864/1455 [05:26<04:59,  1.97it/s][A
epoch 1 iter 864: train loss 1.17588. lr 4.784471e-04:  59%|█████▉    | 864/1455 [05:26<04:59,  1.97it/s][A
epoch 1 iter 864: train loss 1.17588. lr 4.784471e-04:  59%|█████▉    | 865/1455 [05:26<04:57,  1.99it/s][A
epoch 1 iter 865: train loss 1.17781. lr 4.781866e-04:  59%|█████▉    | 865/1455 [05:27<04:57,  1.99it/s][A
epoch 1 iter 865: train loss 1.17781. lr 4.781866e-04:  60%|█████▉    | 866/1455 [05:27<04:48,  2.04it/s][A
epoch 1 iter 866: train loss 1.17861. lr 4.779259e-04:  60%|█████▉    | 866/1455 [05:27<04:48,  2.04it/s][A
epoch 1 iter 866: train loss 1.17861. lr 4.779259e-04:  60%|█████▉    | 867/1455 [05:27<04:41,  2.09it/s][A
epoch 1 iter 867: t

epoch 1 iter 900: train loss 1.13713. lr 4.689413e-04:  62%|██████▏   | 900/1455 [05:42<04:23,  2.11it/s][A
epoch 1 iter 900: train loss 1.13713. lr 4.689413e-04:  62%|██████▏   | 901/1455 [05:42<04:22,  2.11it/s][A
epoch 1 iter 901: train loss 1.13279. lr 4.686735e-04:  62%|██████▏   | 901/1455 [05:43<04:22,  2.11it/s][A
epoch 1 iter 901: train loss 1.13279. lr 4.686735e-04:  62%|██████▏   | 902/1455 [05:43<04:14,  2.17it/s][A
epoch 1 iter 902: train loss 1.12330. lr 4.684056e-04:  62%|██████▏   | 902/1455 [05:43<04:14,  2.17it/s][A
epoch 1 iter 902: train loss 1.12330. lr 4.684056e-04:  62%|██████▏   | 903/1455 [05:43<04:11,  2.19it/s][A
epoch 1 iter 903: train loss 1.12420. lr 4.681374e-04:  62%|██████▏   | 903/1455 [05:44<04:11,  2.19it/s][A
epoch 1 iter 903: train loss 1.12420. lr 4.681374e-04:  62%|██████▏   | 904/1455 [05:44<04:09,  2.21it/s][A
epoch 1 iter 904: train loss 1.13452. lr 4.678691e-04:  62%|██████▏   | 904/1455 [05:44<04:09,  2.21it/s][A
epoch 1 iter 904: t

epoch 1 iter 937: train loss 1.10147. lr 4.589057e-04:  64%|██████▍   | 938/1455 [06:02<04:02,  2.13it/s][A
epoch 1 iter 938: train loss 1.09640. lr 4.586309e-04:  64%|██████▍   | 938/1455 [06:02<04:02,  2.13it/s][A
epoch 1 iter 938: train loss 1.09640. lr 4.586309e-04:  65%|██████▍   | 939/1455 [06:02<03:57,  2.17it/s][A
epoch 1 iter 939: train loss 1.08061. lr 4.583559e-04:  65%|██████▍   | 939/1455 [06:02<03:57,  2.17it/s][A
epoch 1 iter 939: train loss 1.08061. lr 4.583559e-04:  65%|██████▍   | 940/1455 [06:02<03:55,  2.19it/s][A
epoch 1 iter 940: train loss 1.08266. lr 4.580807e-04:  65%|██████▍   | 940/1455 [06:03<03:55,  2.19it/s][A
epoch 1 iter 940: train loss 1.08266. lr 4.580807e-04:  65%|██████▍   | 941/1455 [06:03<03:53,  2.20it/s][A
epoch 1 iter 941: train loss 1.08196. lr 4.578053e-04:  65%|██████▍   | 941/1455 [06:03<03:53,  2.20it/s][A
epoch 1 iter 941: train loss 1.08196. lr 4.578053e-04:  65%|██████▍   | 942/1455 [06:03<03:51,  2.22it/s][A
epoch 1 iter 942: t

epoch 1 iter 975: train loss 1.05539. lr 4.483351e-04:  67%|██████▋   | 975/1455 [06:19<03:43,  2.15it/s][A
epoch 1 iter 975: train loss 1.05539. lr 4.483351e-04:  67%|██████▋   | 976/1455 [06:19<03:38,  2.19it/s][A
epoch 1 iter 976: train loss 1.03619. lr 4.480535e-04:  67%|██████▋   | 976/1455 [06:20<03:38,  2.19it/s][A
epoch 1 iter 976: train loss 1.03619. lr 4.480535e-04:  67%|██████▋   | 977/1455 [06:20<03:34,  2.23it/s][A
epoch 1 iter 977: train loss 1.04019. lr 4.477717e-04:  67%|██████▋   | 977/1455 [06:20<03:34,  2.23it/s][A
epoch 1 iter 977: train loss 1.04019. lr 4.477717e-04:  67%|██████▋   | 978/1455 [06:20<03:44,  2.13it/s][A
epoch 1 iter 978: train loss 1.05102. lr 4.474897e-04:  67%|██████▋   | 978/1455 [06:21<03:44,  2.13it/s][A
epoch 1 iter 978: train loss 1.05102. lr 4.474897e-04:  67%|██████▋   | 979/1455 [06:21<03:38,  2.17it/s][A
epoch 1 iter 979: train loss 1.04249. lr 4.472075e-04:  67%|██████▋   | 979/1455 [06:21<03:38,  2.17it/s][A
epoch 1 iter 979: t

epoch 1 iter 1012: train loss 0.98994. lr 4.378026e-04:  70%|██████▉   | 1012/1455 [06:38<03:48,  1.94it/s][A
epoch 1 iter 1012: train loss 0.98994. lr 4.378026e-04:  70%|██████▉   | 1013/1455 [06:38<03:35,  2.05it/s][A
epoch 1 iter 1013: train loss 0.99484. lr 4.375148e-04:  70%|██████▉   | 1013/1455 [06:38<03:35,  2.05it/s][A
epoch 1 iter 1013: train loss 0.99484. lr 4.375148e-04:  70%|██████▉   | 1014/1455 [06:38<03:26,  2.14it/s][A
epoch 1 iter 1014: train loss 0.99681. lr 4.372269e-04:  70%|██████▉   | 1014/1455 [06:39<03:26,  2.14it/s][A
epoch 1 iter 1014: train loss 0.99681. lr 4.372269e-04:  70%|██████▉   | 1015/1455 [06:39<03:20,  2.20it/s][A
epoch 1 iter 1015: train loss 0.99378. lr 4.369388e-04:  70%|██████▉   | 1015/1455 [06:39<03:20,  2.20it/s][A
epoch 1 iter 1015: train loss 0.99378. lr 4.369388e-04:  70%|██████▉   | 1016/1455 [06:39<03:15,  2.24it/s][A
epoch 1 iter 1016: train loss 0.99229. lr 4.366505e-04:  70%|██████▉   | 1016/1455 [06:40<03:15,  2.24it/s][A
e

epoch 1 iter 1048: train loss 0.95087. lr 4.273436e-04:  72%|███████▏  | 1049/1455 [06:59<04:29,  1.51it/s][A
epoch 1 iter 1049: train loss 0.94895. lr 4.270502e-04:  72%|███████▏  | 1049/1455 [06:59<04:29,  1.51it/s][A
epoch 1 iter 1049: train loss 0.94895. lr 4.270502e-04:  72%|███████▏  | 1050/1455 [06:59<04:20,  1.56it/s][A
epoch 1 iter 1050: train loss 0.94473. lr 4.267567e-04:  72%|███████▏  | 1050/1455 [07:00<04:20,  1.56it/s][A
epoch 1 iter 1050: train loss 0.94473. lr 4.267567e-04:  72%|███████▏  | 1051/1455 [07:00<04:10,  1.62it/s][A
epoch 1 iter 1051: train loss 0.93842. lr 4.264631e-04:  72%|███████▏  | 1051/1455 [07:00<04:10,  1.62it/s][A
epoch 1 iter 1051: train loss 0.93842. lr 4.264631e-04:  72%|███████▏  | 1052/1455 [07:00<04:06,  1.64it/s][A
epoch 1 iter 1052: train loss 0.94696. lr 4.261693e-04:  72%|███████▏  | 1052/1455 [07:01<04:06,  1.64it/s][A
epoch 1 iter 1052: train loss 0.94696. lr 4.261693e-04:  72%|███████▏  | 1053/1455 [07:01<03:54,  1.71it/s][A
e

epoch 1 iter 1085: train loss 0.90958. lr 4.163938e-04:  75%|███████▍  | 1085/1455 [07:17<02:37,  2.35it/s][A
epoch 1 iter 1085: train loss 0.90958. lr 4.163938e-04:  75%|███████▍  | 1086/1455 [07:17<02:34,  2.38it/s][A
epoch 1 iter 1086: train loss 0.90050. lr 4.160952e-04:  75%|███████▍  | 1086/1455 [07:18<02:34,  2.38it/s][A
epoch 1 iter 1086: train loss 0.90050. lr 4.160952e-04:  75%|███████▍  | 1087/1455 [07:18<02:32,  2.41it/s][A
epoch 1 iter 1087: train loss 0.90339. lr 4.157964e-04:  75%|███████▍  | 1087/1455 [07:18<02:32,  2.41it/s][A
epoch 1 iter 1087: train loss 0.90339. lr 4.157964e-04:  75%|███████▍  | 1088/1455 [07:18<02:31,  2.42it/s][A
epoch 1 iter 1088: train loss 0.89758. lr 4.154976e-04:  75%|███████▍  | 1088/1455 [07:19<02:31,  2.42it/s][A
epoch 1 iter 1088: train loss 0.89758. lr 4.154976e-04:  75%|███████▍  | 1089/1455 [07:19<02:31,  2.42it/s][A
epoch 1 iter 1089: train loss 0.89726. lr 4.151986e-04:  75%|███████▍  | 1089/1455 [07:19<02:31,  2.42it/s][A
e

epoch 1 iter 1121: train loss 0.85241. lr 4.055615e-04:  77%|███████▋  | 1122/1455 [07:39<02:41,  2.07it/s][A
epoch 1 iter 1122: train loss 0.85906. lr 4.052582e-04:  77%|███████▋  | 1122/1455 [07:40<02:41,  2.07it/s][A
epoch 1 iter 1122: train loss 0.85906. lr 4.052582e-04:  77%|███████▋  | 1123/1455 [07:40<02:37,  2.11it/s][A
epoch 1 iter 1123: train loss 0.84862. lr 4.049548e-04:  77%|███████▋  | 1123/1455 [07:40<02:37,  2.11it/s][A
epoch 1 iter 1123: train loss 0.84862. lr 4.049548e-04:  77%|███████▋  | 1124/1455 [07:40<02:35,  2.14it/s][A
epoch 1 iter 1124: train loss 0.86379. lr 4.046513e-04:  77%|███████▋  | 1124/1455 [07:41<02:35,  2.14it/s][A
epoch 1 iter 1124: train loss 0.86379. lr 4.046513e-04:  77%|███████▋  | 1125/1455 [07:41<02:36,  2.11it/s][A
epoch 1 iter 1125: train loss 0.84858. lr 4.043477e-04:  77%|███████▋  | 1125/1455 [07:41<02:36,  2.11it/s][A
epoch 1 iter 1125: train loss 0.84858. lr 4.043477e-04:  77%|███████▋  | 1126/1455 [07:41<02:34,  2.13it/s][A
e

epoch 1 iter 1158: train loss 0.81724. lr 3.942622e-04:  80%|███████▉  | 1158/1455 [07:59<02:24,  2.06it/s][A
epoch 1 iter 1158: train loss 0.81724. lr 3.942622e-04:  80%|███████▉  | 1159/1455 [07:59<02:21,  2.09it/s][A
epoch 1 iter 1159: train loss 0.82159. lr 3.939547e-04:  80%|███████▉  | 1159/1455 [07:59<02:21,  2.09it/s][A
epoch 1 iter 1159: train loss 0.82159. lr 3.939547e-04:  80%|███████▉  | 1160/1455 [07:59<02:36,  1.89it/s][A
epoch 1 iter 1160: train loss 0.81617. lr 3.936470e-04:  80%|███████▉  | 1160/1455 [08:00<02:36,  1.89it/s][A
epoch 1 iter 1160: train loss 0.81617. lr 3.936470e-04:  80%|███████▉  | 1161/1455 [08:00<02:37,  1.86it/s][A
epoch 1 iter 1161: train loss 0.80361. lr 3.933393e-04:  80%|███████▉  | 1161/1455 [08:00<02:37,  1.86it/s][A
epoch 1 iter 1161: train loss 0.80361. lr 3.933393e-04:  80%|███████▉  | 1162/1455 [08:00<02:38,  1.85it/s][A
epoch 1 iter 1162: train loss 0.81115. lr 3.930314e-04:  80%|███████▉  | 1162/1455 [08:01<02:38,  1.85it/s][A
e

epoch 1 iter 1194: train loss 0.76976. lr 3.831239e-04:  82%|████████▏ | 1195/1455 [08:17<02:07,  2.04it/s][A
epoch 1 iter 1195: train loss 0.77309. lr 3.828126e-04:  82%|████████▏ | 1195/1455 [08:17<02:07,  2.04it/s][A
epoch 1 iter 1195: train loss 0.77309. lr 3.828126e-04:  82%|████████▏ | 1196/1455 [08:17<02:04,  2.07it/s][A
epoch 1 iter 1196: train loss 0.77949. lr 3.825013e-04:  82%|████████▏ | 1196/1455 [08:18<02:04,  2.07it/s][A
epoch 1 iter 1196: train loss 0.77949. lr 3.825013e-04:  82%|████████▏ | 1197/1455 [08:18<02:03,  2.09it/s][A
epoch 1 iter 1197: train loss 0.77453. lr 3.821898e-04:  82%|████████▏ | 1197/1455 [08:18<02:03,  2.09it/s][A
epoch 1 iter 1197: train loss 0.77453. lr 3.821898e-04:  82%|████████▏ | 1198/1455 [08:18<01:59,  2.15it/s][A
epoch 1 iter 1198: train loss 0.76604. lr 3.818782e-04:  82%|████████▏ | 1198/1455 [08:19<01:59,  2.15it/s][A
epoch 1 iter 1198: train loss 0.76604. lr 3.818782e-04:  82%|████████▏ | 1199/1455 [08:19<01:59,  2.14it/s][A
e

epoch 1 iter 1231: train loss 0.73481. lr 3.715455e-04:  85%|████████▍ | 1231/1455 [08:39<02:23,  1.56it/s][A
epoch 1 iter 1231: train loss 0.73481. lr 3.715455e-04:  85%|████████▍ | 1232/1455 [08:39<02:17,  1.62it/s][A
epoch 1 iter 1232: train loss 0.73174. lr 3.712309e-04:  85%|████████▍ | 1232/1455 [08:40<02:17,  1.62it/s][A
epoch 1 iter 1232: train loss 0.73174. lr 3.712309e-04:  85%|████████▍ | 1233/1455 [08:40<02:13,  1.67it/s][A
epoch 1 iter 1233: train loss 0.73895. lr 3.709162e-04:  85%|████████▍ | 1233/1455 [08:40<02:13,  1.67it/s][A
epoch 1 iter 1233: train loss 0.73895. lr 3.709162e-04:  85%|████████▍ | 1234/1455 [08:40<02:07,  1.74it/s][A
epoch 1 iter 1234: train loss 0.73026. lr 3.706014e-04:  85%|████████▍ | 1234/1455 [08:41<02:07,  1.74it/s][A
epoch 1 iter 1234: train loss 0.73026. lr 3.706014e-04:  85%|████████▍ | 1235/1455 [08:41<02:02,  1.79it/s][A
epoch 1 iter 1235: train loss 0.73101. lr 3.702866e-04:  85%|████████▍ | 1235/1455 [08:41<02:02,  1.79it/s][A
e

epoch 1 iter 1267: train loss 0.69246. lr 3.601703e-04:  87%|████████▋ | 1268/1455 [08:58<01:45,  1.77it/s][A
epoch 1 iter 1268: train loss 0.70105. lr 3.598529e-04:  87%|████████▋ | 1268/1455 [08:58<01:45,  1.77it/s][A
epoch 1 iter 1268: train loss 0.70105. lr 3.598529e-04:  87%|████████▋ | 1269/1455 [08:58<01:41,  1.84it/s][A
epoch 1 iter 1269: train loss 0.69412. lr 3.595355e-04:  87%|████████▋ | 1269/1455 [08:59<01:41,  1.84it/s][A
epoch 1 iter 1269: train loss 0.69412. lr 3.595355e-04:  87%|████████▋ | 1270/1455 [08:59<01:38,  1.88it/s][A
epoch 1 iter 1270: train loss 0.68820. lr 3.592180e-04:  87%|████████▋ | 1270/1455 [08:59<01:38,  1.88it/s][A
epoch 1 iter 1270: train loss 0.68820. lr 3.592180e-04:  87%|████████▋ | 1271/1455 [08:59<01:35,  1.92it/s][A
epoch 1 iter 1271: train loss 0.69608. lr 3.589004e-04:  87%|████████▋ | 1271/1455 [09:00<01:35,  1.92it/s][A
epoch 1 iter 1271: train loss 0.69608. lr 3.589004e-04:  87%|████████▋ | 1272/1455 [09:00<01:33,  1.95it/s][A
e

epoch 1 iter 1304: train loss 0.64896. lr 3.483845e-04:  90%|████████▉ | 1304/1455 [09:15<01:07,  2.24it/s][A
epoch 1 iter 1304: train loss 0.64896. lr 3.483845e-04:  90%|████████▉ | 1305/1455 [09:15<01:06,  2.27it/s][A
epoch 1 iter 1305: train loss 0.65458. lr 3.480648e-04:  90%|████████▉ | 1305/1455 [09:16<01:06,  2.27it/s][A
epoch 1 iter 1305: train loss 0.65458. lr 3.480648e-04:  90%|████████▉ | 1306/1455 [09:16<01:04,  2.30it/s][A
epoch 1 iter 1306: train loss 0.65691. lr 3.477451e-04:  90%|████████▉ | 1306/1455 [09:16<01:04,  2.30it/s][A
epoch 1 iter 1306: train loss 0.65691. lr 3.477451e-04:  90%|████████▉ | 1307/1455 [09:16<01:02,  2.35it/s][A
epoch 1 iter 1307: train loss 0.65691. lr 3.474253e-04:  90%|████████▉ | 1307/1455 [09:17<01:02,  2.35it/s][A
epoch 1 iter 1307: train loss 0.65691. lr 3.474253e-04:  90%|████████▉ | 1308/1455 [09:17<01:01,  2.39it/s][A
epoch 1 iter 1308: train loss 0.65103. lr 3.471054e-04:  90%|████████▉ | 1308/1455 [09:17<01:01,  2.39it/s][A
e

epoch 1 iter 1340: train loss 0.62490. lr 3.368430e-04:  92%|█████████▏| 1341/1455 [09:35<00:53,  2.12it/s][A
epoch 1 iter 1341: train loss 0.62883. lr 3.365216e-04:  92%|█████████▏| 1341/1455 [09:36<00:53,  2.12it/s][A
epoch 1 iter 1341: train loss 0.62883. lr 3.365216e-04:  92%|█████████▏| 1342/1455 [09:36<00:53,  2.13it/s][A
epoch 1 iter 1342: train loss 0.62454. lr 3.362000e-04:  92%|█████████▏| 1342/1455 [09:36<00:53,  2.13it/s][A
epoch 1 iter 1342: train loss 0.62454. lr 3.362000e-04:  92%|█████████▏| 1343/1455 [09:36<00:52,  2.13it/s][A
epoch 1 iter 1343: train loss 0.62156. lr 3.358785e-04:  92%|█████████▏| 1343/1455 [09:36<00:52,  2.13it/s][A
epoch 1 iter 1343: train loss 0.62156. lr 3.358785e-04:  92%|█████████▏| 1344/1455 [09:36<00:51,  2.17it/s][A
epoch 1 iter 1344: train loss 0.62797. lr 3.355569e-04:  92%|█████████▏| 1344/1455 [09:37<00:51,  2.17it/s][A
epoch 1 iter 1344: train loss 0.62797. lr 3.355569e-04:  92%|█████████▏| 1345/1455 [09:37<00:50,  2.19it/s][A
e

epoch 1 iter 1377: train loss 0.59414. lr 3.249231e-04:  95%|█████████▍| 1377/1455 [09:54<00:45,  1.71it/s][A
epoch 1 iter 1377: train loss 0.59414. lr 3.249231e-04:  95%|█████████▍| 1378/1455 [09:54<00:42,  1.81it/s][A
epoch 1 iter 1378: train loss 0.58554. lr 3.246003e-04:  95%|█████████▍| 1378/1455 [09:54<00:42,  1.81it/s][A
epoch 1 iter 1378: train loss 0.58554. lr 3.246003e-04:  95%|█████████▍| 1379/1455 [09:54<00:40,  1.89it/s][A
epoch 1 iter 1379: train loss 0.59605. lr 3.242775e-04:  95%|█████████▍| 1379/1455 [09:55<00:40,  1.89it/s][A
epoch 1 iter 1379: train loss 0.59605. lr 3.242775e-04:  95%|█████████▍| 1380/1455 [09:55<00:38,  1.97it/s][A
epoch 1 iter 1380: train loss 0.58643. lr 3.239546e-04:  95%|█████████▍| 1380/1455 [09:55<00:38,  1.97it/s][A
epoch 1 iter 1380: train loss 0.58643. lr 3.239546e-04:  95%|█████████▍| 1381/1455 [09:55<00:36,  2.03it/s][A
epoch 1 iter 1381: train loss 0.59286. lr 3.236318e-04:  95%|█████████▍| 1381/1455 [09:56<00:36,  2.03it/s][A
e

epoch 1 iter 1413: train loss 0.56855. lr 3.132870e-04:  97%|█████████▋| 1414/1455 [10:09<00:16,  2.44it/s][A
epoch 1 iter 1414: train loss 0.56079. lr 3.129634e-04:  97%|█████████▋| 1414/1455 [10:09<00:16,  2.44it/s][A
epoch 1 iter 1414: train loss 0.56079. lr 3.129634e-04:  97%|█████████▋| 1415/1455 [10:09<00:16,  2.44it/s][A
epoch 1 iter 1415: train loss 0.56025. lr 3.126398e-04:  97%|█████████▋| 1415/1455 [10:10<00:16,  2.44it/s][A
epoch 1 iter 1415: train loss 0.56025. lr 3.126398e-04:  97%|█████████▋| 1416/1455 [10:10<00:15,  2.44it/s][A
epoch 1 iter 1416: train loss 0.56210. lr 3.123162e-04:  97%|█████████▋| 1416/1455 [10:10<00:15,  2.44it/s][A
epoch 1 iter 1416: train loss 0.56210. lr 3.123162e-04:  97%|█████████▋| 1417/1455 [10:10<00:15,  2.44it/s][A
epoch 1 iter 1417: train loss 0.55667. lr 3.119926e-04:  97%|█████████▋| 1417/1455 [10:10<00:15,  2.44it/s][A
epoch 1 iter 1417: train loss 0.55667. lr 3.119926e-04:  97%|█████████▋| 1418/1455 [10:10<00:15,  2.45it/s][A
e

epoch 1 iter 1450: train loss 0.52905. lr 3.013070e-04: 100%|█████████▉| 1450/1455 [10:33<00:02,  2.36it/s][A
epoch 1 iter 1450: train loss 0.52905. lr 3.013070e-04: 100%|█████████▉| 1451/1455 [10:33<00:01,  2.39it/s][A
epoch 1 iter 1451: train loss 0.52888. lr 3.009831e-04: 100%|█████████▉| 1451/1455 [10:34<00:01,  2.39it/s][A
epoch 1 iter 1451: train loss 0.52888. lr 3.009831e-04: 100%|█████████▉| 1452/1455 [10:34<00:01,  2.43it/s][A
epoch 1 iter 1452: train loss 0.52274. lr 3.006592e-04: 100%|█████████▉| 1452/1455 [10:34<00:01,  2.43it/s][A
epoch 1 iter 1452: train loss 0.52274. lr 3.006592e-04: 100%|█████████▉| 1453/1455 [10:34<00:00,  2.46it/s][A
epoch 1 iter 1453: train loss 0.53663. lr 3.003353e-04: 100%|█████████▉| 1453/1455 [10:35<00:00,  2.46it/s][A
epoch 1 iter 1453: train loss 0.53663. lr 3.003353e-04: 100%|█████████▉| 1454/1455 [10:35<00:00,  2.49it/s][A
epoch 1 iter 1454: train loss 0.52595. lr 3.000253e-04: 100%|█████████▉| 1454/1455 [10:35<00:00,  2.49it/s][A
e

epoch 2 iter 33: train loss 0.50714. lr 2.890151e-04:   2%|▏         | 33/1455 [00:17<14:55,  1.59it/s][A
epoch 2 iter 33: train loss 0.50714. lr 2.890151e-04:   2%|▏         | 34/1455 [00:17<14:47,  1.60it/s][A
epoch 2 iter 34: train loss 0.49767. lr 2.886914e-04:   2%|▏         | 34/1455 [00:18<14:47,  1.60it/s][A
epoch 2 iter 34: train loss 0.49767. lr 2.886914e-04:   2%|▏         | 35/1455 [00:18<14:38,  1.62it/s][A
epoch 2 iter 35: train loss 0.49911. lr 2.883677e-04:   2%|▏         | 35/1455 [00:18<14:38,  1.62it/s][A
epoch 2 iter 35: train loss 0.49911. lr 2.883677e-04:   2%|▏         | 36/1455 [00:18<14:32,  1.63it/s][A
epoch 2 iter 36: train loss 0.49260. lr 2.880441e-04:   2%|▏         | 36/1455 [00:19<14:32,  1.63it/s][A
epoch 2 iter 36: train loss 0.49260. lr 2.880441e-04:   3%|▎         | 37/1455 [00:19<14:28,  1.63it/s][A
epoch 2 iter 37: train loss 0.49094. lr 2.877204e-04:   3%|▎         | 37/1455 [00:20<14:28,  1.63it/s][A
epoch 2 iter 37: train loss 0.49094. 

epoch 2 iter 71: train loss 0.47163. lr 2.767277e-04:   5%|▍         | 71/1455 [00:37<12:37,  1.83it/s][A
epoch 2 iter 71: train loss 0.47163. lr 2.767277e-04:   5%|▍         | 72/1455 [00:37<12:17,  1.87it/s][A
epoch 2 iter 72: train loss 0.47045. lr 2.764048e-04:   5%|▍         | 72/1455 [00:38<12:17,  1.87it/s][A
epoch 2 iter 72: train loss 0.47045. lr 2.764048e-04:   5%|▌         | 73/1455 [00:38<12:03,  1.91it/s][A
epoch 2 iter 73: train loss 0.46547. lr 2.760819e-04:   5%|▌         | 73/1455 [00:38<12:03,  1.91it/s][A
epoch 2 iter 73: train loss 0.46547. lr 2.760819e-04:   5%|▌         | 74/1455 [00:38<11:52,  1.94it/s][A
epoch 2 iter 74: train loss 0.47477. lr 2.757591e-04:   5%|▌         | 74/1455 [00:39<11:52,  1.94it/s][A
epoch 2 iter 74: train loss 0.47477. lr 2.757591e-04:   5%|▌         | 75/1455 [00:39<11:44,  1.96it/s][A
epoch 2 iter 75: train loss 0.46306. lr 2.754362e-04:   5%|▌         | 75/1455 [00:39<11:44,  1.96it/s][A
epoch 2 iter 75: train loss 0.46306. 

epoch 2 iter 109: train loss 0.44504. lr 2.644796e-04:   7%|▋         | 109/1455 [00:55<10:26,  2.15it/s][A
epoch 2 iter 109: train loss 0.44504. lr 2.644796e-04:   8%|▊         | 110/1455 [00:55<10:21,  2.16it/s][A
epoch 2 iter 110: train loss 0.45394. lr 2.641579e-04:   8%|▊         | 110/1455 [00:55<10:21,  2.16it/s][A
epoch 2 iter 110: train loss 0.45394. lr 2.641579e-04:   8%|▊         | 111/1455 [00:55<10:17,  2.18it/s][A
epoch 2 iter 111: train loss 0.44733. lr 2.638364e-04:   8%|▊         | 111/1455 [00:56<10:17,  2.18it/s][A
epoch 2 iter 111: train loss 0.44733. lr 2.638364e-04:   8%|▊         | 112/1455 [00:56<10:09,  2.20it/s][A
epoch 2 iter 112: train loss 0.44254. lr 2.635149e-04:   8%|▊         | 112/1455 [00:56<10:09,  2.20it/s][A
epoch 2 iter 112: train loss 0.44254. lr 2.635149e-04:   8%|▊         | 113/1455 [00:56<10:18,  2.17it/s][A
epoch 2 iter 113: train loss 0.44132. lr 2.631934e-04:   8%|▊         | 113/1455 [00:56<10:18,  2.17it/s][A
epoch 2 iter 113: t

epoch 2 iter 146: train loss 0.42594. lr 2.526110e-04:  10%|█         | 147/1455 [01:13<09:19,  2.34it/s][A
epoch 2 iter 147: train loss 0.43137. lr 2.522912e-04:  10%|█         | 147/1455 [01:13<09:19,  2.34it/s][A
epoch 2 iter 147: train loss 0.43137. lr 2.522912e-04:  10%|█         | 148/1455 [01:13<09:10,  2.38it/s][A
epoch 2 iter 148: train loss 0.43082. lr 2.519714e-04:  10%|█         | 148/1455 [01:14<09:10,  2.38it/s][A
epoch 2 iter 148: train loss 0.43082. lr 2.519714e-04:  10%|█         | 149/1455 [01:14<09:03,  2.40it/s][A
epoch 2 iter 149: train loss 0.43084. lr 2.516517e-04:  10%|█         | 149/1455 [01:14<09:03,  2.40it/s][A
epoch 2 iter 149: train loss 0.43084. lr 2.516517e-04:  10%|█         | 150/1455 [01:14<08:58,  2.42it/s][A
epoch 2 iter 150: train loss 0.42854. lr 2.513321e-04:  10%|█         | 150/1455 [01:14<08:58,  2.42it/s][A
epoch 2 iter 150: train loss 0.42854. lr 2.513321e-04:  10%|█         | 151/1455 [01:14<09:00,  2.41it/s][A
epoch 2 iter 151: t

epoch 2 iter 184: train loss 0.40768. lr 2.405005e-04:  13%|█▎        | 184/1455 [01:28<08:41,  2.44it/s][A
epoch 2 iter 184: train loss 0.40768. lr 2.405005e-04:  13%|█▎        | 185/1455 [01:28<08:43,  2.43it/s][A
epoch 2 iter 185: train loss 0.40666. lr 2.401831e-04:  13%|█▎        | 185/1455 [01:29<08:43,  2.43it/s][A
epoch 2 iter 185: train loss 0.40666. lr 2.401831e-04:  13%|█▎        | 186/1455 [01:29<08:44,  2.42it/s][A
epoch 2 iter 186: train loss 0.41046. lr 2.398657e-04:  13%|█▎        | 186/1455 [01:31<08:44,  2.42it/s][A
epoch 2 iter 186: train loss 0.41046. lr 2.398657e-04:  13%|█▎        | 187/1455 [01:31<17:29,  1.21it/s][A
epoch 2 iter 187: train loss 0.40799. lr 2.395484e-04:  13%|█▎        | 187/1455 [01:33<17:29,  1.21it/s][A
epoch 2 iter 187: train loss 0.40799. lr 2.395484e-04:  13%|█▎        | 188/1455 [01:33<26:34,  1.26s/it][A
epoch 2 iter 188: train loss 0.41113. lr 2.392312e-04:  13%|█▎        | 188/1455 [01:35<26:34,  1.26s/it][A
epoch 2 iter 188: t

epoch 2 iter 221: train loss 0.39941. lr 2.288048e-04:  15%|█▌        | 222/1455 [01:53<08:39,  2.38it/s][A
epoch 2 iter 222: train loss 0.39120. lr 2.284902e-04:  15%|█▌        | 222/1455 [01:54<08:39,  2.38it/s][A
epoch 2 iter 222: train loss 0.39120. lr 2.284902e-04:  15%|█▌        | 223/1455 [01:54<08:37,  2.38it/s][A
epoch 2 iter 223: train loss 0.39278. lr 2.281756e-04:  15%|█▌        | 223/1455 [01:54<08:37,  2.38it/s][A
epoch 2 iter 223: train loss 0.39278. lr 2.281756e-04:  15%|█▌        | 224/1455 [01:54<09:19,  2.20it/s][A
epoch 2 iter 224: train loss 0.38444. lr 2.278612e-04:  15%|█▌        | 224/1455 [01:55<09:19,  2.20it/s][A
epoch 2 iter 224: train loss 0.38444. lr 2.278612e-04:  15%|█▌        | 225/1455 [01:55<09:01,  2.27it/s][A
epoch 2 iter 225: train loss 0.39046. lr 2.275468e-04:  15%|█▌        | 225/1455 [01:55<09:01,  2.27it/s][A
epoch 2 iter 225: train loss 0.39046. lr 2.275468e-04:  16%|█▌        | 226/1455 [01:55<08:46,  2.34it/s][A
epoch 2 iter 226: t

epoch 2 iter 259: train loss 0.37610. lr 2.169114e-04:  18%|█▊        | 259/1455 [02:14<10:44,  1.86it/s][A
epoch 2 iter 259: train loss 0.37610. lr 2.169114e-04:  18%|█▊        | 260/1455 [02:14<10:42,  1.86it/s][A
epoch 2 iter 260: train loss 0.37780. lr 2.166002e-04:  18%|█▊        | 260/1455 [02:15<10:42,  1.86it/s][A
epoch 2 iter 260: train loss 0.37780. lr 2.166002e-04:  18%|█▊        | 261/1455 [02:15<10:30,  1.89it/s][A
epoch 2 iter 261: train loss 0.36744. lr 2.162891e-04:  18%|█▊        | 261/1455 [02:15<10:30,  1.89it/s][A
epoch 2 iter 261: train loss 0.36744. lr 2.162891e-04:  18%|█▊        | 262/1455 [02:15<10:21,  1.92it/s][A
epoch 2 iter 262: train loss 0.37385. lr 2.159781e-04:  18%|█▊        | 262/1455 [02:16<10:21,  1.92it/s][A
epoch 2 iter 262: train loss 0.37385. lr 2.159781e-04:  18%|█▊        | 263/1455 [02:16<10:14,  1.94it/s][A
epoch 2 iter 263: train loss 0.37001. lr 2.156672e-04:  18%|█▊        | 263/1455 [02:16<10:14,  1.94it/s][A
epoch 2 iter 263: t

epoch 2 iter 296: train loss 0.35786. lr 2.054651e-04:  20%|██        | 297/1455 [02:32<08:56,  2.16it/s][A
epoch 2 iter 297: train loss 0.36371. lr 2.051578e-04:  20%|██        | 297/1455 [02:33<08:56,  2.16it/s][A
epoch 2 iter 297: train loss 0.36371. lr 2.051578e-04:  20%|██        | 298/1455 [02:33<08:53,  2.17it/s][A
epoch 2 iter 298: train loss 0.35790. lr 2.048506e-04:  20%|██        | 298/1455 [02:33<08:53,  2.17it/s][A
epoch 2 iter 298: train loss 0.35790. lr 2.048506e-04:  21%|██        | 299/1455 [02:33<08:51,  2.18it/s][A
epoch 2 iter 299: train loss 0.36323. lr 2.045434e-04:  21%|██        | 299/1455 [02:33<08:51,  2.18it/s][A
epoch 2 iter 299: train loss 0.36323. lr 2.045434e-04:  21%|██        | 300/1455 [02:33<08:49,  2.18it/s][A
epoch 2 iter 300: train loss 0.36091. lr 2.042364e-04:  21%|██        | 300/1455 [02:34<08:49,  2.18it/s][A
epoch 2 iter 300: train loss 0.36091. lr 2.042364e-04:  21%|██        | 301/1455 [02:34<08:48,  2.18it/s][A
epoch 2 iter 301: t

epoch 2 iter 334: train loss 0.35089. lr 1.938667e-04:  23%|██▎       | 334/1455 [02:50<08:41,  2.15it/s][A
epoch 2 iter 334: train loss 0.35089. lr 1.938667e-04:  23%|██▎       | 335/1455 [02:50<08:30,  2.19it/s][A
epoch 2 iter 335: train loss 0.34048. lr 1.935638e-04:  23%|██▎       | 335/1455 [02:50<08:30,  2.19it/s][A
epoch 2 iter 335: train loss 0.34048. lr 1.935638e-04:  23%|██▎       | 336/1455 [02:50<08:23,  2.22it/s][A
epoch 2 iter 336: train loss 0.34618. lr 1.932611e-04:  23%|██▎       | 336/1455 [02:51<08:23,  2.22it/s][A
epoch 2 iter 336: train loss 0.34618. lr 1.932611e-04:  23%|██▎       | 337/1455 [02:51<08:18,  2.24it/s][A
epoch 2 iter 337: train loss 0.34779. lr 1.929584e-04:  23%|██▎       | 337/1455 [02:51<08:18,  2.24it/s][A
epoch 2 iter 337: train loss 0.34779. lr 1.929584e-04:  23%|██▎       | 338/1455 [02:51<08:13,  2.26it/s][A
epoch 2 iter 338: train loss 0.34551. lr 1.926559e-04:  23%|██▎       | 338/1455 [02:52<08:13,  2.26it/s][A
epoch 2 iter 338: t

epoch 2 iter 371: train loss 0.33344. lr 1.827450e-04:  26%|██▌       | 372/1455 [03:06<07:22,  2.45it/s][A
epoch 2 iter 372: train loss 0.33230. lr 1.824470e-04:  26%|██▌       | 372/1455 [03:06<07:22,  2.45it/s][A
epoch 2 iter 372: train loss 0.33230. lr 1.824470e-04:  26%|██▌       | 373/1455 [03:06<07:17,  2.47it/s][A
epoch 2 iter 373: train loss 0.33372. lr 1.821490e-04:  26%|██▌       | 373/1455 [03:07<07:17,  2.47it/s][A
epoch 2 iter 373: train loss 0.33372. lr 1.821490e-04:  26%|██▌       | 374/1455 [03:07<07:16,  2.47it/s][A
epoch 2 iter 374: train loss 0.33558. lr 1.818512e-04:  26%|██▌       | 374/1455 [03:07<07:16,  2.47it/s][A
epoch 2 iter 374: train loss 0.33558. lr 1.818512e-04:  26%|██▌       | 375/1455 [03:07<07:17,  2.47it/s][A
epoch 2 iter 375: train loss 0.33461. lr 1.815536e-04:  26%|██▌       | 375/1455 [03:08<07:17,  2.47it/s][A
epoch 2 iter 375: train loss 0.33461. lr 1.815536e-04:  26%|██▌       | 376/1455 [03:08<08:43,  2.06it/s][A
epoch 2 iter 376: t

epoch 2 iter 409: train loss 0.32250. lr 1.715177e-04:  28%|██▊       | 409/1455 [03:29<07:36,  2.29it/s][A
epoch 2 iter 409: train loss 0.32250. lr 1.715177e-04:  28%|██▊       | 410/1455 [03:29<07:19,  2.38it/s][A
epoch 2 iter 410: train loss 0.32050. lr 1.712250e-04:  28%|██▊       | 410/1455 [03:29<07:19,  2.38it/s][A
epoch 2 iter 410: train loss 0.32050. lr 1.712250e-04:  28%|██▊       | 411/1455 [03:29<07:23,  2.36it/s][A
epoch 2 iter 411: train loss 0.32036. lr 1.709326e-04:  28%|██▊       | 411/1455 [03:30<07:23,  2.36it/s][A
epoch 2 iter 411: train loss 0.32036. lr 1.709326e-04:  28%|██▊       | 412/1455 [03:30<07:56,  2.19it/s][A
epoch 2 iter 412: train loss 0.32156. lr 1.706403e-04:  28%|██▊       | 412/1455 [03:30<07:56,  2.19it/s][A
epoch 2 iter 412: train loss 0.32156. lr 1.706403e-04:  28%|██▊       | 413/1455 [03:30<08:03,  2.16it/s][A
epoch 2 iter 413: train loss 0.32440. lr 1.703481e-04:  28%|██▊       | 413/1455 [03:31<08:03,  2.16it/s][A
epoch 2 iter 413: t

epoch 2 iter 446: train loss 0.31380. lr 1.607934e-04:  31%|███       | 447/1455 [03:48<08:19,  2.02it/s][A
epoch 2 iter 447: train loss 0.31379. lr 1.605065e-04:  31%|███       | 447/1455 [03:48<08:19,  2.02it/s][A
epoch 2 iter 447: train loss 0.31379. lr 1.605065e-04:  31%|███       | 448/1455 [03:48<08:10,  2.05it/s][A
epoch 2 iter 448: train loss 0.31289. lr 1.602199e-04:  31%|███       | 448/1455 [03:49<08:10,  2.05it/s][A
epoch 2 iter 448: train loss 0.31289. lr 1.602199e-04:  31%|███       | 449/1455 [03:49<08:04,  2.08it/s][A
epoch 2 iter 449: train loss 0.31352. lr 1.599333e-04:  31%|███       | 449/1455 [03:49<08:04,  2.08it/s][A
epoch 2 iter 449: train loss 0.31352. lr 1.599333e-04:  31%|███       | 450/1455 [03:49<08:00,  2.09it/s][A
epoch 2 iter 450: train loss 0.31071. lr 1.596470e-04:  31%|███       | 450/1455 [03:50<08:00,  2.09it/s][A
epoch 2 iter 450: train loss 0.31071. lr 1.596470e-04:  31%|███       | 451/1455 [03:50<07:52,  2.13it/s][A
epoch 2 iter 451: t

epoch 2 iter 484: train loss 0.30833. lr 1.500106e-04:  33%|███▎      | 484/1455 [04:05<06:37,  2.44it/s][A
epoch 2 iter 484: train loss 0.30833. lr 1.500106e-04:  33%|███▎      | 485/1455 [04:05<06:46,  2.38it/s][A
epoch 2 iter 485: train loss 0.30655. lr 1.497302e-04:  33%|███▎      | 485/1455 [04:06<06:46,  2.38it/s][A
epoch 2 iter 485: train loss 0.30655. lr 1.497302e-04:  33%|███▎      | 486/1455 [04:06<06:40,  2.42it/s][A
epoch 2 iter 486: train loss 0.30277. lr 1.494499e-04:  33%|███▎      | 486/1455 [04:06<06:40,  2.42it/s][A
epoch 2 iter 486: train loss 0.30277. lr 1.494499e-04:  33%|███▎      | 487/1455 [04:06<06:36,  2.44it/s][A
epoch 2 iter 487: train loss 0.30254. lr 1.491698e-04:  33%|███▎      | 487/1455 [04:06<06:36,  2.44it/s][A
epoch 2 iter 487: train loss 0.30254. lr 1.491698e-04:  34%|███▎      | 488/1455 [04:06<06:33,  2.46it/s][A
epoch 2 iter 488: train loss 0.30399. lr 1.488899e-04:  34%|███▎      | 488/1455 [04:07<06:33,  2.46it/s][A
epoch 2 iter 488: t

epoch 2 iter 521: train loss 0.29818. lr 1.397540e-04:  36%|███▌      | 522/1455 [04:24<06:36,  2.35it/s][A
epoch 2 iter 522: train loss 0.29337. lr 1.394803e-04:  36%|███▌      | 522/1455 [04:24<06:36,  2.35it/s][A
epoch 2 iter 522: train loss 0.29337. lr 1.394803e-04:  36%|███▌      | 523/1455 [04:24<06:23,  2.43it/s][A
epoch 2 iter 523: train loss 0.29141. lr 1.392067e-04:  36%|███▌      | 523/1455 [04:25<06:23,  2.43it/s][A
epoch 2 iter 523: train loss 0.29141. lr 1.392067e-04:  36%|███▌      | 524/1455 [04:25<06:16,  2.47it/s][A
epoch 2 iter 524: train loss 0.29773. lr 1.389334e-04:  36%|███▌      | 524/1455 [04:25<06:16,  2.47it/s][A
epoch 2 iter 524: train loss 0.29773. lr 1.389334e-04:  36%|███▌      | 525/1455 [04:25<06:19,  2.45it/s][A
epoch 2 iter 525: train loss 0.29181. lr 1.386602e-04:  36%|███▌      | 525/1455 [04:26<06:19,  2.45it/s][A
epoch 2 iter 525: train loss 0.29181. lr 1.386602e-04:  36%|███▌      | 526/1455 [04:26<07:17,  2.12it/s][A
epoch 2 iter 526: t

epoch 2 iter 559: train loss 0.29342. lr 1.294865e-04:  38%|███▊      | 559/1455 [04:44<06:48,  2.19it/s][A
epoch 2 iter 559: train loss 0.29342. lr 1.294865e-04:  38%|███▊      | 560/1455 [04:44<06:44,  2.21it/s][A
epoch 2 iter 560: train loss 0.28405. lr 1.292201e-04:  38%|███▊      | 560/1455 [04:44<06:44,  2.21it/s][A
epoch 2 iter 560: train loss 0.28405. lr 1.292201e-04:  39%|███▊      | 561/1455 [04:44<06:42,  2.22it/s][A
epoch 2 iter 561: train loss 0.28465. lr 1.289539e-04:  39%|███▊      | 561/1455 [04:45<06:42,  2.22it/s][A
epoch 2 iter 561: train loss 0.28465. lr 1.289539e-04:  39%|███▊      | 562/1455 [04:45<06:37,  2.24it/s][A
epoch 2 iter 562: train loss 0.28654. lr 1.286879e-04:  39%|███▊      | 562/1455 [04:45<06:37,  2.24it/s][A
epoch 2 iter 562: train loss 0.28654. lr 1.286879e-04:  39%|███▊      | 563/1455 [04:45<06:34,  2.26it/s][A
epoch 2 iter 563: train loss 0.28793. lr 1.284221e-04:  39%|███▊      | 563/1455 [04:45<06:34,  2.26it/s][A
epoch 2 iter 563: t

epoch 2 iter 596: train loss 0.27641. lr 1.197648e-04:  41%|████      | 597/1455 [04:59<05:52,  2.44it/s][A
epoch 2 iter 597: train loss 0.27959. lr 1.195059e-04:  41%|████      | 597/1455 [05:00<05:52,  2.44it/s][A
epoch 2 iter 597: train loss 0.27959. lr 1.195059e-04:  41%|████      | 598/1455 [05:00<05:49,  2.45it/s][A
epoch 2 iter 598: train loss 0.27973. lr 1.192473e-04:  41%|████      | 598/1455 [05:00<05:49,  2.45it/s][A
epoch 2 iter 598: train loss 0.27973. lr 1.192473e-04:  41%|████      | 599/1455 [05:00<05:44,  2.48it/s][A
epoch 2 iter 599: train loss 0.27621. lr 1.189889e-04:  41%|████      | 599/1455 [05:01<05:44,  2.48it/s][A
epoch 2 iter 599: train loss 0.27621. lr 1.189889e-04:  41%|████      | 600/1455 [05:01<05:49,  2.45it/s][A
epoch 2 iter 600: train loss 0.28066. lr 1.187307e-04:  41%|████      | 600/1455 [05:01<05:49,  2.45it/s][A
epoch 2 iter 600: train loss 0.28066. lr 1.187307e-04:  41%|████▏     | 601/1455 [05:01<05:47,  2.46it/s][A
epoch 2 iter 601: t

epoch 2 iter 634: train loss 0.27275. lr 1.100798e-04:  44%|████▎     | 634/1455 [05:23<07:16,  1.88it/s][A
epoch 2 iter 634: train loss 0.27275. lr 1.100798e-04:  44%|████▎     | 635/1455 [05:23<07:06,  1.92it/s][A
epoch 2 iter 635: train loss 0.27493. lr 1.098292e-04:  44%|████▎     | 635/1455 [05:24<07:06,  1.92it/s][A
epoch 2 iter 635: train loss 0.27493. lr 1.098292e-04:  44%|████▎     | 636/1455 [05:24<06:59,  1.95it/s][A
epoch 2 iter 636: train loss 0.26762. lr 1.095788e-04:  44%|████▎     | 636/1455 [05:24<06:59,  1.95it/s][A
epoch 2 iter 636: train loss 0.26762. lr 1.095788e-04:  44%|████▍     | 637/1455 [05:24<06:53,  1.98it/s][A
epoch 2 iter 637: train loss 0.27165. lr 1.093286e-04:  44%|████▍     | 637/1455 [05:25<06:53,  1.98it/s][A
epoch 2 iter 637: train loss 0.27165. lr 1.093286e-04:  44%|████▍     | 638/1455 [05:25<06:49,  1.99it/s][A
epoch 2 iter 638: train loss 0.26960. lr 1.090787e-04:  44%|████▍     | 638/1455 [05:25<06:49,  1.99it/s][A
epoch 2 iter 638: t

epoch 2 iter 671: train loss 0.26701. lr 1.009567e-04:  46%|████▌     | 672/1455 [05:41<06:37,  1.97it/s][A
epoch 2 iter 672: train loss 0.26779. lr 1.007145e-04:  46%|████▌     | 672/1455 [05:41<06:37,  1.97it/s][A
epoch 2 iter 672: train loss 0.26779. lr 1.007145e-04:  46%|████▋     | 673/1455 [05:41<06:40,  1.95it/s][A
epoch 2 iter 673: train loss 0.26944. lr 1.004725e-04:  46%|████▋     | 673/1455 [05:42<06:40,  1.95it/s][A
epoch 2 iter 673: train loss 0.26944. lr 1.004725e-04:  46%|████▋     | 674/1455 [05:42<06:09,  2.11it/s][A
epoch 2 iter 674: train loss 0.26477. lr 1.002307e-04:  46%|████▋     | 674/1455 [05:42<06:09,  2.11it/s][A
epoch 2 iter 674: train loss 0.26477. lr 1.002307e-04:  46%|████▋     | 675/1455 [05:42<05:49,  2.23it/s][A
epoch 2 iter 675: train loss 0.26580. lr 9.998921e-05:  46%|████▋     | 675/1455 [05:43<05:49,  2.23it/s][A
epoch 2 iter 675: train loss 0.26580. lr 9.998921e-05:  46%|████▋     | 676/1455 [05:43<05:47,  2.24it/s][A
epoch 2 iter 676: t

epoch 2 iter 709: train loss 0.26188. lr 9.191778e-05:  49%|████▊     | 709/1455 [06:01<08:49,  1.41it/s][A
epoch 2 iter 709: train loss 0.26188. lr 9.191778e-05:  49%|████▉     | 710/1455 [06:01<08:27,  1.47it/s][A
epoch 2 iter 710: train loss 0.25720. lr 9.168458e-05:  49%|████▉     | 710/1455 [06:01<08:27,  1.47it/s][A
epoch 2 iter 710: train loss 0.25720. lr 9.168458e-05:  49%|████▉     | 711/1455 [06:01<08:00,  1.55it/s][A
epoch 2 iter 711: train loss 0.26249. lr 9.145162e-05:  49%|████▉     | 711/1455 [06:02<08:00,  1.55it/s][A
epoch 2 iter 711: train loss 0.26249. lr 9.145162e-05:  49%|████▉     | 712/1455 [06:02<07:40,  1.61it/s][A
epoch 2 iter 712: train loss 0.26320. lr 9.121890e-05:  49%|████▉     | 712/1455 [06:02<07:40,  1.61it/s][A
epoch 2 iter 712: train loss 0.26320. lr 9.121890e-05:  49%|████▉     | 713/1455 [06:02<07:26,  1.66it/s][A
epoch 2 iter 713: train loss 0.25708. lr 9.098643e-05:  49%|████▉     | 713/1455 [06:03<07:26,  1.66it/s][A
epoch 2 iter 713: t

epoch 2 iter 746: train loss 0.25572. lr 8.345309e-05:  51%|█████▏    | 747/1455 [06:19<08:56,  1.32it/s][A
epoch 2 iter 747: train loss 0.25369. lr 8.322905e-05:  51%|█████▏    | 747/1455 [06:20<08:56,  1.32it/s][A
epoch 2 iter 747: train loss 0.25369. lr 8.322905e-05:  51%|█████▏    | 748/1455 [06:20<11:31,  1.02it/s][A
epoch 2 iter 748: train loss 0.25523. lr 8.300527e-05:  51%|█████▏    | 748/1455 [06:21<11:31,  1.02it/s][A
epoch 2 iter 748: train loss 0.25523. lr 8.300527e-05:  51%|█████▏    | 749/1455 [06:21<10:39,  1.10it/s][A
epoch 2 iter 749: train loss 0.25329. lr 8.278173e-05:  51%|█████▏    | 749/1455 [06:22<10:39,  1.10it/s][A
epoch 2 iter 749: train loss 0.25329. lr 8.278173e-05:  52%|█████▏    | 750/1455 [06:22<10:04,  1.17it/s][A
epoch 2 iter 750: train loss 0.25674. lr 8.255845e-05:  52%|█████▏    | 750/1455 [06:22<10:04,  1.17it/s][A
epoch 2 iter 750: train loss 0.25674. lr 8.255845e-05:  52%|█████▏    | 751/1455 [06:22<09:24,  1.25it/s][A
epoch 2 iter 751: t

epoch 2 iter 784: train loss 0.25442. lr 7.511941e-05:  54%|█████▍    | 784/1455 [06:39<04:36,  2.43it/s][A
epoch 2 iter 784: train loss 0.25442. lr 7.511941e-05:  54%|█████▍    | 785/1455 [06:39<04:34,  2.44it/s][A
epoch 2 iter 785: train loss 0.25326. lr 7.490516e-05:  54%|█████▍    | 785/1455 [06:39<04:34,  2.44it/s][A
epoch 2 iter 785: train loss 0.25326. lr 7.490516e-05:  54%|█████▍    | 786/1455 [06:39<04:32,  2.45it/s][A
epoch 2 iter 786: train loss 0.25197. lr 7.469116e-05:  54%|█████▍    | 786/1455 [06:40<04:32,  2.45it/s][A
epoch 2 iter 786: train loss 0.25197. lr 7.469116e-05:  54%|█████▍    | 787/1455 [06:40<04:39,  2.39it/s][A
epoch 2 iter 787: train loss 0.24759. lr 7.447743e-05:  54%|█████▍    | 787/1455 [06:40<04:39,  2.39it/s][A
epoch 2 iter 787: train loss 0.24759. lr 7.447743e-05:  54%|█████▍    | 788/1455 [06:40<04:34,  2.43it/s][A
epoch 2 iter 788: train loss 0.25245. lr 7.426396e-05:  54%|█████▍    | 788/1455 [06:41<04:34,  2.43it/s][A
epoch 2 iter 788: t

epoch 2 iter 821: train loss 0.24657. lr 6.736860e-05:  56%|█████▋    | 822/1455 [06:58<06:27,  1.63it/s][A
epoch 2 iter 822: train loss 0.24314. lr 6.716422e-05:  56%|█████▋    | 822/1455 [06:59<06:27,  1.63it/s][A
epoch 2 iter 822: train loss 0.24314. lr 6.716422e-05:  57%|█████▋    | 823/1455 [06:59<06:25,  1.64it/s][A
epoch 2 iter 823: train loss 0.24642. lr 6.696010e-05:  57%|█████▋    | 823/1455 [06:59<06:25,  1.64it/s][A
epoch 2 iter 823: train loss 0.24642. lr 6.696010e-05:  57%|█████▋    | 824/1455 [06:59<06:24,  1.64it/s][A
epoch 2 iter 824: train loss 0.24777. lr 6.675626e-05:  57%|█████▋    | 824/1455 [07:00<06:24,  1.64it/s][A
epoch 2 iter 824: train loss 0.24777. lr 6.675626e-05:  57%|█████▋    | 825/1455 [07:00<06:23,  1.64it/s][A
epoch 2 iter 825: train loss 0.24586. lr 6.655269e-05:  57%|█████▋    | 825/1455 [07:00<06:23,  1.64it/s][A
epoch 2 iter 825: train loss 0.24586. lr 6.655269e-05:  57%|█████▋    | 826/1455 [07:00<06:14,  1.68it/s][A
epoch 2 iter 826: t

epoch 2 iter 859: train loss 0.24385. lr 6.000000e-05:  59%|█████▉    | 859/1455 [07:19<07:17,  1.36it/s][A
epoch 2 iter 859: train loss 0.24385. lr 6.000000e-05:  59%|█████▉    | 860/1455 [07:19<07:17,  1.36it/s][A
epoch 2 iter 860: train loss 0.23906. lr 6.000000e-05:  59%|█████▉    | 860/1455 [07:19<07:17,  1.36it/s][A
epoch 2 iter 860: train loss 0.23906. lr 6.000000e-05:  59%|█████▉    | 861/1455 [07:19<07:04,  1.40it/s][A
epoch 2 iter 861: train loss 0.24318. lr 6.000000e-05:  59%|█████▉    | 861/1455 [07:20<07:04,  1.40it/s][A
epoch 2 iter 861: train loss 0.24318. lr 6.000000e-05:  59%|█████▉    | 862/1455 [07:20<06:56,  1.43it/s][A
epoch 2 iter 862: train loss 0.23701. lr 6.000000e-05:  59%|█████▉    | 862/1455 [07:21<06:56,  1.43it/s][A
epoch 2 iter 862: train loss 0.23701. lr 6.000000e-05:  59%|█████▉    | 863/1455 [07:21<06:48,  1.45it/s][A
epoch 2 iter 863: train loss 0.24224. lr 6.000000e-05:  59%|█████▉    | 863/1455 [07:21<06:48,  1.45it/s][A
epoch 2 iter 863: t

epoch 2 iter 896: train loss 0.23739. lr 6.000000e-05:  62%|██████▏   | 897/1455 [07:38<04:52,  1.91it/s][A
epoch 2 iter 897: train loss 0.23549. lr 6.000000e-05:  62%|██████▏   | 897/1455 [07:39<04:52,  1.91it/s][A
epoch 2 iter 897: train loss 0.23549. lr 6.000000e-05:  62%|██████▏   | 898/1455 [07:39<04:48,  1.93it/s][A
epoch 2 iter 898: train loss 0.23960. lr 6.000000e-05:  62%|██████▏   | 898/1455 [07:39<04:48,  1.93it/s][A
epoch 2 iter 898: train loss 0.23960. lr 6.000000e-05:  62%|██████▏   | 899/1455 [07:39<04:25,  2.09it/s][A
epoch 2 iter 899: train loss 0.23921. lr 6.000000e-05:  62%|██████▏   | 899/1455 [07:39<04:25,  2.09it/s][A
epoch 2 iter 899: train loss 0.23921. lr 6.000000e-05:  62%|██████▏   | 900/1455 [07:39<04:10,  2.22it/s][A
epoch 2 iter 900: train loss 0.23604. lr 6.000000e-05:  62%|██████▏   | 900/1455 [07:40<04:10,  2.22it/s][A
epoch 2 iter 900: train loss 0.23604. lr 6.000000e-05:  62%|██████▏   | 901/1455 [07:40<03:59,  2.31it/s][A
epoch 2 iter 901: t

epoch 2 iter 934: train loss 0.23898. lr 6.000000e-05:  64%|██████▍   | 934/1455 [07:57<03:58,  2.18it/s][A
epoch 2 iter 934: train loss 0.23898. lr 6.000000e-05:  64%|██████▍   | 935/1455 [07:57<03:56,  2.20it/s][A
epoch 2 iter 935: train loss 0.23589. lr 6.000000e-05:  64%|██████▍   | 935/1455 [07:57<03:56,  2.20it/s][A
epoch 2 iter 935: train loss 0.23589. lr 6.000000e-05:  64%|██████▍   | 936/1455 [07:57<03:55,  2.20it/s][A
epoch 2 iter 936: train loss 0.23459. lr 6.000000e-05:  64%|██████▍   | 936/1455 [07:58<03:55,  2.20it/s][A
epoch 2 iter 936: train loss 0.23459. lr 6.000000e-05:  64%|██████▍   | 937/1455 [07:58<03:52,  2.23it/s][A
epoch 2 iter 937: train loss 0.23777. lr 6.000000e-05:  64%|██████▍   | 937/1455 [07:58<03:52,  2.23it/s][A
epoch 2 iter 937: train loss 0.23777. lr 6.000000e-05:  64%|██████▍   | 938/1455 [07:58<03:50,  2.24it/s][A
epoch 2 iter 938: train loss 0.23956. lr 6.000000e-05:  64%|██████▍   | 938/1455 [07:59<03:50,  2.24it/s][A
epoch 2 iter 938: t

epoch 2 iter 971: train loss 0.23452. lr 6.000000e-05:  67%|██████▋   | 972/1455 [08:15<03:24,  2.36it/s][A
epoch 2 iter 972: train loss 0.23591. lr 6.000000e-05:  67%|██████▋   | 972/1455 [08:15<03:24,  2.36it/s][A
epoch 2 iter 972: train loss 0.23591. lr 6.000000e-05:  67%|██████▋   | 973/1455 [08:15<03:20,  2.40it/s][A
epoch 2 iter 973: train loss 0.23548. lr 6.000000e-05:  67%|██████▋   | 973/1455 [08:16<03:20,  2.40it/s][A
epoch 2 iter 973: train loss 0.23548. lr 6.000000e-05:  67%|██████▋   | 974/1455 [08:16<03:18,  2.43it/s][A
epoch 2 iter 974: train loss 0.23805. lr 6.000000e-05:  67%|██████▋   | 974/1455 [08:16<03:18,  2.43it/s][A
epoch 2 iter 974: train loss 0.23805. lr 6.000000e-05:  67%|██████▋   | 975/1455 [08:16<03:19,  2.41it/s][A
epoch 2 iter 975: train loss 0.23856. lr 6.000000e-05:  67%|██████▋   | 975/1455 [08:16<03:19,  2.41it/s][A
epoch 2 iter 975: train loss 0.23856. lr 6.000000e-05:  67%|██████▋   | 976/1455 [08:16<03:21,  2.38it/s][A
epoch 2 iter 976: t

epoch 2 iter 1008: train loss 0.23258. lr 6.000000e-05:  69%|██████▉   | 1009/1455 [08:35<03:07,  2.38it/s][A
epoch 2 iter 1009: train loss 0.23189. lr 6.000000e-05:  69%|██████▉   | 1009/1455 [08:35<03:07,  2.38it/s][A
epoch 2 iter 1009: train loss 0.23189. lr 6.000000e-05:  69%|██████▉   | 1010/1455 [08:35<03:05,  2.40it/s][A
epoch 2 iter 1010: train loss 0.23151. lr 6.000000e-05:  69%|██████▉   | 1010/1455 [08:36<03:05,  2.40it/s][A
epoch 2 iter 1010: train loss 0.23151. lr 6.000000e-05:  69%|██████▉   | 1011/1455 [08:36<03:04,  2.41it/s][A
epoch 2 iter 1011: train loss 0.22944. lr 6.000000e-05:  69%|██████▉   | 1011/1455 [08:36<03:04,  2.41it/s][A
epoch 2 iter 1011: train loss 0.22944. lr 6.000000e-05:  70%|██████▉   | 1012/1455 [08:36<03:03,  2.41it/s][A
epoch 2 iter 1012: train loss 0.23458. lr 6.000000e-05:  70%|██████▉   | 1012/1455 [08:37<03:03,  2.41it/s][A
epoch 2 iter 1012: train loss 0.23458. lr 6.000000e-05:  70%|██████▉   | 1013/1455 [08:37<03:22,  2.18it/s][A
e

epoch 2 iter 1045: train loss 0.23145. lr 6.000000e-05:  72%|███████▏  | 1045/1455 [08:54<03:46,  1.81it/s][A
epoch 2 iter 1045: train loss 0.23145. lr 6.000000e-05:  72%|███████▏  | 1046/1455 [08:54<03:41,  1.85it/s][A
epoch 2 iter 1046: train loss 0.23052. lr 6.000000e-05:  72%|███████▏  | 1046/1455 [08:54<03:41,  1.85it/s][A
epoch 2 iter 1046: train loss 0.23052. lr 6.000000e-05:  72%|███████▏  | 1047/1455 [08:54<03:37,  1.87it/s][A
epoch 2 iter 1047: train loss 0.23145. lr 6.000000e-05:  72%|███████▏  | 1047/1455 [08:55<03:37,  1.87it/s][A
epoch 2 iter 1047: train loss 0.23145. lr 6.000000e-05:  72%|███████▏  | 1048/1455 [08:55<03:31,  1.93it/s][A
epoch 2 iter 1048: train loss 0.23168. lr 6.000000e-05:  72%|███████▏  | 1048/1455 [08:55<03:31,  1.93it/s][A
epoch 2 iter 1048: train loss 0.23168. lr 6.000000e-05:  72%|███████▏  | 1049/1455 [08:55<03:26,  1.96it/s][A
epoch 2 iter 1049: train loss 0.23163. lr 6.000000e-05:  72%|███████▏  | 1049/1455 [08:56<03:26,  1.96it/s][A
e

epoch 2 iter 1081: train loss 0.23117. lr 6.000000e-05:  74%|███████▍  | 1082/1455 [09:11<02:35,  2.40it/s][A
epoch 2 iter 1082: train loss 0.22999. lr 6.000000e-05:  74%|███████▍  | 1082/1455 [09:11<02:35,  2.40it/s][A
epoch 2 iter 1082: train loss 0.22999. lr 6.000000e-05:  74%|███████▍  | 1083/1455 [09:11<02:34,  2.41it/s][A
epoch 2 iter 1083: train loss 0.23378. lr 6.000000e-05:  74%|███████▍  | 1083/1455 [09:11<02:34,  2.41it/s][A
epoch 2 iter 1083: train loss 0.23378. lr 6.000000e-05:  75%|███████▍  | 1084/1455 [09:11<02:37,  2.35it/s][A
epoch 2 iter 1084: train loss 0.23341. lr 6.000000e-05:  75%|███████▍  | 1084/1455 [09:12<02:37,  2.35it/s][A
epoch 2 iter 1084: train loss 0.23341. lr 6.000000e-05:  75%|███████▍  | 1085/1455 [09:12<02:36,  2.37it/s][A
epoch 2 iter 1085: train loss 0.23054. lr 6.000000e-05:  75%|███████▍  | 1085/1455 [09:12<02:36,  2.37it/s][A
epoch 2 iter 1085: train loss 0.23054. lr 6.000000e-05:  75%|███████▍  | 1086/1455 [09:12<02:51,  2.15it/s][A
e

epoch 2 iter 1118: train loss 0.23092. lr 6.000000e-05:  77%|███████▋  | 1118/1455 [09:29<02:25,  2.31it/s][A
epoch 2 iter 1118: train loss 0.23092. lr 6.000000e-05:  77%|███████▋  | 1119/1455 [09:29<02:23,  2.35it/s][A
epoch 2 iter 1119: train loss 0.22968. lr 6.000000e-05:  77%|███████▋  | 1119/1455 [09:30<02:23,  2.35it/s][A
epoch 2 iter 1119: train loss 0.22968. lr 6.000000e-05:  77%|███████▋  | 1120/1455 [09:30<02:21,  2.38it/s][A
epoch 2 iter 1120: train loss 0.23386. lr 6.000000e-05:  77%|███████▋  | 1120/1455 [09:30<02:21,  2.38it/s][A
epoch 2 iter 1120: train loss 0.23386. lr 6.000000e-05:  77%|███████▋  | 1121/1455 [09:30<02:19,  2.39it/s][A
epoch 2 iter 1121: train loss 0.23145. lr 6.000000e-05:  77%|███████▋  | 1121/1455 [09:30<02:19,  2.39it/s][A
epoch 2 iter 1121: train loss 0.23145. lr 6.000000e-05:  77%|███████▋  | 1122/1455 [09:30<02:18,  2.41it/s][A
epoch 2 iter 1122: train loss 0.22968. lr 6.000000e-05:  77%|███████▋  | 1122/1455 [09:31<02:18,  2.41it/s][A
e

epoch 2 iter 1154: train loss 0.22689. lr 6.000000e-05:  79%|███████▉  | 1155/1455 [09:50<02:34,  1.94it/s][A
epoch 2 iter 1155: train loss 0.22764. lr 6.000000e-05:  79%|███████▉  | 1155/1455 [09:50<02:34,  1.94it/s][A
epoch 2 iter 1155: train loss 0.22764. lr 6.000000e-05:  79%|███████▉  | 1156/1455 [09:50<02:42,  1.84it/s][A
epoch 2 iter 1156: train loss 0.23070. lr 6.000000e-05:  79%|███████▉  | 1156/1455 [09:51<02:42,  1.84it/s][A
epoch 2 iter 1156: train loss 0.23070. lr 6.000000e-05:  80%|███████▉  | 1157/1455 [09:51<02:47,  1.78it/s][A
epoch 2 iter 1157: train loss 0.22665. lr 6.000000e-05:  80%|███████▉  | 1157/1455 [09:52<02:47,  1.78it/s][A
epoch 2 iter 1157: train loss 0.22665. lr 6.000000e-05:  80%|███████▉  | 1158/1455 [09:52<02:50,  1.74it/s][A
epoch 2 iter 1158: train loss 0.23034. lr 6.000000e-05:  80%|███████▉  | 1158/1455 [09:52<02:50,  1.74it/s][A
epoch 2 iter 1158: train loss 0.23034. lr 6.000000e-05:  80%|███████▉  | 1159/1455 [09:52<02:53,  1.71it/s][A
e

epoch 2 iter 1191: train loss 0.22662. lr 6.000000e-05:  82%|████████▏ | 1191/1455 [10:08<01:50,  2.39it/s][A
epoch 2 iter 1191: train loss 0.22662. lr 6.000000e-05:  82%|████████▏ | 1192/1455 [10:08<01:48,  2.42it/s][A
epoch 2 iter 1192: train loss 0.22720. lr 6.000000e-05:  82%|████████▏ | 1192/1455 [10:08<01:48,  2.42it/s][A
epoch 2 iter 1192: train loss 0.22720. lr 6.000000e-05:  82%|████████▏ | 1193/1455 [10:08<01:47,  2.44it/s][A
epoch 2 iter 1193: train loss 0.22404. lr 6.000000e-05:  82%|████████▏ | 1193/1455 [10:09<01:47,  2.44it/s][A
epoch 2 iter 1193: train loss 0.22404. lr 6.000000e-05:  82%|████████▏ | 1194/1455 [10:09<01:46,  2.45it/s][A
epoch 2 iter 1194: train loss 0.22474. lr 6.000000e-05:  82%|████████▏ | 1194/1455 [10:09<01:46,  2.45it/s][A
epoch 2 iter 1194: train loss 0.22474. lr 6.000000e-05:  82%|████████▏ | 1195/1455 [10:09<01:45,  2.47it/s][A
epoch 2 iter 1195: train loss 0.22429. lr 6.000000e-05:  82%|████████▏ | 1195/1455 [10:10<01:45,  2.47it/s][A
e

epoch 2 iter 1227: train loss 0.22223. lr 6.000000e-05:  84%|████████▍ | 1228/1455 [10:27<01:55,  1.97it/s][A
epoch 2 iter 1228: train loss 0.22356. lr 6.000000e-05:  84%|████████▍ | 1228/1455 [10:28<01:55,  1.97it/s][A
epoch 2 iter 1228: train loss 0.22356. lr 6.000000e-05:  84%|████████▍ | 1229/1455 [10:28<01:51,  2.04it/s][A
epoch 2 iter 1229: train loss 0.22342. lr 6.000000e-05:  84%|████████▍ | 1229/1455 [10:28<01:51,  2.04it/s][A
epoch 2 iter 1229: train loss 0.22342. lr 6.000000e-05:  85%|████████▍ | 1230/1455 [10:28<01:48,  2.08it/s][A
epoch 2 iter 1230: train loss 0.22745. lr 6.000000e-05:  85%|████████▍ | 1230/1455 [10:29<01:48,  2.08it/s][A
epoch 2 iter 1230: train loss 0.22745. lr 6.000000e-05:  85%|████████▍ | 1231/1455 [10:29<01:53,  1.97it/s][A
epoch 2 iter 1231: train loss 0.22379. lr 6.000000e-05:  85%|████████▍ | 1231/1455 [10:29<01:53,  1.97it/s][A
epoch 2 iter 1231: train loss 0.22379. lr 6.000000e-05:  85%|████████▍ | 1232/1455 [10:29<01:53,  1.97it/s][A
e

epoch 2 iter 1264: train loss 0.21876. lr 6.000000e-05:  87%|████████▋ | 1264/1455 [10:46<02:07,  1.50it/s][A
epoch 2 iter 1264: train loss 0.21876. lr 6.000000e-05:  87%|████████▋ | 1265/1455 [10:46<02:00,  1.57it/s][A
epoch 2 iter 1265: train loss 0.22495. lr 6.000000e-05:  87%|████████▋ | 1265/1455 [10:46<02:00,  1.57it/s][A
epoch 2 iter 1265: train loss 0.22495. lr 6.000000e-05:  87%|████████▋ | 1266/1455 [10:46<01:53,  1.66it/s][A
epoch 2 iter 1266: train loss 0.22676. lr 6.000000e-05:  87%|████████▋ | 1266/1455 [10:47<01:53,  1.66it/s][A
epoch 2 iter 1266: train loss 0.22676. lr 6.000000e-05:  87%|████████▋ | 1267/1455 [10:47<01:48,  1.73it/s][A
epoch 2 iter 1267: train loss 0.22596. lr 6.000000e-05:  87%|████████▋ | 1267/1455 [10:47<01:48,  1.73it/s][A
epoch 2 iter 1267: train loss 0.22596. lr 6.000000e-05:  87%|████████▋ | 1268/1455 [10:47<01:43,  1.82it/s][A
epoch 2 iter 1268: train loss 0.22326. lr 6.000000e-05:  87%|████████▋ | 1268/1455 [10:48<01:43,  1.82it/s][A
e

epoch 2 iter 1300: train loss 0.22356. lr 6.000000e-05:  89%|████████▉ | 1301/1455 [11:03<01:27,  1.75it/s][A
epoch 2 iter 1301: train loss 0.22301. lr 6.000000e-05:  89%|████████▉ | 1301/1455 [11:03<01:27,  1.75it/s][A
epoch 2 iter 1301: train loss 0.22301. lr 6.000000e-05:  89%|████████▉ | 1302/1455 [11:03<01:29,  1.71it/s][A
epoch 2 iter 1302: train loss 0.22419. lr 6.000000e-05:  89%|████████▉ | 1302/1455 [11:04<01:29,  1.71it/s][A
epoch 2 iter 1302: train loss 0.22419. lr 6.000000e-05:  90%|████████▉ | 1303/1455 [11:04<01:30,  1.68it/s][A
epoch 2 iter 1303: train loss 0.21959. lr 6.000000e-05:  90%|████████▉ | 1303/1455 [11:05<01:30,  1.68it/s][A
epoch 2 iter 1303: train loss 0.21959. lr 6.000000e-05:  90%|████████▉ | 1304/1455 [11:05<01:30,  1.66it/s][A
epoch 2 iter 1304: train loss 0.22350. lr 6.000000e-05:  90%|████████▉ | 1304/1455 [11:05<01:30,  1.66it/s][A
epoch 2 iter 1304: train loss 0.22350. lr 6.000000e-05:  90%|████████▉ | 1305/1455 [11:05<01:30,  1.65it/s][A
e

epoch 2 iter 1337: train loss 0.22214. lr 6.000000e-05:  92%|█████████▏| 1337/1455 [11:22<00:49,  2.40it/s][A
epoch 2 iter 1337: train loss 0.22214. lr 6.000000e-05:  92%|█████████▏| 1338/1455 [11:22<00:47,  2.44it/s][A
epoch 2 iter 1338: train loss 0.21781. lr 6.000000e-05:  92%|█████████▏| 1338/1455 [11:22<00:47,  2.44it/s][A
epoch 2 iter 1338: train loss 0.21781. lr 6.000000e-05:  92%|█████████▏| 1339/1455 [11:22<00:46,  2.47it/s][A
epoch 2 iter 1339: train loss 0.21984. lr 6.000000e-05:  92%|█████████▏| 1339/1455 [11:23<00:46,  2.47it/s][A
epoch 2 iter 1339: train loss 0.21984. lr 6.000000e-05:  92%|█████████▏| 1340/1455 [11:23<00:46,  2.49it/s][A
epoch 2 iter 1340: train loss 0.22174. lr 6.000000e-05:  92%|█████████▏| 1340/1455 [11:24<00:46,  2.49it/s][A
epoch 2 iter 1340: train loss 0.22174. lr 6.000000e-05:  92%|█████████▏| 1341/1455 [11:24<01:27,  1.30it/s][A
epoch 2 iter 1341: train loss 0.22353. lr 6.000000e-05:  92%|█████████▏| 1341/1455 [11:25<01:27,  1.30it/s][A
e

epoch 2 iter 1373: train loss 0.22122. lr 6.000000e-05:  94%|█████████▍| 1374/1455 [11:39<00:37,  2.19it/s][A
epoch 2 iter 1374: train loss 0.21773. lr 6.000000e-05:  94%|█████████▍| 1374/1455 [11:40<00:37,  2.19it/s][A
epoch 2 iter 1374: train loss 0.21773. lr 6.000000e-05:  95%|█████████▍| 1375/1455 [11:40<00:34,  2.29it/s][A
epoch 2 iter 1375: train loss 0.21923. lr 6.000000e-05:  95%|█████████▍| 1375/1455 [11:40<00:34,  2.29it/s][A
epoch 2 iter 1375: train loss 0.21923. lr 6.000000e-05:  95%|█████████▍| 1376/1455 [11:40<00:33,  2.35it/s][A
epoch 2 iter 1376: train loss 0.22036. lr 6.000000e-05:  95%|█████████▍| 1376/1455 [11:41<00:33,  2.35it/s][A
epoch 2 iter 1376: train loss 0.22036. lr 6.000000e-05:  95%|█████████▍| 1377/1455 [11:41<00:35,  2.23it/s][A
epoch 2 iter 1377: train loss 0.22349. lr 6.000000e-05:  95%|█████████▍| 1377/1455 [11:41<00:35,  2.23it/s][A
epoch 2 iter 1377: train loss 0.22349. lr 6.000000e-05:  95%|█████████▍| 1378/1455 [11:41<00:37,  2.06it/s][A
e

epoch 2 iter 1410: train loss 0.21744. lr 6.000000e-05:  97%|█████████▋| 1410/1455 [11:58<00:28,  1.56it/s][A
epoch 2 iter 1410: train loss 0.21744. lr 6.000000e-05:  97%|█████████▋| 1411/1455 [11:58<00:31,  1.41it/s][A
epoch 2 iter 1411: train loss 0.21966. lr 6.000000e-05:  97%|█████████▋| 1411/1455 [11:59<00:31,  1.41it/s][A
epoch 2 iter 1411: train loss 0.21966. lr 6.000000e-05:  97%|█████████▋| 1412/1455 [11:59<00:32,  1.33it/s][A
epoch 2 iter 1412: train loss 0.21811. lr 6.000000e-05:  97%|█████████▋| 1412/1455 [12:00<00:32,  1.33it/s][A
epoch 2 iter 1412: train loss 0.21811. lr 6.000000e-05:  97%|█████████▋| 1413/1455 [12:00<00:32,  1.28it/s][A
epoch 2 iter 1413: train loss 0.22079. lr 6.000000e-05:  97%|█████████▋| 1413/1455 [12:01<00:32,  1.28it/s][A
epoch 2 iter 1413: train loss 0.22079. lr 6.000000e-05:  97%|█████████▋| 1414/1455 [12:01<00:32,  1.25it/s][A
epoch 2 iter 1414: train loss 0.21686. lr 6.000000e-05:  97%|█████████▋| 1414/1455 [12:02<00:32,  1.25it/s][A
e

epoch 2 iter 1446: train loss 0.21702. lr 6.000000e-05:  99%|█████████▉| 1447/1455 [12:17<00:03,  2.46it/s][A
epoch 2 iter 1447: train loss 0.21830. lr 6.000000e-05:  99%|█████████▉| 1447/1455 [12:18<00:03,  2.46it/s][A
epoch 2 iter 1447: train loss 0.21830. lr 6.000000e-05: 100%|█████████▉| 1448/1455 [12:18<00:02,  2.49it/s][A
epoch 2 iter 1448: train loss 0.21533. lr 6.000000e-05: 100%|█████████▉| 1448/1455 [12:18<00:02,  2.49it/s][A
epoch 2 iter 1448: train loss 0.21533. lr 6.000000e-05: 100%|█████████▉| 1449/1455 [12:18<00:02,  2.50it/s][A
epoch 2 iter 1449: train loss 0.21783. lr 6.000000e-05: 100%|█████████▉| 1449/1455 [12:19<00:02,  2.50it/s][A
epoch 2 iter 1449: train loss 0.21783. lr 6.000000e-05: 100%|█████████▉| 1450/1455 [12:19<00:02,  2.37it/s][A
epoch 2 iter 1450: train loss 0.21884. lr 6.000000e-05: 100%|█████████▉| 1450/1455 [12:19<00:02,  2.37it/s][A
epoch 2 iter 1450: train loss 0.21884. lr 6.000000e-05: 100%|█████████▉| 1451/1455 [12:19<00:01,  2.42it/s][A
e

data has 599662 characters, 186 unique.



  0%|          | 0/1171 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.37209. lr 5.999998e-04:   0%|          | 0/1171 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.37209. lr 5.999998e-04:   0%|          | 1/1171 [00:00<09:37,  2.03it/s][A
epoch 1 iter 1: train loss 4.17997. lr 5.999991e-04:   0%|          | 1/1171 [00:00<09:37,  2.03it/s][A
epoch 1 iter 1: train loss 4.17997. lr 5.999991e-04:   0%|          | 2/1171 [00:00<09:00,  2.16it/s][A
epoch 1 iter 2: train loss 4.13829. lr 5.999978e-04:   0%|          | 2/1171 [00:01<09:00,  2.16it/s][A
epoch 1 iter 2: train loss 4.13829. lr 5.999978e-04:   0%|          | 3/1171 [00:01<08:31,  2.28it/s][A
epoch 1 iter 3: train loss 3.71260. lr 5.999960e-04:   0%|          | 3/1171 [00:01<08:31,  2.28it/s][A
epoch 1 iter 3: train loss 3.71260. lr 5.999960e-04:   0%|          | 4/1171 [00:01<08:11,  2.37it/s][A
epoch 1 iter 4: train loss 3.75604. lr 5.999937e-04:   0%|          | 4/1171 [00:02<08:11,  2.37it/s][A
epoch 1 iter 4: tr

epoch 1 iter 38: train loss 2.94531. lr 5.995928e-04:   3%|▎         | 38/1171 [00:20<08:40,  2.17it/s][A
epoch 1 iter 38: train loss 2.94531. lr 5.995928e-04:   3%|▎         | 39/1171 [00:20<08:35,  2.20it/s][A
epoch 1 iter 39: train loss 2.94591. lr 5.995715e-04:   3%|▎         | 39/1171 [00:20<08:35,  2.20it/s][A
epoch 1 iter 39: train loss 2.94591. lr 5.995715e-04:   3%|▎         | 40/1171 [00:20<08:30,  2.22it/s][A
epoch 1 iter 40: train loss 2.89440. lr 5.995498e-04:   3%|▎         | 40/1171 [00:20<08:30,  2.22it/s][A
epoch 1 iter 40: train loss 2.89440. lr 5.995498e-04:   4%|▎         | 41/1171 [00:20<08:25,  2.24it/s][A
epoch 1 iter 41: train loss 2.94757. lr 5.995274e-04:   4%|▎         | 41/1171 [00:21<08:25,  2.24it/s][A
epoch 1 iter 41: train loss 2.94757. lr 5.995274e-04:   4%|▎         | 42/1171 [00:21<08:21,  2.25it/s][A
epoch 1 iter 42: train loss 2.87570. lr 5.995046e-04:   4%|▎         | 42/1171 [00:21<08:21,  2.25it/s][A
epoch 1 iter 42: train loss 2.87570. 

epoch 1 iter 76: train loss 2.64607. lr 5.984073e-04:   6%|▋         | 76/1171 [00:37<08:43,  2.09it/s][A
epoch 1 iter 76: train loss 2.64607. lr 5.984073e-04:   7%|▋         | 77/1171 [00:37<08:27,  2.16it/s][A
epoch 1 iter 77: train loss 2.65967. lr 5.983656e-04:   7%|▋         | 77/1171 [00:38<08:27,  2.16it/s][A
epoch 1 iter 77: train loss 2.65967. lr 5.983656e-04:   7%|▋         | 78/1171 [00:38<08:16,  2.20it/s][A
epoch 1 iter 78: train loss 2.67654. lr 5.983234e-04:   7%|▋         | 78/1171 [00:38<08:16,  2.20it/s][A
epoch 1 iter 78: train loss 2.67654. lr 5.983234e-04:   7%|▋         | 79/1171 [00:38<08:08,  2.23it/s][A
epoch 1 iter 79: train loss 2.66894. lr 5.982806e-04:   7%|▋         | 79/1171 [00:39<08:08,  2.23it/s][A
epoch 1 iter 79: train loss 2.66894. lr 5.982806e-04:   7%|▋         | 80/1171 [00:39<08:04,  2.25it/s][A
epoch 1 iter 80: train loss 2.65557. lr 5.982373e-04:   7%|▋         | 80/1171 [00:39<08:04,  2.25it/s][A
epoch 1 iter 80: train loss 2.65557. 

epoch 1 iter 114: train loss 2.58181. lr 5.964465e-04:  10%|▉         | 114/1171 [00:57<07:49,  2.25it/s][A
epoch 1 iter 114: train loss 2.58181. lr 5.964465e-04:  10%|▉         | 115/1171 [00:57<07:28,  2.36it/s][A
epoch 1 iter 115: train loss 2.57728. lr 5.963845e-04:  10%|▉         | 115/1171 [00:58<07:28,  2.36it/s][A
epoch 1 iter 115: train loss 2.57728. lr 5.963845e-04:  10%|▉         | 116/1171 [00:58<07:47,  2.26it/s][A
epoch 1 iter 116: train loss 2.57481. lr 5.963219e-04:  10%|▉         | 116/1171 [00:58<07:47,  2.26it/s][A
epoch 1 iter 116: train loss 2.57481. lr 5.963219e-04:  10%|▉         | 117/1171 [00:58<07:47,  2.25it/s][A
epoch 1 iter 117: train loss 2.57821. lr 5.962588e-04:  10%|▉         | 117/1171 [00:59<07:47,  2.25it/s][A
epoch 1 iter 117: train loss 2.57821. lr 5.962588e-04:  10%|█         | 118/1171 [00:59<07:55,  2.21it/s][A
epoch 1 iter 118: train loss 2.58337. lr 5.961952e-04:  10%|█         | 118/1171 [00:59<07:55,  2.21it/s][A
epoch 1 iter 118: t

epoch 1 iter 151: train loss 2.48850. lr 5.937971e-04:  13%|█▎        | 152/1171 [01:15<12:48,  1.33it/s][A
epoch 1 iter 152: train loss 2.46896. lr 5.937154e-04:  13%|█▎        | 152/1171 [01:16<12:48,  1.33it/s][A
epoch 1 iter 152: train loss 2.46896. lr 5.937154e-04:  13%|█▎        | 153/1171 [01:16<13:04,  1.30it/s][A
epoch 1 iter 153: train loss 2.47802. lr 5.936332e-04:  13%|█▎        | 153/1171 [01:17<13:04,  1.30it/s][A
epoch 1 iter 153: train loss 2.47802. lr 5.936332e-04:  13%|█▎        | 154/1171 [01:17<13:06,  1.29it/s][A
epoch 1 iter 154: train loss 2.46777. lr 5.935505e-04:  13%|█▎        | 154/1171 [01:18<13:06,  1.29it/s][A
epoch 1 iter 154: train loss 2.46777. lr 5.935505e-04:  13%|█▎        | 155/1171 [01:18<12:46,  1.33it/s][A
epoch 1 iter 155: train loss 2.46758. lr 5.934672e-04:  13%|█▎        | 155/1171 [01:18<12:46,  1.33it/s][A
epoch 1 iter 155: train loss 2.46758. lr 5.934672e-04:  13%|█▎        | 156/1171 [01:18<12:15,  1.38it/s][A
epoch 1 iter 156: t

epoch 1 iter 189: train loss 2.33738. lr 5.903229e-04:  16%|█▌        | 189/1171 [01:35<09:33,  1.71it/s][A
epoch 1 iter 189: train loss 2.33738. lr 5.903229e-04:  16%|█▌        | 190/1171 [01:35<09:41,  1.69it/s][A
epoch 1 iter 190: train loss 2.30298. lr 5.902212e-04:  16%|█▌        | 190/1171 [01:36<09:41,  1.69it/s][A
epoch 1 iter 190: train loss 2.30298. lr 5.902212e-04:  16%|█▋        | 191/1171 [01:36<08:53,  1.84it/s][A
epoch 1 iter 191: train loss 2.31458. lr 5.901191e-04:  16%|█▋        | 191/1171 [01:36<08:53,  1.84it/s][A
epoch 1 iter 191: train loss 2.31458. lr 5.901191e-04:  16%|█▋        | 192/1171 [01:36<08:02,  2.03it/s][A
epoch 1 iter 192: train loss 2.30888. lr 5.900164e-04:  16%|█▋        | 192/1171 [01:36<08:02,  2.03it/s][A
epoch 1 iter 192: train loss 2.30888. lr 5.900164e-04:  16%|█▋        | 193/1171 [01:36<07:28,  2.18it/s][A
epoch 1 iter 193: train loss 2.31240. lr 5.899131e-04:  16%|█▋        | 193/1171 [01:37<07:28,  2.18it/s][A
epoch 1 iter 193: t

epoch 1 iter 226: train loss 2.12860. lr 5.862152e-04:  19%|█▉        | 227/1171 [01:55<13:34,  1.16it/s][A
epoch 1 iter 227: train loss 2.11591. lr 5.860943e-04:  19%|█▉        | 227/1171 [01:56<13:34,  1.16it/s][A
epoch 1 iter 227: train loss 2.11591. lr 5.860943e-04:  19%|█▉        | 228/1171 [01:56<12:17,  1.28it/s][A
epoch 1 iter 228: train loss 2.11892. lr 5.859730e-04:  19%|█▉        | 228/1171 [01:57<12:17,  1.28it/s][A
epoch 1 iter 228: train loss 2.11892. lr 5.859730e-04:  20%|█▉        | 229/1171 [01:57<11:17,  1.39it/s][A
epoch 1 iter 229: train loss 2.14337. lr 5.858511e-04:  20%|█▉        | 229/1171 [01:57<11:17,  1.39it/s][A
epoch 1 iter 229: train loss 2.14337. lr 5.858511e-04:  20%|█▉        | 230/1171 [01:57<10:35,  1.48it/s][A
epoch 1 iter 230: train loss 2.14098. lr 5.857287e-04:  20%|█▉        | 230/1171 [01:58<10:35,  1.48it/s][A
epoch 1 iter 230: train loss 2.14098. lr 5.857287e-04:  20%|█▉        | 231/1171 [01:58<10:05,  1.55it/s][A
epoch 1 iter 231: t

epoch 1 iter 264: train loss 1.97791. lr 5.812627e-04:  23%|██▎       | 264/1171 [02:14<06:42,  2.25it/s][A
epoch 1 iter 264: train loss 1.97791. lr 5.812627e-04:  23%|██▎       | 265/1171 [02:14<06:41,  2.25it/s][A
epoch 1 iter 265: train loss 1.92509. lr 5.811224e-04:  23%|██▎       | 265/1171 [02:15<06:41,  2.25it/s][A
epoch 1 iter 265: train loss 1.92509. lr 5.811224e-04:  23%|██▎       | 266/1171 [02:15<06:33,  2.30it/s][A
epoch 1 iter 266: train loss 1.90433. lr 5.809817e-04:  23%|██▎       | 266/1171 [02:15<06:33,  2.30it/s][A
epoch 1 iter 266: train loss 1.90433. lr 5.809817e-04:  23%|██▎       | 267/1171 [02:15<06:28,  2.33it/s][A
epoch 1 iter 267: train loss 1.93802. lr 5.808404e-04:  23%|██▎       | 267/1171 [02:16<06:28,  2.33it/s][A
epoch 1 iter 267: train loss 1.93802. lr 5.808404e-04:  23%|██▎       | 268/1171 [02:16<07:07,  2.11it/s][A
epoch 1 iter 268: train loss 1.93125. lr 5.806986e-04:  23%|██▎       | 268/1171 [02:16<07:07,  2.11it/s][A
epoch 1 iter 268: t

epoch 1 iter 301: train loss 1.78615. lr 5.757382e-04:  26%|██▌       | 302/1171 [02:31<06:06,  2.37it/s][A
epoch 1 iter 302: train loss 1.77423. lr 5.755794e-04:  26%|██▌       | 302/1171 [02:33<06:06,  2.37it/s][A
epoch 1 iter 302: train loss 1.77423. lr 5.755794e-04:  26%|██▌       | 303/1171 [02:33<10:10,  1.42it/s][A
epoch 1 iter 303: train loss 1.77563. lr 5.754201e-04:  26%|██▌       | 303/1171 [02:33<10:10,  1.42it/s][A
epoch 1 iter 303: train loss 1.77563. lr 5.754201e-04:  26%|██▌       | 304/1171 [02:33<09:14,  1.56it/s][A
epoch 1 iter 304: train loss 1.76419. lr 5.752603e-04:  26%|██▌       | 304/1171 [02:34<09:14,  1.56it/s][A
epoch 1 iter 304: train loss 1.76419. lr 5.752603e-04:  26%|██▌       | 305/1171 [02:34<08:23,  1.72it/s][A
epoch 1 iter 305: train loss 1.79106. lr 5.751000e-04:  26%|██▌       | 305/1171 [02:34<08:23,  1.72it/s][A
epoch 1 iter 305: train loss 1.79106. lr 5.751000e-04:  26%|██▌       | 306/1171 [02:34<07:41,  1.87it/s][A
epoch 1 iter 306: t

epoch 1 iter 339: train loss 1.63923. lr 5.693575e-04:  29%|██▉       | 339/1171 [02:52<09:02,  1.53it/s][A
epoch 1 iter 339: train loss 1.63923. lr 5.693575e-04:  29%|██▉       | 340/1171 [02:52<08:47,  1.58it/s][A
epoch 1 iter 340: train loss 1.60255. lr 5.691801e-04:  29%|██▉       | 340/1171 [02:53<08:47,  1.58it/s][A
epoch 1 iter 340: train loss 1.60255. lr 5.691801e-04:  29%|██▉       | 341/1171 [02:53<08:32,  1.62it/s][A
epoch 1 iter 341: train loss 1.59348. lr 5.690021e-04:  29%|██▉       | 341/1171 [02:53<08:32,  1.62it/s][A
epoch 1 iter 341: train loss 1.59348. lr 5.690021e-04:  29%|██▉       | 342/1171 [02:53<08:04,  1.71it/s][A
epoch 1 iter 342: train loss 1.62040. lr 5.688237e-04:  29%|██▉       | 342/1171 [02:54<08:04,  1.71it/s][A
epoch 1 iter 342: train loss 1.62040. lr 5.688237e-04:  29%|██▉       | 343/1171 [02:54<07:45,  1.78it/s][A
epoch 1 iter 343: train loss 1.61319. lr 5.686448e-04:  29%|██▉       | 343/1171 [02:54<07:45,  1.78it/s][A
epoch 1 iter 343: t

epoch 1 iter 376: train loss 1.52576. lr 5.624721e-04:  32%|███▏      | 377/1171 [03:08<06:54,  1.91it/s][A
epoch 1 iter 377: train loss 1.51657. lr 5.622770e-04:  32%|███▏      | 377/1171 [03:09<06:54,  1.91it/s][A
epoch 1 iter 377: train loss 1.51657. lr 5.622770e-04:  32%|███▏      | 378/1171 [03:09<06:37,  1.99it/s][A
epoch 1 iter 378: train loss 1.46674. lr 5.620813e-04:  32%|███▏      | 378/1171 [03:09<06:37,  1.99it/s][A
epoch 1 iter 378: train loss 1.46674. lr 5.620813e-04:  32%|███▏      | 379/1171 [03:09<06:28,  2.04it/s][A
epoch 1 iter 379: train loss 1.48279. lr 5.618853e-04:  32%|███▏      | 379/1171 [03:10<06:28,  2.04it/s][A
epoch 1 iter 379: train loss 1.48279. lr 5.618853e-04:  32%|███▏      | 380/1171 [03:10<06:38,  1.98it/s][A
epoch 1 iter 380: train loss 1.50290. lr 5.616887e-04:  32%|███▏      | 380/1171 [03:10<06:38,  1.98it/s][A
epoch 1 iter 380: train loss 1.50290. lr 5.616887e-04:  33%|███▎      | 381/1171 [03:10<06:50,  1.92it/s][A
epoch 1 iter 381: t

epoch 1 iter 414: train loss 1.34694. lr 5.547278e-04:  35%|███▌      | 414/1171 [03:28<05:18,  2.38it/s][A
epoch 1 iter 414: train loss 1.34694. lr 5.547278e-04:  35%|███▌      | 415/1171 [03:28<05:14,  2.40it/s][A
epoch 1 iter 415: train loss 1.39198. lr 5.545149e-04:  35%|███▌      | 415/1171 [03:28<05:14,  2.40it/s][A
epoch 1 iter 415: train loss 1.39198. lr 5.545149e-04:  36%|███▌      | 416/1171 [03:28<05:11,  2.42it/s][A
epoch 1 iter 416: train loss 1.35967. lr 5.543017e-04:  36%|███▌      | 416/1171 [03:28<05:11,  2.42it/s][A
epoch 1 iter 416: train loss 1.35967. lr 5.543017e-04:  36%|███▌      | 417/1171 [03:28<05:08,  2.44it/s][A
epoch 1 iter 417: train loss 1.36993. lr 5.540879e-04:  36%|███▌      | 417/1171 [03:29<05:08,  2.44it/s][A
epoch 1 iter 417: train loss 1.36993. lr 5.540879e-04:  36%|███▌      | 418/1171 [03:29<05:07,  2.45it/s][A
epoch 1 iter 418: train loss 1.36822. lr 5.538737e-04:  36%|███▌      | 418/1171 [03:29<05:07,  2.45it/s][A
epoch 1 iter 418: t

epoch 1 iter 451: train loss 1.26461. lr 5.465511e-04:  39%|███▊      | 452/1171 [03:49<05:29,  2.18it/s][A
epoch 1 iter 452: train loss 1.27740. lr 5.463216e-04:  39%|███▊      | 452/1171 [03:49<05:29,  2.18it/s][A
epoch 1 iter 452: train loss 1.27740. lr 5.463216e-04:  39%|███▊      | 453/1171 [03:49<05:26,  2.20it/s][A
epoch 1 iter 453: train loss 1.28068. lr 5.460916e-04:  39%|███▊      | 453/1171 [03:50<05:26,  2.20it/s][A
epoch 1 iter 453: train loss 1.28068. lr 5.460916e-04:  39%|███▉      | 454/1171 [03:50<05:24,  2.21it/s][A
epoch 1 iter 454: train loss 1.27824. lr 5.458612e-04:  39%|███▉      | 454/1171 [03:50<05:24,  2.21it/s][A
epoch 1 iter 454: train loss 1.27824. lr 5.458612e-04:  39%|███▉      | 455/1171 [03:50<05:08,  2.32it/s][A
epoch 1 iter 455: train loss 1.25646. lr 5.456304e-04:  39%|███▉      | 455/1171 [03:51<05:08,  2.32it/s][A
epoch 1 iter 455: train loss 1.25646. lr 5.456304e-04:  39%|███▉      | 456/1171 [03:51<04:56,  2.41it/s][A
epoch 1 iter 456: t

epoch 1 iter 489: train loss 1.19556. lr 5.375215e-04:  42%|████▏     | 489/1171 [04:05<05:05,  2.23it/s][A
epoch 1 iter 489: train loss 1.19556. lr 5.375215e-04:  42%|████▏     | 490/1171 [04:05<05:01,  2.26it/s][A
epoch 1 iter 490: train loss 1.17382. lr 5.372754e-04:  42%|████▏     | 490/1171 [04:05<05:01,  2.26it/s][A
epoch 1 iter 490: train loss 1.17382. lr 5.372754e-04:  42%|████▏     | 491/1171 [04:05<05:00,  2.26it/s][A
epoch 1 iter 491: train loss 1.22273. lr 5.370289e-04:  42%|████▏     | 491/1171 [04:06<05:00,  2.26it/s][A
epoch 1 iter 491: train loss 1.22273. lr 5.370289e-04:  42%|████▏     | 492/1171 [04:06<05:01,  2.25it/s][A
epoch 1 iter 492: train loss 1.17728. lr 5.367820e-04:  42%|████▏     | 492/1171 [04:06<05:01,  2.25it/s][A
epoch 1 iter 492: train loss 1.17728. lr 5.367820e-04:  42%|████▏     | 493/1171 [04:06<05:04,  2.22it/s][A
epoch 1 iter 493: train loss 1.18650. lr 5.365346e-04:  42%|████▏     | 493/1171 [04:07<05:04,  2.22it/s][A
epoch 1 iter 493: t

epoch 1 iter 526: train loss 1.14098. lr 5.281362e-04:  45%|████▌     | 527/1171 [04:21<04:26,  2.42it/s][A
epoch 1 iter 527: train loss 1.09761. lr 5.278747e-04:  45%|████▌     | 527/1171 [04:22<04:26,  2.42it/s][A
epoch 1 iter 527: train loss 1.09761. lr 5.278747e-04:  45%|████▌     | 528/1171 [04:22<04:27,  2.41it/s][A
epoch 1 iter 528: train loss 1.12248. lr 5.276127e-04:  45%|████▌     | 528/1171 [04:22<04:27,  2.41it/s][A
epoch 1 iter 528: train loss 1.12248. lr 5.276127e-04:  45%|████▌     | 529/1171 [04:22<04:29,  2.38it/s][A
epoch 1 iter 529: train loss 1.12598. lr 5.273503e-04:  45%|████▌     | 529/1171 [04:23<04:29,  2.38it/s][A
epoch 1 iter 529: train loss 1.12598. lr 5.273503e-04:  45%|████▌     | 530/1171 [04:23<04:22,  2.44it/s][A
epoch 1 iter 530: train loss 1.12980. lr 5.270875e-04:  45%|████▌     | 530/1171 [04:23<04:22,  2.44it/s][A
epoch 1 iter 530: train loss 1.12980. lr 5.270875e-04:  45%|████▌     | 531/1171 [04:23<04:23,  2.43it/s][A
epoch 1 iter 531: t

epoch 1 iter 564: train loss 1.05928. lr 5.179126e-04:  48%|████▊     | 564/1171 [04:40<04:11,  2.41it/s][A
epoch 1 iter 564: train loss 1.05928. lr 5.179126e-04:  48%|████▊     | 565/1171 [04:40<04:13,  2.39it/s][A
epoch 1 iter 565: train loss 1.05037. lr 5.176358e-04:  48%|████▊     | 565/1171 [04:40<04:13,  2.39it/s][A
epoch 1 iter 565: train loss 1.05037. lr 5.176358e-04:  48%|████▊     | 566/1171 [04:40<04:13,  2.39it/s][A
epoch 1 iter 566: train loss 1.07504. lr 5.173586e-04:  48%|████▊     | 566/1171 [04:41<04:13,  2.39it/s][A
epoch 1 iter 566: train loss 1.07504. lr 5.173586e-04:  48%|████▊     | 567/1171 [04:41<04:14,  2.37it/s][A
epoch 1 iter 567: train loss 1.06656. lr 5.170810e-04:  48%|████▊     | 567/1171 [04:41<04:14,  2.37it/s][A
epoch 1 iter 567: train loss 1.06656. lr 5.170810e-04:  49%|████▊     | 568/1171 [04:41<04:47,  2.10it/s][A
epoch 1 iter 568: train loss 1.05473. lr 5.168030e-04:  49%|████▊     | 568/1171 [04:42<04:47,  2.10it/s][A
epoch 1 iter 568: t

epoch 1 iter 601: train loss 1.01244. lr 5.074138e-04:  51%|█████▏    | 602/1171 [04:57<06:06,  1.55it/s][A
epoch 1 iter 602: train loss 0.99344. lr 5.071228e-04:  51%|█████▏    | 602/1171 [04:58<06:06,  1.55it/s][A
epoch 1 iter 602: train loss 0.99344. lr 5.071228e-04:  51%|█████▏    | 603/1171 [04:58<06:09,  1.54it/s][A
epoch 1 iter 603: train loss 1.00277. lr 5.068315e-04:  51%|█████▏    | 603/1171 [04:59<06:09,  1.54it/s][A
epoch 1 iter 603: train loss 1.00277. lr 5.068315e-04:  52%|█████▏    | 604/1171 [04:59<06:12,  1.52it/s][A
epoch 1 iter 604: train loss 1.01468. lr 5.065398e-04:  52%|█████▏    | 604/1171 [04:59<06:12,  1.52it/s][A
epoch 1 iter 604: train loss 1.01468. lr 5.065398e-04:  52%|█████▏    | 605/1171 [04:59<05:53,  1.60it/s][A
epoch 1 iter 605: train loss 0.98439. lr 5.062477e-04:  52%|█████▏    | 605/1171 [05:00<05:53,  1.60it/s][A
epoch 1 iter 605: train loss 0.98439. lr 5.062477e-04:  52%|█████▏    | 606/1171 [05:00<05:39,  1.66it/s][A
epoch 1 iter 606: t

epoch 1 iter 639: train loss 0.94457. lr 4.960996e-04:  55%|█████▍    | 639/1171 [05:15<05:01,  1.76it/s][A
epoch 1 iter 639: train loss 0.94457. lr 4.960996e-04:  55%|█████▍    | 640/1171 [05:15<04:41,  1.89it/s][A
epoch 1 iter 640: train loss 0.93477. lr 4.957948e-04:  55%|█████▍    | 640/1171 [05:16<04:41,  1.89it/s][A
epoch 1 iter 640: train loss 0.93477. lr 4.957948e-04:  55%|█████▍    | 641/1171 [05:16<04:24,  2.01it/s][A
epoch 1 iter 641: train loss 0.91654. lr 4.954897e-04:  55%|█████▍    | 641/1171 [05:16<04:24,  2.01it/s][A
epoch 1 iter 641: train loss 0.91654. lr 4.954897e-04:  55%|█████▍    | 642/1171 [05:16<04:10,  2.11it/s][A
epoch 1 iter 642: train loss 0.93210. lr 4.951843e-04:  55%|█████▍    | 642/1171 [05:17<04:10,  2.11it/s][A
epoch 1 iter 642: train loss 0.93210. lr 4.951843e-04:  55%|█████▍    | 643/1171 [05:17<04:00,  2.20it/s][A
epoch 1 iter 643: train loss 0.92149. lr 4.948785e-04:  55%|█████▍    | 643/1171 [05:17<04:00,  2.20it/s][A
epoch 1 iter 643: t

epoch 2 iter 1014: train loss 0.16978. lr 6.000000e-05:  87%|████████▋ | 1015/1171 [07:41<01:04,  2.43it/s][A
epoch 2 iter 1015: train loss 0.17033. lr 6.000000e-05:  87%|████████▋ | 1015/1171 [07:41<01:04,  2.43it/s][A
epoch 2 iter 1015: train loss 0.17033. lr 6.000000e-05:  87%|████████▋ | 1016/1171 [07:41<01:03,  2.46it/s][A
epoch 2 iter 1016: train loss 0.17093. lr 6.000000e-05:  87%|████████▋ | 1016/1171 [07:41<01:03,  2.46it/s][A
epoch 2 iter 1016: train loss 0.17093. lr 6.000000e-05:  87%|████████▋ | 1017/1171 [07:41<01:02,  2.47it/s][A
epoch 2 iter 1017: train loss 0.16763. lr 6.000000e-05:  87%|████████▋ | 1017/1171 [07:42<01:02,  2.47it/s][A
epoch 2 iter 1017: train loss 0.16763. lr 6.000000e-05:  87%|████████▋ | 1018/1171 [07:42<01:01,  2.49it/s][A
epoch 2 iter 1018: train loss 0.16771. lr 6.000000e-05:  87%|████████▋ | 1018/1171 [07:42<01:01,  2.49it/s][A
epoch 2 iter 1018: train loss 0.16771. lr 6.000000e-05:  87%|████████▋ | 1019/1171 [07:42<01:00,  2.50it/s][A
e

epoch 2 iter 1051: train loss 0.17062. lr 6.000000e-05:  90%|████████▉ | 1051/1171 [07:56<00:49,  2.42it/s][A
epoch 2 iter 1051: train loss 0.17062. lr 6.000000e-05:  90%|████████▉ | 1052/1171 [07:56<00:48,  2.44it/s][A
epoch 2 iter 1052: train loss 0.16897. lr 6.000000e-05:  90%|████████▉ | 1052/1171 [07:56<00:48,  2.44it/s][A
epoch 2 iter 1052: train loss 0.16897. lr 6.000000e-05:  90%|████████▉ | 1053/1171 [07:56<00:47,  2.48it/s][A
epoch 2 iter 1053: train loss 0.16987. lr 6.000000e-05:  90%|████████▉ | 1053/1171 [07:56<00:47,  2.48it/s][A
epoch 2 iter 1053: train loss 0.16987. lr 6.000000e-05:  90%|█████████ | 1054/1171 [07:56<00:46,  2.50it/s][A
epoch 2 iter 1054: train loss 0.16749. lr 6.000000e-05:  90%|█████████ | 1054/1171 [07:57<00:46,  2.50it/s][A
epoch 2 iter 1054: train loss 0.16749. lr 6.000000e-05:  90%|█████████ | 1055/1171 [07:57<00:45,  2.53it/s][A
epoch 2 iter 1055: train loss 0.16994. lr 6.000000e-05:  90%|█████████ | 1055/1171 [07:57<00:45,  2.53it/s][A
e

epoch 2 iter 1087: train loss 0.16801. lr 6.000000e-05:  93%|█████████▎| 1088/1171 [08:14<00:34,  2.42it/s][A
epoch 2 iter 1088: train loss 0.17032. lr 6.000000e-05:  93%|█████████▎| 1088/1171 [08:15<00:34,  2.42it/s][A
epoch 2 iter 1088: train loss 0.17032. lr 6.000000e-05:  93%|█████████▎| 1089/1171 [08:15<00:34,  2.41it/s][A
epoch 2 iter 1089: train loss 0.16940. lr 6.000000e-05:  93%|█████████▎| 1089/1171 [08:15<00:34,  2.41it/s][A
epoch 2 iter 1089: train loss 0.16940. lr 6.000000e-05:  93%|█████████▎| 1090/1171 [08:15<00:33,  2.40it/s][A
epoch 2 iter 1090: train loss 0.16766. lr 6.000000e-05:  93%|█████████▎| 1090/1171 [08:16<00:33,  2.40it/s][A
epoch 2 iter 1090: train loss 0.16766. lr 6.000000e-05:  93%|█████████▎| 1091/1171 [08:16<00:38,  2.10it/s][A
epoch 2 iter 1091: train loss 0.16631. lr 6.000000e-05:  93%|█████████▎| 1091/1171 [08:16<00:38,  2.10it/s][A
epoch 2 iter 1091: train loss 0.16631. lr 6.000000e-05:  93%|█████████▎| 1092/1171 [08:16<00:38,  2.07it/s][A
e

epoch 2 iter 1124: train loss 0.16394. lr 6.000000e-05:  96%|█████████▌| 1124/1171 [08:30<00:18,  2.52it/s][A
epoch 2 iter 1124: train loss 0.16394. lr 6.000000e-05:  96%|█████████▌| 1125/1171 [08:30<00:18,  2.43it/s][A
epoch 2 iter 1125: train loss 0.16690. lr 6.000000e-05:  96%|█████████▌| 1125/1171 [08:30<00:18,  2.43it/s][A
epoch 2 iter 1125: train loss 0.16690. lr 6.000000e-05:  96%|█████████▌| 1126/1171 [08:30<00:19,  2.35it/s][A
epoch 2 iter 1126: train loss 0.16880. lr 6.000000e-05:  96%|█████████▌| 1126/1171 [08:31<00:19,  2.35it/s][A
epoch 2 iter 1126: train loss 0.16880. lr 6.000000e-05:  96%|█████████▌| 1127/1171 [08:31<00:18,  2.33it/s][A
epoch 2 iter 1127: train loss 0.16759. lr 6.000000e-05:  96%|█████████▌| 1127/1171 [08:32<00:18,  2.33it/s][A
epoch 2 iter 1127: train loss 0.16759. lr 6.000000e-05:  96%|█████████▋| 1128/1171 [08:32<00:20,  2.09it/s][A
epoch 2 iter 1128: train loss 0.16713. lr 6.000000e-05:  96%|█████████▋| 1128/1171 [08:32<00:20,  2.09it/s][A
e

epoch 2 iter 1160: train loss 0.16608. lr 6.000000e-05:  99%|█████████▉| 1161/1171 [08:46<00:04,  2.19it/s][A
epoch 2 iter 1161: train loss 0.16599. lr 6.000000e-05:  99%|█████████▉| 1161/1171 [08:46<00:04,  2.19it/s][A
epoch 2 iter 1161: train loss 0.16599. lr 6.000000e-05:  99%|█████████▉| 1162/1171 [08:46<00:04,  2.11it/s][A
epoch 2 iter 1162: train loss 0.16931. lr 6.000000e-05:  99%|█████████▉| 1162/1171 [08:47<00:04,  2.11it/s][A
epoch 2 iter 1162: train loss 0.16931. lr 6.000000e-05:  99%|█████████▉| 1163/1171 [08:47<00:03,  2.05it/s][A
epoch 2 iter 1163: train loss 0.16428. lr 6.000000e-05:  99%|█████████▉| 1163/1171 [08:47<00:03,  2.05it/s][A
epoch 2 iter 1163: train loss 0.16428. lr 6.000000e-05:  99%|█████████▉| 1164/1171 [08:47<00:03,  2.01it/s][A
epoch 2 iter 1164: train loss 0.16467. lr 6.000000e-05:  99%|█████████▉| 1164/1171 [08:48<00:03,  2.01it/s][A
epoch 2 iter 1164: train loss 0.16467. lr 6.000000e-05:  99%|█████████▉| 1165/1171 [08:48<00:02,  2.03it/s][A
e

data has 2609807 characters, 266 unique.



  0%|          | 0/5098 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.75955. lr 6.000000e-04:   0%|          | 0/5098 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.75955. lr 6.000000e-04:   0%|          | 1/5098 [00:00<41:24,  2.05it/s][A
epoch 1 iter 1: train loss 4.40882. lr 6.000000e-04:   0%|          | 1/5098 [00:00<41:24,  2.05it/s][A
epoch 1 iter 1: train loss 4.40882. lr 6.000000e-04:   0%|          | 2/5098 [00:00<40:04,  2.12it/s][A
epoch 1 iter 2: train loss 4.08023. lr 5.999999e-04:   0%|          | 2/5098 [00:01<40:04,  2.12it/s][A
epoch 1 iter 2: train loss 4.08023. lr 5.999999e-04:   0%|          | 3/5098 [00:01<38:52,  2.18it/s][A
epoch 1 iter 3: train loss 3.80968. lr 5.999998e-04:   0%|          | 3/5098 [00:01<38:52,  2.18it/s][A
epoch 1 iter 3: train loss 3.80968. lr 5.999998e-04:   0%|          | 4/5098 [00:01<36:57,  2.30it/s][A
epoch 1 iter 4: train loss 3.68302. lr 5.999997e-04:   0%|          | 4/5098 [00:02<36:57,  2.30it/s][A
epoch 1 iter 4: tr

epoch 1 iter 38: train loss 2.77425. lr 5.999785e-04:   1%|          | 38/5098 [00:17<34:58,  2.41it/s][A
epoch 1 iter 38: train loss 2.77425. lr 5.999785e-04:   1%|          | 39/5098 [00:17<35:19,  2.39it/s][A
epoch 1 iter 39: train loss 2.76551. lr 5.999774e-04:   1%|          | 39/5098 [00:17<35:19,  2.39it/s][A
epoch 1 iter 39: train loss 2.76551. lr 5.999774e-04:   1%|          | 40/5098 [00:17<34:42,  2.43it/s][A
epoch 1 iter 40: train loss 2.76827. lr 5.999762e-04:   1%|          | 40/5098 [00:17<34:42,  2.43it/s][A
epoch 1 iter 40: train loss 2.76827. lr 5.999762e-04:   1%|          | 41/5098 [00:17<34:21,  2.45it/s][A
epoch 1 iter 41: train loss 2.74810. lr 5.999751e-04:   1%|          | 41/5098 [00:18<34:21,  2.45it/s][A
epoch 1 iter 41: train loss 2.74810. lr 5.999751e-04:   1%|          | 42/5098 [00:18<34:06,  2.47it/s][A
epoch 1 iter 42: train loss 2.75843. lr 5.999738e-04:   1%|          | 42/5098 [00:18<34:06,  2.47it/s][A
epoch 1 iter 42: train loss 2.75843. 

epoch 1 iter 76: train loss 2.65843. lr 5.999159e-04:   1%|▏         | 76/5098 [00:36<34:58,  2.39it/s][A
epoch 1 iter 76: train loss 2.65843. lr 5.999159e-04:   2%|▏         | 77/5098 [00:36<34:25,  2.43it/s][A
epoch 1 iter 77: train loss 2.62061. lr 5.999137e-04:   2%|▏         | 77/5098 [00:36<34:25,  2.43it/s][A
epoch 1 iter 77: train loss 2.62061. lr 5.999137e-04:   2%|▏         | 78/5098 [00:36<34:04,  2.45it/s][A
epoch 1 iter 78: train loss 2.62352. lr 5.999114e-04:   2%|▏         | 78/5098 [00:37<34:04,  2.45it/s][A
epoch 1 iter 78: train loss 2.62352. lr 5.999114e-04:   2%|▏         | 79/5098 [00:37<33:52,  2.47it/s][A
epoch 1 iter 79: train loss 2.63343. lr 5.999092e-04:   2%|▏         | 79/5098 [00:37<33:52,  2.47it/s][A
epoch 1 iter 79: train loss 2.63343. lr 5.999092e-04:   2%|▏         | 80/5098 [00:37<33:43,  2.48it/s][A
epoch 1 iter 80: train loss 2.61802. lr 5.999069e-04:   2%|▏         | 80/5098 [00:37<33:43,  2.48it/s][A
epoch 1 iter 80: train loss 2.61802. 

epoch 1 iter 114: train loss 2.56198. lr 5.998121e-04:   2%|▏         | 114/5098 [00:53<39:51,  2.08it/s][A
epoch 1 iter 114: train loss 2.56198. lr 5.998121e-04:   2%|▏         | 115/5098 [00:53<41:03,  2.02it/s][A
epoch 1 iter 115: train loss 2.56823. lr 5.998088e-04:   2%|▏         | 115/5098 [00:53<41:03,  2.02it/s][A
epoch 1 iter 115: train loss 2.56823. lr 5.998088e-04:   2%|▏         | 116/5098 [00:53<40:57,  2.03it/s][A
epoch 1 iter 116: train loss 2.56431. lr 5.998055e-04:   2%|▏         | 116/5098 [00:54<40:57,  2.03it/s][A
epoch 1 iter 116: train loss 2.56431. lr 5.998055e-04:   2%|▏         | 117/5098 [00:54<41:05,  2.02it/s][A
epoch 1 iter 117: train loss 2.56094. lr 5.998022e-04:   2%|▏         | 117/5098 [00:54<41:05,  2.02it/s][A
epoch 1 iter 117: train loss 2.56094. lr 5.998022e-04:   2%|▏         | 118/5098 [00:54<41:00,  2.02it/s][A
epoch 1 iter 118: train loss 2.56194. lr 5.997988e-04:   2%|▏         | 118/5098 [00:55<41:00,  2.02it/s][A
epoch 1 iter 118: t

epoch 1 iter 151: train loss 2.50280. lr 5.996716e-04:   3%|▎         | 152/5098 [01:10<36:09,  2.28it/s][A
epoch 1 iter 152: train loss 2.49995. lr 5.996672e-04:   3%|▎         | 152/5098 [01:10<36:09,  2.28it/s][A
epoch 1 iter 152: train loss 2.49995. lr 5.996672e-04:   3%|▎         | 153/5098 [01:10<36:10,  2.28it/s][A
epoch 1 iter 153: train loss 2.50224. lr 5.996629e-04:   3%|▎         | 153/5098 [01:11<36:10,  2.28it/s][A
epoch 1 iter 153: train loss 2.50224. lr 5.996629e-04:   3%|▎         | 154/5098 [01:11<36:07,  2.28it/s][A
epoch 1 iter 154: train loss 2.49438. lr 5.996585e-04:   3%|▎         | 154/5098 [01:11<36:07,  2.28it/s][A
epoch 1 iter 154: train loss 2.49438. lr 5.996585e-04:   3%|▎         | 155/5098 [01:11<36:11,  2.28it/s][A
epoch 1 iter 155: train loss 2.48462. lr 5.996541e-04:   3%|▎         | 155/5098 [01:12<36:11,  2.28it/s][A
epoch 1 iter 155: train loss 2.48462. lr 5.996541e-04:   3%|▎         | 156/5098 [01:12<36:03,  2.28it/s][A
epoch 1 iter 156: t

epoch 1 iter 189: train loss 2.36574. lr 5.994867e-04:   4%|▎         | 189/5098 [01:28<36:48,  2.22it/s][A
epoch 1 iter 189: train loss 2.36574. lr 5.994867e-04:   4%|▎         | 190/5098 [01:28<36:21,  2.25it/s][A
epoch 1 iter 190: train loss 2.35756. lr 5.994813e-04:   4%|▎         | 190/5098 [01:28<36:21,  2.25it/s][A
epoch 1 iter 190: train loss 2.35756. lr 5.994813e-04:   4%|▎         | 191/5098 [01:28<35:55,  2.28it/s][A
epoch 1 iter 191: train loss 2.34675. lr 5.994758e-04:   4%|▎         | 191/5098 [01:29<35:55,  2.28it/s][A
epoch 1 iter 191: train loss 2.34675. lr 5.994758e-04:   4%|▍         | 192/5098 [01:29<38:07,  2.14it/s][A
epoch 1 iter 192: train loss 2.34369. lr 5.994703e-04:   4%|▍         | 192/5098 [01:29<38:07,  2.14it/s][A
epoch 1 iter 192: train loss 2.34369. lr 5.994703e-04:   4%|▍         | 193/5098 [01:29<39:35,  2.07it/s][A
epoch 1 iter 193: train loss 2.35565. lr 5.994648e-04:   4%|▍         | 193/5098 [01:30<39:35,  2.07it/s][A
epoch 1 iter 193: t

epoch 1 iter 226: train loss 2.23583. lr 5.992672e-04:   4%|▍         | 227/5098 [01:44<33:14,  2.44it/s][A
epoch 1 iter 227: train loss 2.24017. lr 5.992607e-04:   4%|▍         | 227/5098 [01:44<33:14,  2.44it/s][A
epoch 1 iter 227: train loss 2.24017. lr 5.992607e-04:   4%|▍         | 228/5098 [01:44<33:11,  2.45it/s][A
epoch 1 iter 228: train loss 2.23180. lr 5.992542e-04:   4%|▍         | 228/5098 [01:45<33:11,  2.45it/s][A
epoch 1 iter 228: train loss 2.23180. lr 5.992542e-04:   4%|▍         | 229/5098 [01:45<34:50,  2.33it/s][A
epoch 1 iter 229: train loss 2.23140. lr 5.992477e-04:   4%|▍         | 229/5098 [01:45<34:50,  2.33it/s][A
epoch 1 iter 229: train loss 2.23140. lr 5.992477e-04:   5%|▍         | 230/5098 [01:45<37:30,  2.16it/s][A
epoch 1 iter 230: train loss 2.23600. lr 5.992411e-04:   5%|▍         | 230/5098 [01:46<37:30,  2.16it/s][A
epoch 1 iter 230: train loss 2.23600. lr 5.992411e-04:   5%|▍         | 231/5098 [01:46<39:16,  2.07it/s][A
epoch 1 iter 231: t

epoch 1 iter 264: train loss 2.08470. lr 5.990013e-04:   5%|▌         | 264/5098 [02:02<44:27,  1.81it/s][A
epoch 1 iter 264: train loss 2.08470. lr 5.990013e-04:   5%|▌         | 265/5098 [02:02<44:32,  1.81it/s][A
epoch 1 iter 265: train loss 2.08590. lr 5.989937e-04:   5%|▌         | 265/5098 [02:02<44:32,  1.81it/s][A
epoch 1 iter 265: train loss 2.08590. lr 5.989937e-04:   5%|▌         | 266/5098 [02:02<44:26,  1.81it/s][A
epoch 1 iter 266: train loss 2.12263. lr 5.989861e-04:   5%|▌         | 266/5098 [02:03<44:26,  1.81it/s][A
epoch 1 iter 266: train loss 2.12263. lr 5.989861e-04:   5%|▌         | 267/5098 [02:03<44:28,  1.81it/s][A
epoch 1 iter 267: train loss 2.09903. lr 5.989785e-04:   5%|▌         | 267/5098 [02:03<44:28,  1.81it/s][A
epoch 1 iter 267: train loss 2.09903. lr 5.989785e-04:   5%|▌         | 268/5098 [02:03<44:43,  1.80it/s][A
epoch 1 iter 268: train loss 2.09799. lr 5.989709e-04:   5%|▌         | 268/5098 [02:04<44:43,  1.80it/s][A
epoch 1 iter 268: t

epoch 1 iter 301: train loss 1.98257. lr 5.987029e-04:   6%|▌         | 302/5098 [02:19<39:59,  2.00it/s][A
epoch 1 iter 302: train loss 1.96864. lr 5.986943e-04:   6%|▌         | 302/5098 [02:20<39:59,  2.00it/s][A
epoch 1 iter 302: train loss 1.96864. lr 5.986943e-04:   6%|▌         | 303/5098 [02:20<39:29,  2.02it/s][A
epoch 1 iter 303: train loss 1.95308. lr 5.986857e-04:   6%|▌         | 303/5098 [02:20<39:29,  2.02it/s][A
epoch 1 iter 303: train loss 1.95308. lr 5.986857e-04:   6%|▌         | 304/5098 [02:20<38:42,  2.06it/s][A
epoch 1 iter 304: train loss 1.96919. lr 5.986770e-04:   6%|▌         | 304/5098 [02:21<38:42,  2.06it/s][A
epoch 1 iter 304: train loss 1.96919. lr 5.986770e-04:   6%|▌         | 305/5098 [02:21<37:51,  2.11it/s][A
epoch 1 iter 305: train loss 1.96394. lr 5.986684e-04:   6%|▌         | 305/5098 [02:21<37:51,  2.11it/s][A
epoch 1 iter 305: train loss 1.96394. lr 5.986684e-04:   6%|▌         | 306/5098 [02:21<37:10,  2.15it/s][A
epoch 1 iter 306: t

epoch 1 iter 339: train loss 1.83027. lr 5.983561e-04:   7%|▋         | 339/5098 [02:35<32:56,  2.41it/s][A
epoch 1 iter 339: train loss 1.83027. lr 5.983561e-04:   7%|▋         | 340/5098 [02:35<33:09,  2.39it/s][A
epoch 1 iter 340: train loss 1.83054. lr 5.983464e-04:   7%|▋         | 340/5098 [02:36<33:09,  2.39it/s][A
epoch 1 iter 340: train loss 1.83054. lr 5.983464e-04:   7%|▋         | 341/5098 [02:36<33:04,  2.40it/s][A
epoch 1 iter 341: train loss 1.87011. lr 5.983367e-04:   7%|▋         | 341/5098 [02:36<33:04,  2.40it/s][A
epoch 1 iter 341: train loss 1.87011. lr 5.983367e-04:   7%|▋         | 342/5098 [02:36<33:19,  2.38it/s][A
epoch 1 iter 342: train loss 1.82643. lr 5.983270e-04:   7%|▋         | 342/5098 [02:36<33:19,  2.38it/s][A
epoch 1 iter 342: train loss 1.82643. lr 5.983270e-04:   7%|▋         | 343/5098 [02:36<32:44,  2.42it/s][A
epoch 1 iter 343: train loss 1.80538. lr 5.983172e-04:   7%|▋         | 343/5098 [02:37<32:44,  2.42it/s][A
epoch 1 iter 343: t

epoch 1 iter 376: train loss 1.74452. lr 5.979791e-04:   7%|▋         | 377/5098 [02:53<39:56,  1.97it/s][A
epoch 1 iter 377: train loss 1.73511. lr 5.979684e-04:   7%|▋         | 377/5098 [02:53<39:56,  1.97it/s][A
epoch 1 iter 377: train loss 1.73511. lr 5.979684e-04:   7%|▋         | 378/5098 [02:53<41:29,  1.90it/s][A
epoch 1 iter 378: train loss 1.73813. lr 5.979576e-04:   7%|▋         | 378/5098 [02:54<41:29,  1.90it/s][A
epoch 1 iter 378: train loss 1.73813. lr 5.979576e-04:   7%|▋         | 379/5098 [02:54<42:20,  1.86it/s][A
epoch 1 iter 379: train loss 1.70948. lr 5.979468e-04:   7%|▋         | 379/5098 [02:54<42:20,  1.86it/s][A
epoch 1 iter 379: train loss 1.70948. lr 5.979468e-04:   7%|▋         | 380/5098 [02:54<41:24,  1.90it/s][A
epoch 1 iter 380: train loss 1.71600. lr 5.979360e-04:   7%|▋         | 380/5098 [02:55<41:24,  1.90it/s][A
epoch 1 iter 380: train loss 1.71600. lr 5.979360e-04:   7%|▋         | 381/5098 [02:55<40:41,  1.93it/s][A
epoch 1 iter 381: t

epoch 1 iter 414: train loss 1.65513. lr 5.975516e-04:   8%|▊         | 414/5098 [03:10<32:58,  2.37it/s][A
epoch 1 iter 414: train loss 1.65513. lr 5.975516e-04:   8%|▊         | 415/5098 [03:10<32:11,  2.42it/s][A
epoch 1 iter 415: train loss 1.64785. lr 5.975398e-04:   8%|▊         | 415/5098 [03:11<32:11,  2.42it/s][A
epoch 1 iter 415: train loss 1.64785. lr 5.975398e-04:   8%|▊         | 416/5098 [03:11<32:49,  2.38it/s][A
epoch 1 iter 416: train loss 1.63907. lr 5.975279e-04:   8%|▊         | 416/5098 [03:11<32:49,  2.38it/s][A
epoch 1 iter 416: train loss 1.63907. lr 5.975279e-04:   8%|▊         | 417/5098 [03:11<31:55,  2.44it/s][A
epoch 1 iter 417: train loss 1.65427. lr 5.975161e-04:   8%|▊         | 417/5098 [03:11<31:55,  2.44it/s][A
epoch 1 iter 417: train loss 1.65427. lr 5.975161e-04:   8%|▊         | 418/5098 [03:11<31:11,  2.50it/s][A
epoch 1 iter 418: train loss 1.63224. lr 5.975042e-04:   8%|▊         | 418/5098 [03:12<31:11,  2.50it/s][A
epoch 1 iter 418: t

epoch 1 iter 451: train loss 1.56961. lr 5.970961e-04:   9%|▉         | 452/5098 [03:27<31:57,  2.42it/s][A
epoch 1 iter 452: train loss 1.57621. lr 5.970832e-04:   9%|▉         | 452/5098 [03:27<31:57,  2.42it/s][A
epoch 1 iter 452: train loss 1.57621. lr 5.970832e-04:   9%|▉         | 453/5098 [03:27<31:57,  2.42it/s][A
epoch 1 iter 453: train loss 1.57830. lr 5.970704e-04:   9%|▉         | 453/5098 [03:28<31:57,  2.42it/s][A
epoch 1 iter 453: train loss 1.57830. lr 5.970704e-04:   9%|▉         | 454/5098 [03:28<38:56,  1.99it/s][A
epoch 1 iter 454: train loss 1.55987. lr 5.970575e-04:   9%|▉         | 454/5098 [03:29<38:56,  1.99it/s][A
epoch 1 iter 454: train loss 1.55987. lr 5.970575e-04:   9%|▉         | 455/5098 [03:29<39:39,  1.95it/s][A
epoch 1 iter 455: train loss 1.54677. lr 5.970445e-04:   9%|▉         | 455/5098 [03:29<39:39,  1.95it/s][A
epoch 1 iter 455: train loss 1.54677. lr 5.970445e-04:   9%|▉         | 456/5098 [03:29<38:33,  2.01it/s][A
epoch 1 iter 456: t

epoch 1 iter 489: train loss 1.50287. lr 5.965881e-04:  10%|▉         | 489/5098 [03:44<40:48,  1.88it/s][A
epoch 1 iter 489: train loss 1.50287. lr 5.965881e-04:  10%|▉         | 490/5098 [03:44<39:57,  1.92it/s][A
epoch 1 iter 490: train loss 1.51155. lr 5.965742e-04:  10%|▉         | 490/5098 [03:45<39:57,  1.92it/s][A
epoch 1 iter 490: train loss 1.51155. lr 5.965742e-04:  10%|▉         | 491/5098 [03:45<39:20,  1.95it/s][A
epoch 1 iter 491: train loss 1.50966. lr 5.965602e-04:  10%|▉         | 491/5098 [03:45<39:20,  1.95it/s][A
epoch 1 iter 491: train loss 1.50966. lr 5.965602e-04:  10%|▉         | 492/5098 [03:45<38:53,  1.97it/s][A
epoch 1 iter 492: train loss 1.51379. lr 5.965462e-04:  10%|▉         | 492/5098 [03:46<38:53,  1.97it/s][A
epoch 1 iter 492: train loss 1.51379. lr 5.965462e-04:  10%|▉         | 493/5098 [03:46<38:34,  1.99it/s][A
epoch 1 iter 493: train loss 1.50366. lr 5.965322e-04:  10%|▉         | 493/5098 [03:46<38:34,  1.99it/s][A
epoch 1 iter 493: t

epoch 1 iter 526: train loss 1.45165. lr 5.960543e-04:  10%|█         | 527/5098 [04:01<32:51,  2.32it/s][A
epoch 1 iter 527: train loss 1.46728. lr 5.960394e-04:  10%|█         | 527/5098 [04:01<32:51,  2.32it/s][A
epoch 1 iter 527: train loss 1.46728. lr 5.960394e-04:  10%|█         | 528/5098 [04:01<33:10,  2.30it/s][A
epoch 1 iter 528: train loss 1.45815. lr 5.960244e-04:  10%|█         | 528/5098 [04:02<33:10,  2.30it/s][A
epoch 1 iter 528: train loss 1.45815. lr 5.960244e-04:  10%|█         | 529/5098 [04:02<33:40,  2.26it/s][A
epoch 1 iter 529: train loss 1.46623. lr 5.960094e-04:  10%|█         | 529/5098 [04:02<33:40,  2.26it/s][A
epoch 1 iter 529: train loss 1.46623. lr 5.960094e-04:  10%|█         | 530/5098 [04:02<33:39,  2.26it/s][A
epoch 1 iter 530: train loss 1.45520. lr 5.959943e-04:  10%|█         | 530/5098 [04:02<33:39,  2.26it/s][A
epoch 1 iter 530: train loss 1.45520. lr 5.959943e-04:  10%|█         | 531/5098 [04:02<33:38,  2.26it/s][A
epoch 1 iter 531: t

epoch 1 iter 564: train loss 1.41868. lr 5.954661e-04:  11%|█         | 564/5098 [04:18<32:17,  2.34it/s][A
epoch 1 iter 564: train loss 1.41868. lr 5.954661e-04:  11%|█         | 565/5098 [04:18<31:54,  2.37it/s][A
epoch 1 iter 565: train loss 1.40238. lr 5.954501e-04:  11%|█         | 565/5098 [04:18<31:54,  2.37it/s][A
epoch 1 iter 565: train loss 1.40238. lr 5.954501e-04:  11%|█         | 566/5098 [04:18<31:37,  2.39it/s][A
epoch 1 iter 566: train loss 1.40695. lr 5.954340e-04:  11%|█         | 566/5098 [04:18<31:37,  2.39it/s][A
epoch 1 iter 566: train loss 1.40695. lr 5.954340e-04:  11%|█         | 567/5098 [04:18<31:24,  2.40it/s][A
epoch 1 iter 567: train loss 1.40025. lr 5.954180e-04:  11%|█         | 567/5098 [04:19<31:24,  2.40it/s][A
epoch 1 iter 567: train loss 1.40025. lr 5.954180e-04:  11%|█         | 568/5098 [04:19<31:20,  2.41it/s][A
epoch 1 iter 568: train loss 1.39103. lr 5.954018e-04:  11%|█         | 568/5098 [04:19<31:20,  2.41it/s][A
epoch 1 iter 568: t

epoch 1 iter 601: train loss 1.36309. lr 5.948544e-04:  12%|█▏        | 602/5098 [04:34<32:00,  2.34it/s][A
epoch 1 iter 602: train loss 1.35828. lr 5.948374e-04:  12%|█▏        | 602/5098 [04:34<32:00,  2.34it/s][A
epoch 1 iter 602: train loss 1.35828. lr 5.948374e-04:  12%|█▏        | 603/5098 [04:34<30:55,  2.42it/s][A
epoch 1 iter 603: train loss 1.42326. lr 5.948203e-04:  12%|█▏        | 603/5098 [04:34<30:55,  2.42it/s][A
epoch 1 iter 603: train loss 1.42326. lr 5.948203e-04:  12%|█▏        | 604/5098 [04:34<30:22,  2.47it/s][A
epoch 1 iter 604: train loss 1.36801. lr 5.948032e-04:  12%|█▏        | 604/5098 [04:35<30:22,  2.47it/s][A
epoch 1 iter 604: train loss 1.36801. lr 5.948032e-04:  12%|█▏        | 605/5098 [04:35<30:36,  2.45it/s][A
epoch 1 iter 605: train loss 1.40714. lr 5.947860e-04:  12%|█▏        | 605/5098 [04:35<30:36,  2.45it/s][A
epoch 1 iter 605: train loss 1.40714. lr 5.947860e-04:  12%|█▏        | 606/5098 [04:35<31:07,  2.41it/s][A
epoch 1 iter 606: t

epoch 1 iter 639: train loss 1.37277. lr 5.941863e-04:  13%|█▎        | 639/5098 [04:52<37:10,  2.00it/s][A
epoch 1 iter 639: train loss 1.37277. lr 5.941863e-04:  13%|█▎        | 640/5098 [04:52<39:06,  1.90it/s][A
epoch 1 iter 640: train loss 1.35349. lr 5.941682e-04:  13%|█▎        | 640/5098 [04:52<39:06,  1.90it/s][A
epoch 1 iter 640: train loss 1.35349. lr 5.941682e-04:  13%|█▎        | 641/5098 [04:52<40:04,  1.85it/s][A
epoch 1 iter 641: train loss 1.34005. lr 5.941500e-04:  13%|█▎        | 641/5098 [04:53<40:04,  1.85it/s][A
epoch 1 iter 641: train loss 1.34005. lr 5.941500e-04:  13%|█▎        | 642/5098 [04:53<39:49,  1.86it/s][A
epoch 1 iter 642: train loss 1.34641. lr 5.941319e-04:  13%|█▎        | 642/5098 [04:53<39:49,  1.86it/s][A
epoch 1 iter 642: train loss 1.34641. lr 5.941319e-04:  13%|█▎        | 643/5098 [04:53<38:21,  1.94it/s][A
epoch 1 iter 643: train loss 1.32808. lr 5.941137e-04:  13%|█▎        | 643/5098 [04:54<38:21,  1.94it/s][A
epoch 1 iter 643: t

epoch 1 iter 676: train loss 1.31350. lr 5.934970e-04:  13%|█▎        | 677/5098 [05:09<32:39,  2.26it/s][A
epoch 1 iter 677: train loss 1.31322. lr 5.934779e-04:  13%|█▎        | 677/5098 [05:10<32:39,  2.26it/s][A
epoch 1 iter 677: train loss 1.31322. lr 5.934779e-04:  13%|█▎        | 678/5098 [05:10<32:29,  2.27it/s][A
epoch 1 iter 678: train loss 1.29905. lr 5.934587e-04:  13%|█▎        | 678/5098 [05:10<32:29,  2.27it/s][A
epoch 1 iter 678: train loss 1.29905. lr 5.934587e-04:  13%|█▎        | 679/5098 [05:10<32:13,  2.29it/s][A
epoch 1 iter 679: train loss 1.29584. lr 5.934395e-04:  13%|█▎        | 679/5098 [05:10<32:13,  2.29it/s][A
epoch 1 iter 679: train loss 1.29584. lr 5.934395e-04:  13%|█▎        | 680/5098 [05:10<31:47,  2.32it/s][A
epoch 1 iter 680: train loss 1.31906. lr 5.934202e-04:  13%|█▎        | 680/5098 [05:11<31:47,  2.32it/s][A
epoch 1 iter 680: train loss 1.31906. lr 5.934202e-04:  13%|█▎        | 681/5098 [05:11<32:14,  2.28it/s][A
epoch 1 iter 681: t

epoch 1 iter 714: train loss 1.33300. lr 5.927494e-04:  14%|█▍        | 714/5098 [05:26<31:28,  2.32it/s][A
epoch 1 iter 714: train loss 1.33300. lr 5.927494e-04:  14%|█▍        | 715/5098 [05:26<31:16,  2.34it/s][A
epoch 1 iter 715: train loss 1.29449. lr 5.927292e-04:  14%|█▍        | 715/5098 [05:26<31:16,  2.34it/s][A
epoch 1 iter 715: train loss 1.29449. lr 5.927292e-04:  14%|█▍        | 716/5098 [05:26<30:55,  2.36it/s][A
epoch 1 iter 716: train loss 1.29595. lr 5.927089e-04:  14%|█▍        | 716/5098 [05:27<30:55,  2.36it/s][A
epoch 1 iter 716: train loss 1.29595. lr 5.927089e-04:  14%|█▍        | 717/5098 [05:27<30:46,  2.37it/s][A
epoch 1 iter 717: train loss 1.28727. lr 5.926886e-04:  14%|█▍        | 717/5098 [05:27<30:46,  2.37it/s][A
epoch 1 iter 717: train loss 1.28727. lr 5.926886e-04:  14%|█▍        | 718/5098 [05:27<30:59,  2.36it/s][A
epoch 1 iter 718: train loss 1.28003. lr 5.926683e-04:  14%|█▍        | 718/5098 [05:28<30:59,  2.36it/s][A
epoch 1 iter 718: t

epoch 1 iter 751: train loss 1.28265. lr 5.919828e-04:  15%|█▍        | 752/5098 [05:42<28:55,  2.50it/s][A
epoch 1 iter 752: train loss 1.26941. lr 5.919616e-04:  15%|█▍        | 752/5098 [05:43<28:55,  2.50it/s][A
epoch 1 iter 752: train loss 1.26941. lr 5.919616e-04:  15%|█▍        | 753/5098 [05:43<29:16,  2.47it/s][A
epoch 1 iter 753: train loss 1.29093. lr 5.919403e-04:  15%|█▍        | 753/5098 [05:43<29:16,  2.47it/s][A
epoch 1 iter 753: train loss 1.29093. lr 5.919403e-04:  15%|█▍        | 754/5098 [05:43<29:33,  2.45it/s][A
epoch 1 iter 754: train loss 1.27498. lr 5.919190e-04:  15%|█▍        | 754/5098 [05:43<29:33,  2.45it/s][A
epoch 1 iter 754: train loss 1.27498. lr 5.919190e-04:  15%|█▍        | 755/5098 [05:43<30:01,  2.41it/s][A
epoch 1 iter 755: train loss 1.27396. lr 5.918977e-04:  15%|█▍        | 755/5098 [05:44<30:01,  2.41it/s][A
epoch 1 iter 755: train loss 1.27396. lr 5.918977e-04:  15%|█▍        | 756/5098 [05:44<29:32,  2.45it/s][A
epoch 1 iter 756: t

epoch 1 iter 789: train loss 1.23462. lr 5.911560e-04:  15%|█▌        | 789/5098 [06:00<35:48,  2.01it/s][A
epoch 1 iter 789: train loss 1.23462. lr 5.911560e-04:  15%|█▌        | 790/5098 [06:00<35:01,  2.05it/s][A
epoch 1 iter 790: train loss 1.25433. lr 5.911337e-04:  15%|█▌        | 790/5098 [06:00<35:01,  2.05it/s][A
epoch 1 iter 790: train loss 1.25433. lr 5.911337e-04:  16%|█▌        | 791/5098 [06:00<34:25,  2.08it/s][A
epoch 1 iter 791: train loss 1.24105. lr 5.911114e-04:  16%|█▌        | 791/5098 [06:01<34:25,  2.08it/s][A
epoch 1 iter 791: train loss 1.24105. lr 5.911114e-04:  16%|█▌        | 792/5098 [06:01<34:00,  2.11it/s][A
epoch 1 iter 792: train loss 1.23705. lr 5.910891e-04:  16%|█▌        | 792/5098 [06:01<34:00,  2.11it/s][A
epoch 1 iter 792: train loss 1.23705. lr 5.910891e-04:  16%|█▌        | 793/5098 [06:01<33:46,  2.12it/s][A
epoch 1 iter 793: train loss 1.24858. lr 5.910667e-04:  16%|█▌        | 793/5098 [06:02<33:46,  2.12it/s][A
epoch 1 iter 793: t

epoch 1 iter 826: train loss 1.23683. lr 5.903126e-04:  16%|█▌        | 827/5098 [06:16<28:44,  2.48it/s][A
epoch 1 iter 827: train loss 1.20517. lr 5.902893e-04:  16%|█▌        | 827/5098 [06:16<28:44,  2.48it/s][A
epoch 1 iter 827: train loss 1.20517. lr 5.902893e-04:  16%|█▌        | 828/5098 [06:16<29:21,  2.42it/s][A
epoch 1 iter 828: train loss 1.20939. lr 5.902660e-04:  16%|█▌        | 828/5098 [06:17<29:21,  2.42it/s][A
epoch 1 iter 828: train loss 1.20939. lr 5.902660e-04:  16%|█▋        | 829/5098 [06:17<30:17,  2.35it/s][A
epoch 1 iter 829: train loss 1.22319. lr 5.902426e-04:  16%|█▋        | 829/5098 [06:17<30:17,  2.35it/s][A
epoch 1 iter 829: train loss 1.22319. lr 5.902426e-04:  16%|█▋        | 830/5098 [06:17<29:40,  2.40it/s][A
epoch 1 iter 830: train loss 1.20807. lr 5.902192e-04:  16%|█▋        | 830/5098 [06:18<29:40,  2.40it/s][A
epoch 1 iter 830: train loss 1.20807. lr 5.902192e-04:  16%|█▋        | 831/5098 [06:18<29:07,  2.44it/s][A
epoch 1 iter 831: t

epoch 1 iter 864: train loss 1.19872. lr 5.894071e-04:  17%|█▋        | 864/5098 [06:35<36:54,  1.91it/s][A
epoch 1 iter 864: train loss 1.19872. lr 5.894071e-04:  17%|█▋        | 865/5098 [06:35<35:42,  1.98it/s][A
epoch 1 iter 865: train loss 1.17956. lr 5.893828e-04:  17%|█▋        | 865/5098 [06:35<35:42,  1.98it/s][A
epoch 1 iter 865: train loss 1.17956. lr 5.893828e-04:  17%|█▋        | 866/5098 [06:35<34:47,  2.03it/s][A
epoch 1 iter 866: train loss 1.20313. lr 5.893584e-04:  17%|█▋        | 866/5098 [06:36<34:47,  2.03it/s][A
epoch 1 iter 866: train loss 1.20313. lr 5.893584e-04:  17%|█▋        | 867/5098 [06:36<35:41,  1.98it/s][A
epoch 1 iter 867: train loss 1.20023. lr 5.893340e-04:  17%|█▋        | 867/5098 [06:36<35:41,  1.98it/s][A
epoch 1 iter 867: train loss 1.20023. lr 5.893340e-04:  17%|█▋        | 868/5098 [06:36<34:40,  2.03it/s][A
epoch 1 iter 868: train loss 1.20941. lr 5.893095e-04:  17%|█▋        | 868/5098 [06:37<34:40,  2.03it/s][A
epoch 1 iter 868: t

epoch 1 iter 901: train loss 1.18853. lr 5.884874e-04:  18%|█▊        | 902/5098 [06:51<30:57,  2.26it/s][A
epoch 1 iter 902: train loss 1.18034. lr 5.884620e-04:  18%|█▊        | 902/5098 [06:51<30:57,  2.26it/s][A
epoch 1 iter 902: train loss 1.18034. lr 5.884620e-04:  18%|█▊        | 903/5098 [06:51<30:49,  2.27it/s][A
epoch 1 iter 903: train loss 1.19588. lr 5.884366e-04:  18%|█▊        | 903/5098 [06:51<30:49,  2.27it/s][A
epoch 1 iter 903: train loss 1.19588. lr 5.884366e-04:  18%|█▊        | 904/5098 [06:51<30:43,  2.27it/s][A
epoch 1 iter 904: train loss 1.18978. lr 5.884111e-04:  18%|█▊        | 904/5098 [06:52<30:43,  2.27it/s][A
epoch 1 iter 904: train loss 1.18978. lr 5.884111e-04:  18%|█▊        | 905/5098 [06:52<30:59,  2.26it/s][A
epoch 1 iter 905: train loss 1.18099. lr 5.883857e-04:  18%|█▊        | 905/5098 [06:52<30:59,  2.26it/s][A
epoch 1 iter 905: train loss 1.18099. lr 5.883857e-04:  18%|█▊        | 906/5098 [06:52<30:26,  2.30it/s][A
epoch 1 iter 906: t

epoch 1 iter 939: train loss 1.16439. lr 5.875037e-04:  18%|█▊        | 939/5098 [07:08<31:49,  2.18it/s][A
epoch 1 iter 939: train loss 1.16439. lr 5.875037e-04:  18%|█▊        | 940/5098 [07:08<31:52,  2.17it/s][A
epoch 1 iter 940: train loss 1.16832. lr 5.874772e-04:  18%|█▊        | 940/5098 [07:08<31:52,  2.17it/s][A
epoch 1 iter 940: train loss 1.16832. lr 5.874772e-04:  18%|█▊        | 941/5098 [07:08<31:53,  2.17it/s][A
epoch 1 iter 941: train loss 1.17341. lr 5.874508e-04:  18%|█▊        | 941/5098 [07:09<31:53,  2.17it/s][A
epoch 1 iter 941: train loss 1.17341. lr 5.874508e-04:  18%|█▊        | 942/5098 [07:09<31:50,  2.18it/s][A
epoch 1 iter 942: train loss 1.16192. lr 5.874243e-04:  18%|█▊        | 942/5098 [07:09<31:50,  2.18it/s][A
epoch 1 iter 942: train loss 1.16192. lr 5.874243e-04:  18%|█▊        | 943/5098 [07:09<31:13,  2.22it/s][A
epoch 1 iter 943: train loss 1.18005. lr 5.873978e-04:  18%|█▊        | 943/5098 [07:10<31:13,  2.22it/s][A
epoch 1 iter 943: t

epoch 1 iter 976: train loss 1.13452. lr 5.865080e-04:  19%|█▉        | 977/5098 [07:25<37:31,  1.83it/s][A
epoch 1 iter 977: train loss 1.14289. lr 5.864805e-04:  19%|█▉        | 977/5098 [07:26<37:31,  1.83it/s][A
epoch 1 iter 977: train loss 1.14289. lr 5.864805e-04:  19%|█▉        | 978/5098 [07:26<36:56,  1.86it/s][A
epoch 1 iter 978: train loss 1.15659. lr 5.864531e-04:  19%|█▉        | 978/5098 [07:26<36:56,  1.86it/s][A
epoch 1 iter 978: train loss 1.15659. lr 5.864531e-04:  19%|█▉        | 979/5098 [07:26<36:25,  1.89it/s][A
epoch 1 iter 979: train loss 1.15216. lr 5.864256e-04:  19%|█▉        | 979/5098 [07:27<36:25,  1.89it/s][A
epoch 1 iter 979: train loss 1.15216. lr 5.864256e-04:  19%|█▉        | 980/5098 [07:27<36:01,  1.91it/s][A
epoch 1 iter 980: train loss 1.15115. lr 5.863981e-04:  19%|█▉        | 980/5098 [07:27<36:01,  1.91it/s][A
epoch 1 iter 980: train loss 1.15115. lr 5.863981e-04:  19%|█▉        | 981/5098 [07:27<35:06,  1.95it/s][A
epoch 1 iter 981: t

epoch 1 iter 1013: train loss 1.12857. lr 5.854750e-04:  20%|█▉        | 1014/5098 [07:42<29:13,  2.33it/s][A
epoch 1 iter 1014: train loss 1.12462. lr 5.854466e-04:  20%|█▉        | 1014/5098 [07:43<29:13,  2.33it/s][A
epoch 1 iter 1014: train loss 1.12462. lr 5.854466e-04:  20%|█▉        | 1015/5098 [07:43<28:34,  2.38it/s][A
epoch 1 iter 1015: train loss 1.12326. lr 5.854181e-04:  20%|█▉        | 1015/5098 [07:43<28:34,  2.38it/s][A
epoch 1 iter 1015: train loss 1.12326. lr 5.854181e-04:  20%|█▉        | 1016/5098 [07:43<28:39,  2.37it/s][A
epoch 1 iter 1016: train loss 1.14549. lr 5.853896e-04:  20%|█▉        | 1016/5098 [07:43<28:39,  2.37it/s][A
epoch 1 iter 1016: train loss 1.14549. lr 5.853896e-04:  20%|█▉        | 1017/5098 [07:43<28:49,  2.36it/s][A
epoch 1 iter 1017: train loss 1.13023. lr 5.853611e-04:  20%|█▉        | 1017/5098 [07:44<28:49,  2.36it/s][A
epoch 1 iter 1017: train loss 1.13023. lr 5.853611e-04:  20%|█▉        | 1018/5098 [07:44<28:18,  2.40it/s][A
e

epoch 1 iter 1050: train loss 1.12149. lr 5.844049e-04:  21%|██        | 1050/5098 [07:59<28:33,  2.36it/s][A
epoch 1 iter 1050: train loss 1.12149. lr 5.844049e-04:  21%|██        | 1051/5098 [07:59<28:25,  2.37it/s][A
epoch 1 iter 1051: train loss 1.12842. lr 5.843755e-04:  21%|██        | 1051/5098 [08:00<28:25,  2.37it/s][A
epoch 1 iter 1051: train loss 1.12842. lr 5.843755e-04:  21%|██        | 1052/5098 [08:00<28:07,  2.40it/s][A
epoch 1 iter 1052: train loss 1.11844. lr 5.843460e-04:  21%|██        | 1052/5098 [08:00<28:07,  2.40it/s][A
epoch 1 iter 1052: train loss 1.11844. lr 5.843460e-04:  21%|██        | 1053/5098 [08:00<27:48,  2.42it/s][A
epoch 1 iter 1053: train loss 1.13368. lr 5.843165e-04:  21%|██        | 1053/5098 [08:00<27:48,  2.42it/s][A
epoch 1 iter 1053: train loss 1.13368. lr 5.843165e-04:  21%|██        | 1054/5098 [08:00<28:13,  2.39it/s][A
epoch 1 iter 1054: train loss 1.10590. lr 5.842870e-04:  21%|██        | 1054/5098 [08:01<28:13,  2.39it/s][A
e

epoch 1 iter 1086: train loss 1.09852. lr 5.833283e-04:  21%|██▏       | 1087/5098 [08:15<27:08,  2.46it/s][A
epoch 1 iter 1087: train loss 1.09607. lr 5.832979e-04:  21%|██▏       | 1087/5098 [08:16<27:08,  2.46it/s][A
epoch 1 iter 1087: train loss 1.09607. lr 5.832979e-04:  21%|██▏       | 1088/5098 [08:16<26:56,  2.48it/s][A
epoch 1 iter 1088: train loss 1.09729. lr 5.832675e-04:  21%|██▏       | 1088/5098 [08:16<26:56,  2.48it/s][A
epoch 1 iter 1088: train loss 1.09729. lr 5.832675e-04:  21%|██▏       | 1089/5098 [08:16<26:48,  2.49it/s][A
epoch 1 iter 1089: train loss 1.11506. lr 5.832370e-04:  21%|██▏       | 1089/5098 [08:17<26:48,  2.49it/s][A
epoch 1 iter 1089: train loss 1.11506. lr 5.832370e-04:  21%|██▏       | 1090/5098 [08:17<26:44,  2.50it/s][A
epoch 1 iter 1090: train loss 1.10237. lr 5.832065e-04:  21%|██▏       | 1090/5098 [08:17<26:44,  2.50it/s][A
epoch 1 iter 1090: train loss 1.10237. lr 5.832065e-04:  21%|██▏       | 1091/5098 [08:17<26:39,  2.50it/s][A
e

epoch 1 iter 1123: train loss 1.09770. lr 5.821854e-04:  22%|██▏       | 1123/5098 [08:32<29:43,  2.23it/s][A
epoch 1 iter 1123: train loss 1.09770. lr 5.821854e-04:  22%|██▏       | 1124/5098 [08:32<29:55,  2.21it/s][A
epoch 1 iter 1124: train loss 1.06599. lr 5.821540e-04:  22%|██▏       | 1124/5098 [08:32<29:55,  2.21it/s][A
epoch 1 iter 1124: train loss 1.06599. lr 5.821540e-04:  22%|██▏       | 1125/5098 [08:32<29:04,  2.28it/s][A
epoch 1 iter 1125: train loss 1.08848. lr 5.821226e-04:  22%|██▏       | 1125/5098 [08:32<29:04,  2.28it/s][A
epoch 1 iter 1125: train loss 1.08848. lr 5.821226e-04:  22%|██▏       | 1126/5098 [08:32<28:20,  2.34it/s][A
epoch 1 iter 1126: train loss 1.08725. lr 5.820911e-04:  22%|██▏       | 1126/5098 [08:33<28:20,  2.34it/s][A
epoch 1 iter 1126: train loss 1.08725. lr 5.820911e-04:  22%|██▏       | 1127/5098 [08:33<27:39,  2.39it/s][A
epoch 1 iter 1127: train loss 1.06880. lr 5.820596e-04:  22%|██▏       | 1127/5098 [08:33<27:39,  2.39it/s][A
e

epoch 1 iter 1159: train loss 1.06198. lr 5.810382e-04:  23%|██▎       | 1160/5098 [08:48<28:35,  2.30it/s][A
epoch 1 iter 1160: train loss 1.06608. lr 5.810058e-04:  23%|██▎       | 1160/5098 [08:48<28:35,  2.30it/s][A
epoch 1 iter 1160: train loss 1.06608. lr 5.810058e-04:  23%|██▎       | 1161/5098 [08:48<29:03,  2.26it/s][A
epoch 1 iter 1161: train loss 1.07517. lr 5.809734e-04:  23%|██▎       | 1161/5098 [08:49<29:03,  2.26it/s][A
epoch 1 iter 1161: train loss 1.07517. lr 5.809734e-04:  23%|██▎       | 1162/5098 [08:49<29:18,  2.24it/s][A
epoch 1 iter 1162: train loss 1.06426. lr 5.809410e-04:  23%|██▎       | 1162/5098 [08:49<29:18,  2.24it/s][A
epoch 1 iter 1162: train loss 1.06426. lr 5.809410e-04:  23%|██▎       | 1163/5098 [08:49<29:23,  2.23it/s][A
epoch 1 iter 1163: train loss 1.09464. lr 5.809086e-04:  23%|██▎       | 1163/5098 [08:49<29:23,  2.23it/s][A
epoch 1 iter 1163: train loss 1.09464. lr 5.809086e-04:  23%|██▎       | 1164/5098 [08:49<29:30,  2.22it/s][A
e

epoch 1 iter 1196: train loss 1.04625. lr 5.798230e-04:  23%|██▎       | 1196/5098 [09:04<31:40,  2.05it/s][A
epoch 1 iter 1196: train loss 1.04625. lr 5.798230e-04:  23%|██▎       | 1197/5098 [09:04<34:03,  1.91it/s][A
epoch 1 iter 1197: train loss 1.05523. lr 5.797897e-04:  23%|██▎       | 1197/5098 [09:04<34:03,  1.91it/s][A
epoch 1 iter 1197: train loss 1.05523. lr 5.797897e-04:  23%|██▎       | 1198/5098 [09:04<34:56,  1.86it/s][A
epoch 1 iter 1198: train loss 1.03962. lr 5.797563e-04:  23%|██▎       | 1198/5098 [09:05<34:56,  1.86it/s][A
epoch 1 iter 1198: train loss 1.03962. lr 5.797563e-04:  24%|██▎       | 1199/5098 [09:05<35:33,  1.83it/s][A
epoch 1 iter 1199: train loss 1.04529. lr 5.797229e-04:  24%|██▎       | 1199/5098 [09:06<35:33,  1.83it/s][A
epoch 1 iter 1199: train loss 1.04529. lr 5.797229e-04:  24%|██▎       | 1200/5098 [09:06<35:59,  1.81it/s][A
epoch 1 iter 1200: train loss 1.04630. lr 5.796895e-04:  24%|██▎       | 1200/5098 [09:06<35:59,  1.81it/s][A
e

epoch 1 iter 1232: train loss 1.02482. lr 5.786058e-04:  24%|██▍       | 1233/5098 [09:20<28:12,  2.28it/s][A
epoch 1 iter 1233: train loss 1.03905. lr 5.785715e-04:  24%|██▍       | 1233/5098 [09:21<28:12,  2.28it/s][A
epoch 1 iter 1233: train loss 1.03905. lr 5.785715e-04:  24%|██▍       | 1234/5098 [09:21<29:53,  2.15it/s][A
epoch 1 iter 1234: train loss 1.04165. lr 5.785372e-04:  24%|██▍       | 1234/5098 [09:21<29:53,  2.15it/s][A
epoch 1 iter 1234: train loss 1.04165. lr 5.785372e-04:  24%|██▍       | 1235/5098 [09:21<31:04,  2.07it/s][A
epoch 1 iter 1235: train loss 1.01965. lr 5.785028e-04:  24%|██▍       | 1235/5098 [09:22<31:04,  2.07it/s][A
epoch 1 iter 1235: train loss 1.01965. lr 5.785028e-04:  24%|██▍       | 1236/5098 [09:22<31:12,  2.06it/s][A
epoch 1 iter 1236: train loss 1.03700. lr 5.784685e-04:  24%|██▍       | 1236/5098 [09:22<31:12,  2.06it/s][A
epoch 1 iter 1236: train loss 1.03700. lr 5.784685e-04:  24%|██▍       | 1237/5098 [09:22<31:16,  2.06it/s][A
e

epoch 1 iter 1269: train loss 1.05053. lr 5.773191e-04:  25%|██▍       | 1269/5098 [09:38<34:47,  1.83it/s][A
epoch 1 iter 1269: train loss 1.05053. lr 5.773191e-04:  25%|██▍       | 1270/5098 [09:38<33:50,  1.89it/s][A
epoch 1 iter 1270: train loss 1.02834. lr 5.772838e-04:  25%|██▍       | 1270/5098 [09:38<33:50,  1.89it/s][A
epoch 1 iter 1270: train loss 1.02834. lr 5.772838e-04:  25%|██▍       | 1271/5098 [09:38<33:28,  1.91it/s][A
epoch 1 iter 1271: train loss 1.00382. lr 5.772485e-04:  25%|██▍       | 1271/5098 [09:39<33:28,  1.91it/s][A
epoch 1 iter 1271: train loss 1.00382. lr 5.772485e-04:  25%|██▍       | 1272/5098 [09:39<32:19,  1.97it/s][A
epoch 1 iter 1272: train loss 1.02158. lr 5.772132e-04:  25%|██▍       | 1272/5098 [09:39<32:19,  1.97it/s][A
epoch 1 iter 1272: train loss 1.02158. lr 5.772132e-04:  25%|██▍       | 1273/5098 [09:39<31:35,  2.02it/s][A
epoch 1 iter 1273: train loss 1.01105. lr 5.771778e-04:  25%|██▍       | 1273/5098 [09:40<31:35,  2.02it/s][A
e

epoch 1 iter 1305: train loss 1.00572. lr 5.760325e-04:  26%|██▌       | 1306/5098 [09:54<27:46,  2.28it/s][A
epoch 1 iter 1306: train loss 1.01345. lr 5.759963e-04:  26%|██▌       | 1306/5098 [09:55<27:46,  2.28it/s][A
epoch 1 iter 1306: train loss 1.01345. lr 5.759963e-04:  26%|██▌       | 1307/5098 [09:55<27:36,  2.29it/s][A
epoch 1 iter 1307: train loss 0.99905. lr 5.759600e-04:  26%|██▌       | 1307/5098 [09:55<27:36,  2.29it/s][A
epoch 1 iter 1307: train loss 0.99905. lr 5.759600e-04:  26%|██▌       | 1308/5098 [09:55<28:58,  2.18it/s][A
epoch 1 iter 1308: train loss 1.02075. lr 5.759237e-04:  26%|██▌       | 1308/5098 [09:56<28:58,  2.18it/s][A
epoch 1 iter 1308: train loss 1.02075. lr 5.759237e-04:  26%|██▌       | 1309/5098 [09:56<29:03,  2.17it/s][A
epoch 1 iter 1309: train loss 1.00630. lr 5.758874e-04:  26%|██▌       | 1309/5098 [09:56<29:03,  2.17it/s][A
epoch 1 iter 1309: train loss 1.00630. lr 5.758874e-04:  26%|██▌       | 1310/5098 [09:56<28:46,  2.19it/s][A
e

epoch 1 iter 1342: train loss 1.00210. lr 5.746747e-04:  26%|██▋       | 1342/5098 [10:11<26:47,  2.34it/s][A
epoch 1 iter 1342: train loss 1.00210. lr 5.746747e-04:  26%|██▋       | 1343/5098 [10:11<25:53,  2.42it/s][A
epoch 1 iter 1343: train loss 0.99688. lr 5.746376e-04:  26%|██▋       | 1343/5098 [10:11<25:53,  2.42it/s][A
epoch 1 iter 1343: train loss 0.99688. lr 5.746376e-04:  26%|██▋       | 1344/5098 [10:11<25:28,  2.46it/s][A
epoch 1 iter 1344: train loss 1.00961. lr 5.746003e-04:  26%|██▋       | 1344/5098 [10:12<25:28,  2.46it/s][A
epoch 1 iter 1344: train loss 1.00961. lr 5.746003e-04:  26%|██▋       | 1345/5098 [10:12<25:47,  2.43it/s][A
epoch 1 iter 1345: train loss 0.99154. lr 5.745631e-04:  26%|██▋       | 1345/5098 [10:12<25:47,  2.43it/s][A
epoch 1 iter 1345: train loss 0.99154. lr 5.745631e-04:  26%|██▋       | 1346/5098 [10:12<25:40,  2.44it/s][A
epoch 1 iter 1346: train loss 0.99768. lr 5.745258e-04:  26%|██▋       | 1346/5098 [10:12<25:40,  2.44it/s][A
e

epoch 1 iter 1378: train loss 0.97697. lr 5.733194e-04:  27%|██▋       | 1379/5098 [10:27<25:45,  2.41it/s][A
epoch 1 iter 1379: train loss 0.97170. lr 5.732813e-04:  27%|██▋       | 1379/5098 [10:28<25:45,  2.41it/s][A
epoch 1 iter 1379: train loss 0.97170. lr 5.732813e-04:  27%|██▋       | 1380/5098 [10:28<25:56,  2.39it/s][A
epoch 1 iter 1380: train loss 0.96374. lr 5.732431e-04:  27%|██▋       | 1380/5098 [10:28<25:56,  2.39it/s][A
epoch 1 iter 1380: train loss 0.96374. lr 5.732431e-04:  27%|██▋       | 1381/5098 [10:28<26:05,  2.37it/s][A
epoch 1 iter 1381: train loss 0.97720. lr 5.732050e-04:  27%|██▋       | 1381/5098 [10:28<26:05,  2.37it/s][A
epoch 1 iter 1381: train loss 0.97720. lr 5.732050e-04:  27%|██▋       | 1382/5098 [10:28<26:11,  2.37it/s][A
epoch 1 iter 1382: train loss 0.97575. lr 5.731668e-04:  27%|██▋       | 1382/5098 [10:29<26:11,  2.37it/s][A
epoch 1 iter 1382: train loss 0.97575. lr 5.731668e-04:  27%|██▋       | 1383/5098 [10:29<26:15,  2.36it/s][A
e

epoch 1 iter 1415: train loss 0.97190. lr 5.718914e-04:  28%|██▊       | 1415/5098 [10:43<26:12,  2.34it/s][A
epoch 1 iter 1415: train loss 0.97190. lr 5.718914e-04:  28%|██▊       | 1416/5098 [10:43<25:38,  2.39it/s][A
epoch 1 iter 1416: train loss 0.96320. lr 5.718523e-04:  28%|██▊       | 1416/5098 [10:43<25:38,  2.39it/s][A
epoch 1 iter 1416: train loss 0.96320. lr 5.718523e-04:  28%|██▊       | 1417/5098 [10:43<25:13,  2.43it/s][A
epoch 1 iter 1417: train loss 0.96741. lr 5.718132e-04:  28%|██▊       | 1417/5098 [10:43<25:13,  2.43it/s][A
epoch 1 iter 1417: train loss 0.96741. lr 5.718132e-04:  28%|██▊       | 1418/5098 [10:43<24:53,  2.46it/s][A
epoch 1 iter 1418: train loss 0.95985. lr 5.717741e-04:  28%|██▊       | 1418/5098 [10:44<24:53,  2.46it/s][A
epoch 1 iter 1418: train loss 0.95985. lr 5.717741e-04:  28%|██▊       | 1419/5098 [10:44<25:24,  2.41it/s][A
epoch 1 iter 1419: train loss 0.97138. lr 5.717349e-04:  28%|██▊       | 1419/5098 [10:44<25:24,  2.41it/s][A
e

epoch 1 iter 1451: train loss 0.95351. lr 5.704680e-04:  28%|██▊       | 1452/5098 [10:59<27:33,  2.21it/s][A
epoch 1 iter 1452: train loss 0.95765. lr 5.704280e-04:  28%|██▊       | 1452/5098 [10:59<27:33,  2.21it/s][A
epoch 1 iter 1452: train loss 0.95765. lr 5.704280e-04:  29%|██▊       | 1453/5098 [10:59<31:32,  1.93it/s][A
epoch 1 iter 1453: train loss 0.95200. lr 5.703880e-04:  29%|██▊       | 1453/5098 [11:00<31:32,  1.93it/s][A
epoch 1 iter 1453: train loss 0.95200. lr 5.703880e-04:  29%|██▊       | 1454/5098 [11:00<34:16,  1.77it/s][A
epoch 1 iter 1454: train loss 0.94521. lr 5.703479e-04:  29%|██▊       | 1454/5098 [11:01<34:16,  1.77it/s][A
epoch 1 iter 1454: train loss 0.94521. lr 5.703479e-04:  29%|██▊       | 1455/5098 [11:01<34:53,  1.74it/s][A
epoch 1 iter 1455: train loss 0.96168. lr 5.703078e-04:  29%|██▊       | 1455/5098 [11:01<34:53,  1.74it/s][A
epoch 1 iter 1455: train loss 0.96168. lr 5.703078e-04:  29%|██▊       | 1456/5098 [11:01<35:22,  1.72it/s][A
e

epoch 1 iter 1488: train loss 0.93308. lr 5.689704e-04:  29%|██▉       | 1488/5098 [11:16<27:16,  2.21it/s][A
epoch 1 iter 1488: train loss 0.93308. lr 5.689704e-04:  29%|██▉       | 1489/5098 [11:16<27:52,  2.16it/s][A
epoch 1 iter 1489: train loss 0.94561. lr 5.689295e-04:  29%|██▉       | 1489/5098 [11:17<27:52,  2.16it/s][A
epoch 1 iter 1489: train loss 0.94561. lr 5.689295e-04:  29%|██▉       | 1490/5098 [11:17<28:15,  2.13it/s][A
epoch 1 iter 1490: train loss 0.96353. lr 5.688885e-04:  29%|██▉       | 1490/5098 [11:17<28:15,  2.13it/s][A
epoch 1 iter 1490: train loss 0.96353. lr 5.688885e-04:  29%|██▉       | 1491/5098 [11:17<27:24,  2.19it/s][A
epoch 1 iter 1491: train loss 0.94869. lr 5.688475e-04:  29%|██▉       | 1491/5098 [11:18<27:24,  2.19it/s][A
epoch 1 iter 1491: train loss 0.94869. lr 5.688475e-04:  29%|██▉       | 1492/5098 [11:18<27:21,  2.20it/s][A
epoch 1 iter 1492: train loss 0.94092. lr 5.688064e-04:  29%|██▉       | 1492/5098 [11:18<27:21,  2.20it/s][A
e

epoch 1 iter 1524: train loss 0.93392. lr 5.674798e-04:  30%|██▉       | 1525/5098 [11:33<27:37,  2.16it/s][A
epoch 1 iter 1525: train loss 0.92486. lr 5.674379e-04:  30%|██▉       | 1525/5098 [11:33<27:37,  2.16it/s][A
epoch 1 iter 1525: train loss 0.92486. lr 5.674379e-04:  30%|██▉       | 1526/5098 [11:33<27:32,  2.16it/s][A
epoch 1 iter 1526: train loss 0.92189. lr 5.673960e-04:  30%|██▉       | 1526/5098 [11:33<27:32,  2.16it/s][A
epoch 1 iter 1526: train loss 0.92189. lr 5.673960e-04:  30%|██▉       | 1527/5098 [11:33<27:09,  2.19it/s][A
epoch 1 iter 1527: train loss 0.91297. lr 5.673541e-04:  30%|██▉       | 1527/5098 [11:34<27:09,  2.19it/s][A
epoch 1 iter 1527: train loss 0.91297. lr 5.673541e-04:  30%|██▉       | 1528/5098 [11:34<26:51,  2.22it/s][A
epoch 1 iter 1528: train loss 0.93167. lr 5.673121e-04:  30%|██▉       | 1528/5098 [11:34<26:51,  2.22it/s][A
epoch 1 iter 1528: train loss 0.93167. lr 5.673121e-04:  30%|██▉       | 1529/5098 [11:34<26:39,  2.23it/s][A
e

epoch 1 iter 1561: train loss 0.90935. lr 5.659134e-04:  31%|███       | 1561/5098 [11:48<24:04,  2.45it/s][A
epoch 1 iter 1561: train loss 0.90935. lr 5.659134e-04:  31%|███       | 1562/5098 [11:48<23:52,  2.47it/s][A
epoch 1 iter 1562: train loss 0.92045. lr 5.658706e-04:  31%|███       | 1562/5098 [11:48<23:52,  2.47it/s][A
epoch 1 iter 1562: train loss 0.92045. lr 5.658706e-04:  31%|███       | 1563/5098 [11:48<23:49,  2.47it/s][A
epoch 1 iter 1563: train loss 0.93626. lr 5.658277e-04:  31%|███       | 1563/5098 [11:49<23:49,  2.47it/s][A
epoch 1 iter 1563: train loss 0.93626. lr 5.658277e-04:  31%|███       | 1564/5098 [11:49<23:47,  2.48it/s][A
epoch 1 iter 1564: train loss 0.90341. lr 5.657848e-04:  31%|███       | 1564/5098 [11:49<23:47,  2.48it/s][A
epoch 1 iter 1564: train loss 0.90341. lr 5.657848e-04:  31%|███       | 1565/5098 [11:49<24:11,  2.43it/s][A
epoch 1 iter 1565: train loss 0.92391. lr 5.657420e-04:  31%|███       | 1565/5098 [11:50<24:11,  2.43it/s][A
e

epoch 1 iter 1597: train loss 0.92466. lr 5.643561e-04:  31%|███▏      | 1598/5098 [12:06<32:35,  1.79it/s][A
epoch 1 iter 1598: train loss 0.92726. lr 5.643124e-04:  31%|███▏      | 1598/5098 [12:06<32:35,  1.79it/s][A
epoch 1 iter 1598: train loss 0.92726. lr 5.643124e-04:  31%|███▏      | 1599/5098 [12:06<31:59,  1.82it/s][A
epoch 1 iter 1599: train loss 0.90211. lr 5.642686e-04:  31%|███▏      | 1599/5098 [12:07<31:59,  1.82it/s][A
epoch 1 iter 1599: train loss 0.90211. lr 5.642686e-04:  31%|███▏      | 1600/5098 [12:07<30:58,  1.88it/s][A
epoch 1 iter 1600: train loss 0.90348. lr 5.642249e-04:  31%|███▏      | 1600/5098 [12:07<30:58,  1.88it/s][A
epoch 1 iter 1600: train loss 0.90348. lr 5.642249e-04:  31%|███▏      | 1601/5098 [12:07<30:20,  1.92it/s][A
epoch 1 iter 1601: train loss 0.90991. lr 5.641811e-04:  31%|███▏      | 1601/5098 [12:08<30:20,  1.92it/s][A
epoch 1 iter 1601: train loss 0.90991. lr 5.641811e-04:  31%|███▏      | 1602/5098 [12:08<29:48,  1.96it/s][A
e

epoch 1 iter 1634: train loss 0.88880. lr 5.627217e-04:  32%|███▏      | 1634/5098 [12:22<23:45,  2.43it/s][A
epoch 1 iter 1634: train loss 0.88880. lr 5.627217e-04:  32%|███▏      | 1635/5098 [12:22<23:44,  2.43it/s][A
epoch 1 iter 1635: train loss 0.89169. lr 5.626771e-04:  32%|███▏      | 1635/5098 [12:23<23:44,  2.43it/s][A
epoch 1 iter 1635: train loss 0.89169. lr 5.626771e-04:  32%|███▏      | 1636/5098 [12:23<23:41,  2.44it/s][A
epoch 1 iter 1636: train loss 0.87920. lr 5.626324e-04:  32%|███▏      | 1636/5098 [12:23<23:41,  2.44it/s][A
epoch 1 iter 1636: train loss 0.87920. lr 5.626324e-04:  32%|███▏      | 1637/5098 [12:23<23:38,  2.44it/s][A
epoch 1 iter 1637: train loss 0.89024. lr 5.625877e-04:  32%|███▏      | 1637/5098 [12:24<23:38,  2.44it/s][A
epoch 1 iter 1637: train loss 0.89024. lr 5.625877e-04:  32%|███▏      | 1638/5098 [12:24<23:36,  2.44it/s][A
epoch 1 iter 1638: train loss 0.89277. lr 5.625430e-04:  32%|███▏      | 1638/5098 [12:24<23:36,  2.44it/s][A
e

epoch 1 iter 1670: train loss 0.85782. lr 5.610987e-04:  33%|███▎      | 1671/5098 [12:37<24:25,  2.34it/s][A
epoch 1 iter 1671: train loss 0.87771. lr 5.610531e-04:  33%|███▎      | 1671/5098 [12:38<24:25,  2.34it/s][A
epoch 1 iter 1671: train loss 0.87771. lr 5.610531e-04:  33%|███▎      | 1672/5098 [12:38<24:20,  2.35it/s][A
epoch 1 iter 1672: train loss 0.88011. lr 5.610076e-04:  33%|███▎      | 1672/5098 [12:38<24:20,  2.35it/s][A
epoch 1 iter 1672: train loss 0.88011. lr 5.610076e-04:  33%|███▎      | 1673/5098 [12:38<24:19,  2.35it/s][A
epoch 1 iter 1673: train loss 0.86439. lr 5.609620e-04:  33%|███▎      | 1673/5098 [12:39<24:19,  2.35it/s][A
epoch 1 iter 1673: train loss 0.86439. lr 5.609620e-04:  33%|███▎      | 1674/5098 [12:39<24:16,  2.35it/s][A
epoch 1 iter 1674: train loss 0.88656. lr 5.609164e-04:  33%|███▎      | 1674/5098 [12:39<24:16,  2.35it/s][A
epoch 1 iter 1674: train loss 0.88656. lr 5.609164e-04:  33%|███▎      | 1675/5098 [12:39<24:13,  2.35it/s][A
e

epoch 1 iter 1707: train loss 0.86485. lr 5.593971e-04:  33%|███▎      | 1707/5098 [12:53<23:31,  2.40it/s][A
epoch 1 iter 1707: train loss 0.86485. lr 5.593971e-04:  34%|███▎      | 1708/5098 [12:53<23:06,  2.45it/s][A
epoch 1 iter 1708: train loss 0.86929. lr 5.593506e-04:  34%|███▎      | 1708/5098 [12:54<23:06,  2.45it/s][A
epoch 1 iter 1708: train loss 0.86929. lr 5.593506e-04:  34%|███▎      | 1709/5098 [12:54<23:35,  2.39it/s][A
epoch 1 iter 1709: train loss 0.86806. lr 5.593041e-04:  34%|███▎      | 1709/5098 [12:54<23:35,  2.39it/s][A
epoch 1 iter 1709: train loss 0.86806. lr 5.593041e-04:  34%|███▎      | 1710/5098 [12:54<24:35,  2.30it/s][A
epoch 1 iter 1710: train loss 0.86242. lr 5.592576e-04:  34%|███▎      | 1710/5098 [12:55<24:35,  2.30it/s][A
epoch 1 iter 1710: train loss 0.86242. lr 5.592576e-04:  34%|███▎      | 1711/5098 [12:55<25:12,  2.24it/s][A
epoch 1 iter 1711: train loss 0.86549. lr 5.592111e-04:  34%|███▎      | 1711/5098 [12:55<25:12,  2.24it/s][A
e

epoch 1 iter 1743: train loss 0.85906. lr 5.577091e-04:  34%|███▍      | 1744/5098 [13:09<25:38,  2.18it/s][A
epoch 1 iter 1744: train loss 0.85484. lr 5.576617e-04:  34%|███▍      | 1744/5098 [13:10<25:38,  2.18it/s][A
epoch 1 iter 1744: train loss 0.85484. lr 5.576617e-04:  34%|███▍      | 1745/5098 [13:10<24:19,  2.30it/s][A
epoch 1 iter 1745: train loss 0.84058. lr 5.576144e-04:  34%|███▍      | 1745/5098 [13:10<24:19,  2.30it/s][A
epoch 1 iter 1745: train loss 0.84058. lr 5.576144e-04:  34%|███▍      | 1746/5098 [13:10<23:21,  2.39it/s][A
epoch 1 iter 1746: train loss 0.86208. lr 5.575670e-04:  34%|███▍      | 1746/5098 [13:10<23:21,  2.39it/s][A
epoch 1 iter 1746: train loss 0.86208. lr 5.575670e-04:  34%|███▍      | 1747/5098 [13:10<22:25,  2.49it/s][A
epoch 1 iter 1747: train loss 0.85498. lr 5.575196e-04:  34%|███▍      | 1747/5098 [13:11<22:25,  2.49it/s][A
epoch 1 iter 1747: train loss 0.85498. lr 5.575196e-04:  34%|███▍      | 1748/5098 [13:11<23:26,  2.38it/s][A
e

epoch 1 iter 1780: train loss 0.83841. lr 5.559412e-04:  35%|███▍      | 1780/5098 [13:27<25:32,  2.17it/s][A
epoch 1 iter 1780: train loss 0.83841. lr 5.559412e-04:  35%|███▍      | 1781/5098 [13:27<27:01,  2.05it/s][A
epoch 1 iter 1781: train loss 0.84065. lr 5.558929e-04:  35%|███▍      | 1781/5098 [13:28<27:01,  2.05it/s][A
epoch 1 iter 1781: train loss 0.84065. lr 5.558929e-04:  35%|███▍      | 1782/5098 [13:28<27:57,  1.98it/s][A
epoch 1 iter 1782: train loss 0.83819. lr 5.558446e-04:  35%|███▍      | 1782/5098 [13:28<27:57,  1.98it/s][A
epoch 1 iter 1782: train loss 0.83819. lr 5.558446e-04:  35%|███▍      | 1783/5098 [13:28<28:28,  1.94it/s][A
epoch 1 iter 1783: train loss 0.81633. lr 5.557964e-04:  35%|███▍      | 1783/5098 [13:29<28:28,  1.94it/s][A
epoch 1 iter 1783: train loss 0.81633. lr 5.557964e-04:  35%|███▍      | 1784/5098 [13:29<28:19,  1.95it/s][A
epoch 1 iter 1784: train loss 0.82869. lr 5.557480e-04:  35%|███▍      | 1784/5098 [13:29<28:19,  1.95it/s][A
e

epoch 1 iter 1816: train loss 0.82761. lr 5.541891e-04:  36%|███▌      | 1817/5098 [13:44<28:53,  1.89it/s][A
epoch 1 iter 1817: train loss 0.84019. lr 5.541400e-04:  36%|███▌      | 1817/5098 [13:44<28:53,  1.89it/s][A
epoch 1 iter 1817: train loss 0.84019. lr 5.541400e-04:  36%|███▌      | 1818/5098 [13:44<29:00,  1.88it/s][A
epoch 1 iter 1818: train loss 0.82742. lr 5.540908e-04:  36%|███▌      | 1818/5098 [13:45<29:00,  1.88it/s][A
epoch 1 iter 1818: train loss 0.82742. lr 5.540908e-04:  36%|███▌      | 1819/5098 [13:45<29:03,  1.88it/s][A
epoch 1 iter 1819: train loss 0.82422. lr 5.540417e-04:  36%|███▌      | 1819/5098 [13:45<29:03,  1.88it/s][A
epoch 1 iter 1819: train loss 0.82422. lr 5.540417e-04:  36%|███▌      | 1820/5098 [13:45<28:30,  1.92it/s][A
epoch 1 iter 1820: train loss 0.83905. lr 5.539925e-04:  36%|███▌      | 1820/5098 [13:46<28:30,  1.92it/s][A
epoch 1 iter 1820: train loss 0.83905. lr 5.539925e-04:  36%|███▌      | 1821/5098 [13:46<27:59,  1.95it/s][A
e

epoch 1 iter 1853: train loss 0.80803. lr 5.523557e-04:  36%|███▋      | 1853/5098 [14:01<21:52,  2.47it/s][A
epoch 1 iter 1853: train loss 0.80803. lr 5.523557e-04:  36%|███▋      | 1854/5098 [14:01<31:07,  1.74it/s][A
epoch 1 iter 1854: train loss 0.81398. lr 5.523057e-04:  36%|███▋      | 1854/5098 [14:02<31:07,  1.74it/s][A
epoch 1 iter 1854: train loss 0.81398. lr 5.523057e-04:  36%|███▋      | 1855/5098 [14:02<30:46,  1.76it/s][A
epoch 1 iter 1855: train loss 0.83132. lr 5.522557e-04:  36%|███▋      | 1855/5098 [14:02<30:46,  1.76it/s][A
epoch 1 iter 1855: train loss 0.83132. lr 5.522557e-04:  36%|███▋      | 1856/5098 [14:02<29:10,  1.85it/s][A
epoch 1 iter 1856: train loss 0.80046. lr 5.522056e-04:  36%|███▋      | 1856/5098 [14:03<29:10,  1.85it/s][A
epoch 1 iter 1856: train loss 0.80046. lr 5.522056e-04:  36%|███▋      | 1857/5098 [14:03<27:46,  1.94it/s][A
epoch 1 iter 1857: train loss 0.82310. lr 5.521555e-04:  36%|███▋      | 1857/5098 [14:03<27:46,  1.94it/s][A
e

epoch 1 iter 1889: train loss 0.80653. lr 5.505404e-04:  37%|███▋      | 1890/5098 [14:17<32:11,  1.66it/s][A
epoch 1 iter 1890: train loss 0.79872. lr 5.504895e-04:  37%|███▋      | 1890/5098 [14:18<32:11,  1.66it/s][A
epoch 1 iter 1890: train loss 0.79872. lr 5.504895e-04:  37%|███▋      | 1891/5098 [14:18<30:56,  1.73it/s][A
epoch 1 iter 1891: train loss 0.79821. lr 5.504387e-04:  37%|███▋      | 1891/5098 [14:18<30:56,  1.73it/s][A
epoch 1 iter 1891: train loss 0.79821. lr 5.504387e-04:  37%|███▋      | 1892/5098 [14:18<28:49,  1.85it/s][A
epoch 1 iter 1892: train loss 0.79464. lr 5.503877e-04:  37%|███▋      | 1892/5098 [14:19<28:49,  1.85it/s][A
epoch 1 iter 1892: train loss 0.79464. lr 5.503877e-04:  37%|███▋      | 1893/5098 [14:19<27:25,  1.95it/s][A
epoch 1 iter 1893: train loss 0.80525. lr 5.503368e-04:  37%|███▋      | 1893/5098 [14:19<27:25,  1.95it/s][A
epoch 1 iter 1893: train loss 0.80525. lr 5.503368e-04:  37%|███▋      | 1894/5098 [14:19<26:12,  2.04it/s][A
e

epoch 1 iter 1926: train loss 0.79077. lr 5.486425e-04:  38%|███▊      | 1926/5098 [14:33<21:39,  2.44it/s][A
epoch 1 iter 1926: train loss 0.79077. lr 5.486425e-04:  38%|███▊      | 1927/5098 [14:33<21:36,  2.45it/s][A
epoch 1 iter 1927: train loss 0.79860. lr 5.485908e-04:  38%|███▊      | 1927/5098 [14:33<21:36,  2.45it/s][A
epoch 1 iter 1927: train loss 0.79860. lr 5.485908e-04:  38%|███▊      | 1928/5098 [14:33<21:33,  2.45it/s][A
epoch 1 iter 1928: train loss 0.80232. lr 5.485390e-04:  38%|███▊      | 1928/5098 [14:34<21:33,  2.45it/s][A
epoch 1 iter 1928: train loss 0.80232. lr 5.485390e-04:  38%|███▊      | 1929/5098 [14:34<21:29,  2.46it/s][A
epoch 1 iter 1929: train loss 0.78525. lr 5.484872e-04:  38%|███▊      | 1929/5098 [14:34<21:29,  2.46it/s][A
epoch 1 iter 1929: train loss 0.78525. lr 5.484872e-04:  38%|███▊      | 1930/5098 [14:34<21:30,  2.45it/s][A
epoch 1 iter 1930: train loss 0.78433. lr 5.484354e-04:  38%|███▊      | 1930/5098 [14:34<21:30,  2.45it/s][A
e

epoch 1 iter 1962: train loss 0.77789. lr 5.467649e-04:  39%|███▊      | 1963/5098 [14:50<22:27,  2.33it/s][A
epoch 1 iter 1963: train loss 0.78773. lr 5.467124e-04:  39%|███▊      | 1963/5098 [14:51<22:27,  2.33it/s][A
epoch 1 iter 1963: train loss 0.78773. lr 5.467124e-04:  39%|███▊      | 1964/5098 [14:51<22:57,  2.28it/s][A
epoch 1 iter 1964: train loss 0.79199. lr 5.466597e-04:  39%|███▊      | 1964/5098 [14:51<22:57,  2.28it/s][A
epoch 1 iter 1964: train loss 0.79199. lr 5.466597e-04:  39%|███▊      | 1965/5098 [14:51<22:36,  2.31it/s][A
epoch 1 iter 1965: train loss 0.79294. lr 5.466071e-04:  39%|███▊      | 1965/5098 [14:52<22:36,  2.31it/s][A
epoch 1 iter 1965: train loss 0.79294. lr 5.466071e-04:  39%|███▊      | 1966/5098 [14:52<22:21,  2.34it/s][A
epoch 1 iter 1966: train loss 0.77119. lr 5.465544e-04:  39%|███▊      | 1966/5098 [14:53<22:21,  2.34it/s][A
epoch 1 iter 1966: train loss 0.77119. lr 5.465544e-04:  39%|███▊      | 1967/5098 [14:53<28:10,  1.85it/s][A
e

epoch 1 iter 1999: train loss 0.76295. lr 5.448035e-04:  39%|███▉      | 1999/5098 [15:07<24:51,  2.08it/s][A
epoch 1 iter 1999: train loss 0.76295. lr 5.448035e-04:  39%|███▉      | 2000/5098 [15:07<25:29,  2.03it/s][A
epoch 1 iter 2000: train loss 0.78198. lr 5.447501e-04:  39%|███▉      | 2000/5098 [15:08<25:29,  2.03it/s][A
epoch 1 iter 2000: train loss 0.78198. lr 5.447501e-04:  39%|███▉      | 2001/5098 [15:08<25:54,  1.99it/s][A
epoch 1 iter 2001: train loss 0.76632. lr 5.446966e-04:  39%|███▉      | 2001/5098 [15:08<25:54,  1.99it/s][A
epoch 1 iter 2001: train loss 0.76632. lr 5.446966e-04:  39%|███▉      | 2002/5098 [15:08<25:44,  2.00it/s][A
epoch 1 iter 2002: train loss 0.77670. lr 5.446431e-04:  39%|███▉      | 2002/5098 [15:09<25:44,  2.00it/s][A
epoch 1 iter 2002: train loss 0.77670. lr 5.446431e-04:  39%|███▉      | 2003/5098 [15:09<25:34,  2.02it/s][A
epoch 1 iter 2003: train loss 0.77573. lr 5.445896e-04:  39%|███▉      | 2003/5098 [15:09<25:34,  2.02it/s][A
e

epoch 1 iter 2035: train loss 0.76383. lr 5.428646e-04:  40%|███▉      | 2036/5098 [15:23<20:52,  2.44it/s][A
epoch 1 iter 2036: train loss 0.76913. lr 5.428103e-04:  40%|███▉      | 2036/5098 [15:23<20:52,  2.44it/s][A
epoch 1 iter 2036: train loss 0.76913. lr 5.428103e-04:  40%|███▉      | 2037/5098 [15:23<21:04,  2.42it/s][A
epoch 1 iter 2037: train loss 0.75659. lr 5.427560e-04:  40%|███▉      | 2037/5098 [15:24<21:04,  2.42it/s][A
epoch 1 iter 2037: train loss 0.75659. lr 5.427560e-04:  40%|███▉      | 2038/5098 [15:24<24:20,  2.10it/s][A
epoch 1 iter 2038: train loss 0.74667. lr 5.427017e-04:  40%|███▉      | 2038/5098 [15:24<24:20,  2.10it/s][A
epoch 1 iter 2038: train loss 0.74667. lr 5.427017e-04:  40%|███▉      | 2039/5098 [15:25<23:16,  2.19it/s][A
epoch 1 iter 2039: train loss 0.76514. lr 5.426473e-04:  40%|███▉      | 2039/5098 [15:25<23:16,  2.19it/s][A
epoch 1 iter 2039: train loss 0.76514. lr 5.426473e-04:  40%|████      | 2040/5098 [15:25<22:16,  2.29it/s][A
e

epoch 1 iter 2072: train loss 0.76538. lr 5.408406e-04:  41%|████      | 2072/5098 [15:40<21:08,  2.39it/s][A
epoch 1 iter 2072: train loss 0.76538. lr 5.408406e-04:  41%|████      | 2073/5098 [15:40<21:01,  2.40it/s][A
epoch 1 iter 2073: train loss 0.74874. lr 5.407855e-04:  41%|████      | 2073/5098 [15:40<21:01,  2.40it/s][A
epoch 1 iter 2073: train loss 0.74874. lr 5.407855e-04:  41%|████      | 2074/5098 [15:40<20:55,  2.41it/s][A
epoch 1 iter 2074: train loss 0.75159. lr 5.407303e-04:  41%|████      | 2074/5098 [15:41<20:55,  2.41it/s][A
epoch 1 iter 2074: train loss 0.75159. lr 5.407303e-04:  41%|████      | 2075/5098 [15:41<20:48,  2.42it/s][A
epoch 1 iter 2075: train loss 0.74569. lr 5.406752e-04:  41%|████      | 2075/5098 [15:41<20:48,  2.42it/s][A
epoch 1 iter 2075: train loss 0.74569. lr 5.406752e-04:  41%|████      | 2076/5098 [15:41<20:45,  2.43it/s][A
epoch 1 iter 2076: train loss 0.74862. lr 5.406199e-04:  41%|████      | 2076/5098 [15:41<20:45,  2.43it/s][A
e

epoch 1 iter 2108: train loss 0.74749. lr 5.388413e-04:  41%|████▏     | 2109/5098 [15:57<22:36,  2.20it/s][A
epoch 1 iter 2109: train loss 0.74699. lr 5.387854e-04:  41%|████▏     | 2109/5098 [15:57<22:36,  2.20it/s][A
epoch 1 iter 2109: train loss 0.74699. lr 5.387854e-04:  41%|████▏     | 2110/5098 [15:57<22:30,  2.21it/s][A
epoch 1 iter 2110: train loss 0.73643. lr 5.387294e-04:  41%|████▏     | 2110/5098 [15:58<22:30,  2.21it/s][A
epoch 1 iter 2110: train loss 0.73643. lr 5.387294e-04:  41%|████▏     | 2111/5098 [15:58<22:25,  2.22it/s][A
epoch 1 iter 2111: train loss 0.74893. lr 5.386734e-04:  41%|████▏     | 2111/5098 [15:58<22:25,  2.22it/s][A
epoch 1 iter 2111: train loss 0.74893. lr 5.386734e-04:  41%|████▏     | 2112/5098 [15:58<22:52,  2.18it/s][A
epoch 1 iter 2112: train loss 0.73751. lr 5.386174e-04:  41%|████▏     | 2112/5098 [15:59<22:52,  2.18it/s][A
epoch 1 iter 2112: train loss 0.73751. lr 5.386174e-04:  41%|████▏     | 2113/5098 [15:59<22:27,  2.21it/s][A
e

epoch 1 iter 2145: train loss 0.73742. lr 5.367558e-04:  42%|████▏     | 2145/5098 [16:14<20:45,  2.37it/s][A
epoch 1 iter 2145: train loss 0.73742. lr 5.367558e-04:  42%|████▏     | 2146/5098 [16:14<20:39,  2.38it/s][A
epoch 1 iter 2146: train loss 0.72976. lr 5.366990e-04:  42%|████▏     | 2146/5098 [16:14<20:39,  2.38it/s][A
epoch 1 iter 2146: train loss 0.72976. lr 5.366990e-04:  42%|████▏     | 2147/5098 [16:14<20:36,  2.39it/s][A
epoch 1 iter 2147: train loss 0.73079. lr 5.366422e-04:  42%|████▏     | 2147/5098 [16:14<20:36,  2.39it/s][A
epoch 1 iter 2147: train loss 0.73079. lr 5.366422e-04:  42%|████▏     | 2148/5098 [16:14<20:32,  2.39it/s][A
epoch 1 iter 2148: train loss 0.73551. lr 5.365854e-04:  42%|████▏     | 2148/5098 [16:15<20:32,  2.39it/s][A
epoch 1 iter 2148: train loss 0.73551. lr 5.365854e-04:  42%|████▏     | 2149/5098 [16:15<20:34,  2.39it/s][A
epoch 1 iter 2149: train loss 0.74491. lr 5.365285e-04:  42%|████▏     | 2149/5098 [16:15<20:34,  2.39it/s][A
e

epoch 1 iter 2181: train loss 0.72270. lr 5.346972e-04:  43%|████▎     | 2182/5098 [16:30<22:39,  2.14it/s][A
epoch 1 iter 2182: train loss 0.73264. lr 5.346396e-04:  43%|████▎     | 2182/5098 [16:30<22:39,  2.14it/s][A
epoch 1 iter 2182: train loss 0.73264. lr 5.346396e-04:  43%|████▎     | 2183/5098 [16:30<21:19,  2.28it/s][A
epoch 1 iter 2183: train loss 0.70708. lr 5.345819e-04:  43%|████▎     | 2183/5098 [16:31<21:19,  2.28it/s][A
epoch 1 iter 2183: train loss 0.70708. lr 5.345819e-04:  43%|████▎     | 2184/5098 [16:31<20:14,  2.40it/s][A
epoch 1 iter 2184: train loss 0.70977. lr 5.345243e-04:  43%|████▎     | 2184/5098 [16:31<20:14,  2.40it/s][A
epoch 1 iter 2184: train loss 0.70977. lr 5.345243e-04:  43%|████▎     | 2185/5098 [16:31<20:50,  2.33it/s][A
epoch 1 iter 2185: train loss 0.70305. lr 5.344666e-04:  43%|████▎     | 2185/5098 [16:31<20:50,  2.33it/s][A
epoch 1 iter 2185: train loss 0.70305. lr 5.344666e-04:  43%|████▎     | 2186/5098 [16:31<20:55,  2.32it/s][A
e

epoch 1 iter 2218: train loss 0.70431. lr 5.325512e-04:  44%|████▎     | 2218/5098 [16:48<19:44,  2.43it/s][A
epoch 1 iter 2218: train loss 0.70431. lr 5.325512e-04:  44%|████▎     | 2219/5098 [16:48<19:40,  2.44it/s][A
epoch 1 iter 2219: train loss 0.70090. lr 5.324928e-04:  44%|████▎     | 2219/5098 [16:48<19:40,  2.44it/s][A
epoch 1 iter 2219: train loss 0.70090. lr 5.324928e-04:  44%|████▎     | 2220/5098 [16:48<19:38,  2.44it/s][A
epoch 1 iter 2220: train loss 0.71192. lr 5.324344e-04:  44%|████▎     | 2220/5098 [16:48<19:38,  2.44it/s][A
epoch 1 iter 2220: train loss 0.71192. lr 5.324344e-04:  44%|████▎     | 2221/5098 [16:48<19:35,  2.45it/s][A
epoch 1 iter 2221: train loss 0.72062. lr 5.323759e-04:  44%|████▎     | 2221/5098 [16:49<19:35,  2.45it/s][A
epoch 1 iter 2221: train loss 0.72062. lr 5.323759e-04:  44%|████▎     | 2222/5098 [16:49<19:34,  2.45it/s][A
epoch 1 iter 2222: train loss 0.69255. lr 5.323174e-04:  44%|████▎     | 2222/5098 [16:49<19:34,  2.45it/s][A
e

epoch 1 iter 2254: train loss 0.68496. lr 5.304342e-04:  44%|████▍     | 2255/5098 [17:04<19:49,  2.39it/s][A
epoch 1 iter 2255: train loss 0.71766. lr 5.303750e-04:  44%|████▍     | 2255/5098 [17:04<19:49,  2.39it/s][A
epoch 1 iter 2255: train loss 0.71766. lr 5.303750e-04:  44%|████▍     | 2256/5098 [17:04<19:26,  2.44it/s][A
epoch 1 iter 2256: train loss 0.68880. lr 5.303158e-04:  44%|████▍     | 2256/5098 [17:05<19:26,  2.44it/s][A
epoch 1 iter 2256: train loss 0.68880. lr 5.303158e-04:  44%|████▍     | 2257/5098 [17:05<19:53,  2.38it/s][A
epoch 1 iter 2257: train loss 0.70395. lr 5.302565e-04:  44%|████▍     | 2257/5098 [17:05<19:53,  2.38it/s][A
epoch 1 iter 2257: train loss 0.70395. lr 5.302565e-04:  44%|████▍     | 2258/5098 [17:05<19:47,  2.39it/s][A
epoch 1 iter 2258: train loss 0.69968. lr 5.301973e-04:  44%|████▍     | 2258/5098 [17:05<19:47,  2.39it/s][A
epoch 1 iter 2258: train loss 0.69968. lr 5.301973e-04:  44%|████▍     | 2259/5098 [17:05<19:51,  2.38it/s][A
e

epoch 1 iter 2291: train loss 0.68656. lr 5.282289e-04:  45%|████▍     | 2291/5098 [17:21<25:18,  1.85it/s][A
epoch 1 iter 2291: train loss 0.68656. lr 5.282289e-04:  45%|████▍     | 2292/5098 [17:21<26:01,  1.80it/s][A
epoch 1 iter 2292: train loss 0.69163. lr 5.281689e-04:  45%|████▍     | 2292/5098 [17:22<26:01,  1.80it/s][A
epoch 1 iter 2292: train loss 0.69163. lr 5.281689e-04:  45%|████▍     | 2293/5098 [17:22<26:33,  1.76it/s][A
epoch 1 iter 2293: train loss 0.69713. lr 5.281088e-04:  45%|████▍     | 2293/5098 [17:22<26:33,  1.76it/s][A
epoch 1 iter 2293: train loss 0.69713. lr 5.281088e-04:  45%|████▍     | 2294/5098 [17:22<26:55,  1.74it/s][A
epoch 1 iter 2294: train loss 0.69700. lr 5.280488e-04:  45%|████▍     | 2294/5098 [17:23<26:55,  1.74it/s][A
epoch 1 iter 2294: train loss 0.69700. lr 5.280488e-04:  45%|████▌     | 2295/5098 [17:23<27:09,  1.72it/s][A
epoch 1 iter 2295: train loss 0.69230. lr 5.279887e-04:  45%|████▌     | 2295/5098 [17:23<27:09,  1.72it/s][A
e

epoch 1 iter 2327: train loss 0.68210. lr 5.260547e-04:  46%|████▌     | 2328/5098 [17:38<22:59,  2.01it/s][A
epoch 1 iter 2328: train loss 0.67442. lr 5.259939e-04:  46%|████▌     | 2328/5098 [17:39<22:59,  2.01it/s][A
epoch 1 iter 2328: train loss 0.67442. lr 5.259939e-04:  46%|████▌     | 2329/5098 [17:39<22:51,  2.02it/s][A
epoch 1 iter 2329: train loss 0.67464. lr 5.259331e-04:  46%|████▌     | 2329/5098 [17:39<22:51,  2.02it/s][A
epoch 1 iter 2329: train loss 0.67464. lr 5.259331e-04:  46%|████▌     | 2330/5098 [17:39<22:44,  2.03it/s][A
epoch 1 iter 2330: train loss 0.67778. lr 5.258722e-04:  46%|████▌     | 2330/5098 [17:40<22:44,  2.03it/s][A
epoch 1 iter 2330: train loss 0.67778. lr 5.258722e-04:  46%|████▌     | 2331/5098 [17:40<22:25,  2.06it/s][A
epoch 1 iter 2331: train loss 0.68216. lr 5.258114e-04:  46%|████▌     | 2331/5098 [17:40<22:25,  2.06it/s][A
epoch 1 iter 2331: train loss 0.68216. lr 5.258114e-04:  46%|████▌     | 2332/5098 [17:40<22:04,  2.09it/s][A
e

epoch 1 iter 2364: train loss 0.67906. lr 5.237911e-04:  46%|████▋     | 2364/5098 [17:56<22:42,  2.01it/s][A
epoch 1 iter 2364: train loss 0.67906. lr 5.237911e-04:  46%|████▋     | 2365/5098 [17:56<22:56,  1.98it/s][A
epoch 1 iter 2365: train loss 0.67891. lr 5.237295e-04:  46%|████▋     | 2365/5098 [17:56<22:56,  1.98it/s][A
epoch 1 iter 2365: train loss 0.67891. lr 5.237295e-04:  46%|████▋     | 2366/5098 [17:56<22:27,  2.03it/s][A
epoch 1 iter 2366: train loss 0.65613. lr 5.236679e-04:  46%|████▋     | 2366/5098 [17:56<22:27,  2.03it/s][A
epoch 1 iter 2366: train loss 0.65613. lr 5.236679e-04:  46%|████▋     | 2367/5098 [17:56<21:37,  2.11it/s][A
epoch 1 iter 2367: train loss 0.65612. lr 5.236062e-04:  46%|████▋     | 2367/5098 [17:57<21:37,  2.11it/s][A
epoch 1 iter 2367: train loss 0.65612. lr 5.236062e-04:  46%|████▋     | 2368/5098 [17:57<20:49,  2.19it/s][A
epoch 1 iter 2368: train loss 0.66918. lr 5.235446e-04:  46%|████▋     | 2368/5098 [17:57<20:49,  2.19it/s][A
e

epoch 1 iter 2400: train loss 0.65018. lr 5.215607e-04:  47%|████▋     | 2401/5098 [18:12<32:27,  1.38it/s][A
epoch 1 iter 2401: train loss 0.65483. lr 5.214984e-04:  47%|████▋     | 2401/5098 [18:13<32:27,  1.38it/s][A
epoch 1 iter 2401: train loss 0.65483. lr 5.214984e-04:  47%|████▋     | 2402/5098 [18:13<32:28,  1.38it/s][A
epoch 1 iter 2402: train loss 0.65098. lr 5.214360e-04:  47%|████▋     | 2402/5098 [18:14<32:28,  1.38it/s][A
epoch 1 iter 2402: train loss 0.65098. lr 5.214360e-04:  47%|████▋     | 2403/5098 [18:14<31:25,  1.43it/s][A
epoch 1 iter 2403: train loss 0.66517. lr 5.213736e-04:  47%|████▋     | 2403/5098 [18:14<31:25,  1.43it/s][A
epoch 1 iter 2403: train loss 0.66517. lr 5.213736e-04:  47%|████▋     | 2404/5098 [18:14<30:25,  1.48it/s][A
epoch 1 iter 2404: train loss 0.65457. lr 5.213112e-04:  47%|████▋     | 2404/5098 [18:15<30:25,  1.48it/s][A
epoch 1 iter 2404: train loss 0.65457. lr 5.213112e-04:  47%|████▋     | 2405/5098 [18:15<28:57,  1.55it/s][A
e

epoch 1 iter 2437: train loss 0.65032. lr 5.192400e-04:  48%|████▊     | 2437/5098 [18:29<21:06,  2.10it/s][A
epoch 1 iter 2437: train loss 0.65032. lr 5.192400e-04:  48%|████▊     | 2438/5098 [18:29<20:24,  2.17it/s][A
epoch 1 iter 2438: train loss 0.63967. lr 5.191769e-04:  48%|████▊     | 2438/5098 [18:30<20:24,  2.17it/s][A
epoch 1 iter 2438: train loss 0.63967. lr 5.191769e-04:  48%|████▊     | 2439/5098 [18:30<19:53,  2.23it/s][A
epoch 1 iter 2439: train loss 0.64060. lr 5.191137e-04:  48%|████▊     | 2439/5098 [18:30<19:53,  2.23it/s][A
epoch 1 iter 2439: train loss 0.64060. lr 5.191137e-04:  48%|████▊     | 2440/5098 [18:30<19:34,  2.26it/s][A
epoch 1 iter 2440: train loss 0.65475. lr 5.190505e-04:  48%|████▊     | 2440/5098 [18:31<19:34,  2.26it/s][A
epoch 1 iter 2440: train loss 0.65475. lr 5.190505e-04:  48%|████▊     | 2441/5098 [18:31<19:16,  2.30it/s][A
epoch 1 iter 2441: train loss 0.65621. lr 5.189874e-04:  48%|████▊     | 2441/5098 [18:31<19:16,  2.30it/s][A
e

epoch 1 iter 2473: train loss 0.63417. lr 5.169546e-04:  49%|████▊     | 2474/5098 [18:46<23:36,  1.85it/s][A
epoch 1 iter 2474: train loss 0.64831. lr 5.168907e-04:  49%|████▊     | 2474/5098 [18:47<23:36,  1.85it/s][A
epoch 1 iter 2474: train loss 0.64831. lr 5.168907e-04:  49%|████▊     | 2475/5098 [18:47<23:37,  1.85it/s][A
epoch 1 iter 2475: train loss 0.64432. lr 5.168268e-04:  49%|████▊     | 2475/5098 [18:47<23:37,  1.85it/s][A
epoch 1 iter 2475: train loss 0.64432. lr 5.168268e-04:  49%|████▊     | 2476/5098 [18:47<23:09,  1.89it/s][A
epoch 1 iter 2476: train loss 0.63572. lr 5.167629e-04:  49%|████▊     | 2476/5098 [18:48<23:09,  1.89it/s][A
epoch 1 iter 2476: train loss 0.63572. lr 5.167629e-04:  49%|████▊     | 2477/5098 [18:48<22:50,  1.91it/s][A
epoch 1 iter 2477: train loss 0.63205. lr 5.166990e-04:  49%|████▊     | 2477/5098 [18:48<22:50,  1.91it/s][A
epoch 1 iter 2477: train loss 0.63205. lr 5.166990e-04:  49%|████▊     | 2478/5098 [18:48<22:17,  1.96it/s][A
e

epoch 1 iter 2510: train loss 0.63464. lr 5.145779e-04:  49%|████▉     | 2510/5098 [19:03<18:26,  2.34it/s][A
epoch 1 iter 2510: train loss 0.63464. lr 5.145779e-04:  49%|████▉     | 2511/5098 [19:03<18:18,  2.35it/s][A
epoch 1 iter 2511: train loss 0.62555. lr 5.145133e-04:  49%|████▉     | 2511/5098 [19:03<18:18,  2.35it/s][A
epoch 1 iter 2511: train loss 0.62555. lr 5.145133e-04:  49%|████▉     | 2512/5098 [19:03<18:15,  2.36it/s][A
epoch 1 iter 2512: train loss 0.62525. lr 5.144487e-04:  49%|████▉     | 2512/5098 [19:03<18:15,  2.36it/s][A
epoch 1 iter 2512: train loss 0.62525. lr 5.144487e-04:  49%|████▉     | 2513/5098 [19:03<18:14,  2.36it/s][A
epoch 1 iter 2513: train loss 0.62865. lr 5.143840e-04:  49%|████▉     | 2513/5098 [19:04<18:14,  2.36it/s][A
epoch 1 iter 2513: train loss 0.62865. lr 5.143840e-04:  49%|████▉     | 2514/5098 [19:04<18:25,  2.34it/s][A
epoch 1 iter 2514: train loss 0.62458. lr 5.143193e-04:  49%|████▉     | 2514/5098 [19:04<18:25,  2.34it/s][A
e

epoch 1 iter 2546: train loss 0.62804. lr 5.122387e-04:  50%|████▉     | 2547/5098 [19:21<20:19,  2.09it/s][A
epoch 1 iter 2547: train loss 0.62622. lr 5.121733e-04:  50%|████▉     | 2547/5098 [19:21<20:19,  2.09it/s][A
epoch 1 iter 2547: train loss 0.62622. lr 5.121733e-04:  50%|████▉     | 2548/5098 [19:21<19:17,  2.20it/s][A
epoch 1 iter 2548: train loss 0.61981. lr 5.121080e-04:  50%|████▉     | 2548/5098 [19:22<19:17,  2.20it/s][A
epoch 1 iter 2548: train loss 0.61981. lr 5.121080e-04:  50%|█████     | 2549/5098 [19:22<18:27,  2.30it/s][A
epoch 1 iter 2549: train loss 0.62572. lr 5.120426e-04:  50%|█████     | 2549/5098 [19:22<18:27,  2.30it/s][A
epoch 1 iter 2549: train loss 0.62572. lr 5.120426e-04:  50%|█████     | 2550/5098 [19:22<17:57,  2.36it/s][A
epoch 1 iter 2550: train loss 0.62349. lr 5.119772e-04:  50%|█████     | 2550/5098 [19:23<17:57,  2.36it/s][A
epoch 1 iter 2550: train loss 0.62349. lr 5.119772e-04:  50%|█████     | 2551/5098 [19:23<17:28,  2.43it/s][A
e

epoch 1 iter 2583: train loss 0.62045. lr 5.098073e-04:  51%|█████     | 2583/5098 [19:37<18:14,  2.30it/s][A
epoch 1 iter 2583: train loss 0.62045. lr 5.098073e-04:  51%|█████     | 2584/5098 [19:37<18:08,  2.31it/s][A
epoch 1 iter 2584: train loss 0.60269. lr 5.097412e-04:  51%|█████     | 2584/5098 [19:37<18:08,  2.31it/s][A
epoch 1 iter 2584: train loss 0.60269. lr 5.097412e-04:  51%|█████     | 2585/5098 [19:37<17:31,  2.39it/s][A
epoch 1 iter 2585: train loss 0.61439. lr 5.096751e-04:  51%|█████     | 2585/5098 [19:38<17:31,  2.39it/s][A
epoch 1 iter 2585: train loss 0.61439. lr 5.096751e-04:  51%|█████     | 2586/5098 [19:38<17:29,  2.39it/s][A
epoch 1 iter 2586: train loss 0.60924. lr 5.096089e-04:  51%|█████     | 2586/5098 [19:38<17:29,  2.39it/s][A
epoch 1 iter 2586: train loss 0.60924. lr 5.096089e-04:  51%|█████     | 2587/5098 [19:38<17:39,  2.37it/s][A
epoch 1 iter 2587: train loss 0.60290. lr 5.095428e-04:  51%|█████     | 2587/5098 [19:38<17:39,  2.37it/s][A
e

epoch 1 iter 2619: train loss 0.60010. lr 5.074154e-04:  51%|█████▏    | 2620/5098 [19:55<18:18,  2.26it/s][A
epoch 1 iter 2620: train loss 0.61268. lr 5.073486e-04:  51%|█████▏    | 2620/5098 [19:55<18:18,  2.26it/s][A
epoch 1 iter 2620: train loss 0.61268. lr 5.073486e-04:  51%|█████▏    | 2621/5098 [19:55<18:07,  2.28it/s][A
epoch 1 iter 2621: train loss 0.61344. lr 5.072817e-04:  51%|█████▏    | 2621/5098 [19:56<18:07,  2.28it/s][A
epoch 1 iter 2621: train loss 0.61344. lr 5.072817e-04:  51%|█████▏    | 2622/5098 [19:56<18:01,  2.29it/s][A
epoch 1 iter 2622: train loss 0.59907. lr 5.072149e-04:  51%|█████▏    | 2622/5098 [19:56<18:01,  2.29it/s][A
epoch 1 iter 2622: train loss 0.59907. lr 5.072149e-04:  51%|█████▏    | 2623/5098 [19:56<17:58,  2.30it/s][A
epoch 1 iter 2623: train loss 0.61105. lr 5.071480e-04:  51%|█████▏    | 2623/5098 [19:56<17:58,  2.30it/s][A
epoch 1 iter 2623: train loss 0.61105. lr 5.071480e-04:  51%|█████▏    | 2624/5098 [19:56<17:51,  2.31it/s][A
e

epoch 1 iter 2656: train loss 0.60245. lr 5.049304e-04:  52%|█████▏    | 2656/5098 [20:12<22:40,  1.80it/s][A
epoch 1 iter 2656: train loss 0.60245. lr 5.049304e-04:  52%|█████▏    | 2657/5098 [20:12<22:04,  1.84it/s][A
epoch 1 iter 2657: train loss 0.60201. lr 5.048629e-04:  52%|█████▏    | 2657/5098 [20:12<22:04,  1.84it/s][A
epoch 1 iter 2657: train loss 0.60201. lr 5.048629e-04:  52%|█████▏    | 2658/5098 [20:12<22:15,  1.83it/s][A
epoch 1 iter 2658: train loss 0.59030. lr 5.047953e-04:  52%|█████▏    | 2658/5098 [20:13<22:15,  1.83it/s][A
epoch 1 iter 2658: train loss 0.59030. lr 5.047953e-04:  52%|█████▏    | 2659/5098 [20:13<21:50,  1.86it/s][A
epoch 1 iter 2659: train loss 0.59278. lr 5.047278e-04:  52%|█████▏    | 2659/5098 [20:13<21:50,  1.86it/s][A
epoch 1 iter 2659: train loss 0.59278. lr 5.047278e-04:  52%|█████▏    | 2660/5098 [20:13<21:27,  1.89it/s][A
epoch 1 iter 2660: train loss 0.58607. lr 5.046602e-04:  52%|█████▏    | 2660/5098 [20:14<21:27,  1.89it/s][A
e

epoch 1 iter 2692: train loss 0.58415. lr 5.024871e-04:  53%|█████▎    | 2693/5098 [20:28<16:50,  2.38it/s][A
epoch 1 iter 2693: train loss 0.59668. lr 5.024188e-04:  53%|█████▎    | 2693/5098 [20:28<16:50,  2.38it/s][A
epoch 1 iter 2693: train loss 0.59668. lr 5.024188e-04:  53%|█████▎    | 2694/5098 [20:28<16:26,  2.44it/s][A
epoch 1 iter 2694: train loss 0.59321. lr 5.023506e-04:  53%|█████▎    | 2694/5098 [20:29<16:26,  2.44it/s][A
epoch 1 iter 2694: train loss 0.59321. lr 5.023506e-04:  53%|█████▎    | 2695/5098 [20:29<16:09,  2.48it/s][A
epoch 1 iter 2695: train loss 0.57695. lr 5.022823e-04:  53%|█████▎    | 2695/5098 [20:29<16:09,  2.48it/s][A
epoch 1 iter 2695: train loss 0.57695. lr 5.022823e-04:  53%|█████▎    | 2696/5098 [20:29<17:14,  2.32it/s][A
epoch 1 iter 2696: train loss 0.58916. lr 5.022140e-04:  53%|█████▎    | 2696/5098 [20:30<17:14,  2.32it/s][A
epoch 1 iter 2696: train loss 0.58916. lr 5.022140e-04:  53%|█████▎    | 2697/5098 [20:30<17:25,  2.30it/s][A
e

epoch 1 iter 2729: train loss 0.58315. lr 4.999499e-04:  54%|█████▎    | 2729/5098 [20:44<17:20,  2.28it/s][A
epoch 1 iter 2729: train loss 0.58315. lr 4.999499e-04:  54%|█████▎    | 2730/5098 [20:44<16:47,  2.35it/s][A
epoch 1 iter 2730: train loss 0.57833. lr 4.998809e-04:  54%|█████▎    | 2730/5098 [20:45<16:47,  2.35it/s][A
epoch 1 iter 2730: train loss 0.57833. lr 4.998809e-04:  54%|█████▎    | 2731/5098 [20:45<17:03,  2.31it/s][A
epoch 1 iter 2731: train loss 0.58661. lr 4.998120e-04:  54%|█████▎    | 2731/5098 [20:45<17:03,  2.31it/s][A
epoch 1 iter 2731: train loss 0.58661. lr 4.998120e-04:  54%|█████▎    | 2732/5098 [20:45<16:45,  2.35it/s][A
epoch 1 iter 2732: train loss 0.57392. lr 4.997430e-04:  54%|█████▎    | 2732/5098 [20:45<16:45,  2.35it/s][A
epoch 1 iter 2732: train loss 0.57392. lr 4.997430e-04:  54%|█████▎    | 2733/5098 [20:45<16:37,  2.37it/s][A
epoch 1 iter 2733: train loss 0.58310. lr 4.996740e-04:  54%|█████▎    | 2733/5098 [20:46<16:37,  2.37it/s][A
e

epoch 1 iter 2765: train loss 0.57332. lr 4.974563e-04:  54%|█████▍    | 2766/5098 [21:01<17:58,  2.16it/s][A
epoch 1 iter 2766: train loss 0.57633. lr 4.973867e-04:  54%|█████▍    | 2766/5098 [21:02<17:58,  2.16it/s][A
epoch 1 iter 2766: train loss 0.57633. lr 4.973867e-04:  54%|█████▍    | 2767/5098 [21:02<17:41,  2.20it/s][A
epoch 1 iter 2767: train loss 0.57932. lr 4.973170e-04:  54%|█████▍    | 2767/5098 [21:02<17:41,  2.20it/s][A
epoch 1 iter 2767: train loss 0.57932. lr 4.973170e-04:  54%|█████▍    | 2768/5098 [21:02<17:31,  2.22it/s][A
epoch 1 iter 2768: train loss 0.57115. lr 4.972474e-04:  54%|█████▍    | 2768/5098 [21:03<17:31,  2.22it/s][A
epoch 1 iter 2768: train loss 0.57115. lr 4.972474e-04:  54%|█████▍    | 2769/5098 [21:03<17:25,  2.23it/s][A
epoch 1 iter 2769: train loss 0.57756. lr 4.971777e-04:  54%|█████▍    | 2769/5098 [21:03<17:25,  2.23it/s][A
epoch 1 iter 2769: train loss 0.57756. lr 4.971777e-04:  54%|█████▍    | 2770/5098 [21:03<17:18,  2.24it/s][A
e

epoch 1 iter 2802: train loss 0.56472. lr 4.948681e-04:  55%|█████▍    | 2802/5098 [21:17<15:41,  2.44it/s][A
epoch 1 iter 2802: train loss 0.56472. lr 4.948681e-04:  55%|█████▍    | 2803/5098 [21:17<15:36,  2.45it/s][A
epoch 1 iter 2803: train loss 0.56436. lr 4.947978e-04:  55%|█████▍    | 2803/5098 [21:17<15:36,  2.45it/s][A
epoch 1 iter 2803: train loss 0.56436. lr 4.947978e-04:  55%|█████▌    | 2804/5098 [21:17<15:44,  2.43it/s][A
epoch 1 iter 2804: train loss 0.56769. lr 4.947275e-04:  55%|█████▌    | 2804/5098 [21:18<15:44,  2.43it/s][A
epoch 1 iter 2804: train loss 0.56769. lr 4.947275e-04:  55%|█████▌    | 2805/5098 [21:18<16:30,  2.32it/s][A
epoch 1 iter 2805: train loss 0.55473. lr 4.946571e-04:  55%|█████▌    | 2805/5098 [21:18<16:30,  2.32it/s][A
epoch 1 iter 2805: train loss 0.55473. lr 4.946571e-04:  55%|█████▌    | 2806/5098 [21:18<16:24,  2.33it/s][A
epoch 1 iter 2806: train loss 0.56277. lr 4.945868e-04:  55%|█████▌    | 2806/5098 [21:19<16:24,  2.33it/s][A
e

epoch 1 iter 2838: train loss 0.56518. lr 4.923256e-04:  56%|█████▌    | 2839/5098 [21:34<16:21,  2.30it/s][A
epoch 1 iter 2839: train loss 0.55739. lr 4.922546e-04:  56%|█████▌    | 2839/5098 [21:34<16:21,  2.30it/s][A
epoch 1 iter 2839: train loss 0.55739. lr 4.922546e-04:  56%|█████▌    | 2840/5098 [21:34<16:06,  2.34it/s][A
epoch 1 iter 2840: train loss 0.55916. lr 4.921836e-04:  56%|█████▌    | 2840/5098 [21:35<16:06,  2.34it/s][A
epoch 1 iter 2840: train loss 0.55916. lr 4.921836e-04:  56%|█████▌    | 2841/5098 [21:35<15:46,  2.38it/s][A
epoch 1 iter 2841: train loss 0.55664. lr 4.921126e-04:  56%|█████▌    | 2841/5098 [21:35<15:46,  2.38it/s][A
epoch 1 iter 2841: train loss 0.55664. lr 4.921126e-04:  56%|█████▌    | 2842/5098 [21:35<15:29,  2.43it/s][A
epoch 1 iter 2842: train loss 0.56014. lr 4.920416e-04:  56%|█████▌    | 2842/5098 [21:35<15:29,  2.43it/s][A
epoch 1 iter 2842: train loss 0.56014. lr 4.920416e-04:  56%|█████▌    | 2843/5098 [21:35<15:16,  2.46it/s][A
e

epoch 1 iter 2875: train loss 0.54255. lr 4.896877e-04:  56%|█████▋    | 2875/5098 [21:51<15:14,  2.43it/s][A
epoch 1 iter 2875: train loss 0.54255. lr 4.896877e-04:  56%|█████▋    | 2876/5098 [21:51<14:58,  2.47it/s][A
epoch 1 iter 2876: train loss 0.54784. lr 4.896161e-04:  56%|█████▋    | 2876/5098 [21:51<14:58,  2.47it/s][A
epoch 1 iter 2876: train loss 0.54784. lr 4.896161e-04:  56%|█████▋    | 2877/5098 [21:51<14:52,  2.49it/s][A
epoch 1 iter 2877: train loss 0.55077. lr 4.895444e-04:  56%|█████▋    | 2877/5098 [21:51<14:52,  2.49it/s][A
epoch 1 iter 2877: train loss 0.55077. lr 4.895444e-04:  56%|█████▋    | 2878/5098 [21:51<15:30,  2.39it/s][A
epoch 1 iter 2878: train loss 0.55453. lr 4.894728e-04:  56%|█████▋    | 2878/5098 [21:52<15:30,  2.39it/s][A
epoch 1 iter 2878: train loss 0.55453. lr 4.894728e-04:  56%|█████▋    | 2879/5098 [21:52<16:36,  2.23it/s][A
epoch 1 iter 2879: train loss 0.55620. lr 4.894011e-04:  56%|█████▋    | 2879/5098 [21:52<16:36,  2.23it/s][A
e

epoch 1 iter 2911: train loss 0.53757. lr 4.870975e-04:  57%|█████▋    | 2912/5098 [22:06<15:42,  2.32it/s][A
epoch 1 iter 2912: train loss 0.54117. lr 4.870252e-04:  57%|█████▋    | 2912/5098 [22:06<15:42,  2.32it/s][A
epoch 1 iter 2912: train loss 0.54117. lr 4.870252e-04:  57%|█████▋    | 2913/5098 [22:06<15:11,  2.40it/s][A
epoch 1 iter 2913: train loss 0.55620. lr 4.869529e-04:  57%|█████▋    | 2913/5098 [22:07<15:11,  2.40it/s][A
epoch 1 iter 2913: train loss 0.55620. lr 4.869529e-04:  57%|█████▋    | 2914/5098 [22:07<14:46,  2.46it/s][A
epoch 1 iter 2914: train loss 0.55084. lr 4.868806e-04:  57%|█████▋    | 2914/5098 [22:07<14:46,  2.46it/s][A
epoch 1 iter 2914: train loss 0.55084. lr 4.868806e-04:  57%|█████▋    | 2915/5098 [22:07<14:53,  2.44it/s][A
epoch 1 iter 2915: train loss 0.53895. lr 4.868083e-04:  57%|█████▋    | 2915/5098 [22:08<14:53,  2.44it/s][A
epoch 1 iter 2915: train loss 0.53895. lr 4.868083e-04:  57%|█████▋    | 2916/5098 [22:08<15:45,  2.31it/s][A
e

epoch 1 iter 2948: train loss 0.53895. lr 4.844114e-04:  58%|█████▊    | 2948/5098 [22:23<18:45,  1.91it/s][A
epoch 1 iter 2948: train loss 0.53895. lr 4.844114e-04:  58%|█████▊    | 2949/5098 [22:23<18:14,  1.96it/s][A
epoch 1 iter 2949: train loss 0.53391. lr 4.843384e-04:  58%|█████▊    | 2949/5098 [22:24<18:14,  1.96it/s][A
epoch 1 iter 2949: train loss 0.53391. lr 4.843384e-04:  58%|█████▊    | 2950/5098 [22:24<17:54,  2.00it/s][A
epoch 1 iter 2950: train loss 0.54546. lr 4.842655e-04:  58%|█████▊    | 2950/5098 [22:24<17:54,  2.00it/s][A
epoch 1 iter 2950: train loss 0.54546. lr 4.842655e-04:  58%|█████▊    | 2951/5098 [22:24<17:39,  2.03it/s][A
epoch 1 iter 2951: train loss 0.53942. lr 4.841925e-04:  58%|█████▊    | 2951/5098 [22:25<17:39,  2.03it/s][A
epoch 1 iter 2951: train loss 0.53942. lr 4.841925e-04:  58%|█████▊    | 2952/5098 [22:25<17:18,  2.07it/s][A
epoch 1 iter 2952: train loss 0.53813. lr 4.841195e-04:  58%|█████▊    | 2952/5098 [22:25<17:18,  2.07it/s][A
e

epoch 1 iter 2984: train loss 0.52965. lr 4.817748e-04:  59%|█████▊    | 2985/5098 [22:39<13:48,  2.55it/s][A
epoch 1 iter 2985: train loss 0.53285. lr 4.817012e-04:  59%|█████▊    | 2985/5098 [22:39<13:48,  2.55it/s][A
epoch 1 iter 2985: train loss 0.53285. lr 4.817012e-04:  59%|█████▊    | 2986/5098 [22:39<14:05,  2.50it/s][A
epoch 1 iter 2986: train loss 0.52780. lr 4.816276e-04:  59%|█████▊    | 2986/5098 [22:40<14:05,  2.50it/s][A
epoch 1 iter 2986: train loss 0.52780. lr 4.816276e-04:  59%|█████▊    | 2987/5098 [22:40<14:56,  2.36it/s][A
epoch 1 iter 2987: train loss 0.52262. lr 4.815540e-04:  59%|█████▊    | 2987/5098 [22:40<14:56,  2.36it/s][A
epoch 1 iter 2987: train loss 0.52262. lr 4.815540e-04:  59%|█████▊    | 2988/5098 [22:40<14:41,  2.39it/s][A
epoch 1 iter 2988: train loss 0.53154. lr 4.814804e-04:  59%|█████▊    | 2988/5098 [22:41<14:41,  2.39it/s][A
epoch 1 iter 2988: train loss 0.53154. lr 4.814804e-04:  59%|█████▊    | 2989/5098 [22:41<14:21,  2.45it/s][A
e

epoch 1 iter 3021: train loss 0.52423. lr 4.790416e-04:  59%|█████▉    | 3021/5098 [22:57<13:51,  2.50it/s][A
epoch 1 iter 3021: train loss 0.52423. lr 4.790416e-04:  59%|█████▉    | 3022/5098 [22:57<13:54,  2.49it/s][A
epoch 1 iter 3022: train loss 0.52090. lr 4.789674e-04:  59%|█████▉    | 3022/5098 [22:58<13:54,  2.49it/s][A
epoch 1 iter 3022: train loss 0.52090. lr 4.789674e-04:  59%|█████▉    | 3023/5098 [22:58<14:01,  2.47it/s][A
epoch 1 iter 3023: train loss 0.52707. lr 4.788932e-04:  59%|█████▉    | 3023/5098 [22:58<14:01,  2.47it/s][A
epoch 1 iter 3023: train loss 0.52707. lr 4.788932e-04:  59%|█████▉    | 3024/5098 [22:58<14:06,  2.45it/s][A
epoch 1 iter 3024: train loss 0.52404. lr 4.788190e-04:  59%|█████▉    | 3024/5098 [22:59<14:06,  2.45it/s][A
epoch 1 iter 3024: train loss 0.52404. lr 4.788190e-04:  59%|█████▉    | 3025/5098 [22:59<14:08,  2.44it/s][A
epoch 1 iter 3025: train loss 0.51325. lr 4.787448e-04:  59%|█████▉    | 3025/5098 [22:59<14:08,  2.44it/s][A
e

epoch 1 iter 3057: train loss 0.53396. lr 4.763600e-04:  60%|█████▉    | 3058/5098 [23:13<14:32,  2.34it/s][A
epoch 1 iter 3058: train loss 0.51219. lr 4.762852e-04:  60%|█████▉    | 3058/5098 [23:13<14:32,  2.34it/s][A
epoch 1 iter 3058: train loss 0.51219. lr 4.762852e-04:  60%|██████    | 3059/5098 [23:13<16:34,  2.05it/s][A
epoch 1 iter 3059: train loss 0.51294. lr 4.762104e-04:  60%|██████    | 3059/5098 [23:14<16:34,  2.05it/s][A
epoch 1 iter 3059: train loss 0.51294. lr 4.762104e-04:  60%|██████    | 3060/5098 [23:14<16:39,  2.04it/s][A
epoch 1 iter 3060: train loss 0.51296. lr 4.761356e-04:  60%|██████    | 3060/5098 [23:14<16:39,  2.04it/s][A
epoch 1 iter 3060: train loss 0.51296. lr 4.761356e-04:  60%|██████    | 3061/5098 [23:14<16:10,  2.10it/s][A
epoch 1 iter 3061: train loss 0.50637. lr 4.760607e-04:  60%|██████    | 3061/5098 [23:15<16:10,  2.10it/s][A
epoch 1 iter 3061: train loss 0.50637. lr 4.760607e-04:  60%|██████    | 3062/5098 [23:15<15:40,  2.16it/s][A
e

epoch 1 iter 3094: train loss 0.50955. lr 4.735813e-04:  61%|██████    | 3094/5098 [23:29<16:15,  2.06it/s][A
epoch 1 iter 3094: train loss 0.50955. lr 4.735813e-04:  61%|██████    | 3095/5098 [23:29<17:03,  1.96it/s][A
epoch 1 iter 3095: train loss 0.51006. lr 4.735059e-04:  61%|██████    | 3095/5098 [23:30<17:03,  1.96it/s][A
epoch 1 iter 3095: train loss 0.51006. lr 4.735059e-04:  61%|██████    | 3096/5098 [23:30<17:34,  1.90it/s][A
epoch 1 iter 3096: train loss 0.50689. lr 4.734305e-04:  61%|██████    | 3096/5098 [23:31<17:34,  1.90it/s][A
epoch 1 iter 3096: train loss 0.50689. lr 4.734305e-04:  61%|██████    | 3097/5098 [23:31<17:56,  1.86it/s][A
epoch 1 iter 3097: train loss 0.51421. lr 4.733550e-04:  61%|██████    | 3097/5098 [23:31<17:56,  1.86it/s][A
epoch 1 iter 3097: train loss 0.51421. lr 4.733550e-04:  61%|██████    | 3098/5098 [23:31<18:14,  1.83it/s][A
epoch 1 iter 3098: train loss 0.51070. lr 4.732796e-04:  61%|██████    | 3098/5098 [23:32<18:14,  1.83it/s][A
e

epoch 1 iter 3130: train loss 0.50341. lr 4.708560e-04:  61%|██████▏   | 3131/5098 [23:46<16:06,  2.03it/s][A
epoch 1 iter 3131: train loss 0.50648. lr 4.707800e-04:  61%|██████▏   | 3131/5098 [23:47<16:06,  2.03it/s][A
epoch 1 iter 3131: train loss 0.50648. lr 4.707800e-04:  61%|██████▏   | 3132/5098 [23:47<15:39,  2.09it/s][A
epoch 1 iter 3132: train loss 0.50110. lr 4.707040e-04:  61%|██████▏   | 3132/5098 [23:47<15:39,  2.09it/s][A
epoch 1 iter 3132: train loss 0.50110. lr 4.707040e-04:  61%|██████▏   | 3133/5098 [23:47<15:22,  2.13it/s][A
epoch 1 iter 3133: train loss 0.50729. lr 4.706280e-04:  61%|██████▏   | 3133/5098 [23:48<15:22,  2.13it/s][A
epoch 1 iter 3133: train loss 0.50729. lr 4.706280e-04:  61%|██████▏   | 3134/5098 [23:48<15:08,  2.16it/s][A
epoch 1 iter 3134: train loss 0.49753. lr 4.705519e-04:  61%|██████▏   | 3134/5098 [23:48<15:08,  2.16it/s][A
epoch 1 iter 3134: train loss 0.49753. lr 4.705519e-04:  61%|██████▏   | 3135/5098 [23:48<14:57,  2.19it/s][A
e

epoch 1 iter 3167: train loss 0.49297. lr 4.680331e-04:  62%|██████▏   | 3167/5098 [24:02<13:56,  2.31it/s][A
epoch 1 iter 3167: train loss 0.49297. lr 4.680331e-04:  62%|██████▏   | 3168/5098 [24:02<15:38,  2.06it/s][A
epoch 1 iter 3168: train loss 0.49323. lr 4.679565e-04:  62%|██████▏   | 3168/5098 [24:02<15:38,  2.06it/s][A
epoch 1 iter 3168: train loss 0.49323. lr 4.679565e-04:  62%|██████▏   | 3169/5098 [24:02<16:16,  1.98it/s][A
epoch 1 iter 3169: train loss 0.49816. lr 4.678799e-04:  62%|██████▏   | 3169/5098 [24:03<16:16,  1.98it/s][A
epoch 1 iter 3169: train loss 0.49816. lr 4.678799e-04:  62%|██████▏   | 3170/5098 [24:03<15:38,  2.05it/s][A
epoch 1 iter 3170: train loss 0.49829. lr 4.678033e-04:  62%|██████▏   | 3170/5098 [24:03<15:38,  2.05it/s][A
epoch 1 iter 3170: train loss 0.49829. lr 4.678033e-04:  62%|██████▏   | 3171/5098 [24:03<14:59,  2.14it/s][A
epoch 1 iter 3171: train loss 0.50452. lr 4.677266e-04:  62%|██████▏   | 3171/5098 [24:04<14:59,  2.14it/s][A
e

epoch 1 iter 3203: train loss 0.48441. lr 4.652656e-04:  63%|██████▎   | 3204/5098 [24:18<14:45,  2.14it/s][A
epoch 1 iter 3204: train loss 0.48430. lr 4.651884e-04:  63%|██████▎   | 3204/5098 [24:19<14:45,  2.14it/s][A
epoch 1 iter 3204: train loss 0.48430. lr 4.651884e-04:  63%|██████▎   | 3205/5098 [24:19<13:59,  2.25it/s][A
epoch 1 iter 3205: train loss 0.49241. lr 4.651112e-04:  63%|██████▎   | 3205/5098 [24:19<13:59,  2.25it/s][A
epoch 1 iter 3205: train loss 0.49241. lr 4.651112e-04:  63%|██████▎   | 3206/5098 [24:19<13:26,  2.35it/s][A
epoch 1 iter 3206: train loss 0.48752. lr 4.650340e-04:  63%|██████▎   | 3206/5098 [24:20<13:26,  2.35it/s][A
epoch 1 iter 3206: train loss 0.48752. lr 4.650340e-04:  63%|██████▎   | 3207/5098 [24:20<13:23,  2.35it/s][A
epoch 1 iter 3207: train loss 0.48800. lr 4.649568e-04:  63%|██████▎   | 3207/5098 [24:20<13:23,  2.35it/s][A
epoch 1 iter 3207: train loss 0.48800. lr 4.649568e-04:  63%|██████▎   | 3208/5098 [24:20<13:28,  2.34it/s][A
e

epoch 1 iter 3240: train loss 0.48575. lr 4.623999e-04:  64%|██████▎   | 3240/5098 [24:34<14:08,  2.19it/s][A
epoch 1 iter 3240: train loss 0.48575. lr 4.623999e-04:  64%|██████▎   | 3241/5098 [24:34<14:03,  2.20it/s][A
epoch 1 iter 3241: train loss 0.49115. lr 4.623222e-04:  64%|██████▎   | 3241/5098 [24:35<14:03,  2.20it/s][A
epoch 1 iter 3241: train loss 0.49115. lr 4.623222e-04:  64%|██████▎   | 3242/5098 [24:35<13:52,  2.23it/s][A
epoch 1 iter 3242: train loss 0.48412. lr 4.622444e-04:  64%|██████▎   | 3242/5098 [24:35<13:52,  2.23it/s][A
epoch 1 iter 3242: train loss 0.48412. lr 4.622444e-04:  64%|██████▎   | 3243/5098 [24:35<13:49,  2.24it/s][A
epoch 1 iter 3243: train loss 0.48520. lr 4.621666e-04:  64%|██████▎   | 3243/5098 [24:36<13:49,  2.24it/s][A
epoch 1 iter 3243: train loss 0.48520. lr 4.621666e-04:  64%|██████▎   | 3244/5098 [24:36<18:05,  1.71it/s][A
epoch 1 iter 3244: train loss 0.48599. lr 4.620888e-04:  64%|██████▎   | 3244/5098 [24:37<18:05,  1.71it/s][A
e

epoch 1 iter 4266: train loss 0.35871. lr 3.759132e-04:  84%|████████▎ | 4266/5098 [32:28<06:56,  2.00it/s][A
epoch 1 iter 4266: train loss 0.35871. lr 3.759132e-04:  84%|████████▎ | 4267/5098 [32:28<06:42,  2.06it/s][A
epoch 1 iter 4267: train loss 0.36740. lr 3.758238e-04:  84%|████████▎ | 4267/5098 [32:29<06:42,  2.06it/s][A
epoch 1 iter 4267: train loss 0.36740. lr 3.758238e-04:  84%|████████▎ | 4268/5098 [32:29<06:30,  2.12it/s][A
epoch 1 iter 4268: train loss 0.36017. lr 3.757343e-04:  84%|████████▎ | 4268/5098 [32:29<06:30,  2.12it/s][A
epoch 1 iter 4268: train loss 0.36017. lr 3.757343e-04:  84%|████████▎ | 4269/5098 [32:29<06:20,  2.18it/s][A
epoch 1 iter 4269: train loss 0.36209. lr 3.756448e-04:  84%|████████▎ | 4269/5098 [32:30<06:20,  2.18it/s][A
epoch 1 iter 4269: train loss 0.36209. lr 3.756448e-04:  84%|████████▍ | 4270/5098 [32:30<06:14,  2.21it/s][A
epoch 1 iter 4270: train loss 0.36353. lr 3.755554e-04:  84%|████████▍ | 4270/5098 [32:30<06:14,  2.21it/s][A
e

epoch 1 iter 4302: train loss 0.36452. lr 3.726886e-04:  84%|████████▍ | 4303/5098 [32:44<05:25,  2.44it/s][A
epoch 1 iter 4303: train loss 0.35724. lr 3.725988e-04:  84%|████████▍ | 4303/5098 [32:44<05:25,  2.44it/s][A
epoch 1 iter 4303: train loss 0.35724. lr 3.725988e-04:  84%|████████▍ | 4304/5098 [32:44<05:20,  2.48it/s][A
epoch 1 iter 4304: train loss 0.35881. lr 3.725091e-04:  84%|████████▍ | 4304/5098 [32:44<05:20,  2.48it/s][A
epoch 1 iter 4304: train loss 0.35881. lr 3.725091e-04:  84%|████████▍ | 4305/5098 [32:44<05:17,  2.49it/s][A
epoch 1 iter 4305: train loss 0.34896. lr 3.724194e-04:  84%|████████▍ | 4305/5098 [32:45<05:17,  2.49it/s][A
epoch 1 iter 4305: train loss 0.34896. lr 3.724194e-04:  84%|████████▍ | 4306/5098 [32:45<05:16,  2.50it/s][A
epoch 1 iter 4306: train loss 0.35586. lr 3.723297e-04:  84%|████████▍ | 4306/5098 [32:45<05:16,  2.50it/s][A
epoch 1 iter 4306: train loss 0.35586. lr 3.723297e-04:  84%|████████▍ | 4307/5098 [32:45<05:26,  2.42it/s][A
e

epoch 1 iter 4339: train loss 0.36065. lr 3.693650e-04:  85%|████████▌ | 4339/5098 [33:02<07:23,  1.71it/s][A
epoch 1 iter 4339: train loss 0.36065. lr 3.693650e-04:  85%|████████▌ | 4340/5098 [33:02<07:08,  1.77it/s][A
epoch 1 iter 4340: train loss 0.36271. lr 3.692750e-04:  85%|████████▌ | 4340/5098 [33:02<07:08,  1.77it/s][A
epoch 1 iter 4340: train loss 0.36271. lr 3.692750e-04:  85%|████████▌ | 4341/5098 [33:02<06:46,  1.86it/s][A
epoch 1 iter 4341: train loss 0.35890. lr 3.691851e-04:  85%|████████▌ | 4341/5098 [33:03<06:46,  1.86it/s][A
epoch 1 iter 4341: train loss 0.35890. lr 3.691851e-04:  85%|████████▌ | 4342/5098 [33:03<06:31,  1.93it/s][A
epoch 1 iter 4342: train loss 0.35040. lr 3.690951e-04:  85%|████████▌ | 4342/5098 [33:03<06:31,  1.93it/s][A
epoch 1 iter 4342: train loss 0.35040. lr 3.690951e-04:  85%|████████▌ | 4343/5098 [33:03<06:21,  1.98it/s][A
epoch 1 iter 4343: train loss 0.35597. lr 3.690051e-04:  85%|████████▌ | 4343/5098 [33:04<06:21,  1.98it/s][A
e

epoch 1 iter 4375: train loss 0.35550. lr 3.661226e-04:  86%|████████▌ | 4376/5098 [33:18<04:57,  2.42it/s][A
epoch 1 iter 4376: train loss 0.34950. lr 3.660324e-04:  86%|████████▌ | 4376/5098 [33:18<04:57,  2.42it/s][A
epoch 1 iter 4376: train loss 0.34950. lr 3.660324e-04:  86%|████████▌ | 4377/5098 [33:18<04:56,  2.43it/s][A
epoch 1 iter 4377: train loss 0.34835. lr 3.659422e-04:  86%|████████▌ | 4377/5098 [33:19<04:56,  2.43it/s][A
epoch 1 iter 4377: train loss 0.34835. lr 3.659422e-04:  86%|████████▌ | 4378/5098 [33:19<04:55,  2.44it/s][A
epoch 1 iter 4378: train loss 0.35106. lr 3.658520e-04:  86%|████████▌ | 4378/5098 [33:19<04:55,  2.44it/s][A
epoch 1 iter 4378: train loss 0.35106. lr 3.658520e-04:  86%|████████▌ | 4379/5098 [33:19<04:54,  2.44it/s][A
epoch 1 iter 4379: train loss 0.35078. lr 3.657618e-04:  86%|████████▌ | 4379/5098 [33:19<04:54,  2.44it/s][A
epoch 1 iter 4379: train loss 0.35078. lr 3.657618e-04:  86%|████████▌ | 4380/5098 [33:19<04:53,  2.45it/s][A
e

epoch 1 iter 4412: train loss 0.35293. lr 3.627817e-04:  87%|████████▋ | 4412/5098 [33:35<04:41,  2.44it/s][A
epoch 1 iter 4412: train loss 0.35293. lr 3.627817e-04:  87%|████████▋ | 4413/5098 [33:35<04:51,  2.35it/s][A
epoch 1 iter 4413: train loss 0.34596. lr 3.626913e-04:  87%|████████▋ | 4413/5098 [33:35<04:51,  2.35it/s][A
epoch 1 iter 4413: train loss 0.34596. lr 3.626913e-04:  87%|████████▋ | 4414/5098 [33:35<04:51,  2.35it/s][A
epoch 1 iter 4414: train loss 0.35287. lr 3.626008e-04:  87%|████████▋ | 4414/5098 [33:35<04:51,  2.35it/s][A
epoch 1 iter 4414: train loss 0.35287. lr 3.626008e-04:  87%|████████▋ | 4415/5098 [33:35<04:50,  2.35it/s][A
epoch 1 iter 4415: train loss 0.34684. lr 3.625104e-04:  87%|████████▋ | 4415/5098 [33:36<04:50,  2.35it/s][A
epoch 1 iter 4415: train loss 0.34684. lr 3.625104e-04:  87%|████████▋ | 4416/5098 [33:36<04:45,  2.39it/s][A
epoch 1 iter 4416: train loss 0.35691. lr 3.624200e-04:  87%|████████▋ | 4416/5098 [33:36<04:45,  2.39it/s][A
e

epoch 1 iter 4448: train loss 0.34503. lr 3.595232e-04:  87%|████████▋ | 4449/5098 [33:51<05:15,  2.06it/s][A
epoch 1 iter 4449: train loss 0.34166. lr 3.594326e-04:  87%|████████▋ | 4449/5098 [33:51<05:15,  2.06it/s][A
epoch 1 iter 4449: train loss 0.34166. lr 3.594326e-04:  87%|████████▋ | 4450/5098 [33:51<05:13,  2.06it/s][A
epoch 1 iter 4450: train loss 0.35080. lr 3.593419e-04:  87%|████████▋ | 4450/5098 [33:52<05:13,  2.06it/s][A
epoch 1 iter 4450: train loss 0.35080. lr 3.593419e-04:  87%|████████▋ | 4451/5098 [33:52<05:13,  2.06it/s][A
epoch 1 iter 4451: train loss 0.34345. lr 3.592513e-04:  87%|████████▋ | 4451/5098 [33:52<05:13,  2.06it/s][A
epoch 1 iter 4451: train loss 0.34345. lr 3.592513e-04:  87%|████████▋ | 4452/5098 [33:52<05:08,  2.09it/s][A
epoch 1 iter 4452: train loss 0.34842. lr 3.591607e-04:  87%|████████▋ | 4452/5098 [33:53<05:08,  2.09it/s][A
epoch 1 iter 4452: train loss 0.34842. lr 3.591607e-04:  87%|████████▋ | 4453/5098 [33:53<05:04,  2.12it/s][A
e

epoch 1 iter 4485: train loss 0.34430. lr 3.561666e-04:  88%|████████▊ | 4485/5098 [34:08<04:10,  2.44it/s][A
epoch 1 iter 4485: train loss 0.34430. lr 3.561666e-04:  88%|████████▊ | 4486/5098 [34:08<04:09,  2.45it/s][A
epoch 1 iter 4486: train loss 0.34098. lr 3.560757e-04:  88%|████████▊ | 4486/5098 [34:09<04:09,  2.45it/s][A
epoch 1 iter 4486: train loss 0.34098. lr 3.560757e-04:  88%|████████▊ | 4487/5098 [34:09<04:18,  2.37it/s][A
epoch 1 iter 4487: train loss 0.34660. lr 3.559849e-04:  88%|████████▊ | 4487/5098 [34:09<04:18,  2.37it/s][A
epoch 1 iter 4487: train loss 0.34660. lr 3.559849e-04:  88%|████████▊ | 4488/5098 [34:09<05:21,  1.89it/s][A
epoch 1 iter 4488: train loss 0.34336. lr 3.558941e-04:  88%|████████▊ | 4488/5098 [34:10<05:21,  1.89it/s][A
epoch 1 iter 4488: train loss 0.34336. lr 3.558941e-04:  88%|████████▊ | 4489/5098 [34:10<05:44,  1.77it/s][A
epoch 1 iter 4489: train loss 0.33897. lr 3.558032e-04:  88%|████████▊ | 4489/5098 [34:11<05:44,  1.77it/s][A
e

epoch 1 iter 4521: train loss 0.34489. lr 3.528936e-04:  89%|████████▊ | 4522/5098 [34:25<04:06,  2.34it/s][A
epoch 1 iter 4522: train loss 0.34673. lr 3.528026e-04:  89%|████████▊ | 4522/5098 [34:26<04:06,  2.34it/s][A
epoch 1 iter 4522: train loss 0.34673. lr 3.528026e-04:  89%|████████▊ | 4523/5098 [34:26<04:05,  2.34it/s][A
epoch 1 iter 4523: train loss 0.34351. lr 3.527116e-04:  89%|████████▊ | 4523/5098 [34:26<04:05,  2.34it/s][A
epoch 1 iter 4523: train loss 0.34351. lr 3.527116e-04:  89%|████████▊ | 4524/5098 [34:26<04:04,  2.35it/s][A
epoch 1 iter 4524: train loss 0.34546. lr 3.526206e-04:  89%|████████▊ | 4524/5098 [34:26<04:04,  2.35it/s][A
epoch 1 iter 4524: train loss 0.34546. lr 3.526206e-04:  89%|████████▉ | 4525/5098 [34:26<04:02,  2.36it/s][A
epoch 1 iter 4525: train loss 0.34306. lr 3.525296e-04:  89%|████████▉ | 4525/5098 [34:27<04:02,  2.36it/s][A
epoch 1 iter 4525: train loss 0.34306. lr 3.525296e-04:  89%|████████▉ | 4526/5098 [34:27<03:58,  2.40it/s][A
e

epoch 1 iter 4558: train loss 0.33849. lr 3.495230e-04:  89%|████████▉ | 4558/5098 [34:43<05:51,  1.53it/s][A
epoch 1 iter 4558: train loss 0.33849. lr 3.495230e-04:  89%|████████▉ | 4559/5098 [34:43<05:36,  1.60it/s][A
epoch 1 iter 4559: train loss 0.34252. lr 3.494318e-04:  89%|████████▉ | 4559/5098 [34:44<05:36,  1.60it/s][A
epoch 1 iter 4559: train loss 0.34252. lr 3.494318e-04:  89%|████████▉ | 4560/5098 [34:44<05:21,  1.67it/s][A
epoch 1 iter 4560: train loss 0.34055. lr 3.493406e-04:  89%|████████▉ | 4560/5098 [34:44<05:21,  1.67it/s][A
epoch 1 iter 4560: train loss 0.34055. lr 3.493406e-04:  89%|████████▉ | 4561/5098 [34:44<04:51,  1.84it/s][A
epoch 1 iter 4561: train loss 0.34067. lr 3.492494e-04:  89%|████████▉ | 4561/5098 [34:44<04:51,  1.84it/s][A
epoch 1 iter 4561: train loss 0.34067. lr 3.492494e-04:  89%|████████▉ | 4562/5098 [34:44<04:27,  2.00it/s][A
epoch 1 iter 4562: train loss 0.33892. lr 3.491582e-04:  89%|████████▉ | 4562/5098 [34:45<04:27,  2.00it/s][A
e

epoch 1 iter 4594: train loss 0.33415. lr 3.462373e-04:  90%|█████████ | 4595/5098 [34:59<04:22,  1.92it/s][A
epoch 1 iter 4595: train loss 0.33981. lr 3.461460e-04:  90%|█████████ | 4595/5098 [34:59<04:22,  1.92it/s][A
epoch 1 iter 4595: train loss 0.33981. lr 3.461460e-04:  90%|█████████ | 4596/5098 [34:59<04:38,  1.80it/s][A
epoch 1 iter 4596: train loss 0.33313. lr 3.460546e-04:  90%|█████████ | 4596/5098 [35:00<04:38,  1.80it/s][A
epoch 1 iter 4596: train loss 0.33313. lr 3.460546e-04:  90%|█████████ | 4597/5098 [35:00<04:41,  1.78it/s][A
epoch 1 iter 4597: train loss 0.33401. lr 3.459633e-04:  90%|█████████ | 4597/5098 [35:00<04:41,  1.78it/s][A
epoch 1 iter 4597: train loss 0.33401. lr 3.459633e-04:  90%|█████████ | 4598/5098 [35:00<04:36,  1.81it/s][A
epoch 1 iter 4598: train loss 0.33943. lr 3.458719e-04:  90%|█████████ | 4598/5098 [35:01<04:36,  1.81it/s][A
epoch 1 iter 4598: train loss 0.33943. lr 3.458719e-04:  90%|█████████ | 4599/5098 [35:01<04:32,  1.83it/s][A
e

epoch 1 iter 4631: train loss 0.33407. lr 3.428544e-04:  91%|█████████ | 4631/5098 [35:17<03:09,  2.46it/s][A
epoch 1 iter 4631: train loss 0.33407. lr 3.428544e-04:  91%|█████████ | 4632/5098 [35:17<03:08,  2.47it/s][A
epoch 1 iter 4632: train loss 0.33053. lr 3.427629e-04:  91%|█████████ | 4632/5098 [35:18<03:08,  2.47it/s][A
epoch 1 iter 4632: train loss 0.33053. lr 3.427629e-04:  91%|█████████ | 4633/5098 [35:18<03:10,  2.45it/s][A
epoch 1 iter 4633: train loss 0.33841. lr 3.426714e-04:  91%|█████████ | 4633/5098 [35:18<03:10,  2.45it/s][A
epoch 1 iter 4633: train loss 0.33841. lr 3.426714e-04:  91%|█████████ | 4634/5098 [35:18<03:19,  2.33it/s][A
epoch 1 iter 4634: train loss 0.32942. lr 3.425799e-04:  91%|█████████ | 4634/5098 [35:19<03:19,  2.33it/s][A
epoch 1 iter 4634: train loss 0.32942. lr 3.425799e-04:  91%|█████████ | 4635/5098 [35:19<03:19,  2.32it/s][A
epoch 1 iter 4635: train loss 0.33279. lr 3.424884e-04:  91%|█████████ | 4635/5098 [35:19<03:19,  2.32it/s][A
e

epoch 1 iter 4667: train loss 0.32895. lr 3.395576e-04:  92%|█████████▏| 4668/5098 [35:33<03:21,  2.13it/s][A
epoch 1 iter 4668: train loss 0.32668. lr 3.394660e-04:  92%|█████████▏| 4668/5098 [35:34<03:21,  2.13it/s][A
epoch 1 iter 4668: train loss 0.32668. lr 3.394660e-04:  92%|█████████▏| 4669/5098 [35:34<03:10,  2.25it/s][A
epoch 1 iter 4669: train loss 0.33544. lr 3.393743e-04:  92%|█████████▏| 4669/5098 [35:34<03:10,  2.25it/s][A
epoch 1 iter 4669: train loss 0.33544. lr 3.393743e-04:  92%|█████████▏| 4670/5098 [35:34<03:03,  2.34it/s][A
epoch 1 iter 4670: train loss 0.32692. lr 3.392827e-04:  92%|█████████▏| 4670/5098 [35:35<03:03,  2.34it/s][A
epoch 1 iter 4670: train loss 0.32692. lr 3.392827e-04:  92%|█████████▏| 4671/5098 [35:35<02:58,  2.39it/s][A
epoch 1 iter 4671: train loss 0.32924. lr 3.391910e-04:  92%|█████████▏| 4671/5098 [35:35<02:58,  2.39it/s][A
epoch 1 iter 4671: train loss 0.32924. lr 3.391910e-04:  92%|█████████▏| 4672/5098 [35:35<02:58,  2.39it/s][A
e

epoch 1 iter 4704: train loss 0.32254. lr 3.361641e-04:  92%|█████████▏| 4704/5098 [35:50<02:40,  2.45it/s][A
epoch 1 iter 4704: train loss 0.32254. lr 3.361641e-04:  92%|█████████▏| 4705/5098 [35:50<04:13,  1.55it/s][A
epoch 1 iter 4705: train loss 0.32731. lr 3.360724e-04:  92%|█████████▏| 4705/5098 [35:51<04:13,  1.55it/s][A
epoch 1 iter 4705: train loss 0.32731. lr 3.360724e-04:  92%|█████████▏| 4706/5098 [35:51<04:26,  1.47it/s][A
epoch 1 iter 4706: train loss 0.32816. lr 3.359806e-04:  92%|█████████▏| 4706/5098 [35:52<04:26,  1.47it/s][A
epoch 1 iter 4706: train loss 0.32816. lr 3.359806e-04:  92%|█████████▏| 4707/5098 [35:52<04:21,  1.49it/s][A
epoch 1 iter 4707: train loss 0.32767. lr 3.358888e-04:  92%|█████████▏| 4707/5098 [35:52<04:21,  1.49it/s][A
epoch 1 iter 4707: train loss 0.32767. lr 3.358888e-04:  92%|█████████▏| 4708/5098 [35:52<04:20,  1.50it/s][A
epoch 1 iter 4708: train loss 0.32036. lr 3.357970e-04:  92%|█████████▏| 4708/5098 [35:53<04:20,  1.50it/s][A
e

epoch 1 iter 4740: train loss 0.31829. lr 3.328579e-04:  93%|█████████▎| 4741/5098 [36:07<02:25,  2.46it/s][A
epoch 1 iter 4741: train loss 0.33314. lr 3.327660e-04:  93%|█████████▎| 4741/5098 [36:07<02:25,  2.46it/s][A
epoch 1 iter 4741: train loss 0.33314. lr 3.327660e-04:  93%|█████████▎| 4742/5098 [36:07<02:28,  2.39it/s][A
epoch 1 iter 4742: train loss 0.32359. lr 3.326741e-04:  93%|█████████▎| 4742/5098 [36:08<02:28,  2.39it/s][A
epoch 1 iter 4742: train loss 0.32359. lr 3.326741e-04:  93%|█████████▎| 4743/5098 [36:08<02:39,  2.22it/s][A
epoch 1 iter 4743: train loss 0.32084. lr 3.325822e-04:  93%|█████████▎| 4743/5098 [36:08<02:39,  2.22it/s][A
epoch 1 iter 4743: train loss 0.32084. lr 3.325822e-04:  93%|█████████▎| 4744/5098 [36:08<02:46,  2.13it/s][A
epoch 1 iter 4744: train loss 0.32225. lr 3.324903e-04:  93%|█████████▎| 4744/5098 [36:09<02:46,  2.13it/s][A
epoch 1 iter 4744: train loss 0.32225. lr 3.324903e-04:  93%|█████████▎| 4745/5098 [36:09<02:44,  2.14it/s][A
e

epoch 1 iter 4777: train loss 0.32175. lr 3.294556e-04:  94%|█████████▎| 4777/5098 [36:24<02:22,  2.25it/s][A
epoch 1 iter 4777: train loss 0.32175. lr 3.294556e-04:  94%|█████████▎| 4778/5098 [36:24<02:20,  2.28it/s][A
epoch 1 iter 4778: train loss 0.32382. lr 3.293636e-04:  94%|█████████▎| 4778/5098 [36:25<02:20,  2.28it/s][A
epoch 1 iter 4778: train loss 0.32382. lr 3.293636e-04:  94%|█████████▎| 4779/5098 [36:25<02:17,  2.31it/s][A
epoch 1 iter 4779: train loss 0.32241. lr 3.292715e-04:  94%|█████████▎| 4779/5098 [36:25<02:17,  2.31it/s][A
epoch 1 iter 4779: train loss 0.32241. lr 3.292715e-04:  94%|█████████▍| 4780/5098 [36:25<02:16,  2.32it/s][A
epoch 1 iter 4780: train loss 0.31961. lr 3.291795e-04:  94%|█████████▍| 4780/5098 [36:25<02:16,  2.32it/s][A
epoch 1 iter 4780: train loss 0.31961. lr 3.291795e-04:  94%|█████████▍| 4781/5098 [36:25<02:15,  2.34it/s][A
epoch 1 iter 4781: train loss 0.32185. lr 3.290875e-04:  94%|█████████▍| 4781/5098 [36:26<02:15,  2.34it/s][A
e

epoch 1 iter 4813: train loss 0.32504. lr 3.261415e-04:  94%|█████████▍| 4814/5098 [36:40<01:58,  2.41it/s][A
epoch 1 iter 4814: train loss 0.31907. lr 3.260494e-04:  94%|█████████▍| 4814/5098 [36:41<01:58,  2.41it/s][A
epoch 1 iter 4814: train loss 0.31907. lr 3.260494e-04:  94%|█████████▍| 4815/5098 [36:41<01:56,  2.43it/s][A
epoch 1 iter 4815: train loss 0.31429. lr 3.259573e-04:  94%|█████████▍| 4815/5098 [36:41<01:56,  2.43it/s][A
epoch 1 iter 4815: train loss 0.31429. lr 3.259573e-04:  94%|█████████▍| 4816/5098 [36:41<01:55,  2.44it/s][A
epoch 1 iter 4816: train loss 0.31653. lr 3.258652e-04:  94%|█████████▍| 4816/5098 [36:42<01:55,  2.44it/s][A
epoch 1 iter 4816: train loss 0.31653. lr 3.258652e-04:  94%|█████████▍| 4817/5098 [36:42<01:55,  2.44it/s][A
epoch 1 iter 4817: train loss 0.32155. lr 3.257731e-04:  94%|█████████▍| 4817/5098 [36:42<01:55,  2.44it/s][A
epoch 1 iter 4817: train loss 0.32155. lr 3.257731e-04:  95%|█████████▍| 4818/5098 [36:42<01:55,  2.43it/s][A
e

epoch 1 iter 4850: train loss 0.31720. lr 3.227321e-04:  95%|█████████▌| 4850/5098 [36:57<02:44,  1.51it/s][A
epoch 1 iter 4850: train loss 0.31720. lr 3.227321e-04:  95%|█████████▌| 4851/5098 [36:57<02:36,  1.58it/s][A
epoch 1 iter 4851: train loss 0.32064. lr 3.226399e-04:  95%|█████████▌| 4851/5098 [36:58<02:36,  1.58it/s][A
epoch 1 iter 4851: train loss 0.32064. lr 3.226399e-04:  95%|█████████▌| 4852/5098 [36:58<02:31,  1.63it/s][A
epoch 1 iter 4852: train loss 0.32063. lr 3.225477e-04:  95%|█████████▌| 4852/5098 [36:58<02:31,  1.63it/s][A
epoch 1 iter 4852: train loss 0.32063. lr 3.225477e-04:  95%|█████████▌| 4853/5098 [36:58<02:26,  1.67it/s][A
epoch 1 iter 4853: train loss 0.31224. lr 3.224555e-04:  95%|█████████▌| 4853/5098 [36:59<02:26,  1.67it/s][A
epoch 1 iter 4853: train loss 0.31224. lr 3.224555e-04:  95%|█████████▌| 4854/5098 [36:59<02:20,  1.74it/s][A
epoch 1 iter 4854: train loss 0.32083. lr 3.223633e-04:  95%|█████████▌| 4854/5098 [36:59<02:20,  1.74it/s][A
e

epoch 1 iter 4886: train loss 0.31755. lr 3.194119e-04:  96%|█████████▌| 4887/5098 [37:14<01:32,  2.29it/s][A
epoch 1 iter 4887: train loss 0.32354. lr 3.193197e-04:  96%|█████████▌| 4887/5098 [37:14<01:32,  2.29it/s][A
epoch 1 iter 4887: train loss 0.32354. lr 3.193197e-04:  96%|█████████▌| 4888/5098 [37:14<01:29,  2.35it/s][A
epoch 1 iter 4888: train loss 0.31185. lr 3.192274e-04:  96%|█████████▌| 4888/5098 [37:14<01:29,  2.35it/s][A
epoch 1 iter 4888: train loss 0.31185. lr 3.192274e-04:  96%|█████████▌| 4889/5098 [37:14<01:27,  2.40it/s][A
epoch 1 iter 4889: train loss 0.31733. lr 3.191351e-04:  96%|█████████▌| 4889/5098 [37:15<01:27,  2.40it/s][A
epoch 1 iter 4889: train loss 0.31733. lr 3.191351e-04:  96%|█████████▌| 4890/5098 [37:15<01:24,  2.45it/s][A
epoch 1 iter 4890: train loss 0.31417. lr 3.190429e-04:  96%|█████████▌| 4890/5098 [37:15<01:24,  2.45it/s][A
epoch 1 iter 4890: train loss 0.31417. lr 3.190429e-04:  96%|█████████▌| 4891/5098 [37:15<01:23,  2.48it/s][A
e

epoch 1 iter 4923: train loss 0.31002. lr 3.159971e-04:  97%|█████████▋| 4923/5098 [37:30<01:19,  2.20it/s][A
epoch 1 iter 4923: train loss 0.31002. lr 3.159971e-04:  97%|█████████▋| 4924/5098 [37:30<01:15,  2.31it/s][A
epoch 1 iter 4924: train loss 0.31350. lr 3.159047e-04:  97%|█████████▋| 4924/5098 [37:30<01:15,  2.31it/s][A
epoch 1 iter 4924: train loss 0.31350. lr 3.159047e-04:  97%|█████████▋| 4925/5098 [37:30<01:12,  2.40it/s][A
epoch 1 iter 4925: train loss 0.30797. lr 3.158124e-04:  97%|█████████▋| 4925/5098 [37:30<01:12,  2.40it/s][A
epoch 1 iter 4925: train loss 0.30797. lr 3.158124e-04:  97%|█████████▋| 4926/5098 [37:30<01:11,  2.40it/s][A
epoch 1 iter 4926: train loss 0.32079. lr 3.157201e-04:  97%|█████████▋| 4926/5098 [37:31<01:11,  2.40it/s][A
epoch 1 iter 4926: train loss 0.32079. lr 3.157201e-04:  97%|█████████▋| 4927/5098 [37:31<01:16,  2.23it/s][A
epoch 1 iter 4927: train loss 0.30995. lr 3.156278e-04:  97%|█████████▋| 4927/5098 [37:31<01:16,  2.23it/s][A
e

epoch 1 iter 4959: train loss 0.30824. lr 3.126725e-04:  97%|█████████▋| 4960/5098 [37:48<02:08,  1.07it/s][A
epoch 1 iter 4960: train loss 0.31372. lr 3.125801e-04:  97%|█████████▋| 4960/5098 [37:48<02:08,  1.07it/s][A
epoch 1 iter 4960: train loss 0.31372. lr 3.125801e-04:  97%|█████████▋| 4961/5098 [37:48<01:59,  1.14it/s][A
epoch 1 iter 4961: train loss 0.30731. lr 3.124878e-04:  97%|█████████▋| 4961/5098 [37:49<01:59,  1.14it/s][A
epoch 1 iter 4961: train loss 0.30731. lr 3.124878e-04:  97%|█████████▋| 4962/5098 [37:49<01:53,  1.20it/s][A
epoch 1 iter 4962: train loss 0.31567. lr 3.123954e-04:  97%|█████████▋| 4962/5098 [37:50<01:53,  1.20it/s][A
epoch 1 iter 4962: train loss 0.31567. lr 3.123954e-04:  97%|█████████▋| 4963/5098 [37:50<01:45,  1.28it/s][A
epoch 1 iter 4963: train loss 0.30592. lr 3.123030e-04:  97%|█████████▋| 4963/5098 [37:50<01:45,  1.28it/s][A
epoch 1 iter 4963: train loss 0.30592. lr 3.123030e-04:  97%|█████████▋| 4964/5098 [37:50<01:38,  1.36it/s][A
e

epoch 1 iter 4996: train loss 0.30964. lr 3.092540e-04:  98%|█████████▊| 4996/5098 [38:05<00:48,  2.09it/s][A
epoch 1 iter 4996: train loss 0.30964. lr 3.092540e-04:  98%|█████████▊| 4997/5098 [38:05<00:48,  2.08it/s][A
epoch 1 iter 4997: train loss 0.30862. lr 3.091616e-04:  98%|█████████▊| 4997/5098 [38:06<00:48,  2.08it/s][A
epoch 1 iter 4997: train loss 0.30862. lr 3.091616e-04:  98%|█████████▊| 4998/5098 [38:06<00:48,  2.06it/s][A
epoch 1 iter 4998: train loss 0.31197. lr 3.090691e-04:  98%|█████████▊| 4998/5098 [38:06<00:48,  2.06it/s][A
epoch 1 iter 4998: train loss 0.31197. lr 3.090691e-04:  98%|█████████▊| 4999/5098 [38:06<00:48,  2.05it/s][A
epoch 1 iter 4999: train loss 0.30888. lr 3.089767e-04:  98%|█████████▊| 4999/5098 [38:06<00:48,  2.05it/s][A
epoch 1 iter 4999: train loss 0.30888. lr 3.089767e-04:  98%|█████████▊| 5000/5098 [38:06<00:47,  2.05it/s][A
epoch 1 iter 5000: train loss 0.30680. lr 3.088843e-04:  98%|█████████▊| 5000/5098 [38:07<00:47,  2.05it/s][A
e

epoch 1 iter 5032: train loss 0.30525. lr 3.059267e-04:  99%|█████████▊| 5033/5098 [38:21<00:30,  2.11it/s][A
epoch 1 iter 5033: train loss 0.30849. lr 3.058342e-04:  99%|█████████▊| 5033/5098 [38:21<00:30,  2.11it/s][A
epoch 1 iter 5033: train loss 0.30849. lr 3.058342e-04:  99%|█████████▊| 5034/5098 [38:21<00:28,  2.22it/s][A
epoch 1 iter 5034: train loss 0.30238. lr 3.057418e-04:  99%|█████████▊| 5034/5098 [38:21<00:28,  2.22it/s][A
epoch 1 iter 5034: train loss 0.30238. lr 3.057418e-04:  99%|█████████▉| 5035/5098 [38:21<00:27,  2.31it/s][A
epoch 1 iter 5035: train loss 0.30383. lr 3.056494e-04:  99%|█████████▉| 5035/5098 [38:22<00:27,  2.31it/s][A
epoch 1 iter 5035: train loss 0.30383. lr 3.056494e-04:  99%|█████████▉| 5036/5098 [38:22<00:26,  2.36it/s][A
epoch 1 iter 5036: train loss 0.30304. lr 3.055569e-04:  99%|█████████▉| 5036/5098 [38:22<00:26,  2.36it/s][A
epoch 1 iter 5036: train loss 0.30304. lr 3.055569e-04:  99%|█████████▉| 5037/5098 [38:22<00:26,  2.27it/s][A
e

epoch 1 iter 5069: train loss 0.30484. lr 3.025062e-04:  99%|█████████▉| 5069/5098 [38:39<00:11,  2.48it/s][A
epoch 1 iter 5069: train loss 0.30484. lr 3.025062e-04:  99%|█████████▉| 5070/5098 [38:39<00:11,  2.50it/s][A
epoch 1 iter 5070: train loss 0.30313. lr 3.024137e-04:  99%|█████████▉| 5070/5098 [38:40<00:11,  2.50it/s][A
epoch 1 iter 5070: train loss 0.30313. lr 3.024137e-04:  99%|█████████▉| 5071/5098 [38:40<00:10,  2.51it/s][A
epoch 1 iter 5071: train loss 0.30244. lr 3.023213e-04:  99%|█████████▉| 5071/5098 [38:40<00:10,  2.51it/s][A
epoch 1 iter 5071: train loss 0.30244. lr 3.023213e-04:  99%|█████████▉| 5072/5098 [38:40<00:10,  2.45it/s][A
epoch 1 iter 5072: train loss 0.30603. lr 3.022288e-04:  99%|█████████▉| 5072/5098 [38:41<00:10,  2.45it/s][A
epoch 1 iter 5072: train loss 0.30603. lr 3.022288e-04: 100%|█████████▉| 5073/5098 [38:41<00:10,  2.40it/s][A
epoch 1 iter 5073: train loss 0.30310. lr 3.021364e-04: 100%|█████████▉| 5073/5098 [38:41<00:10,  2.40it/s][A
e

epoch 2 iter 8: train loss 0.30681. lr 2.991751e-04:   0%|          | 8/5098 [00:03<38:08,  2.22it/s][A
epoch 2 iter 8: train loss 0.30681. lr 2.991751e-04:   0%|          | 9/5098 [00:03<36:23,  2.33it/s][A
epoch 2 iter 9: train loss 0.30892. lr 2.990827e-04:   0%|          | 9/5098 [00:04<36:23,  2.33it/s][A
epoch 2 iter 9: train loss 0.30892. lr 2.990827e-04:   0%|          | 10/5098 [00:04<36:34,  2.32it/s][A
epoch 2 iter 10: train loss 0.31340. lr 2.989902e-04:   0%|          | 10/5098 [00:04<36:34,  2.32it/s][A
epoch 2 iter 10: train loss 0.31340. lr 2.989902e-04:   0%|          | 11/5098 [00:04<37:02,  2.29it/s][A
epoch 2 iter 11: train loss 0.30884. lr 2.988978e-04:   0%|          | 11/5098 [00:05<37:02,  2.29it/s][A
epoch 2 iter 11: train loss 0.30884. lr 2.988978e-04:   0%|          | 12/5098 [00:05<37:26,  2.26it/s][A
epoch 2 iter 12: train loss 0.30721. lr 2.988053e-04:   0%|          | 12/5098 [00:05<37:26,  2.26it/s][A
epoch 2 iter 12: train loss 0.30721. lr 2.98

epoch 2 iter 46: train loss 0.29598. lr 2.956620e-04:   1%|          | 46/5098 [00:21<41:04,  2.05it/s][A
epoch 2 iter 46: train loss 0.29598. lr 2.956620e-04:   1%|          | 47/5098 [00:21<42:04,  2.00it/s][A
epoch 2 iter 47: train loss 0.29897. lr 2.955695e-04:   1%|          | 47/5098 [00:21<42:04,  2.00it/s][A
epoch 2 iter 47: train loss 0.29897. lr 2.955695e-04:   1%|          | 48/5098 [00:21<39:38,  2.12it/s][A
epoch 2 iter 48: train loss 0.29755. lr 2.954771e-04:   1%|          | 48/5098 [00:22<39:38,  2.12it/s][A
epoch 2 iter 48: train loss 0.29755. lr 2.954771e-04:   1%|          | 49/5098 [00:22<37:22,  2.25it/s][A
epoch 2 iter 49: train loss 0.29162. lr 2.953847e-04:   1%|          | 49/5098 [00:22<37:22,  2.25it/s][A
epoch 2 iter 49: train loss 0.29162. lr 2.953847e-04:   1%|          | 50/5098 [00:22<35:46,  2.35it/s][A
epoch 2 iter 50: train loss 0.29644. lr 2.952922e-04:   1%|          | 50/5098 [00:23<35:46,  2.35it/s][A
epoch 2 iter 50: train loss 0.29644. 

epoch 2 iter 84: train loss 0.29488. lr 2.921494e-04:   2%|▏         | 84/5098 [00:39<39:48,  2.10it/s][A
epoch 2 iter 84: train loss 0.29488. lr 2.921494e-04:   2%|▏         | 85/5098 [00:39<40:17,  2.07it/s][A
epoch 2 iter 85: train loss 0.29611. lr 2.920570e-04:   2%|▏         | 85/5098 [00:39<40:17,  2.07it/s][A
epoch 2 iter 85: train loss 0.29611. lr 2.920570e-04:   2%|▏         | 86/5098 [00:39<40:30,  2.06it/s][A
epoch 2 iter 86: train loss 0.29531. lr 2.919646e-04:   2%|▏         | 86/5098 [00:40<40:30,  2.06it/s][A
epoch 2 iter 86: train loss 0.29531. lr 2.919646e-04:   2%|▏         | 87/5098 [00:40<40:40,  2.05it/s][A
epoch 2 iter 87: train loss 0.29649. lr 2.918722e-04:   2%|▏         | 87/5098 [00:40<40:40,  2.05it/s][A
epoch 2 iter 87: train loss 0.29649. lr 2.918722e-04:   2%|▏         | 88/5098 [00:40<40:24,  2.07it/s][A
epoch 2 iter 88: train loss 0.29265. lr 2.917798e-04:   2%|▏         | 88/5098 [00:41<40:24,  2.07it/s][A
epoch 2 iter 88: train loss 0.29265. 

epoch 2 iter 121: train loss 0.29049. lr 2.887304e-04:   2%|▏         | 122/5098 [00:56<46:28,  1.78it/s][A
epoch 2 iter 122: train loss 0.29363. lr 2.886380e-04:   2%|▏         | 122/5098 [00:56<46:28,  1.78it/s][A
epoch 2 iter 122: train loss 0.29363. lr 2.886380e-04:   2%|▏         | 123/5098 [00:56<47:09,  1.76it/s][A
epoch 2 iter 123: train loss 0.29202. lr 2.885456e-04:   2%|▏         | 123/5098 [00:57<47:09,  1.76it/s][A
epoch 2 iter 123: train loss 0.29202. lr 2.885456e-04:   2%|▏         | 124/5098 [00:57<47:58,  1.73it/s][A
epoch 2 iter 124: train loss 0.29518. lr 2.884532e-04:   2%|▏         | 124/5098 [00:58<47:58,  1.73it/s][A
epoch 2 iter 124: train loss 0.29518. lr 2.884532e-04:   2%|▏         | 125/5098 [00:58<47:59,  1.73it/s][A
epoch 2 iter 125: train loss 0.29262. lr 2.883608e-04:   2%|▏         | 125/5098 [00:58<47:59,  1.73it/s][A
epoch 2 iter 125: train loss 0.29262. lr 2.883608e-04:   2%|▏         | 126/5098 [00:58<49:00,  1.69it/s][A
epoch 2 iter 126: t

epoch 2 iter 159: train loss 0.29846. lr 2.852204e-04:   3%|▎         | 159/5098 [01:13<33:09,  2.48it/s][A
epoch 2 iter 159: train loss 0.29846. lr 2.852204e-04:   3%|▎         | 160/5098 [01:13<32:22,  2.54it/s][A
epoch 2 iter 160: train loss 0.29144. lr 2.851281e-04:   3%|▎         | 160/5098 [01:14<32:22,  2.54it/s][A
epoch 2 iter 160: train loss 0.29144. lr 2.851281e-04:   3%|▎         | 161/5098 [01:14<32:54,  2.50it/s][A
epoch 2 iter 161: train loss 0.28667. lr 2.850357e-04:   3%|▎         | 161/5098 [01:14<32:54,  2.50it/s][A
epoch 2 iter 161: train loss 0.28667. lr 2.850357e-04:   3%|▎         | 162/5098 [01:14<34:23,  2.39it/s][A
epoch 2 iter 162: train loss 0.29295. lr 2.849434e-04:   3%|▎         | 162/5098 [01:15<34:23,  2.39it/s][A
epoch 2 iter 162: train loss 0.29295. lr 2.849434e-04:   3%|▎         | 163/5098 [01:15<35:30,  2.32it/s][A
epoch 2 iter 163: train loss 0.29442. lr 2.848510e-04:   3%|▎         | 163/5098 [01:16<35:30,  2.32it/s][A
epoch 2 iter 163: t

epoch 2 iter 196: train loss 0.29220. lr 2.818048e-04:   4%|▍         | 197/5098 [01:31<32:54,  2.48it/s][A
epoch 2 iter 197: train loss 0.28890. lr 2.817125e-04:   4%|▍         | 197/5098 [01:32<32:54,  2.48it/s][A
epoch 2 iter 197: train loss 0.28890. lr 2.817125e-04:   4%|▍         | 198/5098 [01:32<38:00,  2.15it/s][A
epoch 2 iter 198: train loss 0.29170. lr 2.816202e-04:   4%|▍         | 198/5098 [01:32<38:00,  2.15it/s][A
epoch 2 iter 198: train loss 0.29170. lr 2.816202e-04:   4%|▍         | 199/5098 [01:32<39:28,  2.07it/s][A
epoch 2 iter 199: train loss 0.28879. lr 2.815279e-04:   4%|▍         | 199/5098 [01:33<39:28,  2.07it/s][A
epoch 2 iter 199: train loss 0.28879. lr 2.815279e-04:   4%|▍         | 200/5098 [01:33<40:14,  2.03it/s][A
epoch 2 iter 200: train loss 0.28631. lr 2.814356e-04:   4%|▍         | 200/5098 [01:33<40:14,  2.03it/s][A
epoch 2 iter 200: train loss 0.28631. lr 2.814356e-04:   4%|▍         | 201/5098 [01:33<38:49,  2.10it/s][A
epoch 2 iter 201: t

epoch 2 iter 234: train loss 0.29070. lr 2.782993e-04:   5%|▍         | 234/5098 [01:48<35:05,  2.31it/s][A
epoch 2 iter 234: train loss 0.29070. lr 2.782993e-04:   5%|▍         | 235/5098 [01:48<33:47,  2.40it/s][A
epoch 2 iter 235: train loss 0.28523. lr 2.782070e-04:   5%|▍         | 235/5098 [01:48<33:47,  2.40it/s][A
epoch 2 iter 235: train loss 0.28523. lr 2.782070e-04:   5%|▍         | 236/5098 [01:48<33:50,  2.39it/s][A
epoch 2 iter 236: train loss 0.28655. lr 2.781148e-04:   5%|▍         | 236/5098 [01:48<33:50,  2.39it/s][A
epoch 2 iter 236: train loss 0.28655. lr 2.781148e-04:   5%|▍         | 237/5098 [01:48<34:14,  2.37it/s][A
epoch 2 iter 237: train loss 0.28805. lr 2.780226e-04:   5%|▍         | 237/5098 [01:49<34:14,  2.37it/s][A
epoch 2 iter 237: train loss 0.28805. lr 2.780226e-04:   5%|▍         | 238/5098 [01:49<34:28,  2.35it/s][A
epoch 2 iter 238: train loss 0.28952. lr 2.779304e-04:   5%|▍         | 238/5098 [01:49<34:28,  2.35it/s][A
epoch 2 iter 238: t

epoch 2 iter 271: train loss 0.28447. lr 2.748889e-04:   5%|▌         | 272/5098 [02:06<50:19,  1.60it/s][A
epoch 2 iter 272: train loss 0.28593. lr 2.747967e-04:   5%|▌         | 272/5098 [02:07<50:19,  1.60it/s][A
epoch 2 iter 272: train loss 0.28593. lr 2.747967e-04:   5%|▌         | 273/5098 [02:07<48:17,  1.67it/s][A
epoch 2 iter 273: train loss 0.28871. lr 2.747046e-04:   5%|▌         | 273/5098 [02:07<48:17,  1.67it/s][A
epoch 2 iter 273: train loss 0.28871. lr 2.747046e-04:   5%|▌         | 274/5098 [02:07<46:37,  1.72it/s][A
epoch 2 iter 274: train loss 0.29067. lr 2.746125e-04:   5%|▌         | 274/5098 [02:08<46:37,  1.72it/s][A
epoch 2 iter 274: train loss 0.29067. lr 2.746125e-04:   5%|▌         | 275/5098 [02:08<44:25,  1.81it/s][A
epoch 2 iter 275: train loss 0.28793. lr 2.745204e-04:   5%|▌         | 275/5098 [02:08<44:25,  1.81it/s][A
epoch 2 iter 275: train loss 0.28793. lr 2.745204e-04:   5%|▌         | 276/5098 [02:08<42:17,  1.90it/s][A
epoch 2 iter 276: t

epoch 2 iter 309: train loss 0.28263. lr 2.713897e-04:   6%|▌         | 309/5098 [02:23<33:38,  2.37it/s][A
epoch 2 iter 309: train loss 0.28263. lr 2.713897e-04:   6%|▌         | 310/5098 [02:23<33:01,  2.42it/s][A
epoch 2 iter 310: train loss 0.28375. lr 2.712977e-04:   6%|▌         | 310/5098 [02:23<33:01,  2.42it/s][A
epoch 2 iter 310: train loss 0.28375. lr 2.712977e-04:   6%|▌         | 311/5098 [02:23<32:35,  2.45it/s][A
epoch 2 iter 311: train loss 0.27897. lr 2.712056e-04:   6%|▌         | 311/5098 [02:23<32:35,  2.45it/s][A
epoch 2 iter 311: train loss 0.27897. lr 2.712056e-04:   6%|▌         | 312/5098 [02:23<33:20,  2.39it/s][A
epoch 2 iter 312: train loss 0.28598. lr 2.711136e-04:   6%|▌         | 312/5098 [02:24<33:20,  2.39it/s][A
epoch 2 iter 312: train loss 0.28598. lr 2.711136e-04:   6%|▌         | 313/5098 [02:24<33:42,  2.37it/s][A
epoch 2 iter 313: train loss 0.28495. lr 2.710216e-04:   6%|▌         | 313/5098 [02:24<33:42,  2.37it/s][A
epoch 2 iter 313: t

epoch 2 iter 346: train loss 0.28358. lr 2.679864e-04:   7%|▋         | 347/5098 [02:39<36:20,  2.18it/s][A
epoch 2 iter 347: train loss 0.28161. lr 2.678945e-04:   7%|▋         | 347/5098 [02:40<36:20,  2.18it/s][A
epoch 2 iter 347: train loss 0.28161. lr 2.678945e-04:   7%|▋         | 348/5098 [02:40<40:42,  1.94it/s][A
epoch 2 iter 348: train loss 0.28158. lr 2.678025e-04:   7%|▋         | 348/5098 [02:40<40:42,  1.94it/s][A
epoch 2 iter 348: train loss 0.28158. lr 2.678025e-04:   7%|▋         | 349/5098 [02:40<44:32,  1.78it/s][A
epoch 2 iter 349: train loss 0.28058. lr 2.677106e-04:   7%|▋         | 349/5098 [02:41<44:32,  1.78it/s][A
epoch 2 iter 349: train loss 0.28058. lr 2.677106e-04:   7%|▋         | 350/5098 [02:41<47:16,  1.67it/s][A
epoch 2 iter 350: train loss 0.28146. lr 2.676187e-04:   7%|▋         | 350/5098 [02:42<47:16,  1.67it/s][A
epoch 2 iter 350: train loss 0.28146. lr 2.676187e-04:   7%|▋         | 351/5098 [02:42<49:01,  1.61it/s][A
epoch 2 iter 351: t

epoch 2 iter 384: train loss 0.28077. lr 2.644954e-04:   8%|▊         | 384/5098 [02:58<39:23,  1.99it/s][A
epoch 2 iter 384: train loss 0.28077. lr 2.644954e-04:   8%|▊         | 385/5098 [02:58<38:44,  2.03it/s][A
epoch 2 iter 385: train loss 0.28574. lr 2.644036e-04:   8%|▊         | 385/5098 [02:59<38:44,  2.03it/s][A
epoch 2 iter 385: train loss 0.28574. lr 2.644036e-04:   8%|▊         | 386/5098 [02:59<38:32,  2.04it/s][A
epoch 2 iter 386: train loss 0.28045. lr 2.643118e-04:   8%|▊         | 386/5098 [02:59<38:32,  2.04it/s][A
epoch 2 iter 386: train loss 0.28045. lr 2.643118e-04:   8%|▊         | 387/5098 [02:59<37:42,  2.08it/s][A
epoch 2 iter 387: train loss 0.27934. lr 2.642200e-04:   8%|▊         | 387/5098 [03:00<37:42,  2.08it/s][A
epoch 2 iter 387: train loss 0.27934. lr 2.642200e-04:   8%|▊         | 388/5098 [03:00<36:56,  2.12it/s][A
epoch 2 iter 388: train loss 0.28204. lr 2.641282e-04:   8%|▊         | 388/5098 [03:00<36:56,  2.12it/s][A
epoch 2 iter 388: t

epoch 2 iter 421: train loss 0.28048. lr 2.611010e-04:   8%|▊         | 422/5098 [03:15<32:04,  2.43it/s][A
epoch 2 iter 422: train loss 0.27685. lr 2.610093e-04:   8%|▊         | 422/5098 [03:15<32:04,  2.43it/s][A
epoch 2 iter 422: train loss 0.27685. lr 2.610093e-04:   8%|▊         | 423/5098 [03:15<31:55,  2.44it/s][A
epoch 2 iter 423: train loss 0.27381. lr 2.609177e-04:   8%|▊         | 423/5098 [03:16<31:55,  2.44it/s][A
epoch 2 iter 423: train loss 0.27381. lr 2.609177e-04:   8%|▊         | 424/5098 [03:16<31:47,  2.45it/s][A
epoch 2 iter 424: train loss 0.28255. lr 2.608260e-04:   8%|▊         | 424/5098 [03:16<31:47,  2.45it/s][A
epoch 2 iter 424: train loss 0.28255. lr 2.608260e-04:   8%|▊         | 425/5098 [03:16<31:43,  2.46it/s][A
epoch 2 iter 425: train loss 0.27846. lr 2.607344e-04:   8%|▊         | 425/5098 [03:17<31:43,  2.46it/s][A
epoch 2 iter 425: train loss 0.27846. lr 2.607344e-04:   8%|▊         | 426/5098 [03:17<31:46,  2.45it/s][A
epoch 2 iter 426: t

epoch 2 iter 459: train loss 0.27899. lr 2.576201e-04:   9%|▉         | 459/5098 [03:34<46:47,  1.65it/s][A
epoch 2 iter 459: train loss 0.27899. lr 2.576201e-04:   9%|▉         | 460/5098 [03:34<45:50,  1.69it/s][A
epoch 2 iter 460: train loss 0.27625. lr 2.575286e-04:   9%|▉         | 460/5098 [03:34<45:50,  1.69it/s][A
epoch 2 iter 460: train loss 0.27625. lr 2.575286e-04:   9%|▉         | 461/5098 [03:34<44:08,  1.75it/s][A
epoch 2 iter 461: train loss 0.27676. lr 2.574371e-04:   9%|▉         | 461/5098 [03:35<44:08,  1.75it/s][A
epoch 2 iter 461: train loss 0.27676. lr 2.574371e-04:   9%|▉         | 462/5098 [03:35<42:53,  1.80it/s][A
epoch 2 iter 462: train loss 0.27512. lr 2.573456e-04:   9%|▉         | 462/5098 [03:35<42:53,  1.80it/s][A
epoch 2 iter 462: train loss 0.27512. lr 2.573456e-04:   9%|▉         | 463/5098 [03:35<41:15,  1.87it/s][A
epoch 2 iter 463: train loss 0.28432. lr 2.572541e-04:   9%|▉         | 463/5098 [03:36<41:15,  1.87it/s][A
epoch 2 iter 463: t

epoch 2 iter 496: train loss 0.27742. lr 2.542364e-04:  10%|▉         | 497/5098 [03:51<41:21,  1.85it/s][A
epoch 2 iter 497: train loss 0.26814. lr 2.541451e-04:  10%|▉         | 497/5098 [03:51<41:21,  1.85it/s][A
epoch 2 iter 497: train loss 0.26814. lr 2.541451e-04:  10%|▉         | 498/5098 [03:51<38:15,  2.00it/s][A
epoch 2 iter 498: train loss 0.27238. lr 2.540537e-04:  10%|▉         | 498/5098 [03:51<38:15,  2.00it/s][A
epoch 2 iter 498: train loss 0.27238. lr 2.540537e-04:  10%|▉         | 499/5098 [03:51<35:51,  2.14it/s][A
epoch 2 iter 499: train loss 0.27286. lr 2.539623e-04:  10%|▉         | 499/5098 [03:52<35:51,  2.14it/s][A
epoch 2 iter 499: train loss 0.27286. lr 2.539623e-04:  10%|▉         | 500/5098 [03:52<34:00,  2.25it/s][A
epoch 2 iter 500: train loss 0.27104. lr 2.538710e-04:  10%|▉         | 500/5098 [03:52<34:00,  2.25it/s][A
epoch 2 iter 500: train loss 0.27104. lr 2.538710e-04:  10%|▉         | 501/5098 [03:52<32:57,  2.32it/s][A
epoch 2 iter 501: t

epoch 2 iter 534: train loss 0.27337. lr 2.507675e-04:  10%|█         | 534/5098 [04:09<32:04,  2.37it/s][A
epoch 2 iter 534: train loss 0.27337. lr 2.507675e-04:  10%|█         | 535/5098 [04:09<32:09,  2.36it/s][A
epoch 2 iter 535: train loss 0.26686. lr 2.506763e-04:  10%|█         | 535/5098 [04:09<32:09,  2.36it/s][A
epoch 2 iter 535: train loss 0.26686. lr 2.506763e-04:  11%|█         | 536/5098 [04:09<32:14,  2.36it/s][A
epoch 2 iter 536: train loss 0.27665. lr 2.505851e-04:  11%|█         | 536/5098 [04:10<32:14,  2.36it/s][A
epoch 2 iter 536: train loss 0.27665. lr 2.505851e-04:  11%|█         | 537/5098 [04:10<31:23,  2.42it/s][A
epoch 2 iter 537: train loss 0.27803. lr 2.504939e-04:  11%|█         | 537/5098 [04:10<31:23,  2.42it/s][A
epoch 2 iter 537: train loss 0.27803. lr 2.504939e-04:  11%|█         | 538/5098 [04:10<30:45,  2.47it/s][A
epoch 2 iter 538: train loss 0.27534. lr 2.504027e-04:  11%|█         | 538/5098 [04:11<30:45,  2.47it/s][A
epoch 2 iter 538: t

epoch 2 iter 571: train loss 0.27166. lr 2.473963e-04:  11%|█         | 572/5098 [04:25<35:29,  2.13it/s][A
epoch 2 iter 572: train loss 0.26910. lr 2.473053e-04:  11%|█         | 572/5098 [04:26<35:29,  2.13it/s][A
epoch 2 iter 572: train loss 0.26910. lr 2.473053e-04:  11%|█         | 573/5098 [04:26<34:17,  2.20it/s][A
epoch 2 iter 573: train loss 0.27505. lr 2.472143e-04:  11%|█         | 573/5098 [04:26<34:17,  2.20it/s][A
epoch 2 iter 573: train loss 0.27505. lr 2.472143e-04:  11%|█▏        | 574/5098 [04:26<40:14,  1.87it/s][A
epoch 2 iter 574: train loss 0.27316. lr 2.471232e-04:  11%|█▏        | 574/5098 [04:27<40:14,  1.87it/s][A
epoch 2 iter 574: train loss 0.27316. lr 2.471232e-04:  11%|█▏        | 575/5098 [04:27<42:46,  1.76it/s][A
epoch 2 iter 575: train loss 0.26940. lr 2.470322e-04:  11%|█▏        | 575/5098 [04:28<42:46,  1.76it/s][A
epoch 2 iter 575: train loss 0.26940. lr 2.470322e-04:  11%|█▏        | 576/5098 [04:28<42:35,  1.77it/s][A
epoch 2 iter 576: t

epoch 2 iter 609: train loss 0.27276. lr 2.439411e-04:  12%|█▏        | 609/5098 [04:43<31:01,  2.41it/s][A
epoch 2 iter 609: train loss 0.27276. lr 2.439411e-04:  12%|█▏        | 610/5098 [04:43<30:54,  2.42it/s][A
epoch 2 iter 610: train loss 0.27247. lr 2.438503e-04:  12%|█▏        | 610/5098 [04:44<30:54,  2.42it/s][A
epoch 2 iter 610: train loss 0.27247. lr 2.438503e-04:  12%|█▏        | 611/5098 [04:44<30:52,  2.42it/s][A
epoch 2 iter 611: train loss 0.27153. lr 2.437595e-04:  12%|█▏        | 611/5098 [04:44<30:52,  2.42it/s][A
epoch 2 iter 611: train loss 0.27153. lr 2.437595e-04:  12%|█▏        | 612/5098 [04:44<30:49,  2.43it/s][A
epoch 2 iter 612: train loss 0.26562. lr 2.436687e-04:  12%|█▏        | 612/5098 [04:45<30:49,  2.43it/s][A
epoch 2 iter 612: train loss 0.26562. lr 2.436687e-04:  12%|█▏        | 613/5098 [04:45<31:54,  2.34it/s][A
epoch 2 iter 613: train loss 0.26865. lr 2.435779e-04:  12%|█▏        | 613/5098 [04:45<31:54,  2.34it/s][A
epoch 2 iter 613: t

epoch 2 iter 646: train loss 0.26663. lr 2.405843e-04:  13%|█▎        | 647/5098 [05:00<34:16,  2.16it/s][A
epoch 2 iter 647: train loss 0.27119. lr 2.404936e-04:  13%|█▎        | 647/5098 [05:00<34:16,  2.16it/s][A
epoch 2 iter 647: train loss 0.27119. lr 2.404936e-04:  13%|█▎        | 648/5098 [05:00<33:48,  2.19it/s][A
epoch 2 iter 648: train loss 0.26505. lr 2.404030e-04:  13%|█▎        | 648/5098 [05:01<33:48,  2.19it/s][A
epoch 2 iter 648: train loss 0.26505. lr 2.404030e-04:  13%|█▎        | 649/5098 [05:01<33:19,  2.23it/s][A
epoch 2 iter 649: train loss 0.26527. lr 2.403124e-04:  13%|█▎        | 649/5098 [05:01<33:19,  2.23it/s][A
epoch 2 iter 649: train loss 0.26527. lr 2.403124e-04:  13%|█▎        | 650/5098 [05:01<32:57,  2.25it/s][A
epoch 2 iter 650: train loss 0.27030. lr 2.402218e-04:  13%|█▎        | 650/5098 [05:02<32:57,  2.25it/s][A
epoch 2 iter 650: train loss 0.27030. lr 2.402218e-04:  13%|█▎        | 651/5098 [05:02<32:39,  2.27it/s][A
epoch 2 iter 651: t

epoch 2 iter 684: train loss 0.26854. lr 2.371447e-04:  13%|█▎        | 684/5098 [05:17<31:22,  2.34it/s][A
epoch 2 iter 684: train loss 0.26854. lr 2.371447e-04:  13%|█▎        | 685/5098 [05:17<30:37,  2.40it/s][A
epoch 2 iter 685: train loss 0.26721. lr 2.370543e-04:  13%|█▎        | 685/5098 [05:17<30:37,  2.40it/s][A
epoch 2 iter 685: train loss 0.26721. lr 2.370543e-04:  13%|█▎        | 686/5098 [05:17<30:20,  2.42it/s][A
epoch 2 iter 686: train loss 0.26843. lr 2.369639e-04:  13%|█▎        | 686/5098 [05:18<30:20,  2.42it/s][A
epoch 2 iter 686: train loss 0.26843. lr 2.369639e-04:  13%|█▎        | 687/5098 [05:18<31:12,  2.36it/s][A
epoch 2 iter 687: train loss 0.26371. lr 2.368735e-04:  13%|█▎        | 687/5098 [05:18<31:12,  2.36it/s][A
epoch 2 iter 687: train loss 0.26371. lr 2.368735e-04:  13%|█▎        | 688/5098 [05:18<30:50,  2.38it/s][A
epoch 2 iter 688: train loss 0.25906. lr 2.367831e-04:  13%|█▎        | 688/5098 [05:19<30:50,  2.38it/s][A
epoch 2 iter 688: t

epoch 2 iter 721: train loss 0.26005. lr 2.338040e-04:  14%|█▍        | 722/5098 [05:35<37:31,  1.94it/s][A
epoch 2 iter 722: train loss 0.26589. lr 2.337138e-04:  14%|█▍        | 722/5098 [05:35<37:31,  1.94it/s][A
epoch 2 iter 722: train loss 0.26589. lr 2.337138e-04:  14%|█▍        | 723/5098 [05:35<36:53,  1.98it/s][A
epoch 2 iter 723: train loss 0.26689. lr 2.336236e-04:  14%|█▍        | 723/5098 [05:36<36:53,  1.98it/s][A
epoch 2 iter 723: train loss 0.26689. lr 2.336236e-04:  14%|█▍        | 724/5098 [05:36<35:47,  2.04it/s][A
epoch 2 iter 724: train loss 0.26526. lr 2.335335e-04:  14%|█▍        | 724/5098 [05:36<35:47,  2.04it/s][A
epoch 2 iter 724: train loss 0.26526. lr 2.335335e-04:  14%|█▍        | 725/5098 [05:36<35:07,  2.08it/s][A
epoch 2 iter 725: train loss 0.26446. lr 2.334433e-04:  14%|█▍        | 725/5098 [05:37<35:07,  2.08it/s][A
epoch 2 iter 725: train loss 0.26446. lr 2.334433e-04:  14%|█▍        | 726/5098 [05:37<34:37,  2.10it/s][A
epoch 2 iter 726: t

epoch 2 iter 759: train loss 0.26860. lr 2.303819e-04:  15%|█▍        | 759/5098 [05:52<40:16,  1.80it/s][A
epoch 2 iter 759: train loss 0.26860. lr 2.303819e-04:  15%|█▍        | 760/5098 [05:52<40:30,  1.78it/s][A
epoch 2 iter 760: train loss 0.26232. lr 2.302919e-04:  15%|█▍        | 760/5098 [05:53<40:30,  1.78it/s][A
epoch 2 iter 760: train loss 0.26232. lr 2.302919e-04:  15%|█▍        | 761/5098 [05:53<38:37,  1.87it/s][A
epoch 2 iter 761: train loss 0.26635. lr 2.302020e-04:  15%|█▍        | 761/5098 [05:53<38:37,  1.87it/s][A
epoch 2 iter 761: train loss 0.26635. lr 2.302020e-04:  15%|█▍        | 762/5098 [05:53<37:05,  1.95it/s][A
epoch 2 iter 762: train loss 0.25919. lr 2.301121e-04:  15%|█▍        | 762/5098 [05:54<37:05,  1.95it/s][A
epoch 2 iter 762: train loss 0.25919. lr 2.301121e-04:  15%|█▍        | 763/5098 [05:54<36:01,  2.01it/s][A
epoch 2 iter 763: train loss 0.26631. lr 2.300222e-04:  15%|█▍        | 763/5098 [05:54<36:01,  2.01it/s][A
epoch 2 iter 763: t

epoch 2 iter 796: train loss 0.25674. lr 2.270590e-04:  16%|█▌        | 797/5098 [06:09<38:33,  1.86it/s][A
epoch 2 iter 797: train loss 0.25559. lr 2.269693e-04:  16%|█▌        | 797/5098 [06:10<38:33,  1.86it/s][A
epoch 2 iter 797: train loss 0.25559. lr 2.269693e-04:  16%|█▌        | 798/5098 [06:10<36:18,  1.97it/s][A
epoch 2 iter 798: train loss 0.26025. lr 2.268797e-04:  16%|█▌        | 798/5098 [06:10<36:18,  1.97it/s][A
epoch 2 iter 798: train loss 0.26025. lr 2.268797e-04:  16%|█▌        | 799/5098 [06:10<36:48,  1.95it/s][A
epoch 2 iter 799: train loss 0.26177. lr 2.267900e-04:  16%|█▌        | 799/5098 [06:11<36:48,  1.95it/s][A
epoch 2 iter 799: train loss 0.26177. lr 2.267900e-04:  16%|█▌        | 800/5098 [06:11<37:02,  1.93it/s][A
epoch 2 iter 800: train loss 0.26130. lr 2.267004e-04:  16%|█▌        | 800/5098 [06:11<37:02,  1.93it/s][A
epoch 2 iter 800: train loss 0.26130. lr 2.267004e-04:  16%|█▌        | 801/5098 [06:11<36:25,  1.97it/s][A
epoch 2 iter 801: t

epoch 2 iter 834: train loss 0.26129. lr 2.236562e-04:  16%|█▋        | 834/5098 [06:26<28:36,  2.48it/s][A
epoch 2 iter 834: train loss 0.26129. lr 2.236562e-04:  16%|█▋        | 835/5098 [06:26<29:25,  2.42it/s][A
epoch 2 iter 835: train loss 0.25818. lr 2.235668e-04:  16%|█▋        | 835/5098 [06:26<29:25,  2.42it/s][A
epoch 2 iter 835: train loss 0.25818. lr 2.235668e-04:  16%|█▋        | 836/5098 [06:26<30:26,  2.33it/s][A
epoch 2 iter 836: train loss 0.25723. lr 2.234774e-04:  16%|█▋        | 836/5098 [06:27<30:26,  2.33it/s][A
epoch 2 iter 836: train loss 0.25723. lr 2.234774e-04:  16%|█▋        | 837/5098 [06:27<31:08,  2.28it/s][A
epoch 2 iter 837: train loss 0.26081. lr 2.233880e-04:  16%|█▋        | 837/5098 [06:27<31:08,  2.28it/s][A
epoch 2 iter 837: train loss 0.26081. lr 2.233880e-04:  16%|█▋        | 838/5098 [06:27<31:40,  2.24it/s][A
epoch 2 iter 838: train loss 0.26032. lr 2.232986e-04:  16%|█▋        | 838/5098 [06:28<31:40,  2.24it/s][A
epoch 2 iter 838: t

epoch 2 iter 871: train loss 0.25291. lr 2.203531e-04:  17%|█▋        | 872/5098 [06:42<30:04,  2.34it/s][A
epoch 2 iter 872: train loss 0.25549. lr 2.202639e-04:  17%|█▋        | 872/5098 [06:42<30:04,  2.34it/s][A
epoch 2 iter 872: train loss 0.25549. lr 2.202639e-04:  17%|█▋        | 873/5098 [06:42<30:21,  2.32it/s][A
epoch 2 iter 873: train loss 0.25750. lr 2.201748e-04:  17%|█▋        | 873/5098 [06:43<30:21,  2.32it/s][A
epoch 2 iter 873: train loss 0.25750. lr 2.201748e-04:  17%|█▋        | 874/5098 [06:43<32:58,  2.13it/s][A
epoch 2 iter 874: train loss 0.25955. lr 2.200857e-04:  17%|█▋        | 874/5098 [06:43<32:58,  2.13it/s][A
epoch 2 iter 874: train loss 0.25955. lr 2.200857e-04:  17%|█▋        | 875/5098 [06:43<32:24,  2.17it/s][A
epoch 2 iter 875: train loss 0.25652. lr 2.199966e-04:  17%|█▋        | 875/5098 [06:44<32:24,  2.17it/s][A
epoch 2 iter 875: train loss 0.25652. lr 2.199966e-04:  17%|█▋        | 876/5098 [06:44<31:33,  2.23it/s][A
epoch 2 iter 876: t

epoch 2 iter 909: train loss 0.25875. lr 2.169714e-04:  18%|█▊        | 909/5098 [07:02<28:36,  2.44it/s][A
epoch 2 iter 909: train loss 0.25875. lr 2.169714e-04:  18%|█▊        | 910/5098 [07:02<29:00,  2.41it/s][A
epoch 2 iter 910: train loss 0.25452. lr 2.168825e-04:  18%|█▊        | 910/5098 [07:02<29:00,  2.41it/s][A
epoch 2 iter 910: train loss 0.25452. lr 2.168825e-04:  18%|█▊        | 911/5098 [07:02<29:04,  2.40it/s][A
epoch 2 iter 911: train loss 0.26050. lr 2.167937e-04:  18%|█▊        | 911/5098 [07:02<29:04,  2.40it/s][A
epoch 2 iter 911: train loss 0.26050. lr 2.167937e-04:  18%|█▊        | 912/5098 [07:02<28:59,  2.41it/s][A
epoch 2 iter 912: train loss 0.25107. lr 2.167049e-04:  18%|█▊        | 912/5098 [07:03<28:59,  2.41it/s][A
epoch 2 iter 912: train loss 0.25107. lr 2.167049e-04:  18%|█▊        | 913/5098 [07:03<28:50,  2.42it/s][A
epoch 2 iter 913: train loss 0.25909. lr 2.166161e-04:  18%|█▊        | 913/5098 [07:03<28:50,  2.42it/s][A
epoch 2 iter 913: t

epoch 2 iter 946: train loss 0.25800. lr 2.136896e-04:  19%|█▊        | 947/5098 [07:19<40:24,  1.71it/s][A
epoch 2 iter 947: train loss 0.25429. lr 2.136011e-04:  19%|█▊        | 947/5098 [07:19<40:24,  1.71it/s][A
epoch 2 iter 947: train loss 0.25429. lr 2.136011e-04:  19%|█▊        | 948/5098 [07:19<40:49,  1.69it/s][A
epoch 2 iter 948: train loss 0.25414. lr 2.135126e-04:  19%|█▊        | 948/5098 [07:20<40:49,  1.69it/s][A
epoch 2 iter 948: train loss 0.25414. lr 2.135126e-04:  19%|█▊        | 949/5098 [07:20<41:08,  1.68it/s][A
epoch 2 iter 949: train loss 0.25890. lr 2.134240e-04:  19%|█▊        | 949/5098 [07:20<41:08,  1.68it/s][A
epoch 2 iter 949: train loss 0.25890. lr 2.134240e-04:  19%|█▊        | 950/5098 [07:20<40:23,  1.71it/s][A
epoch 2 iter 950: train loss 0.25256. lr 2.133355e-04:  19%|█▊        | 950/5098 [07:21<40:23,  1.71it/s][A
epoch 2 iter 950: train loss 0.25256. lr 2.133355e-04:  19%|█▊        | 951/5098 [07:21<39:17,  1.76it/s][A
epoch 2 iter 951: t

epoch 2 iter 984: train loss 0.25521. lr 2.103309e-04:  19%|█▉        | 984/5098 [07:37<36:08,  1.90it/s][A
epoch 2 iter 984: train loss 0.25521. lr 2.103309e-04:  19%|█▉        | 985/5098 [07:37<35:01,  1.96it/s][A
epoch 2 iter 985: train loss 0.25514. lr 2.102427e-04:  19%|█▉        | 985/5098 [07:37<35:01,  1.96it/s][A
epoch 2 iter 985: train loss 0.25514. lr 2.102427e-04:  19%|█▉        | 986/5098 [07:37<33:48,  2.03it/s][A
epoch 2 iter 986: train loss 0.25575. lr 2.101544e-04:  19%|█▉        | 986/5098 [07:38<33:48,  2.03it/s][A
epoch 2 iter 986: train loss 0.25575. lr 2.101544e-04:  19%|█▉        | 987/5098 [07:38<32:44,  2.09it/s][A
epoch 2 iter 987: train loss 0.25202. lr 2.100662e-04:  19%|█▉        | 987/5098 [07:38<32:44,  2.09it/s][A
epoch 2 iter 987: train loss 0.25202. lr 2.100662e-04:  19%|█▉        | 988/5098 [07:38<31:44,  2.16it/s][A
epoch 2 iter 988: train loss 0.26108. lr 2.099780e-04:  19%|█▉        | 988/5098 [07:39<31:44,  2.16it/s][A
epoch 2 iter 988: t

epoch 2 iter 1021: train loss 0.25760. lr 2.070723e-04:  20%|██        | 1021/5098 [07:53<31:12,  2.18it/s][A
epoch 2 iter 1021: train loss 0.25760. lr 2.070723e-04:  20%|██        | 1022/5098 [07:53<30:51,  2.20it/s][A
epoch 2 iter 1022: train loss 0.25177. lr 2.069844e-04:  20%|██        | 1022/5098 [07:54<30:51,  2.20it/s][A
epoch 2 iter 1022: train loss 0.25177. lr 2.069844e-04:  20%|██        | 1023/5098 [07:54<30:20,  2.24it/s][A
epoch 2 iter 1023: train loss 0.25462. lr 2.068965e-04:  20%|██        | 1023/5098 [07:54<30:20,  2.24it/s][A
epoch 2 iter 1023: train loss 0.25462. lr 2.068965e-04:  20%|██        | 1024/5098 [07:54<30:02,  2.26it/s][A
epoch 2 iter 1024: train loss 0.24769. lr 2.068086e-04:  20%|██        | 1024/5098 [07:54<30:02,  2.26it/s][A
epoch 2 iter 1024: train loss 0.24769. lr 2.068086e-04:  20%|██        | 1025/5098 [07:54<29:49,  2.28it/s][A
epoch 2 iter 1025: train loss 0.25507. lr 2.067208e-04:  20%|██        | 1025/5098 [07:55<29:49,  2.28it/s][A
e

epoch 2 iter 1057: train loss 0.24969. lr 2.039134e-04:  21%|██        | 1058/5098 [08:09<31:32,  2.13it/s][A
epoch 2 iter 1058: train loss 0.25297. lr 2.038259e-04:  21%|██        | 1058/5098 [08:10<31:32,  2.13it/s][A
epoch 2 iter 1058: train loss 0.25297. lr 2.038259e-04:  21%|██        | 1059/5098 [08:10<31:49,  2.12it/s][A
epoch 2 iter 1059: train loss 0.24923. lr 2.037383e-04:  21%|██        | 1059/5098 [08:10<31:49,  2.12it/s][A
epoch 2 iter 1059: train loss 0.24923. lr 2.037383e-04:  21%|██        | 1060/5098 [08:10<32:32,  2.07it/s][A
epoch 2 iter 1060: train loss 0.25045. lr 2.036507e-04:  21%|██        | 1060/5098 [08:11<32:32,  2.07it/s][A
epoch 2 iter 1060: train loss 0.25045. lr 2.036507e-04:  21%|██        | 1061/5098 [08:11<33:03,  2.04it/s][A
epoch 2 iter 1061: train loss 0.25145. lr 2.035632e-04:  21%|██        | 1061/5098 [08:11<33:03,  2.04it/s][A
epoch 2 iter 1061: train loss 0.25145. lr 2.035632e-04:  21%|██        | 1062/5098 [08:11<31:13,  2.15it/s][A
e

epoch 2 iter 1094: train loss 0.25001. lr 2.006791e-04:  21%|██▏       | 1094/5098 [08:28<35:32,  1.88it/s][A
epoch 2 iter 1094: train loss 0.25001. lr 2.006791e-04:  21%|██▏       | 1095/5098 [08:28<34:19,  1.94it/s][A
epoch 2 iter 1095: train loss 0.24771. lr 2.005919e-04:  21%|██▏       | 1095/5098 [08:28<34:19,  1.94it/s][A
epoch 2 iter 1095: train loss 0.24771. lr 2.005919e-04:  21%|██▏       | 1096/5098 [08:28<33:30,  1.99it/s][A
epoch 2 iter 1096: train loss 0.25298. lr 2.005047e-04:  21%|██▏       | 1096/5098 [08:29<33:30,  1.99it/s][A
epoch 2 iter 1096: train loss 0.25298. lr 2.005047e-04:  22%|██▏       | 1097/5098 [08:29<32:57,  2.02it/s][A
epoch 2 iter 1097: train loss 0.24612. lr 2.004174e-04:  22%|██▏       | 1097/5098 [08:29<32:57,  2.02it/s][A
epoch 2 iter 1097: train loss 0.24612. lr 2.004174e-04:  22%|██▏       | 1098/5098 [08:29<32:07,  2.07it/s][A
epoch 2 iter 1098: train loss 0.24931. lr 2.003302e-04:  22%|██▏       | 1098/5098 [08:29<32:07,  2.07it/s][A
e

epoch 2 iter 1130: train loss 0.24355. lr 1.975446e-04:  22%|██▏       | 1131/5098 [08:43<26:59,  2.45it/s][A
epoch 2 iter 1131: train loss 0.24719. lr 1.974577e-04:  22%|██▏       | 1131/5098 [08:43<26:59,  2.45it/s][A
epoch 2 iter 1131: train loss 0.24719. lr 1.974577e-04:  22%|██▏       | 1132/5098 [08:43<26:46,  2.47it/s][A
epoch 2 iter 1132: train loss 0.24811. lr 1.973708e-04:  22%|██▏       | 1132/5098 [08:44<26:46,  2.47it/s][A
epoch 2 iter 1132: train loss 0.24811. lr 1.973708e-04:  22%|██▏       | 1133/5098 [08:44<26:54,  2.46it/s][A
epoch 2 iter 1133: train loss 0.24922. lr 1.972840e-04:  22%|██▏       | 1133/5098 [08:44<26:54,  2.46it/s][A
epoch 2 iter 1133: train loss 0.24922. lr 1.972840e-04:  22%|██▏       | 1134/5098 [08:44<27:06,  2.44it/s][A
epoch 2 iter 1134: train loss 0.24628. lr 1.971971e-04:  22%|██▏       | 1134/5098 [08:45<27:06,  2.44it/s][A
epoch 2 iter 1134: train loss 0.24628. lr 1.971971e-04:  22%|██▏       | 1135/5098 [08:45<27:18,  2.42it/s][A
e

epoch 2 iter 1167: train loss 0.24522. lr 1.943362e-04:  23%|██▎       | 1167/5098 [08:58<26:36,  2.46it/s][A
epoch 2 iter 1167: train loss 0.24522. lr 1.943362e-04:  23%|██▎       | 1168/5098 [08:58<25:55,  2.53it/s][A
epoch 2 iter 1168: train loss 0.24666. lr 1.942497e-04:  23%|██▎       | 1168/5098 [08:59<25:55,  2.53it/s][A
epoch 2 iter 1168: train loss 0.24666. lr 1.942497e-04:  23%|██▎       | 1169/5098 [08:59<26:38,  2.46it/s][A
epoch 2 iter 1169: train loss 0.24645. lr 1.941631e-04:  23%|██▎       | 1169/5098 [08:59<26:38,  2.46it/s][A
epoch 2 iter 1169: train loss 0.24645. lr 1.941631e-04:  23%|██▎       | 1170/5098 [08:59<30:04,  2.18it/s][A
epoch 2 iter 1170: train loss 0.24618. lr 1.940766e-04:  23%|██▎       | 1170/5098 [09:00<30:04,  2.18it/s][A
epoch 2 iter 1170: train loss 0.24618. lr 1.940766e-04:  23%|██▎       | 1171/5098 [09:00<32:59,  1.98it/s][A
epoch 2 iter 1171: train loss 0.24785. lr 1.939901e-04:  23%|██▎       | 1171/5098 [09:01<32:59,  1.98it/s][A
e

epoch 2 iter 1203: train loss 0.24361. lr 1.912277e-04:  24%|██▎       | 1204/5098 [09:16<28:50,  2.25it/s][A
epoch 2 iter 1204: train loss 0.24675. lr 1.911415e-04:  24%|██▎       | 1204/5098 [09:17<28:50,  2.25it/s][A
epoch 2 iter 1204: train loss 0.24675. lr 1.911415e-04:  24%|██▎       | 1205/5098 [09:17<28:21,  2.29it/s][A
epoch 2 iter 1205: train loss 0.24403. lr 1.910553e-04:  24%|██▎       | 1205/5098 [09:17<28:21,  2.29it/s][A
epoch 2 iter 1205: train loss 0.24403. lr 1.910553e-04:  24%|██▎       | 1206/5098 [09:17<28:00,  2.32it/s][A
epoch 2 iter 1206: train loss 0.24332. lr 1.909692e-04:  24%|██▎       | 1206/5098 [09:18<28:00,  2.32it/s][A
epoch 2 iter 1206: train loss 0.24332. lr 1.909692e-04:  24%|██▎       | 1207/5098 [09:18<27:46,  2.33it/s][A
epoch 2 iter 1207: train loss 0.24924. lr 1.908831e-04:  24%|██▎       | 1207/5098 [09:18<27:46,  2.33it/s][A
epoch 2 iter 1207: train loss 0.24924. lr 1.908831e-04:  24%|██▎       | 1208/5098 [09:18<27:37,  2.35it/s][A
e

epoch 2 iter 1240: train loss 0.24092. lr 1.880467e-04:  24%|██▍       | 1240/5098 [09:33<31:01,  2.07it/s][A
epoch 2 iter 1240: train loss 0.24092. lr 1.880467e-04:  24%|██▍       | 1241/5098 [09:33<31:24,  2.05it/s][A
epoch 2 iter 1241: train loss 0.24375. lr 1.879610e-04:  24%|██▍       | 1241/5098 [09:34<31:24,  2.05it/s][A
epoch 2 iter 1241: train loss 0.24375. lr 1.879610e-04:  24%|██▍       | 1242/5098 [09:34<30:48,  2.09it/s][A
epoch 2 iter 1242: train loss 0.24512. lr 1.878752e-04:  24%|██▍       | 1242/5098 [09:34<30:48,  2.09it/s][A
epoch 2 iter 1242: train loss 0.24512. lr 1.878752e-04:  24%|██▍       | 1243/5098 [09:34<30:09,  2.13it/s][A
epoch 2 iter 1243: train loss 0.24095. lr 1.877894e-04:  24%|██▍       | 1243/5098 [09:35<30:09,  2.13it/s][A
epoch 2 iter 1243: train loss 0.24095. lr 1.877894e-04:  24%|██▍       | 1244/5098 [09:35<29:47,  2.16it/s][A
epoch 2 iter 1244: train loss 0.24188. lr 1.877037e-04:  24%|██▍       | 1244/5098 [09:35<29:47,  2.16it/s][A
e

epoch 2 iter 1276: train loss 0.23937. lr 1.849657e-04:  25%|██▌       | 1277/5098 [09:49<26:52,  2.37it/s][A
epoch 2 iter 1277: train loss 0.23989. lr 1.848804e-04:  25%|██▌       | 1277/5098 [09:50<26:52,  2.37it/s][A
epoch 2 iter 1277: train loss 0.23989. lr 1.848804e-04:  25%|██▌       | 1278/5098 [09:50<26:36,  2.39it/s][A
epoch 2 iter 1278: train loss 0.24483. lr 1.847950e-04:  25%|██▌       | 1278/5098 [09:50<26:36,  2.39it/s][A
epoch 2 iter 1278: train loss 0.24483. lr 1.847950e-04:  25%|██▌       | 1279/5098 [09:50<26:22,  2.41it/s][A
epoch 2 iter 1279: train loss 0.24560. lr 1.847096e-04:  25%|██▌       | 1279/5098 [09:50<26:22,  2.41it/s][A
epoch 2 iter 1279: train loss 0.24560. lr 1.847096e-04:  25%|██▌       | 1280/5098 [09:50<26:15,  2.42it/s][A
epoch 2 iter 1280: train loss 0.23955. lr 1.846243e-04:  25%|██▌       | 1280/5098 [09:51<26:15,  2.42it/s][A
epoch 2 iter 1280: train loss 0.23955. lr 1.846243e-04:  25%|██▌       | 1281/5098 [09:51<26:10,  2.43it/s][A
e

epoch 2 iter 1313: train loss 0.24271. lr 1.818139e-04:  26%|██▌       | 1313/5098 [10:05<26:15,  2.40it/s][A
epoch 2 iter 1313: train loss 0.24271. lr 1.818139e-04:  26%|██▌       | 1314/5098 [10:05<26:24,  2.39it/s][A
epoch 2 iter 1314: train loss 0.24344. lr 1.817290e-04:  26%|██▌       | 1314/5098 [10:05<26:24,  2.39it/s][A
epoch 2 iter 1314: train loss 0.24344. lr 1.817290e-04:  26%|██▌       | 1315/5098 [10:05<26:21,  2.39it/s][A
epoch 2 iter 1315: train loss 0.24287. lr 1.816440e-04:  26%|██▌       | 1315/5098 [10:06<26:21,  2.39it/s][A
epoch 2 iter 1315: train loss 0.24287. lr 1.816440e-04:  26%|██▌       | 1316/5098 [10:06<26:08,  2.41it/s][A
epoch 2 iter 1316: train loss 0.24347. lr 1.815590e-04:  26%|██▌       | 1316/5098 [10:06<26:08,  2.41it/s][A
epoch 2 iter 1316: train loss 0.24347. lr 1.815590e-04:  26%|██▌       | 1317/5098 [10:06<25:50,  2.44it/s][A
epoch 2 iter 1317: train loss 0.24038. lr 1.814741e-04:  26%|██▌       | 1317/5098 [10:07<25:50,  2.44it/s][A
e

epoch 2 iter 1349: train loss 0.23796. lr 1.787621e-04:  26%|██▋       | 1350/5098 [10:22<27:44,  2.25it/s][A
epoch 2 iter 1350: train loss 0.23930. lr 1.786775e-04:  26%|██▋       | 1350/5098 [10:22<27:44,  2.25it/s][A
epoch 2 iter 1350: train loss 0.23930. lr 1.786775e-04:  27%|██▋       | 1351/5098 [10:22<26:48,  2.33it/s][A
epoch 2 iter 1351: train loss 0.24129. lr 1.785929e-04:  27%|██▋       | 1351/5098 [10:22<26:48,  2.33it/s][A
epoch 2 iter 1351: train loss 0.24129. lr 1.785929e-04:  27%|██▋       | 1352/5098 [10:22<26:09,  2.39it/s][A
epoch 2 iter 1352: train loss 0.23598. lr 1.785084e-04:  27%|██▋       | 1352/5098 [10:23<26:09,  2.39it/s][A
epoch 2 iter 1352: train loss 0.23598. lr 1.785084e-04:  27%|██▋       | 1353/5098 [10:23<25:43,  2.43it/s][A
epoch 2 iter 1353: train loss 0.23920. lr 1.784239e-04:  27%|██▋       | 1353/5098 [10:23<25:43,  2.43it/s][A
epoch 2 iter 1353: train loss 0.23920. lr 1.784239e-04:  27%|██▋       | 1354/5098 [10:23<24:54,  2.51it/s][A
e

epoch 2 iter 1386: train loss 0.24297. lr 1.756410e-04:  27%|██▋       | 1386/5098 [10:41<26:51,  2.30it/s][A
epoch 2 iter 1386: train loss 0.24297. lr 1.756410e-04:  27%|██▋       | 1387/5098 [10:41<45:06,  1.37it/s][A
epoch 2 iter 1387: train loss 0.23769. lr 1.755568e-04:  27%|██▋       | 1387/5098 [10:42<45:06,  1.37it/s][A
epoch 2 iter 1387: train loss 0.23769. lr 1.755568e-04:  27%|██▋       | 1388/5098 [10:42<43:42,  1.41it/s][A
epoch 2 iter 1388: train loss 0.24043. lr 1.754727e-04:  27%|██▋       | 1388/5098 [10:42<43:42,  1.41it/s][A
epoch 2 iter 1388: train loss 0.24043. lr 1.754727e-04:  27%|██▋       | 1389/5098 [10:42<42:42,  1.45it/s][A
epoch 2 iter 1389: train loss 0.23761. lr 1.753886e-04:  27%|██▋       | 1389/5098 [10:43<42:42,  1.45it/s][A
epoch 2 iter 1389: train loss 0.23761. lr 1.753886e-04:  27%|██▋       | 1390/5098 [10:43<40:45,  1.52it/s][A
epoch 2 iter 1390: train loss 0.24128. lr 1.753045e-04:  27%|██▋       | 1390/5098 [10:43<40:45,  1.52it/s][A
e

epoch 2 iter 1422: train loss 0.23625. lr 1.726197e-04:  28%|██▊       | 1423/5098 [10:57<27:53,  2.20it/s][A
epoch 2 iter 1423: train loss 0.23703. lr 1.725360e-04:  28%|██▊       | 1423/5098 [10:58<27:53,  2.20it/s][A
epoch 2 iter 1423: train loss 0.23703. lr 1.725360e-04:  28%|██▊       | 1424/5098 [10:58<28:01,  2.19it/s][A
epoch 2 iter 1424: train loss 0.23728. lr 1.724523e-04:  28%|██▊       | 1424/5098 [10:58<28:01,  2.19it/s][A
epoch 2 iter 1424: train loss 0.23728. lr 1.724523e-04:  28%|██▊       | 1425/5098 [10:58<27:32,  2.22it/s][A
epoch 2 iter 1425: train loss 0.23842. lr 1.723687e-04:  28%|██▊       | 1425/5098 [10:59<27:32,  2.22it/s][A
epoch 2 iter 1425: train loss 0.23842. lr 1.723687e-04:  28%|██▊       | 1426/5098 [10:59<27:12,  2.25it/s][A
epoch 2 iter 1426: train loss 0.23651. lr 1.722850e-04:  28%|██▊       | 1426/5098 [10:59<27:12,  2.25it/s][A
epoch 2 iter 1426: train loss 0.23651. lr 1.722850e-04:  28%|██▊       | 1427/5098 [10:59<26:12,  2.33it/s][A
e

epoch 2 iter 1459: train loss 0.23744. lr 1.695309e-04:  29%|██▊       | 1459/5098 [11:14<25:48,  2.35it/s][A
epoch 2 iter 1459: train loss 0.23744. lr 1.695309e-04:  29%|██▊       | 1460/5098 [11:14<26:27,  2.29it/s][A
epoch 2 iter 1460: train loss 0.23281. lr 1.694477e-04:  29%|██▊       | 1460/5098 [11:14<26:27,  2.29it/s][A
epoch 2 iter 1460: train loss 0.23281. lr 1.694477e-04:  29%|██▊       | 1461/5098 [11:14<25:37,  2.36it/s][A
epoch 2 iter 1461: train loss 0.23617. lr 1.693644e-04:  29%|██▊       | 1461/5098 [11:15<25:37,  2.36it/s][A
epoch 2 iter 1461: train loss 0.23617. lr 1.693644e-04:  29%|██▊       | 1462/5098 [11:15<24:57,  2.43it/s][A
epoch 2 iter 1462: train loss 0.23794. lr 1.692812e-04:  29%|██▊       | 1462/5098 [11:15<24:57,  2.43it/s][A
epoch 2 iter 1462: train loss 0.23794. lr 1.692812e-04:  29%|██▊       | 1463/5098 [11:15<24:10,  2.51it/s][A
epoch 2 iter 1463: train loss 0.23374. lr 1.691980e-04:  29%|██▊       | 1463/5098 [11:15<24:10,  2.51it/s][A
e

epoch 2 iter 1495: train loss 0.23105. lr 1.665419e-04:  29%|██▉       | 1496/5098 [11:30<27:14,  2.20it/s][A
epoch 2 iter 1496: train loss 0.23822. lr 1.664591e-04:  29%|██▉       | 1496/5098 [11:30<27:14,  2.20it/s][A
epoch 2 iter 1496: train loss 0.23822. lr 1.664591e-04:  29%|██▉       | 1497/5098 [11:30<27:28,  2.18it/s][A
epoch 2 iter 1497: train loss 0.24071. lr 1.663763e-04:  29%|██▉       | 1497/5098 [11:31<27:28,  2.18it/s][A
epoch 2 iter 1497: train loss 0.24071. lr 1.663763e-04:  29%|██▉       | 1498/5098 [11:31<28:43,  2.09it/s][A
epoch 2 iter 1498: train loss 0.23768. lr 1.662935e-04:  29%|██▉       | 1498/5098 [11:31<28:43,  2.09it/s][A
epoch 2 iter 1498: train loss 0.23768. lr 1.662935e-04:  29%|██▉       | 1499/5098 [11:31<27:58,  2.14it/s][A
epoch 2 iter 1499: train loss 0.23299. lr 1.662107e-04:  29%|██▉       | 1499/5098 [11:32<27:58,  2.14it/s][A
epoch 2 iter 1499: train loss 0.23299. lr 1.662107e-04:  29%|██▉       | 1500/5098 [11:32<26:46,  2.24it/s][A
e

epoch 2 iter 1532: train loss 0.23722. lr 1.634869e-04:  30%|███       | 1532/5098 [11:49<24:32,  2.42it/s][A
epoch 2 iter 1532: train loss 0.23722. lr 1.634869e-04:  30%|███       | 1533/5098 [11:49<24:27,  2.43it/s][A
epoch 2 iter 1533: train loss 0.23422. lr 1.634046e-04:  30%|███       | 1533/5098 [11:49<24:27,  2.43it/s][A
epoch 2 iter 1533: train loss 0.23422. lr 1.634046e-04:  30%|███       | 1534/5098 [11:49<24:05,  2.47it/s][A
epoch 2 iter 1534: train loss 0.23409. lr 1.633223e-04:  30%|███       | 1534/5098 [11:49<24:05,  2.47it/s][A
epoch 2 iter 1534: train loss 0.23409. lr 1.633223e-04:  30%|███       | 1535/5098 [11:49<23:50,  2.49it/s][A
epoch 2 iter 1535: train loss 0.23755. lr 1.632400e-04:  30%|███       | 1535/5098 [11:50<23:50,  2.49it/s][A
epoch 2 iter 1535: train loss 0.23755. lr 1.632400e-04:  30%|███       | 1536/5098 [11:50<23:44,  2.50it/s][A
epoch 2 iter 1536: train loss 0.23088. lr 1.631577e-04:  30%|███       | 1536/5098 [11:50<23:44,  2.50it/s][A
e

epoch 2 iter 1568: train loss 0.23240. lr 1.605315e-04:  31%|███       | 1569/5098 [12:06<24:56,  2.36it/s][A
epoch 2 iter 1569: train loss 0.23068. lr 1.604497e-04:  31%|███       | 1569/5098 [12:06<24:56,  2.36it/s][A
epoch 2 iter 1569: train loss 0.23068. lr 1.604497e-04:  31%|███       | 1570/5098 [12:06<25:26,  2.31it/s][A
epoch 2 iter 1570: train loss 0.23077. lr 1.603679e-04:  31%|███       | 1570/5098 [12:07<25:26,  2.31it/s][A
epoch 2 iter 1570: train loss 0.23077. lr 1.603679e-04:  31%|███       | 1571/5098 [12:07<24:36,  2.39it/s][A
epoch 2 iter 1571: train loss 0.23108. lr 1.602860e-04:  31%|███       | 1571/5098 [12:07<24:36,  2.39it/s][A
epoch 2 iter 1571: train loss 0.23108. lr 1.602860e-04:  31%|███       | 1572/5098 [12:07<25:02,  2.35it/s][A
epoch 2 iter 1572: train loss 0.23187. lr 1.602042e-04:  31%|███       | 1572/5098 [12:08<25:02,  2.35it/s][A
epoch 2 iter 1572: train loss 0.23187. lr 1.602042e-04:  31%|███       | 1573/5098 [12:08<24:06,  2.44it/s][A
e

epoch 2 iter 1605: train loss 0.23409. lr 1.575120e-04:  31%|███▏      | 1605/5098 [12:22<28:08,  2.07it/s][A
epoch 2 iter 1605: train loss 0.23409. lr 1.575120e-04:  32%|███▏      | 1606/5098 [12:22<27:37,  2.11it/s][A
epoch 2 iter 1606: train loss 0.22855. lr 1.574306e-04:  32%|███▏      | 1606/5098 [12:22<27:37,  2.11it/s][A
epoch 2 iter 1606: train loss 0.22855. lr 1.574306e-04:  32%|███▏      | 1607/5098 [12:22<27:03,  2.15it/s][A
epoch 2 iter 1607: train loss 0.23068. lr 1.573493e-04:  32%|███▏      | 1607/5098 [12:23<27:03,  2.15it/s][A
epoch 2 iter 1607: train loss 0.23068. lr 1.573493e-04:  32%|███▏      | 1608/5098 [12:23<26:34,  2.19it/s][A
epoch 2 iter 1608: train loss 0.22666. lr 1.572680e-04:  32%|███▏      | 1608/5098 [12:23<26:34,  2.19it/s][A
epoch 2 iter 1608: train loss 0.22666. lr 1.572680e-04:  32%|███▏      | 1609/5098 [12:23<26:19,  2.21it/s][A
epoch 2 iter 1609: train loss 0.22769. lr 1.571866e-04:  32%|███▏      | 1609/5098 [12:24<26:19,  2.21it/s][A
e

epoch 2 iter 1641: train loss 0.23095. lr 1.545918e-04:  32%|███▏      | 1642/5098 [12:40<36:40,  1.57it/s][A
epoch 2 iter 1642: train loss 0.22865. lr 1.545109e-04:  32%|███▏      | 1642/5098 [12:40<36:40,  1.57it/s][A
epoch 2 iter 1642: train loss 0.22865. lr 1.545109e-04:  32%|███▏      | 1643/5098 [12:40<35:43,  1.61it/s][A
epoch 2 iter 1643: train loss 0.22897. lr 1.544301e-04:  32%|███▏      | 1643/5098 [12:41<35:43,  1.61it/s][A
epoch 2 iter 1643: train loss 0.22897. lr 1.544301e-04:  32%|███▏      | 1644/5098 [12:41<34:38,  1.66it/s][A
epoch 2 iter 1644: train loss 0.22903. lr 1.543493e-04:  32%|███▏      | 1644/5098 [12:41<34:38,  1.66it/s][A
epoch 2 iter 1644: train loss 0.22903. lr 1.543493e-04:  32%|███▏      | 1645/5098 [12:41<34:35,  1.66it/s][A
epoch 2 iter 1645: train loss 0.23174. lr 1.542684e-04:  32%|███▏      | 1645/5098 [12:42<34:35,  1.66it/s][A
epoch 2 iter 1645: train loss 0.23174. lr 1.542684e-04:  32%|███▏      | 1646/5098 [12:42<32:24,  1.78it/s][A
e

epoch 2 iter 1678: train loss 0.22537. lr 1.516092e-04:  33%|███▎      | 1678/5098 [12:56<23:38,  2.41it/s][A
epoch 2 iter 1678: train loss 0.22537. lr 1.516092e-04:  33%|███▎      | 1679/5098 [12:56<23:40,  2.41it/s][A
epoch 2 iter 1679: train loss 0.22897. lr 1.515288e-04:  33%|███▎      | 1679/5098 [12:56<23:40,  2.41it/s][A
epoch 2 iter 1679: train loss 0.22897. lr 1.515288e-04:  33%|███▎      | 1680/5098 [12:56<23:42,  2.40it/s][A
epoch 2 iter 1680: train loss 0.22997. lr 1.514485e-04:  33%|███▎      | 1680/5098 [12:57<23:42,  2.40it/s][A
epoch 2 iter 1680: train loss 0.22997. lr 1.514485e-04:  33%|███▎      | 1681/5098 [12:57<23:43,  2.40it/s][A
epoch 2 iter 1681: train loss 0.22625. lr 1.513682e-04:  33%|███▎      | 1681/5098 [12:57<23:43,  2.40it/s][A
epoch 2 iter 1681: train loss 0.22625. lr 1.513682e-04:  33%|███▎      | 1682/5098 [12:57<23:43,  2.40it/s][A
epoch 2 iter 1682: train loss 0.22909. lr 1.512879e-04:  33%|███▎      | 1682/5098 [12:58<23:43,  2.40it/s][A
e

epoch 2 iter 1714: train loss 0.22966. lr 1.487257e-04:  34%|███▎      | 1715/5098 [13:13<24:27,  2.31it/s][A
epoch 2 iter 1715: train loss 0.22826. lr 1.486458e-04:  34%|███▎      | 1715/5098 [13:14<24:27,  2.31it/s][A
epoch 2 iter 1715: train loss 0.22826. lr 1.486458e-04:  34%|███▎      | 1716/5098 [13:14<24:10,  2.33it/s][A
epoch 2 iter 1716: train loss 0.22199. lr 1.485660e-04:  34%|███▎      | 1716/5098 [13:14<24:10,  2.33it/s][A
epoch 2 iter 1716: train loss 0.22199. lr 1.485660e-04:  34%|███▎      | 1717/5098 [13:14<27:17,  2.06it/s][A
epoch 2 iter 1717: train loss 0.22926. lr 1.484862e-04:  34%|███▎      | 1717/5098 [13:15<27:17,  2.06it/s][A
epoch 2 iter 1717: train loss 0.22926. lr 1.484862e-04:  34%|███▎      | 1718/5098 [13:15<28:50,  1.95it/s][A
epoch 2 iter 1718: train loss 0.23087. lr 1.484064e-04:  34%|███▎      | 1718/5098 [13:16<28:50,  1.95it/s][A
epoch 2 iter 1718: train loss 0.23087. lr 1.484064e-04:  34%|███▎      | 1719/5098 [13:16<29:49,  1.89it/s][A
e

epoch 2 iter 1751: train loss 0.22489. lr 1.457815e-04:  34%|███▍      | 1751/5098 [13:30<23:01,  2.42it/s][A
epoch 2 iter 1751: train loss 0.22489. lr 1.457815e-04:  34%|███▍      | 1752/5098 [13:30<22:51,  2.44it/s][A
epoch 2 iter 1752: train loss 0.22819. lr 1.457022e-04:  34%|███▍      | 1752/5098 [13:31<22:51,  2.44it/s][A
epoch 2 iter 1752: train loss 0.22819. lr 1.457022e-04:  34%|███▍      | 1753/5098 [13:31<23:07,  2.41it/s][A
epoch 2 iter 1753: train loss 0.22762. lr 1.456229e-04:  34%|███▍      | 1753/5098 [13:31<23:07,  2.41it/s][A
epoch 2 iter 1753: train loss 0.22762. lr 1.456229e-04:  34%|███▍      | 1754/5098 [13:31<22:59,  2.42it/s][A
epoch 2 iter 1754: train loss 0.22803. lr 1.455436e-04:  34%|███▍      | 1754/5098 [13:32<22:59,  2.42it/s][A
epoch 2 iter 1754: train loss 0.22803. lr 1.455436e-04:  34%|███▍      | 1755/5098 [13:32<22:49,  2.44it/s][A
epoch 2 iter 1755: train loss 0.23105. lr 1.454644e-04:  34%|███▍      | 1755/5098 [13:32<22:49,  2.44it/s][A
e

epoch 2 iter 1787: train loss 0.22740. lr 1.429361e-04:  35%|███▌      | 1788/5098 [13:46<21:50,  2.53it/s][A
epoch 2 iter 1788: train loss 0.22544. lr 1.428573e-04:  35%|███▌      | 1788/5098 [13:47<21:50,  2.53it/s][A
epoch 2 iter 1788: train loss 0.22544. lr 1.428573e-04:  35%|███▌      | 1789/5098 [13:47<22:02,  2.50it/s][A
epoch 2 iter 1789: train loss 0.22804. lr 1.427786e-04:  35%|███▌      | 1789/5098 [13:47<22:02,  2.50it/s][A
epoch 2 iter 1789: train loss 0.22804. lr 1.427786e-04:  35%|███▌      | 1790/5098 [13:47<23:37,  2.33it/s][A
epoch 2 iter 1790: train loss 0.22848. lr 1.426998e-04:  35%|███▌      | 1790/5098 [13:48<23:37,  2.33it/s][A
epoch 2 iter 1790: train loss 0.22848. lr 1.426998e-04:  35%|███▌      | 1791/5098 [13:48<23:25,  2.35it/s][A
epoch 2 iter 1791: train loss 0.22307. lr 1.426211e-04:  35%|███▌      | 1791/5098 [13:48<23:25,  2.35it/s][A
epoch 2 iter 1791: train loss 0.22307. lr 1.426211e-04:  35%|███▌      | 1792/5098 [13:48<22:46,  2.42it/s][A
e

epoch 2 iter 1824: train loss 0.22565. lr 1.400318e-04:  36%|███▌      | 1824/5098 [14:03<21:49,  2.50it/s][A
epoch 2 iter 1824: train loss 0.22565. lr 1.400318e-04:  36%|███▌      | 1825/5098 [14:03<22:01,  2.48it/s][A
epoch 2 iter 1825: train loss 0.22687. lr 1.399536e-04:  36%|███▌      | 1825/5098 [14:04<22:01,  2.48it/s][A
epoch 2 iter 1825: train loss 0.22687. lr 1.399536e-04:  36%|███▌      | 1826/5098 [14:04<22:31,  2.42it/s][A
epoch 2 iter 1826: train loss 0.22694. lr 1.398754e-04:  36%|███▌      | 1826/5098 [14:04<22:31,  2.42it/s][A
epoch 2 iter 1826: train loss 0.22694. lr 1.398754e-04:  36%|███▌      | 1827/5098 [14:04<22:50,  2.39it/s][A
epoch 2 iter 1827: train loss 0.22250. lr 1.397973e-04:  36%|███▌      | 1827/5098 [14:04<22:50,  2.39it/s][A
epoch 2 iter 1827: train loss 0.22250. lr 1.397973e-04:  36%|███▌      | 1828/5098 [14:04<22:32,  2.42it/s][A
epoch 2 iter 1828: train loss 0.22106. lr 1.397191e-04:  36%|███▌      | 1828/5098 [14:05<22:32,  2.42it/s][A
e

epoch 2 iter 1860: train loss 0.22522. lr 1.372260e-04:  37%|███▋      | 1861/5098 [14:21<26:09,  2.06it/s][A
epoch 2 iter 1861: train loss 0.22810. lr 1.371484e-04:  37%|███▋      | 1861/5098 [14:21<26:09,  2.06it/s][A
epoch 2 iter 1861: train loss 0.22810. lr 1.371484e-04:  37%|███▋      | 1862/5098 [14:21<25:42,  2.10it/s][A
epoch 2 iter 1862: train loss 0.22441. lr 1.370707e-04:  37%|███▋      | 1862/5098 [14:22<25:42,  2.10it/s][A
epoch 2 iter 1862: train loss 0.22441. lr 1.370707e-04:  37%|███▋      | 1863/5098 [14:22<25:00,  2.16it/s][A
epoch 2 iter 1863: train loss 0.21871. lr 1.369931e-04:  37%|███▋      | 1863/5098 [14:22<25:00,  2.16it/s][A
epoch 2 iter 1863: train loss 0.21871. lr 1.369931e-04:  37%|███▋      | 1864/5098 [14:22<24:29,  2.20it/s][A
epoch 2 iter 1864: train loss 0.22166. lr 1.369155e-04:  37%|███▋      | 1864/5098 [14:23<24:29,  2.20it/s][A
epoch 2 iter 1864: train loss 0.22166. lr 1.369155e-04:  37%|███▋      | 1865/5098 [14:23<24:10,  2.23it/s][A
e

epoch 2 iter 1897: train loss 0.21754. lr 1.343631e-04:  37%|███▋      | 1897/5098 [14:37<22:47,  2.34it/s][A
epoch 2 iter 1897: train loss 0.21754. lr 1.343631e-04:  37%|███▋      | 1898/5098 [14:37<22:31,  2.37it/s][A
epoch 2 iter 1898: train loss 0.22113. lr 1.342861e-04:  37%|███▋      | 1898/5098 [14:38<22:31,  2.37it/s][A
epoch 2 iter 1898: train loss 0.22113. lr 1.342861e-04:  37%|███▋      | 1899/5098 [14:38<22:18,  2.39it/s][A
epoch 2 iter 1899: train loss 0.22411. lr 1.342090e-04:  37%|███▋      | 1899/5098 [14:38<22:18,  2.39it/s][A
epoch 2 iter 1899: train loss 0.22411. lr 1.342090e-04:  37%|███▋      | 1900/5098 [14:38<22:11,  2.40it/s][A
epoch 2 iter 1900: train loss 0.21971. lr 1.341320e-04:  37%|███▋      | 1900/5098 [14:39<22:11,  2.40it/s][A
epoch 2 iter 1900: train loss 0.21971. lr 1.341320e-04:  37%|███▋      | 1901/5098 [14:39<22:12,  2.40it/s][A
epoch 2 iter 1901: train loss 0.21971. lr 1.340549e-04:  37%|███▋      | 1901/5098 [14:39<22:12,  2.40it/s][A
e

epoch 2 iter 1933: train loss 0.21798. lr 1.315983e-04:  38%|███▊      | 1934/5098 [14:54<24:40,  2.14it/s][A
epoch 2 iter 1934: train loss 0.22018. lr 1.315218e-04:  38%|███▊      | 1934/5098 [14:54<24:40,  2.14it/s][A
epoch 2 iter 1934: train loss 0.22018. lr 1.315218e-04:  38%|███▊      | 1935/5098 [14:54<24:04,  2.19it/s][A
epoch 2 iter 1935: train loss 0.22217. lr 1.314453e-04:  38%|███▊      | 1935/5098 [14:55<24:04,  2.19it/s][A
epoch 2 iter 1935: train loss 0.22217. lr 1.314453e-04:  38%|███▊      | 1936/5098 [14:55<23:28,  2.25it/s][A
epoch 2 iter 1936: train loss 0.22270. lr 1.313688e-04:  38%|███▊      | 1936/5098 [14:55<23:28,  2.25it/s][A
epoch 2 iter 1936: train loss 0.22270. lr 1.313688e-04:  38%|███▊      | 1937/5098 [14:55<23:01,  2.29it/s][A
epoch 2 iter 1937: train loss 0.22256. lr 1.312924e-04:  38%|███▊      | 1937/5098 [14:56<23:01,  2.29it/s][A
epoch 2 iter 1937: train loss 0.22256. lr 1.312924e-04:  38%|███▊      | 1938/5098 [14:56<22:29,  2.34it/s][A
e

epoch 2 iter 1970: train loss 0.21698. lr 1.287783e-04:  39%|███▊      | 1970/5098 [15:12<32:48,  1.59it/s][A
epoch 2 iter 1970: train loss 0.21698. lr 1.287783e-04:  39%|███▊      | 1971/5098 [15:12<32:42,  1.59it/s][A
epoch 2 iter 1971: train loss 0.22002. lr 1.287024e-04:  39%|███▊      | 1971/5098 [15:13<32:42,  1.59it/s][A
epoch 2 iter 1971: train loss 0.22002. lr 1.287024e-04:  39%|███▊      | 1972/5098 [15:13<31:50,  1.64it/s][A
epoch 2 iter 1972: train loss 0.21676. lr 1.286265e-04:  39%|███▊      | 1972/5098 [15:13<31:50,  1.64it/s][A
epoch 2 iter 1972: train loss 0.21676. lr 1.286265e-04:  39%|███▊      | 1973/5098 [15:13<31:14,  1.67it/s][A
epoch 2 iter 1973: train loss 0.22087. lr 1.285506e-04:  39%|███▊      | 1973/5098 [15:14<31:14,  1.67it/s][A
epoch 2 iter 1973: train loss 0.22087. lr 1.285506e-04:  39%|███▊      | 1974/5098 [15:14<30:30,  1.71it/s][A
epoch 2 iter 1974: train loss 0.21970. lr 1.284747e-04:  39%|███▊      | 1974/5098 [15:14<30:30,  1.71it/s][A
e

epoch 2 iter 2006: train loss 0.21979. lr 1.260558e-04:  39%|███▉      | 2007/5098 [15:29<24:59,  2.06it/s][A
epoch 2 iter 2007: train loss 0.22160. lr 1.259805e-04:  39%|███▉      | 2007/5098 [15:30<24:59,  2.06it/s][A
epoch 2 iter 2007: train loss 0.22160. lr 1.259805e-04:  39%|███▉      | 2008/5098 [15:30<24:37,  2.09it/s][A
epoch 2 iter 2008: train loss 0.22129. lr 1.259052e-04:  39%|███▉      | 2008/5098 [15:30<24:37,  2.09it/s][A
epoch 2 iter 2008: train loss 0.22129. lr 1.259052e-04:  39%|███▉      | 2009/5098 [15:30<24:26,  2.11it/s][A
epoch 2 iter 2009: train loss 0.21873. lr 1.258299e-04:  39%|███▉      | 2009/5098 [15:31<24:26,  2.11it/s][A
epoch 2 iter 2009: train loss 0.21873. lr 1.258299e-04:  39%|███▉      | 2010/5098 [15:31<24:02,  2.14it/s][A
epoch 2 iter 2010: train loss 0.21834. lr 1.257547e-04:  39%|███▉      | 2010/5098 [15:31<24:02,  2.14it/s][A
epoch 2 iter 2010: train loss 0.21834. lr 1.257547e-04:  39%|███▉      | 2011/5098 [15:31<23:33,  2.18it/s][A
e

epoch 2 iter 2043: train loss 0.21870. lr 1.232801e-04:  40%|████      | 2043/5098 [15:46<21:39,  2.35it/s][A
epoch 2 iter 2043: train loss 0.21870. lr 1.232801e-04:  40%|████      | 2044/5098 [15:46<22:10,  2.30it/s][A
epoch 2 iter 2044: train loss 0.21692. lr 1.232054e-04:  40%|████      | 2044/5098 [15:46<22:10,  2.30it/s][A
epoch 2 iter 2044: train loss 0.21692. lr 1.232054e-04:  40%|████      | 2045/5098 [15:46<21:33,  2.36it/s][A
epoch 2 iter 2045: train loss 0.21766. lr 1.231307e-04:  40%|████      | 2045/5098 [15:46<21:33,  2.36it/s][A
epoch 2 iter 2045: train loss 0.21766. lr 1.231307e-04:  40%|████      | 2046/5098 [15:46<21:07,  2.41it/s][A
epoch 2 iter 2046: train loss 0.21970. lr 1.230560e-04:  40%|████      | 2046/5098 [15:47<21:07,  2.41it/s][A
epoch 2 iter 2046: train loss 0.21970. lr 1.230560e-04:  40%|████      | 2047/5098 [15:47<21:14,  2.39it/s][A
epoch 2 iter 2047: train loss 0.21489. lr 1.229814e-04:  40%|████      | 2047/5098 [15:47<21:14,  2.39it/s][A
e

epoch 2 iter 2079: train loss 0.22068. lr 1.206014e-04:  41%|████      | 2080/5098 [16:02<23:49,  2.11it/s][A
epoch 2 iter 2080: train loss 0.21893. lr 1.205273e-04:  41%|████      | 2080/5098 [16:02<23:49,  2.11it/s][A
epoch 2 iter 2080: train loss 0.21893. lr 1.205273e-04:  41%|████      | 2081/5098 [16:02<22:46,  2.21it/s][A
epoch 2 iter 2081: train loss 0.21586. lr 1.204532e-04:  41%|████      | 2081/5098 [16:03<22:46,  2.21it/s][A
epoch 2 iter 2081: train loss 0.21586. lr 1.204532e-04:  41%|████      | 2082/5098 [16:03<22:04,  2.28it/s][A
epoch 2 iter 2082: train loss 0.21606. lr 1.203792e-04:  41%|████      | 2082/5098 [16:04<22:04,  2.28it/s][A
epoch 2 iter 2082: train loss 0.21606. lr 1.203792e-04:  41%|████      | 2083/5098 [16:04<35:45,  1.41it/s][A
epoch 2 iter 2083: train loss 0.21871. lr 1.203051e-04:  41%|████      | 2083/5098 [16:04<35:45,  1.41it/s][A
epoch 2 iter 2083: train loss 0.21871. lr 1.203051e-04:  41%|████      | 2084/5098 [16:04<32:16,  1.56it/s][A
e

epoch 2 iter 2116: train loss 0.21593. lr 1.178713e-04:  42%|████▏     | 2116/5098 [16:20<21:42,  2.29it/s][A
epoch 2 iter 2116: train loss 0.21593. lr 1.178713e-04:  42%|████▏     | 2117/5098 [16:20<21:52,  2.27it/s][A
epoch 2 iter 2117: train loss 0.21710. lr 1.177979e-04:  42%|████▏     | 2117/5098 [16:20<21:52,  2.27it/s][A
epoch 2 iter 2117: train loss 0.21710. lr 1.177979e-04:  42%|████▏     | 2118/5098 [16:21<21:03,  2.36it/s][A
epoch 2 iter 2118: train loss 0.21455. lr 1.177244e-04:  42%|████▏     | 2118/5098 [16:21<21:03,  2.36it/s][A
epoch 2 iter 2118: train loss 0.21455. lr 1.177244e-04:  42%|████▏     | 2119/5098 [16:21<20:23,  2.43it/s][A
epoch 2 iter 2119: train loss 0.21811. lr 1.176510e-04:  42%|████▏     | 2119/5098 [16:21<20:23,  2.43it/s][A
epoch 2 iter 2119: train loss 0.21811. lr 1.176510e-04:  42%|████▏     | 2120/5098 [16:21<20:52,  2.38it/s][A
epoch 2 iter 2120: train loss 0.21581. lr 1.175776e-04:  42%|████▏     | 2120/5098 [16:22<20:52,  2.38it/s][A
e

epoch 2 iter 2152: train loss 0.21432. lr 1.152378e-04:  42%|████▏     | 2153/5098 [16:37<20:04,  2.45it/s][A
epoch 2 iter 2153: train loss 0.21963. lr 1.151649e-04:  42%|████▏     | 2153/5098 [16:37<20:04,  2.45it/s][A
epoch 2 iter 2153: train loss 0.21963. lr 1.151649e-04:  42%|████▏     | 2154/5098 [16:37<20:06,  2.44it/s][A
epoch 2 iter 2154: train loss 0.21496. lr 1.150921e-04:  42%|████▏     | 2154/5098 [16:37<20:06,  2.44it/s][A
epoch 2 iter 2154: train loss 0.21496. lr 1.150921e-04:  42%|████▏     | 2155/5098 [16:37<20:09,  2.43it/s][A
epoch 2 iter 2155: train loss 0.21584. lr 1.150193e-04:  42%|████▏     | 2155/5098 [16:38<20:09,  2.43it/s][A
epoch 2 iter 2155: train loss 0.21584. lr 1.150193e-04:  42%|████▏     | 2156/5098 [16:38<20:09,  2.43it/s][A
epoch 2 iter 2156: train loss 0.21554. lr 1.149465e-04:  42%|████▏     | 2156/5098 [16:38<20:09,  2.43it/s][A
epoch 2 iter 2156: train loss 0.21554. lr 1.149465e-04:  42%|████▏     | 2157/5098 [16:38<20:09,  2.43it/s][A
e

epoch 2 iter 2189: train loss 0.21354. lr 1.125547e-04:  43%|████▎     | 2189/5098 [16:53<22:01,  2.20it/s][A
epoch 2 iter 2189: train loss 0.21354. lr 1.125547e-04:  43%|████▎     | 2190/5098 [16:53<21:46,  2.23it/s][A
epoch 2 iter 2190: train loss 0.21158. lr 1.124826e-04:  43%|████▎     | 2190/5098 [16:53<21:46,  2.23it/s][A
epoch 2 iter 2190: train loss 0.21158. lr 1.124826e-04:  43%|████▎     | 2191/5098 [16:53<22:09,  2.19it/s][A
epoch 2 iter 2191: train loss 0.21497. lr 1.124104e-04:  43%|████▎     | 2191/5098 [16:54<22:09,  2.19it/s][A
epoch 2 iter 2191: train loss 0.21497. lr 1.124104e-04:  43%|████▎     | 2192/5098 [16:54<21:33,  2.25it/s][A
epoch 2 iter 2192: train loss 0.21011. lr 1.123383e-04:  43%|████▎     | 2192/5098 [16:54<21:33,  2.25it/s][A
epoch 2 iter 2192: train loss 0.21011. lr 1.123383e-04:  43%|████▎     | 2193/5098 [16:54<21:04,  2.30it/s][A
epoch 2 iter 2193: train loss 0.21810. lr 1.122661e-04:  43%|████▎     | 2193/5098 [16:55<21:04,  2.30it/s][A
e

epoch 2 iter 2225: train loss 0.21663. lr 1.099676e-04:  44%|████▎     | 2226/5098 [17:09<21:09,  2.26it/s][A
epoch 2 iter 2226: train loss 0.21448. lr 1.098961e-04:  44%|████▎     | 2226/5098 [17:09<21:09,  2.26it/s][A
epoch 2 iter 2226: train loss 0.21448. lr 1.098961e-04:  44%|████▎     | 2227/5098 [17:09<20:30,  2.33it/s][A
epoch 2 iter 2227: train loss 0.21172. lr 1.098246e-04:  44%|████▎     | 2227/5098 [17:10<20:30,  2.33it/s][A
epoch 2 iter 2227: train loss 0.21172. lr 1.098246e-04:  44%|████▎     | 2228/5098 [17:10<19:44,  2.42it/s][A
epoch 2 iter 2228: train loss 0.21383. lr 1.097531e-04:  44%|████▎     | 2228/5098 [17:10<19:44,  2.42it/s][A
epoch 2 iter 2228: train loss 0.21383. lr 1.097531e-04:  44%|████▎     | 2229/5098 [17:10<19:16,  2.48it/s][A
epoch 2 iter 2229: train loss 0.21095. lr 1.096816e-04:  44%|████▎     | 2229/5098 [17:11<19:16,  2.48it/s][A
epoch 2 iter 2229: train loss 0.21095. lr 1.096816e-04:  44%|████▎     | 2230/5098 [17:11<19:15,  2.48it/s][A
e

epoch 2 iter 2262: train loss 0.21480. lr 1.073330e-04:  44%|████▍     | 2262/5098 [17:26<21:43,  2.18it/s][A
epoch 2 iter 2262: train loss 0.21480. lr 1.073330e-04:  44%|████▍     | 2263/5098 [17:26<21:17,  2.22it/s][A
epoch 2 iter 2263: train loss 0.21481. lr 1.072622e-04:  44%|████▍     | 2263/5098 [17:26<21:17,  2.22it/s][A
epoch 2 iter 2263: train loss 0.21481. lr 1.072622e-04:  44%|████▍     | 2264/5098 [17:26<21:00,  2.25it/s][A
epoch 2 iter 2264: train loss 0.21421. lr 1.071913e-04:  44%|████▍     | 2264/5098 [17:26<21:00,  2.25it/s][A
epoch 2 iter 2264: train loss 0.21421. lr 1.071913e-04:  44%|████▍     | 2265/5098 [17:26<21:26,  2.20it/s][A
epoch 2 iter 2265: train loss 0.21226. lr 1.071205e-04:  44%|████▍     | 2265/5098 [17:27<21:26,  2.20it/s][A
epoch 2 iter 2265: train loss 0.21226. lr 1.071205e-04:  44%|████▍     | 2266/5098 [17:27<21:06,  2.24it/s][A
epoch 2 iter 2266: train loss 0.21334. lr 1.070497e-04:  44%|████▍     | 2266/5098 [17:27<21:06,  2.24it/s][A
e

epoch 2 iter 2298: train loss 0.20986. lr 1.047937e-04:  45%|████▌     | 2299/5098 [17:42<19:27,  2.40it/s][A
epoch 2 iter 2299: train loss 0.21327. lr 1.047235e-04:  45%|████▌     | 2299/5098 [17:43<19:27,  2.40it/s][A
epoch 2 iter 2299: train loss 0.21327. lr 1.047235e-04:  45%|████▌     | 2300/5098 [17:43<19:08,  2.44it/s][A
epoch 2 iter 2300: train loss 0.21378. lr 1.046533e-04:  45%|████▌     | 2300/5098 [17:43<19:08,  2.44it/s][A
epoch 2 iter 2300: train loss 0.21378. lr 1.046533e-04:  45%|████▌     | 2301/5098 [17:43<19:03,  2.45it/s][A
epoch 2 iter 2301: train loss 0.21334. lr 1.045831e-04:  45%|████▌     | 2301/5098 [17:44<19:03,  2.45it/s][A
epoch 2 iter 2301: train loss 0.21334. lr 1.045831e-04:  45%|████▌     | 2302/5098 [17:44<18:57,  2.46it/s][A
epoch 2 iter 2302: train loss 0.20645. lr 1.045130e-04:  45%|████▌     | 2302/5098 [17:44<18:57,  2.46it/s][A
epoch 2 iter 2302: train loss 0.20645. lr 1.045130e-04:  45%|████▌     | 2303/5098 [17:44<18:51,  2.47it/s][A
e

epoch 2 iter 2335: train loss 0.20773. lr 1.022088e-04:  46%|████▌     | 2335/5098 [17:59<19:06,  2.41it/s][A
epoch 2 iter 2335: train loss 0.20773. lr 1.022088e-04:  46%|████▌     | 2336/5098 [17:59<18:58,  2.43it/s][A
epoch 2 iter 2336: train loss 0.21367. lr 1.021393e-04:  46%|████▌     | 2336/5098 [17:59<18:58,  2.43it/s][A
epoch 2 iter 2336: train loss 0.21367. lr 1.021393e-04:  46%|████▌     | 2337/5098 [17:59<18:49,  2.44it/s][A
epoch 2 iter 2337: train loss 0.21225. lr 1.020698e-04:  46%|████▌     | 2337/5098 [18:00<18:49,  2.44it/s][A
epoch 2 iter 2337: train loss 0.21225. lr 1.020698e-04:  46%|████▌     | 2338/5098 [18:00<18:44,  2.45it/s][A
epoch 2 iter 2338: train loss 0.21046. lr 1.020004e-04:  46%|████▌     | 2338/5098 [18:00<18:44,  2.45it/s][A
epoch 2 iter 2338: train loss 0.21046. lr 1.020004e-04:  46%|████▌     | 2339/5098 [18:00<19:18,  2.38it/s][A
epoch 2 iter 2339: train loss 0.21005. lr 1.019309e-04:  46%|████▌     | 2339/5098 [18:01<19:18,  2.38it/s][A
e

epoch 2 iter 2371: train loss 0.20993. lr 9.971851e-05:  47%|████▋     | 2372/5098 [18:15<23:13,  1.96it/s][A
epoch 2 iter 2372: train loss 0.21255. lr 9.964969e-05:  47%|████▋     | 2372/5098 [18:15<23:13,  1.96it/s][A
epoch 2 iter 2372: train loss 0.21255. lr 9.964969e-05:  47%|████▋     | 2373/5098 [18:15<22:44,  2.00it/s][A
epoch 2 iter 2373: train loss 0.21044. lr 9.958088e-05:  47%|████▋     | 2373/5098 [18:16<22:44,  2.00it/s][A
epoch 2 iter 2373: train loss 0.21044. lr 9.958088e-05:  47%|████▋     | 2374/5098 [18:16<22:29,  2.02it/s][A
epoch 2 iter 2374: train loss 0.20817. lr 9.951209e-05:  47%|████▋     | 2374/5098 [18:16<22:29,  2.02it/s][A
epoch 2 iter 2374: train loss 0.20817. lr 9.951209e-05:  47%|████▋     | 2375/5098 [18:16<22:19,  2.03it/s][A
epoch 2 iter 2375: train loss 0.20977. lr 9.944333e-05:  47%|████▋     | 2375/5098 [18:17<22:19,  2.03it/s][A
epoch 2 iter 2375: train loss 0.20977. lr 9.944333e-05:  47%|████▋     | 2376/5098 [18:17<21:54,  2.07it/s][A
e

epoch 2 iter 2408: train loss 0.20891. lr 9.718472e-05:  47%|████▋     | 2408/5098 [18:31<19:15,  2.33it/s][A
epoch 2 iter 2408: train loss 0.20891. lr 9.718472e-05:  47%|████▋     | 2409/5098 [18:31<19:37,  2.28it/s][A
epoch 2 iter 2409: train loss 0.20691. lr 9.711660e-05:  47%|████▋     | 2409/5098 [18:31<19:37,  2.28it/s][A
epoch 2 iter 2409: train loss 0.20691. lr 9.711660e-05:  47%|████▋     | 2410/5098 [18:31<19:03,  2.35it/s][A
epoch 2 iter 2410: train loss 0.20944. lr 9.704851e-05:  47%|████▋     | 2410/5098 [18:32<19:03,  2.35it/s][A
epoch 2 iter 2410: train loss 0.20944. lr 9.704851e-05:  47%|████▋     | 2411/5098 [18:32<18:37,  2.41it/s][A
epoch 2 iter 2411: train loss 0.20843. lr 9.698043e-05:  47%|████▋     | 2411/5098 [18:32<18:37,  2.41it/s][A
epoch 2 iter 2411: train loss 0.20843. lr 9.698043e-05:  47%|████▋     | 2412/5098 [18:32<19:22,  2.31it/s][A
epoch 2 iter 2412: train loss 0.20891. lr 9.691237e-05:  47%|████▋     | 2412/5098 [18:33<19:22,  2.31it/s][A
e

epoch 2 iter 2444: train loss 0.20752. lr 9.474472e-05:  48%|████▊     | 2445/5098 [18:49<21:01,  2.10it/s][A
epoch 2 iter 2445: train loss 0.20796. lr 9.467730e-05:  48%|████▊     | 2445/5098 [18:49<21:01,  2.10it/s][A
epoch 2 iter 2445: train loss 0.20796. lr 9.467730e-05:  48%|████▊     | 2446/5098 [18:49<20:42,  2.13it/s][A
epoch 2 iter 2446: train loss 0.20359. lr 9.460990e-05:  48%|████▊     | 2446/5098 [18:50<20:42,  2.13it/s][A
epoch 2 iter 2446: train loss 0.20359. lr 9.460990e-05:  48%|████▊     | 2447/5098 [18:50<20:18,  2.18it/s][A
epoch 2 iter 2447: train loss 0.20690. lr 9.454252e-05:  48%|████▊     | 2447/5098 [18:50<20:18,  2.18it/s][A
epoch 2 iter 2447: train loss 0.20690. lr 9.454252e-05:  48%|████▊     | 2448/5098 [18:50<20:00,  2.21it/s][A
epoch 2 iter 2448: train loss 0.20858. lr 9.447516e-05:  48%|████▊     | 2448/5098 [18:51<20:00,  2.21it/s][A
epoch 2 iter 2448: train loss 0.20858. lr 9.447516e-05:  48%|████▊     | 2449/5098 [18:51<19:49,  2.23it/s][A
e

epoch 2 iter 2481: train loss 0.20616. lr 9.226327e-05:  49%|████▊     | 2481/5098 [19:05<17:57,  2.43it/s][A
epoch 2 iter 2481: train loss 0.20616. lr 9.226327e-05:  49%|████▊     | 2482/5098 [19:05<17:55,  2.43it/s][A
epoch 2 iter 2482: train loss 0.20520. lr 9.219658e-05:  49%|████▊     | 2482/5098 [19:05<17:55,  2.43it/s][A
epoch 2 iter 2482: train loss 0.20520. lr 9.219658e-05:  49%|████▊     | 2483/5098 [19:05<17:53,  2.44it/s][A
epoch 2 iter 2483: train loss 0.20785. lr 9.212990e-05:  49%|████▊     | 2483/5098 [19:05<17:53,  2.44it/s][A
epoch 2 iter 2483: train loss 0.20785. lr 9.212990e-05:  49%|████▊     | 2484/5098 [19:05<18:09,  2.40it/s][A
epoch 2 iter 2484: train loss 0.20495. lr 9.206325e-05:  49%|████▊     | 2484/5098 [19:06<18:09,  2.40it/s][A
epoch 2 iter 2484: train loss 0.20495. lr 9.206325e-05:  49%|████▊     | 2485/5098 [19:06<17:52,  2.44it/s][A
epoch 2 iter 2485: train loss 0.20697. lr 9.199662e-05:  49%|████▊     | 2485/5098 [19:06<17:52,  2.44it/s][A
e

epoch 2 iter 2517: train loss 0.20479. lr 8.987481e-05:  49%|████▉     | 2518/5098 [19:20<19:08,  2.25it/s][A
epoch 2 iter 2518: train loss 0.20833. lr 8.980883e-05:  49%|████▉     | 2518/5098 [19:21<19:08,  2.25it/s][A
epoch 2 iter 2518: train loss 0.20833. lr 8.980883e-05:  49%|████▉     | 2519/5098 [19:21<19:14,  2.23it/s][A
epoch 2 iter 2519: train loss 0.21109. lr 8.974287e-05:  49%|████▉     | 2519/5098 [19:21<19:14,  2.23it/s][A
epoch 2 iter 2519: train loss 0.21109. lr 8.974287e-05:  49%|████▉     | 2520/5098 [19:21<19:24,  2.21it/s][A
epoch 2 iter 2520: train loss 0.20715. lr 8.967693e-05:  49%|████▉     | 2520/5098 [19:22<19:24,  2.21it/s][A
epoch 2 iter 2520: train loss 0.20715. lr 8.967693e-05:  49%|████▉     | 2521/5098 [19:22<19:27,  2.21it/s][A
epoch 2 iter 2521: train loss 0.20443. lr 8.961101e-05:  49%|████▉     | 2521/5098 [19:22<19:27,  2.21it/s][A
epoch 2 iter 2521: train loss 0.20443. lr 8.961101e-05:  49%|████▉     | 2522/5098 [19:22<19:21,  2.22it/s][A
e

epoch 2 iter 2554: train loss 0.20851. lr 8.744695e-05:  50%|█████     | 2554/5098 [19:38<24:54,  1.70it/s][A
epoch 2 iter 2554: train loss 0.20851. lr 8.744695e-05:  50%|█████     | 2555/5098 [19:38<26:03,  1.63it/s][A
epoch 2 iter 2555: train loss 0.20598. lr 8.738172e-05:  50%|█████     | 2555/5098 [19:38<26:03,  1.63it/s][A
epoch 2 iter 2555: train loss 0.20598. lr 8.738172e-05:  50%|█████     | 2556/5098 [19:38<26:00,  1.63it/s][A
epoch 2 iter 2556: train loss 0.20696. lr 8.731650e-05:  50%|█████     | 2556/5098 [19:39<26:00,  1.63it/s][A
epoch 2 iter 2556: train loss 0.20696. lr 8.731650e-05:  50%|█████     | 2557/5098 [19:39<25:20,  1.67it/s][A
epoch 2 iter 2557: train loss 0.20638. lr 8.725131e-05:  50%|█████     | 2557/5098 [19:40<25:20,  1.67it/s][A
epoch 2 iter 2557: train loss 0.20638. lr 8.725131e-05:  50%|█████     | 2558/5098 [19:40<24:53,  1.70it/s][A
epoch 2 iter 2558: train loss 0.20433. lr 8.718613e-05:  50%|█████     | 2558/5098 [19:40<24:53,  1.70it/s][A
e

epoch 2 iter 2590: train loss 0.20521. lr 8.511124e-05:  51%|█████     | 2591/5098 [19:54<17:46,  2.35it/s][A
epoch 2 iter 2591: train loss 0.20356. lr 8.504674e-05:  51%|█████     | 2591/5098 [19:55<17:46,  2.35it/s][A
epoch 2 iter 2591: train loss 0.20356. lr 8.504674e-05:  51%|█████     | 2592/5098 [19:55<17:38,  2.37it/s][A
epoch 2 iter 2592: train loss 0.20435. lr 8.498225e-05:  51%|█████     | 2592/5098 [19:55<17:38,  2.37it/s][A
epoch 2 iter 2592: train loss 0.20435. lr 8.498225e-05:  51%|█████     | 2593/5098 [19:55<17:31,  2.38it/s][A
epoch 2 iter 2593: train loss 0.20955. lr 8.491779e-05:  51%|█████     | 2593/5098 [19:55<17:31,  2.38it/s][A
epoch 2 iter 2593: train loss 0.20955. lr 8.491779e-05:  51%|█████     | 2594/5098 [19:55<17:15,  2.42it/s][A
epoch 2 iter 2594: train loss 0.20654. lr 8.485335e-05:  51%|█████     | 2594/5098 [19:56<17:15,  2.42it/s][A
epoch 2 iter 2594: train loss 0.20654. lr 8.485335e-05:  51%|█████     | 2595/5098 [19:56<17:03,  2.45it/s][A
e

epoch 2 iter 2627: train loss 0.20843. lr 8.273822e-05:  52%|█████▏    | 2627/5098 [20:10<19:34,  2.10it/s][A
epoch 2 iter 2627: train loss 0.20843. lr 8.273822e-05:  52%|█████▏    | 2628/5098 [20:10<19:29,  2.11it/s][A
epoch 2 iter 2628: train loss 0.20321. lr 8.267447e-05:  52%|█████▏    | 2628/5098 [20:11<19:29,  2.11it/s][A
epoch 2 iter 2628: train loss 0.20321. lr 8.267447e-05:  52%|█████▏    | 2629/5098 [20:11<19:31,  2.11it/s][A
epoch 2 iter 2629: train loss 0.20017. lr 8.261075e-05:  52%|█████▏    | 2629/5098 [20:11<19:31,  2.11it/s][A
epoch 2 iter 2629: train loss 0.20017. lr 8.261075e-05:  52%|█████▏    | 2630/5098 [20:11<19:30,  2.11it/s][A
epoch 2 iter 2630: train loss 0.20634. lr 8.254704e-05:  52%|█████▏    | 2630/5098 [20:12<19:30,  2.11it/s][A
epoch 2 iter 2630: train loss 0.20634. lr 8.254704e-05:  52%|█████▏    | 2631/5098 [20:12<19:25,  2.12it/s][A
epoch 2 iter 2631: train loss 0.20193. lr 8.248336e-05:  52%|█████▏    | 2631/5098 [20:12<19:25,  2.12it/s][A
e

epoch 2 iter 2663: train loss 0.20285. lr 8.045644e-05:  52%|█████▏    | 2664/5098 [20:29<22:20,  1.82it/s][A
epoch 2 iter 2664: train loss 0.20295. lr 8.039344e-05:  52%|█████▏    | 2664/5098 [20:29<22:20,  1.82it/s][A
epoch 2 iter 2664: train loss 0.20295. lr 8.039344e-05:  52%|█████▏    | 2665/5098 [20:29<21:33,  1.88it/s][A
epoch 2 iter 2665: train loss 0.20612. lr 8.033046e-05:  52%|█████▏    | 2665/5098 [20:30<21:33,  1.88it/s][A
epoch 2 iter 2665: train loss 0.20612. lr 8.033046e-05:  52%|█████▏    | 2666/5098 [20:30<20:40,  1.96it/s][A
epoch 2 iter 2666: train loss 0.19974. lr 8.026750e-05:  52%|█████▏    | 2666/5098 [20:30<20:40,  1.96it/s][A
epoch 2 iter 2666: train loss 0.19974. lr 8.026750e-05:  52%|█████▏    | 2667/5098 [20:30<20:06,  2.01it/s][A
epoch 2 iter 2667: train loss 0.20286. lr 8.020457e-05:  52%|█████▏    | 2667/5098 [20:31<20:06,  2.01it/s][A
epoch 2 iter 2667: train loss 0.20286. lr 8.020457e-05:  52%|█████▏    | 2668/5098 [20:31<19:40,  2.06it/s][A
e

epoch 2 iter 2700: train loss 0.20586. lr 7.813944e-05:  53%|█████▎    | 2700/5098 [20:45<17:13,  2.32it/s][A
epoch 2 iter 2700: train loss 0.20586. lr 7.813944e-05:  53%|█████▎    | 2701/5098 [20:45<17:08,  2.33it/s][A
epoch 2 iter 2701: train loss 0.20225. lr 7.807721e-05:  53%|█████▎    | 2701/5098 [20:45<17:08,  2.33it/s][A
epoch 2 iter 2701: train loss 0.20225. lr 7.807721e-05:  53%|█████▎    | 2702/5098 [20:45<16:55,  2.36it/s][A
epoch 2 iter 2702: train loss 0.20132. lr 7.801501e-05:  53%|█████▎    | 2702/5098 [20:46<16:55,  2.36it/s][A
epoch 2 iter 2702: train loss 0.20132. lr 7.801501e-05:  53%|█████▎    | 2703/5098 [20:46<16:41,  2.39it/s][A
epoch 2 iter 2703: train loss 0.20393. lr 7.795283e-05:  53%|█████▎    | 2703/5098 [20:46<16:41,  2.39it/s][A
epoch 2 iter 2703: train loss 0.20393. lr 7.795283e-05:  53%|█████▎    | 2704/5098 [20:46<16:26,  2.43it/s][A
epoch 2 iter 2704: train loss 0.20217. lr 7.789067e-05:  53%|█████▎    | 2704/5098 [20:47<16:26,  2.43it/s][A
e

epoch 2 iter 2736: train loss 0.20009. lr 7.591274e-05:  54%|█████▎    | 2737/5098 [21:01<19:21,  2.03it/s][A
epoch 2 iter 2737: train loss 0.20144. lr 7.585128e-05:  54%|█████▎    | 2737/5098 [21:02<19:21,  2.03it/s][A
epoch 2 iter 2737: train loss 0.20144. lr 7.585128e-05:  54%|█████▎    | 2738/5098 [21:02<19:04,  2.06it/s][A
epoch 2 iter 2738: train loss 0.20382. lr 7.578984e-05:  54%|█████▎    | 2738/5098 [21:02<19:04,  2.06it/s][A
epoch 2 iter 2738: train loss 0.20382. lr 7.578984e-05:  54%|█████▎    | 2739/5098 [21:02<18:43,  2.10it/s][A
epoch 2 iter 2739: train loss 0.20027. lr 7.572843e-05:  54%|█████▎    | 2739/5098 [21:03<18:43,  2.10it/s][A
epoch 2 iter 2739: train loss 0.20027. lr 7.572843e-05:  54%|█████▎    | 2740/5098 [21:03<18:28,  2.13it/s][A
epoch 2 iter 2740: train loss 0.20068. lr 7.566703e-05:  54%|█████▎    | 2740/5098 [21:03<18:28,  2.13it/s][A
epoch 2 iter 2740: train loss 0.20068. lr 7.566703e-05:  54%|█████▍    | 2741/5098 [21:03<18:23,  2.14it/s][A
e

epoch 2 iter 2773: train loss 0.19850. lr 7.365294e-05:  54%|█████▍    | 2773/5098 [21:18<16:11,  2.39it/s][A
epoch 2 iter 2773: train loss 0.19850. lr 7.365294e-05:  54%|█████▍    | 2774/5098 [21:18<16:07,  2.40it/s][A
epoch 2 iter 2774: train loss 0.20277. lr 7.359227e-05:  54%|█████▍    | 2774/5098 [21:18<16:07,  2.40it/s][A
epoch 2 iter 2774: train loss 0.20277. lr 7.359227e-05:  54%|█████▍    | 2775/5098 [21:18<16:12,  2.39it/s][A
epoch 2 iter 2775: train loss 0.20450. lr 7.353163e-05:  54%|█████▍    | 2775/5098 [21:19<16:12,  2.39it/s][A
epoch 2 iter 2775: train loss 0.20450. lr 7.353163e-05:  54%|█████▍    | 2776/5098 [21:19<16:07,  2.40it/s][A
epoch 2 iter 2776: train loss 0.20031. lr 7.347100e-05:  54%|█████▍    | 2776/5098 [21:19<16:07,  2.40it/s][A
epoch 2 iter 2776: train loss 0.20031. lr 7.347100e-05:  54%|█████▍    | 2777/5098 [21:19<16:06,  2.40it/s][A
epoch 2 iter 2777: train loss 0.20350. lr 7.341040e-05:  54%|█████▍    | 2777/5098 [21:20<16:06,  2.40it/s][A
e

epoch 2 iter 2809: train loss 0.19711. lr 7.148246e-05:  55%|█████▌    | 2810/5098 [21:34<15:45,  2.42it/s][A
epoch 2 iter 2810: train loss 0.19954. lr 7.142257e-05:  55%|█████▌    | 2810/5098 [21:34<15:45,  2.42it/s][A
epoch 2 iter 2810: train loss 0.19954. lr 7.142257e-05:  55%|█████▌    | 2811/5098 [21:34<16:08,  2.36it/s][A
epoch 2 iter 2811: train loss 0.20420. lr 7.136270e-05:  55%|█████▌    | 2811/5098 [21:34<16:08,  2.36it/s][A
epoch 2 iter 2811: train loss 0.20420. lr 7.136270e-05:  55%|█████▌    | 2812/5098 [21:34<15:47,  2.41it/s][A
epoch 2 iter 2812: train loss 0.20143. lr 7.130286e-05:  55%|█████▌    | 2812/5098 [21:35<15:47,  2.41it/s][A
epoch 2 iter 2812: train loss 0.20143. lr 7.130286e-05:  55%|█████▌    | 2813/5098 [21:35<15:38,  2.44it/s][A
epoch 2 iter 2813: train loss 0.20274. lr 7.124303e-05:  55%|█████▌    | 2813/5098 [21:35<15:38,  2.44it/s][A
epoch 2 iter 2813: train loss 0.20274. lr 7.124303e-05:  55%|█████▌    | 2814/5098 [21:35<15:36,  2.44it/s][A
e

epoch 2 iter 2846: train loss 0.20453. lr 6.928101e-05:  56%|█████▌    | 2846/5098 [21:53<23:36,  1.59it/s][A
epoch 2 iter 2846: train loss 0.20453. lr 6.928101e-05:  56%|█████▌    | 2847/5098 [21:53<22:26,  1.67it/s][A
epoch 2 iter 2847: train loss 0.20071. lr 6.922192e-05:  56%|█████▌    | 2847/5098 [21:54<22:26,  1.67it/s][A
epoch 2 iter 2847: train loss 0.20071. lr 6.922192e-05:  56%|█████▌    | 2848/5098 [21:54<20:58,  1.79it/s][A
epoch 2 iter 2848: train loss 0.19786. lr 6.916286e-05:  56%|█████▌    | 2848/5098 [21:54<20:58,  1.79it/s][A
epoch 2 iter 2848: train loss 0.19786. lr 6.916286e-05:  56%|█████▌    | 2849/5098 [21:54<19:56,  1.88it/s][A
epoch 2 iter 2849: train loss 0.19974. lr 6.910382e-05:  56%|█████▌    | 2849/5098 [21:55<19:56,  1.88it/s][A
epoch 2 iter 2849: train loss 0.19974. lr 6.910382e-05:  56%|█████▌    | 2850/5098 [21:55<19:15,  1.95it/s][A
epoch 2 iter 2850: train loss 0.20124. lr 6.904480e-05:  56%|█████▌    | 2850/5098 [21:55<19:15,  1.95it/s][A
e

epoch 2 iter 2882: train loss 0.19803. lr 6.716784e-05:  57%|█████▋    | 2883/5098 [22:08<15:18,  2.41it/s][A
epoch 2 iter 2883: train loss 0.19831. lr 6.710955e-05:  57%|█████▋    | 2883/5098 [22:09<15:18,  2.41it/s][A
epoch 2 iter 2883: train loss 0.19831. lr 6.710955e-05:  57%|█████▋    | 2884/5098 [22:09<15:51,  2.33it/s][A
epoch 2 iter 2884: train loss 0.19825. lr 6.705128e-05:  57%|█████▋    | 2884/5098 [22:09<15:51,  2.33it/s][A
epoch 2 iter 2884: train loss 0.19825. lr 6.705128e-05:  57%|█████▋    | 2885/5098 [22:09<15:31,  2.38it/s][A
epoch 2 iter 2885: train loss 0.19798. lr 6.699303e-05:  57%|█████▋    | 2885/5098 [22:10<15:31,  2.38it/s][A
epoch 2 iter 2885: train loss 0.19798. lr 6.699303e-05:  57%|█████▋    | 2886/5098 [22:10<15:17,  2.41it/s][A
epoch 2 iter 2886: train loss 0.20052. lr 6.693481e-05:  57%|█████▋    | 2886/5098 [22:10<15:17,  2.41it/s][A
epoch 2 iter 2886: train loss 0.20052. lr 6.693481e-05:  57%|█████▋    | 2887/5098 [22:10<15:08,  2.43it/s][A
e

epoch 2 iter 2919: train loss 0.19679. lr 6.502584e-05:  57%|█████▋    | 2919/5098 [22:27<14:59,  2.42it/s][A
epoch 2 iter 2919: train loss 0.19679. lr 6.502584e-05:  57%|█████▋    | 2920/5098 [22:27<14:56,  2.43it/s][A
epoch 2 iter 2920: train loss 0.19726. lr 6.496837e-05:  57%|█████▋    | 2920/5098 [22:27<14:56,  2.43it/s][A
epoch 2 iter 2920: train loss 0.19726. lr 6.496837e-05:  57%|█████▋    | 2921/5098 [22:27<14:57,  2.43it/s][A
epoch 2 iter 2921: train loss 0.19859. lr 6.491092e-05:  57%|█████▋    | 2921/5098 [22:28<14:57,  2.43it/s][A
epoch 2 iter 2921: train loss 0.19859. lr 6.491092e-05:  57%|█████▋    | 2922/5098 [22:28<14:54,  2.43it/s][A
epoch 2 iter 2922: train loss 0.19700. lr 6.485350e-05:  57%|█████▋    | 2922/5098 [22:28<14:54,  2.43it/s][A
epoch 2 iter 2922: train loss 0.19700. lr 6.485350e-05:  57%|█████▋    | 2923/5098 [22:28<14:52,  2.44it/s][A
epoch 2 iter 2923: train loss 0.19793. lr 6.479610e-05:  57%|█████▋    | 2923/5098 [22:29<14:52,  2.44it/s][A
e

epoch 2 iter 2955: train loss 0.19925. lr 6.297105e-05:  58%|█████▊    | 2956/5098 [22:43<15:55,  2.24it/s][A
epoch 2 iter 2956: train loss 0.19953. lr 6.291439e-05:  58%|█████▊    | 2956/5098 [22:44<15:55,  2.24it/s][A
epoch 2 iter 2956: train loss 0.19953. lr 6.291439e-05:  58%|█████▊    | 2957/5098 [22:44<15:38,  2.28it/s][A
epoch 2 iter 2957: train loss 0.19556. lr 6.285775e-05:  58%|█████▊    | 2957/5098 [22:44<15:38,  2.28it/s][A
epoch 2 iter 2957: train loss 0.19556. lr 6.285775e-05:  58%|█████▊    | 2958/5098 [22:44<15:51,  2.25it/s][A
epoch 2 iter 2958: train loss 0.19698. lr 6.280113e-05:  58%|█████▊    | 2958/5098 [22:45<15:51,  2.25it/s][A
epoch 2 iter 2958: train loss 0.19698. lr 6.280113e-05:  58%|█████▊    | 2959/5098 [22:45<15:28,  2.30it/s][A
epoch 2 iter 2959: train loss 0.19681. lr 6.274454e-05:  58%|█████▊    | 2959/5098 [22:45<15:28,  2.30it/s][A
epoch 2 iter 2959: train loss 0.19681. lr 6.274454e-05:  58%|█████▊    | 2960/5098 [22:45<15:12,  2.34it/s][A
e

epoch 2 iter 2992: train loss 0.19948. lr 6.088959e-05:  59%|█████▊    | 2992/5098 [22:59<14:32,  2.41it/s][A
epoch 2 iter 2992: train loss 0.19948. lr 6.088959e-05:  59%|█████▊    | 2993/5098 [22:59<14:31,  2.42it/s][A
epoch 2 iter 2993: train loss 0.19542. lr 6.083377e-05:  59%|█████▊    | 2993/5098 [23:00<14:31,  2.42it/s][A
epoch 2 iter 2993: train loss 0.19542. lr 6.083377e-05:  59%|█████▊    | 2994/5098 [23:00<14:27,  2.42it/s][A
epoch 2 iter 2994: train loss 0.19807. lr 6.077797e-05:  59%|█████▊    | 2994/5098 [23:00<14:27,  2.42it/s][A
epoch 2 iter 2994: train loss 0.19807. lr 6.077797e-05:  59%|█████▊    | 2995/5098 [23:00<14:23,  2.44it/s][A
epoch 2 iter 2995: train loss 0.19458. lr 6.072219e-05:  59%|█████▊    | 2995/5098 [23:01<14:23,  2.44it/s][A
epoch 2 iter 2995: train loss 0.19458. lr 6.072219e-05:  59%|█████▉    | 2996/5098 [23:01<14:17,  2.45it/s][A
epoch 2 iter 2996: train loss 0.19742. lr 6.066643e-05:  59%|█████▉    | 2996/5098 [23:01<14:17,  2.45it/s][A
e

epoch 2 iter 3028: train loss 0.19875. lr 6.000000e-05:  59%|█████▉    | 3029/5098 [23:16<14:16,  2.42it/s][A
epoch 2 iter 3029: train loss 0.19725. lr 6.000000e-05:  59%|█████▉    | 3029/5098 [23:17<14:16,  2.42it/s][A
epoch 2 iter 3029: train loss 0.19725. lr 6.000000e-05:  59%|█████▉    | 3030/5098 [23:17<14:21,  2.40it/s][A
epoch 2 iter 3030: train loss 0.19745. lr 6.000000e-05:  59%|█████▉    | 3030/5098 [23:17<14:21,  2.40it/s][A
epoch 2 iter 3030: train loss 0.19745. lr 6.000000e-05:  59%|█████▉    | 3031/5098 [23:17<14:22,  2.40it/s][A
epoch 2 iter 3031: train loss 0.19876. lr 6.000000e-05:  59%|█████▉    | 3031/5098 [23:17<14:22,  2.40it/s][A
epoch 2 iter 3031: train loss 0.19876. lr 6.000000e-05:  59%|█████▉    | 3032/5098 [23:17<14:24,  2.39it/s][A
epoch 2 iter 3032: train loss 0.19680. lr 6.000000e-05:  59%|█████▉    | 3032/5098 [23:18<14:24,  2.39it/s][A
epoch 2 iter 3032: train loss 0.19680. lr 6.000000e-05:  59%|█████▉    | 3033/5098 [23:18<13:53,  2.48it/s][A
e

epoch 2 iter 3065: train loss 0.19554. lr 6.000000e-05:  60%|██████    | 3065/5098 [23:33<15:00,  2.26it/s][A
epoch 2 iter 3065: train loss 0.19554. lr 6.000000e-05:  60%|██████    | 3066/5098 [23:33<14:48,  2.29it/s][A
epoch 2 iter 3066: train loss 0.19514. lr 6.000000e-05:  60%|██████    | 3066/5098 [23:34<14:48,  2.29it/s][A
epoch 2 iter 3066: train loss 0.19514. lr 6.000000e-05:  60%|██████    | 3067/5098 [23:34<14:36,  2.32it/s][A
epoch 2 iter 3067: train loss 0.19861. lr 6.000000e-05:  60%|██████    | 3067/5098 [23:34<14:36,  2.32it/s][A
epoch 2 iter 3067: train loss 0.19861. lr 6.000000e-05:  60%|██████    | 3068/5098 [23:34<14:30,  2.33it/s][A
epoch 2 iter 3068: train loss 0.19962. lr 6.000000e-05:  60%|██████    | 3068/5098 [23:34<14:30,  2.33it/s][A
epoch 2 iter 3068: train loss 0.19962. lr 6.000000e-05:  60%|██████    | 3069/5098 [23:34<14:23,  2.35it/s][A
epoch 2 iter 3069: train loss 0.19635. lr 6.000000e-05:  60%|██████    | 3069/5098 [23:35<14:23,  2.35it/s][A
e

epoch 2 iter 3101: train loss 0.19551. lr 6.000000e-05:  61%|██████    | 3102/5098 [23:50<15:39,  2.12it/s][A
epoch 2 iter 3102: train loss 0.19497. lr 6.000000e-05:  61%|██████    | 3102/5098 [23:50<15:39,  2.12it/s][A
epoch 2 iter 3102: train loss 0.19497. lr 6.000000e-05:  61%|██████    | 3103/5098 [23:50<15:06,  2.20it/s][A
epoch 2 iter 3103: train loss 0.19542. lr 6.000000e-05:  61%|██████    | 3103/5098 [23:51<15:06,  2.20it/s][A
epoch 2 iter 3103: train loss 0.19542. lr 6.000000e-05:  61%|██████    | 3104/5098 [23:51<14:40,  2.26it/s][A
epoch 2 iter 3104: train loss 0.19657. lr 6.000000e-05:  61%|██████    | 3104/5098 [23:51<14:40,  2.26it/s][A
epoch 2 iter 3104: train loss 0.19657. lr 6.000000e-05:  61%|██████    | 3105/5098 [23:51<14:22,  2.31it/s][A
epoch 2 iter 3105: train loss 0.19608. lr 6.000000e-05:  61%|██████    | 3105/5098 [23:51<14:22,  2.31it/s][A
epoch 2 iter 3105: train loss 0.19608. lr 6.000000e-05:  61%|██████    | 3106/5098 [23:51<14:36,  2.27it/s][A
e

epoch 2 iter 3138: train loss 0.19502. lr 6.000000e-05:  62%|██████▏   | 3138/5098 [24:06<13:26,  2.43it/s][A
epoch 2 iter 3138: train loss 0.19502. lr 6.000000e-05:  62%|██████▏   | 3139/5098 [24:06<13:04,  2.50it/s][A
epoch 2 iter 3139: train loss 0.19497. lr 6.000000e-05:  62%|██████▏   | 3139/5098 [24:06<13:04,  2.50it/s][A
epoch 2 iter 3139: train loss 0.19497. lr 6.000000e-05:  62%|██████▏   | 3140/5098 [24:06<12:43,  2.56it/s][A
epoch 2 iter 3140: train loss 0.19758. lr 6.000000e-05:  62%|██████▏   | 3140/5098 [24:06<12:43,  2.56it/s][A
epoch 2 iter 3140: train loss 0.19758. lr 6.000000e-05:  62%|██████▏   | 3141/5098 [24:06<13:07,  2.48it/s][A
epoch 2 iter 3141: train loss 0.19850. lr 6.000000e-05:  62%|██████▏   | 3141/5098 [24:07<13:07,  2.48it/s][A
epoch 2 iter 3141: train loss 0.19850. lr 6.000000e-05:  62%|██████▏   | 3142/5098 [24:07<13:42,  2.38it/s][A
epoch 2 iter 3142: train loss 0.19723. lr 6.000000e-05:  62%|██████▏   | 3142/5098 [24:07<13:42,  2.38it/s][A
e

epoch 2 iter 3174: train loss 0.19110. lr 6.000000e-05:  62%|██████▏   | 3175/5098 [24:22<13:03,  2.46it/s][A
epoch 2 iter 3175: train loss 0.19744. lr 6.000000e-05:  62%|██████▏   | 3175/5098 [24:22<13:03,  2.46it/s][A
epoch 2 iter 3175: train loss 0.19744. lr 6.000000e-05:  62%|██████▏   | 3176/5098 [24:22<13:50,  2.31it/s][A
epoch 2 iter 3176: train loss 0.18926. lr 6.000000e-05:  62%|██████▏   | 3176/5098 [24:23<13:50,  2.31it/s][A
epoch 2 iter 3176: train loss 0.18926. lr 6.000000e-05:  62%|██████▏   | 3177/5098 [24:23<14:39,  2.18it/s][A
epoch 2 iter 3177: train loss 0.19461. lr 6.000000e-05:  62%|██████▏   | 3177/5098 [24:23<14:39,  2.18it/s][A
epoch 2 iter 3177: train loss 0.19461. lr 6.000000e-05:  62%|██████▏   | 3178/5098 [24:23<15:13,  2.10it/s][A
epoch 2 iter 3178: train loss 0.19766. lr 6.000000e-05:  62%|██████▏   | 3178/5098 [24:24<15:13,  2.10it/s][A
epoch 2 iter 3178: train loss 0.19766. lr 6.000000e-05:  62%|██████▏   | 3179/5098 [24:24<15:37,  2.05it/s][A
e

epoch 2 iter 3211: train loss 0.19407. lr 6.000000e-05:  63%|██████▎   | 3211/5098 [24:39<12:49,  2.45it/s][A
epoch 2 iter 3211: train loss 0.19407. lr 6.000000e-05:  63%|██████▎   | 3212/5098 [24:39<12:55,  2.43it/s][A
epoch 2 iter 3212: train loss 0.19133. lr 6.000000e-05:  63%|██████▎   | 3212/5098 [24:39<12:55,  2.43it/s][A
epoch 2 iter 3212: train loss 0.19133. lr 6.000000e-05:  63%|██████▎   | 3213/5098 [24:39<13:50,  2.27it/s][A
epoch 2 iter 3213: train loss 0.19521. lr 6.000000e-05:  63%|██████▎   | 3213/5098 [24:40<13:50,  2.27it/s][A
epoch 2 iter 3213: train loss 0.19521. lr 6.000000e-05:  63%|██████▎   | 3214/5098 [24:40<14:18,  2.19it/s][A
epoch 2 iter 3214: train loss 0.19189. lr 6.000000e-05:  63%|██████▎   | 3214/5098 [24:40<14:18,  2.19it/s][A
epoch 2 iter 3214: train loss 0.19189. lr 6.000000e-05:  63%|██████▎   | 3215/5098 [24:40<14:41,  2.14it/s][A
epoch 2 iter 3215: train loss 0.19534. lr 6.000000e-05:  63%|██████▎   | 3215/5098 [24:41<14:41,  2.14it/s][A
e

epoch 2 iter 3247: train loss 0.19544. lr 6.000000e-05:  64%|██████▎   | 3248/5098 [24:56<12:49,  2.41it/s][A
epoch 2 iter 3248: train loss 0.19597. lr 6.000000e-05:  64%|██████▎   | 3248/5098 [24:56<12:49,  2.41it/s][A
epoch 2 iter 3248: train loss 0.19597. lr 6.000000e-05:  64%|██████▎   | 3249/5098 [24:56<12:45,  2.42it/s][A
epoch 2 iter 3249: train loss 0.19648. lr 6.000000e-05:  64%|██████▎   | 3249/5098 [24:57<12:45,  2.42it/s][A
epoch 2 iter 3249: train loss 0.19648. lr 6.000000e-05:  64%|██████▍   | 3250/5098 [24:57<12:44,  2.42it/s][A
epoch 2 iter 3250: train loss 0.19724. lr 6.000000e-05:  64%|██████▍   | 3250/5098 [24:57<12:44,  2.42it/s][A
epoch 2 iter 3250: train loss 0.19724. lr 6.000000e-05:  64%|██████▍   | 3251/5098 [24:57<12:43,  2.42it/s][A
epoch 2 iter 3251: train loss 0.19189. lr 6.000000e-05:  64%|██████▍   | 3251/5098 [24:58<12:43,  2.42it/s][A
epoch 2 iter 3251: train loss 0.19189. lr 6.000000e-05:  64%|██████▍   | 3252/5098 [24:58<12:41,  2.43it/s][A
e

epoch 2 iter 3284: train loss 0.19405. lr 6.000000e-05:  64%|██████▍   | 3284/5098 [25:13<12:35,  2.40it/s][A
epoch 2 iter 3284: train loss 0.19405. lr 6.000000e-05:  64%|██████▍   | 3285/5098 [25:13<12:24,  2.44it/s][A
epoch 2 iter 3285: train loss 0.19687. lr 6.000000e-05:  64%|██████▍   | 3285/5098 [25:14<12:24,  2.44it/s][A
epoch 2 iter 3285: train loss 0.19687. lr 6.000000e-05:  64%|██████▍   | 3286/5098 [25:14<12:40,  2.38it/s][A
epoch 2 iter 3286: train loss 0.19973. lr 6.000000e-05:  64%|██████▍   | 3286/5098 [25:14<12:40,  2.38it/s][A
epoch 2 iter 3286: train loss 0.19973. lr 6.000000e-05:  64%|██████▍   | 3287/5098 [25:14<12:58,  2.33it/s][A
epoch 2 iter 3287: train loss 0.19342. lr 6.000000e-05:  64%|██████▍   | 3287/5098 [25:14<12:58,  2.33it/s][A
epoch 2 iter 3287: train loss 0.19342. lr 6.000000e-05:  64%|██████▍   | 3288/5098 [25:14<12:40,  2.38it/s][A
epoch 2 iter 3288: train loss 0.19517. lr 6.000000e-05:  64%|██████▍   | 3288/5098 [25:15<12:40,  2.38it/s][A
e

epoch 2 iter 3320: train loss 0.19613. lr 6.000000e-05:  65%|██████▌   | 3321/5098 [25:29<12:31,  2.36it/s][A
epoch 2 iter 3321: train loss 0.19270. lr 6.000000e-05:  65%|██████▌   | 3321/5098 [25:30<12:31,  2.36it/s][A
epoch 2 iter 3321: train loss 0.19270. lr 6.000000e-05:  65%|██████▌   | 3322/5098 [25:30<12:39,  2.34it/s][A
epoch 2 iter 3322: train loss 0.19547. lr 6.000000e-05:  65%|██████▌   | 3322/5098 [25:30<12:39,  2.34it/s][A
epoch 2 iter 3322: train loss 0.19547. lr 6.000000e-05:  65%|██████▌   | 3323/5098 [25:30<12:16,  2.41it/s][A
epoch 2 iter 3323: train loss 0.19654. lr 6.000000e-05:  65%|██████▌   | 3323/5098 [25:30<12:16,  2.41it/s][A
epoch 2 iter 3323: train loss 0.19654. lr 6.000000e-05:  65%|██████▌   | 3324/5098 [25:30<12:07,  2.44it/s][A
epoch 2 iter 3324: train loss 0.19660. lr 6.000000e-05:  65%|██████▌   | 3324/5098 [25:31<12:07,  2.44it/s][A
epoch 2 iter 3324: train loss 0.19660. lr 6.000000e-05:  65%|██████▌   | 3325/5098 [25:31<12:50,  2.30it/s][A
e

epoch 2 iter 3357: train loss 0.19427. lr 6.000000e-05:  66%|██████▌   | 3357/5098 [25:46<11:54,  2.44it/s][A
epoch 2 iter 3357: train loss 0.19427. lr 6.000000e-05:  66%|██████▌   | 3358/5098 [25:46<11:48,  2.45it/s][A
epoch 2 iter 3358: train loss 0.19651. lr 6.000000e-05:  66%|██████▌   | 3358/5098 [25:47<11:48,  2.45it/s][A
epoch 2 iter 3358: train loss 0.19651. lr 6.000000e-05:  66%|██████▌   | 3359/5098 [25:47<11:44,  2.47it/s][A
epoch 2 iter 3359: train loss 0.19571. lr 6.000000e-05:  66%|██████▌   | 3359/5098 [25:47<11:44,  2.47it/s][A
epoch 2 iter 3359: train loss 0.19571. lr 6.000000e-05:  66%|██████▌   | 3360/5098 [25:47<11:42,  2.47it/s][A
epoch 2 iter 3360: train loss 0.19591. lr 6.000000e-05:  66%|██████▌   | 3360/5098 [25:48<11:42,  2.47it/s][A
epoch 2 iter 3360: train loss 0.19591. lr 6.000000e-05:  66%|██████▌   | 3361/5098 [25:48<11:38,  2.49it/s][A
epoch 2 iter 3361: train loss 0.19268. lr 6.000000e-05:  66%|██████▌   | 3361/5098 [25:48<11:38,  2.49it/s][A
e

epoch 2 iter 3393: train loss 0.19114. lr 6.000000e-05:  67%|██████▋   | 3394/5098 [26:03<13:58,  2.03it/s][A
epoch 2 iter 3394: train loss 0.19382. lr 6.000000e-05:  67%|██████▋   | 3394/5098 [26:04<13:58,  2.03it/s][A
epoch 2 iter 3394: train loss 0.19382. lr 6.000000e-05:  67%|██████▋   | 3395/5098 [26:04<13:35,  2.09it/s][A
epoch 2 iter 3395: train loss 0.19596. lr 6.000000e-05:  67%|██████▋   | 3395/5098 [26:04<13:35,  2.09it/s][A
epoch 2 iter 3395: train loss 0.19596. lr 6.000000e-05:  67%|██████▋   | 3396/5098 [26:04<13:13,  2.14it/s][A
epoch 2 iter 3396: train loss 0.19570. lr 6.000000e-05:  67%|██████▋   | 3396/5098 [26:05<13:13,  2.14it/s][A
epoch 2 iter 3396: train loss 0.19570. lr 6.000000e-05:  67%|██████▋   | 3397/5098 [26:05<12:59,  2.18it/s][A
epoch 2 iter 3397: train loss 0.19623. lr 6.000000e-05:  67%|██████▋   | 3397/5098 [26:05<12:59,  2.18it/s][A
epoch 2 iter 3397: train loss 0.19623. lr 6.000000e-05:  67%|██████▋   | 3398/5098 [26:05<12:48,  2.21it/s][A
e

epoch 2 iter 3430: train loss 0.19371. lr 6.000000e-05:  67%|██████▋   | 3430/5098 [26:19<11:43,  2.37it/s][A
epoch 2 iter 3430: train loss 0.19371. lr 6.000000e-05:  67%|██████▋   | 3431/5098 [26:19<11:37,  2.39it/s][A
epoch 2 iter 3431: train loss 0.19265. lr 6.000000e-05:  67%|██████▋   | 3431/5098 [26:20<11:37,  2.39it/s][A
epoch 2 iter 3431: train loss 0.19265. lr 6.000000e-05:  67%|██████▋   | 3432/5098 [26:20<11:31,  2.41it/s][A
epoch 2 iter 3432: train loss 0.19269. lr 6.000000e-05:  67%|██████▋   | 3432/5098 [26:20<11:31,  2.41it/s][A
epoch 2 iter 3432: train loss 0.19269. lr 6.000000e-05:  67%|██████▋   | 3433/5098 [26:20<11:29,  2.42it/s][A
epoch 2 iter 3433: train loss 0.19654. lr 6.000000e-05:  67%|██████▋   | 3433/5098 [26:20<11:29,  2.42it/s][A
epoch 2 iter 3433: train loss 0.19654. lr 6.000000e-05:  67%|██████▋   | 3434/5098 [26:20<11:25,  2.43it/s][A
epoch 2 iter 3434: train loss 0.19512. lr 6.000000e-05:  67%|██████▋   | 3434/5098 [26:21<11:25,  2.43it/s][A
e

epoch 2 iter 3466: train loss 0.19284. lr 6.000000e-05:  68%|██████▊   | 3467/5098 [26:34<12:13,  2.22it/s][A
epoch 2 iter 3467: train loss 0.19223. lr 6.000000e-05:  68%|██████▊   | 3467/5098 [26:35<12:13,  2.22it/s][A
epoch 2 iter 3467: train loss 0.19223. lr 6.000000e-05:  68%|██████▊   | 3468/5098 [26:35<12:17,  2.21it/s][A
epoch 2 iter 3468: train loss 0.19724. lr 6.000000e-05:  68%|██████▊   | 3468/5098 [26:35<12:17,  2.21it/s][A
epoch 2 iter 3468: train loss 0.19724. lr 6.000000e-05:  68%|██████▊   | 3469/5098 [26:35<11:48,  2.30it/s][A
epoch 2 iter 3469: train loss 0.19437. lr 6.000000e-05:  68%|██████▊   | 3469/5098 [26:35<11:48,  2.30it/s][A
epoch 2 iter 3469: train loss 0.19437. lr 6.000000e-05:  68%|██████▊   | 3470/5098 [26:35<11:23,  2.38it/s][A
epoch 2 iter 3470: train loss 0.19513. lr 6.000000e-05:  68%|██████▊   | 3470/5098 [26:36<11:23,  2.38it/s][A
epoch 2 iter 3470: train loss 0.19513. lr 6.000000e-05:  68%|██████▊   | 3471/5098 [26:36<11:33,  2.35it/s][A
e

epoch 2 iter 3503: train loss 0.19283. lr 6.000000e-05:  69%|██████▊   | 3503/5098 [26:50<11:43,  2.27it/s][A
epoch 2 iter 3503: train loss 0.19283. lr 6.000000e-05:  69%|██████▊   | 3504/5098 [26:50<13:07,  2.02it/s][A
epoch 2 iter 3504: train loss 0.19341. lr 6.000000e-05:  69%|██████▊   | 3504/5098 [26:51<13:07,  2.02it/s][A
epoch 2 iter 3504: train loss 0.19341. lr 6.000000e-05:  69%|██████▉   | 3505/5098 [26:51<12:28,  2.13it/s][A
epoch 2 iter 3505: train loss 0.19296. lr 6.000000e-05:  69%|██████▉   | 3505/5098 [26:51<12:28,  2.13it/s][A
epoch 2 iter 3505: train loss 0.19296. lr 6.000000e-05:  69%|██████▉   | 3506/5098 [26:51<12:03,  2.20it/s][A
epoch 2 iter 3506: train loss 0.19339. lr 6.000000e-05:  69%|██████▉   | 3506/5098 [26:52<12:03,  2.20it/s][A
epoch 2 iter 3506: train loss 0.19339. lr 6.000000e-05:  69%|██████▉   | 3507/5098 [26:52<11:34,  2.29it/s][A
epoch 2 iter 3507: train loss 0.18981. lr 6.000000e-05:  69%|██████▉   | 3507/5098 [26:52<11:34,  2.29it/s][A
e

epoch 2 iter 3539: train loss 0.19414. lr 6.000000e-05:  69%|██████▉   | 3540/5098 [27:09<11:40,  2.22it/s][A
epoch 2 iter 3540: train loss 0.19642. lr 6.000000e-05:  69%|██████▉   | 3540/5098 [27:09<11:40,  2.22it/s][A
epoch 2 iter 3540: train loss 0.19642. lr 6.000000e-05:  69%|██████▉   | 3541/5098 [27:09<11:50,  2.19it/s][A
epoch 2 iter 3541: train loss 0.19528. lr 6.000000e-05:  69%|██████▉   | 3541/5098 [27:10<11:50,  2.19it/s][A
epoch 2 iter 3541: train loss 0.19528. lr 6.000000e-05:  69%|██████▉   | 3542/5098 [27:10<11:58,  2.17it/s][A
epoch 2 iter 3542: train loss 0.19012. lr 6.000000e-05:  69%|██████▉   | 3542/5098 [27:10<11:58,  2.17it/s][A
epoch 2 iter 3542: train loss 0.19012. lr 6.000000e-05:  69%|██████▉   | 3543/5098 [27:10<11:24,  2.27it/s][A
epoch 2 iter 3543: train loss 0.19050. lr 6.000000e-05:  69%|██████▉   | 3543/5098 [27:11<11:24,  2.27it/s][A
epoch 2 iter 3543: train loss 0.19050. lr 6.000000e-05:  70%|██████▉   | 3544/5098 [27:11<10:57,  2.36it/s][A
e

epoch 2 iter 3576: train loss 0.19509. lr 6.000000e-05:  70%|███████   | 3576/5098 [27:27<12:43,  1.99it/s][A
epoch 2 iter 3576: train loss 0.19509. lr 6.000000e-05:  70%|███████   | 3577/5098 [27:27<12:36,  2.01it/s][A
epoch 2 iter 3577: train loss 0.19257. lr 6.000000e-05:  70%|███████   | 3577/5098 [27:28<12:36,  2.01it/s][A
epoch 2 iter 3577: train loss 0.19257. lr 6.000000e-05:  70%|███████   | 3578/5098 [27:28<12:39,  2.00it/s][A
epoch 2 iter 3578: train loss 0.19518. lr 6.000000e-05:  70%|███████   | 3578/5098 [27:28<12:39,  2.00it/s][A
epoch 2 iter 3578: train loss 0.19518. lr 6.000000e-05:  70%|███████   | 3579/5098 [27:28<12:23,  2.04it/s][A
epoch 2 iter 3579: train loss 0.19118. lr 6.000000e-05:  70%|███████   | 3579/5098 [27:29<12:23,  2.04it/s][A
epoch 2 iter 3579: train loss 0.19118. lr 6.000000e-05:  70%|███████   | 3580/5098 [27:29<12:13,  2.07it/s][A
epoch 2 iter 3580: train loss 0.19233. lr 6.000000e-05:  70%|███████   | 3580/5098 [27:29<12:13,  2.07it/s][A
e

epoch 2 iter 3612: train loss 0.19073. lr 6.000000e-05:  71%|███████   | 3613/5098 [27:43<10:27,  2.37it/s][A
epoch 2 iter 3613: train loss 0.19483. lr 6.000000e-05:  71%|███████   | 3613/5098 [27:43<10:27,  2.37it/s][A
epoch 2 iter 3613: train loss 0.19483. lr 6.000000e-05:  71%|███████   | 3614/5098 [27:43<10:32,  2.34it/s][A
epoch 2 iter 3614: train loss 0.19086. lr 6.000000e-05:  71%|███████   | 3614/5098 [27:44<10:32,  2.34it/s][A
epoch 2 iter 3614: train loss 0.19086. lr 6.000000e-05:  71%|███████   | 3615/5098 [27:44<11:02,  2.24it/s][A
epoch 2 iter 3615: train loss 0.19262. lr 6.000000e-05:  71%|███████   | 3615/5098 [27:44<11:02,  2.24it/s][A
epoch 2 iter 3615: train loss 0.19262. lr 6.000000e-05:  71%|███████   | 3616/5098 [27:44<10:42,  2.31it/s][A
epoch 2 iter 3616: train loss 0.19546. lr 6.000000e-05:  71%|███████   | 3616/5098 [27:44<10:42,  2.31it/s][A
epoch 2 iter 3616: train loss 0.19546. lr 6.000000e-05:  71%|███████   | 3617/5098 [27:44<10:22,  2.38it/s][A
e

epoch 2 iter 3649: train loss 0.19497. lr 6.000000e-05:  72%|███████▏  | 3649/5098 [28:02<13:52,  1.74it/s][A
epoch 2 iter 3649: train loss 0.19497. lr 6.000000e-05:  72%|███████▏  | 3650/5098 [28:02<13:34,  1.78it/s][A
epoch 2 iter 3650: train loss 0.19548. lr 6.000000e-05:  72%|███████▏  | 3650/5098 [28:02<13:34,  1.78it/s][A
epoch 2 iter 3650: train loss 0.19548. lr 6.000000e-05:  72%|███████▏  | 3651/5098 [28:02<13:31,  1.78it/s][A
epoch 2 iter 3651: train loss 0.19272. lr 6.000000e-05:  72%|███████▏  | 3651/5098 [28:03<13:31,  1.78it/s][A
epoch 2 iter 3651: train loss 0.19272. lr 6.000000e-05:  72%|███████▏  | 3652/5098 [28:03<13:06,  1.84it/s][A
epoch 2 iter 3652: train loss 0.19174. lr 6.000000e-05:  72%|███████▏  | 3652/5098 [28:03<13:06,  1.84it/s][A
epoch 2 iter 3652: train loss 0.19174. lr 6.000000e-05:  72%|███████▏  | 3653/5098 [28:03<12:31,  1.92it/s][A
epoch 2 iter 3653: train loss 0.19280. lr 6.000000e-05:  72%|███████▏  | 3653/5098 [28:04<12:31,  1.92it/s][A
e

epoch 2 iter 3685: train loss 0.19028. lr 6.000000e-05:  72%|███████▏  | 3686/5098 [28:19<11:37,  2.03it/s][A
epoch 2 iter 3686: train loss 0.19234. lr 6.000000e-05:  72%|███████▏  | 3686/5098 [28:19<11:37,  2.03it/s][A
epoch 2 iter 3686: train loss 0.19234. lr 6.000000e-05:  72%|███████▏  | 3687/5098 [28:19<11:22,  2.07it/s][A
epoch 2 iter 3687: train loss 0.19354. lr 6.000000e-05:  72%|███████▏  | 3687/5098 [28:20<11:22,  2.07it/s][A
epoch 2 iter 3687: train loss 0.19354. lr 6.000000e-05:  72%|███████▏  | 3688/5098 [28:20<11:11,  2.10it/s][A
epoch 2 iter 3688: train loss 0.18991. lr 6.000000e-05:  72%|███████▏  | 3688/5098 [28:20<11:11,  2.10it/s][A
epoch 2 iter 3688: train loss 0.18991. lr 6.000000e-05:  72%|███████▏  | 3689/5098 [28:20<11:03,  2.12it/s][A
epoch 2 iter 3689: train loss 0.18910. lr 6.000000e-05:  72%|███████▏  | 3689/5098 [28:20<11:03,  2.12it/s][A
epoch 2 iter 3689: train loss 0.18910. lr 6.000000e-05:  72%|███████▏  | 3690/5098 [28:20<10:52,  2.16it/s][A
e

epoch 2 iter 3722: train loss 0.19581. lr 6.000000e-05:  73%|███████▎  | 3722/5098 [28:35<08:57,  2.56it/s][A
epoch 2 iter 3722: train loss 0.19581. lr 6.000000e-05:  73%|███████▎  | 3723/5098 [28:35<09:17,  2.47it/s][A
epoch 2 iter 3723: train loss 0.19650. lr 6.000000e-05:  73%|███████▎  | 3723/5098 [28:35<09:17,  2.47it/s][A
epoch 2 iter 3723: train loss 0.19650. lr 6.000000e-05:  73%|███████▎  | 3724/5098 [28:35<09:45,  2.34it/s][A
epoch 2 iter 3724: train loss 0.19517. lr 6.000000e-05:  73%|███████▎  | 3724/5098 [28:36<09:45,  2.34it/s][A
epoch 2 iter 3724: train loss 0.19517. lr 6.000000e-05:  73%|███████▎  | 3725/5098 [28:36<10:28,  2.18it/s][A
epoch 2 iter 3725: train loss 0.19147. lr 6.000000e-05:  73%|███████▎  | 3725/5098 [28:36<10:28,  2.18it/s][A
epoch 2 iter 3725: train loss 0.19147. lr 6.000000e-05:  73%|███████▎  | 3726/5098 [28:36<10:35,  2.16it/s][A
epoch 2 iter 3726: train loss 0.19408. lr 6.000000e-05:  73%|███████▎  | 3726/5098 [28:37<10:35,  2.16it/s][A
e

epoch 2 iter 3758: train loss 0.19334. lr 6.000000e-05:  74%|███████▎  | 3759/5098 [28:53<14:58,  1.49it/s][A
epoch 2 iter 3759: train loss 0.19239. lr 6.000000e-05:  74%|███████▎  | 3759/5098 [28:54<14:58,  1.49it/s][A
epoch 2 iter 3759: train loss 0.19239. lr 6.000000e-05:  74%|███████▍  | 3760/5098 [28:54<14:16,  1.56it/s][A
epoch 2 iter 3760: train loss 0.18942. lr 6.000000e-05:  74%|███████▍  | 3760/5098 [28:54<14:16,  1.56it/s][A
epoch 2 iter 3760: train loss 0.18942. lr 6.000000e-05:  74%|███████▍  | 3761/5098 [28:54<13:38,  1.63it/s][A
epoch 2 iter 3761: train loss 0.19324. lr 6.000000e-05:  74%|███████▍  | 3761/5098 [28:55<13:38,  1.63it/s][A
epoch 2 iter 3761: train loss 0.19324. lr 6.000000e-05:  74%|███████▍  | 3762/5098 [28:55<12:48,  1.74it/s][A
epoch 2 iter 3762: train loss 0.19415. lr 6.000000e-05:  74%|███████▍  | 3762/5098 [28:55<12:48,  1.74it/s][A
epoch 2 iter 3762: train loss 0.19415. lr 6.000000e-05:  74%|███████▍  | 3763/5098 [28:55<12:12,  1.82it/s][A
e

epoch 2 iter 3795: train loss 0.19457. lr 6.000000e-05:  74%|███████▍  | 3795/5098 [29:09<08:27,  2.57it/s][A
epoch 2 iter 3795: train loss 0.19457. lr 6.000000e-05:  74%|███████▍  | 3796/5098 [29:09<09:14,  2.35it/s][A
epoch 2 iter 3796: train loss 0.18830. lr 6.000000e-05:  74%|███████▍  | 3796/5098 [29:09<09:14,  2.35it/s][A
epoch 2 iter 3796: train loss 0.18830. lr 6.000000e-05:  74%|███████▍  | 3797/5098 [29:09<09:09,  2.37it/s][A
epoch 2 iter 3797: train loss 0.19294. lr 6.000000e-05:  74%|███████▍  | 3797/5098 [29:10<09:09,  2.37it/s][A
epoch 2 iter 3797: train loss 0.19294. lr 6.000000e-05:  74%|███████▍  | 3798/5098 [29:10<08:52,  2.44it/s][A
epoch 2 iter 3798: train loss 0.19311. lr 6.000000e-05:  74%|███████▍  | 3798/5098 [29:10<08:52,  2.44it/s][A
epoch 2 iter 3798: train loss 0.19311. lr 6.000000e-05:  75%|███████▍  | 3799/5098 [29:10<09:08,  2.37it/s][A
epoch 2 iter 3799: train loss 0.19130. lr 6.000000e-05:  75%|███████▍  | 3799/5098 [29:11<09:08,  2.37it/s][A
e

epoch 2 iter 3831: train loss 0.19164. lr 6.000000e-05:  75%|███████▌  | 3832/5098 [29:26<08:42,  2.42it/s][A
epoch 2 iter 3832: train loss 0.19247. lr 6.000000e-05:  75%|███████▌  | 3832/5098 [29:26<08:42,  2.42it/s][A
epoch 2 iter 3832: train loss 0.19247. lr 6.000000e-05:  75%|███████▌  | 3833/5098 [29:26<08:49,  2.39it/s][A
epoch 2 iter 3833: train loss 0.19611. lr 6.000000e-05:  75%|███████▌  | 3833/5098 [29:27<08:49,  2.39it/s][A
epoch 2 iter 3833: train loss 0.19611. lr 6.000000e-05:  75%|███████▌  | 3834/5098 [29:27<08:58,  2.35it/s][A
epoch 2 iter 3834: train loss 0.19283. lr 6.000000e-05:  75%|███████▌  | 3834/5098 [29:27<08:58,  2.35it/s][A
epoch 2 iter 3834: train loss 0.19283. lr 6.000000e-05:  75%|███████▌  | 3835/5098 [29:27<08:42,  2.42it/s][A
epoch 2 iter 3835: train loss 0.19228. lr 6.000000e-05:  75%|███████▌  | 3835/5098 [29:28<08:42,  2.42it/s][A
epoch 2 iter 3835: train loss 0.19228. lr 6.000000e-05:  75%|███████▌  | 3836/5098 [29:28<08:32,  2.46it/s][A
e

epoch 2 iter 3868: train loss 0.18943. lr 6.000000e-05:  76%|███████▌  | 3868/5098 [29:42<08:41,  2.36it/s][A
epoch 2 iter 3868: train loss 0.18943. lr 6.000000e-05:  76%|███████▌  | 3869/5098 [29:42<08:34,  2.39it/s][A
epoch 2 iter 3869: train loss 0.19100. lr 6.000000e-05:  76%|███████▌  | 3869/5098 [29:42<08:34,  2.39it/s][A
epoch 2 iter 3869: train loss 0.19100. lr 6.000000e-05:  76%|███████▌  | 3870/5098 [29:42<08:21,  2.45it/s][A
epoch 2 iter 3870: train loss 0.19201. lr 6.000000e-05:  76%|███████▌  | 3870/5098 [29:43<08:21,  2.45it/s][A
epoch 2 iter 3870: train loss 0.19201. lr 6.000000e-05:  76%|███████▌  | 3871/5098 [29:43<08:05,  2.53it/s][A
epoch 2 iter 3871: train loss 0.19366. lr 6.000000e-05:  76%|███████▌  | 3871/5098 [29:43<08:05,  2.53it/s][A
epoch 2 iter 3871: train loss 0.19366. lr 6.000000e-05:  76%|███████▌  | 3872/5098 [29:43<08:07,  2.52it/s][A
epoch 2 iter 3872: train loss 0.19592. lr 6.000000e-05:  76%|███████▌  | 3872/5098 [29:44<08:07,  2.52it/s][A
e

epoch 2 iter 3904: train loss 0.19216. lr 6.000000e-05:  77%|███████▋  | 3905/5098 [29:59<08:16,  2.40it/s][A
epoch 2 iter 3905: train loss 0.19182. lr 6.000000e-05:  77%|███████▋  | 3905/5098 [30:00<08:16,  2.40it/s][A
epoch 2 iter 3905: train loss 0.19182. lr 6.000000e-05:  77%|███████▋  | 3906/5098 [30:00<08:35,  2.31it/s][A
epoch 2 iter 3906: train loss 0.19176. lr 6.000000e-05:  77%|███████▋  | 3906/5098 [30:00<08:35,  2.31it/s][A
epoch 2 iter 3906: train loss 0.19176. lr 6.000000e-05:  77%|███████▋  | 3907/5098 [30:00<08:16,  2.40it/s][A
epoch 2 iter 3907: train loss 0.19142. lr 6.000000e-05:  77%|███████▋  | 3907/5098 [30:00<08:16,  2.40it/s][A
epoch 2 iter 3907: train loss 0.19142. lr 6.000000e-05:  77%|███████▋  | 3908/5098 [30:00<08:03,  2.46it/s][A
epoch 2 iter 3908: train loss 0.19207. lr 6.000000e-05:  77%|███████▋  | 3908/5098 [30:01<08:03,  2.46it/s][A
epoch 2 iter 3908: train loss 0.19207. lr 6.000000e-05:  77%|███████▋  | 3909/5098 [30:01<07:47,  2.54it/s][A
e

epoch 2 iter 3941: train loss 0.19380. lr 6.000000e-05:  77%|███████▋  | 3941/5098 [30:17<12:55,  1.49it/s][A
epoch 2 iter 3941: train loss 0.19380. lr 6.000000e-05:  77%|███████▋  | 3942/5098 [30:17<12:52,  1.50it/s][A
epoch 2 iter 3942: train loss 0.19157. lr 6.000000e-05:  77%|███████▋  | 3942/5098 [30:18<12:52,  1.50it/s][A
epoch 2 iter 3942: train loss 0.19157. lr 6.000000e-05:  77%|███████▋  | 3943/5098 [30:18<12:26,  1.55it/s][A
epoch 2 iter 3943: train loss 0.19240. lr 6.000000e-05:  77%|███████▋  | 3943/5098 [30:18<12:26,  1.55it/s][A
epoch 2 iter 3943: train loss 0.19240. lr 6.000000e-05:  77%|███████▋  | 3944/5098 [30:18<11:51,  1.62it/s][A
epoch 2 iter 3944: train loss 0.19246. lr 6.000000e-05:  77%|███████▋  | 3944/5098 [30:19<11:51,  1.62it/s][A
epoch 2 iter 3944: train loss 0.19246. lr 6.000000e-05:  77%|███████▋  | 3945/5098 [30:19<11:25,  1.68it/s][A
epoch 2 iter 3945: train loss 0.19250. lr 6.000000e-05:  77%|███████▋  | 3945/5098 [30:19<11:25,  1.68it/s][A
e

epoch 2 iter 3977: train loss 0.19240. lr 6.000000e-05:  78%|███████▊  | 3978/5098 [30:34<09:33,  1.95it/s][A
epoch 2 iter 3978: train loss 0.19246. lr 6.000000e-05:  78%|███████▊  | 3978/5098 [30:35<09:33,  1.95it/s][A
epoch 2 iter 3978: train loss 0.19246. lr 6.000000e-05:  78%|███████▊  | 3979/5098 [30:35<09:20,  2.00it/s][A
epoch 2 iter 3979: train loss 0.18705. lr 6.000000e-05:  78%|███████▊  | 3979/5098 [30:35<09:20,  2.00it/s][A
epoch 2 iter 3979: train loss 0.18705. lr 6.000000e-05:  78%|███████▊  | 3980/5098 [30:35<09:06,  2.05it/s][A
epoch 2 iter 3980: train loss 0.19181. lr 6.000000e-05:  78%|███████▊  | 3980/5098 [30:36<09:06,  2.05it/s][A
epoch 2 iter 3980: train loss 0.19181. lr 6.000000e-05:  78%|███████▊  | 3981/5098 [30:36<08:55,  2.09it/s][A
epoch 2 iter 3981: train loss 0.19374. lr 6.000000e-05:  78%|███████▊  | 3981/5098 [30:36<08:55,  2.09it/s][A
epoch 2 iter 3981: train loss 0.19374. lr 6.000000e-05:  78%|███████▊  | 3982/5098 [30:36<08:47,  2.11it/s][A
e

epoch 2 iter 4014: train loss 0.18901. lr 6.000000e-05:  79%|███████▊  | 4014/5098 [30:51<08:36,  2.10it/s][A
epoch 2 iter 4014: train loss 0.18901. lr 6.000000e-05:  79%|███████▉  | 4015/5098 [30:51<08:45,  2.06it/s][A
epoch 2 iter 4015: train loss 0.19486. lr 6.000000e-05:  79%|███████▉  | 4015/5098 [30:51<08:45,  2.06it/s][A
epoch 2 iter 4015: train loss 0.19486. lr 6.000000e-05:  79%|███████▉  | 4016/5098 [30:51<08:42,  2.07it/s][A
epoch 2 iter 4016: train loss 0.19297. lr 6.000000e-05:  79%|███████▉  | 4016/5098 [30:52<08:42,  2.07it/s][A
epoch 2 iter 4016: train loss 0.19297. lr 6.000000e-05:  79%|███████▉  | 4017/5098 [30:52<08:39,  2.08it/s][A
epoch 2 iter 4017: train loss 0.19304. lr 6.000000e-05:  79%|███████▉  | 4017/5098 [30:52<08:39,  2.08it/s][A
epoch 2 iter 4017: train loss 0.19304. lr 6.000000e-05:  79%|███████▉  | 4018/5098 [30:52<08:34,  2.10it/s][A
epoch 2 iter 4018: train loss 0.19346. lr 6.000000e-05:  79%|███████▉  | 4018/5098 [30:53<08:34,  2.10it/s][A
e

epoch 2 iter 4050: train loss 0.19297. lr 6.000000e-05:  79%|███████▉  | 4051/5098 [31:07<08:13,  2.12it/s][A
epoch 2 iter 4051: train loss 0.19009. lr 6.000000e-05:  79%|███████▉  | 4051/5098 [31:07<08:13,  2.12it/s][A
epoch 2 iter 4051: train loss 0.19009. lr 6.000000e-05:  79%|███████▉  | 4052/5098 [31:07<08:04,  2.16it/s][A
epoch 2 iter 4052: train loss 0.19181. lr 6.000000e-05:  79%|███████▉  | 4052/5098 [31:08<08:04,  2.16it/s][A
epoch 2 iter 4052: train loss 0.19181. lr 6.000000e-05:  80%|███████▉  | 4053/5098 [31:08<07:54,  2.20it/s][A
epoch 2 iter 4053: train loss 0.19135. lr 6.000000e-05:  80%|███████▉  | 4053/5098 [31:08<07:54,  2.20it/s][A
epoch 2 iter 4053: train loss 0.19135. lr 6.000000e-05:  80%|███████▉  | 4054/5098 [31:08<07:34,  2.30it/s][A
epoch 2 iter 4054: train loss 0.19118. lr 6.000000e-05:  80%|███████▉  | 4054/5098 [31:09<07:34,  2.30it/s][A
epoch 2 iter 4054: train loss 0.19118. lr 6.000000e-05:  80%|███████▉  | 4055/5098 [31:09<08:00,  2.17it/s][A
e

epoch 2 iter 4087: train loss 0.19265. lr 6.000000e-05:  80%|████████  | 4087/5098 [31:24<07:50,  2.15it/s][A
epoch 2 iter 4087: train loss 0.19265. lr 6.000000e-05:  80%|████████  | 4088/5098 [31:24<07:42,  2.18it/s][A
epoch 2 iter 4088: train loss 0.19233. lr 6.000000e-05:  80%|████████  | 4088/5098 [31:25<07:42,  2.18it/s][A
epoch 2 iter 4088: train loss 0.19233. lr 6.000000e-05:  80%|████████  | 4089/5098 [31:25<07:36,  2.21it/s][A
epoch 2 iter 4089: train loss 0.18966. lr 6.000000e-05:  80%|████████  | 4089/5098 [31:25<07:36,  2.21it/s][A
epoch 2 iter 4089: train loss 0.18966. lr 6.000000e-05:  80%|████████  | 4090/5098 [31:25<07:31,  2.23it/s][A
epoch 2 iter 4090: train loss 0.18935. lr 6.000000e-05:  80%|████████  | 4090/5098 [31:26<07:31,  2.23it/s][A
epoch 2 iter 4090: train loss 0.18935. lr 6.000000e-05:  80%|████████  | 4091/5098 [31:26<07:25,  2.26it/s][A
epoch 2 iter 4091: train loss 0.18939. lr 6.000000e-05:  80%|████████  | 4091/5098 [31:26<07:25,  2.26it/s][A
e

epoch 2 iter 4123: train loss 0.19195. lr 6.000000e-05:  81%|████████  | 4124/5098 [31:40<08:32,  1.90it/s][A
epoch 2 iter 4124: train loss 0.18922. lr 6.000000e-05:  81%|████████  | 4124/5098 [31:41<08:32,  1.90it/s][A
epoch 2 iter 4124: train loss 0.18922. lr 6.000000e-05:  81%|████████  | 4125/5098 [31:41<09:03,  1.79it/s][A
epoch 2 iter 4125: train loss 0.19295. lr 6.000000e-05:  81%|████████  | 4125/5098 [31:41<09:03,  1.79it/s][A
epoch 2 iter 4125: train loss 0.19295. lr 6.000000e-05:  81%|████████  | 4126/5098 [31:41<09:23,  1.72it/s][A
epoch 2 iter 4126: train loss 0.18879. lr 6.000000e-05:  81%|████████  | 4126/5098 [31:42<09:23,  1.72it/s][A
epoch 2 iter 4126: train loss 0.18879. lr 6.000000e-05:  81%|████████  | 4127/5098 [31:42<09:36,  1.68it/s][A
epoch 2 iter 4127: train loss 0.18875. lr 6.000000e-05:  81%|████████  | 4127/5098 [31:42<09:36,  1.68it/s][A
epoch 2 iter 4127: train loss 0.18875. lr 6.000000e-05:  81%|████████  | 4128/5098 [31:42<09:30,  1.70it/s][A
e

epoch 2 iter 4160: train loss 0.18982. lr 6.000000e-05:  82%|████████▏ | 4160/5098 [31:58<07:46,  2.01it/s][A
epoch 2 iter 4160: train loss 0.18982. lr 6.000000e-05:  82%|████████▏ | 4161/5098 [31:58<07:44,  2.02it/s][A
epoch 2 iter 4161: train loss 0.19187. lr 6.000000e-05:  82%|████████▏ | 4161/5098 [31:58<07:44,  2.02it/s][A
epoch 2 iter 4161: train loss 0.19187. lr 6.000000e-05:  82%|████████▏ | 4162/5098 [31:58<07:35,  2.06it/s][A
epoch 2 iter 4162: train loss 0.18916. lr 6.000000e-05:  82%|████████▏ | 4162/5098 [31:59<07:35,  2.06it/s][A
epoch 2 iter 4162: train loss 0.18916. lr 6.000000e-05:  82%|████████▏ | 4163/5098 [31:59<07:26,  2.10it/s][A
epoch 2 iter 4163: train loss 0.18784. lr 6.000000e-05:  82%|████████▏ | 4163/5098 [31:59<07:26,  2.10it/s][A
epoch 2 iter 4163: train loss 0.18784. lr 6.000000e-05:  82%|████████▏ | 4164/5098 [31:59<07:18,  2.13it/s][A
epoch 2 iter 4164: train loss 0.18928. lr 6.000000e-05:  82%|████████▏ | 4164/5098 [32:00<07:18,  2.13it/s][A
e

epoch 2 iter 4196: train loss 0.19200. lr 6.000000e-05:  82%|████████▏ | 4197/5098 [32:14<08:14,  1.82it/s][A
epoch 2 iter 4197: train loss 0.19131. lr 6.000000e-05:  82%|████████▏ | 4197/5098 [32:14<08:14,  1.82it/s][A
epoch 2 iter 4197: train loss 0.19131. lr 6.000000e-05:  82%|████████▏ | 4198/5098 [32:14<07:51,  1.91it/s][A
epoch 2 iter 4198: train loss 0.18561. lr 6.000000e-05:  82%|████████▏ | 4198/5098 [32:15<07:51,  1.91it/s][A
epoch 2 iter 4198: train loss 0.18561. lr 6.000000e-05:  82%|████████▏ | 4199/5098 [32:15<07:34,  1.98it/s][A
epoch 2 iter 4199: train loss 0.19177. lr 6.000000e-05:  82%|████████▏ | 4199/5098 [32:15<07:34,  1.98it/s][A
epoch 2 iter 4199: train loss 0.19177. lr 6.000000e-05:  82%|████████▏ | 4200/5098 [32:15<07:22,  2.03it/s][A
epoch 2 iter 4200: train loss 0.19003. lr 6.000000e-05:  82%|████████▏ | 4200/5098 [32:16<07:22,  2.03it/s][A
epoch 2 iter 4200: train loss 0.19003. lr 6.000000e-05:  82%|████████▏ | 4201/5098 [32:16<07:11,  2.08it/s][A
e

epoch 2 iter 4233: train loss 0.18897. lr 6.000000e-05:  83%|████████▎ | 4233/5098 [32:31<06:34,  2.19it/s][A
epoch 2 iter 4233: train loss 0.18897. lr 6.000000e-05:  83%|████████▎ | 4234/5098 [32:31<06:27,  2.23it/s][A
epoch 2 iter 4234: train loss 0.18858. lr 6.000000e-05:  83%|████████▎ | 4234/5098 [32:31<06:27,  2.23it/s][A
epoch 2 iter 4234: train loss 0.18858. lr 6.000000e-05:  83%|████████▎ | 4235/5098 [32:31<06:20,  2.27it/s][A
epoch 2 iter 4235: train loss 0.19075. lr 6.000000e-05:  83%|████████▎ | 4235/5098 [32:32<06:20,  2.27it/s][A
epoch 2 iter 4235: train loss 0.19075. lr 6.000000e-05:  83%|████████▎ | 4236/5098 [32:32<06:15,  2.29it/s][A
epoch 2 iter 4236: train loss 0.18783. lr 6.000000e-05:  83%|████████▎ | 4236/5098 [32:32<06:15,  2.29it/s][A
epoch 2 iter 4236: train loss 0.18783. lr 6.000000e-05:  83%|████████▎ | 4237/5098 [32:32<06:11,  2.32it/s][A
epoch 2 iter 4237: train loss 0.19117. lr 6.000000e-05:  83%|████████▎ | 4237/5098 [32:33<06:11,  2.32it/s][A
e

epoch 2 iter 4269: train loss 0.19162. lr 6.000000e-05:  84%|████████▍ | 4270/5098 [32:46<05:34,  2.47it/s][A
epoch 2 iter 4270: train loss 0.19030. lr 6.000000e-05:  84%|████████▍ | 4270/5098 [32:46<05:34,  2.47it/s][A
epoch 2 iter 4270: train loss 0.19030. lr 6.000000e-05:  84%|████████▍ | 4271/5098 [32:46<05:46,  2.39it/s][A
epoch 2 iter 4271: train loss 0.19326. lr 6.000000e-05:  84%|████████▍ | 4271/5098 [32:47<05:46,  2.39it/s][A
epoch 2 iter 4271: train loss 0.19326. lr 6.000000e-05:  84%|████████▍ | 4272/5098 [32:47<05:43,  2.41it/s][A
epoch 2 iter 4272: train loss 0.18786. lr 6.000000e-05:  84%|████████▍ | 4272/5098 [32:47<05:43,  2.41it/s][A
epoch 2 iter 4272: train loss 0.18786. lr 6.000000e-05:  84%|████████▍ | 4273/5098 [32:47<05:44,  2.39it/s][A
epoch 2 iter 4273: train loss 0.18998. lr 6.000000e-05:  84%|████████▍ | 4273/5098 [32:48<05:44,  2.39it/s][A
epoch 2 iter 4273: train loss 0.18998. lr 6.000000e-05:  84%|████████▍ | 4274/5098 [32:48<05:45,  2.38it/s][A
e

epoch 2 iter 4306: train loss 0.19137. lr 6.000000e-05:  84%|████████▍ | 4306/5098 [33:03<05:54,  2.23it/s][A
epoch 2 iter 4306: train loss 0.19137. lr 6.000000e-05:  84%|████████▍ | 4307/5098 [33:03<05:52,  2.25it/s][A
epoch 2 iter 4307: train loss 0.19166. lr 6.000000e-05:  84%|████████▍ | 4307/5098 [33:04<05:52,  2.25it/s][A
epoch 2 iter 4307: train loss 0.19166. lr 6.000000e-05:  85%|████████▍ | 4308/5098 [33:04<05:50,  2.25it/s][A
epoch 2 iter 4308: train loss 0.19006. lr 6.000000e-05:  85%|████████▍ | 4308/5098 [33:04<05:50,  2.25it/s][A
epoch 2 iter 4308: train loss 0.19006. lr 6.000000e-05:  85%|████████▍ | 4309/5098 [33:04<05:48,  2.26it/s][A
epoch 2 iter 4309: train loss 0.19242. lr 6.000000e-05:  85%|████████▍ | 4309/5098 [33:04<05:48,  2.26it/s][A
epoch 2 iter 4309: train loss 0.19242. lr 6.000000e-05:  85%|████████▍ | 4310/5098 [33:04<05:45,  2.28it/s][A
epoch 2 iter 4310: train loss 0.18927. lr 6.000000e-05:  85%|████████▍ | 4310/5098 [33:05<05:45,  2.28it/s][A
e

epoch 2 iter 4342: train loss 0.18790. lr 6.000000e-05:  85%|████████▌ | 4343/5098 [33:20<06:16,  2.00it/s][A
epoch 2 iter 4343: train loss 0.19125. lr 6.000000e-05:  85%|████████▌ | 4343/5098 [33:20<06:16,  2.00it/s][A
epoch 2 iter 4343: train loss 0.19125. lr 6.000000e-05:  85%|████████▌ | 4344/5098 [33:20<06:01,  2.09it/s][A
epoch 2 iter 4344: train loss 0.19164. lr 6.000000e-05:  85%|████████▌ | 4344/5098 [33:21<06:01,  2.09it/s][A
epoch 2 iter 4344: train loss 0.19164. lr 6.000000e-05:  85%|████████▌ | 4345/5098 [33:21<06:02,  2.08it/s][A
epoch 2 iter 4345: train loss 0.18962. lr 6.000000e-05:  85%|████████▌ | 4345/5098 [33:21<06:02,  2.08it/s][A
epoch 2 iter 4345: train loss 0.18962. lr 6.000000e-05:  85%|████████▌ | 4346/5098 [33:21<05:48,  2.16it/s][A
epoch 2 iter 4346: train loss 0.18747. lr 6.000000e-05:  85%|████████▌ | 4346/5098 [33:21<05:48,  2.16it/s][A
epoch 2 iter 4346: train loss 0.18747. lr 6.000000e-05:  85%|████████▌ | 4347/5098 [33:21<05:35,  2.24it/s][A
e

epoch 2 iter 4379: train loss 0.19157. lr 6.000000e-05:  86%|████████▌ | 4379/5098 [33:36<06:33,  1.83it/s][A
epoch 2 iter 4379: train loss 0.19157. lr 6.000000e-05:  86%|████████▌ | 4380/5098 [33:36<06:31,  1.84it/s][A
epoch 2 iter 4380: train loss 0.18618. lr 6.000000e-05:  86%|████████▌ | 4380/5098 [33:37<06:31,  1.84it/s][A
epoch 2 iter 4380: train loss 0.18618. lr 6.000000e-05:  86%|████████▌ | 4381/5098 [33:37<06:27,  1.85it/s][A
epoch 2 iter 4381: train loss 0.19126. lr 6.000000e-05:  86%|████████▌ | 4381/5098 [33:37<06:27,  1.85it/s][A
epoch 2 iter 4381: train loss 0.19126. lr 6.000000e-05:  86%|████████▌ | 4382/5098 [33:37<06:16,  1.90it/s][A
epoch 2 iter 4382: train loss 0.18792. lr 6.000000e-05:  86%|████████▌ | 4382/5098 [33:38<06:16,  1.90it/s][A
epoch 2 iter 4382: train loss 0.18792. lr 6.000000e-05:  86%|████████▌ | 4383/5098 [33:38<06:03,  1.96it/s][A
epoch 2 iter 4383: train loss 0.18850. lr 6.000000e-05:  86%|████████▌ | 4383/5098 [33:38<06:03,  1.96it/s][A
e

epoch 2 iter 4415: train loss 0.18845. lr 6.000000e-05:  87%|████████▋ | 4416/5098 [33:52<04:51,  2.34it/s][A
epoch 2 iter 4416: train loss 0.18992. lr 6.000000e-05:  87%|████████▋ | 4416/5098 [33:53<04:51,  2.34it/s][A
epoch 2 iter 4416: train loss 0.18992. lr 6.000000e-05:  87%|████████▋ | 4417/5098 [33:53<04:46,  2.37it/s][A
epoch 2 iter 4417: train loss 0.18980. lr 6.000000e-05:  87%|████████▋ | 4417/5098 [33:53<04:46,  2.37it/s][A
epoch 2 iter 4417: train loss 0.18980. lr 6.000000e-05:  87%|████████▋ | 4418/5098 [33:53<05:22,  2.11it/s][A
epoch 2 iter 4418: train loss 0.18680. lr 6.000000e-05:  87%|████████▋ | 4418/5098 [33:54<05:22,  2.11it/s][A
epoch 2 iter 4418: train loss 0.18680. lr 6.000000e-05:  87%|████████▋ | 4419/5098 [33:54<05:07,  2.21it/s][A
epoch 2 iter 4419: train loss 0.18679. lr 6.000000e-05:  87%|████████▋ | 4419/5098 [33:54<05:07,  2.21it/s][A
epoch 2 iter 4419: train loss 0.18679. lr 6.000000e-05:  87%|████████▋ | 4420/5098 [33:54<04:53,  2.31it/s][A
e

epoch 2 iter 4452: train loss 0.19162. lr 6.000000e-05:  87%|████████▋ | 4452/5098 [34:11<05:07,  2.10it/s][A
epoch 2 iter 4452: train loss 0.19162. lr 6.000000e-05:  87%|████████▋ | 4453/5098 [34:11<05:01,  2.14it/s][A
epoch 2 iter 4453: train loss 0.18978. lr 6.000000e-05:  87%|████████▋ | 4453/5098 [34:11<05:01,  2.14it/s][A
epoch 2 iter 4453: train loss 0.18978. lr 6.000000e-05:  87%|████████▋ | 4454/5098 [34:11<04:53,  2.19it/s][A
epoch 2 iter 4454: train loss 0.18795. lr 6.000000e-05:  87%|████████▋ | 4454/5098 [34:12<04:53,  2.19it/s][A
epoch 2 iter 4454: train loss 0.18795. lr 6.000000e-05:  87%|████████▋ | 4455/5098 [34:12<04:48,  2.23it/s][A
epoch 2 iter 4455: train loss 0.18767. lr 6.000000e-05:  87%|████████▋ | 4455/5098 [34:12<04:48,  2.23it/s][A
epoch 2 iter 4455: train loss 0.18767. lr 6.000000e-05:  87%|████████▋ | 4456/5098 [34:12<04:45,  2.25it/s][A
epoch 2 iter 4456: train loss 0.18701. lr 6.000000e-05:  87%|████████▋ | 4456/5098 [34:12<04:45,  2.25it/s][A
e

epoch 2 iter 4488: train loss 0.18889. lr 6.000000e-05:  88%|████████▊ | 4489/5098 [34:26<04:17,  2.37it/s][A
epoch 2 iter 4489: train loss 0.19083. lr 6.000000e-05:  88%|████████▊ | 4489/5098 [34:26<04:17,  2.37it/s][A
epoch 2 iter 4489: train loss 0.19083. lr 6.000000e-05:  88%|████████▊ | 4490/5098 [34:26<04:12,  2.41it/s][A
epoch 2 iter 4490: train loss 0.19312. lr 6.000000e-05:  88%|████████▊ | 4490/5098 [34:27<04:12,  2.41it/s][A
epoch 2 iter 4490: train loss 0.19312. lr 6.000000e-05:  88%|████████▊ | 4491/5098 [34:27<04:07,  2.45it/s][A
epoch 2 iter 4491: train loss 0.18691. lr 6.000000e-05:  88%|████████▊ | 4491/5098 [34:27<04:07,  2.45it/s][A
epoch 2 iter 4491: train loss 0.18691. lr 6.000000e-05:  88%|████████▊ | 4492/5098 [34:27<04:58,  2.03it/s][A
epoch 2 iter 4492: train loss 0.19082. lr 6.000000e-05:  88%|████████▊ | 4492/5098 [34:28<04:58,  2.03it/s][A
epoch 2 iter 4492: train loss 0.19082. lr 6.000000e-05:  88%|████████▊ | 4493/5098 [34:28<05:14,  1.92it/s][A
e

epoch 2 iter 4525: train loss 0.18937. lr 6.000000e-05:  89%|████████▉ | 4525/5098 [34:43<03:42,  2.57it/s][A
epoch 2 iter 4525: train loss 0.18937. lr 6.000000e-05:  89%|████████▉ | 4526/5098 [34:43<03:45,  2.54it/s][A
epoch 2 iter 4526: train loss 0.19261. lr 6.000000e-05:  89%|████████▉ | 4526/5098 [34:44<03:45,  2.54it/s][A
epoch 2 iter 4526: train loss 0.19261. lr 6.000000e-05:  89%|████████▉ | 4527/5098 [34:44<03:50,  2.47it/s][A
epoch 2 iter 4527: train loss 0.19175. lr 6.000000e-05:  89%|████████▉ | 4527/5098 [34:44<03:50,  2.47it/s][A
epoch 2 iter 4527: train loss 0.19175. lr 6.000000e-05:  89%|████████▉ | 4528/5098 [34:44<03:54,  2.43it/s][A
epoch 2 iter 4528: train loss 0.19027. lr 6.000000e-05:  89%|████████▉ | 4528/5098 [34:45<03:54,  2.43it/s][A
epoch 2 iter 4528: train loss 0.19027. lr 6.000000e-05:  89%|████████▉ | 4529/5098 [34:45<03:58,  2.39it/s][A
epoch 2 iter 4529: train loss 0.18846. lr 6.000000e-05:  89%|████████▉ | 4529/5098 [34:45<03:58,  2.39it/s][A
e

epoch 2 iter 4561: train loss 0.18573. lr 6.000000e-05:  89%|████████▉ | 4562/5098 [34:59<04:03,  2.20it/s][A
epoch 2 iter 4562: train loss 0.18830. lr 6.000000e-05:  89%|████████▉ | 4562/5098 [35:00<04:03,  2.20it/s][A
epoch 2 iter 4562: train loss 0.18830. lr 6.000000e-05:  90%|████████▉ | 4563/5098 [35:00<04:43,  1.88it/s][A
epoch 2 iter 4563: train loss 0.18771. lr 6.000000e-05:  90%|████████▉ | 4563/5098 [35:00<04:43,  1.88it/s][A
epoch 2 iter 4563: train loss 0.18771. lr 6.000000e-05:  90%|████████▉ | 4564/5098 [35:00<05:03,  1.76it/s][A
epoch 2 iter 4564: train loss 0.18927. lr 6.000000e-05:  90%|████████▉ | 4564/5098 [35:01<05:03,  1.76it/s][A
epoch 2 iter 4564: train loss 0.18927. lr 6.000000e-05:  90%|████████▉ | 4565/5098 [35:01<05:17,  1.68it/s][A
epoch 2 iter 4565: train loss 0.19045. lr 6.000000e-05:  90%|████████▉ | 4565/5098 [35:02<05:17,  1.68it/s][A
epoch 2 iter 4565: train loss 0.19045. lr 6.000000e-05:  90%|████████▉ | 4566/5098 [35:02<05:35,  1.58it/s][A
e

epoch 2 iter 4598: train loss 0.19377. lr 6.000000e-05:  90%|█████████ | 4598/5098 [35:17<03:35,  2.32it/s][A
epoch 2 iter 4598: train loss 0.19377. lr 6.000000e-05:  90%|█████████ | 4599/5098 [35:17<03:33,  2.34it/s][A
epoch 2 iter 4599: train loss 0.19017. lr 6.000000e-05:  90%|█████████ | 4599/5098 [35:18<03:33,  2.34it/s][A
epoch 2 iter 4599: train loss 0.19017. lr 6.000000e-05:  90%|█████████ | 4600/5098 [35:18<03:32,  2.35it/s][A
epoch 2 iter 4600: train loss 0.18772. lr 6.000000e-05:  90%|█████████ | 4600/5098 [35:18<03:32,  2.35it/s][A
epoch 2 iter 4600: train loss 0.18772. lr 6.000000e-05:  90%|█████████ | 4601/5098 [35:18<03:31,  2.35it/s][A
epoch 2 iter 4601: train loss 0.19006. lr 6.000000e-05:  90%|█████████ | 4601/5098 [35:19<03:31,  2.35it/s][A
epoch 2 iter 4601: train loss 0.19006. lr 6.000000e-05:  90%|█████████ | 4602/5098 [35:19<03:30,  2.35it/s][A
epoch 2 iter 4602: train loss 0.19092. lr 6.000000e-05:  90%|█████████ | 4602/5098 [35:19<03:30,  2.35it/s][A
e

epoch 2 iter 4634: train loss 0.18766. lr 6.000000e-05:  91%|█████████ | 4635/5098 [35:33<03:27,  2.23it/s][A
epoch 2 iter 4635: train loss 0.18928. lr 6.000000e-05:  91%|█████████ | 4635/5098 [35:33<03:27,  2.23it/s][A
epoch 2 iter 4635: train loss 0.18928. lr 6.000000e-05:  91%|█████████ | 4636/5098 [35:33<03:27,  2.23it/s][A
epoch 2 iter 4636: train loss 0.18861. lr 6.000000e-05:  91%|█████████ | 4636/5098 [35:34<03:27,  2.23it/s][A
epoch 2 iter 4636: train loss 0.18861. lr 6.000000e-05:  91%|█████████ | 4637/5098 [35:34<03:25,  2.24it/s][A
epoch 2 iter 4637: train loss 0.18935. lr 6.000000e-05:  91%|█████████ | 4637/5098 [35:34<03:25,  2.24it/s][A
epoch 2 iter 4637: train loss 0.18935. lr 6.000000e-05:  91%|█████████ | 4638/5098 [35:34<03:45,  2.04it/s][A
epoch 2 iter 4638: train loss 0.18988. lr 6.000000e-05:  91%|█████████ | 4638/5098 [35:35<03:45,  2.04it/s][A
epoch 2 iter 4638: train loss 0.18988. lr 6.000000e-05:  91%|█████████ | 4639/5098 [35:35<03:58,  1.92it/s][A
e

epoch 2 iter 4671: train loss 0.18891. lr 6.000000e-05:  92%|█████████▏| 4671/5098 [35:50<03:02,  2.34it/s][A
epoch 2 iter 4671: train loss 0.18891. lr 6.000000e-05:  92%|█████████▏| 4672/5098 [35:50<03:00,  2.36it/s][A
epoch 2 iter 4672: train loss 0.18957. lr 6.000000e-05:  92%|█████████▏| 4672/5098 [35:50<03:00,  2.36it/s][A
epoch 2 iter 4672: train loss 0.18957. lr 6.000000e-05:  92%|█████████▏| 4673/5098 [35:50<02:58,  2.38it/s][A
epoch 2 iter 4673: train loss 0.19190. lr 6.000000e-05:  92%|█████████▏| 4673/5098 [35:51<02:58,  2.38it/s][A
epoch 2 iter 4673: train loss 0.19190. lr 6.000000e-05:  92%|█████████▏| 4674/5098 [35:51<02:57,  2.39it/s][A
epoch 2 iter 4674: train loss 0.19142. lr 6.000000e-05:  92%|█████████▏| 4674/5098 [35:51<02:57,  2.39it/s][A
epoch 2 iter 4674: train loss 0.19142. lr 6.000000e-05:  92%|█████████▏| 4675/5098 [35:51<02:56,  2.40it/s][A
epoch 2 iter 4675: train loss 0.19067. lr 6.000000e-05:  92%|█████████▏| 4675/5098 [35:51<02:56,  2.40it/s][A
e

epoch 2 iter 4707: train loss 0.18782. lr 6.000000e-05:  92%|█████████▏| 4708/5098 [36:05<02:40,  2.43it/s][A
epoch 2 iter 4708: train loss 0.18943. lr 6.000000e-05:  92%|█████████▏| 4708/5098 [36:05<02:40,  2.43it/s][A
epoch 2 iter 4708: train loss 0.18943. lr 6.000000e-05:  92%|█████████▏| 4709/5098 [36:05<02:38,  2.45it/s][A
epoch 2 iter 4709: train loss 0.19059. lr 6.000000e-05:  92%|█████████▏| 4709/5098 [36:05<02:38,  2.45it/s][A
epoch 2 iter 4709: train loss 0.19059. lr 6.000000e-05:  92%|█████████▏| 4710/5098 [36:05<02:37,  2.46it/s][A
epoch 2 iter 4710: train loss 0.19040. lr 6.000000e-05:  92%|█████████▏| 4710/5098 [36:06<02:37,  2.46it/s][A
epoch 2 iter 4710: train loss 0.19040. lr 6.000000e-05:  92%|█████████▏| 4711/5098 [36:06<02:38,  2.45it/s][A
epoch 2 iter 4711: train loss 0.18691. lr 6.000000e-05:  92%|█████████▏| 4711/5098 [36:06<02:38,  2.45it/s][A
epoch 2 iter 4711: train loss 0.18691. lr 6.000000e-05:  92%|█████████▏| 4712/5098 [36:06<02:39,  2.42it/s][A
e

epoch 2 iter 4744: train loss 0.18768. lr 6.000000e-05:  93%|█████████▎| 4744/5098 [36:21<02:33,  2.31it/s][A
epoch 2 iter 4744: train loss 0.18768. lr 6.000000e-05:  93%|█████████▎| 4745/5098 [36:21<02:31,  2.33it/s][A
epoch 2 iter 4745: train loss 0.18956. lr 6.000000e-05:  93%|█████████▎| 4745/5098 [36:22<02:31,  2.33it/s][A
epoch 2 iter 4745: train loss 0.18956. lr 6.000000e-05:  93%|█████████▎| 4746/5098 [36:22<02:30,  2.34it/s][A
epoch 2 iter 4746: train loss 0.19088. lr 6.000000e-05:  93%|█████████▎| 4746/5098 [36:22<02:30,  2.34it/s][A
epoch 2 iter 4746: train loss 0.19088. lr 6.000000e-05:  93%|█████████▎| 4747/5098 [36:22<02:29,  2.35it/s][A
epoch 2 iter 4747: train loss 0.18852. lr 6.000000e-05:  93%|█████████▎| 4747/5098 [36:23<02:29,  2.35it/s][A
epoch 2 iter 4747: train loss 0.18852. lr 6.000000e-05:  93%|█████████▎| 4748/5098 [36:23<02:28,  2.36it/s][A
epoch 2 iter 4748: train loss 0.18725. lr 6.000000e-05:  93%|█████████▎| 4748/5098 [36:23<02:28,  2.36it/s][A
e

epoch 2 iter 4780: train loss 0.19177. lr 6.000000e-05:  94%|█████████▍| 4781/5098 [36:37<02:29,  2.12it/s][A
epoch 2 iter 4781: train loss 0.18534. lr 6.000000e-05:  94%|█████████▍| 4781/5098 [36:38<02:29,  2.12it/s][A
epoch 2 iter 4781: train loss 0.18534. lr 6.000000e-05:  94%|█████████▍| 4782/5098 [36:38<02:27,  2.14it/s][A
epoch 2 iter 4782: train loss 0.18836. lr 6.000000e-05:  94%|█████████▍| 4782/5098 [36:38<02:27,  2.14it/s][A
epoch 2 iter 4782: train loss 0.18836. lr 6.000000e-05:  94%|█████████▍| 4783/5098 [36:38<02:26,  2.15it/s][A
epoch 2 iter 4783: train loss 0.18859. lr 6.000000e-05:  94%|█████████▍| 4783/5098 [36:39<02:26,  2.15it/s][A
epoch 2 iter 4783: train loss 0.18859. lr 6.000000e-05:  94%|█████████▍| 4784/5098 [36:39<02:25,  2.16it/s][A
epoch 2 iter 4784: train loss 0.19093. lr 6.000000e-05:  94%|█████████▍| 4784/5098 [36:39<02:25,  2.16it/s][A
epoch 2 iter 4784: train loss 0.19093. lr 6.000000e-05:  94%|█████████▍| 4785/5098 [36:39<02:24,  2.17it/s][A
e

epoch 2 iter 4817: train loss 0.18852. lr 6.000000e-05:  94%|█████████▍| 4817/5098 [36:54<02:00,  2.34it/s][A
epoch 2 iter 4817: train loss 0.18852. lr 6.000000e-05:  95%|█████████▍| 4818/5098 [36:54<01:58,  2.36it/s][A
epoch 2 iter 4818: train loss 0.18624. lr 6.000000e-05:  95%|█████████▍| 4818/5098 [36:55<01:58,  2.36it/s][A
epoch 2 iter 4818: train loss 0.18624. lr 6.000000e-05:  95%|█████████▍| 4819/5098 [36:55<01:57,  2.38it/s][A
epoch 2 iter 4819: train loss 0.18855. lr 6.000000e-05:  95%|█████████▍| 4819/5098 [36:55<01:57,  2.38it/s][A
epoch 2 iter 4819: train loss 0.18855. lr 6.000000e-05:  95%|█████████▍| 4820/5098 [36:55<01:56,  2.39it/s][A
epoch 2 iter 4820: train loss 0.18689. lr 6.000000e-05:  95%|█████████▍| 4820/5098 [36:55<01:56,  2.39it/s][A
epoch 2 iter 4820: train loss 0.18689. lr 6.000000e-05:  95%|█████████▍| 4821/5098 [36:55<01:55,  2.40it/s][A
epoch 2 iter 4821: train loss 0.18822. lr 6.000000e-05:  95%|█████████▍| 4821/5098 [36:56<01:55,  2.40it/s][A
e

epoch 2 iter 4853: train loss 0.18649. lr 6.000000e-05:  95%|█████████▌| 4854/5098 [37:11<02:41,  1.51it/s][A
epoch 2 iter 4854: train loss 0.19178. lr 6.000000e-05:  95%|█████████▌| 4854/5098 [37:12<02:41,  1.51it/s][A
epoch 2 iter 4854: train loss 0.19178. lr 6.000000e-05:  95%|█████████▌| 4855/5098 [37:12<02:24,  1.68it/s][A
epoch 2 iter 4855: train loss 0.18930. lr 6.000000e-05:  95%|█████████▌| 4855/5098 [37:12<02:24,  1.68it/s][A
epoch 2 iter 4855: train loss 0.18930. lr 6.000000e-05:  95%|█████████▌| 4856/5098 [37:12<02:12,  1.82it/s][A
epoch 2 iter 4856: train loss 0.18525. lr 6.000000e-05:  95%|█████████▌| 4856/5098 [37:12<02:12,  1.82it/s][A
epoch 2 iter 4856: train loss 0.18525. lr 6.000000e-05:  95%|█████████▌| 4857/5098 [37:12<02:04,  1.94it/s][A
epoch 2 iter 4857: train loss 0.19104. lr 6.000000e-05:  95%|█████████▌| 4857/5098 [37:13<02:04,  1.94it/s][A
epoch 2 iter 4857: train loss 0.19104. lr 6.000000e-05:  95%|█████████▌| 4858/5098 [37:13<01:58,  2.02it/s][A
e

epoch 2 iter 4890: train loss 0.18741. lr 6.000000e-05:  96%|█████████▌| 4890/5098 [37:27<01:47,  1.93it/s][A
epoch 2 iter 4890: train loss 0.18741. lr 6.000000e-05:  96%|█████████▌| 4891/5098 [37:27<01:47,  1.92it/s][A
epoch 2 iter 4891: train loss 0.18642. lr 6.000000e-05:  96%|█████████▌| 4891/5098 [37:28<01:47,  1.92it/s][A
epoch 2 iter 4891: train loss 0.18642. lr 6.000000e-05:  96%|█████████▌| 4892/5098 [37:28<01:45,  1.95it/s][A
epoch 2 iter 4892: train loss 0.18810. lr 6.000000e-05:  96%|█████████▌| 4892/5098 [37:28<01:45,  1.95it/s][A
epoch 2 iter 4892: train loss 0.18810. lr 6.000000e-05:  96%|█████████▌| 4893/5098 [37:28<01:44,  1.96it/s][A
epoch 2 iter 4893: train loss 0.18880. lr 6.000000e-05:  96%|█████████▌| 4893/5098 [37:29<01:44,  1.96it/s][A
epoch 2 iter 4893: train loss 0.18880. lr 6.000000e-05:  96%|█████████▌| 4894/5098 [37:29<01:41,  2.00it/s][A
epoch 2 iter 4894: train loss 0.19018. lr 6.000000e-05:  96%|█████████▌| 4894/5098 [37:29<01:41,  2.00it/s][A
e

epoch 2 iter 4926: train loss 0.18669. lr 6.000000e-05:  97%|█████████▋| 4927/5098 [37:46<01:59,  1.43it/s][A
epoch 2 iter 4927: train loss 0.18366. lr 6.000000e-05:  97%|█████████▋| 4927/5098 [37:47<01:59,  1.43it/s][A
epoch 2 iter 4927: train loss 0.18366. lr 6.000000e-05:  97%|█████████▋| 4928/5098 [37:47<01:53,  1.50it/s][A
epoch 2 iter 4928: train loss 0.18908. lr 6.000000e-05:  97%|█████████▋| 4928/5098 [37:47<01:53,  1.50it/s][A
epoch 2 iter 4928: train loss 0.18908. lr 6.000000e-05:  97%|█████████▋| 4929/5098 [37:47<01:48,  1.56it/s][A
epoch 2 iter 4929: train loss 0.19063. lr 6.000000e-05:  97%|█████████▋| 4929/5098 [37:48<01:48,  1.56it/s][A
epoch 2 iter 4929: train loss 0.19063. lr 6.000000e-05:  97%|█████████▋| 4930/5098 [37:48<01:43,  1.63it/s][A
epoch 2 iter 4930: train loss 0.18677. lr 6.000000e-05:  97%|█████████▋| 4930/5098 [37:48<01:43,  1.63it/s][A
epoch 2 iter 4930: train loss 0.18677. lr 6.000000e-05:  97%|█████████▋| 4931/5098 [37:48<01:37,  1.72it/s][A
e

epoch 2 iter 4963: train loss 0.18707. lr 6.000000e-05:  97%|█████████▋| 4963/5098 [38:03<00:54,  2.50it/s][A
epoch 2 iter 4963: train loss 0.18707. lr 6.000000e-05:  97%|█████████▋| 4964/5098 [38:03<00:56,  2.39it/s][A
epoch 2 iter 4964: train loss 0.18861. lr 6.000000e-05:  97%|█████████▋| 4964/5098 [38:03<00:56,  2.39it/s][A
epoch 2 iter 4964: train loss 0.18861. lr 6.000000e-05:  97%|█████████▋| 4965/5098 [38:03<00:55,  2.38it/s][A
epoch 2 iter 4965: train loss 0.18680. lr 6.000000e-05:  97%|█████████▋| 4965/5098 [38:03<00:55,  2.38it/s][A
epoch 2 iter 4965: train loss 0.18680. lr 6.000000e-05:  97%|█████████▋| 4966/5098 [38:03<00:55,  2.38it/s][A
epoch 2 iter 4966: train loss 0.18571. lr 6.000000e-05:  97%|█████████▋| 4966/5098 [38:04<00:55,  2.38it/s][A
epoch 2 iter 4966: train loss 0.18571. lr 6.000000e-05:  97%|█████████▋| 4967/5098 [38:04<00:55,  2.37it/s][A
epoch 2 iter 4967: train loss 0.18793. lr 6.000000e-05:  97%|█████████▋| 4967/5098 [38:04<00:55,  2.37it/s][A
e

epoch 2 iter 4999: train loss 0.18983. lr 6.000000e-05:  98%|█████████▊| 5000/5098 [38:20<00:43,  2.28it/s][A
epoch 2 iter 5000: train loss 0.18826. lr 6.000000e-05:  98%|█████████▊| 5000/5098 [38:21<00:43,  2.28it/s][A
epoch 2 iter 5000: train loss 0.18826. lr 6.000000e-05:  98%|█████████▊| 5001/5098 [38:21<00:40,  2.41it/s][A
epoch 2 iter 5001: train loss 0.18854. lr 6.000000e-05:  98%|█████████▊| 5001/5098 [38:21<00:40,  2.41it/s][A
epoch 2 iter 5001: train loss 0.18854. lr 6.000000e-05:  98%|█████████▊| 5002/5098 [38:21<00:38,  2.46it/s][A
epoch 2 iter 5002: train loss 0.19065. lr 6.000000e-05:  98%|█████████▊| 5002/5098 [38:21<00:38,  2.46it/s][A
epoch 2 iter 5002: train loss 0.19065. lr 6.000000e-05:  98%|█████████▊| 5003/5098 [38:21<00:38,  2.46it/s][A
epoch 2 iter 5003: train loss 0.18674. lr 6.000000e-05:  98%|█████████▊| 5003/5098 [38:22<00:38,  2.46it/s][A
epoch 2 iter 5003: train loss 0.18674. lr 6.000000e-05:  98%|█████████▊| 5004/5098 [38:22<00:38,  2.44it/s][A
e

epoch 2 iter 5036: train loss 0.19068. lr 6.000000e-05:  99%|█████████▉| 5036/5098 [38:36<00:26,  2.34it/s][A
epoch 2 iter 5036: train loss 0.19068. lr 6.000000e-05:  99%|█████████▉| 5037/5098 [38:36<00:26,  2.32it/s][A
epoch 2 iter 5037: train loss 0.18867. lr 6.000000e-05:  99%|█████████▉| 5037/5098 [38:36<00:26,  2.32it/s][A
epoch 2 iter 5037: train loss 0.18867. lr 6.000000e-05:  99%|█████████▉| 5038/5098 [38:36<00:27,  2.22it/s][A
epoch 2 iter 5038: train loss 0.18694. lr 6.000000e-05:  99%|█████████▉| 5038/5098 [38:37<00:27,  2.22it/s][A
epoch 2 iter 5038: train loss 0.18694. lr 6.000000e-05:  99%|█████████▉| 5039/5098 [38:37<00:26,  2.24it/s][A
epoch 2 iter 5039: train loss 0.18584. lr 6.000000e-05:  99%|█████████▉| 5039/5098 [38:37<00:26,  2.24it/s][A
epoch 2 iter 5039: train loss 0.18584. lr 6.000000e-05:  99%|█████████▉| 5040/5098 [38:37<00:24,  2.35it/s][A
epoch 2 iter 5040: train loss 0.18773. lr 6.000000e-05:  99%|█████████▉| 5040/5098 [38:37<00:24,  2.35it/s][A
e

epoch 2 iter 5072: train loss 0.18803. lr 6.000000e-05: 100%|█████████▉| 5073/5098 [38:53<00:10,  2.42it/s][A
epoch 2 iter 5073: train loss 0.18901. lr 6.000000e-05: 100%|█████████▉| 5073/5098 [38:53<00:10,  2.42it/s][A
epoch 2 iter 5073: train loss 0.18901. lr 6.000000e-05: 100%|█████████▉| 5074/5098 [38:53<00:09,  2.42it/s][A
epoch 2 iter 5074: train loss 0.18652. lr 6.000000e-05: 100%|█████████▉| 5074/5098 [38:54<00:09,  2.42it/s][A
epoch 2 iter 5074: train loss 0.18652. lr 6.000000e-05: 100%|█████████▉| 5075/5098 [38:54<00:09,  2.42it/s][A
epoch 2 iter 5075: train loss 0.18905. lr 6.000000e-05: 100%|█████████▉| 5075/5098 [38:54<00:09,  2.42it/s][A
epoch 2 iter 5075: train loss 0.18905. lr 6.000000e-05: 100%|█████████▉| 5076/5098 [38:54<00:09,  2.41it/s][A
epoch 2 iter 5076: train loss 0.18735. lr 6.000000e-05: 100%|█████████▉| 5076/5098 [38:54<00:09,  2.41it/s][A
epoch 2 iter 5076: train loss 0.18735. lr 6.000000e-05: 100%|█████████▉| 5077/5098 [38:54<00:08,  2.40it/s][A
e

data has 389250 characters, 169 unique.



  0%|          | 0/761 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.24234. lr 5.999995e-04:   0%|          | 0/761 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.24234. lr 5.999995e-04:   0%|          | 1/761 [00:00<06:17,  2.01it/s][A
epoch 1 iter 1: train loss 4.08599. lr 5.999978e-04:   0%|          | 1/761 [00:00<06:17,  2.01it/s][A
epoch 1 iter 1: train loss 4.08599. lr 5.999978e-04:   0%|          | 2/761 [00:00<06:09,  2.05it/s][A
epoch 1 iter 2: train loss 4.88229. lr 5.999948e-04:   0%|          | 2/761 [00:01<06:09,  2.05it/s][A
epoch 1 iter 2: train loss 4.88229. lr 5.999948e-04:   0%|          | 3/761 [00:01<05:44,  2.20it/s][A
epoch 1 iter 3: train loss 3.95327. lr 5.999905e-04:   0%|          | 3/761 [00:01<05:44,  2.20it/s][A
epoch 1 iter 3: train loss 3.95327. lr 5.999905e-04:   1%|          | 4/761 [00:01<05:32,  2.28it/s][A
epoch 1 iter 4: train loss 3.58429. lr 5.999850e-04:   1%|          | 4/761 [00:02<05:32,  2.28it/s][A
epoch 1 iter 4: train loss 3

epoch 1 iter 38: train loss 2.82431. lr 5.990335e-04:   5%|▌         | 39/761 [00:17<05:44,  2.10it/s][A
epoch 1 iter 39: train loss 2.84310. lr 5.989831e-04:   5%|▌         | 39/761 [00:18<05:44,  2.10it/s][A
epoch 1 iter 39: train loss 2.84310. lr 5.989831e-04:   5%|▌         | 40/761 [00:18<05:41,  2.11it/s][A
epoch 1 iter 40: train loss 2.83185. lr 5.989315e-04:   5%|▌         | 40/761 [00:18<05:41,  2.11it/s][A
epoch 1 iter 40: train loss 2.83185. lr 5.989315e-04:   5%|▌         | 41/761 [00:18<05:38,  2.12it/s][A
epoch 1 iter 41: train loss 2.82361. lr 5.988786e-04:   5%|▌         | 41/761 [00:19<05:38,  2.12it/s][A
epoch 1 iter 41: train loss 2.82361. lr 5.988786e-04:   6%|▌         | 42/761 [00:19<05:36,  2.14it/s][A
epoch 1 iter 42: train loss 2.82153. lr 5.988243e-04:   6%|▌         | 42/761 [00:19<05:36,  2.14it/s][A
epoch 1 iter 42: train loss 2.82153. lr 5.988243e-04:   6%|▌         | 43/761 [00:19<05:37,  2.13it/s][A
epoch 1 iter 43: train loss 2.80589. lr 5.9876

epoch 1 iter 77: train loss 2.69374. lr 5.961248e-04:  10%|█         | 77/761 [00:37<04:41,  2.43it/s][A
epoch 1 iter 77: train loss 2.69374. lr 5.961248e-04:  10%|█         | 78/761 [00:37<04:37,  2.46it/s][A
epoch 1 iter 78: train loss 2.67734. lr 5.960248e-04:  10%|█         | 78/761 [00:37<04:37,  2.46it/s][A
epoch 1 iter 78: train loss 2.67734. lr 5.960248e-04:  10%|█         | 79/761 [00:37<04:32,  2.50it/s][A
epoch 1 iter 79: train loss 2.66734. lr 5.959235e-04:  10%|█         | 79/761 [00:38<04:32,  2.50it/s][A
epoch 1 iter 79: train loss 2.66734. lr 5.959235e-04:  11%|█         | 80/761 [00:38<04:30,  2.52it/s][A
epoch 1 iter 80: train loss 2.69060. lr 5.958210e-04:  11%|█         | 80/761 [00:38<04:30,  2.52it/s][A
epoch 1 iter 80: train loss 2.69060. lr 5.958210e-04:  11%|█         | 81/761 [00:38<04:29,  2.52it/s][A
epoch 1 iter 81: train loss 2.66811. lr 5.957172e-04:  11%|█         | 81/761 [00:38<04:29,  2.52it/s][A
epoch 1 iter 81: train loss 2.66811. lr 5.9571

epoch 1 iter 115: train loss 2.60534. lr 5.914403e-04:  15%|█▌        | 115/761 [00:55<04:59,  2.16it/s][A
epoch 1 iter 115: train loss 2.60534. lr 5.914403e-04:  15%|█▌        | 116/761 [00:55<04:59,  2.15it/s][A
epoch 1 iter 116: train loss 2.60632. lr 5.912926e-04:  15%|█▌        | 116/761 [00:55<04:59,  2.15it/s][A
epoch 1 iter 116: train loss 2.60632. lr 5.912926e-04:  15%|█▌        | 117/761 [00:55<05:03,  2.12it/s][A
epoch 1 iter 117: train loss 2.61254. lr 5.911437e-04:  15%|█▌        | 117/761 [00:56<05:03,  2.12it/s][A
epoch 1 iter 117: train loss 2.61254. lr 5.911437e-04:  16%|█▌        | 118/761 [00:56<04:59,  2.14it/s][A
epoch 1 iter 118: train loss 2.60673. lr 5.909935e-04:  16%|█▌        | 118/761 [00:56<04:59,  2.14it/s][A
epoch 1 iter 118: train loss 2.60673. lr 5.909935e-04:  16%|█▌        | 119/761 [00:56<04:58,  2.15it/s][A
epoch 1 iter 119: train loss 2.61158. lr 5.908421e-04:  16%|█▌        | 119/761 [00:57<04:58,  2.15it/s][A
epoch 1 iter 119: train loss

epoch 1 iter 152: train loss 2.55960. lr 5.851520e-04:  20%|██        | 153/761 [01:11<04:56,  2.05it/s][A
epoch 1 iter 153: train loss 2.56015. lr 5.849587e-04:  20%|██        | 153/761 [01:12<04:56,  2.05it/s][A
epoch 1 iter 153: train loss 2.56015. lr 5.849587e-04:  20%|██        | 154/761 [01:12<04:53,  2.07it/s][A
epoch 1 iter 154: train loss 2.55911. lr 5.847642e-04:  20%|██        | 154/761 [01:12<04:53,  2.07it/s][A
epoch 1 iter 154: train loss 2.55911. lr 5.847642e-04:  20%|██        | 155/761 [01:12<04:51,  2.08it/s][A
epoch 1 iter 155: train loss 2.54891. lr 5.845685e-04:  20%|██        | 155/761 [01:13<04:51,  2.08it/s][A
epoch 1 iter 155: train loss 2.54891. lr 5.845685e-04:  20%|██        | 156/761 [01:13<04:46,  2.11it/s][A
epoch 1 iter 156: train loss 2.55337. lr 5.843716e-04:  20%|██        | 156/761 [01:13<04:46,  2.11it/s][A
epoch 1 iter 156: train loss 2.55337. lr 5.843716e-04:  21%|██        | 157/761 [01:13<04:39,  2.16it/s][A
epoch 1 iter 157: train loss

epoch 1 iter 190: train loss 2.49099. lr 5.769588e-04:  25%|██▍       | 190/761 [01:28<03:57,  2.40it/s][A
epoch 1 iter 190: train loss 2.49099. lr 5.769588e-04:  25%|██▌       | 191/761 [01:28<03:55,  2.42it/s][A
epoch 1 iter 191: train loss 2.48656. lr 5.767199e-04:  25%|██▌       | 191/761 [01:28<03:55,  2.42it/s][A
epoch 1 iter 191: train loss 2.48656. lr 5.767199e-04:  25%|██▌       | 192/761 [01:28<03:54,  2.43it/s][A
epoch 1 iter 192: train loss 2.50448. lr 5.764798e-04:  25%|██▌       | 192/761 [01:28<03:54,  2.43it/s][A
epoch 1 iter 192: train loss 2.50448. lr 5.764798e-04:  25%|██▌       | 193/761 [01:28<03:53,  2.44it/s][A
epoch 1 iter 193: train loss 2.48376. lr 5.762385e-04:  25%|██▌       | 193/761 [01:29<03:53,  2.44it/s][A
epoch 1 iter 193: train loss 2.48376. lr 5.762385e-04:  25%|██▌       | 194/761 [01:29<03:52,  2.44it/s][A
epoch 1 iter 194: train loss 2.50713. lr 5.759960e-04:  25%|██▌       | 194/761 [01:29<03:52,  2.44it/s][A
epoch 1 iter 194: train loss

epoch 1 iter 227: train loss 2.39844. lr 5.673397e-04:  30%|██▉       | 228/761 [01:45<03:55,  2.27it/s][A
epoch 1 iter 228: train loss 2.40987. lr 5.670577e-04:  30%|██▉       | 228/761 [01:45<03:55,  2.27it/s][A
epoch 1 iter 228: train loss 2.40987. lr 5.670577e-04:  30%|███       | 229/761 [01:45<03:48,  2.33it/s][A
epoch 1 iter 229: train loss 2.39474. lr 5.667746e-04:  30%|███       | 229/761 [01:45<03:48,  2.33it/s][A
epoch 1 iter 229: train loss 2.39474. lr 5.667746e-04:  30%|███       | 230/761 [01:45<03:40,  2.41it/s][A
epoch 1 iter 230: train loss 2.40207. lr 5.664904e-04:  30%|███       | 230/761 [01:46<03:40,  2.41it/s][A
epoch 1 iter 230: train loss 2.40207. lr 5.664904e-04:  30%|███       | 231/761 [01:46<03:33,  2.48it/s][A
epoch 1 iter 231: train loss 2.39281. lr 5.662051e-04:  30%|███       | 231/761 [01:46<03:33,  2.48it/s][A
epoch 1 iter 231: train loss 2.39281. lr 5.662051e-04:  30%|███       | 232/761 [01:46<03:32,  2.48it/s][A
epoch 1 iter 232: train loss

epoch 1 iter 265: train loss 2.31399. lr 5.558342e-04:  35%|███▍      | 265/761 [02:01<03:39,  2.26it/s][A
epoch 1 iter 265: train loss 2.31399. lr 5.558342e-04:  35%|███▍      | 266/761 [02:01<03:36,  2.29it/s][A
epoch 1 iter 266: train loss 2.29989. lr 5.555098e-04:  35%|███▍      | 266/761 [02:02<03:36,  2.29it/s][A
epoch 1 iter 266: train loss 2.29989. lr 5.555098e-04:  35%|███▌      | 267/761 [02:02<03:34,  2.31it/s][A
epoch 1 iter 267: train loss 2.30220. lr 5.551843e-04:  35%|███▌      | 267/761 [02:02<03:34,  2.31it/s][A
epoch 1 iter 267: train loss 2.30220. lr 5.551843e-04:  35%|███▌      | 268/761 [02:02<03:32,  2.32it/s][A
epoch 1 iter 268: train loss 2.29721. lr 5.548577e-04:  35%|███▌      | 268/761 [02:02<03:32,  2.32it/s][A
epoch 1 iter 268: train loss 2.29721. lr 5.548577e-04:  35%|███▌      | 269/761 [02:02<03:30,  2.34it/s][A
epoch 1 iter 269: train loss 2.28779. lr 5.545301e-04:  35%|███▌      | 269/761 [02:03<03:30,  2.34it/s][A
epoch 1 iter 269: train loss

epoch 1 iter 302: train loss 2.19236. lr 5.431151e-04:  40%|███▉      | 303/761 [02:18<03:06,  2.46it/s][A
epoch 1 iter 303: train loss 2.19914. lr 5.427512e-04:  40%|███▉      | 303/761 [02:18<03:06,  2.46it/s][A
epoch 1 iter 303: train loss 2.19914. lr 5.427512e-04:  40%|███▉      | 304/761 [02:18<03:06,  2.46it/s][A
epoch 1 iter 304: train loss 2.19937. lr 5.423863e-04:  40%|███▉      | 304/761 [02:19<03:06,  2.46it/s][A
epoch 1 iter 304: train loss 2.19937. lr 5.423863e-04:  40%|████      | 305/761 [02:19<03:05,  2.46it/s][A
epoch 1 iter 305: train loss 2.18792. lr 5.420204e-04:  40%|████      | 305/761 [02:19<03:05,  2.46it/s][A
epoch 1 iter 305: train loss 2.18792. lr 5.420204e-04:  40%|████      | 306/761 [02:19<03:05,  2.45it/s][A
epoch 1 iter 306: train loss 2.19108. lr 5.416535e-04:  40%|████      | 306/761 [02:19<03:05,  2.45it/s][A
epoch 1 iter 306: train loss 2.19108. lr 5.416535e-04:  40%|████      | 307/761 [02:19<03:01,  2.50it/s][A
epoch 1 iter 307: train loss

epoch 1 iter 340: train loss 2.07094. lr 5.285734e-04:  45%|████▍     | 340/761 [02:34<03:01,  2.32it/s][A
epoch 1 iter 340: train loss 2.07094. lr 5.285734e-04:  45%|████▍     | 341/761 [02:34<02:55,  2.39it/s][A
epoch 1 iter 341: train loss 2.08263. lr 5.281713e-04:  45%|████▍     | 341/761 [02:34<02:55,  2.39it/s][A
epoch 1 iter 341: train loss 2.08263. lr 5.281713e-04:  45%|████▍     | 342/761 [02:34<02:47,  2.50it/s][A
epoch 1 iter 342: train loss 2.06536. lr 5.277682e-04:  45%|████▍     | 342/761 [02:35<02:47,  2.50it/s][A
epoch 1 iter 342: train loss 2.06536. lr 5.277682e-04:  45%|████▌     | 343/761 [02:35<02:45,  2.53it/s][A
epoch 1 iter 343: train loss 2.06491. lr 5.273641e-04:  45%|████▌     | 343/761 [02:35<02:45,  2.53it/s][A
epoch 1 iter 343: train loss 2.06491. lr 5.273641e-04:  45%|████▌     | 344/761 [02:35<02:48,  2.47it/s][A
epoch 1 iter 344: train loss 2.06293. lr 5.269590e-04:  45%|████▌     | 344/761 [02:35<02:48,  2.47it/s][A
epoch 1 iter 344: train loss

epoch 1 iter 377: train loss 1.98648. lr 5.130592e-04:  50%|████▉     | 378/761 [02:50<02:54,  2.19it/s][A
epoch 1 iter 378: train loss 1.96199. lr 5.126222e-04:  50%|████▉     | 378/761 [02:51<02:54,  2.19it/s][A
epoch 1 iter 378: train loss 1.96199. lr 5.126222e-04:  50%|████▉     | 379/761 [02:51<02:52,  2.21it/s][A
epoch 1 iter 379: train loss 1.95763. lr 5.121843e-04:  50%|████▉     | 379/761 [02:51<02:52,  2.21it/s][A
epoch 1 iter 379: train loss 1.95763. lr 5.121843e-04:  50%|████▉     | 380/761 [02:51<02:51,  2.23it/s][A
epoch 1 iter 380: train loss 1.94820. lr 5.117454e-04:  50%|████▉     | 380/761 [02:52<02:51,  2.23it/s][A
epoch 1 iter 380: train loss 1.94820. lr 5.117454e-04:  50%|█████     | 381/761 [02:52<02:49,  2.24it/s][A
epoch 1 iter 381: train loss 1.95442. lr 5.113057e-04:  50%|█████     | 381/761 [02:52<02:49,  2.24it/s][A
epoch 1 iter 381: train loss 1.95442. lr 5.113057e-04:  50%|█████     | 382/761 [02:52<02:50,  2.22it/s][A
epoch 1 iter 382: train loss

epoch 1 iter 415: train loss 1.87092. lr 4.958301e-04:  55%|█████▍    | 415/761 [03:09<02:57,  1.95it/s][A
epoch 1 iter 415: train loss 1.87092. lr 4.958301e-04:  55%|█████▍    | 416/761 [03:09<02:51,  2.01it/s][A
epoch 1 iter 416: train loss 1.85901. lr 4.953599e-04:  55%|█████▍    | 416/761 [03:10<02:51,  2.01it/s][A
epoch 1 iter 416: train loss 1.85901. lr 4.953599e-04:  55%|█████▍    | 417/761 [03:10<02:47,  2.05it/s][A
epoch 1 iter 417: train loss 1.85241. lr 4.948888e-04:  55%|█████▍    | 417/761 [03:10<02:47,  2.05it/s][A
epoch 1 iter 417: train loss 1.85241. lr 4.948888e-04:  55%|█████▍    | 418/761 [03:10<02:43,  2.10it/s][A
epoch 1 iter 418: train loss 1.86466. lr 4.944170e-04:  55%|█████▍    | 418/761 [03:10<02:43,  2.10it/s][A
epoch 1 iter 418: train loss 1.86466. lr 4.944170e-04:  55%|█████▌    | 419/761 [03:10<02:40,  2.14it/s][A
epoch 1 iter 419: train loss 1.84977. lr 4.939443e-04:  55%|█████▌    | 419/761 [03:11<02:40,  2.14it/s][A
epoch 1 iter 419: train loss

epoch 1 iter 452: train loss 1.74951. lr 4.778930e-04:  60%|█████▉    | 453/761 [03:25<02:05,  2.45it/s][A
epoch 1 iter 453: train loss 1.75952. lr 4.773933e-04:  60%|█████▉    | 453/761 [03:25<02:05,  2.45it/s][A
epoch 1 iter 453: train loss 1.75952. lr 4.773933e-04:  60%|█████▉    | 454/761 [03:25<02:05,  2.45it/s][A
epoch 1 iter 454: train loss 1.75348. lr 4.768928e-04:  60%|█████▉    | 454/761 [03:25<02:05,  2.45it/s][A
epoch 1 iter 454: train loss 1.75348. lr 4.768928e-04:  60%|█████▉    | 455/761 [03:25<02:04,  2.45it/s][A
epoch 1 iter 455: train loss 1.74092. lr 4.763916e-04:  60%|█████▉    | 455/761 [03:26<02:04,  2.45it/s][A
epoch 1 iter 455: train loss 1.74092. lr 4.763916e-04:  60%|█████▉    | 456/761 [03:26<02:07,  2.39it/s][A
epoch 1 iter 456: train loss 1.75785. lr 4.758896e-04:  60%|█████▉    | 456/761 [03:26<02:07,  2.39it/s][A
epoch 1 iter 456: train loss 1.75785. lr 4.758896e-04:  60%|██████    | 457/761 [03:26<02:06,  2.41it/s][A
epoch 1 iter 457: train loss

epoch 1 iter 490: train loss 1.67944. lr 4.583896e-04:  64%|██████▍   | 490/761 [03:41<01:54,  2.37it/s][A
epoch 1 iter 490: train loss 1.67944. lr 4.583896e-04:  65%|██████▍   | 491/761 [03:41<01:53,  2.38it/s][A
epoch 1 iter 491: train loss 1.66518. lr 4.578626e-04:  65%|██████▍   | 491/761 [03:42<01:53,  2.38it/s][A
epoch 1 iter 491: train loss 1.66518. lr 4.578626e-04:  65%|██████▍   | 492/761 [03:42<01:52,  2.39it/s][A
epoch 1 iter 492: train loss 1.64804. lr 4.573350e-04:  65%|██████▍   | 492/761 [03:42<01:52,  2.39it/s][A
epoch 1 iter 492: train loss 1.64804. lr 4.573350e-04:  65%|██████▍   | 493/761 [03:42<01:51,  2.39it/s][A
epoch 1 iter 493: train loss 1.65812. lr 4.568067e-04:  65%|██████▍   | 493/761 [03:43<01:51,  2.39it/s][A
epoch 1 iter 493: train loss 1.65812. lr 4.568067e-04:  65%|██████▍   | 494/761 [03:43<01:51,  2.40it/s][A
epoch 1 iter 494: train loss 1.66291. lr 4.562777e-04:  65%|██████▍   | 494/761 [03:43<01:51,  2.40it/s][A
epoch 1 iter 494: train loss

epoch 1 iter 527: train loss 1.57355. lr 4.384599e-04:  69%|██████▉   | 528/761 [03:58<01:54,  2.03it/s][A
epoch 1 iter 528: train loss 1.59195. lr 4.379095e-04:  69%|██████▉   | 528/761 [03:58<01:54,  2.03it/s][A
epoch 1 iter 528: train loss 1.59195. lr 4.379095e-04:  70%|██████▉   | 529/761 [03:58<01:51,  2.08it/s][A
epoch 1 iter 529: train loss 1.58883. lr 4.373585e-04:  70%|██████▉   | 529/761 [03:59<01:51,  2.08it/s][A
epoch 1 iter 529: train loss 1.58883. lr 4.373585e-04:  70%|██████▉   | 530/761 [03:59<01:46,  2.16it/s][A
epoch 1 iter 530: train loss 1.58299. lr 4.368069e-04:  70%|██████▉   | 530/761 [03:59<01:46,  2.16it/s][A
epoch 1 iter 530: train loss 1.58299. lr 4.368069e-04:  70%|██████▉   | 531/761 [03:59<01:42,  2.25it/s][A
epoch 1 iter 531: train loss 1.58175. lr 4.362548e-04:  70%|██████▉   | 531/761 [03:59<01:42,  2.25it/s][A
epoch 1 iter 531: train loss 1.58175. lr 4.362548e-04:  70%|██████▉   | 532/761 [03:59<01:41,  2.26it/s][A
epoch 1 iter 532: train loss

epoch 1 iter 565: train loss 1.51223. lr 4.171501e-04:  74%|███████▍  | 565/761 [04:16<02:23,  1.36it/s][A
epoch 1 iter 565: train loss 1.51223. lr 4.171501e-04:  74%|███████▍  | 566/761 [04:16<02:21,  1.38it/s][A
epoch 1 iter 566: train loss 1.51225. lr 4.165790e-04:  74%|███████▍  | 566/761 [04:17<02:21,  1.38it/s][A
epoch 1 iter 566: train loss 1.51225. lr 4.165790e-04:  75%|███████▍  | 567/761 [04:17<02:16,  1.42it/s][A
epoch 1 iter 567: train loss 1.49469. lr 4.160074e-04:  75%|███████▍  | 567/761 [04:17<02:16,  1.42it/s][A
epoch 1 iter 567: train loss 1.49469. lr 4.160074e-04:  75%|███████▍  | 568/761 [04:17<02:08,  1.50it/s][A
epoch 1 iter 568: train loss 1.50196. lr 4.154353e-04:  75%|███████▍  | 568/761 [04:18<02:08,  1.50it/s][A
epoch 1 iter 568: train loss 1.50196. lr 4.154353e-04:  75%|███████▍  | 569/761 [04:18<02:03,  1.56it/s][A
epoch 1 iter 569: train loss 1.50346. lr 4.148627e-04:  75%|███████▍  | 569/761 [04:18<02:03,  1.56it/s][A
epoch 1 iter 569: train loss

epoch 1 iter 602: train loss 1.42164. lr 3.957058e-04:  79%|███████▉  | 603/761 [04:33<01:07,  2.35it/s][A
epoch 1 iter 603: train loss 1.43731. lr 3.951179e-04:  79%|███████▉  | 603/761 [04:33<01:07,  2.35it/s][A
epoch 1 iter 603: train loss 1.43731. lr 3.951179e-04:  79%|███████▉  | 604/761 [04:33<01:06,  2.37it/s][A
epoch 1 iter 604: train loss 1.42740. lr 3.945296e-04:  79%|███████▉  | 604/761 [04:34<01:06,  2.37it/s][A
epoch 1 iter 604: train loss 1.42740. lr 3.945296e-04:  80%|███████▉  | 605/761 [04:34<01:05,  2.38it/s][A
epoch 1 iter 605: train loss 1.42238. lr 3.939409e-04:  80%|███████▉  | 605/761 [04:34<01:05,  2.38it/s][A
epoch 1 iter 605: train loss 1.42238. lr 3.939409e-04:  80%|███████▉  | 606/761 [04:34<01:04,  2.39it/s][A
epoch 1 iter 606: train loss 1.42858. lr 3.933518e-04:  80%|███████▉  | 606/761 [04:35<01:04,  2.39it/s][A
epoch 1 iter 606: train loss 1.42858. lr 3.933518e-04:  80%|███████▉  | 607/761 [04:35<01:04,  2.40it/s][A
epoch 1 iter 607: train loss

epoch 1 iter 640: train loss 1.34982. lr 3.731007e-04:  84%|████████▍ | 640/761 [04:49<00:48,  2.48it/s][A
epoch 1 iter 640: train loss 1.34982. lr 3.731007e-04:  84%|████████▍ | 641/761 [04:49<00:47,  2.51it/s][A
epoch 1 iter 641: train loss 1.36809. lr 3.724992e-04:  84%|████████▍ | 641/761 [04:50<00:47,  2.51it/s][A
epoch 1 iter 641: train loss 1.36809. lr 3.724992e-04:  84%|████████▍ | 642/761 [04:50<00:47,  2.49it/s][A
epoch 1 iter 642: train loss 1.35050. lr 3.718973e-04:  84%|████████▍ | 642/761 [04:50<00:47,  2.49it/s][A
epoch 1 iter 642: train loss 1.35050. lr 3.718973e-04:  84%|████████▍ | 643/761 [04:50<00:47,  2.47it/s][A
epoch 1 iter 643: train loss 1.35441. lr 3.712951e-04:  84%|████████▍ | 643/761 [04:51<00:47,  2.47it/s][A
epoch 1 iter 643: train loss 1.35441. lr 3.712951e-04:  85%|████████▍ | 644/761 [04:51<00:47,  2.46it/s][A
epoch 1 iter 644: train loss 1.35992. lr 3.706926e-04:  85%|████████▍ | 644/761 [04:51<00:47,  2.46it/s][A
epoch 1 iter 644: train loss

epoch 1 iter 677: train loss 1.27418. lr 3.506562e-04:  89%|████████▉ | 678/761 [05:07<00:36,  2.29it/s][A
epoch 1 iter 678: train loss 1.27398. lr 3.500449e-04:  89%|████████▉ | 678/761 [05:07<00:36,  2.29it/s][A
epoch 1 iter 678: train loss 1.27398. lr 3.500449e-04:  89%|████████▉ | 679/761 [05:07<00:35,  2.34it/s][A
epoch 1 iter 679: train loss 1.26328. lr 3.494333e-04:  89%|████████▉ | 679/761 [05:08<00:35,  2.34it/s][A
epoch 1 iter 679: train loss 1.26328. lr 3.494333e-04:  89%|████████▉ | 680/761 [05:08<00:33,  2.39it/s][A
epoch 1 iter 680: train loss 1.27458. lr 3.488216e-04:  89%|████████▉ | 680/761 [05:08<00:33,  2.39it/s][A
epoch 1 iter 680: train loss 1.27458. lr 3.488216e-04:  89%|████████▉ | 681/761 [05:08<00:32,  2.43it/s][A
epoch 1 iter 681: train loss 1.27458. lr 3.482096e-04:  89%|████████▉ | 681/761 [05:09<00:32,  2.43it/s][A
epoch 1 iter 681: train loss 1.27458. lr 3.482096e-04:  90%|████████▉ | 682/761 [05:09<00:32,  2.46it/s][A
epoch 1 iter 682: train loss

epoch 1 iter 715: train loss 1.19681. lr 3.272980e-04:  94%|█████████▍| 715/761 [05:23<00:22,  2.07it/s][A
epoch 1 iter 715: train loss 1.19681. lr 3.272980e-04:  94%|█████████▍| 716/761 [05:23<00:21,  2.08it/s][A
epoch 1 iter 716: train loss 1.19426. lr 3.266804e-04:  94%|█████████▍| 716/761 [05:23<00:21,  2.08it/s][A
epoch 1 iter 716: train loss 1.19426. lr 3.266804e-04:  94%|█████████▍| 717/761 [05:23<00:20,  2.10it/s][A
epoch 1 iter 717: train loss 1.18934. lr 3.260627e-04:  94%|█████████▍| 717/761 [05:23<00:20,  2.10it/s][A
epoch 1 iter 717: train loss 1.18934. lr 3.260627e-04:  94%|█████████▍| 718/761 [05:23<00:20,  2.11it/s][A
epoch 1 iter 718: train loss 1.19122. lr 3.254448e-04:  94%|█████████▍| 718/761 [05:24<00:20,  2.11it/s][A
epoch 1 iter 718: train loss 1.19122. lr 3.254448e-04:  94%|█████████▍| 719/761 [05:24<00:19,  2.20it/s][A
epoch 1 iter 719: train loss 1.18873. lr 3.248269e-04:  94%|█████████▍| 719/761 [05:24<00:19,  2.20it/s][A
epoch 1 iter 719: train loss

epoch 1 iter 752: train loss 1.12811. lr 3.043915e-04:  99%|█████████▉| 753/761 [05:39<00:03,  2.55it/s][A
epoch 1 iter 753: train loss 1.12292. lr 3.037714e-04:  99%|█████████▉| 753/761 [05:40<00:03,  2.55it/s][A
epoch 1 iter 753: train loss 1.12292. lr 3.037714e-04:  99%|█████████▉| 754/761 [05:40<00:02,  2.56it/s][A
epoch 1 iter 754: train loss 1.11730. lr 3.031514e-04:  99%|█████████▉| 754/761 [05:40<00:02,  2.56it/s][A
epoch 1 iter 754: train loss 1.11730. lr 3.031514e-04:  99%|█████████▉| 755/761 [05:40<00:02,  2.54it/s][A
epoch 1 iter 755: train loss 1.12601. lr 3.025313e-04:  99%|█████████▉| 755/761 [05:40<00:02,  2.54it/s][A
epoch 1 iter 755: train loss 1.12601. lr 3.025313e-04:  99%|█████████▉| 756/761 [05:40<00:01,  2.52it/s][A
epoch 1 iter 756: train loss 1.11213. lr 3.019112e-04:  99%|█████████▉| 756/761 [05:41<00:01,  2.52it/s][A
epoch 1 iter 756: train loss 1.11213. lr 3.019112e-04:  99%|█████████▉| 757/761 [05:41<00:01,  2.51it/s][A
epoch 1 iter 757: train loss

epoch 2 iter 30: train loss 1.05345. lr 2.808380e-04:   4%|▍         | 30/761 [00:15<08:34,  1.42it/s][A
epoch 2 iter 30: train loss 1.05345. lr 2.808380e-04:   4%|▍         | 31/761 [00:15<08:06,  1.50it/s][A
epoch 2 iter 31: train loss 1.05266. lr 2.802192e-04:   4%|▍         | 31/761 [00:15<08:06,  1.50it/s][A
epoch 2 iter 31: train loss 1.05266. lr 2.802192e-04:   4%|▍         | 32/761 [00:15<07:36,  1.60it/s][A
epoch 2 iter 32: train loss 1.04500. lr 2.796005e-04:   4%|▍         | 32/761 [00:16<07:36,  1.60it/s][A
epoch 2 iter 32: train loss 1.04500. lr 2.796005e-04:   4%|▍         | 33/761 [00:16<07:14,  1.67it/s][A
epoch 2 iter 33: train loss 1.04634. lr 2.789819e-04:   4%|▍         | 33/761 [00:16<07:14,  1.67it/s][A
epoch 2 iter 33: train loss 1.04634. lr 2.789819e-04:   4%|▍         | 34/761 [00:16<06:51,  1.77it/s][A
epoch 2 iter 34: train loss 1.05963. lr 2.783633e-04:   4%|▍         | 34/761 [00:17<06:51,  1.77it/s][A
epoch 2 iter 34: train loss 1.05963. lr 2.7836

epoch 2 iter 68: train loss 0.98203. lr 2.574052e-04:   9%|▉         | 69/761 [00:31<04:48,  2.40it/s][A
epoch 2 iter 69: train loss 0.98636. lr 2.567914e-04:   9%|▉         | 69/761 [00:32<04:48,  2.40it/s][A
epoch 2 iter 69: train loss 0.98636. lr 2.567914e-04:   9%|▉         | 70/761 [00:32<04:52,  2.36it/s][A
epoch 2 iter 70: train loss 0.97530. lr 2.561779e-04:   9%|▉         | 70/761 [00:32<04:52,  2.36it/s][A
epoch 2 iter 70: train loss 0.97530. lr 2.561779e-04:   9%|▉         | 71/761 [00:32<04:48,  2.39it/s][A
epoch 2 iter 71: train loss 0.98353. lr 2.555645e-04:   9%|▉         | 71/761 [00:32<04:48,  2.39it/s][A
epoch 2 iter 71: train loss 0.98353. lr 2.555645e-04:   9%|▉         | 72/761 [00:32<04:46,  2.41it/s][A
epoch 2 iter 72: train loss 0.97601. lr 2.549513e-04:   9%|▉         | 72/761 [00:33<04:46,  2.41it/s][A
epoch 2 iter 72: train loss 0.97601. lr 2.549513e-04:  10%|▉         | 73/761 [00:33<04:35,  2.50it/s][A
epoch 2 iter 73: train loss 0.97910. lr 2.5433

epoch 2 iter 107: train loss 0.91290. lr 2.336301e-04:  14%|█▍        | 107/761 [00:50<04:40,  2.33it/s][A
epoch 2 iter 107: train loss 0.91290. lr 2.336301e-04:  14%|█▍        | 108/761 [00:50<04:44,  2.30it/s][A
epoch 2 iter 108: train loss 0.91444. lr 2.330255e-04:  14%|█▍        | 108/761 [00:50<04:44,  2.30it/s][A
epoch 2 iter 108: train loss 0.91444. lr 2.330255e-04:  14%|█▍        | 109/761 [00:50<04:41,  2.32it/s][A
epoch 2 iter 109: train loss 0.91909. lr 2.324211e-04:  14%|█▍        | 109/761 [00:50<04:41,  2.32it/s][A
epoch 2 iter 109: train loss 0.91909. lr 2.324211e-04:  14%|█▍        | 110/761 [00:50<04:39,  2.33it/s][A
epoch 2 iter 110: train loss 0.91904. lr 2.318171e-04:  14%|█▍        | 110/761 [00:51<04:39,  2.33it/s][A
epoch 2 iter 110: train loss 0.91904. lr 2.318171e-04:  15%|█▍        | 111/761 [00:51<04:38,  2.34it/s][A
epoch 2 iter 111: train loss 0.91809. lr 2.312134e-04:  15%|█▍        | 111/761 [00:51<04:38,  2.34it/s][A
epoch 2 iter 111: train loss

epoch 2 iter 144: train loss 0.86139. lr 2.114703e-04:  19%|█▉        | 145/761 [01:05<04:17,  2.40it/s][A
epoch 2 iter 145: train loss 0.85155. lr 2.108780e-04:  19%|█▉        | 145/761 [01:05<04:17,  2.40it/s][A
epoch 2 iter 145: train loss 0.85155. lr 2.108780e-04:  19%|█▉        | 146/761 [01:05<04:16,  2.39it/s][A
epoch 2 iter 146: train loss 0.85022. lr 2.102861e-04:  19%|█▉        | 146/761 [01:05<04:16,  2.39it/s][A
epoch 2 iter 146: train loss 0.85022. lr 2.102861e-04:  19%|█▉        | 147/761 [01:05<04:15,  2.40it/s][A
epoch 2 iter 147: train loss 0.85357. lr 2.096945e-04:  19%|█▉        | 147/761 [01:06<04:15,  2.40it/s][A
epoch 2 iter 147: train loss 0.85357. lr 2.096945e-04:  19%|█▉        | 148/761 [01:06<04:16,  2.39it/s][A
epoch 2 iter 148: train loss 0.85015. lr 2.091034e-04:  19%|█▉        | 148/761 [01:06<04:16,  2.39it/s][A
epoch 2 iter 148: train loss 0.85015. lr 2.091034e-04:  20%|█▉        | 149/761 [01:06<04:15,  2.39it/s][A
epoch 2 iter 149: train loss

epoch 2 iter 182: train loss 0.79775. lr 1.892515e-04:  24%|██▍       | 182/761 [01:23<04:25,  2.18it/s][A
epoch 2 iter 182: train loss 0.79775. lr 1.892515e-04:  24%|██▍       | 183/761 [01:23<04:21,  2.21it/s][A
epoch 2 iter 183: train loss 0.78403. lr 1.886755e-04:  24%|██▍       | 183/761 [01:23<04:21,  2.21it/s][A
epoch 2 iter 183: train loss 0.78403. lr 1.886755e-04:  24%|██▍       | 184/761 [01:23<04:19,  2.22it/s][A
epoch 2 iter 184: train loss 0.79433. lr 1.880999e-04:  24%|██▍       | 184/761 [01:24<04:19,  2.22it/s][A
epoch 2 iter 184: train loss 0.79433. lr 1.880999e-04:  24%|██▍       | 185/761 [01:24<04:17,  2.24it/s][A
epoch 2 iter 185: train loss 0.79434. lr 1.875247e-04:  24%|██▍       | 185/761 [01:24<04:17,  2.24it/s][A
epoch 2 iter 185: train loss 0.79434. lr 1.875247e-04:  24%|██▍       | 186/761 [01:24<04:06,  2.33it/s][A
epoch 2 iter 186: train loss 0.78970. lr 1.869501e-04:  24%|██▍       | 186/761 [01:25<04:06,  2.33it/s][A
epoch 2 iter 186: train loss

epoch 2 iter 219: train loss 0.74618. lr 1.682726e-04:  29%|██▉       | 220/761 [01:40<04:36,  1.95it/s][A
epoch 2 iter 220: train loss 0.75325. lr 1.677157e-04:  29%|██▉       | 220/761 [01:40<04:36,  1.95it/s][A
epoch 2 iter 220: train loss 0.75325. lr 1.677157e-04:  29%|██▉       | 221/761 [01:40<04:40,  1.93it/s][A
epoch 2 iter 221: train loss 0.73653. lr 1.671594e-04:  29%|██▉       | 221/761 [01:41<04:40,  1.93it/s][A
epoch 2 iter 221: train loss 0.73653. lr 1.671594e-04:  29%|██▉       | 222/761 [01:41<04:20,  2.07it/s][A
epoch 2 iter 222: train loss 0.73937. lr 1.666037e-04:  29%|██▉       | 222/761 [01:41<04:20,  2.07it/s][A
epoch 2 iter 222: train loss 0.73937. lr 1.666037e-04:  29%|██▉       | 223/761 [01:41<04:05,  2.19it/s][A
epoch 2 iter 223: train loss 0.74478. lr 1.660486e-04:  29%|██▉       | 223/761 [01:41<04:05,  2.19it/s][A
epoch 2 iter 223: train loss 0.74478. lr 1.660486e-04:  29%|██▉       | 224/761 [01:41<03:55,  2.28it/s][A
epoch 2 iter 224: train loss

epoch 2 iter 257: train loss 0.70582. lr 1.475294e-04:  34%|███▍      | 257/761 [01:57<04:19,  1.94it/s][A
epoch 2 iter 257: train loss 0.70582. lr 1.475294e-04:  34%|███▍      | 258/761 [01:57<04:03,  2.07it/s][A
epoch 2 iter 258: train loss 0.69357. lr 1.469956e-04:  34%|███▍      | 258/761 [01:57<04:03,  2.07it/s][A
epoch 2 iter 258: train loss 0.69357. lr 1.469956e-04:  34%|███▍      | 259/761 [01:57<03:50,  2.18it/s][A
epoch 2 iter 259: train loss 0.69436. lr 1.464626e-04:  34%|███▍      | 259/761 [01:58<03:50,  2.18it/s][A
epoch 2 iter 259: train loss 0.69436. lr 1.464626e-04:  34%|███▍      | 260/761 [01:58<03:40,  2.27it/s][A
epoch 2 iter 260: train loss 0.68641. lr 1.459302e-04:  34%|███▍      | 260/761 [01:58<03:40,  2.27it/s][A
epoch 2 iter 260: train loss 0.68641. lr 1.459302e-04:  34%|███▍      | 261/761 [01:58<03:29,  2.39it/s][A
epoch 2 iter 261: train loss 0.68447. lr 1.453984e-04:  34%|███▍      | 261/761 [01:59<03:29,  2.39it/s][A
epoch 2 iter 261: train loss

epoch 2 iter 294: train loss 0.65672. lr 1.282344e-04:  39%|███▉      | 295/761 [02:15<03:42,  2.10it/s][A
epoch 2 iter 295: train loss 0.64851. lr 1.277264e-04:  39%|███▉      | 295/761 [02:16<03:42,  2.10it/s][A
epoch 2 iter 295: train loss 0.64851. lr 1.277264e-04:  39%|███▉      | 296/761 [02:16<03:39,  2.12it/s][A
epoch 2 iter 296: train loss 0.65899. lr 1.272191e-04:  39%|███▉      | 296/761 [02:16<03:39,  2.12it/s][A
epoch 2 iter 296: train loss 0.65899. lr 1.272191e-04:  39%|███▉      | 297/761 [02:16<03:35,  2.15it/s][A
epoch 2 iter 297: train loss 0.65168. lr 1.267125e-04:  39%|███▉      | 297/761 [02:17<03:35,  2.15it/s][A
epoch 2 iter 297: train loss 0.65168. lr 1.267125e-04:  39%|███▉      | 298/761 [02:17<03:32,  2.18it/s][A
epoch 2 iter 298: train loss 0.65785. lr 1.262067e-04:  39%|███▉      | 298/761 [02:17<03:32,  2.18it/s][A
epoch 2 iter 298: train loss 0.65785. lr 1.262067e-04:  39%|███▉      | 299/761 [02:17<03:30,  2.19it/s][A
epoch 2 iter 299: train loss

epoch 2 iter 332: train loss 0.61509. lr 1.094643e-04:  44%|████▎     | 332/761 [02:32<04:58,  1.44it/s][A
epoch 2 iter 332: train loss 0.61509. lr 1.094643e-04:  44%|████▍     | 333/761 [02:32<04:25,  1.61it/s][A
epoch 2 iter 333: train loss 0.61967. lr 1.089857e-04:  44%|████▍     | 333/761 [02:33<04:25,  1.61it/s][A
epoch 2 iter 333: train loss 0.61967. lr 1.089857e-04:  44%|████▍     | 334/761 [02:33<04:00,  1.78it/s][A
epoch 2 iter 334: train loss 0.61437. lr 1.085080e-04:  44%|████▍     | 334/761 [02:33<04:00,  1.78it/s][A
epoch 2 iter 334: train loss 0.61437. lr 1.085080e-04:  44%|████▍     | 335/761 [02:33<03:42,  1.91it/s][A
epoch 2 iter 335: train loss 0.61961. lr 1.080310e-04:  44%|████▍     | 335/761 [02:34<03:42,  1.91it/s][A
epoch 2 iter 335: train loss 0.61961. lr 1.080310e-04:  44%|████▍     | 336/761 [02:34<03:30,  2.02it/s][A
epoch 2 iter 336: train loss 0.60988. lr 1.075549e-04:  44%|████▍     | 336/761 [02:34<03:30,  2.02it/s][A
epoch 2 iter 336: train loss

epoch 2 iter 369: train loss 0.59149. lr 9.231618e-05:  49%|████▊     | 370/761 [02:50<03:35,  1.81it/s][A
epoch 2 iter 370: train loss 0.58370. lr 9.186913e-05:  49%|████▊     | 370/761 [02:51<03:35,  1.81it/s][A
epoch 2 iter 370: train loss 0.58370. lr 9.186913e-05:  49%|████▉     | 371/761 [02:51<03:28,  1.87it/s][A
epoch 2 iter 371: train loss 0.58550. lr 9.142297e-05:  49%|████▉     | 371/761 [02:51<03:28,  1.87it/s][A
epoch 2 iter 371: train loss 0.58550. lr 9.142297e-05:  49%|████▉     | 372/761 [02:51<03:23,  1.91it/s][A
epoch 2 iter 372: train loss 0.58992. lr 9.097770e-05:  49%|████▉     | 372/761 [02:51<03:23,  1.91it/s][A
epoch 2 iter 372: train loss 0.58992. lr 9.097770e-05:  49%|████▉     | 373/761 [02:52<03:16,  1.97it/s][A
epoch 2 iter 373: train loss 0.58331. lr 9.053333e-05:  49%|████▉     | 373/761 [02:52<03:16,  1.97it/s][A
epoch 2 iter 373: train loss 0.58331. lr 9.053333e-05:  49%|████▉     | 374/761 [02:52<03:14,  1.99it/s][A
epoch 2 iter 374: train loss

epoch 2 iter 407: train loss 0.56148. lr 7.596934e-05:  53%|█████▎    | 407/761 [03:07<02:56,  2.01it/s][A
epoch 2 iter 407: train loss 0.56148. lr 7.596934e-05:  54%|█████▎    | 408/761 [03:07<02:54,  2.03it/s][A
epoch 2 iter 408: train loss 0.56213. lr 7.555740e-05:  54%|█████▎    | 408/761 [03:08<02:54,  2.03it/s][A
epoch 2 iter 408: train loss 0.56213. lr 7.555740e-05:  54%|█████▎    | 409/761 [03:08<02:49,  2.08it/s][A
epoch 2 iter 409: train loss 0.56160. lr 7.514641e-05:  54%|█████▎    | 409/761 [03:08<02:49,  2.08it/s][A
epoch 2 iter 409: train loss 0.56160. lr 7.514641e-05:  54%|█████▍    | 410/761 [03:08<02:47,  2.10it/s][A
epoch 2 iter 410: train loss 0.56283. lr 7.473638e-05:  54%|█████▍    | 410/761 [03:09<02:47,  2.10it/s][A
epoch 2 iter 410: train loss 0.56283. lr 7.473638e-05:  54%|█████▍    | 411/761 [03:09<02:45,  2.12it/s][A
epoch 2 iter 411: train loss 0.56185. lr 7.432732e-05:  54%|█████▍    | 411/761 [03:09<02:45,  2.12it/s][A
epoch 2 iter 411: train loss

epoch 2 iter 444: train loss 0.54100. lr 6.137933e-05:  58%|█████▊    | 445/761 [03:23<02:04,  2.53it/s][A
epoch 2 iter 445: train loss 0.53402. lr 6.100401e-05:  58%|█████▊    | 445/761 [03:23<02:04,  2.53it/s][A
epoch 2 iter 445: train loss 0.53402. lr 6.100401e-05:  59%|█████▊    | 446/761 [03:23<02:04,  2.53it/s][A
epoch 2 iter 446: train loss 0.54045. lr 6.062970e-05:  59%|█████▊    | 446/761 [03:24<02:04,  2.53it/s][A
epoch 2 iter 446: train loss 0.54045. lr 6.062970e-05:  59%|█████▊    | 447/761 [03:24<02:07,  2.46it/s][A
epoch 2 iter 447: train loss 0.53758. lr 6.025641e-05:  59%|█████▊    | 447/761 [03:24<02:07,  2.46it/s][A
epoch 2 iter 447: train loss 0.53758. lr 6.025641e-05:  59%|█████▉    | 448/761 [03:24<02:17,  2.28it/s][A
epoch 2 iter 448: train loss 0.53465. lr 6.000000e-05:  59%|█████▉    | 448/761 [03:25<02:17,  2.28it/s][A
epoch 2 iter 448: train loss 0.53465. lr 6.000000e-05:  59%|█████▉    | 449/761 [03:25<02:24,  2.16it/s][A
epoch 2 iter 449: train loss

epoch 2 iter 482: train loss 0.52108. lr 6.000000e-05:  63%|██████▎   | 482/761 [03:40<01:55,  2.42it/s][A
epoch 2 iter 482: train loss 0.52108. lr 6.000000e-05:  63%|██████▎   | 483/761 [03:40<01:55,  2.41it/s][A
epoch 2 iter 483: train loss 0.52830. lr 6.000000e-05:  63%|██████▎   | 483/761 [03:40<01:55,  2.41it/s][A
epoch 2 iter 483: train loss 0.52830. lr 6.000000e-05:  64%|██████▎   | 484/761 [03:40<01:55,  2.41it/s][A
epoch 2 iter 484: train loss 0.51800. lr 6.000000e-05:  64%|██████▎   | 484/761 [03:41<01:55,  2.41it/s][A
epoch 2 iter 484: train loss 0.51800. lr 6.000000e-05:  64%|██████▎   | 485/761 [03:41<01:55,  2.40it/s][A
epoch 2 iter 485: train loss 0.52702. lr 6.000000e-05:  64%|██████▎   | 485/761 [03:41<01:55,  2.40it/s][A
epoch 2 iter 485: train loss 0.52702. lr 6.000000e-05:  64%|██████▍   | 486/761 [03:41<01:54,  2.39it/s][A
epoch 2 iter 486: train loss 0.52328. lr 6.000000e-05:  64%|██████▍   | 486/761 [03:42<01:54,  2.39it/s][A
epoch 2 iter 486: train loss

epoch 2 iter 519: train loss 0.50542. lr 6.000000e-05:  68%|██████▊   | 520/761 [03:57<02:23,  1.68it/s][A
epoch 2 iter 520: train loss 0.50464. lr 6.000000e-05:  68%|██████▊   | 520/761 [03:58<02:23,  1.68it/s][A
epoch 2 iter 520: train loss 0.50464. lr 6.000000e-05:  68%|██████▊   | 521/761 [03:58<02:31,  1.58it/s][A
epoch 2 iter 521: train loss 0.51181. lr 6.000000e-05:  68%|██████▊   | 521/761 [03:59<02:31,  1.58it/s][A
epoch 2 iter 521: train loss 0.51181. lr 6.000000e-05:  69%|██████▊   | 522/761 [03:59<02:29,  1.60it/s][A
epoch 2 iter 522: train loss 0.50372. lr 6.000000e-05:  69%|██████▊   | 522/761 [03:59<02:29,  1.60it/s][A
epoch 2 iter 522: train loss 0.50372. lr 6.000000e-05:  69%|██████▊   | 523/761 [03:59<02:24,  1.65it/s][A
epoch 2 iter 523: train loss 0.50989. lr 6.000000e-05:  69%|██████▊   | 523/761 [04:00<02:24,  1.65it/s][A
epoch 2 iter 523: train loss 0.50989. lr 6.000000e-05:  69%|██████▉   | 524/761 [04:00<02:20,  1.69it/s][A
epoch 2 iter 524: train loss

epoch 2 iter 557: train loss 0.49124. lr 6.000000e-05:  73%|███████▎  | 557/761 [04:15<01:48,  1.88it/s][A
epoch 2 iter 557: train loss 0.49124. lr 6.000000e-05:  73%|███████▎  | 558/761 [04:15<01:46,  1.91it/s][A
epoch 2 iter 558: train loss 0.49599. lr 6.000000e-05:  73%|███████▎  | 558/761 [04:16<01:46,  1.91it/s][A
epoch 2 iter 558: train loss 0.49599. lr 6.000000e-05:  73%|███████▎  | 559/761 [04:16<01:43,  1.95it/s][A
epoch 2 iter 559: train loss 0.49438. lr 6.000000e-05:  73%|███████▎  | 559/761 [04:16<01:43,  1.95it/s][A
epoch 2 iter 559: train loss 0.49438. lr 6.000000e-05:  74%|███████▎  | 560/761 [04:16<01:39,  2.02it/s][A
epoch 2 iter 560: train loss 0.49229. lr 6.000000e-05:  74%|███████▎  | 560/761 [04:17<01:39,  2.02it/s][A
epoch 2 iter 560: train loss 0.49229. lr 6.000000e-05:  74%|███████▎  | 561/761 [04:17<01:36,  2.06it/s][A
epoch 2 iter 561: train loss 0.49112. lr 6.000000e-05:  74%|███████▎  | 561/761 [04:17<01:36,  2.06it/s][A
epoch 2 iter 561: train loss

epoch 2 iter 594: train loss 0.47962. lr 6.000000e-05:  78%|███████▊  | 595/761 [04:31<01:09,  2.40it/s][A
epoch 2 iter 595: train loss 0.47735. lr 6.000000e-05:  78%|███████▊  | 595/761 [04:31<01:09,  2.40it/s][A
epoch 2 iter 595: train loss 0.47735. lr 6.000000e-05:  78%|███████▊  | 596/761 [04:31<01:08,  2.42it/s][A
epoch 2 iter 596: train loss 0.48081. lr 6.000000e-05:  78%|███████▊  | 596/761 [04:32<01:08,  2.42it/s][A
epoch 2 iter 596: train loss 0.48081. lr 6.000000e-05:  78%|███████▊  | 597/761 [04:32<01:07,  2.44it/s][A
epoch 2 iter 597: train loss 0.48483. lr 6.000000e-05:  78%|███████▊  | 597/761 [04:32<01:07,  2.44it/s][A
epoch 2 iter 597: train loss 0.48483. lr 6.000000e-05:  79%|███████▊  | 598/761 [04:32<01:06,  2.44it/s][A
epoch 2 iter 598: train loss 0.48316. lr 6.000000e-05:  79%|███████▊  | 598/761 [04:32<01:06,  2.44it/s][A
epoch 2 iter 598: train loss 0.48316. lr 6.000000e-05:  79%|███████▊  | 599/761 [04:32<01:06,  2.44it/s][A
epoch 2 iter 599: train loss

epoch 2 iter 632: train loss 0.46545. lr 6.000000e-05:  83%|████████▎ | 632/761 [04:46<00:52,  2.44it/s][A
epoch 2 iter 632: train loss 0.46545. lr 6.000000e-05:  83%|████████▎ | 633/761 [04:46<00:51,  2.47it/s][A
epoch 2 iter 633: train loss 0.46746. lr 6.000000e-05:  83%|████████▎ | 633/761 [04:47<00:51,  2.47it/s][A
epoch 2 iter 633: train loss 0.46746. lr 6.000000e-05:  83%|████████▎ | 634/761 [04:47<00:49,  2.56it/s][A
epoch 2 iter 634: train loss 0.47144. lr 6.000000e-05:  83%|████████▎ | 634/761 [04:47<00:49,  2.56it/s][A
epoch 2 iter 634: train loss 0.47144. lr 6.000000e-05:  83%|████████▎ | 635/761 [04:47<00:48,  2.59it/s][A
epoch 2 iter 635: train loss 0.45842. lr 6.000000e-05:  83%|████████▎ | 635/761 [04:48<00:48,  2.59it/s][A
epoch 2 iter 635: train loss 0.45842. lr 6.000000e-05:  84%|████████▎ | 636/761 [04:48<00:48,  2.59it/s][A
epoch 2 iter 636: train loss 0.47239. lr 6.000000e-05:  84%|████████▎ | 636/761 [04:48<00:48,  2.59it/s][A
epoch 2 iter 636: train loss

epoch 2 iter 669: train loss 0.45772. lr 6.000000e-05:  88%|████████▊ | 670/761 [05:04<00:37,  2.40it/s][A
epoch 2 iter 670: train loss 0.45450. lr 6.000000e-05:  88%|████████▊ | 670/761 [05:05<00:37,  2.40it/s][A
epoch 2 iter 670: train loss 0.45450. lr 6.000000e-05:  88%|████████▊ | 671/761 [05:05<00:37,  2.38it/s][A
epoch 2 iter 671: train loss 0.45067. lr 6.000000e-05:  88%|████████▊ | 671/761 [05:05<00:37,  2.38it/s][A
epoch 2 iter 671: train loss 0.45067. lr 6.000000e-05:  88%|████████▊ | 672/761 [05:05<00:38,  2.33it/s][A
epoch 2 iter 672: train loss 0.45656. lr 6.000000e-05:  88%|████████▊ | 672/761 [05:06<00:38,  2.33it/s][A
epoch 2 iter 672: train loss 0.45656. lr 6.000000e-05:  88%|████████▊ | 673/761 [05:06<00:38,  2.30it/s][A
epoch 2 iter 673: train loss 0.45030. lr 6.000000e-05:  88%|████████▊ | 673/761 [05:06<00:38,  2.30it/s][A
epoch 2 iter 673: train loss 0.45030. lr 6.000000e-05:  89%|████████▊ | 674/761 [05:06<00:38,  2.28it/s][A
epoch 2 iter 674: train loss

epoch 2 iter 707: train loss 0.43879. lr 6.000000e-05:  93%|█████████▎| 707/761 [05:21<00:22,  2.35it/s][A
epoch 2 iter 707: train loss 0.43879. lr 6.000000e-05:  93%|█████████▎| 708/761 [05:21<00:22,  2.38it/s][A
epoch 2 iter 708: train loss 0.43932. lr 6.000000e-05:  93%|█████████▎| 708/761 [05:22<00:22,  2.38it/s][A
epoch 2 iter 708: train loss 0.43932. lr 6.000000e-05:  93%|█████████▎| 709/761 [05:22<00:27,  1.88it/s][A
epoch 2 iter 709: train loss 0.44472. lr 6.000000e-05:  93%|█████████▎| 709/761 [05:23<00:27,  1.88it/s][A
epoch 2 iter 709: train loss 0.44472. lr 6.000000e-05:  93%|█████████▎| 710/761 [05:23<00:29,  1.72it/s][A
epoch 2 iter 710: train loss 0.44181. lr 6.000000e-05:  93%|█████████▎| 710/761 [05:23<00:29,  1.72it/s][A
epoch 2 iter 710: train loss 0.44181. lr 6.000000e-05:  93%|█████████▎| 711/761 [05:23<00:29,  1.71it/s][A
epoch 2 iter 711: train loss 0.44158. lr 6.000000e-05:  93%|█████████▎| 711/761 [05:24<00:29,  1.71it/s][A
epoch 2 iter 711: train loss

epoch 2 iter 744: train loss 0.43377. lr 6.000000e-05:  98%|█████████▊| 745/761 [05:39<00:06,  2.54it/s][A
epoch 2 iter 745: train loss 0.43314. lr 6.000000e-05:  98%|█████████▊| 745/761 [05:39<00:06,  2.54it/s][A
epoch 2 iter 745: train loss 0.43314. lr 6.000000e-05:  98%|█████████▊| 746/761 [05:39<00:05,  2.54it/s][A
epoch 2 iter 746: train loss 0.43135. lr 6.000000e-05:  98%|█████████▊| 746/761 [05:40<00:05,  2.54it/s][A
epoch 2 iter 746: train loss 0.43135. lr 6.000000e-05:  98%|█████████▊| 747/761 [05:40<00:05,  2.51it/s][A
epoch 2 iter 747: train loss 0.42692. lr 6.000000e-05:  98%|█████████▊| 747/761 [05:40<00:05,  2.51it/s][A
epoch 2 iter 747: train loss 0.42692. lr 6.000000e-05:  98%|█████████▊| 748/761 [05:40<00:05,  2.49it/s][A
epoch 2 iter 748: train loss 0.43486. lr 6.000000e-05:  98%|█████████▊| 748/761 [05:40<00:05,  2.49it/s][A
epoch 2 iter 748: train loss 0.43486. lr 6.000000e-05:  98%|█████████▊| 749/761 [05:40<00:04,  2.48it/s][A
epoch 2 iter 749: train loss

data has 1559798 characters, 168 unique.



  0%|          | 0/3047 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.19066. lr 6.000000e-04:   0%|          | 0/3047 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.19066. lr 6.000000e-04:   0%|          | 1/3047 [00:00<21:17,  2.39it/s][A
epoch 1 iter 1: train loss 4.14485. lr 5.999999e-04:   0%|          | 1/3047 [00:00<21:17,  2.39it/s][A
epoch 1 iter 1: train loss 4.14485. lr 5.999999e-04:   0%|          | 2/3047 [00:00<20:24,  2.49it/s][A
epoch 1 iter 2: train loss 3.86063. lr 5.999997e-04:   0%|          | 2/3047 [00:01<20:24,  2.49it/s][A
epoch 1 iter 2: train loss 3.86063. lr 5.999997e-04:   0%|          | 3/3047 [00:01<20:03,  2.53it/s][A
epoch 1 iter 3: train loss 3.54645. lr 5.999994e-04:   0%|          | 3/3047 [00:01<20:03,  2.53it/s][A
epoch 1 iter 3: train loss 3.54645. lr 5.999994e-04:   0%|          | 4/3047 [00:01<20:45,  2.44it/s][A
epoch 1 iter 4: train loss 3.51769. lr 5.999991e-04:   0%|          | 4/3047 [00:02<20:45,  2.44it/s][A
epoch 1 iter 4: tr

epoch 1 iter 38: train loss 2.73612. lr 5.999398e-04:   1%|          | 38/3047 [00:17<20:56,  2.39it/s][A
epoch 1 iter 38: train loss 2.73612. lr 5.999398e-04:   1%|▏         | 39/3047 [00:17<20:38,  2.43it/s][A
epoch 1 iter 39: train loss 2.71499. lr 5.999367e-04:   1%|▏         | 39/3047 [00:17<20:38,  2.43it/s][A
epoch 1 iter 39: train loss 2.71499. lr 5.999367e-04:   1%|▏         | 40/3047 [00:17<20:15,  2.47it/s][A
epoch 1 iter 40: train loss 2.69895. lr 5.999335e-04:   1%|▏         | 40/3047 [00:17<20:15,  2.47it/s][A
epoch 1 iter 40: train loss 2.69895. lr 5.999335e-04:   1%|▏         | 41/3047 [00:17<20:36,  2.43it/s][A
epoch 1 iter 41: train loss 2.69368. lr 5.999302e-04:   1%|▏         | 41/3047 [00:18<20:36,  2.43it/s][A
epoch 1 iter 41: train loss 2.69368. lr 5.999302e-04:   1%|▏         | 42/3047 [00:18<20:47,  2.41it/s][A
epoch 1 iter 42: train loss 2.68915. lr 5.999268e-04:   1%|▏         | 42/3047 [00:18<20:47,  2.41it/s][A
epoch 1 iter 42: train loss 2.68915. 

epoch 1 iter 76: train loss 2.51345. lr 5.997645e-04:   2%|▏         | 76/3047 [00:33<21:21,  2.32it/s][A
epoch 1 iter 76: train loss 2.51345. lr 5.997645e-04:   3%|▎         | 77/3047 [00:33<21:14,  2.33it/s][A
epoch 1 iter 77: train loss 2.51294. lr 5.997583e-04:   3%|▎         | 77/3047 [00:34<21:14,  2.33it/s][A
epoch 1 iter 77: train loss 2.51294. lr 5.997583e-04:   3%|▎         | 78/3047 [00:34<34:38,  1.43it/s][A
epoch 1 iter 78: train loss 2.51258. lr 5.997521e-04:   3%|▎         | 78/3047 [00:35<34:38,  1.43it/s][A
epoch 1 iter 78: train loss 2.51258. lr 5.997521e-04:   3%|▎         | 79/3047 [00:35<36:27,  1.36it/s][A
epoch 1 iter 79: train loss 2.51643. lr 5.997458e-04:   3%|▎         | 79/3047 [00:36<36:27,  1.36it/s][A
epoch 1 iter 79: train loss 2.51643. lr 5.997458e-04:   3%|▎         | 80/3047 [00:36<37:36,  1.31it/s][A
epoch 1 iter 80: train loss 2.50260. lr 5.997393e-04:   3%|▎         | 80/3047 [00:37<37:36,  1.31it/s][A
epoch 1 iter 80: train loss 2.50260. 

epoch 1 iter 114: train loss 2.45947. lr 5.994741e-04:   4%|▎         | 114/3047 [00:51<19:43,  2.48it/s][A
epoch 1 iter 114: train loss 2.45947. lr 5.994741e-04:   4%|▍         | 115/3047 [00:51<20:29,  2.39it/s][A
epoch 1 iter 115: train loss 2.43469. lr 5.994649e-04:   4%|▍         | 115/3047 [00:52<20:29,  2.39it/s][A
epoch 1 iter 115: train loss 2.43469. lr 5.994649e-04:   4%|▍         | 116/3047 [00:52<20:50,  2.34it/s][A
epoch 1 iter 116: train loss 2.43793. lr 5.994556e-04:   4%|▍         | 116/3047 [00:52<20:50,  2.34it/s][A
epoch 1 iter 116: train loss 2.43793. lr 5.994556e-04:   4%|▍         | 117/3047 [00:52<20:17,  2.41it/s][A
epoch 1 iter 117: train loss 2.43028. lr 5.994463e-04:   4%|▍         | 117/3047 [00:52<20:17,  2.41it/s][A
epoch 1 iter 117: train loss 2.43028. lr 5.994463e-04:   4%|▍         | 118/3047 [00:52<19:29,  2.50it/s][A
epoch 1 iter 118: train loss 2.45024. lr 5.994368e-04:   4%|▍         | 118/3047 [00:53<19:29,  2.50it/s][A
epoch 1 iter 118: t

epoch 1 iter 151: train loss 2.32257. lr 5.990808e-04:   5%|▍         | 152/3047 [01:09<20:58,  2.30it/s][A
epoch 1 iter 152: train loss 2.30081. lr 5.990687e-04:   5%|▍         | 152/3047 [01:09<20:58,  2.30it/s][A
epoch 1 iter 152: train loss 2.30081. lr 5.990687e-04:   5%|▌         | 153/3047 [01:09<20:52,  2.31it/s][A
epoch 1 iter 153: train loss 2.33300. lr 5.990565e-04:   5%|▌         | 153/3047 [01:10<20:52,  2.31it/s][A
epoch 1 iter 153: train loss 2.33300. lr 5.990565e-04:   5%|▌         | 154/3047 [01:10<20:52,  2.31it/s][A
epoch 1 iter 154: train loss 2.32234. lr 5.990442e-04:   5%|▌         | 154/3047 [01:10<20:52,  2.31it/s][A
epoch 1 iter 154: train loss 2.32234. lr 5.990442e-04:   5%|▌         | 155/3047 [01:10<20:55,  2.30it/s][A
epoch 1 iter 155: train loss 2.30626. lr 5.990318e-04:   5%|▌         | 155/3047 [01:11<20:55,  2.30it/s][A
epoch 1 iter 155: train loss 2.30626. lr 5.990318e-04:   5%|▌         | 156/3047 [01:11<20:55,  2.30it/s][A
epoch 1 iter 156: t

epoch 1 iter 189: train loss 2.12330. lr 5.985636e-04:   6%|▌         | 189/3047 [01:25<20:17,  2.35it/s][A
epoch 1 iter 189: train loss 2.12330. lr 5.985636e-04:   6%|▌         | 190/3047 [01:25<19:49,  2.40it/s][A
epoch 1 iter 190: train loss 2.11601. lr 5.985484e-04:   6%|▌         | 190/3047 [01:26<19:49,  2.40it/s][A
epoch 1 iter 190: train loss 2.11601. lr 5.985484e-04:   6%|▋         | 191/3047 [01:26<35:11,  1.35it/s][A
epoch 1 iter 191: train loss 2.11520. lr 5.985332e-04:   6%|▋         | 191/3047 [01:27<35:11,  1.35it/s][A
epoch 1 iter 191: train loss 2.11520. lr 5.985332e-04:   6%|▋         | 192/3047 [01:27<31:58,  1.49it/s][A
epoch 1 iter 192: train loss 2.09318. lr 5.985179e-04:   6%|▋         | 192/3047 [01:27<31:58,  1.49it/s][A
epoch 1 iter 192: train loss 2.09318. lr 5.985179e-04:   6%|▋         | 193/3047 [01:27<29:41,  1.60it/s][A
epoch 1 iter 193: train loss 2.09226. lr 5.985025e-04:   6%|▋         | 193/3047 [01:28<29:41,  1.60it/s][A
epoch 1 iter 193: t

epoch 1 iter 226: train loss 1.89301. lr 5.979498e-04:   7%|▋         | 227/3047 [01:43<22:13,  2.11it/s][A
epoch 1 iter 227: train loss 1.86370. lr 5.979318e-04:   7%|▋         | 227/3047 [01:43<22:13,  2.11it/s][A
epoch 1 iter 227: train loss 1.86370. lr 5.979318e-04:   7%|▋         | 228/3047 [01:43<22:02,  2.13it/s][A
epoch 1 iter 228: train loss 1.86658. lr 5.979136e-04:   7%|▋         | 228/3047 [01:44<22:02,  2.13it/s][A
epoch 1 iter 228: train loss 1.86658. lr 5.979136e-04:   8%|▊         | 229/3047 [01:44<23:02,  2.04it/s][A
epoch 1 iter 229: train loss 1.86692. lr 5.978953e-04:   8%|▊         | 229/3047 [01:44<23:02,  2.04it/s][A
epoch 1 iter 229: train loss 1.86692. lr 5.978953e-04:   8%|▊         | 230/3047 [01:44<22:34,  2.08it/s][A
epoch 1 iter 230: train loss 1.86582. lr 5.978770e-04:   8%|▊         | 230/3047 [01:45<22:34,  2.08it/s][A
epoch 1 iter 230: train loss 1.86582. lr 5.978770e-04:   8%|▊         | 231/3047 [01:45<22:02,  2.13it/s][A
epoch 1 iter 231: t

epoch 1 iter 264: train loss 1.67930. lr 5.972066e-04:   9%|▊         | 264/3047 [01:59<18:46,  2.47it/s][A
epoch 1 iter 264: train loss 1.67930. lr 5.972066e-04:   9%|▊         | 265/3047 [01:59<20:09,  2.30it/s][A
epoch 1 iter 265: train loss 1.66336. lr 5.971855e-04:   9%|▊         | 265/3047 [01:59<20:09,  2.30it/s][A
epoch 1 iter 265: train loss 1.66336. lr 5.971855e-04:   9%|▊         | 266/3047 [01:59<21:26,  2.16it/s][A
epoch 1 iter 266: train loss 1.65199. lr 5.971643e-04:   9%|▊         | 266/3047 [02:00<21:26,  2.16it/s][A
epoch 1 iter 266: train loss 1.65199. lr 5.971643e-04:   9%|▉         | 267/3047 [02:00<22:16,  2.08it/s][A
epoch 1 iter 267: train loss 1.64772. lr 5.971431e-04:   9%|▉         | 267/3047 [02:00<22:16,  2.08it/s][A
epoch 1 iter 267: train loss 1.64772. lr 5.971431e-04:   9%|▉         | 268/3047 [02:00<22:25,  2.07it/s][A
epoch 1 iter 268: train loss 1.63952. lr 5.971217e-04:   9%|▉         | 268/3047 [02:01<22:25,  2.07it/s][A
epoch 1 iter 268: t

epoch 1 iter 301: train loss 1.51987. lr 5.963733e-04:  10%|▉         | 302/3047 [02:15<22:32,  2.03it/s][A
epoch 1 iter 302: train loss 1.51657. lr 5.963492e-04:  10%|▉         | 302/3047 [02:16<22:32,  2.03it/s][A
epoch 1 iter 302: train loss 1.51657. lr 5.963492e-04:  10%|▉         | 303/3047 [02:16<22:25,  2.04it/s][A
epoch 1 iter 303: train loss 1.50061. lr 5.963251e-04:  10%|▉         | 303/3047 [02:16<22:25,  2.04it/s][A
epoch 1 iter 303: train loss 1.50061. lr 5.963251e-04:  10%|▉         | 304/3047 [02:16<21:08,  2.16it/s][A
epoch 1 iter 304: train loss 1.49171. lr 5.963010e-04:  10%|▉         | 304/3047 [02:16<21:08,  2.16it/s][A
epoch 1 iter 304: train loss 1.49171. lr 5.963010e-04:  10%|█         | 305/3047 [02:16<20:12,  2.26it/s][A
epoch 1 iter 305: train loss 1.49191. lr 5.962767e-04:  10%|█         | 305/3047 [02:17<20:12,  2.26it/s][A
epoch 1 iter 305: train loss 1.49191. lr 5.962767e-04:  10%|█         | 306/3047 [02:17<20:07,  2.27it/s][A
epoch 1 iter 306: t

epoch 1 iter 339: train loss 1.35968. lr 5.954051e-04:  11%|█         | 339/3047 [02:34<19:30,  2.31it/s][A
epoch 1 iter 339: train loss 1.35968. lr 5.954051e-04:  11%|█         | 340/3047 [02:34<19:23,  2.33it/s][A
epoch 1 iter 340: train loss 1.39959. lr 5.953781e-04:  11%|█         | 340/3047 [02:35<19:23,  2.33it/s][A
epoch 1 iter 340: train loss 1.39959. lr 5.953781e-04:  11%|█         | 341/3047 [02:35<19:14,  2.34it/s][A
epoch 1 iter 341: train loss 1.34277. lr 5.953510e-04:  11%|█         | 341/3047 [02:35<19:14,  2.34it/s][A
epoch 1 iter 341: train loss 1.34277. lr 5.953510e-04:  11%|█         | 342/3047 [02:35<19:15,  2.34it/s][A
epoch 1 iter 342: train loss 1.36164. lr 5.953238e-04:  11%|█         | 342/3047 [02:36<19:15,  2.34it/s][A
epoch 1 iter 342: train loss 1.36164. lr 5.953238e-04:  11%|█▏        | 343/3047 [02:36<19:11,  2.35it/s][A
epoch 1 iter 343: train loss 1.32322. lr 5.952966e-04:  11%|█▏        | 343/3047 [02:36<19:11,  2.35it/s][A
epoch 1 iter 343: t

epoch 1 iter 376: train loss 1.24859. lr 5.943534e-04:  12%|█▏        | 377/3047 [02:51<20:27,  2.18it/s][A
epoch 1 iter 377: train loss 1.24489. lr 5.943235e-04:  12%|█▏        | 377/3047 [02:52<20:27,  2.18it/s][A
epoch 1 iter 377: train loss 1.24489. lr 5.943235e-04:  12%|█▏        | 378/3047 [02:52<20:08,  2.21it/s][A
epoch 1 iter 378: train loss 1.25282. lr 5.942935e-04:  12%|█▏        | 378/3047 [02:52<20:08,  2.21it/s][A
epoch 1 iter 378: train loss 1.25282. lr 5.942935e-04:  12%|█▏        | 379/3047 [02:52<19:54,  2.23it/s][A
epoch 1 iter 379: train loss 1.24354. lr 5.942635e-04:  12%|█▏        | 379/3047 [02:52<19:54,  2.23it/s][A
epoch 1 iter 379: train loss 1.24354. lr 5.942635e-04:  12%|█▏        | 380/3047 [02:52<19:23,  2.29it/s][A
epoch 1 iter 380: train loss 1.23902. lr 5.942333e-04:  12%|█▏        | 380/3047 [02:53<19:23,  2.29it/s][A
epoch 1 iter 380: train loss 1.23902. lr 5.942333e-04:  13%|█▎        | 381/3047 [02:53<18:25,  2.41it/s][A
epoch 1 iter 381: t

epoch 1 iter 414: train loss 1.19430. lr 5.931618e-04:  14%|█▎        | 414/3047 [03:07<18:51,  2.33it/s][A
epoch 1 iter 414: train loss 1.19430. lr 5.931618e-04:  14%|█▎        | 415/3047 [03:07<18:50,  2.33it/s][A
epoch 1 iter 415: train loss 1.19361. lr 5.931289e-04:  14%|█▎        | 415/3047 [03:07<18:50,  2.33it/s][A
epoch 1 iter 415: train loss 1.19361. lr 5.931289e-04:  14%|█▎        | 416/3047 [03:07<18:44,  2.34it/s][A
epoch 1 iter 416: train loss 1.15408. lr 5.930960e-04:  14%|█▎        | 416/3047 [03:08<18:44,  2.34it/s][A
epoch 1 iter 416: train loss 1.15408. lr 5.930960e-04:  14%|█▎        | 417/3047 [03:08<18:40,  2.35it/s][A
epoch 1 iter 417: train loss 1.16126. lr 5.930630e-04:  14%|█▎        | 417/3047 [03:08<18:40,  2.35it/s][A
epoch 1 iter 417: train loss 1.16126. lr 5.930630e-04:  14%|█▎        | 418/3047 [03:08<18:39,  2.35it/s][A
epoch 1 iter 418: train loss 1.18416. lr 5.930298e-04:  14%|█▎        | 418/3047 [03:09<18:39,  2.35it/s][A
epoch 1 iter 418: t

epoch 1 iter 451: train loss 1.12311. lr 5.918934e-04:  15%|█▍        | 452/3047 [03:23<19:52,  2.18it/s][A
epoch 1 iter 452: train loss 1.11244. lr 5.918576e-04:  15%|█▍        | 452/3047 [03:24<19:52,  2.18it/s][A
epoch 1 iter 452: train loss 1.11244. lr 5.918576e-04:  15%|█▍        | 453/3047 [03:24<19:42,  2.19it/s][A
epoch 1 iter 453: train loss 1.12488. lr 5.918218e-04:  15%|█▍        | 453/3047 [03:25<19:42,  2.19it/s][A
epoch 1 iter 453: train loss 1.12488. lr 5.918218e-04:  15%|█▍        | 454/3047 [03:25<25:05,  1.72it/s][A
epoch 1 iter 454: train loss 1.07389. lr 5.917859e-04:  15%|█▍        | 454/3047 [03:25<25:05,  1.72it/s][A
epoch 1 iter 454: train loss 1.07389. lr 5.917859e-04:  15%|█▍        | 455/3047 [03:25<24:28,  1.77it/s][A
epoch 1 iter 455: train loss 1.09300. lr 5.917499e-04:  15%|█▍        | 455/3047 [03:26<24:28,  1.77it/s][A
epoch 1 iter 455: train loss 1.09300. lr 5.917499e-04:  15%|█▍        | 456/3047 [03:26<23:20,  1.85it/s][A
epoch 1 iter 456: t

epoch 1 iter 489: train loss 1.07270. lr 5.904801e-04:  16%|█▌        | 489/3047 [03:41<17:53,  2.38it/s][A
epoch 1 iter 489: train loss 1.07270. lr 5.904801e-04:  16%|█▌        | 490/3047 [03:41<17:51,  2.39it/s][A
epoch 1 iter 490: train loss 1.07868. lr 5.904414e-04:  16%|█▌        | 490/3047 [03:41<17:51,  2.39it/s][A
epoch 1 iter 490: train loss 1.07868. lr 5.904414e-04:  16%|█▌        | 491/3047 [03:41<17:47,  2.39it/s][A
epoch 1 iter 491: train loss 1.05622. lr 5.904026e-04:  16%|█▌        | 491/3047 [03:41<17:47,  2.39it/s][A
epoch 1 iter 491: train loss 1.05622. lr 5.904026e-04:  16%|█▌        | 492/3047 [03:41<17:45,  2.40it/s][A
epoch 1 iter 492: train loss 1.04559. lr 5.903638e-04:  16%|█▌        | 492/3047 [03:42<17:45,  2.40it/s][A
epoch 1 iter 492: train loss 1.04559. lr 5.903638e-04:  16%|█▌        | 493/3047 [03:42<17:42,  2.40it/s][A
epoch 1 iter 493: train loss 1.03402. lr 5.903248e-04:  16%|█▌        | 493/3047 [03:42<17:42,  2.40it/s][A
epoch 1 iter 493: t

epoch 1 iter 526: train loss 1.00172. lr 5.889968e-04:  17%|█▋        | 527/3047 [03:56<17:42,  2.37it/s][A
epoch 1 iter 527: train loss 1.00691. lr 5.889553e-04:  17%|█▋        | 527/3047 [03:57<17:42,  2.37it/s][A
epoch 1 iter 527: train loss 1.00691. lr 5.889553e-04:  17%|█▋        | 528/3047 [03:57<18:04,  2.32it/s][A
epoch 1 iter 528: train loss 1.01591. lr 5.889136e-04:  17%|█▋        | 528/3047 [03:57<18:04,  2.32it/s][A
epoch 1 iter 528: train loss 1.01591. lr 5.889136e-04:  17%|█▋        | 529/3047 [03:57<17:44,  2.37it/s][A
epoch 1 iter 529: train loss 1.01458. lr 5.888719e-04:  17%|█▋        | 529/3047 [03:58<17:44,  2.37it/s][A
epoch 1 iter 529: train loss 1.01458. lr 5.888719e-04:  17%|█▋        | 530/3047 [03:58<17:29,  2.40it/s][A
epoch 1 iter 530: train loss 1.00290. lr 5.888301e-04:  17%|█▋        | 530/3047 [03:58<17:29,  2.40it/s][A
epoch 1 iter 530: train loss 1.00290. lr 5.888301e-04:  17%|█▋        | 531/3047 [03:58<17:17,  2.43it/s][A
epoch 1 iter 531: t

epoch 1 iter 564: train loss 0.95363. lr 5.873639e-04:  19%|█▊        | 564/3047 [04:15<25:49,  1.60it/s][A
epoch 1 iter 564: train loss 0.95363. lr 5.873639e-04:  19%|█▊        | 565/3047 [04:15<25:15,  1.64it/s][A
epoch 1 iter 565: train loss 0.98001. lr 5.873195e-04:  19%|█▊        | 565/3047 [04:16<25:15,  1.64it/s][A
epoch 1 iter 565: train loss 0.98001. lr 5.873195e-04:  19%|█▊        | 566/3047 [04:16<24:24,  1.69it/s][A
epoch 1 iter 566: train loss 0.97960. lr 5.872749e-04:  19%|█▊        | 566/3047 [04:16<24:24,  1.69it/s][A
epoch 1 iter 566: train loss 0.97960. lr 5.872749e-04:  19%|█▊        | 567/3047 [04:16<23:37,  1.75it/s][A
epoch 1 iter 567: train loss 0.96775. lr 5.872303e-04:  19%|█▊        | 567/3047 [04:17<23:37,  1.75it/s][A
epoch 1 iter 567: train loss 0.96775. lr 5.872303e-04:  19%|█▊        | 568/3047 [04:17<22:42,  1.82it/s][A
epoch 1 iter 568: train loss 0.98101. lr 5.871856e-04:  19%|█▊        | 568/3047 [04:17<22:42,  1.82it/s][A
epoch 1 iter 568: t

epoch 1 iter 601: train loss 0.91974. lr 5.856680e-04:  20%|█▉        | 602/3047 [04:32<17:03,  2.39it/s][A
epoch 1 iter 602: train loss 0.93914. lr 5.856207e-04:  20%|█▉        | 602/3047 [04:32<17:03,  2.39it/s][A
epoch 1 iter 602: train loss 0.93914. lr 5.856207e-04:  20%|█▉        | 603/3047 [04:32<16:55,  2.41it/s][A
epoch 1 iter 603: train loss 0.91267. lr 5.855734e-04:  20%|█▉        | 603/3047 [04:32<16:55,  2.41it/s][A
epoch 1 iter 603: train loss 0.91267. lr 5.855734e-04:  20%|█▉        | 604/3047 [04:32<16:50,  2.42it/s][A
epoch 1 iter 604: train loss 0.92151. lr 5.855259e-04:  20%|█▉        | 604/3047 [04:33<16:50,  2.42it/s][A
epoch 1 iter 604: train loss 0.92151. lr 5.855259e-04:  20%|█▉        | 605/3047 [04:33<16:46,  2.43it/s][A
epoch 1 iter 605: train loss 0.93675. lr 5.854784e-04:  20%|█▉        | 605/3047 [04:33<16:46,  2.43it/s][A
epoch 1 iter 605: train loss 0.93675. lr 5.854784e-04:  20%|█▉        | 606/3047 [04:33<16:46,  2.43it/s][A
epoch 1 iter 606: t

epoch 1 iter 639: train loss 0.89951. lr 5.838180e-04:  21%|██        | 639/3047 [04:49<17:00,  2.36it/s][A
epoch 1 iter 639: train loss 0.89951. lr 5.838180e-04:  21%|██        | 640/3047 [04:49<16:48,  2.39it/s][A
epoch 1 iter 640: train loss 0.90470. lr 5.837678e-04:  21%|██        | 640/3047 [04:49<16:48,  2.39it/s][A
epoch 1 iter 640: train loss 0.90470. lr 5.837678e-04:  21%|██        | 641/3047 [04:49<16:40,  2.40it/s][A
epoch 1 iter 641: train loss 0.89560. lr 5.837176e-04:  21%|██        | 641/3047 [04:50<16:40,  2.40it/s][A
epoch 1 iter 641: train loss 0.89560. lr 5.837176e-04:  21%|██        | 642/3047 [04:50<16:37,  2.41it/s][A
epoch 1 iter 642: train loss 0.91841. lr 5.836673e-04:  21%|██        | 642/3047 [04:50<16:37,  2.41it/s][A
epoch 1 iter 642: train loss 0.91841. lr 5.836673e-04:  21%|██        | 643/3047 [04:50<16:43,  2.40it/s][A
epoch 1 iter 643: train loss 0.90532. lr 5.836169e-04:  21%|██        | 643/3047 [04:51<16:43,  2.40it/s][A
epoch 1 iter 643: t

epoch 1 iter 676: train loss 0.85060. lr 5.819120e-04:  22%|██▏       | 677/3047 [05:04<18:22,  2.15it/s][A
epoch 1 iter 677: train loss 0.87610. lr 5.818590e-04:  22%|██▏       | 677/3047 [05:05<18:22,  2.15it/s][A
epoch 1 iter 677: train loss 0.87610. lr 5.818590e-04:  22%|██▏       | 678/3047 [05:05<18:41,  2.11it/s][A
epoch 1 iter 678: train loss 0.85499. lr 5.818060e-04:  22%|██▏       | 678/3047 [05:05<18:41,  2.11it/s][A
epoch 1 iter 678: train loss 0.85499. lr 5.818060e-04:  22%|██▏       | 679/3047 [05:05<18:54,  2.09it/s][A
epoch 1 iter 679: train loss 0.85736. lr 5.817529e-04:  22%|██▏       | 679/3047 [05:06<18:54,  2.09it/s][A
epoch 1 iter 679: train loss 0.85736. lr 5.817529e-04:  22%|██▏       | 680/3047 [05:06<18:45,  2.10it/s][A
epoch 1 iter 680: train loss 0.87447. lr 5.816997e-04:  22%|██▏       | 680/3047 [05:06<18:45,  2.10it/s][A
epoch 1 iter 680: train loss 0.87447. lr 5.816997e-04:  22%|██▏       | 681/3047 [05:06<18:38,  2.12it/s][A
epoch 1 iter 681: t

epoch 1 iter 714: train loss 0.85557. lr 5.798476e-04:  23%|██▎       | 714/3047 [05:22<18:50,  2.06it/s][A
epoch 1 iter 714: train loss 0.85557. lr 5.798476e-04:  23%|██▎       | 715/3047 [05:22<18:37,  2.09it/s][A
epoch 1 iter 715: train loss 0.86125. lr 5.797918e-04:  23%|██▎       | 715/3047 [05:23<18:37,  2.09it/s][A
epoch 1 iter 715: train loss 0.86125. lr 5.797918e-04:  23%|██▎       | 716/3047 [05:23<18:16,  2.13it/s][A
epoch 1 iter 716: train loss 0.85448. lr 5.797360e-04:  23%|██▎       | 716/3047 [05:23<18:16,  2.13it/s][A
epoch 1 iter 716: train loss 0.85448. lr 5.797360e-04:  24%|██▎       | 717/3047 [05:23<17:56,  2.16it/s][A
epoch 1 iter 717: train loss 0.85115. lr 5.796800e-04:  24%|██▎       | 717/3047 [05:24<17:56,  2.16it/s][A
epoch 1 iter 717: train loss 0.85115. lr 5.796800e-04:  24%|██▎       | 718/3047 [05:24<17:39,  2.20it/s][A
epoch 1 iter 718: train loss 0.83669. lr 5.796240e-04:  24%|██▎       | 718/3047 [05:24<17:39,  2.20it/s][A
epoch 1 iter 718: t

epoch 1 iter 751: train loss 0.79855. lr 5.777343e-04:  25%|██▍       | 752/3047 [05:38<19:35,  1.95it/s][A
epoch 1 iter 752: train loss 0.82327. lr 5.776758e-04:  25%|██▍       | 752/3047 [05:39<19:35,  1.95it/s][A
epoch 1 iter 752: train loss 0.82327. lr 5.776758e-04:  25%|██▍       | 753/3047 [05:39<20:47,  1.84it/s][A
epoch 1 iter 753: train loss 0.82750. lr 5.776172e-04:  25%|██▍       | 753/3047 [05:39<20:47,  1.84it/s][A
epoch 1 iter 753: train loss 0.82750. lr 5.776172e-04:  25%|██▍       | 754/3047 [05:39<21:38,  1.77it/s][A
epoch 1 iter 754: train loss 0.82771. lr 5.775585e-04:  25%|██▍       | 754/3047 [05:40<21:38,  1.77it/s][A
epoch 1 iter 754: train loss 0.82771. lr 5.775585e-04:  25%|██▍       | 755/3047 [05:40<22:12,  1.72it/s][A
epoch 1 iter 755: train loss 0.83230. lr 5.774998e-04:  25%|██▍       | 755/3047 [05:40<22:12,  1.72it/s][A
epoch 1 iter 755: train loss 0.83230. lr 5.774998e-04:  25%|██▍       | 756/3047 [05:40<22:33,  1.69it/s][A
epoch 1 iter 756: t

epoch 1 iter 789: train loss 0.80490. lr 5.754587e-04:  26%|██▌       | 789/3047 [05:56<15:42,  2.40it/s][A
epoch 1 iter 789: train loss 0.80490. lr 5.754587e-04:  26%|██▌       | 790/3047 [05:56<15:32,  2.42it/s][A
epoch 1 iter 790: train loss 0.78419. lr 5.753974e-04:  26%|██▌       | 790/3047 [05:57<15:32,  2.42it/s][A
epoch 1 iter 790: train loss 0.78419. lr 5.753974e-04:  26%|██▌       | 791/3047 [05:57<15:24,  2.44it/s][A
epoch 1 iter 791: train loss 0.78598. lr 5.753360e-04:  26%|██▌       | 791/3047 [05:57<15:24,  2.44it/s][A
epoch 1 iter 791: train loss 0.78598. lr 5.753360e-04:  26%|██▌       | 792/3047 [05:57<15:01,  2.50it/s][A
epoch 1 iter 792: train loss 0.77187. lr 5.752745e-04:  26%|██▌       | 792/3047 [05:58<15:01,  2.50it/s][A
epoch 1 iter 792: train loss 0.77187. lr 5.752745e-04:  26%|██▌       | 793/3047 [05:58<15:25,  2.43it/s][A
epoch 1 iter 793: train loss 0.79267. lr 5.752130e-04:  26%|██▌       | 793/3047 [05:58<15:25,  2.43it/s][A
epoch 1 iter 793: t

epoch 1 iter 826: train loss 0.75730. lr 5.731413e-04:  27%|██▋       | 827/3047 [06:12<17:34,  2.11it/s][A
epoch 1 iter 827: train loss 0.75307. lr 5.730773e-04:  27%|██▋       | 827/3047 [06:13<17:34,  2.11it/s][A
epoch 1 iter 827: train loss 0.75307. lr 5.730773e-04:  27%|██▋       | 828/3047 [06:13<17:52,  2.07it/s][A
epoch 1 iter 828: train loss 0.76273. lr 5.730132e-04:  27%|██▋       | 828/3047 [06:13<17:52,  2.07it/s][A
epoch 1 iter 828: train loss 0.76273. lr 5.730132e-04:  27%|██▋       | 829/3047 [06:13<18:03,  2.05it/s][A
epoch 1 iter 829: train loss 0.75932. lr 5.729490e-04:  27%|██▋       | 829/3047 [06:14<18:03,  2.05it/s][A
epoch 1 iter 829: train loss 0.75932. lr 5.729490e-04:  27%|██▋       | 830/3047 [06:14<18:09,  2.03it/s][A
epoch 1 iter 830: train loss 0.75604. lr 5.728848e-04:  27%|██▋       | 830/3047 [06:14<18:09,  2.03it/s][A
epoch 1 iter 830: train loss 0.75604. lr 5.728848e-04:  27%|██▋       | 831/3047 [06:14<18:19,  2.02it/s][A
epoch 1 iter 831: t

epoch 1 iter 864: train loss 0.73329. lr 5.706578e-04:  28%|██▊       | 864/3047 [06:29<15:04,  2.41it/s][A
epoch 1 iter 864: train loss 0.73329. lr 5.706578e-04:  28%|██▊       | 865/3047 [06:29<15:22,  2.37it/s][A
epoch 1 iter 865: train loss 0.73814. lr 5.705910e-04:  28%|██▊       | 865/3047 [06:30<15:22,  2.37it/s][A
epoch 1 iter 865: train loss 0.73814. lr 5.705910e-04:  28%|██▊       | 866/3047 [06:30<14:44,  2.47it/s][A
epoch 1 iter 866: train loss 0.73007. lr 5.705242e-04:  28%|██▊       | 866/3047 [06:30<14:44,  2.47it/s][A
epoch 1 iter 866: train loss 0.73007. lr 5.705242e-04:  28%|██▊       | 867/3047 [06:30<14:38,  2.48it/s][A
epoch 1 iter 867: train loss 0.72936. lr 5.704573e-04:  28%|██▊       | 867/3047 [06:31<14:38,  2.48it/s][A
epoch 1 iter 867: train loss 0.72936. lr 5.704573e-04:  28%|██▊       | 868/3047 [06:31<14:43,  2.47it/s][A
epoch 1 iter 868: train loss 0.74427. lr 5.703903e-04:  28%|██▊       | 868/3047 [06:31<14:43,  2.47it/s][A
epoch 1 iter 868: t

epoch 1 iter 901: train loss 0.70993. lr 5.681398e-04:  30%|██▉       | 902/3047 [06:48<17:51,  2.00it/s][A
epoch 1 iter 902: train loss 0.71771. lr 5.680703e-04:  30%|██▉       | 902/3047 [06:48<17:51,  2.00it/s][A
epoch 1 iter 902: train loss 0.71771. lr 5.680703e-04:  30%|██▉       | 903/3047 [06:48<17:43,  2.02it/s][A
epoch 1 iter 903: train loss 0.70841. lr 5.680009e-04:  30%|██▉       | 903/3047 [06:49<17:43,  2.02it/s][A
epoch 1 iter 903: train loss 0.70841. lr 5.680009e-04:  30%|██▉       | 904/3047 [06:49<17:09,  2.08it/s][A
epoch 1 iter 904: train loss 0.70423. lr 5.679313e-04:  30%|██▉       | 904/3047 [06:49<17:09,  2.08it/s][A
epoch 1 iter 904: train loss 0.70423. lr 5.679313e-04:  30%|██▉       | 905/3047 [06:49<16:37,  2.15it/s][A
epoch 1 iter 905: train loss 0.71976. lr 5.678617e-04:  30%|██▉       | 905/3047 [06:49<16:37,  2.15it/s][A
epoch 1 iter 905: train loss 0.71976. lr 5.678617e-04:  30%|██▉       | 906/3047 [06:49<16:12,  2.20it/s][A
epoch 1 iter 906: t

epoch 1 iter 939: train loss 0.71117. lr 5.654521e-04:  31%|███       | 939/3047 [07:04<14:38,  2.40it/s][A
epoch 1 iter 939: train loss 0.71117. lr 5.654521e-04:  31%|███       | 940/3047 [07:04<14:26,  2.43it/s][A
epoch 1 iter 940: train loss 0.69711. lr 5.653800e-04:  31%|███       | 940/3047 [07:05<14:26,  2.43it/s][A
epoch 1 iter 940: train loss 0.69711. lr 5.653800e-04:  31%|███       | 941/3047 [07:05<16:15,  2.16it/s][A
epoch 1 iter 941: train loss 0.68439. lr 5.653078e-04:  31%|███       | 941/3047 [07:05<16:15,  2.16it/s][A
epoch 1 iter 941: train loss 0.68439. lr 5.653078e-04:  31%|███       | 942/3047 [07:05<17:04,  2.05it/s][A
epoch 1 iter 942: train loss 0.68556. lr 5.652356e-04:  31%|███       | 942/3047 [07:06<17:04,  2.05it/s][A
epoch 1 iter 942: train loss 0.68556. lr 5.652356e-04:  31%|███       | 943/3047 [07:06<17:35,  1.99it/s][A
epoch 1 iter 943: train loss 0.67668. lr 5.651632e-04:  31%|███       | 943/3047 [07:06<17:35,  1.99it/s][A
epoch 1 iter 943: t

epoch 1 iter 976: train loss 0.66505. lr 5.627372e-04:  32%|███▏      | 977/3047 [07:23<20:29,  1.68it/s][A
epoch 1 iter 977: train loss 0.66322. lr 5.626625e-04:  32%|███▏      | 977/3047 [07:23<20:29,  1.68it/s][A
epoch 1 iter 977: train loss 0.66322. lr 5.626625e-04:  32%|███▏      | 978/3047 [07:23<19:27,  1.77it/s][A
epoch 1 iter 978: train loss 0.67149. lr 5.625877e-04:  32%|███▏      | 978/3047 [07:24<19:27,  1.77it/s][A
epoch 1 iter 978: train loss 0.67149. lr 5.625877e-04:  32%|███▏      | 979/3047 [07:24<18:35,  1.85it/s][A
epoch 1 iter 979: train loss 0.66904. lr 5.625129e-04:  32%|███▏      | 979/3047 [07:24<18:35,  1.85it/s][A
epoch 1 iter 979: train loss 0.66904. lr 5.625129e-04:  32%|███▏      | 980/3047 [07:24<17:51,  1.93it/s][A
epoch 1 iter 980: train loss 0.66560. lr 5.624380e-04:  32%|███▏      | 980/3047 [07:25<17:51,  1.93it/s][A
epoch 1 iter 980: train loss 0.66560. lr 5.624380e-04:  32%|███▏      | 981/3047 [07:25<17:15,  1.99it/s][A
epoch 1 iter 981: t

epoch 1 iter 1013: train loss 0.64732. lr 5.599267e-04:  33%|███▎      | 1014/3047 [07:39<14:43,  2.30it/s][A
epoch 1 iter 1014: train loss 0.65122. lr 5.598494e-04:  33%|███▎      | 1014/3047 [07:39<14:43,  2.30it/s][A
epoch 1 iter 1014: train loss 0.65122. lr 5.598494e-04:  33%|███▎      | 1015/3047 [07:39<16:36,  2.04it/s][A
epoch 1 iter 1015: train loss 0.64160. lr 5.597721e-04:  33%|███▎      | 1015/3047 [07:40<16:36,  2.04it/s][A
epoch 1 iter 1015: train loss 0.64160. lr 5.597721e-04:  33%|███▎      | 1016/3047 [07:40<16:39,  2.03it/s][A
epoch 1 iter 1016: train loss 0.65659. lr 5.596946e-04:  33%|███▎      | 1016/3047 [07:40<16:39,  2.03it/s][A
epoch 1 iter 1016: train loss 0.65659. lr 5.596946e-04:  33%|███▎      | 1017/3047 [07:40<16:14,  2.08it/s][A
epoch 1 iter 1017: train loss 0.65672. lr 5.596172e-04:  33%|███▎      | 1017/3047 [07:41<16:14,  2.08it/s][A
epoch 1 iter 1017: train loss 0.65672. lr 5.596172e-04:  33%|███▎      | 1018/3047 [07:41<15:59,  2.12it/s][A
e

epoch 1 iter 1050: train loss 0.62156. lr 5.570215e-04:  34%|███▍      | 1050/3047 [07:56<16:50,  1.98it/s][A
epoch 1 iter 1050: train loss 0.62156. lr 5.570215e-04:  34%|███▍      | 1051/3047 [07:56<16:28,  2.02it/s][A
epoch 1 iter 1051: train loss 0.63699. lr 5.569417e-04:  34%|███▍      | 1051/3047 [07:56<16:28,  2.02it/s][A
epoch 1 iter 1051: train loss 0.63699. lr 5.569417e-04:  35%|███▍      | 1052/3047 [07:56<16:12,  2.05it/s][A
epoch 1 iter 1052: train loss 0.64387. lr 5.568618e-04:  35%|███▍      | 1052/3047 [07:57<16:12,  2.05it/s][A
epoch 1 iter 1052: train loss 0.64387. lr 5.568618e-04:  35%|███▍      | 1053/3047 [07:57<15:51,  2.10it/s][A
epoch 1 iter 1053: train loss 0.62777. lr 5.567819e-04:  35%|███▍      | 1053/3047 [07:57<15:51,  2.10it/s][A
epoch 1 iter 1053: train loss 0.62777. lr 5.567819e-04:  35%|███▍      | 1054/3047 [07:57<15:38,  2.12it/s][A
epoch 1 iter 1054: train loss 0.62852. lr 5.567019e-04:  35%|███▍      | 1054/3047 [07:57<15:38,  2.12it/s][A
e

epoch 1 iter 1086: train loss 0.62784. lr 5.541051e-04:  36%|███▌      | 1087/3047 [08:11<13:24,  2.44it/s][A
epoch 1 iter 1087: train loss 0.61446. lr 5.540228e-04:  36%|███▌      | 1087/3047 [08:11<13:24,  2.44it/s][A
epoch 1 iter 1087: train loss 0.61446. lr 5.540228e-04:  36%|███▌      | 1088/3047 [08:11<13:44,  2.38it/s][A
epoch 1 iter 1088: train loss 0.60754. lr 5.539405e-04:  36%|███▌      | 1088/3047 [08:12<13:44,  2.38it/s][A
epoch 1 iter 1088: train loss 0.60754. lr 5.539405e-04:  36%|███▌      | 1089/3047 [08:12<13:38,  2.39it/s][A
epoch 1 iter 1089: train loss 0.61728. lr 5.538581e-04:  36%|███▌      | 1089/3047 [08:12<13:38,  2.39it/s][A
epoch 1 iter 1089: train loss 0.61728. lr 5.538581e-04:  36%|███▌      | 1090/3047 [08:12<13:34,  2.40it/s][A
epoch 1 iter 1090: train loss 0.61819. lr 5.537756e-04:  36%|███▌      | 1090/3047 [08:13<13:34,  2.40it/s][A
epoch 1 iter 1090: train loss 0.61819. lr 5.537756e-04:  36%|███▌      | 1091/3047 [08:13<13:28,  2.42it/s][A
e

epoch 1 iter 1123: train loss 0.59201. lr 5.510164e-04:  37%|███▋      | 1123/3047 [08:26<13:14,  2.42it/s][A
epoch 1 iter 1123: train loss 0.59201. lr 5.510164e-04:  37%|███▋      | 1124/3047 [08:26<13:13,  2.42it/s][A
epoch 1 iter 1124: train loss 0.58563. lr 5.509317e-04:  37%|███▋      | 1124/3047 [08:27<13:13,  2.42it/s][A
epoch 1 iter 1124: train loss 0.58563. lr 5.509317e-04:  37%|███▋      | 1125/3047 [08:27<13:14,  2.42it/s][A
epoch 1 iter 1125: train loss 0.59019. lr 5.508468e-04:  37%|███▋      | 1125/3047 [08:27<13:14,  2.42it/s][A
epoch 1 iter 1125: train loss 0.59019. lr 5.508468e-04:  37%|███▋      | 1126/3047 [08:27<13:16,  2.41it/s][A
epoch 1 iter 1126: train loss 0.59107. lr 5.507620e-04:  37%|███▋      | 1126/3047 [08:28<13:16,  2.41it/s][A
epoch 1 iter 1126: train loss 0.59107. lr 5.507620e-04:  37%|███▋      | 1127/3047 [08:28<13:13,  2.42it/s][A
epoch 1 iter 1127: train loss 0.58513. lr 5.506770e-04:  37%|███▋      | 1127/3047 [08:28<13:13,  2.42it/s][A
e

epoch 1 iter 1159: train loss 0.58075. lr 5.479235e-04:  38%|███▊      | 1160/3047 [08:43<15:52,  1.98it/s][A
epoch 1 iter 1160: train loss 0.57433. lr 5.478364e-04:  38%|███▊      | 1160/3047 [08:43<15:52,  1.98it/s][A
epoch 1 iter 1160: train loss 0.57433. lr 5.478364e-04:  38%|███▊      | 1161/3047 [08:43<16:00,  1.96it/s][A
epoch 1 iter 1161: train loss 0.56070. lr 5.477491e-04:  38%|███▊      | 1161/3047 [08:44<16:00,  1.96it/s][A
epoch 1 iter 1161: train loss 0.56070. lr 5.477491e-04:  38%|███▊      | 1162/3047 [08:44<16:13,  1.94it/s][A
epoch 1 iter 1162: train loss 0.57060. lr 5.476619e-04:  38%|███▊      | 1162/3047 [08:44<16:13,  1.94it/s][A
epoch 1 iter 1162: train loss 0.57060. lr 5.476619e-04:  38%|███▊      | 1163/3047 [08:44<15:59,  1.96it/s][A
epoch 1 iter 1163: train loss 0.56915. lr 5.475745e-04:  38%|███▊      | 1163/3047 [08:45<15:59,  1.96it/s][A
epoch 1 iter 1163: train loss 0.56915. lr 5.475745e-04:  38%|███▊      | 1164/3047 [08:45<15:45,  1.99it/s][A
e

epoch 1 iter 1196: train loss 0.53991. lr 5.446556e-04:  39%|███▉      | 1196/3047 [09:00<13:23,  2.30it/s][A
epoch 1 iter 1196: train loss 0.53991. lr 5.446556e-04:  39%|███▉      | 1197/3047 [09:00<13:17,  2.32it/s][A
epoch 1 iter 1197: train loss 0.56743. lr 5.445661e-04:  39%|███▉      | 1197/3047 [09:00<13:17,  2.32it/s][A
epoch 1 iter 1197: train loss 0.56743. lr 5.445661e-04:  39%|███▉      | 1198/3047 [09:00<13:13,  2.33it/s][A
epoch 1 iter 1198: train loss 0.55073. lr 5.444764e-04:  39%|███▉      | 1198/3047 [09:01<13:13,  2.33it/s][A
epoch 1 iter 1198: train loss 0.55073. lr 5.444764e-04:  39%|███▉      | 1199/3047 [09:01<13:11,  2.33it/s][A
epoch 1 iter 1199: train loss 0.56689. lr 5.443868e-04:  39%|███▉      | 1199/3047 [09:01<13:11,  2.33it/s][A
epoch 1 iter 1199: train loss 0.56689. lr 5.443868e-04:  39%|███▉      | 1200/3047 [09:01<13:09,  2.34it/s][A
epoch 1 iter 1200: train loss 0.55785. lr 5.442970e-04:  39%|███▉      | 1200/3047 [09:02<13:09,  2.34it/s][A
e

epoch 1 iter 1232: train loss 0.54095. lr 5.413906e-04:  40%|████      | 1233/3047 [09:15<12:27,  2.43it/s][A
epoch 1 iter 1233: train loss 0.53923. lr 5.412987e-04:  40%|████      | 1233/3047 [09:15<12:27,  2.43it/s][A
epoch 1 iter 1233: train loss 0.53923. lr 5.412987e-04:  40%|████      | 1234/3047 [09:15<12:27,  2.42it/s][A
epoch 1 iter 1234: train loss 0.53621. lr 5.412068e-04:  40%|████      | 1234/3047 [09:15<12:27,  2.42it/s][A
epoch 1 iter 1234: train loss 0.53621. lr 5.412068e-04:  41%|████      | 1235/3047 [09:15<12:30,  2.42it/s][A
epoch 1 iter 1235: train loss 0.52526. lr 5.411147e-04:  41%|████      | 1235/3047 [09:16<12:30,  2.42it/s][A
epoch 1 iter 1235: train loss 0.52526. lr 5.411147e-04:  41%|████      | 1236/3047 [09:16<12:49,  2.35it/s][A
epoch 1 iter 1236: train loss 0.52371. lr 5.410227e-04:  41%|████      | 1236/3047 [09:16<12:49,  2.35it/s][A
epoch 1 iter 1236: train loss 0.52371. lr 5.410227e-04:  41%|████      | 1237/3047 [09:16<12:42,  2.38it/s][A
e

epoch 1 iter 1269: train loss 0.52484. lr 5.379482e-04:  42%|████▏     | 1269/3047 [09:32<13:09,  2.25it/s][A
epoch 1 iter 1269: train loss 0.52484. lr 5.379482e-04:  42%|████▏     | 1270/3047 [09:32<13:05,  2.26it/s][A
epoch 1 iter 1270: train loss 0.52345. lr 5.378540e-04:  42%|████▏     | 1270/3047 [09:32<13:05,  2.26it/s][A
epoch 1 iter 1270: train loss 0.52345. lr 5.378540e-04:  42%|████▏     | 1271/3047 [09:32<12:29,  2.37it/s][A
epoch 1 iter 1271: train loss 0.51868. lr 5.377596e-04:  42%|████▏     | 1271/3047 [09:33<12:29,  2.37it/s][A
epoch 1 iter 1271: train loss 0.51868. lr 5.377596e-04:  42%|████▏     | 1272/3047 [09:33<11:56,  2.48it/s][A
epoch 1 iter 1272: train loss 0.52090. lr 5.376653e-04:  42%|████▏     | 1272/3047 [09:33<11:56,  2.48it/s][A
epoch 1 iter 1272: train loss 0.52090. lr 5.376653e-04:  42%|████▏     | 1273/3047 [09:33<11:44,  2.52it/s][A
epoch 1 iter 1273: train loss 0.52097. lr 5.375708e-04:  42%|████▏     | 1273/3047 [09:33<11:44,  2.52it/s][A
e

epoch 1 iter 1305: train loss 0.50696. lr 5.345157e-04:  43%|████▎     | 1306/3047 [09:47<12:10,  2.38it/s][A
epoch 1 iter 1306: train loss 0.50768. lr 5.344192e-04:  43%|████▎     | 1306/3047 [09:47<12:10,  2.38it/s][A
epoch 1 iter 1306: train loss 0.50768. lr 5.344192e-04:  43%|████▎     | 1307/3047 [09:47<11:47,  2.46it/s][A
epoch 1 iter 1307: train loss 0.51898. lr 5.343226e-04:  43%|████▎     | 1307/3047 [09:47<11:47,  2.46it/s][A
epoch 1 iter 1307: train loss 0.51898. lr 5.343226e-04:  43%|████▎     | 1308/3047 [09:47<11:52,  2.44it/s][A
epoch 1 iter 1308: train loss 0.50601. lr 5.342260e-04:  43%|████▎     | 1308/3047 [09:48<11:52,  2.44it/s][A
epoch 1 iter 1308: train loss 0.50601. lr 5.342260e-04:  43%|████▎     | 1309/3047 [09:48<12:11,  2.37it/s][A
epoch 1 iter 1309: train loss 0.50066. lr 5.341293e-04:  43%|████▎     | 1309/3047 [09:48<12:11,  2.37it/s][A
epoch 1 iter 1309: train loss 0.50066. lr 5.341293e-04:  43%|████▎     | 1310/3047 [09:48<11:57,  2.42it/s][A
e

epoch 1 iter 1342: train loss 0.48590. lr 5.309036e-04:  44%|████▍     | 1342/3047 [10:04<16:24,  1.73it/s][A
epoch 1 iter 1342: train loss 0.48590. lr 5.309036e-04:  44%|████▍     | 1343/3047 [10:04<16:21,  1.74it/s][A
epoch 1 iter 1343: train loss 0.48346. lr 5.308048e-04:  44%|████▍     | 1343/3047 [10:04<16:21,  1.74it/s][A
epoch 1 iter 1343: train loss 0.48346. lr 5.308048e-04:  44%|████▍     | 1344/3047 [10:04<15:56,  1.78it/s][A
epoch 1 iter 1344: train loss 0.48016. lr 5.307060e-04:  44%|████▍     | 1344/3047 [10:05<15:56,  1.78it/s][A
epoch 1 iter 1344: train loss 0.48016. lr 5.307060e-04:  44%|████▍     | 1345/3047 [10:05<15:20,  1.85it/s][A
epoch 1 iter 1345: train loss 0.48609. lr 5.306071e-04:  44%|████▍     | 1345/3047 [10:05<15:20,  1.85it/s][A
epoch 1 iter 1345: train loss 0.48609. lr 5.306071e-04:  44%|████▍     | 1346/3047 [10:05<14:42,  1.93it/s][A
epoch 1 iter 1346: train loss 0.49203. lr 5.305081e-04:  44%|████▍     | 1346/3047 [10:05<14:42,  1.93it/s][A
e

epoch 1 iter 1378: train loss 0.46747. lr 5.273085e-04:  45%|████▌     | 1379/3047 [10:20<12:27,  2.23it/s][A
epoch 1 iter 1379: train loss 0.48255. lr 5.272075e-04:  45%|████▌     | 1379/3047 [10:21<12:27,  2.23it/s][A
epoch 1 iter 1379: train loss 0.48255. lr 5.272075e-04:  45%|████▌     | 1380/3047 [10:21<12:23,  2.24it/s][A
epoch 1 iter 1380: train loss 0.46819. lr 5.271065e-04:  45%|████▌     | 1380/3047 [10:21<12:23,  2.24it/s][A
epoch 1 iter 1380: train loss 0.46819. lr 5.271065e-04:  45%|████▌     | 1381/3047 [10:21<12:21,  2.25it/s][A
epoch 1 iter 1381: train loss 0.47526. lr 5.270054e-04:  45%|████▌     | 1381/3047 [10:22<12:21,  2.25it/s][A
epoch 1 iter 1381: train loss 0.47526. lr 5.270054e-04:  45%|████▌     | 1382/3047 [10:22<12:13,  2.27it/s][A
epoch 1 iter 1382: train loss 0.47975. lr 5.269042e-04:  45%|████▌     | 1382/3047 [10:22<12:13,  2.27it/s][A
epoch 1 iter 1382: train loss 0.47975. lr 5.269042e-04:  45%|████▌     | 1383/3047 [10:22<12:19,  2.25it/s][A
e

epoch 1 iter 1415: train loss 0.45033. lr 5.235319e-04:  46%|████▋     | 1415/3047 [10:35<11:07,  2.44it/s][A
epoch 1 iter 1415: train loss 0.45033. lr 5.235319e-04:  46%|████▋     | 1416/3047 [10:35<11:05,  2.45it/s][A
epoch 1 iter 1416: train loss 0.46558. lr 5.234287e-04:  46%|████▋     | 1416/3047 [10:36<11:05,  2.45it/s][A
epoch 1 iter 1416: train loss 0.46558. lr 5.234287e-04:  47%|████▋     | 1417/3047 [10:36<11:03,  2.46it/s][A
epoch 1 iter 1417: train loss 0.46827. lr 5.233254e-04:  47%|████▋     | 1417/3047 [10:36<11:03,  2.46it/s][A
epoch 1 iter 1417: train loss 0.46827. lr 5.233254e-04:  47%|████▋     | 1418/3047 [10:36<11:01,  2.46it/s][A
epoch 1 iter 1418: train loss 0.46371. lr 5.232221e-04:  47%|████▋     | 1418/3047 [10:37<11:01,  2.46it/s][A
epoch 1 iter 1418: train loss 0.46371. lr 5.232221e-04:  47%|████▋     | 1419/3047 [10:37<11:03,  2.45it/s][A
epoch 1 iter 1419: train loss 0.45217. lr 5.231187e-04:  47%|████▋     | 1419/3047 [10:37<11:03,  2.45it/s][A
e

epoch 1 iter 1451: train loss 0.43826. lr 5.197792e-04:  48%|████▊     | 1452/3047 [10:50<12:23,  2.14it/s][A
epoch 1 iter 1452: train loss 0.44122. lr 5.196739e-04:  48%|████▊     | 1452/3047 [10:51<12:23,  2.14it/s][A
epoch 1 iter 1452: train loss 0.44122. lr 5.196739e-04:  48%|████▊     | 1453/3047 [10:51<12:20,  2.15it/s][A
epoch 1 iter 1453: train loss 0.44340. lr 5.195685e-04:  48%|████▊     | 1453/3047 [10:51<12:20,  2.15it/s][A
epoch 1 iter 1453: train loss 0.44340. lr 5.195685e-04:  48%|████▊     | 1454/3047 [10:51<12:17,  2.16it/s][A
epoch 1 iter 1454: train loss 0.44416. lr 5.194631e-04:  48%|████▊     | 1454/3047 [10:52<12:17,  2.16it/s][A
epoch 1 iter 1454: train loss 0.44416. lr 5.194631e-04:  48%|████▊     | 1455/3047 [10:52<12:09,  2.18it/s][A
epoch 1 iter 1455: train loss 0.43379. lr 5.193576e-04:  48%|████▊     | 1455/3047 [10:52<12:09,  2.18it/s][A
epoch 1 iter 1455: train loss 0.43379. lr 5.193576e-04:  48%|████▊     | 1456/3047 [10:52<12:02,  2.20it/s][A
e

epoch 1 iter 1488: train loss 0.42740. lr 5.158434e-04:  49%|████▉     | 1488/3047 [11:08<12:06,  2.15it/s][A
epoch 1 iter 1488: train loss 0.42740. lr 5.158434e-04:  49%|████▉     | 1489/3047 [11:08<11:54,  2.18it/s][A
epoch 1 iter 1489: train loss 0.42258. lr 5.157360e-04:  49%|████▉     | 1489/3047 [11:08<11:54,  2.18it/s][A
epoch 1 iter 1489: train loss 0.42258. lr 5.157360e-04:  49%|████▉     | 1490/3047 [11:08<11:53,  2.18it/s][A
epoch 1 iter 1490: train loss 0.42468. lr 5.156284e-04:  49%|████▉     | 1490/3047 [11:09<11:53,  2.18it/s][A
epoch 1 iter 1490: train loss 0.42468. lr 5.156284e-04:  49%|████▉     | 1491/3047 [11:09<12:00,  2.16it/s][A
epoch 1 iter 1491: train loss 0.43280. lr 5.155209e-04:  49%|████▉     | 1491/3047 [11:09<12:00,  2.16it/s][A
epoch 1 iter 1491: train loss 0.43280. lr 5.155209e-04:  49%|████▉     | 1492/3047 [11:09<11:54,  2.18it/s][A
epoch 1 iter 1492: train loss 0.43192. lr 5.154132e-04:  49%|████▉     | 1492/3047 [11:09<11:54,  2.18it/s][A
e

epoch 1 iter 1524: train loss 0.43800. lr 5.119386e-04:  50%|█████     | 1525/3047 [11:24<10:39,  2.38it/s][A
epoch 1 iter 1525: train loss 0.41395. lr 5.118291e-04:  50%|█████     | 1525/3047 [11:25<10:39,  2.38it/s][A
epoch 1 iter 1525: train loss 0.41395. lr 5.118291e-04:  50%|█████     | 1526/3047 [11:25<10:38,  2.38it/s][A
epoch 1 iter 1526: train loss 0.41166. lr 5.117195e-04:  50%|█████     | 1526/3047 [11:25<10:38,  2.38it/s][A
epoch 1 iter 1526: train loss 0.41166. lr 5.117195e-04:  50%|█████     | 1527/3047 [11:25<10:38,  2.38it/s][A
epoch 1 iter 1527: train loss 0.42415. lr 5.116099e-04:  50%|█████     | 1527/3047 [11:25<10:38,  2.38it/s][A
epoch 1 iter 1527: train loss 0.42415. lr 5.116099e-04:  50%|█████     | 1528/3047 [11:25<10:21,  2.44it/s][A
epoch 1 iter 1528: train loss 0.42088. lr 5.115002e-04:  50%|█████     | 1528/3047 [11:26<10:21,  2.44it/s][A
epoch 1 iter 1528: train loss 0.42088. lr 5.115002e-04:  50%|█████     | 1529/3047 [11:26<09:59,  2.53it/s][A
e

epoch 1 iter 1561: train loss 0.39660. lr 5.078492e-04:  51%|█████     | 1561/3047 [11:40<13:22,  1.85it/s][A
epoch 1 iter 1561: train loss 0.39660. lr 5.078492e-04:  51%|█████▏    | 1562/3047 [11:40<14:10,  1.75it/s][A
epoch 1 iter 1562: train loss 0.40496. lr 5.077376e-04:  51%|█████▏    | 1562/3047 [11:41<14:10,  1.75it/s][A
epoch 1 iter 1562: train loss 0.40496. lr 5.077376e-04:  51%|█████▏    | 1563/3047 [11:41<14:42,  1.68it/s][A
epoch 1 iter 1563: train loss 0.40609. lr 5.076260e-04:  51%|█████▏    | 1563/3047 [11:41<14:42,  1.68it/s][A
epoch 1 iter 1563: train loss 0.40609. lr 5.076260e-04:  51%|█████▏    | 1564/3047 [11:41<15:06,  1.64it/s][A
epoch 1 iter 1564: train loss 0.41228. lr 5.075143e-04:  51%|█████▏    | 1564/3047 [11:42<15:06,  1.64it/s][A
epoch 1 iter 1564: train loss 0.41228. lr 5.075143e-04:  51%|█████▏    | 1565/3047 [11:42<15:17,  1.62it/s][A
epoch 1 iter 1565: train loss 0.40214. lr 5.074025e-04:  51%|█████▏    | 1565/3047 [11:43<15:17,  1.62it/s][A
e

epoch 1 iter 1597: train loss 0.40327. lr 5.037976e-04:  52%|█████▏    | 1598/3047 [11:57<09:41,  2.49it/s][A
epoch 1 iter 1598: train loss 0.40478. lr 5.036841e-04:  52%|█████▏    | 1598/3047 [11:58<09:41,  2.49it/s][A
epoch 1 iter 1598: train loss 0.40478. lr 5.036841e-04:  52%|█████▏    | 1599/3047 [11:58<09:40,  2.50it/s][A
epoch 1 iter 1599: train loss 0.38323. lr 5.035705e-04:  52%|█████▏    | 1599/3047 [11:58<09:40,  2.50it/s][A
epoch 1 iter 1599: train loss 0.38323. lr 5.035705e-04:  53%|█████▎    | 1600/3047 [11:58<09:40,  2.49it/s][A
epoch 1 iter 1600: train loss 0.39512. lr 5.034568e-04:  53%|█████▎    | 1600/3047 [11:59<09:40,  2.49it/s][A
epoch 1 iter 1600: train loss 0.39512. lr 5.034568e-04:  53%|█████▎    | 1601/3047 [11:59<09:39,  2.49it/s][A
epoch 1 iter 1601: train loss 0.38556. lr 5.033431e-04:  53%|█████▎    | 1601/3047 [11:59<09:39,  2.49it/s][A
epoch 1 iter 1601: train loss 0.38556. lr 5.033431e-04:  53%|█████▎    | 1602/3047 [11:59<09:36,  2.51it/s][A
e

epoch 1 iter 1634: train loss 0.37478. lr 4.995604e-04:  54%|█████▎    | 1634/3047 [12:14<13:59,  1.68it/s][A
epoch 1 iter 1634: train loss 0.37478. lr 4.995604e-04:  54%|█████▎    | 1635/3047 [12:14<13:54,  1.69it/s][A
epoch 1 iter 1635: train loss 0.39311. lr 4.994449e-04:  54%|█████▎    | 1635/3047 [12:14<13:54,  1.69it/s][A
epoch 1 iter 1635: train loss 0.39311. lr 4.994449e-04:  54%|█████▎    | 1636/3047 [12:14<12:30,  1.88it/s][A
epoch 1 iter 1636: train loss 0.38900. lr 4.993293e-04:  54%|█████▎    | 1636/3047 [12:14<12:30,  1.88it/s][A
epoch 1 iter 1636: train loss 0.38900. lr 4.993293e-04:  54%|█████▎    | 1637/3047 [12:14<11:33,  2.03it/s][A
epoch 1 iter 1637: train loss 0.37838. lr 4.992136e-04:  54%|█████▎    | 1637/3047 [12:15<11:33,  2.03it/s][A
epoch 1 iter 1637: train loss 0.37838. lr 4.992136e-04:  54%|█████▍    | 1638/3047 [12:15<11:01,  2.13it/s][A
epoch 1 iter 1638: train loss 0.38187. lr 4.990979e-04:  54%|█████▍    | 1638/3047 [12:15<11:01,  2.13it/s][A
e

epoch 1 iter 1670: train loss 0.36743. lr 4.953679e-04:  55%|█████▍    | 1671/3047 [12:31<13:41,  1.68it/s][A
epoch 1 iter 1671: train loss 0.38208. lr 4.952505e-04:  55%|█████▍    | 1671/3047 [12:32<13:41,  1.68it/s][A
epoch 1 iter 1671: train loss 0.38208. lr 4.952505e-04:  55%|█████▍    | 1672/3047 [12:32<13:17,  1.73it/s][A
epoch 1 iter 1672: train loss 0.36962. lr 4.951330e-04:  55%|█████▍    | 1672/3047 [12:32<13:17,  1.73it/s][A
epoch 1 iter 1672: train loss 0.36962. lr 4.951330e-04:  55%|█████▍    | 1673/3047 [12:32<12:33,  1.82it/s][A
epoch 1 iter 1673: train loss 0.37151. lr 4.950155e-04:  55%|█████▍    | 1673/3047 [12:33<12:33,  1.82it/s][A
epoch 1 iter 1673: train loss 0.37151. lr 4.950155e-04:  55%|█████▍    | 1674/3047 [12:33<12:03,  1.90it/s][A
epoch 1 iter 1674: train loss 0.37261. lr 4.948979e-04:  55%|█████▍    | 1674/3047 [12:33<12:03,  1.90it/s][A
epoch 1 iter 1674: train loss 0.37261. lr 4.948979e-04:  55%|█████▍    | 1675/3047 [12:33<11:42,  1.95it/s][A
e

epoch 1 iter 1707: train loss 0.36446. lr 4.909889e-04:  56%|█████▌    | 1707/3047 [12:47<09:10,  2.43it/s][A
epoch 1 iter 1707: train loss 0.36446. lr 4.909889e-04:  56%|█████▌    | 1708/3047 [12:47<09:15,  2.41it/s][A
epoch 1 iter 1708: train loss 0.36420. lr 4.908696e-04:  56%|█████▌    | 1708/3047 [12:47<09:15,  2.41it/s][A
epoch 1 iter 1708: train loss 0.36420. lr 4.908696e-04:  56%|█████▌    | 1709/3047 [12:47<09:13,  2.42it/s][A
epoch 1 iter 1709: train loss 0.36086. lr 4.907502e-04:  56%|█████▌    | 1709/3047 [12:48<09:13,  2.42it/s][A
epoch 1 iter 1709: train loss 0.36086. lr 4.907502e-04:  56%|█████▌    | 1710/3047 [12:48<09:15,  2.41it/s][A
epoch 1 iter 1710: train loss 0.36403. lr 4.906308e-04:  56%|█████▌    | 1710/3047 [12:48<09:15,  2.41it/s][A
epoch 1 iter 1710: train loss 0.36403. lr 4.906308e-04:  56%|█████▌    | 1711/3047 [12:48<09:24,  2.37it/s][A
epoch 1 iter 1711: train loss 0.36955. lr 4.905113e-04:  56%|█████▌    | 1711/3047 [12:49<09:24,  2.37it/s][A
e

epoch 1 iter 1743: train loss 0.35476. lr 4.866614e-04:  57%|█████▋    | 1744/3047 [13:03<08:58,  2.42it/s][A
epoch 1 iter 1744: train loss 0.35673. lr 4.865403e-04:  57%|█████▋    | 1744/3047 [13:04<08:58,  2.42it/s][A
epoch 1 iter 1744: train loss 0.35673. lr 4.865403e-04:  57%|█████▋    | 1745/3047 [13:04<08:58,  2.42it/s][A
epoch 1 iter 1745: train loss 0.35873. lr 4.864191e-04:  57%|█████▋    | 1745/3047 [13:04<08:58,  2.42it/s][A
epoch 1 iter 1745: train loss 0.35873. lr 4.864191e-04:  57%|█████▋    | 1746/3047 [13:04<08:59,  2.41it/s][A
epoch 1 iter 1746: train loss 0.35667. lr 4.862979e-04:  57%|█████▋    | 1746/3047 [13:05<08:59,  2.41it/s][A
epoch 1 iter 1746: train loss 0.35667. lr 4.862979e-04:  57%|█████▋    | 1747/3047 [13:05<09:00,  2.41it/s][A
epoch 1 iter 1747: train loss 0.35205. lr 4.861766e-04:  57%|█████▋    | 1747/3047 [13:05<09:00,  2.41it/s][A
epoch 1 iter 1747: train loss 0.35205. lr 4.861766e-04:  57%|█████▋    | 1748/3047 [13:05<09:00,  2.40it/s][A
e

epoch 1 iter 1780: train loss 0.35208. lr 4.821468e-04:  58%|█████▊    | 1780/3047 [13:20<08:35,  2.46it/s][A
epoch 1 iter 1780: train loss 0.35208. lr 4.821468e-04:  58%|█████▊    | 1781/3047 [13:20<08:25,  2.50it/s][A
epoch 1 iter 1781: train loss 0.34964. lr 4.820238e-04:  58%|█████▊    | 1781/3047 [13:21<08:25,  2.50it/s][A
epoch 1 iter 1781: train loss 0.34964. lr 4.820238e-04:  58%|█████▊    | 1782/3047 [13:21<08:14,  2.56it/s][A
epoch 1 iter 1782: train loss 0.34929. lr 4.819008e-04:  58%|█████▊    | 1782/3047 [13:21<08:14,  2.56it/s][A
epoch 1 iter 1782: train loss 0.34929. lr 4.819008e-04:  59%|█████▊    | 1783/3047 [13:21<08:14,  2.56it/s][A
epoch 1 iter 1783: train loss 0.34925. lr 4.817778e-04:  59%|█████▊    | 1783/3047 [13:22<08:14,  2.56it/s][A
epoch 1 iter 1783: train loss 0.34925. lr 4.817778e-04:  59%|█████▊    | 1784/3047 [13:22<08:18,  2.53it/s][A
epoch 1 iter 1784: train loss 0.34866. lr 4.816547e-04:  59%|█████▊    | 1784/3047 [13:22<08:18,  2.53it/s][A
e

epoch 1 iter 1816: train loss 0.33563. lr 4.776904e-04:  60%|█████▉    | 1817/3047 [13:37<08:51,  2.32it/s][A
epoch 1 iter 1817: train loss 0.34455. lr 4.775658e-04:  60%|█████▉    | 1817/3047 [13:37<08:51,  2.32it/s][A
epoch 1 iter 1817: train loss 0.34455. lr 4.775658e-04:  60%|█████▉    | 1818/3047 [13:37<09:04,  2.26it/s][A
epoch 1 iter 1818: train loss 0.34134. lr 4.774411e-04:  60%|█████▉    | 1818/3047 [13:38<09:04,  2.26it/s][A
epoch 1 iter 1818: train loss 0.34134. lr 4.774411e-04:  60%|█████▉    | 1819/3047 [13:38<09:05,  2.25it/s][A
epoch 1 iter 1819: train loss 0.33041. lr 4.773163e-04:  60%|█████▉    | 1819/3047 [13:38<09:05,  2.25it/s][A
epoch 1 iter 1819: train loss 0.33041. lr 4.773163e-04:  60%|█████▉    | 1820/3047 [13:38<09:07,  2.24it/s][A
epoch 1 iter 1820: train loss 0.32748. lr 4.771915e-04:  60%|█████▉    | 1820/3047 [13:39<09:07,  2.24it/s][A
epoch 1 iter 1820: train loss 0.32748. lr 4.771915e-04:  60%|█████▉    | 1821/3047 [13:39<09:06,  2.25it/s][A
e

epoch 1 iter 1853: train loss 0.33005. lr 4.730466e-04:  61%|██████    | 1853/3047 [13:54<08:35,  2.32it/s][A
epoch 1 iter 1853: train loss 0.33005. lr 4.730466e-04:  61%|██████    | 1854/3047 [13:54<08:34,  2.32it/s][A
epoch 1 iter 1854: train loss 0.32321. lr 4.729202e-04:  61%|██████    | 1854/3047 [13:54<08:34,  2.32it/s][A
epoch 1 iter 1854: train loss 0.32321. lr 4.729202e-04:  61%|██████    | 1855/3047 [13:54<08:42,  2.28it/s][A
epoch 1 iter 1855: train loss 0.33619. lr 4.727937e-04:  61%|██████    | 1855/3047 [13:54<08:42,  2.28it/s][A
epoch 1 iter 1855: train loss 0.33619. lr 4.727937e-04:  61%|██████    | 1856/3047 [13:54<08:31,  2.33it/s][A
epoch 1 iter 1856: train loss 0.32293. lr 4.726672e-04:  61%|██████    | 1856/3047 [13:55<08:31,  2.33it/s][A
epoch 1 iter 1856: train loss 0.32293. lr 4.726672e-04:  61%|██████    | 1857/3047 [13:55<08:22,  2.37it/s][A
epoch 1 iter 1857: train loss 0.33074. lr 4.725407e-04:  61%|██████    | 1857/3047 [13:55<08:22,  2.37it/s][A
e

epoch 1 iter 1889: train loss 0.32695. lr 4.684677e-04:  62%|██████▏   | 1890/3047 [14:09<08:33,  2.25it/s][A
epoch 1 iter 1890: train loss 0.32467. lr 4.683397e-04:  62%|██████▏   | 1890/3047 [14:09<08:33,  2.25it/s][A
epoch 1 iter 1890: train loss 0.32467. lr 4.683397e-04:  62%|██████▏   | 1891/3047 [14:09<09:18,  2.07it/s][A
epoch 1 iter 1891: train loss 0.32341. lr 4.682116e-04:  62%|██████▏   | 1891/3047 [14:10<09:18,  2.07it/s][A
epoch 1 iter 1891: train loss 0.32341. lr 4.682116e-04:  62%|██████▏   | 1892/3047 [14:10<08:47,  2.19it/s][A
epoch 1 iter 1892: train loss 0.32285. lr 4.680835e-04:  62%|██████▏   | 1892/3047 [14:10<08:47,  2.19it/s][A
epoch 1 iter 1892: train loss 0.32285. lr 4.680835e-04:  62%|██████▏   | 1893/3047 [14:10<08:23,  2.29it/s][A
epoch 1 iter 1893: train loss 0.32279. lr 4.679553e-04:  62%|██████▏   | 1893/3047 [14:11<08:23,  2.29it/s][A
epoch 1 iter 1893: train loss 0.32279. lr 4.679553e-04:  62%|██████▏   | 1894/3047 [14:11<07:53,  2.43it/s][A
e

epoch 1 iter 1926: train loss 0.31385. lr 4.637012e-04:  63%|██████▎   | 1926/3047 [14:27<07:04,  2.64it/s][A
epoch 1 iter 1926: train loss 0.31385. lr 4.637012e-04:  63%|██████▎   | 1927/3047 [14:27<07:20,  2.54it/s][A
epoch 1 iter 1927: train loss 0.31732. lr 4.635715e-04:  63%|██████▎   | 1927/3047 [14:27<07:20,  2.54it/s][A
epoch 1 iter 1927: train loss 0.31732. lr 4.635715e-04:  63%|██████▎   | 1928/3047 [14:27<07:41,  2.42it/s][A
epoch 1 iter 1928: train loss 0.31385. lr 4.634418e-04:  63%|██████▎   | 1928/3047 [14:28<07:41,  2.42it/s][A
epoch 1 iter 1928: train loss 0.31385. lr 4.634418e-04:  63%|██████▎   | 1929/3047 [14:28<07:48,  2.38it/s][A
epoch 1 iter 1929: train loss 0.30877. lr 4.633120e-04:  63%|██████▎   | 1929/3047 [14:28<07:48,  2.38it/s][A
epoch 1 iter 1929: train loss 0.30877. lr 4.633120e-04:  63%|██████▎   | 1930/3047 [14:28<07:31,  2.47it/s][A
epoch 1 iter 1930: train loss 0.31301. lr 4.631823e-04:  63%|██████▎   | 1930/3047 [14:28<07:31,  2.47it/s][A
e

epoch 1 iter 1962: train loss 0.30429. lr 4.590062e-04:  64%|██████▍   | 1963/3047 [14:43<08:16,  2.18it/s][A
epoch 1 iter 1963: train loss 0.30563. lr 4.588750e-04:  64%|██████▍   | 1963/3047 [14:44<08:16,  2.18it/s][A
epoch 1 iter 1963: train loss 0.30563. lr 4.588750e-04:  64%|██████▍   | 1964/3047 [14:44<08:06,  2.23it/s][A
epoch 1 iter 1964: train loss 0.31181. lr 4.587438e-04:  64%|██████▍   | 1964/3047 [14:44<08:06,  2.23it/s][A
epoch 1 iter 1964: train loss 0.31181. lr 4.587438e-04:  64%|██████▍   | 1965/3047 [14:44<08:23,  2.15it/s][A
epoch 1 iter 1965: train loss 0.30888. lr 4.586125e-04:  64%|██████▍   | 1965/3047 [14:45<08:23,  2.15it/s][A
epoch 1 iter 1965: train loss 0.30888. lr 4.586125e-04:  65%|██████▍   | 1966/3047 [14:45<08:34,  2.10it/s][A
epoch 1 iter 1966: train loss 0.30992. lr 4.584812e-04:  65%|██████▍   | 1966/3047 [14:45<08:34,  2.10it/s][A
epoch 1 iter 1966: train loss 0.30992. lr 4.584812e-04:  65%|██████▍   | 1967/3047 [14:45<08:42,  2.07it/s][A
e

epoch 1 iter 1999: train loss 0.30148. lr 4.541238e-04:  66%|██████▌   | 1999/3047 [15:00<08:04,  2.16it/s][A
epoch 1 iter 1999: train loss 0.30148. lr 4.541238e-04:  66%|██████▌   | 2000/3047 [15:00<07:38,  2.28it/s][A
epoch 1 iter 2000: train loss 0.30582. lr 4.539911e-04:  66%|██████▌   | 2000/3047 [15:00<07:38,  2.28it/s][A
epoch 1 iter 2000: train loss 0.30582. lr 4.539911e-04:  66%|██████▌   | 2001/3047 [15:00<07:13,  2.41it/s][A
epoch 1 iter 2001: train loss 0.30051. lr 4.538583e-04:  66%|██████▌   | 2001/3047 [15:01<07:13,  2.41it/s][A
epoch 1 iter 2001: train loss 0.30051. lr 4.538583e-04:  66%|██████▌   | 2002/3047 [15:01<07:00,  2.48it/s][A
epoch 1 iter 2002: train loss 0.29983. lr 4.537255e-04:  66%|██████▌   | 2002/3047 [15:01<07:00,  2.48it/s][A
epoch 1 iter 2002: train loss 0.29983. lr 4.537255e-04:  66%|██████▌   | 2003/3047 [15:01<07:16,  2.39it/s][A
epoch 1 iter 2003: train loss 0.29106. lr 4.535926e-04:  66%|██████▌   | 2003/3047 [15:02<07:16,  2.39it/s][A
e

epoch 1 iter 2035: train loss 0.29159. lr 4.493195e-04:  67%|██████▋   | 2036/3047 [15:15<07:47,  2.16it/s][A
epoch 1 iter 2036: train loss 0.29507. lr 4.491853e-04:  67%|██████▋   | 2036/3047 [15:16<07:47,  2.16it/s][A
epoch 1 iter 2036: train loss 0.29507. lr 4.491853e-04:  67%|██████▋   | 2037/3047 [15:16<07:44,  2.17it/s][A
epoch 1 iter 2037: train loss 0.29574. lr 4.490511e-04:  67%|██████▋   | 2037/3047 [15:16<07:44,  2.17it/s][A
epoch 1 iter 2037: train loss 0.29574. lr 4.490511e-04:  67%|██████▋   | 2038/3047 [15:16<07:42,  2.18it/s][A
epoch 1 iter 2038: train loss 0.29197. lr 4.489168e-04:  67%|██████▋   | 2038/3047 [15:17<07:42,  2.18it/s][A
epoch 1 iter 2038: train loss 0.29197. lr 4.489168e-04:  67%|██████▋   | 2039/3047 [15:17<07:20,  2.29it/s][A
epoch 1 iter 2039: train loss 0.29070. lr 4.487825e-04:  67%|██████▋   | 2039/3047 [15:17<07:20,  2.29it/s][A
epoch 1 iter 2039: train loss 0.29070. lr 4.487825e-04:  67%|██████▋   | 2040/3047 [15:17<06:54,  2.43it/s][A
e

epoch 1 iter 2072: train loss 0.28472. lr 4.443281e-04:  68%|██████▊   | 2072/3047 [15:32<07:04,  2.29it/s][A
epoch 1 iter 2072: train loss 0.28472. lr 4.443281e-04:  68%|██████▊   | 2073/3047 [15:32<07:01,  2.31it/s][A
epoch 1 iter 2073: train loss 0.29000. lr 4.441925e-04:  68%|██████▊   | 2073/3047 [15:33<07:01,  2.31it/s][A
epoch 1 iter 2073: train loss 0.29000. lr 4.441925e-04:  68%|██████▊   | 2074/3047 [15:33<07:00,  2.32it/s][A
epoch 1 iter 2074: train loss 0.29042. lr 4.440568e-04:  68%|██████▊   | 2074/3047 [15:33<07:00,  2.32it/s][A
epoch 1 iter 2074: train loss 0.29042. lr 4.440568e-04:  68%|██████▊   | 2075/3047 [15:33<06:58,  2.32it/s][A
epoch 1 iter 2075: train loss 0.29361. lr 4.439211e-04:  68%|██████▊   | 2075/3047 [15:34<06:58,  2.32it/s][A
epoch 1 iter 2075: train loss 0.29361. lr 4.439211e-04:  68%|██████▊   | 2076/3047 [15:34<06:55,  2.34it/s][A
epoch 1 iter 2076: train loss 0.29152. lr 4.437853e-04:  68%|██████▊   | 2076/3047 [15:34<06:55,  2.34it/s][A
e

epoch 1 iter 2108: train loss 0.28141. lr 4.394212e-04:  69%|██████▉   | 2109/3047 [15:49<06:47,  2.30it/s][A
epoch 1 iter 2109: train loss 0.28472. lr 4.392842e-04:  69%|██████▉   | 2109/3047 [15:50<06:47,  2.30it/s][A
epoch 1 iter 2109: train loss 0.28472. lr 4.392842e-04:  69%|██████▉   | 2110/3047 [15:50<06:42,  2.33it/s][A
epoch 1 iter 2110: train loss 0.28126. lr 4.391472e-04:  69%|██████▉   | 2110/3047 [15:50<06:42,  2.33it/s][A
epoch 1 iter 2110: train loss 0.28126. lr 4.391472e-04:  69%|██████▉   | 2111/3047 [15:50<07:16,  2.14it/s][A
epoch 1 iter 2111: train loss 0.28203. lr 4.390101e-04:  69%|██████▉   | 2111/3047 [15:51<07:16,  2.14it/s][A
epoch 1 iter 2111: train loss 0.28203. lr 4.390101e-04:  69%|██████▉   | 2112/3047 [15:51<07:21,  2.12it/s][A
epoch 1 iter 2112: train loss 0.28414. lr 4.388730e-04:  69%|██████▉   | 2112/3047 [15:51<07:21,  2.12it/s][A
epoch 1 iter 2112: train loss 0.28414. lr 4.388730e-04:  69%|██████▉   | 2113/3047 [15:51<07:13,  2.15it/s][A
e

epoch 1 iter 2145: train loss 0.28518. lr 4.343279e-04:  70%|███████   | 2145/3047 [16:05<06:00,  2.50it/s][A
epoch 1 iter 2145: train loss 0.28518. lr 4.343279e-04:  70%|███████   | 2146/3047 [16:05<06:16,  2.39it/s][A
epoch 1 iter 2146: train loss 0.27826. lr 4.341896e-04:  70%|███████   | 2146/3047 [16:05<06:16,  2.39it/s][A
epoch 1 iter 2146: train loss 0.27826. lr 4.341896e-04:  70%|███████   | 2147/3047 [16:05<06:54,  2.17it/s][A
epoch 1 iter 2147: train loss 0.27059. lr 4.340512e-04:  70%|███████   | 2147/3047 [16:06<06:54,  2.17it/s][A
epoch 1 iter 2147: train loss 0.27059. lr 4.340512e-04:  70%|███████   | 2148/3047 [16:06<07:26,  2.01it/s][A
epoch 1 iter 2148: train loss 0.28387. lr 4.339128e-04:  70%|███████   | 2148/3047 [16:07<07:26,  2.01it/s][A
epoch 1 iter 2148: train loss 0.28387. lr 4.339128e-04:  71%|███████   | 2149/3047 [16:07<07:48,  1.92it/s][A
epoch 1 iter 2149: train loss 0.27380. lr 4.337743e-04:  71%|███████   | 2149/3047 [16:07<07:48,  1.92it/s][A
e

epoch 1 iter 2181: train loss 0.27618. lr 4.293253e-04:  72%|███████▏  | 2182/3047 [16:21<06:18,  2.29it/s][A
epoch 1 iter 2182: train loss 0.26793. lr 4.291857e-04:  72%|███████▏  | 2182/3047 [16:21<06:18,  2.29it/s][A
epoch 1 iter 2182: train loss 0.26793. lr 4.291857e-04:  72%|███████▏  | 2183/3047 [16:21<06:16,  2.29it/s][A
epoch 1 iter 2183: train loss 0.27461. lr 4.290461e-04:  72%|███████▏  | 2183/3047 [16:22<06:16,  2.29it/s][A
epoch 1 iter 2183: train loss 0.27461. lr 4.290461e-04:  72%|███████▏  | 2184/3047 [16:22<06:15,  2.30it/s][A
epoch 1 iter 2184: train loss 0.27648. lr 4.289064e-04:  72%|███████▏  | 2184/3047 [16:22<06:15,  2.30it/s][A
epoch 1 iter 2184: train loss 0.27648. lr 4.289064e-04:  72%|███████▏  | 2185/3047 [16:22<06:12,  2.31it/s][A
epoch 1 iter 2185: train loss 0.26731. lr 4.287667e-04:  72%|███████▏  | 2185/3047 [16:23<06:12,  2.31it/s][A
epoch 1 iter 2185: train loss 0.26731. lr 4.287667e-04:  72%|███████▏  | 2186/3047 [16:23<06:11,  2.32it/s][A
e

epoch 1 iter 2218: train loss 0.26818. lr 4.241374e-04:  73%|███████▎  | 2218/3047 [16:38<06:56,  1.99it/s][A
epoch 1 iter 2218: train loss 0.26818. lr 4.241374e-04:  73%|███████▎  | 2219/3047 [16:38<06:55,  1.99it/s][A
epoch 1 iter 2219: train loss 0.26446. lr 4.239965e-04:  73%|███████▎  | 2219/3047 [16:38<06:55,  1.99it/s][A
epoch 1 iter 2219: train loss 0.26446. lr 4.239965e-04:  73%|███████▎  | 2220/3047 [16:38<06:54,  2.00it/s][A
epoch 1 iter 2220: train loss 0.26584. lr 4.238556e-04:  73%|███████▎  | 2220/3047 [16:39<06:54,  2.00it/s][A
epoch 1 iter 2220: train loss 0.26584. lr 4.238556e-04:  73%|███████▎  | 2221/3047 [16:39<06:53,  2.00it/s][A
epoch 1 iter 2221: train loss 0.26405. lr 4.237147e-04:  73%|███████▎  | 2221/3047 [16:39<06:53,  2.00it/s][A
epoch 1 iter 2221: train loss 0.26405. lr 4.237147e-04:  73%|███████▎  | 2222/3047 [16:39<06:36,  2.08it/s][A
epoch 1 iter 2222: train loss 0.26318. lr 4.235738e-04:  73%|███████▎  | 2222/3047 [16:40<06:36,  2.08it/s][A
e

epoch 1 iter 2254: train loss 0.26489. lr 4.190463e-04:  74%|███████▍  | 2255/3047 [16:54<06:45,  1.95it/s][A
epoch 1 iter 2255: train loss 0.26263. lr 4.189042e-04:  74%|███████▍  | 2255/3047 [16:55<06:45,  1.95it/s][A
epoch 1 iter 2255: train loss 0.26263. lr 4.189042e-04:  74%|███████▍  | 2256/3047 [16:55<06:34,  2.00it/s][A
epoch 1 iter 2256: train loss 0.26727. lr 4.187622e-04:  74%|███████▍  | 2256/3047 [16:55<06:34,  2.00it/s][A
epoch 1 iter 2256: train loss 0.26727. lr 4.187622e-04:  74%|███████▍  | 2257/3047 [16:55<06:27,  2.04it/s][A
epoch 1 iter 2257: train loss 0.26527. lr 4.186201e-04:  74%|███████▍  | 2257/3047 [16:56<06:27,  2.04it/s][A
epoch 1 iter 2257: train loss 0.26527. lr 4.186201e-04:  74%|███████▍  | 2258/3047 [16:56<06:22,  2.06it/s][A
epoch 1 iter 2258: train loss 0.26099. lr 4.184780e-04:  74%|███████▍  | 2258/3047 [16:56<06:22,  2.06it/s][A
epoch 1 iter 2258: train loss 0.26099. lr 4.184780e-04:  74%|███████▍  | 2259/3047 [16:56<06:20,  2.07it/s][A
e

epoch 1 iter 2291: train loss 0.25882. lr 4.137710e-04:  75%|███████▌  | 2291/3047 [17:11<05:22,  2.35it/s][A
epoch 1 iter 2291: train loss 0.25882. lr 4.137710e-04:  75%|███████▌  | 2292/3047 [17:11<05:21,  2.35it/s][A
epoch 1 iter 2292: train loss 0.25154. lr 4.136278e-04:  75%|███████▌  | 2292/3047 [17:11<05:21,  2.35it/s][A
epoch 1 iter 2292: train loss 0.25154. lr 4.136278e-04:  75%|███████▌  | 2293/3047 [17:11<05:19,  2.36it/s][A
epoch 1 iter 2293: train loss 0.26442. lr 4.134846e-04:  75%|███████▌  | 2293/3047 [17:11<05:19,  2.36it/s][A
epoch 1 iter 2293: train loss 0.26442. lr 4.134846e-04:  75%|███████▌  | 2294/3047 [17:11<05:18,  2.36it/s][A
epoch 1 iter 2294: train loss 0.25980. lr 4.133414e-04:  75%|███████▌  | 2294/3047 [17:12<05:18,  2.36it/s][A
epoch 1 iter 2294: train loss 0.25980. lr 4.133414e-04:  75%|███████▌  | 2295/3047 [17:12<05:18,  2.36it/s][A
epoch 1 iter 2295: train loss 0.25801. lr 4.131982e-04:  75%|███████▌  | 2295/3047 [17:12<05:18,  2.36it/s][A
e

epoch 1 iter 2327: train loss 0.25762. lr 4.085985e-04:  76%|███████▋  | 2328/3047 [17:26<05:15,  2.28it/s][A
epoch 1 iter 2328: train loss 0.25051. lr 4.084543e-04:  76%|███████▋  | 2328/3047 [17:27<05:15,  2.28it/s][A
epoch 1 iter 2328: train loss 0.25051. lr 4.084543e-04:  76%|███████▋  | 2329/3047 [17:27<05:09,  2.32it/s][A
epoch 1 iter 2329: train loss 0.26132. lr 4.083100e-04:  76%|███████▋  | 2329/3047 [17:27<05:09,  2.32it/s][A
epoch 1 iter 2329: train loss 0.26132. lr 4.083100e-04:  76%|███████▋  | 2330/3047 [17:27<05:05,  2.35it/s][A
epoch 1 iter 2330: train loss 0.25087. lr 4.081657e-04:  76%|███████▋  | 2330/3047 [17:28<05:05,  2.35it/s][A
epoch 1 iter 2330: train loss 0.25087. lr 4.081657e-04:  77%|███████▋  | 2331/3047 [17:28<05:03,  2.36it/s][A
epoch 1 iter 2331: train loss 0.25170. lr 4.080214e-04:  77%|███████▋  | 2331/3047 [17:28<05:03,  2.36it/s][A
epoch 1 iter 2331: train loss 0.25170. lr 4.080214e-04:  77%|███████▋  | 2332/3047 [17:28<05:01,  2.37it/s][A
e

epoch 1 iter 2364: train loss 0.24819. lr 4.032434e-04:  78%|███████▊  | 2364/3047 [17:43<04:52,  2.34it/s][A
epoch 1 iter 2364: train loss 0.24819. lr 4.032434e-04:  78%|███████▊  | 2365/3047 [17:43<04:51,  2.34it/s][A
epoch 1 iter 2365: train loss 0.25243. lr 4.030981e-04:  78%|███████▊  | 2365/3047 [17:43<04:51,  2.34it/s][A
epoch 1 iter 2365: train loss 0.25243. lr 4.030981e-04:  78%|███████▊  | 2366/3047 [17:43<04:49,  2.35it/s][A
epoch 1 iter 2366: train loss 0.24856. lr 4.029528e-04:  78%|███████▊  | 2366/3047 [17:43<04:49,  2.35it/s][A
epoch 1 iter 2366: train loss 0.24856. lr 4.029528e-04:  78%|███████▊  | 2367/3047 [17:43<04:45,  2.38it/s][A
epoch 1 iter 2367: train loss 0.24881. lr 4.028075e-04:  78%|███████▊  | 2367/3047 [17:44<04:45,  2.38it/s][A
epoch 1 iter 2367: train loss 0.24881. lr 4.028075e-04:  78%|███████▊  | 2368/3047 [17:44<04:41,  2.42it/s][A
epoch 1 iter 2368: train loss 0.24765. lr 4.026621e-04:  78%|███████▊  | 2368/3047 [17:44<04:41,  2.42it/s][A
e

epoch 1 iter 2400: train loss 0.24998. lr 3.979969e-04:  79%|███████▉  | 2401/3047 [17:59<05:36,  1.92it/s][A
epoch 1 iter 2401: train loss 0.24438. lr 3.978506e-04:  79%|███████▉  | 2401/3047 [18:00<05:36,  1.92it/s][A
epoch 1 iter 2401: train loss 0.24438. lr 3.978506e-04:  79%|███████▉  | 2402/3047 [18:00<05:26,  1.97it/s][A
epoch 1 iter 2402: train loss 0.24204. lr 3.977044e-04:  79%|███████▉  | 2402/3047 [18:00<05:26,  1.97it/s][A
epoch 1 iter 2402: train loss 0.24204. lr 3.977044e-04:  79%|███████▉  | 2403/3047 [18:00<05:20,  2.01it/s][A
epoch 1 iter 2403: train loss 0.24390. lr 3.975581e-04:  79%|███████▉  | 2403/3047 [18:01<05:20,  2.01it/s][A
epoch 1 iter 2403: train loss 0.24390. lr 3.975581e-04:  79%|███████▉  | 2404/3047 [18:01<05:15,  2.04it/s][A
epoch 1 iter 2404: train loss 0.24526. lr 3.974118e-04:  79%|███████▉  | 2404/3047 [18:01<05:15,  2.04it/s][A
epoch 1 iter 2404: train loss 0.24526. lr 3.974118e-04:  79%|███████▉  | 2405/3047 [18:01<05:08,  2.08it/s][A
e

epoch 1 iter 2437: train loss 0.23955. lr 3.925695e-04:  80%|███████▉  | 2437/3047 [18:15<04:11,  2.43it/s][A
epoch 1 iter 2437: train loss 0.23955. lr 3.925695e-04:  80%|████████  | 2438/3047 [18:15<04:10,  2.43it/s][A
epoch 1 iter 2438: train loss 0.24583. lr 3.924223e-04:  80%|████████  | 2438/3047 [18:15<04:10,  2.43it/s][A
epoch 1 iter 2438: train loss 0.24583. lr 3.924223e-04:  80%|████████  | 2439/3047 [18:15<04:09,  2.43it/s][A
epoch 1 iter 2439: train loss 0.24290. lr 3.922751e-04:  80%|████████  | 2439/3047 [18:16<04:09,  2.43it/s][A
epoch 1 iter 2439: train loss 0.24290. lr 3.922751e-04:  80%|████████  | 2440/3047 [18:16<04:09,  2.44it/s][A
epoch 1 iter 2440: train loss 0.24438. lr 3.921279e-04:  80%|████████  | 2440/3047 [18:16<04:09,  2.44it/s][A
epoch 1 iter 2440: train loss 0.24438. lr 3.921279e-04:  80%|████████  | 2441/3047 [18:16<04:08,  2.44it/s][A
epoch 1 iter 2441: train loss 0.24106. lr 3.919807e-04:  80%|████████  | 2441/3047 [18:17<04:08,  2.44it/s][A
e

epoch 1 iter 2473: train loss 0.23842. lr 3.872564e-04:  81%|████████  | 2474/3047 [18:31<05:39,  1.69it/s][A
epoch 1 iter 2474: train loss 0.24436. lr 3.871084e-04:  81%|████████  | 2474/3047 [18:32<05:39,  1.69it/s][A
epoch 1 iter 2474: train loss 0.24436. lr 3.871084e-04:  81%|████████  | 2475/3047 [18:32<05:32,  1.72it/s][A
epoch 1 iter 2475: train loss 0.24079. lr 3.869603e-04:  81%|████████  | 2475/3047 [18:32<05:32,  1.72it/s][A
epoch 1 iter 2475: train loss 0.24079. lr 3.869603e-04:  81%|████████▏ | 2476/3047 [18:32<05:15,  1.81it/s][A
epoch 1 iter 2476: train loss 0.23523. lr 3.868122e-04:  81%|████████▏ | 2476/3047 [18:33<05:15,  1.81it/s][A
epoch 1 iter 2476: train loss 0.23523. lr 3.868122e-04:  81%|████████▏ | 2477/3047 [18:33<04:59,  1.90it/s][A
epoch 1 iter 2477: train loss 0.24077. lr 3.866641e-04:  81%|████████▏ | 2477/3047 [18:33<04:59,  1.90it/s][A
epoch 1 iter 2477: train loss 0.24077. lr 3.866641e-04:  81%|████████▏ | 2478/3047 [18:33<04:47,  1.98it/s][A
e

epoch 1 iter 2510: train loss 0.23418. lr 3.817644e-04:  82%|████████▏ | 2510/3047 [18:48<03:41,  2.43it/s][A
epoch 1 iter 2510: train loss 0.23418. lr 3.817644e-04:  82%|████████▏ | 2511/3047 [18:48<03:40,  2.43it/s][A
epoch 1 iter 2511: train loss 0.24246. lr 3.816155e-04:  82%|████████▏ | 2511/3047 [18:49<03:40,  2.43it/s][A
epoch 1 iter 2511: train loss 0.24246. lr 3.816155e-04:  82%|████████▏ | 2512/3047 [18:49<03:40,  2.43it/s][A
epoch 1 iter 2512: train loss 0.23320. lr 3.814667e-04:  82%|████████▏ | 2512/3047 [18:49<03:40,  2.43it/s][A
epoch 1 iter 2512: train loss 0.23320. lr 3.814667e-04:  82%|████████▏ | 2513/3047 [18:49<03:39,  2.43it/s][A
epoch 1 iter 2513: train loss 0.23153. lr 3.813178e-04:  82%|████████▏ | 2513/3047 [18:50<03:39,  2.43it/s][A
epoch 1 iter 2513: train loss 0.23153. lr 3.813178e-04:  83%|████████▎ | 2514/3047 [18:50<03:39,  2.43it/s][A
epoch 1 iter 2514: train loss 0.23551. lr 3.811689e-04:  83%|████████▎ | 2514/3047 [18:50<03:39,  2.43it/s][A
e

epoch 1 iter 2546: train loss 0.22976. lr 3.763923e-04:  84%|████████▎ | 2547/3047 [19:04<03:48,  2.19it/s][A
epoch 1 iter 2547: train loss 0.23366. lr 3.762427e-04:  84%|████████▎ | 2547/3047 [19:04<03:48,  2.19it/s][A
epoch 1 iter 2547: train loss 0.23366. lr 3.762427e-04:  84%|████████▎ | 2548/3047 [19:04<03:49,  2.17it/s][A
epoch 1 iter 2548: train loss 0.23507. lr 3.760930e-04:  84%|████████▎ | 2548/3047 [19:05<03:49,  2.17it/s][A
epoch 1 iter 2548: train loss 0.23507. lr 3.760930e-04:  84%|████████▎ | 2549/3047 [19:05<03:44,  2.21it/s][A
epoch 1 iter 2549: train loss 0.23112. lr 3.759434e-04:  84%|████████▎ | 2549/3047 [19:05<03:44,  2.21it/s][A
epoch 1 iter 2549: train loss 0.23112. lr 3.759434e-04:  84%|████████▎ | 2550/3047 [19:05<03:41,  2.24it/s][A
epoch 1 iter 2550: train loss 0.23518. lr 3.757937e-04:  84%|████████▎ | 2550/3047 [19:06<03:41,  2.24it/s][A
epoch 1 iter 2550: train loss 0.23518. lr 3.757937e-04:  84%|████████▎ | 2551/3047 [19:06<03:39,  2.26it/s][A
e

epoch 1 iter 2583: train loss 0.22865. lr 3.708435e-04:  85%|████████▍ | 2583/3047 [19:19<03:42,  2.09it/s][A
epoch 1 iter 2583: train loss 0.22865. lr 3.708435e-04:  85%|████████▍ | 2584/3047 [19:19<03:32,  2.18it/s][A
epoch 1 iter 2584: train loss 0.22890. lr 3.706932e-04:  85%|████████▍ | 2584/3047 [19:20<03:32,  2.18it/s][A
epoch 1 iter 2584: train loss 0.22890. lr 3.706932e-04:  85%|████████▍ | 2585/3047 [19:20<03:24,  2.26it/s][A
epoch 1 iter 2585: train loss 0.23499. lr 3.705428e-04:  85%|████████▍ | 2585/3047 [19:20<03:24,  2.26it/s][A
epoch 1 iter 2585: train loss 0.23499. lr 3.705428e-04:  85%|████████▍ | 2586/3047 [19:20<03:18,  2.32it/s][A
epoch 1 iter 2586: train loss 0.22811. lr 3.703924e-04:  85%|████████▍ | 2586/3047 [19:21<03:18,  2.32it/s][A
epoch 1 iter 2586: train loss 0.22811. lr 3.703924e-04:  85%|████████▍ | 2587/3047 [19:21<03:14,  2.36it/s][A
epoch 1 iter 2587: train loss 0.22721. lr 3.702420e-04:  85%|████████▍ | 2587/3047 [19:22<03:14,  2.36it/s][A
e

epoch 1 iter 2619: train loss 0.22554. lr 3.654199e-04:  86%|████████▌ | 2620/3047 [19:38<02:51,  2.49it/s][A
epoch 1 iter 2620: train loss 0.22451. lr 3.652689e-04:  86%|████████▌ | 2620/3047 [19:38<02:51,  2.49it/s][A
epoch 1 iter 2620: train loss 0.22451. lr 3.652689e-04:  86%|████████▌ | 2621/3047 [19:38<02:53,  2.46it/s][A
epoch 1 iter 2621: train loss 0.23564. lr 3.651179e-04:  86%|████████▌ | 2621/3047 [19:38<02:53,  2.46it/s][A
epoch 1 iter 2621: train loss 0.23564. lr 3.651179e-04:  86%|████████▌ | 2622/3047 [19:38<02:58,  2.38it/s][A
epoch 1 iter 2622: train loss 0.22730. lr 3.649669e-04:  86%|████████▌ | 2622/3047 [19:39<02:58,  2.38it/s][A
epoch 1 iter 2622: train loss 0.22730. lr 3.649669e-04:  86%|████████▌ | 2623/3047 [19:39<02:56,  2.41it/s][A
epoch 1 iter 2623: train loss 0.22164. lr 3.648159e-04:  86%|████████▌ | 2623/3047 [19:39<02:56,  2.41it/s][A
epoch 1 iter 2623: train loss 0.22164. lr 3.648159e-04:  86%|████████▌ | 2624/3047 [19:39<02:55,  2.42it/s][A
e

epoch 1 iter 2656: train loss 0.22275. lr 3.598222e-04:  87%|████████▋ | 2656/3047 [19:54<02:54,  2.24it/s][A
epoch 1 iter 2656: train loss 0.22275. lr 3.598222e-04:  87%|████████▋ | 2657/3047 [19:54<02:53,  2.25it/s][A
epoch 1 iter 2657: train loss 0.22228. lr 3.596706e-04:  87%|████████▋ | 2657/3047 [19:54<02:53,  2.25it/s][A
epoch 1 iter 2657: train loss 0.22228. lr 3.596706e-04:  87%|████████▋ | 2658/3047 [19:54<02:51,  2.27it/s][A
epoch 1 iter 2658: train loss 0.22414. lr 3.595190e-04:  87%|████████▋ | 2658/3047 [19:55<02:51,  2.27it/s][A
epoch 1 iter 2658: train loss 0.22414. lr 3.595190e-04:  87%|████████▋ | 2659/3047 [19:55<02:50,  2.28it/s][A
epoch 1 iter 2659: train loss 0.22388. lr 3.593674e-04:  87%|████████▋ | 2659/3047 [19:55<02:50,  2.28it/s][A
epoch 1 iter 2659: train loss 0.22388. lr 3.593674e-04:  87%|████████▋ | 2660/3047 [19:55<02:42,  2.39it/s][A
epoch 1 iter 2660: train loss 0.22186. lr 3.592157e-04:  87%|████████▋ | 2660/3047 [19:55<02:42,  2.39it/s][A
e

epoch 1 iter 2692: train loss 0.22045. lr 3.543549e-04:  88%|████████▊ | 2693/3047 [20:10<02:38,  2.23it/s][A
epoch 1 iter 2693: train loss 0.22295. lr 3.542027e-04:  88%|████████▊ | 2693/3047 [20:10<02:38,  2.23it/s][A
epoch 1 iter 2693: train loss 0.22295. lr 3.542027e-04:  88%|████████▊ | 2694/3047 [20:10<02:33,  2.30it/s][A
epoch 1 iter 2694: train loss 0.21685. lr 3.540506e-04:  88%|████████▊ | 2694/3047 [20:11<02:33,  2.30it/s][A
epoch 1 iter 2694: train loss 0.21685. lr 3.540506e-04:  88%|████████▊ | 2695/3047 [20:11<02:28,  2.36it/s][A
epoch 1 iter 2695: train loss 0.21895. lr 3.538984e-04:  88%|████████▊ | 2695/3047 [20:11<02:28,  2.36it/s][A
epoch 1 iter 2695: train loss 0.21895. lr 3.538984e-04:  88%|████████▊ | 2696/3047 [20:11<02:30,  2.33it/s][A
epoch 1 iter 2696: train loss 0.22047. lr 3.537462e-04:  88%|████████▊ | 2696/3047 [20:11<02:30,  2.33it/s][A
epoch 1 iter 2696: train loss 0.22047. lr 3.537462e-04:  89%|████████▊ | 2697/3047 [20:11<02:26,  2.38it/s][A
e

epoch 1 iter 2729: train loss 0.21780. lr 3.487162e-04:  90%|████████▉ | 2729/3047 [20:27<02:11,  2.41it/s][A
epoch 1 iter 2729: train loss 0.21780. lr 3.487162e-04:  90%|████████▉ | 2730/3047 [20:27<02:10,  2.43it/s][A
epoch 1 iter 2730: train loss 0.21093. lr 3.485635e-04:  90%|████████▉ | 2730/3047 [20:27<02:10,  2.43it/s][A
epoch 1 iter 2730: train loss 0.21093. lr 3.485635e-04:  90%|████████▉ | 2731/3047 [20:27<02:09,  2.45it/s][A
epoch 1 iter 2731: train loss 0.21112. lr 3.484108e-04:  90%|████████▉ | 2731/3047 [20:28<02:09,  2.45it/s][A
epoch 1 iter 2731: train loss 0.21112. lr 3.484108e-04:  90%|████████▉ | 2732/3047 [20:28<02:13,  2.36it/s][A
epoch 1 iter 2732: train loss 0.22073. lr 3.482582e-04:  90%|████████▉ | 2732/3047 [20:28<02:13,  2.36it/s][A
epoch 1 iter 2732: train loss 0.22073. lr 3.482582e-04:  90%|████████▉ | 2733/3047 [20:28<02:20,  2.24it/s][A
epoch 1 iter 2733: train loss 0.21538. lr 3.481055e-04:  90%|████████▉ | 2733/3047 [20:29<02:20,  2.24it/s][A
e

epoch 1 iter 2765: train loss 0.21690. lr 3.432128e-04:  91%|█████████ | 2766/3047 [20:44<02:13,  2.10it/s][A
epoch 1 iter 2766: train loss 0.21644. lr 3.430597e-04:  91%|█████████ | 2766/3047 [20:44<02:13,  2.10it/s][A
epoch 1 iter 2766: train loss 0.21644. lr 3.430597e-04:  91%|█████████ | 2767/3047 [20:44<02:14,  2.09it/s][A
epoch 1 iter 2767: train loss 0.21557. lr 3.429066e-04:  91%|█████████ | 2767/3047 [20:45<02:14,  2.09it/s][A
epoch 1 iter 2767: train loss 0.21557. lr 3.429066e-04:  91%|█████████ | 2768/3047 [20:45<02:14,  2.08it/s][A
epoch 1 iter 2768: train loss 0.21162. lr 3.427535e-04:  91%|█████████ | 2768/3047 [20:45<02:14,  2.08it/s][A
epoch 1 iter 2768: train loss 0.21162. lr 3.427535e-04:  91%|█████████ | 2769/3047 [20:45<02:11,  2.11it/s][A
epoch 1 iter 2769: train loss 0.21550. lr 3.426004e-04:  91%|█████████ | 2769/3047 [20:46<02:11,  2.11it/s][A
epoch 1 iter 2769: train loss 0.21550. lr 3.426004e-04:  91%|█████████ | 2770/3047 [20:46<02:13,  2.08it/s][A
e

epoch 1 iter 2802: train loss 0.21387. lr 3.375411e-04:  92%|█████████▏| 2802/3047 [20:59<01:39,  2.45it/s][A
epoch 1 iter 2802: train loss 0.21387. lr 3.375411e-04:  92%|█████████▏| 2803/3047 [20:59<01:39,  2.46it/s][A
epoch 1 iter 2803: train loss 0.20996. lr 3.373876e-04:  92%|█████████▏| 2803/3047 [21:00<01:39,  2.46it/s][A
epoch 1 iter 2803: train loss 0.20996. lr 3.373876e-04:  92%|█████████▏| 2804/3047 [21:00<01:38,  2.47it/s][A
epoch 1 iter 2804: train loss 0.21380. lr 3.372341e-04:  92%|█████████▏| 2804/3047 [21:01<01:38,  2.47it/s][A
epoch 1 iter 2804: train loss 0.21380. lr 3.372341e-04:  92%|█████████▏| 2805/3047 [21:01<02:45,  1.46it/s][A
epoch 1 iter 2805: train loss 0.21400. lr 3.370806e-04:  92%|█████████▏| 2805/3047 [21:02<02:45,  1.46it/s][A
epoch 1 iter 2805: train loss 0.21400. lr 3.370806e-04:  92%|█████████▏| 2806/3047 [21:02<02:37,  1.53it/s][A
epoch 1 iter 2806: train loss 0.21435. lr 3.369271e-04:  92%|█████████▏| 2806/3047 [21:02<02:37,  1.53it/s][A
e

epoch 1 iter 2838: train loss 0.21025. lr 3.320095e-04:  93%|█████████▎| 2839/3047 [21:17<01:36,  2.15it/s][A
epoch 1 iter 2839: train loss 0.20704. lr 3.318557e-04:  93%|█████████▎| 2839/3047 [21:18<01:36,  2.15it/s][A
epoch 1 iter 2839: train loss 0.20704. lr 3.318557e-04:  93%|█████████▎| 2840/3047 [21:18<01:34,  2.19it/s][A
epoch 1 iter 2840: train loss 0.21335. lr 3.317019e-04:  93%|█████████▎| 2840/3047 [21:18<01:34,  2.19it/s][A
epoch 1 iter 2840: train loss 0.21335. lr 3.317019e-04:  93%|█████████▎| 2841/3047 [21:18<01:32,  2.23it/s][A
epoch 1 iter 2841: train loss 0.20747. lr 3.315480e-04:  93%|█████████▎| 2841/3047 [21:18<01:32,  2.23it/s][A
epoch 1 iter 2841: train loss 0.20747. lr 3.315480e-04:  93%|█████████▎| 2842/3047 [21:18<01:30,  2.27it/s][A
epoch 1 iter 2842: train loss 0.20767. lr 3.313942e-04:  93%|█████████▎| 2842/3047 [21:19<01:30,  2.27it/s][A
epoch 1 iter 2842: train loss 0.20767. lr 3.313942e-04:  93%|█████████▎| 2843/3047 [21:19<01:30,  2.25it/s][A
e

epoch 1 iter 2875: train loss 0.20695. lr 3.263128e-04:  94%|█████████▍| 2875/3047 [21:32<01:10,  2.45it/s][A
epoch 1 iter 2875: train loss 0.20695. lr 3.263128e-04:  94%|█████████▍| 2876/3047 [21:32<01:10,  2.41it/s][A
epoch 1 iter 2876: train loss 0.20846. lr 3.261587e-04:  94%|█████████▍| 2876/3047 [21:33<01:10,  2.41it/s][A
epoch 1 iter 2876: train loss 0.20846. lr 3.261587e-04:  94%|█████████▍| 2877/3047 [21:33<01:11,  2.39it/s][A
epoch 1 iter 2877: train loss 0.20648. lr 3.260046e-04:  94%|█████████▍| 2877/3047 [21:33<01:11,  2.39it/s][A
epoch 1 iter 2877: train loss 0.20648. lr 3.260046e-04:  94%|█████████▍| 2878/3047 [21:33<01:11,  2.38it/s][A
epoch 1 iter 2878: train loss 0.20710. lr 3.258505e-04:  94%|█████████▍| 2878/3047 [21:34<01:11,  2.38it/s][A
epoch 1 iter 2878: train loss 0.20710. lr 3.258505e-04:  94%|█████████▍| 2879/3047 [21:34<01:10,  2.38it/s][A
epoch 1 iter 2879: train loss 0.20527. lr 3.256964e-04:  94%|█████████▍| 2879/3047 [21:34<01:10,  2.38it/s][A
e

epoch 1 iter 2911: train loss 0.20270. lr 3.207609e-04:  96%|█████████▌| 2912/3047 [21:49<01:06,  2.04it/s][A
epoch 1 iter 2912: train loss 0.20410. lr 3.206066e-04:  96%|█████████▌| 2912/3047 [21:50<01:06,  2.04it/s][A
epoch 1 iter 2912: train loss 0.20410. lr 3.206066e-04:  96%|█████████▌| 2913/3047 [21:50<01:04,  2.07it/s][A
epoch 1 iter 2913: train loss 0.20711. lr 3.204522e-04:  96%|█████████▌| 2913/3047 [21:50<01:04,  2.07it/s][A
epoch 1 iter 2913: train loss 0.20711. lr 3.204522e-04:  96%|█████████▌| 2914/3047 [21:50<01:03,  2.10it/s][A
epoch 1 iter 2914: train loss 0.21079. lr 3.202979e-04:  96%|█████████▌| 2914/3047 [21:50<01:03,  2.10it/s][A
epoch 1 iter 2914: train loss 0.21079. lr 3.202979e-04:  96%|█████████▌| 2915/3047 [21:50<01:01,  2.14it/s][A
epoch 1 iter 2915: train loss 0.20636. lr 3.201435e-04:  96%|█████████▌| 2915/3047 [21:51<01:01,  2.14it/s][A
epoch 1 iter 2915: train loss 0.20636. lr 3.201435e-04:  96%|█████████▌| 2916/3047 [21:51<01:00,  2.18it/s][A
e

epoch 1 iter 2948: train loss 0.20620. lr 3.150473e-04:  97%|█████████▋| 2948/3047 [22:05<00:41,  2.40it/s][A
epoch 1 iter 2948: train loss 0.20620. lr 3.150473e-04:  97%|█████████▋| 2949/3047 [22:05<00:40,  2.42it/s][A
epoch 1 iter 2949: train loss 0.20463. lr 3.148928e-04:  97%|█████████▋| 2949/3047 [22:06<00:40,  2.42it/s][A
epoch 1 iter 2949: train loss 0.20463. lr 3.148928e-04:  97%|█████████▋| 2950/3047 [22:06<00:44,  2.19it/s][A
epoch 1 iter 2950: train loss 0.20330. lr 3.147383e-04:  97%|█████████▋| 2950/3047 [22:06<00:44,  2.19it/s][A
epoch 1 iter 2950: train loss 0.20330. lr 3.147383e-04:  97%|█████████▋| 2951/3047 [22:06<00:45,  2.12it/s][A
epoch 1 iter 2951: train loss 0.20223. lr 3.145838e-04:  97%|█████████▋| 2951/3047 [22:07<00:45,  2.12it/s][A
epoch 1 iter 2951: train loss 0.20223. lr 3.145838e-04:  97%|█████████▋| 2952/3047 [22:07<00:44,  2.14it/s][A
epoch 1 iter 2952: train loss 0.19990. lr 3.144292e-04:  97%|█████████▋| 2952/3047 [22:07<00:44,  2.14it/s][A
e

epoch 1 iter 2984: train loss 0.20061. lr 3.094828e-04:  98%|█████████▊| 2985/3047 [22:22<00:26,  2.35it/s][A
epoch 1 iter 2985: train loss 0.20310. lr 3.093282e-04:  98%|█████████▊| 2985/3047 [22:22<00:26,  2.35it/s][A
epoch 1 iter 2985: train loss 0.20310. lr 3.093282e-04:  98%|█████████▊| 2986/3047 [22:22<00:25,  2.43it/s][A
epoch 1 iter 2986: train loss 0.20159. lr 3.091736e-04:  98%|█████████▊| 2986/3047 [22:22<00:25,  2.43it/s][A
epoch 1 iter 2986: train loss 0.20159. lr 3.091736e-04:  98%|█████████▊| 2987/3047 [22:23<00:23,  2.50it/s][A
epoch 1 iter 2987: train loss 0.19828. lr 3.090190e-04:  98%|█████████▊| 2987/3047 [22:23<00:23,  2.50it/s][A
epoch 1 iter 2987: train loss 0.19828. lr 3.090190e-04:  98%|█████████▊| 2988/3047 [22:23<00:23,  2.50it/s][A
epoch 1 iter 2988: train loss 0.19995. lr 3.088643e-04:  98%|█████████▊| 2988/3047 [22:23<00:23,  2.50it/s][A
epoch 1 iter 2988: train loss 0.19995. lr 3.088643e-04:  98%|█████████▊| 2989/3047 [22:23<00:23,  2.47it/s][A
e

epoch 1 iter 3021: train loss 0.19955. lr 3.037604e-04:  99%|█████████▉| 3021/3047 [22:39<00:11,  2.23it/s][A
epoch 1 iter 3021: train loss 0.19955. lr 3.037604e-04:  99%|█████████▉| 3022/3047 [22:39<00:11,  2.26it/s][A
epoch 1 iter 3022: train loss 0.19781. lr 3.036057e-04:  99%|█████████▉| 3022/3047 [22:40<00:11,  2.26it/s][A
epoch 1 iter 3022: train loss 0.19781. lr 3.036057e-04:  99%|█████████▉| 3023/3047 [22:40<00:10,  2.29it/s][A
epoch 1 iter 3023: train loss 0.19542. lr 3.034511e-04:  99%|█████████▉| 3023/3047 [22:40<00:10,  2.29it/s][A
epoch 1 iter 3023: train loss 0.19542. lr 3.034511e-04:  99%|█████████▉| 3024/3047 [22:40<00:09,  2.34it/s][A
epoch 1 iter 3024: train loss 0.19487. lr 3.032964e-04:  99%|█████████▉| 3024/3047 [22:41<00:09,  2.34it/s][A
epoch 1 iter 3024: train loss 0.19487. lr 3.032964e-04:  99%|█████████▉| 3025/3047 [22:41<00:09,  2.38it/s][A
epoch 1 iter 3025: train loss 0.20445. lr 3.031417e-04:  99%|█████████▉| 3025/3047 [22:41<00:09,  2.38it/s][A
e

epoch 2 iter 11: train loss 0.19476. lr 2.981557e-04:   0%|          | 11/3047 [00:06<30:22,  1.67it/s][A
epoch 2 iter 11: train loss 0.19476. lr 2.981557e-04:   0%|          | 12/3047 [00:06<29:17,  1.73it/s][A
epoch 2 iter 12: train loss 0.20062. lr 2.980010e-04:   0%|          | 12/3047 [00:06<29:17,  1.73it/s][A
epoch 2 iter 12: train loss 0.20062. lr 2.980010e-04:   0%|          | 13/3047 [00:06<28:32,  1.77it/s][A
epoch 2 iter 13: train loss 0.19895. lr 2.978463e-04:   0%|          | 13/3047 [00:07<28:32,  1.77it/s][A
epoch 2 iter 13: train loss 0.19895. lr 2.978463e-04:   0%|          | 14/3047 [00:07<27:28,  1.84it/s][A
epoch 2 iter 14: train loss 0.19861. lr 2.976916e-04:   0%|          | 14/3047 [00:07<27:28,  1.84it/s][A
epoch 2 iter 14: train loss 0.19861. lr 2.976916e-04:   0%|          | 15/3047 [00:07<26:31,  1.90it/s][A
epoch 2 iter 15: train loss 0.19555. lr 2.975369e-04:   0%|          | 15/3047 [00:08<26:31,  1.90it/s][A
epoch 2 iter 15: train loss 0.19555. 

epoch 2 iter 49: train loss 0.19459. lr 2.922780e-04:   2%|▏         | 49/3047 [00:23<19:49,  2.52it/s][A
epoch 2 iter 49: train loss 0.19459. lr 2.922780e-04:   2%|▏         | 50/3047 [00:23<19:02,  2.62it/s][A
epoch 2 iter 50: train loss 0.19248. lr 2.921233e-04:   2%|▏         | 50/3047 [00:23<19:02,  2.62it/s][A
epoch 2 iter 50: train loss 0.19248. lr 2.921233e-04:   2%|▏         | 51/3047 [00:23<18:47,  2.66it/s][A
epoch 2 iter 51: train loss 0.19328. lr 2.919687e-04:   2%|▏         | 51/3047 [00:24<18:47,  2.66it/s][A
epoch 2 iter 51: train loss 0.19328. lr 2.919687e-04:   2%|▏         | 52/3047 [00:24<18:54,  2.64it/s][A
epoch 2 iter 52: train loss 0.18761. lr 2.918140e-04:   2%|▏         | 52/3047 [00:24<18:54,  2.64it/s][A
epoch 2 iter 52: train loss 0.18761. lr 2.918140e-04:   2%|▏         | 53/3047 [00:24<19:11,  2.60it/s][A
epoch 2 iter 53: train loss 0.19289. lr 2.916594e-04:   2%|▏         | 53/3047 [00:25<19:11,  2.60it/s][A
epoch 2 iter 53: train loss 0.19289. 

epoch 2 iter 87: train loss 0.19270. lr 2.864032e-04:   3%|▎         | 87/3047 [00:41<21:02,  2.34it/s][A
epoch 2 iter 87: train loss 0.19270. lr 2.864032e-04:   3%|▎         | 88/3047 [00:41<20:07,  2.45it/s][A
epoch 2 iter 88: train loss 0.19228. lr 2.862486e-04:   3%|▎         | 88/3047 [00:41<20:07,  2.45it/s][A
epoch 2 iter 88: train loss 0.19228. lr 2.862486e-04:   3%|▎         | 89/3047 [00:41<19:48,  2.49it/s][A
epoch 2 iter 89: train loss 0.19349. lr 2.860941e-04:   3%|▎         | 89/3047 [00:41<19:48,  2.49it/s][A
epoch 2 iter 89: train loss 0.19349. lr 2.860941e-04:   3%|▎         | 90/3047 [00:41<19:44,  2.50it/s][A
epoch 2 iter 90: train loss 0.19069. lr 2.859396e-04:   3%|▎         | 90/3047 [00:42<19:44,  2.50it/s][A
epoch 2 iter 90: train loss 0.19069. lr 2.859396e-04:   3%|▎         | 91/3047 [00:42<20:18,  2.43it/s][A
epoch 2 iter 91: train loss 0.19594. lr 2.857850e-04:   3%|▎         | 91/3047 [00:42<20:18,  2.43it/s][A
epoch 2 iter 91: train loss 0.19594. 

epoch 2 iter 124: train loss 0.18561. lr 2.806880e-04:   4%|▍         | 125/3047 [00:57<20:47,  2.34it/s][A
epoch 2 iter 125: train loss 0.19127. lr 2.805336e-04:   4%|▍         | 125/3047 [00:57<20:47,  2.34it/s][A
epoch 2 iter 125: train loss 0.19127. lr 2.805336e-04:   4%|▍         | 126/3047 [00:57<20:55,  2.33it/s][A
epoch 2 iter 126: train loss 0.19253. lr 2.803792e-04:   4%|▍         | 126/3047 [00:57<20:55,  2.33it/s][A
epoch 2 iter 126: train loss 0.19253. lr 2.803792e-04:   4%|▍         | 127/3047 [00:57<20:08,  2.42it/s][A
epoch 2 iter 127: train loss 0.18920. lr 2.802249e-04:   4%|▍         | 127/3047 [00:58<20:08,  2.42it/s][A
epoch 2 iter 127: train loss 0.18920. lr 2.802249e-04:   4%|▍         | 128/3047 [00:58<19:51,  2.45it/s][A
epoch 2 iter 128: train loss 0.18910. lr 2.800705e-04:   4%|▍         | 128/3047 [00:58<19:51,  2.45it/s][A
epoch 2 iter 128: train loss 0.18910. lr 2.800705e-04:   4%|▍         | 129/3047 [00:58<20:05,  2.42it/s][A
epoch 2 iter 129: t

epoch 2 iter 162: train loss 0.18901. lr 2.748257e-04:   5%|▌         | 162/3047 [01:13<22:42,  2.12it/s][A
epoch 2 iter 162: train loss 0.18901. lr 2.748257e-04:   5%|▌         | 163/3047 [01:13<22:23,  2.15it/s][A
epoch 2 iter 163: train loss 0.19474. lr 2.746715e-04:   5%|▌         | 163/3047 [01:14<22:23,  2.15it/s][A
epoch 2 iter 163: train loss 0.19474. lr 2.746715e-04:   5%|▌         | 164/3047 [01:14<22:05,  2.18it/s][A
epoch 2 iter 164: train loss 0.18669. lr 2.745174e-04:   5%|▌         | 164/3047 [01:14<22:05,  2.18it/s][A
epoch 2 iter 164: train loss 0.18669. lr 2.745174e-04:   5%|▌         | 165/3047 [01:14<22:12,  2.16it/s][A
epoch 2 iter 165: train loss 0.18736. lr 2.743632e-04:   5%|▌         | 165/3047 [01:15<22:12,  2.16it/s][A
epoch 2 iter 165: train loss 0.18736. lr 2.743632e-04:   5%|▌         | 166/3047 [01:15<21:21,  2.25it/s][A
epoch 2 iter 166: train loss 0.18880. lr 2.742091e-04:   5%|▌         | 166/3047 [01:15<21:21,  2.25it/s][A
epoch 2 iter 166: t

epoch 2 iter 199: train loss 0.18522. lr 2.691269e-04:   7%|▋         | 200/3047 [01:30<19:29,  2.44it/s][A
epoch 2 iter 200: train loss 0.18419. lr 2.689730e-04:   7%|▋         | 200/3047 [01:31<19:29,  2.44it/s][A
epoch 2 iter 200: train loss 0.18419. lr 2.689730e-04:   7%|▋         | 201/3047 [01:31<20:06,  2.36it/s][A
epoch 2 iter 201: train loss 0.18548. lr 2.688192e-04:   7%|▋         | 201/3047 [01:31<20:06,  2.36it/s][A
epoch 2 iter 201: train loss 0.18548. lr 2.688192e-04:   7%|▋         | 202/3047 [01:31<20:33,  2.31it/s][A
epoch 2 iter 202: train loss 0.18737. lr 2.686653e-04:   7%|▋         | 202/3047 [01:32<20:33,  2.31it/s][A
epoch 2 iter 202: train loss 0.18737. lr 2.686653e-04:   7%|▋         | 203/3047 [01:32<19:53,  2.38it/s][A
epoch 2 iter 203: train loss 0.18201. lr 2.685115e-04:   7%|▋         | 203/3047 [01:32<19:53,  2.38it/s][A
epoch 2 iter 203: train loss 0.18201. lr 2.685115e-04:   7%|▋         | 204/3047 [01:32<19:24,  2.44it/s][A
epoch 2 iter 204: t

epoch 2 iter 237: train loss 0.18488. lr 2.632858e-04:   8%|▊         | 237/3047 [01:48<22:26,  2.09it/s][A
epoch 2 iter 237: train loss 0.18488. lr 2.632858e-04:   8%|▊         | 238/3047 [01:48<22:09,  2.11it/s][A
epoch 2 iter 238: train loss 0.18600. lr 2.631323e-04:   8%|▊         | 238/3047 [01:49<22:09,  2.11it/s][A
epoch 2 iter 238: train loss 0.18600. lr 2.631323e-04:   8%|▊         | 239/3047 [01:49<22:35,  2.07it/s][A
epoch 2 iter 239: train loss 0.18866. lr 2.629788e-04:   8%|▊         | 239/3047 [01:49<22:35,  2.07it/s][A
epoch 2 iter 239: train loss 0.18866. lr 2.629788e-04:   8%|▊         | 240/3047 [01:49<22:04,  2.12it/s][A
epoch 2 iter 240: train loss 0.18629. lr 2.628253e-04:   8%|▊         | 240/3047 [01:50<22:04,  2.12it/s][A
epoch 2 iter 240: train loss 0.18629. lr 2.628253e-04:   8%|▊         | 241/3047 [01:50<21:42,  2.15it/s][A
epoch 2 iter 241: train loss 0.18674. lr 2.626718e-04:   8%|▊         | 241/3047 [01:50<21:42,  2.15it/s][A
epoch 2 iter 241: t

epoch 2 iter 274: train loss 0.18399. lr 2.576120e-04:   9%|▉         | 275/3047 [02:05<20:40,  2.23it/s][A
epoch 2 iter 275: train loss 0.18428. lr 2.574588e-04:   9%|▉         | 275/3047 [02:05<20:40,  2.23it/s][A
epoch 2 iter 275: train loss 0.18428. lr 2.574588e-04:   9%|▉         | 276/3047 [02:05<21:29,  2.15it/s][A
epoch 2 iter 276: train loss 0.18419. lr 2.573057e-04:   9%|▉         | 276/3047 [02:06<21:29,  2.15it/s][A
epoch 2 iter 276: train loss 0.18419. lr 2.573057e-04:   9%|▉         | 277/3047 [02:06<22:01,  2.10it/s][A
epoch 2 iter 277: train loss 0.18359. lr 2.571526e-04:   9%|▉         | 277/3047 [02:06<22:01,  2.10it/s][A
epoch 2 iter 277: train loss 0.18359. lr 2.571526e-04:   9%|▉         | 278/3047 [02:06<22:25,  2.06it/s][A
epoch 2 iter 278: train loss 0.18835. lr 2.569995e-04:   9%|▉         | 278/3047 [02:07<22:25,  2.06it/s][A
epoch 2 iter 278: train loss 0.18835. lr 2.569995e-04:   9%|▉         | 279/3047 [02:07<22:41,  2.03it/s][A
epoch 2 iter 279: t

epoch 2 iter 312: train loss 0.18007. lr 2.518009e-04:  10%|█         | 312/3047 [02:22<20:30,  2.22it/s][A
epoch 2 iter 312: train loss 0.18007. lr 2.518009e-04:  10%|█         | 313/3047 [02:22<21:01,  2.17it/s][A
epoch 2 iter 313: train loss 0.18103. lr 2.516482e-04:  10%|█         | 313/3047 [02:22<21:01,  2.17it/s][A
epoch 2 iter 313: train loss 0.18103. lr 2.516482e-04:  10%|█         | 314/3047 [02:22<19:55,  2.29it/s][A
epoch 2 iter 314: train loss 0.18009. lr 2.514955e-04:  10%|█         | 314/3047 [02:23<19:55,  2.29it/s][A
epoch 2 iter 314: train loss 0.18009. lr 2.514955e-04:  10%|█         | 315/3047 [02:23<18:40,  2.44it/s][A
epoch 2 iter 315: train loss 0.18210. lr 2.513429e-04:  10%|█         | 315/3047 [02:23<18:40,  2.44it/s][A
epoch 2 iter 315: train loss 0.18210. lr 2.513429e-04:  10%|█         | 316/3047 [02:23<18:01,  2.53it/s][A
epoch 2 iter 316: train loss 0.18304. lr 2.511902e-04:  10%|█         | 316/3047 [02:23<18:01,  2.53it/s][A
epoch 2 iter 316: t

epoch 2 iter 349: train loss 0.18003. lr 2.461605e-04:  11%|█▏        | 350/3047 [02:38<19:54,  2.26it/s][A
epoch 2 iter 350: train loss 0.18175. lr 2.460083e-04:  11%|█▏        | 350/3047 [02:38<19:54,  2.26it/s][A
epoch 2 iter 350: train loss 0.18175. lr 2.460083e-04:  12%|█▏        | 351/3047 [02:38<19:14,  2.34it/s][A
epoch 2 iter 351: train loss 0.17755. lr 2.458561e-04:  12%|█▏        | 351/3047 [02:38<19:14,  2.34it/s][A
epoch 2 iter 351: train loss 0.17755. lr 2.458561e-04:  12%|█▏        | 352/3047 [02:38<18:08,  2.48it/s][A
epoch 2 iter 352: train loss 0.17807. lr 2.457040e-04:  12%|█▏        | 352/3047 [02:39<18:08,  2.48it/s][A
epoch 2 iter 352: train loss 0.17807. lr 2.457040e-04:  12%|█▏        | 353/3047 [02:39<17:47,  2.52it/s][A
epoch 2 iter 353: train loss 0.17791. lr 2.455518e-04:  12%|█▏        | 353/3047 [02:39<17:47,  2.52it/s][A
epoch 2 iter 353: train loss 0.17791. lr 2.455518e-04:  12%|█▏        | 354/3047 [02:39<18:16,  2.46it/s][A
epoch 2 iter 354: t

epoch 2 iter 387: train loss 0.17777. lr 2.403880e-04:  13%|█▎        | 387/3047 [02:55<21:13,  2.09it/s][A
epoch 2 iter 387: train loss 0.17777. lr 2.403880e-04:  13%|█▎        | 388/3047 [02:55<20:40,  2.14it/s][A
epoch 2 iter 388: train loss 0.17882. lr 2.402364e-04:  13%|█▎        | 388/3047 [02:56<20:40,  2.14it/s][A
epoch 2 iter 388: train loss 0.17882. lr 2.402364e-04:  13%|█▎        | 389/3047 [02:56<20:19,  2.18it/s][A
epoch 2 iter 389: train loss 0.18218. lr 2.400848e-04:  13%|█▎        | 389/3047 [02:56<20:19,  2.18it/s][A
epoch 2 iter 389: train loss 0.18218. lr 2.400848e-04:  13%|█▎        | 390/3047 [02:56<20:07,  2.20it/s][A
epoch 2 iter 390: train loss 0.17787. lr 2.399333e-04:  13%|█▎        | 390/3047 [02:57<20:07,  2.20it/s][A
epoch 2 iter 390: train loss 0.17787. lr 2.399333e-04:  13%|█▎        | 391/3047 [02:57<20:02,  2.21it/s][A
epoch 2 iter 391: train loss 0.17846. lr 2.397817e-04:  13%|█▎        | 391/3047 [02:57<20:02,  2.21it/s][A
epoch 2 iter 391: t

epoch 2 iter 424: train loss 0.17735. lr 2.347895e-04:  14%|█▍        | 425/3047 [03:11<18:14,  2.40it/s][A
epoch 2 iter 425: train loss 0.17427. lr 2.346385e-04:  14%|█▍        | 425/3047 [03:12<18:14,  2.40it/s][A
epoch 2 iter 425: train loss 0.17427. lr 2.346385e-04:  14%|█▍        | 426/3047 [03:12<18:20,  2.38it/s][A
epoch 2 iter 426: train loss 0.17818. lr 2.344875e-04:  14%|█▍        | 426/3047 [03:12<18:20,  2.38it/s][A
epoch 2 iter 426: train loss 0.17818. lr 2.344875e-04:  14%|█▍        | 427/3047 [03:12<17:37,  2.48it/s][A
epoch 2 iter 427: train loss 0.17816. lr 2.343365e-04:  14%|█▍        | 427/3047 [03:12<17:37,  2.48it/s][A
epoch 2 iter 427: train loss 0.17816. lr 2.343365e-04:  14%|█▍        | 428/3047 [03:12<17:17,  2.52it/s][A
epoch 2 iter 428: train loss 0.17587. lr 2.341856e-04:  14%|█▍        | 428/3047 [03:13<17:17,  2.52it/s][A
epoch 2 iter 428: train loss 0.17587. lr 2.341856e-04:  14%|█▍        | 429/3047 [03:13<17:31,  2.49it/s][A
epoch 2 iter 429: t

epoch 2 iter 462: train loss 0.17691. lr 2.290643e-04:  15%|█▌        | 462/3047 [03:29<19:47,  2.18it/s][A
epoch 2 iter 462: train loss 0.17691. lr 2.290643e-04:  15%|█▌        | 463/3047 [03:29<19:00,  2.27it/s][A
epoch 2 iter 463: train loss 0.17394. lr 2.289140e-04:  15%|█▌        | 463/3047 [03:29<19:00,  2.27it/s][A
epoch 2 iter 463: train loss 0.17394. lr 2.289140e-04:  15%|█▌        | 464/3047 [03:29<18:26,  2.33it/s][A
epoch 2 iter 464: train loss 0.17559. lr 2.287637e-04:  15%|█▌        | 464/3047 [03:30<18:26,  2.33it/s][A
epoch 2 iter 464: train loss 0.17559. lr 2.287637e-04:  15%|█▌        | 465/3047 [03:30<18:01,  2.39it/s][A
epoch 2 iter 465: train loss 0.17924. lr 2.286135e-04:  15%|█▌        | 465/3047 [03:30<18:01,  2.39it/s][A
epoch 2 iter 465: train loss 0.17924. lr 2.286135e-04:  15%|█▌        | 466/3047 [03:30<17:08,  2.51it/s][A
epoch 2 iter 466: train loss 0.17279. lr 2.284632e-04:  15%|█▌        | 466/3047 [03:30<17:08,  2.51it/s][A
epoch 2 iter 466: t

epoch 2 iter 499: train loss 0.17190. lr 2.235160e-04:  16%|█▋        | 500/3047 [03:46<18:58,  2.24it/s][A
epoch 2 iter 500: train loss 0.17363. lr 2.233664e-04:  16%|█▋        | 500/3047 [03:46<18:58,  2.24it/s][A
epoch 2 iter 500: train loss 0.17363. lr 2.233664e-04:  16%|█▋        | 501/3047 [03:46<18:46,  2.26it/s][A
epoch 2 iter 501: train loss 0.17308. lr 2.232169e-04:  16%|█▋        | 501/3047 [03:46<18:46,  2.26it/s][A
epoch 2 iter 501: train loss 0.17308. lr 2.232169e-04:  16%|█▋        | 502/3047 [03:46<18:39,  2.27it/s][A
epoch 2 iter 502: train loss 0.17488. lr 2.230673e-04:  16%|█▋        | 502/3047 [03:47<18:39,  2.27it/s][A
epoch 2 iter 502: train loss 0.17488. lr 2.230673e-04:  17%|█▋        | 503/3047 [03:47<19:00,  2.23it/s][A
epoch 2 iter 503: train loss 0.16915. lr 2.229178e-04:  17%|█▋        | 503/3047 [03:47<19:00,  2.23it/s][A
epoch 2 iter 503: train loss 0.16915. lr 2.229178e-04:  17%|█▋        | 504/3047 [03:47<19:07,  2.22it/s][A
epoch 2 iter 504: t

epoch 2 iter 537: train loss 0.17154. lr 2.178467e-04:  18%|█▊        | 537/3047 [04:04<19:14,  2.17it/s][A
epoch 2 iter 537: train loss 0.17154. lr 2.178467e-04:  18%|█▊        | 538/3047 [04:04<18:56,  2.21it/s][A
epoch 2 iter 538: train loss 0.17311. lr 2.176979e-04:  18%|█▊        | 538/3047 [04:04<18:56,  2.21it/s][A
epoch 2 iter 538: train loss 0.17311. lr 2.176979e-04:  18%|█▊        | 539/3047 [04:04<18:36,  2.25it/s][A
epoch 2 iter 539: train loss 0.17449. lr 2.175492e-04:  18%|█▊        | 539/3047 [04:05<18:36,  2.25it/s][A
epoch 2 iter 539: train loss 0.17449. lr 2.175492e-04:  18%|█▊        | 540/3047 [04:05<18:25,  2.27it/s][A
epoch 2 iter 540: train loss 0.17008. lr 2.174004e-04:  18%|█▊        | 540/3047 [04:05<18:25,  2.27it/s][A
epoch 2 iter 540: train loss 0.17008. lr 2.174004e-04:  18%|█▊        | 541/3047 [04:05<18:20,  2.28it/s][A
epoch 2 iter 541: train loss 0.16995. lr 2.172517e-04:  18%|█▊        | 541/3047 [04:05<18:20,  2.28it/s][A
epoch 2 iter 541: t

epoch 2 iter 574: train loss 0.17008. lr 2.123569e-04:  19%|█▉        | 575/3047 [04:19<16:15,  2.53it/s][A
epoch 2 iter 575: train loss 0.17242. lr 2.122090e-04:  19%|█▉        | 575/3047 [04:20<16:15,  2.53it/s][A
epoch 2 iter 575: train loss 0.17242. lr 2.122090e-04:  19%|█▉        | 576/3047 [04:20<16:53,  2.44it/s][A
epoch 2 iter 576: train loss 0.17079. lr 2.120611e-04:  19%|█▉        | 576/3047 [04:20<16:53,  2.44it/s][A
epoch 2 iter 576: train loss 0.17079. lr 2.120611e-04:  19%|█▉        | 577/3047 [04:20<17:44,  2.32it/s][A
epoch 2 iter 577: train loss 0.17000. lr 2.119132e-04:  19%|█▉        | 577/3047 [04:21<17:44,  2.32it/s][A
epoch 2 iter 577: train loss 0.17000. lr 2.119132e-04:  19%|█▉        | 578/3047 [04:21<18:43,  2.20it/s][A
epoch 2 iter 578: train loss 0.17089. lr 2.117653e-04:  19%|█▉        | 578/3047 [04:21<18:43,  2.20it/s][A
epoch 2 iter 578: train loss 0.17089. lr 2.117653e-04:  19%|█▉        | 579/3047 [04:21<19:01,  2.16it/s][A
epoch 2 iter 579: t

epoch 2 iter 612: train loss 0.16948. lr 2.067520e-04:  20%|██        | 612/3047 [04:36<16:42,  2.43it/s][A
epoch 2 iter 612: train loss 0.16948. lr 2.067520e-04:  20%|██        | 613/3047 [04:36<16:32,  2.45it/s][A
epoch 2 iter 613: train loss 0.16916. lr 2.066049e-04:  20%|██        | 613/3047 [04:37<16:32,  2.45it/s][A
epoch 2 iter 613: train loss 0.16916. lr 2.066049e-04:  20%|██        | 614/3047 [04:37<17:48,  2.28it/s][A
epoch 2 iter 614: train loss 0.17075. lr 2.064579e-04:  20%|██        | 614/3047 [04:37<17:48,  2.28it/s][A
epoch 2 iter 614: train loss 0.17075. lr 2.064579e-04:  20%|██        | 615/3047 [04:37<17:17,  2.35it/s][A
epoch 2 iter 615: train loss 0.16607. lr 2.063110e-04:  20%|██        | 615/3047 [04:38<17:17,  2.35it/s][A
epoch 2 iter 615: train loss 0.16607. lr 2.063110e-04:  20%|██        | 616/3047 [04:38<16:54,  2.40it/s][A
epoch 2 iter 616: train loss 0.16719. lr 2.061640e-04:  20%|██        | 616/3047 [04:38<16:54,  2.40it/s][A
epoch 2 iter 616: t

epoch 2 iter 649: train loss 0.17058. lr 2.013289e-04:  21%|██▏       | 650/3047 [04:55<19:03,  2.10it/s][A
epoch 2 iter 650: train loss 0.16843. lr 2.011828e-04:  21%|██▏       | 650/3047 [04:55<19:03,  2.10it/s][A
epoch 2 iter 650: train loss 0.16843. lr 2.011828e-04:  21%|██▏       | 651/3047 [04:55<17:31,  2.28it/s][A
epoch 2 iter 651: train loss 0.16845. lr 2.010368e-04:  21%|██▏       | 651/3047 [04:55<17:31,  2.28it/s][A
epoch 2 iter 651: train loss 0.16845. lr 2.010368e-04:  21%|██▏       | 652/3047 [04:55<16:56,  2.36it/s][A
epoch 2 iter 652: train loss 0.16826. lr 2.008907e-04:  21%|██▏       | 652/3047 [04:56<16:56,  2.36it/s][A
epoch 2 iter 652: train loss 0.16826. lr 2.008907e-04:  21%|██▏       | 653/3047 [04:56<16:28,  2.42it/s][A
epoch 2 iter 653: train loss 0.16872. lr 2.007447e-04:  21%|██▏       | 653/3047 [04:56<16:28,  2.42it/s][A
epoch 2 iter 653: train loss 0.16872. lr 2.007447e-04:  21%|██▏       | 654/3047 [04:56<16:30,  2.42it/s][A
epoch 2 iter 654: t

epoch 2 iter 687: train loss 0.16490. lr 1.957967e-04:  23%|██▎       | 687/3047 [05:11<19:14,  2.04it/s][A
epoch 2 iter 687: train loss 0.16490. lr 1.957967e-04:  23%|██▎       | 688/3047 [05:11<19:25,  2.02it/s][A
epoch 2 iter 688: train loss 0.16747. lr 1.956516e-04:  23%|██▎       | 688/3047 [05:11<19:25,  2.02it/s][A
epoch 2 iter 688: train loss 0.16747. lr 1.956516e-04:  23%|██▎       | 689/3047 [05:11<19:33,  2.01it/s][A
epoch 2 iter 689: train loss 0.16990. lr 1.955066e-04:  23%|██▎       | 689/3047 [05:12<19:33,  2.01it/s][A
epoch 2 iter 689: train loss 0.16990. lr 1.955066e-04:  23%|██▎       | 690/3047 [05:12<19:41,  1.99it/s][A
epoch 2 iter 690: train loss 0.16787. lr 1.953616e-04:  23%|██▎       | 690/3047 [05:12<19:41,  1.99it/s][A
epoch 2 iter 690: train loss 0.16787. lr 1.953616e-04:  23%|██▎       | 691/3047 [05:12<19:18,  2.03it/s][A
epoch 2 iter 691: train loss 0.16588. lr 1.952166e-04:  23%|██▎       | 691/3047 [05:13<19:18,  2.03it/s][A
epoch 2 iter 691: t

epoch 2 iter 724: train loss 0.16324. lr 1.904485e-04:  24%|██▍       | 725/3047 [05:27<16:00,  2.42it/s][A
epoch 2 iter 725: train loss 0.16220. lr 1.903045e-04:  24%|██▍       | 725/3047 [05:28<16:00,  2.42it/s][A
epoch 2 iter 725: train loss 0.16220. lr 1.903045e-04:  24%|██▍       | 726/3047 [05:28<16:13,  2.38it/s][A
epoch 2 iter 726: train loss 0.16707. lr 1.901605e-04:  24%|██▍       | 726/3047 [05:28<16:13,  2.38it/s][A
epoch 2 iter 726: train loss 0.16707. lr 1.901605e-04:  24%|██▍       | 727/3047 [05:28<15:57,  2.42it/s][A
epoch 2 iter 727: train loss 0.16573. lr 1.900165e-04:  24%|██▍       | 727/3047 [05:28<15:57,  2.42it/s][A
epoch 2 iter 727: train loss 0.16573. lr 1.900165e-04:  24%|██▍       | 728/3047 [05:28<15:14,  2.54it/s][A
epoch 2 iter 728: train loss 0.16541. lr 1.898726e-04:  24%|██▍       | 728/3047 [05:29<15:14,  2.54it/s][A
epoch 2 iter 728: train loss 0.16541. lr 1.898726e-04:  24%|██▍       | 729/3047 [05:29<15:00,  2.57it/s][A
epoch 2 iter 729: t

epoch 2 iter 762: train loss 0.16360. lr 1.849972e-04:  25%|██▌       | 762/3047 [05:45<16:26,  2.32it/s][A
epoch 2 iter 762: train loss 0.16360. lr 1.849972e-04:  25%|██▌       | 763/3047 [05:45<16:19,  2.33it/s][A
epoch 2 iter 763: train loss 0.16088. lr 1.848544e-04:  25%|██▌       | 763/3047 [05:45<16:19,  2.33it/s][A
epoch 2 iter 763: train loss 0.16088. lr 1.848544e-04:  25%|██▌       | 764/3047 [05:45<16:17,  2.34it/s][A
epoch 2 iter 764: train loss 0.16348. lr 1.847115e-04:  25%|██▌       | 764/3047 [05:46<16:17,  2.34it/s][A
epoch 2 iter 764: train loss 0.16348. lr 1.847115e-04:  25%|██▌       | 765/3047 [05:46<16:12,  2.35it/s][A
epoch 2 iter 765: train loss 0.16147. lr 1.845687e-04:  25%|██▌       | 765/3047 [05:46<16:12,  2.35it/s][A
epoch 2 iter 765: train loss 0.16147. lr 1.845687e-04:  25%|██▌       | 766/3047 [05:46<16:14,  2.34it/s][A
epoch 2 iter 766: train loss 0.16353. lr 1.844259e-04:  25%|██▌       | 766/3047 [05:47<16:14,  2.34it/s][A
epoch 2 iter 766: t

epoch 2 iter 799: train loss 0.16244. lr 1.797319e-04:  26%|██▋       | 800/3047 [06:02<15:31,  2.41it/s][A
epoch 2 iter 800: train loss 0.16127. lr 1.795901e-04:  26%|██▋       | 800/3047 [06:02<15:31,  2.41it/s][A
epoch 2 iter 800: train loss 0.16127. lr 1.795901e-04:  26%|██▋       | 801/3047 [06:02<15:11,  2.46it/s][A
epoch 2 iter 801: train loss 0.16539. lr 1.794485e-04:  26%|██▋       | 801/3047 [06:03<15:11,  2.46it/s][A
epoch 2 iter 801: train loss 0.16539. lr 1.794485e-04:  26%|██▋       | 802/3047 [06:03<15:03,  2.49it/s][A
epoch 2 iter 802: train loss 0.16284. lr 1.793068e-04:  26%|██▋       | 802/3047 [06:03<15:03,  2.49it/s][A
epoch 2 iter 802: train loss 0.16284. lr 1.793068e-04:  26%|██▋       | 803/3047 [06:03<14:53,  2.51it/s][A
epoch 2 iter 803: train loss 0.16042. lr 1.791652e-04:  26%|██▋       | 803/3047 [06:04<14:53,  2.51it/s][A
epoch 2 iter 803: train loss 0.16042. lr 1.791652e-04:  26%|██▋       | 804/3047 [06:04<14:51,  2.52it/s][A
epoch 2 iter 804: t

epoch 2 iter 837: train loss 0.16097. lr 1.743698e-04:  27%|██▋       | 837/3047 [06:20<19:19,  1.91it/s][A
epoch 2 iter 837: train loss 0.16097. lr 1.743698e-04:  28%|██▊       | 838/3047 [06:20<18:40,  1.97it/s][A
epoch 2 iter 838: train loss 0.16068. lr 1.742293e-04:  28%|██▊       | 838/3047 [06:21<18:40,  1.97it/s][A
epoch 2 iter 838: train loss 0.16068. lr 1.742293e-04:  28%|██▊       | 839/3047 [06:21<18:10,  2.02it/s][A
epoch 2 iter 839: train loss 0.16182. lr 1.740889e-04:  28%|██▊       | 839/3047 [06:21<18:10,  2.02it/s][A
epoch 2 iter 839: train loss 0.16182. lr 1.740889e-04:  28%|██▊       | 840/3047 [06:21<17:40,  2.08it/s][A
epoch 2 iter 840: train loss 0.16374. lr 1.739485e-04:  28%|██▊       | 840/3047 [06:22<17:40,  2.08it/s][A
epoch 2 iter 840: train loss 0.16374. lr 1.739485e-04:  28%|██▊       | 841/3047 [06:22<17:10,  2.14it/s][A
epoch 2 iter 841: train loss 0.15517. lr 1.738081e-04:  28%|██▊       | 841/3047 [06:22<17:10,  2.14it/s][A
epoch 2 iter 841: t

epoch 2 iter 874: train loss 0.16181. lr 1.691951e-04:  29%|██▊       | 875/3047 [06:36<14:36,  2.48it/s][A
epoch 2 iter 875: train loss 0.15893. lr 1.690559e-04:  29%|██▊       | 875/3047 [06:37<14:36,  2.48it/s][A
epoch 2 iter 875: train loss 0.15893. lr 1.690559e-04:  29%|██▊       | 876/3047 [06:37<14:37,  2.47it/s][A
epoch 2 iter 876: train loss 0.16129. lr 1.689168e-04:  29%|██▊       | 876/3047 [06:37<14:37,  2.47it/s][A
epoch 2 iter 876: train loss 0.16129. lr 1.689168e-04:  29%|██▉       | 877/3047 [06:37<14:35,  2.48it/s][A
epoch 2 iter 877: train loss 0.15890. lr 1.687776e-04:  29%|██▉       | 877/3047 [06:37<14:35,  2.48it/s][A
epoch 2 iter 877: train loss 0.15890. lr 1.687776e-04:  29%|██▉       | 878/3047 [06:37<14:14,  2.54it/s][A
epoch 2 iter 878: train loss 0.16311. lr 1.686385e-04:  29%|██▉       | 878/3047 [06:38<14:14,  2.54it/s][A
epoch 2 iter 878: train loss 0.16311. lr 1.686385e-04:  29%|██▉       | 879/3047 [06:38<13:52,  2.60it/s][A
epoch 2 iter 879: t

epoch 2 iter 912: train loss 0.15954. lr 1.639302e-04:  30%|██▉       | 912/3047 [06:54<17:29,  2.03it/s][A
epoch 2 iter 912: train loss 0.15954. lr 1.639302e-04:  30%|██▉       | 913/3047 [06:54<17:44,  2.00it/s][A
epoch 2 iter 913: train loss 0.15975. lr 1.637924e-04:  30%|██▉       | 913/3047 [06:54<17:44,  2.00it/s][A
epoch 2 iter 913: train loss 0.15975. lr 1.637924e-04:  30%|██▉       | 914/3047 [06:54<17:17,  2.06it/s][A
epoch 2 iter 914: train loss 0.16095. lr 1.636545e-04:  30%|██▉       | 914/3047 [06:55<17:17,  2.06it/s][A
epoch 2 iter 914: train loss 0.16095. lr 1.636545e-04:  30%|███       | 915/3047 [06:55<16:44,  2.12it/s][A
epoch 2 iter 915: train loss 0.16145. lr 1.635168e-04:  30%|███       | 915/3047 [06:55<16:44,  2.12it/s][A
epoch 2 iter 915: train loss 0.16145. lr 1.635168e-04:  30%|███       | 916/3047 [06:55<16:19,  2.18it/s][A
epoch 2 iter 916: train loss 0.16068. lr 1.633790e-04:  30%|███       | 916/3047 [06:56<16:19,  2.18it/s][A
epoch 2 iter 916: t

epoch 2 iter 949: train loss 0.15927. lr 1.588540e-04:  31%|███       | 950/3047 [07:10<13:17,  2.63it/s][A
epoch 2 iter 950: train loss 0.15825. lr 1.587175e-04:  31%|███       | 950/3047 [07:11<13:17,  2.63it/s][A
epoch 2 iter 950: train loss 0.15825. lr 1.587175e-04:  31%|███       | 951/3047 [07:11<13:06,  2.67it/s][A
epoch 2 iter 951: train loss 0.15819. lr 1.585811e-04:  31%|███       | 951/3047 [07:11<13:06,  2.67it/s][A
epoch 2 iter 951: train loss 0.15819. lr 1.585811e-04:  31%|███       | 952/3047 [07:11<13:10,  2.65it/s][A
epoch 2 iter 952: train loss 0.15807. lr 1.584447e-04:  31%|███       | 952/3047 [07:11<13:10,  2.65it/s][A
epoch 2 iter 952: train loss 0.15807. lr 1.584447e-04:  31%|███▏      | 953/3047 [07:11<13:17,  2.62it/s][A
epoch 2 iter 953: train loss 0.15933. lr 1.583083e-04:  31%|███▏      | 953/3047 [07:12<13:17,  2.62it/s][A
epoch 2 iter 953: train loss 0.15933. lr 1.583083e-04:  31%|███▏      | 954/3047 [07:12<13:28,  2.59it/s][A
epoch 2 iter 954: t

epoch 2 iter 987: train loss 0.15745. lr 1.536941e-04:  32%|███▏      | 987/3047 [07:27<13:33,  2.53it/s][A
epoch 2 iter 987: train loss 0.15745. lr 1.536941e-04:  32%|███▏      | 988/3047 [07:27<13:45,  2.50it/s][A
epoch 2 iter 988: train loss 0.16077. lr 1.535591e-04:  32%|███▏      | 988/3047 [07:27<13:45,  2.50it/s][A
epoch 2 iter 988: train loss 0.16077. lr 1.535591e-04:  32%|███▏      | 989/3047 [07:27<13:52,  2.47it/s][A
epoch 2 iter 989: train loss 0.15771. lr 1.534241e-04:  32%|███▏      | 989/3047 [07:28<13:52,  2.47it/s][A
epoch 2 iter 989: train loss 0.15771. lr 1.534241e-04:  32%|███▏      | 990/3047 [07:28<13:56,  2.46it/s][A
epoch 2 iter 990: train loss 0.15586. lr 1.532891e-04:  32%|███▏      | 990/3047 [07:28<13:56,  2.46it/s][A
epoch 2 iter 990: train loss 0.15586. lr 1.532891e-04:  33%|███▎      | 991/3047 [07:28<14:24,  2.38it/s][A
epoch 2 iter 991: train loss 0.15883. lr 1.531542e-04:  33%|███▎      | 991/3047 [07:29<14:24,  2.38it/s][A
epoch 2 iter 991: t

epoch 2 iter 1024: train loss 0.15717. lr 1.487240e-04:  34%|███▎      | 1024/3047 [07:43<16:17,  2.07it/s][A
epoch 2 iter 1024: train loss 0.15717. lr 1.487240e-04:  34%|███▎      | 1025/3047 [07:43<16:04,  2.10it/s][A
epoch 2 iter 1025: train loss 0.15615. lr 1.485904e-04:  34%|███▎      | 1025/3047 [07:43<16:04,  2.10it/s][A
epoch 2 iter 1025: train loss 0.15615. lr 1.485904e-04:  34%|███▎      | 1026/3047 [07:43<16:04,  2.10it/s][A
epoch 2 iter 1026: train loss 0.15623. lr 1.484569e-04:  34%|███▎      | 1026/3047 [07:44<16:04,  2.10it/s][A
epoch 2 iter 1026: train loss 0.15623. lr 1.484569e-04:  34%|███▎      | 1027/3047 [07:44<16:02,  2.10it/s][A
epoch 2 iter 1027: train loss 0.15476. lr 1.483234e-04:  34%|███▎      | 1027/3047 [07:44<16:02,  2.10it/s][A
epoch 2 iter 1027: train loss 0.15476. lr 1.483234e-04:  34%|███▎      | 1028/3047 [07:44<15:59,  2.10it/s][A
epoch 2 iter 1028: train loss 0.15725. lr 1.481900e-04:  34%|███▎      | 1028/3047 [07:45<15:59,  2.10it/s][A
e

epoch 2 iter 1060: train loss 0.15385. lr 1.439410e-04:  35%|███▍      | 1061/3047 [07:59<13:20,  2.48it/s][A
epoch 2 iter 1061: train loss 0.15819. lr 1.438089e-04:  35%|███▍      | 1061/3047 [08:00<13:20,  2.48it/s][A
epoch 2 iter 1061: train loss 0.15819. lr 1.438089e-04:  35%|███▍      | 1062/3047 [08:00<13:25,  2.47it/s][A
epoch 2 iter 1062: train loss 0.15492. lr 1.436769e-04:  35%|███▍      | 1062/3047 [08:00<13:25,  2.47it/s][A
epoch 2 iter 1062: train loss 0.15492. lr 1.436769e-04:  35%|███▍      | 1063/3047 [08:00<13:36,  2.43it/s][A
epoch 2 iter 1063: train loss 0.15583. lr 1.435449e-04:  35%|███▍      | 1063/3047 [08:00<13:36,  2.43it/s][A
epoch 2 iter 1063: train loss 0.15583. lr 1.435449e-04:  35%|███▍      | 1064/3047 [08:00<13:42,  2.41it/s][A
epoch 2 iter 1064: train loss 0.15568. lr 1.434129e-04:  35%|███▍      | 1064/3047 [08:01<13:42,  2.41it/s][A
epoch 2 iter 1064: train loss 0.15568. lr 1.434129e-04:  35%|███▍      | 1065/3047 [08:01<14:09,  2.33it/s][A
e

epoch 2 iter 1097: train loss 0.15390. lr 1.390813e-04:  36%|███▌      | 1097/3047 [08:15<13:56,  2.33it/s][A
epoch 2 iter 1097: train loss 0.15390. lr 1.390813e-04:  36%|███▌      | 1098/3047 [08:15<14:14,  2.28it/s][A
epoch 2 iter 1098: train loss 0.15431. lr 1.389507e-04:  36%|███▌      | 1098/3047 [08:16<14:14,  2.28it/s][A
epoch 2 iter 1098: train loss 0.15431. lr 1.389507e-04:  36%|███▌      | 1099/3047 [08:16<14:26,  2.25it/s][A
epoch 2 iter 1099: train loss 0.15470. lr 1.388202e-04:  36%|███▌      | 1099/3047 [08:16<14:26,  2.25it/s][A
epoch 2 iter 1099: train loss 0.15470. lr 1.388202e-04:  36%|███▌      | 1100/3047 [08:16<14:26,  2.25it/s][A
epoch 2 iter 1100: train loss 0.15473. lr 1.386898e-04:  36%|███▌      | 1100/3047 [08:17<14:26,  2.25it/s][A
epoch 2 iter 1100: train loss 0.15473. lr 1.386898e-04:  36%|███▌      | 1101/3047 [08:17<14:28,  2.24it/s][A
epoch 2 iter 1101: train loss 0.15592. lr 1.385594e-04:  36%|███▌      | 1101/3047 [08:17<14:28,  2.24it/s][A
e

epoch 2 iter 1133: train loss 0.15482. lr 1.344091e-04:  37%|███▋      | 1134/3047 [08:32<13:22,  2.38it/s][A
epoch 2 iter 1134: train loss 0.15584. lr 1.342801e-04:  37%|███▋      | 1134/3047 [08:32<13:22,  2.38it/s][A
epoch 2 iter 1134: train loss 0.15584. lr 1.342801e-04:  37%|███▋      | 1135/3047 [08:32<13:10,  2.42it/s][A
epoch 2 iter 1135: train loss 0.15269. lr 1.341512e-04:  37%|███▋      | 1135/3047 [08:33<13:10,  2.42it/s][A
epoch 2 iter 1135: train loss 0.15269. lr 1.341512e-04:  37%|███▋      | 1136/3047 [08:33<13:00,  2.45it/s][A
epoch 2 iter 1136: train loss 0.15507. lr 1.340223e-04:  37%|███▋      | 1136/3047 [08:33<13:00,  2.45it/s][A
epoch 2 iter 1136: train loss 0.15507. lr 1.340223e-04:  37%|███▋      | 1137/3047 [08:33<12:35,  2.53it/s][A
epoch 2 iter 1137: train loss 0.15478. lr 1.338934e-04:  37%|███▋      | 1137/3047 [08:34<12:35,  2.53it/s][A
epoch 2 iter 1137: train loss 0.15478. lr 1.338934e-04:  37%|███▋      | 1138/3047 [08:34<12:17,  2.59it/s][A
e

epoch 2 iter 1170: train loss 0.15103. lr 1.296665e-04:  38%|███▊      | 1170/3047 [08:47<12:51,  2.43it/s][A
epoch 2 iter 1170: train loss 0.15103. lr 1.296665e-04:  38%|███▊      | 1171/3047 [08:47<12:44,  2.45it/s][A
epoch 2 iter 1171: train loss 0.15144. lr 1.295392e-04:  38%|███▊      | 1171/3047 [08:48<12:44,  2.45it/s][A
epoch 2 iter 1171: train loss 0.15144. lr 1.295392e-04:  38%|███▊      | 1172/3047 [08:48<12:40,  2.46it/s][A
epoch 2 iter 1172: train loss 0.15270. lr 1.294119e-04:  38%|███▊      | 1172/3047 [08:48<12:40,  2.46it/s][A
epoch 2 iter 1172: train loss 0.15270. lr 1.294119e-04:  38%|███▊      | 1173/3047 [08:48<12:46,  2.45it/s][A
epoch 2 iter 1173: train loss 0.15439. lr 1.292847e-04:  38%|███▊      | 1173/3047 [08:49<12:46,  2.45it/s][A
epoch 2 iter 1173: train loss 0.15439. lr 1.292847e-04:  39%|███▊      | 1174/3047 [08:49<12:51,  2.43it/s][A
epoch 2 iter 1174: train loss 0.15234. lr 1.291575e-04:  39%|███▊      | 1174/3047 [08:49<12:51,  2.43it/s][A
e

epoch 2 iter 1206: train loss 0.15241. lr 1.251117e-04:  40%|███▉      | 1207/3047 [09:02<13:04,  2.34it/s][A
epoch 2 iter 1207: train loss 0.14808. lr 1.249860e-04:  40%|███▉      | 1207/3047 [09:03<13:04,  2.34it/s][A
epoch 2 iter 1207: train loss 0.14808. lr 1.249860e-04:  40%|███▉      | 1208/3047 [09:03<12:58,  2.36it/s][A
epoch 2 iter 1208: train loss 0.15179. lr 1.248604e-04:  40%|███▉      | 1208/3047 [09:03<12:58,  2.36it/s][A
epoch 2 iter 1208: train loss 0.15179. lr 1.248604e-04:  40%|███▉      | 1209/3047 [09:03<12:59,  2.36it/s][A
epoch 2 iter 1209: train loss 0.14993. lr 1.247348e-04:  40%|███▉      | 1209/3047 [09:04<12:59,  2.36it/s][A
epoch 2 iter 1209: train loss 0.14993. lr 1.247348e-04:  40%|███▉      | 1210/3047 [09:04<13:04,  2.34it/s][A
epoch 2 iter 1210: train loss 0.15122. lr 1.246093e-04:  40%|███▉      | 1210/3047 [09:04<13:04,  2.34it/s][A
epoch 2 iter 1210: train loss 0.15122. lr 1.246093e-04:  40%|███▉      | 1211/3047 [09:04<13:06,  2.33it/s][A
e

epoch 2 iter 1243: train loss 0.15251. lr 1.204932e-04:  41%|████      | 1243/3047 [09:20<12:26,  2.42it/s][A
epoch 2 iter 1243: train loss 0.15251. lr 1.204932e-04:  41%|████      | 1244/3047 [09:20<12:17,  2.44it/s][A
epoch 2 iter 1244: train loss 0.15074. lr 1.203692e-04:  41%|████      | 1244/3047 [09:20<12:17,  2.44it/s][A
epoch 2 iter 1244: train loss 0.15074. lr 1.203692e-04:  41%|████      | 1245/3047 [09:20<12:14,  2.45it/s][A
epoch 2 iter 1245: train loss 0.15018. lr 1.202454e-04:  41%|████      | 1245/3047 [09:21<12:14,  2.45it/s][A
epoch 2 iter 1245: train loss 0.15018. lr 1.202454e-04:  41%|████      | 1246/3047 [09:21<12:14,  2.45it/s][A
epoch 2 iter 1246: train loss 0.15198. lr 1.201215e-04:  41%|████      | 1246/3047 [09:21<12:14,  2.45it/s][A
epoch 2 iter 1246: train loss 0.15198. lr 1.201215e-04:  41%|████      | 1247/3047 [09:21<12:12,  2.46it/s][A
epoch 2 iter 1247: train loss 0.15159. lr 1.199977e-04:  41%|████      | 1247/3047 [09:21<12:12,  2.46it/s][A
e

epoch 2 iter 1279: train loss 0.15168. lr 1.160621e-04:  42%|████▏     | 1280/3047 [09:35<12:51,  2.29it/s][A
epoch 2 iter 1280: train loss 0.14978. lr 1.159400e-04:  42%|████▏     | 1280/3047 [09:36<12:51,  2.29it/s][A
epoch 2 iter 1280: train loss 0.14978. lr 1.159400e-04:  42%|████▏     | 1281/3047 [09:36<12:55,  2.28it/s][A
epoch 2 iter 1281: train loss 0.14952. lr 1.158178e-04:  42%|████▏     | 1281/3047 [09:36<12:55,  2.28it/s][A
epoch 2 iter 1281: train loss 0.14952. lr 1.158178e-04:  42%|████▏     | 1282/3047 [09:36<13:02,  2.26it/s][A
epoch 2 iter 1282: train loss 0.15112. lr 1.156957e-04:  42%|████▏     | 1282/3047 [09:37<13:02,  2.26it/s][A
epoch 2 iter 1282: train loss 0.15112. lr 1.156957e-04:  42%|████▏     | 1283/3047 [09:37<13:04,  2.25it/s][A
epoch 2 iter 1283: train loss 0.15161. lr 1.155737e-04:  42%|████▏     | 1283/3047 [09:37<13:04,  2.25it/s][A
epoch 2 iter 1283: train loss 0.15161. lr 1.155737e-04:  42%|████▏     | 1284/3047 [09:37<12:57,  2.27it/s][A
e

epoch 2 iter 1316: train loss 0.14971. lr 1.115741e-04:  43%|████▎     | 1316/3047 [09:51<12:04,  2.39it/s][A
epoch 2 iter 1316: train loss 0.14971. lr 1.115741e-04:  43%|████▎     | 1317/3047 [09:51<13:07,  2.20it/s][A
epoch 2 iter 1317: train loss 0.14992. lr 1.114538e-04:  43%|████▎     | 1317/3047 [09:52<13:07,  2.20it/s][A
epoch 2 iter 1317: train loss 0.14992. lr 1.114538e-04:  43%|████▎     | 1318/3047 [09:52<13:50,  2.08it/s][A
epoch 2 iter 1318: train loss 0.14985. lr 1.113335e-04:  43%|████▎     | 1318/3047 [09:52<13:50,  2.08it/s][A
epoch 2 iter 1318: train loss 0.14985. lr 1.113335e-04:  43%|████▎     | 1319/3047 [09:52<13:57,  2.06it/s][A
epoch 2 iter 1319: train loss 0.14702. lr 1.112132e-04:  43%|████▎     | 1319/3047 [09:53<13:57,  2.06it/s][A
epoch 2 iter 1319: train loss 0.14702. lr 1.112132e-04:  43%|████▎     | 1320/3047 [09:53<14:01,  2.05it/s][A
epoch 2 iter 1320: train loss 0.14743. lr 1.110930e-04:  43%|████▎     | 1320/3047 [09:53<14:01,  2.05it/s][A
e

epoch 2 iter 1352: train loss 0.14876. lr 1.072732e-04:  44%|████▍     | 1353/3047 [10:08<13:14,  2.13it/s][A
epoch 2 iter 1353: train loss 0.14951. lr 1.071547e-04:  44%|████▍     | 1353/3047 [10:09<13:14,  2.13it/s][A
epoch 2 iter 1353: train loss 0.14951. lr 1.071547e-04:  44%|████▍     | 1354/3047 [10:09<12:38,  2.23it/s][A
epoch 2 iter 1354: train loss 0.14875. lr 1.070362e-04:  44%|████▍     | 1354/3047 [10:09<12:38,  2.23it/s][A
epoch 2 iter 1354: train loss 0.14875. lr 1.070362e-04:  44%|████▍     | 1355/3047 [10:09<12:13,  2.31it/s][A
epoch 2 iter 1355: train loss 0.14867. lr 1.069178e-04:  44%|████▍     | 1355/3047 [10:10<12:13,  2.31it/s][A
epoch 2 iter 1355: train loss 0.14867. lr 1.069178e-04:  45%|████▍     | 1356/3047 [10:10<11:34,  2.44it/s][A
epoch 2 iter 1356: train loss 0.14853. lr 1.067994e-04:  45%|████▍     | 1356/3047 [10:10<11:34,  2.44it/s][A
epoch 2 iter 1356: train loss 0.14853. lr 1.067994e-04:  45%|████▍     | 1357/3047 [10:10<11:15,  2.50it/s][A
e

epoch 2 iter 1389: train loss 0.14849. lr 1.029220e-04:  46%|████▌     | 1389/3047 [10:27<15:36,  1.77it/s][A
epoch 2 iter 1389: train loss 0.14849. lr 1.029220e-04:  46%|████▌     | 1390/3047 [10:27<15:01,  1.84it/s][A
epoch 2 iter 1390: train loss 0.14554. lr 1.028054e-04:  46%|████▌     | 1390/3047 [10:27<15:01,  1.84it/s][A
epoch 2 iter 1390: train loss 0.14554. lr 1.028054e-04:  46%|████▌     | 1391/3047 [10:27<14:33,  1.90it/s][A
epoch 2 iter 1391: train loss 0.14519. lr 1.026889e-04:  46%|████▌     | 1391/3047 [10:28<14:33,  1.90it/s][A
epoch 2 iter 1391: train loss 0.14519. lr 1.026889e-04:  46%|████▌     | 1392/3047 [10:28<14:09,  1.95it/s][A
epoch 2 iter 1392: train loss 0.14602. lr 1.025724e-04:  46%|████▌     | 1392/3047 [10:28<14:09,  1.95it/s][A
epoch 2 iter 1392: train loss 0.14602. lr 1.025724e-04:  46%|████▌     | 1393/3047 [10:28<13:36,  2.03it/s][A
epoch 2 iter 1393: train loss 0.14376. lr 1.024559e-04:  46%|████▌     | 1393/3047 [10:29<13:36,  2.03it/s][A
e

epoch 2 iter 1425: train loss 0.14660. lr 9.875731e-05:  47%|████▋     | 1426/3047 [10:42<11:31,  2.34it/s][A
epoch 2 iter 1426: train loss 0.14685. lr 9.864261e-05:  47%|████▋     | 1426/3047 [10:42<11:31,  2.34it/s][A
epoch 2 iter 1426: train loss 0.14685. lr 9.864261e-05:  47%|████▋     | 1427/3047 [10:43<11:26,  2.36it/s][A
epoch 2 iter 1427: train loss 0.14679. lr 9.852796e-05:  47%|████▋     | 1427/3047 [10:43<11:26,  2.36it/s][A
epoch 2 iter 1427: train loss 0.14679. lr 9.852796e-05:  47%|████▋     | 1428/3047 [10:43<11:26,  2.36it/s][A
epoch 2 iter 1428: train loss 0.14715. lr 9.841337e-05:  47%|████▋     | 1428/3047 [10:43<11:26,  2.36it/s][A
epoch 2 iter 1428: train loss 0.14715. lr 9.841337e-05:  47%|████▋     | 1429/3047 [10:43<11:26,  2.36it/s][A
epoch 2 iter 1429: train loss 0.14645. lr 9.829882e-05:  47%|████▋     | 1429/3047 [10:44<11:26,  2.36it/s][A
epoch 2 iter 1429: train loss 0.14645. lr 9.829882e-05:  47%|████▋     | 1430/3047 [10:44<11:27,  2.35it/s][A
e

epoch 2 iter 1462: train loss 0.14458. lr 9.454919e-05:  48%|████▊     | 1462/3047 [10:58<11:47,  2.24it/s][A
epoch 2 iter 1462: train loss 0.14458. lr 9.454919e-05:  48%|████▊     | 1463/3047 [10:58<11:41,  2.26it/s][A
epoch 2 iter 1463: train loss 0.14660. lr 9.443649e-05:  48%|████▊     | 1463/3047 [10:59<11:41,  2.26it/s][A
epoch 2 iter 1463: train loss 0.14660. lr 9.443649e-05:  48%|████▊     | 1464/3047 [10:59<15:42,  1.68it/s][A
epoch 2 iter 1464: train loss 0.14307. lr 9.432384e-05:  48%|████▊     | 1464/3047 [11:00<15:42,  1.68it/s][A
epoch 2 iter 1464: train loss 0.14307. lr 9.432384e-05:  48%|████▊     | 1465/3047 [11:00<15:41,  1.68it/s][A
epoch 2 iter 1465: train loss 0.14673. lr 9.421125e-05:  48%|████▊     | 1465/3047 [11:01<15:41,  1.68it/s][A
epoch 2 iter 1465: train loss 0.14673. lr 9.421125e-05:  48%|████▊     | 1466/3047 [11:01<15:19,  1.72it/s][A
epoch 2 iter 1466: train loss 0.14526. lr 9.409871e-05:  48%|████▊     | 1466/3047 [11:01<15:19,  1.72it/s][A
e

epoch 2 iter 1498: train loss 0.14461. lr 9.052657e-05:  49%|████▉     | 1499/3047 [11:16<11:19,  2.28it/s][A
epoch 2 iter 1499: train loss 0.14647. lr 9.041586e-05:  49%|████▉     | 1499/3047 [11:16<11:19,  2.28it/s][A
epoch 2 iter 1499: train loss 0.14647. lr 9.041586e-05:  49%|████▉     | 1500/3047 [11:16<10:51,  2.38it/s][A
epoch 2 iter 1500: train loss 0.14369. lr 9.030520e-05:  49%|████▉     | 1500/3047 [11:17<10:51,  2.38it/s][A
epoch 2 iter 1500: train loss 0.14369. lr 9.030520e-05:  49%|████▉     | 1501/3047 [11:17<10:21,  2.49it/s][A
epoch 2 iter 1501: train loss 0.14540. lr 9.019459e-05:  49%|████▉     | 1501/3047 [11:17<10:21,  2.49it/s][A
epoch 2 iter 1501: train loss 0.14540. lr 9.019459e-05:  49%|████▉     | 1502/3047 [11:17<10:06,  2.55it/s][A
epoch 2 iter 1502: train loss 0.14461. lr 9.008405e-05:  49%|████▉     | 1502/3047 [11:18<10:06,  2.55it/s][A
epoch 2 iter 1502: train loss 0.14461. lr 9.008405e-05:  49%|████▉     | 1503/3047 [11:18<10:04,  2.56it/s][A
e

epoch 2 iter 1586: train loss 0.14351. lr 8.100015e-05:  52%|█████▏    | 1587/3047 [11:54<10:15,  2.37it/s][A
epoch 2 iter 1587: train loss 0.14376. lr 8.089445e-05:  52%|█████▏    | 1587/3047 [11:55<10:15,  2.37it/s][A
epoch 2 iter 1587: train loss 0.14376. lr 8.089445e-05:  52%|█████▏    | 1588/3047 [11:55<10:12,  2.38it/s][A
epoch 2 iter 1588: train loss 0.14592. lr 8.078881e-05:  52%|█████▏    | 1588/3047 [11:55<10:12,  2.38it/s][A
epoch 2 iter 1588: train loss 0.14592. lr 8.078881e-05:  52%|█████▏    | 1589/3047 [11:55<10:07,  2.40it/s][A
epoch 2 iter 1589: train loss 0.14244. lr 8.068323e-05:  52%|█████▏    | 1589/3047 [11:56<10:07,  2.40it/s][A
epoch 2 iter 1589: train loss 0.14244. lr 8.068323e-05:  52%|█████▏    | 1590/3047 [11:56<09:58,  2.43it/s][A
epoch 2 iter 1590: train loss 0.14690. lr 8.057770e-05:  52%|█████▏    | 1590/3047 [11:56<09:58,  2.43it/s][A
epoch 2 iter 1590: train loss 0.14690. lr 8.057770e-05:  52%|█████▏    | 1591/3047 [11:56<09:53,  2.46it/s][A
e

epoch 2 iter 1623: train loss 0.14375. lr 7.712827e-05:  53%|█████▎    | 1623/3047 [12:11<10:02,  2.37it/s][A
epoch 2 iter 1623: train loss 0.14375. lr 7.712827e-05:  53%|█████▎    | 1624/3047 [12:11<10:01,  2.36it/s][A
epoch 2 iter 1624: train loss 0.14225. lr 7.702475e-05:  53%|█████▎    | 1624/3047 [12:12<10:01,  2.36it/s][A
epoch 2 iter 1624: train loss 0.14225. lr 7.702475e-05:  53%|█████▎    | 1625/3047 [12:12<10:01,  2.36it/s][A
epoch 2 iter 1625: train loss 0.14312. lr 7.692128e-05:  53%|█████▎    | 1625/3047 [12:12<10:01,  2.36it/s][A
epoch 2 iter 1625: train loss 0.14312. lr 7.692128e-05:  53%|█████▎    | 1626/3047 [12:12<09:28,  2.50it/s][A
epoch 2 iter 1626: train loss 0.14425. lr 7.681787e-05:  53%|█████▎    | 1626/3047 [12:12<09:28,  2.50it/s][A
epoch 2 iter 1626: train loss 0.14425. lr 7.681787e-05:  53%|█████▎    | 1627/3047 [12:12<09:10,  2.58it/s][A
epoch 2 iter 1627: train loss 0.14498. lr 7.671452e-05:  53%|█████▎    | 1627/3047 [12:13<09:10,  2.58it/s][A
e

epoch 2 iter 1659: train loss 0.14418. lr 7.343889e-05:  54%|█████▍    | 1660/3047 [12:28<09:49,  2.35it/s][A
epoch 2 iter 1660: train loss 0.14243. lr 7.333752e-05:  54%|█████▍    | 1660/3047 [12:29<09:49,  2.35it/s][A
epoch 2 iter 1660: train loss 0.14243. lr 7.333752e-05:  55%|█████▍    | 1661/3047 [12:29<09:45,  2.37it/s][A
epoch 2 iter 1661: train loss 0.14596. lr 7.323621e-05:  55%|█████▍    | 1661/3047 [12:29<09:45,  2.37it/s][A
epoch 2 iter 1661: train loss 0.14596. lr 7.323621e-05:  55%|█████▍    | 1662/3047 [12:29<09:44,  2.37it/s][A
epoch 2 iter 1662: train loss 0.14275. lr 7.313495e-05:  55%|█████▍    | 1662/3047 [12:30<09:44,  2.37it/s][A
epoch 2 iter 1662: train loss 0.14275. lr 7.313495e-05:  55%|█████▍    | 1663/3047 [12:30<09:42,  2.38it/s][A
epoch 2 iter 1663: train loss 0.14310. lr 7.303376e-05:  55%|█████▍    | 1663/3047 [12:30<09:42,  2.38it/s][A
epoch 2 iter 1663: train loss 0.14310. lr 7.303376e-05:  55%|█████▍    | 1664/3047 [12:30<09:40,  2.38it/s][A
e

epoch 2 iter 1696: train loss 0.14114. lr 6.972841e-05:  56%|█████▌    | 1696/3047 [12:45<10:10,  2.21it/s][A
epoch 2 iter 1696: train loss 0.14114. lr 6.972841e-05:  56%|█████▌    | 1697/3047 [12:45<10:07,  2.22it/s][A
epoch 2 iter 1697: train loss 0.14492. lr 6.962928e-05:  56%|█████▌    | 1697/3047 [12:45<10:07,  2.22it/s][A
epoch 2 iter 1697: train loss 0.14492. lr 6.962928e-05:  56%|█████▌    | 1698/3047 [12:45<10:05,  2.23it/s][A
epoch 2 iter 1698: train loss 0.14120. lr 6.953021e-05:  56%|█████▌    | 1698/3047 [12:46<10:05,  2.23it/s][A
epoch 2 iter 1698: train loss 0.14120. lr 6.953021e-05:  56%|█████▌    | 1699/3047 [12:46<10:06,  2.22it/s][A
epoch 2 iter 1699: train loss 0.14151. lr 6.943121e-05:  56%|█████▌    | 1699/3047 [12:46<10:06,  2.22it/s][A
epoch 2 iter 1699: train loss 0.14151. lr 6.943121e-05:  56%|█████▌    | 1700/3047 [12:46<09:47,  2.29it/s][A
epoch 2 iter 1700: train loss 0.14158. lr 6.933227e-05:  56%|█████▌    | 1700/3047 [12:46<09:47,  2.29it/s][A
e

epoch 2 iter 1732: train loss 0.14254. lr 6.619864e-05:  57%|█████▋    | 1733/3047 [13:00<09:10,  2.39it/s][A
epoch 2 iter 1733: train loss 0.14022. lr 6.610174e-05:  57%|█████▋    | 1733/3047 [13:01<09:10,  2.39it/s][A
epoch 2 iter 1733: train loss 0.14022. lr 6.610174e-05:  57%|█████▋    | 1734/3047 [13:01<09:08,  2.39it/s][A
epoch 2 iter 1734: train loss 0.14011. lr 6.600490e-05:  57%|█████▋    | 1734/3047 [13:01<09:08,  2.39it/s][A
epoch 2 iter 1734: train loss 0.14011. lr 6.600490e-05:  57%|█████▋    | 1735/3047 [13:01<09:06,  2.40it/s][A
epoch 2 iter 1735: train loss 0.14077. lr 6.590812e-05:  57%|█████▋    | 1735/3047 [13:02<09:06,  2.40it/s][A
epoch 2 iter 1735: train loss 0.14077. lr 6.590812e-05:  57%|█████▋    | 1736/3047 [13:02<09:04,  2.41it/s][A
epoch 2 iter 1736: train loss 0.14220. lr 6.581140e-05:  57%|█████▋    | 1736/3047 [13:02<09:04,  2.41it/s][A
epoch 2 iter 1736: train loss 0.14220. lr 6.581140e-05:  57%|█████▋    | 1737/3047 [13:02<09:05,  2.40it/s][A
e

epoch 2 iter 1769: train loss 0.13863. lr 6.265480e-05:  58%|█████▊    | 1769/3047 [13:18<11:45,  1.81it/s][A
epoch 2 iter 1769: train loss 0.13863. lr 6.265480e-05:  58%|█████▊    | 1770/3047 [13:18<11:43,  1.82it/s][A
epoch 2 iter 1770: train loss 0.14127. lr 6.256022e-05:  58%|█████▊    | 1770/3047 [13:18<11:43,  1.82it/s][A
epoch 2 iter 1770: train loss 0.14127. lr 6.256022e-05:  58%|█████▊    | 1771/3047 [13:18<11:41,  1.82it/s][A
epoch 2 iter 1771: train loss 0.14017. lr 6.246570e-05:  58%|█████▊    | 1771/3047 [13:19<11:41,  1.82it/s][A
epoch 2 iter 1771: train loss 0.14017. lr 6.246570e-05:  58%|█████▊    | 1772/3047 [13:19<11:35,  1.83it/s][A
epoch 2 iter 1772: train loss 0.14339. lr 6.237124e-05:  58%|█████▊    | 1772/3047 [13:19<11:35,  1.83it/s][A
epoch 2 iter 1772: train loss 0.14339. lr 6.237124e-05:  58%|█████▊    | 1773/3047 [13:19<11:19,  1.87it/s][A
epoch 2 iter 1773: train loss 0.13813. lr 6.227684e-05:  58%|█████▊    | 1773/3047 [13:20<11:19,  1.87it/s][A
e

epoch 2 iter 1805: train loss 0.14126. lr 6.000000e-05:  59%|█████▉    | 1806/3047 [13:33<08:31,  2.42it/s][A
epoch 2 iter 1806: train loss 0.14176. lr 6.000000e-05:  59%|█████▉    | 1806/3047 [13:34<08:31,  2.42it/s][A
epoch 2 iter 1806: train loss 0.14176. lr 6.000000e-05:  59%|█████▉    | 1807/3047 [13:34<08:29,  2.43it/s][A
epoch 2 iter 1807: train loss 0.13957. lr 6.000000e-05:  59%|█████▉    | 1807/3047 [13:34<08:29,  2.43it/s][A
epoch 2 iter 1807: train loss 0.13957. lr 6.000000e-05:  59%|█████▉    | 1808/3047 [13:34<08:27,  2.44it/s][A
epoch 2 iter 1808: train loss 0.13782. lr 6.000000e-05:  59%|█████▉    | 1808/3047 [13:34<08:27,  2.44it/s][A
epoch 2 iter 1808: train loss 0.13782. lr 6.000000e-05:  59%|█████▉    | 1809/3047 [13:34<08:26,  2.44it/s][A
epoch 2 iter 1809: train loss 0.13933. lr 6.000000e-05:  59%|█████▉    | 1809/3047 [13:35<08:26,  2.44it/s][A
epoch 2 iter 1809: train loss 0.13933. lr 6.000000e-05:  59%|█████▉    | 1810/3047 [13:35<08:27,  2.44it/s][A
e

epoch 2 iter 1842: train loss 0.13987. lr 6.000000e-05:  60%|██████    | 1842/3047 [13:50<08:17,  2.42it/s][A
epoch 2 iter 1842: train loss 0.13987. lr 6.000000e-05:  60%|██████    | 1843/3047 [13:50<14:50,  1.35it/s][A
epoch 2 iter 1843: train loss 0.13987. lr 6.000000e-05:  60%|██████    | 1843/3047 [13:50<14:50,  1.35it/s][A
epoch 2 iter 1843: train loss 0.13987. lr 6.000000e-05:  61%|██████    | 1844/3047 [13:50<14:18,  1.40it/s][A
epoch 2 iter 1844: train loss 0.14003. lr 6.000000e-05:  61%|██████    | 1844/3047 [13:51<14:18,  1.40it/s][A
epoch 2 iter 1844: train loss 0.14003. lr 6.000000e-05:  61%|██████    | 1845/3047 [13:51<13:55,  1.44it/s][A
epoch 2 iter 1845: train loss 0.14159. lr 6.000000e-05:  61%|██████    | 1845/3047 [13:52<13:55,  1.44it/s][A
epoch 2 iter 1845: train loss 0.14159. lr 6.000000e-05:  61%|██████    | 1846/3047 [13:52<13:40,  1.46it/s][A
epoch 2 iter 1846: train loss 0.13936. lr 6.000000e-05:  61%|██████    | 1846/3047 [13:52<13:40,  1.46it/s][A
e

epoch 2 iter 1878: train loss 0.14104. lr 6.000000e-05:  62%|██████▏   | 1879/3047 [14:06<08:06,  2.40it/s][A
epoch 2 iter 1879: train loss 0.14023. lr 6.000000e-05:  62%|██████▏   | 1879/3047 [14:06<08:06,  2.40it/s][A
epoch 2 iter 1879: train loss 0.14023. lr 6.000000e-05:  62%|██████▏   | 1880/3047 [14:06<08:05,  2.40it/s][A
epoch 2 iter 1880: train loss 0.13710. lr 6.000000e-05:  62%|██████▏   | 1880/3047 [14:07<08:05,  2.40it/s][A
epoch 2 iter 1880: train loss 0.13710. lr 6.000000e-05:  62%|██████▏   | 1881/3047 [14:07<08:04,  2.40it/s][A
epoch 2 iter 1881: train loss 0.13783. lr 6.000000e-05:  62%|██████▏   | 1881/3047 [14:07<08:04,  2.40it/s][A
epoch 2 iter 1881: train loss 0.13783. lr 6.000000e-05:  62%|██████▏   | 1882/3047 [14:07<07:54,  2.46it/s][A
epoch 2 iter 1882: train loss 0.13794. lr 6.000000e-05:  62%|██████▏   | 1882/3047 [14:07<07:54,  2.46it/s][A
epoch 2 iter 1882: train loss 0.13794. lr 6.000000e-05:  62%|██████▏   | 1883/3047 [14:07<07:56,  2.45it/s][A
e

epoch 2 iter 1915: train loss 0.13906. lr 6.000000e-05:  63%|██████▎   | 1915/3047 [14:23<08:36,  2.19it/s][A
epoch 2 iter 1915: train loss 0.13906. lr 6.000000e-05:  63%|██████▎   | 1916/3047 [14:23<08:46,  2.15it/s][A
epoch 2 iter 1916: train loss 0.13889. lr 6.000000e-05:  63%|██████▎   | 1916/3047 [14:23<08:46,  2.15it/s][A
epoch 2 iter 1916: train loss 0.13889. lr 6.000000e-05:  63%|██████▎   | 1917/3047 [14:23<08:17,  2.27it/s][A
epoch 2 iter 1917: train loss 0.14005. lr 6.000000e-05:  63%|██████▎   | 1917/3047 [14:24<08:17,  2.27it/s][A
epoch 2 iter 1917: train loss 0.14005. lr 6.000000e-05:  63%|██████▎   | 1918/3047 [14:24<08:01,  2.34it/s][A
epoch 2 iter 1918: train loss 0.13806. lr 6.000000e-05:  63%|██████▎   | 1918/3047 [14:24<08:01,  2.34it/s][A
epoch 2 iter 1918: train loss 0.13806. lr 6.000000e-05:  63%|██████▎   | 1919/3047 [14:24<08:24,  2.24it/s][A
epoch 2 iter 1919: train loss 0.14019. lr 6.000000e-05:  63%|██████▎   | 1919/3047 [14:25<08:24,  2.24it/s][A
e

epoch 2 iter 1951: train loss 0.13689. lr 6.000000e-05:  64%|██████▍   | 1952/3047 [14:39<08:59,  2.03it/s][A
epoch 2 iter 1952: train loss 0.13950. lr 6.000000e-05:  64%|██████▍   | 1952/3047 [14:40<08:59,  2.03it/s][A
epoch 2 iter 1952: train loss 0.13950. lr 6.000000e-05:  64%|██████▍   | 1953/3047 [14:40<09:16,  1.97it/s][A
epoch 2 iter 1953: train loss 0.13635. lr 6.000000e-05:  64%|██████▍   | 1953/3047 [14:40<09:16,  1.97it/s][A
epoch 2 iter 1953: train loss 0.13635. lr 6.000000e-05:  64%|██████▍   | 1954/3047 [14:40<09:16,  1.96it/s][A
epoch 2 iter 1954: train loss 0.13849. lr 6.000000e-05:  64%|██████▍   | 1954/3047 [14:41<09:16,  1.96it/s][A
epoch 2 iter 1954: train loss 0.13849. lr 6.000000e-05:  64%|██████▍   | 1955/3047 [14:41<09:13,  1.97it/s][A
epoch 2 iter 1955: train loss 0.14067. lr 6.000000e-05:  64%|██████▍   | 1955/3047 [14:41<09:13,  1.97it/s][A
epoch 2 iter 1955: train loss 0.14067. lr 6.000000e-05:  64%|██████▍   | 1956/3047 [14:41<09:13,  1.97it/s][A
e

epoch 2 iter 1988: train loss 0.13998. lr 6.000000e-05:  65%|██████▌   | 1988/3047 [14:56<07:08,  2.47it/s][A
epoch 2 iter 1988: train loss 0.13998. lr 6.000000e-05:  65%|██████▌   | 1989/3047 [14:56<06:57,  2.53it/s][A
epoch 2 iter 1989: train loss 0.13530. lr 6.000000e-05:  65%|██████▌   | 1989/3047 [14:57<06:57,  2.53it/s][A
epoch 2 iter 1989: train loss 0.13530. lr 6.000000e-05:  65%|██████▌   | 1990/3047 [14:57<06:56,  2.54it/s][A
epoch 2 iter 1990: train loss 0.13886. lr 6.000000e-05:  65%|██████▌   | 1990/3047 [14:57<06:56,  2.54it/s][A
epoch 2 iter 1990: train loss 0.13886. lr 6.000000e-05:  65%|██████▌   | 1991/3047 [14:57<07:00,  2.51it/s][A
epoch 2 iter 1991: train loss 0.14058. lr 6.000000e-05:  65%|██████▌   | 1991/3047 [14:57<07:00,  2.51it/s][A
epoch 2 iter 1991: train loss 0.14058. lr 6.000000e-05:  65%|██████▌   | 1992/3047 [14:57<07:11,  2.45it/s][A
epoch 2 iter 1992: train loss 0.14014. lr 6.000000e-05:  65%|██████▌   | 1992/3047 [14:58<07:11,  2.45it/s][A
e

epoch 2 iter 2024: train loss 0.13957. lr 6.000000e-05:  66%|██████▋   | 2025/3047 [15:12<07:20,  2.32it/s][A
epoch 2 iter 2025: train loss 0.13805. lr 6.000000e-05:  66%|██████▋   | 2025/3047 [15:12<07:20,  2.32it/s][A
epoch 2 iter 2025: train loss 0.13805. lr 6.000000e-05:  66%|██████▋   | 2026/3047 [15:12<07:19,  2.32it/s][A
epoch 2 iter 2026: train loss 0.13899. lr 6.000000e-05:  66%|██████▋   | 2026/3047 [15:13<07:19,  2.32it/s][A
epoch 2 iter 2026: train loss 0.13899. lr 6.000000e-05:  67%|██████▋   | 2027/3047 [15:13<07:08,  2.38it/s][A
epoch 2 iter 2027: train loss 0.13882. lr 6.000000e-05:  67%|██████▋   | 2027/3047 [15:13<07:08,  2.38it/s][A
epoch 2 iter 2027: train loss 0.13882. lr 6.000000e-05:  67%|██████▋   | 2028/3047 [15:13<06:48,  2.49it/s][A
epoch 2 iter 2028: train loss 0.14075. lr 6.000000e-05:  67%|██████▋   | 2028/3047 [15:13<06:48,  2.49it/s][A
epoch 2 iter 2028: train loss 0.14075. lr 6.000000e-05:  67%|██████▋   | 2029/3047 [15:13<06:38,  2.55it/s][A
e

epoch 2 iter 2061: train loss 0.13995. lr 6.000000e-05:  68%|██████▊   | 2061/3047 [15:28<07:32,  2.18it/s][A
epoch 2 iter 2061: train loss 0.13995. lr 6.000000e-05:  68%|██████▊   | 2062/3047 [15:28<07:07,  2.30it/s][A
epoch 2 iter 2062: train loss 0.13836. lr 6.000000e-05:  68%|██████▊   | 2062/3047 [15:29<07:07,  2.30it/s][A
epoch 2 iter 2062: train loss 0.13836. lr 6.000000e-05:  68%|██████▊   | 2063/3047 [15:29<06:48,  2.41it/s][A
epoch 2 iter 2063: train loss 0.13854. lr 6.000000e-05:  68%|██████▊   | 2063/3047 [15:29<06:48,  2.41it/s][A
epoch 2 iter 2063: train loss 0.13854. lr 6.000000e-05:  68%|██████▊   | 2064/3047 [15:29<06:38,  2.46it/s][A
epoch 2 iter 2064: train loss 0.13749. lr 6.000000e-05:  68%|██████▊   | 2064/3047 [15:30<06:38,  2.46it/s][A
epoch 2 iter 2064: train loss 0.13749. lr 6.000000e-05:  68%|██████▊   | 2065/3047 [15:30<07:04,  2.31it/s][A
epoch 2 iter 2065: train loss 0.13689. lr 6.000000e-05:  68%|██████▊   | 2065/3047 [15:30<07:04,  2.31it/s][A
e

epoch 2 iter 2097: train loss 0.13514. lr 6.000000e-05:  69%|██████▉   | 2098/3047 [15:44<07:11,  2.20it/s][A
epoch 2 iter 2098: train loss 0.13887. lr 6.000000e-05:  69%|██████▉   | 2098/3047 [15:45<07:11,  2.20it/s][A
epoch 2 iter 2098: train loss 0.13887. lr 6.000000e-05:  69%|██████▉   | 2099/3047 [15:45<09:08,  1.73it/s][A
epoch 2 iter 2099: train loss 0.13939. lr 6.000000e-05:  69%|██████▉   | 2099/3047 [15:46<09:08,  1.73it/s][A
epoch 2 iter 2099: train loss 0.13939. lr 6.000000e-05:  69%|██████▉   | 2100/3047 [15:46<09:30,  1.66it/s][A
epoch 2 iter 2100: train loss 0.13699. lr 6.000000e-05:  69%|██████▉   | 2100/3047 [15:47<09:30,  1.66it/s][A
epoch 2 iter 2100: train loss 0.13699. lr 6.000000e-05:  69%|██████▉   | 2101/3047 [15:47<09:32,  1.65it/s][A
epoch 2 iter 2101: train loss 0.14018. lr 6.000000e-05:  69%|██████▉   | 2101/3047 [15:47<09:32,  1.65it/s][A
epoch 2 iter 2101: train loss 0.14018. lr 6.000000e-05:  69%|██████▉   | 2102/3047 [15:47<09:14,  1.70it/s][A
e

epoch 2 iter 2134: train loss 0.13643. lr 6.000000e-05:  70%|███████   | 2134/3047 [16:02<06:44,  2.26it/s][A
epoch 2 iter 2134: train loss 0.13643. lr 6.000000e-05:  70%|███████   | 2135/3047 [16:02<06:41,  2.27it/s][A
epoch 2 iter 2135: train loss 0.13953. lr 6.000000e-05:  70%|███████   | 2135/3047 [16:02<06:41,  2.27it/s][A
epoch 2 iter 2135: train loss 0.13953. lr 6.000000e-05:  70%|███████   | 2136/3047 [16:02<06:39,  2.28it/s][A
epoch 2 iter 2136: train loss 0.13927. lr 6.000000e-05:  70%|███████   | 2136/3047 [16:03<06:39,  2.28it/s][A
epoch 2 iter 2136: train loss 0.13927. lr 6.000000e-05:  70%|███████   | 2137/3047 [16:03<06:38,  2.28it/s][A
epoch 2 iter 2137: train loss 0.13760. lr 6.000000e-05:  70%|███████   | 2137/3047 [16:03<06:38,  2.28it/s][A
epoch 2 iter 2137: train loss 0.13760. lr 6.000000e-05:  70%|███████   | 2138/3047 [16:03<06:38,  2.28it/s][A
epoch 2 iter 2138: train loss 0.13837. lr 6.000000e-05:  70%|███████   | 2138/3047 [16:04<06:38,  2.28it/s][A
e

epoch 2 iter 2170: train loss 0.14059. lr 6.000000e-05:  71%|███████▏  | 2171/3047 [16:18<06:59,  2.09it/s][A
epoch 2 iter 2171: train loss 0.13912. lr 6.000000e-05:  71%|███████▏  | 2171/3047 [16:18<06:59,  2.09it/s][A
epoch 2 iter 2171: train loss 0.13912. lr 6.000000e-05:  71%|███████▏  | 2172/3047 [16:18<06:27,  2.26it/s][A
epoch 2 iter 2172: train loss 0.13802. lr 6.000000e-05:  71%|███████▏  | 2172/3047 [16:18<06:27,  2.26it/s][A
epoch 2 iter 2172: train loss 0.13802. lr 6.000000e-05:  71%|███████▏  | 2173/3047 [16:18<06:10,  2.36it/s][A
epoch 2 iter 2173: train loss 0.13606. lr 6.000000e-05:  71%|███████▏  | 2173/3047 [16:19<06:10,  2.36it/s][A
epoch 2 iter 2173: train loss 0.13606. lr 6.000000e-05:  71%|███████▏  | 2174/3047 [16:19<06:06,  2.38it/s][A
epoch 2 iter 2174: train loss 0.13609. lr 6.000000e-05:  71%|███████▏  | 2174/3047 [16:19<06:06,  2.38it/s][A
epoch 2 iter 2174: train loss 0.13609. lr 6.000000e-05:  71%|███████▏  | 2175/3047 [16:19<06:06,  2.38it/s][A
e

epoch 2 iter 2207: train loss 0.13518. lr 6.000000e-05:  72%|███████▏  | 2207/3047 [16:33<06:37,  2.11it/s][A
epoch 2 iter 2207: train loss 0.13518. lr 6.000000e-05:  72%|███████▏  | 2208/3047 [16:33<06:40,  2.09it/s][A
epoch 2 iter 2208: train loss 0.14043. lr 6.000000e-05:  72%|███████▏  | 2208/3047 [16:34<06:40,  2.09it/s][A
epoch 2 iter 2208: train loss 0.14043. lr 6.000000e-05:  72%|███████▏  | 2209/3047 [16:34<06:41,  2.09it/s][A
epoch 2 iter 2209: train loss 0.13772. lr 6.000000e-05:  72%|███████▏  | 2209/3047 [16:34<06:41,  2.09it/s][A
epoch 2 iter 2209: train loss 0.13772. lr 6.000000e-05:  73%|███████▎  | 2210/3047 [16:34<06:43,  2.07it/s][A
epoch 2 iter 2210: train loss 0.13771. lr 6.000000e-05:  73%|███████▎  | 2210/3047 [16:35<06:43,  2.07it/s][A
epoch 2 iter 2210: train loss 0.13771. lr 6.000000e-05:  73%|███████▎  | 2211/3047 [16:35<06:37,  2.11it/s][A
epoch 2 iter 2211: train loss 0.13730. lr 6.000000e-05:  73%|███████▎  | 2211/3047 [16:35<06:37,  2.11it/s][A
e

epoch 2 iter 2243: train loss 0.13641. lr 6.000000e-05:  74%|███████▎  | 2244/3047 [16:49<07:18,  1.83it/s][A
epoch 2 iter 2244: train loss 0.13638. lr 6.000000e-05:  74%|███████▎  | 2244/3047 [16:50<07:18,  1.83it/s][A
epoch 2 iter 2244: train loss 0.13638. lr 6.000000e-05:  74%|███████▎  | 2245/3047 [16:50<07:31,  1.78it/s][A
epoch 2 iter 2245: train loss 0.13679. lr 6.000000e-05:  74%|███████▎  | 2245/3047 [16:51<07:31,  1.78it/s][A
epoch 2 iter 2245: train loss 0.13679. lr 6.000000e-05:  74%|███████▎  | 2246/3047 [16:51<07:32,  1.77it/s][A
epoch 2 iter 2246: train loss 0.13724. lr 6.000000e-05:  74%|███████▎  | 2246/3047 [16:51<07:32,  1.77it/s][A
epoch 2 iter 2246: train loss 0.13724. lr 6.000000e-05:  74%|███████▎  | 2247/3047 [16:51<07:21,  1.81it/s][A
epoch 2 iter 2247: train loss 0.13842. lr 6.000000e-05:  74%|███████▎  | 2247/3047 [16:52<07:21,  1.81it/s][A
epoch 2 iter 2247: train loss 0.13842. lr 6.000000e-05:  74%|███████▍  | 2248/3047 [16:52<07:14,  1.84it/s][A
e

epoch 2 iter 2280: train loss 0.13873. lr 6.000000e-05:  75%|███████▍  | 2280/3047 [17:06<05:12,  2.45it/s][A
epoch 2 iter 2280: train loss 0.13873. lr 6.000000e-05:  75%|███████▍  | 2281/3047 [17:06<05:13,  2.44it/s][A
epoch 2 iter 2281: train loss 0.13895. lr 6.000000e-05:  75%|███████▍  | 2281/3047 [17:06<05:13,  2.44it/s][A
epoch 2 iter 2281: train loss 0.13895. lr 6.000000e-05:  75%|███████▍  | 2282/3047 [17:06<05:13,  2.44it/s][A
epoch 2 iter 2282: train loss 0.13861. lr 6.000000e-05:  75%|███████▍  | 2282/3047 [17:07<05:13,  2.44it/s][A
epoch 2 iter 2282: train loss 0.13861. lr 6.000000e-05:  75%|███████▍  | 2283/3047 [17:07<05:13,  2.43it/s][A
epoch 2 iter 2283: train loss 0.13546. lr 6.000000e-05:  75%|███████▍  | 2283/3047 [17:07<05:13,  2.43it/s][A
epoch 2 iter 2283: train loss 0.13546. lr 6.000000e-05:  75%|███████▍  | 2284/3047 [17:07<05:13,  2.43it/s][A
epoch 2 iter 2284: train loss 0.13714. lr 6.000000e-05:  75%|███████▍  | 2284/3047 [17:08<05:13,  2.43it/s][A
e

epoch 2 iter 2316: train loss 0.13802. lr 6.000000e-05:  76%|███████▌  | 2317/3047 [17:22<05:33,  2.19it/s][A
epoch 2 iter 2317: train loss 0.13587. lr 6.000000e-05:  76%|███████▌  | 2317/3047 [17:23<05:33,  2.19it/s][A
epoch 2 iter 2317: train loss 0.13587. lr 6.000000e-05:  76%|███████▌  | 2318/3047 [17:23<05:29,  2.21it/s][A
epoch 2 iter 2318: train loss 0.13672. lr 6.000000e-05:  76%|███████▌  | 2318/3047 [17:23<05:29,  2.21it/s][A
epoch 2 iter 2318: train loss 0.13672. lr 6.000000e-05:  76%|███████▌  | 2319/3047 [17:23<05:35,  2.17it/s][A
epoch 2 iter 2319: train loss 0.13763. lr 6.000000e-05:  76%|███████▌  | 2319/3047 [17:24<05:35,  2.17it/s][A
epoch 2 iter 2319: train loss 0.13763. lr 6.000000e-05:  76%|███████▌  | 2320/3047 [17:24<05:28,  2.21it/s][A
epoch 2 iter 2320: train loss 0.13758. lr 6.000000e-05:  76%|███████▌  | 2320/3047 [17:24<05:28,  2.21it/s][A
epoch 2 iter 2320: train loss 0.13758. lr 6.000000e-05:  76%|███████▌  | 2321/3047 [17:24<05:30,  2.19it/s][A
e

epoch 2 iter 2353: train loss 0.13682. lr 6.000000e-05:  77%|███████▋  | 2353/3047 [17:39<05:34,  2.08it/s][A
epoch 2 iter 2353: train loss 0.13682. lr 6.000000e-05:  77%|███████▋  | 2354/3047 [17:39<05:27,  2.11it/s][A
epoch 2 iter 2354: train loss 0.13618. lr 6.000000e-05:  77%|███████▋  | 2354/3047 [17:39<05:27,  2.11it/s][A
epoch 2 iter 2354: train loss 0.13618. lr 6.000000e-05:  77%|███████▋  | 2355/3047 [17:39<05:24,  2.13it/s][A
epoch 2 iter 2355: train loss 0.13816. lr 6.000000e-05:  77%|███████▋  | 2355/3047 [17:39<05:24,  2.13it/s][A
epoch 2 iter 2355: train loss 0.13816. lr 6.000000e-05:  77%|███████▋  | 2356/3047 [17:39<05:20,  2.16it/s][A
epoch 2 iter 2356: train loss 0.13889. lr 6.000000e-05:  77%|███████▋  | 2356/3047 [17:40<05:20,  2.16it/s][A
epoch 2 iter 2356: train loss 0.13889. lr 6.000000e-05:  77%|███████▋  | 2357/3047 [17:40<05:18,  2.16it/s][A
epoch 2 iter 2357: train loss 0.13776. lr 6.000000e-05:  77%|███████▋  | 2357/3047 [17:40<05:18,  2.16it/s][A
e

epoch 2 iter 2389: train loss 0.13716. lr 6.000000e-05:  78%|███████▊  | 2390/3047 [17:54<04:43,  2.32it/s][A
epoch 2 iter 2390: train loss 0.13429. lr 6.000000e-05:  78%|███████▊  | 2390/3047 [17:54<04:43,  2.32it/s][A
epoch 2 iter 2390: train loss 0.13429. lr 6.000000e-05:  78%|███████▊  | 2391/3047 [17:54<04:58,  2.20it/s][A
epoch 2 iter 2391: train loss 0.13659. lr 6.000000e-05:  78%|███████▊  | 2391/3047 [17:55<04:58,  2.20it/s][A
epoch 2 iter 2391: train loss 0.13659. lr 6.000000e-05:  79%|███████▊  | 2392/3047 [17:55<05:17,  2.06it/s][A
epoch 2 iter 2392: train loss 0.13761. lr 6.000000e-05:  79%|███████▊  | 2392/3047 [17:55<05:17,  2.06it/s][A
epoch 2 iter 2392: train loss 0.13761. lr 6.000000e-05:  79%|███████▊  | 2393/3047 [17:55<05:15,  2.07it/s][A
epoch 2 iter 2393: train loss 0.13773. lr 6.000000e-05:  79%|███████▊  | 2393/3047 [17:56<05:15,  2.07it/s][A
epoch 2 iter 2393: train loss 0.13773. lr 6.000000e-05:  79%|███████▊  | 2394/3047 [17:56<05:13,  2.08it/s][A
e

epoch 2 iter 2426: train loss 0.13550. lr 6.000000e-05:  80%|███████▉  | 2426/3047 [18:11<04:16,  2.42it/s][A
epoch 2 iter 2426: train loss 0.13550. lr 6.000000e-05:  80%|███████▉  | 2427/3047 [18:11<04:17,  2.41it/s][A
epoch 2 iter 2427: train loss 0.13481. lr 6.000000e-05:  80%|███████▉  | 2427/3047 [18:11<04:17,  2.41it/s][A
epoch 2 iter 2427: train loss 0.13481. lr 6.000000e-05:  80%|███████▉  | 2428/3047 [18:11<04:17,  2.40it/s][A
epoch 2 iter 2428: train loss 0.13737. lr 6.000000e-05:  80%|███████▉  | 2428/3047 [18:12<04:17,  2.40it/s][A
epoch 2 iter 2428: train loss 0.13737. lr 6.000000e-05:  80%|███████▉  | 2429/3047 [18:12<04:18,  2.39it/s][A
epoch 2 iter 2429: train loss 0.13791. lr 6.000000e-05:  80%|███████▉  | 2429/3047 [18:12<04:18,  2.39it/s][A
epoch 2 iter 2429: train loss 0.13791. lr 6.000000e-05:  80%|███████▉  | 2430/3047 [18:12<04:19,  2.38it/s][A
epoch 2 iter 2430: train loss 0.13705. lr 6.000000e-05:  80%|███████▉  | 2430/3047 [18:13<04:19,  2.38it/s][A
e

epoch 2 iter 2462: train loss 0.13762. lr 6.000000e-05:  81%|████████  | 2463/3047 [18:26<04:02,  2.41it/s][A
epoch 2 iter 2463: train loss 0.13725. lr 6.000000e-05:  81%|████████  | 2463/3047 [18:27<04:02,  2.41it/s][A
epoch 2 iter 2463: train loss 0.13725. lr 6.000000e-05:  81%|████████  | 2464/3047 [18:27<04:04,  2.38it/s][A
epoch 2 iter 2464: train loss 0.13874. lr 6.000000e-05:  81%|████████  | 2464/3047 [18:27<04:04,  2.38it/s][A
epoch 2 iter 2464: train loss 0.13874. lr 6.000000e-05:  81%|████████  | 2465/3047 [18:27<04:27,  2.18it/s][A
epoch 2 iter 2465: train loss 0.13874. lr 6.000000e-05:  81%|████████  | 2465/3047 [18:28<04:27,  2.18it/s][A
epoch 2 iter 2465: train loss 0.13874. lr 6.000000e-05:  81%|████████  | 2466/3047 [18:28<04:53,  1.98it/s][A
epoch 2 iter 2466: train loss 0.13720. lr 6.000000e-05:  81%|████████  | 2466/3047 [18:29<04:53,  1.98it/s][A
epoch 2 iter 2466: train loss 0.13720. lr 6.000000e-05:  81%|████████  | 2467/3047 [18:29<05:05,  1.90it/s][A
e

epoch 2 iter 2499: train loss 0.13889. lr 6.000000e-05:  82%|████████▏ | 2499/3047 [18:44<05:00,  1.83it/s][A
epoch 2 iter 2499: train loss 0.13889. lr 6.000000e-05:  82%|████████▏ | 2500/3047 [18:44<04:47,  1.90it/s][A
epoch 2 iter 2500: train loss 0.13557. lr 6.000000e-05:  82%|████████▏ | 2500/3047 [18:45<04:47,  1.90it/s][A
epoch 2 iter 2500: train loss 0.13557. lr 6.000000e-05:  82%|████████▏ | 2501/3047 [18:45<04:39,  1.95it/s][A
epoch 2 iter 2501: train loss 0.13634. lr 6.000000e-05:  82%|████████▏ | 2501/3047 [18:45<04:39,  1.95it/s][A
epoch 2 iter 2501: train loss 0.13634. lr 6.000000e-05:  82%|████████▏ | 2502/3047 [18:45<04:33,  1.99it/s][A
epoch 2 iter 2502: train loss 0.13697. lr 6.000000e-05:  82%|████████▏ | 2502/3047 [18:45<04:33,  1.99it/s][A
epoch 2 iter 2502: train loss 0.13697. lr 6.000000e-05:  82%|████████▏ | 2503/3047 [18:45<04:27,  2.03it/s][A
epoch 2 iter 2503: train loss 0.13803. lr 6.000000e-05:  82%|████████▏ | 2503/3047 [18:46<04:27,  2.03it/s][A
e

epoch 2 iter 2535: train loss 0.13752. lr 6.000000e-05:  83%|████████▎ | 2536/3047 [19:00<03:35,  2.37it/s][A
epoch 2 iter 2536: train loss 0.13687. lr 6.000000e-05:  83%|████████▎ | 2536/3047 [19:00<03:35,  2.37it/s][A
epoch 2 iter 2536: train loss 0.13687. lr 6.000000e-05:  83%|████████▎ | 2537/3047 [19:00<03:32,  2.40it/s][A
epoch 2 iter 2537: train loss 0.13745. lr 6.000000e-05:  83%|████████▎ | 2537/3047 [19:00<03:32,  2.40it/s][A
epoch 2 iter 2537: train loss 0.13745. lr 6.000000e-05:  83%|████████▎ | 2538/3047 [19:00<03:31,  2.41it/s][A
epoch 2 iter 2538: train loss 0.13809. lr 6.000000e-05:  83%|████████▎ | 2538/3047 [19:01<03:31,  2.41it/s][A
epoch 2 iter 2538: train loss 0.13809. lr 6.000000e-05:  83%|████████▎ | 2539/3047 [19:01<03:31,  2.40it/s][A
epoch 2 iter 2539: train loss 0.13865. lr 6.000000e-05:  83%|████████▎ | 2539/3047 [19:01<03:31,  2.40it/s][A
epoch 2 iter 2539: train loss 0.13865. lr 6.000000e-05:  83%|████████▎ | 2540/3047 [19:01<03:36,  2.34it/s][A
e

epoch 2 iter 2572: train loss 0.13808. lr 6.000000e-05:  84%|████████▍ | 2572/3047 [19:15<03:40,  2.15it/s][A
epoch 2 iter 2572: train loss 0.13808. lr 6.000000e-05:  84%|████████▍ | 2573/3047 [19:15<03:39,  2.16it/s][A
epoch 2 iter 2573: train loss 0.13791. lr 6.000000e-05:  84%|████████▍ | 2573/3047 [19:16<03:39,  2.16it/s][A
epoch 2 iter 2573: train loss 0.13791. lr 6.000000e-05:  84%|████████▍ | 2574/3047 [19:16<03:39,  2.16it/s][A
epoch 2 iter 2574: train loss 0.13570. lr 6.000000e-05:  84%|████████▍ | 2574/3047 [19:16<03:39,  2.16it/s][A
epoch 2 iter 2574: train loss 0.13570. lr 6.000000e-05:  85%|████████▍ | 2575/3047 [19:16<03:27,  2.28it/s][A
epoch 2 iter 2575: train loss 0.13857. lr 6.000000e-05:  85%|████████▍ | 2575/3047 [19:17<03:27,  2.28it/s][A
epoch 2 iter 2575: train loss 0.13857. lr 6.000000e-05:  85%|████████▍ | 2576/3047 [19:17<03:18,  2.37it/s][A
epoch 2 iter 2576: train loss 0.13798. lr 6.000000e-05:  85%|████████▍ | 2576/3047 [19:17<03:18,  2.37it/s][A
e

epoch 2 iter 2608: train loss 0.13598. lr 6.000000e-05:  86%|████████▌ | 2609/3047 [19:32<03:54,  1.87it/s][A
epoch 2 iter 2609: train loss 0.13519. lr 6.000000e-05:  86%|████████▌ | 2609/3047 [19:32<03:54,  1.87it/s][A
epoch 2 iter 2609: train loss 0.13519. lr 6.000000e-05:  86%|████████▌ | 2610/3047 [19:32<03:52,  1.88it/s][A
epoch 2 iter 2610: train loss 0.13867. lr 6.000000e-05:  86%|████████▌ | 2610/3047 [19:33<03:52,  1.88it/s][A
epoch 2 iter 2610: train loss 0.13867. lr 6.000000e-05:  86%|████████▌ | 2611/3047 [19:33<03:50,  1.89it/s][A
epoch 2 iter 2611: train loss 0.13825. lr 6.000000e-05:  86%|████████▌ | 2611/3047 [19:33<03:50,  1.89it/s][A
epoch 2 iter 2611: train loss 0.13825. lr 6.000000e-05:  86%|████████▌ | 2612/3047 [19:33<03:44,  1.93it/s][A
epoch 2 iter 2612: train loss 0.13585. lr 6.000000e-05:  86%|████████▌ | 2612/3047 [19:34<03:44,  1.93it/s][A
epoch 2 iter 2612: train loss 0.13585. lr 6.000000e-05:  86%|████████▌ | 2613/3047 [19:34<03:40,  1.96it/s][A
e

epoch 2 iter 2645: train loss 0.13763. lr 6.000000e-05:  87%|████████▋ | 2645/3047 [19:49<03:00,  2.23it/s][A
epoch 2 iter 2645: train loss 0.13763. lr 6.000000e-05:  87%|████████▋ | 2646/3047 [19:49<02:58,  2.25it/s][A
epoch 2 iter 2646: train loss 0.13544. lr 6.000000e-05:  87%|████████▋ | 2646/3047 [19:49<02:58,  2.25it/s][A
epoch 2 iter 2646: train loss 0.13544. lr 6.000000e-05:  87%|████████▋ | 2647/3047 [19:49<02:50,  2.35it/s][A
epoch 2 iter 2647: train loss 0.13729. lr 6.000000e-05:  87%|████████▋ | 2647/3047 [19:49<02:50,  2.35it/s][A
epoch 2 iter 2647: train loss 0.13729. lr 6.000000e-05:  87%|████████▋ | 2648/3047 [19:49<02:42,  2.45it/s][A
epoch 2 iter 2648: train loss 0.13696. lr 6.000000e-05:  87%|████████▋ | 2648/3047 [19:50<02:42,  2.45it/s][A
epoch 2 iter 2648: train loss 0.13696. lr 6.000000e-05:  87%|████████▋ | 2649/3047 [19:50<02:41,  2.46it/s][A
epoch 2 iter 2649: train loss 0.13712. lr 6.000000e-05:  87%|████████▋ | 2649/3047 [19:50<02:41,  2.46it/s][A
e

epoch 2 iter 2681: train loss 0.13636. lr 6.000000e-05:  88%|████████▊ | 2682/3047 [20:05<03:33,  1.71it/s][A
epoch 2 iter 2682: train loss 0.13414. lr 6.000000e-05:  88%|████████▊ | 2682/3047 [20:05<03:33,  1.71it/s][A
epoch 2 iter 2682: train loss 0.13414. lr 6.000000e-05:  88%|████████▊ | 2683/3047 [20:05<03:33,  1.71it/s][A
epoch 2 iter 2683: train loss 0.13719. lr 6.000000e-05:  88%|████████▊ | 2683/3047 [20:06<03:33,  1.71it/s][A
epoch 2 iter 2683: train loss 0.13719. lr 6.000000e-05:  88%|████████▊ | 2684/3047 [20:06<03:31,  1.72it/s][A
epoch 2 iter 2684: train loss 0.13607. lr 6.000000e-05:  88%|████████▊ | 2684/3047 [20:07<03:31,  1.72it/s][A
epoch 2 iter 2684: train loss 0.13607. lr 6.000000e-05:  88%|████████▊ | 2685/3047 [20:07<03:25,  1.76it/s][A
epoch 2 iter 2685: train loss 0.13519. lr 6.000000e-05:  88%|████████▊ | 2685/3047 [20:07<03:25,  1.76it/s][A
epoch 2 iter 2685: train loss 0.13519. lr 6.000000e-05:  88%|████████▊ | 2686/3047 [20:07<03:20,  1.80it/s][A
e

epoch 2 iter 2718: train loss 0.13685. lr 6.000000e-05:  89%|████████▉ | 2718/3047 [20:21<02:11,  2.50it/s][A
epoch 2 iter 2718: train loss 0.13685. lr 6.000000e-05:  89%|████████▉ | 2719/3047 [20:21<02:11,  2.49it/s][A
epoch 2 iter 2719: train loss 0.13569. lr 6.000000e-05:  89%|████████▉ | 2719/3047 [20:22<02:11,  2.49it/s][A
epoch 2 iter 2719: train loss 0.13569. lr 6.000000e-05:  89%|████████▉ | 2720/3047 [20:22<02:12,  2.47it/s][A
epoch 2 iter 2720: train loss 0.13731. lr 6.000000e-05:  89%|████████▉ | 2720/3047 [20:22<02:12,  2.47it/s][A
epoch 2 iter 2720: train loss 0.13731. lr 6.000000e-05:  89%|████████▉ | 2721/3047 [20:22<02:13,  2.45it/s][A
epoch 2 iter 2721: train loss 0.13526. lr 6.000000e-05:  89%|████████▉ | 2721/3047 [20:22<02:13,  2.45it/s][A
epoch 2 iter 2721: train loss 0.13526. lr 6.000000e-05:  89%|████████▉ | 2722/3047 [20:22<02:13,  2.43it/s][A
epoch 2 iter 2722: train loss 0.13504. lr 6.000000e-05:  89%|████████▉ | 2722/3047 [20:23<02:13,  2.43it/s][A
e

epoch 2 iter 2754: train loss 0.13497. lr 6.000000e-05:  90%|█████████ | 2755/3047 [20:37<02:13,  2.18it/s][A
epoch 2 iter 2755: train loss 0.13667. lr 6.000000e-05:  90%|█████████ | 2755/3047 [20:38<02:13,  2.18it/s][A
epoch 2 iter 2755: train loss 0.13667. lr 6.000000e-05:  90%|█████████ | 2756/3047 [20:38<02:10,  2.23it/s][A
epoch 2 iter 2756: train loss 0.13501. lr 6.000000e-05:  90%|█████████ | 2756/3047 [20:38<02:10,  2.23it/s][A
epoch 2 iter 2756: train loss 0.13501. lr 6.000000e-05:  90%|█████████ | 2757/3047 [20:38<02:07,  2.27it/s][A
epoch 2 iter 2757: train loss 0.13539. lr 6.000000e-05:  90%|█████████ | 2757/3047 [20:39<02:07,  2.27it/s][A
epoch 2 iter 2757: train loss 0.13539. lr 6.000000e-05:  91%|█████████ | 2758/3047 [20:39<02:05,  2.31it/s][A
epoch 2 iter 2758: train loss 0.13658. lr 6.000000e-05:  91%|█████████ | 2758/3047 [20:39<02:05,  2.31it/s][A
epoch 2 iter 2758: train loss 0.13658. lr 6.000000e-05:  91%|█████████ | 2759/3047 [20:39<02:02,  2.35it/s][A
e

epoch 2 iter 2791: train loss 0.13514. lr 6.000000e-05:  92%|█████████▏| 2791/3047 [20:55<01:56,  2.20it/s][A
epoch 2 iter 2791: train loss 0.13514. lr 6.000000e-05:  92%|█████████▏| 2792/3047 [20:55<01:48,  2.36it/s][A
epoch 2 iter 2792: train loss 0.13415. lr 6.000000e-05:  92%|█████████▏| 2792/3047 [20:55<01:48,  2.36it/s][A
epoch 2 iter 2792: train loss 0.13415. lr 6.000000e-05:  92%|█████████▏| 2793/3047 [20:55<01:43,  2.44it/s][A
epoch 2 iter 2793: train loss 0.13750. lr 6.000000e-05:  92%|█████████▏| 2793/3047 [20:56<01:43,  2.44it/s][A
epoch 2 iter 2793: train loss 0.13750. lr 6.000000e-05:  92%|█████████▏| 2794/3047 [20:56<01:44,  2.43it/s][A
epoch 2 iter 2794: train loss 0.13624. lr 6.000000e-05:  92%|█████████▏| 2794/3047 [20:56<01:44,  2.43it/s][A
epoch 2 iter 2794: train loss 0.13624. lr 6.000000e-05:  92%|█████████▏| 2795/3047 [20:56<01:46,  2.36it/s][A
epoch 2 iter 2795: train loss 0.13480. lr 6.000000e-05:  92%|█████████▏| 2795/3047 [20:57<01:46,  2.36it/s][A
e

epoch 2 iter 2827: train loss 0.13711. lr 6.000000e-05:  93%|█████████▎| 2828/3047 [21:11<01:29,  2.45it/s][A
epoch 2 iter 2828: train loss 0.13576. lr 6.000000e-05:  93%|█████████▎| 2828/3047 [21:11<01:29,  2.45it/s][A
epoch 2 iter 2828: train loss 0.13576. lr 6.000000e-05:  93%|█████████▎| 2829/3047 [21:11<01:27,  2.50it/s][A
epoch 2 iter 2829: train loss 0.13640. lr 6.000000e-05:  93%|█████████▎| 2829/3047 [21:11<01:27,  2.50it/s][A
epoch 2 iter 2829: train loss 0.13640. lr 6.000000e-05:  93%|█████████▎| 2830/3047 [21:11<01:28,  2.46it/s][A
epoch 2 iter 2830: train loss 0.13652. lr 6.000000e-05:  93%|█████████▎| 2830/3047 [21:12<01:28,  2.46it/s][A
epoch 2 iter 2830: train loss 0.13652. lr 6.000000e-05:  93%|█████████▎| 2831/3047 [21:12<01:29,  2.40it/s][A
epoch 2 iter 2831: train loss 0.13570. lr 6.000000e-05:  93%|█████████▎| 2831/3047 [21:12<01:29,  2.40it/s][A
epoch 2 iter 2831: train loss 0.13570. lr 6.000000e-05:  93%|█████████▎| 2832/3047 [21:12<01:30,  2.38it/s][A
e

epoch 2 iter 2864: train loss 0.13746. lr 6.000000e-05:  94%|█████████▍| 2864/3047 [21:28<01:47,  1.70it/s][A
epoch 2 iter 2864: train loss 0.13746. lr 6.000000e-05:  94%|█████████▍| 2865/3047 [21:28<01:41,  1.79it/s][A
epoch 2 iter 2865: train loss 0.13641. lr 6.000000e-05:  94%|█████████▍| 2865/3047 [21:28<01:41,  1.79it/s][A
epoch 2 iter 2865: train loss 0.13641. lr 6.000000e-05:  94%|█████████▍| 2866/3047 [21:28<01:36,  1.87it/s][A
epoch 2 iter 2866: train loss 0.13639. lr 6.000000e-05:  94%|█████████▍| 2866/3047 [21:29<01:36,  1.87it/s][A
epoch 2 iter 2866: train loss 0.13639. lr 6.000000e-05:  94%|█████████▍| 2867/3047 [21:29<01:33,  1.93it/s][A
epoch 2 iter 2867: train loss 0.13497. lr 6.000000e-05:  94%|█████████▍| 2867/3047 [21:29<01:33,  1.93it/s][A
epoch 2 iter 2867: train loss 0.13497. lr 6.000000e-05:  94%|█████████▍| 2868/3047 [21:29<01:30,  1.98it/s][A
epoch 2 iter 2868: train loss 0.13664. lr 6.000000e-05:  94%|█████████▍| 2868/3047 [21:30<01:30,  1.98it/s][A
e

epoch 2 iter 2900: train loss 0.13534. lr 6.000000e-05:  95%|█████████▌| 2901/3047 [21:43<00:59,  2.47it/s][A
epoch 2 iter 2901: train loss 0.13574. lr 6.000000e-05:  95%|█████████▌| 2901/3047 [21:43<00:59,  2.47it/s][A
epoch 2 iter 2901: train loss 0.13574. lr 6.000000e-05:  95%|█████████▌| 2902/3047 [21:43<00:59,  2.43it/s][A
epoch 2 iter 2902: train loss 0.13712. lr 6.000000e-05:  95%|█████████▌| 2902/3047 [21:44<00:59,  2.43it/s][A
epoch 2 iter 2902: train loss 0.13712. lr 6.000000e-05:  95%|█████████▌| 2903/3047 [21:44<01:00,  2.38it/s][A
epoch 2 iter 2903: train loss 0.13336. lr 6.000000e-05:  95%|█████████▌| 2903/3047 [21:44<01:00,  2.38it/s][A
epoch 2 iter 2903: train loss 0.13336. lr 6.000000e-05:  95%|█████████▌| 2904/3047 [21:44<01:02,  2.30it/s][A
epoch 2 iter 2904: train loss 0.13664. lr 6.000000e-05:  95%|█████████▌| 2904/3047 [21:45<01:02,  2.30it/s][A
epoch 2 iter 2904: train loss 0.13664. lr 6.000000e-05:  95%|█████████▌| 2905/3047 [21:45<01:03,  2.25it/s][A
e

epoch 2 iter 2937: train loss 0.13475. lr 6.000000e-05:  96%|█████████▋| 2937/3047 [22:00<00:51,  2.14it/s][A
epoch 2 iter 2937: train loss 0.13475. lr 6.000000e-05:  96%|█████████▋| 2938/3047 [22:00<00:51,  2.13it/s][A
epoch 2 iter 2938: train loss 0.13524. lr 6.000000e-05:  96%|█████████▋| 2938/3047 [22:01<00:51,  2.13it/s][A
epoch 2 iter 2938: train loss 0.13524. lr 6.000000e-05:  96%|█████████▋| 2939/3047 [22:01<00:52,  2.06it/s][A
epoch 2 iter 2939: train loss 0.13779. lr 6.000000e-05:  96%|█████████▋| 2939/3047 [22:01<00:52,  2.06it/s][A
epoch 2 iter 2939: train loss 0.13779. lr 6.000000e-05:  96%|█████████▋| 2940/3047 [22:01<00:52,  2.02it/s][A
epoch 2 iter 2940: train loss 0.13676. lr 6.000000e-05:  96%|█████████▋| 2940/3047 [22:02<00:52,  2.02it/s][A
epoch 2 iter 2940: train loss 0.13676. lr 6.000000e-05:  97%|█████████▋| 2941/3047 [22:02<00:52,  2.04it/s][A
epoch 2 iter 2941: train loss 0.13452. lr 6.000000e-05:  97%|█████████▋| 2941/3047 [22:02<00:52,  2.04it/s][A
e

epoch 2 iter 2973: train loss 0.13673. lr 6.000000e-05:  98%|█████████▊| 2974/3047 [22:16<00:30,  2.42it/s][A
epoch 2 iter 2974: train loss 0.13456. lr 6.000000e-05:  98%|█████████▊| 2974/3047 [22:17<00:30,  2.42it/s][A
epoch 2 iter 2974: train loss 0.13456. lr 6.000000e-05:  98%|█████████▊| 2975/3047 [22:17<00:29,  2.43it/s][A
epoch 2 iter 2975: train loss 0.13463. lr 6.000000e-05:  98%|█████████▊| 2975/3047 [22:17<00:29,  2.43it/s][A
epoch 2 iter 2975: train loss 0.13463. lr 6.000000e-05:  98%|█████████▊| 2976/3047 [22:17<00:29,  2.44it/s][A
epoch 2 iter 2976: train loss 0.13511. lr 6.000000e-05:  98%|█████████▊| 2976/3047 [22:17<00:29,  2.44it/s][A
epoch 2 iter 2976: train loss 0.13511. lr 6.000000e-05:  98%|█████████▊| 2977/3047 [22:17<00:28,  2.44it/s][A
epoch 2 iter 2977: train loss 0.13391. lr 6.000000e-05:  98%|█████████▊| 2977/3047 [22:18<00:28,  2.44it/s][A
epoch 2 iter 2977: train loss 0.13391. lr 6.000000e-05:  98%|█████████▊| 2978/3047 [22:18<00:28,  2.45it/s][A
e

epoch 2 iter 3010: train loss 0.13666. lr 6.000000e-05:  99%|█████████▉| 3010/3047 [22:33<00:14,  2.48it/s][A
epoch 2 iter 3010: train loss 0.13666. lr 6.000000e-05:  99%|█████████▉| 3011/3047 [22:33<00:14,  2.48it/s][A
epoch 2 iter 3011: train loss 0.13444. lr 6.000000e-05:  99%|█████████▉| 3011/3047 [22:34<00:14,  2.48it/s][A
epoch 2 iter 3011: train loss 0.13444. lr 6.000000e-05:  99%|█████████▉| 3012/3047 [22:34<00:14,  2.41it/s][A
epoch 2 iter 3012: train loss 0.13295. lr 6.000000e-05:  99%|█████████▉| 3012/3047 [22:34<00:14,  2.41it/s][A
epoch 2 iter 3012: train loss 0.13295. lr 6.000000e-05:  99%|█████████▉| 3013/3047 [22:34<00:13,  2.43it/s][A
epoch 2 iter 3013: train loss 0.13669. lr 6.000000e-05:  99%|█████████▉| 3013/3047 [22:35<00:13,  2.43it/s][A
epoch 2 iter 3013: train loss 0.13669. lr 6.000000e-05:  99%|█████████▉| 3014/3047 [22:35<00:13,  2.44it/s][A
epoch 2 iter 3014: train loss 0.13608. lr 6.000000e-05:  99%|█████████▉| 3014/3047 [22:35<00:13,  2.44it/s][A
e

epoch 2 iter 3046: train loss 0.13600. lr 6.000000e-05: 100%|██████████| 3047/3047 [22:49<00:00,  2.23it/s][A
 50%|█████     | 5/10 [2:56:21<2:54:54, 2098.95s/it]

data has 348382 characters, 164 unique.



  0%|          | 0/681 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.23528. lr 5.999994e-04:   0%|          | 0/681 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 5.23528. lr 5.999994e-04:   0%|          | 1/681 [00:00<05:23,  2.10it/s][A
epoch 1 iter 1: train loss 4.07270. lr 5.999973e-04:   0%|          | 1/681 [00:00<05:23,  2.10it/s][A
epoch 1 iter 1: train loss 4.07270. lr 5.999973e-04:   0%|          | 2/681 [00:00<05:09,  2.19it/s][A
epoch 1 iter 2: train loss 4.15823. lr 5.999935e-04:   0%|          | 2/681 [00:01<05:09,  2.19it/s][A
epoch 1 iter 2: train loss 4.15823. lr 5.999935e-04:   0%|          | 3/681 [00:01<04:58,  2.27it/s][A
epoch 1 iter 3: train loss 3.54828. lr 5.999882e-04:   0%|          | 3/681 [00:01<04:58,  2.27it/s][A
epoch 1 iter 3: train loss 3.54828. lr 5.999882e-04:   1%|          | 4/681 [00:01<04:49,  2.34it/s][A
epoch 1 iter 4: train loss 3.48173. lr 5.999812e-04:   1%|          | 4/681 [00:02<04:49,  2.34it/s][A
epoch 1 iter 4: train loss 3

epoch 1 iter 38: train loss 2.79699. lr 5.987935e-04:   6%|▌         | 39/681 [00:16<04:24,  2.43it/s][A
epoch 1 iter 39: train loss 2.78437. lr 5.987306e-04:   6%|▌         | 39/681 [00:16<04:24,  2.43it/s][A
epoch 1 iter 39: train loss 2.78437. lr 5.987306e-04:   6%|▌         | 40/681 [00:16<04:12,  2.54it/s][A
epoch 1 iter 40: train loss 2.79481. lr 5.986661e-04:   6%|▌         | 40/681 [00:17<04:12,  2.54it/s][A
epoch 1 iter 40: train loss 2.79481. lr 5.986661e-04:   6%|▌         | 41/681 [00:17<04:09,  2.57it/s][A
epoch 1 iter 41: train loss 2.77832. lr 5.986001e-04:   6%|▌         | 41/681 [00:17<04:09,  2.57it/s][A
epoch 1 iter 41: train loss 2.77832. lr 5.986001e-04:   6%|▌         | 42/681 [00:17<04:38,  2.29it/s][A
epoch 1 iter 42: train loss 2.77748. lr 5.985324e-04:   6%|▌         | 42/681 [00:18<04:38,  2.29it/s][A
epoch 1 iter 42: train loss 2.77748. lr 5.985324e-04:   6%|▋         | 43/681 [00:18<04:57,  2.14it/s][A
epoch 1 iter 43: train loss 2.76923. lr 5.9846

epoch 1 iter 77: train loss 2.64508. lr 5.951643e-04:  11%|█▏        | 77/681 [00:33<04:34,  2.20it/s][A
epoch 1 iter 77: train loss 2.64508. lr 5.951643e-04:  11%|█▏        | 78/681 [00:33<05:02,  2.00it/s][A
epoch 1 iter 78: train loss 2.64243. lr 5.950396e-04:  11%|█▏        | 78/681 [00:34<05:02,  2.00it/s][A
epoch 1 iter 78: train loss 2.64243. lr 5.950396e-04:  12%|█▏        | 79/681 [00:34<05:11,  1.93it/s][A
epoch 1 iter 79: train loss 2.63411. lr 5.949134e-04:  12%|█▏        | 79/681 [00:34<05:11,  1.93it/s][A
epoch 1 iter 79: train loss 2.63411. lr 5.949134e-04:  12%|█▏        | 80/681 [00:34<04:54,  2.04it/s][A
epoch 1 iter 80: train loss 2.63955. lr 5.947855e-04:  12%|█▏        | 80/681 [00:35<04:54,  2.04it/s][A
epoch 1 iter 80: train loss 2.63955. lr 5.947855e-04:  12%|█▏        | 81/681 [00:35<04:41,  2.13it/s][A
epoch 1 iter 81: train loss 2.62766. lr 5.946561e-04:  12%|█▏        | 81/681 [00:35<04:41,  2.13it/s][A
epoch 1 iter 81: train loss 2.62766. lr 5.9465

epoch 1 iter 115: train loss 2.57165. lr 5.893259e-04:  17%|█▋        | 115/681 [00:50<04:00,  2.36it/s][A
epoch 1 iter 115: train loss 2.57165. lr 5.893259e-04:  17%|█▋        | 116/681 [00:50<03:50,  2.45it/s][A
epoch 1 iter 116: train loss 2.56547. lr 5.891419e-04:  17%|█▋        | 116/681 [00:50<03:50,  2.45it/s][A
epoch 1 iter 116: train loss 2.56547. lr 5.891419e-04:  17%|█▋        | 117/681 [00:50<03:51,  2.43it/s][A
epoch 1 iter 117: train loss 2.57026. lr 5.889564e-04:  17%|█▋        | 117/681 [00:51<03:51,  2.43it/s][A
epoch 1 iter 117: train loss 2.57026. lr 5.889564e-04:  17%|█▋        | 118/681 [00:51<03:58,  2.36it/s][A
epoch 1 iter 118: train loss 2.55656. lr 5.887694e-04:  17%|█▋        | 118/681 [00:51<03:58,  2.36it/s][A
epoch 1 iter 118: train loss 2.55656. lr 5.887694e-04:  17%|█▋        | 119/681 [00:51<04:01,  2.33it/s][A
epoch 1 iter 119: train loss 2.56003. lr 5.885808e-04:  17%|█▋        | 119/681 [00:52<04:01,  2.33it/s][A
epoch 1 iter 119: train loss

epoch 1 iter 152: train loss 2.51400. lr 5.815005e-04:  22%|██▏       | 153/681 [01:08<04:41,  1.88it/s][A
epoch 1 iter 153: train loss 2.52414. lr 5.812602e-04:  22%|██▏       | 153/681 [01:08<04:41,  1.88it/s][A
epoch 1 iter 153: train loss 2.52414. lr 5.812602e-04:  23%|██▎       | 154/681 [01:08<04:35,  1.91it/s][A
epoch 1 iter 154: train loss 2.52282. lr 5.810184e-04:  23%|██▎       | 154/681 [01:09<04:35,  1.91it/s][A
epoch 1 iter 154: train loss 2.52282. lr 5.810184e-04:  23%|██▎       | 155/681 [01:09<04:26,  1.97it/s][A
epoch 1 iter 155: train loss 2.51252. lr 5.807751e-04:  23%|██▎       | 155/681 [01:09<04:26,  1.97it/s][A
epoch 1 iter 155: train loss 2.51252. lr 5.807751e-04:  23%|██▎       | 156/681 [01:09<04:24,  1.99it/s][A
epoch 1 iter 156: train loss 2.52457. lr 5.805303e-04:  23%|██▎       | 156/681 [01:10<04:24,  1.99it/s][A
epoch 1 iter 156: train loss 2.52457. lr 5.805303e-04:  23%|██▎       | 157/681 [01:10<04:13,  2.07it/s][A
epoch 1 iter 157: train loss

epoch 1 iter 190: train loss 2.47103. lr 5.713258e-04:  28%|██▊       | 190/681 [01:25<03:32,  2.31it/s][A
epoch 1 iter 190: train loss 2.47103. lr 5.713258e-04:  28%|██▊       | 191/681 [01:25<03:30,  2.33it/s][A
epoch 1 iter 191: train loss 2.45244. lr 5.710294e-04:  28%|██▊       | 191/681 [01:25<03:30,  2.33it/s][A
epoch 1 iter 191: train loss 2.45244. lr 5.710294e-04:  28%|██▊       | 192/681 [01:25<03:29,  2.34it/s][A
epoch 1 iter 192: train loss 2.45025. lr 5.707317e-04:  28%|██▊       | 192/681 [01:26<03:29,  2.34it/s][A
epoch 1 iter 192: train loss 2.45025. lr 5.707317e-04:  28%|██▊       | 193/681 [01:26<03:17,  2.47it/s][A
epoch 1 iter 193: train loss 2.44366. lr 5.704324e-04:  28%|██▊       | 193/681 [01:26<03:17,  2.47it/s][A
epoch 1 iter 193: train loss 2.44366. lr 5.704324e-04:  28%|██▊       | 194/681 [01:26<03:12,  2.52it/s][A
epoch 1 iter 194: train loss 2.44987. lr 5.701318e-04:  28%|██▊       | 194/681 [01:26<03:12,  2.52it/s][A
epoch 1 iter 194: train loss

epoch 1 iter 227: train loss 2.36087. lr 5.594111e-04:  33%|███▎      | 228/681 [01:40<03:12,  2.35it/s][A
epoch 1 iter 228: train loss 2.35352. lr 5.590624e-04:  33%|███▎      | 228/681 [01:41<03:12,  2.35it/s][A
epoch 1 iter 228: train loss 2.35352. lr 5.590624e-04:  34%|███▎      | 229/681 [01:41<03:17,  2.29it/s][A
epoch 1 iter 229: train loss 2.35679. lr 5.587123e-04:  34%|███▎      | 229/681 [01:41<03:17,  2.29it/s][A
epoch 1 iter 229: train loss 2.35679. lr 5.587123e-04:  34%|███▍      | 230/681 [01:41<03:21,  2.23it/s][A
epoch 1 iter 230: train loss 2.34457. lr 5.583608e-04:  34%|███▍      | 230/681 [01:41<03:21,  2.23it/s][A
epoch 1 iter 230: train loss 2.34457. lr 5.583608e-04:  34%|███▍      | 231/681 [01:41<03:21,  2.23it/s][A
epoch 1 iter 231: train loss 2.34110. lr 5.580079e-04:  34%|███▍      | 231/681 [01:42<03:21,  2.23it/s][A
epoch 1 iter 231: train loss 2.34110. lr 5.580079e-04:  34%|███▍      | 232/681 [01:42<03:20,  2.24it/s][A
epoch 1 iter 232: train loss

epoch 1 iter 265: train loss 2.22867. lr 5.452046e-04:  39%|███▉      | 265/681 [01:57<02:50,  2.45it/s][A
epoch 1 iter 265: train loss 2.22867. lr 5.452046e-04:  39%|███▉      | 266/681 [01:57<02:50,  2.43it/s][A
epoch 1 iter 266: train loss 2.24192. lr 5.448047e-04:  39%|███▉      | 266/681 [01:58<02:50,  2.43it/s][A
epoch 1 iter 266: train loss 2.24192. lr 5.448047e-04:  39%|███▉      | 267/681 [01:58<02:49,  2.44it/s][A
epoch 1 iter 267: train loss 2.23371. lr 5.444036e-04:  39%|███▉      | 267/681 [01:58<02:49,  2.44it/s][A
epoch 1 iter 267: train loss 2.23371. lr 5.444036e-04:  39%|███▉      | 268/681 [01:58<02:49,  2.44it/s][A
epoch 1 iter 268: train loss 2.24632. lr 5.440011e-04:  39%|███▉      | 268/681 [01:59<02:49,  2.44it/s][A
epoch 1 iter 268: train loss 2.24632. lr 5.440011e-04:  40%|███▉      | 269/681 [01:59<02:48,  2.44it/s][A
epoch 1 iter 269: train loss 2.23572. lr 5.435973e-04:  40%|███▉      | 269/681 [01:59<02:48,  2.44it/s][A
epoch 1 iter 269: train loss

epoch 1 iter 302: train loss 2.12014. lr 5.295572e-04:  44%|████▍     | 303/681 [02:14<02:39,  2.37it/s][A
epoch 1 iter 303: train loss 2.12564. lr 5.291105e-04:  44%|████▍     | 303/681 [02:14<02:39,  2.37it/s][A
epoch 1 iter 303: train loss 2.12564. lr 5.291105e-04:  45%|████▍     | 304/681 [02:14<02:38,  2.37it/s][A
epoch 1 iter 304: train loss 2.11904. lr 5.286626e-04:  45%|████▍     | 304/681 [02:15<02:38,  2.37it/s][A
epoch 1 iter 304: train loss 2.11904. lr 5.286626e-04:  45%|████▍     | 305/681 [02:15<02:37,  2.39it/s][A
epoch 1 iter 305: train loss 2.12017. lr 5.282134e-04:  45%|████▍     | 305/681 [02:15<02:37,  2.39it/s][A
epoch 1 iter 305: train loss 2.12017. lr 5.282134e-04:  45%|████▍     | 306/681 [02:15<02:37,  2.38it/s][A
epoch 1 iter 306: train loss 2.10630. lr 5.277631e-04:  45%|████▍     | 306/681 [02:15<02:37,  2.38it/s][A
epoch 1 iter 306: train loss 2.10630. lr 5.277631e-04:  45%|████▌     | 307/681 [02:15<02:31,  2.48it/s][A
epoch 1 iter 307: train loss

epoch 1 iter 340: train loss 2.00149. lr 5.117441e-04:  50%|████▉     | 340/681 [02:30<02:46,  2.05it/s][A
epoch 1 iter 340: train loss 2.00149. lr 5.117441e-04:  50%|█████     | 341/681 [02:30<02:45,  2.06it/s][A
epoch 1 iter 341: train loss 2.00183. lr 5.112527e-04:  50%|█████     | 341/681 [02:31<02:45,  2.06it/s][A
epoch 1 iter 341: train loss 2.00183. lr 5.112527e-04:  50%|█████     | 342/681 [02:31<02:43,  2.07it/s][A
epoch 1 iter 342: train loss 1.97940. lr 5.107602e-04:  50%|█████     | 342/681 [02:31<02:43,  2.07it/s][A
epoch 1 iter 342: train loss 1.97940. lr 5.107602e-04:  50%|█████     | 343/681 [02:31<02:43,  2.07it/s][A
epoch 1 iter 343: train loss 2.00864. lr 5.102665e-04:  50%|█████     | 343/681 [02:32<02:43,  2.07it/s][A
epoch 1 iter 343: train loss 2.00864. lr 5.102665e-04:  51%|█████     | 344/681 [02:32<02:42,  2.07it/s][A
epoch 1 iter 344: train loss 1.99351. lr 5.097717e-04:  51%|█████     | 344/681 [02:32<02:42,  2.07it/s][A
epoch 1 iter 344: train loss

epoch 1 iter 377: train loss 1.90374. lr 4.928324e-04:  56%|█████▌    | 378/681 [02:47<02:18,  2.19it/s][A
epoch 1 iter 378: train loss 1.90076. lr 4.923011e-04:  56%|█████▌    | 378/681 [02:48<02:18,  2.19it/s][A
epoch 1 iter 378: train loss 1.90076. lr 4.923011e-04:  56%|█████▌    | 379/681 [02:48<02:14,  2.25it/s][A
epoch 1 iter 379: train loss 1.89434. lr 4.917687e-04:  56%|█████▌    | 379/681 [02:48<02:14,  2.25it/s][A
epoch 1 iter 379: train loss 1.89434. lr 4.917687e-04:  56%|█████▌    | 380/681 [02:48<02:12,  2.27it/s][A
epoch 1 iter 380: train loss 1.87896. lr 4.912354e-04:  56%|█████▌    | 380/681 [02:49<02:12,  2.27it/s][A
epoch 1 iter 380: train loss 1.87896. lr 4.912354e-04:  56%|█████▌    | 381/681 [02:49<02:09,  2.31it/s][A
epoch 1 iter 381: train loss 1.88177. lr 4.907010e-04:  56%|█████▌    | 381/681 [02:49<02:09,  2.31it/s][A
epoch 1 iter 381: train loss 1.88177. lr 4.907010e-04:  56%|█████▌    | 382/681 [02:49<02:07,  2.35it/s][A
epoch 1 iter 382: train loss

epoch 1 iter 415: train loss 1.79950. lr 4.719459e-04:  61%|██████    | 415/681 [03:04<01:49,  2.43it/s][A
epoch 1 iter 415: train loss 1.79950. lr 4.719459e-04:  61%|██████    | 416/681 [03:04<01:50,  2.40it/s][A
epoch 1 iter 416: train loss 1.81406. lr 4.713777e-04:  61%|██████    | 416/681 [03:04<01:50,  2.40it/s][A
epoch 1 iter 416: train loss 1.81406. lr 4.713777e-04:  61%|██████    | 417/681 [03:04<01:51,  2.38it/s][A
epoch 1 iter 417: train loss 1.79005. lr 4.708085e-04:  61%|██████    | 417/681 [03:04<01:51,  2.38it/s][A
epoch 1 iter 417: train loss 1.79005. lr 4.708085e-04:  61%|██████▏   | 418/681 [03:04<01:51,  2.37it/s][A
epoch 1 iter 418: train loss 1.78953. lr 4.702384e-04:  61%|██████▏   | 418/681 [03:05<01:51,  2.37it/s][A
epoch 1 iter 418: train loss 1.78953. lr 4.702384e-04:  62%|██████▏   | 419/681 [03:05<01:51,  2.36it/s][A
epoch 1 iter 419: train loss 1.79326. lr 4.696675e-04:  62%|██████▏   | 419/681 [03:05<01:51,  2.36it/s][A
epoch 1 iter 419: train loss

epoch 1 iter 452: train loss 1.71811. lr 4.503358e-04:  67%|██████▋   | 453/681 [03:19<01:48,  2.10it/s][A
epoch 1 iter 453: train loss 1.71781. lr 4.497358e-04:  67%|██████▋   | 453/681 [03:20<01:48,  2.10it/s][A
epoch 1 iter 453: train loss 1.71781. lr 4.497358e-04:  67%|██████▋   | 454/681 [03:20<02:24,  1.58it/s][A
epoch 1 iter 454: train loss 1.70590. lr 4.491350e-04:  67%|██████▋   | 454/681 [03:21<02:24,  1.58it/s][A
epoch 1 iter 454: train loss 1.70590. lr 4.491350e-04:  67%|██████▋   | 455/681 [03:21<02:21,  1.59it/s][A
epoch 1 iter 455: train loss 1.70204. lr 4.485334e-04:  67%|██████▋   | 455/681 [03:22<02:21,  1.59it/s][A
epoch 1 iter 455: train loss 1.70204. lr 4.485334e-04:  67%|██████▋   | 456/681 [03:22<02:15,  1.66it/s][A
epoch 1 iter 456: train loss 1.69364. lr 4.479310e-04:  67%|██████▋   | 456/681 [03:22<02:15,  1.66it/s][A
epoch 1 iter 456: train loss 1.69364. lr 4.479310e-04:  67%|██████▋   | 457/681 [03:22<02:11,  1.71it/s][A
epoch 1 iter 457: train loss

epoch 1 iter 490: train loss 1.62902. lr 4.270012e-04:  72%|███████▏  | 490/681 [03:37<01:17,  2.46it/s][A
epoch 1 iter 490: train loss 1.62902. lr 4.270012e-04:  72%|███████▏  | 491/681 [03:37<01:16,  2.49it/s][A
epoch 1 iter 491: train loss 1.63485. lr 4.263731e-04:  72%|███████▏  | 491/681 [03:37<01:16,  2.49it/s][A
epoch 1 iter 491: train loss 1.63485. lr 4.263731e-04:  72%|███████▏  | 492/681 [03:37<01:15,  2.51it/s][A
epoch 1 iter 492: train loss 1.62309. lr 4.257443e-04:  72%|███████▏  | 492/681 [03:38<01:15,  2.51it/s][A
epoch 1 iter 492: train loss 1.62309. lr 4.257443e-04:  72%|███████▏  | 493/681 [03:38<01:13,  2.56it/s][A
epoch 1 iter 493: train loss 1.62143. lr 4.251149e-04:  72%|███████▏  | 493/681 [03:38<01:13,  2.56it/s][A
epoch 1 iter 493: train loss 1.62143. lr 4.251149e-04:  73%|███████▎  | 494/681 [03:38<01:13,  2.56it/s][A
epoch 1 iter 494: train loss 1.61629. lr 4.244848e-04:  73%|███████▎  | 494/681 [03:38<01:13,  2.56it/s][A
epoch 1 iter 494: train loss

epoch 1 iter 527: train loss 1.54455. lr 4.033396e-04:  78%|███████▊  | 528/681 [03:52<01:04,  2.38it/s][A
epoch 1 iter 528: train loss 1.54583. lr 4.026888e-04:  78%|███████▊  | 528/681 [03:53<01:04,  2.38it/s][A
epoch 1 iter 528: train loss 1.54583. lr 4.026888e-04:  78%|███████▊  | 529/681 [03:53<01:06,  2.30it/s][A
epoch 1 iter 529: train loss 1.54672. lr 4.020375e-04:  78%|███████▊  | 529/681 [03:53<01:06,  2.30it/s][A
epoch 1 iter 529: train loss 1.54672. lr 4.020375e-04:  78%|███████▊  | 530/681 [03:53<01:03,  2.37it/s][A
epoch 1 iter 530: train loss 1.54091. lr 4.013857e-04:  78%|███████▊  | 530/681 [03:53<01:03,  2.37it/s][A
epoch 1 iter 530: train loss 1.54091. lr 4.013857e-04:  78%|███████▊  | 531/681 [03:53<01:02,  2.42it/s][A
epoch 1 iter 531: train loss 1.55808. lr 4.007333e-04:  78%|███████▊  | 531/681 [03:54<01:02,  2.42it/s][A
epoch 1 iter 531: train loss 1.55808. lr 4.007333e-04:  78%|███████▊  | 532/681 [03:54<01:03,  2.36it/s][A
epoch 1 iter 532: train loss

epoch 1 iter 565: train loss 1.47430. lr 3.782551e-04:  83%|████████▎ | 565/681 [04:11<00:53,  2.18it/s][A
epoch 1 iter 565: train loss 1.47430. lr 3.782551e-04:  83%|████████▎ | 566/681 [04:11<00:52,  2.19it/s][A
epoch 1 iter 566: train loss 1.49516. lr 3.775860e-04:  83%|████████▎ | 566/681 [04:11<00:52,  2.19it/s][A
epoch 1 iter 566: train loss 1.49516. lr 3.775860e-04:  83%|████████▎ | 567/681 [04:11<00:51,  2.20it/s][A
epoch 1 iter 567: train loss 1.48078. lr 3.769165e-04:  83%|████████▎ | 567/681 [04:11<00:51,  2.20it/s][A
epoch 1 iter 567: train loss 1.48078. lr 3.769165e-04:  83%|████████▎ | 568/681 [04:11<00:50,  2.22it/s][A
epoch 1 iter 568: train loss 1.47965. lr 3.762466e-04:  83%|████████▎ | 568/681 [04:12<00:50,  2.22it/s][A
epoch 1 iter 568: train loss 1.47965. lr 3.762466e-04:  84%|████████▎ | 569/681 [04:12<00:50,  2.21it/s][A
epoch 1 iter 569: train loss 1.46787. lr 3.755762e-04:  84%|████████▎ | 569/681 [04:12<00:50,  2.21it/s][A
epoch 1 iter 569: train loss

epoch 1 iter 602: train loss 1.41995. lr 3.532503e-04:  89%|████████▊ | 603/681 [04:27<00:34,  2.26it/s][A
epoch 1 iter 603: train loss 1.42045. lr 3.525683e-04:  89%|████████▊ | 603/681 [04:27<00:34,  2.26it/s][A
epoch 1 iter 603: train loss 1.42045. lr 3.525683e-04:  89%|████████▊ | 604/681 [04:27<00:33,  2.27it/s][A
epoch 1 iter 604: train loss 1.40227. lr 3.518859e-04:  89%|████████▊ | 604/681 [04:28<00:33,  2.27it/s][A
epoch 1 iter 604: train loss 1.40227. lr 3.518859e-04:  89%|████████▉ | 605/681 [04:28<00:33,  2.27it/s][A
epoch 1 iter 605: train loss 1.41094. lr 3.512034e-04:  89%|████████▉ | 605/681 [04:28<00:33,  2.27it/s][A
epoch 1 iter 605: train loss 1.41094. lr 3.512034e-04:  89%|████████▉ | 606/681 [04:28<00:32,  2.28it/s][A
epoch 1 iter 606: train loss 1.40605. lr 3.505205e-04:  89%|████████▉ | 606/681 [04:28<00:32,  2.28it/s][A
epoch 1 iter 606: train loss 1.40605. lr 3.505205e-04:  89%|████████▉ | 607/681 [04:28<00:31,  2.37it/s][A
epoch 1 iter 607: train loss

epoch 1 iter 640: train loss 1.34538. lr 3.271668e-04:  94%|█████████▍| 640/681 [04:43<00:19,  2.05it/s][A
epoch 1 iter 640: train loss 1.34538. lr 3.271668e-04:  94%|█████████▍| 641/681 [04:43<00:19,  2.03it/s][A
epoch 1 iter 641: train loss 1.35363. lr 3.264767e-04:  94%|█████████▍| 641/681 [04:43<00:19,  2.03it/s][A
epoch 1 iter 641: train loss 1.35363. lr 3.264767e-04:  94%|█████████▍| 642/681 [04:43<00:19,  2.02it/s][A
epoch 1 iter 642: train loss 1.35444. lr 3.257864e-04:  94%|█████████▍| 642/681 [04:44<00:19,  2.02it/s][A
epoch 1 iter 642: train loss 1.35444. lr 3.257864e-04:  94%|█████████▍| 643/681 [04:44<00:19,  1.97it/s][A
epoch 1 iter 643: train loss 1.33767. lr 3.250961e-04:  94%|█████████▍| 643/681 [04:44<00:19,  1.97it/s][A
epoch 1 iter 643: train loss 1.33767. lr 3.250961e-04:  95%|█████████▍| 644/681 [04:44<00:18,  2.01it/s][A
epoch 1 iter 644: train loss 1.32678. lr 3.244055e-04:  95%|█████████▍| 644/681 [04:45<00:18,  2.01it/s][A
epoch 1 iter 644: train loss

epoch 1 iter 677: train loss 1.27727. lr 3.015671e-04: 100%|█████████▉| 678/681 [05:00<00:01,  2.57it/s][A
epoch 1 iter 678: train loss 1.26762. lr 3.008742e-04: 100%|█████████▉| 678/681 [05:00<00:01,  2.57it/s][A
epoch 1 iter 678: train loss 1.26762. lr 3.008742e-04: 100%|█████████▉| 679/681 [05:00<00:00,  2.55it/s][A
epoch 1 iter 679: train loss 1.27348. lr 3.001813e-04: 100%|█████████▉| 679/681 [05:01<00:00,  2.55it/s][A
epoch 1 iter 679: train loss 1.27348. lr 3.001813e-04: 100%|█████████▉| 680/681 [05:01<00:00,  2.50it/s][A
epoch 1 iter 680: train loss 1.28421. lr 3.000541e-04: 100%|█████████▉| 680/681 [05:01<00:00,  2.50it/s][A
epoch 1 iter 680: train loss 1.28421. lr 3.000541e-04: 100%|██████████| 681/681 [05:01<00:00,  2.26it/s][A

  0%|          | 0/681 [00:00<?, ?it/s][A
epoch 2 iter 0: train loss 1.27188. lr 2.993612e-04:   0%|          | 0/681 [00:00<?, ?it/s][A
epoch 2 iter 0: train loss 1.27188. lr 2.993612e-04:   0%|          | 1/681 [00:00<05:38,  2.01it/s][A


epoch 2 iter 35: train loss 1.20290. lr 2.751386e-04:   5%|▌         | 35/681 [00:15<04:36,  2.34it/s][A
epoch 2 iter 35: train loss 1.20290. lr 2.751386e-04:   5%|▌         | 36/681 [00:15<04:37,  2.32it/s][A
epoch 2 iter 36: train loss 1.20280. lr 2.744482e-04:   5%|▌         | 36/681 [00:16<04:37,  2.32it/s][A
epoch 2 iter 36: train loss 1.20280. lr 2.744482e-04:   5%|▌         | 37/681 [00:16<04:33,  2.35it/s][A
epoch 2 iter 37: train loss 1.21785. lr 2.737579e-04:   5%|▌         | 37/681 [00:16<04:33,  2.35it/s][A
epoch 2 iter 37: train loss 1.21785. lr 2.737579e-04:   6%|▌         | 38/681 [00:16<04:32,  2.36it/s][A
epoch 2 iter 38: train loss 1.21175. lr 2.730677e-04:   6%|▌         | 38/681 [00:16<04:32,  2.36it/s][A
epoch 2 iter 38: train loss 1.21175. lr 2.730677e-04:   6%|▌         | 39/681 [00:16<04:31,  2.36it/s][A
epoch 2 iter 39: train loss 1.20261. lr 2.723777e-04:   6%|▌         | 39/681 [00:17<04:31,  2.36it/s][A
epoch 2 iter 39: train loss 1.20261. lr 2.7237