## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [2]:
from allennlp.data.token_indexers import TokenIndexer, PretrainedTransformerIndexer
from allennlp.data.tokenizers import Token, Tokenizer, PretrainedTransformerTokenizer

import nltk
#nltk.download('punkt')
import numpy as np
from os import listdir
from os.path import join as pathjoin
import torch
import torch.nn as nn
from torch.nn import functional as F
import tqdm

from minGPT.mingpt.model import GPT, GPTConfig
from minGPT.mingpt.trainer import Trainer, TrainerConfig
# make deterministic
from minGPT.mingpt.utils import sample, set_seed
set_seed(42)

In [3]:
DATA_DIR = '/home/mlepekhin/data'
MODELS_DIR = '/home/mlepekhin/models'
transformer_model = 'DeepPavlov/rubert-base-cased'

In [4]:
import math
from torch.utils.data import Dataset


def detokenize(tokens):
    return ' '.join([str(x) for x in tokens[1:-1]]).replace(' ##', '')

class BPEDataset(Dataset):
    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [5]:
block_size = 128
tokenizer = PretrainedTransformerTokenizer(transformer_model)
#indexer = PretrainedTransformerIndexer(transformer_model)

In [10]:
def train_gpt_generator(train_text_file, state_dict_file, n_layer=8, n_head=8, n_embd=512,
                        max_epochs=2, batch_size=256):
    text_sentences = nltk.tokenize.sent_tokenize(open(train_text_file, 'r').read())
    tokens = np.concatenate([tokenizer.tokenize(sent)[1:-1] for sent in text_sentences])
    tokens = [str(token) for token in tokens]
    train_dataset = BPEDataset(tokens, block_size) 
    print('dataset is read')
    
    mconf = GPTConfig(
        train_dataset.vocab_size, train_dataset.block_size,
        n_layer=n_layer, n_head=n_head, n_embd=n_embd
    )
    model = GPT(mconf)
    print('model is built')
    tconf = TrainerConfig(
        max_epochs=max_epochs, batch_size=batch_size, learning_rate=6e-4,
        lr_decay=True, warmup_tokens=batch_size*20, final_tokens=2*len(train_dataset)*block_size,
        num_workers=4
    )
    trainer = Trainer(model, train_dataset, None, tconf)
    trainer.train()
    torch.save(model.state_dict(), state_dict_file)

In [11]:
GENRE_DATA_DIR = '/home/mlepekhin/data/genre'
GPT_MODELS_DIR = '/home/mlepekhin/models/mini_gpt_bpe'
LANG = 'ru_lemma'
!ls /home/mlepekhin/models/mini_gpt_bpe

en  ru	ru_lemma


In [12]:
#train_gpt_generator(
#        pathjoin(GENRE_DATA_DIR, LANG, 'A1.txt'),
#        pathjoin(GPT_MODELS_DIR, LANG, 'A1')
#)

In [None]:
for train_text_file in tqdm.tqdm(listdir(pathjoin(GENRE_DATA_DIR, LANG))):
    label = train_text_file[:-4]
    train_gpt_generator(
        pathjoin(GENRE_DATA_DIR, LANG, train_text_file),
        pathjoin(GPT_MODELS_DIR, LANG, label)
    )

  0%|          | 0/10 [00:00<?, ?it/s]

data has 187706 characters, 11591 unique.
dataset is read
model is built




epoch 1 iter 0: train loss 9.48513. lr 5.999995e-04:   0%|          | 0/733 [00:07<?, ?it/s][A
epoch 1 iter 0: train loss 9.48513. lr 5.999995e-04:   0%|          | 1/733 [00:07<1:28:55,  7.29s/it][A
epoch 1 iter 1: train loss 8.67772. lr 5.999977e-04:   0%|          | 1/733 [00:07<1:28:55,  7.29s/it][A
epoch 1 iter 1: train loss 8.67772. lr 5.999977e-04:   0%|          | 2/733 [00:07<1:03:26,  5.21s/it][A
epoch 1 iter 2: train loss 8.19510. lr 5.999944e-04:   0%|          | 2/733 [00:07<1:03:26,  5.21s/it][A
epoch 1 iter 2: train loss 8.19510. lr 5.999944e-04:   0%|          | 3/733 [00:07<45:32,  3.74s/it]  [A
epoch 1 iter 3: train loss 7.90512. lr 5.999898e-04:   0%|          | 3/733 [00:08<45:32,  3.74s/it][A
epoch 1 iter 3: train loss 7.90512. lr 5.999898e-04:   1%|          | 4/733 [00:08<33:01,  2.72s/it][A
epoch 1 iter 4: train loss 7.66695. lr 5.999838e-04:   1%|          | 4/733 [00:08<33:01,  2.72s/it][A
epoch 1 iter 4: train loss 7.66695. lr 5.999838e-04:   1%|  

epoch 1 iter 37: train loss 6.08536. lr 5.990131e-04:   5%|▌         | 37/733 [00:19<04:14,  2.73it/s][A
epoch 1 iter 37: train loss 6.08536. lr 5.990131e-04:   5%|▌         | 38/733 [00:19<04:05,  2.83it/s][A
epoch 1 iter 38: train loss 6.03918. lr 5.989602e-04:   5%|▌         | 38/733 [00:19<04:05,  2.83it/s][A
epoch 1 iter 38: train loss 6.03918. lr 5.989602e-04:   5%|▌         | 39/733 [00:19<03:58,  2.91it/s][A
epoch 1 iter 39: train loss 6.07850. lr 5.989061e-04:   5%|▌         | 39/733 [00:20<03:58,  2.91it/s][A
epoch 1 iter 39: train loss 6.07850. lr 5.989061e-04:   5%|▌         | 40/733 [00:20<03:54,  2.96it/s][A
epoch 1 iter 40: train loss 6.12726. lr 5.988505e-04:   5%|▌         | 40/733 [00:20<03:54,  2.96it/s][A
epoch 1 iter 40: train loss 6.12726. lr 5.988505e-04:   6%|▌         | 41/733 [00:20<03:50,  3.00it/s][A
epoch 1 iter 41: train loss 6.03930. lr 5.987935e-04:   6%|▌         | 41/733 [00:20<03:50,  3.00it/s][A
epoch 1 iter 41: train loss 6.03930. lr 5.9879

epoch 1 iter 75: train loss 5.65926. lr 5.960425e-04:  10%|█         | 76/733 [00:31<03:29,  3.14it/s][A
epoch 1 iter 76: train loss 5.60277. lr 5.959377e-04:  10%|█         | 76/733 [00:32<03:29,  3.14it/s][A
epoch 1 iter 76: train loss 5.60277. lr 5.959377e-04:  11%|█         | 77/733 [00:32<03:51,  2.83it/s][A
epoch 1 iter 77: train loss 5.60534. lr 5.958315e-04:  11%|█         | 77/733 [00:32<03:51,  2.83it/s][A
epoch 1 iter 77: train loss 5.60534. lr 5.958315e-04:  11%|█         | 78/733 [00:32<03:44,  2.91it/s][A
epoch 1 iter 78: train loss 5.56525. lr 5.957240e-04:  11%|█         | 78/733 [00:32<03:44,  2.91it/s][A
epoch 1 iter 78: train loss 5.56525. lr 5.957240e-04:  11%|█         | 79/733 [00:32<03:39,  2.98it/s][A
epoch 1 iter 79: train loss 5.58142. lr 5.956151e-04:  11%|█         | 79/733 [00:33<03:39,  2.98it/s][A
epoch 1 iter 79: train loss 5.58142. lr 5.956151e-04:  11%|█         | 80/733 [00:33<03:35,  3.03it/s][A
epoch 1 iter 80: train loss 5.56352. lr 5.9550

epoch 1 iter 113: train loss 5.19435. lr 5.911080e-04:  16%|█▌        | 114/733 [00:44<03:27,  2.98it/s][A
epoch 1 iter 114: train loss 5.19096. lr 5.909519e-04:  16%|█▌        | 114/733 [00:44<03:27,  2.98it/s][A
epoch 1 iter 114: train loss 5.19096. lr 5.909519e-04:  16%|█▌        | 115/733 [00:44<03:24,  3.02it/s][A
epoch 1 iter 115: train loss 5.22027. lr 5.907944e-04:  16%|█▌        | 115/733 [00:44<03:24,  3.02it/s][A
epoch 1 iter 115: train loss 5.22027. lr 5.907944e-04:  16%|█▌        | 116/733 [00:44<03:21,  3.06it/s][A
epoch 1 iter 116: train loss 5.23598. lr 5.906356e-04:  16%|█▌        | 116/733 [00:45<03:21,  3.06it/s][A
epoch 1 iter 116: train loss 5.23598. lr 5.906356e-04:  16%|█▌        | 117/733 [00:45<03:19,  3.08it/s][A
epoch 1 iter 117: train loss 5.14283. lr 5.904755e-04:  16%|█▌        | 117/733 [00:45<03:19,  3.08it/s][A
epoch 1 iter 117: train loss 5.14283. lr 5.904755e-04:  16%|█▌        | 118/733 [00:45<03:20,  3.06it/s][A
epoch 1 iter 118: train loss

epoch 1 iter 151: train loss 4.89619. lr 5.842422e-04:  21%|██        | 151/733 [00:56<03:06,  3.12it/s][A
epoch 1 iter 151: train loss 4.89619. lr 5.842422e-04:  21%|██        | 152/733 [00:56<03:05,  3.12it/s][A
epoch 1 iter 152: train loss 4.88140. lr 5.840359e-04:  21%|██        | 152/733 [00:56<03:05,  3.12it/s][A
epoch 1 iter 152: train loss 4.88140. lr 5.840359e-04:  21%|██        | 153/733 [00:56<03:05,  3.12it/s][A
epoch 1 iter 153: train loss 4.91698. lr 5.838282e-04:  21%|██        | 153/733 [00:57<03:05,  3.12it/s][A
epoch 1 iter 153: train loss 4.91698. lr 5.838282e-04:  21%|██        | 154/733 [00:57<03:05,  3.12it/s][A
epoch 1 iter 154: train loss 4.92154. lr 5.836192e-04:  21%|██        | 154/733 [00:57<03:05,  3.12it/s][A
epoch 1 iter 154: train loss 4.92154. lr 5.836192e-04:  21%|██        | 155/733 [00:57<03:04,  3.13it/s][A
epoch 1 iter 155: train loss 4.87520. lr 5.834089e-04:  21%|██        | 155/733 [00:57<03:04,  3.13it/s][A
epoch 1 iter 155: train loss

epoch 1 iter 188: train loss 4.64506. lr 5.757448e-04:  26%|██▌       | 189/733 [01:08<02:54,  3.11it/s][A
epoch 1 iter 189: train loss 4.62857. lr 5.754908e-04:  26%|██▌       | 189/733 [01:08<02:54,  3.11it/s][A
epoch 1 iter 189: train loss 4.62857. lr 5.754908e-04:  26%|██▌       | 190/733 [01:08<02:54,  3.11it/s][A
epoch 1 iter 190: train loss 4.62191. lr 5.752356e-04:  26%|██▌       | 190/733 [01:09<02:54,  3.11it/s][A
epoch 1 iter 190: train loss 4.62191. lr 5.752356e-04:  26%|██▌       | 191/733 [01:09<02:54,  3.11it/s][A
epoch 1 iter 191: train loss 4.65019. lr 5.749791e-04:  26%|██▌       | 191/733 [01:09<02:54,  3.11it/s][A
epoch 1 iter 191: train loss 4.65019. lr 5.749791e-04:  26%|██▌       | 192/733 [01:09<02:53,  3.12it/s][A
epoch 1 iter 192: train loss 4.60762. lr 5.747213e-04:  26%|██▌       | 192/733 [01:09<02:53,  3.12it/s][A
epoch 1 iter 192: train loss 4.60762. lr 5.747213e-04:  26%|██▋       | 193/733 [01:09<02:52,  3.12it/s][A
epoch 1 iter 193: train loss

epoch 1 iter 226: train loss 4.37511. lr 5.652131e-04:  31%|███       | 226/733 [01:20<02:42,  3.12it/s][A
epoch 1 iter 226: train loss 4.37511. lr 5.652131e-04:  31%|███       | 227/733 [01:20<02:42,  3.12it/s][A
epoch 1 iter 227: train loss 4.37317. lr 5.649118e-04:  31%|███       | 227/733 [01:21<02:42,  3.12it/s][A
epoch 1 iter 227: train loss 4.37317. lr 5.649118e-04:  31%|███       | 228/733 [01:21<02:41,  3.12it/s][A
epoch 1 iter 228: train loss 4.36979. lr 5.646094e-04:  31%|███       | 228/733 [01:21<02:41,  3.12it/s][A
epoch 1 iter 228: train loss 4.36979. lr 5.646094e-04:  31%|███       | 229/733 [01:21<02:41,  3.12it/s][A
epoch 1 iter 229: train loss 4.36388. lr 5.643057e-04:  31%|███       | 229/733 [01:21<02:41,  3.12it/s][A
epoch 1 iter 229: train loss 4.36388. lr 5.643057e-04:  31%|███▏      | 230/733 [01:21<02:41,  3.12it/s][A
epoch 1 iter 230: train loss 4.30422. lr 5.640008e-04:  31%|███▏      | 230/733 [01:22<02:41,  3.12it/s][A
epoch 1 iter 230: train loss

epoch 1 iter 263: train loss 4.06998. lr 5.532672e-04:  36%|███▌      | 264/733 [01:32<02:30,  3.11it/s][A
epoch 1 iter 264: train loss 4.06795. lr 5.529219e-04:  36%|███▌      | 264/733 [01:33<02:30,  3.11it/s][A
epoch 1 iter 264: train loss 4.06795. lr 5.529219e-04:  36%|███▌      | 265/733 [01:33<02:30,  3.11it/s][A
epoch 1 iter 265: train loss 4.07715. lr 5.525754e-04:  36%|███▌      | 265/733 [01:33<02:30,  3.11it/s][A
epoch 1 iter 265: train loss 4.07715. lr 5.525754e-04:  36%|███▋      | 266/733 [01:33<02:30,  3.11it/s][A
epoch 1 iter 266: train loss 4.07689. lr 5.522278e-04:  36%|███▋      | 266/733 [01:33<02:30,  3.11it/s][A
epoch 1 iter 266: train loss 4.07689. lr 5.522278e-04:  36%|███▋      | 267/733 [01:33<02:29,  3.11it/s][A
epoch 1 iter 267: train loss 4.06563. lr 5.518790e-04:  36%|███▋      | 267/733 [01:34<02:29,  3.11it/s][A
epoch 1 iter 267: train loss 4.06563. lr 5.518790e-04:  37%|███▋      | 268/733 [01:34<02:42,  2.87it/s][A
epoch 1 iter 268: train loss

epoch 1 iter 301: train loss 3.79345. lr 5.393412e-04:  41%|████      | 301/733 [01:45<02:21,  3.06it/s][A
epoch 1 iter 301: train loss 3.79345. lr 5.393412e-04:  41%|████      | 302/733 [01:45<02:20,  3.08it/s][A
epoch 1 iter 302: train loss 3.78204. lr 5.389529e-04:  41%|████      | 302/733 [01:45<02:20,  3.08it/s][A
epoch 1 iter 302: train loss 3.78204. lr 5.389529e-04:  41%|████▏     | 303/733 [01:45<02:19,  3.09it/s][A
epoch 1 iter 303: train loss 3.80581. lr 5.385634e-04:  41%|████▏     | 303/733 [01:45<02:19,  3.09it/s][A
epoch 1 iter 303: train loss 3.80581. lr 5.385634e-04:  41%|████▏     | 304/733 [01:45<02:18,  3.09it/s][A
epoch 1 iter 304: train loss 3.74852. lr 5.381729e-04:  41%|████▏     | 304/733 [01:46<02:18,  3.09it/s][A
epoch 1 iter 304: train loss 3.74852. lr 5.381729e-04:  42%|████▏     | 305/733 [01:46<02:18,  3.10it/s][A
epoch 1 iter 305: train loss 3.74933. lr 5.377812e-04:  42%|████▏     | 305/733 [01:46<02:18,  3.10it/s][A
epoch 1 iter 305: train loss

epoch 1 iter 338: train loss 3.45575. lr 5.242551e-04:  46%|████▌     | 339/733 [01:57<02:06,  3.11it/s][A
epoch 1 iter 339: train loss 3.47641. lr 5.238274e-04:  46%|████▌     | 339/733 [01:57<02:06,  3.11it/s][A
epoch 1 iter 339: train loss 3.47641. lr 5.238274e-04:  46%|████▋     | 340/733 [01:57<02:06,  3.11it/s][A
epoch 1 iter 340: train loss 3.43544. lr 5.233986e-04:  46%|████▋     | 340/733 [01:57<02:06,  3.11it/s][A
epoch 1 iter 340: train loss 3.43544. lr 5.233986e-04:  47%|████▋     | 341/733 [01:57<02:05,  3.11it/s][A
epoch 1 iter 341: train loss 3.42182. lr 5.229688e-04:  47%|████▋     | 341/733 [01:58<02:05,  3.11it/s][A
epoch 1 iter 341: train loss 3.42182. lr 5.229688e-04:  47%|████▋     | 342/733 [01:58<02:05,  3.11it/s][A
epoch 1 iter 342: train loss 3.39919. lr 5.225379e-04:  47%|████▋     | 342/733 [01:58<02:05,  3.11it/s][A
epoch 1 iter 342: train loss 3.39919. lr 5.225379e-04:  47%|████▋     | 343/733 [01:58<02:05,  3.11it/s][A
epoch 1 iter 343: train loss

epoch 1 iter 376: train loss 3.12033. lr 5.072941e-04:  51%|█████▏    | 376/733 [02:09<01:55,  3.09it/s][A
epoch 1 iter 376: train loss 3.12033. lr 5.072941e-04:  51%|█████▏    | 377/733 [02:09<01:55,  3.09it/s][A
epoch 1 iter 377: train loss 3.09934. lr 5.068287e-04:  51%|█████▏    | 377/733 [02:09<01:55,  3.09it/s][A
epoch 1 iter 377: train loss 3.09934. lr 5.068287e-04:  52%|█████▏    | 378/733 [02:09<01:54,  3.09it/s][A
epoch 1 iter 378: train loss 3.07774. lr 5.063623e-04:  52%|█████▏    | 378/733 [02:10<01:54,  3.09it/s][A
epoch 1 iter 378: train loss 3.07774. lr 5.063623e-04:  52%|█████▏    | 379/733 [02:10<01:54,  3.09it/s][A
epoch 1 iter 379: train loss 3.08982. lr 5.058950e-04:  52%|█████▏    | 379/733 [02:10<01:54,  3.09it/s][A
epoch 1 iter 379: train loss 3.08982. lr 5.058950e-04:  52%|█████▏    | 380/733 [02:10<02:03,  2.86it/s][A
epoch 1 iter 380: train loss 3.05270. lr 5.054267e-04:  52%|█████▏    | 380/733 [02:10<02:03,  2.86it/s][A
epoch 1 iter 380: train loss

epoch 1 iter 413: train loss 2.79682. lr 4.894570e-04:  56%|█████▋    | 414/733 [02:21<01:44,  3.05it/s][A
epoch 1 iter 414: train loss 2.75651. lr 4.889579e-04:  56%|█████▋    | 414/733 [02:21<01:44,  3.05it/s][A
epoch 1 iter 414: train loss 2.75651. lr 4.889579e-04:  57%|█████▋    | 415/733 [02:21<01:43,  3.06it/s][A
epoch 1 iter 415: train loss 2.72984. lr 4.884579e-04:  57%|█████▋    | 415/733 [02:22<01:43,  3.06it/s][A
epoch 1 iter 415: train loss 2.72984. lr 4.884579e-04:  57%|█████▋    | 416/733 [02:22<01:43,  3.07it/s][A
epoch 1 iter 416: train loss 2.72190. lr 4.879570e-04:  57%|█████▋    | 416/733 [02:22<01:43,  3.07it/s][A
epoch 1 iter 416: train loss 2.72190. lr 4.879570e-04:  57%|█████▋    | 417/733 [02:22<01:42,  3.07it/s][A
epoch 1 iter 417: train loss 2.73809. lr 4.874552e-04:  57%|█████▋    | 417/733 [02:22<01:42,  3.07it/s][A
epoch 1 iter 417: train loss 2.73809. lr 4.874552e-04:  57%|█████▋    | 418/733 [02:22<01:42,  3.08it/s][A
epoch 1 iter 418: train loss

epoch 1 iter 451: train loss 2.36839. lr 4.698986e-04:  62%|██████▏   | 451/733 [02:34<01:31,  3.09it/s][A
epoch 1 iter 451: train loss 2.36839. lr 4.698986e-04:  62%|██████▏   | 452/733 [02:34<01:31,  3.08it/s][A
epoch 1 iter 452: train loss 2.36283. lr 4.693681e-04:  62%|██████▏   | 452/733 [02:34<01:31,  3.08it/s][A
epoch 1 iter 452: train loss 2.36283. lr 4.693681e-04:  62%|██████▏   | 453/733 [02:34<01:30,  3.08it/s][A
epoch 1 iter 453: train loss 2.40508. lr 4.688368e-04:  62%|██████▏   | 453/733 [02:34<01:30,  3.08it/s][A
epoch 1 iter 453: train loss 2.40508. lr 4.688368e-04:  62%|██████▏   | 454/733 [02:34<01:30,  3.08it/s][A
epoch 1 iter 454: train loss 2.33738. lr 4.683048e-04:  62%|██████▏   | 454/733 [02:35<01:30,  3.08it/s][A
epoch 1 iter 454: train loss 2.33738. lr 4.683048e-04:  62%|██████▏   | 455/733 [02:35<01:30,  3.09it/s][A
epoch 1 iter 455: train loss 2.34707. lr 4.677719e-04:  62%|██████▏   | 455/733 [02:35<01:30,  3.09it/s][A
epoch 1 iter 455: train loss

epoch 1 iter 488: train loss 2.04230. lr 4.497707e-04:  67%|██████▋   | 489/733 [02:46<01:19,  3.08it/s][A
epoch 1 iter 489: train loss 2.02790. lr 4.492131e-04:  67%|██████▋   | 489/733 [02:46<01:19,  3.08it/s][A
epoch 1 iter 489: train loss 2.02790. lr 4.492131e-04:  67%|██████▋   | 490/733 [02:46<01:18,  3.08it/s][A
epoch 1 iter 490: train loss 2.02374. lr 4.486548e-04:  67%|██████▋   | 490/733 [02:46<01:18,  3.08it/s][A
epoch 1 iter 490: train loss 2.02374. lr 4.486548e-04:  67%|██████▋   | 491/733 [02:46<01:18,  3.08it/s][A
epoch 1 iter 491: train loss 2.01704. lr 4.480957e-04:  67%|██████▋   | 491/733 [02:47<01:18,  3.08it/s][A
epoch 1 iter 491: train loss 2.01704. lr 4.480957e-04:  67%|██████▋   | 492/733 [02:47<01:24,  2.84it/s][A
epoch 1 iter 492: train loss 2.00750. lr 4.475360e-04:  67%|██████▋   | 492/733 [02:47<01:24,  2.84it/s][A
epoch 1 iter 492: train loss 2.00750. lr 4.475360e-04:  67%|██████▋   | 493/733 [02:47<01:22,  2.90it/s][A
epoch 1 iter 493: train loss

epoch 1 iter 526: train loss 1.71367. lr 4.281196e-04:  72%|███████▏  | 526/733 [02:58<01:08,  3.04it/s][A
epoch 1 iter 526: train loss 1.71367. lr 4.281196e-04:  72%|███████▏  | 527/733 [02:58<01:07,  3.05it/s][A
epoch 1 iter 527: train loss 1.66383. lr 4.275377e-04:  72%|███████▏  | 527/733 [02:58<01:07,  3.05it/s][A
epoch 1 iter 527: train loss 1.66383. lr 4.275377e-04:  72%|███████▏  | 528/733 [02:58<01:07,  3.06it/s][A
epoch 1 iter 528: train loss 1.68711. lr 4.269552e-04:  72%|███████▏  | 528/733 [02:59<01:07,  3.06it/s][A
epoch 1 iter 528: train loss 1.68711. lr 4.269552e-04:  72%|███████▏  | 529/733 [02:59<01:06,  3.06it/s][A
epoch 1 iter 529: train loss 1.70097. lr 4.263722e-04:  72%|███████▏  | 529/733 [02:59<01:06,  3.06it/s][A
epoch 1 iter 529: train loss 1.70097. lr 4.263722e-04:  72%|███████▏  | 530/733 [02:59<01:06,  3.07it/s][A
epoch 1 iter 530: train loss 1.61772. lr 4.257885e-04:  72%|███████▏  | 530/733 [02:59<01:06,  3.07it/s][A
epoch 1 iter 530: train loss

epoch 1 iter 563: train loss 1.36493. lr 4.062203e-04:  77%|███████▋  | 564/733 [03:10<00:55,  3.02it/s][A
epoch 1 iter 564: train loss 1.34994. lr 4.056185e-04:  77%|███████▋  | 564/733 [03:11<00:55,  3.02it/s][A
epoch 1 iter 564: train loss 1.34994. lr 4.056185e-04:  77%|███████▋  | 565/733 [03:11<00:55,  3.02it/s][A
epoch 1 iter 565: train loss 1.37866. lr 4.050162e-04:  77%|███████▋  | 565/733 [03:11<00:55,  3.02it/s][A
epoch 1 iter 565: train loss 1.37866. lr 4.050162e-04:  77%|███████▋  | 566/733 [03:11<00:55,  3.01it/s][A
epoch 1 iter 566: train loss 1.34510. lr 4.044135e-04:  77%|███████▋  | 566/733 [03:11<00:55,  3.01it/s][A
epoch 1 iter 566: train loss 1.34510. lr 4.044135e-04:  77%|███████▋  | 567/733 [03:11<00:55,  3.01it/s][A
epoch 1 iter 567: train loss 1.37483. lr 4.038102e-04:  77%|███████▋  | 567/733 [03:12<00:55,  3.01it/s][A
epoch 1 iter 567: train loss 1.37483. lr 4.038102e-04:  77%|███████▋  | 568/733 [03:12<00:54,  3.00it/s][A
epoch 1 iter 568: train loss

epoch 1 iter 601: train loss 1.07245. lr 3.830350e-04:  82%|████████▏ | 601/733 [03:23<00:44,  2.99it/s][A
epoch 1 iter 601: train loss 1.07245. lr 3.830350e-04:  82%|████████▏ | 602/733 [03:23<00:43,  2.99it/s][A
epoch 1 iter 602: train loss 1.09041. lr 3.824167e-04:  82%|████████▏ | 602/733 [03:23<00:43,  2.99it/s][A
epoch 1 iter 602: train loss 1.09041. lr 3.824167e-04:  82%|████████▏ | 603/733 [03:23<00:43,  2.99it/s][A
epoch 1 iter 603: train loss 1.11812. lr 3.817981e-04:  82%|████████▏ | 603/733 [03:24<00:43,  2.99it/s][A
epoch 1 iter 603: train loss 1.11812. lr 3.817981e-04:  82%|████████▏ | 604/733 [03:24<00:46,  2.77it/s][A
epoch 1 iter 604: train loss 1.09448. lr 3.811790e-04:  82%|████████▏ | 604/733 [03:24<00:46,  2.77it/s][A
epoch 1 iter 604: train loss 1.09448. lr 3.811790e-04:  83%|████████▎ | 605/733 [03:24<00:45,  2.83it/s][A
epoch 1 iter 605: train loss 1.07012. lr 3.805597e-04:  83%|████████▎ | 605/733 [03:25<00:45,  2.83it/s][A
epoch 1 iter 605: train loss

epoch 1 iter 638: train loss 0.90532. lr 3.599292e-04:  87%|████████▋ | 639/733 [03:36<00:31,  2.97it/s][A
epoch 1 iter 639: train loss 0.89651. lr 3.592988e-04:  87%|████████▋ | 639/733 [03:36<00:31,  2.97it/s][A
epoch 1 iter 639: train loss 0.89651. lr 3.592988e-04:  87%|████████▋ | 640/733 [03:36<00:31,  2.98it/s][A
epoch 1 iter 640: train loss 0.91301. lr 3.586682e-04:  87%|████████▋ | 640/733 [03:36<00:31,  2.98it/s][A
epoch 1 iter 640: train loss 0.91301. lr 3.586682e-04:  87%|████████▋ | 641/733 [03:36<00:30,  2.98it/s][A
epoch 1 iter 641: train loss 0.88989. lr 3.580373e-04:  87%|████████▋ | 641/733 [03:37<00:30,  2.98it/s][A
epoch 1 iter 641: train loss 0.88989. lr 3.580373e-04:  88%|████████▊ | 642/733 [03:37<00:30,  2.98it/s][A
epoch 1 iter 642: train loss 0.90080. lr 3.574061e-04:  88%|████████▊ | 642/733 [03:37<00:30,  2.98it/s][A
epoch 1 iter 642: train loss 0.90080. lr 3.574061e-04:  88%|████████▊ | 643/733 [03:37<00:30,  2.99it/s][A
epoch 1 iter 643: train loss

epoch 1 iter 676: train loss 0.74249. lr 3.358080e-04:  92%|█████████▏| 676/733 [03:48<00:19,  2.98it/s][A
epoch 1 iter 676: train loss 0.74249. lr 3.358080e-04:  92%|█████████▏| 677/733 [03:48<00:18,  2.98it/s][A
epoch 1 iter 677: train loss 0.76029. lr 3.351693e-04:  92%|█████████▏| 677/733 [03:49<00:18,  2.98it/s][A
epoch 1 iter 677: train loss 0.76029. lr 3.351693e-04:  92%|█████████▏| 678/733 [03:49<00:18,  2.98it/s][A
epoch 1 iter 678: train loss 0.74400. lr 3.345304e-04:  92%|█████████▏| 678/733 [03:49<00:18,  2.98it/s][A
epoch 1 iter 678: train loss 0.74400. lr 3.345304e-04:  93%|█████████▎| 679/733 [03:49<00:18,  2.98it/s][A
epoch 1 iter 679: train loss 0.72920. lr 3.338914e-04:  93%|█████████▎| 679/733 [03:49<00:18,  2.98it/s][A
epoch 1 iter 679: train loss 0.72920. lr 3.338914e-04:  93%|█████████▎| 680/733 [03:49<00:17,  2.98it/s][A
epoch 1 iter 680: train loss 0.72200. lr 3.332523e-04:  93%|█████████▎| 680/733 [03:50<00:17,  2.98it/s][A
epoch 1 iter 680: train loss

epoch 1 iter 713: train loss 0.62919. lr 3.120919e-04:  97%|█████████▋| 714/733 [04:01<00:06,  2.97it/s][A
epoch 1 iter 714: train loss 0.61401. lr 3.114492e-04:  97%|█████████▋| 714/733 [04:01<00:06,  2.97it/s][A
epoch 1 iter 714: train loss 0.61401. lr 3.114492e-04:  98%|█████████▊| 715/733 [04:01<00:06,  2.97it/s][A
epoch 1 iter 715: train loss 0.62663. lr 3.108064e-04:  98%|█████████▊| 715/733 [04:02<00:06,  2.97it/s][A
epoch 1 iter 715: train loss 0.62663. lr 3.108064e-04:  98%|█████████▊| 716/733 [04:02<00:06,  2.71it/s][A
epoch 1 iter 716: train loss 0.62983. lr 3.101636e-04:  98%|█████████▊| 716/733 [04:02<00:06,  2.71it/s][A
epoch 1 iter 716: train loss 0.62983. lr 3.101636e-04:  98%|█████████▊| 717/733 [04:02<00:05,  2.79it/s][A
epoch 1 iter 717: train loss 0.62331. lr 3.095208e-04:  98%|█████████▊| 717/733 [04:02<00:05,  2.79it/s][A
epoch 1 iter 717: train loss 0.62331. lr 3.095208e-04:  98%|█████████▊| 718/733 [04:02<00:05,  2.84it/s][A
epoch 1 iter 718: train loss

epoch 2 iter 18: train loss 0.53672. lr 2.878328e-04:   3%|▎         | 19/733 [00:06<03:56,  3.01it/s][A
epoch 2 iter 19: train loss 0.54758. lr 2.871902e-04:   3%|▎         | 19/733 [00:06<03:56,  3.01it/s][A
epoch 2 iter 19: train loss 0.54758. lr 2.871902e-04:   3%|▎         | 20/733 [00:06<03:56,  3.02it/s][A
epoch 2 iter 20: train loss 0.52554. lr 2.865476e-04:   3%|▎         | 20/733 [00:07<03:56,  3.02it/s][A
epoch 2 iter 20: train loss 0.52554. lr 2.865476e-04:   3%|▎         | 21/733 [00:07<03:55,  3.02it/s][A
epoch 2 iter 21: train loss 0.53720. lr 2.859051e-04:   3%|▎         | 21/733 [00:07<03:55,  3.02it/s][A
epoch 2 iter 21: train loss 0.53720. lr 2.859051e-04:   3%|▎         | 22/733 [00:07<03:54,  3.03it/s][A
epoch 2 iter 22: train loss 0.54324. lr 2.852626e-04:   3%|▎         | 22/733 [00:07<03:54,  3.03it/s][A
epoch 2 iter 22: train loss 0.54324. lr 2.852626e-04:   3%|▎         | 23/733 [00:07<03:54,  3.03it/s][A
epoch 2 iter 23: train loss 0.53462. lr 2.8462

epoch 2 iter 57: train loss 0.49229. lr 2.628404e-04:   8%|▊         | 57/733 [00:19<03:44,  3.02it/s][A
epoch 2 iter 57: train loss 0.49229. lr 2.628404e-04:   8%|▊         | 58/733 [00:19<03:43,  3.02it/s][A
epoch 2 iter 58: train loss 0.48909. lr 2.622022e-04:   8%|▊         | 58/733 [00:19<03:43,  3.02it/s][A
epoch 2 iter 58: train loss 0.48909. lr 2.622022e-04:   8%|▊         | 59/733 [00:19<03:44,  3.01it/s][A
epoch 2 iter 59: train loss 0.48234. lr 2.615642e-04:   8%|▊         | 59/733 [00:20<03:44,  3.01it/s][A
epoch 2 iter 59: train loss 0.48234. lr 2.615642e-04:   8%|▊         | 60/733 [00:20<03:43,  3.01it/s][A
epoch 2 iter 60: train loss 0.47391. lr 2.609264e-04:   8%|▊         | 60/733 [00:20<03:43,  3.01it/s][A
epoch 2 iter 60: train loss 0.47391. lr 2.609264e-04:   8%|▊         | 61/733 [00:20<03:42,  3.01it/s][A
epoch 2 iter 61: train loss 0.46352. lr 2.602888e-04:   8%|▊         | 61/733 [00:20<03:42,  3.01it/s][A
epoch 2 iter 61: train loss 0.46352. lr 2.6028

epoch 2 iter 95: train loss 0.43123. lr 2.387371e-04:  13%|█▎        | 96/733 [00:32<03:46,  2.82it/s][A
epoch 2 iter 96: train loss 0.44474. lr 2.381076e-04:  13%|█▎        | 96/733 [00:32<03:46,  2.82it/s][A
epoch 2 iter 96: train loss 0.44474. lr 2.381076e-04:  13%|█▎        | 97/733 [00:32<03:42,  2.85it/s][A
epoch 2 iter 97: train loss 0.44978. lr 2.374784e-04:  13%|█▎        | 97/733 [00:32<03:42,  2.85it/s][A
epoch 2 iter 97: train loss 0.44978. lr 2.374784e-04:  13%|█▎        | 98/733 [00:32<03:39,  2.89it/s][A
epoch 2 iter 98: train loss 0.43440. lr 2.368495e-04:  13%|█▎        | 98/733 [00:33<03:39,  2.89it/s][A
epoch 2 iter 98: train loss 0.43440. lr 2.368495e-04:  14%|█▎        | 99/733 [00:33<03:37,  2.91it/s][A
epoch 2 iter 99: train loss 0.42679. lr 2.362208e-04:  14%|█▎        | 99/733 [00:33<03:37,  2.91it/s][A
epoch 2 iter 99: train loss 0.42679. lr 2.362208e-04:  14%|█▎        | 100/733 [00:33<03:36,  2.92it/s][A
epoch 2 iter 100: train loss 0.43894. lr 2.35

epoch 2 iter 133: train loss 0.39830. lr 2.150403e-04:  18%|█▊        | 134/733 [00:45<03:24,  2.92it/s][A
epoch 2 iter 134: train loss 0.40199. lr 2.144236e-04:  18%|█▊        | 134/733 [00:45<03:24,  2.92it/s][A
epoch 2 iter 134: train loss 0.40199. lr 2.144236e-04:  18%|█▊        | 135/733 [00:45<03:25,  2.91it/s][A
epoch 2 iter 135: train loss 0.39485. lr 2.138073e-04:  18%|█▊        | 135/733 [00:45<03:25,  2.91it/s][A
epoch 2 iter 135: train loss 0.39485. lr 2.138073e-04:  19%|█▊        | 136/733 [00:45<03:23,  2.93it/s][A
epoch 2 iter 136: train loss 0.41100. lr 2.131914e-04:  19%|█▊        | 136/733 [00:46<03:23,  2.93it/s][A
epoch 2 iter 136: train loss 0.41100. lr 2.131914e-04:  19%|█▊        | 137/733 [00:46<03:23,  2.92it/s][A
epoch 2 iter 137: train loss 0.38765. lr 2.125760e-04:  19%|█▊        | 137/733 [00:46<03:23,  2.92it/s][A
epoch 2 iter 137: train loss 0.38765. lr 2.125760e-04:  19%|█▉        | 138/733 [00:46<03:24,  2.92it/s][A
epoch 2 iter 138: train loss

epoch 2 iter 171: train loss 0.36630. lr 1.919070e-04:  23%|██▎       | 171/733 [00:58<03:14,  2.89it/s][A
epoch 2 iter 171: train loss 0.36630. lr 1.919070e-04:  23%|██▎       | 172/733 [00:58<03:14,  2.89it/s][A
epoch 2 iter 172: train loss 0.36949. lr 1.913073e-04:  23%|██▎       | 172/733 [00:58<03:14,  2.89it/s][A
epoch 2 iter 172: train loss 0.36949. lr 1.913073e-04:  24%|██▎       | 173/733 [00:58<03:13,  2.90it/s][A
epoch 2 iter 173: train loss 0.36906. lr 1.907080e-04:  24%|██▎       | 173/733 [00:59<03:13,  2.90it/s][A
epoch 2 iter 173: train loss 0.36906. lr 1.907080e-04:  24%|██▎       | 174/733 [00:59<03:13,  2.88it/s][A
epoch 2 iter 174: train loss 0.36689. lr 1.901093e-04:  24%|██▎       | 174/733 [00:59<03:13,  2.88it/s][A
epoch 2 iter 174: train loss 0.36689. lr 1.901093e-04:  24%|██▍       | 175/733 [00:59<03:13,  2.88it/s][A
epoch 2 iter 175: train loss 0.36129. lr 1.895111e-04:  24%|██▍       | 175/733 [00:59<03:13,  2.88it/s][A
epoch 2 iter 175: train loss

epoch 2 iter 208: train loss 0.35017. lr 1.700704e-04:  29%|██▊       | 209/733 [01:11<03:07,  2.79it/s][A
epoch 2 iter 209: train loss 0.34626. lr 1.694909e-04:  29%|██▊       | 209/733 [01:11<03:07,  2.79it/s][A
epoch 2 iter 209: train loss 0.34626. lr 1.694909e-04:  29%|██▊       | 210/733 [01:11<03:05,  2.81it/s][A
epoch 2 iter 210: train loss 0.35563. lr 1.689121e-04:  29%|██▊       | 210/733 [01:12<03:05,  2.81it/s][A
epoch 2 iter 210: train loss 0.35563. lr 1.689121e-04:  29%|██▉       | 211/733 [01:12<03:03,  2.84it/s][A
epoch 2 iter 211: train loss 0.34221. lr 1.683338e-04:  29%|██▉       | 211/733 [01:12<03:03,  2.84it/s][A
epoch 2 iter 211: train loss 0.34221. lr 1.683338e-04:  29%|██▉       | 212/733 [01:12<03:03,  2.84it/s][A
epoch 2 iter 212: train loss 0.35249. lr 1.677562e-04:  29%|██▉       | 212/733 [01:12<03:03,  2.84it/s][A
epoch 2 iter 212: train loss 0.35249. lr 1.677562e-04:  29%|██▉       | 213/733 [01:12<03:02,  2.85it/s][A
epoch 2 iter 213: train loss

epoch 2 iter 246: train loss 0.32839. lr 1.484954e-04:  34%|███▎      | 246/733 [01:24<02:51,  2.84it/s][A
epoch 2 iter 246: train loss 0.32839. lr 1.484954e-04:  34%|███▎      | 247/733 [01:24<02:51,  2.83it/s][A
epoch 2 iter 247: train loss 0.33104. lr 1.479406e-04:  34%|███▎      | 247/733 [01:25<02:51,  2.83it/s][A
epoch 2 iter 247: train loss 0.33104. lr 1.479406e-04:  34%|███▍      | 248/733 [01:25<02:50,  2.84it/s][A
epoch 2 iter 248: train loss 0.32453. lr 1.473865e-04:  34%|███▍      | 248/733 [01:25<02:50,  2.84it/s][A
epoch 2 iter 248: train loss 0.32453. lr 1.473865e-04:  34%|███▍      | 249/733 [01:25<02:50,  2.83it/s][A
epoch 2 iter 249: train loss 0.32015. lr 1.468331e-04:  34%|███▍      | 249/733 [01:25<02:50,  2.83it/s][A
epoch 2 iter 249: train loss 0.32015. lr 1.468331e-04:  34%|███▍      | 250/733 [01:25<02:49,  2.84it/s][A
epoch 2 iter 250: train loss 0.32989. lr 1.462804e-04:  34%|███▍      | 250/733 [01:26<02:49,  2.84it/s][A
epoch 2 iter 250: train loss

epoch 2 iter 283: train loss 0.31161. lr 1.284527e-04:  39%|███▊      | 284/733 [01:37<02:41,  2.78it/s][A
epoch 2 iter 284: train loss 0.31489. lr 1.279255e-04:  39%|███▊      | 284/733 [01:38<02:41,  2.78it/s][A
epoch 2 iter 284: train loss 0.31489. lr 1.279255e-04:  39%|███▉      | 285/733 [01:38<02:42,  2.76it/s][A
epoch 2 iter 285: train loss 0.31892. lr 1.273990e-04:  39%|███▉      | 285/733 [01:38<02:42,  2.76it/s][A
epoch 2 iter 285: train loss 0.31892. lr 1.273990e-04:  39%|███▉      | 286/733 [01:38<02:42,  2.76it/s][A
epoch 2 iter 286: train loss 0.30826. lr 1.268733e-04:  39%|███▉      | 286/733 [01:38<02:42,  2.76it/s][A
epoch 2 iter 286: train loss 0.30826. lr 1.268733e-04:  39%|███▉      | 287/733 [01:38<02:41,  2.76it/s][A
epoch 2 iter 287: train loss 0.31082. lr 1.263484e-04:  39%|███▉      | 287/733 [01:39<02:41,  2.76it/s][A
epoch 2 iter 287: train loss 0.31082. lr 1.263484e-04:  39%|███▉      | 288/733 [01:39<02:38,  2.81it/s][A
epoch 2 iter 288: train loss

epoch 2 iter 321: train loss 0.31046. lr 1.089927e-04:  44%|████▍     | 321/733 [01:51<02:32,  2.70it/s][A
epoch 2 iter 321: train loss 0.31046. lr 1.089927e-04:  44%|████▍     | 322/733 [01:51<02:28,  2.76it/s][A
epoch 2 iter 322: train loss 0.30004. lr 1.084971e-04:  44%|████▍     | 322/733 [01:51<02:28,  2.76it/s][A
epoch 2 iter 322: train loss 0.30004. lr 1.084971e-04:  44%|████▍     | 323/733 [01:51<02:26,  2.79it/s][A
epoch 2 iter 323: train loss 0.30154. lr 1.080024e-04:  44%|████▍     | 323/733 [01:52<02:26,  2.79it/s][A
epoch 2 iter 323: train loss 0.30154. lr 1.080024e-04:  44%|████▍     | 324/733 [01:52<02:26,  2.80it/s][A
epoch 2 iter 324: train loss 0.30734. lr 1.075087e-04:  44%|████▍     | 324/733 [01:52<02:26,  2.80it/s][A
epoch 2 iter 324: train loss 0.30734. lr 1.075087e-04:  44%|████▍     | 325/733 [01:52<02:25,  2.81it/s][A
epoch 2 iter 325: train loss 0.30342. lr 1.070158e-04:  44%|████▍     | 325/733 [01:52<02:25,  2.81it/s][A
epoch 2 iter 325: train loss

epoch 2 iter 358: train loss 0.28419. lr 9.126120e-05:  49%|████▉     | 359/733 [02:04<02:12,  2.81it/s][A
epoch 2 iter 359: train loss 0.29406. lr 9.079971e-05:  49%|████▉     | 359/733 [02:05<02:12,  2.81it/s][A
epoch 2 iter 359: train loss 0.29406. lr 9.079971e-05:  49%|████▉     | 360/733 [02:05<02:12,  2.82it/s][A
epoch 2 iter 360: train loss 0.29924. lr 9.033918e-05:  49%|████▉     | 360/733 [02:05<02:12,  2.82it/s][A
epoch 2 iter 360: train loss 0.29924. lr 9.033918e-05:  49%|████▉     | 361/733 [02:05<02:11,  2.83it/s][A
epoch 2 iter 361: train loss 0.29521. lr 8.987962e-05:  49%|████▉     | 361/733 [02:05<02:11,  2.83it/s][A
epoch 2 iter 361: train loss 0.29521. lr 8.987962e-05:  49%|████▉     | 362/733 [02:05<02:11,  2.82it/s][A
epoch 2 iter 362: train loss 0.29497. lr 8.942102e-05:  49%|████▉     | 362/733 [02:06<02:11,  2.82it/s][A
epoch 2 iter 362: train loss 0.29497. lr 8.942102e-05:  50%|████▉     | 363/733 [02:06<02:10,  2.84it/s][A
epoch 2 iter 363: train loss

epoch 2 iter 396: train loss 0.27817. lr 7.441810e-05:  54%|█████▍    | 396/733 [02:18<02:11,  2.56it/s][A
epoch 2 iter 396: train loss 0.27817. lr 7.441810e-05:  54%|█████▍    | 397/733 [02:18<02:11,  2.55it/s][A
epoch 2 iter 397: train loss 0.27876. lr 7.399460e-05:  54%|█████▍    | 397/733 [02:19<02:11,  2.55it/s][A
epoch 2 iter 397: train loss 0.27876. lr 7.399460e-05:  54%|█████▍    | 398/733 [02:19<02:10,  2.56it/s][A
epoch 2 iter 398: train loss 0.27693. lr 7.357214e-05:  54%|█████▍    | 398/733 [02:19<02:10,  2.56it/s][A
epoch 2 iter 398: train loss 0.27693. lr 7.357214e-05:  54%|█████▍    | 399/733 [02:19<02:09,  2.58it/s][A
epoch 2 iter 399: train loss 0.27880. lr 7.315073e-05:  54%|█████▍    | 399/733 [02:19<02:09,  2.58it/s][A
epoch 2 iter 399: train loss 0.27880. lr 7.315073e-05:  55%|█████▍    | 400/733 [02:19<02:06,  2.63it/s][A
epoch 2 iter 400: train loss 0.27956. lr 7.273035e-05:  55%|█████▍    | 400/733 [02:20<02:06,  2.63it/s][A
epoch 2 iter 400: train loss

epoch 2 iter 433: train loss 0.27731. lr 6.000000e-05:  59%|█████▉    | 434/733 [02:32<01:47,  2.77it/s][A
epoch 2 iter 434: train loss 0.27742. lr 6.000000e-05:  59%|█████▉    | 434/733 [02:33<01:47,  2.77it/s][A
epoch 2 iter 434: train loss 0.27742. lr 6.000000e-05:  59%|█████▉    | 435/733 [02:33<01:47,  2.78it/s][A
epoch 2 iter 435: train loss 0.27716. lr 6.000000e-05:  59%|█████▉    | 435/733 [02:33<01:47,  2.78it/s][A
epoch 2 iter 435: train loss 0.27716. lr 6.000000e-05:  59%|█████▉    | 436/733 [02:33<01:46,  2.78it/s][A
epoch 2 iter 436: train loss 0.26362. lr 6.000000e-05:  59%|█████▉    | 436/733 [02:33<01:46,  2.78it/s][A
epoch 2 iter 436: train loss 0.26362. lr 6.000000e-05:  60%|█████▉    | 437/733 [02:33<01:45,  2.81it/s][A
epoch 2 iter 437: train loss 0.26765. lr 6.000000e-05:  60%|█████▉    | 437/733 [02:34<01:45,  2.81it/s][A
epoch 2 iter 437: train loss 0.26765. lr 6.000000e-05:  60%|█████▉    | 438/733 [02:34<01:45,  2.80it/s][A
epoch 2 iter 438: train loss

epoch 2 iter 471: train loss 0.27218. lr 6.000000e-05:  64%|██████▍   | 471/733 [02:46<01:34,  2.77it/s][A
epoch 2 iter 471: train loss 0.27218. lr 6.000000e-05:  64%|██████▍   | 472/733 [02:46<01:34,  2.75it/s][A
epoch 2 iter 472: train loss 0.26467. lr 6.000000e-05:  64%|██████▍   | 472/733 [02:46<01:34,  2.75it/s][A
epoch 2 iter 472: train loss 0.26467. lr 6.000000e-05:  65%|██████▍   | 473/733 [02:46<01:34,  2.75it/s][A
epoch 2 iter 473: train loss 0.27466. lr 6.000000e-05:  65%|██████▍   | 473/733 [02:47<01:34,  2.75it/s][A
epoch 2 iter 473: train loss 0.27466. lr 6.000000e-05:  65%|██████▍   | 474/733 [02:47<01:34,  2.74it/s][A
epoch 2 iter 474: train loss 0.25770. lr 6.000000e-05:  65%|██████▍   | 474/733 [02:47<01:34,  2.74it/s][A
epoch 2 iter 474: train loss 0.25770. lr 6.000000e-05:  65%|██████▍   | 475/733 [02:47<01:33,  2.76it/s][A
epoch 2 iter 475: train loss 0.26782. lr 6.000000e-05:  65%|██████▍   | 475/733 [02:47<01:33,  2.76it/s][A
epoch 2 iter 475: train loss

epoch 2 iter 508: train loss 0.27176. lr 6.000000e-05:  69%|██████▉   | 509/733 [03:00<01:21,  2.73it/s][A
epoch 2 iter 509: train loss 0.26183. lr 6.000000e-05:  69%|██████▉   | 509/733 [03:00<01:21,  2.73it/s][A
epoch 2 iter 509: train loss 0.26183. lr 6.000000e-05:  70%|██████▉   | 510/733 [03:00<01:24,  2.64it/s][A
epoch 2 iter 510: train loss 0.26573. lr 6.000000e-05:  70%|██████▉   | 510/733 [03:00<01:24,  2.64it/s][A
epoch 2 iter 510: train loss 0.26573. lr 6.000000e-05:  70%|██████▉   | 511/733 [03:00<01:26,  2.58it/s][A
epoch 2 iter 511: train loss 0.26236. lr 6.000000e-05:  70%|██████▉   | 511/733 [03:01<01:26,  2.58it/s][A
epoch 2 iter 511: train loss 0.26236. lr 6.000000e-05:  70%|██████▉   | 512/733 [03:01<01:27,  2.53it/s][A
epoch 2 iter 512: train loss 0.26406. lr 6.000000e-05:  70%|██████▉   | 512/733 [03:01<01:27,  2.53it/s][A
epoch 2 iter 512: train loss 0.26406. lr 6.000000e-05:  70%|██████▉   | 513/733 [03:01<01:27,  2.53it/s][A
epoch 2 iter 513: train loss

epoch 2 iter 546: train loss 0.25698. lr 6.000000e-05:  74%|███████▍  | 546/733 [03:14<01:09,  2.70it/s][A
epoch 2 iter 546: train loss 0.25698. lr 6.000000e-05:  75%|███████▍  | 547/733 [03:14<01:08,  2.72it/s][A
epoch 2 iter 547: train loss 0.25673. lr 6.000000e-05:  75%|███████▍  | 547/733 [03:14<01:08,  2.72it/s][A
epoch 2 iter 547: train loss 0.25673. lr 6.000000e-05:  75%|███████▍  | 548/733 [03:14<01:07,  2.74it/s][A
epoch 2 iter 548: train loss 0.25007. lr 6.000000e-05:  75%|███████▍  | 548/733 [03:15<01:07,  2.74it/s][A
epoch 2 iter 548: train loss 0.25007. lr 6.000000e-05:  75%|███████▍  | 549/733 [03:15<01:06,  2.77it/s][A
epoch 2 iter 549: train loss 0.25906. lr 6.000000e-05:  75%|███████▍  | 549/733 [03:15<01:06,  2.77it/s][A
epoch 2 iter 549: train loss 0.25906. lr 6.000000e-05:  75%|███████▌  | 550/733 [03:15<01:05,  2.78it/s][A
epoch 2 iter 550: train loss 0.24864. lr 6.000000e-05:  75%|███████▌  | 550/733 [03:15<01:05,  2.78it/s][A
epoch 2 iter 550: train loss

epoch 2 iter 583: train loss 0.24556. lr 6.000000e-05:  80%|███████▉  | 584/733 [03:28<00:52,  2.82it/s][A
epoch 2 iter 584: train loss 0.26220. lr 6.000000e-05:  80%|███████▉  | 584/733 [03:29<00:52,  2.82it/s][A
epoch 2 iter 584: train loss 0.26220. lr 6.000000e-05:  80%|███████▉  | 585/733 [03:29<00:52,  2.80it/s][A
epoch 2 iter 585: train loss 0.24874. lr 6.000000e-05:  80%|███████▉  | 585/733 [03:29<00:52,  2.80it/s][A
epoch 2 iter 585: train loss 0.24874. lr 6.000000e-05:  80%|███████▉  | 586/733 [03:29<00:52,  2.81it/s][A
epoch 2 iter 586: train loss 0.25058. lr 6.000000e-05:  80%|███████▉  | 586/733 [03:29<00:52,  2.81it/s][A
epoch 2 iter 586: train loss 0.25058. lr 6.000000e-05:  80%|████████  | 587/733 [03:29<00:51,  2.82it/s][A
epoch 2 iter 587: train loss 0.25003. lr 6.000000e-05:  80%|████████  | 587/733 [03:30<00:51,  2.82it/s][A
epoch 2 iter 587: train loss 0.25003. lr 6.000000e-05:  80%|████████  | 588/733 [03:30<00:51,  2.80it/s][A
epoch 2 iter 588: train loss

epoch 2 iter 621: train loss 0.24468. lr 6.000000e-05:  85%|████████▍ | 621/733 [03:43<00:45,  2.46it/s][A
epoch 2 iter 621: train loss 0.24468. lr 6.000000e-05:  85%|████████▍ | 622/733 [03:43<00:43,  2.58it/s][A
epoch 2 iter 622: train loss 0.24611. lr 6.000000e-05:  85%|████████▍ | 622/733 [03:43<00:43,  2.58it/s][A
epoch 2 iter 622: train loss 0.24611. lr 6.000000e-05:  85%|████████▍ | 623/733 [03:43<00:41,  2.66it/s][A
epoch 2 iter 623: train loss 0.24085. lr 6.000000e-05:  85%|████████▍ | 623/733 [03:44<00:41,  2.66it/s][A
epoch 2 iter 623: train loss 0.24085. lr 6.000000e-05:  85%|████████▌ | 624/733 [03:44<00:40,  2.72it/s][A
epoch 2 iter 624: train loss 0.25313. lr 6.000000e-05:  85%|████████▌ | 624/733 [03:44<00:40,  2.72it/s][A
epoch 2 iter 624: train loss 0.25313. lr 6.000000e-05:  85%|████████▌ | 625/733 [03:44<00:39,  2.74it/s][A
epoch 2 iter 625: train loss 0.24208. lr 6.000000e-05:  85%|████████▌ | 625/733 [03:44<00:39,  2.74it/s][A
epoch 2 iter 625: train loss

epoch 2 iter 658: train loss 0.24501. lr 6.000000e-05:  90%|████████▉ | 659/733 [03:56<00:26,  2.78it/s][A
epoch 2 iter 659: train loss 0.24502. lr 6.000000e-05:  90%|████████▉ | 659/733 [03:57<00:26,  2.78it/s][A
epoch 2 iter 659: train loss 0.24502. lr 6.000000e-05:  90%|█████████ | 660/733 [03:57<00:26,  2.79it/s][A
epoch 2 iter 660: train loss 0.24080. lr 6.000000e-05:  90%|█████████ | 660/733 [03:57<00:26,  2.79it/s][A
epoch 2 iter 660: train loss 0.24080. lr 6.000000e-05:  90%|█████████ | 661/733 [03:57<00:25,  2.81it/s][A
epoch 2 iter 661: train loss 0.25208. lr 6.000000e-05:  90%|█████████ | 661/733 [03:57<00:25,  2.81it/s][A
epoch 2 iter 661: train loss 0.25208. lr 6.000000e-05:  90%|█████████ | 662/733 [03:57<00:25,  2.82it/s][A
epoch 2 iter 662: train loss 0.24755. lr 6.000000e-05:  90%|█████████ | 662/733 [03:58<00:25,  2.82it/s][A
epoch 2 iter 662: train loss 0.24755. lr 6.000000e-05:  90%|█████████ | 663/733 [03:58<00:24,  2.83it/s][A
epoch 2 iter 663: train loss

epoch 2 iter 696: train loss 0.24171. lr 6.000000e-05:  95%|█████████▍| 696/733 [04:10<00:13,  2.75it/s][A
epoch 2 iter 696: train loss 0.24171. lr 6.000000e-05:  95%|█████████▌| 697/733 [04:10<00:12,  2.77it/s][A
epoch 2 iter 697: train loss 0.24711. lr 6.000000e-05:  95%|█████████▌| 697/733 [04:11<00:12,  2.77it/s][A
epoch 2 iter 697: train loss 0.24711. lr 6.000000e-05:  95%|█████████▌| 698/733 [04:11<00:12,  2.79it/s][A
epoch 2 iter 698: train loss 0.23781. lr 6.000000e-05:  95%|█████████▌| 698/733 [04:11<00:12,  2.79it/s][A
epoch 2 iter 698: train loss 0.23781. lr 6.000000e-05:  95%|█████████▌| 699/733 [04:11<00:12,  2.79it/s][A
epoch 2 iter 699: train loss 0.24924. lr 6.000000e-05:  95%|█████████▌| 699/733 [04:11<00:12,  2.79it/s][A
epoch 2 iter 699: train loss 0.24924. lr 6.000000e-05:  95%|█████████▌| 700/733 [04:11<00:11,  2.79it/s][A
epoch 2 iter 700: train loss 0.24313. lr 6.000000e-05:  95%|█████████▌| 700/733 [04:12<00:11,  2.79it/s][A
epoch 2 iter 700: train loss

data has 132836 characters, 7481 unique.
dataset is read
model is built



  0%|          | 0/519 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.00724. lr 5.999990e-04:   0%|          | 0/519 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.00724. lr 5.999990e-04:   0%|          | 1/519 [00:00<03:24,  2.54it/s][A
epoch 1 iter 1: train loss 8.27243. lr 5.999953e-04:   0%|          | 1/519 [00:00<03:24,  2.54it/s][A
epoch 1 iter 1: train loss 8.27243. lr 5.999953e-04:   0%|          | 2/519 [00:00<03:07,  2.76it/s][A
epoch 1 iter 2: train loss 7.83216. lr 5.999889e-04:   0%|          | 2/519 [00:00<03:07,  2.76it/s][A
epoch 1 iter 2: train loss 7.83216. lr 5.999889e-04:   1%|          | 3/519 [00:00<02:55,  2.93it/s][A
epoch 1 iter 3: train loss 7.59455. lr 5.999796e-04:   1%|          | 3/519 [00:01<02:55,  2.93it/s][A
epoch 1 iter 3: train loss 7.59455. lr 5.999796e-04:   1%|          | 4/519 [00:01<02:48,  3.06it/s][A
epoch 1 iter 4: train loss 7.36303. lr 5.999677e-04:   1%|          | 4/519 [00:01<02:48,  3.06it/s][A
epoch 1 iter 4: train loss 7

epoch 1 iter 38: train loss 5.90087. lr 5.979237e-04:   8%|▊         | 39/519 [00:11<02:24,  3.31it/s][A
epoch 1 iter 39: train loss 5.88384. lr 5.978156e-04:   8%|▊         | 39/519 [00:12<02:24,  3.31it/s][A
epoch 1 iter 39: train loss 5.88384. lr 5.978156e-04:   8%|▊         | 40/519 [00:12<02:23,  3.35it/s][A
epoch 1 iter 40: train loss 5.90375. lr 5.977047e-04:   8%|▊         | 40/519 [00:12<02:23,  3.35it/s][A
epoch 1 iter 40: train loss 5.90375. lr 5.977047e-04:   8%|▊         | 41/519 [00:12<02:23,  3.32it/s][A
epoch 1 iter 41: train loss 5.89780. lr 5.975911e-04:   8%|▊         | 41/519 [00:12<02:23,  3.32it/s][A
epoch 1 iter 41: train loss 5.89780. lr 5.975911e-04:   8%|▊         | 42/519 [00:12<02:23,  3.32it/s][A
epoch 1 iter 42: train loss 5.80916. lr 5.974747e-04:   8%|▊         | 42/519 [00:12<02:23,  3.32it/s][A
epoch 1 iter 42: train loss 5.80916. lr 5.974747e-04:   8%|▊         | 43/519 [00:12<02:23,  3.32it/s][A
epoch 1 iter 43: train loss 5.77009. lr 5.9735

epoch 1 iter 77: train loss 4.91445. lr 5.916904e-04:  15%|█▍        | 77/519 [00:23<02:14,  3.28it/s][A
epoch 1 iter 77: train loss 4.91445. lr 5.916904e-04:  15%|█▌        | 78/519 [00:23<02:14,  3.28it/s][A
epoch 1 iter 78: train loss 4.84954. lr 5.914766e-04:  15%|█▌        | 78/519 [00:23<02:14,  3.28it/s][A
epoch 1 iter 78: train loss 4.84954. lr 5.914766e-04:  15%|█▌        | 79/519 [00:23<02:13,  3.28it/s][A
epoch 1 iter 79: train loss 4.80717. lr 5.912600e-04:  15%|█▌        | 79/519 [00:24<02:13,  3.28it/s][A
epoch 1 iter 79: train loss 4.80717. lr 5.912600e-04:  15%|█▌        | 80/519 [00:24<02:13,  3.29it/s][A
epoch 1 iter 80: train loss 4.79208. lr 5.910408e-04:  15%|█▌        | 80/519 [00:24<02:13,  3.29it/s][A
epoch 1 iter 80: train loss 4.79208. lr 5.910408e-04:  16%|█▌        | 81/519 [00:24<02:13,  3.28it/s][A
epoch 1 iter 81: train loss 4.74013. lr 5.908190e-04:  16%|█▌        | 81/519 [00:24<02:13,  3.28it/s][A
epoch 1 iter 81: train loss 4.74013. lr 5.9081

epoch 1 iter 115: train loss 4.17481. lr 5.817010e-04:  22%|██▏       | 115/519 [00:35<02:03,  3.27it/s][A
epoch 1 iter 115: train loss 4.17481. lr 5.817010e-04:  22%|██▏       | 116/519 [00:35<02:03,  3.27it/s][A
epoch 1 iter 116: train loss 4.20839. lr 5.813870e-04:  22%|██▏       | 116/519 [00:35<02:03,  3.27it/s][A
epoch 1 iter 116: train loss 4.20839. lr 5.813870e-04:  23%|██▎       | 117/519 [00:35<02:02,  3.27it/s][A
epoch 1 iter 117: train loss 4.16306. lr 5.810705e-04:  23%|██▎       | 117/519 [00:36<02:02,  3.27it/s][A
epoch 1 iter 117: train loss 4.16306. lr 5.810705e-04:  23%|██▎       | 118/519 [00:36<02:02,  3.28it/s][A
epoch 1 iter 118: train loss 4.21013. lr 5.807513e-04:  23%|██▎       | 118/519 [00:36<02:02,  3.28it/s][A
epoch 1 iter 118: train loss 4.21013. lr 5.807513e-04:  23%|██▎       | 119/519 [00:36<02:01,  3.29it/s][A
epoch 1 iter 119: train loss 4.16049. lr 5.804296e-04:  23%|██▎       | 119/519 [00:36<02:01,  3.29it/s][A
epoch 1 iter 119: train loss

epoch 1 iter 152: train loss 3.64268. lr 5.683871e-04:  29%|██▉       | 153/519 [00:46<01:54,  3.21it/s][A
epoch 1 iter 153: train loss 3.60862. lr 5.679797e-04:  29%|██▉       | 153/519 [00:47<01:54,  3.21it/s][A
epoch 1 iter 153: train loss 3.60862. lr 5.679797e-04:  30%|██▉       | 154/519 [00:47<01:52,  3.24it/s][A
epoch 1 iter 154: train loss 3.63709. lr 5.675697e-04:  30%|██▉       | 154/519 [00:47<01:52,  3.24it/s][A
epoch 1 iter 154: train loss 3.63709. lr 5.675697e-04:  30%|██▉       | 155/519 [00:47<01:52,  3.25it/s][A
epoch 1 iter 155: train loss 3.51663. lr 5.671573e-04:  30%|██▉       | 155/519 [00:47<01:52,  3.25it/s][A
epoch 1 iter 155: train loss 3.51663. lr 5.671573e-04:  30%|███       | 156/519 [00:47<01:51,  3.26it/s][A
epoch 1 iter 156: train loss 3.52096. lr 5.667425e-04:  30%|███       | 156/519 [00:48<01:51,  3.26it/s][A
epoch 1 iter 156: train loss 3.52096. lr 5.667425e-04:  30%|███       | 157/519 [00:48<01:51,  3.26it/s][A
epoch 1 iter 157: train loss

epoch 1 iter 190: train loss 3.02283. lr 5.512064e-04:  37%|███▋      | 190/519 [00:58<01:40,  3.27it/s][A
epoch 1 iter 190: train loss 3.02283. lr 5.512064e-04:  37%|███▋      | 191/519 [00:58<01:40,  3.27it/s][A
epoch 1 iter 191: train loss 2.99663. lr 5.507082e-04:  37%|███▋      | 191/519 [00:59<01:40,  3.27it/s][A
epoch 1 iter 191: train loss 2.99663. lr 5.507082e-04:  37%|███▋      | 192/519 [00:59<01:40,  3.26it/s][A
epoch 1 iter 192: train loss 2.95822. lr 5.502077e-04:  37%|███▋      | 192/519 [00:59<01:40,  3.26it/s][A
epoch 1 iter 192: train loss 2.95822. lr 5.502077e-04:  37%|███▋      | 193/519 [00:59<01:40,  3.25it/s][A
epoch 1 iter 193: train loss 2.93416. lr 5.497050e-04:  37%|███▋      | 193/519 [00:59<01:40,  3.25it/s][A
epoch 1 iter 193: train loss 2.93416. lr 5.497050e-04:  37%|███▋      | 194/519 [00:59<01:40,  3.24it/s][A
epoch 1 iter 194: train loss 2.95521. lr 5.491999e-04:  37%|███▋      | 194/519 [00:59<01:40,  3.24it/s][A
epoch 1 iter 194: train loss

epoch 1 iter 227: train loss 2.45607. lr 5.312778e-04:  44%|████▍     | 228/519 [01:10<01:45,  2.75it/s][A
epoch 1 iter 228: train loss 2.45500. lr 5.306977e-04:  44%|████▍     | 228/519 [01:10<01:45,  2.75it/s][A
epoch 1 iter 228: train loss 2.45500. lr 5.306977e-04:  44%|████▍     | 229/519 [01:10<01:43,  2.79it/s][A
epoch 1 iter 229: train loss 2.35604. lr 5.301154e-04:  44%|████▍     | 229/519 [01:11<01:43,  2.79it/s][A
epoch 1 iter 229: train loss 2.35604. lr 5.301154e-04:  44%|████▍     | 230/519 [01:11<01:43,  2.80it/s][A
epoch 1 iter 230: train loss 2.40034. lr 5.295310e-04:  44%|████▍     | 230/519 [01:11<01:43,  2.80it/s][A
epoch 1 iter 230: train loss 2.40034. lr 5.295310e-04:  45%|████▍     | 231/519 [01:11<01:41,  2.83it/s][A
epoch 1 iter 231: train loss 2.41566. lr 5.289445e-04:  45%|████▍     | 231/519 [01:11<01:41,  2.83it/s][A
epoch 1 iter 231: train loss 2.41566. lr 5.289445e-04:  45%|████▍     | 232/519 [01:11<01:39,  2.88it/s][A
epoch 1 iter 232: train loss

epoch 1 iter 265: train loss 1.93514. lr 5.077895e-04:  51%|█████     | 265/519 [01:22<01:17,  3.26it/s][A
epoch 1 iter 265: train loss 1.93514. lr 5.077895e-04:  51%|█████▏    | 266/519 [01:22<01:17,  3.28it/s][A
epoch 1 iter 266: train loss 1.92966. lr 5.071327e-04:  51%|█████▏    | 266/519 [01:22<01:17,  3.28it/s][A
epoch 1 iter 266: train loss 1.92966. lr 5.071327e-04:  51%|█████▏    | 267/519 [01:22<01:16,  3.27it/s][A
epoch 1 iter 267: train loss 1.87804. lr 5.064741e-04:  51%|█████▏    | 267/519 [01:23<01:16,  3.27it/s][A
epoch 1 iter 267: train loss 1.87804. lr 5.064741e-04:  52%|█████▏    | 268/519 [01:23<01:16,  3.27it/s][A
epoch 1 iter 268: train loss 1.85908. lr 5.058135e-04:  52%|█████▏    | 268/519 [01:23<01:16,  3.27it/s][A
epoch 1 iter 268: train loss 1.85908. lr 5.058135e-04:  52%|█████▏    | 269/519 [01:23<01:16,  3.26it/s][A
epoch 1 iter 269: train loss 1.87205. lr 5.051511e-04:  52%|█████▏    | 269/519 [01:23<01:16,  3.26it/s][A
epoch 1 iter 269: train loss

epoch 1 iter 302: train loss 1.40430. lr 4.822713e-04:  58%|█████▊    | 303/519 [01:34<01:10,  3.06it/s][A
epoch 1 iter 303: train loss 1.40775. lr 4.815484e-04:  58%|█████▊    | 303/519 [01:34<01:10,  3.06it/s][A
epoch 1 iter 303: train loss 1.40775. lr 4.815484e-04:  59%|█████▊    | 304/519 [01:34<01:09,  3.10it/s][A
epoch 1 iter 304: train loss 1.40764. lr 4.808237e-04:  59%|█████▊    | 304/519 [01:34<01:09,  3.10it/s][A
epoch 1 iter 304: train loss 1.40764. lr 4.808237e-04:  59%|█████▉    | 305/519 [01:34<01:08,  3.13it/s][A
epoch 1 iter 305: train loss 1.43514. lr 4.800974e-04:  59%|█████▉    | 305/519 [01:35<01:08,  3.13it/s][A
epoch 1 iter 305: train loss 1.43514. lr 4.800974e-04:  59%|█████▉    | 306/519 [01:35<01:07,  3.15it/s][A
epoch 1 iter 306: train loss 1.41231. lr 4.793695e-04:  59%|█████▉    | 306/519 [01:35<01:07,  3.15it/s][A
epoch 1 iter 306: train loss 1.41231. lr 4.793695e-04:  59%|█████▉    | 307/519 [01:35<01:06,  3.18it/s][A
epoch 1 iter 307: train loss

epoch 1 iter 340: train loss 1.05236. lr 4.536837e-04:  66%|██████▌   | 340/519 [01:46<01:03,  2.80it/s][A
epoch 1 iter 340: train loss 1.05236. lr 4.536837e-04:  66%|██████▌   | 341/519 [01:46<01:02,  2.86it/s][A
epoch 1 iter 341: train loss 1.02113. lr 4.529022e-04:  66%|██████▌   | 341/519 [01:46<01:02,  2.86it/s][A
epoch 1 iter 341: train loss 1.02113. lr 4.529022e-04:  66%|██████▌   | 342/519 [01:46<01:04,  2.75it/s][A
epoch 1 iter 342: train loss 1.01518. lr 4.521192e-04:  66%|██████▌   | 342/519 [01:47<01:04,  2.75it/s][A
epoch 1 iter 342: train loss 1.01518. lr 4.521192e-04:  66%|██████▌   | 343/519 [01:47<01:00,  2.93it/s][A
epoch 1 iter 343: train loss 1.01962. lr 4.513349e-04:  66%|██████▌   | 343/519 [01:47<01:00,  2.93it/s][A
epoch 1 iter 343: train loss 1.01962. lr 4.513349e-04:  66%|██████▋   | 344/519 [01:47<00:57,  3.04it/s][A
epoch 1 iter 344: train loss 0.98892. lr 4.505492e-04:  66%|██████▋   | 344/519 [01:47<00:57,  3.04it/s][A
epoch 1 iter 344: train loss

epoch 1 iter 377: train loss 0.76528. lr 4.238887e-04:  73%|███████▎  | 378/519 [01:58<00:48,  2.90it/s][A
epoch 1 iter 378: train loss 0.75450. lr 4.230601e-04:  73%|███████▎  | 378/519 [01:58<00:48,  2.90it/s][A
epoch 1 iter 378: train loss 0.75450. lr 4.230601e-04:  73%|███████▎  | 379/519 [01:58<00:47,  2.95it/s][A
epoch 1 iter 379: train loss 0.75241. lr 4.222303e-04:  73%|███████▎  | 379/519 [01:58<00:47,  2.95it/s][A
epoch 1 iter 379: train loss 0.75241. lr 4.222303e-04:  73%|███████▎  | 380/519 [01:58<00:46,  3.00it/s][A
epoch 1 iter 380: train loss 0.74520. lr 4.213995e-04:  73%|███████▎  | 380/519 [01:59<00:46,  3.00it/s][A
epoch 1 iter 380: train loss 0.74520. lr 4.213995e-04:  73%|███████▎  | 381/519 [01:59<00:45,  3.05it/s][A
epoch 1 iter 381: train loss 0.76368. lr 4.205675e-04:  73%|███████▎  | 381/519 [01:59<00:45,  3.05it/s][A
epoch 1 iter 381: train loss 0.76368. lr 4.205675e-04:  74%|███████▎  | 382/519 [01:59<00:44,  3.09it/s][A
epoch 1 iter 382: train loss

epoch 1 iter 415: train loss 0.56588. lr 3.916723e-04:  80%|███████▉  | 415/519 [02:09<00:31,  3.25it/s][A
epoch 1 iter 415: train loss 0.56588. lr 3.916723e-04:  80%|████████  | 416/519 [02:09<00:31,  3.25it/s][A
epoch 1 iter 416: train loss 0.58249. lr 3.908062e-04:  80%|████████  | 416/519 [02:10<00:31,  3.25it/s][A
epoch 1 iter 416: train loss 0.58249. lr 3.908062e-04:  80%|████████  | 417/519 [02:10<00:31,  3.26it/s][A
epoch 1 iter 417: train loss 0.56425. lr 3.899392e-04:  80%|████████  | 417/519 [02:10<00:31,  3.26it/s][A
epoch 1 iter 417: train loss 0.56425. lr 3.899392e-04:  81%|████████  | 418/519 [02:10<00:30,  3.26it/s][A
epoch 1 iter 418: train loss 0.56716. lr 3.890715e-04:  81%|████████  | 418/519 [02:10<00:30,  3.26it/s][A
epoch 1 iter 418: train loss 0.56716. lr 3.890715e-04:  81%|████████  | 419/519 [02:10<00:30,  3.26it/s][A
epoch 1 iter 419: train loss 0.56724. lr 3.882029e-04:  81%|████████  | 419/519 [02:11<00:30,  3.26it/s][A
epoch 1 iter 419: train loss

epoch 1 iter 452: train loss 0.47762. lr 3.591331e-04:  87%|████████▋ | 453/519 [02:22<00:20,  3.27it/s][A
epoch 1 iter 453: train loss 0.47122. lr 3.582414e-04:  87%|████████▋ | 453/519 [02:22<00:20,  3.27it/s][A
epoch 1 iter 453: train loss 0.47122. lr 3.582414e-04:  87%|████████▋ | 454/519 [02:22<00:19,  3.30it/s][A
epoch 1 iter 454: train loss 0.46054. lr 3.573493e-04:  87%|████████▋ | 454/519 [02:22<00:19,  3.30it/s][A
epoch 1 iter 454: train loss 0.46054. lr 3.573493e-04:  88%|████████▊ | 455/519 [02:22<00:19,  3.29it/s][A
epoch 1 iter 455: train loss 0.45366. lr 3.564566e-04:  88%|████████▊ | 455/519 [02:22<00:19,  3.29it/s][A
epoch 1 iter 455: train loss 0.45366. lr 3.564566e-04:  88%|████████▊ | 456/519 [02:22<00:19,  3.30it/s][A
epoch 1 iter 456: train loss 0.45417. lr 3.555634e-04:  88%|████████▊ | 456/519 [02:23<00:19,  3.30it/s][A
epoch 1 iter 456: train loss 0.45417. lr 3.555634e-04:  88%|████████▊ | 457/519 [02:23<00:18,  3.29it/s][A
epoch 1 iter 457: train loss

epoch 1 iter 490: train loss 0.39433. lr 3.249452e-04:  94%|█████████▍| 490/519 [02:33<00:08,  3.34it/s][A
epoch 1 iter 490: train loss 0.39433. lr 3.249452e-04:  95%|█████████▍| 491/519 [02:33<00:08,  3.34it/s][A
epoch 1 iter 491: train loss 0.38631. lr 3.240390e-04:  95%|█████████▍| 491/519 [02:34<00:08,  3.34it/s][A
epoch 1 iter 491: train loss 0.38631. lr 3.240390e-04:  95%|█████████▍| 492/519 [02:34<00:08,  3.34it/s][A
epoch 1 iter 492: train loss 0.40124. lr 3.231327e-04:  95%|█████████▍| 492/519 [02:34<00:08,  3.34it/s][A
epoch 1 iter 492: train loss 0.40124. lr 3.231327e-04:  95%|█████████▍| 493/519 [02:34<00:07,  3.31it/s][A
epoch 1 iter 493: train loss 0.39082. lr 3.222261e-04:  95%|█████████▍| 493/519 [02:34<00:07,  3.31it/s][A
epoch 1 iter 493: train loss 0.39082. lr 3.222261e-04:  95%|█████████▌| 494/519 [02:34<00:07,  3.31it/s][A
epoch 1 iter 494: train loss 0.37188. lr 3.213193e-04:  95%|█████████▌| 494/519 [02:34<00:07,  3.31it/s][A
epoch 1 iter 494: train loss

epoch 2 iter 9: train loss 0.34244. lr 2.909806e-04:   2%|▏         | 9/519 [00:03<02:40,  3.17it/s][A
epoch 2 iter 9: train loss 0.34244. lr 2.909806e-04:   2%|▏         | 10/519 [00:03<02:40,  3.17it/s][A
epoch 2 iter 10: train loss 0.33854. lr 2.900719e-04:   2%|▏         | 10/519 [00:03<02:40,  3.17it/s][A
epoch 2 iter 10: train loss 0.33854. lr 2.900719e-04:   2%|▏         | 11/519 [00:03<02:39,  3.18it/s][A
epoch 2 iter 11: train loss 0.32341. lr 2.891632e-04:   2%|▏         | 11/519 [00:04<02:39,  3.18it/s][A
epoch 2 iter 11: train loss 0.32341. lr 2.891632e-04:   2%|▏         | 12/519 [00:04<03:08,  2.69it/s][A
epoch 2 iter 12: train loss 0.33871. lr 2.882547e-04:   2%|▏         | 12/519 [00:04<03:08,  2.69it/s][A
epoch 2 iter 12: train loss 0.33871. lr 2.882547e-04:   3%|▎         | 13/519 [00:04<03:01,  2.79it/s][A
epoch 2 iter 13: train loss 0.32079. lr 2.873463e-04:   3%|▎         | 13/519 [00:04<03:01,  2.79it/s][A
epoch 2 iter 13: train loss 0.32079. lr 2.873463e

epoch 2 iter 47: train loss 0.29758. lr 2.565834e-04:   9%|▉         | 48/519 [00:15<02:25,  3.25it/s][A
epoch 2 iter 48: train loss 0.29621. lr 2.556840e-04:   9%|▉         | 48/519 [00:15<02:25,  3.25it/s][A
epoch 2 iter 48: train loss 0.29621. lr 2.556840e-04:   9%|▉         | 49/519 [00:15<02:26,  3.21it/s][A
epoch 2 iter 49: train loss 0.29492. lr 2.547850e-04:   9%|▉         | 49/519 [00:15<02:26,  3.21it/s][A
epoch 2 iter 49: train loss 0.29492. lr 2.547850e-04:  10%|▉         | 50/519 [00:15<02:26,  3.21it/s][A
epoch 2 iter 50: train loss 0.28874. lr 2.538864e-04:  10%|▉         | 50/519 [00:15<02:26,  3.21it/s][A
epoch 2 iter 50: train loss 0.28874. lr 2.538864e-04:  10%|▉         | 51/519 [00:15<02:26,  3.19it/s][A
epoch 2 iter 51: train loss 0.30145. lr 2.529883e-04:  10%|▉         | 51/519 [00:16<02:26,  3.19it/s][A
epoch 2 iter 51: train loss 0.30145. lr 2.529883e-04:  10%|█         | 52/519 [00:16<02:45,  2.82it/s][A
epoch 2 iter 52: train loss 0.30415. lr 2.5209

epoch 2 iter 86: train loss 0.27341. lr 2.218833e-04:  17%|█▋        | 86/519 [00:28<02:31,  2.85it/s][A
epoch 2 iter 86: train loss 0.27341. lr 2.218833e-04:  17%|█▋        | 87/519 [00:28<02:25,  2.97it/s][A
epoch 2 iter 87: train loss 0.27275. lr 2.210058e-04:  17%|█▋        | 87/519 [00:28<02:25,  2.97it/s][A
epoch 2 iter 87: train loss 0.27275. lr 2.210058e-04:  17%|█▋        | 88/519 [00:28<02:21,  3.06it/s][A
epoch 2 iter 88: train loss 0.26738. lr 2.201291e-04:  17%|█▋        | 88/519 [00:29<02:21,  3.06it/s][A
epoch 2 iter 88: train loss 0.26738. lr 2.201291e-04:  17%|█▋        | 89/519 [00:29<02:18,  3.11it/s][A
epoch 2 iter 89: train loss 0.27460. lr 2.192531e-04:  17%|█▋        | 89/519 [00:29<02:18,  3.11it/s][A
epoch 2 iter 89: train loss 0.27460. lr 2.192531e-04:  17%|█▋        | 90/519 [00:29<02:15,  3.17it/s][A
epoch 2 iter 90: train loss 0.26903. lr 2.183778e-04:  17%|█▋        | 90/519 [00:29<02:15,  3.17it/s][A
epoch 2 iter 90: train loss 0.26903. lr 2.1837

epoch 2 iter 124: train loss 0.24579. lr 1.891174e-04:  24%|██▍       | 124/519 [00:40<02:00,  3.27it/s][A
epoch 2 iter 124: train loss 0.24579. lr 1.891174e-04:  24%|██▍       | 125/519 [00:40<02:00,  3.28it/s][A
epoch 2 iter 125: train loss 0.24561. lr 1.882731e-04:  24%|██▍       | 125/519 [00:40<02:00,  3.28it/s][A
epoch 2 iter 125: train loss 0.24561. lr 1.882731e-04:  24%|██▍       | 126/519 [00:40<01:59,  3.28it/s][A
epoch 2 iter 126: train loss 0.25942. lr 1.874298e-04:  24%|██▍       | 126/519 [00:40<01:59,  3.28it/s][A
epoch 2 iter 126: train loss 0.25942. lr 1.874298e-04:  24%|██▍       | 127/519 [00:40<01:59,  3.29it/s][A
epoch 2 iter 127: train loss 0.24918. lr 1.865876e-04:  24%|██▍       | 127/519 [00:41<01:59,  3.29it/s][A
epoch 2 iter 127: train loss 0.24918. lr 1.865876e-04:  25%|██▍       | 128/519 [00:41<01:58,  3.30it/s][A
epoch 2 iter 128: train loss 0.24289. lr 1.857464e-04:  25%|██▍       | 128/519 [00:41<01:58,  3.30it/s][A
epoch 2 iter 128: train loss

epoch 2 iter 161: train loss 0.23529. lr 1.586217e-04:  31%|███       | 162/519 [00:52<02:03,  2.89it/s][A
epoch 2 iter 162: train loss 0.23324. lr 1.578205e-04:  31%|███       | 162/519 [00:53<02:03,  2.89it/s][A
epoch 2 iter 162: train loss 0.23324. lr 1.578205e-04:  31%|███▏      | 163/519 [00:53<02:01,  2.94it/s][A
epoch 2 iter 163: train loss 0.22738. lr 1.570205e-04:  31%|███▏      | 163/519 [00:53<02:01,  2.94it/s][A
epoch 2 iter 163: train loss 0.22738. lr 1.570205e-04:  32%|███▏      | 164/519 [00:53<01:58,  2.99it/s][A
epoch 2 iter 164: train loss 0.22718. lr 1.562219e-04:  32%|███▏      | 164/519 [00:53<01:58,  2.99it/s][A
epoch 2 iter 164: train loss 0.22718. lr 1.562219e-04:  32%|███▏      | 165/519 [00:53<01:56,  3.03it/s][A
epoch 2 iter 165: train loss 0.22867. lr 1.554246e-04:  32%|███▏      | 165/519 [00:54<01:56,  3.03it/s][A
epoch 2 iter 165: train loss 0.22867. lr 1.554246e-04:  32%|███▏      | 166/519 [00:54<01:55,  3.06it/s][A
epoch 2 iter 166: train loss

epoch 2 iter 199: train loss 0.21503. lr 1.291537e-04:  38%|███▊      | 199/519 [01:04<01:38,  3.24it/s][A
epoch 2 iter 199: train loss 0.21503. lr 1.291537e-04:  39%|███▊      | 200/519 [01:04<01:40,  3.19it/s][A
epoch 2 iter 200: train loss 0.22245. lr 1.284071e-04:  39%|███▊      | 200/519 [01:05<01:40,  3.19it/s][A
epoch 2 iter 200: train loss 0.22245. lr 1.284071e-04:  39%|███▊      | 201/519 [01:05<01:38,  3.23it/s][A
epoch 2 iter 201: train loss 0.22145. lr 1.276621e-04:  39%|███▊      | 201/519 [01:05<01:38,  3.23it/s][A
epoch 2 iter 201: train loss 0.22145. lr 1.276621e-04:  39%|███▉      | 202/519 [01:05<01:37,  3.27it/s][A
epoch 2 iter 202: train loss 0.23141. lr 1.269187e-04:  39%|███▉      | 202/519 [01:05<01:37,  3.27it/s][A
epoch 2 iter 202: train loss 0.23141. lr 1.269187e-04:  39%|███▉      | 203/519 [01:05<01:36,  3.26it/s][A
epoch 2 iter 203: train loss 0.22332. lr 1.261769e-04:  39%|███▉      | 203/519 [01:06<01:36,  3.26it/s][A
epoch 2 iter 203: train loss

epoch 2 iter 236: train loss 0.21515. lr 1.026327e-04:  46%|████▌     | 237/519 [01:17<01:25,  3.31it/s][A
epoch 2 iter 237: train loss 0.21119. lr 1.019489e-04:  46%|████▌     | 237/519 [01:17<01:25,  3.31it/s][A
epoch 2 iter 237: train loss 0.21119. lr 1.019489e-04:  46%|████▌     | 238/519 [01:17<01:24,  3.33it/s][A
epoch 2 iter 238: train loss 0.20787. lr 1.012670e-04:  46%|████▌     | 238/519 [01:17<01:24,  3.33it/s][A
epoch 2 iter 238: train loss 0.20787. lr 1.012670e-04:  46%|████▌     | 239/519 [01:17<01:24,  3.32it/s][A
epoch 2 iter 239: train loss 0.20832. lr 1.005868e-04:  46%|████▌     | 239/519 [01:18<01:24,  3.32it/s][A
epoch 2 iter 239: train loss 0.20832. lr 1.005868e-04:  46%|████▌     | 240/519 [01:18<01:36,  2.89it/s][A
epoch 2 iter 240: train loss 0.19704. lr 9.990846e-05:  46%|████▌     | 240/519 [01:18<01:36,  2.89it/s][A
epoch 2 iter 240: train loss 0.19704. lr 9.990846e-05:  46%|████▋     | 241/519 [01:18<01:32,  3.01it/s][A
epoch 2 iter 241: train loss

epoch 2 iter 274: train loss 0.20077. lr 7.797840e-05:  53%|█████▎    | 274/519 [01:29<01:16,  3.19it/s][A
epoch 2 iter 274: train loss 0.20077. lr 7.797840e-05:  53%|█████▎    | 275/519 [01:29<01:15,  3.22it/s][A
epoch 2 iter 275: train loss 0.20524. lr 7.736797e-05:  53%|█████▎    | 275/519 [01:30<01:15,  3.22it/s][A
epoch 2 iter 275: train loss 0.20524. lr 7.736797e-05:  53%|█████▎    | 276/519 [01:30<01:15,  3.24it/s][A
epoch 2 iter 276: train loss 0.20642. lr 7.675959e-05:  53%|█████▎    | 276/519 [01:30<01:15,  3.24it/s][A
epoch 2 iter 276: train loss 0.20642. lr 7.675959e-05:  53%|█████▎    | 277/519 [01:30<01:14,  3.25it/s][A
epoch 2 iter 277: train loss 0.19761. lr 7.615326e-05:  53%|█████▎    | 277/519 [01:30<01:14,  3.25it/s][A
epoch 2 iter 277: train loss 0.19761. lr 7.615326e-05:  54%|█████▎    | 278/519 [01:30<01:13,  3.27it/s][A
epoch 2 iter 278: train loss 0.20530. lr 7.554898e-05:  54%|█████▎    | 278/519 [01:30<01:13,  3.27it/s][A
epoch 2 iter 278: train loss

epoch 2 iter 311: train loss 0.19816. lr 6.000000e-05:  60%|██████    | 312/519 [01:41<01:04,  3.22it/s][A
epoch 2 iter 312: train loss 0.19788. lr 6.000000e-05:  60%|██████    | 312/519 [01:41<01:04,  3.22it/s][A
epoch 2 iter 312: train loss 0.19788. lr 6.000000e-05:  60%|██████    | 313/519 [01:41<01:03,  3.24it/s][A
epoch 2 iter 313: train loss 0.19156. lr 6.000000e-05:  60%|██████    | 313/519 [01:41<01:03,  3.24it/s][A
epoch 2 iter 313: train loss 0.19156. lr 6.000000e-05:  61%|██████    | 314/519 [01:41<01:02,  3.26it/s][A
epoch 2 iter 314: train loss 0.19287. lr 6.000000e-05:  61%|██████    | 314/519 [01:41<01:02,  3.26it/s][A
epoch 2 iter 314: train loss 0.19287. lr 6.000000e-05:  61%|██████    | 315/519 [01:41<01:02,  3.26it/s][A
epoch 2 iter 315: train loss 0.19449. lr 6.000000e-05:  61%|██████    | 315/519 [01:42<01:02,  3.26it/s][A
epoch 2 iter 315: train loss 0.19449. lr 6.000000e-05:  61%|██████    | 316/519 [01:42<01:01,  3.28it/s][A
epoch 2 iter 316: train loss

epoch 2 iter 349: train loss 0.18288. lr 6.000000e-05:  67%|██████▋   | 349/519 [01:53<00:53,  3.20it/s][A
epoch 2 iter 349: train loss 0.18288. lr 6.000000e-05:  67%|██████▋   | 350/519 [01:53<00:52,  3.22it/s][A
epoch 2 iter 350: train loss 0.18792. lr 6.000000e-05:  67%|██████▋   | 350/519 [01:53<00:52,  3.22it/s][A
epoch 2 iter 350: train loss 0.18792. lr 6.000000e-05:  68%|██████▊   | 351/519 [01:53<00:51,  3.25it/s][A
epoch 2 iter 351: train loss 0.18715. lr 6.000000e-05:  68%|██████▊   | 351/519 [01:53<00:51,  3.25it/s][A
epoch 2 iter 351: train loss 0.18715. lr 6.000000e-05:  68%|██████▊   | 352/519 [01:53<00:51,  3.25it/s][A
epoch 2 iter 352: train loss 0.18583. lr 6.000000e-05:  68%|██████▊   | 352/519 [01:54<00:51,  3.25it/s][A
epoch 2 iter 352: train loss 0.18583. lr 6.000000e-05:  68%|██████▊   | 353/519 [01:54<00:50,  3.26it/s][A
epoch 2 iter 353: train loss 0.18564. lr 6.000000e-05:  68%|██████▊   | 353/519 [01:54<00:50,  3.26it/s][A
epoch 2 iter 353: train loss

epoch 2 iter 386: train loss 0.18616. lr 6.000000e-05:  75%|███████▍  | 387/519 [02:05<00:44,  2.95it/s][A
epoch 2 iter 387: train loss 0.18512. lr 6.000000e-05:  75%|███████▍  | 387/519 [02:05<00:44,  2.95it/s][A
epoch 2 iter 387: train loss 0.18512. lr 6.000000e-05:  75%|███████▍  | 388/519 [02:05<00:44,  2.97it/s][A
epoch 2 iter 388: train loss 0.18816. lr 6.000000e-05:  75%|███████▍  | 388/519 [02:05<00:44,  2.97it/s][A
epoch 2 iter 388: train loss 0.18816. lr 6.000000e-05:  75%|███████▍  | 389/519 [02:05<00:43,  3.00it/s][A
epoch 2 iter 389: train loss 0.18672. lr 6.000000e-05:  75%|███████▍  | 389/519 [02:06<00:43,  3.00it/s][A
epoch 2 iter 389: train loss 0.18672. lr 6.000000e-05:  75%|███████▌  | 390/519 [02:06<00:42,  3.03it/s][A
epoch 2 iter 390: train loss 0.18339. lr 6.000000e-05:  75%|███████▌  | 390/519 [02:06<00:42,  3.03it/s][A
epoch 2 iter 390: train loss 0.18339. lr 6.000000e-05:  75%|███████▌  | 391/519 [02:06<00:41,  3.06it/s][A
epoch 2 iter 391: train loss

epoch 2 iter 424: train loss 0.17647. lr 6.000000e-05:  82%|████████▏ | 424/519 [02:16<00:29,  3.25it/s][A
epoch 2 iter 424: train loss 0.17647. lr 6.000000e-05:  82%|████████▏ | 425/519 [02:16<00:29,  3.22it/s][A
epoch 2 iter 425: train loss 0.18598. lr 6.000000e-05:  82%|████████▏ | 425/519 [02:17<00:29,  3.22it/s][A
epoch 2 iter 425: train loss 0.18598. lr 6.000000e-05:  82%|████████▏ | 426/519 [02:17<00:28,  3.23it/s][A
epoch 2 iter 426: train loss 0.17723. lr 6.000000e-05:  82%|████████▏ | 426/519 [02:17<00:28,  3.23it/s][A
epoch 2 iter 426: train loss 0.17723. lr 6.000000e-05:  82%|████████▏ | 427/519 [02:17<00:28,  3.21it/s][A
epoch 2 iter 427: train loss 0.18638. lr 6.000000e-05:  82%|████████▏ | 427/519 [02:17<00:28,  3.21it/s][A
epoch 2 iter 427: train loss 0.18638. lr 6.000000e-05:  82%|████████▏ | 428/519 [02:17<00:28,  3.24it/s][A
epoch 2 iter 428: train loss 0.17630. lr 6.000000e-05:  82%|████████▏ | 428/519 [02:18<00:28,  3.24it/s][A
epoch 2 iter 428: train loss

epoch 2 iter 461: train loss 0.17483. lr 6.000000e-05:  89%|████████▉ | 462/519 [02:28<00:20,  2.85it/s][A
epoch 2 iter 462: train loss 0.18160. lr 6.000000e-05:  89%|████████▉ | 462/519 [02:29<00:20,  2.85it/s][A
epoch 2 iter 462: train loss 0.18160. lr 6.000000e-05:  89%|████████▉ | 463/519 [02:29<00:19,  2.94it/s][A
epoch 2 iter 463: train loss 0.18075. lr 6.000000e-05:  89%|████████▉ | 463/519 [02:29<00:19,  2.94it/s][A
epoch 2 iter 463: train loss 0.18075. lr 6.000000e-05:  89%|████████▉ | 464/519 [02:29<00:18,  3.04it/s][A
epoch 2 iter 464: train loss 0.18077. lr 6.000000e-05:  89%|████████▉ | 464/519 [02:29<00:18,  3.04it/s][A
epoch 2 iter 464: train loss 0.18077. lr 6.000000e-05:  90%|████████▉ | 465/519 [02:29<00:17,  3.05it/s][A
epoch 2 iter 465: train loss 0.17482. lr 6.000000e-05:  90%|████████▉ | 465/519 [02:30<00:17,  3.05it/s][A
epoch 2 iter 465: train loss 0.17482. lr 6.000000e-05:  90%|████████▉ | 466/519 [02:30<00:17,  3.10it/s][A
epoch 2 iter 466: train loss

epoch 2 iter 499: train loss 0.18102. lr 6.000000e-05:  96%|█████████▌| 499/519 [02:41<00:06,  2.88it/s][A
epoch 2 iter 499: train loss 0.18102. lr 6.000000e-05:  96%|█████████▋| 500/519 [02:41<00:06,  2.93it/s][A
epoch 2 iter 500: train loss 0.17681. lr 6.000000e-05:  96%|█████████▋| 500/519 [02:42<00:06,  2.93it/s][A
epoch 2 iter 500: train loss 0.17681. lr 6.000000e-05:  97%|█████████▋| 501/519 [02:42<00:06,  2.97it/s][A
epoch 2 iter 501: train loss 0.17754. lr 6.000000e-05:  97%|█████████▋| 501/519 [02:42<00:06,  2.97it/s][A
epoch 2 iter 501: train loss 0.17754. lr 6.000000e-05:  97%|█████████▋| 502/519 [02:42<00:05,  3.02it/s][A
epoch 2 iter 502: train loss 0.16552. lr 6.000000e-05:  97%|█████████▋| 502/519 [02:42<00:05,  3.02it/s][A
epoch 2 iter 502: train loss 0.16552. lr 6.000000e-05:  97%|█████████▋| 503/519 [02:42<00:05,  3.06it/s][A
epoch 2 iter 503: train loss 0.17769. lr 6.000000e-05:  97%|█████████▋| 503/519 [02:42<00:05,  3.06it/s][A
epoch 2 iter 503: train loss

data has 599211 characters, 19797 unique.
dataset is read
model is built



  0%|          | 0/2341 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.99100. lr 6.000000e-04:   0%|          | 0/2341 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.99100. lr 6.000000e-04:   0%|          | 1/2341 [00:00<22:56,  1.70it/s][A
epoch 1 iter 1: train loss 9.13702. lr 5.999998e-04:   0%|          | 1/2341 [00:01<22:56,  1.70it/s][A
epoch 1 iter 1: train loss 9.13702. lr 5.999998e-04:   0%|          | 2/2341 [00:01<21:14,  1.83it/s][A
epoch 1 iter 2: train loss 8.62509. lr 5.999995e-04:   0%|          | 2/2341 [00:01<21:14,  1.83it/s][A
epoch 1 iter 2: train loss 8.62509. lr 5.999995e-04:   0%|          | 3/2341 [00:01<19:54,  1.96it/s][A
epoch 1 iter 3: train loss 8.31508. lr 5.999990e-04:   0%|          | 3/2341 [00:01<19:54,  1.96it/s][A
epoch 1 iter 3: train loss 8.31508. lr 5.999990e-04:   0%|          | 4/2341 [00:01<18:58,  2.05it/s][A
epoch 1 iter 4: train loss 8.08883. lr 5.999984e-04:   0%|          | 4/2341 [00:02<18:58,  2.05it/s][A
epoch 1 iter 4: tr

epoch 1 iter 38: train loss 6.29030. lr 5.998980e-04:   2%|▏         | 38/2341 [00:17<16:46,  2.29it/s][A
epoch 1 iter 38: train loss 6.29030. lr 5.998980e-04:   2%|▏         | 39/2341 [00:17<19:02,  2.01it/s][A
epoch 1 iter 39: train loss 6.24377. lr 5.998927e-04:   2%|▏         | 39/2341 [00:17<19:02,  2.01it/s][A
epoch 1 iter 39: train loss 6.24377. lr 5.998927e-04:   2%|▏         | 40/2341 [00:17<18:21,  2.09it/s][A
epoch 1 iter 40: train loss 6.30326. lr 5.998873e-04:   2%|▏         | 40/2341 [00:18<18:21,  2.09it/s][A
epoch 1 iter 40: train loss 6.30326. lr 5.998873e-04:   2%|▏         | 41/2341 [00:18<17:53,  2.14it/s][A
epoch 1 iter 41: train loss 6.24488. lr 5.998817e-04:   2%|▏         | 41/2341 [00:18<17:53,  2.14it/s][A
epoch 1 iter 41: train loss 6.24488. lr 5.998817e-04:   2%|▏         | 42/2341 [00:18<17:33,  2.18it/s][A
epoch 1 iter 42: train loss 6.13125. lr 5.998759e-04:   2%|▏         | 42/2341 [00:19<17:33,  2.18it/s][A
epoch 1 iter 42: train loss 6.13125. 

epoch 1 iter 76: train loss 5.77974. lr 5.996010e-04:   3%|▎         | 76/2341 [00:34<16:32,  2.28it/s][A
epoch 1 iter 76: train loss 5.77974. lr 5.996010e-04:   3%|▎         | 77/2341 [00:34<16:31,  2.28it/s][A
epoch 1 iter 77: train loss 5.78054. lr 5.995905e-04:   3%|▎         | 77/2341 [00:34<16:31,  2.28it/s][A
epoch 1 iter 77: train loss 5.78054. lr 5.995905e-04:   3%|▎         | 78/2341 [00:34<16:29,  2.29it/s][A
epoch 1 iter 78: train loss 5.74079. lr 5.995800e-04:   3%|▎         | 78/2341 [00:35<16:29,  2.29it/s][A
epoch 1 iter 78: train loss 5.74079. lr 5.995800e-04:   3%|▎         | 79/2341 [00:35<16:28,  2.29it/s][A
epoch 1 iter 79: train loss 5.72184. lr 5.995692e-04:   3%|▎         | 79/2341 [00:35<16:28,  2.29it/s][A
epoch 1 iter 79: train loss 5.72184. lr 5.995692e-04:   3%|▎         | 80/2341 [00:35<16:27,  2.29it/s][A
epoch 1 iter 80: train loss 5.77848. lr 5.995584e-04:   3%|▎         | 80/2341 [00:35<16:27,  2.29it/s][A
epoch 1 iter 80: train loss 5.77848. 

epoch 1 iter 114: train loss 5.35993. lr 5.991090e-04:   5%|▍         | 114/2341 [00:50<16:13,  2.29it/s][A
epoch 1 iter 114: train loss 5.35993. lr 5.991090e-04:   5%|▍         | 115/2341 [00:50<16:15,  2.28it/s][A
epoch 1 iter 115: train loss 5.34781. lr 5.990934e-04:   5%|▍         | 115/2341 [00:51<16:15,  2.28it/s][A
epoch 1 iter 115: train loss 5.34781. lr 5.990934e-04:   5%|▍         | 116/2341 [00:51<16:20,  2.27it/s][A
epoch 1 iter 116: train loss 5.38877. lr 5.990777e-04:   5%|▍         | 116/2341 [00:51<16:20,  2.27it/s][A
epoch 1 iter 116: train loss 5.38877. lr 5.990777e-04:   5%|▍         | 117/2341 [00:51<16:23,  2.26it/s][A
epoch 1 iter 117: train loss 5.32939. lr 5.990619e-04:   5%|▍         | 117/2341 [00:52<16:23,  2.26it/s][A
epoch 1 iter 117: train loss 5.32939. lr 5.990619e-04:   5%|▌         | 118/2341 [00:52<16:24,  2.26it/s][A
epoch 1 iter 118: train loss 5.37115. lr 5.990459e-04:   5%|▌         | 118/2341 [00:52<16:24,  2.26it/s][A
epoch 1 iter 118: t

epoch 1 iter 151: train loss 5.03829. lr 5.984430e-04:   6%|▋         | 152/2341 [01:07<16:06,  2.27it/s][A
epoch 1 iter 152: train loss 5.04974. lr 5.984225e-04:   6%|▋         | 152/2341 [01:07<16:06,  2.27it/s][A
epoch 1 iter 152: train loss 5.04974. lr 5.984225e-04:   7%|▋         | 153/2341 [01:07<16:03,  2.27it/s][A
epoch 1 iter 153: train loss 4.95542. lr 5.984018e-04:   7%|▋         | 153/2341 [01:08<16:03,  2.27it/s][A
epoch 1 iter 153: train loss 4.95542. lr 5.984018e-04:   7%|▋         | 154/2341 [01:08<16:03,  2.27it/s][A
epoch 1 iter 154: train loss 4.97487. lr 5.983809e-04:   7%|▋         | 154/2341 [01:08<16:03,  2.27it/s][A
epoch 1 iter 154: train loss 4.97487. lr 5.983809e-04:   7%|▋         | 155/2341 [01:08<16:03,  2.27it/s][A
epoch 1 iter 155: train loss 5.04144. lr 5.983600e-04:   7%|▋         | 155/2341 [01:09<16:03,  2.27it/s][A
epoch 1 iter 155: train loss 5.04144. lr 5.983600e-04:   7%|▋         | 156/2341 [01:09<16:05,  2.26it/s][A
epoch 1 iter 156: t

epoch 1 iter 189: train loss 4.81103. lr 5.975674e-04:   8%|▊         | 189/2341 [01:24<15:54,  2.26it/s][A
epoch 1 iter 189: train loss 4.81103. lr 5.975674e-04:   8%|▊         | 190/2341 [01:24<15:57,  2.25it/s][A
epoch 1 iter 190: train loss 4.77670. lr 5.975417e-04:   8%|▊         | 190/2341 [01:25<15:57,  2.25it/s][A
epoch 1 iter 190: train loss 4.77670. lr 5.975417e-04:   8%|▊         | 191/2341 [01:25<17:23,  2.06it/s][A
epoch 1 iter 191: train loss 4.76790. lr 5.975159e-04:   8%|▊         | 191/2341 [01:25<17:23,  2.06it/s][A
epoch 1 iter 191: train loss 4.76790. lr 5.975159e-04:   8%|▊         | 192/2341 [01:25<17:00,  2.11it/s][A
epoch 1 iter 192: train loss 4.74592. lr 5.974900e-04:   8%|▊         | 192/2341 [01:25<17:00,  2.11it/s][A
epoch 1 iter 192: train loss 4.74592. lr 5.974900e-04:   8%|▊         | 193/2341 [01:25<16:43,  2.14it/s][A
epoch 1 iter 193: train loss 4.79670. lr 5.974640e-04:   8%|▊         | 193/2341 [01:26<16:43,  2.14it/s][A
epoch 1 iter 193: t

epoch 1 iter 226: train loss 4.59118. lr 5.965288e-04:  10%|▉         | 227/2341 [01:41<15:47,  2.23it/s][A
epoch 1 iter 227: train loss 4.50604. lr 5.964982e-04:  10%|▉         | 227/2341 [01:41<15:47,  2.23it/s][A
epoch 1 iter 227: train loss 4.50604. lr 5.964982e-04:  10%|▉         | 228/2341 [01:41<15:41,  2.24it/s][A
epoch 1 iter 228: train loss 4.56193. lr 5.964674e-04:  10%|▉         | 228/2341 [01:42<15:41,  2.24it/s][A
epoch 1 iter 228: train loss 4.56193. lr 5.964674e-04:  10%|▉         | 229/2341 [01:42<15:42,  2.24it/s][A
epoch 1 iter 229: train loss 4.50286. lr 5.964366e-04:  10%|▉         | 229/2341 [01:42<15:42,  2.24it/s][A
epoch 1 iter 229: train loss 4.50286. lr 5.964366e-04:  10%|▉         | 230/2341 [01:42<15:40,  2.24it/s][A
epoch 1 iter 230: train loss 4.46797. lr 5.964055e-04:  10%|▉         | 230/2341 [01:43<15:40,  2.24it/s][A
epoch 1 iter 230: train loss 4.46797. lr 5.964055e-04:  10%|▉         | 231/2341 [01:43<15:41,  2.24it/s][A
epoch 1 iter 231: t

epoch 1 iter 264: train loss 4.31420. lr 5.952717e-04:  11%|█▏        | 264/2341 [01:58<15:54,  2.18it/s][A
epoch 1 iter 264: train loss 4.31420. lr 5.952717e-04:  11%|█▏        | 265/2341 [01:58<15:47,  2.19it/s][A
epoch 1 iter 265: train loss 4.25489. lr 5.952361e-04:  11%|█▏        | 265/2341 [01:58<15:47,  2.19it/s][A
epoch 1 iter 265: train loss 4.25489. lr 5.952361e-04:  11%|█▏        | 266/2341 [01:58<15:44,  2.20it/s][A
epoch 1 iter 266: train loss 4.25756. lr 5.952003e-04:  11%|█▏        | 266/2341 [01:59<15:44,  2.20it/s][A
epoch 1 iter 266: train loss 4.25756. lr 5.952003e-04:  11%|█▏        | 267/2341 [01:59<15:44,  2.20it/s][A
epoch 1 iter 267: train loss 4.25249. lr 5.951643e-04:  11%|█▏        | 267/2341 [01:59<15:44,  2.20it/s][A
epoch 1 iter 267: train loss 4.25249. lr 5.951643e-04:  11%|█▏        | 268/2341 [01:59<15:40,  2.20it/s][A
epoch 1 iter 268: train loss 4.26132. lr 5.951282e-04:  11%|█▏        | 268/2341 [02:00<15:40,  2.20it/s][A
epoch 1 iter 268: t

epoch 1 iter 301: train loss 4.07526. lr 5.938632e-04:  13%|█▎        | 302/2341 [02:15<15:23,  2.21it/s][A
epoch 1 iter 302: train loss 4.09238. lr 5.938226e-04:  13%|█▎        | 302/2341 [02:16<15:23,  2.21it/s][A
epoch 1 iter 302: train loss 4.09238. lr 5.938226e-04:  13%|█▎        | 303/2341 [02:16<15:24,  2.21it/s][A
epoch 1 iter 303: train loss 4.02837. lr 5.937819e-04:  13%|█▎        | 303/2341 [02:16<15:24,  2.21it/s][A
epoch 1 iter 303: train loss 4.02837. lr 5.937819e-04:  13%|█▎        | 304/2341 [02:16<15:24,  2.20it/s][A
epoch 1 iter 304: train loss 4.05418. lr 5.937410e-04:  13%|█▎        | 304/2341 [02:16<15:24,  2.20it/s][A
epoch 1 iter 304: train loss 4.05418. lr 5.937410e-04:  13%|█▎        | 305/2341 [02:16<15:20,  2.21it/s][A
epoch 1 iter 305: train loss 4.05858. lr 5.937000e-04:  13%|█▎        | 305/2341 [02:17<15:20,  2.21it/s][A
epoch 1 iter 305: train loss 4.05858. lr 5.937000e-04:  13%|█▎        | 306/2341 [02:17<15:19,  2.21it/s][A
epoch 1 iter 306: t

epoch 1 iter 339: train loss 3.92173. lr 5.922279e-04:  14%|█▍        | 339/2341 [02:33<15:00,  2.22it/s][A
epoch 1 iter 339: train loss 3.92173. lr 5.922279e-04:  15%|█▍        | 340/2341 [02:33<14:59,  2.22it/s][A
epoch 1 iter 340: train loss 3.88412. lr 5.921823e-04:  15%|█▍        | 340/2341 [02:33<14:59,  2.22it/s][A
epoch 1 iter 340: train loss 3.88412. lr 5.921823e-04:  15%|█▍        | 341/2341 [02:33<15:06,  2.21it/s][A
epoch 1 iter 341: train loss 3.85308. lr 5.921365e-04:  15%|█▍        | 341/2341 [02:33<15:06,  2.21it/s][A
epoch 1 iter 341: train loss 3.85308. lr 5.921365e-04:  15%|█▍        | 342/2341 [02:33<15:08,  2.20it/s][A
epoch 1 iter 342: train loss 3.90708. lr 5.920906e-04:  15%|█▍        | 342/2341 [02:34<15:08,  2.20it/s][A
epoch 1 iter 342: train loss 3.90708. lr 5.920906e-04:  15%|█▍        | 343/2341 [02:34<15:10,  2.19it/s][A
epoch 1 iter 343: train loss 3.84283. lr 5.920446e-04:  15%|█▍        | 343/2341 [02:34<15:10,  2.19it/s][A
epoch 1 iter 343: t

epoch 1 iter 376: train loss 3.66043. lr 5.904529e-04:  16%|█▌        | 377/2341 [02:50<15:09,  2.16it/s][A
epoch 1 iter 377: train loss 3.63158. lr 5.904024e-04:  16%|█▌        | 377/2341 [02:50<15:09,  2.16it/s][A
epoch 1 iter 377: train loss 3.63158. lr 5.904024e-04:  16%|█▌        | 378/2341 [02:50<15:02,  2.17it/s][A
epoch 1 iter 378: train loss 3.61731. lr 5.903518e-04:  16%|█▌        | 378/2341 [02:50<15:02,  2.17it/s][A
epoch 1 iter 378: train loss 3.61731. lr 5.903518e-04:  16%|█▌        | 379/2341 [02:50<14:58,  2.18it/s][A
epoch 1 iter 379: train loss 3.63955. lr 5.903011e-04:  16%|█▌        | 379/2341 [02:51<14:58,  2.18it/s][A
epoch 1 iter 379: train loss 3.63955. lr 5.903011e-04:  16%|█▌        | 380/2341 [02:51<14:53,  2.19it/s][A
epoch 1 iter 380: train loss 3.70386. lr 5.902503e-04:  16%|█▌        | 380/2341 [02:51<14:53,  2.19it/s][A
epoch 1 iter 380: train loss 3.70386. lr 5.902503e-04:  16%|█▋        | 381/2341 [02:51<14:47,  2.21it/s][A
epoch 1 iter 381: t

epoch 1 iter 414: train loss 3.40669. lr 5.884435e-04:  18%|█▊        | 414/2341 [03:07<14:20,  2.24it/s][A
epoch 1 iter 414: train loss 3.40669. lr 5.884435e-04:  18%|█▊        | 415/2341 [03:07<14:21,  2.24it/s][A
epoch 1 iter 415: train loss 3.44591. lr 5.883881e-04:  18%|█▊        | 415/2341 [03:07<14:21,  2.24it/s][A
epoch 1 iter 415: train loss 3.44591. lr 5.883881e-04:  18%|█▊        | 416/2341 [03:07<14:20,  2.24it/s][A
epoch 1 iter 416: train loss 3.44387. lr 5.883325e-04:  18%|█▊        | 416/2341 [03:08<14:20,  2.24it/s][A
epoch 1 iter 416: train loss 3.44387. lr 5.883325e-04:  18%|█▊        | 417/2341 [03:08<14:19,  2.24it/s][A
epoch 1 iter 417: train loss 3.46670. lr 5.882768e-04:  18%|█▊        | 417/2341 [03:08<14:19,  2.24it/s][A
epoch 1 iter 417: train loss 3.46670. lr 5.882768e-04:  18%|█▊        | 418/2341 [03:08<14:16,  2.25it/s][A
epoch 1 iter 418: train loss 3.46522. lr 5.882210e-04:  18%|█▊        | 418/2341 [03:09<14:16,  2.25it/s][A
epoch 1 iter 418: t

epoch 1 iter 451: train loss 3.29038. lr 5.863066e-04:  19%|█▉        | 452/2341 [03:24<14:22,  2.19it/s][A
epoch 1 iter 452: train loss 3.19059. lr 5.862464e-04:  19%|█▉        | 452/2341 [03:24<14:22,  2.19it/s][A
epoch 1 iter 452: train loss 3.19059. lr 5.862464e-04:  19%|█▉        | 453/2341 [03:24<14:18,  2.20it/s][A
epoch 1 iter 453: train loss 3.22845. lr 5.861861e-04:  19%|█▉        | 453/2341 [03:25<14:18,  2.20it/s][A
epoch 1 iter 453: train loss 3.22845. lr 5.861861e-04:  19%|█▉        | 454/2341 [03:25<14:18,  2.20it/s][A
epoch 1 iter 454: train loss 3.19540. lr 5.861256e-04:  19%|█▉        | 454/2341 [03:25<14:18,  2.20it/s][A
epoch 1 iter 454: train loss 3.19540. lr 5.861256e-04:  19%|█▉        | 455/2341 [03:25<14:30,  2.17it/s][A
epoch 1 iter 455: train loss 3.21138. lr 5.860650e-04:  19%|█▉        | 455/2341 [03:26<14:30,  2.17it/s][A
epoch 1 iter 455: train loss 3.21138. lr 5.860650e-04:  19%|█▉        | 456/2341 [03:26<14:36,  2.15it/s][A
epoch 1 iter 456: t

epoch 1 iter 489: train loss 3.07571. lr 5.839282e-04:  21%|██        | 489/2341 [03:41<14:03,  2.20it/s][A
epoch 1 iter 489: train loss 3.07571. lr 5.839282e-04:  21%|██        | 490/2341 [03:41<13:59,  2.20it/s][A
epoch 1 iter 490: train loss 3.10846. lr 5.838631e-04:  21%|██        | 490/2341 [03:42<13:59,  2.20it/s][A
epoch 1 iter 490: train loss 3.10846. lr 5.838631e-04:  21%|██        | 491/2341 [03:42<13:57,  2.21it/s][A
epoch 1 iter 491: train loss 3.04112. lr 5.837979e-04:  21%|██        | 491/2341 [03:42<13:57,  2.21it/s][A
epoch 1 iter 491: train loss 3.04112. lr 5.837979e-04:  21%|██        | 492/2341 [03:42<13:55,  2.21it/s][A
epoch 1 iter 492: train loss 3.06873. lr 5.837325e-04:  21%|██        | 492/2341 [03:43<13:55,  2.21it/s][A
epoch 1 iter 492: train loss 3.06873. lr 5.837325e-04:  21%|██        | 493/2341 [03:43<14:11,  2.17it/s][A
epoch 1 iter 493: train loss 2.99301. lr 5.836671e-04:  21%|██        | 493/2341 [03:43<14:11,  2.17it/s][A
epoch 1 iter 493: t

epoch 1 iter 526: train loss 2.91554. lr 5.814348e-04:  23%|██▎       | 527/2341 [03:58<13:38,  2.22it/s][A
epoch 1 iter 527: train loss 2.80548. lr 5.813650e-04:  23%|██▎       | 527/2341 [03:58<13:38,  2.22it/s][A
epoch 1 iter 527: train loss 2.80548. lr 5.813650e-04:  23%|██▎       | 528/2341 [03:58<13:35,  2.22it/s][A
epoch 1 iter 528: train loss 2.88089. lr 5.812951e-04:  23%|██▎       | 528/2341 [03:59<13:35,  2.22it/s][A
epoch 1 iter 528: train loss 2.88089. lr 5.812951e-04:  23%|██▎       | 529/2341 [03:59<13:34,  2.22it/s][A
epoch 1 iter 529: train loss 2.84575. lr 5.812251e-04:  23%|██▎       | 529/2341 [03:59<13:34,  2.22it/s][A
epoch 1 iter 529: train loss 2.84575. lr 5.812251e-04:  23%|██▎       | 530/2341 [03:59<13:37,  2.22it/s][A
epoch 1 iter 530: train loss 2.83897. lr 5.811549e-04:  23%|██▎       | 530/2341 [04:00<13:37,  2.22it/s][A
epoch 1 iter 530: train loss 2.83897. lr 5.811549e-04:  23%|██▎       | 531/2341 [04:00<13:37,  2.21it/s][A
epoch 1 iter 531: t

epoch 1 iter 564: train loss 2.68218. lr 5.786934e-04:  24%|██▍       | 564/2341 [04:15<13:16,  2.23it/s][A
epoch 1 iter 564: train loss 2.68218. lr 5.786934e-04:  24%|██▍       | 565/2341 [04:15<13:16,  2.23it/s][A
epoch 1 iter 565: train loss 2.66943. lr 5.786188e-04:  24%|██▍       | 565/2341 [04:16<13:16,  2.23it/s][A
epoch 1 iter 565: train loss 2.66943. lr 5.786188e-04:  24%|██▍       | 566/2341 [04:16<13:13,  2.24it/s][A
epoch 1 iter 566: train loss 2.62038. lr 5.785441e-04:  24%|██▍       | 566/2341 [04:16<13:13,  2.24it/s][A
epoch 1 iter 566: train loss 2.62038. lr 5.785441e-04:  24%|██▍       | 567/2341 [04:16<13:16,  2.23it/s][A
epoch 1 iter 567: train loss 2.59815. lr 5.784693e-04:  24%|██▍       | 567/2341 [04:17<13:16,  2.23it/s][A
epoch 1 iter 567: train loss 2.59815. lr 5.784693e-04:  24%|██▍       | 568/2341 [04:17<13:21,  2.21it/s][A
epoch 1 iter 568: train loss 2.62613. lr 5.783943e-04:  24%|██▍       | 568/2341 [04:17<13:21,  2.21it/s][A
epoch 1 iter 568: t

epoch 1 iter 601: train loss 2.48215. lr 5.758499e-04:  26%|██▌       | 602/2341 [04:32<13:17,  2.18it/s][A
epoch 1 iter 602: train loss 2.42671. lr 5.757707e-04:  26%|██▌       | 602/2341 [04:33<13:17,  2.18it/s][A
epoch 1 iter 602: train loss 2.42671. lr 5.757707e-04:  26%|██▌       | 603/2341 [04:33<13:12,  2.19it/s][A
epoch 1 iter 603: train loss 2.42536. lr 5.756913e-04:  26%|██▌       | 603/2341 [04:33<13:12,  2.19it/s][A
epoch 1 iter 603: train loss 2.42536. lr 5.756913e-04:  26%|██▌       | 604/2341 [04:33<13:07,  2.21it/s][A
epoch 1 iter 604: train loss 2.45094. lr 5.756119e-04:  26%|██▌       | 604/2341 [04:34<13:07,  2.21it/s][A
epoch 1 iter 604: train loss 2.45094. lr 5.756119e-04:  26%|██▌       | 605/2341 [04:34<13:03,  2.22it/s][A
epoch 1 iter 605: train loss 2.41392. lr 5.755323e-04:  26%|██▌       | 605/2341 [04:34<13:03,  2.22it/s][A
epoch 1 iter 605: train loss 2.41392. lr 5.755323e-04:  26%|██▌       | 606/2341 [04:34<13:02,  2.22it/s][A
epoch 1 iter 606: t

epoch 1 iter 639: train loss 2.29212. lr 5.727525e-04:  27%|██▋       | 639/2341 [04:50<13:39,  2.08it/s][A
epoch 1 iter 639: train loss 2.29212. lr 5.727525e-04:  27%|██▋       | 640/2341 [04:50<13:25,  2.11it/s][A
epoch 1 iter 640: train loss 2.23175. lr 5.726685e-04:  27%|██▋       | 640/2341 [04:50<13:25,  2.11it/s][A
epoch 1 iter 640: train loss 2.23175. lr 5.726685e-04:  27%|██▋       | 641/2341 [04:50<13:10,  2.15it/s][A
epoch 1 iter 641: train loss 2.20754. lr 5.725845e-04:  27%|██▋       | 641/2341 [04:50<13:10,  2.15it/s][A
epoch 1 iter 641: train loss 2.20754. lr 5.725845e-04:  27%|██▋       | 642/2341 [04:50<13:01,  2.18it/s][A
epoch 1 iter 642: train loss 2.21875. lr 5.725003e-04:  27%|██▋       | 642/2341 [04:51<13:01,  2.18it/s][A
epoch 1 iter 642: train loss 2.21875. lr 5.725003e-04:  27%|██▋       | 643/2341 [04:51<12:56,  2.19it/s][A
epoch 1 iter 643: train loss 2.28848. lr 5.724160e-04:  27%|██▋       | 643/2341 [04:51<12:56,  2.19it/s][A
epoch 1 iter 643: t

epoch 1 iter 676: train loss 2.10087. lr 5.695660e-04:  29%|██▉       | 677/2341 [05:07<13:02,  2.13it/s][A
epoch 1 iter 677: train loss 2.06532. lr 5.694775e-04:  29%|██▉       | 677/2341 [05:07<13:02,  2.13it/s][A
epoch 1 iter 677: train loss 2.06532. lr 5.694775e-04:  29%|██▉       | 678/2341 [05:07<12:50,  2.16it/s][A
epoch 1 iter 678: train loss 2.08461. lr 5.693890e-04:  29%|██▉       | 678/2341 [05:08<12:50,  2.16it/s][A
epoch 1 iter 678: train loss 2.08461. lr 5.693890e-04:  29%|██▉       | 679/2341 [05:08<12:41,  2.18it/s][A
epoch 1 iter 679: train loss 2.05458. lr 5.693003e-04:  29%|██▉       | 679/2341 [05:08<12:41,  2.18it/s][A
epoch 1 iter 679: train loss 2.05458. lr 5.693003e-04:  29%|██▉       | 680/2341 [05:08<12:36,  2.19it/s][A
epoch 1 iter 680: train loss 2.10828. lr 5.692115e-04:  29%|██▉       | 680/2341 [05:08<12:36,  2.19it/s][A
epoch 1 iter 680: train loss 2.10828. lr 5.692115e-04:  29%|██▉       | 681/2341 [05:08<12:32,  2.21it/s][A
epoch 1 iter 681: t

epoch 1 iter 714: train loss 1.85994. lr 5.661203e-04:  30%|███       | 714/2341 [05:24<12:05,  2.24it/s][A
epoch 1 iter 714: train loss 1.85994. lr 5.661203e-04:  31%|███       | 715/2341 [05:24<12:08,  2.23it/s][A
epoch 1 iter 715: train loss 1.90067. lr 5.660273e-04:  31%|███       | 715/2341 [05:24<12:08,  2.23it/s][A
epoch 1 iter 715: train loss 1.90067. lr 5.660273e-04:  31%|███       | 716/2341 [05:24<12:07,  2.23it/s][A
epoch 1 iter 716: train loss 1.91302. lr 5.659342e-04:  31%|███       | 716/2341 [05:25<12:07,  2.23it/s][A
epoch 1 iter 716: train loss 1.91302. lr 5.659342e-04:  31%|███       | 717/2341 [05:25<12:08,  2.23it/s][A
epoch 1 iter 717: train loss 1.83010. lr 5.658409e-04:  31%|███       | 717/2341 [05:25<12:08,  2.23it/s][A
epoch 1 iter 717: train loss 1.83010. lr 5.658409e-04:  31%|███       | 718/2341 [05:25<12:10,  2.22it/s][A
epoch 1 iter 718: train loss 1.91676. lr 5.657475e-04:  31%|███       | 718/2341 [05:26<12:10,  2.22it/s][A
epoch 1 iter 718: t

epoch 1 iter 751: train loss 1.74919. lr 5.625990e-04:  32%|███▏      | 752/2341 [05:41<11:45,  2.25it/s][A
epoch 1 iter 752: train loss 1.72837. lr 5.625015e-04:  32%|███▏      | 752/2341 [05:41<11:45,  2.25it/s][A
epoch 1 iter 752: train loss 1.72837. lr 5.625015e-04:  32%|███▏      | 753/2341 [05:41<11:47,  2.25it/s][A
epoch 1 iter 753: train loss 1.69740. lr 5.624040e-04:  32%|███▏      | 753/2341 [05:41<11:47,  2.25it/s][A
epoch 1 iter 753: train loss 1.69740. lr 5.624040e-04:  32%|███▏      | 754/2341 [05:41<11:46,  2.25it/s][A
epoch 1 iter 754: train loss 1.74743. lr 5.623063e-04:  32%|███▏      | 754/2341 [05:42<11:46,  2.25it/s][A
epoch 1 iter 754: train loss 1.74743. lr 5.623063e-04:  32%|███▏      | 755/2341 [05:42<11:47,  2.24it/s][A
epoch 1 iter 755: train loss 1.70206. lr 5.622085e-04:  32%|███▏      | 755/2341 [05:42<11:47,  2.24it/s][A
epoch 1 iter 755: train loss 1.70206. lr 5.622085e-04:  32%|███▏      | 756/2341 [05:42<11:46,  2.24it/s][A
epoch 1 iter 756: t

epoch 1 iter 789: train loss 1.52675. lr 5.588139e-04:  34%|███▎      | 789/2341 [05:58<11:32,  2.24it/s][A
epoch 1 iter 789: train loss 1.52675. lr 5.588139e-04:  34%|███▎      | 790/2341 [05:58<11:28,  2.25it/s][A
epoch 1 iter 790: train loss 1.56523. lr 5.587120e-04:  34%|███▎      | 790/2341 [05:58<11:28,  2.25it/s][A
epoch 1 iter 790: train loss 1.56523. lr 5.587120e-04:  34%|███▍      | 791/2341 [05:58<11:25,  2.26it/s][A
epoch 1 iter 791: train loss 1.54701. lr 5.586100e-04:  34%|███▍      | 791/2341 [05:59<11:25,  2.26it/s][A
epoch 1 iter 791: train loss 1.54701. lr 5.586100e-04:  34%|███▍      | 792/2341 [05:59<11:27,  2.25it/s][A
epoch 1 iter 792: train loss 1.53320. lr 5.585078e-04:  34%|███▍      | 792/2341 [05:59<11:27,  2.25it/s][A
epoch 1 iter 792: train loss 1.53320. lr 5.585078e-04:  34%|███▍      | 793/2341 [05:59<11:26,  2.25it/s][A
epoch 1 iter 793: train loss 1.54315. lr 5.584056e-04:  34%|███▍      | 793/2341 [05:59<11:26,  2.25it/s][A
epoch 1 iter 793: t

epoch 1 iter 826: train loss 1.42250. lr 5.549665e-04:  35%|███▌      | 827/2341 [06:15<12:30,  2.02it/s][A
epoch 1 iter 827: train loss 1.37199. lr 5.548604e-04:  35%|███▌      | 827/2341 [06:15<12:30,  2.02it/s][A
epoch 1 iter 827: train loss 1.37199. lr 5.548604e-04:  35%|███▌      | 828/2341 [06:15<12:18,  2.05it/s][A
epoch 1 iter 828: train loss 1.42445. lr 5.547541e-04:  35%|███▌      | 828/2341 [06:16<12:18,  2.05it/s][A
epoch 1 iter 828: train loss 1.42445. lr 5.547541e-04:  35%|███▌      | 829/2341 [06:16<12:07,  2.08it/s][A
epoch 1 iter 829: train loss 1.40874. lr 5.546477e-04:  35%|███▌      | 829/2341 [06:16<12:07,  2.08it/s][A
epoch 1 iter 829: train loss 1.40874. lr 5.546477e-04:  35%|███▌      | 830/2341 [06:16<12:00,  2.10it/s][A
epoch 1 iter 830: train loss 1.34089. lr 5.545411e-04:  35%|███▌      | 830/2341 [06:16<12:00,  2.10it/s][A
epoch 1 iter 830: train loss 1.34089. lr 5.545411e-04:  35%|███▌      | 831/2341 [06:16<11:49,  2.13it/s][A
epoch 1 iter 831: t

epoch 2 iter 590: train loss 0.19487. lr 1.840983e-04:  25%|██▌       | 590/2341 [04:29<13:44,  2.12it/s][A
epoch 2 iter 590: train loss 0.19487. lr 1.840983e-04:  25%|██▌       | 591/2341 [04:29<13:37,  2.14it/s][A
epoch 2 iter 591: train loss 0.19056. lr 1.839126e-04:  25%|██▌       | 591/2341 [04:29<13:37,  2.14it/s][A
epoch 2 iter 591: train loss 0.19056. lr 1.839126e-04:  25%|██▌       | 592/2341 [04:29<13:23,  2.18it/s][A
epoch 2 iter 592: train loss 0.19097. lr 1.837269e-04:  25%|██▌       | 592/2341 [04:29<13:23,  2.18it/s][A
epoch 2 iter 592: train loss 0.19097. lr 1.837269e-04:  25%|██▌       | 593/2341 [04:29<13:13,  2.20it/s][A
epoch 2 iter 593: train loss 0.18563. lr 1.835413e-04:  25%|██▌       | 593/2341 [04:30<13:13,  2.20it/s][A
epoch 2 iter 593: train loss 0.18563. lr 1.835413e-04:  25%|██▌       | 594/2341 [04:30<13:10,  2.21it/s][A
epoch 2 iter 594: train loss 0.18518. lr 1.833558e-04:  25%|██▌       | 594/2341 [04:30<13:10,  2.21it/s][A
epoch 2 iter 594: t

epoch 2 iter 627: train loss 0.19193. lr 1.772623e-04:  27%|██▋       | 628/2341 [04:46<12:53,  2.22it/s][A
epoch 2 iter 628: train loss 0.18057. lr 1.770786e-04:  27%|██▋       | 628/2341 [04:46<12:53,  2.22it/s][A
epoch 2 iter 628: train loss 0.18057. lr 1.770786e-04:  27%|██▋       | 629/2341 [04:46<12:44,  2.24it/s][A
epoch 2 iter 629: train loss 0.18812. lr 1.768949e-04:  27%|██▋       | 629/2341 [04:46<12:44,  2.24it/s][A
epoch 2 iter 629: train loss 0.18812. lr 1.768949e-04:  27%|██▋       | 630/2341 [04:46<12:44,  2.24it/s][A
epoch 2 iter 630: train loss 0.18895. lr 1.767113e-04:  27%|██▋       | 630/2341 [04:47<12:44,  2.24it/s][A
epoch 2 iter 630: train loss 0.18895. lr 1.767113e-04:  27%|██▋       | 631/2341 [04:47<12:45,  2.23it/s][A
epoch 2 iter 631: train loss 0.18143. lr 1.765278e-04:  27%|██▋       | 631/2341 [04:47<12:45,  2.23it/s][A
epoch 2 iter 631: train loss 0.18143. lr 1.765278e-04:  27%|██▋       | 632/2341 [04:47<12:44,  2.23it/s][A
epoch 2 iter 632: t

epoch 2 iter 665: train loss 0.17903. lr 1.703205e-04:  28%|██▊       | 665/2341 [05:03<13:08,  2.13it/s][A
epoch 2 iter 665: train loss 0.17903. lr 1.703205e-04:  28%|██▊       | 666/2341 [05:03<13:03,  2.14it/s][A
epoch 2 iter 666: train loss 0.18436. lr 1.701389e-04:  28%|██▊       | 666/2341 [05:03<13:03,  2.14it/s][A
epoch 2 iter 666: train loss 0.18436. lr 1.701389e-04:  28%|██▊       | 667/2341 [05:03<12:56,  2.15it/s][A
epoch 2 iter 667: train loss 0.18984. lr 1.699574e-04:  28%|██▊       | 667/2341 [05:04<12:56,  2.15it/s][A
epoch 2 iter 667: train loss 0.18984. lr 1.699574e-04:  29%|██▊       | 668/2341 [05:04<12:52,  2.17it/s][A
epoch 2 iter 668: train loss 0.18096. lr 1.697760e-04:  29%|██▊       | 668/2341 [05:04<12:52,  2.17it/s][A
epoch 2 iter 668: train loss 0.18096. lr 1.697760e-04:  29%|██▊       | 669/2341 [05:04<12:47,  2.18it/s][A
epoch 2 iter 669: train loss 0.18504. lr 1.695946e-04:  29%|██▊       | 669/2341 [05:05<12:47,  2.18it/s][A
epoch 2 iter 669: t

epoch 2 iter 702: train loss 0.18406. lr 1.636423e-04:  30%|███       | 703/2341 [05:20<12:13,  2.23it/s][A
epoch 2 iter 703: train loss 0.18227. lr 1.634630e-04:  30%|███       | 703/2341 [05:20<12:13,  2.23it/s][A
epoch 2 iter 703: train loss 0.18227. lr 1.634630e-04:  30%|███       | 704/2341 [05:20<12:12,  2.24it/s][A
epoch 2 iter 704: train loss 0.18718. lr 1.632837e-04:  30%|███       | 704/2341 [05:21<12:12,  2.24it/s][A
epoch 2 iter 704: train loss 0.18718. lr 1.632837e-04:  30%|███       | 705/2341 [05:21<12:12,  2.23it/s][A
epoch 2 iter 705: train loss 0.17259. lr 1.631045e-04:  30%|███       | 705/2341 [05:21<12:12,  2.23it/s][A
epoch 2 iter 705: train loss 0.17259. lr 1.631045e-04:  30%|███       | 706/2341 [05:21<12:09,  2.24it/s][A
epoch 2 iter 706: train loss 0.19014. lr 1.629253e-04:  30%|███       | 706/2341 [05:22<12:09,  2.24it/s][A
epoch 2 iter 706: train loss 0.19014. lr 1.629253e-04:  30%|███       | 707/2341 [05:22<12:31,  2.17it/s][A
epoch 2 iter 707: t

epoch 2 iter 740: train loss 0.17969. lr 1.568712e-04:  32%|███▏      | 740/2341 [05:37<12:36,  2.12it/s][A
epoch 2 iter 740: train loss 0.17969. lr 1.568712e-04:  32%|███▏      | 741/2341 [05:37<12:41,  2.10it/s][A
epoch 2 iter 741: train loss 0.18416. lr 1.566943e-04:  32%|███▏      | 741/2341 [05:38<12:41,  2.10it/s][A
epoch 2 iter 741: train loss 0.18416. lr 1.566943e-04:  32%|███▏      | 742/2341 [05:38<12:41,  2.10it/s][A
epoch 2 iter 742: train loss 0.17503. lr 1.565174e-04:  32%|███▏      | 742/2341 [05:38<12:41,  2.10it/s][A
epoch 2 iter 742: train loss 0.17503. lr 1.565174e-04:  32%|███▏      | 743/2341 [05:38<12:36,  2.11it/s][A
epoch 2 iter 743: train loss 0.17809. lr 1.563406e-04:  32%|███▏      | 743/2341 [05:39<12:36,  2.11it/s][A
epoch 2 iter 743: train loss 0.17809. lr 1.563406e-04:  32%|███▏      | 744/2341 [05:39<12:34,  2.12it/s][A
epoch 2 iter 744: train loss 0.16845. lr 1.561638e-04:  32%|███▏      | 744/2341 [05:39<12:34,  2.12it/s][A
epoch 2 iter 744: t

In [None]:
print(1)