## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [1]:
from allennlp.data.token_indexers import TokenIndexer, PretrainedTransformerIndexer
from allennlp.data.tokenizers import Token, Tokenizer, PretrainedTransformerTokenizer

import nltk
#nltk.download('punkt')
import numpy as np
from os import listdir
from os.path import join as pathjoin
import torch
import torch.nn as nn
from torch.nn import functional as F
import tqdm

from minGPT.mingpt.model import GPT, GPTConfig
from minGPT.mingpt.trainer import Trainer, TrainerConfig
# make deterministic
from minGPT.mingpt.utils import sample, set_seed
set_seed(42)

In [2]:
DATA_DIR = '/home/mlepekhin/data'
MODELS_DIR = '/home/mlepekhin/models'
transformer_model = 'bert-base-cased'

In [3]:
import math
from torch.utils.data import Dataset


def detokenize(tokens):
    return ' '.join([str(x) for x in tokens[1:-1]]).replace(' ##', '')

class BPEDataset(Dataset):
    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [4]:
block_size = 128
tokenizer = PretrainedTransformerTokenizer(transformer_model)
#indexer = PretrainedTransformerIndexer(transformer_model)

In [5]:
def train_gpt_generator(train_text_file, state_dict_file, n_layer=8, n_head=8, n_embd=512,
                        max_epochs=2, batch_size=256):
    text_sentences = nltk.tokenize.sent_tokenize(open(train_text_file, 'r').read())
    tokens = np.concatenate([tokenizer.tokenize(sent)[1:-1] for sent in text_sentences])
    tokens = [str(token) for token in tokens]
    train_dataset = BPEDataset(tokens, block_size) 
    
    mconf = GPTConfig(
        train_dataset.vocab_size, train_dataset.block_size,
        n_layer=n_layer, n_head=n_head, n_embd=n_embd
    )
    model = GPT(mconf)
    tconf = TrainerConfig(
        max_epochs=max_epochs, batch_size=batch_size, learning_rate=6e-4,
        lr_decay=True, warmup_tokens=batch_size*20, final_tokens=2*len(train_dataset)*block_size,
        num_workers=2
    )
    trainer = Trainer(model, train_dataset, None, tconf)
    trainer.train()
    torch.save(model.state_dict(), state_dict_file)

In [6]:
GENRE_DATA_DIR = '/home/mlepekhin/data/genre'
GPT_MODELS_DIR = '/home/mlepekhin/models/mini_gpt_bpe/'
LANG = 'en'

In [7]:
#train_gpt_generator(
#        pathjoin(GENRE_DATA_DIR, LANG, 'A1.txt'),
#        pathjoin(GPT_MODELS_DIR, LANG, 'A1')
#)

In [None]:
for train_text_file in tqdm.tqdm(listdir(pathjoin(GENRE_DATA_DIR, LANG))):
    label = train_text_file[:-4]
    train_gpt_generator(
        pathjoin(GENRE_DATA_DIR, LANG, train_text_file),
        pathjoin(GPT_MODELS_DIR, LANG, label)
    )

  0%|          | 0/11 [00:00<?, ?it/s]

data has 191478 characters, 11294 unique.




epoch 1 iter 0: train loss 9.44247. lr 5.999995e-04:   0%|          | 0/748 [00:07<?, ?it/s][A
epoch 1 iter 0: train loss 9.44247. lr 5.999995e-04:   0%|          | 1/748 [00:07<1:32:58,  7.47s/it][A
epoch 1 iter 1: train loss 8.68254. lr 5.999977e-04:   0%|          | 1/748 [00:07<1:32:58,  7.47s/it][A
epoch 1 iter 1: train loss 8.68254. lr 5.999977e-04:   0%|          | 2/748 [00:07<1:06:13,  5.33s/it][A
epoch 1 iter 2: train loss 8.22270. lr 5.999946e-04:   0%|          | 2/748 [00:08<1:06:13,  5.33s/it][A
epoch 1 iter 2: train loss 8.22270. lr 5.999946e-04:   0%|          | 3/748 [00:08<47:30,  3.83s/it]  [A
epoch 1 iter 3: train loss 7.95068. lr 5.999902e-04:   0%|          | 3/748 [00:08<47:30,  3.83s/it][A
epoch 1 iter 3: train loss 7.95068. lr 5.999902e-04:   1%|          | 4/748 [00:08<34:26,  2.78s/it][A
epoch 1 iter 4: train loss 7.73220. lr 5.999845e-04:   1%|          | 4/748 [00:08<34:26,  2.78s/it][A
epoch 1 iter 4: train loss 7.73220. lr 5.999845e-04:   1%|  

epoch 1 iter 75: train loss 5.67482. lr 5.961967e-04:  10%|█         | 76/748 [00:31<03:33,  3.14it/s][A
epoch 1 iter 76: train loss 5.71231. lr 5.960959e-04:  10%|█         | 76/748 [00:31<03:33,  3.14it/s][A
epoch 1 iter 76: train loss 5.71231. lr 5.960959e-04:  10%|█         | 77/748 [00:31<03:33,  3.15it/s][A
epoch 1 iter 77: train loss 5.68845. lr 5.959939e-04:  10%|█         | 77/748 [00:32<03:33,  3.15it/s][A
epoch 1 iter 77: train loss 5.68845. lr 5.959939e-04:  10%|█         | 78/748 [00:32<03:54,  2.85it/s][A
epoch 1 iter 78: train loss 5.63510. lr 5.958905e-04:  10%|█         | 78/748 [00:32<03:54,  2.85it/s][A
epoch 1 iter 78: train loss 5.63510. lr 5.958905e-04:  11%|█         | 79/748 [00:32<03:47,  2.94it/s][A
epoch 1 iter 79: train loss 5.70678. lr 5.957859e-04:  11%|█         | 79/748 [00:32<03:47,  2.94it/s][A
epoch 1 iter 79: train loss 5.70678. lr 5.957859e-04:  11%|█         | 80/748 [00:32<03:42,  3.00it/s][A
epoch 1 iter 80: train loss 5.65166. lr 5.9567

epoch 1 iter 151: train loss 4.95283. lr 5.848522e-04:  20%|██        | 151/748 [00:55<03:10,  3.13it/s][A
epoch 1 iter 151: train loss 4.95283. lr 5.848522e-04:  20%|██        | 152/748 [00:55<03:10,  3.14it/s][A
epoch 1 iter 152: train loss 4.94371. lr 5.846537e-04:  20%|██        | 152/748 [00:56<03:10,  3.14it/s][A
epoch 1 iter 152: train loss 4.94371. lr 5.846537e-04:  20%|██        | 153/748 [00:56<03:09,  3.13it/s][A
epoch 1 iter 153: train loss 4.94180. lr 5.844540e-04:  20%|██        | 153/748 [00:56<03:09,  3.13it/s][A
epoch 1 iter 153: train loss 4.94180. lr 5.844540e-04:  21%|██        | 154/748 [00:56<03:09,  3.13it/s][A
epoch 1 iter 154: train loss 4.92770. lr 5.842531e-04:  21%|██        | 154/748 [00:56<03:09,  3.13it/s][A
epoch 1 iter 154: train loss 4.92770. lr 5.842531e-04:  21%|██        | 155/748 [00:56<03:09,  3.13it/s][A
epoch 1 iter 155: train loss 4.93664. lr 5.840509e-04:  21%|██        | 155/748 [00:57<03:09,  3.13it/s][A
epoch 1 iter 155: train loss

epoch 1 iter 226: train loss 4.38829. lr 5.665454e-04:  30%|███       | 226/748 [01:20<02:47,  3.11it/s][A
epoch 1 iter 226: train loss 4.38829. lr 5.665454e-04:  30%|███       | 227/748 [01:20<02:47,  3.12it/s][A
epoch 1 iter 227: train loss 4.32745. lr 5.662554e-04:  30%|███       | 227/748 [01:20<02:47,  3.12it/s][A
epoch 1 iter 227: train loss 4.32745. lr 5.662554e-04:  30%|███       | 228/748 [01:20<02:46,  3.12it/s][A
epoch 1 iter 228: train loss 4.34787. lr 5.659643e-04:  30%|███       | 228/748 [01:20<02:46,  3.12it/s][A
epoch 1 iter 228: train loss 4.34787. lr 5.659643e-04:  31%|███       | 229/748 [01:20<02:46,  3.12it/s][A
epoch 1 iter 229: train loss 4.37343. lr 5.656720e-04:  31%|███       | 229/748 [01:21<02:46,  3.12it/s][A
epoch 1 iter 229: train loss 4.37343. lr 5.656720e-04:  31%|███       | 230/748 [01:21<02:45,  3.12it/s][A
epoch 1 iter 230: train loss 4.36802. lr 5.653786e-04:  31%|███       | 230/748 [01:21<02:45,  3.12it/s][A
epoch 1 iter 230: train loss

epoch 1 iter 301: train loss 3.66477. lr 5.416293e-04:  40%|████      | 301/748 [01:44<02:29,  2.98it/s][A
epoch 1 iter 301: train loss 3.66477. lr 5.416293e-04:  40%|████      | 302/748 [01:44<02:27,  3.02it/s][A
epoch 1 iter 302: train loss 3.63055. lr 5.412551e-04:  40%|████      | 302/748 [01:44<02:27,  3.02it/s][A
epoch 1 iter 302: train loss 3.63055. lr 5.412551e-04:  41%|████      | 303/748 [01:44<02:26,  3.04it/s][A
epoch 1 iter 303: train loss 3.60729. lr 5.408798e-04:  41%|████      | 303/748 [01:45<02:26,  3.04it/s][A
epoch 1 iter 303: train loss 3.60729. lr 5.408798e-04:  41%|████      | 304/748 [01:45<02:24,  3.06it/s][A
epoch 1 iter 304: train loss 3.60189. lr 5.405034e-04:  41%|████      | 304/748 [01:45<02:24,  3.06it/s][A
epoch 1 iter 304: train loss 3.60189. lr 5.405034e-04:  41%|████      | 305/748 [01:45<02:23,  3.08it/s][A
epoch 1 iter 305: train loss 3.65450. lr 5.401260e-04:  41%|████      | 305/748 [01:45<02:23,  3.08it/s][A
epoch 1 iter 305: train loss

epoch 1 iter 376: train loss 2.89786. lr 5.107219e-04:  50%|█████     | 376/748 [02:09<02:00,  3.09it/s][A
epoch 1 iter 376: train loss 2.89786. lr 5.107219e-04:  50%|█████     | 377/748 [02:09<02:00,  3.09it/s][A
epoch 1 iter 377: train loss 2.93369. lr 5.102727e-04:  50%|█████     | 377/748 [02:09<02:00,  3.09it/s][A
epoch 1 iter 377: train loss 2.93369. lr 5.102727e-04:  51%|█████     | 378/748 [02:09<01:59,  3.09it/s][A
epoch 1 iter 378: train loss 2.89859. lr 5.098225e-04:  51%|█████     | 378/748 [02:09<01:59,  3.09it/s][A
epoch 1 iter 378: train loss 2.89859. lr 5.098225e-04:  51%|█████     | 379/748 [02:09<01:59,  3.09it/s][A
epoch 1 iter 379: train loss 2.91100. lr 5.093714e-04:  51%|█████     | 379/748 [02:09<01:59,  3.09it/s][A
epoch 1 iter 379: train loss 2.91100. lr 5.093714e-04:  51%|█████     | 380/748 [02:09<01:58,  3.10it/s][A
epoch 1 iter 380: train loss 2.87900. lr 5.089194e-04:  51%|█████     | 380/748 [02:10<01:58,  3.10it/s][A
epoch 1 iter 380: train loss

epoch 1 iter 451: train loss 2.21627. lr 4.745895e-04:  60%|██████    | 451/748 [02:33<01:38,  3.01it/s][A
epoch 1 iter 451: train loss 2.21627. lr 4.745895e-04:  60%|██████    | 452/748 [02:33<01:38,  3.01it/s][A
epoch 1 iter 452: train loss 2.24146. lr 4.740764e-04:  60%|██████    | 452/748 [02:34<01:38,  3.01it/s][A
epoch 1 iter 452: train loss 2.24146. lr 4.740764e-04:  61%|██████    | 453/748 [02:34<01:38,  3.01it/s][A
epoch 1 iter 453: train loss 2.21773. lr 4.735625e-04:  61%|██████    | 453/748 [02:34<01:38,  3.01it/s][A
epoch 1 iter 453: train loss 2.21773. lr 4.735625e-04:  61%|██████    | 454/748 [02:34<01:37,  3.01it/s][A
epoch 1 iter 454: train loss 2.15541. lr 4.730478e-04:  61%|██████    | 454/748 [02:34<01:37,  3.01it/s][A
epoch 1 iter 454: train loss 2.15541. lr 4.730478e-04:  61%|██████    | 455/748 [02:34<01:37,  3.01it/s][A
epoch 1 iter 455: train loss 2.16328. lr 4.725324e-04:  61%|██████    | 455/748 [02:35<01:37,  3.01it/s][A
epoch 1 iter 455: train loss

epoch 1 iter 526: train loss 1.54856. lr 4.341281e-04:  70%|███████   | 526/748 [02:58<01:15,  2.92it/s][A
epoch 1 iter 526: train loss 1.54856. lr 4.341281e-04:  70%|███████   | 527/748 [02:58<01:14,  2.95it/s][A
epoch 1 iter 527: train loss 1.56860. lr 4.335638e-04:  70%|███████   | 527/748 [02:59<01:14,  2.95it/s][A
epoch 1 iter 527: train loss 1.56860. lr 4.335638e-04:  71%|███████   | 528/748 [02:59<01:14,  2.97it/s][A
epoch 1 iter 528: train loss 1.56173. lr 4.329989e-04:  71%|███████   | 528/748 [02:59<01:14,  2.97it/s][A
epoch 1 iter 528: train loss 1.56173. lr 4.329989e-04:  71%|███████   | 529/748 [02:59<01:13,  2.98it/s][A
epoch 1 iter 529: train loss 1.51391. lr 4.324334e-04:  71%|███████   | 529/748 [02:59<01:13,  2.98it/s][A
epoch 1 iter 529: train loss 1.51391. lr 4.324334e-04:  71%|███████   | 530/748 [02:59<01:12,  2.99it/s][A
epoch 1 iter 530: train loss 1.52583. lr 4.318674e-04:  71%|███████   | 530/748 [03:00<01:12,  2.99it/s][A
epoch 1 iter 530: train loss

epoch 1 iter 601: train loss 1.05628. lr 3.903408e-04:  80%|████████  | 601/748 [03:23<00:48,  3.01it/s][A
epoch 1 iter 601: train loss 1.05628. lr 3.903408e-04:  80%|████████  | 602/748 [03:23<00:48,  3.01it/s][A
epoch 1 iter 602: train loss 1.04349. lr 3.897393e-04:  80%|████████  | 602/748 [03:24<00:48,  3.01it/s][A
epoch 1 iter 602: train loss 1.04349. lr 3.897393e-04:  81%|████████  | 603/748 [03:24<00:48,  3.01it/s][A
epoch 1 iter 603: train loss 1.04434. lr 3.891375e-04:  81%|████████  | 603/748 [03:24<00:48,  3.01it/s][A
epoch 1 iter 603: train loss 1.04434. lr 3.891375e-04:  81%|████████  | 604/748 [03:24<00:47,  3.01it/s][A
epoch 1 iter 604: train loss 1.03467. lr 3.885353e-04:  81%|████████  | 604/748 [03:24<00:47,  3.01it/s][A
epoch 1 iter 604: train loss 1.03467. lr 3.885353e-04:  81%|████████  | 605/748 [03:24<00:47,  3.01it/s][A
epoch 1 iter 605: train loss 1.02016. lr 3.879326e-04:  81%|████████  | 605/748 [03:25<00:47,  3.01it/s][A
epoch 1 iter 605: train loss

epoch 1 iter 676: train loss 0.74688. lr 3.443135e-04:  90%|█████████ | 676/748 [03:49<00:24,  2.99it/s][A
epoch 1 iter 676: train loss 0.74688. lr 3.443135e-04:  91%|█████████ | 677/748 [03:49<00:23,  2.98it/s][A
epoch 1 iter 677: train loss 0.74623. lr 3.436898e-04:  91%|█████████ | 677/748 [03:49<00:23,  2.98it/s][A
epoch 1 iter 677: train loss 0.74623. lr 3.436898e-04:  91%|█████████ | 678/748 [03:49<00:23,  2.99it/s][A
epoch 1 iter 678: train loss 0.74174. lr 3.430659e-04:  91%|█████████ | 678/748 [03:49<00:23,  2.99it/s][A
epoch 1 iter 678: train loss 0.74174. lr 3.430659e-04:  91%|█████████ | 679/748 [03:49<00:23,  2.99it/s][A
epoch 1 iter 679: train loss 0.71628. lr 3.424418e-04:  91%|█████████ | 679/748 [03:50<00:23,  2.99it/s][A
epoch 1 iter 679: train loss 0.71628. lr 3.424418e-04:  91%|█████████ | 680/748 [03:50<00:22,  3.00it/s][A
epoch 1 iter 680: train loss 0.74632. lr 3.418175e-04:  91%|█████████ | 680/748 [03:50<00:22,  3.00it/s][A
epoch 1 iter 680: train loss

epoch 2 iter 3: train loss 0.58079. lr 2.975272e-04:   0%|          | 3/748 [00:01<04:55,  2.52it/s][A
epoch 2 iter 3: train loss 0.58079. lr 2.975272e-04:   1%|          | 4/748 [00:01<04:41,  2.64it/s][A
epoch 2 iter 4: train loss 0.57282. lr 2.968967e-04:   1%|          | 4/748 [00:01<04:41,  2.64it/s][A
epoch 2 iter 4: train loss 0.57282. lr 2.968967e-04:   1%|          | 5/748 [00:01<04:32,  2.73it/s][A
epoch 2 iter 5: train loss 0.56904. lr 2.962662e-04:   1%|          | 5/748 [00:02<04:32,  2.73it/s][A
epoch 2 iter 5: train loss 0.56904. lr 2.962662e-04:   1%|          | 6/748 [00:02<04:25,  2.80it/s][A
epoch 2 iter 6: train loss 0.54737. lr 2.956358e-04:   1%|          | 6/748 [00:02<04:25,  2.80it/s][A
epoch 2 iter 6: train loss 0.54737. lr 2.956358e-04:   1%|          | 7/748 [00:02<04:20,  2.85it/s][A
epoch 2 iter 7: train loss 0.55570. lr 2.950053e-04:   1%|          | 7/748 [00:02<04:20,  2.85it/s][A
epoch 2 iter 7: train loss 0.55570. lr 2.950053e-04:   1%|      

epoch 2 iter 80: train loss 0.45640. lr 2.492229e-04:  11%|█         | 80/748 [00:28<03:51,  2.89it/s][A
epoch 2 iter 80: train loss 0.45640. lr 2.492229e-04:  11%|█         | 81/748 [00:28<03:51,  2.88it/s][A
epoch 2 iter 81: train loss 0.46524. lr 2.486016e-04:  11%|█         | 81/748 [00:28<03:51,  2.88it/s][A
epoch 2 iter 81: train loss 0.46524. lr 2.486016e-04:  11%|█         | 82/748 [00:28<03:51,  2.87it/s][A
epoch 2 iter 82: train loss 0.45936. lr 2.479805e-04:  11%|█         | 82/748 [00:28<03:51,  2.87it/s][A
epoch 2 iter 82: train loss 0.45936. lr 2.479805e-04:  11%|█         | 83/748 [00:28<04:09,  2.66it/s][A
epoch 2 iter 83: train loss 0.47291. lr 2.473596e-04:  11%|█         | 83/748 [00:29<04:09,  2.66it/s][A
epoch 2 iter 83: train loss 0.47291. lr 2.473596e-04:  11%|█         | 84/748 [00:29<04:02,  2.74it/s][A
epoch 2 iter 84: train loss 0.46396. lr 2.467390e-04:  11%|█         | 84/748 [00:29<04:02,  2.74it/s][A
epoch 2 iter 84: train loss 0.46396. lr 2.4673

epoch 2 iter 155: train loss 0.41183. lr 2.034385e-04:  21%|██        | 156/748 [00:54<03:28,  2.84it/s][A
epoch 2 iter 156: train loss 0.40310. lr 2.028418e-04:  21%|██        | 156/748 [00:55<03:28,  2.84it/s][A
epoch 2 iter 156: train loss 0.40310. lr 2.028418e-04:  21%|██        | 157/748 [00:55<03:29,  2.82it/s][A
epoch 2 iter 157: train loss 0.40985. lr 2.022455e-04:  21%|██        | 157/748 [00:55<03:29,  2.82it/s][A
epoch 2 iter 157: train loss 0.40985. lr 2.022455e-04:  21%|██        | 158/748 [00:55<03:28,  2.83it/s][A
epoch 2 iter 158: train loss 0.41732. lr 2.016496e-04:  21%|██        | 158/748 [00:56<03:28,  2.83it/s][A
epoch 2 iter 158: train loss 0.41732. lr 2.016496e-04:  21%|██▏       | 159/748 [00:56<03:28,  2.82it/s][A
epoch 2 iter 159: train loss 0.39925. lr 2.010541e-04:  21%|██▏       | 159/748 [00:56<03:28,  2.82it/s][A
epoch 2 iter 159: train loss 0.39925. lr 2.010541e-04:  21%|██▏       | 160/748 [00:56<03:29,  2.81it/s][A
epoch 2 iter 160: train loss

epoch 2 iter 230: train loss 0.35548. lr 1.600485e-04:  31%|███       | 231/748 [01:22<03:07,  2.76it/s][A
epoch 2 iter 231: train loss 0.33915. lr 1.594911e-04:  31%|███       | 231/748 [01:22<03:07,  2.76it/s][A
epoch 2 iter 231: train loss 0.33915. lr 1.594911e-04:  31%|███       | 232/748 [01:22<03:06,  2.77it/s][A
epoch 2 iter 232: train loss 0.36710. lr 1.589343e-04:  31%|███       | 232/748 [01:22<03:06,  2.77it/s][A
epoch 2 iter 232: train loss 0.36710. lr 1.589343e-04:  31%|███       | 233/748 [01:22<03:07,  2.74it/s][A
epoch 2 iter 233: train loss 0.36708. lr 1.583782e-04:  31%|███       | 233/748 [01:23<03:07,  2.74it/s][A
epoch 2 iter 233: train loss 0.36708. lr 1.583782e-04:  31%|███▏      | 234/748 [01:23<03:10,  2.70it/s][A
epoch 2 iter 234: train loss 0.36062. lr 1.578227e-04:  31%|███▏      | 234/748 [01:23<03:10,  2.70it/s][A
epoch 2 iter 234: train loss 0.36062. lr 1.578227e-04:  31%|███▏      | 235/748 [01:23<03:11,  2.68it/s][A
epoch 2 iter 235: train loss

epoch 2 iter 305: train loss 0.31921. lr 1.201287e-04:  41%|████      | 306/748 [01:50<02:37,  2.80it/s][A
epoch 2 iter 306: train loss 0.32691. lr 1.196245e-04:  41%|████      | 306/748 [01:50<02:37,  2.80it/s][A
epoch 2 iter 306: train loss 0.32691. lr 1.196245e-04:  41%|████      | 307/748 [01:50<02:48,  2.62it/s][A
epoch 2 iter 307: train loss 0.33516. lr 1.191211e-04:  41%|████      | 307/748 [01:50<02:48,  2.62it/s][A
epoch 2 iter 307: train loss 0.33516. lr 1.191211e-04:  41%|████      | 308/748 [01:50<02:43,  2.69it/s][A
epoch 2 iter 308: train loss 0.33319. lr 1.186184e-04:  41%|████      | 308/748 [01:51<02:43,  2.69it/s][A
epoch 2 iter 308: train loss 0.33319. lr 1.186184e-04:  41%|████▏     | 309/748 [01:51<02:44,  2.68it/s][A
epoch 2 iter 309: train loss 0.32398. lr 1.181166e-04:  41%|████▏     | 309/748 [01:51<02:44,  2.68it/s][A
epoch 2 iter 309: train loss 0.32398. lr 1.181166e-04:  41%|████▏     | 310/748 [01:51<02:45,  2.64it/s][A
epoch 2 iter 310: train loss

epoch 2 iter 380: train loss 0.31380. lr 8.466892e-05:  51%|█████     | 381/748 [02:18<02:09,  2.84it/s][A
epoch 2 iter 381: train loss 0.30419. lr 8.423038e-05:  51%|█████     | 381/748 [02:19<02:09,  2.84it/s][A
epoch 2 iter 381: train loss 0.30419. lr 8.423038e-05:  51%|█████     | 382/748 [02:19<02:08,  2.85it/s][A
epoch 2 iter 382: train loss 0.30256. lr 8.379279e-05:  51%|█████     | 382/748 [02:19<02:08,  2.85it/s][A
epoch 2 iter 382: train loss 0.30256. lr 8.379279e-05:  51%|█████     | 383/748 [02:19<02:07,  2.86it/s][A
epoch 2 iter 383: train loss 0.30866. lr 8.335616e-05:  51%|█████     | 383/748 [02:19<02:07,  2.86it/s][A
epoch 2 iter 383: train loss 0.30866. lr 8.335616e-05:  51%|█████▏    | 384/748 [02:19<02:07,  2.85it/s][A
epoch 2 iter 384: train loss 0.30975. lr 8.292048e-05:  51%|█████▏    | 384/748 [02:20<02:07,  2.85it/s][A
epoch 2 iter 384: train loss 0.30975. lr 8.292048e-05:  51%|█████▏    | 385/748 [02:20<02:07,  2.84it/s][A
epoch 2 iter 385: train loss

epoch 2 iter 455: train loss 0.29097. lr 6.000000e-05:  61%|██████    | 456/748 [02:47<01:46,  2.74it/s][A
epoch 2 iter 456: train loss 0.29485. lr 6.000000e-05:  61%|██████    | 456/748 [02:47<01:46,  2.74it/s][A
epoch 2 iter 456: train loss 0.29485. lr 6.000000e-05:  61%|██████    | 457/748 [02:47<01:45,  2.77it/s][A
epoch 2 iter 457: train loss 0.28219. lr 6.000000e-05:  61%|██████    | 457/748 [02:48<01:45,  2.77it/s][A
epoch 2 iter 457: train loss 0.28219. lr 6.000000e-05:  61%|██████    | 458/748 [02:48<01:43,  2.79it/s][A
epoch 2 iter 458: train loss 0.27974. lr 6.000000e-05:  61%|██████    | 458/748 [02:48<01:43,  2.79it/s][A
epoch 2 iter 458: train loss 0.27974. lr 6.000000e-05:  61%|██████▏   | 459/748 [02:48<01:43,  2.80it/s][A
epoch 2 iter 459: train loss 0.29995. lr 6.000000e-05:  61%|██████▏   | 459/748 [02:48<01:43,  2.80it/s][A
epoch 2 iter 459: train loss 0.29995. lr 6.000000e-05:  61%|██████▏   | 460/748 [02:48<01:42,  2.82it/s][A
epoch 2 iter 460: train loss

epoch 2 iter 530: train loss 0.28902. lr 6.000000e-05:  71%|███████   | 531/748 [03:16<01:29,  2.43it/s][A
epoch 2 iter 531: train loss 0.28444. lr 6.000000e-05:  71%|███████   | 531/748 [03:16<01:29,  2.43it/s][A
epoch 2 iter 531: train loss 0.28444. lr 6.000000e-05:  71%|███████   | 532/748 [03:16<01:25,  2.51it/s][A
epoch 2 iter 532: train loss 0.28543. lr 6.000000e-05:  71%|███████   | 532/748 [03:16<01:25,  2.51it/s][A
epoch 2 iter 532: train loss 0.28543. lr 6.000000e-05:  71%|███████▏  | 533/748 [03:16<01:23,  2.58it/s][A
epoch 2 iter 533: train loss 0.28135. lr 6.000000e-05:  71%|███████▏  | 533/748 [03:17<01:23,  2.58it/s][A
epoch 2 iter 533: train loss 0.28135. lr 6.000000e-05:  71%|███████▏  | 534/748 [03:17<01:21,  2.63it/s][A
epoch 2 iter 534: train loss 0.28123. lr 6.000000e-05:  71%|███████▏  | 534/748 [03:17<01:21,  2.63it/s][A
epoch 2 iter 534: train loss 0.28123. lr 6.000000e-05:  72%|███████▏  | 535/748 [03:17<01:20,  2.66it/s][A
epoch 2 iter 535: train loss

epoch 2 iter 605: train loss 0.27988. lr 6.000000e-05:  81%|████████  | 606/748 [03:46<00:58,  2.43it/s][A
epoch 2 iter 606: train loss 0.27034. lr 6.000000e-05:  81%|████████  | 606/748 [03:46<00:58,  2.43it/s][A
epoch 2 iter 606: train loss 0.27034. lr 6.000000e-05:  81%|████████  | 607/748 [03:46<00:57,  2.44it/s][A
epoch 2 iter 607: train loss 0.27836. lr 6.000000e-05:  81%|████████  | 607/748 [03:47<00:57,  2.44it/s][A
epoch 2 iter 607: train loss 0.27836. lr 6.000000e-05:  81%|████████▏ | 608/748 [03:47<00:57,  2.45it/s][A
epoch 2 iter 608: train loss 0.27506. lr 6.000000e-05:  81%|████████▏ | 608/748 [03:47<00:57,  2.45it/s][A
epoch 2 iter 608: train loss 0.27506. lr 6.000000e-05:  81%|████████▏ | 609/748 [03:47<00:56,  2.45it/s][A
epoch 2 iter 609: train loss 0.26275. lr 6.000000e-05:  81%|████████▏ | 609/748 [03:48<00:56,  2.45it/s][A
epoch 2 iter 609: train loss 0.26275. lr 6.000000e-05:  82%|████████▏ | 610/748 [03:48<00:56,  2.46it/s][A
epoch 2 iter 610: train loss

epoch 2 iter 680: train loss 0.27268. lr 6.000000e-05:  91%|█████████ | 681/748 [04:16<00:24,  2.78it/s][A
epoch 2 iter 681: train loss 0.27148. lr 6.000000e-05:  91%|█████████ | 681/748 [04:16<00:24,  2.78it/s][A
epoch 2 iter 681: train loss 0.27148. lr 6.000000e-05:  91%|█████████ | 682/748 [04:16<00:23,  2.80it/s][A
epoch 2 iter 682: train loss 0.26911. lr 6.000000e-05:  91%|█████████ | 682/748 [04:16<00:23,  2.80it/s][A
epoch 2 iter 682: train loss 0.26911. lr 6.000000e-05:  91%|█████████▏| 683/748 [04:16<00:23,  2.81it/s][A
epoch 2 iter 683: train loss 0.27082. lr 6.000000e-05:  91%|█████████▏| 683/748 [04:17<00:23,  2.81it/s][A
epoch 2 iter 683: train loss 0.27082. lr 6.000000e-05:  91%|█████████▏| 684/748 [04:17<00:22,  2.83it/s][A
epoch 2 iter 684: train loss 0.27236. lr 6.000000e-05:  91%|█████████▏| 684/748 [04:17<00:22,  2.83it/s][A
epoch 2 iter 684: train loss 0.27236. lr 6.000000e-05:  92%|█████████▏| 685/748 [04:17<00:22,  2.84it/s][A
epoch 2 iter 685: train loss

data has 147526 characters, 9734 unique.



  0%|          | 0/576 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.25172. lr 5.999992e-04:   0%|          | 0/576 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.25172. lr 5.999992e-04:   0%|          | 1/576 [00:00<03:45,  2.55it/s][A
epoch 1 iter 1: train loss 8.61469. lr 5.999962e-04:   0%|          | 1/576 [00:00<03:45,  2.55it/s][A
epoch 1 iter 1: train loss 8.61469. lr 5.999962e-04:   0%|          | 2/576 [00:00<03:33,  2.69it/s][A
epoch 1 iter 2: train loss 8.20389. lr 5.999910e-04:   0%|          | 2/576 [00:01<03:33,  2.69it/s][A
epoch 1 iter 2: train loss 8.20389. lr 5.999910e-04:   1%|          | 3/576 [00:01<03:23,  2.81it/s][A
epoch 1 iter 3: train loss 7.94861. lr 5.999835e-04:   1%|          | 3/576 [00:01<03:23,  2.81it/s][A
epoch 1 iter 3: train loss 7.94861. lr 5.999835e-04:   1%|          | 4/576 [00:01<03:17,  2.90it/s][A
epoch 1 iter 4: train loss 7.71431. lr 5.999738e-04:   1%|          | 4/576 [00:01<03:17,  2.90it/s][A
epoch 1 iter 4: train loss 7

epoch 1 iter 77: train loss 5.55541. lr 5.932584e-04:  13%|█▎        | 77/576 [00:27<03:07,  2.66it/s][A
epoch 1 iter 77: train loss 5.55541. lr 5.932584e-04:  14%|█▎        | 78/576 [00:27<03:00,  2.77it/s][A
epoch 1 iter 78: train loss 5.42305. lr 5.930848e-04:  14%|█▎        | 78/576 [00:28<03:00,  2.77it/s][A
epoch 1 iter 78: train loss 5.42305. lr 5.930848e-04:  14%|█▎        | 79/576 [00:28<02:57,  2.80it/s][A
epoch 1 iter 79: train loss 5.42363. lr 5.929090e-04:  14%|█▎        | 79/576 [00:28<02:57,  2.80it/s][A
epoch 1 iter 79: train loss 5.42363. lr 5.929090e-04:  14%|█▍        | 80/576 [00:28<02:57,  2.80it/s][A
epoch 1 iter 80: train loss 5.47816. lr 5.927310e-04:  14%|█▍        | 80/576 [00:28<02:57,  2.80it/s][A
epoch 1 iter 80: train loss 5.47816. lr 5.927310e-04:  14%|█▍        | 81/576 [00:28<02:54,  2.84it/s][A
epoch 1 iter 81: train loss 5.44646. lr 5.925508e-04:  14%|█▍        | 81/576 [00:29<02:54,  2.84it/s][A
epoch 1 iter 81: train loss 5.44646. lr 5.9255

epoch 1 iter 152: train loss 4.31840. lr 5.742879e-04:  27%|██▋       | 153/576 [00:53<02:29,  2.82it/s][A
epoch 1 iter 153: train loss 4.21359. lr 5.739553e-04:  27%|██▋       | 153/576 [00:53<02:29,  2.82it/s][A
epoch 1 iter 153: train loss 4.21359. lr 5.739553e-04:  27%|██▋       | 154/576 [00:53<02:26,  2.89it/s][A
epoch 1 iter 154: train loss 4.18393. lr 5.736207e-04:  27%|██▋       | 154/576 [00:53<02:26,  2.89it/s][A
epoch 1 iter 154: train loss 4.18393. lr 5.736207e-04:  27%|██▋       | 155/576 [00:53<02:23,  2.93it/s][A
epoch 1 iter 155: train loss 4.29894. lr 5.732840e-04:  27%|██▋       | 155/576 [00:54<02:23,  2.93it/s][A
epoch 1 iter 155: train loss 4.29894. lr 5.732840e-04:  27%|██▋       | 156/576 [00:54<02:22,  2.94it/s][A
epoch 1 iter 156: train loss 4.25957. lr 5.729454e-04:  27%|██▋       | 156/576 [00:54<02:22,  2.94it/s][A
epoch 1 iter 156: train loss 4.25957. lr 5.729454e-04:  27%|██▋       | 157/576 [00:54<02:22,  2.94it/s][A
epoch 1 iter 157: train loss

epoch 1 iter 227: train loss 3.15613. lr 5.438710e-04:  40%|███▉      | 228/576 [01:19<01:57,  2.96it/s][A
epoch 1 iter 228: train loss 3.15901. lr 5.433934e-04:  40%|███▉      | 228/576 [01:20<01:57,  2.96it/s][A
epoch 1 iter 228: train loss 3.15901. lr 5.433934e-04:  40%|███▉      | 229/576 [01:20<01:57,  2.96it/s][A
epoch 1 iter 229: train loss 3.10654. lr 5.429139e-04:  40%|███▉      | 229/576 [01:20<01:57,  2.96it/s][A
epoch 1 iter 229: train loss 3.10654. lr 5.429139e-04:  40%|███▉      | 230/576 [01:20<01:56,  2.96it/s][A
epoch 1 iter 230: train loss 3.06695. lr 5.424327e-04:  40%|███▉      | 230/576 [01:20<01:56,  2.96it/s][A
epoch 1 iter 230: train loss 3.06695. lr 5.424327e-04:  40%|████      | 231/576 [01:21<01:56,  2.96it/s][A
epoch 1 iter 231: train loss 3.06685. lr 5.419496e-04:  40%|████      | 231/576 [01:21<01:56,  2.96it/s][A
epoch 1 iter 231: train loss 3.06685. lr 5.419496e-04:  40%|████      | 232/576 [01:21<01:55,  2.98it/s][A
epoch 1 iter 232: train loss

epoch 1 iter 302: train loss 1.94318. lr 5.032771e-04:  53%|█████▎    | 303/576 [01:46<01:31,  2.99it/s][A
epoch 1 iter 303: train loss 1.95753. lr 5.026743e-04:  53%|█████▎    | 303/576 [01:46<01:31,  2.99it/s][A
epoch 1 iter 303: train loss 1.95753. lr 5.026743e-04:  53%|█████▎    | 304/576 [01:46<01:39,  2.74it/s][A
epoch 1 iter 304: train loss 1.93860. lr 5.020700e-04:  53%|█████▎    | 304/576 [01:46<01:39,  2.74it/s][A
epoch 1 iter 304: train loss 1.93860. lr 5.020700e-04:  53%|█████▎    | 305/576 [01:46<01:37,  2.79it/s][A
epoch 1 iter 305: train loss 1.92013. lr 5.014643e-04:  53%|█████▎    | 305/576 [01:47<01:37,  2.79it/s][A
epoch 1 iter 305: train loss 1.92013. lr 5.014643e-04:  53%|█████▎    | 306/576 [01:47<01:34,  2.87it/s][A
epoch 1 iter 306: train loss 1.92533. lr 5.008570e-04:  53%|█████▎    | 306/576 [01:47<01:34,  2.87it/s][A
epoch 1 iter 306: train loss 1.92533. lr 5.008570e-04:  53%|█████▎    | 307/576 [01:47<01:32,  2.92it/s][A
epoch 1 iter 307: train loss

epoch 1 iter 377: train loss 1.08276. lr 4.542001e-04:  66%|██████▌   | 378/576 [02:12<01:10,  2.82it/s][A
epoch 1 iter 378: train loss 1.08629. lr 4.534974e-04:  66%|██████▌   | 378/576 [02:13<01:10,  2.82it/s][A
epoch 1 iter 378: train loss 1.08629. lr 4.534974e-04:  66%|██████▌   | 379/576 [02:13<01:10,  2.81it/s][A
epoch 1 iter 379: train loss 1.06695. lr 4.527936e-04:  66%|██████▌   | 379/576 [02:13<01:10,  2.81it/s][A
epoch 1 iter 379: train loss 1.06695. lr 4.527936e-04:  66%|██████▌   | 380/576 [02:13<01:10,  2.79it/s][A
epoch 1 iter 380: train loss 1.07601. lr 4.520885e-04:  66%|██████▌   | 380/576 [02:13<01:10,  2.79it/s][A
epoch 1 iter 380: train loss 1.07601. lr 4.520885e-04:  66%|██████▌   | 381/576 [02:13<01:09,  2.80it/s][A
epoch 1 iter 381: train loss 1.06962. lr 4.513824e-04:  66%|██████▌   | 381/576 [02:14<01:09,  2.80it/s][A
epoch 1 iter 381: train loss 1.06962. lr 4.513824e-04:  66%|██████▋   | 382/576 [02:14<01:09,  2.80it/s][A
epoch 1 iter 382: train loss

epoch 1 iter 452: train loss 0.64048. lr 3.986883e-04:  79%|███████▊  | 453/576 [02:38<00:41,  2.94it/s][A
epoch 1 iter 453: train loss 0.63221. lr 3.979149e-04:  79%|███████▊  | 453/576 [02:39<00:41,  2.94it/s][A
epoch 1 iter 453: train loss 0.63221. lr 3.979149e-04:  79%|███████▉  | 454/576 [02:39<00:41,  2.97it/s][A
epoch 1 iter 454: train loss 0.64790. lr 3.971408e-04:  79%|███████▉  | 454/576 [02:39<00:41,  2.97it/s][A
epoch 1 iter 454: train loss 0.64790. lr 3.971408e-04:  79%|███████▉  | 455/576 [02:39<00:40,  2.97it/s][A
epoch 1 iter 455: train loss 0.62374. lr 3.963660e-04:  79%|███████▉  | 455/576 [02:39<00:40,  2.97it/s][A
epoch 1 iter 455: train loss 0.62374. lr 3.963660e-04:  79%|███████▉  | 456/576 [02:39<00:40,  2.93it/s][A
epoch 1 iter 456: train loss 0.63577. lr 3.955904e-04:  79%|███████▉  | 456/576 [02:40<00:40,  2.93it/s][A
epoch 1 iter 456: train loss 0.63577. lr 3.955904e-04:  79%|███████▉  | 457/576 [02:40<00:41,  2.90it/s][A
epoch 1 iter 457: train loss

epoch 1 iter 527: train loss 0.45718. lr 3.390580e-04:  92%|█████████▏| 528/576 [03:04<00:16,  2.93it/s][A
epoch 1 iter 528: train loss 0.45224. lr 3.382463e-04:  92%|█████████▏| 528/576 [03:05<00:16,  2.93it/s][A
epoch 1 iter 528: train loss 0.45224. lr 3.382463e-04:  92%|█████████▏| 529/576 [03:05<00:16,  2.93it/s][A
epoch 1 iter 529: train loss 0.44932. lr 3.374342e-04:  92%|█████████▏| 529/576 [03:05<00:16,  2.93it/s][A
epoch 1 iter 529: train loss 0.44932. lr 3.374342e-04:  92%|█████████▏| 530/576 [03:05<00:15,  2.96it/s][A
epoch 1 iter 530: train loss 0.42956. lr 3.366219e-04:  92%|█████████▏| 530/576 [03:05<00:15,  2.96it/s][A
epoch 1 iter 530: train loss 0.42956. lr 3.366219e-04:  92%|█████████▏| 531/576 [03:05<00:15,  2.97it/s][A
epoch 1 iter 531: train loss 0.44901. lr 3.358094e-04:  92%|█████████▏| 531/576 [03:06<00:15,  2.97it/s][A
epoch 1 iter 531: train loss 0.44901. lr 3.358094e-04:  92%|█████████▏| 532/576 [03:06<00:14,  2.94it/s][A
epoch 1 iter 532: train loss

epoch 2 iter 27: train loss 0.34521. lr 2.771665e-04:   5%|▍         | 27/576 [00:09<03:09,  2.89it/s][A
epoch 2 iter 27: train loss 0.34521. lr 2.771665e-04:   5%|▍         | 28/576 [00:09<03:06,  2.94it/s][A
epoch 2 iter 28: train loss 0.33811. lr 2.763504e-04:   5%|▍         | 28/576 [00:10<03:06,  2.94it/s][A
epoch 2 iter 28: train loss 0.33811. lr 2.763504e-04:   5%|▌         | 29/576 [00:10<03:03,  2.97it/s][A
epoch 2 iter 29: train loss 0.32758. lr 2.755345e-04:   5%|▌         | 29/576 [00:10<03:03,  2.97it/s][A
epoch 2 iter 29: train loss 0.32758. lr 2.755345e-04:   5%|▌         | 30/576 [00:10<03:02,  2.99it/s][A
epoch 2 iter 30: train loss 0.34593. lr 2.747187e-04:   5%|▌         | 30/576 [00:10<03:02,  2.99it/s][A
epoch 2 iter 30: train loss 0.34593. lr 2.747187e-04:   5%|▌         | 31/576 [00:10<03:04,  2.95it/s][A
epoch 2 iter 31: train loss 0.33763. lr 2.739032e-04:   5%|▌         | 31/576 [00:11<03:04,  2.95it/s][A
epoch 2 iter 31: train loss 0.33763. lr 2.7390

epoch 2 iter 104: train loss 0.29452. lr 2.152839e-04:  18%|█▊        | 104/576 [00:37<02:41,  2.92it/s][A
epoch 2 iter 104: train loss 0.29452. lr 2.152839e-04:  18%|█▊        | 105/576 [00:37<02:40,  2.94it/s][A
epoch 2 iter 105: train loss 0.28698. lr 2.144990e-04:  18%|█▊        | 105/576 [00:37<02:40,  2.94it/s][A
epoch 2 iter 105: train loss 0.28698. lr 2.144990e-04:  18%|█▊        | 106/576 [00:37<02:39,  2.94it/s][A
epoch 2 iter 106: train loss 0.28327. lr 2.137147e-04:  18%|█▊        | 106/576 [00:37<02:39,  2.94it/s][A
epoch 2 iter 106: train loss 0.28327. lr 2.137147e-04:  19%|█▊        | 107/576 [00:37<02:39,  2.93it/s][A
epoch 2 iter 107: train loss 0.29071. lr 2.129310e-04:  19%|█▊        | 107/576 [00:38<02:39,  2.93it/s][A
epoch 2 iter 107: train loss 0.29071. lr 2.129310e-04:  19%|█▉        | 108/576 [00:38<02:40,  2.92it/s][A
epoch 2 iter 108: train loss 0.28342. lr 2.121480e-04:  19%|█▉        | 108/576 [00:38<02:40,  2.92it/s][A
epoch 2 iter 108: train loss

epoch 2 iter 179: train loss 0.25923. lr 1.585686e-04:  31%|███       | 179/576 [01:02<02:16,  2.91it/s][A
epoch 2 iter 179: train loss 0.25923. lr 1.585686e-04:  31%|███▏      | 180/576 [01:02<02:14,  2.94it/s][A
epoch 2 iter 180: train loss 0.24819. lr 1.578473e-04:  31%|███▏      | 180/576 [01:03<02:14,  2.94it/s][A
epoch 2 iter 180: train loss 0.24819. lr 1.578473e-04:  31%|███▏      | 181/576 [01:03<02:15,  2.92it/s][A
epoch 2 iter 181: train loss 0.25468. lr 1.571270e-04:  31%|███▏      | 181/576 [01:03<02:15,  2.92it/s][A
epoch 2 iter 181: train loss 0.25468. lr 1.571270e-04:  32%|███▏      | 182/576 [01:03<02:15,  2.90it/s][A
epoch 2 iter 182: train loss 0.25688. lr 1.564078e-04:  32%|███▏      | 182/576 [01:04<02:15,  2.90it/s][A
epoch 2 iter 182: train loss 0.25688. lr 1.564078e-04:  32%|███▏      | 183/576 [01:04<02:16,  2.89it/s][A
epoch 2 iter 183: train loss 0.24506. lr 1.556896e-04:  32%|███▏      | 183/576 [01:04<02:16,  2.89it/s][A
epoch 2 iter 183: train loss

epoch 2 iter 254: train loss 0.23109. lr 1.077555e-04:  44%|████▍     | 254/576 [01:29<01:59,  2.69it/s][A
epoch 2 iter 254: train loss 0.23109. lr 1.077555e-04:  44%|████▍     | 255/576 [01:29<01:59,  2.69it/s][A
epoch 2 iter 255: train loss 0.22427. lr 1.071278e-04:  44%|████▍     | 255/576 [01:29<01:59,  2.69it/s][A
epoch 2 iter 255: train loss 0.22427. lr 1.071278e-04:  44%|████▍     | 256/576 [01:29<01:57,  2.73it/s][A
epoch 2 iter 256: train loss 0.23541. lr 1.065015e-04:  44%|████▍     | 256/576 [01:29<01:57,  2.73it/s][A
epoch 2 iter 256: train loss 0.23541. lr 1.065015e-04:  45%|████▍     | 257/576 [01:29<01:55,  2.77it/s][A
epoch 2 iter 257: train loss 0.23359. lr 1.058767e-04:  45%|████▍     | 257/576 [01:30<01:55,  2.77it/s][A
epoch 2 iter 257: train loss 0.23359. lr 1.058767e-04:  45%|████▍     | 258/576 [01:30<01:52,  2.83it/s][A
epoch 2 iter 258: train loss 0.22765. lr 1.052534e-04:  45%|████▍     | 258/576 [01:30<01:52,  2.83it/s][A
epoch 2 iter 258: train loss

epoch 2 iter 329: train loss 0.21087. lr 6.496491e-05:  57%|█████▋    | 329/576 [01:55<01:27,  2.81it/s][A
epoch 2 iter 329: train loss 0.21087. lr 6.496491e-05:  57%|█████▋    | 330/576 [01:55<01:27,  2.81it/s][A
epoch 2 iter 330: train loss 0.20775. lr 6.445709e-05:  57%|█████▋    | 330/576 [01:55<01:27,  2.81it/s][A
epoch 2 iter 330: train loss 0.20775. lr 6.445709e-05:  57%|█████▋    | 331/576 [01:55<01:27,  2.81it/s][A
epoch 2 iter 331: train loss 0.21160. lr 6.395102e-05:  57%|█████▋    | 331/576 [01:55<01:27,  2.81it/s][A
epoch 2 iter 331: train loss 0.21160. lr 6.395102e-05:  58%|█████▊    | 332/576 [01:55<01:25,  2.84it/s][A
epoch 2 iter 332: train loss 0.21565. lr 6.344672e-05:  58%|█████▊    | 332/576 [01:56<01:25,  2.84it/s][A
epoch 2 iter 332: train loss 0.21565. lr 6.344672e-05:  58%|█████▊    | 333/576 [01:56<01:24,  2.86it/s][A
epoch 2 iter 333: train loss 0.21317. lr 6.294417e-05:  58%|█████▊    | 333/576 [01:56<01:24,  2.86it/s][A
epoch 2 iter 333: train loss

epoch 2 iter 404: train loss 0.20078. lr 6.000000e-05:  70%|███████   | 404/576 [02:21<00:59,  2.90it/s][A
epoch 2 iter 404: train loss 0.20078. lr 6.000000e-05:  70%|███████   | 405/576 [02:21<00:58,  2.92it/s][A
epoch 2 iter 405: train loss 0.20964. lr 6.000000e-05:  70%|███████   | 405/576 [02:21<00:58,  2.92it/s][A
epoch 2 iter 405: train loss 0.20964. lr 6.000000e-05:  70%|███████   | 406/576 [02:21<00:58,  2.92it/s][A
epoch 2 iter 406: train loss 0.20667. lr 6.000000e-05:  70%|███████   | 406/576 [02:22<00:58,  2.92it/s][A
epoch 2 iter 406: train loss 0.20667. lr 6.000000e-05:  71%|███████   | 407/576 [02:22<00:58,  2.91it/s][A
epoch 2 iter 407: train loss 0.20016. lr 6.000000e-05:  71%|███████   | 407/576 [02:22<00:58,  2.91it/s][A
epoch 2 iter 407: train loss 0.20016. lr 6.000000e-05:  71%|███████   | 408/576 [02:22<00:57,  2.94it/s][A
epoch 2 iter 408: train loss 0.20279. lr 6.000000e-05:  71%|███████   | 408/576 [02:22<00:57,  2.94it/s][A
epoch 2 iter 408: train loss

epoch 2 iter 479: train loss 0.19915. lr 6.000000e-05:  83%|████████▎ | 479/576 [02:48<00:33,  2.91it/s][A
epoch 2 iter 479: train loss 0.19915. lr 6.000000e-05:  83%|████████▎ | 480/576 [02:48<00:33,  2.90it/s][A
epoch 2 iter 480: train loss 0.19715. lr 6.000000e-05:  83%|████████▎ | 480/576 [02:49<00:33,  2.90it/s][A
epoch 2 iter 480: train loss 0.19715. lr 6.000000e-05:  84%|████████▎ | 481/576 [02:49<00:32,  2.91it/s][A
epoch 2 iter 481: train loss 0.20515. lr 6.000000e-05:  84%|████████▎ | 481/576 [02:49<00:32,  2.91it/s][A
epoch 2 iter 481: train loss 0.20515. lr 6.000000e-05:  84%|████████▎ | 482/576 [02:49<00:32,  2.92it/s][A
epoch 2 iter 482: train loss 0.19582. lr 6.000000e-05:  84%|████████▎ | 482/576 [02:49<00:32,  2.92it/s][A
epoch 2 iter 482: train loss 0.19582. lr 6.000000e-05:  84%|████████▍ | 483/576 [02:49<00:31,  2.93it/s][A
epoch 2 iter 483: train loss 0.20191. lr 6.000000e-05:  84%|████████▍ | 483/576 [02:50<00:31,  2.93it/s][A
epoch 2 iter 483: train loss

epoch 2 iter 554: train loss 0.18266. lr 6.000000e-05:  96%|█████████▌| 554/576 [03:15<00:07,  3.02it/s][A
epoch 2 iter 554: train loss 0.18266. lr 6.000000e-05:  96%|█████████▋| 555/576 [03:15<00:06,  3.03it/s][A
epoch 2 iter 555: train loss 0.20294. lr 6.000000e-05:  96%|█████████▋| 555/576 [03:15<00:06,  3.03it/s][A
epoch 2 iter 555: train loss 0.20294. lr 6.000000e-05:  97%|█████████▋| 556/576 [03:15<00:06,  3.02it/s][A
epoch 2 iter 556: train loss 0.19134. lr 6.000000e-05:  97%|█████████▋| 556/576 [03:15<00:06,  3.02it/s][A
epoch 2 iter 556: train loss 0.19134. lr 6.000000e-05:  97%|█████████▋| 557/576 [03:15<00:06,  2.79it/s][A
epoch 2 iter 557: train loss 0.19827. lr 6.000000e-05:  97%|█████████▋| 557/576 [03:16<00:06,  2.79it/s][A
epoch 2 iter 557: train loss 0.19827. lr 6.000000e-05:  97%|█████████▋| 558/576 [03:16<00:06,  2.88it/s][A
epoch 2 iter 558: train loss 0.19834. lr 6.000000e-05:  97%|█████████▋| 558/576 [03:16<00:06,  2.88it/s][A
epoch 2 iter 558: train loss

data has 367292 characters, 12438 unique.



  0%|          | 0/1435 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.53736. lr 5.999999e-04:   0%|          | 0/1435 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.53736. lr 5.999999e-04:   0%|          | 1/1435 [00:00<10:26,  2.29it/s][A
epoch 1 iter 1: train loss 8.91019. lr 5.999994e-04:   0%|          | 1/1435 [00:00<10:26,  2.29it/s][A
epoch 1 iter 1: train loss 8.91019. lr 5.999994e-04:   0%|          | 2/1435 [00:00<09:53,  2.42it/s][A
epoch 1 iter 2: train loss 8.45934. lr 5.999985e-04:   0%|          | 2/1435 [00:01<09:53,  2.42it/s][A
epoch 1 iter 2: train loss 8.45934. lr 5.999985e-04:   0%|          | 3/1435 [00:01<09:28,  2.52it/s][A
epoch 1 iter 3: train loss 8.20132. lr 5.999973e-04:   0%|          | 3/1435 [00:01<09:28,  2.52it/s][A
epoch 1 iter 3: train loss 8.20132. lr 5.999973e-04:   0%|          | 4/1435 [00:01<09:09,  2.60it/s][A
epoch 1 iter 4: train loss 7.98937. lr 5.999958e-04:   0%|          | 4/1435 [00:01<09:09,  2.60it/s][A
epoch 1 iter 4: tr

epoch 1 iter 76: train loss 5.71013. lr 5.989381e-04:   5%|▌         | 76/1435 [00:27<08:05,  2.80it/s][A
epoch 1 iter 76: train loss 5.71013. lr 5.989381e-04:   5%|▌         | 77/1435 [00:27<08:06,  2.79it/s][A
epoch 1 iter 77: train loss 5.70231. lr 5.989103e-04:   5%|▌         | 77/1435 [00:27<08:06,  2.79it/s][A
epoch 1 iter 77: train loss 5.70231. lr 5.989103e-04:   5%|▌         | 78/1435 [00:27<08:03,  2.81it/s][A
epoch 1 iter 78: train loss 5.65947. lr 5.988821e-04:   5%|▌         | 78/1435 [00:28<08:03,  2.81it/s][A
epoch 1 iter 78: train loss 5.65947. lr 5.988821e-04:   6%|▌         | 79/1435 [00:28<08:05,  2.79it/s][A
epoch 1 iter 79: train loss 5.61117. lr 5.988536e-04:   6%|▌         | 79/1435 [00:28<08:05,  2.79it/s][A
epoch 1 iter 79: train loss 5.61117. lr 5.988536e-04:   6%|▌         | 80/1435 [00:28<08:05,  2.79it/s][A
epoch 1 iter 80: train loss 5.60696. lr 5.988247e-04:   6%|▌         | 80/1435 [00:29<08:05,  2.79it/s][A
epoch 1 iter 80: train loss 5.60696. 

epoch 1 iter 151: train loss 4.66087. lr 5.958607e-04:  11%|█         | 152/1435 [00:55<07:57,  2.69it/s][A
epoch 1 iter 152: train loss 4.67128. lr 5.958061e-04:  11%|█         | 152/1435 [00:55<07:57,  2.69it/s][A
epoch 1 iter 152: train loss 4.67128. lr 5.958061e-04:  11%|█         | 153/1435 [00:55<07:56,  2.69it/s][A
epoch 1 iter 153: train loss 4.68996. lr 5.957512e-04:  11%|█         | 153/1435 [00:56<07:56,  2.69it/s][A
epoch 1 iter 153: train loss 4.68996. lr 5.957512e-04:  11%|█         | 154/1435 [00:56<07:52,  2.71it/s][A
epoch 1 iter 154: train loss 4.57544. lr 5.956959e-04:  11%|█         | 154/1435 [00:56<07:52,  2.71it/s][A
epoch 1 iter 154: train loss 4.57544. lr 5.956959e-04:  11%|█         | 155/1435 [00:56<08:00,  2.66it/s][A
epoch 1 iter 155: train loss 4.65181. lr 5.956402e-04:  11%|█         | 155/1435 [00:56<08:00,  2.66it/s][A
epoch 1 iter 155: train loss 4.65181. lr 5.956402e-04:  11%|█         | 156/1435 [00:56<08:04,  2.64it/s][A
epoch 1 iter 156: t

epoch 1 iter 540: train loss 1.46220. lr 5.488854e-04:  38%|███▊      | 541/1435 [03:25<05:32,  2.69it/s][A
epoch 1 iter 541: train loss 1.47640. lr 5.487018e-04:  38%|███▊      | 541/1435 [03:25<05:32,  2.69it/s][A
epoch 1 iter 541: train loss 1.47640. lr 5.487018e-04:  38%|███▊      | 542/1435 [03:25<05:31,  2.70it/s][A
epoch 1 iter 542: train loss 1.48271. lr 5.485179e-04:  38%|███▊      | 542/1435 [03:26<05:31,  2.70it/s][A
epoch 1 iter 542: train loss 1.48271. lr 5.485179e-04:  38%|███▊      | 543/1435 [03:26<05:31,  2.69it/s][A
epoch 1 iter 543: train loss 1.49015. lr 5.483337e-04:  38%|███▊      | 543/1435 [03:26<05:31,  2.69it/s][A
epoch 1 iter 543: train loss 1.49015. lr 5.483337e-04:  38%|███▊      | 544/1435 [03:26<05:31,  2.69it/s][A
epoch 1 iter 544: train loss 1.49925. lr 5.481492e-04:  38%|███▊      | 544/1435 [03:26<05:31,  2.69it/s][A
epoch 1 iter 544: train loss 1.49925. lr 5.481492e-04:  38%|███▊      | 545/1435 [03:26<05:28,  2.71it/s][A
epoch 1 iter 545: t

epoch 1 iter 615: train loss 1.11214. lr 5.343023e-04:  43%|████▎     | 616/1435 [03:53<05:05,  2.68it/s][A
epoch 1 iter 616: train loss 1.10631. lr 5.340970e-04:  43%|████▎     | 616/1435 [03:54<05:05,  2.68it/s][A
epoch 1 iter 616: train loss 1.10631. lr 5.340970e-04:  43%|████▎     | 617/1435 [03:54<05:02,  2.70it/s][A
epoch 1 iter 617: train loss 1.14468. lr 5.338913e-04:  43%|████▎     | 617/1435 [03:54<05:02,  2.70it/s][A
epoch 1 iter 617: train loss 1.14468. lr 5.338913e-04:  43%|████▎     | 618/1435 [03:54<05:01,  2.71it/s][A
epoch 1 iter 618: train loss 1.10768. lr 5.336854e-04:  43%|████▎     | 618/1435 [03:54<05:01,  2.71it/s][A
epoch 1 iter 618: train loss 1.10768. lr 5.336854e-04:  43%|████▎     | 619/1435 [03:54<05:01,  2.70it/s][A
epoch 1 iter 619: train loss 1.09473. lr 5.334792e-04:  43%|████▎     | 619/1435 [03:55<05:01,  2.70it/s][A
epoch 1 iter 619: train loss 1.09473. lr 5.334792e-04:  43%|████▎     | 620/1435 [03:55<05:01,  2.70it/s][A
epoch 1 iter 620: t

epoch 1 iter 690: train loss 0.85225. lr 5.181390e-04:  48%|████▊     | 691/1435 [04:23<04:34,  2.71it/s][A
epoch 1 iter 691: train loss 0.82484. lr 5.179133e-04:  48%|████▊     | 691/1435 [04:23<04:34,  2.71it/s][A
epoch 1 iter 691: train loss 0.82484. lr 5.179133e-04:  48%|████▊     | 692/1435 [04:23<04:34,  2.71it/s][A
epoch 1 iter 692: train loss 0.88460. lr 5.176873e-04:  48%|████▊     | 692/1435 [04:24<04:34,  2.71it/s][A
epoch 1 iter 692: train loss 0.88460. lr 5.176873e-04:  48%|████▊     | 693/1435 [04:24<04:32,  2.72it/s][A
epoch 1 iter 693: train loss 0.84370. lr 5.174611e-04:  48%|████▊     | 693/1435 [04:24<04:32,  2.72it/s][A
epoch 1 iter 693: train loss 0.84370. lr 5.174611e-04:  48%|████▊     | 694/1435 [04:24<04:33,  2.71it/s][A
epoch 1 iter 694: train loss 0.82915. lr 5.172346e-04:  48%|████▊     | 694/1435 [04:25<04:33,  2.71it/s][A
epoch 1 iter 694: train loss 0.82915. lr 5.172346e-04:  48%|████▊     | 695/1435 [04:25<04:35,  2.69it/s][A
epoch 1 iter 695: t

epoch 1 iter 765: train loss 0.67812. lr 5.005046e-04:  53%|█████▎    | 766/1435 [04:52<04:05,  2.72it/s][A
epoch 1 iter 766: train loss 0.67780. lr 5.002600e-04:  53%|█████▎    | 766/1435 [04:52<04:05,  2.72it/s][A
epoch 1 iter 766: train loss 0.67780. lr 5.002600e-04:  53%|█████▎    | 767/1435 [04:52<04:05,  2.72it/s][A
epoch 1 iter 767: train loss 0.68071. lr 5.000152e-04:  53%|█████▎    | 767/1435 [04:53<04:05,  2.72it/s][A
epoch 1 iter 767: train loss 0.68071. lr 5.000152e-04:  54%|█████▎    | 768/1435 [04:53<04:03,  2.74it/s][A
epoch 1 iter 768: train loss 0.65187. lr 4.997702e-04:  54%|█████▎    | 768/1435 [04:53<04:03,  2.74it/s][A
epoch 1 iter 768: train loss 0.65187. lr 4.997702e-04:  54%|█████▎    | 769/1435 [04:53<04:03,  2.74it/s][A
epoch 1 iter 769: train loss 0.66333. lr 4.995250e-04:  54%|█████▎    | 769/1435 [04:53<04:03,  2.74it/s][A
epoch 1 iter 769: train loss 0.66333. lr 4.995250e-04:  54%|█████▎    | 770/1435 [04:53<04:03,  2.74it/s][A
epoch 1 iter 770: t

epoch 1 iter 840: train loss 0.56560. lr 4.815179e-04:  59%|█████▊    | 841/1435 [05:21<03:41,  2.68it/s][A
epoch 1 iter 841: train loss 0.56615. lr 4.812562e-04:  59%|█████▊    | 841/1435 [05:21<03:41,  2.68it/s][A
epoch 1 iter 841: train loss 0.56615. lr 4.812562e-04:  59%|█████▊    | 842/1435 [05:21<05:10,  1.91it/s][A
epoch 1 iter 842: train loss 0.55683. lr 4.809942e-04:  59%|█████▊    | 842/1435 [05:22<05:10,  1.91it/s][A
epoch 1 iter 842: train loss 0.55683. lr 4.809942e-04:  59%|█████▊    | 843/1435 [05:22<05:10,  1.91it/s][A
epoch 1 iter 843: train loss 0.56438. lr 4.807321e-04:  59%|█████▊    | 843/1435 [05:22<05:10,  1.91it/s][A
epoch 1 iter 843: train loss 0.56438. lr 4.807321e-04:  59%|█████▉    | 844/1435 [05:22<04:59,  1.97it/s][A
epoch 1 iter 844: train loss 0.55518. lr 4.804697e-04:  59%|█████▉    | 844/1435 [05:23<04:59,  1.97it/s][A
epoch 1 iter 844: train loss 0.55518. lr 4.804697e-04:  59%|█████▉    | 845/1435 [05:23<04:46,  2.06it/s][A
epoch 1 iter 845: t

epoch 1 iter 915: train loss 0.50157. lr 4.613070e-04:  64%|██████▍   | 916/1435 [05:50<03:12,  2.70it/s][A
epoch 1 iter 916: train loss 0.48282. lr 4.610299e-04:  64%|██████▍   | 916/1435 [05:50<03:12,  2.70it/s][A
epoch 1 iter 916: train loss 0.48282. lr 4.610299e-04:  64%|██████▍   | 917/1435 [05:50<03:11,  2.71it/s][A
epoch 1 iter 917: train loss 0.49853. lr 4.607525e-04:  64%|██████▍   | 917/1435 [05:51<03:11,  2.71it/s][A
epoch 1 iter 917: train loss 0.49853. lr 4.607525e-04:  64%|██████▍   | 918/1435 [05:51<03:09,  2.73it/s][A
epoch 1 iter 918: train loss 0.48567. lr 4.604750e-04:  64%|██████▍   | 918/1435 [05:51<03:09,  2.73it/s][A
epoch 1 iter 918: train loss 0.48567. lr 4.604750e-04:  64%|██████▍   | 919/1435 [05:51<03:08,  2.74it/s][A
epoch 1 iter 919: train loss 0.49468. lr 4.601973e-04:  64%|██████▍   | 919/1435 [05:51<03:08,  2.74it/s][A
epoch 1 iter 919: train loss 0.49468. lr 4.601973e-04:  64%|██████▍   | 920/1435 [05:51<03:07,  2.75it/s][A
epoch 1 iter 920: t

epoch 1 iter 990: train loss 0.44193. lr 4.400083e-04:  69%|██████▉   | 991/1435 [06:19<02:40,  2.76it/s][A
epoch 1 iter 991: train loss 0.42428. lr 4.397176e-04:  69%|██████▉   | 991/1435 [06:19<02:40,  2.76it/s][A
epoch 1 iter 991: train loss 0.42428. lr 4.397176e-04:  69%|██████▉   | 992/1435 [06:19<02:40,  2.76it/s][A
epoch 1 iter 992: train loss 0.43249. lr 4.394267e-04:  69%|██████▉   | 992/1435 [06:20<02:40,  2.76it/s][A
epoch 1 iter 992: train loss 0.43249. lr 4.394267e-04:  69%|██████▉   | 993/1435 [06:20<02:40,  2.75it/s][A
epoch 1 iter 993: train loss 0.43991. lr 4.391357e-04:  69%|██████▉   | 993/1435 [06:20<02:40,  2.75it/s][A
epoch 1 iter 993: train loss 0.43991. lr 4.391357e-04:  69%|██████▉   | 994/1435 [06:20<02:39,  2.76it/s][A
epoch 1 iter 994: train loss 0.43621. lr 4.388445e-04:  69%|██████▉   | 994/1435 [06:20<02:39,  2.76it/s][A
epoch 1 iter 994: train loss 0.43621. lr 4.388445e-04:  69%|██████▉   | 995/1435 [06:20<02:39,  2.76it/s][A
epoch 1 iter 995: t

epoch 2 iter 98: train loss 1.27421. lr 8.164658e-05:  51%|█████▏    | 98/191 [00:33<00:28,  3.21it/s][A
epoch 2 iter 98: train loss 1.27421. lr 8.164658e-05:  52%|█████▏    | 99/191 [00:33<00:28,  3.22it/s][A
epoch 2 iter 99: train loss 1.23493. lr 7.995874e-05:  52%|█████▏    | 99/191 [00:34<00:28,  3.22it/s][A
epoch 2 iter 99: train loss 1.23493. lr 7.995874e-05:  52%|█████▏    | 100/191 [00:34<00:28,  3.24it/s][A
epoch 2 iter 100: train loss 1.24038. lr 7.828584e-05:  52%|█████▏    | 100/191 [00:34<00:28,  3.24it/s][A
epoch 2 iter 100: train loss 1.24038. lr 7.828584e-05:  53%|█████▎    | 101/191 [00:34<00:27,  3.23it/s][A
epoch 2 iter 101: train loss 1.23331. lr 7.662800e-05:  53%|█████▎    | 101/191 [00:34<00:27,  3.23it/s][A
epoch 2 iter 101: train loss 1.23331. lr 7.662800e-05:  53%|█████▎    | 102/191 [00:34<00:27,  3.23it/s][A
epoch 2 iter 102: train loss 1.22927. lr 7.498533e-05:  53%|█████▎    | 102/191 [00:34<00:27,  3.23it/s][A
epoch 2 iter 102: train loss 1.2292

epoch 2 iter 173: train loss 0.94898. lr 6.000000e-05:  91%|█████████ | 173/191 [00:59<00:05,  3.11it/s][A
epoch 2 iter 173: train loss 0.94898. lr 6.000000e-05:  91%|█████████ | 174/191 [00:59<00:05,  3.12it/s][A
epoch 2 iter 174: train loss 0.92488. lr 6.000000e-05:  91%|█████████ | 174/191 [00:59<00:05,  3.12it/s][A
epoch 2 iter 174: train loss 0.92488. lr 6.000000e-05:  92%|█████████▏| 175/191 [00:59<00:05,  3.15it/s][A
epoch 2 iter 175: train loss 0.94149. lr 6.000000e-05:  92%|█████████▏| 175/191 [00:59<00:05,  3.15it/s][A
epoch 2 iter 175: train loss 0.94149. lr 6.000000e-05:  92%|█████████▏| 176/191 [00:59<00:04,  3.23it/s][A
epoch 2 iter 176: train loss 0.93587. lr 6.000000e-05:  92%|█████████▏| 176/191 [01:00<00:04,  3.23it/s][A
epoch 2 iter 176: train loss 0.93587. lr 6.000000e-05:  93%|█████████▎| 177/191 [01:00<00:04,  3.03it/s][A
epoch 2 iter 177: train loss 0.94328. lr 6.000000e-05:  93%|█████████▎| 177/191 [01:00<00:04,  3.03it/s][A
epoch 2 iter 177: train loss

data has 122065 characters, 6548 unique.



  0%|          | 0/477 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 8.88134. lr 5.999988e-04:   0%|          | 0/477 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 8.88134. lr 5.999988e-04:   0%|          | 1/477 [00:00<03:07,  2.53it/s][A
epoch 1 iter 1: train loss 8.13636. lr 5.999945e-04:   0%|          | 1/477 [00:00<03:07,  2.53it/s][A
epoch 1 iter 1: train loss 8.13636. lr 5.999945e-04:   0%|          | 2/477 [00:00<02:50,  2.79it/s][A
epoch 1 iter 2: train loss 7.64243. lr 5.999868e-04:   0%|          | 2/477 [00:00<02:50,  2.79it/s][A
epoch 1 iter 2: train loss 7.64243. lr 5.999868e-04:   1%|          | 3/477 [00:00<02:38,  2.99it/s][A
epoch 1 iter 3: train loss 7.40486. lr 5.999759e-04:   1%|          | 3/477 [00:01<02:38,  2.99it/s][A
epoch 1 iter 3: train loss 7.40486. lr 5.999759e-04:   1%|          | 4/477 [00:01<02:31,  3.13it/s][A
epoch 1 iter 4: train loss 7.21916. lr 5.999617e-04:   1%|          | 4/477 [00:01<02:31,  3.13it/s][A
epoch 1 iter 4: train loss 7

epoch 1 iter 77: train loss 4.89511. lr 5.901657e-04:  16%|█▌        | 77/477 [00:23<01:56,  3.44it/s][A
epoch 1 iter 77: train loss 4.89511. lr 5.901657e-04:  16%|█▋        | 78/477 [00:23<01:54,  3.48it/s][A
epoch 1 iter 78: train loss 4.86587. lr 5.899129e-04:  16%|█▋        | 78/477 [00:23<01:54,  3.48it/s][A
epoch 1 iter 78: train loss 4.86587. lr 5.899129e-04:  17%|█▋        | 79/477 [00:23<01:55,  3.46it/s][A
epoch 1 iter 79: train loss 4.88180. lr 5.896569e-04:  17%|█▋        | 79/477 [00:24<01:55,  3.46it/s][A
epoch 1 iter 79: train loss 4.88180. lr 5.896569e-04:  17%|█▋        | 80/477 [00:24<01:55,  3.43it/s][A
epoch 1 iter 80: train loss 4.82581. lr 5.893977e-04:  17%|█▋        | 80/477 [00:24<01:55,  3.43it/s][A
epoch 1 iter 80: train loss 4.82581. lr 5.893977e-04:  17%|█▋        | 81/477 [00:24<01:56,  3.39it/s][A
epoch 1 iter 81: train loss 4.86592. lr 5.891354e-04:  17%|█▋        | 81/477 [00:24<01:56,  3.39it/s][A
epoch 1 iter 81: train loss 4.86592. lr 5.8913

epoch 1 iter 152: train loss 3.59981. lr 5.626784e-04:  32%|███▏      | 153/477 [00:47<01:39,  3.26it/s][A
epoch 1 iter 153: train loss 3.60229. lr 5.621990e-04:  32%|███▏      | 153/477 [00:47<01:39,  3.26it/s][A
epoch 1 iter 153: train loss 3.60229. lr 5.621990e-04:  32%|███▏      | 154/477 [00:47<01:41,  3.20it/s][A
epoch 1 iter 154: train loss 3.64588. lr 5.617167e-04:  32%|███▏      | 154/477 [00:47<01:41,  3.20it/s][A
epoch 1 iter 154: train loss 3.64588. lr 5.617167e-04:  32%|███▏      | 155/477 [00:47<01:41,  3.17it/s][A
epoch 1 iter 155: train loss 3.58981. lr 5.612316e-04:  32%|███▏      | 155/477 [00:48<01:41,  3.17it/s][A
epoch 1 iter 155: train loss 3.58981. lr 5.612316e-04:  33%|███▎      | 156/477 [00:48<01:41,  3.17it/s][A
epoch 1 iter 156: train loss 3.49901. lr 5.607437e-04:  33%|███▎      | 156/477 [00:48<01:41,  3.17it/s][A
epoch 1 iter 156: train loss 3.49901. lr 5.607437e-04:  33%|███▎      | 157/477 [00:48<01:40,  3.17it/s][A
epoch 1 iter 157: train loss

epoch 1 iter 227: train loss 2.57797. lr 5.191984e-04:  48%|████▊     | 228/477 [01:12<01:17,  3.21it/s][A
epoch 1 iter 228: train loss 2.60181. lr 5.185216e-04:  48%|████▊     | 228/477 [01:12<01:17,  3.21it/s][A
epoch 1 iter 228: train loss 2.60181. lr 5.185216e-04:  48%|████▊     | 229/477 [01:12<01:16,  3.23it/s][A
epoch 1 iter 229: train loss 2.58686. lr 5.178425e-04:  48%|████▊     | 229/477 [01:12<01:16,  3.23it/s][A
epoch 1 iter 229: train loss 2.58686. lr 5.178425e-04:  48%|████▊     | 230/477 [01:12<01:15,  3.26it/s][A
epoch 1 iter 230: train loss 2.52745. lr 5.171609e-04:  48%|████▊     | 230/477 [01:13<01:15,  3.26it/s][A
epoch 1 iter 230: train loss 2.52745. lr 5.171609e-04:  48%|████▊     | 231/477 [01:13<01:15,  3.27it/s][A
epoch 1 iter 231: train loss 2.53997. lr 5.164771e-04:  48%|████▊     | 231/477 [01:13<01:15,  3.27it/s][A
epoch 1 iter 231: train loss 2.53997. lr 5.164771e-04:  49%|████▊     | 232/477 [01:13<01:14,  3.29it/s][A
epoch 1 iter 232: train loss

epoch 1 iter 302: train loss 1.57039. lr 4.623728e-04:  64%|██████▎   | 303/477 [01:36<00:53,  3.25it/s][A
epoch 1 iter 303: train loss 1.57270. lr 4.615399e-04:  64%|██████▎   | 303/477 [01:36<00:53,  3.25it/s][A
epoch 1 iter 303: train loss 1.57270. lr 4.615399e-04:  64%|██████▎   | 304/477 [01:36<00:53,  3.25it/s][A
epoch 1 iter 304: train loss 1.58247. lr 4.607052e-04:  64%|██████▎   | 304/477 [01:36<00:53,  3.25it/s][A
epoch 1 iter 304: train loss 1.58247. lr 4.607052e-04:  64%|██████▍   | 305/477 [01:36<00:52,  3.26it/s][A
epoch 1 iter 305: train loss 1.58875. lr 4.598688e-04:  64%|██████▍   | 305/477 [01:37<00:52,  3.26it/s][A
epoch 1 iter 305: train loss 1.58875. lr 4.598688e-04:  64%|██████▍   | 306/477 [01:37<00:52,  3.26it/s][A
epoch 1 iter 306: train loss 1.55085. lr 4.590306e-04:  64%|██████▍   | 306/477 [01:37<00:52,  3.26it/s][A
epoch 1 iter 306: train loss 1.55085. lr 4.590306e-04:  64%|██████▍   | 307/477 [01:37<00:52,  3.21it/s][A
epoch 1 iter 307: train loss

epoch 2 iter 76: train loss 0.62015. lr 1.909076e-04:  24%|██▎       | 77/325 [00:26<01:23,  2.97it/s][A
epoch 2 iter 77: train loss 0.62351. lr 1.895567e-04:  24%|██▎       | 77/325 [00:26<01:23,  2.97it/s][A
epoch 2 iter 77: train loss 0.62351. lr 1.895567e-04:  24%|██▍       | 78/325 [00:26<01:22,  2.99it/s][A
epoch 2 iter 78: train loss 0.63226. lr 1.882085e-04:  24%|██▍       | 78/325 [00:26<01:22,  2.99it/s][A
epoch 2 iter 78: train loss 0.63226. lr 1.882085e-04:  24%|██▍       | 79/325 [00:26<01:21,  3.02it/s][A
epoch 2 iter 79: train loss 0.61181. lr 1.868628e-04:  24%|██▍       | 79/325 [00:26<01:21,  3.02it/s][A
epoch 2 iter 79: train loss 0.61181. lr 1.868628e-04:  25%|██▍       | 80/325 [00:26<01:21,  3.02it/s][A
epoch 2 iter 80: train loss 0.61118. lr 1.855198e-04:  25%|██▍       | 80/325 [00:27<01:21,  3.02it/s][A
epoch 2 iter 80: train loss 0.61118. lr 1.855198e-04:  25%|██▍       | 81/325 [00:27<01:20,  3.02it/s][A
epoch 2 iter 81: train loss 0.61612. lr 1.8417

epoch 2 iter 152: train loss 0.42867. lr 9.773854e-05:  47%|████▋     | 152/325 [00:51<00:58,  2.97it/s][A
epoch 2 iter 152: train loss 0.42867. lr 9.773854e-05:  47%|████▋     | 153/325 [00:51<00:57,  3.01it/s][A
epoch 2 iter 153: train loss 0.42569. lr 9.666891e-05:  47%|████▋     | 153/325 [00:52<00:57,  3.01it/s][A
epoch 2 iter 153: train loss 0.42569. lr 9.666891e-05:  47%|████▋     | 154/325 [00:52<01:02,  2.72it/s][A
epoch 2 iter 154: train loss 0.43441. lr 9.560405e-05:  47%|████▋     | 154/325 [00:52<01:02,  2.72it/s][A
epoch 2 iter 154: train loss 0.43441. lr 9.560405e-05:  48%|████▊     | 155/325 [00:52<01:04,  2.62it/s][A
epoch 2 iter 155: train loss 0.43126. lr 9.454396e-05:  48%|████▊     | 155/325 [00:52<01:04,  2.62it/s][A
epoch 2 iter 155: train loss 0.43126. lr 9.454396e-05:  48%|████▊     | 156/325 [00:52<01:04,  2.62it/s][A
epoch 2 iter 156: train loss 0.43796. lr 9.348869e-05:  48%|████▊     | 156/325 [00:53<01:04,  2.62it/s][A
epoch 2 iter 156: train loss

epoch 2 iter 227: train loss 0.37709. lr 6.000000e-05:  70%|██████▉   | 227/325 [01:16<00:32,  2.99it/s][A
epoch 2 iter 227: train loss 0.37709. lr 6.000000e-05:  70%|███████   | 228/325 [01:16<00:32,  3.01it/s][A
epoch 2 iter 228: train loss 0.36985. lr 6.000000e-05:  70%|███████   | 228/325 [01:17<00:32,  3.01it/s][A
epoch 2 iter 228: train loss 0.36985. lr 6.000000e-05:  70%|███████   | 229/325 [01:17<00:31,  3.01it/s][A
epoch 2 iter 229: train loss 0.37809. lr 6.000000e-05:  70%|███████   | 229/325 [01:17<00:31,  3.01it/s][A
epoch 2 iter 229: train loss 0.37809. lr 6.000000e-05:  71%|███████   | 230/325 [01:17<00:31,  3.01it/s][A
epoch 2 iter 230: train loss 0.37230. lr 6.000000e-05:  71%|███████   | 230/325 [01:17<00:31,  3.01it/s][A
epoch 2 iter 230: train loss 0.37230. lr 6.000000e-05:  71%|███████   | 231/325 [01:17<00:31,  3.02it/s][A
epoch 2 iter 231: train loss 0.37344. lr 6.000000e-05:  71%|███████   | 231/325 [01:18<00:31,  3.02it/s][A
epoch 2 iter 231: train loss

epoch 2 iter 302: train loss 0.33071. lr 6.000000e-05:  93%|█████████▎| 302/325 [01:42<00:07,  3.05it/s][A
epoch 2 iter 302: train loss 0.33071. lr 6.000000e-05:  93%|█████████▎| 303/325 [01:42<00:07,  3.05it/s][A
epoch 2 iter 303: train loss 0.33022. lr 6.000000e-05:  93%|█████████▎| 303/325 [01:42<00:07,  3.05it/s][A
epoch 2 iter 303: train loss 0.33022. lr 6.000000e-05:  94%|█████████▎| 304/325 [01:42<00:06,  3.05it/s][A
epoch 2 iter 304: train loss 0.33438. lr 6.000000e-05:  94%|█████████▎| 304/325 [01:42<00:06,  3.05it/s][A
epoch 2 iter 304: train loss 0.33438. lr 6.000000e-05:  94%|█████████▍| 305/325 [01:42<00:06,  3.06it/s][A
epoch 2 iter 305: train loss 0.32593. lr 6.000000e-05:  94%|█████████▍| 305/325 [01:43<00:06,  3.06it/s][A
epoch 2 iter 305: train loss 0.32593. lr 6.000000e-05:  94%|█████████▍| 306/325 [01:43<00:06,  3.06it/s][A
epoch 2 iter 306: train loss 0.32386. lr 6.000000e-05:  94%|█████████▍| 306/325 [01:43<00:06,  3.06it/s][A
epoch 2 iter 306: train loss

data has 116782 characters, 11574 unique.



  0%|          | 0/456 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.44420. lr 5.999987e-04:   0%|          | 0/456 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.44420. lr 5.999987e-04:   0%|          | 1/456 [00:00<03:33,  2.13it/s][A
epoch 1 iter 1: train loss 8.88046. lr 5.999939e-04:   0%|          | 1/456 [00:00<03:33,  2.13it/s][A
epoch 1 iter 1: train loss 8.88046. lr 5.999939e-04:   0%|          | 2/456 [00:00<03:16,  2.32it/s][A
epoch 1 iter 2: train loss 8.54068. lr 5.999856e-04:   0%|          | 2/456 [00:01<03:16,  2.32it/s][A
epoch 1 iter 2: train loss 8.54068. lr 5.999856e-04:   1%|          | 3/456 [00:01<03:03,  2.46it/s][A
epoch 1 iter 3: train loss 8.30041. lr 5.999737e-04:   1%|          | 3/456 [00:01<03:03,  2.46it/s][A
epoch 1 iter 3: train loss 8.30041. lr 5.999737e-04:   1%|          | 4/456 [00:01<02:55,  2.58it/s][A
epoch 1 iter 4: train loss 8.11558. lr 5.999582e-04:   1%|          | 4/456 [00:01<02:55,  2.58it/s][A
epoch 1 iter 4: train loss 8

epoch 1 iter 77: train loss 6.16136. lr 5.892601e-04:  17%|█▋        | 77/456 [00:27<02:14,  2.82it/s][A
epoch 1 iter 77: train loss 6.16136. lr 5.892601e-04:  17%|█▋        | 78/456 [00:27<02:13,  2.83it/s][A
epoch 1 iter 78: train loss 6.12464. lr 5.889841e-04:  17%|█▋        | 78/456 [00:28<02:13,  2.83it/s][A
epoch 1 iter 78: train loss 6.12464. lr 5.889841e-04:  17%|█▋        | 79/456 [00:28<02:12,  2.84it/s][A
epoch 1 iter 79: train loss 6.13463. lr 5.887047e-04:  17%|█▋        | 79/456 [00:28<02:12,  2.84it/s][A
epoch 1 iter 79: train loss 6.13463. lr 5.887047e-04:  18%|█▊        | 80/456 [00:28<02:11,  2.85it/s][A
epoch 1 iter 80: train loss 6.16486. lr 5.884218e-04:  18%|█▊        | 80/456 [00:28<02:11,  2.85it/s][A
epoch 1 iter 80: train loss 6.16486. lr 5.884218e-04:  18%|█▊        | 81/456 [00:28<02:11,  2.85it/s][A
epoch 1 iter 81: train loss 6.14626. lr 5.881355e-04:  18%|█▊        | 81/456 [00:29<02:11,  2.85it/s][A
epoch 1 iter 81: train loss 6.14626. lr 5.8813

epoch 1 iter 152: train loss 4.66103. lr 5.593011e-04:  34%|███▎      | 153/456 [00:54<01:45,  2.87it/s][A
epoch 1 iter 153: train loss 4.60306. lr 5.587794e-04:  34%|███▎      | 153/456 [00:54<01:45,  2.87it/s][A
epoch 1 iter 153: train loss 4.60306. lr 5.587794e-04:  34%|███▍      | 154/456 [00:54<01:45,  2.86it/s][A
epoch 1 iter 154: train loss 4.62501. lr 5.582546e-04:  34%|███▍      | 154/456 [00:54<01:45,  2.86it/s][A
epoch 1 iter 154: train loss 4.62501. lr 5.582546e-04:  34%|███▍      | 155/456 [00:54<01:45,  2.84it/s][A
epoch 1 iter 155: train loss 4.63603. lr 5.577267e-04:  34%|███▍      | 155/456 [00:55<01:45,  2.84it/s][A
epoch 1 iter 155: train loss 4.63603. lr 5.577267e-04:  34%|███▍      | 156/456 [00:55<01:45,  2.84it/s][A
epoch 1 iter 156: train loss 4.54290. lr 5.571958e-04:  34%|███▍      | 156/456 [00:55<01:45,  2.84it/s][A
epoch 1 iter 156: train loss 4.54290. lr 5.571958e-04:  34%|███▍      | 157/456 [00:55<01:45,  2.82it/s][A
epoch 1 iter 157: train loss

epoch 1 iter 1413: train loss 0.41575. lr 4.008158e-04:  78%|███████▊  | 1413/1809 [09:57<02:44,  2.41it/s][A
epoch 1 iter 1413: train loss 0.41575. lr 4.008158e-04:  78%|███████▊  | 1414/1809 [09:57<02:43,  2.42it/s][A
epoch 1 iter 1414: train loss 0.42455. lr 4.005704e-04:  78%|███████▊  | 1414/1809 [09:57<02:43,  2.42it/s][A
epoch 1 iter 1414: train loss 0.42455. lr 4.005704e-04:  78%|███████▊  | 1415/1809 [09:57<02:42,  2.42it/s][A
epoch 1 iter 1415: train loss 0.42462. lr 4.003248e-04:  78%|███████▊  | 1415/1809 [09:58<02:42,  2.42it/s][A
epoch 1 iter 1415: train loss 0.42462. lr 4.003248e-04:  78%|███████▊  | 1416/1809 [09:58<02:42,  2.42it/s][A
epoch 1 iter 1416: train loss 0.40993. lr 4.000792e-04:  78%|███████▊  | 1416/1809 [09:58<02:42,  2.42it/s][A
epoch 1 iter 1416: train loss 0.40993. lr 4.000792e-04:  78%|███████▊  | 1417/1809 [09:58<02:41,  2.43it/s][A
epoch 1 iter 1417: train loss 0.42032. lr 3.998335e-04:  78%|███████▊  | 1417/1809 [09:59<02:41,  2.43it/s][A
e

epoch 1 iter 1486: train loss 0.39411. lr 3.827091e-04:  82%|████████▏ | 1486/1809 [10:28<02:17,  2.35it/s][A
epoch 1 iter 1486: train loss 0.39411. lr 3.827091e-04:  82%|████████▏ | 1487/1809 [10:28<02:15,  2.38it/s][A
epoch 1 iter 1487: train loss 0.40237. lr 3.824585e-04:  82%|████████▏ | 1487/1809 [10:29<02:15,  2.38it/s][A
epoch 1 iter 1487: train loss 0.40237. lr 3.824585e-04:  82%|████████▏ | 1488/1809 [10:29<02:14,  2.39it/s][A
epoch 1 iter 1488: train loss 0.38446. lr 3.822080e-04:  82%|████████▏ | 1488/1809 [10:29<02:14,  2.39it/s][A
epoch 1 iter 1488: train loss 0.38446. lr 3.822080e-04:  82%|████████▏ | 1489/1809 [10:29<02:12,  2.41it/s][A
epoch 1 iter 1489: train loss 0.38830. lr 3.819573e-04:  82%|████████▏ | 1489/1809 [10:29<02:12,  2.41it/s][A
epoch 1 iter 1489: train loss 0.38830. lr 3.819573e-04:  82%|████████▏ | 1490/1809 [10:29<02:11,  2.42it/s][A
epoch 1 iter 1490: train loss 0.38500. lr 3.817066e-04:  82%|████████▏ | 1490/1809 [10:30<02:11,  2.42it/s][A
e

epoch 1 iter 1559: train loss 0.36194. lr 3.642698e-04:  86%|████████▌ | 1559/1809 [10:59<01:42,  2.44it/s][A
epoch 1 iter 1559: train loss 0.36194. lr 3.642698e-04:  86%|████████▌ | 1560/1809 [10:59<01:41,  2.45it/s][A
epoch 1 iter 1560: train loss 0.37678. lr 3.640153e-04:  86%|████████▌ | 1560/1809 [10:59<01:41,  2.45it/s][A
epoch 1 iter 1560: train loss 0.37678. lr 3.640153e-04:  86%|████████▋ | 1561/1809 [10:59<01:41,  2.45it/s][A
epoch 1 iter 1561: train loss 0.35411. lr 3.637607e-04:  86%|████████▋ | 1561/1809 [11:00<01:41,  2.45it/s][A
epoch 1 iter 1561: train loss 0.35411. lr 3.637607e-04:  86%|████████▋ | 1562/1809 [11:00<01:40,  2.45it/s][A
epoch 1 iter 1562: train loss 0.36685. lr 3.635060e-04:  86%|████████▋ | 1562/1809 [11:00<01:40,  2.45it/s][A
epoch 1 iter 1562: train loss 0.36685. lr 3.635060e-04:  86%|████████▋ | 1563/1809 [11:00<01:40,  2.45it/s][A
epoch 1 iter 1563: train loss 0.36864. lr 3.632513e-04:  86%|████████▋ | 1563/1809 [11:01<01:40,  2.45it/s][A
e

epoch 1 iter 1632: train loss 0.35076. lr 3.455723e-04:  90%|█████████ | 1632/1809 [11:31<01:13,  2.42it/s][A
epoch 1 iter 1632: train loss 0.35076. lr 3.455723e-04:  90%|█████████ | 1633/1809 [11:31<01:17,  2.28it/s][A
epoch 1 iter 1633: train loss 0.35165. lr 3.453147e-04:  90%|█████████ | 1633/1809 [11:31<01:17,  2.28it/s][A
epoch 1 iter 1633: train loss 0.35165. lr 3.453147e-04:  90%|█████████ | 1634/1809 [11:31<01:16,  2.30it/s][A
epoch 1 iter 1634: train loss 0.34699. lr 3.450571e-04:  90%|█████████ | 1634/1809 [11:32<01:16,  2.30it/s][A
epoch 1 iter 1634: train loss 0.34699. lr 3.450571e-04:  90%|█████████ | 1635/1809 [11:32<01:15,  2.32it/s][A
epoch 1 iter 1635: train loss 0.34820. lr 3.447995e-04:  90%|█████████ | 1635/1809 [11:32<01:15,  2.32it/s][A
epoch 1 iter 1635: train loss 0.34820. lr 3.447995e-04:  90%|█████████ | 1636/1809 [11:32<01:14,  2.31it/s][A
epoch 1 iter 1636: train loss 0.33591. lr 3.445418e-04:  90%|█████████ | 1636/1809 [11:33<01:14,  2.31it/s][A
e

epoch 1 iter 1705: train loss 0.32999. lr 3.266916e-04:  94%|█████████▍| 1705/1809 [12:03<00:43,  2.37it/s][A
epoch 1 iter 1705: train loss 0.32999. lr 3.266916e-04:  94%|█████████▍| 1706/1809 [12:03<00:42,  2.40it/s][A
epoch 1 iter 1706: train loss 0.33479. lr 3.264320e-04:  94%|█████████▍| 1706/1809 [12:03<00:42,  2.40it/s][A
epoch 1 iter 1706: train loss 0.33479. lr 3.264320e-04:  94%|█████████▍| 1707/1809 [12:03<00:42,  2.42it/s][A
epoch 1 iter 1707: train loss 0.32587. lr 3.261725e-04:  94%|█████████▍| 1707/1809 [12:03<00:42,  2.42it/s][A
epoch 1 iter 1707: train loss 0.32587. lr 3.261725e-04:  94%|█████████▍| 1708/1809 [12:03<00:41,  2.44it/s][A
epoch 1 iter 1708: train loss 0.32873. lr 3.259129e-04:  94%|█████████▍| 1708/1809 [12:04<00:41,  2.44it/s][A
epoch 1 iter 1708: train loss 0.32873. lr 3.259129e-04:  94%|█████████▍| 1709/1809 [12:04<00:40,  2.45it/s][A
epoch 1 iter 1709: train loss 0.33226. lr 3.256533e-04:  94%|█████████▍| 1709/1809 [12:04<00:40,  2.45it/s][A
e

epoch 1 iter 1778: train loss 0.30426. lr 3.077036e-04:  98%|█████████▊| 1778/1809 [12:35<00:14,  2.18it/s][A
epoch 1 iter 1778: train loss 0.30426. lr 3.077036e-04:  98%|█████████▊| 1779/1809 [12:35<00:14,  2.14it/s][A
epoch 1 iter 1779: train loss 0.31622. lr 3.074431e-04:  98%|█████████▊| 1779/1809 [12:35<00:14,  2.14it/s][A
epoch 1 iter 1779: train loss 0.31622. lr 3.074431e-04:  98%|█████████▊| 1780/1809 [12:35<00:13,  2.14it/s][A
epoch 1 iter 1780: train loss 0.30579. lr 3.071826e-04:  98%|█████████▊| 1780/1809 [12:36<00:13,  2.14it/s][A
epoch 1 iter 1780: train loss 0.30579. lr 3.071826e-04:  98%|█████████▊| 1781/1809 [12:36<00:13,  2.15it/s][A
epoch 1 iter 1781: train loss 0.31585. lr 3.069221e-04:  98%|█████████▊| 1781/1809 [12:36<00:13,  2.15it/s][A
epoch 1 iter 1781: train loss 0.31585. lr 3.069221e-04:  99%|█████████▊| 1782/1809 [12:36<00:12,  2.17it/s][A
epoch 1 iter 1782: train loss 0.31181. lr 3.066616e-04:  99%|█████████▊| 1782/1809 [12:37<00:12,  2.17it/s][A
e

epoch 2 iter 44: train loss 0.29226. lr 2.882971e-04:   2%|▏         | 44/1809 [00:20<13:28,  2.18it/s][A
epoch 2 iter 44: train loss 0.29226. lr 2.882971e-04:   2%|▏         | 45/1809 [00:20<13:11,  2.23it/s][A
epoch 2 iter 45: train loss 0.29438. lr 2.880368e-04:   2%|▏         | 45/1809 [00:21<13:11,  2.23it/s][A
epoch 2 iter 45: train loss 0.29438. lr 2.880368e-04:   3%|▎         | 46/1809 [00:21<12:58,  2.26it/s][A
epoch 2 iter 46: train loss 0.28516. lr 2.877764e-04:   3%|▎         | 46/1809 [00:21<12:58,  2.26it/s][A
epoch 2 iter 46: train loss 0.28516. lr 2.877764e-04:   3%|▎         | 47/1809 [00:21<12:47,  2.29it/s][A
epoch 2 iter 47: train loss 0.28867. lr 2.875160e-04:   3%|▎         | 47/1809 [00:21<12:47,  2.29it/s][A
epoch 2 iter 47: train loss 0.28867. lr 2.875160e-04:   3%|▎         | 48/1809 [00:21<12:39,  2.32it/s][A
epoch 2 iter 48: train loss 0.30106. lr 2.872557e-04:   3%|▎         | 48/1809 [00:22<12:39,  2.32it/s][A
epoch 2 iter 48: train loss 0.30106. 

epoch 2 iter 755: train loss 0.21379. lr 1.168722e-04:  42%|████▏     | 756/1809 [05:24<08:51,  1.98it/s][A
epoch 2 iter 756: train loss 0.20147. lr 1.166658e-04:  42%|████▏     | 756/1809 [05:25<08:51,  1.98it/s][A
epoch 2 iter 756: train loss 0.20147. lr 1.166658e-04:  42%|████▏     | 757/1809 [05:25<09:01,  1.94it/s][A
epoch 2 iter 757: train loss 0.20597. lr 1.164596e-04:  42%|████▏     | 757/1809 [05:25<09:01,  1.94it/s][A
epoch 2 iter 757: train loss 0.20597. lr 1.164596e-04:  42%|████▏     | 758/1809 [05:25<08:58,  1.95it/s][A
epoch 2 iter 758: train loss 0.19599. lr 1.162536e-04:  42%|████▏     | 758/1809 [05:26<08:58,  1.95it/s][A
epoch 2 iter 758: train loss 0.19599. lr 1.162536e-04:  42%|████▏     | 759/1809 [05:26<08:47,  1.99it/s][A
epoch 2 iter 759: train loss 0.20322. lr 1.160477e-04:  42%|████▏     | 759/1809 [05:26<08:47,  1.99it/s][A
epoch 2 iter 759: train loss 0.20322. lr 1.160477e-04:  42%|████▏     | 760/1809 [05:26<08:36,  2.03it/s][A
epoch 2 iter 760: t

epoch 2 iter 830: train loss 0.20477. lr 1.017916e-04:  46%|████▌     | 831/1809 [05:56<06:52,  2.37it/s][A
epoch 2 iter 831: train loss 0.20145. lr 1.015960e-04:  46%|████▌     | 831/1809 [05:57<06:52,  2.37it/s][A
epoch 2 iter 831: train loss 0.20145. lr 1.015960e-04:  46%|████▌     | 832/1809 [05:57<06:45,  2.41it/s][A
epoch 2 iter 832: train loss 0.20569. lr 1.014007e-04:  46%|████▌     | 832/1809 [05:57<06:45,  2.41it/s][A
epoch 2 iter 832: train loss 0.20569. lr 1.014007e-04:  46%|████▌     | 833/1809 [05:57<06:41,  2.43it/s][A
epoch 2 iter 833: train loss 0.19556. lr 1.012054e-04:  46%|████▌     | 833/1809 [05:57<06:41,  2.43it/s][A
epoch 2 iter 833: train loss 0.19556. lr 1.012054e-04:  46%|████▌     | 834/1809 [05:57<06:37,  2.45it/s][A
epoch 2 iter 834: train loss 0.19972. lr 1.010103e-04:  46%|████▌     | 834/1809 [05:58<06:37,  2.45it/s][A
epoch 2 iter 834: train loss 0.19972. lr 1.010103e-04:  46%|████▌     | 835/1809 [05:58<06:40,  2.43it/s][A
epoch 2 iter 835: t

epoch 2 iter 905: train loss 0.19727. lr 8.755187e-05:  50%|█████     | 906/1809 [06:27<06:05,  2.47it/s][A
epoch 2 iter 906: train loss 0.19326. lr 8.736796e-05:  50%|█████     | 906/1809 [06:28<06:05,  2.47it/s][A
epoch 2 iter 906: train loss 0.19326. lr 8.736796e-05:  50%|█████     | 907/1809 [06:28<06:59,  2.15it/s][A
epoch 2 iter 907: train loss 0.19675. lr 8.718422e-05:  50%|█████     | 907/1809 [06:29<06:59,  2.15it/s][A
epoch 2 iter 907: train loss 0.19675. lr 8.718422e-05:  50%|█████     | 908/1809 [06:29<06:57,  2.16it/s][A
epoch 2 iter 908: train loss 0.19572. lr 8.700064e-05:  50%|█████     | 908/1809 [06:29<06:57,  2.16it/s][A
epoch 2 iter 908: train loss 0.19572. lr 8.700064e-05:  50%|█████     | 909/1809 [06:29<06:40,  2.25it/s][A
epoch 2 iter 909: train loss 0.20160. lr 8.681721e-05:  50%|█████     | 909/1809 [06:29<06:40,  2.25it/s][A
epoch 2 iter 909: train loss 0.20160. lr 8.681721e-05:  50%|█████     | 910/1809 [06:29<06:27,  2.32it/s][A
epoch 2 iter 910: t

epoch 2 iter 980: train loss 0.19813. lr 7.421346e-05:  54%|█████▍    | 981/1809 [07:02<05:45,  2.40it/s][A
epoch 2 iter 981: train loss 0.19366. lr 7.404196e-05:  54%|█████▍    | 981/1809 [07:02<05:45,  2.40it/s][A
epoch 2 iter 981: train loss 0.19366. lr 7.404196e-05:  54%|█████▍    | 982/1809 [07:02<05:42,  2.42it/s][A
epoch 2 iter 982: train loss 0.19096. lr 7.387063e-05:  54%|█████▍    | 982/1809 [07:03<05:42,  2.42it/s][A
epoch 2 iter 982: train loss 0.19096. lr 7.387063e-05:  54%|█████▍    | 983/1809 [07:03<05:39,  2.44it/s][A
epoch 2 iter 983: train loss 0.19663. lr 7.369948e-05:  54%|█████▍    | 983/1809 [07:03<05:39,  2.44it/s][A
epoch 2 iter 983: train loss 0.19663. lr 7.369948e-05:  54%|█████▍    | 984/1809 [07:03<05:36,  2.45it/s][A
epoch 2 iter 984: train loss 0.19106. lr 7.352850e-05:  54%|█████▍    | 984/1809 [07:04<05:36,  2.45it/s][A
epoch 2 iter 984: train loss 0.19106. lr 7.352850e-05:  54%|█████▍    | 985/1809 [07:04<05:34,  2.46it/s][A
epoch 2 iter 985: t

epoch 2 iter 1054: train loss 0.18889. lr 6.199147e-05:  58%|█████▊    | 1054/1809 [07:34<05:30,  2.28it/s][A
epoch 2 iter 1054: train loss 0.18889. lr 6.199147e-05:  58%|█████▊    | 1055/1809 [07:34<05:30,  2.28it/s][A
epoch 2 iter 1055: train loss 0.19061. lr 6.183293e-05:  58%|█████▊    | 1055/1809 [07:34<05:30,  2.28it/s][A
epoch 2 iter 1055: train loss 0.19061. lr 6.183293e-05:  58%|█████▊    | 1056/1809 [07:34<05:29,  2.29it/s][A
epoch 2 iter 1056: train loss 0.18876. lr 6.167457e-05:  58%|█████▊    | 1056/1809 [07:35<05:29,  2.29it/s][A
epoch 2 iter 1056: train loss 0.18876. lr 6.167457e-05:  58%|█████▊    | 1057/1809 [07:35<05:27,  2.30it/s][A
epoch 2 iter 1057: train loss 0.18733. lr 6.151639e-05:  58%|█████▊    | 1057/1809 [07:35<05:27,  2.30it/s][A
epoch 2 iter 1057: train loss 0.18733. lr 6.151639e-05:  58%|█████▊    | 1058/1809 [07:35<05:22,  2.33it/s][A
epoch 2 iter 1058: train loss 0.18842. lr 6.135839e-05:  58%|█████▊    | 1058/1809 [07:36<05:22,  2.33it/s][A
e

epoch 2 iter 1127: train loss 0.18840. lr 6.000000e-05:  62%|██████▏   | 1127/1809 [08:05<04:50,  2.35it/s][A
epoch 2 iter 1127: train loss 0.18840. lr 6.000000e-05:  62%|██████▏   | 1128/1809 [08:05<04:50,  2.34it/s][A
epoch 2 iter 1128: train loss 0.18313. lr 6.000000e-05:  62%|██████▏   | 1128/1809 [08:06<04:50,  2.34it/s][A
epoch 2 iter 1128: train loss 0.18313. lr 6.000000e-05:  62%|██████▏   | 1129/1809 [08:06<04:50,  2.34it/s][A
epoch 2 iter 1129: train loss 0.18753. lr 6.000000e-05:  62%|██████▏   | 1129/1809 [08:06<04:50,  2.34it/s][A
epoch 2 iter 1129: train loss 0.18753. lr 6.000000e-05:  62%|██████▏   | 1130/1809 [08:06<04:49,  2.34it/s][A
epoch 2 iter 1130: train loss 0.18704. lr 6.000000e-05:  62%|██████▏   | 1130/1809 [08:06<04:49,  2.34it/s][A
epoch 2 iter 1130: train loss 0.18704. lr 6.000000e-05:  63%|██████▎   | 1131/1809 [08:06<04:44,  2.38it/s][A
epoch 2 iter 1131: train loss 0.18047. lr 6.000000e-05:  63%|██████▎   | 1131/1809 [08:07<04:44,  2.38it/s][A
e

epoch 2 iter 1200: train loss 0.18283. lr 6.000000e-05:  66%|██████▋   | 1200/1809 [08:37<04:11,  2.42it/s][A
epoch 2 iter 1200: train loss 0.18283. lr 6.000000e-05:  66%|██████▋   | 1201/1809 [08:37<04:08,  2.45it/s][A
epoch 2 iter 1201: train loss 0.19291. lr 6.000000e-05:  66%|██████▋   | 1201/1809 [08:37<04:08,  2.45it/s][A
epoch 2 iter 1201: train loss 0.19291. lr 6.000000e-05:  66%|██████▋   | 1202/1809 [08:37<04:07,  2.45it/s][A
epoch 2 iter 1202: train loss 0.18439. lr 6.000000e-05:  66%|██████▋   | 1202/1809 [08:38<04:07,  2.45it/s][A
epoch 2 iter 1202: train loss 0.18439. lr 6.000000e-05:  67%|██████▋   | 1203/1809 [08:38<04:08,  2.44it/s][A
epoch 2 iter 1203: train loss 0.18425. lr 6.000000e-05:  67%|██████▋   | 1203/1809 [08:38<04:08,  2.44it/s][A
epoch 2 iter 1203: train loss 0.18425. lr 6.000000e-05:  67%|██████▋   | 1204/1809 [08:38<04:08,  2.43it/s][A
epoch 2 iter 1204: train loss 0.17679. lr 6.000000e-05:  67%|██████▋   | 1204/1809 [08:39<04:08,  2.43it/s][A
e

epoch 2 iter 1273: train loss 0.18695. lr 6.000000e-05:  70%|███████   | 1273/1809 [09:08<03:41,  2.42it/s][A
epoch 2 iter 1273: train loss 0.18695. lr 6.000000e-05:  70%|███████   | 1274/1809 [09:08<03:38,  2.45it/s][A
epoch 2 iter 1274: train loss 0.18717. lr 6.000000e-05:  70%|███████   | 1274/1809 [09:08<03:38,  2.45it/s][A
epoch 2 iter 1274: train loss 0.18717. lr 6.000000e-05:  70%|███████   | 1275/1809 [09:08<03:36,  2.46it/s][A
epoch 2 iter 1275: train loss 0.18473. lr 6.000000e-05:  70%|███████   | 1275/1809 [09:09<03:36,  2.46it/s][A
epoch 2 iter 1275: train loss 0.18473. lr 6.000000e-05:  71%|███████   | 1276/1809 [09:09<03:52,  2.29it/s][A
epoch 2 iter 1276: train loss 0.19076. lr 6.000000e-05:  71%|███████   | 1276/1809 [09:09<03:52,  2.29it/s][A
epoch 2 iter 1276: train loss 0.19076. lr 6.000000e-05:  71%|███████   | 1277/1809 [09:09<04:05,  2.17it/s][A
epoch 2 iter 1277: train loss 0.17778. lr 6.000000e-05:  71%|███████   | 1277/1809 [09:10<04:05,  2.17it/s][A
e

epoch 2 iter 1346: train loss 0.19148. lr 6.000000e-05:  74%|███████▍  | 1346/1809 [09:40<03:11,  2.42it/s][A
epoch 2 iter 1346: train loss 0.19148. lr 6.000000e-05:  74%|███████▍  | 1347/1809 [09:40<03:12,  2.39it/s][A
epoch 2 iter 1347: train loss 0.17827. lr 6.000000e-05:  74%|███████▍  | 1347/1809 [09:40<03:12,  2.39it/s][A
epoch 2 iter 1347: train loss 0.17827. lr 6.000000e-05:  75%|███████▍  | 1348/1809 [09:40<03:23,  2.27it/s][A
epoch 2 iter 1348: train loss 0.18529. lr 6.000000e-05:  75%|███████▍  | 1348/1809 [09:41<03:23,  2.27it/s][A
epoch 2 iter 1348: train loss 0.18529. lr 6.000000e-05:  75%|███████▍  | 1349/1809 [09:41<03:19,  2.30it/s][A
epoch 2 iter 1349: train loss 0.18419. lr 6.000000e-05:  75%|███████▍  | 1349/1809 [09:41<03:19,  2.30it/s][A
epoch 2 iter 1349: train loss 0.18419. lr 6.000000e-05:  75%|███████▍  | 1350/1809 [09:41<03:16,  2.33it/s][A
epoch 2 iter 1350: train loss 0.18017. lr 6.000000e-05:  75%|███████▍  | 1350/1809 [09:42<03:16,  2.33it/s][A
e

epoch 2 iter 1419: train loss 0.18098. lr 6.000000e-05:  78%|███████▊  | 1419/1809 [10:11<02:45,  2.35it/s][A
epoch 2 iter 1419: train loss 0.18098. lr 6.000000e-05:  78%|███████▊  | 1420/1809 [10:11<02:45,  2.35it/s][A
epoch 2 iter 1420: train loss 0.18224. lr 6.000000e-05:  78%|███████▊  | 1420/1809 [10:12<02:45,  2.35it/s][A
epoch 2 iter 1420: train loss 0.18224. lr 6.000000e-05:  79%|███████▊  | 1421/1809 [10:12<02:45,  2.34it/s][A
epoch 2 iter 1421: train loss 0.17945. lr 6.000000e-05:  79%|███████▊  | 1421/1809 [10:12<02:45,  2.34it/s][A
epoch 2 iter 1421: train loss 0.17945. lr 6.000000e-05:  79%|███████▊  | 1422/1809 [10:12<02:46,  2.32it/s][A
epoch 2 iter 1422: train loss 0.17669. lr 6.000000e-05:  79%|███████▊  | 1422/1809 [10:13<02:46,  2.32it/s][A
epoch 2 iter 1422: train loss 0.17669. lr 6.000000e-05:  79%|███████▊  | 1423/1809 [10:13<02:47,  2.30it/s][A
epoch 2 iter 1423: train loss 0.18442. lr 6.000000e-05:  79%|███████▊  | 1423/1809 [10:13<02:47,  2.30it/s][A
e

epoch 2 iter 1492: train loss 0.18012. lr 6.000000e-05:  82%|████████▏ | 1492/1809 [10:43<02:18,  2.30it/s][A
epoch 2 iter 1492: train loss 0.18012. lr 6.000000e-05:  83%|████████▎ | 1493/1809 [10:43<02:17,  2.30it/s][A
epoch 2 iter 1493: train loss 0.17668. lr 6.000000e-05:  83%|████████▎ | 1493/1809 [10:43<02:17,  2.30it/s][A
epoch 2 iter 1493: train loss 0.17668. lr 6.000000e-05:  83%|████████▎ | 1494/1809 [10:43<02:17,  2.30it/s][A
epoch 2 iter 1494: train loss 0.17767. lr 6.000000e-05:  83%|████████▎ | 1494/1809 [10:44<02:17,  2.30it/s][A
epoch 2 iter 1494: train loss 0.17767. lr 6.000000e-05:  83%|████████▎ | 1495/1809 [10:44<02:15,  2.32it/s][A
epoch 2 iter 1495: train loss 0.17651. lr 6.000000e-05:  83%|████████▎ | 1495/1809 [10:44<02:15,  2.32it/s][A
epoch 2 iter 1495: train loss 0.17651. lr 6.000000e-05:  83%|████████▎ | 1496/1809 [10:44<02:12,  2.36it/s][A
epoch 2 iter 1496: train loss 0.17732. lr 6.000000e-05:  83%|████████▎ | 1496/1809 [10:45<02:12,  2.36it/s][A
e

epoch 2 iter 1565: train loss 0.17835. lr 6.000000e-05:  87%|████████▋ | 1565/1809 [11:15<01:57,  2.07it/s][A
epoch 2 iter 1565: train loss 0.17835. lr 6.000000e-05:  87%|████████▋ | 1566/1809 [11:15<01:53,  2.14it/s][A
epoch 2 iter 1566: train loss 0.17961. lr 6.000000e-05:  87%|████████▋ | 1566/1809 [11:15<01:53,  2.14it/s][A
epoch 2 iter 1566: train loss 0.17961. lr 6.000000e-05:  87%|████████▋ | 1567/1809 [11:15<01:50,  2.19it/s][A
epoch 2 iter 1567: train loss 0.18267. lr 6.000000e-05:  87%|████████▋ | 1567/1809 [11:16<01:50,  2.19it/s][A
epoch 2 iter 1567: train loss 0.18267. lr 6.000000e-05:  87%|████████▋ | 1568/1809 [11:16<01:48,  2.22it/s][A
epoch 2 iter 1568: train loss 0.17874. lr 6.000000e-05:  87%|████████▋ | 1568/1809 [11:16<01:48,  2.22it/s][A
epoch 2 iter 1568: train loss 0.17874. lr 6.000000e-05:  87%|████████▋ | 1569/1809 [11:16<01:46,  2.24it/s][A
epoch 2 iter 1569: train loss 0.17833. lr 6.000000e-05:  87%|████████▋ | 1569/1809 [11:16<01:46,  2.24it/s][A
e

epoch 2 iter 1638: train loss 0.17921. lr 6.000000e-05:  91%|█████████ | 1638/1809 [11:46<01:09,  2.45it/s][A
epoch 2 iter 1638: train loss 0.17921. lr 6.000000e-05:  91%|█████████ | 1639/1809 [11:46<01:09,  2.46it/s][A
epoch 2 iter 1639: train loss 0.17981. lr 6.000000e-05:  91%|█████████ | 1639/1809 [11:47<01:09,  2.46it/s][A
epoch 2 iter 1639: train loss 0.17981. lr 6.000000e-05:  91%|█████████ | 1640/1809 [11:47<01:08,  2.47it/s][A
epoch 2 iter 1640: train loss 0.17452. lr 6.000000e-05:  91%|█████████ | 1640/1809 [11:48<01:08,  2.47it/s][A
epoch 2 iter 1640: train loss 0.17452. lr 6.000000e-05:  91%|█████████ | 1641/1809 [11:48<01:31,  1.83it/s][A
epoch 2 iter 1641: train loss 0.17843. lr 6.000000e-05:  91%|█████████ | 1641/1809 [11:48<01:31,  1.83it/s][A
epoch 2 iter 1641: train loss 0.17843. lr 6.000000e-05:  91%|█████████ | 1642/1809 [11:48<01:33,  1.79it/s][A
epoch 2 iter 1642: train loss 0.17997. lr 6.000000e-05:  91%|█████████ | 1642/1809 [11:49<01:33,  1.79it/s][A
e

epoch 2 iter 1711: train loss 0.17580. lr 6.000000e-05:  95%|█████████▍| 1711/1809 [12:20<00:41,  2.35it/s][A
epoch 2 iter 1711: train loss 0.17580. lr 6.000000e-05:  95%|█████████▍| 1712/1809 [12:20<00:40,  2.38it/s][A
epoch 2 iter 1712: train loss 0.17207. lr 6.000000e-05:  95%|█████████▍| 1712/1809 [12:20<00:40,  2.38it/s][A
epoch 2 iter 1712: train loss 0.17207. lr 6.000000e-05:  95%|█████████▍| 1713/1809 [12:20<00:39,  2.41it/s][A
epoch 2 iter 1713: train loss 0.16865. lr 6.000000e-05:  95%|█████████▍| 1713/1809 [12:21<00:39,  2.41it/s][A
epoch 2 iter 1713: train loss 0.16865. lr 6.000000e-05:  95%|█████████▍| 1714/1809 [12:21<00:39,  2.43it/s][A
epoch 2 iter 1714: train loss 0.17817. lr 6.000000e-05:  95%|█████████▍| 1714/1809 [12:21<00:39,  2.43it/s][A
epoch 2 iter 1714: train loss 0.17817. lr 6.000000e-05:  95%|█████████▍| 1715/1809 [12:21<00:39,  2.39it/s][A
epoch 2 iter 1715: train loss 0.18207. lr 6.000000e-05:  95%|█████████▍| 1715/1809 [12:21<00:39,  2.39it/s][A
e

epoch 2 iter 1784: train loss 0.18000. lr 6.000000e-05:  99%|█████████▊| 1784/1809 [12:51<00:10,  2.44it/s][A
epoch 2 iter 1784: train loss 0.18000. lr 6.000000e-05:  99%|█████████▊| 1785/1809 [12:51<00:09,  2.44it/s][A
epoch 2 iter 1785: train loss 0.17540. lr 6.000000e-05:  99%|█████████▊| 1785/1809 [12:52<00:09,  2.44it/s][A
epoch 2 iter 1785: train loss 0.17540. lr 6.000000e-05:  99%|█████████▊| 1786/1809 [12:52<00:09,  2.45it/s][A
epoch 2 iter 1786: train loss 0.17413. lr 6.000000e-05:  99%|█████████▊| 1786/1809 [12:52<00:09,  2.45it/s][A
epoch 2 iter 1786: train loss 0.17413. lr 6.000000e-05:  99%|█████████▉| 1787/1809 [12:52<00:08,  2.46it/s][A
epoch 2 iter 1787: train loss 0.17865. lr 6.000000e-05:  99%|█████████▉| 1787/1809 [12:53<00:08,  2.46it/s][A
epoch 2 iter 1787: train loss 0.17865. lr 6.000000e-05:  99%|█████████▉| 1788/1809 [12:53<00:08,  2.38it/s][A
epoch 2 iter 1788: train loss 0.17855. lr 6.000000e-05:  99%|█████████▉| 1788/1809 [12:53<00:08,  2.38it/s][A
e

data has 218385 characters, 16623 unique.



  0%|          | 0/853 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.82300. lr 5.999996e-04:   0%|          | 0/853 [00:00<?, ?it/s][A
epoch 1 iter 0: train loss 9.82300. lr 5.999996e-04:   0%|          | 1/853 [00:00<07:01,  2.02it/s][A
epoch 1 iter 1: train loss 9.20596. lr 5.999983e-04:   0%|          | 1/853 [00:00<07:01,  2.02it/s][A
epoch 1 iter 1: train loss 9.20596. lr 5.999983e-04:   0%|          | 2/853 [00:00<06:36,  2.15it/s][A
epoch 1 iter 2: train loss 8.81947. lr 5.999959e-04:   0%|          | 2/853 [00:01<06:36,  2.15it/s][A
epoch 1 iter 2: train loss 8.81947. lr 5.999959e-04:   0%|          | 3/853 [00:01<06:18,  2.25it/s][A
epoch 1 iter 3: train loss 8.67595. lr 5.999925e-04:   0%|          | 3/853 [00:01<06:18,  2.25it/s][A
epoch 1 iter 3: train loss 8.67595. lr 5.999925e-04:   0%|          | 4/853 [00:01<06:07,  2.31it/s][A
epoch 1 iter 4: train loss 8.37344. lr 5.999881e-04:   0%|          | 4/853 [00:02<06:07,  2.31it/s][A
epoch 1 iter 4: train loss 8

epoch 1 iter 77: train loss 5.85716. lr 5.969192e-04:   9%|▉         | 77/853 [00:32<05:14,  2.47it/s][A
epoch 1 iter 77: train loss 5.85716. lr 5.969192e-04:   9%|▉         | 78/853 [00:32<05:15,  2.46it/s][A
epoch 1 iter 78: train loss 5.82457. lr 5.968397e-04:   9%|▉         | 78/853 [00:32<05:15,  2.46it/s][A
epoch 1 iter 78: train loss 5.82457. lr 5.968397e-04:   9%|▉         | 79/853 [00:32<05:16,  2.45it/s][A
epoch 1 iter 79: train loss 5.84702. lr 5.967592e-04:   9%|▉         | 79/853 [00:32<05:16,  2.45it/s][A
epoch 1 iter 79: train loss 5.84702. lr 5.967592e-04:   9%|▉         | 80/853 [00:32<05:19,  2.42it/s][A
epoch 1 iter 80: train loss 5.84464. lr 5.966777e-04:   9%|▉         | 80/853 [00:33<05:19,  2.42it/s][A
epoch 1 iter 80: train loss 5.84464. lr 5.966777e-04:   9%|▉         | 81/853 [00:33<05:17,  2.43it/s][A
epoch 1 iter 81: train loss 5.78319. lr 5.965951e-04:   9%|▉         | 81/853 [00:33<05:17,  2.43it/s][A
epoch 1 iter 81: train loss 5.78319. lr 5.9659

epoch 1 iter 152: train loss 4.74106. lr 5.881811e-04:  18%|█▊        | 153/853 [01:03<04:45,  2.45it/s][A
epoch 1 iter 153: train loss 4.67961. lr 5.880270e-04:  18%|█▊        | 153/853 [01:03<04:45,  2.45it/s][A
epoch 1 iter 153: train loss 4.67961. lr 5.880270e-04:  18%|█▊        | 154/853 [01:03<04:44,  2.45it/s][A
epoch 1 iter 154: train loss 4.74189. lr 5.878719e-04:  18%|█▊        | 154/853 [01:03<04:44,  2.45it/s][A
epoch 1 iter 154: train loss 4.74189. lr 5.878719e-04:  18%|█▊        | 155/853 [01:03<04:43,  2.46it/s][A
epoch 1 iter 155: train loss 4.76004. lr 5.877158e-04:  18%|█▊        | 155/853 [01:04<04:43,  2.46it/s][A
epoch 1 iter 155: train loss 4.76004. lr 5.877158e-04:  18%|█▊        | 156/853 [01:04<04:42,  2.47it/s][A
epoch 1 iter 156: train loss 4.62053. lr 5.875588e-04:  18%|█▊        | 156/853 [01:04<04:42,  2.47it/s][A
epoch 1 iter 156: train loss 4.62053. lr 5.875588e-04:  18%|█▊        | 157/853 [01:04<05:07,  2.26it/s][A
epoch 1 iter 157: train loss

epoch 1 iter 227: train loss 3.41137. lr 5.739481e-04:  27%|██▋       | 228/853 [01:34<04:16,  2.44it/s][A
epoch 1 iter 228: train loss 3.54570. lr 5.737223e-04:  27%|██▋       | 228/853 [01:34<04:16,  2.44it/s][A
epoch 1 iter 228: train loss 3.54570. lr 5.737223e-04:  27%|██▋       | 229/853 [01:34<04:16,  2.43it/s][A
epoch 1 iter 229: train loss 3.33743. lr 5.734956e-04:  27%|██▋       | 229/853 [01:35<04:16,  2.43it/s][A
epoch 1 iter 229: train loss 3.33743. lr 5.734956e-04:  27%|██▋       | 230/853 [01:35<04:15,  2.44it/s][A
epoch 1 iter 230: train loss 3.32224. lr 5.732679e-04:  27%|██▋       | 230/853 [01:35<04:15,  2.44it/s][A
epoch 1 iter 230: train loss 3.32224. lr 5.732679e-04:  27%|██▋       | 231/853 [01:35<04:15,  2.44it/s][A
epoch 1 iter 231: train loss 3.39159. lr 5.730394e-04:  27%|██▋       | 231/853 [01:36<04:15,  2.44it/s][A
epoch 1 iter 231: train loss 3.39159. lr 5.730394e-04:  27%|██▋       | 232/853 [01:36<04:18,  2.40it/s][A
epoch 1 iter 232: train loss

epoch 1 iter 302: train loss 2.24055. lr 5.544915e-04:  36%|███▌      | 303/853 [02:06<03:47,  2.42it/s][A
epoch 1 iter 303: train loss 2.25090. lr 5.541984e-04:  36%|███▌      | 303/853 [02:07<03:47,  2.42it/s][A
epoch 1 iter 303: train loss 2.25090. lr 5.541984e-04:  36%|███▌      | 304/853 [02:07<03:45,  2.43it/s][A
epoch 1 iter 304: train loss 2.23993. lr 5.539044e-04:  36%|███▌      | 304/853 [02:07<03:45,  2.43it/s][A
epoch 1 iter 304: train loss 2.23993. lr 5.539044e-04:  36%|███▌      | 305/853 [02:07<03:44,  2.44it/s][A
epoch 1 iter 305: train loss 2.20106. lr 5.536095e-04:  36%|███▌      | 305/853 [02:07<03:44,  2.44it/s][A
epoch 1 iter 305: train loss 2.20106. lr 5.536095e-04:  36%|███▌      | 306/853 [02:07<03:43,  2.45it/s][A
epoch 1 iter 306: train loss 2.21969. lr 5.533138e-04:  36%|███▌      | 306/853 [02:08<03:43,  2.45it/s][A
epoch 1 iter 306: train loss 2.21969. lr 5.533138e-04:  36%|███▌      | 307/853 [02:08<03:42,  2.45it/s][A
epoch 1 iter 307: train loss

epoch 1 iter 377: train loss 1.38811. lr 5.301824e-04:  44%|████▍     | 378/853 [02:38<03:29,  2.27it/s][A
epoch 1 iter 378: train loss 1.39569. lr 5.298275e-04:  44%|████▍     | 378/853 [02:38<03:29,  2.27it/s][A
epoch 1 iter 378: train loss 1.39569. lr 5.298275e-04:  44%|████▍     | 379/853 [02:38<03:26,  2.30it/s][A
epoch 1 iter 379: train loss 1.27486. lr 5.294718e-04:  44%|████▍     | 379/853 [02:38<03:26,  2.30it/s][A
epoch 1 iter 379: train loss 1.27486. lr 5.294718e-04:  45%|████▍     | 380/853 [02:38<03:23,  2.33it/s][A
epoch 1 iter 380: train loss 1.29384. lr 5.291154e-04:  45%|████▍     | 380/853 [02:39<03:23,  2.33it/s][A
epoch 1 iter 380: train loss 1.29384. lr 5.291154e-04:  45%|████▍     | 381/853 [02:39<03:19,  2.36it/s][A
epoch 1 iter 381: train loss 1.35344. lr 5.287581e-04:  45%|████▍     | 381/853 [02:39<03:19,  2.36it/s][A
epoch 1 iter 381: train loss 1.35344. lr 5.287581e-04:  45%|████▍     | 382/853 [02:39<03:17,  2.39it/s][A
epoch 1 iter 382: train loss

In [15]:
from minGPT.mingpt.utils import sample

def generate_dataset(train_text_file, state_dict_file, n_layer=8, n_head=8, n_embd=512,
                     texts_count=1, text_len=100):
    text_sentences = nltk.tokenize.sent_tokenize(open(train_text_file, 'r').read())
    tokens = np.concatenate([tokenizer.tokenize(sent)[1:-1] for sent in text_sentences])
    tokens = [str(token) for token in tokens]
    train_dataset = BPEDataset(tokens, block_size) 
    
    mconf = GPTConfig(
        train_dataset.vocab_size, train_dataset.block_size,
        n_layer=n_layer, n_head=n_head, n_embd=n_embd
    )
    model = GPT(mconf)
    model.load_state_dict(torch.load(state_dict_file))
    print("model is loaded")
    tconf = TrainerConfig(num_workers=1)
    trainer = Trainer(model, train_dataset, None, tconf)
    
    for text_id in range(texts_count):
        context = [train_dataset.itos[np.random.randint(train_dataset.vocab_size)]]
        x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
        y = sample(model, x, text_len, temperature=1.0, sample=True, top_k=10)[0]
        completion = ' '.join([train_dataset.itos[int(i)] for i in y]).replace(' ##', '')
        yield completion

In [16]:
for x in generate_dataset(pathjoin(GENRE_DATA_DIR, LANG, 'A1.txt'), pathjoin(GPT_MODELS_DIR, LANG, 'A1')):
    print(x)

data has 463101 characters, 16955 unique.
model is loaded
In fact , a new global classroom and most of these students have by based on higher energy and materials . And I ' m pleased to say that far from global warming and mis is that I am I the United States of America and American mercenaries , we must also do something about how we can do to have cities and geopolitical considerations can not over the world . But I will ever give the way to an Iraqi / Iran about anything - Nonproliferation and geopolitical issues


1
