# **Homework 7 - Bert (Question Answering)**

If you have any questions, feel free to email us at mlta-2022-spring@googlegroups.com



Slide:    [Link](https://docs.google.com/presentation/d/1H5ZONrb2LMOCixLY7D5_5-7LkIaXO6AGEaV2mRdTOMY/edit?usp=sharing)　Kaggle: [Link](https://www.kaggle.com/c/ml2022spring-hw7)　Data: [Link](https://drive.google.com/uc?id=1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb)




## Task description
- Chinese Extractive Question Answering
  - Input: Paragraph + Question
  - Output: Answer

- Objective: Learn how to fine tune a pretrained model on downstream task using transformers

- Todo
    - Fine tune a pretrained chinese BERT model
    - Change hyperparameters (e.g. doc_stride)
    - Apply linear learning rate decay
    - Try other pretrained models
    - Improve preprocessing
    - Improve postprocessing
- Training tips
    - Automatic mixed precision
    - Gradient accumulation
    - Ensemble

- Estimated training time (tesla t4 with automatic mixed precision enabled)
    - Simple: 8mins
    - Medium: 8mins
    - Strong: 25mins
    - Boss: 2.5hrs
  

## Download Dataset

In [4]:
# # Download link 1
# !gdown --id '1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb' --output hw7_data.zip

# # Download Link 2 (if the above link fails) 
# # !gdown --id '1qwjbRjq481lHsnTrrF4OjKQnxzgoLEFR' --output hw7_data.zip

# # Download Link 3 (if the above link fails) 
# # !gdown --id '1QXuWjNRZH6DscSd6QcRER0cnxmpZvijn' --output hw7_data.zip

# !unzip -o hw7_data.zip

# For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
!nvidia-smi

Fri Apr 22 21:25:23 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:65:00.0  On |                  N/A |
|  0%   51C    P2    61W / 275W |    323MiB / 11177MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install transformers

Documentation for the toolkit:　https://huggingface.co/transformers/

In [5]:
# You are allowed to change version of transformers or use other toolkits
# !pip install transformers==4.5.0

## Import Packages

In [6]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset 
from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast
from tqdm.auto import tqdm
import transformers


device = "cuda" if torch.cuda.is_available() else "cpu"


# Fix random seed for reproducibility
def same_seeds(seed):
	  torch.manual_seed(seed)
	  if torch.cuda.is_available():
		    torch.cuda.manual_seed(seed)
		    torch.cuda.manual_seed_all(seed)
	  np.random.seed(seed)
	  random.seed(seed)
	  torch.backends.cudnn.benchmark = False
	  torch.backends.cudnn.deterministic = True


same_seeds(0)


In [7]:
# Change "fp16_training" to True to support automatic mixed precision training (fp16)	
fp16_training = False

if fp16_training:
    !pip install accelerate==0.2.0
    from accelerate import Accelerator
    accelerator = Accelerator(fp16=True)
    device = accelerator.device

# Documentation for the toolkit:  https://huggingface.co/docs/accelerate/

## Load Model and Tokenizer




 

In [8]:
# model = BertForQuestionAnswering.from_pretrained("uer/roberta-base-chinese-extractive-qa").to(device)
# tokenizer = BertTokenizerFast.from_pretrained("uer/roberta-base-chinese-extractive-qa")

model = BertForQuestionAnswering.from_pretrained("hfl/chinese-macbert-large").to(device)
tokenizer = BertTokenizerFast.from_pretrained("hfl/chinese-macbert-large")

# You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)

Some weights of the model checkpoint at hfl/chinese-macbert-large were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the

## Read Data

- Training set: 31690 QA pairs
- Dev set: 4131  QA pairs
- Test set: 4957  QA pairs

- {train/dev/test}_questions:	
  - List of dicts with the following keys:
   - id (int)
   - paragraph_id (int)
   - question_text (string)
   - answer_text (string)
   - answer_start (int)
   - answer_end (int)
- {train/dev/test}_paragraphs: 
  - List of strings
  - paragraph_ids in questions correspond to indexs in paragraphs
  - A paragraph may be used by several questions 

In [9]:
def read_data(file):
    with open(file, 'r', encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]

train_questions, train_paragraphs = read_data("hw7_train.json")
dev_questions, dev_paragraphs = read_data("hw7_dev.json")
test_questions, test_paragraphs = read_data("hw7_test.json")

## Tokenize Data

In [10]:
# Tokenize questions and paragraphs separately
# 「add_special_tokens」 is set to False since special tokens will be added when tokenized questions and paragraphs are combined in datset __getitem__ 

train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions], add_special_tokens=False)
test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False) 

train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False)
dev_paragraphs_tokenized = tokenizer(dev_paragraphs, add_special_tokens=False)
test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False)

# You can safely ignore the warning message as tokenized sequences will be futher processed in datset __getitem__ before passing to model

## Dataset and Dataloader

In [11]:
class QA_Dataset(Dataset):
    def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs):
        self.split = split
        self.questions = questions
        self.tokenized_questions = tokenized_questions
        self.tokenized_paragraphs = tokenized_paragraphs
        self.max_question_len = 40
        self.max_paragraph_len = 200
        # self.max_paragraph_len = 150

        
        ##### TODO: Change value of doc_stride #####
        self.doc_stride = 50
        # self.doc_stride = 150

        # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
        self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        tokenized_question = self.tokenized_questions[idx]
        tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]]

        ##### TODO: Preprocessing #####
        # Hint: How to prevent model from learning something it should not learn

        if self.split == "train":
            # Convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph  
            answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
            answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])

            # A single window is obtained by slicing the portion of paragraph containing the answer
            mid = int((answer_start_token + answer_end_token) // (2+random.uniform(-1, 1)))
            paragraph_start = max(0, min(mid - self.max_paragraph_len // 2, len(tokenized_paragraph) - self.max_paragraph_len))
            paragraph_end = paragraph_start + self.max_paragraph_len
            
            # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
            input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102] 
            input_ids_paragraph = tokenized_paragraph.ids[paragraph_start : paragraph_end] + [102]		
            
            # Convert answer's start/end positions in tokenized_paragraph to start/end positions in the window  
            answer_start_token += len(input_ids_question) - paragraph_start
            answer_end_token += len(input_ids_question) - paragraph_start
            
            # Pad sequence and obtain inputs to model 
            input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
            return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token

        # Validation/Testing
        else:
            input_ids_list, token_type_ids_list, attention_mask_list = [], [], []
            
            # Paragraph is split into several windows, each with start positions separated by step "doc_stride"
            for i in range(0, len(tokenized_paragraph), self.doc_stride):
                
                # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
                input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
                input_ids_paragraph = tokenized_paragraph.ids[i : i + self.max_paragraph_len] + [102]
                
                # Pad sequence and obtain inputs to model
                input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
                
                input_ids_list.append(input_ids)
                token_type_ids_list.append(token_type_ids)
                attention_mask_list.append(attention_mask)
            
            return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)

    def padding(self, input_ids_question, input_ids_paragraph):
        # Pad zeros if sequence length is shorter than max_seq_len
        padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
        # Indices of input sequence tokens in the vocabulary
        input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
        # Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
        token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
        # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
        attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len
        
        return input_ids, token_type_ids, attention_mask

train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
dev_set = QA_Dataset("dev", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)

train_batch_size = 4

# Note: Do NOT change batch size of dev_loader / test_loader !
# Although batch size=1, it is actually a batch consisting of several windows from the same QA pair
train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
dev_loader = DataLoader(dev_set, batch_size=1, shuffle=False, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)

## Function for Evaluation

In [12]:
def index_tokenize(tokens, start, end):
    char_count, new_start, new_end = 0, 512, 512
    start_flag = False  
    # print("start:", start)
    # print("end:", end)
    for i, token in enumerate(tokens):
        # print("i:",i, " token:", token)
        if token == '[UNK]' or token == '[CLS]' or token == '[SEP]':
            if i == start:
                new_start = char_count
                # print("new_start1:", new_start)

            if i == end:
                new_end = char_count
                return new_start, new_end
                
            char_count += 1
            # print("char_count1:",char_count)
        else:
            for ch in token:
                if i == start and start_flag == False:
                    new_start = char_count
                    start_flag = True
                    # print("new_start2:", new_start)

                if i == end:
                    new_end = char_count
                    return new_start, new_end

                if ch != '#':
                    char_count += 1
                # print("char_count2:",char_count)

    return start, end              

In [13]:
def evaluate(data, output, doc_stride=150, paragraph=None, paragraph_tokenized=None):
    ##### TODO: Postprocessing #####
    # There is a bug and room for improvement in postprocessing 
    # Hint: Open your prediction file to see what is wrong 
    
    answer = ''
    max_prob = float('-inf')
    num_of_windows = data[0].shape[1]
    entire_start_index = 0
    entire_end_index = 0

    # print(data)
    for k in range(num_of_windows):
        # print('window', k)
        # Obtain answer by choosing the most probable start position / end position
        mask = data[1][0][k].bool() & data[2][0][k].bool() # get document, token_type_ids & attention_mask
        mask = mask.to(device)

        masked_output_start = torch.masked_select(output.start_logits[k], mask)[:-1] # last one is [SEP]
        start_prob, start_index = torch.max(masked_output_start, dim=0)

        masked_output_start = torch.masked_select(output.end_logits[k], mask)[:-1] # last one is [SEP]
        end_prob, end_index = torch.max(masked_output_start, dim=0)
        
        # Probability of answer is calculated as sum of start_prob and end_prob
        prob = start_prob + end_prob
        masked_data = torch.masked_select(data[0][0][k].to(device), mask)[:-1]
        
        # Replace answer if calculated probability is larger than previous windows
        if (prob > max_prob) and (end_index - start_index <= 20) and (end_index > start_index):
            max_prob = prob
            entire_start_index = start_index.item() + doc_stride * k
            entire_end_index = end_index.item() + doc_stride * k
            # print("entire_start_index", entire_start_index)
            # print("entire_end_index", entire_end_index)
            # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
            answer = tokenizer.decode(masked_data[start_index : end_index + 1])
    # 若 [UNK] 在 prediction，使用原始的 paragraph
    if '[UNK]' in answer:
        print('found [UNK] in prediction.')
        print('original pred:', answer)

        new_start, new_end = index_tokenize(tokens=paragraph_tokenized, start=entire_start_index, end=entire_end_index)
        # print("new_start",new_start)
        # print("new_end", new_end)

        answer = paragraph[new_start:new_end+1]
        print('final prediction',answer)

    # Remove spaces in answer (e.g. "大 金" --> "大金")
    return answer.replace(' ','')

## Training

In [15]:
num_epoch = 5
validation = True
logging_step = 100
learning_rate = 1e-5
doc_stride = 50
# doc_stride = 150
accum_iter = 4


optimizer = AdamW(model.parameters(), lr=learning_rate)

num_training_steps = len(train_loader) * num_epoch

# scheudler = transformers.get_polynomial_decay_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_training_steps, lr_end = 1e-07, power = 1.0, last_epoch = -1)
scheudler = transformers.get_cosine_with_hard_restarts_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = num_training_steps)

if fp16_training:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader) 

model.train()

print("Start Training ...")

for epoch in range(num_epoch):
    step = 1
    train_loss = train_acc = 0
    
    for idx, data in enumerate(tqdm(train_loader)):	
        # Load all data into GPU
        data = [i.to(device) for i in data]
        
        # Model inputs: input_ids, token_type_ids, attention_mask, start_positions, end_positions (Note: only "input_ids" is mandatory)
        # Model outputs: start_logits, end_logits, loss (return when start_positions/end_positions are provided)  
        output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])

        # Choose the most probable start position / end position
        start_index = torch.argmax(output.start_logits, dim=1)
        end_index = torch.argmax(output.end_logits, dim=1)
        
        # Prediction is correct only if both start_index and end_index are correct
        train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()
        train_loss += output.loss
        output.loss = output.loss / accum_iter
        
        if fp16_training:
            accelerator.backward(output.loss)
        else:
            output.loss.backward()

        
        if ((idx + 1) % accum_iter == 0) or (idx + 1 == len(train_loader)):
            optimizer.step()
            scheudler.step()
            optimizer.zero_grad()
        step += 1

        ##### TODO: Apply linear learning rate decay #####
        
        
        # Print training loss and accuracy over past logging step
        if step % logging_step == 0:
            print(f"Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f}")
            train_loss = train_acc = 0

    if validation:
        print("Evaluating Dev Set ...")
        model.eval()
        with torch.no_grad():
            dev_acc = 0
            for i, data in enumerate(tqdm(dev_loader)):
                output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
                # prediction is correct only if answer text exactly matches
                dev_acc += evaluate(data, output, doc_stride, dev_paragraphs[dev_questions[i]['paragraph_id']], 
                    dev_paragraphs_tokenized[dev_questions[i]['paragraph_id']].tokens) == dev_questions[i]["answer_text"]
            print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
        model.train()

# Save a model and its configuration file to the directory 「saved_model」 
# i.e. there are two files under the direcory 「saved_model」: 「pytorch_model.bin」 and 「config.json」
# Saved model can be re-loaded using 「model = BertForQuestionAnswering.from_pretrained("saved_model")」
print("Saving Model ...")
model_save_dir = "saved_model" 
model.save_pretrained(model_save_dir)

Start Training ...


  1%|          | 99/7923 [00:35<54:33,  2.39it/s]

Epoch 1 | Step 100 | loss = 4.672, acc = 0.037


  3%|▎         | 199/7923 [01:12<52:36,  2.45it/s]

Epoch 1 | Step 200 | loss = 2.603, acc = 0.257


  4%|▍         | 299/7923 [01:48<51:00,  2.49it/s]

Epoch 1 | Step 300 | loss = 1.755, acc = 0.477


  5%|▌         | 399/7923 [02:24<50:06,  2.50it/s]

Epoch 1 | Step 400 | loss = 1.283, acc = 0.582


  6%|▋         | 499/7923 [03:00<49:32,  2.50it/s]

Epoch 1 | Step 500 | loss = 1.277, acc = 0.560


  8%|▊         | 599/7923 [03:36<49:09,  2.48it/s]

Epoch 1 | Step 600 | loss = 1.165, acc = 0.610


  9%|▉         | 699/7923 [04:12<48:50,  2.47it/s]

Epoch 1 | Step 700 | loss = 1.266, acc = 0.613


 10%|█         | 799/7923 [04:48<48:33,  2.45it/s]

Epoch 1 | Step 800 | loss = 1.179, acc = 0.607


 11%|█▏        | 899/7923 [05:24<45:52,  2.55it/s]

Epoch 1 | Step 900 | loss = 1.013, acc = 0.640


 13%|█▎        | 999/7923 [06:00<46:50,  2.46it/s]

Epoch 1 | Step 1000 | loss = 0.958, acc = 0.650


 14%|█▍        | 1099/7923 [06:35<44:36,  2.55it/s]

Epoch 1 | Step 1100 | loss = 0.991, acc = 0.670


 15%|█▌        | 1199/7923 [07:11<45:00,  2.49it/s]

Epoch 1 | Step 1200 | loss = 1.101, acc = 0.625


 16%|█▋        | 1299/7923 [07:46<42:20,  2.61it/s]

Epoch 1 | Step 1300 | loss = 0.963, acc = 0.660


 18%|█▊        | 1399/7923 [08:21<43:58,  2.47it/s]

Epoch 1 | Step 1400 | loss = 0.948, acc = 0.655


 19%|█▉        | 1499/7923 [08:57<43:08,  2.48it/s]

Epoch 1 | Step 1500 | loss = 0.881, acc = 0.680


 20%|██        | 1599/7923 [09:33<43:14,  2.44it/s]

Epoch 1 | Step 1600 | loss = 0.967, acc = 0.645


 21%|██▏       | 1699/7923 [10:09<40:32,  2.56it/s]

Epoch 1 | Step 1700 | loss = 0.879, acc = 0.670


 23%|██▎       | 1799/7923 [10:45<40:28,  2.52it/s]

Epoch 1 | Step 1800 | loss = 1.062, acc = 0.635


 24%|██▍       | 1899/7923 [11:21<41:15,  2.43it/s]

Epoch 1 | Step 1900 | loss = 0.749, acc = 0.710


 25%|██▌       | 1999/7923 [11:57<39:17,  2.51it/s]

Epoch 1 | Step 2000 | loss = 0.803, acc = 0.735


 26%|██▋       | 2099/7923 [12:34<38:12,  2.54it/s]

Epoch 1 | Step 2100 | loss = 0.862, acc = 0.680


 28%|██▊       | 2199/7923 [13:10<38:45,  2.46it/s]

Epoch 1 | Step 2200 | loss = 0.942, acc = 0.688


 29%|██▉       | 2299/7923 [13:47<37:25,  2.50it/s]

Epoch 1 | Step 2300 | loss = 1.010, acc = 0.665


 30%|███       | 2399/7923 [14:23<37:01,  2.49it/s]

Epoch 1 | Step 2400 | loss = 0.817, acc = 0.707


 32%|███▏      | 2499/7923 [14:59<36:41,  2.46it/s]

Epoch 1 | Step 2500 | loss = 0.809, acc = 0.697


 33%|███▎      | 2599/7923 [15:35<35:54,  2.47it/s]

Epoch 1 | Step 2600 | loss = 0.992, acc = 0.662


 34%|███▍      | 2699/7923 [16:11<34:43,  2.51it/s]

Epoch 1 | Step 2700 | loss = 0.790, acc = 0.722


 35%|███▌      | 2799/7923 [16:47<33:02,  2.58it/s]

Epoch 1 | Step 2800 | loss = 0.881, acc = 0.730


 37%|███▋      | 2899/7923 [17:22<33:06,  2.53it/s]

Epoch 1 | Step 2900 | loss = 0.820, acc = 0.722


 38%|███▊      | 2999/7923 [17:58<32:51,  2.50it/s]

Epoch 1 | Step 3000 | loss = 0.687, acc = 0.712


 39%|███▉      | 3099/7923 [18:33<31:06,  2.58it/s]

Epoch 1 | Step 3100 | loss = 0.827, acc = 0.665


 40%|████      | 3199/7923 [19:09<30:29,  2.58it/s]

Epoch 1 | Step 3200 | loss = 0.730, acc = 0.735


 42%|████▏     | 3299/7923 [19:45<31:45,  2.43it/s]

Epoch 1 | Step 3300 | loss = 1.028, acc = 0.680


 43%|████▎     | 3399/7923 [20:21<30:54,  2.44it/s]

Epoch 1 | Step 3400 | loss = 0.788, acc = 0.705


 44%|████▍     | 3499/7923 [20:57<29:18,  2.52it/s]

Epoch 1 | Step 3500 | loss = 0.762, acc = 0.757


 45%|████▌     | 3599/7923 [21:33<28:31,  2.53it/s]

Epoch 1 | Step 3600 | loss = 0.802, acc = 0.695


 47%|████▋     | 3699/7923 [22:10<29:02,  2.42it/s]

Epoch 1 | Step 3700 | loss = 0.687, acc = 0.770


 48%|████▊     | 3799/7923 [22:46<28:11,  2.44it/s]

Epoch 1 | Step 3800 | loss = 0.880, acc = 0.665


 49%|████▉     | 3899/7923 [23:22<26:08,  2.57it/s]

Epoch 1 | Step 3900 | loss = 0.722, acc = 0.710


 50%|█████     | 3999/7923 [23:59<26:07,  2.50it/s]

Epoch 1 | Step 4000 | loss = 0.785, acc = 0.717


 52%|█████▏    | 4099/7923 [24:35<25:22,  2.51it/s]

Epoch 1 | Step 4100 | loss = 0.823, acc = 0.722


 53%|█████▎    | 4199/7923 [25:11<25:47,  2.41it/s]

Epoch 1 | Step 4200 | loss = 0.568, acc = 0.785


 54%|█████▍    | 4299/7923 [25:47<24:05,  2.51it/s]

Epoch 1 | Step 4300 | loss = 0.636, acc = 0.770


 56%|█████▌    | 4399/7923 [26:22<23:52,  2.46it/s]

Epoch 1 | Step 4400 | loss = 0.681, acc = 0.730


 57%|█████▋    | 4499/7923 [26:58<22:35,  2.53it/s]

Epoch 1 | Step 4500 | loss = 0.765, acc = 0.727


 58%|█████▊    | 4599/7923 [27:33<21:57,  2.52it/s]

Epoch 1 | Step 4600 | loss = 0.796, acc = 0.722


 59%|█████▉    | 4699/7923 [28:08<21:26,  2.51it/s]

Epoch 1 | Step 4700 | loss = 0.608, acc = 0.748


 61%|██████    | 4799/7923 [28:44<20:47,  2.51it/s]

Epoch 1 | Step 4800 | loss = 0.771, acc = 0.700


 62%|██████▏   | 4899/7923 [29:19<19:44,  2.55it/s]

Epoch 1 | Step 4900 | loss = 0.776, acc = 0.720


 63%|██████▎   | 4999/7923 [29:55<19:02,  2.56it/s]

Epoch 1 | Step 5000 | loss = 0.738, acc = 0.740


 64%|██████▍   | 5099/7923 [30:32<19:28,  2.42it/s]

Epoch 1 | Step 5100 | loss = 0.651, acc = 0.717


 66%|██████▌   | 5199/7923 [31:08<18:16,  2.48it/s]

Epoch 1 | Step 5200 | loss = 0.725, acc = 0.720


 67%|██████▋   | 5299/7923 [31:44<17:23,  2.52it/s]

Epoch 1 | Step 5300 | loss = 0.808, acc = 0.707


 68%|██████▊   | 5399/7923 [32:20<17:05,  2.46it/s]

Epoch 1 | Step 5400 | loss = 0.585, acc = 0.785


 69%|██████▉   | 5499/7923 [32:57<16:05,  2.51it/s]

Epoch 1 | Step 5500 | loss = 0.854, acc = 0.697


 71%|███████   | 5599/7923 [33:33<15:28,  2.50it/s]

Epoch 1 | Step 5600 | loss = 0.708, acc = 0.745


 72%|███████▏  | 5699/7923 [34:09<14:34,  2.54it/s]

Epoch 1 | Step 5700 | loss = 0.716, acc = 0.750


 73%|███████▎  | 5799/7923 [34:45<14:23,  2.46it/s]

Epoch 1 | Step 5800 | loss = 0.673, acc = 0.735


 74%|███████▍  | 5899/7923 [35:21<13:10,  2.56it/s]

Epoch 1 | Step 5900 | loss = 0.780, acc = 0.738


 76%|███████▌  | 5999/7923 [35:57<13:05,  2.45it/s]

Epoch 1 | Step 6000 | loss = 0.718, acc = 0.707


 77%|███████▋  | 6099/7923 [36:33<11:41,  2.60it/s]

Epoch 1 | Step 6100 | loss = 0.752, acc = 0.725


 78%|███████▊  | 6199/7923 [37:09<11:35,  2.48it/s]

Epoch 1 | Step 6200 | loss = 0.707, acc = 0.738


 80%|███████▉  | 6299/7923 [37:44<10:33,  2.57it/s]

Epoch 1 | Step 6300 | loss = 0.698, acc = 0.727


 81%|████████  | 6399/7923 [38:20<10:08,  2.50it/s]

Epoch 1 | Step 6400 | loss = 0.678, acc = 0.738


 82%|████████▏ | 6499/7923 [38:55<09:26,  2.51it/s]

Epoch 1 | Step 6500 | loss = 0.630, acc = 0.743


 83%|████████▎ | 6599/7923 [39:31<08:50,  2.50it/s]

Epoch 1 | Step 6600 | loss = 0.551, acc = 0.770


 85%|████████▍ | 6699/7923 [40:07<08:04,  2.52it/s]

Epoch 1 | Step 6700 | loss = 0.715, acc = 0.745


 86%|████████▌ | 6799/7923 [40:43<07:27,  2.51it/s]

Epoch 1 | Step 6800 | loss = 0.606, acc = 0.745


 87%|████████▋ | 6899/7923 [41:19<06:51,  2.49it/s]

Epoch 1 | Step 6900 | loss = 0.716, acc = 0.730


 88%|████████▊ | 6999/7923 [41:55<06:22,  2.41it/s]

Epoch 1 | Step 7000 | loss = 0.720, acc = 0.740


 90%|████████▉ | 7099/7923 [42:32<05:30,  2.49it/s]

Epoch 1 | Step 7100 | loss = 0.706, acc = 0.732


 91%|█████████ | 7199/7923 [43:08<05:00,  2.41it/s]

Epoch 1 | Step 7200 | loss = 0.592, acc = 0.750


 92%|█████████▏| 7299/7923 [43:45<04:12,  2.48it/s]

Epoch 1 | Step 7300 | loss = 0.696, acc = 0.722


 93%|█████████▎| 7399/7923 [44:21<03:35,  2.44it/s]

Epoch 1 | Step 7400 | loss = 0.661, acc = 0.782


 95%|█████████▍| 7499/7923 [44:57<02:45,  2.57it/s]

Epoch 1 | Step 7500 | loss = 0.631, acc = 0.732


 96%|█████████▌| 7599/7923 [45:33<02:11,  2.46it/s]

Epoch 1 | Step 7600 | loss = 0.712, acc = 0.735


 97%|█████████▋| 7699/7923 [46:09<01:29,  2.52it/s]

Epoch 1 | Step 7700 | loss = 0.644, acc = 0.748


 98%|█████████▊| 7799/7923 [46:45<00:50,  2.47it/s]

Epoch 1 | Step 7800 | loss = 0.596, acc = 0.727


100%|█████████▉| 7899/7923 [47:20<00:09,  2.54it/s]

Epoch 1 | Step 7900 | loss = 0.619, acc = 0.760


100%|██████████| 7923/7923 [47:29<00:00,  2.78it/s]


Evaluating Dev Set ...


  6%|▌         | 228/4131 [00:59<17:20,  3.75it/s]

found [UNK] in prediction.
original pred: 李 [UNK]
final prediction 李杲


  9%|▉         | 390/4131 [01:42<16:52,  3.70it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 10%|▉         | 393/4131 [01:43<15:06,  4.12it/s]

found [UNK] in prediction.
original pred: [UNK] 崎 八 幡 宮
final prediction 筥崎八幡宮


 12%|█▏        | 504/4131 [02:13<15:00,  4.03it/s]

found [UNK] in prediction.
original pred: 與 慕 容 [UNK] 雙 方 不 和
final prediction 與慕容廆雙方不和


 17%|█▋        | 700/4131 [03:05<14:59,  3.82it/s]

found [UNK] in prediction.
original pred: 對 日 本 報 紙 的 無 恥 造 謠 誣 [UNK] ， 進 行 了 有 力 駁 斥
final prediction 對日本報紙的無恥造謠誣衊，進行了有力駁斥


 20%|██        | 830/4131 [03:40<12:30,  4.40it/s]

found [UNK] in prediction.
original pred: 木 骨 [UNK]
final prediction 木骨閭


 20%|██        | 842/4131 [03:43<12:51,  4.26it/s]

found [UNK] in prediction.
original pred: 杜 恆 - [UNK] 因 論 題
final prediction 杜恆-蒯因論題


 24%|██▍       | 984/4131 [04:22<14:07,  3.72it/s]

found [UNK] in prediction.
original pred: 青 翁 三 足 [UNK]
final prediction 青翁三足缶


 34%|███▎      | 1384/4131 [06:09<11:05,  4.13it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 34%|███▍      | 1401/4131 [06:14<13:14,  3.44it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 37%|███▋      | 1512/4131 [06:44<11:41,  3.73it/s]

found [UNK] in prediction.
original pred: 免 再 次 爆 發 內 [UNK]
final prediction 免再次爆發內訌


 45%|████▍     | 1846/4131 [08:13<08:36,  4.42it/s]

found [UNK] in prediction.
original pred: [UNK] 船
final prediction 艚船


 53%|█████▎    | 2205/4131 [09:47<08:03,  3.98it/s]

found [UNK] in prediction.
original pred: 朱 載 [UNK]
final prediction 朱載堉


 55%|█████▍    | 2262/4131 [10:02<08:28,  3.68it/s]

found [UNK] in prediction.
original pred: 陳 梅 坪 那 裏 學 習 訓 [UNK] 學
final prediction 佛山陳梅坪那裏學習訓


 67%|██████▋   | 2752/4131 [12:09<05:50,  3.93it/s]

found [UNK] in prediction.
original pred: w. v. [UNK] 因
final prediction W.V.蒯因


 71%|███████   | 2937/4131 [12:59<06:53,  2.88it/s]

found [UNK] in prediction.
original pred: 拓 跋 [UNK]
final prediction 拓跋燾


 77%|███████▋  | 3168/4131 [13:58<02:56,  5.46it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 90%|█████████ | 3718/4131 [16:01<01:14,  5.56it/s]

found [UNK] in prediction.
original pred: 拓 跋 [UNK]
final prediction 拓跋燾


 96%|█████████▋| 3982/4131 [16:59<00:33,  4.41it/s]

found [UNK] in prediction.
original pred: 學 習 訓 [UNK] 學
final prediction 學習訓詁學


100%|██████████| 4131/4131 [17:35<00:00,  3.91it/s]


Validation | Epoch 1 | acc = 0.788


  1%|          | 99/7923 [00:35<51:45,  2.52it/s]

Epoch 2 | Step 100 | loss = 0.556, acc = 0.772


  3%|▎         | 199/7923 [01:12<51:14,  2.51it/s]

Epoch 2 | Step 200 | loss = 0.529, acc = 0.775


  4%|▍         | 299/7923 [01:48<50:48,  2.50it/s]

Epoch 2 | Step 300 | loss = 0.550, acc = 0.775


  5%|▌         | 399/7923 [02:24<48:46,  2.57it/s]

Epoch 2 | Step 400 | loss = 0.571, acc = 0.772


  6%|▋         | 499/7923 [02:59<48:36,  2.55it/s]

Epoch 2 | Step 500 | loss = 0.663, acc = 0.790


  8%|▊         | 599/7923 [03:35<48:08,  2.54it/s]

Epoch 2 | Step 600 | loss = 0.574, acc = 0.780


  9%|▉         | 699/7923 [04:10<48:56,  2.46it/s]

Epoch 2 | Step 700 | loss = 0.468, acc = 0.827


 10%|█         | 799/7923 [04:47<47:37,  2.49it/s]

Epoch 2 | Step 800 | loss = 0.471, acc = 0.808


 11%|█▏        | 899/7923 [05:23<46:52,  2.50it/s]

Epoch 2 | Step 900 | loss = 0.493, acc = 0.822


 13%|█▎        | 999/7923 [06:00<44:24,  2.60it/s]

Epoch 2 | Step 1000 | loss = 0.504, acc = 0.782


 14%|█▍        | 1099/7923 [06:36<45:03,  2.52it/s]

Epoch 2 | Step 1100 | loss = 0.466, acc = 0.797


 15%|█▌        | 1199/7923 [07:12<44:40,  2.51it/s]

Epoch 2 | Step 1200 | loss = 0.425, acc = 0.822


 16%|█▋        | 1299/7923 [07:48<44:03,  2.51it/s]

Epoch 2 | Step 1300 | loss = 0.550, acc = 0.790


 18%|█▊        | 1399/7923 [08:24<43:13,  2.52it/s]

Epoch 2 | Step 1400 | loss = 0.566, acc = 0.787


 19%|█▉        | 1499/7923 [08:59<42:18,  2.53it/s]

Epoch 2 | Step 1500 | loss = 0.510, acc = 0.797


 20%|██        | 1599/7923 [09:34<42:26,  2.48it/s]

Epoch 2 | Step 1600 | loss = 0.469, acc = 0.792


 21%|██▏       | 1699/7923 [10:10<42:17,  2.45it/s]

Epoch 2 | Step 1700 | loss = 0.547, acc = 0.785


 23%|██▎       | 1799/7923 [10:46<40:25,  2.53it/s]

Epoch 2 | Step 1800 | loss = 0.581, acc = 0.767


 24%|██▍       | 1899/7923 [11:22<41:00,  2.45it/s]

Epoch 2 | Step 1900 | loss = 0.542, acc = 0.790


 25%|██▌       | 1999/7923 [11:59<39:48,  2.48it/s]

Epoch 2 | Step 2000 | loss = 0.550, acc = 0.795


 26%|██▋       | 2099/7923 [12:35<40:11,  2.42it/s]

Epoch 2 | Step 2100 | loss = 0.497, acc = 0.817


 28%|██▊       | 2199/7923 [13:11<38:19,  2.49it/s]

Epoch 2 | Step 2200 | loss = 0.477, acc = 0.790


 29%|██▉       | 2299/7923 [13:47<37:42,  2.49it/s]

Epoch 2 | Step 2300 | loss = 0.448, acc = 0.835


 30%|███       | 2399/7923 [14:23<36:12,  2.54it/s]

Epoch 2 | Step 2400 | loss = 0.504, acc = 0.808


 32%|███▏      | 2499/7923 [14:59<36:20,  2.49it/s]

Epoch 2 | Step 2500 | loss = 0.477, acc = 0.812


 33%|███▎      | 2599/7923 [15:35<35:31,  2.50it/s]

Epoch 2 | Step 2600 | loss = 0.570, acc = 0.772


 34%|███▍      | 2699/7923 [16:11<34:33,  2.52it/s]

Epoch 2 | Step 2700 | loss = 0.555, acc = 0.790


 35%|███▌      | 2799/7923 [16:46<34:06,  2.50it/s]

Epoch 2 | Step 2800 | loss = 0.482, acc = 0.817


 37%|███▋      | 2899/7923 [17:22<32:40,  2.56it/s]

Epoch 2 | Step 2900 | loss = 0.418, acc = 0.822


 38%|███▊      | 2999/7923 [17:58<32:43,  2.51it/s]

Epoch 2 | Step 3000 | loss = 0.536, acc = 0.785


 39%|███▉      | 3099/7923 [18:34<33:34,  2.39it/s]

Epoch 2 | Step 3100 | loss = 0.538, acc = 0.792


 40%|████      | 3199/7923 [19:11<32:18,  2.44it/s]

Epoch 2 | Step 3200 | loss = 0.541, acc = 0.787


 42%|████▏     | 3299/7923 [19:47<31:14,  2.47it/s]

Epoch 2 | Step 3300 | loss = 0.533, acc = 0.803


 43%|████▎     | 3399/7923 [20:23<30:04,  2.51it/s]

Epoch 2 | Step 3400 | loss = 0.479, acc = 0.803


 44%|████▍     | 3499/7923 [20:59<29:02,  2.54it/s]

Epoch 2 | Step 3500 | loss = 0.413, acc = 0.837


 45%|████▌     | 3599/7923 [21:36<28:32,  2.53it/s]

Epoch 2 | Step 3600 | loss = 0.443, acc = 0.852


 47%|████▋     | 3699/7923 [22:11<28:29,  2.47it/s]

Epoch 2 | Step 3700 | loss = 0.500, acc = 0.803


 48%|████▊     | 3799/7923 [22:47<27:23,  2.51it/s]

Epoch 2 | Step 3800 | loss = 0.470, acc = 0.790


 49%|████▉     | 3899/7923 [23:23<26:25,  2.54it/s]

Epoch 2 | Step 3900 | loss = 0.460, acc = 0.832


 50%|█████     | 3999/7923 [23:58<26:06,  2.50it/s]

Epoch 2 | Step 4000 | loss = 0.510, acc = 0.797


 52%|█████▏    | 4099/7923 [24:34<25:27,  2.50it/s]

Epoch 2 | Step 4100 | loss = 0.501, acc = 0.790


 53%|█████▎    | 4199/7923 [25:09<26:05,  2.38it/s]

Epoch 2 | Step 4200 | loss = 0.393, acc = 0.803


 54%|█████▍    | 4299/7923 [25:46<24:16,  2.49it/s]

Epoch 2 | Step 4300 | loss = 0.515, acc = 0.790


 56%|█████▌    | 4399/7923 [26:22<23:16,  2.52it/s]

Epoch 2 | Step 4400 | loss = 0.459, acc = 0.817


 57%|█████▋    | 4499/7923 [26:58<23:35,  2.42it/s]

Epoch 2 | Step 4500 | loss = 0.425, acc = 0.817


 58%|█████▊    | 4599/7923 [27:35<21:46,  2.54it/s]

Epoch 2 | Step 4600 | loss = 0.515, acc = 0.815


 59%|█████▉    | 4699/7923 [28:11<22:04,  2.43it/s]

Epoch 2 | Step 4700 | loss = 0.410, acc = 0.842


 61%|██████    | 4799/7923 [28:47<20:38,  2.52it/s]

Epoch 2 | Step 4800 | loss = 0.438, acc = 0.842


 62%|██████▏   | 4899/7923 [29:23<20:00,  2.52it/s]

Epoch 2 | Step 4900 | loss = 0.526, acc = 0.785


 63%|██████▎   | 4999/7923 [29:59<19:48,  2.46it/s]

Epoch 2 | Step 5000 | loss = 0.633, acc = 0.790


 64%|██████▍   | 5099/7923 [30:36<18:36,  2.53it/s]

Epoch 2 | Step 5100 | loss = 0.548, acc = 0.785


 66%|██████▌   | 5199/7923 [31:12<18:19,  2.48it/s]

Epoch 2 | Step 5200 | loss = 0.556, acc = 0.792


 67%|██████▋   | 5299/7923 [31:47<17:24,  2.51it/s]

Epoch 2 | Step 5300 | loss = 0.408, acc = 0.835


 68%|██████▊   | 5399/7923 [32:23<16:21,  2.57it/s]

Epoch 2 | Step 5400 | loss = 0.476, acc = 0.832


 69%|██████▉   | 5499/7923 [32:58<16:23,  2.47it/s]

Epoch 2 | Step 5500 | loss = 0.578, acc = 0.795


 71%|███████   | 5599/7923 [33:34<15:10,  2.55it/s]

Epoch 2 | Step 5600 | loss = 0.488, acc = 0.822


 72%|███████▏  | 5699/7923 [34:09<14:41,  2.52it/s]

Epoch 2 | Step 5700 | loss = 0.463, acc = 0.762


 73%|███████▎  | 5799/7923 [34:45<14:18,  2.47it/s]

Epoch 2 | Step 5800 | loss = 0.527, acc = 0.805


 74%|███████▍  | 5899/7923 [35:22<13:33,  2.49it/s]

Epoch 2 | Step 5900 | loss = 0.507, acc = 0.792


 76%|███████▌  | 5999/7923 [35:58<13:00,  2.46it/s]

Epoch 2 | Step 6000 | loss = 0.562, acc = 0.782


 77%|███████▋  | 6099/7923 [36:34<12:05,  2.52it/s]

Epoch 2 | Step 6100 | loss = 0.524, acc = 0.787


 78%|███████▊  | 6199/7923 [37:10<11:48,  2.43it/s]

Epoch 2 | Step 6200 | loss = 0.449, acc = 0.800


 80%|███████▉  | 6299/7923 [37:47<10:46,  2.51it/s]

Epoch 2 | Step 6300 | loss = 0.448, acc = 0.815


 81%|████████  | 6399/7923 [38:23<10:18,  2.47it/s]

Epoch 2 | Step 6400 | loss = 0.626, acc = 0.762


 82%|████████▏ | 6499/7923 [38:59<09:56,  2.39it/s]

Epoch 2 | Step 6500 | loss = 0.384, acc = 0.852


 83%|████████▎ | 6599/7923 [39:35<08:45,  2.52it/s]

Epoch 2 | Step 6600 | loss = 0.608, acc = 0.770


 85%|████████▍ | 6699/7923 [40:11<08:15,  2.47it/s]

Epoch 2 | Step 6700 | loss = 0.590, acc = 0.770


 86%|████████▌ | 6799/7923 [40:47<07:30,  2.49it/s]

Epoch 2 | Step 6800 | loss = 0.498, acc = 0.817


 87%|████████▋ | 6899/7923 [41:23<06:56,  2.46it/s]

Epoch 2 | Step 6900 | loss = 0.450, acc = 0.790


 88%|████████▊ | 6999/7923 [41:59<06:00,  2.56it/s]

Epoch 2 | Step 7000 | loss = 0.427, acc = 0.842


 90%|████████▉ | 7099/7923 [42:34<05:25,  2.53it/s]

Epoch 2 | Step 7100 | loss = 0.511, acc = 0.810


 91%|█████████ | 7199/7923 [43:09<04:46,  2.53it/s]

Epoch 2 | Step 7200 | loss = 0.476, acc = 0.795


 92%|█████████▏| 7299/7923 [43:45<04:06,  2.53it/s]

Epoch 2 | Step 7300 | loss = 0.518, acc = 0.787


 93%|█████████▎| 7399/7923 [44:21<03:32,  2.47it/s]

Epoch 2 | Step 7400 | loss = 0.372, acc = 0.815


 95%|█████████▍| 7499/7923 [44:58<02:49,  2.50it/s]

Epoch 2 | Step 7500 | loss = 0.375, acc = 0.837


 96%|█████████▌| 7599/7923 [45:34<02:09,  2.51it/s]

Epoch 2 | Step 7600 | loss = 0.590, acc = 0.765


 97%|█████████▋| 7699/7923 [46:10<01:28,  2.53it/s]

Epoch 2 | Step 7700 | loss = 0.423, acc = 0.817


 98%|█████████▊| 7799/7923 [46:47<00:50,  2.46it/s]

Epoch 2 | Step 7800 | loss = 0.456, acc = 0.827


100%|█████████▉| 7899/7923 [47:23<00:09,  2.47it/s]

Epoch 2 | Step 7900 | loss = 0.450, acc = 0.832


100%|██████████| 7923/7923 [47:32<00:00,  2.78it/s]


Evaluating Dev Set ...


  6%|▌         | 228/4131 [01:01<18:09,  3.58it/s]

found [UNK] in prediction.
original pred: 李 [UNK]
final prediction 李杲


  9%|▉         | 390/4131 [01:45<16:41,  3.74it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 10%|▉         | 393/4131 [01:46<14:56,  4.17it/s]

found [UNK] in prediction.
original pred: [UNK] 崎 八 幡 宮
final prediction 筥崎八幡宮


 12%|█▏        | 504/4131 [02:15<14:30,  4.17it/s]

found [UNK] in prediction.
original pred: 與 慕 容 [UNK] 雙 方 不 和
final prediction 與慕容廆雙方不和


 20%|██        | 830/4131 [03:40<12:41,  4.34it/s]

found [UNK] in prediction.
original pred: 木 骨 [UNK]
final prediction 木骨閭


 20%|██        | 842/4131 [03:43<12:31,  4.38it/s]

found [UNK] in prediction.
original pred: 杜 恆 - [UNK] 因 論 題
final prediction 杜恆-蒯因論題


 24%|██▍       | 984/4131 [04:20<13:40,  3.83it/s]

found [UNK] in prediction.
original pred: 青 翁 三 足 [UNK]
final prediction 青翁三足缶


 34%|███▎      | 1384/4131 [06:06<11:20,  4.04it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 34%|███▍      | 1401/4131 [06:11<13:16,  3.43it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 37%|███▋      | 1512/4131 [06:41<11:30,  3.79it/s]

found [UNK] in prediction.
original pred: 免 再 次 爆 發 內 [UNK]
final prediction 免再次爆發內訌


 45%|████▍     | 1846/4131 [08:12<08:59,  4.24it/s]

found [UNK] in prediction.
original pred: [UNK] 船
final prediction 艚船


 53%|█████▎    | 2205/4131 [09:47<08:23,  3.83it/s]

found [UNK] in prediction.
original pred: 朱 載 [UNK]
final prediction 朱載堉


 55%|█████▍    | 2263/4131 [10:03<07:31,  4.14it/s]

found [UNK] in prediction.
original pred: 學 習 訓 [UNK] 學
final prediction 那裏學習訓


 64%|██████▍   | 2662/4131 [11:47<06:34,  3.72it/s]

found [UNK] in prediction.
original pred: 李 端 [UNK]
final prediction 李端棻


 67%|██████▋   | 2752/4131 [12:09<05:26,  4.23it/s]

found [UNK] in prediction.
original pred: w. v. [UNK] 因
final prediction W.V.蒯因


 71%|███████   | 2937/4131 [12:58<06:43,  2.96it/s]

found [UNK] in prediction.
original pred: 拓 跋 [UNK]
final prediction 拓跋燾


 77%|███████▋  | 3168/4131 [13:56<02:57,  5.43it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 90%|████████▉ | 3717/4131 [16:02<01:13,  5.60it/s]

found [UNK] in prediction.
original pred: 拓 跋 [UNK]
final prediction 拓跋燾


 96%|█████████▋| 3982/4131 [17:01<00:33,  4.47it/s]

found [UNK] in prediction.
original pred: 學 習 訓 [UNK] 學
final prediction 學習訓詁學


100%|██████████| 4131/4131 [17:37<00:00,  3.91it/s]


Validation | Epoch 2 | acc = 0.798


  1%|          | 99/7923 [00:35<54:28,  2.39it/s]

Epoch 3 | Step 100 | loss = 0.259, acc = 0.875


  3%|▎         | 199/7923 [01:11<51:04,  2.52it/s]

Epoch 3 | Step 200 | loss = 0.416, acc = 0.827


  4%|▍         | 299/7923 [01:47<51:58,  2.44it/s]

Epoch 3 | Step 300 | loss = 0.391, acc = 0.840


  5%|▌         | 399/7923 [02:23<48:01,  2.61it/s]

Epoch 3 | Step 400 | loss = 0.367, acc = 0.850


  6%|▋         | 499/7923 [02:59<48:21,  2.56it/s]

Epoch 3 | Step 500 | loss = 0.420, acc = 0.868


  8%|▊         | 599/7923 [03:35<47:19,  2.58it/s]

Epoch 3 | Step 600 | loss = 0.362, acc = 0.868


  9%|▉         | 699/7923 [04:10<48:13,  2.50it/s]

Epoch 3 | Step 700 | loss = 0.331, acc = 0.830


 10%|█         | 799/7923 [04:45<46:54,  2.53it/s]

Epoch 3 | Step 800 | loss = 0.366, acc = 0.852


 11%|█▏        | 899/7923 [05:21<46:36,  2.51it/s]

Epoch 3 | Step 900 | loss = 0.355, acc = 0.855


 13%|█▎        | 999/7923 [05:57<45:29,  2.54it/s]

Epoch 3 | Step 1000 | loss = 0.249, acc = 0.885


 14%|█▍        | 1099/7923 [06:34<45:21,  2.51it/s]

Epoch 3 | Step 1100 | loss = 0.406, acc = 0.852


 15%|█▌        | 1199/7923 [07:10<45:25,  2.47it/s]

Epoch 3 | Step 1200 | loss = 0.313, acc = 0.857


 16%|█▋        | 1299/7923 [07:47<44:13,  2.50it/s]

Epoch 3 | Step 1300 | loss = 0.386, acc = 0.842


 18%|█▊        | 1399/7923 [08:23<44:32,  2.44it/s]

Epoch 3 | Step 1400 | loss = 0.379, acc = 0.855


 19%|█▉        | 1499/7923 [08:59<42:57,  2.49it/s]

Epoch 3 | Step 1500 | loss = 0.397, acc = 0.880


 20%|██        | 1599/7923 [09:35<43:02,  2.45it/s]

Epoch 3 | Step 1600 | loss = 0.366, acc = 0.873


 21%|██▏       | 1699/7923 [10:11<40:49,  2.54it/s]

Epoch 3 | Step 1700 | loss = 0.291, acc = 0.857


 23%|██▎       | 1799/7923 [10:47<40:48,  2.50it/s]

Epoch 3 | Step 1800 | loss = 0.292, acc = 0.857


 24%|██▍       | 1899/7923 [11:23<39:42,  2.53it/s]

Epoch 3 | Step 1900 | loss = 0.406, acc = 0.855


 25%|██▌       | 1999/7923 [11:59<37:58,  2.60it/s]

Epoch 3 | Step 2000 | loss = 0.311, acc = 0.862


 26%|██▋       | 2099/7923 [12:34<38:16,  2.54it/s]

Epoch 3 | Step 2100 | loss = 0.397, acc = 0.852


 28%|██▊       | 2199/7923 [13:10<38:54,  2.45it/s]

Epoch 3 | Step 2200 | loss = 0.386, acc = 0.852


 29%|██▉       | 2299/7923 [13:45<37:16,  2.51it/s]

Epoch 3 | Step 2300 | loss = 0.337, acc = 0.860


 30%|███       | 2399/7923 [14:21<36:16,  2.54it/s]

Epoch 3 | Step 2400 | loss = 0.320, acc = 0.882


 32%|███▏      | 2499/7923 [14:57<36:25,  2.48it/s]

Epoch 3 | Step 2500 | loss = 0.339, acc = 0.860


 33%|███▎      | 2599/7923 [15:34<36:15,  2.45it/s]

Epoch 3 | Step 2600 | loss = 0.275, acc = 0.865


 34%|███▍      | 2699/7923 [16:10<35:27,  2.46it/s]

Epoch 3 | Step 2700 | loss = 0.392, acc = 0.845


 35%|███▌      | 2799/7923 [16:47<35:56,  2.38it/s]

Epoch 3 | Step 2800 | loss = 0.379, acc = 0.850


 37%|███▋      | 2899/7923 [17:22<33:09,  2.53it/s]

Epoch 3 | Step 2900 | loss = 0.283, acc = 0.880


 38%|███▊      | 2999/7923 [17:59<32:43,  2.51it/s]

Epoch 3 | Step 3000 | loss = 0.312, acc = 0.887


 39%|███▉      | 3099/7923 [18:35<32:24,  2.48it/s]

Epoch 3 | Step 3100 | loss = 0.414, acc = 0.835


 40%|████      | 3199/7923 [19:11<31:07,  2.53it/s]

Epoch 3 | Step 3200 | loss = 0.298, acc = 0.875


 42%|████▏     | 3299/7923 [19:47<30:11,  2.55it/s]

Epoch 3 | Step 3300 | loss = 0.341, acc = 0.837


 43%|████▎     | 3399/7923 [20:23<30:31,  2.47it/s]

Epoch 3 | Step 3400 | loss = 0.255, acc = 0.875


 44%|████▍     | 3499/7923 [20:59<28:46,  2.56it/s]

Epoch 3 | Step 3500 | loss = 0.355, acc = 0.852


 45%|████▌     | 3599/7923 [21:34<27:51,  2.59it/s]

Epoch 3 | Step 3600 | loss = 0.374, acc = 0.845


 47%|████▋     | 3699/7923 [22:09<26:55,  2.61it/s]

Epoch 3 | Step 3700 | loss = 0.351, acc = 0.837


 48%|████▊     | 3799/7923 [22:45<27:26,  2.50it/s]

Epoch 3 | Step 3800 | loss = 0.321, acc = 0.852


 49%|████▉     | 3899/7923 [23:21<27:43,  2.42it/s]

Epoch 3 | Step 3900 | loss = 0.277, acc = 0.873


 50%|█████     | 3999/7923 [23:57<26:18,  2.49it/s]

Epoch 3 | Step 4000 | loss = 0.417, acc = 0.837


 52%|█████▏    | 4099/7923 [24:34<26:04,  2.44it/s]

Epoch 3 | Step 4100 | loss = 0.312, acc = 0.852


 53%|█████▎    | 4199/7923 [25:10<25:05,  2.47it/s]

Epoch 3 | Step 4200 | loss = 0.544, acc = 0.812


 54%|█████▍    | 4299/7923 [25:46<24:33,  2.46it/s]

Epoch 3 | Step 4300 | loss = 0.354, acc = 0.873


 56%|█████▌    | 4399/7923 [26:23<23:49,  2.47it/s]

Epoch 3 | Step 4400 | loss = 0.337, acc = 0.882


 57%|█████▋    | 4499/7923 [26:59<22:53,  2.49it/s]

Epoch 3 | Step 4500 | loss = 0.246, acc = 0.870


 58%|█████▊    | 4599/7923 [27:35<22:31,  2.46it/s]

Epoch 3 | Step 4600 | loss = 0.371, acc = 0.857


 59%|█████▉    | 4699/7923 [28:11<20:46,  2.59it/s]

Epoch 3 | Step 4700 | loss = 0.341, acc = 0.837


 61%|██████    | 4799/7923 [28:47<20:39,  2.52it/s]

Epoch 3 | Step 4800 | loss = 0.414, acc = 0.842


 62%|██████▏   | 4899/7923 [29:22<20:06,  2.51it/s]

Epoch 3 | Step 4900 | loss = 0.338, acc = 0.873


 63%|██████▎   | 4999/7923 [29:58<19:21,  2.52it/s]

Epoch 3 | Step 5000 | loss = 0.263, acc = 0.875


 64%|██████▍   | 5099/7923 [30:33<18:30,  2.54it/s]

Epoch 3 | Step 5100 | loss = 0.326, acc = 0.875


 66%|██████▌   | 5199/7923 [31:08<17:58,  2.53it/s]

Epoch 3 | Step 5200 | loss = 0.457, acc = 0.840


 67%|██████▋   | 5299/7923 [31:44<17:35,  2.49it/s]

Epoch 3 | Step 5300 | loss = 0.386, acc = 0.847


 68%|██████▊   | 5399/7923 [32:20<16:50,  2.50it/s]

Epoch 3 | Step 5400 | loss = 0.359, acc = 0.855


 69%|██████▉   | 5499/7923 [32:57<16:07,  2.51it/s]

Epoch 3 | Step 5500 | loss = 0.392, acc = 0.840


 71%|███████   | 5599/7923 [33:33<15:48,  2.45it/s]

Epoch 3 | Step 5600 | loss = 0.353, acc = 0.865


 72%|███████▏  | 5699/7923 [34:10<14:49,  2.50it/s]

Epoch 3 | Step 5700 | loss = 0.281, acc = 0.900


 73%|███████▎  | 5799/7923 [34:46<14:11,  2.50it/s]

Epoch 3 | Step 5800 | loss = 0.353, acc = 0.850


 74%|███████▍  | 5899/7923 [35:22<13:58,  2.41it/s]

Epoch 3 | Step 5900 | loss = 0.342, acc = 0.847


 76%|███████▌  | 5999/7923 [35:59<12:35,  2.55it/s]

Epoch 3 | Step 6000 | loss = 0.355, acc = 0.847


 77%|███████▋  | 6099/7923 [36:35<12:16,  2.48it/s]

Epoch 3 | Step 6100 | loss = 0.351, acc = 0.868


 78%|███████▊  | 6199/7923 [37:11<11:27,  2.51it/s]

Epoch 3 | Step 6200 | loss = 0.306, acc = 0.857


 80%|███████▉  | 6299/7923 [37:46<10:50,  2.49it/s]

Epoch 3 | Step 6300 | loss = 0.262, acc = 0.873


 81%|████████  | 6399/7923 [38:22<09:55,  2.56it/s]

Epoch 3 | Step 6400 | loss = 0.374, acc = 0.840


 82%|████████▏ | 6499/7923 [38:58<09:20,  2.54it/s]

Epoch 3 | Step 6500 | loss = 0.415, acc = 0.847


 83%|████████▎ | 6599/7923 [39:33<08:46,  2.51it/s]

Epoch 3 | Step 6600 | loss = 0.254, acc = 0.882


 85%|████████▍ | 6699/7923 [40:09<08:13,  2.48it/s]

Epoch 3 | Step 6700 | loss = 0.358, acc = 0.847


 86%|████████▌ | 6799/7923 [40:44<07:24,  2.53it/s]

Epoch 3 | Step 6800 | loss = 0.301, acc = 0.860


 87%|████████▋ | 6899/7923 [41:21<06:54,  2.47it/s]

Epoch 3 | Step 6900 | loss = 0.402, acc = 0.825


 88%|████████▊ | 6999/7923 [41:57<06:11,  2.49it/s]

Epoch 3 | Step 7000 | loss = 0.420, acc = 0.832


 90%|████████▉ | 7099/7923 [42:33<05:35,  2.46it/s]

Epoch 3 | Step 7100 | loss = 0.239, acc = 0.885


 91%|█████████ | 7199/7923 [43:10<04:55,  2.45it/s]

Epoch 3 | Step 7200 | loss = 0.191, acc = 0.890


 92%|█████████▏| 7299/7923 [43:46<04:17,  2.42it/s]

Epoch 3 | Step 7300 | loss = 0.325, acc = 0.855


 93%|█████████▎| 7399/7923 [44:22<03:29,  2.50it/s]

Epoch 3 | Step 7400 | loss = 0.394, acc = 0.847


 95%|█████████▍| 7499/7923 [44:58<02:53,  2.44it/s]

Epoch 3 | Step 7500 | loss = 0.279, acc = 0.890


 96%|█████████▌| 7599/7923 [45:34<02:09,  2.51it/s]

Epoch 3 | Step 7600 | loss = 0.282, acc = 0.877


 97%|█████████▋| 7699/7923 [46:11<01:29,  2.50it/s]

Epoch 3 | Step 7700 | loss = 0.292, acc = 0.855


 98%|█████████▊| 7799/7923 [46:47<00:51,  2.43it/s]

Epoch 3 | Step 7800 | loss = 0.243, acc = 0.902


100%|█████████▉| 7899/7923 [47:23<00:09,  2.49it/s]

Epoch 3 | Step 7900 | loss = 0.368, acc = 0.862


100%|██████████| 7923/7923 [47:31<00:00,  2.78it/s]


Evaluating Dev Set ...


  6%|▌         | 228/4131 [00:59<17:18,  3.76it/s]

found [UNK] in prediction.
original pred: 李 [UNK]
final prediction 李杲


  9%|▉         | 390/4131 [01:42<16:37,  3.75it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 10%|▉         | 393/4131 [01:43<14:40,  4.24it/s]

found [UNK] in prediction.
original pred: [UNK] 崎 八 幡 宮
final prediction 筥崎八幡宮


 20%|██        | 829/4131 [03:40<14:28,  3.80it/s]

found [UNK] in prediction.
original pred: 木 骨 [UNK]
final prediction 木骨閭


 20%|██        | 842/4131 [03:43<13:06,  4.18it/s]

found [UNK] in prediction.
original pred: 杜 恆 - [UNK] 因 論 題
final prediction 杜恆-蒯因論題


 24%|██▍       | 984/4131 [04:22<13:36,  3.86it/s]

found [UNK] in prediction.
original pred: 青 翁 三 足 [UNK]
final prediction 青翁三足缶


 34%|███▎      | 1384/4131 [06:09<11:20,  4.04it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 34%|███▍      | 1401/4131 [06:14<13:03,  3.49it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 37%|███▋      | 1512/4131 [06:44<11:33,  3.78it/s]

found [UNK] in prediction.
original pred: 免 再 次 爆 發 內 [UNK]
final prediction 免再次爆發內訌


 45%|████▍     | 1846/4131 [08:13<07:50,  4.85it/s]

found [UNK] in prediction.
original pred: [UNK] 船
final prediction 艚船


 53%|█████▎    | 2205/4131 [09:46<08:08,  3.95it/s]

found [UNK] in prediction.
original pred: 朱 載 [UNK]
final prediction 朱載堉


 55%|█████▍    | 2263/4131 [10:01<07:37,  4.09it/s]

found [UNK] in prediction.
original pred: 學 習 訓 [UNK] 學
final prediction 那裏學習訓


 58%|█████▊    | 2400/4131 [10:37<06:41,  4.31it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK] 的 禁 殺 之 旨
final prediction 朱允炆的禁殺之旨


 67%|██████▋   | 2752/4131 [12:09<05:44,  4.01it/s]

found [UNK] in prediction.
original pred: w. v. [UNK] 因
final prediction W.V.蒯因


 71%|███████   | 2937/4131 [13:00<06:42,  2.97it/s]

found [UNK] in prediction.
original pred: 拓 跋 [UNK]
final prediction 拓跋燾


 77%|███████▋  | 3168/4131 [13:58<02:56,  5.47it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 96%|█████████▋| 3982/4131 [17:00<00:32,  4.58it/s]

found [UNK] in prediction.
original pred: 學 習 訓 [UNK] 學
final prediction 學習訓詁學


100%|██████████| 4131/4131 [17:34<00:00,  3.92it/s]


Validation | Epoch 3 | acc = 0.794


  1%|          | 99/7923 [00:34<51:23,  2.54it/s]

Epoch 4 | Step 100 | loss = 0.188, acc = 0.890


  3%|▎         | 199/7923 [01:10<50:10,  2.57it/s]

Epoch 4 | Step 200 | loss = 0.250, acc = 0.892


  4%|▍         | 299/7923 [01:45<50:47,  2.50it/s]

Epoch 4 | Step 300 | loss = 0.197, acc = 0.917


  5%|▌         | 399/7923 [02:22<49:33,  2.53it/s]

Epoch 4 | Step 400 | loss = 0.302, acc = 0.887


  6%|▋         | 499/7923 [02:58<49:01,  2.52it/s]

Epoch 4 | Step 500 | loss = 0.256, acc = 0.910


  8%|▊         | 599/7923 [03:34<50:22,  2.42it/s]

Epoch 4 | Step 600 | loss = 0.312, acc = 0.877


  9%|▉         | 699/7923 [04:10<48:12,  2.50it/s]

Epoch 4 | Step 700 | loss = 0.306, acc = 0.880


 10%|█         | 799/7923 [04:46<47:15,  2.51it/s]

Epoch 4 | Step 800 | loss = 0.284, acc = 0.892


 11%|█▏        | 899/7923 [05:22<46:53,  2.50it/s]

Epoch 4 | Step 900 | loss = 0.303, acc = 0.870


 13%|█▎        | 999/7923 [05:59<47:34,  2.43it/s]

Epoch 4 | Step 1000 | loss = 0.344, acc = 0.880


 14%|█▍        | 1099/7923 [06:35<46:29,  2.45it/s]

Epoch 4 | Step 1100 | loss = 0.181, acc = 0.920


 15%|█▌        | 1199/7923 [07:11<45:33,  2.46it/s]

Epoch 4 | Step 1200 | loss = 0.249, acc = 0.902


 16%|█▋        | 1299/7923 [07:46<43:34,  2.53it/s]

Epoch 4 | Step 1300 | loss = 0.223, acc = 0.882


 18%|█▊        | 1399/7923 [08:22<43:32,  2.50it/s]

Epoch 4 | Step 1400 | loss = 0.244, acc = 0.912


 19%|█▉        | 1499/7923 [08:58<41:45,  2.56it/s]

Epoch 4 | Step 1500 | loss = 0.235, acc = 0.880


 20%|██        | 1599/7923 [09:33<40:34,  2.60it/s]

Epoch 4 | Step 1600 | loss = 0.315, acc = 0.892


 21%|██▏       | 1699/7923 [10:08<40:32,  2.56it/s]

Epoch 4 | Step 1700 | loss = 0.227, acc = 0.882


 23%|██▎       | 1799/7923 [10:44<40:58,  2.49it/s]

Epoch 4 | Step 1800 | loss = 0.222, acc = 0.902


 24%|██▍       | 1899/7923 [11:20<39:28,  2.54it/s]

Epoch 4 | Step 1900 | loss = 0.192, acc = 0.920


 25%|██▌       | 1999/7923 [11:57<40:22,  2.45it/s]

Epoch 4 | Step 2000 | loss = 0.229, acc = 0.907


 26%|██▋       | 2099/7923 [12:33<39:04,  2.48it/s]

Epoch 4 | Step 2100 | loss = 0.272, acc = 0.885


 28%|██▊       | 2199/7923 [13:09<38:43,  2.46it/s]

Epoch 4 | Step 2200 | loss = 0.161, acc = 0.935


 29%|██▉       | 2299/7923 [13:46<37:51,  2.48it/s]

Epoch 4 | Step 2300 | loss = 0.325, acc = 0.887


 30%|███       | 2399/7923 [14:22<38:11,  2.41it/s]

Epoch 4 | Step 2400 | loss = 0.312, acc = 0.882


 32%|███▏      | 2499/7923 [14:58<36:23,  2.48it/s]

Epoch 4 | Step 2500 | loss = 0.276, acc = 0.880


 33%|███▎      | 2599/7923 [15:34<36:01,  2.46it/s]

Epoch 4 | Step 2600 | loss = 0.281, acc = 0.907


 34%|███▍      | 2699/7923 [16:10<35:06,  2.48it/s]

Epoch 4 | Step 2700 | loss = 0.308, acc = 0.875


 35%|███▌      | 2799/7923 [16:46<34:21,  2.49it/s]

Epoch 4 | Step 2800 | loss = 0.282, acc = 0.892


 37%|███▋      | 2899/7923 [17:21<33:27,  2.50it/s]

Epoch 4 | Step 2900 | loss = 0.289, acc = 0.862


 38%|███▊      | 2999/7923 [17:56<31:50,  2.58it/s]

Epoch 4 | Step 3000 | loss = 0.265, acc = 0.895


 39%|███▉      | 3099/7923 [18:32<31:23,  2.56it/s]

Epoch 4 | Step 3100 | loss = 0.231, acc = 0.905


 40%|████      | 3199/7923 [19:07<31:47,  2.48it/s]

Epoch 4 | Step 3200 | loss = 0.280, acc = 0.882


 42%|████▏     | 3299/7923 [19:43<31:12,  2.47it/s]

Epoch 4 | Step 3300 | loss = 0.242, acc = 0.900


 43%|████▎     | 3399/7923 [20:19<29:23,  2.56it/s]

Epoch 4 | Step 3400 | loss = 0.220, acc = 0.902


 44%|████▍     | 3499/7923 [20:56<29:32,  2.50it/s]

Epoch 4 | Step 3500 | loss = 0.242, acc = 0.900


 45%|████▌     | 3599/7923 [21:32<29:05,  2.48it/s]

Epoch 4 | Step 3600 | loss = 0.176, acc = 0.902


 47%|████▋     | 3699/7923 [22:08<29:32,  2.38it/s]

Epoch 4 | Step 3700 | loss = 0.245, acc = 0.895


 48%|████▊     | 3799/7923 [22:45<27:45,  2.48it/s]

Epoch 4 | Step 3800 | loss = 0.292, acc = 0.880


 49%|████▉     | 3899/7923 [23:21<27:00,  2.48it/s]

Epoch 4 | Step 3900 | loss = 0.261, acc = 0.880


 50%|█████     | 3999/7923 [23:57<25:04,  2.61it/s]

Epoch 4 | Step 4000 | loss = 0.202, acc = 0.895


 52%|█████▏    | 4099/7923 [24:33<25:52,  2.46it/s]

Epoch 4 | Step 4100 | loss = 0.175, acc = 0.912


 53%|█████▎    | 4199/7923 [25:08<25:20,  2.45it/s]

Epoch 4 | Step 4200 | loss = 0.365, acc = 0.862


 54%|█████▍    | 4299/7923 [25:44<23:33,  2.56it/s]

Epoch 4 | Step 4300 | loss = 0.234, acc = 0.902


 56%|█████▌    | 4399/7923 [26:20<22:47,  2.58it/s]

Epoch 4 | Step 4400 | loss = 0.286, acc = 0.897


 57%|█████▋    | 4499/7923 [26:55<22:11,  2.57it/s]

Epoch 4 | Step 4500 | loss = 0.279, acc = 0.882


 58%|█████▊    | 4599/7923 [27:30<21:29,  2.58it/s]

Epoch 4 | Step 4600 | loss = 0.261, acc = 0.882


 59%|█████▉    | 4699/7923 [28:06<21:18,  2.52it/s]

Epoch 4 | Step 4700 | loss = 0.314, acc = 0.912


 61%|██████    | 4799/7923 [28:42<21:15,  2.45it/s]

Epoch 4 | Step 4800 | loss = 0.216, acc = 0.890


 62%|██████▏   | 4899/7923 [29:18<20:37,  2.44it/s]

Epoch 4 | Step 4900 | loss = 0.166, acc = 0.922


 63%|██████▎   | 4999/7923 [29:55<19:26,  2.51it/s]

Epoch 4 | Step 5000 | loss = 0.223, acc = 0.895


 64%|██████▍   | 5099/7923 [30:31<18:41,  2.52it/s]

Epoch 4 | Step 5100 | loss = 0.267, acc = 0.900


 66%|██████▌   | 5199/7923 [31:07<17:58,  2.52it/s]

Epoch 4 | Step 5200 | loss = 0.240, acc = 0.897


 67%|██████▋   | 5299/7923 [31:43<17:11,  2.54it/s]

Epoch 4 | Step 5300 | loss = 0.241, acc = 0.902


 68%|██████▊   | 5399/7923 [32:20<16:53,  2.49it/s]

Epoch 4 | Step 5400 | loss = 0.266, acc = 0.897


 69%|██████▉   | 5499/7923 [32:56<16:23,  2.47it/s]

Epoch 4 | Step 5500 | loss = 0.338, acc = 0.877


 71%|███████   | 5599/7923 [33:32<15:24,  2.51it/s]

Epoch 4 | Step 5600 | loss = 0.314, acc = 0.868


 72%|███████▏  | 5699/7923 [34:08<15:02,  2.46it/s]

Epoch 4 | Step 5700 | loss = 0.306, acc = 0.885


 73%|███████▎  | 5799/7923 [34:43<13:59,  2.53it/s]

Epoch 4 | Step 5800 | loss = 0.244, acc = 0.912


 74%|███████▍  | 5899/7923 [35:19<13:32,  2.49it/s]

Epoch 4 | Step 5900 | loss = 0.261, acc = 0.892


 76%|███████▌  | 5999/7923 [35:54<12:41,  2.53it/s]

Epoch 4 | Step 6000 | loss = 0.247, acc = 0.885


 77%|███████▋  | 6099/7923 [36:30<12:10,  2.50it/s]

Epoch 4 | Step 6100 | loss = 0.299, acc = 0.860


 78%|███████▊  | 6199/7923 [37:05<11:20,  2.53it/s]

Epoch 4 | Step 6200 | loss = 0.297, acc = 0.892


 80%|███████▉  | 6299/7923 [37:41<10:56,  2.48it/s]

Epoch 4 | Step 6300 | loss = 0.323, acc = 0.877


 81%|████████  | 6399/7923 [38:17<10:07,  2.51it/s]

Epoch 4 | Step 6400 | loss = 0.217, acc = 0.900


 82%|████████▏ | 6499/7923 [38:52<09:24,  2.52it/s]

Epoch 4 | Step 6500 | loss = 0.287, acc = 0.900


 83%|████████▎ | 6599/7923 [39:28<08:37,  2.56it/s]

Epoch 4 | Step 6600 | loss = 0.247, acc = 0.907


 85%|████████▍ | 6699/7923 [40:04<08:14,  2.48it/s]

Epoch 4 | Step 6700 | loss = 0.312, acc = 0.868


 86%|████████▌ | 6799/7923 [40:41<07:28,  2.51it/s]

Epoch 4 | Step 6800 | loss = 0.313, acc = 0.892


 87%|████████▋ | 6899/7923 [41:17<06:51,  2.49it/s]

Epoch 4 | Step 6900 | loss = 0.284, acc = 0.895


 88%|████████▊ | 6999/7923 [41:53<06:19,  2.43it/s]

Epoch 4 | Step 7000 | loss = 0.340, acc = 0.852


 90%|████████▉ | 7099/7923 [42:30<05:41,  2.41it/s]

Epoch 4 | Step 7100 | loss = 0.250, acc = 0.880


 91%|█████████ | 7199/7923 [43:06<04:51,  2.49it/s]

Epoch 4 | Step 7200 | loss = 0.211, acc = 0.895


 92%|█████████▏| 7299/7923 [43:42<04:19,  2.41it/s]

Epoch 4 | Step 7300 | loss = 0.213, acc = 0.905


 93%|█████████▎| 7399/7923 [44:18<03:26,  2.53it/s]

Epoch 4 | Step 7400 | loss = 0.277, acc = 0.892


 95%|█████████▍| 7499/7923 [44:54<02:47,  2.53it/s]

Epoch 4 | Step 7500 | loss = 0.277, acc = 0.880


 96%|█████████▌| 7599/7923 [45:30<02:10,  2.48it/s]

Epoch 4 | Step 7600 | loss = 0.233, acc = 0.907


 97%|█████████▋| 7699/7923 [46:06<01:29,  2.49it/s]

Epoch 4 | Step 7700 | loss = 0.291, acc = 0.882


 98%|█████████▊| 7799/7923 [46:42<00:49,  2.52it/s]

Epoch 4 | Step 7800 | loss = 0.266, acc = 0.895


100%|█████████▉| 7899/7923 [47:18<00:09,  2.47it/s]

Epoch 4 | Step 7900 | loss = 0.211, acc = 0.912


100%|██████████| 7923/7923 [47:26<00:00,  2.78it/s]


Evaluating Dev Set ...


  6%|▌         | 228/4131 [01:00<17:43,  3.67it/s]

found [UNK] in prediction.
original pred: 李 [UNK]
final prediction 李杲


  9%|▉         | 390/4131 [01:43<16:27,  3.79it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 10%|▉         | 393/4131 [01:43<14:50,  4.20it/s]

found [UNK] in prediction.
original pred: [UNK] 崎 八 幡 宮
final prediction 筥崎八幡宮


 12%|█▏        | 504/4131 [02:12<14:21,  4.21it/s]

found [UNK] in prediction.
original pred: 與 慕 容 [UNK]
final prediction 與慕容廆


 17%|█▋        | 700/4131 [03:04<14:42,  3.89it/s]

found [UNK] in prediction.
original pred: 對 日 本 報 紙 的 無 恥 造 謠 誣 [UNK] ， 進 行 了 有 力 駁 斥
final prediction 對日本報紙的無恥造謠誣衊，進行了有力駁斥


 20%|██        | 842/4131 [03:41<12:43,  4.31it/s]

found [UNK] in prediction.
original pred: 杜 恆 - [UNK] 因 論 題
final prediction 杜恆-蒯因論題


 24%|██▍       | 984/4131 [04:20<13:49,  3.80it/s]

found [UNK] in prediction.
original pred: 青 翁 三 足 [UNK]
final prediction 青翁三足缶


 34%|███▎      | 1384/4131 [06:07<11:08,  4.11it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 34%|███▍      | 1401/4131 [06:12<12:59,  3.50it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 37%|███▋      | 1512/4131 [06:42<11:28,  3.80it/s]

found [UNK] in prediction.
original pred: 免 再 次 爆 發 內 [UNK]
final prediction 免再次爆發內訌


 45%|████▍     | 1846/4131 [08:12<08:33,  4.45it/s]

found [UNK] in prediction.
original pred: [UNK] 船
final prediction 艚船


 53%|█████▎    | 2205/4131 [09:47<08:16,  3.88it/s]

found [UNK] in prediction.
original pred: 朱 載 [UNK]
final prediction 朱載堉


 55%|█████▍    | 2263/4131 [10:02<07:32,  4.12it/s]

found [UNK] in prediction.
original pred: 學 習 訓 [UNK] 學
final prediction 那裏學習訓


 58%|█████▊    | 2400/4131 [10:38<06:40,  4.33it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK] 的 禁 殺 之 旨
final prediction 朱允炆的禁殺之旨


 64%|██████▍   | 2662/4131 [11:46<06:37,  3.70it/s]

found [UNK] in prediction.
original pred: 李 端 [UNK]
final prediction 李端棻


 67%|██████▋   | 2752/4131 [12:08<05:37,  4.09it/s]

found [UNK] in prediction.
original pred: w. v. [UNK] 因
final prediction W.V.蒯因


 71%|███████   | 2937/4131 [12:57<06:37,  3.01it/s]

found [UNK] in prediction.
original pred: 拓 跋 [UNK]
final prediction 拓跋燾


 77%|███████▋  | 3168/4131 [13:55<02:54,  5.51it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 96%|█████████▋| 3982/4131 [17:00<00:32,  4.55it/s]

found [UNK] in prediction.
original pred: 學 習 訓 [UNK] 學
final prediction 學習訓詁學


100%|██████████| 4131/4131 [17:35<00:00,  3.91it/s]


Validation | Epoch 4 | acc = 0.793


  1%|          | 99/7923 [00:35<51:30,  2.53it/s]

Epoch 5 | Step 100 | loss = 0.266, acc = 0.897


  3%|▎         | 199/7923 [01:11<51:07,  2.52it/s]

Epoch 5 | Step 200 | loss = 0.203, acc = 0.912


  4%|▍         | 299/7923 [01:47<50:47,  2.50it/s]

Epoch 5 | Step 300 | loss = 0.268, acc = 0.900


  5%|▌         | 399/7923 [02:22<48:19,  2.59it/s]

Epoch 5 | Step 400 | loss = 0.191, acc = 0.922


  6%|▋         | 499/7923 [02:58<48:03,  2.57it/s]

Epoch 5 | Step 500 | loss = 0.220, acc = 0.917


  8%|▊         | 599/7923 [03:33<46:42,  2.61it/s]

Epoch 5 | Step 600 | loss = 0.195, acc = 0.925


  9%|▉         | 699/7923 [04:08<47:22,  2.54it/s]

Epoch 5 | Step 700 | loss = 0.190, acc = 0.917


 10%|█         | 799/7923 [04:44<48:23,  2.45it/s]

Epoch 5 | Step 800 | loss = 0.239, acc = 0.895


 11%|█▏        | 899/7923 [05:20<47:26,  2.47it/s]

Epoch 5 | Step 900 | loss = 0.234, acc = 0.912


 13%|█▎        | 999/7923 [05:56<47:08,  2.45it/s]

Epoch 5 | Step 1000 | loss = 0.250, acc = 0.912


 14%|█▍        | 1099/7923 [06:33<44:24,  2.56it/s]

Epoch 5 | Step 1100 | loss = 0.227, acc = 0.915


 15%|█▌        | 1199/7923 [07:09<43:55,  2.55it/s]

Epoch 5 | Step 1200 | loss = 0.138, acc = 0.925


 16%|█▋        | 1299/7923 [07:45<43:44,  2.52it/s]

Epoch 5 | Step 1300 | loss = 0.182, acc = 0.930


 18%|█▊        | 1399/7923 [08:21<44:08,  2.46it/s]

Epoch 5 | Step 1400 | loss = 0.186, acc = 0.942


 19%|█▉        | 1499/7923 [08:57<42:12,  2.54it/s]

Epoch 5 | Step 1500 | loss = 0.193, acc = 0.935


 20%|██        | 1599/7923 [09:33<41:51,  2.52it/s]

Epoch 5 | Step 1600 | loss = 0.210, acc = 0.912


 21%|██▏       | 1699/7923 [10:09<40:28,  2.56it/s]

Epoch 5 | Step 1700 | loss = 0.135, acc = 0.938


 23%|██▎       | 1799/7923 [10:44<41:24,  2.46it/s]

Epoch 5 | Step 1800 | loss = 0.186, acc = 0.902


 24%|██▍       | 1899/7923 [11:20<40:25,  2.48it/s]

Epoch 5 | Step 1900 | loss = 0.273, acc = 0.915


 25%|██▌       | 1999/7923 [11:55<40:18,  2.45it/s]

Epoch 5 | Step 2000 | loss = 0.180, acc = 0.925


 26%|██▋       | 2099/7923 [12:31<39:07,  2.48it/s]

Epoch 5 | Step 2100 | loss = 0.163, acc = 0.930


 28%|██▊       | 2199/7923 [13:07<39:31,  2.41it/s]

Epoch 5 | Step 2200 | loss = 0.198, acc = 0.933


 29%|██▉       | 2299/7923 [13:43<38:17,  2.45it/s]

Epoch 5 | Step 2300 | loss = 0.196, acc = 0.930


 30%|███       | 2399/7923 [14:19<35:27,  2.60it/s]

Epoch 5 | Step 2400 | loss = 0.247, acc = 0.917


 32%|███▏      | 2499/7923 [14:56<35:56,  2.51it/s]

Epoch 5 | Step 2500 | loss = 0.204, acc = 0.902


 33%|███▎      | 2599/7923 [15:32<35:18,  2.51it/s]

Epoch 5 | Step 2600 | loss = 0.244, acc = 0.910


 34%|███▍      | 2699/7923 [16:08<34:34,  2.52it/s]

Epoch 5 | Step 2700 | loss = 0.225, acc = 0.905


 35%|███▌      | 2799/7923 [16:44<34:05,  2.51it/s]

Epoch 5 | Step 2800 | loss = 0.215, acc = 0.912


 37%|███▋      | 2899/7923 [17:19<32:43,  2.56it/s]

Epoch 5 | Step 2900 | loss = 0.240, acc = 0.907


 38%|███▊      | 2999/7923 [17:55<32:24,  2.53it/s]

Epoch 5 | Step 3000 | loss = 0.181, acc = 0.933


 39%|███▉      | 3099/7923 [18:31<32:16,  2.49it/s]

Epoch 5 | Step 3100 | loss = 0.219, acc = 0.917


 40%|████      | 3199/7923 [19:07<32:11,  2.45it/s]

Epoch 5 | Step 3200 | loss = 0.322, acc = 0.882


 42%|████▏     | 3299/7923 [19:42<29:58,  2.57it/s]

Epoch 5 | Step 3300 | loss = 0.202, acc = 0.910


 43%|████▎     | 3399/7923 [20:17<30:01,  2.51it/s]

Epoch 5 | Step 3400 | loss = 0.232, acc = 0.905


 44%|████▍     | 3499/7923 [20:53<29:10,  2.53it/s]

Epoch 5 | Step 3500 | loss = 0.252, acc = 0.912


 45%|████▌     | 3599/7923 [21:29<29:06,  2.48it/s]

Epoch 5 | Step 3600 | loss = 0.227, acc = 0.897


 47%|████▋     | 3699/7923 [22:05<28:24,  2.48it/s]

Epoch 5 | Step 3700 | loss = 0.194, acc = 0.900


 48%|████▊     | 3799/7923 [22:41<27:57,  2.46it/s]

Epoch 5 | Step 3800 | loss = 0.199, acc = 0.912


 49%|████▉     | 3899/7923 [23:18<27:28,  2.44it/s]

Epoch 5 | Step 3900 | loss = 0.225, acc = 0.910


 50%|█████     | 3999/7923 [23:54<26:31,  2.47it/s]

Epoch 5 | Step 4000 | loss = 0.200, acc = 0.892


 52%|█████▏    | 4099/7923 [24:30<25:24,  2.51it/s]

Epoch 5 | Step 4100 | loss = 0.238, acc = 0.922


 53%|█████▎    | 4199/7923 [25:06<25:12,  2.46it/s]

Epoch 5 | Step 4200 | loss = 0.205, acc = 0.905


 54%|█████▍    | 4299/7923 [25:43<24:30,  2.46it/s]

Epoch 5 | Step 4300 | loss = 0.173, acc = 0.935


 56%|█████▌    | 4399/7923 [26:18<23:30,  2.50it/s]

Epoch 5 | Step 4400 | loss = 0.129, acc = 0.920


 57%|█████▋    | 4499/7923 [26:54<22:21,  2.55it/s]

Epoch 5 | Step 4500 | loss = 0.177, acc = 0.930


 58%|█████▊    | 4599/7923 [27:29<21:22,  2.59it/s]

Epoch 5 | Step 4600 | loss = 0.174, acc = 0.922


 59%|█████▉    | 4699/7923 [28:05<21:08,  2.54it/s]

Epoch 5 | Step 4700 | loss = 0.276, acc = 0.897


 61%|██████    | 4799/7923 [28:40<20:29,  2.54it/s]

Epoch 5 | Step 4800 | loss = 0.149, acc = 0.933


 62%|██████▏   | 4899/7923 [29:16<19:56,  2.53it/s]

Epoch 5 | Step 4900 | loss = 0.229, acc = 0.910


 63%|██████▎   | 4999/7923 [29:52<19:23,  2.51it/s]

Epoch 5 | Step 5000 | loss = 0.216, acc = 0.910


 64%|██████▍   | 5099/7923 [30:28<19:35,  2.40it/s]

Epoch 5 | Step 5100 | loss = 0.207, acc = 0.915


 66%|██████▌   | 5199/7923 [31:04<18:20,  2.48it/s]

Epoch 5 | Step 5200 | loss = 0.199, acc = 0.912


 67%|██████▋   | 5299/7923 [31:40<17:37,  2.48it/s]

Epoch 5 | Step 5300 | loss = 0.246, acc = 0.907


 68%|██████▊   | 5399/7923 [32:16<16:43,  2.52it/s]

Epoch 5 | Step 5400 | loss = 0.161, acc = 0.922


 69%|██████▉   | 5499/7923 [32:53<15:50,  2.55it/s]

Epoch 5 | Step 5500 | loss = 0.213, acc = 0.912


 71%|███████   | 5599/7923 [33:29<15:28,  2.50it/s]

Epoch 5 | Step 5600 | loss = 0.243, acc = 0.895


 72%|███████▏  | 5699/7923 [34:05<14:45,  2.51it/s]

Epoch 5 | Step 5700 | loss = 0.230, acc = 0.910


 73%|███████▎  | 5799/7923 [34:41<14:00,  2.53it/s]

Epoch 5 | Step 5800 | loss = 0.307, acc = 0.873


 74%|███████▍  | 5899/7923 [35:17<13:40,  2.47it/s]

Epoch 5 | Step 5900 | loss = 0.176, acc = 0.917


 76%|███████▌  | 5999/7923 [35:52<12:38,  2.54it/s]

Epoch 5 | Step 6000 | loss = 0.264, acc = 0.900


 77%|███████▋  | 6099/7923 [36:28<11:42,  2.60it/s]

Epoch 5 | Step 6100 | loss = 0.187, acc = 0.927


 78%|███████▊  | 6199/7923 [37:03<11:06,  2.59it/s]

Epoch 5 | Step 6200 | loss = 0.213, acc = 0.915


 80%|███████▉  | 6299/7923 [37:39<10:44,  2.52it/s]

Epoch 5 | Step 6300 | loss = 0.187, acc = 0.907


 81%|████████  | 6399/7923 [38:14<10:01,  2.53it/s]

Epoch 5 | Step 6400 | loss = 0.261, acc = 0.895


 82%|████████▏ | 6499/7923 [38:50<09:31,  2.49it/s]

Epoch 5 | Step 6500 | loss = 0.223, acc = 0.900


 83%|████████▎ | 6599/7923 [39:26<08:54,  2.48it/s]

Epoch 5 | Step 6600 | loss = 0.210, acc = 0.920


 85%|████████▍ | 6699/7923 [40:03<08:10,  2.49it/s]

Epoch 5 | Step 6700 | loss = 0.196, acc = 0.917


 86%|████████▌ | 6799/7923 [40:39<07:38,  2.45it/s]

Epoch 5 | Step 6800 | loss = 0.205, acc = 0.917


 87%|████████▋ | 6899/7923 [41:15<06:49,  2.50it/s]

Epoch 5 | Step 6900 | loss = 0.244, acc = 0.892


 88%|████████▊ | 6999/7923 [41:51<06:13,  2.47it/s]

Epoch 5 | Step 7000 | loss = 0.225, acc = 0.912


 90%|████████▉ | 7099/7923 [42:26<05:33,  2.47it/s]

Epoch 5 | Step 7100 | loss = 0.251, acc = 0.910


 91%|█████████ | 7199/7923 [43:02<04:45,  2.53it/s]

Epoch 5 | Step 7200 | loss = 0.297, acc = 0.890


 92%|█████████▏| 7299/7923 [43:37<04:01,  2.58it/s]

Epoch 5 | Step 7300 | loss = 0.224, acc = 0.895


 93%|█████████▎| 7399/7923 [44:13<03:26,  2.54it/s]

Epoch 5 | Step 7400 | loss = 0.129, acc = 0.915


 95%|█████████▍| 7499/7923 [44:49<02:51,  2.47it/s]

Epoch 5 | Step 7500 | loss = 0.245, acc = 0.907


 96%|█████████▌| 7599/7923 [45:25<02:10,  2.49it/s]

Epoch 5 | Step 7600 | loss = 0.278, acc = 0.875


 97%|█████████▋| 7699/7923 [46:01<01:32,  2.43it/s]

Epoch 5 | Step 7700 | loss = 0.181, acc = 0.922


 98%|█████████▊| 7799/7923 [46:37<00:49,  2.49it/s]

Epoch 5 | Step 7800 | loss = 0.195, acc = 0.907


100%|█████████▉| 7899/7923 [47:14<00:09,  2.42it/s]

Epoch 5 | Step 7900 | loss = 0.241, acc = 0.912


100%|██████████| 7923/7923 [47:22<00:00,  2.79it/s]


Evaluating Dev Set ...


  6%|▌         | 228/4131 [01:00<17:40,  3.68it/s]

found [UNK] in prediction.
original pred: 李 [UNK]
final prediction 李杲


  9%|▉         | 390/4131 [01:43<16:22,  3.81it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 10%|▉         | 393/4131 [01:44<14:31,  4.29it/s]

found [UNK] in prediction.
original pred: [UNK] 崎 八 幡 宮
final prediction 筥崎八幡宮


 12%|█▏        | 504/4131 [02:13<14:40,  4.12it/s]

found [UNK] in prediction.
original pred: 與 慕 容 [UNK] 雙 方 不 和
final prediction 與慕容廆雙方不和


 17%|█▋        | 700/4131 [03:04<13:29,  4.24it/s]

found [UNK] in prediction.
original pred: 對 日 本 報 紙 的 無 恥 造 謠 誣 [UNK] ， 進 行 了 有 力 駁 斥
final prediction 對日本報紙的無恥造謠誣衊，進行了有力駁斥


 20%|██        | 842/4131 [03:41<12:58,  4.23it/s]

found [UNK] in prediction.
original pred: 杜 恆 - [UNK] 因 論 題
final prediction 杜恆-蒯因論題


 24%|██▍       | 984/4131 [04:19<13:47,  3.80it/s]

found [UNK] in prediction.
original pred: 青 翁 三 足 [UNK]
final prediction 青翁三足缶


 34%|███▎      | 1384/4131 [06:06<10:58,  4.17it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 34%|███▍      | 1401/4131 [06:11<12:58,  3.50it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK]
final prediction 朱允炆


 35%|███▍      | 1443/4131 [06:23<13:08,  3.41it/s]

found [UNK] in prediction.
original pred: 劉 [UNK]
final prediction 劉炟


 37%|███▋      | 1512/4131 [06:41<11:44,  3.72it/s]

found [UNK] in prediction.
original pred: 為 免 再 次 爆 發 內 [UNK]
final prediction 為免再次爆發內訌


 45%|████▍     | 1846/4131 [08:11<08:21,  4.56it/s]

found [UNK] in prediction.
original pred: [UNK] 船
final prediction 艚船


 53%|█████▎    | 2205/4131 [09:45<08:07,  3.95it/s]

found [UNK] in prediction.
original pred: 朱 載 [UNK]
final prediction 朱載堉


 55%|█████▍    | 2263/4131 [10:00<07:52,  3.96it/s]

found [UNK] in prediction.
original pred: 學 習 訓 [UNK] 學
final prediction 那裏學習訓


 58%|█████▊    | 2400/4131 [10:36<06:34,  4.39it/s]

found [UNK] in prediction.
original pred: 朱 允 [UNK] 的 禁 殺 之 旨
final prediction 朱允炆的禁殺之旨


 64%|██████▍   | 2662/4131 [11:44<06:40,  3.67it/s]

found [UNK] in prediction.
original pred: 李 端 [UNK]
final prediction 李端棻


 67%|██████▋   | 2752/4131 [12:08<05:38,  4.08it/s]

found [UNK] in prediction.
original pred: w. v. [UNK] 因
final prediction W.V.蒯因


 71%|███████   | 2937/4131 [12:58<06:47,  2.93it/s]

found [UNK] in prediction.
original pred: 拓 跋 [UNK]
final prediction 拓跋燾


 77%|███████▋  | 3168/4131 [13:57<02:59,  5.37it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 蛞蝓


 90%|█████████ | 3718/4131 [16:02<01:15,  5.50it/s]

found [UNK] in prediction.
original pred: 拓 跋 [UNK]
final prediction 拓跋燾


 96%|█████████▋| 3982/4131 [16:59<00:32,  4.54it/s]

found [UNK] in prediction.
original pred: 到 霍 山 陳 梅 婷 那 裏 學 習 訓 [UNK] 學
final prediction 到霍山陳梅婷那裏學習訓詁學


100%|██████████| 4131/4131 [17:33<00:00,  3.92it/s]


Validation | Epoch 5 | acc = 0.798
Saving Model ...


## Testing

In [16]:
print("Evaluating Test Set ...")

result = []

model.eval()
with torch.no_grad():
    for i, data in enumerate(tqdm(test_loader)):
        output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        result.append(evaluate(data, output, doc_stride, test_paragraphs[test_questions[i]['paragraph_id']],
                               test_paragraphs_tokenized[test_questions[i]['paragraph_id']].tokens))

result_file = "result0422.csv"
with open(result_file, 'w') as f:	
	  f.write("ID,Answer\n")
	  for i, test_question in enumerate(test_questions):
        # Replace commas in answers with empty strings (since csv is separated by comma)
        # Answers in kaggle are processed in the same way
		    f.write(f"{test_question['id']},{result[i].replace(',','')}\n")

print(f"Completed! Result is in {result_file}")

Evaluating Test Set ...


  5%|▌         | 252/4957 [01:05<19:36,  4.00it/s]

found [UNK] in prediction.
original pred: 溥 [UNK]
final prediction 溥儁


  7%|▋         | 341/4957 [01:27<22:40,  3.39it/s]

found [UNK] in prediction.
original pred: [UNK] 人 國 ； 公 元 前 28 年 ， 滅 北 沃 沮
final prediction 荇人國；公元前28年，滅北沃沮


 10%|▉         | 492/4957 [02:05<14:07,  5.27it/s]

found [UNK] in prediction.
original pred: [UNK] [UNK]
final prediction 濊貊


 11%|█▏        | 564/4957 [02:23<14:50,  4.93it/s]

found [UNK] in prediction.
original pred: 馬 [UNK]
final prediction 馬馼


 13%|█▎        | 636/4957 [02:43<19:42,  3.65it/s]

found [UNK] in prediction.
original pred: 東 晉 常 [UNK]
final prediction ，東晉常


 19%|█▉        | 939/4957 [04:02<15:05,  4.44it/s]

found [UNK] in prediction.
original pred: [UNK] 稻
final prediction 秈稻


 20%|██        | 993/4957 [04:16<14:12,  4.65it/s]

found [UNK] in prediction.
original pred: 白 [UNK] 紀 滅 絕 事 件
final prediction 白堊紀滅絕事件


 29%|██▊       | 1424/4957 [06:07<13:05,  4.50it/s]

found [UNK] in prediction.
original pred: 抗 佝 [UNK] 病
final prediction 抗佝僂病


 31%|███       | 1535/4957 [06:36<14:19,  3.98it/s]

found [UNK] in prediction.
original pred: 杭 州 [UNK] 橋 機 場
final prediction 襲杭州筧橋機


 31%|███▏      | 1558/4957 [06:42<12:29,  4.53it/s]

found [UNK] in prediction.
original pred: 蔡 [UNK]
final prediction 蔡鍔


 34%|███▍      | 1681/4957 [07:13<13:09,  4.15it/s]

found [UNK] in prediction.
original pred: 丁 [UNK]
final prediction 丁旿


 35%|███▌      | 1751/4957 [07:32<16:04,  3.32it/s]

found [UNK] in prediction.
original pred: 隋 [UNK] 帝
final prediction 隋煬帝


 39%|███▊      | 1914/4957 [08:14<15:39,  3.24it/s]

found [UNK] in prediction.
original pred: 胡 季 [UNK]
final prediction 。其中


 41%|████      | 2043/4957 [08:48<11:01,  4.41it/s]

found [UNK] in prediction.
original pred: 其 英 文 縮 寫 首 字 母 為 「 [UNK] · ㄎㄟ · ㄨㄞ 」
final prediction 其英文縮寫首字母為「ㄟㄙ·ㄎㄟ·ㄨㄞ


 48%|████▊     | 2400/4957 [10:18<10:28,  4.07it/s]

found [UNK] in prediction.
original pred: 梁 [UNK]
final prediction 梁鵠


 51%|█████     | 2514/4957 [10:48<08:51,  4.60it/s]

found [UNK] in prediction.
original pred: [UNK] 靼 海 峽
final prediction 韃靼海峽


 53%|█████▎    | 2621/4957 [11:16<09:30,  4.09it/s]

found [UNK] in prediction.
original pred: 白 [UNK] 紀 末 滅 絕 事 件
final prediction 白堊紀末滅絕事件


 56%|█████▌    | 2769/4957 [11:55<08:50,  4.12it/s]

found [UNK] in prediction.
original pred: 侏 [UNK] 紀
final prediction 侏儸紀


 61%|██████    | 3027/4957 [13:02<09:26,  3.41it/s]

found [UNK] in prediction.
original pred: 克 里 米 亞 [UNK] 靼 人
final prediction 克里米亞韃靼人


 68%|██████▊   | 3370/4957 [14:32<06:12,  4.26it/s]

found [UNK] in prediction.
original pred: 白 [UNK] 紀 中 期
final prediction 白堊紀中期


 73%|███████▎  | 3604/4957 [15:26<04:03,  5.56it/s]

found [UNK] in prediction.
original pred: 白 [UNK] 紀
final prediction 白堊紀


 73%|███████▎  | 3608/4957 [15:27<04:05,  5.49it/s]

found [UNK] in prediction.
original pred: 白 [UNK] 紀
final prediction 白堊紀


100%|██████████| 4957/4957 [21:09<00:00,  3.91it/s]

Completed! Result is in result0422.csv



