<a href="https://colab.research.google.com/github/Offliners/writeup/blob/main/HW7/homework7_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework 7 - Bert (Question Answering)**

If you have any questions, feel free to email us at ntu-ml-2021spring-ta@googlegroups.com



Slide:    [Link](https://docs.google.com/presentation/d/1aQoWogAQo_xVJvMQMrGaYiWzuyfO0QyLLAhiMwFyS2w)　Kaggle: [Link](https://www.kaggle.com/c/ml2021-spring-hw7)　Data: [Link](https://drive.google.com/uc?id=1znKmX08v9Fygp-dgwo7BKiLIf2qL1FH1)




## Task description
- Chinese Extractive Question Answering
  - Input: Paragraph + Question
  - Output: Answer

- Objective: Learn how to fine tune a pretrained model on downstream task using transformers

- Todo
    - Fine tune a pretrained chinese BERT model
    - Change hyperparameters (e.g. doc_stride)
    - Apply linear learning rate decay
    - Try other pretrained models
    - Improve preprocessing
    - Improve postprocessing
- Training tips
    - Automatic mixed precision
    - Gradient accumulation
    - Ensemble

- Estimated training time (tesla t4 with automatic mixed precision enabled)
    - Simple: 8mins
    - Medium: 8mins
    - Strong: 25mins
    - Boss: 2hrs
  

## Download Dataset

In [1]:
# For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
!nvidia-smi

Tue May 18 16:04:29 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Download link 1
!gdown --id '1znKmX08v9Fygp-dgwo7BKiLIf2qL1FH1' --output hw7_data.zip

# Download Link 2 (if the above link fails) 
# !gdown --id '1pOu3FdPdvzielUZyggeD7KDnVy9iW1uC' --output hw7_data.zip

!unzip -o hw7_data.zip

Downloading...
From: https://drive.google.com/uc?id=1znKmX08v9Fygp-dgwo7BKiLIf2qL1FH1
To: /content/hw7_data.zip
0.00B [00:00, ?B/s]7.71MB [00:00, 67.9MB/s]
Archive:  hw7_data.zip
  inflating: hw7_dev.json            
  inflating: hw7_test.json           
  inflating: hw7_train.json          


## Install transformers

Documentation for the toolkit:　https://huggingface.co/transformers/

In [3]:
# You are allowed to change version of transformers or use other toolkits
!pip install transformers==4.5.0

Collecting transformers==4.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/81/91/61d69d58a1af1bd81d9ca9d62c90a6de3ab80d77f27c5df65d9a2c1f5626/transformers-4.5.0-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.2MB 8.7MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 46.9MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 50.4MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.0


## Import Packages

In [4]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset, ConcatDataset
from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast
from transformers import AutoTokenizer, AutoModel
from tqdm.auto import tqdm
from transformers import get_linear_schedule_with_warmup
from random import randint

device = "cuda" if torch.cuda.is_available() else "cpu"

# Fix random seed for reproducibility
def same_seeds(seed):
	  torch.manual_seed(seed)
	  if torch.cuda.is_available():
		    torch.cuda.manual_seed(seed)
		    torch.cuda.manual_seed_all(seed)
	  np.random.seed(seed)
	  random.seed(seed)
	  torch.backends.cudnn.benchmark = False
	  torch.backends.cudnn.deterministic = True
same_seeds(0)

In [5]:
# Change "fp16_training" to True to support automatic mixed precision training (fp16)	
fp16_training = True

if fp16_training:
    !pip install accelerate==0.2.0
    from accelerate import Accelerator
    accelerator = Accelerator(fp16=True)
    device = accelerator.device

# Documentation for the toolkit:  https://huggingface.co/docs/accelerate/

Collecting accelerate==0.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/60/c6/6f08def78c19e328335236ec283a7c70e73913d1ed6f653ce2101bfad139/accelerate-0.2.0-py3-none-any.whl (47kB)
[K     |███████                         | 10kB 11.2MB/s eta 0:00:01[K     |█████████████▉                  | 20kB 1.5MB/s eta 0:00:01[K     |████████████████████▉           | 30kB 2.2MB/s eta 0:00:01[K     |███████████████████████████▊    | 40kB 2.9MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.5MB/s 
Collecting pyaml>=20.4.0
  Downloading https://files.pythonhosted.org/packages/15/c4/1310a054d33abc318426a956e7d6df0df76a6ddfa9c66f6310274fb75d42/pyaml-20.4.0-py2.py3-none-any.whl
Installing collected packages: pyaml, accelerate
Successfully installed accelerate-0.2.0 pyaml-20.4.0


## Load Model and Tokenizer




 

In [6]:
# model = BertForQuestionAnswering.from_pretrained("bert-base-chinese").to(device)
# tokenizer = BertTokenizerFast.from_pretrained("bert-base-chinese")

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
  
tokenizer = AutoTokenizer.from_pretrained("nyust-eb210/braslab-bert-drcd-384")

model = AutoModelForQuestionAnswering.from_pretrained("nyust-eb210/braslab-bert-drcd-384")

# You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=804.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=109540.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=286.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=406799727.0, style=ProgressStyle(descri…




## Read Data

- Training set: 26935 QA pairs
- Dev set: 3523  QA pairs
- Test set: 3492  QA pairs

- {train/dev/test}_questions:	
  - List of dicts with the following keys:
   - id (int)
   - paragraph_id (int)
   - question_text (string)
   - answer_text (string)
   - answer_start (int)
   - answer_end (int)
- {train/dev/test}_paragraphs: 
  - List of strings
  - paragraph_ids in questions correspond to indexs in paragraphs
  - A paragraph may be used by several questions 

In [7]:
def read_data(file):
    with open(file, 'r', encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]

train_questions, train_paragraphs = read_data("hw7_train.json")
dev_questions, dev_paragraphs = read_data("hw7_dev.json")
test_questions, test_paragraphs = read_data("hw7_test.json")

## Tokenize Data

In [8]:
# Tokenize questions and paragraphs separately
# 「add_special_tokens」 is set to False since special tokens will be added when tokenized questions and paragraphs are combined in datset __getitem__ 

train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions], add_special_tokens=False)
test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False) 

train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False)
dev_paragraphs_tokenized = tokenizer(dev_paragraphs, add_special_tokens=False)
test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False)

# You can safely ignore the warning message as tokenized sequences will be futher processed in datset __getitem__ before passing to model

Token indices sequence length is longer than the specified maximum sequence length for this model (570 > 512). Running this sequence through the model will result in indexing errors


## Dataset and Dataloader

In [13]:
class QA_Dataset(Dataset):
    def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs):
        self.split = split
        self.questions = questions
        self.tokenized_questions = tokenized_questions
        self.tokenized_paragraphs = tokenized_paragraphs
        self.max_question_len = 100
        self.max_paragraph_len = 384
        
        ##### TODO: Change value of doc_stride #####
        self.doc_stride = 128

        # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
        self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        tokenized_question = self.tokenized_questions[idx]
        tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]]

        ##### TODO: Preprocessing #####
        # Hint: How to prevent model from learning something it should not learn

        if self.split == "train":
            # Convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph  
            answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
            answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])

            # A single window is obtained by slicing the portion of paragraph containing the answer
            # mid = (answer_start_token + answer_end_token) // 2
            # paragraph_start = max(0, min(mid - self.max_paragraph_len // 2, len(tokenized_paragraph) - self.max_paragraph_len))
            if answer_end_token - answer_start_token <= 0:
                paragraph_start = 0
            else:
                pos = randint(answer_start_token, answer_end_token)
                paragraph_start = max(0, min(pos - int(self.max_paragraph_len * (pos - answer_start_token) / (answer_end_token - answer_start_token)), len(tokenized_paragraph) - self.max_paragraph_len))
            paragraph_end = paragraph_start + self.max_paragraph_len

            # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
            input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102] 
            input_ids_paragraph = tokenized_paragraph.ids[paragraph_start : paragraph_end] + [102]		
            
            # Convert answer's start/end positions in tokenized_paragraph to start/end positions in the window  
            answer_start_token += len(input_ids_question) - paragraph_start
            answer_end_token += len(input_ids_question) - paragraph_start

            # Pad sequence and obtain inputs to model 
            input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
            return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token

        # Validation/Testing
        else:
            input_ids_list, token_type_ids_list, attention_mask_list = [], [], []
            
            # Paragraph is split into several windows, each with start positions separated by step "doc_stride"
            for i in range(0, len(tokenized_paragraph), self.doc_stride):
                
                # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
                input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
                input_ids_paragraph = tokenized_paragraph.ids[i : i + self.max_paragraph_len] + [102]
                
                # Pad sequence and obtain inputs to model
                input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
                
                input_ids_list.append(input_ids)
                token_type_ids_list.append(token_type_ids)
                attention_mask_list.append(attention_mask)
            
            return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)

    def padding(self, input_ids_question, input_ids_paragraph):
        # Pad zeros if sequence length is shorter than max_seq_len
        padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
        # Indices of input sequence tokens in the vocabulary
        input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
        # Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
        token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
        # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
        attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len
        
        return input_ids, token_type_ids, attention_mask

train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
dev_set = QA_Dataset("dev", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)

train_batch_size = 12

# Note: Do NOT change batch size of dev_loader / test_loader !
# Although batch size=1, it is actually a batch consisting of several windows from the same QA pair
# temp_set = ConcatDataset([train_set, dev_set])
train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
dev_loader = DataLoader(dev_set, batch_size=1, shuffle=False, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)

## Training

In [14]:
num_epoch = 3
validation = False
logging_step = 100
learning_rate = 3e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(train_loader) * num_epoch
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

if fp16_training:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader) 

model.train()

print("Start Training ...")

for epoch in range(num_epoch):
    step = 1
    train_loss = train_acc = 0
    
    for data in tqdm(train_loader):	
        # Load all data into GPU
        data = [i.to(device) for i in data]
        
        # Model inputs: input_ids, token_type_ids, attention_mask, start_positions, end_positions (Note: only "input_ids" is mandatory)
        # Model outputs: start_logits, end_logits, loss (return when start_positions/end_positions are provided)  
        output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])

        # Choose the most probable start position / end position
        start_index = torch.argmax(output.start_logits, dim=1)
        end_index = torch.argmax(output.end_logits, dim=1)
        
        # Prediction is correct only if both start_index and end_index are correct
        train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()
        train_loss += output.loss
        
        if fp16_training:
            accelerator.backward(output.loss)
        else:
            output.loss.backward()
        
        optimizer.step()
        optimizer.zero_grad()
        step += 1

        ##### TODO: Apply linear learning rate decay #####
        scheduler.step()

        # Print training loss and accuracy over past logging step
        if step % logging_step == 0:
            print(f"Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f}")
            train_loss = train_acc = 0

    if validation:
        print("Evaluating Dev Set ...")
        model.eval()
        with torch.no_grad():
            dev_acc = 0
            for i, data in enumerate(tqdm(dev_loader)):
                output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
                # prediction is correct only if answer text exactly matches
                dev_acc += evaluate(data, output) == dev_questions[i]["answer_text"]
            print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
        model.train()

# Save a model and its configuration file to the directory 「saved_model」 
# i.e. there are two files under the direcory 「saved_model」: 「pytorch_model.bin」 and 「config.json」
# Saved model can be re-loaded using 「model = BertForQuestionAnswering.from_pretrained("saved_model")」
print("Saving Model ...")
model_save_dir = "saved_model" 
model.save_pretrained(model_save_dir)

Start Training ...


HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))



Epoch 1 | Step 100 | loss = 1.487, acc = 0.495
Epoch 1 | Step 200 | loss = 0.843, acc = 0.655
Epoch 1 | Step 300 | loss = 0.809, acc = 0.694
Epoch 1 | Step 400 | loss = 0.750, acc = 0.687
Epoch 1 | Step 500 | loss = 0.676, acc = 0.718
Epoch 1 | Step 600 | loss = 0.698, acc = 0.701
Epoch 1 | Step 700 | loss = 0.671, acc = 0.716
Epoch 1 | Step 800 | loss = 0.669, acc = 0.739
Epoch 1 | Step 900 | loss = 0.609, acc = 0.743
Epoch 1 | Step 1000 | loss = 0.602, acc = 0.731
Epoch 1 | Step 1100 | loss = 0.589, acc = 0.747
Epoch 1 | Step 1200 | loss = 0.559, acc = 0.759
Epoch 1 | Step 1300 | loss = 0.615, acc = 0.739
Epoch 1 | Step 1400 | loss = 0.561, acc = 0.767
Epoch 1 | Step 1500 | loss = 0.573, acc = 0.751
Epoch 1 | Step 1600 | loss = 0.557, acc = 0.756
Epoch 1 | Step 1700 | loss = 0.570, acc = 0.757
Epoch 1 | Step 1800 | loss = 0.592, acc = 0.765
Epoch 1 | Step 1900 | loss = 0.570, acc = 0.765
Epoch 1 | Step 2000 | loss = 0.549, acc = 0.777
Epoch 1 | Step 2100 | loss = 0.540, acc = 0.768
E

HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 2 | Step 100 | loss = 0.285, acc = 0.855
Epoch 2 | Step 200 | loss = 0.329, acc = 0.847
Epoch 2 | Step 300 | loss = 0.273, acc = 0.861
Epoch 2 | Step 400 | loss = 0.309, acc = 0.855
Epoch 2 | Step 500 | loss = 0.308, acc = 0.857
Epoch 2 | Step 600 | loss = 0.326, acc = 0.833
Epoch 2 | Step 700 | loss = 0.303, acc = 0.842
Epoch 2 | Step 800 | loss = 0.298, acc = 0.854
Epoch 2 | Step 900 | loss = 0.312, acc = 0.869
Epoch 2 | Step 1000 | loss = 0.315, acc = 0.849
Epoch 2 | Step 1100 | loss = 0.280, acc = 0.852
Epoch 2 | Step 1200 | loss = 0.280, acc = 0.857
Epoch 2 | Step 1300 | loss = 0.312, acc = 0.852
Epoch 2 | Step 1400 | loss = 0.236, acc = 0.873
Epoch 2 | Step 1500 | loss = 0.293, acc = 0.852
Epoch 2 | Step 1600 | loss = 0.300, acc = 0.850
Epoch 2 | Step 1700 | loss = 0.298, acc = 0.857
Epoch 2 | Step 1800 | loss = 0.266, acc = 0.868
Epoch 2 | Step 1900 | loss = 0.262, acc = 0.873
Epoch 2 | Step 2000 | loss = 0.280, acc = 0.850
Epoch 2 | Step 2100 | loss = 0.281, acc = 0.857
E

HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 3 | Step 100 | loss = 0.137, acc = 0.909
Epoch 3 | Step 200 | loss = 0.158, acc = 0.914
Epoch 3 | Step 300 | loss = 0.170, acc = 0.913
Epoch 3 | Step 400 | loss = 0.157, acc = 0.918
Epoch 3 | Step 500 | loss = 0.159, acc = 0.904
Epoch 3 | Step 600 | loss = 0.164, acc = 0.912
Epoch 3 | Step 700 | loss = 0.168, acc = 0.912
Epoch 3 | Step 800 | loss = 0.168, acc = 0.907
Epoch 3 | Step 900 | loss = 0.150, acc = 0.923
Epoch 3 | Step 1000 | loss = 0.156, acc = 0.920
Epoch 3 | Step 1100 | loss = 0.115, acc = 0.930
Epoch 3 | Step 1200 | loss = 0.146, acc = 0.922
Epoch 3 | Step 1300 | loss = 0.148, acc = 0.927
Epoch 3 | Step 1400 | loss = 0.170, acc = 0.901
Epoch 3 | Step 1500 | loss = 0.171, acc = 0.907
Epoch 3 | Step 1600 | loss = 0.180, acc = 0.914
Epoch 3 | Step 1700 | loss = 0.156, acc = 0.926
Epoch 3 | Step 1800 | loss = 0.155, acc = 0.915
Epoch 3 | Step 1900 | loss = 0.127, acc = 0.917
Epoch 3 | Step 2000 | loss = 0.144, acc = 0.912
Epoch 3 | Step 2100 | loss = 0.153, acc = 0.922
E

## Function for Evaluation

In [15]:
def evaluate(data, output, index):
    ##### TODO: Postprocessing #####
    # There is a bug and room for improvement in postprocessing 
    # Hint: Open your prediction file to see what is wrong 
    
    answer = ''
    max_prob = float('-inf')
    num_of_windows = data[0].shape[1]
    
    for k in range(num_of_windows):
        # Obtain answer by choosing the most probable start position / end position
        start_prob, start_index = torch.max(output.start_logits[k], dim=0)
        end_prob, end_index = torch.max(output.end_logits[k], dim=0)
        
        # Probability of answer is calculated as sum of start_prob and end_prob
        prob = start_prob + end_prob
        
        # Replace answer if calculated probability is larger than previous windows
        if prob > max_prob:
            max_prob = prob
            # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
            answer = tokenizer.decode(data[0][0][k][start_index : end_index + 1])
            if answer.find('[UNK]') != -1:
                countUNK = answer.count('[UNK]')
                flag = 0
                print(f'ID {index} Before : {answer}')
                while answer.find('[UNK]') != -1:
                    window_len = end_index - start_index + 10
                    start = 0
                    end = window_len
                    if answer.find('[UNK]') > 1:
                        target1 = answer[answer.find('[UNK]') - 2]
                    else:
                        target1 = -1
                    if len(answer) - (answer.find('[UNK]') + len('[UNK]')) > 0:
                        target2 = answer[answer.find('[UNK]') + len('[UNK]') + 1]
                    else:
                        target2 = -1
                    count = 0
                    for i in range(len(test_paragraphs[test_questions[index]['paragraph_id']])):
                        if end > len(test_paragraphs[test_questions[index]['paragraph_id']]):
                            final = len(test_paragraphs[test_questions[index]['paragraph_id']])
                        else:
                            final = end + i + 1
                        para = test_paragraphs[test_questions[index]['paragraph_id']][start+i: final]
                        if target1 != -1 and target2 != -1:
                            if para.find(target1) != -1 and para.find(target2) != -1:
                                target = para[para.find(target1) + 1: para.find(target2)]
                                print(f'Target : {target}')
                                answer = answer.replace('[UNK]', target)
                                print(f'ID {index} After  : {answer}')
                                count += 1
                        elif target1 != -1 and target2 == -1:
                            if para.find(target1) != -1 and para.find(target1) != len(para) - 1:
                                target = para[para.find(target1) + 1]
                                print(f'Target : {target}')
                                answer = answer.replace('[UNK]', target)
                                print(f'ID {index} After  : {answer}')
                                count += 1
                        elif target1 == -1 and target2 != -1:
                            if para.find(target2) != -1 and para.find(target2) != 0:
                                target = para[para.find(target2) - 1]
                                print(f'Target After: {target}')
                                answer = answer.replace('[UNK]', target)
                                print(f'ID {index} After  : {answer}')
                                count += 1
                        else:
                            answer = answer.replace('[UNK]', para[answer.find('[UNK]')])
                            flag = 1
                            break
                        if count >= countUNK:
                            break
                        if i == len(test_paragraphs[test_questions[index]['paragraph_id']]) - 1:
                            flag = 1
                    if flag == 1:
                      break
                        
    
    # Remove spaces in answer (e.g. "大 金" --> "大金")
    return answer.replace(' ','')

In [16]:
test_paragraphs[test_questions[250]['paragraph_id']]

'1900年1月27日上海電報局總辦經元善領銜通電要求光緒皇帝「力疾臨御，勿存退位之思」；簽名者有葉瀚、馬裕藻、章炳麟、汪貽年、丁惠康、沈藎，唐才常、經亨頤、蔡元培、黃炎培等1231人；經元善等1231人同時發表《布告各省公啟》，要求各省共同力爭，「如朝廷不理，則請我諸工商通行罷市集議」。各國公使認為立儲事件影響中國形勢穩定，隨之提出警告，拒絕入賀。慈禧太后對列強怨恨甚深，載漪等人對西方列強及光緒帝更為仇恨。歷史學家唐德剛支持宮廷權力鬥爭是義和團運動激化的其中一個原因的觀點。唐德剛將惇親王載濂、端郡王載漪、輔國公載瀾、莊親王載勛四名同族兄弟比作四人幫，將剛毅比作林彪，將義和團比喻為紅衛兵。載字輩四名同族兄弟、剛毅及其一幫扶助義和團的大臣如趙舒翹、毓賢、董福祥等，利用義和團的民間力量及慈禧太后對洋人又怕又恨的心態，排斥光緒帝等帝黨和打擊洋人勢力。在多次御前會議上，他們當眾羞辱光緒帝及主和大臣，溥儁甚至直斥光緒為二毛子。'

In [17]:
# Postprocessing

def postprocessing(result, index):
    if result:
        if result[0] == '，' or result[0] == '。':
            print(f'ID {index} Before : {result}')
            result = result[1:]
            print(f'ID {index} After  : {result}')

        if result[0] == '《' and result[-1] != '》' and result.find('》') == -1:
            print(f'ID {index} Before : {result}')
            result = result + '》'
            print(f'ID {index} After  : {result}')
        
        if result[0] != '《' and result[-1] == '》' and result.find('《') == -1:
            print(f'ID {index} Before : {result}')
            result = '《' + result  
            print(f'ID {index} After  : {result}')
        
    return result

## Testing

In [18]:
print("Evaluating Test Set ...")

result = []

model.eval()
with torch.no_grad():
    for i, data in enumerate(tqdm(test_loader)):
        output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        result.append(evaluate(data, output, i))

result_file = "result.csv"
with open(result_file, 'w') as f:	
	  f.write("ID,Answer\n")
	  for i, test_question in enumerate(test_questions):
        # Replace commas in answers with empty strings (since csv is separated by comma)
        # Answers in kaggle are processed in the same way
                result[i] = postprocessing(result[i], i)
                f.write(f"{test_question['id']},{result[i].replace(',','')}\n")

print(f"Completed! Result is in {result_file}")

Evaluating Test Set ...


HBox(children=(FloatProgress(value=0.0, max=3493.0), HTML(value='')))

ID 20 Before : 愛 爾 蘭 語 ， 在 英 語 中 也 稱 為 [UNK] 、 [UNK] 、 [UNK] [UNK] 或 [UNK] ， 因 此 在 漢 語 中 愛 爾 蘭 語
Target : Irish
ID 20 After  : 愛 爾 蘭 語 ， 在 英 語 中 也 稱 為 Irish 、 Irish 、 Irish Irish 或 Irish ， 因 此 在 漢 語 中 愛 爾 蘭 語
Target : Irish
ID 20 After  : 愛 爾 蘭 語 ， 在 英 語 中 也 稱 為 Irish 、 Irish 、 Irish Irish 或 Irish ， 因 此 在 漢 語 中 愛 爾 蘭 語
Target : Irish
ID 20 After  : 愛 爾 蘭 語 ， 在 英 語 中 也 稱 為 Irish 、 Irish 、 Irish Irish 或 Irish ， 因 此 在 漢 語 中 愛 爾 蘭 語
Target : Irish
ID 20 After  : 愛 爾 蘭 語 ， 在 英 語 中 也 稱 為 Irish 、 Irish 、 Irish Irish 或 Irish ， 因 此 在 漢 語 中 愛 爾 蘭 語
Target : Irish
ID 20 After  : 愛 爾 蘭 語 ， 在 英 語 中 也 稱 為 Irish 、 Irish 、 Irish Irish 或 Irish ， 因 此 在 漢 語 中 愛 爾 蘭 語
ID 49 Before : 自 大 型 購 物 中 心 [UNK] 開 幕 後
Target : MegaBox
ID 49 After  : 自 大 型 購 物 中 心 MegaBox 開 幕 後
ID 250 Before : 溥 [UNK]
Target : 儁
ID 250 After  : 溥 儁
ID 250 Before : 溥 [UNK]
Target : 儁
ID 250 After  : 溥 儁
ID 332 Before : 目 前 沒 有 觀 察 到 任 何 語 言 純 [UNK] 以 力 道 來 區 分 不 同 輔 音
Target : 綷
ID 332 After  : 目 前 沒 有 觀 察 到 任 何 語 言 純 綷 以 力 道 來 區 分 不

In [19]:
from google.colab import files
files.download("result.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>