<a href="https://colab.research.google.com/github/Offliners/writeup/blob/main/HW7/homework7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework 7 - Bert (Question Answering)**

If you have any questions, feel free to email us at ntu-ml-2021spring-ta@googlegroups.com



Slide:    [Link](https://docs.google.com/presentation/d/1aQoWogAQo_xVJvMQMrGaYiWzuyfO0QyLLAhiMwFyS2w)　Kaggle: [Link](https://www.kaggle.com/c/ml2021-spring-hw7)　Data: [Link](https://drive.google.com/uc?id=1znKmX08v9Fygp-dgwo7BKiLIf2qL1FH1)




## Task description
- Chinese Extractive Question Answering
  - Input: Paragraph + Question
  - Output: Answer

- Objective: Learn how to fine tune a pretrained model on downstream task using transformers

- Todo
    - Fine tune a pretrained chinese BERT model
    - Change hyperparameters (e.g. doc_stride)
    - Apply linear learning rate decay
    - Try other pretrained models
    - Improve preprocessing
    - Improve postprocessing
- Training tips
    - Automatic mixed precision
    - Gradient accumulation
    - Ensemble

- Estimated training time (tesla t4 with automatic mixed precision enabled)
    - Simple: 8mins
    - Medium: 8mins
    - Strong: 25mins
    - Boss: 2hrs
  

## Download Dataset

In [1]:
# For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
!nvidia-smi

Mon May 17 08:25:50 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8    12W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Download link 1
!gdown --id '1znKmX08v9Fygp-dgwo7BKiLIf2qL1FH1' --output hw7_data.zip

# Download Link 2 (if the above link fails) 
# !gdown --id '1pOu3FdPdvzielUZyggeD7KDnVy9iW1uC' --output hw7_data.zip

!unzip -o hw7_data.zip

Downloading...
From: https://drive.google.com/uc?id=1znKmX08v9Fygp-dgwo7BKiLIf2qL1FH1
To: /content/hw7_data.zip
0.00B [00:00, ?B/s]7.71MB [00:00, 67.7MB/s]
Archive:  hw7_data.zip
  inflating: hw7_dev.json            
  inflating: hw7_test.json           
  inflating: hw7_train.json          


## Install transformers

Documentation for the toolkit:　https://huggingface.co/transformers/

In [3]:
# You are allowed to change version of transformers or use other toolkits
!pip install transformers==4.5.0

Collecting transformers==4.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/81/91/61d69d58a1af1bd81d9ca9d62c90a6de3ab80d77f27c5df65d9a2c1f5626/transformers-4.5.0-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.2MB 8.7MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 50.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 46.5MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.0


## Import Packages

In [4]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset 
from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast
from transformers import AutoTokenizer, AutoModel
from tqdm.auto import tqdm
from transformers import get_linear_schedule_with_warmup
from random import randint
from google.colab import files

device = "cuda" if torch.cuda.is_available() else "cpu"

# Fix random seed for reproducibility
def same_seeds(seed):
	  torch.manual_seed(seed)
	  if torch.cuda.is_available():
		    torch.cuda.manual_seed(seed)
		    torch.cuda.manual_seed_all(seed)
	  np.random.seed(seed)
	  random.seed(seed)
	  torch.backends.cudnn.benchmark = False
	  torch.backends.cudnn.deterministic = True

In [5]:
# Change "fp16_training" to True to support automatic mixed precision training (fp16)	
fp16_training = True

if fp16_training:
    !pip install accelerate==0.2.0
    from accelerate import Accelerator
    accelerator = Accelerator(fp16=True)
    device = accelerator.device

# Documentation for the toolkit:  https://huggingface.co/docs/accelerate/

Collecting accelerate==0.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/60/c6/6f08def78c19e328335236ec283a7c70e73913d1ed6f653ce2101bfad139/accelerate-0.2.0-py3-none-any.whl (47kB)
[K     |███████                         | 10kB 19.4MB/s eta 0:00:01[K     |█████████████▉                  | 20kB 2.0MB/s eta 0:00:01[K     |████████████████████▉           | 30kB 2.9MB/s eta 0:00:01[K     |███████████████████████████▊    | 40kB 3.8MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.9MB/s 
Collecting pyaml>=20.4.0
  Downloading https://files.pythonhosted.org/packages/15/c4/1310a054d33abc318426a956e7d6df0df76a6ddfa9c66f6310274fb75d42/pyaml-20.4.0-py2.py3-none-any.whl
Installing collected packages: pyaml, accelerate
Successfully installed accelerate-0.2.0 pyaml-20.4.0


## Load Model and Tokenizer




 

In [6]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("nyust-eb210/braslab-bert-drcd-384")
model = AutoModelForQuestionAnswering.from_pretrained("nyust-eb210/braslab-bert-drcd-384")

# You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=804.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=109540.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=286.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=406799727.0, style=ProgressStyle(descri…




## Read Data

- Training set: 26935 QA pairs
- Dev set: 3523  QA pairs
- Test set: 3492  QA pairs

- {train/dev/test}_questions:	
  - List of dicts with the following keys:
   - id (int)
   - paragraph_id (int)
   - question_text (string)
   - answer_text (string)
   - answer_start (int)
   - answer_end (int)
- {train/dev/test}_paragraphs: 
  - List of strings
  - paragraph_ids in questions correspond to indexs in paragraphs
  - A paragraph may be used by several questions 

In [7]:
def read_data(file):
    with open(file, 'r', encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]

train_questions, train_paragraphs = read_data("hw7_train.json")
dev_questions, dev_paragraphs = read_data("hw7_dev.json")
test_questions, test_paragraphs = read_data("hw7_test.json")

## Tokenize Data

In [8]:
pass

## Dataset and Dataloader

In [9]:
class QA_Dataset(Dataset):
    def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs):
        self.split = split
        self.questions = questions
        self.tokenized_questions = tokenized_questions
        self.tokenized_paragraphs = tokenized_paragraphs
        self.max_question_len = 100
        self.max_paragraph_len = 384
        
        ##### TODO: Change value of doc_stride #####
        self.doc_stride = 128

        # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
        self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        tokenized_question = self.tokenized_questions[idx]
        tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]]

        ##### TODO: Preprocessing #####
        # Hint: How to prevent model from learning something it should not learn

        if self.split == "train":
            # Convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph  
            answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
            answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])

            # A single window is obtained by slicing the portion of paragraph containing the answer
            mid = randint(-192, 192) # (answer_start_token + answer_end_token)
            paragraph_start = max(0, min(mid - self.max_paragraph_len, len(tokenized_paragraph) - self.max_paragraph_len))
            # paragraph_start = 0
            paragraph_end = paragraph_start + self.max_paragraph_len

            # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
            input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102] 
            input_ids_paragraph = tokenized_paragraph.ids[paragraph_start : paragraph_end] + [102]		
            
            # Convert answer's start/end positions in tokenized_paragraph to start/end positions in the window  
            answer_start_token += len(input_ids_question) - paragraph_start
            answer_end_token += len(input_ids_question) - paragraph_start

            # Pad sequence and obtain inputs to model 
            input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
            return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token

        # Validation/Testing
        else:
            input_ids_list, token_type_ids_list, attention_mask_list = [], [], []
            
            # Paragraph is split into several windows, each with start positions separated by step "doc_stride"
            for i in range(0, len(tokenized_paragraph), self.doc_stride):
                
                # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
                input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
                input_ids_paragraph = tokenized_paragraph.ids[i : i + self.max_paragraph_len] + [102]
                
                # Pad sequence and obtain inputs to model
                input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
                
                input_ids_list.append(input_ids)
                token_type_ids_list.append(token_type_ids)
                attention_mask_list.append(attention_mask)
            
            return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)

    def padding(self, input_ids_question, input_ids_paragraph):
        # Pad zeros if sequence length is shorter than max_seq_len
        padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
        # Indices of input sequence tokens in the vocabulary
        input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
        # Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
        token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
        # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
        attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len
        
        return input_ids, token_type_ids, attention_mask

## Function for Evaluation

In [10]:
def evaluate(data, output):
    ##### TODO: Postprocessing #####
    # There is a bug and room for improvement in postprocessing 
    # Hint: Open your prediction file to see what is wrong 
    
    answer = ''
    max_prob = float('-inf')
    num_of_windows = data[0].shape[1]
    
    for k in range(num_of_windows):
        # Obtain answer by choosing the most probable start position / end position
        start_prob, start_index = torch.max(output.start_logits[k], dim=0)
        end_prob, end_index = torch.max(output.end_logits[k], dim=0)
        
        # Probability of answer is calculated as sum of start_prob and end_prob
        prob = start_prob + end_prob
        
        # Replace answer if calculated probability is larger than previous windows
        if prob > max_prob:
            max_prob = prob
            # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
            answer = tokenizer.decode(data[0][0][k][start_index : end_index + 1])
    
    # Remove spaces in answer (e.g. "大 金" --> "大金")
    return answer.replace(' ','')

In [11]:
# Postprocessing

def postprocessing(result, index):
    if result:
        if result.find('[UNK]') != -1:
            print(f'ID {index} Before : {result}')
            if result[:result.find('[UNK]')] != '' and result[result.find('[UNK]') + len('[UNK]') + 1:] != '':
                start = test_paragraphs[test_questions[index]['paragraph_id']].find(result[:result.find('[UNK]')]) + len(result[: result.find('[UNK]')])
                end = test_paragraphs[test_questions[index]['paragraph_id']].find(result[result.find('[UNK]') + len('[UNK]') + 1:])
                target = test_paragraphs[test_questions[index]['paragraph_id']][start: end - 1]

                if len(target) < 10:
                    print(f'Target: {target}')
                    result = result.replace('[UNK]', target)
                else:
                    start = test_paragraphs[test_questions[index]['paragraph_id']].find(result[result.find('[UNK]') - 1])
                    target = test_paragraphs[test_questions[index]['paragraph_id']][start + 1]
                    result = result.replace('[UNK]', target)
                print(f'ID {index} After  : {result}')
            else:
                start = test_paragraphs[test_questions[index]['paragraph_id']].find(result[result.find('[UNK]') - 1])
                target = test_paragraphs[test_questions[index]['paragraph_id']][start + 1]
                result = result.replace('[UNK]', target)
                print(f'ID {index} After  : {result}')
             
        if result[0] == '，' or result[0] == '。':
            print(f'ID {index} Before : {result}')
            result = result[1:]
            print(f'ID {index} After  : {result}')

        if result[0] == '《' and result[-1] != '》' and result.find('》') == -1:
            print(f'ID {index} Before : {result}')
            result = result + '》'
            print(f'ID {index} After  : {result}')
        
        if result[0] != '《' and result[-1] == '》' and result.find('《') == -1:
            print(f'ID {index} Before : {result}')
            result = '《' + result  
            print(f'ID {index} After  : {result}')
        
    return result

## Training

In [12]:
num_epoch = 3
validation = True
logging_step = 100
learning_rate = 3e-5
ensemble = 3

import gc
print("Start Training ...")

for num in range(ensemble):
    same_seeds(num)
    print(f"Train {num} :")
    # Tokenize questions and paragraphs separately
    # 「add_special_tokens」 is set to False since special tokens will be added when tokenized questions and paragraphs are combined in datset __getitem__ 

    train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
    dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions], add_special_tokens=False)
    test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False) 

    train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False)
    dev_paragraphs_tokenized = tokenizer(dev_paragraphs, add_special_tokens=False)
    test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False)

    # You can safely ignore the warning message as tokenized sequences will be futher processed in datset __getitem__ before passing to model

    train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
    dev_set = QA_Dataset("dev", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
    test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)

    train_batch_size = 12

    # Note: Do NOT change batch size of dev_loader / test_loader !
    # Although batch size=1, it is actually a batch consisting of several windows from the same QA pair
    train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
    dev_loader = DataLoader(dev_set, batch_size=1, shuffle=False, pin_memory=True)
    test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)

    optimizer = AdamW(model.parameters(), lr=learning_rate)
    total_steps = len(train_loader) * num_epoch
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=100, num_training_steps=total_steps)

    if fp16_training:
        model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader) 
    
    model.train()
    for epoch in range(num_epoch):
        step = 1
        train_loss = train_acc = 0
        
        for data in tqdm(train_loader):	
            # Load all data into GPU
            data = [i.to(device) for i in data]
            
            # Model inputs: input_ids, token_type_ids, attention_mask, start_positions, end_positions (Note: only "input_ids" is mandatory)
            # Model outputs: start_logits, end_logits, loss (return when start_positions/end_positions are provided)  
            output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])

            # Choose the most probable start position / end position
            start_index = torch.argmax(output.start_logits, dim=1)
            end_index = torch.argmax(output.end_logits, dim=1)
            
            # Prediction is correct only if both start_index and end_index are correct
            train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()
            train_loss += output.loss
            
            if fp16_training:
                accelerator.backward(output.loss)
            else:
                output.loss.backward()
            
            optimizer.step()
            optimizer.zero_grad()
            step += 1

            ##### TODO: Apply linear learning rate decay #####
            scheduler.step()

            # Print training loss and accuracy over past logging step
            if step % logging_step == 0:
                print(f"Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f}")
                train_loss = train_acc = 0

        if validation:
            print("Evaluating Dev Set ...")
            model.eval()
            with torch.no_grad():
                dev_acc = 0
                for i, data in enumerate(tqdm(dev_loader)):
                    output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                          attention_mask=data[2].squeeze(dim=0).to(device))
                    # prediction is correct only if answer text exactly matches
                    dev_acc += evaluate(data, output) == dev_questions[i]["answer_text"]
                print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
            model.train()
        gc.collect()

    # Save a model and its configuration file to the directory 「saved_model」 
    # i.e. there are two files under the direcory 「saved_model」: 「pytorch_model.bin」 and 「config.json」
    # Saved model can be re-loaded using 「model = BertForQuestionAnswering.from_pretrained("saved_model")」
    print("Saving Model ...")
    model_save_dir = "saved_model" 
    model.save_pretrained(model_save_dir)

    print("Evaluating Test Set ...")

    result = []

    model.eval()
    with torch.no_grad():
        for data in tqdm(test_loader):
            output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                          attention_mask=data[2].squeeze(dim=0).to(device))
            result.append(evaluate(data, output))

    result_file = f"result{num}.csv"
    with open(result_file, 'w') as f:	
        f.write("ID,Answer\n")
        for i, test_question in enumerate(test_questions):
            # Replace commas in answers with empty strings (since csv is separated by comma)
            # Answers in kaggle are processed in the same way
                    result[i] = postprocessing(result[i], i)
                    f.write(f"{test_question['id']},{result[i].replace(',','')}\n")

    print(f"Completed! Result is in {result_file}")

Start Training ...
Train 0 :


Token indices sequence length is longer than the specified maximum sequence length for this model (570 > 512). Running this sequence through the model will result in indexing errors


HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))



Epoch 1 | Step 100 | loss = 2.561, acc = 0.315
Epoch 1 | Step 200 | loss = 1.260, acc = 0.582
Epoch 1 | Step 300 | loss = 1.087, acc = 0.626
Epoch 1 | Step 400 | loss = 0.917, acc = 0.656
Epoch 1 | Step 500 | loss = 0.951, acc = 0.647
Epoch 1 | Step 600 | loss = 0.877, acc = 0.678
Epoch 1 | Step 700 | loss = 0.876, acc = 0.677
Epoch 1 | Step 800 | loss = 0.833, acc = 0.679
Epoch 1 | Step 900 | loss = 0.852, acc = 0.682
Epoch 1 | Step 1000 | loss = 0.878, acc = 0.677
Epoch 1 | Step 1100 | loss = 0.862, acc = 0.687
Epoch 1 | Step 1200 | loss = 0.878, acc = 0.664
Epoch 1 | Step 1300 | loss = 0.854, acc = 0.692
Epoch 1 | Step 1400 | loss = 0.811, acc = 0.710
Epoch 1 | Step 1500 | loss = 0.807, acc = 0.716
Epoch 1 | Step 1600 | loss = 0.922, acc = 0.683
Epoch 1 | Step 1700 | loss = 0.714, acc = 0.728
Epoch 1 | Step 1800 | loss = 0.773, acc = 0.712
Epoch 1 | Step 1900 | loss = 0.803, acc = 0.696
Epoch 1 | Step 2000 | loss = 0.755, acc = 0.716
Epoch 1 | Step 2100 | loss = 0.861, acc = 0.702
E

HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 1 | acc = 0.778


HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 2 | Step 100 | loss = 0.452, acc = 0.798
Epoch 2 | Step 200 | loss = 0.467, acc = 0.802
Epoch 2 | Step 300 | loss = 0.471, acc = 0.796
Epoch 2 | Step 400 | loss = 0.545, acc = 0.786
Epoch 2 | Step 500 | loss = 0.444, acc = 0.819
Epoch 2 | Step 600 | loss = 0.459, acc = 0.805
Epoch 2 | Step 700 | loss = 0.498, acc = 0.788
Epoch 2 | Step 800 | loss = 0.476, acc = 0.799
Epoch 2 | Step 900 | loss = 0.507, acc = 0.799
Epoch 2 | Step 1000 | loss = 0.477, acc = 0.814
Epoch 2 | Step 1100 | loss = 0.459, acc = 0.795
Epoch 2 | Step 1200 | loss = 0.490, acc = 0.805
Epoch 2 | Step 1300 | loss = 0.525, acc = 0.792
Epoch 2 | Step 1400 | loss = 0.392, acc = 0.818
Epoch 2 | Step 1500 | loss = 0.427, acc = 0.800
Epoch 2 | Step 1600 | loss = 0.456, acc = 0.813
Epoch 2 | Step 1700 | loss = 0.434, acc = 0.807
Epoch 2 | Step 1800 | loss = 0.503, acc = 0.795
Epoch 2 | Step 1900 | loss = 0.484, acc = 0.798
Epoch 2 | Step 2000 | loss = 0.485, acc = 0.783
Epoch 2 | Step 2100 | loss = 0.537, acc = 0.793
E

HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 2 | acc = 0.791


HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 3 | Step 100 | loss = 0.333, acc = 0.855
Epoch 3 | Step 200 | loss = 0.351, acc = 0.832
Epoch 3 | Step 300 | loss = 0.286, acc = 0.873
Epoch 3 | Step 400 | loss = 0.260, acc = 0.874
Epoch 3 | Step 500 | loss = 0.338, acc = 0.867
Epoch 3 | Step 600 | loss = 0.285, acc = 0.878
Epoch 3 | Step 700 | loss = 0.269, acc = 0.868
Epoch 3 | Step 800 | loss = 0.296, acc = 0.866
Epoch 3 | Step 900 | loss = 0.303, acc = 0.854
Epoch 3 | Step 1000 | loss = 0.294, acc = 0.862
Epoch 3 | Step 1100 | loss = 0.336, acc = 0.851
Epoch 3 | Step 1200 | loss = 0.364, acc = 0.853
Epoch 3 | Step 1300 | loss = 0.342, acc = 0.843
Epoch 3 | Step 1400 | loss = 0.245, acc = 0.872
Epoch 3 | Step 1500 | loss = 0.305, acc = 0.852
Epoch 3 | Step 1600 | loss = 0.323, acc = 0.877
Epoch 3 | Step 1700 | loss = 0.286, acc = 0.868
Epoch 3 | Step 1800 | loss = 0.336, acc = 0.852
Epoch 3 | Step 1900 | loss = 0.288, acc = 0.848
Epoch 3 | Step 2000 | loss = 0.286, acc = 0.867
Epoch 3 | Step 2100 | loss = 0.325, acc = 0.864
E

HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 3 | acc = 0.793
Saving Model ...
Evaluating Test Set ...


HBox(children=(FloatProgress(value=0.0, max=3493.0), HTML(value='')))


ID 49 Before : 自大型購物中心[UNK]開幕
Target: MegaBox
ID 49 After  : 自大型購物中心MegaBox開幕
ID 250 Before : 溥[UNK]
ID 250 After  : 溥儁
ID 332 Before : 目前沒有觀察到任何語言純[UNK]以力道來區分不同輔音
Target: 綷
ID 332 After  : 目前沒有觀察到任何語言純綷以力道來區分不同輔音
ID 340 Before : [UNK]人國
ID 340 After  : 志人國
ID 467 Before : 《全唐詩
ID 467 After  : 《全唐詩》
ID 490 Before : [UNK][UNK]
ID 490 After  : 高高
ID 563 Before : 馬[UNK]
ID 563 After  : 馬馼
ID 635 Before : 東晉常[UNK]
ID 635 After  : 東晉常璩
ID 938 Before : [UNK]稻
ID 938 After  : 米稻
ID 991 Before : 白[UNK]紀滅絕事件
ID 991 After  : 白堊紀滅絕事件
ID 1212 Before : [UNK]州縣實際解散
ID 1212 After  : ，州縣實際解散
ID 1212 Before : ，州縣實際解散
ID 1212 After  : 州縣實際解散
ID 1423 Before : 抗佝[UNK]病
ID 1423 After  : 抗佝僂病
ID 1494 Before : [UNK]以下的聲音對基底膜的影響
ID 1494 After  : 。以下的聲音對基底膜的影響
ID 1494 Before : 。以下的聲音對基底膜的影響
ID 1494 After  : 以下的聲音對基底膜的影響
ID 1502 Before : [UNK]戰鬥機
ID 1502 After  : ，戰鬥機
ID 1502 Before : ，戰鬥機
ID 1502 After  : 戰鬥機
ID 1534 Before : 杭州[UNK]橋機場
Target: 筧
ID 1534 After  : 杭州筧橋機場
ID 1557 Before : 蔡[UNK]
ID 1557 After  

HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 1 | Step 100 | loss = 0.300, acc = 0.851
Epoch 1 | Step 200 | loss = 0.272, acc = 0.860
Epoch 1 | Step 300 | loss = 0.335, acc = 0.837
Epoch 1 | Step 400 | loss = 0.419, acc = 0.805
Epoch 1 | Step 500 | loss = 0.358, acc = 0.834
Epoch 1 | Step 600 | loss = 0.430, acc = 0.833
Epoch 1 | Step 700 | loss = 0.400, acc = 0.831
Epoch 1 | Step 800 | loss = 0.490, acc = 0.803
Epoch 1 | Step 900 | loss = 0.419, acc = 0.822
Epoch 1 | Step 1000 | loss = 0.467, acc = 0.815
Epoch 1 | Step 1100 | loss = 0.380, acc = 0.825
Epoch 1 | Step 1200 | loss = 0.406, acc = 0.819
Epoch 1 | Step 1300 | loss = 0.438, acc = 0.832
Epoch 1 | Step 1400 | loss = 0.384, acc = 0.826
Epoch 1 | Step 1500 | loss = 0.384, acc = 0.852
Epoch 1 | Step 1600 | loss = 0.420, acc = 0.815
Epoch 1 | Step 1700 | loss = 0.390, acc = 0.845
Epoch 1 | Step 1800 | loss = 0.508, acc = 0.810
Epoch 1 | Step 1900 | loss = 0.445, acc = 0.815
Epoch 1 | Step 2000 | loss = 0.448, acc = 0.821
Epoch 1 | Step 2100 | loss = 0.441, acc = 0.820
E

HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 1 | acc = 0.785


HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 2 | Step 100 | loss = 0.205, acc = 0.889
Epoch 2 | Step 200 | loss = 0.309, acc = 0.875
Epoch 2 | Step 300 | loss = 0.303, acc = 0.861
Epoch 2 | Step 400 | loss = 0.263, acc = 0.885
Epoch 2 | Step 500 | loss = 0.317, acc = 0.860
Epoch 2 | Step 600 | loss = 0.335, acc = 0.857
Epoch 2 | Step 700 | loss = 0.325, acc = 0.853
Epoch 2 | Step 800 | loss = 0.289, acc = 0.871
Epoch 2 | Step 900 | loss = 0.299, acc = 0.861
Epoch 2 | Step 1000 | loss = 0.294, acc = 0.873
Epoch 2 | Step 1100 | loss = 0.279, acc = 0.875
Epoch 2 | Step 1200 | loss = 0.284, acc = 0.861
Epoch 2 | Step 1300 | loss = 0.340, acc = 0.863
Epoch 2 | Step 1400 | loss = 0.304, acc = 0.873
Epoch 2 | Step 1500 | loss = 0.289, acc = 0.865
Epoch 2 | Step 1600 | loss = 0.322, acc = 0.841
Epoch 2 | Step 1700 | loss = 0.270, acc = 0.866
Epoch 2 | Step 1800 | loss = 0.282, acc = 0.862
Epoch 2 | Step 1900 | loss = 0.322, acc = 0.853
Epoch 2 | Step 2000 | loss = 0.333, acc = 0.843
Epoch 2 | Step 2100 | loss = 0.297, acc = 0.865
E

HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 2 | acc = 0.789


HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 3 | Step 100 | loss = 0.239, acc = 0.875
Epoch 3 | Step 200 | loss = 0.228, acc = 0.901
Epoch 3 | Step 300 | loss = 0.245, acc = 0.879
Epoch 3 | Step 400 | loss = 0.210, acc = 0.897
Epoch 3 | Step 500 | loss = 0.227, acc = 0.889
Epoch 3 | Step 600 | loss = 0.237, acc = 0.897
Epoch 3 | Step 700 | loss = 0.247, acc = 0.889
Epoch 3 | Step 800 | loss = 0.236, acc = 0.890
Epoch 3 | Step 900 | loss = 0.264, acc = 0.885
Epoch 3 | Step 1000 | loss = 0.210, acc = 0.897
Epoch 3 | Step 1100 | loss = 0.318, acc = 0.870
Epoch 3 | Step 1200 | loss = 0.236, acc = 0.900
Epoch 3 | Step 1300 | loss = 0.180, acc = 0.902
Epoch 3 | Step 1400 | loss = 0.186, acc = 0.913
Epoch 3 | Step 1500 | loss = 0.252, acc = 0.889
Epoch 3 | Step 1600 | loss = 0.257, acc = 0.887
Epoch 3 | Step 1700 | loss = 0.224, acc = 0.887
Epoch 3 | Step 1800 | loss = 0.225, acc = 0.890
Epoch 3 | Step 1900 | loss = 0.236, acc = 0.894
Epoch 3 | Step 2000 | loss = 0.181, acc = 0.908
Epoch 3 | Step 2100 | loss = 0.261, acc = 0.882
E

HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 3 | acc = 0.799
Saving Model ...
Evaluating Test Set ...


HBox(children=(FloatProgress(value=0.0, max=3493.0), HTML(value='')))


ID 49 Before : 大型購物中心[UNK]開幕
Target: MegaBox
ID 49 After  : 大型購物中心MegaBox開幕
ID 250 Before : 溥[UNK]
ID 250 After  : 溥儁
ID 332 Before : 目前沒有觀察到任何語言純[UNK]以力道來區分不同輔音
Target: 綷
ID 332 After  : 目前沒有觀察到任何語言純綷以力道來區分不同輔音
ID 371 Before : 為了建立通往西方的道路，早在1209年[UNK]1210年就讓新疆東部的畏兀兒與伊犁河谷的哈剌魯先後歸順。當金朝遷都並將要滅亡之際，中亞新興大國花剌子模在沙阿摩訶末時期崛起，該國訛答剌地方大臣海兒汗亦納勒術前後兩次屠殺蒙古商隊並侮辱蒙古使臣
Target: —
ID 371 After  : 為了建立通往西方的道路，早在1209年—1210年就讓新疆東部的畏兀兒與伊犁河谷的哈剌魯先後歸順。當金朝遷都並將要滅亡之際，中亞新興大國花剌子模在沙阿摩訶末時期崛起，該國訛答剌地方大臣海兒汗亦納勒術前後兩次屠殺蒙古商隊並侮辱蒙古使臣
ID 490 Before : [UNK][UNK]
ID 490 After  : 高高
ID 635 Before : 常[UNK]
ID 635 After  : 常璩
ID 938 Before : [UNK]稻
ID 938 After  : 米稻
ID 991 Before : 白[UNK]紀滅絕事件
ID 991 After  : 白堊紀滅絕事件
ID 1423 Before : 抗佝[UNK]病
ID 1423 After  : 抗佝僂病
ID 1494 Before : [UNK]以下的聲音對基底膜的影響
ID 1494 After  : 。以下的聲音對基底膜的影響
ID 1494 Before : 。以下的聲音對基底膜的影響
ID 1494 After  : 以下的聲音對基底膜的影響
ID 1502 Before : [UNK]戰鬥機
ID 1502 After  : ，戰鬥機
ID 1502 Before : ，戰鬥機
ID 1502 After  : 戰鬥機
ID 1533 Before : 九廣鐵路的落馬洲支線穿過[UNK]原濕地
Target: 塱
ID 1533 Aft

HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 1 | Step 100 | loss = 0.253, acc = 0.873
Epoch 1 | Step 200 | loss = 0.328, acc = 0.878
Epoch 1 | Step 300 | loss = 0.283, acc = 0.869
Epoch 1 | Step 400 | loss = 0.285, acc = 0.848
Epoch 1 | Step 500 | loss = 0.307, acc = 0.855
Epoch 1 | Step 600 | loss = 0.345, acc = 0.861
Epoch 1 | Step 700 | loss = 0.281, acc = 0.867
Epoch 1 | Step 800 | loss = 0.346, acc = 0.850
Epoch 1 | Step 900 | loss = 0.336, acc = 0.860
Epoch 1 | Step 1000 | loss = 0.295, acc = 0.866
Epoch 1 | Step 1100 | loss = 0.279, acc = 0.872
Epoch 1 | Step 1200 | loss = 0.375, acc = 0.840
Epoch 1 | Step 1300 | loss = 0.345, acc = 0.853
Epoch 1 | Step 1400 | loss = 0.320, acc = 0.856
Epoch 1 | Step 1500 | loss = 0.305, acc = 0.861
Epoch 1 | Step 1600 | loss = 0.326, acc = 0.854
Epoch 1 | Step 1700 | loss = 0.283, acc = 0.871
Epoch 1 | Step 1800 | loss = 0.290, acc = 0.864
Epoch 1 | Step 1900 | loss = 0.249, acc = 0.861
Epoch 1 | Step 2000 | loss = 0.286, acc = 0.875
Epoch 1 | Step 2100 | loss = 0.339, acc = 0.858
E

HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 1 | acc = 0.785


HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 2 | Step 100 | loss = 0.277, acc = 0.859
Epoch 2 | Step 200 | loss = 0.318, acc = 0.859
Epoch 2 | Step 300 | loss = 0.254, acc = 0.873
Epoch 2 | Step 400 | loss = 0.320, acc = 0.870
Epoch 2 | Step 500 | loss = 0.294, acc = 0.874
Epoch 2 | Step 600 | loss = 0.214, acc = 0.890
Epoch 2 | Step 700 | loss = 0.228, acc = 0.887
Epoch 2 | Step 800 | loss = 0.211, acc = 0.898
Epoch 2 | Step 900 | loss = 0.294, acc = 0.874
Epoch 2 | Step 1000 | loss = 0.260, acc = 0.876
Epoch 2 | Step 1100 | loss = 0.217, acc = 0.908
Epoch 2 | Step 1200 | loss = 0.246, acc = 0.883
Epoch 2 | Step 1300 | loss = 0.257, acc = 0.883
Epoch 2 | Step 1400 | loss = 0.257, acc = 0.885
Epoch 2 | Step 1500 | loss = 0.239, acc = 0.880
Epoch 2 | Step 1600 | loss = 0.291, acc = 0.875
Epoch 2 | Step 1700 | loss = 0.255, acc = 0.882
Epoch 2 | Step 1800 | loss = 0.228, acc = 0.889
Epoch 2 | Step 1900 | loss = 0.221, acc = 0.887
Epoch 2 | Step 2000 | loss = 0.229, acc = 0.885
Epoch 2 | Step 2100 | loss = 0.280, acc = 0.866
E

HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 2 | acc = 0.786


HBox(children=(FloatProgress(value=0.0, max=2245.0), HTML(value='')))

Epoch 3 | Step 100 | loss = 0.223, acc = 0.896
Epoch 3 | Step 200 | loss = 0.199, acc = 0.888
Epoch 3 | Step 300 | loss = 0.198, acc = 0.902
Epoch 3 | Step 400 | loss = 0.162, acc = 0.916
Epoch 3 | Step 500 | loss = 0.163, acc = 0.909
Epoch 3 | Step 600 | loss = 0.178, acc = 0.896
Epoch 3 | Step 700 | loss = 0.216, acc = 0.892
Epoch 3 | Step 800 | loss = 0.193, acc = 0.894
Epoch 3 | Step 900 | loss = 0.247, acc = 0.895
Epoch 3 | Step 1000 | loss = 0.255, acc = 0.900
Epoch 3 | Step 1100 | loss = 0.245, acc = 0.897
Epoch 3 | Step 1200 | loss = 0.223, acc = 0.897
Epoch 3 | Step 1300 | loss = 0.207, acc = 0.900
Epoch 3 | Step 1400 | loss = 0.197, acc = 0.907
Epoch 3 | Step 1500 | loss = 0.196, acc = 0.911
Epoch 3 | Step 1600 | loss = 0.171, acc = 0.903
Epoch 3 | Step 1700 | loss = 0.200, acc = 0.909
Epoch 3 | Step 1800 | loss = 0.242, acc = 0.890
Epoch 3 | Step 1900 | loss = 0.206, acc = 0.898
Epoch 3 | Step 2000 | loss = 0.166, acc = 0.912
Epoch 3 | Step 2100 | loss = 0.212, acc = 0.891
E

HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 3 | acc = 0.795
Saving Model ...
Evaluating Test Set ...


HBox(children=(FloatProgress(value=0.0, max=3493.0), HTML(value='')))


ID 49 Before : 大型購物中心[UNK]開幕
Target: MegaBox
ID 49 After  : 大型購物中心MegaBox開幕
ID 250 Before : 溥[UNK]
ID 250 After  : 溥儁
ID 261 Before : 商業及人均[UNK]都不如東南亞的大城市，所以此時南洋的福建人
Target: GDP
ID 261 After  : 商業及人均GDP都不如東南亞的大城市，所以此時南洋的福建人
ID 332 Before : 目前沒有觀察到任何語言純[UNK]以力道來區分不同輔音
Target: 綷
ID 332 After  : 目前沒有觀察到任何語言純綷以力道來區分不同輔音
ID 340 Before : [UNK]人國
ID 340 After  : 志人國
ID 563 Before : 馬[UNK]
ID 563 After  : 馬馼
ID 635 Before : 常[UNK]
ID 635 After  : 常璩
ID 938 Before : [UNK]稻
ID 938 After  : 米稻
ID 991 Before : 白[UNK]紀滅絕事件
ID 991 After  : 白堊紀滅絕事件
ID 1071 Before : 《金雲翹傳
ID 1071 After  : 《金雲翹傳》
ID 1212 Before : [UNK]州縣實際解散
ID 1212 After  : ，州縣實際解散
ID 1212 Before : ，州縣實際解散
ID 1212 After  : 州縣實際解散
ID 1285 Before : 二人對弈的戰略棋盤遊戲，也是世界上最流行的遊戲之一。世界各地數以百萬計的人在家中、俱樂部中、網路上以通訊西洋棋或比賽形式對弈。西洋棋的棋盤由64個黑白相間的八乘八網格組成。每位玩家開局時各有16個棋子：一國王、一后、兩城堡、兩騎士、兩主教和八士兵，各具不同功能與走法。棋手行棋目標是將對方的國王處在不可避免的威脅之下以將死對方，也可以通過對方自知無望、主動認輸而獲勝，另有相當多的情況可導致和局。遊戲過程分三個階段：開局、中局、西洋棋殘局，共有1043至1050種棋局變化。西洋棋棋子多用木或塑膠製成，也有用石材製作；較為精美的石頭、玻璃或金屬製棋子常用作裝飾擺設。西洋棋一般被認為源

In [14]:
import csv
ensemble = 3
guess = [[]] * ensemble

for j in range(ensemble):
    guess[j] = []
    with open (f'result{j}.csv', 'r') as f:
        rows = csv.reader(f)
        for row in rows:
            guess[j].append(row)
            
with open('result.csv', 'w') as f:
    f.write("ID,Answer\n")
    for i in range(1, len(guess[0])):
        a = []
        for j in range(num):
            a.append(guess[j][i][1])
        b = max(set(a), key = a.count)
        f.write('{},{}\n'.format(i-1, b))

files.download('result.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Reference**
Source: Heng-Jui Chang @ NTUEE (https://github.com/ga642381/ML2021-Spring/blob/main/HW07/HW07.ipynb)