**Demo cho mô học Deeplearning**

- 19C11033 – Nguyễn Hùng Phú​

- 19C12006 – Phạm Trần Quốc Vương​

- 19C11017 – Diêu Tiến Đạt​

- 19C11027 – Phạm Quốc Huy​

- 19C11022 – Khưu Minh Huệ​

# PREPARATION

In [9]:
!pip install transformers -q

# Question Answering with a Fine-Tuned BERT


<a href="https://mccormickml.com/2020/03/10/question-answering-with-a-fine-tuned-BERT/#part-1-how-bert-is-applied-to-question-answering"> *Reference Chris McCormick* </a>

- Bài toán đặt ra là "tìm câu trả lời" tương ứng với một câu hỏi trong đoạn văn bản.

## Part 1: How BERT is applied to Question Answering

### BERT Input Format

Input của mô hình BERT sẽ gồm 2 loại token tạm gọi: token loại A và loại B, 2 token này chia ra bởi [sep] token và có thêm một [cls] token để cho tác vụ phân loại

Đối với tác vụ Question & Answer: các token lại A sẽ là câu hỏi, token loại B sẽ tương ứng với reference text (tức là đoạn văn chứa câu trả lời)


<img src="../static/bert_input.png" width="600" />


### Start & End Token Classifiers

- Build on top của lớp transformer cuối cùng của BERT là 2 bộ trọng số đặc biệt là start và end
- Trrọng số start/end sẽ nhân với mỗi output token và đi qua hàm softmax để tìm ra vị trí token có xác suất cao nhất, tương ứng với vị trí bắt đầu/kết thúc của câu trả lời trong reference text

<img style="display:inline" src="../static/bert_start_token.png"  width="500"/>
<img style="display:inline" src="../static/bert_end_token.png"  width="400"/>

## Part 2: Example Code

- Phần demo mình sử dụng thư viện transformers của Hugging face với pretrain "bert-large-uncased-whole-word-masking-finetuned-squad"
- Có thể download trực tiếp pretrain trong code hoặc download 1 lần rồi save (function save_pretrained()) lại để chạy local.

In [1]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering

modelQA = BertForQuestionAnswering.from_pretrained('./bert-large-uncased-whole-word-masking-finetuned-squad/')
tokenizerQA = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad/')

In [2]:
# BERT Output Layer: A simple linear output layer that converts the dimension of the output
# sequence from (batch_size, seq_len, hiddenstate) to (batch_size, seq_len, 2). And we split it to
# get the start and end logits. Finally, we compute the cross-entropy loss with the start and end position vectors
# Source code: https://huggingface.co/transformers/_modules/transformers/models/bert/modeling_bert.html#BertForQuestionAnswering
modelQA

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-12,

In [3]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    encoding = tokenizerQA.encode_plus(question, answer_text)

    input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]
    
    # ======== Evaluate ========
    # Run our example through the model.
    outputs = modelQA(torch.tensor([input_ids]),
                    token_type_ids=torch.tensor([token_type_ids]),
                    return_dict=True) 

    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizerQA.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    return answer

### Document 1

In [12]:
import textwrap

# Wrap text to 80 characters.
wrapper = textwrap.TextWrapper(width=80) 

document1 = "Before the game, Vietnam were leading group G with 17 points, two more than UAE. The Golden Dragons only needed a draw to secure the top spot. However, UAE showed dominance and led 3-0 after 50 minutes. But Vietnam didn`t give up. In the last six minutes, Nguyen Tien Linh and Tran Minh Vuong scored to make it 2-3, which is also the final result.'When some players were out of stamina, we made some substitution to improve on pace and offense. And we succeeded,' Tuan added."
print(wrapper.fill(document1))

Before the game, Vietnam were leading group G with 17 points, two more than UAE.
The Golden Dragons only needed a draw to secure the top spot. However, UAE
showed dominance and led 3-0 after 50 minutes. But Vietnam didn`t give up. In
the last six minutes, Nguyen Tien Linh and Tran Minh Vuong scored to make it
2-3, which is also the final result.'When some players were out of stamina, we
made some substitution to improve on pace and offense. And we succeeded,' Tuan
added.


In [13]:
question1 = "Who scored for Vietnam?"
print(f'{question1} --> {answer_question(question1, document1)}')

question2 = "How many points Vietnam takes?"
print(f'{question2} --> {answer_question(question2, document1)}')

Who scored for Vietnam? --> nguyen tien linh and tran minh vuong
How many points Vietnam takes? --> 17


### Document 2

In [14]:
import textwrap

# Wrap text to 80 characters.
wrapper = textwrap.TextWrapper(width=80) 

document2 = "Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is uncased: it does not make a difference between english and English.Differently to other BERT models, this model was trained with a new technique: Whole Word Masking. In this case, all of the tokens corresponding to a word are masked at once. The overall masking rate remains the same."
print(wrapper.fill(document2))

Pretrained model on English language using a masked language modeling (MLM)
objective. It was introduced in this paper and first released in this
repository. This model is uncased: it does not make a difference between english
and English.Differently to other BERT models, this model was trained with a new
technique: Whole Word Masking. In this case, all of the tokens corresponding to
a word are masked at once. The overall masking rate remains the same.


In [15]:
question1 = "MLM is stand for?"
print(f'{question1} --> {answer_question(question1, document2)}')

question2 = "Which difference with bert?"
print(f'{question2} --> {answer_question(question2, document2)}')

MLM is stand for? --> masked language modeling
Which difference with bert? --> whole word masking


In [16]:
document3 = "Vietnam recorded 116 local Covid-19 cases Thursday night, including more than half in HCMC, raising the nation's total of the day to 279."
print(wrapper.fill(document3))

Vietnam recorded 116 local Covid-19 cases Thursday night, including more than
half in HCMC, raising the nation's total of the day to 279.


In [17]:
question1 = "When Vietnam recored COvid-19 cases?"
print(f'{question1} --> {answer_question(question1, document3)}')

When Vietnam recored COvid-19 cases? --> thursday night
