# **目標**:
### 使用SQuAD 2.0 資料集和facebook/bart-base 訓練 Question-Answering 模型生成 Answer

# 安裝套件

In [1]:
!pip install transformers datasets accelerate

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 8.4 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 63.2 MB/s 
[?25hCollecting accelerate
  Downloading accelerate-0.6.2-py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 5.0 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 51.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 63.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77

# 確認 GPU 分配

In [2]:
!nvidia-smi

Thu Apr 14 09:25:07 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# 連雲端硬碟

In [3]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


# 資料下載

In [4]:
%cd /content/drive/Shareddrives/中興大學/讀書會/DataSet/SQuAD

/content/drive/Shareddrives/中興大學/讀書會/DataSet/SQuAD


In [5]:
%ls

dev-v2.0.json  [0m[01;34mmodel[0m/  train-v2.0.json


### SQuAD 資料格式
Stanford 大學所整理的閱讀理解資料集 Stanford Question Answering Dataset (SQuAD) 
內容從維基百科中收集超過 10 萬筆的 CQA pair


For more information please refer to Paper: https://arxiv.org/abs/1606.05250

### Data format 資料格式

- version : <String> 資料集版本
- data : <Array>
  - title : <String> : 文章標題
  - id : <String> : 文章編號
  - paragraphs : <Array>
    - id : <String> : 文章編號_段落編號
    - context : <String> : 段落內容
    - qas : <Array>
      - question : <String> : 問題內容
      - id :<String> : 文章編號_段落編號_問題編號
      - is_impossible : <String> : "1"表示為不可回答，"2"為可回答
      - answers : <Arrays>
        - answer_start : <int> text在文中位置
        - text : <string> : 答案內容


In [6]:
import json
from pprint import pprint
with open('dev-v2.0.json') as file:
  train_data = json.load(file)

for ele in train_data['data']:
  pprint(ele['paragraphs'][0])
  break

{'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: '
            'Normanni) were the people who in the 10th and 11th centuries gave '
            'their name to Normandy, a region in France. They were descended '
            'from Norse ("Norman" comes from "Norseman") raiders and pirates '
            'from Denmark, Iceland and Norway who, under their leader Rollo, '
            'agreed to swear fealty to King Charles III of West Francia. '
            'Through generations of assimilation and mixing with the native '
            'Frankish and Roman-Gaulish populations, their descendants would '
            'gradually merge with the Carolingian-based cultures of West '
            'Francia. The distinct cultural and ethnic identity of the Normans '
            'emerged initially in the first half of the 10th century, and it '
            'continued to evolve over the succeeding centuries.',
 'qas': [{'answers': [{'answer_start': 159, 'text': 'France'},
              

# 分析模型 (計算 exact match, F1-score )

In [7]:
from transformers import BartTokenizerFast, BartForQuestionAnswering, AutoConfig, default_data_collator
from torch.utils.data import DataLoader
from accelerate import Accelerator
from tqdm.auto import tqdm
import json

In [8]:
%cd /content/drive/Shareddrives/中興大學/讀書會/DataSet/SQuAD
%ls

/content/drive/Shareddrives/中興大學/讀書會/DataSet/SQuAD
dev-v2.0.json  [0m[01;34mmodel[0m/  train-v2.0.json


# 載入模型與測試資料

In [9]:
tokenizer = BartTokenizerFast.from_pretrained("facebook/bart-base")
config = AutoConfig.from_pretrained("./model/epoch_0/config.json") 
model = BartForQuestionAnswering.from_pretrained("./model/epoch_0/pytorch_model.bin", config=config)

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

## 查看模型架構

In [10]:
print(model)

BartForQuestionAnswering(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0): BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, ele

#讀取資料

In [11]:
from pathlib import Path
def read_data(path, limit=None):
    path = Path(path)
    with open(path, 'rb') as f:
        data_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    unanswers = {'text': '', 'answer_start': 0}
    for group in data_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                answer_list = qa['answers']
                if len(answer_list) == 0:
                  contexts.append(context)
                  questions.append(question)
                  answers.append(unanswers)
                else:
                  for answer in answer_list:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)
                  if limit != None and len(contexts) > limit:
                    return contexts, questions, answers
                  
    return contexts, questions, answers

In [12]:
eval_contexts, eval_questions, eval_answers = read_data('dev-v2.0.json',2000)

In [13]:
print("1st eval context = ",eval_contexts[0])
print("1st eval question = ",eval_questions[0])
print("1st eval answers = ",eval_answers[0])

1st eval context =  The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.
1st eval question =  In what country is Normandy located?
1st eval answers =  {'text': 'France', 'answer_start': 159}


### 新增 answer 的結束位置

In [14]:
def add_end_idx(answers):
    for answer in answers:
        gold_text = answer['text']
        start_idx = answer['answer_start']
        if gold_text == '':
          end_idx = 0
        else:
          end_idx = start_idx + len(gold_text) # Find end character index of answer in context
        answer['answer_end'] = end_idx

add_end_idx(eval_answers)

In [15]:
print("1st eval answers = ",eval_answers[0])

1st eval answers =  {'text': 'France', 'answer_start': 159, 'answer_end': 165}


#將資料進行Tokenize

In [16]:
eval_encodings = tokenizer(eval_contexts, eval_questions, truncation=True, padding=True)

In [17]:
eval_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

###新增答案的start and end position

In [18]:
def add_token_positions(encodings,answers):
  start_positions = []
  end_positions = []
  for i in range(len(answers)):
    if answers[i]['answer_start'] == 0 and answers[i]['answer_end'] == 0:
      start_positions.append(0)
      end_positions.append(0)
    else:
      start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
      end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
    # if None, the answer passage has been truncated
    if start_positions[-1] is None:
      start_positions[-1] = tokenizer.model_max_length
    if end_positions[-1] is None:
      end_positions[-1] = tokenizer.model_max_length
  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

In [19]:
add_token_positions(eval_encodings, eval_answers)

In [20]:
eval_encodings.keys()

dict_keys(['input_ids', 'attention_mask', 'start_positions', 'end_positions'])

In [21]:
print(eval_answers[0]['text'])

France


In [22]:
print("input_ids\n", eval_encodings['input_ids'][0])
print("input_ids to tokens\n",tokenizer.convert_ids_to_tokens(eval_encodings['input_ids'][0]))
print("input_ids_decode\n", tokenizer.decode(eval_encodings['input_ids'][0]))
print("attention_mask\n", eval_encodings['attention_mask'][0])

print("start_positions\n", eval_encodings['start_positions'][0])
print("end_positions\n", eval_encodings['end_positions'][0])

input_ids
 [0, 133, 20336, 1253, 36, 487, 16803, 35, 234, 2126, 119, 8771, 131, 1515, 35, 20336, 8771, 131, 5862, 35, 20336, 28867, 43, 58, 5, 82, 54, 11, 5, 158, 212, 8, 365, 212, 11505, 851, 49, 766, 7, 37741, 6, 10, 976, 11, 1470, 4, 252, 58, 22306, 31, 41498, 6697, 487, 16803, 113, 606, 31, 22, 487, 27209, 397, 8070, 10369, 268, 8, 34941, 31, 10060, 6, 14605, 8, 8683, 54, 6, 223, 49, 884, 13065, 139, 6, 1507, 7, 24909, 10668, 12107, 7, 1745, 3163, 6395, 9, 580, 17932, 493, 4, 6278, 6808, 9, 8446, 43616, 8, 17793, 19, 5, 3763, 3848, 1173, 8, 7733, 12, 534, 6695, 1173, 9883, 6, 49, 29285, 74, 9097, 19388, 19, 5, 9347, 154, 811, 12, 805, 13426, 9, 580, 17932, 493, 4, 20, 11693, 4106, 8, 7289, 3599, 9, 5, 20336, 1253, 4373, 3225, 11, 5, 78, 457, 9, 5, 158, 212, 3220, 6, 8, 24, 1143, 7, 14842, 81, 5, 27544, 11505, 4, 2, 2, 1121, 99, 247, 16, 37741, 2034, 116, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

# 定義 Dataset，並轉換成 tensor 格式

In [23]:
import torch
class Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings

  def __getitem__(self, idx):
    return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

  def __len__(self):
    return len(self.encodings.input_ids)

In [24]:
eval_dataset = Dataset(eval_encodings)

In [25]:
eval_batch_size = 2     # 設定 batch size
data_collator = default_data_collator

eval_dataloader = DataLoader(eval_dataset, collate_fn=data_collator, batch_size=eval_batch_size)

# Initialize the accelerator. We will let the accelerator handle device placement for us in this example.
accelerator = Accelerator()

# Prepare everything with our `accelerator`.
model, eval_dataloader = accelerator.prepare(
    model, eval_dataloader
)

In [26]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

#Evaluation

In [27]:
print("***** Running eval *****")
model.eval()
ref = []
pre = []
index = 0

true_answer = eval_answers

for step, batch in enumerate(tqdm(eval_dataloader, desc="Eval Iteration")):

  outputs = model(**batch)
  loss = outputs.loss

  start_scores = [start_score for start_score in outputs['start_logits']]
  end_scores = [end_score for end_score in outputs['end_logits']]

  start_pos = [torch.argmax(start).item() for start in outputs['start_logits']]
  end_pos = [torch.argmax(end).item() for end in outputs['end_logits']]


  
  pred_answer_input_ids = [ input[start : end + 1]  for input,start,end in zip(batch['input_ids'].tolist(),start_pos,end_pos)]
  pred_answer = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in pred_answer_input_ids]

  start_prob = [torch.max(torch.nn.Softmax(dim=-1)(start_p)).item() for start_p in outputs['start_logits']]
  end_prob = [torch.max(torch.nn.Softmax(dim=-1)(end_p)).item() for end_p in outputs['end_logits']]
  confidence = [(1-(start_c + end_c) / 2) for start_c,end_c in zip(start_prob,end_prob)]

  for i in range(len(pred_answer)):
    score = confidence[i]
    text = pred_answer[i]
    pre.append({'prediction_text': text,'confidence':score ,'id': str(index)})
    index+=1


***** Running eval *****


Eval Iteration:   0%|          | 0/1004 [00:00<?, ?it/s]

In [28]:
ref = []
for i in range(len(eval_answers)):
  start = eval_answers[i]['answer_start']
  text = eval_answers[i]['text']
  ref.append({'answers':{'answer_start':[start],'text':[text]},'id':str(i)})

In [29]:
for data in pre:
  data['no_answer_probability'] = data['confidence']
  data.pop('confidence',None)

In [30]:
origin = pre

In [31]:
for data in pre:
  text = data['prediction_text']
  if len(text) > 1 and text[0] == ' ':
    text = text[1:]
    data['prediction_text'] = text

In [32]:
print(ref[0])
print(pre[0])

{'answers': {'answer_start': [159], 'text': ['France']}, 'id': '0'}
{'prediction_text': 'France', 'id': '0', 'no_answer_probability': 0.07417917251586914}


In [33]:
import datasets

squad_metric = datasets.load_metric("squad_v2")
results = squad_metric.compute(predictions=pre, references=ref)
print(results)

Downloading builder script:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.19k [00:00<?, ?B/s]

{'exact': 49.65139442231076, 'f1': 56.35701078471165, 'total': 2008, 'HasAns_exact': 49.65139442231076, 'HasAns_f1': 56.35701078471165, 'HasAns_total': 2008, 'best_exact': 49.65139442231076, 'best_exact_thresh': 0.740144670009613, 'best_f1': 56.357010784711655, 'best_f1_thresh': 0.740144670009613}


In [None]:
for i in range(87,100):
  print(i)
  print('Context:',eval_contexts[i])
  print('Question:',eval_questions[i])
  print('Answer:',ref[i]['answers']['text'][0])
  print('Prediction:',pre[i]['prediction_text'])
  print('-------------\n')

87
Context: One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s. By then however, there were already Norman mercenaries serving as far away as Trebizond and Georgia. They were based at Malatya and Edessa, under the Byzantine duke of Antioch, Isaac Komnenos. In the 1060s, Robert Crispin led the Normans of Edessa against the Turks. Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population, but he was stopped by the Byzantine general Alexius Komnenos.
Question: When did Herve serve as a Byzantine general?
Answer: in the 1050s
Prediction: 
-------------

88
Context: One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s. By then however, there were already Norman mercenaries serving as far away as Trebizond and Georgia. They were based at Malatya and Edessa, under the Byzantine duke of Antioch, Isaac Komnenos. In the 1060s, Robert Crispin led the Normans of

#Inference

In [None]:
# **撰寫預測程式**
def QA_model(model, context, question):

  encoded_input = tokenizer(context, question, return_tensors="pt").to(device) 
  outputs = model(**encoded_input)
  
  start = torch.argmax(outputs.start_logits).item()
  end = torch.argmax(outputs.end_logits).item()

  answer = encoded_input.input_ids.tolist()[0][start : end + 1]
  answer = tokenizer.decode(answer, skip_special_tokens=True, clean_up_tokenization_spaces=False)
  
  
  try:
    answer = answer[1:]
  except:
    answer = answer

  # answer = encoded_input.input_ids.tolist()[0][start : end + 1]
  # answer = "".join(tokenizer.decode(answer).split())


  start_prob = torch.max(torch.nn.Softmax(dim=-1)(outputs.start_logits)).item()
  end_prob = torch.max(torch.nn.Softmax(dim=-1)(outputs.end_logits)).item()
  confidence = (start_prob + end_prob) / 2

  return answer,confidence

In [None]:
test_context = 'Hank ate pizza at 6 p.m. yesterday.'
test_question = 'When did Hank eat pizza yesterday?'
answer , confidence = QA_model(model,test_context,test_question)
print("Context = ",test_context)
print("Question = ",test_question)
print("Answer = ",answer)
print("Confidence = ",confidence)

Context =  Hank ate pizza at 6 p.m. yesterday.
Question =  When did Hank eat pizza yesterday?
Answer =   6 p.m.
Confidence =  0.7435054779052734
