# 9강) Closed book Question Answering 을 수행해보기

## Natural Questions 

- [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) 
- [Natrual Questions from huggingface](https://huggingface.co/datasets/nq_open)
- [Original NQ Dataset from huggingface](https://huggingface.co/datasets/natural_questions)

Natural Questions는 open-domain QA 에서 자주 사용되는 데이터셋입니다. 이번 실습에서는 NQ 데이터셋에서 질문과 정답을 통해 Closed-book QA를 진행해볼 예정입니다.


## Objective
이번 실습에서는 BART를 불러와 short-answer를 예측하도록 학습을 진행해보고 그 결과를 검증합니다. 또한 NQ에 학습된 T5를 불러와 검증 또한 실행해봅니다.

## Requirements

In [1]:
%%bash
# install packages
pip install tqdm==4.64.1 -q
pip install datasets==2.12.0 -q
pip install transformers==4.24.0 -q
pip install sentencepiece==0.1.97 -q
pip install apache_beam > /dev/null 2>&1 # for trivia or nq datasets, just in case

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 kB 1.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 474.6/474.6 kB 22.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.5/110.5 kB 13.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.5/212.5 kB 24.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.3/134.3 kB 17.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 61.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.5/224.5 kB 27.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.5/114.5 kB 12.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 24.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 149.6/149.6 kB 18.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 81.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 116.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 데이터 불러오기

In [2]:
import os
from tqdm.auto import tqdm, trange
import argparse
import random
import numpy as np

from datasets import load_dataset, Dataset

In [3]:
dataset = load_dataset("nq_open")

Downloading builder script:   0%|          | 0.00/6.60k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/5.14k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.63k [00:00<?, ?B/s]

Downloading and preparing dataset nq_open/nq_open to /root/.cache/huggingface/datasets/nq_open/nq_open/2.0.0/75b7e191dc38a0f99f451a2cc0dc969fee2965238051d6f03989ff66ea1f39a5...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/126k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.61M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87925 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3610 [00:00<?, ? examples/s]

Dataset nq_open downloaded and prepared to /root/.cache/huggingface/datasets/nq_open/nq_open/2.0.0/75b7e191dc38a0f99f451a2cc0dc969fee2965238051d6f03989ff66ea1f39a5. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
import re
def nq_preprocessor(ex):
  def normalize_text(text):
    """Lowercase and remove quotes from a string."""
    text = text.lower()
    text = re.sub("'(.*)'", r"\1", text)
    return text

  def to_inputs_and_targets(ex):
    """Map {"question": ..., "answer": ...}->{"inputs": ..., "targets": ...}."""
    return {
        "inputs": normalize_text(ex["question"]),
        "targets": normalize_text(random.choice(ex["answer"])),
    }
  return to_inputs_and_targets(ex)

In [5]:
import multiprocessing as mp
cpus = mp.cpu_count()

remove_columns = ['question', "answer"]
train_ds = dataset["train"].map(nq_preprocessor, num_proc=cpus).remove_columns(remove_columns)
valid_ds = dataset["validation"].map(nq_preprocessor, num_proc=cpus).remove_columns(remove_columns)

Map (num_proc=2):   0%|          | 0/87925 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/3610 [00:00<?, ? examples/s]

In [6]:
# Choose small samples as our dataset 
sample_idx = np.random.choice(range(len(train_ds)), 4) 
training_dataset = train_ds[sample_idx]

## 훈련

In [7]:
import torch
import torch.nn.functional as F

from transformers import (AutoTokenizer, 
                          AutoModelForSeq2SeqLM, 
                          AdamW,
                          TrainingArguments, 
                          get_linear_schedule_with_warmup)

In [8]:
args = TrainingArguments(
    output_dir="seq2seq_models/bart_nq",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
    gradient_accumulation_steps=2
)

In [9]:
args.device

device(type='cuda', index=0)

In [10]:
# load pre-trained model on cuda (if available)
model_checkpoint = "facebook/bart-large"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(args.device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

In [11]:
!nvidia-smi

Mon May 15 13:53:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P0    29W /  70W |   2427MiB / 15360MiB |     31%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [12]:
torch.manual_seed(2023)
torch.cuda.manual_seed(2023)
np.random.seed(2023)
random.seed(2023)

In [13]:
from torch.utils.data import DataLoader, RandomSampler, TensorDataset
max_len = 128 # to reduce memory per sample! 
token_dict = dict(padding="max_length",
                  max_length=max_len,
                  truncation=True,
                  return_tensors="pt")

q_seqs = tokenizer(training_dataset['inputs'], **token_dict)
a_seqs = tokenizer(training_dataset['targets'], **token_dict)
train_dataset = TensorDataset(q_seqs['input_ids'], q_seqs['attention_mask'],
                              a_seqs['input_ids'], a_seqs['attention_mask'])
train_dataloader = DataLoader(train_dataset, batch_size=args.per_device_eval_batch_size)

In [14]:
# Optimizer
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
    ]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)



In [15]:
def train(args, dataset, model, optimizer):
    # Dataloader
    train_sampler = RandomSampler(dataset)
    
    train_dataloader = DataLoader(dataset, batch_size=args.per_device_train_batch_size,
                                  sampler=train_sampler, )

    t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total)

    # 훈련 시작 
    global_step = 0

    model.zero_grad()

    train_iterator = trange(int(args.num_train_epochs), desc="Epoch")

    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration")

        for step, batch in enumerate(epoch_iterator):
            model.train()

            q_ids, q_mask, a_ids, a_mask = batch
            # 레이블 구하기 - answer의 0번째를 제외한 나머지  
            lm_labels = a_ids[:, 1:].contiguous().clone()
            lm_labels[a_mask[:, 1:].contiguous() == 0] = -100

            # decoder_input_ids 는 원래 주어지지 않아도 모델이 자동으으 계산합니다 
            model_inputs = {
                "input_ids": q_ids.cuda(),
                "attention_mask": q_mask.cuda(),
                "decoder_input_ids": a_ids[:, :-1].contiguous().cuda(),
                "labels": lm_labels.cuda(),
            }

            outputs = model(**model_inputs)  # (batch_size, emb_dim)
            loss = outputs[0]
                
            loss.backward()

            optimizer.step()
            scheduler.step()  # 학습률을 조정하는 스케쥴러 
            model.zero_grad()
            global_step += 1

            # save model
            model.save_pretrained(args.output_dir)

    return model

In [16]:
model = train(args, train_dataset, model, optimizer)

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2 [00:00<?, ?it/s]

# `.generate`를 통해 훈련한 모델 결과확인

In [17]:
def qa_s2s_generate(
    model_inputs,
    qa_s2s_model,
    qa_s2s_tokenizer,
    num_answers=1,
    num_beams=2,
    min_len=1,
    max_len=64,
    do_sample=False,
    temp=1.0,
    top_p=None,
    top_k=None,
):
    # n_beams = num_answers if num_beams is None else max(num_beams, num_answers)
    n_beams = num_beams
    generated_ids = qa_s2s_model.generate(
        input_ids=model_inputs[0],
        attention_mask=model_inputs[1],
        min_length=min_len,
        max_new_tokens=max_len,
        do_sample=do_sample,
        early_stopping=True,
        num_beams=1 if do_sample else n_beams,
        temperature=temp,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=qa_s2s_tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        num_return_sequences=num_answers,
        decoder_start_token_id=qa_s2s_tokenizer.bos_token_id,
    )
    return [qa_s2s_tokenizer.decode(ans_ids, skip_special_tokens=True).strip() for ans_ids in generated_ids]

In [18]:
def generate_answer(model, tokenizer, dataloader=None):
  if dataloader is None:
    # Choose small samples as our dataset 
    sample_idx = np.random.choice(range(len(valid_ds)), 4)
    validation_dataset = valid_ds[sample_idx]

    input_dict = tokenizer(validation_dataset['inputs'], **token_dict)
    target_dict = tokenizer(validation_dataset['targets'], **token_dict)

    # target 정보는 필요 없으나 모델 결과를 확인하기 위해 넣음
    valid_dataset = TensorDataset(input_dict['input_ids'], input_dict['attention_mask'],  target_dict['input_ids'])
    dataloader = DataLoader(valid_dataset, batch_size=args.per_device_eval_batch_size)
  
  for step, batch in enumerate(dataloader):
      model.eval()
      
      if torch.cuda.is_available():
          batch = tuple(t.cuda() for t in batch)

      inputs = [input.strip() for input in tokenizer.batch_decode(batch[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)]
      targets = [target.strip() for target in tokenizer.batch_decode(batch[2], skip_special_tokens=True, clean_up_tokenization_spaces=True)]
      
      results = qa_s2s_generate(model_inputs=batch,
                                qa_s2s_model=model,
                                qa_s2s_tokenizer=tokenizer,
                                num_beams=3)

      for inp, tgt, pred in zip(inputs, targets, results):
          print("Input:", inp)
          print("Target:", tgt)
          print("Prediction:", pred)
          print()

In [19]:
# 훈련데이터에 대해 잘 되었는지 확인
generate_answer(model=model,
                tokenizer=tokenizer,
                dataloader=train_dataloader)

Input: who does jim end up with in the office
Target: receptionist pam beesly
Prediction: who does jim end up with in the office

Input: who plays alex in the big bang theory
Target: margo cathleen harshman
Prediction: who plays alex in the big bang theory

Input: the effect of french revolution on english literature
Target: romanticism
Prediction: the effect of french revolution on english literature

Input: who plays the character jesus in the walking dead
Target: tom payne
Prediction: who plays the character jesus in the walking dead



In [20]:
# 검증데이터에 대해 잘 되었는지 확인
generate_answer(model=model,
                tokenizer=tokenizer)

Input: in the song i drive your truck who is he talking about
Target: his brother
Prediction: in the song i drive your truck who is he talking about

Input: mount everest is part of what mountain range
Target: himalayas
Prediction: mount everest is part of what mountain range

Input: which president of the united states was a boy scout
Target: gerald ford
Prediction: which president of the united states was a boy scout

Input: when was the first documented case of tool mark identification
Target: 1835
Prediction: when was the first documented case of tool mark identification



## 미리 학습된 모델로 테스트 해보기

결과가 잘 나왔나요? 사실 BART 자체는 Question Answering task를 학습한 것이 아니기 때문에 훈련을 제대로 진행한 후에 사용하지 않으면 결과가 잘 나오지 않습니다. 이번에는 NQ에 학습된 T5 모델을 불러와서 검증 데이터셋에 대해 얼마나 잘 하는지 확인해봅시다.




In [21]:
# load pre-trained model on cuda (if available)
model_checkpoint = "google/t5-large-ssm-nq"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(args.device) # BartForConditionalGeneration

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

In [22]:
generate_answer(model=model,
                tokenizer=tokenizer)

Input: the first vice president of india who become the president letter was
Target: sarvepalli radhakrishnan
Prediction: The first President of India

Input: how many votes to approve supreme court justice
Target: a simple majority
Prediction: A simple majority vote

Input: who wrote you must have been a beautiful baby
Target: harry warren
Prediction: Billy Preston

Input: which is the site of the light dependent reactions of photosynthesis
Target: the thylakoid membranes
Prediction: on the thylakoid membrane



### **콘텐츠 라이선스**

<font color='red'><b>**WARNING**</b></font> : **본 교육 콘텐츠의 지식재산권은 재단법인 네이버커넥트에 귀속됩니다. 본 콘텐츠를 어떠한 경로로든 외부로 유출 및 수정하는 행위를 엄격히 금합니다.** 다만, 비영리적 교육 및 연구활동에 한정되어 사용할 수 있으나 재단의 허락을 받아야 합니다. 이를 위반하는 경우, 관련 법률에 따라 책임을 질 수 있습니다. 모델 라이선스 : MIT License

