# 4. WH-Question

3장에서는 BERT와 GPT를 사용하여 True/False 문제를 생성해보았습니다. 4장에서는 T5모델로 WH문제를 생성해보도록 하겠습니다. Cambridge Dictionary에 의하면 [WH문제](https://dictionary.cambridge.org/ko/%EC%82%AC%EC%A0%84/%EC%98%81%EC%96%B4/wh-question)는 what, when, where, who, whom, which, whose, why, how로 시작하는 문제라고 정의되어 있습니다. 이번장에서는 그 중에서도 Multi-Choice(다지선다) 문제를 생성하여 지문에 대한 정보를 질문하는 문제를 생성해 볼 것입니다. 4장의 코드는 AMontgomerie님의 [question_generator](https://github.com/AMontgomerie/question_generator)을 참고하여 제작되었습니다.


4.1절에서는 모델에 인풋으로 사용할 에세이 데이터를 불러오겠습니다. 그리고 4.2절에서는 문제를 생성하는 클래스와 문제/정답 쌍을 평가하는 클래스를 정의하도록 하겠습니다. 마지막으로 4.3절에서는 실제 에세이 데이터를 넣어 WH문제를 생성해보도록 하겠습니다.

## 4.1 데이터셋 다운로드

2.?절에서 저장한 데이터셋을 불러오도록 하겠습니다. 이 데이터셋은 2장의 전처리를 통해 올바른 표현으로 수정되어 가짜연구소 깃허브에 저장되어 있습니다. `.pkl` 형식으로 저장되어 있는 파일들을 불러옵니다. 이 데이터는 WH문제를 만들 지문으로 사용됩니다.

In [1]:
import pickle
file_name = "C:/Users/a/OneDrive - 고려대학교/PseudoLab/tutorial/articles/corrected_essays.pkl"
open_file = open(file_name, "rb")
loaded_list = pickle.load(open_file)
open_file.close()

In [2]:
print(loaded_list[0])

Keeping the Secret of Genetic Testing What is genetic risk? Genetic risk refers  to your chance of inheriting a disorder or disease. People get certain diseases because of genetic changes. How much a genetic change tells us about your chance of developing a disorder is not always clear. If your genetic results indicate that you have gene changes associated with an increased risk of heart disease, it does not mean that you definitely will develop heart disease. The opposite is also true. If your genetic results show that you do not have changes associated with an increased risk of heart disease, it is still possible that you will develop heart disease. However, for some rare diseases, people who have certain gene changes are guaranteed to develop the disease. When we are diagnosed with a certain genetic disease, are we supposed to disclose this result to our relatives? My answer is no. On the one hand, we do not want this potential danger to have frightening effects on our families' lat

4장에서 사용할 패키지들을 import 해보겠습니다. 각 환경에 따라 패키지가 없거나 버전 차이가 있을 수 있습니다. 아래 코드를 통해 패키지를 설치하거나 버전을 맞추어 사용하시기 바랍니다.

`random`은 난수를 생성하기 위한 패키지이고, `numpy`는 수치연산에 사용하는 패키지입니다. `json`은 파일 형식을 위한 패키지이며 `re`는 정규 표현식을 지원하는 패키지입니다. `torch`는 모델 구축 프레임워크이고, `spacy`는 자연어 처리를 쉽게 할 수 있도록 도와주는 패키지이며 자체적으로 배포한 언어모델이 `en_core_web_sm` 패키지입니다. 또한 다양한 토큰화 기법과 모델들이 내장되어 있는 `transformers` 패키지도 있습니다.

In [3]:
# requirements
# !pip install transformers==4.1.1
# !pip install spacy==2.3.1
# !pip install sentencepiece==0.1.94
# !python -m spacy download en_core_web_sm

In [4]:
import random
import numpy as np
import json
import re
import torch
import spacy
import en_core_web_sm
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    AutoModelForSequenceClassification,
)

본격적인 모델링에 앞서 실험 환경에서 GPU가 사용 가능한지 확인할 수 있습니다. `cuda`가 출력되면 GPU를 사용할 수 있는 것입니다. 단, GPU가 없다면 CPU를 사용해도 가능합니다.

In [5]:
print('cuda' if torch.cuda.is_available() else 'cpu')

cuda


## 4.2 문제 생성 클래스 정의

문제를 생성하기 위한 QuestionGenerator 클래스를 정의하게 됩니다. 해당 클래스에는 토크나이저와 Seq2SeqLM모델이 들어갑니다. 또한 생성된 문제와 정답쌍을 평가하기 위한 SequenceClassification모델이 있는 QAEvaluator 클래스도 정의해줍니다. 각 클래스에 정의된 함수는 다음과 같은 단계를 가지고 있으며, 순서와 역할은 그림 4-1과 같습니다.

- generate  
    - generate_qg_inputs  
        - _split_text  
        - _prepare_qg_inputs_MC  
            - _get_MC_answers  
    - generate_questions_from_inputs  
        - _generate_question  
            - _encode_qg_input  
    - qa_evaluator.encode_qa_pairs  
        - qa_evaluator._encode_qa  
    - qa_evaluator.get_scores  
        -  qa_evaluator._evaluate_qa  
    - _get_ranked_qa_pairs  
        -  _make_dict  

![](https://github.com/Pseudo-Lab/Tutorial-Book/blob/master/book/pics/NLP-ch4img01.png?raw=true)
 - 그림 4-1 각 클래스에 정의된 함수의 순서와 역할

각 클래스에 사용된 토크나이저와 모델은 [huggingface](https://huggingface.co/models)에서 제공하는 사전학습모델(API)을 가져와 사용하였습니다. 이 때, AutoModel API는 학습 가중치에 넣고자 하는 모델을 자동으로 찾아서 생성해주는 API입니다. 예를 들어 T5모델을 경로에 넣어주면 자동으로 T5구조를 생성하고 그 안에 학습 가중치를 넣어주게 됩니다. 4.2절의 QuestionGenerator 클래스에서는 ["iarfmoose/t5-base-question-generator"](https://huggingface.co/iarfmoose/t5-base-question-generator)를, QAEvaluator 클래스에서는 ["iarfmoose/bert-base-cased-qa-evaluator"](https://huggingface.co/iarfmoose/bert-base-cased-qa-evaluator)를 불러와 정의해보겠습니다. 

In [6]:
class QuestionGenerator:
    def __init__(self, model_dir=None):

        QG_PRETRAINED = "iarfmoose/t5-base-question-generator"
        self.ANSWER_TOKEN = "<answer>"
        self.CONTEXT_TOKEN = "<context>"
        self.SEQ_LENGTH = 512

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        self.qg_tokenizer = AutoTokenizer.from_pretrained(QG_PRETRAINED, use_fast=False)
        self.qg_model = AutoModelForSeq2SeqLM.from_pretrained(QG_PRETRAINED)
        self.qg_model.to(self.device)

        self.qa_evaluator = QAEvaluator(model_dir)

    def generate(
        self, article, num_questions=None, answer_style="multiple_choice"
    ):

        print("Generating questions...\n")

        qg_inputs, qg_answers = self.generate_qg_inputs(article, answer_style)
        generated_questions = self.generate_questions_from_inputs(qg_inputs)

        message = "{} questions doesn't match {} answers".format(
            len(generated_questions), len(qg_answers)
        )
        assert len(generated_questions) == len(qg_answers), message

        print("Evaluating QA pairs...\n")

        encoded_qa_pairs = self.qa_evaluator.encode_qa_pairs(
            generated_questions, qg_answers
        )
        scores = self.qa_evaluator.get_scores(encoded_qa_pairs)
        if num_questions:
            qa_list = self._get_ranked_qa_pairs(
                generated_questions, qg_answers, scores, num_questions
            )
        else:
            qa_list = self._get_ranked_qa_pairs(
                generated_questions, qg_answers, scores
            )

        return qa_list

    def generate_qg_inputs(self, text, answer_style):

        VALID_ANSWER_STYLES = ["multiple_choice"]

        if answer_style not in VALID_ANSWER_STYLES:
            raise ValueError(
                "Invalid answer style {}. Please choose from {}".format(
                    answer_style, VALID_ANSWER_STYLES
                )
            )

        inputs = []
        answers = []

        if answer_style == "multiple_choice": 
            sentences = self._split_text(text)
            prepped_inputs, prepped_answers = self._prepare_qg_inputs_MC(sentences)
            inputs.extend(prepped_inputs)
            answers.extend(prepped_answers)

        return inputs, answers

    def generate_questions_from_inputs(self, qg_inputs):
        generated_questions = []

        for qg_input in qg_inputs:
            question = self._generate_question(qg_input)
            generated_questions.append(question)

        return generated_questions

    def _split_text(self, text):
        MAX_SENTENCE_LEN = 128

        sentences = re.findall(".*?[.!\?]", text) 

        # cut_sentences = []
        # for sentence in sentences:
        #     if len(sentence) > MAX_SENTENCE_LEN:   
        #         cut_sentences.extend(re.split("[,;:)]", sentence))     
        cut_sentences = [s for s in sentences if len(s.split(" ")) > 5]   # 이상함@@@@@ 왜 두번??????
        sentences = sentences + cut_sentences

        return list(set([s.strip(" ") for s in sentences])) 

    def _prepare_qg_inputs_MC(self, sentences):

        spacy_nlp = en_core_web_sm.load()
        docs = list(spacy_nlp.pipe(sentences, disable=["parser"]))
        inputs_from_text = []
        answers_from_text = []

        for i in range(len(sentences)):
            entities = docs[i].ents
            if entities:
                for entity in entities:
                    qg_input = "{} {} {} {}".format(
                        self.ANSWER_TOKEN, entity, self.CONTEXT_TOKEN, sentences[i]
                    )
                    answers = self._get_MC_answers(entity, docs)
                    inputs_from_text.append(qg_input)
                    answers_from_text.append(answers)

        return inputs_from_text, answers_from_text

    def _get_MC_answers(self, correct_answer, docs):

        entities = []
        for doc in docs:
            entities.extend([{"text": e.text, "label_": e.label_} for e in doc.ents])

        entities_json = [json.dumps(kv) for kv in entities]
        pool = set(entities_json)
        num_choices = (
            min(4, len(pool)) - 1
        )

        final_choices = []
        correct_label = correct_answer.label_
        final_choices.append({"answer": correct_answer.text, "correct": True})
        pool.remove(
            json.dumps({"text": correct_answer.text, "label_": correct_answer.label_})
        )

        matches = [e for e in pool if correct_label in e]

        if len(matches) < num_choices:
            choices = matches
            pool = pool.difference(set(choices))
            choices.extend(random.sample(pool, num_choices - len(choices)))
        else:
            choices = random.sample(matches, num_choices)

        choices = [json.loads(s) for s in choices]
        for choice in choices:
            final_choices.append({"answer": choice["text"], "correct": False})
        random.shuffle(final_choices)
        return final_choices

    def _generate_question(self, qg_input):
        self.qg_model.eval()
        encoded_input = self._encode_qg_input(qg_input)
        with torch.no_grad():
            output = self.qg_model.generate(input_ids=encoded_input["input_ids"])
        question = self.qg_tokenizer.decode(output[0], skip_special_tokens=True)
        return question

    def _encode_qg_input(self, qg_input):
        return self.qg_tokenizer(
            qg_input,
            padding='max_length',
            max_length=self.SEQ_LENGTH,
            truncation=True,
            return_tensors="pt",
        ).to(self.device)

    def _get_ranked_qa_pairs(
        self, generated_questions, qg_answers, scores, num_questions=10
    ):
        if num_questions > len(scores):
            num_questions = len(scores)
            print(
                "\nWas only able to generate {} questions. For more questions, please input a longer text.".format(
                    num_questions
                )
            )

        qa_list = []
        for i in range(num_questions):
            index = scores[i]
            qa = self._make_dict(
                generated_questions[index].split("?")[0] + "?", qg_answers[index]
            )
            qa_list.append(qa)
        return qa_list

    def _make_dict(self, question, answer):
        qa = {}
        qa["question"] = question
        qa["answer"] = answer
        return qa

In [7]:
class QAEvaluator:
    def __init__(self, model_dir=None):

        QAE_PRETRAINED = "iarfmoose/bert-base-cased-qa-evaluator"
        self.SEQ_LENGTH = 512

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        self.qae_tokenizer = AutoTokenizer.from_pretrained(QAE_PRETRAINED)
        self.qae_model = AutoModelForSequenceClassification.from_pretrained(
            QAE_PRETRAINED
        )
        self.qae_model.to(self.device)

    def encode_qa_pairs(self, questions, answers):
        encoded_pairs = []
        for i in range(len(questions)):
            encoded_qa = self._encode_qa(questions[i], answers[i])
            encoded_pairs.append(encoded_qa.to(self.device))
        return encoded_pairs

    def get_scores(self, encoded_qa_pairs):
        scores = {}
        self.qae_model.eval()
        with torch.no_grad():
            for i in range(len(encoded_qa_pairs)):
                scores[i] = self._evaluate_qa(encoded_qa_pairs[i])

        return [
            k for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)
        ]

    def _encode_qa(self, question, answer):
        if type(answer) is list:
            for a in answer:
                if a["correct"]:
                    correct_answer = a["answer"]
        else:
            correct_answer = answer
        return self.qae_tokenizer(
            text=question,
            text_pair=correct_answer,
            padding="max_length",
            max_length=self.SEQ_LENGTH,
            truncation=True,
            return_tensors="pt",
        )

    def _evaluate_qa(self, encoded_qa_pair):
        output = self.qae_model(**encoded_qa_pair)
        return output[0][0][1]

생성된 문제를 보기 좋게 출력하기 위해 `print_qa`함수를 정의합니다.

In [8]:
def print_qa(qa_list, show_answers=True):
    for i in range(len(qa_list)):
        space = " " * int(np.where(i < 9, 3, 4)) 

        print("{}) Q: {}".format(i + 1, qa_list[i]["question"]))

        answer = qa_list[i]["answer"]

        if type(answer) is list:

            if show_answers:
                print(
                    "{}A: 1.".format(space),
                    answer[0]["answer"],
                    np.where(answer[0]["correct"], "(correct)", ""),
                )
                for j in range(1, len(answer)):
                    print(
                        "{}{}.".format(space + "   ", j + 1),
                        answer[j]["answer"],
                        np.where(answer[j]["correct"] == True, "(correct)", ""),
                    )

            else:
                print("{}A: 1.".format(space), answer[0]["answer"])
                for j in range(1, len(answer)):
                    print("{}{}.".format(space + "   ", j + 1), answer[j]["answer"])
            print("")

        else:
            if show_answers:
                print("{}A:".format(space), answer, "\n")

정리해보자면, 각 지문에 대해서 ents를 뽑아내고 그것들을 조합해 다지선다(정답/오답)를 만들어냅니다. 그리고 정답과 정답이 포함된 문장으로 문제를 생성하고, 생성된 문제와 정답 쌍이 올바른지 점수를 순위매겨 높은 순서대로 최종 문제지를 결정합니다.

## 4.3 WH문제 생성

위에서 정의한 클래스를 불러 `qg`모델을 구축합니다. 

In [9]:
qg = QuestionGenerator()

Some weights of the model checkpoint at iarfmoose/t5-base-question-generator were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


`qg`모델의 메인 함수인 `generate`에서 문제를 생성하여 list 형태로 `qa_list`에 저장합니다. 파라미터인 `num_questions`로 생성할 문제 수를 지정할 수 있습니다. 그리고 `print_qa`함수로 최종 문제지를 출력할 수 있습니다.

In [39]:
article = loaded_list[0]
print(article)

qa_list = qg.generate(
    article, 
    num_questions=5, 
    answer_style='multiple_choice'
)
print_qa(qa_list)

Keeping the Secret of Genetic Testing What is genetic risk? Genetic risk refers  to your chance of inheriting a disorder or disease. People get certain diseases because of genetic changes. How much a genetic change tells us about your chance of developing a disorder is not always clear. If your genetic results indicate that you have gene changes associated with an increased risk of heart disease, it does not mean that you definitely will develop heart disease. The opposite is also true. If your genetic results show that you do not have changes associated with an increased risk of heart disease, it is still possible that you will develop heart disease. However, for some rare diseases, people who have certain gene changes are guaranteed to develop the disease. When we are diagnosed with a certain genetic disease, are we supposed to disclose this result to our relatives? My answer is no. On the one hand, we do not want this potential danger to have frightening effects on our families' lat

In [40]:
article2 = loaded_list[15]
print(article2)

qa_list = qg.generate(
    article2, 
    num_questions=5, 
    answer_style='multiple_choice'
)
print_qa(qa_list)

Nowadays,  studies in the genetic field are going further and further due to  high-velocity technological advancement. Not long ago, the human genome project, sponsored by several large multinational corporations (MNCs), was started. , in the field of genetic testing, which enables people to be aware of the risk of a possible genetic disorder, as the technology is very mature already, an even more popular topic arises. While most people agree that each individual has his own right to decide whether to undergo a genetic test or not, they have different stands regarding whether it is the patient's responsibility  to share the result with his or her relatives. In my opinion, the patients who are carriers of some known genetic risk should share the truth with their family members. Admittedly, it is a social obligation to be responsible for the life of other people. For the whole society, it is a duty. Take H1N1 as an example. All the people in Singapore,  including students at school, trav

In [41]:
article3 = loaded_list[25]
print(article3)

qa_list = qg.generate(
    article3, 
    num_questions=5, 
    answer_style='multiple_choice'
)
print_qa(qa_list)

The Impact Of Social Media Social media play a vitally important role in our lives today. It is almost impossible for us to keep away from it. I started using  Facebook one year ago,when I had just arrived in Singapore. And now I have over 500 friends on Facebook . We know each others' status, changes and so on through  social media. For  example, if we change our contact numbers, we will not send an SMS to each otherindividually any more. Instead, we will post a  and tag our friends to inform them of this kind of change. And if you are not on it, you will just get lost. In certain respects, social media have brought us quite a lot of convenience. It is becoming even more convenient than talking on the phone, since it is free. We discuss everything we like and we have more personal space.Our friends also get more access to  what we are doing and what we like. And also, we get a chance to talk to public figures like celebrities and know their life better. In China, a lot of government o

몇 가지 지문으로 최종 문제지를 생성해보았습니다. qg_model을 통해 생성된 문제는 문법적으로도 어느 정도 잘 생성해내는 것 같습니다. 하지만, 정답 외의 오답 보기들은 단순히 다른 문장에서 뽑았기 때문에 한계가 있어보입니다. 좀 더 정보 전달의 성격을 가진 지문이라면 유의미한 문제를 생성할 것이라고 기대되어집니다.

지금까지 4장에서는 Huggingface에서 제공하는 사전학습모델들로 WH문제를 생성해보았습니다. 만약 SQuAD와 같은 QA 데이터셋이 있다면, 각자 생성하고 싶은 도메인으로 학습시켜 특화된 질문을 만들어 보시기 바랍니다.

## 4.4 참고문헌

- [Question_Generator](https://github.com/AMontgomerie/question_generator/blob/master/questiongenerator.py)
- [huggingface](https://huggingface.co/models)