# QG - A Question Generating NLP Model

This notebook shows how to load and use the `QGAR` model.

Please read the [README](./README.md) before continuing!

---

**Table of Contents**

0. [Dependencies](#0-dependencies)
1. [Load QG](#1-load-qg)
2. [Train QG](#2-train-qg)
3. [Run QG](#3-run-qg)
4. [Answer Generator](#4-answer-generator)
5. [Output to Anki](#5-output-to-anki)

---

## 0. Dependencies

The project uses the following dependencies.

Make sure to install them in your `Virtual Environment`.

In [None]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -q
%pip install transformers -q
%pip install datasets -q
%pip install wandb -q

## 1. Load QG

To load our `Question Generation` model, we use the `QG` class from the `models` module.

From the model, we can extract the actual loaded `model` and `tokenizer` via its fields.

In [None]:
%load_ext autoreload
%autoreload 2

from models.qg import QG

qg = QG("t5-small", "t5-small")
model = qg._model
tokenizer = qg._tokenizer

## 2. Train QG

To train `QG`, we first parse the `settings.json` file to get the `TrainingArguments` and `DataTrainingArguments`.

We can then call `train` on the `QG` instance, which will train the model and push it to `The Coorporation` organization on Huggingface Hub.

In [None]:
%load_ext autoreload
%autoreload 2

from main import get_local_file
from models.qg import QG
from parsing.settings_parser import parse_settings

qg = QG("t5-small", "t5-small")

_, data_args, train_args = parse_settings()
qg.train(train_args, data_args, get_local_file("wandb_token.txt"))

## 3. Run QG

To run the `QG` model on some example `context`, we use the `__call__` method of the class.

This will return a dictionary containing the `context` and a list of `questions` generated for the provided `context`.

In [None]:
%load_ext autoreload
%autoreload 2

from models.qg import QG
import json

qg = QG("t5-small", "t5-small")
context = "Historical Fiction is one of those sub-genres of literature that takes many forms. It's most important feature, though, is that it's set in the past, with every element of the story conforming to the norms of the day. Here's how we define Historical Fiction, a look at its origins, and some popular types."

questions = qg(context)
print(json.dumps(questions, indent=4))

In [None]:
from transformers import AutoTokenizer, T5ForConditionalGeneration
import torch

_MODEL_MAX_LENGTH = 512

device = "cuda" if torch.cuda.is_available() else "cpu"
model = T5ForConditionalGeneration.from_pretrained("the-coorporation/t5-small-qg").to(device)
tokenizer = AutoTokenizer.from_pretrained("t5-small", model_max_length=_MODEL_MAX_LENGTH)

tokenizer.add_tokens(['<sep>'])
model.resize_token_embeddings(len(tokenizer))

In [None]:
context = "Historical Fiction is one of those sub-genres of literature that takes many forms. It's most important feature, though, is that it's set in the past, with every element of the story conforming to the norms of the day. Here's how we define Historical Fiction, a look at its origins, and some popular types."

generator_args = {
            "max_length": 256,
            "num_beams": 4,
            "length_penalty": 1.5,
            "no_repeat_ngram_size": 3,
            "early_stopping": True,
        }

input_string = "generate questions: " + context + " </s>"

# Encode input string
inputs = tokenizer.encode(
    input_string, 
    add_special_tokens=True,
    truncation=True,
    return_tensors="pt", 
).to(device)

# Let the model generate questions from the encoded input
result = model.generate(inputs, **generator_args)

# Decode the questions generated by the model
questions = tokenizer.decode(result[0], skip_special_tokens=True)
print(questions)

# # Split each question by the separator token
# questions = questions.split("<sep>")

# # Remove leading and trailing white space, remove last empty element from results
# questions = [question.strip() for question in questions[:-1]]

# output = {
#     "context": context,
#     "questions": questions
# }

# output

</br>

---

## 4. Answer Generator

In [None]:
%load_ext autoreload

%autoreload 2

from transformers import pipeline
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

question_answers = []

print(questions["questions"])

for question in questions["questions"]:
    print(question)
    result = question_answerer(question=question, context=context)
    if result["score"] > 0.5:
        question_answers.append({ "question": question, "answer": result["answer"] })
        # question_answers.append({ "question": question, "answer": result["answer"], "score": result["score"] })

print(question_answers)

## 5. Output to Anki

In [None]:
# Output format: Front, Back
# Front: Question
# Back: Answer
import pandas as pd

df = pd.DataFrame(question_answers, columns=["question", "answer"])
df.to_csv("anki-output.csv", index=False, header=False)