# QGAR - A Flashcard Generating NLP Model

This notebook shows how to load and use the `QGAR` model.

Please read the [README](./README.md) before continuing!

**Table of Contents:**
1. [Load QG](#load-qgar)
2. [Download and Preprocess SQuAD Dataset](#download-and-preprocess-squad-dataset)
3. [Run QG](#run-qgar)
4. [Train QG](#4-train-qgar)

</br>

---

</br>

# 0. Used Libraries

In [None]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
%pip install transformers
%pip install datasets
%pip install wandb

## 1. Load QG
First, we must load the `QG` model and tokenizer.

In [None]:
%load_ext autoreload

%autoreload 2

from models.qg import QG

qg = QG("t5-small", "t5-small")
model = qg._model
tokenizer = qg._tokenizer

## 2. Download and Preprocess SQuAD Dataset

First, we download and preprocess the modified `SQuAD` dataset, adding separator (`<sep>`) and end of sequence tokens (`<\s>`) to each entry.
The preprocessed file is split in two datasets, `training` and `validation`, and the sets are saved in the `data` directory in `PyTorch` format under the names:
* [training_data.pt](./data/training_data.pt)
* [validation_data.pt](./data/validation_data.pt)

In [None]:
%load_ext autoreload
%autoreload 2

from preprocessing.squad_preprocessor import SquadPreprocessor

preprocessor = SquadPreprocessor(tokenizer)
train, validation = preprocessor.preprocess("the-coorporation/the_squad_v2", "data")

## 3. Run QG
Next, we import `QG` and set up a pipeline.
Now, we can simply pass a context to the model to generate questions.

In [None]:
%load_ext autoreload
%autoreload 2

import json

context = "Historical Fiction is one of those sub-genres of literature that takes many forms. It's most important feature, though, is that it's set in the past, with every element of the story conforming to the norms of the day. Here's how we define Historical Fiction, a look at its origins, and some popular types."

questions = qq(context)
print(json.dumps(questions, indent=4))

## 4. Train QG
To train `QG`, we first parse `settings.json` to get the training arguments.

We then call `train` which will train the model and push it to `The Coorporation`'s Huggingface Hub.

In [None]:
%load_ext autoreload
%autoreload 2

from settings_parser import parse_settings
import os

os.environ['WANDB_PROJECT'] = 't5-qg'

model_args, data_args, train_args = parse_settings()
qg.train(train_args, data_args)

</br>

---

# 5. Answer Generator

In [None]:
%load_ext autoreload

%autoreload 2

from transformers import pipeline
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

question_answers = []

print(questions["questions"])

for question in questions["questions"]:
    print(question)
    result = question_answerer(question=question, context=context)
    if result["score"] > 0.5:
        question_answers.append({ "question": question, "answer": result["answer"] })
        # question_answers.append({ "question": question, "answer": result["answer"], "score": result["score"] })

print(question_answers)

# 6. Output to Anki

In [None]:
# Output format: Front, Back
# Front: Question
# Back: Answer
import pandas as pd

df = pd.DataFrame(question_answers, columns=["question", "answer"])
df.to_csv("anki-output.csv", index=False, header=False)