# QGAR - A Flashcard Generating NLP Model

This notebook shows how to load and use the `QGAR` model.

Please read the [README](./README.md) before continuing!

**Table of Contents:**

0. [Dependencies](#0-dependencies)
1. [Load QG](#1-load-qg)
2. [Download and Preprocess SQuAD Dataset](#2-download-and-preprocess-squad-dataset)
3. [Run QG](#3-run-qg)
4. [Train QG](#4-train-qg)
5. [Answer Generator](#5-answer-generator)
6. [Output to Anki](#6-output-to-anki)

</br>

---

</br>

## 0. Dependencies

The project uses the following dependencies.

Make sure to install them in your `Virtual Environment`.

In [None]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -q
%pip install transformers -q
%pip install datasets -q
%pip install wandb -q

## 1. Load QG

To load our `Question Generator` model, we use the `QG class` from the `models` module.

From the model, we can extract the actual loaded `model` and `tokenizer` via its fields.

In [None]:
%load_ext autoreload
%autoreload 2

from models.qg import QG

qg = QG("t5-small", "t5-small")
model = qg._model
tokenizer = qg._tokenizer

## 2. Download and Preprocess SQuAD Dataset

If we want to train the `QG` model, we must download and preprocess our modified `SQuAD 2.0` dataset from `Hugging Face`.

This is done with the `SquadPreProcessor class` from the `Preprocessing` module.

The `SquadPreProcessor` will download and preprocess the modified `SQuAD` dataset by adding separator (`<sep>`), `end of sequence tokens` (`<\s>`) and encode both `contexts` and `questions` for each entry in the dataset.

The preprocessed dataset is then split in two datasets, `training` and `validation`. The sets can be saved in the specified `save_dir` directory in `PyTorch` format under the names:
* `training_data.pt`
* `validation_data.pt`

In [None]:
%load_ext autoreload
%autoreload 2

from preprocessing.squad_preprocessor import SquadPreprocessor

preprocessor = SquadPreprocessor(tokenizer)
train, validation = preprocessor.preprocess("the-coorporation/the_squad_v2", "data")

## 3. Run QG

To run the `QG` model on some example `context`, we use the `__call__` method of the class.

This will return a dictionary containing the `context` and a list of `questions` generated for the provided `context`.

In [None]:
%load_ext autoreload
%autoreload 2

import json

qg = QG("t5-small", "t5-small")
context = "Historical Fiction is one of those sub-genres of literature that takes many forms. It's most important feature, though, is that it's set in the past, with every element of the story conforming to the norms of the day. Here's how we define Historical Fiction, a look at its origins, and some popular types."

questions = qg(context)
print(json.dumps(questions, indent=4))

## 4. Train QG

To train `QG`, we first parse the `settings.json` file to get the `TrainingArguments` and `DataTrainingArguments`.

We can then call `train` on the `QG` instance, which will train the model and push it to `The Coorporation`'s Huggingface Hub.

In [None]:
%load_ext autoreload
%autoreload 2

from main import get_local_file
from parsing.settings_parser import parse_settings

qg = QG("t5-small", "t5-small")

_, data_args, train_args = parse_settings()
qg.train(train_args, data_args, get_local_file("wandb_token.txt"))

</br>

---

## 5. Answer Generator

In [None]:
%load_ext autoreload

%autoreload 2

from transformers import pipeline
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

question_answers = []

print(questions["questions"])

for question in questions["questions"]:
    print(question)
    result = question_answerer(question=question, context=context)
    if result["score"] > 0.5:
        question_answers.append({ "question": question, "answer": result["answer"] })
        # question_answers.append({ "question": question, "answer": result["answer"], "score": result["score"] })

print(question_answers)

## 6. Output to Anki

In [None]:
# Output format: Front, Back
# Front: Question
# Back: Answer
import pandas as pd

df = pd.DataFrame(question_answers, columns=["question", "answer"])
df.to_csv("anki-output.csv", index=False, header=False)