# QGAR - A Flashcard Generating NLP Model

This notebook shows how to load and use the `QGAR` model.

Please read the [README](./readme.md) before continueing!

**Table of Contents:**
1. [Load QGAR](#load-qgar)
2. [Download and Preprocess SQuAD Dataset](#download-and-preprocess-squad-dataset)
3. [Run QGAR](#run-qgar)
4. [Train QGAR](#4-train-qgar)

</br>

---

</br>

## 1. Load QGAR
First, we must load the `QGAR` model and tokenizer.

In [5]:
import torch
from huggingface_hub import login
from main import load_model_and_tokenizer, get_hg_token

login(get_hg_token(), add_to_git_credential=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
model, tokenizer = load_model_and_tokenizer("the-coorporation/t5-qgar", device)

Token is valid.
Your token has been saved in your configured git credential helpers (osxkeychain).
Your token has been saved to /Users/philiphyltoft/.cache/huggingface/token
Login successful


## 2. Download and Preprocess SQuAD Dataset

First, we download and preprocess the modified `SQuAD` dataset, adding separator (`<sep>`) and end of sequence tokens (`<\s>`) to each entry.
The preprocessed file is split in two datasets, `training` and `validation`, and the sets are saved in the `data` directory in `PyTorch` format under the names:
* [training_data.pt](./data/training_data.pt)
* [validation_data.pt](./data/validation_data.pt)

In [61]:
from preprocessing.preprocessor import Preprocessor

preprocessor = Preprocessor(model, tokenizer)
preprocessor.preprocess_dataset()

Downloading SQuAD dataset...
Found cached dataset squad_processor (/Users/philiphyltoft/.cache/huggingface/datasets/squad_processor/plain_text/1.0.0/173b8305efd9aeaed82e2f74eb48fff367a5b6036cbf7fca6cd0deb4d4bb4f95)
100%|██████████| 2/2 [00:00<00:00, 323.41it/s]
Download complete.
Preprocessing SQuAD dataset...
Loading cached processed dataset at /Users/philiphyltoft/.cache/huggingface/datasets/squad_processor/plain_text/1.0.0/173b8305efd9aeaed82e2f74eb48fff367a5b6036cbf7fca6cd0deb4d4bb4f95/cache-bd6bdb3f24c6a457.arrow
Loading cached processed dataset at /Users/philiphyltoft/.cache/huggingface/datasets/squad_processor/plain_text/1.0.0/173b8305efd9aeaed82e2f74eb48fff367a5b6036cbf7fca6cd0deb4d4bb4f95/cache-b8d3c00b5355c76b.arrow
Loading cached processed dataset at /Users/philiphyltoft/.cache/huggingface/datasets/squad_processor/plain_text/1.0.0/173b8305efd9aeaed82e2f74eb48fff367a5b6036cbf7fca6cd0deb4d4bb4f95/cache-170928d4e07c8d49.arrow
Loading cached processed dataset at /Users/philiphyl

## 3. Run QGAR
Next, we import `QGAR` and set up a pipeline.
Now, we can simply pass a context to the model to generate questions.

In [17]:
%load_ext autoreload

%autoreload 2

from main import run_model

input_text = "Historical Fiction is one of those sub-genres of literature that takes many forms. It's most important feature, though, is that it's set in the past, with every element of the story conforming to the norms of the day. Here's how we define Historical Fiction, a look at its origins, and some popular types."

questions = run_model("the-coorporation/t5-qgar", device, input_text)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Input: 'Historical Fiction is one of those sub-genres of literature that takes many forms. It's most important feature, though, is that it's set in the past, with every element of the story conforming to the norms of the day. Here's how we define Historical Fiction, a look at its origins, and some popular types.'
Output new: 
[' What is one of the sub-genres of literature that takes many forms?', 'What is the most important feature of Historical Fiction?', 'Where is Historical Fiction set in the past?', '</s']


## 4. Train QGAR
To train `QGAR`, we first parse `settings.json` to get the training arguments.

We then make an instance of `QGARTrainer` and call `train` which will train the model and push it to `The Coorporation`'s Huggingface Hub.

In [64]:
from main import parse_settings
from training.trainer import QGARTrainer

train_file = "./data/training_data.pt"
validation_file = "./data/validation_data.pt"

model_args, data_args, train_args = parse_settings()

trainer = QGARTrainer(model, train_file, validation_file, train_args)
trainer.train()

TypeError: ModelArguments.__init__() missing 1 required positional argument: 'model_type'

</br>

---

# 5. Answer Generator

In [10]:
%load_ext autoreload

%autoreload 2

from transformers import pipeline
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

question_answers = []

print(questions)

for question in questions:
    result = question_answerer(question=question, context=input_text)
    question_answers.append({ "question": question, "answer": result["answer"], "score": result["score"] })

print(question_answers)





The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
['What is one of the sub-genres of literature that takes many forms?', 'What is the most important feature of Historical Fiction?', 'Where is Historical Fiction set in the past?', '?']
[{'question': 'What is one of the sub-genres of literature that takes many forms?', 'answer': 'Historical Fiction', 'score': 0.9590075016021729}, {'question': 'What is the most important feature of Historical Fiction?', 'answer': "it's set in the past", 'score': 0.6126183271408081}, {'question': 'Where is Historical Fiction set in the past?', 'answer': 'every element of the story conforming to the norms of the day', 'score': 0.1702776700258255}, {'question': '?', 'answer': 'a look at its origins, and some popular types', 'score': 0.15543927252292633}]
