# QGAR - A Flashcard Generating NLP Model

This notebook shows how to load and use the `QGAR` model.

Please read the [README](./readme.md) before continueing!

**Table of Contents:**
1. [Load QGAR](#load-qgar)
2. [Download and Preprocess SQuAD Dataset](#download-and-preprocess-squad-dataset)
3. [Run QGAR](#run-qgar)
4. [Train QGAR](#4-train-qgar)

</br>

---

</br>

# 0. Used Libraries

In [None]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
%pip install transformers
%pip install datasets
%pip install wandb

## 1. Load QG
First, we must load the `QG` model and tokenizer.

In [2]:
from models.qg import QG

qg = QG("the-coorporation/t5-qgar", "t5-small")
model = qg._model
tokenizer = qg._tokenizer

  from .autonotebook import tqdm as notebook_tqdm


## 2. Download and Preprocess SQuAD Dataset

First, we download and preprocess the modified `SQuAD` dataset, adding separator (`<sep>`) and end of sequence tokens (`<\s>`) to each entry.
The preprocessed file is split in two datasets, `training` and `validation`, and the sets are saved in the `data` directory in `PyTorch` format under the names:
* [training_data.pt](./data/training_data.pt)
* [validation_data.pt](./data/validation_data.pt)

In [42]:
%load_ext autoreload
%autoreload 2

from preprocessing.preprocessor import SquadPreprocessor

preprocessor = SquadPreprocessor(model, tokenizer)
preprocessor.preprocess_dataset()

Downloading SQuAD dataset...
Found cached dataset squad_processor (/Users/philiphyltoft/.cache/huggingface/datasets/squad_processor/plain_text/1.0.0/173b8305efd9aeaed82e2f74eb48fff367a5b6036cbf7fca6cd0deb4d4bb4f95)
100%|██████████| 2/2 [00:00<00:00, 299.58it/s]
Download complete.
Preprocessing SQuAD dataset...
Loading cached processed dataset at /Users/philiphyltoft/.cache/huggingface/datasets/squad_processor/plain_text/1.0.0/173b8305efd9aeaed82e2f74eb48fff367a5b6036cbf7fca6cd0deb4d4bb4f95/cache-92c17265b08f1edc.arrow
Loading cached processed dataset at /Users/philiphyltoft/.cache/huggingface/datasets/squad_processor/plain_text/1.0.0/173b8305efd9aeaed82e2f74eb48fff367a5b6036cbf7fca6cd0deb4d4bb4f95/cache-387cbdf968909eb4.arrow
Loading cached processed dataset at /Users/philiphyltoft/.cache/huggingface/datasets/squad_processor/plain_text/1.0.0/173b8305efd9aeaed82e2f74eb48fff367a5b6036cbf7fca6cd0deb4d4bb4f95/cache-b9172de254d7c7c7.arrow
Loading cached processed dataset at /Users/philiphyl

## 3. Run QG
Next, we import `QG` and set up a pipeline.
Now, we can simply pass a context to the model to generate questions.

In [3]:
%load_ext autoreload
%autoreload 2

import json

context = "Historical Fiction is one of those sub-genres of literature that takes many forms. It's most important feature, though, is that it's set in the past, with every element of the story conforming to the norms of the day. Here's how we define Historical Fiction, a look at its origins, and some popular types."

questions = qg(context)
print(json.dumps(questions, indent=4))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
{
    "context": "Historical Fiction is one of those sub-genres of literature that takes many forms. It's most important feature, though, is that it's set in the past, with every element of the story conforming to the norms of the day. Here's how we define Historical Fiction, a look at its origins, and some popular types.",
    "questions": [
        "What is one of the sub-genres of literature that takes many forms?",
        "What is the most important feature of Historical Fiction?",
        "Where is Historical Fiction set in the past?"
    ]
}


## 4. Train QG
To train `QG`, we first parse `settings.json` to get the training arguments.

We then call `train` which will train the model and push it to `The Coorporation`'s Huggingface Hub.

In [3]:
# %load_ext autoreload
# %autoreload 2

from main import parse_settings, get_wandb_token

# %env WANDB_PROJECT=t5-qg

model_args, data_args, train_args = parse_settings()
qg.train(train_args, data_args, get_wandb_token())

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
env: WANDB_PROJECT=t5-qg


Cloning https://huggingface.co/the-coorporation/t5-qg into local empty directory.


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

</br>

---

# 5. Answer Generator

In [49]:
%load_ext autoreload

%autoreload 2

from transformers import pipeline
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

question_answers = []

print(questions["questions"])

for question in questions["questions"]:
    print(question)
    result = question_answerer(question=question, context=input_text)
    if result["score"] > 0.5:
        question_answers.append({ "question": question, "answer": result["answer"] })
        # question_answers.append({ "question": question, "answer": result["answer"], "score": result["score"] })

print(question_answers)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
['What is one of the sub-genres of literature that takes many forms?', 'What is the most important feature of Historical Fiction?', 'Where is Historical Fiction set in the past?']
What is one of the sub-genres of literature that takes many forms?
What is the most important feature of Historical Fiction?
Where is Historical Fiction set in the past?
[{'question': 'What is one of the sub-genres of literature that takes many forms?', 'answer': 'Historical Fiction'}, {'question': 'What is the most important feature of Historical Fiction?', 'answer': "it's set in the past"}]


# 6. Output to Anki

In [61]:
# Output format: Front, Back
# Front: Question
# Back: Answer
import pandas as pd

df = pd.DataFrame(question_answers, columns=["question", "answer"])
df.to_csv("anki-output.csv", index=False, header=False)

# 7. Evaluation

In [6]:
%pip install evaluate -q
%pip install scikit-learn -q

In [5]:
from transformers import pipeline
from datasets import load_dataset
from evaluate import evaluator
import evaluate

pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")
data = load_dataset("imdb", split="test").shuffle().select(range(1000))
metric = evaluate.load("accuracy")

Downloading builder script: 100%|██████████| 4.31k/4.31k [00:00<00:00, 1.80MB/s]
Downloading metadata: 100%|██████████| 2.17k/2.17k [00:00<00:00, 878kB/s]
Downloading readme: 100%|██████████| 7.59k/7.59k [00:00<00:00, 2.29MB/s]


Downloading and preparing dataset imdb/plain_text to /Users/philiphyltoft/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data: 100%|██████████| 84.1M/84.1M [02:44<00:00, 513kB/s]   
                                                                                              

Dataset imdb downloaded and prepared to /Users/philiphyltoft/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


Using the latest cached version of the module from /Users/philiphyltoft/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Thu Feb 23 13:33:23 2023) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.


ModuleNotFoundError: No module named 'sklearn'