## Transfer Learning and Model Research
### Deciding on an Approach
Various approaches can be used for transfer learning. One of the first ideas to explore is how transfer learning itself is done. There are papers on this topic, such as In their paper on Transfer Learning [An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models](https://doi.org/10.48550/arXiv.1902.10547)  that explain the core of transfer learning and also the parts that you can play around with. Below I explore how transfer learning is actually accomplished.

#### Similar Projects
One way to get started is to check out other's projects. Here are several that seem quite similar to my goal:
* [Simplifying Paragraph-level Question Generation via Transformer Language Models](https://paperswithcode.com/paper/transformer-based-end-to-end-question) uses GPT-2 Small as a base. Then trains on top and produces natural questions. This one is particularly good because it generates **extractive** questions and answers, which is exactly what PBE competitions are all about.
* [Learning to Ask: Neural Question Generation for Reading Comprehension](https://paperswithcode.com/paper/learning-to-ask-neural-question-generation) is trained to ask questions. It avoids the rule-based approach and performs better on generating natural and complex questions than other approaches. This seems very promising.
* [Neural Question Generation from Text: A Preliminary Study](https://paperswithcode.com/paper/neural-question-generation-from-text-a) uses [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) to generate "fluent and diverse questions."
* [Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus](https://paperswithcode.com/paper/asking-questions-the-human-way-scalable) presents a seemingly successful question and answer generation project. However, running it looks difficult and there may be "clues" required to generate good questions.
* [A BERT Baseline for Natural Questions](https://paperswithcode.com/dataset/natural-questions) this paper uses a natural question dataset. Unfortunately, it seems more interested in answering questions rather than asking them.
* [ChatDoctor](http://www.yunxiangli.top/ChatDoctor/). This AI was trained on 220K conversations between doctors and patients to learn to converse and follow instructions. It was built on the the back of the Large Language Model Meta AI (LlaMA).
* [Generating Natural Questions About an Image](https://paperswithcode.com/paper/generating-natural-questions-about-an-image) is a very cool paper with a cool dataset. However, it is not what I need for text AQG.
* [Unified Language Model Pre-training for Natural Language Understanding and Generation](https://paperswithcode.com/paper/unified-language-model-pre-training-for) claims to do question generation with the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) database. Doesn't seem quite right though.
* 

#### Promising datasets
* [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) has questions, answers, and usually, a context paragraph.
* The Stanford Question Answering Dataset ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)) contains questions, answers, and usually, a context paragraph. This data could be used in an "inverted" way to generate questions.
* [harvestingQA](https://github.com/xinyadu/harvestingQA/tree/master/dataset) is a question-answer dataset in SQuAD format.
* [SciQ](https://paperswithcode.com/dataset/sciq) also contains questions, answers, and a supporting paragraph.
* My own dataset of PBE questions.

#### Promising models
* [potsawee/t5-large-generation-squad-QuestionAnswer](https://huggingface.co/potsawee/t5-large-generation-squad-QuestionAnswer) generates questions but is not ideal
* [patil-suraj/question_generation](https://github.com/patil-suraj/question_generation) seems like a state-of-the-art model, but it will probably require some coercing to get the question and answers out of it.
* [abhitopia/question-answer-generation](https://huggingface.co/abhitopia/question-answer-generation) looks like a good model, but requires question generation

#### Other things I have checked out
* [QuestGen](https://app.questgen.ai/) markets itself as a quiz question generator, but it **does not** generate good questions for PBE. The questions tend to be way too abstract for PBE competitions.

## Setup

### Packages and Imports

First I install pyTorch. It's weird that I have to do this to satisfy the jupyter notebook, as the package is already installed for my python environment locally, but this is what it takes.

In [None]:
!pip install torch

This is a list of generally useful libraries for my project. I prefer to install them all at once so I can see them all and avoid circular dependencies.

In [None]:
import json
import csv
import pandas as pd
# load transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, T5Config, T5ForConditionalGeneration, T5Tokenizer

### Importing data
Next I will load my input data from `nkjv.json` and store it as a csv for easier access in the future.

In [None]:
# open the JSON file and load the data
with open('nkjv.json') as f:
  data = json.load(f)

# open the CSV file for writing
with open('nkjv.csv', 'w', newline='') as f:
  writer = csv.writer(f)

  # Write the header row
  writer.writerow(['Book', 'ChapterNumber', 'VerseNumber', 'Verse'])

  # Loop through the data and write each row to the CSV file
  for book in data:
    book_name = book['name']
    for chapter_num, chapter in enumerate(book['chapters'], 1):
      for verse in chapter['verses']:
        verse_num = verse['num']
        verse_text = verse['text']
        writer.writerow([book_name, chapter_num, verse_num, verse_text])


Now I load the csv and then get started with some question generation.

In [None]:
nkjv = pd.read_csv('nkjv.csv')

joshua = nkjv[nkjv['Book'] == 'Joshua']
joshua1 = joshua[joshua['ChapterNumber'] == 1]
joshua2 = joshua[joshua['ChapterNumber'] == 2]
joshua3 = joshua[joshua['ChapterNumber'] == 3]
joshua4 = joshua[joshua['ChapterNumber'] == 4]

nkjv.head()

## Using Transformers and PyTorch with Existing Models
I first try out what the paper "[Simplifying Paragraph-level Question Generation via Transformer Language Models](https://paperswithcode.com/paper/transformer-based-end-to-end-question)" with it's hugging-face packages as well as the T5 hugging-face packages for [Question Generation](https://huggingface.co/mrm8488/t5-base-finetuned-question-generation-ap) and [Q&A Generation](https://huggingface.co/potsawee/t5-large-generation-squad-QuestionAnswer).



I first define a few different functions that can generate questions

In [None]:
# potsawee_T5 is a model taken from https://huggingface.co/potsawee/t5-large-generation-squad-QuestionAnswer
potsawee_tokenizer = AutoTokenizer.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")
potsawee_model = AutoModelForSeq2SeqLM.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")

def potsawee_T5(text):
  inputs = potsawee_tokenizer(text, return_tensors="pt")
  outputs = potsawee_model.generate(**inputs, max_length=100)
  question_answer = potsawee_tokenizer.decode(outputs[0], skip_special_tokens=False)
  question_answer = question_answer.replace(potsawee_tokenizer.pad_token, "").replace(potsawee_tokenizer.eos_token, "")
  return question_answer.split(potsawee_tokenizer.sep_token)


In [None]:
# allenai_t5 model from https://huggingface.co/allenai/t5-small-squad2-question-generation
model_name = "allenai/t5-small-squad2-question-generation"
allenai_t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
allenai_t5_model = T5ForConditionalGeneration.from_pretrained(model_name)

def allenai_t5(input_string, **generator_args):
    input_ids = allenai_t5_tokenizer.encode(input_string, return_tensors="pt")
    res = allenai_t5_model.generate(input_ids, **generator_args)
    output = allenai_t5_tokenizer.batch_decode(res, skip_special_tokens=True)
    print(output)
    return output


allenai_t5("shrouds herself in white and walks penitentially disguised as brotherly love through factories and parliaments; offers help, but desires power;")
allenai_t5("He thanked all fellow bloggers and organizations that showed support.")
allenai_t5("Races are held between April and December at the Veliefendi Hippodrome near Bakerky, 15 km (9 miles) west of Istanbul.")


In [None]:
# Loop through every chapter and verse in Joshua 2
print("_____________potsawee_T5 model_____________")
for index, row in joshua2.iterrows():
  num = row['VerseNumber']
  context = row['Verse']
  print(f"Joshua 1:{num}: {context}")
  question, answer = potsawee_T5(context)

  print("Q:", question)
  print("A:", answer)

print("_____________allenai_T5 model_____________")
for index, row in joshua2.iterrows():
  num = row['VerseNumber']
  context = row['Verse']
  print(f"Joshua 1:{num}: {context}")
  questions = allenai_t5(context)
  for question in questions:
    print("Q:", question)


As we can see above, the questions generated are not ideal. However, no other models seem better. The potsawee_t5 model seems to create more relevant questions than the allenai_t5 and the allenai_t5 does not produce answers, so I will try to work with the potsawee_t5. See [`transferLearningModel.ipynb`](transferLearningModel.ipynb) for the creation of a new model and for results.