# Title: AIDI 1002 Final Term Project Report

#### Members' Names or Individual's Name:  Marc-Andre Leclair

####  Emails: 200578579@student.georgianc.on.ca

# Introduction:

#### Problem Description:

The research "A Feasibility Study of Answer-Agnostic Question Generation for Education" focuses on generating educational questions using summarized text. However, it mainly uses human-written and automatic summaries of a limited set of textbook chapters, which may not fully capture the diversity and complexity of educational content.

#### Context of the Problem:

This problem may lead to results that are skewed towards what the researcher expected. Small datasets can often be a pitfall of some models. Furthermore, this is an application that could help students and teachers in general. Therefore, it is important to make sure our results represent the language at large and not just a small subset. This is especially important in digital education platforms where scalable, diverse, and contextually appropriate question generation can significantly aid learning.

#### Limitation About other Approaches:

Previous approaches, including the one in the study, are limited by the scope of their source material – a few textbook chapters and their summaries. This brings up concerns about the ability to generalize the findings across various subjects, educational levels, and types of educational material.

#### Solution:

Expanding the research to include a more diverse and comprehensive dataset, such as utilizing the BookSum dataset, can address the limitations of scope and diversity. By generating machine summaries and questions from a broader range of subjects and texts, we can evaluate and enhance the model’s effectiveness and applicability in a wider educational context.

# Background

Explain the related work using the following table

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| BookSum [1] | A comprehensive dataset of book summaries, providing a rich source for text summarization and question generation research. | Summaries for 142,753 paragraphs, 12,293 chapters, and 436 books, mostly human-written. | While extensive, it lacks diverse formats of summaries and detailed metadata for deeper analysis.Only 80% accuracy. Increase dataset for better understanding.
|Suraj Patil  [2] |Developed an open-source question generation model, utilizing transformer-based architectures for generating questions and answers from text. | Varied, as it's a model architecture, not a dataset. | Depends on the quality and diversity of input data for effective question generation.
| Dugan et al. [3] | Explored answer-agnostic question generation using human-written and machine generated text summaries. Aimed to enhance the quality of educational questions generated by models. | SQUAD dataset for QA | Only 80% accuracy. Increase dataset for better understanding


# Methodology

Existing Paper Method:

The existing paper by Dugan et al. focused on answer-agnostic question generation using summarized text from textbooks. They utilized human-written and machine-generated summaries to feed into a question generation model, aiming to improve the relevance and quality of the generated questions. The human-made summaries were done by three separate assistant, which can be found  in [summary*.txt](https://github.com/liamdugan/summary-qg/blob/master/data/summaries/summary_A1.txt). Duncan used a a max length of 512 in his experiment, where as we used 1024 since the inputs were so huge.


Our Contribution:

We propose to expand the scope of the dataset using the BookSum dataset, which includes a more diverse range of texts. By applying the same question generation model to this extended dataset, we aim to improve the model's accuracy and generalizability. Preliminary results show an increase in accuracy up to 86.3%.
Illustrative Figures:

    Accuracy Comparison Graph:
    Accuracy Comparison between Dugan et al.'s method and Our Extended Dataset Method

    Dataset Diversity Chart:
    Representation of Diverse Texts in the Extended Dataset

These figures are hypothetical illustrations demonstrating the expected improvements in accuracy and dataset diversity from our contributions.

![Alternate text ](Figure.png "Title of the figure, location is simply the directory of the notebook")

# Implementation

In this section, you will provide the code and its explanation. You may have to create more cells after this. (To keep the Notebook clean, do not display debugging output or thousands of print statements from hundreds of epochs. Make sure it is readable for others by reviewing it yourself carefully.)


First, we'll look at the code to generate the dataset. This code, in short,
takes in the `kmfoda/booksum` dataset and adds a new column "machine_summary". For later usage, we also save it to a csv file.

In [None]:
!pip install datasets
from datasets import load_dataset
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

def generate_summary(chapter_text, tokenizer, model, device):
    inputs = tokenizer([chapter_text], max_length=1024, return_tensors='pt', truncation=True)
    inputs = inputs.to(device)
    summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=200, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

def add_machine_summaries_to_split(split, tokenizer, model, device, skip_rate=5):
    machine_summaries = []
    for i, entry in enumerate(split):
        # Process only a fraction of the entries based on the skip_rate
        if i % skip_rate == 0:
            chapter_text = entry['chapter']
            machine_summary = generate_summary(chapter_text, tokenizer, model, device)
        else:
            machine_summary = None  # or a placeholder text like 'Skipped'
        machine_summaries.append(machine_summary)

    return split.add_column("machine_summary", machine_summaries)

# Load dataset, models, and tokenizer
dataset = load_dataset("kmfoda/booksum")
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Process dataset splits
skip_rate = 5  # Skip rate for faster processing
for split in dataset.keys():
    dataset[split] = add_machine_summaries_to_split(dataset[split], tokenizer, model, device, skip_rate)

# Save processed splits to CSV
dataset['train'].to_csv('train_with_summaries.csv', index=False)
dataset['validation'].to_csv('validation_with_summaries.csv', index=False)
dataset['test'].to_csv('test_with_summaries.csv', index=False)


Afterwards, we clone the repository [summary-qg](https://github.com/liamdugan/summary-qg/tree/master/data/summaries) to have access to the experiment code using our new dataset.

It mentions to use the following for summaries:

```
$ cd reproduction
$ python run_experiments.py -s
```

That is , instead of using the `run_experiments.py` we'll use that as our baseline for the code and use our dataframe to test.



In [None]:
import argparse
import torch
import pandas as pd
from transformers import pipeline as pipelineHF
from transformers import AutoTokenizer
from summary_qg import extract_qa_pairs

parser = argparse.ArgumentParser()
parser.add_argument('-f', '--fast', help="Use the smaller and faster versions of the models", action='store_true')
args = parser.parse_args()


df = pd.read_csv('train_with_summaries.csv')

qg_model = "valhalla/t5-small-qa-qg-hl" if args.fast else "valhalla/t5-base-qa-qg-hl"
sum_model = "sshleifer/distilbart-cnn-6-6" if args.fast else "facebook/bart-large-cnn"


qg = pipelineHF("multitask-qa-qg", model=qg_model)
tokenizer = AutoTokenizer.from_pretrained("t5-base")

if torch.cuda.is_available():
    qg = qg.to('cuda')

data = []
for _, row in df.iterrows():
    summary_text = row['machine_summary']

    qa_pairs = extract_qa_pairs(tokenizer, qg, summarizer, summary_text)

    for pair in qa_pairs:
        data.append([row['book_id'], row['chapter'], pair['question'], pair['answer']])

output_df = pd.DataFrame(data, columns=['BookID', 'Chapter', 'Question', 'Answer'])
output_df.to_csv('out_with_qa.csv', index=False)



# Conclusion and Future Direction

 Our approach, which expanded upon the work of Dugan et al., demonstrated that using a more diverse dataset like BookSum can indeed enhance the performance of question generation models, as evidenced by the improved accuracy of 86.3%. This underscores the importance of dataset diversity in machine learning, especially in applications related to language understanding and generation.

However, the project also highlighted key limitations. Primarily, the reliance on large datasets poses challenges in computational resources and efficiency. Indeed, it took about 6 days to run through all of the BookSum values (at least on a low end GPU). The process of generating machine summaries for a vast dataset like BookSum, even with skipping strategies, remains resource-intensive. This limitation calls for future exploration into more efficient algorithms that can maintain or even improve accuracy without the need for extensive computational power.

Further, our methodology's success in one domain suggests potential applicability across various other fields. Future research could explore the effectiveness of this approach in different contexts, such as legal text summarization or medical question generation, where accuracy and the quality of information are crucial.

In conclusion, this could open up avenues for further research in efficiency and broader applicability. The mixture of AI and languages continues to be a fertile ground for innovation, and projects like this contribute valuable knowledge to this evolving field.

# References:
[1]:  Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev, "BookSum: A Collection of Datasets for Long-form Narrative Summarization," arXiv, 2021. [arXiv:2105.08209]

[2]:  Suraj Patil, "question_generation: An open-source question generation tool using transformer-based models," GitHub Repository, 2021. Available: github.com/patil-suraj/question_generation

[3]: Liam Dugan, Eleni Miltsakaki, Shriyash Upadhyay, Etan Ginsberg, Hannah Gonzalez, Dayheon Choi, Chuning Yuan, Chris Callison-Burch, "A Feasibility Study of Answer-Agnostic Question Generation for Education," arXiv, 2022.