We present OpenBART, a Natural Language Processing model based on BART generating relevant open questions from input paragraphs. Made for a Language Technology Practical course.
The dependencies for the model can be automatically installed by downloading and using references.txt:
pip install -r references.txt
The main way to use the model is through generate.py to generate an open question from a user-inputted paragraph. There are two ways to do this, both require the model folder model-OpenBART to be present in the folder.
The first way to run generate is through the command line, by executing the following command:
python3 generate.py
The second way to run generate is by importing it, as shown below:
import generate
input_string = "This is an example input paragraph."
generate.generate_question(input_string)
It is possible to run these with the model present in a different folder. Additionally, using the import method, one can specify whether data should be preprocessed:
python3 generate.py path/to/folder_that_contains_model_folder
folder = "path/to/folder_that_contains_model_folder
preprocess = True
generate.generate_question(input_string, folder, preprocess)
In this section, we will describe the methods that led to the model & the evaluation scores, and how to reproduce them. Firstly, to install all relevant packages and dependencies, download requirements.txt and run the following:
pip install -r requirements.txt
Preprocessing involves two files: main.py, which is executed, and prepdata.py, which is imported by main.py. main takes a single split (train, test, validation1 or validation2) from rexarski/eli5-category on huggingface.co and preprocesses it by running it through an NER tagger and a Keyword Extractor.
python3 main.py split (path/to/save_folder)
Model training is done using train_model.py. This program is more flexible and has more opportunity for customisation. It is called using the following:
python3 train_model.py save_folder_name
and takes the following arguments:
-m --model Destination name for checkpoints, results and final model
-t --tokenizer Source name of tokenizer to use
-d --dataset Source name of model to train
-p --path Path to source & destination folders
-e --epochs Number of epochs to train the model
-l --learningrate Learning rate of the model
-b --batchsize Batch sizes of the model
-c --cpu Use CPU instead of GPU
-q --checkpoint Continue from specified checkpoint
To train the final model, this script was executed with the following parameters:
python3 train_model.py "four-epochs" --epochs 4 --batchsize 8 --learningrate 2e-5
Parameters that were left out had default values instantiated by the script. To reproduce these results on a different machine, the path must be altered to fit the machine layout.
This code outputs a model (with the three most recent checkpoints and the best checkpoint) in --path/save_folder_name
Model evaluation involves two files, one of which is generate.py, which we covered in Model Use. The second is evaluate_model.py This is a script that takes 100 items from the validation2 split of the dataset and compares the model-generated questions with dataset questions. Evaluation metrics are BERTScore and BLEURT. The usage is the following:
python3 evaluate_model.py "four-epochs"