Skip to content

Our BART-based Natural Language Processing model, generating answerable open questions. Made for a Language Technology Practical course.

Notifications You must be signed in to change notification settings

JTalpa/openbart-ltp

Repository files navigation

OpenBART: Generating Open Questions

We present OpenBART, a Natural Language Processing model based on BART generating relevant open questions from input paragraphs. Made for a Language Technology Practical course.

Dependencies

The dependencies for the model can be automatically installed by downloading and using references.txt:

pip install -r references.txt

Model Use

The main way to use the model is through generate.py to generate an open question from a user-inputted paragraph. There are two ways to do this, both require the model folder model-OpenBART to be present in the folder.

The first way to run generate is through the command line, by executing the following command:

python3 generate.py

The second way to run generate is by importing it, as shown below:

import generate

input_string = "This is an example input paragraph."
generate.generate_question(input_string)

It is possible to run these with the model present in a different folder. Additionally, using the import method, one can specify whether data should be preprocessed:

python3 generate.py path/to/folder_that_contains_model_folder
folder = "path/to/folder_that_contains_model_folder
preprocess = True
generate.generate_question(input_string, folder, preprocess)

Reproducing Results

In this section, we will describe the methods that led to the model & the evaluation scores, and how to reproduce them. Firstly, to install all relevant packages and dependencies, download requirements.txt and run the following:

pip install -r requirements.txt

Preprocessing Data

Preprocessing involves two files: main.py, which is executed, and prepdata.py, which is imported by main.py. main takes a single split (train, test, validation1 or validation2) from rexarski/eli5-category on huggingface.co and preprocesses it by running it through an NER tagger and a Keyword Extractor.

python3 main.py split (path/to/save_folder)

Model Training

Model training is done using train_model.py. This program is more flexible and has more opportunity for customisation. It is called using the following:

python3 train_model.py save_folder_name

and takes the following arguments:

-m		--model					Destination name for checkpoints, results and final model
-t 		--tokenizer			Source name of tokenizer to use
-d 		--dataset				Source name of model to train
-p 		--path					Path to source & destination folders
-e 		--epochs				Number of epochs to train the model
-l 		--learningrate	Learning rate of the model
-b		--batchsize			Batch sizes of the model
-c 		--cpu						Use CPU instead of GPU
-q 		--checkpoint		Continue from specified checkpoint

To train the final model, this script was executed with the following parameters:

python3 train_model.py "four-epochs" --epochs 4 --batchsize 8 --learningrate 2e-5

Parameters that were left out had default values instantiated by the script. To reproduce these results on a different machine, the path must be altered to fit the machine layout.

This code outputs a model (with the three most recent checkpoints and the best checkpoint) in --path/save_folder_name

Model Evaluation

Model evaluation involves two files, one of which is generate.py, which we covered in Model Use. The second is evaluate_model.py This is a script that takes 100 items from the validation2 split of the dataset and compares the model-generated questions with dataset questions. Evaluation metrics are BERTScore and BLEURT. The usage is the following:

python3 evaluate_model.py "four-epochs"

About

Our BART-based Natural Language Processing model, generating answerable open questions. Made for a Language Technology Practical course.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages