This repo details the work done for the module project of NUS CS4248 Natural Language Processing.
This codebase trains a language model to extract a list of actionable points from a given email, using an original email dataset generated using data inversion.
All finetuned models are available on HuggingFace Hub, and can be accessed through the following links:
Bloom-560m finetuned on InstructGPT3 data:
https://huggingface.co/pinxi/bloom-560m-igpt3
Bloom-560m finetuned on Bloom data:
https://huggingface.co/pinxi/bloom-560m-bloom
Bloom-1b7 finetuned on InstructGPT3 data:
https://huggingface.co/pinxi/bloom-1b7-igpt3
Bloom-1b7 finetuned on Bloom data:
https://huggingface.co/pinxi/bloom-1b7-bloom
Install the project dependencies.
pip install -r requirements.txt
Add your API keys in settings.py
or through env variables.
We generate the original email dataset by prompting another pretrained language model with self-crafted actionable and non-actionable points to write an email. The datapoints are then inverted to create an email-to-actionable points dataset.
Data generation script handles all possible ways to generate data:
python data_generation/data_generator.py
Our datasets can be found in the data
directory:
# data generated by InstructGPT3
gpt_generated_data.jsonl
# data generated by Bloom
bloom_generated_data.jsonl
# handwritten dataset for evaluation
handwritten_data.jsonl
Finetuning was done from a Jupyter notebook:
finetuning/bloom_finetune.ipynb
DeepSpeed config we used for finetuning can be found and modified in:
finetuning/ds_config_zero2.json
Evaluation was done from a Jupyter notebook:
finetuning/bloom_loss.ipynb
Tan Pinxi, Tan Xi Zhe, Tan Ming Ann, Lim Yu Yang, Ng Boon Hong