Biomedical Lay Language Generation

This repository contains data and models related to the paper: Retrieval augmentation of large language models for lay language generation. This paper was arxived as CELLS: A Parallel Corpus for Biomedical Lay Language Generation.

Updates

01/18/2024 Paper was accepted by Journal of Biomedical Informatics. Check the latest version with GPT-4 and Llama-2 here: Retrieval augmentation of large language models for lay language generation

05/01/2023 Update wiki_dict.json file! Click here to download it.

04/24/2023 Upload metadata for CELLS dataset. Title and journal name are available now!

Datasets

Datasets can be found in "./data". The "xxx.source" files include the scientific text, while the "xxx.target" files include the plain language text. Follow the instructions here to construct the PubMed dataset for BART pre-training.

Datasets for different applications (details can be found in 3.1.2 Dataset applications):

CELLS: The paragraph-paired data of scientific abstracts and plain language summaries for the lay language generation task.
BELLS: The paragraph segment-paired data for background explanations.
SELLS: The sentence-level paired data for simplification.
Validated dataset: Randomly sampled data annotated by annotators for background explanations and simplification.

Models

BART

For BART model, we use the Fairseq BART implementation. Download the BART model pretrained on CNN/DM dataset from here.

Follow the instructions here to finetune BART model on CELL data. The hyperparameters for finetuning BART on the plain language generation, simplification and background explanation can be found in "./model/BART/"

Definition-based explanation retrieval

Definition-based explanation retrieval with UMLS: run "./preprocess/UMLS/umls_ner.py" first to get NERs in the text, then run "./preprocess/UMLS/run_add_umls.sh" to add definitions after the identified NERs.

Definition-based explanation retrieval with Wikipedia: run "./preprocess/Wiki/run_keywords.sh" first to get the most important words in the text, then sh "./preprocess/Wiki/run_add_wiki.sh" to get definitions from wikipedia after the keywords.

RAG

For BART model, we use the Huggingface implementation.

Follow the instructions here to finetune BART model on CELL data. The hyperparameters for RAG can be found in "./model/RAG/"

LLMs

To evaluate the performance of LLMs in generating background explanations or plain language summaries, we utilized Llama 2 (Llama-2-70B-chat) and GPT-4 (accessed in September 2023). We explored two prompts:

"Summarize in plain language: input"
"Summarize in plain language, providing necessary explanations: input"

To further assess the impact of the retrieval-augmented approach on LLMs, we established two settings for input:

The source alone
The source combined with Wikipedia definitions as identified using KeyBERT

The generation process was configured with a maximum length of 150 tokens. All other parameters were set to their default values.

Checkpoints

Download the models' checkpoints for different tasks here.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
data		data
evaluation		evaluation
model		model
preprocess		preprocess
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

evaluation

evaluation

model

model

preprocess

preprocess

.DS_Store

.DS_Store

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Biomedical Lay Language Generation

Updates

Datasets

Models

BART

Definition-based explanation retrieval

RAG

LLMs

Checkpoints

About

Releases

Packages

Contributors 2

Languages

License

LinguisticAnomalies/pls_retrieval

Folders and files

Latest commit

History

Repository files navigation

Biomedical Lay Language Generation

Updates

Datasets

Models

BART

Definition-based explanation retrieval

RAG

LLMs

Checkpoints

About

Resources

License

Stars

Watchers

Forks

Languages