This directory contains everything for our paper Section-level Simplification of Biomedical Abstracts.
The original Cochrane abstracts and PLS in English were derived from the CDSR by Devaraj et al. (2021). We copied their data into data/cochrane. We also copied the manually and automatically aligned sentence pairs extracted by Joseph et al. (2023) from their repository into the data/multicochrane directory.
We placed our newly created Cochrane-sections dataset in the data/cochrane-sections directory.
We provide the LLM-generated labels for the test set within the subfolders of data/classifications.
Lastly, we provide our manual annotations for the test set under data/annotations.
We provide the checkpoint for the neural CRF alignment model that was first trained by Jiang et al. (2020) and then fine-tuned and shared by Joseph et al. (2023) here. It leverages the BERT model that Jiang et al. trained on Wiki-manual and shared here.
We also share the checkpoints of our trained section classification models under classifiers.
Firstly, the script load_data.py can be used to (1) load the sentence-tokenized abstracts and PLS and (2) determine for each PLS sentence whether it is aligned to an abstract sentence, and if so, what abstract section that sentence belongs to. The resulting triples (pls_sent_id, abs_sent_id, abs_sect_id) are saved under alignments.
The script alignment.py can be used to generate automatic alignments for the test set and evaluate them against the manual alignments. The generated triples are then saved under alignments. We copied the required code from Jiang et al. to the aligner directory.
The script prepare.py contains our code to prepare for the classification step by embedding the source sentences and target labels.
The script classifier.py contains our implementation of the section classifier and the code used for training it.
The script classification.py contains our code for predicting section header labels with a trained classifier. The generated predictions are saved under classifications.
The script two_step_method.py can be used to determine the label of each sentence within a PLS using our two-step method, based on the alignment and classification results.
The script create_dataset.py can be used to create a split of the Cochrane-sections dataset based on annotated (test) or predicted (train/val/auto) labels. This split is then saved under data/cochrane-sections.
Lastly, the script analysis.py can be used to visualize results by generating a table, barplot and confusion matrix as seen in our paper.