Skip to content

JanB100/cochrane-sections

Repository files navigation

Cochrane-sections

This directory contains everything for our paper Section-level Simplification of Biomedical Abstracts.

Data

The original Cochrane abstracts and PLS in English were derived from the CDSR by Devaraj et al. (2021). We copied their data into data/cochrane. We also copied the manually and automatically aligned sentence pairs extracted by Joseph et al. (2023) from their repository into the data/multicochrane directory.

We placed our newly created Cochrane-sections dataset in the data/cochrane-sections directory.

We provide the LLM-generated labels for the test set within the subfolders of data/classifications.

Lastly, we provide our manual annotations for the test set under data/annotations.

Pretrained models

We provide the checkpoint for the neural CRF alignment model that was first trained by Jiang et al. (2020) and then fine-tuned and shared by Joseph et al. (2023) here. It leverages the BERT model that Jiang et al. trained on Wiki-manual and shared here.

We also share the checkpoints of our trained section classification models under classifiers.

Code

Alignment

Firstly, the script load_data.py can be used to (1) load the sentence-tokenized abstracts and PLS and (2) determine for each PLS sentence whether it is aligned to an abstract sentence, and if so, what abstract section that sentence belongs to. The resulting triples (pls_sent_id, abs_sent_id, abs_sect_id) are saved under alignments.

The script alignment.py can be used to generate automatic alignments for the test set and evaluate them against the manual alignments. The generated triples are then saved under alignments. We copied the required code from Jiang et al. to the aligner directory.

Classification

The script prepare.py contains our code to prepare for the classification step by embedding the source sentences and target labels.

The script classifier.py contains our implementation of the section classifier and the code used for training it.

The script classification.py contains our code for predicting section header labels with a trained classifier. The generated predictions are saved under classifications.

Two-step method

The script two_step_method.py can be used to determine the label of each sentence within a PLS using our two-step method, based on the alignment and classification results.

Dataset creation & analysis

The script create_dataset.py can be used to create a split of the Cochrane-sections dataset based on annotated (test) or predicted (train/val/auto) labels. This split is then saved under data/cochrane-sections.

Lastly, the script analysis.py can be used to visualize results by generating a table, barplot and confusion matrix as seen in our paper.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages