PubMed 200k RCT dataset
The PubMed 200k RCT dataset is described in Franck Dernoncourt, Ji Young Lee. PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. International Joint Conference on Natural Language Processing (IJCNLP). 2017.
PubMed 200k RCT is new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.
Some miscellaneous information:
- PubMed 20k is a subset of PubMed 200k. I.e., any abstract present in PubMed 20k is also present in PubMed 200k.
PubMed_200k_RCTis the same as
PubMed_200k_RCT_numbers_replaced_with_at_sign, except that in the latter all numbers had been replaced by
@. (same for
- Since Github file size limit is 100 MiB, we had to compress
PubMed_200k_RCT_numbers_replaced_with_at_sign\train.zip. To uncompress
train.7z, you may use 7-Zip on Windows, Keka on Mac OS X, or p7zip on Linux.
You are most welcome to share with us your analyses or work using this dataset!
Projects using the PubMed 200k RCT dataset
- Titipat Achakulvisut, Chandra Bhagavatula, Daniel E Acuna, Konrad P Kording. Claim Extraction for Scientific Publications. 2018