Home

Introduction

Creating a Dataset of open accessible One Plus papers for research on citation context prediction/analysis.

Dataset description

Source of Data

The PMC provides different ways for publishing data in their corpus. Bulk download is one of this way. We easily can download the collection of data that we needed. This data is available in the XML JATS (the Journal Archive and Interchange Tag Set) format, where each part of an article is a distinct XML element. For example, a citation marker in the text has a particular tag and an id which show the linkage to its related reference string (in reference list).

We could find PLOS One corpus from the following link: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc

and in the following collection: articles.O-Z.xml.tar.gz

We downloaded the corpus on Dec of 2017. (/home/ghavimbm/git_projects/PloS_data/PLoS_One) - Which contains 184545 XML format of papers inside.

here you can find some more information about Pubmed bulk download: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

We decided to use Plos one corpus since we were interested in analyzing social science papers and this corpus contains the papers which were published in social science interdisciplinary fields

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Introduction

Dataset description

Source of Data

links

Schema

Size of data

Papers used

References used

Sentences from Papers

Citation Contexts from Papers

Clone this wiki locally