Skip to content
ottowg edited this page Jun 11, 2018 · 7 revisions

Introduction

Creating a Dataset of open accessible One Plus papers for research on citation context prediction/analysis.

Dataset description

Source of Data

The PMC provides different ways for publishing data in their corpus. Bulk download is one of this way. We easily can download the collection of data that we needed. This data is available in the XML JATS (the Journal Archive and Interchange Tag Set) format, where each part of an article is a distinct XML element. For example, a citation marker in the text has a particular tag and an id which show the linkage to its related reference string (in reference list).

We could find PLOS One corpus from the following link: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc

and in the following collection: articles.O-Z.xml.tar.gz

We downloaded the corpus on Dec of 2017. (/home/ghavimbm/git_projects/PloS_data/PLoS_One) - Which contains 184545 XML format of papers inside.

here you can find some more information about Pubmed bulk download: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

We decided to use Plos one corpus since we were interested in analyzing social science papers and this corpus contains the papers which were published in social science interdisciplinary fields

links

Schema

Size of data

Papers used

References used

Sentences from Papers

Citation Contexts from Papers