unarXive
Access
- Data Set on Zenodo: full / permissively licensed subset
- Data Sample
- ML Data on Hugging Face: citation recommendation / IMRaD classification
Documentation
- Papers
- Scientometrics (2020)
- JCDL 2023 (preprint)
- Data Format
- Usage
- Development
- Cite
Data
unarXive contains
- 1.9 M structured paper full-texts, containing
- 63 M references (28 M linked to OpenAlex)
- 134 M in-text citation markers (65 M linked)
- 9 M figure captions
- 2 M table captions
- 742 M pieces of mathematical notation preserved as LaTeX
A comprehensive documentation of the data format can be found here.
You can find a data sample here.
Usage
Hugging Face Datasets
If you want to use unarXive for citation recommendation or IMRaD classification, you can simply use our Hugging Face datasets:
For example, in the case of citation recommendation:
from datasets import load_dataset
citrec_data = load_dataset('saier/unarxive_citrec')
citrec_data = citrec_data.class_encode_column('label') # assign target label column
citrec_data = citrec_data.remove_columns('_id') # remove sample ID column
Development
For instructions how to re-create or extend unarXive, see src/.
Versions
- Current release (1991–2022): see Access section above
- Previous releases (old format):
Development Status
See issues.
Cite as
@article{Saier2020unarXive,
author = {Saier, Tarek and F{\"{a}}rber, Michael},
title = {{unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata}},
journal = {Scientometrics},
year = {2020},
volume = {125},
number = {3},
pages = {3085--3108},
month = dec,
issn = {1588-2861},
doi = {10.1007/s11192-020-03382-z}
}