Skip to content

IllDepence/unarXive

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
doc
March 27, 2023 12:32
src
March 28, 2023 13:25
January 29, 2019 11:49

unarXive

Access

Documentation

Data

unarXive schema

unarXive contains

  • 1.9 M structured paper full-texts, containing
    • 63 M references (28 M linked to OpenAlex)
    • 134 M in-text citation markers (65 M linked)
    • 9 M figure captions
    • 2 M table captions
    • 742 M pieces of mathematical notation preserved as LaTeX

A comprehensive documentation of the data format can be found here.

You can find a data sample here.

Usage

Hugging Face Datasets

If you want to use unarXive for citation recommendation or IMRaD classification, you can simply use our Hugging Face datasets:

For example, in the case of citation recommendation:

from datasets import load_dataset

citrec_data = load_dataset('saier/unarxive_citrec')
citrec_data = citrec_data.class_encode_column('label')  # assign target label column
citrec_data = citrec_data.remove_columns('_id')         # remove sample ID column

Development

For instructions how to re-create or extend unarXive, see src/.

Versions

Development Status

See issues.

Cite as

@article{Saier2020unarXive,
  author        = {Saier, Tarek and F{\"{a}}rber, Michael},
  title         = {{unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata}},
  journal       = {Scientometrics},
  year          = {2020},
  volume        = {125},
  number        = {3},
  pages         = {3085--3108},
  month         = dec,
  issn          = {1588-2861},
  doi           = {10.1007/s11192-020-03382-z}
}

About

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network

Resources

License

Stars

Watchers

Forks

Languages