Skip to content
Switch branches/tags
Go to file


Code for generating a data set for citation based tasks using submissions.

Data Sample

You can find a small sample of the data set in doc/unarXive_sample.tar.bz2. (Generation procedure of the sample is documented in unarXive_sample/paper_centered_sample/README within the archive. Furthermore, the code used for sampling is provided.)

(re)creating unarXive


  • software
    • Tralics (Ubuntu: # apt install tralics)
    • latexpand (Ubuntu: # apt install texlive-extra-utils)
    • Neural ParsCit
  • data


  • create virtual environment: $ python3 -m venv venv
  • activate virtual environment: $ source venv/bin/activate
  • install requirements: $ pip install -r requirements.txt
  • in
    • adjust line mag_db_uri = 'postgresql+psycopg2://XXX:YYY@localhost:5432/MAG'
    • adjust line doi_headers = { [...] working on XXX; mailto: XXX [...] }
    • depending on your arXiv title lookup DB, adjust line aid_db_uri = 'sqlite:///aid_title.db'
  • run Neural ParsCit web server (instructions)


  1. Extract plain texts and reference items with: (or +
  2. Match reference items with:
  3. Clean txt output with:
  4. Extend ID mappings
    • Create mapping file with: (see note in docstring)
    • Extend IDs with
  5. Extract citation contexts with: (see $ -h for usage details)
$ source venv/bin/activate
$ python3 /tmp/arxiv-sources /tmp/arxiv-txt
$ python3 path /tmp/arxiv-txt 10
$ python3 /tmp/arxiv-txt
$ psql MAG
MAG=> \copy (select * from paperurls where sourceurl like '') to 'mag_id_2_arxiv_url.csv' with csv
$ python3
$ python3 /tmp/arxiv-txt/refs.db
$ python3 /tmp/arxiv-txt \
    --output_file context_sample.csv \
    --sample_size 100 \
    --context_margin_unit s \
    --context_margin_pre 2 \
    --context_margin_pre 0

Evaluation of citation quality and coverage

  • For a manual evaluation of the reference resolution ( we performed on a sample of 300 matchings, see doc/matching_evaluation/.
  • For a manual evaluation of citation coverage (compared to the MAG) we performed on a sample of 300 citations, see doc/coverage_evaluation/.

Cite as

  author        = {Tarek Saier and
                   Michael F{\"{a}}rber},
  title         = {unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata},
  journal       = {Scientometrics},
  year          = {2020},
  month         = mar,
  doi           = {10.1007/s11192-020-03382-z}


Code for generating a data set for citation based tasks using submissions.




No releases published


No packages published