Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time


This repository contains

Further links


The unarXive data set contains

  • full text papers in plain text (papers/)
  • a database with bibliographic interlinkings (papers/refs.db)
  • pre-extracted citation-contexts (contexts/extracted_contexts.csv) (see
  • and a script for extracting citation-contexts (code/

Data Sample

You can find a small sample of the data set in doc/unarXive_sample.tar.bz2. (Generation procedure of the sample is documented in unarXive_sample/paper_centered_sample/README within the archive. Furthermore, the code used for sampling is provided.)

Usage examples

Citation contexts

Load the pre-exported citation contexts into a pandas data frame.

import csv
import sys
import pandas as pd

# read in unarXive citation contexts
df_contexts = pd.read_csv(
    names = [
    sep = '\u241E',
    engine = 'python',
    quoting = csv.QUOTE_NONE
# adjacent_*_ids values are seperated by \u241F

References database

Get the citation counts of the most cited computer science papers.

$ sqlite3 refs.db
sqlite> select
            count(distinct bibitem.citing_mag_id)
            bibitem.cited_arxiv_id = arxivmetadata.arxiv_id
            arxivmetadata.discipline = 'cs'
        group by
        order by
            count(distinct bibitem.citing_mag_id)
Paper full texts

Extract citation contexts including identifiers of the citing and cited document.

See code/ in the data set.

(re)creating unarXive

Generating a data set for citation based tasks using submissions.


  • software
    • Tralics (Ubuntu: # apt install tralics)
    • latexpand (Ubuntu: # apt install texlive-extra-utils)
    • Neural ParsCit
  • data


  • create virtual environment: $ python3 -m venv venv
  • activate virtual environment: $ source venv/bin/activate
  • install requirements: $ pip install -r requirements.txt
  • in
    • adjust line mag_db_uri = 'postgresql+psycopg2://XXX:YYY@localhost:5432/MAG'
    • adjust line doi_headers = { [...] working on XXX; mailto: XXX [...] }
    • depending on your arXiv title lookup DB, adjust line aid_db_uri = 'sqlite:///aid_title.db'
  • run Neural ParsCit web server (instructions)


  1. Extract plain texts and reference items with: (or +
  2. Match reference items with:
  3. Clean txt output with:
  4. Extend ID mappings
    • Create mapping file with: (see note in docstring)
    • Extend IDs with
  5. Extract citation contexts with: (see $ -h for usage details)
$ source venv/bin/activate
$ python3 /tmp/arxiv-sources /tmp/arxiv-txt
$ python3 path /tmp/arxiv-txt 10
$ python3 /tmp/arxiv-txt
$ psql MAG
MAG=> \copy (select * from paperurls where sourceurl like '') to 'mag_id_2_arxiv_url.csv' with csv
$ python3
$ python3 /tmp/arxiv-txt/refs.db
$ python3 /tmp/arxiv-txt \
    --output_file context_sample.csv \
    --sample_size 100 \
    --context_margin_unit s \
    --context_margin_pre 2 \
    --context_margin_pre 0

Evaluation of citation quality and coverage

  • For a manual evaluation of the reference resolution ( we performed on a sample of 300 matchings, see doc/matching_evaluation/.
  • For a manual evaluation of citation coverage (compared to the MAG) we performed on a sample of 300 citations, see doc/coverage_evaluation/.

Cite as

  author        = {Saier, Tarek and F{\"{a}}rber, Michael},
  title         = {{unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata}},
  journal       = {Scientometrics},
  year          = {2020},
  volume        = {125},
  number        = {3},
  pages         = {3085--3108},
  month         = dec,
  issn          = {1588-2861},
  doi           = {10.1007/s11192-020-03382-z}