# Custom arXiv corpus for Annif
## Subject vocabulary

To create the subject vocabulary we first need the categories that are used in the arXiv articles. The categories can be browsed at [Category Taxonomy page](https://arxiv.org/category_taxonomy). However, they defined in a Python source file in a [arXiv's GitHub repository](https://github.com/arXiv/arxiv-base/blob/develop/arxiv/taxonomy/definitions.py), which can be easily downloaded. 

For downloading we use command-line tool `curl`, and to use the (terminal) command line from within Jupyter Notebook cell the command is prepended with `!`: 

In [12]:
! curl https://raw.githubusercontent.com/arXiv/arxiv-base/develop/arxiv/taxonomy/definitions.py > definitions.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 89696  100 89696    0     0   222k      0 --:--:-- --:--:-- --:--:--  222k


Now we have the `definitions.py` file, where the categories are defined as a Python dictionary, and the dictionary can be conveniently imported into this Notebook:

In [13]:
from definitions import CATEGORIES


# Check out three of the categories:
list(CATEGORIES.items())[0:3]

[('acc-phys',
  {'name': 'Accelerator Physics',
   'in_archive': 'acc-phys',
   'is_active': False,
   'is_general': False}),
 ('adap-org',
  {'name': 'Adaptation, Noise, and Self-Organizing Systems',
   'in_archive': 'adap-org',
   'is_active': False,
   'is_general': False}),
 ('alg-geom',
  {'name': 'Algebraic Geometry',
   'in_archive': 'alg-geom',
   'is_active': False,
   'is_general': False})]

The categories are identified with codes like `acc-phys`, `adap-org` and `alg-geom`, and they have also more human-readable names. Next we use the categories to construct [subject vocabulary](https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats) in TSV format.

In Annif subject vocabulary the subjects are identified by URIs. As the arXiv taxonomy does not use URIs, we create (dummy) URIs based on the category codes by prepending each category code with `https://arxiv.org/list/`, which makes the URIs act as URLs to pages on arXiv.org website. For Annif vocabulary the URIs are surrounded with angle brackets (`<>`). For the labels of the subjects we use the category names. 

There are [some categories](https://github.com/arXiv/arxiv-base/blob/163c1f9ddb60a017b21ff03190662ea171884530/arxiv/taxonomy/definitions.py#L2100-L2147) apparently used for testing purposes and some have been deprecated (they have field `'is_active': False`), so we omit them from our vocabulary.

The vocabulary is saved to the file `arxiv-vocab.tsv`.

In [59]:
URI_BASE = 'https://arxiv.org/list/'

with open('arxiv-vocab.tsv', 'w', encoding='utf-8') as vocab_file:
    for category, data in CATEGORIES.items():
        if data['in_archive'] == 'test':
            continue  # Skip categories belonging to the 'test' archive
        if not data['is_active']:
            continue  # Skip deprecated categories
        print('<' + URI_BASE + category + '>\t' + data['name'], file=vocab_file) 


# We can check the first 10 lines of the file with `head` command:
! head arxiv-vocab.tsv
# And count lines in the file with `wc -l` command:
! wc -l < arxiv-vocab.tsv 

<https://arxiv.org/list/astro-ph.CO>	Cosmology and Nongalactic Astrophysics
<https://arxiv.org/list/astro-ph.EP>	Earth and Planetary Astrophysics
<https://arxiv.org/list/astro-ph.GA>	Astrophysics of Galaxies
<https://arxiv.org/list/astro-ph.HE>	High Energy Astrophysical Phenomena
<https://arxiv.org/list/astro-ph.IM>	Instrumentation and Methods for Astrophysics
<https://arxiv.org/list/astro-ph.SR>	Solar and Stellar Astrophysics
<https://arxiv.org/list/cond-mat.dis-nn>	Disordered Systems and Neural Networks
<https://arxiv.org/list/cond-mat.mes-hall>	Mesoscale and Nanoscale Physics
<https://arxiv.org/list/cond-mat.mtrl-sci>	Materials Science
<https://arxiv.org/list/cond-mat.other>	Other Condensed Matter
155


We have 155 subjects in our arXiv vocabulary.


## Document corpus

The actual document corpus still needs to be constructed. Along this tutorial is provided a sample (100k articles) from an arXiv dataset that is distributed in [Kaggle](https://www.kaggle.com/Cornell-University/arxiv) (you can freely download the full data-set of 1.7M+ articles if you wish). 

Our data-set is compressed with `gzip`, so the first step is to unpack it:

In [64]:
! gzip -d arxiv-metadata-oai-snapshot-100k-articles.json.gz

Now we have the metadata in the file `arxiv-metadata-oai-snapshot-100k-articles.json`. Take a look at the file by printing the first two lines of it:

In [66]:
! head -n 2 arxiv-metadata-oai-snapshot-100k-articles.json

{"id":"0704.0001","submitter":"Pavel Nadolsky","authors":"C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan","title":"Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies","comments":"37 pages, 15 figures; published version","journal-ref":"Phys.Rev.D76:013009,2007","doi":"10.1103/PhysRevD.76.013009","report-no":"ANL-HEP-PR-07-12","categories":"hep-ph","license":null,"abstract":"  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab Tevatron, and predic

The file is in [newline-delimited JSON](http://ndjson.org/) format: each line has one JSON object representing the metadata of an article. Each article will become a document in our corpus.

The JSON object for the article metadata has several fields (`id`, `submitter`, `author` etc.), but we are interested in three of them:
- `categories` include the category codes which will be mapped to our vocabulary subjects and used as the (gold-standard) subjects of a document,
- `title` and `abstract` are used to form the text content of a document.

First we construct a Python dictionary, which will be used to map category code to URI:

In [67]:
cat2uri = {}

for category in CATEGORIES.keys():
    cat2uri[category] = '<' + URI_BASE + category + '>'

# For example 'bayes-an' catecory code is mapped to '<https://arxiv.org/list/bayes-an>' URI:
cat2uri['bayes-an']  

'<https://arxiv.org/list/bayes-an>'

The JSON objects can be parsed with Python using `loads()` funtion of the `json` library, so we import the library.

As there are newline characters (`\n`) in the titles and abstracts we also define a simple function for replacing them and all other whitespace characters with spaces (tabulator characters would cause problems in the corpus).

In [70]:
import json


def cleanup(text):
    return " ".join(text.split())

# For example:
cleanup('...chromodynamics is\npresented for...')

'...chromodynamics is presented for...'

Now we are ready to create the corpus!


We use [Short text document corpus format](https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats#short-text-document-corpus-tsv-file), which is the same as the subject vocabulary format: 
a TSV file, where the first column contains the text of the document, and the second column contains a whitespace-separated list of subject URIs for the document.

The text of the documents is formed by concatenating the title and abstract. If an article would have an category that is not in our vocabulary, a warning is printed out (`Unknown category: XXX`) and that article is omitted from the corpus.

In [71]:
# Open file for reading the input JSON metadata file:
with open('arxiv-metadata-oai-snapshot-100k-articles.json', 'r', encoding='utf-8') as input_metadata_file:
    # Open file for writing the output TSV corpus file:
    with open('arxiv-corpus.tsv', 'w', encoding='utf-8') as output_corpus_file:
        # Loop line-by-line:
        for line in input_metadata_file:
            article_metadata = json.loads(line)
            text = cleanup(
                article_metadata['title'] + article_metadata['abstract'])
            try:
                uris = [cat2uri[cat] for cat in article_metadata['categories'].split()]
            except KeyError:
                print('Unknown category: ' + article_metadata['categories'])
                continue
            print(text + '\t' + ' '.join(uris), file=output_corpus_file)

In [72]:
# Check out how few of the documents in the corpus looks like:
! head -n 3 arxiv-corpus.tsv

Calculation of prompt diphoton production cross sections at Tevatron and LHC energies A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity

In [73]:
# And how many documents there are in the corpus:
! wc -l < arxiv-corpus.tsv

100000
