# Custom arXiv corpus for Annif
## Subject vocabulary

To create the subject vocabulary we first need the categories that are used in the arXiv articles. The categories can be browsed at [Category Taxonomy page](https://arxiv.org/category_taxonomy). However, they are defined in a Python source file in a [arXiv's GitHub repository](https://github.com/arXiv/arxiv-base/blob/develop/arxiv/taxonomy/definitions.py), which can be easily downloaded. 

For downloading we use command-line tool `curl`, and to use the (terminal) command line from within Jupyter Notebook cell the command is prepended with `!`: 

In [1]:
# Download a file with specific commit
# https://github.com/arXiv/arxiv-base/commit/8b5f5c404ebd5aa0c11db6b4db967d7290c33948
! curl -H "Accept: application/vnd.github.VERSION.raw" https://api.github.com/repos/arXiv/arxiv-base/contents/arxiv/taxonomy/definitions.py?ref=8b5f5c404ebd5aa0c11db6b4db967d7290c33948 > definitions.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 90754  100 90754    0     0   223k      0 --:--:-- --:--:-- --:--:--  223k


Now we have the `definitions.py` file, where the categories are defined as a Python dictionary, and the dictionary can be conveniently imported into this Notebook:

In [2]:
from definitions import CATEGORIES


# Check out three of the categories:
list(CATEGORIES.items())[0:3]

[('acc-phys',
  {'name': 'Accelerator Physics',
   'in_archive': 'acc-phys',
   'is_active': False,
   'is_general': False}),
 ('adap-org',
  {'name': 'Adaptation, Noise, and Self-Organizing Systems',
   'in_archive': 'adap-org',
   'is_active': False,
   'is_general': False}),
 ('alg-geom',
  {'name': 'Algebraic Geometry',
   'in_archive': 'alg-geom',
   'is_active': False,
   'is_general': False})]

The categories are identified with codes like `acc-phys`, `adap-org` and `alg-geom`, and they have also more human-readable names. Next we use the categories to construct a [subject vocabulary](https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats) in TSV format.

In Annif subject vocabulary the subjects are identified by URIs. As the arXiv taxonomy does not use URIs, we create (dummy) URIs based on the category codes by prepending each category code with `https://arxiv.org/list/`, which makes the URIs act as URLs to pages on arXiv.org website. In the Annif vocabulary format the URIs are surrounded with angle brackets (`<>`). For the labels of the subjects we use the category names. 

There are [some categories](https://github.com/arXiv/arxiv-base/blob/163c1f9ddb60a017b21ff03190662ea171884530/arxiv/taxonomy/definitions.py#L2100-L2147) apparently used for testing purposes and some have been deprecated (they have field `'is_active': False`), so we omit them from our vocabulary.

The vocabulary is saved to the file `arxiv-vocab.tsv`.

In [3]:
URI_BASE = 'https://arxiv.org/list/'

with open('arxiv-vocab.tsv', 'w', encoding='utf-8') as vocab_file:
    for category, data in CATEGORIES.items():
        if data['in_archive'] == 'test':
            continue  # Skip categories belonging to the 'test' archive
        if not data['is_active']:
            continue  # Skip deprecated categories
        print('<' + URI_BASE + category + '>\t' + data['name'], file=vocab_file)


# We can check the first 10 lines of the file with `head` command:
! head arxiv-vocab.tsv
# And count lines in the file with `wc -l` command:
! wc -l < arxiv-vocab.tsv

<https://arxiv.org/list/astro-ph.CO>	Cosmology and Nongalactic Astrophysics
<https://arxiv.org/list/astro-ph.EP>	Earth and Planetary Astrophysics
<https://arxiv.org/list/astro-ph.GA>	Astrophysics of Galaxies
<https://arxiv.org/list/astro-ph.HE>	High Energy Astrophysical Phenomena
<https://arxiv.org/list/astro-ph.IM>	Instrumentation and Methods for Astrophysics
<https://arxiv.org/list/astro-ph.SR>	Solar and Stellar Astrophysics
<https://arxiv.org/list/cond-mat.dis-nn>	Disordered Systems and Neural Networks
<https://arxiv.org/list/cond-mat.mes-hall>	Mesoscale and Nanoscale Physics
<https://arxiv.org/list/cond-mat.mtrl-sci>	Materials Science
<https://arxiv.org/list/cond-mat.other>	Other Condensed Matter
155


Now we have a subject vocabulary with 155 subjects for arXiv articles.


## Document corpus

The actual document corpus still needs to be constructed. With this tutorial is provided a sample (random 100k articles) from an arXiv dataset that is distributed in [Kaggle](https://www.kaggle.com/Cornell-University/arxiv) (you can freely download the full data-set of 1.7M+ articles if you wish). 

Our data-set is compressed with `gzip`, so the first step is to unpack it:

In [4]:
! gzip -dk arxiv-metadata-oai-snapshot-100k-articles.json.gz

Now we have the metadata in the file `arxiv-metadata-oai-snapshot-100k-articles.json`. Take a look at the file by printing the first two lines of it:

In [5]:
! head -n 2 arxiv-metadata-oai-snapshot-100k-articles.json

{"id":"0711.2512","submitter":"Mark Hertzberg","authors":"Mark P. Hertzberg (MIT), Shamit Kachru (Stanford), Washington Taylor\n  (MIT), Max Tegmark (MIT)","title":"Inflationary Constraints on Type IIA String Theory","comments":"22 pages, 1 figure; v3: Updated to match version published in JHEP,\n  references added","journal-ref":"JHEP 0712:095,2007","doi":"10.1088/1126-6708/2007/12/095","report-no":"MIT-CTP-3905, SLAC-PUB-12999","categories":"hep-th astro-ph gr-qc","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","abstract":"  We prove that inflation is forbidden in the most well understood class of\nsemi-realistic type IIA string compactifications: Calabi-Yau compactifications\nwith only standard NS-NS 3-form flux, R-R fluxes, D6-branes and O6-planes at\nlarge volume and small string coupling. With these ingredients, the first\nslow-roll parameter satisfies epsilon >= 27/13 whenever V > 0, ruling out both\ninflation (including brane/anti-brane inflation) and de Sitter 

The file is in [newline-delimited JSON](http://ndjson.org/) format: each line has one JSON object representing the metadata of an article. Each article will become a document in our corpus.

The JSON object for the article metadata has several fields (`id`, `submitter`, `author` etc.), but we are interested in three of them:
- `categories` include the category codes which will be mapped to our vocabulary subjects and used as the (gold-standard) subjects of a document,
- `title` and `abstract` are used to form the text content of a document.

First we construct a Python dictionary, which will be used to map category code to URI:

In [6]:
cat2uri = {}

for category in CATEGORIES.keys():
    cat2uri[category] = '<' + URI_BASE + category + '>'

# For example 'bayes-an' catecory code is mapped to '<https://arxiv.org/list/bayes-an>' URI:
cat2uri['bayes-an']

'<https://arxiv.org/list/bayes-an>'

The JSON objects can be parsed with Python using `loads()` funtion of the `json` library, so we import the library.

As there are newline characters (`\n`) in the titles and abstracts we also define a simple function for replacing them and all other whitespace characters with spaces (tabulator characters would cause problems in the corpus).

In [7]:
import json


def cleanup(text):
    return " ".join(text.split())

# For example:
cleanup('...chromodynamics is\npresented for...')

'...chromodynamics is presented for...'

We aim to form not a single corpus, but three separate data-sets:
- train set (for training models; 60-70 % of documents)
- validation set (for optimizing models hyperparameters; 20-15 % of documents)
- test set (for evaluating models performance; 20-15 % of documents)

We could just split the articles randomly to these sets, but as there is the "creation time" of articles available in the metadata, we perform the split based on it: the oldest articles go to the train set and newest to the test set, and the articles with age in between go to the validation set. This way the evaluation results are on more realistic grounds as the evaluation setting mimics real usage: the documents to use for evaluation are those created only after a model has been trained (including hyperparameter optimization) and deployed in use.

For deciding the points in time to use as the data-set boundaries it is useful to know the whole timespan of the creation dates of the articles. We use regular expressions and the `re` library to find the four-digit part (`'\d\d\d\d'`) from the `created` field (from a datetime string looking e.g. like `"Tue, 27 Nov 2007 17:21:48 GMT"`), which is the creation year. The year is stored in a set, and then the first and last year is printed out:

In [8]:
import re

years = set()

with open('arxiv-metadata-oai-snapshot-100k-articles.json', 'r', encoding='utf-8') as input_metadata_file:
    for line in input_metadata_file:
        article_metadata = json.loads(line)
        datetime_str = article_metadata['versions'][0]['created']
        year = re.findall('\d\d\d\d', datetime_str)[0]
        years.add(year)

print('Earliest: ' + min(years))
print('Latest: ' + max(years))

Earliest: 1990
Latest: 2021


The articles we have are created in 1990-2021. At first we can just try some years, e.g. 2010 and 2015, as the boundaries between the three sets, and then refine the boundaries to get the desired ratios of documents for the sets (it is found out that years 2015 and 2018 produce approximate 60-20-20 ratios).


Now we are ready to create the corpus!


We use [Short text document corpus format](https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats#short-text-document-corpus-tsv-file), which is the same as the subject vocabulary format: 
a TSV file, where the first column contains the text of the document, and the second column contains a whitespace-separated list of subject URIs for the document.

The text of the documents is formed by concatenating the title and abstract. If an article would have an category that is not in our vocabulary, a warning is printed out (`Unknown category: XXX`) and that article is omitted from the data-sets.

In [9]:
max_train_year = 2015  # refined after few tries
max_validation_year = 2018  # refined after few tries


# Open the JSON metadata file for reading:
with open('arxiv-metadata-oai-snapshot-100k-articles.json', 'r', encoding='utf-8') as input_metadata_file:
    # Open the TSV data-sets files for writing:
    with \
        open('arxiv-train.tsv', 'w', encoding='utf-8') as output_train_file, \
        open('arxiv-validate.tsv', 'w', encoding='utf-8') as output_validate_file, \
        open('arxiv-test.tsv', 'w', encoding='utf-8') as output_test_file:
        # Loop line-by-line:
        for line in input_metadata_file:
            article_metadata = json.loads(line)
            text = cleanup(
                article_metadata['title'] + article_metadata['abstract'])

            # Get the URIs from arXiv categories:
            try:
                uris = [cat2uri[cat] for cat in article_metadata['categories'].split()]
            except KeyError:
                print('Unknown category: ' + article_metadata['categories'])
                continue

            # Get the creation year of article:
            datetime_str = article_metadata['versions'][0]['created']
            year = int(re.findall('\d\d\d\d', datetime_str)[0])

            # Print to train, validate or test set file:
            if year <= max_train_year:
                print(text + '\t' + ' '.join(uris), file=output_train_file)
            elif max_train_year < year <= max_validation_year:
                print(text + '\t' + ' '.join(uris), file=output_validate_file)
            else:
                print(text + '\t' + ' '.join(uris), file=output_test_file)

In [10]:
# Check how many documents there are in each corpus file, and refine the data-set time boundaries if needed in the cell above:
! wc -l arxiv-train.tsv
! wc -l arxiv-validate.tsv
! wc -l arxiv-test.tsv

59140 arxiv-train.tsv
20097 arxiv-validate.tsv
20763 arxiv-test.tsv


In [11]:
# See how few of the documents in the train-set look like:
! head -n 3 arxiv-train.tsv

Inflationary Constraints on Type IIA String Theory We prove that inflation is forbidden in the most well understood class of semi-realistic type IIA string compactifications: Calabi-Yau compactifications with only standard NS-NS 3-form flux, R-R fluxes, D6-branes and O6-planes at large volume and small string coupling. With these ingredients, the first slow-roll parameter satisfies epsilon >= 27/13 whenever V > 0, ruling out both inflation (including brane/anti-brane inflation) and de Sitter vacua in this limit. Our proof is based on the dependence of the 4-dimensional potential on the volume and dilaton moduli in the presence of fluxes and branes. We also describe broader classes of IIA models which may include cosmologies with inflation and/or de Sitter vacua. The inclusion of extra ingredients, such as NS 5-branes and geometric or non-geometric NS-NS fluxes, evades the assumptions used in deriving the no-go theorem. We focus on NS 5-branes and outline how such ingredients may prove 