# Compare INSPIRE record numbers obtained from INSPIRE and HEPData

This Jupyter notebook investigates the consistency between the INSPIRE record numbers obtained from either [INSPIRE](https://inspirehep.net/) or [HEPData](https://www.hepdata.net/).

Discrepancies usually occur because an INSPIRE record has changed its record number.

In [1]:
import requests
from time import sleep
import math

## Get records from INSPIRE

 Get a list of all INSPIRE record numbers that have a HEPData record attached using the [INSPIRE API](https://github.com/inspirehep/rest-api-doc).

First define query and get number of INSPIRE records.

In [2]:
query = 'external_system_identifiers.schema:HEPData'

In [3]:
url = 'https://inspirehep.net/api/literature'
payload = {'q': query, 'size': 1, 'fields': 'control_number'}
response = requests.get(url, params=payload)
num_results = response.json()['hits']['total']
print('num_results = {}'.format(num_results))

num_results = 9924


Now get IDs of INSPIRE records in chunks.

In [4]:
inspire_ids = []
size = 1000
pages = math.ceil(num_results/float(size))
print('pages = {}'.format(pages))
page = 1
while page <= pages:
    print('page = {}'.format(page))
    payload = {'q': query, 'size': size, 'fields': 'control_number', 'page': page, 'sort': 'mostrecent'}
    response = requests.get(url, params=payload)
    data = response.json()['hits']['hits']
    for hit in data:
        inspire_ids.append(int(hit['metadata']['control_number']))
    page += 1
    sleep(1)
inspire_ids.sort()
print('num_results = {}'.format(len(inspire_ids)))

pages = 10
page = 1
page = 2
page = 3
page = 4
page = 5
page = 6
page = 7
page = 8
page = 9
page = 10
num_results = 9924


## Get records from HEPData

Get a list of the INSPIRE record numbers of all HEPData records using the PostgreSQL database.

In [5]:
url = 'https://hepdata.net/search/ids'
payload = {'inspire_ids': 'true'}
response = requests.get(url, params=payload)
hepdata_ids = response.json()
hepdata_ids.sort()
print(len(hepdata_ids))

9924


Get a list of the INSPIRE record numbers of all HEPData records using the OpenSearch index.

In [6]:
payload = {'inspire_ids': 'true', 'use_es': 'true'}
response = requests.get(url, params=payload)
hepdata_ids_es = response.json()
hepdata_ids_es.sort()
print(len(hepdata_ids_es))

9924


Check equality of results obtained using the two methods.

In [7]:
hepdata_ids == hepdata_ids_es

True

## Compare lists of record numbers and identify differences

In [8]:
inspire_ids == hepdata_ids

True

Print out IDs for HEPData records not in INSPIRE.

In [9]:
for hepdata_id in hepdata_ids:
    if hepdata_id not in inspire_ids:
        print('The HEPData record https://hepdata.net/record/ins{}'.format(hepdata_id))
        print('is not linked from https://inspirehep.net/record/{}'.format(hepdata_id))
        print()

Print out IDs for INSPIRE records not in HEPData.

In [10]:
for inspire_id in inspire_ids:
    if inspire_id not in hepdata_ids:
        print('https://inspirehep.net/record/{}'.format(inspire_id))