# Import data from zbMath
Data in [zbMath Open](https://www.zbmath.org/) can be accessed through the [zbMath Open OAI-PMH](https://oai.zbmath.org/) service, that implements the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [Schubotz and Teschke, 2021]. The service is open and subject to certain [terms and conditions](https://oai.zbmath.org/static/terms-and-conditions.html).

**Contrary to the documentation, the API always returns XML, not JSON.**

In [1]:
# API URLs
API_URL='https://oai.zbmath.org/v1/' # the API endpoint
# API namespaces
OAI_NS = 'http://www.openarchives.org/OAI/2.0/' # the OAI namespace
OAI_ZB_PREVIEW_NS = 'https://zbmath.org/OAI/2.0/oai_zb_preview/'
ZBMATH_NS = 'https://zbmath.org/zbmath/elements/1.0/'
# text shown in zbMath Open when there's a license conflict
CONFLICT_TXT = 'zbMATH Open Web Interface contents unavailable due to conflicting licenses.'
# which tags to keep
TAGS = ['author', 'document_title', 'source', 'classifications', 'keywords', 'doi', 'publication_year']

## Get a list of publication Id's
The following example gets the first 100 id's of the publications of the year 2021.

In [2]:
import requests

date_from = '2021-01-01T00%3A00%3A00Z'
date_until = '2022-01-01T00%3A00%3A00Z'
REQUEST_URL="{}?verb=ListIdentifiers&from={}&until={}&metadataPrefix=oai_zb_preview".format(
    API_URL, date_from, date_until
)

# get data from API
headers = {'accept': 'text/xml'} # this has NO effect
all_ids_xml = requests.get(REQUEST_URL, headers)

# save raw data in local file
with open('data/all_ids.xml', 'w') as f:
    f.write(all_ids_xml.text)

Parse the response into an XML tree, and put the result into a pandas data frame.

In [3]:
import xml.etree.ElementTree as ET
import pandas as pd

def ns(tag_name, namespace=OAI_NS):
    """
    Returns a fully qualified tag name.
    @param namespace URL of a namespace|None (OAI_NS is default)
    """
    return '{{{}}}{}'.format(namespace, tag_name)

#parse the tree, get a list of identifiers
tree = ET.parse('data/all_ids.xml')
list_ids = tree.getroot().find(ns('ListIdentifiers'))
entries = list_ids.findall(ns('header'))

# put identifiers in a pandas dataframe
entries_df = pd.DataFrame(columns=['id'])
for entry in entries:
    entry_id = entry.find(ns('identifier')).text
    entries_df = entries_df.append({'id': entry_id}, ignore_index=True)

print('The data has', entries_df.shape[0], 'entries.')

The data has 100 entries.


## Get the details of each publication
Call the API for each id returned by the previous call and get the corresponding bibliographic details.

In [4]:

def request_details(entry_id):
    """Sends a get request to zbMath, returns the details of a publication."""

    REQUEST_URL="{}?verb=GetRecord&identifier={}&metadataPrefix=oai_zb_preview".format(API_URL, entry_id)
    response = requests.get(REQUEST_URL, headers)
    tree = ET.fromstring(response.text)
    return tree

For each publication, these details will be kept: 'document_id', 'author', 'document_title', 'source', 'classifications', 'keywords', 'doi'.

To do: map these details to an ontology

Some entries are not compatible with the licensing terms of zbMath Open. These have to be filtered, but only if the conflict is within the tags being imported.

In [5]:
def parse_details(xml_element, verbose=False):
    """
    Parse bibliographic entry details from XML Element.
    """
    new_entry = {}
    record = xml_element.find(ns('GetRecord')).find(ns('record'))
    # zbMath identifier
    zb_id = record.find(ns('header')).find(ns('identifier')).text 
    new_entry['id'] = zb_id
    # read tags
    zb_preview = record.find(ns('metadata')).find(ns('zbmath', OAI_ZB_PREVIEW_NS))
    for tag in TAGS:
        value = zb_preview.find(ns(tag, ZBMATH_NS))
        if value is not None:
            if len(value):
                # element has children
                texts = []
                for child in value:
                    texts.append(child.text)
                text = ';'.join(texts) # multiple values are rendered as a semicolon-separated string
            else:
                # element content is a simple text
                text = zb_preview.find(ns(tag, ZBMATH_NS)).text
                
            if text == CONFLICT_TXT:
                # License conflict
                if verbose:
                    print('Licensing conflict for id "{}" tag "{}"'.format(zb_id, tag))
                return None
            
            new_entry[tag] = text
    return new_entry

Loop through all entries, get the details of each one, put the result into a pandas data frame.

In [6]:
# loop through all entries
all_details = []
counter = 0
for _,current_entry in entries_df.iterrows():
    response = request_details(current_entry.id)
    details = parse_details(response)
    if details:
        all_details.append(details)

    # print progress info
    counter += 1
    if counter % 10 == 0:
        print('Processed {}/{} entries'.format(counter, len(entries_df)))

# convert to data frame
details_df = pd.DataFrame(all_details)
if 'id' not in details_df.columns:
    print("Problem reading zbMath id's. No data?")
else:
    details_df.set_index('id', inplace=True)

print('Imported {} entries (discarded {} for licensing conflicts)'.format(len(details_df), len(entries_df) - len(details_df) ))

Processed 10/100 entries
Processed 20/100 entries
Processed 30/100 entries
Processed 40/100 entries
Processed 50/100 entries
Processed 60/100 entries
Processed 70/100 entries
Processed 80/100 entries
Processed 90/100 entries
Processed 100/100 entries
Imported 46 entries (discarded 54 for licensing conflicts)


Save the data for later usage

In [7]:
details_df.to_csv('data/details.csv')

## References
M. Schubotz and O. Teschke, zbMATH Open: Towards standardized machine interfaces to expose bibliographic metadata. EMS Magazine 119, 50–53 (2021). https://euromathsoc.org/magazine/articles/mag-12

