# Access data from zbMath
Data in [zbMath Open](https://www.zbmath.org/) can be accessed through the [zbMath Open OAI-PMH](https://oai.zbmath.org/) service, that implements the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [Schubotz and Teschke, 2021]. The service is open and subject to certain [terms and conditions](https://oai.zbmath.org/static/terms-and-conditions.html).

In [1]:
%run zbMath_common.ipynb

## Get a list of publication Id's
The following example gets the first 100 id's of the publications of the year 2021.

In [2]:
import requests

date_from = '2021-01-01T00%3A00%3A00Z'
date_until = '2022-01-01T00%3A00%3A00Z'
REQUEST_URL="{}&from={}&until={}".format(LIST_IDENTIFIERS, date_from, date_until)

# get data from API
headers = {'accept': 'text/xml'} # this has NO effect
all_ids_xml = requests.get(REQUEST_URL, headers)

Save raw XML data in local file

In [None]:
with open('data/all_ids.xml', 'w') as f:
    f.write(all_ids_xml.text)

Parse the response into an XML tree, and put the result into a pandas data frame.

In [3]:
import xml.etree.ElementTree as ET
import pandas as pd

#parse the tree, get a list of identifiers
tree = ET.parse('data/all_ids.xml')
list_ids = tree.getroot().find(ns('ListIdentifiers'))
entries = list_ids.findall(ns('header'))

# put identifiers in a pandas dataframe
entries_df = pd.DataFrame(columns=['id'])
for entry in entries:
    entry_id = entry.find(ns('identifier')).text
    entries_df = entries_df.append({'id': entry_id}, ignore_index=True)

print('The data has', entries_df.shape[0], 'entries.')

The data has 100 entries.


## Get the details of each publication
Call the API for each id returned by the previous call and get the corresponding bibliographic details.

In [4]:

def request_details(entry_id):
    """Sends a get request to zbMath, returns the details of a publication."""

    REQUEST_URL="{}&identifier={}".format(GET_RECORD, entry_id)
    response = requests.get(REQUEST_URL, headers)
    tree = ET.fromstring(response.text)
    return tree

For each publication, these details will be kept: 'document_id', 'author', 'document_title', 'source', 'classifications', 'keywords', 'doi'.

To do: map these details to an ontology

Loop through all entries, get the details of each one, put the result into a pandas data frame.

In [5]:
# loop through all entries
all_details = []
counter = 0
for _,current_entry in entries_df.iterrows():
    xml_element = request_details(current_entry.id)
    record = xml_element.find(ns('GetRecord')).find(ns('record'))
    details = parse_record(record)
    if details:
        all_details.append(details)

    # print progress info
    counter += 1
    if counter % 10 == 0:
        print('Processed {}/{} entries'.format(counter, len(entries_df)))

# convert to data frame
details_df = pd.DataFrame(all_details)
if 'id' not in details_df.columns:
    print("Problem reading zbMath id's. No data?")
else:
    details_df.set_index('id', inplace=True)

print('Imported {} entries (discarded {} for licensing conflicts)'.format(len(details_df), len(entries_df) - len(details_df) ))

Processed 10/100 entries
Processed 20/100 entries
Processed 30/100 entries
Processed 40/100 entries
Processed 50/100 entries
Processed 60/100 entries
Processed 70/100 entries
Processed 80/100 entries
Processed 90/100 entries
Processed 100/100 entries
Imported 46 entries (discarded 54 for licensing conflicts)


### Cleanup
* drop incomplete entries: no author or no author ids or no title
* remove duplicate entries: same doi, same authors and title and year

In [6]:
# drop entries without an author or a title
idx = (details_df.author.isna()) | details_df.author_ids.isna() | (details_df.document_title.isna())
details_df = details_df[~idx]
print('{} incomplete entries were removed'.format(idx.sum()))

11 incomplete entries were removed


In [7]:
idx = (details_df.duplicated(subset='doi')) | (details_df.duplicated(subset=['author', 'document_title', 'publication_year']))
details_df = details_df[~idx]
print('{} duplicated entries were removed'.format(idx.sum()))

0 duplicated entries were removed


Save the data for later usage

In [8]:
#details_df.to_csv('data/details.csv')

## References
M. Schubotz and O. Teschke, zbMATH Open: Towards standardized machine interfaces to expose bibliographic metadata. EMS Magazine 119, 50–53 (2021). https://euromathsoc.org/magazine/articles/mag-12