# Harvest records from zbMath
This notebook does basically the same as the [Access data from zbMath](access_data_zbMath.ipynb), but uses the ListRecords endpoint to harvest data *en masse*. This is both more efficient as less http requests are sent, and easier to programm.

Data in [zbMath Open](https://www.zbmath.org/) can be accessed through the [zbMath Open OAI-PMH](https://oai.zbmath.org/) service, that implements the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [Schubotz and Teschke, 2021]. The service is open and subject to certain [terms and conditions](https://oai.zbmath.org/static/terms-and-conditions.html).

**Contrary to the documentation, the API always returns XML, not JSON.**

In [7]:
# API URLs
API_URL='https://oai.zbmath.org/v1/' # base URL of the API
LIST_IDENTIFIERS="{}?verb=ListIdentifiers&metadataPrefix=oai_zb_preview".format(API_URL) # ListIdentifiers endpoint
LIST_RECORDS="{}?verb=ListRecords&metadataPrefix=oai_zb_preview".format(API_URL) # ListRecords endpoint
GET_RECORD="{}?verb=GetRecord&metadataPrefix=oai_zb_preview".format(API_URL) # GetRecord endpoint

# API namespaces
OAI_NS = 'http://www.openarchives.org/OAI/2.0/' # the OAI namespace
OAI_ZB_PREVIEW_NS = 'https://zbmath.org/OAI/2.0/oai_zb_preview/'
ZBMATH_NS = 'https://zbmath.org/zbmath/elements/1.0/'

# text shown in zbMath Open when there's a license conflict
CONFLICT_TXT = 'zbMATH Open Web Interface contents unavailable due to conflicting licenses.'
# which tags to keep
TAGS = ['author', 'author_ids', 'document_title', 'source', 'classifications', 'keywords', 'doi', 'publication_year']

## Get a list of records

In [8]:
import requests

date_from = '2021-01-01T00%3A00%3A00Z'
date_until = '2022-01-01T00%3A00%3A00Z'
REQUEST_URL="{}&from={}&until={}".format(LIST_RECORDS, date_from, date_until)

# get data from API
headers = {'accept': 'text/xml'} # this has NO effect
all_records_xml = requests.get(REQUEST_URL, headers)

# save raw data in local file
with open('data/all_records.xml', 'w') as f:
    f.write(all_records_xml.text)

Parse the response into an XML tree, and put the result into a pandas data frame.

In [11]:
import xml.etree.ElementTree as ET
import pandas as pd

def ns(tag_name, namespace=OAI_NS):
    """
    Returns a fully qualified tag name.
    @param namespace URL of a namespace|None (OAI_NS is default)
    """
    return '{{{}}}{}'.format(namespace, tag_name)

#parse the tree, get a list of records
tree = ET.parse('data/all_records.xml')
list_ids = tree.getroot().find(ns('ListRecords'))
records = list_ids.findall(ns('record'))

Define a function to parse records XML into a python dict

In [18]:
def parse_record(xml_record, verbose=False):
    """
    Parse bibliographic record details from XML Element.
    """
    new_entry = {}
    # zbMath identifier
    zb_id = xml_record.find(ns('header')).find(ns('identifier')).text 
    new_entry['id'] = zb_id
    # read tags
    zb_preview = xml_record.find(ns('metadata')).find(ns('zbmath', OAI_ZB_PREVIEW_NS))
    for tag in TAGS:
        value = zb_preview.find(ns(tag, ZBMATH_NS))
        if value is not None:
            if len(value):
                # element has children
                texts = []
                for child in value:
                    texts.append(child.text)
                text = ';'.join(texts) # multiple values are rendered as a semicolon-separated string
            else:
                # element content is a simple text
                text = zb_preview.find(ns(tag, ZBMATH_NS)).text
                
            if text == CONFLICT_TXT:
                # License conflict
                if verbose:
                    print('Licensing conflict for id "{}" tag "{}"'.format(zb_id, tag))
                return None
            
            new_entry[tag] = text
    return new_entry

Parse all records in the data set, put them into a pandas data frame

In [23]:
# loop through all entries
all_details = []
for record in records:
    details = parse_record(record)
    if details:
        all_details.append(details)

# convert to data frame
records_df = pd.DataFrame(all_details)
if 'id' not in records_df.columns:
    print("Problem reading zbMath id's. No data?")
else:
    records_df.set_index('id', inplace=True)

print('Imported {} entries (discarded {} for licensing conflicts)'.format(len(records_df), len(records) - len(records_df) ))

Imported 70 entries (discarded 30 for licensing conflicts)


Save the data for later usage

In [24]:
records_df.to_csv('data/records.csv')