# Harvest records from zbMath
This notebook does basically the same as the [Access data from zbMath](access_data_zbMath.ipynb), but uses the ListRecords endpoint to harvest data *en masse*. This is both more efficient as less http requests are sent, and easier to programm.

Data in [zbMath Open](https://www.zbmath.org/) can be accessed through the [zbMath Open OAI-PMH](https://oai.zbmath.org/) service, that implements the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [Schubotz and Teschke, 2021]. The service is open and subject to certain [terms and conditions](https://oai.zbmath.org/static/terms-and-conditions.html).

In [36]:
%run zbMath_common.ipynb

## Get a list of records

In [37]:
import requests

#date_from = '2021-01-01T00%3A00%3A00Z'
#date_until = '2022-01-01T00%3A00%3A00Z'
#REQUEST_URL="{}&from={}&until={}".format(LIST_RECORDS, date_from, date_until)
REQUEST_URL="{}".format(LIST_RECORDS_PREVIEW_ENDPOINT)

# get data from API
headers = {'accept': 'text/xml'} # this has NO effect
all_records_xml = requests.get(REQUEST_URL, headers)

# save raw data in local file
with open('data/all_records.xml', 'w') as f:
    f.write(all_records_xml.text)

Parse the response into an XML tree, and find all the 'record' XML elements

In [38]:
import xml.etree.ElementTree as ET
import pandas as pd

#parse the tree, get a list of records
tree = ET.parse('data/all_records.xml')
list_ids = tree.getroot().find(ns('ListRecords'))
records = list_ids.findall(ns('record'))

Parse all records in the data set, put them into a pandas data frame

In [39]:
# loop through all entries
all_details = []
for record in records:
    details = parse_record(record, verbose=True)
    if details:
        all_details.append(details)

# convert to data frame
records_df = pd.DataFrame(all_details)
if 'id' not in records_df.columns:
    print("Problem reading zbMath id's. No data?")
else:
    records_df.set_index('id', inplace=True)

print('Imported {} entries (discarded {} for licensing conflicts)'.format(len(records_df), len(records) - len(records_df) ))

Licensing conflict for id "oai:zbmath.org:5346225" tag "author"
Licensing conflict for id "oai:zbmath.org:5346226" tag "author"
Licensing conflict for id "oai:zbmath.org:5346253" tag "author"
Imported 97 entries (discarded 3 for licensing conflicts)


Save the data for later usage

In [40]:
#records_df.to_csv('data/records.csv')

In [41]:
pd.set_option('display.max_rows', None)
records_df[records_df["author"] == 'zbMATH Open Web Interface contents unavailable due to conflicting licenses.']
#records_df

Unnamed: 0_level_0,author,document_title,source,classifications,language,keywords,doi,publication_year,serial,author_ids,links
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


Get data dump

In [46]:
from sickle import Sickle
sickle = Sickle('https://oai.zbmath.org/v1')
records = sickle.ListRecords(metadataPrefix='oai_dc')

In [47]:
with open('dc.txt', 'a+') as f:
    for foo in records:
        f.write(foo.raw + '\n')

KeyboardInterrupt: 

In [7]:
from habanero import Crossref
cr = Crossref()
cr.works(ids = '35-01;35Jxx;35Kxx;35R05')

HTTPError: 404 Client Error: Not Found for url: https://api.crossref.org/works/35-01;35Jxx;35Kxx;35R05

In [None]:
print(response)