# Filter papers by software
There are millions of references to papers in the zbMath database. We just need (for now) those related to the list of mathematical software (data/smMATH-initial.csv) that has been imported into the MaRDI-Portal.

In [1]:
import pandas as pd

# load the list of swMath software
software_df = pd.read_csv('data/swMATH-initial.csv')
softwares = software_df['Len'].tolist()

In [2]:
# API URLs
API_URL='https://oai.zbmath.org/v1/' # base URL of the API
FILTER = '{}helper/filter?metadataPrefix=oai_zb_preview'.format(API_URL)

# API namespaces
OAI_NS = 'http://www.openarchives.org/OAI/2.0/' # the OAI namespace
OAI_ZB_PREVIEW_NS = 'https://zbmath.org/OAI/2.0/oai_zb_preview/'
ZBMATH_NS = 'https://zbmath.org/zbmath/elements/1.0/'

# text shown in zbMath Open when there's a license conflict
CONFLICT_TXT = 'zbMATH Open Web Interface contents unavailable due to conflicting licenses.'
# which tags to keep
TAGS = ['author', 'author_ids', 'document_title', 'source', 'classifications', 'keywords', 'doi', 'publication_year']

## Get a list of records related to a single software
Use the helper/filter endpoint to get a list of papers related to a particular software. 

**Doesn't work if the result set is very large (?) mailed OAI suport about this**

In [3]:
import requests

software = 'Gfan'
REQUEST_URL="{}&filter=software:{}".format(FILTER, software)

# get data from API
headers = {'accept': 'text/xml'} # this has no effect
all_records_xml = requests.get(REQUEST_URL, headers)
if all_records_xml.status_code == 200:
    # save raw data in local file
    with open('data/software_records_{}.xml'.format(software), 'w') as f:
        f.write(all_records_xml.text)    
else: 
    print(all_records_xml.reason)

Define a function to handle namespaces and tag names

In [4]:
def ns(tag_name, namespace=OAI_NS):
    """
    Returns a fully qualified tag name.
    @param namespace URL of a namespace|None (OAI_NS is default)
    """
    return '{{{}}}{}'.format(namespace, tag_name)

Define a function to parse records XML into a python dict

In [5]:
import xml.etree.ElementTree as ET

def parse_record(xml_record, verbose=False):
    """
    Parse bibliographic record details from XML Element.
    @returns dict
    """
    new_entry = {}
    # zbMath identifier
    zb_id = xml_record.find(ns('header')).find(ns('identifier')).text 
    new_entry['id'] = zb_id
    # read tags
    zb_preview = xml_record.find(ns('metadata')).find(ns('zbmath', OAI_ZB_PREVIEW_NS))
    for tag in TAGS:
        value = zb_preview.find(ns(tag, ZBMATH_NS))
        if value is not None:
            if len(value):
                # element has children
                texts = []
                for child in value:
                    texts.append(child.text)
                text = ';'.join(texts) # multiple values are rendered as a semicolon-separated string
            else:
                # element content is a simple text
                text = zb_preview.find(ns(tag, ZBMATH_NS)).text
                
            if text == CONFLICT_TXT:
                # License conflict
                if verbose:
                    print('Licensing conflict for id "{}" tag "{}"'.format(zb_id, tag))
                return None
            
            new_entry[tag] = text
    return new_entry

Parse all records in the data set, put them into a pandas data frame

In [8]:
import pandas as pd

#parse the tree, get a list of records
tree = ET.parse('data/software_records_Gfan.xml')
list_ids = tree.getroot().find(ns('ListRecords'))
records = list_ids.findall(ns('record'))

# loop through all entries
all_details = []
for record in records:
    details = parse_record(record)
    if details:
        all_details.append(details)

# convert to data frame
records_df = pd.DataFrame(all_details)
if 'id' not in records_df.columns:
    print("Problem reading zbMath id's. No data?")
else:
    records_df.set_index('id', inplace=True)

print('Imported {} entries (discarded {} for licensing conflicts)'.format(len(records_df), len(records) - len(records_df) ))

Imported 61 entries (discarded 39 for licensing conflicts)


Save the data for later usage

In [7]:
records_df.to_csv('data/software_records_Gfan.csv')