Credentials for the zenodo API take the form of a personal access token which can be created when you create a zenodo account. For full details on how to do this consult the authorization section of the API documentation [here](https://developers.zenodo.org/#authentication). This token is then given as a parameter argument when making a HTTP request. I store my token as a variable called 'token' in a separate .py file and then load that file into python using the below import statement. You can also just create a variable directly in your code but make sure to remove the actual token itself when sharing your code online otherwise others will then be able to access the API using your credentials. 

In [19]:
import zenodo_credentials

In [20]:
import pandas as pd
import numpy as np
import requests
import re
from datetime import datetime

This function takes 2 arguments the credentials for the API and the date from which you want the resources. It looks a little scary but the majority of it is merely parsing out the data into a tabular format that fits with the current format that we are using in the WP dataset. The function queries the zenodo API for resources from both the ``rda`` and ``rda-related`` communities. A helper function is immediately defined within it that queries the API and then returns sheets of formatted tabular data. These sheets are named in line with their corresponding sheets in the WP dataset. The output of the function therefore will be 4 csv files ``group_resource.csv``, ``resource.csv``, ``individual.csv``, ``individual_resource.csv``. This data can then be manually copied into the google doc or uploaded into an SQL database.

In [41]:
def zenodo_scraper(credentials, date):
    def helper(community, writetype, header):
        r = requests.get(f"https://zenodo.org/api/records?", params = {"access_token": credentials, "communities": community,
                                                                      "sort": "mostrecent"})
        
        data = r.json()['hits']['hits']
        data = [res for res in data if datetime.strptime(res['metadata']['publication_date'], "%Y-%m-%d") >= datetime.strptime(date, "%Y-%m-%d")]
        if len(data) == 0:
            return
        data = pd.json_normalize(data)

        data['Ready'] = np.nan
        data['URI_Status'] = data['links.self'].apply(lambda x: requests.head(x).status_code)
        data['URI2'] = np.nan
        data['URI2_Status'] = np.nan
        data['PID_LOD_Type'] = "DOI"
        data['type'] = data['metadata.resource_type.title'].apply(lambda x: f"publication-{x}".lower())
        data['dc_type'] = data['metadata.resource_type.title'].apply(lambda x: f"info:eu-repo/semantics/{x}".lower())
        data['dc_description'] = data['metadata.description'].apply(lambda x: re.sub(r"<[^>]*>|\n", " ", x).strip())
        data['dc_description'] = data['dc_description'].apply(lambda x: re.sub(" +", " ", x))

        data.rename(columns = {'metadata.title': 'Title', 'metadata.publication_date': 'dc_date', 
                               'metadata.language': 'dc_language', 'links.self_html': 'URI', 'links.doi': 'PID_LOD'}, inplace = True)

        data = data[['Title', 'Ready', 'URI', 'URI_Status', 'URI2', 'URI2_Status', 'PID_LOD_Type', 'PID_LOD', 'dc_date', 'dc_description', 
              'dc_language', 'type', 'dc_type']]

        contributors = [x.get('metadata', {}).get('contributors') for x in r.json()['hits']['hits']] 
        creators = [x.get('metadata', {}).get('creators') for x in r.json()['hits']['hits']] 
        titles = [x.get('title', {}) for x in r.json()['hits']['hits']]

        for x, y in zip(creators, titles):
            for z in x:
                z['title'] = y

        for i, (x, y) in enumerate(zip(contributors, titles)):
            if x is None:
                continue
            else:
                for z in x:
                    z['title'] = y
        contributors = [i for i in contributors if i is not None]

        creats = [x for z in creators for x in z]
        contribs = [x for z in contributors for x in z]
        creats = pd.DataFrame(creats)
        contribs = pd.DataFrame(contribs)
        contribs.drop(columns = 'type', inplace = True)
        individual = pd.concat([creats, contribs], ignore_index=True, sort=False)
        individual = individual.loc[~individual['name'].str.contains(r"WG$|IG$|Working Group|Interest Group|Research Data Alliance", 
                                                                na = False)]
        individual = individual[['name', 'orcid']]
        individual.to_csv('individual.csv', mode = writetype, header = header, index = False)

        creat_grps = creats.loc[creats.name.str.contains(r"WG$|IG$|Working Group|Interest Group", na = False)]
        contribs_grps = contribs.loc[contribs.name.str.contains(r"WG$|IG$|Working Group|Interest Group", na = False)]
        groups = pd.concat([creat_grps, contribs_grps], ignore_index=True, sort=False)
        groups = groups[['name', 'title']]

        wgs = groups.loc[groups.name.str.contains(r"WG$|Working Group", na = False)]
        wgs.rename(columns = {'name': 'WorkingGroupString', 'title': 'Title'}, inplace = True)
        igs = groups.loc[groups.name.str.contains(r"IG$|Interest Group", na = False)]
        igs.rename(columns = {'name': 'InterestGroupString', 'title': 'Title'}, inplace = True)

        data = data.merge(wgs, how = 'left', on = 'Title')
        data = data.merge(igs, how = 'left', on = 'Title')
        data.to_csv('resource.csv', mode = writetype, header = header, index = False)

        creats['Relation_UUID'] = "rda_graph:DB99C7E3"
        creats['relation'] = "isAuthor"
        creats['UUID_Resource'] = np.nan
        contribs['Relation_UUID'] = "rda_graph:488C40F9"
        contribs['relation'] = "isContributor"
        contribs['UUID_Resource'] = np.nan

        creats.rename(columns = {'name': '(Individual)', 'title': '(Resource)'}, inplace = True)
        contribs.rename(columns = {'name': '(Individual)', 'title': '(Resource)'}, inplace = True)
        creats = creats[['(Individual)', 'Relation_UUID', 'relation', 'UUID_Resource', '(Resource)']]
        contribs = contribs[['(Individual)', 'Relation_UUID', 'relation', 'UUID_Resource', '(Resource)']]

        individual_resource = pd.concat([creats, contribs], ignore_index=True, sort=False)
        individual_resource.loc[~individual_resource['(Individual)'].str.contains(r"WG$|IG$|Working Group|Interest Group|Research Data Alliance", 
                                                                na = False)].to_csv('individual_resource.csv', mode = writetype, 
                                                                                    header = header, index = False)

        creat_grps['Relation_UUID'] = "rda_graph:7D9E4FD2"
        creat_grps['relation'] = 'isCreator'
        contribs_grps['Relation_UUID'] = "rda_graph:488C40F9"
        contribs_grps['relation'] = 'isContributor'
        creat_grps['UUID_Resource'] = np.nan
        contribs_grps['UUID_Resource'] = np.nan
        creat_grps.rename(columns = {'name': '(Title_Group)', 'title': '(Title_Resource)'}, inplace = True)
        contribs_grps.rename(columns = {'name': '(Title_Group)', 'title': '(Title_Resource)'}, inplace = True)
        creat_grps = creat_grps[['(Title_Group)', 'Relation_UUID', 'relation', 'UUID_Resource', '(Title_Resource)']]
        contribs_grps = contribs_grps[['(Title_Group)', 'Relation_UUID', 'relation', 'UUID_Resource', '(Title_Resource)']]
        group_resource = pd.concat([creat_grps, contribs_grps], ignore_index = True, sort = False)
        group_resource.to_csv("group_resource.csv", mode = writetype, header = header, index = False)

    communities = ["rda", "rda-related"]
    writetypes = ["w", "a"]
    headers = [True, False]

    for x, y, z in zip(communities, writetypes, headers):
        helper(x, y, z)

The function throws a number of errors. These are related to performing changes to a copy of a data frame and can be ignored. 

In [42]:
zenodo_scraper(zenodo_credentials.token, "2022-12-31")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wgs.rename(columns = {'name': 'WorkingGroupString', 'title': 'Title'}, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  igs.rename(columns = {'name': 'InterestGroupString', 'title': 'Title'}, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wgs.rename(columns = {'name': 'WorkingGroupString', 'title': 'Title'}, inplace = True)
