# Zenodo API (Example for TU Delft)
This Zenodo API query is taken from Chris Erdmann's example given as part of Library Carpentry (https://github.com/LibraryCarpentry/awesome-jupyter-glam/blob/master/notebooks-guides/Zenodo_API_Notebook.ipynb). 

Demonstration of how to query the Zenodo REST API. First of all, we create a couple of Python helper functions and classes to make it easier to query Zenodo. Don't worry about the details:

In [13]:
import requests

def raise_on_error(res):
    """Helper to check response for errors."""
    if res.status_code != 200:
        data = res.json()
        raise Exception('[{status}] {message}'.format(**data))

        
class ResultWrapper(object):
    """Helper to work with search results"""
    def __init__(self, session, response):
        raise_on_error(response)
        self._session = session
        self.response = response
        
    def has_json(self):
        """Check if content type is JSON."""
        return self.response.headers['Content-Type'] == 'application/json'
            
    @property
    def data(self):
        """Get JSON data."""
        return self.response.json() if self.has_json() else {}
    
    @property
    def aggregations(self):
        """Get aggregations."""
        return self.response.json()['aggregations']  if self.has_json() else {}
    
    @property
    def total(self):
        """Get total number of hits."""
        return self.response.json()['hits']['total'] if self.has_json() else None
        
    @property
    def pages(self):
        """Helper to fetch all result pages."""
        yield self.response
        next_url = self.response.links.get('next', {}).get('url')
        while next_url:
            res_page = self._session.get(next_url)
            raise_on_error(res_page)
            yield res_page
            next_url = res_page.links.get('next', {}).get('url')
    
    @property
    def hits(self):
        """Helper to iterate over each hit."""
        if not self.has_json():
            return None
        for res in self.pages:
            for h in res.json()['hits']['hits']:
                yield h
                
class ZenodoClient(object):
    """Simple Zenodo API Client"""

    def __init__(self, accept=None, token=None):
        self._accept = accept
        self._token = token
        self._session = None
        self._endpoint = 'https://www.zenodo.org/api/'
        self._endpoint_search = '{}records/'.format(self._endpoint)
        self._endpoint_styles = '{}csl/styles'.format(self._endpoint)
        
    @property
    def session(self):
        """Create a session for making HTTP requests to the API."""
        if self._session is None:
            self._session = requests.Session()    
            # Construct headers
            headers = {
                'Accept': self._accept or 'application/vnd.zenodo.v1+json',
                'Accept-Charset': 'utf-8',
            }
            if self._token:
                headers['Authorization'] = 'Bearer {}'.format(self._token)            
            self._session.headers.update(headers)
        return self._session

    def search(self, query=None, size=None, sort=None, **filters):
        """Search Zenodo"""
        params = {'q': query or '', 'sort': sort or 'bestmatch', 'size': size or 100}
        if filters:
            for f, vals in filters.items():
                params[f] = vals
        return ResultWrapper(self.session, self.session.get(self._endpoint_search, params=params))
    
    def record(self, record_id, accept=None, **params):
        """Retrieve a single record."""
        url = '{base}{recid}'.format(base=self._endpoint_search, recid=record_id)
        params = params or {}
        headers = self.session.headers
        if accept:
            headers['Accept'] = accept
            
        res = self.session.get(url, headers=headers, params=params)
        raise_on_error(res)
        return res
    
    def csl_styles(self):
        res = self.session.get(self._endpoint_styles)
        raise_on_error(res)
        return res
        

## Initialize the API client and make a query
Using our just created helper classes we can now create a Zenodo API client:

In [14]:
api = ZenodoClient()

### Simple search

Using the API client we can now execute queries against Zenodo search API:

In [15]:
result = api.search('Delft')
result.total

1413

### Phrase search

The query string is exactly like you would type in the Zenodo search box. The underlying search engine on Zenodo is Elasticsearch, which has a powerful query syntax. Above, we are searching for 4 terms, let's instead make it into a phrase search:

In [16]:
result = api.search('"Delft"')
result.total

1413

### Field search

In [17]:
tud_query = '+creators.affiliation:("TUD" "TU Delft" "Delft University of Technology") +doi:10.5281*'
result = api.search(tud_query)
result.total

783

Notice the query string. Here is a couple of points:

* **Field search**: We are now searching on two specific fields - ``creators.affiliation`` and ``doi``.
* **Required fields**: The small ``+`` in front the field indicate the field must be present (i.e. both fields are required sincen they have a plus in the front).
* **Field multi-term**: On ``creators.affiliation`` we search for ``NCSU`` or ``NC State University`` or ``North Carolina State University``.
* **Wildcard**: On the ``doi`` field we search for any string starting with ``10.5281`` (the Zenodo DOI prefix).

The full query syntax is documented [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax).

## Data inspection
Let's know inspect the author affiliations. First let's get all the records and create a list of all the author affiliations:

In [18]:
# Retrieve all records (be careful with large result sets!)
records = list(result.hits)

# Build a list of author affiliations
affiliations = []
for record in records:
    for creator in record['metadata']['creators']:
            affiliations.append(creator.get('affiliation'))

Next, let's all the affiliations:

In [19]:
# Create a Pandas data frame
import pandas as pd
df = pd.DataFrame(data=affiliations, columns=['affiliation'])
df.groupby('affiliation').size().reset_index(name='count').sort_values(by=['count'], ascending=False).head()

Unnamed: 0,affiliation,count
134,Delft University of Technology,642
556,TU Delft,333
428,Netherlands eScience Center,56
708,University of Zurich,45
488,REE,26


## Export a bibliograph
Zenodo supports several other output metadata formats other than JSON. For search results Zenodo can produce the following metadata formats:

* BibTex: ``application/x-bibtex``
* DublinCore: ``application/x-dc+xml``
* DataCite: ``application/x-datacite+xml``
* MARC21: ``application/marcxml+xml``

For individual records the following additional formats can be produced:

* JSON-LD: ``application/ld+json``
* Citation Style Language (CSL): ``application/vnd.citationstyles.csl+json``
* Citation text: ``text/x-bibliography``

For instance we can export a BibTeX bibliography of all records in above search query and write it to a file:

In [20]:
# Create a API client (specifying the desired output format)
api = ZenodoClient(accept='application/x-bibtex')

# Execute query and write result to "bibliography.bib"
result = api.search(tud_query)
with open('bibliography.bib', 'wb') as fp:
    # Iterate over all the results and write them to the file.
    for page in result.pages:
        fp.write(page.content)

## Aggregations
Want to know about types, open/closed access, file types and keywords?

In [21]:
api = ZenodoClient()
result = api.search(tud_query)
for b in result.aggregations['type']['buckets']:
    print('{key}: {doc_count}'.format(**b))


publication: 261
presentation: 142
software: 126
dataset: 125
image: 86
poster: 21
other: 15
lesson: 4
video: 3


In [22]:
for b in result.aggregations['access_right']['buckets']:
    print('{key}: {doc_count}'.format(**b))

open: 774
closed: 4
restricted: 4
embargoed: 1


In [23]:
for b in result.aggregations['keywords']['buckets']:
    print('{key}: {doc_count}'.format(**b))

Photonics: 87
Bookmark: 81
Light: 79
outreach: 79
science communication: 78
data stewardship: 42
RDM: 38
TU Delft: 31
open science: 27
LED: 23


In [24]:
for b in result.aggregations['file_type']['buckets']:
    print('{key}: {doc_count}'.format(**b))

pdf: 384
zip: 190
pptx: 99
png: 84
txt: 19
xlsx: 15
csv: 12
gz: 12
md: 9
py: 7
