## Introduction

The Crossref REST API is entirely based on URLs and is [documented extensively](https://api.crossref.org). This means, that, in theory, you can simply get all the data that you want using a normal browser. For example, you might want to see the latest DOI records in the Crossref system. You can see this with the following URL:

[`https://www.crossref.org/works`](https://api.crossref.org/works)


This means the REST API is pretty easy to use with basic low level HTTP libraries(e.g. Python's `requests`), but for this tutorial we are going to use a [higher level python library](https://github.com/fabiobatalha/crossrefapi) developed by Fabio Batalha C. Santos at [SciELO](http://www.scielo.org).

The examples here are in Python 3. Sorry- but you're going to have to make the move sometime ;)

To use this libra ry you can:

`pip install crossrefapi`

Then, import the library and get ready to look at so `works` data. 

In [None]:
# If viewing in pineapple notebook, uncomment the next two lines and then run the cell.
#import pineapple
#%require crossrefapi

In [None]:
# If viewing in jupyter notebook, then uncomment the next line and run the cell.
!pip install crossrefapi

In [None]:
from crossref.restful import Works
works = Works()

## Working with "works"

Let's start by looking briefly at"works". The route refers to items identified by a DOI in the index. These can be articles, books, components, etc.

**TIP:** Crossref does not use "works" in the [FRBR](https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records) sense of the word. In Crossref parlance, a "work" is just a thing identified by a DOI. In practice, Crossref DOIs are used as citation identifiers. So, in FRBR terms, this means, that a Crossref DOI tends to refer to one _expression_ which might include multiple _manifestations_. So, for example, the ePub, HTML and PDF version of an article will share a Crossref DOI because the differences between them should not effect the interpretation or crediting of the content. In short, they can be cited interchangeably. The same is true of the "accepted manuscript" and the "version-of-record" of that accepted manuscript.

In order to start querying information about works, we need to import the library and make things convenient.

In [None]:
from crossref.restful import Works
works = Works()

Now we are ready to ask our first question- How many Crossref DOI records are indexed by the API?

In [None]:
works.count()

Note that above I said "How many Crossref DOIs". There are several other [DOI registration agencies](https://www.doi.org/registration_agencies.html). Crossref is by far he largest DOI RA, and the other RAs tend to specialize in orthoganal areas (e.g. Music & Video, Local language translations of publications, etc.) but it is important to not that this API will not work with non-Crossref DOIs (though [DataCite](https://www.datacite.org/), another RA, provides a very similar API).

**TIP:** Not all DOIs are Crossref DOIs. If you are having trouble using a DOI with Crossref's API, check to see if it is a Crossref DOI.

So the next obvious question is, how do I tell if a DOI is a Crossref DOI?

In [None]:
works.agency('10.1590/0102-311x00133115')

In [None]:
works.agency('10.6084/m9.figshare.1314859.v1')

In [None]:
works.agency('10.5240/B1FA-0EEC-C316-3316-3A73-L')

OK, so assuming that we are using a Crossref DOI, how do we get the metadata for it?

In [None]:
record = works.doi('10.7554/eLife.09561')
record

This is basically a huge JSON object, so you can retrieve individual elements from it. Here is the publisher:

In [None]:
record['publisher']

And here is the license for the "version of record":

In [None]:
next((item for item in record['license'] if item["content-version"] == "vor"))['URL']

Um... That was complicated. What does 'vor' mean?

**TIP:** Publishers sometimes record information for multiple versions of the content identified by a DOI. These versions should be interchaneable from the point of view of citation, but sometimes one version has more "features" than another. For example, it might be typset or have references linked, etc. The two versions might also have different licenses and different URLs. The terminology publishers use for identifying versions comes from the [NISO standard call JAV (Journal Article Version)](http://www.niso.org/publications/rp/RP-8-2008.pdf) and, although this terminology is [sometimes problematic](https://f1000research.com/articles/6-608/v1), you should be aware of it. In particualr, you will see two terms used  in Crossref metadata:

- `VOR` = Version of Record
- `AM` = Accepted Manuscript



Now that we know what 'vor' means, let's get the link to the full text of the version of record:

In [None]:
next((item for item in record['link'] if item["content-version"] == "vor"))['URL']

The above has given us a brief overview of how to get a record and elements of a record identified with a Crossref DOI. Obviously, the goal is to do this in bulk. That is, to select and process records for multiple Crossref DOIs. Before we do that, it is helpful to familiarise yourself with some of the other "routes" supported by the REST API. This is because more advanced usage of the API typically involveds combining information from several routes. 

## Members

Crossref is a membership organization. DOI records are registered and managed by those members. It is often very useful to break down Crossref DOI records by member. But first let's find out a little bit more about members.

First we import and setup a useful shortcut. 

In [None]:
from crossref.restful import Members
members = Members()

How many members does Crossref have?

In [None]:
members.count()

Let's look at a partciular member, Hindawi:

In [None]:
pub = next(iter(members.query('Hindawi')))
pub

**TIP:** Many people make the mistake of thinking that a "DOI prefix" can be used to identify the member responsible for a Crossref DOI. This is not true. DOI prefixes merely serve as a namespace form which a member can create new DOIs without worrying about collisions. But, once created, Crossref DOIs are often transferred between publishers and so a Crossref member will often be responsible for DOIs with a variety of prefixes. So, for example, above, Hindawi is responsible for several prefixes:

In [None]:
prefixes = [p['value'] for p in pub['prefix']]
prefixes

**TIP** The most accurate way to refer a particular Crossref member and *all* their prefixes is through the member's `id`.

So let's look at eLife.

In [None]:
pub = next(iter(members.query('eLife')))
pub

eLife's Crossref member ID can be accessed as follows:

In [None]:
pub_id = pub['id']
pub_id

Now we can use this ID to specifically refer to eLife. For example:

In [None]:
pub = members.member(pub_id)
pub

Let's see how many DOIs eLife has registered by year:

In [None]:
dois_by_year = pub['breakdowns']['dois-by-issued-year']
dois_by_year

Cool, now let's look at some of the publisher data in more friendly formats. We are going to use the pandas library for summarising and visualising the data.

In [None]:
import pandas as pd
%matplotlib inline

First let's see the publications by year in a nice, sorted table:

In [None]:
f = pd.DataFrame(dois_by_year)
f.columns = ['year','dois']
dois_sorted_by_year = f.sort_values(['year','dois'])
dois_sorted_by_year

Maybe look at this in a graph?

In [None]:
dois_sorted_by_year.plot.bar(x='year',y='dois')

We can pull this all together and you can look at a number of publishers. Try changing the publisher name in the code below to something else:

In [None]:
publisher_name = 'PLOS'
pub_id = next(iter(members.query(publisher_name)))['id']
pub = members.member(pub_id)
dois_by_year = pub['breakdowns']['dois-by-issued-year']
f = pd.DataFrame(dois_by_year)
f.columns = ['year','dois']
dois_sorted_by_year = f.sort_values(['year','dois'])
dois_sorted_by_year.plot.bar(x='year',y='dois',figsize=(50, 7))

A publisher record also contains a useful summary of the member's metadata and the Crossref services that they participate in.

Let's look at what percentage of their metadata includes certain information:

In [None]:
coverage = [[key,float(pub['coverage'][key])*100] for key in pub['coverage'].keys()]
coverage

In [None]:
f = pd.DataFrame(coverage)
f.columns = ['metadata','coverage']
f.plot.barh(x='metadata',y='coverage')

Now let's see what Crossref services they participate in:

In [None]:
participation = [[key,pub['flags'][key]] for key in pub['flags'].keys()]
f = pd.DataFrame(participation)
f.columns = ['service','paticipation']
f

## Fun with facets

### ORCID support

In [None]:
from crossref.restful import Works
works = Works()
r = works.filter(has_orcid='true').facet('publisher-name',10)
orcid_support = [[key,r['publisher-name']['values'][key]] for key in r['publisher-name']['values'].keys()]
f = pd.DataFrame(orcid_support)
f.columns = ['publisher','orcids']
f.plot.barh(x='publisher',y='orcids')

## Zika publications

In [None]:
r = works.query(title='Zika').facet('publisher-name',10)
zika_publications = [[key,r['publisher-name']['values'][key]] for key in r['publisher-name']['values'].keys()]
f = pd.DataFrame(zika_publications)
f.columns = ['publisher','publications']
f.plot.barh(x='publisher',y='publications')

## Some other resources

### Types

In [None]:
from crossref.restful import Types
types = [type['label'] for type in Types().all()]
types.sort()
types

### Journals

In [None]:
from crossref.restful import Journals
journals = Journals()
journals.count()


In [None]:
journal = journals.journal('0028-0836')
journal

## A slight digression to discuss testing debugging queries

**TIP** One of the cool things about the library we are using, is that you can easily see the REST API URIs that it generates for queries you make to the API. To do this, you simply ask for the URL of the query in question. So, for example- if you want to see the API call for the code we used for asking for the number of Crossref DOIs.

In [None]:
from crossref.restful import Works
works = Works()
works.query('zika').url

## Using samples for testing and to save time

In [None]:
from crossref.restful import Works
works = Works()
zika_sample = [work for work in works.query('zika').sample(10)]
zika_sample

## Some Jisc Examples

### Notify institutions of co-authored works

Works records indexed _today_ that have affiliation data

In [None]:
import datetime
works=Works()
today = datetime.date.today().isoformat()
works_with_affiliations = [w for w in works.filter(from_online_pub_date=today, has_affiliation='true')]
print(len(works_with_affiliations))


Let's look for publications with a publication date of today and a particular affiliation.

In [None]:
import datetime
affiliation="Harvard"
today = datetime.date.today().isoformat()
recent_affiliation_pubs = works.filter(from_online_pub_date=today).query(affiliation=affiliation)
recent_affiliation_pubs.count()

Let's look at the first record...

In [None]:
next(iter(recent_affiliation_pubs))

#### A digression on organizational identifiers...


### Integrate funding data into funder policy service

In [None]:
from crossref.restful import Funders
funders=Funders()
funder_name="NIH"
funder_id=next(iter(funders.query(funder_name)))['id']
funder_id

In [None]:
import datetime
works=Works()
today = datetime.date.today().isoformat()
works_with_funding_data = [w for w in works.filter(from_online_pub_date='2017-01-01', funder=funder_id)]
print(len(works_with_funding_data))

In [None]:
next(iter(works_with_funding_data))

### 