# Analysis using Crossref API

In this section, we will learn to use the Crossref API to search for and analyze scientific articles.

## 0. Installing Packages

First we will install packages that will be useful to us as we explore the Crossref API. When importing a package into Jupyter Notebook, we use the command `pip install ...` to download a package into Jupyter Notebook. 

We then use the command `import (package name)` to use our new package in our code. 

In [1]:
# just run this cell

!pip install habanero
#To use Crossref API in Python, we need to import the habanero package
import habanero
from habanero import Crossref
from collections import Counter # for easy counting
import ast # for string to dictionary conversion
import pandas as pd # for data manipulation
import numpy as np # for array manipulation
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline 



## 1. Analysis using Crossref API
In the sections below, we will walk through the basics of Crossref.

Data can be accessed using the python packages `habanero`, `crossrefapi`, and `crossref-commons` installed above. In this 
, we'll be focusing on the functionality of the **habanero** package. 

## 1.1 Exploring a Crossref query
In the cells below, we walk through using Crossref and exploring the data it gives us. To create an object that takes on the Crossref identity, we assign it to the variable `cr`.

In [2]:
cr = Crossref() # create a crossref object

The main function we will use is called `cr.works()`. This function takes in a query name. As an example, we'll use the search term "permafrost".  

In order to save this output, we'll assign it to the variable name `permafrost`.

In [3]:
# query for the term "permafrost"
permafrost = cr.works(query = "permafrost")

If we inspect `permafrost`, we can see that it is a dictionary.

In [4]:
type(permafrost)

dict

In the cell below, try creating your own query for a different search term. Make sure to save it to a variable! For now, limit your search term to a single word.

In [5]:
# Example: your_query_name = cr.works(query = "yourquery")


 ## 1.2 Keys, Indexes, Metadata
 A dictionary is a type of data structure that is indexed by keys. A dictionary contains key-value pairs, and we can access the values by calling on the keys. In our `permafrost` dictionary, we can inspect the keys and take a look at the values that it contains.

In [6]:
list(permafrost.keys())

['status', 'message-type', 'message-version', 'message']

Above, we can see that there are 4 different keys in our `permafrost` dictionary. We will focus on the values for the message key. In the cell below, we are accessing the values by *indexing* into the dictionary by the given keys.

In [7]:
permafrost['message']

{
    "tags": [
        "output_scroll",
    ]
}

{'facets': {},
 'total-results': 8200,
 'items': [{'indexed': {'date-parts': [[2020, 4, 15]],
    'date-time': '2020-04-15T02:44:23Z',
    'timestamp': 1586918663309},
   'reference-count': 18,
   'publisher': 'Wiley',
   'issue': '2',
   'license': [{'URL': 'http://doi.wiley.com/10.1002/tdm_license_1.1',
     'start': {'date-parts': [[2015, 9, 1]],
      'date-time': '2015-09-01T00:00:00Z',
      'timestamp': 1441065600000},
     'delay-in-days': 1911,
     'content-version': 'tdm'}],
   'content-domain': {'domain': [], 'crossmark-restriction': False},
   'short-container-title': ['Permafrost Periglac. Process.'],
   'DOI': '10.1002/ppp.695',
   'type': 'journal-article',
   'created': {'date-parts': [[2010, 6, 8]],
    'date-time': '2010-06-08T11:53:41Z',
    'timestamp': 1275998021000},
   'page': '215-218',
   'source': 'Crossref',
   'is-referenced-by-count': 5,
   'title': ['Report from the international permafrost association: the IPY permafrost legacy'],
   'prefix': '10.1002',

This dictionary is nested, meaning that we can have keys that lead to values which are more dictionaries. It seems like the `message` contains most of the information we're interested in. Below, take a look at the keys of the `message` component of the `permafrost` dictionary.

In [8]:
list(permafrost['message'].keys()) # keys of the permafrost message dictionary

['facets', 'total-results', 'items', 'items-per-page', 'query']

Just as we did before, we can inspect what information is contained for different keys of the dictionary. Let's focus on total results first.

In [9]:
# This tells us the total number of results from our query
permafrost['message']['total-results']

8200

In [10]:
permafrost['message']['items-per-page'] # tells us how many items per page 

20

In [11]:
permafrost['message']['query'] # details about our query

{'start-index': 0, 'search-terms': 'permafrost'}

In [12]:
permafrost['message']['items'] # the items of our query

{
    "tags": [
        "output_scroll",
    ]
}

[{'indexed': {'date-parts': [[2020, 4, 15]],
   'date-time': '2020-04-15T02:44:23Z',
   'timestamp': 1586918663309},
  'reference-count': 18,
  'publisher': 'Wiley',
  'issue': '2',
  'license': [{'URL': 'http://doi.wiley.com/10.1002/tdm_license_1.1',
    'start': {'date-parts': [[2015, 9, 1]],
     'date-time': '2015-09-01T00:00:00Z',
     'timestamp': 1441065600000},
    'delay-in-days': 1911,
    'content-version': 'tdm'}],
  'content-domain': {'domain': [], 'crossmark-restriction': False},
  'short-container-title': ['Permafrost Periglac. Process.'],
  'DOI': '10.1002/ppp.695',
  'type': 'journal-article',
  'created': {'date-parts': [[2010, 6, 8]],
   'date-time': '2010-06-08T11:53:41Z',
   'timestamp': 1275998021000},
  'page': '215-218',
  'source': 'Crossref',
  'is-referenced-by-count': 5,
  'title': ['Report from the international permafrost association: the IPY permafrost legacy'],
  'prefix': '10.1002',
  'volume': '21',
  'author': [{'given': 'Jerry',
    'family': 'Brown'

Above, we can see that the `items` contains the majority of information about our query on permafrost. It contains a list of all of the results - we can check this with the `type` command we used earlier. Since we only are looking at the first page, our items list has only 20 items in it.

In [40]:
type(permafrost['message']['items'])

list

In [14]:
len(permafrost['message']['items'])

20

## 1.3 Creating Tables
When our data exists in dictionaries, it's a little hard to explore and manipulate. In order to tackle this, we'll create a dataframe of the information so that we can access it more easily.

In [12]:
df_permafrost = pd.DataFrame(permafrost['message']['items'])
df_permafrost.head() # show the first 5 rows

Unnamed: 0,indexed,reference-count,publisher,issue,license,content-domain,short-container-title,DOI,type,created,...,subject,publisher-location,isbn-type,published-print,update-policy,ISBN,assertion,short-title,subtitle,archive
0,"{'date-parts': [[2020, 4, 15]], 'date-time': '...",18,Wiley,2.0,[{'URL': 'http://doi.wiley.com/10.1002/tdm_lic...,"{'domain': [], 'crossmark-restriction': False}",[Permafrost Periglac. Process.],10.1002/ppp.695,journal-article,"{'date-parts': [[2010, 6, 8]], 'date-time': '2...",...,[Earth-Surface Processes],,,,,,,,,
1,"{'date-parts': [[2020, 6, 9]], 'date-time': '2...",308,Springer International Publishing,,"[{'URL': 'http://www.springer.com/tdm', 'start...","{'domain': ['link.springer.com'], 'crossmark-r...",,10.1007/978-3-030-31379-1_5,book-chapter,"{'date-parts': [[2020, 1, 1]], 'date-time': '2...",...,,Cham,"[{'value': '9783030313784', 'type': 'print'}, ...",{'date-parts': [[2020]]},http://dx.doi.org/10.1007/springer_crossmark_p...,"[9783030313784, 9783030313791]","[{'value': '2 January 2020', 'order': 1, 'name...",,,
2,"{'date-parts': [[2020, 4, 10]], 'date-time': '...",0,Wiley,,,"{'domain': [], 'crossmark-restriction': False}",,10.1002/(issn)1099-1530,journal,"{'date-parts': [[2006, 3, 21]], 'date-time': '...",...,[Earth-Surface Processes],,,,,,,[Permafrost Periglac. Process.],,
3,"{'date-parts': [[2020, 4, 8]], 'date-time': '2...",47,Wiley,4.0,[{'URL': 'http://doi.wiley.com/10.1002/tdm_lic...,"{'domain': [], 'crossmark-restriction': False}",[Permafrost Periglac. Process.],10.1002/ppp.464,journal-article,"{'date-parts': [[2003, 12, 19]], 'date-time': ...",...,[Earth-Surface Processes],,,"{'date-parts': [[2003, 10]]}",,,,,,
4,"{'date-parts': [[2020, 4, 15]], 'date-time': '...",2,Wiley,4.0,[{'URL': 'http://doi.wiley.com/10.1002/tdm_lic...,"{'domain': [], 'crossmark-restriction': False}",[Permafrost Periglac. Process.],10.1002/ppp.711,journal-article,"{'date-parts': [[2010, 12, 30]], 'date-time': ...",...,[Earth-Surface Processes],,,"{'date-parts': [[2010, 10]]}",,,,,,


There are a bunch of columns in our table, and we can't see all of them by scrolling. Below, we can look at a list of the columns instead.

In [13]:
df_permafrost.columns

Index(['indexed', 'reference-count', 'publisher', 'issue', 'license',
       'content-domain', 'short-container-title', 'DOI', 'type', 'created',
       'page', 'source', 'is-referenced-by-count', 'title', 'prefix', 'volume',
       'author', 'member', 'published-online', 'reference', 'container-title',
       'language', 'link', 'deposited', 'score', 'issued', 'references-count',
       'journal-issue', 'URL', 'relation', 'ISSN', 'issn-type', 'subject',
       'publisher-location', 'isbn-type', 'published-print', 'update-policy',
       'ISBN', 'assertion', 'short-title', 'subtitle', 'archive'],
      dtype='object')

In [52]:
df_permafrost.shape[0]

20

Container-title seems to stand in for Journal Title. Let's look at what's in the container title column:

In [104]:
journal_titles = df_permafrost['container-title']
journal_titles

0     [Permafrost and Periglacial Processes]
1                       [Thawing Permafrost]
2                                        NaN
3     [Permafrost and Periglacial Processes]
4     [Permafrost and Periglacial Processes]
5     [Permafrost and Periglacial Processes]
6     [Permafrost and Periglacial Processes]
7     [Permafrost and Periglacial Processes]
8     [Permafrost and Periglacial Processes]
9     [Permafrost and Periglacial Processes]
10    [Permafrost and Periglacial Processes]
11    [Permafrost and Periglacial Processes]
12    [Permafrost and Periglacial Processes]
13    [Permafrost and Periglacial Processes]
14    [Permafrost and Periglacial Processes]
15    [Permafrost and Periglacial Processes]
16    [Permafrost and Periglacial Processes]
17    [Permafrost and Periglacial Processes]
18    [Permafrost and Periglacial Processes]
19    [Permafrost and Periglacial Processes]
Name: container-title, dtype: object

## 1.4 Data Retrieval

We want to be able to retrieve more data from CrossRef. A single CrossRef query can return up to 1,000 results, and since our query has over 42,000 total results, we would need to make 43 queries. Remember at the beginning, we only had 20 results since we only grabbed the first page. The CrossRef API permits fetching results from multiple pages, so by setting `cursor` to `*` and `cursor_max` to 1000 we can grab 1000 queries at once. Querying all 42,000 results would take a long time, so for the purposes of this demonstration we are only using 1000. If you have more time, you could query more results, but be aware that it will take a long time.

In [20]:
# this cell will take a while to run
cr_permafrost = cr.works(query="permafrost", cursor = "*", cursor_max = 1000, progress_bar = True)

100%|██████████| 49/49 [00:53<00:00,  1.10s/it]


We can check that we have the 1,000 messages, and indeed we do.

In [21]:
sum([len(k['message']['items']) for k in cr_permafrost])

1000

Remember before, when you created your own search query? Here, we'll go through a few steps to get more results for your query. First, we'll check how many total results your query has. Run the cell below to find out.

In [20]:
#Uncomment and run your query
#query['message']['total-results']

NameError: name 'query' is not defined

If there are fewer than 1000 results, then set `cursor_max` to the number of results. If there are more than 1000 results, set `cursor_max` to 1000 so that the code won't take too long to run. Be sure to fill in `query` with the same search term you used earlier.

In [None]:
#Your example
# cr_query = cr.works(query="queryname", cursor = "*", cursor_max = 1000, progress_bar = True)

In order to get all of the different results from our permafrost query, we need to extract them from `cr_permafrost`. Below, we create a list where each element is one result. `cr_permafrost` is a list consisting of pages, where we have 20 results per page. So `cr_permafrost` has 50 pages of 20 results each, giving us our 1000 results. In order to extract the info and have a list with each element be one results, we need to do some data manipulation.

In [22]:
#Confirm the number of pages
len(cr_permafrost)

50

In the cell below, we have 2 list comprehensions to get the results from our query. The first one creates a length 50 list that contains only the items instead of the entire dictionary, for each of the 50 pages. The second list comprehension extracts all of the items from the nested lists so that we have a single 1000 item list where each element is one result.

In [53]:
permafrost_items = [k['message']['items'] for k in cr_permafrost] # get items for all pages
permafrost_items = [item for sublist in permafrost_items for item in sublist] # restructure list

We'll do the same for your query results so that you can have some fun plotting later in the notebook. Just run the cell below.

In [None]:
#query_items = [k['message']['items'] for k in cr_query] # get items for all pages
#query_items = [item for sublist in query_items for item in sublist] # restructure list

## 1.5 More on Searches 
You may be curious about the `cr.works` function that we've been using to get our data. This is the function that processes your topical "search". There are different arguments, and we've seen how changing these arguments can help us get more search results than just the first 20. 

Also, before we asked you to limit your search term to a single word. This is not a hard restriction - we simply did this for simplicity. You can query terms that are more specific and include more words - try it out below! Keep in mind that the more specific your search query, the fewer results you may see. Feel free to play around with it.

In [None]:
query2 = cr.works(query = 'labrador retriever')
query2['message']['total-results']

# Bibliography <a id = '4'></a>
- Paul Oldham - Adapted CrossRef R guide to Python. https://poldham.github.io/abs/crossref.html#introduction

---

Notebook originally developed by: Keilyn Yuzuki, Anjali Unnithan

This chapter originated as a Data Science Module: http://data.berkeley.edu/education/modules