# Exercise 2 - Working with DOI lists and Article Level Metrics

In this exercise, we will cover how to collect DOIs and Article Level Metrics(ALM) through API as well as some related analysis.

## Table of Contents

- [Part A: Collecting DOIs](#Part-A:-Collecting-DOIs)
- [Part B: Collecting and analysing ALM Data](#Part-B:-Collecting-and-analysing-ALM-Data)

Load the required packages before the start of the exercise.

In [19]:
%load_ext autoreload
%autoreload 2

import os
#os.environ['PLOS_API_KEY'] = 'user api key'
os.environ['PLOS_API_KEY'] = 'NvbJGJZmxrRCLAUgydFx'

import sys
sys.path.append("../modules/orcid-python")
sys.path.append("../modules/pyalm")


import requests
import time
import orcid
import pyalm.pyalm as pyalm
sys.path.append("./modules/pyalm/pyalm")
import utilities.plossearch as search


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Part A: Collecting DOIs

Back to [Table of Contents](#Table-of-Contents).

The first part this exercise will show collecting DOIs from a different source, a publisher API. Here we are using the PLOS Search API as an example because the PLOS Lagotto instance has the most information on article level metrics as discussed in the class.

We will first show an example of using the provided API wrapper and then you will use this to gather Article Level Metrics information on some authors from Caltech.

In [20]:
# Initiate and populate a query object
query = search.Request('author_affiliate:"California Institute of Technology"')

# Initiate the actual API call and get some results
response = query.get()
response

{u'response': {u'docs': [{u'doi': [u'10.1371/journal.pone.0026543']},
   {u'doi': [u'10.1371/journal.pone.0007757']},
   {u'doi': [u'10.1371/journal.pone.0008793']},
   {u'doi': [u'10.1371/journal.ppat.1001225']},
   {u'doi': [u'10.1371/journal.pone.0022201']},
   {u'doi': [u'10.1371/journal.pgen.0020117']},
   {u'doi': [u'10.1371/journal.pone.0035934']},
   {u'doi': [u'10.1371/journal.pone.0000749']},
   {u'doi': [u'10.1371/journal.pone.0000787']},
   {u'doi': [u'10.1371/journal.pcbi.1000349']},
   {u'doi': [u'10.1371/journal.pbio.1000444']},
   {u'doi': [u'10.1371/journal.pbio.1001153']},
   {u'doi': [u'10.1371/journal.pone.0046473']},
   {u'doi': [u'10.1371/journal.pone.0012353']},
   {u'doi': [u'10.1371/journal.pone.0029172']},
   {u'doi': [u'10.1371/journal.pone.0021074']},
   {u'doi': [u'10.1371/journal.pone.0133682']},
   {u'doi': [u'10.1371/journal.pbio.0040112']},
   {u'doi': [u'10.1371/journal.pgen.0030069']},
   {u'doi': [u'10.1371/journal.pcbi.1000354']},
   {u'doi': [u'10.

This gives 220 DOI's found at PLOS which match the affiliation term "California Institute of Technology". You might want to change the search term to see if there are other articles, perhaps listed under Caltech or other variations of the name.

This search matches the terms that you will find in the Advanced Search functionality on the PLOS website: http://www.plosone.org/search/advanced?noSearchFlag so you can use that search form to construct a more advanced search and then use it with the function above. For instance a more complex search for Caltech might look like this:

In [21]:
# Initiate and populate a query object
query = search.Request("""
    author_affiliate:"California Institute of Technology"
    OR
    author_affiliate:"Caltech"
                       """)

# Initiate the actual API call and get some results
caltech = query.get()
len(caltech['response']['docs'])

226

<div class="alert alert-success">
Construct a search that looks for papers from Martin Karplus, Robert Grubbs or Eric Betzig. You should retrieve two articles.
</div>

In [22]:
# Initiate and populate a query object
query = search.Request("""
    author:"Eric Betzig"
    OR
    author:"Robert Grubbs"
    OR
    author:"Martin Karplus"
                       """)

# Initiate the actual API call and get some results
response = query.get()
len(response['response']['docs'])

2

In [23]:
assert len(response['response']['docs']) == 2

## Part B: Collecting and analysing ALM Data

Back to [Table of Contents](#Table-of-Contents).

<div class="alert alert-success">
Based on the example notebooks obtain Article Level Metrics data on these two articles from the PLOS ALM API. Note that the ALM API wrapper can also accept a list of DOIs as well as a single DOI. You will need to construct a list of the two DOIs to pass to the function. Obtain the number of EuropePubmedCentral citations for all the articles.
</div>

In [24]:
# Need to configure the API URL as per the notebook example
pyalm.config.APIS = { 'plos' : {'url': 'http://alm.plos.org/api/v5/articles'},
                      'det'  : {'url' : 'http://det.labs.crossref.org/api/v5/articles'}
                    }

In [25]:
response.get('response').get('docs')

dois = [doc.get('doi')[0] for doc in response.get('response').get('docs')]

print dois

#pyalm.get_alm(dois, info='detail', instance='plos')

[u'10.1371/journal.pbio.1000137', u'10.1371/journal.pbio.0040144']


In [26]:
# Create a list of DOIs from the response above. You could either create a new list or use a list comprehension
dois = [doc.get('doi')[0] for doc in response.get('response').get('docs')]
plos_alm = pyalm.get_alm(dois, info='detail', instance='plos')

type(plos_alm) # should be dict

list

In [28]:
# Get the title and number of EuPMC citations for each article. Create a list of tuples called cites as follows
# [('title1', citations), ('title2', citations)]
cites = []
for article in plos_alm['articles']:
    cites.append((article.title, article.sources['pmceurope'].metrics.total))
    
print cites



TypeError: list indices must be integers, not str

In [29]:
assert cites[0] == (u'Self-Organization of the <i>Escherichia coli</i> Chemotaxis Network Imaged with Super-Resolution Light Microscopy', 108)
assert cites[1] == (u'A Src-Like Inactive Conformation in the Abl Tyrosine Kinase Domain', 90)

IndexError: list index out of range

<div class="alert alert-success">
For the papers returned from a search for the first 50 articles affiliated with California Institute of Technology above output the number of EuropePMC citations, Facebook posts and Tweets. It may take some time for the API to return results for 50 articles.
</div>

In [30]:
# Create a list of the first 50 DOIs and get the ALMs from PLOS API
caldois = [doc['doi'][0] for doc in caltech['response']['docs']][0:50]
cal_alm = pyalm.get_alm(caldois, info='detail', instance='plos')

In [31]:
# Construct a list of tuples called `results` as above with each of the elements required plus the title
results = []
for article in cal_alm['articles']:
    results.append((article.title, 
                   article.sources['pmceurope'].metrics.total,
                   article.sources['facebook'].metrics.total,
                   article.sources['twitter'].metrics.total))
    
results

TypeError: list indices must be integers, not str

In [None]:
assert len(results) == 50

<div class="alert alert-success">
Find articles with at least some tweets and obtain account names of the tweets. Identify whether there are accounts tweeting more than one article. Note that some accounts might tweet about the same article twice. We are only interested in cases where the same account is tweeting about more than one article. Make a list called `common_tweeters` that contains the account handles for anyone who tweeted more than one article.
</div>

In [13]:
# Probably easier to iterate through the article objects than to have to cross reference from the list we just
# but either will work
tweeted = []
for article in cal_alm['articles']:
    if article.sources['twitter'].metrics.total != 0:
        tweeted.append(article)
    
len(tweeted)

NameError: name 'cal_alm' is not defined

In [14]:
assert len(tweeted) > 10

AssertionError: 

In [15]:
# Now obtain all the account names. Look at the example notebook for how to get this information.
unique_accounts = set()
for article in tweeted:
    for tweet in article.sources['twitter'].events:
        unique_accounts.add(tweet['event']['user'])
        
len(unique_accounts)


0

In [16]:
# Now check whether an account occurs tweeting more than one article. This requires a little care and attention. 
common_tweeters = []

for account in unique_accounts:
    count = 0
    for article in tweeted:
        tweeters = [tweet['event']['user'] for tweet in article.sources['twitter'].events]
        if account in tweeters:
            count+=1
            
    if count > 1:
        common_tweeters.append(account)

len(common_tweeters)

0

In [17]:
common_tweeters

[]

In [18]:
assert common_tweeters != []

AssertionError: 