# Exercise 2 - Working with DOI lists and Article Level Metrics

In this exercise, we will cover how to collect DOIs and Article Level Metrics(ALM) through API as well as some related analysis.

## Table of Contents

- [Part A: Collecting DOIs](#Part-A:-Collecting-DOIs)
- [Part B: Collecting and analysing ALM Data](#Part-B:-Collecting-and-analysing-ALM-Data)

Load the required packages before the start of the exercise.

In [None]:
%load_ext autoreload
%autoreload 2

import os
#os.environ['PLOS_API_KEY'] = 'user api key'

import sys
sys.path.append("../modules/orcid-python")
sys.path.append("../modules/pyalm")


import requests
import time
import orcid
import pyalm.pyalm as pyalm
sys.path.append("./modules/pyalm/pyalm")
import utilities.plossearch as search


## Part A: Collecting DOIs

Back to [Table of Contents](#Table-of-Contents).

The first part this exercise will show collecting DOIs from a different source, a publisher API. Here we are using the PLOS Search API as an example because the PLOS Lagotto instance has the most information on article level metrics as discussed in the class.

We will first show an example of using the provided API wrapper and then you will use this to gather Article Level Metrics information on some authors from Caltech.

In [None]:
# Initiate and populate a query object
query = search.Request('author_affiliate:"California Institute of Technology"')

# Initiate the actual API call and get some results
response = query.get()
response

This gives 220 DOI's found at PLOS which match the affiliation term "California Institute of Technology". You might want to change the search term to see if there are other articles, perhaps listed under Caltech or other variations of the name.

This search matches the terms that you will find in the Advanced Search functionality on the PLOS website: http://www.plosone.org/search/advanced?noSearchFlag so you can use that search form to construct a more advanced search and then use it with the function above. For instance a more complex search for Caltech might look like this:

In [None]:
# Initiate and populate a query object
query = search.Request("""
    author_affiliate:"California Institute of Technology"
    OR
    author_affiliate:"Caltech"
                       """)

# Initiate the actual API call and get some results
caltech = query.get()

# Count the number of documents returned.
len( caltech['response']['docs'] )

<div class="alert alert-success">
Construct a search that looks for papers by either authors Martin Karplus, Robert Grubbs or Eric Betzig.  Store the results of your call to `query.get()` in a variable named `response`.  Your query should retrieve two articles.  
</div>

In [None]:
### BEGIN SOLUTION

# Initiate and populate a query object
query = search.Request("""
    author:"Eric Betzig"
    OR
    author:"Robert Grubbs"
    OR
    author:"Martin Karplus"
                       """)

# Initiate the actual API call and get some results
response = query.get()

### END SOLUTION

# Count documents returned.
print( "Document count: " + str( len( response['response']['docs'] ) ) )

# Output documents returned.
print( "Result: " + str( response['response']['docs'] ) )

In [None]:
assert len( response['response']['docs'] ) == 2

In [None]:
# could also check the DOIs of the two items.
doi_list = []
for current_item in response['response']['docs']:
    
    current_doi = current_item[ 'doi' ]
    doi_list.append( current_doi )
    
#-- END loop over documents. --#

# get DOI 1 and 2
doi_1 = doi_list[ 0 ][ 0 ]
doi_2 = doi_list[ 1 ][ 0 ]

# are they right?
assert doi_1.lower() == '10.1371/journal.pbio.1000137'
assert doi_2.lower() == '10.1371/journal.pbio.0040144'

## Part B: Collecting and analysing ALM Data

Back to [Table of Contents](#Table-of-Contents).

<div class="alert alert-success">
Based on the example notebooks obtain Article Level Metrics data on these two articles from the PLOS ALM API (`pyalm`). Note that the ALM API wrapper can accept a list of DOIs as well as a single DOI. You will need to construct a list of the two DOIs to pass to the `pyalm.get_alm` function (stored in the variable `dois`).  Then, obtain the number of EuropePubmedCentral citations for all the articles.
</div>

In [None]:
# Need to configure the API URL as per the notebook example
pyalm.config.APIS = { 
    'plos' : {'url': 'http://alm.plos.org/api/v5/articles'},
    'det'  : {'url' : 'http://det.labs.crossref.org/api/v5/articles'}
}

In [None]:
# Make a list of the DOI values present in the response from part A.
dois = []

### BEGIN SOLUTION

# traditional list creation
for current_item in response['response']['docs']:
    
    current_doi = current_item[ 'doi' ][ 0 ]
    dois.append( current_doi )
    
#-- END loop over documents. --#

# OR - list comprehension
#dois = [ doc.get('doi')[0] for doc in response.get('response').get('docs') ]

### END SOLUTION

print( "Found " + str( len( dois ) ) + " DOIs: " + str( dois ) )

In [None]:
assert len( dois ) == 2

In [None]:
# get DOI 1 and 2
doi_1 = dois[ 0 ]
doi_2 = dois[ 1 ]

# are they right?
assert doi_1.lower() == '10.1371/journal.pbio.1000137'
assert doi_2.lower() == '10.1371/journal.pbio.0040144'

In [None]:
# Use the pyalm.get_alm() method to retrieve PLOS Article Level Metrics for the
#    two articles.  Store the results in a variable named "plos_alm".

### BEGIN SOLUTION
plos_alm = pyalm.get_alm(dois, info='detail', instance='plos')
### END SOLUTION

print( plos_alm )

In [None]:
assert type( plos_alm ) == dict # should be dict

In [None]:
# From the PLOS ALM data, get the title and number of EuPMC citations for
#    each article - create a list of tuples called "cites" as follows:
#    [('title1', citations), ('title2', citations)]

cites = []

### BEGIN SOLUTION

for article in plos_alm['articles']:

    cites.append( ( article.title, article.sources['pmceurope'].metrics.total ) )
    
### END SOLUTION

# print out cites
print( "cites per article: " + str( cites ) )

In [None]:
assert cites[0] == (u'Self-Organization of the <i>Escherichia coli</i> Chemotaxis Network Imaged with Super-Resolution Light Microscopy', 108)
assert cites[1] == (u'A Src-Like Inactive Conformation in the Abl Tyrosine Kinase Domain', 91)

<div class="alert alert-success">
    For the papers returned from a search for the first 50 articles affiliated with California Institute of Technology above (stored in the variable `caltech`), retrieve the PLOS Article Level Metrics for each article, then output the number of EuropePMC citations, Facebook posts and Tweets per article. It may take some time for the API to return results for 50 articles.
</div>

In [None]:
# Create a list of the first 50 DOIs stored in variable "caldois".
caldois = []

### BEGIN SOLUTION
caldois = [doc['doi'][0] for doc in caltech['response']['docs']][0:50]
### END SOLUTION

print( "caldois list:" )
doi_counter = 0
for current_doi in caldois:
    
    doi_counter += 1
    print( "- DOI " + str( doi_counter ) + ": " + current_doi )
    
#-- END loop over DOIs.

In [None]:
assert len( caldois ) == 50

In [None]:
# get the ALMs from PLOS API for these articles and store them in "cal_alm".

### BEGIN SOLUTION
cal_alm = pyalm.get_alm(caldois, info='detail', instance='plos')
### END SOLUTION

print( "Got back " + str( len( cal_alm[ 'articles' ] ) ) + " ALM records." )

In [None]:
assert type( cal_alm ) == dict # should be dict

In [None]:
# Construct a list of tuples in a variable named "results" formatted as follows:
# [
#     ('title1','pmceurope1','facebook1','twitter1'),
#     ('title2','pmceurope2','facebook2','twitter2'),
#     ...
# ]
results = []

### BEGIN SOLUTION
for article in cal_alm['articles']:
    results.append((article.title, 
                   article.sources['pmceurope'].metrics.total,
                   article.sources['facebook'].metrics.total,
                   article.sources['twitter'].metrics.total))
### END SOLUTION

print( "results list:" )
result_counter = 0
for current_result in results:
    
    result_counter += 1
    print( "- result " + str( result_counter ) + ": " + str( current_result ) )
    
#-- END loop over results. --#

In [None]:
assert len(results) == 50

<div class="alert alert-success">
Make a list called `common_tweeters` that contains the twitter account handles for anyone who tweeted about more than one article.  To do this, first find articles with one or more tweets and store them in a list named `tweeted`.  Then, make a set of the unique account names of the users who tweeted about one or more articles, and store this set in a variable named `unique_accounts`.  Finally, identify accounts that tweeted more than one article, store the list of the twitter handles of these accounts in variable `common_tweeters`, then output the common tweeters.
<br />
<br />
Note that some accounts might tweet about the same article twice. We are only interested in cases where the same account is tweeting about more than one article. 
</div>

In [None]:
# Find articles that were tweeted at least once.  Store them in list "tweeted".
tweeted = []

### BEGIN SOLUTION
for article in cal_alm['articles']:
    if article.sources['twitter'].metrics.total != 0:
        tweeted.append(article)
### END SOLUTION

print( "Tweeted articles: " + str( len( tweeted ) ) )

In [None]:
assert len(tweeted) > 10

In [None]:
# Now obtain all the account names. Look at the example notebook
#    "3. Working in practice.ipynb" for how to get this information.
unique_accounts = set()

### BEGIN SOLUTION
for article in tweeted:
    for tweet in article.sources['twitter'].events:
        unique_accounts.add(tweet['event']['user'])
        
### END SOLUTION

print( "Unique Account Count: " + str( len(unique_accounts) ) )

In [None]:
# Now check whether each twitter account tweeted about more than one article.
#    This requires a little care and attention.  Store the handle of each twitter
#    user who tweeted about more than one article in a list in "common_tweeters".
common_tweeters = []

### BEGIN SOLUTION
for account in unique_accounts:
    count = 0
    for article in tweeted:
        tweeters = [tweet['event']['user'] for tweet in article.sources['twitter'].events]
        if account in tweeters:
            count+=1
            
    if count > 1:
        common_tweeters.append(account)
### END SOLUTION

print( "Found " + str( len( common_tweeters ) ) + " common tweeters:" )
print( common_tweeters )

In [None]:
assert common_tweeters != []