### Setup - Initialize packages

In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append("../modules/orcid-python")
sys.path.append("../modules/pyalm")
import time
import orcid
import pyalm.pyalm as pyalm

## Issues in practice - gathering and integrating data from multiple sources

It is common to work across multiple data sources to gather information. A very common pattern is to to search in one location to create a list of identifiers and then use those identifiers to query another API. In the ORCID example above we created a list of DOIs from a single ORCID profile. We could use those DOIs to obtain further information from the Crossref API and other sources. This models a common path for analysis of research outputs: identifying a corpus and then seeking information on its performance.

In this example we will build on the ORCID and Crossref examples to collect a set of work identifiers from an ORCID profile and use a range of APIs to identify additional metadata as well as information on the performance of those articles. In addition to the ORCID API we will use the PLOS Lagotto API. Lagotto is the software that was built to support the Article Level Metrics program at PLOS, the Open Access publisher, and its API provides information on various metrics of PLOS articles. A range of other publishers and service providers, including Crossref also provide an instance of this API meaning the same tools can be used to collect information on articles from a range of sources.

## The Lagotto API

The module `pyalm` is a wrapper for the Lagotto API which is served from a range of hosts. In this we will work with instances run by PLOS and by Crossref (the `det` instance). We first need to provide the details of the URLs for these instances to our wrapper. Then we can obtain some information for a single DOI to see what the data returned looks like.

In [2]:
pyalm.config.APIS = { 'plos' : {'url': 'http://alm.plos.org/api/v5/articles'},
                      'det'  : {'url' : 'http://det.labs.crossref.org/api/v5/articles'}
                    }

In [3]:
det_alm_test = pyalm.get_alm('10.1371/journal.pbio.1001677', info='detail', instance='det')
det_alm_test

HTTPError: 404 Client Error: Not Found for url: http://eventdata.crossref.org/api/v5/articles?info=detail&ids=10.1371%2Fjournal.pbio.1001677

The library returns a python dictionary containing two elements. The `articles` key contains the actual data and the `meta` key includes general information on the results of the interaction with the API. In this case it has returned one page of results containing one object (because we only asked about one DOI). If we want to collect a lot of data this information helps in the process of paging through results. It is common for APIs to impose some limit on the number of results returned so as to ensure performance. By default the Lagotto API has a limit of 50 results.

The `articles` key holds a list of `ArticleALM` objects as its value. Each ArticleALM object has a set of internal attributes that contain information on each of the metrics that PLOS collects. These are derived from various data providers and are called 'sources'. Each can be accessed by name from a dictionary called 'sources'. The iterkeys() function provides an interator that lets us loop over the set of keys in a dictionary. Within the source object there is a range of information that we will dig into.

In [4]:
article = det_alm_test.get('articles')[0]
article.title

NameError: name 'det_alm_test' is not defined

In [None]:
for source in article.sources.iterkeys():
    print source, article.sources[source].metrics.total

The DET service only has a record of citations to this article from Wikipedia. As we will see below the PLOS service returns more results. This is because some of the sources are not yet being queried by DET.  

Because this is a PLOS paper we can also query the PLOS Lagotto instance for the same article.

In [None]:
plos_alm_test = pyalm.get_alm('10.1371/journal.pbio.1001677', info='detail', instance='plos')
article_plos = plos_alm_test.get('articles')[0]
article_plos.title

In [None]:
for source in article_plos.sources.iterkeys():
    print source, article_plos.sources[source].metrics.total

The PLOS instance is providing a greater range of information but also seems to be giving larger numbers than the DET instance in many cases as well. For those sources that are provided by both API instances we can compare the results returned.

In [None]:
for source in article.sources.iterkeys():
    print source, article.sources[source].metrics.total, article_plos.sources[source].metrics.total

The PLOS Lagotto instance is both collecting more information and has a wider range of information sources. Comparing the results from the PLOS and DET instances illustrates the issues of coverage and completeness discussed previously. The data may be sparse for a variety of reasons and it is important to have a clear idea of the strengths and weaknesses of a particular data source or aggregator. In this case the DET instance is returning information for some sources which it is does not have data for. 

We can dig deeper into the events themselves that the metrics.total count aggregates. The API wrapper collects these into an event object within the source object. These contain the JSON returned from the API in most cases. For instance the Crossref source is a list of JSON objects containing information on an article that cites our article of interest. The first citation event in the list is a citation from the journal JASIST by Du et al.

In [None]:
article_plos.sources['crossref'].events[0]

Another source in the PLOS data is Twitter. In the case of the twitter events (individual tweets) this provides the text of the tweet, user ids, user names, url of the tweet and the date. We can see from the length of the events list that there are at least 130 tweets that link to this article. 

In [None]:
len(article_plos.sources['twitter'].events)

Again, noting the issues of coverage, scope and completeness it is important to consider the limitations of this data. This is a lower bound as it represents search results returned by search the Twitter API for the DOI or URL of the article. Other tweets that discuss the article may not include a link, and the Twitter search API also has limitations that can lead to incomplete results. The number must therefore be seen as both incomplete and a lower bound. 

We can look more closely at data on the first tweet on the list. Bear in mind that the order of the list is not necessarily special. This is not the first tweet about this article chronologically. 

In [None]:
article_plos.sources['twitter'].events[0]

We could use the twitter API to understand more about this person. For instance we could look at their twitter followers and followees or analyse the text of their tweets for topic modelling. Much work on social media interactions is done with this kind of data, using forms of network and text analysis described elsewhere in the book.

A different approach is to integrate this data with information from another source. We might be interested for instance in whether the author of this tweet is a researcher, or whether they have authored research papers. One things we could do is search the ORCID API to see if there are any ORCID profiles that link to this Twitter handle.

In [None]:
twitter_search = orcid.search("catmacOA")

for result in twitter_search:
    print unicode(result)
    print result.researcher_urls

So the person with this Twitter handle seems to have an ORCID profile. That means we can also use ORCID to gather more information on their outputs. Perhaps they have authored work which is relevant to our article?

In [None]:
cm = orcid.get("0000-0001-9623-2225")
for pub in cm.publications[0:5]:
    print pub.title

From this analysis we can show that this tweet is actually from one of the authors of the article.

To make this process easier we can write a convenience function to go from a twitter user handle to try and find an ORCID for that person. There are a lot of ways this could be improved. One suggestion which is described in the Parameters of the function but not implemented is to change the function to return `True` or `False` so that the function could be used in a call to the `filter` python function.

In [None]:
def twitter2orcid(twitter_handle, twitter_users_name = None, resp = 'orcid', search_depth = 10):
    """
    Take a twitter handle or user name and return an ORCID (or True/False for use as a filter)
    
    Parameters
    ----------
    twitter_handle : str or unicode
                     A twitter handle
    twitter_users_name : str or unicode
                         A twitter user's full name
    resp : str
           'orcid' or 'filter' If 'orcid' (default) return an ORCID if successful, else None, if 'filter' return 
           True/False for use as a filter function.
    search_depth : int
                   The number of returned results to test before giving up and declaring no match
                   
    Returns
    -------
    out : str or bool
          Depending on the setting of resp returns either an ORCID (as returned by the ORCID search API) or True/False
    """
    
    search = orcid.search(twitter_handle)
    s = [r for r in search]
    orc = None
    i = 0
    while i < search_depth and orc == None and i < len(s):
        arr = [('twitter.com' in website.url) for website in s[i].researcher_urls]
        if True in arr:
            index = arr.index(True)
            url = s[i].researcher_urls[index].url
            if url.lower().endswith(twitter_handle.lower()):
                orc = s[i].orcid
                return orc
        
        i+=1
    
    return None
    

Let's do a quick test of the function

In [None]:
twitter2orcid('catmacOA')

## Working with a corpus

In this case we will continue as previously to collect a set of works from a single ORCID profile. This collection could just as easily be a date range, or subject search at a range of other APIs. The target is to obtain a set of identifiers (in this case DOIs) that can be used to precisely query other data sources. This is a general pattern which reflects the issues of scope and source discussed above. The choice of how to construct a corpus to analyse will strongly affect the results and the conclusions that can be drawn.

In [None]:
# As previously, collect the set of DOIs available from an ORCID profile
cn = orcid.get("0000-0002-0068-716X")
exids = []
for pub in cn.publications:
    if pub.external_ids:
        exids = exids + pub.external_ids

DOIs = [exid.id for exid in exids if exid.type == "DOI"]
        
len(DOIs)

We have recovered 66 DOIs from the ORCID profile. Note that this isn't an identifier for every work (not all of them have DOIs). This illustrates an important point about data integration. In practice it is generally not worth the effort of attempting to integrate data on objects unless they have a unique identifier or key that can be used in multiple data sources. Hence the focus on DOIs and ORCIDs in these examples. Even in our search of the ORCID API for profiles that are associated with a Twitter account we used the Twitter handle as a unique ID to search on. 

While it is possible to work with names or titles of works and to disambiugate it is substantially more difficult. Other chapters deal with issues of data cleaning and disambiguation. Much work has been done on this basis but increasingly you will see that the first step in any analysis is to simply discard objects without a unique ID that can be used across data sources.

We can obtain data for these from the DET API. As is common with many APIs there is a limit to how many queries can be simultaneously run, in this case 50, so we divide our query into batches.  

In [None]:
batches = [DOIs[0:50], DOIs[51:-1]]
det_alms = []
for batch in batches:
    alms_response = pyalm.get_alm(batch, info="detail", instance="det")
    det_alms.extend(alms_response.get('articles'))

len(det_alms)

The DET API only provides information on a subset of Crossref DOIs. Data population has focussed on more recently published articles so only 24 responses are received in this case for the 66 DOIs we queried on. A good exercise would be to look at which of the DOIs are found and which are not. Let us see how much interesting data is available in the subset of DOIs for which we have data.

In [None]:
for r in [d for d in det_alms if d.sources['wikipedia'].metrics.total != 0]:
    print r.title
    print '     ', r.sources['pmceurope'].metrics.total, 'pmceurope citations'
    print '     ', r.sources['wikipedia'].metrics.total, 'wikipedia citations'
    

As discussed above this shows that the DET instance, while it provides information on a greater number of DOIs, has less complete data at this stage. Only four of the 24 responses have wikipedia references. You can change the code to look at the full set of 24 which shows very sparse data. The PLOS Lagotto instance provides more data but only on PLOS articles. However it does provide data on all the PLOS articles, going back earlier than the set returned by the DET instance. We can collect the set of articles from the profile published by PLOS.

In [None]:
plos_dois = []
for doi in DOIs:
    if doi.startswith('10.1371'): #This is quick and dirty, better would be to check Crossref API for publisher
        plos_dois.append(doi)

len(plos_dois)

In [None]:
plos_alms = pyalm.get_alm(plos_dois, info='detail', instance='plos').get('articles')

In [None]:
for article in plos_alms:
    print article.title
    print '     ', article.sources['crossref'].metrics.total, 'crossref citations'
    print '     ', article.sources['twitter'].metrics.total, 'tweets'

    

From the previous examples we know that we can obtain information on citing articles and tweets associated with this 66 articles. From that initial corpus we now have a collection of up to 86 related articles (cited and citing), a few hundred tweets that refer to (some of) those articles and perhaps 500 people if we include authors of both articles and tweets. Note how for each of these links our query is limited so we have a subset of all the related objects and agents. At this stage we probably have duplicate articles (one article might cite multiple in our set of seven) and duplicate people (authors in common between articles and authors who are also tweeting).

This data could be used for network analysis, to build up a new corpus of articles (by following the citation) links or to analyse the links between authors and those tweeting about the articles. We won't pursue an in depth analysis here, but will gather the relevant objects and de-duplicate them as far as possible and count how many we have in preparation for future analysis. 

In [None]:
# Collect all the citing DOIs and author names from citing articles
citing_dois = []
citing_authors = []
for article in plos_alms:
    for cite in article.sources['crossref'].events:
        citing_dois.append(cite['event']['doi'])
        citing_authors.extend(cite['event_csl']['author']) # Use 'extend' because the element is a list
print '\nBefore de-deduplication'
print len(citing_dois), 'dois'
print len(citing_authors), 'citing authors'


# Easiest way to de-deplicate is to convert to a python set
citing_dois = set(citing_dois)
citing_authors = set([author['given'] + author['family'] for author in citing_authors])
print '\nAfter de-deduplication'
print len(citing_dois), 'dois'
print len(citing_authors), 'citing authors'



In [None]:
# Collect all the tweets, usernames and check for any ORCIDs we can find.
tweet_urls = set()
twitter_handles = set()

for article in plos_alms:
    for tweet in article.sources['twitter'].events:
        tweet_urls.add(tweet['event_url'])
        twitter_handles.add(tweet['event']['user'])
        
# No need to explicitly de-duplicate because we created sets directly in this case
print len(tweet_urls), 'tweets'
print len(twitter_handles), 'twitter users'

It could be interesting to look at which twitter users interact most with the articles associated with this ORCID profile. To do that we would need to not create a set but a list and then count the number of duplicates in the list. The code could be easily modified to do this. Another useful exercise would be to search ORCID for profiles corresponding to citing authors. The best way to do this would be to obtain ORCIDS associated with each of the citing articles. However, because ORCID data is sparse and incomplete there are two limitations here. First that the author may not have an ORCID. Second that the article is not explicitly linked to article. Try searching ORCID for the DOIs associated with each of the citing articles.

In this case we will look to see how many of the twitter handles discussing these articles are associated with an ORCID profile we can discover. This in turn could lead to more profiles and more cycles of analysis to build up a network of researchers interacting through citation and on twitter. Note we have inserted a delay between calls. This is because we are making a larger number of API calls (one for each Twitter handle). It is considered polite to keep the pace at which calls are made to an API to a reasonable level. The ORCID API does not post suggested limits at the moment but delaying for a second between calls is reasonable.

In [None]:
tweet_orcids = []
for handle in twitter_handles:
    orc = twitter2orcid(handle)
    if orc:
        tweet_orcids.append(orc)
    time.sleep(1) # wait one second between each call to the ORCID API

print len(tweet_orcids)
    

In this case we have identified twelve ORCID profiles we can positively link to tweets about this set of articles. This is a substantial under estimate of the likely number of ORCIDS associated with these tweets. However relatively few ORCIDs have twitter accounts registered as part of the profile. To gain a broader picture a search and matching strategy would need to be applied. Nonetheless for this eleven we can look closer into the profiles.

The first step is to obtain the actual profile information for each of the twelve ORCIDs we have found. Note that at the moment what we have is the ORCIDs themselves, not the retrieved profiles.

In [None]:
orcs = []
for id in tweet_orcids:
    orcs.append(orcid.get(id))

With the profiles retrieved we can then take a look at who they are, and check that we do in fact have sensible twitter handles associated with them. We could use this to build up the network of related authors and Twitter users for further analysis.

In [None]:
for orc in orcs:
    i = [('twitter.com' in website.url) for website in orc.researcher_urls].index(True)
    twitter_url = orc.researcher_urls[i].url
    print orc.given_name, orc.family_name, orc.orcid, twitter_url