# How reusable is software mentioned in Open Access papers? An empirical study using code-cite

_Neil Chue Hong, Robin Long, Martin O’Reilly, Naomi Penfold, Isla Staden, Alexander Struck, Shoaib Sufi, Matthew Upson, Andrew Walker, Kirstie Whitaker_

RSE18, Birmingham - 4th September 2018

https://github.com/softwaresaved/code-cite

DOI: 10.5281/zenodo.1209095

### Notes on this notebook

This notebook is the presentation for RSE18. To convert to slides and present, use the command:

`jupyter nbconvert RSE18CodeCitePresentation.ipynb --to slides --post serve`

## Abstract

Software is increasingly referenced in publications [1] because it has been used to produce the results being described, and because journals and funders are requiring code to be shared to improve reproducibility, encourage reuse and reduce duplication. This software may have been written to enable the work described in the publication, may be being cited to credit the original authors, or the main function of the publication could be to describe the software.

However it is hard both to identify software which is referenced in publications, and to assess its reusability. To address this we mined [2] the full text of papers available from EuropePMC to identify links to software repositories (here, “Github.com”). We investigated link persistence and queried the software repository to extract attributes including license information, documentation and update frequency, from which we inferred the likely reusability and sustainability of the software.

Our results show that there are clear differences in the reusability of software referenced in the research literature.

[1] Bullard and Howison (2015) https://doi.org/10.1002/asi.23538

[2] Watson et al (2018). http://doi.org/10.5281/zenodo.1209311

# About this talk

* Why do we want to know if code is reusable?
* Why is it hard to find code in papers?
* What is code-cite and how does it work?
* What do we find when we analyse Open Journals?
* What could we do next?

# Why do we want to know if code is reusable?

* Open science has increased the number of researchers making their outputs available but on its own doesn't enable researchers to benefit from the sharing of code as an output
    * The increasing amount of code being shared means that it is becoming potentially harder for people to identify the code they should be using
* Allow researchers and policy makers to see how the presence and quality of links to data and software in publications are changing over time so that they can identify emergent behaviour.

# Why is it hard to find code in papers?

* Howison and Bullard, _Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature_ DOI: [10.1002/asi.23538](https://doi.org/10.1002/asi.23538)
  * There's no standard way to identify it
  * Only between 31% and 43% of mentions involve formal citations
  * Informal mentions are very common, even in high impact factor journals

![Types of software mentions](img/softwarereferences.png)

# But software is increasingly mentioned

![References to GitHub](img/termoccurencegithub.png)

_References to GitHub repositories in papers indexed in EuroPMC over time_

# And software is being formally registered

![Software DOIs registered at Datacite](img/dois-for-software.png)

_DOI Registrations for Software, Fenner et al, https://doi.org/10.5438/1nmy-9902_

# How does _code-cite_ work?

   * Originated from a [Springer Nature Hackday](https://www.springernature.com/gb/researchers/campaigns/sn-hack-day) in November 2017

   * Continued at the [Collaborations Workshop Hackday](https://www.software.ac.uk/cw18/) in March 2018

   * __Goal__: enable analysis of a corpus of papers for citation links into repositories which may hold research data and/or software

# What does this mean in practice?

* Documented how to use TheContentMine to narrow down corpus of papers
* Provided libraries and Jupyter notebooks to search for software in a corpus (by looking for links to public source code repositories)
* Provided libraries and notebooks to analyse characteristics of the repositories identified in the corpus (e.g. number of contributors, how frequently updated, reusability)
* All openly available and licensed so anyone can reuse and contribute

# Using _getpapers_ to narrow a corpus

* [ContentMine](http://contentmine.org/) provide [tools](https://github.com/contentmine) to help extract information from the academic literature 
* We use the [getpapers](https://github.com/ContentMine/getpapers) tool to get metadata, fulltexts and fulltext URLs of papers matching a search query
   * In this example, we use contentmines default EuroPMC corpus as it provides the fulltext of the papers (using the -x flag), which will be useful later
* This technique can also be used to identify references to code repositories (e.g. GitHub or BitBucket) or to deposits in digital repositories (e.g. Zenodo or Figshare)
   * Though additional use of APIs are required to identify if links to digital repository deposits are software or some other research object 

# How many references to software are there?

* Other work has shown that the majority of references to software source code repositories via a URL in publications are to GitHub hosted repositories
* [For this analysis](https://github.com/softwaresaved/code-cite/blob/master/notebooks/getpapers.md) we use the search term 'github.com' since we are looking to identify github urls
   * Searching for 'github' alone retrieves 12,823 papers (out of over 2.1 million) from europePMC using getpapers
   * Searching for 'github.com' retrieves 11,377 and excludes individual mentions of GitHub that are not external links and github.io links which are more likely to be landing pages than software repositories
   
`getpapers --query 'github.com' -o data -x`

* _code-cite_ provides some libraries to structure and process this information


In [None]:
import json
import process_eupmc

# Directory containing the data
data_dir = '../data'

# File containing the list of matching papers
matching_papers = data_dir + '/' + 'eupmc_fulltext_html_urls.txt'

# File for the output
output_jsonfile = data_dir + '/' + 'dict_of_papers.json'

# Get the list of subdirectories dumped by ContentMine
paper_ids = process_eupmc.get_pmcids(matching_papers)

# Process the papers and extract all the references to GitHub urls
papers_info = process_eupmc.process_papers(paper_ids, data_dir)

dict_of_papers = {}

for p in papers_info:
    paper_dict = {}
    paper_dict['pmcid'] = p.pmcid
    paper_dict['pub_date'] = p.pub_date
    paper_dict['github'] = p.references['github']
    dict_of_papers[str(p.doi)] = paper_dict

with open(output_jsonfile, 'w') as outfile:  
    json.dump(dict_of_papers, outfile)

# Is software updated after the paper is published?

We can look at the references to GitHub URLs in papers available in the OA corpus from EuroPMC, and identify how many times the GitHub repositories have been updated since paper referencing them was released.

In [None]:
for p in papers_info:

    repos = []
    # Removes references to the main github.com site
    # and treat references to blobs / issues as references to the repo
    for gh_url in p.references['github']:
        words = gh_url.split('/')
        if len(words) > 4: #
            reponame = words[3] + '/' + words[4]
            if reponame not in repos:
                repos.append(reponame)            
    
    for repo in repos:
        if verbose: print ("Processing: ", repo)
        code = g.get_repo(repo)
        # limit to commits since publication date
        since = datetime.strptime(p.pub_date, '%Y-%m-%d')
        days = (datetime.now() - since).days
        commits = code.get_commits()
        num_commits = 0
        commit_date = commits[num_commits].commit.author.date
        while commit_date > since:
            num_commits = num_commits + 1
            commit_date = commits[num_commits].commit.author.date
        if verbose: print("Number of commits since publication: ", num_commits)
        commit_freq = num_commits / days
        if verbose: print("Commit frequency: ", commit_freq, "commits/day since publication")
        number_of_updates[num_commits] +=1
        # I'm using the magic number 100 until I get a sense of the correct bins to use
        frequency_of_updates[int(100 * commit_freq)] +=1

# Is software updated after the paper is published?

![title](img/commitsperdaysincepublication.png)



_Note that at present, we are __not__ distinguishing between URLs referencing software created by the paper authors, versus used by the authors, nor which software was created as a result of the work in the paper._
* Also this example is running on a relatively small sample of papers (~100) randomly selected from 2017-18

# How reusable is the software?

* There are things that can be argued improve usability
   * Does the URL in the paper actually resolve?
   * Does it have a LICENCE?
   * Does it have a README?

# Does the URL resolve?

We can check whether the URLs provided which reference software in paper still resolve.

![Do the URLs to software still resolve?](img/stillresolve.png)

Why don't they still resolve?
* Typos in URL: 2 repos
* Wrong path in URL: 1 repo (3 URLs)
* Repository reorganised: 1 repo - published February 2018
* Repository renamed: 1 repo (7 URLs) - published July 2017

# Does it have a LICENCE and/or README?

![Does software referenced in papers have a licence or readme?](img/licenceandreadme.png)

* Have not checked if README is more than just the name of the repo
* Some "software" is documentation

# Does it have a process for contributions?

![Does software referenced in publications have a contributor file?](img/contributorfile.png)

# What have we learned?

* Software is increasingly mentioned in papers, if we look for GitHub URLs
* A lot of software is put onto GitHub and never updated
   * But generally this software is still accessible
* Most software referenced in papers has a README and LICENCE for minimal reusability
   * But there's still a lot which has neither (or no licence)

# What can we (all) do next?

* Extend this study to:
   * Look at the whole EuroPMC corpus (donate your GitHub tokens!)
   * Look at other corpora and see if there are domain-based trends
* Compare and contrast to:
   * See if policy changes have had an effect (e.g. Science pre-/post-2011)
* Understand how to:
   * Look at where software is mentioned (e.g. in methodology, in footnotes, in references)
   * Identify software used by the authors versus written by them

   
## _What would you do next?_