# Get Code References from papers

This notebook:

 * Uses getpapers externally to download fulltext of all papers in EuPMC which contain github URLs
 * Textmines each paper fulltext and extract occurences of GitHub URLs
 * Outputs a data structure of the form: paper_DOI {{github_url: "http://github.com/blah/blah"}...}

## Initial setup

In [1]:
import json
import process_eupmc

## File locations

In [3]:
# Directory containing the data
data_dir = '../data'

# File containing the list of matching papers
matching_papers = data_dir + '/' + 'eupmc_fulltext_html_urls.txt'

# File for the output
output_jsonfile = data_dir + '/' + 'dict_of_papers.json'

## Use getpapers to download fulltext of papers

We currently do this outside of the notebook, and assume that the files are available locally.

The command we are using is:

>getpapers --query 'github' -x --limit 100 -o data

which queries EuPMC for all papers containing the term 'github' and returns the full text of the first 100 papers matching this into the directory 'data'

## Textmine each paper

In [4]:
# Get the list of subdirectories dumped by ContentMine
paper_ids = process_eupmc.get_pmcids(matching_papers)

In [5]:
# Process the papers and extract all the references to GitHub and Zenodo urls
papers_info = process_eupmc.process_papers(paper_ids, data_dir)

## Output data structure

In [6]:
dict_of_papers = {}

In [7]:
for p in papers_info:
    paper_dict = {}
    paper_dict['pmcid'] = p.pmcid
    paper_dict['pub_date'] = p.pub_date
    paper_dict['github'] = p.references['github']
    dict_of_papers[str(p.doi)] = paper_dict    
     

In [8]:
with open(output_jsonfile, 'w') as outfile:  
    json.dump(dict_of_papers, outfile)


### Visually inspect the result

In [9]:
print(json.dumps(dict_of_papers, sort_keys=True, indent=4))

{
    "10.1038/s41467-018-03297-7": {
        "github": [
            "https://github.com/YeatmanLab/AFQ-Browser_data",
            "https://github.com/YeatmanLab/AFQ-Browser_data/blob/master/AFQ-Browser_ALSexample/Reproducing-Sarica2017-Figure3.ipynb",
            "https://github.com/YeatmanLab/AFQ-Browser_data/blob/master/AFQ-Browser_ALSexample/Figure6.ipynb",
            "https://github.com",
            "https://github.com/YeatmanLab/AFQ-Browser",
            "https://github.com/YeatmanLab/AFQ-Browser/",
            "https://github.com/yeatmanlab/AFQ-Browser_data",
            "https://github.com/yeatmanlab/AFQBrowser-demo/",
            "https://github.com/yeatmanlab/AFQ-Browser-MSexample/",
            "https://github.com/yeatmanlab/Sarica_2017",
            "https://github.com/mrdoob/three.js/"
        ],
        "pmcid": "PMC5838108",
        "pub_date": "2018-03-01"
    },
    "10.1038/s41598-017-18257-2": {
        "github": [
            "https://github.com/ChrisMaherLab/INT