# Get Code References from papers

This notebook:

 * Uses getpapers externally to download fulltext of all papers in EuPMC which contain github URLs
 * Textmines each paper fulltext and extract occurences of GitHub URLs
 * Outputs a data structure of the form: paper_DOI {{github_url: "http://github.com/blah/blah"}...}

In [101]:
import json
from lxml import etree
import re

## Use getpapers to download fulltext of papers

We currently do this outside of the notebook, and assume that the files are available locally.

The command we are using is:

>getpapers --query 'github' -x --limit 100 -o data

which queries EuPMC for all papers containing the term 'github' and returns the full text of the first 100 papers matching this into the directory 'data'

## Textmine each paper

### File locations

In [110]:
# Directory containing the data
data_dir = '../data'

# File containing the list of matching papers
matching_papers = data_dir + '/' + 'eupmc_fulltext_html_urls.txt'

# Name of the Content Mine results file in each paper subdirectory
contentmine_results = 'eupmc_result.json'

# Name of the Content Mine full text xml paper dump in each paper subdirectory
fulltext_xml = 'fulltext.xml'

In [111]:
# Object for building the JSON output file containing the dictionary of papers and URLs to repositories

dict_of_papers = {}

In [112]:
# Get the list of subdirectories dumped by ContentMine

papers = []

# For each line in matching_papers, strip the start of the line (http://europepmc.org/articles/)

with open(matching_papers, 'r') as f:
  for line in f:
     terms = line.split("/")
     papers.append(terms[-1].rstrip())


In [113]:
# For each paper

for paper_dir in papers:

    # Read in the JSON file and get the DOI
    
    filename = data_dir + '/' + paper_dir + '/' + contentmine_results
    
    try:
        with open(filename, 'r') as f:
            paper_json = json.load(f)
            # Get the DOI
            paper_doi = paper_json['doi'][0]
    except IOError:
        print("Error: File does not appear to exist.")
    
    # Read in the XML full text and mine for the github URLs

    fulltext_file = data_dir + '/' + paper_dir + '/' + fulltext_xml
 
    gh_dict = {}
    gh_urls = []

    try:
        with open(fulltext_file, 'r') as f:
            data = f.read()
            urls = re.findall(r'(https?://\S+)(?=\")', data)
            for url in urls:
                if re.match(r'https?://github.com', url):
 #                   print(url)
                    gh_urls.append(url) 
    except IOError:
        print("Error: File does not appear to exist.")

    gh_dict['github'] = gh_urls

    dict_of_papers[str(paper_doi)] = gh_dict    
        

Error: File does not appear to exist.
Error: File does not appear to exist.


## Output data structure

In [114]:
with open('dict_of_papers.json', 'w') as outfile:  
    json.dump(dict_of_papers, outfile)

print(json.dumps(dict_of_papers, sort_keys=True, indent=4))

{
    "10.1002/1873-3468.12684": {
        "github": [
            "https://github.com/epierson9/ZIFA",
            "https://github.com/dgrun/RaceID",
            "https://github.com/JohnReid/DeLorean",
            "https://github.com/kieranrcampbell/pseudogp",
            "https://github.com/ManuSetty/wishbone",
            "https://github.com/Teichlab/scrnatb",
            "https://github.com/ManuSetty/wishbone",
            "https://github.com/SheffieldML/GPclust",
            "https://github.com/RGLab/MAST",
            "https://github.com/tallulandrews/M3Drop",
            "https://github.com/kdkorthauer/scDD",
            "https://github.com/catavallejos/BASiCS",
            "https://github.com/PMBio/scLVM",
            "https://github.com/brentp/combat.py/blob/master/R-combat.R",
            "https://github.com/lengning/OEFinder",
            "https://github.com/drisso/RUVSeq"
        ]
    },
    "10.1002/pmic.201700244": {
        "github": [
            "https://github.com/PR