This tool searches for code cited in PubMed abstracts by searching for popular repositories such as GitHub and Bitbucket (full list below).
It returns PMIDs and URLs where code is presumed to be.
It will check for valid URLs.
Current list of repositories
- please feel free to suggest additional repos in issues!
In its current manifestation, it will pull a maximum of 100,000 abstracts
The tool should work in both python2 and python3. Using a virtualenv is strongly recommended.
$ pip install -r requirements.txt
You may need to install the Python development headers (Python.h). Consult your distribution for details. On Ubuntu, this is likely:
$ apt-get install -y libpython2.7-dev # or libpython3.5-dev
$ yum install -y python-devel
Download nltk required files, 'punkt':
$ python -m nltk.downloader punkt
By default this will end up in
python find_code_urls.py <OUTPUT_CSV_FILENAME> <YOUR_EMAIL>
Writes a tsv file containing a record per url we were able to find in pubmed articles. Columns: pmid, article title, url