PubMed 2019 annual baseline downloader and namespacer

Project

This project has the main purpose of downloading, unzipping and analysing the PubMed annual baseline for year 2019 (december 2018).

The goal of the project is to create a file containing all PubMed namespaces and counting their cardinality.

The database is downloaded from the following link: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline

Namespace

A namespace is the composition of an author's last name with its forename initial (e.g. John Doe -> (Doe, J)). Creating a file containing every namespace and the relative cardinality has as major goal becoming a feature for an Author Name Disambiguation (AND) classifier.

Information about a namespace's ambiguity (cardinality) can help a classifier out in classifying correctly a couple of scientific articles has 'belonging to the same author' or 'not belonging to the same author'.

Author Name Disambiguation (AND)

The Author Name Disambiguation project can be found at the following link: https://github.com/BrianPulfer/AuthorNameDisambiguation

Conditions

To see PubMed data download conditions, check out the following link: https://www.nlm.nih.gov/databases/download/terms_and_conditions.html

System requirements

MacOS or Linux operating system. Python3 or greater. Internet connection and 300 GB of free space on disk (if downloading database).

Set-up

Create a new virtual environment (for pycharm users: Preferences...-> Project -> Project Interpreter -> show all -> new). Install all requirements specified in the 'requirements.txt' file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
database		database
database_scripts		database_scripts
namespacer		namespacer
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
definitions.py		definitions.py
main_app.py		main_app.py
namespaces.zip		namespaces.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

database

database

database_scripts

database_scripts

namespacer

namespacer

.DS_Store

.DS_Store

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

definitions.py

definitions.py

main_app.py

main_app.py

namespaces.zip

namespaces.zip

requirements.txt

requirements.txt

Repository files navigation

PubMed 2019 annual baseline downloader and namespacer

Project

Namespace

Author Name Disambiguation (AND)

Conditions

System requirements

Set-up

About

Releases

Packages

Languages

License

BrianPulfer/PubMed-Namespacer

Folders and files

Latest commit

History

Repository files navigation

PubMed 2019 annual baseline downloader and namespacer

Project

Namespace

Author Name Disambiguation (AND)

Conditions

System requirements

Set-up

About

Resources

License

Stars

Watchers

Forks

Languages