Python Miner: Big Data Publications
This repository contains the scripts that implement part of the methods described in the publications: "". The scripts handle data fetching, preparation, and visualisation. Classification is implemented in R, found in the R-contrast-pub repository. The scripts handle the following research steps:
Searching PubMed and PubMed Central for articles with a specific query (esearch). Article data is fetched (efetch) and stored into a SQLite database. After the fetch, unwanted articles are removed and the remaining are cleaned.
Searching for matching PubMed and PubMed Central articles based on journal and publication date range (esearch) and fetching results (efetch). Article data is fetched (efetch) and stored into a SQLite database. After the fetch, unwanted articles are removed and the remaining are cleaned.
Unwanted articles are removed from the database by the following criteria:
- They have an empty abstract;
- Their doctype is defined in the
EXCLUDED_DOCTYPESvariable in the config;
- Their journal ISSN is defined in the
EXCLUDED_JOURNALSvariable in the config;
- They are a double, based on their title (with all symbols removed, regex:
- They are a double, based on their DOI.
Articles in the database are cleaned by performing the following steps:
- Special characters are removed from article titles and abstracts (script)
- Tokenizing the titles and abstracts
- Removing stopwords from the tokenized titles and abstracts (script)
- (Optional) Stemming the tokenized titles and abstracts
- Removing very small and very big tokens (unlikely real words, script)
The initial and matching corpora are retrieved. A predetermined number of datasets is created by taking the complete initial corpus and matching a random set from the matching corpus. The dataset is then vectorized and turned into a feature matrix. Lastly, the matrix and original dataset are stored as an pickle object.
The following scripts were used for various tasks to perform the research, for example: analyse datasets, gather metadata, create figures.
- baseline_data.py fetches some baseline metadata about the initial and matching corpora such as word counts and document counts.
- word_distribution.py fetches word distribution metadata over the documents in the initial and matching corpora.
- doc_word_freqs.py and docs_per_year.py create figures using Matplotlib for respectively the word to document frequency and the number of documents per (publication) year.