Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Classifier
Database
Datasets
Initial_search
Match_search
Metadata
Plotting_scripts
Preprocessing
Text_data
.gitignore
LICENSE
README.md
baseline_data.py
clean_articles.py
config.py
main_initial_corpus.py
main_match_corpus.py
prepare_datasets.py
requirements.txt
word_distribution.py

README.md

Python Miner: Big Data Publications

This repository contains the scripts that implement part of the methods described in the publications: "". The scripts handle data fetching, preparation, and visualisation. Classification is implemented in R, found in the R-contrast-pub repository. The scripts handle the following research steps:

Get initial corpus

Searching PubMed and PubMed Central for articles with a specific query (esearch). Article data is fetched (efetch) and stored into a SQLite database. After the fetch, unwanted articles are removed and the remaining are cleaned.

Get matching corpus

Searching for matching PubMed and PubMed Central articles based on journal and publication date range (esearch) and fetching results (efetch). Article data is fetched (efetch) and stored into a SQLite database. After the fetch, unwanted articles are removed and the remaining are cleaned.

Remove articles

Unwanted articles are removed from the database by the following criteria:

  1. They have an empty abstract;
  2. Their doctype is defined in the EXCLUDED_DOCTYPES variable in the config;
  3. Their journal ISSN is defined in the EXCLUDED_JOURNALS variable in the config;
  4. They are a double, based on their title (with all symbols removed, regex: [^a-z]);
  5. They are a double, based on their DOI.

Cleaning articles

Articles in the database are cleaned by performing the following steps:

  1. Special characters are removed from article titles and abstracts (script)
  2. Tokenizing the titles and abstracts
  3. Removing stopwords from the tokenized titles and abstracts (script)
  4. (Optional) Stemming the tokenized titles and abstracts
  5. Removing very small and very big tokens (unlikely real words, script)

Preparing datasets

The initial and matching corpora are retrieved. A predetermined number of datasets is created by taking the complete initial corpus and matching a random set from the matching corpus. The dataset is then vectorized and turned into a feature matrix. Lastly, the matrix and original dataset are stored as an pickle object.

Other

The following scripts were used for various tasks to perform the research, for example: analyse datasets, gather metadata, create figures.

  1. baseline_data.py fetches some baseline metadata about the initial and matching corpora such as word counts and document counts.
  2. word_distribution.py fetches word distribution metadata over the documents in the initial and matching corpora.
  3. doc_word_freqs.py and docs_per_year.py create figures using Matplotlib for respectively the word to document frequency and the number of documents per (publication) year.
You can’t perform that action at this time.