Disease Clustering from Literature Based on Minimal Training Data
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.



Disease clustering from phenotypic literature data through Document Understanding, Comprehension and Knowledge

(doi: )

What's the problem ?

Typically, SNPs are studied in terms of a "one disease - one SNP" relationship. This results in researchers and clinicians with deep knowledge of a disease but often incomplete knowledge of all potentially relevant SNPs.

Why should we solve it ?

Knowledge of a larger set of potentially relevant SNPs to a collection of phenotypes would allow finding a novel set of relevant publications.

What is ClusterDuck ?

ClusterDuck is a tool to automatically identify genetically-relevant publications and returns relevant

How to use ClusterDuck ?



  • Install python packages required: pip3 install -r requirements.txt

  • Download the pubmed database and required data from nltk: python3 setup.py


  1. Use easy-to-start command line tool ClusterDuck.py

    python3 ClusterDuck.py "Autistic behavior" "Restrictive behavior" "Impaired social interactions" "Poor eye contact" "Impaired ability to form peer relationships" "No social interaction" "Impaired use of nonverbal behaviors" "Lack of peer relationships" "Stereotypy"
  2. A case study

    python3 generate_csv.py

  3. Train Topic Models

    After you have corpora, you can run the following function in train_lda.py to obtain topic models:

    lda1, lda2 = train_ldas(corpus1, corpus2, n_topics=N_TOPICS, alpha=ALPHA, eta=ETA)

    where N_TOPICS, ALPHA and ETA parameterize both topic models.

Test Suite

python3 ./dc/test_utils.py

ClusterDuck Workflow

Workflow Pipeline


Set of phenotypic terms from HPO ontology.


  • A 'phenotypic' corpus of literature is extracted from PubMed using the user-input HPO phenotypic terms.
  • All SNPs mentioned in the 'phenotypic clusters are idenfified.
  • PubMed is queried using the phenotypically-relevant SNPs to extract a second 'phenotypic + genetic' corpus.
  • Topic modeling is run on each corpus separately.
  • Topic distributions are compared to discover new genetically-inspired and relevant topics.


A list of novel genetically-related topics to the initial phenotypic input.

Planned Features

  • Synonyms search from user-input HPO provides a synonym list for each of their controlled vocabulary terms. This can be incorporated as a preprocessor with the user input to allow
  • Make use of hierarchy HPO is an ontology of terms and user-input terms are likely to have sub- and super-class terms.
  • Filtering different types of research articles Optionally add a [PT] query filter to the PubMed query to limit the types of publications returned.
  • Use of EMR-type data to build corpus as oppose to PubMed An EMR-based corpus is more likely to be associated with diseases (especially to ICD terms) than a PubMed-based corpus.


  • Jennifer Dong
  • Larry Gray
  • Joseph Halstead
  • Yi Hsiao
  • Wayne Pereanu
  • Neelay Trivedi
  • Nathan Wan
  • Donghui Wu