Disease clustering from phenotypic literature data through Document Understanding, Comprehension and Knowledge
What's the problem ?
Typically, SNPs are studied in terms of a "one disease - one SNP" relationship. This results in researchers and clinicians with deep knowledge of a disease but often incomplete knowledge of all potentially relevant SNPs.
Why should we solve it ?
Knowledge of a larger set of potentially relevant SNPs to a collection of phenotypes would allow finding a novel set of relevant publications.
What is ClusterDuck ?
ClusterDuck is a tool to automatically identify genetically-relevant publications and returns relevant
How to use ClusterDuck ?
- Python 3
Install python packages required:
pip3 install -r requirements.txt
Download the pubmed database and required data from nltk:
Use easy-to-start command line tool
python3 ClusterDuck.py "Autistic behavior" "Restrictive behavior" "Impaired social interactions" "Poor eye contact" "Impaired ability to form peer relationships" "No social interaction" "Impaired use of nonverbal behaviors" "Lack of peer relationships" "Stereotypy"
A case study
Train Topic Models
After you have corpora, you can run the following function in
train_lda.pyto obtain topic models:
lda1, lda2 = train_ldas(corpus1, corpus2, n_topics=N_TOPICS, alpha=ALPHA, eta=ETA)
ETAparameterize both topic models.
Set of phenotypic terms from HPO ontology.
- A 'phenotypic' corpus of literature is extracted from PubMed using the user-input HPO phenotypic terms.
- All SNPs mentioned in the 'phenotypic clusters are idenfified.
- PubMed is queried using the phenotypically-relevant SNPs to extract a second 'phenotypic + genetic' corpus.
- Topic modeling is run on each corpus separately.
- Topic distributions are compared to discover new genetically-inspired and relevant topics.
A list of novel genetically-related topics to the initial phenotypic input.
- Synonyms search from user-input HPO provides a synonym list for each of their controlled vocabulary terms. This can be incorporated as a preprocessor with the user input to allow
- Make use of hierarchy HPO is an ontology of terms and user-input terms are likely to have sub- and super-class terms.
- Filtering different types of research articles Optionally add a [PT] query filter to the PubMed query to limit the types of publications returned.
- Use of EMR-type data to build corpus as oppose to PubMed An EMR-based corpus is more likely to be associated with diseases (especially to ICD terms) than a PubMed-based corpus.
- Jennifer Dong
- Larry Gray
- Joseph Halstead
- Yi Hsiao
- Wayne Pereanu
- Neelay Trivedi
- Nathan Wan
- Donghui Wu