Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



65 Commits

Repository files navigation

Classifying Diseases by Clustering Phenotypes using Automated Text Analysis

A versatile tool to classify diseases using cluster analysis of published phenotypic data

What does PubClass do?

PubClass is a pipeline that will take your PubMed query and output clusters of phenotypes and/or genes based on their co-occurrence in the published literature.

Why should you use PubClass?

The ability to subclassify a disease can directly affect diagnosis and treatment options for a patient. PubClass allows you to determine if a disease of interest can be further classified based on phenotypic data from the entire corpus of published data.


Autism is a brain-based disorder that affects how information is processed. Its cause is not known. Genetic factors are thought to be responsible for 70-90% of cases. There are also related environmental factors.

Autism is a spectrum disorder, which means that patients with it can have a wide range of conditions.

Right now, there is a split in the field. There are some researchers who believe that there is a single common cause among all of the types of autism. Another camp hypothesizes that there could be many completely different types of problems that can result in a patient getting diagnosed with autism.


Cluster analysis has been used by a number of researchers to try to determine if there are distinct subtypes of autism, or just one with varying conditions.

We are taking the approach of doing a literature-wise analysis by processing all abstracts from autism-related publications. We will try cluster analysis using diseases, genes, or topics mentioned in the abstracts. Scientific analysis of the resulting clusters from all related literature may help shed light on the etiology of autism.



  1. INPUT: Collect user-supplied query e.g. "autism" and additional clustering parameters.

2a. Extract abstracts and keywords from the result of NCBI's PubMed query.

2b. Load dictionary to tokenize abstracts for clustering and cluster-metadata analysis.

  1. Perform text-processing (maybe just stemming and tokenization) to extract segments/tokens.

  2. OUTPUT: Output disease-specific clusters based on co-occurrence of topics/tokens/segments.

  3. Analysis of output clusters.

Autism Clustering Results




Papers that have used cluster analysis

Kienle X, Freiberger V, Greulich H, Blank R. Autism Spectrum disorder and DSM-5: Spectrum or cluster? Prax Kinderpsychol Kinderpsychiatr. 2015;64(6):412–28.

Hrdlicka M, Dudova I, Beranova I, Lisy J, Belsan T, Neuwirth J, Komarek V, Faladova L, Havlovicova M, Sedlacek Z, et al. Subtypes of autism by cluster analysis based on structural MRI data. Eur Child Adolesc Psychiatry. 2005;14(3):138–44.

Bitsika V, Sharpley CF, Orapeleng S. An exploratory analysis of the use of cognitive, adaptive and behavioural indices for cluster analysis of ASD subgroups. J Intellect Disabil Res. 2008;52(11):973–85.

Ji NY, Capone GT, Kaufmann WE. Autism spectrum disorder in Down syndrome: cluster analysis of aberrant behaviour checklist data supports diagnosis. J Intellect Disabil Res. 2011;55(11):1064–77.

Cuccaro ML, Tuchman RF, Hamilton KL, Wright HH, Abramson RK, Haines JL, Gilbert JR, Pericak-Vance M. Exploring the relationship between autism spectrum disorder and epilepsy using latent class cluster analysis. J Autism Dev Disord. 2012;42(8):1630–41.

Palmer CJ, Paton B, Enticott PG, Hohwy J. ‘Subtypes’ in the presentation of autistic traits in the general adult population. J Autism Dev Disord. 2015;45(5):1291–301.

Kitazoe N, Fujita N, Izumoto Y, Terada SI, Hatakenaka Y. Whether the autism Spectrum quotient consists of two different subgroups? Cluster analysis of the autism Spectrum quotient in general population. Autism. 2017;21(3):323–32.

Tanaka S, Oi M, Fujino H, Kikuchi M, Yoshimura Y, Miura Y, Tsujii M, Ohoka H. Characteristics of communication among Japanese children with autism spectrum disorder: a cluster analysis using the Children's communication Checklist-2. Clin Linguist Phon. 2017;31(3):234–49.

People / Team

Wayne Sebastian Pereanu

Syed Hussain Ather

Jifan Gao

Kyle Lemoi

Jack Sklar

Olaitan I. Awe




A versatile tool to classify diseases using cluster analysis of published phenotypic data







No releases published


No packages published