-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Tasks
- identity distributions construction and taxon significance from distribution
- hierarchical clustering of binary (presence/absence) matrices i.e. contig ORFs x NCBI taxid, height cutoff determination from taxon ID conservation i.e. novel taxon assignment
- evaluation on burkholderia metagenome
Required Datasets:
- Identity Distributions of IMG High Quality genomes: Examine the sequence
divergence of homologous proteins amongst high quality bacterial
and archaeal genomes downloaded from IMG, constructing identity distributions
across different genera, families, orders, etc.
Expectations and Approach
- top hits share identity approaching 100% across the whole length of both the query and hit
sequence- assign a specific species with high confidence.
- cases where a number of hits are within 95% nucleotide identity, (suggesting the query is the same species as hits)
- assign the LCA of all hits within this identity threshold.
- closest hit to an ORF shares <95% nucleotide identity (could belong to a novel taxon)
- scale the specificity of assignment based on the distance of the closest hits from the query sequence.
- use identity distributions (see above) to determine the significance of hits at
specific identity levels, to derive the likelihood of queries and subjects being in
the same genus or family, for instance.- invoke placeholder new taxonomic identifiers, with appropriate ranks based on
the identity to closest hits (see above).- hierarchical clustering of binary matrices
describing the presence of NCBI taxonomy IDs within the top DIAMOND BLASTP hits for ORFs within
the contig. (Clusters derived from this process are expected to share a significant number of
taxonomy IDs, and so taxonomy ID conservation will be used to determine the height cutoff for trimming
the hierarchical clustering tree in order to obtain novel taxonomy clusters).
- hierarchical clustering of binary matrices
- invoke placeholder new taxonomic identifiers, with appropriate ranks based on
- use identity distributions (see above) to determine the significance of hits at
- scale the specificity of assignment based on the distance of the closest hits from the query sequence.
Evaluation Datasets:
beetle metagenome with several related uncultured symbionts, with one possessing a horizontally acquired secondary metabolite biosynthetic gene cluster.
Reactions are currently unavailable
Metadata
Metadata
Labels
enhancementNew feature or requestNew feature or request