Skip to content

Incorporating divergence and database structure into taxonomic inference #11

@evanroyrees

Description

@evanroyrees

Tasks

  • identity distributions construction and taxon significance from distribution
  • hierarchical clustering of binary (presence/absence) matrices i.e. contig ORFs x NCBI taxid, height cutoff determination from taxon ID conservation i.e. novel taxon assignment
  • evaluation on burkholderia metagenome

Required Datasets:

  • Identity Distributions of IMG High Quality genomes: Examine the sequence
    divergence of homologous proteins amongst high quality bacterial
    and archaeal genomes downloaded from IMG, constructing identity distributions
    across different genera, families, orders, etc.

Expectations and Approach

  • top hits share identity approaching 100% across the whole length of both the query and hit
    sequence
    • assign a specific species with high confidence.
  • cases where a number of hits are within 95% nucleotide identity, (suggesting the query is the same species as hits)
    • assign the LCA of all hits within this identity threshold.
  • closest hit to an ORF shares <95% nucleotide identity (could belong to a novel taxon)
    • scale the specificity of assignment based on the distance of the closest hits from the query sequence.
      • use identity distributions (see above) to determine the significance of hits at
        specific identity levels, to derive the likelihood of queries and subjects being in
        the same genus or family, for instance.
        • invoke placeholder new taxonomic identifiers, with appropriate ranks based on
          the identity to closest hits (see above).
          • hierarchical clustering of binary matrices
            describing the presence of NCBI taxonomy IDs within the top DIAMOND BLASTP hits for ORFs within
            the contig. (Clusters derived from this process are expected to share a significant number of
            taxonomy IDs, and so taxonomy ID conservation will be used to determine the height cutoff for trimming
            the hierarchical clustering tree in order to obtain novel taxonomy clusters).

Evaluation Datasets:

beetle metagenome with several related uncultured symbionts, with one possessing a horizontally acquired secondary metabolite biosynthetic gene cluster.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions