Taxon flags

Jonathan A Rees edited this page Feb 21, 2017 · 15 revisions

The last column in the taxonomy.tsv file in the interim taxonomy file format is "flags". The flags entry is a comma-separated list of flags or markers. Usually these are generated by taxonomy synthesis and are used to decide whether a taxon is to be suppressed in downstream processing. For example, if there's a not_otu flag then the name may not correspond to anything taxon-like and it may be desirable to suppress the name.

The possible values in that field are:

  • 'Incertae sedis'-like flags
    • incertae_sedis - in source taxonomy, was a member of an "incertae sedis" container (also "unallocated", "unclassified", "mitosporic")
    • incertae_sedis_inherited - descends from a node flagged incertae_sedis
    • major_rank_conflict - in source taxonomy, there is a gap, skipping a Linnaean rank, between the node's rank and its parent's rank, while there is a sibling not showing such a gap. For example: a genus in an order, that has a sibling that is a family. This flag is only applied in certain sources, e.g. GBIF, which happen to represent "incertae sedis" in this way. Does not apply to NCBI. Processed the same as incertae_sedis
    • major_rank_conflict_inherited - descends from a node flagged major_rank_conflict
    • unplaced (new in OTT 2.9) - equivalent to incertae_sedis. The nodes's parent is inconsistent with OTT, i.e. does not fit into the hierarchy, so the node has been made to be a child of the MRCA of the children of the inconsistent taxon
    • unplaced_inherited - descends from a node flagged unplaced
    • environmental - child of a node whose name contains the strings "environmental samples" or "mycorrhizal samples". Equivalent to incertae_sedis
    • environmental_inherited - descends from a node flagged "environmental"
    • sibling_higher - has a sibling with a higher rank, where major_rank_conflict does not apply. For example: a subfamily with a sibling that's a family. Similar to major_rank_conflict, but treatment as incertae sedis is not definitely warranted. Currently this only serves as a warning to a human browsing the taxonomy; it has no effect on assembly.
    • inconsistent (new in OTT 2.9) - a placeholder or "tombstone" for a taxon that has been removed due to its being inconsistent with higher priority taxa (judged to be not a clade). Does not have children, and can generally be ignored.
    • merged (OTT 2.9) - similar to inconsistent, but the children were directly placed in a larger taxon
  • Other flags

    • barren - there are only higher taxa at and below this node, no species or unranked tips
    • extinct - node is annotated as extinct (usually but not always by IRMNG)
    • extinct_inherited - descends from a node flagged extinct.
    • hidden - marked hidden due to Open Tree curatorial decision (e.g. microbes from GBIF)
    • hidden_inherited - descends from node flagged hidden
    • hybrid - taxon name contains "hybrid" or " x " indicating that it is a hybrid. Also, any node descended from such a node.
    • infraspecific - descends from a node with rank "species"
    • not_otu - the name suggests that this is not a taxon. Keywords interpreted this way include "uncultured", "unclassified", "unidentified", "unknown", "metagenome", "other sequences", "artificial", "libraries", "tranposons", and a few others. Also "sp." when at the end of a name. Also, any node descended from such a node. This flag is applied to NCBI taxa but not to SILVA taxa.
    • viral - the taxon name suggests that it has something to do with viruses. Also, any node descended from such a node.
    • was_container - this node used to be a container pseudo-taxon (incertae sedis, environmental samples, etc.) but its children have all been flagged and moved to the node's parent
  • Deprecated flags: (occur in old versions of OTT but not current ones)

    • major_rank_conflict_direct - superseded by was_container
    • unclassified - this is NCBI's way of saying incertae sedis
    • unclassified_inherited - descends from a node flagged unclassified
    • sibling_lower (deprecated as of OTT 2.9)
    • tattered (deprecated as of OTT 2.9 in favor of was_container)
    • tattered_inherited (deprecated as of OTT 2.9 in favor of unplaced and unplaced_inherited)
    • edited - the taxon has been subject to an ad hoc edit ("patch")
    • forced_visible - not currently used
    • extinct_direct - superseded by was_container

For more detail see the taxomachine source code and the smasher source code.

Synthesis (treemachine and future methods) and taxomachine are guided by the presence of these flags; each has its own list of flags that it uses as criteria for deciding whether to include an OTT entity in processing. For taxomachine, the flags affect which names are offered via the TNRS. For synthesis, the flags determine whether a node is to be included in the tree.

Flags leading to taxa being unavailable for TNRS

Taxon flags influence the behavior of the taxonomic name resolution services. If a taxon has any of the following flags, it is suppressed for TNRS purposes (i.e. not offered in TNRS results):

* not_otu
* environmental
* environmental_inherited
* viral
* hidden
* hidden_inherited
* was_container

Flags leading to taxa being suppressed from the synthetic tree

WRITE ME