Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions doc/method/abstract.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Taxonomy and nomenclature data are critical for any project that synthesizes
biodiversity data, as most biodiveristy data sets use taxonomic names to
identify taxa. Open Tree of Life is one such project, synthesizing sets of
published phylogenetic trees into comprehensive supertrees. No single published
taxonomy met the taxonomic and nomenclatural needs of the project. We therefore
describe here a system for reproducibly combining several source taxonomies into
a synthetic taxonomy, and discuss the challenges of taxonomic and nomenclatural
taxonomy met the taxonomic and nomenclatural needs of the project.
Here we describe a system for reproducibly combining several source taxonomies into
a synthetic taxonomy, and we discuss the challenges of taxonomic and nomenclatural
synthesis for downstream biodiversity projects.

34 changes: 19 additions & 15 deletions doc/method/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,12 @@ synthesis of ten different source taxonomies with different strengths.
The synthesis process is repeatable so that updates to source
taxonomies can be incorporated easily.

Like other biodiversity projects, Open Tree aggregates and reasons
over information about taxa. Information about taxa is typically
Information about taxa is typically
expressed in databases and files in terms of taxon names or
'name-strings' [cite darwin core?]. To combine data sets it is
necessary to be able to determine name equivalence: whether or not an
occurrence of a name-string in one data source refers to the same
thing (taxon) as a given name-string occurrence in another. Solving
taxon as a given name-string occurrence in another. Solving
this equivalence problem requires detecting equivalence when the
name-strings are different (synonym detection), as well as
distinguishing occurrences that only coincidentally have the same
Expand All @@ -31,11 +30,11 @@ name-string (homonym detection).

The Open Tree of Life project consists of a set of tools for

1. synthesizing phylogenetic supertrees from a corpus of
1. synthesizing phylogenetic supertrees from a corpus of
phylogenetic tree inputs
(source trees)
2. matching groupings in supertrees with higher taxa (such as Mammalia)
3. supplementing supertrees with taxa obtained only from
3. supplementing supertrees with taxa obtained only from
taxonomy

The outcome is one or more summary trees combining phylogenetic and
Expand Down Expand Up @@ -76,12 +75,13 @@ evolutionary biology community, rather than as a one-off study.
Following are all five requirements:

1. *OTU coverage:* The reference taxonomy should have a taxon for
every OTU that has the potential to occur in more than one study.
every OTU that has the potential to occur in more than one study,
over the intended scope of all cellular organisms.
1. *Phylogenetically informed classification:* Higher taxa should be
provided with as much resolution and phylogenetic fidelity as is
reasonable. Ranks and nomenclatural structure should not be
required (since many well-established groups do not have proper
Linnaean names or ranks) and groups at odds with phylogenetic
reasonable. Ranks and nomenclatural structure should not be
required (since many well-established groups do not have proper
Linnaean names or ranks) and groups at odds with phylogenetic
understanding (such as Protozoa) should be avoided.
1. *Taxonomic coverage:* The taxonomy should cover as many as possible of
the species
Expand All @@ -92,7 +92,7 @@ Following are all five requirements:
are constantly being added to the literature.
The taxonomy needs to be updated with new information on an ongoing basis.
1. *Open data:* The taxonomy must be available to anyone for unrestricted use.
Users should not have to ask permission to copy and use the taxonomy,
Users should not have to ask permission to copy and use the taxonomy,
nor should they be bound by terms of use that interfere with further reuse.

[would this be a place to 'highlight' transparency as theme or goal? -
Expand All @@ -102,7 +102,7 @@ No single available taxonomic source meets all five requirements. The
NCBI taxonomy has good coverage of OTUs, provides a rich source of
phyogenetically informed higher taxa, and is open, but its taxonomic
coverage is limited to taxa that have sequence data in GenBank (about
360455 species having standard binomial names). Traditional all-life
360,000 species having standard binomial names at the time of this writing). Traditional all-life
taxonomies such as Catalogue of Life, IRMNG, and GBIF meet the
taxonomic coverage requirement, but miss many OTUs from our input
trees, and their higher-level taxonomies are often not as
Expand All @@ -113,7 +113,11 @@ taxonomy with a traditional broad taxonomy that is also open.
These requirements cannot be met in an absolute sense; each is a 'best
effort' requirement subject to availability of project resources.

Note that the Open Tree Taxonomy is *not* supposed to be 1) a
reference for nomenclature (we can link to other sources for that); 2)
a well-formed or complete taxonomic hypothesis; or 3) a place to
deposit curated taxonomic information.
Note that the Open Tree Taxonomy is *not* supposed to be a
reference for nomenclature; it links to other sources for nomenclatural and other information.
Nor is it a place to deposit curated taxonomic information.
The taxonomy has not been vetted in detail, as that would be beyond
the capacity and focus of the Open Tree project.
It is known to contain many taxon duplications and technical artifacts.
Tolerating these shortcomings is a necessary tradeoff in
attempting to meet the above requirements.
64 changes: 47 additions & 17 deletions doc/method/method-details.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,14 @@ incorrect tree but to downstream curation errors in OTU matching
annotation (information about one copy not propagating to the other)
and to loss of unification opportunities in phylogeny synthesis.

As described above, source taxonomies are processed (aligned and
merged) in priority order. For each source taxonomy, ad hoc
adjustments are applied before automatic alignments. For automatic
alignment, alignments closest to the tips of the source taxonomy are
found in a first pass, and all others in a second pass. The two-pass
structure permits first-pass alignments to be used during the second
pass (see Overlap, below).

### Ad hoc alignment adjustments

Automated alignment is preceded by ad hoc 'adjustments' that address
Expand Down Expand Up @@ -93,13 +101,15 @@ every source node that also has that name-string.

The purpose of the alignment phase is to choose a single correct
candidate for each source node, or to reject all candidates if none is
correct.
correct. For 97% of source nodes, there are no candidates or only one
candidate, and selection is fairly simple, but the remaining nodes
require special treatment.

Example: There are two nodes named _Aporia lemoulti_ in the GBIF
backbone taxonomy; one is a plant and the other is an insect. (One of
backbone taxonomy; one is a plant and the other is an insect. One of
these two is an erroneous duplication, but the automated system has to
be able to cope with this situation because we don't have the
resources to correct all source taxonomy errors!) It is necessary to
resources to correct all source taxonomy errors. It is necessary to
choose the right candidate for the IRMNG node with name _Aporia
lemoulti_. Consequences of incorrect placement might include putting
siblings of IRMNG _Aporia lemoulti_ in the wrong kingdom as well.
Expand Down Expand Up @@ -149,15 +159,17 @@ heuristics are as follows:
the family-rank ancestor node of n' is the same as the name-string of the
family-rank ancestor node of n.

(Example: _Hyphodontia quercina_ irmng:11021089
aligns with Hyphodontia quercina in Index Fungorum [if:298799],
not Malacodon candidus [if:505193]. [Not a great example because
a later heuristic would have gotten it.] The synonymy is via GBIF.)
(Example: Source node _Plasmodiophora diplantherae_ from Index
Fungorum, in Protozoa, has one workspace candidate derived from
NCBI and another from WoRMS. Because the source node and the NCBI
candidate both claim to be in a taxon with name 'Phytomyxea', while the
WoRMS candidate has no applicable lineage in common, the NCBI
candidate is chosen.)

The details are complicated because (a) every pair of nodes have
at least _some_ of their lineage in common, and (b) genus names do not
provide any information when comparing species nodes with the same
name-string, so we can't always just look at the parent taxon. The exact
name-string, so for species we can't just look at the parent taxon. The exact
rule used is the following:

Define the 'quasiparent name' of n, q(n), to be the
Expand All @@ -168,26 +180,41 @@ heuristics are as follows:
an ancestor of n', or q(n') is the name-string of an ancestor of n,
then prefer n to candidates that lack these properties.

[MTH: this section is clear, but it is not clear to the reader what
order nodes in the source are aligned. That seems to make a difference here.
JAR: there is no order dependence, because the
heuristic is comparing names, not checking for nodes alignment.
I think that is implied by the detailed description, but
I've tried to make the example text reinforce this fact.]

1. **Overlap**: Prefer to align n' to n if they are higher level groupings that overlap.
Stated a bit more carefully: if n' has a descendant aligned to
a descendant of n.

(Example: need example. From OTT 2.10: if n' = Scyphocoronis, Millotia is preferred
to Scyphocoronis. - seems to be gone from OTT 2.11)
(Example: Source node _Peranema_ from GBIF has two candidates from NCBI.
One candidate shares descendant _Peranema cryptocercum_ with the source taxon,
while the other shares no descendants with the source taxon.
The source is therefore aligned to the one with the shared descendant.)

1. **Proximity** [opposite of "separation"; not a great name]:
Suppose the separation taxonomy includes A and B,
with B contained in A.
If node n' is in B, then prefer candidates that are in B to those that are in A but not in B.
Prefer candidates n with the property that
the smallest separation taxon containing the source node n'
is also the smallest separation taxon containing a candidate n.

(Example: for IRMNG _Macbrideola indica_, prefer _Macbrideola coprophila_
to _Utharomyces epallocaulus_. [get more info])
(Example: for source node Heterocheilidae in IRMNG (a nematode family) whose smallest
separation ancestor is Metazoa, prefer
the NCBI Taxonomy candidate with smallest separation ancestor
Metazoa (also a nematode family) to the one with smallest separation
ancestor Diptera (a fly family).)

1. **Same name-string**: Prefer candidates whose primary name-string
is the same as the primary name-string of n'.

(Example: candidate _Zabelia tyaihyoni_ preferred to candidate _Zabelia mosanensis_ for
n' = GBIF _Zabelia tyaihyoni_.)
(Example: For source node n' = GBIF _Zabelia tyaihyoni_,
candidate _Zabelia tyaihyoni_ from NCBI is preferred to candidate
_Zabelia mosanensis_, also from NCBI. NCBI _Z. mosanensis_ is a
candidate for n' because GBIF declares that _Z. mosanensis_ is a synonym
for GBIF _Z. tyaihyoni_.)

If there is a single candidate that is not rejected by any heuristic,
it is aligned to that candidate.
Expand Down Expand Up @@ -320,6 +347,9 @@ up while merging taxonomies. [tbd: turn the newicks into a multi-part figure]
[not such a great technical term: 'absorption' - but we need a term. The code currently
says 'merged' and that would be way too confusing]

[MTH: wouldn't the previous answer: ((a,b)x,(c,d)y,?e)z also mean
that e is a proper child of z, but is just uncertain wrt x and y?]

1. ((a,b)x,(c,d)y)z + ((a,c,e)u,(b,d)v)z = ((a,b)x,(c,d)y,?e)

If the S' topology is incompatible with the S topology,
Expand Down
15 changes: 7 additions & 8 deletions doc/method/method-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ necessary. However, it is not clear how to meet the the ongoing update
requirement under this approach. As the source taxonomies change, we would like
for the combined taxonomy to contain only information derived from the latest
versions of the sources, without residual information from previous versions. Many
changes to the sources are corrections, and we do not want to hang on to or even
be influenced by information known to be incorrect.
changes to the sources are corrections, and we do not want to rely on
information known to be incorrect.

Rather than maintain a database of taxonomic information, we instead developed a
process assembling a taxonomy from two or more taxonomic sources. With a
Expand All @@ -27,7 +27,7 @@ Open Tree reference taxonomy version 2.11.
* workspace = data structure for creation of the reference
taxonomy
* node = a taxon record, either from a source taxonomy or the workspace.
records primary name-string, provenance,
Records primary name-string, provenance,
parent node, optional rank, optional annotations
* parent (node) = the nearest enclosing node within a given node's taxonomy
* tip = a node that is not the parent of any node
Expand All @@ -39,11 +39,10 @@ Open Tree reference taxonomy version 2.11.
when recorded in a given taxonomy.
* primary = the non-synonym name-string of a node, as opposed to one of the synonyms.
* image (of a node n') = the workspace node corresponding to n'
* _incertae sedis_: taxon A is _incertae sedis_ in taxon B if A is in B
but is not known to be outside of A's non-_incertae-sedis_ children. That is,
if we had more information, it might turn out that B is a
member of one of the other children of A.

* _incertae sedis_: taxon A is _incertae sedis_ in taxon B if A is a child of B
but is not known to be outside of B's non-_incertae-sedis_ children. That is,
if we had more information, it might turn out that A is a
member of one of the other children of B.

## Assembly overview

Expand Down
4 changes: 4 additions & 0 deletions doc/method/results.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ There are 8043 name-strings in the taxonomy for which there are
multiple nodes. By comparison, there are only 1440 in GBIF. Many of
the homonyms are artifacts of the alignment method, especially the
rule that says genera that do not share species are presumed disjoint.
[MTH: I missed this rule, which one was it?
JAR: big TBD - this kludge has to be described - I didn't even notice
it until recently. Perhaps it can even be
removed so that it doesn't need to be described.]

## Evaluating the taxonomy relative to requirements

Expand Down