From 3b9565debec0d854dc8cd4bf56859b4e89fc2b37 Mon Sep 17 00:00:00 2001 From: "Mark T. Holder" Date: Fri, 3 Feb 2017 11:22:18 -0600 Subject: [PATCH 1/8] comments --- doc/method/abstract.md | 6 +++--- doc/method/introduction.md | 10 +++++----- doc/method/method-details.md | 14 ++++++++++++-- doc/method/method-intro.md | 10 ++++++---- doc/method/results.md | 2 ++ 5 files changed, 28 insertions(+), 14 deletions(-) diff --git a/doc/method/abstract.md b/doc/method/abstract.md index bd7b24e..a618804 100644 --- a/doc/method/abstract.md +++ b/doc/method/abstract.md @@ -8,8 +8,8 @@ Taxonomy and nomenclature data are critical for any project that synthesizes biodiversity data, as most biodiveristy data sets use taxonomic names to identify taxa. Open Tree of Life is one such project, synthesizing sets of published phylogenetic trees into comprehensive supertrees. No single published -taxonomy met the taxonomic and nomenclatural needs of the project. We therefore -describe here a system for reproducibly combining several source taxonomies into -a synthetic taxonomy, and discuss the challenges of taxonomic and nomenclatural +taxonomy met the taxonomic and nomenclatural needs of the project. +Here we describe a system for reproducibly combining several source taxonomies into +a synthetic taxonomy, and we discuss the challenges of taxonomic and nomenclatural synthesis for downstream biodiversity projects. diff --git a/doc/method/introduction.md b/doc/method/introduction.md index 9f02c87..e146404 100644 --- a/doc/method/introduction.md +++ b/doc/method/introduction.md @@ -15,13 +15,12 @@ synthesis of ten different source taxonomies with different strengths. The synthesis process is repeatable so that updates to source taxonomies can be incorporated easily. -Like other biodiversity projects, Open Tree aggregates and reasons -over information about taxa. Information about taxa is typically +Information about taxa is typically expressed in databases and files in terms of taxon names or 'name-strings' [cite darwin core?]. To combine data sets it is necessary to be able to determine name equivalence: whether or not an occurrence of a name-string in one data source refers to the same -thing (taxon) as a given name-string occurrence in another. Solving +taxon as ag iven name-string occurrence in another. Solving this equivalence problem requires detecting equivalence when the name-strings are different (synonym detection), as well as distinguishing occurrences that only coincidentally have the same @@ -102,7 +101,7 @@ No single available taxonomic source meets all five requirements. The NCBI taxonomy has good coverage of OTUs, provides a rich source of phyogenetically informed higher taxa, and is open, but its taxonomic coverage is limited to taxa that have sequence data in GenBank (about -360455 species having standard binomial names). Traditional all-life +360,455 species having standard binomial names). Traditional all-life taxonomies such as Catalogue of Life, IRMNG, and GBIF meet the taxonomic coverage requirement, but miss many OTUs from our input trees, and their higher-level taxonomies are often not as @@ -115,5 +114,6 @@ effort' requirement subject to availability of project resources. Note that the Open Tree Taxonomy is *not* supposed to be 1) a reference for nomenclature (we can link to other sources for that); 2) -a well-formed or complete taxonomic hypothesis; or 3) a place to +a well-formed or complete taxonomic hypothesis; [MTH: need some clarification +on point 2, as it sounds like a contradiction of requirements 2 and 3 above] or 3) a place to deposit curated taxonomic information. diff --git a/doc/method/method-details.md b/doc/method/method-details.md index 100635f..9a92b27 100644 --- a/doc/method/method-details.md +++ b/doc/method/method-details.md @@ -96,10 +96,10 @@ candidate for each source node, or to reject all candidates if none is correct. Example: There are two nodes named _Aporia lemoulti_ in the GBIF -backbone taxonomy; one is a plant and the other is an insect. (One of +backbone taxonomy; one is a plant and the other is an insect. One of these two is an erroneous duplication, but the automated system has to be able to cope with this situation because we don't have the -resources to correct all source taxonomy errors!) It is necessary to +resources to correct all source taxonomy errors. It is necessary to choose the right candidate for the IRMNG node with name _Aporia lemoulti_. Consequences of incorrect placement might include putting siblings of IRMNG _Aporia lemoulti_ in the wrong kingdom as well. @@ -168,6 +168,9 @@ heuristics are as follows: an ancestor of n', or q(n') is the name-string of an ancestor of n, then prefer n to candidates that lack these properties. + [MTH: this section is clear, but it is not clear to the reader what + order nodes in the source are aligned. That seems to make a difference here.] + 1. **Overlap**: Prefer to align n' to n if they are higher level groupings that overlap. Stated a bit more carefully: if n' has a descendant aligned to a descendant of n. @@ -175,6 +178,8 @@ heuristics are as follows: (Example: need example. From OTT 2.10: if n' = Scyphocoronis, Millotia is preferred to Scyphocoronis. - seems to be gone from OTT 2.11) + [MTH: this example is not helpful.] + 1. **Proximity** [opposite of "separation"; not a great name]: Suppose the separation taxonomy includes A and B, with B contained in A. @@ -189,6 +194,8 @@ heuristics are as follows: (Example: candidate _Zabelia tyaihyoni_ preferred to candidate _Zabelia mosanensis_ for n' = GBIF _Zabelia tyaihyoni_.) + [MTH: is there a synonym in this example? seems obvious as stated.] + If there is a single candidate that is not rejected by any heuristic, it is aligned to that candidate. @@ -320,6 +327,9 @@ up while merging taxonomies. [tbd: turn the newicks into a multi-part figure] [not such a great technical term: 'absorption' - but we need a term. The code currently says 'merged' and that would be way too confusing] + [MTH: wouldn't the previous answer: ((a,b)x,(c,d)y,?e)z also mean + that e is a proper child of z, but is just uncertain wrt x and y?] + 1. ((a,b)x,(c,d)y)z + ((a,c,e)u,(b,d)v)z = ((a,b)x,(c,d)y,?e) If the S' topology is incompatible with the S topology, diff --git a/doc/method/method-intro.md b/doc/method/method-intro.md index 929dcea..fad4f35 100644 --- a/doc/method/method-intro.md +++ b/doc/method/method-intro.md @@ -8,8 +8,8 @@ necessary. However, it is not clear how to meet the the ongoing update requirement under this approach. As the source taxonomies change, we would like for the combined taxonomy to contain only information derived from the latest versions of the sources, without residual information from previous versions. Many -changes to the sources are corrections, and we do not want to hang on to or even -be influenced by information known to be incorrect. +changes to the sources are corrections, and we do not want to rely on +information known to be incorrect. Rather than maintain a database of taxonomic information, we instead developed a process assembling a taxonomy from two or more taxonomic sources. With a @@ -27,7 +27,7 @@ Open Tree reference taxonomy version 2.11. * workspace = data structure for creation of the reference taxonomy * node = a taxon record, either from a source taxonomy or the workspace. - records primary name-string, provenance, + Records primary name-string, provenance, parent node, optional rank, optional annotations * parent (node) = the nearest enclosing node within a given node's taxonomy * tip = a node that is not the parent of any node @@ -43,7 +43,9 @@ Open Tree reference taxonomy version 2.11. but is not known to be outside of A's non-_incertae-sedis_ children. That is, if we had more information, it might turn out that B is a member of one of the other children of A. - + [MTH: I found this confusing. Is B the taxon that is flagged as _incertae sedis_? + It seems odd to say "A is _incertae sedis_ in taxon B" if B is the + _incertae sedis_ taxon (and B is in A).] ## Assembly overview diff --git a/doc/method/results.md b/doc/method/results.md index 75c8608..faff8e6 100644 --- a/doc/method/results.md +++ b/doc/method/results.md @@ -33,6 +33,8 @@ multiple nodes. By comparison, there are only 1440 in GBIF. Many of the homonyms are artifacts of the alignment method, especially the rule that says genera that do not share species are presumed disjoint. +[MTH: I missed this rule, which one was it?] + ## Evaluating the taxonomy relative to requirements The introduction sets out requirements for an Open Tree taxonomy. From 4337405d79d6481371943d2c06f154c42185aced Mon Sep 17 00:00:00 2001 From: Karen Cranston Date: Mon, 6 Feb 2017 09:30:11 -0500 Subject: [PATCH 2/8] typo in intro --- doc/method/introduction.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/method/introduction.md b/doc/method/introduction.md index e146404..f6c7515 100644 --- a/doc/method/introduction.md +++ b/doc/method/introduction.md @@ -20,7 +20,7 @@ expressed in databases and files in terms of taxon names or 'name-strings' [cite darwin core?]. To combine data sets it is necessary to be able to determine name equivalence: whether or not an occurrence of a name-string in one data source refers to the same -taxon as ag iven name-string occurrence in another. Solving +taxon as a given name-string occurrence in another. Solving this equivalence problem requires detecting equivalence when the name-strings are different (synonym detection), as well as distinguishing occurrences that only coincidentally have the same @@ -30,11 +30,11 @@ name-string (homonym detection). The Open Tree of Life project consists of a set of tools for -1. synthesizing phylogenetic supertrees from a corpus of +1. synthesizing phylogenetic supertrees from a corpus of phylogenetic tree inputs (source trees) 2. matching groupings in supertrees with higher taxa (such as Mammalia) -3. supplementing supertrees with taxa obtained only from +3. supplementing supertrees with taxa obtained only from taxonomy The outcome is one or more summary trees combining phylogenetic and @@ -78,9 +78,9 @@ Following are all five requirements: every OTU that has the potential to occur in more than one study. 1. *Phylogenetically informed classification:* Higher taxa should be provided with as much resolution and phylogenetic fidelity as is - reasonable. Ranks and nomenclatural structure should not be - required (since many well-established groups do not have proper - Linnaean names or ranks) and groups at odds with phylogenetic + reasonable. Ranks and nomenclatural structure should not be + required (since many well-established groups do not have proper + Linnaean names or ranks) and groups at odds with phylogenetic understanding (such as Protozoa) should be avoided. 1. *Taxonomic coverage:* The taxonomy should cover as many as possible of the species @@ -91,7 +91,7 @@ Following are all five requirements: are constantly being added to the literature. The taxonomy needs to be updated with new information on an ongoing basis. 1. *Open data:* The taxonomy must be available to anyone for unrestricted use. - Users should not have to ask permission to copy and use the taxonomy, + Users should not have to ask permission to copy and use the taxonomy, nor should they be bound by terms of use that interfere with further reuse. [would this be a place to 'highlight' transparency as theme or goal? - From f88f137f8ede083561d9dd932fd0b7ee1c74d42b Mon Sep 17 00:00:00 2001 From: Jonathan A Rees Date: Tue, 7 Feb 2017 16:22:55 -0500 Subject: [PATCH 3/8] address a couple of the MTH comments --- doc/method/introduction.md | 18 +++++++++++------- doc/method/method-details.md | 12 +++++++++++- 2 files changed, 22 insertions(+), 8 deletions(-) diff --git a/doc/method/introduction.md b/doc/method/introduction.md index f6c7515..fe3a6dc 100644 --- a/doc/method/introduction.md +++ b/doc/method/introduction.md @@ -75,7 +75,8 @@ evolutionary biology community, rather than as a one-off study. Following are all five requirements: 1. *OTU coverage:* The reference taxonomy should have a taxon for - every OTU that has the potential to occur in more than one study. + every OTU that has the potential to occur in more than one study, + over the intended scope of all cellular organisms. 1. *Phylogenetically informed classification:* Higher taxa should be provided with as much resolution and phylogenetic fidelity as is reasonable. Ranks and nomenclatural structure should not be @@ -101,7 +102,7 @@ No single available taxonomic source meets all five requirements. The NCBI taxonomy has good coverage of OTUs, provides a rich source of phyogenetically informed higher taxa, and is open, but its taxonomic coverage is limited to taxa that have sequence data in GenBank (about -360,455 species having standard binomial names). Traditional all-life +360,000 species having standard binomial names at the time of this writing). Traditional all-life taxonomies such as Catalogue of Life, IRMNG, and GBIF meet the taxonomic coverage requirement, but miss many OTUs from our input trees, and their higher-level taxonomies are often not as @@ -112,8 +113,11 @@ taxonomy with a traditional broad taxonomy that is also open. These requirements cannot be met in an absolute sense; each is a 'best effort' requirement subject to availability of project resources. -Note that the Open Tree Taxonomy is *not* supposed to be 1) a -reference for nomenclature (we can link to other sources for that); 2) -a well-formed or complete taxonomic hypothesis; [MTH: need some clarification -on point 2, as it sounds like a contradiction of requirements 2 and 3 above] or 3) a place to -deposit curated taxonomic information. +Note that the Open Tree Taxonomy is *not* supposed to be a +reference for nomenclature; it links to other sources for nomenclatural and other information. +Nor is it a place to deposit curated taxonomic information. +The taxonomy has not been vetted in detail, as that would be beyond +the capacity and focus of the Open Tree project. +It is known to contain many taxon duplications and technical artifacts. +Tolerating these shortcomings is a necessary tradeoff in +attempting to meet the above requirements. diff --git a/doc/method/method-details.md b/doc/method/method-details.md index 9a92b27..241d52d 100644 --- a/doc/method/method-details.md +++ b/doc/method/method-details.md @@ -47,6 +47,14 @@ incorrect tree but to downstream curation errors in OTU matching annotation (information about one copy not propagating to the other) and to loss of unification opportunities in phylogeny synthesis. +As described above, source taxonomies are processed (aligned and +merged) in priority order. For each source taxonomy, ad hoc +adjustments are applied before automatic alignments. For automatic +alignment, alignments closest to the tips of the source taxonomy are +found in a first pass, and all others in a second pass. The two-pass +structure permits first-pass alignments to be used during the second +pass (see Overlap, below). + ### Ad hoc alignment adjustments Automated alignment is preceded by ad hoc 'adjustments' that address @@ -93,7 +101,9 @@ every source node that also has that name-string. The purpose of the alignment phase is to choose a single correct candidate for each source node, or to reject all candidates if none is -correct. +correct. For 97% of source nodes, there is only one candidate, and +selection is fairly simple, but the remaining nodes require special +treatment. Example: There are two nodes named _Aporia lemoulti_ in the GBIF backbone taxonomy; one is a plant and the other is an insect. One of From 99af5b1e6adfaa431175e95867cae1dd221f9c91 Mon Sep 17 00:00:00 2001 From: Jonathan A Rees Date: Tue, 7 Feb 2017 16:32:16 -0500 Subject: [PATCH 4/8] new lineage example --- doc/method/method-details.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/doc/method/method-details.md b/doc/method/method-details.md index 241d52d..a36c7d8 100644 --- a/doc/method/method-details.md +++ b/doc/method/method-details.md @@ -101,9 +101,9 @@ every source node that also has that name-string. The purpose of the alignment phase is to choose a single correct candidate for each source node, or to reject all candidates if none is -correct. For 97% of source nodes, there is only one candidate, and -selection is fairly simple, but the remaining nodes require special -treatment. +correct. For 97% of source nodes, there are no candidates or only one +candidate, and selection is fairly simple, but the remaining nodes +require special treatment. Example: There are two nodes named _Aporia lemoulti_ in the GBIF backbone taxonomy; one is a plant and the other is an insect. One of @@ -159,15 +159,16 @@ heuristics are as follows: the family-rank ancestor node of n' is the same as the name-string of the family-rank ancestor node of n. - (Example: _Hyphodontia quercina_ irmng:11021089 - aligns with Hyphodontia quercina in Index Fungorum [if:298799], - not Malacodon candidus [if:505193]. [Not a great example because - a later heuristic would have gotten it.] The synonymy is via GBIF.) + (Example: Source node _Plasmodiophora diplantherae_ from Index + Fungorum, in Protozoa, has one workspace candidate derived from + NCBI and another from WoRMS. Because the source node and the NCBI + candidate are both in Phytomyxea, while the WoRMS candidate has no + applicable lineage in common, the NCBI candidate is chosen.) The details are complicated because (a) every pair of nodes have at least _some_ of their lineage in common, and (b) genus names do not provide any information when comparing species nodes with the same - name-string, so we can't always just look at the parent taxon. The exact + name-string, so for species we can't just look at the parent taxon. The exact rule used is the following: Define the 'quasiparent name' of n, q(n), to be the From 31f4e057a3192c5084488a791f619ff1fdfbd90e Mon Sep 17 00:00:00 2001 From: Jonathan A Rees Date: Tue, 7 Feb 2017 17:10:04 -0500 Subject: [PATCH 5/8] new overlap example --- doc/method/method-details.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/method/method-details.md b/doc/method/method-details.md index a36c7d8..a3996f4 100644 --- a/doc/method/method-details.md +++ b/doc/method/method-details.md @@ -186,10 +186,10 @@ heuristics are as follows: Stated a bit more carefully: if n' has a descendant aligned to a descendant of n. - (Example: need example. From OTT 2.10: if n' = Scyphocoronis, Millotia is preferred - to Scyphocoronis. - seems to be gone from OTT 2.11) - - [MTH: this example is not helpful.] + (Example: Source node _Peranema_ from GBIF has two candidates from NCBI. + One candidate shares descendant _Peranema cryptocercum_ with the source taxon, + while the other shares no descendants with the source taxon. + The source is therefore aligned to the one with the shared descendant.) 1. **Proximity** [opposite of "separation"; not a great name]: Suppose the separation taxonomy includes A and B, From f87c77f913a946b03c22f95e76527b05e93d6ad3 Mon Sep 17 00:00:00 2001 From: Jonathan A Rees Date: Tue, 7 Feb 2017 18:38:52 -0500 Subject: [PATCH 6/8] improving the rest of the heuristics examples --- doc/method/method-details.md | 33 +++++++++++++++++++++------------ 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/doc/method/method-details.md b/doc/method/method-details.md index a3996f4..f24ac93 100644 --- a/doc/method/method-details.md +++ b/doc/method/method-details.md @@ -162,8 +162,9 @@ heuristics are as follows: (Example: Source node _Plasmodiophora diplantherae_ from Index Fungorum, in Protozoa, has one workspace candidate derived from NCBI and another from WoRMS. Because the source node and the NCBI - candidate are both in Phytomyxea, while the WoRMS candidate has no - applicable lineage in common, the NCBI candidate is chosen.) + candidate both claim to be in a taxon with name 'Phytomyxea', while the + WoRMS candidate has no applicable lineage in common, the NCBI + candidate is chosen.) The details are complicated because (a) every pair of nodes have at least _some_ of their lineage in common, and (b) genus names do not @@ -180,7 +181,11 @@ heuristics are as follows: then prefer n to candidates that lack these properties. [MTH: this section is clear, but it is not clear to the reader what - order nodes in the source are aligned. That seems to make a difference here.] + order nodes in the source are aligned. That seems to make a difference here. + JAR: there is no order dependence, because the + heuristic is comparing names, not checking for nodes alignment. + I think that is implied by the detailed description, but + I've tried to make the example text reinforce this fact.] 1. **Overlap**: Prefer to align n' to n if they are higher level groupings that overlap. Stated a bit more carefully: if n' has a descendant aligned to @@ -192,20 +197,24 @@ heuristics are as follows: The source is therefore aligned to the one with the shared descendant.) 1. **Proximity** [opposite of "separation"; not a great name]: - Suppose the separation taxonomy includes A and B, - with B contained in A. - If node n' is in B, then prefer candidates that are in B to those that are in A but not in B. + Prefer candidates n with the property that + the smallest separation taxon containing the source node n' + is also the smallest separation taxon containing a candidate n. - (Example: for IRMNG _Macbrideola indica_, prefer _Macbrideola coprophila_ - to _Utharomyces epallocaulus_. [get more info]) + (Example: for source node Heterocheilidae in IRMNG (a nematode family) whose smallest + separation ancestor is Metazoa, prefer + the NCBI Taxonomy candidate with smallest separation ancestor + Metazoa (also a nematode family) to the one with smallest separation + ancestor Diptera (a fly family).) 1. **Same name-string**: Prefer candidates whose primary name-string is the same as the primary name-string of n'. - (Example: candidate _Zabelia tyaihyoni_ preferred to candidate _Zabelia mosanensis_ for - n' = GBIF _Zabelia tyaihyoni_.) - - [MTH: is there a synonym in this example? seems obvious as stated.] + (Example: For source node n' = GBIF _Zabelia tyaihyoni_, + candidate _Zabelia tyaihyoni_ from NCBI is preferred to candidate + _Zabelia mosanensis_, also from NCBI. NCBI _Z. mosanensis_ is a + candidate for n' because GBIF declares that Z. mosanensis is a synonym + for GBIF _Z. tyaihyoni_.) If there is a single candidate that is not rejected by any heuristic, it is aligned to that candidate. From 96f5bdc2a7401486376dcd0c1dd8fe1c53c24a58 Mon Sep 17 00:00:00 2001 From: Jonathan A Rees Date: Tue, 7 Feb 2017 18:44:03 -0500 Subject: [PATCH 7/8] finish up MTH comment processing --- doc/method/method-intro.md | 11 ++++------- doc/method/results.md | 6 ++++-- 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/doc/method/method-intro.md b/doc/method/method-intro.md index fad4f35..9771a27 100644 --- a/doc/method/method-intro.md +++ b/doc/method/method-intro.md @@ -39,13 +39,10 @@ Open Tree reference taxonomy version 2.11. when recorded in a given taxonomy. * primary = the non-synonym name-string of a node, as opposed to one of the synonyms. * image (of a node n') = the workspace node corresponding to n' - * _incertae sedis_: taxon A is _incertae sedis_ in taxon B if A is in B - but is not known to be outside of A's non-_incertae-sedis_ children. That is, - if we had more information, it might turn out that B is a - member of one of the other children of A. - [MTH: I found this confusing. Is B the taxon that is flagged as _incertae sedis_? - It seems odd to say "A is _incertae sedis_ in taxon B" if B is the - _incertae sedis_ taxon (and B is in A).] + * _incertae sedis_: taxon A is _incertae sedis_ in taxon B if A is a child of B + but is not known to be outside of B's non-_incertae-sedis_ children. That is, + if we had more information, it might turn out that A is a + member of one of the other children of B. ## Assembly overview diff --git a/doc/method/results.md b/doc/method/results.md index faff8e6..21c7bbf 100644 --- a/doc/method/results.md +++ b/doc/method/results.md @@ -32,8 +32,10 @@ There are 8043 name-strings in the taxonomy for which there are multiple nodes. By comparison, there are only 1440 in GBIF. Many of the homonyms are artifacts of the alignment method, especially the rule that says genera that do not share species are presumed disjoint. - -[MTH: I missed this rule, which one was it?] +[MTH: I missed this rule, which one was it? +JAR: big TBD - this kludge has to be described - I didn't even notice +it until recently. Perhaps it can even be +removed so that it doesn't need to be described.] ## Evaluating the taxonomy relative to requirements From 42f3a288e4e201a8209041c880b161cbb968b147 Mon Sep 17 00:00:00 2001 From: Jonathan A Rees Date: Tue, 7 Feb 2017 18:46:53 -0500 Subject: [PATCH 8/8] missing italics --- doc/method/method-details.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/method/method-details.md b/doc/method/method-details.md index f24ac93..6e05cd9 100644 --- a/doc/method/method-details.md +++ b/doc/method/method-details.md @@ -213,7 +213,7 @@ heuristics are as follows: (Example: For source node n' = GBIF _Zabelia tyaihyoni_, candidate _Zabelia tyaihyoni_ from NCBI is preferred to candidate _Zabelia mosanensis_, also from NCBI. NCBI _Z. mosanensis_ is a - candidate for n' because GBIF declares that Z. mosanensis is a synonym + candidate for n' because GBIF declares that _Z. mosanensis_ is a synonym for GBIF _Z. tyaihyoni_.) If there is a single candidate that is not rejected by any heuristic,