OpenTreeOfLife · kcranston · Feb 8, 2017 · Feb 3, 2017 · Feb 6, 2017 · Feb 7, 2017
diff --git a/doc/method/abstract.md b/doc/method/abstract.md
@@ -8,8 +8,8 @@ Taxonomy and nomenclature data are critical for any project that synthesizes
 biodiversity data, as most biodiveristy data sets use taxonomic names to
 identify taxa. Open Tree of Life is one such project, synthesizing sets of
 published phylogenetic trees into comprehensive supertrees. No single published
-taxonomy met the taxonomic and nomenclatural needs of the project. We therefore
-describe here a system for reproducibly combining several source taxonomies into
-a synthetic taxonomy, and discuss the challenges of taxonomic and nomenclatural
+taxonomy met the taxonomic and nomenclatural needs of the project. 
+Here we describe a system for reproducibly combining several source taxonomies into
+a synthetic taxonomy, and we discuss the challenges of taxonomic and nomenclatural
 synthesis for downstream biodiversity projects.
 
diff --git a/doc/method/introduction.md b/doc/method/introduction.md
@@ -15,13 +15,12 @@ synthesis of ten different source taxonomies with different strengths.
 The synthesis process is repeatable so that updates to source
 taxonomies can be incorporated easily.
 
-Like other biodiversity projects, Open Tree aggregates and reasons
-over information about taxa.  Information about taxa is typically
+Information about taxa is typically
 expressed in databases and files in terms of taxon names or
 'name-strings' [cite darwin core?].  To combine data sets it is
 necessary to be able to determine name equivalence: whether or not an
 occurrence of a name-string in one data source refers to the same
-thing (taxon) as a given name-string occurrence in another.  Solving
+taxon as a given name-string occurrence in another.  Solving
 this equivalence problem requires detecting equivalence when the
 name-strings are different (synonym detection), as well as
 distinguishing occurrences that only coincidentally have the same
@@ -31,11 +30,11 @@ name-string (homonym detection).
 
 The Open Tree of Life project consists of a set of tools for
 
-1. synthesizing phylogenetic supertrees from a corpus of 
+1. synthesizing phylogenetic supertrees from a corpus of
    phylogenetic tree inputs
    (source trees)
 2. matching groupings in supertrees with higher taxa (such as Mammalia)
-3. supplementing supertrees with taxa obtained only from 
+3. supplementing supertrees with taxa obtained only from
    taxonomy
 
 The outcome is one or more summary trees combining phylogenetic and
@@ -76,12 +75,13 @@ evolutionary biology community, rather than as a one-off study.
 Following are all five requirements:
 
  1. *OTU coverage:* The reference taxonomy should have a taxon for
-    every OTU that has the potential to occur in more than one study.
+    every OTU that has the potential to occur in more than one study,
+    over the intended scope of all cellular organisms.
  1. *Phylogenetically informed classification:* Higher taxa should be
     provided with as much resolution and phylogenetic fidelity as is
-    reasonable.  Ranks and nomenclatural structure should not be 
-    required (since many well-established groups do not have proper 
-    Linnaean names or ranks) and groups at odds with phylogenetic 
+    reasonable.  Ranks and nomenclatural structure should not be
+    required (since many well-established groups do not have proper
+    Linnaean names or ranks) and groups at odds with phylogenetic
     understanding (such as Protozoa) should be avoided.
  1. *Taxonomic coverage:* The taxonomy should cover as many as possible of
     the species
@@ -92,7 +92,7 @@ Following are all five requirements:
     are constantly being added to the literature.
     The taxonomy needs to be updated with new information on an ongoing basis.
  1. *Open data:* The taxonomy must be available to anyone for unrestricted use.
-    Users should not have to ask permission to copy and use the taxonomy, 
+    Users should not have to ask permission to copy and use the taxonomy,
     nor should they be bound by terms of use that interfere with further reuse.
 
 [would this be a place to 'highlight' transparency as theme or goal? -
@@ -102,7 +102,7 @@ No single available taxonomic source meets all five requirements.  The
 NCBI taxonomy has good coverage of OTUs, provides a rich source of
 phyogenetically informed higher taxa, and is open, but its taxonomic
 coverage is limited to taxa that have sequence data in GenBank (about
-360455 species having standard binomial names).  Traditional all-life
+360,000 species having standard binomial names at the time of this writing).  Traditional all-life
 taxonomies such as Catalogue of Life, IRMNG, and GBIF meet the
 taxonomic coverage requirement, but miss many OTUs from our input
 trees, and their higher-level taxonomies are often not as
@@ -113,7 +113,11 @@ taxonomy with a traditional broad taxonomy that is also open.
 These requirements cannot be met in an absolute sense; each is a 'best
 effort' requirement subject to availability of project resources.
 
-Note that the Open Tree Taxonomy is *not* supposed to be 1) a
-reference for nomenclature (we can link to other sources for that); 2)
-a well-formed or complete taxonomic hypothesis; or 3) a place to
-deposit curated taxonomic information.
+Note that the Open Tree Taxonomy is *not* supposed to be a
+reference for nomenclature; it links to other sources for nomenclatural and other information.
+Nor is it a place to deposit curated taxonomic information.
+The taxonomy has not been vetted in detail, as that would be beyond
+the capacity and focus of the Open Tree project.
+It is known to contain many taxon duplications and technical artifacts.
+Tolerating these shortcomings is a necessary tradeoff in 
+attempting to meet the above requirements.
diff --git a/doc/method/method-details.md b/doc/method/method-details.md
@@ -47,6 +47,14 @@ incorrect tree but to downstream curation errors in OTU matching
 annotation (information about one copy not propagating to the other)
 and to loss of unification opportunities in phylogeny synthesis.
 
+As described above, source taxonomies are processed (aligned and
+merged) in priority order.  For each source taxonomy, ad hoc
+adjustments are applied before automatic alignments.  For automatic
+alignment, alignments closest to the tips of the source taxonomy are
+found in a first pass, and all others in a second pass.  The two-pass
+structure permits first-pass alignments to be used during the second
+pass (see Overlap, below).
+
 ### Ad hoc alignment adjustments
 
 Automated alignment is preceded by ad hoc 'adjustments' that address
@@ -93,13 +101,15 @@ every source node that also has that name-string.
 
 The purpose of the alignment phase is to choose a single correct
 candidate for each source node, or to reject all candidates if none is
-correct.
+correct.  For 97% of source nodes, there are no candidates or only one
+candidate, and selection is fairly simple, but the remaining nodes
+require special treatment.
 
 Example: There are two nodes named _Aporia lemoulti_ in the GBIF
-backbone taxonomy; one is a plant and the other is an insect.  (One of
+backbone taxonomy; one is a plant and the other is an insect.  One of
 these two is an erroneous duplication, but the automated system has to
 be able to cope with this situation because we don't have the
-resources to correct all source taxonomy errors!)  It is necessary to
+resources to correct all source taxonomy errors.  It is necessary to
 choose the right candidate for the IRMNG node with name _Aporia
 lemoulti_.  Consequences of incorrect placement might include putting
 siblings of IRMNG _Aporia lemoulti_ in the wrong kingdom as well.
@@ -149,15 +159,17 @@ heuristics are as follows:
     the family-rank ancestor node of n' is the same as the name-string of the
     family-rank ancestor node of n.
 
-    (Example: _Hyphodontia quercina_ irmng:11021089
-    aligns with Hyphodontia quercina in Index Fungorum [if:298799],
-    not Malacodon candidus [if:505193].  [Not a great example because
-    a later heuristic would have gotten it.]  The synonymy is via GBIF.)
+    (Example: Source node _Plasmodiophora diplantherae_ from Index
+    Fungorum, in Protozoa, has one workspace candidate derived from
+    NCBI and another from WoRMS.  Because the source node and the NCBI
+    candidate both claim to be in a taxon with name 'Phytomyxea', while the
+    WoRMS candidate has no applicable lineage in common, the NCBI 
+    candidate is chosen.)
 
     The details are complicated because (a) every pair of nodes have
     at least _some_ of their lineage in common, and (b) genus names do not
     provide any information when comparing species nodes with the same
-    name-string, so we can't always just look at the parent taxon.  The exact 
+    name-string, so for species we can't just look at the parent taxon.  The exact 
     rule used is the following:
 
     Define the 'quasiparent name' of n, q(n), to be the
@@ -168,26 +180,41 @@ heuristics are as follows:
     an ancestor of n', or q(n') is the name-string of an ancestor of n, 
     then prefer n to candidates that lack these properties.
 
+    [MTH: this section is clear, but it is not clear to the reader what 
+    order nodes in the source are aligned. That seems to make a difference here.
+    JAR: there is no order dependence, because the
+    heuristic is comparing names, not checking for nodes alignment.
+    I think that is implied by the detailed description, but
+    I've tried to make the example text reinforce this fact.]
+
  1. **Overlap**: Prefer to align n' to n if they are higher level groupings that overlap.
     Stated a bit more carefully: if n' has a descendant aligned to 
     a descendant of n.  
 
-    (Example: need example. From OTT 2.10: if n' = Scyphocoronis, Millotia is preferred 
-    to Scyphocoronis. - seems to be gone from OTT 2.11)
+    (Example: Source node _Peranema_ from GBIF has two candidates from NCBI.
+    One candidate shares descendant _Peranema cryptocercum_ with the source taxon,
+    while the other shares no descendants with the source taxon.
+    The source is therefore aligned to the one with the shared descendant.)
 
  1. **Proximity** [opposite of "separation"; not a great name]:
-    Suppose the separation taxonomy includes A and B, 
-    with B contained in A.
-    If node n' is in B, then prefer candidates that are in B to those that are in A but not in B.
+    Prefer candidates n with the property that
+    the smallest separation taxon containing the source node n'
+    is also the smallest separation taxon containing a candidate n.
 
-    (Example: for IRMNG _Macbrideola indica_, prefer _Macbrideola coprophila_
-    to _Utharomyces epallocaulus_.  [get more info])
+    (Example: for source node Heterocheilidae in IRMNG (a nematode family) whose smallest 
+    separation ancestor is Metazoa, prefer
+    the NCBI Taxonomy candidate with smallest separation ancestor
+    Metazoa (also a nematode family) to the one with smallest separation 
+    ancestor Diptera (a fly family).)
 
  1. **Same name-string**: Prefer candidates whose primary name-string
     is the same as the primary name-string of n'.
 
-    (Example: candidate _Zabelia tyaihyoni_ preferred to candidate _Zabelia mosanensis_ for
-    n' = GBIF _Zabelia tyaihyoni_.)
+    (Example: For source node n' = GBIF _Zabelia tyaihyoni_,
+    candidate _Zabelia tyaihyoni_ from NCBI is preferred to candidate 
+    _Zabelia mosanensis_, also from NCBI.  NCBI _Z. mosanensis_ is a 
+    candidate for n' because GBIF declares that _Z. mosanensis_ is a synonym
+    for GBIF _Z. tyaihyoni_.)
 
 If there is a single candidate that is not rejected by any heuristic,
 it is aligned to that candidate.
@@ -320,6 +347,9 @@ up while merging taxonomies.  [tbd: turn the newicks into a multi-part figure]
    [not such a great technical term: 'absorption' - but we need a term. The code currently
    says 'merged' and that would be way too confusing]
 
+   [MTH: wouldn't the previous answer: ((a,b)x,(c,d)y,?e)z also mean
+   that e is a proper child of z, but is just uncertain wrt x and y?]
+
 1. ((a,b)x,(c,d)y)z + ((a,c,e)u,(b,d)v)z = ((a,b)x,(c,d)y,?e)
 
    If the S' topology is incompatible with the S topology,

diff --git a/doc/method/method-intro.md b/doc/method/method-intro.md
@@ -8,8 +8,8 @@ necessary.  However, it is not clear how to meet the the ongoing update
 requirement under this approach.  As the source taxonomies change, we would like
 for the combined taxonomy to contain only information derived from the latest
 versions of the sources, without residual information from previous versions.  Many
-changes to the sources are corrections, and we do not want to hang on to or even
-be influenced by information known to be incorrect.  
+changes to the sources are corrections, and we do not want to rely on
+information known to be incorrect.  
 
 Rather than maintain a database of taxonomic information, we instead developed a
 process assembling a taxonomy from two or more taxonomic sources.  With a
@@ -27,7 +27,7 @@ Open Tree reference taxonomy version 2.11.
   * workspace = data structure for creation of the reference
     taxonomy
   * node = a taxon record, either from a source taxonomy or the workspace.
-    records primary name-string, provenance,
+    Records primary name-string, provenance,
     parent node, optional rank, optional annotations
   * parent (node) = the nearest enclosing node within a given node's taxonomy
   * tip = a node that is not the parent of any node
@@ -39,11 +39,10 @@ Open Tree reference taxonomy version 2.11.
     when recorded in a given taxonomy.
   * primary = the non-synonym name-string of a node, as opposed to one of the synonyms.
   * image (of a node n') = the workspace node corresponding to n'
-  * _incertae sedis_: taxon A is _incertae sedis_ in taxon B if A is in B
-    but is not known to be outside of A's non-_incertae-sedis_ children.  That is,
-    if we had more information, it might turn out that B is a
-    member of one of the other children of A.
-
+  * _incertae sedis_: taxon A is _incertae sedis_ in taxon B if A is a child of B
+    but is not known to be outside of B's non-_incertae-sedis_ children.  That is,
+    if we had more information, it might turn out that A is a
+    member of one of the other children of B.
 
 ## Assembly overview
 

diff --git a/doc/method/results.md b/doc/method/results.md
@@ -32,6 +32,10 @@ There are 8043 name-strings in the taxonomy for which there are
 multiple nodes.  By comparison, there are only 1440 in GBIF. Many of
 the homonyms are artifacts of the alignment method, especially the
 rule that says genera that do not share species are presumed disjoint.
+[MTH: I missed this rule, which one was it?
+JAR: big TBD - this kludge has to be described - I didn't even notice
+it until recently.  Perhaps it can even be
+removed so that it doesn't need to be described.]
 
 ## Evaluating the taxonomy relative to requirements