111 groupings in synthetic tree that are not supported by any input source tree #156
Comments
Just to be clear, tree6165 does support that (Aspidocarya + Parabaena) + There were some others that were pointed out in the previous emails that I On Fri, Jan 30, 2015 at 8:08 AM, Mark T. Holder notifications@github.com
|
But that source tree (6165) has Calycocarpum as sister to a group containing Aspidocarya, Parabaena, Tinomiscium, Tinospora but also Orthogynium. In the synthetic tree, Orthogynium attachs well outside (https://tree.opentreeoflife.org/opentree/argus/otol.draft.22@3573300/Orthogynium) of this group. Which is why my |
OK. This one looks to me like it is likely a non monophyly thing. I will On Fri, Jan 30, 2015 at 9:17 AM, Mark T. Holder notifications@github.com
|
I think Orthogynium is a monotypic taxon. |
Yeah, I don't mean within that group I mean within the Menispermoideae On Fri, Jan 30, 2015 at 11:04 AM, Mark T. Holder notifications@github.com
|
Or within the Menispermaceae rather On Fri, Jan 30, 2015 at 11:25 AM, Stephen Smith blackrim@gmail.com wrote:
|
I don't understand what you mean by "it is likely a non monophyly thing." Menispermaceae is not a tip label in any of the input trees. So I don't understand how it being non-monophyletic is different from other cases of conflict between different sources of phylogenetic information. Could you or @chinchliff or @josephwb confirm that the 3 numbered points that I list above are a correct characterization of the synthesis. procedure. I suppose I should add another statement: if that is not the case (or any of my previous 3 statements are incorrect) then the 111 groupings reported here may just be a wart of the procedure and not a bug. Edit. markdown cause my #4 to show up as 1. fixed. |
Mark, I want to add something here. For our input (taxonomy + 484 other I have already thoroughly studied some of your identified nodes (or On Fri, Jan 30, 2015 at 7:08 AM, Mark T. Holder notifications@github.com
|
Ruchi, you are correct that we have the full taxonomy, but it is highly unresolved. You can easily expand the case that I gave earlier. Consider:
from 3 inputs:
I think that your code would say that the My code would call it "unsupported". |
My 87 nodes include this case too. I am counting all those nodes that have In your example, (C,D) group will get irrelevant from first two input trees On Fri, Jan 30, 2015 at 1:26 PM, Mark T. Holder notifications@github.com
|
Ah. I see. thanks for clarifying. But I think that our codes would diverge on:
from 4 inputs:
My code would call still call the |
Wait...but your initial definition of "unsupported" doesn't approve of Unsupported: When I say that a group/node/edge in the synthetic tree is between the synthetic tree and the set of inputs would not change.So (C,D) is not "unsupported" group by your definition. Since if we On Fri, Jan 30, 2015 at 2:14 PM, Mark T. Holder notifications@github.com
|
good point. I should have said that the RF distance stays the same or decreases. So the unresolved form of the synthetic tree is at least as good as the resolved form when there is an "unsupported" node. Sorry for the confusion. |
I hadn't been thinking of unresolved inputs clearly when I wrote this issue report. By "unsupported" I mean that if we collapse the edge, the RF distance for the restricted synthetic tree to each of the source trees is unchanged or decreases. my code doesn't calculate the total RF. It just tries to find (for every edge in the synthetic tree) at least 1 input tree that supports the edge. If collapsing the edge causes the RF to any of the input trees to increase, then it calls the edge supported. Sorry again for mis-stating this earlier. |
I think I understand it now. It's different from my count. I declare My analysis should have the subset of Mark's nodes. I also think that these On Fri, Jan 30, 2015 at 2:54 PM, Mark T. Holder notifications@github.com
|
Background
Issue #78 started because @ruchiherself's code identified cases in which a grouping in the synthetic tree conflicted with every tree in the input set. The definition of conflict is discussed in the "Conflict between trees and taxonomies" section of the supplemental material.
I started pursuing this using code that uses a slightly different criterion for flagging groups that I think are indicative of bugs in treemachine (or our failure to capture the inputs precisely enough, such that the inputs actually differ from what was fed into treemachine. Or bugs in the checking tools).
This issue separates discussion of the problematic cases detected by the definition that I am using from the cases that Ruchi's code flags.
"unsupported"
UPDATE I've revised this because Ruchi pointed out that I was not being consistent. The original text is not at the bottom of this post.
When I say that a group/node/edge in the synthetic tree is "unsupported" in this thread, I mean: If we were to collapse this group into its parent, then the total Robinson-Foulds symmetric distance (RF) between the synthetic tree and any of the input trees will not increase.
For each input tree t, in the set of input trees T (which includs the taxonomic tree):
If Y is the synthetic tree with some edge y collapsed, then we say that y is supported if
r(Y, t) > r(S, t) for any t.
Software
I have written 2 tools to help find these cases:
checktaxonnodes
checks all named nodes in the synthetic tree against their definition in OTT.findunsupportededges
looks for internal nodes in the synthetic tree that:These are in the examples subtree of NCL. I forked NCL to the Open Tree group to make it easier for any of us to modify it.
I've posted the contents of the standard output stream and the standard error stream.
There are 111 groupings that
findunsupportededges
found which are unsupported.checktaxonnodes
found 22 problems - those are reported on issue #154.Differences from what Ruchi's code is calculating.
Under the Wilkinson terminology (if I'm understanding it correctly) if we had the synthetic tree of:
from two inputs:
then I think the clade
(C, D)
would be considered irrelevant on both trees. Ruchi's code is reporting conflicting cases, so this would not be reported.Under the "unsupported" definition, that I am using, this grouping would be considered unsupported because the tree with the polytomy:
(A,(B,C,D))
fits the inputs just as well. Intuitively there is no information in the inputs indicating thatC
is closer toD
than it is toB
, so it seems like we should be returning the polytomy.This difference in evaluation explains why my software classifies this group to be unsupported, while Ruchi's code considers pg_2644_6164.tre to support it. The source tree does indeed have a grouping of (Aspidocarya + Parabaena) + (Tinomiscium + Tinospora). But if we back up one node in the synthetic tree, we see that the sister group is Calycocarpum. Calycocarpum is not sampled in pg_2644_6164.tre. So, according to that source tree there is no reason that you could not have any resolution of the 3 way polytomy: (Calycocarpum, (Aspidocarya, Parabaena), (Tinomiscium, Tinospora))
I think that Ruchi's list will be a subset of my list because of cases like this one. And this does not imply a bug in either - just different classification schemes.
Why I think this is a problem
All supertree methods have some quirks, so the presence of a few groupings that are not intuitive is not a problem per se. But I think these groups indicate that there is a bug in synthesis.
It could be that I am just misunderstanding the TAG procedure. If that is the case, I would appreciate some one correcting me. I thought that a valid description of the synthesis procedure would be:
Add inputs to the TAG one at a time.
For each node in an input tree t_i we create set of edges to a LICA node. These nodes may include to other taxa (because of other input trees). Crucially:
A. This is the only operation that adds edges to the graph.
B. The parent node of the edge will always be the MRCA of a larger set of leaves than the childe node - even when restricte to the leaf set of t_i.
C. Thus, t_i will support any edge that is created by its introduction into the TAG.
D. Thus, every edge in the TAG will be supported by at least one input.
the synthesis operation only decides what edges to "trace" to make a tree. It does not create new edges.
If all of that is correct, then every edge/grouping in the synthetic tree should be supported by at least one input. So my
checktaxonnode
andfindunsupportednodes
programs should also report no problems.updated: typo in the first word of the description fixed. Doh!
Original incorrect definition of unsupported
just for the record. here is the text that was originally above...
When I say that a group/node/edge in the synthetic tree is "unsupported" in this thread, I mean: If we were to collapse this group into its parent, then the total Robinson-Foulds symmetrict distance (RF) between the synthetic tree and the set of inputs would not change.
We can calculate the total RF distance for the synthetic tree S as follows:
For each input tree t, in the set of input trees T (which includs the taxonomic tree):
Then the total RF distance R(S,T) is simply the sum of r(S,t)
The text was updated successfully, but these errors were encountered: