Skip to content
This repository has been archived by the owner on Sep 22, 2019. It is now read-only.

synthesis issue: only using mrca taxa from an edge ("relationship" in neo4j lingo) rather than node #157

Closed
mtholder opened this issue Jan 31, 2015 · 24 comments

Comments

@mtholder
Copy link
Member

I just pushed https://github.com/OpenTreeOfLife/treemachine/tree/issue-157 with some very simple examples to help debug #156

But I'm not getting the tree that one would hope for when I run:

$ run-simpleprob-example-diff-order-synth.sh

and look a the result in simpleprob-diff-order-out/synth.tre

The inputs have not conflict and there is no evidence of non-monophyly of the taxa. So I would hope that the synthesis would return the tree in simpleprob/expected.tre

There is a (very) good chance that I'm just running treemachine incorrectly. If @blackrim @josephwb or @chinchliff could take a look at run-simpleprob-example-diff-order-synth.sh I would really appreciate it.

I pulled and built jade and ot-base before pulling and building treemachine.

@mtholder
Copy link
Member Author

or I could have messed up my inputs (in simpleprob dir). But I don't see the error...

@mtholder
Copy link
Member Author

To explain this test case...

The only thing that the taxonomy says is that there is a genus E. The taxonomic tree is (A,B,C,D,(E1,E2)E)

The first tree is (((A,B),C),D) this is input tree1.tre and it is the tree that I (tried to) give highest priority to in run-simpleprob-example-diff-order-synth.sh. Clearly this tree says nothing about E

The second tree (which is in tree2.tre) is ((A,E1),B). The only information here is that E is closer to A than it either is to B.

Taken together the inputs imply the supertree ((((A,E1),B),C),D) With the expansion based on taxonomy, one would hope that the synthetic tree would be ((((A,(E1,E2)),B),C),D). This tree is in expected.tre.

What is returned is ((E2,E1)E,D,(C,(B,A))) this is synth.tre in the output of the script.

Hopefully, I am just invoking treemachine incorrectly.

@mtholder
Copy link
Member Author

Outputs of each treemachine step are stored in simpleprob-diff-order for the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like the odd tree is returned because the synthesis starts at the root node. Based on that it decides that the 3 branches emanating from the root should be:

  1. D this makes sense

  2. the mrca node 11 which includes both STREECHILDOF from both tree1 and tree2. This node is the ancestor of A, B, C, and E. E is included because there appears to be a bumping of E1 in tree2 up to E. I base that on:

    attempting to remap tips to deepest exemplified taxa
    E1 was remapped to 'E' (id=7)

    from simpleprob-diff-order/addnewick-tree2.log.

  3. the node representing E. This is odd to me. This is the acceptance of relationship 17 during synthesis:

    testing rel 17 for conflicts
    E has 2 terminal descendants
    ++rel 17 passed, it will be added

    in the synthesize.log. It would seem to me that the acceptance of rel 33 (leading from node 11) would prevent the attachement of E here. because there is taxonomic overlap with the MRCA set in node 11 and E. Perhaps this is a result of the fact that the descendant E is added to the node by the second tree (not the tree that created the node). I suspect that it is because the synthesis traversal is breadth first and somehow not paying attention to the fact that one of the nodes that was just accepted conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which one would like to see in the final tree) is rejected as a dauther of node 10, it appears that node 13 is rejected because synthesis has just moments earlier added A as a child of node 10:

testing rel 45 for conflicts
Node[13] has 2 terminal descendants
        conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node 10 edge.

One one level this make sense. node 13 and 10 have a leaf in common (leaf A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial. tree1 could be satisfied by coming in from node 13 or coming in directly. So tree2 and tree1 say compatible things about the descendants of node 10. One would hope that while recognizing that tree1 should win when it conflicts with tree2, the synthesis would recognize that there is not conflict here -- at least in cases like this one when the high ranking edge is trivial.

To summarize:

  1. I don't see why we get "++rel 17 passed, it will be added"

  2. I don't understand how we can select node 11 to be included, but then select a set of edges for its children that omit some of the leaves listed in its MRCA. I suspect that this is what is causing issue 111 groupings in synthetic tree that are not supported by any input source tree #156. It is a violation of what I listed as principle Incomplete installation instructions #4:

    "Incomplete installation instructions #4. when an edge is chosen in the synthesis all of its descendant tips will remain below the edge - there won't be cherry picked into other groups. "

    I thought was true of synthesis - but it apparently isn't.

  3. I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have a silly error in my inputs.

update "added A as a child of node 11" correct to "added A as a child of node 10" . thanks @kcranston

@blackrim
Copy link
Member

I think it would be good if Cody also piped in because he knows a good
amount about the synth algorithm. Nevertheless, I don't disagree that your
example presents an interesting problem for synth. Basically coming out of
the root there are these potential nodes for synth

11 - from tree1
13 - from tree2
and then the individual taxonomy ones
as you mention, it goes breadth first and so E is added here
also, it checks the relationship taxa not the node (that is why E is not
seen as a conflict when it adds rel 15)

I am pretty sure that is what is resulting in the tree as it is. I am going
to mess around a little on this branch and see if adding the node instead
of relationships mrcas does in fact solve that.

apologies though as it has been 1) a long time and 2) i don't have a ton of
time right now

more in a software email

On Sat, Jan 31, 2015 at 11:47 AM, Mark T. Holder notifications@github.com
wrote:

Outputs of each treemachine step are stored in simpleprob-diff-order for
the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like the
odd tree is returned because the synthesis starts at the root node. Based
on that it decides that the 3 branches emanating from the root should be:

D this makes sense
2.

the mrca node 11 which includes both STREECHILDOF from both tree1 and
tree2. This node is the ancestor of A, B, C, and E. E is included
because there appears to be a bumping of E1 in tree2 up to E. I base
that on:

attempting to remap tips to deepest exemplified taxa
E1 was remapped to 'E' (id=7)

from simpleprob-diff-order/addnewick-tree2.log.
3.

the node representing E. This is odd to me. This is the acceptance of
relationship 17 during synthesis:

testing rel 17 for conflicts
E has 2 terminal descendants
++rel 17 passed, it will be added

in the synthesize.log. It would seem to me that the acceptance of rel
33 (leading from node 11) would prevent the attachement of E here.
because there is taxonomic overlap with the MRCA set in node 11 and E.
Perhaps this is a result of the fact that the descendant E is added to
the node by the second tree (not the tree that created the node). I suspect
that it is because the synthesis traversal is breadth first and somehow not
paying attention to the fact that one of the nodes that was just accepted
conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which one
would like to see in the final tree) is rejected as a dauther of node 10,
it appears that node 13 is rejected because synthesis has just moments
earlier added A as a child of node 11:

testing rel 45 for conflicts
Node[13] has 2 terminal descendants
conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node 10
edge.

One one level this make sense. node 13 and 10 have a leaf in common (leaf
A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial. tree1
could be satisfied by coming in from node 13 or coming in directly. So
tree2 and tree1 say compatible things about the descendants of node 10. One
would hope that while recognizing that tree1 should win when it conflicts
with tree2, the synthesis would recognize that there is not conflict here
-- at least in cases like this one when the high ranking edge is trivial.

To summarize:

I don't see why we get "++rel 17 passed, it will be added"
2.

I don't understand how we can select node 11 to be included, but then
select a set of edges for its children that omit some of the leaves listed
in its MRCA. I suspect that this is what is causing issue #156
#156. It is a
violation of what I listed as principle #4
#4:

"#4 #4. when an
edge is chosen in the synthesis all of its descendant tips will remain
below the edge - there won't be cherry picked into other groups. "

I thought was true of synthesis - but it apparently isn't.
3.

I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have a
silly error in my inputs.


Reply to this email directly or view it on GitHub
#157 (comment)
.

@blackrim
Copy link
Member

Confirmed that the reason E is still added is because of the relationsihp
mrca being checked (from tree 1 ) instead of the node. That isn't
necessarily an error and we would need to bring Cody in on this (he wrote
that code : RankResolutionMethod and the other resolution methods). when
you relax that however, it doesn't give the tree you want anyway because it
gets to node 10 and goes with tree 1 which has just a sister to b instead
of going to node 13 which is a sister to e. you want going to b then node
13.

I am going to draw this out more but I see your point. It is however
following the paths, it is just that an improved method would perhaps
follow your route. this would simply be another resolutionmethod and could
perhaps be mocked up if we could get the rules down.

On Sat, Jan 31, 2015 at 12:24 PM, Stephen Smith blackrim@gmail.com wrote:

I think it would be good if Cody also piped in because he knows a good
amount about the synth algorithm. Nevertheless, I don't disagree that your
example presents an interesting problem for synth. Basically coming out of
the root there are these potential nodes for synth

11 - from tree1
13 - from tree2
and then the individual taxonomy ones
as you mention, it goes breadth first and so E is added here
also, it checks the relationship taxa not the node (that is why E is not
seen as a conflict when it adds rel 15)

I am pretty sure that is what is resulting in the tree as it is. I am
going to mess around a little on this branch and see if adding the node
instead of relationships mrcas does in fact solve that.

apologies though as it has been 1) a long time and 2) i don't have a ton
of time right now

more in a software email

On Sat, Jan 31, 2015 at 11:47 AM, Mark T. Holder <notifications@github.com

wrote:

Outputs of each treemachine step are stored in simpleprob-diff-order for
the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like the
odd tree is returned because the synthesis starts at the root node. Based
on that it decides that the 3 branches emanating from the root should be:

D this makes sense
2.

the mrca node 11 which includes both STREECHILDOF from both tree1 and
tree2. This node is the ancestor of A, B, C, and E. E is included
because there appears to be a bumping of E1 in tree2 up to E. I base
that on:

attempting to remap tips to deepest exemplified taxa
E1 was remapped to 'E' (id=7)

from simpleprob-diff-order/addnewick-tree2.log.
3.

the node representing E. This is odd to me. This is the acceptance of
relationship 17 during synthesis:

testing rel 17 for conflicts
E has 2 terminal descendants
++rel 17 passed, it will be added

in the synthesize.log. It would seem to me that the acceptance of rel
33 (leading from node 11) would prevent the attachement of E here.
because there is taxonomic overlap with the MRCA set in node 11 and E.
Perhaps this is a result of the fact that the descendant E is added
to the node by the second tree (not the tree that created the node). I
suspect that it is because the synthesis traversal is breadth first and
somehow not paying attention to the fact that one of the nodes that was
just accepted conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which one
would like to see in the final tree) is rejected as a dauther of node 10,
it appears that node 13 is rejected because synthesis has just moments
earlier added A as a child of node 11:

testing rel 45 for conflicts
Node[13] has 2 terminal descendants
conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node 10
edge.

One one level this make sense. node 13 and 10 have a leaf in common (leaf
A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial. tree1
could be satisfied by coming in from node 13 or coming in directly. So
tree2 and tree1 say compatible things about the descendants of node 10. One
would hope that while recognizing that tree1 should win when it conflicts
with tree2, the synthesis would recognize that there is not conflict here
-- at least in cases like this one when the high ranking edge is trivial.

To summarize:

I don't see why we get "++rel 17 passed, it will be added"
2.

I don't understand how we can select node 11 to be included, but then
select a set of edges for its children that omit some of the leaves listed
in its MRCA. I suspect that this is what is causing issue #156
#156. It is a
violation of what I listed as principle #4
#4:

"#4 #4. when an
edge is chosen in the synthesis all of its descendant tips will remain
below the edge - there won't be cherry picked into other groups. "

I thought was true of synthesis - but it apparently isn't.
3.

I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have a
silly error in my inputs.


Reply to this email directly or view it on GitHub
#157 (comment)
.

@chinchliff
Copy link
Member

Hi Mark. I agree, those sound like some issues. The current synthesis
method is very simplistic and clearly has some room for improvement. Some
of the problems you mentioned may have solutions within the context of the
current method, but it might make more sense to try and think about
designing a better way to do synthesis overall.

The code is designed to be modular so we can easily swap out a different
method, basically for this exact reason. I am sure there are better
algorithms than the current one. This would be something good to talk
about. We played with a method a while ago, for instance, that attempted to
calculate best paths (instead of just making simple decisions at nodes) but
ran into scalability problems and dropped the ball on that one.

I can look more into this example but it sounds like he tween you and
Stephen this example seems pretty well worked out?

On Saturday, January 31, 2015, Stephen Smith notifications@github.com
wrote:

Confirmed that the reason E is still added is because of the relationsihp
mrca being checked (from tree 1 ) instead of the node. That isn't
necessarily an error and we would need to bring Cody in on this (he wrote
that code : RankResolutionMethod and the other resolution methods). when
you relax that however, it doesn't give the tree you want anyway because
it
gets to node 10 and goes with tree 1 which has just a sister to b instead
of going to node 13 which is a sister to e. you want going to b then node
13.

I am going to draw this out more but I see your point. It is however
following the paths, it is just that an improved method would perhaps
follow your route. this would simply be another resolutionmethod and could
perhaps be mocked up if we could get the rules down.

On Sat, Jan 31, 2015 at 12:24 PM, Stephen Smith <blackrim@gmail.com
javascript:_e(%7B%7D,'cvml','blackrim@gmail.com');> wrote:

I think it would be good if Cody also piped in because he knows a good
amount about the synth algorithm. Nevertheless, I don't disagree that
your
example presents an interesting problem for synth. Basically coming out
of
the root there are these potential nodes for synth

11 - from tree1
13 - from tree2
and then the individual taxonomy ones
as you mention, it goes breadth first and so E is added here
also, it checks the relationship taxa not the node (that is why E is not
seen as a conflict when it adds rel 15)

I am pretty sure that is what is resulting in the tree as it is. I am
going to mess around a little on this branch and see if adding the node
instead of relationships mrcas does in fact solve that.

apologies though as it has been 1) a long time and 2) i don't have a ton
of time right now

more in a software email

On Sat, Jan 31, 2015 at 11:47 AM, Mark T. Holder <
notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');

wrote:

Outputs of each treemachine step are stored in simpleprob-diff-order
for
the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like the
odd tree is returned because the synthesis starts at the root node.
Based
on that it decides that the 3 branches emanating from the root should
be:

D this makes sense
2.

the mrca node 11 which includes both STREECHILDOF from both tree1 and
tree2. This node is the ancestor of A, B, C, and E. E is included
because there appears to be a bumping of E1 in tree2 up to E. I base
that on:

attempting to remap tips to deepest exemplified taxa
E1 was remapped to 'E' (id=7)

from simpleprob-diff-order/addnewick-tree2.log.
3.

the node representing E. This is odd to me. This is the acceptance of
relationship 17 during synthesis:

testing rel 17 for conflicts
E has 2 terminal descendants
++rel 17 passed, it will be added

in the synthesize.log. It would seem to me that the acceptance of rel
33 (leading from node 11) would prevent the attachement of E here.
because there is taxonomic overlap with the MRCA set in node 11 and E.
Perhaps this is a result of the fact that the descendant E is added
to the node by the second tree (not the tree that created the node). I
suspect that it is because the synthesis traversal is breadth first and
somehow not paying attention to the fact that one of the nodes that was
just accepted conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which one
would like to see in the final tree) is rejected as a dauther of node
10,
it appears that node 13 is rejected because synthesis has just moments
earlier added A as a child of node 11:

testing rel 45 for conflicts
Node[13] has 2 terminal descendants
conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node 10
edge.

One one level this make sense. node 13 and 10 have a leaf in common
(leaf
A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial.
tree1
could be satisfied by coming in from node 13 or coming in directly. So
tree2 and tree1 say compatible things about the descendants of node 10.
One
would hope that while recognizing that tree1 should win when it
conflicts
with tree2, the synthesis would recognize that there is not conflict
here
-- at least in cases like this one when the high ranking edge is
trivial.

To summarize:

I don't see why we get "++rel 17 passed, it will be added"
2.

I don't understand how we can select node 11 to be included, but then
select a set of edges for its children that omit some of the leaves
listed
in its MRCA. I suspect that this is what is causing issue #156
#156. It is a
violation of what I listed as principle #4
#4:

"#4 #4. when an
edge is chosen in the synthesis all of its descendant tips will remain
below the edge - there won't be cherry picked into other groups. "

I thought was true of synthesis - but it apparently isn't.
3.

I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have a
silly error in my inputs.


Reply to this email directly or view it on GitHub
<
https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72325365>

.


Reply to this email directly or view it on GitHub
#157 (comment)
.

@blackrim
Copy link
Member

Yeah, I agree. I think this sounds like a new conflictresolution (the
folder in the code) method that is simply less simplistic. We will need to
work out what the rules are though (things like more taxa subtending
doesn't work because of taxonomy nodes, perhaps more resolution subtending
along with rank). Right now the rank only is just very very simple.

On Sat, Jan 31, 2015 at 1:41 PM, Cody Hinchliff notifications@github.com
wrote:

Hi Mark. I agree, those sound like some issues. The current synthesis
method is very simplistic and clearly has some room for improvement. Some
of the problems you mentioned may have solutions within the context of the
current method, but it might make more sense to try and think about
designing a better way to do synthesis overall.

The code is designed to be modular so we can easily swap out a different
method, basically for this exact reason. I am sure there are better
algorithms than the current one. This would be something good to talk
about. We played with a method a while ago, for instance, that attempted
to
calculate best paths (instead of just making simple decisions at nodes)
but
ran into scalability problems and dropped the ball on that one.

I can look more into this example but it sounds like he tween you and
Stephen this example seems pretty well worked out?

On Saturday, January 31, 2015, Stephen Smith notifications@github.com
wrote:

Confirmed that the reason E is still added is because of the
relationsihp
mrca being checked (from tree 1 ) instead of the node. That isn't
necessarily an error and we would need to bring Cody in on this (he
wrote
that code : RankResolutionMethod and the other resolution methods). when
you relax that however, it doesn't give the tree you want anyway because
it
gets to node 10 and goes with tree 1 which has just a sister to b
instead
of going to node 13 which is a sister to e. you want going to b then
node
13.

I am going to draw this out more but I see your point. It is however
following the paths, it is just that an improved method would perhaps
follow your route. this would simply be another resolutionmethod and
could
perhaps be mocked up if we could get the rules down.

On Sat, Jan 31, 2015 at 12:24 PM, Stephen Smith <blackrim@gmail.com
javascript:_e(%7B%7D,'cvml','blackrim@gmail.com');> wrote:

I think it would be good if Cody also piped in because he knows a good
amount about the synth algorithm. Nevertheless, I don't disagree that
your
example presents an interesting problem for synth. Basically coming
out
of
the root there are these potential nodes for synth

11 - from tree1
13 - from tree2
and then the individual taxonomy ones
as you mention, it goes breadth first and so E is added here
also, it checks the relationship taxa not the node (that is why E is
not
seen as a conflict when it adds rel 15)

I am pretty sure that is what is resulting in the tree as it is. I am
going to mess around a little on this branch and see if adding the
node
instead of relationships mrcas does in fact solve that.

apologies though as it has been 1) a long time and 2) i don't have a
ton
of time right now

more in a software email

On Sat, Jan 31, 2015 at 11:47 AM, Mark T. Holder <
notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');

wrote:

Outputs of each treemachine step are stored in simpleprob-diff-order
for
the run-simpleprob-example-diff-order-synth.sh.

From looking at simpleprob-diff-order/synthesize.log it looks like
the
odd tree is returned because the synthesis starts at the root node.
Based
on that it decides that the 3 branches emanating from the root should
be:

D this makes sense
2.

the mrca node 11 which includes both STREECHILDOF from both tree1 and
tree2. This node is the ancestor of A, B, C, and E. E is included
because there appears to be a bumping of E1 in tree2 up to E. I base
that on:

attempting to remap tips to deepest exemplified taxa
E1 was remapped to 'E' (id=7)

from simpleprob-diff-order/addnewick-tree2.log.
3.

the node representing E. This is odd to me. This is the acceptance of
relationship 17 during synthesis:

testing rel 17 for conflicts
E has 2 terminal descendants
++rel 17 passed, it will be added

in the synthesize.log. It would seem to me that the acceptance of rel
33 (leading from node 11) would prevent the attachement of E here.
because there is taxonomic overlap with the MRCA set in node 11 and
E.
Perhaps this is a result of the fact that the descendant E is added
to the node by the second tree (not the tree that created the node).
I
suspect that it is because the synthesis traversal is breadth first
and
somehow not paying attention to the fact that one of the nodes that
was
just accepted conflicts with the placement of E at the same level.

This gives us: (D,(A,B,C),E) at the root - which is unsatisfying.

Another oddity is that when node 13 (the mrca of A, B, and E which
one
would like to see in the final tree) is rejected as a dauther of node
10,
it appears that node 13 is rejected because synthesis has just
moments
earlier added A as a child of node 11:

testing rel 45 for conflicts
Node[13] has 2 terminal descendants
conflict found! offending rel=23

rel 45 is the 13->11 edge and rel 23 is the node 3 (leaf A) -> node
10
edge.

One one level this make sense. node 13 and 10 have a leaf in common
(leaf
A). and rel 23 is coming from the tree with higher rank.

On the other hand, the connection of A to this subtree is trivial.
tree1
could be satisfied by coming in from node 13 or coming in directly.
So
tree2 and tree1 say compatible things about the descendants of node
10.
One
would hope that while recognizing that tree1 should win when it
conflicts
with tree2, the synthesis would recognize that there is not conflict
here
-- at least in cases like this one when the high ranking edge is
trivial.

To summarize:

I don't see why we get "++rel 17 passed, it will be added"
2.

I don't understand how we can select node 11 to be included, but then
select a set of edges for its children that omit some of the leaves
listed
in its MRCA. I suspect that this is what is causing issue #156
#156. It is a
violation of what I listed as principle #4
#4:

"#4 #4. when
an
edge is chosen in the synthesis all of its descendant tips will
remain
below the edge - there won't be cherry picked into other groups. "

I thought was true of synthesis - but it apparently isn't.
3.

I don't see why rel 23 should lead to the rejection of rel 45.

Many apologies if I'm just invoking treemachine incorrectly, or have
a
silly error in my inputs.


Reply to this email directly or view it on GitHub
<

https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72325365>

.


Reply to this email directly or view it on GitHub
<
https://github.com/OpenTreeOfLife/treemachine/issues/157#issuecomment-72327726>

.


Reply to this email directly or view it on GitHub
#157 (comment)
.

@kcranston
Copy link
Member

So, it seems that synthesis is working differently that at least some of us expect. I would have also expected ((((A,(E1,E2)),B),C),D) as the synthetic tree. It is probably ok that it works differently, but we need to make sure 1. we can clearly articulate what is happening and 2. that we aren't making statement in the paper or in the properties pane of the web application (i.e. tree X 'supports' this edge) that conflict with what is actually happening.

I would like to understand why we aren't getting the expected tree. Can we get a few diagrams of what is going on?

@blackrim
Copy link
Member

blackrim commented Feb 1, 2015

Here is the diagram of what is going on (attached). It is basically
traversing the blue relationships. It understands that the green are not in
conflict, that isn't a problem. It just prefers the blue. This isn't an
error in that sense. It is traversing the preferred tree in order of
ranking. It could just be doing some of this smarter (as we have mentioned
a number of times, improving synth methods is a goal). We have discussed
solutions back when we were doing the developments, but as mentioned, it
got sidetracked because of all the other issues that were coming to a head.

My lab has plans on discussing this next week and would love to have ideas
if folks have any (basically rules). As Cody mentioned, implementing new
methods of synth is quite easy (we have a few already in there) and so we
could improve quickly if folks have ideas.

Take care

On Sat, Jan 31, 2015 at 4:07 PM, Karen Cranston notifications@github.com
wrote:

So, it seems that synthesis is working differently that at least some of
us expect. I would have also expected ((((A,(E1,E2)),B),C),D) as the
synthetic tree. It is probably ok that it works differently, but we need to
make sure 1. we can clearly articulate what is happening and 2. that we
aren't making statement in the paper or in the properties pane of the web
application (i.e. tree X 'supports' this edge) that conflict with what is
actually happening.

I would like to understand why we aren't getting the expected tree. Can we
get a few diagrams of what is going on?


Reply to this email directly or view it on GitHub
#157 (comment)
.

@kcranston
Copy link
Member

I don't see the attachment?

@blackrim
Copy link
Member

blackrim commented Feb 1, 2015

Hm, here it is again.

On Sat, Jan 31, 2015 at 9:49 PM, Karen Cranston notifications@github.com
wrote:

I don't see the attachment?


Reply to this email directly or view it on GitHub
#157 (comment)
.

@kcranston
Copy link
Member

I don't think it works to reply to the email notification with an attached file. You can drag and drop an image via GitHub, though.

@kcranston
Copy link
Member

Pasting in the file from @blackrim
example_tm

If ((E2,E1)E,D,(C,(B,A))) is the tree that results when the order of preference is blue (tree1) > green (tree2) > black (taxonomy), then I think we need to re-think how we are explaining the method in the paper. Would the edge from E to the root show in the browser as supported by taxonomy (which would be strange, given that we have tree2 with E) or would it show as supported by tree2 (which would be wrong, as tree2 doesn't have that relationship)?

@blackrim
Copy link
Member

blackrim commented Feb 1, 2015

If would show as taxonomy. Not sure why the that would be strange though.
On Jan 31, 2015 10:47 PM, "Karen Cranston" notifications@github.com wrote:

Pasting in the file from @blackrim https://github.com/blackrim
[image: example_tm]
https://cloud.githubusercontent.com/assets/312034/5990892/7e9f10e6-a997-11e4-83c4-f10aa639cc3a.png

If ((E2,E1)E,D,(C,(B,A))) is the tree that results when the order of
preference is blue (tree1) > green (tree2) > black (taxonomy), then I think
we need to re-think how we are explaining the method in the paper. Would
the edge from E to the root show in the browser as supported by taxonomy
(which would be strange, given that we have tree2 with E) or would it show
as supported by tree2 (which would be wrong, as tree2 doesn't have that
relationship)?


Reply to this email directly or view it on GitHub
#157 (comment)
.

@mtholder mtholder changed the title help need running a simple example... synthesis issue: only using mrca taxa from an edge ("relationship" in neo4j lingo) rather than node Feb 1, 2015
@mtholder
Copy link
Member Author

mtholder commented Feb 1, 2015

housekeeping notes: I changed the name of the issue to be more informative. I also just pushed a run-reltaxa-example.sh to the https://github.com/OpenTreeOfLife/treemachine/tree/issue-157 branch. It is the example from this thread pruning off E1, E2, and D from the examples (in tree2, E is used instead of E1). I think that we should still discuss the run-simpleprob-example-diff-order-synth.sh in this thread. but the new example may be useful if you're looking for the simplest case of this phenomenon.

@mtholder
Copy link
Member Author

mtholder commented Feb 1, 2015

@blackrim can you tell me what tool you use to make that graph? - it is very helpful (fortunately I could steal my daughter's colored markers to make one of my own, but it was tedious.) Thanks also for pointing me to the RankResoloutionMethod. It would have taken me a long time to find that.

The TAG paper says: "At each node, the procedure examines the subtending nodes, and determines if any of them conflict. For synthesis, downstream conflict is determined by comparing the LICAs for each child. If the LICAs from nodes subtending the current node overlap, then these descendant subgraphs define incompatible subtrees, and are said to be in conflict..." (and then it goes on to point to some other notions of conflict that you discuss in the paper which are based on neighbor count)

So I found the statement that "it checks the relationship taxa not the node" above quite surprising. I'm not sure that I know what "relationship taxa" means. I'm assuming that it means the "the taxa in the descendant set of the child node that were present in the edge."

If that is correct, then I think that this is a serious issue because it seems to undermine the justification for believing that we can resolve downstream conflicts by looking at edges entering a node. I thought that the advantage of the TAG was to highlight were conflict was and let you decide what sorts of conflict could be dealt with later in the algorithm.

I'd like to be able to think that the STREECHILDOF edge from node11 to the root implies that, if we select node11 to be part of the synthetic tree, then all of it descendants will be found under node 11. If we use the only the "relationship taxa", then we allow a lower ranking source (the taxonomy in this case) to "win" because its preference accepted deeper in the tree.

This makes it inaccurate to characterize our synthesis as: detecting conflict and resolving those conflicts by preferring higher ranked sources. And it becomes really hard to explain to any biologist how the synthesis works - and how to make the tree better.

I think that I have a sense of some of the downsides to using the node MRCA (as I would prefer to do) instead of the "relationship taxa". I'll try to come up with a simple example of those so that we can discuss them.

@mtholder
Copy link
Member Author

mtholder commented Feb 1, 2015

So, I commented out the filtering down to just the list of descendants from the relationship. I think this works at one stage, but that E is not added to the tree because of #158

E is subsequently grafted back on at the root of the tree. I haven't looked for the code that does that. The output from the synthesis is:

Results of conflict resolution:
(result reporting feature not yet implemented)
have to add 1
taxaleft: 1 working with E

I think that adding a bit of logic to my pseudocode in #158 may be able get around this. Basically, we need to make sure that we attach all of the descendants of a node somewhere below that node.

@mtholder
Copy link
Member Author

mtholder commented Feb 1, 2015

I should have mentioned that I tested this change on the RankNodeDesResolutionMethod branch that I just pushed.

@blackrim
Copy link
Member

blackrim commented Feb 1, 2015

Cool, well check out in a bit. Also, just noting here something I
mentioned in the other issue. We constructed a branch and bound ( though
really mostly branch) synth that also is discussed in the paper and deals
with this issue nicely. It simply doesn't scale well with the many
millions nodes. It is in the code and can be switched on however for the
big tree perhaps your solution will work better.
On Feb 1, 2015 6:16 AM, "Mark T. Holder" notifications@github.com wrote:

I should have mentioned that I tested this change on the
RankNodeDesResolutionMethod branch that I just pushed.


Reply to this email directly or view it on GitHub
#157 (comment)
.

@blackrim
Copy link
Member

blackrim commented Feb 1, 2015

The branch (and bound) method that I mentioned that addresses this problem
is in there both in the conflict resolution folder and explorer in graph
synthesis function. I am tracking down the main so I can do your example
and will push that. However it uses very old code and doesn't seem to have
been brought up to some of the newer classes that Cody wrote. I don't
think it is worth it to do that and instead probably best to write some new
ones including of the pseudo code you had (if you haven't already done that)
On Feb 1, 2015 6:55 AM, "Stephen Smith" blackrim@gmail.com wrote:

Cool, well check out in a bit. Also, just noting here something I
mentioned in the other issue. We constructed a branch and bound ( though
really mostly branch) synth that also is discussed in the paper and deals
with this issue nicely. It simply doesn't scale well with the many
millions nodes. It is in the code and can be switched on however for the
big tree perhaps your solution will work better.
On Feb 1, 2015 6:16 AM, "Mark T. Holder" notifications@github.com wrote:

I should have mentioned that I tested this change on the
RankNodeDesResolutionMethod branch that I just pushed.


Reply to this email directly or view it on GitHub
#157 (comment)
.

@blackrim
Copy link
Member

blackrim commented Feb 1, 2015

yeah, so something was broken with the graphsynthesis method when things
were moved to the new system that isn't letting me do the branch and bound
without the justtrees bit (that was the bit in the paper). i will see if i
can fix that real quick but it won't address doing this at a larger scale.

On Sun, Feb 1, 2015 at 7:41 AM, Stephen Smith blackrim@gmail.com wrote:

The branch (and bound) method that I mentioned that addresses this problem
is in there both in the conflict resolution folder and explorer in graph
synthesis function. I am tracking down the main so I can do your example
and will push that. However it uses very old code and doesn't seem to have
been brought up to some of the newer classes that Cody wrote. I don't
think it is worth it to do that and instead probably best to write some new
ones including of the pseudo code you had (if you haven't already done that)
On Feb 1, 2015 6:55 AM, "Stephen Smith" blackrim@gmail.com wrote:

Cool, well check out in a bit. Also, just noting here something I
mentioned in the other issue. We constructed a branch and bound ( though
really mostly branch) synth that also is discussed in the paper and deals
with this issue nicely. It simply doesn't scale well with the many
millions nodes. It is in the code and can be switched on however for the
big tree perhaps your solution will work better.
On Feb 1, 2015 6:16 AM, "Mark T. Holder" notifications@github.com
wrote:

I should have mentioned that I tested this change on the
RankNodeDesResolutionMethod branch that I just pushed.


Reply to this email directly or view it on GitHub
#157 (comment)
.

@mtholder
Copy link
Member Author

mtholder commented Feb 2, 2015

some more formalized discussion of the effects of this "relationship taxa" conflict detection are at https://github.com/OpenTreeOfLife/treemachine/blob/nonsense-1/nonsense/sted_support_theorem.md

@mtholder
Copy link
Member Author

mtholder commented Feb 3, 2015

As hoped, It looks like 54fcad1 also fixes this issue.
Still need to write more tests to make sure this isn't breaking other logic.

@chinchliff
Copy link
Member

The recent edits to the loading procedure and synthesis seem to solve this problem as well. Here is the tag, and the results of the 'simpleprob' test (which I moved into test-synth dir on rootward-synth branch).

FAILING_TREEMACHINE_TEST=simpleprob ./run_synth_tests.sh

# [snipped]

((((A,(E1,E2)),B),C),D);
clade extracted: E1, E2
clade extracted: A, B, C, E1, E2
clade extracted: A, B, C, D, E1, E2
clade extracted: A, E1, E2
clade extracted: A, B, E1, E2
(((((E2_ottE2,E1_ottE1)E_ottE,A_ottA),B_ottB),C_ottC),D_ottD)life_ott805080;
clade extracted: E1, E2
clade extracted: A, B, C, E1, E2
clade extracted: A, B, C, D, E1, E2
clade extracted: A, E1, E2
clade extracted: A, B, E1, E2
SUCCESS! recovered expected tree.
Failed 0 out of 1 tests(s).

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants