Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem w/ branch len estimate with closely related leaves #134

Open
phiweger opened this issue Sep 7, 2020 · 12 comments
Open

Problem w/ branch len estimate with closely related leaves #134

phiweger opened this issue Sep 7, 2020 · 12 comments

Comments

@phiweger
Copy link

phiweger commented Sep 7, 2020

In the TreeTime .nexus output I get a huge negative branch len followed by another large on for the corresponding leaves:

...
((7ab5cd3e-524b-4f3e-9951-04d783bcef78:28113.25709,5abe8078-fdb6-4e90-9075-314bc4238f48:28113.15326)NODE_0000024:-28112.91947,(dd518f5c-d48a-464d-bff3-4becb51ae5d5:0.00000,0e2e43b1-45fc-4a32-b584-d4db7b91e86b:0.00000)
...

Is this a bug or some numerical instability? How could I avoid this?

Thanks a lot!

@phiweger
Copy link
Author

phiweger commented Sep 8, 2020

Further testing gives me the impression that (1) this does not always occur given the same input and (2) only occured when I add the --confidence flag to treetime.

@rneher
Copy link
Member

rneher commented Sep 10, 2020

it this run a tree with four leaves with some identical dates/branch lengths? then it is likely a numerical instability when trying to invert a singular matrix.

@phiweger
Copy link
Author

Yes, this is a larger tree (20+ leaves) but 3 of them are identical in their SNV alignment, but the dates are different. Is there a way around this instability, besides manually clipping the corresponding branch values to 0? The dates should help resolve polytomies, right?

@rneher
Copy link
Member

rneher commented Sep 22, 2020

could you send me these data. I can't quite explain why this might happen and it would be good to fix.

@phiweger
Copy link
Author

which data do you need? the alignment, dates, undated tree -- anything else?

@rneher
Copy link
Member

rneher commented Oct 7, 2020

yes, those are what I would need.

@ktmeaton
Copy link
Contributor

ktmeaton commented Apr 2, 2021

I think I might be having a similar error (if not I can open a new issue). When estimating date confidences using the marginal likelihood, some nodes will sporadically have very large intervals:

image

Rather than having intervals in the range of 100s of years, these nodes have confidence intervals of +100,000 years. These large intervals are somewhat random, in that rerunning the analyses moves them around. Any thoughts on why this might be occurring and if there's a solution?

@rneher
Copy link
Member

rneher commented Apr 11, 2021

yes, this looks like there is a problem. My hunch is that there is some numerical accuracy problem.

@ktmeaton
Copy link
Contributor

I was thinking numerical accuracy too. This is a large phylogeny with many small branches (1e-8). Would there be any value in rescaling the branch lengths before (ex. multiply them all by 1e4)?

@rneher
Copy link
Member

rneher commented Apr 14, 2021

I suppose this is a large genome? Does this use a SNP only alignment? Or a vcf file? TreeTime carries around an internal scale that is one_mutation = 1/L (L being the length of the genome). One could just try to trick it in assuming the genome is shorter. But I am not sure I understand your application well enough.

@m-a-martin
Copy link

m-a-martin commented Jun 22, 2021

I think I might be having a similar error (if not I can open a new issue). When estimating date confidences using the marginal likelihood, some nodes will sporadically have very large intervals:

image

Rather than having intervals in the range of 100s of years, these nodes have confidence intervals of +100,000 years. These large intervals are somewhat random, in that rerunning the analyses moves them around. Any thoughts on why this might be occurring and if there's a solution?

I am having this same (or a similar) issue on a SARS-CoV-2 dataset with roughly 5000 sequences using the flags, however it occurs without the covariation or branch-length-mode flags as well:

-tree ml_clean.nwk --dates clean_metadata.tsv --aln aln_clean.fasta --clock-filter 4 --reroot EPI_ISL_402125 --covariation --coalescent skyline --clock-rate 0.001 --clock-std-dev 0.0005 --branch-length-mode joint --confidence --keep-polytomies

I'm using a full alignment. The problem is random and rerunning on the same dataset can generate reasonable confidence intervals, but it happens often enough that it is an issue. Using TreeTime v. 0.80 on Python v3.9. I've attached the treetime output as well as the ML tree and a list of accession numbers (can't share alignment because GISAID data).

for_github.zip

@rneher
Copy link
Member

rneher commented Sep 28, 2021

Sorry, just started to pick this up again. All the numbers in the dates.tsv file look sensible and these should be the same as in the graph -- with the exception of those labeled as problematic branches which are masked in the dates.tsv and not in the graph. My hunch is that these long bars are essentially undefined confidence intervals of branches that don't follow the clock to an extend that we can rely on this estimation. I'll add a line to exclude these from the graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants