Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree distance between two samples #2627

Open
hyanwong opened this issue Nov 3, 2022 · 9 comments · May be fixed by #2771
Open

Tree distance between two samples #2627

hyanwong opened this issue Nov 3, 2022 · 9 comments · May be fixed by #2771

Comments

@hyanwong
Copy link
Member

hyanwong commented Nov 3, 2022

When writing the tskit introduction for phylogeneticists, I was reminded of something that has caught people out before: it is relatively common to want to know the length of the path in a tree connecting two samples (or indeed multiple pairs of samples). This can be done using ts.diversity(mode="branch", windows="trees") or ts.divergence(mode="branch", windows="trees"), but that's not terribly obvious to a phylogeneticist. For this rather simple measure, is it perhaps worth having a function that is essentially an alias to one of these methods?

I guess something like ts.path_length(sample_a, sample_B) or (if we want a Tree version), tree.path_length(A, B). The docs could then say this is a simple alias for the divergence method with specific settings, which would give people an easier way in to finding out about the power of the mode=branch stats methods.

Perhaps we don't want to add to an already weightily API, though. Thoughts e.g. @benjeffery ? Better suggestions?

@jeromekelleher
Copy link
Member

How about this?

I don't think we want a tree sequence version, it's too complicated to define.

@benjeffery
Copy link
Member

Tree.path_length does the job well already as @jeromekelleher suggests.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 3, 2022

Gah, I didn't even know about that. Thanks, and sorry for the noise.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 3, 2022

I think maybe a note in the existing path_length docs that you can do this on multiple trees by using ts.diversity(mode="branch", windows="trees") might be in order, through?

@hyanwong
Copy link
Member Author

hyanwong commented Nov 3, 2022

Oh, hang on a second, path_length(u, v) (confusingly IMO) gives the count of number of edges between two samples, not, as I would expect from the name, the sum of "branch lengths" (i.e. ts.diversity([u, v], mode="branch", windows="trees")[0] for a one-tree-sequence). So I'm reopening.

Should there be an option to return the actual length, rather than the edge count?

@hyanwong hyanwong reopened this Nov 3, 2022
@jeromekelleher
Copy link
Member

jeromekelleher commented Nov 3, 2022

That's why it was called "path_length" (something we wanted for the balance metrics), which was the number of steps you take in the path from one to the other.

The other is very easy to calculate using the tmrca of the two nodes.

I guess you could call this branch_distance or something?

@hyanwong
Copy link
Member Author

hyanwong commented Nov 3, 2022

Yes, path_distance or "branch_distance", I guess. I'll look up what it's called in the phylogenetic literature.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 3, 2022

I think the fact that I misunderstood the method, and that you misunderstood what I meant, is a good argument for implementing something like branch_distance, so that we can put a "see also" note to that function in the path_length docstring.

@jeromekelleher
Copy link
Member

SGTM. Could even have something like Tree.distance_between(u, v)?

@Billyzhang1229 Billyzhang1229 linked a pull request Jun 22, 2023 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants