Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Nexus tree or tree/matrix files #89

Open
curtislisle opened this issue Jan 14, 2015 · 7 comments
Open

Read Nexus tree or tree/matrix files #89

curtislisle opened this issue Jan 14, 2015 · 7 comments

Comments

@curtislisle
Copy link
Collaborator

Nexus files are used often in phylogenetics. Instead of having to support our own parsers, we should adopt mature parsers if they exist. The parser below handles Nexus and Newick files into R with more reliability than ape, and uses the NCL (Nexus class library).

http://francoismichonneau.net/2014/12/rncl/

@curtislisle
Copy link
Collaborator Author

In the attached ZIP is a simple tree in Nexus and a corresponding character matrix. We need to be able to add reading of this type to Arbor. A lot of existing packages will output in this format.

nexus_example_data.zip

@curtislisle
Copy link
Collaborator Author

I know we have simple Nexus tree reading, but this format is complex. There is a very complete C++ implementation of the NEXUS spec available here. maybe we can use this to parse to our intermediate tree representation:

https://github.com/mtholder/ncl

@curtislisle
Copy link
Collaborator Author

Flow currently assumes nexus file extensions to be trees. This is not correct. Nexus is a file type which can (and often does) contain either trees, matrices, or both in a single file. Multiple trees and multiple matrices can be stored in a single Nexus file. Reading Nexus successfully is fairly critical for widespread adoption of Arbor.

@jeffbaumes
Copy link
Member

I can take a look. It seems that a new "trees_tables" type is appropriate for Nexus files.

@curtislisle
Copy link
Collaborator Author

Thanks. This isn’t urgent, but I’d like to work on this over the next few weeks/months.

On Aug 16, 2016, at 5:45 AM, Jeffrey Baumes notifications@github.com wrote:

I can take a look. It seems that a new "trees_tables" type is appropriate for Nexus files.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub #89 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/ACDZ9vFNOcpHkT942BCOkpCFFK92ndBSks5qgbDkgaJpZM4DSGzB.

@jeffbaumes
Copy link
Member

It is clear that there can be zero, one, or more trees in a nexus file, and it is clear that there can be zero or one matrices. What is not clear is whether there can be more than one matrix (or if in practice this ever happens). This paper seems to document the nexus format better than anything else I've seen http://sysbio.oxfordjournals.org/content/46/4/590.full.pdf. To do this right we should have a collection of nexus files of all shapes and sizes and perform testing on all of them to ensure they are all supported.

If there can be any number of trees or tables, a few workflows might make sense. I prefer a trees_tables type which has nexus format and one or more in-memory formats based on the library used to read it (such as ape or ncl). There could then be standard "Select Nexus Tree" and "Select Nexus Matrix" analyses in Flow which input a trees_tables (nexus file) and a selector (tree/table index or name) and output a single tree/table in one of the supported formats. So a workflow may look something like this:

img_0897

@curtislisle
Copy link
Collaborator Author

I agree to this approach of having the combined format and selector steps in a workflow. I am working with David Maddison this week. I'll ask him for samples and how many trees / matrices are allowed per file.

On Aug 16, 2016, at 10:17 AM, Jeffrey Baumes notifications@github.com wrote:

It is clear that there can be zero, one, or more trees in a nexus file, and it is clear that there can be zero or one matrices. What is not clear is whether there can be more than one matrix (or if in practice this ever happens). This paper seems to document the nexus format better than anything else I've seen http://sysbio.oxfordjournals.org/content/46/4/590.full.pdf. To do this right we should have a collection of nexus files of all shapes and sizes and perform testing on all of them to ensure they are all supported.

If there can be any number of trees or tables, a few workflows might make sense. I prefer a trees_tables type which has nexus format and one or more in-memory formats based on the library used to read it (such as ape or ncl). There could then be standard "Select Nexus Tree" and "Select Nexus Matrix" analyses in Flow which input a trees_tables (nexus file) and a selector (tree/table index or name) and output a single tree/table in one of the supported formats. So a workflow may look something like this:


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants