-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST] Make tabular output files tab-delimited #235
Comments
Sean's the person who can comment on your feature request, though my take is that if we decided to implement it, we'd want to do so as an optional format to avoid breaking code that works with the current format.
Looking at the anvi'o GitHub page, it looks like a tool I should look into more, but I see that most of your code is written in Python. I've had good luck parsing tblout and domtblout files in Python by using the split() function to turn each line of text into a list of fields, via something like fields = line.split(). Split seems to deal fine with variable-length runs of spaces, such that I haven't seen any problems with it, though I haven't done a vast amount of work with these output file formats.
Could you comment on how you're trying to parse these output files?
…-nick
[Nick Carter - Chat @ Spike](https://spikenow.com/r/a/?ref=spike-organic-signature&_ts=yvwtc) [yvwtc]
On April 1, 2021 at 21:08 GMT, Iva Veseli ***@***.***> wrote:
Hello!
I was wondering if you would be willing to change the tabular output formats (--tblout and --domtblout) to be tab-delimited rather than space-delimited. Right now fields in these output files are separated by a variable number of spaces in each line, which aligns the columns nicely and looks very pretty, but is difficult to parse in downstream code.
[image](https://user-images.githubusercontent.com/10360586/113354178-634bdb00-9304-11eb-9414-522a7a0b08de.png)
Converting these runs of spaces to a single tab between each field would not look as good, but it would make it so much easier for programmers who are working with these output files downstream.
For additional context, I am one of the current developers of [anvi'o](https://github.com/merenlab/anvio). We are big fans of your work and have been relying on HMMER in a variety of contexts in anvi'o. Sadly, our current parsing capabilities for HMMER tabular output are rather inadequate and we've lately been encountering more circumstances in which it just fails (for instance, [here is one example](merenlab/anvio#1564)). Of course we are hoping to fix these parsing issues on our end, but we thought it might be both easier and useful to the wider community to request this change from you all, since there must be plenty of other groups who are working directly with HMMER output and could benefit from the convenience of tab-delimited output files.
If this change seems reasonable to you, I'd be happy to implement it and open a PR. I just wanted to open a discussion with you all first to see if it is something you would even consider altering in your codebase. :)
Thanks for your consideration!
Iva
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, [view it on GitHub](#235), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/ABDJBZCV6XVF7VDJIGGZGSTTGTOFVANCNFSM42HZUVUQ).
|
Thanks for your very prompt reply, @npcarter! That is indeed how we currently parse the tblout files. Here's the relevant code snippet in case you are interested:
Our issues stem from the fact that we currently try to remove the description column prior to splitting because that column can internally include spaces (so resulting files could have different numbers of fields in each line, which generally breaks things downstream). It doesn't work in all cases because the clip index may not be the same in each line. So clearly that is not such a good move on our part, and I am working on fixing it :) But I thought perhaps it would be easier to fix it at the source, if you all thought it was a good idea. I am looking forward to hearing Sean's opinion on it.
That sounds just fine to me :) |
Thank you very much for bringing this up, @ivagljiva, and thank you @npcarter for listening! I'm posting this message mainly to make sure GitHub will keep me in the loop, but I thought I could comment on this:
Thanks for being considerate of backwards compatibility. Perhaps adding a flag to let the user to explicitly ask for a simpler format could be a way to avoid any snafu. But of course Sean and others will know the best course of action. We are a big fun of your work, and thank you again. |
In Python,
for example, will split Other languages and packages generally have equivalently easy ways to parse whitespace-delimited lines into fields + remaining free text. I have strong reasons to prefer whitespace-delimited and column-aligned tabular output files. It is extremely important to make outputs both parsable and easily human readable. If outputs aren't human readable (and aren't checked by humans), analyses are artifact prone. Our tabular output formats are designed to be easily downsampled, sorted, filtered, and examined by hand, using basic command-line tools, not just parsed in automated pipelines. That said, we plan to provide tab-delimited output options in HMMER4, since many people have requested this - even though I think you're all horribly wrong. I think the right solution is to know how to parse whitespace-delimited files, which we do routinely and easily. |
Thanks for the response, @cryptogenomicon.
I will share my 2 cents on this point only because you mentioned that we are "all horribly wrong". I thought it would have been almost rude to not bite. I do agree that any program that operates on user data should make its outputs accessible to human reading. This is quite parallel to our philosophy: we want our users to have everything they need to scrutinize every result anvi'o generates without trouble. But there are better ways to achieve that than generating whitespace-delimited files so things look aligned to human eye. Like many software tools anvi'o reports TAB-delimited files by default. But we also have global flags for human readable output. For instance, any program that produces TAB-delimited files as an output can also report their output as markdown-formatted content if the user simply includes the flag
If the purpose is to give the user a means to be able to scrutinize things easily, this output is of course much more useful and readable than any whitespace-delimited output for certain media. For instance, if the target medium supports proper handling of markdown tables, others looking at the output can even sort or filter the output by column and so on. In addition to the on-the-fly markdown conversion, the user can display the same output in their terminal as an ASCII table if they wish to --here is a screenshot from my terminal: Not to mention any basic command line tools will as easily work with TAB-delimited output files as they do with files that contain columns separated by arbitrary number of whitespaces. We are happy to improve these options when/if anyone makes a reasonable request. HMMER could indeed offer an option to produce its outputs as TAB-delimited files, but I understand that technically feasible solutions can't stand against the power of personal preferences.
This is just one solution. But I don't see how it is the right one that makes everyone else horribly wrong. As the mighty upstream you can of course tell us to go fly a kite and we would do it. But the right solution is to write versatile code that can accommodate versatile needs without forcing downstream users or programmers to deal with personal preferences whenever possible. Which many do routinely and easily. Best wishes, |
This issue was closed, but the feature request was not addressed. A tab delimited format that could be easily read into a pandas dataframe alongside unique field names would be greatly appreciated by many. Users shouldn't need to write custom parsers in python.... |
Say your output is "test.out". It's not ideal, but these shell commands may help, worked for me:
|
Hello!
I was wondering if you would be willing to change the tabular output formats (
--tblout
and--domtblout
) to be tab-delimited rather than space-delimited. Right now fields in these output files are separated by a variable number of spaces in each line, which aligns the columns nicely and looks very pretty, but is difficult to parse in downstream code.Converting these runs of spaces to a single tab between each field would not look as good, but it would make it so much easier for programmers who are working with these output files downstream.
For additional context, I am one of the current developers of anvi'o. We are big fans of your work and have been relying on HMMER in a variety of contexts in anvi'o. Sadly, our current parsing capabilities for HMMER tabular output are rather inadequate and we've lately been encountering more circumstances in which it just fails (for instance, here is one example). Of course we are hoping to fix these parsing issues on our end, but we thought it might be both easier and useful to the wider community to request this change from you all, since there must be plenty of other groups who are working directly with HMMER output and could benefit from the convenience of tab-delimited output files.
If this change seems reasonable to you, I'd be happy to implement it and open a PR. I just wanted to open a discussion with you all first to see if it is something you would even consider altering in your codebase. :)
Thanks for your consideration!
Iva
The text was updated successfully, but these errors were encountered: