-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure to catch interesting performance measures for a publication #121
Comments
something in the line of: I think this could be useful when we run the entire alpha library through the Beta pipeline (N=5500) to get some nice distributions to report in a paper. |
|
In my opinion these metrics should be saved not only for publication, but also to help us optimize the pipeline. Having these metrics saved we can see whether performance have increased or declined during development, it can help us determine which changes increases or decreases performance, etc. |
I don't think the size of the raw dataset compared to the cleaned dataset is very informative (given that non-informative columns will be removed etc), but it can't hurt to have it registered anyway. |
Yeah, maybe that would only be interesting if we would introduce some sort of compression of the information, so that you'll see the compression ratio. |
As the inventory grows it will be good to invest in some better data compression (in time), else we will be paying for a ton of stored on genomeDK |
True about the storage. Maybe we can add it to ibp-pipeline-db to compress the raw data as much as it can. I am not sure if I read this guide correctly or if gzip-9 compression is actually sort of best when considering memory footprint and decompression time ? |
For the other stats discussion, much of your suggestions are sort of covered already. Maybe we could zoom one day, and I can show you what is available. Maybe we should relocate it so you can find it easier :) |
Sure let's zoom next week😀 |
Yes, I don't think we will come up with a better compression algorithm than gzip for textfiles.
In the quote above I think Joeri was referring to other means of data compression (other than using gzip on sumstats-files), such as: creating a binary format for sumstats-files or using a RDBMS.
That's great! From my point of view, I think it's important that those stats are stored somewhere, so that you can compare stats from the latest version against historical versions. And that the stats are stored in such a format that you can do that programmatically. |
Me and Jesper had a meeting about this and the pipeline already produces metrics files with raw data for all the cases we discussed in this issue. So in order to solve this issue we should implement:
|
There is already a lot of numbers presented in the default nextflow report made, so maybe move that to the other statistics in the output folder.
The text was updated successfully, but these errors were encountered: