Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure to catch interesting performance measures for a publication #121

Open
pappewaio opened this issue Jan 28, 2021 · 11 comments
Open
Labels

Comments

@pappewaio
Copy link
Contributor

There is already a lot of numbers presented in the default nextflow report made, so maybe move that to the other statistics in the output folder.

@joejeroe
Copy link

something in the line of:
Total time spend:
Time spend on process A:
Time spend on process B:
Time spend on process C:
et etc
Maximum memory use:
Maximum use per process:

I think this could be useful when we run the entire alpha library through the Beta pipeline (N=5500) to get some nice distributions to report in a paper.

@rzetterberg
Copy link
Contributor

  • Size of raw sumstat
  • Size of cleaned sumstat
  • Lines in raw sumstat
  • Lines in cleaned sumstat
  • Amount of lines processed in total
  • Amount of lines processed per process

@rzetterberg
Copy link
Contributor

In my opinion these metrics should be saved not only for publication, but also to help us optimize the pipeline.

Having these metrics saved we can see whether performance have increased or declined during development, it can help us determine which changes increases or decreases performance, etc.

@joejeroe
Copy link

I don't think the size of the raw dataset compared to the cleaned dataset is very informative (given that non-informative columns will be removed etc), but it can't hurt to have it registered anyway.

@rzetterberg
Copy link
Contributor

Yeah, maybe that would only be interesting if we would introduce some sort of compression of the information, so that you'll see the compression ratio.

@joejeroe
Copy link

As the inventory grows it will be good to invest in some better data compression (in time), else we will be paying for a ton of stored on genomeDK

@pappewaio pappewaio added this to To do in version_v1.x.x-beta Feb 1, 2021
@pappewaio
Copy link
Contributor Author

True about the storage. Maybe we can add it to ibp-pipeline-db to compress the raw data as much as it can. I am not sure if I read this guide correctly or if gzip-9 compression is actually sort of best when considering memory footprint and decompression time ?
https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

@pappewaio
Copy link
Contributor Author

For the other stats discussion, much of your suggestions are sort of covered already. Maybe we could zoom one day, and I can show you what is available. Maybe we should relocate it so you can find it easier :)

@joejeroe
Copy link

joejeroe commented Feb 5, 2021

Sure let's zoom next week😀

@rzetterberg
Copy link
Contributor

gzip-9 compression is actually sort of best when considering memory footprint and decompression time ?

Yes, I don't think we will come up with a better compression algorithm than gzip for textfiles.

As the inventory grows it will be good to invest in some better data compression

In the quote above I think Joeri was referring to other means of data compression (other than using gzip on sumstats-files), such as: creating a binary format for sumstats-files or using a RDBMS.

For the other stats discussion, much of your suggestions are sort of covered already.

That's great! From my point of view, I think it's important that those stats are stored somewhere, so that you can compare stats from the latest version against historical versions. And that the stats are stored in such a format that you can do that programmatically.

@rzetterberg
Copy link
Contributor

Me and Jesper had a meeting about this and the pipeline already produces metrics files with raw data for all the cases we discussed in this issue.

So in order to solve this issue we should implement:

  • Copying these files for each run to a central place
  • Improve stepwise file with better step names

@pappewaio pappewaio removed this from To do in version_v1.x.x-beta May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants