Make sure to catch interesting performance measures for a publication #121

pappewaio · 2021-01-28T20:22:33Z

There is already a lot of numbers presented in the default nextflow report made, so maybe move that to the other statistics in the output folder.

joejeroe · 2021-01-28T20:52:41Z

something in the line of:
Total time spend:
Time spend on process A:
Time spend on process B:
Time spend on process C:
et etc
Maximum memory use:
Maximum use per process:

I think this could be useful when we run the entire alpha library through the Beta pipeline (N=5500) to get some nice distributions to report in a paper.

rzetterberg · 2021-01-29T12:43:04Z

Size of raw sumstat
Size of cleaned sumstat
Lines in raw sumstat
Lines in cleaned sumstat
Amount of lines processed in total
Amount of lines processed per process

rzetterberg · 2021-01-29T12:45:52Z

In my opinion these metrics should be saved not only for publication, but also to help us optimize the pipeline.

Having these metrics saved we can see whether performance have increased or declined during development, it can help us determine which changes increases or decreases performance, etc.

joejeroe · 2021-01-29T12:49:48Z

I don't think the size of the raw dataset compared to the cleaned dataset is very informative (given that non-informative columns will be removed etc), but it can't hurt to have it registered anyway.

rzetterberg · 2021-01-29T13:05:02Z

Yeah, maybe that would only be interesting if we would introduce some sort of compression of the information, so that you'll see the compression ratio.

joejeroe · 2021-01-29T13:08:56Z

As the inventory grows it will be good to invest in some better data compression (in time), else we will be paying for a ton of stored on genomeDK

pappewaio · 2021-02-05T21:17:37Z

True about the storage. Maybe we can add it to ibp-pipeline-db to compress the raw data as much as it can. I am not sure if I read this guide correctly or if gzip-9 compression is actually sort of best when considering memory footprint and decompression time ?
https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

pappewaio · 2021-02-05T21:21:03Z

For the other stats discussion, much of your suggestions are sort of covered already. Maybe we could zoom one day, and I can show you what is available. Maybe we should relocate it so you can find it easier :)

joejeroe · 2021-02-05T21:25:14Z

Sure let's zoom next week😀

rzetterberg · 2021-02-08T10:10:34Z

gzip-9 compression is actually sort of best when considering memory footprint and decompression time ?

Yes, I don't think we will come up with a better compression algorithm than gzip for textfiles.

As the inventory grows it will be good to invest in some better data compression

In the quote above I think Joeri was referring to other means of data compression (other than using gzip on sumstats-files), such as: creating a binary format for sumstats-files or using a RDBMS.

For the other stats discussion, much of your suggestions are sort of covered already.

That's great! From my point of view, I think it's important that those stats are stored somewhere, so that you can compare stats from the latest version against historical versions. And that the stats are stored in such a format that you can do that programmatically.

rzetterberg · 2021-02-11T14:07:33Z

Me and Jesper had a meeting about this and the pipeline already produces metrics files with raw data for all the cases we discussed in this issue.

So in order to solve this issue we should implement:

Copying these files for each run to a central place
Improve stepwise file with better step names

pappewaio added this to To do in version_v1.x.x-beta Feb 1, 2021

pappewaio removed this from To do in version_v1.x.x-beta May 6, 2021

rzetterberg added the feature label Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure to catch interesting performance measures for a publication #121

Make sure to catch interesting performance measures for a publication #121

pappewaio commented Jan 28, 2021

joejeroe commented Jan 28, 2021

rzetterberg commented Jan 29, 2021

rzetterberg commented Jan 29, 2021

joejeroe commented Jan 29, 2021

rzetterberg commented Jan 29, 2021

joejeroe commented Jan 29, 2021

pappewaio commented Feb 5, 2021

pappewaio commented Feb 5, 2021

joejeroe commented Feb 5, 2021

rzetterberg commented Feb 8, 2021

rzetterberg commented Feb 11, 2021

Make sure to catch interesting performance measures for a publication #121

Make sure to catch interesting performance measures for a publication #121

Comments

pappewaio commented Jan 28, 2021

joejeroe commented Jan 28, 2021

rzetterberg commented Jan 29, 2021

rzetterberg commented Jan 29, 2021

joejeroe commented Jan 29, 2021

rzetterberg commented Jan 29, 2021

joejeroe commented Jan 29, 2021

pappewaio commented Feb 5, 2021

pappewaio commented Feb 5, 2021

joejeroe commented Feb 5, 2021

rzetterberg commented Feb 8, 2021

rzetterberg commented Feb 11, 2021