Add these notes to a FAQ or similar #300

pappewaio · 2022-02-08T12:46:31Z

This is an interesting discussion:

Before I answer all specific questions, it is important to remember that the pipeline's fundamental philosophy is to reduce RAM, by only reading line by line. One reason for that is based on the experience that R scripts are not scalable for large GWAS files, as you never now how much memory you need to give it, and the time it takes to just read the data into R is really long, which is quite disturbing. Imagining reading the whole dbsnp into R, ~600 million rows, that is just not feasible. For cleansumstats you should always be fine with <1GB. The only reason to add more RAM is to make the sorting steps faster. They are the only real bottlenecks in the optimization of the pipeline.

RE 1) Yes, I can just add the option for the user to specify /tmp and /workdir in cleansumstats.sh. Right now /tmp is automatically mounting the system /tmp, but not every system has its tmp space there, so a user option might be warranted. The reason why it has to be specified in cleansumstats.sh, and not only in nextflow.config is that the system scratch has to be mounted before running the image.

But maybe better to just use as default to clean the workdir up on completion, and if you want to check the intermediate files you specify the --dev flag.

RE 2) I think most answers can be found in RE 1). The files in the workdir are not only your failed runs.

RE 3) Yes, I don't think the -resume flag should be used except in special debug cases or prototyping. Although, -resume is quite safe to use, it checks if any of the source files for the previous steps, and if it has been modified that step will be re-run. But for -resume to be useful, you have to have an intent to modify any of the intermediate steps, and I don't think that use case is a good use case 😄

RE 4) Failures are not covered in the docs, but we should add some scenarios to the FAQ (which doesn't exist yet). But your comments here will build a great start.

Yes, there is plenty of room to streamline things. This can be a long discussion though, because there are a lot of things to take into consideration, but my thoughts are briefly this: One problem when you put everything you want to do in one long streaming pipe system, is that it can be very difficult to modify or test if it does the correct thing. So my philosophy has been to make sure I quickly understand what I look at, when I need to update a process/function, or to fix a bug. Except the most crucial optimizations I save the others for after I have converted the code to DSL-2. Because that opens up new doors when it comes to code-readabily, which in turn will make it possible to more efficiently skip many intermediate files.

The good thing with skipping intermediate files is that both disk usage and time performance will be improved, at least if there are enough cpus to take care of the multiple threads of a stream process. One of the first version of the pipeline I tried to stream cross the different nextflow processes, but that doesn't actually work, because nextflow wants every process to be possible to distribute to be run on any other compute node, and be run in isolation on that node, and then sent back to the main workflow. Otherwise, it would be almost possible to stream everything in cleansumstats, with only the sorting steps requiring all rows to be visited every time.

Please ask any follow-up questions you might have. I can later summarise our discussion and put in the docs.

Originally posted by @pappewaio in #256 (comment)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add these notes to a FAQ or similar #300

Add these notes to a FAQ or similar #300

pappewaio commented Feb 8, 2022

Add these notes to a FAQ or similar #300

Add these notes to a FAQ or similar #300

Comments

pappewaio commented Feb 8, 2022