disk usage & re-run previously failed pipeline via cleansumstats.sh #256

ofrei · 2021-12-12T13:13:16Z

Hi,
I've triggered cleansumstats on ~300 sumstats files, in parallel using our SLURM cluster. About 70 of the files succeeded, but quite a few failed with Disk quota exceeded:

This is somewhat unlikely, as there is ~14 TB of free disk space available in our project:

However disk space indeed might be an issue as my cleansumstats/tmp/fake-home/work folder has grown huge, 4.4 TB as of now:

I have few questions:

(1) Is there a way for me to reconfigure fake-home pointing it to the scratch area of my SLURM jobs ($SCRATCH)?

(2) With default configuration that places nextflow's files under fake-home, can we have an option of cleaning those intermediate upon successful completion? Or are they cleaned automatically, and all files I have a left behind by failed runs ?

(3) Upon failure, is there a way to resume from where it left? I know nextflow can resume previously halted execution - the question is whether I can resume via calling cleansumstats.sh. I think now it starts from scratch every time. I mention it here for discussion - I'm not myself convinced if "resume" is user-friendly feature for cleansumstats, as it's not clear how to use it for someone who doesn't really understand internal sequence of commands within cleansumstats pipeline, i.e. I upon changing anything in meta-data I'd rather re-run from scratch. Same goes for weird technical errors like "disk quota exceeded" - I don't know if it's safe to resume. As such the files left behind aren't useful for resuming the pipeline. They can be useful to investigate why the pipeline failed, but I's good to know that I need to clean it afterwards after I investigated crashes.

(4) are the previous three questions covered somewhere in the documentation, e.g. readme file?

The text was updated successfully, but these errors were encountered:

ofrei · 2021-12-12T13:26:57Z

(5) Is there a way to make the pipeline less hungry when it comes to disk usage, and perhaps to disk IO? It's not a problem in my case as we have plenty of disk space, especially if we use $SCRATCH, and my cluster has very efficient file system. But this might be a problem in some systems where storage is something like Synology NAS drives. They are quite good, but still can quickly cause IO bottleneck.

One thing to consider as a long-term enhancement is whether it's possible to chain several heavy IO operations using shell pipes, instead of treating them as nextflow steps? I assume nextflow will always use disk to interop between each nextflow tasks? I'm quite new to this , thus interested to discuss to learn how things work.

Just to illustrate with an example (which is not part of cleansumstats.sh - just something from my code), I assume that the following does everything "on the fly", more efficiently then making four intermediate files (2 for zcat, one for cut, and one more for paste):

paste <(zcat cleaned_GRCh37.gz) \
      <(cut -f 5- <(zcat cleaned_GRCh38.gz ) ) | \
gzip > cleaned_GRCh37.sumstats.gz

ofrei · 2021-12-12T14:47:32Z

I figured out my Disk quota exceed - this is not due to out of disk space, but due to limit on the number of files created in my project. There are ~114.000 files under fake-home folder. After removing those files I can run the pipeline again.

But can we discuss other questions, as the pipeline has fairly extensive disk usage?

pappewaio · 2021-12-17T13:52:09Z

This is an interesting discussion:

Before I answer all specific questions, it is important to remember that the pipeline's fundamental philosophy is to reduce RAM, by only reading line by line. One reason for that is based on the experience that R scripts are not scalable for large GWAS files, as you never now how much memory you need to give it, and the time it takes to just read the data into R is really long, which is quite disturbing. Imagining reading the whole dbsnp into R, ~600 million rows, that is just not feasible. For cleansumstats you should always be fine with <1GB. The only reason to add more RAM is to make the sorting steps faster. They are the only real bottlenecks in the optimization of the pipeline.

RE 1) Yes, I can just add the option for the user to specify /tmp and /workdir in cleansumstats.sh. Right now /tmp is automatically mounting the system /tmp, but not every system has its tmp space there, so a user option might be warranted. The reason why it has to be specified in cleansumstats.sh, and not only in nextflow.config is that the system scratch has to be mounted before running the image.

But maybe better to just use as default to clean the workdir up on completion, and if you want to check the intermediate files you specify the --dev flag.

RE 2) I think most answers can be found in RE 1). The files in the workdir are not only your failed runs.

RE 3) Yes, I don't think the -resume flag should be used except in special debug cases or prototyping. Although, -resume is quite safe to use, it checks if any of the source files for the previous steps, and if it has been modified that step will be re-run. But for -resume to be useful, you have to have an intent to modify any of the intermediate steps, and I don't think that use case is a good use case 😄

RE 4) Failures are not covered in the docs, but we should add some scenarios to the FAQ (which doesn't exist yet). But your comments here will build a great start.

Yes, there is plenty of room to streamline things. This can be a long discussion though, because there are a lot of things to take into consideration, but my thoughts are briefly this: One problem when you put everything you want to do in one long streaming pipe system, is that it can be very difficult to modify or test if it does the correct thing. So my philosophy has been to make sure I quickly understand what I look at, when I need to update a process/function, or to fix a bug. Except the most crucial optimizations I save the others for after I have converted the code to DSL-2. Because that opens up new doors when it comes to code-readabily, which in turn will make it possible to more efficiently skip many intermediate files.

The good thing with skipping intermediate files is that both disk usage and time performance will be improved, at least if there are enough cpus to take care of the multiple threads of a stream process. One of the first version of the pipeline I tried to stream cross the different nextflow processes, but that doesn't actually work, because nextflow wants every process to be possible to distribute to be run on any other compute node, and be run in isolation on that node, and then sent back to the main workflow. Otherwise, it would be almost possible to stream everything in cleansumstats, with only the sorting steps requiring all rows to be visited every time.

Please ask any follow-up questions you might have. I can later summarise our discussion and put in the docs.

ofrei · 2021-12-22T22:40:34Z

I'd love an option of pointing temp to a custom location - I'll use it to mount $SCRATCH area of my SLURM jobs; I'm not if system's /tmp on my cluster is large enough (could be perhaps as low as a few MB?). I also like the idea of cleaning upon failure, and using --dev flag if someone need to investigate the failures.
ok
agree. users should run the pipeline from scratch on failure - I think this is reasonable
I didn't mean documenting error codes - best source for that info would be on github issues page :) But it's good to mention about fake-home folder, ability to customize it, describe clean up behaviour (auto by pipeline vs to be done by user), clarify that even though it's nextflow but it's not intended for 'resume on failure' type of usage
good to know that my thinking about streaming via pipes (|) is reasonable, but there are important design consideration - as well as practical limitations of nextflow. IO performance isn't a problem in my case - I'm really happy with out cluster's IO (both in terms of throughput and small IO). I think a good start is to mention this in documentation, i.e. to make resource usage section estimating typical space and number of temporary files created for an average summary stats file. The rationale about limiting memory can also go there. Btw, is there a way to control how much memory the sort is allowed to use? Or by default it it'll use whatever is available?

pappewaio · 2021-12-23T22:39:39Z

Great, it seems we agree on most stuff here.

5. Btw, is there a way to control how much memory the sort is allowed to use? Or by default it it'll use whatever is available?

When cleaning, sort doesn't need to use much memory, and it will use default --buffer-size, which is calculated on the fly. For many sorts in the cleaning, adding --buffer-size, won't have an effect as sort is used with a pipe "|", and only few MBs are being used. So far no one has experienced any problems with the default settings. It might be worth adding parallelisation and --buffer-size as options as well though. It will likely make things run much faster, provided more cpus can be accessed.

I have added sort --buffer-size=20G for the sorting during snpdb specific preparation, and I think it is because adding parallelisation failed the default buffer calculation. It should be added as an option in nextflow.config, and the behaviour should be explained somewhere.

--buffer-size was explained well in this forum post:
https://stackoverflow.com/questions/37514283/gnu-sort-default-buffer-size

Issue #256

pappewaio added this to To do in Release-v1.3.x via automation Feb 4, 2022

pappewaio moved this from To do to In progress in Release-v1.3.x Feb 8, 2022

pappewaio added a commit that referenced this issue Feb 8, 2022

resolves #256

3750b82

This was referenced Feb 8, 2022

Buffer size for sort #299

Open

Add these notes to a FAQ or similar #300

Open

pappewaio added a commit that referenced this issue Feb 8, 2022

Merge pull request #298 from BioPsyk/issue-256

b92188a

Issue #256

pappewaio closed this as completed Feb 8, 2022

Release-v1.3.x automation moved this from In progress to Done Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disk usage & re-run previously failed pipeline via cleansumstats.sh #256

disk usage & re-run previously failed pipeline via cleansumstats.sh #256

ofrei commented Dec 12, 2021

ofrei commented Dec 12, 2021 •

edited

Loading

ofrei commented Dec 12, 2021 •

edited

Loading

pappewaio commented Dec 17, 2021

ofrei commented Dec 22, 2021

pappewaio commented Dec 23, 2021 •

edited

Loading

disk usage & re-run previously failed pipeline via cleansumstats.sh #256

disk usage & re-run previously failed pipeline via cleansumstats.sh #256

Comments

ofrei commented Dec 12, 2021

ofrei commented Dec 12, 2021 • edited Loading

ofrei commented Dec 12, 2021 • edited Loading

pappewaio commented Dec 17, 2021

ofrei commented Dec 22, 2021

pappewaio commented Dec 23, 2021 • edited Loading

ofrei commented Dec 12, 2021 •

edited

Loading

ofrei commented Dec 12, 2021 •

edited

Loading

pappewaio commented Dec 23, 2021 •

edited

Loading