-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disk usage & re-run previously failed pipeline via cleansumstats.sh #256
Comments
(5) Is there a way to make the pipeline less hungry when it comes to disk usage, and perhaps to disk IO? It's not a problem in my case as we have plenty of disk space, especially if we use One thing to consider as a long-term enhancement is whether it's possible to chain several heavy IO operations using shell pipes, instead of treating them as nextflow steps? I assume nextflow will always use disk to interop between each nextflow tasks? I'm quite new to this , thus interested to discuss to learn how things work. Just to illustrate with an example (which is not part of cleansumstats.sh - just something from my code), I assume that the following does everything "on the fly", more efficiently then making four intermediate files (2 for
|
This is an interesting discussion: Before I answer all specific questions, it is important to remember that the pipeline's fundamental philosophy is to reduce RAM, by only reading line by line. One reason for that is based on the experience that R scripts are not scalable for large GWAS files, as you never now how much memory you need to give it, and the time it takes to just read the data into R is really long, which is quite disturbing. Imagining reading the whole dbsnp into R, ~600 million rows, that is just not feasible. For cleansumstats you should always be fine with <1GB. The only reason to add more RAM is to make the sorting steps faster. They are the only real bottlenecks in the optimization of the pipeline. RE 1) Yes, I can just add the option for the user to specify /tmp and /workdir in But maybe better to just use as default to clean the workdir up on completion, and if you want to check the intermediate files you specify the --dev flag. RE 2) I think most answers can be found in RE 1). The files in the workdir are not only your failed runs. RE 3) Yes, I don't think the -resume flag should be used except in special debug cases or prototyping. Although, -resume is quite safe to use, it checks if any of the source files for the previous steps, and if it has been modified that step will be re-run. But for -resume to be useful, you have to have an intent to modify any of the intermediate steps, and I don't think that use case is a good use case 😄 RE 4) Failures are not covered in the docs, but we should add some scenarios to the FAQ (which doesn't exist yet). But your comments here will build a great start.
The good thing with skipping intermediate files is that both disk usage and time performance will be improved, at least if there are enough cpus to take care of the multiple threads of a stream process. One of the first version of the pipeline I tried to stream cross the different nextflow processes, but that doesn't actually work, because nextflow wants every process to be possible to distribute to be run on any other compute node, and be run in isolation on that node, and then sent back to the main workflow. Otherwise, it would be almost possible to stream everything in cleansumstats, with only the sorting steps requiring all rows to be visited every time. Please ask any follow-up questions you might have. I can later summarise our discussion and put in the docs. |
|
Great, it seems we agree on most stuff here.
When cleaning, sort doesn't need to use much memory, and it will use default I have added
|
Hi,
I've triggered cleansumstats on ~300 sumstats files, in parallel using our SLURM cluster. About 70 of the files succeeded, but quite a few failed with
Disk quota exceeded
:This is somewhat unlikely, as there is ~14 TB of free disk space available in our project:
However disk space indeed might be an issue as my
cleansumstats/tmp/fake-home/work
folder has grown huge,4.4 TB
as of now:I have few questions:
(1) Is there a way for me to reconfigure
fake-home
pointing it to the scratch area of my SLURM jobs ($SCRATCH
)?(2) With default configuration that places nextflow's files under
fake-home
, can we have an option of cleaning those intermediate upon successful completion? Or are they cleaned automatically, and all files I have a left behind by failed runs ?(3) Upon failure, is there a way to resume from where it left? I know nextflow can resume previously halted execution - the question is whether I can resume via calling
cleansumstats.sh
. I think now it starts from scratch every time. I mention it here for discussion - I'm not myself convinced if "resume" is user-friendly feature for cleansumstats, as it's not clear how to use it for someone who doesn't really understand internal sequence of commands within cleansumstats pipeline, i.e. I upon changing anything in meta-data I'd rather re-run from scratch. Same goes for weird technical errors like "disk quota exceeded" - I don't know if it's safe to resume. As such the files left behind aren't useful for resuming the pipeline. They can be useful to investigate why the pipeline failed, but I's good to know that I need to clean it afterwards after I investigated crashes.(4) are the previous three questions covered somewhere in the documentation, e.g. readme file?
The text was updated successfully, but these errors were encountered: