New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't run main.nf --help
: "DAG file already exists"; Optimal cluster resource config; "Cannot connect to the Docker daemon"
#18
Comments
Hi @AGI-chandler , thanks for opening the issue. I've never seen that issue, and I suspect it's a behaviour change in the new version of Nextflow. Can you try downgrading Nextflow to version 22.04 and see if this still happens? And yes, if you're using |
@AGI-chandler A simple work around is to delete or rename this report_results folder, and your subsequent command with parameter @proteinosome |
Thanks @kevlim83, yes indeed that's what the error suggests. However, with my version of Nextflow (22.04), I actually can't reproduce it. The Nextflow output structure is that the results folder contains symlinks to the works folder. There's a way to make real copy instead of symlink but you end up with duplicates in the "work" folder. Of course I can also make it delete the work folder after everything is completed, and that's something I may implement in the future, but at the beta stage right now I prefer to keep it that way for troubleshooting purpose. Finally, the "report" folder contains timings and resource usage of the steps, not actual results. I hear you, thou, and will think about how best to make it more streamlined. |
Thanks @proteinosome and @kevlim83 @proteinosome that old version of Nextflow only supports up to Java 17, but we have 19 installed with It sound like this folder creation is not such a big deal and I can delete it. After doing that the
? I will move on to testing then, with our slurm. I updated the
but what else should be updated? Suppose we want to utilize our entire cluster which has 5 nodes totaling 1152 cpu's and 4.5 TB mem. Should I update any of the other settings? Also this is a new and interesting way to run something... we have some real data to run on after testing is complete. Thanks Thanks |
Hi @AGI-chandler , gotcha. If that's not an issue for you then it should be fine. I'll make a note in the next release about this. To your question, you have to specify the path of As for maximizing the cores, Nextflow handle the submission of each job according to the workflow, so e.g. if you have 99 samples, it'll submit 99 concurrent demultiplexing jobs. Some of the steps that make use of merged channels (or input if that's more intuitive for you) will only run on a single job, and you can set the CPU for that depending on how much CPU can be allocated to any job on your cluster. E.g. VSEARCH CPU can be set with In
These are the default CPUs used by different process in Hope that's clear, let me know if you run into any other trouble. Thanks! |
Well we don't have that many samples! at least for our first run, only 15 since we had trouble with 1. I guess I'm still not clear on how the jobs for different steps are executed. For example, with SMRT Link I think there are basically 2 settings, nprocs and chunks or something, and we have nprocs=16 and chunks=72. So when a CCS job is run for example, it'll get broken into 72 chunks (which are the subjobs that are submitted to slurm), each one using 16 cpus, which uses all the resources possible. The config here is different though... when you say "how much CPU can be allocated to any job on your cluster" you are talking about a single job submitted to slurm? In that case I believe it's 256 since that's the max cpus per node. So would Likewise I'm still not clear if I should increase Sorry for all the questions! I'll probably disappear though once we get it tuned right... Thanks |
Hi @AGI-chandler , there's no "chunking" implemented in this pipeline. Only number of CPUs for each job. Nextflow works in a way that the jobs distribution is on a per-job basis. E.g. if you have 2 samples, and they both need to be demultiplexed, then Nextflow will submit 2 jobs in parallel to demultiplex each of these two samples. Says you now have the next job that requires merging the 2 demultiplexed samples, it will have to wait until the demux jobs finished, then it'll submit a new job that takes the input of those two demux jobs. The number of CPUs for each job is controlled via the config file or command line parameter. When you specify And yes, since there's no chunking, you will not be able to maximize your cluster CPU. To your question on whether using 128 CPUs would make more sense or not, it probably doesn't help because VSEARCH and DADA2 steps only run a single job at any time. So even if you submit a set of 384 samples, when it reaches the DADA2 step, there will only be one job analyzing all 384 samples. This is something I'm working to optimize, but even then you should be able to analyze most data within reasonable time with the default CPU allocation. In the GitHub main page you will see some benchmark timing I put down for your reference. Yes, scale the CPU by the same factor should work well. And adjust the memory accordingly. I usually like 4GB per CPU, but I really have not tested if lower would be fine. Just give it a shot with a test run and see what timing do you get, then maybe double everything to see if it helps? Honestly, most steps are very very fast, so I usually only adjust |
Ok thanks, well might as well try maximizing the usage as much as possible. I'll try with this in
and with Unfortunately I can't get the test completed due to docker errors. First I did not have docker installed on the compute nodes. Then I did not have docker-rootless scripts installed on the compute nodes. but I'm still getting this error I attached the full output of the above command. This time it suggests to check .command.out (it's empty), but in the past it has suggested to check .command.run or .command.sh too (are these just random suggestions)? Either way I checked all the .command* files and didn't notice anything yet that might resolve this... I've been trying to get it to run now for a couple hours now. I'll keep working on it but maybe you might have some insights? |
main.nf --help
: DAG file already existsmain.nf --help
: "DAG file already exists"; Optimal cluster resource config; "Cannot connect to the Docker daemon"
Well I'm still stumped and been working on this most of the day. Unfortunately it's pretty important that we get this working ASAP. Does dockerd need to be running on compute nodes? This seems to cause some interference because I couldn't even
|
Hi @AGI-chandler , yes, |
Ok thanks, that's part of the problem. I figured out how to run |
@proteinosome Here are my latest observations. So far, I'm under the impression |
docker swarm mode? maybe that's what we need. looking into this... |
docker swarm doesn't seem to be right either... it seems like a replacement for slurm, so not right for this application. I thought docker was a new and improved way of running apps, but seems to just complicate things in our case. I went ahead and just removed ...and after getting conda setup and configured on the compute nodes... this worked!
I did get these warnings though, which log file is it talking about? I couldn't find any...
Now to test with our first real data... |
Hi @AGI-chandler , we have Slurm clusters running Docker daemon on all nodes at two internal clusters. All I did was to install Docker on all nodes and get docker daemon running on all of them, and that's about it. Docker and Singularity are both very popular and many HPC runs either one of them. As for the warning about execution report and timeline, I suspect it again has something to do with behaviour change in newer version of Nextflow. In the run directory there's usually a hidden |
I see... then it must be a bug/limitation of docker rootless installation. I'm not sure why that installation method was chosen since it's been some years now.... but I think it was so users could run docker apps themselves since I can't be involved every time a user wants to run an app. I guess with your app, this is a different use of docker. I'm not yet aware of all the different ways docker can be used. Supposing I could uninstall docker rootless and go back to standard docker root mode, what are the advantages of using that over the default conda execution of pb-16S-nf that I've got working? Thanks, yes, I see the hidden |
You might be right. The rootless installation has always had some limitations when I tried it long time ago so I gave up and just install it with root. I think in some HPC there's always concern with the security of allowing users to run docker installed via root, so that might be a reason why your cluster has that installed? Either way, to your question: No, there's no difference in output between the |
The pipeline was successful with our first set of real data! 🎉 🥳
Will close the issue then. |
Congrats! That takes really long for just 15 samples, thou. Are these environmental samples? Do you know on average how many reads are there per sample (These should be in the final HTML report in the results folder)? Those are the ones that usually cause very long run time. I am wondering if using 256 CPUs somehow caused one of the steps to use too much memories and it started swapping. If you have some time perhaps try the same set of samples with 32/64 CPUs and see if that changes anything. |
Hello, The data we run was from a single SMRT cell 8M, that yielded 5.7 million reads (8.3 Gb). There were 15 samples from 4 different projects, 4 or 3 samples each. Two projects were from environmental samples (soil and some kind of swamp I think), the other two may have been human microbiomes, but I am not sure. |
I've had success with the installation up to this point. Nexflow-edge was installed by simply downloading and running the script, since we have newer Java 19 on the system.
nextflow run main.nf --download_db -profile docker
was successful but thennextflow run main.nf --help
was not, as you can see below. I tried adding-profile docker
too since I think I'm supposed to use that in every command now right? The instructions were unclear. Either way, the error was the same.The text was updated successfully, but these errors were encountered: