Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run main.nf --help: "DAG file already exists"; Optimal cluster resource config; "Cannot connect to the Docker daemon" #18

Closed
AGI-chandler opened this issue Dec 7, 2022 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@AGI-chandler
Copy link

I've had success with the installation up to this point. Nexflow-edge was installed by simply downloading and running the script, since we have newer Java 19 on the system. nextflow run main.nf --download_db -profile docker was successful but then nextflow run main.nf --help was not, as you can see below. I tried adding -profile docker too since I think I'm supposed to use that in every command now right? The instructions were unclear. Either way, the error was the same.

$ nextflow run main.nf --download_db -profile docker
N E X T F L O W  ~  version 22.11.1-edge
Launching `main.nf` [determined_joliot] DSL2 - revision: 6990708c9f
No input file given to --input!

  Parameters set for pb-16S-nf pipeline for PacBio HiFi 16S
  =========================================================
  Number of samples in samples TSV: 0
  Filter input reads above Q: 20
  Trim primers with cutadapt: Yes
  Forward primer: AGRGTTYGATYMTGGCTCAG
  Reverse primer: AAGTCGTAACAAGGTARCY
  Minimum amplicon length filtered in DADA2: 1000
  Maximum amplicon length filtered in DADA2: 1600
  maxEE parameter for DADA2 filterAndTrim: 2
  minQ parameter for DADA2 filterAndTrim: 0
  Pooling method for DADA2 denoise process: pseudo
  Minimum number of samples required to keep any ASV: 0
  Minimum number of reads required to keep any ASV: 0 
  Taxonomy sequence database for VSEARCH: ~/.local/src/pb-16S-nf/databases/GTDB_ssu_all_r207.qza
  Taxonomy annotation database for VSEARCH: ~/.local/src/pb-16S-nf/databases/GTDB_ssu_all_r207.taxonomy.qza
  Skip Naive Bayes classification: false
  SILVA database for Naive Bayes classifier: ~/.local/src/pb-16S-nf/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz
  GTDB database for Naive Bayes classifier: ~/.local/src/pb-16S-nf/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz
  RefSeq + RDP database for Naive Bayes classifier: ~/.local/src/pb-16S-nf/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz
  VSEARCH maxreject: 100
  VSEARCH maxaccept: 100
  VSEARCH perc-identity: 0.97
  QIIME 2 rarefaction curve sampling depth: null
  Number of threads specified for cutadapt: 16
  Number of threads specified for DADA2: 8
  Number of threads specified for VSEARCH: 8
  Script location for HTML report generation: ~/.local/src/pb-16S-nf/scripts/visualize_biom.Rmd
  Container enabled via docker/singularity: true
  Version of Nextflow pipeline: 0.4

executor >  Local (1)
[00/9a39f6] process > pb16S:download_db [100%] 1 of 1 ✔
Completed at: 07-Dec-2022 13:23:21
Duration    : 2m 53s
CPU hours   : (a few seconds)
Succeeded   : 1

$ nextflow run main.nf --help                
N E X T F L O W  ~  version 22.11.1-edge
Launching `main.nf` [lethal_gutenberg] DSL2 - revision: 6990708c9f
DAG file already exists: ~/.local/src/pb-16S-nf/report_results/dag.html -- enable `dag.overwrite` in your config file to overwrite existing DAG files

$ nextflow run main.nf --help -profile docker
N E X T F L O W  ~  version 22.11.1-edge
Launching `main.nf` [awesome_brattain] DSL2 - revision: 6990708c9f
DAG file already exists: ~/.local/src/pb-16S-nf/report_results/dag.html -- enable `dag.overwrite` in your config file to overwrite existing DAG files

$ 
@proteinosome
Copy link
Collaborator

Hi @AGI-chandler , thanks for opening the issue. I've never seen that issue, and I suspect it's a behaviour change in the new version of Nextflow. Can you try downgrading Nextflow to version 22.04 and see if this still happens?

And yes, if you're using docker, -profile docker should be in the Nextflow command. It's not really important except in nextflow run. I'll make it clearer in the documentation in the future. Thanks.

@proteinosome proteinosome self-assigned this Dec 7, 2022
@proteinosome proteinosome added the bug Something isn't working label Dec 7, 2022
@kevlim83
Copy link

kevlim83 commented Dec 8, 2022

@AGI-chandler
Have a look into the nextflow.config file, I think this is because everytime the nextflow pipeline is called, the program will try to create a new folder named report_results. And since you had run the pipeline the first time with the --download_db parameter, this folder already exist.

A simple work around is to delete or rename this report_results folder, and your subsequent command with parameter --help should work.

@proteinosome
It seems the pipeline creates a few output folders. One that stores the status of the nextflow runs and the other directory that stores the actual results. There is also a work folder that stores the intermediates. I think there is a possibility to streamline this output process.

@proteinosome
Copy link
Collaborator

Thanks @kevlim83, yes indeed that's what the error suggests. However, with my version of Nextflow (22.04), I actually can't reproduce it. The nextflow run main.nf --help command on my end does not create a report folder so I think it's a recent change in behaviour in Nextflow. Hence my suggestion for @AGI-chandler to try an older version.

Nextflow output structure is that the results folder contains symlinks to the works folder. There's a way to make real copy instead of symlink but you end up with duplicates in the "work" folder. Of course I can also make it delete the work folder after everything is completed, and that's something I may implement in the future, but at the beta stage right now I prefer to keep it that way for troubleshooting purpose. Finally, the "report" folder contains timings and resource usage of the steps, not actual results.

I hear you, thou, and will think about how best to make it more streamlined.

@AGI-chandler
Copy link
Author

Thanks @proteinosome and @kevlim83

@proteinosome that old version of Nextflow only supports up to Java 17, but we have 19 installed with java-latest-openjdk-headless-fastdebug package so I don't really want to roll back so many packages.

It sound like this folder creation is not such a big deal and I can delete it. After doing that the --help runs fine. I could also set this dag.overwrite option that is suggested, does that go in nextflow.conf inside the dag settings:

dag {
  enabled = true
  file = "report_$params.outdir/dag.html"
  dag.overwrite = true
}

?

I will move on to testing then, with our slurm. I updated the process settings:

process {
  executor = 'slurm'
  queue = 'defq'
}

but what else should be updated? Suppose we want to utilize our entire cluster which has 5 nodes totaling 1152 cpu's and 4.5 TB mem. Should I update any of the other settings?

Also this is a new and interesting way to run something... we have some real data to run on after testing is complete. nextflow is in my path already, so can run that from our project directory for example. Does Nextflow have a path too, so it could find your main.nf when I run nextflow run main.nf ...? Or do I need to specify something like nextflow run ~/.local/git/pb-16S-nf/main.nf ...? Or do I need to run nextflow in the pb-16S-nf git dir and then specify the other options with our project dir, e.g.: nextflow run main.nf --input <projDir>/sample.tsv --metadata <projDir>/metadata.tsv -profile docket --outdir <projDir>/results?

Thanks

Thanks

@proteinosome
Copy link
Collaborator

Hi @AGI-chandler , gotcha. If that's not an issue for you then it should be fine. I'll make a note in the next release about this.

To your question, you have to specify the path of main.nf. so nextflow run ~/.local/git/pb-16S-nf/main.nf ... is the correct way to use it. You don't need to run the workflow inside the repo directory, just run it in your project directory. By default Nextflow uses the nextflow.config file in the repo directory, but if you copy that and put it in your project directory, it'll override the one in the repo directory. See the priority of the config here: https://www.nextflow.io/docs/latest/config.html

As for maximizing the cores, Nextflow handle the submission of each job according to the workflow, so e.g. if you have 99 samples, it'll submit 99 concurrent demultiplexing jobs. Some of the steps that make use of merged channels (or input if that's more intuitive for you) will only run on a single job, and you can set the CPU for that depending on how much CPU can be allocated to any job on your cluster. E.g. VSEARCH CPU can be set with --vsearch_cpu (see help). Right now, DADA2 denoise step (--dada2_cpu) is the bottleneck step since it uses all the samples for denoising to maximize sensitivity. So that would be where I would advice using as much CPU as you can. The other steps are pretty quick actually. I'm working on a version update that allows you to group the samples into similar types and denoise each group separately to speed it up, but that might take a while.

In nextflow.config you can also find a block

  process {
    withLabel: cpu_def {
      cpus = 4
      memory = 16.GB
    }

    withLabel: cpu8 {
      cpus = 8
      memory = 32.GB
    }

    withLabel: cpu32 {
      cpus = 32
      memory = 128.GB
    }
  }

These are the default CPUs used by different process in Nextflow specified in main.nf, and you can increase them by changing e.g. cpus = 32 to cpus = 64 (don't change the label name as main.nf refers to them via the label).

Hope that's clear, let me know if you run into any other trouble. Thanks!

@AGI-chandler
Copy link
Author

AGI-chandler commented Dec 9, 2022

Well we don't have that many samples! at least for our first run, only 15 since we had trouble with 1. I guess I'm still not clear on how the jobs for different steps are executed. For example, with SMRT Link I think there are basically 2 settings, nprocs and chunks or something, and we have nprocs=16 and chunks=72. So when a CCS job is run for example, it'll get broken into 72 chunks (which are the subjobs that are submitted to slurm), each one using 16 cpus, which uses all the resources possible.

The config here is different though... when you say "how much CPU can be allocated to any job on your cluster" you are talking about a single job submitted to slurm? In that case I believe it's 256 since that's the max cpus per node. So would --vsearch_cpu 256 make sense then? and the same with --dada2_cpu 256? but if only 1 job is running then that means the other 1152-256=896 cpus are not being used? or do multiple VSEARCH and DADA2 subjobs get launched? if each one uses 256 cpus, then only 4 subjobs could run together. Since we limit our 5th node to 128 cpus, none of those would get used. So would --vsearch_cpu 128 and --dada2_cpu 128 make more sense then? Then 9 subjobs could run together and all our resources would be used.

Likewise I'm still not clear if I should increase cpu32 label to cpus = 256 or more or less, but I assume I'll scale cpu_def and cpu8 by the same factor, meaning cpus = 32 and cpus = 64, respectively? I'm guessing memory should also be adjusted accordingly?

Sorry for all the questions! I'll probably disappear though once we get it tuned right... Thanks

@proteinosome
Copy link
Collaborator

Hi @AGI-chandler , there's no "chunking" implemented in this pipeline. Only number of CPUs for each job.

Nextflow works in a way that the jobs distribution is on a per-job basis. E.g. if you have 2 samples, and they both need to be demultiplexed, then Nextflow will submit 2 jobs in parallel to demultiplex each of these two samples. Says you now have the next job that requires merging the 2 demultiplexed samples, it will have to wait until the demux jobs finished, then it'll submit a new job that takes the input of those two demux jobs.

The number of CPUs for each job is controlled via the config file or command line parameter. When you specify --dada2_cpu 256, the DADA2 job will request for 256 CPUs. All the other jobs are controlled by the number of CPUs set in the label section I mentioned. So some of the jobs will use 8 CPUs by default, some 16, and some 32.

And yes, since there's no chunking, you will not be able to maximize your cluster CPU. To your question on whether using 128 CPUs would make more sense or not, it probably doesn't help because VSEARCH and DADA2 steps only run a single job at any time. So even if you submit a set of 384 samples, when it reaches the DADA2 step, there will only be one job analyzing all 384 samples. This is something I'm working to optimize, but even then you should be able to analyze most data within reasonable time with the default CPU allocation. In the GitHub main page you will see some benchmark timing I put down for your reference.

Yes, scale the CPU by the same factor should work well. And adjust the memory accordingly. I usually like 4GB per CPU, but I really have not tested if lower would be fine.

Just give it a shot with a test run and see what timing do you get, then maybe double everything to see if it helps? Honestly, most steps are very very fast, so I usually only adjust --dada2_cpu and leave the rest to default. That is the bottleneck step.

@AGI-chandler
Copy link
Author

Ok thanks, well might as well try maximizing the usage as much as possible. I'll try with this in nextflow.config:

process {
  withLabel: cpu_def {
    cpus = 32
    memory = 128.GB
  }

  withLabel: cpu8 {
    cpus = 64
    memory = 256.GB
  }

  withLabel: cpu32 {
    cpus = 256
    memory = 1024.GB
  }
}

and with 256 cpus for the 3 command line options.
$ nextflow run main.nf --input test_data/test_sample.tsv --metadata test_data/test_metadata.tsv --dada2_cpu 256 --vsearch_cpu 256 --cutadapt_cpu 256 -profile docker --outdir results

Unfortunately I can't get the test completed due to docker errors. First I did not have docker installed on the compute nodes. Then I did not have docker-rootless scripts installed on the compute nodes.

but I'm still getting this error docker: Cannot connect to the Docker daemon at unix:///run/user/10063/docker.sock. Is the docker daemon running?. YES, isn't that what the rootless scripts are for? systemctl --user status docker returns green marks and active (running) on the head node and all compute nodes! Plus the main system dockerd is running on the head node and all compute nodes!

I attached the full output of the above command. This time it suggests to check .command.out (it's empty), but in the past it has suggested to check .command.run or .command.sh too (are these just random suggestions)? Either way I checked all the .command* files and didn't notice anything yet that might resolve this... I've been trying to get it to run now for a couple hours now. I'll keep working on it but maybe you might have some insights?

nextflow.cmd+output.txt

@AGI-chandler AGI-chandler changed the title Can't run main.nf --help: DAG file already exists Can't run main.nf --help: "DAG file already exists"; Optimal cluster resource config; "Cannot connect to the Docker daemon" Dec 12, 2022
@AGI-chandler
Copy link
Author

Well I'm still stumped and been working on this most of the day. Unfortunately it's pretty important that we get this working ASAP.

Does dockerd need to be running on compute nodes? This seems to cause some interference because I couldn't even docker run hello-world from the head node until I systemctl --user restart docker. Then hello-world works, but pb-16S-nf is still failing, this time on the pb16S:QC_fastq command, output below. Does that mean pb16S:QC_fastq and others are being run on a compute node? Maybe there are some special docker config steps I need to take so they can all run together on the cluster?

Error executing process > 'pb16S:QC_fastq (1)'

Caused by:
  Process `pb16S:QC_fastq (1)` terminated with an error exit status (125)

Command executed:

  seqkit fx2tab -j 64 -q --gc -l -H -n -i test_1000_reads.fastq.gz |    csvtk mutate2 -C '%' -t -n sample -e '"test_data"' > test_data.seqkit.readstats.tsv
  seqkit stats -T -j 64 -a test_1000_reads.fastq.gz |    csvtk mutate2 -C '%' -t -n sample -e '"test_data"' > test_data.seqkit.summarystats.tsv
  seqkit seq -j 64 --min-qual 20 test_1000_reads.fastq.gz --out-file test_data.filterQ20.fastq.gz
  echo -e "test_data	"$PWD"/test_data.filterQ20.fastq.gz" >> test_data_filtered.tsv

Command exit status:
  125

Command output:
  (empty)

Command error:
  docker: Cannot connect to the Docker daemon at unix:///run/user/10063/docker.sock. Is the docker daemon running?.
  See 'docker run --help'.

Work dir:
    ~/.local/src/pb-16S-nf/work/31/6a3d74907f3ba8e291049c340ff490

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

@proteinosome
Copy link
Collaborator

Hi @AGI-chandler , yes, dockerd needs to run on compute nodes because Nextflow using Docker to run the jobs on the compute nodes.

@AGI-chandler
Copy link
Author

Ok thanks, that's part of the problem.  I figured out how to run dockerd in the foreground and for some reason it starts fine on the head node and 1 compute node but not the 3 remaining compute nodes.  Strange because the compute nodes are all booted from a single image... so I'll continue to troubleshoot that and possibly post on the docker forums if I get stuck.  Thanks for your patience!

@AGI-chandler
Copy link
Author

@proteinosome Here are my latest observations. So far, I'm under the impression dockerd can't run in a clustered environment, but this is what I'll be looking into next... have you ever run this software in a cluster before? Maybe we should be using one of the other installation options? like just the normal way, without docker or singularity?

@AGI-chandler
Copy link
Author

docker swarm mode? maybe that's what we need. looking into this...

@AGI-chandler
Copy link
Author

docker swarm doesn't seem to be right either... it seems like a replacement for slurm, so not right for this application.

I thought docker was a new and improved way of running apps, but seems to just complicate things in our case.

I went ahead and just removed -profile docker from the command and let it use conda...

...and after getting conda setup and configured on the compute nodes... this worked!
✅ Pipeline Completed with the test data.

Duration    : 9m 46s
CPU hours   : 24.0
Succeeded   : 20

I did get these warnings though, which log file is it talking about? I couldn't find any...

WARN: Failed to render execution report -- see the log file for details
WARN: Failed to render execution timeline -- see the log file for details

Now to test with our first real data...

@proteinosome
Copy link
Collaborator

Hi @AGI-chandler , we have Slurm clusters running Docker daemon on all nodes at two internal clusters. All I did was to install Docker on all nodes and get docker daemon running on all of them, and that's about it. Docker and Singularity are both very popular and many HPC runs either one of them.

As for the warning about execution report and timeline, I suspect it again has something to do with behaviour change in newer version of Nextflow. In the run directory there's usually a hidden .nextflow.log, maybe look into that and see if you can find any error messages. Nonetheless those are just post-run reports and will not affect the results, so I wouldn't worry too much about it.

@AGI-chandler
Copy link
Author

I see... then it must be a bug/limitation of docker rootless installation.  I'm not sure why that installation method was chosen since it's been some years now.... but I think it was so users could run docker apps themselves since I can't be involved every time a user wants to run an app.  I guess with your app, this is a different use of docker.  I'm not yet aware of all the different ways docker can be used.  Supposing I could uninstall docker rootless and go back to standard docker root mode, what are the advantages of using that over the default conda execution of pb-16S-nf that I've got working?

Thanks, yes, I see the hidden .nextflow.log... the warnings were due to the reports already existing, probably from previous failed runs, so nothing to worry about.

@proteinosome
Copy link
Collaborator

You might be right. The rootless installation has always had some limitations when I tried it long time ago so I gave up and just install it with root. I think in some HPC there's always concern with the security of allowing users to run docker installed via root, so that might be a reason why your cluster has that installed?

Either way, to your question: No, there's no difference in output between the docker mode and conda mode. In fact, I provided these different options precisely for the reason that different users might run into different issues. I've had many users who ran into issue with Conda (See the other open issue, for example), and user like you who has difficulty getting Docker mode to run. The Docker container contains the same environment and software versions provisioned from the Conda environment yml file, so you can choose whichever that works for you.

@AGI-chandler
Copy link
Author

The pipeline was successful with our first set of real data! 🎉 🥳

  Parameters set for pb-16S-nf pipeline for PacBio HiFi 16S
  =========================================================
  Number of samples in samples TSV: 15
  Filter input reads above Q: 20
  Trim primers with cutadapt: Yes
  Forward primer: AGRGTTYGATYMTGGCTCAG
  Reverse primer: AAGTCGTAACAAGGTARCY
  Minimum amplicon length filtered in DADA2: 1000
  Maximum amplicon length filtered in DADA2: 1600
  maxEE parameter for DADA2 filterAndTrim: 2
  minQ parameter for DADA2 filterAndTrim: 0
  Pooling method for DADA2 denoise process: pseudo
  Minimum number of samples required to keep any ASV: 1
  Minimum number of reads required to keep any ASV: 5 
  Taxonomy sequence database for VSEARCH: pb-16S-nf/databases/GTDB_ssu_all_r207.qza
  Taxonomy annotation database for VSEARCH: pb-16S-nf/databases/GTDB_ssu_all_r207.taxonomy.qza
  Skip Naive Bayes classification: false
  SILVA database for Naive Bayes classifier: pb-16S-nf/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz
  GTDB database for Naive Bayes classifier: pb-16S-nf/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz
  RefSeq + RDP database for Naive Bayes classifier: pb-16S-nf/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz
  VSEARCH maxreject: 100
  VSEARCH maxaccept: 100
  VSEARCH perc-identity: 0.97
  QIIME 2 rarefaction curve sampling depth: null
  Number of threads specified for cutadapt: 256
  Number of threads specified for DADA2: 256
  Number of threads specified for VSEARCH: 256
  Script location for HTML report generation: pb-16S-nf/scripts/visualize_biom.Rmd
  Container enabled via docker/singularity: false
  Version of Nextflow pipeline: 0.4
Time         1d 6h 20m 21s
CPU-Hours    6,478.5

Will close the issue then.

@proteinosome
Copy link
Collaborator

Congrats! That takes really long for just 15 samples, thou. Are these environmental samples? Do you know on average how many reads are there per sample (These should be in the final HTML report in the results folder)? Those are the ones that usually cause very long run time. I am wondering if using 256 CPUs somehow caused one of the steps to use too much memories and it started swapping. If you have some time perhaps try the same set of samples with 32/64 CPUs and see if that changes anything.

@AGI-chandler
Copy link
Author

Not sure what you mean by "environmental samples" but I'll find out.

As far as the reads go, here are the stat's:
Reads per Barcode Barcode Frequency

To be honest, I had the pipeline running in the foreground for about 4 hours I want to say, and then my desktop session crashed, which disconnected me from the server and killed the pipeline... it was running the dada2 step, I'm not sure if that gets resumed in the middle or if it started from the beginning of that step again, but that might somewhat explain the long run time.

The results don't indicate much memory was used for that step, there is 1 TiB available... dada2_denoise median physical memory usage=32G, median Virtual Memory Usage=307G

@dcopetti
Copy link

Hello,
@AGI-chandler and I are working on this dataset, thank you both for allowing the pipeline to run in our cluster!

The data we run was from a single SMRT cell 8M, that yielded 5.7 million reads (8.3 Gb). There were 15 samples from 4 different projects, 4 or 3 samples each. Two projects were from environmental samples (soil and some kind of swamp I think), the other two may have been human microbiomes, but I am not sure.
Do you think that when we have e.g. 96 samples in one SMRT cell (so less reads/sample) the DADA2 step will take less time?
This run was just to test infrastructure, software, and see the type of output. Next we will run separately the 4 projects.
Thanks,
Dario

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants