You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I had to restart canu jobs a few times. Provided canu remembers the initial configuration at each step because the partitioning happens only once in each pipeline step and creates on disk the shell scripts to be run as long until all partitioned taks complete. Each such task is bound to some amount of CPU cores. Re-running a failed job on a different hardware results in same jobs being re-run while canu sometimes comes up with the idea to re-run the taks with less CPU cores. But, provided the shell-scripts were not re-created on disk, or at least, sed was not used to adjust the number of CPU cores, it happens it overloads a machine. So, canu inadverently ran tasks hoping now 14 cores will be used but they kept using 64 as that was the config it used for the first time and was hardcoded in the pre-existing shell scripts. In brief, I had to use minThreads and maxThreads parameters to get around while respecting the old number and size of partitions, just forcing them to use less CPUs. That was along intro. If canu used sed to edit the number of threads in the pre-existing shell scripts it would have been enough for my situation. I have enough of memory.
Now, I finally to got to the oea step. It turns out the tasks are only single-threaded and that my minThreads=63 prevents to get more tasks executed in parallel. I realized that after a while, killed canu and ran it with minThreads. Now 382 jobs are executed, each taking a single core (of 504 available) and I am happy.
I foresee several ways how to cope with that:
Print a warning message once oea starts that minThreads is > 1 and that this step can be sped up by having more partitions but each can take just a single core. Forcing minThreads is > 1 does not help here. Provided it is very inconvenient to stop jobs and wait in the queue, etc., I propose solution 2 below, if not solution 3 below.
Alternatively, ignore minThreads in this step, it makes no sense.
Actually, I think the minThreads could be probably re-evaluated at each step and if it is larger than canu would pick on its own, it should be just ignored.
How re-executions evolved in my case:
canu useGrid=false minThreads=256 maxThreads=504 executiveMemory=32 executiveThreads=16
-- Detected 504 CPUs and 8000 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
-- 504 CPUs (maxThreads option).
--
-- (tag)Concurrency
-- (tag)Threads |
-- (tag)Memory | |
-- (tag) | | | total usage algorithm
-- ------- ---------- -------- -------- -------------------- -----------------------------
-- Local: meryl 64.000 GB 256 CPUs x 1 job 64.000 GB 256 CPUs (k-mer counting)
-- Local: hap 16.000 GB 256 CPUs x 1 job 16.000 GB 256 CPUs (read-to-haplotype assignment)
-- Local: cormhap 64.000 GB 256 CPUs x 1 job 64.000 GB 256 CPUs (overlap detection with mhap)
-- Local: obtovl 24.000 GB 256 CPUs x 1 job 24.000 GB 256 CPUs (overlap detection)
-- Local: utgovl 24.000 GB 256 CPUs x 1 job 24.000 GB 256 CPUs (overlap detection)
-- Local: cor -.--- GB 256 CPUs x - jobs -.--- GB - CPUs (read correction)
-- Local: ovb 4.000 GB 256 CPUs x 1 job 4.000 GB 256 CPUs (overlap store bucketizer)
-- Local: ovs 32.000 GB 256 CPUs x 1 job 32.000 GB 256 CPUs (overlap store sorting)
-- Local: red 64.000 GB 256 CPUs x 1 job 64.000 GB 256 CPUs (read error detection)
-- Local: oea 8.000 GB 256 CPUs x 1 job 8.000 GB 256 CPUs (overlap error adjustment)
-- Local: bat 1024.000 GB 256 CPUs x 1 job 1024.000 GB 256 CPUs (contig construction with bogart)
-- Local: cns -.--- GB 256 CPUs x - jobs -.--- GB - CPUs (consensus)
canu useGrid=false maxThreads=504
-- Detected 504 CPUs and 10000 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
-- 504 CPUs (maxThreads option).
--
-- (tag)Concurrency
-- (tag)Threads |
-- (tag)Memory | |
-- (tag) | | | total usage algorithm
-- ------- ---------- -------- -------- -------------------- -----------------------------
-- Local: meryl 64.000 GB 8 CPUs x 63 jobs 4032.000 GB 504 CPUs (k-mer counting)
-- Local: hap 16.000 GB 63 CPUs x 8 jobs 128.000 GB 504 CPUs (read-to-haplotype assignment)
-- Local: cormhap 64.000 GB 14 CPUs x 36 jobs 2304.000 GB 504 CPUs (overlap detection with mhap)
-- Local: obtovl 24.000 GB 14 CPUs x 36 jobs 864.000 GB 504 CPUs (overlap detection)
-- Local: utgovl 24.000 GB 14 CPUs x 36 jobs 864.000 GB 504 CPUs (overlap detection)
-- Local: cor -.--- GB 4 CPUs x - jobs -.--- GB - CPUs (read correction)
-- Local: ovb 4.000 GB 1 CPU x 504 jobs 2016.000 GB 504 CPUs (overlap store bucketizer)
-- Local: ovs 32.000 GB 1 CPU x 312 jobs 9984.000 GB 312 CPUs (overlap store sorting)
-- Local: red 64.000 GB 9 CPUs x 56 jobs 3584.000 GB 504 CPUs (read error detection)
-- Local: oea 8.000 GB 1 CPU x 504 jobs 4032.000 GB 504 CPUs (overlap error adjustment)
-- Local: bat 1024.000 GB 64 CPUs x 1 job 1024.000 GB 64 CPUs (contig construction with bogart)
-- Local: cns -.--- GB 8 CPUs x - jobs -.--- GB - CPUs (consensus)
canu useGrid=false minThreads=64 maxThreads=504
-- Detected 504 CPUs and 10074 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
-- 504 CPUs (maxThreads option).
--
-- (tag)Concurrency
-- (tag)Threads |
-- (tag)Memory | |
-- (tag) | | | total usage algorithm
-- ------- ---------- -------- -------- -------------------- -----------------------------
-- Local: meryl 64.000 GB 64 CPUs x 7 jobs 448.000 GB 448 CPUs (k-mer counting)
-- Local: hap 16.000 GB 64 CPUs x 7 jobs 112.000 GB 448 CPUs (read-to-haplotype assignment)
-- Local: cormhap 64.000 GB 64 CPUs x 7 jobs 448.000 GB 448 CPUs (overlap detection with mhap)
-- Local: obtovl 24.000 GB 64 CPUs x 7 jobs 168.000 GB 448 CPUs (overlap detection)
-- Local: utgovl 24.000 GB 64 CPUs x 7 jobs 168.000 GB 448 CPUs (overlap detection)
-- Local: cor -.--- GB 64 CPUs x - jobs -.--- GB - CPUs (read correction)
-- Local: ovb 4.000 GB 64 CPUs x 7 jobs 28.000 GB 448 CPUs (overlap store bucketizer)
-- Local: ovs 32.000 GB 64 CPUs x 7 jobs 224.000 GB 448 CPUs (overlap store sorting)
-- Local: red 64.000 GB 64 CPUs x 7 jobs 448.000 GB 448 CPUs (read error detection)
-- Local: oea 8.000 GB 64 CPUs x 7 jobs 56.000 GB 448 CPUs (overlap error adjustment)
-- Local: bat 1024.000 GB 64 CPUs x 1 job 1024.000 GB 64 CPUs (contig construction with bogart)
-- Local: cns -.--- GB 64 CPUs x - jobs -.--- GB - CPUs (consensus)
canu useGrid=false minThreads=64 maxThreads=512
-- Detected 504 CPUs and 10074 gigabytes of memory on the local machine.
--
-- WARNING: maxThreads=512 has no effect when only 504 CPUs present.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
-- 512 CPUs (maxThreads option).
--
-- (tag)Concurrency
-- (tag)Threads |
-- (tag)Memory | |
-- (tag) | | | total usage algorithm
-- ------- ---------- -------- -------- -------------------- -----------------------------
-- Local: meryl 64.000 GB 64 CPUs x 7 jobs 448.000 GB 448 CPUs (k-mer counting)
-- Local: hap 16.000 GB 64 CPUs x 7 jobs 112.000 GB 448 CPUs (read-to-haplotype assignment)
-- Local: cormhap 64.000 GB 64 CPUs x 7 jobs 448.000 GB 448 CPUs (overlap detection with mhap)
-- Local: obtovl 24.000 GB 64 CPUs x 7 jobs 168.000 GB 448 CPUs (overlap detection)
-- Local: utgovl 24.000 GB 64 CPUs x 7 jobs 168.000 GB 448 CPUs (overlap detection)
-- Local: cor -.--- GB 64 CPUs x - jobs -.--- GB - CPUs (read correction)
-- Local: ovb 4.000 GB 64 CPUs x 7 jobs 28.000 GB 448 CPUs (overlap store bucketizer)
-- Local: ovs 32.000 GB 64 CPUs x 7 jobs 224.000 GB 448 CPUs (overlap store sorting)
-- Local: red 64.000 GB 64 CPUs x 7 jobs 448.000 GB 448 CPUs (read error detection)
-- Local: oea 8.000 GB 64 CPUs x 7 jobs 56.000 GB 448 CPUs (overlap error adjustment)
-- Local: bat 1024.000 GB 64 CPUs x 1 job 1024.000 GB 64 CPUs (contig construction with bogart)
-- Local: cns -.--- GB 64 CPUs x - jobs -.--- GB - CPUs (consensus)
canu useGrid=false minThreads=63 maxThreads=504
-- Detected 504 CPUs and 10074 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
-- 504 CPUs (maxThreads option).
--
-- (tag)Concurrency
-- (tag)Threads |
-- (tag)Memory | |
-- (tag) | | | total usage algorithm
-- ------- ---------- -------- -------- -------------------- -----------------------------
-- Local: meryl 64.000 GB 63 CPUs x 8 jobs 512.000 GB 504 CPUs (k-mer counting)
-- Local: hap 16.000 GB 63 CPUs x 8 jobs 128.000 GB 504 CPUs (read-to-haplotype assignment)
-- Local: cormhap 64.000 GB 63 CPUs x 8 jobs 512.000 GB 504 CPUs (overlap detection with mhap)
-- Local: obtovl 24.000 GB 63 CPUs x 8 jobs 192.000 GB 504 CPUs (overlap detection)
-- Local: utgovl 24.000 GB 63 CPUs x 8 jobs 192.000 GB 504 CPUs (overlap detection)
-- Local: cor -.--- GB 63 CPUs x - jobs -.--- GB - CPUs (read correction)
-- Local: ovb 4.000 GB 63 CPUs x 8 jobs 32.000 GB 504 CPUs (overlap store bucketizer)
-- Local: ovs 32.000 GB 63 CPUs x 8 jobs 256.000 GB 504 CPUs (overlap store sorting)
-- Local: red 64.000 GB 63 CPUs x 8 jobs 512.000 GB 504 CPUs (read error detection)
-- Local: oea 8.000 GB 63 CPUs x 8 jobs 64.000 GB 504 CPUs (overlap error adjustment)
-- Local: bat 1024.000 GB 64 CPUs x 1 job 1024.000 GB 64 CPUs (contig construction with bogart)
-- Local: cns -.--- GB 63 CPUs x - jobs -.--- GB - CPUs (consensus)
...
-- Running jobs. First attempt out of 2.
----------------------------------------
-- Starting 'oea' concurrent execution on Tue Dec 7 08:09:45 2021 with 1149320.296 GB free disk space (352 processes; 8 concurrently)
canu useGrid=false maxThreads=504 ...
-- Detected 504 CPUs and 10074 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
-- 504 CPUs (maxThreads option).
--
-- (tag)Concurrency
-- (tag)Threads |
-- (tag)Memory | |
-- (tag) | | | total usage algorithm
-- ------- ---------- -------- -------- -------------------- -----------------------------
-- Local: meryl 64.000 GB 8 CPUs x 63 jobs 4032.000 GB 504 CPUs (k-mer counting)
-- Local: hap 16.000 GB 63 CPUs x 8 jobs 128.000 GB 504 CPUs (read-to-haplotype assignment)
-- Local: cormhap 64.000 GB 14 CPUs x 36 jobs 2304.000 GB 504 CPUs (overlap detection with mhap)
-- Local: obtovl 24.000 GB 14 CPUs x 36 jobs 864.000 GB 504 CPUs (overlap detection)
-- Local: utgovl 24.000 GB 14 CPUs x 36 jobs 864.000 GB 504 CPUs (overlap detection)
-- Local: cor -.--- GB 4 CPUs x - jobs -.--- GB - CPUs (read correction)
-- Local: ovb 4.000 GB 1 CPU x 504 jobs 2016.000 GB 504 CPUs (overlap store bucketizer)
-- Local: ovs 32.000 GB 1 CPU x 314 jobs 10048.000 GB 314 CPUs (overlap store sorting)
-- Local: red 64.000 GB 9 CPUs x 56 jobs 3584.000 GB 504 CPUs (read error detection)
-- Local: oea 8.000 GB 1 CPU x 504 jobs 4032.000 GB 504 CPUs (overlap error adjustment)
-- Local: bat 1024.000 GB 64 CPUs x 1 job 1024.000 GB 64 CPUs (contig construction with bogart)
-- Local: cns -.--- GB 8 CPUs x - jobs -.--- GB - CPUs (consensus)
...
-- Running jobs. First attempt out of 2.
----------------------------------------
-- Starting 'oea' concurrent execution on Wed Dec 8 11:44:06 2021 with 1142226.585 GB free disk space (352 processes; 504 concurrently)
The text was updated successfully, but these errors were encountered:
The issue is options 2/3 wouldn't work on some cluster settings. Often, we've seen configs which allocate a fixed GB/core so jobs like ovs or oea in Canu which use a good bit of memory but are single-threaded would overload the memory on a machine. This is why we allow the user to set more than 1 thread for a single-core job, the minThreads is a convenient override to ensure all jobs request at least that many cpus/memory. On local nodes 2/3 would be OK but for the most part, Canu is happy to let you shoot yourself in the foot like this.
Hi Serge, OK, thank you for the explanation why is that. Then, I assume at leats printing the warning and pointing to this thread could help me or somebody else next time from wasting CPU cores. Although I would still prefer 2/3 and an option to enforce the workaround for broken setups.
Hi,
I had to restart
canu
jobs a few times. Provided canu remembers the initial configuration at each step because the partitioning happens only once in each pipeline step and creates on disk the shell scripts to be run as long until all partitioned taks complete. Each such task is bound to some amount of CPU cores. Re-running a failed job on a different hardware results in same jobs being re-run whilecanu
sometimes comes up with the idea to re-run the taks with less CPU cores. But, provided the shell-scripts were not re-created on disk, or at least,sed
was not used to adjust the number of CPU cores, it happens it overloads a machine. So,canu
inadverently ran tasks hoping now 14 cores will be used but they kept using 64 as that was the config it used for the first time and was hardcoded in the pre-existing shell scripts. In brief, I had to useminThreads
andmaxThreads
parameters to get around while respecting the old number and size of partitions, just forcing them to use less CPUs. That was along intro. If canu usedsed
to edit the number of threads in the pre-existing shell scripts it would have been enough for my situation. I have enough of memory.Now, I finally to got to the
oea
step. It turns out the tasks are only single-threaded and that myminThreads=63
prevents to get more tasks executed in parallel. I realized that after a while, killedcanu
and ran it withminThreads
. Now 382 jobs are executed, each taking a single core (of 504 available) and I am happy.I foresee several ways how to cope with that:
oea
starts thatminThreads is > 1
and that this step can be sped up by having more partitions but each can take just a single core. ForcingminThreads is > 1
does not help here. Provided it is very inconvenient to stop jobs and wait in the queue, etc., I propose solution 2 below, if not solution 3 below.minThreads
in this step, it makes no sense.minThreads
could be probably re-evaluated at each step and if it is larger than canu would pick on its own, it should be just ignored.How re-executions evolved in my case:
The text was updated successfully, but these errors were encountered: