Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oea should warn if minThreads is set > 1 #2062

Open
mmokrejs opened this issue Dec 8, 2021 · 2 comments
Open

oea should warn if minThreads is set > 1 #2062

mmokrejs opened this issue Dec 8, 2021 · 2 comments

Comments

@mmokrejs
Copy link

mmokrejs commented Dec 8, 2021

Hi,
I had to restart canu jobs a few times. Provided canu remembers the initial configuration at each step because the partitioning happens only once in each pipeline step and creates on disk the shell scripts to be run as long until all partitioned taks complete. Each such task is bound to some amount of CPU cores. Re-running a failed job on a different hardware results in same jobs being re-run while canu sometimes comes up with the idea to re-run the taks with less CPU cores. But, provided the shell-scripts were not re-created on disk, or at least, sed was not used to adjust the number of CPU cores, it happens it overloads a machine. So, canu inadverently ran tasks hoping now 14 cores will be used but they kept using 64 as that was the config it used for the first time and was hardcoded in the pre-existing shell scripts. In brief, I had to use minThreads and maxThreads parameters to get around while respecting the old number and size of partitions, just forcing them to use less CPUs. That was along intro. If canu used sed to edit the number of threads in the pre-existing shell scripts it would have been enough for my situation. I have enough of memory.

Now, I finally to got to the oea step. It turns out the tasks are only single-threaded and that my minThreads=63 prevents to get more tasks executed in parallel. I realized that after a while, killed canu and ran it with minThreads. Now 382 jobs are executed, each taking a single core (of 504 available) and I am happy.

I foresee several ways how to cope with that:

  1. Print a warning message once oea starts that minThreads is > 1 and that this step can be sped up by having more partitions but each can take just a single core. Forcing minThreads is > 1 does not help here. Provided it is very inconvenient to stop jobs and wait in the queue, etc., I propose solution 2 below, if not solution 3 below.
  2. Alternatively, ignore minThreads in this step, it makes no sense.
  3. Actually, I think the minThreads could be probably re-evaluated at each step and if it is larger than canu would pick on its own, it should be just ignored.

How re-executions evolved in my case:

canu useGrid=false minThreads=256 maxThreads=504 executiveMemory=32 executiveThreads=16 

-- Detected 504 CPUs and 8000 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
--    504 CPUs              (maxThreads option).
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB  256 CPUs x   1 job     64.000 GB 256 CPUs  (k-mer counting)
-- Local: hap       16.000 GB  256 CPUs x   1 job     16.000 GB 256 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   64.000 GB  256 CPUs x   1 job     64.000 GB 256 CPUs  (overlap detection with mhap)
-- Local: obtovl    24.000 GB  256 CPUs x   1 job     24.000 GB 256 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB  256 CPUs x   1 job     24.000 GB 256 CPUs  (overlap detection)
-- Local: cor        -.--- GB  256 CPUs x   - jobs     -.--- GB   - CPUs  (read correction)
-- Local: ovb        4.000 GB  256 CPUs x   1 job      4.000 GB 256 CPUs  (overlap store bucketizer)
-- Local: ovs       32.000 GB  256 CPUs x   1 job     32.000 GB 256 CPUs  (overlap store sorting)
-- Local: red       64.000 GB  256 CPUs x   1 job     64.000 GB 256 CPUs  (read error detection)
-- Local: oea        8.000 GB  256 CPUs x   1 job      8.000 GB 256 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB  256 CPUs x   1 job   1024.000 GB 256 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB  256 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)
canu useGrid=false maxThreads=504 

-- Detected 504 CPUs and 10000 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
--    504 CPUs              (maxThreads option).
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB    8 CPUs x  63 jobs  4032.000 GB 504 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   63 CPUs x   8 jobs   128.000 GB 504 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   64.000 GB   14 CPUs x  36 jobs  2304.000 GB 504 CPUs  (overlap detection with mhap)
-- Local: obtovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: cor        -.--- GB    4 CPUs x   - jobs     -.--- GB   - CPUs  (read correction)
-- Local: ovb        4.000 GB    1 CPU  x 504 jobs  2016.000 GB 504 CPUs  (overlap store bucketizer)
-- Local: ovs       32.000 GB    1 CPU  x 312 jobs  9984.000 GB 312 CPUs  (overlap store sorting)
-- Local: red       64.000 GB    9 CPUs x  56 jobs  3584.000 GB 504 CPUs  (read error detection)
-- Local: oea        8.000 GB    1 CPU  x 504 jobs  4032.000 GB 504 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB   64 CPUs x   1 job   1024.000 GB  64 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB    8 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)
canu useGrid=false minThreads=64 maxThreads=504

-- Detected 504 CPUs and 10074 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
--    504 CPUs              (maxThreads option).
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB   64 CPUs x   7 jobs   448.000 GB 448 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   64 CPUs x   7 jobs   112.000 GB 448 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   64.000 GB   64 CPUs x   7 jobs   448.000 GB 448 CPUs  (overlap detection with mhap)
-- Local: obtovl    24.000 GB   64 CPUs x   7 jobs   168.000 GB 448 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB   64 CPUs x   7 jobs   168.000 GB 448 CPUs  (overlap detection)
-- Local: cor        -.--- GB   64 CPUs x   - jobs     -.--- GB   - CPUs  (read correction)
-- Local: ovb        4.000 GB   64 CPUs x   7 jobs    28.000 GB 448 CPUs  (overlap store bucketizer)
-- Local: ovs       32.000 GB   64 CPUs x   7 jobs   224.000 GB 448 CPUs  (overlap store sorting)
-- Local: red       64.000 GB   64 CPUs x   7 jobs   448.000 GB 448 CPUs  (read error detection)
-- Local: oea        8.000 GB   64 CPUs x   7 jobs    56.000 GB 448 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB   64 CPUs x   1 job   1024.000 GB  64 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB   64 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)
canu useGrid=false minThreads=64 maxThreads=512

-- Detected 504 CPUs and 10074 gigabytes of memory on the local machine.
--
-- WARNING: maxThreads=512 has no effect when only 504 CPUs present.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
--    512 CPUs              (maxThreads option).
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB   64 CPUs x   7 jobs   448.000 GB 448 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   64 CPUs x   7 jobs   112.000 GB 448 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   64.000 GB   64 CPUs x   7 jobs   448.000 GB 448 CPUs  (overlap detection with mhap)
-- Local: obtovl    24.000 GB   64 CPUs x   7 jobs   168.000 GB 448 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB   64 CPUs x   7 jobs   168.000 GB 448 CPUs  (overlap detection)
-- Local: cor        -.--- GB   64 CPUs x   - jobs     -.--- GB   - CPUs  (read correction)
-- Local: ovb        4.000 GB   64 CPUs x   7 jobs    28.000 GB 448 CPUs  (overlap store bucketizer)
-- Local: ovs       32.000 GB   64 CPUs x   7 jobs   224.000 GB 448 CPUs  (overlap store sorting)
-- Local: red       64.000 GB   64 CPUs x   7 jobs   448.000 GB 448 CPUs  (read error detection)
-- Local: oea        8.000 GB   64 CPUs x   7 jobs    56.000 GB 448 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB   64 CPUs x   1 job   1024.000 GB  64 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB   64 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)
canu useGrid=false minThreads=63 maxThreads=504

-- Detected 504 CPUs and 10074 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
--    504 CPUs              (maxThreads option).
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB   63 CPUs x   8 jobs   512.000 GB 504 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   63 CPUs x   8 jobs   128.000 GB 504 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   64.000 GB   63 CPUs x   8 jobs   512.000 GB 504 CPUs  (overlap detection with mhap)
-- Local: obtovl    24.000 GB   63 CPUs x   8 jobs   192.000 GB 504 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB   63 CPUs x   8 jobs   192.000 GB 504 CPUs  (overlap detection)
-- Local: cor        -.--- GB   63 CPUs x   - jobs     -.--- GB   - CPUs  (read correction)
-- Local: ovb        4.000 GB   63 CPUs x   8 jobs    32.000 GB 504 CPUs  (overlap store bucketizer)
-- Local: ovs       32.000 GB   63 CPUs x   8 jobs   256.000 GB 504 CPUs  (overlap store sorting)
-- Local: red       64.000 GB   63 CPUs x   8 jobs   512.000 GB 504 CPUs  (read error detection)
-- Local: oea        8.000 GB   63 CPUs x   8 jobs    64.000 GB 504 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB   64 CPUs x   1 job   1024.000 GB  64 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB   63 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)

...

-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'oea' concurrent execution on Tue Dec  7 08:09:45 2021 with 1149320.296 GB free disk space (352 processes; 8 concurrently)
canu useGrid=false maxThreads=504 ...

-- Detected 504 CPUs and 10074 gigabytes of memory on the local machine.
--
-- Local machine mode enabled; grid support not detected or not allowed.
--
-- Job limits:
--    504 CPUs              (maxThreads option).
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB    8 CPUs x  63 jobs  4032.000 GB 504 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   63 CPUs x   8 jobs   128.000 GB 504 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   64.000 GB   14 CPUs x  36 jobs  2304.000 GB 504 CPUs  (overlap detection with mhap)
-- Local: obtovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB   14 CPUs x  36 jobs   864.000 GB 504 CPUs  (overlap detection)
-- Local: cor        -.--- GB    4 CPUs x   - jobs     -.--- GB   - CPUs  (read correction)
-- Local: ovb        4.000 GB    1 CPU  x 504 jobs  2016.000 GB 504 CPUs  (overlap store bucketizer)
-- Local: ovs       32.000 GB    1 CPU  x 314 jobs  10048.000 GB 314 CPUs  (overlap store sorting)
-- Local: red       64.000 GB    9 CPUs x  56 jobs  3584.000 GB 504 CPUs  (read error detection)
-- Local: oea        8.000 GB    1 CPU  x 504 jobs  4032.000 GB 504 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB   64 CPUs x   1 job   1024.000 GB  64 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB    8 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)

...

-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'oea' concurrent execution on Wed Dec  8 11:44:06 2021 with 1142226.585 GB free disk space (352 processes; 504 concurrently)

@skoren
Copy link
Member

skoren commented Dec 19, 2021

The issue is options 2/3 wouldn't work on some cluster settings. Often, we've seen configs which allocate a fixed GB/core so jobs like ovs or oea in Canu which use a good bit of memory but are single-threaded would overload the memory on a machine. This is why we allow the user to set more than 1 thread for a single-core job, the minThreads is a convenient override to ensure all jobs request at least that many cpus/memory. On local nodes 2/3 would be OK but for the most part, Canu is happy to let you shoot yourself in the foot like this.

@mmokrejs
Copy link
Author

Hi Serge, OK, thank you for the explanation why is that. Then, I assume at leats printing the warning and pointing to this thread could help me or somebody else next time from wasting CPU cores. Although I would still prefer 2/3 and an option to enforce the workaround for broken setups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants