[utils] Rewrite slurm.pl from scratch #4314

kkm000 · 2020-10-30T06:11:49Z

The new version invokes sbatch passing the batch file on stdin, and waits for completion of the script without polling of flag files, rather by passing the --wait switch to sbatch (which is still doing the polling, but more efficiently, through an already established RPC channel with the active controller). In this mode, sbatch prints the new job ID immediately upon submission, and then later exists with return code 0 on success, or prints an error message and exits with a non-zero code.

Slurm's sbatch command has a hardcoded polling schedule for waiting. Specifically, it checks if the job has completed in 2s, then increases the wait time by 2s every poll cycle until it maxes out at 10s (i.e, the wait time is 2, 4, 6, 8, 10, 10.... seconds). Because of this, even very short Kaldi batches often incur extra 5s wait delays on average. The following patch reduces the poll time to 1s without growth:

https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch

It has been tested to apply cleanly up until Slurm v20.1, and is unlikely to break in future, given it's a 2-line change. Anyway, please open an issue in the https://github.com/burrmill/burrmill repository if your Slurm source does not patch.

You do not need administrative access to the cluster; you can just build your own version of sbatch and place on the PATH. Internally it uses only Slurm RPC calls, and does not require many dependencies. The Slurm RPC library .so file must be available already (use ldd $(which sbatch) to locate); build against it with headers from the Slurm repository.

Whether to discuss this with the cluster IT admin is, of course, your decision. Your arguments are that (a) Kaldi jobs are sometimes extremely short, sometimes 10 seconds, while it's not uncommon in the HPC world to submit multinode MPI jobs running for days and (b) the old flag-file polling was putting a heavier load on the system at any poll rate: Slurm RPCs are incomparably more efficient compared to placing flag files on an NFS or other shared FS.

Configuration I used:

command sbatch --no-kill --hint=compute_bound --export=PATH,TIME_STYLE

option debug=0
option debug=1
option debug=2 -v

# This is not very helpful as Burmill setup does not schedule memory.
option mem=0
option mem=* --mem-per-cpu=$0

option num_threads=1
option num_threads=* --cpus-per-task=$0

# For memory-gobbling tasks, like large ARPA models or HCLG composition.
option whole_nodes=* --exclusive --nodes=$0

option gpu=0
# Hack, --c-p-t should be 2*$0, but slurm.pl cannot do arithmetics.
# But all our nodes are 1 GPU each anyway.
option gpu=1 --partition=gpu --cpus-per-task=2 --gres=cuda:p100:1

I defined a Gres resource named 'cuda' with subtypes 'p100', 'v100', 'T4' etc in a single partition named 'gpu'. The CPU nodes were shaped with 10 vCPU (i.e., 5 jobs at a time) and 12.5 GB of RAM -- pretty small, but enough unless used for composing a large HCLG. This is what --whole-nodes is for: give a whole node to such tasks. Now they added C2 machines, but these could not be custom configured last time I checked. Take the smallest that could handle the HCLG-style stuff, either 4 or 8 vCPU. These do not offer a lot of benefit on FST tasks, but are real monsters on matrix ops.

The new file calls sbatch passing the batch file on stdin, and waits for completion of the script without polling, rather by passing the --wait switch to sbatch. Slurm has a hardcoded polling schedule for waiting. Specifically, it checks if the job has completed for 2s, then increases the wait time by 2s until it maxes out at 10s. Because of this, even very short Kaldi batches often incur extra 5s wait delays on average. The following patch reduces the poll time to 1s without growth: https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch It has been tested to apply cleanly up until Slurm v20.1, and is unlikely to break in future. Please open a ticket in the https://github.com/burrmill/burrmill repository if your Slurm source does not patch. You do not need administrative access to the cluster; you can just build your own version of sbatch and place on the PATH. Internally it uses only Slurm RPC calls, and does not require many dependencies.

jtrmal · 2020-11-01T16:58:24Z

egs/wsj/s5/utils/parallel/slurm.pl

 }

-if (exists $cli_options{"config"}) {
-  $config = $cli_options{"config"};


in the original slurm, the $config is pre-set to "conf/slurm.conf", which results in automatically reading the config if it exists. It is also the default behavior of queue.pl, so it might be preferred to stick with that behavior also...

jtrmal · 2020-11-01T17:07:33Z

egs/wsj/s5/utils/parallel/slurm.pl

-my $max_jobs_run;
-my $gpu = 0;
-
-my $config = "conf/slurm.conf";


see this....

Will fix, thanks for spotting! What's your opinion on the default hardcoded configuration in the file: should we keep it, or drop and always require a config file?

My take on this is that no hardcoded defaults could be in any way sensible. Tuning a distributed configuration is always required, in my experience. I'd rather ship a default config file. IMO, the mere act of copying the file into conf/ will draw attention to it. I'll also comment it dearly.

Also, I want to make a pinned post to kaldi-help a few days before merging. Like, nearly right now. Stuff can break, and we don't know how many Slurm users are out there.

jtrmal · 2020-11-02T15:06:52Z

well, we usually make the hardcoded config to work on clsp cluster... YMMV :) BTW, if you have time, there is another issue with it, but I didn't find time to trace it... seems like sometimes the return code of the tasks is either ignored or not propagated outside the slurm y.

…

On Mon, Nov 2, 2020 at 10:03 AM kkm000 ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/wsj/s5/utils/parallel/slurm.pl <#4314 (comment)>: > -# option num_threads=* --cpus-per-task $0 -# option num_threads=1 # Do not add anything to qsub_opts -# option max_jobs_run=* -tc $0 -# default gpu=0 -# option gpu=0 -p shared -# option gpu=* -p gpu #this has to be figured out - -#print STDERR "$0 " . join(" ", @argv) . "\n"; - -my $qsub_opts = ""; -my $sync = 0; -my $num_threads = 1; -my $max_jobs_run; -my $gpu = 0; - -my $config = "conf/slurm.conf"; Will fix, thanks for spotting! What's your opinion on the default hardcoded configuration in the file: should we keep it, or drop and always require a config file? My take on this is that no hardcoded defaults could be in any way sensible. Tuning a distributed configuration is always required, in my experience. I'd rather ship a default config file. IMO, the mere act of copying the file into conf/ will draw attention to it. I'll also comment it dearly. Also, I want to make a pinned post to kaldi-help a few days before merging. Like, nearly right now. Stuff can break, and we don't know how many Slurm users are out there. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#4314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX3B42HRRDDZGWFTNFDSN3C57ANCNFSM4TES74LA> .

kkm000 · 2020-11-02T15:12:44Z

I do not know what is clsp.

Yes, I want to release the thing ASAP, please try to track it down. How does not it propagate? What do you see? I need to know symptoms of the problem too, not only your hypothesis. You kinda started in the middle :)

jtrmal · 2020-11-02T15:18:58Z

I meant the clsp cluster... :) Essentially, I have noticed if the task fails and returns non-zero exit code, the slurm "ignores" it and it still seems it returns zero. At least I had a couple of tasks failing but the scripts were trying to continue... And IIRC those were scripts that are tested, so it shouldn't happen because of the scripts. that's all I know for now... I won't be able to track it down today -- sorry -- I'm trying to catch up with some other stuff I was ignoring last week y.

…

On Mon, Nov 2, 2020 at 10:13 AM kkm000 ***@***.***> wrote: I do not know what is clsp. Yes, I want to release the thing ASAP, please try to track it down. How does not it propagate? What do you see? I need to know symptoms of the problem too, not only your hypothesis. You kinda started in the middle :) — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#4314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX7GMI2PXXG5AMP4PH3SN3D77ANCNFSM4TES74LA> .

kkm000 · 2020-11-02T15:27:01Z

I'll enable more logging. Currently, I can see details only for jobs completed within 30 minutes, I do not even flush logs to disk.

kkm000 · 2020-11-02T15:28:50Z

And since I do not know what is cslp, I obviously do not know what is cslp cluster. Is this cluster used by many Kaldi users? In a sense, enough to hardcode config?

jtrmal · 2020-11-02T15:33:35Z

I guess I meant that any configuration that is functional at least at one place is fine... people can copy-out the config and modify it as conf/slurm.conf... y.

…

On Mon, Nov 2, 2020 at 10:29 AM kkm000 ***@***.***> wrote: And since I do not know what is cslp, I obviously do not know what is cslp cluster. Is this cluster used by many Kaldi users? In a sense, enough to hardcode config? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#4314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX35BVWDIG6VKBII73TSN3F4HANCNFSM4TES74LA> .

stale · 2021-01-01T16:12:26Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

danpovey · 2022-01-30T02:50:32Z

Remember that we recently merged #4675, you might want to incorporate any needed changes from there.

kkm000 · 2022-01-30T03:11:29Z

I'm sitting on it because of internal Slurm changes. The script submits a job with the sbatch ... -wait command, which prints the job ID to stdout and waits till the job is completed. The old behavior in Slurm 19.x was to wait 2s, 4s, 6s, ... up to the maximum of 20s, then check for completion every 20s. Our jobs are very short compared to general HPC use, so I have patched sbatch to poll for completion every 1s, sensible for a cluster with essentially 1 user. But Slurm suffers from internal locking problems—the lock scopes are too extensive. They are improving that over time, but in v20.x and on, the wait strategy changed: the wait interval is 30s. That's too much. I would not care as I can always patch my sbatch, but I am not certain if that's the best for general release. There are alternatives. First, there is an alternative approach to polling for completion using scontrol. Unless Slurm is set up to immediately purge completed jobs info from memory (which is hardly a case in any setup), this will return job exit code. Another, good for v20+, is to use their new REST API instead for both batch submission and polling for completion. But this is good only for v20 and on. Looks like we'll end up with two slurm.pl's, one for v19 and less, which may limp through the default growing waiting delay fast enough for very short jobs (this is what we have), and another for the newer Slurms with long wait from the start but with an API. I dunno. Maybe polling with scontrol is the way to do it in both cases?

Another thing that polling is problematic, the more so the older is the controller version. They seriously took it on the lock contentions only from v20.

@danpovey, @jtrmal, what do you think?

danpovey · 2022-01-31T05:11:52Z

I suppose this isn't critical if the existing script works OK.
Personally I think the main thing is just having something that runs, easy-to-install may be better than perfect
for most users.

kkm000 · 2022-01-31T05:15:57Z

@dpovey, you're probably right. I'm trying to overengineer it so that it Slurms better than Slurm itself does. Probably no point. If it's 30s minimum, so be it. This version is much cleaner than the currently on master, good enough. Let me get back to it next week and release.

kkm000 requested a review from jtrmal October 30, 2020 06:11

kkm000 mentioned this pull request Oct 30, 2020

[utils] Rewrite slurm.pl from scratch #4300

Closed

jtrmal reviewed Nov 1, 2020

View reviewed changes

stale bot added the stale Stale bot on the loose label Jan 1, 2021

kkm000 added stale-exclude Stale bot ignore this issue and removed stale Stale bot on the loose labels Mar 14, 2021

kkm000 self-assigned this Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[utils] Rewrite slurm.pl from scratch #4314

[utils] Rewrite slurm.pl from scratch #4314

kkm000 commented Oct 30, 2020

jtrmal Nov 1, 2020

jtrmal Nov 1, 2020

kkm000 Nov 2, 2020

jtrmal commented Nov 2, 2020 via email

kkm000 commented Nov 2, 2020

jtrmal commented Nov 2, 2020 via email

kkm000 commented Nov 2, 2020 •

edited

kkm000 commented Nov 2, 2020

jtrmal commented Nov 2, 2020 via email

stale bot commented Jan 1, 2021

danpovey commented Jan 30, 2022

kkm000 commented Jan 30, 2022

danpovey commented Jan 31, 2022

kkm000 commented Jan 31, 2022

[utils] Rewrite slurm.pl from scratch #4314

Are you sure you want to change the base?

[utils] Rewrite slurm.pl from scratch #4314

Conversation

kkm000 commented Oct 30, 2020

jtrmal Nov 1, 2020

Choose a reason for hiding this comment

jtrmal Nov 1, 2020

Choose a reason for hiding this comment

kkm000 Nov 2, 2020

Choose a reason for hiding this comment

jtrmal commented Nov 2, 2020 via email

kkm000 commented Nov 2, 2020

jtrmal commented Nov 2, 2020 via email

kkm000 commented Nov 2, 2020 • edited

kkm000 commented Nov 2, 2020

jtrmal commented Nov 2, 2020 via email

stale bot commented Jan 1, 2021

danpovey commented Jan 30, 2022

kkm000 commented Jan 30, 2022

danpovey commented Jan 31, 2022

kkm000 commented Jan 31, 2022

kkm000 commented Nov 2, 2020 •

edited