Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[utils] Rewrite slurm.pl from scratch #4314

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

kkm000
Copy link
Contributor

@kkm000 kkm000 commented Oct 30, 2020

The new version invokes sbatch passing the batch file on stdin, and waits for completion of the script without polling of flag files, rather by passing the --wait switch to sbatch (which is still doing the polling, but more efficiently, through an already established RPC channel with the active controller). In this mode, sbatch prints the new job ID immediately upon submission, and then later exists with return code 0 on success, or prints an error message and exits with a non-zero code.

Slurm's sbatch command has a hardcoded polling schedule for waiting. Specifically, it checks if the job has completed in 2s, then increases the wait time by 2s every poll cycle until it maxes out at 10s (i.e, the wait time is 2, 4, 6, 8, 10, 10.... seconds). Because of this, even very short Kaldi batches often incur extra 5s wait delays on average. The following patch reduces the poll time to 1s without growth:

https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch

It has been tested to apply cleanly up until Slurm v20.1, and is unlikely to break in future, given it's a 2-line change. Anyway, please open an issue in the https://github.com/burrmill/burrmill repository if your Slurm source does not patch.

You do not need administrative access to the cluster; you can just build your own version of sbatch and place on the PATH. Internally it uses only Slurm RPC calls, and does not require many dependencies. The Slurm RPC library .so file must be available already (use ldd $(which sbatch) to locate); build against it with headers from the Slurm repository.

Whether to discuss this with the cluster IT admin is, of course, your decision. Your arguments are that (a) Kaldi jobs are sometimes extremely short, sometimes 10 seconds, while it's not uncommon in the HPC world to submit multinode MPI jobs running for days and (b) the old flag-file polling was putting a heavier load on the system at any poll rate: Slurm RPCs are incomparably more efficient compared to placing flag files on an NFS or other shared FS.


Configuration I used:

command sbatch --no-kill --hint=compute_bound --export=PATH,TIME_STYLE

option debug=0
option debug=1
option debug=2 -v

# This is not very helpful as Burmill setup does not schedule memory.
option mem=0
option mem=* --mem-per-cpu=$0

option num_threads=1
option num_threads=* --cpus-per-task=$0

# For memory-gobbling tasks, like large ARPA models or HCLG composition.
option whole_nodes=* --exclusive --nodes=$0

option gpu=0
# Hack, --c-p-t should be 2*$0, but slurm.pl cannot do arithmetics.
# But all our nodes are 1 GPU each anyway.
option gpu=1 --partition=gpu --cpus-per-task=2 --gres=cuda:p100:1

I defined a Gres resource named 'cuda' with subtypes 'p100', 'v100', 'T4' etc in a single partition named 'gpu'. The CPU nodes were shaped with 10 vCPU (i.e., 5 jobs at a time) and 12.5 GB of RAM -- pretty small, but enough unless used for composing a large HCLG. This is what --whole-nodes is for: give a whole node to such tasks. Now they added C2 machines, but these could not be custom configured last time I checked. Take the smallest that could handle the HCLG-style stuff, either 4 or 8 vCPU. These do not offer a lot of benefit on FST tasks, but are real monsters on matrix ops.

The new file calls sbatch passing the batch file on stdin, and
waits for completion of the script without polling, rather by
passing the --wait switch to sbatch.

Slurm has a hardcoded polling schedule for waiting. Specifically,
it checks if the job has completed for 2s, then increases the
wait time by 2s until it maxes out at 10s. Because of this, even
very short Kaldi batches often incur extra 5s wait delays on
average. The following patch reduces the poll time to 1s without
growth:

https://github.com/burrmill/burrmill/blob/v0.5-beta.2/lib/build/slurm/sbatch.19.patch

It has been tested to apply cleanly up until Slurm v20.1, and is
unlikely to break in future. Please open a ticket in the
https://github.com/burrmill/burrmill repository if your Slurm
source does not patch.

You do not need administrative access to the cluster; you can just
build your own version of sbatch and place on the PATH. Internally
it uses only Slurm RPC calls, and does not require many dependencies.
}

if (exists $cli_options{"config"}) {
$config = $cli_options{"config"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the original slurm, the $config is pre-set to "conf/slurm.conf", which results in automatically reading the config if it exists. It is also the default behavior of queue.pl, so it might be preferred to stick with that behavior also...

my $max_jobs_run;
my $gpu = 0;

my $config = "conf/slurm.conf";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see this....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix, thanks for spotting! What's your opinion on the default hardcoded configuration in the file: should we keep it, or drop and always require a config file?

My take on this is that no hardcoded defaults could be in any way sensible. Tuning a distributed configuration is always required, in my experience. I'd rather ship a default config file. IMO, the mere act of copying the file into conf/ will draw attention to it. I'll also comment it dearly.

Also, I want to make a pinned post to kaldi-help a few days before merging. Like, nearly right now. Stuff can break, and we don't know how many Slurm users are out there.

@jtrmal
Copy link
Contributor

jtrmal commented Nov 2, 2020 via email

@kkm000
Copy link
Contributor Author

kkm000 commented Nov 2, 2020

I do not know what is clsp.

Yes, I want to release the thing ASAP, please try to track it down. How does not it propagate? What do you see? I need to know symptoms of the problem too, not only your hypothesis. You kinda started in the middle :)

@jtrmal
Copy link
Contributor

jtrmal commented Nov 2, 2020 via email

@kkm000
Copy link
Contributor Author

kkm000 commented Nov 2, 2020

I'll enable more logging. Currently, I can see details only for jobs completed within 30 minutes, I do not even flush logs to disk.

@kkm000
Copy link
Contributor Author

kkm000 commented Nov 2, 2020

And since I do not know what is cslp, I obviously do not know what is cslp cluster. Is this cluster used by many Kaldi users? In a sense, enough to hardcode config?

@jtrmal
Copy link
Contributor

jtrmal commented Nov 2, 2020 via email

@stale
Copy link

stale bot commented Jan 1, 2021

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Jan 1, 2021
@kkm000 kkm000 added stale-exclude Stale bot ignore this issue and removed stale Stale bot on the loose labels Mar 14, 2021
@kkm000 kkm000 self-assigned this Sep 23, 2021
@danpovey
Copy link
Contributor

Remember that we recently merged #4675, you might want to incorporate any needed changes from there.

@kkm000
Copy link
Contributor Author

kkm000 commented Jan 30, 2022

I'm sitting on it because of internal Slurm changes. The script submits a job with the sbatch ... -wait command, which prints the job ID to stdout and waits till the job is completed. The old behavior in Slurm 19.x was to wait 2s, 4s, 6s, ... up to the maximum of 20s, then check for completion every 20s. Our jobs are very short compared to general HPC use, so I have patched sbatch to poll for completion every 1s, sensible for a cluster with essentially 1 user. But Slurm suffers from internal locking problems—the lock scopes are too extensive. They are improving that over time, but in v20.x and on, the wait strategy changed: the wait interval is 30s. That's too much. I would not care as I can always patch my sbatch, but I am not certain if that's the best for general release. There are alternatives. First, there is an alternative approach to polling for completion using scontrol. Unless Slurm is set up to immediately purge completed jobs info from memory (which is hardly a case in any setup), this will return job exit code. Another, good for v20+, is to use their new REST API instead for both batch submission and polling for completion. But this is good only for v20 and on. Looks like we'll end up with two slurm.pl's, one for v19 and less, which may limp through the default growing waiting delay fast enough for very short jobs (this is what we have), and another for the newer Slurms with long wait from the start but with an API. I dunno. Maybe polling with scontrol is the way to do it in both cases?

Another thing that polling is problematic, the more so the older is the controller version. They seriously took it on the lock contentions only from v20.

@danpovey, @jtrmal, what do you think?

@danpovey
Copy link
Contributor

I suppose this isn't critical if the existing script works OK.
Personally I think the main thing is just having something that runs, easy-to-install may be better than perfect
for most users.

@kkm000
Copy link
Contributor Author

kkm000 commented Jan 31, 2022

@dpovey, you're probably right. I'm trying to overengineer it so that it Slurms better than Slurm itself does. Probably no point. If it's 30s minimum, so be it. This version is much cleaner than the currently on master, good enough. Let me get back to it next week and release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale-exclude Stale bot ignore this issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants