Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow autoscale with Array jobs #9

Closed
jarvist opened this issue Apr 11, 2019 · 0 comments
Closed

Slow autoscale with Array jobs #9

jarvist opened this issue Apr 11, 2019 · 0 comments
Assignees

Comments

@jarvist
Copy link

jarvist commented Apr 11, 2019

Autoscaling after submitting an Slurm Array type job results in very slow spinning up of the cluster.
This appears to be because even with an arbitrarily large amount of array jobs specified:

  • only a single extra node is requested,
  • this then spins up,
  • once spun-up, the next array job starts, and only then is a new node is requested.

Due to the ~5-10 minutes to spin up an instance, this severely limits autoscaling.

A work around is to submit a dummy non-array job requesting a suitable number of nodes.
Even if this job is stuck behind the Array job in the queue, it still results in the cluster spinning up many nodes at once, which the array jobs are then dispatched to.

e.g. to request 100 CPUs worth of nodes
echo '#!/bin/sh' | sbatch -n 100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants