Slow autoscale with Array jobs #9

jarvist · 2019-04-11T11:16:40Z

Autoscaling after submitting an Slurm Array type job results in very slow spinning up of the cluster.
This appears to be because even with an arbitrarily large amount of array jobs specified:

only a single extra node is requested,
this then spins up,
once spun-up, the next array job starts, and only then is a new node is requested.

Due to the ~5-10 minutes to spin up an instance, this severely limits autoscaling.

A work around is to submit a dummy non-array job requesting a suitable number of nodes.
Even if this job is stuck behind the Array job in the queue, it still results in the cluster spinning up many nodes at once, which the array jobs are then dispatched to.

e.g. to request 100 CPUs worth of nodes
echo '#!/bin/sh' | sbatch -n 100

The text was updated successfully, but these errors were encountered:

basnijholt mentioned this issue Apr 26, 2019

Autoscaling doesn't work for pending jobs #11

Open

carogonzalezzapata assigned anhoward Aug 3, 2021

aditigaur4 closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow autoscale with Array jobs #9

Slow autoscale with Array jobs #9

jarvist commented Apr 11, 2019

Slow autoscale with Array jobs #9

Slow autoscale with Array jobs #9

Comments

jarvist commented Apr 11, 2019