You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Autoscaling after submitting an Slurm Array type job results in very slow spinning up of the cluster.
This appears to be because even with an arbitrarily large amount of array jobs specified:
only a single extra node is requested,
this then spins up,
once spun-up, the next array job starts, and only then is a new node is requested.
Due to the ~5-10 minutes to spin up an instance, this severely limits autoscaling.
A work around is to submit a dummy non-array job requesting a suitable number of nodes.
Even if this job is stuck behind the Array job in the queue, it still results in the cluster spinning up many nodes at once, which the array jobs are then dispatched to.
e.g. to request 100 CPUs worth of nodes echo '#!/bin/sh' | sbatch -n 100
The text was updated successfully, but these errors were encountered:
Autoscaling after submitting an Slurm
Array
type job results in very slow spinning up of the cluster.This appears to be because even with an arbitrarily large amount of array jobs specified:
Due to the ~5-10 minutes to spin up an instance, this severely limits autoscaling.
A work around is to submit a dummy non-array job requesting a suitable number of nodes.
Even if this job is stuck behind the Array job in the queue, it still results in the cluster spinning up many nodes at once, which the array jobs are then dispatched to.
e.g. to request 100 CPUs worth of nodes
echo '#!/bin/sh' | sbatch -n 100
The text was updated successfully, but these errors were encountered: