This check monitors Slurm through the Datadog Agent.
Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager used to schedule and manage jobs on large-scale compute clusters. It allocates resources, monitors job queues, and ensures efficient execution of parallel and batch jobs in high-performance computing environments.
The check gathers metrics from slurmctld
by executing and parsing the output of several command-line binaries, including sinfo
, squeue
, sacct
, sdiag
, and sshare
. These commands provide detailed information on resource availability, job queues, accounting, diagnostics, and share usage in a Slurm-managed cluster.
On worker nodes, a scontrol reported metric can also be collected, including the PID(s) of the job information that isn't available in slurmctld, along with other details.
Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various Slurm binaries, monitoring Slurm in containerized environments is not recommended.
Note: This check was tested on Slurm version 21.08.0.
The Slurm check is included in the Datadog Agent package. No additional installation is needed on your server.
-
Ensure that the dd-agent user has execute permissions on the relevant command binaries and the necessary permissions to access the directories where these binaries are located.
-
Edit the
slurm.d/conf.yaml
file, in theconf.d/
folder at the root of your Agent's configuration directory to start collecting your Slurm data. See the sample slurm.d/conf.yaml for all available configuration options.
init_config:
## Customize this part if the binaries are not located in the /usr/bin/ directory
## @param slurm_binaries_dir - string - optional - default: /usr/bin/
## The directory in which all the Slurm binaries are located. These are mainly:
## sinfo, sacct, sdiag, sshare and sdiag.
slurm_binaries_dir: /usr/bin/
instances:
-
## Configure these parameters to select which data the integration collects.
## @param collect_sinfo_stats - boolean - optional - default: true
## Whether or not to collect statistics from the sinfo command.
#
collect_sinfo_stats: true
## @param collect_sdiag_stats - boolean - optional - default: true
## Whether or not to collect statistics from the sdiag command.
#
collect_sdiag_stats: true
## @param collect_squeue_stats - boolean - optional - default: true
## Whether or not to collect statistics from the squeue command.
#
collect_squeue_stats: true
## @param collect_sacct_stats - boolean - optional - default: true
## Whether or not to collect statistics from the sacct command.
#
collect_sacct_stats: true
## @param collect_sshare_stats - boolean - optional - default: true
## Whether or not to collect statistics from the sshare command.
#
collect_sshare_stats: true
## @param collect_gpu_stats - boolean - optional - default: false
## Whether or not to collect GPU statistics when Slurm is configured to use GPUs using sinfo.
#
collect_gpu_stats: true
## @param sinfo_collection_level - integer - optional - default: 1
## The level of detail to collect from the sinfo command. The default is 'basic'. Available options are 1, 2 and
## 3. Level 1 collects data only for partitions. Level 2 collects data from individual nodes. Level 3
## collects data from from individual nodes as well but is more verbose and includes data such as CPU and
## memory usage as reported from the OS, as well as additional tags.
#
sinfo_collection_level: 1
Run the Agent's status subcommand and look for slurm
under the Checks section.
See metadata.csv for a list of metrics provided by this integration.
The Slurm integration does not include any events.
Need help? Contact Datadog support.