Skip to content

slurmd-pyxis:25.11-ubuntu24.04 appears to lack NVML-enabled Slurm build, so AutoDetect=nvml reports 0 GPUs #14

@janekmichalik

Description

@janekmichalik

Summary

When using:

ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04
Slurm 25.11.4
gres.conf with AutoDetect=nvml

the worker nodes report 0 GPUs to Slurm and get drained as invalid, even though the container can see all GPUs and NVML is present.

This looks like the Slurm build inside the image was compiled without NVML support (HAVE_NVML), so AutoDetect=nvml cannot work.

Environment

Image: ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04

Slurm version:

slurmd -V → slurm 25.11.4
scontrol -V → slurm 25.11.4

Arch: aarch64
OS in worker: Ubuntu 24.04-based image
GPUs: 4 NVIDIA GPUs visible in the worker container

Config

Relevant config:

configFiles:
  gres.conf: |
    AutoDetect=nvml

nodesets:
  slinky:
    slurmd:
      resources:
        limits:
          nvidia.com/gpu: 4
    extraConfMap:
      Gres: "gpu:4"

Observed behavior

Slurm drains the nodes as invalid.

sinfo -R:

REASON               USER      TIMESTAMP           NODELIST
gres/gpu count repor slurm     2026-04-14T06:39:53 slinky-0
gres/gpu count repor slurm     2026-04-14T06:39:54 slinky-1

scontrol show node slinky-0:

Gres=gpu:4
State=IDLE+DRAIN+DYNAMIC_NORM+INVALID_REG
CfgTRES=cpu=80,mem=204420M,billing=80
Reason=gres/gpu count reported lower than configured (0 < 4)

Worker log:

[2026-04-14T06:39:53] We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.
[2026-04-14T06:39:53] warning: Ignoring file-less GPU gpu:(null) from final GRES list

What works inside the worker container

The worker container can see the GPUs:

nvidia-smi -L

shows 4 GPUs.

NVML libraries are present:

find /usr /lib -name 'libnvidia-ml.so*' | sort

output:

/usr/lib/aarch64-linux-gnu/libnvidia-ml.so
/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1
/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.590.48.01

ldconfig -p | grep -i nvidia-ml:

libnvidia-ml.so.1 (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1
libnvidia-ml.so (libc6,AArch64) => /usr/lib/aarch64-linux-gnu/libnvidia-ml.so

Symlinks also exist correctly under both /usr/lib/aarch64-linux-gnu and /lib/aarch64-linux-gnu.

Installed packages include:

ii  nvslurm-plugin-pyxis             0.23.0-1
ii  slurm-smd                        25.11.4-1
ii  slurm-smd-slurmd                 25.11.4-1

Why I think this is a build/package issue

This does not look like a runtime NVML visibility problem, because:

nvidia-smi works in the worker container
libnvidia-ml.so is present
libnvidia-ml.so is in the linker cache

The exact log line:

We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured.

suggests Slurm itself was built without NVML support, so AutoDetect=nvml cannot work regardless of runtime library presence.

Also, strings /usr/sbin/slurmd | grep -i nvml returns nothing.

Expected behavior

With:

gres.conf: |
  AutoDetect=nvml

and:

Gres: "gpu:4"

I would expect slurmd to detect the 4 GPUs via NVML and register the node with:

Gres=gpu:4
CfgTRES=...gres/gpu=4

without draining the node.

Actual behavior

slurmd reports 0 GPUs to Slurm, and the node is drained with:

Reason=gres/gpu count reported lower than configured (0 < 4)

Question

Can you confirm whether ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04 / the underlying slurm-smd packages were built without NVML support?

If yes, would it be possible to publish an image/package variant with NVML-enabled Slurm so AutoDetect=nvml works?

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions