Skip to content

Release v1.12.0

Latest

Choose a tag to compare

@github-actions github-actions released this 29 Jun 12:46
v1.12.0
4ac11c6

Release v1.12.0

This release prevents the DCGM connectivity error that fired on every new node during GPU Operator bootstrapping, adds device-count labels from the labeler so downstream consumers can detect nodes reporting fewer GPUs or NICs than expected, adds an opt-in Magic SysRq reboot path for the generic bare-metal provider, and includes reliability fixes for the labeler, node-drainer, and the PostgreSQL store.

Major New Features

Prevent DCGM Connectivity Errors on Node Bootstrapping (#1425, #1423)

On a freshly launched node, the gpu-health-monitor pod could become ready before the GPU Operator's nvidia-dcgm pod finished its init-container startup sequence, producing a GpuDcgmConnectivityFailure unhealthy condition (DCGM_CONNECTIVITY_ERROR, CONTACT_SUPPORT) that only cleared minutes later once DCGM came up. The gpu-health-monitor is no longer scheduled until the nvidia-dcgm pod on the node is ready, so a normal node bootstrap no longer emits a false connectivity error. Node-deletion teardown behavior is unchanged.

Expected Device-Count Labels from Labeler (#1395)

The labeler can now write normalized current and expected device-count labels (e.g. nvsentinel.dgxc.nvidia.com/gpu.count.current / .expected) onto nodes, giving downstream modules a signal for detecting nodes that advertise fewer devices than their peers. Current count is derived from a configurable CEL expression supporting both device-plugin/GFD-style node labels and DRA ResourceSlice advertisements; expected count is either learned from peers in the same grouping-label partition or pinned via per-class overrides. Configured per device class (GPU, NIC) in a TOML ConfigMap and disabled by default. See ADR-043 for the design.

Opt-In SysRq Reboot for the Generic Bare-Metal Provider (#1418)

The generic bare-metal janitor provider now supports an opt-in Linux Magic SysRq reboot mode (janitor-provider.csp.generic.useSysrqReboot=true), which reboots a node by writing b to /proc/sysrq-trigger from a privileged Job rather than using the default chroot-based reboot. This is useful on hosts where the chroot path is unreliable. The existing chroot-based reboot remains the default, so existing deployments are unaffected.

Bug Fixes & Reliability

  • Lazily initialize ResourceSlice informers in the labeler (#1422): The labeler eagerly started a DRA ResourceSlice informer even when device-count detection used only the device-plugin method, spamming failed to list *v1.ResourceSlice: the server could not find the requested resource errors on clusters without the resource.k8s.io API. The informer is now initialized lazily only when a class actually requires ResourceSlice data, and string digits are normalized to numbers during count evaluation. Follow-up to the device-count feature (#1395).
  • Node-drainer ignores stale AlreadyQuarantined events (#1419, #1415): A stale AlreadyQuarantined event re-enqueued via a later change-stream update — after the node had already been unquarantined and its quarantineHealthEvent annotation removed — was treated as "not already drained" and fell through to normal drain evaluation. That marked the stale event Succeeded and mutated the node-state label (triggering an invalid none -> draining transition) despite there being no active quarantine context. The already-drained check now handles a missing annotation on a stale AlreadyQuarantined event correctly instead of proceeding to drain.
  • Fixed PostgreSQL UpdateDocument placeholder collision (#1391): In the direct PostgreSQL store, UpdateDocument did not bind SET parameters before WHERE parameters, so combined update+filter statements could apply parameters in the wrong order. WHERE placeholders are now shifted after the update args (regex-based, so multi-digit placeholders such as $10 are not rewritten incorrectly) and executed with update args followed by filter args.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.12.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.11.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.12.0 \
  --namespace nvsentinel \
  --reuse-values