Release v1.12.0
This release prevents the DCGM connectivity error that fired on every new node during GPU Operator bootstrapping, adds device-count labels from the labeler so downstream consumers can detect nodes reporting fewer GPUs or NICs than expected, adds an opt-in Magic SysRq reboot path for the generic bare-metal provider, and includes reliability fixes for the labeler, node-drainer, and the PostgreSQL store.
Major New Features
Prevent DCGM Connectivity Errors on Node Bootstrapping (#1425, #1423)
On a freshly launched node, the gpu-health-monitor pod could become ready before the GPU Operator's nvidia-dcgm pod finished its init-container startup sequence, producing a GpuDcgmConnectivityFailure unhealthy condition (DCGM_CONNECTIVITY_ERROR, CONTACT_SUPPORT) that only cleared minutes later once DCGM came up. The gpu-health-monitor is no longer scheduled until the nvidia-dcgm pod on the node is ready, so a normal node bootstrap no longer emits a false connectivity error. Node-deletion teardown behavior is unchanged.
Expected Device-Count Labels from Labeler (#1395)
The labeler can now write normalized current and expected device-count labels (e.g. nvsentinel.dgxc.nvidia.com/gpu.count.current / .expected) onto nodes, giving downstream modules a signal for detecting nodes that advertise fewer devices than their peers. Current count is derived from a configurable CEL expression supporting both device-plugin/GFD-style node labels and DRA ResourceSlice advertisements; expected count is either learned from peers in the same grouping-label partition or pinned via per-class overrides. Configured per device class (GPU, NIC) in a TOML ConfigMap and disabled by default. See ADR-043 for the design.
Opt-In SysRq Reboot for the Generic Bare-Metal Provider (#1418)
The generic bare-metal janitor provider now supports an opt-in Linux Magic SysRq reboot mode (janitor-provider.csp.generic.useSysrqReboot=true), which reboots a node by writing b to /proc/sysrq-trigger from a privileged Job rather than using the default chroot-based reboot. This is useful on hosts where the chroot path is unreliable. The existing chroot-based reboot remains the default, so existing deployments are unaffected.
Bug Fixes & Reliability
- Lazily initialize ResourceSlice informers in the labeler (#1422): The labeler eagerly started a DRA
ResourceSliceinformer even when device-count detection used only the device-plugin method, spammingfailed to list *v1.ResourceSlice: the server could not find the requested resourceerrors on clusters without theresource.k8s.ioAPI. The informer is now initialized lazily only when a class actually requires ResourceSlice data, and string digits are normalized to numbers during count evaluation. Follow-up to the device-count feature (#1395). - Node-drainer ignores stale AlreadyQuarantined events (#1419, #1415): A stale
AlreadyQuarantinedevent re-enqueued via a later change-stream update — after the node had already been unquarantined and itsquarantineHealthEventannotation removed — was treated as "not already drained" and fell through to normal drain evaluation. That marked the stale eventSucceededand mutated the node-state label (triggering an invalidnone -> drainingtransition) despite there being no active quarantine context. The already-drained check now handles a missing annotation on a staleAlreadyQuarantinedevent correctly instead of proceeding to drain. - Fixed PostgreSQL UpdateDocument placeholder collision (#1391): In the direct PostgreSQL store,
UpdateDocumentdid not bind SET parameters before WHERE parameters, so combined update+filter statements could apply parameters in the wrong order. WHERE placeholders are now shifted after the update args (regex-based, so multi-digit placeholders such as$10are not rewritten incorrectly) and executed with update args followed by filter args.
Acknowledgments
This release includes contributions from:
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.12.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v1.11.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.12.0 \
--namespace nvsentinel \
--reuse-values