Release v1.4.0
Release v1.4.0
This release introduces the NIC Health Monitor (alpha) with link state detection for InfiniBand and Ethernet/RoCE, GPU reset reboot fallback for failed resets, automatic TTL-based cleanup of janitor CRs, DCGM 4.5 support, Topograph topology label propagation, preflight resilience improvements, and a major security/infrastructure refresh (Debian 13, Go 1.26, CVE remediation).
Major New Features
NIC Health Monitor (Alpha) (#1211)
Status: Alpha. APIs, Helm values, and event schemas may change in future releases. Not recommended for production use yet — feedback and bug reports welcome.
The NIC Health Monitor (introduced as a scaffold in v1.2.0) now has working link state detection, automatic NIC classification, and a deployable Helm chart.
- State Detection Logic (#1229): Implements
InfiniBandStateCheckandEthernetStateCheckthat poll sysfs for port state changes and emit health events when ports go DOWN, recover, or disappear. Monitors logical state (ACTIVE/DOWN), physical state (LinkUp/Disabled), and device disappearance. Persistent state survives pod restarts via a hostPath-backed JSON file with boot-ID comparison to detect host reboots and re-emit healthy baselines. SR-IOV Virtual Functions are automatically excluded. - Automatic NIC Classification: NICs are classified as Compute, Storage, or Management using NUMA locality plus topology matrix. Management NICs (default route) are excluded from monitoring to avoid false
REPLACE_VMremediations. First-poll suppression avoids false FATAL events for expected-down ports using card homogeneity analysis. - Topology Collection in Metadata Collector (#1220): The metadata collector now runs
nvidia-smi topo -monce at startup, parses the GPU-NIC relationship matrix, and publishes the raw matrix and per-GPU NUMA affinity ingpu_metadata.json. Handles multiple real-world output formats (DGX H100, OCI H100, GB200). The NIC monitor reads this JSON for classification — nonvidia-smiruntime dependency. - Helm Chart & Tilt Tests (#1237): DaemonSet-based deployment with configurable processing strategy, polling interval, sysfs paths, and NIC inclusion/exclusion regex. Comprehensive end-to-end tests covering detection, persistence, boot-ID handling, VF exclusion, multi-port faults, and recovery. Chart is disabled by default — opt in via Helm values.
- GB200 Topology Support (#1221): Design refinements for zero-configuration NIC role classification on GB200 platforms.
GPU Reset Reboot Fallback (#1240)
When a GPU reset fails, NVSentinel now automatically falls back to a node reboot instead of leaving the GPU unrecovered.
- The syslog-health-monitor now emits an unhealthy event with a
RESTART_VMrecommended action when it detects a failed GPU reset (in addition to the existing healthy event for successful resets). The resulting reboot clears both the original XID and the failed-reset event, and the node is automatically uncordoned. - New
writeSyslogEventoption in thegpuResetControllerJanitor config (default:true). Set tofalseto suppress syslog event writes for GPU resets triggered outside of NVSentinel — useful for debugging failed resets without triggering a reboot.
Janitor TTL-Based CR Cleanup (#1206, #370)
Implements ADR-037 to prevent unbounded growth of completed maintenance CRs.
- New TTL controller in janitor automatically deletes completed
RebootNode,TerminateNode, andGPUResetCRs after their configured TTL expires. - Per-CR TTL is set via the
nvsentinel.nvidia.com/ttlannotation (e.g.,"30s","24h","7d"). - Eliminates manual CR cleanup and prevents etcd bloat in long-running clusters.
DCGM 4.5 Support (#1239)
Updated gpu-health-monitor to DCGM 4.5.2, adding support for new error codes and diagnostic features introduced in DCGM 4.5. Backward compatible with DCGM 4.x deployments.
Topograph Topology Labels (#1226)
The metadata-augmentor's default allowedLabels now includes the four topology labels written by NVIDIA/topograph:
network.topology.nvidia.com/accelerator— NVLink domain (clique) IDnetwork.topology.nvidia.com/leaf//spine//core— switch hierarchy levels
When Topograph is deployed, NVSentinel automatically propagates these labels into health event metadata, enabling downstream consumers (CEL rules, remediation CRs, dashboards, blast-radius analysis) to reason about topological locality without operator-side configuration. On clusters without Topograph, the labels are absent and the augmentor simply skips them — no behavior change.
Bug Fixes & Reliability
- Preflight STORE_ONLY Resilience and Stale Gang ConfigMap Pruning (#1210): Two fixes for preflight rescheduling loops. (1) When a preflight DCGM container crashes mid-diagnostic (e.g., OOMKill), the next preflight container on the same node now calls
dcgmStopDiagnostic()beforeRunDiagnostic()to clear zombie diagnostics. TheSTORE_ONLYgate is fixed across all three preflight containers (dcgm-diag, nccl-loopback, nccl-allreduce) so observability-mode failures correctly exit 0 instead of triggering reschedules. (2) Stale gang ConfigMap entries from rescheduled pods (which get new names) are now pruned, preventing accumulation of zombie peer entries that broke gang validation. - HEA XID Burst Grouping for PostgreSQL (#1222, #1191): Fixed health-events-analyzer XID burst detection grouping key for the PostgreSQL code path from
node→(node, GPU_UUID), matching MongoDB's per-GPU aggregation. Multi-GPU nodes now get correct per-GPU burst statistics. - Metadata Collector Kubelet Default (#1235): Fixed
kubeletHostdefault to usestatus.hostIPvia the Downward API instead oflocalhost. Thelocalhostdefault assumed kubelet bound to0.0.0.0, causing CrashLoopBackOff on clusters where kubelet binds to the node's primary IP only (common in many distributions). - PSMDB ArgoCD Drift (#1230, #1231): Fixed perpetual ArgoCD
OutOfSynconPerconaServerMongoDBCR by conditionally renderingmetadata.finalizersonly when non-empty. Kubernetes was stripping the empty array, causing target/live drift. - GPU Reset UAT PCI ID Normalization (#1216): Fixed PCI ID normalization in the GPU reset UAT to handle PCI IDs where the first 8 hex digits are not all zero, preventing invalid PCI IDs from being injected into syslog events.
- Pre-initialized XID Metrics (#1209, #1196): syslog-health-monitor now pre-initializes all XID counter/error metrics with zero values for compatibility with Google Managed Prometheus and other systems that require all label series to be present at scrape time.
Security & Infrastructure
- Debian 13 Base Image (#1217): Standardized container base images from Debian 12 (bookworm, EOL June 2026) to Debian 13 (trixie) across all components.
- Go 1.26 Toolchain (#1218): Bumped Go to 1.26.2 across all modules with toolchain pinning for reproducible builds.
- CVE Remediation (#1194): Fixed multiple CVEs across components, including
go.opentelemetry.io/otel(CVE-2026-39882, CVE-2026-24051), Python 3.13 base image upgrades (CVE-2026-4519, CVE-2025-13836), and OpenSSL updates (CVE-2026-2673, CVE-2026-28388). - Hardened CI (#1219): All GitHub Actions pinned to commit SHAs. Updated
golang.org/x/netto 0.53.0, fixing CVE-2026-27141, CVE-2025-22872, CVE-2025-22870, CVE-2025-58190, CVE-2025-47911, CVE-2024-45338, BDSA-2026-6099.
Documentation
- Audit Logging Guide (#1227): User-facing documentation for NVSentinel's audit logging — file-based logs for HTTP write operations, logged fields, storage/rotation behavior, verification and collection steps, troubleshooting, and security considerations around request-body capture and hostPath exposure. Includes Helm values examples and updates to the Observability navigation.
Acknowledgments
This release includes contributions from:
- @KaivalyaMDabhadkar
- @XRFXLP
- @natherz97
- @yysindi
- @resker
- @shaq918
- @rupalis-nv
- @pdmack
- @erezzarum
- @lalitadithya
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributors @resker, @shaq918, and @erezzarum.
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.4.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v1.3.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.4.0 \
--namespace nvsentinel \
--reuse-values