Skip to content

Release v1.4.0

Choose a tag to compare

@github-actions github-actions released this 04 May 11:33
· 102 commits to main since this release
v1.4.0
f2bf38f

Release v1.4.0

This release introduces the NIC Health Monitor (alpha) with link state detection for InfiniBand and Ethernet/RoCE, GPU reset reboot fallback for failed resets, automatic TTL-based cleanup of janitor CRs, DCGM 4.5 support, Topograph topology label propagation, preflight resilience improvements, and a major security/infrastructure refresh (Debian 13, Go 1.26, CVE remediation).

Major New Features

NIC Health Monitor (Alpha) (#1211)

Status: Alpha. APIs, Helm values, and event schemas may change in future releases. Not recommended for production use yet — feedback and bug reports welcome.

The NIC Health Monitor (introduced as a scaffold in v1.2.0) now has working link state detection, automatic NIC classification, and a deployable Helm chart.

  • State Detection Logic (#1229): Implements InfiniBandStateCheck and EthernetStateCheck that poll sysfs for port state changes and emit health events when ports go DOWN, recover, or disappear. Monitors logical state (ACTIVE/DOWN), physical state (LinkUp/Disabled), and device disappearance. Persistent state survives pod restarts via a hostPath-backed JSON file with boot-ID comparison to detect host reboots and re-emit healthy baselines. SR-IOV Virtual Functions are automatically excluded.
  • Automatic NIC Classification: NICs are classified as Compute, Storage, or Management using NUMA locality plus topology matrix. Management NICs (default route) are excluded from monitoring to avoid false REPLACE_VM remediations. First-poll suppression avoids false FATAL events for expected-down ports using card homogeneity analysis.
  • Topology Collection in Metadata Collector (#1220): The metadata collector now runs nvidia-smi topo -m once at startup, parses the GPU-NIC relationship matrix, and publishes the raw matrix and per-GPU NUMA affinity in gpu_metadata.json. Handles multiple real-world output formats (DGX H100, OCI H100, GB200). The NIC monitor reads this JSON for classification — no nvidia-smi runtime dependency.
  • Helm Chart & Tilt Tests (#1237): DaemonSet-based deployment with configurable processing strategy, polling interval, sysfs paths, and NIC inclusion/exclusion regex. Comprehensive end-to-end tests covering detection, persistence, boot-ID handling, VF exclusion, multi-port faults, and recovery. Chart is disabled by default — opt in via Helm values.
  • GB200 Topology Support (#1221): Design refinements for zero-configuration NIC role classification on GB200 platforms.

GPU Reset Reboot Fallback (#1240)

When a GPU reset fails, NVSentinel now automatically falls back to a node reboot instead of leaving the GPU unrecovered.

  • The syslog-health-monitor now emits an unhealthy event with a RESTART_VM recommended action when it detects a failed GPU reset (in addition to the existing healthy event for successful resets). The resulting reboot clears both the original XID and the failed-reset event, and the node is automatically uncordoned.
  • New writeSyslogEvent option in the gpuResetController Janitor config (default: true). Set to false to suppress syslog event writes for GPU resets triggered outside of NVSentinel — useful for debugging failed resets without triggering a reboot.

Janitor TTL-Based CR Cleanup (#1206, #370)

Implements ADR-037 to prevent unbounded growth of completed maintenance CRs.

  • New TTL controller in janitor automatically deletes completed RebootNode, TerminateNode, and GPUReset CRs after their configured TTL expires.
  • Per-CR TTL is set via the nvsentinel.nvidia.com/ttl annotation (e.g., "30s", "24h", "7d").
  • Eliminates manual CR cleanup and prevents etcd bloat in long-running clusters.

DCGM 4.5 Support (#1239)

Updated gpu-health-monitor to DCGM 4.5.2, adding support for new error codes and diagnostic features introduced in DCGM 4.5. Backward compatible with DCGM 4.x deployments.

Topograph Topology Labels (#1226)

The metadata-augmentor's default allowedLabels now includes the four topology labels written by NVIDIA/topograph:

  • network.topology.nvidia.com/accelerator — NVLink domain (clique) ID
  • network.topology.nvidia.com/leaf / /spine / /core — switch hierarchy levels

When Topograph is deployed, NVSentinel automatically propagates these labels into health event metadata, enabling downstream consumers (CEL rules, remediation CRs, dashboards, blast-radius analysis) to reason about topological locality without operator-side configuration. On clusters without Topograph, the labels are absent and the augmentor simply skips them — no behavior change.

Bug Fixes & Reliability

  • Preflight STORE_ONLY Resilience and Stale Gang ConfigMap Pruning (#1210): Two fixes for preflight rescheduling loops. (1) When a preflight DCGM container crashes mid-diagnostic (e.g., OOMKill), the next preflight container on the same node now calls dcgmStopDiagnostic() before RunDiagnostic() to clear zombie diagnostics. The STORE_ONLY gate is fixed across all three preflight containers (dcgm-diag, nccl-loopback, nccl-allreduce) so observability-mode failures correctly exit 0 instead of triggering reschedules. (2) Stale gang ConfigMap entries from rescheduled pods (which get new names) are now pruned, preventing accumulation of zombie peer entries that broke gang validation.
  • HEA XID Burst Grouping for PostgreSQL (#1222, #1191): Fixed health-events-analyzer XID burst detection grouping key for the PostgreSQL code path from node(node, GPU_UUID), matching MongoDB's per-GPU aggregation. Multi-GPU nodes now get correct per-GPU burst statistics.
  • Metadata Collector Kubelet Default (#1235): Fixed kubeletHost default to use status.hostIP via the Downward API instead of localhost. The localhost default assumed kubelet bound to 0.0.0.0, causing CrashLoopBackOff on clusters where kubelet binds to the node's primary IP only (common in many distributions).
  • PSMDB ArgoCD Drift (#1230, #1231): Fixed perpetual ArgoCD OutOfSync on PerconaServerMongoDB CR by conditionally rendering metadata.finalizers only when non-empty. Kubernetes was stripping the empty array, causing target/live drift.
  • GPU Reset UAT PCI ID Normalization (#1216): Fixed PCI ID normalization in the GPU reset UAT to handle PCI IDs where the first 8 hex digits are not all zero, preventing invalid PCI IDs from being injected into syslog events.
  • Pre-initialized XID Metrics (#1209, #1196): syslog-health-monitor now pre-initializes all XID counter/error metrics with zero values for compatibility with Google Managed Prometheus and other systems that require all label series to be present at scrape time.

Security & Infrastructure

Documentation

  • Audit Logging Guide (#1227): User-facing documentation for NVSentinel's audit logging — file-based logs for HTTP write operations, logged fields, storage/rotation behavior, verification and collection steps, troubleshooting, and security considerations around request-body capture and hostPath exposure. Includes Helm values examples and updates to the Observability navigation.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributors @resker, @shaq918, and @erezzarum.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.4.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.3.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.4.0 \
  --namespace nvsentinel \
  --reuse-values