Skip to content

Release v1.5.0

Choose a tag to compare

@github-actions github-actions released this 11 May 12:30
· 87 commits to main since this release
v1.5.0
489b522

Release v1.5.0

This release significantly expands the NIC Health Monitor (alpha) with NIC driver syslog detection and counter-based degradation checks for InfiniBand and Ethernet/RoCE, fixes the GPU reset workflow to inject devices via HostPath volumes for more reliable resets, adds a versioned docs dropdown with content pinning, and ships a critical fix for faultRemediated decoding on wrapped BoolValue records.

Major New Features

NIC Health Monitor Expansion (Alpha)

Status: Alpha. APIs, Helm values, and event schemas may change in future releases. Not recommended for production use yet — feedback and bug reports welcome.

Building on the link-state detection introduced in v1.4.0, the NIC Health Monitor now adds NIC driver syslog detection, counter-based degradation checks, and end-to-end test coverage.

  • NIC Driver Syslog Detection (#1257): New SysLogsNICDriverError check in the syslog-health-monitor detects mlx5_core NIC driver/firmware errors from kernel logs. Ships 8 default patterns (3 Fatal: cmd_exec_timeout, health_poll_failed, unrecoverable_err; 5 Non-Fatal: netdev_watchdog, pci_power_insufficient, port_module_high_temp, access_reg_failed, module_unplugged) verified against upstream Linux kernel source. Fatal patterns publish a node condition with a REPLACE_VM recommended action; non-fatal patterns emit Kubernetes events with no remediation. Operator-configurable via a TOML ConfigMap with BDF extraction for resolving the affected NIC entity. Includes Prometheus metrics for driver-error events.
  • Counter-Based Degradation Detection (#1248): Adds counter polling for InfiniBand and Ethernet/RoCE NICs on a dedicated 1s loop, with latching, reset/recovery, and both delta and velocity (per-second / per-minute) evaluation. New counterDetection configuration block supports per-counter thresholds, velocity units, fatality flags, and recommended actions. Fatal counter breaches (e.g., link_downed, excessive_buffer_overrun_errors) publish node conditions; non-fatal counters (e.g., symbol_error, roce_slow_restart, carrier_changes) emit Kubernetes events. Latched breaches persist until the underlying counter resets.
  • Configuration Hardening (#1271): NIC counter configuration is now restricted to a hardcoded allowlist of supported counter name/path pairs (thresholds and severity remain operator-tunable). Startup validation now rejects unsupported or mismatched counter selections.
  • Tilt End-to-End Tests (#1249, #1260): Comprehensive Tilt e2e coverage for both counter detection (TestNICCounterIBDegradation, TestNICCounterEthernetDegradation, TestNICCounterBelowThreshold, TestNICCounterBootIDClearsBreachState) and syslog NIC driver detection (TestSyslogHealthMonitorNICDriverDetection) — exercising fatal/non-fatal paths, multi-device faults, threshold validation, latch/clear behavior, and boot-ID recovery. AWS/GCP integration and e2e workflow timeouts were bumped from 60 → 75 minutes to accommodate the new coverage.

GPU Reset HostPath Architecture (#1243)

Fixed the GPU reset workflow to inject GPU devices via HostPath volumes instead of NVIDIA_VISIBLE_DEVICES, eliminating a class of failures where the privileged reset pod could not have GPUs injected on nodes with reset-pending XIDs (nvidia-container-cli.real: detection error: nvml error: gpu requires reset).

  • The reset job now mounts /run/nvidia/driver and /sys via HostPath and invokes nvidia-smi through chroot /run/nvidia/driver. NVIDIA_VISIBLE_DEVICES=void disables nvidia-container-toolkit injection for the reset container.
  • Manual persistence-mode toggling is removed — nvidia-smi --gpu-reset now handles persistence mode automatically because /run/nvidia-persistenced is available through the driver mount.
  • The gpu-feature-discovery pod is now evicted in addition to nvidia-device-plugin, nvidia-dcgm, and nvidia-dcgm-exporter, since GFD opens NVML handles in a loop that can block resets.
  • HostNetwork=true is no longer required for the reset workflow.

Versioned Docs Dropdown (#1263, #1262)

The Fern docs site now ships a version dropdown with true content pinning for frozen releases.

  • Replaces the single dev version with Latest plus frozen entries (v1.2.0, v1.3.0, v1.4.0).
  • Frozen versions serve docs extracted from their git tags via git archive at publish time — users no longer see drift between docs and the release they're running.
  • The Latest display-name is stamped with the current GitHub release tag during CI.
  • Publish workflow gains a concurrency group, set -o pipefail, and step-summary URLs.

Bug Fixes & Reliability

  • faultRemediated Decoding for Wrapped BoolValue Records (#1255): Fixed health event status decoding when healtheventstatus.faultremediated is stored as a protobuf BoolValue document ({"value": false}) while the datastore model exposes it as *bool. Adds JSON/BSON compatibility for both wrapped and plain boolean shapes so legacy datastore records and proto/change-stream payloads both decode safely; writers continue emitting the wrapped shape expected by proto/change-stream consumers. Normalizes legacy plain-boolean values before proto unmarshalling in the shared event parser — fixes decoding on the node-drainer query path against MongoDB and PostgreSQL.
  • Fern Docs CI Hardening (#1251, #1250): Closed a regex gap in the MDX safety check that let bare <img> tags slip through, replaced the hardcoded fern-api@4.42.1 pin with a dynamic lookup against fern/fern.config.json, and added the production custom domain (docs.nvidia.com/nvsentinel) and canonical-host metadata.
  • macOS Dev Environment Setup (#1126, #1125): make dev-env-setup now works end-to-end on macOS (Apple Silicon). Replaces wget with curl (not installed by default on macOS), installs Go via Homebrew when the manual tar extraction would fail, installs missing tools (addlicense, protoc-gen-go, golangci-lint, gotestsum, gocover-cobertura) at the pinned versions from .versions.yaml, adds GOPATH/bin to PATH, and uses pipx for Poetry to avoid PEP 668 externally-managed-environment errors.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.5.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.4.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.5.0 \
  --namespace nvsentinel \
  --reuse-values