Release v1.5.0
Release v1.5.0
This release significantly expands the NIC Health Monitor (alpha) with NIC driver syslog detection and counter-based degradation checks for InfiniBand and Ethernet/RoCE, fixes the GPU reset workflow to inject devices via HostPath volumes for more reliable resets, adds a versioned docs dropdown with content pinning, and ships a critical fix for faultRemediated decoding on wrapped BoolValue records.
Major New Features
NIC Health Monitor Expansion (Alpha)
Status: Alpha. APIs, Helm values, and event schemas may change in future releases. Not recommended for production use yet — feedback and bug reports welcome.
Building on the link-state detection introduced in v1.4.0, the NIC Health Monitor now adds NIC driver syslog detection, counter-based degradation checks, and end-to-end test coverage.
- NIC Driver Syslog Detection (#1257): New
SysLogsNICDriverErrorcheck in the syslog-health-monitor detectsmlx5_coreNIC driver/firmware errors from kernel logs. Ships 8 default patterns (3 Fatal:cmd_exec_timeout,health_poll_failed,unrecoverable_err; 5 Non-Fatal:netdev_watchdog,pci_power_insufficient,port_module_high_temp,access_reg_failed,module_unplugged) verified against upstream Linux kernel source. Fatal patterns publish a node condition with aREPLACE_VMrecommended action; non-fatal patterns emit Kubernetes events with no remediation. Operator-configurable via a TOML ConfigMap with BDF extraction for resolving the affected NIC entity. Includes Prometheus metrics for driver-error events. - Counter-Based Degradation Detection (#1248): Adds counter polling for InfiniBand and Ethernet/RoCE NICs on a dedicated 1s loop, with latching, reset/recovery, and both delta and velocity (per-second / per-minute) evaluation. New
counterDetectionconfiguration block supports per-counter thresholds, velocity units, fatality flags, and recommended actions. Fatal counter breaches (e.g.,link_downed,excessive_buffer_overrun_errors) publish node conditions; non-fatal counters (e.g.,symbol_error,roce_slow_restart,carrier_changes) emit Kubernetes events. Latched breaches persist until the underlying counter resets. - Configuration Hardening (#1271): NIC counter configuration is now restricted to a hardcoded allowlist of supported counter name/path pairs (thresholds and severity remain operator-tunable). Startup validation now rejects unsupported or mismatched counter selections.
- Tilt End-to-End Tests (#1249, #1260): Comprehensive Tilt e2e coverage for both counter detection (
TestNICCounterIBDegradation,TestNICCounterEthernetDegradation,TestNICCounterBelowThreshold,TestNICCounterBootIDClearsBreachState) and syslog NIC driver detection (TestSyslogHealthMonitorNICDriverDetection) — exercising fatal/non-fatal paths, multi-device faults, threshold validation, latch/clear behavior, and boot-ID recovery. AWS/GCP integration and e2e workflow timeouts were bumped from 60 → 75 minutes to accommodate the new coverage.
GPU Reset HostPath Architecture (#1243)
Fixed the GPU reset workflow to inject GPU devices via HostPath volumes instead of NVIDIA_VISIBLE_DEVICES, eliminating a class of failures where the privileged reset pod could not have GPUs injected on nodes with reset-pending XIDs (nvidia-container-cli.real: detection error: nvml error: gpu requires reset).
- The reset job now mounts
/run/nvidia/driverand/sysvia HostPath and invokesnvidia-smithroughchroot /run/nvidia/driver.NVIDIA_VISIBLE_DEVICES=voiddisables nvidia-container-toolkit injection for the reset container. - Manual persistence-mode toggling is removed —
nvidia-smi --gpu-resetnow handles persistence mode automatically because/run/nvidia-persistencedis available through the driver mount. - The
gpu-feature-discoverypod is now evicted in addition tonvidia-device-plugin,nvidia-dcgm, andnvidia-dcgm-exporter, since GFD opens NVML handles in a loop that can block resets. HostNetwork=trueis no longer required for the reset workflow.
Versioned Docs Dropdown (#1263, #1262)
The Fern docs site now ships a version dropdown with true content pinning for frozen releases.
- Replaces the single
devversion withLatestplus frozen entries (v1.2.0, v1.3.0, v1.4.0). - Frozen versions serve docs extracted from their git tags via
git archiveat publish time — users no longer see drift between docs and the release they're running. - The
Latestdisplay-name is stamped with the current GitHub release tag during CI. - Publish workflow gains a concurrency group,
set -o pipefail, and step-summary URLs.
Bug Fixes & Reliability
faultRemediatedDecoding for WrappedBoolValueRecords (#1255): Fixed health event status decoding whenhealtheventstatus.faultremediatedis stored as a protobufBoolValuedocument ({"value": false}) while the datastore model exposes it as*bool. Adds JSON/BSON compatibility for both wrapped and plain boolean shapes so legacy datastore records and proto/change-stream payloads both decode safely; writers continue emitting the wrapped shape expected by proto/change-stream consumers. Normalizes legacy plain-boolean values before proto unmarshalling in the shared event parser — fixes decoding on the node-drainer query path against MongoDB and PostgreSQL.- Fern Docs CI Hardening (#1251, #1250): Closed a regex gap in the MDX safety check that let bare
<img>tags slip through, replaced the hardcodedfern-api@4.42.1pin with a dynamic lookup againstfern/fern.config.json, and added the production custom domain (docs.nvidia.com/nvsentinel) and canonical-host metadata. - macOS Dev Environment Setup (#1126, #1125):
make dev-env-setupnow works end-to-end on macOS (Apple Silicon). Replaceswgetwithcurl(not installed by default on macOS), installs Go via Homebrew when the manual tar extraction would fail, installs missing tools (addlicense,protoc-gen-go,golangci-lint,gotestsum,gocover-cobertura) at the pinned versions from.versions.yaml, addsGOPATH/bintoPATH, and usespipxfor Poetry to avoid PEP 668externally-managed-environmenterrors.
Acknowledgments
This release includes contributions from:
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.5.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v1.4.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.5.0 \
--namespace nvsentinel \
--reuse-values