Release v1.3.0
Release v1.3.0
This release implements custom remediation actions for operator extensibility, completes OpenTelemetry tracing across janitor and fault-remediation modules, adds a demo showcasing custom health monitors, fixes syslog health monitor sidecar readiness issues, and includes Helm configuration improvements.
Major New Features
Custom Remediation Actions (#1141)
Operators can now define custom remediation actions beyond NVSentinel's built-in set (QUARANTINE_NODE, REBOOT_NODE, etc.), enabling support for domain-specific remediation like disk replacement, memory reclamation, or custom infrastructure actions.
-
Implementation (#1145): Custom actions are defined via Helm values with:
- Kubernetes resource kind and scope (Cluster or Namespaced)
- Resource template with variable interpolation (
{{ .NodeName }},{{ .HealthEventID }}, etc.) - Completion condition type to mark when remediation is done
- Equivalence group for action de-duplication
- Fault-remediation validates the action is defined before creating CRs, preventing silent failures
-
Demo (#1195): A complete example showcasing NVSentinel extensibility. Includes a custom health monitor that detects memory pressure and recommends a
RECLAIM_MEMORYaction. A controller responds by deleting memory-hogging pods. Demonstrates end-to-end flow from custom health check through custom remediation.
OpenTelemetry Tracing Expansion
Completing the v1.2.0 distributed tracing initiative:
- Janitor Module (#1190): Traces reboot/terminate operations with spans showing node reboot lifecycle, ready state transitions, and timing of CSP API calls.
- Fault-Remediation Module (#1171): Traces CR creation, remediation action validation, and webhook failures with full context in the remediation span tree.
- Node-Drainer Module (#1146): Traces pod eviction, drain status updates, and node recovery scenarios with cancellation tracking.
- Documentation (#1200): User guide for OTEL tracing configuration, trace export setup, and span interpretation.
Bug Fixes & Reliability
- Syslog Health Monitor Sidecar Readiness (#1170, #1168): Fixed a race where the XID analyzer sidecar was slow to start, causing XIDs to be silently dropped. Now uses both application-level readiness gates (TCP dial loop before polling) and native Kubernetes sidecar container ordering with startup probes (K8s 1.29+). Prevents missed faults when sidecars take longer than expected to become ready.
- Fault-Quarantine Test Flakiness (#1208): Fixed unit test intermittency due to unordered map iteration. Validation now iterates actions in sorted order for deterministic error messages.
- Preflight Helm Fallbacks (#1197): Fixed preflight deployment pod settings to respect global Helm values for image pull secrets, node selectors, affinity, and tolerations when chart-specific values are unset. Enables centralized pod configuration across all deployments while preserving local overrides.
Acknowledgments
This release includes contributions from:
- @tanishagoyal2
- @XRFXLP
- @ntaber
- @pdmack
- @lalitadithya
Thank you to everyone who contributed code, testing, documentation, and feedback!
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.3.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v1.2.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.3.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the documentation.