Skip to content

Release v1.3.0

Choose a tag to compare

@github-actions github-actions released this 20 Apr 11:11
· 127 commits to main since this release
v1.3.0
fef921c

Release v1.3.0

This release implements custom remediation actions for operator extensibility, completes OpenTelemetry tracing across janitor and fault-remediation modules, adds a demo showcasing custom health monitors, fixes syslog health monitor sidecar readiness issues, and includes Helm configuration improvements.

Major New Features

Custom Remediation Actions (#1141)

Operators can now define custom remediation actions beyond NVSentinel's built-in set (QUARANTINE_NODE, REBOOT_NODE, etc.), enabling support for domain-specific remediation like disk replacement, memory reclamation, or custom infrastructure actions.

  • Implementation (#1145): Custom actions are defined via Helm values with:

    • Kubernetes resource kind and scope (Cluster or Namespaced)
    • Resource template with variable interpolation ({{ .NodeName }}, {{ .HealthEventID }}, etc.)
    • Completion condition type to mark when remediation is done
    • Equivalence group for action de-duplication
    • Fault-remediation validates the action is defined before creating CRs, preventing silent failures
  • Demo (#1195): A complete example showcasing NVSentinel extensibility. Includes a custom health monitor that detects memory pressure and recommends a RECLAIM_MEMORY action. A controller responds by deleting memory-hogging pods. Demonstrates end-to-end flow from custom health check through custom remediation.

OpenTelemetry Tracing Expansion

Completing the v1.2.0 distributed tracing initiative:

  • Janitor Module (#1190): Traces reboot/terminate operations with spans showing node reboot lifecycle, ready state transitions, and timing of CSP API calls.
  • Fault-Remediation Module (#1171): Traces CR creation, remediation action validation, and webhook failures with full context in the remediation span tree.
  • Node-Drainer Module (#1146): Traces pod eviction, drain status updates, and node recovery scenarios with cancellation tracking.
  • Documentation (#1200): User guide for OTEL tracing configuration, trace export setup, and span interpretation.

Bug Fixes & Reliability

  • Syslog Health Monitor Sidecar Readiness (#1170, #1168): Fixed a race where the XID analyzer sidecar was slow to start, causing XIDs to be silently dropped. Now uses both application-level readiness gates (TCP dial loop before polling) and native Kubernetes sidecar container ordering with startup probes (K8s 1.29+). Prevents missed faults when sidecars take longer than expected to become ready.
  • Fault-Quarantine Test Flakiness (#1208): Fixed unit test intermittency due to unordered map iteration. Validation now iterates actions in sorted order for deterministic error messages.
  • Preflight Helm Fallbacks (#1197): Fixed preflight deployment pod settings to respect global Helm values for image pull secrets, node selectors, affinity, and tolerations when chart-specific values are unset. Enables centralized pod configuration across all deployments while preserving local overrides.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, and feedback!

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.3.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.2.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.3.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the documentation.