Skip to content

Release v1.8.0

Choose a tag to compare

@github-actions github-actions released this 01 Jun 13:20
· 28 commits to main since this release
v1.8.0
fc43d07

Release v1.8.0

This release replaces node-drainer's FIFO worker queue with a two-lane priority queue so a single noisy node can no longer starve drains on other nodes, adds a drainGPUPods flag to scope eviction to GPU-requesting workloads, makes drain and quarantine overrides configurable from the kubernetes-object-monitor, fills in missing recommended actions for newer XIDs, and remediates several CVEs across container images and the Go toolchain.

Major New Features

Priority Queue for Node-Drainer (#1341)

Replaced node-drainer's ready-FIFO ordering with a two-lane priority queue layered under the existing Kubernetes rate-limiting workqueue. Events for nodes that have not yet reached draining get one high-priority representative; additional queued work for the same node stays low-priority to prevent grouped floods from blocking later nodes. Queue priority state is in-memory and follows successful node label transitions — setting draining marks the node as draining, while unquarantine or terminal drain labels clear it. Retry, drain action evaluation, and health-event lifecycle semantics are unchanged. A new Prometheus counter node_drainer_queue_items_assigned_total{priority, reason} tracks assignment decisions.

drainGPUPods Filter (#1310, #1264)

New Helm flag node-drainer.drainGPUPods (default false) restricts pod eviction during fault remediation to workloads that request GPU resources (nvidia.com/gpu or nvidia.com/pgpu). When enabled, CPU-only pods (logging agents, monitoring sidecars, infrastructure DaemonSets) stay running on the node, while GPU workloads — the ones actually blocked by the GPU fault — are evicted. The filter inspects both regular containers and init containers. Default behavior is unchanged so existing deployments are unaffected.

Drain & Quarantine Overrides from Kubernetes Object Monitor (#1342)

drainOverrides and quarantineOverrides are now configurable on health events emitted by kubernetes-object-monitor policies, matching the support that already existed in other monitors. Cluster operators can declare per-policy overrides directly in the TOML/YAML config:

healthEvent:
  componentClass: Node
  isFatal: true
  message: "Node is not ready"
  recommendedAction: CONTACT_SUPPORT
  errorCode:
    - NODE_NOT_READY
  quarantineOverrides:
    force: true                # or skip: true; do not set both
  drainOverrides:
    skip: true                 # or force: true; do not set both

force and skip are mutually exclusive per override block; the chart validates this at template time. This unlocks scenarios like "cordon the node but do not evict pods" (the example tested in the PR) without requiring a separate health monitor.

Bug Fixes & Reliability

  • Missing XID Recommended Actions (#1343): Filled in recommended actions for XIDs that were missing from the gpu-health-monitor mapping but listed in the XID analyzer catalog — adds an additional GPU recovery scenario that now triggers COMPONENT_RESET and fabric-related failures that now trigger RESTART_VM. Bringing the mapping in line with the catalog prevents these XIDs from being silently classified as NONE/CONTACT_SUPPORT.
  • Preflight Build Platform Arg + FQ CEL for Preflight (#1352): Fixed a missing --platform argument in the preflight-checks Docker build/publish targets that caused multi-platform image operations to silently produce single-platform artifacts. Also added a new fault-quarantine CEL policy so nodes are cordoned when preflight agents emit fatal health events (respecting existing node-exclusion settings) — preflight failures now flow through the same cordon path as other monitors.

Security & Infrastructure

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @coderuhaan2004.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.8.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.7.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.8.0 \
  --namespace nvsentinel \
  --reuse-values