Release v1.8.0

This release replaces node-drainer's FIFO worker queue with a two-lane priority queue so a single noisy node can no longer starve drains on other nodes, adds a drainGPUPods flag to scope eviction to GPU-requesting workloads, makes drain and quarantine overrides configurable from the kubernetes-object-monitor, fills in missing recommended actions for newer XIDs, and remediates several CVEs across container images and the Go toolchain.

Major New Features

Priority Queue for Node-Drainer (#1341)

Replaced node-drainer's ready-FIFO ordering with a two-lane priority queue layered under the existing Kubernetes rate-limiting workqueue. Events for nodes that have not yet reached draining get one high-priority representative; additional queued work for the same node stays low-priority to prevent grouped floods from blocking later nodes. Queue priority state is in-memory and follows successful node label transitions — setting draining marks the node as draining, while unquarantine or terminal drain labels clear it. Retry, drain action evaluation, and health-event lifecycle semantics are unchanged. A new Prometheus counter node_drainer_queue_items_assigned_total{priority, reason} tracks assignment decisions.

`drainGPUPods` Filter (#1310, #1264)

New Helm flag node-drainer.drainGPUPods (default false) restricts pod eviction during fault remediation to workloads that request GPU resources (nvidia.com/gpu or nvidia.com/pgpu). When enabled, CPU-only pods (logging agents, monitoring sidecars, infrastructure DaemonSets) stay running on the node, while GPU workloads — the ones actually blocked by the GPU fault — are evicted. The filter inspects both regular containers and init containers. Default behavior is unchanged so existing deployments are unaffected.

Drain & Quarantine Overrides from Kubernetes Object Monitor (#1342)

drainOverrides and quarantineOverrides are now configurable on health events emitted by kubernetes-object-monitor policies, matching the support that already existed in other monitors. Cluster operators can declare per-policy overrides directly in the TOML/YAML config:

healthEvent:
  componentClass: Node
  isFatal: true
  message: "Node is not ready"
  recommendedAction: CONTACT_SUPPORT
  errorCode:
    - NODE_NOT_READY
  quarantineOverrides:
    force: true                # or skip: true; do not set both
  drainOverrides:
    skip: true                 # or force: true; do not set both

force and skip are mutually exclusive per override block; the chart validates this at template time. This unlocks scenarios like "cordon the node but do not evict pods" (the example tested in the PR) without requiring a separate health monitor.

Bug Fixes & Reliability

Missing XID Recommended Actions (#1343): Filled in recommended actions for XIDs that were missing from the gpu-health-monitor mapping but listed in the XID analyzer catalog — adds an additional GPU recovery scenario that now triggers COMPONENT_RESET and fabric-related failures that now trigger RESTART_VM. Bringing the mapping in line with the catalog prevents these XIDs from being silently classified as NONE/CONTACT_SUPPORT.
Preflight Build Platform Arg + FQ CEL for Preflight (#1352): Fixed a missing --platform argument in the preflight-checks Docker build/publish targets that caused multi-platform image operations to silently produce single-platform artifacts. Also added a new fault-quarantine CEL policy so nodes are cordoned when preflight agents emit fatal health events (respecting existing node-exclusion settings) — preflight failures now flow through the same cordon path as other monitors.

Security & Infrastructure

Go Toolchain 1.26.3 (#1346): Bumped Go from 1.26.2 → 1.26.3, remediating CVE-2026-39820, CVE-2026-42499, CVE-2026-42501, CVE-2026-33814, CVE-2026-39836, and CVE-2026-33811.
Image-Level CVE Remediation (#1340):
- preflight-nccl-loopback and preflight-nccl-allreduce: PyTorch base image nvcr.io/nvidia/pytorch:26.03-py3 → 26.04-py3 (pillow 12.1.1 → 12.2.0 fixing GHSA-whj4-6x5x-4v2j and GHSA-pwv6-vv43-88gr; onnx 1.18.0 → 1.21.0 fixing GHSA-q56x-g2fj-4rj6, GHSA-hqmj-h5c6-369m, GHSA-538c-55jv-c5g9, GHSA-3r9x-f23j-gc73). Unused uv/uvx binaries removed to eliminate the embedded vulnerable rustls-webpki (GHSA-82j2-j2ch-gfr8).
- log-collector: kubectl v1.34.1 → v1.34.8 (picks up github.com/moby/spdystream v0.5.1, fixing GHSA-pc3f-x583-g7j2).
- file-server-cleanup: base image python:3.13-alpine → python:3.14-alpine (fixes CVE-2026-7210 in expat, CVE-2026-6100 in Python decompression modules, CVE-2026-4786 in webbrowser).
- gpu-health-monitor and preflight-dcgm-diag: removed unused gnupg package, eliminating CVE-2025-68973 (gnupg2).

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @coderuhaan2004.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.8.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.7.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.8.0 \
  --namespace nvsentinel \
  --reuse-values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v1.8.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Release v1.8.0

Major New Features

Priority Queue for Node-Drainer (#1341)

`drainGPUPods` Filter (#1310, #1264)

Drain & Quarantine Overrides from Kubernetes Object Monitor (#1342)

Bug Fixes & Reliability

Security & Infrastructure

Acknowledgments

Container Images

Helm Chart

Contributors

Uh oh!

Release v1.8.0

Release v1.8.0

Major New Features

Priority Queue for Node-Drainer (#1341)

drainGPUPods Filter (#1310, #1264)

Drain & Quarantine Overrides from Kubernetes Object Monitor (#1342)

Bug Fixes & Reliability

Security & Infrastructure

Acknowledgments

Container Images

Helm Chart

Contributors

Uh oh!

`drainGPUPods` Filter (#1310, #1264)