Skip to content

Release v1.6.0

Choose a tag to compare

@github-actions github-actions released this 19 May 06:37
· 57 commits to main since this release
v1.6.0
f9e2daa

Release v1.6.0

This release adds configurable XID cancellation in the syslog health monitor, repeated-NIC analyzer rules for non-fatal degradation signals, a producer-side gate that prevents stale health events from skewing fault-quarantine metrics during platform-connector outages, identity-aware node-condition compaction that fixes stuck conditions on long entity names, and the v1.5.1 fault-remediation cold-start fix for users upgrading directly from v1.5.0.

Major New Features

XID Cancellation in Syslog Health Monitor (#1270)

The syslog-health-monitor can now be configured with cancellation rules that suppress related XID events when a source XID is observed. Rules are declared in a TOML ConfigMap by source/target error code:

cancellations:
  - name: SysLogsXIDError
    enabled: true
    rules:
      - onErrorCode: "162"
        cancelErrorCodes: ["163"]

When a source XID fires, the monitor emits a synthetic healthy event that clears matching target XIDs from the node condition. The platform-connector and fault-quarantine resolve health events by errorCode when present (falling back to entities-impacted otherwise), so the existing resolution semantics for non-XID checks are unaffected. A new Prometheus metric counts emitted cancellations by check / source / target error code.

Repeated NIC Analyzer Rules (#1272)

Two new Health Events Analyzer rules escalate repeated non-fatal NIC signals:

  • RepeatedNICDriverError: escalates selected non-fatal SysLogsNICDriverError patterns when the same pattern repeats 3 times on a node within 1 hour. Noisy diagnostic-only signals like access_reg_failed are excluded from escalation.
  • RepeatedNICDegradation: escalates non-fatal NIC degradation events when the same NIC + NICPort sees 3 degradation events within 1 hour.

Both rules escalate to CONTACT_SUPPORT rather than REPLACE_VM — deterministic NIC failures still use first-event REPLACE_VM, while repeated diagnostic/degradation signals are surfaced for human triage. Aggregation is scoped to the same NIC + NICPort so events on different ports do not aggregate incorrectly.

Bug Fixes & Reliability

  • Platform-Connector Outage Gating (#1259): When platform-connector restarted (graceful redeploy, OOM, helm upgrade), every health monitor on the node held in-flight events in its retry loop with the original GeneratedTimestamp. When platform-connector returned, those stale events landed at fault-quarantine and were misattributed as multi-minute fault_quarantine_node_quarantine_duration_seconds histogram entries, even when fault-quarantine actually cordoned in ≤100 ms. Each monitor now stat-checks the platform-connector Unix socket before every gRPC send; if the socket is missing the send is skipped (no buffering, no cache mutation) and the next polling cycle re-emits the event with a fresh timestamp. Recovery is bounded by the polling cadence regardless of how long the outage lasts. A shared publisher in commons/pkg/healthpub consolidates the gate, retry policy, and Prometheus counters across all Go monitors; the Python gpu-health-monitor gets the same gate inline. Also fixes a related bug where syslog-health-monitor.handleBootIDChange persisted the new BootID before delivering post-reboot healthy events — any send failure left those events permanently lost. BootID is now persisted only after every healthy event has been delivered, and a pendingPostRebootBootIDClear is retried at the top of every poll cycle.

  • Node Condition Cleanup for Truncated Entity Messages (#1304): Fixed an issue where entities with long values (e.g., v1/Pod:prod/61f345d08c9a432a-134a464884734f90) would be byte-truncated mid-token by the platform-connector's per-message compaction, leaving subsequent healthy events unable to clear the condition (the exact-substring cleanup lookup never matched the truncated form). compactMessageField is rewritten to parse the structured identity prefix (ErrorCode + entity tokens) and only truncate the trailing diagnostic free-text — identity tokens are never byte-truncated. A backward-compatible entityMatchesMessage helper falls back to prefix matching when there is evidence of truncation (token ends in ... or is the last token with no Recommended Action=), so nodes already carrying truncated conditions from older releases can also be cleared.

  • Fault-Quarantine Empty-Annotation Handling (#1309): Fixed a bug where fault-quarantine treated quarantineHealthEvent: "[]" as an active quarantine. When fault-quarantine processed a healthy event that cleared the last entity from a quarantined node, it wrote the annotation as an empty JSON array before performUncordon() removed the key entirely. If fault-quarantine restarted or hit a conflict before the key was removed, the next fatal event for that node followed the handleAlreadyQuarantinedNode path — appending the event without cordoning. Adds a shared annotation.IsEmptyValue() helper that treats "", whitespace, and "[]" as absent, used by hasExistingQuarantine() and the related test helpers. The same PR also hardens NIC E2E teardown to restart the NIC monitor before deleting the fake sysfs tree, eliminating a burst of false "device disappeared" fatal events that contaminated downstream tests.

  • NIC Fatal Events Cordon Nodes (#1288): Updated fault-quarantine rules so fatal syslog-health-monitor events for the NIC component class now cordon nodes (previously only GPU did), and added a new ruleset for fatal nic-health-monitor events. E2E coverage was extended to assert that fatal NIC events cordon and that recovery uncordons. Also prevents node-drainer from marking drain status terminal when a node-state label update fails, so the event can be retried instead of leaving DB and node state inconsistent.

  • Syslog HM Metadata Cache Retry (#1302, #1287): Fixed an issue where the syslog-health-monitor cached the metadata-collector output regardless of parse success — a failed parse poisoned the cache and prevented later retries from picking up valid metadata. Metadata is now cached only after a successful parse; failed parses are retried on the next lookup. Fixes a class of PCI-to-GPU-UUID resolution failures that persisted even after the metadata file appeared on disk.

  • Fault-Remediation Cold-Start Replay (#1281): Brings the v1.5.1 hotfix forward for users upgrading directly from v1.5.0. Cold-start replay for intentionally-skipped events without a terminal remediation status could create duplicate RebootNode CRs after fault-quarantine had uncordoned the node, and unsupported actions like CONTACT_SUPPORT could re-apply the remediation-failed label on every fault-remediation restart. The fix uses the existing node remediation annotation to close stale events for the covered equivalence groups before clearing the annotation (faultremediated=true on UnQuarantined, faultremediated=false on Cancelled and unsupported actions), and shares the cold-start eligibility query between cold-start and cleanup paths. See v1.5.1 release notes for the full description.

  • Fern Docs Preview Build (#1285, #1284): Frozen version content (v1.2.0, v1.3.0, v1.4.0) is now included in PR preview builds via the same git archive loop used by the publish workflow, so previews match the production docs site. Preview-comment generation now streams fern generate output via tee instead of capturing it into a variable that swallowed errors.

CI / Docs Publishing

  • Auto-Registered Versioned Docs (#1290, #1292, #1293, #1291): Fern docs publish is now triggered on release tag push instead of docs/v* tags. On each release tag the workflow auto-registers the new version in the Fern dropdown via yq and prunes the list to Latest + the 3 most recent releases. Pre-release tags (containing -) still trigger publish but skip registration and pruning. Registry changes are persisted via peter-evans/create-pull-request after publish succeeds, with a pre-stamp backup so transient Latest · vX.Y.Z entries are never committed. Frozen-version checkout is hardened with glob guards, MDX brace/angle-bracket escaping, and consistent git show-ref tag verification across both publish and preview-build workflows.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.6.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.5.0 or v1.5.1:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.6.0 \
  --namespace nvsentinel \
  --reuse-values