Release v1.6.0
Release v1.6.0
This release adds configurable XID cancellation in the syslog health monitor, repeated-NIC analyzer rules for non-fatal degradation signals, a producer-side gate that prevents stale health events from skewing fault-quarantine metrics during platform-connector outages, identity-aware node-condition compaction that fixes stuck conditions on long entity names, and the v1.5.1 fault-remediation cold-start fix for users upgrading directly from v1.5.0.
Major New Features
XID Cancellation in Syslog Health Monitor (#1270)
The syslog-health-monitor can now be configured with cancellation rules that suppress related XID events when a source XID is observed. Rules are declared in a TOML ConfigMap by source/target error code:
cancellations:
- name: SysLogsXIDError
enabled: true
rules:
- onErrorCode: "162"
cancelErrorCodes: ["163"]When a source XID fires, the monitor emits a synthetic healthy event that clears matching target XIDs from the node condition. The platform-connector and fault-quarantine resolve health events by errorCode when present (falling back to entities-impacted otherwise), so the existing resolution semantics for non-XID checks are unaffected. A new Prometheus metric counts emitted cancellations by check / source / target error code.
Repeated NIC Analyzer Rules (#1272)
Two new Health Events Analyzer rules escalate repeated non-fatal NIC signals:
RepeatedNICDriverError: escalates selected non-fatalSysLogsNICDriverErrorpatterns when the same pattern repeats 3 times on a node within 1 hour. Noisy diagnostic-only signals likeaccess_reg_failedare excluded from escalation.RepeatedNICDegradation: escalates non-fatal NIC degradation events when the sameNIC+NICPortsees 3 degradation events within 1 hour.
Both rules escalate to CONTACT_SUPPORT rather than REPLACE_VM — deterministic NIC failures still use first-event REPLACE_VM, while repeated diagnostic/degradation signals are surfaced for human triage. Aggregation is scoped to the same NIC + NICPort so events on different ports do not aggregate incorrectly.
Bug Fixes & Reliability
-
Platform-Connector Outage Gating (#1259): When platform-connector restarted (graceful redeploy, OOM, helm upgrade), every health monitor on the node held in-flight events in its retry loop with the original
GeneratedTimestamp. When platform-connector returned, those stale events landed at fault-quarantine and were misattributed as multi-minutefault_quarantine_node_quarantine_duration_secondshistogram entries, even when fault-quarantine actually cordoned in ≤100 ms. Each monitor now stat-checks the platform-connector Unix socket before every gRPC send; if the socket is missing the send is skipped (no buffering, no cache mutation) and the next polling cycle re-emits the event with a fresh timestamp. Recovery is bounded by the polling cadence regardless of how long the outage lasts. A shared publisher incommons/pkg/healthpubconsolidates the gate, retry policy, and Prometheus counters across all Go monitors; the Python gpu-health-monitor gets the same gate inline. Also fixes a related bug wheresyslog-health-monitor.handleBootIDChangepersisted the new BootID before delivering post-reboot healthy events — any send failure left those events permanently lost. BootID is now persisted only after every healthy event has been delivered, and apendingPostRebootBootIDClearis retried at the top of every poll cycle. -
Node Condition Cleanup for Truncated Entity Messages (#1304): Fixed an issue where entities with long values (e.g.,
v1/Pod:prod/61f345d08c9a432a-134a464884734f90) would be byte-truncated mid-token by the platform-connector's per-message compaction, leaving subsequent healthy events unable to clear the condition (the exact-substring cleanup lookup never matched the truncated form).compactMessageFieldis rewritten to parse the structured identity prefix (ErrorCode + entity tokens) and only truncate the trailing diagnostic free-text — identity tokens are never byte-truncated. A backward-compatibleentityMatchesMessagehelper falls back to prefix matching when there is evidence of truncation (token ends in...or is the last token with noRecommended Action=), so nodes already carrying truncated conditions from older releases can also be cleared. -
Fault-Quarantine Empty-Annotation Handling (#1309): Fixed a bug where fault-quarantine treated
quarantineHealthEvent: "[]"as an active quarantine. When fault-quarantine processed a healthy event that cleared the last entity from a quarantined node, it wrote the annotation as an empty JSON array beforeperformUncordon()removed the key entirely. If fault-quarantine restarted or hit a conflict before the key was removed, the next fatal event for that node followed thehandleAlreadyQuarantinedNodepath — appending the event without cordoning. Adds a sharedannotation.IsEmptyValue()helper that treats"", whitespace, and"[]"as absent, used byhasExistingQuarantine()and the related test helpers. The same PR also hardens NIC E2E teardown to restart the NIC monitor before deleting the fake sysfs tree, eliminating a burst of false "device disappeared" fatal events that contaminated downstream tests. -
NIC Fatal Events Cordon Nodes (#1288): Updated fault-quarantine rules so fatal
syslog-health-monitorevents for theNICcomponent class now cordon nodes (previously onlyGPUdid), and added a new ruleset for fatalnic-health-monitorevents. E2E coverage was extended to assert that fatal NIC events cordon and that recovery uncordons. Also prevents node-drainer from marking drain status terminal when a node-state label update fails, so the event can be retried instead of leaving DB and node state inconsistent. -
Syslog HM Metadata Cache Retry (#1302, #1287): Fixed an issue where the syslog-health-monitor cached the metadata-collector output regardless of parse success — a failed parse poisoned the cache and prevented later retries from picking up valid metadata. Metadata is now cached only after a successful parse; failed parses are retried on the next lookup. Fixes a class of PCI-to-GPU-UUID resolution failures that persisted even after the metadata file appeared on disk.
-
Fault-Remediation Cold-Start Replay (#1281): Brings the v1.5.1 hotfix forward for users upgrading directly from v1.5.0. Cold-start replay for intentionally-skipped events without a terminal remediation status could create duplicate
RebootNodeCRs after fault-quarantine had uncordoned the node, and unsupported actions likeCONTACT_SUPPORTcould re-apply theremediation-failedlabel on every fault-remediation restart. The fix uses the existing node remediation annotation to close stale events for the covered equivalence groups before clearing the annotation (faultremediated=trueonUnQuarantined,faultremediated=falseonCancelledand unsupported actions), and shares the cold-start eligibility query between cold-start and cleanup paths. See v1.5.1 release notes for the full description. -
Fern Docs Preview Build (#1285, #1284): Frozen version content (v1.2.0, v1.3.0, v1.4.0) is now included in PR preview builds via the same
git archiveloop used by the publish workflow, so previews match the production docs site. Preview-comment generation now streamsfern generateoutput viateeinstead of capturing it into a variable that swallowed errors.
CI / Docs Publishing
- Auto-Registered Versioned Docs (#1290, #1292, #1293, #1291): Fern docs publish is now triggered on release tag push instead of
docs/v*tags. On each release tag the workflow auto-registers the new version in the Fern dropdown viayqand prunes the list to Latest + the 3 most recent releases. Pre-release tags (containing-) still trigger publish but skip registration and pruning. Registry changes are persisted viapeter-evans/create-pull-requestafter publish succeeds, with a pre-stamp backup so transientLatest · vX.Y.Zentries are never committed. Frozen-version checkout is hardened with glob guards, MDX brace/angle-bracket escaping, and consistentgit show-reftag verification across both publish and preview-build workflows.
Acknowledgments
This release includes contributions from:
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.6.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v1.5.0 or v1.5.1:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.6.0 \
--namespace nvsentinel \
--reuse-values