Skip to content

Release v1.11.0

Latest

Choose a tag to compare

@github-actions github-actions released this 22 Jun 12:58
· 6 commits to main since this release
v1.11.0
7cf58f4

Release v1.11.0

This release improves the accuracy of Mean Time To Repair (MTTR) reporting by letting dashboards distinguish automated remediations from events that require manual intervention, and fixes a checkpoint-advancement bug in the fault-remediation reconciler that could silently drop live health events on cold start.

Major New Features

Recommended-action label on MTTR metrics (#1406)

The fault_quarantine_node_remediation_duration_excluding_drain_seconds MTTR histogram now carries a recommended_action label. Previously, nodes that required manual handling (e.g. a CONTACT_SUPPORT recommended action) could sit cordoned for hours before an operator acted, and that long idle time was bucketed alongside genuine automated remediations, inflating MTTR on Grafana dashboards. With the new label, dashboards can filter out CONTACT_SUPPORT and other manual events so MTTR reflects only automated remediations.

Bug Fixes & Reliability

  • Fixed cold-start checkpoint advancement on document ID errors (#1411): Cold-start events are enqueued without resume tokens, but the document-ID error path in the fault-remediation reconciler called the watcher directly, where an empty token could resolve to the current MongoDB or PostgreSQL stream position and advance the checkpoint past events that had not yet been handled. Document-ID extraction failures are now routed through safeMarkProcessed, so cold-start events are no longer incorrectly marked processed and remediation events are no longer silently lost.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @fallintoplace.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.11.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.10.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.11.0 \
  --namespace nvsentinel \
  --reuse-values