Release v1.2.0
Release v1.2.0
This release introduces OpenTelemetry distributed tracing across the remediation pipeline (behind feature flag), external managed MongoDB support (Atlas, DocumentDB, Cosmos DB), TLS certificate hot-reload, preflight framework enhancements (per-pod check selection, init container placement, objectSelector), a gRPC sink connector, OpenShift service-ca support, feature flag observability via Prometheus, and critical reliability fixes for MongoDB change streams, DCGM handle leaks, and cold-start event processing.
Major New Features
OpenTelemetry Distributed Tracing (#771) — Feature Flag
Note: This feature is behind a feature flag and is disabled by default. Enable via Helm values to opt in.
End-to-end distributed tracing across the NVSentinel remediation pipeline using OpenTelemetry. Operators can now trace the full control flow of a health event — from detection through quarantine, draining, and remediation — in a single trace view.
- Common Instrumentation (#1088): Core tracing setup with auto-instrumentation for HTTP (
otelhttp), MongoDB (otelmongo), and PostgreSQL (otelsql) clients. Errors surface as red spans with exception type and message attributes. All async processes and MongoDB-dependent modules are covered. - Platform Connector (#1087): Traces event transformation, HTTP/DB calls, and Kubernetes node event writes with per-operation spans.
- Fault Quarantine (#1117): Traces cordon/uncordon operations with span attributes for taints applied/removed, annotations, and labels.
- Event Exporter (#1122): Traces event insertion, backfill queries, and publish failures with recorded errors.
- Health Events Analyzer (#1134): Traces rule evaluation with per-rule span durations, pipeline query failures, and event publishing.
External Managed MongoDB Support (#376)
NVSentinel no longer requires an in-cluster MongoDB deployment. Lean GPU clusters can now offload the database to a managed service.
- Atlas, DocumentDB, Cosmos DB (#1010): First-class Helm support for external MongoDB-compatible services. Includes a Helm hook Job that prepares the external database (collections and indexes), per-provider value overlays (
values-atlas-gcp.yaml,values-aws-docdb.yaml,values-cosmosdb-test.yaml), TLS cert volume helpers, and ConfigMap alignment for external connection strings. - Secret-Based Credentials (#1076):
MONGODB_URIis now loaded exclusively from a Kubernetes Secret whenglobal.datastore.credentialsFromSecret.nameis set. The datastore ConfigMap no longer contains the URI in this mode. Helm fails fast if the Secret name is missing or ifglobal.datastore.uriis set for MongoDB, preventing accidental plaintext credential exposure.
TLS Certificate Rotation (#1030)
Implements ADR-029 for automatic TLS certificate rotation without pod restarts.
- Server-Side Hot-Reload: Replaces static
tls.LoadX509KeyPairwithcertwatcherfrom controller-runtime in janitor-provider, enabling automatic cert pickup when cert-manager rotates certificates. - Client-Side CA Bundle Rotation: Replaces the persistent gRPC connection (created once at startup) with per-reconciliation
dialProvider()that reads the CA bundle and SA token fresh from disk on each call. - Helm: Adds
tls.secretNameoverride to janitor-provider values for externally managed certificates.
Cold Start Handling in Fault Remediation (#912)
- Batched Cold-Start Processing (#1118): When fault-remediation or node-drainer restarts after a stale/missing resume token, the cold-start handler now processes missed events in bounded batches (default 1000) instead of loading all results into memory at once. This prevents OOM on clusters with large event histories. The new
FindHealthEventsByQueryBatchedmethod is implemented for both MongoDB (cursor-based) and PostgreSQL (LIMIT/OFFSET). Cold-start events in fault-remediation are enqueued into the controller-runtime workqueue, giving them the same requeue/retry semantics as live change-stream events.
Feature Flag Observability (#836)
- Prometheus Metrics (#1094): All NVSentinel feature flags (circuit breaker, log collector, processing strategy, per-rule enabled/disabled state) are now exposed as Prometheus gauge metrics. Operators managing hundreds of clusters can build dashboards showing fleet-wide feature flag status without digging through ConfigMaps.
Preflight Framework Enhancements
- Per-Pod Check Selection (#1139): Pods can now select which preflight checks to run via annotation (
nvsentinel.nvidia.com/preflight-checks: "preflight-dcgm-diag,preflight-nccl-loopback"). Annotation absent defaults to alldefaultEnabled: truechecks (preserving existing behavior). Gang validation ensures all pods in a gang have the same check list, failing fast beforetorchrunto prevent distributed deadlocks. - Init Container Placement (#1132, #1131): Configurable
initContainerPlacement("prepend"or"append") for preflight init containers relative to existing init containers. Supports scheduling environments that inject their own init containers with ordering requirements. - objectSelector Support (#1128): Optional
objectSelectoron the preflightMutatingWebhookConfigurationfor pod-level opt-in filtering alongside namespace-level selection. Default is no objectSelector (unchanged behavior). - imagePullSecrets Injection (#1142): When preflight init container images are hosted in a private registry,
imagePullSecretsare now automatically injected into target pods. Also fixes a gang controller ConfigMap mismatch where the controller created an orphaned ConfigMap when the webhook fires before the scheduler annotates pods. - Inline DCGM Config (#1140, #1137): Per ADR-035, the separate
dcgm:config block is removed. DCGM-specific configuration (DCGM_HOSTENGINE_ADDR,DCGM_DIAG_LEVEL) is now defined as inline env vars on thepreflight-dcgm-diagcontainer, consistent with how NCCL checks are configured. See the migration table in the PR for upgrade guidance.
New Integrations
gRPC Sink Connector (#1113)
A new gRPC sink connector for platform-connectors that forwards HealthEvent protos to an external gRPC server. Reuses the existing PlatformConnector gRPC service definition (HealthEventOccurredV1). Includes its own ring buffer with independent retry/backoff, Prometheus metrics (send counter, duration histogram, retry counter), and Helm configuration. Disabled by default.
OpenShift Service-CA Support (#1077)
NVSentinel can now run on OpenShift Container Platform without requiring cert-manager. A new webhook.certProvider Helm value ("cert-manager" or "openshift-service-ca") is available on janitor and preflight subcharts. In OpenShift mode, cert-manager CRDs are not rendered; instead, Services get service.beta.openshift.io/serving-cert-secret-name annotations and webhook configurations get CA bundle injection annotations. No changes to Deployments or Go code required.
Bug Fixes & Reliability
- MongoDB Resume Token Recovery (#1092, #955): Fixed fault-remediation and node-drainer entering unrecoverable
CrashLoopBackOffafter upgrades on quiet clusters. Stale resume tokens that reference expired oplog positions are now automatically deleted, allowing the change stream to restart cleanly. - DCGM Handle/Connection Leak (#1089, #1078): Fixed
_cleanup_dcgm_resources()whereDelete()failure preventedShutdown()from running, leaking one TCP connection + one DCGM group per retry cycle. After 64 leaks (DCGM_MAX_NUM_GROUPS), DCGM returned "Max limit reached", triggering permanentGpuDcgmConnectivityFailurealerts. Split into independent try blocks with group rollback on partial init failure. - DCGM Startup Race (#1034): Fixed gpu-health-monitor emitting spurious
DCGM_CONNECTIVITY_ERRORevents on startup when thenvidia-dcgmpod exists but isn't ready (e.g., during gpu-operator init container sequence or post-reboot). - PostgreSQL Password Auth (#1143, #1090): Fixed
newPostgreSQLCompatibleConfignot readingDATASTORE_PASSWORD, causingpq: password authentication failedfor components using theClientFactorycode path (platform-connectors). - Metadata Collector on Pre-R560 Drivers (#1147, #810): Fixed crash on nodes running driver 550 or older by gating chassis serial collection (
nvmlDeviceGetPlatformInfo) on driver major version >= 560. - GPUReset CR Deletion for Deleted Nodes (#1029): GPUReset CRs no longer get stuck in
Terminatingstatus when their corresponding node has been deleted. TherestoreServicesstep gracefully handles missing node objects. - Audit Logging in Janitor-Provider (#1075): Fixed missing audit log initialization in janitor-provider. CSP API calls (reboot signals to GCP/AWS/Azure) were silently dropped from the audit trail.
- Mongo Query State Transitions (#1166): Fixed node-drainer state transition from
quarantinedtodrain-succeededwhen no user workloads exist, and filter resolution issues in MongoDB queries. - Preflight Chart Consistency (#1114): Aligned preflight Helm chart structure with other subcharts.
Performance Improvements
- Labeler Memory Optimization (#1093): Filtered irrelevant node update events (heartbeats, condition changes) from the labeler's watch handler. On ~1,000 node clusters, heap growth dropped from 66 MiB to 1 MiB (peak from 100 MiB to 9 MiB).
- Node Drainer Memory Optimization (#1121, #1133): Replaced full
datastore.Eventdocuments (~13 KB each) in the workqueue with lightweight ID references, fetching the full event lazily from DB only during processing. - Metadata Collector DaemonSet (#1091): Aligned
maxUnavailablefor metadata collector with other NVSentinel daemonsets.
Configuration & Operations
- Kubelet Host Configuration (#1095): Metadata collector kubelet host is now configurable via
KUBELET_HOSTenv var, supporting environments where the kubelet is not reachable at the default address. - UAT Event Timeout (#1035): Event-waiting loops in UAT tests are now configurable via
UAT_EVENT_TIMEOUTenv var (default: 30s), fixing race conditions where events arrive after the default timeout. - GPU Reset Scale Test (#1024): New configurable scale test for GPU reset (disabled by default). Also fixes two concurrency bugs discovered during scale testing: controller-level locking replaced with node-level locking to prevent deadlocks, and a race condition in competing GPUReset CRs for the same node.
Documentation & Design
- Fern Documentation Site (#1123): Adopted Fern for publishing styled documentation at docs.nvidia.com, with CI workflows for build, preview, and publish.
- ADR-035: Inline DCGM Config (#1136): Documentation recommending inline env vars over the
dcgm:block for DCGM preflight configuration. - Preflight Documentation (#1115): User-facing documentation for the preflight framework.
Acknowledgments
This release includes contributions from:
- @deesharma24
- @drubinstein
- @fabiendupont
- @jonalee99
- @jschelling
- @jyotimahapatra
- @KaivalyaMDabhadkar
- @lalitadithya
- @natherz97
- @pdmack
- @rupalis
- @SMohanKumar
- @tanishagoyal2
- @XRFXLP
- @yuanchen97
- @yysindi
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.2.0 \
--namespace nvsentinel \
--create-namespaceTo upgrade from v1.1.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.2.0 \
--namespace nvsentinel \
--reuse-valuesFor detailed installation and configuration instructions, see the documentation.