Skip to content

Release v1.2.0

Choose a tag to compare

@github-actions github-actions released this 13 Apr 11:04
· 151 commits to main since this release
v1.2.0
9e73e8f

Release v1.2.0

This release introduces OpenTelemetry distributed tracing across the remediation pipeline (behind feature flag), external managed MongoDB support (Atlas, DocumentDB, Cosmos DB), TLS certificate hot-reload, preflight framework enhancements (per-pod check selection, init container placement, objectSelector), a gRPC sink connector, OpenShift service-ca support, feature flag observability via Prometheus, and critical reliability fixes for MongoDB change streams, DCGM handle leaks, and cold-start event processing.

Major New Features

OpenTelemetry Distributed Tracing (#771) — Feature Flag

Note: This feature is behind a feature flag and is disabled by default. Enable via Helm values to opt in.

End-to-end distributed tracing across the NVSentinel remediation pipeline using OpenTelemetry. Operators can now trace the full control flow of a health event — from detection through quarantine, draining, and remediation — in a single trace view.

  • Common Instrumentation (#1088): Core tracing setup with auto-instrumentation for HTTP (otelhttp), MongoDB (otelmongo), and PostgreSQL (otelsql) clients. Errors surface as red spans with exception type and message attributes. All async processes and MongoDB-dependent modules are covered.
  • Platform Connector (#1087): Traces event transformation, HTTP/DB calls, and Kubernetes node event writes with per-operation spans.
  • Fault Quarantine (#1117): Traces cordon/uncordon operations with span attributes for taints applied/removed, annotations, and labels.
  • Event Exporter (#1122): Traces event insertion, backfill queries, and publish failures with recorded errors.
  • Health Events Analyzer (#1134): Traces rule evaluation with per-rule span durations, pipeline query failures, and event publishing.

External Managed MongoDB Support (#376)

NVSentinel no longer requires an in-cluster MongoDB deployment. Lean GPU clusters can now offload the database to a managed service.

  • Atlas, DocumentDB, Cosmos DB (#1010): First-class Helm support for external MongoDB-compatible services. Includes a Helm hook Job that prepares the external database (collections and indexes), per-provider value overlays (values-atlas-gcp.yaml, values-aws-docdb.yaml, values-cosmosdb-test.yaml), TLS cert volume helpers, and ConfigMap alignment for external connection strings.
  • Secret-Based Credentials (#1076): MONGODB_URI is now loaded exclusively from a Kubernetes Secret when global.datastore.credentialsFromSecret.name is set. The datastore ConfigMap no longer contains the URI in this mode. Helm fails fast if the Secret name is missing or if global.datastore.uri is set for MongoDB, preventing accidental plaintext credential exposure.

TLS Certificate Rotation (#1030)

Implements ADR-029 for automatic TLS certificate rotation without pod restarts.

  • Server-Side Hot-Reload: Replaces static tls.LoadX509KeyPair with certwatcher from controller-runtime in janitor-provider, enabling automatic cert pickup when cert-manager rotates certificates.
  • Client-Side CA Bundle Rotation: Replaces the persistent gRPC connection (created once at startup) with per-reconciliation dialProvider() that reads the CA bundle and SA token fresh from disk on each call.
  • Helm: Adds tls.secretName override to janitor-provider values for externally managed certificates.

Cold Start Handling in Fault Remediation (#912)

  • Batched Cold-Start Processing (#1118): When fault-remediation or node-drainer restarts after a stale/missing resume token, the cold-start handler now processes missed events in bounded batches (default 1000) instead of loading all results into memory at once. This prevents OOM on clusters with large event histories. The new FindHealthEventsByQueryBatched method is implemented for both MongoDB (cursor-based) and PostgreSQL (LIMIT/OFFSET). Cold-start events in fault-remediation are enqueued into the controller-runtime workqueue, giving them the same requeue/retry semantics as live change-stream events.

Feature Flag Observability (#836)

  • Prometheus Metrics (#1094): All NVSentinel feature flags (circuit breaker, log collector, processing strategy, per-rule enabled/disabled state) are now exposed as Prometheus gauge metrics. Operators managing hundreds of clusters can build dashboards showing fleet-wide feature flag status without digging through ConfigMaps.

Preflight Framework Enhancements

  • Per-Pod Check Selection (#1139): Pods can now select which preflight checks to run via annotation (nvsentinel.nvidia.com/preflight-checks: "preflight-dcgm-diag,preflight-nccl-loopback"). Annotation absent defaults to all defaultEnabled: true checks (preserving existing behavior). Gang validation ensures all pods in a gang have the same check list, failing fast before torchrun to prevent distributed deadlocks.
  • Init Container Placement (#1132, #1131): Configurable initContainerPlacement ("prepend" or "append") for preflight init containers relative to existing init containers. Supports scheduling environments that inject their own init containers with ordering requirements.
  • objectSelector Support (#1128): Optional objectSelector on the preflight MutatingWebhookConfiguration for pod-level opt-in filtering alongside namespace-level selection. Default is no objectSelector (unchanged behavior).
  • imagePullSecrets Injection (#1142): When preflight init container images are hosted in a private registry, imagePullSecrets are now automatically injected into target pods. Also fixes a gang controller ConfigMap mismatch where the controller created an orphaned ConfigMap when the webhook fires before the scheduler annotates pods.
  • Inline DCGM Config (#1140, #1137): Per ADR-035, the separate dcgm: config block is removed. DCGM-specific configuration (DCGM_HOSTENGINE_ADDR, DCGM_DIAG_LEVEL) is now defined as inline env vars on the preflight-dcgm-diag container, consistent with how NCCL checks are configured. See the migration table in the PR for upgrade guidance.

New Integrations

gRPC Sink Connector (#1113)

A new gRPC sink connector for platform-connectors that forwards HealthEvent protos to an external gRPC server. Reuses the existing PlatformConnector gRPC service definition (HealthEventOccurredV1). Includes its own ring buffer with independent retry/backoff, Prometheus metrics (send counter, duration histogram, retry counter), and Helm configuration. Disabled by default.

OpenShift Service-CA Support (#1077)

NVSentinel can now run on OpenShift Container Platform without requiring cert-manager. A new webhook.certProvider Helm value ("cert-manager" or "openshift-service-ca") is available on janitor and preflight subcharts. In OpenShift mode, cert-manager CRDs are not rendered; instead, Services get service.beta.openshift.io/serving-cert-secret-name annotations and webhook configurations get CA bundle injection annotations. No changes to Deployments or Go code required.

Bug Fixes & Reliability

  • MongoDB Resume Token Recovery (#1092, #955): Fixed fault-remediation and node-drainer entering unrecoverable CrashLoopBackOff after upgrades on quiet clusters. Stale resume tokens that reference expired oplog positions are now automatically deleted, allowing the change stream to restart cleanly.
  • DCGM Handle/Connection Leak (#1089, #1078): Fixed _cleanup_dcgm_resources() where Delete() failure prevented Shutdown() from running, leaking one TCP connection + one DCGM group per retry cycle. After 64 leaks (DCGM_MAX_NUM_GROUPS), DCGM returned "Max limit reached", triggering permanent GpuDcgmConnectivityFailure alerts. Split into independent try blocks with group rollback on partial init failure.
  • DCGM Startup Race (#1034): Fixed gpu-health-monitor emitting spurious DCGM_CONNECTIVITY_ERROR events on startup when the nvidia-dcgm pod exists but isn't ready (e.g., during gpu-operator init container sequence or post-reboot).
  • PostgreSQL Password Auth (#1143, #1090): Fixed newPostgreSQLCompatibleConfig not reading DATASTORE_PASSWORD, causing pq: password authentication failed for components using the ClientFactory code path (platform-connectors).
  • Metadata Collector on Pre-R560 Drivers (#1147, #810): Fixed crash on nodes running driver 550 or older by gating chassis serial collection (nvmlDeviceGetPlatformInfo) on driver major version >= 560.
  • GPUReset CR Deletion for Deleted Nodes (#1029): GPUReset CRs no longer get stuck in Terminating status when their corresponding node has been deleted. The restoreServices step gracefully handles missing node objects.
  • Audit Logging in Janitor-Provider (#1075): Fixed missing audit log initialization in janitor-provider. CSP API calls (reboot signals to GCP/AWS/Azure) were silently dropped from the audit trail.
  • Mongo Query State Transitions (#1166): Fixed node-drainer state transition from quarantined to drain-succeeded when no user workloads exist, and filter resolution issues in MongoDB queries.
  • Preflight Chart Consistency (#1114): Aligned preflight Helm chart structure with other subcharts.

Performance Improvements

  • Labeler Memory Optimization (#1093): Filtered irrelevant node update events (heartbeats, condition changes) from the labeler's watch handler. On ~1,000 node clusters, heap growth dropped from 66 MiB to 1 MiB (peak from 100 MiB to 9 MiB).
  • Node Drainer Memory Optimization (#1121, #1133): Replaced full datastore.Event documents (~13 KB each) in the workqueue with lightweight ID references, fetching the full event lazily from DB only during processing.
  • Metadata Collector DaemonSet (#1091): Aligned maxUnavailable for metadata collector with other NVSentinel daemonsets.

Configuration & Operations

  • Kubelet Host Configuration (#1095): Metadata collector kubelet host is now configurable via KUBELET_HOST env var, supporting environments where the kubelet is not reachable at the default address.
  • UAT Event Timeout (#1035): Event-waiting loops in UAT tests are now configurable via UAT_EVENT_TIMEOUT env var (default: 30s), fixing race conditions where events arrive after the default timeout.
  • GPU Reset Scale Test (#1024): New configurable scale test for GPU reset (disabled by default). Also fixes two concurrency bugs discovered during scale testing: controller-level locking replaced with node-level locking to prevent deadlocks, and a race condition in competing GPUReset CRs for the same node.

Documentation & Design

  • Fern Documentation Site (#1123): Adopted Fern for publishing styled documentation at docs.nvidia.com, with CI workflows for build, preview, and publish.
  • ADR-035: Inline DCGM Config (#1136): Documentation recommending inline env vars over the dcgm: block for DCGM preflight configuration.
  • Preflight Documentation (#1115): User-facing documentation for the preflight framework.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.2.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.1.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.2.0 \
  --namespace nvsentinel \
  --reuse-values

For detailed installation and configuration instructions, see the documentation.