Skip to content

v0.6.0

Latest

Choose a tag to compare

@hexinw-nvidia hexinw-nvidia released this 14 May 21:54
· 52 commits to main since this release
Immutable release. Only release title and notes can be modified.
6c5f2a1

NVIDIA Resiliency Extension v0.6.0

Highlights

  • In-job restart

    • Barrier-based rendezvous (v2) is now the default (#214). The legacy dynamic rendezvous (v1) is deprecated and will be removed in a future release (#282).
    • Rendezvous protocol hardening — round-scoped keys, round-fenced CAS to prevent stale slot writes, and cleaner handling of participants exiting mid-rendezvous (#262, #263, #300).
    • Robust startup and shutdown — wait for TCPStore on initial connection (#264), handle signals during rendezvous (#246), notify peers to abort current workers on failure (#228), fix terminate_mp_processes to cover failed workers (#270).
    • Hot-spare node support — closes the v0.5 spare-node gap. Hot-spare is always-on and works with --max-restarts (#226, #250, #266):
      • Simple mode (default, --ft-segment=None) for H100 / non-NVSwitch systems — first min_nodes from --nnodes=min:max become active, the rest become standbys with reserved ranks. No GPU ClusterUUID required.
      • Segment-aware mode (--ft-segment=N) for NVSwitch systems (DGX H200, HGX B200) — uses GPU ClusterUUID to identify NVLink domains; nodes in the same segment get contiguous group ranks for NVLink locality. Requires min_nodes % segment == 0.
      • Block-aware rank assignment (#250) and hot-spare exit-handling fix (#266).
    • Progress-based early termination for in-job restarts and progress-tracker enhancements (#218, #255).
    • External InJob control-plane (experimental) — embed ft_launcher orchestration in a host control plane (#321). Not yet QA-validated; APIs may change.
    • Section-timeout fixes — out-of-section timeout now fires for section-less workloads, baseline iteration tracking corrected (#261, #299).
    • --max-restarts now reflects job-level restart attempts (#211); ft_launcher runs with sensible defaults out of the box (#205, #271).
    • NUMA binding support in ft_launcher for optimized memory affinity (#209).
  • Health checks

    • NIC link-state health check (#230).
    • Distributed Storage health check (#239).
    • DCA integration for HealthCheck (#235).
    • Fail-count tracking in NodeHealthCheck (#244).
  • Checkpointing

    • CPU shared-memory D2H path (experimental) in FileSystemWriterAsync removes a redundant H2H copy and resolves the prior shm D2H race (#298).
    • PersistentAsyncCaller upgrades: QoS control, worker data cache, warmup, IPC-handle caching via ConsistentDataIdentifier, and class-level metadata cache in CachedMetadataFileSystemReader (#273, #274, #275).
    • Reliability fixes: SIGSEGV on SIGKILL with dangling CUDA IPC handles (#284), CUDA IPC handle errors in persistent worker (#288), premature GC of preloaded pinned host tensors in TemporalAsyncCaller (#291), MXFP8/TE quantized tensor handling in IPC cache (#276), spawned persistent worker CUDA-device init (#238).
  • Fault attribution — productized as standalone services (experimental)

    The attribution module — including the Attribution Service, Flight Recorder integration, LogSage, and MCP integration — remains experimental in v0.6. APIs, CLI flags, and service contracts may change in subsequent releases.

    • NVRx Attribution Service (attrsvc) and NVRx Slurm Monitor Service (smonsvc) introduced as FastAPI-based standalone services (#242, #248).
    • ft_launcher-managed attrsvc for co-located deployment (#318); UDS endpoints for attrsvc/smonsvc (#315).
    • Attribution is now an optional package — install with pip install nvidia-resiliency-ext[attribution] (#305). Attribution internals refactored under a svc subpackage with a clear controller/runner boundary (#295, #313, #316).
    • PyTorch Flight Recorder (experimental) support in attrsvc (#283); FR ordering switched to window-based instead of PG description (#210, #216, #219).
    • LogSage (experimental) integrated as an attribution module — direct in-process API, configurable LLM model via env, LogSage v0.1.7 (#224, #249, #267, #289, #297, #308).
    • Triggers and outputs: last-cycle attribution trigger (#247), job-completion handling (#251), Slack bot notifications (#253).
    • MCP (Model Context Protocol) integration (experimental) exposes the attribution module as a tool for the NVIDIA resiliency agent (#215); nvrx-attr Claude skill bundle for the attribution workflow (#312).
  • Log aggregation & observability

    • Two-level gRPC log aggregation (N leaves + root) with auto-tier selection and end-to-end tests (#280, #307, #309).
    • Writer-thread + persistent-reader refactor (#254); pipe-based per-cycle log capture and split-log support (#225, #240).
    • NVRxCycleInfo protobuf exposes per-cycle metadata over the new gRPC interface (#258, #292).
    • GPU memory logger (#206).
  • Build, packaging & security

    • poetry-dynamic-versioning — wheel versions are derived from git tags (v0.6.0-rc1, v0.6.0, etc.) (#260).
    • New optional extras: [attribution] for the attribution service stack (LogSage, MCP, FastAPI, Slack), [dataflow] for nvdataflow integration.
    • Wheel security-scan cleanups and Bandit findings (ionice subprocess, FR module) (#219, #294, #306).
    • Torch 2.10 and langchain lock updates for CVE compliance.

Deprecations & Removals

  • Legacy dynamic rendezvous (v1) is deprecated; barrier-based rendezvous (v2) is the default. Plan to remove v1 in a future release (#282).
  • --ft-restart-policy is deprecated (#259). The min-healthy value has been removed; only any-failed remains (which is also the default), so the flag is effectively a no-op in v0.6. Migration: remove the flag from your launch scripts. No replacement is required. The flag itself will be removed in a future release.
  • ptl_resiliency package is deprecated and removed from the wheel (#282, #285). PyTorch Lightning users should pin v0.5.x or migrate to the underlying fault_tolerance / checkpointing / attribution.straggler APIs directly.
  • OneLogger integration removed (#257).
  • In-process restart is deprecated in v0.6 and will be removed in a future release. It is no longer the focus of the fast-restart solution. Migration: use in-job restart via ft_launcher (see the fault_tolerance usage guide). The existing nvidia_resiliency_ext.inprocess APIs continue to work in v0.6 but are no longer maintained for new feature development; bug fixes will be considered on a case-by-case basis.

Installation

# Core resiliency (fault tolerance, checkpointing, health checks)
pip install nvidia-resiliency-ext

# With attribution stack (Attribution Service, LogSage, MCP, Slack notifications)
pip install nvidia-resiliency-ext[attribution]

# With nvdataflow integration
pip install nvidia-resiliency-ext[dataflow]

Known Issues & Limitations

  • Ubuntu 22.04 / glibc < 2.39 users are advised to build from source — PyPI wheels target manylinux_2_39 and default to CUDA 13.
  • Python: wheels are published for 3.10, 3.11, 3.12.
  • Attribution, Flight Recorder analysis, LogSage, and MCP integration are experimental. APIs, CLI flags, and service contracts may change in subsequent releases.
  • External InJob control-plane is experimental and not yet QA-validated; APIs may change in subsequent releases.
  • CPU shared-memory D2H path in FileSystemWriterAsync is experimental (#298); enable only in non-production runs while we collect soak feedback.