Release v0.6.0 · NVIDIA/nvidia-resiliency-ext

NVIDIA Resiliency Extension v0.6.0

Highlights

In-job restart
- Barrier-based rendezvous (v2) is now the default (#214). The legacy dynamic rendezvous (v1) is deprecated and will be removed in a future release (#282).
- Rendezvous protocol hardening — round-scoped keys, round-fenced CAS to prevent stale slot writes, and cleaner handling of participants exiting mid-rendezvous (#262, #263, #300).
- Robust startup and shutdown — wait for TCPStore on initial connection (#264), handle signals during rendezvous (#246), notify peers to abort current workers on failure (#228), fix terminate_mp_processes to cover failed workers (#270).
- Hot-spare node support — closes the v0.5 spare-node gap. Hot-spare is always-on and works with --max-restarts (#226, #250, #266):
  - Simple mode (default, --ft-segment=None) for H100 / non-NVSwitch systems — first min_nodes from --nnodes=min:max become active, the rest become standbys with reserved ranks. No GPU ClusterUUID required.
  - Segment-aware mode (--ft-segment=N) for NVSwitch systems (DGX H200, HGX B200) — uses GPU ClusterUUID to identify NVLink domains; nodes in the same segment get contiguous group ranks for NVLink locality. Requires min_nodes % segment == 0.
  - Block-aware rank assignment (#250) and hot-spare exit-handling fix (#266).
- Progress-based early termination for in-job restarts and progress-tracker enhancements (#218, #255).
- External InJob control-plane (experimental) — embed ft_launcher orchestration in a host control plane (#321). Not yet QA-validated; APIs may change.
- Section-timeout fixes — out-of-section timeout now fires for section-less workloads, baseline iteration tracking corrected (#261, #299).
- --max-restarts now reflects job-level restart attempts (#211); ft_launcher runs with sensible defaults out of the box (#205, #271).
- NUMA binding support in ft_launcher for optimized memory affinity (#209).
Health checks
- NIC link-state health check (#230).
- Distributed Storage health check (#239).
- DCA integration for HealthCheck (#235).
- Fail-count tracking in NodeHealthCheck (#244).
Checkpointing
- CPU shared-memory D2H path (experimental) in FileSystemWriterAsync removes a redundant H2H copy and resolves the prior shm D2H race (#298).
- PersistentAsyncCaller upgrades: QoS control, worker data cache, warmup, IPC-handle caching via ConsistentDataIdentifier, and class-level metadata cache in CachedMetadataFileSystemReader (#273, #274, #275).
- Reliability fixes: SIGSEGV on SIGKILL with dangling CUDA IPC handles (#284), CUDA IPC handle errors in persistent worker (#288), premature GC of preloaded pinned host tensors in TemporalAsyncCaller (#291), MXFP8/TE quantized tensor handling in IPC cache (#276), spawned persistent worker CUDA-device init (#238).
Fault attribution — productized as standalone services (experimental)

The attribution module — including the Attribution Service, Flight Recorder integration, LogSage, and MCP integration — remains experimental in v0.6. APIs, CLI flags, and service contracts may change in subsequent releases.
- NVRx Attribution Service (attrsvc) and NVRx Slurm Monitor Service (smonsvc) introduced as FastAPI-based standalone services (#242, #248).
- ft_launcher-managed attrsvc for co-located deployment (#318); UDS endpoints for attrsvc/smonsvc (#315).
- Attribution is now an optional package — install with pip install nvidia-resiliency-ext[attribution] (#305). Attribution internals refactored under a svc subpackage with a clear controller/runner boundary (#295, #313, #316).
- PyTorch Flight Recorder (experimental) support in attrsvc (#283); FR ordering switched to window-based instead of PG description (#210, #216, #219).
- LogSage (experimental) integrated as an attribution module — direct in-process API, configurable LLM model via env, LogSage v0.1.7 (#224, #249, #267, #289, #297, #308).
- Triggers and outputs: last-cycle attribution trigger (#247), job-completion handling (#251), Slack bot notifications (#253).
- MCP (Model Context Protocol) integration (experimental) exposes the attribution module as a tool for the NVIDIA resiliency agent (#215); nvrx-attr Claude skill bundle for the attribution workflow (#312).
Log aggregation & observability
- Two-level gRPC log aggregation (N leaves + root) with auto-tier selection and end-to-end tests (#280, #307, #309).
- Writer-thread + persistent-reader refactor (#254); pipe-based per-cycle log capture and split-log support (#225, #240).
- NVRxCycleInfo protobuf exposes per-cycle metadata over the new gRPC interface (#258, #292).
- GPU memory logger (#206).
Build, packaging & security
- poetry-dynamic-versioning — wheel versions are derived from git tags (v0.6.0-rc1, v0.6.0, etc.) (#260).
- New optional extras: [attribution] for the attribution service stack (LogSage, MCP, FastAPI, Slack), [dataflow] for nvdataflow integration.
- Wheel security-scan cleanups and Bandit findings (ionice subprocess, FR module) (#219, #294, #306).
- Torch 2.10 and langchain lock updates for CVE compliance.

Deprecations & Removals

Legacy dynamic rendezvous (v1) is deprecated; barrier-based rendezvous (v2) is the default. Plan to remove v1 in a future release (#282).
--ft-restart-policy is deprecated (#259). The min-healthy value has been removed; only any-failed remains (which is also the default), so the flag is effectively a no-op in v0.6. Migration: remove the flag from your launch scripts. No replacement is required. The flag itself will be removed in a future release.
ptl_resiliency package is deprecated and removed from the wheel (#282, #285). PyTorch Lightning users should pin v0.5.x or migrate to the underlying fault_tolerance / checkpointing / attribution.straggler APIs directly.
OneLogger integration removed (#257).
In-process restart is deprecated in v0.6 and will be removed in a future release. It is no longer the focus of the fast-restart solution. Migration: use in-job restart via ft_launcher (see the fault_tolerance usage guide). The existing nvidia_resiliency_ext.inprocess APIs continue to work in v0.6 but are no longer maintained for new feature development; bug fixes will be considered on a case-by-case basis.

Installation

# Core resiliency (fault tolerance, checkpointing, health checks)
pip install nvidia-resiliency-ext

# With attribution stack (Attribution Service, LogSage, MCP, Slack notifications)
pip install nvidia-resiliency-ext[attribution]

# With nvdataflow integration
pip install nvidia-resiliency-ext[dataflow]

Known Issues & Limitations

Ubuntu 22.04 / glibc < 2.39 users are advised to build from source — PyPI wheels target manylinux_2_39 and default to CUDA 13.
Python: wheels are published for 3.10, 3.11, 3.12.
Attribution, Flight Recorder analysis, LogSage, and MCP integration are experimental. APIs, CLI flags, and service contracts may change in subsequent releases.
External InJob control-plane is experimental and not yet QA-validated; APIs may change in subsequent releases.
CPU shared-memory D2H path in FileSystemWriterAsync is experimental (#298); enable only in non-production runs while we collect soak feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

NVIDIA Resiliency Extension v0.6.0

Highlights

Deprecations & Removals

Installation

Known Issues & Limitations

Uh oh!