·
52 commits
to main
since this release
Immutable
release. Only release title and notes can be modified.
NVIDIA Resiliency Extension v0.6.0
Highlights
-
In-job restart
- Barrier-based rendezvous (v2) is now the default (#214). The legacy dynamic rendezvous (v1) is deprecated and will be removed in a future release (#282).
- Rendezvous protocol hardening — round-scoped keys, round-fenced CAS to prevent stale slot writes, and cleaner handling of participants exiting mid-rendezvous (#262, #263, #300).
- Robust startup and shutdown — wait for TCPStore on initial connection (#264), handle signals during rendezvous (#246), notify peers to abort current workers on failure (#228), fix
terminate_mp_processesto cover failed workers (#270). - Hot-spare node support — closes the v0.5 spare-node gap. Hot-spare is always-on and works with
--max-restarts(#226, #250, #266):- Simple mode (default,
--ft-segment=None) for H100 / non-NVSwitch systems — firstmin_nodesfrom--nnodes=min:maxbecome active, the rest become standbys with reserved ranks. No GPU ClusterUUID required. - Segment-aware mode (
--ft-segment=N) for NVSwitch systems (DGX H200, HGX B200) — uses GPU ClusterUUID to identify NVLink domains; nodes in the same segment get contiguous group ranks for NVLink locality. Requiresmin_nodes % segment == 0. - Block-aware rank assignment (#250) and hot-spare exit-handling fix (#266).
- Simple mode (default,
- Progress-based early termination for in-job restarts and progress-tracker enhancements (#218, #255).
- External InJob control-plane (experimental) — embed
ft_launcherorchestration in a host control plane (#321). Not yet QA-validated; APIs may change. - Section-timeout fixes — out-of-section timeout now fires for section-less workloads, baseline iteration tracking corrected (#261, #299).
--max-restartsnow reflects job-level restart attempts (#211);ft_launcherruns with sensible defaults out of the box (#205, #271).- NUMA binding support in
ft_launcherfor optimized memory affinity (#209).
-
Health checks
-
Checkpointing
- CPU shared-memory D2H path (experimental) in
FileSystemWriterAsyncremoves a redundant H2H copy and resolves the prior shm D2H race (#298). - PersistentAsyncCaller upgrades: QoS control, worker data cache, warmup, IPC-handle caching via
ConsistentDataIdentifier, and class-level metadata cache inCachedMetadataFileSystemReader(#273, #274, #275). - Reliability fixes: SIGSEGV on SIGKILL with dangling CUDA IPC handles (#284), CUDA IPC handle errors in persistent worker (#288), premature GC of preloaded pinned host tensors in
TemporalAsyncCaller(#291), MXFP8/TE quantized tensor handling in IPC cache (#276), spawned persistent worker CUDA-device init (#238).
- CPU shared-memory D2H path (experimental) in
-
Fault attribution — productized as standalone services (experimental)
The attribution module — including the Attribution Service, Flight Recorder integration, LogSage, and MCP integration — remains experimental in v0.6. APIs, CLI flags, and service contracts may change in subsequent releases.
- NVRx Attribution Service (
attrsvc) and NVRx Slurm Monitor Service (smonsvc) introduced as FastAPI-based standalone services (#242, #248). ft_launcher-managedattrsvcfor co-located deployment (#318); UDS endpoints forattrsvc/smonsvc(#315).- Attribution is now an optional package — install with
pip install nvidia-resiliency-ext[attribution](#305). Attribution internals refactored under asvcsubpackage with a clear controller/runner boundary (#295, #313, #316). - PyTorch Flight Recorder (experimental) support in
attrsvc(#283); FR ordering switched to window-based instead of PG description (#210, #216, #219). - LogSage (experimental) integrated as an attribution module — direct in-process API, configurable LLM model via env, LogSage v0.1.7 (#224, #249, #267, #289, #297, #308).
- Triggers and outputs: last-cycle attribution trigger (#247), job-completion handling (#251), Slack bot notifications (#253).
- MCP (Model Context Protocol) integration (experimental) exposes the attribution module as a tool for the NVIDIA resiliency agent (#215);
nvrx-attrClaude skill bundle for the attribution workflow (#312).
- NVRx Attribution Service (
-
Log aggregation & observability
- Two-level gRPC log aggregation (N leaves + root) with auto-tier selection and end-to-end tests (#280, #307, #309).
- Writer-thread + persistent-reader refactor (#254); pipe-based per-cycle log capture and split-log support (#225, #240).
NVRxCycleInfoprotobuf exposes per-cycle metadata over the new gRPC interface (#258, #292).- GPU memory logger (#206).
-
Build, packaging & security
poetry-dynamic-versioning— wheel versions are derived from git tags (v0.6.0-rc1,v0.6.0, etc.) (#260).- New optional extras:
[attribution]for the attribution service stack (LogSage, MCP, FastAPI, Slack),[dataflow]fornvdataflowintegration. - Wheel security-scan cleanups and Bandit findings (
ionicesubprocess, FR module) (#219, #294, #306). - Torch 2.10 and
langchainlock updates for CVE compliance.
Deprecations & Removals
- Legacy dynamic rendezvous (v1) is deprecated; barrier-based rendezvous (v2) is the default. Plan to remove v1 in a future release (#282).
--ft-restart-policyis deprecated (#259). Themin-healthyvalue has been removed; onlyany-failedremains (which is also the default), so the flag is effectively a no-op in v0.6. Migration: remove the flag from your launch scripts. No replacement is required. The flag itself will be removed in a future release.ptl_resiliencypackage is deprecated and removed from the wheel (#282, #285). PyTorch Lightning users should pin v0.5.x or migrate to the underlyingfault_tolerance/checkpointing/attribution.stragglerAPIs directly.- OneLogger integration removed (#257).
- In-process restart is deprecated in v0.6 and will be removed in a future release. It is no longer the focus of the fast-restart solution. Migration: use in-job restart via
ft_launcher(see thefault_toleranceusage guide). The existingnvidia_resiliency_ext.inprocessAPIs continue to work in v0.6 but are no longer maintained for new feature development; bug fixes will be considered on a case-by-case basis.
Installation
# Core resiliency (fault tolerance, checkpointing, health checks)
pip install nvidia-resiliency-ext
# With attribution stack (Attribution Service, LogSage, MCP, Slack notifications)
pip install nvidia-resiliency-ext[attribution]
# With nvdataflow integration
pip install nvidia-resiliency-ext[dataflow]Known Issues & Limitations
- Ubuntu 22.04 / glibc < 2.39 users are advised to build from source — PyPI wheels target
manylinux_2_39and default to CUDA 13. - Python: wheels are published for 3.10, 3.11, 3.12.
- Attribution, Flight Recorder analysis, LogSage, and MCP integration are experimental. APIs, CLI flags, and service contracts may change in subsequent releases.
- External InJob control-plane is experimental and not yet QA-validated; APIs may change in subsequent releases.
- CPU shared-memory D2H path in
FileSystemWriterAsyncis experimental (#298); enable only in non-production runs while we collect soak feedback.