Port remaining 2.7.2 changes to main by chesterxgchen · Pull Request #4376 · NVIDIA/NVFlare

chesterxgchen · 2026-03-29T03:23:31Z

Summary

Cherry-picks 35 net-new commits from the 2.7.2 release into main. The following 50 PRs were requested; 15 were already present in main (via earlier batch ports or equivalent changes) and are excluded.

Included (35 PRs — net new to main)

Security fixes:

[2.7] Added fix for RCE bug caused by not validating type_name #4295 Fix RCE bug caused by not validating type_name
[2.7] Fix FilePipe TOCTOU #4296 Fix FilePipe TOCTOU
[2.7]: Sign the root folders of startup kits #4301 Sign the root folders of startup kits
[2.7] Fix SignatureBuilder: sign root only in CC mode, revert to startup+local for non-CC #4318 Fix SignatureBuilder: sign root only in CC mode

Critical streaming / memory / deadlock fixes:

[2.7] Add retry mechanism for streaming download on TIMEOUT #4167 Add retry mechanism for streaming download on TIMEOUT
[2.7] Fix stream pool starvation issue #4171 Fix stream pool starvation issue
[2.7] smaller lock in produce item #4174 Smaller lock in produce item
[2.7] Mitigate F3 streaming Head-of-Line (HOL) stalls and add guardrails #4206 Mitigate F3 streaming HOL stalls and add guardrails
[2.7] Fix subprocess converter wiring, swarm learning bugs, and recipe enhancements #4225 Fix subprocess converter wiring, swarm learning bugs, and recipe enhancements
[2.7] Fix client-side RSS memory growth and subprocess logging gap #4231 Fix client-side RSS memory growth and subprocess logging gap
[2.7] Fix 1-18: large-model subprocess memory, reverse/forward PASS_THROUGH, download gating, MSG_ROOT_TTL, LazyRef local-aggr, logging, min_clients, diagnostics #4247 Fix 1-18: large-model subprocess memory, reverse/forward PASS_THROUGH, download gating, MSG_ROOT_TTL, LazyRef local-aggr, logging, min_clients, diagnostics
[2.7] Fix RC12 Swarm ext-process bugs: msg_root race, CSE model load, validate hang, round_timeout #4270 Fix RC12 Swarm ext-process bugs: msg_root race, CSE model load, validate hang, round_timeout
[2.7] Fix external kill before server download #4275 Fix external kill before server download
[2.7] increase FilePipe default heartbeat timeout value to 600s #4307 Increase FilePipe default heartbeat timeout to 600s
[2.7] Receiver-side PASS_THROUGH for ext-process CJ forward path #4309 Receiver-side PASS_THROUGH for ext-process CJ forward path
[2.7] Fix Swarm deadlock: _executing guard prevents pipe handler replacement mid-transaction #4314 Fix Swarm deadlock: _executing guard prevents pipe handler replacement mid-transaction
[2.7] Fix FedAvg custom aggregator losing fl_ctx CURRENT_ROUND #4317 Fix FedAvg custom aggregator losing fl_ctx CURRENT_ROUND
[2.7] Fix MetricRelay multi-round pipe lifecycle and bad-payload crash #4334 Fix MetricRelay multi-round pipe lifecycle and bad-payload crash

Other bug fixes / enhancements:

[2.7] Job CLI Tutorial Fixes #4096 Job CLI Tutorial Fixes
[2.7] CIFAR-10 Update tb event reader and requirements [skip ci] #4125 CIFAR-10 Update tb event reader and requirements
[2.7] Fix hello-numpy-cross-val example #4168 Fix hello-numpy-cross-val example
[2.7] Fix numpy cross val sticky property #4181 Fix numpy cross val sticky property
[2.7] Fix recipes and job api model/initial_model confusion #4188 Fix recipes and job api model/initial_model confusion
[2.7] update to match fix in main #4202 Update to match fix in main
[2.7] Add missing submit_model executor #4254 Add missing submit_model executor
[2.7] Fix swarm recipe model issue #4260 Fix swarm recipe model issue
[2.7] Fedbuff doc and check updates #4262 Fedbuff doc and check updates
[2.7] Scaffold update: defensive check; docstring update #4279 Scaffold update: defensive check
[2.7] Condense 2.7.2 release notes to high-level feature highlights [skip ci] #4316 Condense 2.7.2 release notes
[2.7] Polish 2.7.2 release notes memory section and add benchmark tables [skip-ci] #4323 Polish 2.7.2 release notes memory section

Docs / release notes:

[2.7] Update 2.7.2 release notes: streaming hardening, memory management, hierarchical startup stability [skip ci] #4218 Update 2.7.2 release notes
[2.7] update example readme [skip ci] #4272 Update example readme
[2.7] 2.7.2 documentation updates [skip ci] #4278 2.7.2 documentation updates
[2.7] Redact security fix implementation details from release notes [skip ci] #4332 Redact security fix details from release notes
[2.7] Smooth memory benchmark summary paragraph in 2.7.2 release notes [skip cli] #4337 Smooth memory benchmark paragraph in release notes

Skipped (already in main via other PRs)

2.7 PR	Already in main
#3999	Web-only change; #3965, #3989 already in main
#4076	Empty (already present)
#4103	Empty (already present)
#4119	Empty (already present)
#4122	Empty (already present)
#4172	In main via #4328
#4173	In main via #4194
#4250	In main via #4327
#4255	Already present (empty diff)
#4259	In main via #4327
#4297	In main via #4327
#4306	In main via #4329
#4315	In main via #4329
#4320	In main via #4346
#4325	In main via #4346

Note: #4319 was already created as PR #4375.

🤖 Generated with Claude Code

chesterxgchen · 2026-03-29T03:29:46Z

/build

greptile-apps · 2026-03-29T03:37:24Z

Greptile Summary

This PR cherry-picks 35 net-new commits from the 2.7.2 release branch into main, covering security fixes (RCE, FilePipe TOCTOU, startup kit signing), critical streaming/memory/deadlock fixes, and miscellaneous bug fixes and documentation updates. Several previously-identified issues (PASS_THROUGH channel mismatch, missing ROUND_STARTED in FedAvg, ccwf_job.py TypeError, xgboost.core.DataSplitMode whitelist removal) have been addressed in follow-up commits.

Key changes:

fedavg.py: Removes tensor-disk-offload machinery, restores ROUND_STARTED and CURRENT_ROUND propagation to the round loop — but introduces a double-firing of ROUND_STARTED
base_model_controller.py: Adds ROUND_STARTED event inside send_model() and changes CURRENT_ROUND from sticky=True to sticky=False in the result callback
task_exchanger.py: Removes pipe.clear() from BEFORE_TASK_EXECUTION handler to fix a deadlock (PR [2.7] Fix Swarm deadlock: _executing guard prevents pipe handler replacement mid-transaction #4314)
blob_streamer.py: Softens buffer-overrun handling and simplifies blob_cb exception path (PR [2.7] Fix stream pool starvation issue #4171, [2.7] Fix MetricRelay multi-round pipe lifecycle and bad-payload crash #4334)
via_downloader.py: Removes the deprecated make_lazy_ref fast path
New test_conn_manager_shutdown_race.py validates connection manager shutdown under concurrent stream activity

Confidence Score: 4/5

Safe to merge after resolving the double ROUND_STARTED firing in FedAvg — all other previously flagged P1s are fixed.

One remaining P1 defect: ROUND_STARTED is now fired twice per FedAvg round — once explicitly in fedavg.run() and again inside base_model_controller.send_model() (added in this PR). All prior P1 issues were addressed in follow-up commits. The double-firing is a real behavioral regression that needs one of the two call sites removed before merge.

nvflare/app_common/workflows/fedavg.py (line 160) and nvflare/app_common/workflows/base_model_controller.py (line 157) — coordinate which layer owns the ROUND_STARTED event

Important Files Changed

Filename	Overview
nvflare/app_common/workflows/fedavg.py	Refactors FedAvg run loop (removes tensor disk offload, restores ROUND_STARTED and CURRENT_ROUND propagation); double ROUND_STARTED firing introduced — sent once explicitly and again inside send_model()
nvflare/app_common/workflows/base_model_controller.py	Adds ROUND_STARTED event to send_model() and changes CURRENT_ROUND sticky flag to False in result callback; ROUND_STARTED addition causes double-fire when callers also fire it explicitly
nvflare/app_common/executors/task_exchanger.py	Removes pipe.clear() and isinstance guard from BEFORE_TASK_EXECUTION handler to fix deadlock when executing is in progress; safe change
nvflare/fuel/f3/streaming/blob_streamer.py	Simplifies blob_cb exception handling and softens buffer-overrun from exception to warning with partial copy; both changes intentional per 2.7 design
nvflare/fuel/utils/fobs/decomposers/via_downloader.py	Removes make_lazy_ref fast path in recompose(); PASS_THROUGH mode now uses _LazyBatchInfo sentinel instead

Sequence Diagram

sequenceDiagram
    participant FA as FedAvg.run()
    participant BMC as BaseModelController.send_model()
    participant EH as Event Handlers

    loop Each FL Round
        FA->>EH: fire ROUND_STARTED (1st explicit, line 160)
        FA->>BMC: send_model(targets, data, callback)
        BMC->>EH: fire ROUND_STARTED (2nd implicit, line 157) ⚠️
        BMC->>BMC: broadcast task to clients
        FA->>FA: wait for standing tasks
        FA->>EH: fire BEFORE_AGGREGATION
        FA->>FA: _get_aggregated_result()
        FA->>FA: update_model() / save_model()
    end

_{Reviews (4): Last reviewed commit: "Restore xgboost.core.DataSplitMode to fo..." | Re-trigger Greptile}

chesterxgchen · 2026-03-29T03:59:54Z

/build

Fixes # . Fix broken Job CLI tutorial. Adds missing model.py for Job Recipe tutorial.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

Fixes # . - Removed all TensorFlow dependencies from the CIFAR-10 example by replacing the manual protobuf parsing with tbparse - Consolidated requirements by moving plotting dependencies (matplotlib, seaborn, pandas, tbparse) into the main requirements.txt and removing the separate plot-requirements.txt  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

@GeorgeWang-nv

) Based on the original work by @GeorgeWang-nv in PR NVIDIA#4166. Credit to George for the retry mechanism design. This PR just fixes some comments Fixes swarming LLM streaming loss. Add retry mechanism for streaming download on TIMEOUT. **Changes from the original PR plus additional improvements:** - **Retry with exponential backoff**: On `TIMEOUT`, retry up to `max_retries` (default 3) times with exponential backoff (2s, 4s, 8s, ... capped at 60s) instead of fixed 2s sleep - **Structured logging**: `[DOWNLOAD_RETRY]`, `[DOWNLOAD_RECOVERED]`, `[DOWNLOAD_FAILED]` log tags with backoff duration for production observability - **State-safe retry**: Resend the same `current_state` on retry so the producer re-generates the same chunk (data-safe by design) - **Consecutive timeout counter reset**: Counter resets after any successful response, allowing fresh retries on later chunks - **Abort signal handling**: Abort checked before and after backoff sleep to minimize delay on cancellation - **Input validation**: `max_retries < 0` raises `ValueError` - **Fresh request per iteration**: Build a new `Message` each loop iteration to avoid re-encoding stale payloads - **Comprehensive test suite**: 17 unit tests covering happy path, retry/recovery, exhaustion, counter reset, state integrity, abort, consumer errors, and input validation - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [x] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [x] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: chiyuw <chiyuw@nvidia.com>

Properly make hello-numpy-cross-val use branching to match other examples. Fixes # . Properly make hello-numpy-cross-val use branching to match other examples and ensure that there will no longer be any errors due to numpy_key.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

@GeorgeWang-nv

Take changes from NVIDIA#4169 and issues found by @GeorgeWang-nv Fix streaming pool starvation Fixes deadlock when many blob receive callbacks run at once. Ex. Large model with many download chunks. Previously, handle_blob_cb ran on the stream pool and called blob_cb(future, ...) on the same thread; blob_cb blocks on future.result() while _read_stream (which sets the future) also runs on the same pool, so the pool could be exhausted and never run _read_stream. ``` mermaid sequenceDiagram autonumber participant DL as "Requester: download_object()" participant CS as "Requester Cell._send_one_request()" participant TX as "ByteStreamer.send()/TxTask" participant NET as "SFM/Network" participant CR as "Receiver CoreCell._process_request()" participant BR as "ByteReceiver._data_handler()" participant W1 as "stream_thread_pool Worker A" participant BH as "BlobHandler.handle_blob_cb()" participant W2 as "stream_thread_pool Worker B" participant AD as "Adapter.call()" participant HD as "DownloadService._handle_download()" DL->>CS: cell.send_request(...) CS->>TX: send_blob(request) TX->>NET: stream DATA frames NET->>CR: process_message() -> _process_request() CR->>BR: callback for STREAM_DATA_TOPIC BR->>W1: submit(_callback_wrapper) W1->>BH: _callback_wrapper -> handle_blob_cb(future, stream) BH->>W2: submit(_read_stream, blob_task) BH->>AD: blob_cb(future, ...) (this is Adapter.call) AD->>AD: future.result() (blocks) rect rgb(255, 235, 235) Note over W1,W2: Starvation risk: Worker A waits for future.result(), while _read_stream is queued on same pool. end W2->>W2: _read_stream() -> stream.read() loop W2->>AD: future.set_result(...) AD->>HD: self.cb(request) -> _handle_download() -> produce() AD->>TX: send_blob(reply) TX->>NET: stream reply frames NET->>CS: receiver side _process_reply() sets in_receiving CS->>DL: send_request() returns reply ``` - Separate callback pool: Run blob_cb on a dedicated callback_thread_pool instead of on the stream pool. The stream pool only schedules _read_stream and the callback; it no longer blocks on future.result(), so it can always run _read_stream. - Blocking -> non-blocking: handle_blob_cb now submits blob_cb to the callback pool and returns immediately. Comments document this intentional behavior change. - Exception handling: Added _run_blob_cb so exceptions from blob_cb are still logged and stream.task.stop(StreamError(...)) is called when stream has a task (e.g. RxStream), matching the previous behavior of ByteReceiver._callback_wrapper. ``` mermaid sequenceDiagram autonumber participant DL as "Requester download_object" participant CS as "Requester Cell _send_one_request" participant TX as "ByteStreamer send TxTask" participant NET as "SFM Network" participant CR as "Receiver CoreCell _process_request" participant BR as "ByteReceiver _data_handler" participant W1 as "stream_thread_pool Worker A" participant BH as "BlobHandler handle_blob_cb" participant W2 as "stream_thread_pool Worker B" participant CB as "callback_thread_pool Worker" participant AD as "Adapter call" participant HD as "DownloadService _handle_download" DL->>CS: cell.send_request CS->>TX: send_blob request TX->>NET: stream DATA frames NET->>CR: process_message to _process_request CR->>BR: callback for STREAM_DATA_TOPIC BR->>W1: submit _callback_wrapper W1->>BH: _callback_wrapper to handle_blob_cb BH->>W2: submit _read_stream blob_task BH->>CB: submit _run_blob_cb Note over W1: W1 returns immediately W2->>W2: _read_stream stream.read loop W2->>W2: future set_result Note over W2,CB: future completes and unblocks CB CB->>AD: blob_cb Adapter.call AD->>AD: future result blocks on callback pool only AD->>HD: self.cb request _handle_download produce AD->>TX: send_blob reply TX->>NET: stream reply frames NET->>CS: _process_reply sets in_receiving CS->>DL: send_request returns reply ``` - nvflare/fuel/f3/streaming/stream_utils.py: Added CALLBACK_THREAD_POOL_SIZE and callback_thread_pool (CheckedExecutor). Shut down callback_thread_pool in stream_shutdown() before stream_thread_pool. - nvflare/fuel/f3/streaming/blob_streamer.py: In handle_blob_cb, submit blob_cb via callback_thread_pool.submit(self._run_blob_cb, ...) instead of calling it directly. Added _run_blob_cb to run blob_cb in a try/except, log exceptions, and call stream.task.stop(...) when applicable.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

Do not hold the lock around produce_item. It is not needed and this operation can be slow. We do not want/need to hold up everying during this time.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

Fix numpy cross val sticky property. Fix numpy cross val sticky property so there is not a warning message.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>

) Fix recipes and job api model/initial_model confusion Changes picked from NVIDIA#4177 ### Description Fix recipes and job api model/initial_model confusion ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

Update to match fix in main in NVIDIA#4201 ### Description Updates nvflare/app_common/workflows/base_model_controller.py to match NVIDIA#4201 after further changes from greptile comments. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>

@GeorgeWang-nv

…ils (NVIDIA#4206) Many Thanks to @GeorgeWang-nv to find the issues and perform the root cause analysis on this issues, suggested possible fixes - Large byte streams can freeze when a blocking socket send holds the per-connection SFM send lock. - While that lock is held, unrelated traffic on the same connection (including progress and coordination frames) is serialized behind it. - This creates **Head-of-Line (HOL) blocking** at connection level and can manifest as long apparent stalls. - Add a bounded send timeout in the socket send path so one blocked send cannot hold the connection lock indefinitely. - Why this way: minimally invasive, directly targets lock-hold source, and fails deterministically instead of hanging. - Track ACK forward progress and fail fast when ACK offset does not advance for a configurable interval. - Why this way: protects sender flow control from waiting too long on non-progressing streams, independent of generic ACK wait. - Monitor send stall duration in heartbeat monitor; optionally close the stalled connection to recover. - Add a consecutive-check threshold before close to reduce false positives from transient jitter. - Why this way: staged recovery lever (warn-only first, then auto-close) balances reliability and operational safety. - `streaming_send_timeout` - `streaming_ack_progress_timeout` - `streaming_ack_progress_check_interval` - `sfm_send_stall_timeout` - `sfm_close_stalled_connection` - `sfm_send_stall_consecutive_checks` - Added focused unit tests for each mitigation path with both positive and negative cases. - Config application, partial-write success, non-writable timeout, closed-send path, and closing-state suppression. - ACK progress timestamp update, progress-allowed completion, no-progress timeout, legacy ack_wait behavior, and fast-fail precedence when progress timeout is small. - Warn-only behavior, close-enabled recovery, close-disabled non-recovery, heartbeat behavior when healthy, send-state reset on success/exception, warning emission, no-warning in healthy flow, and intermittent-stall false-alarm suppression via consecutive guard. - `pytest` on the three new F3 test modules: **24 passed**. - Added a new section in `docs/user_guide/timeout_troubleshooting.rst`: - Runtime defaults for new stall-related parameters. - Recommended rollout: warn-only first, then auto-close only if repeated stall warnings persist. - Suggested values from discussion: `sfm_send_stall_timeout=75`, `sfm_send_stall_consecutive_checks=3`. - Log interpretation guidance and false-alarm suppression expectations. - This change prefers deterministic failure/recovery over silent long hangs. - Potential side effect when auto-close is enabled: transient connection resets under severe network instability; mitigated by conservative timeout plus consecutive-check guard. - Defaults remain conservative (`sfm_close_stalled_connection=false`) so operators can observe first and enable recovery when needed. - Scope is intentionally focused on the critical Head-of-Line (HOL) mitigation path and operational guardrails. - [x] `python3 -m pytest tests/unit_test/fuel/f3/drivers/socket_conn_timeout_test.py tests/unit_test/fuel/f3/streaming/byte_streamer_ack_watchdog_test.py tests/unit_test/fuel/f3/sfm/sfm_stall_monitor_test.py` - [x] `python3 -m black <changed_python_files>` - [x] `python3 -m isort <changed_python_files>` - [x] `python3 -m flake8 <changed_python_files>` - Close socket connections on `send_frame` timeout to prevent frame-boundary desync after partial writes. - Preserve specific send error codes in `send_frame` for common socket exceptions: - map timeout-like exceptions to `CommError.TIMEOUT` (and close connection), - map closed-socket exceptions to `CommError.CLOSED`, - keep `CommError.ERROR` as fallback for unknown exceptions. - Add positive and negative unit tests in `socket_conn_timeout_test.py`: - Positive: timeout after partial write closes the connection. - Positive: `TimeoutError` -> `TIMEOUT`; `BrokenPipeError` -> `CLOSED`. - Negative: successful partial-write sends do not close; non-timeout `CommError.CLOSED` does not force close; unknown exceptions remain `ERROR`. - Document stall timing relationship in `timeout_troubleshooting.rst`, including the practical close window formula and outer-timeout sizing guideline. - `python3 -m pytest tests/unit_test/fuel/f3/drivers/socket_conn_timeout_test.py -q` - Result: `10 passed`

…ent, hierarchical startup stability [skip ci] (NVIDIA#4218) > ⚠️ **Depends on NVIDIA#4209** — The *Hierarchical FL Startup Stability* section documents changes introduced by PR NVIDIA#4209 (currently open). **Merge PR NVIDIA#4209 into `2.7` before merging this PR.** All other sections cover already-merged PRs. --- This PR updates `docs/release_notes/flare_272.rst` to reflect all major changes merged into the 2.7.x line after the initial 2.7.2 draft, covering three new areas: - **Memory Management** — restructured and expanded with Zero Tensor Copy at CJ (PR NVIDIA#4210) and client-side memory lifecycle management (PR - **F3 Streaming Reliability and Performance** — new section covering HOL stall mitigation (PR NVIDIA#4206), stream pool starvation fix (PR fix (PR NVIDIA#4204), and lock contention reduction (PR NVIDIA#4174) - **Hierarchical FL Startup Stability** — new section covering deployment timeout classification, startup grace period, selective client exclusion, and metadata hardening (PR NVIDIA#4209 — pending merge), with recommended config snippets for HPC/Lustre environments The Bug Fixes section and intro paragraph are also updated accordingly. A source-level RST comment has been added above the Hierarchical FL section in the file to alert future maintainers to the merge dependency. | PR | Area | Status | |---|---|---| | NVIDIA#4171 / NVIDIA#4172 | Stream pool starvation fix | Merged | | NVIDIA#4174 | Lock contention reduction | Merged | | NVIDIA#4167 | Streaming download retry | Merged | | NVIDIA#4204 | RxTask self-deadlock fix | Merged | | NVIDIA#4206 | HOL stall mitigation | Merged | | NVIDIA#4210 | Zero tensor copy at CJ | Merged | | NVIDIA#4211 | Client-side memory management | Merged | | NVIDIA#4209 | Hierarchical FL startup stability | **Open — merge before this PR** | - **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds `LazyDownloadRef` placeholders instead of materializing full tensors, eliminating the CJ as a memory bottleneck for LLM-scale models. - **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` / jemalloc purge / `torch.cuda.empty_cache()` injected after every `flare.send()`, configurable via `client_memory_gc_rounds`. - Existing TensorDownloader and server-side cleanup content retained. - **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress watchdog, and stall detection/recovery. Includes recommended environment variable settings for large hierarchical deployments. - **Stream Pool Starvation Fix**: Blob callbacks dispatched to a dedicated `callback_thread_pool`, keeping stream workers free for concurrent downloads. - **Streaming Download Retry**: Exponential-backoff retry (up to 3 attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware. - **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock` released, eliminating stream-error-triggered deadlock. - **Lock Contention Reduction**: `produce_item()` runs outside `self.lock`; compare-and-store for cache write. Reduces model-download latency under high client concurrency. - **Deployment Timeout as Failure**: `reply=None` correctly counted against `min_sites`; timed-out clients excluded before `start_client_job`. - **Startup Grace Period**: Dead-client detection debounced — client must be observed once before absence triggers dead-job notification. Default changed to `True`. - **Selective Client Exclusion**: Stragglers at start-job excluded rather than causing full abort, if remaining count ≥ `min_clients`. - **Metadata Hardening**: `TypeError` on absent job metadata replaced with descriptive `RuntimeError`. - Recommended `config_fed_server.json` / `config_fed_client.json` snippets for HPC (Frontier/ORNL) scale. - [ ] Sphinx build (`make html`) passes without RST warnings on the updated file - [ ] All new cross-references (`.. code-block::`, `.. note::`) render correctly in the docs build - [ ] Verify section hierarchy (underline characters) is consistent throughout the file - [ ] Confirm PR NVIDIA#4209 is merged before this PR is merged 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…e enhancements (NVIDIA#4225) This PR addresses several related bugs and improvements in the Client API subprocess path, swarm learning, and the recipe interface. Address FLARE-2743 (bug/5921094) and (bug/5921708) --- When pass-through tensor streaming is enabled on the CJ, param converters (e.g. `NumpyToPTParamsConverter`) were running on the CJ *before* lazy references were materialized. This caused `"Could not infer dtype of LazyDownloadRef"` warnings and potential data corruption on large-model jobs using pass-through transfer. The `ExProcessClientAPI` (subprocess path) created a `FlareAgentWithFLModel` with no converters attached. Format conversion was happening on the CJ side in `LauncherExecutor.execute()`, which triggered issue #1. `SwarmClientConfig.learn_task_check_interval` was not forwarded to the swarm controller, so any customization of that parameter was silently ignored. dict form The deploy map validation only handled list-form and `{"sites": [...]}` dict-form entries. The `{"targets": [...]}` form (used by some job generators) raised an unhandled error. `min_clients`, `launch_external_process`, and `command` could not be set through the recipe interface, forcing users to construct jobs manually. `FlareAgentWithCellPipe` The class has been superseded by constructing a `CellPipe` and `FlareAgent` directly, or using the higher-level Client API (`nvflare.client`). --- `FlareAgentWithFLModel.shareable_to_task_data()` The task name was read via `shareable.get_header(FLContextKey.TASK_NAME, "")`, but this header is never populated in the subprocess path — the task name is carried as the pipe `Message.topic` and stored in `self.current_task.task_name` by `FlareAgent._on_task()`. An absent header caused `task_name` to fall back to `""`, making `ParamsConverter.process()` silently skip conversion. Fixed to use `self.current_task.task_name if self.current_task else ""`, consistent with `task_result_to_shareable()`. `to_nvflare_converter_id` set on launcher executor `LauncherExecutor.execute()` still applied CJ-side converters if users explicitly set `from_nvflare_converter_id` / `to_nvflare_converter_id`, while the subprocess was also applying its own default converters — resulting in double conversion. Fixed by removing the converter calls from `LauncherExecutor.execute()` entirely. Since `LauncherExecutor` is only ever subclassed by `ClientAPILauncherExecutor` (subprocess path), this has no effect on the in-process path. --- **Converter logic removed from CJ launcher executors; moved to the subprocess agent side.** Previously, `PTClientAPILauncherExecutor` and `TFClientAPILauncherExecutor` set up format converters in their `initialize()` methods, and `LauncherExecutor.execute()` applied them on the CJ before dispatching to the subprocess. This was the root cause of issue #1 — the CJ was touching tensor data (triggering dtype inference) before `LazyDownloadRef` placeholders were materialized in the subprocess. The converter setup blocks have been **completely removed** from both launcher executors. Converters for the subprocess path now live entirely inside the subprocess, wired into `FlareAgentWithFLModel` and executing after full payload materialization. Key changes: - `FlareAgentWithFLModel` gains optional `from_nvflare_converter` / `to_nvflare_converter` params. A lightweight `_ConverterContext` stub (duck-typing `FLContext.get_prop/set_prop`) allows `ParamsConverter.process()` to work without a real `FLContext` in the subprocess. - `SERVER_EXPECTED_FORMAT` is added to `ConfigKey` and propagated in the subprocess launch config so the agent knows what format the server expects. - A shared factory `converter_utils.create_default_params_converters()` is introduced, replacing four separate copies of the same format-conditional logic across PT and TF executors. - In-process executors (`PTInProcessClientAPIExecutor`, `TFInProcessClientAPIExecutor`) are **unchanged in behavior** — they still apply converters on the CJ side, which is correct since there is no separate subprocess. --- An earlier iteration kept the converter setup in `initialize()` but added a `ClientAPILauncherExecutor.execute()` override that nulled the converters out before calling `super().execute()` and restored them after. This was removed because it left converters alive on the object but always bypassed — confusing dead code. The current approach is explicit and honest: launcher executors do not set converters at all, and the subprocess handles its own conversion autonomously. --- | Area | Files | |------|-------| | Core bug fixes | `ccwf_job.py`, `fed_utils.py` | | Converter framework | `converter_utils.py` (new), `config.py`, `flare_agent_with_fl_model.py`, `ex_process/api.py`, `client_api_launcher_executor.py` (base) | | PT launcher executor — converter removed | `pt/client_api_launcher_executor.py` | | PT in-process executor — refactored to shared factory | `pt/in_process_client_api_executor.py` | | TF launcher executor — converter removed + `initialize()` deleted | `tf/client_api_launcher_executor.py` | | TF in-process executor — refactored to shared factory | `tf/in_process_client_api_executor.py` | | Recipe | `pt/recipes/swarm.py` | | Docs | `3rd_party_integration.rst`, `3rd_party_trainer.py` | | Tests | 4 new test files, 4 modified test files, 2 integration YAMLs | --- - **TF/Keras converter path is not directly unit-tested** due to TF/PyTorch dependency conflicts in a shared environment. The `create_default_params_converters` Keras branch is structurally identical to the PT branch (which is fully tested). A dedicated TF-only test environment could add coverage in a future pass. - **`_ConverterContext` is a duck-type shim.** If more `ParamsConverter` use cases arise in subprocess contexts, a formal lightweight interface (separate from `FLContext`) would be worth extracting. - **Converter plugin (`from_nvflare_converter_id`)** for the subprocess path is currently not wired — only the auto-created default converters are passed to the agent. Explicit user-provided converter IDs are still only supported for the in-process path. This can be addressed in a follow-up. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Fixes FLARE-2760, bug/5949101, bug/5948918. Add ProdEnv info to edge example readme (already have it in docs/user_guide/data_scientist_guide/job_recipe.rst) Add model_state check to buff_model_manager Add timeout arg to buff_recipe  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

@GeorgeWang-nv

… validate hang, round_timeout (NVIDIA#4270) Fixes four bugs in NVFlare 2.7.2 RC12 affecting Swarm Learning with `launch_external_process=True` Implementation Reference @GeorgeWang-nv's original PR NVIDIA#4263 Bug 1 -- different implementation from PR4263 Bug 2 -- accept the implementation approach with small changes from PR 4263 Bug 3 -- use different approach Bug 4 -- add timeout to swarm API but consolidate the timeouts from 4 to 2 and expose to Swarm Recipes Bug 5 -- restore `self._max_resends` (private convention); subprocess logging fixed (see below) Bug 6 -- guard ConnManager against RuntimeError after shutdown; 4 regression tests added **Additional (review feedback):** - Rename `SimpleSwarmLearningRecipe` → `SwarmLearningRecipe` everywhere; backward-compat alias kept - Fix docs: add missing required `min_clients` arg to all `SwarmLearningRecipe` examples (TypeError for users copying doc examples) - CSE inventory scan: take LAST "best" key, not first — prevents initial checkpoint named "best_*" from shadowing trained best model - Add `pipe_type` / `pipe_root_path` to `SwarmLearningRecipe` so users can select `FilePipe` without dropping to low-level API (`via_downloader.py`) **Root cause**: `_create_downloader()` subscribed to msg_root deletion to clean up download transactions. The msg_root is deleted as soon as blob envelopes are delivered, but `blob_cb` fires asynchronously — secondary tensor downloads were still in flight when `delete_transaction()` removed the ref from `_ref_table`, causing `"no ref found" FATAL_SYSTEM_ERROR` on large models (≥1B parameters). **Fix**: Remove `subscribe_to_msg_root` from `_create_downloader()` entirely. Transaction lifecycle is now managed solely by `_monitor_tx()` in `download_service.py`, which polls `is_finished()` every 5s and cleans up only after all chunk downloads complete. (`cse_client_ctl.py`) **Root cause**: `_prepare_local_model()` called `submit_model_executor.execute()` which, for `ClientAPILauncherExecutor`, launches a fresh subprocess with no trained model state. The training subprocess had already exited; the model was already saved to disk by `PTFileModelPersistor`. **Fix**: Try the persistor first. Inventory key scan prefers the LAST key containing `"best"` when `model_name` contains `"best"` — taking the last "best" key avoids mistaking an initial checkpoint named `"best_*"` (added first by `PTFileModelPersistor`) for the trained best model. `isinstance` guard replaces `assert isinstance` (safe under `python -O`). Falls back to executor for in-process mode compatibility. (`flare_agent.py` + `via_downloader.py`) **Root cause**: `FlareAgent._do_submit_result()` unconditionally waited 1800s for `DOWNLOAD_COMPLETE_CB` after every result send when `pass_through_on_send=True`. For validate results (metrics only, no tensors), `_finalize_download_tx()` creates no download transaction and the callback never fires — subprocess blocks indefinitely. **Fix**: Thread-local `was_download_initiated()` flag, set by `_finalize_download_tx()` only when downloadable objects exist. Agent returns immediately if `False` (validate result). Thread-local is required because task pipe and metric pipe share the same `CoreCell` (same `site_name + token + mode` → same FQCN → same `_CellInfo` cache entry → same `fobs_ctx`); a plain flag would be clobbered by concurrent metric serialisation from a different thread. (`swarm.py`) **Root cause**: `SwarmClientConfig` hardcodes `learn_task_ack_timeout=10s` and `final_result_ack_timeout=10s`. For large models (≥2 GB), P2P tensor streaming takes minutes — the ACK times out before download completes. **Fix**: Add `round_timeout: float = 3600` to `SwarmLearningRecipe`. Wires both `learn_task_ack_timeout` and `final_result_ack_timeout`; `learn_task_timeout` is intentionally left `None` (unbounded) to avoid capping per-round training time on slow hardware or 70B+ models. (`client_api_launcher_executor.py`) **Root cause**: `ClientAPILauncherExecutor.__init__()` stored `max_resends` as `self._max_resends` but `TaskExchanger` (the parent) uses `self.max_resends`. `PipeHandler` reads `self.max_resends` and saw `None` (the parent default) instead of the configured value. **Fix**: Restore `self._max_resends` as the private attribute throughout `ClientAPILauncherExecutor` (consistent with private-member convention); the executor builds its own config dict explicitly so it does not rely on the inherited attribute. `ex_process/api.py`) **Problem**: The subprocess had no logging configuration — `logger.info()` calls were silently dropped. Wrapping all stdout via `logger.info()` in the parent caused double-prefixed entries in `log.txt` for NVFlare-formatted log lines. **Fix**: `_configure_subprocess_logging()` in `ex_process/api.py` loads the site's `log_config.json` unchanged, giving the subprocess identical loggers to the parent (both consoleHandler + file handlers). The parent's `log_subprocess_output()` now calls `_route_subprocess_line()` which strips ANSI codes and checks for a `YYYY-MM-DD HH:MM:SS` timestamp: - Formatted NVFlare log line → `print()` to terminal only (file handler already wrote it to `log.txt`) - Raw `print()` from user training script → `logger.info()` so it reaches `log.txt` Regression tests: `test_log_subprocess_output_formatted_lines_not_double_logged` + 5 `TestRouteSubprocessLine` tests. (`conn_manager.py`) **Root cause**: `process_frame_task()` and `start_connector()` could raise unhandled `RuntimeError` when called after the executor was shut down, causing noisy tracebacks during job teardown. **Fix**: Guard `conn_mgr_executor.submit()` with try/except `RuntimeError` (log debug, skip). Add `if self.stopped: return` early exit in `process_frame_task()`. **Regression tests**: 4 new tests in `test_conn_manager_shutdown_race.py`. `SwarmLearningRecipe` always created a `CellPipe`; users needing `FilePipe` (restricted networks, third-party integrations) had to drop to the low-level `CCWFJob` + `SwarmClientConfig` API. **Change**: Add `pipe_type: str = "cell_pipe"` and `pipe_root_path: Optional[str] = None`: - `"cell_pipe"` (default): unchanged — zero-copy PASS_THROUGH forwarding, ~1 GB RAM overhead - `"file_pipe"`: `FilePipe` created with `root_path` defaulting to `{WORKSPACE}/pipe` (resolved at runtime); custom absolute path validated to exist at recipe construction time - Warnings for `pipe_root_path` ignored with `cell_pipe`, and `file_pipe` with `launch_external_process=False` - `ScriptRunner.__init__` gains `task_pipe` parameter forwarded to `BaseScriptRunner` 9 new tests in `TestSwarmLearningRecipePipeType`. | File | Bug | Change | |------|-----|--------| | `nvflare/fuel/utils/fobs/decomposers/via_downloader.py` | 1 + 3 | Remove `subscribe_to_msg_root`; add thread-local `_tls`, `was_download_initiated()`, `clear_download_initiated()` | | `nvflare/client/flare_agent.py` | 3 | Check `was_download_initiated()` after `send_to_peer()`; return immediately for validate results | | `nvflare/app_common/ccwf/cse_client_ctl.py` | 2 | Persistor-first `_prepare_local_model()` with last-"best"-key inventory scan and isinstance guard | | `nvflare/app_opt/pt/recipes/swarm.py` | 4 + rename + pipe | Add `round_timeout`, `pipe_type`, `pipe_root_path`; rename to `SwarmLearningRecipe`; backward-compat alias kept | | `nvflare/job_config/script_runner.py` | pipe | Add `task_pipe` param to `ScriptRunner`, forwarded to `BaseScriptRunner` | | `nvflare/app_common/ccwf/recipes/swarm.py` | rename | Export `SwarmLearningRecipe` alongside `SimpleSwarmLearningRecipe` | | `nvflare/app_opt/pt/recipes/__init__.py` | rename | Add `SwarmLearningRecipe` to lazy imports and `__all__` | | `nvflare/app_common/executors/client_api_launcher_executor.py` | 5 | Restore `self._max_resends` (private convention) | | `nvflare/app_common/launchers/subprocess_launcher.py` | logging | `_route_subprocess_line()`: formatted lines → `print()`, raw lines → `logger.info()` | | `nvflare/client/ex_process/api.py` | logging | `_configure_subprocess_logging()`: apply site log config unchanged (same handlers as parent) | | `nvflare/fuel/f3/sfm/conn_manager.py` | 6 | Guard executor submit + frame processing against post-shutdown RuntimeError | | `docs/programming_guide/memory_management.rst` | docs | Add missing `min_clients` to example | | `docs/user_guide/data_scientist_guide/available_recipes.rst` | docs | Add missing `min_clients`; rename to `SwarmLearningRecipe` | | `docs/programming_guide/controllers/client_controlled_workflows.rst` | docs | Add missing `min_clients` (x2) | | `docs/programming_guide/timeouts.rst` | docs | Rename to `SwarmLearningRecipe` | | `examples/advanced/swarm_learning/swarm_pt/README.md` | 4 + rename | Add `round_timeout` to config table; rename to `SwarmLearningRecipe` | | `examples/advanced/swarm_learning/swarm_pt/job.py` | rename | Update import to `SwarmLearningRecipe` | | `tests/unit_test/fuel/utils/fobs/test_via_downloader_msg_root.py` | 1 | 4 new tests | | `tests/unit_test/client/test_download_initiated_gating.py` | 3 | 7 new tests | | `tests/unit_test/app_common/ccwf/test_cse_persistor_fallback.py` | 2 | 10 new tests (incl. initial-ckpt-shadowing regression) | | `tests/unit_test/fuel/f3/sfm/test_conn_manager_shutdown_race.py` | 6 | 4 new regression tests | | `tests/unit_test/app_common/launchers/subprocess_launcher_test.py` | logging | 6 new tests for `_route_subprocess_line` and double-logging prevention | | `tests/unit_test/recipe/swarm_recipe_test.py` | rename + pipe | Updated to `SwarmLearningRecipe`; 9 new `TestSwarmLearningRecipePipeType` tests | | `tests/unit_test/app_common/executors/client_api_launcher_executor_test.py` | 5 | Updated: `_max_resends` assertions | | `tests/unit_test/recipe/component_config_verification_test.py` | rename | Updated to `SwarmLearningRecipe` | - [x] 250+ unit tests pass across all affected modules - [x] Bug 1: `test_via_downloader_msg_root.py` — verifies `subscribe_to_msg_root` never called, method removed, `DOWNLOAD_COMPLETE_CB` unaffected - [x] Bug 2: `test_cse_persistor_fallback.py` — verifies persistor-first path, last-"best"-key preference, initial-ckpt-shadowing regression, isinstance guard, all fallback paths - [x] Bug 3: `test_download_initiated_gating.py` — verifies thread isolation, validate returns immediately (<1s), train waits, clear-before-send - [x] Bug 4: covered by existing recipe and ccwf tests - [x] Bug 5: `client_api_launcher_executor_test.py` updated (`_max_resends` assertions) - [x] Bug 6: `test_conn_manager_shutdown_race.py` — 4 tests: executor-shutdown no-raise, stop() no-raise, stopped early-return, Prefix.from_bytes not called - [x] Logging: `subprocess_launcher_test.py` — double-logging prevention, ANSI stripping, plain-line capture, partial-timestamp detection - [x] Pipe type: `swarm_recipe_test.py` — 9 tests: default cell_pipe, file_pipe instance, workspace template, custom path, invalid type, warnings, path validation - [x] Rename: backward-compat `SimpleSwarmLearningRecipe` import test passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

update example readme update example readme, remove non-existed links  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

(large-model deadlock) `big_model_4g` CI test reliably failed with: ``` RuntimeError: failed to download from site-X_active ``` after 1–3 hours of retry loops (300 s SubmitUpdate timeout × 3 download retries). **Root cause 1 — ordering bug in `_finalize_external_execution()`:** The subprocess client API works by keeping the subprocess's CellNet connection alive after it sends its result, so the server can pull large tensors directly from the subprocess's `DownloadService` (the "reverse PASS_THROUGH" path). To prevent the subprocess from exiting before the download completes, `_do_submit_result()` blocks on a `download_done.wait(1800 s)` event that fires only after the server has finished downloading all tensors. However, `_finalize_external_execution()` called `stop_task()` **synchronously**, which sends `SIGTERM` to the subprocess immediately after `execute()` receives the result — before `ClientRunner` had even sent `SubmitUpdate` to the server, let alone before the server had started downloading. This tore down the subprocess cell, causing every server download attempt to hit `"no connection to site-X_active"` / `"cannot forward req: no path"`. The subprocess-side wait was therefore unreachable: the process was killed externally before it could block. **Root cause 2 — `launch_once=True` subprocess exits after round 1:** `_do_submit_result()` called `os._exit(0)` unconditionally at the end of the `CellPipe + pass_through_on_send` path. For `launch_once=False` (one subprocess per round) this is correct — the process should exit immediately after the download so the deferred-stop poller unblocks. But for `launch_once=True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`, `pt_client_api_launch_once`), the subprocess was killed after round 1, leaving rounds 2–N unhandled and the job hanging. --- **1. Deferred `stop_task()` (`launcher_executor.py`)** Instead of calling `stop_task()` synchronously, `_finalize_external_execution()` now starts a background thread (when `_stop_task_wait_timeout > 0`) that polls `check_run_status()` until the subprocess exits naturally (i.e. after `download_done` fires and the subprocess unblocks), then calls `stop_task()`. `execute()` returns immediately, so `ClientRunner` can send `SubmitUpdate` and the server can connect to the still-alive subprocess cell to download tensors. **2. Round-boundary coordination (`_deferred_stop_event`)** Without additional synchronization, the deferred thread from round N could fire *after* round N+1's `launch_task()` call. Since `SubprocessLauncher` guards `_start_external_process()` with `if self._process is None`, seeing a not-yet-cleared reference to the exited round-N process causes it to skip starting a new subprocess. Round N+1 then sees `COMPLETE_SUCCESS` immediately and fails with `"External process has not called flare.init and run status becomes success"`. A `threading.Event` (`_deferred_stop_event`, initially set) coordinates the two: - **Cleared** in `_finalize_external_execution()` just before the deferred thread starts. - **Set** unconditionally in a `finally` block when the deferred thread completes. - `_initialize_external_execution()` **waits** on this event before calling `launch_task()`, with a forced `stop_task()` fallback if it times out. In practice the wait is 0–1 s (subprocess exits naturally after download, well before the server completes global aggregation and dispatches the next round's task), so there is no meaningful latency impact. **3. `ClientAPILauncherExecutor` opt-in (`client_api_launcher_executor.py`)** `_stop_task_wait_timeout` is set to `download_complete_timeout` (default 1800 s) in `ClientAPILauncherExecutor.__init__()`, enabling the deferred path only for the subprocess client API where large-tensor downloads are expected. The base `LauncherExecutor` defaults to `0.0` (original synchronous behaviour, no change). **4. `launch_once`-aware subprocess exit (`flare_agent.py`, `config.py`, `ex_process/api.py`)** `prepare_config_for_launch()` now writes `launch_once` (derived from `launcher.needs_deferred_stop()`) into the subprocess config file. The subprocess reads it via `ClientConfig.get_launch_once()` and passes it to `FlareAgent`. `_do_submit_result()` branches on `_launch_once`: | `launch_once` | Behaviour after download gate | |---|---| | `False` (one subprocess per round, e.g. `pt-client-api`) | `os._exit(0)` called directly — deferred-stop poller unblocks immediately (original behaviour preserved) | | `True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`) | `atexit.register(os._exit, 0)` registered once — subprocess continues to next round; `os._exit` fires only when `main()` finally returns, bypassing non-daemon CoreCell thread cleanup | `_resolve_launch_once()` safely fetches the launcher even when `self.launcher` is still `None` at `initialize()` time (resolved directly from the engine component registry). **5. Pipe handler identity guard and safe close (`task_exchanger.py`)** Pipe handler creation is refactored into `_create_pipe_handler()`, which binds a per-handler status callback that checks `self.pipe_handler is _h` before acting. This prevents a late `PEER_GONE` from round N's (now-stale) handler from stopping round N+1's handler. `stop(close_pipe=False)` is used because `CellPipe.close()` is irreversible — closing it in the status callback would prevent the next round from communicating. An explicit `self.pipe.close()` is added in `END_RUN` instead. **6. Defensive logging and heartbeat error handling (`pipe_handler.py`)** `_send_to_pipe` now logs `asked_to_stop` and `abort_triggered` status when a send is suppressed. Peer-gone detection logs the elapsed time since the last heartbeat. Heartbeat sends are wrapped in `try/except` so a broken pipe sets `asked_to_stop` and breaks the heartbeat loop cleanly instead of propagating an unhandled exception. **7. New unit tests (`test_download_complete_gating.py`, `test_download_initiated_gating.py`)** Two new test files covering the subprocess download-gating behaviour (the `download_done` wait, the validate-path fast return, and `launch_once`). Both use `FlareAgent.__new__()` to construct minimal agent stubs. To prevent `os._exit(0)` from killing the pytest-xdist worker: - An `autouse` `_no_os_exit` fixture patches `nvflare.client.flare_agent.os._exit` to a no-op for every test in the file. - `_make_agent()` sets `agent._launch_once = False` (the per-round path that calls `os._exit` directly, making the fixture's patch the active guard). --- | File | Change | |------|--------| | `nvflare/app_common/executors/launcher_executor.py` | Deferred `stop_task()` background thread + `_deferred_stop_event` round-boundary coordination | | `nvflare/app_common/executors/client_api_launcher_executor.py` | Set `_stop_task_wait_timeout = download_complete_timeout`; add `_resolve_launch_once()`; write `LAUNCH_ONCE` to subprocess config | | `nvflare/app_common/abstract/launcher.py` | `needs_deferred_stop()` abstract method + idempotency/thread-safety note on `stop_task()` | | `nvflare/app_common/launchers/subprocess_launcher.py` | Implement `needs_deferred_stop()`; add info logging for process start/stop | | `nvflare/app_common/executors/task_exchanger.py` | Refactor pipe handler creation into `_create_pipe_handler()` with identity-checking callback; use `close_pipe=False` to prevent irreversible `CellPipe.close()` | | `nvflare/app_common/widgets/metric_relay.py` | Include `msg.data` in pipe status log message | | `nvflare/fuel/utils/pipe/pipe_handler.py` | Enhanced logging (send failures, peer-gone elapsed time); heartbeat send error handling | | `nvflare/client/config.py` | Add `LAUNCH_ONCE` config key + `get_launch_once()` | | `nvflare/client/flare_agent.py` | `launch_once`-aware `_do_submit_result()`: direct `os._exit` vs `atexit` | | `nvflare/client/ex_process/api.py` | Pass `launch_once` to `FlareAgentWithFLModel` | | `tests/unit_test/client/test_download_complete_gating.py` | **New** — tests for DOWNLOAD_COMPLETE_CB registration, download-done wait, timeout, status logging, and cleanup | | `tests/unit_test/client/test_download_initiated_gating.py` | **New** — tests for thread-local download-initiation detection (validate-path fast return, no spurious 1800 s wait) | | `tests/unit_test/app_common/executors/client_api_launcher_executor_test.py` | New tests for deferred stop, `_deferred_stop_event`, `_stop_task_wait_timeout` |  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

- **Release notes**: large-model subprocess reliability section, 7 bug fix entries, edge recipe improvements - **available_recipes**: remove \`SimpleSwarmLearningRecipe\` alias, add \`round_timeout\`, subprocess tuning notes - **timeout_troubleshooting**: subprocess result timeout and Swarm P2P transfer scenarios, new quick-reference rows - **timeouts.rst**: \`max_resends\`, \`np/tensor_min_download_timeout\`, \`round_timeout\` cross-refs, updated Swarm example - **memory_management.rst**: client training process memory cleanup section, \`NVFLARE_CLIENT_MEMORY_PROFILE\` env var - **client_controlled_workflows.rst**: \`round_timeout\` in Swarm examples, Client Dropout Tolerance section - **industry_use_cases.rst**: FAITE, JP Morgan/BNY/RBC, ORNL OLCF, AV, Holoscan, Data Federation Mesh; FLARE Day talk links added - **examples/README.md**: remove Section 2 step-by-step examples (notebooks no longer exist), fix CIFAR-10 paths (\`cifar10/\` → \`cifar10/pt/\`), fix \`lr-newton-raphson\` → \`hello-world/hello-lr\`, remove dead \`code-pre-install\` entry, renumber sections - **examples/advanced/README.md**: fix CIFAR-10 paths, remove dead \`random_forest\`, \`brats18\`, \`prostate\`, \`fl_hub\` entries - **examples/hello-world/README.md**: remove broken \`hello-ccwf\` link - **docs/examples/medical_image_analysis.rst**: fix \`brats18\`/\`prostate\` paths (\`examples/advanced/\` → \`research/\`) - **timeout_troubleshooting.rst**: fix LLM example \`submit_task_result_timeout\` (1200 → 1800) to be ≥ \`submit_result_timeout\`; clarify note on setting both timeouts consistently - **gettingStarted.astro**: fix two broken \`getting_started/\` links — Step 3 now points to \`job_api/pt/src/cifar10_fl.py\`, Step 4 to readthedocs quickstart - **series.astro**: fix \`getting_started\` → \`hello-world\`, \`hello-fedavg\` → \`hello-pt\`, CIFAR-10 paths, \`brats18\`/\`prostate\` → \`research/\`, \`xgboost_secure\` path - **tutorials.astro**: add 17 missing examples — Job Recipe, Logging Tutorial, Hello PyTorch Lightning, Hello PyTorch Lightning Eval, Hello Differential Privacy, Hello Flower, Federated Logistic Regression with Newton-Raphson, Split Learning, PSI, Tensor Streaming for LLMs, AMPLIFY Protein Model, Confidential Computing Provision, Cross-Edge FL, BraTS18, FedHCA2, FedOBD, Federated Prostate Segmentation - [ ] Verify RST renders correctly - [ ] Check all external links resolve - [ ] Verify web tutorial catalog links are accessible 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Fixes # . Summary of changes for the ScaffoldRecipe + Client API scaffold_c_diff KeyError: 1. Defensive check in scaffold_aggregate_fn File: nvflare/app_common/workflows/scaffold.py Before using _result.meta[AlgorithmConstants.SCAFFOLD_CTRL_DIFF], the code now checks that the key exists. If it is missing, it raises a ValueError that: Names the client (if present in meta) States that scaffold_c_diff is required in FLModel.meta Explains that the client must use PTScaffoldHelper: init(model), model_update() during training, terms_update() after training, and send get_delta_controls() in meta Points to nvflare.app_opt.pt.scaffold.PTScaffoldHelper So instead of an opaque KeyError: 'scaffold_c_diff', users get a clear message and next steps. 2. Stronger docstring for ScaffoldRecipe File: nvflare/app_opt/pt/recipes/scaffold.py The Client script requirement section now explicitly states that: Scaffold is not like FedAvgRecipe: the client must use PTScaffoldHelper The client must set meta[AlgorithmConstants.SCAFFOLD_CTRL_DIFF] = scaffold_helper.get_delta_controls() in the returned FLModel A plain flare.receive/send loop without PTScaffoldHelper will cause aggregation to fail  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

…A#4295) Fixes a Remote Code Execution (RCE) vulnerability in the FOBS deserialization subsystem reported in NVBug #5968367. Root cause: Packer.unpack() validated the decomposer name (__fobs_dc__) against BUILTIN_DECOMPOSERS but never validated the attacker-controlled type name (__fobs_type__) before passing it to load_class(), which calls importlib.import_module(). An authenticated FL participant could craft a malicious payload with a trusted builtin decomposer and an arbitrary Python class path to achieve full RCE on the aggregation server. Fix: Added a BUILTIN_TYPES allowlist in builtin_decomposers.py covering all types legitimately used across the NVFlare codebase and examples. The Packer.unpack() method now validates type_name against this whitelist (and the runtime-registered _decomposers) before calling load_class(). Public API: Added add_type_name_whitelist(*type_names) for users to extend the whitelist with custom types at runtime. Security Impact CVE severity: CVSS v3.1 8.8 High (AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H) Attack vector: Authenticated FL participant sends crafted FOBS message with __fobs_dc__ set to a trusted builtin decomposer and __fobs_type__ set to any importable Python class path Impact: Full RCE on the NVFlare aggregation server  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Two folders, startup and local, of any startup kit were signed at provisioning time. At the verification stage, nvflare checks the startup folder. In CC environment, one more verification was done before nvflare started. However, the verification checks from the root folder of a startup kit. Therefore, there is a gap as the root folder and transfer folder are not signed. This PR changed the signing process to apply to the root and all sub folders.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

…IA#4307) Fix FilePipe heartbeat send timeout too short for heavy aggregation workloads ### Summary Increase FilePipe heartbeat send timeout from 5s to 600s in file_pipe.py ### Background FilePipe implements heartbeats as file-based request-reply: the sender creates a heartbeat file and waits up to N seconds for the receiver to delete it. If the receiver doesn't read it within N seconds, the sender deletes the file itself and the heartbeat is silently dropped. The previous default of 5s was too short for nodes running as aggregator in CCWF Swarm with large models (e.g. 8B parameters). During inter-round aggregation, P2P model streaming (~15GB per client) causes GIL contention and filesystem pressure that delays the receiver's PipeHandler._read thread beyond 5s. Heartbeat files were deleted before being read, causing _last_heartbeat_received_time to stall and eventually triggering false PEER_GONE after heartbeat_timeout (30s). ### Why 600s is safe _monitor_file returns immediately when the receiver deletes the file — 600s is only the ceiling if the receiver never reads it. Normal operation is unaffected. Genuine peer death is detected independently by PipeHandler.heartbeat_timeout on the receiver side, which is unrelated to this send-side timeout. ### Notes This fix applies to FilePipe only. CellPipe (used in all current production configs) is not affected — it sends heartbeats via fire_and_forget with no send-side timeout. ### Future Enhancements - Make the timeout configurable: expose heartbeat_send_timeout as a FilePipe.__init__ parameter so users running extremely large models or slow storage can tune it without touching source code. - Enforce the invariant FilePipe.heartbeat_send_timeout ≥ PipeHandler.heartbeat_timeout: currently these two values are configured independently and silently break each other if mismatched. The right fix is for PipeHandler to pass its heartbeat_timeout down to the pipe at start() time, making the relationship explicit and impossible to misconfigure. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

…DIA#4309) ## Summary Adds a **receiver-side opt-in** mechanism so that ext-process Client Job (CJ) cells automatically decode incoming model tensors as `LazyDownloadRef` placeholders. The subprocess then downloads tensors **directly from the source** (server in FedAvg, aggregator in Swarm) via CellNet routing, completely bypassing materialisation inside the CJ. Previously, the forward-path PASS_THROUGH was controlled by a sender-side flag (`forward_pass_through` on `SwarmClientController`), which required the sender to know the receiver's execution mode — unsafe in mixed deployments and inapplicable to FedAvg. This PR replaces that with a receiver-decides architecture: the CJ knows it is an ext-process launcher and opts in locally. ## Problem In ext-process mode, the CJ receives the full model from the server/aggregator, materialises all tensors into memory, then re-serialises and sends them to the subprocess. For large models (e.g. Llama 8B at ~16 GB), this doubles peak CJ memory and adds unnecessary serialisation latency. The existing `forward_pass_through` flag on `SwarmClientController` was: - **Redundant** for ext-process receivers (receiver-side opt-in handles it) - **Dangerous** for in-process receivers (forces LazyDownloadRef on nodes that can't handle it) - **Inapplicable** to FedAvg (no SwarmClientController involved) ## Solution ### Core mechanism (3 changes in `cell.py`) - `Cell.__init__()`: add `self.decode_pass_through = False` flag - `Adapter.call()`: check `self.cell.decode_pass_through` for incoming REQUESTs (Swarm AUX path) - `Cell._send_one_request()`: check `self.decode_pass_through` on reply decode (FedAvg GET_TASK path) ### Trigger (1 change in `client_api_launcher_executor.py`) - `ClientAPILauncherExecutor.initialize()`: set `cell.decode_pass_through = True` when pipe is `CellPipe` (FilePipe has no CellNet cell for subprocess downloads, so it stays off) ### Flag removal (changes in `swarm_client_ctl.py`) - Remove `forward_pass_through` parameter, field, and sender-side header stamping - Add `_has_lazy_refs()` static method for data-driven detection - Replace all flag-based guards with `_has_lazy_refs()` checks at 3 sites: - `_scatter()` local queue resolution - `_end_gather()` defensive panic check - `_do_learn()` GLOBAL_MODEL resolution ### Cleanup - Remove dead `CellPipe` import from `task_exchanger.py` ## Data flow (FedAvg ext-process) ``` Server CJ cell Subprocess cell │ │ │ │── model reply ──► │ │ │ decode with │ │ PASS_THROUGH=True │ │ → LazyDownloadRef │ │ │ │ │ CellPipe.send() │ │ FOBS encode: │ │ LazyDownloadRefDecomposer │ │ re-emits server datum │ │ │── cell msg ────────► │ │ │ Adapter.call() │ │ PASS_THROUGH=False │ │ ViaDownloaderDecomposer │ │ _download_from_remote_cell() │◄── download chunks ─┼──────────────────── │ │── tensor bytes ────►┼─────────────────► │ │ │ real tensors in shareable ``` The CJ never materialises the full model. The subprocess downloads directly from the server. ## Coverage matrix | Scenario | CJ behavior | Subprocess behavior | |---|---|---| | FedAvg + CellPipe (ext-process) | `decode_pass_through=True` → LazyDownloadRef | Downloads from server via CellNet | | FedAvg + FilePipe (ext-process) | `decode_pass_through=False` → real tensors | Receives real tensors via file | | FedAvg in-process | `decode_pass_through=False` → real tensors | N/A (same process) | | Swarm ext-process | `decode_pass_through=True` → LazyDownloadRef | Downloads from aggregator via CellNet | | Swarm in-process | `decode_pass_through=False` → real tensors | N/A (same process) | | Mixed deployment | Each CJ decides independently | No sender-side knowledge needed | ## Files changed | File | Change | |---|---| | `nvflare/fuel/f3/cellnet/cell.py` | `decode_pass_through` flag + checks in `Adapter.call()` and `_send_one_request()` | | `nvflare/app_common/executors/client_api_launcher_executor.py` | Set flag when pipe is CellPipe | | `nvflare/app_common/ccwf/swarm_client_ctl.py` | Remove `forward_pass_through`, add `_has_lazy_refs()`, update 3 guard sites | | `nvflare/app_common/executors/task_exchanger.py` | Remove dead `CellPipe` import | | `tests/unit_test/app_common/ccwf/test_swarm_forward_memory_path.py` | Rewritten for data-driven detection tests | | `tests/unit_test/app_common/ccwf/test_swarm_self_message_deadlock.py` | Remove `forward_pass_through` reference | | `tests/unit_test/app_common/ccwf/test_swarm_memory_gc.py` | Remove `forward_pass_through` reference | | `tests/unit_test/app_common/ccwf/test_msg_root_ttl.py` | Remove `forward_pass_through` reference | | `tests/unit_test/app_common/ccwf/test_lazy_ref_local_aggr.py` | Remove `forward_pass_through` reference | ## Known limitations - `_has_lazy_refs()` only recurses into `dict`/`list`/`tuple`. If an `FLModel` object (via `FLModelDecomposer`) carries `LazyDownloadRef` in its `.params` dict, the check will miss it. Current Swarm uses DXO (stored as plain nested dicts), so this is not an issue today. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

…acement mid-transaction (NVIDIA#4314) ### Problem BEFORE_TASK_EXECUTION is a global event broadcast to all FLComponents. In a CCWF Swarm aggregator node, receiving swarm_report_learn_result aux tasks from other sites while the local subprocess is training fires this event concurrently. This causes TaskExchanger.handle_event() to stop and recreate the PipeHandler while execute() is blocked in its polling loop — the old handler (that the subprocess is writing to) is orphaned, and the polling loop reads from the new empty handler forever. Silent deadlock, no error, no PEER_GONE, no timeout. Affects all pipe types (FilePipe, CellPipe). Independent of the FilePipe TOCTOU fix in NVIDIA#4296. ### Fix - Add threading.Event _executing as a guard flag: handle_event(BEFORE_TASK_EXECUTION) skips handler replacement when _executing.is_set() - TaskExchanger.execute() uses an ownership pattern so super().execute() from LauncherExecutor does not prematurely clear the flag - LauncherExecutor.execute() sets the flag at the very top — before _initialize_external_execution() — covering the up-to-60 s _wait_external_setup() window ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

…skip ci] (NVIDIA#4316) ## Summary - Replace verbose bug-fix enumeration in Memory Management, F3 Streaming, and Hierarchical FL Startup Stability sections with concise feature-level summaries - Spell out "Client Job (CJ)" on first use - Internal fix details are preserved in the existing Bug Fixes section ## Test plan - [ ] Verify RST renders correctly in Sphinx (no broken references) - [ ] Review rendered output at readthedocs preview 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…A#4317) Fixes scikit-learn examples rejecting all client contributions due to missing CURRENT_ROUND in fl_ctx Root cause: FedAvg tracks the current round as a plain Python int (self.current_round) but never writes it into fl_ctx. Custom aggregators (e.g. CollectAndAssembleModelAggregator) read CURRENT_ROUND from their fl_ctx for round validation in accept_model() and aggregate_model(). Their fl_ctx is set once at START_RUN and never updated, so it always returns None for CURRENT_ROUND — causing all contributions to be discarded (contribution_round=0 != None), leaving the global model empty and crashing clients in round 1. Why not a full fl_ctx sync (self.aggregator.fl_ctx = self.fl_ctx)? BaseModelController._process_result overwrites self.fl_ctx with a per-task context on every result callback. A one-time sync at run start would immediately become stale, and a per-call sync in _aggregate_one_result only covers accept_model, not aggregate_model. Copying the entire task fl_ctx to the aggregator would also expose unrelated per-task state. Fix: At the start of each round, set CURRENT_ROUND directly on the aggregator's own stable fl_ctx (initialized at START_RUN, never overwritten): if self.aggregator and self.aggregator.fl_ctx: self.aggregator.fl_ctx.set_prop(AppConstants.CURRENT_ROUND, self.current_round, ...) This is minimal, targeted, and safe for aggregators that don't use fl_ctx at all (fl_ctx guard added). Verified: sklearn-kmeans simulation with 3 clients × 5 rounds — all contributions accepted every round, Homogeneity=1.0 from round 1 onward, no warnings or errors.  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>

…rtup+local for non-CC (NVIDIA#4318) ## Issue Commit 0b74dee (NVIDIA#4301) changed SignatureBuilder to sign the entire startup kit root directory (get_ws_dir) for all participants. This introduced an unintended side effect: During provisioning, sign_folders() recursively walks the root and writes signature.json into every subdirectory — including transfer/. When nvflare poc prepare subsequently calls _prepare_jobs_dir(), it finds transfer/ non-empty and incorrectly prompts: job directory at .../transfer is already exists, replace with new one ? (y/N) This happens even on a completely clean workspace (after nvflare poc clean), because transfer/ is polluted during provisioning before _prepare_jobs_dir runs. ## Root Cause Root-level signing was introduced to support Confidential Computing (CVM) environments, where the full startup kit must be verified before CVM launch. However: 1. For non-CC cases, only startup/ and local/ are verified at runtime — signing the root adds no security value. 2. The signature validation code does not check transfer/ or the root in non-CC mode, making the extra signing pure side-effect. 3. Signing the root breaks nvflare poc prepare on clean workspaces. ## Fix In SignatureBuilder.build(), check PropKey.CC_ENABLED per participant: - CC participants: sign from root (get_ws_dir) — unchanged from NVIDIA#4301 - Non-CC participants: sign only startup/ and local/ — restored to pre-NVIDIA#4301 behavior ## Affected Files | File | Change | |------|--------| | `nvflare/lighter/impl/signature.py` | Add `PropKey` import; gate root signing behind `CC_ENABLED` check | | `tests/unit_test/lighter/test_signature_builder.py` | New: 5 unit tests covering CC, non-CC, mixed participants, and the `transfer/` regression | ## Types of Changes - Non-breaking change - New tests added --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…les [skip-ci] (NVIDIA#4323) Polishes the Memory Management section of the 2.7.2 release notes and adds proper RST benchmark tables. **File:** `docs/release_notes/flare_272.rst` - Rewrote the opening paragraph to flow naturally and remove the awkward "We enhance three features to help improvements memory manangement" sentence - Fixed ``--`` em-dash inconsistency in the **Server and Client Memory Cleanup** heading - Fixed admonition directive formatting (missing blank line after title caused render failure) - Rewrote the summary paragraph (lines 107–115) for clarity and correct grammar: removed "memory RSS ( Resident Set Size) grows", "avoid extra memory usage as well as" - Fixed typos: ``manangement``, ``In-Procss``, ``client agv`` - Clarified improvement percentages are reductions by adding ``−`` sign (was bare ``51%``, ``49%``) Replaced the four unformatted plain-text tables (lines 117–137) with proper RST ``list-table`` directives, covering: - **FedAvg** (5 GB model, 4 clients) — in-process and subprocess modes - **Swarm Learning** (2.5 GB model, 3 sites) — in-process and subprocess modes - [x] Non-breaking change (documentation only) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…skip ci] (NVIDIA#4332) ## Summary update release notes ## Changes **File:** `docs/release_notes/flare_272.rst` ``` - Security fix: Addressed a remote code execution vulnerability in FOBS deserialization. - Security fix: Addressed a path traversal vulnerability in FileRetriever. ``` ## Types of Changes - [x] Non-breaking change (documentation only) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

NVIDIA#4334) ### Description - **Bug 1 (pipe destroyed between rounds):** `_pipe_status_cb` called `pipe_handler.stop()` with the default `close_pipe=True`, which invokes `FilePipe.close()` and can remove pipe directories (including the shared root if `_remove_root=True`). Fixed by using `stop(close_pipe=False)`, matching the pattern already in `TaskExchanger`. - **Bug 2 (silent no-op on round 2+):** `BEFORE_TASK_EXECUTION` called `start()` on a stopped handler whose worker threads had already nulled themselves out (`pipe_handler.py` L390/L412), making the call a silent no-op. Fixed by recreating a fresh `PipeHandler` per task execution, same as `TaskExchanger`. - **Bug 3 (stale callback race):** `_pipe_status_cb` referenced `self.pipe_handler` directly, so a late `PEER_GONE` from a previous handler could stop the newly created one. Fixed by using a handler-bound identity-check closure, matching `TaskExchanger._create_pipe_handler`. - **Bug 4 (bad payload crashes reader thread):** `_pipe_msg_cb` logged an error on non-DXO data but did not return, falling through to `send_analytic_dxo()` which raises `TypeError`. Since the callback runs on the pipe reader thread, this exception terminated the handler as a `PEER_GONE` path. Fixed by adding an early `return` after the error log. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

…s [skip cli] (NVIDIA#4337)

The ml-to-fl/pt/, ml-to-fl/tf/, ml-to-fl/np/ directories are empty so the relative link checker fails. Replace the three dead links with a single reference to the ml-to-fl/ directory itself. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Git does not track empty directories, so ml-to-fl/pt/, tf/, np/ don't exist in CI and the relative link checker fails. Remove the section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

NVIDIA#3896) Fixes # . Apply changes from NVIDIA#3878  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

Fixes # . ### Description # Fix FedAvg recipe round handling and model-selector events ## Problem When using the recipe-based FedAvg flow (BaseModelController / FedAvg), IntimeModelSelector saw `CURRENT_ROUND` as `None` and discarded valid contributions with "Current round is: None", then in round 2 reported "Expected round for this aggregation: 1". The "new best validation metric" path and `GLOBAL_BEST_MODEL_AVAILABLE` never ran because `BEFORE_AGGREGATION` was not fired in the InTime aggregation path. ## Changes - **BaseModelController** (`nvflare/app_common/workflows/base_model_controller.py`): - Fire **ROUND_STARTED** at the start of each round in `broadcast_model()` (after `set_fl_context(data)`) so widgets (e.g. IntimeModelSelector) reset per round. - In **`_process_result()`**, set **CURRENT_ROUND** on the callback `fl_ctx` from the task data before firing `BEFORE_CONTRIBUTION_ACCEPT`, so selectors and aggregators see the correct round. - **FedAvg** (`nvflare/app_common/workflows/fedavg.py`): - Fire **BEFORE_AGGREGATION** after all client results are in and before `_get_aggregated_result()`, so IntimeModelSelector runs `_before_aggregate`, logs "new best validation metric" when applicable, and can fire `GLOBAL_BEST_MODEL_AVAILABLE`. ## Result Round and contribution matching work correctly for the recipe FedAvg flow, and model selection / best-model events behave the same as in the scatter-and-gather flow. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

- Restore ROUND_STARTED event and CURRENT_ROUND prop on fl_ctx in FedAvg round loop (removed by NVIDIA#4317); IntimeModelSelector and other components depend on this event firing each round. - Revert PASS_THROUGH channel from get_pipe_channel_name() back to _CellChannel.SERVER_COMMAND in ClientAPILauncherExecutor (NVIDIA#4309 introduced the regression; NVIDIA#4312 was incorrectly marked no-diff at assessment time before NVIDIA#4309 was cherry-picked). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This entry was accidentally dropped during cherry-pick. Without it, fobs.unpack() raises ValueError when deserializing Shareables from the XGBoost histogram_based_v2 controller, which serializes DataSplitMode directly into every Shareable sent to clients. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chesterxgchen · 2026-03-29T04:13:45Z

/build

chesterxgchen · 2026-03-29T04:17:29Z

@greptileai

chesterxgchen requested review from IsaacYangSLA, YuanTingHsieh, ZiyueXu77, holgerroth, nvidianz, nvkevlu and pcnudde and removed request for YuanTingHsieh and holgerroth March 29, 2026 03:29

greptile-apps Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread nvflare/fuel/f3/streaming/blob_streamer.py

Comment thread nvflare/app_common/executors/client_api_launcher_executor.py

Comment thread nvflare/app_common/workflows/fedavg.py

chesterxgchen force-pushed the port-2.7.2-main branch from 01077a8 to 65634f9 Compare March 29, 2026 03:45

greptile-apps Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread nvflare/app_common/ccwf/swarm_client_ctl.py

greptile-apps Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread nvflare/fuel/utils/fobs/builtin_decomposers.py

holgerroth and others added 12 commits March 28, 2026 21:11

ZiyueXu77 and others added 26 commits March 28, 2026 21:11

[2.7] Fix FilePipe TOCTOU (NVIDIA#4296)

093f0d0

[2.7] Smooth memory benchmark summary paragraph in 2.7.2 release note…

a5c2bd0

…s [skip cli] (NVIDIA#4337)

Remove ml-to-fl section: empty dirs not tracked by git

8b8c1e5

Git does not track empty directories, so ml-to-fl/pt/, tf/, np/ don't exist in CI and the relative link checker fails. Remove the section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix docstring indentation in tutorials examples

c033d7e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chesterxgchen force-pushed the port-2.7.2-main branch from b063705 to f7ad0a3 Compare March 29, 2026 04:13

pcnudde closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port remaining 2.7.2 changes to main#4376

Port remaining 2.7.2 changes to main#4376
chesterxgchen wants to merge 42 commits intoNVIDIA:mainfrom
chesterxgchen:port-2.7.2-main

chesterxgchen commented Mar 29, 2026

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

greptile-apps Bot commented Mar 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

chesterxgchen commented Mar 29, 2026

Summary

Included (35 PRs — net new to main)

Skipped (already in main via other PRs)

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

greptile-apps Bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

greptile-apps Bot commented Mar 29, 2026 •

edited

Loading