Skip to content

Port remaining 2.7.2 changes to main#4376

Closed
chesterxgchen wants to merge 42 commits intoNVIDIA:mainfrom
chesterxgchen:port-2.7.2-main
Closed

Port remaining 2.7.2 changes to main#4376
chesterxgchen wants to merge 42 commits intoNVIDIA:mainfrom
chesterxgchen:port-2.7.2-main

Conversation

@chesterxgchen
Copy link
Copy Markdown
Collaborator

Summary

Cherry-picks 35 net-new commits from the 2.7.2 release into main. The following 50 PRs were requested; 15 were already present in main (via earlier batch ports or equivalent changes) and are excluded.

Included (35 PRs — net new to main)

Security fixes:

Critical streaming / memory / deadlock fixes:

Other bug fixes / enhancements:

Docs / release notes:

Skipped (already in main via other PRs)

2.7 PR Already in main
#3999 Web-only change; #3965, #3989 already in main
#4076 Empty (already present)
#4103 Empty (already present)
#4119 Empty (already present)
#4122 Empty (already present)
#4172 In main via #4328
#4173 In main via #4194
#4250 In main via #4327
#4255 Already present (empty diff)
#4259 In main via #4327
#4297 In main via #4327
#4306 In main via #4329
#4315 In main via #4329
#4320 In main via #4346
#4325 In main via #4346

Note: #4319 was already created as PR #4375.

🤖 Generated with Claude Code

@chesterxgchen
Copy link
Copy Markdown
Collaborator Author

/build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 29, 2026

Greptile Summary

This PR cherry-picks 35 net-new commits from the 2.7.2 release branch into main, covering security fixes (RCE, FilePipe TOCTOU, startup kit signing), critical streaming/memory/deadlock fixes, and miscellaneous bug fixes and documentation updates. Several previously-identified issues (PASS_THROUGH channel mismatch, missing ROUND_STARTED in FedAvg, ccwf_job.py TypeError, xgboost.core.DataSplitMode whitelist removal) have been addressed in follow-up commits.

Key changes:

Confidence Score: 4/5

Safe to merge after resolving the double ROUND_STARTED firing in FedAvg — all other previously flagged P1s are fixed.

One remaining P1 defect: ROUND_STARTED is now fired twice per FedAvg round — once explicitly in fedavg.run() and again inside base_model_controller.send_model() (added in this PR). All prior P1 issues were addressed in follow-up commits. The double-firing is a real behavioral regression that needs one of the two call sites removed before merge.

nvflare/app_common/workflows/fedavg.py (line 160) and nvflare/app_common/workflows/base_model_controller.py (line 157) — coordinate which layer owns the ROUND_STARTED event

Important Files Changed

Filename Overview
nvflare/app_common/workflows/fedavg.py Refactors FedAvg run loop (removes tensor disk offload, restores ROUND_STARTED and CURRENT_ROUND propagation); double ROUND_STARTED firing introduced — sent once explicitly and again inside send_model()
nvflare/app_common/workflows/base_model_controller.py Adds ROUND_STARTED event to send_model() and changes CURRENT_ROUND sticky flag to False in result callback; ROUND_STARTED addition causes double-fire when callers also fire it explicitly
nvflare/app_common/executors/task_exchanger.py Removes pipe.clear() and isinstance guard from BEFORE_TASK_EXECUTION handler to fix deadlock when executing is in progress; safe change
nvflare/fuel/f3/streaming/blob_streamer.py Simplifies blob_cb exception handling and softens buffer-overrun from exception to warning with partial copy; both changes intentional per 2.7 design
nvflare/fuel/utils/fobs/decomposers/via_downloader.py Removes make_lazy_ref fast path in recompose(); PASS_THROUGH mode now uses _LazyBatchInfo sentinel instead

Sequence Diagram

sequenceDiagram
    participant FA as FedAvg.run()
    participant BMC as BaseModelController.send_model()
    participant EH as Event Handlers

    loop Each FL Round
        FA->>EH: fire ROUND_STARTED (1st explicit, line 160)
        FA->>BMC: send_model(targets, data, callback)
        BMC->>EH: fire ROUND_STARTED (2nd implicit, line 157) ⚠️
        BMC->>BMC: broadcast task to clients
        FA->>FA: wait for standing tasks
        FA->>EH: fire BEFORE_AGGREGATION
        FA->>FA: _get_aggregated_result()
        FA->>FA: update_model() / save_model()
    end
Loading

Reviews (4): Last reviewed commit: "Restore xgboost.core.DataSplitMode to fo..." | Re-trigger Greptile

Comment thread nvflare/fuel/f3/streaming/blob_streamer.py
Comment thread nvflare/app_common/executors/client_api_launcher_executor.py
Comment thread nvflare/app_common/workflows/fedavg.py
Comment thread nvflare/app_common/ccwf/swarm_client_ctl.py
@chesterxgchen
Copy link
Copy Markdown
Collaborator Author

/build

Comment thread nvflare/fuel/utils/fobs/builtin_decomposers.py
holgerroth and others added 12 commits March 28, 2026 21:11
Fixes # .

Fix broken Job CLI tutorial.
Adds missing model.py for Job Recipe tutorial.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes # .

- Removed all TensorFlow dependencies from the CIFAR-10 example by
replacing the manual protobuf parsing with tbparse
- Consolidated requirements by moving plotting dependencies (matplotlib,
seaborn, pandas, tbparse) into the main requirements.txt and removing
the separate plot-requirements.txt

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
)

Based on the original work by @GeorgeWang-nv in PR NVIDIA#4166. Credit to
George for the retry mechanism design.
This PR just fixes some comments

Fixes swarming LLM streaming loss.

Add retry mechanism for streaming download on TIMEOUT.

**Changes from the original PR plus additional improvements:**

- **Retry with exponential backoff**: On `TIMEOUT`, retry up to
`max_retries` (default 3) times with exponential backoff (2s, 4s, 8s,
... capped at 60s) instead of fixed 2s sleep
- **Structured logging**: `[DOWNLOAD_RETRY]`, `[DOWNLOAD_RECOVERED]`,
`[DOWNLOAD_FAILED]` log tags with backoff duration for production
observability
- **State-safe retry**: Resend the same `current_state` on retry so the
producer re-generates the same chunk (data-safe by design)
- **Consecutive timeout counter reset**: Counter resets after any
successful response, allowing fresh retries on later chunks
- **Abort signal handling**: Abort checked before and after backoff
sleep to minimize delay on cancellation
- **Input validation**: `max_retries < 0` raises `ValueError`
- **Fresh request per iteration**: Build a new `Message` each loop
iteration to avoid re-encoding stale payloads
- **Comprehensive test suite**: 17 unit tests covering happy path,
retry/recovery, exhaustion, counter reset, state integrity, abort,
consumer errors, and input validation

- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [x] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [x] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: chiyuw <chiyuw@nvidia.com>
Properly make hello-numpy-cross-val use branching to match other
examples.

Fixes # .

Properly make hello-numpy-cross-val use branching to match other
examples and ensure that there will no longer be any errors due to
numpy_key.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Take changes from NVIDIA#4169 and
issues found by @GeorgeWang-nv

Fix streaming pool starvation

Fixes deadlock when many blob receive callbacks run at once. Ex. Large
model with many download chunks. Previously, handle_blob_cb ran on the
stream pool and called blob_cb(future, ...) on the same thread; blob_cb
blocks on future.result() while _read_stream (which sets the future)
also runs on the same pool, so the pool could be exhausted and never run
_read_stream.

``` mermaid
sequenceDiagram
    autonumber
    participant DL as "Requester: download_object()"
    participant CS as "Requester Cell._send_one_request()"
    participant TX as "ByteStreamer.send()/TxTask"
    participant NET as "SFM/Network"
    participant CR as "Receiver CoreCell._process_request()"
    participant BR as "ByteReceiver._data_handler()"
    participant W1 as "stream_thread_pool Worker A"
    participant BH as "BlobHandler.handle_blob_cb()"
    participant W2 as "stream_thread_pool Worker B"
    participant AD as "Adapter.call()"
    participant HD as "DownloadService._handle_download()"

    DL->>CS: cell.send_request(...)
    CS->>TX: send_blob(request)
    TX->>NET: stream DATA frames
    NET->>CR: process_message() -> _process_request()
    CR->>BR: callback for STREAM_DATA_TOPIC
    BR->>W1: submit(_callback_wrapper)

    W1->>BH: _callback_wrapper -> handle_blob_cb(future, stream)
    BH->>W2: submit(_read_stream, blob_task)
    BH->>AD: blob_cb(future, ...)  (this is Adapter.call)
    AD->>AD: future.result()  (blocks)

    rect rgb(255, 235, 235)
    Note over W1,W2: Starvation risk: Worker A waits for future.result(), while _read_stream is queued on same pool.
    end

    W2->>W2: _read_stream() -> stream.read() loop
    W2->>AD: future.set_result(...)
    AD->>HD: self.cb(request) -> _handle_download() -> produce()
    AD->>TX: send_blob(reply)
    TX->>NET: stream reply frames
    NET->>CS: receiver side _process_reply() sets in_receiving
    CS->>DL: send_request() returns reply
```

- Separate callback pool: Run blob_cb on a dedicated
callback_thread_pool instead of on the stream pool. The stream pool only
schedules _read_stream and the callback; it no longer blocks on
future.result(), so it can always run _read_stream.
- Blocking -> non-blocking: handle_blob_cb now submits blob_cb to the
callback pool and returns immediately. Comments document this
intentional behavior change.
- Exception handling: Added _run_blob_cb so exceptions from blob_cb are
still logged and stream.task.stop(StreamError(...)) is called when
stream has a task (e.g. RxStream), matching the previous behavior of
ByteReceiver._callback_wrapper.

``` mermaid
sequenceDiagram
    autonumber
    participant DL as "Requester download_object"
    participant CS as "Requester Cell _send_one_request"
    participant TX as "ByteStreamer send TxTask"
    participant NET as "SFM Network"
    participant CR as "Receiver CoreCell _process_request"
    participant BR as "ByteReceiver _data_handler"
    participant W1 as "stream_thread_pool Worker A"
    participant BH as "BlobHandler handle_blob_cb"
    participant W2 as "stream_thread_pool Worker B"
    participant CB as "callback_thread_pool Worker"
    participant AD as "Adapter call"
    participant HD as "DownloadService _handle_download"

    DL->>CS: cell.send_request
    CS->>TX: send_blob request
    TX->>NET: stream DATA frames
    NET->>CR: process_message to _process_request
    CR->>BR: callback for STREAM_DATA_TOPIC
    BR->>W1: submit _callback_wrapper

    W1->>BH: _callback_wrapper to handle_blob_cb
    BH->>W2: submit _read_stream blob_task
    BH->>CB: submit _run_blob_cb
    Note over W1: W1 returns immediately

    W2->>W2: _read_stream stream.read loop
    W2->>W2: future set_result
    Note over W2,CB: future completes and unblocks CB
    CB->>AD: blob_cb Adapter.call
    AD->>AD: future result blocks on callback pool only
    AD->>HD: self.cb request _handle_download produce
    AD->>TX: send_blob reply
    TX->>NET: stream reply frames
    NET->>CS: _process_reply sets in_receiving
    CS->>DL: send_request returns reply
```

- nvflare/fuel/f3/streaming/stream_utils.py: Added
CALLBACK_THREAD_POOL_SIZE and callback_thread_pool (CheckedExecutor).
Shut down callback_thread_pool in stream_shutdown() before
stream_thread_pool.
- nvflare/fuel/f3/streaming/blob_streamer.py: In handle_blob_cb, submit
blob_cb via callback_thread_pool.submit(self._run_blob_cb, ...) instead
of calling it directly. Added _run_blob_cb to run blob_cb in a
try/except, log exceptions, and call stream.task.stop(...) when
applicable.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Do not hold the lock around produce_item. It is not needed and this operation can be slow. We do not want/need to hold up everying during this time.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fix numpy cross val sticky property.

Fix numpy cross val sticky property so there is not a warning message.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
)

Fix recipes and job api model/initial_model confusion
Changes picked from NVIDIA#4177

### Description

Fix recipes and job api model/initial_model confusion

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Update to match fix in main in NVIDIA#4201 

### Description

Updates nvflare/app_common/workflows/base_model_controller.py to match
NVIDIA#4201 after further changes from greptile comments.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
…ils (NVIDIA#4206)

Many Thanks to @GeorgeWang-nv to find the issues and perform the root
cause analysis on this issues, suggested possible fixes

- Large byte streams can freeze when a blocking socket send holds the
per-connection SFM send lock.
- While that lock is held, unrelated traffic on the same connection
(including progress and coordination frames) is serialized behind it.
- This creates **Head-of-Line (HOL) blocking** at connection level and
can manifest as long apparent stalls.

- Add a bounded send timeout in the socket send path so one blocked send
cannot hold the connection lock indefinitely.
- Why this way: minimally invasive, directly targets lock-hold source,
and fails deterministically instead of hanging.

- Track ACK forward progress and fail fast when ACK offset does not
advance for a configurable interval.
- Why this way: protects sender flow control from waiting too long on
non-progressing streams, independent of generic ACK wait.

- Monitor send stall duration in heartbeat monitor; optionally close the
stalled connection to recover.
- Add a consecutive-check threshold before close to reduce false
positives from transient jitter.
- Why this way: staged recovery lever (warn-only first, then auto-close)
balances reliability and operational safety.

- `streaming_send_timeout`
- `streaming_ack_progress_timeout`
- `streaming_ack_progress_check_interval`
- `sfm_send_stall_timeout`
- `sfm_close_stalled_connection`
- `sfm_send_stall_consecutive_checks`

- Added focused unit tests for each mitigation path with both positive
and negative cases.

- Config application, partial-write success, non-writable timeout,
closed-send path, and closing-state suppression.

- ACK progress timestamp update, progress-allowed completion,
no-progress timeout, legacy ack_wait behavior, and fast-fail precedence
when progress timeout is small.

- Warn-only behavior, close-enabled recovery, close-disabled
non-recovery, heartbeat behavior when healthy, send-state reset on
success/exception, warning emission, no-warning in healthy flow, and
intermittent-stall false-alarm suppression via consecutive guard.

- `pytest` on the three new F3 test modules: **24 passed**.

- Added a new section in `docs/user_guide/timeout_troubleshooting.rst`:
  - Runtime defaults for new stall-related parameters.
- Recommended rollout: warn-only first, then auto-close only if repeated
stall warnings persist.
- Suggested values from discussion: `sfm_send_stall_timeout=75`,
`sfm_send_stall_consecutive_checks=3`.
- Log interpretation guidance and false-alarm suppression expectations.

- This change prefers deterministic failure/recovery over silent long
hangs.
- Potential side effect when auto-close is enabled: transient connection
resets under severe network instability; mitigated by conservative
timeout plus consecutive-check guard.
- Defaults remain conservative (`sfm_close_stalled_connection=false`) so
operators can observe first and enable recovery when needed.
- Scope is intentionally focused on the critical Head-of-Line (HOL)
mitigation path and operational guardrails.

- [x] `python3 -m pytest
tests/unit_test/fuel/f3/drivers/socket_conn_timeout_test.py
tests/unit_test/fuel/f3/streaming/byte_streamer_ack_watchdog_test.py
tests/unit_test/fuel/f3/sfm/sfm_stall_monitor_test.py`
- [x] `python3 -m black <changed_python_files>`
- [x] `python3 -m isort <changed_python_files>`
- [x] `python3 -m flake8 <changed_python_files>`

- Close socket connections on `send_frame` timeout to prevent
frame-boundary desync after partial writes.
- Preserve specific send error codes in `send_frame` for common socket
exceptions:
- map timeout-like exceptions to `CommError.TIMEOUT` (and close
connection),
  - map closed-socket exceptions to `CommError.CLOSED`,
  - keep `CommError.ERROR` as fallback for unknown exceptions.
- Add positive and negative unit tests in `socket_conn_timeout_test.py`:
  - Positive: timeout after partial write closes the connection.
- Positive: `TimeoutError` -> `TIMEOUT`; `BrokenPipeError` -> `CLOSED`.
- Negative: successful partial-write sends do not close; non-timeout
`CommError.CLOSED` does not force close; unknown exceptions remain
`ERROR`.
- Document stall timing relationship in `timeout_troubleshooting.rst`,
including the practical close window formula and outer-timeout sizing
guideline.

- `python3 -m pytest
tests/unit_test/fuel/f3/drivers/socket_conn_timeout_test.py -q`
- Result: `10 passed`
…ent, hierarchical startup stability [skip ci] (NVIDIA#4218)

> ⚠️ **Depends on NVIDIA#4209** — The *Hierarchical FL Startup Stability*
section documents changes introduced by PR NVIDIA#4209 (currently open).
**Merge PR NVIDIA#4209 into `2.7` before merging this PR.** All other sections
cover already-merged PRs.

---

This PR updates `docs/release_notes/flare_272.rst` to reflect all major
changes merged into the 2.7.x line after the initial 2.7.2 draft,
covering three new areas:

- **Memory Management** — restructured and expanded with Zero Tensor
Copy at CJ (PR NVIDIA#4210) and client-side memory lifecycle management (PR
- **F3 Streaming Reliability and Performance** — new section covering
HOL stall mitigation (PR NVIDIA#4206), stream pool starvation fix (PR
fix (PR NVIDIA#4204), and lock contention reduction (PR NVIDIA#4174)
- **Hierarchical FL Startup Stability** — new section covering
deployment timeout classification, startup grace period, selective
client exclusion, and metadata hardening (PR NVIDIA#4209 — pending merge),
with recommended config snippets for HPC/Lustre environments

The Bug Fixes section and intro paragraph are also updated accordingly.

A source-level RST comment has been added above the Hierarchical FL
section in the file to alert future maintainers to the merge dependency.

| PR | Area | Status |
|---|---|---|
| NVIDIA#4171 / NVIDIA#4172 | Stream pool starvation fix | Merged |
| NVIDIA#4174 | Lock contention reduction | Merged |
| NVIDIA#4167 | Streaming download retry | Merged |
| NVIDIA#4204 | RxTask self-deadlock fix | Merged |
| NVIDIA#4206 | HOL stall mitigation | Merged |
| NVIDIA#4210 | Zero tensor copy at CJ | Merged |
| NVIDIA#4211 | Client-side memory management | Merged |
| NVIDIA#4209 | Hierarchical FL startup stability | **Open — merge before this
PR** |

- **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds
`LazyDownloadRef` placeholders instead of materializing full tensors,
eliminating the CJ as a memory bottleneck for LLM-scale models.
- **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` /
jemalloc purge / `torch.cuda.empty_cache()` injected after every
`flare.send()`, configurable via `client_memory_gc_rounds`.
- Existing TensorDownloader and server-side cleanup content retained.

- **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress
watchdog, and stall detection/recovery. Includes recommended environment
variable settings for large hierarchical deployments.
- **Stream Pool Starvation Fix**: Blob callbacks dispatched to a
dedicated `callback_thread_pool`, keeping stream workers free for
concurrent downloads.
- **Streaming Download Retry**: Exponential-backoff retry (up to 3
attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware.
- **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock`
released, eliminating stream-error-triggered deadlock.
- **Lock Contention Reduction**: `produce_item()` runs outside
`self.lock`; compare-and-store for cache write. Reduces model-download
latency under high client concurrency.

- **Deployment Timeout as Failure**: `reply=None` correctly counted
against `min_sites`; timed-out clients excluded before
`start_client_job`.
- **Startup Grace Period**: Dead-client detection debounced — client
must be observed once before absence triggers dead-job notification.
Default changed to `True`.
- **Selective Client Exclusion**: Stragglers at start-job excluded
rather than causing full abort, if remaining count ≥ `min_clients`.
- **Metadata Hardening**: `TypeError` on absent job metadata replaced
with descriptive `RuntimeError`.
- Recommended `config_fed_server.json` / `config_fed_client.json`
snippets for HPC (Frontier/ORNL) scale.

- [ ] Sphinx build (`make html`) passes without RST warnings on the
updated file
- [ ] All new cross-references (`.. code-block::`, `.. note::`) render
correctly in the docs build
- [ ] Verify section hierarchy (underline characters) is consistent
throughout the file
- [ ] Confirm PR NVIDIA#4209 is merged before this PR is merged

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…e enhancements (NVIDIA#4225)

This PR addresses several related bugs and improvements in the Client
API subprocess path, swarm learning, and the recipe interface.
Address FLARE-2743 (bug/5921094) and (bug/5921708)

---

When pass-through tensor streaming is enabled on the CJ, param
converters (e.g. `NumpyToPTParamsConverter`) were running on the CJ
*before* lazy references were materialized. This caused `"Could not
infer dtype of LazyDownloadRef"` warnings and potential data corruption
on large-model jobs using pass-through transfer.

The `ExProcessClientAPI` (subprocess path) created a
`FlareAgentWithFLModel` with no converters attached. Format conversion
was happening on the CJ side in `LauncherExecutor.execute()`, which
triggered issue #1.

`SwarmClientConfig.learn_task_check_interval` was not forwarded to the
swarm controller, so any customization of that parameter was silently
ignored.

dict form
The deploy map validation only handled list-form and `{"sites": [...]}`
dict-form entries. The `{"targets": [...]}` form (used by some job
generators) raised an unhandled error.

`min_clients`, `launch_external_process`, and `command` could not be set
through the recipe interface, forcing users to construct jobs manually.

`FlareAgentWithCellPipe`
The class has been superseded by constructing a `CellPipe` and
`FlareAgent` directly, or using the higher-level Client API
(`nvflare.client`).

---

`FlareAgentWithFLModel.shareable_to_task_data()`
The task name was read via `shareable.get_header(FLContextKey.TASK_NAME,
"")`, but this header is never populated in the subprocess path — the
task name is carried as the pipe `Message.topic` and stored in
`self.current_task.task_name` by `FlareAgent._on_task()`. An absent
header caused `task_name` to fall back to `""`, making
`ParamsConverter.process()` silently skip conversion. Fixed to use
`self.current_task.task_name if self.current_task else ""`, consistent
with `task_result_to_shareable()`.

`to_nvflare_converter_id` set on launcher executor
`LauncherExecutor.execute()` still applied CJ-side converters if users
explicitly set `from_nvflare_converter_id` / `to_nvflare_converter_id`,
while the subprocess was also applying its own default converters —
resulting in double conversion. Fixed by removing the converter calls
from `LauncherExecutor.execute()` entirely. Since `LauncherExecutor` is
only ever subclassed by `ClientAPILauncherExecutor` (subprocess path),
this has no effect on the in-process path.

---

**Converter logic removed from CJ launcher executors; moved to the
subprocess agent side.**

Previously, `PTClientAPILauncherExecutor` and
`TFClientAPILauncherExecutor` set up format converters in their
`initialize()` methods, and `LauncherExecutor.execute()` applied them on
the CJ before dispatching to the subprocess. This was the root cause of
issue #1 — the CJ was touching tensor data (triggering dtype inference)
before `LazyDownloadRef` placeholders were materialized in the
subprocess.

The converter setup blocks have been **completely removed** from both
launcher executors. Converters for the subprocess path now live entirely
inside the subprocess, wired into `FlareAgentWithFLModel` and executing
after full payload materialization.

Key changes:
- `FlareAgentWithFLModel` gains optional `from_nvflare_converter` /
`to_nvflare_converter` params. A lightweight `_ConverterContext` stub
(duck-typing `FLContext.get_prop/set_prop`) allows
`ParamsConverter.process()` to work without a real `FLContext` in the
subprocess.
- `SERVER_EXPECTED_FORMAT` is added to `ConfigKey` and propagated in the
subprocess launch config so the agent knows what format the server
expects.
- A shared factory `converter_utils.create_default_params_converters()`
is introduced, replacing four separate copies of the same
format-conditional logic across PT and TF executors.
- In-process executors (`PTInProcessClientAPIExecutor`,
`TFInProcessClientAPIExecutor`) are **unchanged in behavior** — they
still apply converters on the CJ side, which is correct since there is
no separate subprocess.

---

An earlier iteration kept the converter setup in `initialize()` but
added a `ClientAPILauncherExecutor.execute()` override that nulled the
converters out before calling `super().execute()` and restored them
after. This was removed because it left converters alive on the object
but always bypassed — confusing dead code. The current approach is
explicit and honest: launcher executors do not set converters at all,
and the subprocess handles its own conversion autonomously.

---

| Area | Files |
|------|-------|
| Core bug fixes | `ccwf_job.py`, `fed_utils.py` |
| Converter framework | `converter_utils.py` (new), `config.py`,
`flare_agent_with_fl_model.py`, `ex_process/api.py`,
`client_api_launcher_executor.py` (base) |
| PT launcher executor — converter removed |
`pt/client_api_launcher_executor.py` |
| PT in-process executor — refactored to shared factory |
`pt/in_process_client_api_executor.py` |
| TF launcher executor — converter removed + `initialize()` deleted |
`tf/client_api_launcher_executor.py` |
| TF in-process executor — refactored to shared factory |
`tf/in_process_client_api_executor.py` |
| Recipe | `pt/recipes/swarm.py` |
| Docs | `3rd_party_integration.rst`, `3rd_party_trainer.py` |
| Tests | 4 new test files, 4 modified test files, 2 integration YAMLs |

---

- **TF/Keras converter path is not directly unit-tested** due to
TF/PyTorch dependency conflicts in a shared environment. The
`create_default_params_converters` Keras branch is structurally
identical to the PT branch (which is fully tested). A dedicated TF-only
test environment could add coverage in a future pass.
- **`_ConverterContext` is a duck-type shim.** If more `ParamsConverter`
use cases arise in subprocess contexts, a formal lightweight interface
(separate from `FLContext`) would be worth extracting.
- **Converter plugin (`from_nvflare_converter_id`)** for the subprocess
path is currently not wired — only the auto-created default converters
are passed to the agent. Explicit user-provided converter IDs are still
only supported for the in-process path. This can be addressed in a
follow-up.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ZiyueXu77 and others added 26 commits March 28, 2026 21:11
Fixes FLARE-2760, bug/5949101, bug/5948918.

Add ProdEnv info to edge example readme (already have it in
docs/user_guide/data_scientist_guide/job_recipe.rst)

Add model_state check to buff_model_manager

Add timeout arg to buff_recipe

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
… validate hang, round_timeout (NVIDIA#4270)

Fixes four bugs in NVFlare 2.7.2 RC12 affecting Swarm Learning with
`launch_external_process=True`
Implementation Reference @GeorgeWang-nv's original PR
NVIDIA#4263
Bug 1  -- different implementation from PR4263
Bug 2 -- accept the implementation approach with small changes from PR
4263
Bug 3 -- use different approach
Bug 4 -- add timeout to swarm API but consolidate the timeouts from 4 to
2 and expose to Swarm Recipes
Bug 5 -- restore `self._max_resends` (private convention); subprocess
logging fixed (see below)
Bug 6 -- guard ConnManager against RuntimeError after shutdown; 4
regression tests added

**Additional (review feedback):**
- Rename `SimpleSwarmLearningRecipe` → `SwarmLearningRecipe` everywhere;
backward-compat alias kept
- Fix docs: add missing required `min_clients` arg to all
`SwarmLearningRecipe` examples (TypeError for users copying doc
examples)
- CSE inventory scan: take LAST "best" key, not first — prevents initial
checkpoint named "best_*" from shadowing trained best model
- Add `pipe_type` / `pipe_root_path` to `SwarmLearningRecipe` so users
can select `FilePipe` without dropping to low-level API

(`via_downloader.py`)

**Root cause**: `_create_downloader()` subscribed to msg_root deletion
to clean up download transactions. The msg_root is deleted as soon as
blob envelopes are delivered, but `blob_cb` fires asynchronously —
secondary tensor downloads were still in flight when
`delete_transaction()` removed the ref from `_ref_table`, causing `"no
ref found" FATAL_SYSTEM_ERROR` on large models (≥1B parameters).

**Fix**: Remove `subscribe_to_msg_root` from `_create_downloader()`
entirely. Transaction lifecycle is now managed solely by `_monitor_tx()`
in `download_service.py`, which polls `is_finished()` every 5s and
cleans up only after all chunk downloads complete.

(`cse_client_ctl.py`)

**Root cause**: `_prepare_local_model()` called
`submit_model_executor.execute()` which, for
`ClientAPILauncherExecutor`, launches a fresh subprocess with no trained
model state. The training subprocess had already exited; the model was
already saved to disk by `PTFileModelPersistor`.

**Fix**: Try the persistor first. Inventory key scan prefers the LAST
key containing `"best"` when `model_name` contains `"best"` — taking the
last "best" key avoids mistaking an initial checkpoint named `"best_*"`
(added first by `PTFileModelPersistor`) for the trained best model.
`isinstance` guard replaces `assert isinstance` (safe under `python
-O`). Falls back to executor for in-process mode compatibility.

(`flare_agent.py` + `via_downloader.py`)

**Root cause**: `FlareAgent._do_submit_result()` unconditionally waited
1800s for `DOWNLOAD_COMPLETE_CB` after every result send when
`pass_through_on_send=True`. For validate results (metrics only, no
tensors), `_finalize_download_tx()` creates no download transaction and
the callback never fires — subprocess blocks indefinitely.

**Fix**: Thread-local `was_download_initiated()` flag, set by
`_finalize_download_tx()` only when downloadable objects exist. Agent
returns immediately if `False` (validate result). Thread-local is
required because task pipe and metric pipe share the same `CoreCell`
(same `site_name + token + mode` → same FQCN → same `_CellInfo` cache
entry → same `fobs_ctx`); a plain flag would be clobbered by concurrent
metric serialisation from a different thread.

(`swarm.py`)

**Root cause**: `SwarmClientConfig` hardcodes
`learn_task_ack_timeout=10s` and `final_result_ack_timeout=10s`. For
large models (≥2 GB), P2P tensor streaming takes minutes — the ACK times
out before download completes.

**Fix**: Add `round_timeout: float = 3600` to `SwarmLearningRecipe`.
Wires both `learn_task_ack_timeout` and `final_result_ack_timeout`;
`learn_task_timeout` is intentionally left `None` (unbounded) to avoid
capping per-round training time on slow hardware or 70B+ models.

(`client_api_launcher_executor.py`)

**Root cause**: `ClientAPILauncherExecutor.__init__()` stored
`max_resends` as `self._max_resends` but `TaskExchanger` (the parent)
uses `self.max_resends`. `PipeHandler` reads `self.max_resends` and saw
`None` (the parent default) instead of the configured value.

**Fix**: Restore `self._max_resends` as the private attribute throughout
`ClientAPILauncherExecutor` (consistent with private-member convention);
the executor builds its own config dict explicitly so it does not rely
on the inherited attribute.

`ex_process/api.py`)

**Problem**: The subprocess had no logging configuration —
`logger.info()` calls were silently dropped. Wrapping all stdout via
`logger.info()` in the parent caused double-prefixed entries in
`log.txt` for NVFlare-formatted log lines.

**Fix**: `_configure_subprocess_logging()` in `ex_process/api.py` loads
the site's `log_config.json` unchanged, giving the subprocess identical
loggers to the parent (both consoleHandler + file handlers). The
parent's `log_subprocess_output()` now calls `_route_subprocess_line()`
which strips ANSI codes and checks for a `YYYY-MM-DD HH:MM:SS`
timestamp:
- Formatted NVFlare log line → `print()` to terminal only (file handler
already wrote it to `log.txt`)
- Raw `print()` from user training script → `logger.info()` so it
reaches `log.txt`

Regression tests:
`test_log_subprocess_output_formatted_lines_not_double_logged` + 5
`TestRouteSubprocessLine` tests.

(`conn_manager.py`)

**Root cause**: `process_frame_task()` and `start_connector()` could
raise unhandled `RuntimeError` when called after the executor was shut
down, causing noisy tracebacks during job teardown.

**Fix**: Guard `conn_mgr_executor.submit()` with try/except
`RuntimeError` (log debug, skip). Add `if self.stopped: return` early
exit in `process_frame_task()`.

**Regression tests**: 4 new tests in
`test_conn_manager_shutdown_race.py`.

`SwarmLearningRecipe` always created a `CellPipe`; users needing
`FilePipe` (restricted networks, third-party integrations) had to drop
to the low-level `CCWFJob` + `SwarmClientConfig` API.

**Change**: Add `pipe_type: str = "cell_pipe"` and `pipe_root_path:
Optional[str] = None`:
- `"cell_pipe"` (default): unchanged — zero-copy PASS_THROUGH
forwarding, ~1 GB RAM overhead
- `"file_pipe"`: `FilePipe` created with `root_path` defaulting to
`{WORKSPACE}/pipe` (resolved at runtime); custom absolute path validated
to exist at recipe construction time
- Warnings for `pipe_root_path` ignored with `cell_pipe`, and
`file_pipe` with `launch_external_process=False`
- `ScriptRunner.__init__` gains `task_pipe` parameter forwarded to
`BaseScriptRunner`

9 new tests in `TestSwarmLearningRecipePipeType`.

| File | Bug | Change |
|------|-----|--------|
| `nvflare/fuel/utils/fobs/decomposers/via_downloader.py` | 1 + 3 |
Remove `subscribe_to_msg_root`; add thread-local `_tls`,
`was_download_initiated()`, `clear_download_initiated()` |
| `nvflare/client/flare_agent.py` | 3 | Check `was_download_initiated()`
after `send_to_peer()`; return immediately for validate results |
| `nvflare/app_common/ccwf/cse_client_ctl.py` | 2 | Persistor-first
`_prepare_local_model()` with last-"best"-key inventory scan and
isinstance guard |
| `nvflare/app_opt/pt/recipes/swarm.py` | 4 + rename + pipe | Add
`round_timeout`, `pipe_type`, `pipe_root_path`; rename to
`SwarmLearningRecipe`; backward-compat alias kept |
| `nvflare/job_config/script_runner.py` | pipe | Add `task_pipe` param
to `ScriptRunner`, forwarded to `BaseScriptRunner` |
| `nvflare/app_common/ccwf/recipes/swarm.py` | rename | Export
`SwarmLearningRecipe` alongside `SimpleSwarmLearningRecipe` |
| `nvflare/app_opt/pt/recipes/__init__.py` | rename | Add
`SwarmLearningRecipe` to lazy imports and `__all__` |
| `nvflare/app_common/executors/client_api_launcher_executor.py` | 5 |
Restore `self._max_resends` (private convention) |
| `nvflare/app_common/launchers/subprocess_launcher.py` | logging |
`_route_subprocess_line()`: formatted lines → `print()`, raw lines →
`logger.info()` |
| `nvflare/client/ex_process/api.py` | logging |
`_configure_subprocess_logging()`: apply site log config unchanged (same
handlers as parent) |
| `nvflare/fuel/f3/sfm/conn_manager.py` | 6 | Guard executor submit +
frame processing against post-shutdown RuntimeError |
| `docs/programming_guide/memory_management.rst` | docs | Add missing
`min_clients` to example |
| `docs/user_guide/data_scientist_guide/available_recipes.rst` | docs |
Add missing `min_clients`; rename to `SwarmLearningRecipe` |
| `docs/programming_guide/controllers/client_controlled_workflows.rst` |
docs | Add missing `min_clients` (x2) |
| `docs/programming_guide/timeouts.rst` | docs | Rename to
`SwarmLearningRecipe` |
| `examples/advanced/swarm_learning/swarm_pt/README.md` | 4 + rename |
Add `round_timeout` to config table; rename to `SwarmLearningRecipe` |
| `examples/advanced/swarm_learning/swarm_pt/job.py` | rename | Update
import to `SwarmLearningRecipe` |
| `tests/unit_test/fuel/utils/fobs/test_via_downloader_msg_root.py` | 1
| 4 new tests |
| `tests/unit_test/client/test_download_initiated_gating.py` | 3 | 7 new
tests |
| `tests/unit_test/app_common/ccwf/test_cse_persistor_fallback.py` | 2 |
10 new tests (incl. initial-ckpt-shadowing regression) |
| `tests/unit_test/fuel/f3/sfm/test_conn_manager_shutdown_race.py` | 6 |
4 new regression tests |
| `tests/unit_test/app_common/launchers/subprocess_launcher_test.py` |
logging | 6 new tests for `_route_subprocess_line` and double-logging
prevention |
| `tests/unit_test/recipe/swarm_recipe_test.py` | rename + pipe |
Updated to `SwarmLearningRecipe`; 9 new
`TestSwarmLearningRecipePipeType` tests |
|
`tests/unit_test/app_common/executors/client_api_launcher_executor_test.py`
| 5 | Updated: `_max_resends` assertions |
| `tests/unit_test/recipe/component_config_verification_test.py` |
rename | Updated to `SwarmLearningRecipe` |

- [x] 250+ unit tests pass across all affected modules
- [x] Bug 1: `test_via_downloader_msg_root.py` — verifies
`subscribe_to_msg_root` never called, method removed,
`DOWNLOAD_COMPLETE_CB` unaffected
- [x] Bug 2: `test_cse_persistor_fallback.py` — verifies persistor-first
path, last-"best"-key preference, initial-ckpt-shadowing regression,
isinstance guard, all fallback paths
- [x] Bug 3: `test_download_initiated_gating.py` — verifies thread
isolation, validate returns immediately (<1s), train waits,
clear-before-send
- [x] Bug 4: covered by existing recipe and ccwf tests
- [x] Bug 5: `client_api_launcher_executor_test.py` updated
(`_max_resends` assertions)
- [x] Bug 6: `test_conn_manager_shutdown_race.py` — 4 tests:
executor-shutdown no-raise, stop() no-raise, stopped early-return,
Prefix.from_bytes not called
- [x] Logging: `subprocess_launcher_test.py` — double-logging
prevention, ANSI stripping, plain-line capture, partial-timestamp
detection
- [x] Pipe type: `swarm_recipe_test.py` — 9 tests: default cell_pipe,
file_pipe instance, workspace template, custom path, invalid type,
warnings, path validation
- [x] Rename: backward-compat `SimpleSwarmLearningRecipe` import test
passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
update example readme

update example readme, remove non-existed links

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(large-model deadlock)

`big_model_4g` CI test reliably failed with:

```
RuntimeError: failed to download from site-X_active
```

after 1–3 hours of retry loops (300 s SubmitUpdate timeout × 3 download
retries).

**Root cause 1 — ordering bug in `_finalize_external_execution()`:**

The subprocess client API works by keeping the subprocess's CellNet
connection alive
after it sends its result, so the server can pull large tensors directly
from the
subprocess's `DownloadService` (the "reverse PASS_THROUGH" path). To
prevent the
subprocess from exiting before the download completes,
`_do_submit_result()` blocks
on a `download_done.wait(1800 s)` event that fires only after the server
has
finished downloading all tensors.

However, `_finalize_external_execution()` called `stop_task()`
**synchronously**,
which sends `SIGTERM` to the subprocess immediately after `execute()`
receives the
result — before `ClientRunner` had even sent `SubmitUpdate` to the
server, let alone
before the server had started downloading. This tore down the subprocess
cell,
causing every server download attempt to hit `"no connection to
site-X_active"` /
`"cannot forward req: no path"`.

The subprocess-side wait was therefore unreachable: the process was
killed
externally before it could block.

**Root cause 2 — `launch_once=True` subprocess exits after round 1:**

`_do_submit_result()` called `os._exit(0)` unconditionally at the end of
the
`CellPipe + pass_through_on_send` path. For `launch_once=False` (one
subprocess per
round) this is correct — the process should exit immediately after the
download so
the deferred-stop poller unblocks. But for `launch_once=True` (one
subprocess for
all rounds, e.g. `np-loop-cell-pipe`, `pt_client_api_launch_once`), the
subprocess
was killed after round 1, leaving rounds 2–N unhandled and the job
hanging.

---

**1. Deferred `stop_task()` (`launcher_executor.py`)**

Instead of calling `stop_task()` synchronously,
`_finalize_external_execution()`
now starts a background thread (when `_stop_task_wait_timeout > 0`) that
polls
`check_run_status()` until the subprocess exits naturally (i.e. after
`download_done` fires and the subprocess unblocks), then calls
`stop_task()`.
`execute()` returns immediately, so `ClientRunner` can send
`SubmitUpdate` and the
server can connect to the still-alive subprocess cell to download
tensors.

**2. Round-boundary coordination (`_deferred_stop_event`)**

Without additional synchronization, the deferred thread from round N
could fire
*after* round N+1's `launch_task()` call. Since `SubprocessLauncher`
guards
`_start_external_process()` with `if self._process is None`, seeing a
not-yet-cleared
reference to the exited round-N process causes it to skip starting a new
subprocess.
Round N+1 then sees `COMPLETE_SUCCESS` immediately and fails with
`"External process has not called flare.init and run status becomes
success"`.

A `threading.Event` (`_deferred_stop_event`, initially set) coordinates
the two:

- **Cleared** in `_finalize_external_execution()` just before the
deferred thread starts.
- **Set** unconditionally in a `finally` block when the deferred thread
completes.
- `_initialize_external_execution()` **waits** on this event before
calling
  `launch_task()`, with a forced `stop_task()` fallback if it times out.

In practice the wait is 0–1 s (subprocess exits naturally after
download, well
before the server completes global aggregation and dispatches the next
round's task),
so there is no meaningful latency impact.

**3. `ClientAPILauncherExecutor` opt-in
(`client_api_launcher_executor.py`)**

`_stop_task_wait_timeout` is set to `download_complete_timeout` (default
1800 s) in
`ClientAPILauncherExecutor.__init__()`, enabling the deferred path only
for the
subprocess client API where large-tensor downloads are expected. The
base
`LauncherExecutor` defaults to `0.0` (original synchronous behaviour, no
change).

**4. `launch_once`-aware subprocess exit (`flare_agent.py`, `config.py`,
`ex_process/api.py`)**

`prepare_config_for_launch()` now writes `launch_once` (derived from
`launcher.needs_deferred_stop()`) into the subprocess config file. The
subprocess
reads it via `ClientConfig.get_launch_once()` and passes it to
`FlareAgent`.
`_do_submit_result()` branches on `_launch_once`:

| `launch_once` | Behaviour after download gate |
|---|---|
| `False` (one subprocess per round, e.g. `pt-client-api`) |
`os._exit(0)` called directly — deferred-stop poller unblocks
immediately (original behaviour preserved) |
| `True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`) |
`atexit.register(os._exit, 0)` registered once — subprocess continues to
next round; `os._exit` fires only when `main()` finally returns,
bypassing non-daemon CoreCell thread cleanup |

`_resolve_launch_once()` safely fetches the launcher even when
`self.launcher` is
still `None` at `initialize()` time (resolved directly from the engine
component
registry).

**5. Pipe handler identity guard and safe close (`task_exchanger.py`)**

Pipe handler creation is refactored into `_create_pipe_handler()`, which
binds a
per-handler status callback that checks `self.pipe_handler is _h` before
acting.
This prevents a late `PEER_GONE` from round N's (now-stale) handler from
stopping
round N+1's handler. `stop(close_pipe=False)` is used because
`CellPipe.close()`
is irreversible — closing it in the status callback would prevent the
next round
from communicating. An explicit `self.pipe.close()` is added in
`END_RUN` instead.

**6. Defensive logging and heartbeat error handling
(`pipe_handler.py`)**

`_send_to_pipe` now logs `asked_to_stop` and `abort_triggered` status
when a send
is suppressed. Peer-gone detection logs the elapsed time since the last
heartbeat.
Heartbeat sends are wrapped in `try/except` so a broken pipe sets
`asked_to_stop`
and breaks the heartbeat loop cleanly instead of propagating an
unhandled exception.

**7. New unit tests (`test_download_complete_gating.py`,
`test_download_initiated_gating.py`)**

Two new test files covering the subprocess download-gating behaviour
(the
`download_done` wait, the validate-path fast return, and `launch_once`).
Both use `FlareAgent.__new__()` to construct minimal agent
stubs. To prevent `os._exit(0)` from killing the pytest-xdist worker:

- An `autouse` `_no_os_exit` fixture patches
`nvflare.client.flare_agent.os._exit`
  to a no-op for every test in the file.
- `_make_agent()` sets `agent._launch_once = False` (the per-round path
that calls
  `os._exit` directly, making the fixture's patch the active guard).

---

| File | Change |
|------|--------|
| `nvflare/app_common/executors/launcher_executor.py` | Deferred
`stop_task()` background thread + `_deferred_stop_event` round-boundary
coordination |
| `nvflare/app_common/executors/client_api_launcher_executor.py` | Set
`_stop_task_wait_timeout = download_complete_timeout`; add
`_resolve_launch_once()`; write `LAUNCH_ONCE` to subprocess config |
| `nvflare/app_common/abstract/launcher.py` | `needs_deferred_stop()`
abstract method + idempotency/thread-safety note on `stop_task()` |
| `nvflare/app_common/launchers/subprocess_launcher.py` | Implement
`needs_deferred_stop()`; add info logging for process start/stop |
| `nvflare/app_common/executors/task_exchanger.py` | Refactor pipe
handler creation into `_create_pipe_handler()` with identity-checking
callback; use `close_pipe=False` to prevent irreversible
`CellPipe.close()` |
| `nvflare/app_common/widgets/metric_relay.py` | Include `msg.data` in
pipe status log message |
| `nvflare/fuel/utils/pipe/pipe_handler.py` | Enhanced logging (send
failures, peer-gone elapsed time); heartbeat send error handling |
| `nvflare/client/config.py` | Add `LAUNCH_ONCE` config key +
`get_launch_once()` |
| `nvflare/client/flare_agent.py` | `launch_once`-aware
`_do_submit_result()`: direct `os._exit` vs `atexit` |
| `nvflare/client/ex_process/api.py` | Pass `launch_once` to
`FlareAgentWithFLModel` |
| `tests/unit_test/client/test_download_complete_gating.py` | **New** —
tests for DOWNLOAD_COMPLETE_CB registration, download-done wait,
timeout, status logging, and cleanup |
| `tests/unit_test/client/test_download_initiated_gating.py` | **New** —
tests for thread-local download-initiation detection (validate-path fast
return, no spurious 1800 s wait) |
|
`tests/unit_test/app_common/executors/client_api_launcher_executor_test.py`
| New tests for deferred stop, `_deferred_stop_event`,
`_stop_task_wait_timeout` |

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
- **Release notes**: large-model subprocess reliability section, 7 bug
fix entries, edge recipe improvements
- **available_recipes**: remove \`SimpleSwarmLearningRecipe\` alias, add
\`round_timeout\`, subprocess tuning notes
- **timeout_troubleshooting**: subprocess result timeout and Swarm P2P
transfer scenarios, new quick-reference rows
- **timeouts.rst**: \`max_resends\`, \`np/tensor_min_download_timeout\`,
\`round_timeout\` cross-refs, updated Swarm example
- **memory_management.rst**: client training process memory cleanup
section, \`NVFLARE_CLIENT_MEMORY_PROFILE\` env var
- **client_controlled_workflows.rst**: \`round_timeout\` in Swarm
examples, Client Dropout Tolerance section
- **industry_use_cases.rst**: FAITE, JP Morgan/BNY/RBC, ORNL OLCF, AV,
Holoscan, Data Federation Mesh; FLARE Day talk links added

- **examples/README.md**: remove Section 2 step-by-step examples
(notebooks no longer exist), fix CIFAR-10 paths (\`cifar10/\` →
\`cifar10/pt/\`), fix \`lr-newton-raphson\` → \`hello-world/hello-lr\`,
remove dead \`code-pre-install\` entry, renumber sections
- **examples/advanced/README.md**: fix CIFAR-10 paths, remove dead
\`random_forest\`, \`brats18\`, \`prostate\`, \`fl_hub\` entries
- **examples/hello-world/README.md**: remove broken \`hello-ccwf\` link
- **docs/examples/medical_image_analysis.rst**: fix
\`brats18\`/\`prostate\` paths (\`examples/advanced/\` → \`research/\`)
- **timeout_troubleshooting.rst**: fix LLM example
\`submit_task_result_timeout\` (1200 → 1800) to be ≥
\`submit_result_timeout\`; clarify note on setting both timeouts
consistently

- **gettingStarted.astro**: fix two broken \`getting_started/\` links —
Step 3 now points to \`job_api/pt/src/cifar10_fl.py\`, Step 4 to
readthedocs quickstart
- **series.astro**: fix \`getting_started\` → \`hello-world\`,
\`hello-fedavg\` → \`hello-pt\`, CIFAR-10 paths,
\`brats18\`/\`prostate\` → \`research/\`, \`xgboost_secure\` path
- **tutorials.astro**: add 17 missing examples — Job Recipe, Logging
Tutorial, Hello PyTorch Lightning, Hello PyTorch Lightning Eval, Hello
Differential Privacy, Hello Flower, Federated Logistic Regression with
Newton-Raphson, Split Learning, PSI, Tensor Streaming for LLMs, AMPLIFY
Protein Model, Confidential Computing Provision, Cross-Edge FL, BraTS18,
FedHCA2, FedOBD, Federated Prostate Segmentation

- [ ] Verify RST renders correctly
- [ ] Check all external links resolve
- [ ] Verify web tutorial catalog links are accessible

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes # .

Summary of changes for the ScaffoldRecipe + Client API scaffold_c_diff
KeyError:

1. Defensive check in scaffold_aggregate_fn
File: nvflare/app_common/workflows/scaffold.py

Before using _result.meta[AlgorithmConstants.SCAFFOLD_CTRL_DIFF], the
code now checks that the key exists. If it is missing, it raises a
ValueError that:

Names the client (if present in meta)
States that scaffold_c_diff is required in FLModel.meta
Explains that the client must use PTScaffoldHelper: init(model),
model_update() during training, terms_update() after training, and send
get_delta_controls() in meta
Points to nvflare.app_opt.pt.scaffold.PTScaffoldHelper
So instead of an opaque KeyError: 'scaffold_c_diff', users get a clear
message and next steps.

2. Stronger docstring for ScaffoldRecipe
File: nvflare/app_opt/pt/recipes/scaffold.py

The Client script requirement section now explicitly states that:

Scaffold is not like FedAvgRecipe: the client must use PTScaffoldHelper
The client must set meta[AlgorithmConstants.SCAFFOLD_CTRL_DIFF] =
scaffold_helper.get_delta_controls() in the returned FLModel
A plain flare.receive/send loop without PTScaffoldHelper will cause
aggregation to fail

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…A#4295)

Fixes a Remote Code Execution (RCE) vulnerability in the FOBS
deserialization subsystem reported in NVBug #5968367.

Root cause: Packer.unpack() validated the decomposer name (__fobs_dc__)
against BUILTIN_DECOMPOSERS but never validated the attacker-controlled
type name (__fobs_type__) before passing it to load_class(), which calls
importlib.import_module(). An authenticated FL participant could craft a
malicious payload with a trusted builtin decomposer and an arbitrary
Python class path to achieve full RCE on the aggregation server.

Fix: Added a BUILTIN_TYPES allowlist in builtin_decomposers.py covering
all types legitimately used across the NVFlare codebase and examples.
The Packer.unpack() method now validates type_name against this
whitelist (and the runtime-registered _decomposers) before calling
load_class().

Public API: Added add_type_name_whitelist(*type_names) for users to
extend the whitelist with custom types at runtime.

Security Impact
CVE severity: CVSS v3.1 8.8 High (AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H)
Attack vector: Authenticated FL participant sends crafted FOBS message
with __fobs_dc__ set to a trusted builtin decomposer and __fobs_type__
set to any importable Python class path
Impact: Full RCE on the NVFlare aggregation server

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Two folders, startup and local, of any startup kit were signed at
provisioning time. At the verification stage, nvflare checks the startup
folder. In CC environment, one more verification was done before nvflare
started. However, the verification checks from the root folder of a
startup kit. Therefore, there is a gap as the root folder and transfer
folder are not signed. This PR changed the signing process to apply to
the root and all sub folders.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…IA#4307)

Fix FilePipe heartbeat send timeout too short for heavy aggregation
workloads

### Summary
Increase FilePipe heartbeat send timeout from 5s to 600s in file_pipe.py

### Background
FilePipe implements heartbeats as file-based request-reply: the sender
creates a heartbeat file and waits up to N seconds for the receiver to
delete it. If the receiver doesn't read it within N seconds, the sender
deletes the file itself and the heartbeat is silently dropped.

The previous default of 5s was too short for nodes running as aggregator
in CCWF Swarm with large models (e.g. 8B parameters). During inter-round
aggregation, P2P model streaming (~15GB per client) causes GIL
contention and filesystem pressure that delays the receiver's
PipeHandler._read thread beyond 5s. Heartbeat files were deleted before
being read, causing _last_heartbeat_received_time to stall and
eventually triggering false PEER_GONE after heartbeat_timeout (30s).

### Why 600s is safe
_monitor_file returns immediately when the receiver deletes the file —
600s is only the ceiling if the receiver never reads it. Normal
operation is unaffected. Genuine peer death is detected independently by
PipeHandler.heartbeat_timeout on the receiver side, which is unrelated
to this send-side timeout.

### Notes
This fix applies to FilePipe only. CellPipe (used in all current
production configs) is not affected — it sends heartbeats via
fire_and_forget with no send-side timeout.

### Future Enhancements
- Make the timeout configurable: expose heartbeat_send_timeout as a
FilePipe.__init__ parameter so users running extremely large models or
slow storage can tune it without touching source code.
- Enforce the invariant FilePipe.heartbeat_send_timeout ≥
PipeHandler.heartbeat_timeout: currently these two values are configured
independently and silently break each other if mismatched. The right fix
is for PipeHandler to pass its heartbeat_timeout down to the pipe at
start() time, making the relationship explicit and impossible to
misconfigure.


### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…DIA#4309)

## Summary

Adds a **receiver-side opt-in** mechanism so that ext-process Client Job
(CJ) cells automatically decode incoming model tensors as
`LazyDownloadRef` placeholders. The subprocess then downloads tensors
**directly from the source** (server in FedAvg, aggregator in Swarm) via
CellNet routing, completely bypassing materialisation inside the CJ.

Previously, the forward-path PASS_THROUGH was controlled by a
sender-side flag (`forward_pass_through` on `SwarmClientController`),
which required the sender to know the receiver's execution mode — unsafe
in mixed deployments and inapplicable to FedAvg. This PR replaces that
with a receiver-decides architecture: the CJ knows it is an ext-process
launcher and opts in locally.

## Problem

In ext-process mode, the CJ receives the full model from the
server/aggregator, materialises all tensors into memory, then
re-serialises and sends them to the subprocess. For large models (e.g.
Llama 8B at ~16 GB), this doubles peak CJ memory and adds unnecessary
serialisation latency.

The existing `forward_pass_through` flag on `SwarmClientController` was:
- **Redundant** for ext-process receivers (receiver-side opt-in handles
it)
- **Dangerous** for in-process receivers (forces LazyDownloadRef on
nodes that can't handle it)
- **Inapplicable** to FedAvg (no SwarmClientController involved)

## Solution

### Core mechanism (3 changes in `cell.py`)

- `Cell.__init__()`: add `self.decode_pass_through = False` flag
- `Adapter.call()`: check `self.cell.decode_pass_through` for incoming
REQUESTs (Swarm AUX path)
- `Cell._send_one_request()`: check `self.decode_pass_through` on reply
decode (FedAvg GET_TASK path)

### Trigger (1 change in `client_api_launcher_executor.py`)

- `ClientAPILauncherExecutor.initialize()`: set
`cell.decode_pass_through = True` when pipe is `CellPipe` (FilePipe has
no CellNet cell for subprocess downloads, so it stays off)

### Flag removal (changes in `swarm_client_ctl.py`)

- Remove `forward_pass_through` parameter, field, and sender-side header
stamping
- Add `_has_lazy_refs()` static method for data-driven detection
- Replace all flag-based guards with `_has_lazy_refs()` checks at 3
sites:
  - `_scatter()` local queue resolution
  - `_end_gather()` defensive panic check
  - `_do_learn()` GLOBAL_MODEL resolution

### Cleanup

- Remove dead `CellPipe` import from `task_exchanger.py`

## Data flow (FedAvg ext-process)

```
Server                CJ cell              Subprocess cell
  │                     │                      │
  │── model reply ──►   │                      │
  │                  decode with               │
  │                  PASS_THROUGH=True         │
  │                  → LazyDownloadRef         │
  │                     │                      │
  │                  CellPipe.send()           │
  │                  FOBS encode:              │
  │                  LazyDownloadRefDecomposer │
  │                  re-emits server datum     │
  │                     │── cell msg ────────► │
  │                     │                   Adapter.call()
  │                     │                   PASS_THROUGH=False
  │                     │                   ViaDownloaderDecomposer
  │                     │                   _download_from_remote_cell()
  │◄── download chunks ─┼──────────────────── │
  │── tensor bytes ────►┼─────────────────►   │
  │                     │                   real tensors in shareable
```

The CJ never materialises the full model. The subprocess downloads
directly from the server.

## Coverage matrix

| Scenario | CJ behavior | Subprocess behavior |
|---|---|---|
| FedAvg + CellPipe (ext-process) | `decode_pass_through=True` →
LazyDownloadRef | Downloads from server via CellNet |
| FedAvg + FilePipe (ext-process) | `decode_pass_through=False` → real
tensors | Receives real tensors via file |
| FedAvg in-process | `decode_pass_through=False` → real tensors | N/A
(same process) |
| Swarm ext-process | `decode_pass_through=True` → LazyDownloadRef |
Downloads from aggregator via CellNet |
| Swarm in-process | `decode_pass_through=False` → real tensors | N/A
(same process) |
| Mixed deployment | Each CJ decides independently | No sender-side
knowledge needed |

## Files changed

| File | Change |
|---|---|
| `nvflare/fuel/f3/cellnet/cell.py` | `decode_pass_through` flag +
checks in `Adapter.call()` and `_send_one_request()` |
| `nvflare/app_common/executors/client_api_launcher_executor.py` | Set
flag when pipe is CellPipe |
| `nvflare/app_common/ccwf/swarm_client_ctl.py` | Remove
`forward_pass_through`, add `_has_lazy_refs()`, update 3 guard sites |
| `nvflare/app_common/executors/task_exchanger.py` | Remove dead
`CellPipe` import |
| `tests/unit_test/app_common/ccwf/test_swarm_forward_memory_path.py` |
Rewritten for data-driven detection tests |
| `tests/unit_test/app_common/ccwf/test_swarm_self_message_deadlock.py`
| Remove `forward_pass_through` reference |
| `tests/unit_test/app_common/ccwf/test_swarm_memory_gc.py` | Remove
`forward_pass_through` reference |
| `tests/unit_test/app_common/ccwf/test_msg_root_ttl.py` | Remove
`forward_pass_through` reference |
| `tests/unit_test/app_common/ccwf/test_lazy_ref_local_aggr.py` | Remove
`forward_pass_through` reference |



## Known limitations

- `_has_lazy_refs()` only recurses into `dict`/`list`/`tuple`. If an
`FLModel` object (via `FLModelDecomposer`) carries `LazyDownloadRef` in
its `.params` dict, the check will miss it. Current Swarm uses DXO
(stored as plain nested dicts), so this is not an issue today.


### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…acement mid-transaction (NVIDIA#4314)

### Problem

BEFORE_TASK_EXECUTION is a global event broadcast to all FLComponents.
In a CCWF Swarm aggregator node, receiving swarm_report_learn_result aux
tasks from other sites while the local subprocess is training fires this
event concurrently. This causes TaskExchanger.handle_event() to stop and
recreate the PipeHandler while execute() is blocked in its polling loop
— the old handler (that the subprocess is writing to) is orphaned, and
the polling loop reads from the new empty handler forever. Silent
deadlock, no error, no PEER_GONE, no timeout.

Affects all pipe types (FilePipe, CellPipe). Independent of the FilePipe
TOCTOU fix in NVIDIA#4296.

### Fix

- Add threading.Event _executing as a guard flag:

handle_event(BEFORE_TASK_EXECUTION) skips handler replacement when
_executing.is_set()
- TaskExchanger.execute() uses an ownership pattern so super().execute()
from LauncherExecutor does not prematurely clear the flag
- LauncherExecutor.execute() sets the flag at the very top — before
_initialize_external_execution() — covering the up-to-60 s
_wait_external_setup() window

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…skip ci] (NVIDIA#4316)

## Summary

- Replace verbose bug-fix enumeration in Memory Management, F3
Streaming, and Hierarchical FL Startup Stability sections with concise
feature-level summaries
- Spell out "Client Job (CJ)" on first use
- Internal fix details are preserved in the existing Bug Fixes section

## Test plan
- [ ] Verify RST renders correctly in Sphinx (no broken references)
- [ ] Review rendered output at readthedocs preview

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…A#4317)

Fixes scikit-learn examples rejecting all client contributions due to
missing CURRENT_ROUND in fl_ctx

  Root cause:

FedAvg tracks the current round as a plain Python int
(self.current_round) but never writes it into fl_ctx. Custom aggregators
(e.g. CollectAndAssembleModelAggregator) read CURRENT_ROUND from their
fl_ctx for round validation in
accept_model() and aggregate_model(). Their fl_ctx is set once at
START_RUN and never updated, so it always returns None for CURRENT_ROUND
— causing all contributions to be discarded (contribution_round=0 !=
None), leaving the global model
   empty and crashing clients in round 1.

  Why not a full fl_ctx sync (self.aggregator.fl_ctx = self.fl_ctx)?

BaseModelController._process_result overwrites self.fl_ctx with a
per-task context on every result callback. A one-time sync at run start
would immediately become stale, and a per-call sync in
_aggregate_one_result only covers
accept_model, not aggregate_model. Copying the entire task fl_ctx to the
aggregator would also expose unrelated per-task state.

  Fix:

At the start of each round, set CURRENT_ROUND directly on the
aggregator's own stable fl_ctx (initialized at START_RUN, never
overwritten):

  if self.aggregator and self.aggregator.fl_ctx:
self.aggregator.fl_ctx.set_prop(AppConstants.CURRENT_ROUND,
self.current_round, ...)

This is minimal, targeted, and safe for aggregators that don't use
fl_ctx at all (fl_ctx guard added).

Verified: sklearn-kmeans simulation with 3 clients × 5 rounds — all
contributions accepted every round, Homogeneity=1.0 from round 1 onward,
no warnings or errors.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
…rtup+local for non-CC (NVIDIA#4318)

## Issue

Commit 0b74dee (NVIDIA#4301) changed SignatureBuilder to sign the entire
startup kit root directory (get_ws_dir) for all participants. This
  introduced an unintended side effect:

During provisioning, sign_folders() recursively walks the root and
writes signature.json into every subdirectory — including transfer/.
When
nvflare poc prepare subsequently calls _prepare_jobs_dir(), it finds
transfer/ non-empty and incorrectly prompts:

job directory at .../transfer is already exists, replace with new one ?
(y/N)

This happens even on a completely clean workspace (after nvflare poc
clean), because transfer/ is polluted during provisioning before
  _prepare_jobs_dir runs.

## Root Cause

Root-level signing was introduced to support Confidential Computing
(CVM) environments, where the full startup kit must be verified before
CVM
  launch. However:

1. For non-CC cases, only startup/ and local/ are verified at runtime —
signing the root adds no security value.
2. The signature validation code does not check transfer/ or the root in
non-CC mode, making the extra signing pure side-effect.
  3. Signing the root breaks nvflare poc prepare on clean workspaces.

## Fix

  In SignatureBuilder.build(), check PropKey.CC_ENABLED per participant:
  - CC participants: sign from root (get_ws_dir) — unchanged from NVIDIA#4301
- Non-CC participants: sign only startup/ and local/ — restored to
pre-NVIDIA#4301 behavior

## Affected Files

  | File | Change |
  |------|--------|
| `nvflare/lighter/impl/signature.py` | Add `PropKey` import; gate root
signing behind `CC_ENABLED` check |
| `tests/unit_test/lighter/test_signature_builder.py` | New: 5 unit
tests covering CC, non-CC, mixed participants, and the `transfer/`
regression |
 
## Types of Changes

  - Non-breaking change
  - New tests added

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…les [skip-ci] (NVIDIA#4323)

Polishes the Memory Management section of the 2.7.2 release notes and
adds
proper RST benchmark tables.

**File:** `docs/release_notes/flare_272.rst`

- Rewrote the opening paragraph to flow naturally and remove the awkward
"We enhance three features to help improvements memory manangement"
sentence
- Fixed ``--`` em-dash inconsistency in the **Server and Client Memory
Cleanup** heading
- Fixed admonition directive formatting (missing blank line after title
caused render failure)
- Rewrote the summary paragraph (lines 107–115) for clarity and correct
grammar:
removed "memory RSS ( Resident Set Size) grows", "avoid extra memory
usage as well as"
- Fixed typos: ``manangement``, ``In-Procss``, ``client agv``
- Clarified improvement percentages are reductions by adding ``−`` sign
(was bare ``51%``, ``49%``)

Replaced the four unformatted plain-text tables (lines 117–137) with
proper
RST ``list-table`` directives, covering:

- **FedAvg** (5 GB model, 4 clients) — in-process and subprocess modes
- **Swarm Learning** (2.5 GB model, 3 sites) — in-process and subprocess
modes

- [x] Non-breaking change (documentation only)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…skip ci] (NVIDIA#4332)

## Summary
update release notes 

## Changes

**File:** `docs/release_notes/flare_272.rst`
  
```
- Security fix: Addressed a remote code execution vulnerability in FOBS deserialization.
- Security fix: Addressed a path traversal vulnerability in FileRetriever.
```

## Types of Changes

- [x] Non-breaking change (documentation only)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
NVIDIA#4334)

### Description

- **Bug 1 (pipe destroyed between rounds):** `_pipe_status_cb` called
`pipe_handler.stop()` with the default `close_pipe=True`, which invokes
`FilePipe.close()` and can remove pipe directories (including the shared
root if `_remove_root=True`). Fixed by using `stop(close_pipe=False)`,
matching the pattern already in `TaskExchanger`.

- **Bug 2 (silent no-op on round 2+):** `BEFORE_TASK_EXECUTION` called
`start()` on a stopped handler whose worker threads had already nulled
themselves out (`pipe_handler.py` L390/L412), making the call a silent
no-op. Fixed by recreating a fresh `PipeHandler` per task execution,
same as `TaskExchanger`.

- **Bug 3 (stale callback race):** `_pipe_status_cb` referenced
`self.pipe_handler` directly, so a late `PEER_GONE` from a previous
handler could stop the newly created one. Fixed by using a handler-bound
identity-check closure, matching `TaskExchanger._create_pipe_handler`.

- **Bug 4 (bad payload crashes reader thread):** `_pipe_msg_cb` logged
an error on non-DXO data but did not return, falling through to
`send_analytic_dxo()` which raises `TypeError`. Since the callback runs
on the pipe reader thread, this exception terminated the handler as a
`PEER_GONE` path. Fixed by adding an early `return` after the error log.


### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
The ml-to-fl/pt/, ml-to-fl/tf/, ml-to-fl/np/ directories are empty so
the relative link checker fails. Replace the three dead links with a
single reference to the ml-to-fl/ directory itself.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Git does not track empty directories, so ml-to-fl/pt/, tf/, np/ don't
exist in CI and the relative link checker fails. Remove the section.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NVIDIA#3896)

Fixes # .

Apply changes from NVIDIA#3878

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes # .

### Description

# Fix FedAvg recipe round handling and model-selector events

## Problem

When using the recipe-based FedAvg flow (BaseModelController / FedAvg),
IntimeModelSelector saw `CURRENT_ROUND` as `None` and discarded valid
contributions with "Current round is: None", then in round 2 reported
"Expected round for this aggregation: 1". The "new best validation
metric" path and `GLOBAL_BEST_MODEL_AVAILABLE` never ran because
`BEFORE_AGGREGATION` was not fired in the InTime aggregation path.

## Changes

- **BaseModelController**
(`nvflare/app_common/workflows/base_model_controller.py`):
- Fire **ROUND_STARTED** at the start of each round in
`broadcast_model()` (after `set_fl_context(data)`) so widgets (e.g.
IntimeModelSelector) reset per round.
- In **`_process_result()`**, set **CURRENT_ROUND** on the callback
`fl_ctx` from the task data before firing `BEFORE_CONTRIBUTION_ACCEPT`,
so selectors and aggregators see the correct round.

- **FedAvg** (`nvflare/app_common/workflows/fedavg.py`):
- Fire **BEFORE_AGGREGATION** after all client results are in and before
`_get_aggregated_result()`, so IntimeModelSelector runs
`_before_aggregate`, logs "new best validation metric" when applicable,
and can fire `GLOBAL_BEST_MODEL_AVAILABLE`.

## Result

Round and contribution matching work correctly for the recipe FedAvg
flow, and model selection / best-model events behave the same as in the
scatter-and-gather flow.


### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
- Restore ROUND_STARTED event and CURRENT_ROUND prop on fl_ctx in FedAvg
  round loop (removed by NVIDIA#4317); IntimeModelSelector and other components
  depend on this event firing each round.

- Revert PASS_THROUGH channel from get_pipe_channel_name() back to
  _CellChannel.SERVER_COMMAND in ClientAPILauncherExecutor (NVIDIA#4309
  introduced the regression; NVIDIA#4312 was incorrectly marked no-diff at
  assessment time before NVIDIA#4309 was cherry-picked).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This entry was accidentally dropped during cherry-pick. Without it, fobs.unpack()
raises ValueError when deserializing Shareables from the XGBoost
histogram_based_v2 controller, which serializes DataSplitMode directly into
every Shareable sent to clients.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@chesterxgchen
Copy link
Copy Markdown
Collaborator Author

/build

@chesterxgchen
Copy link
Copy Markdown
Collaborator Author

@greptileai

@pcnudde pcnudde closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants