Skip to content

[2.8] Narrow client failure reporting#4576

Merged
YuanTingHsieh merged 2 commits into
NVIDIA:2.8from
YuanTingHsieh:codex/narrow-client-failure-report-2.8
May 12, 2026
Merged

[2.8] Narrow client failure reporting#4576
YuanTingHsieh merged 2 commits into
NVIDIA:2.8from
YuanTingHsieh:codex/narrow-client-failure-report-2.8

Conversation

@YuanTingHsieh
Copy link
Copy Markdown
Collaborator

@YuanTingHsieh YuanTingHsieh commented May 11, 2026

Summary

Narrow the client job-failure reporting path added for 2.8 so generic launcher JobReturnCode.EXECUTION_ERROR is not promoted into an authoritative server-side job failure.

Why

client_api_qa can complete the FL workflow successfully and then hit local worker teardown noise that surfaces as a generic launcher execution error. Reporting that generic code to the server causes the server to fail an already-finished job.

This keeps the explicit failure cases used by the recent job-timeout status fix:

  • ProcessExitCode.EXCEPTION
  • ProcessExitCode.CONFIG_ERROR
  • ProcessExitCode.UNSAFE_COMPONENT
  • JobReturnCode.ABORTED

The K8s pending-timeout path still reports JobReturnCode.EXCEPTION, so it should continue to show FINISHED:EXECUTION_EXCEPTION rather than RUNNING.

@YuanTingHsieh YuanTingHsieh force-pushed the codex/narrow-client-failure-report-2.8 branch from c75a717 to e596cb5 Compare May 11, 2026 23:43
@YuanTingHsieh YuanTingHsieh marked this pull request as ready for review May 11, 2026 23:43
Copilot AI review requested due to automatic review settings May 11, 2026 23:43
@YuanTingHsieh YuanTingHsieh changed the title [codex] Narrow client failure reporting [2.8] Narrow client failure reporting May 11, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR narrows the client → server “job failure” reporting pathway introduced in 2.8 so that a generic launcher-level JobReturnCode.EXECUTION_ERROR reported by a client does not get promoted into an authoritative server-side job failure (avoiding failing jobs that already completed successfully server-side).

Changes:

  • Server: stop treating reported JobReturnCode.EXECUTION_ERROR as a run-failing signal in process_job_failure.
  • Client: stop reporting JobReturnCode.EXECUTION_ERROR as a “reportable job failure” to the server.
  • Tests: update and add unit tests to enforce the narrower reporting/handling contract.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
nvflare/private/fed/server/fed_server.py Narrows process_job_failure to only fail runs for reported CONFIG_ERROR / EXCEPTION (no longer for EXECUTION_ERROR).
nvflare/private/fed/client/client_executor.py Removes EXECUTION_ERROR from REPORTABLE_JOB_FAILURES, preventing it from being reported to the server.
tests/unit_test/private/fed/server/fed_server_test.py Updates parametrized expectations and adds a test asserting EXECUTION_ERROR is ignored by the server.
tests/unit_test/private/fed/client/client_executor_test.py Updates expected reportable failures and asserts EXECUTION_ERROR is non-reportable from the client.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/unit_test/private/fed/client/client_executor_test.py
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR narrows the set of client job-failure codes that are promoted to an authoritative server-side job failure, fixing a false-failure scenario where client_api_qa worker teardown noise (surfaced as JobReturnCode.EXECUTION_ERROR) was causing the server to fail an already-completed job.

  • client_executor.py: Removes JobReturnCode.EXECUTION_ERROR from REPORTABLE_JOB_FAILURES, so clients no longer send a failure message to the server for generic launcher exit codes.
  • fed_server.py: Removes JobReturnCode.EXECUTION_ERROR from the fail_run branch in process_job_failure, ensuring even legacy clients that still send this code will have it silently ignored rather than triggering a job failure.
  • Tests: Updated parametrized and new explicit tests cleanly cover both the retained fail/abort paths and the newly ignored EXECUTION_ERROR path.

Confidence Score: 5/5

Safe to merge — the change removes a single over-broad failure code from a two-layer guard (client reporting + server handling), the retained codes match the PR description, and the new test explicitly exercises the ignored path.

The change is small and surgical: one dict entry removed on the client, one tuple element removed on the server, and tests updated consistently. The K8s pending-timeout path continues to report JobReturnCode.EXCEPTION so that path is unaffected. No regressions are apparent.

No files require special attention.

Important Files Changed

Filename Overview
nvflare/private/fed/client/client_executor.py Removes JobReturnCode.EXECUTION_ERROR from REPORTABLE_JOB_FAILURES; the retained set (EXCEPTION, UNSAFE_COMPONENT, CONFIG_ERROR, ABORTED) is consistent with the server-side handling.
nvflare/private/fed/server/fed_server.py Removes JobReturnCode.EXECUTION_ERROR from the fail_run branch; the two-branch if/elif structure (fail vs. stop) is preserved and correctly aligned with the client-side reportable set.
tests/unit_test/private/fed/client/client_executor_test.py Moves EXECUTION_ERROR into the non-failure parametrize set and updates the expected-failures dict to match production code.
tests/unit_test/private/fed/server/fed_server_test.py Adds a new explicit test confirming EXECUTION_ERROR leaves fail_run and stop_run uncalled; existing parametrized test correctly trimmed to the two remaining fail-run codes.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Server

    Note over Client: Job finishes successfully
    Note over Client: Worker teardown raises noise<br/>(EXECUTION_ERROR)

    alt Before this PR
        Client->>Server: process_job_failure(EXECUTION_ERROR)
        Server->>Server: fail_run(job_id, EXCEPTION)
        Note over Server: Job incorrectly marked FAILED
    else After this PR
        Note over Client: EXECUTION_ERROR not in REPORTABLE_JOB_FAILURES
        Note over Client: No message sent to server
        Note over Server: Job stays FINISHED (correct)
    end

    Note over Client: EXCEPTION / CONFIG_ERROR / UNSAFE_COMPONENT / ABORTED
    Client->>Server: process_job_failure(code)
    alt code in (EXCEPTION, CONFIG_ERROR)
        Server->>Server: fail_run(job_id, EXCEPTION)
    else code in (UNSAFE_COMPONENT, ABORTED)
        Server->>Server: stop_run(job_id)
    end
Loading

Reviews (2): Last reviewed commit: "Merge branch '2.8' into codex/narrow-cli..." | Re-trigger Greptile

@YuanTingHsieh YuanTingHsieh merged commit 510150c into NVIDIA:2.8 May 12, 2026
24 checks passed
@YuanTingHsieh YuanTingHsieh deleted the codex/narrow-client-failure-report-2.8 branch May 12, 2026 20:32
YuanTingHsieh added a commit to YuanTingHsieh/NVFlare that referenced this pull request May 13, 2026
## Summary

Narrow the client job-failure reporting path added for 2.8 so generic
launcher `JobReturnCode.EXECUTION_ERROR` is not promoted into an
authoritative server-side job failure.

## Why

`client_api_qa` can complete the FL workflow successfully and then hit
local worker teardown noise that surfaces as a generic launcher
execution error. Reporting that generic code to the server causes the
server to fail an already-finished job.

This keeps the explicit failure cases used by the recent job-timeout
status fix:

- `ProcessExitCode.EXCEPTION`
- `ProcessExitCode.CONFIG_ERROR`
- `ProcessExitCode.UNSAFE_COMPONENT`
- `JobReturnCode.ABORTED`

The K8s pending-timeout path still reports `JobReturnCode.EXCEPTION`, so
it should continue to show `FINISHED:EXECUTION_EXCEPTION` rather than
`RUNNING`.

(cherry picked from commit 510150c)
pcnudde pushed a commit that referenced this pull request May 13, 2026
## Summary

Port the selected 2.8 fixes back to `main` in 2.8 merge order:

- #4528 Add warnings for missing study data mappings
- #4538 Update deploy prepare launcher docs
- #4550 Align `Run.get_result()` with the `clean_up` parameter spelling
- #4561 Clarify `remove_client` token cleanup semantics
- #4563 Respect `CUDA_VISIBLE_DEVICES` in the GPU resource manager
- #4574 Fix Docker SJ workspace tmpfs permissions
- #4576 Narrow client failure reporting for generic launcher execution
errors
- #4583 Fix tracking recipe integration test

---------

Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants