[2.8] Narrow client failure reporting#4576
Conversation
c75a717 to
e596cb5
Compare
There was a problem hiding this comment.
Pull request overview
This PR narrows the client → server “job failure” reporting pathway introduced in 2.8 so that a generic launcher-level JobReturnCode.EXECUTION_ERROR reported by a client does not get promoted into an authoritative server-side job failure (avoiding failing jobs that already completed successfully server-side).
Changes:
- Server: stop treating reported
JobReturnCode.EXECUTION_ERRORas a run-failing signal inprocess_job_failure. - Client: stop reporting
JobReturnCode.EXECUTION_ERRORas a “reportable job failure” to the server. - Tests: update and add unit tests to enforce the narrower reporting/handling contract.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
nvflare/private/fed/server/fed_server.py |
Narrows process_job_failure to only fail runs for reported CONFIG_ERROR / EXCEPTION (no longer for EXECUTION_ERROR). |
nvflare/private/fed/client/client_executor.py |
Removes EXECUTION_ERROR from REPORTABLE_JOB_FAILURES, preventing it from being reported to the server. |
tests/unit_test/private/fed/server/fed_server_test.py |
Updates parametrized expectations and adds a test asserting EXECUTION_ERROR is ignored by the server. |
tests/unit_test/private/fed/client/client_executor_test.py |
Updates expected reportable failures and asserts EXECUTION_ERROR is non-reportable from the client. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR narrows the set of client job-failure codes that are promoted to an authoritative server-side job failure, fixing a false-failure scenario where
Confidence Score: 5/5Safe to merge — the change removes a single over-broad failure code from a two-layer guard (client reporting + server handling), the retained codes match the PR description, and the new test explicitly exercises the ignored path. The change is small and surgical: one dict entry removed on the client, one tuple element removed on the server, and tests updated consistently. The K8s pending-timeout path continues to report JobReturnCode.EXCEPTION so that path is unaffected. No regressions are apparent. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant Server
Note over Client: Job finishes successfully
Note over Client: Worker teardown raises noise<br/>(EXECUTION_ERROR)
alt Before this PR
Client->>Server: process_job_failure(EXECUTION_ERROR)
Server->>Server: fail_run(job_id, EXCEPTION)
Note over Server: Job incorrectly marked FAILED
else After this PR
Note over Client: EXECUTION_ERROR not in REPORTABLE_JOB_FAILURES
Note over Client: No message sent to server
Note over Server: Job stays FINISHED (correct)
end
Note over Client: EXCEPTION / CONFIG_ERROR / UNSAFE_COMPONENT / ABORTED
Client->>Server: process_job_failure(code)
alt code in (EXCEPTION, CONFIG_ERROR)
Server->>Server: fail_run(job_id, EXCEPTION)
else code in (UNSAFE_COMPONENT, ABORTED)
Server->>Server: stop_run(job_id)
end
Reviews (2): Last reviewed commit: "Merge branch '2.8' into codex/narrow-cli..." | Re-trigger Greptile |
## Summary Narrow the client job-failure reporting path added for 2.8 so generic launcher `JobReturnCode.EXECUTION_ERROR` is not promoted into an authoritative server-side job failure. ## Why `client_api_qa` can complete the FL workflow successfully and then hit local worker teardown noise that surfaces as a generic launcher execution error. Reporting that generic code to the server causes the server to fail an already-finished job. This keeps the explicit failure cases used by the recent job-timeout status fix: - `ProcessExitCode.EXCEPTION` - `ProcessExitCode.CONFIG_ERROR` - `ProcessExitCode.UNSAFE_COMPONENT` - `JobReturnCode.ABORTED` The K8s pending-timeout path still reports `JobReturnCode.EXCEPTION`, so it should continue to show `FINISHED:EXECUTION_EXCEPTION` rather than `RUNNING`. (cherry picked from commit 510150c)
## Summary Port the selected 2.8 fixes back to `main` in 2.8 merge order: - #4528 Add warnings for missing study data mappings - #4538 Update deploy prepare launcher docs - #4550 Align `Run.get_result()` with the `clean_up` parameter spelling - #4561 Clarify `remove_client` token cleanup semantics - #4563 Respect `CUDA_VISIBLE_DEVICES` in the GPU resource manager - #4574 Fix Docker SJ workspace tmpfs permissions - #4576 Narrow client failure reporting for generic launcher execution errors - #4583 Fix tracking recipe integration test --------- Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
Summary
Narrow the client job-failure reporting path added for 2.8 so generic launcher
JobReturnCode.EXECUTION_ERRORis not promoted into an authoritative server-side job failure.Why
client_api_qacan complete the FL workflow successfully and then hit local worker teardown noise that surfaces as a generic launcher execution error. Reporting that generic code to the server causes the server to fail an already-finished job.This keeps the explicit failure cases used by the recent job-timeout status fix:
ProcessExitCode.EXCEPTIONProcessExitCode.CONFIG_ERRORProcessExitCode.UNSAFE_COMPONENTJobReturnCode.ABORTEDThe K8s pending-timeout path still reports
JobReturnCode.EXCEPTION, so it should continue to showFINISHED:EXECUTION_EXCEPTIONrather thanRUNNING.