Add Client API Cell protocol vocabulary (client/cell/defs.py)#4856
Add Client API Cell protocol vocabulary (client/cell/defs.py)#4856YuanTingHsieh wants to merge 4 commits into
Conversation
…ives TE-1 (Wave 0, interface freeze #1) of the Client API Execution Modes program (see docs/design/client_api_execution_modes_plan.md on PR NVIDIA#4853). New pure-library package nvflare/client/cell consumed by the trainer engine (TE-3), external_process (EP-3), and attach (AT-2) tracks so the HELLO handshake is implemented once: - defs.py: the Client API control-protocol vocabulary - channel, PROTOCOL_VERSION, Topic (HELLO/HELLO_CHALLENGE/HELLO_PROOF/ HELLO_ACCEPTED, TASK_READY/TASK_ACCEPTED/TASK_FAILED, RESULT_READY/ RESULT_ACCEPTED/RESULT_REJECTED, LOG, HEARTBEAT, ABORT, SHUTDOWN, BYE, ERROR) and MsgKey payload keys. Topic values are "client_api."-prefixed to avoid colliding with ipc/defs.py values. - auth.py: session-token primitives per the design's attach auth contract - TokenScope (frozen, binds token to job/site/attach id/ target FQCN/trainer FQCN/rank policy/protocol version), generate_session_token/generate_nonce (secrets), token_digest (sha256; anything persisted stores only the digest), compute/verify_hello_proof (HMAC-SHA256 keyed by the token over a domain-tagged, length-prefixed serialization of nonce + full scope; constant-time compare), and SessionTokenManager (single-use nonces consumed on any attempt, attach_timeout expiry with injectable clock, single-session enforcement, invalidate; thread-safe; no cellnet/file-I/O deps). 37 unit tests: proof round-trip, tamper/replay/consumed-nonce/scope- mismatch/expiry/single-session/invalidation matrices, serialization non-ambiguity, and vocabulary uniqueness incl. non-collision with ipc. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a new pure-library nvflare.client.cell package that centralizes the Client API control-protocol vocabulary (defs.py) and session-token based HELLO handshake/auth primitives (auth.py), along with unit tests that lock down the cross-track wire contract.
Changes:
- Added frozen protocol/channel/topic/message-key constants for the Client API control protocol (
nvflare/client/cell/defs.py). - Added session-token auth primitives (token scope, proof compute/verify, nonce issuance/bounding, attach-timeout + invalidation) in a Cell-independent library (
nvflare/client/cell/auth.py). - Added unit tests asserting exact wire strings and extensive proof/nonce/session-manager security properties (
tests/unit_test/client/cell/*).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| nvflare/client/cell/defs.py | Adds the frozen channel/topic/message-key vocabulary for the Client API control protocol. |
| nvflare/client/cell/auth.py | Adds session-token / HELLO proof primitives and an executor-side token+nonce manager. |
| nvflare/client/cell/init.py | Introduces the new nvflare.client.cell package. |
| tests/unit_test/client/cell/defs_test.py | Locks down exact wire values and collision avoidance with legacy IPC defs. |
| tests/unit_test/client/cell/auth_test.py | Comprehensive unit tests for proof integrity, replay resistance, expiry, invalidation, and bounded nonce behavior. |
| tests/unit_test/client/cell/init.py | Introduces the test package namespace for tests.unit_test.client.cell. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4856 +/- ##
==========================================
+ Coverage 56.20% 56.98% +0.77%
==========================================
Files 967 970 +3
Lines 91978 92302 +324
==========================================
+ Hits 51699 52595 +896
+ Misses 40279 39707 -572
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Greptile SummaryThis PR introduces
Confidence Score: 5/5Safe to merge; the change is a pure constants module with no logic, no I/O, and no imports from the rest of the codebase. All wire values are pinned by the test suite. The diff adds only frozen string/int constants and a matching test suite that performs exact-value equality checks. There is no executable logic, no mutation of shared state, and no coupling to Cell or cellnet at this layer. The collision test against the legacy ipc/defs.py channel and topic names is correct. The one discrepancy — the PR description describes auth.py primitives that are absent from the diff — is a documentation gap, not a defect in the code that is actually delivered. No files require special attention. The PR description overstates the scope (references auth.py which is not in the diff), but that does not affect the correctness of defs.py or its tests. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant T as Trainer Process
participant E as NVFlare Engine (TE-3/EP-3)
T->>E: Topic.HELLO (MsgKey.SESSION_ID, JOB_ID, SITE_NAME, TRAINER_FQCN, RANK, PROTOCOL_VERSION)
E-->>T: Topic.HELLO_CHALLENGE (MsgKey.NONCE)
T->>E: Topic.HELLO_PROOF (MsgKey.PROOF)
alt auth success
E-->>T: Topic.HELLO_ACCEPTED
else auth failure (bad proof / wrong scope / expired nonce)
E-->>T: Topic.HELLO_REJECTED (MsgKey.REASON)
end
loop Every training round
E->>T: Topic.TASK_READY (MsgKey.TASK_ID, TASK_NAME, MODEL, PARAMS)
T-->>E: Topic.TASK_ACCEPTED
T->>E: Topic.RESULT_READY (MsgKey.RESULT_ID, MANIFEST)
alt result OK
E-->>T: Topic.RESULT_ACCEPTED
else result rejected
E-->>T: Topic.RESULT_REJECTED (MsgKey.REASON)
end
end
alt normal teardown
E->>T: Topic.SHUTDOWN
T-->>E: Topic.BYE
else error / abort
E->>T: Topic.ABORT
T-->>E: Topic.ERROR (MsgKey.REASON)
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant T as Trainer Process
participant E as NVFlare Engine (TE-3/EP-3)
T->>E: Topic.HELLO (MsgKey.SESSION_ID, JOB_ID, SITE_NAME, TRAINER_FQCN, RANK, PROTOCOL_VERSION)
E-->>T: Topic.HELLO_CHALLENGE (MsgKey.NONCE)
T->>E: Topic.HELLO_PROOF (MsgKey.PROOF)
alt auth success
E-->>T: Topic.HELLO_ACCEPTED
else auth failure (bad proof / wrong scope / expired nonce)
E-->>T: Topic.HELLO_REJECTED (MsgKey.REASON)
end
loop Every training round
E->>T: Topic.TASK_READY (MsgKey.TASK_ID, TASK_NAME, MODEL, PARAMS)
T-->>E: Topic.TASK_ACCEPTED
T->>E: Topic.RESULT_READY (MsgKey.RESULT_ID, MANIFEST)
alt result OK
E-->>T: Topic.RESULT_ACCEPTED
else result rejected
E-->>T: Topic.RESULT_REJECTED (MsgKey.REASON)
end
end
alt normal teardown
E->>T: Topic.SHUTDOWN
T-->>E: Topic.BYE
else error / abort
E->>T: Topic.ABORT
T-->>E: Topic.ERROR (MsgKey.REASON)
end
Reviews (4): Last reviewed commit: "TE-1: reduce freeze to the protocol voca..." | Re-trigger Greptile |
…ttach Scope refinement (PR NVIDIA#4856): external_process — the primary mode, which shells out to torchrun/Deepspeed/Horovod/mpirun and talks only to rank 0 over a localhost connection NVFlare itself created — needs only a lightweight launch-token proof. The stateful executor-side SessionTokenManager (single-use nonce issuance, attach-window expiry, single-session enforcement, invalidation) is a challenge-response concern specific to attach mode, where NVFlare does not own the trainer process and the token is delivered out-of-band by an untrusted starter. Move SessionTokenManager to the attach backend (AT-2) so this freeze ships only what P0 (external_process) needs: the defs vocabulary and the stateless primitives (TokenScope, generate_session_token/generate_nonce, token_digest, compute_hello_proof/verify_hello_proof, combine_nonces). 30 tests remain (the manager's tests move with it). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…CTED PR NVIDIA#4856 review fixes: - verify_hello_proof rejects any proof whose length is not the fixed SHA-256 hex length (PROOF_HEX_LEN=64) up front, so an arbitrarily long attacker proof cannot force HMAC/compare work; also guards AttributeError so a non-str token/nonce (e.g. a None token from a config miss on the one-round path) returns False instead of raising. - generate_session_token / generate_nonce reject below a 128-bit entropy floor (_MIN_ENTROPY_BYTES=16) so a public primitive can't mint an empty or weak token/nonce (num_bytes=0). - add Topic.HELLO_REJECTED so a handshake auth refusal (bad proof, wrong scope, consumed/expired nonce, single-session) is a distinct semantic signal from the generic ERROR (protocol/transport fault) for TE-3/AT-2. Tests cover the entropy floors, wrong/huge proof lengths, non-str token/nonce, a 64-char non-ASCII proof reaching compare_digest, and the HELLO_REJECTED wire value. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Trim auth.py (and its tests) out of the interface freeze. The token is three roles — rendezvous, anti-mixup, and authentication — and whether external_process needs a *secret* HMAC proof at all is a host-trust decision (single-tenant: OS isolation suffices; multi-tenant: needed) that belongs to EP-3, not to a frozen module shipped ahead of it. NVIDIA#4856 now freezes only what is undisputed and shared: the Cell control- protocol vocabulary (client/cell/defs.py — Topics, MsgKeys, CHANNEL, PROTOCOL_VERSION). The proof helpers (TokenScope, compute/verify_hello_proof, combine_nonces) move to EP-3 where the auth model is decided and first consumed; the generic generators (generate_session_token/nonce, token_digest) go to the fuel/sec consolidation (FLARE-3017); the stateful SessionTokenManager remains attach-only (AT-2). Plan updated to match. The removed code is preserved in this branch's history for EP-3 to draw on. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reflect the NVIDIA#4856 trim: TE-1 (interface freeze #1) is now the Cell control-protocol vocabulary only. The auth mechanism is not frozen ahead of its decision — the proof helpers land with EP-3 (which owns the host-trust / auth-strength call: rendezvous-only vs one-round HMAC), the generic token/nonce/digest generators go to FLARE-3017, and the stateful SessionTokenManager stays attach-only (AT-2). Updated the TE-1, EP-3, AT-2 rows and the interface-freeze note accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
What
New pure-library module
nvflare/client/cell/defs.py— the Client API Cell control-protocol vocabulary, consumed by the trainer-side Cell engine (TE-3), external_process (EP-3), and attach (AT-2) so the wire contract is defined once.CHANNEL("client_api"),PROTOCOL_VERSION.Topic— HELLO / HELLO_CHALLENGE / HELLO_PROOF / HELLO_ACCEPTED / HELLO_REJECTED, TASK_READY / TASK_ACCEPTED / TASK_FAILED, RESULT_READY / RESULT_ACCEPTED / RESULT_REJECTED, LOG, HEARTBEAT, ABORT, SHUTDOWN, BYE, ERROR (valuesclient_api.-prefixed to avoid colliding with the legacyflare_agenttopics).MsgKeypayload keys.Pure constants — no Cell/cellnet imports, no I/O. Applies to the out-of-process modes only; in_process uses DataBus and does not use this protocol.
Scope change (this PR was trimmed)
The auth mechanism (
auth.py— token generation + HMAC challenge-response proof) was removed from this freeze. Rationale: the token is three roles — rendezvous, anti-mixup, and authentication — and whether external_process needs a secret HMAC proof at all is a host-trust decision (single-tenant: OS isolation suffices; multi-tenant: needed) that belongs to EP-3, not to a module frozen ahead of it. Freezing the vocabulary (undisputed, shared) is right; freezing an auth mechanism before its model is decided is not.Where the auth pieces now live:
TokenScope,compute/verify_hello_proof,combine_nonces) → EP-3 (owns the auth-model decision, first consumer)generate_session_token/generate_nonce,token_digest) → FLARE-3017 (fuel/sec consolidation)SessionTokenManager→ AT-2 (attach-only)(The removed code is preserved in this branch's history for EP-3 to draw on.)
Program context
Client API Execution Modes (2.9) — TE-1, Wave 0, interface freeze #1 (Epic FLARE-2698 / Story FLARE-3011).
Design:
docs/design/client_api_execution_modes.md§ "Control Protocol".Unblocks: TE-3 (trainer engine), EP-3 (external_process HELLO + auth model), AT-2 (attach).
Testing
9 unit tests pin the exact frozen wire values (CHANNEL, every Topic string, every MsgKey) so a value rename fails CI — the strings are the cross-track contract. black/isort/flake8 clean.
🤖 Generated with Claude Code