Skip to content

Add Client API Cell protocol vocabulary (client/cell/defs.py)#4856

Open
YuanTingHsieh wants to merge 4 commits into
NVIDIA:mainfrom
YuanTingHsieh:yuantingh/client-api-te1-protocol-auth
Open

Add Client API Cell protocol vocabulary (client/cell/defs.py)#4856
YuanTingHsieh wants to merge 4 commits into
NVIDIA:mainfrom
YuanTingHsieh:yuantingh/client-api-te1-protocol-auth

Conversation

@YuanTingHsieh

@YuanTingHsieh YuanTingHsieh commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

What

New pure-library module nvflare/client/cell/defs.py — the Client API Cell control-protocol vocabulary, consumed by the trainer-side Cell engine (TE-3), external_process (EP-3), and attach (AT-2) so the wire contract is defined once.

  • CHANNEL ("client_api"), PROTOCOL_VERSION.
  • Topic — HELLO / HELLO_CHALLENGE / HELLO_PROOF / HELLO_ACCEPTED / HELLO_REJECTED, TASK_READY / TASK_ACCEPTED / TASK_FAILED, RESULT_READY / RESULT_ACCEPTED / RESULT_REJECTED, LOG, HEARTBEAT, ABORT, SHUTDOWN, BYE, ERROR (values client_api.-prefixed to avoid colliding with the legacy flare_agent topics).
  • MsgKey payload keys.

Pure constants — no Cell/cellnet imports, no I/O. Applies to the out-of-process modes only; in_process uses DataBus and does not use this protocol.

Scope change (this PR was trimmed)

The auth mechanism (auth.py — token generation + HMAC challenge-response proof) was removed from this freeze. Rationale: the token is three roles — rendezvous, anti-mixup, and authentication — and whether external_process needs a secret HMAC proof at all is a host-trust decision (single-tenant: OS isolation suffices; multi-tenant: needed) that belongs to EP-3, not to a module frozen ahead of it. Freezing the vocabulary (undisputed, shared) is right; freezing an auth mechanism before its model is decided is not.

Where the auth pieces now live:

  • proof helpers (TokenScope, compute/verify_hello_proof, combine_nonces) → EP-3 (owns the auth-model decision, first consumer)
  • generic generators (generate_session_token/generate_nonce, token_digest) → FLARE-3017 (fuel/sec consolidation)
  • stateful SessionTokenManagerAT-2 (attach-only)

(The removed code is preserved in this branch's history for EP-3 to draw on.)

Program context

Client API Execution Modes (2.9) — TE-1, Wave 0, interface freeze #1 (Epic FLARE-2698 / Story FLARE-3011).
Design: docs/design/client_api_execution_modes.md § "Control Protocol".
Unblocks: TE-3 (trainer engine), EP-3 (external_process HELLO + auth model), AT-2 (attach).

Testing

9 unit tests pin the exact frozen wire values (CHANNEL, every Topic string, every MsgKey) so a value rename fails CI — the strings are the cross-track contract. black/isort/flake8 clean.

🤖 Generated with Claude Code

…ives

TE-1 (Wave 0, interface freeze #1) of the Client API Execution Modes
program (see docs/design/client_api_execution_modes_plan.md on PR NVIDIA#4853).

New pure-library package nvflare/client/cell consumed by the trainer
engine (TE-3), external_process (EP-3), and attach (AT-2) tracks so the
HELLO handshake is implemented once:

- defs.py: the Client API control-protocol vocabulary - channel,
  PROTOCOL_VERSION, Topic (HELLO/HELLO_CHALLENGE/HELLO_PROOF/
  HELLO_ACCEPTED, TASK_READY/TASK_ACCEPTED/TASK_FAILED, RESULT_READY/
  RESULT_ACCEPTED/RESULT_REJECTED, LOG, HEARTBEAT, ABORT, SHUTDOWN, BYE,
  ERROR) and MsgKey payload keys. Topic values are "client_api."-prefixed
  to avoid colliding with ipc/defs.py values.
- auth.py: session-token primitives per the design's attach auth
  contract - TokenScope (frozen, binds token to job/site/attach id/
  target FQCN/trainer FQCN/rank policy/protocol version),
  generate_session_token/generate_nonce (secrets), token_digest (sha256;
  anything persisted stores only the digest), compute/verify_hello_proof
  (HMAC-SHA256 keyed by the token over a domain-tagged, length-prefixed
  serialization of nonce + full scope; constant-time compare), and
  SessionTokenManager (single-use nonces consumed on any attempt,
  attach_timeout expiry with injectable clock, single-session
  enforcement, invalidate; thread-safe; no cellnet/file-I/O deps).

37 unit tests: proof round-trip, tamper/replay/consumed-nonce/scope-
mismatch/expiry/single-session/invalidation matrices, serialization
non-ambiguity, and vocabulary uniqueness incl. non-collision with ipc.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 2, 2026 16:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new pure-library nvflare.client.cell package that centralizes the Client API control-protocol vocabulary (defs.py) and session-token based HELLO handshake/auth primitives (auth.py), along with unit tests that lock down the cross-track wire contract.

Changes:

  • Added frozen protocol/channel/topic/message-key constants for the Client API control protocol (nvflare/client/cell/defs.py).
  • Added session-token auth primitives (token scope, proof compute/verify, nonce issuance/bounding, attach-timeout + invalidation) in a Cell-independent library (nvflare/client/cell/auth.py).
  • Added unit tests asserting exact wire strings and extensive proof/nonce/session-manager security properties (tests/unit_test/client/cell/*).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
nvflare/client/cell/defs.py Adds the frozen channel/topic/message-key vocabulary for the Client API control protocol.
nvflare/client/cell/auth.py Adds session-token / HELLO proof primitives and an executor-side token+nonce manager.
nvflare/client/cell/init.py Introduces the new nvflare.client.cell package.
tests/unit_test/client/cell/defs_test.py Locks down exact wire values and collision avoidance with legacy IPC defs.
tests/unit_test/client/cell/auth_test.py Comprehensive unit tests for proof integrity, replay resistance, expiry, invalidation, and bounded nonce behavior.
tests/unit_test/client/cell/init.py Introduces the test package namespace for tests.unit_test.client.cell.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nvflare/client/cell/auth.py Outdated
Comment thread nvflare/client/cell/auth.py Outdated
Comment thread nvflare/client/cell/auth.py Outdated
@codecov-commenter

codecov-commenter commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.98%. Comparing base (1cff2b3) to head (4bd2d09).
⚠️ Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4856      +/-   ##
==========================================
+ Coverage   56.20%   56.98%   +0.77%     
==========================================
  Files         967      970       +3     
  Lines       91978    92302     +324     
==========================================
+ Hits        51699    52595     +896     
+ Misses      40279    39707     -572     
Flag Coverage Δ
unit-tests 56.98% <100.00%> (+0.77%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces nvflare/client/cell/ as a new pure-vocabulary package containing the frozen wire constants for the Client API control protocol: CHANNEL, PROTOCOL_VERSION, Topic (17 topics covering session setup, per-task lifecycle, and teardown), and MsgKey (19 payload keys). The test suite pins every exact wire value and verifies non-collision with the legacy flare_agent channel and TOPIC_* names from nvflare/client/ipc/defs.py.

  • defs.py is a dependency-free constants module; all Topic values are client_api.-prefixed and HELLO_REJECTED has been added (closing a previously noted asymmetry) with its wire value pinned in defs_test.py.
  • The PR description and title describe auth.py (stateless TokenScope, generate_session_token, compute/verify_hello_proof, etc.) as part of this PR and claim "30 unit tests", but auth.py is not in the diff and the test file contains ~9 tests scoped entirely to defs.py; the description appears to reflect an earlier, broader revision and was not updated after the scope trim.

Confidence Score: 5/5

Safe to merge; the change is a pure constants module with no logic, no I/O, and no imports from the rest of the codebase. All wire values are pinned by the test suite.

The diff adds only frozen string/int constants and a matching test suite that performs exact-value equality checks. There is no executable logic, no mutation of shared state, and no coupling to Cell or cellnet at this layer. The collision test against the legacy ipc/defs.py channel and topic names is correct. The one discrepancy — the PR description describes auth.py primitives that are absent from the diff — is a documentation gap, not a defect in the code that is actually delivered.

No files require special attention. The PR description overstates the scope (references auth.py which is not in the diff), but that does not affect the correctness of defs.py or its tests.

Important Files Changed

Filename Overview
nvflare/client/cell/defs.py New vocabulary module: frozen wire constants CHANNEL, PROTOCOL_VERSION, Topic (17 topics), and MsgKey (19 keys); pure constants, no I/O or imports.
tests/unit_test/client/cell/defs_test.py Comprehensive pin-tests covering exact wire values, uniqueness, legacy-collision checks, and presence of protocol-required keys; all topic and key constants are anchored.
nvflare/client/cell/init.py Empty package init (copyright only); no re-exports yet, consistent with the staged delivery plan.
tests/unit_test/client/cell/init.py Empty test package init; no issues.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant T as Trainer Process
    participant E as NVFlare Engine (TE-3/EP-3)

    T->>E: Topic.HELLO (MsgKey.SESSION_ID, JOB_ID, SITE_NAME, TRAINER_FQCN, RANK, PROTOCOL_VERSION)
    E-->>T: Topic.HELLO_CHALLENGE (MsgKey.NONCE)
    T->>E: Topic.HELLO_PROOF (MsgKey.PROOF)
    alt auth success
        E-->>T: Topic.HELLO_ACCEPTED
    else auth failure (bad proof / wrong scope / expired nonce)
        E-->>T: Topic.HELLO_REJECTED (MsgKey.REASON)
    end

    loop Every training round
        E->>T: Topic.TASK_READY (MsgKey.TASK_ID, TASK_NAME, MODEL, PARAMS)
        T-->>E: Topic.TASK_ACCEPTED
        T->>E: Topic.RESULT_READY (MsgKey.RESULT_ID, MANIFEST)
        alt result OK
            E-->>T: Topic.RESULT_ACCEPTED
        else result rejected
            E-->>T: Topic.RESULT_REJECTED (MsgKey.REASON)
        end
    end

    alt normal teardown
        E->>T: Topic.SHUTDOWN
        T-->>E: Topic.BYE
    else error / abort
        E->>T: Topic.ABORT
        T-->>E: Topic.ERROR (MsgKey.REASON)
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant T as Trainer Process
    participant E as NVFlare Engine (TE-3/EP-3)

    T->>E: Topic.HELLO (MsgKey.SESSION_ID, JOB_ID, SITE_NAME, TRAINER_FQCN, RANK, PROTOCOL_VERSION)
    E-->>T: Topic.HELLO_CHALLENGE (MsgKey.NONCE)
    T->>E: Topic.HELLO_PROOF (MsgKey.PROOF)
    alt auth success
        E-->>T: Topic.HELLO_ACCEPTED
    else auth failure (bad proof / wrong scope / expired nonce)
        E-->>T: Topic.HELLO_REJECTED (MsgKey.REASON)
    end

    loop Every training round
        E->>T: Topic.TASK_READY (MsgKey.TASK_ID, TASK_NAME, MODEL, PARAMS)
        T-->>E: Topic.TASK_ACCEPTED
        T->>E: Topic.RESULT_READY (MsgKey.RESULT_ID, MANIFEST)
        alt result OK
            E-->>T: Topic.RESULT_ACCEPTED
        else result rejected
            E-->>T: Topic.RESULT_REJECTED (MsgKey.REASON)
        end
    end

    alt normal teardown
        E->>T: Topic.SHUTDOWN
        T-->>E: Topic.BYE
    else error / abort
        E->>T: Topic.ABORT
        T-->>E: Topic.ERROR (MsgKey.REASON)
    end
Loading

Reviews (4): Last reviewed commit: "TE-1: reduce freeze to the protocol voca..." | Re-trigger Greptile

Comment thread nvflare/client/cell/auth.py Outdated
Comment thread nvflare/client/cell/defs.py
Comment thread nvflare/client/cell/auth.py Outdated
…ttach

Scope refinement (PR NVIDIA#4856): external_process — the primary mode, which shells
out to torchrun/Deepspeed/Horovod/mpirun and talks only to rank 0 over a
localhost connection NVFlare itself created — needs only a lightweight
launch-token proof. The stateful executor-side SessionTokenManager
(single-use nonce issuance, attach-window expiry, single-session enforcement,
invalidation) is a challenge-response concern specific to attach mode, where
NVFlare does not own the trainer process and the token is delivered
out-of-band by an untrusted starter.

Move SessionTokenManager to the attach backend (AT-2) so this freeze ships
only what P0 (external_process) needs: the defs vocabulary and the stateless
primitives (TokenScope, generate_session_token/generate_nonce, token_digest,
compute_hello_proof/verify_hello_proof, combine_nonces). 30 tests remain
(the manager's tests move with it).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@YuanTingHsieh YuanTingHsieh changed the title Add Client API Cell protocol vocabulary and session-token auth primitives Add Client API Cell protocol vocabulary and stateless session-token proof primitives Jul 2, 2026
YuanTingHsieh and others added 2 commits July 2, 2026 14:20
…CTED

PR NVIDIA#4856 review fixes:
- verify_hello_proof rejects any proof whose length is not the fixed SHA-256
  hex length (PROOF_HEX_LEN=64) up front, so an arbitrarily long attacker
  proof cannot force HMAC/compare work; also guards AttributeError so a
  non-str token/nonce (e.g. a None token from a config miss on the one-round
  path) returns False instead of raising.
- generate_session_token / generate_nonce reject below a 128-bit entropy
  floor (_MIN_ENTROPY_BYTES=16) so a public primitive can't mint an empty or
  weak token/nonce (num_bytes=0).
- add Topic.HELLO_REJECTED so a handshake auth refusal (bad proof, wrong
  scope, consumed/expired nonce, single-session) is a distinct semantic
  signal from the generic ERROR (protocol/transport fault) for TE-3/AT-2.

Tests cover the entropy floors, wrong/huge proof lengths, non-str token/nonce,
a 64-char non-ASCII proof reaching compare_digest, and the HELLO_REJECTED
wire value.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Trim auth.py (and its tests) out of the interface freeze. The token is
three roles — rendezvous, anti-mixup, and authentication — and whether
external_process needs a *secret* HMAC proof at all is a host-trust
decision (single-tenant: OS isolation suffices; multi-tenant: needed)
that belongs to EP-3, not to a frozen module shipped ahead of it.

NVIDIA#4856 now freezes only what is undisputed and shared: the Cell control-
protocol vocabulary (client/cell/defs.py — Topics, MsgKeys, CHANNEL,
PROTOCOL_VERSION). The proof helpers (TokenScope, compute/verify_hello_proof,
combine_nonces) move to EP-3 where the auth model is decided and first
consumed; the generic generators (generate_session_token/nonce,
token_digest) go to the fuel/sec consolidation (FLARE-3017); the stateful
SessionTokenManager remains attach-only (AT-2). Plan updated to match.

The removed code is preserved in this branch's history for EP-3 to draw on.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
YuanTingHsieh added a commit to YuanTingHsieh/NVFlare that referenced this pull request Jul 2, 2026
Reflect the NVIDIA#4856 trim: TE-1 (interface freeze #1) is now the Cell
control-protocol vocabulary only. The auth mechanism is not frozen ahead
of its decision — the proof helpers land with EP-3 (which owns the
host-trust / auth-strength call: rendezvous-only vs one-round HMAC), the
generic token/nonce/digest generators go to FLARE-3017, and the stateful
SessionTokenManager stays attach-only (AT-2). Updated the TE-1, EP-3, AT-2
rows and the interface-freeze note accordingly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@YuanTingHsieh YuanTingHsieh changed the title Add Client API Cell protocol vocabulary and stateless session-token proof primitives Add Client API Cell protocol vocabulary (client/cell/defs.py) Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants