Skip to content

Name CellPipe cells hierarchically for mTLS identity resolution#4801

Merged
nvidianz merged 7 commits into
NVIDIA:mainfrom
nvidianz:fix-cellpipe-alias-identity
Jun 15, 2026
Merged

Name CellPipe cells hierarchically for mTLS identity resolution#4801
nvidianz merged 7 commits into
NVIDIA:mainfrom
nvidianz:fix-cellpipe-alias-identity

Conversation

@nvidianz

@nvidianz nvidianz commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Fixes # .

Description

Direct CellPipe cells authenticated with the owning site's certificate but were named with underscore aliases such as site-1_<job-id>_active / site-1_<job-id>_passive, which don't fit FQCN semantics. The mTLS gate in conn_manager.update_endpoint rejected these connections (authenticated as 'site-1' ... requires identity 'site-1_<job>_active'), failing jobs with FINISHED:EXECUTION_EXCEPTION.

This PR fixes the problem at the naming level: CellPipe cells are now named hierarchically as <site>.<token>.<mode> (e.g. site-1.<job-id>.active), scoped under the cell they connect to (relay-1.site-1.<job-id>.active when joining via a relay). With the owning site as a leading FQCN segment, mTLS identity resolution and auth-token origin binding follow the normal FQCN hierarchy — no alias parsing needed.

Changes to support the new names:

  • cell_pipe.py: _cell_fqcn produces the hierarchical name; _build_cell picks root-style vs parent-style connection by the connect target (server root vs CP/relay) instead of by name shape, preserving the TLS settings for direct-to-server pipes.
  • core_cell.py: a client cell configured with only a root URL creates the backbone external connector regardless of generation; same-family endpoint resolution falls through to generic path resolution when the literal FQCN parent is not connected (pipe cells connect to an ancestor or the root instead).
  • identity.py / authenticator.py: the underscore-alias parsing is kept for backward compatibility with peers running older NVFlare versions, with comments marking it as legacy.

Notes for reviewers:

  • The legacy alias check still runs after the exact/prefix identity-map lookups, so explicitly configured identities keep their precedence.
  • Both peer pipes derive each other's FQCN with the same formula, and connecting via a relay or via the CP behind that relay now yields the same name, so the peers agree regardless of connect type.
  • New unit tests cover the name format (root/CP/relay), hierarchical and legacy identity resolution, handshake accept/spoof-reject, and endpoint resolution through a connected ancestor or the server root.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.

🤖 Generated with Claude Code

nvidianz and others added 2 commits June 11, 2026 20:22
…gate

Direct CellPipe cells use sibling FQCNs like site-1_<job>_active but
authenticate with the owning site's certificate. CellIdentityResolver
handled dotted child FQCNs but fell back to the alias itself for these
underscore aliases, so the mTLS gate in conn_manager rejected the
connection before the auth-token alias exception could apply, failing
jobs with FINISHED:EXECUTION_EXCEPTION.

Teach resolve() to recognize only the constrained alias form
<owner>_<runtime_id>_(active|passive) with a non-empty runtime_id
containing no '.' or '_' (mirroring _origin_matches_fqcn in
authenticator.py), and resolve the owner through the normal identity
maps. Right-anchored parsing keeps ambiguous forms like
site-a_x_<uuid>_active bound to site-a_x.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@nvidianz

Copy link
Copy Markdown
Collaborator Author

/build

@greptile-apps

This comment has been minimized.

Comment thread nvflare/fuel/f3/cellnet/identity.py Outdated
Comment thread nvflare/fuel/f3/cellnet/identity.py
nvidianz and others added 2 commits June 11, 2026 21:28
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…mode>

CellPipe cells were named with underscore aliases (site-1_<job>_active)
that don't fit FQCN semantics: the mTLS identity gate and auth-token
binding needed special alias parsing to map them to the owning site.

Name them <site>.<token>.<mode> scoped under the cell they connect to
(site-1.<job>.active via CP or server root; relay-1.site-1.<job>.active
via a relay), so identity resolution and origin binding follow the
normal FQCN hierarchy. To support the new names:

- CellPipe picks root-style vs parent-style connection by the connect
  target (server root vs CP/relay) instead of by name shape, keeping
  TLS settings for direct-to-server pipes.
- A client cell configured with only a root URL creates the backbone
  external connector regardless of generation.
- Same-family endpoint resolution falls through to generic path
  resolution when the literal FQCN parent is not connected, since pipe
  cells connect to an ancestor (CP/relay) or the root instead.

The underscore-alias parsing in the identity resolver and authenticator
is kept for backward compatibility with peers running older versions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@nvidianz nvidianz changed the title Fix mTLS identity resolution for direct CellPipe alias cells Name CellPipe cells hierarchically for mTLS identity resolution Jun 12, 2026
@nvidianz

Copy link
Copy Markdown
Collaborator Author

/build

1 similar comment
@nvidianz

Copy link
Copy Markdown
Collaborator Author

/build

@nvidianz nvidianz enabled auto-merge (squash) June 12, 2026 03:14
@nvidianz

Copy link
Copy Markdown
Collaborator Author

/build

@codecov-commenter

codecov-commenter commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.85%. Comparing base (2d0701e) to head (4c2853a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
nvflare/fuel/utils/pipe/cell_pipe.py 62.50% 3 Missing ⚠️
nvflare/fuel/f3/cellnet/core_cell.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4801      +/-   ##
==========================================
+ Coverage   55.78%   55.85%   +0.06%     
==========================================
  Files         967      967              
  Lines       91795    91810      +15     
==========================================
+ Hits        51212    51277      +65     
+ Misses      40583    40533      -50     
Flag Coverage Δ
unit-tests 55.85% <80.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nvidianz nvidianz merged commit 1c21747 into NVIDIA:main Jun 15, 2026
17 checks passed
@nvidianz nvidianz deleted the fix-cellpipe-alias-identity branch June 15, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants