Conversation
Add _BLOCKED_MODULES frozenset and UnsafeClassError. load_class() now checks the top-level module against the blocklist before importing, blocking dangerous modules like subprocess, os, shutil, etc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changed ConnProps.SUBMITTER_ORG to ConnProps.SUBMITTER_ROLE for the role field in the submitter Person object construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The make_reply() call for AUTHENTICATION_ERROR was missing a return statement, causing the code to continue processing even after auth failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace empty nonce with server-generated one-time challenge: 1. Client sends _login_challenge → server generates UUID nonce, stores it 2. Client signs (CN + nonce) → sends to server with _cert_login 3. Server consumes the stored nonce (one-time), verifies signature 4. Challenge expires after 60 seconds 5. Empty nonce without a pending challenge is REJECTED (prevents replay) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion - get_session() now accepts and forwards id_asserter to decode_token() - Updated pre_command() caller in login.py to pass id_asserter, ensuring token signatures are verified on every admin command - Without this, decode_token() skips signature check and only looks up the session ID — a captured token works without valid signature Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validate the sender's identity via validate_client() before processing job failure reports, preventing unauthenticated clients from aborting arbitrary jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reject contributions with NaN/Inf tensor values and invalid weights (NaN, Inf, negative) to prevent model poisoning attacks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, heartbeat() would re-create a client record for unknown tokens, allowing revoked clients to bypass removal. Now it rejects unknown tokens and requires clients to re-register properly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use os.open() with 0o600 mode instead of plain open() for .key files in legacy provisioning, preventing world-readable private keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace random.randint (Mersenne Twister, non-cryptographic) with secrets.token_bytes (CSPRNG) for SNP attestation nonce generation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Log a SECURITY WARNING when running unsigned jobs with signature verification disabled, making the security posture visible to admins. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add shutil.rmtree() cleanup after AppAuthzService.authorize() failure to prevent unauthorized job files from persisting on disk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR applies a batch of Q2 2026 security and audit fixes across the admin login flow, app deployment, class loading, certificate provisioning, and server-side authentication. Most individual fixes are sound (CSPRNG nonce generation, private key file permissions,
Confidence Score: 3/5Not safe to merge — two new P1 defects were introduced, and two previously-flagged security issues remain unresolved in login.py. The PR fixes several real bugs (missing return on auth error, CSPRNG nonce, file permission hardening, role-field copy-paste, revoked-token rejection), but introduces a builtins bypass in the new class-loader blocklist and an incomplete cleanup path in app_deployer.py. Combined with the two carry-over issues in login.py flagged in prior review rounds, there are four substantive issues that need resolution before this security-focused PR is production-ready. nvflare/fuel/utils/class_loader.py (builtins bypass), nvflare/private/fed/utils/app_deployer.py (incomplete cleanup), nvflare/fuel/hci/server/login.py (debug nonce log + client-nonce fallback still unresolved).
|
| Filename | Overview |
|---|---|
| nvflare/fuel/utils/class_loader.py | Adds module blocklist for job-config class loading, but the builtins branch (no-dot paths) has no filtering — exec, eval, open etc. are reachable without restriction. |
| nvflare/private/fed/utils/app_deployer.py | Adds cleanup on auth/authz failure, but only removes app_path; run_dir with fl_app.txt and job metadata is left on disk. Also contains redundant inline import shutil (already at module level). |
| nvflare/fuel/hci/server/login.py | Implements challenge-response login skeleton, but the SEC004_DEBUG info log (line 125) still leaks the consumed nonce, the client-nonce fallback (lines 126–135) still enables replay, and _pending_challenges has no eviction for abandoned sessions. |
| nvflare/fuel/hci/server/authz.py | Fixes copy-paste bug where submitter role was being set to submitter org; correct field (SUBMITTER_ROLE) now used. |
| nvflare/lighter/impl/cert.py | Private key files now written with os.open and mode 0o600, preventing world-readable key material on disk. |
| nvflare/app_opt/confidential_computing/snp_authorizer.py | Nonce generation switched from random.randint to secrets.token_bytes — correct use of CSPRNG for security-sensitive material. |
| nvflare/private/fed/server/server_command_agent.py | Bug fix: missing return before make_reply on authentication error prevented early exit, allowing execution to continue after auth failure. |
| nvflare/private/fed/server/client_manager.py | Removes auto-reactivation of clients from unknown/revoked tokens; heartbeats from unrecognized tokens are now rejected unconditionally. |
| nvflare/app_common/aggregators/weighted_aggregation_helper.py | Adds NaN/Inf validation for tensor contributions and weight validation; exception handling in _check_finite silently swallows non-Non-finite errors. |
| nvflare/fuel/hci/client/api.py | Adds _ChallengeReplyProcessor and _request_login_challenge to implement the client-side step of the challenge-response protocol; flows correctly into _user_login. |
| nvflare/fuel/hci/server/sess.py | Passes id_asserter into Session.decode_token for cryptographic token verification; tightens bare except to except Exception. |
| nvflare/fuel/hci/proto.py | Adds LOGIN_CHALLENGE to InternalCommands and its commands list to support the new challenge-response protocol. |
| nvflare/private/fed/server/fed_server.py | Adds sender identity validation before processing process_job_failure, dropping unauthenticated reports early. |
| nvflare/private/fed/server/job_runner.py | Adds a security warning log when unsigned jobs are run with signature verification disabled — good operational visibility. |
Sequence Diagram
sequenceDiagram
participant C as Admin Client
participant S as HCI Server (LoginModule)
participant A as AppDeployer
participant CL as ClassLoader
C->>S: _login_challenge username
S-->>C: CHALLENGE uuid-nonce
Note over S: Stored in _pending_challenges[user]
C->>S: _cert_login username (headers: cert, signature, nonce)
S->>S: _consume_challenge(user) pops nonce one-time
S->>S: identity_verifier.verify_common_name(cert, nonce, sig)
alt verified
S-->>C: OK + session token
else not verified
S-->>C: REJECT
end
Note over A: App Deployment new cleanup logic
A->>A: unzip app to app_path
A->>A: write fl_app.txt and job_meta.json to run_dir
A->>A: AppAuthzService.authorize()
alt auth error or not authorized
A->>A: shutil.rmtree(app_path) run_dir metadata left behind
A-->>A: return error
else authorized
A-->>A: deploy success
end
Note over CL: Class Loading new blocklist
CL->>CL: load_class(class_path)
alt dot in class_path
CL->>CL: check top_module in _BLOCKED_MODULES
alt blocked
CL-->>CL: raise UnsafeClassError
else allowed
CL-->>CL: importlib.import_module returns class
end
else no dot builtins path
CL-->>CL: getattr(builtins, class_path) no blocklist check
end
Reviews (2): Last reviewed commit: "Merge branch 'main' into security/audit-..." | Re-trigger Greptile
| # Use the server-generated challenge nonce (one-time, consumed here). | ||
| # If no pending challenge, reject empty nonces to prevent replay. | ||
| login_nonce = self._consume_challenge(user_name) | ||
| self.logger.info(f"SEC004_DEBUG: user={user_name}, consumed_nonce='{login_nonce}', headers_keys={list(headers.keys())}") |
There was a problem hiding this comment.
Debug log leaks authentication nonce
SEC004_DEBUG is an INFO-level log statement that prints the server-generated nonce alongside the username. This is a debugging leftover that should be removed before merging — auth nonces logged at INFO will appear in production log files and can facilitate replay attacks.
| self.logger.info(f"SEC004_DEBUG: user={user_name}, consumed_nonce='{login_nonce}', headers_keys={list(headers.keys())}") |
There was a problem hiding this comment.
Agreed. This looks like a debugging leftover and should be removed before merge.
| if not login_nonce: | ||
| # No pending server challenge — check if client sent a non-empty nonce | ||
| client_nonce = headers.get("nonce", "") | ||
| if not client_nonce: | ||
| # Empty nonce = replay attempt or legacy client without challenge support. | ||
| # Reject to prevent signature replay attacks. | ||
| self.logger.warning(f"login rejected for {user_name}: empty nonce without server challenge") | ||
| conn.append_string("REJECT") | ||
| return | ||
| login_nonce = client_nonce |
There was a problem hiding this comment.
Challenge fallback re-enables replay attacks
When no server-issued challenge is pending for a user (e.g., challenge expired or was never requested), the code falls back to accepting whatever nonce the client provides (lines 128-135). An attacker who previously captured a valid (nonce, signature) pair can: (1) wait for the 60-second window to lapse, (2) submit a fresh login with the old nonce and old signature — the signature check will still pass because the nonce is attacker-controlled. This completely negates the purpose of the challenge-response mechanism.
If no server-issued challenge exists for the user, the login should be rejected unconditionally:
| if not login_nonce: | |
| # No pending server challenge — check if client sent a non-empty nonce | |
| client_nonce = headers.get("nonce", "") | |
| if not client_nonce: | |
| # Empty nonce = replay attempt or legacy client without challenge support. | |
| # Reject to prevent signature replay attacks. | |
| self.logger.warning(f"login rejected for {user_name}: empty nonce without server challenge") | |
| conn.append_string("REJECT") | |
| return | |
| login_nonce = client_nonce | |
| if not login_nonce: | |
| self.logger.warning(f"login rejected for {user_name}: no pending server challenge") | |
| conn.append_string("REJECT") | |
| return |
There was a problem hiding this comment.
Agreed. If there is no pending server-issued challenge, the login should be rejected rather than falling back to a client-supplied nonce. The original implementation choice here was driven by compatibility with the older admin protocol flow, but that is not a sufficient reason to preserve replay exposure.
| except (TypeError, ValueError) as e: | ||
| if 'Non-finite' in str(e): | ||
| raise |
There was a problem hiding this comment.
Silent exception swallowing hides conversion failures
The except (TypeError, ValueError) block re-raises only when the message contains 'Non-finite'. Any other ValueError or TypeError raised by np.asarray(v) or np.isfinite() (e.g., ambiguous truth value from a multi-element array, unsupported dtype) is silently discarded, allowing a potentially non-finite or un-checkable contribution through without detection. Per the project's custom instruction, prefer raising an explicit error instead of silent handling.
| except (TypeError, ValueError) as e: | |
| if 'Non-finite' in str(e): | |
| raise | |
| try: | |
| if hasattr(v, 'cpu'): | |
| arr = v.cpu().numpy() | |
| else: | |
| arr = np.asarray(v) | |
| if not np.all(np.isfinite(arr)): | |
| raise ValueError( | |
| f"Non-finite values (NaN/Inf) detected in key '{key}' " | |
| f"from contributor '{contributor_name}' at round {contribution_round}. " | |
| f"Contribution rejected." | |
| ) | |
| except (TypeError, ValueError) as e: | |
| raise ValueError( | |
| f"Failed to validate finiteness of key '{key}' from contributor " | |
| f"'{contributor_name}' at round {contribution_round}: {e}" | |
| ) from e |
Rule Used: When data_loader might be empty during loss comput... (source)
Learnt From
NVIDIA/NVFlare#4001
There was a problem hiding this comment.
We do not view this as a high-priority security issue. Rejecting NaN and Inf is reasonable as a hygiene improvement, but it does not materially address the broader model-poisoning problem. In practice, valid finite values can be more damaging than NaNs during an attack. If we keep this validation, explicit failure on uncheckable inputs is fine, but we would not position it as a meaningful security mitigation by itself.
| import shutil | ||
|
|
||
| if os.path.exists(app_path): | ||
| shutil.rmtree(app_path, ignore_errors=True) | ||
| return err | ||
|
|
||
| if not authorized: | ||
| import shutil | ||
|
|
||
| if os.path.exists(app_path): | ||
| shutil.rmtree(app_path, ignore_errors=True) | ||
| return "not authorized" |
There was a problem hiding this comment.
Redundant inline imports —
shutil is already at module level
shutil is imported at the top of the file (line 17). The two import shutil statements inside the function body are unnecessary.
| import shutil | |
| if os.path.exists(app_path): | |
| shutil.rmtree(app_path, ignore_errors=True) | |
| return err | |
| if not authorized: | |
| import shutil | |
| if os.path.exists(app_path): | |
| shutil.rmtree(app_path, ignore_errors=True) | |
| return "not authorized" | |
| if err: | |
| if os.path.exists(app_path): | |
| shutil.rmtree(app_path, ignore_errors=True) | |
| return err | |
| if not authorized: | |
| if os.path.exists(app_path): | |
| shutil.rmtree(app_path, ignore_errors=True) | |
| return "not authorized" |
There was a problem hiding this comment.
Agreed. This is just cleanup; the inline imports are unnecessary since shutil is already imported at module scope.
Response to Security Audit ReportThank you for the detailed review and for providing isolated fixes and reproductions. We reviewed the findings against the current codebase and our intended trust model. Below is our response to each item. Finding-by-Finding ResponseSEC-001: Unrestricted Class LoadingWe do not agree with the current framing or severity. In NVFlare, custom code execution is an inherent part of the job model. A user who is permitted to submit and run custom jobs is already effectively able to execute code remotely through the normal framework mechanisms. For that reason, we do not consider this a standalone critical privilege-escalation vulnerability. That said, we do agree there is room to improve policy control around restricted modules. The right direction is not a blanket prohibition on dynamic class loading, but a mechanism that allows certain jobs or trusted contexts to load from restricted modules while preventing unsafe usage in other contexts. SEC-002: AuthzFilter Role Field Uses Wrong ConstantAccepted. This is a real bug. It was already known internally and had been fixed in a previous PR, but that PR was later backed out. We agree this should be corrected. SEC-003: Missing
|
|
/build |
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
| except UnsafeClassError: | ||
| raise |
There was a problem hiding this comment.
Builtins branch bypasses the module blocklist
The else branch (triggered when class_path has no .) calls getattr(builtins, class_path) with no filtering. A job config that specifies "exec", "eval", "open", "compile", or "__import__" as a component class will return those builtins directly. If the framework then instantiates the returned callable with attacker-controlled arguments, this becomes arbitrary code/file execution — exactly what _BLOCKED_MODULES is designed to prevent.
A simple fix is to maintain an explicit allowlist of safe builtins:
_SAFE_BUILTINS = frozenset({"int", "float", "str", "bool", "list", "dict", "tuple", "set", "bytes"})
# in load_class, else branch:
else:
if not allow_unsafe and class_path not in _SAFE_BUILTINS:
raise UnsafeClassError(
f"Loading builtin '{class_path}' is not allowed in job configs."
)
return getattr(builtins, class_path)| if err: | ||
| import shutil | ||
|
|
||
| if os.path.exists(app_path): | ||
| shutil.rmtree(app_path, ignore_errors=True) | ||
| return err | ||
|
|
||
| if not authorized: | ||
| import shutil | ||
|
|
||
| if os.path.exists(app_path): | ||
| shutil.rmtree(app_path, ignore_errors=True) | ||
| return "not authorized" |
There was a problem hiding this comment.
Incomplete cleanup leaves metadata files on disk
When authorization fails, only app_path (the unzipped app content) is removed. By this point, run_dir has already been populated with fl_app.txt (lines 59–60) and job_meta_file (lines 62–63), which contain the app name and job metadata including submitter name, org, and role. Removing only app_path leaves these files behind in run_dir.
The cleanup should target run_dir to fully remove all artifacts written before the authorization check:
if err:
if os.path.exists(run_dir):
shutil.rmtree(run_dir, ignore_errors=True)
return err
if not authorized:
if os.path.exists(run_dir):
shutil.rmtree(run_dir, ignore_errors=True)
return "not authorized"Note: shutil is already imported at module level (line 17), so the inline import shutil statements are also unnecessary.
Fixes # .
Description
A few sentences describing the changes proposed in this pull request.
Types of changes
./runtest.sh.