Skip to content

Security/audit fixes 2026 q2#4422

Closed
GeorgeWang-nv wants to merge 13 commits intoNVIDIA:mainfrom
GeorgeWang-nv:security/audit-fixes-2026Q2
Closed

Security/audit fixes 2026 q2#4422
GeorgeWang-nv wants to merge 13 commits intoNVIDIA:mainfrom
GeorgeWang-nv:security/audit-fixes-2026Q2

Conversation

@GeorgeWang-nv
Copy link
Copy Markdown
Contributor

Fixes # .

Description

A few sentences describing the changes proposed in this pull request.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

GeorgeWang-nv and others added 12 commits April 9, 2026 11:56
Add _BLOCKED_MODULES frozenset and UnsafeClassError. load_class() now
checks the top-level module against the blocklist before importing,
blocking dangerous modules like subprocess, os, shutil, etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changed ConnProps.SUBMITTER_ORG to ConnProps.SUBMITTER_ROLE for the
role field in the submitter Person object construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The make_reply() call for AUTHENTICATION_ERROR was missing a return
statement, causing the code to continue processing even after auth failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace empty nonce with server-generated one-time challenge:
1. Client sends _login_challenge → server generates UUID nonce, stores it
2. Client signs (CN + nonce) → sends to server with _cert_login
3. Server consumes the stored nonce (one-time), verifies signature
4. Challenge expires after 60 seconds
5. Empty nonce without a pending challenge is REJECTED (prevents replay)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

- get_session() now accepts and forwards id_asserter to decode_token()
- Updated pre_command() caller in login.py to pass id_asserter,
  ensuring token signatures are verified on every admin command
- Without this, decode_token() skips signature check and only looks up
  the session ID — a captured token works without valid signature

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validate the sender's identity via validate_client() before processing
job failure reports, preventing unauthenticated clients from aborting
arbitrary jobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reject contributions with NaN/Inf tensor values and invalid weights
(NaN, Inf, negative) to prevent model poisoning attacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, heartbeat() would re-create a client record for unknown tokens,
allowing revoked clients to bypass removal. Now it rejects unknown tokens
and requires clients to re-register properly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use os.open() with 0o600 mode instead of plain open() for .key files
in legacy provisioning, preventing world-readable private keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace random.randint (Mersenne Twister, non-cryptographic) with
secrets.token_bytes (CSPRNG) for SNP attestation nonce generation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Log a SECURITY WARNING when running unsigned jobs with signature
verification disabled, making the security posture visible to admins.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add shutil.rmtree() cleanup after AppAuthzService.authorize() failure
to prevent unauthorized job files from persisting on disk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Greptile Summary

This PR applies a batch of Q2 2026 security and audit fixes across the admin login flow, app deployment, class loading, certificate provisioning, and server-side authentication. Most individual fixes are sound (CSPRNG nonce generation, private key file permissions, return on auth error, revoked-token rejection, role-field copy-paste fix), but two new issues were introduced:

  • class_loader.py: The new _BLOCKED_MODULES guard covers dotted module paths but the else branch (no-dot paths) loads directly from builtins with no filtering, making exec, eval, open, compile, and __import__ accessible via job configs.
  • app_deployer.py: The new auth-failure cleanup removes app_path but leaves run_dir populated with fl_app.txt and job_meta.json (submitter metadata written before the auth check).

Confidence Score: 3/5

Not safe to merge — two new P1 defects were introduced, and two previously-flagged security issues remain unresolved in login.py.

The PR fixes several real bugs (missing return on auth error, CSPRNG nonce, file permission hardening, role-field copy-paste, revoked-token rejection), but introduces a builtins bypass in the new class-loader blocklist and an incomplete cleanup path in app_deployer.py. Combined with the two carry-over issues in login.py flagged in prior review rounds, there are four substantive issues that need resolution before this security-focused PR is production-ready.

nvflare/fuel/utils/class_loader.py (builtins bypass), nvflare/private/fed/utils/app_deployer.py (incomplete cleanup), nvflare/fuel/hci/server/login.py (debug nonce log + client-nonce fallback still unresolved).

Security Review

  • nvflare/fuel/utils/class_loader.py — builtins bypass: The _BLOCKED_MODULES blocklist only guards the dotted-module path. The else branch (getattr(builtins, class_path)) is unguarded, allowing job configs to reference exec, eval, open, compile, or __import__ as component classes without restriction.
  • nvflare/fuel/hci/server/login.py — debug log leaks consumed nonce (line 125): SEC004_DEBUG INFO-level log emits the server-generated nonce after it is consumed; this line was flagged in a prior review and has not been removed.
  • nvflare/fuel/hci/server/login.py — client-nonce fallback (lines 126–135): When no server challenge exists, a non-empty client-supplied nonce is still accepted, re-enabling signature replay; also flagged previously and not yet fixed.

Important Files Changed

Filename Overview
nvflare/fuel/utils/class_loader.py Adds module blocklist for job-config class loading, but the builtins branch (no-dot paths) has no filtering — exec, eval, open etc. are reachable without restriction.
nvflare/private/fed/utils/app_deployer.py Adds cleanup on auth/authz failure, but only removes app_path; run_dir with fl_app.txt and job metadata is left on disk. Also contains redundant inline import shutil (already at module level).
nvflare/fuel/hci/server/login.py Implements challenge-response login skeleton, but the SEC004_DEBUG info log (line 125) still leaks the consumed nonce, the client-nonce fallback (lines 126–135) still enables replay, and _pending_challenges has no eviction for abandoned sessions.
nvflare/fuel/hci/server/authz.py Fixes copy-paste bug where submitter role was being set to submitter org; correct field (SUBMITTER_ROLE) now used.
nvflare/lighter/impl/cert.py Private key files now written with os.open and mode 0o600, preventing world-readable key material on disk.
nvflare/app_opt/confidential_computing/snp_authorizer.py Nonce generation switched from random.randint to secrets.token_bytes — correct use of CSPRNG for security-sensitive material.
nvflare/private/fed/server/server_command_agent.py Bug fix: missing return before make_reply on authentication error prevented early exit, allowing execution to continue after auth failure.
nvflare/private/fed/server/client_manager.py Removes auto-reactivation of clients from unknown/revoked tokens; heartbeats from unrecognized tokens are now rejected unconditionally.
nvflare/app_common/aggregators/weighted_aggregation_helper.py Adds NaN/Inf validation for tensor contributions and weight validation; exception handling in _check_finite silently swallows non-Non-finite errors.
nvflare/fuel/hci/client/api.py Adds _ChallengeReplyProcessor and _request_login_challenge to implement the client-side step of the challenge-response protocol; flows correctly into _user_login.
nvflare/fuel/hci/server/sess.py Passes id_asserter into Session.decode_token for cryptographic token verification; tightens bare except to except Exception.
nvflare/fuel/hci/proto.py Adds LOGIN_CHALLENGE to InternalCommands and its commands list to support the new challenge-response protocol.
nvflare/private/fed/server/fed_server.py Adds sender identity validation before processing process_job_failure, dropping unauthenticated reports early.
nvflare/private/fed/server/job_runner.py Adds a security warning log when unsigned jobs are run with signature verification disabled — good operational visibility.

Sequence Diagram

sequenceDiagram
    participant C as Admin Client
    participant S as HCI Server (LoginModule)
    participant A as AppDeployer
    participant CL as ClassLoader

    C->>S: _login_challenge username
    S-->>C: CHALLENGE uuid-nonce
    Note over S: Stored in _pending_challenges[user]

    C->>S: _cert_login username (headers: cert, signature, nonce)
    S->>S: _consume_challenge(user) pops nonce one-time
    S->>S: identity_verifier.verify_common_name(cert, nonce, sig)
    alt verified
        S-->>C: OK + session token
    else not verified
        S-->>C: REJECT
    end

    Note over A: App Deployment new cleanup logic
    A->>A: unzip app to app_path
    A->>A: write fl_app.txt and job_meta.json to run_dir
    A->>A: AppAuthzService.authorize()
    alt auth error or not authorized
        A->>A: shutil.rmtree(app_path) run_dir metadata left behind
        A-->>A: return error
    else authorized
        A-->>A: deploy success
    end

    Note over CL: Class Loading new blocklist
    CL->>CL: load_class(class_path)
    alt dot in class_path
        CL->>CL: check top_module in _BLOCKED_MODULES
        alt blocked
            CL-->>CL: raise UnsafeClassError
        else allowed
            CL-->>CL: importlib.import_module returns class
        end
    else no dot builtins path
        CL-->>CL: getattr(builtins, class_path) no blocklist check
    end
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into security/audit-..." | Re-trigger Greptile

# Use the server-generated challenge nonce (one-time, consumed here).
# If no pending challenge, reject empty nonces to prevent replay.
login_nonce = self._consume_challenge(user_name)
self.logger.info(f"SEC004_DEBUG: user={user_name}, consumed_nonce='{login_nonce}', headers_keys={list(headers.keys())}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Debug log leaks authentication nonce

SEC004_DEBUG is an INFO-level log statement that prints the server-generated nonce alongside the username. This is a debugging leftover that should be removed before merging — auth nonces logged at INFO will appear in production log files and can facilitate replay attacks.

Suggested change
self.logger.info(f"SEC004_DEBUG: user={user_name}, consumed_nonce='{login_nonce}', headers_keys={list(headers.keys())}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This looks like a debugging leftover and should be removed before merge.

Comment on lines +126 to +135
if not login_nonce:
# No pending server challenge — check if client sent a non-empty nonce
client_nonce = headers.get("nonce", "")
if not client_nonce:
# Empty nonce = replay attempt or legacy client without challenge support.
# Reject to prevent signature replay attacks.
self.logger.warning(f"login rejected for {user_name}: empty nonce without server challenge")
conn.append_string("REJECT")
return
login_nonce = client_nonce
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Challenge fallback re-enables replay attacks

When no server-issued challenge is pending for a user (e.g., challenge expired or was never requested), the code falls back to accepting whatever nonce the client provides (lines 128-135). An attacker who previously captured a valid (nonce, signature) pair can: (1) wait for the 60-second window to lapse, (2) submit a fresh login with the old nonce and old signature — the signature check will still pass because the nonce is attacker-controlled. This completely negates the purpose of the challenge-response mechanism.

If no server-issued challenge exists for the user, the login should be rejected unconditionally:

Suggested change
if not login_nonce:
# No pending server challenge — check if client sent a non-empty nonce
client_nonce = headers.get("nonce", "")
if not client_nonce:
# Empty nonce = replay attempt or legacy client without challenge support.
# Reject to prevent signature replay attacks.
self.logger.warning(f"login rejected for {user_name}: empty nonce without server challenge")
conn.append_string("REJECT")
return
login_nonce = client_nonce
if not login_nonce:
self.logger.warning(f"login rejected for {user_name}: no pending server challenge")
conn.append_string("REJECT")
return

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If there is no pending server-issued challenge, the login should be rejected rather than falling back to a client-supplied nonce. The original implementation choice here was driven by compatibility with the older admin protocol flow, but that is not a sufficient reason to preserve replay exposure.

Comment on lines +122 to +124
except (TypeError, ValueError) as e:
if 'Non-finite' in str(e):
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent exception swallowing hides conversion failures

The except (TypeError, ValueError) block re-raises only when the message contains 'Non-finite'. Any other ValueError or TypeError raised by np.asarray(v) or np.isfinite() (e.g., ambiguous truth value from a multi-element array, unsupported dtype) is silently discarded, allowing a potentially non-finite or un-checkable contribution through without detection. Per the project's custom instruction, prefer raising an explicit error instead of silent handling.

Suggested change
except (TypeError, ValueError) as e:
if 'Non-finite' in str(e):
raise
try:
if hasattr(v, 'cpu'):
arr = v.cpu().numpy()
else:
arr = np.asarray(v)
if not np.all(np.isfinite(arr)):
raise ValueError(
f"Non-finite values (NaN/Inf) detected in key '{key}' "
f"from contributor '{contributor_name}' at round {contribution_round}. "
f"Contribution rejected."
)
except (TypeError, ValueError) as e:
raise ValueError(
f"Failed to validate finiteness of key '{key}' from contributor "
f"'{contributor_name}' at round {contribution_round}: {e}"
) from e

Rule Used: When data_loader might be empty during loss comput... (source)

Learnt From
NVIDIA/NVFlare#4001

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not view this as a high-priority security issue. Rejecting NaN and Inf is reasonable as a hygiene improvement, but it does not materially address the broader model-poisoning problem. In practice, valid finite values can be more damaging than NaNs during an attack. If we keep this validation, explicit failure on uncheckable inputs is fine, but we would not position it as a meaningful security mitigation by itself.

Comment on lines +76 to 87
import shutil

if os.path.exists(app_path):
shutil.rmtree(app_path, ignore_errors=True)
return err

if not authorized:
import shutil

if os.path.exists(app_path):
shutil.rmtree(app_path, ignore_errors=True)
return "not authorized"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant inline imports — shutil is already at module level

shutil is imported at the top of the file (line 17). The two import shutil statements inside the function body are unnecessary.

Suggested change
import shutil
if os.path.exists(app_path):
shutil.rmtree(app_path, ignore_errors=True)
return err
if not authorized:
import shutil
if os.path.exists(app_path):
shutil.rmtree(app_path, ignore_errors=True)
return "not authorized"
if err:
if os.path.exists(app_path):
shutil.rmtree(app_path, ignore_errors=True)
return err
if not authorized:
if os.path.exists(app_path):
shutil.rmtree(app_path, ignore_errors=True)
return "not authorized"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is just cleanup; the inline imports are unnecessary since shutil is already imported at module scope.

Copy link
Copy Markdown
Collaborator

nvidianz commented Apr 9, 2026

Response to Security Audit Report

Thank you for the detailed review and for providing isolated fixes and reproductions. We reviewed the findings against the current codebase and our intended trust model. Below is our response to each item.

Finding-by-Finding Response

SEC-001: Unrestricted Class Loading

We do not agree with the current framing or severity.

In NVFlare, custom code execution is an inherent part of the job model. A user who is permitted to submit and run custom jobs is already effectively able to execute code remotely through the normal framework mechanisms. For that reason, we do not consider this a standalone critical privilege-escalation vulnerability.

That said, we do agree there is room to improve policy control around restricted modules. The right direction is not a blanket prohibition on dynamic class loading, but a mechanism that allows certain jobs or trusted contexts to load from restricted modules while preventing unsafe usage in other contexts.

SEC-002: AuthzFilter Role Field Uses Wrong Constant

Accepted.

This is a real bug. It was already known internally and had been fixed in a previous PR, but that PR was later backed out. We agree this should be corrected.

SEC-003: Missing return in Authentication Check

Accepted.

This is a good finding. The missing return should be fixed.

SEC-004: Admin Login Replay Due to Empty Nonce

Accepted.

The current implementation was originally chosen to keep the message flow aligned with the older admin protocol, which did not use a challenge-response step. That historical reason does not outweigh the replay concern. The proposed challenge-response fix looks good.

SEC-005: Session Token Signature Verification Skipped

Accepted.

This is a good fix and should be applied.

SEC-006: Unauthenticated Job Abort

Partially accepted, lower priority.

We agree the current path should be hardened. However, this is lower priority for us than several other findings. Before adopting the proposed change, we need to run tests to confirm that the header information required by validate_client() is consistently present on this message path.

SEC-007: NaN Model Poisoning

Partially accepted, but we disagree with the effectiveness framing.

Model poisoning is a serious problem, but this specific fix only addresses one narrow form of malicious contribution. Rejecting NaN and Inf values is reasonable as a hygiene improvement, but it does not materially solve the broader poisoning problem. In practice, carefully chosen finite values can be more damaging than NaNs during an attack.

We therefore do not consider this a complete or high-confidence mitigation for adversarial clients.

SEC-008: Heartbeat Bypasses Token Revocation

Accepted.

This appears to be a good fix.

SEC-009: Private Keys Written Without Restrictive Permissions

Accepted, low priority.

This is a valid finding, but we consider it lower priority than the authentication and session-management issues.

SEC-010: Non-Cryptographic RNG for SNP Nonce

Accepted.

This is a good finding and an easy fix.

SEC-011: Job Signing Disabled in Non-PKI Deployments

By design.

Unsigned jobs in non-PKI environments are an intentional tradeoff for development and simulation workflows. We do not consider this a vulnerability. Adding a warning message to make the security posture explicit is reasonable.

SEC-012: ZIP Extracted Before Authorization Check

Partially accepted.

We agree that cleanup on authorization failure is a good improvement. However, we do not consider the current behavior to present a meaningful security risk. This is primarily a cleanup and robustness issue rather than a security vulnerability.

Summary

We appreciate the review and agree that several findings should be fixed, especially around authentication, session validation, and certain authorization paths. At the same time, we believe a few items are overstated in severity or are better categorized as design tradeoffs, hardening opportunities, or cleanup improvements rather than actionable security vulnerabilities.

In particular:

  • We accept: SEC-002, SEC-003, SEC-004, SEC-005, SEC-008, SEC-010
  • We accept with lower priority or caveats: SEC-006, SEC-009, SEC-012
  • We disagree with the current framing/severity: SEC-001, SEC-007
  • We consider by design: SEC-011

We are happy to continue discussing threat model assumptions and prioritize fixes accordingly.

@chesterxgchen
Copy link
Copy Markdown
Collaborator

/build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 10, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Comment on lines +79 to +80
except UnsafeClassError:
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 security Builtins branch bypasses the module blocklist

The else branch (triggered when class_path has no .) calls getattr(builtins, class_path) with no filtering. A job config that specifies "exec", "eval", "open", "compile", or "__import__" as a component class will return those builtins directly. If the framework then instantiates the returned callable with attacker-controlled arguments, this becomes arbitrary code/file execution — exactly what _BLOCKED_MODULES is designed to prevent.

A simple fix is to maintain an explicit allowlist of safe builtins:

_SAFE_BUILTINS = frozenset({"int", "float", "str", "bool", "list", "dict", "tuple", "set", "bytes"})

# in load_class, else branch:
else:
    if not allow_unsafe and class_path not in _SAFE_BUILTINS:
        raise UnsafeClassError(
            f"Loading builtin '{class_path}' is not allowed in job configs."
        )
    return getattr(builtins, class_path)

Comment on lines 75 to 87
if err:
import shutil

if os.path.exists(app_path):
shutil.rmtree(app_path, ignore_errors=True)
return err

if not authorized:
import shutil

if os.path.exists(app_path):
shutil.rmtree(app_path, ignore_errors=True)
return "not authorized"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Incomplete cleanup leaves metadata files on disk

When authorization fails, only app_path (the unzipped app content) is removed. By this point, run_dir has already been populated with fl_app.txt (lines 59–60) and job_meta_file (lines 62–63), which contain the app name and job metadata including submitter name, org, and role. Removing only app_path leaves these files behind in run_dir.

The cleanup should target run_dir to fully remove all artifacts written before the authorization check:

            if err:
                if os.path.exists(run_dir):
                    shutil.rmtree(run_dir, ignore_errors=True)
                return err

            if not authorized:
                if os.path.exists(run_dir):
                    shutil.rmtree(run_dir, ignore_errors=True)
                return "not authorized"

Note: shutil is already imported at module level (line 17), so the inline import shutil statements are also unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants