Skip to content

[https://nvbugs/5972776][fix] Pass IPC HMAC key through file descriptor#14378

Merged
yibinl-nvidia merged 3 commits into
NVIDIA:mainfrom
yibinl-nvidia:fix/5972776-ipc-hmac-key-fd
May 28, 2026
Merged

[https://nvbugs/5972776][fix] Pass IPC HMAC key through file descriptor#14378
yibinl-nvidia merged 3 commits into
NVIDIA:mainfrom
yibinl-nvidia:fix/5972776-ipc-hmac-key-fd

Conversation

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator

@yibinl-nvidia yibinl-nvidia commented May 21, 2026

Summary by CodeRabbit

  • Improvements

    • Enhanced IPC process security by updating how authentication keys are transmitted between processes, replacing environment variable-based passing with a more secure file descriptor-based mechanism to reduce key exposure.
  • Tests

    • Added test coverage for the updated authentication key provisioning behavior.

Review Change Stack

Description

This prevents another process to steal HMAC key from the environment variable.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
@yibinl-nvidia yibinl-nvidia marked this pull request as ready for review May 21, 2026 22:37
@yibinl-nvidia yibinl-nvidia requested a review from a team as a code owner May 21, 2026 22:37
@yibinl-nvidia yibinl-nvidia requested a review from Superjomn May 21, 2026 22:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

This PR switches IPC HMAC key provisioning from environment variables to file descriptors across Python executor launch paths, disaggregated leader spawn, shell script launch, and test coverage. The core mechanism caches and reads keys from FDs, with coordinated updates in serve.py (disaggregated leader), the llmapi launch script, and corresponding test coverage.

Changes

IPC HMAC Key File Descriptor Mechanism

Layer / File(s) Summary
Core FD-based HMAC key utilities
tensorrt_llm/executor/utils.py
Add TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY_FD enum member, module-level key caching with set_spawn_proxy_process_ipc_hmac_key() setter, FD reading helper that normalizes bytes to 32-byte key, and updated get_spawn_proxy_process_ipc_hmac_key_env() that prefers FD source, caches result, and asserts on missing key.
Disaggregated leader launch with FD-based key
tensorrt_llm/commands/serve.py
Update imports, generate and cache HMAC key in _launch_disaggregated_leader, remove old env var assertion, create pipe and write key to it, pass read-end FD to child via pass_fds, and cleanup pipe FDs in finally block.
Shell script wrapper and execution points
tensorrt_llm/llmapi/trtllm-llmapi-launch
Introduce run_with_ipc_hmac_key wrapper that generates key at runtime and passes via FD using bash exec {fd}<<<...; wrap Rank0 task execution, MGMN leader-node stop, and leader-node start through wrapper.
Test coverage for FD-based HMAC key reading
tests/unittest/executor/test_launcher_envs.py
Add test helpers for cache/env cleanup and pipe FD setup; test FD reading and env var removal, caching behavior across calls, and assertion when HMAC key source is missing.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is incomplete. It explains the 'why' (prevent key theft) but lacks a detailed 'what' section, test coverage details, and most of the PR checklist items remain unchecked despite the author marking completion. Expand description to detail the implementation changes across files, specifically list test cases added, and ensure all applicable PR checklist items are properly addressed and verified.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: passing the IPC HMAC key through a file descriptor instead of environment variables, directly addressing the security concern.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tensorrt_llm/executor/utils.py (2)

36-45: 💤 Low value

Potential UnicodeDecodeError when normalizing non-hex binary bytes.

If key is bytes with length ≠ 32 and contains non-ASCII characters, key.decode("ascii") on line 40 will raise UnicodeDecodeError. This could happen if the caller passes raw binary bytes that aren't hex-encoded.

Consider adding explicit handling or documenting that non-32-byte inputs must be ASCII hex strings:

🛡️ Proposed defensive fix
 def _normalize_spawn_proxy_process_ipc_hmac_key(key: str | bytes) -> bytes:
     if isinstance(key, bytes):
         if len(key) == 32:
             return key
-        key = key.decode("ascii")
+        try:
+            key = key.decode("ascii")
+        except UnicodeDecodeError as e:
+            raise ValueError(
+                "IPC HMAC key bytes must be exactly 32 bytes or ASCII hex-encoded"
+            ) from e

     key_bytes = bytes.fromhex(key)
     if len(key_bytes) != 32:
         raise ValueError("IPC HMAC key must be 32 bytes.")
     return key_bytes
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/executor/utils.py` around lines 36 - 45, The function
_normalize_spawn_proxy_process_ipc_hmac_key can raise UnicodeDecodeError when
given non-32 raw bytes containing non-ASCII values; wrap the key.decode("ascii")
in a try/except UnicodeDecodeError and convert that into a clear ValueError
(e.g. "IPC HMAC key must be 32 bytes or an ASCII hex string") so callers get a
deterministic error; keep the existing bytes.fromhex flow and length check for
the decoded hex string and return the 32-byte result if valid.

87-89: ⚡ Quick win

Consider using ValueError instead of AssertionError for missing key.

AssertionError is typically reserved for programming errors caught during development and can be disabled with -O. For a runtime configuration error that should always be checked, a ValueError or custom exception is more appropriate.

♻️ Proposed fix
-    raise AssertionError(
+    raise ValueError(
         f"{LlmLauncherEnvs.TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY_FD} is not set. "
         "HMAC encryption is required for IPC communication.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/executor/utils.py` around lines 87 - 89, Replace the runtime
check that raises AssertionError with a ValueError to signal a missing runtime
configuration; specifically, in tensorrt_llm/executor/utils.py update the
exception raised where LlmLauncherEnvs.TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY_FD
is validated (the block that currently raises AssertionError saying the HMAC key
FD is not set) to raise ValueError with the same descriptive message so the
error cannot be disabled with -O and correctly represents a
configuration/runtime error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tensorrt_llm/executor/utils.py`:
- Around line 36-45: The function _normalize_spawn_proxy_process_ipc_hmac_key
can raise UnicodeDecodeError when given non-32 raw bytes containing non-ASCII
values; wrap the key.decode("ascii") in a try/except UnicodeDecodeError and
convert that into a clear ValueError (e.g. "IPC HMAC key must be 32 bytes or an
ASCII hex string") so callers get a deterministic error; keep the existing
bytes.fromhex flow and length check for the decoded hex string and return the
32-byte result if valid.
- Around line 87-89: Replace the runtime check that raises AssertionError with a
ValueError to signal a missing runtime configuration; specifically, in
tensorrt_llm/executor/utils.py update the exception raised where
LlmLauncherEnvs.TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY_FD is validated (the block
that currently raises AssertionError saying the HMAC key FD is not set) to raise
ValueError with the same descriptive message so the error cannot be disabled
with -O and correctly represents a configuration/runtime error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f193e4b7-543c-44ca-93e9-c7ed36a338c0

📥 Commits

Reviewing files that changed from the base of the PR and between 3b8387c and 525f500.

📒 Files selected for processing (4)
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/executor/utils.py
  • tensorrt_llm/llmapi/trtllm-llmapi-launch
  • tests/unittest/executor/test_launcher_envs.py

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

2 similar comments
@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49788 [ run ] triggered by Bot. Commit: 525f500 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49788 [ run ] completed with state SUCCESS. Commit: 525f500
/LLM/main/L0_MergeRequest_PR pipeline #39379 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49848 [ run ] triggered by Bot. Commit: 525f500 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49848 [ run ] completed with state SUCCESS. Commit: 525f500
/LLM/main/L0_MergeRequest_PR pipeline #39432 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yibinl-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49905 [ run ] triggered by Bot. Commit: 525f500 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49905 [ run ] completed with state SUCCESS. Commit: 525f500
/LLM/main/L0_MergeRequest_PR pipeline #39484 completed with status: 'SUCCESS'

CI Report

Link to invocation

Copy link
Copy Markdown
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yibinl-nvidia yibinl-nvidia merged commit 50ca49f into NVIDIA:main May 28, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants