Skip to content

[TRTLLM-10688][fix] fix cross-node rollout issues in verl#11924

Merged
hchings merged 8 commits intoNVIDIA:mainfrom
hchings:verl_fix
Mar 20, 2026
Merged

[TRTLLM-10688][fix] fix cross-node rollout issues in verl#11924
hchings merged 8 commits intoNVIDIA:mainfrom
hchings:verl_fix

Conversation

@hchings
Copy link
Collaborator

@hchings hchings commented Mar 5, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced environment variable handling in distributed worker initialization by filtering node-local variables to prevent conflicts
    • Improved GPU worker isolation to prevent race conditions during parallel initialization with dedicated cache directories per worker

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
@hchings hchings changed the title [None][fix] fix ray worker raylet issue for multi-node in verl [None][fix] fix cross-node rollout issues in verl Mar 5, 2026
@hchings hchings changed the title [None][fix] fix cross-node rollout issues in verl [TRTLLM-10688][fix] fix cross-node rollout issues in verl Mar 5, 2026
@hchings hchings marked this pull request as ready for review March 10, 2026 06:38
@hchings hchings requested a review from a team as a code owner March 10, 2026 06:38
@hchings hchings requested a review from syuoni March 10, 2026 06:38
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

These changes improve Ray worker environment isolation by filtering node-local environment variables in the executor and establishing per-worker DeepGemm JIT cache directories to prevent file conflicts among co-located workers.

Changes

Cohort / File(s) Summary
Ray Worker Environment Setup
tensorrt_llm/executor/ray_executor.py, tensorrt_llm/executor/ray_gpu_worker.py
Filters out node-local environment variables (RAY_RAYLET_PID, RAY_NODE_IP_ADDRESS) in create_workers while preserving other configuration; sets per-worker DeepGemm JIT cache directory with rank and GPU index to isolate cache files.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is completely empty—only the template structure is present with no actual content explaining what issues are being fixed or why. Complete the Description section by explaining the cross-node rollout issues being fixed and how the Ray worker environment filtering and DeepGemm cache changes address them. Add Test Coverage section detailing relevant tests.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: fixing cross-node rollout issues in verl, with proper ticket format [TRTLLM-10688] and type [fix].

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/executor/ray_gpu_worker.py`:
- Around line 62-63: The DG_JIT_CACHE_DIR value currently uses only rank and
self.gpu and can collide across separate Ray jobs; update the assignment in
ray_gpu_worker.py to generate a unique per-process/job directory (e.g., use
tempfile.mkdtemp or append a UUID/pid) and set os.environ["DG_JIT_CACHE_DIR"] to
that path; also add import tempfile to the stdlib imports (or import
uuid/os.getpid if using those) so each executor gets a non-colliding cache
directory (refer to the DG_JIT_CACHE_DIR assignment and use of rank and self.gpu
to locate the change).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 98ec96f6-e59c-48d5-9852-b5bbca2cefd7

📥 Commits

Reviewing files that changed from the base of the PR and between 460889f and a97905f.

📒 Files selected for processing (2)
  • tensorrt_llm/executor/ray_executor.py
  • tensorrt_llm/executor/ray_gpu_worker.py

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
@hchings
Copy link
Collaborator Author

hchings commented Mar 10, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38398 [ run ] triggered by Bot. Commit: 77ec629 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38398 [ run ] completed with state SUCCESS. Commit: 77ec629
/LLM/main/L0_MergeRequest_PR pipeline #29761 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

hchings added 2 commits March 10, 2026 14:26
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Removed redundant line break in JIT cache directory setup.

Signed-off-by: Erin <14718778+hchings@users.noreply.github.com>
Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hchings
Copy link
Collaborator Author

hchings commented Mar 11, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38536 [ run ] triggered by Bot. Commit: 545ad83 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38536 [ run ] completed with state SUCCESS. Commit: 545ad83
/LLM/main/L0_MergeRequest_PR pipeline #29883 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@hchings
Copy link
Collaborator Author

hchings commented Mar 11, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38621 [ run ] triggered by Bot. Commit: 545ad83 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38621 [ run ] completed with state SUCCESS. Commit: 545ad83
/LLM/main/L0_MergeRequest_PR pipeline #29955 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@hchings hchings enabled auto-merge (squash) March 11, 2026 20:46
@hchings
Copy link
Collaborator Author

hchings commented Mar 11, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38633 [ run ] triggered by Bot. Commit: 09bad4e Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38633 [ run ] completed with state SUCCESS. Commit: 09bad4e
/LLM/main/L0_MergeRequest_PR pipeline #29965 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@hchings
Copy link
Collaborator Author

hchings commented Mar 12, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38784 [ run ] triggered by Bot. Commit: 09bad4e Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38784 [ run ] completed with state SUCCESS. Commit: 09bad4e
/LLM/main/L0_MergeRequest_PR pipeline #30098 completed with status: 'SUCCESS'

CI Report

Link to invocation

@hchings
Copy link
Collaborator Author

hchings commented Mar 13, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38838 [ run ] triggered by Bot. Commit: 87e0bf9 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38838 [ run ] completed with state SUCCESS. Commit: 87e0bf9
/LLM/main/L0_MergeRequest_PR pipeline #30148 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@hchings
Copy link
Collaborator Author

hchings commented Mar 13, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38906 [ run ] triggered by Bot. Commit: 87e0bf9 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38906 [ run ] completed with state FAILURE. Commit: 87e0bf9
/LLM/main/L0_MergeRequest_PR pipeline #30215 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@hchings
Copy link
Collaborator Author

hchings commented Mar 16, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39130 [ run ] triggered by Bot. Commit: 93f1f8b Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39130 [ run ] completed with state FAILURE. Commit: 93f1f8b
/LLM/main/L0_MergeRequest_PR pipeline #30389 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@hchings
Copy link
Collaborator Author

hchings commented Mar 17, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39303 [ run ] triggered by Bot. Commit: b7c686f Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39303 [ run ] completed with state SUCCESS. Commit: b7c686f
/LLM/main/L0_MergeRequest_PR pipeline #30553 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@hchings
Copy link
Collaborator Author

hchings commented Mar 19, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39556 [ run ] triggered by Bot. Commit: b7c686f Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39556 [ run ] completed with state SUCCESS. Commit: b7c686f
/LLM/main/L0_MergeRequest_PR pipeline #30773 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@hchings hchings merged commit 4a8b6b8 into NVIDIA:main Mar 20, 2026
6 of 7 checks passed
@hchings hchings deleted the verl_fix branch March 20, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants