[TRTLLM-10688][fix] fix cross-node rollout issues in verl by hchings · Pull Request #11924 · NVIDIA/TensorRT-LLM

hchings · 2026-03-05T00:31:42Z

Summary by CodeRabbit

Bug Fixes
- Enhanced environment variable handling in distributed worker initialization by filtering node-local variables to prevent conflicts
- Improved GPU worker isolation to prevent race conditions during parallel initialization with dedicated cache directories per worker

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

coderabbitai · 2026-03-10T06:43:36Z

📝 Walkthrough

Walkthrough

These changes improve Ray worker environment isolation by filtering node-local environment variables in the executor and establishing per-worker DeepGemm JIT cache directories to prevent file conflicts among co-located workers.

Changes

Cohort / File(s)	Summary
Ray Worker Environment Setup `tensorrt_llm/executor/ray_executor.py`, `tensorrt_llm/executor/ray_gpu_worker.py`	Filters out node-local environment variables (RAY_RAYLET_PID, RAY_NODE_IP_ADDRESS) in create_workers while preserving other configuration; sets per-worker DeepGemm JIT cache directory with rank and GPU index to isolate cache files.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is completely empty—only the template structure is present with no actual content explaining what issues are being fixed or why.	Complete the Description section by explaining the cross-node rollout issues being fixed and how the Ray worker environment filtering and DeepGemm cache changes address them. Add Test Coverage section detailing relevant tests.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: fixing cross-node rollout issues in verl, with proper ticket format [TRTLLM-10688] and type [fix].

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/executor/ray_gpu_worker.py`:
- Around line 62-63: The DG_JIT_CACHE_DIR value currently uses only rank and
self.gpu and can collide across separate Ray jobs; update the assignment in
ray_gpu_worker.py to generate a unique per-process/job directory (e.g., use
tempfile.mkdtemp or append a UUID/pid) and set os.environ["DG_JIT_CACHE_DIR"] to
that path; also add import tempfile to the stdlib imports (or import
uuid/os.getpid if using those) so each executor gets a non-colliding cache
directory (refer to the DG_JIT_CACHE_DIR assignment and use of rank and self.gpu
to locate the change).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 98ec96f6-e59c-48d5-9852-b5bbca2cefd7

📥 Commits

Reviewing files that changed from the base of the PR and between 460889f and a97905f.

📒 Files selected for processing (2)

tensorrt_llm/executor/ray_executor.py
tensorrt_llm/executor/ray_gpu_worker.py

tensorrt_llm/executor/ray_gpu_worker.py

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

hchings · 2026-03-10T06:54:14Z

/bot run

tensorrt-cicd · 2026-03-10T07:01:46Z

PR_Github #38398 [ run ] triggered by Bot. Commit: 77ec629 Link to invocation

tensorrt-cicd · 2026-03-10T07:54:02Z

PR_Github #38398 [ run ] completed with state SUCCESS. Commit: 77ec629
/LLM/main/L0_MergeRequest_PR pipeline #29761 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

Removed redundant line break in JIT cache directory setup. Signed-off-by: Erin <14718778+hchings@users.noreply.github.com>

Superjomn

LGTM

hchings · 2026-03-11T05:27:11Z

/bot run

tensorrt-cicd · 2026-03-11T05:32:47Z

PR_Github #38536 [ run ] triggered by Bot. Commit: 545ad83 Link to invocation

tensorrt-cicd · 2026-03-11T07:05:21Z

PR_Github #38536 [ run ] completed with state SUCCESS. Commit: 545ad83
/LLM/main/L0_MergeRequest_PR pipeline #29883 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

hchings · 2026-03-11T17:51:50Z

/bot run

tensorrt-cicd · 2026-03-11T17:59:43Z

PR_Github #38621 [ run ] triggered by Bot. Commit: 545ad83 Link to invocation

tensorrt-cicd · 2026-03-11T20:44:51Z

PR_Github #38621 [ run ] completed with state SUCCESS. Commit: 545ad83
/LLM/main/L0_MergeRequest_PR pipeline #29955 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

hchings · 2026-03-11T20:48:45Z

/bot run

tensorrt-cicd · 2026-03-11T20:54:46Z

PR_Github #38633 [ run ] triggered by Bot. Commit: 09bad4e Link to invocation

tensorrt-cicd · 2026-03-12T02:36:46Z

PR_Github #38633 [ run ] completed with state SUCCESS. Commit: 09bad4e
/LLM/main/L0_MergeRequest_PR pipeline #29965 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

hchings · 2026-03-12T21:49:53Z

/bot run

tensorrt-cicd · 2026-03-12T21:58:26Z

PR_Github #38784 [ run ] triggered by Bot. Commit: 09bad4e Link to invocation

tensorrt-cicd · 2026-03-13T01:34:07Z

PR_Github #38784 [ run ] completed with state SUCCESS. Commit: 09bad4e
/LLM/main/L0_MergeRequest_PR pipeline #30098 completed with status: 'SUCCESS'

CI Report

Link to invocation

hchings · 2026-03-13T06:37:45Z

/bot run

tensorrt-cicd · 2026-03-13T06:44:11Z

PR_Github #38838 [ run ] triggered by Bot. Commit: 87e0bf9 Link to invocation

tensorrt-cicd · 2026-03-13T09:28:15Z

PR_Github #38838 [ run ] completed with state SUCCESS. Commit: 87e0bf9
/LLM/main/L0_MergeRequest_PR pipeline #30148 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

hchings · 2026-03-13T21:44:37Z

/bot run

tensorrt-cicd · 2026-03-13T21:51:47Z

PR_Github #38906 [ run ] triggered by Bot. Commit: 87e0bf9 Link to invocation

tensorrt-cicd · 2026-03-14T01:23:16Z

PR_Github #38906 [ run ] completed with state FAILURE. Commit: 87e0bf9
/LLM/main/L0_MergeRequest_PR pipeline #30215 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

hchings · 2026-03-16T21:16:35Z

/bot run

tensorrt-cicd · 2026-03-16T21:22:50Z

PR_Github #39130 [ run ] triggered by Bot. Commit: 93f1f8b Link to invocation

tensorrt-cicd · 2026-03-17T01:23:10Z

PR_Github #39130 [ run ] completed with state FAILURE. Commit: 93f1f8b
/LLM/main/L0_MergeRequest_PR pipeline #30389 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

hchings · 2026-03-17T18:00:21Z

/bot run

tensorrt-cicd · 2026-03-17T18:06:49Z

PR_Github #39303 [ run ] triggered by Bot. Commit: b7c686f Link to invocation

tensorrt-cicd · 2026-03-17T23:42:13Z

PR_Github #39303 [ run ] completed with state SUCCESS. Commit: b7c686f
/LLM/main/L0_MergeRequest_PR pipeline #30553 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

hchings · 2026-03-19T06:36:39Z

/bot run

tensorrt-cicd · 2026-03-19T06:42:07Z

PR_Github #39556 [ run ] triggered by Bot. Commit: b7c686f Link to invocation

tensorrt-cicd · 2026-03-19T10:05:51Z

PR_Github #39556 [ run ] completed with state SUCCESS. Commit: b7c686f
/LLM/main/L0_MergeRequest_PR pipeline #30773 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

fix ray worker env

54e16a9

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

github-actions bot assigned hchings Mar 5, 2026

hchings changed the title ~~[None][fix] fix ray worker raylet issue for multi-node in verl~~ [None][fix] fix cross-node rollout issues in verl Mar 5, 2026

hchings assigned Superjomn Mar 5, 2026

hchings changed the title ~~[None][fix] fix cross-node rollout issues in verl~~ [TRTLLM-10688][fix] fix cross-node rollout issues in verl Mar 5, 2026

hchings marked this pull request as ready for review March 10, 2026 06:38

hchings requested a review from a team as a code owner March 10, 2026 06:38

hchings requested a review from syuoni March 10, 2026 06:38

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

tensorrt_llm/executor/ray_gpu_worker.py Outdated Show resolved Hide resolved

fix deepgemm cache race condition

77ec629

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

hchings force-pushed the verl_fix branch from a97905f to 77ec629 Compare March 10, 2026 06:53

hchings requested a review from Superjomn March 10, 2026 06:54

hchings unassigned Superjomn Mar 10, 2026

hchings added 2 commits March 10, 2026 14:26

update

bf8f35b

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

pre-commit

545ad83

Removed redundant line break in JIT cache directory setup. Signed-off-by: Erin <14718778+hchings@users.noreply.github.com>

Superjomn approved these changes Mar 11, 2026

View reviewed changes

hchings enabled auto-merge (squash) March 11, 2026 20:46

Merge branch 'main' into verl_fix

09bad4e

Merge branch 'main' into verl_fix

87e0bf9

Merge branch 'main' into verl_fix

93f1f8b

Merge branch 'main' into verl_fix

b7c686f

hchings merged commit 4a8b6b8 into NVIDIA:main Mar 20, 2026
6 of 7 checks passed

hchings deleted the verl_fix branch March 20, 2026 01:16

Conversation

hchings commented Mar 5, 2026 • edited by Superjomn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hchings commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

Superjomn left a comment

Choose a reason for hiding this comment

Uh oh!

hchings commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

hchings commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

hchings commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 11, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

hchings commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

hchings commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

hchings commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

hchings commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

hchings commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

hchings commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

hchings commented Mar 5, 2026 •

edited by Superjomn

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading