[https://nvbugs/5996645][fix] Fix Pyxis Error in Disagg Perf Test by chenfeiz0326 · Pull Request #12575 · NVIDIA/TensorRT-LLM

chenfeiz0326 · 2026-03-29T15:30:43Z

Summary by CodeRabbit

Bug Fixes
- Added stabilization delays to resolve race conditions during container initialization, improving system reliability.
Tests
- Removed waivers for disaggregated server configuration tests, enabling stricter validation across multiple model variants.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

chenfeiz0326 · 2026-03-29T15:31:59Z

/bot run --disable-fail-fast --post-merge

coderabbitai · 2026-03-29T15:35:11Z

📝 Walkthrough

Walkthrough

This PR makes three focused changes: adds explicit variable declaration in a Groovy test script to improve scoping, inserts timing delays in a shell script to mitigate race conditions during server initialization, and removes specific test waivers from the waiver list for disaggregated configurations.

Changes

Cohort / File(s)	Summary
Groovy Script Update `jenkins/L0_Test.groovy`	Changed `srunArgs` assignment to explicit `def` declaration within closure scope to prevent variable reuse/overwriting in outer scope.
Shell Script Race Condition Mitigation `jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh`	Added four 5-second delay statements to address Pyxis container namespace initialization race conditions: after each gen server launch, after each ctx server launch, after server loops complete, and after disagg server startup.
Test Waiver Cleanup `tests/integration/test_lists/waives.txt`	Removed 5 test waiver entries for `disagg_upload-gen_only` and `disagg_upload-e2e` configurations targeting GB200 with various model variants.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete, containing only the template structure without filling in the required Description, Test Coverage, or PR Checklist sections with actual details.	Fill in the Description section explaining the Pyxis race condition issue and the 5-second delay solution, add Test Coverage listing relevant test files (test_perf_sanity.py, waives.txt updates), and ensure all checklist items are reviewed.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically references the NVBugs issue and describes the main fix for Pyxis error in disaggregated performance tests.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh (1)

31-65: Replace hard-coded startup sleeps with configurable/readiness-based gating.

The added delays (Line 31, Line 47, Line 53, Line 65) reduce race frequency, but fixed sleep 5 is still brittle under variable cluster load and adds avoidable runtime. Prefer readiness polling with timeout; at minimum, make the delay configurable.

Suggested incremental improvement (configurable delay)

+PYXIS_INIT_DELAY_SEC="${PYXIS_INIT_DELAY_SEC:-5}"
+
 ...
-    sleep 5  # Wait for pyxis container namespace initialization to avoid race condition
+    sleep "${PYXIS_INIT_DELAY_SEC}"  # Wait for pyxis container namespace initialization to avoid race condition
 ...
-        sleep 5  # Wait for pyxis container namespace initialization to avoid race condition
+        sleep "${PYXIS_INIT_DELAY_SEC}"  # Wait for pyxis container namespace initialization to avoid race condition
 ...
-sleep 5  # Wait for pyxis container namespace initialization to avoid race condition
+sleep "${PYXIS_INIT_DELAY_SEC}"  # Wait for pyxis container namespace initialization to avoid race condition
 ...
-sleep 5  # Wait for pyxis container namespace initialization to avoid race condition
+sleep "${PYXIS_INIT_DELAY_SEC}"  # Wait for pyxis container namespace initialization to avoid race condition

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh` around lines 31 -
65, Replace the fixed "sleep 5" calls used before/after starting servers with a
configurable timeout and readiness-based polling: add a STARTUP_SLEEP_DEFAULT
env var (or reuse TRTLLM_STARTUP_SLEEP) to set a fallback delay, then implement
a loop that checks service readiness (e.g., poll the generated log files
$jobWorkspace/ctx_server_$i.log and $jobWorkspace/disagg_server.log for a
readiness string, or check srun task/process status) until the readiness
indicator appears or the configured timeout elapses; update the code paths
around the ctx server loop and the disagg server startup where
DISAGG_SERVING_TYPE, pytestCommand, srun and runScript are used to wait via
polling instead of unconditional sleep, and fall back to the configurable sleep
only if polling is unsupported.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh`:
- Around line 31-65: Replace the fixed "sleep 5" calls used before/after
starting servers with a configurable timeout and readiness-based polling: add a
STARTUP_SLEEP_DEFAULT env var (or reuse TRTLLM_STARTUP_SLEEP) to set a fallback
delay, then implement a loop that checks service readiness (e.g., poll the
generated log files $jobWorkspace/ctx_server_$i.log and
$jobWorkspace/disagg_server.log for a readiness string, or check srun
task/process status) until the readiness indicator appears or the configured
timeout elapses; update the code paths around the ctx server loop and the disagg
server startup where DISAGG_SERVING_TYPE, pytestCommand, srun and runScript are
used to wait via polling instead of unconditional sleep, and fall back to the
configurable sleep only if polling is unsupported.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3df89f44-b074-4d3a-955a-2d3d814aa962

📥 Commits

Reviewing files that changed from the base of the PR and between 8ce0518 and cfb3044.

📒 Files selected for processing (3)

jenkins/L0_Test.groovy
jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

tensorrt-cicd · 2026-03-29T15:39:01Z

PR_Github #40600 [ run ] triggered by Bot. Commit: cfb3044 Link to invocation

tensorrt-cicd · 2026-03-29T23:55:57Z

PR_Github #40600 [ run ] completed with state SUCCESS. Commit: cfb3044
/LLM/main/L0_MergeRequest_PR pipeline #31641 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-03-30T01:39:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-30T01:44:36Z

PR_Github #40617 [ run ] triggered by Bot. Commit: cfb3044 Link to invocation

tensorrt-cicd · 2026-03-30T05:41:48Z

PR_Github #40617 [ run ] completed with state SUCCESS. Commit: cfb3044
/LLM/main/L0_MergeRequest_PR pipeline #31657 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-03-30T10:26:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-30T10:32:29Z

PR_Github #40721 [ run ] triggered by Bot. Commit: cfb3044 Link to invocation

tensorrt-cicd · 2026-03-30T18:54:26Z

PR_Github #40721 [ run ] completed with state SUCCESS. Commit: cfb3044
/LLM/main/L0_MergeRequest_PR pipeline #31747 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-03-31T03:53:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-31T03:59:05Z

PR_Github #40847 [ run ] triggered by Bot. Commit: cfb3044 Link to invocation

tensorrt-cicd · 2026-03-31T07:31:09Z

PR_Github #40847 [ run ] completed with state SUCCESS. Commit: cfb3044
/LLM/main/L0_MergeRequest_PR pipeline #31856 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-04-01T01:33:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T01:39:35Z

PR_Github #41058 [ run ] triggered by Bot. Commit: cfb3044 Link to invocation

tensorrt-cicd · 2026-04-01T03:24:09Z

PR_Github #41058 [ run ] completed with state FAILURE. Commit: cfb3044
/LLM/main/L0_MergeRequest_PR pipeline #32035 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-04-01T05:37:42Z

/bot run --disable-fail-fast

chenfeiz0326 · 2026-04-01T05:37:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T05:43:42Z

PR_Github #41119 [ run ] triggered by Bot. Commit: cfb3044 Link to invocation

tensorrt-cicd · 2026-04-01T05:43:56Z

PR_Github #41120 [ run ] triggered by Bot. Commit: cfb3044 Link to invocation

tensorrt-cicd · 2026-04-01T05:43:58Z

PR_Github #41119 [ run ] completed with state ABORTED. Commit: cfb3044

Link to invocation

tensorrt-cicd · 2026-04-01T09:39:39Z

PR_Github #41120 [ run ] completed with state SUCCESS. Commit: cfb3044
/LLM/main/L0_MergeRequest_PR pipeline #32092 completed with status: 'SUCCESS'

CI Report

Link to invocation

…IDIA#12575) Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster> Co-authored-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

update

cfb3044

Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

chenfeiz0326 requested review from a team as code owners March 29, 2026 15:30

chenfeiz0326 requested review from niukuo and zeroepoch March 29, 2026 15:30

github-actions bot assigned chenfeiz0326 Mar 29, 2026

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

chenfeiz0326 requested review from EmmaQiaoCh and chzblych March 30, 2026 01:37

chenfeiz0326 requested a review from tburt-nv March 30, 2026 10:22

ZhanruiSunCh approved these changes Mar 30, 2026

View reviewed changes

niukuo approved these changes Mar 30, 2026

View reviewed changes

chenfeiz0326 enabled auto-merge (squash) March 31, 2026 03:54

chenfeiz0326 merged commit 075e36a into NVIDIA:main Apr 1, 2026
9 of 10 checks passed

chenfeiz0326 changed the title ~~[https://nvbugs/5997090][fix] Fix Pyxis Error in Disagg Perf Test~~ [https://nvbugs/5996645][fix] Fix Pyxis Error in Disagg Perf Test Apr 1, 2026

Conversation

chenfeiz0326 commented Mar 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

chenfeiz0326 commented Mar 29, 2026

Uh oh!

coderabbitai bot commented Mar 29, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 29, 2026

Uh oh!

tensorrt-cicd commented Mar 29, 2026

Uh oh!

chenfeiz0326 commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

chenfeiz0326 commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

chenfeiz0326 commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

chenfeiz0326 commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

chenfeiz0326 commented Apr 1, 2026

Uh oh!

chenfeiz0326 commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chenfeiz0326 commented Mar 29, 2026 •

edited by coderabbitai bot

Loading