[https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back by chenfeiz0326 · Pull Request #12159 · NVIDIA/TensorRT-LLM

chenfeiz0326 · 2026-03-12T11:01:30Z

Summary by CodeRabbit

Release Notes

New Features
- https://nvbugspro.nvidia.com/bug/5948539 has fixed gen_only hang issue. So add 3 disagg gen_only tests back.
Documentation
- Added comprehensive guide for adding and re-enabling perf sanity tests in CI pipelines

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

coderabbitai · 2026-03-12T11:05:06Z

📝 Walkthrough

Walkthrough

This PR expands GB200 multi-node performance sanity test coverage by increasing test counts in two existing SBSA configurations, introducing a new 3-node test variant, enabling previously commented performance tests in test database files, and adding documentation for the process.

Changes

Cohort / File(s)	Summary
CI Configuration `jenkins/L0_Test.groovy`	Updated multi-node SBSA test configurations: increased test counts for two existing entries (2→3, 3→4) and added new GB200 12 GPU, 3-node test configuration with testCount 1.
Test Database Files `tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2.yml`, `tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml`	Enabled previously commented performance tests by removing comment markers, activating 8k1k and qwen3 test variants.
Documentation `jenkins/scripts/perf/README.md`, `tests/scripts/perf-sanity/README.md`	Added comprehensive documentation sections describing steps to add or re-enable perf sanity tests in CI, including configuration procedures, resource requirements, and worked examples.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is empty, containing only the repository template with no actual content in required sections.	Provide a concrete description explaining what changes were made, why they were made, and what tests validate the changes. Fill in the Description and Test Coverage sections with specific details.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title references adding 3 disagg gen_only tests back and matches the primary changes shown in test configuration files enabling previously commented tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan for PR comments

Generate coding plan

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (4)

tests/scripts/perf-sanity/README.md (4)

309-309: Remove extra spaces inside code spans.

The code spans have trailing spaces that trigger MD038 linting warnings.

📝 Suggested fix

-- If the test line already exists but is **commented out** (prefixed with `# `), remove the `# ` prefix.
+- If the test line already exists but is **commented out** (prefixed with `#`), remove the `#` prefix.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/scripts/perf-sanity/README.md` at line 309, The Markdown line contains
code spans with trailing spaces (e.g., "`# `") that trigger MD038; edit the
README.md line so code spans have no inner trailing spaces (change
backtick-wrapped "`# `" to "`#`" and similarly remove any other extra spaces
inside backticks on that line), ensuring the visible text and meaning remain the
same but the code spans contain no leading/trailing spaces.

332-334: Add language identifier to fenced code block.

This stage naming convention block should have a language identifier.

📝 Suggested fix

-```
+```text
 GB200-{gpuCount}_GPUs-{nodeCount}_Nodes-PyTorch-Disagg-PerfSanity-CTX{num_ctx}-NODE{nodes_per_ctx}-GPU{ctx_tp}-GEN{num_gen}-NODE{nodes_per_gen}-GPU{gen_tp}-Post-Merge
-```
+```

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/scripts/perf-sanity/README.md` around lines 332 - 334, The fenced code
block containing the stage name string
GB200-{gpuCount}_GPUs-{nodeCount}_Nodes-PyTorch-Disagg-PerfSanity-CTX{num_ctx}-NODE{nodes_per_ctx}-GPU{ctx_tp}-GEN{num_gen}-NODE{nodes_per_gen}-GPU{gen_tp}-Post-Merge
should include a language identifier (e.g., add "text" after the opening triple
backticks) so the block becomes ```text ... ```, ensuring consistent markdown
rendering.

283-290: Add language identifiers to fenced code blocks.

These code blocks showing file path patterns would benefit from language identifiers (e.g., text) for better rendering and to satisfy linting rules.

📝 Suggested fix

-```
+```text
 l0_{gpu_type}_multi_nodes_perf_sanity_ctx{num_ctx}_node{nodes_per_ctx}_gpu{ctx_tp}_gen{num_gen}_node{nodes_per_gen}_gpu{gen_tp}.yml
-```
+```

 Example: for ctx_tp=1, gen_tp=8, 1 ctx worker, 1 gen worker on GB200:
-```
+```text
 l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node2_gpu8.yml
-```
+```

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/scripts/perf-sanity/README.md` around lines 283 - 290, Update the two
fenced code blocks that show filename patterns by adding a language identifier
(use "text") to each opening triple-backtick fence; specifically change the
block starting with "```" before "l0_{gpu_type}_multi_nodes_perf_sanity..." and
the block before "l0_gb200_multi_nodes_perf_sanity..." to "```text" so the
example filename patterns are rendered and linted correctly.

246-248: Add language identifier to fenced code block.

This code block shows a filename pattern and would benefit from a language identifier for better rendering.

📝 Suggested fix

-```
+```text
 {gpu_type}_{model}-{precision}_{ISL}k{OSL}k_con{concurrency}_ctx{ctx_count}_tp{ctx_tp}_gen{gen_count}_{gen_parallelism}_eplb{N}_mtp{N}_ccb-{transport}.yaml
-```
+```

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/scripts/perf-sanity/README.md` around lines 246 - 248, The fenced code
block containing the filename pattern
"{gpu_type}_{model}-{precision}_{ISL}k{OSL}k_con{concurrency}_ctx{ctx_count}_tp{ctx_tp}_gen{gen_count}_{gen_parallelism}_eplb{N}_mtp{N}_ccb-{transport}.yaml"
should include a language identifier for better rendering; change the opening
fence from ``` to ```text so the block becomes a plain-text fenced block while
leaving the content unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/scripts/perf-sanity/README.md`:
- Line 309: The Markdown line contains code spans with trailing spaces (e.g.,
"`# `") that trigger MD038; edit the README.md line so code spans have no inner
trailing spaces (change backtick-wrapped "`# `" to "`#`" and similarly remove
any other extra spaces inside backticks on that line), ensuring the visible text
and meaning remain the same but the code spans contain no leading/trailing
spaces.
- Around line 332-334: The fenced code block containing the stage name string
GB200-{gpuCount}_GPUs-{nodeCount}_Nodes-PyTorch-Disagg-PerfSanity-CTX{num_ctx}-NODE{nodes_per_ctx}-GPU{ctx_tp}-GEN{num_gen}-NODE{nodes_per_gen}-GPU{gen_tp}-Post-Merge
should include a language identifier (e.g., add "text" after the opening triple
backticks) so the block becomes ```text ... ```, ensuring consistent markdown
rendering.
- Around line 283-290: Update the two fenced code blocks that show filename
patterns by adding a language identifier (use "text") to each opening
triple-backtick fence; specifically change the block starting with "```" before
"l0_{gpu_type}_multi_nodes_perf_sanity..." and the block before
"l0_gb200_multi_nodes_perf_sanity..." to "```text" so the example filename
patterns are rendered and linted correctly.
- Around line 246-248: The fenced code block containing the filename pattern
"{gpu_type}_{model}-{precision}_{ISL}k{OSL}k_con{concurrency}_ctx{ctx_count}_tp{ctx_tp}_gen{gen_count}_{gen_parallelism}_eplb{N}_mtp{N}_ccb-{transport}.yaml"
should include a language identifier for better rendering; change the opening
fence from ``` to ```text so the block becomes a plain-text fenced block while
leaving the content unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ed514eb3-5211-4b5e-8443-cc8bfc1b1ec3

📥 Commits

Reviewing files that changed from the base of the PR and between adfc542 and 02dd8da.

📒 Files selected for processing (5)

jenkins/L0_Test.groovy
jenkins/scripts/perf/README.md
tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2.yml
tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml
tests/scripts/perf-sanity/README.md

chenfeiz0326 · 2026-03-12T11:06:39Z

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-4,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1"

tensorrt-cicd · 2026-03-12T11:12:10Z

PR_Github #38715 [ run ] triggered by Bot. Commit: 02dd8da Link to invocation

tensorrt-cicd · 2026-03-12T13:01:47Z

PR_Github #38715 [ run ] completed with state SUCCESS. Commit: 02dd8da
/LLM/main/L0_MergeRequest_PR pipeline #30033 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

chienchunhung

Thanks for the PR!

pcastonguay · 2026-03-12T16:42:24Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-12T16:48:58Z

PR_Github #38757 [ run ] triggered by Bot. Commit: 02dd8da Link to invocation

tensorrt-cicd · 2026-03-12T23:28:23Z

PR_Github #38757 [ run ] completed with state SUCCESS. Commit: 02dd8da
/LLM/main/L0_MergeRequest_PR pipeline #30073 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chienchunhung · 2026-03-13T00:14:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-13T00:21:54Z

PR_Github #38795 [ run ] triggered by Bot. Commit: 02dd8da Link to invocation

tensorrt-cicd · 2026-03-13T01:48:31Z

PR_Github #38795 [ run ] completed with state SUCCESS. Commit: 02dd8da
/LLM/main/L0_MergeRequest_PR pipeline #30108 completed with status: 'SUCCESS'

CI Report

Link to invocation

ZhanruiSunCh

LGTM

update

02dd8da

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

chenfeiz0326 requested review from a team as code owners March 12, 2026 11:01

chenfeiz0326 requested review from ruodil and yuanjingx87 March 12, 2026 11:01

github-actions bot assigned chenfeiz0326 Mar 12, 2026

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

chienchunhung changed the title ~~[None][infra] Add 3 Disagg gen_only tests back~~ [https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back Mar 12, 2026

chienchunhung self-requested a review March 12, 2026 16:32

chienchunhung approved these changes Mar 12, 2026

View reviewed changes

pcastonguay approved these changes Mar 12, 2026

View reviewed changes

chenfeiz0326 enabled auto-merge (squash) March 13, 2026 02:11

chenfeiz0326 requested a review from ZhanruiSunCh March 13, 2026 02:13

ZhanruiSunCh approved these changes Mar 13, 2026

View reviewed changes

chenfeiz0326 merged commit 94f9489 into NVIDIA:main Mar 13, 2026
11 of 14 checks passed

Conversation

chenfeiz0326 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

chenfeiz0326 commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

chienchunhung left a comment

Choose a reason for hiding this comment

Uh oh!

pcastonguay commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

chienchunhung commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

ZhanruiSunCh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chenfeiz0326 commented Mar 12, 2026 •

edited

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading