Skip to content

[https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back#12159

Merged
chenfeiz0326 merged 1 commit intoNVIDIA:mainfrom
chenfeiz0326:chenfez/unwaive-disagg-gen_only-tests-back
Mar 13, 2026
Merged

[https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back#12159
chenfeiz0326 merged 1 commit intoNVIDIA:mainfrom
chenfeiz0326:chenfez/unwaive-disagg-gen_only-tests-back

Conversation

@chenfeiz0326
Copy link
Collaborator

@chenfeiz0326 chenfeiz0326 commented Mar 12, 2026

Summary by CodeRabbit

Release Notes

  • New Features

  • Documentation

    • Added comprehensive guide for adding and re-enabling perf sanity tests in CI pipelines

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

This PR expands GB200 multi-node performance sanity test coverage by increasing test counts in two existing SBSA configurations, introducing a new 3-node test variant, enabling previously commented performance tests in test database files, and adding documentation for the process.

Changes

Cohort / File(s) Summary
CI Configuration
jenkins/L0_Test.groovy
Updated multi-node SBSA test configurations: increased test counts for two existing entries (2→3, 3→4) and added new GB200 12 GPU, 3-node test configuration with testCount 1.
Test Database Files
tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2.yml, tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml
Enabled previously commented performance tests by removing comment markers, activating 8k1k and qwen3 test variants.
Documentation
jenkins/scripts/perf/README.md, tests/scripts/perf-sanity/README.md
Added comprehensive documentation sections describing steps to add or re-enable perf sanity tests in CI, including configuration procedures, resource requirements, and worked examples.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is empty, containing only the repository template with no actual content in required sections. Provide a concrete description explaining what changes were made, why they were made, and what tests validate the changes. Fill in the Description and Test Coverage sections with specific details.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title references adding 3 disagg gen_only tests back and matches the primary changes shown in test configuration files enabling previously commented tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan for PR comments
  • Generate coding plan

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
tests/scripts/perf-sanity/README.md (4)

309-309: Remove extra spaces inside code spans.

The code spans have trailing spaces that trigger MD038 linting warnings.

📝 Suggested fix
-- If the test line already exists but is **commented out** (prefixed with `# `), remove the `# ` prefix.
+- If the test line already exists but is **commented out** (prefixed with `#`), remove the `#` prefix.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/scripts/perf-sanity/README.md` at line 309, The Markdown line contains
code spans with trailing spaces (e.g., "`# `") that trigger MD038; edit the
README.md line so code spans have no inner trailing spaces (change
backtick-wrapped "`# `" to "`#`" and similarly remove any other extra spaces
inside backticks on that line), ensuring the visible text and meaning remain the
same but the code spans contain no leading/trailing spaces.

332-334: Add language identifier to fenced code block.

This stage naming convention block should have a language identifier.

📝 Suggested fix
-```
+```text
 GB200-{gpuCount}_GPUs-{nodeCount}_Nodes-PyTorch-Disagg-PerfSanity-CTX{num_ctx}-NODE{nodes_per_ctx}-GPU{ctx_tp}-GEN{num_gen}-NODE{nodes_per_gen}-GPU{gen_tp}-Post-Merge
-```
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/scripts/perf-sanity/README.md` around lines 332 - 334, The fenced code
block containing the stage name string
GB200-{gpuCount}_GPUs-{nodeCount}_Nodes-PyTorch-Disagg-PerfSanity-CTX{num_ctx}-NODE{nodes_per_ctx}-GPU{ctx_tp}-GEN{num_gen}-NODE{nodes_per_gen}-GPU{gen_tp}-Post-Merge
should include a language identifier (e.g., add "text" after the opening triple
backticks) so the block becomes ```text ... ```, ensuring consistent markdown
rendering.

283-290: Add language identifiers to fenced code blocks.

These code blocks showing file path patterns would benefit from language identifiers (e.g., text) for better rendering and to satisfy linting rules.

📝 Suggested fix
-```
+```text
 l0_{gpu_type}_multi_nodes_perf_sanity_ctx{num_ctx}_node{nodes_per_ctx}_gpu{ctx_tp}_gen{num_gen}_node{nodes_per_gen}_gpu{gen_tp}.yml
-```
+```

 Example: for ctx_tp=1, gen_tp=8, 1 ctx worker, 1 gen worker on GB200:
-```
+```text
 l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node2_gpu8.yml
-```
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/scripts/perf-sanity/README.md` around lines 283 - 290, Update the two
fenced code blocks that show filename patterns by adding a language identifier
(use "text") to each opening triple-backtick fence; specifically change the
block starting with "```" before "l0_{gpu_type}_multi_nodes_perf_sanity..." and
the block before "l0_gb200_multi_nodes_perf_sanity..." to "```text" so the
example filename patterns are rendered and linted correctly.

246-248: Add language identifier to fenced code block.

This code block shows a filename pattern and would benefit from a language identifier for better rendering.

📝 Suggested fix
-```
+```text
 {gpu_type}_{model}-{precision}_{ISL}k{OSL}k_con{concurrency}_ctx{ctx_count}_tp{ctx_tp}_gen{gen_count}_{gen_parallelism}_eplb{N}_mtp{N}_ccb-{transport}.yaml
-```
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/scripts/perf-sanity/README.md` around lines 246 - 248, The fenced code
block containing the filename pattern
"{gpu_type}_{model}-{precision}_{ISL}k{OSL}k_con{concurrency}_ctx{ctx_count}_tp{ctx_tp}_gen{gen_count}_{gen_parallelism}_eplb{N}_mtp{N}_ccb-{transport}.yaml"
should include a language identifier for better rendering; change the opening
fence from ``` to ```text so the block becomes a plain-text fenced block while
leaving the content unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/scripts/perf-sanity/README.md`:
- Line 309: The Markdown line contains code spans with trailing spaces (e.g.,
"`# `") that trigger MD038; edit the README.md line so code spans have no inner
trailing spaces (change backtick-wrapped "`# `" to "`#`" and similarly remove
any other extra spaces inside backticks on that line), ensuring the visible text
and meaning remain the same but the code spans contain no leading/trailing
spaces.
- Around line 332-334: The fenced code block containing the stage name string
GB200-{gpuCount}_GPUs-{nodeCount}_Nodes-PyTorch-Disagg-PerfSanity-CTX{num_ctx}-NODE{nodes_per_ctx}-GPU{ctx_tp}-GEN{num_gen}-NODE{nodes_per_gen}-GPU{gen_tp}-Post-Merge
should include a language identifier (e.g., add "text" after the opening triple
backticks) so the block becomes ```text ... ```, ensuring consistent markdown
rendering.
- Around line 283-290: Update the two fenced code blocks that show filename
patterns by adding a language identifier (use "text") to each opening
triple-backtick fence; specifically change the block starting with "```" before
"l0_{gpu_type}_multi_nodes_perf_sanity..." and the block before
"l0_gb200_multi_nodes_perf_sanity..." to "```text" so the example filename
patterns are rendered and linted correctly.
- Around line 246-248: The fenced code block containing the filename pattern
"{gpu_type}_{model}-{precision}_{ISL}k{OSL}k_con{concurrency}_ctx{ctx_count}_tp{ctx_tp}_gen{gen_count}_{gen_parallelism}_eplb{N}_mtp{N}_ccb-{transport}.yaml"
should include a language identifier for better rendering; change the opening
fence from ``` to ```text so the block becomes a plain-text fenced block while
leaving the content unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ed514eb3-5211-4b5e-8443-cc8bfc1b1ec3

📥 Commits

Reviewing files that changed from the base of the PR and between adfc542 and 02dd8da.

📒 Files selected for processing (5)
  • jenkins/L0_Test.groovy
  • jenkins/scripts/perf/README.md
  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu2.yml
  • tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml
  • tests/scripts/perf-sanity/README.md

@chenfeiz0326
Copy link
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-4,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38715 [ run ] triggered by Bot. Commit: 02dd8da Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38715 [ run ] completed with state SUCCESS. Commit: 02dd8da
/LLM/main/L0_MergeRequest_PR pipeline #30033 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@chienchunhung chienchunhung changed the title [None][infra] Add 3 Disagg gen_only tests back [https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back Mar 12, 2026
@chienchunhung chienchunhung self-requested a review March 12, 2026 16:32
Copy link
Collaborator

@chienchunhung chienchunhung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@pcastonguay
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38757 [ run ] triggered by Bot. Commit: 02dd8da Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38757 [ run ] completed with state SUCCESS. Commit: 02dd8da
/LLM/main/L0_MergeRequest_PR pipeline #30073 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chienchunhung
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38795 [ run ] triggered by Bot. Commit: 02dd8da Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38795 [ run ] completed with state SUCCESS. Commit: 02dd8da
/LLM/main/L0_MergeRequest_PR pipeline #30108 completed with status: 'SUCCESS'

CI Report

Link to invocation

@chenfeiz0326 chenfeiz0326 enabled auto-merge (squash) March 13, 2026 02:11
Copy link
Collaborator

@ZhanruiSunCh ZhanruiSunCh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenfeiz0326 chenfeiz0326 merged commit 94f9489 into NVIDIA:main Mar 13, 2026
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants