[https://nvbugs/6080024][fix] Fix CudaGraphConfig validation conflict from YAML deep merge by nvchenghaoz · Pull Request #13397 · NVIDIA/TensorRT-LLM

nvchenghaoz · 2026-04-23T22:14:35Z

Summary by CodeRabbit

Chores
- Updated CUDA graph configuration settings for model deployment.
Tests
- Modified test execution scheduling for GPU environments.
- Disabled a test case for a specific model configuration.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

nvchenghaoz · 2026-04-23T22:16:48Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

coderabbitai · 2026-04-23T22:17:47Z

📝 Walkthrough

Walkthrough

Two configuration updates are made: CUDA graph batch size alignment in a model registry configuration file, and test execution scheduling adjustment with a specific test case disabled in integration test configurations.

Changes

Cohort / File(s)	Summary
Model Configuration Updates `examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml`	Sets `cuda_graph_config.max_batch_size` to `256` to align with top-level batch size configuration.
Test Configuration Updates `tests/integration/test_lists/test-db/l0_dgx_b200.yml`	Changes autodeploy/mpi execution stage from `post_merge` to `pre_merge` for 8-GPU b200 environment; disables `TestModelRegistryAccuracy::test_autodeploy_from_registry` test for deepseek-ai_DeepSeek-R1-0528-True variant.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	PR description is completely unfilled; only the template structure remains with no actual content explaining the changes, rationale, or test coverage.	Fill in the Description section explaining the CudaGraphConfig validation conflict and why the YAML changes fix it. Add Test Coverage section listing affected tests. Complete the PR Checklist items.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly specifies the issue (CudaGraphConfig validation conflict from YAML deep merge) and the fix, directly corresponding to changes in the qwen3.5_moe_400b.yaml configuration file.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/test_lists/test-db/l0_dgx_b200.yml`:
- Line 376: The test
TestModelRegistryAccuracy::test_autodeploy_from_registry[deepseek-ai_DeepSeek-R1-0528-True]
was silently commented out; either re-enable it in the appropriate test stage or
quarantine it explicitly: if re-enabling, remove the comment so the test runs;
if quarantining, replace the commented entry with a quarantine record that
includes required metadata (nvbug/id, owner, target_reenable_date, and a short
reason) so the omission is tracked and automatically reviewable.
- Line 371: The section header "AutoDeploy Post Merge 8 GPU tests" and the stage
key "stage: pre_merge" are inconsistent; update either the header text to
"AutoDeploy Pre Merge 8 GPU tests" or change the stage value to "post_merge" so
the section label and the stage key match (look for the header string
"AutoDeploy Post Merge 8 GPU tests" and the YAML key "stage" in that block to
make the edit).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c0505814-4372-493c-81af-4ce5476d2bcd

📥 Commits

Reviewing files that changed from the base of the PR and between 79396d8 and 2968943.

📒 Files selected for processing (2)

examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml
tests/integration/test_lists/test-db/l0_dgx_b200.yml

tensorrt-cicd · 2026-04-23T22:22:38Z

PR_Github #45251 [ run ] triggered by Bot. Commit: 2968943 Link to invocation

tensorrt-cicd · 2026-04-23T23:13:55Z

PR_Github #45251 [ run ] completed with state SUCCESS. Commit: 2968943
/LLM/main/L0_MergeRequest_PR pipeline #35510 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nvchenghaoz · 2026-05-04T20:12:57Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-04T20:20:55Z

PR_Github #46684 [ run ] triggered by Bot. Commit: 3ba1341 Link to invocation

nvchenghaoz · 2026-05-05T00:05:29Z

/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-05T01:30:40Z

PR_Github #46684 [ run ] completed with state SUCCESS. Commit: 3ba1341
/LLM/main/L0_MergeRequest_PR pipeline #36723 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

nvchenghaoz · 2026-05-05T02:15:09Z

/bot skip --comment "post merge test fix"

suyoggupta · 2026-05-05T02:29:01Z

is this the only affected model? could we proactively apply this fix to other model yamls?

nvchenghaoz · 2026-05-05T16:33:07Z

/bot skip --comment "post merge test fix"

tensorrt-cicd · 2026-05-05T16:40:48Z

PR_Github #46834 [ skip ] triggered by Bot. Commit: 284f792 Link to invocation

tensorrt-cicd · 2026-05-05T16:55:47Z

PR_Github #46834 [ skip ] completed with state SUCCESS. Commit: 284f792
Skipping testing for commit 284f792

Link to invocation

… from YAML deep merge (NVIDIA#13397) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

fix the test failure

2968943

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

nvchenghaoz requested a review from a team as a code owner April 23, 2026 22:14

nvchenghaoz requested a review from greg-kwasniewski1 April 23, 2026 22:14

github-actions Bot assigned nvchenghaoz Apr 23, 2026

nvchenghaoz commented Apr 23, 2026

View reviewed changes

Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml Outdated

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml Outdated

Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml Outdated

Merge branch 'main' into chenghao/fix_nvbug_0423

3ba1341

Merge branch 'main' into chenghao/fix_nvbug_0423

cae6316

revert the changes

284f792

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

suyoggupta approved these changes May 5, 2026

View reviewed changes

nvchenghaoz merged commit 26fc8b4 into NVIDIA:main May 5, 2026
6 checks passed

Conversation

nvchenghaoz commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

nvchenghaoz commented Apr 23, 2026

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

nvchenghaoz commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

nvchenghaoz commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

nvchenghaoz commented May 5, 2026

Uh oh!

suyoggupta commented May 5, 2026

Uh oh!

nvchenghaoz commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvchenghaoz commented Apr 23, 2026 •

edited

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading