Skip to content

[https://nvbugs/6080024][fix] Fix CudaGraphConfig validation conflict from YAML deep merge#13397

Merged
nvchenghaoz merged 4 commits into
NVIDIA:mainfrom
nv-auto-deploy:chenghao/fix_nvbug_0423
May 5, 2026
Merged

[https://nvbugs/6080024][fix] Fix CudaGraphConfig validation conflict from YAML deep merge#13397
nvchenghaoz merged 4 commits into
NVIDIA:mainfrom
nv-auto-deploy:chenghao/fix_nvbug_0423

Conversation

@nvchenghaoz
Copy link
Copy Markdown
Collaborator

@nvchenghaoz nvchenghaoz commented Apr 23, 2026

Summary by CodeRabbit

  • Chores

    • Updated CUDA graph configuration settings for model deployment.
  • Tests

    • Modified test execution scheduling for GPU environments.
    • Disabled a test case for a specific model configuration.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

Two configuration updates are made: CUDA graph batch size alignment in a model registry configuration file, and test execution scheduling adjustment with a specific test case disabled in integration test configurations.

Changes

Cohort / File(s) Summary
Model Configuration Updates
examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml
Sets cuda_graph_config.max_batch_size to 256 to align with top-level batch size configuration.
Test Configuration Updates
tests/integration/test_lists/test-db/l0_dgx_b200.yml
Changes autodeploy/mpi execution stage from post_merge to pre_merge for 8-GPU b200 environment; disables TestModelRegistryAccuracy::test_autodeploy_from_registry test for deepseek-ai_DeepSeek-R1-0528-True variant.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning PR description is completely unfilled; only the template structure remains with no actual content explaining the changes, rationale, or test coverage. Fill in the Description section explaining the CudaGraphConfig validation conflict and why the YAML changes fix it. Add Test Coverage section listing affected tests. Complete the PR Checklist items.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly specifies the issue (CudaGraphConfig validation conflict from YAML deep merge) and the fix, directly corresponding to changes in the qwen3.5_moe_400b.yaml configuration file.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/test_lists/test-db/l0_dgx_b200.yml`:
- Line 376: The test
TestModelRegistryAccuracy::test_autodeploy_from_registry[deepseek-ai_DeepSeek-R1-0528-True]
was silently commented out; either re-enable it in the appropriate test stage or
quarantine it explicitly: if re-enabling, remove the comment so the test runs;
if quarantining, replace the commented entry with a quarantine record that
includes required metadata (nvbug/id, owner, target_reenable_date, and a short
reason) so the omission is tracked and automatically reviewable.
- Line 371: The section header "AutoDeploy Post Merge 8 GPU tests" and the stage
key "stage: pre_merge" are inconsistent; update either the header text to
"AutoDeploy Pre Merge 8 GPU tests" or change the stage value to "post_merge" so
the section label and the stage key match (look for the header string
"AutoDeploy Post Merge 8 GPU tests" and the YAML key "stage" in that block to
make the edit).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c0505814-4372-493c-81af-4ce5476d2bcd

📥 Commits

Reviewing files that changed from the base of the PR and between 79396d8 and 2968943.

📒 Files selected for processing (2)
  • examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml

Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml Outdated
Comment thread tests/integration/test_lists/test-db/l0_dgx_b200.yml Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45251 [ run ] triggered by Bot. Commit: 2968943 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45251 [ run ] completed with state SUCCESS. Commit: 2968943
/LLM/main/L0_MergeRequest_PR pipeline #35510 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46684 [ run ] triggered by Bot. Commit: 3ba1341 Link to invocation

@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46684 [ run ] completed with state SUCCESS. Commit: 3ba1341
/LLM/main/L0_MergeRequest_PR pipeline #36723 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "post merge test fix"

@suyoggupta
Copy link
Copy Markdown
Collaborator

is this the only affected model? could we proactively apply this fix to other model yamls?

@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "post merge test fix"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46834 [ skip ] triggered by Bot. Commit: 284f792 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46834 [ skip ] completed with state SUCCESS. Commit: 284f792
Skipping testing for commit 284f792

Link to invocation

@nvchenghaoz nvchenghaoz merged commit 26fc8b4 into NVIDIA:main May 5, 2026
6 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
… from YAML deep merge (NVIDIA#13397)

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants