[#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models #9317

nvchenghaoz · 2025-11-19T20:19:56Z

@Wanli-Jiang added the models to the model hub and since the model is accessible to the pipeline, add BF16 and FP8 accuracy test for Nemotron MOE model and setup the threshold for the accuracy tests.

#9316

Summary by CodeRabbit

New Features
- Added nvidia/Nemotron-MOE model support with BF16 and FP8 quantization options.
- Published accuracy benchmarks for Nemotron-MOE on GSM8K and MMLU tasks.
- Updated accuracy metrics for Phi-4-mini-instruct with FP8 quantization data.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

coderabbitai · 2025-11-19T20:24:18Z

📝 Walkthrough

Walkthrough

The changes add a new model entry nvidia/Nemotron-MOE to accuracy reference files with both quantized (FP8) and unquantized configurations. An existing model entry is restructured. The test suite is updated with new BF16 and FP8 test methods and corresponding model path attributes.

Changes

Cohort / File(s)	Summary
Accuracy reference data `tests/integration/defs/accuracy/references/gsm8k.yaml`, `tests/integration/defs/accuracy/references/mmlu.yaml`	Added `nvidia/Nemotron-MOE` model entries with unquantized and FP8-quantized accuracy scores. Modified `microsoft/Phi-4-mini-instruct` entry in MMLU to nest accuracy under FP8 quantization config.
Test implementation `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`	Imported `QuantAlgo` from `tensorrt_llm.quantization`. Added `MODEL_PATH_BF16` and `MODEL_PATH_FP8` class attributes to `TestNemotronMOE`. Replaced `test_auto_dtype` with separate `test_bf16` and `test_fp8` test methods.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Review accuracy data additions to verify correctness of numeric values and YAML structure alignment
Verify that new test methods properly configure quantization algorithms and evaluate expected benchmarks

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding accuracy tests for Nemotron MOE models with the appropriate GitHub issue reference and feature type.
Description check	✅ Passed	The description explains the context and purpose of the changes (adding BF16 and FP8 accuracy tests for Nemotron MOE model), but lacks detail on implementation specifics and does not fully follow the template structure with distinct sections.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
164-167: Misleading comment: Nemotron-MOE is not an SSM model.

The comment states "SSMs do not support cache reuse", but Nemotron-MOE is a Mixture of Experts (MOE) model, not a State Space Model (SSM). This is evident from the 'ep' (expert parallelism) sharding dimension at line 180. The enable_block_reuse: False setting might still be correct for MOE models, but the comment is misleading.

Update the comment to reflect the actual reason for disabling cache reuse, or remove it if inherited from TestNemotronH incorrectly:
-            # SSMs do not support cache reuse.
+            # MOE models do not support cache reuse.
             "kv_cache_config": {
                 "enable_block_reuse": False
             },

🧹 Nitpick comments (1)

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

219-221: Verify the need for manual quant_config setting with pre-quantized models.

The comment indicates this manual setting is needed "to get the accuracy threshold", suggesting that AutoDeployLLM doesn't auto-detect quantization for pre-quantized models (the model path contains "FP8-KVFP8"). While this approach works, consider whether:

AutoDeployLLM should auto-detect quantization from pre-quantized model metadata or path patterns

This pattern will need to be repeated for every FP8 model test

This might be a broader design consideration for the AutoDeployLLM implementation.

If auto-detection is not feasible, consider documenting this pattern for future test authors or creating a helper method to reduce duplication.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49c45eb and 8efaec6.

📒 Files selected for processing (3)

tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (3 hunks)

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

📚 Learning: 2025-08-29T14:07:45.863Z

Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

🧬 Code graph analysis (1)

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (3)

tensorrt_llm/quantization/mode.py (1)

QuantAlgo (23-47)

tests/integration/defs/accuracy/accuracy_core.py (2)

evaluate (184-247)

evaluate (868-878)

tensorrt_llm/_torch/auto_deploy/llm_args.py (2)

quant_config (342-345)

quant_config (348-349)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tests/integration/defs/accuracy/references/gsm8k.yaml (1)

164-168: LGTM! Nemotron-MOE accuracy thresholds added correctly.

The new model entry follows the established pattern with appropriate baseline (BF16) and FP8 quantized accuracy thresholds. The accuracy degradation from BF16 (88.249) to FP8 (86.884) is expected and reasonable for quantization.

tests/integration/defs/accuracy/references/mmlu.yaml (2)

273-277: LGTM! Nemotron-MOE accuracy thresholds added correctly.

The new model entry is consistent with the gsm8k.yaml reference file and follows the established pattern for accuracy thresholds.

278-281: Verify the Phi-4-mini-instruct restructuring and missing kv_cache_quant_algo.

This change appears unrelated to the PR's stated objective of adding Nemotron-MOE tests. Additionally, the FP8 entry lacks kv_cache_quant_algo, which is inconsistent with similar entries in this file (e.g., microsoft/phi-4 at lines 321-323 has both quant_algo: FP8 and kv_cache_quant_algo: FP8). Please clarify:

Is this Phi-4-mini-instruct change intentional and part of this PR's scope?

Should kv_cache_quant_algo: FP8 be added to maintain consistency?

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (4)

21-21: LGTM! QuantAlgo import added correctly.

The import is necessary for manually configuring FP8 quantization in the test_fp8 method.

157-158: LGTM! Model paths defined correctly.

The BF16 and FP8 model paths follow the same pattern as TestNemotronH and align with the separate test methods for each quantization type.

201-210: LGTM! BF16 test method implemented correctly.

The test follows the established pattern and will validate against the BF16 accuracy thresholds defined in gsm8k.yaml and mmlu.yaml.

213-226: LGTM! FP8 test method structured correctly.

The test follows the established pattern and will validate against the FP8 accuracy thresholds defined in gsm8k.yaml and mmlu.yaml. The manual quant_config setting is explained by the comment, though it may warrant architectural review (see separate comment).

nvchenghaoz · 2025-11-19T20:29:18Z

/bot run

tensorrt-cicd · 2025-11-19T20:39:24Z

PR_Github #25090 [ run ] triggered by Bot. Commit: 8efaec6

tensorrt-cicd · 2025-11-19T22:37:50Z

PR_Github #25090 [ run ] completed with state SUCCESS. Commit: 8efaec6
/LLM/main/L0_MergeRequest_PR pipeline #18966 completed with status: 'FAILURE'

nvchenghaoz · 2025-11-19T22:38:44Z

/bot run

tensorrt-cicd · 2025-11-19T22:43:53Z

PR_Github #25101 [ run ] triggered by Bot. Commit: 8efaec6

tensorrt-cicd · 2025-11-20T03:04:45Z

PR_Github #25101 [ run ] completed with state SUCCESS. Commit: 8efaec6
/LLM/main/L0_MergeRequest_PR pipeline #18976 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

suyoggupta · 2025-11-20T05:51:58Z

I think you need to enable the test as done for others here:
https://sourcegraph.com/github.com/NVIDIA/TensorRT-LLM/-/blob/tests/integration/test_lists/test-db/l0_h100.yml?L105-109

@Wanli-Jiang , @nvchenghaoz

Add the accuracy test for Nemotron MOE models

8efaec6

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

nvchenghaoz requested review from nzmora-nvidia and suyoggupta November 19, 2025 20:19

github-project-automation bot added this to AutoDeploy Board Nov 19, 2025

github-project-automation bot moved this to Backlog in AutoDeploy Board Nov 19, 2025

nvchenghaoz linked an issue Nov 19, 2025 that may be closed by this pull request

[Feature]: AutoDeploy: Set up the Accuracy testing for Nemotron MOE models. #9316

Closed

1 task

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

suyoggupta approved these changes Nov 20, 2025

View reviewed changes

github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Nov 20, 2025

nvchenghaoz merged commit cd44f80 into NVIDIA:main Nov 20, 2025
9 checks passed

github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models #9317

[#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models #9317

nvchenghaoz commented Nov 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 19, 2025

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

nvchenghaoz commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

nvchenghaoz commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

Uh oh!

suyoggupta commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models #9317

[#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models #9317

Conversation

nvchenghaoz commented Nov 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 19, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

nvchenghaoz commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

nvchenghaoz commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 19, 2025

Uh oh!

tensorrt-cicd commented Nov 20, 2025

Uh oh!

Uh oh!

suyoggupta commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvchenghaoz commented Nov 19, 2025 •

edited by coderabbitai bot

Loading