Skip to content

Conversation

@nvchenghaoz
Copy link
Collaborator

@nvchenghaoz nvchenghaoz commented Nov 19, 2025

@Wanli-Jiang added the models to the model hub and since the model is accessible to the pipeline, add BF16 and FP8 accuracy test for Nemotron MOE model and setup the threshold for the accuracy tests.

#9316

Summary by CodeRabbit

  • New Features
    • Added nvidia/Nemotron-MOE model support with BF16 and FP8 quantization options.
    • Published accuracy benchmarks for Nemotron-MOE on GSM8K and MMLU tasks.
    • Updated accuracy metrics for Phi-4-mini-instruct with FP8 quantization data.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 19, 2025

📝 Walkthrough

Walkthrough

The changes add a new model entry nvidia/Nemotron-MOE to accuracy reference files with both quantized (FP8) and unquantized configurations. An existing model entry is restructured. The test suite is updated with new BF16 and FP8 test methods and corresponding model path attributes.

Changes

Cohort / File(s) Summary
Accuracy reference data
tests/integration/defs/accuracy/references/gsm8k.yaml, tests/integration/defs/accuracy/references/mmlu.yaml
Added nvidia/Nemotron-MOE model entries with unquantized and FP8-quantized accuracy scores. Modified microsoft/Phi-4-mini-instruct entry in MMLU to nest accuracy under FP8 quantization config.
Test implementation
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
Imported QuantAlgo from tensorrt_llm.quantization. Added MODEL_PATH_BF16 and MODEL_PATH_FP8 class attributes to TestNemotronMOE. Replaced test_auto_dtype with separate test_bf16 and test_fp8 test methods.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Review accuracy data additions to verify correctness of numeric values and YAML structure alignment
  • Verify that new test methods properly configure quantization algorithms and evaluate expected benchmarks

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding accuracy tests for Nemotron MOE models with the appropriate GitHub issue reference and feature type.
Description check ✅ Passed The description explains the context and purpose of the changes (adding BF16 and FP8 accuracy tests for Nemotron MOE model), but lacks detail on implementation specifics and does not fully follow the template structure with distinct sections.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

164-167: Misleading comment: Nemotron-MOE is not an SSM model.

The comment states "SSMs do not support cache reuse", but Nemotron-MOE is a Mixture of Experts (MOE) model, not a State Space Model (SSM). This is evident from the 'ep' (expert parallelism) sharding dimension at line 180. The enable_block_reuse: False setting might still be correct for MOE models, but the comment is misleading.

Update the comment to reflect the actual reason for disabling cache reuse, or remove it if inherited from TestNemotronH incorrectly:

-            # SSMs do not support cache reuse.
+            # MOE models do not support cache reuse.
             "kv_cache_config": {
                 "enable_block_reuse": False
             },
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

219-221: Verify the need for manual quant_config setting with pre-quantized models.

The comment indicates this manual setting is needed "to get the accuracy threshold", suggesting that AutoDeployLLM doesn't auto-detect quantization for pre-quantized models (the model path contains "FP8-KVFP8"). While this approach works, consider whether:

  1. AutoDeployLLM should auto-detect quantization from pre-quantized model metadata or path patterns
  2. This pattern will need to be repeated for every FP8 model test

This might be a broader design consideration for the AutoDeployLLM implementation.

If auto-detection is not feasible, consider documenting this pattern for future test authors or creating a helper method to reduce duplication.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49c45eb and 8efaec6.

📒 Files selected for processing (3)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py (3 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
🧬 Code graph analysis (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (3)
tensorrt_llm/quantization/mode.py (1)
  • QuantAlgo (23-47)
tests/integration/defs/accuracy/accuracy_core.py (2)
  • evaluate (184-247)
  • evaluate (868-878)
tensorrt_llm/_torch/auto_deploy/llm_args.py (2)
  • quant_config (342-345)
  • quant_config (348-349)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tests/integration/defs/accuracy/references/gsm8k.yaml (1)

164-168: LGTM! Nemotron-MOE accuracy thresholds added correctly.

The new model entry follows the established pattern with appropriate baseline (BF16) and FP8 quantized accuracy thresholds. The accuracy degradation from BF16 (88.249) to FP8 (86.884) is expected and reasonable for quantization.

tests/integration/defs/accuracy/references/mmlu.yaml (2)

273-277: LGTM! Nemotron-MOE accuracy thresholds added correctly.

The new model entry is consistent with the gsm8k.yaml reference file and follows the established pattern for accuracy thresholds.


278-281: Verify the Phi-4-mini-instruct restructuring and missing kv_cache_quant_algo.

This change appears unrelated to the PR's stated objective of adding Nemotron-MOE tests. Additionally, the FP8 entry lacks kv_cache_quant_algo, which is inconsistent with similar entries in this file (e.g., microsoft/phi-4 at lines 321-323 has both quant_algo: FP8 and kv_cache_quant_algo: FP8). Please clarify:

  1. Is this Phi-4-mini-instruct change intentional and part of this PR's scope?
  2. Should kv_cache_quant_algo: FP8 be added to maintain consistency?
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (4)

21-21: LGTM! QuantAlgo import added correctly.

The import is necessary for manually configuring FP8 quantization in the test_fp8 method.


157-158: LGTM! Model paths defined correctly.

The BF16 and FP8 model paths follow the same pattern as TestNemotronH and align with the separate test methods for each quantization type.


201-210: LGTM! BF16 test method implemented correctly.

The test follows the established pattern and will validate against the BF16 accuracy thresholds defined in gsm8k.yaml and mmlu.yaml.


213-226: LGTM! FP8 test method structured correctly.

The test follows the established pattern and will validate against the FP8 accuracy thresholds defined in gsm8k.yaml and mmlu.yaml. The manual quant_config setting is explained by the comment, though it may warrant architectural review (see separate comment).

@nvchenghaoz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25090 [ run ] triggered by Bot. Commit: 8efaec6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25090 [ run ] completed with state SUCCESS. Commit: 8efaec6
/LLM/main/L0_MergeRequest_PR pipeline #18966 completed with status: 'FAILURE'

@nvchenghaoz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25101 [ run ] triggered by Bot. Commit: 8efaec6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25101 [ run ] completed with state SUCCESS. Commit: 8efaec6
/LLM/main/L0_MergeRequest_PR pipeline #18976 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@github-project-automation github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Nov 20, 2025
@nvchenghaoz nvchenghaoz merged commit cd44f80 into NVIDIA:main Nov 20, 2025
9 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 20, 2025
@suyoggupta
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: AutoDeploy: Set up the Accuracy testing for Nemotron MOE models.

3 participants