-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models #9317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models #9317
Conversation
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
📝 WalkthroughWalkthroughThe changes add a new model entry Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
164-167: Misleading comment: Nemotron-MOE is not an SSM model.The comment states "SSMs do not support cache reuse", but Nemotron-MOE is a Mixture of Experts (MOE) model, not a State Space Model (SSM). This is evident from the 'ep' (expert parallelism) sharding dimension at line 180. The
enable_block_reuse: Falsesetting might still be correct for MOE models, but the comment is misleading.Update the comment to reflect the actual reason for disabling cache reuse, or remove it if inherited from TestNemotronH incorrectly:
- # SSMs do not support cache reuse. + # MOE models do not support cache reuse. "kv_cache_config": { "enable_block_reuse": False },
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
219-221: Verify the need for manual quant_config setting with pre-quantized models.The comment indicates this manual setting is needed "to get the accuracy threshold", suggesting that AutoDeployLLM doesn't auto-detect quantization for pre-quantized models (the model path contains "FP8-KVFP8"). While this approach works, consider whether:
- AutoDeployLLM should auto-detect quantization from pre-quantized model metadata or path patterns
- This pattern will need to be repeated for every FP8 model test
This might be a broader design consideration for the AutoDeployLLM implementation.
If auto-detection is not feasible, consider documenting this pattern for future test authors or creating a helper method to reduce duplication.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/integration/defs/accuracy/references/gsm8k.yaml(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml(1 hunks)tests/integration/defs/accuracy/test_llm_api_autodeploy.py(3 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
🧬 Code graph analysis (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (3)
tensorrt_llm/quantization/mode.py (1)
QuantAlgo(23-47)tests/integration/defs/accuracy/accuracy_core.py (2)
evaluate(184-247)evaluate(868-878)tensorrt_llm/_torch/auto_deploy/llm_args.py (2)
quant_config(342-345)quant_config(348-349)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tests/integration/defs/accuracy/references/gsm8k.yaml (1)
164-168: LGTM! Nemotron-MOE accuracy thresholds added correctly.The new model entry follows the established pattern with appropriate baseline (BF16) and FP8 quantized accuracy thresholds. The accuracy degradation from BF16 (88.249) to FP8 (86.884) is expected and reasonable for quantization.
tests/integration/defs/accuracy/references/mmlu.yaml (2)
273-277: LGTM! Nemotron-MOE accuracy thresholds added correctly.The new model entry is consistent with the gsm8k.yaml reference file and follows the established pattern for accuracy thresholds.
278-281: Verify the Phi-4-mini-instruct restructuring and missing kv_cache_quant_algo.This change appears unrelated to the PR's stated objective of adding Nemotron-MOE tests. Additionally, the FP8 entry lacks
kv_cache_quant_algo, which is inconsistent with similar entries in this file (e.g., microsoft/phi-4 at lines 321-323 has bothquant_algo: FP8andkv_cache_quant_algo: FP8). Please clarify:
- Is this Phi-4-mini-instruct change intentional and part of this PR's scope?
- Should
kv_cache_quant_algo: FP8be added to maintain consistency?tests/integration/defs/accuracy/test_llm_api_autodeploy.py (4)
21-21: LGTM! QuantAlgo import added correctly.The import is necessary for manually configuring FP8 quantization in the test_fp8 method.
157-158: LGTM! Model paths defined correctly.The BF16 and FP8 model paths follow the same pattern as TestNemotronH and align with the separate test methods for each quantization type.
201-210: LGTM! BF16 test method implemented correctly.The test follows the established pattern and will validate against the BF16 accuracy thresholds defined in gsm8k.yaml and mmlu.yaml.
213-226: LGTM! FP8 test method structured correctly.The test follows the established pattern and will validate against the FP8 accuracy thresholds defined in gsm8k.yaml and mmlu.yaml. The manual quant_config setting is explained by the comment, though it may warrant architectural review (see separate comment).
|
/bot run |
|
PR_Github #25090 [ run ] triggered by Bot. Commit: |
|
PR_Github #25090 [ run ] completed with state |
|
/bot run |
|
PR_Github #25101 [ run ] triggered by Bot. Commit: |
|
PR_Github #25101 [ run ] completed with state |
|
I think you need to enable the test as done for others here: |
@Wanli-Jiang added the models to the model hub and since the model is accessible to the pipeline, add BF16 and FP8 accuracy test for Nemotron MOE model and setup the threshold for the accuracy tests.
#9316
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.