[#12954][fix] AutoDeploy: Fix Gemma4 MoE config (disable multi_stream_moe, lower free_gpu_memory_fraction)#12955
Conversation
…stream_moe, lower free_gpu_memory_fraction) Disable multi_stream_moe to fix accuracy regression (GSM8K 80% -> 90%) when combined with MLIR elementwise fusion and monolithic CUDA graphs. See NVIDIA#12954 for root cause analysis and proposed fix. Lower free_gpu_memory_fraction from 0.8 to 0.4 to prevent OOM during piecewise CUDA graph capture on 80GB GPUs. Verified scores on google/gemma-4-26B-A4B-it (bf16, single GPU): - MMLU: 75% - GSM8K: 90% Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR updates a Gemma-4 MoE model configuration file by reducing the KV cache GPU memory fraction from 0.8 to 0.4, disabling the multi-stream MoE feature, and adding a TODO comment referencing a pending TensorRT-LLM issue. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot skip --"autodeploy config only change" |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot skip --comment "autodeploy config only change" |
|
PR_Github #42822 [ skip ] triggered by Bot. Commit: |
|
PR_Github #42822 [ skip ] completed with state |
Summary
multi_stream_moeto fix accuracy regression when combined with MLIR elementwise fusion and monolithic CUDA graphs (GSM8K 80% → 90%). Root cause tracked in [None][fix] multi_stream_moe + MLIR accuracy regression in monolithic CUDA graph decode path #12954.free_gpu_memory_fractionfrom 0.8 to 0.4 to prevent OOM during piecewise CUDA graph capture on 80GB GPUs.Config-only change — no code modifications.
Verified scores
Model:
google/gemma-4-26B-A4B-it(bf16, single H100 80GB)Test plan
TRTLLM_ACCURACY_NO_REFERENCE=1 pytest tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestGemma4MoE::test_bf16 -s -v🤖 Generated with Claude Code
Summary by CodeRabbit