Force vLLM non-gated MoE through Triton by mxinO · Pull Request #1572 · NVIDIA/Model-Optimizer

mxinO · 2026-05-30T01:25:04Z

What does this PR do?

Type of change: bug fix

From v0.20.0, vLLM selects the FlashInfer CUTLASS unquantized MoE backend for Nano-style non-gated MoE layers. That fused backend hides the intermediate activation between the expert GEMMs, so the w2 input quantizer can not be inserted.

Solution:
Add the parameter to force use triton kernel.

Usage

Testing

Tested on Nano3

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Documentation
- Updated vLLM fakequant serve documentation with notes on moe-backend configuration defaults.
Improvements
- Enhanced launcher logic to conditionally set moe-backend defaults only when the feature is available, improving configuration flexibility across different vLLM installations.

copy-pr-bot · 2026-05-30T01:25:08Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-30T01:25:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: edf4be5b-6ca7-49c8-aa85-5e081a8e540f

📥 Commits

Reviewing files that changed from the base of the PR and between 40a4dd3 and a34d446.

📒 Files selected for processing (2)

examples/vllm_serve/README.md
examples/vllm_serve/vllm_serve_fakequant.py

📝 Walkthrough

Walkthrough

This PR updates the vLLM fakequant serve launcher to conditionally default to --moe-backend triton when the installed vLLM version supports it. The change adds a parser helper function, updates configuration logic to check for argument support before setting defaults, and documents the behavior for users.

Changes

Conditional MoE backend defaults

Layer / File(s)	Summary
Conditional MoE backend defaults and documentation `examples/vllm_serve/vllm_serve_fakequant.py`, `examples/vllm_serve/README.md`	Added `_parser_has_argument()` helper to detect if a parser supports `--moe-backend` argument; updated `main()` to conditionally set `moe_backend="triton"` only when parser includes this argument; documented the behavior explaining that decomposed MoE backend is required for expert fakequant calibration visibility.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main change: forcing vLLM non-gated MoE through Triton backend, which is the core of the pull request addressing the MoE fakequant issue.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	PR changes contain no security anti-patterns: no unsafe torch.load/numpy.load, no hardcoded trust_remote_code, no eval/exec on untrusted input, no nosec comments, no new unsafe dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mxin/vllm-flashinfer-moe-fakequant

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-30T01:29:33Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-30 16:44 UTC

codecov · 2026-05-30T01:44:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.50%. Comparing base (5eba879) to head (a34d446).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1572      +/-   ##
==========================================
- Coverage   76.88%   76.50%   -0.38%     
==========================================
  Files         478      478              
  Lines       52209    54286    +2077     
==========================================
+ Hits        40140    41533    +1393     
- Misses      12069    12753     +684

Flag	Coverage Δ
examples	`41.64% <ø> (+8.77%)`	⬆️
unit	`53.60% <ø> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vLLM can select fused FlashInfer MoE backends whose expert GEMMs are not visible to ModelOpt fakequant hooks. Rebuilding vLLM's internal MoE kernel after model loading is fragile and can mismatch the weight layout that vLLM already selected during process_weights_after_loading. Keep the ModelOpt plugin simple: accept decomposed Triton backends, leave disabled expert quantizers alone, and fail loudly for fused or unsupported MoE backends with guidance to configure vLLM with moe_backend='triton' before model construction. The vLLM fakequant serving example now defaults to that backend when the installed vLLM parser exposes the option. Constraint: vLLM backend selection controls both kernel construction and weight post-processing, so changing only the runtime kernel inside ModelOpt is not robust. Rejected: Reconstruct vLLM MoE kernels from oracle helper functions | too much low-level vLLM API coupling and can drift with vLLM internals. Confidence: high Scope-risk: narrow Directive: Do not add ModelOpt-side vLLM MoE kernel reconstruction unless vLLM exposes a stable high-level API for post-load backend migration. Tested: python -m py_compile examples/vllm_serve/vllm_serve_fakequant.py modelopt/torch/quantization/plugins/vllm.py tests/unit/torch/quantization/plugins/test_vllm.py Tested: env -u NODE_OPTIONS .venv/bin/pre-commit run --all-files --show-diff-on-failure Not-tested: Focused pytest; current .venv fails before collection because tests/conftest.py imports requests, which is not installed. Signed-off-by: Meng Xin <mxin@nvidia.com>

The vLLM plugin-level backend guard added maintenance cost and depended on backend details that are likely to churn. Keep the behavior change at the user-facing example launcher instead: when vLLM exposes moe_backend, the example defaults it to triton so expert fakequant runs through a decomposed path. Rejected: Runtime plugin assertions for MoE backend selection | too much low-level coupling for this example-only use case Confidence: high Scope-risk: narrow Tested: python -m py_compile examples/vllm_serve/vllm_serve_fakequant.py Tested: ruff check examples/vllm_serve/vllm_serve_fakequant.py Tested: git diff --check on touched paths Not-tested: Full vLLM serve smoke after removing plugin assertions Signed-off-by: Meng Xin <mxin@nvidia.com>

mxinO force-pushed the mxin/vllm-flashinfer-moe-fakequant branch 2 times, most recently from 6387746 to 900ddd5 Compare May 30, 2026 01:30

mxinO force-pushed the mxin/vllm-flashinfer-moe-fakequant branch 3 times, most recently from e10c249 to 7819ae9 Compare May 30, 2026 02:56

mxinO force-pushed the mxin/vllm-flashinfer-moe-fakequant branch from 7819ae9 to 1a3d7c9 Compare May 30, 2026 03:00

mxinO changed the title ~~Calibrate vLLM non-gated MoE through Triton fakequant~~ Force vLLM non-gated MoE through Triton May 30, 2026

mxinO marked this pull request as ready for review May 30, 2026 04:50

mxinO requested a review from a team as a code owner May 30, 2026 04:50

mxinO requested review from kinjalpatel27, meenchen and realAsma May 30, 2026 04:50

coderabbitai Bot approved these changes May 30, 2026

View reviewed changes

mxinO enabled auto-merge (squash) May 30, 2026 04:55

meenchen approved these changes May 30, 2026

View reviewed changes

mxinO merged commit 7ae4ee7 into main May 30, 2026
40 checks passed

mxinO deleted the mxin/vllm-flashinfer-moe-fakequant branch May 30, 2026 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force vLLM non-gated MoE through Triton#1572

Force vLLM non-gated MoE through Triton#1572
mxinO merged 2 commits into
mainfrom
mxin/vllm-flashinfer-moe-fakequant

mxinO commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mxinO commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mxinO commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading

github-actions Bot commented May 30, 2026 •

edited

Loading

codecov Bot commented May 30, 2026 •

edited

Loading