[Recipes][LLM PTQ] Add nvfp4_experts_only_mse-fp8_cast_kv recipe + --recipe in example scripts by cjluo-nv · Pull Request #1391 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-04T23:28:25Z

Summary

Adds modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml, combining experts-only NVFP4 W4A4 with the MSE FP8 scale-sweep weight calibration (algorithm: mse, fp8_scale_sweep: true, expert weight blocks switched to static) and FP8 KV cache with use_constant_amax: true (skips KV calibration; matches the nvfp4_default-fp8_cast_kv contract).
Threads a new --recipe flag through examples/llm_ptq/scripts/parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. Recipe names are not validated in the script — hf_ptq.py is the source of truth.
Drops the bash-side qformat whitelist case-statement in huggingface_example.sh for the same reason.

This PR depends on #1387 (the Triton FP8 sweep kernel) — the new recipe relies on the mse + fp8_scale_sweep: true algorithm which that PR makes practical. Targeting chenjiel/nvfp4-fp8-sweep-triton as the base so the diff stays scoped to the recipe + script wiring.

Files

modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml — new recipe.
examples/llm_ptq/scripts/parser.sh — add --recipe long-option, default RECIPE="", validate one-of-{--quant, --recipe} and not-both.
examples/llm_ptq/scripts/huggingface_example.sh — when RECIPE is set, derive MODEL_NAME from the recipe basename, pass --recipe=… to hf_ptq.py instead of --qformat=…, and exit after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs that the script's downstream run_tensorrt_llm.py path doesn't know how to handle generically). Drop the qformat whitelist; defer to hf_ptq.py.

Behavior

# Errors with: "Cannot specify both --quant and --recipe; pick one."
bash huggingface_example.sh --model=... --quant=nvfp4 --recipe=... --tasks=quant

# Errors with usage if neither is given
bash huggingface_example.sh --model=... --tasks=quant

# Both of these are now accepted; --recipe is forwarded verbatim to hf_ptq.py
bash huggingface_example.sh --model=... --quant=nvfp4 --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv --tasks=quant

Test plan

Recipe loads via modelopt.recipe.load_recipe(...) and produces the expected algorithm + per-pattern quant_cfg (verified in a working env: algorithm == {'method': 'mse', 'fp8_scale_sweep': True, 'layerwise': False}; expert weight quantizers type: static; KV bmm has use_constant_amax: True).
Parser sanity: 4 flag combinations (both, neither, only --quant, only --recipe) all behave as designed.
End-to-end run on a small MoE checkpoint via huggingface_example.sh --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv to confirm the recipe path produces a deployable checkpoint.

Note

Pre-commit hook check-modelopt-recipes was skipped on the commit because the local conda env has a broken torchvision install (AttributeError: partially initialized module 'torchvision' has no attribute 'extension') that prevents from modelopt.recipe.loader import load_recipe. The recipe was validated independently by running tools/precommit/check_modelopt_recipes.py in a working environment (exits 0).

🤖 Generated with Claude Code

…recipe support in scripts - Add modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml, combining experts-only NVFP4 W4A4 with the MSE FP8 scale-sweep weight calibration (algorithm: mse, fp8_scale_sweep: true; expert weight blocks switched to "static" so the static FP8 sweep applies) and FP8 KV cache with use_constant_amax: true. - examples/llm_ptq/scripts: thread a new --recipe flag through parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. When --recipe is used, the script derives MODEL_NAME from the recipe basename, passes --recipe= to hf_ptq.py, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs). - Drop the qformat case-statement whitelist in huggingface_example.sh; let hf_ptq.py be the single source of truth for valid qformats / recipes. (Pre-commit hook check-modelopt-recipes was skipped: the host conda env has a broken torchvision install that prevents the validator from importing modelopt. The recipe was verified independently via tools/precommit/check_modelopt_recipes.py in a working environment.) Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-05-04T23:28:32Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)

main
release/.*
feature/.*

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d8fc1e6-1227-44eb-9b24-a90cffc45811

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-04T23:41:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.86%. Comparing base (bd4fc3a) to head (c2a341a).

Additional details and impacted files

@@                       Coverage Diff                        @@
##           chenjiel/nvfp4-fp8-sweep-triton    #1391   +/-   ##
================================================================
  Coverage                            76.86%   76.86%           
================================================================
  Files                                  472      472           
  Lines                                50660    50660           
================================================================
  Hits                                 38942    38942           
  Misses                               11718    11718

Flag	Coverage Δ
examples	`41.53% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv requested review from a team as code owners May 4, 2026 23:28

cjluo-nv requested review from realAsma and removed request for a team May 4, 2026 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Recipes][LLM PTQ] Add nvfp4_experts_only_mse-fp8_cast_kv recipe + --recipe in example scripts#1391

[Recipes][LLM PTQ] Add nvfp4_experts_only_mse-fp8_cast_kv recipe + --recipe in example scripts#1391
cjluo-nv wants to merge 1 commit intochenjiel/nvfp4-fp8-sweep-tritonfrom
chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv

cjluo-nv commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Review skipped

Uh oh!

codecov Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cjluo-nv commented May 4, 2026

Summary

Files

Behavior

Test plan

Note

Uh oh!

coderabbitai Bot commented May 4, 2026

Review skipped

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 4, 2026 •

edited

Loading