Skip to content

[Recipes][LLM PTQ] Add nvfp4_experts_only_mse-fp8_cast_kv recipe + --recipe in example scripts#1391

Open
cjluo-nv wants to merge 1 commit intochenjiel/nvfp4-fp8-sweep-tritonfrom
chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv
Open

[Recipes][LLM PTQ] Add nvfp4_experts_only_mse-fp8_cast_kv recipe + --recipe in example scripts#1391
cjluo-nv wants to merge 1 commit intochenjiel/nvfp4-fp8-sweep-tritonfrom
chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented May 4, 2026

Summary

  • Adds modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml, combining experts-only NVFP4 W4A4 with the MSE FP8 scale-sweep weight calibration (algorithm: mse, fp8_scale_sweep: true, expert weight blocks switched to static) and FP8 KV cache with use_constant_amax: true (skips KV calibration; matches the nvfp4_default-fp8_cast_kv contract).
  • Threads a new --recipe flag through examples/llm_ptq/scripts/parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. Recipe names are not validated in the script — hf_ptq.py is the source of truth.
  • Drops the bash-side qformat whitelist case-statement in huggingface_example.sh for the same reason.

This PR depends on #1387 (the Triton FP8 sweep kernel) — the new recipe relies on the mse + fp8_scale_sweep: true algorithm which that PR makes practical. Targeting chenjiel/nvfp4-fp8-sweep-triton as the base so the diff stays scoped to the recipe + script wiring.

Files

  • modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml — new recipe.
  • examples/llm_ptq/scripts/parser.sh — add --recipe long-option, default RECIPE="", validate one-of-{--quant, --recipe} and not-both.
  • examples/llm_ptq/scripts/huggingface_example.sh — when RECIPE is set, derive MODEL_NAME from the recipe basename, pass --recipe=… to hf_ptq.py instead of --qformat=…, and exit after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs that the script's downstream run_tensorrt_llm.py path doesn't know how to handle generically). Drop the qformat whitelist; defer to hf_ptq.py.

Behavior

# Errors with: "Cannot specify both --quant and --recipe; pick one."
bash huggingface_example.sh --model=... --quant=nvfp4 --recipe=... --tasks=quant

# Errors with usage if neither is given
bash huggingface_example.sh --model=... --tasks=quant

# Both of these are now accepted; --recipe is forwarded verbatim to hf_ptq.py
bash huggingface_example.sh --model=... --quant=nvfp4 --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv --tasks=quant

Test plan

  • Recipe loads via modelopt.recipe.load_recipe(...) and produces the expected algorithm + per-pattern quant_cfg (verified in a working env: algorithm == {'method': 'mse', 'fp8_scale_sweep': True, 'layerwise': False}; expert weight quantizers type: static; KV bmm has use_constant_amax: True).
  • Parser sanity: 4 flag combinations (both, neither, only --quant, only --recipe) all behave as designed.
  • End-to-end run on a small MoE checkpoint via huggingface_example.sh --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv to confirm the recipe path produces a deployable checkpoint.

Note

Pre-commit hook check-modelopt-recipes was skipped on the commit because the local conda env has a broken torchvision install (AttributeError: partially initialized module 'torchvision' has no attribute 'extension') that prevents from modelopt.recipe.loader import load_recipe. The recipe was validated independently by running tools/precommit/check_modelopt_recipes.py in a working environment (exits 0).

🤖 Generated with Claude Code

…recipe support in scripts

- Add modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml,
  combining experts-only NVFP4 W4A4 with the MSE FP8 scale-sweep weight
  calibration (algorithm: mse, fp8_scale_sweep: true; expert weight blocks
  switched to "static" so the static FP8 sweep applies) and FP8 KV cache
  with use_constant_amax: true.

- examples/llm_ptq/scripts: thread a new --recipe flag through parser.sh and
  huggingface_example.sh. Either --quant or --recipe is required; passing both
  errors out. When --recipe is used, the script derives MODEL_NAME from the
  recipe basename, passes --recipe= to hf_ptq.py, and exits after export with
  a TRT-LLM deployment hint (recipes can produce arbitrary configs).

- Drop the qformat case-statement whitelist in huggingface_example.sh; let
  hf_ptq.py be the single source of truth for valid qformats / recipes.

(Pre-commit hook check-modelopt-recipes was skipped: the host conda env has a
broken torchvision install that prevents the validator from importing modelopt.
The recipe was verified independently via tools/precommit/check_modelopt_recipes.py
in a working environment.)

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested review from a team as code owners May 4, 2026 23:28
@cjluo-nv cjluo-nv requested review from realAsma and removed request for a team May 4, 2026 23:28
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)
  • main
  • release/.*
  • feature/.*

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d8fc1e6-1227-44eb-9b24-a90cffc45811

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.86%. Comparing base (bd4fc3a) to head (c2a341a).

Additional details and impacted files
@@                       Coverage Diff                        @@
##           chenjiel/nvfp4-fp8-sweep-triton    #1391   +/-   ##
================================================================
  Coverage                            76.86%   76.86%           
================================================================
  Files                                  472      472           
  Lines                                50660    50660           
================================================================
  Hits                                 38942    38942           
  Misses                               11718    11718           
Flag Coverage Δ
examples 41.53% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant