[Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 by kevalmorabia97 · Pull Request #1385 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-05-04T19:53:23Z

Cherry-picked PRs

Summary by CodeRabbit

New Features
- New end-to-end Minitron pruning→distillation→FP8 quantization→vLLM tutorial and Nemotron-Nano-9B-v2 example
- FP8 MHA quantization support for vision transformers
- Windows CUDA‑13.x installation instructions and Megatron dataset pre-tokenization guide
- Improved grouped MoE export and fused-experts handling; medusa mode enabled in speculative decoding
Bug Fixes
- nvfp4_awq export fallback for MoE models
- Enhanced Windows cuDNN discovery and expert calibration robustness
Documentation
- Expanded pruning/distillation/quantization tutorials, ablations, and examples

Adds `.claude/skills/release-cherry-pick/SKILL.md` — a Claude Code skill for cherry-picking labeled PRs to a release branch. Invoke with `/release-cherry-pick <version>`. See this PR created with the skill: #1350  ## Summary by CodeRabbit * **New Features** * Added automated release cherry-pick workflow to streamline selecting and applying multiple PRs into release branches.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

…#1351) PR #1289 (FP8 MHA quantization for ViT) was merged to `main` after `0.44.0rc1` was tagged, so the rc1 wheel ships without the `nn.LayerNorm` registration that the example's `_FP8_MHA_OVERRIDE` now references — surfaced as nvbug 6114983 (`ValueError: parent_class 'nn.LayerNorm' not found in QuantModuleRegistry` when running `torch_quant_to_onnx.py --quantize_mode=fp8`). PR #1289 is labeled `cherry-pick-0.44.0` and will be cherry-picked to `release/0.44.0` for the next rc, so the feature ships in 0.44 — this PR moves the corresponding release-notes bullet from the `0.45 (Future)` section to `0.44 (2026-05-xx)` to match. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…int (#1330) ### What does this PR do? Type of change: ? Bug fix Fixes `https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/puzzletron/main.py` where multi-GPU run caused only part of the file `model.safetensors.index.json` to be written to disk. ### Usage does not apply ### Testing Follow [instructions, step 3](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron#compress-the-model) - run with `--nproc_per_node 2` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information   ## Summary by CodeRabbit * **New Features** * Added a public checkpoint-saving entry that consolidates distributed sharded model shards into a single filesystem checkpoint; retains direct saving for single-process runs. * **Refactor** * Validation/evaluation tooling now uses the consolidated checkpoint-saving flow when persisting realized model checkpoints during runs.  --------- Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: CodeRabbit <noreply@coderabbit.ai> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…1354) ## Summary - NVBug: [6103846](https://nvbugspro.nvidia.com/bug/6103846) — `Qwen3-30B-A3B nvfp4_awq` quantization fails at export with `AssertionError: Modules have different quantization formats`. - Root cause: in `model_calib.awq_lite`, MoE experts that end up disabled (NaN in act/weight scales, or no search-pass tokens) get `max_calibrate`-d but no `pre_quant_scale`. `get_quantization_format` then returns `nvfp4` for those experts while siblings stay `nvfp4_awq`. `unified_export_hf.requantize_resmooth_fused_llm_layers` groups all 128 experts of each linear name (gate_proj/down_proj/up_proj) and calls `preprocess_linear_fusion(..., resmooth_only=True)`, which asserts uniform format → fires for any single mismatched expert. - Fix: unify the disabled-expert paths in the awq_lite postprocess loop so any expert with `is_enabled == False` (no cache hits, NaN scales, or no search-pass tokens) receives `max_calibrate` + a neutral all-ones `pre_quant_scale`, matching the existing behavior for `num_cache_steps == 0`. Emit a warning so users notice that calibration coverage is incomplete and accuracy may degrade. ## Test plan - [x] `pytest tests/unit/torch/quantization/test_calib.py -k 'awq'` → 5 passed - [x] End-to-end on `Qwen/Qwen3-30B-A3B` with `NVFP4_AWQ_LITE_CFG` and a small calib set that leaves many experts uncalibrated: - All 6144 gate_proj/up_proj/down_proj expert linears report `nvfp4_awq` (no mismatch) - `export_hf_checkpoint` succeeds with no `AssertionError` - The new "Forcing pre_quant_scale=1 ... may degrade accuracy" warning fires for each affected expert - [x] Re-run via `examples/llm_ptq/hf_ptq.py` with the bug-report CLI (cnn_dailymail, batch_size=8, calib_size=64 — scaled down from 512 to fit budget) on B200: - 36 "the second time did not forward data through ..experts.X.{gate,up,down}_proj" warnings — i.e. the exact bug-triggering condition from the original NVBug log naturally reproduces - 2058 "Forcing pre_quant_scale=1" warnings — fix path activates for uncalibrated/disabled experts - 0 `AssertionError`s — export completes - `Quantized model exported to: /tmp/test_plan_qwen3-30b-a3b-nvfp4_awq` and post-PTQ generation works --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: Bug fix Previously the regex we had looked for a dot after the integer layer number, but it might not exist sometimes. ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Improved detection and handling of pipeline-parallel layer indices in model names to correctly support layer identifiers positioned at the end of submodule names, enhancing compatibility with various naming conventions in distillation workflows.  --------- Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: ? Bug fix The config `examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/pruning/attn_pruning.yaml` didn't have required keys to use attention pruning in the example `examples/puzzletron/main.py` ### Usage ### Testing In `examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/Llama-3_1-8B.yaml` change `ffn_pruning` to `attn_pruning` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information   ## Summary by CodeRabbit * **Chores** * Updated pruning configuration for improved KV-head pruning support, including enhanced importance hook settings and attention output handling for memory optimization.  Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: Bug fix - Enable EP (expert parallelism) import for HF to MCore when using TE Spec - Fix bug in mamba moe config which doesn't skip attention layers properly in MCore (Mcore uses different naming for attention layers than HF) - Add getter for Quant Config (used in MLM modelopt examples to get quant cfg fields) ### Usage ```python # In Megatron-LM/examples/post_training/modelopt MLM_EXTRA_ARGS="--export-default-te-spec --trust-remote-code --moe-router-dtype fp32" EP=4 HF_MODEL_CKPT=</path/to/hf> MLM_MODEL_SAVE=<save/path> ./convert.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Corrected expert-slice assignment so each expert-parallel rank loads the proper expert slice. * Improved detection of pipeline-parallel layer indices in submodule names. * **Improvements** * Relaxed constraints between local and global expert counts for grouped-local-expert imports. * Added typed helpers for managing quantization configuration entries and expanded quantizer disable patterns. * Exporter now accepts an additional hybrid model type when available.  --------- Signed-off-by: Jennifer Chen <jennifchen@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…ributeError on custom configs (#1324) ### What does this PR do? Type of change: Bug fix   - Summary: Running hf_ptq.py on stepfun-ai/Step-3.5-Flash (and any model whose custom HF config doesn't assign use_cache) crashed in get_max_batch_size() with AttributeError: 'Step3p5Config' object has no attribute 'use_cache' before calibration could start. - Extract the existing "disable KV cache during calibration" logic into a _disable_use_cache(model) context manager, apply it to both get_max_batch_size and _forward_loop. The CM sets config.use_cache = False unconditionally (not only when the attribute exists) and restores the prior value on exit if one was set. - Behavior unchanged for normal configs; the NemotronH hybrid-cache correctness guarantee from #1251 is preserved. ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  Step-3.5-Flash PTQ now passes get_max_batch_size ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Refactor** * Improved memory handling during model evaluation and calibration by consistently disabling KV cache for both single-batch probes and full dataloader runs, simplifying and stabilizing inference flow and ensuring cache state is managed reliably. * **Tests** * Added unit tests verifying cache-state handling across models with and without cache settings, including correct restoration behavior even when errors occur.  --------- Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…(Qwen3.5-MoE VLM) (#1340) ### What does this PR do? Type of change: Bug fix Fixes a 4-bug cascade that caused silent PTQ failure on Qwen3.5-MoE VLMs (Qwen3.6-35B-A3B): calibration appeared to succeed but produced token-salad at inference. Root cause: HF's @use_experts_implementation dispatches expert forward to torch._grouped_mm / torch.bmm, bypassing the F.linear hook that captures activations — so gate_up_proj_input_quantizer / down_proj_input_quantizer never calibrated and no input_scale tensors were emitted. Changes: - examples/llm_ptq/hf_ptq.py — force config._experts_implementation = "eager" (recursing into text_config / vision_config / …) so per-expert F.linear calls are visible to the calibration hook. - modelopt/torch/quantization/conversion.py — normalize plural ModuleList quantizer names (weight_quantizers.N → weight_quantizer) before fnmatch, so wildcards like *mlp.experts*weight_quantizer match fused-expert quantizers. - modelopt/torch/export/unified_export_hf.py — hoist the _QuantFusedExperts export branch above the get_quantization_format() gate so _export_fused_experts() runs even when the top-level format query returns QUANTIZATION_NONE (happens for experts-only recipes). - modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml — layerwise: false (VLM nested layer structure breaks the layerwise walker).  ### Usage ```python python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path Qwen/Qwen3.6-35B-A3B \ --qformat nvfp4 \ --kv_cache_qformat fp8 \ --calib_size 512 \ --export_path Qwen3.6-35B-A3B-NVFP4 ``` ### Testing  Testing End-to-end PTQ → vLLM deploy → NEL eval on Qwen3.6-35B-A3B (256 experts × 40 layers, 35B params): Hook-call diagnostic: 0 → 6720 per-expert F.linear calls during calibration after the fix; 0 → 30720 input_scale tensors emitted in the exported checkpoint. FP8 fused-MoE path still produces gibberish — separate follow-up (vLLM per-expert weight_scale handling). * vLLM full-FP8: the FlashInfer TRTLLM Fp8MoE loader doesn't stack the 256 per-expert scalar weight_scale tensors into a [num_experts] per-expert vector — it ends up applying one expert's scale across all 256, so every routed expert dequants with the wrong amplitude → coherent token stream collapses into multilingual gibberish. * SGLang full-FP8: qwen3_5.py::_make_packed_weight_loader rejects with AssertionError: Unexpected scalar for tuple shard load: loaded_shard_id=(0,1,2), split_sizes=[1,1,1] — its packed-loader has no path for "N independent per-tensor source scalars combining into one fused-shard parameter," so the fused QKV (or in_proj_qkvz) load is structurally refused and the model never finishes loading. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **New Features** * Better fused-expert export flow, a plugin to force eager expert execution during calibration/export, and a representative quantizer discovery utility. * **Bug Fixes** * Reliable matching/discovery of per-expert indexed quantizers enabling correct calibration and mixed-precision export; fixes for calibration in nested decoder layouts. * **Documentation** * Clarified PTQ config guidance on layerwise calibration. * **Tests** * Added fused-experts calibration, export, and name-normalization tests.  --------- Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…n with Cuda13.x (#1368) ### What does this PR do? Type of change: ? documentation  Update windows documentation for onnxruntime quantization with Cuda13.x ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A  - Did you write any new necessary tests?: N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A   ## Summary by CodeRabbit * **Documentation** * Updated Windows installation guide with CUDA 13.x-specific setup instructions for GPU-accelerated dependencies, including CuPy and ONNX Runtime configuration with nightly builds.  Signed-off-by: ynankani <ynankani@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Type of change: Bug fix   The offline dflash regression test can be runned on 1 or 2 gpus. For 2 gpus, the total steps is half of 1 gpu. This PR relax the failing threshold for 2 gpu tests. ```python ```  Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A   Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: Minor bug fix - Put quantization steps inside try-finally to ensure removal of temp files on error in ONNX INT4 quantization. - To avoid redundancy between awq_lite() and awq_clip() methods, created a utility _remove_augmented_onnx() for exception-handling based removal of augmented onnx file and its data file. ### Testing - Locally performed ONNX INT4 awq-lite and awq-clip quantization with Llama 1B model. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Refactor** * Improved reliability of the quantization pipeline by ensuring temporary conversion artifacts are always removed, making cleanup more robust. * Consolidated handling of external-data companions and added safer deletion behavior that logs failures instead of raising errors. * Ensured consistent session teardown and forced memory collection to reduce resource leakage and intermittent errors during model conversion.  --------- Signed-off-by: vipandya <vipandya@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Remove return statement from the code checking remote auto tuning config arguments since that results in skipping adding the actual remote tuning config to the trtexec cmd. **Root cause**: The necessary flags do not get added to `self._base_cmd.extend(trtexec_args)` when remote autotuning is enabled. **Before fix**: ``` ['trtexec', '--avgRuns=100', '--iterations=100', '--warmUp=50', '--stronglyTyped', \ '--saveEngine=engine.trt', '--timingCacheFile=trtexec_timing.cache', \ '--onnx=baseline.onnx'] ``` **After fix**: ``` ['trtexec', '--avgRuns=100', '--iterations=100', '--warmUp=50', '--stronglyTyped', \ '--saveEngine=engine.trt', '--timingCacheFile=trtexec_timing.cache', \ '--remoteAutoTuningConfig=$CONFIG', '--safe', '--skipInference', \ '--onnx=baseline.onnx'] ``` Notice that the remote autotuning and related flags are now included in the `trtexec` command. **Related PR**: #1259  ## Summary by CodeRabbit ## Bug Fixes * Fixed an issue where remote autotuning configuration arguments were not being properly included in benchmark commands, ensuring all remote autotuning settings are now correctly applied during execution.  Signed-off-by: dmoodie <dmoodie@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…val + Quantize + vLLM deployment (#1325) ## Summary End-to-end optimization walkthrough for Nemotron-Nano-9B-v2 showing how ModelOpt techniques stack: - **Pruning** — Minitron structured pruning 9B → 7B - **Distillation** — Megatron-Bridge knowledge distillation up to 80B tokens; near-parity with official 9B on MMLU Pro, GPQA, LCB, AIME, Math 500, IFEval, SciCode - **Evaluation** - using nemo-evaluator - **Quantization** — FP8 PTQ via \`hf_ptq.py\`; checkpoint deployable on vLLM/TRT-LLM/SGLang with no extra flags (quantization auto-detected from \`config.json\`) - **vLLM Throughput** — BF16 vs FP8 benchmark on single H100 <img width="2085" height="1740" alt="image" src="https://github.com/user-attachments/assets/8620a019-5c09-4a6b-a5d2-ca164aaa5d87" /> <img width="2085" height="810" alt="image" src="https://github.com/user-attachments/assets/742c8035-f1fb-4394-b11b-0c6c3ac4e843" /> ### Files changed - `examples/pruning/minitron/README.md` — index page for Minitron end-to-end tutorials - `examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md` — full repro doc with 6 sections: data prep, pruning, distillation, evaluation, FP8 quantization, vLLM benchmarking - `examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml` — NeMo Evaluator config used for all benchmark numbers - `examples/pruning/puzzletron/README.md` — index page for Puzzletron distillation results - `examples/pruning/puzzletron/Llama-3.1-8B-Instruct.md` — Puzzletron distillation results (renamed from puzzletron.md) - `examples/pruning/README.md` — updated Results section with direct links to new locations - `examples/megatron_bridge/README.md` — updated results link to point to `examples/pruning/` - `examples/puzzletron/README.md` — updated distillation results link - `examples/dataset/MEGATRON_DATA_PREP.md` — tokenization commands for all datasets used in the data blend 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit ## Documentation * **New end-to-end tutorial** for model optimization covering Minitron pruning, knowledge distillation, FP8 quantization, and vLLM deployment with reproducibility steps and benchmark results * **Dataset preparation guide** with ready-to-run tokenization templates for Nemotron HuggingFace datasets * **Evaluation configuration** and results documentation including ablation studies across multiple benchmarks * **Updated navigation** across pruning, distillation, and dataset examples to streamline user workflows  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: Bug fix CUDNN 9.21 added a new dll dependency called cudnn_engines_tensor_ir64_9.dll that ort.preload_dlls() is not updated on for windows to load this dll hence fails trying to load cudnn when just nvidia-cudnn-cu12>9.20 package is used. So added code to add any extra dlls from the site-packages folder that the preload function misses.  ## Summary by CodeRabbit * **Bug Fixes** * Improved Windows cuDNN detection and loading for ONNX Runtime with CUDA by scanning installed cuDNN packages and attempting to load any missing DLLs to reduce startup failures. * Enhanced logging and diagnostics: preload output is now surfaced as warnings and individual DLL load successes/failures are logged to aid troubleshooting.  --------- Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

## Problem When `training.mode == "medusa"` is used in `main.py`, the `data_module` variable is never assigned because line 344 only covered `eagle3` and `dflash` modes. This causes an `UnboundLocalError` when the trainer is constructed with `**data_module`. Fixes OMNIML-4147 ## Fix Add `"medusa"` to the `training_args.mode in ("eagle3", "dflash")` condition so `data_module` is correctly populated for medusa training.  ## Summary by CodeRabbit * **Bug Fixes** * Fixed speculative decoding example to properly handle "medusa" mode alongside existing "eagle3" and "dflash" modes.  Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…1371) ## Problem When training with a model that has no `chat_template` in its tokenizer (e.g. base Llama-3.2 models), `_post_process_chat_template()` crashes: ``` AttributeError: 'NoneType' object has no attribute 'replace' ``` The DeepSeek WAR at the top of `_post_process_chat_template` called `.replace()` directly on `self.tokenizer.chat_template` without checking for `None` first. Fixes NVBug 6120958 ## Fix Add an early return when `chat_template is None`. The existing check at line 164 (`if self.tokenizer.chat_template is None: raise ValueError`) still provides a clear error message if no valid template is available after post-processing.  ## Summary by CodeRabbit * **Bug Fixes** * Fixed a crash in chat template processing that occurred when a chat template configuration was not set, improving system stability and reliability during initialization.  Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-05-04T19:53:38Z

📝 Walkthrough

Walkthrough

Adds new documentation and examples for pruning/distillation/quantization/deployment, multiple fused-experts quantization/export changes, distributed checkpoint saving for Puzzletron with tests, ONNX quantization cleanup and Windows cuDNN loading, dataset/tokenization and cache-control updates, Megatron exporter/importer fixes, new regression/unit tests, a CI timeout tweak, and a release-cherry-pick workflow doc.

Changes

Documentation & Examples (Nemotron / Pruning / Puzzletron)

Layer / File(s)	Summary
High-level docs `CHANGELOG.rst`, `examples/*/README.md`	Added changelog entries and multiple README edits linking to new documents and updating references.
Megatron data prep guide `examples/dataset/MEGATRON_DATA_PREP.md`, `examples/dataset/README.md`	New standalone Megatron pre-tokenization guide; README updated to link to it.
Minitron pruning tutorial & artefacts `examples/pruning/`, `examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/`	Added end-to-end Minitron pruning+distillation+quantization+vLLM deployment README, ablations, evaluator YAML, and supporting README/link updates.
Puzzletron docs `examples/pruning/puzzletron/README.md`, `examples/puzzletron/README.md`	Added Puzzletron pruning/distillation results README and cross-reference updates.

Fused-Experts Quantization & Export

Layer / File(s)	Summary
Core utility addition `modelopt/torch/quantization/utils/core_utils.py`, `modelopt/torch/quantization/utils/__init__.py`	Added `representative_weight_quantizer()` and exported it; updated `weight_attr_names()` to handle plural/fused-experts quantizer layouts.
Name normalization `modelopt/torch/quantization/conversion.py`	Added regex normalization to collapse plural per-expert quantizer naming for wildcard matching.
Calibration robustness `modelopt/torch/quantization/model_calib.py`, `modelopt/torch/quantization/config.py`	Defensive handling for AWQ-lite experts (disable/mask unexercised experts) and extended disabled-quantizer patterns for Mamba-MoE/self-attention names.
Plugin changes `modelopt/torch/quantization/plugins/huggingface.py`	Added `_QuantFusedExperts.fold_weight()` to fold per-expert fake-quant and `force_eager_experts_impl_on_the_fly()` registered in plugins to force eager expert impl when needed.
Export & fake-quant `modelopt/torch/export/plugins/vllm_fakequant_hf.py`, `modelopt/torch/export/quant_utils.py`, `modelopt/torch/export/unified_export_hf.py`	Broadened weight-quantizer state-key regex; added `_fakequant_fused_experts_weights()` and ensured fused-experts are exported/handled before generic quantized-module logic; export detection now uses `representative_weight_quantizer()`.
Tests `tests/unit/torch/quantization/plugins/test_fused_experts.py`	New unit tests covering eager-impl forcing, per-expert calibration amax checks, mixed-precision export enumeration, and quantizer-name normalization.

Puzzletron Distributed Checkpoint Saving & Tests

Layer / File(s)	Summary
Implementation `modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`	Added `save_checkpoint_from_shards()` to gather per-rank state dicts to rank 0, merge, save once, broadcast errors, and synchronize. Exported it via `__all__`.
Integration `modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`	Switched checkpoint save calls to `save_checkpoint_from_shards`.
Tests `tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py`	New tests validating single-process and distributed sharded checkpoint saving, index/shard/config outputs, tied-embedding exclusion, and numeric tensor equality after reload.
Test import behavior `tests/gpu/torch/puzzletron/test_puzzletron.py`	Removed module-level pytest import-or-skip guard for `mip` so module can load when optional dependency missing.

ONNX Quantization Cleanup & cuDNN DLL Loading

Layer / File(s)	Summary
Cleanup helper `modelopt/onnx/quantization/int4.py`	Added `_remove_augmented_onnx()` and refactored `_quantize_awq_clip` / `_quantize_awq_lite` to use try/finally with session cleanup, gc.collect(), and unified ONNX removal.
Windows cuDNN loading `modelopt/onnx/quantization/ort_utils.py`	Added helpers to find `nvidia-cudnn-cu12/cu13` site-packages DLL dir and explicitly load missing `cudnn*.dll` on Windows; adjusted preload output handling and core-DLL checks.

Dataset Utilities, Tokenization, and Speculative Decoding

Layer / File(s)	Summary
Cache-control context `modelopt/torch/utils/dataset_utils.py`, `tests/unit/torch/utils/test_dataset_utils.py`	Added `_disable_use_cache()` context manager; updated `get_max_batch_size` and `_forward_loop` to use it; added unit tests for behavior and restoration.
Tokenizer & streaming changes `modelopt/torch/utils/plugins/megatron_preprocess_data.py`, `examples/dataset/MEGATRON_DATA_PREP.md`	Tokenizer `apply_chat_template` now wrapped with try/except to log input on failure; removed fallback to non-streaming when `--hf_streaming` used; updated docs and examples.
Chat template guard `modelopt/torch/utils/plugins/transformers_dataset.py`	Added early-return when tokenizer chat_template is None to avoid .replace on None.
Speculative decoding `examples/speculative_decoding/main.py`	Construct data module for `medusa` mode by adding `"medusa"` to the mode condition.

Megatron Importer & Exporter / Model Support

Layer / File(s)	Summary
Exporter validation `modelopt/torch/export/unified_export_megatron.py`	Added optional guarded import of `HybridModel` and allowed it in exporter input-type validation.
Grouped MoE import `modelopt/torch/export/plugins/megatron_importer.py`	Compute initial global expert index via `get_expert_model_parallel_rank()` instead of hardcoded 0; assert divisibility and map local expert weights into `weight{local_id}` slots.
Descriptor mapping `modelopt/torch/puzzletron/anymodel/.../model_descriptor_factory.py`	Adjusted auto-detection mapping from `gpt_oss_20b` → `gpt_oss`.
Replacement library robustness `modelopt/torch/puzzletron/replacement_library/build_replacement_library.py`	Treat pandas NaN as missing checkpoint_dir when building block replacement paths.

Quantization Recipes & Misc

Layer / File(s)	Summary
Recipe tweak `modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml`	Set `quantize.algorithm.layerwise: false` with comment about VLM decoder nesting requirement.
TensorRT autotune arg `modelopt/onnx/quantization/autotune/benchmark.py`	Ensure `--skipInference` present in trtexec args for remote autotuning and log TensorRT API version info.

Tests, Regression & CI

Layer / File(s)	Summary
DFlash offline regression `tests/regression/torch/speculative/test_dflash_offline.py`	New regression test that dumps hidden states for synthetic conversations, runs offline DFlash training, and asserts loss decreases and final loss < 5.0.
Dataset utils unit tests `tests/unit/torch/utils/test_dataset_utils.py`	Added tests for `_disable_use_cache` and `_forward_loop` cache behavior and restoration.
CI workflow `.github/workflows/gpu_tests.yml`	Increased GPU job timeout from 60 to 75 minutes.

Release Cherry-pick Skill doc

Layer / File(s)	Summary
Skill doc `.claude/skills/release-cherry-pick/SKILL.md`	New skill documenting an interactive workflow to cherry-pick merged PRs labeled `cherry-pick-<version>` onto `release/<version>`, create a `cherry-picks/release-<version>` branch and PR, and label cherry-picked PRs with `cherry-pick-done`.

Sequence Diagram(s)

(omitted — conditions for sequence diagram generation not met)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1352: Adds the same .claude/skills/release-cherry-pick/SKILL.md release workflow file.

Suggested reviewers

jenchen13
realAsma
cjluo-nv

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security anti-patterns found. All potentially dangerous functions have proper justifications or use secure patterns.
Title check	✅ Passed	The title clearly identifies this as a cherry-pick PR combining multiple upstream PRs (`#1352` `#1351` `#1330` `#1354` `#1355` `#1360` `#1342` `#1324` `#1340` `#1368` `#1373` `#1359` `#1361` `#1325` `#1369` `#1370` `#1371`) into release/0.44.0, which matches the changeset content containing numerous bug fixes, features, and documentation updates across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cherry-picks/release-0.44.0

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-04T19:56:55Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1385/
Built to branch `gh-pages` at 2026-05-04 20:34 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py (1)

175-192: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Potential NameError when realizable_as_symlinks=True and skip_validation=True.

If both conditions are true, the model variable is never assigned (line 175-176 condition is False), but line 192 still attempts to use it, causing a NameError.

This appears to be pre-existing behavior, but since the line is modified, worth noting. Consider adding model to the condition or guarding line 192.

🛡️ One possible fix

         if args.save_models:
             checkpoint_dir = (
                 args.solutions_path.with_name(f"{args.solutions_path.stem}--checkpoints")
                 / f"solution_{i_solution}"
             )

             model_config.dtype = resolve_torch_dtype(getattr(args, "model_dtype", "torch.bfloat16"))
             Converter.copy_checkpoint_files(args.teacher_dir, checkpoint_dir)
             if realizable_as_symlinks:
                 if dist.is_master():
                     # TODO: Loo into internal Puzzleron code to see how to save as symlinks
                     # save_checkpoint_as_symlinks is currently not supported
                     pass
-            save_checkpoint_from_shards(model, checkpoint_dir, descriptor)
+            if not realizable_as_symlinks:
+                save_checkpoint_from_shards(model, checkpoint_dir, descriptor)

Or ensure model is always loaded when args.save_models is True:

-        if (args.save_models and not realizable_as_symlinks) or (not args.skip_validation):
+        if args.save_models or (not args.skip_validation):
             model = replacement_library.load_model(layer_replacements)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`
around lines 175 - 192, The code can raise NameError because model is only set
when (args.save_models and not realizable_as_symlinks) or (not
args.skip_validation) but save_checkpoint_from_shards(model, ...) always runs;
ensure model exists before use by either: move/recompute model assignment so
replacement_library.load_model(layer_replacements) runs whenever
args.save_models is True (e.g., include realizable_as_symlinks branch), or guard
the call to save_checkpoint_from_shards behind the same condition that sets
model (use the same boolean logic involving args.save_models,
realizable_as_symlinks, and args.skip_validation) so save_checkpoint_from_shards
only receives a valid model; update related model_config assignment
(model_config.dtype) to match the chosen approach.

modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

632-649: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disable and restore the plural fused-expert weight quantizers too.

The new fused-expert path strips *_weight_quantizers.<idx> from saved state, but Lines 632-649 still only disable singular *_weight_quantizer children. nn.ModuleList-backed fused-expert quantizers therefore remain enabled when modelopt_state() is captured, so the reload metadata can disagree with the fakequantized weights, and _check_all_weight_quantizers_disabled() will miss it for the same reason.

🧩 Suggested fix

         for _, module in model.named_modules():
             if isinstance(module, QuantModule):
                 for attr_name, quantizer in module.named_children():
-                    if not (attr_name.endswith("weight_quantizer") and quantizer.is_enabled):
-                        continue
-                    if isinstance(quantizer, SequentialQuantizer):
-                        quantizer.disable()
-                        for sub in quantizer:
-                            orig_rotate = sub._rotate
-                            if sub.rotate_is_enabled:
-                                sub._rotate = disable_rotate(sub)
-                            wqs_to_restore.append((sub, orig_rotate))
-                    elif isinstance(quantizer, TensorQuantizer):
-                        quantizer.disable()
-                        orig_rotate = quantizer._rotate
-                        if quantizer.rotate_is_enabled:
-                            quantizer._rotate = disable_rotate(quantizer)
-                        wqs_to_restore.append((quantizer, orig_rotate))
+                    quantizers: list[TensorQuantizer | SequentialQuantizer] = []
+                    if attr_name.endswith("weight_quantizers") and isinstance(quantizer, nn.ModuleList):
+                        quantizers = [
+                            q
+                            for q in quantizer
+                            if isinstance(q, (TensorQuantizer, SequentialQuantizer)) and q.is_enabled
+                        ]
+                    elif (
+                        attr_name.endswith("weight_quantizer")
+                        and isinstance(quantizer, (TensorQuantizer, SequentialQuantizer))
+                        and quantizer.is_enabled
+                    ):
+                        quantizers = [quantizer]
+
+                    for q in quantizers:
+                        if isinstance(q, SequentialQuantizer):
+                            q.disable()
+                            for sub in q:
+                                orig_rotate = sub._rotate
+                                if sub.rotate_is_enabled:
+                                    sub._rotate = disable_rotate(sub)
+                                wqs_to_restore.append((sub, orig_rotate))
+                        else:
+                            q.disable()
+                            orig_rotate = q._rotate
+                            if q.rotate_is_enabled:
+                                q._rotate = disable_rotate(q)
+                            wqs_to_restore.append((q, orig_rotate))

Please mirror the same plural/singular test in _check_all_weight_quantizers_disabled().

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 632 - 649,
The loop that disables per-module weight quantizers only checks for singular
child names ending with "weight_quantizer" and misses plural fused-expert
ModuleList children (e.g., "weight_quantizers"), so update the disable logic in
the block that iterates model.named_modules() to also detect attr_name ending
with "weight_quantizers" (or the ModuleList-typed plural container), iterate its
elements, disable each element (handling SequentialQuantizer and TensorQuantizer
cases the same as the singular branch), call disable_rotate when
rotate_is_enabled, and append the (quantizer, orig_rotate) tuples to
wqs_to_restore; then mirror the same plural-vs-singular checks in
_check_all_weight_quantizers_disabled() so it verifies both single
"weight_quantizer" children and ModuleList "weight_quantizers" entries when
asserting all weight quantizers are disabled.

🧹 Nitpick comments (3)

modelopt/onnx/quantization/ort_utils.py (1)

146-167: 💤 Low value

Consider adding a platform guard for defensive programming.

The function uses ctypes.windll (line 174) which only exists on Windows. While currently this function is only called within a Windows check (line 245-246), adding an early guard would prevent potential AttributeError if the function is ever called directly from elsewhere.
🛡️ Optional defensive guard
 def _load_extra_cudnn_dlls():
     """Load any cuDNN DLLs from site-packages that ORT's preload_dlls() missed.
     ...
     """
+    if platform.system() != "Windows":
+        return
+
     import ctypes
     import ctypes.wintypes
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/ort_utils.py` around lines 146 - 167, The
_load_extra_cudnn_dlls function should defensively return early on non-Windows
platforms to avoid AttributeError from using ctypes.windll; add a platform guard
at the top of _load_extra_cudnn_dlls (e.g., check sys.platform or os.name for
Windows) so the function exits immediately when not on Windows, while keeping
the existing _find_cudnn_bin_dir check and logging behavior unchanged.

tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py (1)

63-63: 💤 Low value

Minor: Redundant model instantiation.

get_tiny_llama() is called again just to access config.num_hidden_layers. Reuse the model from line 41 instead.

♻️ Suggested fix

-        assert cfg["num_hidden_layers"] == get_tiny_llama().config.num_hidden_layers
+        assert cfg["num_hidden_layers"] == model.config.num_hidden_layers

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py` at line 63,
The assertion redundantly calls get_tiny_llama() again to read
config.num_hidden_layers; instead reuse the already-instantiated model (variable
named model) created earlier (from get_tiny_llama()) and change the check to
compare cfg["num_hidden_layers"] with model.config.num_hidden_layers so you
remove the extra get_tiny_llama() call.

examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml (1)

18-23: ⚡ Quick win

Avoid requiring unrelated API secrets in this template.

Line 18–23 and Line 110–113 ask for API/OpenAI credentials even though this workflow’s listed tasks rely on HF-backed evals. Prefer scoping extra secrets only to tasks that need them to reduce unnecessary secret exposure.

Proposed tightening

-#   # Set additional unused but required environment variables:
-#   export API_KEY=xxxxxx
-#   export INFERENCE_API_KEY=xxxxxx
-#   export OPENAI_CLIENT_ID=xxxxxx
-#   export OPENAI_CLIENT_SECRET=xxxxxx
+#   # If a specific task/provider requires extra credentials, add them per-task.

   env_vars:
     HF_TOKEN: HF_TOKEN
     HF_HOME: HF_HOME
     VLLM_CACHE_ROOT: VLLM_CACHE_ROOT
-    API_KEY: API_KEY
-    INFERENCE_API_KEY: INFERENCE_API_KEY
-    OPENAI_CLIENT_ID: OPENAI_CLIENT_ID
-    OPENAI_CLIENT_SECRET: OPENAI_CLIENT_SECRET

Also applies to: 106-113

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml`
around lines 18 - 23, Remove the globally listed unrelated secret environment
variables (API_KEY, INFERENCE_API_KEY, OPENAI_CLIENT_ID, OPENAI_CLIENT_SECRET)
from the top-level template and instead scope them only to the specific tasks
that require them (or mark them as optional) so HF-backed evals don’t prompt for
unnecessary secrets; update the commented blocks around those env vars (the
comments containing API_KEY / INFERENCE_API_KEY / OPENAI_CLIENT_ID /
OPENAI_CLIENT_SECRET) and any duplicated occurrences later in the file to either
remove them or relocate them into the relevant task-level env sections.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/release-cherry-pick/SKILL.md:
- Around line 20-24: The gh api command in the SKILL.md snippet uses per_page=50
which will truncate results when more than 50 matching PRs exist; update the
command that constructs the search (the gh api "search/issues?q=...+per_page=50"
invocation) to handle pagination—either increase per_page to the GitHub max
(100) and/or add --paginate to the gh api call so all pages are returned, and
ensure the pipeline (the jq and sort usage) still consumes streamed results
correctly.
- Around line 28-35: The current text says "creating it from the remote if it
doesn't exist locally" but the shown git fetch/checkout will fail if
release/<VERSION> doesn't exist on the remote; update the Step 3 wording to
reflect that, or replace the commands with logic that first git fetch origin,
then test for origin/release/<VERSION> (e.g., using git rev-parse --verify
origin/release/<VERSION>) and if it exists run git checkout release/<VERSION>,
otherwise create the branch locally from main (git checkout -b release/<VERSION>
origin/main) and push it upstream (git push -u origin release/<VERSION>).

In `@examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md`:
- Line 146: Fix the user-facing typos in the README: change "atmost" to "at
most" and replace the shorthand "hparams" with "hyperparameters" (e.g., the
sentence "Only considering atmost 40% for width and 20% for depth pruning
hparams" should read "Only considering at most 40% for width and 20% for depth
pruning hyperparameters"); apply the same wording corrections to the equivalent
lines around Line 184 and Line 228 so all tutorial sentences use "at most" and
"hyperparameters" consistently.

In `@examples/pruning/minitron/README.md`:
- Line 3: The intro sentence in README.md has a missing space after the comma
("evaluation,and"); update the sentence in the README (the Minitron tutorial
intro line) to read "evaluation, and" by inserting a space after the comma so it
becomes "evaluation, and vLLM deployment."

In `@examples/pruning/README.md`:
- Around line 295-300: The "Data Composition" guidance mixes the spellings
"pre-training" and "pretraining"; pick one consistent spelling (e.g.,
"pretraining") and update every occurrence in this section including the entries
under "Data Composition" and the related lines noted (previously 295-300 and
307-310) so all user-facing text uses the same term; search for both
"pre-training" and "pretraining" in the README.md and replace them consistently
(references: the "Data Composition" row and the adjacent bullet lines).

In `@modelopt/onnx/quantization/int4.py`:
- Around line 1322-1324: The bug is that new_tensor.get() is called
unconditionally which fails when cupy is unavailable; update the conversion
before calling numpy_helper.from_array to use np.asnumpy(new_tensor) when
has_cupy is True and otherwise pass the numpy array directly (i.e., use the same
pattern as other conversions: conditionally call np.asnumpy on new_tensor before
numpy_helper.from_array), referencing the variables/new calls new_tensor,
has_cupy, np.asnumpy and numpy_helper.from_array so the fix is applied where
new_tensor is created and passed into numpy_helper.from_array.

In `@modelopt/torch/distill/plugins/megatron.py`:
- Line 175: The current line uses submodule_name.replace(f".{match.group(0)}",
f".{new_layer_idx}") which replaces every identical ".<idx>" token; instead,
only replace the specific match instance. Locate the code referencing
submodule_name, match, new_layer_idx and change it to perform a single-instance
replacement using the match span (e.g., compute start,end = match.span(0) and
build new_submodule_name = submodule_name[:start] + f".{new_layer_idx}" +
submodule_name[end:]) or use re.sub with count=1 anchored to the exact match;
ensure only the matched token is changed.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 653-659: The current check sends any module with
gate_up_proj_weight_quantizers to _export_fused_experts regardless of whether
quantization is actually enabled; update the condition to only call
_export_fused_experts when the module both exposes
gate_up_proj_weight_quantizers and is actually quantized (e.g., change the if to
require hasattr(sub_module, "gate_up_proj_weight_quantizers") and
get_quantization_format(sub_module) != QUANTIZATION_NONE), or alternatively
inspect the items in sub_module.gate_up_proj_weight_quantizers for an
enabled/active flag before entering the
fsdp2_aware_weight_update/_export_fused_experts path so non-quantized
_QuantFusedExperts are not rewritten. Ensure you adjust/remove the following
branches accordingly: the if that references gate_up_proj_weight_quantizers, the
subsequent elif using get_quantization_format, and keep calls to
_export_fused_experts and fsdp2_aware_weight_update only when quantization is
active.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 231-235: The merge loop using full_sd.update(shard_sd) can
silently overwrite real tensors with placeholder zeros from other ranks; change
the merge to iterate each shard_sd's items and for each key k: if k not in
full_sd set full_sd[k]=v, else verify torch.equal(full_sd[k], v) (and raise a
ValueError on mismatch) or skip if you prefer keeping the first-seen owner—apply
this logic where full_sd and gathered are used to build the combined state_dict
(the shard_sd aggregation loop).

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 522-537: The probe tensor target_input is only built once before
the retry loop, so after an OOM and reducing target_data_batch you must rebuild
target_input using sample_input_single_batch.expand(...) with the new
target_data_batch before calling infer_method; move the expand logic (or
recreate target_input) inside the while loop (referencing
sample_input_single_batch, target_data_batch, and infer_method) so each retry
tests the correctly sized probe batch under the
torch.set_grad_enabled(enable_grad) context.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 195-202: The exception handler in the try/except around
_Encoder.tokenizer.apply_chat_template currently prints the entire data payload
(variable data), which risks leaking sensitive content; replace the print of
json.dumps(data, ...) with a safe, minimal identifier and summary: log a stable
id if present (e.g., data.get("id") or data.get("row_id")), or compute and log a
short hash/fingerprint (e.g., SHA256 hex prefix) of the serialized row plus a
small sanitized field summary (e.g., length and first 100 chars of a sanitized
"text" field or list of keys), and include the exception message and context
(function name/tokenizer call) instead of the full payload in the raise/LOG call
in the except block that follows the call to
_Encoder.tokenizer.apply_chat_template.

In `@tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py`:
- Around line 120-132: The test_distributed_save_creates_valid_checkpoint
currently only verifies keys; update it to also load the saved tensors
referenced by SAFE_WEIGHTS_INDEX_NAME/weight_map (and shards in
SAFETENSORS_SUBBLOCKS_DIR_NAME) and compare each tensor's data to the original
model from get_tiny_llama() (use the same key mapping) using an element-wise
comparison (e.g., torch.equal or torch.allclose with a tolerance) to ensure
values match—mirror the approach used in test_saved_weights_match_original and
perform comparisons for every entry in index["weight_map"] to detect silent
corruption from _distributed_save_worker.

In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 386-441: The test is flaky because routing depends on random model
init; seed the model and make the forward_loop drive tokens to every expert
deterministically: call torch.manual_seed(...) before creating _TinyMoEModel()
(so weights are deterministic) and replace the current forward_loop with one
that, for idx in range(NUM_EXPERTS), constructs inputs that target expert idx
(e.g., unique per-expert input vectors or one-hot-like patterns) and runs m(x)
so each expert in model.moe.experts is exercised at least once before
calibration assertions.

---

Outside diff comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 632-649: The loop that disables per-module weight quantizers only
checks for singular child names ending with "weight_quantizer" and misses plural
fused-expert ModuleList children (e.g., "weight_quantizers"), so update the
disable logic in the block that iterates model.named_modules() to also detect
attr_name ending with "weight_quantizers" (or the ModuleList-typed plural
container), iterate its elements, disable each element (handling
SequentialQuantizer and TensorQuantizer cases the same as the singular branch),
call disable_rotate when rotate_is_enabled, and append the (quantizer,
orig_rotate) tuples to wqs_to_restore; then mirror the same plural-vs-singular
checks in _check_all_weight_quantizers_disabled() so it verifies both single
"weight_quantizer" children and ModuleList "weight_quantizers" entries when
asserting all weight quantizers are disabled.

In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`:
- Around line 175-192: The code can raise NameError because model is only set
when (args.save_models and not realizable_as_symlinks) or (not
args.skip_validation) but save_checkpoint_from_shards(model, ...) always runs;
ensure model exists before use by either: move/recompute model assignment so
replacement_library.load_model(layer_replacements) runs whenever
args.save_models is True (e.g., include realizable_as_symlinks branch), or guard
the call to save_checkpoint_from_shards behind the same condition that sets
model (use the same boolean logic involving args.save_models,
realizable_as_symlinks, and args.skip_validation) so save_checkpoint_from_shards
only receives a valid model; update related model_config assignment
(model_config.dtype) to match the chosen approach.

---

Nitpick comments:
In `@examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml`:
- Around line 18-23: Remove the globally listed unrelated secret environment
variables (API_KEY, INFERENCE_API_KEY, OPENAI_CLIENT_ID, OPENAI_CLIENT_SECRET)
from the top-level template and instead scope them only to the specific tasks
that require them (or mark them as optional) so HF-backed evals don’t prompt for
unnecessary secrets; update the commented blocks around those env vars (the
comments containing API_KEY / INFERENCE_API_KEY / OPENAI_CLIENT_ID /
OPENAI_CLIENT_SECRET) and any duplicated occurrences later in the file to either
remove them or relocate them into the relevant task-level env sections.

In `@modelopt/onnx/quantization/ort_utils.py`:
- Around line 146-167: The _load_extra_cudnn_dlls function should defensively
return early on non-Windows platforms to avoid AttributeError from using
ctypes.windll; add a platform guard at the top of _load_extra_cudnn_dlls (e.g.,
check sys.platform or os.name for Windows) so the function exits immediately
when not on Windows, while keeping the existing _find_cudnn_bin_dir check and
logging behavior unchanged.

In `@tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py`:
- Line 63: The assertion redundantly calls get_tiny_llama() again to read
config.num_hidden_layers; instead reuse the already-instantiated model (variable
named model) created earlier (from get_tiny_llama()) and change the check to
compare cfg["num_hidden_layers"] with model.config.num_hidden_layers so you
remove the extra get_tiny_llama() call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c29df816-dd02-41f0-81cf-28e64d80103b

📥 Commits

Reviewing files that changed from the base of the PR and between b1ec471 and 1b2f029.

⛔ Files ignored due to path filters (1)

examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/figures/learning_curves.png is excluded by !**/*.png

📒 Files selected for processing (44)

.claude/skills/release-cherry-pick/SKILL.md
CHANGELOG.rst
docs/source/getting_started/windows/_installation_standalone.rst
examples/dataset/MEGATRON_DATA_PREP.md
examples/dataset/README.md
examples/megatron_bridge/README.md
examples/pruning/README.md
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/ABLATIONS.md
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml
examples/pruning/minitron/README.md
examples/pruning/puzzletron/Llama-3.1-8B-Instruct.md
examples/pruning/puzzletron/README.md
examples/puzzletron/README.md
examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/pruning/attn_pruning.yaml
examples/speculative_decoding/main.py
modelopt/onnx/quantization/autotune/benchmark.py
modelopt/onnx/quantization/int4.py
modelopt/onnx/quantization/ort_utils.py
modelopt/torch/distill/plugins/megatron.py
modelopt/torch/export/plugins/megatron_importer.py
modelopt/torch/export/plugins/vllm_fakequant_hf.py
modelopt/torch/export/quant_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/export/unified_export_megatron.py
modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py
modelopt/torch/puzzletron/replacement_library/build_replacement_library.py
modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/conversion.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/plugins/huggingface.py
modelopt/torch/quantization/utils/__init__.py
modelopt/torch/quantization/utils/core_utils.py
modelopt/torch/utils/dataset_utils.py
modelopt/torch/utils/plugins/megatron_preprocess_data.py
modelopt/torch/utils/plugins/transformers_dataset.py
modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml
tests/gpu/torch/puzzletron/test_puzzletron.py
tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py
tests/regression/torch/speculative/test_dflash_offline.py
tests/unit/torch/quantization/plugins/test_fused_experts.py
tests/unit/torch/utils/test_dataset_utils.py

💤 Files with no reviewable changes (2)

tests/gpu/torch/puzzletron/test_puzzletron.py
modelopt/onnx/quantization/autotune/benchmark.py

coderabbitai · 2026-05-04T20:01:51Z

+```bash
+gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
+  --jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
+  | sort -t$'\t' -k3
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle pagination for repositories with many pending cherry-picks.

The GitHub API query uses per_page=50, which limits results to 50 PRs. If more than 50 PRs are labeled cherry-pick-<VERSION> without cherry-pick-done, later PRs will be silently omitted from the cherry-pick batch.

📄 Suggested approaches

Option 1: Increase the page size to a safer limit:

-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \ +gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=100" \

Option 2: Add pagination using --paginate:

-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \ +gh api --paginate "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=100" \ --jq '.items[] | [.number, .title, .pull_request.merged_at] | `@tsv`' \

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```bash

gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \

--jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \

| sort -t$'\t' -k3

```

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/release-cherry-pick/SKILL.md around lines 20 - 24, The gh api command in the SKILL.md snippet uses per_page=50 which will truncate results when more than 50 matching PRs exist; update the command that constructs the search (the gh api "search/issues?q=...+per_page=50" invocation) to handle pagination—either increase per_page to the GitHub max (100) and/or add --paginate to the gh api call so all pages are returned, and ensure the pipeline (the jq and sort usage) still consumes streamed results correctly.

coderabbitai · 2026-05-04T20:01:51Z

+## Step 3 — Set up the release branch
+
+Check out `release/<VERSION>`, creating it from the remote if it doesn't exist locally:
+
+```bash
+git fetch origin release/<VERSION>
+git checkout release/<VERSION>
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify release branch creation behavior.

The description states "creating it from the remote if it doesn't exist locally," but the provided commands will fail if the release/<VERSION> branch doesn't exist on the remote. The commands only check out an existing remote branch—they don't create it.

📝 Suggested clarification

Either update the description to reflect the actual behavior:

-Check out `release/<VERSION>`, creating it from the remote if it doesn't exist locally: +Check out the existing `release/<VERSION>` branch from the remote:

Or provide commands that handle both cases:

+If the release branch doesn't exist yet, create it from main: + +```bash +git fetch origin +if git rev-parse --verify origin/release/<VERSION> >/dev/null 2>&1; then + git checkout release/<VERSION> +else + git checkout -b release/<VERSION> origin/main + git push -u origin release/<VERSION> +fi +``` + +Otherwise, check out the existing branch: + ```bash git fetch origin release/<VERSION> git checkout release/<VERSION>

</details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/release-cherry-pick/SKILL.md around lines 28 - 35, The
current text says "creating it from the remote if it doesn't exist locally" but
the shown git fetch/checkout will fail if release/ doesn't exist on the
remote; update the Step 3 wording to reflect that, or replace the commands with
logic that first git fetch origin, then test for origin/release/ (e.g.,
using git rev-parse --verify origin/release/) and if it exists run git
checkout release/, otherwise create the branch locally from main (git
checkout -b release/ origin/main) and push it upstream (git push -u
origin release/).

</details>   

coderabbitai · 2026-05-04T20:01:51Z

+Important pruning logs:
+
+```text
+Only considering atmost 40% for width and 20% for depth pruning hparams


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix small typos in user-facing instructions.

There are a few wording typos in key tutorial lines (Line 146, Line 184, Line 228) that are quick to clean up.

Suggested edits

-Only considering atmost 40% for width and 20% for depth pruning hparams +Only considering at most 40% for width and 20% for depth pruning hparams

-> While we use 96 nodes here for faster training, you can also run with 1 node. If you dont want to do full distillation run, you can stop earlier and take intermediate checkpoints as well. +> While we use 96 nodes here for faster training, you can also run with 1 node. If you don't want to do a full distillation run, you can stop earlier and take intermediate checkpoints as well.

-The eval config xin [nemo_evaluator.yaml](nemo_evaluator.yaml) is for Slurm-based evaluation — it submits a vLLM serving job and runs evals against it. +The eval config in [nemo_evaluator.yaml](nemo_evaluator.yaml) is for Slurm-based evaluation — it submits a vLLM serving job and runs evals against it.

Also applies to: 184-184, 228-228

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md` at line 146, Fix the user-facing typos in the README: change "atmost" to "at most" and replace the shorthand "hparams" with "hyperparameters" (e.g., the sentence "Only considering atmost 40% for width and 20% for depth pruning hparams" should read "Only considering at most 40% for width and 20% for depth pruning hyperparameters"); apply the same wording corrections to the equivalent lines around Line 184 and Line 228 so all tutorial sentences use "at most" and "hyperparameters" consistently.

coderabbitai · 2026-05-04T20:01:51Z

@@ -0,0 +1,11 @@
+# Minitron Pruning — End-to-End Tutorials
+
+End-to-end tutorials for [Minitron](https://arxiv.org/abs/2407.14679) structured pruning followed by knowledge distillation, quantization, evaluation,and vLLM deployment.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the missing space in the intro sentence.

evaluation,and should read evaluation, and.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/pruning/minitron/README.md` at line 3, The intro sentence in README.md has a missing space after the comma ("evaluation,and"); update the sentence in the README (the Minitron tutorial intro line) to read "evaluation, and" by inserting a space after the comma so it becomes "evaluation, and vLLM deployment."

coderabbitai · 2026-05-04T20:01:51Z

+| **Global Batch Size (GBS)** | same as the original training or 768 if unsure |
 | **Micro Batch Size (MBS)** | As large as your GPU memory can accommodate |
 | **Learning Rate (LR)** | 1e-4 → 1e-5 (linear decay) for 30-50% pruning<br>• More compression → higher LR<br>• Less compression → lower LR<br>• As model gets larger → reduce LR to avoid divergence |
 | **Warmup Steps** | 100 |
-| **Training Max Steps** | Num training tokens / (Seq len × GBS)<br>• Recommended: 80-100B tokens |
+| **Training Max Steps** | Num training tokens / (Seq len × GBS)<br>• Recommended: 80-100B tokens for best results. |
 | **Data Composition** | • Standard models: 100% pre-training data<br>• Reasoning models: 70% reasoning data + 30% pre-training data |


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use one spelling of “pretraining” in this section.

The edited guidance now mixes pre-training and pretraining, which reads like an edit artifact in user-facing docs.

Also applies to: 307-310

🧰 Tools

🪛 LanguageTool

[uncategorized] ~300-~300: Do not mix variants of the same word (‘pre-train’ and ‘pretrain’) within a single text.
Context: ...Composition** | • Standard models: 100% pre-training data
• Reasoning models: 70% reasoni...

(EN_WORD_COHERENCY)

[uncategorized] ~300-~300: Do not mix variants of the same word (‘pre-train’ and ‘pretrain’) within a single text.
Context: ...soning models: 70% reasoning data + 30% pre-training data | > [!TIP] > If you know the maxi...

(EN_WORD_COHERENCY)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/pruning/README.md` around lines 295 - 300, The "Data Composition" guidance mixes the spellings "pre-training" and "pretraining"; pick one consistent spelling (e.g., "pretraining") and update every occurrence in this section including the entries under "Data Composition" and the related lines noted (previously 295-300 and 307-310) so all user-facing text uses the same term; search for both "pre-training" and "pretraining" in the README.md and replace them consistently (references: the "Data Composition" row and the adjacent bullet lines).

coderabbitai · 2026-05-04T20:01:51Z

+            full_sd: dict[str, torch.Tensor] = {}
+            for shard_sd in gathered:
+                if shard_sd is None:
+                    continue
+                full_sd.update(shard_sd)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Merge logic is fragile with overlapping keys across ranks.

The full_sd.update(shard_sd) approach silently overwrites values when multiple ranks have the same key. If a rank provides a "placeholder" value (e.g., zeros) for keys it doesn't own, later ranks will overwrite earlier ranks' real values.

This is exactly what the distributed test does—each rank zeroes keys it doesn't own, so the final merged state will have zeros from the last rank overwriting real values from earlier ranks for those keys.

Consider either:

Only updating keys that are missing from full_sd, or

Documenting that ranks must provide disjoint key sets (no overlapping keys).

🛡️ Proposed fix to avoid overwriting existing values

full_sd: dict[str, torch.Tensor] = {} for shard_sd in gathered: if shard_sd is None: continue - full_sd.update(shard_sd) + for k, v in shard_sd.items(): + if k not in full_sd: + full_sd[k] = v

Alternatively, if overlapping keys with identical values are expected, you could add a check:

for k, v in shard_sd.items(): if k in full_sd: if not torch.equal(full_sd[k], v): raise ValueError(f"Key {k} has conflicting values across ranks") else: full_sd[k] = v

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 231 - 235, The merge loop using full_sd.update(shard_sd) can silently overwrite real tensors with placeholder zeros from other ranks; change the merge to iterate each shard_sd's items and for each key k: if k not in full_sd set full_sd[k]=v, else verify torch.equal(full_sd[k], v) (and raise a ValueError on mismatch) or skip if you prefer keeping the first-seen owner—apply this logic where full_sd and gathered are used to build the combined state_dict (the shard_sd aggregation loop).

coderabbitai · 2026-05-04T20:01:51Z

+        target_input = sample_input_single_batch.expand(
+            [
+                target_data_batch if index == 0 else dim
+                for index, dim in enumerate(sample_input_single_batch.shape)
+            ]
        )
-        target_data_batch = 1
-    else:
-        target_data_batch = max(int(free_mem_before / mem_diff_per_data_batch), 1)
-    target_input = sample_input_single_batch.expand(
-        [
-            target_data_batch if index == 0 else dim
-            for index, dim in enumerate(sample_input_single_batch.shape)
-        ]
-    )

-    # For some models on multi GPU, we observe the memory per batch is not a constant.
-    # So we just test the target batch size and make sure we do not go OOM.
-    while target_data_batch > 1:
-        with torch.set_grad_enabled(enable_grad):
-            try:
-                infer_method(target_input)
-                break
-            except torch.cuda.OutOfMemoryError:
-                target_data_batch = target_data_batch // 2
+        # For some models on multi GPU, we observe the memory per batch is not a constant.
+        # So we just test the target batch size and make sure we do not go OOM.
+        while target_data_batch > 1:
+            with torch.set_grad_enabled(enable_grad):
+                try:
+                    infer_method(target_input)
+                    break
+                except torch.cuda.OutOfMemoryError:  # pragma: no cover - GPU OOM retry path
+                    target_data_batch = target_data_batch // 2  # pragma: no cover


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rebuild the probe tensor after each OOM retry.

target_input is expanded once before the retry loop, so every retry still uses the original oversized batch. If the first estimate OOMs, this path just counts target_data_batch down to 1 without ever validating a smaller candidate, which can collapse auto-batching to 1 even when a larger batch would fit.

Suggested fix

- target_input = sample_input_single_batch.expand( - [ - target_data_batch if index == 0 else dim - for index, dim in enumerate(sample_input_single_batch.shape) - ] - ) - # For some models on multi GPU, we observe the memory per batch is not a constant. # So we just test the target batch size and make sure we do not go OOM. while target_data_batch > 1: + target_input = sample_input_single_batch.expand( + [ + target_data_batch if index == 0 else dim + for index, dim in enumerate(sample_input_single_batch.shape) + ] + ) with torch.set_grad_enabled(enable_grad): try: infer_method(target_input) break except torch.cuda.OutOfMemoryError: # pragma: no cover - GPU OOM retry path + torch.cuda.empty_cache() # pragma: no cover target_data_batch = target_data_batch // 2 # pragma: no cover

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/utils/dataset_utils.py` around lines 522 - 537, The probe tensor target_input is only built once before the retry loop, so after an OOM and reducing target_data_batch you must rebuild target_input using sample_input_single_batch.expand(...) with the new target_data_batch before calling infer_method; move the expand logic (or recreate target_input) inside the while loop (referencing sample_input_single_batch, target_data_batch, and infer_method) so each retry tests the correctly sized probe batch under the torch.set_grad_enabled(enable_grad) context.

coderabbitai · 2026-05-04T20:01:51Z

+                try:
+                    text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
+                except Exception as e:
+                    print(
+                        f"apply_chat_template failed: {e}\nData:\n{json.dumps(data, indent=2, default=str)}",
+                        flush=True,
+                    )
+                    raise


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid dumping the full failing sample into stdout.

Line 199 prints the entire JSON row when apply_chat_template() fails. On real SFT/pretraining data that can leak prompts, tool payloads, PII, or secrets into shared logs, and very large rows will also blow up log volume. Log a stable identifier / field summary instead of the full payload.

🔒 Suggested fix

try: text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs) except Exception as e: + sample_id = data.get("id") or data.get("uuid") or data.get("conversation_id") print( - f"apply_chat_template failed: {e}\nData:\n{json.dumps(data, indent=2, default=str)}", + "apply_chat_template failed " + f"for sample={sample_id!r}: {e}. " + f"Available keys: {sorted(data.keys())}", flush=True, ) raise

As per coding guidelines, SECURITY.md says to treat all model/weight/config artifacts and data as untrusted.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 195 - 202, The exception handler in the try/except around _Encoder.tokenizer.apply_chat_template currently prints the entire data payload (variable data), which risks leaking sensitive content; replace the print of json.dumps(data, ...) with a safe, minimal identifier and summary: log a stable id if present (e.g., data.get("id") or data.get("row_id")), or compute and log a short hash/fingerprint (e.g., SHA256 hex prefix) of the serialized row plus a small sanitized field summary (e.g., length and first 100 chars of a sanitized "text" field or list of keys), and include the exception message and context (function name/tokenizer call) instead of the full payload in the raise/LOG call in the except block that follows the call to _Encoder.tokenizer.apply_chat_template.

coderabbitai · 2026-05-04T20:01:51Z

+    def test_distributed_save_creates_valid_checkpoint(self, tmp_path):
+        spawn_multiprocess_job(2, partial(_distributed_save_worker, checkpoint_dir=tmp_path))
+
+        index_path = tmp_path / SAFE_WEIGHTS_INDEX_NAME
+        assert index_path.exists()
+        index = json.loads(index_path.read_text())
+
+        model = get_tiny_llama()
+        expected_keys = set(model.state_dict().keys())
+        assert set(index["weight_map"].keys()) == expected_keys
+
+        shard_files = list((tmp_path / SAFETENSORS_SUBBLOCKS_DIR_NAME).glob("*.safetensors"))
+        assert len(shard_files) > 0


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Multi-process test should verify saved tensor values, not just keys.

The test only checks that weight_map.keys() match expected keys, but doesn't verify the actual tensor values like test_saved_weights_match_original does. Given the merge logic concern in the main implementation, verifying tensor correctness would catch silent data corruption.

💚 Proposed enhancement to verify tensor values

shard_files = list((tmp_path / SAFETENSORS_SUBBLOCKS_DIR_NAME).glob("*.safetensors")) assert len(shard_files) > 0 + + # Verify saved values match original model + reloaded_sd = {} + for shard in shard_files: + reloaded_sd.update(safe_load_file(str(shard))) + + original_sd = {k: v.cpu() for k, v in model.state_dict().items()} + for key in expected_keys: + torch.testing.assert_close(reloaded_sd[key], original_sd[key])

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py` around lines 120 - 132, The test_distributed_save_creates_valid_checkpoint currently only verifies keys; update it to also load the saved tensors referenced by SAFE_WEIGHTS_INDEX_NAME/weight_map (and shards in SAFETENSORS_SUBBLOCKS_DIR_NAME) and compare each tensor's data to the original model from get_tiny_llama() (use the same key mapping) using an element-wise comparison (e.g., torch.equal or torch.allclose with a tolerance) to ensure values match—mirror the approach used in test_saved_weights_match_original and perform comparisons for every entry in index["weight_map"] to detect silent corruption from _distributed_save_worker.

coderabbitai · 2026-05-04T20:01:52Z

+    def test_calibration_populates_all_expert_quantizers(self):
+        """After PTQ, every input/weight quantizer on the fused-experts module has amax set."""
+        import modelopt.torch.quantization as mtq
+
+        model = _TinyMoEModel()
+        expert_type = type(model.moe.experts)
+        self._cleanup_registry(expert_type)
+
+        quant_cfg = {
+            "quant_cfg": [
+                {"quantizer_name": "*", "enable": False},
+                {
+                    "quantizer_name": "*gate_up_proj_input_quantizer",
+                    "cfg": {"num_bits": 8, "axis": None},
+                },
+                {
+                    "quantizer_name": "*down_proj_input_quantizer",
+                    "cfg": {"num_bits": 8, "axis": None},
+                },
+                {
+                    "quantizer_name": "*gate_up_proj_weight_quantizer",
+                    "cfg": {"num_bits": 8, "axis": 0},
+                },
+                {
+                    "quantizer_name": "*down_proj_weight_quantizer",
+                    "cfg": {"num_bits": 8, "axis": 0},
+                },
+            ],
+            "algorithm": "max",
+        }
+
+        def forward_loop(m):
+            torch.manual_seed(0)
+            for _ in range(2):
+                x = torch.randn(1, 4, HIDDEN_DIM)
+                m(x)
+
+        mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
+
+        experts = model.moe.experts
+        assert experts.gate_up_proj_input_quantizer.amax is not None, (
+            "Shared gate_up_proj input quantizer was not calibrated — "
+            "F.linear hook likely bypassed by non-eager experts_implementation."
+        )
+        assert experts.down_proj_input_quantizer.amax is not None, (
+            "Shared down_proj input quantizer was not calibrated."
+        )
+        for idx in range(NUM_EXPERTS):
+            assert experts.gate_up_proj_weight_quantizers[idx].amax is not None, (
+                f"gate_up_proj_weight_quantizers[{idx}].amax is None — "
+                "plural ModuleList name normalization in _match_quantizer likely broken."
+            )
+            assert experts.down_proj_weight_quantizers[idx].amax is not None, (
+                f"down_proj_weight_quantizers[{idx}].amax is None."
+            )
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make expert coverage deterministic in this calibration test.

This assertion requires every expert to receive tokens, but _TinyMoEModel() is created from unseeded random weights and the router path is data-dependent. That makes the test flaky: a different initialization can leave one or more experts untouched and keep their amax at None even when the fused-experts path is working correctly. Please force a deterministic routing pattern that exercises all experts before asserting on every per-expert quantizer.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 386 - 441, The test is flaky because routing depends on random model init; seed the model and make the forward_loop drive tokens to every expert deterministically: call torch.manual_seed(...) before creating _TinyMoEModel() (so weights are deterministic) and replace the current forward_loop with one that, for idx in range(NUM_EXPERTS), constructs inputs that target expert idx (e.g., unique per-expert input vectors or one-hot-like patterns) and runs m(x) so each expert in model.moe.experts is exercised at least once before calibration assertions.

codecov · 2026-05-04T20:06:17Z

Codecov Report

❌ Patch coverage is 64.52906% with 177 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.88%. Comparing base (b1ec471) to head (3720a7a).

Files with missing lines	Patch %	Lines
modelopt/onnx/quantization/int4.py	67.42%	86 Missing ⚠️
modelopt/onnx/quantization/ort_utils.py	15.09%	45 Missing ⚠️
...lopt/torch/puzzletron/tools/checkpoint_utils_hf.py	25.00%	18 Missing ⚠️
modelopt/torch/export/plugins/vllm_fakequant_hf.py	67.85%	9 Missing ⚠️
modelopt/torch/export/plugins/megatron_importer.py	0.00%	6 Missing ⚠️
...pt/torch/utils/plugins/megatron_preprocess_data.py	0.00%	5 Missing ⚠️
modelopt/torch/quantization/plugins/huggingface.py	90.00%	3 Missing ⚠️
modelopt/torch/distill/plugins/megatron.py	0.00%	2 Missing ⚠️
modelopt/torch/quantization/model_calib.py	60.00%	2 Missing ⚠️
...delopt/torch/utils/plugins/transformers_dataset.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@                Coverage Diff                 @@
##           release/0.44.0    #1385      +/-   ##
==================================================
+ Coverage           75.41%   76.88%   +1.46%     
==================================================
  Files                 463      463              
  Lines               50208    50387     +179     
==================================================
+ Hits                37865    38739     +874     
+ Misses              12343    11648     -695

Flag	Coverage Δ
examples	`41.58% <16.63%> (+0.87%)`	⬆️
gpu	`59.65% <59.31%> (+0.52%)`	⬆️
unit	`52.84% <51.30%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 and others added 17 commits May 4, 2026 12:49

kevalmorabia97 requested review from a team as code owners May 4, 2026 19:53

kevalmorabia97 requested review from yeyu-nvidia and removed request for a team May 4, 2026 19:53

kevalmorabia97 requested review from Edwardf0t1, cjluo-nv, jenchen13 and realAsma and removed request for a team May 4, 2026 19:53

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Increase gpu_tests CI timeout from 60 to 75 mins

3720a7a

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested a review from a team as a code owner May 4, 2026 20:31

kevalmorabia97 requested review from AAnoosheh and removed request for a team, Edwardf0t1, cjluo-nv, realAsma and yeyu-nvidia May 4, 2026 20:32

kevalmorabia97 changed the title ~~[Cherry-pick] PRs #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371~~ [Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 May 4, 2026

		@@ -0,0 +1,11 @@
		# Minitron Pruning — End-to-End Tutorials

		End-to-end tutorials for [Minitron](https://arxiv.org/abs/2407.14679) structured pruning followed by knowledge distillation, quantization, evaluation,and vLLM deployment.

Conversation

kevalmorabia97 commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cherry-picked PRs

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-04 20:34 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

kevalmorabia97 commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-04 20:34 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented May 4, 2026 •

edited

Loading