to_hf.py: add --vllm flag for vLLM-ready checkpoint export by DongjiGao · Pull Request #15617 · NVIDIA-NeMo/NeMo

DongjiGao · 2026-04-16T18:21:49Z

Summary

Add vllm=True flag to to_hf.py that produces vLLM-ready checkpoints in one step
prepare_for_vllm() runs after checkpoint consolidation (rank 0 only) and handles architecture detection, chat template generation, tokenizer saving, and generation config
Depends on Add to_jinja() to PromptFormatter for programmatic chat template generation #15616 (prompt formatters to_jinja())

Changes

`examples/speechlm2/to_hf.py`

HfExportConfig.vllm: bool = False — opt-in flag
_detect_vllm_architecture() — determines hybrid vs standard from pretrained_llm
prepare_for_vllm(output_dir, model_cfg):
1. Patches config.json with model_type: "nemo_speechlm" and detected architectures
2. Calls PromptFormatter.resolve(prompt_format).to_jinja() for chat template
3. Saves tokenizer with <|audio|> token and chat template
4. Writes minimal generation_config.json (EOS only — inference params are task-specific)
Raises ValueError for missing prompt_format/pretrained_llm
Warns when overwriting existing files

Architecture mapping

LLM backbone	Architecture name
Hybrid (NemotronH)	`NeMoSpeechLMHybridForConditionalGeneration`
Standard (Qwen3, etc.)	`NeMoSpeechLMForConditionalGeneration`

Usage

torchrun --nproc-per-node=8 --nnodes=4 \
    examples/speechlm2/to_hf.py \
    class_path=nemo.collections.speechlm2.models.salm_automodel.SALMAutomodel \
    ckpt_path=/path/to/checkpoint.ckpt \
    ckpt_config=/path/to/exp_config.yaml \
    output_dir=/path/to/hf_output \
    vllm=True

Test plan

End-to-end: to_hf.py vllm=True on nano-v3-canary1bflash (32 GPUs, 5.5 min)
Output verified: correct model_type, architectures, chat template (1149 chars), generation_config.json
Eval: burst_eval_vllm.py on produced checkpoint — WER matches baseline (6.31 overall)
CI: add "Run CICD" label

Made with Cursor

When vllm=True, prepare_for_vllm() runs after checkpoint consolidation: - Detects hybrid vs standard architecture from pretrained_llm - Patches config.json with model_type and architectures - Generates chat template via PromptFormatter.to_jinja() - Saves tokenizer with <|audio|> token and chat template - Writes generation_config.json with eos_token_id only (inference params like temperature are task-specific) - Raises explicit errors for missing prompt_format/pretrained_llm - Warns when overwriting existing files Depends on: prompt-formatters PR (to_jinja), vllm-plugin PR (architecture names) Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko · 2026-04-16T18:52:38Z

+
+        llm_cfg = AutoConfig.from_pretrained(pretrained_llm, trust_remote_code=True)
+        archs = getattr(llm_cfg, "architectures", [])
+    except Exception:


Broad exception clause, what errors are you expecting here?

pzelasko · 2026-04-16T18:53:40Z


+    # If True, patch the checkpoint for vLLM inference (add tokenizer, chat template,
+    # model_type/architectures, generation_config). Requires HuggingFace transformers.
+    vllm: bool = False


Remove this flag and always output converted checkpoints that work with both HF and vLLM engines

`to_hf.py` now always emits a checkpoint that can be served by vLLM's SpeechLM plugin: architecture/model_type fields in config.json, the backbone's canonical chat_template in tokenizer_config.json, audio placeholder registered on the tokenizer, and a minimal generation config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM loaders now share the same on-disk artifact (pzelasko's ask on the closed PR NVIDIA-NeMo#15617 thread). Key bits: - `_detect_vllm_architecture` inspects the backbone's HF config to pick the right vLLM plugin class. Fail-fast ValueError on missing `architectures` rather than silently defaulting to 'Std' (also addresses pzelasko's review comment about the broad except). - `prepare_for_vllm` is invoked unconditionally after save, wrapped in `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning so legacy callers that never needed vLLM (e.g., NeMo SALM loading the same dir) still get a clean HF-only checkpoint. - Tokenizer is re-saved from the backbone (brings its native chat_template along) + augmented with `<|audio|>`; extra_special_tokens is normalized to a dict so vLLM's AutoTokenizer can load it. - For reasoning backbones (nemotron-nano-v3), the exported chat_template's `enable_thinking` default is flipped to False so vLLM's request-time render matches training-time render; otherwise vLLM silently prepends `<think>\n` to every assistant turn and WER regresses. Verified librispeech-pc WER 1.57 (== baseline) after this fix; without it WER regressed to 5.92. hf_hub.py: setdefault `model_type` and `architectures` in `HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry the metadata vLLM / transformers need to identify them. Made-with: Cursor

`to_hf.py` now always emits a checkpoint that can be served by vLLM's SpeechLM plugin: architecture/model_type fields in config.json, the backbone's canonical chat_template in tokenizer_config.json, audio placeholder registered on the tokenizer, and a minimal generation config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM loaders now share the same on-disk artifact (pzelasko's ask on the closed PR NVIDIA-NeMo#15617 thread). Key bits: - `_detect_vllm_architecture` inspects the backbone's HF config to pick the right vLLM plugin class. Fail-fast ValueError on missing `architectures` rather than silently defaulting to 'Std' (also addresses pzelasko's review comment about the broad except). - `prepare_for_vllm` is invoked unconditionally after save, wrapped in `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning so legacy callers that never needed vLLM (e.g., NeMo SALM loading the same dir) still get a clean HF-only checkpoint. - Tokenizer is re-saved from the backbone (brings its native chat_template along) + augmented with `<|audio|>`; extra_special_tokens is normalized to a dict so vLLM's AutoTokenizer can load it. - For reasoning backbones (nemotron-nano-v3), the exported chat_template's `enable_thinking` default is flipped to False so vLLM's request-time render matches training-time render; otherwise vLLM silently prepends `<think>\n` to every assistant turn and WER regresses. Verified librispeech-pc WER 1.57 (== baseline) after this fix; without it WER regressed to 5.92. hf_hub.py: setdefault `model_type` and `architectures` in `HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry the metadata vLLM / transformers need to identify them. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

…ith backbone-native chat_template (#15623) * to_hf.py: unconditionally produce vLLM-ready HF checkpoints `to_hf.py` now always emits a checkpoint that can be served by vLLM's SpeechLM plugin: architecture/model_type fields in config.json, the backbone's canonical chat_template in tokenizer_config.json, audio placeholder registered on the tokenizer, and a minimal generation config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM loaders now share the same on-disk artifact (pzelasko's ask on the closed PR #15617 thread). Key bits: - `_detect_vllm_architecture` inspects the backbone's HF config to pick the right vLLM plugin class. Fail-fast ValueError on missing `architectures` rather than silently defaulting to 'Std' (also addresses pzelasko's review comment about the broad except). - `prepare_for_vllm` is invoked unconditionally after save, wrapped in `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning so legacy callers that never needed vLLM (e.g., NeMo SALM loading the same dir) still get a clean HF-only checkpoint. - Tokenizer is re-saved from the backbone (brings its native chat_template along) + augmented with `<|audio|>`; extra_special_tokens is normalized to a dict so vLLM's AutoTokenizer can load it. - For reasoning backbones (nemotron-nano-v3), the exported chat_template's `enable_thinking` default is flipped to False so vLLM's request-time render matches training-time render; otherwise vLLM silently prepends `<think>\n` to every assistant turn and WER regresses. Verified librispeech-pc WER 1.57 (== baseline) after this fix; without it WER regressed to 5.92. hf_hub.py: setdefault `model_type` and `architectures` in `HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry the metadata vLLM / transformers need to identify them. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * to_hf.py: add missing return-type hints per NeMo PR checklist Five top-level helpers (load_checkpoint, setup_distributed_from_config, consolidate_state_dict, save_hf_checkpoint, main) lacked return types. Uses Any for setup_distributed_from_config's AutomodelParallelStrategy return to avoid adding an import just for typing; concrete types everywhere else. Same fix previously applied on pr/vllm-plugin. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * Apply isort and black reformatting Signed-off-by: Dongji Gao <dongjig@nvidia.com> * to_hf.py: rescue chat_template.jinja before deleting it Modern HuggingFace transformers (~4.42+) moves long chat_template strings out of tokenizer_config.json into a separate chat_template.jinja file to keep the JSON readable. Qwen3-1.7B's 4168-char template triggers this split; Nemotron-Nano's shorter template stays inline. The old code deleted chat_template.jinja before reading tokenizer_config.json, assuming the inline copy was always complete. For Qwen3 that meant the exported checkpoint shipped with an empty chat_template -- vLLM's apply_chat_template returned a prompt without the <|audio|> placeholder, which broke multimodal prompt replacement (Failed to apply prompt replacement for mm_items['audio'][0]). Now read chat_template.jinja, inline it into tokenizer_config.json when non-empty, and only then delete the file. Nemotron's inline-only path is unchanged because .jinja doesn't get written for small templates. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * to_hf.py: force tokenizer_class=PreTrainedTokenizerFast for vLLM compat Newer NeMo containers (e.g. nemo-25.11-pytorch2.9-automodel-03apr26) wrap AutoTokenizer.from_pretrained(trust_remote_code=True) in a NeMo-internal TokenizersBackend class. save_pretrained then writes 'tokenizer_class: TokenizersBackend' to tokenizer_config.json -- not in HF transformers' registry, so vLLM's AutoTokenizer.from_pretrained crashes at server load: ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported. The underlying tokenizer.json is a valid HF fast tokenizer regardless of which wrapper produced it; force the class name back to PreTrainedTokenizerFast so downstream HF-based loaders (including vLLM) can round-trip the config. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * QwenPromptFormatter: align with Qwen3 enable_thinking=False pre-training Qwen3's chat_template injects '<think>\n\n</think>\n\n' before assistant content when enable_thinking=False (the 'no reasoning' mode, which is what SpeechLM ASR wants). The old QwenPromptFormatter didn't include this prefix, so SpeechLM fine-tunes trained through it showed the model a turn shape that's different from Qwen3's pre-training convention. Bake NO_THINK_PREFIX into both INFERENCE_PREFIX and (transitively via INFERENCE_PREFIX) the assistant template, so future fine-tunes produce training data byte-identical to 'apply_chat_template(enable_thinking= False)'. Existing checkpoints are unaffected -- the change only kicks in the next time you retrain with prompt_format=qwen. Test: updated hardcoded expected strings to match Qwen3 jinja output for single-turn training and inference. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * to_hf.py: read audio_locator_tag from model config as SoT Removes the hardcoded _AUDIO_TOKEN constant and reads the audio placeholder from model_cfg["audio_locator_tag"], raising ValueError if missing. This ensures the exported config.json, added tokenizer symbols, and extra_special_tokens dict all reference the same source of truth as training, avoiding silent drift between train-time and inference-time audio tokens. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor * NemotronNanoV3PromptFormatter: fix past-asst-no-think + test coverage Before this fix, the formatter only normalized past assistant turns that already contained <think>...</think> tags, so a past assistant turn without any think tags would emit as "<|im_start|>assistant\nTEST<|im_end|>\n" while the HF jinja template emits "<|im_start|>assistant\n<think></think>TEST<|im_end|>\n" (jinja unconditionally injects an empty think block for content lacking both tags). This caused a silent train/inference-template divergence for multi-turn dialogs without reasoning history. Fix step 3 in encode_dialog to handle all three cases symmetrically: - both tags present -> truncate to "<think></think>" + post-</think> content - neither tag present -> prepend "<think></think>" - only one tag present -> leave as-is (matches jinja) Also adds three tests to fill previously-missing coverage: - training multi-turn with past assistant missing think tags (regression test for the fix above) - inference multi-turn with enable_thinking=False - inference multi-turn with enable_thinking=True All 12 non-HF tests pass in the NeMo 25.11 container. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor * to_hf.py: stop patching chat_template's enable_thinking default The previous in-place string-replace flipped Nemotron's ``enable_thinking`` default from True to False so that default vLLM inference (with no ``chat_template_kwargs``) would match SpeechLM training rendering. This approach is fragile (silently no-ops if upstream changes the template) and surprising for downstream consumers of the exported checkpoint. Serving callers should instead pass ``chat_template_kwargs={"enable_thinking": False}`` (or the OpenAI-API equivalent) at inference time to opt out of thinking. This keeps the exported chat_template byte-identical to the backbone's canonical HF template. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor * tests: add unit tests for to_hf.py prepare_for_vllm Covers the behavior introduced / changed in this PR: * Error paths for missing pretrained_llm and missing audio_locator_tag * config.json patching (model_type, architectures, audio_locator_tag SoT) * Audio token registration (add_special_tokens called only when missing from the backbone vocab) * tokenizer_config.json normalization (dict-form extra_special_tokens, forced tokenizer_class=PreTrainedTokenizerFast) * chat_template.jinja rescue (inlined back into tokenizer_config.json and the separate .jinja file removed) * chat_template is byte-identical after prep (regression guard for the removal of the enable_thinking default-flip) * generation_config.json carries the tokenizer's eos_token_id The script lives under examples/ and is loaded via importlib; AutoTokenizer and _detect_vllm_architecture are patched so the tests run fully offline. 9 tests pass in 0.35s in the NeMo 25.11 container. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor * Apply isort and black reformatting Signed-off-by: Dongji Gao <dongjig@nvidia.com> * Revert "QwenPromptFormatter: align with Qwen3 enable_thinking=False pre-training" This reverts commit 0245881. Piotr raised a concern that baking ``NO_THINK_PREFIX`` into ``QwenPromptFormatter`` changes the turn shape seen by in-flight fine-tunes like canary-qwen-2.5b, which were trained on the prior (no-think-prefix) formatter output. Re-rendering the same data through the updated formatter would silently shift the prompt distribution and break those checkpoints. Reverting the bake keeps the ``qwen`` prompt format byte-identical to the version those checkpoints saw during training. Any future fine-tune that actually wants the ``<think></think>`` empty-reasoning prefix should use ``Qwen3PromptFormatter`` (already handles it via ``enable_thinking=False``) or explicitly include the prefix in training data, rather than flipping the default for all ``qwen`` consumers. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor --------- Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko reviewed Apr 16, 2026

View reviewed changes

pzelasko deleted the branch NVIDIA-NeMo:speechlm2-with-nemo-automodel-merge April 17, 2026 14:08

pzelasko closed this Apr 17, 2026

DongjiGao mentioned this pull request Apr 18, 2026

to_hf.py + PromptFormatter: produce vLLM-ready SpeechLM checkpoints with backbone-native chat_template #15623

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_hf.py: add --vllm flag for vLLM-ready checkpoint export#15617

to_hf.py: add --vllm flag for vLLM-ready checkpoint export#15617
DongjiGao wants to merge 1 commit into
NVIDIA-NeMo:speechlm2-with-nemo-automodel-mergefrom
DongjiGao:pr/to-hf-vllm

DongjiGao commented Apr 16, 2026

Uh oh!

pzelasko Apr 16, 2026

Uh oh!

pzelasko Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DongjiGao commented Apr 16, 2026

Summary

Changes

examples/speechlm2/to_hf.py

Architecture mapping

Usage

Test plan

Uh oh!

pzelasko Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`examples/speechlm2/to_hf.py`