to_hf.py: add --vllm flag for vLLM-ready checkpoint export#15617
Closed
DongjiGao wants to merge 1 commit into
Closed
to_hf.py: add --vllm flag for vLLM-ready checkpoint export#15617DongjiGao wants to merge 1 commit into
DongjiGao wants to merge 1 commit into
Conversation
When vllm=True, prepare_for_vllm() runs after checkpoint consolidation: - Detects hybrid vs standard architecture from pretrained_llm - Patches config.json with model_type and architectures - Generates chat template via PromptFormatter.to_jinja() - Saves tokenizer with <|audio|> token and chat template - Writes generation_config.json with eos_token_id only (inference params like temperature are task-specific) - Raises explicit errors for missing prompt_format/pretrained_llm - Warns when overwriting existing files Depends on: prompt-formatters PR (to_jinja), vllm-plugin PR (architecture names) Signed-off-by: Dongji Gao <dongjig@nvidia.com>
pzelasko
reviewed
Apr 16, 2026
|
|
||
| llm_cfg = AutoConfig.from_pretrained(pretrained_llm, trust_remote_code=True) | ||
| archs = getattr(llm_cfg, "architectures", []) | ||
| except Exception: |
Collaborator
There was a problem hiding this comment.
Broad exception clause, what errors are you expecting here?
pzelasko
reviewed
Apr 16, 2026
|
|
||
| # If True, patch the checkpoint for vLLM inference (add tokenizer, chat template, | ||
| # model_type/architectures, generation_config). Requires HuggingFace transformers. | ||
| vllm: bool = False |
Collaborator
There was a problem hiding this comment.
Remove this flag and always output converted checkpoints that work with both HF and vLLM engines
DongjiGao
added a commit
to DongjiGao/NeMo
that referenced
this pull request
Apr 18, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's SpeechLM plugin: architecture/model_type fields in config.json, the backbone's canonical chat_template in tokenizer_config.json, audio placeholder registered on the tokenizer, and a minimal generation config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM loaders now share the same on-disk artifact (pzelasko's ask on the closed PR NVIDIA-NeMo#15617 thread). Key bits: - `_detect_vllm_architecture` inspects the backbone's HF config to pick the right vLLM plugin class. Fail-fast ValueError on missing `architectures` rather than silently defaulting to 'Std' (also addresses pzelasko's review comment about the broad except). - `prepare_for_vllm` is invoked unconditionally after save, wrapped in `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning so legacy callers that never needed vLLM (e.g., NeMo SALM loading the same dir) still get a clean HF-only checkpoint. - Tokenizer is re-saved from the backbone (brings its native chat_template along) + augmented with `<|audio|>`; extra_special_tokens is normalized to a dict so vLLM's AutoTokenizer can load it. - For reasoning backbones (nemotron-nano-v3), the exported chat_template's `enable_thinking` default is flipped to False so vLLM's request-time render matches training-time render; otherwise vLLM silently prepends `<think>\n` to every assistant turn and WER regresses. Verified librispeech-pc WER 1.57 (== baseline) after this fix; without it WER regressed to 5.92. hf_hub.py: setdefault `model_type` and `architectures` in `HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry the metadata vLLM / transformers need to identify them. Made-with: Cursor
6 tasks
DongjiGao
added a commit
to DongjiGao/NeMo
that referenced
this pull request
Apr 20, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's SpeechLM plugin: architecture/model_type fields in config.json, the backbone's canonical chat_template in tokenizer_config.json, audio placeholder registered on the tokenizer, and a minimal generation config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM loaders now share the same on-disk artifact (pzelasko's ask on the closed PR NVIDIA-NeMo#15617 thread). Key bits: - `_detect_vllm_architecture` inspects the backbone's HF config to pick the right vLLM plugin class. Fail-fast ValueError on missing `architectures` rather than silently defaulting to 'Std' (also addresses pzelasko's review comment about the broad except). - `prepare_for_vllm` is invoked unconditionally after save, wrapped in `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning so legacy callers that never needed vLLM (e.g., NeMo SALM loading the same dir) still get a clean HF-only checkpoint. - Tokenizer is re-saved from the backbone (brings its native chat_template along) + augmented with `<|audio|>`; extra_special_tokens is normalized to a dict so vLLM's AutoTokenizer can load it. - For reasoning backbones (nemotron-nano-v3), the exported chat_template's `enable_thinking` default is flipped to False so vLLM's request-time render matches training-time render; otherwise vLLM silently prepends `<think>\n` to every assistant turn and WER regresses. Verified librispeech-pc WER 1.57 (== baseline) after this fix; without it WER regressed to 5.92. hf_hub.py: setdefault `model_type` and `architectures` in `HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry the metadata vLLM / transformers need to identify them. Made-with: Cursor
DongjiGao
added a commit
to DongjiGao/NeMo
that referenced
this pull request
Apr 23, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's SpeechLM plugin: architecture/model_type fields in config.json, the backbone's canonical chat_template in tokenizer_config.json, audio placeholder registered on the tokenizer, and a minimal generation config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM loaders now share the same on-disk artifact (pzelasko's ask on the closed PR NVIDIA-NeMo#15617 thread). Key bits: - `_detect_vllm_architecture` inspects the backbone's HF config to pick the right vLLM plugin class. Fail-fast ValueError on missing `architectures` rather than silently defaulting to 'Std' (also addresses pzelasko's review comment about the broad except). - `prepare_for_vllm` is invoked unconditionally after save, wrapped in `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning so legacy callers that never needed vLLM (e.g., NeMo SALM loading the same dir) still get a clean HF-only checkpoint. - Tokenizer is re-saved from the backbone (brings its native chat_template along) + augmented with `<|audio|>`; extra_special_tokens is normalized to a dict so vLLM's AutoTokenizer can load it. - For reasoning backbones (nemotron-nano-v3), the exported chat_template's `enable_thinking` default is flipped to False so vLLM's request-time render matches training-time render; otherwise vLLM silently prepends `<think>\n` to every assistant turn and WER regresses. Verified librispeech-pc WER 1.57 (== baseline) after this fix; without it WER regressed to 5.92. hf_hub.py: setdefault `model_type` and `architectures` in `HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry the metadata vLLM / transformers need to identify them. Made-with: Cursor
DongjiGao
added a commit
to DongjiGao/NeMo
that referenced
this pull request
Apr 23, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's SpeechLM plugin: architecture/model_type fields in config.json, the backbone's canonical chat_template in tokenizer_config.json, audio placeholder registered on the tokenizer, and a minimal generation config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM loaders now share the same on-disk artifact (pzelasko's ask on the closed PR NVIDIA-NeMo#15617 thread). Key bits: - `_detect_vllm_architecture` inspects the backbone's HF config to pick the right vLLM plugin class. Fail-fast ValueError on missing `architectures` rather than silently defaulting to 'Std' (also addresses pzelasko's review comment about the broad except). - `prepare_for_vllm` is invoked unconditionally after save, wrapped in `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning so legacy callers that never needed vLLM (e.g., NeMo SALM loading the same dir) still get a clean HF-only checkpoint. - Tokenizer is re-saved from the backbone (brings its native chat_template along) + augmented with `<|audio|>`; extra_special_tokens is normalized to a dict so vLLM's AutoTokenizer can load it. - For reasoning backbones (nemotron-nano-v3), the exported chat_template's `enable_thinking` default is flipped to False so vLLM's request-time render matches training-time render; otherwise vLLM silently prepends `<think>\n` to every assistant turn and WER regresses. Verified librispeech-pc WER 1.57 (== baseline) after this fix; without it WER regressed to 5.92. hf_hub.py: setdefault `model_type` and `architectures` in `HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry the metadata vLLM / transformers need to identify them. Made-with: Cursor
DongjiGao
added a commit
to DongjiGao/NeMo
that referenced
this pull request
Apr 23, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's SpeechLM plugin: architecture/model_type fields in config.json, the backbone's canonical chat_template in tokenizer_config.json, audio placeholder registered on the tokenizer, and a minimal generation config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM loaders now share the same on-disk artifact (pzelasko's ask on the closed PR NVIDIA-NeMo#15617 thread). Key bits: - `_detect_vllm_architecture` inspects the backbone's HF config to pick the right vLLM plugin class. Fail-fast ValueError on missing `architectures` rather than silently defaulting to 'Std' (also addresses pzelasko's review comment about the broad except). - `prepare_for_vllm` is invoked unconditionally after save, wrapped in `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning so legacy callers that never needed vLLM (e.g., NeMo SALM loading the same dir) still get a clean HF-only checkpoint. - Tokenizer is re-saved from the backbone (brings its native chat_template along) + augmented with `<|audio|>`; extra_special_tokens is normalized to a dict so vLLM's AutoTokenizer can load it. - For reasoning backbones (nemotron-nano-v3), the exported chat_template's `enable_thinking` default is flipped to False so vLLM's request-time render matches training-time render; otherwise vLLM silently prepends `<think>\n` to every assistant turn and WER regresses. Verified librispeech-pc WER 1.57 (== baseline) after this fix; without it WER regressed to 5.92. hf_hub.py: setdefault `model_type` and `architectures` in `HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry the metadata vLLM / transformers need to identify them. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>
pzelasko
pushed a commit
that referenced
this pull request
Apr 24, 2026
…ith backbone-native chat_template (#15623) * to_hf.py: unconditionally produce vLLM-ready HF checkpoints `to_hf.py` now always emits a checkpoint that can be served by vLLM's SpeechLM plugin: architecture/model_type fields in config.json, the backbone's canonical chat_template in tokenizer_config.json, audio placeholder registered on the tokenizer, and a minimal generation config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM loaders now share the same on-disk artifact (pzelasko's ask on the closed PR #15617 thread). Key bits: - `_detect_vllm_architecture` inspects the backbone's HF config to pick the right vLLM plugin class. Fail-fast ValueError on missing `architectures` rather than silently defaulting to 'Std' (also addresses pzelasko's review comment about the broad except). - `prepare_for_vllm` is invoked unconditionally after save, wrapped in `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning so legacy callers that never needed vLLM (e.g., NeMo SALM loading the same dir) still get a clean HF-only checkpoint. - Tokenizer is re-saved from the backbone (brings its native chat_template along) + augmented with `<|audio|>`; extra_special_tokens is normalized to a dict so vLLM's AutoTokenizer can load it. - For reasoning backbones (nemotron-nano-v3), the exported chat_template's `enable_thinking` default is flipped to False so vLLM's request-time render matches training-time render; otherwise vLLM silently prepends `<think>\n` to every assistant turn and WER regresses. Verified librispeech-pc WER 1.57 (== baseline) after this fix; without it WER regressed to 5.92. hf_hub.py: setdefault `model_type` and `architectures` in `HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry the metadata vLLM / transformers need to identify them. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * to_hf.py: add missing return-type hints per NeMo PR checklist Five top-level helpers (load_checkpoint, setup_distributed_from_config, consolidate_state_dict, save_hf_checkpoint, main) lacked return types. Uses Any for setup_distributed_from_config's AutomodelParallelStrategy return to avoid adding an import just for typing; concrete types everywhere else. Same fix previously applied on pr/vllm-plugin. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * Apply isort and black reformatting Signed-off-by: Dongji Gao <dongjig@nvidia.com> * to_hf.py: rescue chat_template.jinja before deleting it Modern HuggingFace transformers (~4.42+) moves long chat_template strings out of tokenizer_config.json into a separate chat_template.jinja file to keep the JSON readable. Qwen3-1.7B's 4168-char template triggers this split; Nemotron-Nano's shorter template stays inline. The old code deleted chat_template.jinja before reading tokenizer_config.json, assuming the inline copy was always complete. For Qwen3 that meant the exported checkpoint shipped with an empty chat_template -- vLLM's apply_chat_template returned a prompt without the <|audio|> placeholder, which broke multimodal prompt replacement (Failed to apply prompt replacement for mm_items['audio'][0]). Now read chat_template.jinja, inline it into tokenizer_config.json when non-empty, and only then delete the file. Nemotron's inline-only path is unchanged because .jinja doesn't get written for small templates. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * to_hf.py: force tokenizer_class=PreTrainedTokenizerFast for vLLM compat Newer NeMo containers (e.g. nemo-25.11-pytorch2.9-automodel-03apr26) wrap AutoTokenizer.from_pretrained(trust_remote_code=True) in a NeMo-internal TokenizersBackend class. save_pretrained then writes 'tokenizer_class: TokenizersBackend' to tokenizer_config.json -- not in HF transformers' registry, so vLLM's AutoTokenizer.from_pretrained crashes at server load: ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported. The underlying tokenizer.json is a valid HF fast tokenizer regardless of which wrapper produced it; force the class name back to PreTrainedTokenizerFast so downstream HF-based loaders (including vLLM) can round-trip the config. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * QwenPromptFormatter: align with Qwen3 enable_thinking=False pre-training Qwen3's chat_template injects '<think>\n\n</think>\n\n' before assistant content when enable_thinking=False (the 'no reasoning' mode, which is what SpeechLM ASR wants). The old QwenPromptFormatter didn't include this prefix, so SpeechLM fine-tunes trained through it showed the model a turn shape that's different from Qwen3's pre-training convention. Bake NO_THINK_PREFIX into both INFERENCE_PREFIX and (transitively via INFERENCE_PREFIX) the assistant template, so future fine-tunes produce training data byte-identical to 'apply_chat_template(enable_thinking= False)'. Existing checkpoints are unaffected -- the change only kicks in the next time you retrain with prompt_format=qwen. Test: updated hardcoded expected strings to match Qwen3 jinja output for single-turn training and inference. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com> * to_hf.py: read audio_locator_tag from model config as SoT Removes the hardcoded _AUDIO_TOKEN constant and reads the audio placeholder from model_cfg["audio_locator_tag"], raising ValueError if missing. This ensures the exported config.json, added tokenizer symbols, and extra_special_tokens dict all reference the same source of truth as training, avoiding silent drift between train-time and inference-time audio tokens. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor * NemotronNanoV3PromptFormatter: fix past-asst-no-think + test coverage Before this fix, the formatter only normalized past assistant turns that already contained <think>...</think> tags, so a past assistant turn without any think tags would emit as "<|im_start|>assistant\nTEST<|im_end|>\n" while the HF jinja template emits "<|im_start|>assistant\n<think></think>TEST<|im_end|>\n" (jinja unconditionally injects an empty think block for content lacking both tags). This caused a silent train/inference-template divergence for multi-turn dialogs without reasoning history. Fix step 3 in encode_dialog to handle all three cases symmetrically: - both tags present -> truncate to "<think></think>" + post-</think> content - neither tag present -> prepend "<think></think>" - only one tag present -> leave as-is (matches jinja) Also adds three tests to fill previously-missing coverage: - training multi-turn with past assistant missing think tags (regression test for the fix above) - inference multi-turn with enable_thinking=False - inference multi-turn with enable_thinking=True All 12 non-HF tests pass in the NeMo 25.11 container. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor * to_hf.py: stop patching chat_template's enable_thinking default The previous in-place string-replace flipped Nemotron's ``enable_thinking`` default from True to False so that default vLLM inference (with no ``chat_template_kwargs``) would match SpeechLM training rendering. This approach is fragile (silently no-ops if upstream changes the template) and surprising for downstream consumers of the exported checkpoint. Serving callers should instead pass ``chat_template_kwargs={"enable_thinking": False}`` (or the OpenAI-API equivalent) at inference time to opt out of thinking. This keeps the exported chat_template byte-identical to the backbone's canonical HF template. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor * tests: add unit tests for to_hf.py prepare_for_vllm Covers the behavior introduced / changed in this PR: * Error paths for missing pretrained_llm and missing audio_locator_tag * config.json patching (model_type, architectures, audio_locator_tag SoT) * Audio token registration (add_special_tokens called only when missing from the backbone vocab) * tokenizer_config.json normalization (dict-form extra_special_tokens, forced tokenizer_class=PreTrainedTokenizerFast) * chat_template.jinja rescue (inlined back into tokenizer_config.json and the separate .jinja file removed) * chat_template is byte-identical after prep (regression guard for the removal of the enable_thinking default-flip) * generation_config.json carries the tokenizer's eos_token_id The script lives under examples/ and is loaded via importlib; AutoTokenizer and _detect_vllm_architecture are patched so the tests run fully offline. 9 tests pass in 0.35s in the NeMo 25.11 container. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor * Apply isort and black reformatting Signed-off-by: Dongji Gao <dongjig@nvidia.com> * Revert "QwenPromptFormatter: align with Qwen3 enable_thinking=False pre-training" This reverts commit 0245881. Piotr raised a concern that baking ``NO_THINK_PREFIX`` into ``QwenPromptFormatter`` changes the turn shape seen by in-flight fine-tunes like canary-qwen-2.5b, which were trained on the prior (no-think-prefix) formatter output. Re-rendering the same data through the updated formatter would silently shift the prompt distribution and break those checkpoints. Reverting the bake keeps the ``qwen`` prompt format byte-identical to the version those checkpoints saw during training. Any future fine-tune that actually wants the ``<think></think>`` empty-reasoning prefix should use ``Qwen3PromptFormatter`` (already handles it via ``enable_thinking=False``) or explicitly include the prefix in training data, rather than flipping the default for all ``qwen`` consumers. Signed-off-by: Dongji Gao <dongjig@nvidia.com> Made-with: Cursor --------- Signed-off-by: Dongji Gao <dongjig@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vllm=Trueflag toto_hf.pythat produces vLLM-ready checkpoints in one stepprepare_for_vllm()runs after checkpoint consolidation (rank 0 only) and handles architecture detection, chat template generation, tokenizer saving, and generation configto_jinja())Changes
examples/speechlm2/to_hf.pyHfExportConfig.vllm: bool = False— opt-in flag_detect_vllm_architecture()— determines hybrid vs standard frompretrained_llmprepare_for_vllm(output_dir, model_cfg):config.jsonwithmodel_type: "nemo_speechlm"and detectedarchitecturesPromptFormatter.resolve(prompt_format).to_jinja()for chat template<|audio|>token and chat templategeneration_config.json(EOS only — inference params are task-specific)ValueErrorfor missingprompt_format/pretrained_llmArchitecture mapping
NeMoSpeechLMHybridForConditionalGenerationNeMoSpeechLMForConditionalGenerationUsage
torchrun --nproc-per-node=8 --nnodes=4 \ examples/speechlm2/to_hf.py \ class_path=nemo.collections.speechlm2.models.salm_automodel.SALMAutomodel \ ckpt_path=/path/to/checkpoint.ckpt \ ckpt_config=/path/to/exp_config.yaml \ output_dir=/path/to/hf_output \ vllm=TrueTest plan
to_hf.py vllm=Trueon nano-v3-canary1bflash (32 GPUs, 5.5 min)model_type,architectures, chat template (1149 chars),generation_config.jsonburst_eval_vllm.pyon produced checkpoint — WER matches baseline (6.31 overall)Made with Cursor