Skip to content

to_hf.py: add --vllm flag for vLLM-ready checkpoint export#15617

Closed
DongjiGao wants to merge 1 commit into
NVIDIA-NeMo:speechlm2-with-nemo-automodel-mergefrom
DongjiGao:pr/to-hf-vllm
Closed

to_hf.py: add --vllm flag for vLLM-ready checkpoint export#15617
DongjiGao wants to merge 1 commit into
NVIDIA-NeMo:speechlm2-with-nemo-automodel-mergefrom
DongjiGao:pr/to-hf-vllm

Conversation

@DongjiGao
Copy link
Copy Markdown
Contributor

Summary

Changes

examples/speechlm2/to_hf.py

  • HfExportConfig.vllm: bool = False — opt-in flag
  • _detect_vllm_architecture() — determines hybrid vs standard from pretrained_llm
  • prepare_for_vllm(output_dir, model_cfg):
    1. Patches config.json with model_type: "nemo_speechlm" and detected architectures
    2. Calls PromptFormatter.resolve(prompt_format).to_jinja() for chat template
    3. Saves tokenizer with <|audio|> token and chat template
    4. Writes minimal generation_config.json (EOS only — inference params are task-specific)
  • Raises ValueError for missing prompt_format/pretrained_llm
  • Warns when overwriting existing files

Architecture mapping

LLM backbone Architecture name
Hybrid (NemotronH) NeMoSpeechLMHybridForConditionalGeneration
Standard (Qwen3, etc.) NeMoSpeechLMForConditionalGeneration

Usage

torchrun --nproc-per-node=8 --nnodes=4 \
    examples/speechlm2/to_hf.py \
    class_path=nemo.collections.speechlm2.models.salm_automodel.SALMAutomodel \
    ckpt_path=/path/to/checkpoint.ckpt \
    ckpt_config=/path/to/exp_config.yaml \
    output_dir=/path/to/hf_output \
    vllm=True

Test plan

  • End-to-end: to_hf.py vllm=True on nano-v3-canary1bflash (32 GPUs, 5.5 min)
  • Output verified: correct model_type, architectures, chat template (1149 chars), generation_config.json
  • Eval: burst_eval_vllm.py on produced checkpoint — WER matches baseline (6.31 overall)
  • CI: add "Run CICD" label

Made with Cursor

When vllm=True, prepare_for_vllm() runs after checkpoint consolidation:
- Detects hybrid vs standard architecture from pretrained_llm
- Patches config.json with model_type and architectures
- Generates chat template via PromptFormatter.to_jinja()
- Saves tokenizer with <|audio|> token and chat template
- Writes generation_config.json with eos_token_id only
  (inference params like temperature are task-specific)
- Raises explicit errors for missing prompt_format/pretrained_llm
- Warns when overwriting existing files

Depends on: prompt-formatters PR (to_jinja), vllm-plugin PR (architecture names)

Signed-off-by: Dongji Gao <dongjig@nvidia.com>

llm_cfg = AutoConfig.from_pretrained(pretrained_llm, trust_remote_code=True)
archs = getattr(llm_cfg, "architectures", [])
except Exception:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broad exception clause, what errors are you expecting here?


# If True, patch the checkpoint for vLLM inference (add tokenizer, chat template,
# model_type/architectures, generation_config). Requires HuggingFace transformers.
vllm: bool = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this flag and always output converted checkpoints that work with both HF and vLLM engines

@pzelasko pzelasko deleted the branch NVIDIA-NeMo:speechlm2-with-nemo-automodel-merge April 17, 2026 14:08
@pzelasko pzelasko closed this Apr 17, 2026
DongjiGao added a commit to DongjiGao/NeMo that referenced this pull request Apr 18, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's
SpeechLM plugin: architecture/model_type fields in config.json, the
backbone's canonical chat_template in tokenizer_config.json, audio
placeholder registered on the tokenizer, and a minimal generation
config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM
loaders now share the same on-disk artifact (pzelasko's ask on the
closed PR NVIDIA-NeMo#15617 thread).

Key bits:
- `_detect_vllm_architecture` inspects the backbone's HF config to
  pick the right vLLM plugin class. Fail-fast ValueError on missing
  `architectures` rather than silently defaulting to 'Std' (also
  addresses pzelasko's review comment about the broad except).
- `prepare_for_vllm` is invoked unconditionally after save, wrapped in
  `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning
  so legacy callers that never needed vLLM (e.g., NeMo SALM loading
  the same dir) still get a clean HF-only checkpoint.
- Tokenizer is re-saved from the backbone (brings its native
  chat_template along) + augmented with `<|audio|>`; extra_special_tokens
  is normalized to a dict so vLLM's AutoTokenizer can load it.
- For reasoning backbones (nemotron-nano-v3), the exported
  chat_template's `enable_thinking` default is flipped to False so
  vLLM's request-time render matches training-time render; otherwise
  vLLM silently prepends `<think>\n` to every assistant turn and WER
  regresses. Verified librispeech-pc WER 1.57 (== baseline) after this
  fix; without it WER regressed to 5.92.

hf_hub.py: setdefault `model_type` and `architectures` in
`HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry
the metadata vLLM / transformers need to identify them.

Made-with: Cursor
DongjiGao added a commit to DongjiGao/NeMo that referenced this pull request Apr 20, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's
SpeechLM plugin: architecture/model_type fields in config.json, the
backbone's canonical chat_template in tokenizer_config.json, audio
placeholder registered on the tokenizer, and a minimal generation
config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM
loaders now share the same on-disk artifact (pzelasko's ask on the
closed PR NVIDIA-NeMo#15617 thread).

Key bits:
- `_detect_vllm_architecture` inspects the backbone's HF config to
  pick the right vLLM plugin class. Fail-fast ValueError on missing
  `architectures` rather than silently defaulting to 'Std' (also
  addresses pzelasko's review comment about the broad except).
- `prepare_for_vllm` is invoked unconditionally after save, wrapped in
  `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning
  so legacy callers that never needed vLLM (e.g., NeMo SALM loading
  the same dir) still get a clean HF-only checkpoint.
- Tokenizer is re-saved from the backbone (brings its native
  chat_template along) + augmented with `<|audio|>`; extra_special_tokens
  is normalized to a dict so vLLM's AutoTokenizer can load it.
- For reasoning backbones (nemotron-nano-v3), the exported
  chat_template's `enable_thinking` default is flipped to False so
  vLLM's request-time render matches training-time render; otherwise
  vLLM silently prepends `<think>\n` to every assistant turn and WER
  regresses. Verified librispeech-pc WER 1.57 (== baseline) after this
  fix; without it WER regressed to 5.92.

hf_hub.py: setdefault `model_type` and `architectures` in
`HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry
the metadata vLLM / transformers need to identify them.

Made-with: Cursor
DongjiGao added a commit to DongjiGao/NeMo that referenced this pull request Apr 23, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's
SpeechLM plugin: architecture/model_type fields in config.json, the
backbone's canonical chat_template in tokenizer_config.json, audio
placeholder registered on the tokenizer, and a minimal generation
config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM
loaders now share the same on-disk artifact (pzelasko's ask on the
closed PR NVIDIA-NeMo#15617 thread).

Key bits:
- `_detect_vllm_architecture` inspects the backbone's HF config to
  pick the right vLLM plugin class. Fail-fast ValueError on missing
  `architectures` rather than silently defaulting to 'Std' (also
  addresses pzelasko's review comment about the broad except).
- `prepare_for_vllm` is invoked unconditionally after save, wrapped in
  `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning
  so legacy callers that never needed vLLM (e.g., NeMo SALM loading
  the same dir) still get a clean HF-only checkpoint.
- Tokenizer is re-saved from the backbone (brings its native
  chat_template along) + augmented with `<|audio|>`; extra_special_tokens
  is normalized to a dict so vLLM's AutoTokenizer can load it.
- For reasoning backbones (nemotron-nano-v3), the exported
  chat_template's `enable_thinking` default is flipped to False so
  vLLM's request-time render matches training-time render; otherwise
  vLLM silently prepends `<think>\n` to every assistant turn and WER
  regresses. Verified librispeech-pc WER 1.57 (== baseline) after this
  fix; without it WER regressed to 5.92.

hf_hub.py: setdefault `model_type` and `architectures` in
`HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry
the metadata vLLM / transformers need to identify them.

Made-with: Cursor
DongjiGao added a commit to DongjiGao/NeMo that referenced this pull request Apr 23, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's
SpeechLM plugin: architecture/model_type fields in config.json, the
backbone's canonical chat_template in tokenizer_config.json, audio
placeholder registered on the tokenizer, and a minimal generation
config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM
loaders now share the same on-disk artifact (pzelasko's ask on the
closed PR NVIDIA-NeMo#15617 thread).

Key bits:
- `_detect_vllm_architecture` inspects the backbone's HF config to
  pick the right vLLM plugin class. Fail-fast ValueError on missing
  `architectures` rather than silently defaulting to 'Std' (also
  addresses pzelasko's review comment about the broad except).
- `prepare_for_vllm` is invoked unconditionally after save, wrapped in
  `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning
  so legacy callers that never needed vLLM (e.g., NeMo SALM loading
  the same dir) still get a clean HF-only checkpoint.
- Tokenizer is re-saved from the backbone (brings its native
  chat_template along) + augmented with `<|audio|>`; extra_special_tokens
  is normalized to a dict so vLLM's AutoTokenizer can load it.
- For reasoning backbones (nemotron-nano-v3), the exported
  chat_template's `enable_thinking` default is flipped to False so
  vLLM's request-time render matches training-time render; otherwise
  vLLM silently prepends `<think>\n` to every assistant turn and WER
  regresses. Verified librispeech-pc WER 1.57 (== baseline) after this
  fix; without it WER regressed to 5.92.

hf_hub.py: setdefault `model_type` and `architectures` in
`HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry
the metadata vLLM / transformers need to identify them.

Made-with: Cursor
DongjiGao added a commit to DongjiGao/NeMo that referenced this pull request Apr 23, 2026
`to_hf.py` now always emits a checkpoint that can be served by vLLM's
SpeechLM plugin: architecture/model_type fields in config.json, the
backbone's canonical chat_template in tokenizer_config.json, audio
placeholder registered on the tokenizer, and a minimal generation
config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM
loaders now share the same on-disk artifact (pzelasko's ask on the
closed PR NVIDIA-NeMo#15617 thread).

Key bits:
- `_detect_vllm_architecture` inspects the backbone's HF config to
  pick the right vLLM plugin class. Fail-fast ValueError on missing
  `architectures` rather than silently defaulting to 'Std' (also
  addresses pzelasko's review comment about the broad except).
- `prepare_for_vllm` is invoked unconditionally after save, wrapped in
  `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning
  so legacy callers that never needed vLLM (e.g., NeMo SALM loading
  the same dir) still get a clean HF-only checkpoint.
- Tokenizer is re-saved from the backbone (brings its native
  chat_template along) + augmented with `<|audio|>`; extra_special_tokens
  is normalized to a dict so vLLM's AutoTokenizer can load it.
- For reasoning backbones (nemotron-nano-v3), the exported
  chat_template's `enable_thinking` default is flipped to False so
  vLLM's request-time render matches training-time render; otherwise
  vLLM silently prepends `<think>\n` to every assistant turn and WER
  regresses. Verified librispeech-pc WER 1.57 (== baseline) after this
  fix; without it WER regressed to 5.92.

hf_hub.py: setdefault `model_type` and `architectures` in
`HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry
the metadata vLLM / transformers need to identify them.

Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
pzelasko pushed a commit that referenced this pull request Apr 24, 2026
…ith backbone-native chat_template (#15623)

* to_hf.py: unconditionally produce vLLM-ready HF checkpoints

`to_hf.py` now always emits a checkpoint that can be served by vLLM's
SpeechLM plugin: architecture/model_type fields in config.json, the
backbone's canonical chat_template in tokenizer_config.json, audio
placeholder registered on the tokenizer, and a minimal generation
config. The previous `vllm: bool` opt-in flag is gone -- HF and vLLM
loaders now share the same on-disk artifact (pzelasko's ask on the
closed PR #15617 thread).

Key bits:
- `_detect_vllm_architecture` inspects the backbone's HF config to
  pick the right vLLM plugin class. Fail-fast ValueError on missing
  `architectures` rather than silently defaulting to 'Std' (also
  addresses pzelasko's review comment about the broad except).
- `prepare_for_vllm` is invoked unconditionally after save, wrapped in
  `_try_prepare_for_vllm` which downgrades a `ValueError` to a warning
  so legacy callers that never needed vLLM (e.g., NeMo SALM loading
  the same dir) still get a clean HF-only checkpoint.
- Tokenizer is re-saved from the backbone (brings its native
  chat_template along) + augmented with `<|audio|>`; extra_special_tokens
  is normalized to a dict so vLLM's AutoTokenizer can load it.
- For reasoning backbones (nemotron-nano-v3), the exported
  chat_template's `enable_thinking` default is flipped to False so
  vLLM's request-time render matches training-time render; otherwise
  vLLM silently prepends `<think>\n` to every assistant turn and WER
  regresses. Verified librispeech-pc WER 1.57 (== baseline) after this
  fix; without it WER regressed to 5.92.

hf_hub.py: setdefault `model_type` and `architectures` in
`HFHubMixin.save_pretrained` so NeMo-saved SpeechLM checkpoints carry
the metadata vLLM / transformers need to identify them.

Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>

* to_hf.py: add missing return-type hints per NeMo PR checklist

Five top-level helpers (load_checkpoint, setup_distributed_from_config,
consolidate_state_dict, save_hf_checkpoint, main) lacked return types.
Uses Any for setup_distributed_from_config's AutomodelParallelStrategy
return to avoid adding an import just for typing; concrete types
everywhere else. Same fix previously applied on pr/vllm-plugin.

Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: Dongji Gao <dongjig@nvidia.com>

* to_hf.py: rescue chat_template.jinja before deleting it

Modern HuggingFace transformers (~4.42+) moves long chat_template
strings out of tokenizer_config.json into a separate
chat_template.jinja file to keep the JSON readable. Qwen3-1.7B's
4168-char template triggers this split; Nemotron-Nano's shorter
template stays inline.

The old code deleted chat_template.jinja before reading
tokenizer_config.json, assuming the inline copy was always complete.
For Qwen3 that meant the exported checkpoint shipped with an empty
chat_template -- vLLM's apply_chat_template returned a prompt
without the <|audio|> placeholder, which broke multimodal prompt
replacement (Failed to apply prompt replacement for mm_items['audio'][0]).

Now read chat_template.jinja, inline it into tokenizer_config.json
when non-empty, and only then delete the file. Nemotron's inline-only
path is unchanged because .jinja doesn't get written for small
templates.

Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>

* to_hf.py: force tokenizer_class=PreTrainedTokenizerFast for vLLM compat

Newer NeMo containers (e.g. nemo-25.11-pytorch2.9-automodel-03apr26)
wrap AutoTokenizer.from_pretrained(trust_remote_code=True) in a
NeMo-internal TokenizersBackend class. save_pretrained then writes
'tokenizer_class: TokenizersBackend' to tokenizer_config.json --
not in HF transformers' registry, so vLLM's
AutoTokenizer.from_pretrained crashes at server load:

  ValueError: Tokenizer class TokenizersBackend does not exist or
  is not currently imported.

The underlying tokenizer.json is a valid HF fast tokenizer regardless
of which wrapper produced it; force the class name back to
PreTrainedTokenizerFast so downstream HF-based loaders (including
vLLM) can round-trip the config.

Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>

* QwenPromptFormatter: align with Qwen3 enable_thinking=False pre-training

Qwen3's chat_template injects '<think>\n\n</think>\n\n' before assistant
content when enable_thinking=False (the 'no reasoning' mode, which is
what SpeechLM ASR wants). The old QwenPromptFormatter didn't include
this prefix, so SpeechLM fine-tunes trained through it showed the model
a turn shape that's different from Qwen3's pre-training convention.

Bake NO_THINK_PREFIX into both INFERENCE_PREFIX and (transitively via
INFERENCE_PREFIX) the assistant template, so future fine-tunes produce
training data byte-identical to 'apply_chat_template(enable_thinking=
False)'. Existing checkpoints are unaffected -- the change only kicks
in the next time you retrain with prompt_format=qwen.

Test: updated hardcoded expected strings to match Qwen3 jinja output
for single-turn training and inference.

Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>

* to_hf.py: read audio_locator_tag from model config as SoT

Removes the hardcoded _AUDIO_TOKEN constant and reads the audio placeholder
from model_cfg["audio_locator_tag"], raising ValueError if missing. This
ensures the exported config.json, added tokenizer symbols, and
extra_special_tokens dict all reference the same source of truth as training,
avoiding silent drift between train-time and inference-time audio tokens.

Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Made-with: Cursor

* NemotronNanoV3PromptFormatter: fix past-asst-no-think + test coverage

Before this fix, the formatter only normalized past assistant turns that
already contained <think>...</think> tags, so a past assistant turn without
any think tags would emit as "<|im_start|>assistant\nTEST<|im_end|>\n" while
the HF jinja template emits "<|im_start|>assistant\n<think></think>TEST<|im_end|>\n"
(jinja unconditionally injects an empty think block for content lacking both
tags). This caused a silent train/inference-template divergence for
multi-turn dialogs without reasoning history.

Fix step 3 in encode_dialog to handle all three cases symmetrically:
  - both tags present -> truncate to "<think></think>" + post-</think> content
  - neither tag present -> prepend "<think></think>"
  - only one tag present -> leave as-is (matches jinja)

Also adds three tests to fill previously-missing coverage:
  - training multi-turn with past assistant missing think tags
    (regression test for the fix above)
  - inference multi-turn with enable_thinking=False
  - inference multi-turn with enable_thinking=True

All 12 non-HF tests pass in the NeMo 25.11 container.

Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Made-with: Cursor

* to_hf.py: stop patching chat_template's enable_thinking default

The previous in-place string-replace flipped Nemotron's
``enable_thinking`` default from True to False so that default vLLM
inference (with no ``chat_template_kwargs``) would match SpeechLM
training rendering. This approach is fragile (silently no-ops if
upstream changes the template) and surprising for downstream consumers
of the exported checkpoint.

Serving callers should instead pass ``chat_template_kwargs={"enable_thinking": False}``
(or the OpenAI-API equivalent) at inference time to opt out of thinking.
This keeps the exported chat_template byte-identical to the backbone's
canonical HF template.

Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Made-with: Cursor

* tests: add unit tests for to_hf.py prepare_for_vllm

Covers the behavior introduced / changed in this PR:
  * Error paths for missing pretrained_llm and missing audio_locator_tag
  * config.json patching (model_type, architectures, audio_locator_tag SoT)
  * Audio token registration (add_special_tokens called only when missing
    from the backbone vocab)
  * tokenizer_config.json normalization (dict-form extra_special_tokens,
    forced tokenizer_class=PreTrainedTokenizerFast)
  * chat_template.jinja rescue (inlined back into tokenizer_config.json
    and the separate .jinja file removed)
  * chat_template is byte-identical after prep (regression guard for the
    removal of the enable_thinking default-flip)
  * generation_config.json carries the tokenizer's eos_token_id

The script lives under examples/ and is loaded via importlib; AutoTokenizer
and _detect_vllm_architecture are patched so the tests run fully offline.
9 tests pass in 0.35s in the NeMo 25.11 container.

Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Made-with: Cursor

* Apply isort and black reformatting

Signed-off-by: Dongji Gao <dongjig@nvidia.com>

* Revert "QwenPromptFormatter: align with Qwen3 enable_thinking=False pre-training"

This reverts commit 0245881.

Piotr raised a concern that baking ``NO_THINK_PREFIX`` into
``QwenPromptFormatter`` changes the turn shape seen by in-flight fine-tunes
like canary-qwen-2.5b, which were trained on the prior (no-think-prefix)
formatter output. Re-rendering the same data through the updated formatter
would silently shift the prompt distribution and break those checkpoints.

Reverting the bake keeps the ``qwen`` prompt format byte-identical to the
version those checkpoints saw during training. Any future fine-tune that
actually wants the ``<think></think>`` empty-reasoning prefix should use
``Qwen3PromptFormatter`` (already handles it via ``enable_thinking=False``)
or explicitly include the prefix in training data, rather than flipping the
default for all ``qwen`` consumers.

Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Made-with: Cursor

---------

Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants