Add vllm support for nemo speechlm by DongjiGao · Pull Request #15520 · NVIDIA-NeMo/NeMo

DongjiGao · 2026-03-18T19:19:33Z

What does this PR do?

Add a vLLM plugin that enables fast inference serving for NeMo Speech LM models (speech encoder + projection + LLM backbone) using vLLM's PagedAttention and continuous batching engine.

Collection: speechlm2

Change log

Add nemo/collections/speechlm2/vllm/nemotron_v3/ plugin package:
- config.py: NeMoSpeechLMConfig wrapping LLM backbone's text config with NeMo speech fields
- model.py: NeMoSpeechLMForConditionalGeneration combining NeMo AudioPerceptionModule with vLLM-native LLM
- __init__.py: plugin registration via vllm.general_plugins entry point
Add vllm.general_plugins entry point in pyproject.toml
Add unit tests in tests/collections/speechlm2/test_vllm_plugin.py (10 tests, including GPU forward pass)

Usage

import os
os.environ["VLLM_PLUGINS"] = "nemo_speechlm"

from vllm import LLM, SamplingParams
import soundfile as sf

llm = LLM(
    model="/path/to/nemo-speech-checkpoint",
    tokenizer="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    hf_overrides={
        "architectures": ["NeMoSpeechLMForConditionalGeneration"],
        "model_type": "nemo_speechlm",
    },
    trust_remote_code=True,
    dtype="bfloat16",
    enforce_eager=True,
    limit_mm_per_prompt={"audio": 1},
)

tokenizer = llm.get_tokenizer()
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Transcribe the following: <|audio|>"}],
    tokenize=False, add_generation_prompt=True,
)

audio, sr = sf.read("audio.wav", dtype="float32")
outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"audio": (audio, sr)}}],
    SamplingParams(max_tokens=512, temperature=0.0),
)
print(outputs[0].outputs[0].text)

Evaluation Results

Open ASR Leaderboard, Nemotron-Nano-v3 + Canary-v2 (30B MoE), A100-80GB. WER computed at corpus level with jiwer + whisper EnglishTextNormalizer (hf_leaderboard mode), same method as NeMo checkpoint evaluation.

WER Comparison (NeMo checkpoint vs vLLM, BS=32, 32 GPUs)

Dataset	Samples	NeMo WER (%)	vLLM WER (%)	Delta
librispeech_clean	2,620	1.91	1.92	+0.01
librispeech_other	2,939	3.53	3.51	-0.02
tedlium	1,155	3.70	3.64	-0.06
spgispeech	39,341	2.20	2.21	+0.01
voxpopuli	1,842	6.29	6.27	-0.02
gigaspeech	19,898	9.95	9.96	+0.01
earnings22	2,737	10.93	11.01	+0.08
ami	11,653	11.65	11.65	+0.00

All within 0.1%. NeMo WER recomputed with identical method matches the reported checkpoint summary exactly.

Throughput (WER / RTFx per batch size, 32 GPUs)

Dataset	BS=8	BS=16	BS=32
librispeech_clean	1.91% / 48x	1.93% / 107x	1.92% / 227x
librispeech_other	3.53% / 41x	3.53% / 92x	3.51% / 197x
tedlium	3.64% / 64x	3.64% / 136x	3.64% / 277x
spgispeech	2.21% / 12x	2.22% / 36x	2.21% / 98x
voxpopuli	6.32% / 67x	6.28% / 147x	6.27% / 306x
gigaspeech	9.95% / 14x	9.97% / 38x	9.96% / 98x
earnings22	11.03% / 46x	10.98% / 102x	11.01% / 215x
ami	11.62% / 12x	11.74% / 29x	11.65% / 66x

WER is consistent across batch sizes. BS=32 recommended for maximum throughput.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Additional Information

Validated on full Open ASR Leaderboard (8 datasets, ~82K samples)
Tested with NemotronH-30B backbone (Nemotron-Nano-v3 + Canary-v2)
Requires vLLM >= 0.17.1 and NeMo toolkit with speechlm2 collection
Multi-GPU scaling via horizontal instances (ns generate --num_chunks N), not DP/EP
Related Skills backend PR: Add vLLM ASR backend for multimodal speech recognition Skills#1308

Add vLLM plugin that enables fast inference for NeMo Speech LM models (encoder + projection + LLM) via vLLM's PagedAttention and continuous batching engine. Files: - nemo/collections/speechlm2/vllm/__init__.py: package marker - nemo/collections/speechlm2/vllm/nemotron_v3/__init__.py: plugin registration (config, model, NemotronH patch) - nemo/collections/speechlm2/vllm/nemotron_v3/config.py: NeMoSpeechLMConfig (wraps text_config from LLM backbone) - nemo/collections/speechlm2/vllm/nemotron_v3/model.py: NeMoSpeechLMForConditionalGeneration (NeMo encoder + vLLM LLM) - pyproject.toml: register vllm.general_plugins entry point Validated on Open ASR Leaderboard (8 datasets, 82K samples): WER matches NeMo checkpoint within 0.1%. Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

- Add Apache 2.0 license headers to all plugin files - Add docstrings to all public classes and register() - Remove unused imports (nullcontext, init_logger) - Run black (line_length=119) and isort formatting Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

Tests config creation, special token handling, plugin registration, and audio encoder forward pass with dummy audio. No model weights required; GPU needed only for the perception forward test. All 10 tests pass. Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

DongjiGao · 2026-03-19T18:03:34Z

@pzelasko

nemo/collections/speechlm2/vllm/nemotron_v3/__init__.py

+            raise AttributeError(name)
+
+        NHConfigCls.__getattr__ = _patched_getattr
+    except Exception:


tests/collections/speechlm2/test_vllm_plugin.py

+    _HAS_CONFIG = False
+
+try:
+    import vllm  # noqa: F401


Dongji Gao added 3 commits March 17, 2026 17:24

DongjiGao changed the title ~~Vllm nemo speechlm~~ Add vllm support for nemo speechlm Mar 18, 2026

github-actions bot added the community-request label Mar 19, 2026

github-advanced-security bot found potential problems Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vllm support for nemo speechlm#15520

Add vllm support for nemo speechlm#15520
DongjiGao wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
DongjiGao:vllm-nemo-speechlm

DongjiGao commented Mar 18, 2026 •

edited

Loading

Uh oh!

DongjiGao commented Mar 19, 2026

Uh oh!

Check notice

Check notice

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DongjiGao commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Change log

Usage

Evaluation Results

WER Comparison (NeMo checkpoint vs vLLM, BS=32, 32 GPUs)

Throughput (WER / RTFx per batch size, 32 GPUs)

Before your PR is "Ready for review"

Additional Information

Uh oh!

DongjiGao commented Mar 19, 2026

Uh oh!

Check notice

Check notice

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DongjiGao commented Mar 18, 2026 •

edited

Loading