Skip to content

Add vllm support for nemo speechlm#15520

Open
DongjiGao wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
DongjiGao:vllm-nemo-speechlm
Open

Add vllm support for nemo speechlm#15520
DongjiGao wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
DongjiGao:vllm-nemo-speechlm

Conversation

@DongjiGao
Copy link

@DongjiGao DongjiGao commented Mar 18, 2026

What does this PR do?

Add a vLLM plugin that enables fast inference serving for NeMo Speech LM models (speech encoder + projection + LLM backbone) using vLLM's PagedAttention and continuous batching engine.

Collection: speechlm2

Change log

  • Add nemo/collections/speechlm2/vllm/nemotron_v3/ plugin package:
    • config.py: NeMoSpeechLMConfig wrapping LLM backbone's text config with NeMo speech fields
    • model.py: NeMoSpeechLMForConditionalGeneration combining NeMo AudioPerceptionModule with vLLM-native LLM
    • __init__.py: plugin registration via vllm.general_plugins entry point
  • Add vllm.general_plugins entry point in pyproject.toml
  • Add unit tests in tests/collections/speechlm2/test_vllm_plugin.py (10 tests, including GPU forward pass)

Usage

import os
os.environ["VLLM_PLUGINS"] = "nemo_speechlm"

from vllm import LLM, SamplingParams
import soundfile as sf

llm = LLM(
    model="/path/to/nemo-speech-checkpoint",
    tokenizer="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    hf_overrides={
        "architectures": ["NeMoSpeechLMForConditionalGeneration"],
        "model_type": "nemo_speechlm",
    },
    trust_remote_code=True,
    dtype="bfloat16",
    enforce_eager=True,
    limit_mm_per_prompt={"audio": 1},
)

tokenizer = llm.get_tokenizer()
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Transcribe the following: <|audio|>"}],
    tokenize=False, add_generation_prompt=True,
)

audio, sr = sf.read("audio.wav", dtype="float32")
outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"audio": (audio, sr)}}],
    SamplingParams(max_tokens=512, temperature=0.0),
)
print(outputs[0].outputs[0].text)

Evaluation Results

Open ASR Leaderboard, Nemotron-Nano-v3 + Canary-v2 (30B MoE), A100-80GB. WER computed at corpus level with jiwer + whisper EnglishTextNormalizer (hf_leaderboard mode), same method as NeMo checkpoint evaluation.

WER Comparison (NeMo checkpoint vs vLLM, BS=32, 32 GPUs)

Dataset Samples NeMo WER (%) vLLM WER (%) Delta
librispeech_clean 2,620 1.91 1.92 +0.01
librispeech_other 2,939 3.53 3.51 -0.02
tedlium 1,155 3.70 3.64 -0.06
spgispeech 39,341 2.20 2.21 +0.01
voxpopuli 1,842 6.29 6.27 -0.02
gigaspeech 19,898 9.95 9.96 +0.01
earnings22 2,737 10.93 11.01 +0.08
ami 11,653 11.65 11.65 +0.00

All within 0.1%. NeMo WER recomputed with identical method matches the reported checkpoint summary exactly.

Throughput (WER / RTFx per batch size, 32 GPUs)

Dataset BS=8 BS=16 BS=32
librispeech_clean 1.91% / 48x 1.93% / 107x 1.92% / 227x
librispeech_other 3.53% / 41x 3.53% / 92x 3.51% / 197x
tedlium 3.64% / 64x 3.64% / 136x 3.64% / 277x
spgispeech 2.21% / 12x 2.22% / 36x 2.21% / 98x
voxpopuli 6.32% / 67x 6.28% / 147x 6.27% / 306x
gigaspeech 9.95% / 14x 9.97% / 38x 9.96% / 98x
earnings22 11.03% / 46x 10.98% / 102x 11.01% / 215x
ami 11.62% / 12x 11.74% / 29x 11.65% / 66x

WER is consistent across batch sizes. BS=32 recommended for maximum throughput.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Additional Information

  • Validated on full Open ASR Leaderboard (8 datasets, ~82K samples)
  • Tested with NemotronH-30B backbone (Nemotron-Nano-v3 + Canary-v2)
  • Requires vLLM >= 0.17.1 and NeMo toolkit with speechlm2 collection
  • Multi-GPU scaling via horizontal instances (ns generate --num_chunks N), not DP/EP
  • Related Skills backend PR: Add vLLM ASR backend for multimodal speech recognition Skills#1308

Dongji Gao added 3 commits March 17, 2026 17:24
Add vLLM plugin that enables fast inference for NeMo Speech LM models
(encoder + projection + LLM) via vLLM's PagedAttention and continuous
batching engine.

Files:
- nemo/collections/speechlm2/vllm/__init__.py: package marker
- nemo/collections/speechlm2/vllm/nemotron_v3/__init__.py: plugin
  registration (config, model, NemotronH patch)
- nemo/collections/speechlm2/vllm/nemotron_v3/config.py:
  NeMoSpeechLMConfig (wraps text_config from LLM backbone)
- nemo/collections/speechlm2/vllm/nemotron_v3/model.py:
  NeMoSpeechLMForConditionalGeneration (NeMo encoder + vLLM LLM)
- pyproject.toml: register vllm.general_plugins entry point

Validated on Open ASR Leaderboard (8 datasets, 82K samples):
WER matches NeMo checkpoint within 0.1%.

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
- Add Apache 2.0 license headers to all plugin files
- Add docstrings to all public classes and register()
- Remove unused imports (nullcontext, init_logger)
- Run black (line_length=119) and isort formatting

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
Tests config creation, special token handling, plugin registration,
and audio encoder forward pass with dummy audio. No model weights
required; GPU needed only for the perception forward test.

All 10 tests pass.

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
@DongjiGao DongjiGao changed the title Vllm nemo speechlm Add vllm support for nemo speechlm Mar 18, 2026
@DongjiGao
Copy link
Author

@pzelasko

raise AttributeError(name)

NHConfigCls.__getattr__ = _patched_getattr
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
_HAS_CONFIG = False

try:
import vllm # noqa: F401

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'vllm' is not used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants