Add vLLM ASR backend for multimodal speech recognition by DongjiGao · Pull Request #1308 · NVIDIA-NeMo/Skills

DongjiGao · 2026-03-15T19:11:01Z

Summary

Add a new vllm_asr backend to the unified inference server that performs ASR using vLLM with the nemotron_nano_asr multimodal plugin. This enables fast batched inference for speech LLM models (e.g. Nemotron-Nano-v3 + Canary-v2).

New VLLMASRBackend in recipes/multimodal/server/backends/vllm_asr_backend.py
Registered in backend registry and README
Supports tensor_parallel_size for multi-GPU sharding
Horizontal scaling via ns generate --num_chunks N (recommended for MoE models)
Validated on the full Open ASR Leaderboard (8 datasets, ~82K samples). WER matches NeMo checkpoint evaluation within 0.1% across all datasets and batch sizes.

Test plan

Verified on librispeech_clean/other, tedlium, spgispeech, voxpopuli, gigaspeech, earnings22, ami
Corpus-level WER matches NeMo checkpoint reference (e.g. librispeech_clean: 1.91%)
Tested batch sizes 8/16/32 with 32 GPUs
Ruff check and format pass

Summary by CodeRabbit

New Features
- Added a vLLM NeMo Speech LM backend to enable audio-to-text inference with multimodal models; now available in the unified-server framework.
Documentation
- Updated the available backends listing and README to include the new vLLM NeMo Speech LM backend.

Adds a new vllm_asr backend that uses vLLM with multimodal plugin support for fast speech recognition inference: - VLLMASRBackend with PagedAttention and continuous batching - Supports Nemotron-Nano-v3 + Canary-v2 ASR model via plugin - OpenAI-compatible /v1/chat/completions endpoint with audio input - Strips <think></think> tags from NemotronH output - Configurable prompt template, GPU memory, model length Usage: python -m nemo_skills.inference.server.serve_unified \ --backend vllm_asr \ --model /path/to/checkpoint \ --tokenizer nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 Made-with: Cursor

- test_vllm_asr_librispeech.py: concurrent async client for evaluating ASR on Open ASR Leaderboard datasets via the unified server. Reports WER, RTFx, throughput, and latency. Saves results to JSON. - run_all_eval.sh: runs all 8 Open ASR Leaderboard datasets end-to-end (starts server, evaluates, saves results, stops server) Made-with: Cursor

Add <think> tag after assistant prompt to match the NemotronH chat template exactly. Verified on AMI: 11.64% WER matches checkpoint's 11.65% (was 11.44% without <think>). Made-with: Cursor

Remove experimental data_parallel/expert_parallel serve-mode code that proved impractical for the Nemotron-Nano MoE architecture due to synchronization overhead. The recommended scaling approach is horizontal: multiple independent single-GPU instances via ns generate --num_chunks N. - Remove serve-mode subprocess, HTTP proxy, and concurrent request logic from vllm_asr_backend.py - Add tensor_parallel_size support for multi-GPU sharding - Add Prerequisites section documenting the nemotron_nano_asr vLLM plugin dependency, installation, and checkpoint requirements - Add Multi-GPU scaling section with usage example and throughput benchmarks (32 GPUs, A100-80GB, Open ASR Leaderboard) - Inline _generate_embedded into generate() for simplicity - Register vllm_asr in __init__.py docstring and README.md Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

These scripts are cluster-specific and should not be in the repo: - test_vllm_asr_librispeech.py - run_all_eval.sh Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

coderabbitai · 2026-03-15T19:20:01Z

📝 Walkthrough

Walkthrough

Adds a new vLLM-backed NeMo Speech LM backend (audio→text) to the multimodal unified server, plus registry and README updates. Implements config, model loading, audio preprocessing, request validation, prompt construction, batch generation, and result postprocessing.

Changes

Cohort / File(s)	Summary
Documentation & Registry `recipes/multimodal/server/README.md`, `recipes/multimodal/server/backends/__init__.py`	Added `vllm_nemo_speechlm` to the available backends list and `BACKEND_REGISTRY` for lazy-loading via `VLLMNeMoSpeechLMBackend`.
Backend Implementation `recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`	New backend module defining `VLLMNeMoSpeechLMConfig` and `VLLMNeMoSpeechLMBackend`: activates vLLM NeMo plugin, constructs LLM with hf_overrides and tokenizer, builds prompts, converts audio bytes→mono float32 NumPy, validates requests, performs batch generation via vLLM, strips `<think>` tags, and assembles per-request `GenerationResult` with timing and debug info.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Backend as VLLMNeMoSpeechLMBackend
    participant Validator as Validation
    participant vLLM as vLLM Engine
    participant NeMo as NeMo Speech LM

    Client->>Backend: Submit GenerationRequest(s) with audio_bytes
    Backend->>Validator: validate_request(request)
    Validator-->>Backend: validation result
    Backend->>Backend: _get_request_audio() / _audio_bytes_to_numpy()
    Backend->>Backend: build prompt + inputs
    Backend->>vLLM: generate(batch_inputs, SamplingParams)
    vLLM->>NeMo: invoke NeMo plugin (audio + prompt)
    NeMo-->>vLLM: generated text outputs
    vLLM-->>Backend: batch generation results
    Backend->>Backend: _strip_think_tags(), assemble GenerationResult
    Backend-->>Client: return GenerationResult(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add vLLM ASR backend for multimodal speech recognition' accurately describes the main changes: adding a new vLLM-based ASR backend with multimodal speech recognition capabilities to the recipes/multimodal/server/backends directory.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/multimodal/server/backends/vllm_asr_backend.py`:
- Around line 176-179: The code currently freezes SamplingParams at init
(self._sampling_params = SamplingParams(...)) and always uses the same prompt
template, ignoring per-request knobs; update the request handling so either (A)
you construct/override SamplingParams from per-request values (max_new_tokens,
temperature, top_p, top_k, seed) and pass that per-call into the vLLM inference
path (replace use of the fixed self._sampling_params and ensure request
text/system_prompt/user_prompt are merged into the prompt template) or (B)
explicitly reject requests that include any of those unsupported fields in
validate_request() by checking for presence of
max_new_tokens/temperature/top_p/top_k/seed/text/system_prompt/user_prompt and
raising a clear validation error; modify validate_request() and the code paths
that build prompts to reflect the chosen approach and ensure SamplingParams
(class SamplingParams and attribute self._sampling_params) is not silently
ignored.
- Around line 234-235: The current request loop catches a blanket Exception and
turns all failures into per-request errors; update the except clause around the
code that assigns results[idx] = GenerationResult(...) to only catch expected
input/validation errors (e.g., ValueError, TypeError, custom BadRequest/Input
exceptions your backend defines) and produce a GenerationResult for those cases
using req.request_id, while re-raising any other exceptions so the worker fails
visibly; refer to the request loop's except block, the results list assignment
(results[idx]), and the GenerationResult constructor when making this change.
- Around line 239-248: Batch latency is being amortized across requests by
dividing elapsed_ms by len(vllm_inputs); instead set each
GenerationResult.generation_time_ms to the full elapsed_ms (or to the actual
per-request duration if you have per-request timings) so each request reflects
the real end-to-end latency. Locate elapsed_ms, per_req_ms, vllm_inputs,
outputs, valid_indices and the GenerationResult construction in
vllm_asr_backend.py and remove the division by len(vllm_inputs) (or replace
per_req_ms with elapsed_ms) before passing into
GenerationResult.generation_time_ms so metrics are not underreported.
- Line 90: The tokenizer config defaults to an empty string which gets forwarded
by load_model() into LLM(...), causing vLLM to try resolving "" as a tokenizer;
change the tokenizer field default from "" to None (tokenizer: Optional[str] =
None) or modify load_model() so it only passes tokenizer=... to LLM when
tokenizer is not empty/None (check the tokenizer variable before adding the
kwarg) — update references in load_model and any callers to handle the None case
accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 248fac67-7edd-456f-b402-cd48cad56242

📥 Commits

Reviewing files that changed from the base of the PR and between 86071c1 and 150b5b7.

📒 Files selected for processing (3)

recipes/multimodal/server/README.md
recipes/multimodal/server/backends/__init__.py
recipes/multimodal/server/backends/vllm_asr_backend.py

recipes/multimodal/server/backends/vllm_asr_backend.py

recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py

recipes/multimodal/server/backends/vllm_asr_backend.py

recipes/multimodal/server/backends/__init__.py

recipes/multimodal/server/backends/vllm_asr_backend.py

- Change tokenizer default from "" to None; only pass to LLM() when set - Reject unsupported per-request overrides (temperature, max_tokens, etc.) in validate_request() instead of silently ignoring them - Narrow exception catch from blanket Exception to (ValueError, TypeError, OSError) so unexpected errors propagate visibly Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

Address all review comments from pzelasko and CodeRabbit: Rename: - vllm_asr -> vllm_nemo_speechlm (backend name, file, class, registry) - Broaden description to cover all speech-to-text tasks Code fixes (CodeRabbit): - tokenizer default "" -> None, only pass to LLM() when set - Reject unsupported per-request overrides in validate_request() - Narrow exception catch to (ValueError, TypeError, OSError) - Report full batch elapsed_ms as generation_time_ms Design changes (pzelasko): - Auto-fetch chat template from tokenizer via apply_chat_template() - User only provides task prompt (e.g. "Transcribe the following:") - Hardcode vllm_plugins as class constant - Update plugin docs to reference NeMo Speech dependency - Throughput table in RTFx with full NeMo vs vLLM WER comparison - WER verified with whisper EnglishTextNormalizer (hf_leaderboard mode) Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 210-226: validate_request currently only rejects sampling
overrides and missing audio but silently ignores text fields and leaves the
multi-audio check in _get_request_audio; update validate_request to (1) reject
any per-request prompt fields (text, system_prompt, user_prompt) if they are
non-empty by returning an error like "vllm_nemo_speechlm backend does not accept
per-request prompt fields: text, system_prompt, user_prompt", and (2) move the
multiple-audio validation there by checking both audio_bytes and
audio_bytes_list and returning an error if more than one audio input is provided
(e.g., if audio_bytes is set and audio_bytes_list non-empty or audio_bytes_list
length > 1); remove or no-op the duplicate multi-audio check in
_get_request_audio and ensure generate() continues to use self._prompt_text
without reading those rejected fields.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b87de7ca-1f7a-4792-bba7-1cd6bc703eec

📥 Commits

Reviewing files that changed from the base of the PR and between 150b5b7 and a41eef0.

📒 Files selected for processing (3)

recipes/multimodal/server/README.md
recipes/multimodal/server/backends/__init__.py
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py

🚧 Files skipped from review as they are similar to previous changes (1)

recipes/multimodal/server/backends/init.py

coderabbitai · 2026-03-18T00:14:19Z

recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py

+    def validate_request(self, request: GenerationRequest) -> Optional[str]:
+        has_audio = request.audio_bytes is not None or (
+            request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
+        )
+        if not has_audio:
+            return "vllm_nemo_speechlm backend requires audio input"
+        unsupported = {
+            "max_new_tokens": request.max_new_tokens,
+            "temperature": request.temperature,
+            "top_p": request.top_p,
+            "top_k": request.top_k,
+            "seed": request.seed,
+        }
+        set_fields = [k for k, v in unsupported.items() if v is not None]
+        if set_fields:
+            return f"vllm_nemo_speechlm backend does not support per-request overrides: {', '.join(set_fields)}"
+        return None


⚠️ Potential issue | 🟡 Minor

Extend validation to reject ignored text input fields.

The validate_request method now rejects per-request sampling overrides (addressing the past review), but the prompt-related fields text, system_prompt, and user_prompt are still accepted and silently ignored. The generate() method uses a fixed self._prompt_text and never reads these fields.

Additionally, move the multiple-audio check from _get_request_audio into validate_request so all validation errors are returned consistently via the validation path rather than raised as exceptions during generation.

As per coding guidelines, "code should fail if user specifies an unsupported argument."

Proposed fix

def validate_request(self, request: GenerationRequest) -> Optional[str]: has_audio = request.audio_bytes is not None or ( request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0 ) if not has_audio: return "vllm_nemo_speechlm backend requires audio input" + if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 1: + return "vllm_nemo_speechlm backend currently supports one audio per request" + if request.text is not None: + return "vllm_nemo_speechlm backend does not support custom text input" + if request.system_prompt is not None: + return "vllm_nemo_speechlm backend does not support system_prompt" + if request.user_prompt is not None: + return "vllm_nemo_speechlm backend does not support user_prompt" unsupported = { "max_new_tokens": request.max_new_tokens,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around lines 210 - 226, validate_request currently only rejects sampling overrides and missing audio but silently ignores text fields and leaves the multi-audio check in _get_request_audio; update validate_request to (1) reject any per-request prompt fields (text, system_prompt, user_prompt) if they are non-empty by returning an error like "vllm_nemo_speechlm backend does not accept per-request prompt fields: text, system_prompt, user_prompt", and (2) move the multiple-audio validation there by checking both audio_bytes and audio_bytes_list and returning an error if more than one audio input is provided (e.g., if audio_bytes is set and audio_bytes_list non-empty or audio_bytes_list length > 1); remove or no-op the duplicate multi-audio check in _get_request_audio and ensure generate() continues to use self._prompt_text without reading those rejected fields.

The NeMo-Skills pipeline sends temperature/max_tokens per-request. Rejecting them breaks the pipeline. Instead, log a warning and use the backend's fixed SamplingParams. Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py (1)

213-229: ⚠️ Potential issue | 🟠 Major

Reject unsupported per-request fields instead of logging-and-ignoring them.

validate_request() currently accepts and ignores per-request knobs and prompt fields, and it also allows conflicting audio sources to pass into generation. This makes request behavior ambiguous.

💡 Proposed fix

 def validate_request(self, request: GenerationRequest) -> Optional[str]:
-    has_audio = request.audio_bytes is not None or (
-        request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
-    )
-    if not has_audio:
+    has_audio_bytes = request.audio_bytes is not None
+    has_audio_list = request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
+    if not (has_audio_bytes or has_audio_list):
         return "vllm_nemo_speechlm backend requires audio input"
-    ignored = {
-        "max_new_tokens": request.max_new_tokens,
-        "temperature": request.temperature,
-        "top_p": request.top_p,
-        "top_k": request.top_k,
-        "seed": request.seed,
-    }
-    set_fields = [k for k, v in ignored.items() if v is not None]
-    if set_fields:
-        logger.warning("Ignoring per-request overrides (using backend defaults): %s", ", ".join(set_fields))
+
+    if has_audio_bytes and has_audio_list:
+        return "vllm_nemo_speechlm backend accepts either audio_bytes or audio_bytes_list, not both"
+    if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 1:
+        return "vllm_nemo_speechlm backend currently supports one audio per request"
+
+    unsupported = [
+        k for k, v in {
+            "text": request.text,
+            "system_prompt": request.system_prompt,
+            "user_prompt": request.user_prompt,
+            "max_new_tokens": request.max_new_tokens,
+            "temperature": request.temperature,
+            "top_p": request.top_p,
+            "top_k": request.top_k,
+            "seed": request.seed,
+        }.items()
+        if v is not None
+    ]
+    if unsupported:
+        return f"vllm_nemo_speechlm backend does not support per-request fields: {', '.join(unsupported)}"
     return None

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 213 - 229, validate_request currently logs and ignores per-request knobs
and allows ambiguous audio inputs; change it to reject unsupported fields and
conflicting audio sources by returning an error string. In validate_request
(GenerationRequest) check the per-request knobs currently in ignored
(max_new_tokens, temperature, top_p, top_k, seed) and any prompt-related fields
passed on the request, collect any that are set and return a descriptive error
like "unsupported per-request fields: ..." instead of warning; also validate
audio input so if both audio_bytes and audio_bytes_list are provided (or
neither) return an error like "provide exactly one of audio_bytes or
audio_bytes_list". Ensure callers expect Optional[str] error messages from
validate_request rather than silent ignores.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 201-205: The _get_request_audio function uses truthiness for
request.audio_bytes which treats empty bytes as missing; update it to use
explicit "is not None" checks to match validate_request() semantics: in
_get_request_audio, change the first condition to check "request.audio_bytes is
not None" before returning (and similarly ensure any checks of
request.audio_bytes_list presence use "is not None" if needed), while keeping
the existing handling for multiple entries (len(request.audio_bytes_list) > 1)
and referencing the _get_request_audio and validate_request functions to locate
where to make the change.

---

Duplicate comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 213-229: validate_request currently logs and ignores per-request
knobs and allows ambiguous audio inputs; change it to reject unsupported fields
and conflicting audio sources by returning an error string. In validate_request
(GenerationRequest) check the per-request knobs currently in ignored
(max_new_tokens, temperature, top_p, top_k, seed) and any prompt-related fields
passed on the request, collect any that are set and return a descriptive error
like "unsupported per-request fields: ..." instead of warning; also validate
audio input so if both audio_bytes and audio_bytes_list are provided (or
neither) return an error like "provide exactly one of audio_bytes or
audio_bytes_list". Ensure callers expect Optional[str] error messages from
validate_request rather than silent ignores.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 156c8a47-fcd1-47a8-8df0-27c86d85f3a4

📥 Commits

Reviewing files that changed from the base of the PR and between a41eef0 and b5dde63.

📒 Files selected for processing (1)

recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py

coderabbitai · 2026-03-19T00:00:21Z

recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py

+    def _get_request_audio(self, request: GenerationRequest) -> bytes:
+        if request.audio_bytes:
+            return request.audio_bytes
+        if request.audio_bytes_list:
+            if len(request.audio_bytes_list) > 1:


⚠️ Potential issue | 🟡 Minor

Use explicit is not None checks for audio_bytes to match validation semantics.

validate_request() treats audio_bytes presence as is not None, but _get_request_audio() uses truthiness. This can produce inconsistent behavior for empty-byte payloads.

💡 Proposed fix

def _get_request_audio(self, request: GenerationRequest) -> bytes: - if request.audio_bytes: + if request.audio_bytes is not None: return request.audio_bytes - if request.audio_bytes_list: + if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0: if len(request.audio_bytes_list) > 1: raise ValueError("vllm_nemo_speechlm backend currently supports one audio per request.") return request.audio_bytes_list[0]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around lines 201 - 205, The _get_request_audio function uses truthiness for request.audio_bytes which treats empty bytes as missing; update it to use explicit "is not None" checks to match validate_request() semantics: in _get_request_audio, change the first condition to check "request.audio_bytes is not None" before returning (and similarly ensure any checks of request.audio_bytes_list presence use "is not None" if needed), while keeping the existing handling for multiple entries (len(request.audio_bytes_list) > 1) and referencing the _get_request_audio and validate_request functions to locate where to make the change.

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

coderabbitai

♻️ Duplicate comments (1)

recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py (1)

208-239: ⚠️ Potential issue | 🟠 Major

Fail fast on unsupported request fields and normalize audio presence checks.

validate_request() still accepts-and-ignores per-request prompt/sampling fields, and multi-audio validation is still deferred to _get_request_audio(). Also, Line 210 uses truthiness (if request.audio_bytes:), which is inconsistent with Line 224 semantics (is not None).

Suggested fix

 def _get_request_audio(self, request: GenerationRequest) -> bytes:
     """Extract audio bytes from a request (supports audio_bytes or audio_bytes_list)."""
-    if request.audio_bytes:
+    if request.audio_bytes is not None:
         return request.audio_bytes
-    if request.audio_bytes_list:
-        if len(request.audio_bytes_list) > 1:
-            raise ValueError("vllm_nemo_speechlm backend currently supports one audio per request.")
+    if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0:
+        if len(request.audio_bytes_list) > 1:
+            raise ValueError("vllm_nemo_speechlm backend currently supports one audio per request.")
         return request.audio_bytes_list[0]
     raise ValueError("Request must contain audio_bytes or audio_bytes_list")

 def validate_request(self, request: GenerationRequest) -> Optional[str]:
-    """Validate request has audio input. Logs warning for ignored per-request overrides."""
-    has_audio = request.audio_bytes is not None or (
-        request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
-    )
-    if not has_audio:
+    """Validate request has exactly one audio input and no unsupported per-request overrides."""
+    has_audio_bytes = request.audio_bytes is not None
+    num_audio_list = len(request.audio_bytes_list) if request.audio_bytes_list is not None else 0
+    has_audio_list = num_audio_list > 0
+
+    if not has_audio_bytes and not has_audio_list:
         return "vllm_nemo_speechlm backend requires audio input"
-    ignored = {
+    if (1 if has_audio_bytes else 0) + (1 if has_audio_list else 0) > 1 or num_audio_list > 1:
+        return "vllm_nemo_speechlm backend currently supports one audio per request"
+
+    unsupported_prompt_fields = []
+    if request.text:
+        unsupported_prompt_fields.append("text")
+    if request.system_prompt:
+        unsupported_prompt_fields.append("system_prompt")
+    if request.user_prompt:
+        unsupported_prompt_fields.append("user_prompt")
+    if unsupported_prompt_fields:
+        return (
+            "vllm_nemo_speechlm backend does not accept per-request prompt fields: "
+            + ", ".join(unsupported_prompt_fields)
+        )
+
+    unsupported_sampling_fields = {
         "max_new_tokens": request.max_new_tokens,
         "temperature": request.temperature,
         "top_p": request.top_p,
         "top_k": request.top_k,
         "seed": request.seed,
     }
-    set_fields = [k for k, v in ignored.items() if v is not None]
+    set_fields = [k for k, v in unsupported_sampling_fields.items() if v is not None]
     if set_fields:
-        logger.warning("Ignoring per-request overrides (using backend defaults): %s", ", ".join(set_fields))
+        return (
+            "vllm_nemo_speechlm backend does not support per-request sampling overrides: "
+            + ", ".join(set_fields)
+        )
     return None

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 208 - 239, In validate_request, perform strict validation: check audio
presence using explicit "is not None" for request.audio_bytes and
request.audio_bytes_list, error out if neither is provided, and also error if
request.audio_bytes_list contains more than one entry (move the multi-audio
check from _get_request_audio into validate_request). Additionally, fail fast
when any unsupported per-request sampling/prompt overrides are set
(max_new_tokens, temperature, top_p, top_k, seed) by returning an error string
instead of just logging a warning. Keep _get_request_audio simple (use is not
None checks) and assume validate_request has already enforced single-audio
constraints.

🧹 Nitpick comments (1)

recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py (1)
103-110: Remove VLLMNeMoSpeechLMConfig.from_dict() override and reuse base parsing.

This method duplicates BackendConfig.from_dict() behavior and can drift over time. Reusing the inherited implementation keeps config parsing consistent and simpler.

As per coding guidelines, "Keep code simple and elegant; reuse/extend existing functionality when possible".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 103 - 110, Remove the duplicated from_dict override on
VLLMNeMoSpeechLMConfig so the class uses the inherited BackendConfig.from_dict
implementation; locate the method VLLMNeMoSpeechLMConfig.from_dict and delete it
(or replace its body with a single call to super().from_dict(d) if explicit
forwarding is preferred) to avoid drift and ensure consistent config parsing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 208-239: In validate_request, perform strict validation: check
audio presence using explicit "is not None" for request.audio_bytes and
request.audio_bytes_list, error out if neither is provided, and also error if
request.audio_bytes_list contains more than one entry (move the multi-audio
check from _get_request_audio into validate_request). Additionally, fail fast
when any unsupported per-request sampling/prompt overrides are set
(max_new_tokens, temperature, top_p, top_k, seed) by returning an error string
instead of just logging a warning. Keep _get_request_audio simple (use is not
None checks) and assume validate_request has already enforced single-audio
constraints.

---

Nitpick comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 103-110: Remove the duplicated from_dict override on
VLLMNeMoSpeechLMConfig so the class uses the inherited BackendConfig.from_dict
implementation; locate the method VLLMNeMoSpeechLMConfig.from_dict and delete it
(or replace its body with a single call to super().from_dict(d) if explicit
forwarding is preferred) to avoid drift and ensure consistent config parsing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 033755dc-e93d-437c-a83e-02a7c4127c38

📥 Commits

Reviewing files that changed from the base of the PR and between b5dde63 and 2320d09.

📒 Files selected for processing (1)

recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py

DongjiGao and others added 6 commits March 10, 2026 23:06

Fix prompt template to match NeMo native format

479caca

Add <think> tag after assistant prompt to match the NemotronH chat template exactly. Verified on AMI: 11.64% WER matches checkpoint's 11.65% (was 11.44% without <think>). Made-with: Cursor

Merge branch 'NVIDIA-NeMo:main' into vllm-asr-backend

356c6c8

Remove local-only eval scripts from git tracking

150b5b7

These scripts are cluster-specific and should not be in the repo: - test_vllm_asr_librispeech.py - run_all_eval.sh Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

coderabbitai bot reviewed Mar 15, 2026

View reviewed changes

pzelasko suggested changes Mar 16, 2026

View reviewed changes

Dongji Gao added 2 commits March 16, 2026 16:03

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

DongjiGao mentioned this pull request Mar 18, 2026

Add vllm support for nemo speechlm NVIDIA-NeMo/NeMo#15520

Open

8 tasks

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

Dongji Gao and others added 2 commits March 20, 2026 11:43

Add docstrings to all functions and properties (100% coverage)

2320d09

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>

Merge branch 'main' into vllm-asr-backend

f839f03

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

DongjiGao requested a review from pzelasko March 20, 2026 19:55

Conversation

DongjiGao commented Mar 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DongjiGao commented Mar 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 15, 2026 •

edited

Loading