Skip to content

Add vLLM ASR backend for multimodal speech recognition#1308

Open
DongjiGao wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
DongjiGao:vllm-asr-backend
Open

Add vLLM ASR backend for multimodal speech recognition#1308
DongjiGao wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
DongjiGao:vllm-asr-backend

Conversation

@DongjiGao
Copy link

@DongjiGao DongjiGao commented Mar 15, 2026

Summary

Add a new vllm_asr backend to the unified inference server that performs ASR using vLLM with the nemotron_nano_asr multimodal plugin. This enables fast batched inference for speech LLM models (e.g. Nemotron-Nano-v3 + Canary-v2).

  • New VLLMASRBackend in recipes/multimodal/server/backends/vllm_asr_backend.py
  • Registered in backend registry and README
  • Supports tensor_parallel_size for multi-GPU sharding
  • Horizontal scaling via ns generate --num_chunks N (recommended for MoE models)
    Validated on the full Open ASR Leaderboard (8 datasets, ~82K samples). WER matches NeMo checkpoint evaluation within 0.1% across all datasets and batch sizes.

Test plan

  • Verified on librispeech_clean/other, tedlium, spgispeech, voxpopuli, gigaspeech, earnings22, ami
  • Corpus-level WER matches NeMo checkpoint reference (e.g. librispeech_clean: 1.91%)
  • Tested batch sizes 8/16/32 with 32 GPUs
  • Ruff check and format pass

Summary by CodeRabbit

  • New Features

    • Added a vLLM NeMo Speech LM backend to enable audio-to-text inference with multimodal models; now available in the unified-server framework.
  • Documentation

    • Updated the available backends listing and README to include the new vLLM NeMo Speech LM backend.

DongjiGao and others added 6 commits March 10, 2026 23:06
Adds a new vllm_asr backend that uses vLLM with multimodal plugin
support for fast speech recognition inference:
- VLLMASRBackend with PagedAttention and continuous batching
- Supports Nemotron-Nano-v3 + Canary-v2 ASR model via plugin
- OpenAI-compatible /v1/chat/completions endpoint with audio input
- Strips <think></think> tags from NemotronH output
- Configurable prompt template, GPU memory, model length

Usage: python -m nemo_skills.inference.server.serve_unified \
    --backend vllm_asr \
    --model /path/to/checkpoint \
    --tokenizer nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Made-with: Cursor
- test_vllm_asr_librispeech.py: concurrent async client for evaluating
  ASR on Open ASR Leaderboard datasets via the unified server. Reports
  WER, RTFx, throughput, and latency. Saves results to JSON.
- run_all_eval.sh: runs all 8 Open ASR Leaderboard datasets end-to-end
  (starts server, evaluates, saves results, stops server)

Made-with: Cursor
Add <think> tag after assistant prompt to match the NemotronH
chat template exactly. Verified on AMI: 11.64% WER matches
checkpoint's 11.65% (was 11.44% without <think>).

Made-with: Cursor
Remove experimental data_parallel/expert_parallel serve-mode code
that proved impractical for the Nemotron-Nano MoE architecture due
to synchronization overhead.  The recommended scaling approach is
horizontal: multiple independent single-GPU instances via
ns generate --num_chunks N.

- Remove serve-mode subprocess, HTTP proxy, and concurrent request
  logic from vllm_asr_backend.py
- Add tensor_parallel_size support for multi-GPU sharding
- Add Prerequisites section documenting the nemotron_nano_asr vLLM
  plugin dependency, installation, and checkpoint requirements
- Add Multi-GPU scaling section with usage example and throughput
  benchmarks (32 GPUs, A100-80GB, Open ASR Leaderboard)
- Inline _generate_embedded into generate() for simplicity
- Register vllm_asr in __init__.py docstring and README.md

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
These scripts are cluster-specific and should not be in the repo:
- test_vllm_asr_librispeech.py
- run_all_eval.sh

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 15, 2026

📝 Walkthrough

Walkthrough

Adds a new vLLM-backed NeMo Speech LM backend (audio→text) to the multimodal unified server, plus registry and README updates. Implements config, model loading, audio preprocessing, request validation, prompt construction, batch generation, and result postprocessing.

Changes

Cohort / File(s) Summary
Documentation & Registry
recipes/multimodal/server/README.md, recipes/multimodal/server/backends/__init__.py
Added vllm_nemo_speechlm to the available backends list and BACKEND_REGISTRY for lazy-loading via VLLMNeMoSpeechLMBackend.
Backend Implementation
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py
New backend module defining VLLMNeMoSpeechLMConfig and VLLMNeMoSpeechLMBackend: activates vLLM NeMo plugin, constructs LLM with hf_overrides and tokenizer, builds prompts, converts audio bytes→mono float32 NumPy, validates requests, performs batch generation via vLLM, strips <think> tags, and assembles per-request GenerationResult with timing and debug info.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Backend as VLLMNeMoSpeechLMBackend
    participant Validator as Validation
    participant vLLM as vLLM Engine
    participant NeMo as NeMo Speech LM

    Client->>Backend: Submit GenerationRequest(s) with audio_bytes
    Backend->>Validator: validate_request(request)
    Validator-->>Backend: validation result
    Backend->>Backend: _get_request_audio() / _audio_bytes_to_numpy()
    Backend->>Backend: build prompt + inputs
    Backend->>vLLM: generate(batch_inputs, SamplingParams)
    vLLM->>NeMo: invoke NeMo plugin (audio + prompt)
    NeMo-->>vLLM: generated text outputs
    vLLM-->>Backend: batch generation results
    Backend->>Backend: _strip_think_tags(), assemble GenerationResult
    Backend-->>Client: return GenerationResult(s)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add vLLM ASR backend for multimodal speech recognition' accurately describes the main changes: adding a new vLLM-based ASR backend with multimodal speech recognition capabilities to the recipes/multimodal/server/backends directory.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/multimodal/server/backends/vllm_asr_backend.py`:
- Around line 176-179: The code currently freezes SamplingParams at init
(self._sampling_params = SamplingParams(...)) and always uses the same prompt
template, ignoring per-request knobs; update the request handling so either (A)
you construct/override SamplingParams from per-request values (max_new_tokens,
temperature, top_p, top_k, seed) and pass that per-call into the vLLM inference
path (replace use of the fixed self._sampling_params and ensure request
text/system_prompt/user_prompt are merged into the prompt template) or (B)
explicitly reject requests that include any of those unsupported fields in
validate_request() by checking for presence of
max_new_tokens/temperature/top_p/top_k/seed/text/system_prompt/user_prompt and
raising a clear validation error; modify validate_request() and the code paths
that build prompts to reflect the chosen approach and ensure SamplingParams
(class SamplingParams and attribute self._sampling_params) is not silently
ignored.
- Around line 234-235: The current request loop catches a blanket Exception and
turns all failures into per-request errors; update the except clause around the
code that assigns results[idx] = GenerationResult(...) to only catch expected
input/validation errors (e.g., ValueError, TypeError, custom BadRequest/Input
exceptions your backend defines) and produce a GenerationResult for those cases
using req.request_id, while re-raising any other exceptions so the worker fails
visibly; refer to the request loop's except block, the results list assignment
(results[idx]), and the GenerationResult constructor when making this change.
- Around line 239-248: Batch latency is being amortized across requests by
dividing elapsed_ms by len(vllm_inputs); instead set each
GenerationResult.generation_time_ms to the full elapsed_ms (or to the actual
per-request duration if you have per-request timings) so each request reflects
the real end-to-end latency. Locate elapsed_ms, per_req_ms, vllm_inputs,
outputs, valid_indices and the GenerationResult construction in
vllm_asr_backend.py and remove the division by len(vllm_inputs) (or replace
per_req_ms with elapsed_ms) before passing into
GenerationResult.generation_time_ms so metrics are not underreported.
- Line 90: The tokenizer config defaults to an empty string which gets forwarded
by load_model() into LLM(...), causing vLLM to try resolving "" as a tokenizer;
change the tokenizer field default from "" to None (tokenizer: Optional[str] =
None) or modify load_model() so it only passes tokenizer=... to LLM when
tokenizer is not empty/None (check the tokenizer variable before adding the
kwarg) — update references in load_model and any callers to handle the None case
accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 248fac67-7edd-456f-b402-cd48cad56242

📥 Commits

Reviewing files that changed from the base of the PR and between 86071c1 and 150b5b7.

📒 Files selected for processing (3)
  • recipes/multimodal/server/README.md
  • recipes/multimodal/server/backends/__init__.py
  • recipes/multimodal/server/backends/vllm_asr_backend.py

Dongji Gao added 2 commits March 16, 2026 16:03
- Change tokenizer default from "" to None; only pass to LLM() when set
- Reject unsupported per-request overrides (temperature, max_tokens, etc.)
  in validate_request() instead of silently ignoring them
- Narrow exception catch from blanket Exception to (ValueError, TypeError,
  OSError) so unexpected errors propagate visibly

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
Address all review comments from pzelasko and CodeRabbit:

Rename:
- vllm_asr -> vllm_nemo_speechlm (backend name, file, class, registry)
- Broaden description to cover all speech-to-text tasks

Code fixes (CodeRabbit):
- tokenizer default "" -> None, only pass to LLM() when set
- Reject unsupported per-request overrides in validate_request()
- Narrow exception catch to (ValueError, TypeError, OSError)
- Report full batch elapsed_ms as generation_time_ms

Design changes (pzelasko):
- Auto-fetch chat template from tokenizer via apply_chat_template()
- User only provides task prompt (e.g. "Transcribe the following:")
- Hardcode vllm_plugins as class constant
- Update plugin docs to reference NeMo Speech dependency
- Throughput table in RTFx with full NeMo vs vLLM WER comparison
- WER verified with whisper EnglishTextNormalizer (hf_leaderboard mode)

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 210-226: validate_request currently only rejects sampling
overrides and missing audio but silently ignores text fields and leaves the
multi-audio check in _get_request_audio; update validate_request to (1) reject
any per-request prompt fields (text, system_prompt, user_prompt) if they are
non-empty by returning an error like "vllm_nemo_speechlm backend does not accept
per-request prompt fields: text, system_prompt, user_prompt", and (2) move the
multiple-audio validation there by checking both audio_bytes and
audio_bytes_list and returning an error if more than one audio input is provided
(e.g., if audio_bytes is set and audio_bytes_list non-empty or audio_bytes_list
length > 1); remove or no-op the duplicate multi-audio check in
_get_request_audio and ensure generate() continues to use self._prompt_text
without reading those rejected fields.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b87de7ca-1f7a-4792-bba7-1cd6bc703eec

📥 Commits

Reviewing files that changed from the base of the PR and between 150b5b7 and a41eef0.

📒 Files selected for processing (3)
  • recipes/multimodal/server/README.md
  • recipes/multimodal/server/backends/__init__.py
  • recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • recipes/multimodal/server/backends/init.py

Comment on lines +210 to +226
def validate_request(self, request: GenerationRequest) -> Optional[str]:
has_audio = request.audio_bytes is not None or (
request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
)
if not has_audio:
return "vllm_nemo_speechlm backend requires audio input"
unsupported = {
"max_new_tokens": request.max_new_tokens,
"temperature": request.temperature,
"top_p": request.top_p,
"top_k": request.top_k,
"seed": request.seed,
}
set_fields = [k for k, v in unsupported.items() if v is not None]
if set_fields:
return f"vllm_nemo_speechlm backend does not support per-request overrides: {', '.join(set_fields)}"
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Extend validation to reject ignored text input fields.

The validate_request method now rejects per-request sampling overrides (addressing the past review), but the prompt-related fields text, system_prompt, and user_prompt are still accepted and silently ignored. The generate() method uses a fixed self._prompt_text and never reads these fields.

Additionally, move the multiple-audio check from _get_request_audio into validate_request so all validation errors are returned consistently via the validation path rather than raised as exceptions during generation.

As per coding guidelines, "code should fail if user specifies an unsupported argument."

Proposed fix
     def validate_request(self, request: GenerationRequest) -> Optional[str]:
         has_audio = request.audio_bytes is not None or (
             request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
         )
         if not has_audio:
             return "vllm_nemo_speechlm backend requires audio input"
+        if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 1:
+            return "vllm_nemo_speechlm backend currently supports one audio per request"
+        if request.text is not None:
+            return "vllm_nemo_speechlm backend does not support custom text input"
+        if request.system_prompt is not None:
+            return "vllm_nemo_speechlm backend does not support system_prompt"
+        if request.user_prompt is not None:
+            return "vllm_nemo_speechlm backend does not support user_prompt"
         unsupported = {
             "max_new_tokens": request.max_new_tokens,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 210 - 226, validate_request currently only rejects sampling overrides and
missing audio but silently ignores text fields and leaves the multi-audio check
in _get_request_audio; update validate_request to (1) reject any per-request
prompt fields (text, system_prompt, user_prompt) if they are non-empty by
returning an error like "vllm_nemo_speechlm backend does not accept per-request
prompt fields: text, system_prompt, user_prompt", and (2) move the
multiple-audio validation there by checking both audio_bytes and
audio_bytes_list and returning an error if more than one audio input is provided
(e.g., if audio_bytes is set and audio_bytes_list non-empty or audio_bytes_list
length > 1); remove or no-op the duplicate multi-audio check in
_get_request_audio and ensure generate() continues to use self._prompt_text
without reading those rejected fields.

The NeMo-Skills pipeline sends temperature/max_tokens per-request.
Rejecting them breaks the pipeline. Instead, log a warning and use
the backend's fixed SamplingParams.

Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py (1)

213-229: ⚠️ Potential issue | 🟠 Major

Reject unsupported per-request fields instead of logging-and-ignoring them.

validate_request() currently accepts and ignores per-request knobs and prompt fields, and it also allows conflicting audio sources to pass into generation. This makes request behavior ambiguous.

💡 Proposed fix
 def validate_request(self, request: GenerationRequest) -> Optional[str]:
-    has_audio = request.audio_bytes is not None or (
-        request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
-    )
-    if not has_audio:
+    has_audio_bytes = request.audio_bytes is not None
+    has_audio_list = request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
+    if not (has_audio_bytes or has_audio_list):
         return "vllm_nemo_speechlm backend requires audio input"
-    ignored = {
-        "max_new_tokens": request.max_new_tokens,
-        "temperature": request.temperature,
-        "top_p": request.top_p,
-        "top_k": request.top_k,
-        "seed": request.seed,
-    }
-    set_fields = [k for k, v in ignored.items() if v is not None]
-    if set_fields:
-        logger.warning("Ignoring per-request overrides (using backend defaults): %s", ", ".join(set_fields))
+
+    if has_audio_bytes and has_audio_list:
+        return "vllm_nemo_speechlm backend accepts either audio_bytes or audio_bytes_list, not both"
+    if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 1:
+        return "vllm_nemo_speechlm backend currently supports one audio per request"
+
+    unsupported = [
+        k for k, v in {
+            "text": request.text,
+            "system_prompt": request.system_prompt,
+            "user_prompt": request.user_prompt,
+            "max_new_tokens": request.max_new_tokens,
+            "temperature": request.temperature,
+            "top_p": request.top_p,
+            "top_k": request.top_k,
+            "seed": request.seed,
+        }.items()
+        if v is not None
+    ]
+    if unsupported:
+        return f"vllm_nemo_speechlm backend does not support per-request fields: {', '.join(unsupported)}"
     return None

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 213 - 229, validate_request currently logs and ignores per-request knobs
and allows ambiguous audio inputs; change it to reject unsupported fields and
conflicting audio sources by returning an error string. In validate_request
(GenerationRequest) check the per-request knobs currently in ignored
(max_new_tokens, temperature, top_p, top_k, seed) and any prompt-related fields
passed on the request, collect any that are set and return a descriptive error
like "unsupported per-request fields: ..." instead of warning; also validate
audio input so if both audio_bytes and audio_bytes_list are provided (or
neither) return an error like "provide exactly one of audio_bytes or
audio_bytes_list". Ensure callers expect Optional[str] error messages from
validate_request rather than silent ignores.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 201-205: The _get_request_audio function uses truthiness for
request.audio_bytes which treats empty bytes as missing; update it to use
explicit "is not None" checks to match validate_request() semantics: in
_get_request_audio, change the first condition to check "request.audio_bytes is
not None" before returning (and similarly ensure any checks of
request.audio_bytes_list presence use "is not None" if needed), while keeping
the existing handling for multiple entries (len(request.audio_bytes_list) > 1)
and referencing the _get_request_audio and validate_request functions to locate
where to make the change.

---

Duplicate comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 213-229: validate_request currently logs and ignores per-request
knobs and allows ambiguous audio inputs; change it to reject unsupported fields
and conflicting audio sources by returning an error string. In validate_request
(GenerationRequest) check the per-request knobs currently in ignored
(max_new_tokens, temperature, top_p, top_k, seed) and any prompt-related fields
passed on the request, collect any that are set and return a descriptive error
like "unsupported per-request fields: ..." instead of warning; also validate
audio input so if both audio_bytes and audio_bytes_list are provided (or
neither) return an error like "provide exactly one of audio_bytes or
audio_bytes_list". Ensure callers expect Optional[str] error messages from
validate_request rather than silent ignores.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 156c8a47-fcd1-47a8-8df0-27c86d85f3a4

📥 Commits

Reviewing files that changed from the base of the PR and between a41eef0 and b5dde63.

📒 Files selected for processing (1)
  • recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py

Comment on lines +201 to +205
def _get_request_audio(self, request: GenerationRequest) -> bytes:
if request.audio_bytes:
return request.audio_bytes
if request.audio_bytes_list:
if len(request.audio_bytes_list) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use explicit is not None checks for audio_bytes to match validation semantics.

validate_request() treats audio_bytes presence as is not None, but _get_request_audio() uses truthiness. This can produce inconsistent behavior for empty-byte payloads.

💡 Proposed fix
 def _get_request_audio(self, request: GenerationRequest) -> bytes:
-    if request.audio_bytes:
+    if request.audio_bytes is not None:
         return request.audio_bytes
-    if request.audio_bytes_list:
+    if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0:
         if len(request.audio_bytes_list) > 1:
             raise ValueError("vllm_nemo_speechlm backend currently supports one audio per request.")
         return request.audio_bytes_list[0]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 201 - 205, The _get_request_audio function uses truthiness for
request.audio_bytes which treats empty bytes as missing; update it to use
explicit "is not None" checks to match validate_request() semantics: in
_get_request_audio, change the first condition to check "request.audio_bytes is
not None" before returning (and similarly ensure any checks of
request.audio_bytes_list presence use "is not None" if needed), while keeping
the existing handling for multiple entries (len(request.audio_bytes_list) > 1)
and referencing the _get_request_audio and validate_request functions to locate
where to make the change.

Dongji Gao and others added 2 commits March 20, 2026 11:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py (1)

208-239: ⚠️ Potential issue | 🟠 Major

Fail fast on unsupported request fields and normalize audio presence checks.

validate_request() still accepts-and-ignores per-request prompt/sampling fields, and multi-audio validation is still deferred to _get_request_audio(). Also, Line 210 uses truthiness (if request.audio_bytes:), which is inconsistent with Line 224 semantics (is not None).

Suggested fix
 def _get_request_audio(self, request: GenerationRequest) -> bytes:
     """Extract audio bytes from a request (supports audio_bytes or audio_bytes_list)."""
-    if request.audio_bytes:
+    if request.audio_bytes is not None:
         return request.audio_bytes
-    if request.audio_bytes_list:
-        if len(request.audio_bytes_list) > 1:
-            raise ValueError("vllm_nemo_speechlm backend currently supports one audio per request.")
+    if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0:
+        if len(request.audio_bytes_list) > 1:
+            raise ValueError("vllm_nemo_speechlm backend currently supports one audio per request.")
         return request.audio_bytes_list[0]
     raise ValueError("Request must contain audio_bytes or audio_bytes_list")

 def validate_request(self, request: GenerationRequest) -> Optional[str]:
-    """Validate request has audio input. Logs warning for ignored per-request overrides."""
-    has_audio = request.audio_bytes is not None or (
-        request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
-    )
-    if not has_audio:
+    """Validate request has exactly one audio input and no unsupported per-request overrides."""
+    has_audio_bytes = request.audio_bytes is not None
+    num_audio_list = len(request.audio_bytes_list) if request.audio_bytes_list is not None else 0
+    has_audio_list = num_audio_list > 0
+
+    if not has_audio_bytes and not has_audio_list:
         return "vllm_nemo_speechlm backend requires audio input"
-    ignored = {
+    if (1 if has_audio_bytes else 0) + (1 if has_audio_list else 0) > 1 or num_audio_list > 1:
+        return "vllm_nemo_speechlm backend currently supports one audio per request"
+
+    unsupported_prompt_fields = []
+    if request.text:
+        unsupported_prompt_fields.append("text")
+    if request.system_prompt:
+        unsupported_prompt_fields.append("system_prompt")
+    if request.user_prompt:
+        unsupported_prompt_fields.append("user_prompt")
+    if unsupported_prompt_fields:
+        return (
+            "vllm_nemo_speechlm backend does not accept per-request prompt fields: "
+            + ", ".join(unsupported_prompt_fields)
+        )
+
+    unsupported_sampling_fields = {
         "max_new_tokens": request.max_new_tokens,
         "temperature": request.temperature,
         "top_p": request.top_p,
         "top_k": request.top_k,
         "seed": request.seed,
     }
-    set_fields = [k for k, v in ignored.items() if v is not None]
+    set_fields = [k for k, v in unsupported_sampling_fields.items() if v is not None]
     if set_fields:
-        logger.warning("Ignoring per-request overrides (using backend defaults): %s", ", ".join(set_fields))
+        return (
+            "vllm_nemo_speechlm backend does not support per-request sampling overrides: "
+            + ", ".join(set_fields)
+        )
     return None

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 208 - 239, In validate_request, perform strict validation: check audio
presence using explicit "is not None" for request.audio_bytes and
request.audio_bytes_list, error out if neither is provided, and also error if
request.audio_bytes_list contains more than one entry (move the multi-audio
check from _get_request_audio into validate_request). Additionally, fail fast
when any unsupported per-request sampling/prompt overrides are set
(max_new_tokens, temperature, top_p, top_k, seed) by returning an error string
instead of just logging a warning. Keep _get_request_audio simple (use is not
None checks) and assume validate_request has already enforced single-audio
constraints.
🧹 Nitpick comments (1)
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py (1)

103-110: Remove VLLMNeMoSpeechLMConfig.from_dict() override and reuse base parsing.

This method duplicates BackendConfig.from_dict() behavior and can drift over time. Reusing the inherited implementation keeps config parsing consistent and simpler.

As per coding guidelines, "Keep code simple and elegant; reuse/extend existing functionality when possible".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 103 - 110, Remove the duplicated from_dict override on
VLLMNeMoSpeechLMConfig so the class uses the inherited BackendConfig.from_dict
implementation; locate the method VLLMNeMoSpeechLMConfig.from_dict and delete it
(or replace its body with a single call to super().from_dict(d) if explicit
forwarding is preferred) to avoid drift and ensure consistent config parsing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 208-239: In validate_request, perform strict validation: check
audio presence using explicit "is not None" for request.audio_bytes and
request.audio_bytes_list, error out if neither is provided, and also error if
request.audio_bytes_list contains more than one entry (move the multi-audio
check from _get_request_audio into validate_request). Additionally, fail fast
when any unsupported per-request sampling/prompt overrides are set
(max_new_tokens, temperature, top_p, top_k, seed) by returning an error string
instead of just logging a warning. Keep _get_request_audio simple (use is not
None checks) and assume validate_request has already enforced single-audio
constraints.

---

Nitpick comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 103-110: Remove the duplicated from_dict override on
VLLMNeMoSpeechLMConfig so the class uses the inherited BackendConfig.from_dict
implementation; locate the method VLLMNeMoSpeechLMConfig.from_dict and delete it
(or replace its body with a single call to super().from_dict(d) if explicit
forwarding is preferred) to avoid drift and ensure consistent config parsing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 033755dc-e93d-437c-a83e-02a7c4127c38

📥 Commits

Reviewing files that changed from the base of the PR and between b5dde63 and 2320d09.

📒 Files selected for processing (1)
  • recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py

@DongjiGao DongjiGao requested a review from pzelasko March 20, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants