Add vLLM ASR backend for multimodal speech recognition#1308
Add vLLM ASR backend for multimodal speech recognition#1308DongjiGao wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
Conversation
Adds a new vllm_asr backend that uses vLLM with multimodal plugin
support for fast speech recognition inference:
- VLLMASRBackend with PagedAttention and continuous batching
- Supports Nemotron-Nano-v3 + Canary-v2 ASR model via plugin
- OpenAI-compatible /v1/chat/completions endpoint with audio input
- Strips <think></think> tags from NemotronH output
- Configurable prompt template, GPU memory, model length
Usage: python -m nemo_skills.inference.server.serve_unified \
--backend vllm_asr \
--model /path/to/checkpoint \
--tokenizer nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Made-with: Cursor
- test_vllm_asr_librispeech.py: concurrent async client for evaluating ASR on Open ASR Leaderboard datasets via the unified server. Reports WER, RTFx, throughput, and latency. Saves results to JSON. - run_all_eval.sh: runs all 8 Open ASR Leaderboard datasets end-to-end (starts server, evaluates, saves results, stops server) Made-with: Cursor
Add <think> tag after assistant prompt to match the NemotronH chat template exactly. Verified on AMI: 11.64% WER matches checkpoint's 11.65% (was 11.44% without <think>). Made-with: Cursor
Remove experimental data_parallel/expert_parallel serve-mode code that proved impractical for the Nemotron-Nano MoE architecture due to synchronization overhead. The recommended scaling approach is horizontal: multiple independent single-GPU instances via ns generate --num_chunks N. - Remove serve-mode subprocess, HTTP proxy, and concurrent request logic from vllm_asr_backend.py - Add tensor_parallel_size support for multi-GPU sharding - Add Prerequisites section documenting the nemotron_nano_asr vLLM plugin dependency, installation, and checkpoint requirements - Add Multi-GPU scaling section with usage example and throughput benchmarks (32 GPUs, A100-80GB, Open ASR Leaderboard) - Inline _generate_embedded into generate() for simplicity - Register vllm_asr in __init__.py docstring and README.md Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
These scripts are cluster-specific and should not be in the repo: - test_vllm_asr_librispeech.py - run_all_eval.sh Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
📝 WalkthroughWalkthroughAdds a new vLLM-backed NeMo Speech LM backend (audio→text) to the multimodal unified server, plus registry and README updates. Implements config, model loading, audio preprocessing, request validation, prompt construction, batch generation, and result postprocessing. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Backend as VLLMNeMoSpeechLMBackend
participant Validator as Validation
participant vLLM as vLLM Engine
participant NeMo as NeMo Speech LM
Client->>Backend: Submit GenerationRequest(s) with audio_bytes
Backend->>Validator: validate_request(request)
Validator-->>Backend: validation result
Backend->>Backend: _get_request_audio() / _audio_bytes_to_numpy()
Backend->>Backend: build prompt + inputs
Backend->>vLLM: generate(batch_inputs, SamplingParams)
vLLM->>NeMo: invoke NeMo plugin (audio + prompt)
NeMo-->>vLLM: generated text outputs
vLLM-->>Backend: batch generation results
Backend->>Backend: _strip_think_tags(), assemble GenerationResult
Backend-->>Client: return GenerationResult(s)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can suggest fixes for GitHub Check annotations.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@recipes/multimodal/server/backends/vllm_asr_backend.py`:
- Around line 176-179: The code currently freezes SamplingParams at init
(self._sampling_params = SamplingParams(...)) and always uses the same prompt
template, ignoring per-request knobs; update the request handling so either (A)
you construct/override SamplingParams from per-request values (max_new_tokens,
temperature, top_p, top_k, seed) and pass that per-call into the vLLM inference
path (replace use of the fixed self._sampling_params and ensure request
text/system_prompt/user_prompt are merged into the prompt template) or (B)
explicitly reject requests that include any of those unsupported fields in
validate_request() by checking for presence of
max_new_tokens/temperature/top_p/top_k/seed/text/system_prompt/user_prompt and
raising a clear validation error; modify validate_request() and the code paths
that build prompts to reflect the chosen approach and ensure SamplingParams
(class SamplingParams and attribute self._sampling_params) is not silently
ignored.
- Around line 234-235: The current request loop catches a blanket Exception and
turns all failures into per-request errors; update the except clause around the
code that assigns results[idx] = GenerationResult(...) to only catch expected
input/validation errors (e.g., ValueError, TypeError, custom BadRequest/Input
exceptions your backend defines) and produce a GenerationResult for those cases
using req.request_id, while re-raising any other exceptions so the worker fails
visibly; refer to the request loop's except block, the results list assignment
(results[idx]), and the GenerationResult constructor when making this change.
- Around line 239-248: Batch latency is being amortized across requests by
dividing elapsed_ms by len(vllm_inputs); instead set each
GenerationResult.generation_time_ms to the full elapsed_ms (or to the actual
per-request duration if you have per-request timings) so each request reflects
the real end-to-end latency. Locate elapsed_ms, per_req_ms, vllm_inputs,
outputs, valid_indices and the GenerationResult construction in
vllm_asr_backend.py and remove the division by len(vllm_inputs) (or replace
per_req_ms with elapsed_ms) before passing into
GenerationResult.generation_time_ms so metrics are not underreported.
- Line 90: The tokenizer config defaults to an empty string which gets forwarded
by load_model() into LLM(...), causing vLLM to try resolving "" as a tokenizer;
change the tokenizer field default from "" to None (tokenizer: Optional[str] =
None) or modify load_model() so it only passes tokenizer=... to LLM when
tokenizer is not empty/None (check the tokenizer variable before adding the
kwarg) — update references in load_model and any callers to handle the None case
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 248fac67-7edd-456f-b402-cd48cad56242
📒 Files selected for processing (3)
recipes/multimodal/server/README.mdrecipes/multimodal/server/backends/__init__.pyrecipes/multimodal/server/backends/vllm_asr_backend.py
- Change tokenizer default from "" to None; only pass to LLM() when set - Reject unsupported per-request overrides (temperature, max_tokens, etc.) in validate_request() instead of silently ignoring them - Narrow exception catch from blanket Exception to (ValueError, TypeError, OSError) so unexpected errors propagate visibly Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
Address all review comments from pzelasko and CodeRabbit: Rename: - vllm_asr -> vllm_nemo_speechlm (backend name, file, class, registry) - Broaden description to cover all speech-to-text tasks Code fixes (CodeRabbit): - tokenizer default "" -> None, only pass to LLM() when set - Reject unsupported per-request overrides in validate_request() - Narrow exception catch to (ValueError, TypeError, OSError) - Report full batch elapsed_ms as generation_time_ms Design changes (pzelasko): - Auto-fetch chat template from tokenizer via apply_chat_template() - User only provides task prompt (e.g. "Transcribe the following:") - Hardcode vllm_plugins as class constant - Update plugin docs to reference NeMo Speech dependency - Throughput table in RTFx with full NeMo vs vLLM WER comparison - WER verified with whisper EnglishTextNormalizer (hf_leaderboard mode) Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 210-226: validate_request currently only rejects sampling
overrides and missing audio but silently ignores text fields and leaves the
multi-audio check in _get_request_audio; update validate_request to (1) reject
any per-request prompt fields (text, system_prompt, user_prompt) if they are
non-empty by returning an error like "vllm_nemo_speechlm backend does not accept
per-request prompt fields: text, system_prompt, user_prompt", and (2) move the
multiple-audio validation there by checking both audio_bytes and
audio_bytes_list and returning an error if more than one audio input is provided
(e.g., if audio_bytes is set and audio_bytes_list non-empty or audio_bytes_list
length > 1); remove or no-op the duplicate multi-audio check in
_get_request_audio and ensure generate() continues to use self._prompt_text
without reading those rejected fields.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b87de7ca-1f7a-4792-bba7-1cd6bc703eec
📒 Files selected for processing (3)
recipes/multimodal/server/README.mdrecipes/multimodal/server/backends/__init__.pyrecipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py
🚧 Files skipped from review as they are similar to previous changes (1)
- recipes/multimodal/server/backends/init.py
| def validate_request(self, request: GenerationRequest) -> Optional[str]: | ||
| has_audio = request.audio_bytes is not None or ( | ||
| request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0 | ||
| ) | ||
| if not has_audio: | ||
| return "vllm_nemo_speechlm backend requires audio input" | ||
| unsupported = { | ||
| "max_new_tokens": request.max_new_tokens, | ||
| "temperature": request.temperature, | ||
| "top_p": request.top_p, | ||
| "top_k": request.top_k, | ||
| "seed": request.seed, | ||
| } | ||
| set_fields = [k for k, v in unsupported.items() if v is not None] | ||
| if set_fields: | ||
| return f"vllm_nemo_speechlm backend does not support per-request overrides: {', '.join(set_fields)}" | ||
| return None |
There was a problem hiding this comment.
Extend validation to reject ignored text input fields.
The validate_request method now rejects per-request sampling overrides (addressing the past review), but the prompt-related fields text, system_prompt, and user_prompt are still accepted and silently ignored. The generate() method uses a fixed self._prompt_text and never reads these fields.
Additionally, move the multiple-audio check from _get_request_audio into validate_request so all validation errors are returned consistently via the validation path rather than raised as exceptions during generation.
As per coding guidelines, "code should fail if user specifies an unsupported argument."
Proposed fix
def validate_request(self, request: GenerationRequest) -> Optional[str]:
has_audio = request.audio_bytes is not None or (
request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0
)
if not has_audio:
return "vllm_nemo_speechlm backend requires audio input"
+ if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 1:
+ return "vllm_nemo_speechlm backend currently supports one audio per request"
+ if request.text is not None:
+ return "vllm_nemo_speechlm backend does not support custom text input"
+ if request.system_prompt is not None:
+ return "vllm_nemo_speechlm backend does not support system_prompt"
+ if request.user_prompt is not None:
+ return "vllm_nemo_speechlm backend does not support user_prompt"
unsupported = {
"max_new_tokens": request.max_new_tokens,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 210 - 226, validate_request currently only rejects sampling overrides and
missing audio but silently ignores text fields and leaves the multi-audio check
in _get_request_audio; update validate_request to (1) reject any per-request
prompt fields (text, system_prompt, user_prompt) if they are non-empty by
returning an error like "vllm_nemo_speechlm backend does not accept per-request
prompt fields: text, system_prompt, user_prompt", and (2) move the
multiple-audio validation there by checking both audio_bytes and
audio_bytes_list and returning an error if more than one audio input is provided
(e.g., if audio_bytes is set and audio_bytes_list non-empty or audio_bytes_list
length > 1); remove or no-op the duplicate multi-audio check in
_get_request_audio and ensure generate() continues to use self._prompt_text
without reading those rejected fields.
The NeMo-Skills pipeline sends temperature/max_tokens per-request. Rejecting them breaks the pipeline. Instead, log a warning and use the backend's fixed SamplingParams. Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py (1)
213-229:⚠️ Potential issue | 🟠 MajorReject unsupported per-request fields instead of logging-and-ignoring them.
validate_request()currently accepts and ignores per-request knobs and prompt fields, and it also allows conflicting audio sources to pass into generation. This makes request behavior ambiguous.💡 Proposed fix
def validate_request(self, request: GenerationRequest) -> Optional[str]: - has_audio = request.audio_bytes is not None or ( - request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0 - ) - if not has_audio: + has_audio_bytes = request.audio_bytes is not None + has_audio_list = request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0 + if not (has_audio_bytes or has_audio_list): return "vllm_nemo_speechlm backend requires audio input" - ignored = { - "max_new_tokens": request.max_new_tokens, - "temperature": request.temperature, - "top_p": request.top_p, - "top_k": request.top_k, - "seed": request.seed, - } - set_fields = [k for k, v in ignored.items() if v is not None] - if set_fields: - logger.warning("Ignoring per-request overrides (using backend defaults): %s", ", ".join(set_fields)) + + if has_audio_bytes and has_audio_list: + return "vllm_nemo_speechlm backend accepts either audio_bytes or audio_bytes_list, not both" + if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 1: + return "vllm_nemo_speechlm backend currently supports one audio per request" + + unsupported = [ + k for k, v in { + "text": request.text, + "system_prompt": request.system_prompt, + "user_prompt": request.user_prompt, + "max_new_tokens": request.max_new_tokens, + "temperature": request.temperature, + "top_p": request.top_p, + "top_k": request.top_k, + "seed": request.seed, + }.items() + if v is not None + ] + if unsupported: + return f"vllm_nemo_speechlm backend does not support per-request fields: {', '.join(unsupported)}" return NoneAs per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around lines 213 - 229, validate_request currently logs and ignores per-request knobs and allows ambiguous audio inputs; change it to reject unsupported fields and conflicting audio sources by returning an error string. In validate_request (GenerationRequest) check the per-request knobs currently in ignored (max_new_tokens, temperature, top_p, top_k, seed) and any prompt-related fields passed on the request, collect any that are set and return a descriptive error like "unsupported per-request fields: ..." instead of warning; also validate audio input so if both audio_bytes and audio_bytes_list are provided (or neither) return an error like "provide exactly one of audio_bytes or audio_bytes_list". Ensure callers expect Optional[str] error messages from validate_request rather than silent ignores.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 201-205: The _get_request_audio function uses truthiness for
request.audio_bytes which treats empty bytes as missing; update it to use
explicit "is not None" checks to match validate_request() semantics: in
_get_request_audio, change the first condition to check "request.audio_bytes is
not None" before returning (and similarly ensure any checks of
request.audio_bytes_list presence use "is not None" if needed), while keeping
the existing handling for multiple entries (len(request.audio_bytes_list) > 1)
and referencing the _get_request_audio and validate_request functions to locate
where to make the change.
---
Duplicate comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 213-229: validate_request currently logs and ignores per-request
knobs and allows ambiguous audio inputs; change it to reject unsupported fields
and conflicting audio sources by returning an error string. In validate_request
(GenerationRequest) check the per-request knobs currently in ignored
(max_new_tokens, temperature, top_p, top_k, seed) and any prompt-related fields
passed on the request, collect any that are set and return a descriptive error
like "unsupported per-request fields: ..." instead of warning; also validate
audio input so if both audio_bytes and audio_bytes_list are provided (or
neither) return an error like "provide exactly one of audio_bytes or
audio_bytes_list". Ensure callers expect Optional[str] error messages from
validate_request rather than silent ignores.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 156c8a47-fcd1-47a8-8df0-27c86d85f3a4
📒 Files selected for processing (1)
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py
| def _get_request_audio(self, request: GenerationRequest) -> bytes: | ||
| if request.audio_bytes: | ||
| return request.audio_bytes | ||
| if request.audio_bytes_list: | ||
| if len(request.audio_bytes_list) > 1: |
There was a problem hiding this comment.
Use explicit is not None checks for audio_bytes to match validation semantics.
validate_request() treats audio_bytes presence as is not None, but _get_request_audio() uses truthiness. This can produce inconsistent behavior for empty-byte payloads.
💡 Proposed fix
def _get_request_audio(self, request: GenerationRequest) -> bytes:
- if request.audio_bytes:
+ if request.audio_bytes is not None:
return request.audio_bytes
- if request.audio_bytes_list:
+ if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0:
if len(request.audio_bytes_list) > 1:
raise ValueError("vllm_nemo_speechlm backend currently supports one audio per request.")
return request.audio_bytes_list[0]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around
lines 201 - 205, The _get_request_audio function uses truthiness for
request.audio_bytes which treats empty bytes as missing; update it to use
explicit "is not None" checks to match validate_request() semantics: in
_get_request_audio, change the first condition to check "request.audio_bytes is
not None" before returning (and similarly ensure any checks of
request.audio_bytes_list presence use "is not None" if needed), while keeping
the existing handling for multiple entries (len(request.audio_bytes_list) > 1)
and referencing the _get_request_audio and validate_request functions to locate
where to make the change.
Signed-off-by: Dongji Gao <dongjig@draco-oci-login-03.cm.cluster>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py (1)
208-239:⚠️ Potential issue | 🟠 MajorFail fast on unsupported request fields and normalize audio presence checks.
validate_request()still accepts-and-ignores per-request prompt/sampling fields, and multi-audio validation is still deferred to_get_request_audio(). Also, Line 210 uses truthiness (if request.audio_bytes:), which is inconsistent with Line 224 semantics (is not None).Suggested fix
def _get_request_audio(self, request: GenerationRequest) -> bytes: """Extract audio bytes from a request (supports audio_bytes or audio_bytes_list).""" - if request.audio_bytes: + if request.audio_bytes is not None: return request.audio_bytes - if request.audio_bytes_list: - if len(request.audio_bytes_list) > 1: - raise ValueError("vllm_nemo_speechlm backend currently supports one audio per request.") + if request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0: + if len(request.audio_bytes_list) > 1: + raise ValueError("vllm_nemo_speechlm backend currently supports one audio per request.") return request.audio_bytes_list[0] raise ValueError("Request must contain audio_bytes or audio_bytes_list") def validate_request(self, request: GenerationRequest) -> Optional[str]: - """Validate request has audio input. Logs warning for ignored per-request overrides.""" - has_audio = request.audio_bytes is not None or ( - request.audio_bytes_list is not None and len(request.audio_bytes_list) > 0 - ) - if not has_audio: + """Validate request has exactly one audio input and no unsupported per-request overrides.""" + has_audio_bytes = request.audio_bytes is not None + num_audio_list = len(request.audio_bytes_list) if request.audio_bytes_list is not None else 0 + has_audio_list = num_audio_list > 0 + + if not has_audio_bytes and not has_audio_list: return "vllm_nemo_speechlm backend requires audio input" - ignored = { + if (1 if has_audio_bytes else 0) + (1 if has_audio_list else 0) > 1 or num_audio_list > 1: + return "vllm_nemo_speechlm backend currently supports one audio per request" + + unsupported_prompt_fields = [] + if request.text: + unsupported_prompt_fields.append("text") + if request.system_prompt: + unsupported_prompt_fields.append("system_prompt") + if request.user_prompt: + unsupported_prompt_fields.append("user_prompt") + if unsupported_prompt_fields: + return ( + "vllm_nemo_speechlm backend does not accept per-request prompt fields: " + + ", ".join(unsupported_prompt_fields) + ) + + unsupported_sampling_fields = { "max_new_tokens": request.max_new_tokens, "temperature": request.temperature, "top_p": request.top_p, "top_k": request.top_k, "seed": request.seed, } - set_fields = [k for k, v in ignored.items() if v is not None] + set_fields = [k for k, v in unsupported_sampling_fields.items() if v is not None] if set_fields: - logger.warning("Ignoring per-request overrides (using backend defaults): %s", ", ".join(set_fields)) + return ( + "vllm_nemo_speechlm backend does not support per-request sampling overrides: " + + ", ".join(set_fields) + ) return NoneAs per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around lines 208 - 239, In validate_request, perform strict validation: check audio presence using explicit "is not None" for request.audio_bytes and request.audio_bytes_list, error out if neither is provided, and also error if request.audio_bytes_list contains more than one entry (move the multi-audio check from _get_request_audio into validate_request). Additionally, fail fast when any unsupported per-request sampling/prompt overrides are set (max_new_tokens, temperature, top_p, top_k, seed) by returning an error string instead of just logging a warning. Keep _get_request_audio simple (use is not None checks) and assume validate_request has already enforced single-audio constraints.
🧹 Nitpick comments (1)
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py (1)
103-110: RemoveVLLMNeMoSpeechLMConfig.from_dict()override and reuse base parsing.This method duplicates
BackendConfig.from_dict()behavior and can drift over time. Reusing the inherited implementation keeps config parsing consistent and simpler.As per coding guidelines, "Keep code simple and elegant; reuse/extend existing functionality when possible".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py` around lines 103 - 110, Remove the duplicated from_dict override on VLLMNeMoSpeechLMConfig so the class uses the inherited BackendConfig.from_dict implementation; locate the method VLLMNeMoSpeechLMConfig.from_dict and delete it (or replace its body with a single call to super().from_dict(d) if explicit forwarding is preferred) to avoid drift and ensure consistent config parsing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 208-239: In validate_request, perform strict validation: check
audio presence using explicit "is not None" for request.audio_bytes and
request.audio_bytes_list, error out if neither is provided, and also error if
request.audio_bytes_list contains more than one entry (move the multi-audio
check from _get_request_audio into validate_request). Additionally, fail fast
when any unsupported per-request sampling/prompt overrides are set
(max_new_tokens, temperature, top_p, top_k, seed) by returning an error string
instead of just logging a warning. Keep _get_request_audio simple (use is not
None checks) and assume validate_request has already enforced single-audio
constraints.
---
Nitpick comments:
In `@recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py`:
- Around line 103-110: Remove the duplicated from_dict override on
VLLMNeMoSpeechLMConfig so the class uses the inherited BackendConfig.from_dict
implementation; locate the method VLLMNeMoSpeechLMConfig.from_dict and delete it
(or replace its body with a single call to super().from_dict(d) if explicit
forwarding is preferred) to avoid drift and ensure consistent config parsing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 033755dc-e93d-437c-a83e-02a7c4127c38
📒 Files selected for processing (1)
recipes/multimodal/server/backends/vllm_nemo_speechlm_backend.py
Summary
Add a new
vllm_asrbackend to the unified inference server that performs ASR using vLLM with thenemotron_nano_asrmultimodal plugin. This enables fast batched inference for speech LLM models (e.g. Nemotron-Nano-v3 + Canary-v2).VLLMASRBackendinrecipes/multimodal/server/backends/vllm_asr_backend.pytensor_parallel_sizefor multi-GPU shardingns generate --num_chunks N(recommended for MoE models)Validated on the full Open ASR Leaderboard (8 datasets, ~82K samples). WER matches NeMo checkpoint evaluation within 0.1% across all datasets and batch sizes.
Test plan
Summary by CodeRabbit
New Features
Documentation