[TRTLLM-12429][tests] Add audio E2E test for nano v3 omni by 2ez4bz · Pull Request #13750 · NVIDIA/TensorRT-LLM

2ez4bz · 2026-05-05T05:52:25Z

Summary by CodeRabbit

New Features

New Features
- Audio Automatic Speech Recognition (ASR) evaluator for comprehensive speech-to-text model performance assessment
- Word Error Rate (WER) metrics with detailed per-sample accuracy tracking and normalized text comparisons
- Multimodal evaluation support enabling assessment of models with audio input handling
- Reference benchmarks and integration tests for speech recognition evaluation workflows

Description

This PR adds a new ASR evaluation harness, which is used
with the Vox Populi dataset's English test split for E2E testing
against Nano v3 Omni.

Test Coverage

See description.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

venkywonka

🥇

moraxu

LGTM, assuming audio-specific methods from tensorrt_llm/evaluate/audio_asr.py are correct

2ez4bz · 2026-05-07T16:49:13Z

/bot run --disable-fail-fast

coderabbitai · 2026-05-07T16:49:18Z

📝 Walkthrough

Walkthrough

This PR introduces audio ASR evaluation capability to TensorRT-LLM. It adds the AudioASREvaluator class that loads audio datasets, constructs multimodal prompts, runs async LLM generation, and computes corpus-level WER. Supporting changes generalize the test framework's hypothesis testing to use metric names instead of hardcoded accuracy labels, enabling VoxPopuli ASR evaluation alongside traditional accuracy tasks.

Changes

Audio ASR Evaluator Implementation

Layer / File(s)	Summary
Data Types `tensorrt_llm/evaluate/audio_asr.py`	`MultimodalASRSample` and `ASRWERSampleResult` NamedTuples, and `AudioASREvaluator` class signature with constructor and evaluation methods.
Sample Generation & Iteration `tensorrt_llm/evaluate/audio_asr.py`	Local HF dataset loading with parquet support, sample iteration with random seeding, audio media resolution, and transcript extraction from configurable columns.
Scoring & Evaluation Flow `tensorrt_llm/evaluate/audio_asr.py`	Async LLM generation, per-sample WER computation, corpus aggregation, worst-case logging, and execution timing instrumentation.
Multimodal Prompt Construction `tensorrt_llm/evaluate/audio_asr.py`	Model type resolution, audio data loading in multiple formats, ConversationMessage building, multimodal placeholder injection, system prompt inclusion, and chat-template rendering.
Helper Utilities `tensorrt_llm/evaluate/audio_asr.py`	Dataset-relative path resolution, HF audio materialization, text normalization, Levenshtein-based WER calculation, corpus aggregation, and worst-sample reporting.

Test Framework & Integration

Layer / File(s)	Summary
Package Exports `tensorrt_llm/evaluate/__init__.py`	`AudioASREvaluator` is imported and added to `__all__`.
Metric-Driven Testing `tests/integration/defs/accuracy/accuracy_core.py`	`HypothesisTestingParams` adds `metric_name` field; accuracy reporting is generalized to use dynamic metric labels; `AccuracyTask.METRIC_NAME = "accuracy"`; VoxPopuli is configured with `METRIC_NAME = "WER"` and switched to `AudioASREvaluator`.
Test Parameters & Methods `tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py`	`NANO_V3_OMNI_PARAMS` and `NANO_V3_OMNI_QUANTIZED_PARAMS` centralize quantization test matrix; `test_auto_dtype` is updated to use the parameterized list; new `test_voxpopuli_asr` method runs VoxPopuli with chunked prefill and disabled thinking.
Reference Baselines `tests/integration/defs/accuracy/references/voxpopuli.yaml`	Two NVIDIA Nemotron-3-Nano-Omni configurations with quantization settings and expected WER values.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 24.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding audio E2E testing (via VoxPopuli ASR) for the Nano v3 Omni model.
Description check	✅ Passed	The description covers the main purpose (ASR evaluation harness for E2E testing with VoxPopuli), but Test Coverage section lacks specific test details and references only 'See description'.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tensorrt_llm/evaluate/audio_asr.py (2)

452-452: ⚡ Quick win

Add strict=True to zip() for safety.

Per Python best practices (and ruff B905), using strict=True ensures that mismatched list lengths raise an error rather than silently truncating.

Suggested fix

-    for sample_id, prediction, reference in zip(sample_ids, predictions, references):
+    for sample_id, prediction, reference in zip(sample_ids, predictions, references, strict=True):

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/evaluate/audio_asr.py` at line 452, The for-loop that iterates
with zip(sample_ids, predictions, references) should use zip(..., strict=True)
to make length mismatches raise an error; update the loop in audio_asr.py (the
for sample_id, prediction, reference in zip(...) line) to for sample_id,
prediction, reference in zip(sample_ids, predictions, references, strict=True)
so any differing lengths of sample_ids, predictions, or references fail fast
(ensure the project runs on Python 3.10+ or add an explicit length check if
older Python support is required).

305-305: ⚡ Quick win

Rename input to avoid shadowing the Python builtin.

The variable name input shadows Python's built-in function, which can cause confusion and potential bugs if the builtin is needed later.

Suggested fix

-        input = {"prompt": prompt}
+        request_input = {"prompt": prompt}
         multi_modal_data, _ = mm_data_tracker.retrieve_all_sync()
         if multi_modal_data:
-            input["multi_modal_data"] = multi_modal_data
-        return input
+            request_input["multi_modal_data"] = multi_modal_data
+        return request_input

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/evaluate/audio_asr.py` at line 305, The variable named input
shadows Python's builtin; rename it (for example to input_payload or
prompt_input) wherever it's defined and used in the audio ASR evaluation code
(the assignment currently written as input = {"prompt": prompt}) and update all
subsequent references in the same scope to the new name so the builtin is no
longer shadowed.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py`:
- Around line 584-625: Add the new integration test test_voxpopuli_asr into the
llm_function_core.txt QA list so it runs as part of the same E2E accuracy suite:
open llm_function_core.txt and add an entry for the test using the same format
as other entries from test_llm_api_pytorch_multimodal.py (e.g.
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py::test_voxpopuli_asr);
keep the `@pytest.mark.skip_less_device_memory`(80000) decorator as-is so the test
is skipped on insufficient hardware.

---

Nitpick comments:
In `@tensorrt_llm/evaluate/audio_asr.py`:
- Line 452: The for-loop that iterates with zip(sample_ids, predictions,
references) should use zip(..., strict=True) to make length mismatches raise an
error; update the loop in audio_asr.py (the for sample_id, prediction, reference
in zip(...) line) to for sample_id, prediction, reference in zip(sample_ids,
predictions, references, strict=True) so any differing lengths of sample_ids,
predictions, or references fail fast (ensure the project runs on Python 3.10+ or
add an explicit length check if older Python support is required).
- Line 305: The variable named input shadows Python's builtin; rename it (for
example to input_payload or prompt_input) wherever it's defined and used in the
audio ASR evaluation code (the assignment currently written as input =
{"prompt": prompt}) and update all subsequent references in the same scope to
the new name so the builtin is no longer shadowed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a1c3f0a7-c55a-4ee5-b134-0f004d04a916

📥 Commits

Reviewing files that changed from the base of the PR and between 2f79ebb and 66dd700.

📒 Files selected for processing (5)

tensorrt_llm/evaluate/__init__.py
tensorrt_llm/evaluate/audio_asr.py
tests/integration/defs/accuracy/accuracy_core.py
tests/integration/defs/accuracy/references/voxpopuli.yaml
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py

tensorrt-cicd · 2026-05-07T16:56:02Z

PR_Github #47226 [ run ] triggered by Bot. Commit: 66dd700 Link to invocation

This PR adds a new ASR evaluation harness, which is used with the Vox Populi dataset's English test split for E2E testing against Nano v3 Omni. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

2ez4bz · 2026-05-07T19:43:37Z

/bot run

tensorrt-cicd · 2026-05-07T19:49:53Z

PR_Github #47242 [ run ] triggered by Bot. Commit: 4a9dac6 Link to invocation

tensorrt-cicd · 2026-05-07T23:41:52Z

PR_Github #47242 [ run ] completed with state SUCCESS. Commit: 4a9dac6
/LLM/main/L0_MergeRequest_PR pipeline #37192 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

2ez4bz · 2026-05-08T03:51:50Z

/bot run

tensorrt-cicd · 2026-05-08T03:57:59Z

PR_Github #47305 [ run ] triggered by Bot. Commit: 4a9dac6 Link to invocation

tensorrt-cicd · 2026-05-08T07:10:54Z

PR_Github #47305 [ run ] completed with state SUCCESS. Commit: 4a9dac6
/LLM/main/L0_MergeRequest_PR pipeline #37246 completed with status: 'SUCCESS'

CI Report

Link to invocation

This PR adds a new ASR evaluation harness, which is used with the Vox Populi dataset's English test split for E2E testing against Nano v3 Omni. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

github-actions Bot assigned 2ez4bz May 5, 2026

venkywonka approved these changes May 5, 2026

View reviewed changes

moraxu approved these changes May 5, 2026

View reviewed changes

Comment thread tensorrt_llm/evaluate/audio_asr.py

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py

2ez4bz commented May 6, 2026

View reviewed changes

2ez4bz force-pushed the nano-omni-audio-tests branch 2 times, most recently from b531204 to 66dd700 Compare May 7, 2026 16:44

2ez4bz marked this pull request as ready for review May 7, 2026 16:45

2ez4bz requested review from a team as code owners May 7, 2026 16:45

2ez4bz requested review from Wanli-Jiang and yechank-nvidia May 7, 2026 16:45

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py Outdated

[TRTLLM-12429][tests] Add audio E2E test for nano v3 omni

4a9dac6

This PR adds a new ASR evaluation harness, which is used with the Vox Populi dataset's English test split for E2E testing against Nano v3 Omni. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

2ez4bz force-pushed the nano-omni-audio-tests branch from 66dd700 to 4a9dac6 Compare May 7, 2026 19:38

2ez4bz enabled auto-merge (squash) May 7, 2026 19:43

tburt-nv approved these changes May 8, 2026

View reviewed changes

2ez4bz merged commit c87f002 into NVIDIA:main May 8, 2026
7 checks passed

Conversation

2ez4bz commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

New Features

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

venkywonka left a comment

Choose a reason for hiding this comment

Uh oh!

moraxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

2ez4bz commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

2ez4bz commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

2ez4bz commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

2ez4bz commented May 5, 2026 •

edited by coderabbitai Bot

Loading