Skip to content

[TRTLLM-12429][tests] Add audio E2E test for nano v3 omni#13750

Merged
2ez4bz merged 1 commit into
NVIDIA:mainfrom
2ez4bz:nano-omni-audio-tests
May 8, 2026
Merged

[TRTLLM-12429][tests] Add audio E2E test for nano v3 omni#13750
2ez4bz merged 1 commit into
NVIDIA:mainfrom
2ez4bz:nano-omni-audio-tests

Conversation

@2ez4bz
Copy link
Copy Markdown
Collaborator

@2ez4bz 2ez4bz commented May 5, 2026

Summary by CodeRabbit

New Features

  • New Features
    • Audio Automatic Speech Recognition (ASR) evaluator for comprehensive speech-to-text model performance assessment
    • Word Error Rate (WER) metrics with detailed per-sample accuracy tracking and normalized text comparisons
    • Multimodal evaluation support enabling assessment of models with audio input handling
    • Reference benchmarks and integration tests for speech recognition evaluation workflows

Description

This PR adds a new ASR evaluation harness, which is used
with the Vox Populi dataset's English test split for E2E testing
against Nano v3 Omni.

Test Coverage

See description.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Copy link
Copy Markdown
Collaborator

@venkywonka venkywonka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

Copy link
Copy Markdown
Collaborator

@moraxu moraxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming audio-specific methods from tensorrt_llm/evaluate/audio_asr.py are correct

Comment thread tensorrt_llm/evaluate/audio_asr.py
Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
Comment thread tensorrt_llm/evaluate/audio_asr.py
Comment thread tensorrt_llm/evaluate/audio_asr.py Outdated
Comment thread tensorrt_llm/evaluate/audio_asr.py Outdated
Comment thread tensorrt_llm/evaluate/audio_asr.py
Comment thread tensorrt_llm/evaluate/audio_asr.py Outdated
Comment thread tensorrt_llm/evaluate/audio_asr.py
Comment thread tensorrt_llm/evaluate/audio_asr.py Outdated
@2ez4bz 2ez4bz force-pushed the nano-omni-audio-tests branch 2 times, most recently from b531204 to 66dd700 Compare May 7, 2026 16:44
@2ez4bz 2ez4bz marked this pull request as ready for review May 7, 2026 16:45
@2ez4bz 2ez4bz requested review from a team as code owners May 7, 2026 16:45
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented May 7, 2026

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces audio ASR evaluation capability to TensorRT-LLM. It adds the AudioASREvaluator class that loads audio datasets, constructs multimodal prompts, runs async LLM generation, and computes corpus-level WER. Supporting changes generalize the test framework's hypothesis testing to use metric names instead of hardcoded accuracy labels, enabling VoxPopuli ASR evaluation alongside traditional accuracy tasks.

Changes

Audio ASR Evaluator Implementation

Layer / File(s) Summary
Data Types
tensorrt_llm/evaluate/audio_asr.py
MultimodalASRSample and ASRWERSampleResult NamedTuples, and AudioASREvaluator class signature with constructor and evaluation methods.
Sample Generation & Iteration
tensorrt_llm/evaluate/audio_asr.py
Local HF dataset loading with parquet support, sample iteration with random seeding, audio media resolution, and transcript extraction from configurable columns.
Scoring & Evaluation Flow
tensorrt_llm/evaluate/audio_asr.py
Async LLM generation, per-sample WER computation, corpus aggregation, worst-case logging, and execution timing instrumentation.
Multimodal Prompt Construction
tensorrt_llm/evaluate/audio_asr.py
Model type resolution, audio data loading in multiple formats, ConversationMessage building, multimodal placeholder injection, system prompt inclusion, and chat-template rendering.
Helper Utilities
tensorrt_llm/evaluate/audio_asr.py
Dataset-relative path resolution, HF audio materialization, text normalization, Levenshtein-based WER calculation, corpus aggregation, and worst-sample reporting.

Test Framework & Integration

Layer / File(s) Summary
Package Exports
tensorrt_llm/evaluate/__init__.py
AudioASREvaluator is imported and added to __all__.
Metric-Driven Testing
tests/integration/defs/accuracy/accuracy_core.py
HypothesisTestingParams adds metric_name field; accuracy reporting is generalized to use dynamic metric labels; AccuracyTask.METRIC_NAME = "accuracy"; VoxPopuli is configured with METRIC_NAME = "WER" and switched to AudioASREvaluator.
Test Parameters & Methods
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
NANO_V3_OMNI_PARAMS and NANO_V3_OMNI_QUANTIZED_PARAMS centralize quantization test matrix; test_auto_dtype is updated to use the parameterized list; new test_voxpopuli_asr method runs VoxPopuli with chunked prefill and disabled thinking.
Reference Baselines
tests/integration/defs/accuracy/references/voxpopuli.yaml
Two NVIDIA Nemotron-3-Nano-Omni configurations with quantization settings and expected WER values.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding audio E2E testing (via VoxPopuli ASR) for the Nano v3 Omni model.
Description check ✅ Passed The description covers the main purpose (ASR evaluation harness for E2E testing with VoxPopuli), but Test Coverage section lacks specific test details and references only 'See description'.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tensorrt_llm/evaluate/audio_asr.py (2)

452-452: ⚡ Quick win

Add strict=True to zip() for safety.

Per Python best practices (and ruff B905), using strict=True ensures that mismatched list lengths raise an error rather than silently truncating.

Suggested fix
-    for sample_id, prediction, reference in zip(sample_ids, predictions, references):
+    for sample_id, prediction, reference in zip(sample_ids, predictions, references, strict=True):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/evaluate/audio_asr.py` at line 452, The for-loop that iterates
with zip(sample_ids, predictions, references) should use zip(..., strict=True)
to make length mismatches raise an error; update the loop in audio_asr.py (the
for sample_id, prediction, reference in zip(...) line) to for sample_id,
prediction, reference in zip(sample_ids, predictions, references, strict=True)
so any differing lengths of sample_ids, predictions, or references fail fast
(ensure the project runs on Python 3.10+ or add an explicit length check if
older Python support is required).

305-305: ⚡ Quick win

Rename input to avoid shadowing the Python builtin.

The variable name input shadows Python's built-in function, which can cause confusion and potential bugs if the builtin is needed later.

Suggested fix
-        input = {"prompt": prompt}
+        request_input = {"prompt": prompt}
         multi_modal_data, _ = mm_data_tracker.retrieve_all_sync()
         if multi_modal_data:
-            input["multi_modal_data"] = multi_modal_data
-        return input
+            request_input["multi_modal_data"] = multi_modal_data
+        return request_input
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/evaluate/audio_asr.py` at line 305, The variable named input
shadows Python's builtin; rename it (for example to input_payload or
prompt_input) wherever it's defined and used in the audio ASR evaluation code
(the assignment currently written as input = {"prompt": prompt}) and update all
subsequent references in the same scope to the new name so the builtin is no
longer shadowed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py`:
- Around line 584-625: Add the new integration test test_voxpopuli_asr into the
llm_function_core.txt QA list so it runs as part of the same E2E accuracy suite:
open llm_function_core.txt and add an entry for the test using the same format
as other entries from test_llm_api_pytorch_multimodal.py (e.g.
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py::test_voxpopuli_asr);
keep the `@pytest.mark.skip_less_device_memory`(80000) decorator as-is so the test
is skipped on insufficient hardware.

---

Nitpick comments:
In `@tensorrt_llm/evaluate/audio_asr.py`:
- Line 452: The for-loop that iterates with zip(sample_ids, predictions,
references) should use zip(..., strict=True) to make length mismatches raise an
error; update the loop in audio_asr.py (the for sample_id, prediction, reference
in zip(...) line) to for sample_id, prediction, reference in zip(sample_ids,
predictions, references, strict=True) so any differing lengths of sample_ids,
predictions, or references fail fast (ensure the project runs on Python 3.10+ or
add an explicit length check if older Python support is required).
- Line 305: The variable named input shadows Python's builtin; rename it (for
example to input_payload or prompt_input) wherever it's defined and used in the
audio ASR evaluation code (the assignment currently written as input =
{"prompt": prompt}) and update all subsequent references in the same scope to
the new name so the builtin is no longer shadowed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a1c3f0a7-c55a-4ee5-b134-0f004d04a916

📥 Commits

Reviewing files that changed from the base of the PR and between 2f79ebb and 66dd700.

📒 Files selected for processing (5)
  • tensorrt_llm/evaluate/__init__.py
  • tensorrt_llm/evaluate/audio_asr.py
  • tests/integration/defs/accuracy/accuracy_core.py
  • tests/integration/defs/accuracy/references/voxpopuli.yaml
  • tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47226 [ run ] triggered by Bot. Commit: 66dd700 Link to invocation

This PR adds a new ASR evaluation harness, which is used
with the Vox Populi dataset's English test split for E2E testing
against Nano v3 Omni.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz force-pushed the nano-omni-audio-tests branch from 66dd700 to 4a9dac6 Compare May 7, 2026 19:38
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented May 7, 2026

/bot run

@2ez4bz 2ez4bz enabled auto-merge (squash) May 7, 2026 19:43
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47242 [ run ] triggered by Bot. Commit: 4a9dac6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47242 [ run ] completed with state SUCCESS. Commit: 4a9dac6
/LLM/main/L0_MergeRequest_PR pipeline #37192 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented May 8, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47305 [ run ] triggered by Bot. Commit: 4a9dac6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47305 [ run ] completed with state SUCCESS. Commit: 4a9dac6
/LLM/main/L0_MergeRequest_PR pipeline #37246 completed with status: 'SUCCESS'

CI Report

Link to invocation

@2ez4bz 2ez4bz merged commit c87f002 into NVIDIA:main May 8, 2026
7 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
This PR adds a new ASR evaluation harness, which is used
with the Vox Populi dataset's English test split for E2E testing
against Nano v3 Omni.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants