[TRTLLM-12430][tests] Add video E2E test for nano v3 omni by 2ez4bz · Pull Request #13883 · NVIDIA/TensorRT-LLM

2ez4bz · 2026-05-08T04:53:01Z

Summary by CodeRabbit

Release Notes

New Features
- Audio automatic speech recognition (ASR) evaluation now available
- Video question-answering (QA) evaluation now available
- Evaluation framework enhanced to support custom metrics beyond accuracy
Tests
- Added reference benchmarks for ASR and video QA evaluations
- Expanded multimodal test support to evaluate multiple tasks concurrently

Description

This PR adds a new video MCQ evaluation harness, which is used
with a subset of the Video-MME dataset's short videos against
Nano v3 Omni.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-05-08T04:56:01Z

📝 Walkthrough

Walkthrough

This PR introduces AudioASREvaluator and VideoMME evaluators for automatic speech recognition and video question-answering evaluation, generalizes the accuracy test framework to support multiple metrics (WER, accuracy) with configurable higher_is_better direction, and integrates both evaluators into the multimodal test suite alongside the existing MMMU task.

Changes

Audio ASR and Video QA Evaluators with Metric-Aware Testing

Layer / File(s)	Summary
Module Exports `tensorrt_llm/evaluate/__init__.py`	AudioASREvaluator is imported from audio_asr module and added to `__all__` exports; copyright header updated to 2025–2026.
AudioASREvaluator Implementation `tensorrt_llm/evaluate/audio_asr.py`	New 515-line evaluator loads HuggingFace datasets with audio, normalizes transcripts, builds multimodal chat prompts, submits async requests to LLM, and computes corpus-level WER via token-level Levenshtein distance with per-sample auditing and worst-sample reporting.
VideoMME Video QA Evaluator `tests/integration/defs/accuracy/video_mme.py`	New 326-line evaluator loads video QA annotations with video paths and multiple-choice questions, builds multimodal prompts with cached video frames, extracts predicted letters via regex, and computes accuracy with mismatch logging.
Accuracy Framework Generalization `tests/integration/defs/accuracy/accuracy_core.py`	HypothesisTestingParams gains configurable `metric_name` and `higher_is_better` fields; AccuracyTask introduces `METRIC_NAME` class attribute; VoxPopuli and VideoMME tasks are added with WER and accuracy metrics respectively; hypothesis testing reports and assertions now reference the configured metric instead of hardcoded accuracy.
Test Integration and Multi-Task Loop `tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py`	test_auto_dtype is refactored to parameterize and iterate over task-spec tuples (task class, sampling params, evaluator kwargs), enabling sequential evaluation of MMMU, VoxPopuli, and VideoMME against the same initialized LLM; removes single-task MMMU-only flow.
Reference Baselines `tests/integration/defs/accuracy/references/videomme.yaml`, `tests/integration/defs/accuracy/references/voxpopuli.yaml`	New YAML files define guardrail baseline values for Video-MME (accuracy, 300 samples) and VoxPopuli (WER 6.2%, 1842 samples) across Nemotron model variants with quantization configurations.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description provides a brief explanation of the video MCQ evaluation harness but lacks details on what specifically changed and how it works.	Expand the description to clarify the implementation details, list the files added/modified, explain the VideoMME evaluator's role, and provide specific examples of test coverage changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly relates to the main change of adding video E2E tests for the Nano v3 Omni model, which aligns with the PR's objectives and file changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/integration/defs/accuracy/video_mme.py (1)
314-326: 💤 Low value

Duplicated _get_model_type function.

This function is identical to _get_model_context in tensorrt_llm/evaluate/audio_asr.py (except the latter also returns model_dir). Consider extracting this to a shared utility in tensorrt_llm/evaluate/interface.py or a common module to avoid duplication.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/accuracy/video_mme.py` around lines 314 - 326, The
function _get_model_type is duplicated (also present as _get_model_context in
tensorrt_llm/evaluate/audio_asr.py); extract the shared logic into a single
utility (e.g., add get_model_type or get_model_info in
tensorrt_llm/evaluate/interface.py) that accepts the LLM object, reads
config.json, and returns model_type (and optionally model_dir or both to match
_get_model_context). Update callers in
tests/integration/defs/accuracy/video_mme.py and
tensorrt_llm/evaluate/audio_asr.py to import and use the new shared function and
remove the duplicate local implementations.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/evaluate/audio_asr.py`:
- Line 452: The zip over sample_ids, predictions, references in the loop
(variables sample_id, prediction, reference) should use strict=True to avoid
silent truncation on mismatched lengths; update the for statement in the
function handling evaluation/iteration to call zip(sample_ids, predictions,
references, strict=True) so a ValueError is raised when lengths differ, matching
the other zip usage in this module.
- Line 305: The local assignment input = {"prompt": prompt} in
tensorrt_llm/evaluate/audio_asr.py shadows Python's built-in input; rename this
variable (for example to payload, input_payload, or prompt_input) and update all
subsequent references within the same scope to use the new name (preserving the
dictionary structure and keys so downstream code expecting {"prompt": prompt}
continues to work).

---

Nitpick comments:
In `@tests/integration/defs/accuracy/video_mme.py`:
- Around line 314-326: The function _get_model_type is duplicated (also present
as _get_model_context in tensorrt_llm/evaluate/audio_asr.py); extract the shared
logic into a single utility (e.g., add get_model_type or get_model_info in
tensorrt_llm/evaluate/interface.py) that accepts the LLM object, reads
config.json, and returns model_type (and optionally model_dir or both to match
_get_model_context). Update callers in
tests/integration/defs/accuracy/video_mme.py and
tensorrt_llm/evaluate/audio_asr.py to import and use the new shared function and
remove the duplicate local implementations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 01cf6dc8-be7e-40f4-a020-dee43b45133b

📥 Commits

Reviewing files that changed from the base of the PR and between 6e069b6 and f1790c5.

📒 Files selected for processing (7)

tensorrt_llm/evaluate/__init__.py
tensorrt_llm/evaluate/audio_asr.py
tests/integration/defs/accuracy/accuracy_core.py
tests/integration/defs/accuracy/references/videomme.yaml
tests/integration/defs/accuracy/references/voxpopuli.yaml
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
tests/integration/defs/accuracy/video_mme.py

2ez4bz · 2026-05-08T21:13:37Z

/bot run

tensorrt-cicd · 2026-05-08T21:19:12Z

PR_Github #47443 [ run ] triggered by Bot. Commit: a454f35 Link to invocation

tensorrt-cicd · 2026-05-08T22:55:41Z

PR_Github #47443 [ run ] completed with state FAILURE. Commit: a454f35
/LLM/main/L0_MergeRequest_PR pipeline #37366 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

2ez4bz · 2026-05-11T17:17:25Z

/bot run

tensorrt-cicd · 2026-05-11T17:23:34Z

PR_Github #47783 [ run ] triggered by Bot. Commit: 4343e3c Link to invocation

tensorrt-cicd · 2026-05-11T22:24:06Z

PR_Github #47783 [ run ] completed with state SUCCESS. Commit: 4343e3c
/LLM/main/L0_MergeRequest_PR pipeline #37675 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

2ez4bz · 2026-05-11T23:09:31Z

/bot run

tensorrt-cicd · 2026-05-11T23:15:14Z

PR_Github #47800 [ run ] triggered by Bot. Commit: 4343e3c Link to invocation

tensorrt-cicd · 2026-05-12T00:08:01Z

PR_Github #47800 [ run ] completed with state SUCCESS. Commit: 4343e3c
/LLM/main/L0_MergeRequest_PR pipeline #37691 completed with status: 'SUCCESS'

CI Report

Link to invocation

This PR adds a new video MCQ evaluation harness, which is used with a subset of the Video-MME dataset's short videos against Nano v3 Omni. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

2ez4bz requested review from a team as code owners May 8, 2026 04:53

2ez4bz requested review from Wanli-Jiang and moraxu May 8, 2026 04:53

github-actions Bot assigned 2ez4bz May 8, 2026

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread tensorrt_llm/evaluate/audio_asr.py Outdated

Comment thread tensorrt_llm/evaluate/audio_asr.py Outdated

yechank-nvidia reviewed May 8, 2026

View reviewed changes

Comment thread tests/integration/defs/accuracy/accuracy_core.py

Comment thread tests/integration/defs/accuracy/accuracy_core.py

moraxu approved these changes May 8, 2026

View reviewed changes

xinhe-nv approved these changes May 8, 2026

View reviewed changes

tburt-nv approved these changes May 8, 2026

View reviewed changes

2ez4bz force-pushed the dev-omni-video-tests branch from f1790c5 to a454f35 Compare May 8, 2026 21:12

2ez4bz enabled auto-merge (squash) May 8, 2026 21:13

[TRTLLM-12430][tests] Add video E2E test for nano v3 omni

4343e3c

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>

2ez4bz force-pushed the dev-omni-video-tests branch from a454f35 to 4343e3c Compare May 11, 2026 17:16

2ez4bz merged commit 61eac2e into NVIDIA:main May 12, 2026
6 checks passed

Conversation

2ez4bz commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 8, 2026

Walkthrough

Changes

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

2ez4bz commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

2ez4bz commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

2ez4bz commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

2ez4bz commented May 8, 2026 •

edited by coderabbitai Bot

Loading