Skip to content

[TRTLLM-12430][tests] Add video E2E test for nano v3 omni#13883

Merged
2ez4bz merged 1 commit into
NVIDIA:mainfrom
2ez4bz:dev-omni-video-tests
May 12, 2026
Merged

[TRTLLM-12430][tests] Add video E2E test for nano v3 omni#13883
2ez4bz merged 1 commit into
NVIDIA:mainfrom
2ez4bz:dev-omni-video-tests

Conversation

@2ez4bz
Copy link
Copy Markdown
Collaborator

@2ez4bz 2ez4bz commented May 8, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Audio automatic speech recognition (ASR) evaluation now available
    • Video question-answering (QA) evaluation now available
    • Evaluation framework enhanced to support custom metrics beyond accuracy
  • Tests

    • Added reference benchmarks for ASR and video QA evaluations
    • Expanded multimodal test support to evaluate multiple tasks concurrently

Description

This PR adds a new video MCQ evaluation harness, which is used
with a subset of the Video-MME dataset's short videos against
Nano v3 Omni.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@2ez4bz 2ez4bz requested review from a team as code owners May 8, 2026 04:53
@2ez4bz 2ez4bz requested review from Wanli-Jiang and moraxu May 8, 2026 04:53
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces AudioASREvaluator and VideoMME evaluators for automatic speech recognition and video question-answering evaluation, generalizes the accuracy test framework to support multiple metrics (WER, accuracy) with configurable higher_is_better direction, and integrates both evaluators into the multimodal test suite alongside the existing MMMU task.

Changes

Audio ASR and Video QA Evaluators with Metric-Aware Testing

Layer / File(s) Summary
Module Exports
tensorrt_llm/evaluate/__init__.py
AudioASREvaluator is imported from audio_asr module and added to __all__ exports; copyright header updated to 2025–2026.
AudioASREvaluator Implementation
tensorrt_llm/evaluate/audio_asr.py
New 515-line evaluator loads HuggingFace datasets with audio, normalizes transcripts, builds multimodal chat prompts, submits async requests to LLM, and computes corpus-level WER via token-level Levenshtein distance with per-sample auditing and worst-sample reporting.
VideoMME Video QA Evaluator
tests/integration/defs/accuracy/video_mme.py
New 326-line evaluator loads video QA annotations with video paths and multiple-choice questions, builds multimodal prompts with cached video frames, extracts predicted letters via regex, and computes accuracy with mismatch logging.
Accuracy Framework Generalization
tests/integration/defs/accuracy/accuracy_core.py
HypothesisTestingParams gains configurable metric_name and higher_is_better fields; AccuracyTask introduces METRIC_NAME class attribute; VoxPopuli and VideoMME tasks are added with WER and accuracy metrics respectively; hypothesis testing reports and assertions now reference the configured metric instead of hardcoded accuracy.
Test Integration and Multi-Task Loop
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
test_auto_dtype is refactored to parameterize and iterate over task-spec tuples (task class, sampling params, evaluator kwargs), enabling sequential evaluation of MMMU, VoxPopuli, and VideoMME against the same initialized LLM; removes single-task MMMU-only flow.
Reference Baselines
tests/integration/defs/accuracy/references/videomme.yaml, tests/integration/defs/accuracy/references/voxpopuli.yaml
New YAML files define guardrail baseline values for Video-MME (accuracy, 300 samples) and VoxPopuli (WER 6.2%, 1842 samples) across Nemotron model variants with quantization configurations.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description provides a brief explanation of the video MCQ evaluation harness but lacks details on what specifically changed and how it works. Expand the description to clarify the implementation details, list the files added/modified, explain the VideoMME evaluator's role, and provide specific examples of test coverage changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly relates to the main change of adding video E2E tests for the Nano v3 Omni model, which aligns with the PR's objectives and file changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/integration/defs/accuracy/video_mme.py (1)

314-326: 💤 Low value

Duplicated _get_model_type function.

This function is identical to _get_model_context in tensorrt_llm/evaluate/audio_asr.py (except the latter also returns model_dir). Consider extracting this to a shared utility in tensorrt_llm/evaluate/interface.py or a common module to avoid duplication.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/accuracy/video_mme.py` around lines 314 - 326, The
function _get_model_type is duplicated (also present as _get_model_context in
tensorrt_llm/evaluate/audio_asr.py); extract the shared logic into a single
utility (e.g., add get_model_type or get_model_info in
tensorrt_llm/evaluate/interface.py) that accepts the LLM object, reads
config.json, and returns model_type (and optionally model_dir or both to match
_get_model_context). Update callers in
tests/integration/defs/accuracy/video_mme.py and
tensorrt_llm/evaluate/audio_asr.py to import and use the new shared function and
remove the duplicate local implementations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/evaluate/audio_asr.py`:
- Line 452: The zip over sample_ids, predictions, references in the loop
(variables sample_id, prediction, reference) should use strict=True to avoid
silent truncation on mismatched lengths; update the for statement in the
function handling evaluation/iteration to call zip(sample_ids, predictions,
references, strict=True) so a ValueError is raised when lengths differ, matching
the other zip usage in this module.
- Line 305: The local assignment input = {"prompt": prompt} in
tensorrt_llm/evaluate/audio_asr.py shadows Python's built-in input; rename this
variable (for example to payload, input_payload, or prompt_input) and update all
subsequent references within the same scope to use the new name (preserving the
dictionary structure and keys so downstream code expecting {"prompt": prompt}
continues to work).

---

Nitpick comments:
In `@tests/integration/defs/accuracy/video_mme.py`:
- Around line 314-326: The function _get_model_type is duplicated (also present
as _get_model_context in tensorrt_llm/evaluate/audio_asr.py); extract the shared
logic into a single utility (e.g., add get_model_type or get_model_info in
tensorrt_llm/evaluate/interface.py) that accepts the LLM object, reads
config.json, and returns model_type (and optionally model_dir or both to match
_get_model_context). Update callers in
tests/integration/defs/accuracy/video_mme.py and
tensorrt_llm/evaluate/audio_asr.py to import and use the new shared function and
remove the duplicate local implementations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 01cf6dc8-be7e-40f4-a020-dee43b45133b

📥 Commits

Reviewing files that changed from the base of the PR and between 6e069b6 and f1790c5.

📒 Files selected for processing (7)
  • tensorrt_llm/evaluate/__init__.py
  • tensorrt_llm/evaluate/audio_asr.py
  • tests/integration/defs/accuracy/accuracy_core.py
  • tests/integration/defs/accuracy/references/videomme.yaml
  • tests/integration/defs/accuracy/references/voxpopuli.yaml
  • tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
  • tests/integration/defs/accuracy/video_mme.py

Comment thread tensorrt_llm/evaluate/audio_asr.py Outdated
Comment thread tensorrt_llm/evaluate/audio_asr.py Outdated
Comment thread tests/integration/defs/accuracy/accuracy_core.py
Comment thread tests/integration/defs/accuracy/accuracy_core.py
@2ez4bz 2ez4bz force-pushed the dev-omni-video-tests branch from f1790c5 to a454f35 Compare May 8, 2026 21:12
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented May 8, 2026

/bot run

@2ez4bz 2ez4bz enabled auto-merge (squash) May 8, 2026 21:13
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47443 [ run ] triggered by Bot. Commit: a454f35 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47443 [ run ] completed with state FAILURE. Commit: a454f35
/LLM/main/L0_MergeRequest_PR pipeline #37366 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz force-pushed the dev-omni-video-tests branch from a454f35 to 4343e3c Compare May 11, 2026 17:16
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented May 11, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47783 [ run ] triggered by Bot. Commit: 4343e3c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47783 [ run ] completed with state SUCCESS. Commit: 4343e3c
/LLM/main/L0_MergeRequest_PR pipeline #37675 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented May 11, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47800 [ run ] triggered by Bot. Commit: 4343e3c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47800 [ run ] completed with state SUCCESS. Commit: 4343e3c
/LLM/main/L0_MergeRequest_PR pipeline #37691 completed with status: 'SUCCESS'

CI Report

Link to invocation

@2ez4bz 2ez4bz merged commit 61eac2e into NVIDIA:main May 12, 2026
6 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
This PR adds a new video MCQ evaluation harness, which is used
with a subset of the Video-MME dataset's short videos against
Nano v3 Omni.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants