Skip to content

[TRTLLM-11319][feat] VisualGen public output API + bench timing decomposition#13635

Merged
zhenhuaw-me merged 9 commits into
NVIDIA:mainfrom
zhenhuaw-me:update-api-step4
May 8, 2026
Merged

[TRTLLM-11319][feat] VisualGen public output API + bench timing decomposition#13635
zhenhuaw-me merged 9 commits into
NVIDIA:mainfrom
zhenhuaw-me:update-api-step4

Conversation

@zhenhuaw-me
Copy link
Copy Markdown
Member

@zhenhuaw-me zhenhuaw-me commented Apr 30, 2026

Description

Reworks the public output story of tensorrt_llm.visual_gen so callers get a coherent, request-aware return type with measurable engine-side timing — and reworks the offline bench so its latency and engine generation metrics line up with what the serving path does.

Public output API

  • VisualGenOutput replaces the raw MediaOutput as the user-facing return type. Carries image/video/audio tensors plus request_id, error, frame_rate / audio_sample_rate, and engine-side VisualGenMetrics.
  • VisualGenResult is a single Future-like awaitable handle that resolves to VisualGenOutput (single prompt) or List[VisualGenOutput] (batch). Batched generate(List[str]) fans out into per-item outputs with per-item error semantics.
  • VisualGenOutput.save(path) is the single user-facing way to persist a generated tensor to disk (image PNG/JPG/WEBP, video MP4/AVI).
  • Public exceptions: dropped VisualGenError / VisualGenParamsError in favor of standard RuntimeError / ValueError / NotImplementedError. Aligns with vLLM/SGLang convention where diffusion engines raise standard exception types and callers catch by built-in type, not by package-specific wrappers.

Internal types and contracts

  • Internal MediaOutput renamed to PipelineOutput and gains three CUDA-event timing phases (pre_denoise / denoise / post_denoise) plus an executor-measured generation field on the wire response.
  • Tightened batched-shape contract: image is always (B, H, W, C), video (B, T, H, W, C), audio (B, channels, T_audio). Asserts in split_visual_gen_output make a violating pipeline fail loudly rather than silently corrupt per-item outputs by indexing along the wrong dim. LTX-2's decode_audio no longer unconditionally .squeeze(0)s, restoring the contract for B=1.

Module boundaries

  • Tensor-encoding code extracted from tensorrt_llm/serve/media_storage.py into a new tensorrt_llm/media/encoding.py module of free functions; MediaStorage keeps its file-storage role under serve/.
  • The eight video-generation route handlers in openai_server.py are extracted into a sibling mixin (tensorrt_llm/serve/openai_video_routes.py) so openai_server.py stays under the per-file line budget. Behavior is strictly preserved.

Engine reliability

  • Fixed a timeout-leak in DiffusionRemoteClient: previously, a request that timed out in aresult() could have its late-arriving response pinned in completed_responses for the process lifetime, leaking a full PipelineOutput (including video tensors) per timeout. New abandon_request_id flow uses an abandoned-id set so _store_response drops late responses and abandon_request_id pops cached ones; both orderings handled under the same async lock.

Server endpoints

  • /v1/images/edits short-circuits to HTTP 501 NotImplementedError. Reason: no in-tree pipeline implements image editing today (Flux/Flux2 are gen-only; Wan/LTX-2 produce video). The handler used to dispatch through the generator and 500 on a downstream None check; 501 is the honest answer until an edit-capable pipeline lands. Route stays registered so re-enabling is a one-line change.

Benchmark (tensorrt_llm/bench/benchmark/visual_gen.py)

  • Per-run persisted media dir: _run_benchmark wraps each run in one tempfile.TemporaryDirectory and saves each result into it. Mirrors the server's persisted-write pattern (no per-request mkstemp / unlink overhead in the timing window).
  • ffmpeg-optional fallback via resolve_video_format("auto"), matching how the server treats output_format="auto". The bench no longer hard-fails on hosts without ffmpeg.
  • Timing units: dropped _ms suffix and store wall-clock seconds end-to-end. CudaPhaseTimer.fill() divides Event.elapsed_time by 1000 once at the boundary so every downstream type carries seconds.
  • Renamed the engine-side metric from pipeline (a class-name leak) to generation (describes what the timer measures, not how). Threaded through VisualGenMetrics, DiffusionResponse, server log lines, bench per-request and aggregate types.
  • VisualGenBenchmarkMetrics gains mean / median / std / min / max / percentiles_generation alongside the existing latency aggregates. The latency >= generation gap is the encode + persist + IPC overhead the bench measures around the engine, and the report exposes that headroom.

API stability

YAML coverage for the VisualGen public surface is explicitly deferred until the API exits prototype status (all classes are @set_api_status("prototype")).

Test Coverage

  • tests/unittest/visual_gen/test_output.pyVisualGenOutput dataclass, VisualGenMetrics, to_visual_gen_output / split_visual_gen_output success and error paths, batch fan-out, VisualGenOutput.save() routing, VisualGenResult sync result(), async aresult(), __await__, timeout-leak fix (5 new tests covering abandon_request_id's two race orderings via a minimal DiffusionRemoteClient stub).
  • tests/unittest/_torch/visual_gen/test_visual_gen_params.py — request-validation surface migrated to plain ValueError.
  • tests/unittest/_torch/visual_gen/test_trtllm_serve_endpoints.py — FastAPI mock tests; TestImageEdit reduced to a 501 assertion plus the existing pydantic-validation 400 test.
  • tests/unittest/media/test_encoding.py — PNG round-trip, video_to_bytes AVI bytes, resolve_video_format dispatch, ffmpeg fallback paths.
  • tests/integration/defs/examples/test_visual_gen.py — Wan T2V and Flux end-to-end public-output contract checks (single + batch + async), including error-path semantics. Field-name assertions updated for the timing rename.
  • tests/integration/defs/visual_gen/test_visual_gen_benchmark.py — bench JSON-key assertions extended with mean_generation.

Summary by CodeRabbit

  • New Features

    • Added performance metrics (pipeline latency, denoising timing) to generation outputs via VisualGenMetrics
    • Introduced VisualGenOutput.save() method for easy output persistence
    • Added OpenAI-compatible video generation endpoints
  • Improvements

    • Refactored output handling for cleaner result objects with error tracking
    • Consolidated media encoding functionality into dedicated utilities
  • Documentation

    • Updated all visual generation examples to use new output API

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@zhenhuaw-me zhenhuaw-me self-assigned this Apr 30, 2026
Comment thread examples/visual_gen/models/wan_t2v.py Outdated
Comment thread examples/visual_gen/visual_gen_ltx2.py Outdated
Comment thread tensorrt_llm/media/encoding.py
Comment thread tensorrt_llm/media/encoding.py Outdated
Comment thread tensorrt_llm/serve/media_storage.py Outdated
Comment thread tensorrt_llm/visual_gen/output.py Outdated
Comment thread tensorrt_llm/visual_gen/visual_gen.py Outdated
Comment thread tests/integration/defs/examples/test_visual_gen.py Outdated
Comment thread tests/integration/defs/examples/test_visual_gen.py Outdated
Comment thread tests/integration/defs/examples/test_visual_gen.py Outdated
@zhenhuaw-me zhenhuaw-me force-pushed the update-api-step4 branch 2 times, most recently from 3d639b8 to 96c319e Compare April 30, 2026 03:58
Comment thread tensorrt_llm/visual_gen/output.py Outdated
@zhenhuaw-me zhenhuaw-me marked this pull request as ready for review April 30, 2026 04:07
@zhenhuaw-me zhenhuaw-me requested review from a team as code owners April 30, 2026 04:07
@zhenhuaw-me
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

The changes refactor the visual generation output architecture, replacing MediaOutput with a new internal PipelineOutput type enhanced with CUDA timing instrumentation via CudaPhaseTimer, introducing public VisualGenOutput and VisualGenMetrics types as the primary user-facing API, migrating media encoding logic from MediaStorage into a new dedicated tensorrt_llm.media.encoding module, updating all example scripts to use output.save(), extracting OpenAI video routes into a reusable _VideoRoutesMixin, and extending test coverage for the new output and encoding APIs.

Changes

Cohort / File(s) Summary
Example Output Persistence
examples/visual_gen/models/wan_t2v.py, examples/visual_gen/quickstart_example.py, examples/visual_gen/visual_gen_flux.py, examples/visual_gen/visual_gen_ltx2.py, examples/visual_gen/visual_gen_wan_i2v.py, examples/visual_gen/visual_gen_wan_t2v.py
Removed MediaStorage imports and replaced MediaStorage.save_*() calls with output.save(args.output_path) for all generated visual content (images and videos).
Public Output API
tensorrt_llm/visual_gen/output.py
Added VisualGenMetrics and VisualGenOutput dataclasses as public result containers with timing metadata and a save() method for persisting generated media. Replaced MediaOutput with these new types.
Internal Pipeline Output
tensorrt_llm/_torch/visual_gen/output.py
Introduced PipelineOutput internal dataclass replacing MediaOutput with frame rate and audio sample rate metadata. Added CudaPhaseTimer helper for measuring CUDA-event-based phase timings (pre-denoise, denoise, post-denoise). Added conversion utilities to_visual_gen_output() and split_visual_gen_output() for transforming responses into public output types.
Internal Output Exports
tensorrt_llm/_torch/visual_gen/__init__.py, tensorrt_llm/visual_gen/__init__.py
Updated module exports: replaced MediaOutput with PipelineOutput and VisualGenMetrics/VisualGenOutput re-exports in __all__ lists.
Main Package Exports
tensorrt_llm/__init__.py
Added VisualGenMetrics and VisualGenOutput to root-level exports for public API access.
Pipeline Implementations
tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux.py, tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux2.py, tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py, tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py, tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py, tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py
Replaced MediaOutput return type with PipelineOutput wrapped in CudaPhaseTimer. Added CUDA phase markers (mark_pre_start, mark_denoise_start, mark_post_start, mark_end) around denoising and decoding stages to instrument per-phase latencies. Updated docstrings to reflect PipelineOutput.
Executor Response Type
tensorrt_llm/_torch/visual_gen/executor.py
Updated DiffusionResponse dataclass: changed output field from MediaOutput to PipelineOutput, added pipeline_ms field. Instrumented request processing with host-side wall-clock timing around pipeline.infer() using time.perf_counter().
Main VisualGen API
tensorrt_llm/visual_gen/visual_gen.py
Updated VisualGenResult to be awaitable and resolve to VisualGenOutput or List[VisualGenOutput]. Changed VisualGen.generate() return type from MediaOutput to Union[VisualGenOutput, List[VisualGenOutput]]. Added input validation and batch-size tracking. Updated error semantics: single prompts raise on error, batch items return errors within output objects.
Media Encoding Module
tensorrt_llm/media/__init__.py, tensorrt_llm/media/encoding.py
Created new tensorrt_llm.media.encoding module with public tensor-based encoding functions: save_image(), image_to_bytes(), save_video(), video_to_bytes(). Supports both ffmpeg-based MP4 encoding and pure-Python MJPEG/AVI fallback. Includes optional audio muxing and format resolution logic.
MediaStorage Reduction
tensorrt_llm/serve/media_storage.py
Removed all media encoding/persistence methods (save_image, convert_image_to_bytes, save_video, convert_video_to_bytes, internal helpers, and ffmpeg handling). Left only documentation redirecting to new tensorrt_llm.media.encoding module.
OpenAI Server Video Routes
tensorrt_llm/serve/openai_server.py, tensorrt_llm/serve/openai_video_routes.py
Created _VideoRoutesMixin providing eight OpenAI-compatible video endpoints (sync/async generation, job listing/retrieval/deletion, content serving, metadata access). Refactored OpenAIServer to inherit from mixin. Switched image encoding from MediaStorage to tensorrt_llm.media.encoding. Added end-to-end timing and metrics extraction from VisualGenOutput.metrics.
Benchmark Metrics
tensorrt_llm/bench/benchmark/visual_gen.py, tensorrt_llm/bench/benchmark/visual_gen_utils.py
Added _save_to_tempfile() helper to include encoding latency in end-to-end measurements. Updated VisualGenRequestOutput with pipeline_ms and denoise_ms fields. Refactored result capture to extract metrics from result.metrics.
Test Files
tests/integration/defs/examples/test_visual_gen.py, tests/unittest/_torch/visual_gen/test_media_storage.py, tests/unittest/_torch/visual_gen/test_trtllm_serve_endpoints.py, tests/unittest/dynamo/test_imports.py, tests/unittest/media/test_encoding.py, tests/unittest/visual_gen/test_output.py
Deleted obsolete MediaStorage unit tests. Added comprehensive new tests for VisualGenOutput, VisualGenMetrics, PipelineOutput, media encoding functions, and VisualGenResult awaitability. Updated endpoint tests to mock VisualGenOutput instead of MediaOutput. Updated import validation test to expect PipelineOutput export. Updated integration tests to use output.save() and validate output metadata.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant VisualGen
    participant Executor
    participant Pipeline
    participant CudaPhaseTimer
    participant Encoder

    Client->>VisualGen: generate(inputs)
    VisualGen->>Executor: submit request (prompt, params)
    Executor->>Pipeline: create CudaPhaseTimer
    Pipeline->>CudaPhaseTimer: mark_pre_start()
    Pipeline->>Pipeline: initialization
    Pipeline->>CudaPhaseTimer: mark_denoise_start()
    Pipeline->>Pipeline: denoising loop
    Pipeline->>CudaPhaseTimer: mark_post_start()
    Pipeline->>Pipeline: decode video/image
    Pipeline->>CudaPhaseTimer: mark_end()
    CudaPhaseTimer->>Pipeline: fill(PipelineOutput)
    Pipeline->>Executor: return PipelineOutput with metrics
    Executor->>Executor: measure pipeline_ms (host-side)
    Executor->>VisualGen: DiffusionResponse(output=PipelineOutput, pipeline_ms)
    VisualGen->>VisualGen: convert to VisualGenOutput
    VisualGen->>Client: return VisualGenOutput
    Client->>Encoder: output.save(path)
    Encoder->>Encoder: encode image/video
    Encoder->>Encoder: write to file
    Encoder->>Client: return path
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.03% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main change: introducing a public VisualGenOutput API and benchmark timing decomposition for the visual generation feature.
Description check ✅ Passed PR description is comprehensive with clear sections on public API changes, internal types, module boundaries, engine reliability improvements, server endpoint changes, and benchmark updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
tests/integration/defs/examples/test_visual_gen.py (1)

482-490: ⚡ Quick win

Assert audio_sample_rate on the two-stage LTX-2 path too.

The single-stage path checks that rate propagation worked, but this one doesn't. If two-stage output stops populating audio_sample_rate, output.save(...) can still pass by falling back to its default, so this regression would go uncaught.

Suggested test addition
         assert output.error is None
         assert output.video is not None
         assert output.frame_rate == LTX2_T2V_FRAME_RATE
+        assert output.audio_sample_rate is not None and output.audio_sample_rate > 0
         assert output.metrics is not None
         assert output.metrics.pipeline_ms > 0
         assert output.metrics.denoise_ms > 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/examples/test_visual_gen.py` around lines 482 - 490,
Add an assertion that the two-stage LTX-2 output populates audio_sample_rate:
after calling visual_gen.generate(...) and before output.save(...), assert that
output.audio_sample_rate equals the expected constant (e.g.
LTX2_T2V_AUDIO_SAMPLE_RATE) so that regressions where audio_sample_rate is
missing are caught; locate this in the same block that calls visual_gen.generate
and uses output.save to ensure parity with the single-stage test.
tests/unittest/media/test_encoding.py (2)

94-116: ⚡ Quick win

Make the AVI fallback tests deterministic.

These cases currently depend on the host having no ffmpeg. On builders where ffmpeg is present, save_video(..., format="avi") can take the ffmpeg path instead, so the test stops validating the pure-Python fallback it documents. Patch ffmpeg availability to False and reset the cached encoder in the test to pin the intended branch.

As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/media/test_encoding.py` around lines 94 - 116, Tests that
exercise the pure-Python AVI fallback must force the non-ffmpeg branch; in the
two tests calling video_to_bytes and save_video (symbols: video_to_bytes,
save_video, _dummy_video) use monkeypatch to set whatever ffmpeg-availability
flag/function your video module exposes to False (e.g., patch a function like
ffmpeg_available() to return False or set a module-level FFMPEG_AVAILABLE =
False), then clear/reset the module encoder cache before calling the functions
(remove or reinitialize the cached encoder object/attribute used by the
pure-Python path so the fallback is chosen, e.g., delattr(video_module,
"<cached_encoder_name>") or call the provided reset function) so the tests
deterministically exercise the pure-Python AVI encoder.

1-24: QA list updates look unnecessary here.

These additions stay in tests/unittest/, so I don't see a corresponding need for changes under tests/integration/test_lists/qa/ in this PR.

As per coding guidelines, "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/media/test_encoding.py` around lines 1 - 24, This PR only
modifies the unit test for the module tensorrt_llm.media.encoding (the test that
exercises image_to_bytes/save_image/video_to_bytes/resolve_video_format), so
explicitly state in the PR description or a commit message that QA list updates
under the integration QA lists are unnecessary for this change; update the PR
text to say "QA list updates unnecessary — change limited to unit tests" (or
similar), and remove or revert any unrelated changes made to the integration QA
list if present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/visual_gen/visual_gen_ltx2.py`:
- Line 416: The CLI help text for the --output_path argument is out of sync with
the actual save call (output.save(args.output_path)); find the
parser.add_argument call that defines '--output_path' and update its help string
to accurately list the formats output.save supports (replace the old ".gif/.png
video outputs" text with the correct supported file types/containers), so the
help matches the behavior of output.save and args.output_path.

In `@tensorrt_llm/_torch/visual_gen/output.py`:
- Around line 148-149: The current truthiness check on resp.error_msg allows
empty-string errors to be treated as success and lead to dereferencing
resp.output; change both checks (the one around VisualGenOutput creation and the
other at lines handling resp.output) to explicitly test for None (e.g., if
resp.error_msg is not None) so any non-None error_msg — including empty string —
short-circuits and returns VisualGenOutput(request_id=resp.request_id,
error=resp.error_msg) instead of accessing resp.output.

In `@tensorrt_llm/bench/benchmark/visual_gen_utils.py`:
- Around line 36-47: The new timing fields pipeline_ms and denoise_ms on the
visual generation result are not being propagated into the JSON/console outputs;
update calculate_metrics to aggregate these per-request fields (e.g.,
sum/count/min/max/percentiles as appropriate) alongside e2e_latency and then
modify build_visual_gen_result_dict to include the aggregated pipeline_ms and
denoise_ms values in the returned dict and any per-request entries; locate uses
of VisualGenOutput.metrics and the result dataclass fields (pipeline_ms,
denoise_ms, success, e2e_latency) to read the engine-side timings and ensure
they are serialized into the final metrics JSON/console output.

In `@tensorrt_llm/serve/openai_video_routes.py`:
- Around line 98-103: The except blocks currently map all non-ValueError
exceptions to the default (client) error response; update each generic except
Exception handler (the ones that log traceback and call
create_error_response(str(e))) to return an internal server error (500) instead
of the default 400 by invoking create_error_response with an explicit 500 status
(or using an existing create_internal_error_response helper if present).
Concretely, in the handlers around the create_error_response usages (e.g.,
inside the functions that catch ValueError and Exception, and at the other noted
sites) keep the ValueError branch as-is but change the generic Exception branch
to logger.error(traceback.format_exc()) followed by return
self.create_error_response(str(e), status=500) (or equivalent API) so
server-side failures are classified as InternalServerError. Ensure you apply the
same change in the other occurrences mentioned (the other except Exception
blocks).
- Around line 196-202: video_gen_tasks is never cleaned up and delete_video()
does not cancel running jobs; update the flow so tasks are removed and cancelled
when finished or deleted: when creating the task for _generate_video_background,
attach an add_done_callback that removes the entry from self.video_gen_tasks
(and logs exceptions from task.result()); modify delete_video() to look up the
task in self.video_gen_tasks, if present cancel it (task.cancel()), await or
suppress asyncio.CancelledError as appropriate, and then delete the key so a
deleted job cannot keep running or leak memory; apply the same pattern to other
places that spawn background generation tasks referenced in the file.
- Around line 195-217: Persist the VideoJob to VIDEO_STORE before scheduling the
background task: construct the VideoJob (VideoJob(...)) and call await
VIDEO_STORE.upsert(video_id, video_job) first, then create and assign the
background task to self.video_gen_tasks[video_id] which calls
self._generate_video_background(...); this ensures _generate_video_background
can always find the job by id (video_id) even if it completes quickly. Return
the JSONResponse after the upsert and task scheduling.

In `@tensorrt_llm/visual_gen/visual_gen.py`:
- Around line 599-609: The single-request timeout path currently sets
self._finished before populating self._resolved, so the first raise
VisualGenError is not reflected for subsequent aresult()/result() calls; change
both timeout branches (the block handling response is None and the similar block
at the other location) to first set self._resolved =
[VisualGenOutput(request_id=self.request_id, error="Generation timed out")] (or
a single VisualGenOutput for non-batch), then set self._finished = True, and
finally raise VisualGenError("Generation timed out") so that later calls return
the stored error state; update references in the response handling logic that
use self._batch_size, self._resolved, self._finished, VisualGenOutput, and
VisualGenError.

---

Nitpick comments:
In `@tests/integration/defs/examples/test_visual_gen.py`:
- Around line 482-490: Add an assertion that the two-stage LTX-2 output
populates audio_sample_rate: after calling visual_gen.generate(...) and before
output.save(...), assert that output.audio_sample_rate equals the expected
constant (e.g. LTX2_T2V_AUDIO_SAMPLE_RATE) so that regressions where
audio_sample_rate is missing are caught; locate this in the same block that
calls visual_gen.generate and uses output.save to ensure parity with the
single-stage test.

In `@tests/unittest/media/test_encoding.py`:
- Around line 94-116: Tests that exercise the pure-Python AVI fallback must
force the non-ffmpeg branch; in the two tests calling video_to_bytes and
save_video (symbols: video_to_bytes, save_video, _dummy_video) use monkeypatch
to set whatever ffmpeg-availability flag/function your video module exposes to
False (e.g., patch a function like ffmpeg_available() to return False or set a
module-level FFMPEG_AVAILABLE = False), then clear/reset the module encoder
cache before calling the functions (remove or reinitialize the cached encoder
object/attribute used by the pure-Python path so the fallback is chosen, e.g.,
delattr(video_module, "<cached_encoder_name>") or call the provided reset
function) so the tests deterministically exercise the pure-Python AVI encoder.
- Around line 1-24: This PR only modifies the unit test for the module
tensorrt_llm.media.encoding (the test that exercises
image_to_bytes/save_image/video_to_bytes/resolve_video_format), so explicitly
state in the PR description or a commit message that QA list updates under the
integration QA lists are unnecessary for this change; update the PR text to say
"QA list updates unnecessary — change limited to unit tests" (or similar), and
remove or revert any unrelated changes made to the integration QA list if
present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 34454c9b-615c-41d0-8b29-1ba1656ba7fb

📥 Commits

Reviewing files that changed from the base of the PR and between 37fc0e3 and 13a9b96.

📒 Files selected for processing (34)
  • examples/visual_gen/models/wan_t2v.py
  • examples/visual_gen/quickstart_example.py
  • examples/visual_gen/visual_gen_flux.py
  • examples/visual_gen/visual_gen_ltx2.py
  • examples/visual_gen/visual_gen_wan_i2v.py
  • examples/visual_gen/visual_gen_wan_t2v.py
  • tensorrt_llm/__init__.py
  • tensorrt_llm/_torch/visual_gen/__init__.py
  • tensorrt_llm/_torch/visual_gen/executor.py
  • tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux.py
  • tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux2.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.py
  • tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py
  • tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py
  • tensorrt_llm/_torch/visual_gen/output.py
  • tensorrt_llm/bench/benchmark/visual_gen.py
  • tensorrt_llm/bench/benchmark/visual_gen_utils.py
  • tensorrt_llm/media/__init__.py
  • tensorrt_llm/media/encoding.py
  • tensorrt_llm/serve/media_storage.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/openai_video_routes.py
  • tensorrt_llm/visual_gen/__init__.py
  • tensorrt_llm/visual_gen/output.py
  • tensorrt_llm/visual_gen/visual_gen.py
  • tests/integration/defs/examples/test_visual_gen.py
  • tests/unittest/_torch/visual_gen/test_media_storage.py
  • tests/unittest/_torch/visual_gen/test_trtllm_serve_endpoints.py
  • tests/unittest/dynamo/test_imports.py
  • tests/unittest/media/__init__.py
  • tests/unittest/media/test_encoding.py
  • tests/unittest/visual_gen/__init__.py
  • tests/unittest/visual_gen/test_output.py
💤 Files with no reviewable changes (1)
  • tests/unittest/_torch/visual_gen/test_media_storage.py

Comment thread examples/visual_gen/visual_gen_ltx2.py
Comment thread tensorrt_llm/_torch/visual_gen/output.py Outdated
Comment thread tensorrt_llm/bench/benchmark/visual_gen_utils.py Outdated
Comment thread tensorrt_llm/serve/openai_video_routes.py Outdated
Comment thread tensorrt_llm/serve/openai_video_routes.py Outdated
Comment thread tensorrt_llm/serve/openai_video_routes.py Outdated
Comment thread tensorrt_llm/visual_gen/visual_gen.py Outdated
zhenhuaw-me added a commit to zhenhuaw-me/TensorRT-LLM that referenced this pull request Apr 30, 2026
- DiffusionResponse.error_msg checks switch from truthy to is-not-None
  in to_visual_gen_output and split_visual_gen_output so an empty-string
  error message still routes through the failure branch instead of
  dereferencing resp.output.
- VisualGenResult.aresult timeout path now persists the resolved error
  state on self._resolved before flipping self._finished, so subsequent
  aresult/result calls replay the same VisualGenError instead of
  silently returning None via the fast path. Batch and single-prompt
  branches both route through _resolved_value.
- Video routes (sync gen, async gen, list, get-metadata, get-content,
  delete) now classify generic exceptions as InternalServerError/500
  instead of leaving them at the create_error_response default of 400,
  while ValueError stays at 400. test_sync_video_failure updated to
  match.
- openai_video_generation_async persists the queued VideoJob to
  VIDEO_STORE before scheduling the background task, closing the race
  where a fast-completing task could fail to find the job and leave it
  stuck at queued.
- Background video tasks now register a done callback that drops the
  entry from self.video_gen_tasks (with an is-identity guard) and logs
  unexpected exceptions, bounding memory growth. delete_video cancels
  any in-flight task before removing the file and store entry so a
  deleted job cannot recreate output afterwards.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
@zhenhuaw-me
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46326 [ run ] triggered by Bot. Commit: a3cf840 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46326 [ run ] completed with state SUCCESS. Commit: a3cf840
/LLM/main/L0_MergeRequest_PR pipeline #36422 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Comment thread tensorrt_llm/serve/openai_server.py Outdated
Comment thread tensorrt_llm/serve/openai_server.py
Comment thread tensorrt_llm/bench/benchmark/visual_gen.py Outdated
Comment thread tensorrt_llm/visual_gen/visual_gen.py
Comment thread tensorrt_llm/_torch/visual_gen/output.py
Comment thread tensorrt_llm/serve/openai_server.py Outdated
Comment thread tensorrt_llm/visual_gen/output.py Outdated
Comment thread tensorrt_llm/visual_gen/output.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46925 [ run ] completed with state SUCCESS. Commit: 45f1b61
/LLM/main/L0_MergeRequest_PR pipeline #36931 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zhenhuaw-me added a commit to zhenhuaw-me/TensorRT-LLM that referenced this pull request May 7, 2026
Fix the duplicate-timeout bug in VisualGenResult.result() that QiJune
flagged: the outer future.result(timeout=timeout) could raise
concurrent.futures.TimeoutError before the inner aresult timeout branch
finished cleanup, leaking late-arriving responses into
completed_responses.

VisualGenResult.result()
- Drop outer timeout argument; only the inner aresult(timeout=timeout)
  enforces the wait. The inner branch already does abandon_request_id
  cleanup and resolves to an error output before returning.
- Add a comment naming the constraint so the asymmetry is not lost on
  future readers.

Verified
- 37 / 37 unit tests in tests/unittest/visual_gen/test_output.py pass,
  including the existing test_aresult_timeout_invokes_abandon_request_id
  pair that covers the cleanup contract.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
…ike handle, batch fan-out, encoding extraction

Replaces the raw MediaOutput return type with a request-aware public
wrapper VisualGenOutput carrying image/video/audio tensors plus
request_id, error, frame_rate/audio_sample_rate, and engine-side
VisualGenMetrics. Promotes VisualGenResult to a single Future-like
awaitable handle that resolves to VisualGenOutput (single prompt) or
List[VisualGenOutput] (batch). Fans out batched generate(List[str])
into per-item outputs with per-item error semantics. Adds
VisualGenOutput.save(path) as the single user-facing way to write
generated media to disk.

Extracts tensor-encoding code out of tensorrt_llm/serve/media_storage.py
into a new tensorrt_llm/media/encoding.py module of free functions,
while keeping MediaStorage in serve/ for its file-storage role.
Internally renames MediaOutput -> PipelineOutput with three CUDA-event
timing phases (pre_denoise_ms / denoise_ms / post_denoise_ms) plus an
executor-measured pipeline_ms on the wire response. Updates every
in-tree caller (examples, bench, serve, tests).

The eight video-generation route handlers in openai_server.py are
extracted into a sibling _VideoRoutesMixin
(tensorrt_llm/serve/openai_video_routes.py) so openai_server.py stays
under the per-file line budget. Behavior is strictly preserved: mixin
methods reach back through self for instance state.

API-stability YAML coverage for the VisualGen surface is explicitly
deferred until the public API exits prototype status.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
- DiffusionResponse.error_msg checks switch from truthy to is-not-None
  in to_visual_gen_output and split_visual_gen_output so an empty-string
  error message still routes through the failure branch instead of
  dereferencing resp.output.
- VisualGenResult.aresult timeout path now persists the resolved error
  state on self._resolved before flipping self._finished, so subsequent
  aresult/result calls replay the same VisualGenError instead of
  silently returning None via the fast path. Batch and single-prompt
  branches both route through _resolved_value.
- Video routes (sync gen, async gen, list, get-metadata, get-content,
  delete) now classify generic exceptions as InternalServerError/500
  instead of leaving them at the create_error_response default of 400,
  while ValueError stays at 400. test_sync_video_failure updated to
  match.
- openai_video_generation_async persists the queued VideoJob to
  VIDEO_STORE before scheduling the background task, closing the race
  where a fast-completing task could fail to find the job and leave it
  stuck at queued.
- Background video tasks now register a done callback that drops the
  entry from self.video_gen_tasks (with an is-identity guard) and logs
  unexpected exceptions, bounding memory growth. delete_video cancels
  any in-flight task before removing the file and store entry so a
  deleted job cannot recreate output afterwards.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
- Drop VisualGenError and VisualGenParamsError in favor of standard
  ValueError / RuntimeError / NotImplementedError, matching vLLM/SGLang
  conventions. Removed exports from tensorrt_llm/__init__.py and
  tensorrt_llm/visual_gen/__init__.py. Save errors map to:
  errored output -> RuntimeError, missing frame_rate / no media ->
  ValueError, audio-only -> NotImplementedError. _validate_request
  raises plain ValueError. Tests updated to match.
- Fix timeout leak in DiffusionRemoteClient. The aresult timeout branch
  now schedules abandon_request_id on the executor's event loop. New
  abandon flow uses an _abandoned_request_ids set so _store_response
  drops late responses for abandoned ids and abandon_request_id pops
  responses that arrived between the await timeout and the abandon
  call; both orderings handled under the same async lock. Added five
  unit tests covering the abandon contract.
- Tighten PipelineOutput shape contract to always-batched: image
  (B, H, W, C), video (B, T, H, W, C), audio (B, channels, T_audio).
  split_visual_gen_output now asserts the leading dim for all three
  modalities. Removed the unconditional .squeeze(0) in
  decode_audio that produced shape-inconsistent audio for B=1 vs B>1.
- Image-edit endpoint /v1/images/edits now short-circuits to 501
  NotImplemented because no in-tree pipeline implements image edit
  (Flux/Flux2 are gen-only, Wan/LTX-2 produce video). Trimmed
  TestImageEdit happy-path tests to a single 501 assertion plus the
  pre-existing pydantic-validation 400 test.
- Replace lazy-import comment in visual_gen/output.py: the import-cycle
  reason is gone with VisualGenError, only the encoding-stack lazy
  rationale remains.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
…seconds

Bench Option A: align the offline VisualGen bench with what the serving
path actually does inside its e2e timing window, and rework the timing
field names so they mean what they say.

Bench encoding/persist scope (chang-l review feedback #2)
- _run_benchmark wraps each run in a single tempfile.TemporaryDirectory
  and threads media_dir into both _run_sequential and _run_concurrent.
- The per-request save now goes to media_dir/{img|vid}_{idx}{ext}; no
  per-request mkstemp/unlink. The directory is wiped wholesale when the
  context exits, mirroring the server's persisted-write pattern.
- Video extension comes from resolve_video_format("auto") so the bench
  falls back to .avi (pure-Python) on hosts without ffmpeg, matching
  how the server handles output_format="auto" instead of hard-failing.

Timing-metric rename (drop _ms suffix, store seconds)
- VisualGenMetrics: pipeline_ms / pre_denoise_ms / denoise_ms /
  post_denoise_ms -> pipeline / pre_denoise / denoise / post_denoise.
  Wall-clock seconds throughout.
- PipelineOutput: same rename. CudaPhaseTimer.fill() divides
  Event.elapsed_time by 1000 once at the boundary so every downstream
  type carries seconds.
- DiffusionResponse.pipeline_ms -> pipeline; executor populates
  perf_counter() delta directly (no * 1000.0).
- VisualGenRequestOutput (bench): e2e_latency / pipeline_ms / denoise_ms
  -> latency / pipeline / denoise.
- VisualGenBenchmarkMetrics: *_e2e_latency_ms -> *_latency.
  calculate_metrics no longer rescales to ms.
- print_visual_gen_results shows "Latency (s)" with :.4f precision and
  drops the "E2E" qualifier.
- build_visual_gen_result_dict JSON keys: mean_e2e_latency_ms ->
  mean_latency, e2e_latencies -> latencies, etc.
- Server log lines (image-gen, sync video, async video): e2e_ms /
  pipeline_ms / denoise_ms -> latency / pipeline / denoise with :.3f
  seconds formatting.
- Tests in unittest/visual_gen/test_output.py and the integration
  tests under tests/integration/defs/{examples,visual_gen}/ updated to
  match. 192 / 192 unit tests passing.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
…gate to bench

Decouple the public timing metric name from the internal class/method
("pipeline.infer()") and surface the engine-side floor in the benchmark
report alongside the externally observed latency.

Rename
- VisualGenMetrics.pipeline -> generation. The new name describes what
  the timer measures from the user's perspective (time spent producing
  the output), not which code path we measured. Field-level docstrings
  added so per-field semantics sit next to each declaration; the class
  docstring keeps the cross-cutting CUDA-event methodology note.
- DiffusionResponse.pipeline -> generation; executor populator uses
  generation_start / generation locals.
- VisualGenRequestOutput.pipeline -> generation (bench per-request).
- to_visual_gen_output and split_visual_gen_output populate the renamed
  field.
- Server log lines (image-gen, sync video, async video):
  pipeline=... -> generation=...

Bench aggregate
- VisualGenBenchmarkMetrics gains
  mean_generation / median_generation / std_generation /
  min_generation / max_generation / percentiles_generation alongside
  the existing latency aggregates. The relationship
  latency >= generation should hold per request; the gap is the
  encode + persist + IPC overhead the bench measures around the
  engine, and the report exposes that headroom.
- calculate_metrics collects only non-zero generation samples so
  backends that don't supply per-phase metrics don't pull the mean
  toward zero and obscure the latency comparison.
- print_visual_gen_results adds a "Generation" section mirroring the
  "Latency" section.
- build_visual_gen_result_dict adds the corresponding JSON keys plus a
  per-request "generations" array.

Tests
- tests/unittest/visual_gen/test_output.py: field-name assertions, kwarg
  arguments, attribute accesses, and the function name
  test_sub_phase_sum_bounded_by_pipeline -> ..._by_generation updated.
- tests/integration/defs/examples/test_visual_gen.py:
  output.metrics.pipeline -> output.metrics.generation.
- tests/integration/defs/visual_gen/test_visual_gen_benchmark.py:
  added mean_generation JSON-key assertions next to mean_latency.

192 / 192 unit tests passing.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
…elapsed_time

Without the sync, cudaEventElapsedTime raises cudaErrorNotReady whenever
the pipeline returns the output tensor while it is still resident on the
GPU — the common case at fill() time. Caught by the e2e Flux text-to-image
tests, which all returned 400 with "Both events must be completed before
calculating elapsed time." Syncing on _end also covers the three earlier
events because they record on the same default stream.

Updated the corresponding methodology docstrings on both the public
VisualGenMetrics and the internal CudaPhaseTimer to match the actual
behavior.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
Round 3 of @JunyiXu-nv's review on PR NVIDIA#13635.

Drop empty MediaStorage stub
- Remove tensorrt_llm/serve/media_storage.py: encoding moved to
  tensorrt_llm/media/encoding.py earlier in this PR, leaving only an
  empty MediaStorage class. No production code referenced it.
- Drop the two regression tests in tests/unittest/visual_gen/test_output.py
  that pinned the class to "no encoding methods" — moot once the file is gone.

Switch route call sites to VisualGenOutput.save
- openai_video_routes.py (sync + background) and openai_server.py URL-mode
  image now call output.save(...) instead of save_video/save_image directly.
  This exercises the public API the PR added and keeps the audio/sample-rate
  default chain inside one place.

Fix stale test-list entry
- tests/integration/test_lists/test-db/l0_a10.yml referenced the deleted
  unittest/_torch/visual_gen/test_media_storage.py. Replace with the two
  test files this PR adds: unittest/visual_gen/test_output.py and
  unittest/media/test_encoding.py.

Test:
- tests/unittest/media/test_encoding.py + tests/unittest/visual_gen/test_output.py:
  51 / 51 passing.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
Fix the duplicate-timeout bug in VisualGenResult.result() that QiJune
flagged: the outer future.result(timeout=timeout) could raise
concurrent.futures.TimeoutError before the inner aresult timeout branch
finished cleanup, leaking late-arriving responses into
completed_responses.

VisualGenResult.result()
- Drop outer timeout argument; only the inner aresult(timeout=timeout)
  enforces the wait. The inner branch already does abandon_request_id
  cleanup and resolves to an error output before returning.
- Add a comment naming the constraint so the asymmetry is not lost on
  future readers.

Verified
- 37 / 37 unit tests in tests/unittest/visual_gen/test_output.py pass,
  including the existing test_aresult_timeout_invokes_abandon_request_id
  pair that covers the cleanup contract.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
@zhenhuaw-me
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47085 [ run ] triggered by Bot. Commit: e148156 Link to invocation

Copy link
Copy Markdown
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47085 [ run ] completed with state SUCCESS. Commit: e148156
/LLM/main/L0_MergeRequest_PR pipeline #37056 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…_server refactor

This PR's refactor moved save_image out of tensorrt_llm.serve.openai_server
into tensorrt_llm.media.encoding. The two TestImageGeneration regression
guards in test_trtllm_serve_endpoints.py were still patching
"tensorrt_llm.serve.openai_server.save_image", so unittest.mock.patch raised
AttributeError on entry, failing both
test_image_generation_b64_no_save_image_no_disk_write and
test_image_generation_b64_with_4d_batch_pipeline_output.

Patch the source module instead. This also tightens the regression guard:
it catches a future caller that pulls save_image in via output.save() (the
URL path), not just one re-importing it back into openai_server's namespace.

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
@zhenhuaw-me
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47218 [ run ] triggered by Bot. Commit: 2499db9 Link to invocation

Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47218 [ run ] completed with state SUCCESS. Commit: 2499db9
/LLM/main/L0_MergeRequest_PR pipeline #37174 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@zhenhuaw-me
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47311 [ run ] triggered by Bot. Commit: 2499db9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47311 [ run ] completed with state SUCCESS. Commit: 2499db9
/LLM/main/L0_MergeRequest_PR pipeline #37252 completed with status: 'SUCCESS'

CI Report

Link to invocation

@zhenhuaw-me zhenhuaw-me merged commit 8e07cea into NVIDIA:main May 8, 2026
6 checks passed
@zhenhuaw-me zhenhuaw-me deleted the update-api-step4 branch May 8, 2026 08:17
JunyiXu-nv added a commit to JunyiXu-nv/TensorRT-LLM that referenced this pull request May 11, 2026
Port the batch-inference support from the original PR onto the new
media/encoding + openai_video_routes layout introduced by NVIDIA#13635:

- tensorrt_llm/media/encoding.py: add save_images() and save_videos()
  free functions plus a shared _resolve_batch_paths() helper. Both
  accept either a path prefix or an explicit List[str] of per-item paths.
- tensorrt_llm/serve/openai_video_routes.py: sync and async video
  endpoints now call save_videos(). The async background task records
  every saved path on VideoJob.output_paths; delete_video() removes all
  of them. The sync endpoint still returns only the first file (OpenAI
  Videos API has no multi-file response yet).
- tensorrt_llm/serve/openai_protocol.py: add VideoJob.output_paths for
  the multi-output case.
- tensorrt_llm/serve/visual_gen_utils.py: map VideoGenerationRequest.n
  to VisualGenParams.num_images_per_prompt (already done for image
  requests on main).
- tests: cover save_images/save_videos in tests/unittest/media/test_encoding.py
  and add n=2 batch tests for sync and async video endpoints.

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
JunyiXu-nv added a commit to JunyiXu-nv/TensorRT-LLM that referenced this pull request May 11, 2026
Port the batch-inference support from the original PR onto the new
media/encoding + openai_video_routes layout introduced by NVIDIA#13635:

- tensorrt_llm/media/encoding.py: add save_images() and save_videos()
  free functions plus a shared _resolve_batch_paths() helper. Both
  accept either a path prefix or an explicit List[str] of per-item paths.
- tensorrt_llm/serve/openai_video_routes.py: sync and async video
  endpoints now call save_videos(). The async background task records
  every saved path on VideoJob.output_paths; delete_video() removes all
  of them. The sync endpoint still returns only the first file (OpenAI
  Videos API has no multi-file response yet).
- tensorrt_llm/serve/openai_protocol.py: add VideoJob.output_paths for
  the multi-output case.
- tensorrt_llm/serve/visual_gen_utils.py: map VideoGenerationRequest.n
  to VisualGenParams.num_images_per_prompt (already done for image
  requests on main).
- tests: cover save_images/save_videos in tests/unittest/media/test_encoding.py
  and add n=2 batch tests for sync and async video endpoints.

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
JunyiXu-nv added a commit to JunyiXu-nv/TensorRT-LLM that referenced this pull request May 11, 2026
Port the batch-inference support from the original PR onto the new
media/encoding + openai_video_routes layout introduced by NVIDIA#13635:

- tensorrt_llm/media/encoding.py: add save_images() and save_videos()
  free functions plus a shared _resolve_batch_paths() helper. Both
  accept either a path prefix or an explicit List[str] of per-item paths.
- tensorrt_llm/serve/openai_video_routes.py: sync and async video
  endpoints now call save_videos(). The async background task records
  every saved path on VideoJob.output_paths; delete_video() removes all
  of them. The sync endpoint still returns only the first file (OpenAI
  Videos API has no multi-file response yet).
- tensorrt_llm/serve/openai_protocol.py: add VideoJob.output_paths for
  the multi-output case.
- tensorrt_llm/serve/visual_gen_utils.py: map VideoGenerationRequest.n
  to VisualGenParams.num_images_per_prompt (already done for image
  requests on main).
- tests: cover save_images/save_videos in tests/unittest/media/test_encoding.py
  and add n=2 batch tests for sync and async video endpoints.

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
…position (NVIDIA#13635)

Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants