[TRTLLM-11319][feat] VisualGen public output API + bench timing decomposition#13635
Conversation
3d639b8 to
96c319e
Compare
96c319e to
13a9b96
Compare
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThe changes refactor the visual generation output architecture, replacing Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant VisualGen
participant Executor
participant Pipeline
participant CudaPhaseTimer
participant Encoder
Client->>VisualGen: generate(inputs)
VisualGen->>Executor: submit request (prompt, params)
Executor->>Pipeline: create CudaPhaseTimer
Pipeline->>CudaPhaseTimer: mark_pre_start()
Pipeline->>Pipeline: initialization
Pipeline->>CudaPhaseTimer: mark_denoise_start()
Pipeline->>Pipeline: denoising loop
Pipeline->>CudaPhaseTimer: mark_post_start()
Pipeline->>Pipeline: decode video/image
Pipeline->>CudaPhaseTimer: mark_end()
CudaPhaseTimer->>Pipeline: fill(PipelineOutput)
Pipeline->>Executor: return PipelineOutput with metrics
Executor->>Executor: measure pipeline_ms (host-side)
Executor->>VisualGen: DiffusionResponse(output=PipelineOutput, pipeline_ms)
VisualGen->>VisualGen: convert to VisualGenOutput
VisualGen->>Client: return VisualGenOutput
Client->>Encoder: output.save(path)
Encoder->>Encoder: encode image/video
Encoder->>Encoder: write to file
Encoder->>Client: return path
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (3)
tests/integration/defs/examples/test_visual_gen.py (1)
482-490: ⚡ Quick winAssert
audio_sample_rateon the two-stage LTX-2 path too.The single-stage path checks that rate propagation worked, but this one doesn't. If two-stage output stops populating
audio_sample_rate,output.save(...)can still pass by falling back to its default, so this regression would go uncaught.Suggested test addition
assert output.error is None assert output.video is not None assert output.frame_rate == LTX2_T2V_FRAME_RATE + assert output.audio_sample_rate is not None and output.audio_sample_rate > 0 assert output.metrics is not None assert output.metrics.pipeline_ms > 0 assert output.metrics.denoise_ms > 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/examples/test_visual_gen.py` around lines 482 - 490, Add an assertion that the two-stage LTX-2 output populates audio_sample_rate: after calling visual_gen.generate(...) and before output.save(...), assert that output.audio_sample_rate equals the expected constant (e.g. LTX2_T2V_AUDIO_SAMPLE_RATE) so that regressions where audio_sample_rate is missing are caught; locate this in the same block that calls visual_gen.generate and uses output.save to ensure parity with the single-stage test.tests/unittest/media/test_encoding.py (2)
94-116: ⚡ Quick winMake the AVI fallback tests deterministic.
These cases currently depend on the host having no ffmpeg. On builders where ffmpeg is present,
save_video(..., format="avi")can take the ffmpeg path instead, so the test stops validating the pure-Python fallback it documents. Patch ffmpeg availability toFalseand reset the cached encoder in the test to pin the intended branch.As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/media/test_encoding.py` around lines 94 - 116, Tests that exercise the pure-Python AVI fallback must force the non-ffmpeg branch; in the two tests calling video_to_bytes and save_video (symbols: video_to_bytes, save_video, _dummy_video) use monkeypatch to set whatever ffmpeg-availability flag/function your video module exposes to False (e.g., patch a function like ffmpeg_available() to return False or set a module-level FFMPEG_AVAILABLE = False), then clear/reset the module encoder cache before calling the functions (remove or reinitialize the cached encoder object/attribute used by the pure-Python path so the fallback is chosen, e.g., delattr(video_module, "<cached_encoder_name>") or call the provided reset function) so the tests deterministically exercise the pure-Python AVI encoder.
1-24: QA list updates look unnecessary here.These additions stay in
tests/unittest/, so I don't see a corresponding need for changes undertests/integration/test_lists/qa/in this PR.As per coding guidelines, "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/media/test_encoding.py` around lines 1 - 24, This PR only modifies the unit test for the module tensorrt_llm.media.encoding (the test that exercises image_to_bytes/save_image/video_to_bytes/resolve_video_format), so explicitly state in the PR description or a commit message that QA list updates under the integration QA lists are unnecessary for this change; update the PR text to say "QA list updates unnecessary — change limited to unit tests" (or similar), and remove or revert any unrelated changes made to the integration QA list if present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/visual_gen/visual_gen_ltx2.py`:
- Line 416: The CLI help text for the --output_path argument is out of sync with
the actual save call (output.save(args.output_path)); find the
parser.add_argument call that defines '--output_path' and update its help string
to accurately list the formats output.save supports (replace the old ".gif/.png
video outputs" text with the correct supported file types/containers), so the
help matches the behavior of output.save and args.output_path.
In `@tensorrt_llm/_torch/visual_gen/output.py`:
- Around line 148-149: The current truthiness check on resp.error_msg allows
empty-string errors to be treated as success and lead to dereferencing
resp.output; change both checks (the one around VisualGenOutput creation and the
other at lines handling resp.output) to explicitly test for None (e.g., if
resp.error_msg is not None) so any non-None error_msg — including empty string —
short-circuits and returns VisualGenOutput(request_id=resp.request_id,
error=resp.error_msg) instead of accessing resp.output.
In `@tensorrt_llm/bench/benchmark/visual_gen_utils.py`:
- Around line 36-47: The new timing fields pipeline_ms and denoise_ms on the
visual generation result are not being propagated into the JSON/console outputs;
update calculate_metrics to aggregate these per-request fields (e.g.,
sum/count/min/max/percentiles as appropriate) alongside e2e_latency and then
modify build_visual_gen_result_dict to include the aggregated pipeline_ms and
denoise_ms values in the returned dict and any per-request entries; locate uses
of VisualGenOutput.metrics and the result dataclass fields (pipeline_ms,
denoise_ms, success, e2e_latency) to read the engine-side timings and ensure
they are serialized into the final metrics JSON/console output.
In `@tensorrt_llm/serve/openai_video_routes.py`:
- Around line 98-103: The except blocks currently map all non-ValueError
exceptions to the default (client) error response; update each generic except
Exception handler (the ones that log traceback and call
create_error_response(str(e))) to return an internal server error (500) instead
of the default 400 by invoking create_error_response with an explicit 500 status
(or using an existing create_internal_error_response helper if present).
Concretely, in the handlers around the create_error_response usages (e.g.,
inside the functions that catch ValueError and Exception, and at the other noted
sites) keep the ValueError branch as-is but change the generic Exception branch
to logger.error(traceback.format_exc()) followed by return
self.create_error_response(str(e), status=500) (or equivalent API) so
server-side failures are classified as InternalServerError. Ensure you apply the
same change in the other occurrences mentioned (the other except Exception
blocks).
- Around line 196-202: video_gen_tasks is never cleaned up and delete_video()
does not cancel running jobs; update the flow so tasks are removed and cancelled
when finished or deleted: when creating the task for _generate_video_background,
attach an add_done_callback that removes the entry from self.video_gen_tasks
(and logs exceptions from task.result()); modify delete_video() to look up the
task in self.video_gen_tasks, if present cancel it (task.cancel()), await or
suppress asyncio.CancelledError as appropriate, and then delete the key so a
deleted job cannot keep running or leak memory; apply the same pattern to other
places that spawn background generation tasks referenced in the file.
- Around line 195-217: Persist the VideoJob to VIDEO_STORE before scheduling the
background task: construct the VideoJob (VideoJob(...)) and call await
VIDEO_STORE.upsert(video_id, video_job) first, then create and assign the
background task to self.video_gen_tasks[video_id] which calls
self._generate_video_background(...); this ensures _generate_video_background
can always find the job by id (video_id) even if it completes quickly. Return
the JSONResponse after the upsert and task scheduling.
In `@tensorrt_llm/visual_gen/visual_gen.py`:
- Around line 599-609: The single-request timeout path currently sets
self._finished before populating self._resolved, so the first raise
VisualGenError is not reflected for subsequent aresult()/result() calls; change
both timeout branches (the block handling response is None and the similar block
at the other location) to first set self._resolved =
[VisualGenOutput(request_id=self.request_id, error="Generation timed out")] (or
a single VisualGenOutput for non-batch), then set self._finished = True, and
finally raise VisualGenError("Generation timed out") so that later calls return
the stored error state; update references in the response handling logic that
use self._batch_size, self._resolved, self._finished, VisualGenOutput, and
VisualGenError.
---
Nitpick comments:
In `@tests/integration/defs/examples/test_visual_gen.py`:
- Around line 482-490: Add an assertion that the two-stage LTX-2 output
populates audio_sample_rate: after calling visual_gen.generate(...) and before
output.save(...), assert that output.audio_sample_rate equals the expected
constant (e.g. LTX2_T2V_AUDIO_SAMPLE_RATE) so that regressions where
audio_sample_rate is missing are caught; locate this in the same block that
calls visual_gen.generate and uses output.save to ensure parity with the
single-stage test.
In `@tests/unittest/media/test_encoding.py`:
- Around line 94-116: Tests that exercise the pure-Python AVI fallback must
force the non-ffmpeg branch; in the two tests calling video_to_bytes and
save_video (symbols: video_to_bytes, save_video, _dummy_video) use monkeypatch
to set whatever ffmpeg-availability flag/function your video module exposes to
False (e.g., patch a function like ffmpeg_available() to return False or set a
module-level FFMPEG_AVAILABLE = False), then clear/reset the module encoder
cache before calling the functions (remove or reinitialize the cached encoder
object/attribute used by the pure-Python path so the fallback is chosen, e.g.,
delattr(video_module, "<cached_encoder_name>") or call the provided reset
function) so the tests deterministically exercise the pure-Python AVI encoder.
- Around line 1-24: This PR only modifies the unit test for the module
tensorrt_llm.media.encoding (the test that exercises
image_to_bytes/save_image/video_to_bytes/resolve_video_format), so explicitly
state in the PR description or a commit message that QA list updates under the
integration QA lists are unnecessary for this change; update the PR text to say
"QA list updates unnecessary — change limited to unit tests" (or similar), and
remove or revert any unrelated changes made to the integration QA list if
present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 34454c9b-615c-41d0-8b29-1ba1656ba7fb
📒 Files selected for processing (34)
examples/visual_gen/models/wan_t2v.pyexamples/visual_gen/quickstart_example.pyexamples/visual_gen/visual_gen_flux.pyexamples/visual_gen/visual_gen_ltx2.pyexamples/visual_gen/visual_gen_wan_i2v.pyexamples/visual_gen/visual_gen_wan_t2v.pytensorrt_llm/__init__.pytensorrt_llm/_torch/visual_gen/__init__.pytensorrt_llm/_torch/visual_gen/executor.pytensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux.pytensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux2.pytensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.pytensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2_two_stages.pytensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.pytensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.pytensorrt_llm/_torch/visual_gen/output.pytensorrt_llm/bench/benchmark/visual_gen.pytensorrt_llm/bench/benchmark/visual_gen_utils.pytensorrt_llm/media/__init__.pytensorrt_llm/media/encoding.pytensorrt_llm/serve/media_storage.pytensorrt_llm/serve/openai_server.pytensorrt_llm/serve/openai_video_routes.pytensorrt_llm/visual_gen/__init__.pytensorrt_llm/visual_gen/output.pytensorrt_llm/visual_gen/visual_gen.pytests/integration/defs/examples/test_visual_gen.pytests/unittest/_torch/visual_gen/test_media_storage.pytests/unittest/_torch/visual_gen/test_trtllm_serve_endpoints.pytests/unittest/dynamo/test_imports.pytests/unittest/media/__init__.pytests/unittest/media/test_encoding.pytests/unittest/visual_gen/__init__.pytests/unittest/visual_gen/test_output.py
💤 Files with no reviewable changes (1)
- tests/unittest/_torch/visual_gen/test_media_storage.py
- DiffusionResponse.error_msg checks switch from truthy to is-not-None in to_visual_gen_output and split_visual_gen_output so an empty-string error message still routes through the failure branch instead of dereferencing resp.output. - VisualGenResult.aresult timeout path now persists the resolved error state on self._resolved before flipping self._finished, so subsequent aresult/result calls replay the same VisualGenError instead of silently returning None via the fast path. Batch and single-prompt branches both route through _resolved_value. - Video routes (sync gen, async gen, list, get-metadata, get-content, delete) now classify generic exceptions as InternalServerError/500 instead of leaving them at the create_error_response default of 400, while ValueError stays at 400. test_sync_video_failure updated to match. - openai_video_generation_async persists the queued VideoJob to VIDEO_STORE before scheduling the background task, closing the race where a fast-completing task could fail to find the job and leave it stuck at queued. - Background video tasks now register a done callback that drops the entry from self.video_gen_tasks (with an is-identity guard) and logs unexpected exceptions, bounding memory growth. delete_video cancels any in-flight task before removing the file and store entry so a deleted job cannot recreate output afterwards. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #46326 [ run ] triggered by Bot. Commit: |
|
PR_Github #46326 [ run ] completed with state
|
|
PR_Github #46925 [ run ] completed with state
|
Fix the duplicate-timeout bug in VisualGenResult.result() that QiJune flagged: the outer future.result(timeout=timeout) could raise concurrent.futures.TimeoutError before the inner aresult timeout branch finished cleanup, leaking late-arriving responses into completed_responses. VisualGenResult.result() - Drop outer timeout argument; only the inner aresult(timeout=timeout) enforces the wait. The inner branch already does abandon_request_id cleanup and resolves to an error output before returning. - Add a comment naming the constraint so the asymmetry is not lost on future readers. Verified - 37 / 37 unit tests in tests/unittest/visual_gen/test_output.py pass, including the existing test_aresult_timeout_invokes_abandon_request_id pair that covers the cleanup contract. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
…ike handle, batch fan-out, encoding extraction Replaces the raw MediaOutput return type with a request-aware public wrapper VisualGenOutput carrying image/video/audio tensors plus request_id, error, frame_rate/audio_sample_rate, and engine-side VisualGenMetrics. Promotes VisualGenResult to a single Future-like awaitable handle that resolves to VisualGenOutput (single prompt) or List[VisualGenOutput] (batch). Fans out batched generate(List[str]) into per-item outputs with per-item error semantics. Adds VisualGenOutput.save(path) as the single user-facing way to write generated media to disk. Extracts tensor-encoding code out of tensorrt_llm/serve/media_storage.py into a new tensorrt_llm/media/encoding.py module of free functions, while keeping MediaStorage in serve/ for its file-storage role. Internally renames MediaOutput -> PipelineOutput with three CUDA-event timing phases (pre_denoise_ms / denoise_ms / post_denoise_ms) plus an executor-measured pipeline_ms on the wire response. Updates every in-tree caller (examples, bench, serve, tests). The eight video-generation route handlers in openai_server.py are extracted into a sibling _VideoRoutesMixin (tensorrt_llm/serve/openai_video_routes.py) so openai_server.py stays under the per-file line budget. Behavior is strictly preserved: mixin methods reach back through self for instance state. API-stability YAML coverage for the VisualGen surface is explicitly deferred until the public API exits prototype status. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
- DiffusionResponse.error_msg checks switch from truthy to is-not-None in to_visual_gen_output and split_visual_gen_output so an empty-string error message still routes through the failure branch instead of dereferencing resp.output. - VisualGenResult.aresult timeout path now persists the resolved error state on self._resolved before flipping self._finished, so subsequent aresult/result calls replay the same VisualGenError instead of silently returning None via the fast path. Batch and single-prompt branches both route through _resolved_value. - Video routes (sync gen, async gen, list, get-metadata, get-content, delete) now classify generic exceptions as InternalServerError/500 instead of leaving them at the create_error_response default of 400, while ValueError stays at 400. test_sync_video_failure updated to match. - openai_video_generation_async persists the queued VideoJob to VIDEO_STORE before scheduling the background task, closing the race where a fast-completing task could fail to find the job and leave it stuck at queued. - Background video tasks now register a done callback that drops the entry from self.video_gen_tasks (with an is-identity guard) and logs unexpected exceptions, bounding memory growth. delete_video cancels any in-flight task before removing the file and store entry so a deleted job cannot recreate output afterwards. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
- Drop VisualGenError and VisualGenParamsError in favor of standard ValueError / RuntimeError / NotImplementedError, matching vLLM/SGLang conventions. Removed exports from tensorrt_llm/__init__.py and tensorrt_llm/visual_gen/__init__.py. Save errors map to: errored output -> RuntimeError, missing frame_rate / no media -> ValueError, audio-only -> NotImplementedError. _validate_request raises plain ValueError. Tests updated to match. - Fix timeout leak in DiffusionRemoteClient. The aresult timeout branch now schedules abandon_request_id on the executor's event loop. New abandon flow uses an _abandoned_request_ids set so _store_response drops late responses for abandoned ids and abandon_request_id pops responses that arrived between the await timeout and the abandon call; both orderings handled under the same async lock. Added five unit tests covering the abandon contract. - Tighten PipelineOutput shape contract to always-batched: image (B, H, W, C), video (B, T, H, W, C), audio (B, channels, T_audio). split_visual_gen_output now asserts the leading dim for all three modalities. Removed the unconditional .squeeze(0) in decode_audio that produced shape-inconsistent audio for B=1 vs B>1. - Image-edit endpoint /v1/images/edits now short-circuits to 501 NotImplemented because no in-tree pipeline implements image edit (Flux/Flux2 are gen-only, Wan/LTX-2 produce video). Trimmed TestImageEdit happy-path tests to a single 501 assertion plus the pre-existing pydantic-validation 400 test. - Replace lazy-import comment in visual_gen/output.py: the import-cycle reason is gone with VisualGenError, only the encoding-stack lazy rationale remains. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
…seconds Bench Option A: align the offline VisualGen bench with what the serving path actually does inside its e2e timing window, and rework the timing field names so they mean what they say. Bench encoding/persist scope (chang-l review feedback #2) - _run_benchmark wraps each run in a single tempfile.TemporaryDirectory and threads media_dir into both _run_sequential and _run_concurrent. - The per-request save now goes to media_dir/{img|vid}_{idx}{ext}; no per-request mkstemp/unlink. The directory is wiped wholesale when the context exits, mirroring the server's persisted-write pattern. - Video extension comes from resolve_video_format("auto") so the bench falls back to .avi (pure-Python) on hosts without ffmpeg, matching how the server handles output_format="auto" instead of hard-failing. Timing-metric rename (drop _ms suffix, store seconds) - VisualGenMetrics: pipeline_ms / pre_denoise_ms / denoise_ms / post_denoise_ms -> pipeline / pre_denoise / denoise / post_denoise. Wall-clock seconds throughout. - PipelineOutput: same rename. CudaPhaseTimer.fill() divides Event.elapsed_time by 1000 once at the boundary so every downstream type carries seconds. - DiffusionResponse.pipeline_ms -> pipeline; executor populates perf_counter() delta directly (no * 1000.0). - VisualGenRequestOutput (bench): e2e_latency / pipeline_ms / denoise_ms -> latency / pipeline / denoise. - VisualGenBenchmarkMetrics: *_e2e_latency_ms -> *_latency. calculate_metrics no longer rescales to ms. - print_visual_gen_results shows "Latency (s)" with :.4f precision and drops the "E2E" qualifier. - build_visual_gen_result_dict JSON keys: mean_e2e_latency_ms -> mean_latency, e2e_latencies -> latencies, etc. - Server log lines (image-gen, sync video, async video): e2e_ms / pipeline_ms / denoise_ms -> latency / pipeline / denoise with :.3f seconds formatting. - Tests in unittest/visual_gen/test_output.py and the integration tests under tests/integration/defs/{examples,visual_gen}/ updated to match. 192 / 192 unit tests passing. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
…gate to bench
Decouple the public timing metric name from the internal class/method
("pipeline.infer()") and surface the engine-side floor in the benchmark
report alongside the externally observed latency.
Rename
- VisualGenMetrics.pipeline -> generation. The new name describes what
the timer measures from the user's perspective (time spent producing
the output), not which code path we measured. Field-level docstrings
added so per-field semantics sit next to each declaration; the class
docstring keeps the cross-cutting CUDA-event methodology note.
- DiffusionResponse.pipeline -> generation; executor populator uses
generation_start / generation locals.
- VisualGenRequestOutput.pipeline -> generation (bench per-request).
- to_visual_gen_output and split_visual_gen_output populate the renamed
field.
- Server log lines (image-gen, sync video, async video):
pipeline=... -> generation=...
Bench aggregate
- VisualGenBenchmarkMetrics gains
mean_generation / median_generation / std_generation /
min_generation / max_generation / percentiles_generation alongside
the existing latency aggregates. The relationship
latency >= generation should hold per request; the gap is the
encode + persist + IPC overhead the bench measures around the
engine, and the report exposes that headroom.
- calculate_metrics collects only non-zero generation samples so
backends that don't supply per-phase metrics don't pull the mean
toward zero and obscure the latency comparison.
- print_visual_gen_results adds a "Generation" section mirroring the
"Latency" section.
- build_visual_gen_result_dict adds the corresponding JSON keys plus a
per-request "generations" array.
Tests
- tests/unittest/visual_gen/test_output.py: field-name assertions, kwarg
arguments, attribute accesses, and the function name
test_sub_phase_sum_bounded_by_pipeline -> ..._by_generation updated.
- tests/integration/defs/examples/test_visual_gen.py:
output.metrics.pipeline -> output.metrics.generation.
- tests/integration/defs/visual_gen/test_visual_gen_benchmark.py:
added mean_generation JSON-key assertions next to mean_latency.
192 / 192 unit tests passing.
Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
…elapsed_time Without the sync, cudaEventElapsedTime raises cudaErrorNotReady whenever the pipeline returns the output tensor while it is still resident on the GPU — the common case at fill() time. Caught by the e2e Flux text-to-image tests, which all returned 400 with "Both events must be completed before calculating elapsed time." Syncing on _end also covers the three earlier events because they record on the same default stream. Updated the corresponding methodology docstrings on both the public VisualGenMetrics and the internal CudaPhaseTimer to match the actual behavior. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
Round 3 of @JunyiXu-nv's review on PR NVIDIA#13635. Drop empty MediaStorage stub - Remove tensorrt_llm/serve/media_storage.py: encoding moved to tensorrt_llm/media/encoding.py earlier in this PR, leaving only an empty MediaStorage class. No production code referenced it. - Drop the two regression tests in tests/unittest/visual_gen/test_output.py that pinned the class to "no encoding methods" — moot once the file is gone. Switch route call sites to VisualGenOutput.save - openai_video_routes.py (sync + background) and openai_server.py URL-mode image now call output.save(...) instead of save_video/save_image directly. This exercises the public API the PR added and keeps the audio/sample-rate default chain inside one place. Fix stale test-list entry - tests/integration/test_lists/test-db/l0_a10.yml referenced the deleted unittest/_torch/visual_gen/test_media_storage.py. Replace with the two test files this PR adds: unittest/visual_gen/test_output.py and unittest/media/test_encoding.py. Test: - tests/unittest/media/test_encoding.py + tests/unittest/visual_gen/test_output.py: 51 / 51 passing. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
Fix the duplicate-timeout bug in VisualGenResult.result() that QiJune flagged: the outer future.result(timeout=timeout) could raise concurrent.futures.TimeoutError before the inner aresult timeout branch finished cleanup, leaking late-arriving responses into completed_responses. VisualGenResult.result() - Drop outer timeout argument; only the inner aresult(timeout=timeout) enforces the wait. The inner branch already does abandon_request_id cleanup and resolves to an error output before returning. - Add a comment naming the constraint so the asymmetry is not lost on future readers. Verified - 37 / 37 unit tests in tests/unittest/visual_gen/test_output.py pass, including the existing test_aresult_timeout_invokes_abandon_request_id pair that covers the cleanup contract. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
44c14b0 to
e148156
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #47085 [ run ] triggered by Bot. Commit: |
|
PR_Github #47085 [ run ] completed with state
|
…_server refactor This PR's refactor moved save_image out of tensorrt_llm.serve.openai_server into tensorrt_llm.media.encoding. The two TestImageGeneration regression guards in test_trtllm_serve_endpoints.py were still patching "tensorrt_llm.serve.openai_server.save_image", so unittest.mock.patch raised AttributeError on entry, failing both test_image_generation_b64_no_save_image_no_disk_write and test_image_generation_b64_with_4d_batch_pipeline_output. Patch the source module instead. This also tightens the regression guard: it catches a future caller that pulls save_image in via output.save() (the URL path), not just one re-importing it back into openai_server's namespace. Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #47218 [ run ] triggered by Bot. Commit: |
|
PR_Github #47218 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47311 [ run ] triggered by Bot. Commit: |
|
PR_Github #47311 [ run ] completed with state |
Port the batch-inference support from the original PR onto the new media/encoding + openai_video_routes layout introduced by NVIDIA#13635: - tensorrt_llm/media/encoding.py: add save_images() and save_videos() free functions plus a shared _resolve_batch_paths() helper. Both accept either a path prefix or an explicit List[str] of per-item paths. - tensorrt_llm/serve/openai_video_routes.py: sync and async video endpoints now call save_videos(). The async background task records every saved path on VideoJob.output_paths; delete_video() removes all of them. The sync endpoint still returns only the first file (OpenAI Videos API has no multi-file response yet). - tensorrt_llm/serve/openai_protocol.py: add VideoJob.output_paths for the multi-output case. - tensorrt_llm/serve/visual_gen_utils.py: map VideoGenerationRequest.n to VisualGenParams.num_images_per_prompt (already done for image requests on main). - tests: cover save_images/save_videos in tests/unittest/media/test_encoding.py and add n=2 batch tests for sync and async video endpoints. Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
Port the batch-inference support from the original PR onto the new media/encoding + openai_video_routes layout introduced by NVIDIA#13635: - tensorrt_llm/media/encoding.py: add save_images() and save_videos() free functions plus a shared _resolve_batch_paths() helper. Both accept either a path prefix or an explicit List[str] of per-item paths. - tensorrt_llm/serve/openai_video_routes.py: sync and async video endpoints now call save_videos(). The async background task records every saved path on VideoJob.output_paths; delete_video() removes all of them. The sync endpoint still returns only the first file (OpenAI Videos API has no multi-file response yet). - tensorrt_llm/serve/openai_protocol.py: add VideoJob.output_paths for the multi-output case. - tensorrt_llm/serve/visual_gen_utils.py: map VideoGenerationRequest.n to VisualGenParams.num_images_per_prompt (already done for image requests on main). - tests: cover save_images/save_videos in tests/unittest/media/test_encoding.py and add n=2 batch tests for sync and async video endpoints. Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
Port the batch-inference support from the original PR onto the new media/encoding + openai_video_routes layout introduced by NVIDIA#13635: - tensorrt_llm/media/encoding.py: add save_images() and save_videos() free functions plus a shared _resolve_batch_paths() helper. Both accept either a path prefix or an explicit List[str] of per-item paths. - tensorrt_llm/serve/openai_video_routes.py: sync and async video endpoints now call save_videos(). The async background task records every saved path on VideoJob.output_paths; delete_video() removes all of them. The sync endpoint still returns only the first file (OpenAI Videos API has no multi-file response yet). - tensorrt_llm/serve/openai_protocol.py: add VideoJob.output_paths for the multi-output case. - tensorrt_llm/serve/visual_gen_utils.py: map VideoGenerationRequest.n to VisualGenParams.num_images_per_prompt (already done for image requests on main). - tests: cover save_images/save_videos in tests/unittest/media/test_encoding.py and add n=2 batch tests for sync and async video endpoints. Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
…position (NVIDIA#13635) Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>
Description
Reworks the public output story of
tensorrt_llm.visual_genso callers get a coherent, request-aware return type with measurable engine-side timing — and reworks the offline bench so itslatencyand enginegenerationmetrics line up with what the serving path does.Public output API
VisualGenOutputreplaces the rawMediaOutputas the user-facing return type. Carries image/video/audio tensors plusrequest_id,error,frame_rate/audio_sample_rate, and engine-sideVisualGenMetrics.VisualGenResultis a single Future-like awaitable handle that resolves toVisualGenOutput(single prompt) orList[VisualGenOutput](batch). Batchedgenerate(List[str])fans out into per-item outputs with per-item error semantics.VisualGenOutput.save(path)is the single user-facing way to persist a generated tensor to disk (image PNG/JPG/WEBP, video MP4/AVI).VisualGenError/VisualGenParamsErrorin favor of standardRuntimeError/ValueError/NotImplementedError. Aligns with vLLM/SGLang convention where diffusion engines raise standard exception types and callers catch by built-in type, not by package-specific wrappers.Internal types and contracts
MediaOutputrenamed toPipelineOutputand gains three CUDA-event timing phases (pre_denoise/denoise/post_denoise) plus an executor-measuredgenerationfield on the wire response.(B, H, W, C), video(B, T, H, W, C), audio(B, channels, T_audio). Asserts insplit_visual_gen_outputmake a violating pipeline fail loudly rather than silently corrupt per-item outputs by indexing along the wrong dim. LTX-2'sdecode_audiono longer unconditionally.squeeze(0)s, restoring the contract forB=1.Module boundaries
tensorrt_llm/serve/media_storage.pyinto a newtensorrt_llm/media/encoding.pymodule of free functions;MediaStoragekeeps its file-storage role underserve/.openai_server.pyare extracted into a sibling mixin (tensorrt_llm/serve/openai_video_routes.py) soopenai_server.pystays under the per-file line budget. Behavior is strictly preserved.Engine reliability
DiffusionRemoteClient: previously, a request that timed out inaresult()could have its late-arriving response pinned incompleted_responsesfor the process lifetime, leaking a fullPipelineOutput(including video tensors) per timeout. Newabandon_request_idflow uses an abandoned-id set so_store_responsedrops late responses andabandon_request_idpops cached ones; both orderings handled under the same async lock.Server endpoints
/v1/images/editsshort-circuits to HTTP 501NotImplementedError. Reason: no in-tree pipeline implements image editing today (Flux/Flux2 are gen-only; Wan/LTX-2 produce video). The handler used to dispatch through the generator and 500 on a downstreamNonecheck; 501 is the honest answer until an edit-capable pipeline lands. Route stays registered so re-enabling is a one-line change.Benchmark (
tensorrt_llm/bench/benchmark/visual_gen.py)_run_benchmarkwraps each run in onetempfile.TemporaryDirectoryand saves each result into it. Mirrors the server's persisted-write pattern (no per-requestmkstemp/unlinkoverhead in the timing window).resolve_video_format("auto"), matching how the server treatsoutput_format="auto". The bench no longer hard-fails on hosts without ffmpeg._mssuffix and store wall-clock seconds end-to-end.CudaPhaseTimer.fill()dividesEvent.elapsed_timeby 1000 once at the boundary so every downstream type carries seconds.pipeline(a class-name leak) togeneration(describes what the timer measures, not how). Threaded throughVisualGenMetrics,DiffusionResponse, server log lines, bench per-request and aggregate types.VisualGenBenchmarkMetricsgainsmean / median / std / min / max / percentiles_generationalongside the existing latency aggregates. Thelatency >= generationgap is the encode + persist + IPC overhead the bench measures around the engine, and the report exposes that headroom.API stability
YAML coverage for the VisualGen public surface is explicitly deferred until the API exits prototype status (all classes are
@set_api_status("prototype")).Test Coverage
tests/unittest/visual_gen/test_output.py—VisualGenOutputdataclass,VisualGenMetrics,to_visual_gen_output/split_visual_gen_outputsuccess and error paths, batch fan-out,VisualGenOutput.save()routing,VisualGenResultsyncresult(), asyncaresult(),__await__, timeout-leak fix (5 new tests coveringabandon_request_id's two race orderings via a minimalDiffusionRemoteClientstub).tests/unittest/_torch/visual_gen/test_visual_gen_params.py— request-validation surface migrated to plainValueError.tests/unittest/_torch/visual_gen/test_trtllm_serve_endpoints.py— FastAPI mock tests;TestImageEditreduced to a 501 assertion plus the existing pydantic-validation 400 test.tests/unittest/media/test_encoding.py— PNG round-trip,video_to_bytesAVI bytes,resolve_video_formatdispatch, ffmpeg fallback paths.tests/integration/defs/examples/test_visual_gen.py— Wan T2V and Flux end-to-end public-output contract checks (single + batch + async), including error-path semantics. Field-name assertions updated for the timing rename.tests/integration/defs/visual_gen/test_visual_gen_benchmark.py— bench JSON-key assertions extended withmean_generation.Summary by CodeRabbit
New Features
VisualGenMetricsVisualGenOutput.save()method for easy output persistenceImprovements
Documentation
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.