perf(hy_worldplay): longer video bench + HY model card by wenqingw-nv · Pull Request #231 · NVIDIA/flashdreams

wenqingw-nv · 2026-05-30T11:19:28Z

Follow-up to #155 for #203: GB300 perf re-bench, the hy_worldplay model card, and two num_chunk≥4 fixes the larger GPU surfaced.

Changes

_action.py: unwrap the torch.compile OptimizedModule in prefill_memory_kv_cache; run memory-engaged chunks eager under CUDA graphs (the data-dependent memory torch.cat can't replay in a graph captured pre-memory). Both reachable only at num_chunk≥4.
bench.sh: default HY_VENDOR_SDPA=1 (vendor on cuDNN SDPA, matching native). run.sh: add torchvision to the vendor deps.
models/hy_worldplay.rst (LingBot-style) + perf-0530.md + native sample videos; registered in the toctree; clean under sphinx-build -W.

Bench

Native vs wan/generate.py: num_chunk=8 / pose=w-31 / seed=0 / 704×1280, warmup-discard 5, DiT+VAE scope, both legs cuDNN SDPA + torch.compile, single GB300. Post-warmup medians (chunks 5–7):

stage	native	vendor	speedup
DiT (diffuse)	632 ms	1206 ms	1.91×
VAE decode	383 ms	372 ms	~parity
DiT + VAE / chunk	1015 ms	1578 ms	1.55×

Per data_local/ input:

image	DiT nat/ven	VAE nat/ven	ratio	mean `\|Δ\|`
`1.png`	630 / 1208	382 / 374	1.56×	40.8
`2.png`	632 / 1210	382 / 374	1.56×	31.1
`6.jpeg`	628 / 1208	381 / 374	1.57×	21.1
`cat_surf.jpg`	632 / 1207	381 / 375	1.56×	51.8
median	631 / 1208	382 / 374	1.56×	—

Native ~1.91× on DiT, parity on VAE, input-independent (native DiT 628–632 ms across inputs). Parity |Δ| is cumulative bf16 AR drift (chunk-0 ≈14 matches #155's 12.91, ramps to ~49 by chunk-7, no jump at memory engagement); highest on the off-aspect cat_surf (625×350 upscaled).

Native vs vendor video pairs

image	native	vendor
`1.png`	_{hy-worldplay-wan-i2v-5b.mp4}	_{hy-worldplay-wan-i2v-5b.mp4}
`2.png`	_{hy-worldplay-wan-i2v-5b.mp4}	_{hy-worldplay-wan-i2v-5b.mp4}
`6.jpeg`	_{hy-worldplay-wan-i2v-5b.mp4}	_{hy-worldplay-wan-i2v-5b.mp4}
`cat_surf.jpg`	_{hy-worldplay-wan-i2v-5b.mp4}	_{hy-worldplay-wan-i2v-5b.mp4}

Transport branch (not a PR): carries HANDOFF.md + tasks.md so a fresh agent on the GB300 has the perf-MR steps, verified status of the four follow-up PRs (#222/#223/#224/#227), and the gotchas. Delete before any real PR off this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Audit fixes: add #227 to the PR table; mark all four PRs green + auto- merge armed (pending review); reframe the two "known bugs" as fixed (#227 base-load, #224 DiT 404) since they no longer block; correct the "always --ckpt-path" rationale (you need distilled weights for real output, not "base path is broken"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

prefill_memory_kv_cache asserted isinstance(self.network, HyWorldPlayWanDiTNetwork), but with compile_network=True the parent Wan transformer reassigns self.network to a torch.compile OptimizedModule, so the assert fired the first time the memory-prefill path engaged. That path only runs once enough chunks accumulate (~num_chunk>=4), so it was never reachable on the 44 GiB card that capped the bench at num_chunk=2; the GB300 num_chunk=8 run surfaced it. Unwrap _orig_mod before narrowing the type. The prefill is a separate eager KV-fill pass; the diffuse forward stays on the compiled _network_call. CPU smoke (16 tests) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two fixes that surfaced provisioning + running the matched num_chunk=8 bench on a GB300: - bench.sh now defaults HY_VENDOR_SDPA=1, so the vendor leg is forced onto cuDNN F.scaled_dot_product_attention (the kernel the native leg uses) instead of its as-shipped sageattention INT8/FP8 path. This is the apples-to-apples attention-backend match the perf reviewers required; set HY_VENDOR_SDPA=0 to bench vendor as-shipped. - run.sh adds torchvision==0.26.* to the vendor heavy-deps install. Upstream HEAD's hyvideo HunyuanVideo-1.5 pipeline import now pulls in torchvision; pinned to match the torch==2.11.* sub-venv. (run.sh's uv sync prunes anything not in pyproject, so it must be reinstalled after the sync, alongside the other vendor-only deps.) - uv.lock: re-lock picking up flashdreams dropping torchvision + lowering the torch floor to >=2.9 (the reason the vendor dep above is now needed explicitly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds the model-card chart data and a work-in-progress model-card page draft from the matched re-bench on a single GB300 (num_chunk=8, 704x1280, seed 0, warmup-discard 5, DiT+VAE scope, both legs cuDNN SDPA + torch.compile): - perf-0530.md: per-chunk steady-state DiT+VAE-decode total -- official 1578 ms vs flashdreams 1015 ms (~1.55x). DiT alone is 632 vs 1206 ms (1.91x); VAE decode is at parity. The flashdreams number is conservative -- its CUDA-graph fast path is disabled in this run (a num_chunk>=4 graph/memory-prefill bug, tracked separately), so production is faster still. - hy_worldplay.rst.draft: page mirrors lingbot_world.rst. Kept as a .draft (not picked up by sphinx) until the curated sample MP4s are generated and uploaded; the perf footnote + chart wiring are final. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Marks the num_chunk=8 perf re-bench DONE, records the matched medians (native DiT 1.91x, VAE parity, DiT+VAE/chunk 1.55x), the expectation-reversal (DiT wins, not VAE), the two num_chunk>=4 bugs (compiled-network isinstance fixed; CUDA-graph/prefill illegal access open), and the long-rollout parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

With use_cuda_graph=True (production default), the rollout crashed with cudaErrorIllegalAddress at the first reconstituted-context chunk (num_chunk>=4). The per-block attention prepends the prefilled memory K/V via a data-dependent torch.cat (hy_worldplay._camera), lengthening the attention sequence once memory engages. A graph captured on the pre-memory path (shorter sequence, no cat) cannot be replayed against the longer post-memory sequence. Flag memory-engaged steps in predict_flow and override _select_network to route them onto the wrapper's eager drain path; pre-memory chunks keep the base filling/steady CUDA-graph dispatch. Verified num_chunk=8 end-to-end with graphs ON: 8 chunks, no fault, valid 125-frame mp4. The steady-state (post-warmup) chunks are memory-engaged and therefore already run eager, so the reported steady-state medians are unchanged by this fix and are not "conservative" -- CUDA graphs only accelerate the discarded warmup chunks. Corrected the model-card perf footnote accordingly. (Graph-accelerating the memory-engaged steady state would need fixed-size in-place memory KV buffers; tracked separately.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Bug 2 (num_chunk>=4 cudaErrorIllegalAddress) fixed in eb121aa; record that the reported steady-state perf medians are production-config and graph-independent (memory-engaged chunks run eager either way), not conservative. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

num_chunk=8 mean |Δ| 29/255 (vs 15.65 at num_chunk=2) is cumulative bf16 autoregressive drift, not a bug: per-frame |Δ| ramps smoothly (chunk-0 ~14 reproducing the README's 12.91, to ~49 by chunk-7) with no discontinuity at memory engagement. Higher mean only reflects averaging in more later high-drift chunks. Memory-frame selection verified identical native-vs-vendor for w-31. Notes the unseeded FOV point cloud as a latent risk for non-forward poses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

All 5 data_local images benched native-vs-vendor (num_chunk=8). Perf input-independent (DiT 628-633 ms native / ~1208 vendor -> ~1.91x; VAE parity; DiT+VAE 1.56x), corroborating perf-0530.md across 6 inputs. Parity 21-52/255 (benign AR drift, highest on off-aspect inputs). Notes samples generated + the out-of-repo artifact location (in-repo outputs/ gets wiped by the shared CI-runner box). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Moved the data_local batch artifacts from /home/nvidia/hy_bench_out into tests/parity_check/outputs/<stem>/; update the paths in tasks.md accordingly (still gitignored, with the CI-wipe caveat noted). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Promote the model-card draft to docs/source/models/hy_worldplay.rst (LingBot style: hero, install, running, variants list-table, native sample grid, perf chart). Wire the gallery to our GB300-generated native rollouts (num_chunk=8, 704x1280): hero = 6.jpeg, gallery = 2.png / 1.png / cat_surf.jpg, web-transcoded into docs/source/_static/videos/hy_worldplay/ (3.4 MB total). Register the page in models/index.rst and the index.rst "Model cards" toctree. Builds clean under `sphinx-build -W` (warnings-as-errors, matching CI). The local _static videos can be swapped for research.nvidia.com-hosted URLs later to match LingBot's external-hosting convention. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-30T11:19:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fourth gallery clip from the native num_chunk=8 rollout on data_local/10.png (web-transcoded, 484 KB). Builds clean under sphinx-build -W. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Trim the intro, --ckpt-path/--pose paragraph (drop the full pose-token enumeration; it's in --help), variant description, sample caption, and perf footnote. Builds clean under sphinx-build -W. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… clips Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…y-perf-handoff

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tasks.md is an internal HY-WorldPlay handoff status note, not upstream material; untrack + gitignore it so it stays out of the PR diff while remaining on disk for local tracking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Untrack HANDOFF.md and gitignore it (alongside tasks.md/comments.md) so the internal handoff notes stay on disk for local tracking but out of the PR #231 diff / upstream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wenqingw-nv · 2026-05-30T12:25:45Z

/ok to test 1e7a2a3

liruilong940607 · 2026-06-01T15:45:31Z

The visual results here in the native column is choppy -- this seems to indicate something is wrong with the current implementation. Needs revisit and fix.

Let's not add this to the public-facing doc until we are sure it is correctly implemented.

liruilong940607 · 2026-06-01T15:46:47Z

Lets not check in videos into this repo. you can send me the videos on slack and I will host them.

liruilong940607 · 2026-06-01T15:56:54Z

+        # Run the reconstituted-context prefill once at the first
+        # denoising step of each chunk past the first; the
+        # ``prefill_completed_for_chunk`` latch suppresses re-runs on
+        # subsequent scheduler steps within the chunk.


This slipped through my previous review. Why would we need prefill_memory_kv? The BlockKVCache in flashdreams should naturally work with CUDAGraph.

I traced it and it seems to lead to HyWorldPlayMemoryKVCache, which stored roped K and roped V. What's the reason we can't use BlockKVCache for that? And what's the reason it can't be compatible with cudagraph like other models?

HyWorldPlayMemoryKVCache is separate because memory isn't a rolling window, it's a per-chunk re-selection of non-contiguous frames, so BlockKVCache doesn't map directly. The non-graphability is just a stopgap; since the selected set is always exactly 16, a fixed-size in-place buffer is graph-compatible. This path also causes the choppy output, where the memory keys sit at collapsed RoPE positions [0..16) regardless of identity, so on the first window slide a frame's encoding shifts and the scene jolts.

I'll rework it into a fixed-size, consistently-positioned in-place buffer to fix both, and numerically diff vs vendor's prefill first. Do you think we should: gather K/V from retained history or re-prefill each chunk (vendor re-prefills)? @liruilong940607

…sted URLs Per review on #231: don't check videos into the repo. Untrack the five _static sample mp4s, gitignore the dir, and repoint the model-card <source>s at research.nvidia.com-hosted URLs (LingBot convention). The clips stay on disk locally to hand off for hosting. Builds clean under sphinx-build -W. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wenqingw-nv and others added 13 commits May 30, 2026 00:36

docs: drop n/a mgpu row from tasks.md perf/docs table

114136f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: record perf/docs MR as NVIDIA#231 in tasks.md

9a29e7f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wenqingw-nv changed the title ~~perf(hy_worldplay): GB300 num_chunk=8 re-bench + model card + num_chunk≥4 fixes~~ perf(hy_worldplay): longer video bench + HY model card May 30, 2026

wenqingw-nv and others added 4 commits May 30, 2026 11:46

docs(hy_worldplay): add 10.png native sample to the model-card gallery

c178c70

Fourth gallery clip from the native num_chunk=8 rollout on data_local/10.png (web-transcoded, 484 KB). Builds clean under sphinx-build -W. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(hy_worldplay): swap hero (now 1.png) and gallery-02 (now 6.jpeg)…

3b4f804

… clips Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'upstream/main' into wenqing/hy-worldpla…

62f8a6a

…y-perf-handoff

wenqingw-nv mentioned this pull request May 30, 2026

HY-WorldPlay WAN-5B I2V — follow-ups from PR #155 #203

Open

7 tasks

wenqingw-nv and others added 3 commits May 30, 2026 12:20

docs: update tasks.md — PR NVIDIA#231 synced + NVIDIA#203 updated; fi…

df2ea1e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wenqingw-nv enabled auto-merge May 30, 2026 12:38

liruilong940607 reviewed Jun 1, 2026

View reviewed changes

wenqingw-nv marked this pull request as draft June 2, 2026 05:31

auto-merge was automatically disabled June 2, 2026 05:31
Pull request was converted to draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(hy_worldplay): longer video bench + HY model card#231

perf(hy_worldplay): longer video bench + HY model card#231
wenqingw-nv wants to merge 22 commits into
NVIDIA:mainfrom
wenqingw-nv:wenqing/hy-worldplay-perf-handoff

wenqingw-nv commented May 30, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 30, 2026

Uh oh!

wenqingw-nv commented May 30, 2026

Uh oh!

liruilong940607 commented Jun 1, 2026

Uh oh!

liruilong940607 Jun 1, 2026

Uh oh!

liruilong940607 Jun 1, 2026

Uh oh!

wenqingw-nv Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wenqingw-nv commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Bench

Native vs vendor video pairs

Uh oh!

copy-pr-bot Bot commented May 30, 2026

Uh oh!

wenqingw-nv commented May 30, 2026

Uh oh!

liruilong940607 commented Jun 1, 2026

Uh oh!

liruilong940607 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

liruilong940607 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

wenqingw-nv Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wenqingw-nv commented May 30, 2026 •

edited

Loading

wenqingw-nv Jun 2, 2026 •

edited

Loading