perf(hy_worldplay): longer video bench + HY model card#231
perf(hy_worldplay): longer video bench + HY model card#231wenqingw-nv wants to merge 22 commits into
Conversation
Transport branch (not a PR): carries HANDOFF.md + tasks.md so a fresh agent on the GB300 has the perf-MR steps, verified status of the four follow-up PRs (#222/#223/#224/#227), and the gotchas. Delete before any real PR off this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit fixes: add #227 to the PR table; mark all four PRs green + auto- merge armed (pending review); reframe the two "known bugs" as fixed (#227 base-load, #224 DiT 404) since they no longer block; correct the "always --ckpt-path" rationale (you need distilled weights for real output, not "base path is broken"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
prefill_memory_kv_cache asserted isinstance(self.network, HyWorldPlayWanDiTNetwork), but with compile_network=True the parent Wan transformer reassigns self.network to a torch.compile OptimizedModule, so the assert fired the first time the memory-prefill path engaged. That path only runs once enough chunks accumulate (~num_chunk>=4), so it was never reachable on the 44 GiB card that capped the bench at num_chunk=2; the GB300 num_chunk=8 run surfaced it. Unwrap _orig_mod before narrowing the type. The prefill is a separate eager KV-fill pass; the diffuse forward stays on the compiled _network_call. CPU smoke (16 tests) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two fixes that surfaced provisioning + running the matched num_chunk=8 bench on a GB300: - bench.sh now defaults HY_VENDOR_SDPA=1, so the vendor leg is forced onto cuDNN F.scaled_dot_product_attention (the kernel the native leg uses) instead of its as-shipped sageattention INT8/FP8 path. This is the apples-to-apples attention-backend match the perf reviewers required; set HY_VENDOR_SDPA=0 to bench vendor as-shipped. - run.sh adds torchvision==0.26.* to the vendor heavy-deps install. Upstream HEAD's hyvideo HunyuanVideo-1.5 pipeline import now pulls in torchvision; pinned to match the torch==2.11.* sub-venv. (run.sh's uv sync prunes anything not in pyproject, so it must be reinstalled after the sync, alongside the other vendor-only deps.) - uv.lock: re-lock picking up flashdreams dropping torchvision + lowering the torch floor to >=2.9 (the reason the vendor dep above is now needed explicitly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the model-card chart data and a work-in-progress model-card page draft from the matched re-bench on a single GB300 (num_chunk=8, 704x1280, seed 0, warmup-discard 5, DiT+VAE scope, both legs cuDNN SDPA + torch.compile): - perf-0530.md: per-chunk steady-state DiT+VAE-decode total -- official 1578 ms vs flashdreams 1015 ms (~1.55x). DiT alone is 632 vs 1206 ms (1.91x); VAE decode is at parity. The flashdreams number is conservative -- its CUDA-graph fast path is disabled in this run (a num_chunk>=4 graph/memory-prefill bug, tracked separately), so production is faster still. - hy_worldplay.rst.draft: page mirrors lingbot_world.rst. Kept as a .draft (not picked up by sphinx) until the curated sample MP4s are generated and uploaded; the perf footnote + chart wiring are final. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Marks the num_chunk=8 perf re-bench DONE, records the matched medians (native DiT 1.91x, VAE parity, DiT+VAE/chunk 1.55x), the expectation-reversal (DiT wins, not VAE), the two num_chunk>=4 bugs (compiled-network isinstance fixed; CUDA-graph/prefill illegal access open), and the long-rollout parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With use_cuda_graph=True (production default), the rollout crashed with cudaErrorIllegalAddress at the first reconstituted-context chunk (num_chunk>=4). The per-block attention prepends the prefilled memory K/V via a data-dependent torch.cat (hy_worldplay._camera), lengthening the attention sequence once memory engages. A graph captured on the pre-memory path (shorter sequence, no cat) cannot be replayed against the longer post-memory sequence. Flag memory-engaged steps in predict_flow and override _select_network to route them onto the wrapper's eager drain path; pre-memory chunks keep the base filling/steady CUDA-graph dispatch. Verified num_chunk=8 end-to-end with graphs ON: 8 chunks, no fault, valid 125-frame mp4. The steady-state (post-warmup) chunks are memory-engaged and therefore already run eager, so the reported steady-state medians are unchanged by this fix and are not "conservative" -- CUDA graphs only accelerate the discarded warmup chunks. Corrected the model-card perf footnote accordingly. (Graph-accelerating the memory-engaged steady state would need fixed-size in-place memory KV buffers; tracked separately.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bug 2 (num_chunk>=4 cudaErrorIllegalAddress) fixed in eb121aa; record that the reported steady-state perf medians are production-config and graph-independent (memory-engaged chunks run eager either way), not conservative. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
num_chunk=8 mean |Δ| 29/255 (vs 15.65 at num_chunk=2) is cumulative bf16 autoregressive drift, not a bug: per-frame |Δ| ramps smoothly (chunk-0 ~14 reproducing the README's 12.91, to ~49 by chunk-7) with no discontinuity at memory engagement. Higher mean only reflects averaging in more later high-drift chunks. Memory-frame selection verified identical native-vs-vendor for w-31. Notes the unseeded FOV point cloud as a latent risk for non-forward poses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All 5 data_local images benched native-vs-vendor (num_chunk=8). Perf input-independent (DiT 628-633 ms native / ~1208 vendor -> ~1.91x; VAE parity; DiT+VAE 1.56x), corroborating perf-0530.md across 6 inputs. Parity 21-52/255 (benign AR drift, highest on off-aspect inputs). Notes samples generated + the out-of-repo artifact location (in-repo outputs/ gets wiped by the shared CI-runner box). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Moved the data_local batch artifacts from /home/nvidia/hy_bench_out into tests/parity_check/outputs/<stem>/; update the paths in tasks.md accordingly (still gitignored, with the CI-wipe caveat noted). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Promote the model-card draft to docs/source/models/hy_worldplay.rst (LingBot style: hero, install, running, variants list-table, native sample grid, perf chart). Wire the gallery to our GB300-generated native rollouts (num_chunk=8, 704x1280): hero = 6.jpeg, gallery = 2.png / 1.png / cat_surf.jpg, web-transcoded into docs/source/_static/videos/hy_worldplay/ (3.4 MB total). Register the page in models/index.rst and the index.rst "Model cards" toctree. Builds clean under `sphinx-build -W` (warnings-as-errors, matching CI). The local _static videos can be swapped for research.nvidia.com-hosted URLs later to match LingBot's external-hosting convention. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fourth gallery clip from the native num_chunk=8 rollout on data_local/10.png (web-transcoded, 484 KB). Builds clean under sphinx-build -W. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trim the intro, --ckpt-path/--pose paragraph (drop the full pose-token enumeration; it's in --help), variant description, sample caption, and perf footnote. Builds clean under sphinx-build -W. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… clips Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tasks.md is an internal HY-WorldPlay handoff status note, not upstream material; untrack + gitignore it so it stays out of the PR diff while remaining on disk for local tracking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Untrack HANDOFF.md and gitignore it (alongside tasks.md/comments.md) so the internal handoff notes stay on disk for local tracking but out of the PR #231 diff / upstream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/ok to test 1e7a2a3 |
|
The visual results here in the native column is choppy -- this seems to indicate something is wrong with the current implementation. Needs revisit and fix. Let's not add this to the public-facing doc until we are sure it is correctly implemented. |
There was a problem hiding this comment.
Lets not check in videos into this repo. you can send me the videos on slack and I will host them.
| # Run the reconstituted-context prefill once at the first | ||
| # denoising step of each chunk past the first; the | ||
| # ``prefill_completed_for_chunk`` latch suppresses re-runs on | ||
| # subsequent scheduler steps within the chunk. |
There was a problem hiding this comment.
This slipped through my previous review. Why would we need prefill_memory_kv? The BlockKVCache in flashdreams should naturally work with CUDAGraph.
I traced it and it seems to lead to HyWorldPlayMemoryKVCache, which stored roped K and roped V. What's the reason we can't use BlockKVCache for that? And what's the reason it can't be compatible with cudagraph like other models?
There was a problem hiding this comment.
HyWorldPlayMemoryKVCache is separate because memory isn't a rolling window, it's a per-chunk re-selection of non-contiguous frames, so BlockKVCache doesn't map directly. The non-graphability is just a stopgap; since the selected set is always exactly 16, a fixed-size in-place buffer is graph-compatible. This path also causes the choppy output, where the memory keys sit at collapsed RoPE positions [0..16) regardless of identity, so on the first window slide a frame's encoding shifts and the scene jolts.
I'll rework it into a fixed-size, consistently-positioned in-place buffer to fix both, and numerically diff vs vendor's prefill first. Do you think we should: gather K/V from retained history or re-prefill each chunk (vendor re-prefills)? @liruilong940607
…sted URLs Per review on #231: don't check videos into the repo. Untrack the five _static sample mp4s, gitignore the dir, and repoint the model-card <source>s at research.nvidia.com-hosted URLs (LingBot convention). The clips stay on disk locally to hand off for hosting. Builds clean under sphinx-build -W. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pull request was converted to draft
Follow-up to #155 for #203: GB300 perf re-bench, the
hy_worldplaymodel card, and twonum_chunk≥4fixes the larger GPU surfaced.Changes
_action.py: unwrap thetorch.compileOptimizedModuleinprefill_memory_kv_cache; run memory-engaged chunks eager under CUDA graphs (the data-dependent memorytorch.catcan't replay in a graph captured pre-memory). Both reachable only atnum_chunk≥4.bench.sh: defaultHY_VENDOR_SDPA=1(vendor on cuDNN SDPA, matching native).run.sh: addtorchvisionto the vendor deps.models/hy_worldplay.rst(LingBot-style) +perf-0530.md+ native sample videos; registered in the toctree; clean undersphinx-build -W.Bench
Native vs
wan/generate.py:num_chunk=8 / pose=w-31 / seed=0 / 704×1280, warmup-discard 5, DiT+VAE scope, both legs cuDNN SDPA +torch.compile, single GB300. Post-warmup medians (chunks 5–7):Per
data_local/input:|Δ|1.png2.png6.jpegcat_surf.jpgNative ~1.91× on DiT, parity on VAE, input-independent (native DiT 628–632 ms across inputs). Parity
|Δ|is cumulative bf16 AR drift (chunk-0 ≈14 matches #155's 12.91, ramps to ~49 by chunk-7, no jump at memory engagement); highest on the off-aspectcat_surf(625×350 upscaled).Native vs vendor video pairs
1.pnghy-worldplay-wan-i2v-5b.mp4
hy-worldplay-wan-i2v-5b.mp4
2.pnghy-worldplay-wan-i2v-5b.mp4
hy-worldplay-wan-i2v-5b.mp4
6.jpeghy-worldplay-wan-i2v-5b.mp4
hy-worldplay-wan-i2v-5b.mp4
cat_surf.jpghy-worldplay-wan-i2v-5b.mp4
hy-worldplay-wan-i2v-5b.mp4