Skip to content

perf(hy_worldplay): longer video bench + HY model card#231

Draft
wenqingw-nv wants to merge 22 commits into
NVIDIA:mainfrom
wenqingw-nv:wenqing/hy-worldplay-perf-handoff
Draft

perf(hy_worldplay): longer video bench + HY model card#231
wenqingw-nv wants to merge 22 commits into
NVIDIA:mainfrom
wenqingw-nv:wenqing/hy-worldplay-perf-handoff

Conversation

@wenqingw-nv
Copy link
Copy Markdown
Collaborator

@wenqingw-nv wenqingw-nv commented May 30, 2026

Follow-up to #155 for #203: GB300 perf re-bench, the hy_worldplay model card, and two num_chunk≥4 fixes the larger GPU surfaced.

Changes

  • _action.py: unwrap the torch.compile OptimizedModule in prefill_memory_kv_cache; run memory-engaged chunks eager under CUDA graphs (the data-dependent memory torch.cat can't replay in a graph captured pre-memory). Both reachable only at num_chunk≥4.
  • bench.sh: default HY_VENDOR_SDPA=1 (vendor on cuDNN SDPA, matching native). run.sh: add torchvision to the vendor deps.
  • models/hy_worldplay.rst (LingBot-style) + perf-0530.md + native sample videos; registered in the toctree; clean under sphinx-build -W.

Bench

Native vs wan/generate.py: num_chunk=8 / pose=w-31 / seed=0 / 704×1280, warmup-discard 5, DiT+VAE scope, both legs cuDNN SDPA + torch.compile, single GB300. Post-warmup medians (chunks 5–7):

stage native vendor speedup
DiT (diffuse) 632 ms 1206 ms 1.91×
VAE decode 383 ms 372 ms ~parity
DiT + VAE / chunk 1015 ms 1578 ms 1.55×

Per data_local/ input:

image DiT nat/ven VAE nat/ven ratio mean |Δ|
1.png 630 / 1208 382 / 374 1.56× 40.8
2.png 632 / 1210 382 / 374 1.56× 31.1
6.jpeg 628 / 1208 381 / 374 1.57× 21.1
cat_surf.jpg 632 / 1207 381 / 375 1.56× 51.8
median 631 / 1208 382 / 374 1.56×

Native ~1.91× on DiT, parity on VAE, input-independent (native DiT 628–632 ms across inputs). Parity |Δ| is cumulative bf16 AR drift (chunk-0 ≈14 matches #155's 12.91, ramps to ~49 by chunk-7, no jump at memory engagement); highest on the off-aspect cat_surf (625×350 upscaled).

Native vs vendor video pairs

imagenativevendor
1.png
hy-worldplay-wan-i2v-5b.mp4

hy-worldplay-wan-i2v-5b.mp4

2.png
hy-worldplay-wan-i2v-5b.mp4

hy-worldplay-wan-i2v-5b.mp4

6.jpeg
hy-worldplay-wan-i2v-5b.mp4

hy-worldplay-wan-i2v-5b.mp4

cat_surf.jpg
hy-worldplay-wan-i2v-5b.mp4

hy-worldplay-wan-i2v-5b.mp4

wenqingw-nv and others added 13 commits May 30, 2026 00:36
Transport branch (not a PR): carries HANDOFF.md + tasks.md so a fresh
agent on the GB300 has the perf-MR steps, verified status of the four
follow-up PRs (#222/#223/#224/#227), and the gotchas. Delete before any
real PR off this branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit fixes: add #227 to the PR table; mark all four PRs green + auto-
merge armed (pending review); reframe the two "known bugs" as fixed
(#227 base-load, #224 DiT 404) since they no longer block; correct the
"always --ckpt-path" rationale (you need distilled weights for real
output, not "base path is broken").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
prefill_memory_kv_cache asserted isinstance(self.network,
HyWorldPlayWanDiTNetwork), but with compile_network=True the parent
Wan transformer reassigns self.network to a torch.compile
OptimizedModule, so the assert fired the first time the memory-prefill
path engaged. That path only runs once enough chunks accumulate
(~num_chunk>=4), so it was never reachable on the 44 GiB card that
capped the bench at num_chunk=2; the GB300 num_chunk=8 run surfaced it.

Unwrap _orig_mod before narrowing the type. The prefill is a separate
eager KV-fill pass; the diffuse forward stays on the compiled
_network_call. CPU smoke (16 tests) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two fixes that surfaced provisioning + running the matched num_chunk=8
bench on a GB300:

- bench.sh now defaults HY_VENDOR_SDPA=1, so the vendor leg is forced
  onto cuDNN F.scaled_dot_product_attention (the kernel the native leg
  uses) instead of its as-shipped sageattention INT8/FP8 path. This is
  the apples-to-apples attention-backend match the perf reviewers
  required; set HY_VENDOR_SDPA=0 to bench vendor as-shipped.
- run.sh adds torchvision==0.26.* to the vendor heavy-deps install.
  Upstream HEAD's hyvideo HunyuanVideo-1.5 pipeline import now pulls in
  torchvision; pinned to match the torch==2.11.* sub-venv. (run.sh's
  uv sync prunes anything not in pyproject, so it must be reinstalled
  after the sync, alongside the other vendor-only deps.)
- uv.lock: re-lock picking up flashdreams dropping torchvision +
  lowering the torch floor to >=2.9 (the reason the vendor dep above is
  now needed explicitly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the model-card chart data and a work-in-progress model-card page
draft from the matched re-bench on a single GB300 (num_chunk=8,
704x1280, seed 0, warmup-discard 5, DiT+VAE scope, both legs cuDNN SDPA
+ torch.compile):

- perf-0530.md: per-chunk steady-state DiT+VAE-decode total --
  official 1578 ms vs flashdreams 1015 ms (~1.55x). DiT alone is
  632 vs 1206 ms (1.91x); VAE decode is at parity. The flashdreams
  number is conservative -- its CUDA-graph fast path is disabled in
  this run (a num_chunk>=4 graph/memory-prefill bug, tracked
  separately), so production is faster still.
- hy_worldplay.rst.draft: page mirrors lingbot_world.rst. Kept as a
  .draft (not picked up by sphinx) until the curated sample MP4s are
  generated and uploaded; the perf footnote + chart wiring are final.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Marks the num_chunk=8 perf re-bench DONE, records the matched medians
(native DiT 1.91x, VAE parity, DiT+VAE/chunk 1.55x), the
expectation-reversal (DiT wins, not VAE), the two num_chunk>=4 bugs
(compiled-network isinstance fixed; CUDA-graph/prefill illegal access
open), and the long-rollout parity drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With use_cuda_graph=True (production default), the rollout crashed with
cudaErrorIllegalAddress at the first reconstituted-context chunk
(num_chunk>=4). The per-block attention prepends the prefilled memory
K/V via a data-dependent torch.cat (hy_worldplay._camera), lengthening
the attention sequence once memory engages. A graph captured on the
pre-memory path (shorter sequence, no cat) cannot be replayed against
the longer post-memory sequence.

Flag memory-engaged steps in predict_flow and override _select_network
to route them onto the wrapper's eager drain path; pre-memory chunks
keep the base filling/steady CUDA-graph dispatch. Verified num_chunk=8
end-to-end with graphs ON: 8 chunks, no fault, valid 125-frame mp4.

The steady-state (post-warmup) chunks are memory-engaged and therefore
already run eager, so the reported steady-state medians are unchanged
by this fix and are not "conservative" -- CUDA graphs only accelerate
the discarded warmup chunks. Corrected the model-card perf footnote
accordingly. (Graph-accelerating the memory-engaged steady state would
need fixed-size in-place memory KV buffers; tracked separately.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bug 2 (num_chunk>=4 cudaErrorIllegalAddress) fixed in eb121aa; record
that the reported steady-state perf medians are production-config and
graph-independent (memory-engaged chunks run eager either way), not
conservative.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
num_chunk=8 mean |Δ| 29/255 (vs 15.65 at num_chunk=2) is cumulative
bf16 autoregressive drift, not a bug: per-frame |Δ| ramps smoothly
(chunk-0 ~14 reproducing the README's 12.91, to ~49 by chunk-7) with
no discontinuity at memory engagement. Higher mean only reflects
averaging in more later high-drift chunks. Memory-frame selection
verified identical native-vs-vendor for w-31. Notes the unseeded FOV
point cloud as a latent risk for non-forward poses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All 5 data_local images benched native-vs-vendor (num_chunk=8). Perf
input-independent (DiT 628-633 ms native / ~1208 vendor -> ~1.91x; VAE
parity; DiT+VAE 1.56x), corroborating perf-0530.md across 6 inputs.
Parity 21-52/255 (benign AR drift, highest on off-aspect inputs).
Notes samples generated + the out-of-repo artifact location (in-repo
outputs/ gets wiped by the shared CI-runner box).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Moved the data_local batch artifacts from /home/nvidia/hy_bench_out
into tests/parity_check/outputs/<stem>/; update the paths in tasks.md
accordingly (still gitignored, with the CI-wipe caveat noted).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Promote the model-card draft to docs/source/models/hy_worldplay.rst
(LingBot style: hero, install, running, variants list-table, native
sample grid, perf chart). Wire the gallery to our GB300-generated
native rollouts (num_chunk=8, 704x1280): hero = 6.jpeg, gallery =
2.png / 1.png / cat_surf.jpg, web-transcoded into
docs/source/_static/videos/hy_worldplay/ (3.4 MB total). Register the
page in models/index.rst and the index.rst "Model cards" toctree.

Builds clean under `sphinx-build -W` (warnings-as-errors, matching CI).
The local _static videos can be swapped for research.nvidia.com-hosted
URLs later to match LingBot's external-hosting convention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wenqingw-nv wenqingw-nv changed the title perf(hy_worldplay): GB300 num_chunk=8 re-bench + model card + num_chunk≥4 fixes perf(hy_worldplay): longer video bench + HY model card May 30, 2026
wenqingw-nv and others added 4 commits May 30, 2026 11:46
Fourth gallery clip from the native num_chunk=8 rollout on
data_local/10.png (web-transcoded, 484 KB). Builds clean under
sphinx-build -W.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trim the intro, --ckpt-path/--pose paragraph (drop the full pose-token
enumeration; it's in --help), variant description, sample caption, and
perf footnote. Builds clean under sphinx-build -W.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… clips

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wenqingw-nv and others added 3 commits May 30, 2026 12:20
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tasks.md is an internal HY-WorldPlay handoff status note, not upstream
material; untrack + gitignore it so it stays out of the PR diff while
remaining on disk for local tracking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Untrack HANDOFF.md and gitignore it (alongside tasks.md/comments.md) so
the internal handoff notes stay on disk for local tracking but out of
the PR #231 diff / upstream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wenqingw-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test 1e7a2a3

@wenqingw-nv wenqingw-nv enabled auto-merge May 30, 2026 12:38
@liruilong940607
Copy link
Copy Markdown
Collaborator

The visual results here in the native column is choppy -- this seems to indicate something is wrong with the current implementation. Needs revisit and fix.

Let's not add this to the public-facing doc until we are sure it is correctly implemented.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not check in videos into this repo. you can send me the videos on slack and I will host them.

Comment on lines +971 to +974
# Run the reconstituted-context prefill once at the first
# denoising step of each chunk past the first; the
# ``prefill_completed_for_chunk`` latch suppresses re-runs on
# subsequent scheduler steps within the chunk.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slipped through my previous review. Why would we need prefill_memory_kv? The BlockKVCache in flashdreams should naturally work with CUDAGraph.

I traced it and it seems to lead to HyWorldPlayMemoryKVCache, which stored roped K and roped V. What's the reason we can't use BlockKVCache for that? And what's the reason it can't be compatible with cudagraph like other models?

Copy link
Copy Markdown
Collaborator Author

@wenqingw-nv wenqingw-nv Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyWorldPlayMemoryKVCache is separate because memory isn't a rolling window, it's a per-chunk re-selection of non-contiguous frames, so BlockKVCache doesn't map directly. The non-graphability is just a stopgap; since the selected set is always exactly 16, a fixed-size in-place buffer is graph-compatible. This path also causes the choppy output, where the memory keys sit at collapsed RoPE positions [0..16) regardless of identity, so on the first window slide a frame's encoding shifts and the scene jolts.

I'll rework it into a fixed-size, consistently-positioned in-place buffer to fix both, and numerically diff vs vendor's prefill first. Do you think we should: gather K/V from retained history or re-prefill each chunk (vendor re-prefills)? @liruilong940607

…sted URLs

Per review on #231: don't check videos into the repo. Untrack the five
_static sample mp4s, gitignore the dir, and repoint the model-card
<source>s at research.nvidia.com-hosted URLs (LingBot convention). The
clips stay on disk locally to hand off for hosting. Builds clean under
sphinx-build -W.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wenqingw-nv wenqingw-nv marked this pull request as draft June 2, 2026 05:31
auto-merge was automatically disabled June 2, 2026 05:31

Pull request was converted to draft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants